This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
Home Products Support Community News


Re: [Xen-devel] [PATCH] blkback: Fix block I/O latency issue

To: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
Subject: Re: [Xen-devel] [PATCH] blkback: Fix block I/O latency issue
From: "Vincent, Pradeep" <pradeepv@xxxxxxxxxx>
Date: Thu, 19 May 2011 23:12:25 -0700
Accept-language: en-US
Acceptlanguage: en-US
Cc: Daniel, Jeremy Fitzhardinge <jeremy@xxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxxxx>, Stodden <daniel.stodden@xxxxxxxxxx>
Delivery-date: Thu, 19 May 2011 23:15:21 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <20110516152224.GA7195@xxxxxxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
Thread-index: AcwWtOQUuLcZEyw4RXeF9ggy9QHSLQ==
Thread-topic: [Xen-devel] [PATCH] blkback: Fix block I/O latency issue
User-agent: Microsoft-MacOutlook/
Hey Konrad, 

Thanks for running the tests. Very useful data.

Re: Experiment to show latency improvement

I never ran anything on ramdisk.

You should be able to see the latency benefit with 'orion' tool but I am
sure other tools can be used as well. For a volume backed by a single disk
drive, keep the number of small random I/O outstanding to 2 (I think
"num_small" parameter in orion should do the job) with a 50-50 mix of
write and read. Measure the latency reported by the guest and Dom-0 &
compare them. For LVM volumes that present multiple drives as a single LUN
(inside the guest), the latency improvement will be the highest when the
number of I/O outstanding is 2X the number of spindles. This is the
'moderate I/O' scenario I was describing and you should see significant
improvement in latencies.

If you allow page cache to perform sequential I/O using dd or other
sequential non-direct I/O generation tool, you should find that the
interrupt rate doesn't go up for high I/O load. Thinking about this, I
think burstiness of I/O submission as seen by the driver is also a key
player particularly in the absence of I/O coalescing waits introduced by
I/O scheduler. Page cache draining is notoriously bursty.

>>queue depth of 256.

What 'queue depth' is this ? If I am not wrong, blkfront-blkback is
restricted to ~32 max pending I/Os due to the limit of one page being used
for mailbox entries - no ?

>>But to my surprise the case where the I/O latency is high, the interrupt
>>generation was quite small

If this patch results in an extra interrupt, it will very likely result in
reduction of latency for the next I/O. If the interrupt generation
increase is not high, then the number of I/Os whose latencies this patch
has improved is low. Looks like your workload belonged to this category.
Perhaps that's why you didn't much of an improvement in overall
performance ? I think this is close to the high I/O workload scenario I

>>But where the I/O latency was very very small (4 microseconds) the
>>interrupt generation was on average about 20K/s.

This is not a scenario I tested but the results aren't surprising.  This
isn't the high I/O load I was describing though (I didn't test ramdisk).
SSD is probably the closest real world workload.
An increase of 20K/sec means this patch very likely improved latency of
20K I/Os per sec although the absolute value of latency improvements would
be smaller in this case. 20K/sec interrupt rate (50usec delay between
interrupt) is something I would be comfortable with if they directly
translate to latency improvement for the users. The graphs seem to
indicate a 5% increase in throughput for this case - Am I reading the
graphs right ? 

Overall, Very useful tests indeed and I haven't seen anything too
concerning or unexpected except that I don't think you have seen the 50+%
latency benefit that the patch got me in my moderate I/O benchmark :-)
Feel free to ping me offline if you aren't able to see the latency impact
using the 'moderate I/O' methodology described above.

About IRQ coalescing: Stepping back a bit, there are few different use
cases that irq coalescing mechanism would be useful for

1. Latency sensitive workload: Wait time of 10s of usecs. Particularly
useful for SSDs. 
2. Interrupt rate conscious workload/environment: Wait time of 200+ usecs
which will essentially cap the theoretical interrupt rate to 5K.
3. Excessive CPU consumption Mitigation: This is similar to (2) but
includes the case of malicious guests. Perhaps not a big concern unless
you have lots of drives attached to each guest.

I suspect the implementation for (1) and (2) would be different (spin vs
sleep perhaps). (3) can't be implemented by manipulation of 'req_event'
since a guest has the ability to abuse irq channel independent of what
'blkback' tries to tell 'blkfront' via 'req_event' manipulation.

(3) could be implemented in the hypervisor as a generic irq throttler that
could be leveraged for all irqs heading to Dom-0 from DomUs including
blkback/netback. Such a mechanism could potentially solve (1) and/or (2)
as well. Thoughts ?

One crude way to address (3) for 'many disk drive' scenario is to pin
all/most blkback interrupts for an instance to the same CPU core in Dom-0
and throttle down the thread wake up (wake_up(&blkif->wq) in
blkif_notify_work) that usually results in IPIs. Not an elegant solution
but might be a good crutch.

Another angle to (1) and (2) is whether these irq coalesce settings should
be controllable by the guest, perhaps within limits set by the

Thoughts ? Suggestions ?

Konrad, Love to help out if you are already working on something around
irq coalescing. Or when I have irq coalescing functionality that can be
consumed by community I will certainly submit them.

Meanwhile, I wouldn't want to deny Xen users the advantage of this patch
just because there is no irq coalescing functionality. Particularly since
the downside is very minimal on blkfront-blkback stack. My 2 cents..

Thanks much Konrad,

- Pradeep Vincent

On 5/16/11 8:22 AM, "Konrad Rzeszutek Wilk" <konrad.wilk@xxxxxxxxxx> wrote:

>On Thu, May 12, 2011 at 10:51:32PM -0400, Konrad Rzeszutek Wilk wrote:
>> > >>what were the numbers when it came to high bandwidth numbers
>> > 
>> > Under high I/O workload, where the blkfront would fill up the queue as
>> > blkback works the queue, the I/O latency problem in question doesn't
>> > manifest itself and as a result this patch doesn't make much of a
>> > difference in terms of interrupt rate. My benchmarks didn't show any
>> > significant effect.
>> I have to rerun my benchmarks. Under high load (so 64Kb, four threads
>> writting as much as they can to a iSCSI disk), the IRQ rate for each
>> blkif went from 2-3/sec to ~5K/sec. But I did not do a good
>> job on capturing the submission latency to see if the I/Os get the
>> response back as fast (or the same) as without your patch.
>> And the iSCSI disk on the target side was an RAMdisk, so latency
>> was quite small which is not fair to your problem.
>> Do you have a program to measure the latency for the workload you
>> had encountered? I would like to run those numbers myself.
>Ran some more benchmarks over this week. This time I tried to run it on:
> - iSCSI target (1GB, and on the "other side" it wakes up every 1msec, so
>   latency is set to 1msec).
> - scsi_debug delay=0 (no delay and as fast possible. Comes out to be
>   4 microseconds completion with queue depth of one with 32K I/Os).
> - local SATAI 80GB ST3808110AS. Still running as it is quite slow.
>With only one PV guest doing a round (three times) of two threads randomly
>writting I/Os with a queue depth of 256. Then a different round of four
>threads writting/reading (80/20) 512bytes up to 64K randomly over the
>I used the attached patch against #master
>to gauge how well we are doing (and what the interrupt generation rate
>These workloads I think would be considered 'high I/O' and I was expecting
>your patch to not have any influence on the numbers.
>But to my surprise the case where the I/O latency is high, the interrupt
>was quite small. But where the I/O latency was very very small (4
>the interrupt generation was on average about 20K/s. And this is with a
>queue depth
>of 256 with four threads. I was expecting the opposite. Hence quite
>to see your use case.
>What do you consider a middle I/O and low I/O cases? Do you use 'fio' for
>With the high I/O load, the numbers came out to give us about 1% benefit
>with your
>patch. However, I am worried (maybe unneccassarily?) about the 20K
>interrupt generation
>when the iometer tests kicked in (this was only when using the
>unrealistic 'scsi_debug'
>The picture of this using iSCSI target:
>And when done on top of local RAMdisk:

Xen-devel mailing list

<Prev in Thread] Current Thread [Next in Thread>