WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

RE: [Xen-devel] [PATCH] blkback: Fix block I/O latency issue

To: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
Subject: RE: [Xen-devel] [PATCH] blkback: Fix block I/O latency issue
From: "Vincent, Pradeep" <pradeepv@xxxxxxxxxx>
Date: Tue, 24 May 2011 15:40:46 -0700
Accept-language: en-US
Acceptlanguage: en-US
Cc: Fitzhardinge <jeremy@xxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>, Jeremy, Jan Beulich <JBeulich@xxxxxxxxxx>, "Daniel@xxxxxxxxxxxxxxxxxxxx" <Daniel@xxxxxxxxxxxxxxxxxxxx>, Stodden <daniel.stodden@xxxxxxxxxx>
Delivery-date: Tue, 24 May 2011 15:41:26 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <20110524160249.GC29481@xxxxxxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <20110516152224.GA7195@xxxxxxxxxxxx> <C9FAE626.161E7%pradeepv@xxxxxxxxxx> <20110524160249.GC29481@xxxxxxxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
Thread-index: AcwaLBTc9+36kKUkSD+cmB75H6003wAM6zEQ
Thread-topic: [Xen-devel] [PATCH] blkback: Fix block I/O latency issue
Response inline..

-----Original Message-----
From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@xxxxxxxxxx] 
Sent: Tuesday, May 24, 2011 9:03 AM
To: Vincent, Pradeep
Cc: Daniel@xxxxxxxxxxxxxxxxxxxx; Jeremy Fitzhardinge; 
xen-devel@xxxxxxxxxxxxxxxxxxx; Jan Beulich; Stodden
Subject: Re: [Xen-devel] [PATCH] blkback: Fix block I/O latency issue

On Thu, May 19, 2011 at 11:12:25PM -0700, Vincent, Pradeep wrote:
> Hey Konrad, 
> 
> Thanks for running the tests. Very useful data.
> 
> Re: Experiment to show latency improvement
> 
> I never ran anything on ramdisk.
> 
> You should be able to see the latency benefit with 'orion' tool but I am

>Link?

PV: http://www.oracle.com/technetwork/topics/index-089595.html

> sure other tools can be used as well. For a volume backed by a single disk
> drive, keep the number of small random I/O outstanding to 2 (I think
> "num_small" parameter in orion should do the job) with a 50-50 mix of
> write and read. Measure the latency reported by the guest and Dom-0 &
> compare them. For LVM volumes that present multiple drives as a single LUN
> (inside the guest), the latency improvement will be the highest when the
> number of I/O outstanding is 2X the number of spindles. This is the
> 'moderate I/O' scenario I was describing and you should see significant
> improvement in latencies.

Ok.
> 
> 
> If you allow page cache to perform sequential I/O using dd or other
> sequential non-direct I/O generation tool, you should find that the
> interrupt rate doesn't go up for high I/O load. Thinking about this, I
> think burstiness of I/O submission as seen by the driver is also a key
> player particularly in the absence of I/O coalescing waits introduced by
> I/O scheduler. Page cache draining is notoriously bursty.

Sure, .. thought most of the tests I've been doing have been bypassing
the page cache.
> 
> >>queue depth of 256.
> 
> What 'queue depth' is this ? If I am not wrong, blkfront-blkback is

The 'request_queue' one. This is the block API one.

PV: Got it.

> restricted to ~32 max pending I/Os due to the limit of one page being used
> for mailbox entries - no ?

>This is the frontend's block API queue I was thinking about. In regards
to the ring buffer .. um, not exactly sure the right number (would have to
compute it), but it is much bigger I believe.
The ring buffer entries are for 'requests', wherein each request can contain
>up to 11 pages of data (nr segments).

PV: I just did a back of the envelope calculation for size of blkif_request 
that gave me ~78 bytes, primarily dominated by 6 bytes per segment for 11 
segments per request. This would result in max pending I/O count of 32. This 
matches my recollection from long time back but not sure if I missed something. 
Of course, like you said each I/O req can have 44K of data but small sized 
random I/O can't take advantage of it. (If I am not wrong, netback takes a 
slightly different approach where each slot is essentially a 4K page and 
multiple slots are used for larger sized packets.)

> 
> >>But to my surprise the case where the I/O latency is high, the interrupt
> >>generation was quite small
> 
> If this patch results in an extra interrupt, it will very likely result in
> reduction of latency for the next I/O. If the interrupt generation
> increase is not high, then the number of I/Os whose latencies this patch
> has improved is low. Looks like your workload belonged to this category.
> Perhaps that's why you didn't much of an improvement in overall
> performance ? I think this is close to the high I/O workload scenario I
> described.
Ok
> 
> >>But where the I/O latency was very very small (4 microseconds) the
> >>interrupt generation was on average about 20K/s.
> 
> This is not a scenario I tested but the results aren't surprising.  This
> isn't the high I/O load I was describing though (I didn't test ramdisk).
> SSD is probably the closest real world workload.
> An increase of 20K/sec means this patch very likely improved latency of
> 20K I/Os per sec although the absolute value of latency improvements would
> be smaller in this case. 20K/sec interrupt rate (50usec delay between
> interrupt) is something I would be comfortable with if they directly
> translate to latency improvement for the users. The graphs seem to
> indicate a 5% increase in throughput for this case - Am I reading the

>I came up with 1%. But those are a bit unrealistic - and I ordered
>an SSD to do some proper testing.

PV: Terrific.

> graphs right ? 
> 
> Overall, Very useful tests indeed and I haven't seen anything too
> concerning or unexpected except that I don't think you have seen the 50+%
> latency benefit that the patch got me in my moderate I/O benchmark :-)

Let me redo the tests again.

PV: Thanks much. Let me know if you need more info on test setup.

> Feel free to ping me offline if you aren't able to see the latency impact
> using the 'moderate I/O' methodology described above.
> 
> About IRQ coalescing: Stepping back a bit, there are few different use
> cases that irq coalescing mechanism would be useful for
> 
> 1. Latency sensitive workload: Wait time of 10s of usecs. Particularly
> useful for SSDs. 
> 2. Interrupt rate conscious workload/environment: Wait time of 200+ usecs
> which will essentially cap the theoretical interrupt rate to 5K.
> 3. Excessive CPU consumption Mitigation: This is similar to (2) but
> includes the case of malicious guests. Perhaps not a big concern unless
> you have lots of drives attached to each guest.
> 
> I suspect the implementation for (1) and (2) would be different (spin vs
> sleep perhaps). (3) can't be implemented by manipulation of 'req_event'
> since a guest has the ability to abuse irq channel independent of what
> 'blkback' tries to tell 'blkfront' via 'req_event' manipulation.
> 
> (3) could be implemented in the hypervisor as a generic irq throttler that
> could be leveraged for all irqs heading to Dom-0 from DomUs including
> blkback/netback. Such a mechanism could potentially solve (1) and/or (2)
> as well. Thoughts ?

The hypervisor does have some irq storm avoidancy mechanism. Thought the
>number is 100K/sec and it only applies to physical IRQs.

PV: I will take a closer look to see what hypervisor already does here.

> 
> One crude way to address (3) for 'many disk drive' scenario is to pin
> all/most blkback interrupts for an instance to the same CPU core in Dom-0
> and throttle down the thread wake up (wake_up(&blkif->wq) in
> blkif_notify_work) that usually results in IPIs. Not an elegant solution
> but might be a good crutch.
> 
> Another angle to (1) and (2) is whether these irq coalesce settings should
> be controllable by the guest, perhaps within limits set by the
> administrator. 
> 
> Thoughts ? Suggestions ?
> 
> Konrad, Love to help out if you are already working on something around
> irq coalescing. Or when I have irq coalescing functionality that can be

Not yet. Hence hinting for you to do it :-)

> consumed by community I will certainly submit them.
> 
> Meanwhile, I wouldn't want to deny Xen users the advantage of this patch
> just because there is no irq coalescing functionality. Particularly since
> the downside is very minimal on blkfront-blkback stack. My 2 cents..
> 
> Thanks much Konrad,
> 
> - Pradeep Vincent
> 
> 
> 
> 
> On 5/16/11 8:22 AM, "Konrad Rzeszutek Wilk" <konrad.wilk@xxxxxxxxxx> wrote:
> 
> >On Thu, May 12, 2011 at 10:51:32PM -0400, Konrad Rzeszutek Wilk wrote:
> >> > >>what were the numbers when it came to high bandwidth numbers
> >> > 
> >> > Under high I/O workload, where the blkfront would fill up the queue as
> >> > blkback works the queue, the I/O latency problem in question doesn't
> >> > manifest itself and as a result this patch doesn't make much of a
> >> > difference in terms of interrupt rate. My benchmarks didn't show any
> >> > significant effect.
> >> 
> >> I have to rerun my benchmarks. Under high load (so 64Kb, four threads
> >> writting as much as they can to a iSCSI disk), the IRQ rate for each
> >> blkif went from 2-3/sec to ~5K/sec. But I did not do a good
> >> job on capturing the submission latency to see if the I/Os get the
> >> response back as fast (or the same) as without your patch.
> >> 
> >> And the iSCSI disk on the target side was an RAMdisk, so latency
> >> was quite small which is not fair to your problem.
> >> 
> >> Do you have a program to measure the latency for the workload you
> >> had encountered? I would like to run those numbers myself.
> >
> >Ran some more benchmarks over this week. This time I tried to run it on:
> >
> > - iSCSI target (1GB, and on the "other side" it wakes up every 1msec, so
> >the
> >   latency is set to 1msec).
> > - scsi_debug delay=0 (no delay and as fast possible. Comes out to be
> >about
> >   4 microseconds completion with queue depth of one with 32K I/Os).
> > - local SATAI 80GB ST3808110AS. Still running as it is quite slow.
> >
> >With only one PV guest doing a round (three times) of two threads randomly
> >writting I/Os with a queue depth of 256. Then a different round of four
> >threads writting/reading (80/20) 512bytes up to 64K randomly over the
> >disk.
> >
> >I used the attached patch against #master
> >(git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git)
> >to gauge how well we are doing (and what the interrupt generation rate
> >is).
> >
> >These workloads I think would be considered 'high I/O' and I was expecting
> >your patch to not have any influence on the numbers.
> >
> >But to my surprise the case where the I/O latency is high, the interrupt
> >generation
> >was quite small. But where the I/O latency was very very small (4
> >microseconds)
> >the interrupt generation was on average about 20K/s. And this is with a
> >queue depth
> >of 256 with four threads. I was expecting the opposite. Hence quite
> >curious
> >to see your use case.
> >
> >What do you consider a middle I/O and low I/O cases? Do you use 'fio' for
> >your
> >testing?
> >
> >With the high I/O load, the numbers came out to give us about 1% benefit
> >with your
> >patch. However, I am worried (maybe unneccassarily?) about the 20K
> >interrupt generation
> >when the iometer tests kicked in (this was only when using the
> >unrealistic 'scsi_debug'
> >drive).
> >
> >The picture of this using iSCSI target:
> >http://darnok.org/xen/amazon/iscsi_target/iometer-bw.png
> >
> >And when done on top of local RAMdisk:
> >http://darnok.org/xen/amazon/scsi_debug/iometer-bw.png
> >
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

<Prev in Thread] Current Thread [Next in Thread>