[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Xen-devel] [PATCH] blkback: Fix block I/O latency issue



Response inline..

-----Original Message-----
From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@xxxxxxxxxx] 
Sent: Tuesday, May 24, 2011 9:03 AM
To: Vincent, Pradeep
Cc: Daniel@xxxxxxxxxxxxxxxxxxxx; Jeremy Fitzhardinge; 
xen-devel@xxxxxxxxxxxxxxxxxxx; Jan Beulich; Stodden
Subject: Re: [Xen-devel] [PATCH] blkback: Fix block I/O latency issue

On Thu, May 19, 2011 at 11:12:25PM -0700, Vincent, Pradeep wrote:
> Hey Konrad, 
> 
> Thanks for running the tests. Very useful data.
> 
> Re: Experiment to show latency improvement
> 
> I never ran anything on ramdisk.
> 
> You should be able to see the latency benefit with 'orion' tool but I am

>Link?

PV: http://www.oracle.com/technetwork/topics/index-089595.html

> sure other tools can be used as well. For a volume backed by a single disk
> drive, keep the number of small random I/O outstanding to 2 (I think
> "num_small" parameter in orion should do the job) with a 50-50 mix of
> write and read. Measure the latency reported by the guest and Dom-0 &
> compare them. For LVM volumes that present multiple drives as a single LUN
> (inside the guest), the latency improvement will be the highest when the
> number of I/O outstanding is 2X the number of spindles. This is the
> 'moderate I/O' scenario I was describing and you should see significant
> improvement in latencies.

Ok.
> 
> 
> If you allow page cache to perform sequential I/O using dd or other
> sequential non-direct I/O generation tool, you should find that the
> interrupt rate doesn't go up for high I/O load. Thinking about this, I
> think burstiness of I/O submission as seen by the driver is also a key
> player particularly in the absence of I/O coalescing waits introduced by
> I/O scheduler. Page cache draining is notoriously bursty.

Sure, .. thought most of the tests I've been doing have been bypassing
the page cache.
> 
> >>queue depth of 256.
> 
> What 'queue depth' is this ? If I am not wrong, blkfront-blkback is

The 'request_queue' one. This is the block API one.

PV: Got it.

> restricted to ~32 max pending I/Os due to the limit of one page being used
> for mailbox entries - no ?

>This is the frontend's block API queue I was thinking about. In regards
to the ring buffer .. um, not exactly sure the right number (would have to
compute it), but it is much bigger I believe.
The ring buffer entries are for 'requests', wherein each request can contain
>up to 11 pages of data (nr segments).

PV: I just did a back of the envelope calculation for size of blkif_request 
that gave me ~78 bytes, primarily dominated by 6 bytes per segment for 11 
segments per request. This would result in max pending I/O count of 32. This 
matches my recollection from long time back but not sure if I missed something. 
Of course, like you said each I/O req can have 44K of data but small sized 
random I/O can't take advantage of it. (If I am not wrong, netback takes a 
slightly different approach where each slot is essentially a 4K page and 
multiple slots are used for larger sized packets.)

> 
> >>But to my surprise the case where the I/O latency is high, the interrupt
> >>generation was quite small
> 
> If this patch results in an extra interrupt, it will very likely result in
> reduction of latency for the next I/O. If the interrupt generation
> increase is not high, then the number of I/Os whose latencies this patch
> has improved is low. Looks like your workload belonged to this category.
> Perhaps that's why you didn't much of an improvement in overall
> performance ? I think this is close to the high I/O workload scenario I
> described.
Ok
> 
> >>But where the I/O latency was very very small (4 microseconds) the
> >>interrupt generation was on average about 20K/s.
> 
> This is not a scenario I tested but the results aren't surprising.  This
> isn't the high I/O load I was describing though (I didn't test ramdisk).
> SSD is probably the closest real world workload.
> An increase of 20K/sec means this patch very likely improved latency of
> 20K I/Os per sec although the absolute value of latency improvements would
> be smaller in this case. 20K/sec interrupt rate (50usec delay between
> interrupt) is something I would be comfortable with if they directly
> translate to latency improvement for the users. The graphs seem to
> indicate a 5% increase in throughput for this case - Am I reading the

>I came up with 1%. But those are a bit unrealistic - and I ordered
>an SSD to do some proper testing.

PV: Terrific.

> graphs right ? 
> 
> Overall, Very useful tests indeed and I haven't seen anything too
> concerning or unexpected except that I don't think you have seen the 50+%
> latency benefit that the patch got me in my moderate I/O benchmark :-)

Let me redo the tests again.

PV: Thanks much. Let me know if you need more info on test setup.

> Feel free to ping me offline if you aren't able to see the latency impact
> using the 'moderate I/O' methodology described above.
> 
> About IRQ coalescing: Stepping back a bit, there are few different use
> cases that irq coalescing mechanism would be useful for
> 
> 1. Latency sensitive workload: Wait time of 10s of usecs. Particularly
> useful for SSDs. 
> 2. Interrupt rate conscious workload/environment: Wait time of 200+ usecs
> which will essentially cap the theoretical interrupt rate to 5K.
> 3. Excessive CPU consumption Mitigation: This is similar to (2) but
> includes the case of malicious guests. Perhaps not a big concern unless
> you have lots of drives attached to each guest.
> 
> I suspect the implementation for (1) and (2) would be different (spin vs
> sleep perhaps). (3) can't be implemented by manipulation of 'req_event'
> since a guest has the ability to abuse irq channel independent of what
> 'blkback' tries to tell 'blkfront' via 'req_event' manipulation.
> 
> (3) could be implemented in the hypervisor as a generic irq throttler that
> could be leveraged for all irqs heading to Dom-0 from DomUs including
> blkback/netback. Such a mechanism could potentially solve (1) and/or (2)
> as well. Thoughts ?

The hypervisor does have some irq storm avoidancy mechanism. Thought the
>number is 100K/sec and it only applies to physical IRQs.

PV: I will take a closer look to see what hypervisor already does here.

> 
> One crude way to address (3) for 'many disk drive' scenario is to pin
> all/most blkback interrupts for an instance to the same CPU core in Dom-0
> and throttle down the thread wake up (wake_up(&blkif->wq) in
> blkif_notify_work) that usually results in IPIs. Not an elegant solution
> but might be a good crutch.
> 
> Another angle to (1) and (2) is whether these irq coalesce settings should
> be controllable by the guest, perhaps within limits set by the
> administrator. 
> 
> Thoughts ? Suggestions ?
> 
> Konrad, Love to help out if you are already working on something around
> irq coalescing. Or when I have irq coalescing functionality that can be

Not yet. Hence hinting for you to do it :-)

> consumed by community I will certainly submit them.
> 
> Meanwhile, I wouldn't want to deny Xen users the advantage of this patch
> just because there is no irq coalescing functionality. Particularly since
> the downside is very minimal on blkfront-blkback stack. My 2 cents..
> 
> Thanks much Konrad,
> 
> - Pradeep Vincent
> 
> 
> 
> 
> On 5/16/11 8:22 AM, "Konrad Rzeszutek Wilk" <konrad.wilk@xxxxxxxxxx> wrote:
> 
> >On Thu, May 12, 2011 at 10:51:32PM -0400, Konrad Rzeszutek Wilk wrote:
> >> > >>what were the numbers when it came to high bandwidth numbers
> >> > 
> >> > Under high I/O workload, where the blkfront would fill up the queue as
> >> > blkback works the queue, the I/O latency problem in question doesn't
> >> > manifest itself and as a result this patch doesn't make much of a
> >> > difference in terms of interrupt rate. My benchmarks didn't show any
> >> > significant effect.
> >> 
> >> I have to rerun my benchmarks. Under high load (so 64Kb, four threads
> >> writting as much as they can to a iSCSI disk), the IRQ rate for each
> >> blkif went from 2-3/sec to ~5K/sec. But I did not do a good
> >> job on capturing the submission latency to see if the I/Os get the
> >> response back as fast (or the same) as without your patch.
> >> 
> >> And the iSCSI disk on the target side was an RAMdisk, so latency
> >> was quite small which is not fair to your problem.
> >> 
> >> Do you have a program to measure the latency for the workload you
> >> had encountered? I would like to run those numbers myself.
> >
> >Ran some more benchmarks over this week. This time I tried to run it on:
> >
> > - iSCSI target (1GB, and on the "other side" it wakes up every 1msec, so
> >the
> >   latency is set to 1msec).
> > - scsi_debug delay=0 (no delay and as fast possible. Comes out to be
> >about
> >   4 microseconds completion with queue depth of one with 32K I/Os).
> > - local SATAI 80GB ST3808110AS. Still running as it is quite slow.
> >
> >With only one PV guest doing a round (three times) of two threads randomly
> >writting I/Os with a queue depth of 256. Then a different round of four
> >threads writting/reading (80/20) 512bytes up to 64K randomly over the
> >disk.
> >
> >I used the attached patch against #master
> >(git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git)
> >to gauge how well we are doing (and what the interrupt generation rate
> >is).
> >
> >These workloads I think would be considered 'high I/O' and I was expecting
> >your patch to not have any influence on the numbers.
> >
> >But to my surprise the case where the I/O latency is high, the interrupt
> >generation
> >was quite small. But where the I/O latency was very very small (4
> >microseconds)
> >the interrupt generation was on average about 20K/s. And this is with a
> >queue depth
> >of 256 with four threads. I was expecting the opposite. Hence quite
> >curious
> >to see your use case.
> >
> >What do you consider a middle I/O and low I/O cases? Do you use 'fio' for
> >your
> >testing?
> >
> >With the high I/O load, the numbers came out to give us about 1% benefit
> >with your
> >patch. However, I am worried (maybe unneccassarily?) about the 20K
> >interrupt generation
> >when the iometer tests kicked in (this was only when using the
> >unrealistic 'scsi_debug'
> >drive).
> >
> >The picture of this using iSCSI target:
> >http://darnok.org/xen/amazon/iscsi_target/iometer-bw.png
> >
> >And when done on top of local RAMdisk:
> >http://darnok.org/xen/amazon/scsi_debug/iometer-bw.png
> >
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.