Re: [Xen-devel] Xen dom0 network I/O scalability
On Tue, 2011-05-10 at 21:23 +0100, Konrad Rzeszutek Wilk wrote:
> On Mon, May 09, 2011 at 09:13:09PM -0500, Kaushik Kumar Ram wrote:
> > On Apr 27, 2011, at 12:50 PM, Konrad Rzeszutek Wilk wrote:
> > >> So the current implementation of netback does not scale beyond a
> single CPU core, thanks to the use of tasklets, making it a bottleneck
> (please correct me if I am wrong). I remember coming across some
> patches which attempts to use softirqs instead of tasklets to solve
> this issue. But the latest version of linux-2.6-xen.hg repo does not
> include them. Are they included in some other version of dom0 Linux?
> Or will they be included in future?
> > >
> > > You should be using the 2.6.39 kernel or the 2.6.32 to take
> advantage of those patches.
> > Thanks Konrad. I got hold of a pvops dom0 kernel from Jeremy's git
> repo (xen/stable-2.6.32.x). As you pointed out it did include those
> patches. I spent some time studying the new netback design and ran
> some experiments. I have a few questions regarding them.
> > I am using the latest version of the hypervisor from the
> xen-unstable.hg repo. I ran the experiments on a dual socket AMD
> quad-core opteron machine (with 8 CPU cores). My experiments simply
> involved running 'netperf' between 1 or 2 pairs of VMs on the same
> machine. I allocated 4 vcpus for dom0 and one each for the VMs. None
> of the vcpus were pinned.
> > - So the new design allows you to choose between tasklets and
> kthreads within netback, with tasklets being the default option. Is
> there any specific reason for this?
> Not sure where the thread is for this - but when the patches for that
> were posted it showed a big improvement in performance over 10GB. But
> it did require spreading the netback across the CPUs.
Using a tasklet basically allows netback to take the entire CPU under
heavy load (mostly a problem when you only have the same number of VCPUs
assigned to dom0 as you have netback tasklets). Using a thread causes
the network processing to at least get scheduled alongside e.g. your
sshd and toolstack. I seem to remember a small throughput reduction in
thread vs. tasklet mode but this is more than offset by the "can use
dom0" factor. In the upstream version of netback I removed the tasklet
option so threaded is the only choice.
The decision to go multi-thread/tasklet was somewhat orthogonal to this
and was about utilising all of the VCPUs in dom0. Previously netback
would only ever use 1CPU. In the new design each VIF interface is
statically assigned to a particular netback thread at start of day. So
for a given VIF interface there is no real difference other than lower
contention with other VIFs.
> > - The inter-VM performance (throughput) is worse using both tasklets
> and kthreads as compared to the old version of netback (as in
> linux-2.6-xen.hg repo). I observed about 50% drop in throughput in my
> experiments. Has anyone else observed this? Is the new version yet to
> be optimized?
> That is not surprising. The "new" version of netback copies pages. It
> does not "swizzel" or "map" then between domains (so zero copying).
I think Kaushik is running a xen/2.6.32.x tree and the copying only
variant is only in mainline.
A 50% drop in performance between linux-2.6-xen.hg and the xen.git
2.6.32 tree is slightly worrying but such a big drop sounds more like a
misconfiguration, e.g. something like enabling debugging options in the
kernel .config rather than a design or implementation issue in netback.
(I actually have no idea what was in the linux-2.6-xen.hg tree, I don't
recall such a tree ever being properly maintained, the last cset appears
to be from 2006 and I recently cleaned it out of xenbits because noone
knew what it was -- did you mean linux-2.6.18-xen.hg?)
> > - Two tasklets (rx and tx) are created per vcpu within netback. But
> in my experiments I noticed that only one vcpu was being used during
> the experiments (even with 4 VMs). I also observed that all the event
> channel notifications within netback are always sent to vcpu 0. So my
> conjecture is that since the tasklets are always scheduled by vcpu 0,
> all of them are run only on vcpu 0. Is this a BUG?
> Yes. We need to fix 'irqbalance' to work properly. There is something
> not working right.
The fix is to install the "irqbalanced" package. Without it no IRQ
balancing will occur in a modern kernel. (perhaps this linux-2.6-xen.hg
tree was from a time when the kernel would do balancing on its own?).
You can also manually balance the VIF IRQs under /proc/irq if you are so
> > - Unlike with tasklets, I observed the CPU utilization go up when I
> used kthreads and increased the number of VMs. But the performance
> never scaled up. On profiling the code (using xenoprof) I observed
> significant synchronization overhead due to lock contention. The main
> culprit seems to be the per-domain lock acquired inside the hypervisor
> (specifically within do_grant_table_op). Further, packets are copied
> (inside gnttab_copy) while this lock is held. Seems like a bad idea?
> Ian was thinking (and he proposed a talk at Linux Plumbers Conference)
> to reintroduce the zero copying
> functionality back. But it is not an easy problem b/c the way the
> pages go through the Linux kernel
> > - A smaller source of overhead is when the '_lock' is acquired
> within netback in netif_idx_release(). Shouldn't this lock be per
> struct xen-netbk instead of being global (declared as static within
> the function)? Is this a BUG?
> Ian, what is your thought?
I suspect the _lock could be moved into the netbk, I expect it was just
missed in the switch to multi-threading because it was static in the
function instead of a normal global var located with all the others.
> > If some (or all) of these points have already been discussed before,
> I apologize in advance!
> > I appreciate any feedback or pointers.
> > Thanks.
> > --Kaushik
Xen-devel mailing list