[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen dom0 network I/O scalability

On Wed, 2011-05-11 at 19:43 +0100, Kaushik Kumar Ram wrote:
> On May 11, 2011, at 4:31 AM, Ian Campbell wrote:
> > 
> >>> - The inter-VM performance (throughput) is worse using both tasklets
> >> and kthreads as compared to the old version of netback (as in
> >> linux-2.6-xen.hg repo). I observed about 50% drop in throughput in my
> >> experiments. Has anyone else observed this? Is the new version yet to
> >> be optimized?
> >> 
> >> That is not surprising. The "new" version of netback copies pages. It
> >> does not "swizzel" or "map" then between domains (so zero copying).
> > 
> > I think Kaushik is running a xen/2.6.32.x tree and the copying only
> > variant is only in mainline.
> > 
> > A 50% drop in performance between linux-2.6-xen.hg and the xen.git
> > 2.6.32 tree is slightly worrying but such a big drop sounds more like a
> > misconfiguration, e.g. something like enabling debugging options in the
> > kernel .config rather than a design or implementation issue in netback.
> > 
> > (I actually have no idea what was in the linux-2.6-xen.hg tree, I don't
> > recall such a tree ever being properly maintained, the last cset appears
> > to be from 2006 and I recently cleaned it out of xenbits because noone
> > knew what it was -- did you mean linux-2.6.18-xen.hg?)
> I was referring to the single-threaded netback version in linux-2.6.18-xen.hg 
> (which btw also uses copying).

Ah, I think we are talking about different values of copying.

A long time ago to backend->frontend path (guest receive) operated using
a page flipping mode. At some point a copying mode was added to this
path which became the default some time in 2006. You would have to go
out of your way to find a guest which used flipping mode these days. I
think this is the copying you are referring too, it's so long ago that
there was a distinction on this path that I'd forgotten all about it
until now.

The frontend->backend path (guest transmit) has used a mapping
(PageForeign) based scheme practically since forever. However when
netback was upstreamed into 2.6.39 this had to be removed in favour of a
copy based implementation (PageForeign has fingers in the mm subsystem
which were unacceptable for upstreaming). This is the copying mode
Konrad and I were talking about. We know the performance will suffer
versus mapping mode, and we are working to find ways of reinstating

>  I don't believe misconfiguration to be reason. 
> As I mentioned previously, I profiled the code and found significant 
> synchronization
> overhead due to lock contention. This essentially happens when two vcpus in 
> dom0 perform the grant hypercall and both try to acquire the domain_lock.
> I don't think re-introducing zero-copy in the receive path is a solution to 
> this problem.

As far as I can tell you are running with the zero-copy path. Only
mainline 2.6.39+ has anything different.

I think you need to go into detail about your test setup so we can all
get on the same page and stop confusing ourselves by guessing which
modes netback has available and is running in. Please can you describe
precisely which kernels you are running (tree URL and changeset as well
as the .config you are using). Please also describe your guest
configuration (kernels, cfg file, distro etc) and benchmark methodology
(e.g. netperf options).

I'd also be interesting in seeing the actual numbers you are seeing,
alongside specifics of the test scenario which produced them.

I'm especially interesting in the details of the experiment(s) where you
saw a 50% drop in throughput.

>  I mentioned packet copies only to explain the severity of this
> problem. Let me try to clarify. Consider the following scenario: vcpu 1 
> performs a hypercall, acquires the domain_lock, and starts copying one or 
> more 
> packets (in gnttab_copy). Now vcpu 2 also performs a hypercall, but it cannot 
> acquire the domain_lock until all the copies have completed and the lock is 
> released by vcpu 1. So the domain_lock could be held for a long time before 
> it is released.

But this isn't a difference between the multi-threaded/tasklet and
single-threaded/tasklet version of netback, is it?

In the single threaded case the serialisation is explicit due to the
lack of threading, and it would obviously be good to avoid for the
multithreaded case, but the contention doesn't really explain why
multi-threaded mode would be 50% slower. (I suppose the threading case
could serialise things into a different order, perhaps one which is
somehow pessimal for e.g. TCP)

It is quite easy to force the number of tasklets/threads to 1 (by
forcing xen_netbk_group_nr to 1 in netback_init()). This might be an
interesting experiment to see if the degradation is down to contention
between threads or something else which has changed between 2.6.18 and
2.6.32 (there is an extent to which this is comparing apples to oranges
but 50% is pretty severe...).

> I think to properly scale netback we need more fine grained locking.

Quite possibly. It doesn't seem at all unlikely that the domain lock on
the guest-receive grant copy is going to hurt at some point. There are
some plans to rework the guest receive path to do the copy on the guest
side, the primary motivation is to remove load from dom0 and to allow
better accounting of work to the guests to request it but a side-effect
of this could be to reduce contention on dom0's domain_lock.

However I would like to get to the bottom of the 50% degradation between
linux-2.6.18-xen.hg and xen.git#xen/stable-2.6.32.x before we move on to
how we can further improve the situation in xen.git.

> >>> - Two tasklets (rx and tx) are created per vcpu within netback. But
> >> in my experiments I noticed that only one vcpu was being used during
> >> the experiments (even with 4 VMs).  I also observed that all the event
> >> channel notifications within netback are always sent to vcpu 0. So my
> >> conjecture is that since the tasklets are always scheduled by vcpu 0,
> >> all of them are run only on vcpu 0. Is this a BUG?
> >> 
> >> Yes. We need to fix 'irqbalance' to work properly. There is something
> >> not working right.
> > 
> > The fix is to install the "irqbalanced" package. Without it no IRQ
> > balancing will occur in a modern kernel. (perhaps this linux-2.6-xen.hg
> > tree was from a time when the kernel would do balancing on its own?).
> > You can also manually balance the VIF IRQs under /proc/irq if you are so
> > inclined.
> Why cannot the virq associated with each xen_netbk be bound to a different 
> vcpu during initlization?

An IRQ is associated with a VIF and multiple VIFs can be associated with
a netbk.

I suppose we could bind the IRQ to the same CPU as the associated netrbk
thread but this can move around so we'd need to follow it. The tasklet
case is easier since, I think, the tasklet will be run on whichever CPU
scheduled it, which will be the one the IRQ occurred on.

Drivers are not typically expected to behave in this way. In fact I'm
not sure it is even allowed by the IRQ subsystem and I expect upstream
would frown on a driver doing this sort of thing (I expect their answer
would be "why aren't you using irqbalanced?"). If you can make this work
and it shows real gains over running irqbalanced we can of course
consider it.

> Also, which git repo/branch should I be using If I would like to experiment 
> with 
> the latest dom0 networking?

I wouldn't recommend playing with the stuff in mainline right now -- we
know it isn't the best due to the use of copying on the guest receive
path. The xen.git#xen/stable-2.6.32.x tree is probably the best one to
experiment on.


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.