Re: [Xen-devel] Xen dom0 network I/O scalability
On May 11, 2011, at 4:31 AM, Ian Campbell wrote:
>>> - The inter-VM performance (throughput) is worse using both tasklets
>> and kthreads as compared to the old version of netback (as in
>> linux-2.6-xen.hg repo). I observed about 50% drop in throughput in my
>> experiments. Has anyone else observed this? Is the new version yet to
>> be optimized?
>> That is not surprising. The "new" version of netback copies pages. It
>> does not "swizzel" or "map" then between domains (so zero copying).
> I think Kaushik is running a xen/2.6.32.x tree and the copying only
> variant is only in mainline.
> A 50% drop in performance between linux-2.6-xen.hg and the xen.git
> 2.6.32 tree is slightly worrying but such a big drop sounds more like a
> misconfiguration, e.g. something like enabling debugging options in the
> kernel .config rather than a design or implementation issue in netback.
> (I actually have no idea what was in the linux-2.6-xen.hg tree, I don't
> recall such a tree ever being properly maintained, the last cset appears
> to be from 2006 and I recently cleaned it out of xenbits because noone
> knew what it was -- did you mean linux-2.6.18-xen.hg?)
I was referring to the single-threaded netback version in linux-2.6.18-xen.hg
(which btw also uses copying). I don't believe misconfiguration to be reason.
As I mentioned previously, I profiled the code and found significant
overhead due to lock contention. This essentially happens when two vcpus in
dom0 perform the grant hypercall and both try to acquire the domain_lock.
I don't think re-introducing zero-copy in the receive path is a solution to
this problem. I mentioned packet copies only to explain the severity of this
problem. Let me try to clarify. Consider the following scenario: vcpu 1
performs a hypercall, acquires the domain_lock, and starts copying one or more
packets (in gnttab_copy). Now vcpu 2 also performs a hypercall, but it cannot
acquire the domain_lock until all the copies have completed and the lock is
released by vcpu 1. So the domain_lock could be held for a long time before
it is released.
I think to properly scale netback we need more fine grained locking.
>>> - Two tasklets (rx and tx) are created per vcpu within netback. But
>> in my experiments I noticed that only one vcpu was being used during
>> the experiments (even with 4 VMs). I also observed that all the event
>> channel notifications within netback are always sent to vcpu 0. So my
>> conjecture is that since the tasklets are always scheduled by vcpu 0,
>> all of them are run only on vcpu 0. Is this a BUG?
>> Yes. We need to fix 'irqbalance' to work properly. There is something
>> not working right.
> The fix is to install the "irqbalanced" package. Without it no IRQ
> balancing will occur in a modern kernel. (perhaps this linux-2.6-xen.hg
> tree was from a time when the kernel would do balancing on its own?).
> You can also manually balance the VIF IRQs under /proc/irq if you are so
Why cannot the virq associated with each xen_netbk be bound to a different
vcpu during initlization? There is after all one struct xen_netbk per vcpu in
This seems like the simplest fix for this problem.
>>> - A smaller source of overhead is when the '_lock' is acquired
>> within netback in netif_idx_release(). Shouldn't this lock be per
>> struct xen-netbk instead of being global (declared as static within
>> the function)? Is this a BUG?
>> Ian, what is your thought?
> I suspect the _lock could be moved into the netbk, I expect it was just
> missed in the switch to multi-threading because it was static in the
> function instead of a normal global var located with all the others.
Yes, it has to be moved into struct xen_netbk.
Also, which git repo/branch should I be using If I would like to experiment
the latest dom0 networking?
Xen-devel mailing list