Hi Jeremy and all,
I'd like to make an update on these patches. The main logic is not
changed, and I only did a rebase towards the upstream pv-ops kernel.
See attached patch. The original patches are checked in Jeremy's
netback-tasklet branch.
Let me explain the main idea of the patchset again:
Current netback uses one pair of tasklets for Tx/Rx data transaction.
Netback tasklet could only run at one CPU at a time, and it is used to
serve all the netfronts. Therefore it has become a performance bottle
neck. This patch is to use multiple tasklet pairs to replace the current
single pair in dom0.
Assuming that Dom0 has CPUNR VCPUs, we define CPUNR kinds of
tasklets pair (CPUNR for Tx, and CPUNR for Rx). Each pare of tasklets
serve specific group of netfronts. Also for those global and static
variables, we duplicated them for each group in order to avoid the
spinlock.
PATCH 01: Generilize static/global variables into 'struct xen_netbk'.
PATCH 02: Multiple tasklets support.
PATCH 03: Use Kernel thread to replace the tasklet.
Recently I re-tested the patchset with Intel 10G multi-queue NIC device,
and use 10 outside 1G NICs to do netperf tests with that 10G NIC.
Case 1: Dom0 has more than 10 vcpus pinned with each physical CPU.
With the patchset, the performance is 2x of the original throughput.
Case 2: Dom0 has 4 vcpus pinned with 4 physical CPUs.
With the patchset, the performance is 3.7x of the original throughput.
when we test this patch, we found that the domain_lock in grant table
operation (gnttab_copy()) becomes a bottle neck. We temporarily
remove the global domain_lock to achieve good performance.
Thanks,
Dongxiao
Jeremy Fitzhardinge wrote:
> On 12/09/09 19:29, Xu, Dongxiao wrote:
>>> Also, is it worth making it a tunable? Presumably it needn't scale
>>> exactly with the number of dom0 cpus; if you only have one or two
>>> gbit interfaces, then you could saturate that pretty quickly with a
>>> small number of cpus, regardless of how many domains you have.
>>>
>> How many CPUs are serving for the NIC interface is determined by how
>> interrupt is delivered. If system only has two gbit interfaces, and
>> they delivier interrupts to CPU0 and CPU1, then the case is: two
>> CPUs handle two tasklets. Other CPUs are idle. The group_nr just
>> defines the max number of tasklets, however it doesn't decide how
>> tasklet is handled by CPU.
>>
>
> So does this mean that a given vcpu will be used to handle the
> interrupt if happens to be running on a pcpu with affinity for the
> device? Or that particular devices will be handled by particular
> vcpus?
>
>>> I've pushed this out in its own branch:
>>> xen/dom0/backend/netback-tasklet; please post any future patches
>>> against this branch.
>>>
>> What's my next step for this netback-tasklet tree merging into
>> xen/master?
>>
>
> Hm, well, I guess:
>
> * I'd like to see some comments Keir/Ian(s)/others that this is
> basically the right approach. It looks OK to me, but I don't
> have much experience with performance in the field.
> o does nc2 make nc1 obsolete?
> * Testing to make sure it really works. Netback is clearly
> critical functionality, so I'd like to be sure we're not
> introducing big regressions
>
> J
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|