Thanks for your comments. Some explainations below.
Ian Campbell wrote:
> Does this change have any impact on the responsiveness of domain 0
> userspace while the host is under heavy network load? We have found
> that the netback tasklets can completely dominate dom0's VCPU to the
> point where no userspace process ever gets the chance to run, since
> this includes sshd and the management toolstack that can be quite
> The issue was probably specific to using a single-VCPU domain 0 in
> XenServer but since you patch introduces a tasklet per VCPU it could
> possibly happen to multi-VCPU domain 0.
The former case you found is because all the netfronts are processed by one
dom0's tasklet, therefore the only vcpu which handles the tasklet will become
super busy and have no time to handle other userspace issues. My patch
separates the netback's workload to different tasklets, and if the tasklets
could bind with different vcpus in dom0 (by irqbalance or manually pin
interrupt), the total CPU utilization will be delivered to each vcpu in
average, which could make dom0 more scalable. Take our test case as an example,
the system is in heavy network load, whose throughput is close to its network
bandwidth (9.55G/10G), but it only uses ~460% dom0's CPU (Dom0 totally has 10
vcpus to handle the network, so each vcpu cost 46% for the network workload).
Therefore for 1G NIC, there will be no problem. Also for the current 10G NIC,
most of them have multi-queue technology, interrupts will be deliver to
different cpus, so dom0's each vcpu is only needed to handle part of the
workload, I believe there will be no problem too.
> For XenServer we converted the tasklets into a kernel thread, at the
> cost of a small reduction in overall throughput but yielding a massive
> improvement in domain 0 responsiveness. Unfortunately the change was
> made by someone who has since left Citrix and I cannot locate the
> numbers he left behind :-(
> Our patch is attached. A netback thread per domain 0 VCPU might be
> interesting to experiment with?
Adding the kernel thread mechanism to netback is a good way to improve dom0's
responsiveness in UP case. However for multiple vcpu dom0, I think it may be
not needed. Anyway it is another story from my multiple tasklet approach.
> On Fri, 2009-11-27 at 02:26 +0000, Xu, Dongxiao wrote:
>> Current netback uses one pair of tasklets for Tx/Rx data transaction.
>> Netback tasklet could only run at one CPU at a time, and it is used
>> to serve all the netfronts. Therefore it has become a performance
>> bottle neck. This patch is to use multiple tasklet pairs to replace
>> the current single pair in dom0. Assuming that Dom0 has CPUNR
>> VCPUs, we define CPUNR kinds of tasklets pair (CPUNR for Tx, and
>> CPUNR for Rx). Each pare of tasklets serve specific group of
>> netfronts. Also for those global and static variables, we duplicated
>> them for each group in order to avoid the spinlock.
>> Test senario:
>> We use ten 1G NIC interface to talk with 10 VMs (netfront) in server.
>> So the total bandwidth is 10G.
>> For host machine, bind each guest's netfront with each NIC interface.
>> For client machine, do netperf testing with each guest.
>> Test Case Packet Size Throughput(Mbps) Dom0 CPU Util Guests
>> CPU Util
>> w/o patch 1400 4304.30 400.33% 112.21%
>> w/ patch 1400 9533.13 461.64% 243.81%
>> BTW, when we test this patch, we found that the domain_lock in grant
>> table operation becomes a bottle neck. We temporarily remove the
>> global domain_lock to achieve good performance.
>> Best Regards,
>> -- Dongxiao
Xen-devel mailing list