According to your feedback, I revised my patch and resend it now.
[PATCH 01]: Use multiple tasklet pairs to replace the current single pair in
[PATCH 02]: Replace the tasklet with kernel thread. It may hurt the
performance, but could improve the responseness from userspace.
We use ten 1G NIC interface to talk with 10 VMs (netfront) in server. So the
total bandwidth is 10G.
For host machine, bind each guest's netfront with each NIC interface.
For client machine, do netperf testing with each guest.
Test Case Throughput(Mbps) Dom0 CPU Util Guests
w/o any patch 4304.30 400.33% 112.21%
w/ 01 patch 9533.13 461.64% 243.81%
w/ 01 and 02 patches 7942.68 597.83% 250.53%
>From the result we can see that, the case "w/ 01 and 02 patches" didn't
>reach/near the total bandwidth. It is because some vcpus in dom0 are saturated
>due to the context switch with other tasks, thus hurt the performance. To
>prove this idea, I did a experiment, which sets the kernel thread to
>SCHED_FIFO type, in order to avoid preemption by normal tasks. The experiment
>result is showed below, and it could get good performance. However like
>tasklet, set the kernel thread to high priority could also influence the
>userspace responseness because the usespace application (for example, sshd)
>could not preempt that netback kernel thread.
w/ hi-priority kthread 9535.74 543.56% 241.26%
For netchannel2, it omits the grant copy in dom0, I didn't try it yet. But I
used xenoprofile in current netback system to get a feeling that, grant copy
occupies ~1/6 cpu cycle of dom0 (including Xen and dom0 vmlinux).
BTW, 02 patch is ported from the patch given by Ian Campbell. You can add your
signed-off-by if you want. :)
Ian Pratt wrote:
>> The domain lock is in grant_op hypercall. If the multiple tasklets
>> are fighting with each other for this big domain lock, it would
>> become a bottleneck and
>> hurt the performance.
>> Our test system has 16 LP in total, so we have 16 vcpus in dom0 by
>> 10 of them are used to handle the network load. For our test case,
>> dom0's totalvcpu utilization is ~461.64%, so each vcpu ocupies
> Having 10 VCPUs for dom0 doesn't seem like a good idea -- it really
> oughtn't to need that many CPUs to handle IO load. Have you got any
> results with e.g. 2 or 4 VCPUs?
> When we switch over to using netchannel2 by default this issue should
> largely go away anyhow as the copy is not done by dom0. Have you done
> any tests with netchannel2?
>> Actually the multiple tasklet in netback could already improve the
>> the QoS of the system, therefore I think it can also help to get
>> better responseness for
>> that vcpu.
>> I think I can try to write another patch which replace the tasklet
>> by kthread, because I think is a different job with the
>> multi-tasklet netback support. (kthread is used to guarantee the
>> responseness of userspace, however multi-tasklet netback is used to
>> remove the dom0's cpu utilization bottleneck). However I am not sure
>> whether the improvement in QoS by this change is needed In MP
> Have you looked at the patch that xenserver uses to replace the
> tasklets by kthreads?
Xen-devel mailing list