Re: [Xen-devel] Xen dom0 network I/O scalability

On May 12, 2011, at 3:21 AM, Ian Campbell wrote:

> A long time ago to backend->frontend path (guest receive) operated using
> a page flipping mode. At some point a copying mode was added to this
> path which became the default some time in 2006. You would have to go
> out of your way to find a guest which used flipping mode these days. I
> think this is the copying you are referring too, it's so long ago that
> there was a distinction on this path that I'd forgotten all about it
> until now.

I was not referring to page flipping at all. I was only talking about the copies
in the receive path. 

> The frontend->backend path (guest transmit) has used a mapping
> (PageForeign) based scheme practically since forever. However when
> netback was upstreamed into 2.6.39 this had to be removed in favour of a
> copy based implementation (PageForeign has fingers in the mm subsystem
> which were unacceptable for upstreaming). This is the copying mode
> Konrad and I were talking about. We know the performance will suffer
> versus mapping mode, and we are working to find ways of reinstating
> mapping.

Hmm.. I did not know that the copying mode was introduced in the transmit path. 
But as I said above I was only referring to the receive path. 

> As far as I can tell you are running with the zero-copy path. Only
> mainline 2.6.39+ has anything different.

Again, I was only referring to the receive path! I assumed you were talking 
about
re-introducing zero-copy in the receive path (aka page flipping). 
To be clear: 
- xen.git#xen/stable-2.6.32.x uses copying in the RX path and 
mapping (zero-copy) in the RX path. 
- Copying is used in both RX and TX path in 2.6.39+ for upstreaming.

> I think you need to go into detail about your test setup so we can all
> get on the same page and stop confusing ourselves by guessing which
> modes netback has available and is running in. Please can you describe
> precisely which kernels you are running (tree URL and changeset as well
> as the .config you are using). Please also describe your guest
> configuration (kernels, cfg file, distro etc) and benchmark methodology
> (e.g. netperf options).
> 
> I'd also be interesting in seeing the actual numbers you are seeing,
> alongside specifics of the test scenario which produced them.
> 
> I'm especially interesting in the details of the experiment(s) where you
> saw a 50% drop in throughput.

I agree. I plan to run the experiments again next week. I will get back to you 
with all the details. 

But these are the versions I am trying to compare:
1. http://xenbits.xensource.com/linux-2.6.18-xen.hg (single-threaded legacy 
netback)
2. xen.git#xen/stable-2.6.32.x (multi-threaded netback using tasklets)
3. xen.git#xen/stable-2.6.32.x (multi-threaded netback using kthreads)

And (1) outperforms both (2) and (3).

>> I mentioned packet copies only to explain the severity of this
>> problem. Let me try to clarify. Consider the following scenario: vcpu 1 
>> performs a hypercall, acquires the domain_lock, and starts copying one or 
>> more 
>> packets (in gnttab_copy). Now vcpu 2 also performs a hypercall, but it 
>> cannot 
>> acquire the domain_lock until all the copies have completed and the lock is 
>> released by vcpu 1. So the domain_lock could be held for a long time before 
>> it is released.
> 
> But this isn't a difference between the multi-threaded/tasklet and
> single-threaded/tasklet version of netback, is it?
> 
> In the single threaded case the serialisation is explicit due to the
> lack of threading, and it would obviously be good to avoid for the
> multithreaded case, but the contention doesn't really explain why
> multi-threaded mode would be 50% slower. (I suppose the threading case
> could serialise things into a different order, perhaps one which is
> somehow pessimal for e.g. TCP)
> 
> It is quite easy to force the number of tasklets/threads to 1 (by
> forcing xen_netbk_group_nr to 1 in netback_init()). This might be an
> interesting experiment to see if the degradation is down to contention
> between threads or something else which has changed between 2.6.18 and
> 2.6.32 (there is an extent to which this is comparing apples to oranges
> but 50% is pretty severe...).

Hmm.. You are right.  I will run the above experiments next week.

>> I think to properly scale netback we need more fine grained locking.
> 
> Quite possibly. It doesn't seem at all unlikely that the domain lock on
> the guest-receive grant copy is going to hurt at some point. There are
> some plans to rework the guest receive path to do the copy on the guest
> side, the primary motivation is to remove load from dom0 and to allow
> better accounting of work to the guests to request it but a side-effect
> of this could be to reduce contention on dom0's domain_lock.
> 
> However I would like to get to the bottom of the 50% degradation between
> linux-2.6.18-xen.hg and xen.git#xen/stable-2.6.32.x before we move on to
> how we can further improve the situation in xen.git.

OK.

> An IRQ is associated with a VIF and multiple VIFs can be associated with
> a netbk.
> 
> I suppose we could bind the IRQ to the same CPU as the associated netrbk
> thread but this can move around so we'd need to follow it. The tasklet
> case is easier since, I think, the tasklet will be run on whichever CPU
> scheduled it, which will be the one the IRQ occurred on.
> 
> Drivers are not typically expected to behave in this way. In fact I'm
> not sure it is even allowed by the IRQ subsystem and I expect upstream
> would frown on a driver doing this sort of thing (I expect their answer
> would be "why aren't you using irqbalanced?"). If you can make this work
> and it shows real gains over running irqbalanced we can of course
> consider it.

OK.

Thanks.

--Kaushik
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
WARNING - OLD ARCHIVES

xen-devel

Re: [Xen-devel] Xen dom0 network I/O scalability