WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

[Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy.

On Wed, 2010-11-17 at 13:00 -0500, Jeremy Fitzhardinge wrote:
> On 11/16/2010 01:28 PM, Daniel Stodden wrote:
> >> What's the problem?  If you do nothing then it will appear to the kernel
> >> as a bunch of processes doing memory allocations, and they'll get
> >> blocked/rate-limited accordingly if memory is getting short.  
> > The problem is that just letting the page allocator work through
> > allocations isn't going to scale anywhere.
> >
> > The worst case memory requested under load is <number-of-disks> * (32 *
> > 11 pages). As a (conservative) rule of thumb, N will be 200 or rather
> > better.
> 
> Under what circumstances would you end up needing to allocate that many
> pages?

I don't. Independently running tapdisks would do, on behalf of guests
queuing I/O.

That's why one wouldn't just let them run and allocate their own memory.
The memory space set aside for I/O should be a shared resource.

> > The number of I/O actually in-flight at any point, in contrast, is
> > derived from the queue/sg sizes of the physical device. For a simple
> > disk, that's about a ring or two.
> 
> Wouldn't that be the worst case?

Yes. It's quite small. A 2 or 3 megs per physical backend are usually
sufficient.

> >> There's
> >> plenty of existing mechanisms to control that sort of thing (cgroups,
> >> etc) without adding anything new to the kernel.  Or are you talking
> >> about something other than simple memory pressure?
> >>
> >> And there's plenty of existing IPC mechanisms if you want them to
> >> explicitly coordinate with each other, but I'd tend to thing that's
> >> premature unless you have something specific in mind.
> >>
> >>> Also, I was absolutely certain I once saw VM_FOREIGN support in gntdev..
> >>> Can't find it now, what happened? Without, there's presently still no
> >>> zero-copy.
> >> gntdev doesn't need VM_FOREIGN any more - it uses the (relatively
> >> new-ish) mmu notifier infrastructure which is intended to allow a device
> >> to sync an external MMU with usermode mappings.  We're not using it in
> >> precisely that way, but it allows us to wrangle grant mappings before
> >> the generic code tries to do normal pte ops on them.
> > The mmu notifiers were for safe teardown only. They are not sufficient
> > for DIO, which wants gup() to work. If you want zcopy on gntdev, we'll
> > need to back those VMAs with page structs.
> 
> The pages will have struct page, because they're normal kernel pages
> which happen to be backed by mapped granted pages.

And, like all granted frames, not owning them implies they are not
resolvable via mfn_to_pfn, thereby failing in follow_page, thereby gup()
without the VM_FOREIGN hack.

Correct me if I'm mistaken. I used to be quicker looking up stuff on
arch-xen kernels, but I think fundamental constants of the Xen universe
didn't change since last time.

>   Are you talking
> about the #ifdef CONFIG_XEN code in the middle of __get_user_pages()? 
> Isn't that just there to cope with the nested-IO-on-the-same-page
> problem that the current blktap architecture provokes?  If there's only
> a single IO on each page - the one initiated by usermode - then it
> shouldn't be necessary, right?

No. Jake brought the aliasing in specifically to get blktap2 working
with zero-copy.

VM_FOREIGN is much older. Only blktap2 does the recursive thing, because
it's a blkdev above some physical dev. Blktap1 went from the guest ring
straight down to userland. As would be the case with a gntdev-based
blkback.

> >   Or bounce again (gulp, just
> > mentioning it). As with the blktap2 patches, note there is no difference
> > in the dom0 memory bill, it takes page frames.
> 
> (And perhaps actual pages to substitute for the granted pages.)

Well yes, that's right. Still fine as long there's some relatively small
constant boundary on the worst case. O(n) for large systems would go in
the hundreds of megs. Given that the *reasonable* amount of memory used
simultaneously is pretty small in any case, even going through the
memory allocator can be skipped.

[
Part of the reason why blktap *never* frees those pages, apart from
being slightly greedy, are deadlock hazards when writing those nodes in
dom0 through the pagecache, as dom0 might. You need memory pools on the
datapath to guarantee progress under pressure. That got pretty ugly
after 2.6.27, btw.
]

In any case, let's skip trying what happens if a thundering herd of
several hundred userspace disks tries gfp()ing their grant slots out of
dom0 without without arbitration.

> > I guess we've been meaning the same thing here, unless I'm
> > misunderstanding you. Any pfn does, and the balloon pagevec allocations
> > default to order 0 entries indeed. Sorry, you're right, that's not a
> > 'range'. With a pending re-xmit, the backend can find a couple (or all)
> > of the request frames have count>1. It can flip and abandon those as
> > normal memory. But it will need those lost memory slots back, straight
> > away or next time it's running out of frames. As order-0 allocations.
> 
> Right.  GFP_KERNEL order 0 allocations are pretty reliable; they only
> fail if the system is under extreme memory pressure.  And it has the
> nice property that if those allocations block or fail it rate limits IO
> ingress from domains rather than being crushed by memory pressure at the
> backend (ie, the problem with trying to allocate memory in the writeout
> path).
> 
> Also the cgroup mechanism looks like an extremely powerful way to
> control the allocations for a process or group of processes to stop them
> from dominating the whole machine.

Ah. In case it can be put to work to bind processes allocating pagecache
entries for dirtying to some boundary, I'd be really interested. I think
I came across it once but didn't take the time to read the docs
thoroughly. Can it?

Daniel





_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel