> -----Original Message-----
> From: Emmanuel Ackaouy [mailto:ack@xxxxxxxxxxxxx]
> Sent: 16 January 2007 16:14
> To: Petersson, Mats
> Cc: xen-devel; Anthony Liguori; David Pilger; Ryan Harper
> Subject: Re: [Xen-devel] Re: NUMA and SMP
>
> On Jan 16, 2007, at 15:19, Petersson, Mats wrote:
> >> There is a strong argument for making hypervisors and OSes NUMA
> >> aware in the sense that:
> >> 1- They know about system topology
> >> 2- They can export this information up the stack to
> applications and
> >> users
> >> 3- They can take in directives from users and applications to
> >> partition
> >> the
> >> host and place some threads and memory in specific partitions.
> >> 4- They use an interleaved (or random) initial memory
> >> placement strategy
> >> by default.
> >>
> >> The argument that the OS on its own -- without user or application
> >> directives -- can make better placement decisions than
> round-robin or
> >> random placement is -- in my opinion -- flawed.
> >
> > Debatable - it depends a lot on WHAT applications you
> expect to run,
> > and
> > how they behave. If you consider an application that frequently
> > allocates and de-allocates memory dynamically in a single threaded
> > process (say compiler), then allocating memory in the local
> node should
> > be the "first choice".
> >
> > Multithreaded apps can use a similar approach, if a thread is
> > allocating
> > memory, it's often a good chance that the memory is being
> used by that
> > thread too [although this doesn't work for message passing between
> > threads, obviously, this is again a case where "knowledge
> from the app"
> > will be the only better solution than "random"].
> >
> > This approach is by far not perfect, but if you consider that
> > applications often do short term allocations, it makes sense to
> > allocate
> > on the local node if possible.
>
> I do not agree.
>
> Just because a thread happens to run on processor X when
> it first faults in a page off the process' heap doesn't give you
> a good indication that the memory will be used mostly by
> this thread or that the thread will continue running on the
> same processor. There are at least as many cases when
> this assumption is invalid than when it is valid. Without any
> solid indication that something else will work better, round
> robin allocation has to be the default strategy.
My guess would be that noticably more than 50% of all (user-mode) memory
allocations are released within a shorter time than the time quanta used
by the scheduler - which in itself means that it's most likely not going
to swap from one processor to another (although of course an interrupt
may reschedule and move the thread to another processor, of course).
These memory allocations are also usually small, but there may be many
of them done in any second of runtime of the machine. Note that I
haven't made any effort to verify this guess, so if there's some other
data that you have that contradicts my view, then by all means disregard
my thoughts!
>
> Also, if you allow one process to consume a large percentage
> of one node's memory, you are indirectly hurting all competing
> multi-threaded apps which benefit from higher total memory
> bandwidth when they spread their data across nodes.
Yes. There's definitely one of the drawbacks with this method.
>
> I understand your point that if a single threaded process quickly
> shrinks its heap after growing it, it makes it less likely
> that it will
> migrate to a different processor while it is using this memory. I'm
> not sure how you predict that memory will be quickly released at
> allocation time though. Even if you could, I maintain you would
> still need safeguards in place to balance that process' needs
> with that of competing multi-threaded apps benefiting from the
> memory bandwidth scaling with number of hosting nodes.
See above "guesswork".
>
> You could try and compromise and allocate round robin starting
> locally and perhaps with diminishing strides as the total allocation
> grows (ie allocate local and progressively move towards a page
> round robin scheme as more memory is requested). I'm not sure
> this would do any better than plain old dumb round robin in the
> average case but it's worth a thought.
That's definitely not a bad idea.
Also, it's probably not a bad idea to have at least two choices:
"Allocate on closest processor" and "Round robin" (or "random" -
apparently, this is a better approach than LRU for cache-line
replacement, where LRU tends to work very badly for some cases, so it
may be a better approach than round robin for the same reason).
>
>
> > However, supporting NUMA in the Hypervisor and forwarding
> arch-info to
> > the guest would make sense. At the least the very basic
> principle of:
> > If
> > the guest is to run on a limited set of processors (nodes), allocate
> > memory from that (those) node(s) for the guest would make a lot of
> > sense.
>
> I suspect there is widespread agreement on this point.
>
>
>
>
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|