xen-devel
RE: [Xen-devel] Re: NUMA and SMP
> -----Original Message-----
> From: Emmanuel Ackaouy [mailto:ack@xxxxxxxxxxxxx]
> Sent: 16 January 2007 13:56
> To: Petersson, Mats
> Cc: xen-devel; Anthony Liguori; David Pilger; Ryan Harper
> Subject: Re: [Xen-devel] Re: NUMA and SMP
>
> On the topic of NUMA:
>
> I'd like to dispute the assumption that a NUMA-aware OS can actually
> make good decisions about the initial placement of memory in a
> reasonable hardware ccNUMA system.
I'm not saying that it ALWAYS can make good decisions, but it's got a
better chance than software that just places things in "first available"
way.
>
> How does the OS know on which node a particular chunk of memory
> will be most accessed? The truth is that unless the application or
> person running the application is herself NUMA-aware and can provide
> placement hints or directives, the OS will seldom beat a round-robin /
> interleave or random placement strategy.
I don't disagree with that.
>
> To illustrate, consider an app which lays out a bunch of data
> in memory
> in a single thread and then spawns worker threads to process it.
That's a good example of a hard to crack nut. Not easily solved in the
OS, that's for sure.
>
> Is the OS to place memory close to the initial thread? How can it
> possibly
> know how many threads will eventually process the data?
>
> Even if the OS knew how many threads will eventually crunch the data,
> it cannot possibly know at placement time if each thread will
> work on an
> assigned data subset (and if so, which one) or if it will act as a
> pipeline
> stage with all the data being passed from one thread to the next.
>
> If you go beyond initial memory placement or start considering memory
> migration, then it's even harder to win because you have to pay copy
> and stall penalties during migrations. So you have to be real smart
> about predicting the future to do better than your ~10-40% memory
> bandwidth and latency hit associated with doing simple memory
> interleaving on a modern hardware-ccNUMA system.
Sure, I certainly wasn't suggesting memory migration.
However, there is a case where NUMA information COULD be helpful, and
that is if the system is paging in, it could try to find a page in the
local node rather than "random" [although without knowing what the
future holds, this could be wrong - as any non-future-knowing strategy
would be]. Of course, I wouldn't disagree if you said "The system
probably has too little memory if it's paging"!.
>
> And it gets worse for you when your app is successfully
> taking advantage
> of the memory cache hierarchy because its performance is less impacted
> by raw memory latency and bandwidth.
Indeed.
>
> Things also get more difficult on a time-sharing host with competing
> apps.
Agreed.
>
> There is a strong argument for making hypervisors and OSes NUMA
> aware in the sense that:
> 1- They know about system topology
> 2- They can export this information up the stack to applications and
> users
> 3- They can take in directives from users and applications to
> partition
> the
> host and place some threads and memory in specific partitions.
> 4- They use an interleaved (or random) initial memory
> placement strategy
> by default.
>
> The argument that the OS on its own -- without user or application
> directives -- can make better placement decisions than round-robin or
> random placement is -- in my opinion -- flawed.
Debatable - it depends a lot on WHAT applications you expect to run, and
how they behave. If you consider an application that frequently
allocates and de-allocates memory dynamically in a single threaded
process (say compiler), then allocating memory in the local node should
be the "first choice".
Multithreaded apps can use a similar approach, if a thread is allocating
memory, it's often a good chance that the memory is being used by that
thread too [although this doesn't work for message passing between
threads, obviously, this is again a case where "knowledge from the app"
will be the only better solution than "random"].
This approach is by far not perfect, but if you consider that
applications often do short term allocations, it makes sense to allocate
on the local node if possible.
>
> I also am skeptical that the complexity associated with page migration
> strategies would be worthwhile: If you got it wrong the first
> time, what
> makes you think you'll do better this time?
I'm not advocating for any page-migration, with the possible exception
that page-faults that are resolved by paging in should have a
first-choice of local node.
However, supporting NUMA in the Hypervisor and forwarding arch-info to
the guest would make sense. At the least the very basic principle of: If
the guest is to run on a limited set of processors (nodes), allocate
memory from that (those) node(s) for the guest would make a lot of
sense.
[Note that I'm by no means a NUMA expert - I just happen to work for AMD
that happens to have a ccNUMA architecture].
--
Mats
>
> Emmanuel.
>
>
>
>
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|
|
|