On Wed, Apr 7, 2010 at 3:57 AM, Cui, Dexuan <dexuan.cui@xxxxxxxxx> wrote:
> Keir Fraser wrote:
>> I would like Acks from the people working on HVM NUMA for this patch
>> series. At the very least it would be nice to have a single user
>> interface for setting this up, regardless of whether for a PV or HVM
>> guest. Hopefully code in the toolstack also can be shared. So I'm
> Yes, I strongly agree we should share one interterface, e.g., The
> XENMEM_numa_op hypercalls implemented by Dulloor could be >re-used in the hvm
> numa case and some parts of the toolstack could be shared, I think. I also
> replied in another thead and >supplied some similarity I found in
> Andre/Dulloor's patches.
>
IMO PV NUMA guests and HVM NUMA guests could share most of the code
from toolstack - for instance, getting the current state of machine,
deciding on a strategy for domain memory allocation, selection of
nodes, etc. They diverge only at the actual point of domain
construction. PV NUMA uses enlightenments, whereas HVM would need
working with hvmloader to export SLIT/SRAT ACPI tables. So, I agree
that we need to converge.
>> cc'ing Dexuan and Andre, as I know they are involved in the HVM NUMA
>> work.
>>
>> Thanks,
>> Keir
>>
>> On 04/04/2010 20:30, "Dulloor" <dulloor@xxxxxxxxx> wrote:
>>
>>> The set of patches implements virtual NUMA-enlightenment to support
>>> NUMA-aware PV guests. In more detail, the patch implements the
>>> following :
>>>
>>> * For the NUMA systems, the following memory allocation strategies
>>> are implemented :
>>> - CONFINE : Confine the VM memory allocation to a
>>> single node. As opposed to the current method of doing this in
>>> python, the patch implements this in libxc(along with other
>>> strategies) and with assurance that the memory actually comes from
>>> the selected node.
>> > - STRIPE : If the VM memory doesn't fit in a
>>> single node and if the VM is not compiled with guest-numa-support,
>>> the memory is allocated striped across a selected max-set of nodes.
>>> - SPLIT : If the VM memory doesn't fit in a single node and if the VM
>>> is compiled with guest-numa-support, the memory is allocated split
>>> (equally for now) from the min-set of nodes. The VM is then made
>>> aware of this NUMA allocation (virtual NUMA enlightenment).
>>> -DEFAULT : This is the existing allocation scheme.
>>>
>>> * If the numa-guest support is compiled into the PV guest, we add
>>> numa-guest-support to xen features elfnote. The xen tools use this to
>>> determine if SPLIT strategy can be applied.
>>>
> I think this looks too complex to allow a real user to easily determine which
> one to use...
I think you misunderstood this. For the first version, I have
implemented an automatic global domain memory allocation scheme, which
(when enabled) applies to all domains on a NUMA machine. I am of
opinion that users are seldom in a state to determine which strategy
to use. They would want the best possible performance for their VM at
any point of time, and we can only guarantee the best possible
performance, given the current state of the system (how the free
memory is scattered across nodes, distance between those nodes, etc).
In that regard, this solution is the simplest.
> About the CONFINE stragegy -- looks this is not a useful usage model to me --
> do we really think it's a typical usage model to
> ensure a VM's memory can only be allocated on a specified node?
Not all VMs are large enough not to fit into a single node (note that
user doesn't specify a node). And, if a VM can be fit into a single
node, that is obviously the best possible option for a VM.
> The definitions of STRIPE and SPLIT also doesn't sound like typical usage
> models to me.
There are only two possibilities. Either the VM fits in a single node
or it doesn't. The mentioned strategies (SPLIT, STRIPE) try to
optimize the solution when the VM doesn't fit in a single node. The
aim is to reduce the number of inter-node accesses(SPLIT) and/or
provide a more predictable performance(STRIPE).
> Why must tools know if the PV kernel is built with guest numa support or not?
What is the point of arranging the memory amenable for construction of
nodes in guest if the guest itself is not compiled to do so.
> If a user configures guest numa to "on" for a pv guest, the tools can supply
> the numa info to PV kernel even if the pv kernel is not > built with guest
> numa support -- the pv kernel will neglect the info safely;
> If a user configures guest numa to "off" for a pv guest and the tools don't
> supply the numa info to PV kernel, and if the pv kernel > is built with guest
> numa support, the pv kernel can easily detect this by your new hypercall and
> will not enable numa.
These error checks are done even now. But, by checking if the PV
kernel is built with guest numa support, we don't require the user to
configure yet another parameter. Wasn't that your concern too in the
very first point ?
>
> When a user finds the computing capability of a single node can't satisfy the
> actual need and hence wants to use guest numa,
> since the user has specified the amount of guest memory and the number of
> vcpus in guest config file, I think the user only needs
>to specify how many guest nodes (the "guestnodes" option in Andre's patch) the
>guest will see, and the tools and the hypervisor
>should co-work to distribute guest memory and vcpus uniformly among the guest
>nodes(I think we may not want to support non-
>uniform nodes as that doesn't look like a typical usage model) -- of course,
>maybe a specified node doesn't have the expected
>amount of memory -- in this case, the guest can continue to run with a slower
>speed (we can print a warning message to the
>user); or, if the user does care about predictable guest performance, the
>guest creation should fail.
Please observe that the patch does all these things plus some more.
For one, "guestnodes" option doesn't make sense, since as you observe,
it needs the user to carefully read the state of the system when
starting the domain and also the user needs to make sure that the
guest itself is compiled with numa support. The aim should be to
automate this part and provide the best performance, given the current
state. The patch attempts to do that. Secondly, when the guests are
not compiled with numa support, they would still want a more
predictable (albeit average) performance. And, by striping the memory
across the nodes and by pinning the domain vcpus to the union of those
nodes' processors, applications (of substantial sizes) could be
expected to see more predictable performance.
>
> How do you like this? My thought is we can make things simple in the first
> step. :-)
Please let me know if my comments are not clear. I agree that we
should shoot for simplicity and also for a common interface. Hope we
will get there :)
>
> Thanks,
> -- Dexuan
>
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|