WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

Re: [Xen-devel] [PATCH 00/11] PV NUMA Guests

To: "Cui, Dexuan" <dexuan.cui@xxxxxxxxx>
Subject: Re: [Xen-devel] [PATCH 00/11] PV NUMA Guests
From: Dulloor <dulloor@xxxxxxxxx>
Date: Thu, 15 Apr 2010 13:19:48 -0400
Cc: Andre Przywara <andre.przywara@xxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>, Keir Fraser <keir.fraser@xxxxxxxxxxxxx>
Delivery-date: Thu, 15 Apr 2010 10:20:48 -0700
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:received:message-id:subject:from:to:cc:content-type; bh=q9hFLeFFEWDmzcc6Mwl/X2Fa3Lj4gLExv/fgtJ1Q6zo=; b=hUgPW7ExOIvl3S102KPq6gGSrUua96omUIvWS858QPjaWMee7gTxZRt7jGJuXPs9ak g4ZnwXbo+oTvpTpWz3N2q8ChGEALSAgg+uL5wPTVtGeRVhYs8Dmtof3+WNhaIVUewwQQ qon+xvjaQIwWTyCKHARRFQTDkY7VYf3ZfnDsk=
Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; b=Jxw6bmeE+N0S5UTt/34smMgQVoWhKqxUdh6Vew8FgcXEi2P3XHrawuv6MGJDaQrqP6 JJFtFWh390s1cZrC8UQ3RnruQ0uyFhbvkIago03FWDqTSJuFwWvX0N9Qgki2ifwgXuwa BpKFwfy9SE2yO+UYy4cT8TGLnib/YYS7Q6YKM=
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <ED3036A092A28F4C91B0B4360DD128EABE27581B@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <i2s940bcfd21004041230i36a89d07z81876daa0a344154@xxxxxxxxxxxxxx> <C7DF41BF.F558%keir.fraser@xxxxxxxxxxxxx> <ED3036A092A28F4C91B0B4360DD128EABE1D656C@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx> <q2r940bcfd21004082147ma61dada2ge687f10e6dc75351@xxxxxxxxxxxxxx> <ED3036A092A28F4C91B0B4360DD128EABE27581B@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
On Wed, Apr 14, 2010 at 1:18 AM, Cui, Dexuan <dexuan.cui@xxxxxxxxx> wrote:
> Dulloor wrote:
>> On Wed, Apr 7, 2010 at 3:57 AM, Cui, Dexuan <dexuan.cui@xxxxxxxxx>
>> wrote:
>>> Keir Fraser wrote:
>>>> I would like Acks from the people working on HVM NUMA for this patch
>>>> series. At the very least it would be nice to have a single user
>>>> interface for setting this up, regardless of whether for a PV or HVM
>>>> guest. Hopefully code in the toolstack also can be shared. So I'm
>>> Yes, I strongly agree we should share one interterface, e.g., The
>>> XENMEM_numa_op hypercalls implemented by Dulloor could be >re-used
>>> in the hvm numa case and some parts of the toolstack could be
>>> shared, I think. I also replied in another thead and >supplied some
>>> similarity I found in Andre/Dulloor's patches.
>>>
>> IMO PV NUMA guests and HVM NUMA guests could share most of the code
>> from toolstack - for instance, getting the current state of machine,
>> deciding on a strategy for domain memory allocation, selection of
>> nodes, etc. They diverge only at the actual point of domain
>> construction. PV NUMA uses enlightenments, whereas HVM would need
>> working with hvmloader to export SLIT/SRAT ACPI tables. So, I agree
>> that we need to converge.
> Hi Dulloor,
> In your patches, the toolstack tries to figure out the "best fit nodes" for a 
> PV guest and
>invokes a hypercall set_domain_numa_layout to tell the hypervisor to remember 
>the
>info, and later the PV guest invokes a hypercall get_domain_numa_layout to 
>retrieve the
>info from the hypervisor.
> Can this be changed to: the toolstack writes the guest numa info directly 
> into a new
>field in the start_info(or the share_info) (maybe in the starndard format of 
>the SRAT/SLIT)
>and later PV guest reads the info and uses acpi_numa_init() to parse the info? 
> I think in
>this way the new hypercalls can be avoided and the pv numa enlightenment code 
>in
>guest kernel can be minimized.
> I'm asking  this because this is the way how HVM numa patches of Andure do(the
>toolstack passes the info to hvmloader and the latter builds SRAT/SLIT for 
>guest)
Hi Cui,

In my first version of patches (for making dom0 a numa guest), I had
put this information into start_info
(http://lists.xensource.com/archives/html/xen-devel/2010-02/msg00630.html).
But, after that I thought this new approach is better (for pv numa and
maybe even hvm numa) for following reasons :

- For PV NUMA guests, there are more places where the enlightenment
might be useful. For instance, in the attached (refreshed)patch, I
have used the enlightenment to support ballooning (without changing
node mappings) for PV NUMA guests. Similarly, there are
other places within the hypervisor as well as in the VM where I plan
to use the domain_numa_layout. That's the main reason for choosing
this approach. Although I am not sure, I think this could be useful
for HVM too (maybe with PV on HVM).

- Using the hypercall interface is equally simple. And, also with
start-info, I wasn't sure if it looks clean to add feature-specific
variables (useful only with PV NUMA guests) to start-info (or even
shared info), changing the xen-vm interface, adding (unnecessary)
changes for compat, etc.

Please let me know your thoughts.


>
> xc_select_best_fit_nodes() decides the "min-set" of host nodes that will be 
> used for the
>guest. It only considers the current memory usage of the system. Maybe we 
>should also
>condider the cpu load? And the number of the nodes must be 2^^n? And how to 
>handle >the case #vcpu is < #vnode?
> And looks your patches only consider the guest's memory requirement -- 
> guest's vcpu
>requirement is neglected? e.g., a guest may not need a very large amount of 
>memory
>while it needs many vcpus. xc_select_best_fit_nodes() should consider this when
>determining the number of vnode.

I agree with you. I was planning to consider vcpu load as the next
step. Also, I am looking
for a good heuristic. I looked at the nodeload heuristic (currently in
xen), but found it too naive. But, if you/Andre think it is a good
heuristic, I will add the support. Actually, I think
in future we should do away with strict vcpu-affinities and rely more
on a scheduler with
necessary NUMA support to complement our placement strategies.

As of now, we don't SPLIT, if #vcpu < #vnode. We use STRIPING in that case.

>
>>>> On 04/04/2010 20:30, "Dulloor" <dulloor@xxxxxxxxx> wrote:
>>>>
>>>>> The set of patches implements virtual NUMA-enlightenment to support
>>>>> NUMA-aware PV guests. In more detail, the patch implements the
>>>>> following :
>>>>>
>>>>> * For the NUMA systems, the following memory allocation strategies
>>>>> are implemented : - CONFINE : Confine the VM memory allocation to a
>>>>> single node. As opposed to the current method of doing this in
>>>>> python, the patch implements this in libxc(along with other
>>>>> strategies) and with assurance that the memory actually comes from
>>>>> the selected node. - STRIPE : If the VM memory doesn't fit in a
>>>>> single node and if the VM is not compiled with guest-numa-support,
>>>>> the memory is allocated striped across a selected max-set of nodes.
>>>>> - SPLIT : If the VM memory doesn't fit in a single node and if the
>>>>> VM is compiled with guest-numa-support, the memory is allocated
>>>>> split (equally for now) from the min-set of nodes. The  VM is then
>>>>> made aware of this NUMA allocation (virtual NUMA enlightenment).
>>>>> -DEFAULT : This is the existing allocation scheme.
>>>>>
>>>>> * If the numa-guest support is compiled into the PV guest, we add
>>>>> numa-guest-support to xen features elfnote. The xen tools use this
>>>>> to determine if SPLIT strategy can be applied.
>>>>>
>>> I think this looks too complex to allow a real user to easily
>>> determine which one to use...
>> I think you misunderstood this. For the first version, I have
>> implemented an automatic global domain memory allocation scheme, which
>> (when enabled) applies to all domains on a NUMA machine. I am of
>> opinion that users are seldom in a state to determine which strategy
>> to use. They would want the best possible performance for their VM at
>> any point of time, and we can only guarantee the best possible
>> performance, given the current state of the system (how the free
>> memory is scattered across nodes, distance between those nodes, etc).
>> In that regard, this solution is the simplest.
> Ok, I see.
> BTW: I think actually currently Xen can handle the case CONFINE pretty well, 
> e.g, when
> no vcpu affinity is explicitly specified, the toolstack tries to choose a 
> "best" host node
> for the guest and pins all vcpus of the guest to the host node.
But, currently it is done in python code and also it doesn't use
exact_node interface.
I added this to the libxc toolstack for the sake of completeness
(CONFINE is just a
special case of SPLIT). Also, with libxl catching up, we might anyway
want to do these
things in libxc, where it is accessible to both xm and xl.

>
>>> About the CONFINE stragegy -- looks this is not a useful usage model
>>> to me -- do we really think it's a typical usage model to
>>> ensure a VM's memory can only be allocated on a specified node?
>> Not all VMs are large enough not to fit into a single node (note that
>> user doesn't specify a node). And, if a VM can be fit into a single
>> node, that is obviously the best possible option for a VM.
>>
>>> The definitions of STRIPE and SPLIT also doesn't sound like typical
>>> usage models to me.
>> There are only two possibilities. Either the VM fits in a single node
>> or it doesn't. The mentioned strategies (SPLIT, STRIPE) try to
>> optimize the solution when the VM doesn't fit in a single node. The
>> aim is to reduce the number of inter-node accesses(SPLIT) and/or
>> provide a more predictable performance(STRIPE).
>>
>>> Why must tools know if the PV kernel is built with guest numa
>>> support or not?
>> What is the point of arranging the memory amenable for construction of
>> nodes in guest if the guest itself is not compiled to do so.
> I meant: to simplify the implementation, the toolstack can always supply the 
> numa
> config info to the guest *if necessary*, no matter if the guest kernel is 
> numa-enabled or
> not (even if the guest kernel isn't numa-enabled, the guest performance may 
> be better
> if the toolstack decides to supply a numa config to the guest)
> About the "*if necessary*": Andure and I think the user should supply an 
>  option
> "guestnode" in the guest config file, and you think the toolstack should be 
> able to
> automatically determine a "best" value. I raised some questions about
> xc_select_best_fit_nodes() in the above paragraph.
> Hi Andre, would you like to comment on this?
How about an "automatic"  global option along with a VM-level
"guestnode" option. These options could be work independently or with
each other ("guestnode" would take
preference over global "automatic" option). We can work out finer details.

>
>>
>>> If a user configures guest numa to "on" for a pv guest, the tools
>>> can supply the numa info to PV kernel even if the pv kernel is not >
>>> built with guest numa support -- the pv kernel will neglect the info
>>> safely;
>>> If a user configures guest numa to "off" for a pv guest and the
>>> tools don't supply the numa info to PV kernel, and if the pv kernel
>>> > is built with guest numa support, the pv kernel can easily detect
>>> this by your new hypercall and will not enable numa.
>> These error checks are done even now. But, by checking if the PV
>> kernel is built with guest numa support, we don't require the user to
>> configure yet another parameter. Wasn't that your concern too in the
>> very first point ?
>>
>>>
>>> When a user finds the computing capability of a single node can't
>>> satisfy the actual need and hence wants to use guest numa,
>>> since the user has specified the amount of guest memory and the
>>> number of vcpus in guest config file, I think the user only needs
>>> to specify how many guest nodes (the "guestnodes" option in Andre's
>>> patch) the guest will see, and the tools and the hypervisor
>>> should co-work to distribute guest memory and vcpus uniformly among
>>> the guest nodes(I think we may not want to support non-
>>> uniform nodes as that doesn't look like a typical usage model) -- of
>>> course, maybe a specified node doesn't have the expected
>>> amount of memory -- in this case, the guest can continue to run with
>>> a slower speed (we can print a warning message to the
>>> user); or, if the user does care about predictable guest
>>> performance, the guest creation should fail.
>>
>> Please observe that the patch does all these things plus some more.
>> For one, "guestnodes" option doesn't make sense, since as you observe,
>> it needs the user to carefully read the state of the system when
>> starting the domain and also the user needs to make sure that the
>> guest itself is compiled with numa support. The aim should be to
> I think it's not difficult for a user to specify "guestnodes" and to check if 
> a PV/HVM guest
> kernel is numa-enabled or not(anyway, a user needs to ensure that to achieve 
> the
> optimal peformance). "xm info/list/vcpu-list" should already supply enough 
> info. I think
> it's reasonable to assume a numa user has more knowledge than a preliminary 
> user. :-)
>
> I suppose Andure would argue more for the  "guestnodes" option.
>
> PV guest can use the ELFnote as a hit to the toolstack. This may be used as a 
> kind of optimization.
> HVM guest can't use this.
As mentioned above, I think we have a good case for both global and
VM-level options. What do you think ?

>
>> automate this part and provide the best performance, given the current
>> state. The patch attempts to do that. Secondly, when the guests are
>> not compiled with numa support, they would still want a more
>> predictable (albeit average) performance. And, by striping the memory
>> across the nodes and by pinning the domain vcpus to the union of those
>> nodes' processors, applications (of substantial sizes) could be
>> expected to see more predictable performance.
>>>
>>> How do you like this? My thought is we can make things simple in the
>>> first step. :-)
>> Please let me know if my comments are not clear. I agree that we
>> should shoot for simplicity and also for a common interface. Hope we
>> will get there :)
> Thanks a lot for all the explanation and discussion.
> Yes, we need to agree on a common interface to avoid confusion.
> And I still think the "guestnodes/uniform_nodes" idea is more straightforward 
> and the
> implementatin is simpler. :-)
>
> Thanks,
>  -- Dexuan

thanks
dulloor

Attachment: numa-ballooning.patch
Description: Text Data

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel