RE: [Xen-devel] [PATCH 0/5] [POST-4.0]: RFC: HVM NUMA guest supp

To:	Andre Przywara <andre.przywara@xxxxxxx>, Keir Fraser <keir.fraser@xxxxxxxxxxxxx>, "Kamble, Nitin A" <nitin.a.kamble@xxxxxxxxx>
Subject:	RE: [Xen-devel] [PATCH 0/5] [POST-4.0]: RFC: HVM NUMA guest support
From:	"Cui, Dexuan" <dexuan.cui@xxxxxxxxx>
Date:	Sat, 6 Feb 2010 00:35:34 +0800
Accept-language:	zh-CN, en-US
Acceptlanguage:	zh-CN, en-US
Cc:	"xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>
Delivery-date:	Fri, 05 Feb 2010 08:36:56 -0800
Envelope-to:	www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to:	<4B6B4126.2050508@xxxxxxx>
List-help:	<mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id:	Xen developer discussion <xen-devel.lists.xensource.com>
List-post:	<mailto:xen-devel@lists.xensource.com>
List-subscribe:	<http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe:	<http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References:	<4B6B4126.2050508@xxxxxxx>
Sender:	xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
Thread-index:	Acql4+ASN059+6ptR/eqz6JM2Y/pGAAnBe0g
Thread-topic:	[Xen-devel] [PATCH 0/5] [POST-4.0]: RFC: HVM NUMA guest support

Hi Andre,
I'm also looking into hvm guest's numa support and I'd like to share my thoughs 
and supply my understanding about your patches.

1) Besides SRAT, I think we should also build guest SLIT according to host SLIT.

2) I agree we should supply the user a way to specify which guest node should 
have how much memory, namely, the "nodemem" parameter in your patch02. However, 
I can't find where it is assigned a value in your patches. I guess you missed 
it in image.py.
     And what if xen can't allocate memory from the specified host node(e.g., 
no enough free memory on the host node)? -- currently xen *silently* tries to 
allocate memory from other host nodes -- this would hurt guest performance 
while the user doesn't know that at all! I think we should add an option in 
guest config file: if it's set, the guest creation should fail if xen can not 
allocate memory from the specified host node.

3) In your patch02:
+        for (i = 0; i < numanodes; i++)
+            numainfo.guest_to_host_node[i] = i % 2;
As you said in the mail "[PATCH 5/5]", at present it "simply round robin until 
the code for automatic allocation is in place", I think "simply round robin" is 
not acceptable and we should implement "automatic allocation".

4) Your patches try to sort the host nodes using a noad load evaluation 
algorithm, and require the user to specify how many guest nodes the guest 
should see, and distribute equally guest vcpus into each guest node.
    I don't think the algorithm could be wise enough every time and it's not 
flexiable. Requiring the user to specify the number of guest node and 
districuting vcpus equally into each guest node also doesn't sound wise enough 
and flexible.

   Since guest numa needs vcpu pinning to work as expected, how about my below 
thoughs?

   a) ask the user to use "cpus" option to pin each vcpu to a physical cpu (or 
node);
   b) find out how many physical nodes (host nodes) are involved and use that 
number as the number of guest node;
   c) each guest node corresponds to a host node found out in step b) and use 
this info to fill the numainfo.guest_to_host_node[] in 3).


5) I think we also need to present the numa guest with virtual cpu topology, 
e.g., throught the initial APCI ID. In current xen, apic_id = vcpu_id * 2; even 
if we have the guest SRAT support and use 2 guest nodes for a vcpus=n guest, 
the guest would still think it's on a package with n cores without the 
knowledge of vcpu and cache topology and this would do harm to the performance 
of guest. 
   I think we can use each guest node as a guest package and by giving the 
guest a proper APIC ID (consisting of guest SMT_ID/Core_ID/Package_ID) to show 
the vcpu topology to guest. This needs changes to the hvmloader's SRAT/MADT's 
APID ID fields, xen's cpuid/vlapic emulation.

6) HVM vcpu's hot add/remove functionlity was added into xen recently. The 
guest numa support should take this into consideration.

7) I don't see the live migration support in your patches. Looks it's hard for 
hvm numa guest to do live migration as the src/dest hosts could be very 
different in HW  configuration.

Thanks,
-- Dexuan


-----Original Message-----
From: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx 
[mailto:xen-devel-bounces@xxxxxxxxxxxxxxxxxxx] On Behalf Of Andre Przywara
Sent: 2010年2月5日 5:51
To: Keir Fraser; Kamble, Nitin A
Cc: xen-devel@xxxxxxxxxxxxxxxxxxx
Subject: [Xen-devel] [PATCH 0/5] [POST-4.0]: RFC: HVM NUMA guest support

Hi,
to avoid double work in the community on the same topic and to help
syncing on the subject and as I am not in office next week, I would like
to send the NUMA guest support patches I have so far.

These patches introduce NUMA support for guests. This can be handy if 
either the guests resources (VCPUs and/or memory) exceed one node's 
capacity or the host is already loaded so that the requirement cannot be 
satisfied from one node alone. Some applications may also benefit from 
the aggregated bandwidth of multiple memory controllers.
Even if the guest has only a single node, this code replaces the current 
NUMA placement mechanism by moving it into libxc.

I have changed something lately, so there are some loose ends, but it
should suffice as a discussion base.

The patches are for HVM guest primarily, as I don't deal much with PV I 
am not sure whether a port would be straight-forward or the complexity 
is higher. One thing I was not sure about is how to communicate the NUMA 
topology to PV guests. Reusing the existing code base and inject a 
generated ACPI table seems smart, but this would mean to enable ACPI 
parsing code in PV Linux, which currently seems to be disabled (?).
If someone wants to step in and implement PV support, I will be glad to 
help.

I have reworked the (guest node to) host node assignment part, this is
currently unfinished. I decided to move the node-rating part from
XendDomainInfo.py:find_relaxed_node() into libxc (should this eventually 
go into libxenlight?) to avoid passing to much information between the 
layers and to include libxl support. This code snippet (patch 5/5) 
basically scans all VCPUs on all domains and generates an array holding 
the node load metric for future sorting. The missing part is here a 
static function in xc_hvm_build.c to pick the <n> best nodes and 
populate the numainfo->guest_to_host_node array with the result. I will 
do this when I will be back.

For more details see the following email bodies.

Thanks and Regards,
Andre.

-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany
Tel: +49 351 488-3567-12
----to satisfy European Law for business letters:
Advanced Micro Devices GmbH
Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen
Geschaeftsfuehrer: Andrew Bowd; Thomas M. McCoy; Giuliano Meroni
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

WARNING - OLD ARCHIVES

xen-devel

RE: [Xen-devel] [PATCH 0/5] [POST-4.0]: RFC: HVM NUMA guest support