Hi Andre,
I'm also looking into hvm guest's numa support and I'd like to share my thoughs
and supply my understanding about your patches.
1) Besides SRAT, I think we should also build guest SLIT according to host SLIT.
2) I agree we should supply the user a way to specify which guest node should
have how much memory, namely, the "nodemem" parameter in your patch02. However,
I can't find where it is assigned a value in your patches. I guess you missed
it in image.py.
And what if xen can't allocate memory from the specified host node(e.g.,
no enough free memory on the host node)? -- currently xen *silently* tries to
allocate memory from other host nodes -- this would hurt guest performance
while the user doesn't know that at all! I think we should add an option in
guest config file: if it's set, the guest creation should fail if xen can not
allocate memory from the specified host node.
3) In your patch02:
+ for (i = 0; i < numanodes; i++)
+ numainfo.guest_to_host_node[i] = i % 2;
As you said in the mail "[PATCH 5/5]", at present it "simply round robin until
the code for automatic allocation is in place", I think "simply round robin" is
not acceptable and we should implement "automatic allocation".
4) Your patches try to sort the host nodes using a noad load evaluation
algorithm, and require the user to specify how many guest nodes the guest
should see, and distribute equally guest vcpus into each guest node.
I don't think the algorithm could be wise enough every time and it's not
flexiable. Requiring the user to specify the number of guest node and
districuting vcpus equally into each guest node also doesn't sound wise enough
and flexible.
Since guest numa needs vcpu pinning to work as expected, how about my below
thoughs?
a) ask the user to use "cpus" option to pin each vcpu to a physical cpu (or
node);
b) find out how many physical nodes (host nodes) are involved and use that
number as the number of guest node;
c) each guest node corresponds to a host node found out in step b) and use
this info to fill the numainfo.guest_to_host_node[] in 3).
5) I think we also need to present the numa guest with virtual cpu topology,
e.g., throught the initial APCI ID. In current xen, apic_id = vcpu_id * 2; even
if we have the guest SRAT support and use 2 guest nodes for a vcpus=n guest,
the guest would still think it's on a package with n cores without the
knowledge of vcpu and cache topology and this would do harm to the performance
of guest.
I think we can use each guest node as a guest package and by giving the
guest a proper APIC ID (consisting of guest SMT_ID/Core_ID/Package_ID) to show
the vcpu topology to guest. This needs changes to the hvmloader's SRAT/MADT's
APID ID fields, xen's cpuid/vlapic emulation.
6) HVM vcpu's hot add/remove functionlity was added into xen recently. The
guest numa support should take this into consideration.
7) I don't see the live migration support in your patches. Looks it's hard for
hvm numa guest to do live migration as the src/dest hosts could be very
different in HW configuration.
Thanks,
-- Dexuan
-----Original Message-----
From: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
[mailto:xen-devel-bounces@xxxxxxxxxxxxxxxxxxx] On Behalf Of Andre Przywara
Sent: 2010年2月5日 5:51
To: Keir Fraser; Kamble, Nitin A
Cc: xen-devel@xxxxxxxxxxxxxxxxxxx
Subject: [Xen-devel] [PATCH 0/5] [POST-4.0]: RFC: HVM NUMA guest support
Hi,
to avoid double work in the community on the same topic and to help
syncing on the subject and as I am not in office next week, I would like
to send the NUMA guest support patches I have so far.
These patches introduce NUMA support for guests. This can be handy if
either the guests resources (VCPUs and/or memory) exceed one node's
capacity or the host is already loaded so that the requirement cannot be
satisfied from one node alone. Some applications may also benefit from
the aggregated bandwidth of multiple memory controllers.
Even if the guest has only a single node, this code replaces the current
NUMA placement mechanism by moving it into libxc.
I have changed something lately, so there are some loose ends, but it
should suffice as a discussion base.
The patches are for HVM guest primarily, as I don't deal much with PV I
am not sure whether a port would be straight-forward or the complexity
is higher. One thing I was not sure about is how to communicate the NUMA
topology to PV guests. Reusing the existing code base and inject a
generated ACPI table seems smart, but this would mean to enable ACPI
parsing code in PV Linux, which currently seems to be disabled (?).
If someone wants to step in and implement PV support, I will be glad to
help.
I have reworked the (guest node to) host node assignment part, this is
currently unfinished. I decided to move the node-rating part from
XendDomainInfo.py:find_relaxed_node() into libxc (should this eventually
go into libxenlight?) to avoid passing to much information between the
layers and to include libxl support. This code snippet (patch 5/5)
basically scans all VCPUs on all domains and generates an array holding
the node load metric for future sorting. The missing part is here a
static function in xc_hvm_build.c to pick the <n> best nodes and
populate the numainfo->guest_to_host_node array with the result. I will
do this when I will be back.
For more details see the following email bodies.
Thanks and Regards,
Andre.
--
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany
Tel: +49 351 488-3567-12
----to satisfy European Law for business letters:
Advanced Micro Devices GmbH
Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen
Geschaeftsfuehrer: Andrew Bowd; Thomas M. McCoy; Giuliano Meroni
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|