Hi Andre, will you re-post your patches?
Now I think for the first implementation, we can make things simple, e.g, we
should specify how many guest nodes (the "guestnodes" option in your patch -- I
think "numa_nodes", or "nodes", may be a better naming) the hvm guest will
see, and we distribute guest memory and vcpus uniformly among the guest nodes.
And we should add one more option "uniform_nodes" -- this boolean option's
default value can be True, meaning if we can't construct uniform nodes to
guest(e.g., on the related host node, no enough memory as expected can be
allocated to the guest), the guest creation should fail. This option is useful
to users who want predictable guest performance.
From: Andre Przywara [mailto:andre.przywara@xxxxxxx]
Sent: 2010年2月23日 17:53
To: Cui, Dexuan
Cc: xen-devel; Keir Fraser; Kamble, Nitin A
Subject: Re: [Xen-devel] [PATCH 0/5] [POST-4.0]: RFC: HVM NUMA guest support
Cui, Dexuan wrote:
> Hi Andre,
> I'm also looking into hvm guest's numa support and I'd like to share my
> thoughs and supply my understanding about your patches.
> 1) Besides SRAT, I think we should also build guest SLIT according to host
That is probably right, though currently low priority. Let's get the
basics first upstream.
> 2) I agree we should supply the user a way to specify which guest node should
> have how much memory, namely, the "nodemem"
> parameter in your patch02. However, I can't find where it is assigned
a value in your patches. I guess you missed it in image.py.
Omitted for now. I wanted to keep the first patches clean and had some
hard time to propagate arrays from the config files downto libxc. Is
there a good explanation of the different kind of config file options? I
see different classes (like HVM only) along with
some legacy parts that appear quite confusing to me.
> And what if xen can't allocate memory from the specified host node(e.g.,
> no enough free memory on the host node)?
> -- currently xen *silently* tries to allocate memory from other host
nodes -- this would hurt guest performance
> while the user doesn't know that at all! I think we should add an
option in guest config file: if it's set,
> the guest creation should fail if xen can not allocate memory from the
specified host node.
Exactly that scenario I had also in mind: Provide some kind of numa=auto
option in the config file to let Xen automatically
split up the memory allocation from different nodes if needed. I think
we need an upper limit here, or maybe something like:
the numa=allow option would only allocate up to 2 nodes if no single
node can satisfy the memory request.
> 3) In your patch02:
> + for (i = 0; i < numanodes; i++)
> + numainfo.guest_to_host_node[i] = i % 2;
> As you said in the mail "[PATCH 5/5]", at present it "simply round robin
> until the code for automatic allocation is in place",
> I think "simply round robin" is not acceptable and we should implement
Right, but this depends on the one part I missed. The first part of this
is the xc_nodeload() function. I will try to provide
the missing part this week.
> 4) Your patches try to sort the host nodes using a noad load evaluation
> algorithm, and require the user to specify how many
> guest nodes the guest should see, and distribute equally guest vcpus
into each guest node.
> I don't think the algorithm could be wise enough every time and it's not
> flexiable. Requiring the user to specify the number
> of guest node and districuting vcpus equally into each guest node also
doesn't sound wise enough and flexible.
Another possible extension. I had some draft with "node_cpus=[1,2,1]" to
put one vCPU in the first and third node and two vCPUs in the second
node, although I omitted them from the first "draft" release.
> Since guest numa needs vcpu pinning to work as expected, how about my
> below thoughs?
> a) ask the user to use "cpus" option to pin each vcpu to a physical cpu
> (or node);
> b) find out how many physical nodes (host nodes) are involved and use that
> number as the number of guest node;
> c) each guest node corresponds to a host node found out in step b) and use
> this info to fill the numainfo.guest_to_host_node in 3).
My idea is:
1) use xc_nodeload() to get a list of host nodes with the respective
amount of free memory
2) either use the user-provided number of guest nodes and determine the
number based on the memory availability (=n)
3) select the <n> best nodes from the list (algorithm still to be
discussed, but a simple approach is sufficient for the first time)
4) populate numainfo.guest_to_host_node accordingly
5) pin vCPUs based on this array
This is basically the missing function (TM) I described earlier.
> 5) I think we also need to present the numa guest with virtual cpu topology,
> e.g., throught the initial APCI ID. In current xen,
> apic_id = vcpu_id * 2; even if we have the guest SRAT support and use
2 guest nodes for a vcpus=n guest,
> the guest would still think it's on a package with n cores without the
knowledge of vcpu and cache
> topology and this would do harm to the performance of guest.
> I think we can use each guest node as a guest package and by giving the
> guest a proper APIC ID
> (consisting of guest SMT_ID/Core_ID/Package_ID) to show the vcpu
topology to guest.
> This needs changes to the hvmloader's SRAT/MADT's APID ID fields,
xen's cpuid/vlapic emulation.
The APIC ID scenario does not work on AMD CPUs, which don't have a bit
field based association between compute units and APIC IDs. For NUMA
purposes SRAT should be sufficient, as it overrides APIC based
decisions. But you are right in that it needs more CPUID / ACPI tweaking
to get the topology right, although this should be addressed in separate
Currently(?) it is very cumbersome to inject a specific "cores per
socket" number into Xen (by tweaking those ugly CPUID bit masks). For
QEMU/KVM I introduced an easy config scheme (smp=8,cores=2,threads=2) to
allow this (purely CPUID based). If only I had time for this I would do
this for Xen, too.
> 6) HVM vcpu's hot add/remove functionlity was added into xen recently. The
> guest numa support should take this into consideration.
Are you volunteering? ;-)
> 7) I don't see the live migration support in your patches. Looks it's hard
> for hvm numa guest to do live migration as the
> src/dest hosts could be very different in HW configuration.
I don't think this is a problem. We need to separate guest specific
options (like VCPUs to guest nodes or guest memory to guest nodes
mapping) from host specific parts (guest nodes to host nodes). I haven't
tested it yet, but I assume that the config file options to specify the
guest specific parts should be sent already right now, resulting in the
new guest setting up with the proper guest config. The guest node to
host node association is determined by the new host dynamically
depending on the current host's resources. This can turn out to be
sub-optimal, like migrating a "4 guest node on 4 host nodes" guest on a
dual node host, but this would currently map to 0-1-0-1 setup, where two
guest nodes are assigned the same host node. I don't see much of an
Thanks for your thoughts and looking forward to future collaboration.
Xen-devel mailing list