This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
Home Products Support Community News


[Xen-devel] [PATCH] tools: avoid over-commitment if numa=on

To: Keir Fraser <keir.fraser@xxxxxxxxxxxxx>
Subject: [Xen-devel] [PATCH] tools: avoid over-commitment if numa=on
From: Andre Przywara <andre.przywara@xxxxxxx>
Date: Mon, 30 Nov 2009 16:40:48 +0100
Cc: George Dunlap <George.Dunlap@xxxxxxxxxxxxx>, Dan Magenheimer <dan.magenheimer@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxx, Papagiannis Anastasios <apapag@xxxxxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxxxx>
Delivery-date: Mon, 30 Nov 2009 07:42:06 -0800
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <4AF84100020000780001E8CC@xxxxxxxxxxxxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <bd4f4a54-5269-42d8-b16d-cbdfaeeba361@default> <4AF82F12.6040400@xxxxxxx> <4AF84100020000780001E8CC@xxxxxxxxxxxxxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
User-agent: Thunderbird (X11/20090329)
Jan Beulich wrote:
Andre Przywara <andre.przywara@xxxxxxx> 09.11.09 16:02 >>>
BTW: Shouldn't we set finally numa=on as the default value?

I'd say no, at least until the default confinement of a guest to a single
node gets fixed to properly deal with guests having more vCPU-s than
a node's worth of pCPU-s (i.e. I take it for granted that the benefits of
not overcommitting CPUs outweigh the drawbacks of cross-node memory
accesses at the very least for CPU-bound workloads).
That sounds reasonable.
Attached a patch to lift the restriction of one node per guest if the number of VCPUs is greater than the number of cores / node. This isn't optimal (the best way would be to inform the guest about it, but this is another patchset ;-), but should solve the above concerns.

Please apply,

Signed-off-by: Andre Przywara <andre.przywara@xxxxxxx>

Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany
Tel: +49 351 448 3567 12
----to satisfy European Law for business letters:
Advanced Micro Devices GmbH
Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen
Geschaeftsfuehrer: Andrew Bowd; Thomas M. McCoy; Giuliano Meroni
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632
# HG changeset patch
# User Andre Przywara <andre.przywara@xxxxxxx>
# Date 1259594006 -3600
# Node ID bdf4109edffbcc0cbac605a19d2fd7a7459f1117
# Parent  abc6183f486e66b5721dbf0313ee0d3460613a99
allocate enough NUMA nodes for all VCPUs

If numa=on, we constrain a guest to one node to keep it's memory
accesses local. This will hurt performance if the number of VCPUs
is greater than the number of cores per node. We detect this case
now and allocate further NUMA nodes to allow all VCPUs to run

Signed-off-by: Andre Przywara <andre.przywara@xxxxxxx>

diff -r abc6183f486e -r bdf4109edffb tools/python/xen/xend/XendDomainInfo.py
--- a/tools/python/xen/xend/XendDomainInfo.py   Mon Nov 30 10:58:23 2009 +0000
+++ b/tools/python/xen/xend/XendDomainInfo.py   Mon Nov 30 16:13:26 2009 +0100
@@ -2637,8 +2637,7 @@
                         nodeload[i] = int(nodeload[i] * 16 / 
                         nodeload[i] = sys.maxint
-                index = nodeload.index( min(nodeload) )    
-                return index
+                return map(lambda x: x[0], sorted(enumerate(nodeload), 
key=lambda x:x[1]))
             info = xc.physinfo()
             if info['nr_nodes'] > 1:
@@ -2648,8 +2647,15 @@
                 for i in range(0, info['nr_nodes']):
                     if node_memory_list[i] >= needmem and 
len(info['node_to_cpu'][i]) > 0:
-                index = find_relaxed_node(candidate_node_list)
-                cpumask = info['node_to_cpu'][index]
+                best_node = find_relaxed_node(candidate_node_list)[0]
+                cpumask = info['node_to_cpu'][best_node]
+                cores_per_node = info['nr_cpus'] / info['nr_nodes']
+                nodes_required = (self.info['VCPUs_max'] + cores_per_node - 1) 
/ cores_per_node
+                if nodes_required > 1:
+                    log.debug("allocating %d NUMA nodes", nodes_required)
+                    best_nodes = find_relaxed_node(filter(lambda x: x != 
best_node, range(0,info['nr_nodes'])))
+                    for i in best_nodes[:nodes_required - 1]:
+                        cpumask = cpumask + info['node_to_cpu'][i]
                 for v in range(0, self.info['VCPUs_max']):
                     xc.vcpu_setaffinity(self.domid, v, cpumask)
         return index
Xen-devel mailing list
<Prev in Thread] Current Thread [Next in Thread>