This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
Home Products Support Community News


RE: [Xen-devel] RE: Live migration fails due to c/s 20627

To: "Zhang, Xiantao" <xiantao.zhang@xxxxxxxxx>, "Xu, Dongxiao" <dongxiao.xu@xxxxxxxxx>, Keir Fraser <keir.fraser@xxxxxxxxxxxxx>
Subject: RE: [Xen-devel] RE: Live migration fails due to c/s 20627
From: Dan Magenheimer <dan.magenheimer@xxxxxxxxxx>
Date: Wed, 16 Dec 2009 08:23:57 -0800 (PST)
Cc: kurt.hackel@xxxxxxxxxx, Jeremy Fitzhardinge <jeremy@xxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxx, "Dugger, Donald D" <donald.d.dugger@xxxxxxxxx>, "Nakajima, Jun" <jun.nakajima@xxxxxxxxx>
Delivery-date: Wed, 16 Dec 2009 08:24:54 -0800
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <EB8593BCECAB3D40A8248BE0B6400A382E9BF4BD@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
Since this discussion seems to be going in circles, I suspect
we may have some fundamentally different assumptions.  You
likely have some unstated ideas, maybe about the underlying
implementation of the Linux NUMA syscalls when running on
Xen, or maybe defaults for how NUMA-ness might be specified
when creating an HVM domain.

But all of these are mostly unrelated to rdtscp.  The only
reason that this discussion has involved NUMA concepts is
that the rdtscp instruction, by accident rather than by
design, may on some (but not all) guest OS's, communicate the
guest OS's concept of cpu and node information to an application.
As Jeremy has pointed out, this cpu/node information is exactly
the same information that can be obtained by a system call.
So the only reason that rdtscp is better than using the
system call would be for performance.

Rdtscp is faster than a system call in many situations, but
now is often emulated in Xen (even on processors that do support
the hardware instruction*), so cannot be assumed to be much
faster than a system call.  And the difference in performance
is only measurable if an app is executing rdtscp many thousands
of times every second.

Are there apps that execute rdtscp many thousands of times
every second PRIMARILY TO OBTAIN the cpu/node information?
If so, I agree that it is unfortunately necessary to expose
the rdtscp instruction.  If not, I would highly recommend
we do NOT expose it now.  Otherwise, to use Keir's words,
we are "Supporting CPU instructions just because they're there
[which] is not a useful effort."

Once rdtscp/TSC_AUX is exposed to guests, it is very hard
to remove it again (as saved guests may have tested
the cpuid bit once at startup and will fail if restored).

Other brief NUMA-related replies below.

* See xen-unstable.hg/docs/misc/tscmode.txt for explanation

> From: Zhang, Xiantao [mailto:xiantao.zhang@xxxxxxxxx]
> Dan Magenheimer wrote:
> >>> .  And, as I've said before,
> >>> the node/cpu info provided by Linux in TSC_AUX is
> >>> wrong anyway (except in very constrained environments
> >>> such as where the admin has pinned vcpus to pcpus).
> >> 
> >> I don't agree with you at this point. For guest numa support,
> >> it should be a must to pin virtual node's vcpus to its
> >> related physical node and crossing-node vcpu migration should
> >> be disallowed by default, otherwise guest numa support is
> >> meaningless, right ?
> > 
> > It's not a must.  A system administrator should always
> > have the option of choosing flexibility vs performance.
> > I agree that when performance is higher priority, pinning
> > is a must, but pinning may even have issues when the
> > guest's nvcpus exceeds the number of cores in a node. 
> Could you elaborate the issues you can see ?  Normally, 
> virtual node's number of vcpus should be less than one 
> physical node's cpu number. But enen if vcpu's number is more 
> than physical cpu's number in a node, why it can introduce issues ? 

Suppose a guest believes it has eight cores on a single
processor/node. It is now started on a machine that has
four cores per processor/node (and two or more sockets).
Since the guest believes it is running on a single node,
it communicates that information (via TSC_AUX or vgetcpu)
to an application. The application is NUMA-aware, but since
the guest OS told it that all cores are on the same node,
it doesn't use it's NUMA code/mode.

Suppose a guest believes it has a total of four cores,
two cores on each of two nodes.  It is now started on
some future machine with 16 cores all on a single
node. Since the guest believes it is running on two
nodes, it communicates that information (via TSC_AUX
or vgetcpu) to an application.  The application is
NUMA-aware, and the guest OS told it that there are
two nodes.  This app has very high memory bandwidth
needs, so it spends lots of time doing NUMA-related
syscalls such as Linux move_pages to ensure that the
memory is on the same node as the cpu.  All of these
move calls are wasted.

Both of these situations are very possible in a cloud

(NOTE: Since this NUMA-related discussion is orthogonal
to rdtscp, we should probably start a separate thread
for further discussion.)

If the above discussion doesn't clarify my concerns and
I haven't answered other questions in your email, please
let me know.

Xen-devel mailing list

<Prev in Thread] Current Thread [Next in Thread>