This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
Home Products Support Community News


[Xen-devel] RE: TSC scaling and softtsc reprise, and PROPOSAL

To: Keir Fraser <keir.fraser@xxxxxxxxxxxxx>, Ian Pratt <Ian.Pratt@xxxxxxxxxxxxx>, "Zhang, Xiantao" <xiantao.zhang@xxxxxxxxx>, "Xen-Devel (E-mail)" <xen-devel@xxxxxxxxxxxxxxxxxxx>
Subject: [Xen-devel] RE: TSC scaling and softtsc reprise, and PROPOSAL
From: Dan Magenheimer <dan.magenheimer@xxxxxxxxxx>
Date: Thu, 23 Jul 2009 09:39:38 -0700 (PDT)
Cc: "Dong, Eddie" <eddie.dong@xxxxxxxxx>, John Levon <levon@xxxxxxxxxxxxxxxxx>
Delivery-date: Thu, 23 Jul 2009 09:41:09 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <C68E406A.1044D%keir.fraser@xxxxxxxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
> >> I've informally heard that certain version of the JVM and
> >> Oracle Db have a habit of pounding rdtsc hard from user
> >> space, but I don't know what rates.
> > 
> > Indeed they do and they use it for timestamping
> > events/transactions, so these are the very same
> > apps that need to guarantee SMP timestamp ordering.
> Why would you expect host TSC consistency running on Xen to 
> be worse than
> when running on a native OS?

In short, it is because a new class of machine
is emerging in the virtualization space that
is really a NUMA machine, tries to look like
a SMP (non-NUMA) machine by making memory access
fast enough that NUMA-ness can be ignored,
but for the purposes of time, is still a
NUMA machine.

Let's consider three physical platforms:

SMALL = single socket (multi-core)
MEDIUM = multiple sockets, same motherboard
LARGE = multiple sockets, multiple motherboards

The LARGE is becoming more widely available (e.g.
HP DL785) because multiple motherboards are
very convenient for field upgradeability (which
has a major impact on support costs).  They
also make a very nice consolidation target for
virtualizing a bunch of SMALL  machines.  However,
SMALL and MEDIUM are much less expensive so much
more prevalent (especially as development machines!).

On SMALL, TSC is always consistent between cores
(at least on all but the first dual-core processors).

On MEDIUM, some claim that TSC is always consistent
between cores on different sockets because the
sockets share a motherboard crystal.  I don't
know if this is true; if it is true, MEDIUM can
be considered the same as SMALL, if not MEDIUM
can be considered the same as LARGE.  So
ignore MEDIUM as a subcase of one of the others.

On LARGE, the motherboards are connected by
HT or QPI, but neither has any form of clock
synchronization.  So, from a clock perspective,
LARGE needs to be "partitioned"; OR there has
to be sophisticated system software that does
its best to synchronize TSC across all of
the cores (which enterprise OS's like HP-UX
and AIX have, Linux is working on, and Xen
has... though it remains to be seen if any
of these work "good enough"); OR TSC has to
be abandoned altogether by all software that
relies on it (OR TSC needs to be emulated).

This problem on LARGE machines is obscure enough
that software is developed (on SMALL machines)
that has a hidden timebomb if TSC is not perfectly
consistent. Admittedly, all such software should
have a switch that abandons TSC altogether in favor
of an OS "gettimeofday", but this either depends
on TSC as well or on a verrryyy sllloooowwww
platform timer that if used frequently probably
has as bad or worse a performance impact as
emulating TSC.

So what is "good enough"?  If Xen's existing
algorithm works poorly on LARGE systems (or
even on older SMALL systems), applications
should abandon TSC.  If Xen's existing algorithm
works "well", then applications can and should
use TSC.  But unless "good enough" can be carefully
defined and agreed upon between Xen and the
applications AND Xen can communicate "YES
this platform is good enough or NOT" to any
software that cares, we are caught in a gray
area.  Unfortunately, neither is true:  "good
enough" is not defined, AND there is no clean
way to communicate it even if it were.

And living in the gray area means some very
infrequent, very bizarre bugs can arise because
sometimes, unbeknownst to that application,
rarely and irreproducibly, time will appear to
go backwards.  And if timestamps are used,
for example, to replay transactions, data
corruption occurs.

So the choices are:
1) Ignore the problem and hope it never happens (or
   if it does that Xen doesn't get blamed)
2) Tell all Xen users that TSC should not be used
   as a timestamp.  (In other words, fix your apps
   or always turn on the app's TSC-is-bad option when
   running virtualized on a "bad" physical machine.)
3) Always emulate TSC and let the heavy TSC users
   pay the performance cost.

Last, as Intel has pointed out, a related kind of
issue occurs when live migration moves a running
VM from a machine with one TSC rate to another machine
with a different TSC rate (or if TSC rate varies
on the same machine, i.e. for power-savings reasons).
It would be nice if our choice (above) solves this
problem too.

Xen-devel mailing list

<Prev in Thread] Current Thread [Next in Thread>