[Xen-devel] rdtsc: correctness vs performance on Xen (and KVM?)

(Sorry for repost.. original was accidentally
posted as a reply to an existing thread which,
despite a different subject field, gets threaded
with the existing thread rather than starting
a new one.  Please reply to this thread rather
than the previous post on the old thread.)

To summarize:

Xen and KVM currently allow rdtsc to be executed
directly by userland.  As a result, apps that
use rdtsc smartly and effectively on (some) physical
machines may break badly in Xen or KVM because of
the disassociation of physical and virtual cpus.
(Readers not familiar with why rdtsc is a problem,
can read e.g. http://en.wikipedia.org/wiki/Rdtsc)

VMware always emulates rdtsc, both for kernel and
userland rdtsc's. (I don't know what HyperV does.)

Xen currently has a boot option to always emulate
rdtsc in HVM guests and just added code such that
the same boot option will always emulate rdtsc for
userland-only in PVM guests.  There is some agreement
in the Xen community that rdtsc emulation should
always be the default though the default is currently
off.  KVM is having a similar discussion and, I'm
told, has also come to the conclusion that emulating
rdtsc is a necessary evil.

The problem is that emulating rdtsc is slow.  On
my dual-core Conroe, rdtsc is about 72 cycles and
emulating rdtsc (returning a fixed frequency 1GHz
Xen monotonic system time) is over 15x slower.
This is a big hit for apps that do tens to hundreds
of thousands of rdtsc's per processor per second.
(And yes these apps are more common than one
might think.)

VMware has the advantage of binary translation;
rdtsc can be translated to return a "conforming"
value in ~200 cycles (on an older processor so
probably faster if you are comparing against my
dual-core Conroe numbers above).  This value
is "stale" (not linear with wallclock time).
For VMs that need rdtsc to more accurately reflect
wallclock time, full emulation can be optionally
enabled for a VM.

I'm searching for alternatives that provide the
correctness of emulation, but better performance
than emulation.  Jeremy points out that the
pvclock mechanism in upstream Linux works well,
but the pvclock data is currently only exposed
to kernel... and exposing it to userland still
requires apps-using-rdtsc to be rewritten.
But Jeremy claims that all apps-that-use-rdtsc
MUST be rewritten because using rdtsc is unsafe,
and that they should be rewritten to use
gettimeofday (or actually vgettimeofday).
But on older OS's (including the vast majority
of installed units) and machines where tsc is
"unsafe", gettimeofday can be MUCH slower than
emulating rdtsc.  So telling app writers to
convert all uses of rdtsc to gettimeofday is
not an acceptable solution for these apps in
the shortterm.

My current thinking is that we (the Linux and
Xen and KVM community) should architect a
userland API using the pvclock mechanism.
The underlying implementation of this API would
utilize Linux only to "register" the mechanism,
preferably via a module so that it, like
disk and network frontends, could easily be
bolted on to shipping OS's.  Individual uses
of "pvclock_read" would need no syscall... like
the kernel pvclock mechanism, they need only
access memory to get the necessary scaling
and offset data.  Once instantiated, rdtsc
is executed directly by the app as part of the
pvclock protocol.  If never registered,
rdtsc would always be trapped and emulated.

I realize this idea is half-baked, but would like
to invite other TSC/time experts to determine
if some or all of the idea might be used to
achieve a fully-baked solution.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
WARNING - OLD ARCHIVES

xen-devel

[Xen-devel] rdtsc: correctness vs performance on Xen (and KVM?)