[Xen-devel] RE: rdtsc: correctness vs performance on Xen (and KV

> My current thinking is that we (the Linux and
> Xen and KVM community) should architect a
> userland API using the pvclock mechanism.

OK, here's a slightly refined proposal.  To
reiterate, the problem is that Xen's current
mechanism for handling the rdtsc instruction
may silently provide incorrect results while
alternative mechanisms are too slow (vs VMware
which is both fast and correct).  My goal is to
provide a paravirtualized tsc mechanism for apps
running on Xen that is reliably correct,
is not dependent on a particular OS or
processor family, is approximately as fast
as rdtsc (or at least much faster than emulated
rdtsc), provides adequate (e.g. nanosecond)
resolution, does not require recompilation to
work both on Xen and bare metal, and works properly
across: vcpu-to-pcpu rescheduling even on NUMA
machines; system sleep/hibernation; and 
save/restore/migration between machines with
dissimilar clock rates.  Implementation requires
changes in Xen and "the app" but no OS changes
thus making it still viable on legacy OS's
and possibly(?) HVM domains.  Note that
only apps that need to sample time on the
order of >5-100K/core/second would use this;
for other apps, rdtsc emulation overhead
is probably negligible (<0.2%).

0)  Xen implements rdtsc emulation by default
1)  Guest OS is launched with pvtsc=1 in vm.cfg
2)  App running on guest OS sets up a SIGILL handler
3)  App executes a special rdmsr instruction or
    hypercall.
4a) If SIGILL results, not running on Xen at all,
    or on old Xen; app uses rdtsc at own risk. Done.
4b) Else, rdmsr/hypercall returns virtual address of
    special pvclock page ("pvclock_va").
5)  App executes another special rdmsr instruction/
    hypercall to disable rdtsc emulation.  This
    affects ALL execution for all processes in this VM.
6)  Xen maintains mapping of pvclock_va to a
    different physical page for each processor
    and transparently handles TLB misses for
    pvclock_va
7)  App uses (unemulated) rdtsc and applies
    pvclock algorithm (using values in memory
    at pvclock_va) resulting in pvtsc, which
    is nanoseconds since VM start.  App can
    further apply local algorithms to enforce
    monotonicity or frequency scaling as desired.

Comments appreciated.  I realize that this is hacky
and ugly... better alternatives gladly solicited.

Thanks,
Dan

P.S. While it would be nice if we could just tell
apps to use a fast vgettimeofday equivalent, this
does not exist today and, even if it did, would not
be widely available for years in the kernel running under
most enterprise app deployments (and, even then,
only on 64-bit Linux.)

> -----Original Message-----
> From: Dan Magenheimer 
> Sent: Friday, August 28, 2009 11:50 AM
> To: Xen-Devel (E-mail)
> Cc: Jeremy Fitzhardinge; Keir Fraser; Alan Cox
> Subject: rdtsc: correctness vs performance on Xen (and KVM?)
> 
> 
> To summarize:
> 
> Xen and KVM currently allow rdtsc to be executed
> directly by userland.  As a result, apps that
> use rdtsc smartly and effectively on (some) physical
> machines may break badly in Xen or KVM because of
> the disassociation of physical and virtual cpus.
> (Readers not familiar with why rdtsc is a problem,
> can read e.g. http://en.wikipedia.org/wiki/Rdtsc)
> 
> VMware always emulates rdtsc, both for kernel and
> userland rdtsc's. (I don't know what HyperV does.)
> 
> Xen currently has a boot option to always emulate
> rdtsc in HVM guests and just added code such that
> the same boot option will always emulate rdtsc for
> userland-only in PVM guests.  There is some agreement
> in the Xen community that rdtsc emulation should
> always be the default though the default is currently
> off.  KVM is having a similar discussion and, I'm
> told, has also come to the conclusion that emulating
> rdtsc is a necessary evil.
> 
> The problem is that emulating rdtsc is slow.  On
> my dual-core Conroe, rdtsc is about 72 cycles and
> emulating rdtsc (returning a fixed frequency 1GHz
> Xen monotonic system time) is over 15x slower.
> This is a big hit for apps that do tens to hundreds
> of thousands of rdtsc's per processor per second.
> (And yes these apps are more common than one
> might think.)
> 
> VMware has the advantage of binary translation;
> rdtsc can be translated to return a "conforming"
> value in ~200 cycles (on an older processor so
> probably faster if you are comparing against my
> dual-core Conroe numbers above).  This value
> is "stale" (not linear with wallclock time).
> For VMs that need rdtsc to more accurately reflect
> wallclock time, full emulation can be optionally
> enabled for a VM.
> 
> I'm searching for alternatives that provide the
> correctness of emulation, but better performance
> than emulation.  Jeremy points out that the
> pvclock mechanism in upstream Linux works well,
> but the pvclock data is currently only exposed
> to kernel... and exposing it to userland still
> requires apps-using-rdtsc to be rewritten.
> But Jeremy claims that all apps-that-use-rdtsc
> MUST be rewritten because using rdtsc is unsafe,
> and that they should be rewritten to use
> gettimeofday (or actually vgettimeofday).
> But on older OS's (including the vast majority
> of installed units) and machines where tsc is
> "unsafe", gettimeofday can be MUCH slower than
> emulating rdtsc.  So telling app writers to
> convert all uses of rdtsc to gettimeofday is
> not an acceptable solution for these apps in
> the shortterm.
> 
> My current thinking is that we (the Linux and
> Xen and KVM community) should architect a
> userland API using the pvclock mechanism.
> The underlying implementation of this API would
> utilize Linux only to "register" the mechanism,
> preferably via a module so that it, like
> disk and network frontends, could easily be
> bolted on to shipping OS's.  Individual uses
> of "pvclock_read" would need no syscall... like
> the kernel pvclock mechanism, they need only
> access memory to get the necessary scaling
> and offset data.  Once instantiated, rdtsc
> is executed directly by the app as part of the
> pvclock protocol.  If never registered,
> rdtsc would always be trapped and emulated.
> 
> I realize this idea is half-baked, but would like
> to invite other TSC/time experts to determine
> if some or all of the idea might be used to
> achieve a fully-baked solution.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
WARNING - OLD ARCHIVES

xen-devel

[Xen-devel] RE: rdtsc: correctness vs performance on Xen (and KVM?)