Re: [Xen-devel] write_tsc in a PV domain?

On 08/28/09 10:49, Dan Magenheimer wrote:
>> Apps are free to try and use the tsc in any way they
>> feel like, but it has never had any
>> GUARANTEED [djm's emphasis] properties.
>>     
> I think this is the key difference of opinion which
> must be resolved.  If what you say is true, your
> other positions make sense.  If it is false,
> they make much less sense.  (And unfortunately
> it is not a black and white issue.)
>
> There ARE guaranteed properties specified by
> the Intel SDM for any _single_ processor,
> namely that rdtsc is "guaranteed to return
> a monotonically increasing unique value whenever
> executed, except for 64-bit counter wraparound.
> Intel guarantees that the time-stamp counter
> will not wrap-around within 10 years after being
> reset."  Both uses of the word "guarantee"
> are quoted from the Intel SDM.
>   

Yes, but those are fairly weak guarantees.  It does not guarantee that
the tsc won't change rate arbitrarily, or stop outright between reads.

> What is NOT guaranteed, but is widely and
> incorrectly assumed to be implied and has
> gotten us into this mess, is that
> the same properties applies across multiple
> processors.

Yes, Linux offers even weaker guarantees than Intel.  Aside from the
processor migration issue, the tsc can jump arbitrarily as a result of
suspend/resume (ie, it can be non-monotonic).

>   And there are notable examples
> of systems where the properties do NOT apply.
> So it is true that an app that
> does not know conclusively that certain threads
> are running on certain processors cannot
> always safely use rdtsc to obtain the
> single-processor-guaranteed results.
>
> BUT some software systems (including VMware) do
> provide this guarantee across multiple processors.
> And recent families of both Intel and AMD
> multi-core have advanced to the point where
> the properties apply across all cores, so
> on the vast majority (but admittedly not all)
> of future physical systems, apps can and will
> use rdtsc and expect the properties to apply
> (whether guaranteed or not).
>   

Even very recent processors with "constant" tscs (ie, they don't change
rate with the core frequency) stop in certain power states.  Any
motherboard design which runs packages in different clock-domains will
lose tsc-sync between those packages, regardless of what's in the packages.

The "sane tsc" properties are primarily for the benefit of kernels, to
allow them to make better use of the tsc.  They will have enough
knowledge of the overall system architecture to know how and when the
tsc can be trusted.  Usermode apps can try to piggyback onto this if
they like, but they're in much more treacherous territory.  They can
never know what the underlying system design is, or whether its really
safe to trust the tsc's sanity.  And without some explicit guarantees on
Linux's part, the tsc will still be non-monotonic over suspend/resume
(in all its many forms).

> So in your opinion, some systems are broken
> so Xen should assume all future systems are
> broken.  In my opinion, the problem is being
> fixed in hardware and has always been fixed
> in VMware, so Xen should look to the future
> not the past.
>
> Does that sound like a good summary of this
> disagreement?
>
>   

Not quite.

You are talking about three different cases:

   1. the reliability of the tsc in a PV guest in kernel mode
   2. the reliability of the tsc in a PV guest in user mode
   3. the reliability of the tsc in an HVM guest

I don't think 1. needs any attention.  The current scheme works fine.

The only option for 3 is to try make a best-effort of tsc quality, which
ranges from trapping every rdtsc to make them all give globally
monotonic results, or use the other VT/SVM features to apply an offset
from the raw tsc to a guest tsc, etc.  Either way the situation isn't
much different from running native (ie, apps will see basically the same
tsc behaviour as in the native case, to some degree of approximation).

So, there's case 2: pv usermode.  There are four classes of apps worth
considering here:

   1. Old apps which make unwarranted assumptions about the behavour of
      the tsc.  They assume they're basically running on some equivalent
      of a P54, and so will get junk on any modernish system with SMP
      and/or power management.  If people are still using such apps, it
      probably means their performance isn't critically dependent on the
      tsc.
   2. More sophisticated apps which know the tsc has some limitations
      and try to mitigate them by filtering discontinuities, using
      rdtscp, etc.  They're best-effort, but they inherently lack enough
      information to do a complete job (they have to guess at where
      power transitions occured, etc).
   3. New apps which know about modern processor capabilities, and
      attempt to rely on constant_tsc forgoing all the best-effort
      filtering, etc
   4. Apps which use gettimeofday() and/or clock_gettime() for all time
      measurement.  They're guaranteed to get consistent time results,
      perhaps at the cost of a syscall.  On systems which support it,
      they'll get vsyscall implementations which avoid the syscall while
      still using the best-possible clocksource.  Even if they don't a
      syscall will outperform an emulated rdtsc.

Class 1 apps are just broken.  We can try to emulate a UP, no-PM
processor for them, and that's probably best done in an HVM domain. 
There's no need to go to extraordinary efforts for them because the
native hardware certainly won't.

Class 2 apps will work as well as ever in a Xen PV domain as-is.  If
they use rdtscp then they will be able to correlate the tsc to the
underlying pcpu and manage consistency that way.  If they pin threads to
VCPUs, then they may also requre VCPUs to be pinned to PCPUs.  But
there's no need to make deep changes to Xen's tsc handling to
accommodate them.

Class 3 apps will get a bit of a rude surprise in a PV Xen domain.  But
they're also new enough to use another mechanism to get time.  They're
new enough to "know" that gettimeofday can be very efficient, and should
not be going down the rathole of using rdtsc directly.  And unless
they're going to be restricted to a very narrow class of machines (for
example, not my relatively new Core2 laptop which stops the "constant"
tsc in deep sleep modes), they need to fall back to being a class 2 or 4
app anyway.

Class 4 apps are not well-served under Xen.  I think the vsyscall
mechanism will be disabled and they'll always end up doing a real
syscall.  However, I think it would be relatively easy to add a new
vgettimeofday implementation which directly uses the pvclock mechanism
from usermode (the same code would work equally well for Xen and KVM). 
There's no need to add a new usermode ABI to get quick, high-quality
time in usermode.  Performance-wise it would be more or less
indistinguishable from using a raw rdtsc, but it has the benefit of
getting full cooperation from the kernel and Xen, and can take into
account all tsc variations (if any).


So if you want to address these problems, it seems to me you'll get most
bang for the buck by fixing (v)gettimeofday to use pvclock, and
convincing app writers to trust in gettimeofday.

    J

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
WARNING - OLD ARCHIVES

xen-devel

Re: [Xen-devel] write_tsc in a PV domain?