FYI, I finally found a published source describing
the TSC Invariant bit in Nehalem. See 2.2.6 in:
http://www.intel.com/Assets/PDF/appnote/241618.pdf
"In the Core i7 AND FUTURE PROCESSOR GENERATIONS
[my emphasis] the TSC will continue to run in the
deepest C-states. Therefore, the TSC will run at
a constant rate in all ACPI P-, C-, and T-states.
Support for this feature is indicated by
CPUID.0x8000_0007.EDX[8]. On processors with
invariant TSC support, the OS may use the TSC
for wall clock timer services (instead of ACPI
or HPET timers). TSC reads are much more efficient
and do not incur the overhead associated with a
ring transition or access to a platform resource."
Linux upstream now does exactly that; if this
bit is set (on Intel processors), tsc is utilized
as the system clocksource and afaict there
is NO path that will test or revert this
decision.
Admittedly, this doesn't guarantee that a multi-socket
platform obeys invariance, but apparently this
feature utilizes a crystal available externally
to the socket so it is easy to leverage in a
system design to ensure invariance across
multiple sockets, or even across multiple enclosures
that are all on a QPI link. So system designers
(other than perhaps for the very largest superNUMA
machines) would be silly to not use it.
So, I'd recommend that:
1) On (Intel, maybe later AMD) systems where this
bit is set, the mechanisms enabled by the
Xen consistent_tscs boot option should be enabled
automatically for Xen.
2) The time_calibration_tsc_rendezvous loop in
timer.c could/should be rewritten or removed
and certainly should NOT write_tsc().
Keir, I know you are very sensitive around
this code, so thought I'd check before messing
with it. Or feel free to do it yourself.
Thanks,
Dan
> -----Original Message-----
> From: Dan Magenheimer
> Sent: Friday, October 02, 2009 11:51 AM
> To: Xen-Devel (E-mail)
> Cc: Kurt Hackel; Ian Pratt; Keir Fraser
> Subject: [Xen-devel] [RFC] Correct/fast timestamping in apps under Xen
> [1 of 4]: Reliable TSC
>
>
> =============
> Premise 1: A large and growing percentage of servers
> running Xen have a "reliable" TSC and Xen can determine
> conclusively whether a server does or does not have a
> reliable TSC.
> =============
>
> The truth of this statement has been vociferously
> challenged in other threads, so I'd LOVE TO GET
> FEEDBACK OR CONFIRMATION FROM PROCESSOR AND SERVER
> VENDORS.
>
> The rest of this is long though hopefully educational,
> but if you have no interest in the rdtsc instruction
> or timestamping, please move on to [2 of 4].
>
> Since my overall premise is a bit vague, I need to
> first very clearly define my terms. And to define
> those terms clearly, I need to provide some more
> background. As far as I can find, there is no
> publication which clearly describes all of these
> concepts.
>
> The rdtsc instruction was at one time the easiest
> and cheapest and most precise method for "approximating
> the passage of time"; as such rdtsc was widely
> used by x86 performance practitioners and high-end
> apps that needed to provide extensive metrics. When
> commodity SMP x86 systems emerged, rdtsc fell into
> disfavor because: (a) it was difficult to for
> different CPU packages to share a crystal or
> ensure different crystals were synchronized or
> increasing at precisely the same rate, and
> (b) SMP apps were oblivious to which CPU their
> thread(s) were running on so two rdtsc instructions
> in the same thread might execute on different
> CPU's and thus unwittingly use different crystals,
> resulting in strange things like the appearance that
> time went backwards (sometimes by a large amount)
> or events appearing to take different amounts of
> time depending on whether they were running on
> processor A or processor B. We will call this
> the "inconsistent TSC" problem.
>
> Processor and system vendors attempted to fix the
> inconsistent TSC problem by providing a new class
> of "platform timers" (e.g. HPET), but these proved
> to be slow and difficult to use, especially for
> apps that required frequent fine metrics.
>
> Processor and system vendors eventually figured out
> how to synchronize TSC with the same crystal, but
> then a new set of problems emerged: Power features
> sometimes caused the clock on one processor to
> slow down or even stop, thus destroying the synchrony
> with other processors. This was fixed first
> by ensuring that the tick rate did not change
> ("constant TSC") and later that it did not stop
> ("nonstop TSC"), unless ALL of the TSCs on all of
> the processors stopped. Nearly all of the most recent
> generations of server processors support these
> capabilities, so as a result on most recent servers,
> the TSC on all processors/cores/sockets is driven by
> the same crystal, always ticks at the same rate,
> and doesn't stop independently of other processors'
> TSCs. This is what we call a "reliable TSC".
>
> But we're not done yet. What does a reliable TSC
> provide? We need to define a few more terms.
>
> A "perfect TSC" would be one where a magic logic
> analyzer with a cesium clock could confirm that
> the TSC's on every processor increment at precisely
> the same femtosecond. Both the speed of light
> and the pricing models of commodity processors
> make a perfect TSC unlikely :-)
>
> How close is good enough? We define two TSCs
> as being "unobservably different" if code running
> on the two processors can never see time going
> backwards, because the difference bettween their
> TSCs is smaller than the memory access overhead
> due to cache synchronization. (This is sometimes
> called a "cache bounce".) To wit, suppose processor
> A does a rdtsc and writes the result into memory;
> meanwhile processor B is spinning until it sees that the
> memory location has changed, reads A's value
> from memory and then does its own rdtsc. If
> B's rdtsc is NEVER less OR equal to A's rdtsc,
> we will call this an "optimal TSC".
>
> A reliable TSC is not guaranteed to be optimal;
> it may just be very close to optimal, meaning
> the difference between two TSCs may sometimes
> be observable but it will always be very small.
> (As far as I know, processor and server vendors
> will not guarantee exactly how small.) To simulate
> an optimal TSC with a reliable TSC, a software
> wrapper can be placed around the reads from a
> reliable TSC to catch and "fix" the rare
> circumstances where time goes backwards.
> If this wrapper, ensures that time never goes
> backwards AND ensures that time always moves
> forward, we call this a monotonically-increasing
> wrapper. If it instead ensures that time never
> goes backwards AND may appear to stop, we call
> this a monotonically-non-decreasing wrapper.
>
> Note also that a reliable TSC is not guaranteed
> to never stop; it is just guaranteed that if
> the TSC on one processor is stopped, the TSC on
> all other processors will also be stopped. As
> a result, a reliable TSC cannot be used as
> a wallclock, at least without other software
> support that can properly adjust the TSC on all
> processors when all processors awaken.
>
> Last, there is the issue of whether or not Xen can
> conclusively determine if the TSC is reliable.
> This is still an open challenge. There exists
> a CPUID bit which purports to do this, but it
> is not known with certainty if there are exceptions.
> Notably, there is concern if certain newer
> larger NUMA servers will truly provide a reliable
> TSC across all system processors even if the
> CPUID bit on each CPU package says the package
> does provide a reliable TSC. One large server vendor
> claims that this is not a problem anymore, but
> ideally we would like to test this dynamically
> and there is GPL code available to do exactly
> that. This code is used in Linux in some
> circumstances once at boot-time to test for
> an "optimal TSC". But in some cases the CPUID
> bit defuses this test. And in any case a boottime
> test may not catch all problems, such as a
> power event that doesn't handle TSC quite properly.
> So without some form of ongoing post-boottime
> test, we just don't know.
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-devel
>
>
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|