Detailed documentation(!) for the new tsc_mode VM configuration option
(Note, new file to be hg add'ed.)
Signed-off-by: Dan Magenheimer <dan.magenheimer@xxxxxxxxxx>
diff -r d0b030008814 docs/misc/tscmode.txt
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/docs/misc/tscmode.txt Tue Dec 01 11:23:19 2009 -0700
@@ -0,0 +1,299 @@
+by: Dan Magenheimer <dan.magenheimer@xxxxxxxxxx>
+As of Xen 4.0, a new config option called tsc_mode may be specified
+for each domain. The default for tsc_mode handles the vast majority
+of hardware and software environments. This document is targeted
+for Xen users and administrators that may need to select a non-default
+Proper selection of tsc_mode depends on an understanding not only of
+the guest operating system (OS), but also of the application set that will
+ever run on this guest OS. This is because tsc_mode applies
+equally to both the OS and ALL apps that are running on this
+domain, now or in the future.
+Key questions to be answered for the OS and/or each application are:
+- Does the OS/app use the rdtsc instruction at all? (We will explain below
+ how to determine this.)
+- At what frequency is the rdtsc instruction executed by either the OS
+ or any running apps? If the sum exceeds about 10,000 rdtsc instructions
+ per second per processor, we call this a "high-TSC-frequency"
+ OS/app/environment. (This is relatively rare, and developers of OS's
+ and apps that are high-TSC-frequency are usually aware of it.)
+- If the OS/app does use rdtsc, will it behave incorrectly if "time goes
+ backwards" or if the frequency of the TSC suddenly changes? If so,
+ we call this a "TSC-sensitive" app or OS; otherwise it is "TSC-resilient".
+This last is the US$64,000 question as it may be very difficult
+(or, for legacy apps, even impossible) to predict all possible
+failure cases. As a result, unless proven otherwise, any app
+that uses rdtsc must be assumed to be TSC-sensitive and, as we
+will see, this is the default starting in Xen 4.0.
+Xen's new tsc_mode parameter determines the circumstances under which
+the family of rdtsc instructions are executed "natively" vs emulated.
+Roughly speaking, native means rdtsc is fast but TSC-sensitive apps
+may, under unpredictable circumstances, run incorrectly; emulated means
+there is some performance degradation (unobservable in most cases),
+but TSC-sensitive apps will always run correctly. Prior to Xen 4.0,
+all rdtsc instructions were native: "fast but potentially incorrect."
+Starting at Xen 4.0, the default is that all rdtsc instructions are
+"correct but potentially slow". The tsc_mode parameter in 4.0 provides
+an intelligent default but allows system administrator's to adjust
+how rdtsc instructions are executed differently for different domains.
+The non-default choices for tsc_mode are:
+- tsc_mode=1 (always emulate). All rdtsc instructions are emulated;
+ this is the best choice when TSC-sensitive apps are running and
+ it is necessary to understand worst-case performance degradation
+ for a specific hardware environment.
+- tsc_mode=2 (never emulate). This is the same as prior to Xen 4.0
+ and is the best choice if it is certain that all apps running in
+ this VM are TSC-resilient and highest performance is required.
+- tsc_mode=3 (PVRDTSCP). High-TSC-frequency apps may be paravirtualized
+ (modified) to obtain both correctness and highest performance; any
+ unmodified apps must be TSC-resilient.
+If tsc_mode is left unspecified (or set to tsc_mode=0), a hybrid
+algorithm is utilized to ensure correctness while providing the
+best performance possible given:
+- the requirement of correctness,
+- the underlying hardware, and
+- whether or not the VM has been saved/restored/migrated
+To understand this in more detail, the rest of this document must
+DETERMINING RDTSC FREQUENCY
+To determine the frequency of rdtsc instructions that are emulated,
+an "xm" command can be used by a privileged user of domain0. The
+# xm debug-key s; xm dmesg | tail
+provides information about TSC usage in each domain where TSC
+emulation is currently enabled.
+To understand tsc_mode completely, some background on TSC is required:
+The x86 "timestamp counter", or TSC, is a 64-bit register on each
+processor that increases monotonically. Historically, TSC incremented
+every processor cycle, but on recent processors, it increases
+at a constant rate even if the processor changes frequency (for example,
+to reduce processor power usage). TSC is known by x86 programmers
+as the fastest, highest-precision measurement of the passage of time
+so it is often used as a foundation for performance monitoring.
+And since it is guaranteed to be monotonically increasing and, at
+64 bits, is guaranteed to not wraparound within 10 years, it is
+sometimes used as a random number or a unique sequence identifier,
+such as to stamp transactions so they can be replayed in a specific
+On most older SMP and early multi-core machines, TSC was not synchronized
+between processors. Thus if an application were to read the TSC on
+one processor, then was moved by the OS to another processor, then read
+TSC again, it might appear that "time went backwards". This loss of
+monotonicity resulted in many obscure application bugs when TSC-sensitive
+apps were ported from a uniprocessor to an SMP environment; as a result,
+many applications -- especially in the Windows world -- removed their
+dependency on TSC and replaced their timestamp needs with OS-specific
+functions, losing both performance and precision. On some more recent
+generations of multi-core machines, especially multi-socket multi-core
+machines, the TSC was synchronized but if one processor were to enter
+certain low-power states, its TSC would stop, destroying the synchrony
+and again causing obscure bugs. This reinforced decisions to avoid use
+of TSC altogether. On the most recent generations of multi-core
+machines, however, synchronization is provided across all processors
+in all power states, even on multi-socket machines, and provide a
+flag that indicates that TSC is synchronized and "invariant". Thus
+TSC is once again useful for applications, and even newer operating
+systems are using and depending upon TSC for critical timekeeping
+tasks when running on these recent machines.
+We will refer to hardware that ensures TSC is both synchronized and
+invariant as "TSC-safe" and any hardware on which TSC is not (or
+may not remain) synchronized as "TSC-unsafe".
+As a result of TSC's sordid history, two classes of applications use
+TSC: old applications designed for single processors, and the most recent
+enteprise applications which require high-frequency high-precision
+We will refer to apps that might break if running on a TSC-unsafe
+machine as "TSC-sensitive"; apps that don't use TSC, or do use
+TSC but use it in a way that monotonicity and frequency invariance
+are unimportant as "TSC-resilient".
+The emergence of virtualization once again complicates the usage of
+TSC. When features such as save/restore or live migration are employed,
+a guest OS and all its currently running applications may be invisibly
+transported to an entirely different physical machine. While TSC
+may be "safe" on one machine, it is essentially impossible to precisely
+synchronize TSC across a data center or even a pool of machines. As
+a result, when run in a virtualized environment, rare and obscure
+"time going backwards" problems might once again occur for those
+TSC-sensitive applications. Worse, if a guest OS moves from, for
+example, a 3GHz
+machine to a 1.5GHz machine, attempts by an OS/app to measure time
+intervals with TSC may without notice be incorrect by a factor of two.
+The rdtsc (read timestamp counter) instruction is used to read the
+TSC register. The rdtscp instruction is a variant of rdtsc on recent
+processors. We refer to these together as the rdtsc family of instructions,
+or just "rdtsc". Instructions in the rdtsc family are non-privileged, but
+privileged software may set a cpuid bit to cause all rdtsc family
+instructions to trap. This trap can be detected by Xen, which can
+then transparently "emulate" the results of the rdtsc instruction and
+return control to the code following the rdtsc instruction.
+To provide a "safe" TSC, i.e. to ensure both TSC monontonicity and a
+fixed rate, Xen provides rdtsc emulation whenever necessary or when
+explicitly specified by a per-VM configuration option. TSC emulation is
+relatively slow -- roughly 15-20 times slower than the rdtsc instruction
+when executed natively. However, except when an OS or application uses
+the rdtsc instruction at a high frequency (e.g. more than about 10,000 times
+per second per processor), this performance degradation is not noticable
+(i.e. <0.3%). And, TSC emulation is nearly always faster than
+OS-provided alternatives (e.g. Linux's gettimeofday). For environments
+where it is certain that all apps are TSC-resilient (e.g.
+"TSC-safeness" is not necessary) and highest performance is a
+requirement, TSC emulation may be entirely disabled (tsc_mode==2).
+The default mode (tsc_mode==0) checks TSC-safeness of the underlying
+hardware on which the virtual machine is launched. If it is
+TSC-safe, rdtsc will execute at hardware speed; if it is not, rdtsc
+will be emulated. Once a virtual machine is save/restored or migrated,
+however, there are two possibilities: For a paravirtualized (PV) domain,
+TSC will always be emulated. For a fully-virtualized (HVM) domain,
+TSC remains native IF the source physical machine and target physical machine
+have the same TSC frequency; else TSC is emulated. Note that, though
+emulated, the "apparent" TSC frequency will be the TSC frequency
+of the initial physical machine, even after migration.
+For environments where both TSC-safeness AND highest performance
+even across migration is a requirement, application code can be specially
+modified to use an algorithm explicitly designed into Xen for this purpose.
+This mode (tsc_mode==3) is called PVRDTSCP, because it requires
+app paravirtualization (awareness by the app that it may be running
+on top of Xen), and utilizes a variation of the rdtsc instruction
+called rdtscp that is available on most recent generation processors.
+(The rdtscp instruction differs from the rdtsc instruction in that it
+reads not only the TSC but an additional register set by system software.)
+When a pvrdtscp-modified app is running on a processor that is both TSC-safe
+and supports the rdtscp instruction, information can be obtained
+about migration and TSC frequency/offset adjustment to allow the
+vast majority of timestamps to be obtained at top performance; when
+running on a TSC-unsafe processor or a processor that doesn't support
+the rdtscp instruction, rdtscp is emulated.
+PVRDTSCP (tsc_mode==3) has two limitations. First, it applies to
+all apps running in this virtual machine. This means that all
+apps must either be TSC-resilient or pvrdtscp-modified. Second,
+highest performance is only obtained on TSC-safe machines that
+support the rdtscp instruction; when running on older machines,
+rdtscp is emulated and thus slower. For more information on PVRTSCP,
+Finally, tsc_mode==1 always enables TSC emulation, regardless of
+the underlying physical hardware. The "apparent" TSC frequency will
+be the TSC frequency of the initial physical machine, even after migration.
+This mode is useful to measure any performance degradation that
+might be encountered by a tsc_mode==0 domain after migration occurs,
+or a tsc_mode==3 domain when it is running on TSC-unsafe hardware.
+Note that while Xen ensures that an emulated TSC is "safe" across migration,
+it does not ensure that it continues to tick at the same rate during
+the actual migration. As an oversimplified example, if TSC is ticking
+once per second in a guest, and the guest is saved when the TSC is 1000,
+then restored 30 seconds later, TSC is only guaranteed to be greater
+than or equal to 1001, not precisely 1030. This has some OS implications
+as will be seen in the next section.
+TSC INVARIANT BIT and NO_MIGRATE
+Related to TSC emulation, the "TSC Invariant" bit is architecturally defined
+in a cpuid bit on the most recent x86 processors. If set, TSC invariance
+ensures that the TSC is "safe", that is it will increment at a constant rate
+regardless of power events, will be synchronized across all processors, and
+was properly initialized to zero on all processors at boot-time
+by system hardware/BIOS. As long as system software never writes to TSC,
+TSC will be safe and continuously incremented at a fixed rate and thus
+can be used as a system "clocksource".
+This bit is used by some OS's, and specifically by Linux starting with
+version 2.6.30(?), to select TSC as a system clocksource. Once selected,
+TSC remains the Linux system clocksource unless manually overridden. In
+a virtualized environment, since it is not possible to synchronize TSC
+across all the machines in a pool or data center, a migration may "break"
+TSC as a usable clocksource; while time will not go backwards, it may
+not track wallclock time well enough to avoid certain time-sensitive
+consequences. As a result, Xen can only expose the TSC Invariant bit
+to a guest OS if it is certain that the domain will never migrate.
+As of Xen 4.0, the "no_migrate=1" VM configuration option may be specified
+to disable migration. If no_migrate is selected and the VM is running
+on a physical machine with "TSC Invariant", Linux 2.6.30+ will safely
+use TSC as the system clocksource. But, attempts to migrate or, once
+saved, restore this domain will fail.
+There is another cpuid-related complication: The x86 cpuid instruction is
+non-privileged. HVM domains are configured to always trap this instruction
+to Xen, where Xen can "filter" the result. In a PV OS, all cpuid instructions
+have been replaced by a parvirtualized equivalent of the cpuid instruction
+("pvcpuid") and also trap to Xen. But apps in a PV guest that use a
+cpuid instruction execute it directly, without a trap to Xen. As a result,
+an app may directly examine the physical TSC Invariant cpuid bit and make
+decisions based on that bit. This is still an unsolved problem, though
+a workaround exists as part of the PVRDTSCP tsc_mode for apps that
+can be modified.
+MORE ON PVRDTSCP
+Paravirtualized OS's use the "pvclock" algorithm to manage the passing
+of time. This sophisticated algorithm obtains information from a memory
+page shared between Xen and the OS and selects information from this
+page based on the current virtual CPU (vcpu) in order to properly adapt to
+TSC-unsafe systems and changes that occur across migration. Neither
+this shared page nor the vcpu information is available to a userland
+app so the pvclock algorithm cannot be directly used by an app, at least
+without performance degradation roughly equal to the cost of just
+emulating an rdtsc.
+As a result, as of 4.0, Xen provides capabilities for a userland app
+to obtain key time values similar to the information accessible
+to the PV OS pvclock algorithm. The app uses the rdtscp instruction
+which is defined in recent processors to obtain both the TSC and an
+auxiliary value called TSC_AUX. Xen is responsible for setting TSC_AUX
+to the same value on all vcpus running any domain with tsc_mode==3;
+further, Xen tools are responsible for monotonically incrementing TSC_AUX
+anytime the domain is restored/migrated (thus changing key time values);
+and, when the domain is running on a physical machine that either
+is not TSC-safe or does not support the rdtscp instruction, Xen
+is responsible for emulating the rdtscp instruction and for setting
+TSC_AUX to zero on all processors.
+Xen also provides pvclock information via a "pvcpuid" instruction.
+While this results in a slow trap, the information changes
+(and thus must be reobtained via pvcpuid) ONLY when TSC_AUX
+has changed, which should be very rare relative to a high
+frequency of rdtscp instructions.
+Finally, Xen provides additional time-related information via
+other pvcpuid instructions. First, an app is capable of
+determining if it is currently running on Xen, next whether
+the tsc_mode setting of the domain in which it is running,
+and finally whether the underlying hardware is TSC-safe and
+supports the rdtscp instruction.
+As a result, a pvrdtscp-modified app has sufficient information
+to compute the pvclock "elapsed nanoseconds" which can
+be used as a timestamp. And this can be done nearly as
+fast as a native rdtsc instruction, much faster than emulation,
+and also much faster than nearly all OS-provided time mechanisms.
+While pvrtscp is too complex for most apps, certain enterprise
+TSC-sensitive high-TSC-frequency apps may find it useful to
+obtain a significant performance gain.
Description: Binary data
Xen-devel mailing list