[Xen-devel] Potential side-effects and mitigations after cpuidle

In xen3.4, cpuidle is defaultly enabled by c/s 19421. But some side-effects
may exist under different h/w C-states implementations or h/w configurations,
so that user may occasionally observe latency or system time/tsc skew. Below
are conditions causing these side-effects and means to mitigate them:

1. Latency

Latency could be caused by two factors: C-state entry/exit latency, and extra
latency caused by broadcast mechanism. 

C-state entry/exit latency is inevitable since powering on/off gates takes
time. Normally shallower C-state incurs lighter latency but less power saving
capability, and vice versa for deeper C-state. Cpuidle governor tries to
balance performance and power tradeoff in high level, which is one area where
we'll continue to tune.

Broadcast is necessary to handle APIC timer stop at deep C-state (>=C3) on
some platforms. One platform timer source is chosen to carry per-cpu timer
deadline, and then wakeup CPUs in deep C-state timely at expected expiry.
By far Xen3.4 supports PIT/HPET as the broadcast source. In current
implementation PIT broadcast is implemented in periodical mode (10ms) which
means up to 10ms extra latency could be added on expiry expected from sleep
CPUs. This is just initial implementation choice which of course could be
enhanced to on-demand on/off mode in the future. We didn't go into that
complexity in current implementation, due to its slow access and also short
wrap count. So HPET broadcast is always preferred, once this facility is
available which adds negligible overhead with timely wakeup. Then... world is
not always perfect, and some side-effects also exist along with HPET.

Detail is listed as below:

1.1. For h/w supporting ACPI C1 (halt) only (BIOS reported in ACPI _CST
method):

It's immune from this side-effect as only instruction execution is halted.

1.2. For h/w supporting ACPI C2 in which TSC and apic timer don't stop:

ACPI C2 type is a bit special which is sometimes alias to a deep CPU
C-state and thus current Xen3.4 treat ACPI C2 type in same manner as
ACPI C3 type (i.e. broadcast is activated). If user knows on that platform
ACPI C2 type has not that h/w limitation, 'lapic_timer_c2_ok' could be
added in grub to deactivate software mitigation.

1.3. For the rest implementations support ACPI C2+ in which apic timer
will be stopped:

1.3.1. HPET as broadcast timer source

HPET can delivery timely wakeup event to CPUs sleep in deep 
C-states with negligible overhead, as stated earlier. But 
HPET mode being used does make some differences to worthy of 
our noting:

1.3.1.1. If h/w supports per-channel MSI delivery mode (intr via FSB), it's
the best broadcast mechanism known so far. No side effect regarding to
latency, and IPIs used to broadcast wakeup event could be reduced by a factor
of number of available channels (each channel could independently serve one
or several sleeping CPUs). 

As long as this feature is available, it's always first prefered automatically

1.3.1.2. when MSI delivery mode is absent, we have to use legacy replacement
mode with only one HPET channel available. Well, it's not that bad as this
only one channel could serve all sleeping CPUs by using IPIs to wake up.
However another side-effect occurs, as PIT/RTC interrupts (IRQ0/IRQ8) are
replaced by HPET channel. Then RTC alarm feature in dom0 will be lost, unless
we add RTC emulation between dom0's rtc module and Xen's hpet logic (however,
it's not implemented by far.) 

Due to above side-effect, this broadcast option is disabled by default. In
that case, PIT broadcast is the default. If user is sure that he doesn't need
RTC alarm, then use 'hpetbroadcast' grub option to force enabling it.

1.3.2. PIT as broadcast timer source

If MSI based HPET intr delivery is not available or HPET is missing, in all
cases PIT broadcast is the current default one. As said earlier, 10ms
periodical mode is implemented on PIT broadcast which thus could incur up to
10ms latency for each deep C-state entry/exit. One natural result is to
observe 'many lost ticks' in some guests.

1.4 Suggestions

So, if user doesn't care about power consumption while his platform does 
expose deep C-states, one mitigation is to add 'max_cstate=' boot option to
restrict maximum allowed C-states (If limited to C2, ensure adding
'lapic_timer_c2_ok' if applied). Runtime modification on 'max_cstate' is
allowed by xenpm (patch posted in 3/24/2009, not checked in yet).

If user does care about power consumption w/o requirement on RTC alarm, then
always using HPET is preferred.

Last, we could either add RTC emulation on HPET or enhance PIT broadcast to
use single shot mode, but would like to see comments from community whether
it's worthy of doing. :-)

2. system time/TSC skew

Similarly to APIC timer stop, TSC is also stopped at deep C-states in some
implementations, which thus requires Xen to recover lost counts at exit from
deep C-state by software means. It's easy to think kinds of errors caused by
software methods. For the detail how TSC skew could occur, its side effects
and possible solutions, you could refer to our Xen summit presentation:
http://www.xen.org/files/xensummit_oracle09/XenSummit09pm.pdf

Below is the brief introduction about which algorithm is available in
different implementations:

2.1. Best case is to have non-stop TSC at h/w implementation level. For
example, Intel Core-i7 processors supports this green feature which could be
detected by CPUID. Xen will do nothing once this feature is detected, and thus
no extra software-caused skew besides dozens of cycles due to crystal drift.

2.2. If TSC frequency is invariant across freq/voltage scaling (true for all
Intel processors supporting VTx), Xen will sync AP's TSCs to BSP's at 1 second
interval in per-cpu time calibration, meanwhile do recover in a per-cpu style,
where only elapsed platform counter since last calibration point is
compensated to local TSC with a boot-time-calculated scale factor. This
global synchronization along with per-cpu compensation limits TSC skew to ns
level in most cases.

2.3. If TSC frequency is variant across freq/voltage scaling, Xen will only do
recover in a per-cpu style, where only elapsed platform counter since last
calibration point is compensated to local TSC with local scale factor. In such
manner TSC skew across cpus is accumulating and easy to be observed after
system is up for some time.

2.4. Solution

Once you observe obvious system time/TSC skew, and you don't care power
consumption specially, then similar to handle broadcast latency:

Limit 'max_cstate' to C1 or limit 'max_cstate' to a real C2 and give
'lapic_timer_c2_ok' option.

Or, better to run your work on a newer platform with either constant TSC
frequency or no-stop TSC feature supported. :-)

Jimmy
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
WARNING - OLD ARCHIVES

xen-devel

[Xen-devel] Potential side-effects and mitigations after cpuidle enabled