RE: [Xen-devel] Potential side-effects and mitigations after cpu

On Tuesday, March 31, 2009 10:52 PM, Carsten Schiers wrote:
> Sorry for my ignorance, but as I find all that very interesting, prior to read
> it all over and as I suffer from skew after using cpuidle and also cpufreq
> management (on an AMD CPU ;-, which TSC frequency is variant across
> freq/voltage scaling) a few 
> questions:
> 
>   - you mention lost ticks in some guests, does this include Dom0? It's where
>     my messages mainly show up.

I haven't observe lost ticks warning in Dom0 for current Xen3.4 tip by far.

>   - you recomend to limit cpuidle either to C1 or C2 (in case APIC
>     timer is not stopping. How to know that?

You may need refer to processor's spec.
 
>   - xm debug-key c reports active C1, max_cstate C2, but only lists C1 usage.
>     C1 Clock Ramping seems to be disabled. Platform timer is 25MHz HPET.
>     Excuse my ignorance again, but doesn't that mean I am not using C-states
> at all? 

In Xen3.3 the C1 residency is not counted yet. The max_cstate=C2 does not mean
your platform support C2, it just means if your platform support C-states deeper
than C2, the deepest used C-state will be C2. I guess xm debug-key c didn't 
report
any C2 information (usage, residency) in your platform, right? If yes, that 
means
your system only support C1.

> I understand you speak about Xen 3.4. Currently, I am at 3.3.1 and have to
> wait 
> for a slot to test 3.4. I am curious to see what happens. Dan told me how to
> use 
> xm debug-key t and said, max cycles skew is so much smaller than max stime
> (xen 
> system time) skew. This makes him believe 3.4 will help.

Yes, I also strongly suggesting you to have a try on 3.4. But I doesn't expect 
much
for the variant TSC case, just like what I said in the orginal mail. 

BTW, I believe enabling cpuidle or not should have no impact on your case. Have 
you checked the result while cpufreq disabled?

Thanks
Jimmy

> 
> BR,
> Carsten.
> 
> ----- Originalnachricht -----
> Von: "Wei, Gang" <gang.wei@xxxxxxxxx>
> Gesendet: Die, 31.3.2009 16:00
> An: xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxx>
> Cc: "Tian, Kevin" <kevin.tian@xxxxxxxxx> ; Keir Fraser
> <keir.fraser@xxxxxxxxxxxxx> ; "Yu, Ke" <ke.yu@xxxxxxxxx> Betreff: [Xen-devel]
> Potential side-effects and mitigations after cpuidle enabled by default 
> 
> In xen3.4, cpuidle is defaultly enabled by c/s 19421. But some side-effects
> may exist under different h/w C-states implementations or h/w configurations,
> so that user may occasionally observe latency or system time/tsc skew. Below
> are conditions causing these side-effects and means to mitigate them:
> 
> 1. Latency
> 
> Latency could be caused by two factors: C-state entry/exit latency, and extra
> latency caused by broadcast mechanism.
> 
> C-state entry/exit latency is inevitable since powering on/off gates takes
> time. Normally shallower C-state incurs lighter latency but less power saving
> capability, and vice versa for deeper C-state. Cpuidle governor tries to
> balance performance and power tradeoff in high level, which is one area where
> we'll continue to tune.
> 
> Broadcast is necessary to handle APIC timer stop at deep C-state (>=C3) on
> some platforms. One platform timer source is chosen to carry per-cpu timer
> deadline, and then wakeup CPUs in deep C-state timely at expected expiry.
> By far Xen3.4 supports PIT/HPET as the broadcast source. In current
> implementation PIT broadcast is implemented in periodical mode (10ms) which
> means up to 10ms extra latency could be added on expiry expected from sleep
> CPUs. This is just initial implementation choice which of course could be
> enhanced to on-demand on/off mode in the future. We didn't go into that
> complexity in current implementation, due to its slow access and also short
> wrap count. So HPET broadcast is always preferred, once this facility is
> available which adds negligible overhead with timely wakeup. Then... world is
> not always perfect, and some side-effects also exist along with HPET.
> 
> Detail is listed as below:
> 
> 1.1. For h/w supporting ACPI C1 (halt) only (BIOS reported in ACPI _CST
> method):
> 
> It's immune from this side-effect as only instruction execution is halted.
> 
> 1.2. For h/w supporting ACPI C2 in which TSC and apic timer don't stop:
> 
> ACPI C2 type is a bit special which is sometimes alias to a deep CPU
> C-state and thus current Xen3.4 treat ACPI C2 type in same manner as
> ACPI C3 type (i.e. broadcast is activated). If user knows on that platform
> ACPI C2 type has not that h/w limitation, 'lapic_timer_c2_ok' could be
> added in grub to deactivate software mitigation.
> 
> 1.3. For the rest implementations support ACPI C2+ in which apic timer
> will be stopped:
> 
> 1.3.1. HPET as broadcast timer source
> 
> HPET can delivery timely wakeup event to CPUs sleep in deep
> C-states with negligible overhead, as stated earlier. But
> HPET mode being used does make some differences to worthy of
> our noting:
> 
> 1.3.1.1. If h/w supports per-channel MSI delivery mode (intr via FSB), it's
> the best broadcast mechanism known so far. No side effect regarding to
> latency, and IPIs used to broadcast wakeup event could be reduced by a factor
> of number of available channels (each channel could independently serve one
> or several sleeping CPUs).
> 
> As long as this feature is available, it's always first prefered automatically
> 
> 1.3.1.2. when MSI delivery mode is absent, we have to use legacy replacement
> mode with only one HPET channel available. Well, it's not that bad as this
> only one channel could serve all sleeping CPUs by using IPIs to wake up.
> However another side-effect occurs, as PIT/RTC interrupts (IRQ0/IRQ8) are
> replaced by HPET channel. Then RTC alarm feature in dom0 will be lost, unless
> we add RTC emulation between dom0's rtc module and Xen's hpet logic (however,
> it's not implemented by far.)
> 
> Due to above side-effect, this broadcast option is disabled by default. In
> that case, PIT broadcast is the default. If user is sure that he doesn't need
> RTC alarm, then use 'hpetbroadcast' grub option to force enabling it.
> 
> 1.3.2. PIT as broadcast timer source
> 
> If MSI based HPET intr delivery is not available or HPET is missing, in all
> cases PIT broadcast is the current default one. As said earlier, 10ms
> periodical mode is implemented on PIT broadcast which thus could incur up to
> 10ms latency for each deep C-state entry/exit. One natural result is to
> observe 'many lost ticks' in some guests.
> 
> 1.4 Suggestions
> 
> So, if user doesn't care about power consumption while his platform does
> expose deep C-states, one mitigation is to add 'max_cstate=' boot option to
> restrict maximum allowed C-states (If limited to C2, ensure adding
> 'lapic_timer_c2_ok' if applied). Runtime modification on 'max_cstate' is
> allowed by xenpm (patch posted in 3/24/2009, not checked in yet).
> 
> If user does care about power consumption w/o requirement on RTC alarm, then
> always using HPET is preferred.
> 
> Last, we could either add RTC emulation on HPET or enhance PIT broadcast to
> use single shot mode, but would like to see comments from community whether
> it's worthy of doing. :-)
> 
> 2. system time/TSC skew
> 
> Similarly to APIC timer stop, TSC is also stopped at deep C-states in some
> implementations, which thus requires Xen to recover lost counts at exit from
> deep C-state by software means. It's easy to think kinds of errors caused by
> software methods. For the detail how TSC skew could occur, its side effects
> and possible solutions, you could refer to our Xen summit presentation:
> http://www.xen.org/files/xensummit_oracle09/XenSummit09pm.pdf
> 
> Below is the brief introduction about which algorithm is available in
> different implementations:
> 
> 2.1. Best case is to have non-stop TSC at h/w implementation level. For
> example, Intel Core-i7 processors supports this green feature which could be
> detected by CPUID. Xen will do nothing once this feature is detected, and thus
> no extra software-caused skew besides dozens of cycles due to crystal drift.
> 
> 2.2. If TSC frequency is invariant across freq/voltage scaling (true for all
> Intel processors supporting VTx), Xen will sync AP's TSCs to BSP's at 1 second
> interval in per-cpu time calibration, meanwhile do recover in a per-cpu style,
> where only elapsed platform counter since last calibration point is
> compensated to local TSC with a boot-time-calculated scale factor. This
> global synchronization along with per-cpu compensation limits TSC skew to ns
> level in most cases.
> 
> 2.3. If TSC frequency is variant across freq/voltage scaling, Xen will only do
> recover in a per-cpu style, where only elapsed platform counter since last
> calibration point is compensated to local TSC with local scale factor. In such
> manner TSC skew across cpus is accumulating and easy to be observed after
> system is up for some time.
> 
> 2.4. Solution
> 
> Once you observe obvious system time/TSC skew, and you don't care power
> consumption specially, then similar to handle broadcast latency:
> 
> Limit 'max_cstate' to C1 or limit 'max_cstate' to a real C2 and give
> 'lapic_timer_c2_ok' option.
> 
> Or, better to run your work on a newer platform with either constant TSC
> frequency or no-stop TSC feature supported. :-)
> 
> Jimmy
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
WARNING - OLD ARCHIVES

xen-devel

RE: [Xen-devel] Potential side-effects and mitigations after cpuidle ena