On Tuesday, March 31, 2009 10:52 PM, Carsten Schiers wrote:
> Sorry for my ignorance, but as I find all that very interesting, prior to read
> it all over and as I suffer from skew after using cpuidle and also cpufreq
> management (on an AMD CPU ;-, which TSC frequency is variant across
> freq/voltage scaling) a few
> questions:
>
> - you mention lost ticks in some guests, does this include Dom0? It's where
> my messages mainly show up.
I haven't observe lost ticks warning in Dom0 for current Xen3.4 tip by far.
> - you recomend to limit cpuidle either to C1 or C2 (in case APIC
> timer is not stopping. How to know that?
You may need refer to processor's spec.
> - xm debug-key c reports active C1, max_cstate C2, but only lists C1 usage.
> C1 Clock Ramping seems to be disabled. Platform timer is 25MHz HPET.
> Excuse my ignorance again, but doesn't that mean I am not using C-states
> at all?
In Xen3.3 the C1 residency is not counted yet. The max_cstate=C2 does not mean
your platform support C2, it just means if your platform support C-states deeper
than C2, the deepest used C-state will be C2. I guess xm debug-key c didn't
report
any C2 information (usage, residency) in your platform, right? If yes, that
means
your system only support C1.
> I understand you speak about Xen 3.4. Currently, I am at 3.3.1 and have to
> wait
> for a slot to test 3.4. I am curious to see what happens. Dan told me how to
> use
> xm debug-key t and said, max cycles skew is so much smaller than max stime
> (xen
> system time) skew. This makes him believe 3.4 will help.
Yes, I also strongly suggesting you to have a try on 3.4. But I doesn't expect
much
for the variant TSC case, just like what I said in the orginal mail.
BTW, I believe enabling cpuidle or not should have no impact on your case. Have
you checked the result while cpufreq disabled?
Thanks
Jimmy
>
> BR,
> Carsten.
>
> ----- Originalnachricht -----
> Von: "Wei, Gang" <gang.wei@xxxxxxxxx>
> Gesendet: Die, 31.3.2009 16:00
> An: xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxx>
> Cc: "Tian, Kevin" <kevin.tian@xxxxxxxxx> ; Keir Fraser
> <keir.fraser@xxxxxxxxxxxxx> ; "Yu, Ke" <ke.yu@xxxxxxxxx> Betreff: [Xen-devel]
> Potential side-effects and mitigations after cpuidle enabled by default
>
> In xen3.4, cpuidle is defaultly enabled by c/s 19421. But some side-effects
> may exist under different h/w C-states implementations or h/w configurations,
> so that user may occasionally observe latency or system time/tsc skew. Below
> are conditions causing these side-effects and means to mitigate them:
>
> 1. Latency
>
> Latency could be caused by two factors: C-state entry/exit latency, and extra
> latency caused by broadcast mechanism.
>
> C-state entry/exit latency is inevitable since powering on/off gates takes
> time. Normally shallower C-state incurs lighter latency but less power saving
> capability, and vice versa for deeper C-state. Cpuidle governor tries to
> balance performance and power tradeoff in high level, which is one area where
> we'll continue to tune.
>
> Broadcast is necessary to handle APIC timer stop at deep C-state (>=C3) on
> some platforms. One platform timer source is chosen to carry per-cpu timer
> deadline, and then wakeup CPUs in deep C-state timely at expected expiry.
> By far Xen3.4 supports PIT/HPET as the broadcast source. In current
> implementation PIT broadcast is implemented in periodical mode (10ms) which
> means up to 10ms extra latency could be added on expiry expected from sleep
> CPUs. This is just initial implementation choice which of course could be
> enhanced to on-demand on/off mode in the future. We didn't go into that
> complexity in current implementation, due to its slow access and also short
> wrap count. So HPET broadcast is always preferred, once this facility is
> available which adds negligible overhead with timely wakeup. Then... world is
> not always perfect, and some side-effects also exist along with HPET.
>
> Detail is listed as below:
>
> 1.1. For h/w supporting ACPI C1 (halt) only (BIOS reported in ACPI _CST
> method):
>
> It's immune from this side-effect as only instruction execution is halted.
>
> 1.2. For h/w supporting ACPI C2 in which TSC and apic timer don't stop:
>
> ACPI C2 type is a bit special which is sometimes alias to a deep CPU
> C-state and thus current Xen3.4 treat ACPI C2 type in same manner as
> ACPI C3 type (i.e. broadcast is activated). If user knows on that platform
> ACPI C2 type has not that h/w limitation, 'lapic_timer_c2_ok' could be
> added in grub to deactivate software mitigation.
>
> 1.3. For the rest implementations support ACPI C2+ in which apic timer
> will be stopped:
>
> 1.3.1. HPET as broadcast timer source
>
> HPET can delivery timely wakeup event to CPUs sleep in deep
> C-states with negligible overhead, as stated earlier. But
> HPET mode being used does make some differences to worthy of
> our noting:
>
> 1.3.1.1. If h/w supports per-channel MSI delivery mode (intr via FSB), it's
> the best broadcast mechanism known so far. No side effect regarding to
> latency, and IPIs used to broadcast wakeup event could be reduced by a factor
> of number of available channels (each channel could independently serve one
> or several sleeping CPUs).
>
> As long as this feature is available, it's always first prefered automatically
>
> 1.3.1.2. when MSI delivery mode is absent, we have to use legacy replacement
> mode with only one HPET channel available. Well, it's not that bad as this
> only one channel could serve all sleeping CPUs by using IPIs to wake up.
> However another side-effect occurs, as PIT/RTC interrupts (IRQ0/IRQ8) are
> replaced by HPET channel. Then RTC alarm feature in dom0 will be lost, unless
> we add RTC emulation between dom0's rtc module and Xen's hpet logic (however,
> it's not implemented by far.)
>
> Due to above side-effect, this broadcast option is disabled by default. In
> that case, PIT broadcast is the default. If user is sure that he doesn't need
> RTC alarm, then use 'hpetbroadcast' grub option to force enabling it.
>
> 1.3.2. PIT as broadcast timer source
>
> If MSI based HPET intr delivery is not available or HPET is missing, in all
> cases PIT broadcast is the current default one. As said earlier, 10ms
> periodical mode is implemented on PIT broadcast which thus could incur up to
> 10ms latency for each deep C-state entry/exit. One natural result is to
> observe 'many lost ticks' in some guests.
>
> 1.4 Suggestions
>
> So, if user doesn't care about power consumption while his platform does
> expose deep C-states, one mitigation is to add 'max_cstate=' boot option to
> restrict maximum allowed C-states (If limited to C2, ensure adding
> 'lapic_timer_c2_ok' if applied). Runtime modification on 'max_cstate' is
> allowed by xenpm (patch posted in 3/24/2009, not checked in yet).
>
> If user does care about power consumption w/o requirement on RTC alarm, then
> always using HPET is preferred.
>
> Last, we could either add RTC emulation on HPET or enhance PIT broadcast to
> use single shot mode, but would like to see comments from community whether
> it's worthy of doing. :-)
>
> 2. system time/TSC skew
>
> Similarly to APIC timer stop, TSC is also stopped at deep C-states in some
> implementations, which thus requires Xen to recover lost counts at exit from
> deep C-state by software means. It's easy to think kinds of errors caused by
> software methods. For the detail how TSC skew could occur, its side effects
> and possible solutions, you could refer to our Xen summit presentation:
> http://www.xen.org/files/xensummit_oracle09/XenSummit09pm.pdf
>
> Below is the brief introduction about which algorithm is available in
> different implementations:
>
> 2.1. Best case is to have non-stop TSC at h/w implementation level. For
> example, Intel Core-i7 processors supports this green feature which could be
> detected by CPUID. Xen will do nothing once this feature is detected, and thus
> no extra software-caused skew besides dozens of cycles due to crystal drift.
>
> 2.2. If TSC frequency is invariant across freq/voltage scaling (true for all
> Intel processors supporting VTx), Xen will sync AP's TSCs to BSP's at 1 second
> interval in per-cpu time calibration, meanwhile do recover in a per-cpu style,
> where only elapsed platform counter since last calibration point is
> compensated to local TSC with a boot-time-calculated scale factor. This
> global synchronization along with per-cpu compensation limits TSC skew to ns
> level in most cases.
>
> 2.3. If TSC frequency is variant across freq/voltage scaling, Xen will only do
> recover in a per-cpu style, where only elapsed platform counter since last
> calibration point is compensated to local TSC with local scale factor. In such
> manner TSC skew across cpus is accumulating and easy to be observed after
> system is up for some time.
>
> 2.4. Solution
>
> Once you observe obvious system time/TSC skew, and you don't care power
> consumption specially, then similar to handle broadcast latency:
>
> Limit 'max_cstate' to C1 or limit 'max_cstate' to a real C2 and give
> 'lapic_timer_c2_ok' option.
>
> Or, better to run your work on a newer platform with either constant TSC
> frequency or no-stop TSC feature supported. :-)
>
> Jimmy
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|