WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

RE: [Xen-devel] Potential side-effects and mitigations after cpuidle ena

To: Carsten Schiers <carsten@xxxxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxx>
Subject: RE: [Xen-devel] Potential side-effects and mitigations after cpuidle enabled by default
From: "Wei, Gang" <gang.wei@xxxxxxxxx>
Date: Wed, 1 Apr 2009 10:38:54 +0800
Accept-language: en-US
Acceptlanguage: en-US
Cc: "Tian, Kevin" <kevin.tian@xxxxxxxxx>, Keir Fraser <keir.fraser@xxxxxxxxxxxxx>, "Yu, Ke" <ke.yu@xxxxxxxxx>
Delivery-date: Tue, 31 Mar 2009 19:39:37 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <25710728.381238511147382.JavaMail.root@uhura>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <25710728.381238511147382.JavaMail.root@uhura>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
Thread-index: AcmyEF5Z3S2Y4R+MSsGLpOgDam+IAgAXDiKA
Thread-topic: [Xen-devel] Potential side-effects and mitigations after cpuidle enabled by default
On Tuesday, March 31, 2009 10:52 PM, Carsten Schiers wrote:
> Sorry for my ignorance, but as I find all that very interesting, prior to read
> it all over and as I suffer from skew after using cpuidle and also cpufreq
> management (on an AMD CPU ;-, which TSC frequency is variant across
> freq/voltage scaling) a few 
> questions:
> 
>   - you mention lost ticks in some guests, does this include Dom0? It's where
>     my messages mainly show up.

I haven't observe lost ticks warning in Dom0 for current Xen3.4 tip by far.

>   - you recomend to limit cpuidle either to C1 or C2 (in case APIC
>     timer is not stopping. How to know that?

You may need refer to processor's spec.
 
>   - xm debug-key c reports active C1, max_cstate C2, but only lists C1 usage.
>     C1 Clock Ramping seems to be disabled. Platform timer is 25MHz HPET.
>     Excuse my ignorance again, but doesn't that mean I am not using C-states
> at all? 

In Xen3.3 the C1 residency is not counted yet. The max_cstate=C2 does not mean
your platform support C2, it just means if your platform support C-states deeper
than C2, the deepest used C-state will be C2. I guess xm debug-key c didn't 
report
any C2 information (usage, residency) in your platform, right? If yes, that 
means
your system only support C1.

> I understand you speak about Xen 3.4. Currently, I am at 3.3.1 and have to
> wait 
> for a slot to test 3.4. I am curious to see what happens. Dan told me how to
> use 
> xm debug-key t and said, max cycles skew is so much smaller than max stime
> (xen 
> system time) skew. This makes him believe 3.4 will help.

Yes, I also strongly suggesting you to have a try on 3.4. But I doesn't expect 
much
for the variant TSC case, just like what I said in the orginal mail. 

BTW, I believe enabling cpuidle or not should have no impact on your case. Have 
you checked the result while cpufreq disabled?

Thanks
Jimmy

> 
> BR,
> Carsten.
> 
> ----- Originalnachricht -----
> Von: "Wei, Gang" <gang.wei@xxxxxxxxx>
> Gesendet: Die, 31.3.2009 16:00
> An: xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxx>
> Cc: "Tian, Kevin" <kevin.tian@xxxxxxxxx> ; Keir Fraser
> <keir.fraser@xxxxxxxxxxxxx> ; "Yu, Ke" <ke.yu@xxxxxxxxx> Betreff: [Xen-devel]
> Potential side-effects and mitigations after cpuidle enabled by default 
> 
> In xen3.4, cpuidle is defaultly enabled by c/s 19421. But some side-effects
> may exist under different h/w C-states implementations or h/w configurations,
> so that user may occasionally observe latency or system time/tsc skew. Below
> are conditions causing these side-effects and means to mitigate them:
> 
> 1. Latency
> 
> Latency could be caused by two factors: C-state entry/exit latency, and extra
> latency caused by broadcast mechanism.
> 
> C-state entry/exit latency is inevitable since powering on/off gates takes
> time. Normally shallower C-state incurs lighter latency but less power saving
> capability, and vice versa for deeper C-state. Cpuidle governor tries to
> balance performance and power tradeoff in high level, which is one area where
> we'll continue to tune.
> 
> Broadcast is necessary to handle APIC timer stop at deep C-state (>=C3) on
> some platforms. One platform timer source is chosen to carry per-cpu timer
> deadline, and then wakeup CPUs in deep C-state timely at expected expiry.
> By far Xen3.4 supports PIT/HPET as the broadcast source. In current
> implementation PIT broadcast is implemented in periodical mode (10ms) which
> means up to 10ms extra latency could be added on expiry expected from sleep
> CPUs. This is just initial implementation choice which of course could be
> enhanced to on-demand on/off mode in the future. We didn't go into that
> complexity in current implementation, due to its slow access and also short
> wrap count. So HPET broadcast is always preferred, once this facility is
> available which adds negligible overhead with timely wakeup. Then... world is
> not always perfect, and some side-effects also exist along with HPET.
> 
> Detail is listed as below:
> 
> 1.1. For h/w supporting ACPI C1 (halt) only (BIOS reported in ACPI _CST
> method):
> 
> It's immune from this side-effect as only instruction execution is halted.
> 
> 1.2. For h/w supporting ACPI C2 in which TSC and apic timer don't stop:
> 
> ACPI C2 type is a bit special which is sometimes alias to a deep CPU
> C-state and thus current Xen3.4 treat ACPI C2 type in same manner as
> ACPI C3 type (i.e. broadcast is activated). If user knows on that platform
> ACPI C2 type has not that h/w limitation, 'lapic_timer_c2_ok' could be
> added in grub to deactivate software mitigation.
> 
> 1.3. For the rest implementations support ACPI C2+ in which apic timer
> will be stopped:
> 
> 1.3.1. HPET as broadcast timer source
> 
> HPET can delivery timely wakeup event to CPUs sleep in deep
> C-states with negligible overhead, as stated earlier. But
> HPET mode being used does make some differences to worthy of
> our noting:
> 
> 1.3.1.1. If h/w supports per-channel MSI delivery mode (intr via FSB), it's
> the best broadcast mechanism known so far. No side effect regarding to
> latency, and IPIs used to broadcast wakeup event could be reduced by a factor
> of number of available channels (each channel could independently serve one
> or several sleeping CPUs).
> 
> As long as this feature is available, it's always first prefered automatically
> 
> 1.3.1.2. when MSI delivery mode is absent, we have to use legacy replacement
> mode with only one HPET channel available. Well, it's not that bad as this
> only one channel could serve all sleeping CPUs by using IPIs to wake up.
> However another side-effect occurs, as PIT/RTC interrupts (IRQ0/IRQ8) are
> replaced by HPET channel. Then RTC alarm feature in dom0 will be lost, unless
> we add RTC emulation between dom0's rtc module and Xen's hpet logic (however,
> it's not implemented by far.)
> 
> Due to above side-effect, this broadcast option is disabled by default. In
> that case, PIT broadcast is the default. If user is sure that he doesn't need
> RTC alarm, then use 'hpetbroadcast' grub option to force enabling it.
> 
> 1.3.2. PIT as broadcast timer source
> 
> If MSI based HPET intr delivery is not available or HPET is missing, in all
> cases PIT broadcast is the current default one. As said earlier, 10ms
> periodical mode is implemented on PIT broadcast which thus could incur up to
> 10ms latency for each deep C-state entry/exit. One natural result is to
> observe 'many lost ticks' in some guests.
> 
> 1.4 Suggestions
> 
> So, if user doesn't care about power consumption while his platform does
> expose deep C-states, one mitigation is to add 'max_cstate=' boot option to
> restrict maximum allowed C-states (If limited to C2, ensure adding
> 'lapic_timer_c2_ok' if applied). Runtime modification on 'max_cstate' is
> allowed by xenpm (patch posted in 3/24/2009, not checked in yet).
> 
> If user does care about power consumption w/o requirement on RTC alarm, then
> always using HPET is preferred.
> 
> Last, we could either add RTC emulation on HPET or enhance PIT broadcast to
> use single shot mode, but would like to see comments from community whether
> it's worthy of doing. :-)
> 
> 2. system time/TSC skew
> 
> Similarly to APIC timer stop, TSC is also stopped at deep C-states in some
> implementations, which thus requires Xen to recover lost counts at exit from
> deep C-state by software means. It's easy to think kinds of errors caused by
> software methods. For the detail how TSC skew could occur, its side effects
> and possible solutions, you could refer to our Xen summit presentation:
> http://www.xen.org/files/xensummit_oracle09/XenSummit09pm.pdf
> 
> Below is the brief introduction about which algorithm is available in
> different implementations:
> 
> 2.1. Best case is to have non-stop TSC at h/w implementation level. For
> example, Intel Core-i7 processors supports this green feature which could be
> detected by CPUID. Xen will do nothing once this feature is detected, and thus
> no extra software-caused skew besides dozens of cycles due to crystal drift.
> 
> 2.2. If TSC frequency is invariant across freq/voltage scaling (true for all
> Intel processors supporting VTx), Xen will sync AP's TSCs to BSP's at 1 second
> interval in per-cpu time calibration, meanwhile do recover in a per-cpu style,
> where only elapsed platform counter since last calibration point is
> compensated to local TSC with a boot-time-calculated scale factor. This
> global synchronization along with per-cpu compensation limits TSC skew to ns
> level in most cases.
> 
> 2.3. If TSC frequency is variant across freq/voltage scaling, Xen will only do
> recover in a per-cpu style, where only elapsed platform counter since last
> calibration point is compensated to local TSC with local scale factor. In such
> manner TSC skew across cpus is accumulating and easy to be observed after
> system is up for some time.
> 
> 2.4. Solution
> 
> Once you observe obvious system time/TSC skew, and you don't care power
> consumption specially, then similar to handle broadcast latency:
> 
> Limit 'max_cstate' to C1 or limit 'max_cstate' to a real C2 and give
> 'lapic_timer_c2_ok' option.
> 
> Or, better to run your work on a newer platform with either constant TSC
> frequency or no-stop TSC feature supported. :-)
> 
> Jimmy
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

<Prev in Thread] Current Thread [Next in Thread>