RE: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to

To:	'Keir Fraser' <keir.fraser@xxxxxxxxxxxxx>, "Wei, Gang" <gang.wei@xxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>
Subject:	RE: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus
From:	"Tian, Kevin" <kevin.tian@xxxxxxxxx>
Date:	Fri, 5 Dec 2008 19:50:09 +0800
Accept-language:	en-US
Acceptlanguage:	en-US
Cc:
Delivery-date:	Fri, 05 Dec 2008 03:50:52 -0800
Envelope-to:	www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to:	<C55EAF42.1FE11%keir.fraser@xxxxxxxxxxxxx>
List-help:	<mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id:	Xen developer discussion <xen-devel.lists.xensource.com>
List-post:	<mailto:xen-devel@lists.xensource.com>
List-subscribe:	<http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe:	<http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References:	<0A882F4D99BBF6449D58E61AAFD7EDD601E23C35@xxxxxxxxxxxxxxxxxxxxxxxxxxxx> <C55EAF42.1FE11%keir.fraser@xxxxxxxxxxxxx>
Sender:	xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
Thread-index:	AclWoeM2xc+X6mj6QOaQHsyxDmpzagAFRoNpAAEP6tAAAWmCgAAATdDsAAIJ5qA=
Thread-topic:	[Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid big tsc skew between cpus

>From: Keir Fraser [mailto:keir.fraser@xxxxxxxxxxxxx] 
>Sent: Friday, December 05, 2008 6:13 PM
>On 05/12/2008 10:05, "Tian, Kevin" <kevin.tian@xxxxxxxxx> wrote:
>
>>> From: Tian, Kevin
>>> Sent: Friday, December 05, 2008 6:00 PM
>>> 
>>> Then if we agree always aligning TSC to absolute platform timer
>>> counter, it doesn't make difference to use cpu_khz or local 
>tsc_scale
>>> since both are using scale factor calculated within a small period
>>> to represent the underlying crystal frequency.
>>> 
>> 
>> Let me hold back above words. As you said, cpu_khz has lower accuracy
>> by cutting down lowest bits.
>
>Yes. Also bear in mind that absolute ongoing synchronisation 
>between TSCs
>*does not matter*. Xen will happily synchronise system time on top of
>(slowly enough, constantly enough) diverging TSCs, and of 
>course HVM VCPUs
>re-set their guest TSC offset when moving between host CPUs.

We had measurement on following cases: (4 idle up-hvm-rhel5 with 2 cores)

a) disable deep C-state
b) enable deep C-state, with original tsc save/restore at each C-state 
entry/exit
c) enable deep C-state, and restore TSC based on local calibration stamp
    and tsc scale
d) enable deep C-state, and restore TSC based on monotonic platform stime
    and cpu_khz

        system time skew        TSC skew
a)      hundred ns              several us
b)      accumulating larger     accumulating larger
c)      dozens of us            accumulating larger
d)      hundred ns              several us

Large system time skew can impact both pv and hvm domain. pv
domain will complain time went backward when migrating to a cpu 
with slower NOW(). hvm domain will have delayed vpt expiration
when migrating to slower one, or vice versa missed ticks are accounted
by xen for some timer mode. Both c) and d) ensures skew within a stable
range. 

Large TSC skew is normally OK with pv domain, since xen time
stamps are synced at gettimeofday and timer interrupt within pv
guest. Possibly impacted is some user processes which uses 
rdtsc directly. However larget TSC skew is really bad for hvm
guest, especially when guest TSC offset is never adjusted at
vcpu migration. That will cause guest itself to catch up missing
ticks in a batch, which results softlockup warning or DMA time
out. Thus with c) we can still observe guest complains after running
a enough long time.

I'm not sure whether guest TSC offset can be adjusted accurately,
since you need first get TSC skew among cores which may require 
issuing IPI and adds extra overhead. It just gets really messed to
handle an accumulating TSC skew for hvm guest.

That's why we go with option d) which really exposes same level
of constraints compared to disabled case. This is not perfect
solution, but it shows more stable result than others.

>
>What *does* matter is the possibility of warping a host TSC 
>value on wake
>from deep sleep, compared with its value if the sleep had 
>never happened. In
>this case, system time will be wrong (since we haven't been through a
>calibration step since waking up) and HVM timers will be 
>wrong. And using
>start-of-day timestamp plus cpu_khz makes this more likely. The correct
>thing to do is obey the most recent set of local calibration values.
>

I assume you meant S3 for "deep sleep"? If yes, I don't think it
an issue. A sane dom0 S3 flow will only happen after other domains
has been notified with virtual S3 event, and thus after waken up
from dom0 S3, every domain will resume its timekeeping sub-system.

Thanks,
Kevin
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

WARNING - OLD ARCHIVES

xen-devel

RE: [Xen-devel] Re: [PATCH] CPUIDLE: revise tsc-save/restore to avoid bi