RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Hi Dave --

Thanks for the additional explanation.

Could you please be very precise, when you say "Linux",

as to what you are (and are not) testing? Specifically:

1) kernel version number and/or distro info

2) 32 vs 64

3) kernel parameters specified

4) config file parameters

5) relevant CPU info that may be passed through by Xen

to hvm guests (e.g. whether "tsc is synchronized")

6) relevant xen boot parameters (if any)

As we've seen, different combinations of the above can yield

very different test results. We'd like to confirm your tests,

but if we can avoid unnecessary additional iterations (due to

mismatches on the above), that would be helpful.

Our testing goal is to ensure that there is at least one

known good combination of parameters for each of RHEL3,

RHEL4, and RHEL5 (both 32 and 64) and that works on

both tsc-synchronized and tsc-unsynchronized Intel

and AMD boxes. And hopefully that works with and without

a real physical hpet available.

We don't have a good test environment for Windows time,

but if you can provide the same test configuration detail,

we may be able to do some testing.

Thanks,

Dan

-----Original Message-----
From: Dave Winchell [mailto:dwinchell@xxxxxxxxxxxxxxx]
Sent: Sunday, June 08, 2008 2:32 PM
To: dan.magenheimer@xxxxxxxxxx; Keir Fraser
Cc: Ben Guthro; xen-devel; Dave Winchell
Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Hi Dan,

> While I am fully supportive of offering hardware hpet as an option
> for hvm guests (let's call it hwhpet=1 for shorthand), I am very
> surprised by your preliminary results; the most obvious conclusion
> is that Xen system time is losing time at the rate of 1000 PPM
> though its possible there's a bug somewhere else in the "time
> stack". Your Windows result is jaw-dropping and inexplicable,
> though I have to admit ignorance of how Windows manages time.

I think xen system time is fine. You have to add the interrupt
delivery policies decribed in the write-up for the patch to get
accurate timekeeping in the guest.

The windows policy is obvious and results in a large improvement
in accuracy. The Linux policy is more subtle, but is required to go
from .1% to .03%.

> I think with my recent patch and hpet=1 (essentially the same as
> your emulated hpet), hvm guest time should track Xen system time.
> I wonder if domain0 (which if I understand correctly is directly
> using Xen system time) is also seeing an error of .1%? Also
> I wonder for the skew you are seeing (in both hvm guests and
> domain0) is time moving too fast or two slow?

I don't recall the direction. I can look it up in my notes at work
tomorrow.

> Although hwhpet=1 is a fine alternative in many cases, it may
> be unavailable on some systems and may cause significant performance
> issues on others. So I think we will still need to track down
> the poor accuracy when hwhpet=0.

Our patch is accurate to < .03% using the physical hpet mode or
the simulated mode.

> And if for some reason
> Xen system time can't be made accurate enough (< 0.05%), then
> I think we should consider building Xen system time itself on
> top of hardware hpet instead of TSC... at least when Xen discovers
> a capable hpet.

In our experience, Xen system time is accurate enough now.

> One more thought... do you know the accuracy of the TSC crystals
> on your test systems? I posted a patch awhile ago that was
> intended to test that, though I guess it was only testing skew
> of different TSCs on the same system, not TSCs against an
> external time source.

I do not know the tsc accuracy.

> Or maybe there's a computation error somewhere in the hvm hpet
> scaling code? Hmmm...

Regards,
Dave

-----Original Message-----
From: Dan Magenheimer [mailto:dan.magenheimer@xxxxxxxxxx]
Sent: Fri 6/6/2008 4:29 PM
To: Dave Winchell; Keir Fraser
Cc: Ben Guthro; xen-devel
Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

Dave --

Thanks much for posting the preliminary results!

While I am fully supportive of offering hardware hpet as an option
for hvm guests (let's call it hwhpet=1 for shorthand), I am very
surprised by your preliminary results; the most obvious conclusion
is that Xen system time is losing time at the rate of 1000 PPM
though its possible there's a bug somewhere else in the "time
stack". Your Windows result is jaw-dropping and inexplicable,
though I have to admit ignorance of how Windows manages time.

I think with my recent patch and hpet=1 (essentially the same as
your emulated hpet), hvm guest time should track Xen system time.
I wonder if domain0 (which if I understand correctly is directly
using Xen system time) is also seeing an error of .1%? Also
I wonder for the skew you are seeing (in both hvm guests and
domain0) is time moving too fast or two slow?

Although hwhpet=1 is a fine alternative in many cases, it may
be unavailable on some systems and may cause significant performance
issues on others. So I think we will still need to track down
the poor accuracy when hwhpet=0. And if for some reason
Xen system time can't be made accurate enough (< 0.05%), then
I think we should consider building Xen system time itself on
top of hardware hpet instead of TSC... at least when Xen discovers
a capable hpet.

One more thought... do you know the accuracy of the TSC crystals
on your test systems? I posted a patch awhile ago that was
intended to test that, though I guess it was only testing skew
of different TSCs on the same system, not TSCs against an
external time source.

Or maybe there's a computation error somewhere in the hvm hpet
scaling code? Hmmm...

Thanks,
Dan

> -----Original Message-----
> From: Dave Winchell [mailto:dwinchell@xxxxxxxxxxxxxxx]
> Sent: Friday, June 06, 2008 1:33 PM
> To: dan.magenheimer@xxxxxxxxxx; Keir Fraser
> Cc: Ben Guthro; xen-devel; Dave Winchell
> Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>
>
> Dan, Keir:
>
> Preliminary tests results indicate an error of .1% for Linux 64 bit
> guests configured
> for hpet with xen-unstable as is. As we have discussed many times, the
> ntp requirement is .05%.
> Tests on the patch we just submitted for hpet have indicated errors of
> .0012%
> on this platform under similar test conditions and .03% on
> other platforms.
>
> Windows vista64 has an error of 11% using hpet with the
> xen-unstable bits.
> In an overnight test with our hpet patch, the Windows vista
> error was .008%.
>
> The tests are with two or three guests on a physical node, all under
> load, and with
> the ratio of vcpus to phys cpus > 1.
>
> I will continue to run tests over the next few days.
>
> thanks,
> Dave
>
>
> Dan Magenheimer wrote:
>
> > Hi Dave and Ben --
> >
> > When running tests on xen-unstable (without your patch),
> please ensure
> > that hpet=1 is set in the hvm config and also I think that when hpet
> > is the clocksource on RHEL4-32, the clock IS resilient to
> missed ticks
> > so timer_mode should be 2 (vs when pit is the clocksource
> on RHEL4-32,
> > all clock ticks must be delivered and so timer_mode should be 0).
> >
> > Per
> >
> http://lists.xensource.com/archives/html/xen-devel/2008-06/msg
> 00098.html it's
> > my intent to clean this up, but I won't get to it until next week.
> >
> > Thanks,
> > Dan
> >
> >     -----Original Message-----
> >     *From:* xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
> >     [mailto:xen-devel-bounces@xxxxxxxxxxxxxxxxxxx]*On
> Behalf Of *Dave
> >     Winchell
> >     *Sent:* Friday, June 06, 2008 4:46 AM
> >     *To:* Keir Fraser; Ben Guthro; xen-devel
> >     *Cc:* dan.magenheimer@xxxxxxxxxx; Dave Winchell
> >     *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> >
> >     Keir,
> >
> >     I think the changes are required. We'll run some tests
> today today so
> >     that we have some data to talk about.
> >
> >     -Dave
> >
> >
> >     -----Original Message-----
> >     From: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx on behalf
> of Keir Fraser
> >     Sent: Fri 6/6/2008 4:58 AM
> >     To: Ben Guthro; xen-devel
> >     Cc: dan.magenheimer@xxxxxxxxxx
> >     Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> >
> >     Are these patches needed now the timers are built on Xen system
> >     time rather
> >     than host TSC? Dan has reported much better
> time-keeping with his
> >     patch
> >     checked in, and it¹s for sure a lot less invasive than
> this patchset.
> >
> >
> >      -- Keir
> >
> >     On 5/6/08 15:59, "Ben Guthro" <bguthro@xxxxxxxxxxxxxxx> wrote:
> >
> >     >
> >     > 1. Introduction
> >     >
> >     > This patch improves the hpet based guest clock in
> terms of drift and
> >     > monotonicity.
> >     > Prior to this work the drift with hpet was greater
> than 2%, far
> >     above the .05%
> >     > limit
> >     > for ntp to synchronize. With this code, the drift ranges from
> >     .001% to .0033%
> >     > depending
> >     > on guest and physical platform.
> >     >
> >     > Using hpet allows guest operating systems to provide monotonic
> >     time to their
> >     > applications. Time sources other than hpet are not
> monotonic because
> >     > of their reliance on tsc, which is not synchronized
> across physical
> >     > processors.
> >     >
> >     > Windows 2k864 and many Linux guests are supported with two
> >     policies, one for
> >     > guests
> >     > that handle missed clock interrupts and the other for guests
> >     that require the
> >     > correct number of interrupts.
> >     >
> >     > Guests may use hpet for the timing source even if the physical
> >     platform has no
> >     > visible
> >     > hpet. Migration is supported between physical machines which
> >     differ in
> >     > physical
> >     > hpet visibility.
> >     >
> >     > Most of the changes are in hpet.c. Two general facilities are
> >     added to track
> >     > interrupt
> >     > progress. The ideas here and the facilities would be useful in
> >     vpt.c, for
> >     > other time
> >     > sources, though no attempt is made here to improve vpt.c.
> >     >
> >     > The following sections discuss hpet dependencies, interrupt
> >     delivery policies,
> >     > live migration,
> >     > test results, and relation to recent work with monotonic time.
> >     >
> >     >
> >     > 2. Virtual Hpet dependencies
> >     >
> >     > The virtual hpet depends on the ability to read the
> physical or
> >     simulated
> >     > (see discussion below) hpet. For timekeeping, the
> virtual hpet
> >     also depends
> >     > on two new interrupt notification facilities to implement its
> >     policies for
> >     > interrupt delivery.
> >     >
> >     > 2.1. Two modes of low-level hpet main counter reads.
> >     >
> >     > In this implementation, the virtual hpet reads with
> >     read_64_main_counter(),
> >     > exported by
> >     > time.c, either the real physical hpet main counter register
> >     directly or a
> >     > "simulated"
> >     > hpet main counter.
> >     >
> >     > The simulated mode uses a monotonic version of get_s_time()
> >     (NOW()), where the
> >     > last
> >     > time value is returned whenever the current time value is less
> >     than the last
> >     > time
> >     > value. In simulated mode, since it is layered on s_time, the
> >     underlying
> >     > hardware
> >     > can be hpet or some other device. The frequency of the main
> >     counter in
> >     > simulated
> >     > mode is the same as the standard physical hpet frequency,
> >     allowing live
> >     > migration
> >     > between nodes that are configured differently.
> >     >
> >     > If the physical platform does not have an hpet
> device, or if xen
> >     is configured
> >     > not
> >     > to use the device, then the simulated method is used. If there
> >     is a physical
> >     > hpet device,
> >     > and xen has initialized it, then either simulated or physical
> >     mode can be
> >     > used.
> >     > This is governed by a boot time option, hpet-avoid.
> Setting this
> >     option to 1
> >     > gives the
> >     > simulated mode and 0 the physical mode. The default
> is physical
> >     mode.
> >     >
> >     > A disadvantage of the physical mode is that may take longer to
> >     read the device
> >     > than in simulated mode. On some platforms the cost is
> about the
> >     same (less
> >     > than 250 nsec) for
> >     > physical and simulated modes, while on others physical cost is
> >     much higher
> >     > than simulated.
> >     > A disadvantage of the simulated mode is that it can return the
> >     same value
> >     > for the counter in consecutive calls.
> >     >
> >     > 2.2. Interrupt notification facilities.
> >     >
> >     > Two interrupt notification facilities are introduced, one is
> >     > hvm_isa_irq_assert_cb()
> >     > and the other hvm_register_intr_en_notif().
> >     >
> >     > The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to
> >     the vioapic.
> >     > hvm_isa_irq_assert_cb allows a callback to be passed along to
> >     > vioapic_deliver()
> >     > and this callback is called with a mask of the vcpus
> which will
> >     get the
> >     > interrupt. This callback is made before any vcpus receive an
> >     interrupt.
> >     >
> >     > Vhpet uses hvm_register_intr_en_notif() to register a handler
> >     for a particular
> >     > vector that will be called when that vector is injected in
> >     > [vmx,svm]_intr_assist()
> >     > and also when the guest finishes handling the interrupt. Here
> >     finished is
> >     > defined
> >     > as the point when the guest re-enables interrupts or
> lowers the
> >     tpr value.
> >     > EOI is not used as the end of interrupt as this is sometimes
> >     returned before
> >     > the interrupt handler has done its work. A flag is
> passed to the
> >     handler
> >     > indicating
> >     > whether this is the injection point (post = 1) or the
> interrupt
> >     finished (post
> >     > = 0) point.
> >     > The need for the finished point callback is discussed in the
> >     missed ticks
> >     > policy section.
> >     >
> >     > To prevent a possible early trigger of the finished callback,
> >     intr_en_notif
> >     > logic
> >     > has a two stage arm, the first at injection
> >     (hvm_intr_en_notif_arm()) and the
> >     > second when
> >     > interrupts are seen to be disabled
> (hvm_intr_en_notif_disarm()).
> >     Once fully
> >     > armed, re-enabling
> >     > interrupts will cause hvm_intr_en_notif_disarm() to
> make the end
> >     of interrupt
> >     > callback. hvm_intr_en_notif_arm() and
> hvm_intr_en_notif_disarm()
> >     are called by
> >     > [vmx,svm]_intr_assist().
> >     >
> >     > 3. Interrupt delivery policies
> >     >
> >     > The existing hpet interrupt delivery is preserved.
> This includes
> >     > vcpu round robin delivery used by Linux and broadcast delivery
> >     used by
> >     > Windows.
> >     >
> >     > There are two policies for interrupt delivery, one for Windows
> >     2k8-64 and the
> >     > other
> >     > for Linux. The Linux policy takes advantage of the
> (guest) Linux
> >     missed tick
> >     > and offset
> >     > calculations and does not attempt to deliver the
> right number of
> >     interrupts.
> >     > The Windows policy delivers the correct number of interrupts,
> >     even if
> >     > sometimes much
> >     > closer to each other than the period. The policies are similar
> >     to those in
> >     > vpt.c, though
> >     > there are some important differences.
> >     >
> >     > Policies are selected with an HVMOP_set_param
> hypercall with index
> >     > HVM_PARAM_TIMER_MODE.
> >     > Two new values are added,
> HVM_HPET_guest_computes_missed_ticks and
> >     > HVM_HPET_guest_does_not_compute_missed_ticks. The reason that
> >     two new ones
> >     > are added is that
> >     > in some guests (32bit Linux) a no-missed policy is needed for
> >     clock sources
> >     > other than hpet
> >     > and a missed ticks policy for hpet. It was felt that
> there would
> >     be less
> >     > confusion by simply
> >     > introducing the two hpet policies.
> >     >
> >     > 3.1. The missed ticks policy
> >     >
> >     > The Linux clock interrupt handler for hpet calculates missed
> >     ticks and offset
> >     > using the hpet
> >     > main counter. The algorithm works well when the time since the
> >     last interrupt
> >     > is greater than
> >     > or equal to a period and poorly otherwise.
> >     >
> >     > The missed ticks policy ensures that no two clock
> interrupts are
> >     delivered to
> >     > the guest at
> >     > a time interval less than a period. A time stamp (hpet main
> >     counter value) is
> >     > recorded (by a
> >     > callback registered with hvm_register_intr_en_notif)
> when Linux
> >     finishes
> >     > handling the clock
> >     > interrupt. Then, ensuing interrupts are delivered to
> the vioapic
> >     only if the
> >     > current main
> >     > counter value is a period greater than when the last interrupt
> >     was handled.
> >     >
> >     > Tests showed a significant improvement in clock drift with end
> >     of interrupt
> >     > time stamps
> >     > versus beginning of interrupt[1]. It is believed that
> the reason
> >     for the
> >     > improvement
> >     > is that the clock interrupt handler goes for a
> spinlock and can
> >     be therefore
> >     > delayed in its
> >     > processing. Furthermore, the main counter is read by the guest
> >     under the lock.
> >     > The net
> >     > effect is that if we time stamp injection, we can get the
> >     difference in time
> >     > between successive interrupt handler lock acquisitions to be
> >     less than the
> >     > period.
> >     >
> >     > 3.2. The no-missed ticks policy
> >     >
> >     > Windows 2k864 keeps very poor time with the missed
> ticks policy.
> >     So the
> >     > no-missed ticks policy
> >     > was developed. In the no-missed ticks policy we deliver the
> >     correct number of
> >     > interrupts,
> >     > even if they are spaced less than a period apart
> (when catching up).
> >     >
> >     > Windows 2k864 uses a broadcast mode in the interrupt routing
> >     such that
> >     > all vcpus get the clock interrupt. The best Windows drift
> >     performance was
> >     > achieved when the
> >     > policy code ensured that all the previous interrupts (on the
> >     various vcpus)
> >     > had been injected
> >     > before injecting the next interrupt to the vioapic..
> >     >
> >     > The policy code works as follows. It uses the
> >     hvm_isa_irq_assert_cb() to
> >     > record
> >     > the vcpus to be interrupted in h->hpet.pending_mask. Then, in
> >     the callback
> >     > registered
> >     > with hvm_register_intr_en_notif() at post=1 time it clears the
> >     current vcpu in
> >     > the pending_mask.
> >     > When the pending_mask is clear it decrements
> >     hpet.intr_pending_nr and if
> >     > intr_pending_nr is still
> >     > non-zero posts another interrupt to the ioapic with
> >     hvm_isa_irq_assert_cb().
> >     > Intr_pending_nr is incremented in
> >     hpet_route_decision_not_missed_ticks().
> >     >
> >     > The missed ticks policy intr_en_notif callback also uses the
> >     pending_mask
> >     > method. So even though
> >     > Linux does not broadcast its interrupts, the code could handle
> >     it if it did.
> >     > In this case the end of interrupt time stamp is made when the
> >     pending_mask is
> >     > clear.
> >     >
> >     > 4. Live Migration
> >     >
> >     > Live migration with hpet preserves the current offset of the
> >     guest clock with
> >     > respect
> >     > to ntp. This is accomplished by migrating all of the state in
> >     the h->hpet data
> >     > structure
> >     > in the usual way. The hp->mc_offset is recalculated on the
> >     receiving node so
> >     > that the
> >     > guest sees a continuous hpet main counter.
> >     >
> >     > Code as been added to xc_domain_save.c to send a small message
> >     after the
> >     > domain context is sent. The contents of the message is the
> >     physical tsc
> >     > timestamp, last_tsc,
> >     > read just before the message is sent. When the
> last_tsc message
> >     is received in
> >     > xc_domain_restore.c,
> >     > another physical tsc timestamp, cur_tsc, is read. The two
> >     timestamps are
> >     > loaded into the domain
> >     > structure as last_tsc_sender and first_tsc_receiver with
> >     hypercalls. Then
> >     > xc_domain_hvm_setcontext
> >     > is called so that hpet_load has access to these time stamps.
> >     Hpet_load uses
> >     > the timestamps
> >     > to account for the time spent saving and loading the domain
> >     context. With this
> >     > technique,
> >     > the only neglected time is the time spent sending a small
> >     network message.
> >     >
> >     > 5. Test Results
> >     >
> >     > Some recent test results are:
> >     >
> >     > 5.1 Linux 4u664 and Windows 2k864 load test.
> >     >       Duration: 70 hours.
> >     >       Test date: 6/2/08
> >     >       Loads: usex -b48 on Linux; burn-in on Windows
> >     >       Guest vcpus: 8 for Linux; 2 for Windows
> >     >       Hardware: 8 physical cpu AMD
> >     >       Clock drift : Linux: .0012% Windows: .009%
> >     >
> >     > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test
> >     >       Duration: 23 hours.
> >     >       Test date: 6/3/08
> >     >       Loads: none
> >     >       Guest vcpus: 8 for each Linux; 2 for Windows
> >     >       Hardware: 4 physical cpu AMD
> >     >       Clock drift : Linux: .033% Windows: .019%
> >     >
> >     > 6. Relation to recent work in xen-unstable
> >     >
> >     > There is a similarity between hvm_get_guest_time() in
> >     xen-unstable and
> >     > read_64_main_counter()
> >     > in this code. However, read_64_main_counter() is more tuned to
> >     the needs of
> >     > hpet.c. It has no
> >     > "set" operation, only the get. It isolates the mode,
> physical or
> >     simulated, in
> >     > read_64_main_counter()
> >     > itself. It uses no vcpu or domain state as it is a physical
> >     entity, in either
> >     > mode. And it provides a real
> >     > physical mode for every read for those applications
> that desire
> >     this.
> >     >
> >     > 7. Conclusion
> >     >
> >     > The virtual hpet is improved by this patch in terms
> of accuracy and
> >     > monotonicity.
> >     > Tests performed to date verify this and more testing
> is under way.
> >     >
> >     > 8. Future Work
> >     >
> >     > Testing with Windows Vista will be performed soon. The reason
> >     for accuracy
> >     > variations
> >     > on different platforms using the physical hpet device will be
> >     investigated.
> >     > Additional overhead measurements on simulated vs physical hpet
> >     mode will be
> >     > made.
> >     >
> >     > Footnotes:
> >     >
> >     > 1. I don't recall the accuracy improvement with end
> of interrupt
> >     stamping, but
> >     > it was
> >     > significant, perhaps better than two to one improvement. It
> >     would be a very
> >     > simple matter
> >     > to re-measure the improvement as the facility can call back at
> >     injection time
> >     > as well.
> >     >
> >     >
> >     > Signed-off-by: Dave Winchell <dwinchell@xxxxxxxxxxxxxxx>
> >     > <mailto:dwinchell@xxxxxxxxxxxxxxx>
> >     > Signed-off-by: Ben Guthro <bguthro@xxxxxxxxxxxxxxx>
> >     > <mailto:bguthro@xxxxxxxxxxxxxxx>
> >     >
> >     >
> >     > _______________________________________________
> >     > Xen-devel mailing list
> >     > Xen-devel@xxxxxxxxxxxxxxxxxxx
> >     > http://lists.xensource.com/xen-devel
> >
> >
> >
>
>

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

WARNING - OLD ARCHIVES

xen-devel

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy