Hi Dan,
> While I am fully supportive of offering hardware hpet as an option
> for hvm guests (let's call it hwhpet=1 for shorthand), I am very
> surprised by your preliminary results; the most obvious conclusion
> is that Xen system time is losing time at the rate of 1000 PPM
> though its possible there's a bug somewhere else in the "time
> stack". Your Windows result is jaw-dropping and inexplicable,
> though I have to admit ignorance of how Windows manages time.
I think xen system time is fine. You have to add the interrupt
delivery policies decribed in the write-up for the patch to get
accurate timekeeping in the guest.
The windows policy is obvious and results in a large improvement
in accuracy. The Linux policy is more subtle, but is required to go
from .1% to .03%.
> I think with my recent patch and hpet=1 (essentially the same as
> your emulated hpet), hvm guest time should track Xen system time.
> I wonder if domain0 (which if I understand correctly is directly
> using Xen system time) is also seeing an error of .1%? Also
> I wonder for the skew you are seeing (in both hvm guests and
> domain0) is time moving too fast or two slow?
I don't recall the direction. I can look it up in my notes at work
tomorrow.
> Although hwhpet=1 is a fine alternative in many cases, it may
> be unavailable on some systems and may cause significant performance
> issues on others. So I think we will still need to track down
> the poor accuracy when hwhpet=0.
Our patch is accurate to < .03% using the physical hpet mode or
the simulated mode.
> And if for some reason
> Xen system time can't be made accurate enough (< 0.05%), then
> I think we should consider building Xen system time itself on
> top of hardware hpet instead of TSC... at least when Xen discovers
> a capable hpet.
In our experience, Xen system time is accurate enough now.
> One more thought... do you know the accuracy of the TSC crystals
> on your test systems? I posted a patch awhile ago that was
> intended to test that, though I guess it was only testing skew
> of different TSCs on the same system, not TSCs against an
> external time source.
I do not know the tsc accuracy.
> Or maybe there's a computation error somewhere in the hvm hpet
> scaling code? Hmmm...
Regards,
Dave
-----Original Message-----
From: Dan Magenheimer [mailto:dan.magenheimer@xxxxxxxxxx]
Sent: Fri 6/6/2008 4:29 PM
To: Dave Winchell; Keir Fraser
Cc: Ben Guthro; xen-devel
Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
Dave --
Thanks much for posting the preliminary results!
While I am fully supportive of offering hardware hpet as an option
for hvm guests (let's call it hwhpet=1 for shorthand), I am very
surprised by your preliminary results; the most obvious conclusion
is that Xen system time is losing time at the rate of 1000 PPM
though its possible there's a bug somewhere else in the "time
stack". Your Windows result is jaw-dropping and inexplicable,
though I have to admit ignorance of how Windows manages time.
I think with my recent patch and hpet=1 (essentially the same as
your emulated hpet), hvm guest time should track Xen system time.
I wonder if domain0 (which if I understand correctly is directly
using Xen system time) is also seeing an error of .1%? Also
I wonder for the skew you are seeing (in both hvm guests and
domain0) is time moving too fast or two slow?
Although hwhpet=1 is a fine alternative in many cases, it may
be unavailable on some systems and may cause significant performance
issues on others. So I think we will still need to track down
the poor accuracy when hwhpet=0. And if for some reason
Xen system time can't be made accurate enough (< 0.05%), then
I think we should consider building Xen system time itself on
top of hardware hpet instead of TSC... at least when Xen discovers
a capable hpet.
One more thought... do you know the accuracy of the TSC crystals
on your test systems? I posted a patch awhile ago that was
intended to test that, though I guess it was only testing skew
of different TSCs on the same system, not TSCs against an
external time source.
Or maybe there's a computation error somewhere in the hvm hpet
scaling code? Hmmm...
Thanks,
Dan
> -----Original Message-----
> From: Dave Winchell [mailto:dwinchell@xxxxxxxxxxxxxxx]
> Sent: Friday, June 06, 2008 1:33 PM
> To: dan.magenheimer@xxxxxxxxxx; Keir Fraser
> Cc: Ben Guthro; xen-devel; Dave Winchell
> Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
>
>
> Dan, Keir:
>
> Preliminary tests results indicate an error of .1% for Linux 64 bit
> guests configured
> for hpet with xen-unstable as is. As we have discussed many times, the
> ntp requirement is .05%.
> Tests on the patch we just submitted for hpet have indicated errors of
> .0012%
> on this platform under similar test conditions and .03% on
> other platforms.
>
> Windows vista64 has an error of 11% using hpet with the
> xen-unstable bits.
> In an overnight test with our hpet patch, the Windows vista
> error was .008%.
>
> The tests are with two or three guests on a physical node, all under
> load, and with
> the ratio of vcpus to phys cpus > 1.
>
> I will continue to run tests over the next few days.
>
> thanks,
> Dave
>
>
> Dan Magenheimer wrote:
>
> > Hi Dave and Ben --
> >
> > When running tests on xen-unstable (without your patch),
> please ensure
> > that hpet=1 is set in the hvm config and also I think that when hpet
> > is the clocksource on RHEL4-32, the clock IS resilient to
> missed ticks
> > so timer_mode should be 2 (vs when pit is the clocksource
> on RHEL4-32,
> > all clock ticks must be delivered and so timer_mode should be 0).
> >
> > Per
> >
> http://lists.xensource.com/archives/html/xen-devel/2008-06/msg
> 00098.html it's
> > my intent to clean this up, but I won't get to it until next week.
> >
> > Thanks,
> > Dan
> >
> > -----Original Message-----
> > *From:* xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
> > [mailto:xen-devel-bounces@xxxxxxxxxxxxxxxxxxx]*On
> Behalf Of *Dave
> > Winchell
> > *Sent:* Friday, June 06, 2008 4:46 AM
> > *To:* Keir Fraser; Ben Guthro; xen-devel
> > *Cc:* dan.magenheimer@xxxxxxxxxx; Dave Winchell
> > *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> >
> > Keir,
> >
> > I think the changes are required. We'll run some tests
> today today so
> > that we have some data to talk about.
> >
> > -Dave
> >
> >
> > -----Original Message-----
> > From: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx on behalf
> of Keir Fraser
> > Sent: Fri 6/6/2008 4:58 AM
> > To: Ben Guthro; xen-devel
> > Cc: dan.magenheimer@xxxxxxxxxx
> > Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> >
> > Are these patches needed now the timers are built on Xen system
> > time rather
> > than host TSC? Dan has reported much better
> time-keeping with his
> > patch
> > checked in, and it¹s for sure a lot less invasive than
> this patchset.
> >
> >
> > -- Keir
> >
> > On 5/6/08 15:59, "Ben Guthro" <bguthro@xxxxxxxxxxxxxxx> wrote:
> >
> > >
> > > 1. Introduction
> > >
> > > This patch improves the hpet based guest clock in
> terms of drift and
> > > monotonicity.
> > > Prior to this work the drift with hpet was greater
> than 2%, far
> > above the .05%
> > > limit
> > > for ntp to synchronize. With this code, the drift ranges from
> > .001% to .0033%
> > > depending
> > > on guest and physical platform.
> > >
> > > Using hpet allows guest operating systems to provide monotonic
> > time to their
> > > applications. Time sources other than hpet are not
> monotonic because
> > > of their reliance on tsc, which is not synchronized
> across physical
> > > processors.
> > >
> > > Windows 2k864 and many Linux guests are supported with two
> > policies, one for
> > > guests
> > > that handle missed clock interrupts and the other for guests
> > that require the
> > > correct number of interrupts.
> > >
> > > Guests may use hpet for the timing source even if the physical
> > platform has no
> > > visible
> > > hpet. Migration is supported between physical machines which
> > differ in
> > > physical
> > > hpet visibility.
> > >
> > > Most of the changes are in hpet.c. Two general facilities are
> > added to track
> > > interrupt
> > > progress. The ideas here and the facilities would be useful in
> > vpt.c, for
> > > other time
> > > sources, though no attempt is made here to improve vpt.c.
> > >
> > > The following sections discuss hpet dependencies, interrupt
> > delivery policies,
> > > live migration,
> > > test results, and relation to recent work with monotonic time.
> > >
> > >
> > > 2. Virtual Hpet dependencies
> > >
> > > The virtual hpet depends on the ability to read the
> physical or
> > simulated
> > > (see discussion below) hpet. For timekeeping, the
> virtual hpet
> > also depends
> > > on two new interrupt notification facilities to implement its
> > policies for
> > > interrupt delivery.
> > >
> > > 2.1. Two modes of low-level hpet main counter reads.
> > >
> > > In this implementation, the virtual hpet reads with
> > read_64_main_counter(),
> > > exported by
> > > time.c, either the real physical hpet main counter register
> > directly or a
> > > "simulated"
> > > hpet main counter.
> > >
> > > The simulated mode uses a monotonic version of get_s_time()
> > (NOW()), where the
> > > last
> > > time value is returned whenever the current time value is less
> > than the last
> > > time
> > > value. In simulated mode, since it is layered on s_time, the
> > underlying
> > > hardware
> > > can be hpet or some other device. The frequency of the main
> > counter in
> > > simulated
> > > mode is the same as the standard physical hpet frequency,
> > allowing live
> > > migration
> > > between nodes that are configured differently.
> > >
> > > If the physical platform does not have an hpet
> device, or if xen
> > is configured
> > > not
> > > to use the device, then the simulated method is used. If there
> > is a physical
> > > hpet device,
> > > and xen has initialized it, then either simulated or physical
> > mode can be
> > > used.
> > > This is governed by a boot time option, hpet-avoid.
> Setting this
> > option to 1
> > > gives the
> > > simulated mode and 0 the physical mode. The default
> is physical
> > mode.
> > >
> > > A disadvantage of the physical mode is that may take longer to
> > read the device
> > > than in simulated mode. On some platforms the cost is
> about the
> > same (less
> > > than 250 nsec) for
> > > physical and simulated modes, while on others physical cost is
> > much higher
> > > than simulated.
> > > A disadvantage of the simulated mode is that it can return the
> > same value
> > > for the counter in consecutive calls.
> > >
> > > 2.2. Interrupt notification facilities.
> > >
> > > Two interrupt notification facilities are introduced, one is
> > > hvm_isa_irq_assert_cb()
> > > and the other hvm_register_intr_en_notif().
> > >
> > > The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to
> > the vioapic.
> > > hvm_isa_irq_assert_cb allows a callback to be passed along to
> > > vioapic_deliver()
> > > and this callback is called with a mask of the vcpus
> which will
> > get the
> > > interrupt. This callback is made before any vcpus receive an
> > interrupt.
> > >
> > > Vhpet uses hvm_register_intr_en_notif() to register a handler
> > for a particular
> > > vector that will be called when that vector is injected in
> > > [vmx,svm]_intr_assist()
> > > and also when the guest finishes handling the interrupt. Here
> > finished is
> > > defined
> > > as the point when the guest re-enables interrupts or
> lowers the
> > tpr value.
> > > EOI is not used as the end of interrupt as this is sometimes
> > returned before
> > > the interrupt handler has done its work. A flag is
> passed to the
> > handler
> > > indicating
> > > whether this is the injection point (post = 1) or the
> interrupt
> > finished (post
> > > = 0) point.
> > > The need for the finished point callback is discussed in the
> > missed ticks
> > > policy section.
> > >
> > > To prevent a possible early trigger of the finished callback,
> > intr_en_notif
> > > logic
> > > has a two stage arm, the first at injection
> > (hvm_intr_en_notif_arm()) and the
> > > second when
> > > interrupts are seen to be disabled
> (hvm_intr_en_notif_disarm()).
> > Once fully
> > > armed, re-enabling
> > > interrupts will cause hvm_intr_en_notif_disarm() to
> make the end
> > of interrupt
> > > callback. hvm_intr_en_notif_arm() and
> hvm_intr_en_notif_disarm()
> > are called by
> > > [vmx,svm]_intr_assist().
> > >
> > > 3. Interrupt delivery policies
> > >
> > > The existing hpet interrupt delivery is preserved.
> This includes
> > > vcpu round robin delivery used by Linux and broadcast delivery
> > used by
> > > Windows.
> > >
> > > There are two policies for interrupt delivery, one for Windows
> > 2k8-64 and the
> > > other
> > > for Linux. The Linux policy takes advantage of the
> (guest) Linux
> > missed tick
> > > and offset
> > > calculations and does not attempt to deliver the
> right number of
> > interrupts.
> > > The Windows policy delivers the correct number of interrupts,
> > even if
> > > sometimes much
> > > closer to each other than the period. The policies are similar
> > to those in
> > > vpt.c, though
> > > there are some important differences.
> > >
> > > Policies are selected with an HVMOP_set_param
> hypercall with index
> > > HVM_PARAM_TIMER_MODE.
> > > Two new values are added,
> HVM_HPET_guest_computes_missed_ticks and
> > > HVM_HPET_guest_does_not_compute_missed_ticks. The reason that
> > two new ones
> > > are added is that
> > > in some guests (32bit Linux) a no-missed policy is needed for
> > clock sources
> > > other than hpet
> > > and a missed ticks policy for hpet. It was felt that
> there would
> > be less
> > > confusion by simply
> > > introducing the two hpet policies.
> > >
> > > 3.1. The missed ticks policy
> > >
> > > The Linux clock interrupt handler for hpet calculates missed
> > ticks and offset
> > > using the hpet
> > > main counter. The algorithm works well when the time since the
> > last interrupt
> > > is greater than
> > > or equal to a period and poorly otherwise.
> > >
> > > The missed ticks policy ensures that no two clock
> interrupts are
> > delivered to
> > > the guest at
> > > a time interval less than a period. A time stamp (hpet main
> > counter value) is
> > > recorded (by a
> > > callback registered with hvm_register_intr_en_notif)
> when Linux
> > finishes
> > > handling the clock
> > > interrupt. Then, ensuing interrupts are delivered to
> the vioapic
> > only if the
> > > current main
> > > counter value is a period greater than when the last interrupt
> > was handled.
> > >
> > > Tests showed a significant improvement in clock drift with end
> > of interrupt
> > > time stamps
> > > versus beginning of interrupt[1]. It is believed that
> the reason
> > for the
> > > improvement
> > > is that the clock interrupt handler goes for a
> spinlock and can
> > be therefore
> > > delayed in its
> > > processing. Furthermore, the main counter is read by the guest
> > under the lock.
> > > The net
> > > effect is that if we time stamp injection, we can get the
> > difference in time
> > > between successive interrupt handler lock acquisitions to be
> > less than the
> > > period.
> > >
> > > 3.2. The no-missed ticks policy
> > >
> > > Windows 2k864 keeps very poor time with the missed
> ticks policy.
> > So the
> > > no-missed ticks policy
> > > was developed. In the no-missed ticks policy we deliver the
> > correct number of
> > > interrupts,
> > > even if they are spaced less than a period apart
> (when catching up).
> > >
> > > Windows 2k864 uses a broadcast mode in the interrupt routing
> > such that
> > > all vcpus get the clock interrupt. The best Windows drift
> > performance was
> > > achieved when the
> > > policy code ensured that all the previous interrupts (on the
> > various vcpus)
> > > had been injected
> > > before injecting the next interrupt to the vioapic..
> > >
> > > The policy code works as follows. It uses the
> > hvm_isa_irq_assert_cb() to
> > > record
> > > the vcpus to be interrupted in h->hpet.pending_mask. Then, in
> > the callback
> > > registered
> > > with hvm_register_intr_en_notif() at post=1 time it clears the
> > current vcpu in
> > > the pending_mask.
> > > When the pending_mask is clear it decrements
> > hpet.intr_pending_nr and if
> > > intr_pending_nr is still
> > > non-zero posts another interrupt to the ioapic with
> > hvm_isa_irq_assert_cb().
> > > Intr_pending_nr is incremented in
> > hpet_route_decision_not_missed_ticks().
> > >
> > > The missed ticks policy intr_en_notif callback also uses the
> > pending_mask
> > > method. So even though
> > > Linux does not broadcast its interrupts, the code could handle
> > it if it did.
> > > In this case the end of interrupt time stamp is made when the
> > pending_mask is
> > > clear.
> > >
> > > 4. Live Migration
> > >
> > > Live migration with hpet preserves the current offset of the
> > guest clock with
> > > respect
> > > to ntp. This is accomplished by migrating all of the state in
> > the h->hpet data
> > > structure
> > > in the usual way. The hp->mc_offset is recalculated on the
> > receiving node so
> > > that the
> > > guest sees a continuous hpet main counter.
> > >
> > > Code as been added to xc_domain_save.c to send a small message
> > after the
> > > domain context is sent. The contents of the message is the
> > physical tsc
> > > timestamp, last_tsc,
> > > read just before the message is sent. When the
> last_tsc message
> > is received in
> > > xc_domain_restore.c,
> > > another physical tsc timestamp, cur_tsc, is read. The two
> > timestamps are
> > > loaded into the domain
> > > structure as last_tsc_sender and first_tsc_receiver with
> > hypercalls. Then
> > > xc_domain_hvm_setcontext
> > > is called so that hpet_load has access to these time stamps.
> > Hpet_load uses
> > > the timestamps
> > > to account for the time spent saving and loading the domain
> > context. With this
> > > technique,
> > > the only neglected time is the time spent sending a small
> > network message.
> > >
> > > 5. Test Results
> > >
> > > Some recent test results are:
> > >
> > > 5.1 Linux 4u664 and Windows 2k864 load test.
> > > Duration: 70 hours.
> > > Test date: 6/2/08
> > > Loads: usex -b48 on Linux; burn-in on Windows
> > > Guest vcpus: 8 for Linux; 2 for Windows
> > > Hardware: 8 physical cpu AMD
> > > Clock drift : Linux: .0012% Windows: .009%
> > >
> > > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test
> > > Duration: 23 hours.
> > > Test date: 6/3/08
> > > Loads: none
> > > Guest vcpus: 8 for each Linux; 2 for Windows
> > > Hardware: 4 physical cpu AMD
> > > Clock drift : Linux: .033% Windows: .019%
> > >
> > > 6. Relation to recent work in xen-unstable
> > >
> > > There is a similarity between hvm_get_guest_time() in
> > xen-unstable and
> > > read_64_main_counter()
> > > in this code. However, read_64_main_counter() is more tuned to
> > the needs of
> > > hpet.c. It has no
> > > "set" operation, only the get. It isolates the mode,
> physical or
> > simulated, in
> > > read_64_main_counter()
> > > itself. It uses no vcpu or domain state as it is a physical
> > entity, in either
> > > mode. And it provides a real
> > > physical mode for every read for those applications
> that desire
> > this.
> > >
> > > 7. Conclusion
> > >
> > > The virtual hpet is improved by this patch in terms
> of accuracy and
> > > monotonicity.
> > > Tests performed to date verify this and more testing
> is under way.
> > >
> > > 8. Future Work
> > >
> > > Testing with Windows Vista will be performed soon. The reason
> > for accuracy
> > > variations
> > > on different platforms using the physical hpet device will be
> > investigated.
> > > Additional overhead measurements on simulated vs physical hpet
> > mode will be
> > > made.
> > >
> > > Footnotes:
> > >
> > > 1. I don't recall the accuracy improvement with end
> of interrupt
> > stamping, but
> > > it was
> > > significant, perhaps better than two to one improvement. It
> > would be a very
> > > simple matter
> > > to re-measure the improvement as the facility can call back at
> > injection time
> > > as well.
> > >
> > >
> > > Signed-off-by: Dave Winchell <dwinchell@xxxxxxxxxxxxxxx>
> > > <mailto:dwinchell@xxxxxxxxxxxxxxx>
> > > Signed-off-by: Ben Guthro <bguthro@xxxxxxxxxxxxxxx>
> > > <mailto:bguthro@xxxxxxxxxxxxxxx>
> > >
> > >
> > > _______________________________________________
> > > Xen-devel mailing list
> > > Xen-devel@xxxxxxxxxxxxxxxxxxx
> > > http://lists.xensource.com/xen-devel
> >
> >
> >
>
>
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|