xen-devel
Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
Hi Dan,
I am running with hpet=1 and timer_mode=2. I don't see where timer_mode
is checked for
hpet timekeeping but I set it nevertheless.
thanks,
Dave
Dan Magenheimer wrote:
Hi Dave and Ben --
When running tests on xen-unstable (without your patch), please ensure
that hpet=1 is set in the hvm config and also I think that when hpet
is the clocksource on RHEL4-32, the clock IS resilient to missed ticks
so timer_mode should be 2 (vs when pit is the clocksource on RHEL4-32,
all clock ticks must be delivered and so timer_mode should be 0).
Per
http://lists.xensource.com/archives/html/xen-devel/2008-06/msg00098.html it's
my intent to clean this up, but I won't get to it until next week.
Thanks,
Dan
-----Original Message-----
*From:* xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
[mailto:xen-devel-bounces@xxxxxxxxxxxxxxxxxxx]*On Behalf Of *Dave
Winchell
*Sent:* Friday, June 06, 2008 4:46 AM
*To:* Keir Fraser; Ben Guthro; xen-devel
*Cc:* dan.magenheimer@xxxxxxxxxx; Dave Winchell
*Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
Keir,
I think the changes are required. We'll run some tests today today so
that we have some data to talk about.
-Dave
-----Original Message-----
From: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx on behalf of Keir Fraser
Sent: Fri 6/6/2008 4:58 AM
To: Ben Guthro; xen-devel
Cc: dan.magenheimer@xxxxxxxxxx
Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
Are these patches needed now the timers are built on Xen system
time rather
than host TSC? Dan has reported much better time-keeping with his
patch
checked in, and it¹s for sure a lot less invasive than this patchset.
-- Keir
On 5/6/08 15:59, "Ben Guthro" <bguthro@xxxxxxxxxxxxxxx> wrote:
>
> 1. Introduction
>
> This patch improves the hpet based guest clock in terms of drift and
> monotonicity.
> Prior to this work the drift with hpet was greater than 2%, far
above the .05%
> limit
> for ntp to synchronize. With this code, the drift ranges from
.001% to .0033%
> depending
> on guest and physical platform.
>
> Using hpet allows guest operating systems to provide monotonic
time to their
> applications. Time sources other than hpet are not monotonic because
> of their reliance on tsc, which is not synchronized across physical
> processors.
>
> Windows 2k864 and many Linux guests are supported with two
policies, one for
> guests
> that handle missed clock interrupts and the other for guests
that require the
> correct number of interrupts.
>
> Guests may use hpet for the timing source even if the physical
platform has no
> visible
> hpet. Migration is supported between physical machines which
differ in
> physical
> hpet visibility.
>
> Most of the changes are in hpet.c. Two general facilities are
added to track
> interrupt
> progress. The ideas here and the facilities would be useful in
vpt.c, for
> other time
> sources, though no attempt is made here to improve vpt.c.
>
> The following sections discuss hpet dependencies, interrupt
delivery policies,
> live migration,
> test results, and relation to recent work with monotonic time.
>
>
> 2. Virtual Hpet dependencies
>
> The virtual hpet depends on the ability to read the physical or
simulated
> (see discussion below) hpet. For timekeeping, the virtual hpet
also depends
> on two new interrupt notification facilities to implement its
policies for
> interrupt delivery.
>
> 2.1. Two modes of low-level hpet main counter reads.
>
> In this implementation, the virtual hpet reads with
read_64_main_counter(),
> exported by
> time.c, either the real physical hpet main counter register
directly or a
> "simulated"
> hpet main counter.
>
> The simulated mode uses a monotonic version of get_s_time()
(NOW()), where the
> last
> time value is returned whenever the current time value is less
than the last
> time
> value. In simulated mode, since it is layered on s_time, the
underlying
> hardware
> can be hpet or some other device. The frequency of the main
counter in
> simulated
> mode is the same as the standard physical hpet frequency,
allowing live
> migration
> between nodes that are configured differently.
>
> If the physical platform does not have an hpet device, or if xen
is configured
> not
> to use the device, then the simulated method is used. If there
is a physical
> hpet device,
> and xen has initialized it, then either simulated or physical
mode can be
> used.
> This is governed by a boot time option, hpet-avoid. Setting this
option to 1
> gives the
> simulated mode and 0 the physical mode. The default is physical
mode.
>
> A disadvantage of the physical mode is that may take longer to
read the device
> than in simulated mode. On some platforms the cost is about the
same (less
> than 250 nsec) for
> physical and simulated modes, while on others physical cost is
much higher
> than simulated.
> A disadvantage of the simulated mode is that it can return the
same value
> for the counter in consecutive calls.
>
> 2.2. Interrupt notification facilities.
>
> Two interrupt notification facilities are introduced, one is
> hvm_isa_irq_assert_cb()
> and the other hvm_register_intr_en_notif().
>
> The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to
the vioapic.
> hvm_isa_irq_assert_cb allows a callback to be passed along to
> vioapic_deliver()
> and this callback is called with a mask of the vcpus which will
get the
> interrupt. This callback is made before any vcpus receive an
interrupt.
>
> Vhpet uses hvm_register_intr_en_notif() to register a handler
for a particular
> vector that will be called when that vector is injected in
> [vmx,svm]_intr_assist()
> and also when the guest finishes handling the interrupt. Here
finished is
> defined
> as the point when the guest re-enables interrupts or lowers the
tpr value.
> EOI is not used as the end of interrupt as this is sometimes
returned before
> the interrupt handler has done its work. A flag is passed to the
handler
> indicating
> whether this is the injection point (post = 1) or the interrupt
finished (post
> = 0) point.
> The need for the finished point callback is discussed in the
missed ticks
> policy section.
>
> To prevent a possible early trigger of the finished callback,
intr_en_notif
> logic
> has a two stage arm, the first at injection
(hvm_intr_en_notif_arm()) and the
> second when
> interrupts are seen to be disabled (hvm_intr_en_notif_disarm()).
Once fully
> armed, re-enabling
> interrupts will cause hvm_intr_en_notif_disarm() to make the end
of interrupt
> callback. hvm_intr_en_notif_arm() and hvm_intr_en_notif_disarm()
are called by
> [vmx,svm]_intr_assist().
>
> 3. Interrupt delivery policies
>
> The existing hpet interrupt delivery is preserved. This includes
> vcpu round robin delivery used by Linux and broadcast delivery
used by
> Windows.
>
> There are two policies for interrupt delivery, one for Windows
2k8-64 and the
> other
> for Linux. The Linux policy takes advantage of the (guest) Linux
missed tick
> and offset
> calculations and does not attempt to deliver the right number of
interrupts.
> The Windows policy delivers the correct number of interrupts,
even if
> sometimes much
> closer to each other than the period. The policies are similar
to those in
> vpt.c, though
> there are some important differences.
>
> Policies are selected with an HVMOP_set_param hypercall with index
> HVM_PARAM_TIMER_MODE.
> Two new values are added, HVM_HPET_guest_computes_missed_ticks and
> HVM_HPET_guest_does_not_compute_missed_ticks. The reason that
two new ones
> are added is that
> in some guests (32bit Linux) a no-missed policy is needed for
clock sources
> other than hpet
> and a missed ticks policy for hpet. It was felt that there would
be less
> confusion by simply
> introducing the two hpet policies.
>
> 3.1. The missed ticks policy
>
> The Linux clock interrupt handler for hpet calculates missed
ticks and offset
> using the hpet
> main counter. The algorithm works well when the time since the
last interrupt
> is greater than
> or equal to a period and poorly otherwise.
>
> The missed ticks policy ensures that no two clock interrupts are
delivered to
> the guest at
> a time interval less than a period. A time stamp (hpet main
counter value) is
> recorded (by a
> callback registered with hvm_register_intr_en_notif) when Linux
finishes
> handling the clock
> interrupt. Then, ensuing interrupts are delivered to the vioapic
only if the
> current main
> counter value is a period greater than when the last interrupt
was handled.
>
> Tests showed a significant improvement in clock drift with end
of interrupt
> time stamps
> versus beginning of interrupt[1]. It is believed that the reason
for the
> improvement
> is that the clock interrupt handler goes for a spinlock and can
be therefore
> delayed in its
> processing. Furthermore, the main counter is read by the guest
under the lock.
> The net
> effect is that if we time stamp injection, we can get the
difference in time
> between successive interrupt handler lock acquisitions to be
less than the
> period.
>
> 3.2. The no-missed ticks policy
>
> Windows 2k864 keeps very poor time with the missed ticks policy.
So the
> no-missed ticks policy
> was developed. In the no-missed ticks policy we deliver the
correct number of
> interrupts,
> even if they are spaced less than a period apart (when catching up).
>
> Windows 2k864 uses a broadcast mode in the interrupt routing
such that
> all vcpus get the clock interrupt. The best Windows drift
performance was
> achieved when the
> policy code ensured that all the previous interrupts (on the
various vcpus)
> had been injected
> before injecting the next interrupt to the vioapic..
>
> The policy code works as follows. It uses the
hvm_isa_irq_assert_cb() to
> record
> the vcpus to be interrupted in h->hpet.pending_mask. Then, in
the callback
> registered
> with hvm_register_intr_en_notif() at post=1 time it clears the
current vcpu in
> the pending_mask.
> When the pending_mask is clear it decrements
hpet.intr_pending_nr and if
> intr_pending_nr is still
> non-zero posts another interrupt to the ioapic with
hvm_isa_irq_assert_cb().
> Intr_pending_nr is incremented in
hpet_route_decision_not_missed_ticks().
>
> The missed ticks policy intr_en_notif callback also uses the
pending_mask
> method. So even though
> Linux does not broadcast its interrupts, the code could handle
it if it did.
> In this case the end of interrupt time stamp is made when the
pending_mask is
> clear.
>
> 4. Live Migration
>
> Live migration with hpet preserves the current offset of the
guest clock with
> respect
> to ntp. This is accomplished by migrating all of the state in
the h->hpet data
> structure
> in the usual way. The hp->mc_offset is recalculated on the
receiving node so
> that the
> guest sees a continuous hpet main counter.
>
> Code as been added to xc_domain_save.c to send a small message
after the
> domain context is sent. The contents of the message is the
physical tsc
> timestamp, last_tsc,
> read just before the message is sent. When the last_tsc message
is received in
> xc_domain_restore.c,
> another physical tsc timestamp, cur_tsc, is read. The two
timestamps are
> loaded into the domain
> structure as last_tsc_sender and first_tsc_receiver with
hypercalls. Then
> xc_domain_hvm_setcontext
> is called so that hpet_load has access to these time stamps.
Hpet_load uses
> the timestamps
> to account for the time spent saving and loading the domain
context. With this
> technique,
> the only neglected time is the time spent sending a small
network message.
>
> 5. Test Results
>
> Some recent test results are:
>
> 5.1 Linux 4u664 and Windows 2k864 load test.
> Duration: 70 hours.
> Test date: 6/2/08
> Loads: usex -b48 on Linux; burn-in on Windows
> Guest vcpus: 8 for Linux; 2 for Windows
> Hardware: 8 physical cpu AMD
> Clock drift : Linux: .0012% Windows: .009%
>
> 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test
> Duration: 23 hours.
> Test date: 6/3/08
> Loads: none
> Guest vcpus: 8 for each Linux; 2 for Windows
> Hardware: 4 physical cpu AMD
> Clock drift : Linux: .033% Windows: .019%
>
> 6. Relation to recent work in xen-unstable
>
> There is a similarity between hvm_get_guest_time() in
xen-unstable and
> read_64_main_counter()
> in this code. However, read_64_main_counter() is more tuned to
the needs of
> hpet.c. It has no
> "set" operation, only the get. It isolates the mode, physical or
simulated, in
> read_64_main_counter()
> itself. It uses no vcpu or domain state as it is a physical
entity, in either
> mode. And it provides a real
> physical mode for every read for those applications that desire
this.
>
> 7. Conclusion
>
> The virtual hpet is improved by this patch in terms of accuracy and
> monotonicity.
> Tests performed to date verify this and more testing is under way.
>
> 8. Future Work
>
> Testing with Windows Vista will be performed soon. The reason
for accuracy
> variations
> on different platforms using the physical hpet device will be
investigated.
> Additional overhead measurements on simulated vs physical hpet
mode will be
> made.
>
> Footnotes:
>
> 1. I don't recall the accuracy improvement with end of interrupt
stamping, but
> it was
> significant, perhaps better than two to one improvement. It
would be a very
> simple matter
> to re-measure the improvement as the facility can call back at
injection time
> as well.
>
>
> Signed-off-by: Dave Winchell <dwinchell@xxxxxxxxxxxxxxx>
> <mailto:dwinchell@xxxxxxxxxxxxxxx>
> Signed-off-by: Ben Guthro <bguthro@xxxxxxxxxxxxxxx>
> <mailto:bguthro@xxxxxxxxxxxxxxx>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|
|
|