WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

To: "dan.magenheimer@xxxxxxxxxx" <dan.magenheimer@xxxxxxxxxx>
Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
From: Dave Winchell <dwinchell@xxxxxxxxxxxxxxx>
Date: Fri, 06 Jun 2008 13:54:19 -0400
Cc: Dave Winchell <dwinchell@xxxxxxxxxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxx>, Keir Fraser <keir.fraser@xxxxxxxxxxxxx>, Ben Guthro <bguthro@xxxxxxxxxxxxxxx>
Delivery-date: Fri, 06 Jun 2008 10:53:09 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxx
In-reply-to: <20080606095323843.00000002776@djm-pc>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <20080606095323843.00000002776@djm-pc>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
User-agent: Mozilla Thunderbird 1.0.7-1.1.fc4 (X11/20050929)
Hi Dan,

I am running with hpet=1 and timer_mode=2. I don't see where timer_mode is checked for
hpet timekeeping but I set it nevertheless.

thanks,
Dave


Dan Magenheimer wrote:

Hi Dave and Ben --
When running tests on xen-unstable (without your patch), please ensure that hpet=1 is set in the hvm config and also I think that when hpet is the clocksource on RHEL4-32, the clock IS resilient to missed ticks so timer_mode should be 2 (vs when pit is the clocksource on RHEL4-32, all clock ticks must be delivered and so timer_mode should be 0). Per http://lists.xensource.com/archives/html/xen-devel/2008-06/msg00098.html it's my intent to clean this up, but I won't get to it until next week. Thanks,
Dan

    -----Original Message-----
    *From:* xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
    [mailto:xen-devel-bounces@xxxxxxxxxxxxxxxxxxx]*On Behalf Of *Dave
    Winchell
    *Sent:* Friday, June 06, 2008 4:46 AM
    *To:* Keir Fraser; Ben Guthro; xen-devel
    *Cc:* dan.magenheimer@xxxxxxxxxx; Dave Winchell
    *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

    Keir,

    I think the changes are required. We'll run some tests today today so
    that we have some data to talk about.

    -Dave


    -----Original Message-----
    From: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx on behalf of Keir Fraser
    Sent: Fri 6/6/2008 4:58 AM
    To: Ben Guthro; xen-devel
    Cc: dan.magenheimer@xxxxxxxxxx
    Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

    Are these patches needed now the timers are built on Xen system
    time rather
    than host TSC? Dan has reported much better time-keeping with his
    patch
    checked in, and it¹s for sure a lot less invasive than this patchset.


     -- Keir

    On 5/6/08 15:59, "Ben Guthro" <bguthro@xxxxxxxxxxxxxxx> wrote:

    >
    > 1. Introduction
    >
    > This patch improves the hpet based guest clock in terms of drift and
    > monotonicity.
    > Prior to this work the drift with hpet was greater than 2%, far
    above the .05%
    > limit
    > for ntp to synchronize. With this code, the drift ranges from
    .001% to .0033%
    > depending
    > on guest and physical platform.
    >
    > Using hpet allows guest operating systems to provide monotonic
    time to their
    > applications. Time sources other than hpet are not monotonic because
    > of their reliance on tsc, which is not synchronized across physical
    > processors.
    >
    > Windows 2k864 and many Linux guests are supported with two
    policies, one for
    > guests
    > that handle missed clock interrupts and the other for guests
    that require the
    > correct number of interrupts.
    >
    > Guests may use hpet for the timing source even if the physical
    platform has no
    > visible
    > hpet. Migration is supported between physical machines which
    differ in
    > physical
    > hpet visibility.
    >
    > Most of the changes are in hpet.c. Two general facilities are
    added to track
    > interrupt
    > progress. The ideas here and the facilities would be useful in
    vpt.c, for
    > other time
    > sources, though no attempt is made here to improve vpt.c.
    >
    > The following sections discuss hpet dependencies, interrupt
    delivery policies,
    > live migration,
    > test results, and relation to recent work with monotonic time.
    >
    >
    > 2. Virtual Hpet dependencies
    >
    > The virtual hpet depends on the ability to read the physical or
    simulated
    > (see discussion below) hpet.  For timekeeping, the virtual hpet
    also depends
    > on two new interrupt notification facilities to implement its
    policies for
    > interrupt delivery.
    >
    > 2.1. Two modes of low-level hpet main counter reads.
    >
    > In this implementation, the virtual hpet reads with
    read_64_main_counter(),
    > exported by
    > time.c, either the real physical hpet main counter register
    directly or a
    > "simulated"
    > hpet main counter.
    >
    > The simulated mode uses a monotonic version of get_s_time()
    (NOW()), where the
    > last
    > time value is returned whenever the current time value is less
    than the last
    > time
    > value. In simulated mode, since it is layered on s_time, the
    underlying
    > hardware
    > can be hpet or some other device. The frequency of the main
    counter in
    > simulated
    > mode is the same as the standard physical hpet frequency,
    allowing live
    > migration
    > between nodes that are configured differently.
    >
    > If the physical platform does not have an hpet device, or if xen
    is configured
    > not
    > to use the device, then the simulated method is used. If there
    is a physical
    > hpet device,
    > and xen has initialized it, then either simulated or physical
    mode can be
    > used.
    > This is governed by a boot time option, hpet-avoid. Setting this
    option to 1
    > gives the
    > simulated mode and 0 the physical mode. The default is physical
    mode.
    >
    > A disadvantage of the physical mode is that may take longer to
    read the device
    > than in simulated mode. On some platforms the cost is about the
    same (less
    > than 250 nsec) for
    > physical and simulated modes, while on others physical cost is
    much higher
    > than simulated.
    > A disadvantage of the simulated mode is that it can return the
    same value
    > for the counter in consecutive calls.
    >
    > 2.2. Interrupt notification facilities.
    >
    > Two interrupt notification facilities are introduced, one is
    > hvm_isa_irq_assert_cb()
    > and the other hvm_register_intr_en_notif().
    >
    > The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to
    the vioapic.
    > hvm_isa_irq_assert_cb allows a callback to be passed along to
    > vioapic_deliver()
    > and this callback is called with a mask of the vcpus which will
    get the
    > interrupt. This callback is made before any vcpus receive an
    interrupt.
    >
    > Vhpet uses hvm_register_intr_en_notif() to register a handler
    for a particular
    > vector that will be called when that vector is injected in
    > [vmx,svm]_intr_assist()
    > and also when the guest finishes handling the interrupt. Here
    finished is
    > defined
    > as the point when the guest re-enables interrupts or lowers the
    tpr value.
    > EOI is not used as the end of interrupt as this is sometimes
    returned before
    > the interrupt handler has done its work. A flag is passed to the
    handler
    > indicating
    > whether this is the injection point (post = 1) or the interrupt
    finished (post
    > = 0) point.
    > The need for the finished point callback is discussed in the
    missed ticks
    > policy section.
    >
    > To prevent a possible early trigger of the finished callback,
    intr_en_notif
    > logic
    > has a two stage arm, the first at injection
    (hvm_intr_en_notif_arm()) and the
    > second when
    > interrupts are seen to be disabled (hvm_intr_en_notif_disarm()).
    Once fully
    > armed, re-enabling
    > interrupts will cause hvm_intr_en_notif_disarm() to make the end
    of interrupt
    > callback. hvm_intr_en_notif_arm() and hvm_intr_en_notif_disarm()
    are called by
    > [vmx,svm]_intr_assist().
    >
    > 3. Interrupt delivery policies
    >
    > The existing hpet interrupt delivery is preserved. This includes
    > vcpu round robin delivery used by Linux and broadcast delivery
    used by
    > Windows.
    >
    > There are two policies for interrupt delivery, one for Windows
    2k8-64 and the
    > other
    > for Linux. The Linux policy takes advantage of the (guest) Linux
    missed tick
    > and offset
    > calculations and does not attempt to deliver the right number of
    interrupts.
    > The Windows policy delivers the correct number of interrupts,
    even if
    > sometimes much
    > closer to each other than the period. The policies are similar
    to those in
    > vpt.c, though
    > there are some important differences.
    >
    > Policies are selected with an HVMOP_set_param hypercall with index
    > HVM_PARAM_TIMER_MODE.
    > Two new values are added, HVM_HPET_guest_computes_missed_ticks and
    > HVM_HPET_guest_does_not_compute_missed_ticks.  The reason that
    two new ones
    > are added is that
    > in some guests (32bit Linux) a no-missed policy is needed for
    clock sources
    > other than hpet
    > and a missed ticks policy for hpet. It was felt that there would
    be less
    > confusion by simply
    > introducing the two hpet policies.
    >
    > 3.1. The missed ticks policy
    >
    > The Linux clock interrupt handler for hpet calculates missed
    ticks and offset
    > using the hpet
    > main counter. The algorithm works well when the time since the
    last interrupt
    > is greater than
    > or equal to a period and poorly otherwise.
    >
    > The missed ticks policy ensures that no two clock interrupts are
    delivered to
    > the guest at
    > a time interval less than a period. A time stamp (hpet main
    counter value) is
    > recorded (by a
    > callback registered with hvm_register_intr_en_notif) when Linux
    finishes
    > handling the clock
    > interrupt. Then, ensuing interrupts are delivered to the vioapic
    only if the
    > current main
    > counter value is a period greater than when the last interrupt
    was handled.
    >
    > Tests showed a significant improvement in clock drift with end
    of interrupt
    > time stamps
    > versus beginning of interrupt[1]. It is believed that the reason
    for the
    > improvement
    > is that the clock interrupt handler goes for a spinlock and can
    be therefore
    > delayed in its
    > processing. Furthermore, the main counter is read by the guest
    under the lock.
    > The net
    > effect is that if we time stamp injection, we can get the
    difference in time
    > between successive interrupt handler lock acquisitions to be
    less than the
    > period.
    >
    > 3.2. The no-missed ticks policy
    >
    > Windows 2k864 keeps very poor time with the missed ticks policy.
    So the
    > no-missed ticks policy
    > was developed. In the no-missed ticks policy we deliver the
    correct number of
    > interrupts,
    > even if they are spaced less than a period apart (when catching up).
    >
    > Windows 2k864 uses a broadcast mode in the interrupt routing
    such that
    > all vcpus get the clock interrupt. The best Windows drift
    performance was
    > achieved when the
    > policy code ensured that all the previous interrupts (on the
    various vcpus)
    > had been injected
    > before injecting the next interrupt to the vioapic..
    >
    > The policy code works as follows. It uses the
    hvm_isa_irq_assert_cb() to
    > record
    > the vcpus to be interrupted in h->hpet.pending_mask. Then, in
    the callback
    > registered
    > with hvm_register_intr_en_notif() at post=1 time it clears the
    current vcpu in
    > the pending_mask.
    > When the pending_mask is clear it decrements
    hpet.intr_pending_nr and if
    > intr_pending_nr is still
    > non-zero posts another interrupt to the ioapic with
    hvm_isa_irq_assert_cb().
    > Intr_pending_nr is incremented in
    hpet_route_decision_not_missed_ticks().
    >
    > The missed ticks policy intr_en_notif callback also uses the
    pending_mask
    > method. So even though
    > Linux does not broadcast its interrupts, the code could handle
    it if it did.
    > In this case the end of interrupt time stamp is made when the
    pending_mask is
    > clear.
    >
    > 4. Live Migration
    >
    > Live migration with hpet preserves the current offset of the
    guest clock with
    > respect
    > to ntp. This is accomplished by migrating all of the state in
    the h->hpet data
    > structure
    > in the usual way. The hp->mc_offset is recalculated on the
    receiving node so
    > that the
    > guest sees a continuous hpet main counter.
    >
    > Code as been added to xc_domain_save.c to send a small message
    after the
    > domain context is sent. The contents of the message is the
    physical tsc
    > timestamp, last_tsc,
    > read just before the message is sent. When the last_tsc message
    is received in
    > xc_domain_restore.c,
    > another physical tsc timestamp, cur_tsc, is read. The two
    timestamps are
    > loaded into the domain
    > structure as last_tsc_sender and first_tsc_receiver with
    hypercalls. Then
    > xc_domain_hvm_setcontext
    > is called so that hpet_load has access to these time stamps.
    Hpet_load uses
    > the timestamps
    > to account for the time spent saving and loading the domain
    context. With this
    > technique,
    > the only neglected time is the time spent sending a small
    network message.
    >
    > 5. Test Results
    >
    > Some recent test results are:
    >
    > 5.1 Linux 4u664 and Windows 2k864 load test.
    >       Duration: 70 hours.
    >       Test date: 6/2/08
    >       Loads: usex -b48 on Linux; burn-in on Windows
    >       Guest vcpus: 8 for Linux; 2 for Windows
    >       Hardware: 8 physical cpu AMD
    >       Clock drift : Linux: .0012% Windows: .009%
    >
    > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test
    >       Duration: 23 hours.
    >       Test date: 6/3/08
    >       Loads: none
    >       Guest vcpus: 8 for each Linux; 2 for Windows
    >       Hardware: 4 physical cpu AMD
    >       Clock drift : Linux: .033% Windows: .019%
    >
    > 6. Relation to recent work in xen-unstable
    >
    > There is a similarity between hvm_get_guest_time() in
    xen-unstable and
    > read_64_main_counter()
    > in this code. However, read_64_main_counter() is more tuned to
    the needs of
    > hpet.c. It has no
    > "set" operation, only the get. It isolates the mode, physical or
    simulated, in
    > read_64_main_counter()
    > itself. It uses no vcpu or domain state as it is a physical
    entity, in either
    > mode. And it provides a real
    > physical mode for every read for those applications that desire
    this.
    >
    > 7. Conclusion
    >
    > The virtual hpet is improved by this patch in terms of accuracy and
    > monotonicity.
    > Tests performed to date verify this and more testing is under way.
    >
    > 8. Future Work
    >
    > Testing with Windows Vista will be performed soon. The reason
    for accuracy
    > variations
    > on different platforms using the physical hpet device will be
    investigated.
    > Additional overhead measurements on simulated vs physical hpet
    mode will be
    > made.
    >
    > Footnotes:
    >
    > 1. I don't recall the accuracy improvement with end of interrupt
    stamping, but
    > it was
    > significant, perhaps better than two to one improvement. It
    would be a very
    > simple matter
    > to re-measure the improvement as the facility can call back at
    injection time
    > as well.
    >
    >
    > Signed-off-by: Dave Winchell <dwinchell@xxxxxxxxxxxxxxx>
    > <mailto:dwinchell@xxxxxxxxxxxxxxx>
    > Signed-off-by: Ben Guthro <bguthro@xxxxxxxxxxxxxxx>
    > <mailto:bguthro@xxxxxxxxxxxxxxx>
    >
    >
    > _______________________________________________
    > Xen-devel mailing list
    > Xen-devel@xxxxxxxxxxxxxxxxxxx
    > http://lists.xensource.com/xen-devel





_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel