WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

Re: [Xen-devel] [PATCH] per-cpu timer changes

On Tue, May 24, 2005 at 02:20:36AM +0100, Ian Pratt wrote:
> 
> Don,
> 
> This is looking good. To help other people review the patch, it might be
> a good idea to post some of the design discussion we had off list as I
> think the approach will be new to most people. (Perhaps put some of the
> text in a comment in the hypervisor interface).
> 
> As regards the time going backwards messages, if you're seeing small -ve
> deltas, I'm not surprised -- you need to round to some precision as we
> won't be nanosecond accurate. Experience suggests we'll be good for a
> few 10's of ns with any kind of decent crystal. We could round to e.g.
> 512ns or 1024ns to make sure.
> 
> Best,
> Ian
> 

I am including the email that we exchanged off-list.  I started to edit
it, but decided that something I thought wasn't important, others would
find vital, so I include all the email.

The time going backwards was only occasionally, and it was a BIG jump
backwards.  I tracked it down yesterday to a problem with doing 32-bit
arithmetic in Linux on the tsc values.  For some reason, every 5-20
minutes xen seems to pause for about 5 seconds.  This causes the tsc to
wrap if only 32-bits are used, and the 'time went backwards' message is
printed.  I changed to use 64-bit tsc deltas and have been running since
yesterday afternoon without any 'time went backwards' messages.  I want
to do some more cleanup (remove my debugging code) and will post all my
changes to the list this afternoon.


----- Forwarded by Don Fry/Beaverton/IBM on 05/26/2005 09:29 AM -----

Bruce Jones/Beaverton/IBM wrote on 04/21/2005 09:07:26 AM:

> John, can you provide some additional technical guidance here?
>
> Ian, Keir: John is the implementor of our Linux changes for Summit
> and understands these issues better than anyone.
>
> I've added Don to the cc: list but he's on vacation this week and
> not reading email.
>
>  -- brucej
>
> Ian Pratt <Ian.Pratt@xxxxxxxxxxxx> wrote on 04/20/2005 05:42:47 PM:
>
> > > "Ian Pratt" <m+Ian.Pratt@xxxxxxxxxxxx> wrote on 04/20/2005 04:47:44 PM:
> > > > Please could Don write a paragraph explaining why cyclone timer support
> > > > is useful. Do summit systems have different frequency CPUs in the same
> > > > box?
> > > Bruce writes:
> > > I can write that paragraph myself.  IBM's high end xSeries systems are
> > > NUMA systems, each node is a separate machine with it's own front side
> > > bus, I/O buses, etc...  The chipset provides a cache-coherent interconnect
> > > to allow them to be cabled together into one big system.
> >
> > OK, so even the FSB clocks come from different crystals.
>
> Yes, and the hardware intentionally skews their frequencies, for reasons
> only the chipset designers understand. :)
>
> > > We had a boatload of problems with Linux when we first shipped it, with
> > > time moving around forward and backward for applications.  The processors
> > > in the various nodes  run at different frequencies and the on-processor
> > > timers do not run in sync.  We needed to modify Linux to use a system-wide
> > > timer.  Our chipset (code-named Cyclone) provides one, for newer systems
> > > Intel has defined the HPET that we can use.  We need to make similar
> > > changes to Xen.
> >
> > This needs some agreement on the design.
> >
> > My gut feeling is that it should still be possible for guests to use
> > the TSC to calculate the time offset relative to the published
> > Xen system time record (which is updated every couple of
> > seconds). The TSC calibration should be good enough to mean that
> > the relative drift over the period between records is tiny (and
> > errors can't accumulate beyond the period).
>
> My gut feeling is that your gut feeling is wrong.  We can't ever
> use the TSC on these systems - even a tiny amount of relative drift
> causes problems.
>
> But I'm no expert.  John, this is your cue.  Please join in.
>
> > The 'TSC when time record created' and 'TSC frequency' will have
> > to be per VCPU and updated to reflect the real CPU that the VCPU
> > is running on.
>
> As long as these are virtual and not read using the readTSC instruction,
> we may be OK.
>
> >
> > Ian
> >
> >
> >

----- Forwarded by Don Fry/Beaverton/IBM on 05/26/2005 09:29 AM -----

"Ian Pratt" <m+Ian.Pratt@xxxxxxxxxxxx> wrote on 04/21/2005 09:24:54 AM: 

> > Yes, and the hardware intentionally skews their frequencies,
> > for reasons only the chipset designers understand. :)
>
> It's to be sneaky as regards FCC EMC emissions regulations.
>
> Some systems even modulate the PCI bus frequency.
>
> > > My gut feeling is that it should still be possible for
> > guests to use
> > > the TSC to calculate the time offset relative to the published Xen
> > > system time record (which is updated every couple of
> > seconds). The TSC
> > > calibration should be good enough to mean that the relative
> > drift over
> > > the period between records is tiny (and errors can't
> > accumulate beyond
> > > the period).
> >
> > My gut feeling is that your gut feeling is wrong.  We can't
> > ever use the TSC on these systems - even a tiny amount of
> > relative drift causes problems.
>
> It depends on the crystal stability, the accuracy with which the
> calibration is done, and the frequency of publishing new absoloute time
> records.
>
> The latter can be made quite frequent if need be.
>
> I'd much prefer avoiding having to expose linux to the HPET/cyclone by
> hiding it Xen, and having the guest use TSC extrapolation from the the
> time record published by Xen.
> We'd just need to update the current interface to have per-CPU records
> (and TSC frequency calibration).
>
> > But I'm no expert.  John, this is your cue.  Please join in.
> >
> > > The 'TSC when time record created' and 'TSC frequency' will
> > have to be
> > > per VCPU and updated to reflect the real CPU that the VCPU
> > is running
> > > on.
> >
> > As long as these are virtual and not read using the readTSC
> > instruction, we may be OK.
>
> Using readTSC should be fine, since we're only using it to extrapolate
> from the last Xen supplied time record, and we've calibrated the
> frequency of the particular CPU we're running on. We only have to worry
> about rapid clock drift due to sudden temperature changes etc, but even
> then we can just update the calibration frequency periodically. Using
> this approach we get to keep gettimeofday very fast, and avoid
> complicating the hypervisor API -- it's exactly what we need for
> migrating a domain between physical servers with different frequency
> CPUs.
>
> Ian
>

----- Forwarded by Don Fry/Beaverton/IBM on 05/26/2005 09:29 AM -----

"Ian Pratt" <m+Ian.Pratt@xxxxxxxxxxxx> wrote on 04/21/2005 01:12:51 PM: 

> > First, forgive my lack of knowledge about Xen. Since I don't
> > know the details of what you're proposing, let me make a
> > straw-man and let you correct my assumptions.
> >
> > Lets say you're proposing that time be calculated with the
> > following formula:
> >
> > timefoday = xen_time_base +  rdtsc() - xen_last_tsc[CPUNUM]
> >
> > Given a guest domain with two cpus, the issue is managing
> > xen_last_tsc[] and xen_time_base. For the equation to work,
> > xen_last_tsc[0] must hold the TSC value from CPU0 at exactly
> > the time stored in xen_time_base. Additionally the same is
> > true with xen_las_tsc[1].
>
> I'm proposing:
>
> timeofday  = round_to_precision( last_xen_time_base[VCPU] +
>              ( rdtsc() - last_xen_tsc[VCPU] ) * xen_tsc_calibrate[VCPU]
> )
>
> We update last_xen_time_base and last_xen_tsc on each CPU every second
> or so.
> xen_tsc_calibrate is calculated for each CPU at start of day. For
> completeness, we could recalculate the calibration every 30s or so to 
> cope with crystal temperature drift if we wanted ultimate precision.
>
> > The difficult question is how do you ensure that the two
> > values in xen_last_tsc[] are linked with the time in
> > xen_time_base? This requires reading the TSC on two cpus at
> > the exact same time. Additionally, this sync point must
> > happen frequently enough so that the continuing drift between
> > cpus isn't noticed.
>
> Nope, we would set the time_base on each CPU independently, but relative
> to the same timer.
> This could be the cyclone, HPET, or even the PIT if its possible to read
> the same PIT from any node (though I'm guessing you probably have a PIT
> per node and can't read the remote one).
>
> > Then you'll have to weigh that solution against just using an
> > alternate global timesource like HPET/Cyclone.
>
> I'd prefer to avoid this as it would mean that there'd be a different 
> hypervisor API for guests on cylcone/hpet systems vs. normal synchronous
> CPU systems.
> Using the TSC will probably give a lower cost gettimeofday, we can also
> trap it and emulate if we want to lie to guests about the progress of
> time.
>
> Best,
> Ian
>
>
>
>
>
>

----- Forwarded by Don Fry/Beaverton/IBM on 05/26/2005 09:29 AM -----

John Stultz/Beaverton/IBM wrote on 04/21/2005 01:49:54 PM:

> I'm just resending this with proper addresses as something got futzed up
in the CC list on that last mail.
>
> "Ian Pratt" <m+Ian.Pratt@xxxxxxxxxxxx> wrote on 04/21/2005 01:12:51 PM:
>
> > > First, forgive my lack of knowledge about Xen. Since I don't
> > > know the details of what you're proposing, let me make a
> > > straw-man and let you correct my assumptions.
> > >
> > > Lets say you're proposing that time be calculated with the
> > > following formula:
> > >
> > > timefoday = xen_time_base +  rdtsc() - xen_last_tsc[CPUNUM]
> > >
> > > Given a guest domain with two cpus, the issue is managing
> > > xen_last_tsc[] and xen_time_base. For the equation to work,
> > > xen_last_tsc[0] must hold the TSC value from CPU0 at exactly
> > > the time stored in xen_time_base. Additionally the same is
> > > true with xen_las_tsc[1].
> >
> > I'm proposing:
> >
> > timeofday  = round_to_precision( last_xen_time_base[VCPU] +
> >              ( rdtsc() - last_xen_tsc[VCPU] ) * xen_tsc_calibrate[VCPU]
> > )
> >
> > We update last_xen_time_base and last_xen_tsc on each CPU every second
> > or so.
>
> Or possibly more frequently, as on a 4Ghz cpu the 32bit TSC will wrap 
each second. Alternatively you could use the full 64bits.
>
> > xen_tsc_calibrate is calculated for each CPU at start of day. For
> > completeness, we could recalculate the calibration every 30s or so to
> > cope with crystal temperature drift if we wanted ultimate precision.
> >
> > > The difficult question is how do you ensure that the two
> > > values in xen_last_tsc[] are linked with the time in
> > > xen_time_base? This requires reading the TSC on two cpus at
> > > the exact same time. Additionally, this sync point must
> > > happen frequently enough so that the continuing drift between
> > > cpus isn't noticed.
> >
> > Nope, we would set the time_base on each CPU independently, but
relative
> > to the same timer.

> Hmmm. That sounds like it could work. Just be sure that preempt won't 
bite you in the timeofday calculation. The bit about still using the
cyclone/HPET to sync the different xen_time_base[] values is the real key.
>
> > This could be the cyclone, HPET, or even the PIT if its possible to read
> > the same PIT from any node (though I'm guessing you probably have a PIT
> > per node and can't read the remote one).

> The ioport space is unified by the BIOS so there is one global PIT shared
by all nodes. Although as you'll need a continuous timesource that doesn't
loop between xen_time_base updates, the PIT would not work.
>
> thanks
> -john
----- Forwarded by Don Fry/Beaverton/IBM on 05/26/2005 09:29 AM -----

"Ian Pratt" <m+Ian.Pratt@xxxxxxxxxxxx> wrote on 04/28/2005 07:08:05 PM:

>
> > First I apologize for not being involved in this email
> > exchange last week.
> > I am also just learning about Xen so my questions may be
> > obvious to others.
> >
> > What is the last_xen_time_base referred to in Ian's email? Is
> > this the stime_irq or wc_sec,wc_usec or something else?
>
> I was referring to the wc_ wall clock and system time values.
> We'll need to make these per VPU, or perhaps slightly more cleanly,
> store an offset in ns for each VCPU.
>
> > When would the last_xen_tsc[VCPU] values be captured by Xen?
> > Currently, the tsc for cpu 0 is obtained during
> > timer_interrupt as full_tsc_irq.
>
> These just need to be captured periodically on each real CPU -- every
> couple of seconds would do it, though more frequently woulnd't hurt.
>
> > When updating the domain's shared_info structure mapping the
> > physical CPU to the domain's view of the CPU would need to be
> > done. For example if domain2 was running on CPU3 and CPU2 and
> > the domain's view was cpu0 and cpu1, the saved tsc value for
> > CPU3 would be copied to last_xen_tsc[0] and CPU2 to
> > last_xen_tsc[1] before sending the interrupt to the domain.
>
> Yep, this shouldn't be hard -- there's already some code to spot when
> they need to be updated.
>
> > From the last algorithm from Ian, I don't see anything that
> > refers to the Cyclone/HPET value directly. Is that because
> > Xen is the only thing that reads the Cyclone/HPET counter and
> > the domain just uses the TSC?
>
> Yep, we don't want to expose the cyclone/hpet to guests. There's no
> need, and it would have implications for migrating VMs between different
> systems.
>
> Strictly speaking, Xen wouldn't even need support for the hpet/cyclone
> as it could just use the shared PIT, though I have no objection to
> adding such support.
>
> Are you happy with this design? It's a little more work, but I believe
> better in the long run. We need to get the hypervisor interface change
> incorporated ASAP.
>
> Cheers,
> Ian

----- Forwarded by Don Fry/Beaverton/IBM on 05/26/2005 09:29 AM -----

"Ian Pratt" <m+Ian.Pratt@xxxxxxxxxxxx> wrote on 04/30/2005 12:04:57 AM:

> > It sounds like the per-cpu changes should be sufficient.
> >
> > Having a time base and ns deltas for each CPU sounds
> > interesting, but wouldn't you have do a subtraction to
> > generate the delta in Xen, and then add it back in, in the
> > domain? Just saving the per-cpu value would save the extra
> > add and subtract.
>
> Sure, but the add/subtract won't cost much, and it saves some space in
> the shared info page, which might be an issue if we have a lot of VCPUs.
>
> Not a big deal either way.
>
> > The bottom line is that it can all be done with the TSC,
> > without needing to use the Cyclone or HPET hardware, which
> > isn't available on all systems like the TSC.
>
> Great, we're in agreement. I think the first stage is just to do the per
> [V]CPU calibration and time vals. Could you work something up?
>
> Thanks,
> Ian


-- 
Don Fry
brazilnut@xxxxxxxxxx

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel