I think I still have a real concern here. Let me see if
I can explain.
The goal for Xen timekeeping is to ensure that if a guest
could somehow magically read any of its virtual clocks
(tsc, pit, hpet, pmtimer, ??) on all its virtual processors
simultaneously, the values read must always obey this
"virtual clock law":
max - min < delta
We can argue how large that delta can reasonably be and it
may vary depending on what the workload is, but
it's certainly under a millisecond, ten microseconds
might not be a bad starting point, and it is getting
smaller as processors get faster.
If xen can't guarantee that, then it must turn on "numa"
mode, which appears to me to be extremely restrictive
and no system vendor could sell honestly sell the true
promise of virtualization on such a box. So we'd like
to avoid that if possible.
Now HP DL785-like designs are likely to become more common
because an HT/QPI interconnect makes it possible to build
a single model that is low cost but very-expandible.
Such boxes use multiple motherboards because its much easier
to expand by adding field replaceable units.
Unfortunately, the current Xen system time model (which I think
is also used by kvm?) may not be scaleable to these boxes.
If the current Xen system time algorithm is scaleable, great.
We are done. If it can be tweaked to be scaleable, great,
no problem. But if the model needs to be changed substantially,
for example if everything needs to be built on a platform
timer because we just can't guarantee the "virtual clock law",
then we may have a real problem... and not just performance.
Why? Because the "paravirtual clock" API is hard-coded
in every existing PV domain... and in current and future
versions of the linux kernel (and probably in Windows too?).
If the new model is unable to use the same API, every
prepackaged VM is broken.
So I think we need to be very sure that we either:
A) do not need to change the xen system time model to
ensure the "virtual clock law" can be obeyed on
such boxes, or
B) DO need to change the xen system time model, but the
paravirtual clock API does NOT need to change, or
C) modify/augment the paravirtual clock API and start
getting the updated version into guests/kernels asap, or
D) ensure that system vendors know that Xen will never run
guests reliably on such a box, without restricting
operation to NUMA mode
Note that the Linux approach doesn't work here
because: 1) a guest's clocks might obey the "virtual clock
law" at one moment on one set of physical processors
and not at the next moment; 2) guests access to all
clocks (except the tsc) is emulated so even if a guest
decides the tsc is unreliable, that just doesn't help
if the alternate clock it chooses (e.g. HPET) is silently
emulated on top of xen system time using the physical tsc.
Now does that make my concern more clear?
> -----Original Message-----
> From: Jeremy Fitzhardinge [mailto:jeremy@xxxxxxxx]
> Sent: Friday, March 27, 2009 4:37 PM
> To: Dan Magenheimer
> Cc: Xen-Devel (E-mail); john.v.morris@xxxxxx
> Subject: Re: [Xen-devel] Time skew on HP DL785 (and possibly other
> Dan Magenheimer wrote:
> > However, I'm told that its not possible to route a clocksource
> > over hypertransport, so TSC's on processors on different
> > motherboards may be VERY different and apparently the
> > mechanisms for synchronizing Xen system time across
> > motherboards may not be up to the challenge. As a result,
> > OS's and apps sensitive to time that are running on PV
> > domains may be in for a rough ride on systems like this.
> > (HVM domains may run into other problems because time will
> > apparently stop for a "long time".)
> I don't see what the problem is. If each individual cpu has
> well known
> tsc parameters (rate and offset), then a PV client will get
> those timing
> parameters and use it to compute its time. It doesn't matter
> if they're
> syncronized between cpus or nodes.
> Xen will need to calibrate each of them against a good reference
> (hpet?), but that's no different from now. I guess its possible that
> this system has more variation and latency for hpet access, which may
> mean that the calibration algorithm needs tweaking.
> Of course, if the tsc rates on each cpu is changing in some
> unpredictable way then that's a whole other barrel of
> problems. Guests
> rely on Xen maintaing accurate tsc timing parameters.
> > Since systems like this are targeted for consolidation
> > and virtualization, I see this as a potentially big problem
> > as it may appear to real Xen customers as bizarre
> > non-reproducible problems, such as "make" failing,
> > leading to questions about the stability and viability
> > of using Xen.
> > Comments?
> In Linux there's this function:
> * apic_is_clustered_box() -- Check if we can expect good TSC
> * Thus far, the major user of this is IBM's Summit2 series:
> * Clustered boxes may have unsynced TSC problems if they are
> * multi-chassis. Use available data to take a good guess.
> * If in doubt, go HPET.
> __cpuinit int apic_is_clustered_box(void)
> Which deals with Summit2 and ScaleSMP vsmp systems which also have
> unsynchronized tscs across nodes. At the moment it assumes that no
> non-VSMP AMD system has unsynchronized tscs; sounds like it will need
> updating for this system.
Xen-devel mailing list