> > However, I do need one special case to indicate
> > emulation vs non-emulation, so wraparound is
> > still a problem.
>
> I was assuming you'd just repurpose the existing version number scheme
> which is always even, and therefore can never equal -1.
That wasn't my plan but if it can be made to work (see
below), it probably saves code in Xen.
> What's the full algorithm for detecting this feature? Usermode has to
> establish:
>
> 1. It is running under Xen (or not, if you expect this to be
> implemented on multiple hypervisors)
> 2. rdtscp is available
> 3. the ABI is actually being implemented, ie:
> 1. the tsc_aux value actually has the correct meaning
> 2. it has a working mechanism for getting the tsc scaling
> parameters
> 3. (accommodate ways to evolve the ABI in a
> back-compatible way)
> before it can do anything else.
Yes, that's what I was thinking. I was planning on prototyping
these checks with "userland-rdmsr" but userland-hypercall or
userland-shared-page could work also.
> If nothing else, its probably worth removing the rdtscp
> feature from the
> logical guest cpuid, so that nothing else tries to use it for its own
> purposes; in other words, you're exclusively claiming rdtscp for this
> ABI. Or you could disable this ABI if a guest kernel tries
> to set TSC_AUX.
I was thinking that setting pvrdtscp=1 would override
any kernel use of rdtscp/TSC_AUX, but disabling the
cpuid has_rdtscp flag and using a different userland
detection mechanism (than checking cpuid for has_rdtscp)
would be a better way to avoid possible conflict.
> > I've restricted the scheme to constant_tsc as I think
> > it breaks down due to nasty races if running on a
> > machine where the pvclock parameters differ across
> > different pcpus. I think the races can only be
> > avoided if Xen sets the TSC_AUX for all of the
> > pcpus running a pvrdtscp doman while all are idle.
> >
> > Is there a scheme that avoids the races?
>
> rdtscp makes it quite easy to avoid races because you get the tsc and
> metadata about the tsc atomically. You just need to encode
> enough info
> in the metadata to do the conversion.
Yes but I don't think there is enough bits for encoding
it all (32-bits in TSC_AUX, right?).
> The obvious thing to do is to pack a version number and pcpu
> number into
> TSC_AUX. Usermode would maintain an array of pv_clock parameters, one
> for each pcpu. If the version number matches, then it uses the
> parameters it has; if not it fetches new parameters and repeats the
> rdtscp. There's no need to worry about either thread or vcpu context
> switches because you get the (tsc,params) tuple atomically,
> which is the
> tricky bit without rdtscp.
>
> (The version number would be truncated wrt the normal pvclock version
> number, but it just needs to be large enough to avoid aliasing from
> wrapping; I'm assuming something like 24 bits version and 8 bits cpu
> number.)
I think a race occurs if the vcpu switches pcpu TWICE
from pcpu-A to pcpu-B and back to pcpu-A and does rdtscp
each time on pcpu-A but reads one or more pvclock parameters
(that are too big to be encoded in TSC_AUX) on pcpu-B.
If Xen can atomically bump/change
TSC_AUX on *all* pcpus runniing a guest vcpu, the race
can be avoided. But I suspect that is too expensive (some
kind of rendezvous required for each bump on any processor).
> > Fortunately, this also has the effect of greatly
> > reducing the version increase frequency.
>
> I don't think that's going to be a huge issue; fetching time
> parameters
> with a syscall/hypercall would be on the same order as doing
> an emulated
> rdtsc, and would only need to happen, say, once per timeslice (100Hz?)
> at the outside.
Even if my assumption of the race (above) is incorrect,
32-bits is not very much time at 100Hz. But the version
bump needs to occur synchronously with every P/C-state
transition for pvclock to work on non_constant_tsc machines
doesn't it? How frequent can those transitions occur?
> > The rate is synced but the values may not be. Since
> > software (BIOS or Xen) sets tsc on each processor
> > it is essentially impossible to ensure they are
> > identical. The rendezvous algorithm should be able
> > to set them so that they are "unobservably" different,
> > but I keep hearing "within 2usec". (It would be
> > interesting to measure this across a broad set
> > of machines.) So it's probably prudent to recommend
> > that apps be prepared for the possibility even if
> > it never happens.
>
> You don't need to guarantee anything stronger than they'd see on bare
> hardware. You also need to be more precise about exactly what you're
> guaranteeing.
>
> Are you saying that a single thread will never see regressing tscs?
> That just requires making sure that Xen gets the tscs synced
> closer than
> the context switch time of a thread between cpus, which
> should be possible.
>
> Or are you making the stronger guarantee that two threads running
> concurrently on different cpus doing rdtsc will see monotonically
> increasing tscs with respect to the ordering of all their operations?
> That would require arbitrarily close syncing (well, within a
> the time it
> takes a cacheline to bounce I guess).
I guess this all depends on what Xen is capable of
guaranteeing. If Xen can provide a "cacheline
bounce guarantee", the app shouldn't have to care.
Linux now seems to provide a cacheline bounce guarantee for
itself, but afaik has no way to communicate that to an app
using raw rdtsc{,p} and all the relevant syscalls have a
monotonicity option and/or have insufficient resolution
to matter.
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|