On 09/21/09 15:20, Dan Magenheimer wrote:
>>> However, I do need one special case to indicate
>>> emulation vs non-emulation, so wraparound is
>>> still a problem.
>>>
>> I was assuming you'd just repurpose the existing version number scheme
>> which is always even, and therefore can never equal -1.
>>
> That wasn't my plan but if it can be made to work (see
> below), it probably saves code in Xen.
>
>
>> What's the full algorithm for detecting this feature? Usermode has to
>> establish:
>>
>> 1. It is running under Xen (or not, if you expect this to be
>> implemented on multiple hypervisors)
>> 2. rdtscp is available
>> 3. the ABI is actually being implemented, ie:
>> 1. the tsc_aux value actually has the correct meaning
>> 2. it has a working mechanism for getting the tsc scaling
>> parameters
>> 3. (accommodate ways to evolve the ABI in a
>> back-compatible way)
>> before it can do anything else.
>>
> Yes, that's what I was thinking. I was planning on prototyping
> these checks with "userland-rdmsr" but userland-hypercall or
> userland-shared-page could work also.
>
>
>> If nothing else, its probably worth removing the rdtscp
>> feature from the
>> logical guest cpuid, so that nothing else tries to use it for its own
>> purposes; in other words, you're exclusively claiming rdtscp for this
>> ABI. Or you could disable this ABI if a guest kernel tries
>> to set TSC_AUX.
>>
> I was thinking that setting pvrdtscp=1 would override
> any kernel use of rdtscp/TSC_AUX, but disabling the
> cpuid has_rdtscp flag and using a different userland
> detection mechanism (than checking cpuid for has_rdtscp)
> would be a better way to avoid possible conflict.
>
>
>>> I've restricted the scheme to constant_tsc as I think
>>> it breaks down due to nasty races if running on a
>>> machine where the pvclock parameters differ across
>>> different pcpus. I think the races can only be
>>> avoided if Xen sets the TSC_AUX for all of the
>>> pcpus running a pvrdtscp doman while all are idle.
>>>
>>> Is there a scheme that avoids the races?
>>>
>> rdtscp makes it quite easy to avoid races because you get the tsc and
>> metadata about the tsc atomically. You just need to encode
>> enough info
>> in the metadata to do the conversion.
>>
> Yes but I don't think there is enough bits for encoding
> it all (32-bits in TSC_AUX, right?).
>
>
>> The obvious thing to do is to pack a version number and pcpu
>> number into
>> TSC_AUX. Usermode would maintain an array of pv_clock parameters, one
>> for each pcpu. If the version number matches, then it uses the
>> parameters it has; if not it fetches new parameters and repeats the
>> rdtscp. There's no need to worry about either thread or vcpu context
>> switches because you get the (tsc,params) tuple atomically,
>> which is the
>> tricky bit without rdtscp.
>>
>> (The version number would be truncated wrt the normal pvclock version
>> number, but it just needs to be large enough to avoid aliasing from
>> wrapping; I'm assuming something like 24 bits version and 8 bits cpu
>> number.)
>>
> I think a race occurs if the vcpu switches pcpu TWICE
> from pcpu-A to pcpu-B and back to pcpu-A and does rdtscp
> each time on pcpu-A but reads one or more pvclock parameters
> (that are too big to be encoded in TSC_AUX) on pcpu-B.
>
That shouldn't matter. Once the process has (tsc,cpu,version) it can
use its own local copy of cpu's pvclock parameters to compute the
tsc->ns conversion. Once it has that triple, it doesn't matter if it
gets context-switched; the time computation doesn't depend on what CPU
is currently running.
It only needs to iterate if it gets a version mismatch. You can
potentially get a livelock if the version is constantly changing between
the rdtscp and the get-pvclock-params, and exacerbated if the process
keeps bouncing between cpus between the two. But given that the
rdtsc+get-params should take no more than a couple of microseconds, it
seems very unlikely the process is sustaining a megahertz CPU migration
rate.
And even if it fails, the process always has to be prepared to go to
some other time source.
> If Xen can atomically bump/change
> TSC_AUX on *all* pcpus runniing a guest vcpu, the race
> can be avoided. But I suspect that is too expensive (some
> kind of rendezvous required for each bump on any processor).
>
Right. Any synchronized cross-cpu call is going to be very expensive,
and can't be done atomically without some kind of stop-the-world which
is even worse.
> Even if my assumption of the race (above) is incorrect,
> 32-bits is not very much time at 100Hz. But the version
> bump needs to occur synchronously with every P/C-state
> transition for pvclock to work on non_constant_tsc machines
> doesn't it? How frequent can those transitions occur?
>
24 bits at 100Hz is 46ish hours. So there's a potential alias problem
if the program reads the tsc at precisely 46.603 (ish) hours after its
previous read. One workaround would be to force a re-read of the timing
parameters every X secs/mins/hours to guarantee that there's no wrap for
some expected rate of param updates.
That said, the standard pvclock algorithm is only 128 times better than
that, and I don't think it has ever considered to be a problem. I've
never seen an update rate of more than once every few seconds.
Also Xen need only update the version number if something has actually
read that version; if nobody had read the current parameters, there's no
need to update the version when updating them to a new value. That
would help mitigate the case of rapid param updates and a low rate of
reading.
> I guess this all depends on what Xen is capable of
> guaranteeing. If Xen can provide a "cacheline
> bounce guarantee", the app shouldn't have to care.
>
It can't, in princple, sync the tscs at a finer grain than the app can
measure. It only has the same resources to play with, and there'll
always be some error margin.
> Linux now seems to provide a cacheline bounce guarantee for
> itself, but afaik has no way to communicate that to an app
> using raw rdtsc{,p} and all the relevant syscalls have a
> monotonicity option and/or have insufficient resolution
> to matter.
>
It's a detail that a usermode app can't rely on anyway.
J
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|