Hi, Dan
Sorry for late reply! See my comments below.
>
> Thanks very much for the additional detail on the 10%
> performance loss. What is this oltp benchmark? Is
> it available for others to run? Also is the rdtsc
> rate 120000/sec on EACH processor?
OLTP benchmark is a test case of sysbench, and you can get it through the
following link:
http://sysbench.sourceforge.net/
And we only configured one virtual processor for one VM, and I don't know oltp
whether can use two virtual processors.
>
> Assuming a 3GHz machine, your results seem to show that
> emulating a rdtsc with softtsc takes about 2500 cycles.
> This agrees with my approximation of about 1 usec.
>
> Have you analyzed where this 2500 cycles is being used?
> My suggestion about performance optimization was not
> to try a different algorithm but to see if it is possible
> to code the existing algorithm much faster using a
> special trap path and assembly code. (We called this
> a "fast path" on Xen/ia64.) Even if the 2500 cycles
> can be cut in half, that would be a big win.
It should have no fast path for emulating rdtsc in x86 side, and the main cost
should be from hardware context switch. Since I am using an old machine when
run this benchmark, the cost should be reduced sharply in latest processors
where I haven't done the test.
> Am I correct in reading that your patch is ONLY for
> HVM guests? If so, since some (maybe most) workloads
> that rely on tsc for transaction timestamps will be
> PV, your patch doesn't solve the whole problem.
Yes, this patch is only for HVM guest, because only HVM guest can use TSC
offset feature(one of VT features) ,and also I don't think PV guest need it.
> Can someone at Intel confirm or deny that VMware ESX
> always traps rdtsc? If so, it is probably not hard
> to write an application that works on VMware ESX (on
> certain hardware) but fails on Xen.
\
>
>> -----Original Message-----
>> From: Zhang, Xiantao [mailto:xiantao.zhang@xxxxxxxxx]
>> Sent: Tuesday, July 21, 2009 11:05 PM
>> To: Keir Fraser; Dan Magenheimer; Xen-Devel (E-mail)
>> Cc: John Levon; Ian Pratt; Dong, Eddie
>> Subject: RE: TSC scaling and softtsc reprise, and PROPOSAL
>>
>>
>> Keir Fraser wrote:
>>> On 20/07/2009 21:02, "Dan Magenheimer" <dan.magenheimer@xxxxxxxxxx>
>>> wrote:
>>>
>>>> I agree that if the performance is *really bad*, the default
>>>> should not change. But I think we are still flying on rumors
>>>> of data collected years ago in a very different world, and
>>>> the performance data should be re-collected to prove that
>>>> it is still *really bad*. If the degradation is a fraction
>>>> of a percent even in worst case analysis, I think the default
>>>> should be changed so that correctness prevails.
>>>>
>>>> Why now? Because more and more real-world applications are
>>>> built on top of multi-core platforms where TSC is reliable
>>>> and (by far) the best timesource. And I think(?) we all agree
>>>> now that softtsc is the only way to guarantee correctness
>>>> in a virtual environment.
>>>
>>> So how bad is the non-softtsc default mode anyway? Our default
>>> timer_mode has guest TSCs track host TSC (plus a fixed per-vcpu
>>> offset that defaults to having all vcpus of a domain aligned to
>>> vcpu0 boot = zero tsc).
>>>
>>> Looking at the email thread you cited, all I see is someone from
>>> Intel saying something about how their code to improve TSC
>>> consistency across migration avoids RDTSC exiting where possible
>>> (which I do not see -- if the TSC rates across the hosts do not
>>> match closely then RDTSC exiting is enabled forever for that
>>> domain), and, most bizarrely, that their 'solution' may have a tsc
>>> drift >10^5 cycles. Where did this huge number come from? What
>>> solution is being talked about, and under what conditions might the
>>> claim hold? Who knows!
>>
>> We had done the experiment to measure the performance impact
>> with softtsc using oltp workload, and we saw ~10% performance
>> loss if rdtsc rate is more than 120,000/second. And we also
>> did some other tests, and the results show that ~1%
>> perfomance loss is caused by 10000 rdtsc instructions. So if
>> the rdtsc rate is not that high(>10000/second), the
>> performance impact can be ignored.
>>
>> We also introduced some performance optimization solutions,
>> but as we claimed before, they may bring some TSC drift (
>> 10^5~10^6 cycles) between virtual processors in SMP cases.
>> One solution is described below, for example, the guest is
>> migrated from low TSC freq(low_freq) machine to a high TSC
>> freq one(high_freq), you know, the low frequency is guest's
>> expected frequency(exp_freq), and we should let guest be
>> aware that it is running on the machine with exp_freq TSC to
>> avoid possbile issues caused by faster TSC in any
>> optimization solution.
>>
>> 1. In this solution, we only guarantee guest's TSC is
>> increasing monotonically and the average frequency equals
>> guest's expected frequency(exp_freq) in a fixed time slot (eg. ~1ms).
>> 2. To be simple, let guest running in high_freq TSC (with
>> hardware TSC offset feature, no perfomrance loss) for 1ms,
>> and then enable rdtsc exiting and use trap and emulation
>> method(suffers perfomance loss) to let guest running in a
>> *VERY VERY* low frequency TSC(e.g 0.2 G Hz) for some time,
>> and the specific value can be calculated with the formula to
>> guarantee average TSC frquency == exp_freq:
>> time = (high_freq - low_freq) / (low_freq - 0.2).
>>
>> 3. If the guest migrate from 2.4G machine to 3.0G machine,
>> only in (3.0-2.4) /(2.4-0.2) == ~0.273ms guest has to suffer
>> performance loss in the total time 1ms+0.273ms ,and that is
>> also to say, in most of the time guest can leverage
>> hardware's TSC offset feature to reduce perfomrance loss.
>>
>> 4. In the 1.273ms, we can say guest's TSC frequency is
>> emulated to its expected one through the hardware and
>> software's co-emulation. And the perfomance loss is very
>> minor compared with purely softtsc solution.
>> 5. But at the same time, since each vcpu's TSC is emulated
>> indpendently for SMP guest, and they may generate a drift
>> value between vcpus, and the drift vaule's range should be
>> 10^5 ~10^6 cycles, and we don't know such drift between vcpus
>> whether can bring other side-effects. At least, one
>> side-effect case we can figure out is when one application
>> running on one vcpu, and it may see backward TSC value after
>> its migrating to another vcpu. Not sure this is a real
>> problem, but it should exist in theory.
>>
>> Attached the draft patch to implement the solution based on
>> an old #Cset19591.
>>
>> Xiantao
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|