Hi Jan,
Could you please try the attached patch. this patch try to avoid entering deep
C state when there is vCPU local irq disabled, and polling event channel. When
tested in my 64 CPU box, this issue is gone with this patch.
Best Regards
Ke
>-----Original Message-----
>From: Yu, Ke
>Sent: Wednesday, February 03, 2010 1:07 AM
>To: Jan Beulich; Keir Fraser
>Cc: xen-devel@xxxxxxxxxxxxxxxxxxx
>Subject: RE: [Xen-devel] cpuidle causing Dom0 soft lockups
>
>>-----Original Message-----
>>From: Jan Beulich [mailto:JBeulich@xxxxxxxxxx]
>>Sent: Tuesday, February 02, 2010 3:55 PM
>>To: Keir Fraser; Yu, Ke
>>Cc: xen-devel@xxxxxxxxxxxxxxxxxxx
>>Subject: Re: [Xen-devel] cpuidle causing Dom0 soft lockups
>>
>>>>> Keir Fraser <keir.fraser@xxxxxxxxxxxxx> 21.01.10 12:03 >>>
>>>On 21/01/2010 10:53, "Jan Beulich" <JBeulich@xxxxxxxxxx> wrote:
>>>> I can see your point. But how can you consider shipping with something
>>>> apparently severely broken. As said before - the fact that this manifests
>>>> itself by hanging many-vCPU Dom0 has the very likely implication that
>>>> there are (so far unnoticed) problems with smaller Dom0-s. If I had a
>>>> machine at hand that supports C3, I'd try to do some measurements
>>>> with smaller domains...
>>>
>>>Well it's a fallback I guess. If we can't make progress on solving it then I
>>>suppose I agree.
>>
>>Just fyi, we now also have seen an issue on a 24-CPU system that went
>>away with cpuidle=0 (and static analysis of the hang hinted in that
>>direction). All I can judge so far is that this likely has something to do
>>with our kernel's intensive use of the poll hypercall (i.e. we see vCPU-s
>>not waking up from the call despite there being pending unmasked or
>>polled for events).
>>
>>Jan
>
>Hi Jan,
>
>We just identified the cause of this issue, and is trying to find appropriate
>way
>to fix it.
>
>This issue is the result of following sequence:
>1. every dom0 vCPU has one 250HZ timer (i.e. 4ms period). The vCPU
>timer_interrupt handler will acquire a global ticket spin lock xtime_lock.
>When xtime_lock is hold by other vCPU, the vCPU will poll event channel and
>become blocked. As a result, the pCPU where the vCPU runs will become idle.
>Later, when the lock holder release xtime_lock, it will notify event channel to
>wake up the vCPU. As a result, the pCPU will wake up from idle state, and
>schedule the vCPU to run.
>
>From the above, we can see the latency of vCPU timer interrupt is consisted
>of the following items. The "latency" here means the time between beginning
>to acquire lock and finally lock acquired.
>T1 - CPU execution time ( e.g. timer interrupt lock holding time, event channel
>notification time)
>T2 - CPU idle wake up time, i.e. the time CPU wake up from deep C state (e.g.
>C3) to C0, usually it is in the order of several 10us or 100us
>
>2. then let's consider the case of large number of CPUs, e.g. 64 pCPU and 64
>VCPU in dom0, let's assume the lock holding sequence is VCPU0 ->
>VCPU1->VCPU2 ... ->VCPU63.
>Then vCPU63 will spend 64*(T1 + T2) to acquire the xtime_lock. if T1+T2 is
>100us, then the total latency would be ~6ms.
>As we have known that the timer is 250HZ, or 4ms period, so when event
>channel notification issued, and pCPU schedule vCPU63, hypervisor will find
>the timer is over-due, and will send another TIMER_VIRQ for vCPU63 (see
>schedule()->vcpu_periodic_timer_work() for detail). In this case, vCPU63 will
>be always busy handling timer interrupt, and not be able to update the watch
>dog, thus cause the softlock up.
>
>So from the above sequence, we can see:
>- cpuidle driver add extra latency, thus make this issue more easy to occurs.
>- Large number of CPU multiply the latency
>- ticket spin lock lead fixed lock acquiring sequence, thus lead the latency
>repeatedly being 64*(T1+T2), thus make this issue more easy to occurs.
>and the fundamental cause of this issue is that vCPU timer interrupt handler
>is not good for scaling, due to the global xtime_lock.
>
>From cpuidle point of view, one thing we are trying to do is: changing the
>cpuidle driver to not enter deep C state when there is vCPU with local irq
>disabled and event channel polling. In this case, the T2 latency will be
>eliminated.
>
>Anyway, cpuidle is just one side, we can anticipate that if CPU number is large
>enough to lead NR_CPU * T1 > 4ms, this issue will occurs again. So another
>way is to make dom0 scaling well by not using xtime_lock, although this is
>pretty hard currently. Or another way is to limit dom0 vCPU number to
>certain reasonable level.
>
>Regards
>Ke
cpuidle-hint-v2.patch
Description: cpuidle-hint-v2.patch
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|