>-----Original Message-----
>From: Jan Beulich [mailto:JBeulich@xxxxxxxxxx]
>Sent: Tuesday, February 02, 2010 3:55 PM
>To: Keir Fraser; Yu, Ke
>Cc: xen-devel@xxxxxxxxxxxxxxxxxxx
>Subject: Re: [Xen-devel] cpuidle causing Dom0 soft lockups
>
>>>> Keir Fraser <keir.fraser@xxxxxxxxxxxxx> 21.01.10 12:03 >>>
>>On 21/01/2010 10:53, "Jan Beulich" <JBeulich@xxxxxxxxxx> wrote:
>>> I can see your point. But how can you consider shipping with something
>>> apparently severely broken. As said before - the fact that this manifests
>>> itself by hanging many-vCPU Dom0 has the very likely implication that
>>> there are (so far unnoticed) problems with smaller Dom0-s. If I had a
>>> machine at hand that supports C3, I'd try to do some measurements
>>> with smaller domains...
>>
>>Well it's a fallback I guess. If we can't make progress on solving it then I
>>suppose I agree.
>
>Just fyi, we now also have seen an issue on a 24-CPU system that went
>away with cpuidle=0 (and static analysis of the hang hinted in that
>direction). All I can judge so far is that this likely has something to do
>with our kernel's intensive use of the poll hypercall (i.e. we see vCPU-s
>not waking up from the call despite there being pending unmasked or
>polled for events).
>
>Jan
Hi Jan,
We just identified the cause of this issue, and is trying to find appropriate
way to fix it.
This issue is the result of following sequence:
1. every dom0 vCPU has one 250HZ timer (i.e. 4ms period). The vCPU
timer_interrupt handler will acquire a global ticket spin lock xtime_lock. When
xtime_lock is hold by other vCPU, the vCPU will poll event channel and become
blocked. As a result, the pCPU where the vCPU runs will become idle. Later,
when the lock holder release xtime_lock, it will notify event channel to wake
up the vCPU. As a result, the pCPU will wake up from idle state, and schedule
the vCPU to run.
>From the above, we can see the latency of vCPU timer interrupt is consisted of
>the following items. The "latency" here means the time between beginning to
>acquire lock and finally lock acquired.
T1 - CPU execution time ( e.g. timer interrupt lock holding time, event channel
notification time)
T2 - CPU idle wake up time, i.e. the time CPU wake up from deep C state (e.g.
C3) to C0, usually it is in the order of several 10us or 100us
2. then let's consider the case of large number of CPUs, e.g. 64 pCPU and 64
VCPU in dom0, let's assume the lock holding sequence is VCPU0 -> VCPU1->VCPU2
... ->VCPU63.
Then vCPU63 will spend 64*(T1 + T2) to acquire the xtime_lock. if T1+T2 is
100us, then the total latency would be ~6ms.
As we have known that the timer is 250HZ, or 4ms period, so when event channel
notification issued, and pCPU schedule vCPU63, hypervisor will find the timer
is over-due, and will send another TIMER_VIRQ for vCPU63 (see
schedule()->vcpu_periodic_timer_work() for detail). In this case, vCPU63 will
be always busy handling timer interrupt, and not be able to update the watch
dog, thus cause the softlock up.
So from the above sequence, we can see:
- cpuidle driver add extra latency, thus make this issue more easy to occurs.
- Large number of CPU multiply the latency
- ticket spin lock lead fixed lock acquiring sequence, thus lead the latency
repeatedly being 64*(T1+T2), thus make this issue more easy to occurs.
and the fundamental cause of this issue is that vCPU timer interrupt handler is
not good for scaling, due to the global xtime_lock.
>From cpuidle point of view, one thing we are trying to do is: changing the
>cpuidle driver to not enter deep C state when there is vCPU with local irq
>disabled and event channel polling. In this case, the T2 latency will be
>eliminated.
Anyway, cpuidle is just one side, we can anticipate that if CPU number is large
enough to lead NR_CPU * T1 > 4ms, this issue will occurs again. So another way
is to make dom0 scaling well by not using xtime_lock, although this is pretty
hard currently. Or another way is to limit dom0 vCPU number to certain
reasonable level.
Regards
Ke
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|