I decided to keep the spin_trylock as they quell my paranoia about other
possible deadlock scenarios inside those complicated hypercall functions.
But I have modified the comments appropriately, in xen-unstable:21179. Note
that this also depends on xen-unstable:21178 (we mustn't execute the
hypercall continuation immediately, in the context of the caller of
c_h_o_c()). Thanks.
But, here's a more subtle and more tricky deadlock scenario for you. You'll
like this one :-): stop_machine_run() schedules a softirq on every CPU.
Let's say CPU A enters our softirq handler, interrupting some guest VCPU X
which is still scheduled on CPU A. But some other CPU B could be waiting for
X to be descheduled (one obvious example is hvmop_flush_tlb_all, which is a
good one because some HVM guest can call that at any time). So we never get
full softirq rendezvous because CPU B is spinning in hvmop_flush_tlb_all(),
while CPU A spins in the stop_machine softirq handler. Deadlock!
What do you think of that? :-D
-- Keir
On 15/04/2010 11:19, "Jiang, Yunhong" <yunhong.jiang@xxxxxxxxx> wrote:
> Aha, yes, you are right. So do I need create a patch, or you can simply revert
> some chunks?
>
> --jyh
>
>> -----Original Message-----
>> From: Keir Fraser [mailto:keir.fraser@xxxxxxxxxxxxx]
>> Sent: Thursday, April 15, 2010 6:17 PM
>> To: Jiang, Yunhong
>> Cc: xen-devel@xxxxxxxxxxxxxxxxxxx
>> Subject: Re: CPU offlining patch xen-unstable:21049
>>
>> On 15/04/2010 09:50, "Jiang, Yunhong" <yunhong.jiang@xxxxxxxxx> wrote:
>>
>>> I think the try_lock is not for the cpu_down(). The point is, if another CPU
>>> is trying the get the lock.
>>>
>>> Considering following scnerio:
>>> 1) cpu_down() in CPU A, and get the xenpf_lock, then call to
>>> stop_machine_run(), trying to bring all CPU to stop_machine_run context.
>>> 2) At the same time, another vcpu in CPU B do a xenpf hypercall, and try to
>>> get the xenpf_lock. If ther is no retyr for this lock, it can't get
>>> xenpf_lock, it will never go to the softirq
>>> So the system will hang.
>>>
>>> Hope this make thing clear.
>>
>> But CPU A doesn't hold the xenpf_lock when it calls stop_machine_run(). It
>> dropped it before cpu_down() got invoked, because that gets executed via
>> continue_hypercall_on_cpu().
>>
>> -- Keir
>>
>
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|