[Xen-devel] Re: CPU offlining patch xen-unstable:21049

To:	"Jiang, Yunhong" <yunhong.jiang@xxxxxxxxx>
Subject:	[Xen-devel] Re: CPU offlining patch xen-unstable:21049
From:	Keir Fraser <keir.fraser@xxxxxxxxxxxxx>
Date:	Thu, 15 Apr 2010 12:04:01 +0100
Cc:	"xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>
Delivery-date:	Thu, 15 Apr 2010 04:05:18 -0700
Envelope-to:	www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to:	<789F9655DD1B8F43B48D77C5D30659731D73CED7@xxxxxxxxxxxxxxxxxxxxxxxxxxxx>
List-help:	<mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id:	Xen developer discussion <xen-devel.lists.xensource.com>
List-post:	<mailto:xen-devel@lists.xensource.com>
List-subscribe:	<http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe:	<http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
Sender:	xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
Thread-index:	AcrcdgaxyxAXcjWt2U2B4gzh49/iiwAAddGwAAM8tEEAAA/mEAABkodd
Thread-topic:	CPU offlining patch xen-unstable:21049
User-agent:	Microsoft-Entourage/12.24.0.100205

I decided to keep the spin_trylock as they quell my paranoia about other
possible deadlock scenarios inside those complicated hypercall functions.
But I have modified the comments appropriately, in xen-unstable:21179. Note
that this also depends on xen-unstable:21178 (we mustn't execute the
hypercall continuation immediately, in the context of the caller of
c_h_o_c()). Thanks.

But, here's a more subtle and more tricky deadlock scenario for you. You'll
like this one :-): stop_machine_run() schedules a softirq on every CPU.
Let's say CPU A enters our softirq handler, interrupting some guest VCPU X
which is still scheduled on CPU A. But some other CPU B could be waiting for
X to be descheduled (one obvious example is hvmop_flush_tlb_all, which is a
good one because some HVM guest can call that at any time). So we never get
full softirq rendezvous because CPU B is spinning in hvmop_flush_tlb_all(),
while CPU A spins in the stop_machine softirq handler. Deadlock!

What do you think of that? :-D

 -- Keir

On 15/04/2010 11:19, "Jiang, Yunhong" <yunhong.jiang@xxxxxxxxx> wrote:

> Aha, yes, you are right. So do I need create a patch, or you can simply revert
> some chunks?
> 
> --jyh
> 
>> -----Original Message-----
>> From: Keir Fraser [mailto:keir.fraser@xxxxxxxxxxxxx]
>> Sent: Thursday, April 15, 2010 6:17 PM
>> To: Jiang, Yunhong
>> Cc: xen-devel@xxxxxxxxxxxxxxxxxxx
>> Subject: Re: CPU offlining patch xen-unstable:21049
>> 
>> On 15/04/2010 09:50, "Jiang, Yunhong" <yunhong.jiang@xxxxxxxxx> wrote:
>> 
>>> I think the try_lock is not for the cpu_down(). The point is, if another CPU
>>> is trying the get the lock.
>>> 
>>> Considering following scnerio:
>>> 1) cpu_down() in CPU A, and get the xenpf_lock, then call to
>>> stop_machine_run(), trying to bring all CPU to stop_machine_run context.
>>> 2) At the same time, another vcpu in CPU B do a xenpf hypercall, and try to
>>> get the xenpf_lock. If ther is no retyr for this lock, it can't get
>>> xenpf_lock, it will never go to the softirq
>>> So the system will hang.
>>> 
>>> Hope this make thing clear.
>> 
>> But CPU A doesn't hold the xenpf_lock when it calls stop_machine_run(). It
>> dropped it before cpu_down() got invoked, because that gets executed via
>> continue_hypercall_on_cpu().
>> 
>> -- Keir
>> 
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

WARNING - OLD ARCHIVES

xen-devel

[Xen-devel] Re: CPU offlining patch xen-unstable:21049