BTW, I suspect if cpu online/offline lock should really wrap the
stop_machine_run(). I remember Criping questioned this also. Will have a look
on it.
--jyh
>-----Original Message-----
>From: Jiang, Yunhong
>Sent: Friday, April 16, 2010 2:43 PM
>To: Keir Fraser; Juergen Gross
>Cc: xen-devel@xxxxxxxxxxxxxxxxxxx; Yu, Ke
>Subject: RE: [Xen-devel] [Patch] continue_hypercall_on_cpu rework using
>tasklets
>
>
>
>>-----Original Message-----
>>From: Keir Fraser [mailto:keir.fraser@xxxxxxxxxxxxx]
>>Sent: Thursday, April 15, 2010 7:07 PM
>>To: Jiang, Yunhong; Juergen Gross
>>Cc: xen-devel@xxxxxxxxxxxxxxxxxxx; Yu, Ke
>>Subject: Re: [Xen-devel] [Patch] continue_hypercall_on_cpu rework using
>>tasklets
>>
>>On 15/04/2010 10:59, "Jiang, Yunhong" <yunhong.jiang@xxxxxxxxx> wrote:
>>
>>>> Actually that's a good example because it now won't work, but for other
>>>> reasons! The hypercall continuation can interrupt another vcpu's execution,
>>>> and then try to synchronously pause that vcpu. Which will deadlock.
>>>>
>>>> Luckily I think we can re-jig this code to freeze_domains() before doing
>>>> the
>>>> continue_hypercall_on_cpu(). I've cc'ed one of the CPU RAS guys. :-)
>>>
>>> Hmm, I have cc'ed one of the PM guys because it is enter_state :-)
>>> Can we add check in vcpu_sleep_sync() for current? It is meaningless to
>>> cpu_relax for current vcpu in that situation, especially if we are not in
>>> irq
>>> context.
>>> I'm not sure why in freeze_domains it only checkes dom0's vcpu for current,
>>> instead of all domains.
>>
>>Well actually pausing any vcpu from within the hypercall continuation is
>>dangerous. The softirq handler running the hypercall continuation may have
>>interrupted some running VCPU X. And the VCPU Y that the continuation is
>>currently trying to pause may itself be trying to pause X. So we can get a
>>deadlock that way. The freeze_domains() *has* to be pulled outside of the
>>hypercall continuation.
>>
>>It's a little bit similar to the super-subtle stop_machine_run deadlock
>>possibility I just emailed to you a second ago. :-)
>
>Thanks for pointing out the stop_machine_run deadlock issue.
>
>After more consideration and internally discussion, seems the key point is, the
>tasklet softirq is something like getting a lock for the current vcpu's
>state(i.e. no one
>else could change that state unless this softirq is finished). So any block
>action in
>softirq context, not just vcpu_pause_sync, is dangerous and should be avoided
>(we
>can't get a lock and do block action per my understanding).
>This is because vcpu's state can only be changed by schedule softirq (am I
>right on
>this?), while schedule softirq can't prempt other softirq. So, more generally,
>anything
>that will be updated by a softirq context, and will be syncrhozed/blocking
>waitied in
>xen's vcpu context is in fact a implicit lock held by the softirq.
>
>To the tricky bug on the stop_machine_run(), I think it is in fact similar to
>the
>cpu_add_remove_lock. The stop_machine_run() is a block action, so we must make
>sure no one will be blocking to get the lock that is held by stop_machine_run()
>already. At that time, we change all components that try to get the
>cpu_add_remove_lock to be try_lock.
>
>The changes caused by the tasklet is, a new implicit lock is added, i.e. the
>vcpu's
>state.
>The straightforward method is like the cpu_add_remove_lock: make everything
>that
>waiting for the vcpu state change to do softirq between the checking. Maybe the
>cleaner way is your previous suggestion, that is, put the stop_machine_run()
>in the
>idle_vcpu(), another way is, turn back to the original method, i.e. do it in
>the
>schedule_tail.
>
>Also We are not sure why the continue_hypercall_on_cpu is changed to use
>tasklet.
>What's the benifit for it? At least I think this is quite confusing, because
>per our
>understanding, usually hypercall is assumed to execut in current context,
>while this
>change break the assumption. So any hypercall that may use this _c_h_o_c, and
>any
>function called by that hypercall, should be aware of this, I'm not sure if
>this is really
>so correct, at least it may cause trouble if someone use this without realize
>the
>limitation. From Juergen Gross's mail, it seems for cpupool, but I have no
>idea of the
>cpupool :-(
>
>--jyh
>
>
>>
>> -- Keir
>>
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|