Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split

To:	Juergen Gross <juergen.gross@xxxxxxxxxxxxxx>
Subject:	Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split
From:	George Dunlap <George.Dunlap@xxxxxxxxxxxxx>
Date:	Mon, 7 Feb 2011 15:55:54 +0000
Cc:	Andre Przywara <andre.przywara@xxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>, "Diestelhorst, Stephan" <Stephan.Diestelhorst@xxxxxxx>
Delivery-date:	Mon, 07 Feb 2011 07:56:53 -0800
Dkim-signature:	v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=T5sJKqs5AbKzaVWzIUNMdBGBYMccwNhuETYnuc7AkSM=; b=xMsZds99CyqcZGM7OJA0c6BXWO/LjW2ceaAjZP9IdYU1qEvbi3DafkrdMYebwjw8ny wkP7wFHemPOZmBCmF5lweLC1o9k6w6xFPYaTYlFUml29SH3gkMGjT91S93LZMNHvLh8a htIeHC+wq1SdEZeOWpopw6V41FMeP/lN2TU/o=
Domainkey-signature:	a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; b=tF1A4d0/u68HNKLbxAStVJXHomIlbsetSucTm1Pd70AhjHZA+m/N6WwPMrYHE4eIEj XqXNbMpVluaKeCgngALc4505fx827bNkgUE98Q19v/SAIMWKTJcXhqDkbPi4RtBfJUAY RyxbFexzc0/pswGSJ30RrXz5xhj+VdYw/HQlQ=
Envelope-to:	www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to:	<4D4FF452.6060508@xxxxxxxxxxxxxx>
List-help:	<mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id:	Xen developer discussion <xen-devel.lists.xensource.com>
List-post:	<mailto:xen-devel@lists.xensource.com>
List-subscribe:	<http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe:	<http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References:	<4D41FD3A.5090506@xxxxxxx> <201102021539.06664.stephan.diestelhorst@xxxxxxx> <4D4974D1.1080503@xxxxxxxxxxxxxx> <201102021701.05665.stephan.diestelhorst@xxxxxxx> <4D4A43B7.5040707@xxxxxxxxxxxxxx> <4D4A72D8.3020502@xxxxxxxxxxxxxx> <4D4C08B6.30600@xxxxxxx> <4D4FE7E2.9070605@xxxxxxx> <4D4FF452.6060508@xxxxxxxxxxxxxx>
Sender:	xen-devel-bounces@xxxxxxxxxxxxxxxxxxx

Juergen,

What is supposed to happen if a domain is in cpupool0, and then all of
the cpus are taken out of cpupool0?  Is that possible?

It looks like there's code in cpupools.c:cpupool_unassign_cpu() which
will move all VMs in a cpupool to cpupool0 before removing the last
cpu.  But what happens if cpupool0 is the pool that has become empty?
It seems like that breaks a lot of the assumptions; e.g.,
sched_move_domain() seems to assume that the pool we're moving a VM to
actually has cpus.

While we're at it, what's with the "(cpu != cpu_moving_cpu)" in the
first half of cpupool_unassign_cpu()?  Under what conditions are you
anticipating cpupool_unassign_cpu() being called a second time before
the first completes?  If you have to abort the move because
schedule_cpu_switch() failed, wouldn't it be better just to roll the
whole transaction back, rather than leaving it hanging in the middle?

Hmm, and why does RMCPU call cpupool_get_by_id() with exact==0?  What
could possibly be the use of grabbing a random cpupool and then trying
to remove the specified cpu from it?

Andre, you might think about folding the attached patch into your debug patch.

 -George

On Mon, Feb 7, 2011 at 1:32 PM, Juergen Gross
<juergen.gross@xxxxxxxxxxxxxx> wrote:
> On 02/07/11 13:38, Andre Przywara wrote:
>>
>> Juergen,
>>
>> as promised some more debug data. This is from c/s 22858 with Stephans
>> debug patch (attached).
>> We get the following dump when the hypervisor crashes, note that the
>> first lock is different from the second and subsequent ones:
>>
>> (XEN) sched_credit.c, 572: prv: ffff831836df2970 &prv->lock:
>> ffff831836df2970 prv->weight: 256 sdom->active_vcpu_count: 3
>> sdom->weight: 256
>> (XEN) sched_credit.c, 572: prv: ffff830437ffa5e0 &prv->lock:
>> ffff830437ffa5e0 prv->weight: 768 sdom->active_vcpu_count: 4
>> sdom->weight: 256
>> (XEN) sched_credit.c, 572: prv: ffff830437ffa5e0 &prv->lock:
>> ffff830437ffa5e0 prv->weight: 1024 sdom->active_vcpu_count: 5
>> sdom->weight: 256
>> (XEN) sched_credit.c, 572: prv: ffff830437ffa5e0 &prv->lock:
>> ffff830437ffa5e0 prv->weight: 1280 sdom->active_vcpu_count: 6
>> sdom->weight: 256
>>
>> ....
>>
>> Hope that gives you an idea. I attach the whole log for your reference.
>
> Hmm, could it be your log wasn't created with the attached patch? I'm
> missing
> Dom-Id and VCPU from the printk() above, which would be interesting (at
> least
> I hope so)...
> Additionally printing the local pcpu number would help, too.
> And could you add a printk for the new prv address in csched_init()?
>
> It would be nice if you could enable cpupool diag output. Please use the
> attached patch (includes the previous patch for executing the cpu move on
> the
> cpu to be moved, plus some diag printk corrections).
>
>
> Juergen
>
> --
> Juergen Gross                 Principal Developer Operating Systems
> TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
> Fujitsu Technology Solutions              e-mail:
> juergen.gross@xxxxxxxxxxxxxx
> Domagkstr. 28                           Internet: ts.fujitsu.com
> D-80807 Muenchen                 Company details:
> ts.fujitsu.com/imprint.html
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-devel
>
>

cpupools-bug-on-move-to-self.diff
Description: Text Data

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

WARNING - OLD ARCHIVES

xen-devel

Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split