xen-devel

[Top] [All Lists]

Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split

from [George Dunlap]

[Permanent Link][Original]

To:	Andre Przywara <andre.przywara@xxxxxxx>
Subject:	Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split
From:	George Dunlap <George.Dunlap@xxxxxxxxxxxxx>
Date:	Mon, 14 Feb 2011 17:57:09 +0000
Cc:	Juergen Gross <juergen.gross@xxxxxxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>, "Diestelhorst, Stephan" <Stephan.Diestelhorst@xxxxxxx>
Delivery-date:	Mon, 14 Feb 2011 10:01:36 -0800
Dkim-signature:	v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=eUdKUpsCBnWT29dcDE+dqpXbtdISnEVdzeUthQPwTl0=; b=JmPHb1RszezBtnmYdxu1llJBw5vlDFr9DvmeiqhJVXXj7xusxa0uCKUZVeqk4uzQby YU4gPel5GHykjBEYy3bxCmwE1rwbawM2SNuNcIBayAqd7mjutes6bXoxM3JhMSy+LKbR MThdhzU4KhUTRQwoJGy7E2HNC1Dlix1D/cTYs=
Domainkey-signature:	a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; b=WUQEI9OnlucV8tSr3C/GDQUF4D7ON6298oKHrwDRot0eWAMXi7BcTENrvori+S9VK4 e4gybwNVcRxNIYGAKORWfv4j2aH0PgQkNkMKZN1MB2a5bkHMn/rcp+/DfcZYvSkbdbbc Jw2r9WPHovLC8hb8tvv76qO+qN3asyS2dKYRQ=
Envelope-to:	www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to:	<4D54E79E.3000800@xxxxxxx>
List-help:	<mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id:	Xen developer discussion <xen-devel.lists.xensource.com>
List-post:	<mailto:xen-devel@lists.xensource.com>
List-subscribe:	<http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe:	<http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References:	<4D41FD3A.5090506@xxxxxxx> <201102021539.06664.stephan.diestelhorst@xxxxxxx> <4D4974D1.1080503@xxxxxxxxxxxxxx> <201102021701.05665.stephan.diestelhorst@xxxxxxx> <4D4A43B7.5040707@xxxxxxxxxxxxxx> <4D4A72D8.3020502@xxxxxxxxxxxxxx> <4D4C08B6.30600@xxxxxxx> <4D4FE7E2.9070605@xxxxxxx> <4D4FF452.6060508@xxxxxxxxxxxxxx> <AANLkTinoRUQC_suVYFM9-x3D00KvYofq3R=XkCQUj6RP@xxxxxxxxxxxxxx> <4D50D80F.9000007@xxxxxxxxxxxxxx> <AANLkTinKJUAXhiXpKui_XX8XCD6T5fmzNARwHE6Fjafv@xxxxxxxxxxxxxx> <AANLkTinP0z9GynF1RFd8RwzWuqvxYdb+UBE+7xKpX6D4@xxxxxxxxxxxxxx> <4D517051.10402@xxxxxxx> <AANLkTi=MiELBnPFvb6-jzVth+T7aKxP5JMFhVh3Crdmo@xxxxxxxxxxxxxx> <AANLkTikgGNz=imS1xRVVjntY5P=+MuT_Qsb=-h3QHajY@xxxxxxxxxxxxxx> <4D529BD9.5050200@xxxxxxx> <4D52A2CD.9090507@xxxxxxxxxxxxxx> <4D5388DF.8040900@xxxxxxxxxxxxxx> <4D53AF27.7030909@xxxxxxx> <4D53F3BC.4070807@xxxxxxx> <4D54D478.9000402@xxxxxxxxxxxxxx> <4D54E79E.3000800@xxxxxxx>
Sender:	xen-devel-bounces@xxxxxxxxxxxxxxxxxxx

The good news is, I've managed to reproduce this on my local test
hardware with 1x4x2 (1 socket, 4 cores, 2 threads per core) using the
attached script.  It's time to go home now, but I should be able to
dig something up tomorrow.

To use the script:
* Rename cpupool0 to "p0", and create an empty second pool, "p1"
* You can modify elements by adding "arg=val" as arguments.
* Arguments are:
 + dryrun={true,false} Do the work, but don't actually execute any xl
arguments.  Default false.
 + left: Number commands to execute.  Default 10.
 + maxcpus: highest numerical value for a cpu.  Default 7 (i.e., 0-7 is 8 cpus).
 + verbose={true,false} Print what you're doing.  Default is true.

The script sometimes attempts to remove the last cpu from cpupool0; in
this case, libxl will print an error.  If the script gets an error
under that condition, it will ignore it; under any other condition, it
will print diagnostic information.

What finally crashed it for me was this command:
# ./cpupool-test.sh verbose=false left=1000

 -George

On Fri, Feb 11, 2011 at 7:39 AM, Andre Przywara <andre.przywara@xxxxxxx> wrote:
> Juergen Gross wrote:
>>
>> On 02/10/11 15:18, Andre Przywara wrote:
>>>
>>> Andre Przywara wrote:
>>>>
>>>> On 02/10/2011 07:42 AM, Juergen Gross wrote:
>>>>>
>>>>> On 02/09/11 15:21, Juergen Gross wrote:
>>>>>>
>>>>>> Andre, George,
>>>>>>
>>>>>>
>>>>>> What seems to be interesting: I think the problem did always occur
>>>>>> when
>>>>>> a new cpupool was created and the first cpu was moved to it.
>>>>>>
>>>>>> I think my previous assumption regarding the master_ticker was not
>>>>>> too bad.
>>>>>> I think somehow the master_ticker of the new cpupool is becoming
>>>>>> active
>>>>>> before the scheduler is really initialized properly. This could
>>>>>> happen, if
>>>>>> enough time is spent between alloc_pdata for the cpu to be moved and
>>>>>> the
>>>>>> critical section in schedule_cpu_switch().
>>>>>>
>>>>>> The solution should be to activate the timers only if the scheduler is
>>>>>> ready for them.
>>>>>>
>>>>>> George, do you think the master_ticker should be stopped in
>>>>>> suspend_ticker
>>>>>> as well? I still see potential problems for entering deep C-States.
>>>>>> I think
>>>>>> I'll prepare a patch which will keep the master_ticker active for the
>>>>>> C-State case and migrate it for the schedule_cpu_switch() case.
>>>>>
>>>>> Okay, here is a patch for this. It ran on my 4-core machine without any
>>>>> problems.
>>>>> Andre, could you give it a try?
>>>>
>>>> Did, but unfortunately it crashed as always. Tried twice and made sure
>>>> I booted the right kernel. Sorry.
>>>> The idea with the race between the timer and the state changing
>>>> sounded very appealing, actually that was suspicious to me from the
>>>> beginning.
>>>>
>>>> I will add some code to dump the state of all cpupools to the BUG_ON
>>>> to see in which situation we are when the bug triggers.
>>>
>>> OK, here is a first try of this, the patch iterates over all CPU pools
>>> and outputs some data if the BUG_ON
>>> ((sdom->weight * sdom->active_vcpu_count) > weight_left) condition
>>> triggers:
>>> (XEN) CPU pool #0: 1 domains (SMP Credit Scheduler), mask: fffffffc003f
>>> (XEN) CPU pool #1: 0 domains (SMP Credit Scheduler), mask: fc0
>>> (XEN) CPU pool #2: 0 domains (SMP Credit Scheduler), mask: 1000
>>> (XEN) Xen BUG at sched_credit.c:1010
>>> ....
>>> The masks look proper (6 cores per node), the bug triggers when the
>>> first CPU is about to be(?) inserted.
>>
>> Sure? I'm missing the cpu with mask 2000.
>> I'll try to reproduce the problem on a larger machine here (24 cores, 4
>> numa
>> nodes).
>> Andre, can you give me your xen boot parameters? Which xen changeset are
>> you
>> running, and do you have any additional patches in use?
>
> The grub lines:
> kernel (hd1,0)/boot/xen-22858_debug_04.gz console=com1,vga com1=115200
> module (hd1,0)/boot/vmlinuz-2.6.32.27_pvops console=tty0
> console=ttyS0,115200 ro root=/dev/sdb1 xencons=hvc0
>
> All of my experiments are use c/s 22858 as a base.
> If you use a AMD Magny-Cours box for your experiments (socket C32 or G34),
> you should add the following patch (removing the line)
> --- a/xen/arch/x86/traps.c
> +++ b/xen/arch/x86/traps.c
> @@ -803,7 +803,6 @@ static void pv_cpuid(struct cpu_user_regs *regs)
>         __clear_bit(X86_FEATURE_SKINIT % 32, &c);
>         __clear_bit(X86_FEATURE_WDT % 32, &c);
>         __clear_bit(X86_FEATURE_LWP % 32, &c);
> -        __clear_bit(X86_FEATURE_NODEID_MSR % 32, &c);
>         __clear_bit(X86_FEATURE_TOPOEXT % 32, &c);
>         break;
>     case 5: /* MONITOR/MWAIT */
>
> This is not necessary (in fact that reverts my patch c/s 22815), but raises
> the probability to trigger the bug, probably because it increases the
> pressure of the Dom0 scheduler. If you cannot trigger it with Dom0, try to
> create a guest with many VCPUs and squeeze it into a small CPU-pool.
>
> Good luck ;-)
> Andre.
>
> --
> Andre Przywara
> AMD-OSRC (Dresden)
> Tel: x29712
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-devel
>

cpupool-test.sh
Description: Bourne shell script

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

[More with this subject...]

<Prev in Thread]	Current Thread	[Next in Thread>
Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split, (continued) Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split, George Dunlap Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split, Juergen Gross Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split, Andre Przywara Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split, Andre Przywara Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split, Juergen Gross Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split, Juergen Gross Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split, Andre Przywara Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split, Andre Przywara Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split, Juergen Gross Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split, Andre Przywara Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split, George Dunlap <= Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split, Juergen Gross Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split, Juergen Gross Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split, George Dunlap Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split, Juergen Gross Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split, Juergen Gross Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split, André Przywara Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split, Juergen Gross Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split, Juergen Gross Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split, Andre Przywara Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split, Juergen Gross

Previous by Date:	Re: [Xen-devel] Re: [PATCH v6] Userspace grant communication, Daniel De Graaf
Next by Date:	Re: [Xen-devel] [PATCH 6/6] xen/gntalloc, gntdev: Add unmap notify ioctl, Daniel De Graaf
Previous by Thread:	Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split, Andre Przywara
Next by Thread:	Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split, Juergen Gross
Indexes:	[Date] [Thread] [Top] [All Lists]