[Xen-devel] Re: PVops domain 0 crash on NUMA system only Node==1

To:	Ian Campbell <ijc@xxxxxxxxxxxxxx>
Subject:	[Xen-devel] Re: PVops domain 0 crash on NUMA system only Node==1 present (Was: Re: Bug#603632: linux-image-2.6.32-5-xen-amd64: Linux kernel 2.6.32/xen/amd64 booting fine on bare metal, but not as dom0 with Xen 4.0.1 (Dell R410))
From:	Cris Daniluk <cris.daniluk@xxxxxxxxx>
Date:	Tue, 23 Nov 2010 07:44:32 -0500
Cc:	Vincent CARON <zerodeux@xxxxxxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxx>, Keir Fraser <keir@xxxxxxx>, Jeremy Fitzhardinge <jeremy@xxxxxxxx>, 603632@xxxxxxxxxxxxxxx
Delivery-date:	Wed, 24 Nov 2010 02:17:19 -0800
Dkim-signature:	v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:cc:content-type; bh=kwcnc0Mv+POGbfP8h4b/UPN/1dWcVzfl1imVS9Ee7/Q=; b=neDoKAJDbLiYCyaeIM4fzCFghuZwf446pgtMnRFjjNvuKUIgrt5/uU1wwdx9yXAqOX o4y609gnet3JIOUOyTomzOf0yzKG0tbBA3xJWZLQ7PgI1a+fyVfklUMUIaPga58vysrL l1PbWlq7CIYtZy/nEB1e3yPixY8OeaGEN0FrM=
Domainkey-signature:	a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; b=IA+fdEBzV1ilXnhGgNlfL6V5rANAIyXCj0FWYvWEmrbmtOyAl+TeKpUafxpYKTcyK3 nJHyUxmEhRhn3ZKPDSlRxoYch7M0AUkgzyGEUPW+QnbwEKNP7rC1ko3/tCyl8r+RY7jc xO84H+pMrA/DcKPA9BQx7SWPn2Cb7ITLARe3E=
Envelope-to:	www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to:	<1290513067.31507.7699.camel@xxxxxxxxxxxxxxxxxxxxxx>
List-help:	<mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id:	Xen developer discussion <xen-devel.lists.xensource.com>
List-post:	<mailto:xen-devel@lists.xensource.com>
List-subscribe:	<http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe:	<http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References:	<20101115233253.11935.35707.reportbug@zerohal> <1290513067.31507.7699.camel@xxxxxxxxxxxxxxxxxxxxxx>
Sender:	xen-devel-bounces@xxxxxxxxxxxxxxxxxxx

I was unable to, and this does look similar indeed. I tried a variety of pvops kernels and kernel configs and was unable to get past this. I never found resolution and eventually fell back to 3.4.3 w/a xenlinux kernel. Much less sexy but very stable on the same hardware.

I also had related but different problems on IBM 3650 M2s and IBM 3500s with pvops kernels. It seems very prone to crashing at any APIC/ACPI bugs, of which there seem to be quite a bit of in both Dell and IBM. I was toying with the idea of downgrading BIOS's based on the success someone else on xen-devel list reported with that, but I didn't have the time to see that idea through.

On Tue, Nov 23, 2010 at 6:51 AM, Ian Campbell <ijc@xxxxxxxxxxxxxx> wrote:

Thanks for the report Vincent.

I've added xen-devel to the CC as well as Cris Daniluk who previously
reported a very similar issue[0] also on an R410 -- Cris did you ever
get a resolution to your issue?

Vincent's full report is at:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=603632
I've also attached the boot log here of which the interesting part looks
to be:

[ 8.422639] xen: acpi sci 9
[ 8.434217] Console: colour VGA+ 80x25
[ 8.441350] console [hvc0] enabled, bootconsole disabled
[ 8.441350] console [hvc0] enabled, bootconsole disabled
[ 8.462694] Xen: using vcpuop timer interface
[ 8.471508] installing Xen timer for CPU 0
[ 8.479841] BUG: unable to handle kernel paging request at 0000000000005a08
[ 8.493868] IP: [<ffffffff810badce>] __alloc_pages_nodemask+0x8f/0x5f5
[ 8.507041] PGD 0
[ 8.511199] Thread overran stack, or stack corrupted
[ 8.521253] Oops: 0000 [#1] SMP
[ 8.527838] last sysfs file:
[ 8.533941] CPU 0
[ 8.538100] Modules linked in:
[ 8.544342] Pid: 0, comm: swapper Not tainted 2.6.32-5-xen-amd64 #1 PowerEdge R410
[ 8.559594] RIP: e030:[<ffffffff810badce>] [<ffffffff810badce>] __alloc_pages_nodemask+0x8f/0x5f5
[ 8.577620] RSP: e02b:ffffffff81443c88 EFLAGS: 00010046
[ 8.588366] RAX: 0000000000000000 RBX: 0000000000005220 RCX: 0000000000005a00
[ 8.602752] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000005220
[ 8.617139] RBP: 0000000000004020 R08: 0000000000000002 R09: ffff88003fc1c010
[ 8.631525] R10: ffffffff813c2700 R11: 00000000000186a0 R12: 0000000000005220
[ 8.645910] R13: 0000000000000002 R14: 0000000000000000 R15: ffff88000000da28
[ 8.660300] FS: 0000000000000000(0000) GS:ffff88000349b000(0000) knlGS:0000000000000000
[ 8.676591] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 8.688203] CR2: 0000000000005a08 CR3: 0000000001001000 CR4: 0000000000002660
[ 8.702589] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 8.716975] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 8.731361] Process swapper (pid: 0, threadinfo ffffffff81442000, task ffffffff814771f0)
[ 8.747654] Stack:
[ 8.751813] ffff88000000da00 00000010813c2765 00000000000212d0 00000000000186a0
[ 8.766199] <0> ffff88000000ac10 ffffffff8100e5b5 ffffffff8100ec72 00000000000186a0
[ 8.781625] <0> 00000000000186a0 0000000000000000 0000000000005a00 0000000000000000
[ 8.797572] Call Trace:
[ 8.802603] [<ffffffff8100e5b5>] ? xen_force_evtchn_callback+0x9/0xa
[ 8.815600] [<ffffffff8100ec72>] ? check_events+0x12/0x20
[ 8.826695] [<ffffffff810e759d>] ? new_slab+0x42/0x1ca
[ 8.837267] [<ffffffff810e7915>] ? __slab_alloc+0x1f0/0x39b
[ 8.848707] [<ffffffff812f87d8>] ? irq_to_desc_alloc_node+0x96/0x195
[ 8.861704] [<ffffffff810e85cb>] ? __kmalloc_node+0xe8/0x146
[ 8.873317] [<ffffffff812f87d8>] ? irq_to_desc_alloc_node+0x96/0x195
[ 8.886316] [<ffffffff812f87d8>] ? irq_to_desc_alloc_node+0x96/0x195
[ 8.899317] [<ffffffff811f24df>] ? find_unbound_irq+0x67/0xae
[ 8.911103] [<ffffffff811f259e>] ? bind_virq_to_irq+0x78/0x126
[ 8.923062] [<ffffffff8100e5b5>] ? xen_force_evtchn_callback+0x9/0xa
[ 8.936063] [<ffffffff8100e8f6>] ? xen_timer_interrupt+0x0/0x18d
[ 8.948368] [<ffffffff811f29f6>] ? bind_virq_to_irqhandler+0x19/0x4a
[ 8.961368] [<ffffffff8100e884>] ? xen_setup_timer+0x55/0xaa
[ 8.972982] [<ffffffff81509a5e>] ? xen_time_init+0xaf/0xb5
[ 8.984247] [<ffffffff8150a491>] ? x86_late_time_init+0xa/0x10
[ 8.996206] [<ffffffff81506c3d>] ? start_kernel+0x348/0x3e8
[ 9.007646] [<ffffffff81508c7d>] ? xen_start_kernel+0x57c/0x581
[ 9.019777] Code: d8 c1 e8 13 83 e0 01 09 44 24 64 41 89 dc 44 23 25 28 01 43 00 44 89 e2 83 e2 10 89 54 24 5c 74 05 e8 16 03 25 00 48 8b 4c 24 50 <48> 83 79 08 00 0f 84 30 05 00 00 83 e3 0f 48 8b 44 24 50 41 bf
[ 9.057561] RIP [<ffffffff810badce>] __alloc_pages_nodemask+0x8f/0x5f5
[ 9.070909] RSP <ffffffff81443c88>
[ 9.078015] CR2: 0000000000005a08
[ 9.084780] ---[ end trace a7919e7f17c0a725 ]---
[ 9.094136] Kernel panic - not syncing: Attempted to kill the idle task!

It's worth noting that the Debian kernels are based on
e73f4955a821f850f5b88c32d12a81714523a95f (less the GPU fixes merged by
bcf16b6b4f34fb40a7aaf637947c7d3bce0be671, which the Debian kernel
maintainer chose to exclude).

The baseline is slightly old but Debian is now pretty deeply frozen so a
wholesale rebase is not possible, if either of you have run a more
recent kernel the result would be interesting to know.

The actual crashing RIP corresponds to mm/page_alloc.c:1975 which is in
__alloc_pages_nodemask:

/*
* Check the zones suitable for the gfp_mask contain at least one
* valid zone. It's possible to have an empty zonelist as a result
* of GFP_THISNODE and a memoryless node
*/
if (unlikely(!zonelist->_zonerefs->zone))
return NULL;

zonelist->_zonerefs is an array but looking at the disassembly and the
register dump zonelist itself appears to be 0x5a00 which seems unlikely
to be valid.

The zonelist ultimately comes from node which is always passed as 0 in
the outer most caller in this stack trace (find_unbound_irq calling
irq_to_desc_alloc_node).

I'm not sure but looking at the complete bootlog it looks as if the
system may only have node==1 i.e. no 0 node which could plausibly lead
to this sort of issue:
[ 0.000000] Bootmem setup node 1 0000000000000000-0000000040000000
[ 0.000000] NODE_DATA [0000000000008000 - 000000000000ffff]
[ 0.000000] bootmap [0000000000010000 - 0000000000017fff] pages 8
[ 0.000000] (8 early reservations) ==> bootmem [0000000000 - 0040000000]
[ 0.000000] #0 [0000000000 - 0000001000] BIOS data page ==> [0000000000 - 0000001000]
[ 0.000000] #1 [0003446000 - 0003465000] XEN PAGETABLES ==> [0003446000 - 0003465000]
[ 0.000000] #2 [0000006000 - 0000008000] TRAMPOLINE ==> [0000006000 - 0000008000]
[ 0.000000] #3 [0001000000 - 0001694994] TEXT DATA BSS ==> [0001000000 - 0001694994]
[ 0.000000] #4 [00016b5000 - 0003244e00] RAMDISK ==> [00016b5000 - 0003244e00]
[ 0.000000] #5 [0003245000 - 0003446000] XEN START INFO ==> [0003245000 - 0003446000]
[ 0.000000] #6 [0001695000 - 000169532d] BRK ==> [0001695000 - 000169532d]
[ 0.000000] #7 [0000100000 - 00002e0000] PGTABLE ==> [0000100000 - 00002e0000]
[ 0.000000] found SMP MP-table at [ffff8800000fe710] fe710
[ 0.000000] Zone PFN ranges:
[ 0.000000] DMA 0x00000000 -> 0x00001000
[ 0.000000] DMA32 0x00001000 -> 0x00100000
[ 0.000000] Normal 0x00100000 -> 0x00100000
[ 0.000000] Movable zone start PFN for each node
[ 0.000000] early_node_map[2] active PFN ranges
[ 0.000000] 1: 0x00000000 -> 0x000000a0
[ 0.000000] 1: 0x00000100 -> 0x00040000
[ 0.000000] On node 1 totalpages: 262048
[ 0.000000] DMA zone: 56 pages used for memmap
[ 0.000000] DMA zone: 483 pages reserved
[ 0.000000] DMA zone: 3461 pages, LIFO batch:0
[ 0.000000] DMA32 zone: 3528 pages used for memmap
[ 0.000000] DMA32 zone: 254520 pages, LIFO batch:31

Perhaps we should be passing numa_node_id() (e.g. current node) instead
of node 0? There doesn't seem to be another obvious alternative to
passing in an explicit node number to this callchain (some places cope
with -1 but not this path AFAICT).

It's also not obvious if dom0 should be seeing the tables which describe
the hosts nodes anyway or if we should be clobbering something. Given
that dom0 sees a pseudo-physical address map I'm not convinced seeing
the real SRAT is in any way beneficial. Perhaps we should simply be
clobbering NUMAness until actual PV understanding of NUMA is ready?

One thing I notice when googling R410 issues is that they apparently
have a "Cores per CPU" BIOS option which might be worth playing with,
since configuring a reduced number of cores might remove node 0 but not
node 1 (odd but not invalid?). Presumably it is also worth making sure
you have the latest BIOS etc.

It's very much an outside possibility but it is also worth trying the
packages at http://xenbits.xen.org/people/ianc/ which reinstates the
changesets from bcf16b6b4f34fb40a7aaf637947c7d3bce0be671

Ian.

[0]
http://lists.xensource.com/archives/html/xen-devel/2010-06/msg01140.html

On Tue, 2010-11-16 at 00:32 +0100, Vincent CARON wrote:
> Package: linux-image-2.6.32-5-xen-amd64
> Version: 2.6.32-27
> Severity: important
>
> I just tried d-i 6beta1 and booted Squeeeze and its 2.6.32 kernel for
> the first time on my usual server hardware (Dell R410).
>
> I opted for the xen-amd64 kernel, and it boots fine on bare metal. But
> as soon as I tried to boot it as dom0 over Xen hypervisor, it BUG's:
>
> [ 8.479841] BUG: unable to handle kernel paging request at
> 0000000000005a08^M
> [ 8.493868] IP: [<ffffffff810badce>]
> __alloc_pages_nodemask+0x8f/0x5f5^M
>
> Then quickly oopses and panics. I tried various flags:
> - upping dom0_mem from 256M to 1024M (I've been running Lenny/Xen 3.2
> with 256M happily for several months on the same hw)
> - using Xen 'nommu'
> - using Linux nomodeset
>
> Then I followed instructions on a Xen wiki page to provide verbose
> traces (although they do not look much more verbose than the regular
> boot).
>
> I'm using an IPMI serial-over-lan console which appears as a regular
> UART to Xen.
>
> I'm attaching a boot log to this report.
>
> -- System Information:
> Debian Release: squeeze/sid
> APT prefers testing
> APT policy: (500, 'testing')
> Architecture: amd64 (x86_64)
>
> Kernel: Linux 2.6.32-5-amd64 (SMP w/2 CPU cores)
> Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8)
> Shell: /bin/sh linked to /bin/bash
>
>
>

--
Ian Campbell
Current Noise: Wolf - Seize The Night

If you will practice being fictional for a while, you will understand that
fictional characters are sometimes more real than people with bodies and
heartbeats.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

WARNING - OLD ARCHIVES

xen-devel

[Xen-devel] Re: PVops domain 0 crash on NUMA system only Node==1 present