[Xen-devel] Hunting down an oops in Xen 3.1.0's 2.6.18 kernel

To:	xen-devel@xxxxxxxxxxxxxxxxxxx
Subject:	[Xen-devel] Hunting down an oops in Xen 3.1.0's 2.6.18 kernel
From:	"Michael Marineau" <mike@xxxxxxxxxxxx>
Date:	Fri, 14 Sep 2007 15:51:03 -0700
Delivery-date:	Fri, 14 Sep 2007 15:51:44 -0700
Dkim-signature:	v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:sender:to:subject:mime-version:content-type:content-transfer-encoding:content-disposition:x-google-sender-auth; bh=xksY0IU2XcI3WDKFhdbGY9lCFoEIzpa4af+JeW/4CbM=; b=nmfFof22UEeflmTihJWOCR0z0qeECp9OD8NvjAEFpbY3NfrNLj2z7RVFv0KQfpq6b74uWmjpoqDUa0oxL5aENJaD+8yO8SsL5nkKmhc7dDGJv1QyPaGd4afjgPbrntLRryDZrb1MJhl1tc6qxjr8ks/hA0uZKTVNjz6oIiWe0U8=
Domainkey-signature:	a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:sender:to:subject:mime-version:content-type:content-transfer-encoding:content-disposition:x-google-sender-auth; b=Z08FDiJMigEu4za9Hde8OIlJoAXIMrRILYFsTw8Un1QKiHEdL47ob9UkQ1lqK7193qgdqw1F0biNPK1pt5TNHDAI/sv7Davq84fhIqlFePyAbaJWxFppYANvdom1OOsRyYlhDjG7PR7XyifoGrcxRqBjklVtjX6OGGS0QvcLvGo=
Envelope-to:	www-data@xxxxxxxxxxxxxxxxxx
List-help:	<mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id:	Xen developer discussion <xen-devel.lists.xensource.com>
List-post:	<mailto:xen-devel@lists.xensource.com>
List-subscribe:	<http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe:	<http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
Sender:	xen-devel-bounces@xxxxxxxxxxxxxxxxxxx

Hey,
I've been beating my head against this bug for the last few days.
After Dom0's memory is reduced it appears that something is trying to
refer to a page that was removed from the machine_to_phys_mapping
table. After much tracing around I haven't spotted how that could
happen yet though.

System required to reproduce:
x86_32, with or without pae
2 GB of ram or more
3.1.0's 2.6.18 or things based on it such as redhat's 2.6.20 xen patch
start dom0 with no memory limit so it uses most of the 2gb

The easiest way to reproduce the problem is to reduce dom0's memory
significantly (to something like 150M) with either mem-set or by
starting a vary large domU. Then do something, sometimes ls will do,
other times I start compiling glibc. It is also possible to hit the
issue by reducing memory only a little but that will take longer to
hit if at all.

I have been unable to reproduce this with 3.0.4's 2.6.16 kernel but
2.6.18 will oops on both 3.0.4 and 3.1.0. Also, x86_64 appears to be
ok.

I'm guessing this issue is the same as the oops reported here:
http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=975

Below is an example of the oops on my 2.6.18 pae kernel with a couple
extra debuging lines added:

(XEN) mm.c:503:d0 Could not get page ref for pfn 7fffffff
(XEN) mm.c:2324:d0 mfn: 7fffffff, gmfn: 7fffffff, ptr: 7fffffff0c0
(XEN) mm.c:2325:d0 Could not get page for normal update
virtptr: f57a70c0 machineptr: 7fffffff0c0
------------[ cut here ]------------
kernel BUG at arch/i386/mm/hypervisor.c:62!
invalid opcode: 0000 [#1]
SMP
Modules linked in:
CPU:    1
EIP:    0061:[<c0117875>]    Not tainted VLI
EFLAGS: 00010296   (2.6.18-xen-r5-try2 #6)
EIP is at xen_l1_entry_update+0xb9/0xde
eax: 0000002d   ebx: deadbeef   ecx: 00000000   edx: 00000001
esi: deadbeef   edi: 00000000   ebp: ecea0c4c   esp: ecea0c14
ds: 007b   es: 007b   ss: 0069
Process bash (pid: 5065, ti=ecea0000 task=ecfe3030 task.ti=ecea0000)
Stack: c037b964 f57a70c0 fffff0c0 000007ff 00000000 00000000 f57a70c0 fffff0c0
       000007ff 00000000 00000000 00000000 00000000 00000000 ecea0cc0 c0158693
       3536f025 00000000 ed383780 ed3837c8 c04bce70 00000000 00000004 00000000
Call Trace:
 [<c0158693>] zap_pte_range+0x265/0x658
 [<c0158bf2>] unmap_page_range+0x16c/0x2b4
 [<c0158e08>] unmap_vmas+0xce/0x1cb
 [<c015f094>] exit_mmap+0x7d/0xf4
 [<c011e0cf>] mmput+0x36/0x8c
 [<c01782af>] exec_mmap+0x156/0x229
 [<c0178a54>] flush_old_exec+0x59/0x25a
 [<c01989f4>] load_elf_binary+0x33c/0xc52
 [<c0178f06>] search_binary_handler+0x89/0x23c
 [<c0197c95>] load_script+0x221/0x23c
 [<c0178f06>] search_binary_handler+0x89/0x23c
 [<c017920b>] do_execve+0x152/0x1be
 [<c010391c>] sys_execve+0x32/0x84
 [<c0104dfb>] syscall_call+0x7/0xb
 [<b7e13899>] 0xb7e13899
Code: 78 08 83 c4 2c 5b 5e 5f 5d c3 8b 45 e4 8b 55 e8 89 54 24 0c 89
44 24 08 8b 45 e
EIP: [<c0117875>] xen_l1_entry_update+0xb9/0xde SS:ESP 0069:ecea0c14

And just for kicks a non-pae oops:

(XEN) mm.c:503:d0 Could not get page ref for pfn fffff
(XEN) mm.c:2324:d0 mfn: fffff, gmfn: fffff, ptr: fffff060
(XEN) mm.c:2325:d0 Could not get page for normal update
virtptr: fbfa7060 machineptr: fffff060
------------[ cut here ]------------
kernel BUG at arch/i386/mm/hypervisor.c:62!
invalid opcode: 0000 [#1]
SMP
Modules linked in:
CPU:    1
EIP:    0061:[<c01158e1>]    Not tainted VLI
EFLAGS: 00010282   (2.6.18-xen-r5-try2 #4)
EIP is at xen_l1_entry_update+0xa1/0xb1
eax: 0000002a   ebx: deadbeef   ecx: 00000000   edx: 00000001
esi: deadbeef   edi: fbfa7060   ebp: c0bcbca0   esp: c0bcbc74
ds: 007b   es: 007b   ss: 0069
Process bash (pid: 4943, ti=c0bcb000 task=c1fd7030 task.ti=c0bcb000)
Stack: c036508c fbfa7060 fffff060 00000000 fffff060 00000000 00000000 00000000
       fbfa7060 3b875025 f3bce3c0 c0bcbd20 c0152f4b c0bcbd10 f35ff840 80018000
       00000000 f35bb860 c0bcbd38 003fefe8 00000000 00000001 800c9000 f3be7800
Call Trace:
 [<c0152f4b>] unmap_vmas+0x4d4/0x743
 [<c0156b36>] exit_mmap+0x7f/0xf4
 [<c011b779>] mmput+0x24/0x85
 [<c016fd62>] flush_old_exec+0x2de/0xa6d
 [<c018fad0>] load_elf_binary+0x51d/0x1a4d
 [<c016f23e>] search_binary_handler+0x8d/0x22c
 [<c0170eca>] do_execve+0x14d/0x1c9
 [<c01034be>] sys_execve+0x2e/0x76
 [<c0104e83>] syscall_call+0x7/0xb
 [<b7ecb899>] 0xb7ecb899
Code: c1 72 af 0f 0b 22 00 54 29 36 c0 eb a5 8b 45 e4 8b 55 e8 89 44
24 08 89 54 24 0
EIP: [<c01158e1>] xen_l1_entry_update+0xa1/0xb1 SS:ESP 0069:c0bcbc74

The call trace's tend to differ, but the above two are pretty common.
The oops is in xen_l1_entry_update almost all of the time, I have seen
it in xen_l2_entry_update

Thanks,
-- 
Michael Marineau
Oregon State University
mike@xxxxxxxxxxxx

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

WARNING - OLD ARCHIVES

xen-devel

[Xen-devel] Hunting down an oops in Xen 3.1.0's 2.6.18 kernel