WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

Re: [Xen-devel] kernel BUG at arch/x86/xen/mmu.c:1860!

To: xen-devel@xxxxxxxxxxxxxxxxxxx
Subject: Re: [Xen-devel] kernel BUG at arch/x86/xen/mmu.c:1860!
From: Christophe Saout <christophe@xxxxxxxx>
Date: Wed, 05 Jan 2011 00:10:16 +0100
Cc: Teck Choon Giam <giamteckchoon@xxxxxxxxx>, Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
Delivery-date: Tue, 04 Jan 2011 15:10:55 -0800
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=saout.de; s=default; t=1294182594; bh=l74cv+KzW9R8BbjxkwKWwLSEOUpnqAgJRH1BrBLLzcg=; h=Subject:From:In-Reply-To:References:Date:Message-ID; b=YXw4YjEMfZa3GZUCdCnvIR0dNWffc3uvDQwlp0ofh8HsWAiSDTNWav6uT6PEHDl/O olgs0zD3KSCX5sZ0DvF3k1uF19RJIlvKFZMd997Vswv2k3W45FejT8GUsvH4rfucPL oF+qHeIUUuMLH2bwOvu2JLvXV3tERKSgjxOHiqHE=
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <1294153817.24719.3.camel@xxxxxxxxxxxxxxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <AANLkTi=Hwjooo43FiLPAAGzzOTG440ij_QsEqks6ECVv@xxxxxxxxxxxxxx> <20101227155314.GG3728@xxxxxxxxxxxx> <AANLkTikNvKGc78HQOMtVfi=Q+r8r92=svzZcMLQ2xojQ@xxxxxxxxxxxxxx> <20101228104256.GJ2754@xxxxxxxxxxx> <1294153817.24719.3.camel@xxxxxxxxxxxxxxxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
Hi,

> > >      > While doing LVM snapshot for migration and get the following:
> > >      >
> > >      > Dec 26 15:58:29 xen01 kernel: ------------[ cut here ]------------
> > >      > Dec 26 15:58:29 xen01 kernel: kernel BUG at 
> > > arch/x86/xen/mmu.c:1860!
> > >      > Dec 26 15:58:29 xen01 kernel: invalid opcode: 0000 [#1] SMP
> > >      > Dec 26 15:58:29 xen01 kernel: last sysfs file: /sys/block/dm-26/dev
> > >      > Dec 26 15:58:29 xen01 kernel: CPU 0
> > >      > Dec 26 15:58:29 xen01 kernel: Modules linked in: ipt_MASQUERADE
>
> [...]
> [<ffffffff810052e2>] pin_pagetable_pfn+0x52/0x60    
> [<ffffffff81006f5c>] xen_alloc_ptpage+0x9c/0xa0
> [<ffffffff81006f8e>] xen_alloc_pte+0xe/0x10
> [<ffffffff810decde>] __pte_alloc+0x7e/0xf0
> [<ffffffff810e15c5>] handle_mm_fault+0x855/0x930
> [<ffffffff8102dd9e>] ? pvclock_clocksource_read+0x4e/0x100
> [<ffffffff810e734c>] ? do_mmap_pgoff+0x33c/0x380
> [<ffffffff81452b96>] do_page_fault+0x116/0x3e0
> [<ffffffff8144ff65>] page_fault+0x25/0x30
> [...]
> (XEN) mm.c:2364:d0 Bad type (saw 7400000000000001 != exp 1000000000000000) 
> for mfn 41114f (pfn d514f)
> (XEN) mm.c:2733:d0 Error while pinning mfn 41114f

Looking into the code, the Dom0 code ist attempting to pin what it thins
is a "PGT_l1_page_table", however the hypervisor returns -EINVAL because
it actually is a "PGT_writable_page".

After a few hours I managed to catch the crash while the offending
process is being straced.  However the results where totally
inconclusive, because the last lines before the crash are:

16576 open("/lib/multipath/libcheckdirectio.so", O_RDONLY) = 4
16576 read(4, 
"\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0P\v\0\0\0\0\0\0"..., 832) = 832
16576 fstat(4, {st_mode=S_IFREG|0644, st_size=9344, ...}) = 0
16576 mmap(NULL, 2104672, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 4, 0) 
= 0x7fa6b36f6000
16576 mprotect(0x7fa6b36f8000, 2093056, PROT_NONE) = 0
16576 mmap(0x7fa6b38f7000, 4096, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 4, 0x1000) = 0x7fa6b38f7000
16576 close(4)                          = 0

A non-crashing execution would have continued with:

16667 open("/etc/ld.so.cache", O_RDONLY) = 4
16667 fstat(4, {st_mode=S_IFREG|0644, st_size=21739, ...}) = 0
16667 mmap(NULL, 21739, PROT_READ, MAP_PRIVATE, 4, 0) = 0x7f237de56000
16667 close(4)                          = 0
16667 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
16667 open("/lib/libaio.so.1", O_RDONLY) = 4
[...]

Which means that it crashed during the dynamic loading of a plugin
shared library and not while interacting with the device mapper.
(also, the device being investigated was /dev/sde and not some dm
device)

This leads me to believe that some device-mapper shared library has a
particular memory layout that tends to trigger this crash and it has
nothing to do with any device-mapper code at all.  Also, the crash seems
to be timing-sensitive, so it might also be a race condition of some
sort. (on a side-note: this is a 24-core machine (!) and the kernel has
happens to have full preemption enabled).

I am trying to understand the code a bit.  Can someone explain to me
what xen_alloc_ptpage is doing.

> /* This needs to make sure the new pte page is pinned iff its being
>   attached to a pinned pagetable. */
> [...]
> if (PagePinned(virt_to_page(mm->pgd))) {
>     [...]
>     pin_pagetable_pfn(MMUEXT_PIN_L1_TABLE, pfn);

I must admit I don't know very much about memory handling in linux (so
please excuse me if I am interpreting total nonsense into this here,
still I'm intigued and would like to understand it a bit better), but
isn't `mm->pgd' supposed to point to the L1 page table and `pfn', being
a pte page a 3rd/4th level page?  Is this a code path that is exercised
a lot?

Thanks,
        Christophe



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

<Prev in Thread] Current Thread [Next in Thread>