Hi,
> > > > While doing LVM snapshot for migration and get the following:
> > > >
> > > > Dec 26 15:58:29 xen01 kernel: ------------[ cut here ]------------
> > > > Dec 26 15:58:29 xen01 kernel: kernel BUG at
> > > arch/x86/xen/mmu.c:1860!
> > > > Dec 26 15:58:29 xen01 kernel: invalid opcode: 0000 [#1] SMP
> > > > Dec 26 15:58:29 xen01 kernel: last sysfs file: /sys/block/dm-26/dev
> > > > Dec 26 15:58:29 xen01 kernel: CPU 0
> > > > Dec 26 15:58:29 xen01 kernel: Modules linked in: ipt_MASQUERADE
>
> [...]
> [<ffffffff810052e2>] pin_pagetable_pfn+0x52/0x60
> [<ffffffff81006f5c>] xen_alloc_ptpage+0x9c/0xa0
> [<ffffffff81006f8e>] xen_alloc_pte+0xe/0x10
> [<ffffffff810decde>] __pte_alloc+0x7e/0xf0
> [<ffffffff810e15c5>] handle_mm_fault+0x855/0x930
> [<ffffffff8102dd9e>] ? pvclock_clocksource_read+0x4e/0x100
> [<ffffffff810e734c>] ? do_mmap_pgoff+0x33c/0x380
> [<ffffffff81452b96>] do_page_fault+0x116/0x3e0
> [<ffffffff8144ff65>] page_fault+0x25/0x30
> [...]
> (XEN) mm.c:2364:d0 Bad type (saw 7400000000000001 != exp 1000000000000000)
> for mfn 41114f (pfn d514f)
> (XEN) mm.c:2733:d0 Error while pinning mfn 41114f
Looking into the code, the Dom0 code ist attempting to pin what it thins
is a "PGT_l1_page_table", however the hypervisor returns -EINVAL because
it actually is a "PGT_writable_page".
After a few hours I managed to catch the crash while the offending
process is being straced. However the results where totally
inconclusive, because the last lines before the crash are:
16576 open("/lib/multipath/libcheckdirectio.so", O_RDONLY) = 4
16576 read(4,
"\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0P\v\0\0\0\0\0\0"..., 832) = 832
16576 fstat(4, {st_mode=S_IFREG|0644, st_size=9344, ...}) = 0
16576 mmap(NULL, 2104672, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 4, 0)
= 0x7fa6b36f6000
16576 mprotect(0x7fa6b36f8000, 2093056, PROT_NONE) = 0
16576 mmap(0x7fa6b38f7000, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 4, 0x1000) = 0x7fa6b38f7000
16576 close(4) = 0
A non-crashing execution would have continued with:
16667 open("/etc/ld.so.cache", O_RDONLY) = 4
16667 fstat(4, {st_mode=S_IFREG|0644, st_size=21739, ...}) = 0
16667 mmap(NULL, 21739, PROT_READ, MAP_PRIVATE, 4, 0) = 0x7f237de56000
16667 close(4) = 0
16667 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
16667 open("/lib/libaio.so.1", O_RDONLY) = 4
[...]
Which means that it crashed during the dynamic loading of a plugin
shared library and not while interacting with the device mapper.
(also, the device being investigated was /dev/sde and not some dm
device)
This leads me to believe that some device-mapper shared library has a
particular memory layout that tends to trigger this crash and it has
nothing to do with any device-mapper code at all. Also, the crash seems
to be timing-sensitive, so it might also be a race condition of some
sort. (on a side-note: this is a 24-core machine (!) and the kernel has
happens to have full preemption enabled).
I am trying to understand the code a bit. Can someone explain to me
what xen_alloc_ptpage is doing.
> /* This needs to make sure the new pte page is pinned iff its being
> attached to a pinned pagetable. */
> [...]
> if (PagePinned(virt_to_page(mm->pgd))) {
> [...]
> pin_pagetable_pfn(MMUEXT_PIN_L1_TABLE, pfn);
I must admit I don't know very much about memory handling in linux (so
please excuse me if I am interpreting total nonsense into this here,
still I'm intigued and would like to understand it a bit better), but
isn't `mm->pgd' supposed to point to the L1 page table and `pfn', being
a pte page a 3rd/4th level page? Is this a code path that is exercised
a lot?
Thanks,
Christophe
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|