Jeremy Fitzhardinge wrote:
> On 08/27/2010 06:22 PM, Scott Garron wrote:
>> I use LVM volumes for domU disks. To create backups, I create a
>> snapshot of the volume, mount the snapshot in the dom0, mount an
>> equally-sized backup volume from another physical storage source, run
>> an rsync from one to the other, unmount both, then remove the
>> snapshot.
>> This includes creating a snapshot and mounting NTFS volumes from
>> Windows-based HVM guests.
>>
>> This practice may not be perfect, but has worked fine for me for a
>> couple of years - while I was running Xen 3.2.1 and
>> linux-2.6.18.8-xen
>> dom0 (and the same kernel for domU). After upgrades of udev started
>> complaining about the kernel being too old, I thought it was well
>> past
>> time to try to transition to a newer version of Xen and a newer dom0
>> kernel. This transition has been a gigantic learning experience, let
>> me tell you.
>>
>> After that transition, here's the problem I've been wrestling with
>> and
>> can't seem to find a solution for: It seems like any time I start
>> manipulating a volume group to add or remove a snapshot of a logical
>> volume that's used as a disk for a running HVM guest, new calls to
>> LVM2 and/or Xen's storage locks up and spins forever. The first time
>> I ran across the problem, there was no indication of a problem other
>> than any command I ran that handled anything to do with LVM would
>> freeze and be completely unable to be signaled to do anything. In
>> other words, no error messages, nothing in dmesg, nothing in
>> syslog...
>> The commands would just freeze and not return. That was with the
>> 2.6.31.14 kernel that is what's currently retrieved if you checkout
>> xen-4.0-testing.hg and just do a make dist.
>>
>> I have since checked out and compiled 2.6.32.18 that comes from doing
>> git checkout -b xen/stable-2.6.32.x origin/xen/stable-2.6.32.x, as
>> described on the Wiki page here:
>> http://wiki.xensource.com/xenwiki/XenParavirtOps
>>
>> If I run that kernel for dom0, but continue to use 2.6.31.14 for the
>> paravirtualized domUs, everything works fine until I try to
>> manipulate
>> the snapshots of the HVM volumes. Today, I got this kernel OOPS:
>
> That's definitely bad. Something is causing udevd to end up with bad
> pagetables which are causing a kernel crash on exit. I'm not sure if
> its *the* udevd or some transient child, but either way its bad.
>
> Any thoughts on this Daniel?
>
>>
>> ---------------------------
>>
>> [78084.004530] BUG: unable to handle kernel paging request at
>> ffff8800267c9010 [78084.004710] IP: [<ffffffff810382ff>]
>> xen_set_pmd+0x24/0x44 [78084.004886] PGD 1002067 PUD 1006067 PMD
>> 217067 PTE 80100000267c9065 [78084.005065] Oops: 0003 [#1] SMP
>> [78084.005234] last sysfs file:
>> /sys/devices/virtual/block/dm-32/removable
>> [78084.005256] CPU 1
>> [78084.005256] Modules linked in: tun xt_multiport fuse dm_snapshot
>> nf_nat_tftp nf_conntrack_tftp nf_nat_pptp nf_conntrack_pptp
>> nf_conntrack_proto_gre nf_nat_proto_gre ntfs parport_pc parport
>> k8temp
>> floppy forcedeth [last unloaded: scsi_wait_scan]
>> [78084.005256] Pid: 22814, comm: udevd Tainted: G W
>> 2.6.32.18 #1
>> H8SMI
>> [78084.005256] RIP: e030:[<ffffffff810382ff>] [<ffffffff810382ff>]
>> xen_set_pmd+0x24/0x44 [78084.005256] RSP: e02b:ffff88002e2e1d18
>> EFLAGS: 00010246 [78084.005256] RAX: 0000000000000000 RBX:
>> ffff8800267c9010 RCX:
>> ffff880000000000
>> [78084.005256] RDX: dead000000100100 RSI: 0000000000000000 RDI:
>> 0000000000000004 [78084.005256] RBP: ffff88002e2e1d28 R08:
>> 0000000001993000 R09:
>> dead000000100100
>> [78084.005256] R10: 800000016e90e165 R11: 0000000000000000 R12:
>> 0000000000000000 [78084.005256] R13: ffff880002d8f580 R14:
>> 0000000000400000 R15:
>> ffff880029248000
>> [78084.005256] FS: 00007fa07d87f7a0(0000) GS:ffff880002d81000(0000)
>> knlGS:0000000000000000 [78084.005256] CS: e033 DS: 0000 ES: 0000
>> CR0: 000000008005003b [78084.005256] CR2: ffff8800267c9010 CR3:
>> 0000000001001000 CR4: 0000000000000660
>> [78084.005256] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>> 0000000000000000 [78084.005256] DR3: 0000000000000000 DR6:
>> 00000000ffff0ff0 DR7: 0000000000000400 [78084.005256] Process udevd
>> (pid: 22814, threadinfo ffff88002e2e0000,
>> task ffff880019491e80) [78084.005256] Stack:
>> [78084.005256] 0000000000600000 000000000061e000 ffff88002e2e1de8
>> ffffffff810fb8a5
>> [78084.005256] <0> 00007fff13ffffff 0000000100000206 ffff880003158003
>> 0000000000000000 [78084.005256] <0> 0000000000000000 000000000061dfff
>> 000000000061dfff 000000000061dfff [78084.005256] Call Trace:
>> [78084.005256] [<ffffffff810fb8a5>] free_pgd_range+0x27c/0x45e
>> [78084.005256] [<ffffffff810fbb2b>] free_pgtables+0xa4/0xc7
>> [78084.005256] [<ffffffff810ff1fd>] exit_mmap+0x107/0x13f
>> [78084.005256] [<ffffffff8107714b>] mmput+0x39/0xda [78084.005256]
>> [<ffffffff8107adff>] exit_mm+0xfb/0x106 [78084.005256]
>> [<ffffffff8107c86d>] do_exit+0x1e8/0x6ff [78084.005256]
>> [<ffffffff815c228b>] ? do_page_fault+0x2cd/0x2fd [78084.005256]
>> [<ffffffff8107ce0d>] do_group_exit+0x89/0xb3 [78084.005256]
>> [<ffffffff8107ce49>] sys_exit_group+0x12/0x16 [78084.005256]
>> [<ffffffff8103cc82>] system_call_fastpath+0x16/0x1b [78084.005256]
>> Code: 48 83 c4 28 5b c9 c3 55 48 89 e5 41 54 49 89 f4 53
>> 48 89 fb e8 fc ee ff ff 48 89 df ff 05 52 8f 9e 00 e8 78 e4 ff ff 84
>> c0
>> 75 05 <4c> 89 23 eb 16 e8 e0 ee ff ff 4c 89 e6 48 89 df ff 05 37 8f
>> 9e [78084.005256] RIP [<ffffffff810382ff>] xen_set_pmd+0x24/0x44
>> [78084.005256] RSP <ffff88002e2e1d18> [78084.005256] CR2:
>> ffff8800267c9010 [78084.005256] ---[ end trace 4eaa2a86a8e2da24 ]---
>> [78084.005256] Fixing recursive fault but reboot is needed!
>>
>> ---------------------------
>>
>> After that was printed on the console, use of anything that interacts
>> with Xen (xentop, xm) would freeze whatever command it was and not
>> return. After trying to do a sane shutdown on the guests, the whole
>> dom0 locked completely. Even the alt-sysrq things stopped working
>> after looking at a couple of them.
>>
>> I feel it's probably necessary to mention that this is after several,
>> fairly rapid-fire creations and deletions of snapshot volumes. I
>> have
>> it scripted to make a snapshot, mount it, mount a backup volume,
>> rsync
>> it, unmount both volumes, and delete the snapshot for 19 volumes in a
>> row. In other words, there's a lot of disk I/O going on around the
>> time of the lockup. It always seems to coincide with when it gets to
>> the volumes that are being used for active, running, Windows Server
>> 2008, HVM volumes. That may be just coincidental, though, because
>> those are the last ones on the list. 15 volumes used in active,
>> running paravirtualized Linux guests are at the top of the list.
>>
>>
>> Another issue that comes up is that if I run the 2.6.32.18 pvops
>> kernel for my Linux domUs, after a time (usually only about an hour
>> or
>> so), the network interfaces stop responding. I don't know if the
>> problem is related, but it was something else that I noticed. The
>> only way to get the network access to come back is to reboot the
>> domU.
>> When I reverted the domU kernel to 2.6.31.14, this problem goes away.
>
> That's a separate problem in netfront that appears to be a bug in the
> "smartpoll" code. I think Dongxiao is looking into it.
Yes, I tried to reproduce these days, however I could catch it locally. I tried
both netperf and ping for a long time, but the bug is not triggered. What
workload are you using when met the bug?
Thanks,
Dongxiao
>
>> I'm not 100%
>> sure, but I think this issue also causes xm console to not allow you
>> to type on the console that you connect to. If I connect to a
>> console, then issue an xm shutdown on the same domU from another
>> terminal, all of the console messages that show the play-by-play of
>> the shutdown process display, but my keyboard input doesn't seem to
>> make it through.
>
> Hm, not familiar with this problem. Perhaps its just something wrong
> with your console settings for the domain? Do you have "console=" on
> the kernel command line?
>
>> Since I'm not a developer, I don't know if these questions are better
>> suited for the xen-users list, but since it generated an OOPS with
>> the
>> word "BUG" in capital letters, I thought I'd post it here. If that
>> assumption was incorrect, just give me a gentle nudge and I'll
>> redirect the inquiry to somewhere more appropriate. :)
>
> Nope, they're both xen-devel fodder. Thanks for posting.
>
> J
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|