This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
Home Products Support Community News


Re: [Xen-devel] Making snapshot of logical volumes handling HVM domU cau

To: Scott Garron <xen-devel@xxxxxxxxxxxxxxxxxx>
Subject: Re: [Xen-devel] Making snapshot of logical volumes handling HVM domU causes OOPS and instability
From: Jeremy Fitzhardinge <jeremy@xxxxxxxx>
Date: Mon, 30 Aug 2010 09:52:22 -0700
Cc: "Xu, Dongxiao" <dongxiao.xu@xxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxx, Daniel Stodden <daniel.stodden@xxxxxxxxxx>
Delivery-date: Mon, 30 Aug 2010 09:53:05 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <4C7864BB.1010808@xxxxxxxxxxxxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <4C7864BB.1010808@xxxxxxxxxxxxxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv: Gecko/20100806 Fedora/3.1.2-1.fc13 Lightning/1.0b2pre Thunderbird/3.1.2
 On 08/27/2010 06:22 PM, Scott Garron wrote:
> I use LVM volumes for domU disks.  To create backups, I create a
> snapshot of the volume, mount the snapshot in the dom0, mount an
> equally-sized backup volume from another physical storage source, run an
> rsync from one to the other, unmount both, then remove the snapshot.
> This includes creating a snapshot and mounting NTFS volumes from
> Windows-based HVM guests.
> This practice may not be perfect, but has worked fine for me for a
> couple of years - while I was running Xen 3.2.1 and linux-
> dom0 (and the same kernel for domU).  After upgrades of udev started
> complaining about the kernel being too old, I thought it was well past
> time to try to transition to a newer version of Xen and a newer dom0
> kernel.  This transition has been a gigantic learning experience, let me
> tell you.
> After that transition, here's the problem I've been wrestling with and
> can't seem to find a solution for:  It seems like any time I start
> manipulating a volume group to add or remove a snapshot of a logical
> volume that's used as a disk for a running HVM guest, new calls to LVM2
> and/or Xen's storage locks up and spins forever.  The first time I ran
> across the problem, there was no indication of a problem other than
> any command I ran that handled anything to do with LVM would freeze and
> be completely unable to be signaled to do anything.  In other words, no
> error messages, nothing in dmesg, nothing in syslog...  The commands
> would just freeze and not return.  That was with the kernel
> that is what's currently retrieved if you checkout xen-4.0-testing.hg
> and just do a make dist.
> I have since checked out and compiled that comes from doing
> git checkout -b xen/stable-2.6.32.x origin/xen/stable-2.6.32.x, as
> described on the Wiki page here:
> http://wiki.xensource.com/xenwiki/XenParavirtOps
> If I run that kernel for dom0, but continue to use for the
> paravirtualized domUs, everything works fine until I try to manipulate
> the snapshots of the HVM volumes.  Today, I got this kernel OOPS:

That's definitely bad.  Something is causing udevd to end up with bad
pagetables which are causing a kernel crash on exit.  I'm not sure if
its *the* udevd or some transient child, but either way its bad.

Any thoughts on this Daniel?

> ---------------------------
> [78084.004530] BUG: unable to handle kernel paging request at
> ffff8800267c9010
> [78084.004710] IP: [<ffffffff810382ff>] xen_set_pmd+0x24/0x44
> [78084.004886] PGD 1002067 PUD 1006067 PMD 217067 PTE 80100000267c9065
> [78084.005065] Oops: 0003 [#1] SMP
> [78084.005234] last sysfs file:
> /sys/devices/virtual/block/dm-32/removable
> [78084.005256] CPU 1
> [78084.005256] Modules linked in: tun xt_multiport fuse dm_snapshot
> nf_nat_tftp nf_conntrack_tftp nf_nat_pptp nf_conntrack_pptp
> nf_conntrack_proto_gre nf_nat_proto_gre ntfs parport_pc parport k8temp
> floppy forcedeth [last unloaded: scsi_wait_scan]
> [78084.005256] Pid: 22814, comm: udevd Tainted: G        W #1
> [78084.005256] RIP: e030:[<ffffffff810382ff>]  [<ffffffff810382ff>]
> xen_set_pmd+0x24/0x44
> [78084.005256] RSP: e02b:ffff88002e2e1d18  EFLAGS: 00010246
> [78084.005256] RAX: 0000000000000000 RBX: ffff8800267c9010 RCX:
> ffff880000000000
> [78084.005256] RDX: dead000000100100 RSI: 0000000000000000 RDI:
> 0000000000000004
> [78084.005256] RBP: ffff88002e2e1d28 R08: 0000000001993000 R09:
> dead000000100100
> [78084.005256] R10: 800000016e90e165 R11: 0000000000000000 R12:
> 0000000000000000
> [78084.005256] R13: ffff880002d8f580 R14: 0000000000400000 R15:
> ffff880029248000
> [78084.005256] FS:  00007fa07d87f7a0(0000) GS:ffff880002d81000(0000)
> knlGS:0000000000000000
> [78084.005256] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> [78084.005256] CR2: ffff8800267c9010 CR3: 0000000001001000 CR4:
> 0000000000000660
> [78084.005256] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [78084.005256] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> 0000000000000400
> [78084.005256] Process udevd (pid: 22814, threadinfo ffff88002e2e0000,
> task ffff880019491e80)
> [78084.005256] Stack:
> [78084.005256]  0000000000600000 000000000061e000 ffff88002e2e1de8
> ffffffff810fb8a5
> [78084.005256] <0> 00007fff13ffffff 0000000100000206 ffff880003158003
> 0000000000000000
> [78084.005256] <0> 0000000000000000 000000000061dfff 000000000061dfff
> 000000000061dfff
> [78084.005256] Call Trace:
> [78084.005256]  [<ffffffff810fb8a5>] free_pgd_range+0x27c/0x45e
> [78084.005256]  [<ffffffff810fbb2b>] free_pgtables+0xa4/0xc7
> [78084.005256]  [<ffffffff810ff1fd>] exit_mmap+0x107/0x13f
> [78084.005256]  [<ffffffff8107714b>] mmput+0x39/0xda
> [78084.005256]  [<ffffffff8107adff>] exit_mm+0xfb/0x106
> [78084.005256]  [<ffffffff8107c86d>] do_exit+0x1e8/0x6ff
> [78084.005256]  [<ffffffff815c228b>] ? do_page_fault+0x2cd/0x2fd
> [78084.005256]  [<ffffffff8107ce0d>] do_group_exit+0x89/0xb3
> [78084.005256]  [<ffffffff8107ce49>] sys_exit_group+0x12/0x16
> [78084.005256]  [<ffffffff8103cc82>] system_call_fastpath+0x16/0x1b
> [78084.005256] Code: 48 83 c4 28 5b c9 c3 55 48 89 e5 41 54 49 89 f4 53
> 48 89 fb e8 fc ee ff ff 48 89 df ff 05 52 8f 9e 00 e8 78 e4 ff ff 84 c0
> 75 05 <4c> 89 23 eb 16 e8 e0 ee ff ff 4c 89 e6 48 89 df ff 05 37 8f 9e
> [78084.005256] RIP  [<ffffffff810382ff>] xen_set_pmd+0x24/0x44
> [78084.005256]  RSP <ffff88002e2e1d18>
> [78084.005256] CR2: ffff8800267c9010
> [78084.005256] ---[ end trace 4eaa2a86a8e2da24 ]---
> [78084.005256] Fixing recursive fault but reboot is needed!
> ---------------------------
> After that was printed on the console, use of anything that interacts
> with Xen (xentop, xm) would freeze whatever command it was and not
> return.  After trying to do a sane shutdown on the guests, the whole
> dom0 locked completely.  Even the alt-sysrq things stopped working after
> looking at a couple of them.
> I feel it's probably necessary to mention that this is after several,
> fairly rapid-fire creations and deletions of snapshot volumes.  I have
> it scripted to make a snapshot, mount it, mount a backup volume, rsync
> it, unmount both volumes, and delete the snapshot for 19 volumes in a
> row.  In other words, there's a lot of disk I/O going on around the time
> of the lockup.  It always seems to coincide with when it gets to the
> volumes that are being used for active, running, Windows Server 2008,
> HVM volumes.  That may be just coincidental, though, because those are
> the last ones on the list.  15 volumes used in active, running
> paravirtualized Linux guests are at the top of the list.
> Another issue that comes up is that if I run the pvops kernel
> for my Linux domUs, after a time (usually only about an hour or so), the
> network interfaces stop responding.  I don't know if the problem is
> related, but it was something else that I noticed.  The only way to get
> the network access to come back is to reboot the domU.  When I reverted
> the domU kernel to, this problem goes away.

That's a separate problem in netfront that appears to be a bug in the
"smartpoll" code.  I think Dongxiao is looking into it.

> I'm not 100%
> sure, but I think this issue also causes xm console to not allow you to
> type on the console that you connect to.  If I connect to a console,
> then issue an xm shutdown on the same domU from another terminal, all of
> the console messages that show the play-by-play of the shutdown process
> display, but my keyboard input doesn't seem to make it through.

Hm, not familiar with this problem.  Perhaps its just something wrong
with your console settings for the domain?  Do you have "console=" on
the kernel command line?

> Since I'm not a developer, I don't know if these questions are better
> suited for the xen-users list, but since it generated an OOPS with the
> word "BUG" in capital letters, I thought I'd post it here.  If that
> assumption was incorrect, just give me a gentle nudge and I'll redirect
> the inquiry to somewhere more appropriate.  :)

Nope, they're both xen-devel fodder.  Thanks for posting.


Xen-devel mailing list