[Xen-devel] Tapdisk devices too strongly attached?

To:	"xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>
Subject:	[Xen-devel] Tapdisk devices too strongly attached?
From:	Gerd Jakobovitsch <gerd@xxxxxxxxxxx>
Date:	Wed, 10 Aug 2011 09:24:36 -0300
Cc:	Daniel Stodden <daniel.stodden@xxxxxxxxxx>
Delivery-date:	Wed, 10 Aug 2011 05:25:44 -0700
Envelope-to:	www-data@xxxxxxxxxxxxxxxxxxx
List-help:	<mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id:	Xen developer discussion <xen-devel.lists.xensource.com>
List-post:	<mailto:xen-devel@lists.xensource.com>
List-subscribe:	<http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe:	<http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
Sender:	xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
User-agent:	Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.18) Gecko/20110617 Lightning/1.0b2 Thunderbird/3.1.11

Dear all,

I am having problems with tapdisk devices:
- When shutting down the virtual machine, the tapdisk process continues running, and the device is still present at /sys/class/blktap2. It can be removed, though, issuing echo 1 > /sys/class/blktap2/blktap<id>/remove.
- I tried to duplicate the snapshot process implemented in tools/blktap2/drivers/xmsnap, but using vhd snapshot instead of qcow. The process seemed to work, but changes continue to be written to the renamed disk, not to the snapshot. It seems that the tapdisk process keeps the association to the opened file, even when moving it.

I'm using xen on a CentOS 5 distro, with xen and kernel compiled from xen's own baselines. I noticed the same behavior in xen 4.0.2.rc3 / kernel 2.6.32.36+fix and in xen 4.1.2.rc1-pre / kernel 2.6.32.43.

Info from a xl.log file:
cat /var/log/xen/xl-teste020.log.2
Waiting for domain teste020 (domid 11) to die [pid 7352]
Domain 11 is dead
Unknown shutdown reason code 255. Destroying domain.
Action for shutdown reason code 255 is destroy
Domain 11 needs to be cleaned up: destroying the domain
libxl: error: libxl.c:734:libxl_domain_destroy xc_domain_pause failed for 11
libxl: error: libxl_dm.c:747:libxl__destroy_device_model Couldn't find device model's pid: No such file or directory
libxl: error: libxl.c:738:libxl_domain_destroy libxl__destroy_device_model failed for 11
libxl: error: libxl_dom.c:603:userdata_path unable to find domain info for domain 11: No such file or directory
libxl: error: libxl.c:755:libxl_domain_destroy xc_domain_destroy failed for 11
Done. Exiting now

As a hint, some months ago I posted at xen-devel a bug report related to tapdisk failures, which was solved with a fix related to spinlocks, recently delivered to 2.6.32 pvops kernel baseline. At that point, Daniel Stodden, who identified the needed fix, wrote:

"It's the only pending bugfix, quite an obvious one actually. It's been rare enough unless provoked like Gerd did, but we found it first in XCP so it actually tends to happen."

Actually, I'm not sure how I could be provoking any different behavior from tapdisk, but it seems that some configuration I'm using is leading tapdisk to some unexpected behavior.

The whole message exchange:

On Thu, 2011-04-14 at 12:38 -0400, Daniel Stodden wrote:

> On Thu, 2011-04-14 at 09:15 -0400, Konrad Rzeszutek Wilk wrote:

> > On Wed, Apr 13, 2011 at 06:02:13PM -0300, Gerd Jakobovitsch wrote:

> > > I'm trying to run several VMs (linux hvm, with tapdisk:aio disks at
> > > a storage over nfs) on a CentOS system, using the up-to-date version
> > > of xen 4.0 / kernel pvops 2.6.32.x stable. With a configuration
> > > without (most of) debug activated, I can start several instances -
> > > I'm running 7 of them - but shortly afterwards the system stops
> > > responding. I can't find any information on this.

> > 
> > First time I see it.

> > > 
> > > Activating several debug configuration items, among them
> > > DEBUG_PAGEALLOC, I get an exception as soon as I try to start up a
> > > VM. The system reboots.

> > 
> > Oooh, and is the log below from that situation?
> > 
> > Daniel, any thoughs?

> 
> ---
> 	  Unmap pages from the kernel linear mapping after free_pages().
> 	  This results in a large slowdown, but helps to find certain types
> 	  of memory corruption.
> 
> Stunning. Our I/O page allocator is a sort of twisted mempool. Unless
> the allocation is explicitly modified in sysfs/, everything should stay
> pinned. We might be just tripping over debug code alone, but I didn't
> figure it out yet.

Ah, that's just missing Dominic's spinlock fix.

http://xenbits.xen.org/gitweb/?p=people/dstodden/linux.git;a=commit;h=a765257af7e28c41bd776c3e03615539597eb592

Daniel

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

WARNING - OLD ARCHIVES

xen-devel

[Xen-devel] Tapdisk devices too strongly attached?