WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-users

[Xen-users] Xen 4.0.0 - tapdisk2 "hangs"

To: <xen-users@xxxxxxxxxxxxxxxxxxx>
Subject: [Xen-users] Xen 4.0.0 - tapdisk2 "hangs"
From: "Heiko Wundram" <modelnine@xxxxxxxxxxxxx>
Date: Tue, 4 May 2010 15:09:31 +0200
Delivery-date: Tue, 04 May 2010 06:11:31 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
List-help: <mailto:xen-users-request@lists.xensource.com?subject=help>
List-id: Xen user discussion <xen-users.lists.xensource.com>
List-post: <mailto:xen-users@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-users>, <mailto:xen-users-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-users>, <mailto:xen-users-request@lists.xensource.com?subject=unsubscribe>
Sender: xen-users-bounces@xxxxxxxxxxxxxxxxxxx
Thread-index: AcrriwkQBNDhVxXYQ62mrbpGKuKJNg==
Hey all!

I'm currently in the process of migrating a (Gentoo-based) Xen-server to use
Xen 4.0.0 (where I'm using the Xen ebuilds from bugs.gentoo.org), and I'm
having severe problems with tapdisk2 (which I wish to use to do I/O
prioritizing using CFQ on the LVM-based backing storage of a virtual
server).

It seems that after a while of heavy I/O in the virtual domain, the
communication between the (paravirtualized) DomU and Dom0 (the
tapdisk2-process) breaks, in that no more interrupts are delivered to Dom0
for I/O requests from the virtual domain, and as such the virtual host
"loses" its harddisk (but does not "break" besides not responding). The
network front-/backend is not affected by this communication loss, AFAICT.

The virtual host can be destroyed by an xm destroy, but the created blktap2
interface does not disappear until the next reboot, and cannot be removed by
the respective sysfs accesses (rather, echoing a 1 into "remove" blocks,
too, and is "unkillable", i.e. stays in kernel space). After a blktap2
device has entered this broken state, no more hosts can be created by xm
create (that blocks, too), and the host system must be rebooted to enter a
usable state again.

I've not been able to provoke this breakage by "normal" I/O (i.e., when the
hosts run normally), but I have been able to provoke it by using bonnie,
which after a short period of substained read/write I/O of +120MB/s will
freeze the blktap2 device.

The Dom0 and the DomU kernels that are being used are xen-sources-2.6.32-r1
(which are the xen-stable 2.6.32.10 [11?] based OpenSuSE Xen-kernel sources,
AFAIK) from the official portage tree; the kernel configuration that's in
use is attached.

I've tried iommu=off for xen (the mobo doesn't support VT-d anyway, so Xen
never turns it on), and I've also looked for any signs of errors appearing
when setting verbosity 9 for the blktap2 module and loglvl=all and
guest_loglvl=all for Xen, but there are no errors that I've seen so far.

Strace-ing the tapdisk2 process reveals that it's blocked on select(), and
none of the descriptors it's polling on ever return as readable (which is
the condition that tapdisk2 queries), rather they always timeout after 600s.

Thanks in advance for any hint as to what is causing this, or if there's
anything I might try to get things working...

PS: I have to boot with acpi=off, as the mobo won't reboot when acpi is
turned on for Dom0 (not even when disabling ACPI reboots), but using acpi
directly doesn't change that blktap2 blocks.

--- Heiko.


Attachment: config
Description: Binary data

Attachment: dmesg.dump
Description: Binary data

Attachment: interrupts
Description: Binary data

Attachment: lspci.dump
Description: Binary data

Attachment: xmdmesg.dump
Description: Binary data

_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-users