On Tue, May 4, 2010 at 2:09 PM, Heiko Wundram <modelnine@xxxxxxxxxxxxx> wrote:
> Hey all!
>
> I'm currently in the process of migrating a (Gentoo-based) Xen-server to use
> Xen 4.0.0 (where I'm using the Xen ebuilds from bugs.gentoo.org), and I'm
> having severe problems with tapdisk2 (which I wish to use to do I/O
> prioritizing using CFQ on the LVM-based backing storage of a virtual
> server).
>
> It seems that after a while of heavy I/O in the virtual domain, the
> communication between the (paravirtualized) DomU and Dom0 (the
> tapdisk2-process) breaks, in that no more interrupts are delivered to Dom0
> for I/O requests from the virtual domain, and as such the virtual host
> "loses" its harddisk (but does not "break" besides not responding). The
> network front-/backend is not affected by this communication loss, AFAICT.
>
> The virtual host can be destroyed by an xm destroy, but the created blktap2
> interface does not disappear until the next reboot, and cannot be removed by
> the respective sysfs accesses (rather, echoing a 1 into "remove" blocks,
> too, and is "unkillable", i.e. stays in kernel space). After a blktap2
> device has entered this broken state, no more hosts can be created by xm
> create (that blocks, too), and the host system must be rebooted to enter a
> usable state again.
>
> I've not been able to provoke this breakage by "normal" I/O (i.e., when the
> hosts run normally), but I have been able to provoke it by using bonnie,
> which after a short period of substained read/write I/O of +120MB/s will
> freeze the blktap2 device.
>
> The Dom0 and the DomU kernels that are being used are xen-sources-2.6.32-r1
> (which are the xen-stable 2.6.32.10 [11?] based OpenSuSE Xen-kernel sources,
> AFAIK) from the official portage tree; the kernel configuration that's in
> use is attached.
>
> I've tried iommu=off for xen (the mobo doesn't support VT-d anyway, so Xen
> never turns it on), and I've also looked for any signs of errors appearing
> when setting verbosity 9 for the blktap2 module and loglvl=all and
> guest_loglvl=all for Xen, but there are no errors that I've seen so far.
>
> Strace-ing the tapdisk2 process reveals that it's blocked on select(), and
> none of the descriptors it's polling on ever return as readable (which is
> the condition that tapdisk2 queries), rather they always timeout after 600s.
>
> Thanks in advance for any hint as to what is causing this, or if there's
> anything I might try to get things working...
>
> PS: I have to boot with acpi=off, as the mobo won't reboot when acpi is
> turned on for Dom0 (not even when disabling ACPI reboots), but using acpi
> directly doesn't change that blktap2 blocks.
>
> --- Heiko.
>
>
>
> _______________________________________________
> Xen-users mailing list
> Xen-users@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-users
>
I have had exactly the same problem and ended up going back to tapdisk1.
I was able to replicate the problem using the entire SLE11-SP1 kernel
source patch set which proves that the bug exists upstream,
unfortunately I am very busy on other projects at the moment so did
not have time to debug it at all.
The SLE11-SP1 tree has been updated since xen-sources-2.6.32-r1, I
will make a updated set of patches for you to try but it will take me
a couple of days.
Andy
_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-users
|