WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

[Xen-devel] Bug: Problematic DomU Duplication on reboot

To: xen-devel@xxxxxxxxxxxxxxxxxxx
Subject: [Xen-devel] Bug: Problematic DomU Duplication on reboot
From: Florian Kirstein <xenlist@xxxxxxxxxxxxxx>
Date: Sun, 21 Jan 2007 09:59:25 +0100
Delivery-date: Sun, 21 Jan 2007 01:00:14 -0800
Envelope-to: www-data@xxxxxxxxxxxxxxxxxx
In-reply-to: <20070117233659.A22236@xxxxxxxxxxx>; from xenlist@xxxxxxxxxxxxxx on Wed, Jan 17, 2007 at 11:36:59PM +0100
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <20070117233659.A22236@xxxxxxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
User-agent: Mutt/1.2.5.1i
Hi,

OK, I did some more experiments and can now reproduce the duplication
of a domain on it's reboot. Seems to be a race condition somewhere,
as I can trigger it by putting high load on xend.

The really bad thing: all instances of the domain are then actively
running on the same block devices, which almost certainly causes massive
data corruption :-( And: it also can happen in normal operation, I had
it at least twice in a "normal" environment without much load on xend,
possibly just a libvirt request at the wrong time during a DomU reboot.

If this is already known: sorry for the long mail then... Is there a fix
for 3.0.4-testing? :)

If not: I more or less see two Bugs there:
1) why is the domain multiplicated during the reboot
2) why is it possible at all that it's started twice, using the same
devices? Could there be a check added to prevent duplicate use of
the same device readwrite, or is there already one which is failing in
this case?

Reproduction:
I was able to reproduce this quite reliably using the sample-program
dump-info.pl from the perl-Sys-virt libvirt Interface. I (as root) just do a 
while true; do ./dump-info.pl; done
in the examples dir to stress the system/xend. Building the loop inside
dump-info.pl and removing all "print"s even makes it work a bit "better"
and really messing things up, so try that if the other doesn't work. I
tested it on a P4 3 GHz and a Dualcore A64 2.2Ghz, it's easier when
I use nosmp on the xen kernel on the A64 but it works also in the SMP
case.

While this is running I simply issue:
xm reboot DomU1
and most of the times it results in two or more DomU1s running
afterwards... Sometimes it also causes DomU1 to disappear, having an
entry in the log it was rebooting too fast (of course I waited long
enough with the reboot). If it "works" it looks like this:
DomU1                                     97   256     1     -b----     12.5
DomU1                                     98   256     1     -b----     12.9
afterwards. DomU1 being just a normal paravirtualized Linux Guest. 
Dom0 is a CentOS 4 in case it could matter.

Observations:
During the reboot sometimes multiple duplications were created, load
on Dom0 went up to about 30 and I saw lots of xen-backend hotplug agents:
10613 ?        S<     0:00  \_ /bin/sh /sbin/hotplug xen-backend
10617 ?        S<     0:01  |   \_ /bin/sh /etc/hotplug/xen-backend.agent
15018 ?        S<     0:00  \_ /bin/sh /sbin/hotplug xen-backend
15248 ?        S<     0:01  |   \_ /bin/sh /etc/hotplug/xen-backend.agent
14698 ?        S<     0:00  \_ /bin/sh /sbin/hotplug xen-backend
14702 ?        S<     0:00  |   \_ /bin/sh /etc/hotplug/xen-backend.agent
15091 ?        S<     0:00  \_ /bin/sh /sbin/hotplug xen-backend
(about 60 more lines like this - and I had just one domU). After everything
settled the result:
VM100                                     38   256     1     -b----     13.3
VM100                                     10   256     1     -b----     14.1
Noticable the large difference from 10-38, meaning 27 domains were
partially crated and then died, the Domain I rebooted had ID 9.

Oh, and one more thing: when using "stress" to put load on the Dom0
system instead of the perl-Sys-virt tool, it usually causes the
DomU to disappear on reboot, but I couldn't reproduce the duplication
that way.

All this done with the released 3.0.4.1-1, will try xen-unstable next,
but possibly someone already as an idea what could be wrong here?

(:ul8er, r@y

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

<Prev in Thread] Current Thread [Next in Thread>