WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

[Xen-devel] Probable Xen bug triggered by localhost migration

To: xen-devel@xxxxxxxxxxxxxxxxxxx
Subject: [Xen-devel] Probable Xen bug triggered by localhost migration
From: Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>
Date: Fri, 4 Feb 2011 18:08:58 +0000
Delivery-date: Fri, 04 Feb 2011 10:09:45 -0800
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
Once again I have had a test fail during "10 migrations of a PV domain
to localhost", with an apparent Xen or dom0 lockup or other serious
problem.

Failure modes include:
  * dom0 reporting soft lockup BUGs (showing xl stuck in a privcmd
     ioctl, apparently in a hypercall)
  * dom0 disk controller failure due to apparent lost/stuck
     interrupt (dom0 decides disk not working, tries unsuccessfully to
     reset)
  * apparent dom0 lockup or networking failure

Problems occur with both XCP 2.6.27 and pvops 2.6.32 kernels.
Problems seem only to happen with xl but that's likely to be because
it's due to a race; xl and xend will make various calls in different
orders and with different timing.

Having added some machinery to request Xen debug keys, I now have some
more information:

   
http://www.chiark.greenend.org.uk/~xensrcts/logs/5639/test-amd64-i386-xl-credit2/info.html

The most relevant files there are these:

  
http://www.chiark.greenend.org.uk/~xensrcts/logs/5639/test-amd64-i386-xl-credit2/14.ts-guest-localmigrate.log

That shows the failure.  The test harness ssh's to the dom0 to run "xl
migrate" and gets "No route to host", which typically means it has
stopped responding to arp requests.  In this particular case the
failure happened after an apparently-successful previous migration,
but the more common failure mode is that "xl migrate" prints the 0%
progress message and then nothing else gets through.

  
http://www.chiark.greenend.org.uk/~xensrcts/logs/5639/test-amd64-i386-xl-credit2/serial-woodlouse.log

Serial log.  Scroll to around "Feb 4 03:30:35" (timestamps, and the
messages about clients connecting and disconnecting, are from the
serial concentrator).

You'll see a series of debug key outputs, which you can correlate with
the test harness's requests, listed with timestamps here:

  
http://www.chiark.greenend.org.uk/~xensrcts/logs/5639/test-amd64-i386-xl-credit2/15.ts-logs-capture.log

After the Xen debug keys have been run through, the test harness sends
the "q" guest debug key, which also produces the output you can see in
the serial log.

Then the test harness switches the serial back to dom0 and sends RET
and we can see dom0 produce a new login prompt.  So dom0 is not
entirely dead.

However, later entries in the "ts-logs-capture" log show that it still
isn't responding to the network, and eventually the test harness
decides to power cycle the host and collect what remains from the dom0
filesystem.  So that's why you see a pile of boot messages at the end
of the test log - these should be disregarded.

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel