WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

Re: [Xen-devel] [SPAM] Re: kernel BUG at arch/x86/xen/mmu.c:1860!

To: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxx
Subject: Re: [Xen-devel] [SPAM] Re: kernel BUG at arch/x86/xen/mmu.c:1860!
From: Andreas Olsowski <andreas.olsowski@xxxxxxxxxxx>
Date: Thu, 10 Mar 2011 14:45:02 +0100
Cc:
Delivery-date: Fri, 11 Mar 2011 09:50:48 -0800
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <4D77DC0A.9090705@xxxxxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <20110303221639.GB12175@xxxxxxxxxxxx> <AANLkTi=r+ErO+PkPWF=3L8+v9+TPbgVz-6qaycTgMo4c@xxxxxxxxxxxxxx> <AANLkTimTuXqoLe9VinpAdhwJPO3Z8HGU2+KOoHOBaUvq@xxxxxxxxxxxxxx> <AANLkTimin0OZZUmvUr2KcXugc_2GGuEhRHLdug1ufha6@xxxxxxxxxxxxxx> <20110308192950.GA4562@xxxxxxxxxxxx> <20110308201002.GA5721@xxxxxxxxxxxx> <AANLkTikdA0vnxYzU7MFZNk3m6SH6=ns-WGVry2zCfws+@xxxxxxxxxxxxxx> <1299617407852-3414620.post@xxxxxxxxxxxxx> <20110309004318.GB10007@xxxxxxxxxxxx> <4D77251F.8070709@xxxxxxxxxxx> <20110309150023.GB6247@xxxxxxxxxxxx> <4D77DC0A.9090705@xxxxxxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.16) Gecko/20101226 Icedove/3.0.11
All xen 4.1.0 test were done on server1 (netcatarina).
All but one test with xen 4.0.1 were made on server2 (memoryana).
Why i had to rerun one of the test for server2 on server1 is explained below.

Here are my test results:

======================================================
Kernel 2.6.32.28 without XEN:
about 50 successful runs of Teck Choon Giams "test.sh" script.
(modified for handling 10 test volumes and sleeping 2 seconds)
multipathd restarted succesfully s
multipath module loaded/unloaded successfully
lvm2 restarted successfully

======================================================
Kernel 2.6.38 without XEN:
about 20 successful runs of "test.sh"
multipathd restarted succesfully s
multipath module loaded/unloaded successfully
lvm2 restarted successfully

======================================================
Kernel 2.6.32.28 with XEN 4.0.1:
at about loop 2 for volume 7 of "test.sh" it stopped doing ... well anything
there has been no output on the screen and neitehr syslog nor dmesg entry.
I left it hanging for about 15 Minutes until i decided to write this one off as a side effect of the same underlying problem.
All lvm2 tools stopped working and i couldnt shut it down.
Killing the hangig process ended it properly.

I did a cold reset of the server, as i wanted to see the discussed BUG again. But i failed here.
It would seem like my server2 has some kind of addressing error:
pci 000:04:00.1: BAR 6: address space collision of device ....

0000:04:00.1: is one of my QLogic HBAs
And since i use centralized FC storage ... who knows what side effects happened here.
Interesting enough i had no problems with kernel 2.6.38 on this machine.

So i downgraded server1 that did never show this message to xen 4.0.1 and ran the test:
after 2 loops at volume 5 i hit "kernel BUG at arch/x86/xen/mmu.c" again.



======================================================
Kernel 2.6.38 with XEN 4.0.1:
100 runs of test.sh without error
multipathd restarted successfully
multipath module loaded/unloaded successfully
lvm2 stop/start ok

======================================================
Kernel 2.6.32.28 with XEN 4.1.0-rc7:
booted at first:
crash afer only 5 iterations of "test.sh"
http://pastebin.com/uNL7ehZ8

later, after having booted 2.6.38 on this server to test it with xen 4.1, i encountered different error at boottime:
BUG: unable to handle kernel paging request at  ffff8800cc3e5f48
Only have pictures of it:
http://141.39.208.101/err1.png
http://141.39.208.101/err2.png
I then did a cold boot of the server, as this has proven to make it boot in the past. When this did not help, i stopped the test.sh running on my other server, because the hang came when lvm2 was started and the servers use shared storage.
Apparently this helped, the server booted fine after another cold reset.

After that i encountered an error again at loop 10 of "test.sh", but not with the "kernel BUG at arch/x86/xen/mmu.c", but again, with
"BUG: unable to handle kernel paging request at  ffff8800cc61ce010"
http://141.39.208.101/err3.png
http://141.39.208.101/err4.png


======================================================
Kernel 2.6.38 with XEN 4.1.0-rc7:
100 runs of test.sh without error
multipathd restarted successfully
multipath module loaded/unloaded successfully
lvm2 stop/start ok


======================================================
Summary
======================================================

So thats two different errors i have encountered,
one is the "kernel BUG at arch/x86/xen/mmu.c", the other is
"BUG: unablte to handle kernel paging request"

Both only apply to 2.6.32 when running under eitehr xen4.0.1 or 4.1.
On its own the kernel works fine.

Kernel 2.6.38 ran fine on both hypervisors as well as on its own.

One other issue occured that i didnt expect:
With the same .config (make oldconfig), 2.6.38 left my screen black after loading the kernel, on both hypervisors. The servers worked just fine, i just didnt see any output on their VGA ports.


I hope this information helps you to hunt this bug down as it effectively makes the "default" Xen unusable in server situations where the device mapper is involved.

It is puzzling to me why noone did notice it last year, am i the only one running xen on server hardware (Dell R610, 710 and 2950) with centralized storage (FibreChannel or iSCSI) and using it as environment for production.

Is multipathing two links to a centralized storage and using LVM2 to split it up for virtual machines running on two or more servers really such a rare thing to find Xen running on?

Btw, who is currently working on the remus implementation?



If you should need any more testing from me, feel free to ask.

Best regards.


--
Andreas Olsowski

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
<Prev in Thread] Current Thread [Next in Thread>