All xen 4.1.0 test were done on server1 (netcatarina).
All but one test with xen 4.0.1 were made on server2 (memoryana).
Why i had to rerun one of the test for server2 on server1 is explained
below.
Here are my test results:
======================================================
Kernel 2.6.32.28 without XEN:
about 50 successful runs of Teck Choon Giams "test.sh" script.
(modified for handling 10 test volumes and sleeping 2 seconds)
multipathd restarted succesfully s
multipath module loaded/unloaded successfully
lvm2 restarted successfully
======================================================
Kernel 2.6.38 without XEN:
about 20 successful runs of "test.sh"
multipathd restarted succesfully s
multipath module loaded/unloaded successfully
lvm2 restarted successfully
======================================================
Kernel 2.6.32.28 with XEN 4.0.1:
at about loop 2 for volume 7 of "test.sh" it stopped doing ... well anything
there has been no output on the screen and neitehr syslog nor dmesg entry.
I left it hanging for about 15 Minutes until i decided to write this one
off as a side effect of the same underlying problem.
All lvm2 tools stopped working and i couldnt shut it down.
Killing the hangig process ended it properly.
I did a cold reset of the server, as i wanted to see the discussed BUG
again. But i failed here.
It would seem like my server2 has some kind of addressing error:
pci 000:04:00.1: BAR 6: address space collision of device ....
0000:04:00.1: is one of my QLogic HBAs
And since i use centralized FC storage ... who knows what side effects
happened here.
Interesting enough i had no problems with kernel 2.6.38 on this machine.
So i downgraded server1 that did never show this message to xen 4.0.1
and ran the test:
after 2 loops at volume 5 i hit "kernel BUG at arch/x86/xen/mmu.c" again.
======================================================
Kernel 2.6.38 with XEN 4.0.1:
100 runs of test.sh without error
multipathd restarted successfully
multipath module loaded/unloaded successfully
lvm2 stop/start ok
======================================================
Kernel 2.6.32.28 with XEN 4.1.0-rc7:
booted at first:
crash afer only 5 iterations of "test.sh"
http://pastebin.com/uNL7ehZ8
later, after having booted 2.6.38 on this server to test it with xen
4.1, i encountered different error at boottime:
BUG: unable to handle kernel paging request at ffff8800cc3e5f48
Only have pictures of it:
http://141.39.208.101/err1.png
http://141.39.208.101/err2.png
I then did a cold boot of the server, as this has proven to make it boot
in the past.
When this did not help, i stopped the test.sh running on my other
server, because the hang came when lvm2 was started and the servers use
shared storage.
Apparently this helped, the server booted fine after another cold reset.
After that i encountered an error again at loop 10 of "test.sh", but not
with the "kernel BUG at arch/x86/xen/mmu.c", but again, with
"BUG: unable to handle kernel paging request at ffff8800cc61ce010"
http://141.39.208.101/err3.png
http://141.39.208.101/err4.png
======================================================
Kernel 2.6.38 with XEN 4.1.0-rc7:
100 runs of test.sh without error
multipathd restarted successfully
multipath module loaded/unloaded successfully
lvm2 stop/start ok
======================================================
Summary
======================================================
So thats two different errors i have encountered,
one is the "kernel BUG at arch/x86/xen/mmu.c", the other is
"BUG: unablte to handle kernel paging request"
Both only apply to 2.6.32 when running under eitehr xen4.0.1 or 4.1.
On its own the kernel works fine.
Kernel 2.6.38 ran fine on both hypervisors as well as on its own.
One other issue occured that i didnt expect:
With the same .config (make oldconfig), 2.6.38 left my screen black
after loading the kernel, on both hypervisors.
The servers worked just fine, i just didnt see any output on their VGA
ports.
I hope this information helps you to hunt this bug down as it
effectively makes the "default" Xen unusable in server situations where
the device mapper is involved.
It is puzzling to me why noone did notice it last year, am i the only
one running xen on server hardware (Dell R610, 710 and 2950) with
centralized storage (FibreChannel or iSCSI) and using it as environment
for production.
Is multipathing two links to a centralized storage and using LVM2 to
split it up for virtual machines running on two or more servers really
such a rare thing to find Xen running on?
Btw, who is currently working on the remus implementation?
If you should need any more testing from me, feel free to ask.
Best regards.
--
Andreas Olsowski
smime.p7s
Description: S/MIME Cryptographic Signature
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|