I have encountered a crash in dom0 kernel while booting a domU from an AOE device. I haven't seen such crashes when booting from local partitions/ LVM volumes/ loopback file systems. Also I haven't seen such crash when I did repetitive I/O to these AOE devices. As the call trace of crash indicates the crash is in xenolinux kernel. Also this crash is predictably reproducible.
I am currently using xen 3.0.1, but I have seen the same thing happening in 3.0.2 some time back. If time permits I can try to reproduce it on latest Xen builds.
The domU's disks look like this:
Inside the domU, sda1 is treated as root device and sda2 is treated as swap.
The AOE setup involves, vblade servers running on the server machine that exports some disks over AOE. The dom0 instance in question is a client to this AOE server. It has 'aoe' module loaded into it and the aoe-tools version is 10.
The stack trace of the crash is as follows:
Unable to handle kernel NULL pointer dereference at virtual
*pde = ma 8da99067 pa 32e99067
*pte = ma 00000000 pa 55555000
Oops: 0002 [#1]
Modules linked in: ipt_physdev iptable_filter ip_tables aoe
bridge nfs lockd ppdev vmnet vmmon sg parport_pc lp parport autofs4 sunrpc
af_packet binfmt_misc dm_mirror dm_multipath video thermal processor fan button
battery ac ipv6 md ohci1394 ieee1394 uhci_hcd intel_agp agpgart i2c_i801
i2c_core pci_hotplug snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss
snd_pcm snd_timer snd soundcore snd_page_alloc e1000 floppy unix sd_mod aacraid
scsi_mod ext3 jbd dm_mod
EIP: 0061:[<c012cc32>] Tainted: P VLI
EFLAGS: 00010012 (184.108.40.206-xen)
EIP is at run_timer_softirq+0xa2/0x1c0
eax: 00000000 ebx: 00000000 ecx: f33dbe00 edx:
esi: 00000100 edi: c26deda0 ebp: 00000000 esp:
ds: 007b es: 007b ss: 0069
Process swapper (pid: 0, threadinfo=c03f2000 task=c0369fc0)
Stack: 00000000 c03f3f7c 00000100 c01438a0 c03f2000 f33dbe00
00000011 c03ecda8 c0420ea0 00000000 c0127ee6 c03ecda8
00000001 00000000 00000000 c0128005 00000000 fbf7e000
Code: 00 8b 53 04 8d 6c 24 14 8b 44 24 14 89 69 04 89 4c 24
14 89 50 04 89 02 89 5b 04 89 5e 0c eb 66 8b 51 04 8b 01 8b 69 14 8b 59 18
<89> 50 04 89 02 c7 41 04 00 02 20 00 c7 01 00 01 10 00 89 4f 08
<0>Kernel panic - not syncing: Fatal exception in
(XEN) Domain 0 shutdown: rebooting machine.
(XEN) Reboot disabled on cmdline: require manual reset
Before getting this crash I get some warnings on the serial console that look like following:
This is just a warning. Your computer is OK
But I guess these have nothing to do with the crash.
I also observed the AOE traffic when the crash occurs using tcpdump. But nothing seemed unusual to my eyes, just that the packets stopped flowing after the AOE client dom0 crashed. Furthermore, there is no problem with AOE servers. After reboot I can again start using the same AOE devices (save the inconsistent file system). My past attempts of putting printk's in AOE driver source also didn't reveal any helpful information.
Please let me know if any bug fixes were done in recent versions in the area where this crash is being seen (handle_IRQ_event). Any other suggestions to tackle the problem are welcome.
Everything you can imagine is real
Xen-devel mailing list