Just run into an odd DomU crash doing live migration of a 4-VCPU domain (with
3.0.4 but the code looks the same in 2.6.18/unstable to me) - the actual panic
is attached at the end of this, but the bottom line is that the code in
cache_remove_shared_cpu_map (in arch/i385/kernel/cpu/intel_cacheinfo.c) is
attempting to clean up the cache info for a processor that does not yet have
this info setup - the code is dereferencing a pointer in the cpuid4_info[]
array and looking at the dump I can see that this is NULL.
My working theory here is that we attempted the migration waaay early and the
initialization of the array of cache info pointers was not setup for all
processors yet; it would be relatively easy to protect against this by checking
for NULL, but I'm not sure if this is the correct solution or not -- if anyone
is familiar with this code and can comment on an appropriate fix I'd be
grateful.
One thing I'm really not sure about is the timing of marking the CPUs up with
respect to the trace re initializing CPUs (see console output below) -- I can
see that the four VCPUs are setup in the cpu_sys_devices array (which is setup
by the code that outputs the 'Initializing CPU#n' trace) but the array of cache
info structures only has an entry for VCPU 0:
crash> cpu_sys_devices
cpu_sys_devices = $3 =
{0xc0464448, 0xc046448c, 0xc04644d0, 0xc0464514, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0,
0x0, 0x0}
crash> cpuid4_info
cpuid4_info = $4 =
{0xc7971180, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0}
Any suggestions for appropriate fixes here?
Simon
--- console output ---
Enabling SMP...
Initializing CPU#3
Initializing CPU#2
Initializing CPU#1
eth0: no IPv6 routers present
Unable to handle kernel NULL pointer dereference at virtual address 00000010
printing eip:
c010dd3a
0204a000 -> *pde = 00000001:0d8ec001
06a9c000 -> *pme = 00000000:00000000
Oops: 0000 [#1]
SMP
Modules linked in: ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core
binfmt_misc dm_mirror dm_mod bnx2 ata_piix libata mptscsih mptfc mptspi mptsas
mptscsi scsi_mod mptbase
CPU: 0
EIP: 0061:[<c010dd3a>] Tainted: GF VLI
EFLAGS: 00010202 (2.6.16.29-xen #1)
EIP is at cache_remove_shared_cpu_map+0x1a/0x90
eax: 00000000 ebx: 00000001 ecx: 00000001 edx: 00000000
esi: 00000000 edi: 00000010 ebp: c3913f14 esp: c3913f08
ds: 007b es: 007b ss: 0069
Process suspend (pid: 4038, threadinfo=c3912000 task=c2244570)
Stack: <0>00000001 00000001 00000000 c3913f28 c010e3ba 00000007 00000001
00000007
c3913f34 c010e425 c03bd804 c3913f48 c012fae8 ffffffea 00000001 c568c570
c3913f7c c013b889 c3913fc0 00000002 00000001 00000000 00000003 00000000
Call Trace:
[<c0105401>] show_stack_log_lvl+0xa1/0xe0
[<c01055f1>] show_registers+0x181/0x200
[<c0105810>] die+0x100/0x1a0
[<c01156f6>] do_page_fault+0x3c6/0x8b1
[<c0105067>] error_code+0x2b/0x30
[<c010e3ba>] cache_remove_dev+0x2a/0x60
[<c010e425>] cacheinfo_cpu_callback+0x35/0x40
[<c012fae8>] notifier_call_chain+0x18/0x40
[<c013b889>] cpu_down+0x139/0x260
[<c028bc9f>] smp_suspend+0x7f/0x100
[<c028ca80>] __do_suspend+0x40/0x180
[<c0136a06>] kthread+0x96/0xe0
[<c0102e95>] kernel_thread_helper+0x5/0x10
Code: 0c 5b 5e 5f 5d c3 8d 74 26 00 8d bc 27 00 00 00 00 55 89 e5 57 56 89 d6
53 89 c3 8d 04 92 8b 14 9d 20 4d 46 c0 8d 04 82 8d 78 10 <8b> 40 10 ba 20 00 00
00 85 c0 74 03 0f bc d0 83 fa 21 b9 20 00
-and-
crash> bt
PID: 4038 TASK: c2244570 CPU: 0 COMMAND: "suspend"
#0 [c3913ddc] xen_panic_event at c010a527
#1 [c3913df8] notifier_call_chain at c012fae6
#2 [c3913e0c] panic at c0120b16
#3 [c3913e20] die at c0105866
#4 [c3913e6c] do_page_fault at c01156f1
#5 [c3913ed0] error_code (via page_fault) at c0105065
EAX: 00000000 EBX: 00000001 ECX: 00000001 EDX: 00000000 EBP: c3913f14
DS: 007b ESI: 00000000 ES: 007b EDI: 00000010
CS: 0061 EIP: c010dd3a ERR: ffffffff EFLAGS: 00010202
#6 [c3913f04] cache_remove_shared_cpu_map at c010dd3a
#7 [c3913f18] cache_remove_dev at c010e3b5
#8 [c3913f2c] cacheinfo_cpu_callback at c010e420
#9 [c3913f38] notifier_call_chain at c012fae6
#10 [c3913f4c] cpu_down at c013b884
#11 [c3913f80] smp_suspend at c028bc9a
#12 [c3913f98] __do_suspend at c028ca7b
#13 [c3913fc4] kthread at c0136a03
#14 [c3913fe8] kernel_thread_helper at c0102e93
crash>
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|