[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] BUG: unable to handle kernel NULL pointer dereference at IP: [<ffffffff8105ae4c>] process_one_work+



On Mon, Jun 13, 2011 at 07:20:34PM -0400, Scott Garron wrote:
> On 06/13/2011 06:03 PM, Konrad Rzeszutek Wilk wrote:
> >Can you do one more thing - bootup the same kernel as baremetal?
> >Without any Xen and with the same options .. and also with
> >/proc/interrupts so I can see what native Linux sees?
> 
> The serial console plus cat /proc/interrupts pasted onto the end of it
> is here:

Thank you.
> 
> http://pridelands.org/~simba/xen/hailstorm-fullserial20110613.txt

So IRQ 9 is correct.

Somehow I thought that this:

[    1.646560]  dc 0FF ACPI Warning: Large Reference Count (0x1FEA) in object 
ffff88001ebb3b98 (20110316/utdelete-448)
[    4.136398] ACPI Warning: Large Reference Count (0x1FE9) in object 
ffff88001ebb3b98 (20110316/utdelete-448)
[    4.136426] BUG: unable to handle kernel NULL pointer dereference at         
  (null)
[    4.136436] IP: [<ffffffff8105ae4c>] process_one_work+0x27/0x286
[    4.136459] PGD 0 
[    4.136465] Oops: 0000 [#1] SMP 
[    4.136475] CPU 0 
[    4.136479] Modules linked in:
[    4.136485] 
[    4.136492] Pid: 374, comm: kworker/0:1 Tainted: G        W   2.6.39+ #2 To 
Be Filled By O.E.M. To Be Filled By O.E.M./TYAN High-End Dual AMD Opteron, S2882
[    4.136505] RIP: e030:[<ffffffff8105ae4c>]  [<ffffffff8105ae4c>] 
process_one_work+0x27/0x286
[    4.136516] RSP: e02b:ffff88001eb4be40  EFLAGS: 00010046
(from http://pridelands.org/~simba/xen/hailstorm-fullserial20110610.txt)

are related - as in the ACPI IRQ gets triggered, it does something (and it looks
to make the ACPI parser complain about it), then puts some function on the
workqueue which dies trying to access ffff88001ebb3b80. It died and whatever
that function was suppose to do - it never completed. I was thinking that
due to the IRQ 9 having the wrong polarity (which it has not) or trigger (which 
it has
not) it is causing this mayhem - but that is not the case. Sorry about
wasting your time heading this wrong path.

The boot process continues and the xen clocksource kicks in and it does a 
hypercall
.. and is probabally looping between the hypercall, the xen upcall handler and 
back.
The IRQ 9 is pending so it hasn't been acknowledged by the Linux kernel. In 
fact, there
are couple of events that are stuck and are locally masked. Which means that 
'spin_lock_irqsave'
has been called and it masks the vcpu, but spin_unlock_irqrestore has not - 
which could be
due to process_one_work dying.

But the curious thing is that you have two CPUs assigned to Dom0 and while
CPU0 looks to be bouncing back and forth, CPU1 is doing something. The RIP
is 0xffffffff8108820c. Can you try to run this through System.map?
Or the whole bunch of these:

ffffffff8108820c
ffffffff81088100
ffffffff810881a7
ffffffff8108811a
ffffffff816101a8
ffffffff81006c32
ffffffff816114a4
ffffffff8108803a
ffffffff8105f5bd
ffffffff81618564
ffffffff81617973
ffffffff816117a1
ffffffff81618560

The other idea is to limit Dom0 to only run on one CPU. You can do this
by having 'dom0_max_vcpus=1 dom0_vcpus_pin' and see if it fails somewhere
else? It probably will die in the 0xffffffff810013aa :-(

But irregardless of what I mentioned above we need to find out why
process_one_worker got a toxic parameter. Can you disassemble 0xffffffff8105ae4c
and see what it does and how it corresponds to 'process_one_work' in 
kernel/workqueue.c?
You can also instrument the code to find out what:

1804         work_func_t f = work->func;

is.

Jeremy, any thoughts on what else might be at foot here?

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.