|
|
|
|
|
|
|
|
|
|
xen-devel
[Xen-devel] NMI with SMP domain causing machine to reboot
I have spend most of the last weeks trying to nail down a nasty bug
that is preventing me to release xenoprof for SMP domains.
The bug is non-deterministic and when it happens the machine just
reboots with no message or warning on the serial console.
This made the debugging process painfull and slow.
I started removing specific components of xenoprof code trying to find
what component is causing the problem. After removing almost all code
it seems the bug is associated with NMI interrupts.
Right now the only code left is the code to program a hardware perf.
counter to count "non-halted" clock cycles (hard-coded) and to handle
NMI interrupts. All other logic was removed and and I am still seeing
the machine auto rebooting at some non-determinist time.
I am starting to suspect this might be a Xen bug and I will probably
need some help from the Xen core team to nail this down.
I have attached a patch that enables Xen to program the perf counter
and handle the NMIs they generate. I have also attached a patch for
a new user level test tool for starting the performance counter.
I hope these patches enable others to reproduce the behaviour I am
observing
I only see this bug when running SMP domains (either dom0 or domU)
with NMIs being generated. My machine has two CPUs with hyperthreading
disabled. When I boot an SMP domain0 (with 2 VCPUs) I only see the
the bug when NMIs are generated for CPU 1. Surprisingly,
I have never seen the auto rebooting behavior when NMIs are generated
on
CPU 0 only. Since the bug is non determinitic it is possible that
the bug is still there but for some reason not triggered for NMIs on
CPU 0.
Here is a sequence of steps that I use to trigger the bug (on an SMP
dom0 with 2 VCPUs);
1) initialize the performance counter
> xenpmc -i
2) start the counter
> xenpmc -g
3) verify that NMIs are being generated
> xenpmc -s
This causes a counter of NMIs for [CPU0,CPU1] to be printed.
This command was originally intended to stop the counters
(and NMI generation) but the command was modified to
just return without stopping the counters. As a side
effect the number of NMIs are printed on the xen console
and can be used to verify that NMIs are being generated
In order to trigger the bug I execute the comand "xm dmesg"
in a loop and eventually the machine auto reboot. (usually
after a few minutes). I use the following shell script to
execute "xm dmesg" in a loop.
#!/bin/bash
while true;
do xm dmesg;
sleep 1;
done
Does anybody has an idea of what can be causing this behavior and
how we could nail this down?
Thanks
Renato
nmitest_xen.patch
Description: nmitest_xen.patch
nmitest_tools.patch
Description: nmitest_tools.patch
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|
<Prev in Thread] |
Current Thread |
[Next in Thread>
|
- [Xen-devel] NMI with SMP domain causing machine to reboot,
Santos, Jose Renato G <=
|
|
|
|
|