| Here is some additional info from my experiments over the weekend.
I took the Lenovo T500 and removed its internal WiFi miniPCIe card.  In
its place, I put in a miniPCIe to PCIe converter card with a PCIe
socket.  Into that socket, I placed a PCIe dump card.  This card has a
switch that when you press it, it creates an SERR error.  Using the
utility provided by the vendor, I enabled all the bridges between the
card to carry the SERR signal to the CPU and cause the CPU to see it as
an NMI.  I tested the set-up several times.  Every single time I pressed
the switch, I got an NMI, followed by a kdump core.  So I was sure the
HW setup was working correctly.
I left two Lenovo T500 running over the weekend and when I returned this
morning, both had hung.  Completely frozen.  I pressed the NMI switch in
both systems and nothing.  No crashes, no coredumps.  It looks as if the
SERR/NMI is getting ignored/blocked or CPU is completely shutdown
(STPCLK).
This experiment helps me prove that the software watchdog code in Xen
was not the problem and indeed the NMIs are getting blocked somehow.
This is what I now need to investigate.  Areas that I care to learn more
about are the SMI handler and the external chip's use of the STPCLK
signal to the CPU.
As an additional bit of info, the only response we get when the systems
are hung is a beep when the power cord is unplugged/plugged from the
laptop.  I don't know if the beep is done via a HW module or whether
ACPI/BIOS is involved.
Still looking for additional ideas.
Regards,
Roger R. Cruz
-----Original Message-----
From: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
[mailto:xen-devel-bounces@xxxxxxxxxxxxxxxxxxx] On Behalf Of Roger Cruz
Sent: Monday, October 04, 2010 3:03 PM
To: Jan Kiszka
Cc: xen-devel@xxxxxxxxxxxxxxxxxxx; Konrad Rzeszutek Wilk
Subject: [Xen-devel] RE: How to generate a HW NMI
> BTW, "rmmod processor thermal" (should be equivalent to your Xen
I am not familiar with the thermal module but my guess is that they are
not the same as the C3 states which can be entered when the kernel
becomes idle.  I believe the thermal plays with other type of state (P?)
where it alters the voltage and frequency of the CPU to keep the CPU
still running but at a particular % of the top speed.  The C3 state
causes the CPU clocks to shutdown entirely and then it is awaken by an
external event.
R.
-----Original Message-----
From: Jan Kiszka [mailto:jan.kiszka@xxxxxxxxxxx] 
Sent: Monday, October 04, 2010 11:23 AM
To: Roger Cruz
Cc: Konrad Rzeszutek Wilk; xen-devel@xxxxxxxxxxxxxxxxxxx
Subject: Re: How to generate a HW NMI
Am 04.10.2010 16:19, Roger Cruz wrote:
> Until Friday, all hard hangs that we and our customers had experienced
> were on Lenovo T500 and X200, even with their latest BIOSes.
Yeah, the T500 was reported as problematic here as well. My Fujitsu
Celsius H700 also crashes.
In contrast, we have positive results from a Dell server with an Asus
P6T Deluxe V2 board and a Core i7 920.
>  The Lenovo
> T400 has never hung for me and I don't have any reports on them from
the
> field.  On Friday, I had an HP i5 hard hang with similar footprint as
i5? Mmh, we only have reports from i7 so far. Which BIOS vendor?
> the Lenovos.  When this hard hang happens, the Xen watchdog (which is
> driven by the NMI handler) will not do its job and cause a crash/stack
> trace.
>  This is why we have started to suspect something with the BIOS
> and SMIs as they are the only thing that can block an NMI.  I am
pretty
> certain that this is somehow related to entering C3 power states and
> possibly at the same time an SMI comes in.
I tried various stuff under Linux as well: nmi_watchdog=1, tracing to
VGA buffer right before/after guest-host switch (it always hangs after
entry here), verified guest interruptibility before entry (though
hypervisors usually do not play with the critical bits), read-out of
host RAM (including kernel log buffer) via Firewire - it all points to a
crash outside the scope of the host OS.
>  The time it takes to hang
> varies from 30mins to 24 hrs.
We are a bit more lucky, maybe due to our special guest (an old RTOS in
16-bit mode): I can reproduce the hang after a few minutes.
BTW, "rmmod processor thermal" (should be equivalent to your Xen
parameter) did not make a difference here.
Jan
-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
No virus found in this incoming message.
Checked by AVG - www.avg.com 
Version: 9.0.856 / Virus Database: 271.1.1/3168 - Release Date: 10/04/10
02:35:00
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
No virus found in this incoming message.
Checked by AVG - www.avg.com 
Version: 9.0.856 / Virus Database: 271.1.1/3168 - Release Date: 10/04/10
02:35:00
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
 |