Hi Dan,
I tried both of your suggestions here and I am still seeing a
hang. The only clue that I have to go on now is why the print statements
that I have placed in the nmi tick handler do not continue to come at a regular
1 second interval. My expectation is that they come at a regular
rate and it seems like upon boot, it does up to the 39th
second. See the output below where curr_sum comes from the
nmi_timer_ticks variable. After the 39th tick, the messages
are printed at different intervals… anywhere between 2-20 seconds.
Does Xen adjust the NMI interval rate? I know that the NMIs are
programmed based on the CPU cycles, so what I did was disabled SpeedStep to
make sure that the processor speed is not adjusted and the symptoms are still
identical.
Any other ideas?
Regards,
Roger R. Cruz
void nmi_watchdog_tick(struct
cpu_user_regs * regs)
{
unsigned int sum = this_cpu(nmi_timer_ticks);
HEDLEY-T500 login: (XEN) [23:29:04] 17213**** CPU0, counter=0,
last_sum=25, curr_sum=27, hz=10, nmis=36
(XEN) [23:29:05] 17298**** CPU0, counter=0, last_sum=27, curr_sum=28, hz=10,
nmis=37
(XEN) [23:29:06] 17383**** CPU0, counter=0, last_sum=28, curr_sum=29, hz=10,
nmis=38
(XEN) [23:29:06] 17468mm.c:806:d0 Error getting mfn 100 (pfn 3ff0) from L1
entry 8000000000100625 for l1e_owner=0, pg_owner=32753
(XEN) [23:29:07] 17598**** CPU0, counter=0, last_sum=29, curr_sum=29, hz=10,
nmis=39
(XEN) [23:29:08] 17683**** CPU0, counter=1, last_sum=29, curr_sum=30, hz=10,
nmis=40
(XEN) [23:29:08] 17768**** CPU0, counter=0, last_sum=30, curr_sum=31, hz=10,
nmis=41
(XEN) [23:29:09] 17853**** CPU0, counter=0, last_sum=31, curr_sum=32, hz=10,
nmis=42
mapping kernel into physical memory
about to get started...
(XEN) [23:29:10] 17938**** CPU0, counter=0, last_sum=32, curr_sum=32, hz=10,
nmis=43
(XEN) [23:29:10] 18023**** CPU0, counter=1, last_sum=32, curr_sum=33, hz=10,
nmis=44
(XEN) [23:29:11] 18108**** CPU0, counter=0, last_sum=33, curr_sum=34, hz=10,
nmis=45
(XEN) [23:29:12] 18193**** CPU0, counter=0, last_sum=34, curr_sum=34, hz=10,
nmis=46
(XEN) [23:29:13] 18278**** CPU0, counter=1, last_sum=34, curr_sum=35, hz=10,
nmis=47
(XEN) [23:29:13] 18363**** CPU0, counter=0, last_sum=35, curr_sum=36, hz=10,
nmis=48
(XEN) [23:29:15] 18448**** CPU0, counter=0, last_sum=36, curr_sum=38, hz=10,
nmis=49
(XEN) [23:29:16] 18533**** CPU0, counter=0, last_sum=38, curr_sum=39, hz=10,
nmis=50
(XEN) [23:29:18] 18618**** CPU0, counter=0, last_sum=39, curr_sum=41, hz=10,
nmis=51
(XEN) [23:29:21] 18703**** CPU0, counter=0, last_sum=41, curr_sum=43, hz=10,
nmis=52
(XEN) [23:29:27] 18788**** CPU0, counter=0, last_sum=43, curr_sum=49, hz=10,
nmis=53
(XEN) [23:29:34] 18873**** CPU0, counter=0, last_sum=49, curr_sum=56, hz=10,
nmis=54
(XEN) [23:29:39] 18958**** CPU0, counter=0, last_sum=56, curr_sum=61, hz=10,
nmis=55
(XEN) [23:29:43] 19043**** CPU0, counter=0, last_sum=61, curr_sum=66, hz=10,
nmis=56
(XEN) [23:29:47] 19128**** CPU0, counter=0, last_sum=66, curr_sum=69, hz=10,
nmis=57
(XEN) [23:29:50] 19213**** CPU0, counter=0, last_sum=69, curr_sum=73, hz=10,
nmis=58
(XEN) [23:29:54] 19298**** CPU0, counter=0, last_sum=73, curr_sum=77, hz=10,
nmis=59
(XEN) [23:29:58] 19383**** CPU0, counter=0, last_sum=77, curr_sum=80, hz=10,
nmis=60
(XEN) [23:30:01] 19468**** CPU0, counter=0, last_sum=80, curr_sum=83, hz=10,
nmis=61
(XEN) [23:30:04] 19553**** CPU0, counter=0, last_sum=83, curr_sum=87, hz=10,
nmis=62
(XEN) [23:30:08] 19638**** CPU0, counter=0, last_sum=87, curr_sum=91, hz=10,
nmis=63
(XEN) [23:30:11] 19723**** CPU0, counter=0, last_sum=91, curr_sum=94, hz=10,
nmis=64
(XEN) [23:30:13] 19808**** CPU0, counter=0, last_sum=94, curr_sum=96, hz=10,
nmis=65
(XEN) [23:30:16] 19893**** CPU0, counter=0, last_sum=96, curr_sum=99, hz=10,
nmis=66
(XEN) [23:30:19] 19978**** CPU0, counter=0, last_sum=99, curr_sum=102, hz=10,
nmis=67
(XEN) [23:30:22] 20064**** CPU0, counter=0, last_sum=102, curr_sum=105, hz=10,
nmis=68
(XEN) [23:30:25] 20151**** CPU0, counter=0, last_sum=105, curr_sum=108, hz=10,
nmis=69
(XEN) [23:30:28] 20238**** CPU0, counter=0, last_sum=108, curr_sum=111, hz=10,
nmis=70
From:
xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
[mailto:xen-devel-bounces@xxxxxxxxxxxxxxxxxxx] On Behalf Of Roger Cruz
Sent: Tuesday, September 14, 2010 11:55 AM
To: Dan Magenheimer; Tim Deegan
Cc: xen-devel@xxxxxxxxxxxxxxxxxxx
Subject: RE: [Xen-devel] State of current Xen debugger
Hi Dan,
I am using 3.4.2 where we have made very minor modifications (some backports,
for example).
I have not tried your suggestions.. so I will do that next.. thanks!
R.
-----Original Message-----
From: Dan Magenheimer [mailto:dan.magenheimer@xxxxxxxxxx]
Sent: Tue 9/14/2010 11:19 AM
To: Roger Cruz; Tim Deegan
Cc: xen-devel@xxxxxxxxxxxxxxxxxxx
Subject: RE: [Xen-devel] State of current Xen debugger
A couple of thoughts:
Have you tried max_cstate=0 (as a Xen boot option)?
Also, you didn't say what version of Xen you are using but playing around with
hpet_broadcast (enabling it or force-disabling it as below) might be worth a
try.
http://lists.xensource.com/archives/html/xen-devel/2010-09/msg00556.html
From: Roger Cruz [mailto:roger.cruz@xxxxxxxxxxxxxxxxxxx]
Sent: Tuesday, September 14, 2010 8:56 AM
To: Tim Deegan
Cc: xen-devel@xxxxxxxxxxxxxxxxxxx
Subject: RE: [Xen-devel] State of current Xen debugger
Hi Tim, good to hear from you again
I had a pretty good inkling that one of you hardcore developers would say that
:-) Yes, it is pretty well wedged. I can cause the problem more
rapidly by dropping to a single CPU. When the hang happens, the Xen
console is completely dead. None of the special keys work.
I do have hopes a BIOS upgrade could fix this as a last resort but I want to
see if at least I can understand the problem. We have a few different
machines that are exhibiting similar symptoms so I have to see if I can find a
work-around without requiring every user to upgrade their BIOS :-(
Just in case, what debugger have you been using? Are there recent
instructions on how to set it up that you can point me to?
Thanks
Roger
-----Original Message-----
From: Tim Deegan [mailto:Tim.Deegan@xxxxxxxxxx]
Sent: Tue 9/14/2010 10:30 AM
To: Roger Cruz
Cc: xen-devel@xxxxxxxxxxxxxxxxxxx
Subject: Re: [Xen-devel] State of current Xen debugger
Hi,
At 15:22 +0100 on 14 Sep (1284477779), Roger Cruz wrote:
> I am trying to debug a problem where the hypervisor is hanging hard.
> Not even the NMI watchdog is triggering a reboot. So I wanted to
hook
> up a debugger.
Sorry to bring a counsel of despair but if the NMI watchdog isn't
working then your chances of getting a working debugger are slim. It's
likely that at least one CPU is very very stuck. Does the 'd' debug key
work on the serial line when the machine is wedged?
On a more cheerful note, I've twice seen hard hangs like this that
turned out to be hardware issues, fixable with BIOS upgrades.
Cheers,
Tim.
> What is the state of the current debuggers out there?
> Any input on how I should set it up (kdb, gdb, etc) and pointers to a
> good wiki page are much appreciated. I did perform a Google search
> and found some links but I want to hear from the current developers as
> to what is most stable and useful for debugging this type of hard
> hang. I only have a serial port PCI-express card to use as the
laptop
> has no built in port.
--
Tim Deegan <Tim.Deegan@xxxxxxxxxx>
Principal Software Engineer, XenServer Engineering
Citrix Systems UK Ltd. (Company #02937203, SL9 0BG)
No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.851 / Virus Database: 271.1.1/3119 - Release Date: 09/14/10
02:35:00
No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.851 / Virus Database: 271.1.1/3119 - Release Date: 09/14/10
02:35:00
No virus
found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.851 / Virus Database: 271.1.1/3119 - Release Date: 09/14/10
02:35:00