I am using 3.4.2 with some modifications
I added printks to the nmi_watchdog_tick as shown below. I don't break the console lock.. but I am convinced that the printk lock is not the problem because I have also tested by having a void printk routine and it still hangs, so it felt pretty safe not breaking the lock. I also tried the console_start/end_sync to make sure I was seeing all the messages when it hung.
void nmi_watchdog_tick(struct cpu_user_regs * regs)
{
unsigned int sum = this_cpu(nmi_timer_ticks);
if ( (this_cpu(last_irq_sums) == sum) &&
!atomic_read(&watchdog_disable_count) )
{
if (sum > 20) {
// console_start_sync();
printk("**** CPU%d, counter=%d, last_sum=%d, curr_sum=%d, hz=%d, nmis=%d\n",
smp_processor_id(), this_cpu(alert_counter), this_cpu(last_irq_sums), sum, 5*nmi_hz, nmi_count(smp_processor_id()) );
// console_end_sync();
}
/*
* Ayiee, looks like this CPU is stuck ... wait a few IRQs (5 seconds)
* before doing the oops ...
*/
this_cpu(alert_counter)++;
if ( this_cpu(alert_counter) == 5*nmi_hz )
{
console_force_unlock();
printk("Watchdog timer detects that CPU%d is stuck!\n",
smp_processor_id());
fatal_trap(TRAP_nmi, regs);
}
}
else
{
if (sum > 20) {
// console_start_sync();
printk("*CPU%d, counter=%d, last_sum=%d, curr_sum=%d, nmis=%d\n",
smp_processor_id(), this_cpu(alert_counter), this_cpu(last_irq_sums), sum, nmi_count(smp_processor_id()) );
//console_end_sync();
}
this_cpu(last_irq_sums) = sum;
this_cpu(alert_counter) = 0;
}
My messages stop printing and I get a hard hang. the Performance Ctr NMI appears to come once every 4 seconds. However, I have observed instances where they are about 10 seconds apart. Not sure what is making the NMIs come in at uneven intervals. As a test, I turned on SpeedStep and power management functions in the BIOS and it still hangs.
XEN) *CPU0, counter=0, last_sum=974, curr_sum=977, nmis=391
(XEN) *CPU0, counter=0, last_sum=977, curr_sum=979, nmis=392
(XEN) *CPU0, counter=0, last_sum=979, curr_sum=981, nmis=393
(XEN) *CPU0, counter=0, last_sum=981, curr_sum=984, nmis=394
(XEN) *CPU0, counter=0, last_sum=984, curr_sum=986, nmis=395
(XEN) *CPU0, counter=0, last_sum=986, curr_sum=988, nmis=396
(XEN) *CPU0, counter=0, last_sum=988, curr_sum=991, nmis=397
(XEN) *CPU0, counter=0, last_sum=991, curr_sum=993, nmis=398
(XEN) *CPU0, counter=0, last_sum=993, curr_sum=995, nmis=399
(XEN) *CPU0, counter=0, last_sum=995, curr_sum=997, nmis=400
(XEN) *CPU0, counter=0, last_sum=997, curr_sum=1000, nmis=401
(XEN) *CPU0, counter=0, last_sum=1000, curr_sum=1002, nmis=402
(XEN) *CPU0, counter=0, last_sum=1002, curr_sum=1005, nmis=403
(XEN) *CPU0, counter=0, last_sum=1005, curr_sum=1008, nmis=404
(XEN) *CPU0, counter=0, last_sum=1008, curr_sum=1010, nmis=405
(XEN) *CPU0, counter=0, last_sum=1010, curr_sum=1013, nmis=406
(XEN) *CPU0, counter=0, last_sum=1013, curr_sum=1015, nmis=407
(XEN) *CPU0, counter=0, last_sum=1015, curr_sum=1018, nmis=408
(XEN) *CPU0, counter=0, last_sum=1018, curr_sum=1020, nmis=409
(XEN) *CPU0, counter=0, last_sum=1020, curr_sum=1023, nmis=410
(XEN) *CPU0, counter=0, last_sum=1023, curr_sum=1026, nmis=411
(XEN) *CPU0, counter=0, last_sum=1026, curr_sum=1029, nmis=412
(XEN) *CPU0, counter=0, last_sum=1029, curr_sum=1031, nmis=413
(XEN) *CPU0, counter=0, last_sum=1031, curr_sum=1033, nmis=414
(XEN) *CPU0, counter=0, last_sum=1033, curr_sum=1035, nmis=415
(XEN) *CPU0, counter=0, last_sum=1035, curr_sum=1038, nmis=416
(XEN) *CPU0, counter=0, last_sum=1038, curr_sum=1041, nmis=417
(XEN) *CPU0, counter=0, last_sum=1041, curr_sum=1043, nmis=418
(XEN) *CPU0, counter=0, last_sum=1043, curr_sum=1046, nmis=419
(XEN) *CPU0, counter=0, last_sum=1046, curr_sum=1049, nmis=420
(XEN) *CPU0, counter=0, last_sum=1049, curr_sum=1051, nmis=421
(XEN) *CPU0, counter=0, last_sum=1051, curr_sum=1055, nmis=422
(XEN) *CPU0, counter=0, last_sum=1055, curr_sum=1058, nmis=423
(XEN) *CPU0, counter=0, last_sum=1058, curr_sum=1061, nmis=424
(XEN) *CPU0, counter=0, last_sum=1061, curr_sum=1064, nmis=425
(XEN) *CPU0, counter=0, last_sum=1064, curr_sum=1067, nmis=426
(XEN) *CPU0, counter=0, last_sum=1067, curr_sum=1070, nmis=427
(XEN) *CPU0, counter=0, last_sum=1070, curr_sum=1073, nmis=428
(XEN) *CPU0, counter=0, last_sum=1073, curr_sum=1076, nmis=429
__ __ _____ _ _ ____
\ \/ /___ _ __ |___ /| || | |___ \
\ // _ \ '_ \ |_ \| || |_ __) |
/ \ __/ | | | ___) |__ _| / __/
/_/\_\___|_| |_| |____(_) |_|(_)_____|
(XEN) Xen version 3.4.2 (rcruz@) (gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5) ) Mon Sep 13 23:06:17 UTC 2010
(XEN) Latest ChangeSet: Mon Sep 13 16:12:14 2010 -0400 132:a499dd8fcb55
-----Original Message-----
From: Tim Deegan [mailto:Tim.Deegan@xxxxxxxxxx]
Sent: Tue 9/14/2010 11:20 AM
To: Roger Cruz
Cc: xen-devel@xxxxxxxxxxxxxxxxxxx
Subject: Re: [Xen-devel] State of current Xen debugger
At 15:56 +0100 on 14 Sep (1284479787), Roger Cruz wrote:
> I had a pretty good inkling that one of you hardcore developers would
> say that :-) Yes, it is pretty well wedged. I can cause the problem
> more rapidly by dropping to a single CPU. When the hang happens, the
> Xen console is completely dead. None of the special keys work.
If the 'd' key doesn't work then the serial irq isn't getting handled,
so the CPU is wedged at a higher TPR (at least). Usually in that case
the CPU is spinning so the NMI watchdog timer kicks in OK; possibly if
it was idle with a high TPR it wouldn't.
What version of Xen are you using?
It might be worth trying a boot with MSI disabled (there were reports at
one stage of MSIs not being EOI'd because the timer interupt that would
remind Xen to EOI them was at a lower priority than the MSI).
> I do have hopes a BIOS upgrade could fix this as a last resort but I
> want to see if at least I can understand the problem. We have a few
> different machines that are exhibiting similar symptoms so I have to
> see if I can find a work-around without requiring every user to
> upgrade their BIOS :-(
>
> Just in case, what debugger have you been using? Are there recent
> instructions on how to set it up that you can point me to?
I don't use a debugger on Xen. I usually find that by the time the
debugger kicks in it's too late to help, so I end up finding bugs by
code inspection and printks. :)
Mukesh Rathor at Oracle has done some debugger work, though, including
an in-Xen debugger. There's a gdb stub too but I suspect it's rotted
quite badly.
Cheers,
Tim.
> Thanks
> Roger
>
>
> -----Original Message-----
> From: Tim Deegan [mailto:Tim.Deegan@xxxxxxxxxx]
> Sent: Tue 9/14/2010 10:30 AM
> To: Roger Cruz
> Cc: xen-devel@xxxxxxxxxxxxxxxxxxx
> Subject: Re: [Xen-devel] State of current Xen debugger
>
> Hi,
>
> At 15:22 +0100 on 14 Sep (1284477779), Roger Cruz wrote:
> > I am trying to debug a problem where the hypervisor is hanging hard.
> > Not even the NMI watchdog is triggering a reboot. So I wanted to hook
> > up a debugger.
>
> Sorry to bring a counsel of despair but if the NMI watchdog isn't
> working then your chances of getting a working debugger are slim. It's
> likely that at least one CPU is very very stuck. Does the 'd' debug key
> work on the serial line when the machine is wedged?
>
> On a more cheerful note, I've twice seen hard hangs like this that
> turned out to be hardware issues, fixable with BIOS upgrades.
>
> Cheers,
>
> Tim.
>
> > What is the state of the current debuggers out there?
> > Any input on how I should set it up (kdb, gdb, etc) and pointers to a
> > good wiki page are much appreciated. I did perform a Google search
> > and found some links but I want to hear from the current developers as
> > to what is most stable and useful for debugging this type of hard
> > hang. I only have a serial port PCI-express card to use as the laptop
> > has no built in port.
>
> --
> Tim Deegan <Tim.Deegan@xxxxxxxxxx>
> Principal Software Engineer, XenServer Engineering
> Citrix Systems UK Ltd. (Company #02937203, SL9 0BG)
>
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 9.0.851 / Virus Database: 271.1.1/3119 - Release Date: 09/14/10 02:35:00
>
--
Tim Deegan <Tim.Deegan@xxxxxxxxxx>
Principal Software Engineer, XenServer Engineering
Citrix Systems UK Ltd. (Company #02937203, SL9 0BG)
No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.851 / Virus Database: 271.1.1/3119 - Release Date: 09/14/10 02:35:00
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|