Never saw this reply, sorry for the delay. Answers inline. ( still
seeing the issue )
On Sat, Feb 26, 2011 at 2:50 PM, Todd Deshane <todd.deshane@xxxxxxx> wrote:
> On Sat, Feb 26, 2011 at 12:11 AM, Javier Frias <jfrias@xxxxxxxxx> wrote:
>> I posted a bug about this, but figured I'd ask the mailing list to see
>> if someone had seen this.
>> Bugzilla: http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=1746
>>
>> Basically, I had a dom0, after 57 days of non issues, lock up for 4
>> hours, completely unresponsive, and then recovered. The domU's were
>> unaffected except for the fact that I could not shut them down. (
>> since dom0 was unresponsive ). Although I was able to gain access via
>> xapi/xencenter, and I atleast had some access ( console, status, etc,
>> all worked via xapi).
>>
>
> Could you clarify this explanation a bit. What access was not
> available for 4 hours?
>
The dom0 was so loaded, ssh and any services running on (snmp for
one), were just unavailable. It was swapping, and just thoroughly
overloaded. I think this was due to the high io being done by one of
the guests, since I was able to log in to the host as one of the
events happened, and saw this via top.
Tasks: 228 total, 2 running, 226 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.4%us, 0.0%sy, 0.0%ni, 98.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.8%st
Mem: 771328k total, 747572k used, 23756k free, 139952k buffers
Swap: 524280k total, 5440k used, 518840k free, 342188k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
15715 root 20 0 3796 2388 1868 S 9293.8 0.3 686893:40
tapdisk2
24367 root 20 0 4128 2720 1896 S 8004.8 0.4 553094:39
tapdisk2
3133 root 20 0 3928 2520 1868 S 5264.2 0.3 695773:20 tapdisk2
26586 root 20 0 4924 3516 1868 S 1370.3 0.5 450796:40 tapdisk2
Everywhere I read, they say tapdisk2 is way more cpu intensive than
any other driver, is there a way to use raw LVM in xcp? In our case, I
think that would be the best choice, since we have a beefy subsystem.
> You say you could access via xapi/xencenter was this after the 4 hours
> or during?
>
Oddly enough, during. Which was puzzling since every other service was
affected by the high loads and swapping going on in the host. Things
like shutting down a host did not work though, seemed only read only
things ( like verying vm running state and params worked via xencenter
or hitting the api directly )
> Did you happen to look at the guest performance during those times?
> Was one of the guest doing a lot of disk I/O? Could you give some more
> information as to how the guests access their virtual disks (local,
> NFS, iSCSI, etc.) and any other information about your setup that
> could give us hints as to what might have caused this.
Yes, absolutely, two vms in this host that locked up have what would
be considered high i/o characteristics. ( one is lots of small file
i/o, and the other just large files being appended to )
My hardware looks like the following ( i use no shared storage )
Dell R710
72Gb Ram
2 x X5650 @ 2.67GHz ( 12 physical cores, 12 additional threads )
6 x 600GB 15K disks in raid 10
Dell H700 raid controller ( 512MB version )
So the hardware should handle the i/o that's being done by the vm no problem.
The dom0 has the default cpu and ram allocation ( 768MB and 4 vcpus )
any help greatly appreciated.
Also, here's a kernel message of a new vm as it went nuts ... ( seems related )
===dmesg====
[6954775.046768] BUG: soft lockup - CPU#2 stuck for 61s! [apache2:20139]
[6954775.046776] Modules linked in: xenfs lp parport
[6954775.046784] CPU 2
[6954775.046786] Modules linked in: xenfs lp parport
[6954775.046793]
[6954775.046796] Pid: 20139, comm: apache2 Tainted: G D
2.6.35-22-virtual #34~lucid1-Ubuntu /
[6954775.046802] RIP: e030:[<ffffffff812526a5>] [<ffffffff812526a5>]
sys_semtimedop+0x625/0x690
[6954775.046811] RSP: e02b:ffff8800fb0fbcf8 EFLAGS: 00000246
[6954775.046815] RAX: 0000000000000001 RBX: 0000000000430000 RCX:
ffff8800fb0fbfd8
[6954775.046820] RDX: 0000000000000000 RSI: ffff8800eeb744a0 RDI:
00000000ffffffff
[6954775.046825] RBP: ffff8800fb0fbf68 R08: 0000000000000000 R09:
0000000000000000
[6954775.046830] R10: 0000000000000000 R11: 0000000000000001 R12:
0000000000000001
[6954775.046835] R13: 0000000000000000 R14: 0000000000000001 R15:
ffff8800fae5ee50
[6954775.046843] FS: 00007f3943fd2740(0000) GS:ffff880003e76000(0000)
knlGS:0000000000000000
[6954775.046848] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[6954775.046852] CR2: 00007f9da4c3b000 CR3: 00000000fa0b3000 CR4:
0000000000002660
[6954775.046857] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[6954775.046863] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[6954775.046868] Process apache2 (pid: 20139, threadinfo
ffff8800fb0fa000, task ffff8800fa9496e0)
[6954775.046874] Stack:
[6954775.046876] ffff8800ffc39400 ffff8800fb0fbf28 ffff8800fa9496e0
ffffffff81a514a8
[6954775.046884] <0> ffff8800fa4ec060 0000000000000000
00000001810072d2 ffff8800fb0fbd48
[6954775.046893] <0> ffff8800fae5ee50 ffff8800fb0fbd48
ffff1000ffff0000 ffff8800fa9402e0
[6954775.046904] Call Trace:
[6954775.046909] [<ffffffff810072bf>] ? xen_restore_fl_direct_end+0x0/0x1
[6954775.046916] [<ffffffff8100611d>] ? xen_flush_tlb_single+0x9d/0xb0
[6954775.046921] [<ffffffff8100527f>] ? xen_set_pte_at+0x6f/0xf0
[6954775.046927] [<ffffffff81006b3d>] ? xen_force_evtchn_callback+0xd/0x10
[6954775.046932] [<ffffffff810072d2>] ? check_events+0x12/0x20
[6954775.046938] [<ffffffff81006b3d>] ? xen_force_evtchn_callback+0xd/0x10
[6954775.046943] [<ffffffff810072d2>] ? check_events+0x12/0x20
[6954775.046949] [<ffffffff810072bf>] ? xen_restore_fl_direct_end+0x0/0x1
[6954775.046954] [<ffffffff810041a1>] ? xen_clts+0x71/0x80
[6954775.046959] [<ffffffff8101407c>] ? restore_i387_xstate+0xcc/0x1c0
[6954775.046965] [<ffffffff81252720>] sys_semop+0x10/0x20
[6954775.046970] [<ffffffff8100a0f2>] system_call_fastpath+0x16/0x1b
[6954775.046974] Code: 57 48 45 85 f6 74 65 48 8b 4a 10 48 89 42 10 48
83 c2 08 48 89 95 60 ff ff ff 48 89 8d 68 ff ff ff 48 89 01 e9 29 fe
ff ff f3 90 <e9> 63 fe ff ff 48 8b 95 60 ff ff ff 48 8b 85 68 ff ff ff
49 b8
[6954775.047036] Call Trace:
[6954775.047040] [<ffffffff810072bf>] ? xen_restore_fl_direct_end+0x0/0x1
[6954775.047045] [<ffffffff8100611d>] ? xen_flush_tlb_single+0x9d/0xb0
[6954775.047050] [<ffffffff8100527f>] ? xen_set_pte_at+0x6f/0xf0
[6954775.047055] [<ffffffff81006b3d>] ? xen_force_evtchn_callback+0xd/0x10
[6954775.047061] [<ffffffff810072d2>] ? check_events+0x12/0x20
[6954775.047066] [<ffffffff81006b3d>] ? xen_force_evtchn_callback+0xd/0x10
[6954775.047071] [<ffffffff810072d2>] ? check_events+0x12/0x20
[6954775.047077] [<ffffffff810072bf>] ? xen_restore_fl_direct_end+0x0/0x1
[6954775.047082] [<ffffffff810041a1>] ? xen_clts+0x71/0x80
[6954775.047087] [<ffffffff8101407c>] ? restore_i387_xstate+0xcc/0x1c0
[6954775.047092] [<ffffffff81252720>] sys_semop+0x10/0x20
[6954775.047097] [<ffffffff8100a0f2>] system_call_fastpath+0x16/0x1b
[6954777.197935] BUG: soft lockup - CPU#3 stuck for 61s! [apache2:20145]
[6954777.197949] Modules linked in: xenfs lp parport
[6954777.197959] CPU 3
[6954777.197961] Modules linked in: xenfs lp parport
[6954777.197969]
[6954777.197973] Pid: 20145, comm: apache2 Tainted: G D
2.6.35-22-virtual #34~lucid1-Ubuntu /
[6954777.197979] RIP: e030:[<ffffffff812526a5>] [<ffffffff812526a5>]
sys_semtimedop+0x625/0x690
[6954777.197993] RSP: e02b:ffff880048ed3cf8 EFLAGS: 00000246
[6954777.197997] RAX: 0000000000000001 RBX: 0000000000430000 RCX:
ffff880048ed3fd8
[6954777.198002] RDX: 0000000000000000 RSI: ffff8800032d16e0 RDI:
00000000ffffffff
[6954777.198007] RBP: ffff880048ed3f68 R08: 0000000000000000 R09:
0000000000000000
[6954777.198012] R10: 0000000000000000 R11: 0000000000000001 R12:
0000000000000001
[6954777.198017] R13: 0000000000000000 R14: 0000000000000001 R15:
ffff8800fae5ee50
[6954777.198027] FS: 00007f3943fd2740(0000) GS:ffff880003e94000(0000)
knlGS:0000000000000000
[6954777.198032] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[6954777.198036] CR2: 00007f393dc39030 CR3: 00000000faf56000 CR4:
0000000000002660
[6954777.198042] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[6954777.198047] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[6954777.198052] Process apache2 (pid: 20145, threadinfo
ffff880048ed2000, task ffff8800fb16c4a0)
[6954777.198057] Stack:
[6954777.198060] 0000000000000293 ffff880048ed3f28 ffff8800fb16c4a0
ffffffff81a514a8
[6954777.198068] <0> ffff8800fa4ec8a0 0000000000000000
0000000148ed3dd8 ffff880048ed3d48
[6954777.198077] <0> ffff8800fae5ee50 ffff880048ed3d48
ffff1000ffff0000 ffff8800fb2d4480
[6954777.198088] Call Trace:
[6954777.198097] [<ffffffff81006b3d>] ? xen_force_evtchn_callback+0xd/0x10
[6954777.198104] [<ffffffff810072d2>] ? check_events+0x12/0x20
[6954777.198109] [<ffffffff81006b3d>] ? xen_force_evtchn_callback+0xd/0x10
[6954777.198115] [<ffffffff810072d2>] ? check_events+0x12/0x20
[6954777.198123] [<ffffffff81036e88>] ? pvclock_clocksource_read+0x58/0xd0
[6954777.198129] [<ffffffff81007161>] ? xen_clocksource_read+0x21/0x30
[6954777.198137] [<ffffffff8108931a>] ? do_gettimeofday+0x1a/0x50
[6954777.198142] [<ffffffff81252720>] sys_semop+0x10/0x20
[6954777.198148] [<ffffffff8100a0f2>] system_call_fastpath+0x16/0x1b
[6954777.198152] Code: 57 48 45 85 f6 74 65 48 8b 4a 10 48 89 42 10 48
83 c2 08 48 89 95 60 ff ff ff 48 89 8d 68 ff ff ff 48 89 01 e9 29 fe
ff ff f3 90 <e9> 63 fe ff ff 48 8b 95 60 ff ff ff 48 8b 85 68 ff ff ff
49 b8
[6954777.198218] Call Trace:
[6954777.198223] [<ffffffff81006b3d>] ? xen_force_evtchn_callback+0xd/0x10
[6954777.198229] [<ffffffff810072d2>] ? check_events+0x12/0x20
[6954777.198234] [<ffffffff81006b3d>] ? xen_force_evtchn_callback+0xd/0x10
[6954777.198239] [<ffffffff810072d2>] ? check_events+0x12/0x20
[6954777.198245] [<ffffffff81036e88>] ? pvclock_clocksource_read+0x58/0xd0
[6954777.198251] [<ffffffff81007161>] ? xen_clocksource_read+0x21/0x30
[6954777.198256] [<ffffffff8108931a>] ? do_gettimeofday+0x1a/0x50
[6954777.198261] [<ffffffff81252720>] sys_semop+0x10/0x20
[6954777.198267] [<ffffffff8100a0f2>] system_call_fastpath+0x16/0x1b
_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-users
|