WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

RE: [Xen-devel] BUG: soft lockup - CPU#1 stuck for 61s! [udevd:4865]

To: "Jeremy Fitzhardinge" <jeremy@xxxxxxxx>
Subject: RE: [Xen-devel] BUG: soft lockup - CPU#1 stuck for 61s! [udevd:4865]
From: "James Harper" <james.harper@xxxxxxxxxxxxxxxx>
Date: Fri, 3 Sep 2010 10:35:34 +1000
Cc: xen-devel@xxxxxxxxxxxxxxxxxxx
Delivery-date: Thu, 02 Sep 2010 17:36:26 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <4C8040AE.7060902@xxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <AEC6C66638C05B468B556EA548C1A77D01A92A28@trantor> <4C8040AE.7060902@xxxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
Thread-index: ActK/q882zvhES+FSZ6IH+PETbgLzAAAAuyw
Thread-topic: [Xen-devel] BUG: soft lockup - CPU#1 stuck for 61s! [udevd:4865]
> 
>  On 09/02/2010 05:20 PM, James Harper wrote:
> > I've had "BUG: soft lockup - CPU#1 stuck for 61s! [udevd:4865]"
happen
> > twice now and the server fairly quickly (over the course of 5-10
> > minutes) becomes unusable after that happens. The call trace appears
to
> > be:
> >
> > Call Trace:
> >  [<ffffffff8100e8e9>] ? xen_force_evtchn_callback+0x9/0xa
> >  [<ffffffff8100ef72>] ? check_events+0x12/0x20
> >  [<ffffffff8100ef19>] ? xen_irq_enable_direct_end+0x0/0x7
> >  [<ffffffff8102fcc1>] ? do_page_fault+0x9e/0x27b
> >  [<ffffffff812b39a5>] ? page_fault+0x25/0x30
> >  [<ffffffff8116f28d>] ? __put_user_4+0x1d/0x30
> >  [<ffffffff81045265>] ? schedule_tail+0x92/0x96
> >  [<ffffffff81011983>] ? ret_from_fork+0x13/0x80
> >
> > That was under 2.6.32.17-g69a73fa. I've since upgraded to
> > 2.6.32.18-ge6b9b2c but the changelogs don't show anything obviously
> > related to the crash (but as usual there are a lot of them so maybe
I've
> > missed something?)
> >
> > Is this a known problem under 2.6.32.17-g69a73fa that has since been
> > fixed?
> 
> Hard to know; nothing springs to mind right now.  What else is going
on
> at the time?  What's the full softlockup message?  Is there any more
> context?

I just noticed that it coincided pretty much exactly when a backup runs.
I take a snapshot in Dom0 then block-attach it to the linux DomU so it
can be backed up from DomU which makes restores easier etc, rather than
doing the backup in Dom0. I hadn't noticed before because it's the
Windows DomU that people complain about first, not the linux DomU.
Block-attach/detach would explain why udevd is involved too which makes
more sense. The block-attach happens every hour on the hour. The full
cut&paste of the first hang is:

Sep  3 09:00:42 keechsvr2 kernel: [235893.079566] blkback: ring-ref
1442, event-channel 13, protocol 1 (x86_64-abi)
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] BUG: soft lockup -
CPU#1 stuck for 61s! [udevd:4865]
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] Modules linked in:
ipt_MASQUERADE xt_physdev pppoe pppox ipmi_devintf autofs4 tun
act_police sch_ingress cls_u32 sch_sfq sch_cbq ppp_generic slhc ipt_ULO
G xt_limit ipt_REJECT xt_state xt_dscp iptable_filter ipt_REDIRECT
xt_conntrack iptable_nat xt_TCPMSS xt_tcpmss xt_TCPOPTSTRIP xt_tcpudp
iptable_mangle ip_tables x_tables ipv6 8021q garp bridge stp llc n
f_nat_ftp nf_conntrack_ftp ip_gre nf_nat_pptp nf_conntrack_pptp
nf_conntrack_proto_gre nf_nat_proto_gre nf_nat nf_conntrack_ipv4
nf_conntrack nf_defrag_ipv4 xen_gntdev xen_evtchn xen_netback
xen_blkback
blkback_pagemap loop snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd
tpm_tis tpm sd_mod soundcore ipmi_si tpm_bios hpilo snd_page_alloc
ipmi_msghandler pcspkr psmouse serio_raw rng_core i5000_edac edac_c
ore i5k_amb button evdev xfs exportfs dm_mirror dm_region_hash dm_log
dm_snapshot dm_mod ide_cd_mod cdrom ata_piix ata_generic libata
usb_storage scsi_mod usbhid hid piix ide_pci_generic bnx2 ehci_hcd id
e_core uhci_hc
Sep  3 09:02:13 keechsvr2 kernel:  cciss thermal fan [last unloaded:
scsi_wait_scan]
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] CPU 1:
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] Modules linked in:
ipt_MASQUERADE xt_physdev pppoe pppox ipmi_devintf autofs4 tun
act_police sch_ingress cls_u32 sch_sfq sch_cbq ppp_generic slhc ipt_ULO
G xt_limit ipt_REJECT xt_state xt_dscp iptable_filter ipt_REDIRECT
xt_conntrack iptable_nat xt_TCPMSS xt_tcpmss xt_TCPOPTSTRIP xt_tcpudp
iptable_mangle ip_tables x_tables ipv6 8021q garp bridge stp llc n
f_nat_ftp nf_conntrack_ftp ip_gre nf_nat_pptp nf_conntrack_pptp
nf_conntrack_proto_gre nf_nat_proto_gre nf_nat nf_conntrack_ipv4
nf_conntrack nf_defrag_ipv4 xen_gntdev xen_evtchn xen_netback
xen_blkback
blkback_pagemap loop snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd
tpm_tis tpm sd_mod soundcore ipmi_si tpm_bios hpilo snd_page_alloc
ipmi_msghandler pcspkr psmouse serio_raw rng_core i5000_edac edac_c
ore i5k_amb button evdev xfs exportfs dm_mirror dm_region_hash dm_log
dm_snapshot dm_mod ide_cd_mod cdrom ata_piix ata_generic libata
usb_storage scsi_mod usbhid hid piix ide_pci_generic bnx2 ehci_hcd id
e_core uhci_hc
Sep  3 09:02:13 keechsvr2 kernel:  cciss thermal fan [last unloaded:
scsi_wait_scan]
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] Pid: 4865, comm: udevd
Not tainted 2.6.32.17-g69a73fa #1 ProLiant ML370 G5
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] RIP:
e030:[<ffffffff8100922a>]  [<ffffffff8100922a>]
hypercall_page+0x22a/0x1001
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] RSP:
e02b:ffff8800114d5dd0  EFLAGS: 00000246
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] RAX: 0000000000040000
RBX: 0000000000000001 RCX: ffffffff8100922a
keechsvr2:/usr/local/src/linux-2.6-xen# head -100 /var/log/kern.log
Sep  3 07:00:44 keechsvr2 kernel: [228694.767691] blkback: ring-ref 311,
event-channel 13, protocol 1 (x86_64-abi)
Sep  3 08:00:42 keechsvr2 kernel: [232293.072248] blkback: ring-ref
1334, event-channel 13, protocol 1 (x86_64-abi)
Sep  3 09:00:42 keechsvr2 kernel: [235893.079566] blkback: ring-ref
1442, event-channel 13, protocol 1 (x86_64-abi)
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] BUG: soft lockup -
CPU#1 stuck for 61s! [udevd:4865]
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] Modules linked in:
ipt_MASQUERADE xt_physdev pppoe pppox ipmi_devintf autofs4 tun
act_police sch_ingress cls_u32 sch_sfq sch_cbq ppp_generic slhc ipt_ULOG
xt_limit ipt_REJECT xt_state xt_dscp iptable_filter ipt_REDIRECT
xt_conntrack iptable_nat xt_TCPMSS xt_tcpmss xt_TCPOPTSTRIP xt_tcpudp
iptable_mangle ip_tables x_tables ipv6 8021q garp bridge stp llc
nf_nat_ftp nf_conntrack_ftp ip_gre nf_nat_pptp nf_conntrack_pptp
nf_conntrack_proto_gre nf_nat_proto_gre nf_nat nf_conntrack_ipv4
nf_conntrack nf_defrag_ipv4 xen_gntdev xen_evtchn xen_netback
xen_blkback blkback_pagemap loop snd_pcm_oss snd_mixer_oss snd_pcm
snd_timer snd tpm_tis tpm sd_mod soundcore ipmi_si tpm_bios hpilo
snd_page_alloc ipmi_msghandler pcspkr psmouse serio_raw rng_core
i5000_edac edac_core i5k_amb button evdev xfs exportfs dm_mirror
dm_region_hash dm_log dm_snapshot dm_mod ide_cd_mod cdrom ata_piix
ata_generic libata usb_storage scsi_mod usbhid hid piix ide_pci_generic
bnx2 ehci_hcd ide_core uhci_hc
Sep  3 09:02:13 keechsvr2 kernel:  cciss thermal fan [last unloaded:
scsi_wait_scan]
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] CPU 1:
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] Modules linked in:
ipt_MASQUERADE xt_physdev pppoe pppox ipmi_devintf autofs4 tun
act_police sch_ingress cls_u32 sch_sfq sch_cbq ppp_generic slhc ipt_ULOG
xt_limit ipt_REJECT xt_state xt_dscp iptable_filter ipt_REDIRECT
xt_conntrack iptable_nat xt_TCPMSS xt_tcpmss xt_TCPOPTSTRIP xt_tcpudp
iptable_mangle ip_tables x_tables ipv6 8021q garp bridge stp llc
nf_nat_ftp nf_conntrack_ftp ip_gre nf_nat_pptp nf_conntrack_pptp
nf_conntrack_proto_gre nf_nat_proto_gre nf_nat nf_conntrack_ipv4
nf_conntrack nf_defrag_ipv4 xen_gntdev xen_evtchn xen_netback
xen_blkback blkback_pagemap loop snd_pcm_oss snd_mixer_oss snd_pcm
snd_timer snd tpm_tis tpm sd_mod soundcore ipmi_si tpm_bios hpilo
snd_page_alloc ipmi_msghandler pcspkr psmouse serio_raw rng_core
i5000_edac edac_core i5k_amb button evdev xfs exportfs dm_mirror
dm_region_hash dm_log dm_snapshot dm_mod ide_cd_mod cdrom ata_piix
ata_generic libata usb_storage scsi_mod usbhid hid piix ide_pci_generic
bnx2 ehci_hcd ide_core uhci_hc
Sep  3 09:02:13 keechsvr2 kernel:  cciss thermal fan [last unloaded:
scsi_wait_scan]
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] Pid: 4865, comm: udevd
Not tainted 2.6.32.17-g69a73fa #1 ProLiant ML370 G5
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] RIP:
e030:[<ffffffff8100922a>]  [<ffffffff8100922a>]
hypercall_page+0x22a/0x1001
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] RSP:
e02b:ffff8800114d5dd0  EFLAGS: 00000246
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] RAX: 0000000000040000
RBX: 0000000000000001 RCX: ffffffff8100922a
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] RDX: ffffffff814293d0
RSI: 0000000000000000 RDI: 0000000000000000
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] RBP: 00007fbc628a9800
R08: 0000000000000000 R09: 0000000000000000
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] R10: 0000000000000002
R11: 0000000000000246 R12: 0000000000000002
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] R13: ffff8800114d5e88
R14: ffff880017f9cfc0 R15: ffff880017eed500
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] FS:
00007fbc628a9770(0000) GS:ffff880002de8000(0000) knlGS:0000000000000000
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] CS:  e033 DS: 0000 ES:
0000 CR0: 000000008005003b
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] CR2: 00007fbc628a9800
CR3: 000000001e2d7000 CR4: 0000000000002660
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] DR0: 0000000000000000
DR1: 0000000000000000 DR2: 0000000000000000
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] DR3: 0000000000000000
DR6: 00000000ffff0ff0 DR7: 0000000000000400
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] Call Trace:
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505]  [<ffffffff8100e8e9>]
? xen_force_evtchn_callback+0x9/0xa
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505]  [<ffffffff8100ef72>]
? check_events+0x12/0x20
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505]  [<ffffffff8100ef19>]
? xen_irq_enable_direct_end+0x0/0x7
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505]  [<ffffffff8102fcc1>]
? do_page_fault+0x9e/0x27b
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505]  [<ffffffff812b39a5>]
? page_fault+0x25/0x30
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505]  [<ffffffff8116f28d>]
? __put_user_4+0x1d/0x30
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505]  [<ffffffff81045265>]
? schedule_tail+0x92/0x96
Sep  3 09:02:13 keechsvr2 kernel: [235983.760505]  [<ffffffff81011983>]
? ret_from_fork+0x13/0x80

There's another hang at 09:03:09 which I can post if you'd find it
useful.

If the block-attach happened at 09:00:42 and the crash happened at
09:02:13 then I assume that whatever udevd started doing started at
09:01:12 (eg 61 seconds earlier) which means that it's more likely to be
the block-detach at the end of the backup causing the hang not the
block-attach. The backup runs pretty quick.

I'm inclined to turn off the snapshot backup if this isn't a known and
resolved problem...

Thanks

James


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

<Prev in Thread] Current Thread [Next in Thread>