WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-users

[Xen-users] Task Blocking / Domu Lockups

To: "xen-users@xxxxxxxxxxxxxxxxxxx" <xen-users@xxxxxxxxxxxxxxxxxxx>
Subject: [Xen-users] Task Blocking / Domu Lockups
From: Richard Maynard / Wessex Networks <rjm@xxxxxxxxxxxxxxxxxx>
Date: Fri, 2 Sep 2011 10:51:46 +0000
Accept-language: en-GB, en-US
Delivery-date: Fri, 02 Sep 2011 03:52:45 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
List-help: <mailto:xen-users-request@lists.xensource.com?subject=help>
List-id: Xen user discussion <xen-users.lists.xensource.com>
List-post: <mailto:xen-users@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-users>, <mailto:xen-users-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-users>, <mailto:xen-users-request@lists.xensource.com?subject=unsubscribe>
Sender: xen-users-bounces@xxxxxxxxxxxxxxxxxxx
Thread-index: AcxpXNB+BzkwE6n2RZ202E6/qbUuFw==
Thread-topic: Task Blocking / Domu Lockups
Hi All,

We've been getting DomU's locking up for some time now under moderate IO load 
(I think) on two different Xen hosts.  Everything is Debian - Dom0 is Squeeze 
and the DomUs are a mixture of Lenny and Squeeze which both crash in the same 
way.

The DomUs and the Dom0 are running the latest Squeeze kernel 
(2.6.32-5-xen-amd64) and Xen is 4.0.1-2.

The block device (or the kernel's handling of it) is probably closer to the 
cause of the problem than a bug in the individual tasks as you see multiple 
tasks lock up at the same time if you get enough output and on separate 
incidents you see different tasks as well.  A couple of excerpts from the 
console are below:

[581606.222303] INFO: task syslogd:1142 blocked for more than 120 seconds.
[581606.222321] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this                                                                            
        message.
[581606.222329] syslogd       D ffff8800f9eafc78     0  1142      1
[581606.222338]  ffff8800f9eafda8 0000000000000286 ffff880003d74fe8 
ffff8800060f                                                                    
               18f0
[581606.222349]  ffff8800f9e30440 ffff8800d8d8c940 ffff8800f9e306c0 
000000000000                                                                    
               0000
[581606.222360]  ffff880000000005 0000000000138512 ffff8800f7c41cc0 
ffff88000000                                                                    
               000f
[581606.222368] Call Trace:
[581606.222382]  [<ffffffff8022383e>] __wake_up+0x38/0x4f
[581606.222395]  [<ffffffffa0032067>] :jbd:log_wait_commit+0xb6/0x11f
[581606.222403]  [<ffffffff8023f64d>] autoremove_wake_function+0x0/0x2e
[581606.222413]  [<ffffffffa002d552>] :jbd:journal_stop+0x198/0x1f3
[581606.222421]  [<ffffffff802a7eec>] __writeback_single_inode+0x1bc/0x2da
[581606.222429]  [<ffffffff8028a992>] do_readv_writev+0x176/0x18b
[581606.222436]  [<ffffffff802a898d>] sync_inode+0x24/0x53
[581606.222453]  [<ffffffffa003e48a>] :ext3:ext3_sync_file+0x9e/0xb0
[581606.222460]  [<ffffffff802aafc6>] do_fsync+0x52/0xa4
[581606.222467]  [<ffffffff802ab03b>] __do_fsync+0x23/0x36
[581606.222473]  [<ffffffff8020b528>] system_call+0x68/0x6d
[581606.222479]  [<ffffffff8020b4c0>] system_call+0x0/0x6d
[581606.222484]

[581376.493333] INFO: task apache2:14097 blocked for more than 120 seconds.
[581376.493348] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this                                                                            
        message.
[581376.493356] apache2       D ffffffff8044af00     0 14097  26200
[581376.493365]  ffff8800d0091de0 0000000000000286 0000000000000000 
ffff8800f759                                                                    
               aec0
[581376.493375]  ffff8800d8f17440 ffffffff804ff460 ffff8800d8f176c0 
00000000d009                                                                    
               1e68
[581376.493385]  00000000ffffffff 0000000000000000 ffff880073859000 
ffff8800f74a                                                                    
               76c4
[581376.493394] Call Trace:
[581376.493408]  [<ffffffff8029443f>] path_walk+0x7e/0x8b
[581376.493415]  [<ffffffff80294733>] do_path_lookup+0x158/0x1ce
[581376.493423]  [<ffffffff804356ad>] __mutex_lock_slowpath+0x79/0xc7
[581376.493430]  [<ffffffff80435482>] mutex_lock+0xa/0xb
[581376.493435]  [<ffffffff8029542a>] do_filp_open+0x11a/0x7c4
[581376.493445]  [<ffffffff80288b3b>] get_unused_fd_flags+0x74/0x13f
[581376.493452]  [<ffffffff80288c4c>] do_sys_open+0x46/0xc3
[581376.493458]  [<ffffffff8020b528>] system_call+0x68/0x6d
[581376.493464]  [<ffffffff8020b4c0>] system_call+0x0/0x6d
[581376.493471]

[1426201.768058] INFO: task sshd:772 blocked for more than 120 seconds.
[1426201.768058] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
[1426201.768058] sshd          D 0000000000000000     0   772      1 0x00000000
[1426201.768058]  ffffffff814791f0 0000000000000282 0000000000000000 
ffff88000edc35b0
[1426201.768058]  ffff88000edc3690 ffffffff8117fd56 000000000000f9e0 
ffff88000edc3fd8
[1426201.768058]  0000000000015780 0000000000015780 ffff88000284f100 
ffff88000284f3f8
[1426201.768058] Call Trace:
[1426201.768058]  [<ffffffff8117fd56>] ? blk_peek_request+0x18b/0x19f
[1426201.768058]  [<ffffffff8102ddcc>] ? pvclock_clocksource_read+0x3a/0x8b
[1426201.768058]  [<ffffffff8130c16a>] ? io_schedule+0x73/0xb7
[1426201.768058]  [<ffffffff81180b77>] ? get_request_wait+0xf0/0x188
[1426201.768058]  [<ffffffff81065f06>] ? autoremove_wake_function+0x0/0x2e
[1426201.768058]  [<ffffffff81180f06>] ? __make_request+0x2f7/0x428
[1426201.768058]  [<ffffffff81192e43>] ? radix_tree_tag_clear+0x93/0xf1
[1426201.768058]  [<ffffffff8117f6e3>] ? generic_make_request+0x299/0x2f9
[1426201.768058]  [<ffffffff8100e629>] ? xen_force_evtchn_callback+0x9/0xa
[1426201.768058]  [<ffffffff8100ece2>] ? check_events+0x12/0x20
[1426201.768058]  [<ffffffff810bc7ce>] ? __set_page_dirty_nobuffers+0x0/0xfa
[1426201.768058]  [<ffffffff8117f819>] ? submit_bio+0xd6/0xf2
[1426201.768058]  [<ffffffff810bb841>] ? test_set_page_writeback+0xe0/0xef
[1426201.768058]  [<ffffffff810d9a70>] ? swap_writepage+0x9b/0xa5
[1426201.768058]  [<ffffffff810bf3c1>] ? shrink_page_list+0x375/0x623
[1426201.768058]  [<ffffffff8100e629>] ? xen_force_evtchn_callback+0x9/0xa
[1426201.768058]  [<ffffffff8100ece2>] ? check_events+0x12/0x20
[1426201.768058]  [<ffffffff810bfda4>] ? shrink_list+0x45c/0x767
[1426201.768058]  [<ffffffff81042abe>] ? pick_next_task_fair+0xca/0xd6
[1426201.768058]  [<ffffffff8100eccf>] ? xen_restore_fl_direct_end+0x0/0x1
[1426201.768058]  [<ffffffff8130d42a>] ? _spin_unlock_irqrestore+0xd/0xe
[1426201.768058]  [<ffffffff8105b8c8>] ? try_to_del_timer_sync+0x63/0x6c
[1426201.768058]  [<ffffffff810c032f>] ? shrink_zone+0x280/0x342
[1426201.768058]  [<ffffffff8130d42a>] ? _spin_unlock_irqrestore+0xd/0xe
[1426201.768058]  [<ffffffff810c94f8>] ? congestion_wait+0x74/0x80
[1426201.768058]  [<ffffffff81065f06>] ? autoremove_wake_function+0x0/0x2e
[1426201.768058]  [<ffffffff810c13f6>] ? try_to_free_pages+0x232/0x38e
[1426201.768058]  [<ffffffff810be3eb>] ? isolate_pages_global+0x0/0x20f
[1426201.768058]  [<ffffffff810fdb83>] ? pollwake+0x0/0x59
[1426201.768058]  [<ffffffff810bb484>] ? __alloc_pages_nodemask+0x3cd/0x5f5
[1426201.768058]  [<ffffffff810ba60f>] ? __get_free_pages+0x9/0x46
[1426201.768058]  [<ffffffff8104d4f6>] ? copy_process+0xd7/0x115f
[1426201.768058]  [<ffffffff811542f6>] ? cap_d_instantiate+0x0/0x1
[1426201.768058]  [<ffffffff8100eccf>] ? xen_restore_fl_direct_end+0x0/0x1
[1426201.768058]  [<ffffffff8100e629>] ? xen_force_evtchn_callback+0x9/0xa
[1426201.768058]  [<ffffffff8100ece2>] ? check_events+0x12/0x20
[1426201.768058]  [<ffffffff811542f6>] ? cap_d_instantiate+0x0/0x1
[1426201.768058]  [<ffffffff8100eccf>] ? xen_restore_fl_direct_end+0x0/0x1
[1426201.768058]  [<ffffffff8104e6d5>] ? do_fork+0x157/0x31e
[1426201.768058]  [<ffffffff81118548>] ? inotify_d_instantiate+0x12/0x39
[1426201.768058]  [<ffffffff812510d3>] ? sock_attach_fd+0x91/0xbf
[1426201.768058]  [<ffffffff810ee05f>] ? fd_install+0x2e/0x5a
[1426201.768058]  [<ffffffff81011e63>] ? stub_clone+0x13/0x20
[1426201.768058]  [<ffffffff81011b42>] ? system_call_fastpath+0x16/0x1b
[1426201.768058] INFO: task master:845 blocked for more than 120 seconds.
[1426201.768058] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
[1426201.768058] master        D 0000000000000000     0   845      1 0x00000000
[1426201.768058]  ffffffff814791f0 0000000000000286 0000000000000000 
ffff88000ebcd588
[1426201.768058]  ffff88000ebcd668 ffffffff8117fd56 000000000000f9e0 
ffff88000ebcdfd8
[1426201.768058]  0000000000015780 0000000000015780 ffff88000fd1f810 
ffff88000fd1fb08
[1426201.768058] Call Trace:
[1426201.768058]  [<ffffffff8117fd56>] ? blk_peek_request+0x18b/0x19f
[1426201.768058]  [<ffffffff8102ddcc>] ? pvclock_clocksource_read+0x3a/0x8b
[1426201.768058]  [<ffffffff8130c16a>] ? io_schedule+0x73/0xb7
[1426201.768058]  [<ffffffff81180b77>] ? get_request_wait+0xf0/0x188
[1426201.768058]  [<ffffffff810bee23>] ? move_active_pages_to_lru+0xf3/0x126
[1426201.768058]  [<ffffffff81065f06>] ? autoremove_wake_function+0x0/0x2e
[1426201.768058]  [<ffffffff81180f06>] ? __make_request+0x2f7/0x428
[1426201.768058]  [<ffffffff8100ece2>] ? check_events+0x12/0x20
[1426201.768058]  [<ffffffff81192e43>] ? radix_tree_tag_clear+0x93/0xf1
[1426201.768058]  [<ffffffff8117f6e3>] ? generic_make_request+0x299/0x2f9
[1426201.768058]  [<ffffffff8100e629>] ? xen_force_evtchn_callback+0x9/0xa
[1426201.768058]  [<ffffffff8100ece2>] ? check_events+0x12/0x20
[1426201.768058]  [<ffffffff8118f534>] ? cpumask_any_but+0x28/0x34
[1426201.768058]  [<ffffffff8117f819>] ? submit_bio+0xd6/0xf2
[1426201.768058]  [<ffffffff810bb841>] ? test_set_page_writeback+0xe0/0xef
[1426201.768058]  [<ffffffff810d9a70>] ? swap_writepage+0x9b/0xa5
[1426201.768058]  [<ffffffff810bf3c1>] ? shrink_page_list+0x375/0x623
[1426201.768058]  [<ffffffff810bfda4>] ? shrink_list+0x45c/0x767
[1426201.768058]  [<ffffffff810bbfd0>] ? determine_dirtyable_memory+0xd/0x1d
[1426201.768058]  [<ffffffff810bc048>] ? get_dirty_limits+0x1d/0x259
[1426201.768058]  [<ffffffffa00380ba>] ? journal_cancel_revoke+0xc3/0xec [jbd]
[1426201.768058]  [<ffffffff810c032f>] ? shrink_zone+0x280/0x342
[1426201.768058]  [<ffffffffa002c226>] ? mb_cache_shrink_fn+0x26/0x129 [mbcache]
[1426201.768058]  [<ffffffff810c0532>] ? shrink_slab+0x141/0x153
[1426201.768058]  [<ffffffff810c13f6>] ? try_to_free_pages+0x232/0x38e
[1426201.768058]  [<ffffffff810be3eb>] ? isolate_pages_global+0x0/0x20f
[1426201.768058]  [<ffffffff810bb484>] ? __alloc_pages_nodemask+0x3cd/0x5f5
[1426201.768058]  [<ffffffff810cc224>] ? do_wp_page+0x386/0x707
[1426201.768058]  [<ffffffff810efa56>] ? do_sync_write+0xce/0x113
[1426201.768058]  [<ffffffff8100c3a5>] ? __raw_callee_save_xen_pud_val+0x11/0x1e
[1426201.768058]  [<ffffffff8100c369>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
[1426201.768058]  [<ffffffff810cdfc7>] ? handle_mm_fault+0x7aa/0x80f
[1426201.768058]  [<ffffffff8115421a>] ? cap_cred_commit+0x0/0x1
[1426201.768058]  [<ffffffff8130f906>] ? do_page_fault+0x2e0/0x2fc
[1426201.768058]  [<ffffffff8130d7a5>] ? page_fault+0x25/0x30

Has anybody seen this before?  Is there a fix / workaround or should we be 
trying / building different kernels for the DomUs?

Thanks in advance!

Regards,

Richard Maynard

Wessex Networks
Linchmere Place
Ifield
Crawley
West Sussex
RH11 0EX
www.wessexnetworks.com rjm@xxxxxxxxxxxxxxxxxx
T: 01293 542080 F: 01293 553849
Twitter: @wessexnetworks


_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-users

<Prev in Thread] Current Thread [Next in Thread>
  • [Xen-users] Task Blocking / Domu Lockups, Richard Maynard / Wessex Networks <=