Re: [Xen-devel] remus trouble

On Tuesday, 06 July 2010 at 13:36, Nomen Nescio wrote:
> Hey all,
> 
> I ran into some problems trying remus on xen4.0.1rc4 with the 2.6.31.13
> dom0 (checkout from yesterday):

Wat's your domU kernel? pvops support was recently added to dom0, but
still doesn't work for domU.

> Upon creating a vm with remus tapdisk, the domU starts up fine, but
> keeps hanging on mountall. the tapdisk process is only active on
> server1. The domU shows up in xm list as blocked. Once remus is started, 
> the tapdisk process also shows up on the backup server, but where i was
> expecting the domU to show up in paused state, in appeared as blocked.
> Console output on the initial server remains unresponsive, but on the
> backup server a login-shell appears. After logging in it turns out the
> disk is read-only.

The tapdisk process on the backup is created when you start remus on
the primary. Until you start Remus, writes to the primary will block,
so you'll typically want to start Remus some time close to when the
first write to the primary disk occurs (mount is a good time). If the
backup is in blocked mode but the primary is still running, you're in
split-brain state (both sides running) due to a checkpointing timeout
being interpreted as a guest failure. I'm guessing from the logs you
post that the guest kernel on the primary is not suspending
correctly.

> Remus doesn't give any output anymore after the first two syncs and the
> following errors show up in dmesg on the primary server:

Are these in dom0 or the primary domU? Looks a bit like dom0, but I
haven't seen these before.

> [ 6600.224933] INFO: task xenwatch:55 blocked for more than 120 seconds.
> [ 6600.224969] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [ 6600.225025] xenwatch      D ffff8802d7511040     0    55      2
> 0x00000000
> [ 6600.225071]  ffff8802e2e65c10 0000000000000246 ffff8802d5011d68
> ffffffff8103819d
> [ 6600.225145]  ffff8802e2e65be0 ffffffff81038a12 0000000000000000
> 00000000ffffffff
> [ 6600.225217]  ffff8802e2e683b0 000000000000f668 00000000000153c0
> ffff8802e2e683b0
> [ 6600.225287] Call Trace:
> [ 6600.225320]  [<ffffffff8103819d>] ?
> xen_force_evtchn_callback+0xd/0x10
> [ 6600.225357]  [<ffffffff81038a12>] ? check_events+0x12/0x20
> [ 6600.225396]  [<ffffffff810389ff>] ? xen_restore_fl_direct_end+0x0/0x1
> [ 6600.225432]  [<ffffffff81577aa7>] ? _spin_unlock_irqrestore+0x37/0x60
> [ 6600.225468]  [<ffffffff81344c55>] blkif_disconnect+0x75/0xf0
> [ 6600.225502]  [<ffffffff810af240>] ? autoremove_wake_function+0x0/0x40
> [ 6600.225542]  [<ffffffff8114fd80>] ? kfree+0xf0/0x180
> [ 6600.225576]  [<ffffffff8134448c>] blkback_remove+0x4c/0x90
> [ 6600.225609]  [<ffffffff8133e890>] xenbus_dev_remove+0x50/0x70
> [ 6600.225644]  [<ffffffff8138b9d8>] __device_release_driver+0x58/0xb0
> [ 6600.225679]  [<ffffffff8138bb4d>] device_release_driver+0x2d/0x40
> [ 6600.225712]  [<ffffffff8138ac0a>] bus_remove_device+0x9a/0xc0
> [ 6600.225744]  [<ffffffff81388da7>] device_del+0x127/0x1d0
> [ 6600.225777]  [<ffffffff81388e66>] device_unregister+0x16/0x30
> [ 6600.225813]  [<ffffffff813441a0>] frontend_changed+0x90/0x2a0
> [ 6600.225848]  [<ffffffff8133eb82>] xenbus_otherend_changed+0xb2/0xc0
> [ 6600.225883]  [<ffffffff81577aa7>] ? _spin_unlock_irqrestore+0x37/0x60
> [ 6600.225918]  [<ffffffff8133f150>] frontend_changed+0x10/0x20
> [ 6600.225951]  [<ffffffff8133c794>] xenwatch_thread+0xb4/0x190
> [ 6600.225986]  [<ffffffff810af240>] ? autoremove_wake_function+0x0/0x40
> [ 6600.226023]  [<ffffffff8133c6e0>] ? xenwatch_thread+0x0/0x190
> [ 6600.226060]  [<ffffffff810aecb6>] kthread+0xa6/0xb0
> [ 6600.226091]  [<ffffffff8103f3ea>] child_rip+0xa/0x20
> [ 6600.226125]  [<ffffffff8103e5d7>] ? int_ret_from_sys_call+0x7/0x1b
> [ 6600.226158]  [<ffffffff8103ed5d>] ? retint_restore_args+0x5/0x6
> [ 6600.226193]  [<ffffffff8103f3e0>] ? child_rip+0x0/0x20
> 
> soon followed by:
> 
> [ 6691.837161] ------------[ cut here ]------------
> [ 6691.837163] kernel BUG at
> /usr/src/xen-4.0-testing.hg/linux-2.6-pvops.git/fs/sysfs/dir.c:511!
> [ 6691.837165] invalid opcode: 0000 [#1] SMP
> [ 6691.837167] last sysfs file:
> /sys/devices/pci0000:00/0000:00:05.0/0000:1a:00.0/host4/target4:2:0/4:2:0:0/block/sda/sda6/uevent
> [ 6691.837169] CPU 1  
> [ 6691.837171] Modules linked in: iptable_filter ip_tables bridge stp
> qla4xxx scsi_transport_iscsi qla3xxx sch_queue bonding ebt_imq x_tables
> lp parport serio_raw shpchp joydev bnx2 ses enclosure usbhid hid
> megaraid_sas e1000e
> [ 6691.837191] Pid: 55, comm: xenwatch Not tainted 2.6.31.13 #4 System
> x3650 M3 -[7945M2G]-
> [ 6691.837192] RIP: e030:[<ffffffff811c2b4b>]  [<ffffffff811c2b4b>]
> sysfs_remove_one+0x9b/0xa0
> [ 6691.837199] RSP: e02b:ffff8802e2e65a30  EFLAGS: 00010202
> [ 6691.837201] RAX: 0000000000000000 RBX: ffff8802e2e65a40 RCX:
> 0000000000000201
> [ 6691.837202] RDX: ffffffff81a1a9a0 RSI: ffff8802644649b0 RDI:
> ffff8802e2e65a40
> [ 6691.837204] RBP: ffff8802e2e65a30 R08: 000000004e84bbbc R09:
> 0000000000000000
> [ 6691.837206] R10: 0000000000000005 R11: 0000000000000000 R12:
> ffff8802644649b0
> [ 6691.837207] R13: ffff8802644649b0 R14: ffff880264464780 R15:
> 000000000000000a
> [ 6691.837212] FS:  00007f723b663710(0000) GS:ffffc90000018000(0000)
> knlGS:0000000000000000
> [ 6691.837214] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 6691.837215] CR2: 00007f47c3dfe000 CR3: 00000002da830000 CR4:
> 0000000000002660
> [ 6691.837217] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [ 6691.837219] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> 0000000000000400
> [ 6691.837221] Process xenwatch (pid: 55, threadinfo ffff8802e2e64000,
> task ffff8802e2e68000)
> [ 6691.837222] Stack: 
> [ 6691.837224]  ffff8802e2e65a70 ffffffff811c3863 ffff880264464780
> 0000000000000000
> [ 6691.837227] <0> 0000000000000000 0000000000000000 ffffffff81865320
> ffffffff81865348
> [ 6691.837231] <0> ffff8802e2e65a80 ffffffff811c388e ffff8802e2e65ab0
> ffffffff811c4ed6
> [ 6691.837235] Call Trace:
> [ 6691.837239]  [<ffffffff811c3863>] remove_dir+0x33/0x50
> [ 6691.837242]  [<ffffffff811c388e>] sysfs_remove_subdir+0xe/0x10
> [ 6691.837244]  [<ffffffff811c4ed6>] sysfs_remove_group+0x76/0x100
> [ 6691.837248]  [<ffffffff8138fd09>] dpm_sysfs_remove+0x19/0x20
> [ 6691.837252]  [<ffffffff81388cc9>] device_del+0x49/0x1d0
> [ 6691.837254]  [<ffffffff81388e66>] device_unregister+0x16/0x30
> [ 6691.837258]  [<ffffffff81130fff>] bdi_unregister+0x3f/0x60
> [ 6691.837262]  [<ffffffff812b60d4>] unlink_gendisk+0x34/0x60
> [ 6691.837265]  [<ffffffff811bbcec>] del_gendisk+0x9c/0x150
> [ 6691.837268]  [<ffffffff812ab8ba>] ? blk_remove_plug+0x4a/0xd0
> [ 6691.837271]  [<ffffffff813475f3>] blktap_device_destroy+0xc3/0x130
> [ 6691.837274]  [<ffffffff81347688>] blktap_device_release+0x28/0x30
> [ 6691.837278]  [<ffffffff81187cf1>] __blkdev_put+0xd1/0x1a0
> [ 6691.837281]  [<ffffffff81187dd0>] blkdev_put+0x10/0x20
> [ 6691.837284]  [<ffffffff81344fea>] vbd_free+0x2a/0x40
> [ 6691.837287]  [<ffffffff81344499>] blkback_remove+0x59/0x90
> [ 6691.837289]  [<ffffffff8133e890>] xenbus_dev_remove+0x50/0x70
> [ 6691.837292]  [<ffffffff8138b9d8>] __device_release_driver+0x58/0xb0
> [ 6691.837295]  [<ffffffff8138bb4d>] device_release_driver+0x2d/0x40
> [ 6691.837298]  [<ffffffff8138ac0a>] bus_remove_device+0x9a/0xc0
> [ 6691.837301]  [<ffffffff81388da7>] device_del+0x127/0x1d0
> [ 6691.837303]  [<ffffffff81388e66>] device_unregister+0x16/0x30
> [ 6691.837306]  [<ffffffff813441a0>] frontend_changed+0x90/0x2a0
> [ 6691.837309]  [<ffffffff8133eb82>] xenbus_otherend_changed+0xb2/0xc0
> [ 6691.837313]  [<ffffffff81577aa7>] ? _spin_unlock_irqrestore+0x37/0x60
> [ 6691.837316]  [<ffffffff8133f150>] frontend_changed+0x10/0x20
> [ 6691.837318]  [<ffffffff8133c794>] xenwatch_thread+0xb4/0x190
> [ 6691.837321]  [<ffffffff810af240>] ? autoremove_wake_function+0x0/0x40
> [ 6691.837324]  [<ffffffff8133c6e0>] ? xenwatch_thread+0x0/0x190
> [ 6691.837327]  [<ffffffff810aecb6>] kthread+0xa6/0xb0
> [ 6691.837331]  [<ffffffff8103f3ea>] child_rip+0xa/0x20
> [ 6691.837334]  [<ffffffff8103e5d7>] ? int_ret_from_sys_call+0x7/0x1b
> [ 6691.837337]  [<ffffffff8103ed5d>] ? retint_restore_args+0x5/0x6
> [ 6691.837340]  [<ffffffff8103f3e0>] ? child_rip+0x0/0x20
> [ 6691.837341] Code: 48 8b 46 10 48 89 02 48 c7 46 10 00 00 00 00 8b 4e
> 30 eb c3 0f 1f 40 00 48 8b 47 08 48 85 c0 74 ce 83 68 4c 01 83 47 18 01
> c9 c3 <0f> 0b eb fe 90 55 48 89 e5 41 54 53 0f 1f 44 00 00 48 8b 5f 28
> [ 6691.837378] RIP  [<ffffffff811c2b4b>] sysfs_remove_one+0x9b/0xa0
> [ 6691.837381]  RSP <ffff8802e2e65a30>
> [ 6691.837697] ---[ end trace 2bd84e059b85817e ]---
> 
> and then:
> 
> [ 6803.229132] BUG: soft lockup - CPU#1 stuck for 61s! [irqbalance:3772]
> [ 6803.229205] Modules linked in: iptable_filter ip_tables bridge stp
> qla4xxx scsi_transport_iscsi qla3xxx sch_queue bonding ebt_imq x_tables
> lp parport serio_raw shpchp joydev bnx2 ses enclosure usbhid hid
> megaraid_sas e1000e
> [ 6803.230394] CPU 1:
> [ 6803.230499] Modules linked in: iptable_filter ip_tables bridge stp
> qla4xxx scsi_transport_iscsi qla3xxx sch_queue bonding ebt_imq x_tables
> lp parport serio_raw shpchp joydev bnx2 ses enclosure usbhid hid
> megaraid_sas e1000e
> [ 6803.231669] Pid: 3772, comm: irqbalance Tainted: G      D
> 2.6.31.13 #4 System x3650 M3 -[7945M2G]-
> [ 6803.231764] RIP: e030:[<ffffffff8157611a>]  [<ffffffff8157611a>]
> __mutex_lock_slowpath+0x3a/0x180
> [ 6803.231904] RSP: e02b:ffff8802d6049bd8  EFLAGS: 00000286
> [ 6803.231974] RAX: 00000000fffffffe RBX: ffffffff81841980 RCX:
> 0000000000000000
> [ 6803.232047] RDX: 0000000000000000 RSI: ffff8802e2e64000 RDI:
> ffffffff81841980
> [ 6803.232121] RBP: ffff8802d6049c38 R08: 000000000000004e R09:
> 0000000000000000
> [ 6803.232194] R10: 0000000000000001 R11: 0000000000000000 R12:
> ffff8802d609c470
> [ 6803.232268] R13: ffff8802e2e64000 R14: ffffffff81841998 R15:
> 0000000000000001
> [ 6803.232344] FS:  00007f47c3df4720(0000) GS:ffffc90000018000(0000)
> knlGS:0000000000000000
> [ 6803.232434] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 6803.232505] CR2: 00007f47c3dfe000 CR3: 00000002db8e7000 CR4:
> 0000000000002660
> [ 6803.232579] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [ 6803.232651] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> 0000000000000400
> [ 6803.232724] Call Trace:
> [ 6803.232789]  [<ffffffff81576125>] ? __mutex_lock_slowpath+0x45/0x180
> [ 6803.232863]  [<ffffffff8116d3dd>] ? __d_lookup+0x12d/0x170
> [ 6803.232936]  [<ffffffff815760bb>] mutex_lock+0x2b/0x50
> [ 6803.233009]  [<ffffffff811c3ecd>] sysfs_follow_link+0x6d/0x1d0
> [ 6803.233081]  [<ffffffff8116f038>] ? touch_atime+0x38/0x150
> [ 6803.233152]  [<ffffffff81164c47>] __link_path_walk+0x5d7/0x1060
> [ 6803.233229]  [<ffffffff81577aa7>] ? _spin_unlock_irqrestore+0x37/0x60
> [ 6803.233304]  [<ffffffff812c7d2a>] ? __up_read+0x9a/0xc0
> [ 6803.233379]  [<ffffffff8103819d>] ?
> xen_force_evtchn_callback+0xd/0x10
> [ 6803.233452]  [<ffffffff8116594a>] path_walk+0x6a/0xe0
> [ 6803.233522]  [<ffffffff81165b1b>] do_path_lookup+0x5b/0xa0
> [ 6803.233593]  [<ffffffff8115a2a1>] ? get_empty_filp+0xa1/0x170
> [ 6803.233663]  [<ffffffff81166ac3>] do_filp_open+0x103/0xc60
> [ 6803.233736]  [<ffffffff810389ff>] ? xen_restore_fl_direct_end+0x0/0x1
> [ 6803.233809]  [<ffffffff8114fc0b>] ? kmem_cache_free+0xcb/0x150
> [ 6803.233884]  [<ffffffff8103819d>] ?
> xen_force_evtchn_callback+0xd/0x10
> [ 6803.233957]  [<ffffffff81172334>] ? alloc_fd+0x114/0x160
> [ 6803.234028]  [<ffffffff81156b39>] do_sys_open+0x69/0x140
> [ 6803.234098]  [<ffffffff81156c50>] sys_open+0x20/0x30
> [ 6803.235949]  [<ffffffff8103e332>] system_call_fastpath+0x16/0x1b
> 
> Which repeats several times.
> Does anyone have a suggestion what's going wrong? Is it normal for the
> tapdisk process to only show up on the backup-server once remus is started 
> or should i initiallise it beforehand? Thanks in advance!

I'm not sure what these panics are, but it is normal for the tapdisk
process not to start on the backup until you start Remus. If you're
not using 2.6.18 for your guest kernel, give that a try, or possibly
HVM. If you haven't walked through the getting started steps here:

http://nss.cs.ubc.ca/remus/doc.html

they may help narrow down the problem.

Regards,
Brendan

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
WARNING - OLD ARCHIVES

xen-devel

Re: [Xen-devel] remus trouble