Re: [Xen-devel] pv guests die after failed migration

On Fri, 2011-09-23 at 08:39 +0100, Andreas Olsowski wrote:
> Here is the full procedure:
[...]

Thanks, I should be able to figure out a repro with this, although I may
not get to it straight away.

> root@xenturio1:/usr/src/linux-2.6-xen# xl console thiswillfail
> PM: freeze of devices complete after 0.207 msecs
> PM: late freeze of devices complete after 0.058 msecs
> ------------[ cut here ]------------
> kernel BUG at drivers/xen/events.c:1466!
> invalid opcode: 0000 [#1] SMP
> CPU 0
> Modules linked in:
> 
> Pid: 6, comm: migration/0 Not tainted 3.0.4-xenU #6
> RIP: e030:[<ffffffff8140d574>]  [<ffffffff8140d574>] 
> xen_irq_resume+0x224/0x370
> RSP: e02b:ffff88001f9fbce0  EFLAGS: 00010082
> RAX: ffffffffffffffef RBX: 0000000000000000 RCX: 0000000000000000
> RDX: ffff88001f809ea8 RSI: ffff88001f9fbd00 RDI: 0000000000000001
> RBP: 0000000000000010 R08: ffffffff81859a00 R09: 0000000000000000
> R10: 0000000000000000 R11: 09f911029d74e35b R12: 0000000000000000
> R13: 000000000000f0a0 R14: 0000000000000000 R15: ffff88001f9fbd00
> FS:  00007ff28f8c8700(0000) GS:ffff88001fec6000(0000) knlGS:0000000000000000
> CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00007fff02056048 CR3: 000000001e4d8000 CR4: 0000000000002660
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process migration/0 (pid: 6, threadinfo ffff88001f9fa000, task 
> ffff88001f9f7170)
> Stack:
>   ffff88001f9fbd34 ffff88001f9fbd54 0000000000000003 000000000000f100
>   0000000000000000 0000000000000003 0000000000000000 0000000000000003
>   ffff88001fa6ddb0 ffffffff8140aa20 ffffffff81859a08 0000000000000000
> Call Trace:
>   [<ffffffff8140aa20>] ? gnttab_map+0x100/0x130
>   [<ffffffff815c2765>] ? _raw_spin_lock+0x5/0x10
>   [<ffffffff81083e01>] ? cpu_stopper_thread+0x101/0x190
>   [<ffffffff8140e1f5>] ? xen_suspend+0x75/0xa0
>   [<ffffffff81083f1b>] ? stop_machine_cpu_stop+0x8b/0xd0
>   [<ffffffff81083e90>] ? cpu_stopper_thread+0x190/0x190
>   [<ffffffff81083dd0>] ? cpu_stopper_thread+0xd0/0x190
>   [<ffffffff815c0870>] ? schedule+0x270/0x6c0
>   [<ffffffff81083d00>] ? copy_pid_ns+0x2a0/0x2a0
>   [<ffffffff81065846>] ? kthread+0x96/0xa0
>   [<ffffffff815c4024>] ? kernel_thread_helper+0x4/0x10
>   [<ffffffff815c3436>] ? int_ret_from_sys_call+0x7/0x1b
>   [<ffffffff815c2be1>] ? retint_restore_args+0x5/0x6
>   [<ffffffff815c4020>] ? gs_change+0x13/0x13
> Code: e8 f2 e9 ff ff 8b 44 24 10 44 89 e6 89 c7 e8 64 e8 ff ff ff c3 83 
> fb 04 0f 84 95 fe ff ff 4a 8b 14 f5 20 95 85 81 e9 68 ff ff ff <0f> 0b 
> eb fe 0f 0b eb fe 48 8b 1d fd 00 42 00 4c 8d 6c 24 20 eb
> RIP  [<ffffffff8140d574>] xen_irq_resume+0x224/0x370
>   RSP <ffff88001f9fbce0>
> ---[ end trace 82e2e97d58b5f835 ]---

This seems to be taking the non-cancelled resume path, does this patch
help at all:

diff -r d7b14b76f1eb tools/libxl/libxl.c
--- a/tools/libxl/libxl.c       Thu Sep 22 14:26:08 2011 +0100
+++ b/tools/libxl/libxl.c       Fri Sep 23 08:45:28 2011 +0100
@@ -246,7 +246,7 @@ int libxl_domain_resume(libxl_ctx *ctx, 
         rc = ERROR_NI;
         goto out;
     }
-    if (xc_domain_resume(ctx->xch, domid, 0)) {
+    if (xc_domain_resume(ctx->xch, domid, 1)) {
         LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR,
                         "xc_domain_resume failed for domain %u",
                         domid);

I don't think that's a solution but if this patch works then it may
indicate a problem with xc_domain_resume_any.

[...]
> I am not quite sure what you mean by "guest log".

The guest console log, which you provided above, thanks.

Ian.



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
WARNING - OLD ARCHIVES

xen-devel

Re: [Xen-devel] pv guests die after failed migration