WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

Re: [Xen-devel] [PATCH] libxl/xl: improve behaviour when guest fails to

To: Ian Campbell <ian.campbell@xxxxxxxxxx>
Subject: Re: [Xen-devel] [PATCH] libxl/xl: improve behaviour when guest fails to suspend itself
From: Stefano Stabellini <stefano.stabellini@xxxxxxxxxxxxx>
Date: Tue, 8 Feb 2011 15:06:05 +0000
Cc: Ian, Campbell <Ian.Campbell@xxxxxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>
Delivery-date: Tue, 08 Feb 2011 07:04:47 -0800
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <ad6f318aac5dcdc5a7be.1297171403@xxxxxxxxxxxxxxxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <ad6f318aac5dcdc5a7be.1297171403@xxxxxxxxxxxxxxxxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
User-agent: Alpine 2.00 (DEB 1167 2008-08-23)
On Tue, 8 Feb 2011, Ian Campbell wrote:
> # HG changeset patch
> # User Ian Campbell <ian.campbell@xxxxxxxxxx>
> # Date 1297171349 0
> # Node ID ad6f318aac5dcdc5a7be03a4a80f27ed80729638
> # Parent  deaa7bc1a7ff6cfad7865394cb8b7205741004c9
> libxl/xl: improve behaviour when guest fails to suspend itself.
> 
> The PV suspend protocol requires guest co-operating whereb the guest
> must respond to a suspend request written to the xenstore control node
> by clearing the node and then making a suspend hypercall.
> 
> Currently when a guest fails to do this libxl times out and returns
> a generic failure code to the caller.
> 
> In response to this failure xl attempts to resume the guest. However
> if the guest has not responded to the suspend request then the is no
> guarantee that the guest has made the suspend hypercall (in fact it is
> quite unlikely). Since the resume process attempts to modify the
> return value of the hypercall (to indicate a cancelled suspend) this
> results in the guest eax/rax register being corrupted!
> 
> Firstly ensure that libxl cancels the suspend request in a synchronous
> manner, to avoid racing with a slow guest. Both pvops and classic-Xen
> guests read and clear the control node in a single transaction so it
> sufficient to do the same on the toolstack side.
> 
> Secondly add a mechanism to propagate guest timeouts from the
> xc_domain_suspend ->suspend callback
> (libxl__domain_suspend_common_callback in the case of libxl) to the
> caller (libxl__domain_suspend_common).
> 
> Thirdly add a new libxl error code to indicate that the guest did not
> respond in time.
> 
> Lastly in xl do not attempt to resume a guest if it has not responded
> to the suspend request.
> 
> Tested by live migration of PVops kernels which either ignore the
> suspend request, have already crashed and those which suspend/resume
> correctly. In the first two cases the source domain is left alone (and
> continues to function in the first case) and in the third the
> migration is successful.
> 
> Signed-off-by: Ian Campbell <ian.campbell@xxxxxxxxxx>
> 
> diff -r deaa7bc1a7ff -r ad6f318aac5d tools/libxl/libxl.h
> --- a/tools/libxl/libxl.h     Tue Feb 08 09:13:38 2011 +0000
> +++ b/tools/libxl/libxl.h     Tue Feb 08 13:22:29 2011 +0000
> @@ -239,6 +239,7 @@ enum {
>      ERROR_NOMEM = -5,
>      ERROR_INVAL = -6,
>      ERROR_BADFAIL = -7,
> +    ERROR_GUEST_TIMEDOUT = -8,
>  };
>  
>  #define LIBXL_VERSION 0
> diff -r deaa7bc1a7ff -r ad6f318aac5d tools/libxl/libxl_dom.c
> --- a/tools/libxl/libxl_dom.c Tue Feb 08 09:13:38 2011 +0000
> +++ b/tools/libxl/libxl_dom.c Tue Feb 08 13:22:29 2011 +0000
> @@ -319,7 +319,7 @@ struct suspendinfo {
>      int suspend_eventchn;
>      int domid;
>      int hvm;
> -    unsigned int flags;
> +    int guest_responded;
>  };
>  

Even though they are not currently used, I would keep the flags field.


>  static int libxl__domain_suspend_common_switch_qemu_logdirty(int domid, 
> unsigned int enable, void *data)
> @@ -349,6 +349,7 @@ static int libxl__domain_suspend_common_
>      char *path, *state = "suspend";
>      int watchdog = 60;
>      libxl_ctx *ctx = libxl__gc_owner(si->gc);
> +    xs_transaction_t t;
>  
>      if (si->hvm)
>          xc_get_hvm_param(ctx->xch, si->domid, HVM_PARAM_ACPI_S_STATE, 
> &s_state);
> @@ -363,6 +364,7 @@ static int libxl__domain_suspend_common_
>              LIBXL__LOG(ctx, LIBXL__LOG_ERROR, "xc_await_suspend failed 
> ret=%d", ret);
>              return 0;
>          }
> +        si->guest_responded = 1;
>          return 1;
>      }
>      path = libxl__sprintf(si->gc, "%s/control/shutdown", 
> libxl__xs_get_dompath(si->gc, si->domid));
> @@ -386,16 +388,40 @@ static int libxl__domain_suspend_common_
>              int shutdown_reason;
>  
>              shutdown_reason = (info.flags >> XEN_DOMINF_shutdownshift) & 
> XEN_DOMINF_shutdownmask;
> -            if (shutdown_reason == SHUTDOWN_suspend)
> +            if (shutdown_reason == SHUTDOWN_suspend) {
> +                si->guest_responded = 1;
>                  return 1;
> +            }
>          }
>          state = libxl__xs_read(si->gc, XBT_NULL, path);
>          watchdog--;
>      }
> +
> +    /*
> +     * Guest appear to not be responding, clear the suspend request
> +     * synchronously using a transaction.
> +     */
>      if (!strcmp(state, "suspend")) {
> -        LIBXL__LOG(ctx, LIBXL__LOG_ERROR, "guest didn't suspend in time");
> -        libxl__xs_write(si->gc, XBT_NULL, path, "");
> +        LIBXL__LOG(ctx, LIBXL__LOG_ERROR, "guest didn't suspend in time, 
> clearing suspend request");
> +    retry_transaction:
> +        t = xs_transaction_start(ctx->xsh);
> +
> +        state = libxl__xs_read(si->gc, t, path);
> +
> +        libxl__xs_write(si->gc, t, path, "");
> +

Doesn't it make sense to clear path only if state is still "suspend"?
Clearing the path again even if the guest already cleared it may call a
spurious xenstore watch event in the guest, right?


> +
> +    /* Final check, guest may have suspended while we were clearing the 
> request. */
> +    if (!strcmp(state, "suspend")) {
> +        LIBXL__LOG(ctx, LIBXL__LOG_ERROR, "final check, guest didn't 
> suspend");
> +        return 0;
> +    }
> +
> +    si->guest_responded = 1;
>      return 1;
>  }

If "suspend" was gone before we cleared it, shouldn't we check
xc_domain_getinfolist again before asserting that the domain suspended
correcty?
Or maybe is better just to return 0 without the final check, assuming
that the guest failed in any case.


> @@ -414,10 +440,10 @@ int libxl__domain_suspend_common(libxl_c
>            | (hvm) ? XCFLAGS_HVM : 0;
>  
>      si.domid = domid;
> -    si.flags = flags;
>      si.hvm = hvm;
>      si.gc = &gc;
>      si.suspend_eventchn = -1;
> +    si.guest_responded = 0;
>  
>      si.xce = xc_evtchn_open(NULL, 0);
>      if (si.xce == NULL)
> @@ -441,8 +467,14 @@ int libxl__domain_suspend_common(libxl_c
>  
>      rc = xc_domain_save(ctx->xch, fd, domid, 0, 0, flags, &callbacks, hvm);
>      if ( rc ) {
> -        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "saving domain");
> -        rc = ERROR_FAIL;
> +        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "saving domain: %s",
> +                         si.guest_responded ?
> +                         "domain responded to suspend request" :
> +                         "domain did not respond to suspend request");
> +        if ( !si.guest_responded )
> +            rc = ERROR_GUEST_TIMEDOUT;
> +        else
> +            rc = ERROR_FAIL;
>      }
>  
>      if (si.suspend_eventchn > 0)
> diff -r deaa7bc1a7ff -r ad6f318aac5d tools/libxl/xl_cmdimpl.c
> --- a/tools/libxl/xl_cmdimpl.c        Tue Feb 08 09:13:38 2011 +0000
> +++ b/tools/libxl/xl_cmdimpl.c        Tue Feb 08 13:22:29 2011 +0000
> @@ -2583,7 +2583,10 @@ static void migrate_domain(const char *d
>      if (rc) {
>          fprintf(stderr, "migration sender: libxl_domain_suspend failed"
>                  " (rc=%d)\n", rc);
> -        goto failed_resume;
> +        if (rc == ERROR_GUEST_TIMEDOUT)
> +            goto failed_suspend;
> +        else
> +            goto failed_resume;
>      }
>  
>      //fprintf(stderr, "migration sender: Transfer complete.\n");
> @@ -2661,6 +2664,12 @@ static void migrate_domain(const char *d
>      libxl_domain_destroy(&ctx, domid, 1); /* bang! */
>      fprintf(stderr, "Migration successful.\n");
>      exit(0);
> +
> + failed_suspend:
> +    close(send_fd);
> +    migration_child_report(child, recv_fd);
> +    fprintf(stderr, "Migration failed, failed to suspend at sender.\n");
> +    exit(-ERROR_FAIL);
>  
>   failed_resume:
>      close(send_fd);
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-devel
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel