WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

Re: [Xen-devel] [PATCH] libxc: succeed silently on restore

On Thu, 2010-09-02 at 18:07 +0100, Ian Jackson wrote:
> Ian Campbell writes ("Re: [Xen-devel] [PATCH] libxc: succeed silently on 
> restore"):
> > I'm not so sure what can be done about this case, the way
> > xc_domain_restore is (currently) designed it relies on the saving end to
> > close its FD when it is done in order to generate an EOF at the receiver
> > end to signal the end of the migration.
> 
> This was introduced in the Remus patches and is IMO not correct.
> 
> > The xl migration protocol has a postamble which prevents us from closing
> > the FD and so instead what happens is that the sender finishes the save
> > and then sits waiting for the ACK from the receiver so the receiver hits
> > the remus heartbeat timeout which causes us to continue. This isn't
> > ideal from the downtime point of view nor from just a general design
> > POV.
> 
> The xl migration protocol postamble is needed to try to mitigate the
> consequences of network failure, where otherwise it is easy to get
> into situations where neither the sender nor the receiver can safely
> resume the domain.

Yes, I wasn't suggesting getting rid of the postamble, just commenting
on why we can't simply close the sending fd as xc_domain_restore
currently expects.

> > Perhaps we should insert an explicit done marker into the xc save
> > protocol which would be appended in the non-checkpoint case? Only the
> > save end is aware if the migration is a checkpoint or not (and only
> > implicitly via callbacks->checkpoint <> NULL) but that is OK, I think.
> 
> There _is_ an explicit done marker: the sender stops sending pages and
> sends a register dump.  It's just that remus then wants to continue
> anyway.

I was suggesting a second "alldone" marker to be sent after the register
dump and other tail bits when there are no more checkpoints to come.
But...

> The solution is that the interface to xc_domain_restore should be
> extended so that:
>  * Callers specify whether they are expecting a series of checkpoints,
>    or just one.
>  * When it returns you find out whether the response was "we got
>    exactly the one checkpoint you were expecting" or "the network
>    connection failed too soon" or "we got some checkpoints and then
>    the network connection failed".

... I like this idea more. I'll see what I can rustle up.

> A related problem is that it is very difficult for the caller to
> determine when the replication has been properly set up: ie, to know
> when the receiver has got at least one whole checkpoint.

I think this actually does work with the code as it is -- the receive
will return error if it doesn't get at least one whole checkpoint and
will return success and commit to the most recent complete checkpoint
otherwise.

Ian.



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel