WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

Re: [Xen-devel] [PATCH] libxc: succeed silently on restore

On Thursday, 02 September 2010 at 19:29, Ian Campbell wrote:
> On Thu, 2010-09-02 at 19:16 +0100, Brendan Cully wrote:
> > On Thursday, 02 September 2010 at 18:01, Ian Campbell wrote:
> > > So it turns out that there is a similar issue on migration:
> > >         xc: Saving memory: iter 3 (last sent 37 skipped 0): 0/32768    
> > > 0%xc: error: rdexact failed (select returned 0): Internal error
> > >         xc: error: Error when reading batch size (110 = Connection timed 
> > > out): Internal error
> > >         xc: error: error when buffering batch, finishing (110 = 
> > > Connection timed out): Internal error
> > > 
> > > I'm not so sure what can be done about this case, the way
> > > xc_domain_restore is (currently) designed it relies on the saving end to
> > > close its FD when it is done in order to generate an EOF at the receiver
> > > end to signal the end of the migration.
> > > 
> > > The xl migration protocol has a postamble which prevents us from closing
> > > the FD and so instead what happens is that the sender finishes the save
> > > and then sits waiting for the ACK from the receiver so the receiver hits
> > > the remus heartbeat timeout which causes us to continue. This isn't
> > > ideal from the downtime point of view nor from just a general design
> > > POV.
> > > 
> > > Perhaps we should insert an explicit done marker into the xc save
> > > protocol which would be appended in the non-checkpoint case? Only the
> > > save end is aware if the migration is a checkpoint or not (and only
> > > implicitly via callbacks->checkpoint <> NULL) but that is OK, I think.
> > 
> > I think this can be done trivially? We can just add another negative
> > length record at the end of memory copying (like the debug flag, tmem,
> > hvm extensions, etc) if we're running the new xl migration protocol
> > and expect restore to exit after receiving the first full
> > checkpoint. Or, if you're not as worried about preserving the existing
> > semantics, make the minus flag indicate that callbacks->checkpoint is
> > not null, and only continue reading past the first complete checkpoint
> > if you see that minus flag on the receive side.
> > 
> > Isn't that sufficient?
> 
> It would probably work but isn't there a benefit to having the receiver
> know that it is partaking in a multiple checkpoint restore and being
> told how many iterations there were etc?

Is there?

The minusflag does tell the receiver that it is participating in a
multiple checkpoint restore (when it receives the flag). I can't
really see a reason why the sender should want to tell the receiver to
expect n checkpoints (as opposed to 1 or continuous). I suppose it
would be nice if the sender could gracefully abort a continual
checkpoint process without causing the receiver to activate the
VM. Yet another minusflag? :)

I have no objection to more aggressive refactoring (the current
protocol and code are gross), I'm just noting that this particular
problem also has an easy fix.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel