xen-devel

[Top] [All Lists]

Re: [Xen-devel] [PATCH] libxc: succeed silently on restore

from [Brendan Cully]

[Permanent Link][Original]

To:	Ian Campbell <Ian.Campbell@xxxxxxxxxxxxx>
Subject:	Re: [Xen-devel] [PATCH] libxc: succeed silently on restore
From:	Brendan Cully <brendan@xxxxxxxxx>
Date:	Thu, 2 Sep 2010 11:39:58 -0700
Cc:	"xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>, Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>
Delivery-date:	Thu, 02 Sep 2010 11:41:32 -0700
Envelope-to:	www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to:	<1283452140.3469.175.camel@xxxxxxxxxxxxxxxxxxxxx>
List-help:	<mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id:	Xen developer discussion <xen-devel.lists.xensource.com>
List-post:	<mailto:xen-devel@lists.xensource.com>
List-subscribe:	<http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe:	<http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
Mail-followup-to:	Ian Campbell <Ian.Campbell@xxxxxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>, Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>
References:	<5ad37819cddd19a27065.1283444083@xxxxxxxxxxxxxxxxxxxxx> <1283446919.12544.9877.camel@xxxxxxxxxxxxxxxxxxxxxx> <20100902181623.GA8021@xxxxxxxxxxxxxxxxx> <1283452140.3469.175.camel@xxxxxxxxxxxxxxxxxxxxx>
Sender:	xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
User-agent:	Mutt/1.5.20 (2010-08-04)

On Thursday, 02 September 2010 at 19:29, Ian Campbell wrote:
> On Thu, 2010-09-02 at 19:16 +0100, Brendan Cully wrote:
> > On Thursday, 02 September 2010 at 18:01, Ian Campbell wrote:
> > > So it turns out that there is a similar issue on migration:
> > >         xc: Saving memory: iter 3 (last sent 37 skipped 0): 0/32768    
> > > 0%xc: error: rdexact failed (select returned 0): Internal error
> > >         xc: error: Error when reading batch size (110 = Connection timed 
> > > out): Internal error
> > >         xc: error: error when buffering batch, finishing (110 = 
> > > Connection timed out): Internal error
> > > 
> > > I'm not so sure what can be done about this case, the way
> > > xc_domain_restore is (currently) designed it relies on the saving end to
> > > close its FD when it is done in order to generate an EOF at the receiver
> > > end to signal the end of the migration.
> > > 
> > > The xl migration protocol has a postamble which prevents us from closing
> > > the FD and so instead what happens is that the sender finishes the save
> > > and then sits waiting for the ACK from the receiver so the receiver hits
> > > the remus heartbeat timeout which causes us to continue. This isn't
> > > ideal from the downtime point of view nor from just a general design
> > > POV.
> > > 
> > > Perhaps we should insert an explicit done marker into the xc save
> > > protocol which would be appended in the non-checkpoint case? Only the
> > > save end is aware if the migration is a checkpoint or not (and only
> > > implicitly via callbacks->checkpoint <> NULL) but that is OK, I think.
> > 
> > I think this can be done trivially? We can just add another negative
> > length record at the end of memory copying (like the debug flag, tmem,
> > hvm extensions, etc) if we're running the new xl migration protocol
> > and expect restore to exit after receiving the first full
> > checkpoint. Or, if you're not as worried about preserving the existing
> > semantics, make the minus flag indicate that callbacks->checkpoint is
> > not null, and only continue reading past the first complete checkpoint
> > if you see that minus flag on the receive side.
> > 
> > Isn't that sufficient?
> 
> It would probably work but isn't there a benefit to having the receiver
> know that it is partaking in a multiple checkpoint restore and being
> told how many iterations there were etc?

Is there?

The minusflag does tell the receiver that it is participating in a
multiple checkpoint restore (when it receives the flag). I can't
really see a reason why the sender should want to tell the receiver to
expect n checkpoints (as opposed to 1 or continuous). I suppose it
would be nice if the sender could gracefully abort a continual
checkpoint process without causing the receiver to activate the
VM. Yet another minusflag? :)

I have no objection to more aggressive refactoring (the current
protocol and code are gross), I'm just noting that this particular
problem also has an easy fix.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

[More with this subject...]

<Prev in Thread]	Current Thread	[Next in Thread>
[Xen-devel] [PATCH] libxc: succeed silently on restore, Ian Campbell Re: [Xen-devel] [PATCH] libxc: succeed silently on restore, Ian Campbell Re: [Xen-devel] [PATCH] libxc: succeed silently on restore, Ian Jackson Re: [Xen-devel] [PATCH] libxc: succeed silently on restore, Ian Campbell Re: [Xen-devel] [PATCH] libxc: succeed silently on restore, Ian Jackson Re: [Xen-devel] [PATCH] libxc: succeed silently on restore, Brendan Cully Re: [Xen-devel] [PATCH] libxc: succeed silently on restore, Ian Campbell Re: [Xen-devel] [PATCH] libxc: succeed silently on restore, Brendan Cully <= Re: [Xen-devel] [PATCH] libxc: succeed silently on restore, Ian Jackson

Previous by Date:	Re: [Xen-devel] [PATCH] libxc: succeed silently on restore, Ian Campbell
Next by Date:	Re: [Xen-devel] XCP (bug?): another frontend device is already connected to this domain, Daniel Stodden
Previous by Thread:	Re: [Xen-devel] [PATCH] libxc: succeed silently on restore, Ian Campbell
Next by Thread:	Re: [Xen-devel] [PATCH] libxc: succeed silently on restore, Ian Jackson
Indexes:	[Date] [Thread] [Top] [All Lists]