On 12/03/09 02:28, Ian Campbell wrote:
> On Wed, 2009-12-02 at 19:54 +0000, Ian Campbell wrote:
>
>>
>>> Does it really need to be a panic? Can't we just start failing all
>>> future operations? Seems bad to take out the whole machine if we
>>>
>> can
>>
>>> just get away with crippling one device (especially if it can be
>>> recovered by downing it and re-upping a new one with nc1 and/or
>>>
>> gt1).
>>
>> Wouldn't there be (failing) grant table ops on the down path?
>>
>> In any case doesn't it effect all devices since they all use the same
>> grant table?
>>
> Oh, I see what you meant... in the proper resume case (as opposed to the
> cancelled suspend/checkpoint case I was thinking of) there should be no
> grant tables in use at this point so most devices should, in theory, be
> able to reconnect using v1 grants, any drivers which require v2 grant
> tables need to check for them in their resume hook as well as at start
> of day.
>
> Unfortunately frontend devices tear down their grant entries after the
> resume rather than before the suspend (I presume this has to do with
> faster checkpointing?) which means they could be trying to clear an
> entry of the wrong layout, leading the unbounded badness that the
> comment refers to.
>
I think the reason frontends don't do anything before suspend is because
they need to cope with backends going away at any moment, and a
suspend/migrate is just a special case of that. But a normal backend
restart won't change the grant table format, whereas a resume/migrate
can, so it does make sense to take advantage of the suspend callback.
Also I think originally there wasn't a suspend callback, but there is
now that we're using the device model.
I don't know how it affects performance, but I guess it would require
checkpoints to do a full teardown/reconnect so that the checkpointed
image can cope.
On the other hand, on resume, there are no existing grants, so the
device can just ignore any grant state it currently has established and
do it all afresh with the current grant mechanism, no?
> I think the choices are basically:
> * Always latch to either v1 or v2 at start of day, if we can't get
> the version we want then panic (this is a stronger restriction
> than the current code which will try to upgrade to v2 on resume)
> * Write v1<->v2 layout transformations called on gnttab resume
> before the devices get a chance to try and unmap their old
> entries. Would need to handle v2 entries sing feature which are
> not expressible in v1.
>
> I'm tempted to go with the former for simplicity, it enables migration
> to a newer version of Xen (the guest will just keep using v1) but will
> not allow migration back to an older version of Xen, which is not
> something we generally support anyway.
>
Yeah, given the "no downgrade" rule we don't need to solve it in the
most general way.
J
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|