On Saturday, 06 September 2008 at 09:19, Keir Fraser wrote:
> On 5/9/08 20:16, "Brendan Cully" <brendan@xxxxxxxxx> wrote:
> > I've noticed that the suspend event channel becomes pending as soon as
> > it is bound. I'm not sure why or whether this is intentional, but it
> > means that the suspend function will return before the domain has
> > completed suspending unless the first notification is cleared. Without
> > this patch, xc_domain_save may find that the guest has not suspended
> > and sleep in 10ms chunks until it does. Typically this is several
> > milliseconds of wasted time.
> This patch looks pretty dodgy to me.
> First of all, you removed the non-error-case return statement from
> suspend_evtchn_init(). Hence it appears you release the local port before
> returning, and you return -1 (which you'd never notice since the caller
> erroneously does not check the return code - please fix this).
Very sorry. I've attached the simple fix below.
> Hence I think actually you are ending up using the old slow shutdown method
> since suspend_evtchn will be -1 and hence you'll end up in compat_suspend().
> That should result in a significant slowdown rather than speedup, so were
> your measurements actually taken with this exact patch?
No, I measured in remus and misbackported, I'm afraid. But I've timed
this version -- it tends to be 2-5ms to suspend and notify xend.
> Even with this fixed, I don't think that ignoring an event wakeup is a great
> idea. You can easily make the not-suspended-yet retry loop always wait on
> the evtchn rather than sleeping 10ms. By which I mean -- there is really no
> reason for that to be a pure poll loop when you have an evtchn to sleep on.
This code is only ignoring the event at setup time, before the suspend
function has actually notified the domain to suspend itself. So I
think it should be reasonably safe. Presumably the only reason the
event channel has a pending notification at this point is because it
is forcibly set in evtchn_bind_interdomain? The only other case I can
think of is that somebody else has requested that the domain suspend
itself, maybe by writing to the watch. That seems like a higher-level
locking problem to me.
> To do this, call (*suspend) from within the retry loop: the evtchn case can
> do what it always does (basically sleep on the evtchn device until its
> evtchn of interest appears); the compat case should change behaviour after
> its first invocation so that it sleeps 10ms (stash a static variable in the
> function or in suspendinfo for this purpose, to remember whether it was
> already invoked).
I could certainly code this up as well (it'd need a static flag in
evtchn_suspend as well to avoid resignalling the domain, I think). But
generally without clearing the event channel before signalling the
guest, the first suspend attempt will always return early. I'm not
really clear on the scenario that results in the domain not being
suspended after *suspend has succesfully returned. Could you clarify?
Description: Text document
Xen-devel mailing list