On Mon, Dec 04, 2006 at 02:18:37PM -0500, Graham, Simon wrote:
> We've been noticing a lot of these errors when booting VMs since we
> moved to 3.0.3 - I've traced this to the hotplug scripts in Dom0 taking
> >10s to run to completion and specifically the vif-bridge script taking
> >=9s to plug the vif into the s/w bridge on occasion - was wondering if
> anyone has any insight into why it might take this long.
>
> I added some instrumentation to the scripts to log entry/exit from
> xen-backend.agent and also lock contention (attached at the end of this)
> and have the following observations:
>
> 1. Currently, the various script invocations are issued in parallel but
> are serialized
> by a single global lock -- is it really necessary, for example, to
> serialize vif
> and vbd hot plug processing in Dom0?
You need to serialise VBD hotplug if you are going to get the right result
when performing the sharing check. If you're using vif-nat, you need to
serialise the modifications to the DHCP configuration file. Other than that,
I don't think that there's a need to serialise events at startup. On Bugzilla
#515, Harry Butterworth notes that there is a race condition in teardown,
which is why he introduced the global lock. You could make this cleverer,
possibly, so that it doesn't affect startup times.
All that said, I believe that udev is supposed to serialise all events
anyway, so unless you're using hotplug rather than udev, I'd expect you to see
no lock contention whatsoever.
> 2. In most cases we've seen, this problem happens when the first VM is
> started after
> re-installing a box. In the example below, the 'vif online'
> processing started at
> 2:21:53 and did not finish until 2:22:04
>
> 3. Clearly a hard coded timeout of 10s is less than perfect -- is there
> no better way of knowing
> when the hotplug processing is done?
We know precisely when hotplugging is done -- the scripts write an entry into
the store to tell us so. It's knowing when they've locked up that's the hard
bit.
If you're seeing vif bringup taking 9 seconds, then clearly the 10 second
timeout is far too short. There's no particular reason to keep the timeout
short, so feel free to lengthen it, with the obvious consequences. Bear in
mind that Xend will time out the whole device bringup phase after 100 seconds.
I'd want to root-cause the 9 second bringup as well, as I don't see why it
ought to take that long.
Cheers,
Ewan.
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|