OK, I've managed to reproduce this: Under 2.0.5 I can make a domain
crash in the same fashion if it is under extreme network receive load
during the migration.
This used to work fine, so we've obviously introduced a bug in the last
few months.
I'll investigate when I get a chance. We seem to get stuck in a page
fault loop writing to an skb's shared info area, passing through the
vmalloc fault section of do_page_fault. It looks like the PTE is read
only, which is very odd. We just need to figure out how it got that way.
This smells like the first real Xen-internal bug in the stable series
for several months...
Ian
> The patch did not make a difference. I do have a few more
> data points though. Irrespective of whether the patch is
> applied, migration without the --live switch works. Further,
> even --live seems to work if all the memory pages are copied
> in one iteration. However, if xfrd.log shows that a second
> iteration has been started, live migration will fail.
>
> > > After the error on the source machine, while the VM shows
> up on xm
> > > list at the destination, xfrd.log (on the destination) shows that
> > > its trying to reload memory pages beyond 100%. The number
> of memory
> > > pages reloaded keeps on going up until I use 'xm destroy'.
> >
> > Going beyond 100% is normal behaviour, but obviously it should
> > terminate eventually, after doing a number of iterations.
> >
> > Posting the xfrd log (after applying the patch) would be
> interesting.
> >
>
> Applying the patch did not make a difference. However,
> reducing the amount of memory provided to the migrated domain
> changes the error message. While the live migration still
> fails, it does not start trying to keep on loading pages. It
> now fails with a message like "Frame number in type 1 page
> table is out of range".
>
> The xfrd logs from the sender and receiver are attached for a 512 and
> 256 MB domain configuration.
>
>
> Also, a few other smaller things.
>
> a) On doing a migrate (without --live), the 'xm migrate'
> command does not return control to the shell even after a
> successful migration. A Control-C gives the following trace
>
> Traceback (most recent call last):
> File "/usr/sbin/xm", line 9, in ?
> main.main(sys.argv)
> File "/usr/lib/python/xen/xm/main.py", line 808, in main
> xm.main(args)
> File "/usr/lib/python/xen/xm/main.py", line 106, in main
> self.main_call(args)
> File "/usr/lib/python/xen/xm/main.py", line 124, in main_call
> p.main(args[1:])
> File "/usr/lib/python/xen/xm/main.py", line 309, in main
> migrate.main(args)
> File "/usr/lib/python/xen/xm/migrate.py", line 49, in main
> server.xend_domain_migrate(dom, dst, opts.vals.live,
> opts.vals.resource)
> File "/usr/lib/python/xen/xend/XendClient.py", line 249, in
> xend_domain_migrate
> {'op' : 'migrate',
> File "/usr/lib/python/xen/xend/XendClient.py", line 148, in xendPost
> return self.client.xendPost(url, data)
> File "/usr/lib/python/xen/xend/XendProtocol.py", line 79,
> in xendPost
> return self.xendRequest(url, "POST", args)
> File "/usr/lib/python/xen/xend/XendProtocol.py", line 143,
> in xendRequest
> resp = conn.getresponse()
> File "/usr/lib/python2.3/httplib.py", line 778, in getresponse
> response.begin()
> File "/usr/lib/python2.3/httplib.py", line 273, in begin
> version, status, reason = self._read_status()
> File "/usr/lib/python2.3/httplib.py", line 231, in _read_status
> line = self.fp.readline()
> File "/usr/lib/python2.3/socket.py", line 323, in readline
> data = recv(1)
>
>
> b) I noticed that if the sender migrates a VM (without
> --live) and has a console attached to the domain, CPU
> utilization hits 100% after migration until the console is
> disconnected.
>
> Niraj
>
> > Best,
> > Ian
> >
> > > The process of xm save/scp <saved image>/xm restore the machine
> > > works fine though. Any ideas why live migration would not
> work? The
> > > combined usage of dom0 and domU is around half the
> physical memory
> > > present in the machine.
> > >
> > > The traceback from xend.log
> > >
> > > Traceback (most recent call last):
> > > File
> "/usr/lib/python2.3/site-packages/twisted/internet/defer.py",
> > > line 308, in _startRunCallbacks
> > > self.timeoutCall.cancel()
> > > File
> "/usr/lib/python2.3/site-packages/twisted/internet/base.py",
> > > line 82, in cancel
> > > raise error.AlreadyCalled
> > > AlreadyCalled: Tried to cancel an already-called event.
> > >
> > > xfrd.log on the sender is full of "Retry suspend domain (120)"
> > > before it says "Unable to suspend domain. (120)" and
> "Domain appears
> > > not to have suspended: 120".
> > >
> > > Niraj
> > >
> > > --
> > > http://www.cs.cmu.edu/~ntolia
> > >
> > > _______________________________________________
> > > Xen-users mailing list
> > > Xen-users@xxxxxxxxxxxxxxxxxxx
> > > http://lists.xensource.com/xen-users
> > >
> >
>
>
> --
> http://www.cs.cmu.edu/~ntolia
>
_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-users
|