[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] blkif migration problem

To: Ewan Mellor <ewan@xxxxxxxxxxxxx>
From: Cristian Zamfir <zamf@xxxxxxxxxxxxx>
Date: Thu, 07 Dec 2006 18:14:39 +0000
Cc: xen-devel@xxxxxxxxxxxxxxxxxxx
Delivery-date: Thu, 07 Dec 2006 10:14:32 -0800
List-id: Xen developer discussion <xen-devel.lists.xensource.com>

Ewan Mellor wrote:

On Thu, Dec 07, 2006 at 03:47:39PM +0000, Cristian Zamfir wrote:
Hi,
I am trying to live migrate blkif devices backed by drbd devices and Iam struggling with a problem for a few days now. The problem is thatafter migration, the domU machine cannot load any new programs intomemory. The ssh connection survives migration and I can run programsthat are already in the memory but not something that needs to be loadedfrom the disk.
I am currently testing with an almost idle machine and I am triggeringthe drive migration after the domain is suspended, in step 2, from:XendCheckpoint.py: dominfo.migrateDevices(network, dst,DEV_MIGRATE_STEP2, domain_name).
However, I also tried before the domain is suspended from step 1(dominfo.migrateDevices(network, dst, DEV_MIGRATE_STEP1, domain_name))and everything works fine, except that there is the obvious possibilityof loosing some writes to the disk because the domain is not suspended yet.
After migration, when I reattach a console I get this message:
"vbd vbd-769: 16 Device in use; refusing to close"
This is from the blkfront.c backend_changed() function but I cannotfigure out why this error occurs.
I believe that this means that the frontend has seen that the backend is
tearing down, but since the device is still mounted inside the guest, it's
refusing.  I don't think that the frontend ought to see the backend tear down
at all -- the guest ought to be suspended before you tear down the backend
device.

I am triggering the migration in DEV_MIGRATE_STEP2, which is right afterthe domain was suspended, as far as I can tell from the python code inXendCheckpoint.py:


dominfo.migrateDevices(network, dst, DEV_MIGRATE_STEP1, domain_name)
....
....
def saveInputHandler(line, tochild):
           log.debug("In saveInputHandler %s", line)
           if line == "suspend":
                log.debug("Suspending %d ...", dominfo.getDomid())
                dominfo.shutdown('suspend')
                dominfo.waitForShutdown()
                dominfo.migrateDevices(network, dst, DEV_MIGRATE_STEP2,
                                      domain_name)
                log.info("Domain %d suspended.", dominfo.getDomid())
                dominfo.migrateDevices(network, dst, DEV_MIGRATE_STEP3,
                                       domain_name)

"Triggering the migration" involves dominfo.migrateDevices(..) callingmy script in /etc/xen/scripts. This script checks that the drive at thesource and the replica at the destination are in sync and then switchestheir roles (the one on the source becomes secondary and the one on thedestination becomes primary). But since the guest is suspended at thispoint, I don't understand why should the frontend see any change.

I found that DRBD drives are not quite usable when they are in secondarystate, only the primary one should be mounted. For instance, when tryingto mount a drbd device in secondary state I get this error:

#mount -r -t reiserfs /dev/drbd1 /mnt/vm
mount: /dev/drbd1 already mounted or /mnt/vm busy

Therefore, could this error happen on the destination, during restorewhile waiting for backends to set up, if the drive is in secondary state?

I also don't understand why everything works if I migrate the hard drivein DEV_MIGRATE_STEP1. The only error I get in this case is reiserfscomplainig about some writes that failed, but everything besides thisseems ok.

I cannot really try localhost migration because I think drbd only workswith two machines, but I have tested most of my code outside xen and itworked.


Thank you very much for your help.

When you say that you are "triggering the drive migration", what does that
involve?  Why would the frontend see the store contents change at all at this
point?

Have you tried a localhost migration?  This would be easier, because you don't
actually need to move the disk of course, so you can get half your signalling
tested before moving on to the harder problem.

Ewan.



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

References:
- [Xen-devel] blkif migration problem
  - From: Cristian Zamfir
- Re: [Xen-devel] blkif migration problem
  - From: Ewan Mellor

Prev by Date: [Xen-devel] Broken changing of config parameters for inactive domains
Next by Date: Re: [Xen-devel] Broken changing of config parameters for inactive domains
Previous by thread: Re: [Xen-devel] blkif migration problem
Next by thread: [Xen-devel] lxr free text search
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.