We are seeing a disk corruption problem when migrating a VM between two nodes that are both active writers
of a shared storage block device.
The corruption seems to be caused by a lack of
synchronization between the migration source and destination regarding
outstanding block write requests. The
failing scenario is as follows:
1) The VM has block
write A in progress on the source node X at the time
it is being migrated.
2) The blkfront driver requeues
A on the destination node Y after migration. Request A gets completed immediately,
because the shared storage already has a request in flight to the same block
(from X), so it ignores the new request.
3) New block write
request A' is made from Y, now that the VM is running, to the same block number
as A. Request A' gets completed
immediately for the same reasons as in #2.
The corruption we are seeing is that the block contains the
data A, not A' as the VM expects. The
problem is that the shared storage doesn't guarantee the outcome of the
concurrent writes X->A and Y->A.
It is choosing to ignore and immediately complete the second request,
which I understand is one of the acceptable strategies for managing concurrent
writes to the same block. That
behavior is fine when the redundant request A is being ignored, but when the
new request A' occurs, we get corruption.
The problem only shows up under heavy disk load (e.g the Bonnie benchmark) while migrating, so most users probably
haven't seen it.
If I understand this correctly though, this could affect
anyone using shared block storage with dual active writers and live migration. When we run with a single active writer
and then move the active writer to the destination node, all outstanding
requests get flushed in the background and we don't see this problem.
The blkfront xenbus_driver
doesn't have a "suspend" method. I was going to add one to flush the
outstanding requests from the migration source to fix the problem. Or maybe we can cancel all outstanding
I/O requests to eliminate the concurrency between the two nodes. Does the Linux block I/O interface allow
the canceling of requests?
Anyone else seeing this problem? Any other ideas for
solutions?
Thanks,
Jeff