Hi,
I'm part of the team integrating Xen support into Solaris and I'm trying
to add barrier support to our implementation so that we can improve
performance on our back end disk devices.
There are some difference in the way that I/O is peformed by Solaris and
Linux that make this a little more challenging that it might appear. Hence,
this email to try and get some feedback on possible solutions.
We are trying to map the semantics of a BLKIF_OP_WRITE_BARRIER onto the
behaviour of the Solaris I/O sub-system. The first thing to observe is
that we believe that when a BLKIF_OP_WRITE_BARRIER returns to the front
end, all previously issued write operations (including the write which
triggered the barrier) are complete. Secondly, Solaris separates this notion
of a "barrier" away from the write operation and provides the
DKIOCFLUSHWRITECACHE ioctl which can be used to request that all previously
issues I/Os are flushed to disk.
We are currently implementing the behaviour of BLKIF_OP_WRITE_BARRIER
in two ways:
1 If a write is requested and the front end write cache is not enabled,
then we issue a BLKIF_OP_WRITE_BARRIER which causes the back-end
to wait for completion of the write and then to issue a
DKIOCFLUSHWRITECACHE ioctl on the underlying device to ensure the
write-cache is flushed before returning to the front end
If the write cache is enabled, then we issue a BLKIF_OP_WRITE
request, which doesn't require a DKIOCFLUSHWRITECACHE ioctl in
the back-end. Clearly performance here is greater.
2 If we receive a DKIOCFLUSHWRITECACHE in the front end, then we
now have a problem. Because we have received a requirement to
ensure previous writes are flushed, but we have no write
associated with the request with which to issue a
BLKIF_OP_WRITE_BARRIER.
We have modified our Solaris front end so that we can issue 0 byte writes
and the existing Solaris back-end receives the write and passes it onto the
lower level drivers which return success and then eventually result
in an ioctl to flush pending writes. This is where we hit the problem I'm
trying to solve. On Linux, if a zero byte write is received, then the
blkback device returns a failure response to the front end, presumably
because a zero byte write will not be accepted by the lower level drivers.
This is where I need to work out what my options are for making Solaris
work correctly as a domU on a Linux dom0. Things I am considering include:
a Caching a previously issued write and when running on Linux
issuing the write in place of a zero-byte write so as to
succeed. This is a bit of a nasty hack, and is not efficient.
b Investigating removing the restriction on zero-byte writes
in the Linux blkback driver. I'm not knowledgeable enough about
the Linux kernel to know if this would work and would appreciate
feedback on this suggestion. This would require zero-byte write
support in the Linux block layer, which I am led to believe is
not currently allowed.
c Adding a new protocol operation, BLKIF_OP_BARRIER, which would
require support in Linux blkblack to return success on receipt. In
Solaris, we would issue the DKIOCFLUSHWRITECACHE ioctl on
our layered device and return. In Linux, I've been informed that
it might be implemented via blkdev_issue_flush() functionality.
There are pros and cons for each of the above suggestions and they are not
mutually exclusive. I imagine we will need to implement (a) in order to
work with existing installations. (b) and (c) are alternatives to get
around the perceived problem that Linux doesn't like 0 byte writes and
to provide a clean solution to the problem which would minimise the
requirement for the hack described in (a). (b) would be simpler in some
ways, but we still would have issues with the installed base unless we
installed a new flag in xenstore which indicated that it was acceptable for
a client to pass a zero byte write to the back end. (c) is probably the
cleanest approach and would, I think, provide a complete solution when
coupled with (a).
Ok, those are the alternatives which seem viable to me right now. I've
considered and discarded other schemes/alternatives none of which were
as desirable as the ones I've listed. I'd really appreciate some feedback
from the community at this point.
Thanks,
Gary
--
Gary Pennington
Solaris Core OS
Sun Microsystems
Gary.Pennington@xxxxxxx
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|