On 07/28/2010 04:06 AM, Daniel Stodden wrote:
Hey.
I tried to figure out some of the barrier stuff. I think it's becoming
more clear to me now. But take everything below with a grain of salt.
Sorry for having written a book about it. It's just because I'm not sure
if I'm always reasoning correctly.
Thanks for the detailed response!
There's summary at the bottom. Maybe skip over and just apply your
patch, possibly with a more optimistic update to the comment, and with a
third option in support of QUEUE_ORDERED_DRAIN.
In between, I seems to make sense to differentiate between blkback and
blktap1/2 to understand what our disk model traditionally was/is,
because until now frontends don't differentiate much. Except for maybe
that feature-barrier flag. And below I gather why I think we might want
to do something about it.
Blktap1
--------------------------------------------------------------------------
A blktap1 backend ring is quite simple to explain. It thinks it's a
cacheless disk with no ordering guarantees. In kernel queueing terms
that's a QUEUE_ORDERED_DRAIN.
Yes, if it has no cache then we don't need to worry about delayed
writes, and draining should be enough.
- We pass requests directly on to tapdisk.
- Tapdisk drivers will scatter and reorder at will, especially VHD.
- The RSP message is driven by iocb completion(s).
- It fully relies on AIO/O_DIRECT to make sure the data made it
actually down to the platter by that time.
I'm not aware of anything else left to a userspace app.
Blktap2
--------------------------------------------------------------------------
The blktap2 datapath was completely derived from blktap1.
The difference that it's now hooked into a bdev. Which is quite
important because image consistency is now in the hands of the block
layer.
It presently doesn't declare a barrier mode, which in my understanding
evaluates to QUEUE_ORDERED_NONE.
My understanding of QUEUE_ORDERED_NONE semantics is that the blk code
and filesystems will happily fill the queue and assmume in-order
completion. This is the mode of choice with either a queue depth of 1
or no reordering. Or a non-flushable cache, at which point you know
you're already screwed, so you don't need to worry.
Blktap2 passes a frontend ring worth of requests to userland, with the
same semantics as blktap1: queue depth of 32, no serialization
control. Which should actually be a QUEUE_ORDERED_DRAIN passed on to
the block layer.
So blktap2 doesn't pass barriers into userland? I guess userland
doesn't really have a way to send barriers into the kernel except with
f(data)sync or direct io.
So I guess we better fix that.
Blkback
--------------------------------------------------------------------------
I had some trouble with this one. Blkback and -tap1 being somewhat the
same kinda guy from the frontend perspective, I've always assumed the
semantics should be (are) the same. Now I think that's quite wrong.
Blkback is sitting on a kernel device queue, and that's a slightly
different beast, because the blkdev interface works SCSI-style, which
means it's completely revolving around SCSI barriers.
On SCSI you issue a barrier into a request stream and ideally that
directly maps to controller hardware, you're done.
For ATA with NCQ that apparently doesn't apply, but you still issue a
barrier op. It will cause a cache flush (if given) to complete
foregoing requests, then a forced unit access (if given) to complete
the barrier'd write itself. Or post-flush, again.
The point is you don't do that on every request separately, or
performance somewhat suffers. :)
Let's assume blkback on a bdev of queue depth>1 with
barriers enabled. And that we wanted to guarantee responses implying
physical completion. We'd have to do at least something like the
following: queue any batch we can get from the frontend, set a barrier
bit on the last one, then ack every req only after the last one
completed. Because in between, bio completion doesn't mean much. That
might even be an option, I haven't tried. One could probably also play
tricks based on the _FLUSH bit. Anyway, I don't see the blkback code
presently doing that.
Couldn't blkback just send a barrier write down into the host's storage
subsystem, and rely on it completing the bios in the right order?
> From the blkfront perspective, that would still be a weird-looking
QUEUE_ORDERED_DRAIN. Because everything queued appears to complete at
once. It didn't, of course, so you still have to drain where you want
to be safe.
If we don't do that, then the frontend has to submit the barrier
position.
Not sure I follow this.
I find this has some interesting implications. One is that the
guest will effectively be utilizing a writeback cache, safely. The
reason why I find this so exciting is because that cache can be
potentially much larger than a ring, and it'd still be safely used by
the guest. The other reason is I'm often made to wonder if blktap should
get one.
To a Linux/blkfront queue, this is a QUEUE_ORDERED_TAG.
Blkif
--------------------------------------------------------------------------
I'm not entirely clear about the reason for BLKIF_OP_FLUSH_DISKCACHE
because of the existence of BLKIF_OP_WRITE_BARRIER. Maybe someone can
enlighten me.
The latter implies a flush, if one is necessary (otherwise it wouldn't
be a barrier).
Really? It's perfectly reasonable to define a barrier which prevents
reordering but doesn't imply any synchronous write. If you have a
filesystem which just wants to make sure the superblock is written after
metadata updates, but doesn't really mean to push things out *now*, a
non-flushing barrier makes complete sense. I found it odd that all the
Linux documentation about barriers seem to imply a flush. Or is that
just the conventional meaning of "barrier" in storage-world?
I could imagine drain/flush combinations executed to achieve barriers,
or a flush as a slightly relaxed alternative, e.g. for queues which
preserve order.
Right.
Otoh, Xen normally tries to be about interface idealization where
appropriate. There is no point in feature-barrier==0&&
feature-cache-flush==1 or vice versa, because a barrier would still
have been realizable per drain/flush, and at least in the blkback case
there'd be no additional cost in the backend, if it just does that
on behalf of the frontend. And it saves precious ring space.
Last I'm not clear about how the meaning of feature-barrier got
formulated in the header. To the frontend, is not a "feature" which
"may succeed". Rather, if set, as a strong recommendation for guests
who aim to eventually remount their filesystems r/w after a
reboot.
Summary
--------------------------------------------------------------------------
I'm assuming most guest OS filesystem/block layer glue follows an
ordering interface based on SCSI? Does anyone know if there are
exceptions to that? I'd be really curious. [*]
Assuming the former is correct, I think we absolutely want interface
idealization.
This leaves exactly 2.5 reasonable modes which frontends should prepare
for:
- feature-barrier == 0 -> QUEUE_ORDERED_NONE
Disk is either a) non-caching/non-reordering or b) roken.
- feature-barrier == 1 -> QUEUE_ORDERED_TAG
Disk is reordering, quite possibly caching, but you won't need to
know. You seriously want to issue ordered writes with
BLKIF_OP_WRITE_BARRIER.
So does this mean that feature-barrier is actually not backwards
compatible? If you use an old blkfront which doesn't know what to do
with it, you end up with less reliable storage than before feature-barrier?
Perhaps the backend should keep writes synchronous until it sees a
barrier coming from the guest, then it switches to
caching/reordering/etc (and hope the guest sends a barrier quickish).
[
- !xenbus_exists(feature-barrier) -> QUEUE_ORDERED_DRAIN (?)
This is either blktap1, or an outdated blkback (?) I think one would
want to assume blktap1 (or parse otherend-id). With Blkback there's
probably not much you can do anyway. With blktap1 there's a chance
you're doing the right thing.
]
See patch below (delta against previous one). Does that look OK?
I'd suggest to ignore/phase-out the caching bit, because it's no use
wrt QUEUE_ORDERED_TAG and complementary to QUEUE_ORDERED_NONE.
I'd suggest to fix the backends where people see fit. In blktap2,
which appears urgent to me, this is a one liner for now, setting
QUEUE_ORDERED_DRAIN on the bdev. Unless we discover some day we want
to implement barriers in tapdisk. Which is not going to happen in
July.
OK. Is blkback OK as-is? And I don't care about blktap1, but I guess
its still the current product storage backend...
Thanks,
J
From: Jeremy Fitzhardinge<jeremy.fitzhardinge@xxxxxxxxxx>
Date: Wed, 28 Jul 2010 10:49:29 -0700
Subject: [PATCH] xen/blkfront: Use QUEUE_ORDERED_DRAIN for old backends
If there's no feature-barrier key in xenstore, then it means its a fairly
old backend which does uncached in-order writes, which means ORDERED_DRAIN
is appropriate.
Signed-off-by: Jeremy Fitzhardinge<jeremy.fitzhardinge@xxxxxxxxxx>
diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index eb28e1f..8cd5418 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -423,27 +423,22 @@ static int xlvbd_init_blk_queue(struct blkfront_info
*info,
static int xlvbd_barrier(struct blkfront_info *info)
{
int err;
- unsigned ordered = QUEUE_ORDERED_NONE;
+ const char *barrier;
- /*
- * If we don't have barrier support, then there's really no
- * way to guarantee write ordering, so we really just have to
- * send writes to the backend and hope for the best. If
- * barriers are supported, we don't really know what the
- * flushing semantics of the barrier are, so again, tag the
- * requests in order and hope for the best.
- */
- if (info->feature_barrier)
- ordered = QUEUE_ORDERED_TAG;
+ switch (info->feature_barrier) {
+ case QUEUE_ORDERED_DRAIN: barrier = "enabled (drain)"; break;
+ case QUEUE_ORDERED_TAG: barrier = "enabled (tag)"; break;
+ case QUEUE_ORDERED_NONE: barrier = "disabled"; break;
+ default: return -EINVAL;
+ }
- err = blk_queue_ordered(info->rq, ordered, NULL);
+ err = blk_queue_ordered(info->rq, info->feature_barrier, NULL);
if (err)
return err;
printk(KERN_INFO "blkfront: %s: barriers %s\n",
- info->gd->disk_name,
- info->feature_barrier ? "enabled" : "disabled");
+ info->gd->disk_name, barrier);
return 0;
}
@@ -717,7 +712,7 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
printk(KERN_WARNING "blkfront: %s: write barrier op
failed\n",
info->gd->disk_name);
error = -EOPNOTSUPP;
- info->feature_barrier = 0;
+ info->feature_barrier = QUEUE_ORDERED_NONE;
xlvbd_barrier(info);
}
/* fall through */
@@ -1057,6 +1052,7 @@ static void blkfront_connect(struct blkfront_info *info)
unsigned long sector_size;
unsigned int binfo;
int err;
+ int barrier;
switch (info->connected) {
case BLKIF_STATE_CONNECTED:
@@ -1097,10 +1093,27 @@ static void blkfront_connect(struct blkfront_info *info)
}
err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
- "feature-barrier", "%lu",&info->feature_barrier,
+ "feature-barrier", "%lu",&barrier,
NULL);
+
+ /*
+ * If there's no "feature-barrier" defined, then it means
+ * we're dealing with a very old backend which probably writes
+ * in order with no cache; draining will do what needs to get
+ * done.
+ *
+ * If there are barriers, then we can do full queued writes
+ * with tagged barriers.
+ *
+ * If barriers are not supported, then there's no much we can
+ * do, so just set ordering to NONE.
+ */
if (err)
- info->feature_barrier = 0;
+ info->feature_barrier = QUEUE_ORDERED_DRAIN;
+ else if (barrier)
+ info->feature_barrier = QUEUE_ORDERED_TAG;
+ else
+ info->feature_barrier = QUEUE_ORDERED_NONE;
err = xlvbd_alloc_gendisk(sectors, info, binfo, sector_size);
if (err) {
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|