Hey.
I tried to figure out some of the barrier stuff. I think it's becoming
more clear to me now. But take everything below with a grain of salt.
Sorry for having written a book about it. It's just because I'm not sure
if I'm always reasoning correctly.
There's summary at the bottom. Maybe skip over and just apply your
patch, possibly with a more optimistic update to the comment, and with a
third option in support of QUEUE_ORDERED_DRAIN.
In between, I seems to make sense to differentiate between blkback and
blktap1/2 to understand what our disk model traditionally was/is,
because until now frontends don't differentiate much. Except for maybe
that feature-barrier flag. And below I gather why I think we might want
to do something about it.
Blktap1
--------------------------------------------------------------------------
A blktap1 backend ring is quite simple to explain. It thinks it's a
cacheless disk with no ordering guarantees. In kernel queueing terms
that's a QUEUE_ORDERED_DRAIN.
- We pass requests directly on to tapdisk.
- Tapdisk drivers will scatter and reorder at will, especially VHD.
- The RSP message is driven by iocb completion(s).
- It fully relies on AIO/O_DIRECT to make sure the data made it
actually down to the platter by that time.
I'm not aware of anything else left to a userspace app.
Blktap2
--------------------------------------------------------------------------
The blktap2 datapath was completely derived from blktap1.
The difference that it's now hooked into a bdev. Which is quite
important because image consistency is now in the hands of the block
layer.
It presently doesn't declare a barrier mode, which in my understanding
evaluates to QUEUE_ORDERED_NONE.
My understanding of QUEUE_ORDERED_NONE semantics is that the blk code
and filesystems will happily fill the queue and assmume in-order
completion. This is the mode of choice with either a queue depth of 1
or no reordering. Or a non-flushable cache, at which point you know
you're already screwed, so you don't need to worry.
Blktap2 passes a frontend ring worth of requests to userland, with the
same semantics as blktap1: queue depth of 32, no serialization
control. Which should actually be a QUEUE_ORDERED_DRAIN passed on to
the block layer.
So I guess we better fix that.
Blkback
--------------------------------------------------------------------------
I had some trouble with this one. Blkback and -tap1 being somewhat the
same kinda guy from the frontend perspective, I've always assumed the
semantics should be (are) the same. Now I think that's quite wrong.
Blkback is sitting on a kernel device queue, and that's a slightly
different beast, because the blkdev interface works SCSI-style, which
means it's completely revolving around SCSI barriers.
On SCSI you issue a barrier into a request stream and ideally that
directly maps to controller hardware, you're done.
For ATA with NCQ that apparently doesn't apply, but you still issue a
barrier op. It will cause a cache flush (if given) to complete
foregoing requests, then a forced unit access (if given) to complete
the barrier'd write itself. Or post-flush, again.
The point is you don't do that on every request separately, or
performance somewhat suffers. :)
Let's assume blkback on a bdev of queue depth >1 with
barriers enabled. And that we wanted to guarantee responses implying
physical completion. We'd have to do at least something like the
following: queue any batch we can get from the frontend, set a barrier
bit on the last one, then ack every req only after the last one
completed. Because in between, bio completion doesn't mean much. That
might even be an option, I haven't tried. One could probably also play
tricks based on the _FLUSH bit. Anyway, I don't see the blkback code
presently doing that.
>From the blkfront perspective, that would still be a weird-looking
QUEUE_ORDERED_DRAIN. Because everything queued appears to complete at
once. It didn't, of course, so you still have to drain where you want
to be safe.
If we don't do that, then the frontend has to submit the barrier
position. I find this has some interesting implications. One is that the
guest will effectively be utilizing a writeback cache, safely. The
reason why I find this so exciting is because that cache can be
potentially much larger than a ring, and it'd still be safely used by
the guest. The other reason is I'm often made to wonder if blktap should
get one.
To a Linux/blkfront queue, this is a QUEUE_ORDERED_TAG.
Blkif
--------------------------------------------------------------------------
I'm not entirely clear about the reason for BLKIF_OP_FLUSH_DISKCACHE
because of the existence of BLKIF_OP_WRITE_BARRIER. Maybe someone can
enlighten me.
The latter implies a flush, if one is necessary (otherwise it wouldn't
be a barrier).
I could imagine drain/flush combinations executed to achieve barriers,
or a flush as a slightly relaxed alternative, e.g. for queues which
preserve order.
Otoh, Xen normally tries to be about interface idealization where
appropriate. There is no point in feature-barrier==0 &&
feature-cache-flush==1 or vice versa, because a barrier would still
have been realizable per drain/flush, and at least in the blkback case
there'd be no additional cost in the backend, if it just does that
on behalf of the frontend. And it saves precious ring space.
Last I'm not clear about how the meaning of feature-barrier got
formulated in the header. To the frontend, is not a "feature" which
"may succeed". Rather, if set, as a strong recommendation for guests
who aim to eventually remount their filesystems r/w after a
reboot.
Summary
--------------------------------------------------------------------------
I'm assuming most guest OS filesystem/block layer glue follows an
ordering interface based on SCSI? Does anyone know if there are
exceptions to that? I'd be really curious. [*]
Assuming the former is correct, I think we absolutely want interface
idealization.
This leaves exactly 2.5 reasonable modes which frontends should prepare
for:
- feature-barrier == 0 -> QUEUE_ORDERED_NONE
Disk is either a) non-caching/non-reordering or b) roken.
- feature-barrier == 1 -> QUEUE_ORDERED_TAG
Disk is reordering, quite possibly caching, but you won't need to
know. You seriously want to issue ordered writes with
BLKIF_OP_WRITE_BARRIER.
[
- !xenbus_exists(feature-barrier) -> QUEUE_ORDERED_DRAIN (?)
This is either blktap1, or an outdated blkback (?) I think one would
want to assume blktap1 (or parse otherend-id). With Blkback there's
probably not much you can do anyway. With blktap1 there's a chance
you're doing the right thing.
]
I'd suggest to ignore/phase-out the caching bit, because it's no use
wrt QUEUE_ORDERED_TAG and complementary to QUEUE_ORDERED_NONE.
I'd suggest to fix the backends where people see fit. In blktap2,
which appears urgent to me, this is a one liner for now, setting
QUEUE_ORDERED_DRAIN on the bdev. Unless we discover some day we want
to implement barriers in tapdisk. Which is not going to happen in
July.
In good old blktap1, it would take a queue drain. I guess that would
best be be done in userspace, by forwarding BLKIF_OP_BARRIER. At which
point we might just shift the blktap2 tapdev mode to QUEUE_ORDERED_TAG
as well.
[*] One could imagine different hardware models for a ring-based Xen
backend. As mentioned above, NCQ could be reasonable too. Essentially,
one might just go and map queue ordering bits into a xenstore key and
let the guest I/O stack figure it out. That doesn't map to bio.h, but
it would map to blkdev.h.
I'm also the wrong guy to ask if there'd be a benefit :). Definitely
not for Linux guests, where it's only overhead in order to get the
same full barrier op the ring. But maybe other OSes?
On Thu, 2010-07-22 at 17:20 -0400, Jeremy Fitzhardinge wrote:
> When barriers are supported, then use QUEUE_ORDERED_TAG to tell the block
> subsystem that it doesn't need to do anything else with the barriers.
> Previously we used ORDERED_DRAIN which caused the block subsystem to
> drain all pending IO before submitting the barrier, which would be
> very expensive.
>
> Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
>
> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> index 92091a7..b61a021 100644
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> @@ -418,10 +418,20 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16
> sector_size)
> static int xlvbd_barrier(struct blkfront_info *info)
> {
> int err;
> + unsigned ordered = QUEUE_ORDERED_NONE;
>
> - err = blk_queue_ordered(info->rq,
> - info->feature_barrier ? QUEUE_ORDERED_DRAIN :
> QUEUE_ORDERED_NONE,
> - NULL);
> + /*
> + * If we don't have barrier support, then there's really no
> + * way to guarantee write ordering, so we really just have to
> + * send writes to the backend and hope for the best. If
> + * barriers are supported, we don't really know what the
> + * flushing semantics of the barrier are, so again, tag the
> + * requests in order and hope for the best.
> + */
> + if (info->feature_barrier)
> + ordered = QUEUE_ORDERED_TAG;
> +
> + err = blk_queue_ordered(info->rq, ordered, NULL);
>
> if (err)
> return err;
> @@ -508,8 +518,7 @@ static int xlvbd_alloc_gendisk(blkif_sector_t capacity,
> info->rq = gd->queue;
> info->gd = gd;
>
> - if (info->feature_barrier)
> - xlvbd_barrier(info);
> + xlvbd_barrier(info);
>
> if (vdisk_info & VDISK_READONLY)
> set_disk_ro(gd, 1);
>
>
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|