WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

[Xen-devel] Re: [PATCH RFC] xen/blkfront: use tagged queuing for barrier

To: Daniel Stodden <daniel.stodden@xxxxxxxxxx>
Subject: [Xen-devel] Re: [PATCH RFC] xen/blkfront: use tagged queuing for barriers
From: Jeremy Fitzhardinge <jeremy@xxxxxxxx>
Date: Wed, 28 Jul 2010 16:12:08 -0700
Cc: Paul Durrant <Paul.Durrant@xxxxxxxxxx>, "Xen-devel@xxxxxxxxxxxxxxxxxxx" <Xen-devel@xxxxxxxxxxxxxxxxxxx>, Jake Wires <Jake.Wires@xxxxxxxxxx>
Delivery-date: Wed, 28 Jul 2010 16:13:01 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <1280353948.16573.1679.camel@xxxxxxxxxxxxxxxxxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <4C48B61F.2030302@xxxxxxxx> <1280315165.4626.3945.camel@xxxxxxxxxxxxxxxxxxx> <4C5070FE.4020308@xxxxxxxx> <1280353948.16573.1679.camel@xxxxxxxxxxxxxxxxxxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.7) Gecko/20100720 Fedora/3.1.1-1.fc13 Lightning/1.0b2pre Thunderbird/3.1.1
 On 07/28/2010 02:52 PM, Daniel Stodden wrote:
If your average disk is caching, there is no non-flushing barrier. You
cannot order a metadata write *after* any foregoing one if you submit it
to a non-empty disk cache only with a post-flush, because the cache
won't preserve order.

If you have a modern SCSI disk which supports an ordered write, good
news is that you can leave the problem to the hardware and keep queuing
without explicit draining or flushing. I think I understood that
correctly.

Same with NCQ for SATA?  Or is that something else?

There's noting in the kernel suggesting a BIO_RW_BARRIER with a bit
dropping the durability requirement. So it must be a full flush. You're
right that a non-flushing one might make sense to FSes, but then again
that requires caches to track order. Or NVRAM right away.

Yes. Given that the drive manufacturers seem keen on putting more smarts into the drive (deeper queues to allow drives to schedule, object storage, etc), then tracking write ordering in their delayed write buffers shouldn't be too complex.

Caching disks without hardware barriers in the above sense need to get
flushed explicitly on the host before submitting the request. That's
probably why my desktop's ATA drive has all those drain and flush bits
set on its queue.

With barrier bits in the software queue that's fairly transparent to an
FS which doesn't want to care. But I think it will want to. I'm not sure
what the filesystem does in every detail, but with data=ordered the way
to deal with it would be to gather as much data writes as reasonable
queue before following up with a bunch of metadata updates.

So you gather and merge threads at the filesystem level to optimize the
queue. This queue, I only started studying it very lately, right now
looks very single-threaded to me. There's no partial ordering anywhere
on the way to the hardware, although having that it sounds like a neat
idea to me. Again, I might be missing some bits, so I probably shouldn't
even comment on that issue.

I found it odd that all the
Linux documentation about barriers seem to imply a flush.  Or is that
just the conventional meaning of "barrier" in storage-world?
Barrier in barrier.txt means a full barrier in every respect: No prior
request may commit later, no later request may commit earlier. Plus (!)
a barrier write completion is durable.

So you're right. Maybe also in wondering if a weaker model wouldn't be
more elegant, but I'd expect it not to be done because it's too
expensive in hardware. NVRAM is simpler because it works well in a
legacy context. So you can call that fast and cacheless, and everything
stays as is and just gets faster. And flash is going to be everywhere,
anyway.

Flash is still dependent on being able to reorder writes, particularly to make sure they get grouped in whatever way is good for the flash controller.

But a weak, non-flushing barrier can be implemented as a flushing barrier if the hardware finds it convenient (or at any level); just because you can't implement on all devices/paths doesn't mean you can't specify it conceptually.

What I would personally find interesting would be partial ordering, to
multi-thread the queue down to the controller.. Because it appears to me
that's what SATA with NCQ does best. You have 32 slots. A normal request
may cache, or has the FUA bit set, so it's writing through. In either
case, requests complete out-of-order.

Now imagine you have a whole lot of threads in your filesystem,
performing independent updates. I guess that's the normal case. You
order data, so in phase 1 you fill the disk cache with user data. Then
flush. Once. Then you follow up with 32 metadata updates, all FUA, now
simultaneously.

I think in the kernel that's presently not possible, because there's no
partial ordering. One would probably want to add an I/O context
allocated by each thread, as a key to bio submission, which then tracks
what parent request the ordered ones related to. Then aggressively merge
those, bundling independent barrier writes at the tip of the queue.

I might be missing a couple tiny details. Such as journaling and what
not. :P

That might make sense for a very tree-oriented filesystem like btrfs, though ultimately you need a total ordering to make sure the overall filesystem is in a consistent state.

Summary
--------------------------------------------------------------------------

I'm assuming most guest OS filesystem/block layer glue follows an
ordering interface based on SCSI? Does anyone know if there are
exceptions to that? I'd be really curious. [*]

Assuming the former is correct, I think we absolutely want interface
idealization.

This leaves exactly 2.5 reasonable modes which frontends should prepare
for:

   - feature-barrier == 0 ->   QUEUE_ORDERED_NONE

     Disk is either a) non-caching/non-reordering or b) roken.

   - feature-barrier == 1 ->   QUEUE_ORDERED_TAG

     Disk is reordering, quite possibly caching, but you won't need to
     know. You seriously want to issue ordered writes with
     BLKIF_OP_WRITE_BARRIER.
So does this mean that feature-barrier is actually not backwards
compatible?  If you use an old blkfront which doesn't know what to do
with it, you end up with less reliable storage than before feature-barrier?
The previous model in blkfront is the one you just replaced, right?

        err = blk_queue_ordered(info->rq,
                                info->feature_barrier ? QUEUE_ORDERED_DRAIN : 
QUEUE_ORDERED_NONE,
                                NULL);

Blkback writes feature-barrier=1 because the disk needs it, and the
frontend is moving to NONE, i.e. assumes non-reordering. That's wrong
and dangerous?

No, I mean an *old* blkfront which doesn't know about feature-barrier *at all*. It never looks at it, and never sends barriers. If blkback sets feature-barrier=1 and the blkfront ignores it, then you end up with no ordering or durability guarantees at all. Whereas if feature-barrier=0 or the blkback is pre-barrier, then the writes are synchronous and in-order.

In other words, at the moment, its actually mandatory for a backend to implement barriers if feature-barrier=1.

Blkback writes feature-barrier=0 and the disk moves to DRAIN. That's not
wrong. It might even help avoiding the worst, because it adds latency.
But otherwise it's a waste of time.

Blk*tap*1 writes no feature-barrier at all, iirc, and the frontend stays
at NONE. That sounds wrong again.

That's already fairly broken.

Well, if we assume that blktap1 with no feature-barrier is doing synchronous writes (not necessarily ordered) then the patch to make it use DRAIN should be fine. If they're ordered then NONE is better.

We could fix blktap1 by making it write the bit complemented (i.e. 1).
We cannot entirely fix blkback with barriers enabled, because there's
nothing generating barriers in there.

But writing the bit inverted might help increasing the probability of
getting away with it.

We could update blkif.h accordingly:

"""
#define BLKIF_OP_BARRIER ..

When feature-barrier is set, then barrier support is only a feature, and
it's almost guaranteed to fail. If feature-barrier is zero, then barrier
support is not just a feature but strictly required, assuming you care
for your disk. If feature-barrier isn't set at all, we strongly
recommend resorting to a queue drain instead, to improve the probability
of getting your writes to order correctly.
"""

That just seems gratuitously confusing.

Then again, I guess pvops cares mostly for the blkback case where
feature barrier is normally set, because there's not many disks left
with queue depth 1 or no reordering (?). We won't get this fixed, but
the DRAIN might be better than nothing (?) Altogether not sure about
this one.

I don't think it has very much to do with the underlying disks. If you submit a bunch of IOs to the blk subsystem, then it will reorder them in the ioscheduler anyway, won't it?

I should also check with Paul what the XenServer PV drivers are
assuming. [hereby CCd]

Perhaps the backend should keep writes synchronous until it sees a
barrier coming from the guest, then it switches to
caching/reordering/etc (and hope the guest sends a barrier quickish).
That won't help. The synchronous mode just means request completion and
release, but not durability. Or do you mean flagging everything to
barrier writes?

I mean that until the backend sees a barrier from the frontend, it does whatever it can to make sure that writes completed in order, and durable when they're complete. So I guess that means sticking a barrier on every one.

That sounds like a good correctness-preserving measure. I'm just kinda
worried about rogue guests, a gratuitous barrier doesn't come cheap,
does it?

Rogue guest in what sense? I guess the question is how long does it take before a guest typically sends a write barrier down. I wonder if it requires using a barrier capable and enabled filesystem?

[
   - !xenbus_exists(feature-barrier) ->   QUEUE_ORDERED_DRAIN (?)

     This is either blktap1, or an outdated blkback (?) I think one would
     want to assume blktap1 (or parse otherend-id). With Blkback there's
     probably not much you can do anyway. With blktap1 there's a chance
     you're doing the right thing.
]
See patch below (delta against previous one).  Does that look OK?
Yes, this patch looks good to me.

Do you think it's worth asking Jens or someone else who's really good
with this stuff about the reasoning so far?

Not sure. I'm going to push these blkfront patches to Jens shortly; perhaps he'll review them.

I also have no idea if we're facing a regrettable performance
difference. :}

TAG should always be an improvement over DRAIN, no?

I'd suggest to ignore/phase-out the caching bit, because it's no use
wrt QUEUE_ORDERED_TAG and complementary to QUEUE_ORDERED_NONE.

I'd suggest to fix the backends where people see fit. In blktap2,
which appears urgent to me, this is a one liner for now, setting
QUEUE_ORDERED_DRAIN on the bdev.  Unless we discover some day we want
to implement barriers in tapdisk. Which is not going to happen in
July.
OK.  Is blkback OK as-is?  And I don't care about blktap1, but I guess
its still the current product storage backend...
The blkback side is still beyond my understanding.

Blkback has only a few simple duties.

* A q->orderd must always be reflected in feature-barrier.

    We start out with 1, without testing, then clear after failing.

    That looks okay. Should it rather test first? Because I'm
    pretty sure even those spurious request attempts generate alerting
    noise in the logs.

    Also, judging from it seems the ordering mode
    may change (stuff like hdparm -W called maybe?).

I think the choice of IO scheduler will have a bigger effect.

    I should check if that needs improvement then.

    If that's the case, then the frontend should be
    watching. Could be even be the case that this is too simplistic
    if you're switching to ordered>  0 while I/O is in-flight.

I don't think that should matter. If the backend understands barriers, then the frontend should always be able to use barrier requests and QUEUE_ORDER_TAG internally. It should be up to the backend to make the barriers work correctly, regardless of the underlying IO stack.

In other words, I'm not sure there's any point in advertising feature-barrier=0 unless the backend literally does not understand BLKIF_OP_WRITE_BARRIER (in which case, it shouldn't even mention feature-barrier either). And conversely feature-barrier=1 doesn't mean "the disks have native barrier support", but "I will make BLKIF_OP_WRITE_BARRIER do the right thing, so go ahead and use it to your heart's desire".

Perhaps this would be/have been better:

   * feature-barrier missing - backend doesn't understand
     BLKIF_OP_WRITE_BARRIER at all
   * feature-barrier=0 - backend will complete all writes in order, and
     they'll be durable once complete
   * feature-barrier=1 - backend can reorder writes at will, but
     they'll be durable once complete.  BLKIF_OP_WRITE_BARRIER required
     to guarantee any specific ordering
   * blkfront writes 1 to feature-barrier when it wants to use
     barriers.  The first use of BLKIF_OP_WRITE_BARRIER also has the
     effect of writing 1 to feature-barrier

But its probably too late to retrofit now. But we can make the frontend issue a "gratuitous" barrier early on to make sure the backend knows what to expect later.

    As you pointed out below, that would have to make all requests
    synchronous until the frontend starts
    to ack that state change with some barrier request. I'm wondering
    about the shared performance impact then in case that doesn't happen.

Probably not good. The IO path probably gets a lot of win out of being given the freedom to reorder IOs. It also raises the question of guests just deliberately barriering everything to cause a seek storm DoS (I guess what you're alluding to by "rogue guest" above). We need to look at all the new(ish) IO controller stuff for managing QoS, etc.

* A barrier write should translate to a barrier write.

    I see it generating an empty bio. I don't see it setting a
    BIO_RW_BARRIER.

    I see nothing in blk-core which suggests to me that this translates
    to a barrier, but there is always hope I'm just mistaken. Help?

In dispatch_rw_block_io():

        switch (req->operation) {
        case BLKIF_OP_READ:
                operation = READ;
                break;
        case BLKIF_OP_WRITE:
                operation = WRITE;
                break;
        case BLKIF_OP_WRITE_BARRIER:
                operation = WRITE_BARRIER;
                break;
        default:
                operation = 0; /* make gcc happy */
                BUG();
        }
where WRITE_BARRIER is
#define WRITE_BARRIER   (WRITE | (1<<  BIO_RW_BARRIER))


    I think a BLKIF_OP_WRITE_BARRIER should be allowed to carry
    data, even encouraged to do so, because ring space is precious
    and I think Linux would otherwise have to insert barriers before
    and after a kernel barrier write to stay simple. Right now
    it's apparently doing the right thing. Only that this path has
    never been taken.

    It should not be required to carry data.

Seems reasonable. It isn't clear to me whether blkfront will generate WRITE_BARRIER with a payload, but it looks like it might already.

    Should we phase out OP_FLUSH_DISKCACHE? Did anything actually
    ever implement it?

Is that relatively new? It isn't present in the blkif.h in the kernel, which is probably a snapshot from a while ago.

So I think just doing a quiet blkback update where necessary would stay
compatible.

Got a patch?

Thanks,
    J

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel