From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jeremy Fitzhardinge Subject: Re: [PATCH RFC] xen/blkfront: use tagged queuing for barriers Date: Wed, 28 Jul 2010 11:03:42 -0700 Message-ID: <4C5070FE.4020308@goop.org> References: <4C48B61F.2030302@goop.org> <1280315165.4626.3945.camel@ramone.somacoma.net> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <1280315165.4626.3945.camel@ramone.somacoma.net> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: Daniel Stodden Cc: "Xen-devel@lists.xensource.com" , Jake Wires List-Id: xen-devel@lists.xenproject.org On 07/28/2010 04:06 AM, Daniel Stodden wrote: > Hey. > > I tried to figure out some of the barrier stuff. I think it's becoming > more clear to me now. But take everything below with a grain of salt. > Sorry for having written a book about it. It's just because I'm not sure > if I'm always reasoning correctly. Thanks for the detailed response! > There's summary at the bottom. Maybe skip over and just apply your > patch, possibly with a more optimistic update to the comment, and with a > third option in support of QUEUE_ORDERED_DRAIN. > > In between, I seems to make sense to differentiate between blkback and > blktap1/2 to understand what our disk model traditionally was/is, > because until now frontends don't differentiate much. Except for maybe > that feature-barrier flag. And below I gather why I think we might want > to do something about it. > > Blktap1 > -------------------------------------------------------------------------- > > A blktap1 backend ring is quite simple to explain. It thinks it's a > cacheless disk with no ordering guarantees. In kernel queueing terms > that's a QUEUE_ORDERED_DRAIN. Yes, if it has no cache then we don't need to worry about delayed writes, and draining should be enough. > - We pass requests directly on to tapdisk. > > - Tapdisk drivers will scatter and reorder at will, especially VHD. > > - The RSP message is driven by iocb completion(s). > > - It fully relies on AIO/O_DIRECT to make sure the data made it > actually down to the platter by that time. > > I'm not aware of anything else left to a userspace app. > > Blktap2 > -------------------------------------------------------------------------- > > The blktap2 datapath was completely derived from blktap1. > > The difference that it's now hooked into a bdev. Which is quite > important because image consistency is now in the hands of the block > layer. > > It presently doesn't declare a barrier mode, which in my understanding > evaluates to QUEUE_ORDERED_NONE. > > My understanding of QUEUE_ORDERED_NONE semantics is that the blk code > and filesystems will happily fill the queue and assmume in-order > completion. This is the mode of choice with either a queue depth of 1 > or no reordering. Or a non-flushable cache, at which point you know > you're already screwed, so you don't need to worry. > > Blktap2 passes a frontend ring worth of requests to userland, with the > same semantics as blktap1: queue depth of 32, no serialization > control. Which should actually be a QUEUE_ORDERED_DRAIN passed on to > the block layer. So blktap2 doesn't pass barriers into userland? I guess userland doesn't really have a way to send barriers into the kernel except with f(data)sync or direct io. > So I guess we better fix that. > > Blkback > -------------------------------------------------------------------------- > > I had some trouble with this one. Blkback and -tap1 being somewhat the > same kinda guy from the frontend perspective, I've always assumed the > semantics should be (are) the same. Now I think that's quite wrong. > > Blkback is sitting on a kernel device queue, and that's a slightly > different beast, because the blkdev interface works SCSI-style, which > means it's completely revolving around SCSI barriers. > > On SCSI you issue a barrier into a request stream and ideally that > directly maps to controller hardware, you're done. > > For ATA with NCQ that apparently doesn't apply, but you still issue a > barrier op. It will cause a cache flush (if given) to complete > foregoing requests, then a forced unit access (if given) to complete > the barrier'd write itself. Or post-flush, again. > > The point is you don't do that on every request separately, or > performance somewhat suffers. :) > > Let's assume blkback on a bdev of queue depth>1 with > barriers enabled. And that we wanted to guarantee responses implying > physical completion. We'd have to do at least something like the > following: queue any batch we can get from the frontend, set a barrier > bit on the last one, then ack every req only after the last one > completed. Because in between, bio completion doesn't mean much. That > might even be an option, I haven't tried. One could probably also play > tricks based on the _FLUSH bit. Anyway, I don't see the blkback code > presently doing that. Couldn't blkback just send a barrier write down into the host's storage subsystem, and rely on it completing the bios in the right order? > > From the blkfront perspective, that would still be a weird-looking > QUEUE_ORDERED_DRAIN. Because everything queued appears to complete at > once. It didn't, of course, so you still have to drain where you want > to be safe. > > If we don't do that, then the frontend has to submit the barrier > position. Not sure I follow this. > I find this has some interesting implications. One is that the > guest will effectively be utilizing a writeback cache, safely. The > reason why I find this so exciting is because that cache can be > potentially much larger than a ring, and it'd still be safely used by > the guest. The other reason is I'm often made to wonder if blktap should > get one. > > To a Linux/blkfront queue, this is a QUEUE_ORDERED_TAG. > > Blkif > -------------------------------------------------------------------------- > > I'm not entirely clear about the reason for BLKIF_OP_FLUSH_DISKCACHE > because of the existence of BLKIF_OP_WRITE_BARRIER. Maybe someone can > enlighten me. > > The latter implies a flush, if one is necessary (otherwise it wouldn't > be a barrier). Really? It's perfectly reasonable to define a barrier which prevents reordering but doesn't imply any synchronous write. If you have a filesystem which just wants to make sure the superblock is written after metadata updates, but doesn't really mean to push things out *now*, a non-flushing barrier makes complete sense. I found it odd that all the Linux documentation about barriers seem to imply a flush. Or is that just the conventional meaning of "barrier" in storage-world? > I could imagine drain/flush combinations executed to achieve barriers, > or a flush as a slightly relaxed alternative, e.g. for queues which > preserve order. Right. > Otoh, Xen normally tries to be about interface idealization where > appropriate. There is no point in feature-barrier==0&& > feature-cache-flush==1 or vice versa, because a barrier would still > have been realizable per drain/flush, and at least in the blkback case > there'd be no additional cost in the backend, if it just does that > on behalf of the frontend. And it saves precious ring space. > > Last I'm not clear about how the meaning of feature-barrier got > formulated in the header. To the frontend, is not a "feature" which > "may succeed". Rather, if set, as a strong recommendation for guests > who aim to eventually remount their filesystems r/w after a > reboot. > > Summary > -------------------------------------------------------------------------- > > I'm assuming most guest OS filesystem/block layer glue follows an > ordering interface based on SCSI? Does anyone know if there are > exceptions to that? I'd be really curious. [*] > > Assuming the former is correct, I think we absolutely want interface > idealization. > > This leaves exactly 2.5 reasonable modes which frontends should prepare > for: > > - feature-barrier == 0 -> QUEUE_ORDERED_NONE > > Disk is either a) non-caching/non-reordering or b) roken. > > - feature-barrier == 1 -> QUEUE_ORDERED_TAG > > Disk is reordering, quite possibly caching, but you won't need to > know. You seriously want to issue ordered writes with > BLKIF_OP_WRITE_BARRIER. So does this mean that feature-barrier is actually not backwards compatible? If you use an old blkfront which doesn't know what to do with it, you end up with less reliable storage than before feature-barrier? Perhaps the backend should keep writes synchronous until it sees a barrier coming from the guest, then it switches to caching/reordering/etc (and hope the guest sends a barrier quickish). > [ > - !xenbus_exists(feature-barrier) -> QUEUE_ORDERED_DRAIN (?) > > This is either blktap1, or an outdated blkback (?) I think one would > want to assume blktap1 (or parse otherend-id). With Blkback there's > probably not much you can do anyway. With blktap1 there's a chance > you're doing the right thing. > ] See patch below (delta against previous one). Does that look OK? > I'd suggest to ignore/phase-out the caching bit, because it's no use > wrt QUEUE_ORDERED_TAG and complementary to QUEUE_ORDERED_NONE. > > I'd suggest to fix the backends where people see fit. In blktap2, > which appears urgent to me, this is a one liner for now, setting > QUEUE_ORDERED_DRAIN on the bdev. Unless we discover some day we want > to implement barriers in tapdisk. Which is not going to happen in > July. OK. Is blkback OK as-is? And I don't care about blktap1, but I guess its still the current product storage backend... Thanks, J From: Jeremy Fitzhardinge Date: Wed, 28 Jul 2010 10:49:29 -0700 Subject: [PATCH] xen/blkfront: Use QUEUE_ORDERED_DRAIN for old backends If there's no feature-barrier key in xenstore, then it means its a fairly old backend which does uncached in-order writes, which means ORDERED_DRAIN is appropriate. Signed-off-by: Jeremy Fitzhardinge diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c index eb28e1f..8cd5418 100644 --- a/drivers/block/xen-blkfront.c +++ b/drivers/block/xen-blkfront.c @@ -423,27 +423,22 @@ static int xlvbd_init_blk_queue(struct blkfront_info *info, static int xlvbd_barrier(struct blkfront_info *info) { int err; - unsigned ordered = QUEUE_ORDERED_NONE; + const char *barrier; - /* - * If we don't have barrier support, then there's really no - * way to guarantee write ordering, so we really just have to - * send writes to the backend and hope for the best. If - * barriers are supported, we don't really know what the - * flushing semantics of the barrier are, so again, tag the - * requests in order and hope for the best. - */ - if (info->feature_barrier) - ordered = QUEUE_ORDERED_TAG; + switch (info->feature_barrier) { + case QUEUE_ORDERED_DRAIN: barrier = "enabled (drain)"; break; + case QUEUE_ORDERED_TAG: barrier = "enabled (tag)"; break; + case QUEUE_ORDERED_NONE: barrier = "disabled"; break; + default: return -EINVAL; + } - err = blk_queue_ordered(info->rq, ordered, NULL); + err = blk_queue_ordered(info->rq, info->feature_barrier, NULL); if (err) return err; printk(KERN_INFO "blkfront: %s: barriers %s\n", - info->gd->disk_name, - info->feature_barrier ? "enabled" : "disabled"); + info->gd->disk_name, barrier); return 0; } @@ -717,7 +712,7 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id) printk(KERN_WARNING "blkfront: %s: write barrier op failed\n", info->gd->disk_name); error = -EOPNOTSUPP; - info->feature_barrier = 0; + info->feature_barrier = QUEUE_ORDERED_NONE; xlvbd_barrier(info); } /* fall through */ @@ -1057,6 +1052,7 @@ static void blkfront_connect(struct blkfront_info *info) unsigned long sector_size; unsigned int binfo; int err; + int barrier; switch (info->connected) { case BLKIF_STATE_CONNECTED: @@ -1097,10 +1093,27 @@ static void blkfront_connect(struct blkfront_info *info) } err = xenbus_gather(XBT_NIL, info->xbdev->otherend, - "feature-barrier", "%lu",&info->feature_barrier, + "feature-barrier", "%lu",&barrier, NULL); + + /* + * If there's no "feature-barrier" defined, then it means + * we're dealing with a very old backend which probably writes + * in order with no cache; draining will do what needs to get + * done. + * + * If there are barriers, then we can do full queued writes + * with tagged barriers. + * + * If barriers are not supported, then there's no much we can + * do, so just set ordering to NONE. + */ if (err) - info->feature_barrier = 0; + info->feature_barrier = QUEUE_ORDERED_DRAIN; + else if (barrier) + info->feature_barrier = QUEUE_ORDERED_TAG; + else + info->feature_barrier = QUEUE_ORDERED_NONE; err = xlvbd_alloc_gendisk(sectors, info, binfo, sector_size); if (err) {