From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jeremy Fitzhardinge <jeremy@goop.org>
Subject: Re: [PATCH RFC] xen/blkfront: use tagged queuing for
	barriers
Date: Wed, 28 Jul 2010 11:03:42 -0700
Message-ID: <4C5070FE.4020308@goop.org>
References: <4C48B61F.2030302@goop.org>
	<1280315165.4626.3945.camel@ramone.somacoma.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xensource.com>
In-Reply-To: <1280315165.4626.3945.camel@ramone.somacoma.net>
List-Unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xensource.com>
List-Help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-Subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
Sender: xen-devel-bounces@lists.xensource.com
Errors-To: xen-devel-bounces@lists.xensource.com
To: Daniel Stodden <daniel.stodden@citrix.com>
Cc: "Xen-devel@lists.xensource.com" <Xen-devel@lists.xensource.com>, Jake Wires <Jake.Wires@citrix.com>
List-Id: xen-devel@lists.xenproject.org

  On 07/28/2010 04:06 AM, Daniel Stodden wrote:
> Hey.
>
> I tried to figure out some of the barrier stuff. I think it's becoming
> more clear to me now. But take everything below with a grain of salt.
> Sorry for having written a book about it. It's just because I'm not sure
> if I'm always reasoning correctly.

Thanks for the detailed response!

> There's summary at the bottom. Maybe skip over and just apply your
> patch, possibly with a more optimistic update to the comment, and with a
> third option in support of QUEUE_ORDERED_DRAIN.
>
> In between, I seems to make sense to differentiate between blkback and
> blktap1/2 to understand what our disk model traditionally was/is,
> because until now frontends don't differentiate much. Except for maybe
> that feature-barrier flag. And below I gather why I think we might want
> to do something about it.
>
> Blktap1
> --------------------------------------------------------------------------
>
> A blktap1 backend ring is quite simple to explain. It thinks it's a
> cacheless disk with no ordering guarantees. In kernel queueing terms
> that's a QUEUE_ORDERED_DRAIN.

Yes, if it has no cache then we don't need to worry about delayed 
writes, and draining should be enough.

>   - We pass requests directly on to tapdisk.
>
>   - Tapdisk drivers will scatter and reorder at will, especially VHD.
>
>   - The RSP message is driven by iocb completion(s).
>
>   - It fully relies on AIO/O_DIRECT to make sure the data made it
>     actually down to the platter by that time.
>
> I'm not aware of anything else left to a userspace app.
>
> Blktap2
> --------------------------------------------------------------------------
>
> The blktap2 datapath was completely derived from blktap1.
>
> The difference that it's now hooked into a bdev. Which is quite
> important because image consistency is now in the hands of the block
> layer.
>
> It presently doesn't declare a barrier mode, which in my understanding
> evaluates to QUEUE_ORDERED_NONE.
>
> My understanding of QUEUE_ORDERED_NONE semantics is that the blk code
> and filesystems will happily fill the queue and assmume in-order
> completion. This is the mode of choice with either a queue depth of 1
> or no reordering. Or a non-flushable cache, at which point you know
> you're already screwed, so you don't need to worry.
>
> Blktap2 passes a frontend ring worth of requests to userland, with the
> same semantics as blktap1: queue depth of 32, no serialization
> control. Which should actually be a QUEUE_ORDERED_DRAIN passed on to
> the block layer.

So blktap2 doesn't pass barriers into userland?  I guess userland 
doesn't really have a way to send barriers into the kernel except with 
f(data)sync or direct io.

> So I guess we better fix that.
>
> Blkback
> --------------------------------------------------------------------------
>
> I had some trouble with this one. Blkback and -tap1 being somewhat the
> same kinda guy from the frontend perspective, I've always assumed the
> semantics should be (are) the same. Now I think that's quite wrong.
>
> Blkback is sitting on a kernel device queue, and that's a slightly
> different beast, because the blkdev interface works SCSI-style, which
> means it's completely revolving around SCSI barriers.
>
> On SCSI you issue a barrier into a request stream and ideally that
> directly maps to controller hardware, you're done.
>
> For ATA with NCQ that apparently doesn't apply, but you still issue a
> barrier op. It will cause a cache flush (if given) to complete
> foregoing requests, then a forced unit access (if given) to complete
> the barrier'd write itself. Or post-flush, again.
>
> The point is you don't do that on every request separately, or
> performance somewhat suffers. :)
>
> Let's assume blkback on a bdev of queue depth>1 with
> barriers enabled. And that we wanted to guarantee responses implying
> physical completion. We'd have to do at least something like the
> following: queue any batch we can get from the frontend, set a barrier
> bit on the last one, then ack every req only after the last one
> completed. Because in between, bio completion doesn't mean much. That
> might even be an option, I haven't tried. One could probably also play
> tricks based on the _FLUSH bit. Anyway, I don't see the blkback code
> presently doing that.

Couldn't blkback just send a barrier write down into the host's storage 
subsystem, and rely on it completing the bios in the right order?

> > From the blkfront perspective, that would still be a weird-looking
> QUEUE_ORDERED_DRAIN. Because everything queued appears to complete at
> once. It didn't, of course, so you still have to drain where you want
> to be safe.
>
> If we don't do that, then the frontend has to submit the barrier
> position.

Not sure I follow this.

>   I find this has some interesting implications. One is that the
> guest will effectively be utilizing a writeback cache, safely. The
> reason why I find this so exciting is because that cache can be
> potentially much larger than a ring, and it'd still be safely used by
> the guest. The other reason is I'm often made to wonder if blktap should
> get one.
>
> To a Linux/blkfront queue, this is a QUEUE_ORDERED_TAG.
>
> Blkif
> --------------------------------------------------------------------------
>
> I'm not entirely clear about the reason for BLKIF_OP_FLUSH_DISKCACHE
> because of the existence of BLKIF_OP_WRITE_BARRIER.  Maybe someone can
> enlighten me.
>
> The latter implies a flush, if one is necessary (otherwise it wouldn't
> be a barrier).

Really?  It's perfectly reasonable to define a barrier which prevents 
reordering but doesn't imply any synchronous write.  If you have a 
filesystem which just wants to make sure the superblock is written after 
metadata updates, but doesn't really mean to push things out *now*, a 
non-flushing barrier makes complete sense.  I found it odd that all the 
Linux documentation about barriers seem to imply a flush.  Or is that 
just the conventional meaning of "barrier" in storage-world?

> I could imagine drain/flush combinations executed to achieve barriers,
> or a flush as a slightly relaxed alternative, e.g. for queues which
> preserve order.

Right.

> Otoh, Xen normally tries to be about interface idealization where
> appropriate. There is no point in feature-barrier==0&&
> feature-cache-flush==1 or vice versa, because a barrier would still
> have been realizable per drain/flush, and at least in the blkback case
> there'd be no additional cost in the backend, if it just does that
> on behalf of the frontend. And it saves precious ring space.
>
> Last I'm not clear about how the meaning of feature-barrier got
> formulated in the header. To the frontend, is not a "feature" which
> "may succeed". Rather, if set, as a strong recommendation for guests
> who aim to eventually remount their filesystems r/w after a
> reboot.
>
> Summary
> --------------------------------------------------------------------------
>
> I'm assuming most guest OS filesystem/block layer glue follows an
> ordering interface based on SCSI? Does anyone know if there are
> exceptions to that? I'd be really curious. [*]
>
> Assuming the former is correct, I think we absolutely want interface
> idealization.
>
> This leaves exactly 2.5 reasonable modes which frontends should prepare
> for:
>
>   - feature-barrier == 0 ->  QUEUE_ORDERED_NONE
>
>     Disk is either a) non-caching/non-reordering or b) roken.
>
>   - feature-barrier == 1 ->  QUEUE_ORDERED_TAG
>
>     Disk is reordering, quite possibly caching, but you won't need to
>     know. You seriously want to issue ordered writes with
>     BLKIF_OP_WRITE_BARRIER.

So does this mean that feature-barrier is actually not backwards 
compatible?  If you use an old blkfront which doesn't know what to do 
with it, you end up with less reliable storage than before feature-barrier?

Perhaps the backend should keep writes synchronous until it sees a 
barrier coming from the guest, then it switches to 
caching/reordering/etc (and hope the guest sends a barrier quickish).

> [
>   - !xenbus_exists(feature-barrier) ->  QUEUE_ORDERED_DRAIN (?)
>
>     This is either blktap1, or an outdated blkback (?) I think one would
>     want to assume blktap1 (or parse otherend-id). With Blkback there's
>     probably not much you can do anyway. With blktap1 there's a chance
>     you're doing the right thing.
> ]

See patch below (delta against previous one).  Does that look OK?

> I'd suggest to ignore/phase-out the caching bit, because it's no use
> wrt QUEUE_ORDERED_TAG and complementary to QUEUE_ORDERED_NONE.
>
> I'd suggest to fix the backends where people see fit. In blktap2,
> which appears urgent to me, this is a one liner for now, setting
> QUEUE_ORDERED_DRAIN on the bdev.  Unless we discover some day we want
> to implement barriers in tapdisk. Which is not going to happen in
> July.

OK.  Is blkback OK as-is?  And I don't care about blktap1, but I guess 
its still the current product storage backend...


Thanks,
     J


From: Jeremy Fitzhardinge<jeremy.fitzhardinge@citrix.com>
Date: Wed, 28 Jul 2010 10:49:29 -0700
Subject: [PATCH] xen/blkfront: Use QUEUE_ORDERED_DRAIN for old backends

If there's no feature-barrier key in xenstore, then it means its a fairly
old backend which does uncached in-order writes, which means ORDERED_DRAIN
is appropriate.

Signed-off-by: Jeremy Fitzhardinge<jeremy.fitzhardinge@citrix.com>

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index eb28e1f..8cd5418 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -423,27 +423,22 @@ static int xlvbd_init_blk_queue(struct blkfront_info *info,
  static int xlvbd_barrier(struct blkfront_info *info)
  {
  	int err;
-	unsigned ordered = QUEUE_ORDERED_NONE;
+	const char *barrier;

-	/*
-	 * If we don't have barrier support, then there's really no
-	 * way to guarantee write ordering, so we really just have to
-	 * send writes to the backend and hope for the best.  If
-	 * barriers are supported, we don't really know what the
-	 * flushing semantics of the barrier are, so again, tag the
-	 * requests in order and hope for the best.
-	 */
-	if (info->feature_barrier)
-		ordered = QUEUE_ORDERED_TAG;
+	switch (info->feature_barrier) {
+	case QUEUE_ORDERED_DRAIN:	barrier = "enabled (drain)"; break;
+	case QUEUE_ORDERED_TAG:		barrier = "enabled (tag)"; break;
+	case QUEUE_ORDERED_NONE:	barrier = "disabled"; break;
+	default:			return -EINVAL;
+	}

-	err = blk_queue_ordered(info->rq, ordered, NULL);
+	err = blk_queue_ordered(info->rq, info->feature_barrier, NULL);

  	if (err)
  		return err;

  	printk(KERN_INFO "blkfront: %s: barriers %s\n",
-	       info->gd->disk_name,
-	       info->feature_barrier ? "enabled" : "disabled");
+	       info->gd->disk_name, barrier);
  	return 0;
  }

@@ -717,7 +712,7 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
  				printk(KERN_WARNING "blkfront: %s: write barrier op failed\n",
  				       info->gd->disk_name);
  				error = -EOPNOTSUPP;
-				info->feature_barrier = 0;
+				info->feature_barrier = QUEUE_ORDERED_NONE;
  				xlvbd_barrier(info);
  			}
  			/* fall through */
@@ -1057,6 +1052,7 @@ static void blkfront_connect(struct blkfront_info *info)
  	unsigned long sector_size;
  	unsigned int binfo;
  	int err;
+	int barrier;

  	switch (info->connected) {
  	case BLKIF_STATE_CONNECTED:
@@ -1097,10 +1093,27 @@ static void blkfront_connect(struct blkfront_info *info)
  	}

  	err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
-			    "feature-barrier", "%lu",&info->feature_barrier,
+			    "feature-barrier", "%lu",&barrier,
  			    NULL);
+
+	/*
+	 * If there's no "feature-barrier" defined, then it means
+	 * we're dealing with a very old backend which probably writes
+	 * in order with no cache; draining will do what needs to get
+	 * done.
+	 *
+	 * If there are barriers, then we can do full queued writes
+	 * with tagged barriers.
+	 *
+	 * If barriers are not supported, then there's no much we can
+	 * do, so just set ordering to NONE.
+	 */
  	if (err)
-		info->feature_barrier = 0;
+		info->feature_barrier = QUEUE_ORDERED_DRAIN;
+	else if (barrier)
+		info->feature_barrier = QUEUE_ORDERED_TAG;
+	else
+		info->feature_barrier = QUEUE_ORDERED_NONE;

  	err = xlvbd_alloc_gendisk(sectors, info, binfo, sector_size);
  	if (err) {