- * [PATCH 01/11] block/loop: queue ordered mode should be DRAIN_FLUSH
  2010-08-12 12:41 [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush Tejun Heo
@ 2010-08-12 12:41 ` Tejun Heo
  2010-08-12 12:41 ` [PATCH 02/11] block: kill QUEUE_ORDERED_BY_TAG Tejun Heo
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 109+ messages in thread
From: Tejun Heo @ 2010-08-12 12:41 UTC (permalink / raw)
  To: jaxboe, linux-fsdevel, linux-scsi, linux-ide, linux-kernel,
	linux-raid, hch
  Cc: Tejun Heo
loop implements FLUSH using fsync but was incorrectly setting its
ordered mode to DRAIN.  Change it to DRAIN_FLUSH.  In practice, this
doesn't change anything as loop doesn't make use of the block layer
ordered implementation.
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 drivers/block/loop.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index f3c636d..c3a4a2e 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -832,7 +832,7 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
 	lo->lo_queue->unplug_fn = loop_unplug;
 
 	if (!(lo_flags & LO_FLAGS_READ_ONLY) && file->f_op->fsync)
-		blk_queue_ordered(lo->lo_queue, QUEUE_ORDERED_DRAIN);
+		blk_queue_ordered(lo->lo_queue, QUEUE_ORDERED_DRAIN_FLUSH);
 
 	set_capacity(lo->lo_disk, size);
 	bd_set_size(bdev, size << 9);
-- 
1.7.1
^ permalink raw reply related	[flat|nested] 109+ messages in thread
- * [PATCH 02/11] block: kill QUEUE_ORDERED_BY_TAG
  2010-08-12 12:41 [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush Tejun Heo
  2010-08-12 12:41 ` [PATCH 01/11] block/loop: queue ordered mode should be DRAIN_FLUSH Tejun Heo
@ 2010-08-12 12:41 ` Tejun Heo
  2010-08-13 12:56   ` Vladislav Bolkhovitin
  2010-08-12 12:41 ` [PATCH 03/11] block: deprecate barrier and replace blk_queue_ordered() with blk_queue_flush() Tejun Heo
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 109+ messages in thread
From: Tejun Heo @ 2010-08-12 12:41 UTC (permalink / raw)
  To: jaxboe, linux-fsdevel, linux-scsi, linux-ide, linux-kernel,
	linux-raid, hch, James.
  Cc: Tejun Heo, Christoph Hellwig, Nick Piggin, Michael S. Tsirkin,
	Jeremy Fitzhardinge, Chris Wright
Nobody is making meaningful use of ORDERED_BY_TAG now and queue
draining for barrier requests will be removed soon which will render
the advantage of tag ordering moot.  Kill ORDERED_BY_TAG.  The
following users are affected.
* brd: converted to ORDERED_DRAIN.
* virtio_blk: ORDERED_TAG path was already marked deprecated.  Removed.
* xen-blkfront: ORDERED_TAG case dropped.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Chris Wright <chrisw@sous-sol.org>
---
 block/blk-barrier.c          |   35 +++++++----------------------------
 drivers/block/brd.c          |    2 +-
 drivers/block/virtio_blk.c   |    9 ---------
 drivers/block/xen-blkfront.c |    8 +++-----
 drivers/scsi/sd.c            |    4 +---
 include/linux/blkdev.h       |   17 +----------------
 6 files changed, 13 insertions(+), 62 deletions(-)
diff --git a/block/blk-barrier.c b/block/blk-barrier.c
index f0faefc..c807e9c 100644
--- a/block/blk-barrier.c
+++ b/block/blk-barrier.c
@@ -26,10 +26,7 @@ int blk_queue_ordered(struct request_queue *q, unsigned ordered)
 	if (ordered != QUEUE_ORDERED_NONE &&
 	    ordered != QUEUE_ORDERED_DRAIN &&
 	    ordered != QUEUE_ORDERED_DRAIN_FLUSH &&
-	    ordered != QUEUE_ORDERED_DRAIN_FUA &&
-	    ordered != QUEUE_ORDERED_TAG &&
-	    ordered != QUEUE_ORDERED_TAG_FLUSH &&
-	    ordered != QUEUE_ORDERED_TAG_FUA) {
+	    ordered != QUEUE_ORDERED_DRAIN_FUA) {
 		printk(KERN_ERR "blk_queue_ordered: bad value %d\n", ordered);
 		return -EINVAL;
 	}
@@ -155,21 +152,9 @@ static inline bool start_ordered(struct request_queue *q, struct request **rqp)
 	 * For an empty barrier, there's no actual BAR request, which
 	 * in turn makes POSTFLUSH unnecessary.  Mask them off.
 	 */
-	if (!blk_rq_sectors(rq)) {
+	if (!blk_rq_sectors(rq))
 		q->ordered &= ~(QUEUE_ORDERED_DO_BAR |
 				QUEUE_ORDERED_DO_POSTFLUSH);
-		/*
-		 * Empty barrier on a write-through device w/ ordered
-		 * tag has no command to issue and without any command
-		 * to issue, ordering by tag can't be used.  Drain
-		 * instead.
-		 */
-		if ((q->ordered & QUEUE_ORDERED_BY_TAG) &&
-		    !(q->ordered & QUEUE_ORDERED_DO_PREFLUSH)) {
-			q->ordered &= ~QUEUE_ORDERED_BY_TAG;
-			q->ordered |= QUEUE_ORDERED_BY_DRAIN;
-		}
-	}
 
 	/* stash away the original request */
 	blk_dequeue_request(rq);
@@ -210,7 +195,7 @@ static inline bool start_ordered(struct request_queue *q, struct request **rqp)
 	} else
 		skip |= QUEUE_ORDSEQ_PREFLUSH;
 
-	if ((q->ordered & QUEUE_ORDERED_BY_DRAIN) && queue_in_flight(q))
+	if (queue_in_flight(q))
 		rq = NULL;
 	else
 		skip |= QUEUE_ORDSEQ_DRAIN;
@@ -257,16 +242,10 @@ bool blk_do_ordered(struct request_queue *q, struct request **rqp)
 	    rq != &q->pre_flush_rq && rq != &q->post_flush_rq)
 		return true;
 
-	if (q->ordered & QUEUE_ORDERED_BY_TAG) {
-		/* Ordered by tag.  Blocking the next barrier is enough. */
-		if (is_barrier && rq != &q->bar_rq)
-			*rqp = NULL;
-	} else {
-		/* Ordered by draining.  Wait for turn. */
-		WARN_ON(blk_ordered_req_seq(rq) < blk_ordered_cur_seq(q));
-		if (blk_ordered_req_seq(rq) > blk_ordered_cur_seq(q))
-			*rqp = NULL;
-	}
+	/* Ordered by draining.  Wait for turn. */
+	WARN_ON(blk_ordered_req_seq(rq) < blk_ordered_cur_seq(q));
+	if (blk_ordered_req_seq(rq) > blk_ordered_cur_seq(q))
+		*rqp = NULL;
 
 	return true;
 }
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 1c7f637..47a4127 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -482,7 +482,7 @@ static struct brd_device *brd_alloc(int i)
 	if (!brd->brd_queue)
 		goto out_free_dev;
 	blk_queue_make_request(brd->brd_queue, brd_make_request);
-	blk_queue_ordered(brd->brd_queue, QUEUE_ORDERED_TAG);
+	blk_queue_ordered(brd->brd_queue, QUEUE_ORDERED_DRAIN);
 	blk_queue_max_hw_sectors(brd->brd_queue, 1024);
 	blk_queue_bounce_limit(brd->brd_queue, BLK_BOUNCE_ANY);
 
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 2aafafc..7965280 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -395,15 +395,6 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
 		 * to implement write barrier support.
 		 */
 		blk_queue_ordered(q, QUEUE_ORDERED_DRAIN_FLUSH);
-	} else if (virtio_has_feature(vdev, VIRTIO_BLK_F_BARRIER)) {
-		/*
-		 * If the BARRIER feature is supported the host expects us
-		 * to order request by tags.  This implies there is not
-		 * volatile write cache on the host, and that the host
-		 * never re-orders outstanding I/O.  This feature is not
-		 * useful for real life scenarious and deprecated.
-		 */
-		blk_queue_ordered(q, QUEUE_ORDERED_TAG);
 	} else {
 		/*
 		 * If the FLUSH feature is not supported we must assume that
diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 510ab86..25ffbf9 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -424,8 +424,7 @@ static int xlvbd_barrier(struct blkfront_info *info)
 	const char *barrier;
 
 	switch (info->feature_barrier) {
-	case QUEUE_ORDERED_DRAIN:	barrier = "enabled (drain)"; break;
-	case QUEUE_ORDERED_TAG:		barrier = "enabled (tag)"; break;
+	case QUEUE_ORDERED_DRAIN:	barrier = "enabled"; break;
 	case QUEUE_ORDERED_NONE:	barrier = "disabled"; break;
 	default:			return -EINVAL;
 	}
@@ -1078,8 +1077,7 @@ static void blkfront_connect(struct blkfront_info *info)
 	 * we're dealing with a very old backend which writes
 	 * synchronously; draining will do what needs to get done.
 	 *
-	 * If there are barriers, then we can do full queued writes
-	 * with tagged barriers.
+	 * If there are barriers, then we use flush.
 	 *
 	 * If barriers are not supported, then there's no much we can
 	 * do, so just set ordering to NONE.
@@ -1087,7 +1085,7 @@ static void blkfront_connect(struct blkfront_info *info)
 	if (err)
 		info->feature_barrier = QUEUE_ORDERED_DRAIN;
 	else if (barrier)
-		info->feature_barrier = QUEUE_ORDERED_TAG;
+		info->feature_barrier = QUEUE_ORDERED_DRAIN_FLUSH;
 	else
 		info->feature_barrier = QUEUE_ORDERED_NONE;
 
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 8e2e893..05a15b0 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -2151,9 +2151,7 @@ static int sd_revalidate_disk(struct gendisk *disk)
 
 	/*
 	 * We now have all cache related info, determine how we deal
-	 * with ordered requests.  Note that as the current SCSI
-	 * dispatch function can alter request order, we cannot use
-	 * QUEUE_ORDERED_TAG_* even when ordered tag is supported.
+	 * with ordered requests.
 	 */
 	if (sdkp->WCE)
 		ordered = sdkp->DPOFUA
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 89c855c..96ef5f1 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -469,12 +469,7 @@ enum {
 	 * DRAIN	: ordering by draining is enough
 	 * DRAIN_FLUSH	: ordering by draining w/ pre and post flushes
 	 * DRAIN_FUA	: ordering by draining w/ pre flush and FUA write
-	 * TAG		: ordering by tag is enough
-	 * TAG_FLUSH	: ordering by tag w/ pre and post flushes
-	 * TAG_FUA	: ordering by tag w/ pre flush and FUA write
 	 */
-	QUEUE_ORDERED_BY_DRAIN		= 0x01,
-	QUEUE_ORDERED_BY_TAG		= 0x02,
 	QUEUE_ORDERED_DO_PREFLUSH	= 0x10,
 	QUEUE_ORDERED_DO_BAR		= 0x20,
 	QUEUE_ORDERED_DO_POSTFLUSH	= 0x40,
@@ -482,8 +477,7 @@ enum {
 
 	QUEUE_ORDERED_NONE		= 0x00,
 
-	QUEUE_ORDERED_DRAIN		= QUEUE_ORDERED_BY_DRAIN |
-					  QUEUE_ORDERED_DO_BAR,
+	QUEUE_ORDERED_DRAIN		= QUEUE_ORDERED_DO_BAR,
 	QUEUE_ORDERED_DRAIN_FLUSH	= QUEUE_ORDERED_DRAIN |
 					  QUEUE_ORDERED_DO_PREFLUSH |
 					  QUEUE_ORDERED_DO_POSTFLUSH,
@@ -491,15 +485,6 @@ enum {
 					  QUEUE_ORDERED_DO_PREFLUSH |
 					  QUEUE_ORDERED_DO_FUA,
 
-	QUEUE_ORDERED_TAG		= QUEUE_ORDERED_BY_TAG |
-					  QUEUE_ORDERED_DO_BAR,
-	QUEUE_ORDERED_TAG_FLUSH		= QUEUE_ORDERED_TAG |
-					  QUEUE_ORDERED_DO_PREFLUSH |
-					  QUEUE_ORDERED_DO_POSTFLUSH,
-	QUEUE_ORDERED_TAG_FUA		= QUEUE_ORDERED_TAG |
-					  QUEUE_ORDERED_DO_PREFLUSH |
-					  QUEUE_ORDERED_DO_FUA,
-
 	/*
 	 * Ordered operation sequence
 	 */
-- 
1.7.1
^ permalink raw reply related	[flat|nested] 109+ messages in thread
- * Re: [PATCH 02/11] block: kill QUEUE_ORDERED_BY_TAG
  2010-08-12 12:41 ` [PATCH 02/11] block: kill QUEUE_ORDERED_BY_TAG Tejun Heo
@ 2010-08-13 12:56   ` Vladislav Bolkhovitin
  2010-08-13 13:06     ` Christoph Hellwig
  0 siblings, 1 reply; 109+ messages in thread
From: Vladislav Bolkhovitin @ 2010-08-13 12:56 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jaxboe, linux-fsdevel, linux-scsi, linux-ide, linux-kernel,
	linux-raid, hch, James.Bottomley, tytso, chris.mason, swhiteho,
	konishi.ryusuke, dm-devel, jack, rwheeler, hare,
	Christoph Hellwig, Nick Piggin, Michael S. Tsirkin,
	Jeremy Fitzhardinge, Chris Wright
Hello Tejun,
Tejun Heo, on 08/12/2010 04:41 PM wrote:
> Nobody is making meaningful use of ORDERED_BY_TAG now and queue
> draining for barrier requests will be removed soon which will render
> the advantage of tag ordering moot.
Have you seen Hannes Reinecke's and my measurements in 
http://marc.info/?l=linux-scsi&m=128110662528485&w=2 and 
http://marc.info/?l=linux-scsi&m=128111995217405&w=2 correspondingly?
If yes, what else evidences do you need to see that the tag ordering is 
a big performance win?
Vlad
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: [PATCH 02/11] block: kill QUEUE_ORDERED_BY_TAG
  2010-08-13 12:56   ` Vladislav Bolkhovitin
@ 2010-08-13 13:06     ` Christoph Hellwig
  0 siblings, 0 replies; 109+ messages in thread
From: Christoph Hellwig @ 2010-08-13 13:06 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Tejun Heo, jaxboe, linux-fsdevel, linux-scsi, linux-ide,
	linux-kernel, linux-raid, hch, James.Bottomley, tytso,
	chris.mason, swhiteho, konishi.ryusuke, dm-devel, jack, rwheeler,
	hare, Christoph Hellwig, Nick Piggin, Michael S. Tsirkin,
	Jeremy Fitzhardinge, Chris Wright
On Fri, Aug 13, 2010 at 04:56:32PM +0400, Vladislav Bolkhovitin wrote:
> Tejun Heo, on 08/12/2010 04:41 PM wrote:
> >Nobody is making meaningful use of ORDERED_BY_TAG now and queue
> >draining for barrier requests will be removed soon which will render
> >the advantage of tag ordering moot.
> 
> Have you seen Hannes Reinecke's and my measurements in 
> http://marc.info/?l=linux-scsi&m=128110662528485&w=2 and 
> http://marc.info/?l=linux-scsi&m=128111995217405&w=2 correspondingly?
> 
> If yes, what else evidences do you need to see that the tag ordering is 
> a big performance win?
It's not tag odering that is a win but big queue depth.  That's what you
measured and what I fully agree on.  I haven't been able to get out of
Hannes what he actually measured.
And if you'd actually look at the patchset allowing deep queues is
exactly what it allows us, and while I haven't done testing on this
patchset but only on my previous version it does get us back to use
the full potential of large arrays exactly because of that.
^ permalink raw reply	[flat|nested] 109+ messages in thread 
 
 
- * [PATCH 03/11] block: deprecate barrier and replace blk_queue_ordered() with blk_queue_flush()
  2010-08-12 12:41 [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush Tejun Heo
  2010-08-12 12:41 ` [PATCH 01/11] block/loop: queue ordered mode should be DRAIN_FLUSH Tejun Heo
  2010-08-12 12:41 ` [PATCH 02/11] block: kill QUEUE_ORDERED_BY_TAG Tejun Heo
@ 2010-08-12 12:41 ` Tejun Heo
  2010-08-14  1:07   ` Jeremy Fitzhardinge
  2010-08-12 12:41 ` [PATCH 04/11] block: remove spurious uses of REQ_HARDBARRIER Tejun Heo
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 109+ messages in thread
From: Tejun Heo @ 2010-08-12 12:41 UTC (permalink / raw)
  To: jaxboe, linux-fsdevel, linux-scsi, linux-ide, linux-kernel,
	linux-raid, hch, James.
  Cc: Tejun Heo, Christoph Hellwig, Nick Piggin, Michael S. Tsirkin,
	Jeremy Fitzhardinge, Chris Wright, FUJITA Tomonori, Boaz Harrosh,
	Geert Uytterhoeven, David S. Miller, Alasdair G Kergon,
	Pierre Ossman, Stefan Weinhuber
Barrier is deemed too heavy and will soon be replaced by FLUSH/FUA
requests.  Deprecate barrier.  All REQ_HARDBARRIERs are failed with
-EOPNOTSUPP and blk_queue_ordered() is replaced with simpler
blk_queue_flush().
blk_queue_flush() takes combinations of REQ_FLUSH and FUA.  If a
device has write cache and can flush it, it should set REQ_FLUSH.  If
the device can handle FUA writes, it should also set REQ_FUA.
All blk_queue_ordered() users are converted.
* ORDERED_DRAIN is mapped to 0 which is the default value.
* ORDERED_DRAIN_FLUSH is mapped to REQ_FLUSH.
* ORDERED_DRAIN_FLUSH_FUA is mapped to REQ_FLUSH | REQ_FUA.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Boaz Harrosh <bharrosh@panasas.com>
Cc: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Alasdair G Kergon <agk@redhat.com>
Cc: Pierre Ossman <drzeus@drzeus.cx>
Cc: Stefan Weinhuber <wein@de.ibm.com>
---
 block/blk-barrier.c          |   29 ----------------------------
 block/blk-core.c             |    6 +++-
 block/blk-settings.c         |   20 +++++++++++++++++++
 drivers/block/brd.c          |    1 -
 drivers/block/loop.c         |    2 +-
 drivers/block/osdblk.c       |    2 +-
 drivers/block/ps3disk.c      |    2 +-
 drivers/block/virtio_blk.c   |   25 ++++++++---------------
 drivers/block/xen-blkfront.c |   43 +++++++++++------------------------------
 drivers/ide/ide-disk.c       |   13 +++++------
 drivers/md/dm.c              |    2 +-
 drivers/mmc/card/queue.c     |    1 -
 drivers/s390/block/dasd.c    |    1 -
 drivers/scsi/sd.c            |   16 +++++++-------
 include/linux/blkdev.h       |    6 +++-
 15 files changed, 67 insertions(+), 102 deletions(-)
diff --git a/block/blk-barrier.c b/block/blk-barrier.c
index c807e9c..ed0aba5 100644
--- a/block/blk-barrier.c
+++ b/block/blk-barrier.c
@@ -9,35 +9,6 @@
 
 #include "blk.h"
 
-/**
- * blk_queue_ordered - does this queue support ordered writes
- * @q:        the request queue
- * @ordered:  one of QUEUE_ORDERED_*
- *
- * Description:
- *   For journalled file systems, doing ordered writes on a commit
- *   block instead of explicitly doing wait_on_buffer (which is bad
- *   for performance) can be a big win. Block drivers supporting this
- *   feature should call this function and indicate so.
- *
- **/
-int blk_queue_ordered(struct request_queue *q, unsigned ordered)
-{
-	if (ordered != QUEUE_ORDERED_NONE &&
-	    ordered != QUEUE_ORDERED_DRAIN &&
-	    ordered != QUEUE_ORDERED_DRAIN_FLUSH &&
-	    ordered != QUEUE_ORDERED_DRAIN_FUA) {
-		printk(KERN_ERR "blk_queue_ordered: bad value %d\n", ordered);
-		return -EINVAL;
-	}
-
-	q->ordered = ordered;
-	q->next_ordered = ordered;
-
-	return 0;
-}
-EXPORT_SYMBOL(blk_queue_ordered);
-
 /*
  * Cache flushing for ordered writes handling
  */
diff --git a/block/blk-core.c b/block/blk-core.c
index 5ab3ac2..3f802dd 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1203,11 +1203,13 @@ static int __make_request(struct request_queue *q, struct bio *bio)
 	const unsigned int ff = bio->bi_rw & REQ_FAILFAST_MASK;
 	int rw_flags;
 
-	if ((bio->bi_rw & REQ_HARDBARRIER) &&
-	    (q->next_ordered == QUEUE_ORDERED_NONE)) {
+	/* REQ_HARDBARRIER is no more */
+	if (WARN_ONCE(bio->bi_rw & REQ_HARDBARRIER,
+		"block: HARDBARRIER is deprecated, use FLUSH/FUA instead\n")) {
 		bio_endio(bio, -EOPNOTSUPP);
 		return 0;
 	}
+
 	/*
 	 * low level driver can indicate that it wants pages above a
 	 * certain limit bounced to low memory (ie for highmem, or even
diff --git a/block/blk-settings.c b/block/blk-settings.c
index a234f4b..9b18afc 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -794,6 +794,26 @@ void blk_queue_update_dma_alignment(struct request_queue *q, int mask)
 }
 EXPORT_SYMBOL(blk_queue_update_dma_alignment);
 
+/**
+ * blk_queue_flush - configure queue's cache flush capability
+ * @q:		the request queue for the device
+ * @flush:	0, REQ_FLUSH or REQ_FLUSH | REQ_FUA
+ *
+ * Tell block layer cache flush capability of @q.  If it supports
+ * flushing, REQ_FLUSH should be set.  If it supports bypassing
+ * write cache for individual writes, REQ_FUA should be set.
+ */
+void blk_queue_flush(struct request_queue *q, unsigned int flush)
+{
+	WARN_ON_ONCE(flush & ~(REQ_FLUSH | REQ_FUA));
+
+	if (WARN_ON_ONCE(!(flush & REQ_FLUSH) && (flush & REQ_FUA)))
+		flush &= ~REQ_FUA;
+
+	q->flush_flags = flush & (REQ_FLUSH | REQ_FUA);
+}
+EXPORT_SYMBOL_GPL(blk_queue_flush);
+
 static int __init blk_settings_init(void)
 {
 	blk_max_low_pfn = max_low_pfn - 1;
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 47a4127..fa33f97 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -482,7 +482,6 @@ static struct brd_device *brd_alloc(int i)
 	if (!brd->brd_queue)
 		goto out_free_dev;
 	blk_queue_make_request(brd->brd_queue, brd_make_request);
-	blk_queue_ordered(brd->brd_queue, QUEUE_ORDERED_DRAIN);
 	blk_queue_max_hw_sectors(brd->brd_queue, 1024);
 	blk_queue_bounce_limit(brd->brd_queue, BLK_BOUNCE_ANY);
 
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index c3a4a2e..953d1e1 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -832,7 +832,7 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
 	lo->lo_queue->unplug_fn = loop_unplug;
 
 	if (!(lo_flags & LO_FLAGS_READ_ONLY) && file->f_op->fsync)
-		blk_queue_ordered(lo->lo_queue, QUEUE_ORDERED_DRAIN_FLUSH);
+		blk_queue_flush(lo->lo_queue, REQ_FLUSH);
 
 	set_capacity(lo->lo_disk, size);
 	bd_set_size(bdev, size << 9);
diff --git a/drivers/block/osdblk.c b/drivers/block/osdblk.c
index 2284b4f..72d6246 100644
--- a/drivers/block/osdblk.c
+++ b/drivers/block/osdblk.c
@@ -439,7 +439,7 @@ static int osdblk_init_disk(struct osdblk_device *osdev)
 	blk_queue_stack_limits(q, osd_request_queue(osdev->osd));
 
 	blk_queue_prep_rq(q, blk_queue_start_tag);
-	blk_queue_ordered(q, QUEUE_ORDERED_DRAIN_FLUSH);
+	blk_queue_flush(q, REQ_FLUSH);
 
 	disk->queue = q;
 
diff --git a/drivers/block/ps3disk.c b/drivers/block/ps3disk.c
index e9da874..4911f9e 100644
--- a/drivers/block/ps3disk.c
+++ b/drivers/block/ps3disk.c
@@ -468,7 +468,7 @@ static int __devinit ps3disk_probe(struct ps3_system_bus_device *_dev)
 	blk_queue_dma_alignment(queue, dev->blk_size-1);
 	blk_queue_logical_block_size(queue, dev->blk_size);
 
-	blk_queue_ordered(queue, QUEUE_ORDERED_DRAIN_FLUSH);
+	blk_queue_flush(queue, REQ_FLUSH);
 
 	blk_queue_max_segments(queue, -1);
 	blk_queue_max_segment_size(queue, dev->bounce_size);
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 7965280..d10b635 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -388,22 +388,15 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
 	vblk->disk->driverfs_dev = &vdev->dev;
 	index++;
 
-	if (virtio_has_feature(vdev, VIRTIO_BLK_F_FLUSH)) {
-		/*
-		 * If the FLUSH feature is supported we do have support for
-		 * flushing a volatile write cache on the host.  Use that
-		 * to implement write barrier support.
-		 */
-		blk_queue_ordered(q, QUEUE_ORDERED_DRAIN_FLUSH);
-	} else {
-		/*
-		 * If the FLUSH feature is not supported we must assume that
-		 * the host does not perform any kind of volatile write
-		 * caching. We still need to drain the queue to provider
-		 * proper barrier semantics.
-		 */
-		blk_queue_ordered(q, QUEUE_ORDERED_DRAIN);
-	}
+	/*
+	 * If the FLUSH feature is supported we do have support for
+	 * flushing a volatile write cache on the host.  Use that to
+	 * implement write barrier support; otherwise, we must assume
+	 * that the host does not perform any kind of volatile write
+	 * caching.
+	 */
+	if (virtio_has_feature(vdev, VIRTIO_BLK_F_FLUSH))
+		blk_queue_flush(q, REQ_FLUSH);
 
 	/* If disk is read-only in the host, the guest should obey */
 	if (virtio_has_feature(vdev, VIRTIO_BLK_F_RO))
diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 25ffbf9..1d48f3a 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -95,7 +95,7 @@ struct blkfront_info
 	struct gnttab_free_callback callback;
 	struct blk_shadow shadow[BLK_RING_SIZE];
 	unsigned long shadow_free;
-	int feature_barrier;
+	unsigned int feature_flush;
 	int is_ready;
 };
 
@@ -418,25 +418,12 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size)
 }
 
 
-static int xlvbd_barrier(struct blkfront_info *info)
+static void xlvbd_flush(struct blkfront_info *info)
 {
-	int err;
-	const char *barrier;
-
-	switch (info->feature_barrier) {
-	case QUEUE_ORDERED_DRAIN:	barrier = "enabled"; break;
-	case QUEUE_ORDERED_NONE:	barrier = "disabled"; break;
-	default:			return -EINVAL;
-	}
-
-	err = blk_queue_ordered(info->rq, info->feature_barrier);
-
-	if (err)
-		return err;
-
+	blk_queue_flush(info->rq, info->feature_flush);
 	printk(KERN_INFO "blkfront: %s: barriers %s\n",
-	       info->gd->disk_name, barrier);
-	return 0;
+	       info->gd->disk_name,
+	       info->feature_flush ? "enabled" : "disabled");
 }
 
 
@@ -515,7 +502,7 @@ static int xlvbd_alloc_gendisk(blkif_sector_t capacity,
 	info->rq = gd->queue;
 	info->gd = gd;
 
-	xlvbd_barrier(info);
+	xlvbd_flush(info);
 
 	if (vdisk_info & VDISK_READONLY)
 		set_disk_ro(gd, 1);
@@ -661,8 +648,8 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
 				printk(KERN_WARNING "blkfront: %s: write barrier op failed\n",
 				       info->gd->disk_name);
 				error = -EOPNOTSUPP;
-				info->feature_barrier = QUEUE_ORDERED_NONE;
-				xlvbd_barrier(info);
+				info->feature_flush = 0;
+				xlvbd_flush(info);
 			}
 			/* fall through */
 		case BLKIF_OP_READ:
@@ -1075,19 +1062,13 @@ static void blkfront_connect(struct blkfront_info *info)
 	/*
 	 * If there's no "feature-barrier" defined, then it means
 	 * we're dealing with a very old backend which writes
-	 * synchronously; draining will do what needs to get done.
+	 * synchronously; nothing to do.
 	 *
 	 * If there are barriers, then we use flush.
-	 *
-	 * If barriers are not supported, then there's no much we can
-	 * do, so just set ordering to NONE.
 	 */
-	if (err)
-		info->feature_barrier = QUEUE_ORDERED_DRAIN;
-	else if (barrier)
-		info->feature_barrier = QUEUE_ORDERED_DRAIN_FLUSH;
-	else
-		info->feature_barrier = QUEUE_ORDERED_NONE;
+	info->feature_flush = 0;
+	if (!err && barrier)
+		info->feature_flush = REQ_FLUSH;
 
 	err = xlvbd_alloc_gendisk(sectors, info, binfo, sector_size);
 	if (err) {
diff --git a/drivers/ide/ide-disk.c b/drivers/ide/ide-disk.c
index 7433e07..7c5b01c 100644
--- a/drivers/ide/ide-disk.c
+++ b/drivers/ide/ide-disk.c
@@ -516,10 +516,10 @@ static int ide_do_setfeature(ide_drive_t *drive, u8 feature, u8 nsect)
 	return ide_no_data_taskfile(drive, &cmd);
 }
 
-static void update_ordered(ide_drive_t *drive)
+static void update_flush(ide_drive_t *drive)
 {
 	u16 *id = drive->id;
-	unsigned ordered = QUEUE_ORDERED_NONE;
+	unsigned flush = 0;
 
 	if (drive->dev_flags & IDE_DFLAG_WCACHE) {
 		unsigned long long capacity;
@@ -543,13 +543,12 @@ static void update_ordered(ide_drive_t *drive)
 		       drive->name, barrier ? "" : "not ");
 
 		if (barrier) {
-			ordered = QUEUE_ORDERED_DRAIN_FLUSH;
+			flush = REQ_FLUSH;
 			blk_queue_prep_rq(drive->queue, idedisk_prep_fn);
 		}
-	} else
-		ordered = QUEUE_ORDERED_DRAIN;
+	}
 
-	blk_queue_ordered(drive->queue, ordered);
+	blk_queue_flush(drive->queue, flush);
 }
 
 ide_devset_get_flag(wcache, IDE_DFLAG_WCACHE);
@@ -572,7 +571,7 @@ static int set_wcache(ide_drive_t *drive, int arg)
 		}
 	}
 
-	update_ordered(drive);
+	update_flush(drive);
 
 	return err;
 }
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index a3f21dc..b71cc9e 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1908,7 +1908,7 @@ static struct mapped_device *alloc_dev(int minor)
 	blk_queue_softirq_done(md->queue, dm_softirq_done);
 	blk_queue_prep_rq(md->queue, dm_prep_fn);
 	blk_queue_lld_busy(md->queue, dm_lld_busy);
-	blk_queue_ordered(md->queue, QUEUE_ORDERED_DRAIN_FLUSH);
+	blk_queue_flush(md->queue, REQ_FLUSH);
 
 	md->disk = alloc_disk(1);
 	if (!md->disk)
diff --git a/drivers/mmc/card/queue.c b/drivers/mmc/card/queue.c
index c77eb49..d791772 100644
--- a/drivers/mmc/card/queue.c
+++ b/drivers/mmc/card/queue.c
@@ -128,7 +128,6 @@ int mmc_init_queue(struct mmc_queue *mq, struct mmc_card *card, spinlock_t *lock
 	mq->req = NULL;
 
 	blk_queue_prep_rq(mq->queue, mmc_prep_request);
-	blk_queue_ordered(mq->queue, QUEUE_ORDERED_DRAIN);
 	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, mq->queue);
 
 #ifdef CONFIG_MMC_BLOCK_BOUNCE
diff --git a/drivers/s390/block/dasd.c b/drivers/s390/block/dasd.c
index 1a84fae..29046b7 100644
--- a/drivers/s390/block/dasd.c
+++ b/drivers/s390/block/dasd.c
@@ -2197,7 +2197,6 @@ static void dasd_setup_queue(struct dasd_block *block)
 	 */
 	blk_queue_max_segment_size(block->request_queue, PAGE_SIZE);
 	blk_queue_segment_boundary(block->request_queue, PAGE_SIZE - 1);
-	blk_queue_ordered(block->request_queue, QUEUE_ORDERED_DRAIN);
 }
 
 /*
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 05a15b0..7f6aca2 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -2109,7 +2109,7 @@ static int sd_revalidate_disk(struct gendisk *disk)
 	struct scsi_disk *sdkp = scsi_disk(disk);
 	struct scsi_device *sdp = sdkp->device;
 	unsigned char *buffer;
-	unsigned ordered;
+	unsigned flush = 0;
 
 	SCSI_LOG_HLQUEUE(3, sd_printk(KERN_INFO, sdkp,
 				      "sd_revalidate_disk\n"));
@@ -2151,15 +2151,15 @@ static int sd_revalidate_disk(struct gendisk *disk)
 
 	/*
 	 * We now have all cache related info, determine how we deal
-	 * with ordered requests.
+	 * with flush requests.
 	 */
-	if (sdkp->WCE)
-		ordered = sdkp->DPOFUA
-			? QUEUE_ORDERED_DRAIN_FUA : QUEUE_ORDERED_DRAIN_FLUSH;
-	else
-		ordered = QUEUE_ORDERED_DRAIN;
+	if (sdkp->WCE) {
+		flush |= REQ_FLUSH;
+		if (sdkp->DPOFUA)
+			flush |= REQ_FUA;
+	}
 
-	blk_queue_ordered(sdkp->disk->queue, ordered);
+	blk_queue_flush(sdkp->disk->queue, flush);
 
 	set_capacity(disk, sdkp->capacity);
 	kfree(buffer);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 96ef5f1..6003f7c 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -355,8 +355,10 @@ struct request_queue
 	struct blk_trace	*blk_trace;
 #endif
 	/*
-	 * reserved for flush operations
+	 * for flush operations
 	 */
+	unsigned int		flush_flags;
+
 	unsigned int		ordered, next_ordered, ordseq;
 	int			orderr, ordcolor;
 	struct request		pre_flush_rq, bar_rq, post_flush_rq;
@@ -863,8 +865,8 @@ extern void blk_queue_update_dma_alignment(struct request_queue *, int);
 extern void blk_queue_softirq_done(struct request_queue *, softirq_done_fn *);
 extern void blk_queue_rq_timed_out(struct request_queue *, rq_timed_out_fn *);
 extern void blk_queue_rq_timeout(struct request_queue *, unsigned int);
+extern void blk_queue_flush(struct request_queue *q, unsigned int flush);
 extern struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev);
-extern int blk_queue_ordered(struct request_queue *, unsigned);
 extern bool blk_do_ordered(struct request_queue *, struct request **);
 extern unsigned blk_ordered_cur_seq(struct request_queue *);
 extern unsigned blk_ordered_req_seq(struct request *);
-- 
1.7.1
^ permalink raw reply related	[flat|nested] 109+ messages in thread
- * Re: [PATCH 03/11] block: deprecate barrier and replace blk_queue_ordered() with blk_queue_flush()
  2010-08-12 12:41 ` [PATCH 03/11] block: deprecate barrier and replace blk_queue_ordered() with blk_queue_flush() Tejun Heo
@ 2010-08-14  1:07   ` Jeremy Fitzhardinge
  2010-08-14  9:42     ` hch
  0 siblings, 1 reply; 109+ messages in thread
From: Jeremy Fitzhardinge @ 2010-08-14  1:07 UTC (permalink / raw)
  To: hch@lst.de
  Cc: jack@suse.cz, Michael S. Tsirkin, linux-ide@vger.kernel.org,
	dm-devel@redhat.com, James.Bottomley@suse.de, Pierre Ossman,
	konishi.ryusuke@lab.ntt.co.jp, Alasdair G Kergon,
	Stefan Weinhuber, vst@vlnb.net, linux-scsi@vger.kernel.org,
	Christoph Hellwig, Boaz Harrosh, Geert Uytterhoeven,
	Daniel Stodden, Nick Piggin, Chris Wright, Tejun Heo,
	"swhiteho@redhat.com" <swhite>
 On 08/12/2010 05:41 AM, Tejun Heo wrote:
> Barrier is deemed too heavy and will soon be replaced by FLUSH/FUA
> requests.  Deprecate barrier.  All REQ_HARDBARRIERs are failed with
> -EOPNOTSUPP and blk_queue_ordered() is replaced with simpler
> blk_queue_flush().
>
> blk_queue_flush() takes combinations of REQ_FLUSH and FUA.  If a
> device has write cache and can flush it, it should set REQ_FLUSH.  If
> the device can handle FUA writes, it should also set REQ_FUA.
Christoph, do these two patches (parts 2 and 3) make xen-blkfront
correct WRT barriers/flushing as far as your concerned?
Thanks,
    J
> All blk_queue_ordered() users are converted.
>
> * ORDERED_DRAIN is mapped to 0 which is the default value.
> * ORDERED_DRAIN_FLUSH is mapped to REQ_FLUSH.
> * ORDERED_DRAIN_FLUSH_FUA is mapped to REQ_FLUSH | REQ_FUA.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Nick Piggin <npiggin@kernel.dk>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
> Cc: Chris Wright <chrisw@sous-sol.org>
> Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
> Cc: Boaz Harrosh <bharrosh@panasas.com>
> Cc: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
> Cc: David S. Miller <davem@davemloft.net>
> Cc: Alasdair G Kergon <agk@redhat.com>
> Cc: Pierre Ossman <drzeus@drzeus.cx>
> Cc: Stefan Weinhuber <wein@de.ibm.com>
> ---
>  block/blk-barrier.c          |   29 ----------------------------
>  block/blk-core.c             |    6 +++-
>  block/blk-settings.c         |   20 +++++++++++++++++++
>  drivers/block/brd.c          |    1 -
>  drivers/block/loop.c         |    2 +-
>  drivers/block/osdblk.c       |    2 +-
>  drivers/block/ps3disk.c      |    2 +-
>  drivers/block/virtio_blk.c   |   25 ++++++++---------------
>  drivers/block/xen-blkfront.c |   43 +++++++++++------------------------------
>  drivers/ide/ide-disk.c       |   13 +++++------
>  drivers/md/dm.c              |    2 +-
>  drivers/mmc/card/queue.c     |    1 -
>  drivers/s390/block/dasd.c    |    1 -
>  drivers/scsi/sd.c            |   16 +++++++-------
>  include/linux/blkdev.h       |    6 +++-
>  15 files changed, 67 insertions(+), 102 deletions(-)
>
> diff --git a/block/blk-barrier.c b/block/blk-barrier.c
> index c807e9c..ed0aba5 100644
> --- a/block/blk-barrier.c
> +++ b/block/blk-barrier.c
> @@ -9,35 +9,6 @@
>
>  #include "blk.h"
>
> -/**
> - * blk_queue_ordered - does this queue support ordered writes
> - * @q:        the request queue
> - * @ordered:  one of QUEUE_ORDERED_*
> - *
> - * Description:
> - *   For journalled file systems, doing ordered writes on a commit
> - *   block instead of explicitly doing wait_on_buffer (which is bad
> - *   for performance) can be a big win. Block drivers supporting this
> - *   feature should call this function and indicate so.
> - *
> - **/
> -int blk_queue_ordered(struct request_queue *q, unsigned ordered)
> -{
> -       if (ordered != QUEUE_ORDERED_NONE &&
> -           ordered != QUEUE_ORDERED_DRAIN &&
> -           ordered != QUEUE_ORDERED_DRAIN_FLUSH &&
> -           ordered != QUEUE_ORDERED_DRAIN_FUA) {
> -               printk(KERN_ERR "blk_queue_ordered: bad value %d\n", ordered);
> -               return -EINVAL;
> -       }
> -
> -       q->ordered = ordered;
> -       q->next_ordered = ordered;
> -
> -       return 0;
> -}
> -EXPORT_SYMBOL(blk_queue_ordered);
> -
>  /*
>   * Cache flushing for ordered writes handling
>   */
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 5ab3ac2..3f802dd 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1203,11 +1203,13 @@ static int __make_request(struct request_queue *q, struct bio *bio)
>         const unsigned int ff = bio->bi_rw & REQ_FAILFAST_MASK;
>         int rw_flags;
>
> -       if ((bio->bi_rw & REQ_HARDBARRIER) &&
> -           (q->next_ordered == QUEUE_ORDERED_NONE)) {
> +       /* REQ_HARDBARRIER is no more */
> +       if (WARN_ONCE(bio->bi_rw & REQ_HARDBARRIER,
> +               "block: HARDBARRIER is deprecated, use FLUSH/FUA instead\n")) {
>                 bio_endio(bio, -EOPNOTSUPP);
>                 return 0;
>         }
> +
>         /*
>          * low level driver can indicate that it wants pages above a
>          * certain limit bounced to low memory (ie for highmem, or even
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index a234f4b..9b18afc 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -794,6 +794,26 @@ void blk_queue_update_dma_alignment(struct request_queue *q, int mask)
>  }
>  EXPORT_SYMBOL(blk_queue_update_dma_alignment);
>
> +/**
> + * blk_queue_flush - configure queue's cache flush capability
> + * @q:         the request queue for the device
> + * @flush:     0, REQ_FLUSH or REQ_FLUSH | REQ_FUA
> + *
> + * Tell block layer cache flush capability of @q.  If it supports
> + * flushing, REQ_FLUSH should be set.  If it supports bypassing
> + * write cache for individual writes, REQ_FUA should be set.
> + */
> +void blk_queue_flush(struct request_queue *q, unsigned int flush)
> +{
> +       WARN_ON_ONCE(flush & ~(REQ_FLUSH | REQ_FUA));
> +
> +       if (WARN_ON_ONCE(!(flush & REQ_FLUSH) && (flush & REQ_FUA)))
> +               flush &= ~REQ_FUA;
> +
> +       q->flush_flags = flush & (REQ_FLUSH | REQ_FUA);
> +}
> +EXPORT_SYMBOL_GPL(blk_queue_flush);
> +
>  static int __init blk_settings_init(void)
>  {
>         blk_max_low_pfn = max_low_pfn - 1;
> diff --git a/drivers/block/brd.c b/drivers/block/brd.c
> index 47a4127..fa33f97 100644
> --- a/drivers/block/brd.c
> +++ b/drivers/block/brd.c
> @@ -482,7 +482,6 @@ static struct brd_device *brd_alloc(int i)
>         if (!brd->brd_queue)
>                 goto out_free_dev;
>         blk_queue_make_request(brd->brd_queue, brd_make_request);
> -       blk_queue_ordered(brd->brd_queue, QUEUE_ORDERED_DRAIN);
>         blk_queue_max_hw_sectors(brd->brd_queue, 1024);
>         blk_queue_bounce_limit(brd->brd_queue, BLK_BOUNCE_ANY);
>
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index c3a4a2e..953d1e1 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -832,7 +832,7 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
>         lo->lo_queue->unplug_fn = loop_unplug;
>
>         if (!(lo_flags & LO_FLAGS_READ_ONLY) && file->f_op->fsync)
> -               blk_queue_ordered(lo->lo_queue, QUEUE_ORDERED_DRAIN_FLUSH);
> +               blk_queue_flush(lo->lo_queue, REQ_FLUSH);
>
>         set_capacity(lo->lo_disk, size);
>         bd_set_size(bdev, size << 9);
> diff --git a/drivers/block/osdblk.c b/drivers/block/osdblk.c
> index 2284b4f..72d6246 100644
> --- a/drivers/block/osdblk.c
> +++ b/drivers/block/osdblk.c
> @@ -439,7 +439,7 @@ static int osdblk_init_disk(struct osdblk_device *osdev)
>         blk_queue_stack_limits(q, osd_request_queue(osdev->osd));
>
>         blk_queue_prep_rq(q, blk_queue_start_tag);
> -       blk_queue_ordered(q, QUEUE_ORDERED_DRAIN_FLUSH);
> +       blk_queue_flush(q, REQ_FLUSH);
>
>         disk->queue = q;
>
> diff --git a/drivers/block/ps3disk.c b/drivers/block/ps3disk.c
> index e9da874..4911f9e 100644
> --- a/drivers/block/ps3disk.c
> +++ b/drivers/block/ps3disk.c
> @@ -468,7 +468,7 @@ static int __devinit ps3disk_probe(struct ps3_system_bus_device *_dev)
>         blk_queue_dma_alignment(queue, dev->blk_size-1);
>         blk_queue_logical_block_size(queue, dev->blk_size);
>
> -       blk_queue_ordered(queue, QUEUE_ORDERED_DRAIN_FLUSH);
> +       blk_queue_flush(queue, REQ_FLUSH);
>
>         blk_queue_max_segments(queue, -1);
>         blk_queue_max_segment_size(queue, dev->bounce_size);
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index 7965280..d10b635 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -388,22 +388,15 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
>         vblk->disk->driverfs_dev = &vdev->dev;
>         index++;
>
> -       if (virtio_has_feature(vdev, VIRTIO_BLK_F_FLUSH)) {
> -               /*
> -                * If the FLUSH feature is supported we do have support for
> -                * flushing a volatile write cache on the host.  Use that
> -                * to implement write barrier support.
> -                */
> -               blk_queue_ordered(q, QUEUE_ORDERED_DRAIN_FLUSH);
> -       } else {
> -               /*
> -                * If the FLUSH feature is not supported we must assume that
> -                * the host does not perform any kind of volatile write
> -                * caching. We still need to drain the queue to provider
> -                * proper barrier semantics.
> -                */
> -               blk_queue_ordered(q, QUEUE_ORDERED_DRAIN);
> -       }
> +       /*
> +        * If the FLUSH feature is supported we do have support for
> +        * flushing a volatile write cache on the host.  Use that to
> +        * implement write barrier support; otherwise, we must assume
> +        * that the host does not perform any kind of volatile write
> +        * caching.
> +        */
> +       if (virtio_has_feature(vdev, VIRTIO_BLK_F_FLUSH))
> +               blk_queue_flush(q, REQ_FLUSH);
>
>         /* If disk is read-only in the host, the guest should obey */
>         if (virtio_has_feature(vdev, VIRTIO_BLK_F_RO))
> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> index 25ffbf9..1d48f3a 100644
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> @@ -95,7 +95,7 @@ struct blkfront_info
>         struct gnttab_free_callback callback;
>         struct blk_shadow shadow[BLK_RING_SIZE];
>         unsigned long shadow_free;
> -       int feature_barrier;
> +       unsigned int feature_flush;
>         int is_ready;
>  };
>
> @@ -418,25 +418,12 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size)
>  }
>
>
> -static int xlvbd_barrier(struct blkfront_info *info)
> +static void xlvbd_flush(struct blkfront_info *info)
>  {
> -       int err;
> -       const char *barrier;
> -
> -       switch (info->feature_barrier) {
> -       case QUEUE_ORDERED_DRAIN:       barrier = "enabled"; break;
> -       case QUEUE_ORDERED_NONE:        barrier = "disabled"; break;
> -       default:                        return -EINVAL;
> -       }
> -
> -       err = blk_queue_ordered(info->rq, info->feature_barrier);
> -
> -       if (err)
> -               return err;
> -
> +       blk_queue_flush(info->rq, info->feature_flush);
>         printk(KERN_INFO "blkfront: %s: barriers %s\n",
> -              info->gd->disk_name, barrier);
> -       return 0;
> +              info->gd->disk_name,
> +              info->feature_flush ? "enabled" : "disabled");
>  }
>
>
> @@ -515,7 +502,7 @@ static int xlvbd_alloc_gendisk(blkif_sector_t capacity,
>         info->rq = gd->queue;
>         info->gd = gd;
>
> -       xlvbd_barrier(info);
> +       xlvbd_flush(info);
>
>         if (vdisk_info & VDISK_READONLY)
>                 set_disk_ro(gd, 1);
> @@ -661,8 +648,8 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
>                                 printk(KERN_WARNING "blkfront: %s: write barrier op failed\n",
>                                        info->gd->disk_name);
>                                 error = -EOPNOTSUPP;
> -                               info->feature_barrier = QUEUE_ORDERED_NONE;
> -                               xlvbd_barrier(info);
> +                               info->feature_flush = 0;
> +                               xlvbd_flush(info);
>                         }
>                         /* fall through */
>                 case BLKIF_OP_READ:
> @@ -1075,19 +1062,13 @@ static void blkfront_connect(struct blkfront_info *info)
>         /*
>          * If there's no "feature-barrier" defined, then it means
>          * we're dealing with a very old backend which writes
> -        * synchronously; draining will do what needs to get done.
> +        * synchronously; nothing to do.
>          *
>          * If there are barriers, then we use flush.
> -        *
> -        * If barriers are not supported, then there's no much we can
> -        * do, so just set ordering to NONE.
>          */
> -       if (err)
> -               info->feature_barrier = QUEUE_ORDERED_DRAIN;
> -       else if (barrier)
> -               info->feature_barrier = QUEUE_ORDERED_DRAIN_FLUSH;
> -       else
> -               info->feature_barrier = QUEUE_ORDERED_NONE;
> +       info->feature_flush = 0;
> +       if (!err && barrier)
> +               info->feature_flush = REQ_FLUSH;
>
>         err = xlvbd_alloc_gendisk(sectors, info, binfo, sector_size);
>         if (err) {
> diff --git a/drivers/ide/ide-disk.c b/drivers/ide/ide-disk.c
> index 7433e07..7c5b01c 100644
> --- a/drivers/ide/ide-disk.c
> +++ b/drivers/ide/ide-disk.c
> @@ -516,10 +516,10 @@ static int ide_do_setfeature(ide_drive_t *drive, u8 feature, u8 nsect)
>         return ide_no_data_taskfile(drive, &cmd);
>  }
>
> -static void update_ordered(ide_drive_t *drive)
> +static void update_flush(ide_drive_t *drive)
>  {
>         u16 *id = drive->id;
> -       unsigned ordered = QUEUE_ORDERED_NONE;
> +       unsigned flush = 0;
>
>         if (drive->dev_flags & IDE_DFLAG_WCACHE) {
>                 unsigned long long capacity;
> @@ -543,13 +543,12 @@ static void update_ordered(ide_drive_t *drive)
>                        drive->name, barrier ? "" : "not ");
>
>                 if (barrier) {
> -                       ordered = QUEUE_ORDERED_DRAIN_FLUSH;
> +                       flush = REQ_FLUSH;
>                         blk_queue_prep_rq(drive->queue, idedisk_prep_fn);
>                 }
> -       } else
> -               ordered = QUEUE_ORDERED_DRAIN;
> +       }
>
> -       blk_queue_ordered(drive->queue, ordered);
> +       blk_queue_flush(drive->queue, flush);
>  }
>
>  ide_devset_get_flag(wcache, IDE_DFLAG_WCACHE);
> @@ -572,7 +571,7 @@ static int set_wcache(ide_drive_t *drive, int arg)
>                 }
>         }
>
> -       update_ordered(drive);
> +       update_flush(drive);
>
>         return err;
>  }
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index a3f21dc..b71cc9e 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1908,7 +1908,7 @@ static struct mapped_device *alloc_dev(int minor)
>         blk_queue_softirq_done(md->queue, dm_softirq_done);
>         blk_queue_prep_rq(md->queue, dm_prep_fn);
>         blk_queue_lld_busy(md->queue, dm_lld_busy);
> -       blk_queue_ordered(md->queue, QUEUE_ORDERED_DRAIN_FLUSH);
> +       blk_queue_flush(md->queue, REQ_FLUSH);
>
>         md->disk = alloc_disk(1);
>         if (!md->disk)
> diff --git a/drivers/mmc/card/queue.c b/drivers/mmc/card/queue.c
> index c77eb49..d791772 100644
> --- a/drivers/mmc/card/queue.c
> +++ b/drivers/mmc/card/queue.c
> @@ -128,7 +128,6 @@ int mmc_init_queue(struct mmc_queue *mq, struct mmc_card *card, spinlock_t *lock
>         mq->req = NULL;
>
>         blk_queue_prep_rq(mq->queue, mmc_prep_request);
> -       blk_queue_ordered(mq->queue, QUEUE_ORDERED_DRAIN);
>         queue_flag_set_unlocked(QUEUE_FLAG_NONROT, mq->queue);
>
>  #ifdef CONFIG_MMC_BLOCK_BOUNCE
> diff --git a/drivers/s390/block/dasd.c b/drivers/s390/block/dasd.c
> index 1a84fae..29046b7 100644
> --- a/drivers/s390/block/dasd.c
> +++ b/drivers/s390/block/dasd.c
> @@ -2197,7 +2197,6 @@ static void dasd_setup_queue(struct dasd_block *block)
>          */
>         blk_queue_max_segment_size(block->request_queue, PAGE_SIZE);
>         blk_queue_segment_boundary(block->request_queue, PAGE_SIZE - 1);
> -       blk_queue_ordered(block->request_queue, QUEUE_ORDERED_DRAIN);
>  }
>
>  /*
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index 05a15b0..7f6aca2 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -2109,7 +2109,7 @@ static int sd_revalidate_disk(struct gendisk *disk)
>         struct scsi_disk *sdkp = scsi_disk(disk);
>         struct scsi_device *sdp = sdkp->device;
>         unsigned char *buffer;
> -       unsigned ordered;
> +       unsigned flush = 0;
>
>         SCSI_LOG_HLQUEUE(3, sd_printk(KERN_INFO, sdkp,
>                                       "sd_revalidate_disk\n"));
> @@ -2151,15 +2151,15 @@ static int sd_revalidate_disk(struct gendisk *disk)
>
>         /*
>          * We now have all cache related info, determine how we deal
> -        * with ordered requests.
> +        * with flush requests.
>          */
> -       if (sdkp->WCE)
> -               ordered = sdkp->DPOFUA
> -                       ? QUEUE_ORDERED_DRAIN_FUA : QUEUE_ORDERED_DRAIN_FLUSH;
> -       else
> -               ordered = QUEUE_ORDERED_DRAIN;
> +       if (sdkp->WCE) {
> +               flush |= REQ_FLUSH;
> +               if (sdkp->DPOFUA)
> +                       flush |= REQ_FUA;
> +       }
>
> -       blk_queue_ordered(sdkp->disk->queue, ordered);
> +       blk_queue_flush(sdkp->disk->queue, flush);
>
>         set_capacity(disk, sdkp->capacity);
>         kfree(buffer);
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 96ef5f1..6003f7c 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -355,8 +355,10 @@ struct request_queue
>         struct blk_trace        *blk_trace;
>  #endif
>         /*
> -        * reserved for flush operations
> +        * for flush operations
>          */
> +       unsigned int            flush_flags;
> +
>         unsigned int            ordered, next_ordered, ordseq;
>         int                     orderr, ordcolor;
>         struct request          pre_flush_rq, bar_rq, post_flush_rq;
> @@ -863,8 +865,8 @@ extern void blk_queue_update_dma_alignment(struct request_queue *, int);
>  extern void blk_queue_softirq_done(struct request_queue *, softirq_done_fn *);
>  extern void blk_queue_rq_timed_out(struct request_queue *, rq_timed_out_fn *);
>  extern void blk_queue_rq_timeout(struct request_queue *, unsigned int);
> +extern void blk_queue_flush(struct request_queue *q, unsigned int flush);
>  extern struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev);
> -extern int blk_queue_ordered(struct request_queue *, unsigned);
>  extern bool blk_do_ordered(struct request_queue *, struct request **);
>  extern unsigned blk_ordered_cur_seq(struct request_queue *);
>  extern unsigned blk_ordered_req_seq(struct request *);
> --
> 1.7.1
>
^ permalink raw reply	[flat|nested] 109+ messages in thread
- * Re: [PATCH 03/11] block: deprecate barrier and replace blk_queue_ordered() with blk_queue_flush()
  2010-08-14  1:07   ` Jeremy Fitzhardinge
@ 2010-08-14  9:42     ` hch
  2010-08-16 20:38       ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 109+ messages in thread
From: hch @ 2010-08-14  9:42 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: jack@suse.cz, Michael S. Tsirkin, linux-ide@vger.kernel.org,
	dm-devel@redhat.com, James.Bottomley@suse.de, Pierre Ossman,
	konishi.ryusuke@lab.ntt.co.jp, Alasdair G Kergon,
	Stefan Weinhuber, vst@vlnb.net, linux-scsi@vger.kernel.org,
	Boaz Harrosh, Geert Uytterhoeven, Daniel Stodden, Nick Piggin,
	Chris Wright, Tejun Heo, swhiteho@redhat.com, chris.mason
On Fri, Aug 13, 2010 at 06:07:13PM -0700, Jeremy Fitzhardinge wrote:
>  On 08/12/2010 05:41 AM, Tejun Heo wrote:
> > Barrier is deemed too heavy and will soon be replaced by FLUSH/FUA
> > requests.  Deprecate barrier.  All REQ_HARDBARRIERs are failed with
> > -EOPNOTSUPP and blk_queue_ordered() is replaced with simpler
> > blk_queue_flush().
> >
> > blk_queue_flush() takes combinations of REQ_FLUSH and FUA.  If a
> > device has write cache and can flush it, it should set REQ_FLUSH.  If
> > the device can handle FUA writes, it should also set REQ_FUA.
> 
> Christoph, do these two patches (parts 2 and 3) make xen-blkfront
> correct WRT barriers/flushing as far as your concerned?
If all your backends handle a zero-length BLKIF_OP_WRITE_BARRIER request
it is a fully correct, but rather suboptimal implementation.  To get
all the benefit of the new non-draining barriers you'll need a new
If all your backends handle a zero-length BLKIF_OP_FLUSH request that
only flushes the cache, but has no ordering side effects.  Note that
the quite suboptimal here means not as good as the new barrier
implementation, but it shouldn't be notiably worse than the old one
for Xen.
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: [PATCH 03/11] block: deprecate barrier and replace blk_queue_ordered() with blk_queue_flush()
  2010-08-14  9:42     ` hch
@ 2010-08-16 20:38       ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 109+ messages in thread
From: Jeremy Fitzhardinge @ 2010-08-16 20:38 UTC (permalink / raw)
  To: hch@lst.de
  Cc: jack@suse.cz, Michael S. Tsirkin, linux-ide@vger.kernel.org,
	dm-devel@redhat.com, James.Bottomley@suse.de, Pierre Ossman,
	konishi.ryusuke@lab.ntt.co.jp, Alasdair G Kergon,
	Stefan Weinhuber, vst@vlnb.net, linux-scsi@vger.kernel.org,
	Boaz Harrosh, Geert Uytterhoeven, Daniel Stodden, Nick Piggin,
	Chris Wright, Tejun Heo, swhiteho@redhat.com, chris.mason
 On 08/14/2010 02:42 AM, hch@lst.de wrote:
> On Fri, Aug 13, 2010 at 06:07:13PM -0700, Jeremy Fitzhardinge wrote:
>>  On 08/12/2010 05:41 AM, Tejun Heo wrote:
>>> Barrier is deemed too heavy and will soon be replaced by FLUSH/FUA
>>> requests.  Deprecate barrier.  All REQ_HARDBARRIERs are failed with
>>> -EOPNOTSUPP and blk_queue_ordered() is replaced with simpler
>>> blk_queue_flush().
>>>
>>> blk_queue_flush() takes combinations of REQ_FLUSH and FUA.  If a
>>> device has write cache and can flush it, it should set REQ_FLUSH.  If
>>> the device can handle FUA writes, it should also set REQ_FUA.
>> Christoph, do these two patches (parts 2 and 3) make xen-blkfront
>> correct WRT barriers/flushing as far as your concerned?
> If all your backends handle a zero-length BLKIF_OP_WRITE_BARRIER request
> it is a fully correct, but rather suboptimal implementation.  To get
> all the benefit of the new non-draining barriers you'll need a new
> If all your backends handle a zero-length BLKIF_OP_FLUSH request that
> only flushes the cache, but has no ordering side effects.
Is the effect of the flush that, once complete, any previously completed
write is guaranteed to be on durable storage, but it is not guaranteed
to have any effect on pending writes?  If so, does it flush writes that
were completed before the flush is issued, or writes that complete
before the flush completes?
>   Note that
> the quite suboptimal here means not as good as the new barrier
> implementation, but it shouldn't be notiably worse than the old one
> for Xen.
OK, thanks.  We can do some testing on that and see if there's a benefit
to adding a flush operation with the appropriate semantics.
    J
^ permalink raw reply	[flat|nested] 109+ messages in thread
 
 
 
- * [PATCH 04/11] block: remove spurious uses of REQ_HARDBARRIER
  2010-08-12 12:41 [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush Tejun Heo
                   ` (2 preceding siblings ...)
  2010-08-12 12:41 ` [PATCH 03/11] block: deprecate barrier and replace blk_queue_ordered() with blk_queue_flush() Tejun Heo
@ 2010-08-12 12:41 ` Tejun Heo
  2010-08-12 12:41 ` [PATCH 05/11] block: misc cleanups in barrier code Tejun Heo
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 109+ messages in thread
From: Tejun Heo @ 2010-08-12 12:41 UTC (permalink / raw)
  To: jaxboe, linux-fsdevel, linux-scsi, linux-ide, linux-kernel,
	linux-raid, hch, James.
  Cc: Tejun Heo, Boaz Harrosh, Peter Osterlund
REQ_HARDBARRIER is deprecated.  Remove spurious uses in the following
users.  Please note that other than osdblk, all other uses were
already spurious before deprecation.
* osdblk: osdblk_rq_fn() won't receive any request with
  REQ_HARDBARRIER set.  Remove the test for it.
* pktcdvd: use of REQ_HARDBARRIER in pkt_generic_packet() doesn't mean
  anything.  Removed.
* aic7xxx_old: Setting MSG_ORDERED_Q_TAG on REQ_HARDBARRIER is
  spurious.  Removed.
* sas_scsi_host: Setting TASK_ATTR_ORDERED on REQ_HARDBARRIER is
  spurious.  Removed.
* scsi_tcq: The ordered tag path wasn't being used anyway.  Removed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Boaz Harrosh <bharrosh@panasas.com>
Cc: James Bottomley <James.Bottomley@suse.de>
Cc: Peter Osterlund <petero2@telia.com>
---
 drivers/block/osdblk.c              |    3 +--
 drivers/block/pktcdvd.c             |    1 -
 drivers/scsi/aic7xxx_old.c          |   21 ++-------------------
 drivers/scsi/libsas/sas_scsi_host.c |   13 +------------
 include/scsi/scsi_tcq.h             |    6 +-----
 5 files changed, 5 insertions(+), 39 deletions(-)
diff --git a/drivers/block/osdblk.c b/drivers/block/osdblk.c
index 72d6246..87311eb 100644
--- a/drivers/block/osdblk.c
+++ b/drivers/block/osdblk.c
@@ -310,8 +310,7 @@ static void osdblk_rq_fn(struct request_queue *q)
 			break;
 
 		/* filter out block requests we don't understand */
-		if (rq->cmd_type != REQ_TYPE_FS &&
-		    !(rq->cmd_flags & REQ_HARDBARRIER)) {
+		if (rq->cmd_type != REQ_TYPE_FS) {
 			blk_end_request_all(rq, 0);
 			continue;
 		}
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index b1cbeb5..0166ea1 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -753,7 +753,6 @@ static int pkt_generic_packet(struct pktcdvd_device *pd, struct packet_command *
 
 	rq->timeout = 60*HZ;
 	rq->cmd_type = REQ_TYPE_BLOCK_PC;
-	rq->cmd_flags |= REQ_HARDBARRIER;
 	if (cgc->quiet)
 		rq->cmd_flags |= REQ_QUIET;
 
diff --git a/drivers/scsi/aic7xxx_old.c b/drivers/scsi/aic7xxx_old.c
index 93984c9..e1cd606 100644
--- a/drivers/scsi/aic7xxx_old.c
+++ b/drivers/scsi/aic7xxx_old.c
@@ -2850,12 +2850,6 @@ aic7xxx_done(struct aic7xxx_host *p, struct aic7xxx_scb *scb)
       aic_dev->r_total++;
       ptr = aic_dev->r_bins;
     }
-    if(cmd->device->simple_tags && cmd->request->cmd_flags & REQ_HARDBARRIER)
-    {
-      aic_dev->barrier_total++;
-      if(scb->tag_action == MSG_ORDERED_Q_TAG)
-        aic_dev->ordered_total++;
-    }
     x = scb->sg_length;
     x >>= 10;
     for(i=0; i<6; i++)
@@ -10144,19 +10138,8 @@ static void aic7xxx_buildscb(struct aic7xxx_host *p, struct scsi_cmnd *cmd,
     /* We always force TEST_UNIT_READY to untagged */
     if (cmd->cmnd[0] != TEST_UNIT_READY && sdptr->simple_tags)
     {
-      if (req->cmd_flags & REQ_HARDBARRIER)
-      {
-	if(sdptr->ordered_tags)
-	{
-          hscb->control |= MSG_ORDERED_Q_TAG;
-          scb->tag_action = MSG_ORDERED_Q_TAG;
-	}
-      }
-      else
-      {
-        hscb->control |= MSG_SIMPLE_Q_TAG;
-        scb->tag_action = MSG_SIMPLE_Q_TAG;
-      }
+      hscb->control |= MSG_SIMPLE_Q_TAG;
+      scb->tag_action = MSG_SIMPLE_Q_TAG;
     }
   }
   if ( !(aic_dev->dtr_pending) &&
diff --git a/drivers/scsi/libsas/sas_scsi_host.c b/drivers/scsi/libsas/sas_scsi_host.c
index f0cfba9..535085c 100644
--- a/drivers/scsi/libsas/sas_scsi_host.c
+++ b/drivers/scsi/libsas/sas_scsi_host.c
@@ -130,17 +130,6 @@ static void sas_scsi_task_done(struct sas_task *task)
 	sc->scsi_done(sc);
 }
 
-static enum task_attribute sas_scsi_get_task_attr(struct scsi_cmnd *cmd)
-{
-	enum task_attribute ta = TASK_ATTR_SIMPLE;
-	if (cmd->request && blk_rq_tagged(cmd->request)) {
-		if (cmd->device->ordered_tags &&
-		    (cmd->request->cmd_flags & REQ_HARDBARRIER))
-			ta = TASK_ATTR_ORDERED;
-	}
-	return ta;
-}
-
 static struct sas_task *sas_create_task(struct scsi_cmnd *cmd,
 					       struct domain_device *dev,
 					       gfp_t gfp_flags)
@@ -160,7 +149,7 @@ static struct sas_task *sas_create_task(struct scsi_cmnd *cmd,
 	task->ssp_task.retry_count = 1;
 	int_to_scsilun(cmd->device->lun, &lun);
 	memcpy(task->ssp_task.LUN, &lun.scsi_lun, 8);
-	task->ssp_task.task_attr = sas_scsi_get_task_attr(cmd);
+	task->ssp_task.task_attr = TASK_ATTR_SIMPLE;
 	memcpy(task->ssp_task.cdb, cmd->cmnd, 16);
 
 	task->scatter = scsi_sglist(cmd);
diff --git a/include/scsi/scsi_tcq.h b/include/scsi/scsi_tcq.h
index 1723138..d6e7994 100644
--- a/include/scsi/scsi_tcq.h
+++ b/include/scsi/scsi_tcq.h
@@ -97,13 +97,9 @@ static inline void scsi_deactivate_tcq(struct scsi_device *sdev, int depth)
 static inline int scsi_populate_tag_msg(struct scsi_cmnd *cmd, char *msg)
 {
         struct request *req = cmd->request;
-	struct scsi_device *sdev = cmd->device;
 
         if (blk_rq_tagged(req)) {
-		if (sdev->ordered_tags && req->cmd_flags & REQ_HARDBARRIER)
-        	        *msg++ = MSG_ORDERED_TAG;
-        	else
-        	        *msg++ = MSG_SIMPLE_TAG;
+		*msg++ = MSG_SIMPLE_TAG;
         	*msg++ = req->tag;
         	return 2;
 	}
-- 
1.7.1
^ permalink raw reply related	[flat|nested] 109+ messages in thread
- * [PATCH 05/11] block: misc cleanups in barrier code
  2010-08-12 12:41 [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush Tejun Heo
                   ` (3 preceding siblings ...)
  2010-08-12 12:41 ` [PATCH 04/11] block: remove spurious uses of REQ_HARDBARRIER Tejun Heo
@ 2010-08-12 12:41 ` Tejun Heo
  2010-08-12 12:41 ` [PATCH 06/11] block: drop barrier ordering by queue draining Tejun Heo
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 109+ messages in thread
From: Tejun Heo @ 2010-08-12 12:41 UTC (permalink / raw)
  To: jaxboe, linux-fsdevel, linux-scsi, linux-ide, linux-kernel,
	linux-raid, hch
  Cc: Tejun Heo
Make the following cleanups in preparation of barrier/flush update.
* blk_do_ordered() declaration is moved from include/linux/blkdev.h to
  block/blk.h.
* blk_do_ordered() now returns pointer to struct request, with %NULL
  meaning "try the next request" and ERR_PTR(-EAGAIN) "try again
  later".  The third case will be dropped with further changes.
* In the initialization of proxy barrier request, data direction is
  already set by init_request_from_bio().  Drop unnecessary explicit
  REQ_WRITE setting and move init_request_from_bio() above REQ_FUA
  flag setting.
* add_request() is collapsed into __make_request().
These changes don't make any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-barrier.c    |   32 ++++++++++++++------------------
 block/blk-core.c       |   21 ++++-----------------
 block/blk.h            |    7 +++++--
 include/linux/blkdev.h |    1 -
 4 files changed, 23 insertions(+), 38 deletions(-)
diff --git a/block/blk-barrier.c b/block/blk-barrier.c
index ed0aba5..f1be85b 100644
--- a/block/blk-barrier.c
+++ b/block/blk-barrier.c
@@ -110,9 +110,9 @@ static void queue_flush(struct request_queue *q, unsigned which)
 	elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
 }
 
-static inline bool start_ordered(struct request_queue *q, struct request **rqp)
+static inline struct request *start_ordered(struct request_queue *q,
+					    struct request *rq)
 {
-	struct request *rq = *rqp;
 	unsigned skip = 0;
 
 	q->orderr = 0;
@@ -149,11 +149,9 @@ static inline bool start_ordered(struct request_queue *q, struct request **rqp)
 
 		/* initialize proxy request and queue it */
 		blk_rq_init(q, rq);
-		if (bio_data_dir(q->orig_bar_rq->bio) == WRITE)
-			rq->cmd_flags |= REQ_WRITE;
+		init_request_from_bio(rq, q->orig_bar_rq->bio);
 		if (q->ordered & QUEUE_ORDERED_DO_FUA)
 			rq->cmd_flags |= REQ_FUA;
-		init_request_from_bio(rq, q->orig_bar_rq->bio);
 		rq->end_io = bar_end_io;
 
 		elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
@@ -171,27 +169,26 @@ static inline bool start_ordered(struct request_queue *q, struct request **rqp)
 	else
 		skip |= QUEUE_ORDSEQ_DRAIN;
 
-	*rqp = rq;
-
 	/*
 	 * Complete skipped sequences.  If whole sequence is complete,
-	 * return false to tell elevator that this request is gone.
+	 * return %NULL to tell elevator that this request is gone.
 	 */
-	return !blk_ordered_complete_seq(q, skip, 0);
+	if (blk_ordered_complete_seq(q, skip, 0))
+		rq = NULL;
+	return rq;
 }
 
-bool blk_do_ordered(struct request_queue *q, struct request **rqp)
+struct request *blk_do_ordered(struct request_queue *q, struct request *rq)
 {
-	struct request *rq = *rqp;
 	const int is_barrier = rq->cmd_type == REQ_TYPE_FS &&
 				(rq->cmd_flags & REQ_HARDBARRIER);
 
 	if (!q->ordseq) {
 		if (!is_barrier)
-			return true;
+			return rq;
 
 		if (q->next_ordered != QUEUE_ORDERED_NONE)
-			return start_ordered(q, rqp);
+			return start_ordered(q, rq);
 		else {
 			/*
 			 * Queue ordering not supported.  Terminate
@@ -199,8 +196,7 @@ bool blk_do_ordered(struct request_queue *q, struct request **rqp)
 			 */
 			blk_dequeue_request(rq);
 			__blk_end_request_all(rq, -EOPNOTSUPP);
-			*rqp = NULL;
-			return false;
+			return NULL;
 		}
 	}
 
@@ -211,14 +207,14 @@ bool blk_do_ordered(struct request_queue *q, struct request **rqp)
 	/* Special requests are not subject to ordering rules. */
 	if (rq->cmd_type != REQ_TYPE_FS &&
 	    rq != &q->pre_flush_rq && rq != &q->post_flush_rq)
-		return true;
+		return rq;
 
 	/* Ordered by draining.  Wait for turn. */
 	WARN_ON(blk_ordered_req_seq(rq) < blk_ordered_cur_seq(q));
 	if (blk_ordered_req_seq(rq) > blk_ordered_cur_seq(q))
-		*rqp = NULL;
+		rq = ERR_PTR(-EAGAIN);
 
-	return true;
+	return rq;
 }
 
 static void bio_end_empty_barrier(struct bio *bio, int err)
diff --git a/block/blk-core.c b/block/blk-core.c
index 3f802dd..ed8ef89 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1037,22 +1037,6 @@ void blk_insert_request(struct request_queue *q, struct request *rq,
 }
 EXPORT_SYMBOL(blk_insert_request);
 
-/*
- * add-request adds a request to the linked list.
- * queue lock is held and interrupts disabled, as we muck with the
- * request queue list.
- */
-static inline void add_request(struct request_queue *q, struct request *req)
-{
-	drive_stat_acct(req, 1);
-
-	/*
-	 * elevator indicated where it wants this request to be
-	 * inserted at elevator_merge time
-	 */
-	__elv_add_request(q, req, ELEVATOR_INSERT_SORT, 0);
-}
-
 static void part_round_stats_single(int cpu, struct hd_struct *part,
 				    unsigned long now)
 {
@@ -1316,7 +1300,10 @@ get_rq:
 		req->cpu = blk_cpu_to_group(smp_processor_id());
 	if (queue_should_plug(q) && elv_queue_empty(q))
 		blk_plug_device(q);
-	add_request(q, req);
+
+	/* insert the request into the elevator */
+	drive_stat_acct(req, 1);
+	__elv_add_request(q, req, ELEVATOR_INSERT_SORT, 0);
 out:
 	if (unplug || !queue_should_plug(q))
 		__generic_unplug_device(q);
diff --git a/block/blk.h b/block/blk.h
index 6e7dc87..874eb4e 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -51,6 +51,8 @@ static inline void blk_clear_rq_complete(struct request *rq)
  */
 #define ELV_ON_HASH(rq)		(!hlist_unhashed(&(rq)->hash))
 
+struct request *blk_do_ordered(struct request_queue *q, struct request *rq);
+
 static inline struct request *__elv_next_request(struct request_queue *q)
 {
 	struct request *rq;
@@ -58,8 +60,9 @@ static inline struct request *__elv_next_request(struct request_queue *q)
 	while (1) {
 		while (!list_empty(&q->queue_head)) {
 			rq = list_entry_rq(q->queue_head.next);
-			if (blk_do_ordered(q, &rq))
-				return rq;
+			rq = blk_do_ordered(q, rq);
+			if (rq)
+				return !IS_ERR(rq) ? rq : NULL;
 		}
 
 		if (!q->elevator->ops->elevator_dispatch_fn(q, 0))
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 6003f7c..21baa19 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -867,7 +867,6 @@ extern void blk_queue_rq_timed_out(struct request_queue *, rq_timed_out_fn *);
 extern void blk_queue_rq_timeout(struct request_queue *, unsigned int);
 extern void blk_queue_flush(struct request_queue *q, unsigned int flush);
 extern struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev);
-extern bool blk_do_ordered(struct request_queue *, struct request **);
 extern unsigned blk_ordered_cur_seq(struct request_queue *);
 extern unsigned blk_ordered_req_seq(struct request *);
 extern bool blk_ordered_complete_seq(struct request_queue *, unsigned, int);
-- 
1.7.1
^ permalink raw reply related	[flat|nested] 109+ messages in thread
- * [PATCH 06/11] block: drop barrier ordering by queue draining
  2010-08-12 12:41 [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush Tejun Heo
                   ` (4 preceding siblings ...)
  2010-08-12 12:41 ` [PATCH 05/11] block: misc cleanups in barrier code Tejun Heo
@ 2010-08-12 12:41 ` Tejun Heo
  2010-08-12 12:41 ` [PATCH 07/11] block: rename blk-barrier.c to blk-flush.c Tejun Heo
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 109+ messages in thread
From: Tejun Heo @ 2010-08-12 12:41 UTC (permalink / raw)
  To: jaxboe, linux-fsdevel, linux-scsi, linux-ide, linux-kernel,
	linux-raid, hch, James.
  Cc: Tejun Heo, Christoph Hellwig
Filesystems will take all the responsibilities for ordering requests
around commit writes and will only indicate how the commit writes
themselves should be handled by block layers.  This patch drops
barrier ordering by queue draining from block layer.  Ordering by
draining implementation was somewhat invasive to request handling.
List of notable changes follow.
* Each queue has 1 bit color which is flipped on each barrier issue.
  This is used to track whether a given request is issued before the
  current barrier or not.  REQ_ORDERED_COLOR flag and coloring
  implementation in __elv_add_request() are removed.
* Requests which shouldn't be processed yet for draining were stalled
  by returning -EAGAIN from blk_do_ordered() according to the test
  result between blk_ordered_req_seq() and blk_blk_ordered_cur_seq().
  This logic is removed.
* Draining completion logic in elv_completed_request() removed.
* All barrier sequence requests were queued to request queue and then
  trckled to lower layer according to progress and thus maintaining
  request orders during requeue was necessary.  This is replaced by
  queueing the next request in the barrier sequence only after the
  current one is complete from blk_ordered_complete_seq(), which
  removes the need for multiple proxy requests in struct request_queue
  and the request sorting logic in the ELEVATOR_INSERT_REQUEUE path of
  elv_insert().
* As barriers no longer have ordering constraints, there's no need to
  dump the whole elevator onto the dispatch queue on each barrier.
  Insert barriers at the front instead.
* If other barrier requests come to the front of the dispatch queue
  while one is already in progress, they are stored in
  q->pending_barriers and restored to dispatch queue one-by-one after
  each barrier completion from blk_ordered_complete_seq().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
---
 block/blk-barrier.c       |  220 ++++++++++++++++++---------------------------
 block/blk-core.c          |   11 ++-
 block/blk.h               |    2 +-
 block/elevator.c          |   79 ++--------------
 include/linux/blk_types.h |    2 -
 include/linux/blkdev.h    |   19 ++---
 6 files changed, 113 insertions(+), 220 deletions(-)
diff --git a/block/blk-barrier.c b/block/blk-barrier.c
index f1be85b..e8b2e5c 100644
--- a/block/blk-barrier.c
+++ b/block/blk-barrier.c
@@ -9,6 +9,8 @@
 
 #include "blk.h"
 
+static struct request *queue_next_ordseq(struct request_queue *q);
+
 /*
  * Cache flushing for ordered writes handling
  */
@@ -19,38 +21,10 @@ unsigned blk_ordered_cur_seq(struct request_queue *q)
 	return 1 << ffz(q->ordseq);
 }
 
-unsigned blk_ordered_req_seq(struct request *rq)
-{
-	struct request_queue *q = rq->q;
-
-	BUG_ON(q->ordseq == 0);
-
-	if (rq == &q->pre_flush_rq)
-		return QUEUE_ORDSEQ_PREFLUSH;
-	if (rq == &q->bar_rq)
-		return QUEUE_ORDSEQ_BAR;
-	if (rq == &q->post_flush_rq)
-		return QUEUE_ORDSEQ_POSTFLUSH;
-
-	/*
-	 * !fs requests don't need to follow barrier ordering.  Always
-	 * put them at the front.  This fixes the following deadlock.
-	 *
-	 * http://thread.gmane.org/gmane.linux.kernel/537473
-	 */
-	if (rq->cmd_type != REQ_TYPE_FS)
-		return QUEUE_ORDSEQ_DRAIN;
-
-	if ((rq->cmd_flags & REQ_ORDERED_COLOR) ==
-	    (q->orig_bar_rq->cmd_flags & REQ_ORDERED_COLOR))
-		return QUEUE_ORDSEQ_DRAIN;
-	else
-		return QUEUE_ORDSEQ_DONE;
-}
-
-bool blk_ordered_complete_seq(struct request_queue *q, unsigned seq, int error)
+static struct request *blk_ordered_complete_seq(struct request_queue *q,
+						unsigned seq, int error)
 {
-	struct request *rq;
+	struct request *next_rq = NULL;
 
 	if (error && !q->orderr)
 		q->orderr = error;
@@ -58,16 +32,22 @@ bool blk_ordered_complete_seq(struct request_queue *q, unsigned seq, int error)
 	BUG_ON(q->ordseq & seq);
 	q->ordseq |= seq;
 
-	if (blk_ordered_cur_seq(q) != QUEUE_ORDSEQ_DONE)
-		return false;
-
-	/*
-	 * Okay, sequence complete.
-	 */
-	q->ordseq = 0;
-	rq = q->orig_bar_rq;
-	__blk_end_request_all(rq, q->orderr);
-	return true;
+	if (blk_ordered_cur_seq(q) != QUEUE_ORDSEQ_DONE) {
+		/* not complete yet, queue the next ordered sequence */
+		next_rq = queue_next_ordseq(q);
+	} else {
+		/* complete this barrier request */
+		__blk_end_request_all(q->orig_bar_rq, q->orderr);
+		q->orig_bar_rq = NULL;
+		q->ordseq = 0;
+
+		/* dispatch the next barrier if there's one */
+		if (!list_empty(&q->pending_barriers)) {
+			next_rq = list_entry_rq(q->pending_barriers.next);
+			list_move(&next_rq->queuelist, &q->queue_head);
+		}
+	}
+	return next_rq;
 }
 
 static void pre_flush_end_io(struct request *rq, int error)
@@ -88,133 +68,105 @@ static void post_flush_end_io(struct request *rq, int error)
 	blk_ordered_complete_seq(rq->q, QUEUE_ORDSEQ_POSTFLUSH, error);
 }
 
-static void queue_flush(struct request_queue *q, unsigned which)
+static void queue_flush(struct request_queue *q, struct request *rq,
+			rq_end_io_fn *end_io)
 {
-	struct request *rq;
-	rq_end_io_fn *end_io;
-
-	if (which == QUEUE_ORDERED_DO_PREFLUSH) {
-		rq = &q->pre_flush_rq;
-		end_io = pre_flush_end_io;
-	} else {
-		rq = &q->post_flush_rq;
-		end_io = post_flush_end_io;
-	}
-
 	blk_rq_init(q, rq);
 	rq->cmd_type = REQ_TYPE_FS;
-	rq->cmd_flags = REQ_HARDBARRIER | REQ_FLUSH;
+	rq->cmd_flags = REQ_FLUSH;
 	rq->rq_disk = q->orig_bar_rq->rq_disk;
 	rq->end_io = end_io;
 
 	elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
 }
 
-static inline struct request *start_ordered(struct request_queue *q,
-					    struct request *rq)
+static struct request *queue_next_ordseq(struct request_queue *q)
 {
-	unsigned skip = 0;
-
-	q->orderr = 0;
-	q->ordered = q->next_ordered;
-	q->ordseq |= QUEUE_ORDSEQ_STARTED;
-
-	/*
-	 * For an empty barrier, there's no actual BAR request, which
-	 * in turn makes POSTFLUSH unnecessary.  Mask them off.
-	 */
-	if (!blk_rq_sectors(rq))
-		q->ordered &= ~(QUEUE_ORDERED_DO_BAR |
-				QUEUE_ORDERED_DO_POSTFLUSH);
-
-	/* stash away the original request */
-	blk_dequeue_request(rq);
-	q->orig_bar_rq = rq;
-	rq = NULL;
-
-	/*
-	 * Queue ordered sequence.  As we stack them at the head, we
-	 * need to queue in reverse order.  Note that we rely on that
-	 * no fs request uses ELEVATOR_INSERT_FRONT and thus no fs
-	 * request gets inbetween ordered sequence.
-	 */
-	if (q->ordered & QUEUE_ORDERED_DO_POSTFLUSH) {
-		queue_flush(q, QUEUE_ORDERED_DO_POSTFLUSH);
-		rq = &q->post_flush_rq;
-	} else
-		skip |= QUEUE_ORDSEQ_POSTFLUSH;
+	struct request *rq = &q->bar_rq;
 
-	if (q->ordered & QUEUE_ORDERED_DO_BAR) {
-		rq = &q->bar_rq;
+	switch (blk_ordered_cur_seq(q)) {
+	case QUEUE_ORDSEQ_PREFLUSH:
+		queue_flush(q, rq, pre_flush_end_io);
+		break;
 
+	case QUEUE_ORDSEQ_BAR:
 		/* initialize proxy request and queue it */
 		blk_rq_init(q, rq);
 		init_request_from_bio(rq, q->orig_bar_rq->bio);
+		rq->cmd_flags &= ~REQ_HARDBARRIER;
 		if (q->ordered & QUEUE_ORDERED_DO_FUA)
 			rq->cmd_flags |= REQ_FUA;
 		rq->end_io = bar_end_io;
 
 		elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
-	} else
-		skip |= QUEUE_ORDSEQ_BAR;
+		break;
 
-	if (q->ordered & QUEUE_ORDERED_DO_PREFLUSH) {
-		queue_flush(q, QUEUE_ORDERED_DO_PREFLUSH);
-		rq = &q->pre_flush_rq;
-	} else
-		skip |= QUEUE_ORDSEQ_PREFLUSH;
+	case QUEUE_ORDSEQ_POSTFLUSH:
+		queue_flush(q, rq, post_flush_end_io);
+		break;
 
-	if (queue_in_flight(q))
-		rq = NULL;
-	else
-		skip |= QUEUE_ORDSEQ_DRAIN;
-
-	/*
-	 * Complete skipped sequences.  If whole sequence is complete,
-	 * return %NULL to tell elevator that this request is gone.
-	 */
-	if (blk_ordered_complete_seq(q, skip, 0))
-		rq = NULL;
+	default:
+		BUG();
+	}
 	return rq;
 }
 
 struct request *blk_do_ordered(struct request_queue *q, struct request *rq)
 {
-	const int is_barrier = rq->cmd_type == REQ_TYPE_FS &&
-				(rq->cmd_flags & REQ_HARDBARRIER);
-
-	if (!q->ordseq) {
-		if (!is_barrier)
-			return rq;
-
-		if (q->next_ordered != QUEUE_ORDERED_NONE)
-			return start_ordered(q, rq);
-		else {
-			/*
-			 * Queue ordering not supported.  Terminate
-			 * with prejudice.
-			 */
-			blk_dequeue_request(rq);
-			__blk_end_request_all(rq, -EOPNOTSUPP);
-			return NULL;
-		}
+	unsigned skip = 0;
+
+	if (!(rq->cmd_flags & REQ_HARDBARRIER))
+		return rq;
+
+	if (q->ordseq) {
+		/*
+		 * Barrier is already in progress and they can't be
+		 * processed in parallel.  Queue for later processing.
+		 */
+		list_move_tail(&rq->queuelist, &q->pending_barriers);
+		return NULL;
+	}
+
+	if (unlikely(q->next_ordered == QUEUE_ORDERED_NONE)) {
+		/*
+		 * Queue ordering not supported.  Terminate
+		 * with prejudice.
+		 */
+		blk_dequeue_request(rq);
+		__blk_end_request_all(rq, -EOPNOTSUPP);
+		return NULL;
 	}
 
 	/*
-	 * Ordered sequence in progress
+	 * Start a new ordered sequence
 	 */
+	q->orderr = 0;
+	q->ordered = q->next_ordered;
+	q->ordseq |= QUEUE_ORDSEQ_STARTED;
 
-	/* Special requests are not subject to ordering rules. */
-	if (rq->cmd_type != REQ_TYPE_FS &&
-	    rq != &q->pre_flush_rq && rq != &q->post_flush_rq)
-		return rq;
+	/*
+	 * For an empty barrier, there's no actual BAR request, which
+	 * in turn makes POSTFLUSH unnecessary.  Mask them off.
+	 */
+	if (!blk_rq_sectors(rq))
+		q->ordered &= ~(QUEUE_ORDERED_DO_BAR |
+				QUEUE_ORDERED_DO_POSTFLUSH);
 
-	/* Ordered by draining.  Wait for turn. */
-	WARN_ON(blk_ordered_req_seq(rq) < blk_ordered_cur_seq(q));
-	if (blk_ordered_req_seq(rq) > blk_ordered_cur_seq(q))
-		rq = ERR_PTR(-EAGAIN);
+	/* stash away the original request */
+	blk_dequeue_request(rq);
+	q->orig_bar_rq = rq;
 
-	return rq;
+	if (!(q->ordered & QUEUE_ORDERED_DO_PREFLUSH))
+		skip |= QUEUE_ORDSEQ_PREFLUSH;
+
+	if (!(q->ordered & QUEUE_ORDERED_DO_BAR))
+		skip |= QUEUE_ORDSEQ_BAR;
+
+	if (!(q->ordered & QUEUE_ORDERED_DO_POSTFLUSH))
+		skip |= QUEUE_ORDSEQ_POSTFLUSH;
+
+	/* complete skipped sequences and return the first sequence */
+	return blk_ordered_complete_seq(q, skip, 0);
 }
 
 static void bio_end_empty_barrier(struct bio *bio, int err)
diff --git a/block/blk-core.c b/block/blk-core.c
index ed8ef89..82bd6d9 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -520,6 +520,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 	init_timer(&q->unplug_timer);
 	setup_timer(&q->timeout, blk_rq_timed_out_timer, (unsigned long) q);
 	INIT_LIST_HEAD(&q->timeout_list);
+	INIT_LIST_HEAD(&q->pending_barriers);
 	INIT_WORK(&q->unplug_work, blk_unplug_work);
 
 	kobject_init(&q->kobj, &blk_queue_ktype);
@@ -1185,6 +1186,7 @@ static int __make_request(struct request_queue *q, struct bio *bio)
 	const bool sync = (bio->bi_rw & REQ_SYNC);
 	const bool unplug = (bio->bi_rw & REQ_UNPLUG);
 	const unsigned int ff = bio->bi_rw & REQ_FAILFAST_MASK;
+	int where = ELEVATOR_INSERT_SORT;
 	int rw_flags;
 
 	/* REQ_HARDBARRIER is no more */
@@ -1203,7 +1205,12 @@ static int __make_request(struct request_queue *q, struct bio *bio)
 
 	spin_lock_irq(q->queue_lock);
 
-	if (unlikely((bio->bi_rw & REQ_HARDBARRIER)) || elv_queue_empty(q))
+	if (bio->bi_rw & REQ_HARDBARRIER) {
+		where = ELEVATOR_INSERT_FRONT;
+		goto get_rq;
+	}
+
+	if (elv_queue_empty(q))
 		goto get_rq;
 
 	el_ret = elv_merge(q, &req, bio);
@@ -1303,7 +1310,7 @@ get_rq:
 
 	/* insert the request into the elevator */
 	drive_stat_acct(req, 1);
-	__elv_add_request(q, req, ELEVATOR_INSERT_SORT, 0);
+	__elv_add_request(q, req, where, 0);
 out:
 	if (unplug || !queue_should_plug(q))
 		__generic_unplug_device(q);
diff --git a/block/blk.h b/block/blk.h
index 874eb4e..08081e4 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -62,7 +62,7 @@ static inline struct request *__elv_next_request(struct request_queue *q)
 			rq = list_entry_rq(q->queue_head.next);
 			rq = blk_do_ordered(q, rq);
 			if (rq)
-				return !IS_ERR(rq) ? rq : NULL;
+				return rq;
 		}
 
 		if (!q->elevator->ops->elevator_dispatch_fn(q, 0))
diff --git a/block/elevator.c b/block/elevator.c
index 816a7c8..22c46b5 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -611,8 +611,6 @@ void elv_quiesce_end(struct request_queue *q)
 
 void elv_insert(struct request_queue *q, struct request *rq, int where)
 {
-	struct list_head *pos;
-	unsigned ordseq;
 	int unplug_it = 1;
 
 	trace_block_rq_insert(q, rq);
@@ -620,9 +618,16 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
 	rq->q = q;
 
 	switch (where) {
+	case ELEVATOR_INSERT_REQUEUE:
+		/*
+		 * Most requeues happen because of a busy condition,
+		 * don't force unplug of the queue for that case.
+		 * Clear unplug_it and fall through.
+		 */
+		unplug_it = 0;
+
 	case ELEVATOR_INSERT_FRONT:
 		rq->cmd_flags |= REQ_SOFTBARRIER;
-
 		list_add(&rq->queuelist, &q->queue_head);
 		break;
 
@@ -662,36 +667,6 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
 		q->elevator->ops->elevator_add_req_fn(q, rq);
 		break;
 
-	case ELEVATOR_INSERT_REQUEUE:
-		/*
-		 * If ordered flush isn't in progress, we do front
-		 * insertion; otherwise, requests should be requeued
-		 * in ordseq order.
-		 */
-		rq->cmd_flags |= REQ_SOFTBARRIER;
-
-		/*
-		 * Most requeues happen because of a busy condition,
-		 * don't force unplug of the queue for that case.
-		 */
-		unplug_it = 0;
-
-		if (q->ordseq == 0) {
-			list_add(&rq->queuelist, &q->queue_head);
-			break;
-		}
-
-		ordseq = blk_ordered_req_seq(rq);
-
-		list_for_each(pos, &q->queue_head) {
-			struct request *pos_rq = list_entry_rq(pos);
-			if (ordseq <= blk_ordered_req_seq(pos_rq))
-				break;
-		}
-
-		list_add_tail(&rq->queuelist, pos);
-		break;
-
 	default:
 		printk(KERN_ERR "%s: bad insertion point %d\n",
 		       __func__, where);
@@ -710,26 +685,8 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
 void __elv_add_request(struct request_queue *q, struct request *rq, int where,
 		       int plug)
 {
-	if (q->ordcolor)
-		rq->cmd_flags |= REQ_ORDERED_COLOR;
-
 	if (rq->cmd_flags & (REQ_SOFTBARRIER | REQ_HARDBARRIER)) {
-		/*
-		 * toggle ordered color
-		 */
-		if (rq->cmd_flags & REQ_HARDBARRIER)
-			q->ordcolor ^= 1;
-
-		/*
-		 * barriers implicitly indicate back insertion
-		 */
-		if (where == ELEVATOR_INSERT_SORT)
-			where = ELEVATOR_INSERT_BACK;
-
-		/*
-		 * this request is scheduling boundary, update
-		 * end_sector
-		 */
+		/* barriers are scheduling boundary, update end_sector */
 		if (rq->cmd_type == REQ_TYPE_FS ||
 		    (rq->cmd_flags & REQ_DISCARD)) {
 			q->end_sector = rq_end_sector(rq);
@@ -849,24 +806,6 @@ void elv_completed_request(struct request_queue *q, struct request *rq)
 		    e->ops->elevator_completed_req_fn)
 			e->ops->elevator_completed_req_fn(q, rq);
 	}
-
-	/*
-	 * Check if the queue is waiting for fs requests to be
-	 * drained for flush sequence.
-	 */
-	if (unlikely(q->ordseq)) {
-		struct request *next = NULL;
-
-		if (!list_empty(&q->queue_head))
-			next = list_entry_rq(q->queue_head.next);
-
-		if (!queue_in_flight(q) &&
-		    blk_ordered_cur_seq(q) == QUEUE_ORDSEQ_DRAIN &&
-		    (!next || blk_ordered_req_seq(next) > QUEUE_ORDSEQ_DRAIN)) {
-			blk_ordered_complete_seq(q, QUEUE_ORDSEQ_DRAIN, 0);
-			__blk_run_queue(q);
-		}
-	}
 }
 
 #define to_elv(atr) container_of((atr), struct elv_fs_entry, attr)
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 1185237..8e9887d 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -141,7 +141,6 @@ enum rq_flag_bits {
 	__REQ_FAILED,		/* set if the request failed */
 	__REQ_QUIET,		/* don't worry about errors */
 	__REQ_PREEMPT,		/* set for "ide_preempt" requests */
-	__REQ_ORDERED_COLOR,	/* is before or after barrier */
 	__REQ_ALLOCED,		/* request came from our alloc pool */
 	__REQ_COPY_USER,	/* contains copies of user pages */
 	__REQ_INTEGRITY,	/* integrity metadata has been remapped */
@@ -181,7 +180,6 @@ enum rq_flag_bits {
 #define REQ_FAILED		(1 << __REQ_FAILED)
 #define REQ_QUIET		(1 << __REQ_QUIET)
 #define REQ_PREEMPT		(1 << __REQ_PREEMPT)
-#define REQ_ORDERED_COLOR	(1 << __REQ_ORDERED_COLOR)
 #define REQ_ALLOCED		(1 << __REQ_ALLOCED)
 #define REQ_COPY_USER		(1 << __REQ_COPY_USER)
 #define REQ_INTEGRITY		(1 << __REQ_INTEGRITY)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 21baa19..522ecda 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -360,9 +360,10 @@ struct request_queue
 	unsigned int		flush_flags;
 
 	unsigned int		ordered, next_ordered, ordseq;
-	int			orderr, ordcolor;
-	struct request		pre_flush_rq, bar_rq, post_flush_rq;
+	int			orderr;
+	struct request		bar_rq;
 	struct request		*orig_bar_rq;
+	struct list_head	pending_barriers;
 
 	struct mutex		sysfs_lock;
 
@@ -490,12 +491,11 @@ enum {
 	/*
 	 * Ordered operation sequence
 	 */
-	QUEUE_ORDSEQ_STARTED	= 0x01,	/* flushing in progress */
-	QUEUE_ORDSEQ_DRAIN	= 0x02,	/* waiting for the queue to be drained */
-	QUEUE_ORDSEQ_PREFLUSH	= 0x04,	/* pre-flushing in progress */
-	QUEUE_ORDSEQ_BAR	= 0x08,	/* original barrier req in progress */
-	QUEUE_ORDSEQ_POSTFLUSH	= 0x10,	/* post-flushing in progress */
-	QUEUE_ORDSEQ_DONE	= 0x20,
+	QUEUE_ORDSEQ_STARTED	= (1 << 0), /* flushing in progress */
+	QUEUE_ORDSEQ_PREFLUSH	= (1 << 1), /* pre-flushing in progress */
+	QUEUE_ORDSEQ_BAR	= (1 << 2), /* barrier write in progress */
+	QUEUE_ORDSEQ_POSTFLUSH	= (1 << 3), /* post-flushing in progress */
+	QUEUE_ORDSEQ_DONE	= (1 << 4),
 };
 
 #define blk_queue_plugged(q)	test_bit(QUEUE_FLAG_PLUGGED, &(q)->queue_flags)
@@ -867,9 +867,6 @@ extern void blk_queue_rq_timed_out(struct request_queue *, rq_timed_out_fn *);
 extern void blk_queue_rq_timeout(struct request_queue *, unsigned int);
 extern void blk_queue_flush(struct request_queue *q, unsigned int flush);
 extern struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev);
-extern unsigned blk_ordered_cur_seq(struct request_queue *);
-extern unsigned blk_ordered_req_seq(struct request *);
-extern bool blk_ordered_complete_seq(struct request_queue *, unsigned, int);
 
 extern int blk_rq_map_sg(struct request_queue *, struct request *, struct scatterlist *);
 extern void blk_dump_rq_flags(struct request *, char *);
-- 
1.7.1
^ permalink raw reply related	[flat|nested] 109+ messages in thread
- * [PATCH 07/11] block: rename blk-barrier.c to blk-flush.c
  2010-08-12 12:41 [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush Tejun Heo
                   ` (5 preceding siblings ...)
  2010-08-12 12:41 ` [PATCH 06/11] block: drop barrier ordering by queue draining Tejun Heo
@ 2010-08-12 12:41 ` Tejun Heo
  2010-08-12 12:41 ` [PATCH 08/11] block: rename barrier/ordered to flush Tejun Heo
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 109+ messages in thread
From: Tejun Heo @ 2010-08-12 12:41 UTC (permalink / raw)
  To: jaxboe, linux-fsdevel, linux-scsi, linux-ide, linux-kernel,
	linux-raid, hch, James.
  Cc: Tejun Heo, Christoph Hellwig
Without ordering requirements, barrier and ordering are minomers.
Rename block/blk-barrier.c to block/blk-flush.c.  Rename of symbols
will follow.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
---
 block/Makefile      |    2 +-
 block/blk-barrier.c |  248 ---------------------------------------------------
 block/blk-flush.c   |  248 +++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 249 insertions(+), 249 deletions(-)
 delete mode 100644 block/blk-barrier.c
 create mode 100644 block/blk-flush.c
diff --git a/block/Makefile b/block/Makefile
index 0bb499a..f627e4b 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -3,7 +3,7 @@
 #
 
 obj-$(CONFIG_BLOCK) := elevator.o blk-core.o blk-tag.o blk-sysfs.o \
-			blk-barrier.o blk-settings.o blk-ioc.o blk-map.o \
+			blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
 			blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
 			blk-iopoll.o blk-lib.o ioctl.o genhd.o scsi_ioctl.o
 
diff --git a/block/blk-barrier.c b/block/blk-barrier.c
deleted file mode 100644
index e8b2e5c..0000000
--- a/block/blk-barrier.c
+++ /dev/null
@@ -1,248 +0,0 @@
-/*
- * Functions related to barrier IO handling
- */
-#include <linux/kernel.h>
-#include <linux/module.h>
-#include <linux/bio.h>
-#include <linux/blkdev.h>
-#include <linux/gfp.h>
-
-#include "blk.h"
-
-static struct request *queue_next_ordseq(struct request_queue *q);
-
-/*
- * Cache flushing for ordered writes handling
- */
-unsigned blk_ordered_cur_seq(struct request_queue *q)
-{
-	if (!q->ordseq)
-		return 0;
-	return 1 << ffz(q->ordseq);
-}
-
-static struct request *blk_ordered_complete_seq(struct request_queue *q,
-						unsigned seq, int error)
-{
-	struct request *next_rq = NULL;
-
-	if (error && !q->orderr)
-		q->orderr = error;
-
-	BUG_ON(q->ordseq & seq);
-	q->ordseq |= seq;
-
-	if (blk_ordered_cur_seq(q) != QUEUE_ORDSEQ_DONE) {
-		/* not complete yet, queue the next ordered sequence */
-		next_rq = queue_next_ordseq(q);
-	} else {
-		/* complete this barrier request */
-		__blk_end_request_all(q->orig_bar_rq, q->orderr);
-		q->orig_bar_rq = NULL;
-		q->ordseq = 0;
-
-		/* dispatch the next barrier if there's one */
-		if (!list_empty(&q->pending_barriers)) {
-			next_rq = list_entry_rq(q->pending_barriers.next);
-			list_move(&next_rq->queuelist, &q->queue_head);
-		}
-	}
-	return next_rq;
-}
-
-static void pre_flush_end_io(struct request *rq, int error)
-{
-	elv_completed_request(rq->q, rq);
-	blk_ordered_complete_seq(rq->q, QUEUE_ORDSEQ_PREFLUSH, error);
-}
-
-static void bar_end_io(struct request *rq, int error)
-{
-	elv_completed_request(rq->q, rq);
-	blk_ordered_complete_seq(rq->q, QUEUE_ORDSEQ_BAR, error);
-}
-
-static void post_flush_end_io(struct request *rq, int error)
-{
-	elv_completed_request(rq->q, rq);
-	blk_ordered_complete_seq(rq->q, QUEUE_ORDSEQ_POSTFLUSH, error);
-}
-
-static void queue_flush(struct request_queue *q, struct request *rq,
-			rq_end_io_fn *end_io)
-{
-	blk_rq_init(q, rq);
-	rq->cmd_type = REQ_TYPE_FS;
-	rq->cmd_flags = REQ_FLUSH;
-	rq->rq_disk = q->orig_bar_rq->rq_disk;
-	rq->end_io = end_io;
-
-	elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
-}
-
-static struct request *queue_next_ordseq(struct request_queue *q)
-{
-	struct request *rq = &q->bar_rq;
-
-	switch (blk_ordered_cur_seq(q)) {
-	case QUEUE_ORDSEQ_PREFLUSH:
-		queue_flush(q, rq, pre_flush_end_io);
-		break;
-
-	case QUEUE_ORDSEQ_BAR:
-		/* initialize proxy request and queue it */
-		blk_rq_init(q, rq);
-		init_request_from_bio(rq, q->orig_bar_rq->bio);
-		rq->cmd_flags &= ~REQ_HARDBARRIER;
-		if (q->ordered & QUEUE_ORDERED_DO_FUA)
-			rq->cmd_flags |= REQ_FUA;
-		rq->end_io = bar_end_io;
-
-		elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
-		break;
-
-	case QUEUE_ORDSEQ_POSTFLUSH:
-		queue_flush(q, rq, post_flush_end_io);
-		break;
-
-	default:
-		BUG();
-	}
-	return rq;
-}
-
-struct request *blk_do_ordered(struct request_queue *q, struct request *rq)
-{
-	unsigned skip = 0;
-
-	if (!(rq->cmd_flags & REQ_HARDBARRIER))
-		return rq;
-
-	if (q->ordseq) {
-		/*
-		 * Barrier is already in progress and they can't be
-		 * processed in parallel.  Queue for later processing.
-		 */
-		list_move_tail(&rq->queuelist, &q->pending_barriers);
-		return NULL;
-	}
-
-	if (unlikely(q->next_ordered == QUEUE_ORDERED_NONE)) {
-		/*
-		 * Queue ordering not supported.  Terminate
-		 * with prejudice.
-		 */
-		blk_dequeue_request(rq);
-		__blk_end_request_all(rq, -EOPNOTSUPP);
-		return NULL;
-	}
-
-	/*
-	 * Start a new ordered sequence
-	 */
-	q->orderr = 0;
-	q->ordered = q->next_ordered;
-	q->ordseq |= QUEUE_ORDSEQ_STARTED;
-
-	/*
-	 * For an empty barrier, there's no actual BAR request, which
-	 * in turn makes POSTFLUSH unnecessary.  Mask them off.
-	 */
-	if (!blk_rq_sectors(rq))
-		q->ordered &= ~(QUEUE_ORDERED_DO_BAR |
-				QUEUE_ORDERED_DO_POSTFLUSH);
-
-	/* stash away the original request */
-	blk_dequeue_request(rq);
-	q->orig_bar_rq = rq;
-
-	if (!(q->ordered & QUEUE_ORDERED_DO_PREFLUSH))
-		skip |= QUEUE_ORDSEQ_PREFLUSH;
-
-	if (!(q->ordered & QUEUE_ORDERED_DO_BAR))
-		skip |= QUEUE_ORDSEQ_BAR;
-
-	if (!(q->ordered & QUEUE_ORDERED_DO_POSTFLUSH))
-		skip |= QUEUE_ORDSEQ_POSTFLUSH;
-
-	/* complete skipped sequences and return the first sequence */
-	return blk_ordered_complete_seq(q, skip, 0);
-}
-
-static void bio_end_empty_barrier(struct bio *bio, int err)
-{
-	if (err) {
-		if (err == -EOPNOTSUPP)
-			set_bit(BIO_EOPNOTSUPP, &bio->bi_flags);
-		clear_bit(BIO_UPTODATE, &bio->bi_flags);
-	}
-	if (bio->bi_private)
-		complete(bio->bi_private);
-	bio_put(bio);
-}
-
-/**
- * blkdev_issue_flush - queue a flush
- * @bdev:	blockdev to issue flush for
- * @gfp_mask:	memory allocation flags (for bio_alloc)
- * @error_sector:	error sector
- * @flags:	BLKDEV_IFL_* flags to control behaviour
- *
- * Description:
- *    Issue a flush for the block device in question. Caller can supply
- *    room for storing the error offset in case of a flush error, if they
- *    wish to. If WAIT flag is not passed then caller may check only what
- *    request was pushed in some internal queue for later handling.
- */
-int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask,
-		sector_t *error_sector, unsigned long flags)
-{
-	DECLARE_COMPLETION_ONSTACK(wait);
-	struct request_queue *q;
-	struct bio *bio;
-	int ret = 0;
-
-	if (bdev->bd_disk == NULL)
-		return -ENXIO;
-
-	q = bdev_get_queue(bdev);
-	if (!q)
-		return -ENXIO;
-
-	/*
-	 * some block devices may not have their queue correctly set up here
-	 * (e.g. loop device without a backing file) and so issuing a flush
-	 * here will panic. Ensure there is a request function before issuing
-	 * the barrier.
-	 */
-	if (!q->make_request_fn)
-		return -ENXIO;
-
-	bio = bio_alloc(gfp_mask, 0);
-	bio->bi_end_io = bio_end_empty_barrier;
-	bio->bi_bdev = bdev;
-	if (test_bit(BLKDEV_WAIT, &flags))
-		bio->bi_private = &wait;
-
-	bio_get(bio);
-	submit_bio(WRITE_BARRIER, bio);
-	if (test_bit(BLKDEV_WAIT, &flags)) {
-		wait_for_completion(&wait);
-		/*
-		 * The driver must store the error location in ->bi_sector, if
-		 * it supports it. For non-stacked drivers, this should be
-		 * copied from blk_rq_pos(rq).
-		 */
-		if (error_sector)
-			*error_sector = bio->bi_sector;
-	}
-
-	if (bio_flagged(bio, BIO_EOPNOTSUPP))
-		ret = -EOPNOTSUPP;
-	else if (!bio_flagged(bio, BIO_UPTODATE))
-		ret = -EIO;
-
-	bio_put(bio);
-	return ret;
-}
-EXPORT_SYMBOL(blkdev_issue_flush);
diff --git a/block/blk-flush.c b/block/blk-flush.c
new file mode 100644
index 0000000..e8b2e5c
--- /dev/null
+++ b/block/blk-flush.c
@@ -0,0 +1,248 @@
+/*
+ * Functions related to barrier IO handling
+ */
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/bio.h>
+#include <linux/blkdev.h>
+#include <linux/gfp.h>
+
+#include "blk.h"
+
+static struct request *queue_next_ordseq(struct request_queue *q);
+
+/*
+ * Cache flushing for ordered writes handling
+ */
+unsigned blk_ordered_cur_seq(struct request_queue *q)
+{
+	if (!q->ordseq)
+		return 0;
+	return 1 << ffz(q->ordseq);
+}
+
+static struct request *blk_ordered_complete_seq(struct request_queue *q,
+						unsigned seq, int error)
+{
+	struct request *next_rq = NULL;
+
+	if (error && !q->orderr)
+		q->orderr = error;
+
+	BUG_ON(q->ordseq & seq);
+	q->ordseq |= seq;
+
+	if (blk_ordered_cur_seq(q) != QUEUE_ORDSEQ_DONE) {
+		/* not complete yet, queue the next ordered sequence */
+		next_rq = queue_next_ordseq(q);
+	} else {
+		/* complete this barrier request */
+		__blk_end_request_all(q->orig_bar_rq, q->orderr);
+		q->orig_bar_rq = NULL;
+		q->ordseq = 0;
+
+		/* dispatch the next barrier if there's one */
+		if (!list_empty(&q->pending_barriers)) {
+			next_rq = list_entry_rq(q->pending_barriers.next);
+			list_move(&next_rq->queuelist, &q->queue_head);
+		}
+	}
+	return next_rq;
+}
+
+static void pre_flush_end_io(struct request *rq, int error)
+{
+	elv_completed_request(rq->q, rq);
+	blk_ordered_complete_seq(rq->q, QUEUE_ORDSEQ_PREFLUSH, error);
+}
+
+static void bar_end_io(struct request *rq, int error)
+{
+	elv_completed_request(rq->q, rq);
+	blk_ordered_complete_seq(rq->q, QUEUE_ORDSEQ_BAR, error);
+}
+
+static void post_flush_end_io(struct request *rq, int error)
+{
+	elv_completed_request(rq->q, rq);
+	blk_ordered_complete_seq(rq->q, QUEUE_ORDSEQ_POSTFLUSH, error);
+}
+
+static void queue_flush(struct request_queue *q, struct request *rq,
+			rq_end_io_fn *end_io)
+{
+	blk_rq_init(q, rq);
+	rq->cmd_type = REQ_TYPE_FS;
+	rq->cmd_flags = REQ_FLUSH;
+	rq->rq_disk = q->orig_bar_rq->rq_disk;
+	rq->end_io = end_io;
+
+	elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
+}
+
+static struct request *queue_next_ordseq(struct request_queue *q)
+{
+	struct request *rq = &q->bar_rq;
+
+	switch (blk_ordered_cur_seq(q)) {
+	case QUEUE_ORDSEQ_PREFLUSH:
+		queue_flush(q, rq, pre_flush_end_io);
+		break;
+
+	case QUEUE_ORDSEQ_BAR:
+		/* initialize proxy request and queue it */
+		blk_rq_init(q, rq);
+		init_request_from_bio(rq, q->orig_bar_rq->bio);
+		rq->cmd_flags &= ~REQ_HARDBARRIER;
+		if (q->ordered & QUEUE_ORDERED_DO_FUA)
+			rq->cmd_flags |= REQ_FUA;
+		rq->end_io = bar_end_io;
+
+		elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
+		break;
+
+	case QUEUE_ORDSEQ_POSTFLUSH:
+		queue_flush(q, rq, post_flush_end_io);
+		break;
+
+	default:
+		BUG();
+	}
+	return rq;
+}
+
+struct request *blk_do_ordered(struct request_queue *q, struct request *rq)
+{
+	unsigned skip = 0;
+
+	if (!(rq->cmd_flags & REQ_HARDBARRIER))
+		return rq;
+
+	if (q->ordseq) {
+		/*
+		 * Barrier is already in progress and they can't be
+		 * processed in parallel.  Queue for later processing.
+		 */
+		list_move_tail(&rq->queuelist, &q->pending_barriers);
+		return NULL;
+	}
+
+	if (unlikely(q->next_ordered == QUEUE_ORDERED_NONE)) {
+		/*
+		 * Queue ordering not supported.  Terminate
+		 * with prejudice.
+		 */
+		blk_dequeue_request(rq);
+		__blk_end_request_all(rq, -EOPNOTSUPP);
+		return NULL;
+	}
+
+	/*
+	 * Start a new ordered sequence
+	 */
+	q->orderr = 0;
+	q->ordered = q->next_ordered;
+	q->ordseq |= QUEUE_ORDSEQ_STARTED;
+
+	/*
+	 * For an empty barrier, there's no actual BAR request, which
+	 * in turn makes POSTFLUSH unnecessary.  Mask them off.
+	 */
+	if (!blk_rq_sectors(rq))
+		q->ordered &= ~(QUEUE_ORDERED_DO_BAR |
+				QUEUE_ORDERED_DO_POSTFLUSH);
+
+	/* stash away the original request */
+	blk_dequeue_request(rq);
+	q->orig_bar_rq = rq;
+
+	if (!(q->ordered & QUEUE_ORDERED_DO_PREFLUSH))
+		skip |= QUEUE_ORDSEQ_PREFLUSH;
+
+	if (!(q->ordered & QUEUE_ORDERED_DO_BAR))
+		skip |= QUEUE_ORDSEQ_BAR;
+
+	if (!(q->ordered & QUEUE_ORDERED_DO_POSTFLUSH))
+		skip |= QUEUE_ORDSEQ_POSTFLUSH;
+
+	/* complete skipped sequences and return the first sequence */
+	return blk_ordered_complete_seq(q, skip, 0);
+}
+
+static void bio_end_empty_barrier(struct bio *bio, int err)
+{
+	if (err) {
+		if (err == -EOPNOTSUPP)
+			set_bit(BIO_EOPNOTSUPP, &bio->bi_flags);
+		clear_bit(BIO_UPTODATE, &bio->bi_flags);
+	}
+	if (bio->bi_private)
+		complete(bio->bi_private);
+	bio_put(bio);
+}
+
+/**
+ * blkdev_issue_flush - queue a flush
+ * @bdev:	blockdev to issue flush for
+ * @gfp_mask:	memory allocation flags (for bio_alloc)
+ * @error_sector:	error sector
+ * @flags:	BLKDEV_IFL_* flags to control behaviour
+ *
+ * Description:
+ *    Issue a flush for the block device in question. Caller can supply
+ *    room for storing the error offset in case of a flush error, if they
+ *    wish to. If WAIT flag is not passed then caller may check only what
+ *    request was pushed in some internal queue for later handling.
+ */
+int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask,
+		sector_t *error_sector, unsigned long flags)
+{
+	DECLARE_COMPLETION_ONSTACK(wait);
+	struct request_queue *q;
+	struct bio *bio;
+	int ret = 0;
+
+	if (bdev->bd_disk == NULL)
+		return -ENXIO;
+
+	q = bdev_get_queue(bdev);
+	if (!q)
+		return -ENXIO;
+
+	/*
+	 * some block devices may not have their queue correctly set up here
+	 * (e.g. loop device without a backing file) and so issuing a flush
+	 * here will panic. Ensure there is a request function before issuing
+	 * the barrier.
+	 */
+	if (!q->make_request_fn)
+		return -ENXIO;
+
+	bio = bio_alloc(gfp_mask, 0);
+	bio->bi_end_io = bio_end_empty_barrier;
+	bio->bi_bdev = bdev;
+	if (test_bit(BLKDEV_WAIT, &flags))
+		bio->bi_private = &wait;
+
+	bio_get(bio);
+	submit_bio(WRITE_BARRIER, bio);
+	if (test_bit(BLKDEV_WAIT, &flags)) {
+		wait_for_completion(&wait);
+		/*
+		 * The driver must store the error location in ->bi_sector, if
+		 * it supports it. For non-stacked drivers, this should be
+		 * copied from blk_rq_pos(rq).
+		 */
+		if (error_sector)
+			*error_sector = bio->bi_sector;
+	}
+
+	if (bio_flagged(bio, BIO_EOPNOTSUPP))
+		ret = -EOPNOTSUPP;
+	else if (!bio_flagged(bio, BIO_UPTODATE))
+		ret = -EIO;
+
+	bio_put(bio);
+	return ret;
+}
+EXPORT_SYMBOL(blkdev_issue_flush);
-- 
1.7.1
^ permalink raw reply related	[flat|nested] 109+ messages in thread
- * [PATCH 08/11] block: rename barrier/ordered to flush
  2010-08-12 12:41 [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush Tejun Heo
                   ` (6 preceding siblings ...)
  2010-08-12 12:41 ` [PATCH 07/11] block: rename blk-barrier.c to blk-flush.c Tejun Heo
@ 2010-08-12 12:41 ` Tejun Heo
  2010-08-17 13:26   ` Christoph Hellwig
  2010-08-12 12:41 ` [PATCH 09/11] block: implement REQ_FLUSH/FUA based interface for FLUSH/FUA requests Tejun Heo
                   ` (7 subsequent siblings)
  15 siblings, 1 reply; 109+ messages in thread
From: Tejun Heo @ 2010-08-12 12:41 UTC (permalink / raw)
  To: jaxboe, linux-fsdevel, linux-scsi, linux-ide, linux-kernel,
	linux-raid, hch
  Cc: Tejun Heo, Christoph Hellwig
With ordering requirements dropped, barrier and ordered are misnomers.
Now all block layer does is sequencing FLUSH and FUA.  Rename them to
flush.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
---
 block/blk-core.c       |   21 +++++-----
 block/blk-flush.c      |   98 +++++++++++++++++++++++------------------------
 block/blk.h            |    4 +-
 include/linux/blkdev.h |   26 ++++++------
 4 files changed, 73 insertions(+), 76 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index 82bd6d9..efe391b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -136,7 +136,7 @@ static void req_bio_endio(struct request *rq, struct bio *bio,
 {
 	struct request_queue *q = rq->q;
 
-	if (&q->bar_rq != rq) {
+	if (&q->flush_rq != rq) {
 		if (error)
 			clear_bit(BIO_UPTODATE, &bio->bi_flags);
 		else if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
@@ -160,13 +160,12 @@ static void req_bio_endio(struct request *rq, struct bio *bio,
 		if (bio->bi_size == 0)
 			bio_endio(bio, error);
 	} else {
-
 		/*
-		 * Okay, this is the barrier request in progress, just
-		 * record the error;
+		 * Okay, this is the sequenced flush request in
+		 * progress, just record the error;
 		 */
-		if (error && !q->orderr)
-			q->orderr = error;
+		if (error && !q->flush_err)
+			q->flush_err = error;
 	}
 }
 
@@ -520,7 +519,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 	init_timer(&q->unplug_timer);
 	setup_timer(&q->timeout, blk_rq_timed_out_timer, (unsigned long) q);
 	INIT_LIST_HEAD(&q->timeout_list);
-	INIT_LIST_HEAD(&q->pending_barriers);
+	INIT_LIST_HEAD(&q->pending_flushes);
 	INIT_WORK(&q->unplug_work, blk_unplug_work);
 
 	kobject_init(&q->kobj, &blk_queue_ktype);
@@ -1758,11 +1757,11 @@ static void blk_account_io_completion(struct request *req, unsigned int bytes)
 static void blk_account_io_done(struct request *req)
 {
 	/*
-	 * Account IO completion.  bar_rq isn't accounted as a normal
-	 * IO on queueing nor completion.  Accounting the containing
-	 * request is enough.
+	 * Account IO completion.  flush_rq isn't accounted as a
+	 * normal IO on queueing nor completion.  Accounting the
+	 * containing request is enough.
 	 */
-	if (blk_do_io_stat(req) && req != &req->q->bar_rq) {
+	if (blk_do_io_stat(req) && req != &req->q->flush_rq) {
 		unsigned long duration = jiffies - req->start_time;
 		const int rw = rq_data_dir(req);
 		struct hd_struct *part;
diff --git a/block/blk-flush.c b/block/blk-flush.c
index e8b2e5c..dd87322 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -9,41 +9,38 @@
 
 #include "blk.h"
 
-static struct request *queue_next_ordseq(struct request_queue *q);
+static struct request *queue_next_fseq(struct request_queue *q);
 
-/*
- * Cache flushing for ordered writes handling
- */
-unsigned blk_ordered_cur_seq(struct request_queue *q)
+unsigned blk_flush_cur_seq(struct request_queue *q)
 {
-	if (!q->ordseq)
+	if (!q->flush_seq)
 		return 0;
-	return 1 << ffz(q->ordseq);
+	return 1 << ffz(q->flush_seq);
 }
 
-static struct request *blk_ordered_complete_seq(struct request_queue *q,
-						unsigned seq, int error)
+static struct request *blk_flush_complete_seq(struct request_queue *q,
+					      unsigned seq, int error)
 {
 	struct request *next_rq = NULL;
 
-	if (error && !q->orderr)
-		q->orderr = error;
+	if (error && !q->flush_err)
+		q->flush_err = error;
 
-	BUG_ON(q->ordseq & seq);
-	q->ordseq |= seq;
+	BUG_ON(q->flush_seq & seq);
+	q->flush_seq |= seq;
 
-	if (blk_ordered_cur_seq(q) != QUEUE_ORDSEQ_DONE) {
-		/* not complete yet, queue the next ordered sequence */
-		next_rq = queue_next_ordseq(q);
+	if (blk_flush_cur_seq(q) != QUEUE_FSEQ_DONE) {
+		/* not complete yet, queue the next flush sequence */
+		next_rq = queue_next_fseq(q);
 	} else {
-		/* complete this barrier request */
-		__blk_end_request_all(q->orig_bar_rq, q->orderr);
-		q->orig_bar_rq = NULL;
-		q->ordseq = 0;
-
-		/* dispatch the next barrier if there's one */
-		if (!list_empty(&q->pending_barriers)) {
-			next_rq = list_entry_rq(q->pending_barriers.next);
+		/* complete this flush request */
+		__blk_end_request_all(q->orig_flush_rq, q->flush_err);
+		q->orig_flush_rq = NULL;
+		q->flush_seq = 0;
+
+		/* dispatch the next flush if there's one */
+		if (!list_empty(&q->pending_flushes)) {
+			next_rq = list_entry_rq(q->pending_flushes.next);
 			list_move(&next_rq->queuelist, &q->queue_head);
 		}
 	}
@@ -53,19 +50,19 @@ static struct request *blk_ordered_complete_seq(struct request_queue *q,
 static void pre_flush_end_io(struct request *rq, int error)
 {
 	elv_completed_request(rq->q, rq);
-	blk_ordered_complete_seq(rq->q, QUEUE_ORDSEQ_PREFLUSH, error);
+	blk_flush_complete_seq(rq->q, QUEUE_FSEQ_PREFLUSH, error);
 }
 
-static void bar_end_io(struct request *rq, int error)
+static void flush_data_end_io(struct request *rq, int error)
 {
 	elv_completed_request(rq->q, rq);
-	blk_ordered_complete_seq(rq->q, QUEUE_ORDSEQ_BAR, error);
+	blk_flush_complete_seq(rq->q, QUEUE_FSEQ_DATA, error);
 }
 
 static void post_flush_end_io(struct request *rq, int error)
 {
 	elv_completed_request(rq->q, rq);
-	blk_ordered_complete_seq(rq->q, QUEUE_ORDSEQ_POSTFLUSH, error);
+	blk_flush_complete_seq(rq->q, QUEUE_FSEQ_POSTFLUSH, error);
 }
 
 static void queue_flush(struct request_queue *q, struct request *rq,
@@ -74,34 +71,34 @@ static void queue_flush(struct request_queue *q, struct request *rq,
 	blk_rq_init(q, rq);
 	rq->cmd_type = REQ_TYPE_FS;
 	rq->cmd_flags = REQ_FLUSH;
-	rq->rq_disk = q->orig_bar_rq->rq_disk;
+	rq->rq_disk = q->orig_flush_rq->rq_disk;
 	rq->end_io = end_io;
 
 	elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
 }
 
-static struct request *queue_next_ordseq(struct request_queue *q)
+static struct request *queue_next_fseq(struct request_queue *q)
 {
-	struct request *rq = &q->bar_rq;
+	struct request *rq = &q->flush_rq;
 
-	switch (blk_ordered_cur_seq(q)) {
-	case QUEUE_ORDSEQ_PREFLUSH:
+	switch (blk_flush_cur_seq(q)) {
+	case QUEUE_FSEQ_PREFLUSH:
 		queue_flush(q, rq, pre_flush_end_io);
 		break;
 
-	case QUEUE_ORDSEQ_BAR:
+	case QUEUE_FSEQ_DATA:
 		/* initialize proxy request and queue it */
 		blk_rq_init(q, rq);
-		init_request_from_bio(rq, q->orig_bar_rq->bio);
+		init_request_from_bio(rq, q->orig_flush_rq->bio);
 		rq->cmd_flags &= ~REQ_HARDBARRIER;
 		if (q->ordered & QUEUE_ORDERED_DO_FUA)
 			rq->cmd_flags |= REQ_FUA;
-		rq->end_io = bar_end_io;
+		rq->end_io = flush_data_end_io;
 
 		elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
 		break;
 
-	case QUEUE_ORDSEQ_POSTFLUSH:
+	case QUEUE_FSEQ_POSTFLUSH:
 		queue_flush(q, rq, post_flush_end_io);
 		break;
 
@@ -111,19 +108,20 @@ static struct request *queue_next_ordseq(struct request_queue *q)
 	return rq;
 }
 
-struct request *blk_do_ordered(struct request_queue *q, struct request *rq)
+struct request *blk_do_flush(struct request_queue *q, struct request *rq)
 {
 	unsigned skip = 0;
 
 	if (!(rq->cmd_flags & REQ_HARDBARRIER))
 		return rq;
 
-	if (q->ordseq) {
+	if (q->flush_seq) {
 		/*
-		 * Barrier is already in progress and they can't be
-		 * processed in parallel.  Queue for later processing.
+		 * Sequenced flush is already in progress and they
+		 * can't be processed in parallel.  Queue for later
+		 * processing.
 		 */
-		list_move_tail(&rq->queuelist, &q->pending_barriers);
+		list_move_tail(&rq->queuelist, &q->pending_flushes);
 		return NULL;
 	}
 
@@ -138,11 +136,11 @@ struct request *blk_do_ordered(struct request_queue *q, struct request *rq)
 	}
 
 	/*
-	 * Start a new ordered sequence
+	 * Start a new flush sequence
 	 */
-	q->orderr = 0;
+	q->flush_err = 0;
 	q->ordered = q->next_ordered;
-	q->ordseq |= QUEUE_ORDSEQ_STARTED;
+	q->flush_seq |= QUEUE_FSEQ_STARTED;
 
 	/*
 	 * For an empty barrier, there's no actual BAR request, which
@@ -154,19 +152,19 @@ struct request *blk_do_ordered(struct request_queue *q, struct request *rq)
 
 	/* stash away the original request */
 	blk_dequeue_request(rq);
-	q->orig_bar_rq = rq;
+	q->orig_flush_rq = rq;
 
 	if (!(q->ordered & QUEUE_ORDERED_DO_PREFLUSH))
-		skip |= QUEUE_ORDSEQ_PREFLUSH;
+		skip |= QUEUE_FSEQ_PREFLUSH;
 
 	if (!(q->ordered & QUEUE_ORDERED_DO_BAR))
-		skip |= QUEUE_ORDSEQ_BAR;
+		skip |= QUEUE_FSEQ_DATA;
 
 	if (!(q->ordered & QUEUE_ORDERED_DO_POSTFLUSH))
-		skip |= QUEUE_ORDSEQ_POSTFLUSH;
+		skip |= QUEUE_FSEQ_POSTFLUSH;
 
 	/* complete skipped sequences and return the first sequence */
-	return blk_ordered_complete_seq(q, skip, 0);
+	return blk_flush_complete_seq(q, skip, 0);
 }
 
 static void bio_end_empty_barrier(struct bio *bio, int err)
diff --git a/block/blk.h b/block/blk.h
index 08081e4..24b92bd 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -51,7 +51,7 @@ static inline void blk_clear_rq_complete(struct request *rq)
  */
 #define ELV_ON_HASH(rq)		(!hlist_unhashed(&(rq)->hash))
 
-struct request *blk_do_ordered(struct request_queue *q, struct request *rq);
+struct request *blk_do_flush(struct request_queue *q, struct request *rq);
 
 static inline struct request *__elv_next_request(struct request_queue *q)
 {
@@ -60,7 +60,7 @@ static inline struct request *__elv_next_request(struct request_queue *q)
 	while (1) {
 		while (!list_empty(&q->queue_head)) {
 			rq = list_entry_rq(q->queue_head.next);
-			rq = blk_do_ordered(q, rq);
+			rq = blk_do_flush(q, rq);
 			if (rq)
 				return rq;
 		}
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 522ecda..87e58f0 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -357,13 +357,13 @@ struct request_queue
 	/*
 	 * for flush operations
 	 */
+	unsigned int		ordered, next_ordered;
 	unsigned int		flush_flags;
-
-	unsigned int		ordered, next_ordered, ordseq;
-	int			orderr;
-	struct request		bar_rq;
-	struct request		*orig_bar_rq;
-	struct list_head	pending_barriers;
+	unsigned int		flush_seq;
+	int			flush_err;
+	struct request		flush_rq;
+	struct request		*orig_flush_rq;
+	struct list_head	pending_flushes;
 
 	struct mutex		sysfs_lock;
 
@@ -489,13 +489,13 @@ enum {
 					  QUEUE_ORDERED_DO_FUA,
 
 	/*
-	 * Ordered operation sequence
+	 * FLUSH/FUA sequences.
 	 */
-	QUEUE_ORDSEQ_STARTED	= (1 << 0), /* flushing in progress */
-	QUEUE_ORDSEQ_PREFLUSH	= (1 << 1), /* pre-flushing in progress */
-	QUEUE_ORDSEQ_BAR	= (1 << 2), /* barrier write in progress */
-	QUEUE_ORDSEQ_POSTFLUSH	= (1 << 3), /* post-flushing in progress */
-	QUEUE_ORDSEQ_DONE	= (1 << 4),
+	QUEUE_FSEQ_STARTED	= (1 << 0), /* flushing in progress */
+	QUEUE_FSEQ_PREFLUSH	= (1 << 1), /* pre-flushing in progress */
+	QUEUE_FSEQ_DATA		= (1 << 2), /* data write in progress */
+	QUEUE_FSEQ_POSTFLUSH	= (1 << 3), /* post-flushing in progress */
+	QUEUE_FSEQ_DONE		= (1 << 4),
 };
 
 #define blk_queue_plugged(q)	test_bit(QUEUE_FLAG_PLUGGED, &(q)->queue_flags)
@@ -507,7 +507,7 @@ enum {
 #define blk_queue_nonrot(q)	test_bit(QUEUE_FLAG_NONROT, &(q)->queue_flags)
 #define blk_queue_io_stat(q)	test_bit(QUEUE_FLAG_IO_STAT, &(q)->queue_flags)
 #define blk_queue_add_random(q)	test_bit(QUEUE_FLAG_ADD_RANDOM, &(q)->queue_flags)
-#define blk_queue_flushing(q)	((q)->ordseq)
+#define blk_queue_flushing(q)	((q)->flush_seq)
 #define blk_queue_stackable(q)	\
 	test_bit(QUEUE_FLAG_STACKABLE, &(q)->queue_flags)
 #define blk_queue_discard(q)	test_bit(QUEUE_FLAG_DISCARD, &(q)->queue_flags)
-- 
1.7.1
^ permalink raw reply related	[flat|nested] 109+ messages in thread
- * Re: [PATCH 08/11] block: rename barrier/ordered to flush
  2010-08-12 12:41 ` [PATCH 08/11] block: rename barrier/ordered to flush Tejun Heo
@ 2010-08-17 13:26   ` Christoph Hellwig
  2010-08-17 16:23     ` Tejun Heo
  0 siblings, 1 reply; 109+ messages in thread
From: Christoph Hellwig @ 2010-08-17 13:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jaxboe, linux-fsdevel, linux-scsi, linux-ide, linux-kernel,
	linux-raid, hch, James.Bottomley, tytso, chris.mason, swhiteho,
	konishi.ryusuke, dm-devel, vst, jack, rwheeler, hare,
	Christoph Hellwig
> -#define blk_queue_flushing(q)	((q)->ordseq)
> +#define blk_queue_flushing(q)	((q)->flush_seq)
Btw, I think this one should just go away.  It's only used by
ide in an attempt to make ordered sequences atomic, which isn't
needed for the new design.
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: [PATCH 08/11] block: rename barrier/ordered to flush
  2010-08-17 13:26   ` Christoph Hellwig
@ 2010-08-17 16:23     ` Tejun Heo
  2010-08-17 17:08       ` Christoph Hellwig
  0 siblings, 1 reply; 109+ messages in thread
From: Tejun Heo @ 2010-08-17 16:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: jaxboe, linux-fsdevel, linux-scsi, linux-ide, linux-kernel,
	linux-raid, James.Bottomley, tytso, chris.mason, swhiteho,
	konishi.ryusuke, dm-devel, vst, jack, rwheeler, hare,
	Christoph Hellwig
Hello,
On 08/17/2010 03:26 PM, Christoph Hellwig wrote:
>> -#define blk_queue_flushing(q)	((q)->ordseq)
>> +#define blk_queue_flushing(q)	((q)->flush_seq)
> 
> Btw, I think this one should just go away.  It's only used by
> ide in an attempt to make ordered sequences atomic, which isn't
> needed for the new design.
Yeap, agreed.  I couldn't really understand why the the sequence
needed to be atomic for ide in the first place so just left it alone.
Do you understand why it tried to be atomic?
Thanks.
-- 
tejun
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: [PATCH 08/11] block: rename barrier/ordered to flush
  2010-08-17 16:23     ` Tejun Heo
@ 2010-08-17 17:08       ` Christoph Hellwig
  2010-08-18  6:23         ` Tejun Heo
  0 siblings, 1 reply; 109+ messages in thread
From: Christoph Hellwig @ 2010-08-17 17:08 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Hellwig, jaxboe, linux-fsdevel, linux-scsi, linux-ide,
	linux-kernel, linux-raid, James.Bottomley, tytso, chris.mason,
	swhiteho, konishi.ryusuke, dm-devel, vst, jack, rwheeler, hare,
	Christoph Hellwig
On Tue, Aug 17, 2010 at 06:23:55PM +0200, Tejun Heo wrote:
> Yeap, agreed.  I couldn't really understand why the the sequence
> needed to be atomic for ide in the first place so just left it alone.
> Do you understand why it tried to be atomic?
I think initial drafs of the barrier specification talked about atomic
sequences.  Except for that I can't think of any reason.
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: [PATCH 08/11] block: rename barrier/ordered to flush
  2010-08-17 17:08       ` Christoph Hellwig
@ 2010-08-18  6:23         ` Tejun Heo
  0 siblings, 0 replies; 109+ messages in thread
From: Tejun Heo @ 2010-08-18  6:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: jaxboe, linux-fsdevel, linux-scsi, linux-ide, linux-kernel,
	linux-raid, James.Bottomley, tytso, chris.mason, swhiteho,
	konishi.ryusuke, dm-devel, vst, jack, rwheeler, hare,
	Christoph Hellwig
Hello,
On 08/17/2010 07:08 PM, Christoph Hellwig wrote:
> On Tue, Aug 17, 2010 at 06:23:55PM +0200, Tejun Heo wrote:
>> Yeap, agreed.  I couldn't really understand why the the sequence
>> needed to be atomic for ide in the first place so just left it alone.
>> Do you understand why it tried to be atomic?
> 
> I think initial drafs of the barrier specification talked about atomic
> sequences.  Except for that I can't think of any reason.
Hmm... alright, I'll rip it out.
Thanks.
-- 
tejun
^ permalink raw reply	[flat|nested] 109+ messages in thread 
 
 
 
 
- * [PATCH 09/11] block: implement REQ_FLUSH/FUA based interface for FLUSH/FUA requests
  2010-08-12 12:41 [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush Tejun Heo
                   ` (7 preceding siblings ...)
  2010-08-12 12:41 ` [PATCH 08/11] block: rename barrier/ordered to flush Tejun Heo
@ 2010-08-12 12:41 ` Tejun Heo
  2010-08-12 12:41 ` [PATCH 10/11] fs, block: propagate REQ_FLUSH/FUA interface to upper layers Tejun Heo
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 109+ messages in thread
From: Tejun Heo @ 2010-08-12 12:41 UTC (permalink / raw)
  To: jaxboe, linux-fsdevel, linux-scsi, linux-ide, linux-kernel,
	linux-raid, hch, James.
  Cc: Tejun Heo, Christoph Hellwig
Now that the backend conversion is complete, export sequenced
FLUSH/FUA capability through REQ_FLUSH/FUA flags.  REQ_FLUSH means the
device cache should be flushed before executing the request.  REQ_FUA
means that the data in the request should be on non-volatile media on
completion.
Block layer will choose the correct way of implementing the semantics
and execute it.  The request may be passed to the device directly if
the device can handle it; otherwise, it will be sequenced using one or
more proxy requests.  Devices will never see REQ_FLUSH and/or FUA
which it doesn't support.
* QUEUE_ORDERED_* are removed and QUEUE_FSEQ_* are moved into
  blk-flush.c.
* REQ_FLUSH w/o data can also be directly passed to drivers without
  sequencing but some drivers assume that zero length requests don't
  have rq->bio which isn't true for these requests requiring the use
  of proxy requests.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
---
 block/blk-core.c       |    2 +-
 block/blk-flush.c      |   85 ++++++++++++++++++++++++++----------------------
 block/blk.h            |    3 ++
 include/linux/blkdev.h |   38 +--------------------
 4 files changed, 52 insertions(+), 76 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index efe391b..c00ace2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1204,7 +1204,7 @@ static int __make_request(struct request_queue *q, struct bio *bio)
 
 	spin_lock_irq(q->queue_lock);
 
-	if (bio->bi_rw & REQ_HARDBARRIER) {
+	if (bio->bi_rw & (REQ_FLUSH | REQ_FUA)) {
 		where = ELEVATOR_INSERT_FRONT;
 		goto get_rq;
 	}
diff --git a/block/blk-flush.c b/block/blk-flush.c
index dd87322..452c552 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -1,5 +1,5 @@
 /*
- * Functions related to barrier IO handling
+ * Functions to sequence FLUSH and FUA writes.
  */
 #include <linux/kernel.h>
 #include <linux/module.h>
@@ -9,6 +9,15 @@
 
 #include "blk.h"
 
+/* FLUSH/FUA sequences */
+enum {
+	QUEUE_FSEQ_STARTED	= (1 << 0), /* flushing in progress */
+	QUEUE_FSEQ_PREFLUSH	= (1 << 1), /* pre-flushing in progress */
+	QUEUE_FSEQ_DATA		= (1 << 2), /* data write in progress */
+	QUEUE_FSEQ_POSTFLUSH	= (1 << 3), /* post-flushing in progress */
+	QUEUE_FSEQ_DONE		= (1 << 4),
+};
+
 static struct request *queue_next_fseq(struct request_queue *q);
 
 unsigned blk_flush_cur_seq(struct request_queue *q)
@@ -79,6 +88,7 @@ static void queue_flush(struct request_queue *q, struct request *rq,
 
 static struct request *queue_next_fseq(struct request_queue *q)
 {
+	struct request *orig_rq = q->orig_flush_rq;
 	struct request *rq = &q->flush_rq;
 
 	switch (blk_flush_cur_seq(q)) {
@@ -87,12 +97,11 @@ static struct request *queue_next_fseq(struct request_queue *q)
 		break;
 
 	case QUEUE_FSEQ_DATA:
-		/* initialize proxy request and queue it */
+		/* initialize proxy request, inherit FLUSH/FUA and queue it */
 		blk_rq_init(q, rq);
-		init_request_from_bio(rq, q->orig_flush_rq->bio);
-		rq->cmd_flags &= ~REQ_HARDBARRIER;
-		if (q->ordered & QUEUE_ORDERED_DO_FUA)
-			rq->cmd_flags |= REQ_FUA;
+		init_request_from_bio(rq, orig_rq->bio);
+		rq->cmd_flags &= ~(REQ_FLUSH | REQ_FUA);
+		rq->cmd_flags |= orig_rq->cmd_flags & (REQ_FLUSH | REQ_FUA);
 		rq->end_io = flush_data_end_io;
 
 		elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
@@ -110,60 +119,58 @@ static struct request *queue_next_fseq(struct request_queue *q)
 
 struct request *blk_do_flush(struct request_queue *q, struct request *rq)
 {
+	unsigned int fflags = q->flush_flags; /* may change, cache it */
+	bool has_flush = fflags & REQ_FLUSH, has_fua = fflags & REQ_FUA;
+	bool do_preflush = has_flush && (rq->cmd_flags & REQ_FLUSH);
+	bool do_postflush = has_flush && !has_fua && (rq->cmd_flags & REQ_FUA);
 	unsigned skip = 0;
 
-	if (!(rq->cmd_flags & REQ_HARDBARRIER))
+	/*
+	 * Special case.  If there's data but flush is not necessary,
+	 * the request can be issued directly.
+	 *
+	 * Flush w/o data should be able to be issued directly too but
+	 * currently some drivers assume that rq->bio contains
+	 * non-zero data if it isn't NULL and empty FLUSH requests
+	 * getting here usually have bio's without data.
+	 */
+	if (blk_rq_sectors(rq) && !do_preflush && !do_postflush) {
+		rq->cmd_flags &= ~REQ_FLUSH;
+		if (!has_fua)
+			rq->cmd_flags &= ~REQ_FUA;
 		return rq;
+	}
 
+	/*
+	 * Sequenced flushes can't be processed in parallel.  If
+	 * another one is already in progress, queue for later
+	 * processing.
+	 */
 	if (q->flush_seq) {
-		/*
-		 * Sequenced flush is already in progress and they
-		 * can't be processed in parallel.  Queue for later
-		 * processing.
-		 */
 		list_move_tail(&rq->queuelist, &q->pending_flushes);
 		return NULL;
 	}
 
-	if (unlikely(q->next_ordered == QUEUE_ORDERED_NONE)) {
-		/*
-		 * Queue ordering not supported.  Terminate
-		 * with prejudice.
-		 */
-		blk_dequeue_request(rq);
-		__blk_end_request_all(rq, -EOPNOTSUPP);
-		return NULL;
-	}
-
 	/*
 	 * Start a new flush sequence
 	 */
 	q->flush_err = 0;
-	q->ordered = q->next_ordered;
 	q->flush_seq |= QUEUE_FSEQ_STARTED;
 
-	/*
-	 * For an empty barrier, there's no actual BAR request, which
-	 * in turn makes POSTFLUSH unnecessary.  Mask them off.
-	 */
-	if (!blk_rq_sectors(rq))
-		q->ordered &= ~(QUEUE_ORDERED_DO_BAR |
-				QUEUE_ORDERED_DO_POSTFLUSH);
-
-	/* stash away the original request */
+	/* adjust FLUSH/FUA of the original request and stash it away */
+	rq->cmd_flags &= ~REQ_FLUSH;
+	if (!has_fua)
+		rq->cmd_flags &= ~REQ_FUA;
 	blk_dequeue_request(rq);
 	q->orig_flush_rq = rq;
 
-	if (!(q->ordered & QUEUE_ORDERED_DO_PREFLUSH))
+	/* skip unneded sequences and return the first one */
+	if (!do_preflush)
 		skip |= QUEUE_FSEQ_PREFLUSH;
-
-	if (!(q->ordered & QUEUE_ORDERED_DO_BAR))
+	if (!blk_rq_sectors(rq))
 		skip |= QUEUE_FSEQ_DATA;
-
-	if (!(q->ordered & QUEUE_ORDERED_DO_POSTFLUSH))
+	if (!do_postflush)
 		skip |= QUEUE_FSEQ_POSTFLUSH;
-
-	/* complete skipped sequences and return the first sequence */
 	return blk_flush_complete_seq(q, skip, 0);
 }
 
diff --git a/block/blk.h b/block/blk.h
index 24b92bd..a09c18b 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -60,6 +60,9 @@ static inline struct request *__elv_next_request(struct request_queue *q)
 	while (1) {
 		while (!list_empty(&q->queue_head)) {
 			rq = list_entry_rq(q->queue_head.next);
+			if (!(rq->cmd_flags & (REQ_FLUSH | REQ_FUA)) ||
+			    rq == &q->flush_rq)
+				return rq;
 			rq = blk_do_flush(q, rq);
 			if (rq)
 				return rq;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 87e58f0..5ce0696 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -357,7 +357,6 @@ struct request_queue
 	/*
 	 * for flush operations
 	 */
-	unsigned int		ordered, next_ordered;
 	unsigned int		flush_flags;
 	unsigned int		flush_seq;
 	int			flush_err;
@@ -464,40 +463,6 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q)
 	__clear_bit(flag, &q->queue_flags);
 }
 
-enum {
-	/*
-	 * Hardbarrier is supported with one of the following methods.
-	 *
-	 * NONE		: hardbarrier unsupported
-	 * DRAIN	: ordering by draining is enough
-	 * DRAIN_FLUSH	: ordering by draining w/ pre and post flushes
-	 * DRAIN_FUA	: ordering by draining w/ pre flush and FUA write
-	 */
-	QUEUE_ORDERED_DO_PREFLUSH	= 0x10,
-	QUEUE_ORDERED_DO_BAR		= 0x20,
-	QUEUE_ORDERED_DO_POSTFLUSH	= 0x40,
-	QUEUE_ORDERED_DO_FUA		= 0x80,
-
-	QUEUE_ORDERED_NONE		= 0x00,
-
-	QUEUE_ORDERED_DRAIN		= QUEUE_ORDERED_DO_BAR,
-	QUEUE_ORDERED_DRAIN_FLUSH	= QUEUE_ORDERED_DRAIN |
-					  QUEUE_ORDERED_DO_PREFLUSH |
-					  QUEUE_ORDERED_DO_POSTFLUSH,
-	QUEUE_ORDERED_DRAIN_FUA		= QUEUE_ORDERED_DRAIN |
-					  QUEUE_ORDERED_DO_PREFLUSH |
-					  QUEUE_ORDERED_DO_FUA,
-
-	/*
-	 * FLUSH/FUA sequences.
-	 */
-	QUEUE_FSEQ_STARTED	= (1 << 0), /* flushing in progress */
-	QUEUE_FSEQ_PREFLUSH	= (1 << 1), /* pre-flushing in progress */
-	QUEUE_FSEQ_DATA		= (1 << 2), /* data write in progress */
-	QUEUE_FSEQ_POSTFLUSH	= (1 << 3), /* post-flushing in progress */
-	QUEUE_FSEQ_DONE		= (1 << 4),
-};
-
 #define blk_queue_plugged(q)	test_bit(QUEUE_FLAG_PLUGGED, &(q)->queue_flags)
 #define blk_queue_tagged(q)	test_bit(QUEUE_FLAG_QUEUED, &(q)->queue_flags)
 #define blk_queue_stopped(q)	test_bit(QUEUE_FLAG_STOPPED, &(q)->queue_flags)
@@ -576,7 +541,8 @@ static inline void blk_clear_queue_full(struct request_queue *q, int sync)
  * it already be started by driver.
  */
 #define RQ_NOMERGE_FLAGS	\
-	(REQ_NOMERGE | REQ_STARTED | REQ_HARDBARRIER | REQ_SOFTBARRIER)
+	(REQ_NOMERGE | REQ_STARTED | REQ_HARDBARRIER | REQ_SOFTBARRIER | \
+	 REQ_FLUSH | REQ_FUA)
 #define rq_mergeable(rq)	\
 	(!((rq)->cmd_flags & RQ_NOMERGE_FLAGS) && \
 	 (((rq)->cmd_flags & REQ_DISCARD) || \
-- 
1.7.1
^ permalink raw reply related	[flat|nested] 109+ messages in thread
- * [PATCH 10/11] fs, block: propagate REQ_FLUSH/FUA interface to upper layers
  2010-08-12 12:41 [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush Tejun Heo
                   ` (8 preceding siblings ...)
  2010-08-12 12:41 ` [PATCH 09/11] block: implement REQ_FLUSH/FUA based interface for FLUSH/FUA requests Tejun Heo
@ 2010-08-12 12:41 ` Tejun Heo
  2010-08-12 21:24   ` Jan Kara
  2010-08-16 16:33   ` [PATCH UPDATED " Tejun Heo
  2010-08-12 12:41 ` [PATCH 11/11] block: use REQ_FLUSH in blkdev_issue_flush() Tejun Heo
                   ` (5 subsequent siblings)
  15 siblings, 2 replies; 109+ messages in thread
From: Tejun Heo @ 2010-08-12 12:41 UTC (permalink / raw)
  To: jaxboe, linux-fsdevel, linux-scsi, linux-ide, linux-kernel,
	linux-raid, hch, James.
  Cc: Tejun Heo, Christoph Hellwig
Propagate deprecation of REQ_HARDBARRIER and new REQ_FLUSH/FUA
interface to upper layers.
* WRITE_BARRIER is marked deprecated and WRITE_FLUSH, WRITE_FUA and
  WRITE_FLUSH_FUA are added.
* REQ_COMMON_MASK now includes REQ_FLUSH | REQ_FUA so that they are
  copied from bio to request.
* BH_Ordered is marked deprecated and BH_Flush and BH_FUA are added.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
---
 fs/buffer.c                 |   27 ++++++++++++++++-----------
 include/linux/blk_types.h   |    2 +-
 include/linux/buffer_head.h |    8 ++++++--
 include/linux/fs.h          |   20 +++++++++++++-------
 4 files changed, 36 insertions(+), 21 deletions(-)
diff --git a/fs/buffer.c b/fs/buffer.c
index d54812b..ec32fbb 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3019,18 +3019,23 @@ int submit_bh(int rw, struct buffer_head * bh)
 	BUG_ON(buffer_delay(bh));
 	BUG_ON(buffer_unwritten(bh));
 
-	/*
-	 * Mask in barrier bit for a write (could be either a WRITE or a
-	 * WRITE_SYNC
-	 */
-	if (buffer_ordered(bh) && (rw & WRITE))
-		rw |= WRITE_BARRIER;
+	if (rw & WRITE) {
+		/* ordered is deprecated, will be removed */
+		if (buffer_ordered(bh))
+			rw |= WRITE_BARRIER;
 
-	/*
-	 * Only clear out a write error when rewriting
-	 */
-	if (test_set_buffer_req(bh) && (rw & WRITE))
-		clear_buffer_write_io_error(bh);
+		if (buffer_flush(bh))
+			rw |= WRITE_FLUSH;
+
+		if (buffer_fua(bh))
+			rw |= WRITE_FUA;
+
+		/*
+		 * Only clear out a write error when rewriting
+		 */
+		if (test_set_buffer_req(bh))
+			clear_buffer_write_io_error(bh);
+	}
 
 	/*
 	 * from here on down, it's all bio -- do the initial mapping,
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 8e9887d..6609fc0 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -164,7 +164,7 @@ enum rq_flag_bits {
 	(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
 #define REQ_COMMON_MASK \
 	(REQ_WRITE | REQ_FAILFAST_MASK | REQ_HARDBARRIER | REQ_SYNC | \
-	 REQ_META| REQ_DISCARD | REQ_NOIDLE)
+	 REQ_META | REQ_DISCARD | REQ_NOIDLE | REQ_FLUSH | REQ_FUA)
 
 #define REQ_UNPLUG		(1 << __REQ_UNPLUG)
 #define REQ_RAHEAD		(1 << __REQ_RAHEAD)
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 1b9ba19..498bd8b 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -32,8 +32,10 @@ enum bh_state_bits {
 	BH_Delay,	/* Buffer is not yet allocated on disk */
 	BH_Boundary,	/* Block is followed by a discontiguity */
 	BH_Write_EIO,	/* I/O error on write */
-	BH_Ordered,	/* ordered write */
-	BH_Eopnotsupp,	/* operation not supported (barrier) */
+	BH_Ordered,	/* DEPRECATED: ordered write */
+	BH_Eopnotsupp,	/* DEPRECATED: operation not supported (barrier) */
+	BH_Flush,	/* Flush device cache before executing IO */
+	BH_FUA,		/* Data should be on non-volatile media on completion */
 	BH_Unwritten,	/* Buffer is allocated on disk but not written */
 	BH_Quiet,	/* Buffer Error Prinks to be quiet */
 
@@ -126,6 +128,8 @@ BUFFER_FNS(Delay, delay)
 BUFFER_FNS(Boundary, boundary)
 BUFFER_FNS(Write_EIO, write_io_error)
 BUFFER_FNS(Ordered, ordered)
+BUFFER_FNS(Flush, flush)
+BUFFER_FNS(FUA, fua)
 BUFFER_FNS(Eopnotsupp, eopnotsupp)
 BUFFER_FNS(Unwritten, unwritten)
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 4ebd8eb..6e30b0b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -138,13 +138,13 @@ struct inodes_stat_t {
  * SWRITE_SYNC
  * SWRITE_SYNC_PLUG	Like WRITE_SYNC/WRITE_SYNC_PLUG, but locks the buffer.
  *			See SWRITE.
- * WRITE_BARRIER	Like WRITE_SYNC, but tells the block layer that all
- *			previously submitted writes must be safely on storage
- *			before this one is started. Also guarantees that when
- *			this write is complete, it itself is also safely on
- *			storage. Prevents reordering of writes on both sides
- *			of this IO.
- *
+ * WRITE_BARRIER	DEPRECATED. Always fails. Use FLUSH/FUA instead.
+ * WRITE_FLUSH		Like WRITE_SYNC but with preceding cache flush.
+ * WRITE_FUA		Like WRITE_SYNC but data is guaranteed to be on
+ *			non-volatile media on completion.
+ * WRITE_FLUSH_FUA	Combination of WRITE_FLUSH and FUA.  The IO is preceded
+ *			by a cache flush and data is guaranteed to be on
+ *			non-volatile media on completion.
  */
 #define RW_MASK			REQ_WRITE
 #define RWA_MASK		REQ_RAHEAD
@@ -162,6 +162,12 @@ struct inodes_stat_t {
 #define WRITE_META		(WRITE | REQ_META)
 #define WRITE_BARRIER		(WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
 				 REQ_HARDBARRIER)
+#define WRITE_FLUSH		(WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
+				 REQ_FLUSH)
+#define WRITE_FUA		(WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
+				 REQ_FUA)
+#define WRITE_FLUSH_FUA		(WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
+				 REQ_FLUSH | REQ_FUA)
 #define SWRITE_SYNC_PLUG	(SWRITE | REQ_SYNC | REQ_NOIDLE)
 #define SWRITE_SYNC		(SWRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG)
 
-- 
1.7.1
^ permalink raw reply related	[flat|nested] 109+ messages in thread
- * Re: [PATCH 10/11] fs, block: propagate REQ_FLUSH/FUA interface to upper layers
  2010-08-12 12:41 ` [PATCH 10/11] fs, block: propagate REQ_FLUSH/FUA interface to upper layers Tejun Heo
@ 2010-08-12 21:24   ` Jan Kara
  2010-08-13  7:19     ` Tejun Heo
  2010-08-16 16:33   ` [PATCH UPDATED " Tejun Heo
  1 sibling, 1 reply; 109+ messages in thread
From: Jan Kara @ 2010-08-12 21:24 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jaxboe, linux-fsdevel, linux-scsi, linux-ide, linux-kernel,
	linux-raid, hch, James.Bottomley, tytso, chris.mason, swhiteho,
	konishi.ryusuke, dm-devel, vst, jack, rwheeler, hare,
	Christoph Hellwig
On Thu 12-08-10 14:41:30, Tejun Heo wrote:
> Propagate deprecation of REQ_HARDBARRIER and new REQ_FLUSH/FUA
> interface to upper layers.
> 
> * WRITE_BARRIER is marked deprecated and WRITE_FLUSH, WRITE_FUA and
>   WRITE_FLUSH_FUA are added.
> 
> * REQ_COMMON_MASK now includes REQ_FLUSH | REQ_FUA so that they are
>   copied from bio to request.
> 
> * BH_Ordered is marked deprecated and BH_Flush and BH_FUA are added.
  Deprecating BH_Ordered is fine but I wouldn't introduce new BH flags for
this. BH flags should be used for buffer state, not for encoding how the
buffer should be written (there were actually bugs in the past because of
this). Being able to set proper flags when calling submit_bh() in the rw
parameter is enough.
								Honza
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Christoph Hellwig <hch@infradead.org>
> ---
>  fs/buffer.c                 |   27 ++++++++++++++++-----------
>  include/linux/blk_types.h   |    2 +-
>  include/linux/buffer_head.h |    8 ++++++--
>  include/linux/fs.h          |   20 +++++++++++++-------
>  4 files changed, 36 insertions(+), 21 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index d54812b..ec32fbb 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -3019,18 +3019,23 @@ int submit_bh(int rw, struct buffer_head * bh)
>  	BUG_ON(buffer_delay(bh));
>  	BUG_ON(buffer_unwritten(bh));
>  
> -	/*
> -	 * Mask in barrier bit for a write (could be either a WRITE or a
> -	 * WRITE_SYNC
> -	 */
> -	if (buffer_ordered(bh) && (rw & WRITE))
> -		rw |= WRITE_BARRIER;
> +	if (rw & WRITE) {
> +		/* ordered is deprecated, will be removed */
> +		if (buffer_ordered(bh))
> +			rw |= WRITE_BARRIER;
>  
> -	/*
> -	 * Only clear out a write error when rewriting
> -	 */
> -	if (test_set_buffer_req(bh) && (rw & WRITE))
> -		clear_buffer_write_io_error(bh);
> +		if (buffer_flush(bh))
> +			rw |= WRITE_FLUSH;
> +
> +		if (buffer_fua(bh))
> +			rw |= WRITE_FUA;
> +
> +		/*
> +		 * Only clear out a write error when rewriting
> +		 */
> +		if (test_set_buffer_req(bh))
> +			clear_buffer_write_io_error(bh);
> +	}
>  
>  	/*
>  	 * from here on down, it's all bio -- do the initial mapping,
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 8e9887d..6609fc0 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -164,7 +164,7 @@ enum rq_flag_bits {
>  	(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
>  #define REQ_COMMON_MASK \
>  	(REQ_WRITE | REQ_FAILFAST_MASK | REQ_HARDBARRIER | REQ_SYNC | \
> -	 REQ_META| REQ_DISCARD | REQ_NOIDLE)
> +	 REQ_META | REQ_DISCARD | REQ_NOIDLE | REQ_FLUSH | REQ_FUA)
>  
>  #define REQ_UNPLUG		(1 << __REQ_UNPLUG)
>  #define REQ_RAHEAD		(1 << __REQ_RAHEAD)
> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> index 1b9ba19..498bd8b 100644
> --- a/include/linux/buffer_head.h
> +++ b/include/linux/buffer_head.h
> @@ -32,8 +32,10 @@ enum bh_state_bits {
>  	BH_Delay,	/* Buffer is not yet allocated on disk */
>  	BH_Boundary,	/* Block is followed by a discontiguity */
>  	BH_Write_EIO,	/* I/O error on write */
> -	BH_Ordered,	/* ordered write */
> -	BH_Eopnotsupp,	/* operation not supported (barrier) */
> +	BH_Ordered,	/* DEPRECATED: ordered write */
> +	BH_Eopnotsupp,	/* DEPRECATED: operation not supported (barrier) */
> +	BH_Flush,	/* Flush device cache before executing IO */
> +	BH_FUA,		/* Data should be on non-volatile media on completion */
>  	BH_Unwritten,	/* Buffer is allocated on disk but not written */
>  	BH_Quiet,	/* Buffer Error Prinks to be quiet */
>  
> @@ -126,6 +128,8 @@ BUFFER_FNS(Delay, delay)
>  BUFFER_FNS(Boundary, boundary)
>  BUFFER_FNS(Write_EIO, write_io_error)
>  BUFFER_FNS(Ordered, ordered)
> +BUFFER_FNS(Flush, flush)
> +BUFFER_FNS(FUA, fua)
>  BUFFER_FNS(Eopnotsupp, eopnotsupp)
>  BUFFER_FNS(Unwritten, unwritten)
>  
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 4ebd8eb..6e30b0b 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -138,13 +138,13 @@ struct inodes_stat_t {
>   * SWRITE_SYNC
>   * SWRITE_SYNC_PLUG	Like WRITE_SYNC/WRITE_SYNC_PLUG, but locks the buffer.
>   *			See SWRITE.
> - * WRITE_BARRIER	Like WRITE_SYNC, but tells the block layer that all
> - *			previously submitted writes must be safely on storage
> - *			before this one is started. Also guarantees that when
> - *			this write is complete, it itself is also safely on
> - *			storage. Prevents reordering of writes on both sides
> - *			of this IO.
> - *
> + * WRITE_BARRIER	DEPRECATED. Always fails. Use FLUSH/FUA instead.
> + * WRITE_FLUSH		Like WRITE_SYNC but with preceding cache flush.
> + * WRITE_FUA		Like WRITE_SYNC but data is guaranteed to be on
> + *			non-volatile media on completion.
> + * WRITE_FLUSH_FUA	Combination of WRITE_FLUSH and FUA.  The IO is preceded
> + *			by a cache flush and data is guaranteed to be on
> + *			non-volatile media on completion.
>   */
>  #define RW_MASK			REQ_WRITE
>  #define RWA_MASK		REQ_RAHEAD
> @@ -162,6 +162,12 @@ struct inodes_stat_t {
>  #define WRITE_META		(WRITE | REQ_META)
>  #define WRITE_BARRIER		(WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
>  				 REQ_HARDBARRIER)
> +#define WRITE_FLUSH		(WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
> +				 REQ_FLUSH)
> +#define WRITE_FUA		(WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
> +				 REQ_FUA)
> +#define WRITE_FLUSH_FUA		(WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
> +				 REQ_FLUSH | REQ_FUA)
>  #define SWRITE_SYNC_PLUG	(SWRITE | REQ_SYNC | REQ_NOIDLE)
>  #define SWRITE_SYNC		(SWRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG)
>  
> -- 
> 1.7.1
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply	[flat|nested] 109+ messages in thread
- * Re: [PATCH 10/11] fs, block: propagate REQ_FLUSH/FUA interface to upper layers
  2010-08-12 21:24   ` Jan Kara
@ 2010-08-13  7:19     ` Tejun Heo
  2010-08-13  7:47       ` Christoph Hellwig
  0 siblings, 1 reply; 109+ messages in thread
From: Tejun Heo @ 2010-08-13  7:19 UTC (permalink / raw)
  To: Jan Kara
  Cc: jaxboe, linux-fsdevel, linux-scsi, linux-ide, linux-kernel,
	linux-raid, hch, James.Bottomley, tytso, chris.mason, swhiteho,
	konishi.ryusuke, dm-devel, vst, rwheeler, hare, Christoph Hellwig
Hello, Jan.
On 08/12/2010 11:24 PM, Jan Kara wrote:
> On Thu 12-08-10 14:41:30, Tejun Heo wrote:
>> Propagate deprecation of REQ_HARDBARRIER and new REQ_FLUSH/FUA
>> interface to upper layers.
>>
>> * WRITE_BARRIER is marked deprecated and WRITE_FLUSH, WRITE_FUA and
>>   WRITE_FLUSH_FUA are added.
>>
>> * REQ_COMMON_MASK now includes REQ_FLUSH | REQ_FUA so that they are
>>   copied from bio to request.
>>
>> * BH_Ordered is marked deprecated and BH_Flush and BH_FUA are added.
>
>   Deprecating BH_Ordered is fine but I wouldn't introduce new BH flags for
> this. BH flags should be used for buffer state, not for encoding how the
> buffer should be written (there were actually bugs in the past because of
> this). Being able to set proper flags when calling submit_bh() in the rw
> parameter is enough.
Ah, okay, I was just trying to match the BH_Ordered usage but you're
saying just requiring submit_bh() users to specify appropriate REQ_*
(or WRITE_*) in @rw is okay, right?  I'll drop the bh part then.
Thanks.
-- 
tejun
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: [PATCH 10/11] fs, block: propagate REQ_FLUSH/FUA interface to upper layers
  2010-08-13  7:19     ` Tejun Heo
@ 2010-08-13  7:47       ` Christoph Hellwig
  0 siblings, 0 replies; 109+ messages in thread
From: Christoph Hellwig @ 2010-08-13  7:47 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, jaxboe, linux-fsdevel, linux-scsi, linux-ide,
	linux-kernel, linux-raid, hch, James.Bottomley, tytso,
	chris.mason, swhiteho, konishi.ryusuke, dm-devel, vst, rwheeler,
	hare
FYI: I've already sent a patch to kill BH_Ordered, hopefully Al will
still push it in this merge window.
^ permalink raw reply	[flat|nested] 109+ messages in thread 
 
 
- * [PATCH UPDATED 10/11] fs, block: propagate REQ_FLUSH/FUA interface to upper layers
  2010-08-12 12:41 ` [PATCH 10/11] fs, block: propagate REQ_FLUSH/FUA interface to upper layers Tejun Heo
  2010-08-12 21:24   ` Jan Kara
@ 2010-08-16 16:33   ` Tejun Heo
  1 sibling, 0 replies; 109+ messages in thread
From: Tejun Heo @ 2010-08-16 16:33 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jaxboe, linux-fsdevel, linux-scsi, linux-ide, linux-kernel,
	linux-raid, hch, James.Bottomley, tytso, chris.mason, swhiteho,
	konishi.ryusuke, dm-devel, vst, jack, rwheeler, hare,
	Christoph Hellwig
Propagate deprecation of REQ_HARDBARRIER and new REQ_FLUSH/FUA
interface to upper layers.
* WRITE_BARRIER is marked deprecated and WRITE_FLUSH, WRITE_FUA and
  WRITE_FLUSH_FUA are added.
* REQ_COMMON_MASK now includes REQ_FLUSH | REQ_FUA so that they are
  copied from bio to request.
* BH_Ordered and BH_Eopnotsupp are marked deprecated.  BH_Flush/FUA
  are _NOT_ added as they can and should be specified when calling
  submit_bh() as @rw parameter as suggested by Jan Kara.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jan Kara <jack@suse.cz>
---
Dropped BH_Flush/FUA as suggested.
Thanks.
 include/linux/blk_types.h   |    2 +-
 include/linux/buffer_head.h |    4 ++--
 include/linux/fs.h          |   20 +++++++++++++-------
 3 files changed, 16 insertions(+), 10 deletions(-)
Index: block/include/linux/fs.h
===================================================================
--- block.orig/include/linux/fs.h
+++ block/include/linux/fs.h
@@ -138,13 +138,13 @@ struct inodes_stat_t {
  * SWRITE_SYNC
  * SWRITE_SYNC_PLUG	Like WRITE_SYNC/WRITE_SYNC_PLUG, but locks the buffer.
  *			See SWRITE.
- * WRITE_BARRIER	Like WRITE_SYNC, but tells the block layer that all
- *			previously submitted writes must be safely on storage
- *			before this one is started. Also guarantees that when
- *			this write is complete, it itself is also safely on
- *			storage. Prevents reordering of writes on both sides
- *			of this IO.
- *
+ * WRITE_BARRIER	DEPRECATED. Always fails. Use FLUSH/FUA instead.
+ * WRITE_FLUSH		Like WRITE_SYNC but with preceding cache flush.
+ * WRITE_FUA		Like WRITE_SYNC but data is guaranteed to be on
+ *			non-volatile media on completion.
+ * WRITE_FLUSH_FUA	Combination of WRITE_FLUSH and FUA.  The IO is preceded
+ *			by a cache flush and data is guaranteed to be on
+ *			non-volatile media on completion.
  */
 #define RW_MASK			REQ_WRITE
 #define RWA_MASK		REQ_RAHEAD
@@ -162,6 +162,12 @@ struct inodes_stat_t {
 #define WRITE_META		(WRITE | REQ_META)
 #define WRITE_BARRIER		(WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
 				 REQ_HARDBARRIER)
+#define WRITE_FLUSH		(WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
+				 REQ_FLUSH)
+#define WRITE_FUA		(WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
+				 REQ_FUA)
+#define WRITE_FLUSH_FUA		(WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
+				 REQ_FLUSH | REQ_FUA)
 #define SWRITE_SYNC_PLUG	(SWRITE | REQ_SYNC | REQ_NOIDLE)
 #define SWRITE_SYNC		(SWRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG)
Index: block/include/linux/blk_types.h
===================================================================
--- block.orig/include/linux/blk_types.h
+++ block/include/linux/blk_types.h
@@ -164,7 +164,7 @@ enum rq_flag_bits {
 	(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
 #define REQ_COMMON_MASK \
 	(REQ_WRITE | REQ_FAILFAST_MASK | REQ_HARDBARRIER | REQ_SYNC | \
-	 REQ_META| REQ_DISCARD | REQ_NOIDLE)
+	 REQ_META | REQ_DISCARD | REQ_NOIDLE | REQ_FLUSH | REQ_FUA)
 #define REQ_UNPLUG		(1 << __REQ_UNPLUG)
 #define REQ_RAHEAD		(1 << __REQ_RAHEAD)
Index: block/include/linux/buffer_head.h
===================================================================
--- block.orig/include/linux/buffer_head.h
+++ block/include/linux/buffer_head.h
@@ -32,8 +32,8 @@ enum bh_state_bits {
 	BH_Delay,	/* Buffer is not yet allocated on disk */
 	BH_Boundary,	/* Block is followed by a discontiguity */
 	BH_Write_EIO,	/* I/O error on write */
-	BH_Ordered,	/* ordered write */
-	BH_Eopnotsupp,	/* operation not supported (barrier) */
+	BH_Ordered,	/* DEPRECATED: ordered write */
+	BH_Eopnotsupp,	/* DEPRECATED: operation not supported (barrier) */
 	BH_Unwritten,	/* Buffer is allocated on disk but not written */
 	BH_Quiet,	/* Buffer Error Prinks to be quiet */
^ permalink raw reply	[flat|nested] 109+ messages in thread
 
- * [PATCH 11/11] block: use REQ_FLUSH in blkdev_issue_flush()
  2010-08-12 12:41 [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush Tejun Heo
                   ` (9 preceding siblings ...)
  2010-08-12 12:41 ` [PATCH 10/11] fs, block: propagate REQ_FLUSH/FUA interface to upper layers Tejun Heo
@ 2010-08-12 12:41 ` Tejun Heo
  2010-08-13 11:48 ` [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush Christoph Hellwig
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 109+ messages in thread
From: Tejun Heo @ 2010-08-12 12:41 UTC (permalink / raw)
  To: jaxboe, linux-fsdevel, linux-scsi, linux-ide, linux-kernel,
	linux-raid, hch, James.
  Cc: Tejun Heo, Christoph Hellwig
Update blkdev_issue_flush() to use new REQ_FLUSH interface.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
---
 block/blk-flush.c |   17 ++++++-----------
 1 files changed, 6 insertions(+), 11 deletions(-)
diff --git a/block/blk-flush.c b/block/blk-flush.c
index 452c552..ab765c2 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -174,13 +174,10 @@ struct request *blk_do_flush(struct request_queue *q, struct request *rq)
 	return blk_flush_complete_seq(q, skip, 0);
 }
 
-static void bio_end_empty_barrier(struct bio *bio, int err)
+static void bio_end_flush(struct bio *bio, int err)
 {
-	if (err) {
-		if (err == -EOPNOTSUPP)
-			set_bit(BIO_EOPNOTSUPP, &bio->bi_flags);
+	if (err)
 		clear_bit(BIO_UPTODATE, &bio->bi_flags);
-	}
 	if (bio->bi_private)
 		complete(bio->bi_private);
 	bio_put(bio);
@@ -218,19 +215,19 @@ int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask,
 	 * some block devices may not have their queue correctly set up here
 	 * (e.g. loop device without a backing file) and so issuing a flush
 	 * here will panic. Ensure there is a request function before issuing
-	 * the barrier.
+	 * the flush.
 	 */
 	if (!q->make_request_fn)
 		return -ENXIO;
 
 	bio = bio_alloc(gfp_mask, 0);
-	bio->bi_end_io = bio_end_empty_barrier;
+	bio->bi_end_io = bio_end_flush;
 	bio->bi_bdev = bdev;
 	if (test_bit(BLKDEV_WAIT, &flags))
 		bio->bi_private = &wait;
 
 	bio_get(bio);
-	submit_bio(WRITE_BARRIER, bio);
+	submit_bio(WRITE_FLUSH, bio);
 	if (test_bit(BLKDEV_WAIT, &flags)) {
 		wait_for_completion(&wait);
 		/*
@@ -242,9 +239,7 @@ int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask,
 			*error_sector = bio->bi_sector;
 	}
 
-	if (bio_flagged(bio, BIO_EOPNOTSUPP))
-		ret = -EOPNOTSUPP;
-	else if (!bio_flagged(bio, BIO_UPTODATE))
+	if (!bio_flagged(bio, BIO_UPTODATE))
 		ret = -EIO;
 
 	bio_put(bio);
-- 
1.7.1
^ permalink raw reply related	[flat|nested] 109+ messages in thread
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-12 12:41 [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush Tejun Heo
                   ` (10 preceding siblings ...)
  2010-08-12 12:41 ` [PATCH 11/11] block: use REQ_FLUSH in blkdev_issue_flush() Tejun Heo
@ 2010-08-13 11:48 ` Christoph Hellwig
  2010-08-13 13:48   ` Tejun Heo
  2010-08-13 12:55 ` Vladislav Bolkhovitin
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 109+ messages in thread
From: Christoph Hellwig @ 2010-08-13 11:48 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jaxboe, linux-fsdevel, linux-scsi, linux-ide, linux-kernel,
	linux-raid, hch, James.Bottomley, tytso, chris.mason, swhiteho,
	konishi.ryusuke, dm-devel, vst, jack, rwheeler, hare
The patchset looks functionally correct to me, and with a small patch
to make use of WRITE_FUA_FLUSH survives xfstests, and instrumenting the
underlying qemu shows that we actually get the flush requests where we should.
No performance or power fail testing done yet.
But I do not like the transition very much.  The new WRITE_FUA_FLUSH
request is exactly what filesystems expect from a current barrier
request, so I'd rather move to that functionality without breaking stuff
inbetween.
So if it was to me I'd keep patches 1, 2, 4 and 5 from your series, than
a main one to relax barrier semantics, then have the renaming patches 7
and 8, and possible keep patch 11 separate from the main implementation
change, and if absolutely also a separate one to introduce REQ_FUA and
REQ_FLUSH in the bio interface, but keep things working while doing
this.
Then we can patches do disable the reiserfs barrier "optimization" as
the very first one, and DM/MD support which I'm currently working on
as the last one and we can start doing the heavy testing.
^ permalink raw reply	[flat|nested] 109+ messages in thread
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-13 11:48 ` [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush Christoph Hellwig
@ 2010-08-13 13:48   ` Tejun Heo
  2010-08-13 14:38     ` Christoph Hellwig
  0 siblings, 1 reply; 109+ messages in thread
From: Tejun Heo @ 2010-08-13 13:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: jaxboe, linux-fsdevel, linux-scsi, linux-ide, linux-kernel,
	linux-raid, James.Bottomley, tytso, chris.mason, swhiteho,
	konishi.ryusuke, dm-devel, vst, jack, rwheeler, hare
Hello, Christoph.
On 08/13/2010 01:48 PM, Christoph Hellwig wrote:
> The patchset looks functionally correct to me, and with a small patch
> to make use of WRITE_FUA_FLUSH survives xfstests, and instrumenting the
> underlying qemu shows that we actually get the flush requests where we should.
Great.
> No performance or power fail testing done yet.
>
> But I do not like the transition very much.  The new WRITE_FUA_FLUSH
> request is exactly what filesystems expect from a current barrier
> request, so I'd rather move to that functionality without breaking stuff
> inbetween.
> 
> So if it was to me I'd keep patches 1, 2, 4 and 5 from your series, than
> a main one to relax barrier semantics, then have the renaming patches 7
> and 8, and possible keep patch 11 separate from the main implementation
> change, and if absolutely also a separate one to introduce REQ_FUA and
> REQ_FLUSH in the bio interface, but keep things working while doing
> this.
There are two reason to avoid changing the meaning of REQ_HARDBARRIER
and just deprecate it.  One is to avoid breaking filesystems'
expectations underneath it.  Please note that there are out-of-tree
filesystems too.  I think it would be too dangerous to relax
REQ_HARDBARRIER.
Another is that pseudo block layer drivers (loop, virtio_blk,
md/dm...) have assumptions about REQ_HARDBARRIER behavior and things
would be broken in obscure ways between REQ_HARDBARRIER semantics
change and updates to each of those drivers, so I don't really think
changing the semantics while the mechanism is online is a good idea.
> Then we can patches do disable the reiserfs barrier "optimization" as
> the very first one, and DM/MD support which I'm currently working on
> as the last one and we can start doing the heavy testing.
Oops, I've already converted loop, virtio_blk/lguest and am working on
md/dm right now too.  I'm almost done with md and now doing dm. :-)
Maybe we should post them right now so that we don't waste too much
time trying to solve the same problems?
Thanks.
-- 
tejun
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-13 13:48   ` Tejun Heo
@ 2010-08-13 14:38     ` Christoph Hellwig
  2010-08-13 14:51       ` Tejun Heo
  0 siblings, 1 reply; 109+ messages in thread
From: Christoph Hellwig @ 2010-08-13 14:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Hellwig, jaxboe, linux-fsdevel, linux-scsi, linux-ide,
	linux-kernel, linux-raid, James.Bottomley, tytso, chris.mason,
	swhiteho, konishi.ryusuke, dm-devel, vst, jack, rwheeler, hare
On Fri, Aug 13, 2010 at 03:48:59PM +0200, Tejun Heo wrote:
> There are two reason to avoid changing the meaning of REQ_HARDBARRIER
> and just deprecate it.  One is to avoid breaking filesystems'
> expectations underneath it.  Please note that there are out-of-tree
> filesystems too.  I think it would be too dangerous to relax
> REQ_HARDBARRIER.
Note that the renaming patch would include a move from REQ_HARDBARRIER
to REQ_FLUSH_FUA, so things just using REQ_HARDBARRIER will fail to
compile.  And while out of tree filesystems do exist they it's their
problem to keep up with kernel changes.  They decide not to be part
of the Linux kernel, so it'll be their job to keep up with it.
> Another is that pseudo block layer drivers (loop, virtio_blk,
> md/dm...) have assumptions about REQ_HARDBARRIER behavior and things
> would be broken in obscure ways between REQ_HARDBARRIER semantics
> change and updates to each of those drivers, so I don't really think
> changing the semantics while the mechanism is online is a good idea.
I don't think doing those changes in a separate commit is a good idea.
> > Then we can patches do disable the reiserfs barrier "optimization" as
> > the very first one, and DM/MD support which I'm currently working on
> > as the last one and we can start doing the heavy testing.
> 
> Oops, I've already converted loop, virtio_blk/lguest and am working on
> md/dm right now too.  I'm almost done with md and now doing dm. :-)
> Maybe we should post them right now so that we don't waste too much
> time trying to solve the same problems?
Here's the dm patch.  It only handles normal bio based dm yet, which
I understand and can test.  request  based dm (multipath) still needs
work.
Index: linux-2.6/drivers/md/dm-crypt.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-crypt.c	2010-08-13 16:11:04.207010218 +0200
+++ linux-2.6/drivers/md/dm-crypt.c	2010-08-13 16:11:10.048003862 +0200
@@ -1249,7 +1249,7 @@ static int crypt_map(struct dm_target *t
 	struct dm_crypt_io *io;
 	struct crypt_config *cc;
 
-	if (unlikely(bio_empty_barrier(bio))) {
+	if (bio_empty_flush(bio)) {
 		cc = ti->private;
 		bio->bi_bdev = cc->dev->bdev;
 		return DM_MAPIO_REMAPPED;
Index: linux-2.6/drivers/md/dm-io.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-io.c	2010-08-13 16:11:04.213011894 +0200
+++ linux-2.6/drivers/md/dm-io.c	2010-08-13 16:11:10.049003792 +0200
@@ -364,7 +364,7 @@ static void dispatch_io(int rw, unsigned
 	 */
 	for (i = 0; i < num_regions; i++) {
 		*dp = old_pages;
-		if (where[i].count || (rw & REQ_HARDBARRIER))
+		if (where[i].count || (rw & REQ_FLUSH))
 			do_region(rw, i, where + i, dp, io);
 	}
 
@@ -412,8 +412,8 @@ retry:
 	}
 	set_current_state(TASK_RUNNING);
 
-	if (io->eopnotsupp_bits && (rw & REQ_HARDBARRIER)) {
-		rw &= ~REQ_HARDBARRIER;
+	if (io->eopnotsupp_bits && (rw & REQ_FLUSH)) {
+		rw &= ~REQ_FLUSH;
 		goto retry;
 	}
 
Index: linux-2.6/drivers/md/dm-raid1.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-raid1.c	2010-08-13 16:11:04.220013431 +0200
+++ linux-2.6/drivers/md/dm-raid1.c	2010-08-13 16:11:10.054018319 +0200
@@ -670,7 +670,7 @@ static void do_writes(struct mirror_set
 	bio_list_init(&requeue);
 
 	while ((bio = bio_list_pop(writes))) {
-		if (unlikely(bio_empty_barrier(bio))) {
+		if (bio_empty_flush(bio)) {
 			bio_list_add(&sync, bio);
 			continue;
 		}
@@ -1203,7 +1203,7 @@ static int mirror_end_io(struct dm_targe
 	 * We need to dec pending if this was a write.
 	 */
 	if (rw == WRITE) {
-		if (likely(!bio_empty_barrier(bio)))
+		if (!bio_empty_flush(bio))
 			dm_rh_dec(ms->rh, map_context->ll);
 		return error;
 	}
Index: linux-2.6/drivers/md/dm-region-hash.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-region-hash.c	2010-08-13 16:11:04.228004631 +0200
+++ linux-2.6/drivers/md/dm-region-hash.c	2010-08-13 16:11:10.060003932 +0200
@@ -399,7 +399,7 @@ void dm_rh_mark_nosync(struct dm_region_
 	region_t region = dm_rh_bio_to_region(rh, bio);
 	int recovering = 0;
 
-	if (bio_empty_barrier(bio)) {
+	if (bio_empty_flush(bio)) {
 		rh->barrier_failure = 1;
 		return;
 	}
@@ -524,7 +524,7 @@ void dm_rh_inc_pending(struct dm_region_
 	struct bio *bio;
 
 	for (bio = bios->head; bio; bio = bio->bi_next) {
-		if (bio_empty_barrier(bio))
+		if (bio_empty_flush(bio))
 			continue;
 		rh_inc(rh, dm_rh_bio_to_region(rh, bio));
 	}
Index: linux-2.6/drivers/md/dm-snap.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-snap.c	2010-08-13 16:11:04.238004701 +0200
+++ linux-2.6/drivers/md/dm-snap.c	2010-08-13 16:11:10.067005677 +0200
@@ -1581,7 +1581,7 @@ static int snapshot_map(struct dm_target
 	chunk_t chunk;
 	struct dm_snap_pending_exception *pe = NULL;
 
-	if (unlikely(bio_empty_barrier(bio))) {
+	if (bio_empty_flush(bio)) {
 		bio->bi_bdev = s->cow->bdev;
 		return DM_MAPIO_REMAPPED;
 	}
@@ -1685,7 +1685,7 @@ static int snapshot_merge_map(struct dm_
 	int r = DM_MAPIO_REMAPPED;
 	chunk_t chunk;
 
-	if (unlikely(bio_empty_barrier(bio))) {
+	if (bio_empty_flush(bio)) {
 		if (!map_context->flush_request)
 			bio->bi_bdev = s->origin->bdev;
 		else
@@ -2123,7 +2123,7 @@ static int origin_map(struct dm_target *
 	struct dm_dev *dev = ti->private;
 	bio->bi_bdev = dev->bdev;
 
-	if (unlikely(bio_empty_barrier(bio)))
+	if (bio_empty_flush(bio))
 		return DM_MAPIO_REMAPPED;
 
 	/* Only tell snapshots if this is a write */
Index: linux-2.6/drivers/md/dm-stripe.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-stripe.c	2010-08-13 16:11:04.247011266 +0200
+++ linux-2.6/drivers/md/dm-stripe.c	2010-08-13 16:11:10.072026629 +0200
@@ -214,7 +214,7 @@ static int stripe_map(struct dm_target *
 	sector_t offset, chunk;
 	uint32_t stripe;
 
-	if (unlikely(bio_empty_barrier(bio))) {
+	if (bio_empty_flush(bio)) {
 		BUG_ON(map_context->flush_request >= sc->stripes);
 		bio->bi_bdev = sc->stripe[map_context->flush_request].dev->bdev;
 		return DM_MAPIO_REMAPPED;
Index: linux-2.6/drivers/md/dm.c
===================================================================
--- linux-2.6.orig/drivers/md/dm.c	2010-08-13 16:11:04.256004631 +0200
+++ linux-2.6/drivers/md/dm.c	2010-08-13 16:11:37.152005462 +0200
@@ -139,17 +139,6 @@ struct mapped_device {
 	spinlock_t deferred_lock;
 
 	/*
-	 * An error from the barrier request currently being processed.
-	 */
-	int barrier_error;
-
-	/*
-	 * Protect barrier_error from concurrent endio processing
-	 * in request-based dm.
-	 */
-	spinlock_t barrier_error_lock;
-
-	/*
 	 * Processing queue (flush/barriers)
 	 */
 	struct workqueue_struct *wq;
@@ -194,9 +183,6 @@ struct mapped_device {
 
 	/* sysfs handle */
 	struct kobject kobj;
-
-	/* zero-length barrier that will be cloned and submitted to targets */
-	struct bio barrier_bio;
 };
 
 /*
@@ -505,10 +491,6 @@ static void end_io_acct(struct dm_io *io
 	part_stat_add(cpu, &dm_disk(md)->part0, ticks[rw], duration);
 	part_stat_unlock();
 
-	/*
-	 * After this is decremented the bio must not be touched if it is
-	 * a barrier.
-	 */
 	dm_disk(md)->part0.in_flight[rw] = pending =
 		atomic_dec_return(&md->pending[rw]);
 	pending += atomic_read(&md->pending[rw^0x1]);
@@ -621,7 +603,7 @@ static void dec_pending(struct dm_io *io
 			 */
 			spin_lock_irqsave(&md->deferred_lock, flags);
 			if (__noflush_suspending(md)) {
-				if (!(io->bio->bi_rw & REQ_HARDBARRIER))
+				if (!(io->bio->bi_rw & (REQ_FLUSH|REQ_FUA)))
 					bio_list_add_head(&md->deferred,
 							  io->bio);
 			} else
@@ -633,25 +615,13 @@ static void dec_pending(struct dm_io *io
 		io_error = io->error;
 		bio = io->bio;
 
-		if (bio->bi_rw & REQ_HARDBARRIER) {
-			/*
-			 * There can be just one barrier request so we use
-			 * a per-device variable for error reporting.
-			 * Note that you can't touch the bio after end_io_acct
-			 */
-			if (!md->barrier_error && io_error != -EOPNOTSUPP)
-				md->barrier_error = io_error;
-			end_io_acct(io);
-			free_io(md, io);
-		} else {
-			end_io_acct(io);
-			free_io(md, io);
+		end_io_acct(io);
+		free_io(md, io);
 
-			if (io_error != DM_ENDIO_REQUEUE) {
-				trace_block_bio_complete(md->queue, bio);
+		if (io_error != DM_ENDIO_REQUEUE) {
+			trace_block_bio_complete(md->queue, bio);
 
-				bio_endio(bio, io_error);
-			}
+			bio_endio(bio, io_error);
 		}
 	}
 }
@@ -744,23 +714,6 @@ static void end_clone_bio(struct bio *cl
 	blk_update_request(tio->orig, 0, nr_bytes);
 }
 
-static void store_barrier_error(struct mapped_device *md, int error)
-{
-	unsigned long flags;
-
-	spin_lock_irqsave(&md->barrier_error_lock, flags);
-	/*
-	 * Basically, the first error is taken, but:
-	 *   -EOPNOTSUPP supersedes any I/O error.
-	 *   Requeue request supersedes any I/O error but -EOPNOTSUPP.
-	 */
-	if (!md->barrier_error || error == -EOPNOTSUPP ||
-	    (md->barrier_error != -EOPNOTSUPP &&
-	     error == DM_ENDIO_REQUEUE))
-		md->barrier_error = error;
-	spin_unlock_irqrestore(&md->barrier_error_lock, flags);
-}
-
 /*
  * Don't touch any member of the md after calling this function because
  * the md may be freed in dm_put() at the end of this function.
@@ -798,13 +751,11 @@ static void free_rq_clone(struct request
 static void dm_end_request(struct request *clone, int error)
 {
 	int rw = rq_data_dir(clone);
-	int run_queue = 1;
-	bool is_barrier = clone->cmd_flags & REQ_HARDBARRIER;
 	struct dm_rq_target_io *tio = clone->end_io_data;
 	struct mapped_device *md = tio->md;
 	struct request *rq = tio->orig;
 
-	if (rq->cmd_type == REQ_TYPE_BLOCK_PC && !is_barrier) {
+	if (rq->cmd_type == REQ_TYPE_BLOCK_PC) {
 		rq->errors = clone->errors;
 		rq->resid_len = clone->resid_len;
 
@@ -818,15 +769,8 @@ static void dm_end_request(struct reques
 	}
 
 	free_rq_clone(clone);
-
-	if (unlikely(is_barrier)) {
-		if (unlikely(error))
-			store_barrier_error(md, error);
-		run_queue = 0;
-	} else
-		blk_end_request_all(rq, error);
-
-	rq_completed(md, rw, run_queue);
+	blk_end_request_all(rq, error);
+	rq_completed(md, rw, 1);
 }
 
 static void dm_unprep_request(struct request *rq)
@@ -1113,7 +1057,7 @@ static struct bio *split_bvec(struct bio
 
 	clone->bi_sector = sector;
 	clone->bi_bdev = bio->bi_bdev;
-	clone->bi_rw = bio->bi_rw & ~REQ_HARDBARRIER;
+	clone->bi_rw = bio->bi_rw;
 	clone->bi_vcnt = 1;
 	clone->bi_size = to_bytes(len);
 	clone->bi_io_vec->bv_offset = offset;
@@ -1140,7 +1084,6 @@ static struct bio *clone_bio(struct bio
 
 	clone = bio_alloc_bioset(GFP_NOIO, bio->bi_max_vecs, bs);
 	__bio_clone(clone, bio);
-	clone->bi_rw &= ~REQ_HARDBARRIER;
 	clone->bi_destructor = dm_bio_destructor;
 	clone->bi_sector = sector;
 	clone->bi_idx = idx;
@@ -1186,7 +1129,7 @@ static void __flush_target(struct clone_
 	__map_bio(ti, clone, tio);
 }
 
-static int __clone_and_map_empty_barrier(struct clone_info *ci)
+static int __clone_and_map_empty_flush(struct clone_info *ci)
 {
 	unsigned target_nr = 0, flush_nr;
 	struct dm_target *ti;
@@ -1208,8 +1151,8 @@ static int __clone_and_map(struct clone_
 	sector_t len = 0, max;
 	struct dm_target_io *tio;
 
-	if (unlikely(bio_empty_barrier(bio)))
-		return __clone_and_map_empty_barrier(ci);
+	if (bio_empty_flush(bio))
+		return __clone_and_map_empty_flush(ci);
 
 	ti = dm_table_find_target(ci->map, ci->sector);
 	if (!dm_target_is_valid(ti))
@@ -1308,11 +1251,7 @@ static void __split_and_process_bio(stru
 
 	ci.map = dm_get_live_table(md);
 	if (unlikely(!ci.map)) {
-		if (!(bio->bi_rw & REQ_HARDBARRIER))
-			bio_io_error(bio);
-		else
-			if (!md->barrier_error)
-				md->barrier_error = -EIO;
+		bio_io_error(bio);
 		return;
 	}
 
@@ -1326,7 +1265,7 @@ static void __split_and_process_bio(stru
 	spin_lock_init(&ci.io->endio_lock);
 	ci.sector = bio->bi_sector;
 	ci.sector_count = bio_sectors(bio);
-	if (unlikely(bio_empty_barrier(bio)))
+	if (bio_empty_flush(bio))
 		ci.sector_count = 1;
 	ci.idx = bio->bi_idx;
 
@@ -1420,8 +1359,7 @@ static int _dm_request(struct request_qu
 	 * If we're suspended or the thread is processing barriers
 	 * we have to queue this io for later.
 	 */
-	if (unlikely(test_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags)) ||
-	    unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
+	if (unlikely(test_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags))) {
 		up_read(&md->io_lock);
 
 		if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) &&
@@ -1873,7 +1811,6 @@ static struct mapped_device *alloc_dev(i
 	init_rwsem(&md->io_lock);
 	mutex_init(&md->suspend_lock);
 	spin_lock_init(&md->deferred_lock);
-	spin_lock_init(&md->barrier_error_lock);
 	rwlock_init(&md->map_lock);
 	atomic_set(&md->holders, 1);
 	atomic_set(&md->open_count, 0);
@@ -2233,38 +2170,6 @@ static int dm_wait_for_completion(struct
 	return r;
 }
 
-static void dm_flush(struct mapped_device *md)
-{
-	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
-
-	bio_init(&md->barrier_bio);
-	md->barrier_bio.bi_bdev = md->bdev;
-	md->barrier_bio.bi_rw = WRITE_BARRIER;
-	__split_and_process_bio(md, &md->barrier_bio);
-
-	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
-}
-
-static void process_barrier(struct mapped_device *md, struct bio *bio)
-{
-	md->barrier_error = 0;
-
-	dm_flush(md);
-
-	if (!bio_empty_barrier(bio)) {
-		__split_and_process_bio(md, bio);
-		dm_flush(md);
-	}
-
-	if (md->barrier_error != DM_ENDIO_REQUEUE)
-		bio_endio(bio, md->barrier_error);
-	else {
-		spin_lock_irq(&md->deferred_lock);
-		bio_list_add_head(&md->deferred, bio);
-		spin_unlock_irq(&md->deferred_lock);
-	}
-}
-
 /*
  * Process the deferred bios
  */
@@ -2290,12 +2195,8 @@ static void dm_wq_work(struct work_struc
 
 		if (dm_request_based(md))
 			generic_make_request(c);
-		else {
-			if (c->bi_rw & REQ_HARDBARRIER)
-				process_barrier(md, c);
-			else
-				__split_and_process_bio(md, c);
-		}
+		else
+			__split_and_process_bio(md, c);
 
 		down_write(&md->io_lock);
 	}
@@ -2326,8 +2227,6 @@ static int dm_rq_barrier(struct mapped_d
 	struct dm_target *ti;
 	struct request *clone;
 
-	md->barrier_error = 0;
-
 	for (i = 0; i < num_targets; i++) {
 		ti = dm_table_get_target(map, i);
 		for (j = 0; j < ti->num_flush_requests; j++) {
@@ -2341,7 +2240,7 @@ static int dm_rq_barrier(struct mapped_d
 	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
 	dm_table_put(map);
 
-	return md->barrier_error;
+	return 0;
 }
 
 static void dm_rq_barrier_work(struct work_struct *work)
Index: linux-2.6/include/linux/bio.h
===================================================================
--- linux-2.6.orig/include/linux/bio.h	2010-08-13 16:11:04.268004351 +0200
+++ linux-2.6/include/linux/bio.h	2010-08-13 16:11:10.082005677 +0200
@@ -66,8 +66,8 @@
 #define bio_offset(bio)		bio_iovec((bio))->bv_offset
 #define bio_segments(bio)	((bio)->bi_vcnt - (bio)->bi_idx)
 #define bio_sectors(bio)	((bio)->bi_size >> 9)
-#define bio_empty_barrier(bio) \
-	((bio->bi_rw & REQ_HARDBARRIER) && \
+#define bio_empty_flush(bio) \
+	((bio->bi_rw & REQ_FLUSH) && \
 	 !bio_has_data(bio) && \
 	 !(bio->bi_rw & REQ_DISCARD))
 
^ permalink raw reply	[flat|nested] 109+ messages in thread
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-13 14:38     ` Christoph Hellwig
@ 2010-08-13 14:51       ` Tejun Heo
  2010-08-14 10:36         ` Christoph Hellwig
  0 siblings, 1 reply; 109+ messages in thread
From: Tejun Heo @ 2010-08-13 14:51 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: jaxboe, linux-fsdevel, linux-scsi, linux-ide, linux-kernel,
	linux-raid, James.Bottomley, tytso, chris.mason, swhiteho,
	konishi.ryusuke, dm-devel, vst, jack, rwheeler, hare
Hello,
On 08/13/2010 04:38 PM, Christoph Hellwig wrote:
> On Fri, Aug 13, 2010 at 03:48:59PM +0200, Tejun Heo wrote:
>> There are two reason to avoid changing the meaning of REQ_HARDBARRIER
>> and just deprecate it.  One is to avoid breaking filesystems'
>> expectations underneath it.  Please note that there are out-of-tree
>> filesystems too.  I think it would be too dangerous to relax
>> REQ_HARDBARRIER.
> 
> Note that the renaming patch would include a move from REQ_HARDBARRIER
> to REQ_FLUSH_FUA, so things just using REQ_HARDBARRIER will fail to
> compile.  And while out of tree filesystems do exist they it's their
> problem to keep up with kernel changes.  They decide not to be part
> of the Linux kernel, so it'll be their job to keep up with it.
Oh, right, we can simply remove REQ_HARDBARRIER completely.
>> Another is that pseudo block layer drivers (loop, virtio_blk,
>> md/dm...) have assumptions about REQ_HARDBARRIER behavior and things
>> would be broken in obscure ways between REQ_HARDBARRIER semantics
>> change and updates to each of those drivers, so I don't really think
>> changing the semantics while the mechanism is online is a good idea.
> 
> I don't think doing those changes in a separate commit is a good idea.
Do you want to change the whole thing in a single commit?  That would
be a pretty big invasive patch touching multiple subsystems.  Also, I
don't know what to do about drdb and would like to leave its
conversion to the maintainer (in separate patches).
Eh, well, this is mostly logistics.  Jens, what do you think?
>>> Then we can patches do disable the reiserfs barrier "optimization" as
>>> the very first one, and DM/MD support which I'm currently working on
>>> as the last one and we can start doing the heavy testing.
>>
>> Oops, I've already converted loop, virtio_blk/lguest and am working on
>> md/dm right now too.  I'm almost done with md and now doing dm. :-)
>> Maybe we should post them right now so that we don't waste too much
>> time trying to solve the same problems?
> 
> Here's the dm patch.  It only handles normal bio based dm yet, which
> I understand and can test.  request  based dm (multipath) still needs
> work.
Here's the combined patch I've been working on.  I've verified loop
and virtio_blk/loop.  I just (like five mins ago) got dm/dm conversion
compiling, so I'm sure they're broken.  The neat part is that thanks
to the separation between REQ_FLUSH and FUA handling, bio mangling
drivers only have to sequence the pre-flush and pass FUA directly to
lower layers which in many cases saves an array-wide cache flush
cycle.
After getting this patch working, the only remaining bits would be
blktrace and drdb.
Thanks.
 Documentation/lguest/lguest.c   |   36 +++-----
 drivers/block/loop.c            |   18 ++--
 drivers/block/virtio_blk.c      |   26 ++---
 drivers/md/dm-io.c              |   20 ----
 drivers/md/dm-log.c             |    2
 drivers/md/dm-raid1.c           |    8 -
 drivers/md/dm-snap-persistent.c |    2
 drivers/md/dm.c                 |  176 +++++++++++++++++++--------------------
 drivers/md/linear.c             |    4
 drivers/md/md.c                 |  117 +++++---------------------
 drivers/md/md.h                 |   23 +----
 drivers/md/multipath.c          |    4
 drivers/md/raid0.c              |    4
 drivers/md/raid1.c              |  178 +++++++++++++---------------------------
 drivers/md/raid1.h              |    2
 drivers/md/raid10.c             |    6 -
 drivers/md/raid5.c              |   18 +---
 include/linux/virtio_blk.h      |    6 +
 18 files changed, 244 insertions(+), 406 deletions(-)
Index: block/drivers/block/loop.c
===================================================================
--- block.orig/drivers/block/loop.c
+++ block/drivers/block/loop.c
@@ -477,17 +477,17 @@ static int do_bio_filebacked(struct loop
 	pos = ((loff_t) bio->bi_sector << 9) + lo->lo_offset;
 	if (bio_rw(bio) == WRITE) {
-		bool barrier = (bio->bi_rw & REQ_HARDBARRIER);
 		struct file *file = lo->lo_backing_file;
-		if (barrier) {
-			if (unlikely(!file->f_op->fsync)) {
-				ret = -EOPNOTSUPP;
-				goto out;
-			}
+		/* REQ_HARDBARRIER is deprecated */
+		if (bio->bi_rw & REQ_HARDBARRIER) {
+			ret = -EOPNOTSUPP;
+			goto out;
+		}
+		if (bio->bi_rw & REQ_FLUSH) {
 			ret = vfs_fsync(file, 0);
-			if (unlikely(ret)) {
+			if (unlikely(ret && ret != -EINVAL)) {
 				ret = -EIO;
 				goto out;
 			}
@@ -495,9 +495,9 @@ static int do_bio_filebacked(struct loop
 		ret = lo_send(lo, bio, pos);
-		if (barrier && !ret) {
+		if ((bio->bi_rw & REQ_FUA) && !ret) {
 			ret = vfs_fsync(file, 0);
-			if (unlikely(ret))
+			if (unlikely(ret && ret != -EINVAL))
 				ret = -EIO;
 		}
 	} else
Index: block/drivers/block/virtio_blk.c
===================================================================
--- block.orig/drivers/block/virtio_blk.c
+++ block/drivers/block/virtio_blk.c
@@ -128,9 +128,6 @@ static bool do_req(struct request_queue
 		}
 	}
-	if (vbr->req->cmd_flags & REQ_HARDBARRIER)
-		vbr->out_hdr.type |= VIRTIO_BLK_T_BARRIER;
-
 	sg_set_buf(&vblk->sg[out++], &vbr->out_hdr, sizeof(vbr->out_hdr));
 	/*
@@ -157,6 +154,8 @@ static bool do_req(struct request_queue
 		if (rq_data_dir(vbr->req) == WRITE) {
 			vbr->out_hdr.type |= VIRTIO_BLK_T_OUT;
 			out += num;
+			if (req->cmd_flags & REQ_FUA)
+				vbr->out_hdr.type |= VIRTIO_BLK_T_FUA;
 		} else {
 			vbr->out_hdr.type |= VIRTIO_BLK_T_IN;
 			in += num;
@@ -307,6 +306,7 @@ static int __devinit virtblk_probe(struc
 {
 	struct virtio_blk *vblk;
 	struct request_queue *q;
+	unsigned int flush;
 	int err;
 	u64 cap;
 	u32 v, blk_size, sg_elems, opt_io_size;
@@ -388,15 +388,13 @@ static int __devinit virtblk_probe(struc
 	vblk->disk->driverfs_dev = &vdev->dev;
 	index++;
-	/*
-	 * If the FLUSH feature is supported we do have support for
-	 * flushing a volatile write cache on the host.  Use that to
-	 * implement write barrier support; otherwise, we must assume
-	 * that the host does not perform any kind of volatile write
-	 * caching.
-	 */
+	/* configure queue flush support */
+	flush = 0;
 	if (virtio_has_feature(vdev, VIRTIO_BLK_F_FLUSH))
-		blk_queue_flush(q, REQ_FLUSH);
+		flush |= REQ_FLUSH;
+	if (virtio_has_feature(vdev, VIRTIO_BLK_F_FUA))
+		flush |= REQ_FUA;
+	blk_queue_flush(q, flush);
 	/* If disk is read-only in the host, the guest should obey */
 	if (virtio_has_feature(vdev, VIRTIO_BLK_F_RO))
@@ -515,9 +513,9 @@ static const struct virtio_device_id id_
 };
 static unsigned int features[] = {
-	VIRTIO_BLK_F_BARRIER, VIRTIO_BLK_F_SEG_MAX, VIRTIO_BLK_F_SIZE_MAX,
-	VIRTIO_BLK_F_GEOMETRY, VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE,
-	VIRTIO_BLK_F_SCSI, VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_TOPOLOGY
+	VIRTIO_BLK_F_SEG_MAX, VIRTIO_BLK_F_SIZE_MAX, VIRTIO_BLK_F_GEOMETRY,
+	VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE, VIRTIO_BLK_F_SCSI,
+	VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_TOPOLOGY, VIRTIO_BLK_F_FUA,
 };
 /*
Index: block/include/linux/virtio_blk.h
===================================================================
--- block.orig/include/linux/virtio_blk.h
+++ block/include/linux/virtio_blk.h
@@ -16,6 +16,7 @@
 #define VIRTIO_BLK_F_SCSI	7	/* Supports scsi command passthru */
 #define VIRTIO_BLK_F_FLUSH	9	/* Cache flush command support */
 #define VIRTIO_BLK_F_TOPOLOGY	10	/* Topology information is available */
+#define VIRTIO_BLK_F_FUA	11	/* Forced Unit Access write support */
 #define VIRTIO_BLK_ID_BYTES	20	/* ID string length */
@@ -70,7 +71,10 @@ struct virtio_blk_config {
 #define VIRTIO_BLK_T_FLUSH	4
 /* Get device ID command */
-#define VIRTIO_BLK_T_GET_ID    8
+#define VIRTIO_BLK_T_GET_ID	8
+
+/* FUA command */
+#define VIRTIO_BLK_T_FUA	16
 /* Barrier before this op. */
 #define VIRTIO_BLK_T_BARRIER	0x80000000
Index: block/Documentation/lguest/lguest.c
===================================================================
--- block.orig/Documentation/lguest/lguest.c
+++ block/Documentation/lguest/lguest.c
@@ -1639,15 +1639,6 @@ static void blk_request(struct virtqueue
 	off = out->sector * 512;
 	/*
-	 * The block device implements "barriers", where the Guest indicates
-	 * that it wants all previous writes to occur before this write.  We
-	 * don't have a way of asking our kernel to do a barrier, so we just
-	 * synchronize all the data in the file.  Pretty poor, no?
-	 */
-	if (out->type & VIRTIO_BLK_T_BARRIER)
-		fdatasync(vblk->fd);
-
-	/*
 	 * In general the virtio block driver is allowed to try SCSI commands.
 	 * It'd be nice if we supported eject, for example, but we don't.
 	 */
@@ -1679,6 +1670,19 @@ static void blk_request(struct virtqueue
 			/* Die, bad Guest, die. */
 			errx(1, "Write past end %llu+%u", off, ret);
 		}
+
+		/* Honor FUA by syncing everything. */
+		if (ret >= 0 && (out->type & VIRTIO_BLK_T_FUA)) {
+			ret = fdatasync(vblk->fd);
+			verbose("FUA fdatasync: %i\n", ret);
+		}
+
+		wlen = sizeof(*in);
+		*in = (ret >= 0 ? VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR);
+	} else if (out->type & VIRTIO_BLK_T_FLUSH) {
+		/* Flush */
+		ret = fdatasync(vblk->fd);
+		verbose("FLUSH fdatasync: %i\n", ret);
 		wlen = sizeof(*in);
 		*in = (ret >= 0 ? VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR);
 	} else {
@@ -1702,15 +1706,6 @@ static void blk_request(struct virtqueue
 		}
 	}
-	/*
-	 * OK, so we noted that it was pretty poor to use an fdatasync as a
-	 * barrier.  But Christoph Hellwig points out that we need a sync
-	 * *afterwards* as well: "Barriers specify no reordering to the front
-	 * or the back."  And Jens Axboe confirmed it, so here we are:
-	 */
-	if (out->type & VIRTIO_BLK_T_BARRIER)
-		fdatasync(vblk->fd);
-
 	/* Finished that request. */
 	add_used(vq, head, wlen);
 }
@@ -1735,8 +1730,9 @@ static void setup_block_file(const char
 	vblk->fd = open_or_die(filename, O_RDWR|O_LARGEFILE);
 	vblk->len = lseek64(vblk->fd, 0, SEEK_END);
-	/* We support barriers. */
-	add_feature(dev, VIRTIO_BLK_F_BARRIER);
+	/* We support FLUSH and FUA. */
+	add_feature(dev, VIRTIO_BLK_F_FLUSH);
+	add_feature(dev, VIRTIO_BLK_F_FUA);
 	/* Tell Guest how many sectors this device has. */
 	conf.capacity = cpu_to_le64(vblk->len / 512);
Index: block/drivers/md/linear.c
===================================================================
--- block.orig/drivers/md/linear.c
+++ block/drivers/md/linear.c
@@ -294,8 +294,8 @@ static int linear_make_request (mddev_t
 	dev_info_t *tmp_dev;
 	sector_t start_sector;
-	if (unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
-		md_barrier_request(mddev, bio);
+	if (unlikely(bio->bi_rw & REQ_FLUSH)) {
+		md_flush_request(mddev, bio);
 		return 0;
 	}
Index: block/drivers/md/md.c
===================================================================
--- block.orig/drivers/md/md.c
+++ block/drivers/md/md.c
@@ -226,12 +226,12 @@ static int md_make_request(struct reques
 		return 0;
 	}
 	rcu_read_lock();
-	if (mddev->suspended || mddev->barrier) {
+	if (mddev->suspended) {
 		DEFINE_WAIT(__wait);
 		for (;;) {
 			prepare_to_wait(&mddev->sb_wait, &__wait,
 					TASK_UNINTERRUPTIBLE);
-			if (!mddev->suspended && !mddev->barrier)
+			if (!mddev->suspended)
 				break;
 			rcu_read_unlock();
 			schedule();
@@ -280,40 +280,29 @@ static void mddev_resume(mddev_t *mddev)
 int mddev_congested(mddev_t *mddev, int bits)
 {
-	if (mddev->barrier)
-		return 1;
 	return mddev->suspended;
 }
 EXPORT_SYMBOL(mddev_congested);
 /*
- * Generic barrier handling for md
+ * Generic flush handling for md
  */
-#define POST_REQUEST_BARRIER ((void*)1)
-
-static void md_end_barrier(struct bio *bio, int err)
+static void md_end_flush(struct bio *bio, int err)
 {
 	mdk_rdev_t *rdev = bio->bi_private;
 	mddev_t *mddev = rdev->mddev;
-	if (err == -EOPNOTSUPP && mddev->barrier != POST_REQUEST_BARRIER)
-		set_bit(BIO_EOPNOTSUPP, &mddev->barrier->bi_flags);
 	rdev_dec_pending(rdev, mddev);
 	if (atomic_dec_and_test(&mddev->flush_pending)) {
-		if (mddev->barrier == POST_REQUEST_BARRIER) {
-			/* This was a post-request barrier */
-			mddev->barrier = NULL;
-			wake_up(&mddev->sb_wait);
-		} else
-			/* The pre-request barrier has finished */
-			schedule_work(&mddev->barrier_work);
+		/* The pre-request flush has finished */
+		schedule_work(&mddev->flush_work);
 	}
 	bio_put(bio);
 }
-static void submit_barriers(mddev_t *mddev)
+static void submit_flushes(mddev_t *mddev)
 {
 	mdk_rdev_t *rdev;
@@ -330,60 +319,56 @@ static void submit_barriers(mddev_t *mdd
 			atomic_inc(&rdev->nr_pending);
 			rcu_read_unlock();
 			bi = bio_alloc(GFP_KERNEL, 0);
-			bi->bi_end_io = md_end_barrier;
+			bi->bi_end_io = md_end_flush;
 			bi->bi_private = rdev;
 			bi->bi_bdev = rdev->bdev;
 			atomic_inc(&mddev->flush_pending);
-			submit_bio(WRITE_BARRIER, bi);
+			submit_bio(WRITE_FLUSH, bi);
 			rcu_read_lock();
 			rdev_dec_pending(rdev, mddev);
 		}
 	rcu_read_unlock();
 }
-static void md_submit_barrier(struct work_struct *ws)
+static void md_submit_flush_data(struct work_struct *ws)
 {
-	mddev_t *mddev = container_of(ws, mddev_t, barrier_work);
-	struct bio *bio = mddev->barrier;
+	mddev_t *mddev = container_of(ws, mddev_t, flush_work);
+	struct bio *bio = mddev->flush_bio;
 	atomic_set(&mddev->flush_pending, 1);
-	if (test_bit(BIO_EOPNOTSUPP, &bio->bi_flags))
-		bio_endio(bio, -EOPNOTSUPP);
-	else if (bio->bi_size == 0)
+	if (bio->bi_size == 0)
 		/* an empty barrier - all done */
 		bio_endio(bio, 0);
 	else {
-		bio->bi_rw &= ~REQ_HARDBARRIER;
+		bio->bi_rw &= ~REQ_FLUSH;
 		if (mddev->pers->make_request(mddev, bio))
 			generic_make_request(bio);
-		mddev->barrier = POST_REQUEST_BARRIER;
-		submit_barriers(mddev);
 	}
 	if (atomic_dec_and_test(&mddev->flush_pending)) {
-		mddev->barrier = NULL;
+		mddev->flush_bio = NULL;
 		wake_up(&mddev->sb_wait);
 	}
 }
-void md_barrier_request(mddev_t *mddev, struct bio *bio)
+void md_flush_request(mddev_t *mddev, struct bio *bio)
 {
 	spin_lock_irq(&mddev->write_lock);
 	wait_event_lock_irq(mddev->sb_wait,
-			    !mddev->barrier,
+			    !mddev->flush_bio,
 			    mddev->write_lock, /*nothing*/);
-	mddev->barrier = bio;
+	mddev->flush_bio = bio;
 	spin_unlock_irq(&mddev->write_lock);
 	atomic_set(&mddev->flush_pending, 1);
-	INIT_WORK(&mddev->barrier_work, md_submit_barrier);
+	INIT_WORK(&mddev->flush_work, md_submit_flush_data);
-	submit_barriers(mddev);
+	submit_flushes(mddev);
 	if (atomic_dec_and_test(&mddev->flush_pending))
-		schedule_work(&mddev->barrier_work);
+		schedule_work(&mddev->flush_work);
 }
-EXPORT_SYMBOL(md_barrier_request);
+EXPORT_SYMBOL(md_flush_request);
 static inline mddev_t *mddev_get(mddev_t *mddev)
 {
@@ -642,31 +627,6 @@ static void super_written(struct bio *bi
 	bio_put(bio);
 }
-static void super_written_barrier(struct bio *bio, int error)
-{
-	struct bio *bio2 = bio->bi_private;
-	mdk_rdev_t *rdev = bio2->bi_private;
-	mddev_t *mddev = rdev->mddev;
-
-	if (!test_bit(BIO_UPTODATE, &bio->bi_flags) &&
-	    error == -EOPNOTSUPP) {
-		unsigned long flags;
-		/* barriers don't appear to be supported :-( */
-		set_bit(BarriersNotsupp, &rdev->flags);
-		mddev->barriers_work = 0;
-		spin_lock_irqsave(&mddev->write_lock, flags);
-		bio2->bi_next = mddev->biolist;
-		mddev->biolist = bio2;
-		spin_unlock_irqrestore(&mddev->write_lock, flags);
-		wake_up(&mddev->sb_wait);
-		bio_put(bio);
-	} else {
-		bio_put(bio2);
-		bio->bi_private = rdev;
-		super_written(bio, error);
-	}
-}
-
 void md_super_write(mddev_t *mddev, mdk_rdev_t *rdev,
 		   sector_t sector, int size, struct page *page)
 {
@@ -675,51 +635,28 @@ void md_super_write(mddev_t *mddev, mdk_
 	 * and decrement it on completion, waking up sb_wait
 	 * if zero is reached.
 	 * If an error occurred, call md_error
-	 *
-	 * As we might need to resubmit the request if REQ_HARDBARRIER
-	 * causes ENOTSUPP, we allocate a spare bio...
 	 */
 	struct bio *bio = bio_alloc(GFP_NOIO, 1);
-	int rw = REQ_WRITE | REQ_SYNC | REQ_UNPLUG;
 	bio->bi_bdev = rdev->bdev;
 	bio->bi_sector = sector;
 	bio_add_page(bio, page, size, 0);
 	bio->bi_private = rdev;
 	bio->bi_end_io = super_written;
-	bio->bi_rw = rw;
 	atomic_inc(&mddev->pending_writes);
-	if (!test_bit(BarriersNotsupp, &rdev->flags)) {
-		struct bio *rbio;
-		rw |= REQ_HARDBARRIER;
-		rbio = bio_clone(bio, GFP_NOIO);
-		rbio->bi_private = bio;
-		rbio->bi_end_io = super_written_barrier;
-		submit_bio(rw, rbio);
-	} else
-		submit_bio(rw, bio);
+	submit_bio(REQ_WRITE | REQ_SYNC | REQ_UNPLUG | REQ_FLUSH | REQ_FUA,
+		   bio);
 }
 void md_super_wait(mddev_t *mddev)
 {
-	/* wait for all superblock writes that were scheduled to complete.
-	 * if any had to be retried (due to BARRIER problems), retry them
-	 */
+	/* wait for all superblock writes that were scheduled to complete */
 	DEFINE_WAIT(wq);
 	for(;;) {
 		prepare_to_wait(&mddev->sb_wait, &wq, TASK_UNINTERRUPTIBLE);
 		if (atomic_read(&mddev->pending_writes)==0)
 			break;
-		while (mddev->biolist) {
-			struct bio *bio;
-			spin_lock_irq(&mddev->write_lock);
-			bio = mddev->biolist;
-			mddev->biolist = bio->bi_next ;
-			bio->bi_next = NULL;
-			spin_unlock_irq(&mddev->write_lock);
-			submit_bio(bio->bi_rw, bio);
-		}
 		schedule();
 	}
 	finish_wait(&mddev->sb_wait, &wq);
@@ -1016,7 +953,6 @@ static int super_90_validate(mddev_t *md
 	clear_bit(Faulty, &rdev->flags);
 	clear_bit(In_sync, &rdev->flags);
 	clear_bit(WriteMostly, &rdev->flags);
-	clear_bit(BarriersNotsupp, &rdev->flags);
 	if (mddev->raid_disks == 0) {
 		mddev->major_version = 0;
@@ -1431,7 +1367,6 @@ static int super_1_validate(mddev_t *mdd
 	clear_bit(Faulty, &rdev->flags);
 	clear_bit(In_sync, &rdev->flags);
 	clear_bit(WriteMostly, &rdev->flags);
-	clear_bit(BarriersNotsupp, &rdev->flags);
 	if (mddev->raid_disks == 0) {
 		mddev->major_version = 1;
@@ -4463,7 +4398,6 @@ static int md_run(mddev_t *mddev)
 	/* may be over-ridden by personality */
 	mddev->resync_max_sectors = mddev->dev_sectors;
-	mddev->barriers_work = 1;
 	mddev->ok_start_degraded = start_dirty_degraded;
 	if (start_readonly && mddev->ro == 0)
@@ -4638,7 +4572,6 @@ static void md_clean(mddev_t *mddev)
 	mddev->recovery = 0;
 	mddev->in_sync = 0;
 	mddev->degraded = 0;
-	mddev->barriers_work = 0;
 	mddev->safemode = 0;
 	mddev->bitmap_info.offset = 0;
 	mddev->bitmap_info.default_offset = 0;
Index: block/drivers/md/md.h
===================================================================
--- block.orig/drivers/md/md.h
+++ block/drivers/md/md.h
@@ -67,7 +67,6 @@ struct mdk_rdev_s
 #define	Faulty		1		/* device is known to have a fault */
 #define	In_sync		2		/* device is in_sync with rest of array */
 #define	WriteMostly	4		/* Avoid reading if at all possible */
-#define	BarriersNotsupp	5		/* REQ_HARDBARRIER is not supported */
 #define	AllReserved	6		/* If whole device is reserved for
 					 * one array */
 #define	AutoDetected	7		/* added by auto-detect */
@@ -249,13 +248,6 @@ struct mddev_s
 	int				degraded;	/* whether md should consider
 							 * adding a spare
 							 */
-	int				barriers_work;	/* initialised to true, cleared as soon
-							 * as a barrier request to slave
-							 * fails.  Only supported
-							 */
-	struct bio			*biolist; 	/* bios that need to be retried
-							 * because REQ_HARDBARRIER is not supported
-							 */
 	atomic_t			recovery_active; /* blocks scheduled, but not written */
 	wait_queue_head_t		recovery_wait;
@@ -308,16 +300,13 @@ struct mddev_s
 	struct list_head		all_mddevs;
 	struct attribute_group		*to_remove;
-	/* Generic barrier handling.
-	 * If there is a pending barrier request, all other
-	 * writes are blocked while the devices are flushed.
-	 * The last to finish a flush schedules a worker to
-	 * submit the barrier request (without the barrier flag),
-	 * then submit more flush requests.
+	/* Generic flush handling.
+	 * The last to finish preflush schedules a worker to submit
+	 * the rest of the request (without the REQ_FLUSH flag).
 	 */
-	struct bio *barrier;
+	struct bio *flush_bio;
 	atomic_t flush_pending;
-	struct work_struct barrier_work;
+	struct work_struct flush_work;
 };
@@ -458,7 +447,7 @@ extern void md_done_sync(mddev_t *mddev,
 extern void md_error(mddev_t *mddev, mdk_rdev_t *rdev);
 extern int mddev_congested(mddev_t *mddev, int bits);
-extern void md_barrier_request(mddev_t *mddev, struct bio *bio);
+extern void md_flush_request(mddev_t *mddev, struct bio *bio);
 extern void md_super_write(mddev_t *mddev, mdk_rdev_t *rdev,
 			   sector_t sector, int size, struct page *page);
 extern void md_super_wait(mddev_t *mddev);
Index: block/drivers/md/raid0.c
===================================================================
--- block.orig/drivers/md/raid0.c
+++ block/drivers/md/raid0.c
@@ -483,8 +483,8 @@ static int raid0_make_request(mddev_t *m
 	struct strip_zone *zone;
 	mdk_rdev_t *tmp_dev;
-	if (unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
-		md_barrier_request(mddev, bio);
+	if (unlikely(bio->bi_rw & REQ_FLUSH)) {
+		md_flush_request(mddev, bio);
 		return 0;
 	}
Index: block/drivers/md/raid1.c
===================================================================
--- block.orig/drivers/md/raid1.c
+++ block/drivers/md/raid1.c
@@ -319,83 +319,74 @@ static void raid1_end_write_request(stru
 		if (r1_bio->bios[mirror] == bio)
 			break;
-	if (error == -EOPNOTSUPP && test_bit(R1BIO_Barrier, &r1_bio->state)) {
-		set_bit(BarriersNotsupp, &conf->mirrors[mirror].rdev->flags);
-		set_bit(R1BIO_BarrierRetry, &r1_bio->state);
-		r1_bio->mddev->barriers_work = 0;
-		/* Don't rdev_dec_pending in this branch - keep it for the retry */
-	} else {
+	/*
+	 * 'one mirror IO has finished' event handler:
+	 */
+	r1_bio->bios[mirror] = NULL;
+	to_put = bio;
+	if (!uptodate) {
+		md_error(r1_bio->mddev, conf->mirrors[mirror].rdev);
+		/* an I/O failed, we can't clear the bitmap */
+		set_bit(R1BIO_Degraded, &r1_bio->state);
+	} else
 		/*
-		 * this branch is our 'one mirror IO has finished' event handler:
+		 * Set R1BIO_Uptodate in our master bio, so that we
+		 * will return a good error code for to the higher
+		 * levels even if IO on some other mirrored buffer
+		 * fails.
+		 *
+		 * The 'master' represents the composite IO operation
+		 * to user-side. So if something waits for IO, then it
+		 * will wait for the 'master' bio.
 		 */
-		r1_bio->bios[mirror] = NULL;
-		to_put = bio;
-		if (!uptodate) {
-			md_error(r1_bio->mddev, conf->mirrors[mirror].rdev);
-			/* an I/O failed, we can't clear the bitmap */
-			set_bit(R1BIO_Degraded, &r1_bio->state);
-		} else
-			/*
-			 * Set R1BIO_Uptodate in our master bio, so that
-			 * we will return a good error code for to the higher
-			 * levels even if IO on some other mirrored buffer fails.
-			 *
-			 * The 'master' represents the composite IO operation to
-			 * user-side. So if something waits for IO, then it will
-			 * wait for the 'master' bio.
-			 */
-			set_bit(R1BIO_Uptodate, &r1_bio->state);
+		set_bit(R1BIO_Uptodate, &r1_bio->state);
+
+	update_head_pos(mirror, r1_bio);
-		update_head_pos(mirror, r1_bio);
+	if (behind) {
+		if (test_bit(WriteMostly, &conf->mirrors[mirror].rdev->flags))
+			atomic_dec(&r1_bio->behind_remaining);
-		if (behind) {
-			if (test_bit(WriteMostly, &conf->mirrors[mirror].rdev->flags))
-				atomic_dec(&r1_bio->behind_remaining);
-
-			/* In behind mode, we ACK the master bio once the I/O has safely
-			 * reached all non-writemostly disks. Setting the Returned bit
-			 * ensures that this gets done only once -- we don't ever want to
-			 * return -EIO here, instead we'll wait */
-
-			if (atomic_read(&r1_bio->behind_remaining) >= (atomic_read(&r1_bio->remaining)-1) &&
-			    test_bit(R1BIO_Uptodate, &r1_bio->state)) {
-				/* Maybe we can return now */
-				if (!test_and_set_bit(R1BIO_Returned, &r1_bio->state)) {
-					struct bio *mbio = r1_bio->master_bio;
-					PRINTK(KERN_DEBUG "raid1: behind end write sectors %llu-%llu\n",
-					       (unsigned long long) mbio->bi_sector,
-					       (unsigned long long) mbio->bi_sector +
-					       (mbio->bi_size >> 9) - 1);
-					bio_endio(mbio, 0);
-				}
+		/*
+		 * In behind mode, we ACK the master bio once the I/O
+		 * has safely reached all non-writemostly
+		 * disks. Setting the Returned bit ensures that this
+		 * gets done only once -- we don't ever want to return
+		 * -EIO here, instead we'll wait
+		 */
+		if (atomic_read(&r1_bio->behind_remaining) >= (atomic_read(&r1_bio->remaining)-1) &&
+		    test_bit(R1BIO_Uptodate, &r1_bio->state)) {
+			/* Maybe we can return now */
+			if (!test_and_set_bit(R1BIO_Returned, &r1_bio->state)) {
+				struct bio *mbio = r1_bio->master_bio;
+				PRINTK(KERN_DEBUG "raid1: behind end write sectors %llu-%llu\n",
+				       (unsigned long long) mbio->bi_sector,
+				       (unsigned long long) mbio->bi_sector +
+				       (mbio->bi_size >> 9) - 1);
+				bio_endio(mbio, 0);
 			}
 		}
-		rdev_dec_pending(conf->mirrors[mirror].rdev, conf->mddev);
 	}
+	rdev_dec_pending(conf->mirrors[mirror].rdev, conf->mddev);
+
 	/*
-	 *
 	 * Let's see if all mirrored write operations have finished
 	 * already.
 	 */
 	if (atomic_dec_and_test(&r1_bio->remaining)) {
-		if (test_bit(R1BIO_BarrierRetry, &r1_bio->state))
-			reschedule_retry(r1_bio);
-		else {
-			/* it really is the end of this request */
-			if (test_bit(R1BIO_BehindIO, &r1_bio->state)) {
-				/* free extra copy of the data pages */
-				int i = bio->bi_vcnt;
-				while (i--)
-					safe_put_page(bio->bi_io_vec[i].bv_page);
-			}
-			/* clear the bitmap if all writes complete successfully */
-			bitmap_endwrite(r1_bio->mddev->bitmap, r1_bio->sector,
-					r1_bio->sectors,
-					!test_bit(R1BIO_Degraded, &r1_bio->state),
-					behind);
-			md_write_end(r1_bio->mddev);
-			raid_end_bio_io(r1_bio);
-		}
+		if (test_bit(R1BIO_BehindIO, &r1_bio->state)) {
+			/* free extra copy of the data pages */
+			int i = bio->bi_vcnt;
+			while (i--)
+				safe_put_page(bio->bi_io_vec[i].bv_page);
+		}
+		/* clear the bitmap if all writes complete successfully */
+		bitmap_endwrite(r1_bio->mddev->bitmap, r1_bio->sector,
+				r1_bio->sectors,
+				!test_bit(R1BIO_Degraded, &r1_bio->state),
+				behind);
+		md_write_end(r1_bio->mddev);
+		raid_end_bio_io(r1_bio);
 	}
 	if (to_put)
@@ -787,17 +778,14 @@ static int make_request(mddev_t *mddev,
 	struct bio_list bl;
 	struct page **behind_pages = NULL;
 	const int rw = bio_data_dir(bio);
-	const bool do_sync = (bio->bi_rw & REQ_SYNC);
-	bool do_barriers;
+	const unsigned int do_sync = (bio->bi_rw & REQ_SYNC);
+	const unsigned int do_flush_fua = (bio->bi_rw & (REQ_FLUSH | REQ_FUA));
 	mdk_rdev_t *blocked_rdev;
 	/*
 	 * Register the new request and wait if the reconstruction
 	 * thread has put up a bar for new requests.
 	 * Continue immediately if no resync is active currently.
-	 * We test barriers_work *after* md_write_start as md_write_start
-	 * may cause the first superblock write, and that will check out
-	 * if barriers work.
 	 */
 	md_write_start(mddev, bio); /* wait on superblock update early */
@@ -821,13 +809,6 @@ static int make_request(mddev_t *mddev,
 		}
 		finish_wait(&conf->wait_barrier, &w);
 	}
-	if (unlikely(!mddev->barriers_work &&
-		     (bio->bi_rw & REQ_HARDBARRIER))) {
-		if (rw == WRITE)
-			md_write_end(mddev);
-		bio_endio(bio, -EOPNOTSUPP);
-		return 0;
-	}
 	wait_barrier(conf);
@@ -877,7 +858,7 @@ static int make_request(mddev_t *mddev,
 		read_bio->bi_sector = r1_bio->sector + mirror->rdev->data_offset;
 		read_bio->bi_bdev = mirror->rdev->bdev;
 		read_bio->bi_end_io = raid1_end_read_request;
-		read_bio->bi_rw = READ | do_sync;
+		read_bio->bi_rw = READ | do_sync | do_flush_fua;
 		read_bio->bi_private = r1_bio;
 		generic_make_request(read_bio);
@@ -959,10 +940,6 @@ static int make_request(mddev_t *mddev,
 	atomic_set(&r1_bio->remaining, 0);
 	atomic_set(&r1_bio->behind_remaining, 0);
-	do_barriers = bio->bi_rw & REQ_HARDBARRIER;
-	if (do_barriers)
-		set_bit(R1BIO_Barrier, &r1_bio->state);
-
 	bio_list_init(&bl);
 	for (i = 0; i < disks; i++) {
 		struct bio *mbio;
@@ -975,7 +952,7 @@ static int make_request(mddev_t *mddev,
 		mbio->bi_sector	= r1_bio->sector + conf->mirrors[i].rdev->data_offset;
 		mbio->bi_bdev = conf->mirrors[i].rdev->bdev;
 		mbio->bi_end_io	= raid1_end_write_request;
-		mbio->bi_rw = WRITE | do_barriers | do_sync;
+		mbio->bi_rw = WRITE | do_sync;
 		mbio->bi_private = r1_bio;
 		if (behind_pages) {
@@ -1631,41 +1608,6 @@ static void raid1d(mddev_t *mddev)
 		if (test_bit(R1BIO_IsSync, &r1_bio->state)) {
 			sync_request_write(mddev, r1_bio);
 			unplug = 1;
-		} else if (test_bit(R1BIO_BarrierRetry, &r1_bio->state)) {
-			/* some requests in the r1bio were REQ_HARDBARRIER
-			 * requests which failed with -EOPNOTSUPP.  Hohumm..
-			 * Better resubmit without the barrier.
-			 * We know which devices to resubmit for, because
-			 * all others have had their bios[] entry cleared.
-			 * We already have a nr_pending reference on these rdevs.
-			 */
-			int i;
-			const bool do_sync = (r1_bio->master_bio->bi_rw & REQ_SYNC);
-			clear_bit(R1BIO_BarrierRetry, &r1_bio->state);
-			clear_bit(R1BIO_Barrier, &r1_bio->state);
-			for (i=0; i < conf->raid_disks; i++)
-				if (r1_bio->bios[i])
-					atomic_inc(&r1_bio->remaining);
-			for (i=0; i < conf->raid_disks; i++)
-				if (r1_bio->bios[i]) {
-					struct bio_vec *bvec;
-					int j;
-
-					bio = bio_clone(r1_bio->master_bio, GFP_NOIO);
-					/* copy pages from the failed bio, as
-					 * this might be a write-behind device */
-					__bio_for_each_segment(bvec, bio, j, 0)
-						bvec->bv_page = bio_iovec_idx(r1_bio->bios[i], j)->bv_page;
-					bio_put(r1_bio->bios[i]);
-					bio->bi_sector = r1_bio->sector +
-						conf->mirrors[i].rdev->data_offset;
-					bio->bi_bdev = conf->mirrors[i].rdev->bdev;
-					bio->bi_end_io = raid1_end_write_request;
-					bio->bi_rw = WRITE | do_sync;
-					bio->bi_private = r1_bio;
-					r1_bio->bios[i] = bio;
-					generic_make_request(bio);
-				}
 		} else {
 			int disk;
Index: block/drivers/md/raid1.h
===================================================================
--- block.orig/drivers/md/raid1.h
+++ block/drivers/md/raid1.h
@@ -117,8 +117,6 @@ struct r1bio_s {
 #define	R1BIO_IsSync	1
 #define	R1BIO_Degraded	2
 #define	R1BIO_BehindIO	3
-#define	R1BIO_Barrier	4
-#define R1BIO_BarrierRetry 5
 /* For write-behind requests, we call bi_end_io when
  * the last non-write-behind device completes, providing
  * any write was successful.  Otherwise we call when
Index: block/drivers/md/raid5.c
===================================================================
--- block.orig/drivers/md/raid5.c
+++ block/drivers/md/raid5.c
@@ -3278,7 +3278,7 @@ static void handle_stripe5(struct stripe
 	if (dec_preread_active) {
 		/* We delay this until after ops_run_io so that if make_request
-		 * is waiting on a barrier, it won't continue until the writes
+		 * is waiting on a flush, it won't continue until the writes
 		 * have actually been submitted.
 		 */
 		atomic_dec(&conf->preread_active_stripes);
@@ -3580,7 +3580,7 @@ static void handle_stripe6(struct stripe
 	if (dec_preread_active) {
 		/* We delay this until after ops_run_io so that if make_request
-		 * is waiting on a barrier, it won't continue until the writes
+		 * is waiting on a flush, it won't continue until the writes
 		 * have actually been submitted.
 		 */
 		atomic_dec(&conf->preread_active_stripes);
@@ -3958,14 +3958,8 @@ static int make_request(mddev_t *mddev,
 	const int rw = bio_data_dir(bi);
 	int remaining;
-	if (unlikely(bi->bi_rw & REQ_HARDBARRIER)) {
-		/* Drain all pending writes.  We only really need
-		 * to ensure they have been submitted, but this is
-		 * easier.
-		 */
-		mddev->pers->quiesce(mddev, 1);
-		mddev->pers->quiesce(mddev, 0);
-		md_barrier_request(mddev, bi);
+	if (unlikely(bi->bi_rw & REQ_FLUSH)) {
+		md_flush_request(mddev, bi);
 		return 0;
 	}
@@ -4083,7 +4077,7 @@ static int make_request(mddev_t *mddev,
 			finish_wait(&conf->wait_for_overlap, &w);
 			set_bit(STRIPE_HANDLE, &sh->state);
 			clear_bit(STRIPE_DELAYED, &sh->state);
-			if (mddev->barrier &&
+			if (mddev->flush_bio &&
 			    !test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
 				atomic_inc(&conf->preread_active_stripes);
 			release_stripe(sh);
@@ -4106,7 +4100,7 @@ static int make_request(mddev_t *mddev,
 		bio_endio(bi, 0);
 	}
-	if (mddev->barrier) {
+	if (mddev->flush_bio) {
 		/* We need to wait for the stripes to all be handled.
 		 * So: wait for preread_active_stripes to drop to 0.
 		 */
Index: block/drivers/md/multipath.c
===================================================================
--- block.orig/drivers/md/multipath.c
+++ block/drivers/md/multipath.c
@@ -142,8 +142,8 @@ static int multipath_make_request(mddev_
 	struct multipath_bh * mp_bh;
 	struct multipath_info *multipath;
-	if (unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
-		md_barrier_request(mddev, bio);
+	if (unlikely(bio->bi_rw & REQ_FLUSH)) {
+		md_flush_request(mddev, bio);
 		return 0;
 	}
Index: block/drivers/md/raid10.c
===================================================================
--- block.orig/drivers/md/raid10.c
+++ block/drivers/md/raid10.c
@@ -799,13 +799,13 @@ static int make_request(mddev_t *mddev,
 	int i;
 	int chunk_sects = conf->chunk_mask + 1;
 	const int rw = bio_data_dir(bio);
-	const bool do_sync = (bio->bi_rw & REQ_SYNC);
+	const unsigned int do_sync = (bio->bi_rw & REQ_SYNC);
 	struct bio_list bl;
 	unsigned long flags;
 	mdk_rdev_t *blocked_rdev;
-	if (unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
-		md_barrier_request(mddev, bio);
+	if (unlikely(bio->bi_rw & REQ_FLUSH)) {
+		md_flush_request(mddev, bio);
 		return 0;
 	}
Index: block/drivers/md/dm-io.c
===================================================================
--- block.orig/drivers/md/dm-io.c
+++ block/drivers/md/dm-io.c
@@ -31,7 +31,6 @@ struct dm_io_client {
  */
 struct io {
 	unsigned long error_bits;
-	unsigned long eopnotsupp_bits;
 	atomic_t count;
 	struct task_struct *sleeper;
 	struct dm_io_client *client;
@@ -130,11 +129,8 @@ static void retrieve_io_and_region_from_
  *---------------------------------------------------------------*/
 static void dec_count(struct io *io, unsigned int region, int error)
 {
-	if (error) {
+	if (error)
 		set_bit(region, &io->error_bits);
-		if (error == -EOPNOTSUPP)
-			set_bit(region, &io->eopnotsupp_bits);
-	}
 	if (atomic_dec_and_test(&io->count)) {
 		if (io->sleeper)
@@ -310,8 +306,8 @@ static void do_region(int rw, unsigned r
 	sector_t remaining = where->count;
 	/*
-	 * where->count may be zero if rw holds a write barrier and we
-	 * need to send a zero-sized barrier.
+	 * where->count may be zero if rw holds a flush and we need to
+	 * send a zero-sized flush.
 	 */
 	do {
 		/*
@@ -364,7 +360,7 @@ static void dispatch_io(int rw, unsigned
 	 */
 	for (i = 0; i < num_regions; i++) {
 		*dp = old_pages;
-		if (where[i].count || (rw & REQ_HARDBARRIER))
+		if (where[i].count || (rw & REQ_FLUSH))
 			do_region(rw, i, where + i, dp, io);
 	}
@@ -393,9 +389,7 @@ static int sync_io(struct dm_io_client *
 		return -EIO;
 	}
-retry:
 	io->error_bits = 0;
-	io->eopnotsupp_bits = 0;
 	atomic_set(&io->count, 1); /* see dispatch_io() */
 	io->sleeper = current;
 	io->client = client;
@@ -412,11 +406,6 @@ retry:
 	}
 	set_current_state(TASK_RUNNING);
-	if (io->eopnotsupp_bits && (rw & REQ_HARDBARRIER)) {
-		rw &= ~REQ_HARDBARRIER;
-		goto retry;
-	}
-
 	if (error_bits)
 		*error_bits = io->error_bits;
@@ -437,7 +426,6 @@ static int async_io(struct dm_io_client
 	io = mempool_alloc(client->pool, GFP_NOIO);
 	io->error_bits = 0;
-	io->eopnotsupp_bits = 0;
 	atomic_set(&io->count, 1); /* see dispatch_io() */
 	io->sleeper = NULL;
 	io->client = client;
Index: block/drivers/md/dm-raid1.c
===================================================================
--- block.orig/drivers/md/dm-raid1.c
+++ block/drivers/md/dm-raid1.c
@@ -259,7 +259,7 @@ static int mirror_flush(struct dm_target
 	struct dm_io_region io[ms->nr_mirrors];
 	struct mirror *m;
 	struct dm_io_request io_req = {
-		.bi_rw = WRITE_BARRIER,
+		.bi_rw = WRITE_FLUSH,
 		.mem.type = DM_IO_KMEM,
 		.mem.ptr.bvec = NULL,
 		.client = ms->io_client,
@@ -629,7 +629,7 @@ static void do_write(struct mirror_set *
 	struct dm_io_region io[ms->nr_mirrors], *dest = io;
 	struct mirror *m;
 	struct dm_io_request io_req = {
-		.bi_rw = WRITE | (bio->bi_rw & WRITE_BARRIER),
+		.bi_rw = WRITE | (bio->bi_rw & (WRITE_FLUSH | WRITE_FUA)),
 		.mem.type = DM_IO_BVEC,
 		.mem.ptr.bvec = bio->bi_io_vec + bio->bi_idx,
 		.notify.fn = write_callback,
@@ -670,7 +670,7 @@ static void do_writes(struct mirror_set
 	bio_list_init(&requeue);
 	while ((bio = bio_list_pop(writes))) {
-		if (unlikely(bio_empty_barrier(bio))) {
+		if ((bio->bi_rw & REQ_FLUSH) && !bio_has_data(bio)) {
 			bio_list_add(&sync, bio);
 			continue;
 		}
@@ -1203,7 +1203,7 @@ static int mirror_end_io(struct dm_targe
 	 * We need to dec pending if this was a write.
 	 */
 	if (rw == WRITE) {
-		if (likely(!bio_empty_barrier(bio)))
+		if (!(bio->bi_rw & REQ_FLUSH) || bio_has_data(bio))
 			dm_rh_dec(ms->rh, map_context->ll);
 		return error;
 	}
Index: block/drivers/md/dm.c
===================================================================
--- block.orig/drivers/md/dm.c
+++ block/drivers/md/dm.c
@@ -139,21 +139,21 @@ struct mapped_device {
 	spinlock_t deferred_lock;
 	/*
-	 * An error from the barrier request currently being processed.
+	 * An error from the flush request currently being processed.
 	 */
-	int barrier_error;
+	int flush_error;
 	/*
-	 * Protect barrier_error from concurrent endio processing
+	 * Protect flush_error from concurrent endio processing
 	 * in request-based dm.
 	 */
-	spinlock_t barrier_error_lock;
+	spinlock_t flush_error_lock;
 	/*
-	 * Processing queue (flush/barriers)
+	 * Processing queue (flush)
 	 */
 	struct workqueue_struct *wq;
-	struct work_struct barrier_work;
+	struct work_struct flush_work;
 	/* A pointer to the currently processing pre/post flush request */
 	struct request *flush_request;
@@ -195,8 +195,8 @@ struct mapped_device {
 	/* sysfs handle */
 	struct kobject kobj;
-	/* zero-length barrier that will be cloned and submitted to targets */
-	struct bio barrier_bio;
+	/* zero-length flush that will be cloned and submitted to targets */
+	struct bio flush_bio;
 };
 /*
@@ -507,7 +507,7 @@ static void end_io_acct(struct dm_io *io
 	/*
 	 * After this is decremented the bio must not be touched if it is
-	 * a barrier.
+	 * a flush.
 	 */
 	dm_disk(md)->part0.in_flight[rw] = pending =
 		atomic_dec_return(&md->pending[rw]);
@@ -621,7 +621,7 @@ static void dec_pending(struct dm_io *io
 			 */
 			spin_lock_irqsave(&md->deferred_lock, flags);
 			if (__noflush_suspending(md)) {
-				if (!(io->bio->bi_rw & REQ_HARDBARRIER))
+				if (!(io->bio->bi_rw & REQ_FLUSH))
 					bio_list_add_head(&md->deferred,
 							  io->bio);
 			} else
@@ -633,14 +633,14 @@ static void dec_pending(struct dm_io *io
 		io_error = io->error;
 		bio = io->bio;
-		if (bio->bi_rw & REQ_HARDBARRIER) {
+		if (bio->bi_rw & REQ_FLUSH) {
 			/*
-			 * There can be just one barrier request so we use
+			 * There can be just one flush request so we use
 			 * a per-device variable for error reporting.
 			 * Note that you can't touch the bio after end_io_acct
 			 */
-			if (!md->barrier_error && io_error != -EOPNOTSUPP)
-				md->barrier_error = io_error;
+			if (!md->flush_error)
+				md->flush_error = io_error;
 			end_io_acct(io);
 			free_io(md, io);
 		} else {
@@ -744,21 +744,18 @@ static void end_clone_bio(struct bio *cl
 	blk_update_request(tio->orig, 0, nr_bytes);
 }
-static void store_barrier_error(struct mapped_device *md, int error)
+static void store_flush_error(struct mapped_device *md, int error)
 {
 	unsigned long flags;
-	spin_lock_irqsave(&md->barrier_error_lock, flags);
+	spin_lock_irqsave(&md->flush_error_lock, flags);
 	/*
-	 * Basically, the first error is taken, but:
-	 *   -EOPNOTSUPP supersedes any I/O error.
-	 *   Requeue request supersedes any I/O error but -EOPNOTSUPP.
-	 */
-	if (!md->barrier_error || error == -EOPNOTSUPP ||
-	    (md->barrier_error != -EOPNOTSUPP &&
-	     error == DM_ENDIO_REQUEUE))
-		md->barrier_error = error;
-	spin_unlock_irqrestore(&md->barrier_error_lock, flags);
+	 * Basically, the first error is taken, but requeue request
+	 * supersedes any I/O error.
+	 */
+	if (!md->flush_error || error == DM_ENDIO_REQUEUE)
+		md->flush_error = error;
+	spin_unlock_irqrestore(&md->flush_error_lock, flags);
 }
 /*
@@ -799,12 +796,12 @@ static void dm_end_request(struct reques
 {
 	int rw = rq_data_dir(clone);
 	int run_queue = 1;
-	bool is_barrier = clone->cmd_flags & REQ_HARDBARRIER;
+	bool is_flush = clone->cmd_flags & REQ_FLUSH;
 	struct dm_rq_target_io *tio = clone->end_io_data;
 	struct mapped_device *md = tio->md;
 	struct request *rq = tio->orig;
-	if (rq->cmd_type == REQ_TYPE_BLOCK_PC && !is_barrier) {
+	if (rq->cmd_type == REQ_TYPE_BLOCK_PC && !is_flush) {
 		rq->errors = clone->errors;
 		rq->resid_len = clone->resid_len;
@@ -819,12 +816,13 @@ static void dm_end_request(struct reques
 	free_rq_clone(clone);
-	if (unlikely(is_barrier)) {
+	if (!is_flush)
+		blk_end_request_all(rq, error);
+	else {
 		if (unlikely(error))
-			store_barrier_error(md, error);
+			store_flush_error(md, error);
 		run_queue = 0;
-	} else
-		blk_end_request_all(rq, error);
+	}
 	rq_completed(md, rw, run_queue);
 }
@@ -851,9 +849,9 @@ void dm_requeue_unmapped_request(struct
 	struct request_queue *q = rq->q;
 	unsigned long flags;
-	if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
+	if (clone->cmd_flags & REQ_FLUSH) {
 		/*
-		 * Barrier clones share an original request.
+		 * Flush clones share an original request.
 		 * Leave it to dm_end_request(), which handles this special
 		 * case.
 		 */
@@ -950,14 +948,14 @@ static void dm_complete_request(struct r
 	struct dm_rq_target_io *tio = clone->end_io_data;
 	struct request *rq = tio->orig;
-	if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
+	if (clone->cmd_flags & REQ_FLUSH) {
 		/*
-		 * Barrier clones share an original request.  So can't use
+		 * Flush clones share an original request.  So can't use
 		 * softirq_done with the original.
 		 * Pass the clone to dm_done() directly in this special case.
 		 * It is safe (even if clone->q->queue_lock is held here)
 		 * because there is no I/O dispatching during the completion
-		 * of barrier clone.
+		 * of flush clone.
 		 */
 		dm_done(clone, error, true);
 		return;
@@ -979,9 +977,9 @@ void dm_kill_unmapped_request(struct req
 	struct dm_rq_target_io *tio = clone->end_io_data;
 	struct request *rq = tio->orig;
-	if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
+	if (clone->cmd_flags & REQ_FLUSH) {
 		/*
-		 * Barrier clones share an original request.
+		 * Flush clones share an original request.
 		 * Leave it to dm_end_request(), which handles this special
 		 * case.
 		 */
@@ -1098,7 +1096,7 @@ static void dm_bio_destructor(struct bio
 }
 /*
- * Creates a little bio that is just does part of a bvec.
+ * Creates a little bio that is just a part of a bvec.
  */
 static struct bio *split_bvec(struct bio *bio, sector_t sector,
 			      unsigned short idx, unsigned int offset,
@@ -1113,7 +1111,7 @@ static struct bio *split_bvec(struct bio
 	clone->bi_sector = sector;
 	clone->bi_bdev = bio->bi_bdev;
-	clone->bi_rw = bio->bi_rw & ~REQ_HARDBARRIER;
+	clone->bi_rw = bio->bi_rw;
 	clone->bi_vcnt = 1;
 	clone->bi_size = to_bytes(len);
 	clone->bi_io_vec->bv_offset = offset;
@@ -1140,7 +1138,6 @@ static struct bio *clone_bio(struct bio
 	clone = bio_alloc_bioset(GFP_NOIO, bio->bi_max_vecs, bs);
 	__bio_clone(clone, bio);
-	clone->bi_rw &= ~REQ_HARDBARRIER;
 	clone->bi_destructor = dm_bio_destructor;
 	clone->bi_sector = sector;
 	clone->bi_idx = idx;
@@ -1186,7 +1183,7 @@ static void __flush_target(struct clone_
 	__map_bio(ti, clone, tio);
 }
-static int __clone_and_map_empty_barrier(struct clone_info *ci)
+static int __clone_and_map_flush(struct clone_info *ci)
 {
 	unsigned target_nr = 0, flush_nr;
 	struct dm_target *ti;
@@ -1208,9 +1205,6 @@ static int __clone_and_map(struct clone_
 	sector_t len = 0, max;
 	struct dm_target_io *tio;
-	if (unlikely(bio_empty_barrier(bio)))
-		return __clone_and_map_empty_barrier(ci);
-
 	ti = dm_table_find_target(ci->map, ci->sector);
 	if (!dm_target_is_valid(ti))
 		return -EIO;
@@ -1308,11 +1302,11 @@ static void __split_and_process_bio(stru
 	ci.map = dm_get_live_table(md);
 	if (unlikely(!ci.map)) {
-		if (!(bio->bi_rw & REQ_HARDBARRIER))
+		if (!(bio->bi_rw & REQ_FLUSH))
 			bio_io_error(bio);
 		else
-			if (!md->barrier_error)
-				md->barrier_error = -EIO;
+			if (!md->flush_error)
+				md->flush_error = -EIO;
 		return;
 	}
@@ -1325,14 +1319,22 @@ static void __split_and_process_bio(stru
 	ci.io->md = md;
 	spin_lock_init(&ci.io->endio_lock);
 	ci.sector = bio->bi_sector;
-	ci.sector_count = bio_sectors(bio);
-	if (unlikely(bio_empty_barrier(bio)))
+	if (!(bio->bi_rw & REQ_FLUSH))
+		ci.sector_count = bio_sectors(bio);
+	else {
+		/* FLUSH bio reaching here should all be empty */
+		WARN_ON_ONCE(bio_has_data(bio));
 		ci.sector_count = 1;
+	}
 	ci.idx = bio->bi_idx;
 	start_io_acct(ci.io);
-	while (ci.sector_count && !error)
-		error = __clone_and_map(&ci);
+	while (ci.sector_count && !error) {
+		if (!(bio->bi_rw & REQ_FLUSH))
+			error = __clone_and_map(&ci);
+		else
+			error = __clone_and_map_flush(&ci);
+	}
 	/* drop the extra reference count */
 	dec_pending(ci.io, error);
@@ -1417,11 +1419,11 @@ static int _dm_request(struct request_qu
 	part_stat_unlock();
 	/*
-	 * If we're suspended or the thread is processing barriers
+	 * If we're suspended or the thread is processing flushes
 	 * we have to queue this io for later.
 	 */
 	if (unlikely(test_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags)) ||
-	    unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
+	    (bio->bi_rw & REQ_FLUSH)) {
 		up_read(&md->io_lock);
 		if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) &&
@@ -1464,10 +1466,7 @@ static int dm_request(struct request_que
 static bool dm_rq_is_flush_request(struct request *rq)
 {
-	if (rq->cmd_flags & REQ_FLUSH)
-		return true;
-	else
-		return false;
+	return rq->cmd_flags & REQ_FLUSH;
 }
 void dm_dispatch_request(struct request *rq)
@@ -1520,7 +1519,7 @@ static int setup_clone(struct request *c
 	if (dm_rq_is_flush_request(rq)) {
 		blk_rq_init(NULL, clone);
 		clone->cmd_type = REQ_TYPE_FS;
-		clone->cmd_flags |= (REQ_HARDBARRIER | WRITE);
+		clone->cmd_flags |= (REQ_FLUSH | WRITE);
 	} else {
 		r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
 				      dm_rq_bio_constructor, tio);
@@ -1668,7 +1667,7 @@ static void dm_request_fn(struct request
 			BUG_ON(md->flush_request);
 			md->flush_request = rq;
 			blk_start_request(rq);
-			queue_work(md->wq, &md->barrier_work);
+			queue_work(md->wq, &md->flush_work);
 			goto out;
 		}
@@ -1843,7 +1842,7 @@ out:
 static const struct block_device_operations dm_blk_dops;
 static void dm_wq_work(struct work_struct *work);
-static void dm_rq_barrier_work(struct work_struct *work);
+static void dm_rq_flush_work(struct work_struct *work);
 /*
  * Allocate and initialise a blank device with a given minor.
@@ -1873,7 +1872,7 @@ static struct mapped_device *alloc_dev(i
 	init_rwsem(&md->io_lock);
 	mutex_init(&md->suspend_lock);
 	spin_lock_init(&md->deferred_lock);
-	spin_lock_init(&md->barrier_error_lock);
+	spin_lock_init(&md->flush_error_lock);
 	rwlock_init(&md->map_lock);
 	atomic_set(&md->holders, 1);
 	atomic_set(&md->open_count, 0);
@@ -1918,7 +1917,7 @@ static struct mapped_device *alloc_dev(i
 	atomic_set(&md->pending[1], 0);
 	init_waitqueue_head(&md->wait);
 	INIT_WORK(&md->work, dm_wq_work);
-	INIT_WORK(&md->barrier_work, dm_rq_barrier_work);
+	INIT_WORK(&md->flush_work, dm_rq_flush_work);
 	init_waitqueue_head(&md->eventq);
 	md->disk->major = _major;
@@ -2233,31 +2232,28 @@ static int dm_wait_for_completion(struct
 	return r;
 }
-static void dm_flush(struct mapped_device *md)
+static void process_flush(struct mapped_device *md, struct bio *bio)
 {
+	md->flush_error = 0;
+
+	/* handle REQ_FLUSH */
 	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
-	bio_init(&md->barrier_bio);
-	md->barrier_bio.bi_bdev = md->bdev;
-	md->barrier_bio.bi_rw = WRITE_BARRIER;
-	__split_and_process_bio(md, &md->barrier_bio);
+	bio_init(&md->flush_bio);
+	md->flush_bio.bi_bdev = md->bdev;
+	md->flush_bio.bi_rw = WRITE_FLUSH;
+	__split_and_process_bio(md, &md->flush_bio);
 	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
-}
-
-static void process_barrier(struct mapped_device *md, struct bio *bio)
-{
-	md->barrier_error = 0;
-	dm_flush(md);
+	bio->bi_rw &= ~REQ_FLUSH;
-	if (!bio_empty_barrier(bio)) {
+	/* handle data + REQ_FUA */
+	if (bio_has_data(bio))
 		__split_and_process_bio(md, bio);
-		dm_flush(md);
-	}
-	if (md->barrier_error != DM_ENDIO_REQUEUE)
-		bio_endio(bio, md->barrier_error);
+	if (md->flush_error != DM_ENDIO_REQUEUE)
+		bio_endio(bio, md->flush_error);
 	else {
 		spin_lock_irq(&md->deferred_lock);
 		bio_list_add_head(&md->deferred, bio);
@@ -2291,8 +2287,8 @@ static void dm_wq_work(struct work_struc
 		if (dm_request_based(md))
 			generic_make_request(c);
 		else {
-			if (c->bi_rw & REQ_HARDBARRIER)
-				process_barrier(md, c);
+			if (c->bi_rw & REQ_FLUSH)
+				process_flush(md, c);
 			else
 				__split_and_process_bio(md, c);
 		}
@@ -2317,8 +2313,8 @@ static void dm_rq_set_flush_nr(struct re
 	tio->info.flush_request = flush_nr;
 }
-/* Issue barrier requests to targets and wait for their completion. */
-static int dm_rq_barrier(struct mapped_device *md)
+/* Issue flush requests to targets and wait for their completion. */
+static int dm_rq_flush(struct mapped_device *md)
 {
 	int i, j;
 	struct dm_table *map = dm_get_live_table(md);
@@ -2326,7 +2322,7 @@ static int dm_rq_barrier(struct mapped_d
 	struct dm_target *ti;
 	struct request *clone;
-	md->barrier_error = 0;
+	md->flush_error = 0;
 	for (i = 0; i < num_targets; i++) {
 		ti = dm_table_get_target(map, i);
@@ -2341,26 +2337,26 @@ static int dm_rq_barrier(struct mapped_d
 	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
 	dm_table_put(map);
-	return md->barrier_error;
+	return md->flush_error;
 }
-static void dm_rq_barrier_work(struct work_struct *work)
+static void dm_rq_flush_work(struct work_struct *work)
 {
 	int error;
 	struct mapped_device *md = container_of(work, struct mapped_device,
-						barrier_work);
+						flush_work);
 	struct request_queue *q = md->queue;
 	struct request *rq;
 	unsigned long flags;
 	/*
 	 * Hold the md reference here and leave it at the last part so that
-	 * the md can't be deleted by device opener when the barrier request
+	 * the md can't be deleted by device opener when the flush request
 	 * completes.
 	 */
 	dm_get(md);
-	error = dm_rq_barrier(md);
+	error = dm_rq_flush(md);
 	rq = md->flush_request;
 	md->flush_request = NULL;
@@ -2520,7 +2516,7 @@ int dm_suspend(struct mapped_device *md,
 	up_write(&md->io_lock);
 	/*
-	 * Request-based dm uses md->wq for barrier (dm_rq_barrier_work) which
+	 * Request-based dm uses md->wq for flush (dm_rq_flush_work) which
 	 * can be kicked until md->queue is stopped.  So stop md->queue before
 	 * flushing md->wq.
 	 */
Index: block/drivers/md/dm-log.c
===================================================================
--- block.orig/drivers/md/dm-log.c
+++ block/drivers/md/dm-log.c
@@ -300,7 +300,7 @@ static int flush_header(struct log_c *lc
 		.count = 0,
 	};
-	lc->io_req.bi_rw = WRITE_BARRIER;
+	lc->io_req.bi_rw = WRITE_FLUSH;
 	return dm_io(&lc->io_req, 1, &null_location, NULL);
 }
Index: block/drivers/md/dm-snap-persistent.c
===================================================================
--- block.orig/drivers/md/dm-snap-persistent.c
+++ block/drivers/md/dm-snap-persistent.c
@@ -687,7 +687,7 @@ static void persistent_commit_exception(
 	/*
 	 * Commit exceptions to disk.
 	 */
-	if (ps->valid && area_io(ps, WRITE_BARRIER))
+	if (ps->valid && area_io(ps, WRITE_FLUSH_FUA))
 		ps->valid = 0;
 	/*
-- 
tejun
^ permalink raw reply	[flat|nested] 109+ messages in thread
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-13 14:51       ` Tejun Heo
@ 2010-08-14 10:36         ` Christoph Hellwig
  2010-08-17  9:59           ` Tejun Heo
  0 siblings, 1 reply; 109+ messages in thread
From: Christoph Hellwig @ 2010-08-14 10:36 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Hellwig, jaxboe, linux-fsdevel, linux-scsi, linux-ide,
	linux-kernel, linux-raid, James.Bottomley, tytso, chris.mason,
	swhiteho, konishi.ryusuke, dm-devel, vst, jack, rwheeler, hare
On Fri, Aug 13, 2010 at 04:51:17PM +0200, Tejun Heo wrote:
> Do you want to change the whole thing in a single commit?  That would
> be a pretty big invasive patch touching multiple subsystems.
We can just stop draining in the block layer in the first patch, then
stop doing the stuff in md/dm/etc in the following and then do the
final renaming patches.  It would still be less patches then now, but
keep things working through the whole transition, which would really
help biseting any problems.
> +			if (req->cmd_flags & REQ_FUA)
> +				vbr->out_hdr.type |= VIRTIO_BLK_T_FUA;
I'd suggest not adding FUA support to virtio yet.  Just using the flush
feature gives you a fully working barrier implementation.
Eventually we might want to add a flag in the block queue to send
REQ_FLUSH|REQ_FUA request through to virtio directly so that we can
avoid separate pre- and post flushes, but I really want to benchmark if
it makes an impact on real life setups first.
> Index: block/drivers/md/linear.c
> ===================================================================
> --- block.orig/drivers/md/linear.c
> +++ block/drivers/md/linear.c
> @@ -294,8 +294,8 @@ static int linear_make_request (mddev_t
>  	dev_info_t *tmp_dev;
>  	sector_t start_sector;
> 
> -	if (unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
> -		md_barrier_request(mddev, bio);
> +	if (unlikely(bio->bi_rw & REQ_FLUSH)) {
> +		md_flush_request(mddev, bio);
We only need the special md_flush_request handling for
empty REQ_FLUSH requests.  REQ_WRITE | REQ_FLUSH just need the
flag propagated to the underlying devices.
> +static void md_end_flush(struct bio *bio, int err)
>  {
>  	mdk_rdev_t *rdev = bio->bi_private;
>  	mddev_t *mddev = rdev->mddev;
> 
>  	rdev_dec_pending(rdev, mddev);
> 
>  	if (atomic_dec_and_test(&mddev->flush_pending)) {
> +		/* The pre-request flush has finished */
> +		schedule_work(&mddev->flush_work);
Once we only handle empty barriers here we can directly call bio_endio
instead of first scheduling a work queue.Once we only handle empty
barriers here we can directly call bio_endio and the super wakeup
instead of first scheduling a work queue.
>  	while ((bio = bio_list_pop(writes))) {
> -		if (unlikely(bio_empty_barrier(bio))) {
> +		if ((bio->bi_rw & REQ_FLUSH) && !bio_has_data(bio)) {
I kept bio_empty_barrier as bio_empty_flush, which actually is a quite
useful macro for the bio based drivers.
> @@ -621,7 +621,7 @@ static void dec_pending(struct dm_io *io
>  			 */
>  			spin_lock_irqsave(&md->deferred_lock, flags);
>  			if (__noflush_suspending(md)) {
> -				if (!(io->bio->bi_rw & REQ_HARDBARRIER))
> +				if (!(io->bio->bi_rw & REQ_FLUSH))
I suspect we don't actually need to special case flushes here anymore.
> @@ -633,14 +633,14 @@ static void dec_pending(struct dm_io *io
>  		io_error = io->error;
>  		bio = io->bio;
> 
> -		if (bio->bi_rw & REQ_HARDBARRIER) {
> +		if (bio->bi_rw & REQ_FLUSH) {
>  			/*
> -			 * There can be just one barrier request so we use
> +			 * There can be just one flush request so we use
>  			 * a per-device variable for error reporting.
>  			 * Note that you can't touch the bio after end_io_acct
>  			 */
> -			if (!md->barrier_error && io_error != -EOPNOTSUPP)
> -				md->barrier_error = io_error;
> +			if (!md->flush_error)
> +				md->flush_error = io_error;
And we certainly do not need any special casing here.  See my patch.
>  {
>  	int rw = rq_data_dir(clone);
>  	int run_queue = 1;
> -	bool is_barrier = clone->cmd_flags & REQ_HARDBARRIER;
> +	bool is_flush = clone->cmd_flags & REQ_FLUSH;
>  	struct dm_rq_target_io *tio = clone->end_io_data;
>  	struct mapped_device *md = tio->md;
>  	struct request *rq = tio->orig;
> 
> -	if (rq->cmd_type == REQ_TYPE_BLOCK_PC && !is_barrier) {
> +	if (rq->cmd_type == REQ_TYPE_BLOCK_PC && !is_flush) {
We never send flush requests as REQ_TYPE_BLOCK_PC anymore, so no need
for the second half of this conditional.
> +	if (!is_flush)
> +		blk_end_request_all(rq, error);
> +	else {
>  		if (unlikely(error))
> -			store_barrier_error(md, error);
> +			store_flush_error(md, error);
>  		run_queue = 0;
> -	} else
> -		blk_end_request_all(rq, error);
> +	}
Flush requests can now be completed normally.
> @@ -1308,11 +1302,11 @@ static void __split_and_process_bio(stru
> 
>  	ci.map = dm_get_live_table(md);
>  	if (unlikely(!ci.map)) {
> -		if (!(bio->bi_rw & REQ_HARDBARRIER))
> +		if (!(bio->bi_rw & REQ_FLUSH))
>  			bio_io_error(bio);
>  		else
> -			if (!md->barrier_error)
> -				md->barrier_error = -EIO;
> +			if (!md->flush_error)
> +				md->flush_error = -EIO;
No need for the special error handling here, flush requests can now
be completed normally.
> @@ -1417,11 +1419,11 @@ static int _dm_request(struct request_qu
>  	part_stat_unlock();
> 
>  	/*
> -	 * If we're suspended or the thread is processing barriers
> +	 * If we're suspended or the thread is processing flushes
>  	 * we have to queue this io for later.
>  	 */
>  	if (unlikely(test_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags)) ||
> -	    unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
> +	    (bio->bi_rw & REQ_FLUSH)) {
>  		up_read(&md->io_lock);
AFAICS this is only needed for the old barrier code, no need for this
for pure flushes.
> @@ -1464,10 +1466,7 @@ static int dm_request(struct request_que
> 
>  static bool dm_rq_is_flush_request(struct request *rq)
>  {
> -	if (rq->cmd_flags & REQ_FLUSH)
> -		return true;
> -	else
> -		return false;
> +	return rq->cmd_flags & REQ_FLUSH;
>  }
It's probably worth just killing this wrapper.
>  void dm_dispatch_request(struct request *rq)
> @@ -1520,7 +1519,7 @@ static int setup_clone(struct request *c
>  	if (dm_rq_is_flush_request(rq)) {
>  		blk_rq_init(NULL, clone);
>  		clone->cmd_type = REQ_TYPE_FS;
> -		clone->cmd_flags |= (REQ_HARDBARRIER | WRITE);
> +		clone->cmd_flags |= (REQ_FLUSH | WRITE);
>  	} else {
>  		r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
>  				      dm_rq_bio_constructor, tio);
My suspicion is that we can get rif of all that special casing here
and just use blk_rq_prep_clone once it's been updated to propagate
REQ_FLUSH, similar to the DISCARD flag.
I also suspect that there is absolutely no need to the barrier work
queue once we stop waiting for outstanding request.  But then again
the request based dm code still somewhat confuses me.
> +static void process_flush(struct mapped_device *md, struct bio *bio)
>  {
> +	md->flush_error = 0;
> +
> +	/* handle REQ_FLUSH */
>  	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
> 
> -	bio_init(&md->barrier_bio);
> -	md->barrier_bio.bi_bdev = md->bdev;
> -	md->barrier_bio.bi_rw = WRITE_BARRIER;
> -	__split_and_process_bio(md, &md->barrier_bio);
> +	bio_init(&md->flush_bio);
> +	md->flush_bio.bi_bdev = md->bdev;
> +	md->flush_bio.bi_rw = WRITE_FLUSH;
> +	__split_and_process_bio(md, &md->flush_bio);
There's not need to use a separate flush_bio here.
__split_and_process_bio does the right thing for empty REQ_FLUSH
requests.  See my patch for how to do this differenty.  And yeah,
my version has been tested.
^ permalink raw reply	[flat|nested] 109+ messages in thread
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-14 10:36         ` Christoph Hellwig
@ 2010-08-17  9:59           ` Tejun Heo
  2010-08-17 13:19             ` Christoph Hellwig
  0 siblings, 1 reply; 109+ messages in thread
From: Tejun Heo @ 2010-08-17  9:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: jaxboe, linux-fsdevel, linux-scsi, linux-ide, linux-kernel,
	linux-raid, James.Bottomley, tytso, chris.mason, swhiteho,
	konishi.ryusuke, dm-devel, vst, jack, rwheeler, hare
Hello, Christoph.
On 08/14/2010 12:36 PM, Christoph Hellwig wrote:
> On Fri, Aug 13, 2010 at 04:51:17PM +0200, Tejun Heo wrote:
>> Do you want to change the whole thing in a single commit?  That would
>> be a pretty big invasive patch touching multiple subsystems.
> 
> We can just stop draining in the block layer in the first patch, then
> stop doing the stuff in md/dm/etc in the following and then do the
> final renaming patches.  It would still be less patches then now, but
> keep things working through the whole transition, which would really
> help biseting any problems.
I'm not really convinced that would help much.  If bisecting can point
to the conversion as the culprit for whatever kind of failure,
wouldn't that be enough?  No matter what we do the conversion will be
a single step thing.  If we make the filesystems enforce the ordering
first and then relax ordering in the block layer, bisection would
still just point at the later patch.  The same goes for md/dm, the
best we can find out would be whether the conversion is correct or not
anyway.
I'm not against restructuring the patchset if it makes more sense but
it just feels like it would be a bit pointless effort (and one which
would require much tighter coordination among different trees) at this
point.  Am I missing something?
>> +			if (req->cmd_flags & REQ_FUA)
>> +				vbr->out_hdr.type |= VIRTIO_BLK_T_FUA;
> 
> I'd suggest not adding FUA support to virtio yet.  Just using the flush
> feature gives you a fully working barrier implementation.
> 
> Eventually we might want to add a flag in the block queue to send
> REQ_FLUSH|REQ_FUA request through to virtio directly so that we can
> avoid separate pre- and post flushes, but I really want to benchmark if
> it makes an impact on real life setups first.
I wrote this in the other mail but I think it would make difference if
the backend storag is md/dm especially if it's shared by multiple VMs.
It cuts down on one array wide cache flush.
>> Index: block/drivers/md/linear.c
>> ===================================================================
>> --- block.orig/drivers/md/linear.c
>> +++ block/drivers/md/linear.c
>> @@ -294,8 +294,8 @@ static int linear_make_request (mddev_t
>>  	dev_info_t *tmp_dev;
>>  	sector_t start_sector;
>>
>> -	if (unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
>> -		md_barrier_request(mddev, bio);
>> +	if (unlikely(bio->bi_rw & REQ_FLUSH)) {
>> +		md_flush_request(mddev, bio);
> 
> We only need the special md_flush_request handling for
> empty REQ_FLUSH requests.  REQ_WRITE | REQ_FLUSH just need the
> flag propagated to the underlying devices.
Hmm, not really, the WRITE should happen after all the data in cache
are committed to NV media, meaning that empty FLUSH should already
have finished by the time the WRITE starts.
>> +static void md_end_flush(struct bio *bio, int err)
>>  {
>>  	mdk_rdev_t *rdev = bio->bi_private;
>>  	mddev_t *mddev = rdev->mddev;
>>
>>  	rdev_dec_pending(rdev, mddev);
>>
>>  	if (atomic_dec_and_test(&mddev->flush_pending)) {
>> +		/* The pre-request flush has finished */
>> +		schedule_work(&mddev->flush_work);
> 
> Once we only handle empty barriers here we can directly call bio_endio
> instead of first scheduling a work queue.Once we only handle empty
> barriers here we can directly call bio_endio and the super wakeup
> instead of first scheduling a work queue.
Yeap, right.  That would be a nice optimization.
>>  	while ((bio = bio_list_pop(writes))) {
>> -		if (unlikely(bio_empty_barrier(bio))) {
>> +		if ((bio->bi_rw & REQ_FLUSH) && !bio_has_data(bio)) {
> 
> I kept bio_empty_barrier as bio_empty_flush, which actually is a quite
> useful macro for the bio based drivers.
Hmm... maybe.  The reason why I removed bio_empty_flush() was that
except for the front-most sequencer (block layer for all the request
based ones and the front-most make_request for bio based ones), it
doesn't make sense to see REQ_FLUSH + data bios.  They should be
sequenced at the front-most stage anyway, so I didn't have much use
for them.  Those code paths couldn't deal with REQ_FLUSH + data bios
anyway.
>> @@ -621,7 +621,7 @@ static void dec_pending(struct dm_io *io
>>  			 */
>>  			spin_lock_irqsave(&md->deferred_lock, flags);
>>  			if (__noflush_suspending(md)) {
>> -				if (!(io->bio->bi_rw & REQ_HARDBARRIER))
>> +				if (!(io->bio->bi_rw & REQ_FLUSH))
> 
> I suspect we don't actually need to special case flushes here anymore.
Oh, I'm not sure about this part at all.  I'll ask Mike.
>> @@ -633,14 +633,14 @@ static void dec_pending(struct dm_io *io
>>  		io_error = io->error;
>>  		bio = io->bio;
>>
>> -		if (bio->bi_rw & REQ_HARDBARRIER) {
>> +		if (bio->bi_rw & REQ_FLUSH) {
>>  			/*
>> -			 * There can be just one barrier request so we use
>> +			 * There can be just one flush request so we use
>>  			 * a per-device variable for error reporting.
>>  			 * Note that you can't touch the bio after end_io_acct
>>  			 */
>> -			if (!md->barrier_error && io_error != -EOPNOTSUPP)
>> -				md->barrier_error = io_error;
>> +			if (!md->flush_error)
>> +				md->flush_error = io_error;
> 
> And we certainly do not need any special casing here.  See my patch.
I wasn't sure about that part.  You removed store_flush_error(), but
DM_ENDIO_REQUEUE should still have higher priority than other
failures, no?
>>  {
>>  	int rw = rq_data_dir(clone);
>>  	int run_queue = 1;
>> -	bool is_barrier = clone->cmd_flags & REQ_HARDBARRIER;
>> +	bool is_flush = clone->cmd_flags & REQ_FLUSH;
>>  	struct dm_rq_target_io *tio = clone->end_io_data;
>>  	struct mapped_device *md = tio->md;
>>  	struct request *rq = tio->orig;
>>
>> -	if (rq->cmd_type == REQ_TYPE_BLOCK_PC && !is_barrier) {
>> +	if (rq->cmd_type == REQ_TYPE_BLOCK_PC && !is_flush) {
> 
> We never send flush requests as REQ_TYPE_BLOCK_PC anymore, so no need
> for the second half of this conditional.
I see.
>> +	if (!is_flush)
>> +		blk_end_request_all(rq, error);
>> +	else {
>>  		if (unlikely(error))
>> -			store_barrier_error(md, error);
>> +			store_flush_error(md, error);
>>  		run_queue = 0;
>> -	} else
>> -		blk_end_request_all(rq, error);
>> +	}
> 
> Flush requests can now be completed normally.
The same question as before.  I think we still need to prioritize
DM_ENDIO_REQUEUE failures.
>> @@ -1417,11 +1419,11 @@ static int _dm_request(struct request_qu
>>  	part_stat_unlock();
>>
>>  	/*
>> -	 * If we're suspended or the thread is processing barriers
>> +	 * If we're suspended or the thread is processing flushes
>>  	 * we have to queue this io for later.
>>  	 */
>>  	if (unlikely(test_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags)) ||
>> -	    unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
>> +	    (bio->bi_rw & REQ_FLUSH)) {
>>  		up_read(&md->io_lock);
> 
> AFAICS this is only needed for the old barrier code, no need for this
> for pure flushes.
I'll ask Mike.
>> @@ -1464,10 +1466,7 @@ static int dm_request(struct request_que
>>
>>  static bool dm_rq_is_flush_request(struct request *rq)
>>  {
>> -	if (rq->cmd_flags & REQ_FLUSH)
>> -		return true;
>> -	else
>> -		return false;
>> +	return rq->cmd_flags & REQ_FLUSH;
>>  }
> 
> It's probably worth just killing this wrapper.
Yeah, probably.  It was an accidental edit to begin with and I left
this part out in the new patch.
>> +static void process_flush(struct mapped_device *md, struct bio *bio)
>>  {
>> +	md->flush_error = 0;
>> +
>> +	/* handle REQ_FLUSH */
>>  	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
>>
>> -	bio_init(&md->barrier_bio);
>> -	md->barrier_bio.bi_bdev = md->bdev;
>> -	md->barrier_bio.bi_rw = WRITE_BARRIER;
>> -	__split_and_process_bio(md, &md->barrier_bio);
>> +	bio_init(&md->flush_bio);
>> +	md->flush_bio.bi_bdev = md->bdev;
>> +	md->flush_bio.bi_rw = WRITE_FLUSH;
>> +	__split_and_process_bio(md, &md->flush_bio);
> 
> There's not need to use a separate flush_bio here.
> __split_and_process_bio does the right thing for empty REQ_FLUSH
> requests.  See my patch for how to do this differenty.  And yeah,
> my version has been tested.
But how do you make sure REQ_FLUSHes for preflush finish before
starting the write?
Thanks.
-- 
tejun
^ permalink raw reply	[flat|nested] 109+ messages in thread
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-17  9:59           ` Tejun Heo
@ 2010-08-17 13:19             ` Christoph Hellwig
  2010-08-17 16:41               ` Tejun Heo
  0 siblings, 1 reply; 109+ messages in thread
From: Christoph Hellwig @ 2010-08-17 13:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Hellwig, jaxboe, linux-fsdevel, linux-scsi, linux-ide,
	linux-kernel, linux-raid, James.Bottomley, tytso, chris.mason,
	swhiteho, konishi.ryusuke, dm-devel, vst, jack, rwheeler, hare
On Tue, Aug 17, 2010 at 11:59:38AM +0200, Tejun Heo wrote:
> I'm not really convinced that would help much.  If bisecting can point
> to the conversion as the culprit for whatever kind of failure,
> wouldn't that be enough?  No matter what we do the conversion will be
> a single step thing.  If we make the filesystems enforce the ordering
> first and then relax ordering in the block layer, bisection would
> still just point at the later patch.  The same goes for md/dm, the
> best we can find out would be whether the conversion is correct or not
> anyway.
The filesystems already enforce the ordering, except reiserfs which
opts out if the barrier options is set. 
> I'm not against restructuring the patchset if it makes more sense but
> it just feels like it would be a bit pointless effort (and one which
> would require much tighter coordination among different trees) at this
> point.  Am I missing something?
What other trees do you mean?  The conversions of the 8 filesystems
that actually support barriers need to go through this tree anyway
if we want to be able to test it.  Also the changes in the filesystem
are absolutely minimal - it's basically just
s/WRITE_BARRIER/WRITE_FUA_FLUSH/ after my initial patch kill BH_Orderd,
and removing about 10 lines of code in reiserfs.
> > We only need the special md_flush_request handling for
> > empty REQ_FLUSH requests.  REQ_WRITE | REQ_FLUSH just need the
> > flag propagated to the underlying devices.
> 
> Hmm, not really, the WRITE should happen after all the data in cache
> are committed to NV media, meaning that empty FLUSH should already
> have finished by the time the WRITE starts.
You're right.
> >>  	while ((bio = bio_list_pop(writes))) {
> >> -		if (unlikely(bio_empty_barrier(bio))) {
> >> +		if ((bio->bi_rw & REQ_FLUSH) && !bio_has_data(bio)) {
> > 
> > I kept bio_empty_barrier as bio_empty_flush, which actually is a quite
> > useful macro for the bio based drivers.
> 
> Hmm... maybe.  The reason why I removed bio_empty_flush() was that
> except for the front-most sequencer (block layer for all the request
> based ones and the front-most make_request for bio based ones), it
> doesn't make sense to see REQ_FLUSH + data bios.  They should be
> sequenced at the front-most stage anyway, so I didn't have much use
> for them.  Those code paths couldn't deal with REQ_FLUSH + data bios
> anyway.
The current bio_empty_barrier is only used in dm, and indeed only makes
sense for make_request-based drivers.  But I think it's a rather useful
helper for them.  Either way, it's not a big issue and either way is
fine with me.
> >> +		if (bio->bi_rw & REQ_FLUSH) {
> >>  			/*
> >> -			 * There can be just one barrier request so we use
> >> +			 * There can be just one flush request so we use
> >>  			 * a per-device variable for error reporting.
> >>  			 * Note that you can't touch the bio after end_io_acct
> >>  			 */
> >> -			if (!md->barrier_error && io_error != -EOPNOTSUPP)
> >> -				md->barrier_error = io_error;
> >> +			if (!md->flush_error)
> >> +				md->flush_error = io_error;
> > 
> > And we certainly do not need any special casing here.  See my patch.
> 
> I wasn't sure about that part.  You removed store_flush_error(), but
> DM_ENDIO_REQUEUE should still have higher priority than other
> failures, no?
Which priority?
> >> +static void process_flush(struct mapped_device *md, struct bio *bio)
> >>  {
> >> +	md->flush_error = 0;
> >> +
> >> +	/* handle REQ_FLUSH */
> >>  	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
> >>
> >> -	bio_init(&md->barrier_bio);
> >> -	md->barrier_bio.bi_bdev = md->bdev;
> >> -	md->barrier_bio.bi_rw = WRITE_BARRIER;
> >> -	__split_and_process_bio(md, &md->barrier_bio);
> >> +	bio_init(&md->flush_bio);
> >> +	md->flush_bio.bi_bdev = md->bdev;
> >> +	md->flush_bio.bi_rw = WRITE_FLUSH;
> >> +	__split_and_process_bio(md, &md->flush_bio);
> > 
> > There's not need to use a separate flush_bio here.
> > __split_and_process_bio does the right thing for empty REQ_FLUSH
> > requests.  See my patch for how to do this differenty.  And yeah,
> > my version has been tested.
> 
> But how do you make sure REQ_FLUSHes for preflush finish before
> starting the write?
Hmm, okay.  I see how the special flush_bio makes the waiting easier,
let's see if Mike or other in the DM team have a better idea.
^ permalink raw reply	[flat|nested] 109+ messages in thread
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-17 13:19             ` Christoph Hellwig
@ 2010-08-17 16:41               ` Tejun Heo
  2010-08-17 16:59                 ` Christoph Hellwig
  0 siblings, 1 reply; 109+ messages in thread
From: Tejun Heo @ 2010-08-17 16:41 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: jaxboe, linux-fsdevel, linux-scsi, linux-ide, linux-kernel,
	linux-raid, James.Bottomley, tytso, chris.mason, swhiteho,
	konishi.ryusuke, dm-devel, vst, jack, rwheeler, hare
Hi,
On 08/17/2010 03:19 PM, Christoph Hellwig wrote:
> On Tue, Aug 17, 2010 at 11:59:38AM +0200, Tejun Heo wrote:
>> I'm not against restructuring the patchset if it makes more sense but
>> it just feels like it would be a bit pointless effort (and one which
>> would require much tighter coordination among different trees) at this
>> point.  Am I missing something?
> 
> What other trees do you mean?
I was mostly thinking about dm/md, drdb and stuff, but you're talking
about filesystem conversion patches being routed through block tree,
right?
> The conversions of the 8 filesystems that actually support barriers
> need to go through this tree anyway if we want to be able to test
> it.  Also the changes in the filesystem are absolutely minimal -
> it's basically just s/WRITE_BARRIER/WRITE_FUA_FLUSH/ after my
> initial patch kill BH_Orderd, and removing about 10 lines of code in
> reiserfs.
I might just resequence it to finish this part of discussion but what
does that really buy us?  It's not really gonna help bisection.
Bisection won't be able to tell anything in higher resolution than
"the new implementation doesn't work".  If you show me how it would
actually help, I'll happily reshuffle the patches.
>> I wasn't sure about that part.  You removed store_flush_error(), but
>> DM_ENDIO_REQUEUE should still have higher priority than other
>> failures, no?
> 
> Which priority?
IIUC, when any of flushes get DM_ENDIO_REQUEUE (which tells the dm
core layer to retry the whole bio later), it trumps all other failures
and the bio is retried later.  That was why DM_ENDIO_REQUEUE was
prioritized over other error codes, which actually is sort of
incorrect in that once a FLUSH fails, it _MUST_ be reported to upper
layers as FLUSH failure implies data already lost.  So,
DM_ENDIO_REQUEUE actually should have lower priority than other
failures.  But, then again, the error codes still need to be
prioritized.
>> But how do you make sure REQ_FLUSHes for preflush finish before
>> starting the write?
> 
> Hmm, okay.  I see how the special flush_bio makes the waiting easier,
> let's see if Mike or other in the DM team have a better idea.
Yeah, it would be better if it can be sequenced w/o using a work but
let's leave it for later.
Thanks.
-- 
tejun
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-17 16:41               ` Tejun Heo
@ 2010-08-17 16:59                 ` Christoph Hellwig
  2010-08-18  6:35                   ` Tejun Heo
  2010-08-20  8:26                   ` Kiyoshi Ueda
  0 siblings, 2 replies; 109+ messages in thread
From: Christoph Hellwig @ 2010-08-17 16:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Hellwig, jaxboe, linux-fsdevel, linux-scsi, linux-ide,
	linux-kernel, linux-raid, James.Bottomley, tytso, chris.mason,
	swhiteho, konishi.ryusuke, dm-devel, vst, jack, rwheeler, hare
On Tue, Aug 17, 2010 at 06:41:47PM +0200, Tejun Heo wrote:
> > What other trees do you mean?
> 
> I was mostly thinking about dm/md, drdb and stuff, but you're talking
> about filesystem conversion patches being routed through block tree,
> right?
I think we really need all the conversions in one tree, block layer,
remapping drivers and filesystems.
Btw, I've done the conversion for all filesystems and I'm running tests
over them now.  Expect the series late today or tomorrow.
> I might just resequence it to finish this part of discussion but what
> does that really buy us?  It's not really gonna help bisection.
> Bisection won't be able to tell anything in higher resolution than
> "the new implementation doesn't work".  If you show me how it would
> actually help, I'll happily reshuffle the patches.
It's not bisecting to find bugs in the barrier conversion.  We can't
easily bisect it down anyway.  The problem is when we try to bisect
other problems and get into the middle of the series barriers suddenly
are gone.  Which is not very helpful for things like data integrity
problems in filesystems.
> >> I wasn't sure about that part.  You removed store_flush_error(), but
> >> DM_ENDIO_REQUEUE should still have higher priority than other
> >> failures, no?
> > 
> > Which priority?
> 
> IIUC, when any of flushes get DM_ENDIO_REQUEUE (which tells the dm
> core layer to retry the whole bio later), it trumps all other failures
> and the bio is retried later.  That was why DM_ENDIO_REQUEUE was
> prioritized over other error codes, which actually is sort of
> incorrect in that once a FLUSH fails, it _MUST_ be reported to upper
> layers as FLUSH failure implies data already lost.  So,
> DM_ENDIO_REQUEUE actually should have lower priority than other
> failures.  But, then again, the error codes still need to be
> prioritized.
I think that's something we better leave to the DM team.
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-17 16:59                 ` Christoph Hellwig
@ 2010-08-18  6:35                   ` Tejun Heo
  2010-08-18  8:11                     ` Tejun Heo
  2010-08-20  8:26                   ` Kiyoshi Ueda
  1 sibling, 1 reply; 109+ messages in thread
From: Tejun Heo @ 2010-08-18  6:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: jaxboe, linux-fsdevel, linux-scsi, linux-ide, linux-kernel,
	linux-raid, James.Bottomley, tytso, chris.mason, swhiteho,
	konishi.ryusuke, dm-devel, vst, jack, rwheeler, hare
Hello,
On 08/17/2010 06:59 PM, Christoph Hellwig wrote:
> I think we really need all the conversions in one tree, block layer,
> remapping drivers and filesystems.
I don't know.  If filesystem changes are really trivial maybe, but
md/dm changes seem a bit too invasive to go through the block tree.
> Btw, I've done the conversion for all filesystems and I'm running tests
> over them now.  Expect the series late today or tomorrow.
Cool. :-)
>> I might just resequence it to finish this part of discussion but what
>> does that really buy us?  It's not really gonna help bisection.
>> Bisection won't be able to tell anything in higher resolution than
>> "the new implementation doesn't work".  If you show me how it would
>> actually help, I'll happily reshuffle the patches.
> 
> It's not bisecting to find bugs in the barrier conversion.  We can't
> easily bisect it down anyway.  The problem is when we try to bisect
> other problems and get into the middle of the series barriers suddenly
> are gone.  Which is not very helpful for things like data integrity
> problems in filesystems.
Ah, okay, hmmm.... alright, I'll resequence the patches.  If the
filesystem changes can be put into a single tree somehow, we can keep
things mostly working at least for direct devices.
>> IIUC, when any of flushes get DM_ENDIO_REQUEUE (which tells the dm
>> core layer to retry the whole bio later), it trumps all other failures
>> and the bio is retried later.  That was why DM_ENDIO_REQUEUE was
>> prioritized over other error codes, which actually is sort of
>> incorrect in that once a FLUSH fails, it _MUST_ be reported to upper
>> layers as FLUSH failure implies data already lost.  So,
>> DM_ENDIO_REQUEUE actually should have lower priority than other
>> failures.  But, then again, the error codes still need to be
>> prioritized.
> 
> I think that's something we better leave to the DM team.
Sure, but we shouldn't be ripping out the code to do that.
Thanks.
-- 
tejun
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-18  6:35                   ` Tejun Heo
@ 2010-08-18  8:11                     ` Tejun Heo
  0 siblings, 0 replies; 109+ messages in thread
From: Tejun Heo @ 2010-08-18  8:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: jaxboe, linux-fsdevel, linux-scsi, linux-ide, linux-kernel,
	linux-raid, James.Bottomley, tytso, chris.mason, swhiteho,
	konishi.ryusuke, dm-devel, vst, jack, rwheeler, hare
Hello,
On 08/18/2010 08:35 AM, Tejun Heo wrote:
>> It's not bisecting to find bugs in the barrier conversion.  We can't
>> easily bisect it down anyway.  The problem is when we try to bisect
>> other problems and get into the middle of the series barriers suddenly
>> are gone.  Which is not very helpful for things like data integrity
>> problems in filesystems.
> 
> Ah, okay, hmmm.... alright, I'll resequence the patches.  If the
> filesystem changes can be put into a single tree somehow, we can keep
> things mostly working at least for direct devices.
Sorry but I'm doing it.  It just doesn't make much sense.  I can't
relax the ordering for REQ_HARDBARRIER without breaking the remapping
drivers.  So, to keep things working, I'll have to 1. relax the
ordering 2. implement new REQ_FLUSH/FUA based interface and 3. use
them in the filesystems in the same patch.  That's just wrong.  And I
don't think md/dm changes can or should go through the block tree.
They're way too invasive for that.  It's a new implementation and
barrier won't work (fail gracefully) for several commits during the
transition.  I don't think there's a better way around it.
Thanks.
-- 
tejun
^ permalink raw reply	[flat|nested] 109+ messages in thread 
 
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-17 16:59                 ` Christoph Hellwig
  2010-08-18  6:35                   ` Tejun Heo
@ 2010-08-20  8:26                   ` Kiyoshi Ueda
  2010-08-23 12:14                     ` Tejun Heo
  1 sibling, 1 reply; 109+ messages in thread
From: Kiyoshi Ueda @ 2010-08-20  8:26 UTC (permalink / raw)
  To: Christoph Hellwig, Tejun Heo
  Cc: jaxboe, linux-fsdevel, linux-scsi, linux-ide, linux-kernel,
	linux-raid, James.Bottomley, tytso, chris.mason, swhiteho,
	konishi.ryusuke, dm-devel, vst, jack, rwheeler, hare
Hi Tejun, Christoph,
On Tue, Aug 17, 2010 at 06:41:47PM +0200, Tejun Heo wrote:
>>> I wasn't sure about that part.  You removed store_flush_error(), but
>>> DM_ENDIO_REQUEUE should still have higher priority than other
>>> failures, no?
>>
>> Which priority?
>
> IIUC, when any of flushes get DM_ENDIO_REQUEUE (which tells the dm
> core layer to retry the whole bio later), it trumps all other failures
> and the bio is retried later.  That was why DM_ENDIO_REQUEUE was
> prioritized over other error codes, which actually is sort of
> incorrect in that once a FLUSH fails, it _MUST_ be reported to upper
> layers as FLUSH failure implies data already lost.  So,
> DM_ENDIO_REQUEUE actually should have lower priority than other
> failures.  But, then again, the error codes still need to be
> prioritized.
I think that's correct and changing the priority of DM_ENDIO_REQUEUE
for REQ_FLUSH down to the lowest should be fine.
(I didn't know that FLUSH failure implies data loss possibility.)
But the patch is not enough, you have to change target drivers, too.
E.g. As for multipath, you need to change
     drivers/md/dm-mpath.c:do_end_io() to return error for REQ_FLUSH
     like the REQ_DISCARD support included in 2.6.36-rc1.
By the way, if these patch-set with the change above are included,
even one path failure for REQ_FLUSH on multipath configuration will
be reported to upper layer as error, although it's retried using
other paths currently.
Then, if an upper layer won't take correct recovery action for the error,
it would be seen as a regression for users. (e.g. Frequent EXT3-error
resulting in read-only mount on multipath configuration.)
Although I think the explicit error is fine rather than implicit data
corruption, please check upper layers carefully so that users won't see
such errors as much as possible.
Thanks,
Kiyoshi Ueda
^ permalink raw reply	[flat|nested] 109+ messages in thread
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-20  8:26                   ` Kiyoshi Ueda
@ 2010-08-23 12:14                     ` Tejun Heo
  2010-08-23 14:17                       ` Mike Snitzer
  2010-08-24 17:11                       ` [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush Vladislav Bolkhovitin
  0 siblings, 2 replies; 109+ messages in thread
From: Tejun Heo @ 2010-08-23 12:14 UTC (permalink / raw)
  To: Kiyoshi Ueda
  Cc: Christoph Hellwig, jaxboe, linux-fsdevel, linux-scsi, linux-ide,
	linux-kernel, linux-raid, James.Bottomley, tytso, chris.mason,
	swhiteho, konishi.ryusuke, dm-devel, vst, jack, rwheeler, hare
Hello,
On 08/20/2010 10:26 AM, Kiyoshi Ueda wrote:
> I think that's correct and changing the priority of DM_ENDIO_REQUEUE
> for REQ_FLUSH down to the lowest should be fine.
> (I didn't know that FLUSH failure implies data loss possibility.)
At least on ATA, FLUSH failure implies that data is already lost, so
the error can't be ignored or retried.
> But the patch is not enough, you have to change target drivers, too.
> E.g. As for multipath, you need to change
>      drivers/md/dm-mpath.c:do_end_io() to return error for REQ_FLUSH
>      like the REQ_DISCARD support included in 2.6.36-rc1.
I'll take a look but is there an easy to test mpath other than having
fancy hardware?
> By the way, if these patch-set with the change above are included,
> even one path failure for REQ_FLUSH on multipath configuration will
> be reported to upper layer as error, although it's retried using
> other paths currently.
> Then, if an upper layer won't take correct recovery action for the error,
> it would be seen as a regression for users. (e.g. Frequent EXT3-error
> resulting in read-only mount on multipath configuration.)
> 
> Although I think the explicit error is fine rather than implicit data
> corruption, please check upper layers carefully so that users won't see
> such errors as much as possible.
Argh... then it will have to discern why FLUSH failed.  It can retry
for transport errors but if it got aborted by the device it should
report upwards.  Maybe just turn off barrier support in mpath for now?
Thanks.
-- 
tejun
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-23 12:14                     ` Tejun Heo
@ 2010-08-23 14:17                       ` Mike Snitzer
  2010-08-24 10:24                         ` Kiyoshi Ueda
  2010-08-24 17:11                       ` [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush Vladislav Bolkhovitin
  1 sibling, 1 reply; 109+ messages in thread
From: Mike Snitzer @ 2010-08-23 14:17 UTC (permalink / raw)
  To: Tejun Heo, Hannes Reinecke
  Cc: Kiyoshi Ueda, tytso, linux-scsi, jaxboe, jack, linux-kernel,
	swhiteho, linux-raid, linux-ide, James.Bottomley, konishi.ryusuke,
	linux-fsdevel, vst, rwheeler, Christoph Hellwig, chris.mason,
	dm-devel
On Mon, Aug 23 2010 at  8:14am -0400,
Tejun Heo <tj@kernel.org> wrote:
> Hello,
> 
> On 08/20/2010 10:26 AM, Kiyoshi Ueda wrote:
> > I think that's correct and changing the priority of DM_ENDIO_REQUEUE
> > for REQ_FLUSH down to the lowest should be fine.
> > (I didn't know that FLUSH failure implies data loss possibility.)
> 
> At least on ATA, FLUSH failure implies that data is already lost, so
> the error can't be ignored or retried.
> 
> > But the patch is not enough, you have to change target drivers, too.
> > E.g. As for multipath, you need to change
> >      drivers/md/dm-mpath.c:do_end_io() to return error for REQ_FLUSH
> >      like the REQ_DISCARD support included in 2.6.36-rc1.
> 
> I'll take a look but is there an easy to test mpath other than having
> fancy hardware?
It is easy enough to make a single path use mpath.  Just verify/modify
/etc/multipath.conf so that your device isn't blacklisted.
multipathd will even work with a scsi-debug device.
You obviously won't get path failover but you'll see the path get marked
faulty, etc.
> > By the way, if these patch-set with the change above are included,
> > even one path failure for REQ_FLUSH on multipath configuration will
> > be reported to upper layer as error, although it's retried using
> > other paths currently.
> > Then, if an upper layer won't take correct recovery action for the error,
> > it would be seen as a regression for users. (e.g. Frequent EXT3-error
> > resulting in read-only mount on multipath configuration.)
> > 
> > Although I think the explicit error is fine rather than implicit data
> > corruption, please check upper layers carefully so that users won't see
> > such errors as much as possible.
> 
> Argh... then it will have to discern why FLUSH failed.  It can retry
> for transport errors but if it got aborted by the device it should
> report upwards.
Yes, we discussed this issue of needing to train dm-multipath to know if
there was a transport failure or not (at LSF).  But I'm not sure when
Hannes intends to repost his work in this area (updated to account for
feedback from LSF).
> Maybe just turn off barrier support in mpath for now?
I think we'd prefer to have a device fail rather than jeopardize data
integrity.  Clearly not ideal but...
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-23 14:17                       ` Mike Snitzer
@ 2010-08-24 10:24                         ` Kiyoshi Ueda
  2010-08-24 16:59                           ` Tejun Heo
  0 siblings, 1 reply; 109+ messages in thread
From: Kiyoshi Ueda @ 2010-08-24 10:24 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Mike Snitzer, Hannes Reinecke, tytso, linux-scsi, jaxboe, jack,
	linux-kernel, swhiteho, linux-raid, linux-ide, James.Bottomley,
	konishi.ryusuke, linux-fsdevel, vst, rwheeler, Christoph Hellwig,
	chris.mason, dm-devel
Hi Tejun,
On 08/23/2010 11:17 PM +0900, Mike Snitzer wrote:
> On Mon, Aug 23 2010 at  8:14am -0400, Tejun Heo <tj@kernel.org> wrote:
>> On 08/20/2010 10:26 AM, Kiyoshi Ueda wrote:
>>> By the way, if these patch-set with the change above are included,
>>> even one path failure for REQ_FLUSH on multipath configuration will
>>> be reported to upper layer as error, although it's retried using
>>> other paths currently.
>>> Then, if an upper layer won't take correct recovery action for the error,
>>> it would be seen as a regression for users. (e.g. Frequent EXT3-error
>>> resulting in read-only mount on multipath configuration.)
>>>
>>> Although I think the explicit error is fine rather than implicit data
>>> corruption, please check upper layers carefully so that users won't see
>>> such errors as much as possible.
>> 
>> Argh... then it will have to discern why FLUSH failed.  It can retry
>> for transport errors but if it got aborted by the device it should
>> report upwards.
> 
> Yes, we discussed this issue of needing to train dm-multipath to know if
> there was a transport failure or not (at LSF).  But I'm not sure when
> Hannes intends to repost his work in this area (updated to account for
> feedback from LSF).
Yes, checking whether it's a transport error in lower layer is
the right solution.
(Since I know it's not available yet, I just hoped if upper layers
 had some other options.)
Anyway, only reporting errors for REQ_FLUSH to upper layer without
such a solution would make dm-multipath almost unusable in real world,
although it's better than implicit data loss.
>> Maybe just turn off barrier support in mpath for now?
If it's possible, it could be a workaround for a short term.
But how can you do that?
I think it's not enough to just drop REQ_FLUSH flag from q->flush_flags.
Underlying devices of a mpath device may have write-back cache and
it may be enabled.
So if a mpath device doesn't set REQ_FLUSH flag in q->flush_flags, it
becomes a device which has write-back cache but doesn't support flush.
Then, upper layer can do nothing to ensure cache flush?
Thanks,
Kiyoshi Ueda
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-24 10:24                         ` Kiyoshi Ueda
@ 2010-08-24 16:59                           ` Tejun Heo
  2010-08-24 17:52                             ` Mike Snitzer
  2010-08-25  8:00                             ` Kiyoshi Ueda
  0 siblings, 2 replies; 109+ messages in thread
From: Tejun Heo @ 2010-08-24 16:59 UTC (permalink / raw)
  To: Kiyoshi Ueda
  Cc: Mike Snitzer, Hannes Reinecke, tytso, linux-scsi, jaxboe, jack,
	linux-kernel, swhiteho, linux-raid, linux-ide, James.Bottomley,
	konishi.ryusuke, linux-fsdevel, vst, rwheeler, Christoph Hellwig,
	chris.mason, dm-devel
Hello,
On 08/24/2010 12:24 PM, Kiyoshi Ueda wrote:
> Yes, checking whether it's a transport error in lower layer is
> the right solution.
> (Since I know it's not available yet, I just hoped if upper layers
>  had some other options.)
> 
> Anyway, only reporting errors for REQ_FLUSH to upper layer without
> such a solution would make dm-multipath almost unusable in real world,
> although it's better than implicit data loss.
I see.
>>> Maybe just turn off barrier support in mpath for now?
> 
> If it's possible, it could be a workaround for a short term.
> But how can you do that?
> 
> I think it's not enough to just drop REQ_FLUSH flag from q->flush_flags.
> Underlying devices of a mpath device may have write-back cache and
> it may be enabled.
> So if a mpath device doesn't set REQ_FLUSH flag in q->flush_flags, it
> becomes a device which has write-back cache but doesn't support flush.
> Then, upper layer can do nothing to ensure cache flush?
Yeah, I was basically suggesting to forget about cache flush w/ mpath
until it can be fixed.  You're saying that if mpath just passes
REQ_FLUSH upwards without retrying, it will be almost unuseable,
right?  I'm not sure how to proceed here.  How much work would
discerning between transport and IO errors take?  If it can't be done
quickly enough the retry logic can be kept around to keep the old
behavior but that already was a broken behavior, so...  :-(
Thanks.
-- 
tejun
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-24 16:59                           ` Tejun Heo
@ 2010-08-24 17:52                             ` Mike Snitzer
  2010-08-24 18:14                               ` Tejun Heo
  2010-08-25  8:00                             ` Kiyoshi Ueda
  1 sibling, 1 reply; 109+ messages in thread
From: Mike Snitzer @ 2010-08-24 17:52 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Kiyoshi Ueda, Hannes Reinecke, tytso, linux-scsi, jaxboe, jack,
	linux-kernel, swhiteho, linux-raid, linux-ide, James.Bottomley,
	konishi.ryusuke, linux-fsdevel, vst, rwheeler, Christoph Hellwig,
	chris.mason, dm-devel
On Tue, Aug 24 2010 at 12:59pm -0400,
Tejun Heo <tj@kernel.org> wrote:
> Hello,
> 
> On 08/24/2010 12:24 PM, Kiyoshi Ueda wrote:
> > Yes, checking whether it's a transport error in lower layer is
> > the right solution.
> > (Since I know it's not available yet, I just hoped if upper layers
> >  had some other options.)
> > 
> > Anyway, only reporting errors for REQ_FLUSH to upper layer without
> > such a solution would make dm-multipath almost unusable in real world,
> > although it's better than implicit data loss.
> 
> I see.
> 
> >>> Maybe just turn off barrier support in mpath for now?
> > 
> > If it's possible, it could be a workaround for a short term.
> > But how can you do that?
> > 
> > I think it's not enough to just drop REQ_FLUSH flag from q->flush_flags.
> > Underlying devices of a mpath device may have write-back cache and
> > it may be enabled.
> > So if a mpath device doesn't set REQ_FLUSH flag in q->flush_flags, it
> > becomes a device which has write-back cache but doesn't support flush.
> > Then, upper layer can do nothing to ensure cache flush?
> 
> Yeah, I was basically suggesting to forget about cache flush w/ mpath
> until it can be fixed.  You're saying that if mpath just passes
> REQ_FLUSH upwards without retrying, it will be almost unuseable,
> right?  I'm not sure how to proceed here.
Seems clear that we must fix mpath to receive the SCSI errors, in some
form, so it can decide if a retry is required/valid or not.
Such error processing was a big selling point for the transition from
bio-based to request-based multipath; so it's unfortunate that this
piece has been left until now.
> How much work would discerning between transport and IO errors take?
Hannes already proposed some patches:
https://patchwork.kernel.org/patch/61282/
https://patchwork.kernel.org/patch/61283/
https://patchwork.kernel.org/patch/61596/
This work was discussed at LSF, see "Error Handling - Hannes Reinecke"
here: http://lwn.net/Articles/400589/
I thought James, Alasdair and others offered some guidance on what he'd
like to see...
Unfortunately, even though I was at this LSF session, I can't recall any
specific consensus on how Hannes' work should be refactored (to avoid
adding SCSI sense processing code directly in dm-mpath).  Maybe James,
Hannes or others remember?
Was it enough to just have the SCSI sense processing code split out in a
new sub-section of the SCSI midlayer -- and then DM calls that code?
> If it can't be done quickly enough the retry logic can be kept around
> to keep the old behavior but that already was a broken behavior, so...
> :-(
I'll have to review this thread again to understand why mpath's existing
retry logic is broken behavior.  mpath is used with more capable SCSI
devices so I'm missing why a failed FLUSH implies data loss.
Mike
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-24 17:52                             ` Mike Snitzer
@ 2010-08-24 18:14                               ` Tejun Heo
  0 siblings, 0 replies; 109+ messages in thread
From: Tejun Heo @ 2010-08-24 18:14 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Kiyoshi Ueda, Hannes Reinecke, tytso, linux-scsi, jaxboe, jack,
	linux-kernel, swhiteho, linux-raid, linux-ide, James.Bottomley,
	konishi.ryusuke, linux-fsdevel, vst, rwheeler, Christoph Hellwig,
	chris.mason, dm-devel
Hello,
On 08/24/2010 07:52 PM, Mike Snitzer wrote:
>> If it can't be done quickly enough the retry logic can be kept around
>> to keep the old behavior but that already was a broken behavior, so...
>> :-(
> 
> I'll have to review this thread again to understand why mpath's existing
> retry logic is broken behavior.  mpath is used with more capable SCSI
> devices so I'm missing why a failed FLUSH implies data loss.
SBC doesn't specify the failure behavior, so it could be that retrying
flush could be safe.  But for most disk type devices, flush failure
usually indicates that the device exhausted all the options to commit
some of pending data to NV media - ie. even remapping failed for
whatever reason.  Even if retry is safe, it's more likely to simply
delay notification of failure.
In ATA, the situation is clearer, when a device actively fails a
flush, the drive reports the first failed sector it failed to commit
and the next flush will continue _after_ the sector - IOW, data is
already lost.
<speculation>
I think there's no reason mpath should be tasked with retrying flush
failure.  That's upto the SCSI EH.  If the command failed in 'safe'
transient way - ie. device busy or whatnot, SCSI EH can and does retry
the command.  There are several FAILFAST bits already and SCSI EH can
avoid retrying transport errors for mpath (maybe it already does
that?) and just need to be able to tell upper layer that the failure
was a fast one and upper layer is responsible for retrying?  Is there
any reason to pass the whole sense information upwards?
</speculation>
Anyways, flush failure is different from read/write failures.
Read/writes can always be retried cleanly.  They are stateless.  I
don't know how SCSI devices would actually behavior but it's a bit
scary to retry SYNCHRONIZE_CACHE a device failed and report success
upwards.
Thanks.
-- 
tejun
^ permalink raw reply	[flat|nested] 109+ messages in thread 
 
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-24 16:59                           ` Tejun Heo
  2010-08-24 17:52                             ` Mike Snitzer
@ 2010-08-25  8:00                             ` Kiyoshi Ueda
  2010-08-25 15:28                               ` Mike Snitzer
  2010-08-25 15:59                               ` [RFC] training mpath to discern between SCSI errors (was: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush) Mike Snitzer
  1 sibling, 2 replies; 109+ messages in thread
From: Kiyoshi Ueda @ 2010-08-25  8:00 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Mike Snitzer, Hannes Reinecke, tytso, linux-scsi, jaxboe, jack,
	linux-kernel, swhiteho, linux-raid, linux-ide, James.Bottomley,
	konishi.ryusuke, linux-fsdevel, vst, rwheeler, Christoph Hellwig,
	chris.mason, dm-devel
Hi Tejun,
On 08/25/2010 01:59 AM +0900, Tejun Heo wrote:
> On 08/24/2010 12:24 PM, Kiyoshi Ueda wrote:
>> Anyway, only reporting errors for REQ_FLUSH to upper layer without
>> such a solution would make dm-multipath almost unusable in real world,
>> although it's better than implicit data loss.
> 
> I see.
> 
>>> Maybe just turn off barrier support in mpath for now?
>> 
>> If it's possible, it could be a workaround for a short term.
>> But how can you do that?
>>
>> I think it's not enough to just drop REQ_FLUSH flag from q->flush_flags.
>> Underlying devices of a mpath device may have write-back cache and
>> it may be enabled.
>> So if a mpath device doesn't set REQ_FLUSH flag in q->flush_flags, it
>> becomes a device which has write-back cache but doesn't support flush.
>> Then, upper layer can do nothing to ensure cache flush?
> 
> Yeah, I was basically suggesting to forget about cache flush w/ mpath
> until it can be fixed.  You're saying that if mpath just passes
> REQ_FLUSH upwards without retrying, it will be almost unuseable,
> right?
Right.
If the error is safe/needed to retry using other paths, mpath should
retry even if REQ_FLUSH.  Otherwise, only one path failure may result
in system down.
Just passing any REQ_FLUSH error upwards regardless the error type
will make such situations, and users will feel the behavior as
unstable/unusable.
> I'm not sure how to proceed here.  How much work would
> discerning between transport and IO errors take?  If it can't be done
> quickly enough the retry logic can be kept around to keep the old
> behavior but that already was a broken behavior, so...  :-(
I'm not sure how long will it take.
Anyway, as you said, the flush error handling of dm-mpath is already
broken if data loss really happens on any storage used by dm-mpath.
Although it's a serious issue and quick fix is required, I think
you may leave the old behavior in your patch-set, since it's
a separate issue.
Thanks,
Kiyoshi Ueda
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-25  8:00                             ` Kiyoshi Ueda
@ 2010-08-25 15:28                               ` Mike Snitzer
  2010-08-27  9:47                                 ` Kiyoshi Ueda
  2010-08-25 15:59                               ` [RFC] training mpath to discern between SCSI errors (was: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush) Mike Snitzer
  1 sibling, 1 reply; 109+ messages in thread
From: Mike Snitzer @ 2010-08-25 15:28 UTC (permalink / raw)
  To: Kiyoshi Ueda
  Cc: Tejun Heo, Hannes Reinecke, tytso, linux-scsi, jaxboe, jack,
	linux-kernel, swhiteho, linux-raid, linux-ide, James.Bottomley,
	konishi.ryusuke, linux-fsdevel, vst, rwheeler, Christoph Hellwig,
	chris.mason, dm-devel
On Wed, Aug 25 2010 at  4:00am -0400,
Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:
> Hi Tejun,
> 
> On 08/25/2010 01:59 AM +0900, Tejun Heo wrote:
> > On 08/24/2010 12:24 PM, Kiyoshi Ueda wrote:
> >> Anyway, only reporting errors for REQ_FLUSH to upper layer without
> >> such a solution would make dm-multipath almost unusable in real world,
> >> although it's better than implicit data loss.
> > 
> > I see.
> > 
> >>> Maybe just turn off barrier support in mpath for now?
> >> 
> >> If it's possible, it could be a workaround for a short term.
> >> But how can you do that?
> >>
> >> I think it's not enough to just drop REQ_FLUSH flag from q->flush_flags.
> >> Underlying devices of a mpath device may have write-back cache and
> >> it may be enabled.
> >> So if a mpath device doesn't set REQ_FLUSH flag in q->flush_flags, it
> >> becomes a device which has write-back cache but doesn't support flush.
> >> Then, upper layer can do nothing to ensure cache flush?
> > 
> > Yeah, I was basically suggesting to forget about cache flush w/ mpath
> > until it can be fixed.  You're saying that if mpath just passes
> > REQ_FLUSH upwards without retrying, it will be almost unuseable,
> > right?
> 
> Right.
> If the error is safe/needed to retry using other paths, mpath should
> retry even if REQ_FLUSH.  Otherwise, only one path failure may result
> in system down.
> Just passing any REQ_FLUSH error upwards regardless the error type
> will make such situations, and users will feel the behavior as
> unstable/unusable.
Right, there are hardware configurations that lend themselves to FLUSH
retries mattering, namely:
1) a SAS drive with 2 ports and a writeback cache
2) theoretically possible: SCSI array that is mpath capable but
   advertises cache as writeback (WCE=1)
The SAS case is obviously a more concrete example of why FLUSH retries
are worthwhile in mpath.
But I understand (and agree) that we'd be better off if mpath could
differentiate between failures rather than blindly retrying on failures
like it does today (fails path and retries if additional paths
available).
> Anyway, as you said, the flush error handling of dm-mpath is already
> broken if data loss really happens on any storage used by dm-mpath.
> Although it's a serious issue and quick fix is required, I think
> you may leave the old behavior in your patch-set, since it's
> a separate issue.
I'm not seeing where anything is broken with current mpath.  If a
multipathed LUN is WCE=1 then it should be fair to assume the cache is
mirrored or shared across ports.  Therefore retrying the SYNCHRONIZE
CACHE is needed.
Do we still have fear that SYNCHRONIZE CACHE can silently drop data?
Seems unlikely especially given what Tejun shared from SBC.
It seems that at worst, with current mpath, we retry when it doesn't
make sense (e.g. target failure).
Mike
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-25 15:28                               ` Mike Snitzer
@ 2010-08-27  9:47                                 ` Kiyoshi Ueda
  2010-08-27 13:49                                   ` Mike Snitzer
  0 siblings, 1 reply; 109+ messages in thread
From: Kiyoshi Ueda @ 2010-08-27  9:47 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: jack, linux-scsi, jaxboe, linux-fsdevel, linux-kernel,
	Christoph Hellwig, linux-raid, linux-ide, dm-devel,
	James.Bottomley, konishi.ryusuke, Tejun Heo, tytso, rwheeler, vst,
	swhiteho, chris.mason
Hi Mike,
On 08/26/2010 12:28 AM +0900, Mike Snitzer wrote:
> Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:
>> Anyway, as you said, the flush error handling of dm-mpath is already
>> broken if data loss really happens on any storage used by dm-mpath.
>> Although it's a serious issue and quick fix is required, I think
>> you may leave the old behavior in your patch-set, since it's
>> a separate issue.
> 
> I'm not seeing where anything is broken with current mpath.  If a
> multipathed LUN is WCE=1 then it should be fair to assume the cache is
> mirrored or shared across ports.  Therefore retrying the SYNCHRONIZE
> CACHE is needed.
> 
> Do we still have fear that SYNCHRONIZE CACHE can silently drop data?
> Seems unlikely especially given what Tejun shared from SBC.
Do we have any proof to wipe that fear?
If retrying on flush failure is safe on all storages used with multipath
(e.g. SCSI, CCISS, DASD, etc), then current dm-mpath should be fine in
the real world.
But I'm afraid if there is a storage where something like below can happen:
    - a flush command is returned as error to mpath because a part of
      cache has physically broken at the time or so, then that part of
      data loses and the size of the cache is shrunk by the storage.
    - mpath retries the flush command using other path.
    - the flush command is returned as success to mpath.
    - mpath passes the result, success, to upper layer, but some of
      the data already lost.
Thanks,
Kiyoshi Ueda
^ permalink raw reply	[flat|nested] 109+ messages in thread
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-27  9:47                                 ` Kiyoshi Ueda
@ 2010-08-27 13:49                                   ` Mike Snitzer
  2010-08-30  6:13                                     ` Kiyoshi Ueda
  0 siblings, 1 reply; 109+ messages in thread
From: Mike Snitzer @ 2010-08-27 13:49 UTC (permalink / raw)
  To: Kiyoshi Ueda
  Cc: Tejun Heo, Hannes Reinecke, tytso, linux-scsi, jaxboe, jack,
	linux-kernel, swhiteho, linux-raid, linux-ide, James.Bottomley,
	konishi.ryusuke, linux-fsdevel, vst, rwheeler, Christoph Hellwig,
	chris.mason, dm-devel
On Fri, Aug 27 2010 at  5:47am -0400,
Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:
> Hi Mike,
> 
> On 08/26/2010 12:28 AM +0900, Mike Snitzer wrote:
> > Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:
> >> Anyway, as you said, the flush error handling of dm-mpath is already
> >> broken if data loss really happens on any storage used by dm-mpath.
> >> Although it's a serious issue and quick fix is required, I think
> >> you may leave the old behavior in your patch-set, since it's
> >> a separate issue.
> > 
> > I'm not seeing where anything is broken with current mpath.  If a
> > multipathed LUN is WCE=1 then it should be fair to assume the cache is
> > mirrored or shared across ports.  Therefore retrying the SYNCHRONIZE
> > CACHE is needed.
> > 
> > Do we still have fear that SYNCHRONIZE CACHE can silently drop data?
> > Seems unlikely especially given what Tejun shared from SBC.
> 
> Do we have any proof to wipe that fear?
> 
> If retrying on flush failure is safe on all storages used with multipath
> (e.g. SCSI, CCISS, DASD, etc), then current dm-mpath should be fine in
> the real world.
> But I'm afraid if there is a storage where something like below can happen:
>     - a flush command is returned as error to mpath because a part of
>       cache has physically broken at the time or so, then that part of
>       data loses and the size of the cache is shrunk by the storage.
>     - mpath retries the flush command using other path.
>     - the flush command is returned as success to mpath.
>     - mpath passes the result, success, to upper layer, but some of
>       the data already lost.
That does seem like a valid concern.  But I'm not seeing why its unique
to SYNCHRONIZE CACHE.  Any IO that fails on the target side should be
passed up once the error gets to DM.
Mike
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-27 13:49                                   ` Mike Snitzer
@ 2010-08-30  6:13                                     ` Kiyoshi Ueda
  2010-09-01  0:55                                       ` safety of retrying SYNCHRONIZE CACHE [was: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush] Mike Snitzer
  0 siblings, 1 reply; 109+ messages in thread
From: Kiyoshi Ueda @ 2010-08-30  6:13 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Tejun Heo, Hannes Reinecke, tytso, linux-scsi, jaxboe, jack,
	linux-kernel, swhiteho, linux-raid, linux-ide, James.Bottomley,
	konishi.ryusuke, linux-fsdevel, vst, rwheeler, Christoph Hellwig,
	chris.mason, dm-devel
Hi Mike,
On 08/27/2010 10:49 PM +0900, Mike Snitzer wrote:
> Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:
>> On 08/26/2010 12:28 AM +0900, Mike Snitzer wrote:
>>> Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:
>>>> Anyway, as you said, the flush error handling of dm-mpath is already
>>>> broken if data loss really happens on any storage used by dm-mpath.
>>>> Although it's a serious issue and quick fix is required, I think
>>>> you may leave the old behavior in your patch-set, since it's
>>>> a separate issue.
>>> 
>>> I'm not seeing where anything is broken with current mpath.  If a
>>> multipathed LUN is WCE=1 then it should be fair to assume the cache is
>>> mirrored or shared across ports.  Therefore retrying the SYNCHRONIZE
>>> CACHE is needed.
>>>
>>> Do we still have fear that SYNCHRONIZE CACHE can silently drop data?
>>> Seems unlikely especially given what Tejun shared from SBC.
>> 
>> Do we have any proof to wipe that fear?
>>
>> If retrying on flush failure is safe on all storages used with multipath
>> (e.g. SCSI, CCISS, DASD, etc), then current dm-mpath should be fine in
>> the real world.
>> But I'm afraid if there is a storage where something like below can happen:
>>     - a flush command is returned as error to mpath because a part of
>>       cache has physically broken at the time or so, then that part of
>>       data loses and the size of the cache is shrunk by the storage.
>>     - mpath retries the flush command using other path.
>>     - the flush command is returned as success to mpath.
>>     - mpath passes the result, success, to upper layer, but some of
>>       the data already lost.
> 
> That does seem like a valid concern.  But I'm not seeing why its unique
> to SYNCHRONIZE CACHE.  Any IO that fails on the target side should be
> passed up once the error gets to DM.
See the Tejun's explanation again:
    http://marc.info/?l=linux-kernel&m=128267361813859&w=2
What I'm concerning is whether the same thing as Tejun explained
for ATA can happen on other types of devices.
Normal write command has data and no data loss happens on error.
So it can be retried cleanly, and if the result of the retry is
success, it's really success, no implicit data loss.
Normal read command has a sector to read.  If the sector is broken,
all retries will fail and the error will be reported upwards.
So it can be retried cleanly as well.
Thanks,
Kiyoshi Ueda
^ permalink raw reply	[flat|nested] 109+ messages in thread
- * safety of retrying SYNCHRONIZE CACHE [was: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush]
  2010-08-30  6:13                                     ` Kiyoshi Ueda
@ 2010-09-01  0:55                                       ` Mike Snitzer
  2010-09-01  7:32                                         ` Hannes Reinecke
  0 siblings, 1 reply; 109+ messages in thread
From: Mike Snitzer @ 2010-09-01  0:55 UTC (permalink / raw)
  To: Kiyoshi Ueda
  Cc: Tejun Heo, Hannes Reinecke, tytso, linux-scsi, jaxboe, jack,
	linux-kernel, swhiteho, linux-raid, linux-ide, James.Bottomley,
	konishi.ryusuke, linux-fsdevel, vst, rwheeler, Christoph Hellwig,
	chris.mason, dm-devel, Frederick.Knight
Hi Kiyoshi,
On Mon, Aug 30 2010 at  2:13am -0400,
Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:
> > That does seem like a valid concern.  But I'm not seeing why its unique
> > to SYNCHRONIZE CACHE.  Any IO that fails on the target side should be
> > passed up once the error gets to DM.
> 
> See the Tejun's explanation again:
>     http://marc.info/?l=linux-kernel&m=128267361813859&w=2
> What I'm concerning is whether the same thing as Tejun explained
> for ATA can happen on other types of devices.
> 
> 
> Normal write command has data and no data loss happens on error.
> So it can be retried cleanly, and if the result of the retry is
> success, it's really success, no implicit data loss.
> 
> Normal read command has a sector to read.  If the sector is broken,
> all retries will fail and the error will be reported upwards.
> So it can be retried cleanly as well.
I reached out to Fred Knight on this, to get a more insight from a pure
SCSI SBC perspective, and he shared the following:
----- Forwarded message from "Knight, Frederick" <Frederick.Knight@netapp.com> -----
> Date: Tue, 31 Aug 2010 13:24:15 -0400
> From: "Knight, Frederick" <Frederick.Knight@netapp.com>
> To: Mike Snitzer <snitzer@redhat.com>
> Subject: RE: safety of retrying SYNCHRONIZE CACHE?
>
> There are requirements in SBC to maintain data integrity.  If you WRITE
> a block and READ that block, you must get the data you sent in the
> WRITE.  This will be synchronized around the completion of the WRITE.
> Before the WRITE completes, who knows what a READ will return.  Maybe
> all the old data, maybe all the new data, maybe some mix of old and new
> data.  Once the WRITE ends successful, all READs of those LBAs (from any
> port) will always get the same data.
>
> As for errors, SBC describes how the deferred errors are reported (like
> when a CACHE tries to flush but fails).  So if a write from cache to
> media does have problems, the device would tell you via a CHECK
> CONDITION (with the first byte of the sense data set to 71h or 73h.  SBC
> clause 4.12 and 4.13 cover a lot of this information.  It is these error
> codes that prevent silent loss of data.  And, in this case, when the
> CHECK CONDITION is delivered, it will have nothing to do with the
> command that was issued (the victim command).  If you look into the
> sense data, you will see the deferred error flag, and all the additional
> information fields will relate to the original I/O
>
> SYNCHRONIZE CACHE is not substantially different than a WRITE (it puts
> data on the media).  So issuing it multiple times wouldn't be any
> different than issuing multiple WRITES (it might put a temporary dent in
> performance as everything flushes out to media).  If it or any other
> commands fail with 71h/73h, then you have to dig down into the sense
> data buffer to find out what happened.  For example, if you issue a
> WRITE command, and it completes into write back cache but later (before
> being written to the media), some of the cache breaks and looses data,
> then the device must signal a deferred error to tell the host, and cause
> a forced error on the LBA in question.
>
> Does that help?
>
>       Fred
----- End forwarded message -----
Seems like verifying/improving the handling of CHECK CONDITION is a more
pressing concern than silent data loss purely due to SYNCHRONIZE CACHE
retries.  Without proper handling we could completely miss these
deferred errors.
But how to effectively report such errors to upper layers is unclear to
me given that a particular SCSI command can carry error information for
IO that was already acknowledged successful (e.g. to the FS).
drivers/scsi/scsi_error.c's various calls to scsi_check_sense()
illustrate Linux's current CHECK CONDITION handling.  I need to look
closer at how deferred errors propagate to upper layers.  After an
initial look it seems scsi_error.c does handle retrying commands where
appropriate.
I believe Hannes has concerns/insight here.
Mike
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: safety of retrying SYNCHRONIZE CACHE [was: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush]
  2010-09-01  0:55                                       ` safety of retrying SYNCHRONIZE CACHE [was: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush] Mike Snitzer
@ 2010-09-01  7:32                                         ` Hannes Reinecke
  2010-09-01  7:38                                           ` Hannes Reinecke
  0 siblings, 1 reply; 109+ messages in thread
From: Hannes Reinecke @ 2010-09-01  7:32 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Kiyoshi Ueda, Tejun Heo, tytso, linux-scsi, jaxboe, jack,
	linux-kernel, swhiteho, linux-raid, linux-ide, James.Bottomley,
	konishi.ryusuke, linux-fsdevel, vst, rwheeler, Christoph Hellwig,
	chris.mason, dm-devel, Frederick.Knight
Mike Snitzer wrote:
> Hi Kiyoshi,
> 
> On Mon, Aug 30 2010 at  2:13am -0400,
> Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:
> 
>>> That does seem like a valid concern.  But I'm not seeing why its unique
>>> to SYNCHRONIZE CACHE.  Any IO that fails on the target side should be
>>> passed up once the error gets to DM.
>> See the Tejun's explanation again:
>>     http://marc.info/?l=linux-kernel&m=128267361813859&w=2
>> What I'm concerning is whether the same thing as Tejun explained
>> for ATA can happen on other types of devices.
>>
>>
>> Normal write command has data and no data loss happens on error.
>> So it can be retried cleanly, and if the result of the retry is
>> success, it's really success, no implicit data loss.
>>
>> Normal read command has a sector to read.  If the sector is broken,
>> all retries will fail and the error will be reported upwards.
>> So it can be retried cleanly as well.
> 
> I reached out to Fred Knight on this, to get a more insight from a pure
> SCSI SBC perspective, and he shared the following:
> 
> ----- Forwarded message from "Knight, Frederick" <Frederick.Knight@netapp.com> -----
> 
>> Date: Tue, 31 Aug 2010 13:24:15 -0400
>> From: "Knight, Frederick" <Frederick.Knight@netapp.com>
>> To: Mike Snitzer <snitzer@redhat.com>
>> Subject: RE: safety of retrying SYNCHRONIZE CACHE?
>>
>> There are requirements in SBC to maintain data integrity.  If you WRITE
>> a block and READ that block, you must get the data you sent in the
>> WRITE.  This will be synchronized around the completion of the WRITE.
>> Before the WRITE completes, who knows what a READ will return.  Maybe
>> all the old data, maybe all the new data, maybe some mix of old and new
>> data.  Once the WRITE ends successful, all READs of those LBAs (from any
>> port) will always get the same data.
>>
>> As for errors, SBC describes how the deferred errors are reported (like
>> when a CACHE tries to flush but fails).  So if a write from cache to
>> media does have problems, the device would tell you via a CHECK
>> CONDITION (with the first byte of the sense data set to 71h or 73h.  SBC
>> clause 4.12 and 4.13 cover a lot of this information.  It is these error
>> codes that prevent silent loss of data.  And, in this case, when the
>> CHECK CONDITION is delivered, it will have nothing to do with the
>> command that was issued (the victim command).  If you look into the
>> sense data, you will see the deferred error flag, and all the additional
>> information fields will relate to the original I/O
>>
>> SYNCHRONIZE CACHE is not substantially different than a WRITE (it puts
>> data on the media).  So issuing it multiple times wouldn't be any
>> different than issuing multiple WRITES (it might put a temporary dent in
>> performance as everything flushes out to media).  If it or any other
>> commands fail with 71h/73h, then you have to dig down into the sense
>> data buffer to find out what happened.  For example, if you issue a
>> WRITE command, and it completes into write back cache but later (before
>> being written to the media), some of the cache breaks and looses data,
>> then the device must signal a deferred error to tell the host, and cause
>> a forced error on the LBA in question.
>>
>> Does that help?
>>
>>       Fred
> ----- End forwarded message -----
> 
> Seems like verifying/improving the handling of CHECK CONDITION is a more
> pressing concern than silent data loss purely due to SYNCHRONIZE CACHE
> retries.  Without proper handling we could completely miss these
> deferred errors.
> 
Yes.
> But how to effectively report such errors to upper layers is unclear to
> me given that a particular SCSI command can carry error information for
> IO that was already acknowledged successful (e.g. to the FS).
> 
> drivers/scsi/scsi_error.c's various calls to scsi_check_sense()
> illustrate Linux's current CHECK CONDITION handling.  I need to look
> closer at how deferred errors propagate to upper layers.  After an
> initial look it seems scsi_error.c does handle retrying commands where
> appropriate.
> 
> I believe Hannes has concerns/insight here.
> 
Quite. We _should_ be handling deferred errors correctly;
if you check drivers/scsi/scsi_lib.c:scsi_io_completion()
you'll find this:
	if (host_byte(result) == DID_RESET) {
		/* Third party bus reset or reset for error recovery
		 * reasons.  Just retry the command and see what
		 * happens.
		 */
		action = ACTION_RETRY;
	} else if (sense_valid && !sense_deferred) {
                ...
	} else {
		description = "Unhandled error code";
		action = ACTION_FAIL;
	}
ie for deferred errors we're already aborting the command. Not sure
if I agree with this bit in drivers/scsi/scsi_lib.c:
static int scsi_check_sense(struct scsi_cmnd *scmd)
{
	struct scsi_device *sdev = scmd->device;
	struct scsi_sense_hdr sshdr;
	if (! scsi_command_normalize_sense(scmd, &sshdr))
		return FAILED;	/* no valid sense data */
	if (scsi_sense_is_deferred(&sshdr))
		return NEEDS_RETRY;
I doubt we can resolve the situation by retrying the command, which
will be the wrong command to retry anyway. I would rather
have those retry 'SUCCESS' and add another case in scsi_io_completion()
to notify us about the deferred error.
I'll be sending a patch.
Cheers,
Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
^ permalink raw reply	[flat|nested] 109+ messages in thread
- * Re: safety of retrying SYNCHRONIZE CACHE [was: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush]
  2010-09-01  7:32                                         ` Hannes Reinecke
@ 2010-09-01  7:38                                           ` Hannes Reinecke
  0 siblings, 0 replies; 109+ messages in thread
From: Hannes Reinecke @ 2010-09-01  7:38 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Kiyoshi Ueda, Tejun Heo, tytso, linux-scsi, jaxboe, jack,
	linux-kernel, swhiteho, linux-raid, linux-ide, James.Bottomley,
	konishi.ryusuke, linux-fsdevel, vst, rwheeler, Christoph Hellwig,
	chris.mason, dm-devel, Frederick.Knight
Hannes Reinecke wrote:
> Mike Snitzer wrote:
>> Hi Kiyoshi,
>>
>> On Mon, Aug 30 2010 at  2:13am -0400,
>> Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:
>>
>>>> That does seem like a valid concern.  But I'm not seeing why its unique
>>>> to SYNCHRONIZE CACHE.  Any IO that fails on the target side should be
>>>> passed up once the error gets to DM.
>>> See the Tejun's explanation again:
>>>     http://marc.info/?l=linux-kernel&m=128267361813859&w=2
>>> What I'm concerning is whether the same thing as Tejun explained
>>> for ATA can happen on other types of devices.
>>>
>>>
>>> Normal write command has data and no data loss happens on error.
>>> So it can be retried cleanly, and if the result of the retry is
>>> success, it's really success, no implicit data loss.
>>>
>>> Normal read command has a sector to read.  If the sector is broken,
>>> all retries will fail and the error will be reported upwards.
>>> So it can be retried cleanly as well.
>> I reached out to Fred Knight on this, to get a more insight from a pure
>> SCSI SBC perspective, and he shared the following:
>>
>> ----- Forwarded message from "Knight, Frederick" <Frederick.Knight@netapp.com> -----
>>
>>> Date: Tue, 31 Aug 2010 13:24:15 -0400
>>> From: "Knight, Frederick" <Frederick.Knight@netapp.com>
>>> To: Mike Snitzer <snitzer@redhat.com>
>>> Subject: RE: safety of retrying SYNCHRONIZE CACHE?
>>>
>>> There are requirements in SBC to maintain data integrity.  If you WRITE
>>> a block and READ that block, you must get the data you sent in the
>>> WRITE.  This will be synchronized around the completion of the WRITE.
>>> Before the WRITE completes, who knows what a READ will return.  Maybe
>>> all the old data, maybe all the new data, maybe some mix of old and new
>>> data.  Once the WRITE ends successful, all READs of those LBAs (from any
>>> port) will always get the same data.
>>>
>>> As for errors, SBC describes how the deferred errors are reported (like
>>> when a CACHE tries to flush but fails).  So if a write from cache to
>>> media does have problems, the device would tell you via a CHECK
>>> CONDITION (with the first byte of the sense data set to 71h or 73h.  SBC
>>> clause 4.12 and 4.13 cover a lot of this information.  It is these error
>>> codes that prevent silent loss of data.  And, in this case, when the
>>> CHECK CONDITION is delivered, it will have nothing to do with the
>>> command that was issued (the victim command).  If you look into the
>>> sense data, you will see the deferred error flag, and all the additional
>>> information fields will relate to the original I/O
>>>
>>> SYNCHRONIZE CACHE is not substantially different than a WRITE (it puts
>>> data on the media).  So issuing it multiple times wouldn't be any
>>> different than issuing multiple WRITES (it might put a temporary dent in
>>> performance as everything flushes out to media).  If it or any other
>>> commands fail with 71h/73h, then you have to dig down into the sense
>>> data buffer to find out what happened.  For example, if you issue a
>>> WRITE command, and it completes into write back cache but later (before
>>> being written to the media), some of the cache breaks and looses data,
>>> then the device must signal a deferred error to tell the host, and cause
>>> a forced error on the LBA in question.
>>>
>>> Does that help?
>>>
>>>       Fred
>> ----- End forwarded message -----
>>
>> Seems like verifying/improving the handling of CHECK CONDITION is a more
>> pressing concern than silent data loss purely due to SYNCHRONIZE CACHE
>> retries.  Without proper handling we could completely miss these
>> deferred errors.
>>
> Yes.
> 
>> But how to effectively report such errors to upper layers is unclear to
>> me given that a particular SCSI command can carry error information for
>> IO that was already acknowledged successful (e.g. to the FS).
>>
>> drivers/scsi/scsi_error.c's various calls to scsi_check_sense()
>> illustrate Linux's current CHECK CONDITION handling.  I need to look
>> closer at how deferred errors propagate to upper layers.  After an
>> initial look it seems scsi_error.c does handle retrying commands where
>> appropriate.
>>
>> I believe Hannes has concerns/insight here.
>>
> 
> Quite. We _should_ be handling deferred errors correctly;
> if you check drivers/scsi/scsi_lib.c:scsi_io_completion()
> you'll find this:
> 
> 	if (host_byte(result) == DID_RESET) {
> 		/* Third party bus reset or reset for error recovery
> 		 * reasons.  Just retry the command and see what
> 		 * happens.
> 		 */
> 		action = ACTION_RETRY;
> 	} else if (sense_valid && !sense_deferred) {
>                 ...
> 	} else {
> 		description = "Unhandled error code";
> 		action = ACTION_FAIL;
> 	}
> 
> ie for deferred errors we're already aborting the command. Not sure
> if I agree with this bit in drivers/scsi/scsi_lib.c:
> 
> static int scsi_check_sense(struct scsi_cmnd *scmd)
> {
> 	struct scsi_device *sdev = scmd->device;
> 	struct scsi_sense_hdr sshdr;
> 
> 	if (! scsi_command_normalize_sense(scmd, &sshdr))
> 		return FAILED;	/* no valid sense data */
> 
> 	if (scsi_sense_is_deferred(&sshdr))
> 		return NEEDS_RETRY;
> 
> I doubt we can resolve the situation by retrying the command, which
> will be the wrong command to retry anyway. I would rather
> have those retry 'SUCCESS' and add another case in scsi_io_completion()
> to notify us about the deferred error.
> 
Ah. No. That is actually correct. SPC-3 states:
If the task terminates with CHECK CONDITION status and the sense data
describes a deferred error, the command for the terminated task shall
not have been processed.
So we're good after all and I would just add this patch:
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index fb841e3..efb4609 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -912,7 +912,10 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int
 good_bytes)
                        break;
                }
        } else {
-               description = "Unhandled error code";
+               if (sense_deferred)
+                       description = "Deferred error";
+               else
+                       description = "Unhandled error code";
                action = ACTION_FAIL;
        }
 
to make the whole situation more transparent.
Cheers,
Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related	[flat|nested] 109+ messages in thread
 
 
 
 
 
 
- * [RFC] training mpath to discern between SCSI errors (was: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush)
  2010-08-25  8:00                             ` Kiyoshi Ueda
  2010-08-25 15:28                               ` Mike Snitzer
@ 2010-08-25 15:59                               ` Mike Snitzer
  2010-08-25 19:15                                 ` [RFC] training mpath to discern between SCSI errors Mike Christie
  2010-08-30 11:38                                 ` Hannes Reinecke
  1 sibling, 2 replies; 109+ messages in thread
From: Mike Snitzer @ 2010-08-25 15:59 UTC (permalink / raw)
  To: Kiyoshi Ueda, Tejun Heo, michaelc, James.Bottomley,
	Hannes Reinecke
  Cc: tytso, linux-scsi, jaxboe, jack, linux-kernel, swhiteho,
	linux-raid, linux-ide, konishi.ryusuke, linux-fsdevel, vst,
	rwheeler, Christoph Hellwig, chris.mason, dm-devel
On Wed, Aug 25 2010 at  4:00am -0400,
Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:
> > I'm not sure how to proceed here.  How much work would
> > discerning between transport and IO errors take?  If it can't be done
> > quickly enough the retry logic can be kept around to keep the old
> > behavior but that already was a broken behavior, so...  :-(
> 
> I'm not sure how long will it take.
We first need to understand what direction we want to go with this.  We
currently have 2 options.  But any other ideas are obviously welcome.
1)
Mike Christie has a patchset that introduce more specific
target/transport/host error codes.  Mike shared these pointers but he'd
have to put the work in to refresh them:
http://marc.info/?l=linux-scsi&m=112487427230642&w=2
http://marc.info/?l=linux-scsi&m=112487427306501&w=2
http://marc.info/?l=linux-scsi&m=112487431524436&w=2
http://marc.info/?l=linux-scsi&m=112487431524350&w=2
errno.h new EXYZ
http://marc.info/?l=linux-kernel&m=107715299008231&w=2
add block layer blkdev.h error values
http://marc.info/?l=linux-kernel&m=107961883915068&w=2
add block layer blkdev.h error values (v2 convert more drivers)
http://marc.info/?l=linux-scsi&m=112487427230642&w=2
I think that patchset's appoach is fairly disruptive just to be able to
train upper layers to differentiate (e.g. mpath).  But in the end maybe
that change takes the code in a more desirable direction?
2)
Another option is Hannes' approach of having DM consume req->errors and
SCSI sense more directly.
I've refreshed Hannes' previous patchset against 2.6.36-rc2 but I
haven't finished testing it yet (should be OK.. it boots, but still have
FIXME to move scsi_uld_should_retry to scsi_error.c):
http://people.redhat.com/msnitzer/patches/dm-scsi-sense/
Would be great if James, Hannes and others had a look at this
refreshed RFC patchset.  It's clearly not polished but it gives an idea
of the approach.  Does this look worthwhile?
Follow-on work is needed to refine scsi_uld_should_retry further.  Keep
in mind that scsi_error.c is the intended location for this code.
James, please note that I've attempted to make REQ_TYPE_FS set
req->errors only for "genuine errors" by (ab)using
scsi_decide_disposition:
http://people.redhat.com/msnitzer/patches/dm-scsi-sense/scsi-Always-pass-error-result-and-sense-on-request-completion.patch
If others think this may be worthwhile I can finish testing, cleanup the
patches further, and post them.
Mike
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: [RFC] training mpath to discern between SCSI errors
  2010-08-25 15:59                               ` [RFC] training mpath to discern between SCSI errors (was: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush) Mike Snitzer
@ 2010-08-25 19:15                                 ` Mike Christie
  2010-08-30 11:38                                 ` Hannes Reinecke
  1 sibling, 0 replies; 109+ messages in thread
From: Mike Christie @ 2010-08-25 19:15 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Kiyoshi Ueda, Tejun Heo, James.Bottomley, Hannes Reinecke, tytso,
	linux-scsi, jaxboe, jack, linux-kernel, swhiteho, linux-raid,
	linux-ide, konishi.ryusuke, linux-fsdevel, vst, rwheeler,
	Christoph Hellwig, chris.mason, dm-devel
On 08/25/2010 10:59 AM, Mike Snitzer wrote:
> On Wed, Aug 25 2010 at  4:00am -0400,
> Kiyoshi Ueda<k-ueda@ct.jp.nec.com>  wrote:
>
>>> I'm not sure how to proceed here.  How much work would
>>> discerning between transport and IO errors take?  If it can't be done
>>> quickly enough the retry logic can be kept around to keep the old
>>> behavior but that already was a broken behavior, so...  :-(
>>
>> I'm not sure how long will it take.
>
> We first need to understand what direction we want to go with this.  We
> currently have 2 options.  But any other ideas are obviously welcome.
>
> 1)
> Mike Christie has a patchset that introduce more specific
> target/transport/host error codes.  Mike shared these pointers but he'd
> have to put the work in to refresh them:
> http://marc.info/?l=linux-scsi&m=112487427230642&w=2
> http://marc.info/?l=linux-scsi&m=112487427306501&w=2
> http://marc.info/?l=linux-scsi&m=112487431524436&w=2
> http://marc.info/?l=linux-scsi&m=112487431524350&w=2
>
> errno.h new EXYZ
> http://marc.info/?l=linux-kernel&m=107715299008231&w=2
>
> add block layer blkdev.h error values
> http://marc.info/?l=linux-kernel&m=107961883915068&w=2
>
> add block layer blkdev.h error values (v2 convert more drivers)
> http://marc.info/?l=linux-scsi&m=112487427230642&w=2
>
> I think that patchset's appoach is fairly disruptive just to be able to
> train upper layers to differentiate (e.g. mpath).  But in the end maybe
> that change takes the code in a more desirable direction?
I think it is more disruptive, but is the cleaner approach in the end.
#2 looks hacky. In upper layers, we will have checks for dasd and other 
AOE and other drivers. And then #2 does not even work for filesystems 
(ext said they need this).
>
> 2)
> Another option is Hannes' approach of having DM consume req->errors and
> SCSI sense more directly.
>
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: [RFC] training mpath to discern between SCSI errors
  2010-08-25 15:59                               ` [RFC] training mpath to discern between SCSI errors (was: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush) Mike Snitzer
  2010-08-25 19:15                                 ` [RFC] training mpath to discern between SCSI errors Mike Christie
@ 2010-08-30 11:38                                 ` Hannes Reinecke
  2010-08-30 12:07                                   ` Sergei Shtylyov
  1 sibling, 1 reply; 109+ messages in thread
From: Hannes Reinecke @ 2010-08-30 11:38 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Kiyoshi Ueda, Tejun Heo, michaelc, James.Bottomley, tytso,
	linux-scsi, jaxboe, jack, linux-kernel, swhiteho, linux-raid,
	linux-ide, konishi.ryusuke, linux-fsdevel, vst, rwheeler,
	Christoph Hellwig, chris.mason, dm-devel
Mike Snitzer wrote:
> On Wed, Aug 25 2010 at  4:00am -0400,
> Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:
> 
>>> I'm not sure how to proceed here.  How much work would
>>> discerning between transport and IO errors take?  If it can't be done
>>> quickly enough the retry logic can be kept around to keep the old
>>> behavior but that already was a broken behavior, so...  :-(
>> I'm not sure how long will it take.
> 
> We first need to understand what direction we want to go with this.  We
> currently have 2 options.  But any other ideas are obviously welcome.
> 
> 1)
> Mike Christie has a patchset that introduce more specific
> target/transport/host error codes.  Mike shared these pointers but he'd
> have to put the work in to refresh them:
> http://marc.info/?l=linux-scsi&m=112487427230642&w=2
> http://marc.info/?l=linux-scsi&m=112487427306501&w=2
> http://marc.info/?l=linux-scsi&m=112487431524436&w=2
> http://marc.info/?l=linux-scsi&m=112487431524350&w=2
> 
> errno.h new EXYZ
> http://marc.info/?l=linux-kernel&m=107715299008231&w=2
> 
> add block layer blkdev.h error values
> http://marc.info/?l=linux-kernel&m=107961883915068&w=2
> 
> add block layer blkdev.h error values (v2 convert more drivers)
> http://marc.info/?l=linux-scsi&m=112487427230642&w=2
> 
> I think that patchset's appoach is fairly disruptive just to be able to
> train upper layers to differentiate (e.g. mpath).  But in the end maybe
> that change takes the code in a more desirable direction?
> 
> 2)
> Another option is Hannes' approach of having DM consume req->errors and
> SCSI sense more directly.
> 
Actually, I think we have two separate issues here:
1) The need of having more detailed I/O errors even in the fs layer. This
   we've already discussed at the LSF, consensus here is to allow other
   errors than just 'EIO'.
   Instead of Mike's approach I would rather use existing error codes here;
   this will make the transition somewhat easier.
   Initially I would propose to return 'ENOLINK' for a transport failure,
   'EIO' for a non-retryable failure on the target, and 'ENODEV' for a
   retryable failure on the target.
2) The need to differentiate the various error conditions on the multipath
   layer. Multipath needs to distinguish the three error types as specified
   in 1)
Mike has been trying to solve 1) and 2) by introducing separate/new error
codes, and I have been trying to use 2) by parsing the sense codes directly
from multipathing.
Given that the fs people have expressed their desire to know about these
error classes, too, it makes sense to have them exposed to the fs layer.
I see if I can come up with a patch.
Cheers,
Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: [RFC] training mpath to discern between SCSI errors
  2010-08-30 11:38                                 ` Hannes Reinecke
@ 2010-08-30 12:07                                   ` Sergei Shtylyov
  2010-08-30 12:39                                     ` Hannes Reinecke
  0 siblings, 1 reply; 109+ messages in thread
From: Sergei Shtylyov @ 2010-08-30 12:07 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Mike Snitzer, Kiyoshi Ueda, Tejun Heo, michaelc, James.Bottomley,
	tytso, linux-scsi, jaxboe, jack, linux-kernel, swhiteho,
	linux-raid, linux-ide, konishi.ryusuke, linux-fsdevel, vst,
	rwheeler, Christoph Hellwig, chris.mason, dm-devel
Hello.
Hannes Reinecke wrote:
> Actually, I think we have two separate issues here:
> 1) The need of having more detailed I/O errors even in the fs layer. This
>    we've already discussed at the LSF, consensus here is to allow other
>    errors than just 'EIO'.
>    Instead of Mike's approach I would rather use existing error codes here;
>    this will make the transition somewhat easier.
>    Initially I would propose to return 'ENOLINK' for a transport failure,
>    'EIO' for a non-retryable failure on the target, and 'ENODEV' for a
>    retryable failure on the target.
    Are you sure it's not vice versa: EIO for retryable and ENODEV for 
non-retryable failures. ENODEV looks more like permanent condition to me.
WBR, Sergei
^ permalink raw reply	[flat|nested] 109+ messages in thread
- * Re: [RFC] training mpath to discern between SCSI errors
  2010-08-30 12:07                                   ` Sergei Shtylyov
@ 2010-08-30 12:39                                     ` Hannes Reinecke
  2010-08-30 14:52                                       ` [dm-devel] " Hannes Reinecke
  0 siblings, 1 reply; 109+ messages in thread
From: Hannes Reinecke @ 2010-08-30 12:39 UTC (permalink / raw)
  To: Sergei Shtylyov
  Cc: Mike Snitzer, Kiyoshi Ueda, Tejun Heo, michaelc, James.Bottomley,
	tytso, linux-scsi, jaxboe, jack, linux-kernel, swhiteho,
	linux-raid, linux-ide, konishi.ryusuke, linux-fsdevel, vst,
	rwheeler, Christoph Hellwig, chris.mason, dm-devel
Sergei Shtylyov wrote:
> Hello.
> 
> Hannes Reinecke wrote:
> 
>> Actually, I think we have two separate issues here:
>> 1) The need of having more detailed I/O errors even in the fs layer. This
>>    we've already discussed at the LSF, consensus here is to allow other
>>    errors than just 'EIO'.
>>    Instead of Mike's approach I would rather use existing error codes
>> here;
>>    this will make the transition somewhat easier.
>>    Initially I would propose to return 'ENOLINK' for a transport failure,
>>    'EIO' for a non-retryable failure on the target, and 'ENODEV' for a
>>    retryable failure on the target.
> 
>    Are you sure it's not vice versa: EIO for retryable and ENODEV for
> non-retryable failures. ENODEV looks more like permanent condition to me.
> 
Ok, can do.
And looking a the error numbers again, maybe we should be using 'EREMOTEIO'
for non-retryable failures.
So we would be ending with:
ENOLINK: transport failure
EIO: retryable remote failure
EREMOTEIO: non-retryable remote failure
Does that look okay?
Cheers,
Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: [dm-devel] [RFC] training mpath to discern between SCSI errors
  2010-08-30 12:39                                     ` Hannes Reinecke
@ 2010-08-30 14:52                                       ` Hannes Reinecke
  2010-10-18  8:09                                         ` Jun'ichi Nomura
  0 siblings, 1 reply; 109+ messages in thread
From: Hannes Reinecke @ 2010-08-30 14:52 UTC (permalink / raw)
  To: device-mapper development
  Cc: Sergei Shtylyov, Kiyoshi Ueda, michaelc, tytso, linux-scsi,
	Mike Snitzer, jaxboe, linux-fsdevel, linux-kernel,
	Christoph Hellwig, linux-raid, linux-ide, James.Bottomley,
	rwheeler, konishi.ryusuke, Tejun Heo, jack, vst, swhiteho,
	chris.mason
[-- Attachment #1: Type: text/plain, Size: 2475 bytes --]
Hannes Reinecke wrote:
> Sergei Shtylyov wrote:
>> Hello.
>>
>> Hannes Reinecke wrote:
>>
>>> Actually, I think we have two separate issues here:
>>> 1) The need of having more detailed I/O errors even in the fs layer. This
>>>    we've already discussed at the LSF, consensus here is to allow other
>>>    errors than just 'EIO'.
>>>    Instead of Mike's approach I would rather use existing error codes
>>> here;
>>>    this will make the transition somewhat easier.
>>>    Initially I would propose to return 'ENOLINK' for a transport failure,
>>>    'EIO' for a non-retryable failure on the target, and 'ENODEV' for a
>>>    retryable failure on the target.
>>    Are you sure it's not vice versa: EIO for retryable and ENODEV for
>> non-retryable failures. ENODEV looks more like permanent condition to me.
>>
> Ok, can do.
> And looking a the error numbers again, maybe we should be using 'EREMOTEIO'
> for non-retryable failures.
> 
> So we would be ending with:
> 
> ENOLINK: transport failure
> EIO: retryable remote failure
> EREMOTEIO: non-retryable remote failure
> 
And here is the corresponding patch.
Compile tested only; just to give an idea of the possible implementation.
I have decided to pass the I/O failure information in-line:
- scsi_check_sense() might now return 'TARGET_ERROR' to signal
  a permanent error
- scsi_decide_disposition() sets the driver byte of the result
  field to 'DID_TARGET_FAILURE' if a return code of 'TARGET_ERROR'
  is encountered.
- scsi_io_completion() sets the error to ENOLINK for DID_TRANSPORT_FAILFAST,
  EREMOTEIO for DID_TARGET_FAILURE, and EIO for any other error. It also
  resets DID_TARGET_FAILURE back to DID_OK once the error code is set.
I'm not 100% happy with this patch; DID_TARGET_FAILURE is really just
a communication vehicle to signal the permanent target failure.
I looked at passing this information directly via an explicit argument
to scsi_finish_command(), but this would include changing
scsi_io_completion(), too. As both of them are exported / public
interfaces I didn't like modifying them.
Another possibility would be to re-use / redefine the 'DRIVER_'
bits; they don't seem to be used a the moment. Eg 'DRIVER_HARD'
for permanent errors, DRIVER_SOFT for link failures.
Opinions welcome.
Cheers,
Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
[-- Attachment #2: scsi-detailed-io-errors --]
[-- Type: text/plain, Size: 5633 bytes --]
From f0835d92426cb3938f79f1b7a1e1208de63ca7bc Mon Sep 17 00:00:00 2001
From: Hannes Reinecke <hare@suse.de>
Date: Mon, 30 Aug 2010 16:21:10 +0200
Subject: [RFC][PATCH] scsi: Detailed I/O errors
Instead of just passing 'EIO' for any I/O errors we should be
notifying the upper layers with some more details about the cause
of this error.
This patch updates the possible I/O errors to:
- ENOLINK: Link failure between host and target
- EIO: Retryable I/O error
- EREMOTEIO: Non-retryable I/O error
'Retryable' in this context means that an I/O error _might_ be
restricted to the I_T_L nexus (vulgo: path), so retrying on another
nexus / path might succeed.
Signed-off-by: Hannes Reinecke <hare@suse.de>
diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 487ecda..d49b375 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -1270,7 +1270,7 @@ static int do_end_io(struct multipath *m, struct request *clone,
 	if (!error && !clone->errors)
 		return 0;	/* I/O complete */
 
-	if (error == -EOPNOTSUPP)
+	if (error == -EOPNOTSUPP || error == -EREMOTEIO)
 		return error;
 
 	if (clone->cmd_flags & REQ_DISCARD)
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index ce089df..5da040b 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -223,7 +223,7 @@ static inline void scsi_eh_prt_fail_stats(struct Scsi_Host *shost,
  * @scmd:	Cmd to have sense checked.
  *
  * Return value:
- * 	SUCCESS or FAILED or NEEDS_RETRY
+ *	SUCCESS or FAILED or NEEDS_RETRY or TARGET_ERROR
  *
  * Notes:
  *	When a deferred error is detected the current command has
@@ -338,25 +338,25 @@ static int scsi_check_sense(struct scsi_cmnd *scmd)
 	case COPY_ABORTED:
 	case VOLUME_OVERFLOW:
 	case MISCOMPARE:
-		return SUCCESS;
+	case DATA_PROTECT:
+	case BLANK_CHECK:
+		return TARGET_ERROR;
 
 	case MEDIUM_ERROR:
 		if (sshdr.asc == 0x11 || /* UNRECOVERED READ ERR */
 		    sshdr.asc == 0x13 || /* AMNF DATA FIELD */
 		    sshdr.asc == 0x14) { /* RECORD NOT FOUND */
-			return SUCCESS;
+			return TARGET_ERROR;
 		}
 		return NEEDS_RETRY;
 
 	case HARDWARE_ERROR:
 		if (scmd->device->retry_hwerror)
 			return ADD_TO_MLQUEUE;
-		else
-			return SUCCESS;
-
+		else {
+			return TARGET_ERROR;
+		}
 	case ILLEGAL_REQUEST:
-	case BLANK_CHECK:
-	case DATA_PROTECT:
 	default:
 		return SUCCESS;
 	}
@@ -819,6 +819,7 @@ static int scsi_send_eh_cmnd(struct scsi_cmnd *scmd, unsigned char *cmnd,
 		case SUCCESS:
 		case NEEDS_RETRY:
 		case FAILED:
+		case TARGET_ERROR:
 			break;
 		case ADD_TO_MLQUEUE:
 			rtn = NEEDS_RETRY;
@@ -1512,6 +1513,12 @@ int scsi_decide_disposition(struct scsi_cmnd *scmd)
 		rtn = scsi_check_sense(scmd);
 		if (rtn == NEEDS_RETRY)
 			goto maybe_retry;
+		if (rtn == TARGET_ERROR) {
+			/* Need to modify host byte to signal a
+			 * permanent target failure */
+			scmd->result |= (DID_TARGET_FAILURE << 16);
+			rtn = SUCCESS;
+		}
 		/* if rtn == FAILED, we have no sense information;
 		 * returning FAILED will wake the error handler thread
 		 * to collect the sense and redo the decide
@@ -1529,6 +1536,7 @@ int scsi_decide_disposition(struct scsi_cmnd *scmd)
 	case RESERVATION_CONFLICT:
 		SCSI_LOG_ERROR_RECOVERY(3, sdev_printk(KERN_INFO, scmd->device,
 						"reservation conflict\n"));
+		scmd->result |= (DID_TARGET_FAILURE << 16);
 		return SUCCESS; /* causes immediate i/o error */
 	default:
 		return FAILED;
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 9ade720..fb841e3 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -736,8 +736,20 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
 				memcpy(req->sense, cmd->sense_buffer,  len);
 				req->sense_len = len;
 			}
-			if (!sense_deferred)
-				error = -EIO;
+			if (!sense_deferred) {
+				switch(host_byte(result)) {
+				case DID_TRANSPORT_FAILFAST:
+					error = -ENOLINK;
+					break;
+				case DID_TARGET_FAILURE:
+					cmd->result |= (DID_OK << 16);
+					error = -EREMOTEIO;
+					break;
+				default:
+					error = -EIO;
+					break;
+				}
+			}
 		}
 
 		req->resid_len = scsi_get_resid(cmd);
@@ -796,7 +808,18 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
 	if (scsi_end_request(cmd, error, good_bytes, result == 0) == NULL)
 		return;
 
-	error = -EIO;
+	switch (host_byte(result)) {
+	case DID_TRANSPORT_FAILFAST:
+		error = -ENOLINK;
+		break;
+	case DID_TARGET_FAILURE:
+		cmd->result |= (DID_OK << 16);
+		error = -EREMOTEIO;
+		break;
+	default:
+		error = -EIO;
+		break;
+	}
 
 	if (host_byte(result) == DID_RESET) {
 		/* Third party bus reset or reset for error recovery
@@ -1418,7 +1441,6 @@ static void scsi_softirq_done(struct request *rq)
 			    wait_for/HZ);
 		disposition = SUCCESS;
 	}
-			
 	scsi_log_completion(cmd, disposition);
 
 	switch (disposition) {
diff --git a/include/scsi/scsi.h b/include/scsi/scsi.h
index 8fcb6e0..abfee76 100644
--- a/include/scsi/scsi.h
+++ b/include/scsi/scsi.h
@@ -397,6 +397,8 @@ static inline int scsi_is_wlun(unsigned int lun)
 				      * recover the link. Transport class will
 				      * retry or fail IO */
 #define DID_TRANSPORT_FAILFAST	0x0f /* Transport class fastfailed the io */
+#define DID_TARGET_FAILURE 0x10 /* Permanent target failure, do not retry on
+				 * other paths */
 #define DRIVER_OK       0x00	/* Driver status                           */
 
 /*
@@ -426,6 +428,7 @@ static inline int scsi_is_wlun(unsigned int lun)
 #define TIMEOUT_ERROR   0x2007
 #define SCSI_RETURN_NOT_HANDLED   0x2008
 #define FAST_IO_FAIL	0x2009
+#define TARGET_ERROR    0x200A
 
 /*
  * Midlevel queue return values.
^ permalink raw reply related	[flat|nested] 109+ messages in thread
- * Re: [RFC] training mpath to discern between SCSI errors
  2010-08-30 14:52                                       ` [dm-devel] " Hannes Reinecke
@ 2010-10-18  8:09                                         ` Jun'ichi Nomura
  2010-10-18 11:55                                           ` Hannes Reinecke
  0 siblings, 1 reply; 109+ messages in thread
From: Jun'ichi Nomura @ 2010-10-18  8:09 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: device-mapper development, Kiyoshi Ueda, michaelc, tytso,
	linux-scsi, Mike Snitzer, jaxboe, jack, vst, linux-kernel,
	swhiteho, linux-raid, linux-ide, James.Bottomley, chris.mason,
	konishi.ryusuke, linux-fsdevel, Tejun Heo, rwheeler,
	Christoph Hellwig, Sergei Shtylyov
Hi Hannes,
Thank you for working on this issue and sorry for very late reply...
(08/30/10 23:52), Hannes Reinecke wrote:
> From: Hannes Reinecke <hare@suse.de>
> Date: Mon, 30 Aug 2010 16:21:10 +0200
> Subject: [RFC][PATCH] scsi: Detailed I/O errors
> 
> Instead of just passing 'EIO' for any I/O errors we should be
> notifying the upper layers with some more details about the cause
> of this error.
> This patch updates the possible I/O errors to:
> 
> - ENOLINK: Link failure between host and target
> - EIO: Retryable I/O error
> - EREMOTEIO: Non-retryable I/O error
> 
> 'Retryable' in this context means that an I/O error _might_ be
> restricted to the I_T_L nexus (vulgo: path), so retrying on another
> nexus / path might succeed.
Does 'retryable' of EIO mean retryable in multipath layer?
If so, what is the difference between EIO and ENOLINK?
I've heard of a case where just retrying within path-group is
preferred to (relatively costly) switching group.
So, if EIO (or other error code) can be used to indicate such type
of errors, it's nice.
Also (although this might be a bit off topic from your patch),
can we expand such a distinction to what should be logged?
Currently, it's difficult to distinguish important SCSI/block errors
and less important ones in kernel log.
For example, when I get a link failure on sda, kernel prints something
like below, regardless of whether the I/O is recovered by multipathing or not:
  end_request: I/O error, dev sda, sector XXXXX
Setting REQ_QUIET in dm-multipath could mask the message
but also other important ones in SCSI.
Thanks,
-- 
Jun'ichi Nomura, NEC Corporation
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: [RFC] training mpath to discern between SCSI errors
  2010-10-18  8:09                                         ` Jun'ichi Nomura
@ 2010-10-18 11:55                                           ` Hannes Reinecke
  2010-10-19  4:03                                             ` Jun'ichi Nomura
  2010-11-19  3:11                                             ` [dm-devel] " Malahal Naineni
  0 siblings, 2 replies; 109+ messages in thread
From: Hannes Reinecke @ 2010-10-18 11:55 UTC (permalink / raw)
  To: Jun'ichi Nomura
  Cc: device-mapper development, Kiyoshi Ueda, michaelc, tytso,
	linux-scsi, Mike Snitzer, jaxboe, jack, vst, linux-kernel,
	swhiteho, linux-raid, linux-ide, James.Bottomley, chris.mason,
	konishi.ryusuke, linux-fsdevel, Tejun Heo, rwheeler,
	Christoph Hellwig, Sergei Shtylyov
On 10/18/2010 10:09 AM, Jun'ichi Nomura wrote:
> Hi Hannes,
> 
> Thank you for working on this issue and sorry for very late reply...
> 
> (08/30/10 23:52), Hannes Reinecke wrote:
>> From: Hannes Reinecke <hare@suse.de>
>> Date: Mon, 30 Aug 2010 16:21:10 +0200
>> Subject: [RFC][PATCH] scsi: Detailed I/O errors
>>
>> Instead of just passing 'EIO' for any I/O errors we should be
>> notifying the upper layers with some more details about the cause
>> of this error.
>> This patch updates the possible I/O errors to:
>>
>> - ENOLINK: Link failure between host and target
>> - EIO: Retryable I/O error
>> - EREMOTEIO: Non-retryable I/O error
>>
>> 'Retryable' in this context means that an I/O error _might_ be
>> restricted to the I_T_L nexus (vulgo: path), so retrying on another
>> nexus / path might succeed.
> 
> Does 'retryable' of EIO mean retryable in multipath layer?
> If so, what is the difference between EIO and ENOLINK?
> 
Yes, EIO is intended for errors which should be retried at the
multipath layer. This does _not_ include transport errors, which are
signalled by ENOLINK.
Basically, ENOLINK is a transport error, and EIO just means
something is wrong and we weren't able to classify it properly.
If we were, it'd be either ENOLINK or EREMOTEIO.
> I've heard of a case where just retrying within path-group is
> preferred to (relatively costly) switching group.
> So, if EIO (or other error code) can be used to indicate such type
> of errors, it's nice.
> 
Yes, that was one of the intention.
> 
> Also (although this might be a bit off topic from your patch),
> can we expand such a distinction to what should be logged?
> Currently, it's difficult to distinguish important SCSI/block errors
> and less important ones in kernel log.
> For example, when I get a link failure on sda, kernel prints something
> like below, regardless of whether the I/O is recovered by multipathing or not:
>   end_request: I/O error, dev sda, sector XXXXX
> 
Indeed, when using the above we could be modifying the above
message, eg by
end_request: transport error, dev sda, sector XXXXX
or
end_request: target error, dev sda, sector XXXXX
which would improve the output noticeable.
> Setting REQ_QUIET in dm-multipath could mask the message
> but also other important ones in SCSI.
> 
Hmm. Not sure about that, but I think the above modifications will
be useful already.
I'll be sending an updated patch.
Cheers,
Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: [RFC] training mpath to discern between SCSI errors
  2010-10-18 11:55                                           ` Hannes Reinecke
@ 2010-10-19  4:03                                             ` Jun'ichi Nomura
  2010-11-19  3:11                                             ` [dm-devel] " Malahal Naineni
  1 sibling, 0 replies; 109+ messages in thread
From: Jun'ichi Nomura @ 2010-10-19  4:03 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: device-mapper development, Kiyoshi Ueda, michaelc, tytso,
	linux-scsi, Mike Snitzer, jaxboe, jack, vst, linux-kernel,
	swhiteho, linux-raid, linux-ide, James.Bottomley, chris.mason,
	konishi.ryusuke, linux-fsdevel, Tejun Heo, rwheeler,
	Christoph Hellwig, Sergei Shtylyov
Hi Hannes,
(10/18/10 20:55), Hannes Reinecke wrote:
>> Does 'retryable' of EIO mean retryable in multipath layer?
>> If so, what is the difference between EIO and ENOLINK?
>>
> Yes, EIO is intended for errors which should be retried at the
> multipath layer. This does _not_ include transport errors, which are
> signalled by ENOLINK.
> 
> Basically, ENOLINK is a transport error, and EIO just means
> something is wrong and we weren't able to classify it properly.
> If we were, it'd be either ENOLINK or EREMOTEIO.
> 
>> I've heard of a case where just retrying within path-group is
>> preferred to (relatively costly) switching group.
>> So, if EIO (or other error code) can be used to indicate such type
>> of errors, it's nice.
>>
> Yes, that was one of the intention.
Great to hear that.
And when it comes to retrying, the next problem is who controls it.
I don't think it's good to duplicate retry logic in multipath and
underlying device like SCSI (i.e. sd retries 5 times).
So perhaps we need a way to disable (or limit) retries in underlying
device at least.
>> Also (although this might be a bit off topic from your patch),
>> can we expand such a distinction to what should be logged?
>> Currently, it's difficult to distinguish important SCSI/block errors
>> and less important ones in kernel log.
>> For example, when I get a link failure on sda, kernel prints something
>> like below, regardless of whether the I/O is recovered by multipathing or not:
>>   end_request: I/O error, dev sda, sector XXXXX
>>
> Indeed, when using the above we could be modifying the above
> message, eg by
> 
> end_request: transport error, dev sda, sector XXXXX
> 
> or
> 
> end_request: target error, dev sda, sector XXXXX
> 
> which would improve the output noticeable.
It improves but still they look like critical errors
even if multipath saves them.
When I see this:
  end_request: target error, dev sda, sector XXXXX
I can't tell whether it's a real error visible to user space
or it's just recoverred by multipath retry/failover afterwards.
>> Setting REQ_QUIET in dm-multipath could mask the message
>> but also other important ones in SCSI.
>>
> Hmm. Not sure about that, but I think the above modifications will
> be useful already.
> 
> I'll be sending an updated patch.
Thank you. I'm looking for that.
-- 
Jun'ichi Nomura, NEC Corporation
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: [dm-devel] [RFC] training mpath to discern between SCSI errors
  2010-10-18 11:55                                           ` Hannes Reinecke
  2010-10-19  4:03                                             ` Jun'ichi Nomura
@ 2010-11-19  3:11                                             ` Malahal Naineni
  2010-11-30 22:59                                               ` Mike Snitzer
  1 sibling, 1 reply; 109+ messages in thread
From: Malahal Naineni @ 2010-11-19  3:11 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jun'ichi Nomura, Kiyoshi Ueda, michaelc, tytso, linux-scsi,
	Mike Snitzer, jaxboe, vst, linux-kernel, Christoph Hellwig,
	linux-raid, linux-ide, device-mapper development, James.Bottomley,
	Sergei Shtylyov, konishi.ryusuke, linux-fsdevel, jack, rwheeler,
	swhiteho, chris.mason, Tejun Heo
Hannes Reinecke [hare@suse.de] wrote:
> > Also (although this might be a bit off topic from your patch),
> > can we expand such a distinction to what should be logged?
> > Currently, it's difficult to distinguish important SCSI/block errors
> > and less important ones in kernel log.
> > For example, when I get a link failure on sda, kernel prints something
> > like below, regardless of whether the I/O is recovered by multipathing or not:
> >   end_request: I/O error, dev sda, sector XXXXX
> > 
> Indeed, when using the above we could be modifying the above
> message, eg by
> 
> end_request: transport error, dev sda, sector XXXXX
> 
> or
> 
> end_request: target error, dev sda, sector XXXXX
> 
> which would improve the output noticeable.
> 
> > Setting REQ_QUIET in dm-multipath could mask the message
> > but also other important ones in SCSI.
> > 
> Hmm. Not sure about that, but I think the above modifications will
> be useful already.
> 
> I'll be sending an updated patch.
Hannes, is there an updated version of this patch? It applied fine with
Linus git tree with a minor reject! I would like to test an updated
version if you have one (the update seems to refer to better logging
only, right?).
Thanks, Malahal.
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: training mpath to discern between SCSI errors
  2010-11-19  3:11                                             ` [dm-devel] " Malahal Naineni
@ 2010-11-30 22:59                                               ` Mike Snitzer
  2010-12-07 23:16                                                 ` [RFC PATCH 0/3] differentiate between I/O errors Mike Snitzer
  2010-12-17  9:47                                                 ` training mpath to discern between SCSI errors Hannes Reinecke
  0 siblings, 2 replies; 109+ messages in thread
From: Mike Snitzer @ 2010-11-30 22:59 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jun'ichi Nomura, Kiyoshi Ueda, michaelc, tytso, linux-scsi,
	jaxboe, vst, linux-kernel, Christoph Hellwig, linux-raid,
	linux-ide, device-mapper development, James.Bottomley,
	Sergei Shtylyov, konishi.ryusuke, linux-fsdevel, jack, rwheeler,
	swhiteho, chris.mason, Tejun Heo
On Thu, Nov 18 2010 at 10:11pm -0500,
Malahal Naineni <malahal@us.ibm.com> wrote:
> Hannes Reinecke [hare@suse.de] wrote:
> > > Also (although this might be a bit off topic from your patch),
> > > can we expand such a distinction to what should be logged?
> > > Currently, it's difficult to distinguish important SCSI/block errors
> > > and less important ones in kernel log.
> > > For example, when I get a link failure on sda, kernel prints something
> > > like below, regardless of whether the I/O is recovered by multipathing or not:
> > >   end_request: I/O error, dev sda, sector XXXXX
> > > 
> > Indeed, when using the above we could be modifying the above
> > message, eg by
> > 
> > end_request: transport error, dev sda, sector XXXXX
> > 
> > or
> > 
> > end_request: target error, dev sda, sector XXXXX
> > 
> > which would improve the output noticeable.
> > 
> > > Setting REQ_QUIET in dm-multipath could mask the message
> > > but also other important ones in SCSI.
> > > 
> > Hmm. Not sure about that, but I think the above modifications will
> > be useful already.
> > 
> > I'll be sending an updated patch.
> 
> Hannes, is there an updated version of this patch? It applied fine with
> Linus git tree with a minor reject! I would like to test an updated
> version if you have one (the update seems to refer to better logging
> only, right?).
Hannes,
Any chance you've had time to fold your proposed logging changes in and
rebase this patch?  Could you post that updated patch?
I'd like to help see this patch through to inclussion when 2.6.38 merge
window opens.  I can help with further review, testing and development.
Please advise, thanks.
Mike
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * [RFC PATCH 0/3] differentiate between I/O errors
  2010-11-30 22:59                                               ` Mike Snitzer
@ 2010-12-07 23:16                                                 ` Mike Snitzer
  2010-12-07 23:16                                                   ` [RFC PATCH v2 1/3] scsi: Detailed " Mike Snitzer
                                                                     ` (3 more replies)
  2010-12-17  9:47                                                 ` training mpath to discern between SCSI errors Hannes Reinecke
  1 sibling, 4 replies; 109+ messages in thread
From: Mike Snitzer @ 2010-12-07 23:16 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: k-ueda, michaelc, tytso, sshtylyov, linux-scsi, jaxboe, jack,
	linux-fsdevel, linux-kernel, swhiteho, linux-raid, linux-ide,
	dm-devel, James.Bottomley, konishi.ryusuke, j-nomura, vst,
	rwheeler, hch, chris.mason, tj
Refreshed Hannes' initial "scsi: Detailed I/O errors" patch against
v2.6.37-rc5.  v2 introduces __scsi_error_from_host_byte to avoid
the duplicate switch statement.  Also a few whitespace and comment
changes.
Split DM mpath change out to separate v2 patch; failed discard is now
retryable in the face of a non-target IO error.
Added improved block layer's I/O error message (based on the finer
grained I/O error returns afforded by SCSI).
Comments/suggestions are welcome.
Thanks,
Mike
Hannes Reinecke (1):
  scsi: Detailed I/O errors
Mike Snitzer (2):
  dm mpath: propagate target I/O errors immediately
  block: improve detail in I/O error messages
 block/blk-core.c          |   12 +++++++++---
 drivers/md/dm-mpath.c     |   11 +----------
 drivers/scsi/scsi_error.c |   24 +++++++++++++++++-------
 drivers/scsi/scsi_lib.c   |   24 ++++++++++++++++++++++--
 include/scsi/scsi.h       |    3 +++
 5 files changed, 52 insertions(+), 22 deletions(-)
-- 
1.7.2.3
^ permalink raw reply	[flat|nested] 109+ messages in thread
- * [RFC PATCH v2 1/3] scsi: Detailed I/O errors
  2010-12-07 23:16                                                 ` [RFC PATCH 0/3] differentiate between I/O errors Mike Snitzer
@ 2010-12-07 23:16                                                   ` Mike Snitzer
  2010-12-07 23:16                                                   ` [RFC PATCH v2 2/3] dm mpath: propagate target errors immediately Mike Snitzer
                                                                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 109+ messages in thread
From: Mike Snitzer @ 2010-12-07 23:16 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: k-ueda, michaelc, tytso, sshtylyov, linux-scsi, jaxboe, jack,
	linux-fsdevel, linux-kernel, swhiteho, linux-raid, linux-ide,
	dm-devel, James.Bottomley, konishi.ryusuke, j-nomura, vst,
	rwheeler, hch, chris.mason, tj
From: Hannes Reinecke <hare@suse.de>
Instead of just passing 'EIO' for any I/O error we should be
notifying the upper layers with more details about the cause
of this error.
Update the possible I/O errors to:
- ENOLINK: Link failure between host and target
- EIO: Retryable I/O error
- EREMOTEIO: Non-retryable I/O error
'Retryable' in this context means that an I/O error _might_ be
restricted to the I_T_L nexus (vulgo: path), so retrying on another
nexus / path might succeed.
'Non-retryable' means target failure or reservation conflict.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/scsi/scsi_error.c |   24 +++++++++++++++++-------
 drivers/scsi/scsi_lib.c   |   24 ++++++++++++++++++++++--
 include/scsi/scsi.h       |    3 +++
 3 files changed, 42 insertions(+), 9 deletions(-)
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 824b8fc..48fdd85 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -223,7 +223,7 @@ static inline void scsi_eh_prt_fail_stats(struct Scsi_Host *shost,
  * @scmd:	Cmd to have sense checked.
  *
  * Return value:
- * 	SUCCESS or FAILED or NEEDS_RETRY
+ *	SUCCESS or FAILED or NEEDS_RETRY or TARGET_ERROR
  *
  * Notes:
  *	When a deferred error is detected the current command has
@@ -326,17 +326,19 @@ static int scsi_check_sense(struct scsi_cmnd *scmd)
 		 */
 		return SUCCESS;
 
-		/* these three are not supported */
+		/* these are not supported */
 	case COPY_ABORTED:
 	case VOLUME_OVERFLOW:
 	case MISCOMPARE:
-		return SUCCESS;
+	case BLANK_CHECK:
+	case DATA_PROTECT:
+		return TARGET_ERROR;
 
 	case MEDIUM_ERROR:
 		if (sshdr.asc == 0x11 || /* UNRECOVERED READ ERR */
 		    sshdr.asc == 0x13 || /* AMNF DATA FIELD */
 		    sshdr.asc == 0x14) { /* RECORD NOT FOUND */
-			return SUCCESS;
+			return TARGET_ERROR;
 		}
 		return NEEDS_RETRY;
 
@@ -344,11 +346,9 @@ static int scsi_check_sense(struct scsi_cmnd *scmd)
 		if (scmd->device->retry_hwerror)
 			return ADD_TO_MLQUEUE;
 		else
-			return SUCCESS;
+			return TARGET_ERROR;
 
 	case ILLEGAL_REQUEST:
-	case BLANK_CHECK:
-	case DATA_PROTECT:
 	default:
 		return SUCCESS;
 	}
@@ -809,6 +809,7 @@ static int scsi_send_eh_cmnd(struct scsi_cmnd *scmd, unsigned char *cmnd,
 		case SUCCESS:
 		case NEEDS_RETRY:
 		case FAILED:
+		case TARGET_ERROR:
 			break;
 		case ADD_TO_MLQUEUE:
 			rtn = NEEDS_RETRY;
@@ -1502,6 +1503,14 @@ int scsi_decide_disposition(struct scsi_cmnd *scmd)
 		rtn = scsi_check_sense(scmd);
 		if (rtn == NEEDS_RETRY)
 			goto maybe_retry;
+		else if (rtn == TARGET_ERROR) {
+			/*
+			 * Need to modify host byte to signal a
+			 * permanent target failure
+			 */
+			scmd->result |= (DID_TARGET_FAILURE << 16);
+			rtn = SUCCESS;
+		}
 		/* if rtn == FAILED, we have no sense information;
 		 * returning FAILED will wake the error handler thread
 		 * to collect the sense and redo the decide
@@ -1519,6 +1528,7 @@ int scsi_decide_disposition(struct scsi_cmnd *scmd)
 	case RESERVATION_CONFLICT:
 		sdev_printk(KERN_INFO, scmd->device,
 			    "reservation conflict\n");
+		scmd->result |= (DID_TARGET_FAILURE << 16);
 		return SUCCESS; /* causes immediate i/o error */
 	default:
 		return FAILED;
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index eafeeda..4da6459 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -667,6 +667,26 @@ void scsi_release_buffers(struct scsi_cmnd *cmd)
 }
 EXPORT_SYMBOL(scsi_release_buffers);
 
+static int __scsi_error_from_host_byte(struct scsi_cmnd *cmd, int result)
+{
+	int error = 0;
+
+	switch(host_byte(result)) {
+	case DID_TRANSPORT_FAILFAST:
+		error = -ENOLINK;
+		break;
+	case DID_TARGET_FAILURE:
+		cmd->result |= (DID_OK << 16);
+		error = -EREMOTEIO;
+		break;
+	default:
+		error = -EIO;
+		break;
+	}
+
+	return error;
+}
+
 /*
  * Function:    scsi_io_completion()
  *
@@ -737,7 +757,7 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
 				req->sense_len = len;
 			}
 			if (!sense_deferred)
-				error = -EIO;
+				error = __scsi_error_from_host_byte(cmd, result);
 		}
 
 		req->resid_len = scsi_get_resid(cmd);
@@ -796,7 +816,7 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
 	if (scsi_end_request(cmd, error, good_bytes, result == 0) == NULL)
 		return;
 
-	error = -EIO;
+	error = __scsi_error_from_host_byte(cmd, result);
 
 	if (host_byte(result) == DID_RESET) {
 		/* Third party bus reset or reset for error recovery
diff --git a/include/scsi/scsi.h b/include/scsi/scsi.h
index 216af85..73d27d9 100644
--- a/include/scsi/scsi.h
+++ b/include/scsi/scsi.h
@@ -405,6 +405,8 @@ static inline int scsi_is_wlun(unsigned int lun)
 				      * recover the link. Transport class will
 				      * retry or fail IO */
 #define DID_TRANSPORT_FAILFAST	0x0f /* Transport class fastfailed the io */
+#define DID_TARGET_FAILURE 0x10 /* Permanent target failure, do not retry on
+				 * other paths */
 #define DRIVER_OK       0x00	/* Driver status                           */
 
 /*
@@ -434,6 +436,7 @@ static inline int scsi_is_wlun(unsigned int lun)
 #define TIMEOUT_ERROR   0x2007
 #define SCSI_RETURN_NOT_HANDLED   0x2008
 #define FAST_IO_FAIL	0x2009
+#define TARGET_ERROR    0x200A
 
 /*
  * Midlevel queue return values.
-- 
1.7.2.3
^ permalink raw reply related	[flat|nested] 109+ messages in thread
- * [RFC PATCH v2 2/3] dm mpath: propagate target errors immediately
  2010-12-07 23:16                                                 ` [RFC PATCH 0/3] differentiate between I/O errors Mike Snitzer
  2010-12-07 23:16                                                   ` [RFC PATCH v2 1/3] scsi: Detailed " Mike Snitzer
@ 2010-12-07 23:16                                                   ` Mike Snitzer
  2010-12-07 23:16                                                   ` [RFC PATCH 3/3] block: improve detail in I/O error messages Mike Snitzer
  2010-12-10 23:40                                                   ` [RFC PATCH 0/3] differentiate between I/O errors Malahal Naineni
  3 siblings, 0 replies; 109+ messages in thread
From: Mike Snitzer @ 2010-12-07 23:16 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: k-ueda, michaelc, tytso, sshtylyov, linux-scsi, jaxboe, vst,
	linux-kernel, hch, linux-raid, linux-ide, dm-devel,
	James.Bottomley, konishi.ryusuke, linux-fsdevel, jack, j-nomura,
	rwheeler, swhiteho, chris.mason, tj
DM now has more information about the nature of the underlying storage
failure.  Path failure is avoided if a request failed due to a target
error.  Instead the target error is immediately passed up the stack.
Discard requests that fail due to non-target errors may now be retried.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/md/dm-mpath.c |   11 +----------
 1 files changed, 1 insertions(+), 10 deletions(-)
diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 487ecda..071529a 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -1270,16 +1270,7 @@ static int do_end_io(struct multipath *m, struct request *clone,
 	if (!error && !clone->errors)
 		return 0;	/* I/O complete */
 
-	if (error == -EOPNOTSUPP)
-		return error;
-
-	if (clone->cmd_flags & REQ_DISCARD)
-		/*
-		 * Pass all discard request failures up.
-		 * FIXME: only fail_path if the discard failed due to a
-		 * transport problem.  This requires precise understanding
-		 * of the underlying failure (e.g. the SCSI sense).
-		 */
+	if (error == -EOPNOTSUPP || error == -EREMOTEIO)
 		return error;
 
 	if (mpio->pgpath)
-- 
1.7.2.3
^ permalink raw reply related	[flat|nested] 109+ messages in thread 
- * [RFC PATCH 3/3] block: improve detail in I/O error messages
  2010-12-07 23:16                                                 ` [RFC PATCH 0/3] differentiate between I/O errors Mike Snitzer
  2010-12-07 23:16                                                   ` [RFC PATCH v2 1/3] scsi: Detailed " Mike Snitzer
  2010-12-07 23:16                                                   ` [RFC PATCH v2 2/3] dm mpath: propagate target errors immediately Mike Snitzer
@ 2010-12-07 23:16                                                   ` Mike Snitzer
  2010-12-08 11:28                                                     ` Sergei Shtylyov
  2010-12-10 23:40                                                   ` [RFC PATCH 0/3] differentiate between I/O errors Malahal Naineni
  3 siblings, 1 reply; 109+ messages in thread
From: Mike Snitzer @ 2010-12-07 23:16 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: k-ueda, michaelc, tytso, sshtylyov, linux-scsi, jaxboe, jack,
	linux-fsdevel, linux-kernel, swhiteho, linux-raid, linux-ide,
	dm-devel, James.Bottomley, konishi.ryusuke, j-nomura, vst,
	rwheeler, hch, chris.mason, tj
Classify severity of I/O errors for target and transport errors.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 block/blk-core.c |   12 +++++++++---
 1 files changed, 9 insertions(+), 3 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index 4ce953f..ab8c776 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2028,9 +2028,15 @@ bool blk_update_request(struct request *req, int error, unsigned int nr_bytes)
 
 	if (error && req->cmd_type == REQ_TYPE_FS &&
 	    !(req->cmd_flags & REQ_QUIET)) {
-		printk(KERN_ERR "end_request: I/O error, dev %s, sector %llu\n",
-				req->rq_disk ? req->rq_disk->disk_name : "?",
-				(unsigned long long)blk_rq_pos(req));
+		char *error_type = "I/O";
+
+		if (error == -ENOLINK)
+			error_type = "recoverable transport";
+		else if (error == -EREMOTEIO)
+			error_type = "critical target";
+		printk(KERN_ERR "end_request: %s error, dev %s, sector %llu\n",
+		       error_type, req->rq_disk ? req->rq_disk->disk_name : "?",
+		       (unsigned long long)blk_rq_pos(req));
 	}
 
 	blk_account_io_completion(req, nr_bytes);
-- 
1.7.2.3
^ permalink raw reply related	[flat|nested] 109+ messages in thread
- * Re: [RFC PATCH 3/3] block: improve detail in I/O error messages
  2010-12-07 23:16                                                   ` [RFC PATCH 3/3] block: improve detail in I/O error messages Mike Snitzer
@ 2010-12-08 11:28                                                     ` Sergei Shtylyov
  2010-12-08 15:05                                                       ` [PATCH v2 " Mike Snitzer
  0 siblings, 1 reply; 109+ messages in thread
From: Sergei Shtylyov @ 2010-12-08 11:28 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Hannes Reinecke, k-ueda, michaelc, tytso, sshtylyov, linux-scsi,
	jaxboe, jack, linux-fsdevel, linux-kernel, swhiteho, linux-raid,
	linux-ide, dm-devel, James.Bottomley, konishi.ryusuke, j-nomura,
	vst, rwheeler, hch, chris.mason, tj
Hello.
On 08-12-2010 2:16, Mike Snitzer wrote:
> Classify severity of I/O errors for target and transport errors.
> Signed-off-by: Mike Snitzer<snitzer@redhat.com>
[...]
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 4ce953f..ab8c776 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2028,9 +2028,15 @@ bool blk_update_request(struct request *req, int error, unsigned int nr_bytes)
>
>   	if (error&&  req->cmd_type == REQ_TYPE_FS&&
>   	!(req->cmd_flags&  REQ_QUIET)) {
> -		printk(KERN_ERR "end_request: I/O error, dev %s, sector %llu\n",
> -				req->rq_disk ? req->rq_disk->disk_name : "?",
> -				(unsigned long long)blk_rq_pos(req));
> +		char *error_type = "I/O";
> +
> +		if (error == -ENOLINK)
> +			error_type = "recoverable transport";
> +		else if (error == -EREMOTEIO)
> +			error_type = "critical target";
    *switch* would be more natural here.
WBR, Sergei
^ permalink raw reply	[flat|nested] 109+ messages in thread
- * [PATCH v2 3/3] block: improve detail in I/O error messages
  2010-12-08 11:28                                                     ` Sergei Shtylyov
@ 2010-12-08 15:05                                                       ` Mike Snitzer
  0 siblings, 0 replies; 109+ messages in thread
From: Mike Snitzer @ 2010-12-08 15:05 UTC (permalink / raw)
  To: Sergei Shtylyov
  Cc: Hannes Reinecke, k-ueda, michaelc, tytso, linux-scsi, jaxboe,
	jack, linux-fsdevel, linux-kernel, swhiteho, linux-raid,
	linux-ide, dm-devel, James.Bottomley, konishi.ryusuke, j-nomura,
	vst, rwheeler, hch, chris.mason, tj
Classify severity of I/O errors for target and transport errors.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 block/blk-core.c |   20 +++++++++++++++++---
 1 file changed, 17 insertions(+), 3 deletions(-)
Index: linux-2.6/block/blk-core.c
===================================================================
--- linux-2.6.orig/block/blk-core.c
+++ linux-2.6/block/blk-core.c
@@ -2028,9 +2028,23 @@ bool blk_update_request(struct request *
 
 	if (error && req->cmd_type == REQ_TYPE_FS &&
 	    !(req->cmd_flags & REQ_QUIET)) {
-		printk(KERN_ERR "end_request: I/O error, dev %s, sector %llu\n",
-				req->rq_disk ? req->rq_disk->disk_name : "?",
-				(unsigned long long)blk_rq_pos(req));
+		char *error_type;
+
+		switch (error) {
+		case -ENOLINK:
+			error_type = "recoverable transport";
+			break;
+		case -EREMOTEIO:
+			error_type = "critical target";
+			break;
+		case -EIO:
+		default:
+			error_type = "I/O";
+			break;
+		}
+		printk(KERN_ERR "end_request: %s error, dev %s, sector %llu\n",
+		       error_type, req->rq_disk ? req->rq_disk->disk_name : "?",
+		       (unsigned long long)blk_rq_pos(req));
 	}
 
 	blk_account_io_completion(req, nr_bytes);
^ permalink raw reply	[flat|nested] 109+ messages in thread
 
 
- * Re: [RFC PATCH 0/3] differentiate between I/O errors
  2010-12-07 23:16                                                 ` [RFC PATCH 0/3] differentiate between I/O errors Mike Snitzer
                                                                     ` (2 preceding siblings ...)
  2010-12-07 23:16                                                   ` [RFC PATCH 3/3] block: improve detail in I/O error messages Mike Snitzer
@ 2010-12-10 23:40                                                   ` Malahal Naineni
  2011-01-14  1:15                                                     ` Mike Snitzer
  3 siblings, 1 reply; 109+ messages in thread
From: Malahal Naineni @ 2010-12-10 23:40 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Hannes Reinecke, k-ueda, michaelc, tytso, sshtylyov, linux-scsi,
	jaxboe, jack, linux-fsdevel, linux-kernel, swhiteho, linux-raid,
	linux-ide, dm-devel, James.Bottomley, konishi.ryusuke, j-nomura,
	vst, rwheeler, hch, chris.mason, tj
Mike Snitzer [snitzer@redhat.com] wrote:
> Refreshed Hannes' initial "scsi: Detailed I/O errors" patch against
> v2.6.37-rc5.  v2 introduces __scsi_error_from_host_byte to avoid
> the duplicate switch statement.  Also a few whitespace and comment
> changes.
> 
> Split DM mpath change out to separate v2 patch; failed discard is now
> retryable in the face of a non-target IO error.
> 
> Added improved block layer's I/O error message (based on the finer
> grained I/O error returns afforded by SCSI).
> 
> Comments/suggestions are welcome.
I did test the Hannes original patch with the latest Linus' git tree! I
used scsi_debug to simulate path failures as well as 'Media' failures
and it did work as expected. I will test your patches soon.
Thanks, Malahal.
^ permalink raw reply	[flat|nested] 109+ messages in thread
- * Re: [RFC PATCH 0/3] differentiate between I/O errors
  2010-12-10 23:40                                                   ` [RFC PATCH 0/3] differentiate between I/O errors Malahal Naineni
@ 2011-01-14  1:15                                                     ` Mike Snitzer
  0 siblings, 0 replies; 109+ messages in thread
From: Mike Snitzer @ 2011-01-14  1:15 UTC (permalink / raw)
  To: Hannes Reinecke, k-ueda, michaelc, tytso, sshtylyov, linux-scsi,
	jaxboe, jack, linux-f
On Fri, Dec 10 2010 at  6:40pm -0500,
Malahal Naineni <malahal@us.ibm.com> wrote:
> Mike Snitzer [snitzer@redhat.com] wrote:
> > Refreshed Hannes' initial "scsi: Detailed I/O errors" patch against
> > v2.6.37-rc5.  v2 introduces __scsi_error_from_host_byte to avoid
> > the duplicate switch statement.  Also a few whitespace and comment
> > changes.
> > 
> > Split DM mpath change out to separate v2 patch; failed discard is now
> > retryable in the face of a non-target IO error.
> > 
> > Added improved block layer's I/O error message (based on the finer
> > grained I/O error returns afforded by SCSI).
> > 
> > Comments/suggestions are welcome.
> 
> I did test the Hannes original patch with the latest Linus' git tree! I
> used scsi_debug to simulate path failures as well as 'Media' failures
> and it did work as expected. I will test your patches soon.
Hi Malahal,
I was wondering if you had any feedback (testing or otherwise) for these
patches:
https://patchwork.kernel.org/patch/384612/
https://patchwork.kernel.org/patch/384602/
https://patchwork.kernel.org/patch/390882/
We haven't heard from Hannes in a bit but I was hoping we could at least
understand that the few changes I made are agreeable and working as
expected.
Thanks,
Mike
^ permalink raw reply	[flat|nested] 109+ messages in thread 
 
 
- * Re: training mpath to discern between SCSI errors
  2010-11-30 22:59                                               ` Mike Snitzer
  2010-12-07 23:16                                                 ` [RFC PATCH 0/3] differentiate between I/O errors Mike Snitzer
@ 2010-12-17  9:47                                                 ` Hannes Reinecke
  2010-12-17 14:06                                                   ` Mike Snitzer
  1 sibling, 1 reply; 109+ messages in thread
From: Hannes Reinecke @ 2010-12-17  9:47 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jun'ichi Nomura, Kiyoshi Ueda, michaelc, tytso, linux-scsi,
	jaxboe, vst, linux-kernel, Christoph Hellwig, linux-raid,
	linux-ide, device-mapper development, James.Bottomley,
	Sergei Shtylyov, konishi.ryusuke, linux-fsdevel, jack, rwheeler,
	swhiteho, chris.mason, Tejun Heo
On 11/30/2010 11:59 PM, Mike Snitzer wrote:
> On Thu, Nov 18 2010 at 10:11pm -0500,
> Malahal Naineni <malahal@us.ibm.com> wrote:
> 
>> Hannes Reinecke [hare@suse.de] wrote:
>>>> Also (although this might be a bit off topic from your patch),
>>>> can we expand such a distinction to what should be logged?
>>>> Currently, it's difficult to distinguish important SCSI/block errors
>>>> and less important ones in kernel log.
>>>> For example, when I get a link failure on sda, kernel prints something
>>>> like below, regardless of whether the I/O is recovered by multipathing or not:
>>>>   end_request: I/O error, dev sda, sector XXXXX
>>>>
>>> Indeed, when using the above we could be modifying the above
>>> message, eg by
>>>
>>> end_request: transport error, dev sda, sector XXXXX
>>>
>>> or
>>>
>>> end_request: target error, dev sda, sector XXXXX
>>>
>>> which would improve the output noticeable.
>>>
>>>> Setting REQ_QUIET in dm-multipath could mask the message
>>>> but also other important ones in SCSI.
>>>>
>>> Hmm. Not sure about that, but I think the above modifications will
>>> be useful already.
>>>
>>> I'll be sending an updated patch.
>>
>> Hannes, is there an updated version of this patch? It applied fine with
>> Linus git tree with a minor reject! I would like to test an updated
>> version if you have one (the update seems to refer to better logging
>> only, right?).
> 
> Hannes,
> 
> Any chance you've had time to fold your proposed logging changes in and
> rebase this patch?  Could you post that updated patch?
> 
yes, will be following shortly.
> I'd like to help see this patch through to inclussion when 2.6.38 merge
> window opens.  I can help with further review, testing and development.
> 
Ok, thanks.
Cheers,
Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: training mpath to discern between SCSI errors
  2010-12-17  9:47                                                 ` training mpath to discern between SCSI errors Hannes Reinecke
@ 2010-12-17 14:06                                                   ` Mike Snitzer
  2011-01-14  1:09                                                     ` Mike Snitzer
  0 siblings, 1 reply; 109+ messages in thread
From: Mike Snitzer @ 2010-12-17 14:06 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Kiyoshi Ueda, michaelc, tytso, Sergei Shtylyov, linux-scsi,
	jaxboe, jack, linux-fsdevel, linux-kernel, swhiteho, linux-raid,
	linux-ide, device-mapper development, James.Bottomley,
	konishi.ryusuke, Jun'ichi Nomura, vst, rwheeler,
	Christoph Hellwig, chris.mason, Tejun Heo
On Fri, Dec 17 2010 at  4:47am -0500,
Hannes Reinecke <hare@suse.de> wrote:
> On 11/30/2010 11:59 PM, Mike Snitzer wrote:
> > Hannes,
> > 
> > Any chance you've had time to fold your proposed logging changes in and
> > rebase this patch?  Could you post that updated patch?
> > 
> yes, will be following shortly.
> 
> > I'd like to help see this patch through to inclussion when 2.6.38 merge
> > window opens.  I can help with further review, testing and development.
> > 
> Ok, thanks.
I took some steps at furthering your work.  Here is the cover letter to
the patches I resently sent to dm-devel:
https://www.redhat.com/archives/dm-devel/2010-December/msg00090.html
And here are the patches:
https://patchwork.kernel.org/patch/384612/
https://patchwork.kernel.org/patch/384602/
https://patchwork.kernel.org/patch/390882/
Please feel free to change these how ever you see fit but your feedback
is really appreciated.
Thanks,
Mike
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: training mpath to discern between SCSI errors
  2010-12-17 14:06                                                   ` Mike Snitzer
@ 2011-01-14  1:09                                                     ` Mike Snitzer
  2011-01-14  7:45                                                       ` Hannes Reinecke
  0 siblings, 1 reply; 109+ messages in thread
From: Mike Snitzer @ 2011-01-14  1:09 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jun'ichi Nomura, Kiyoshi Ueda, michaelc, tytso, linux-scsi,
	jaxboe, vst, linux-kernel, Christoph Hellwig, linux-raid,
	linux-ide, device-mapper development, James.Bottomley,
	Sergei Shtylyov, konishi.ryusuke, linux-fsdevel, jack, rwheeler,
	swhiteho, chris.mason, Tejun Heo
On Fri, Dec 17 2010 at  9:06am -0500,
Mike Snitzer <snitzer@redhat.com> wrote:
> On Fri, Dec 17 2010 at  4:47am -0500,
> Hannes Reinecke <hare@suse.de> wrote:
> 
> > On 11/30/2010 11:59 PM, Mike Snitzer wrote:
> > > Hannes,
> > > 
> > > Any chance you've had time to fold your proposed logging changes in and
> > > rebase this patch?  Could you post that updated patch?
> > > 
> > yes, will be following shortly.
> > 
> > > I'd like to help see this patch through to inclussion when 2.6.38 merge
> > > window opens.  I can help with further review, testing and development.
> > > 
> > Ok, thanks.
> 
> I took some steps at furthering your work.  Here is the cover letter to
> the patches I resently sent to dm-devel:
> https://www.redhat.com/archives/dm-devel/2010-December/msg00090.html
> 
> And here are the patches:
> https://patchwork.kernel.org/patch/384612/
> https://patchwork.kernel.org/patch/384602/
> https://patchwork.kernel.org/patch/390882/
> 
> Please feel free to change these how ever you see fit but your feedback
> is really appreciated.
Hannes,
Any update?  I'd really like to see this work get upstream ASAP.  I'm
doubtful that is possible for 2.6.38 given the merge window is likely to
close shortly.
Regardless, if we could get consensus on this work now and then stage it
with James that would be great.
Thanks,
Mike
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: training mpath to discern between SCSI errors
  2011-01-14  1:09                                                     ` Mike Snitzer
@ 2011-01-14  7:45                                                       ` Hannes Reinecke
  2011-01-14 13:59                                                         ` Mike Snitzer
  0 siblings, 1 reply; 109+ messages in thread
From: Hannes Reinecke @ 2011-01-14  7:45 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jun'ichi Nomura, Kiyoshi Ueda, michaelc, tytso, linux-scsi,
	jaxboe, vst, linux-kernel, Christoph Hellwig, linux-raid,
	linux-ide, device-mapper development, James.Bottomley,
	Sergei Shtylyov, konishi.ryusuke, linux-fsdevel, jack, rwheeler,
	swhiteho, chris.mason, Tejun Heo
On 01/14/2011 02:09 AM, Mike Snitzer wrote:
> On Fri, Dec 17 2010 at  9:06am -0500,
> Mike Snitzer <snitzer@redhat.com> wrote:
> 
>> On Fri, Dec 17 2010 at  4:47am -0500,
>> Hannes Reinecke <hare@suse.de> wrote:
>>
>>> On 11/30/2010 11:59 PM, Mike Snitzer wrote:
>>>> Hannes,
>>>>
>>>> Any chance you've had time to fold your proposed logging changes in and
>>>> rebase this patch?  Could you post that updated patch?
>>>>
>>> yes, will be following shortly.
>>>
>>>> I'd like to help see this patch through to inclussion when 2.6.38 merge
>>>> window opens.  I can help with further review, testing and development.
>>>>
>>> Ok, thanks.
>>
>> I took some steps at furthering your work.  Here is the cover letter to
>> the patches I resently sent to dm-devel:
>> https://www.redhat.com/archives/dm-devel/2010-December/msg00090.html
>>
>> And here are the patches:
>> https://patchwork.kernel.org/patch/384612/
>> https://patchwork.kernel.org/patch/384602/
>> https://patchwork.kernel.org/patch/390882/
>>
>> Please feel free to change these how ever you see fit but your feedback
>> is really appreciated.
> 
> Hannes,
> 
> Any update?  I'd really like to see this work get upstream ASAP.  I'm
> doubtful that is possible for 2.6.38 given the merge window is likely to
> close shortly.
> 
Sorry for the late answer; the above patches somehow got lost in my
various mail folders :-(
But yes, the patchset looks okay.
Feel free to add my Acked-by: to last two.
> Regardless, if we could get consensus on this work now and then stage it
> with James that would be great.
> 
Indeed. Will you resend it to linux-scsi or shall I do it?
Cheers,
Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: training mpath to discern between SCSI errors
  2011-01-14  7:45                                                       ` Hannes Reinecke
@ 2011-01-14 13:59                                                         ` Mike Snitzer
  0 siblings, 0 replies; 109+ messages in thread
From: Mike Snitzer @ 2011-01-14 13:59 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jun'ichi Nomura, Kiyoshi Ueda, michaelc, tytso, linux-scsi,
	jaxboe, vst, linux-kernel, Christoph Hellwig, linux-raid,
	linux-ide, device-mapper development, James.Bottomley,
	Sergei Shtylyov, konishi.ryusuke, linux-fsdevel, jack, rwheeler,
	swhiteho, chris.mason, Tejun Heo
On Fri, Jan 14 2011 at  2:45am -0500,
Hannes Reinecke <hare@suse.de> wrote:
> On 01/14/2011 02:09 AM, Mike Snitzer wrote:
> > On Fri, Dec 17 2010 at  9:06am -0500,
> > Mike Snitzer <snitzer@redhat.com> wrote:
> > 
> >> On Fri, Dec 17 2010 at  4:47am -0500,
> >> Hannes Reinecke <hare@suse.de> wrote:
> >>
> >>> On 11/30/2010 11:59 PM, Mike Snitzer wrote:
> >>>> Hannes,
> >>>>
> >>>> Any chance you've had time to fold your proposed logging changes in and
> >>>> rebase this patch?  Could you post that updated patch?
> >>>>
> >>> yes, will be following shortly.
> >>>
> >>>> I'd like to help see this patch through to inclussion when 2.6.38 merge
> >>>> window opens.  I can help with further review, testing and development.
> >>>>
> >>> Ok, thanks.
> >>
> >> I took some steps at furthering your work.  Here is the cover letter to
> >> the patches I resently sent to dm-devel:
> >> https://www.redhat.com/archives/dm-devel/2010-December/msg00090.html
> >>
> >> And here are the patches:
> >> https://patchwork.kernel.org/patch/384612/
> >> https://patchwork.kernel.org/patch/384602/
> >> https://patchwork.kernel.org/patch/390882/
> >>
> >> Please feel free to change these how ever you see fit but your feedback
> >> is really appreciated.
> > 
> > Hannes,
> > 
> > Any update?  I'd really like to see this work get upstream ASAP.  I'm
> > doubtful that is possible for 2.6.38 given the merge window is likely to
> > close shortly.
> > 
> Sorry for the late answer; the above patches somehow got lost in my
> various mail folders :-(
> 
> But yes, the patchset looks okay.
> Feel free to add my Acked-by: to last two.
OK.
> > Regardless, if we could get consensus on this work now and then stage it
> > with James that would be great.
> > 
> Indeed. Will you resend it to linux-scsi or shall I do it?
I can resend, will do so shortly.
Thanks!
Mike
^ permalink raw reply	[flat|nested] 109+ messages in thread 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-23 12:14                     ` Tejun Heo
  2010-08-23 14:17                       ` Mike Snitzer
@ 2010-08-24 17:11                       ` Vladislav Bolkhovitin
  2010-08-24 23:14                         ` Alan Cox
  1 sibling, 1 reply; 109+ messages in thread
From: Vladislav Bolkhovitin @ 2010-08-24 17:11 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Kiyoshi Ueda, Christoph Hellwig, jaxboe, linux-fsdevel,
	linux-scsi, linux-ide, linux-kernel, linux-raid, James.Bottomley,
	tytso, chris.mason, swhiteho, konishi.ryusuke, dm-devel, jack,
	rwheeler, hare
Tejun Heo, on 08/23/2010 04:14 PM wrote:
>> I think that's correct and changing the priority of DM_ENDIO_REQUEUE
>> for REQ_FLUSH down to the lowest should be fine.
>> (I didn't know that FLUSH failure implies data loss possibility.)
>
> At least on ATA, FLUSH failure implies that data is already lost, so
> the error can't be ignored or retried.
In SCSI there are conditions when a command, including FLUSH 
(SYNC_CACHE), failed which don't imply lost data. For them the caller 
expected to retry the failed command. Most common cases are Unit 
Attentions and TASK QUEUE FULL status.
Vlad
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-24 17:11                       ` [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush Vladislav Bolkhovitin
@ 2010-08-24 23:14                         ` Alan Cox
  0 siblings, 0 replies; 109+ messages in thread
From: Alan Cox @ 2010-08-24 23:14 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Kiyoshi Ueda, tytso, linux-scsi, linux-ide, jaxboe, linux-kernel,
	swhiteho, linux-raid, linux-fsdevel, dm-devel, James.Bottomley,
	konishi.ryusuke, Tejun Heo, jack, rwheeler, Christoph Hellwig,
	chris.mason
> In SCSI there are conditions when a command, including FLUSH 
> (SYNC_CACHE), failed which don't imply lost data. For them the caller 
> expected to retry the failed command. Most common cases are Unit 
> Attentions and TASK QUEUE FULL status.
ATA expects the command to be retried as well because a failed flush
indicates the specific sector is lost (unless the host still has a copy
of course - which is *very* likely although we don't use it) but the rest
of the flush transaction can be retried to continue to flush sectors
beyond the failed one.
Alan
^ permalink raw reply	[flat|nested] 109+ messages in thread 
 
 
 
 
 
 
 
 
 
 
 
 
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-12 12:41 [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush Tejun Heo
                   ` (11 preceding siblings ...)
  2010-08-13 11:48 ` [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush Christoph Hellwig
@ 2010-08-13 12:55 ` Vladislav Bolkhovitin
  2010-08-13 13:17   ` Christoph Hellwig
  2010-08-13 13:21   ` Tejun Heo
  2010-08-18  9:46 ` Christoph Hellwig
                   ` (2 subsequent siblings)
  15 siblings, 2 replies; 109+ messages in thread
From: Vladislav Bolkhovitin @ 2010-08-13 12:55 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jaxboe, linux-fsdevel, linux-scsi, linux-ide, linux-kernel,
	linux-raid, hch, James.Bottomley, tytso, chris.mason, swhiteho,
	konishi.ryusuke, dm-devel, jack, rwheeler, hare
Tejun Heo, on 08/12/2010 04:41 PM wrote:
> Each filesystem needs to be updated to enforce request
> ordering themselves and then to use REQ_FLUSH/FUA mechanism.
I generally agree with the patchset, but I believe this particular move 
is a really bad move.
I'm not mentioning the obvious that a common functionality (enforcing 
requests ordering in this case) should be handled by a common library, 
but not internally by a zillion file systems Linux has.
The worst in this move is that it would hide all the requests ordering 
semantic inside file systems in, most likely, a very much unclear way. 
That would lead that if I or someone else decide to implement the 
"hardware offload" of requests ordering (ORDERED requests), I or he/she 
would not be able to see any improvement until at least one file system 
be changed to be able to use it. Worse, if the implementor can't 
demonstrate the improvement, how can he encourage file systems 
developers to update their file systems? Which, basically, would mean 
that only a person with *BOTH* deep storage and file systems internals 
knowledge can do the job. How many do you know such people? Both storage 
and file systems topics are very wide and tricky, so nearly always 
people specialize in one of them, not both.
Thus, this move would basically mean that the proper ordered queuing 
would probably never be implemented in Linux.
I believe, much better would be to create a common interface, which file 
systems would use to enforce requests order, when they need it.
Advantages of this approach:
1. The ordering requirements of file systems would be clear.
2. They would be handled in one place by a common code.
3. Any storage level expert can try to implement ordered queuing without 
a deep dive into file systems design and implementation.
I already suggested such interface in 
http://marc.info/?l=linux-scsi&m=128077574815881&w=2. Internally for the 
moment it can be implemented using existing REQ_FLUSH/FUA/etc. and 
waiting for all the requests in the group to finish. As a nice side 
effect, if a device doesn't support FUA, it would be possible to issue 
SYNC_CACHE command(s) only for required blocks, not for the whole device 
as it is done now.
If requested, I can develop the interface further.
Vlad
^ permalink raw reply	[flat|nested] 109+ messages in thread
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-13 12:55 ` Vladislav Bolkhovitin
@ 2010-08-13 13:17   ` Christoph Hellwig
  2010-08-18 19:29     ` Vladislav Bolkhovitin
  2010-08-13 13:21   ` Tejun Heo
  1 sibling, 1 reply; 109+ messages in thread
From: Christoph Hellwig @ 2010-08-13 13:17 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: Tejun Heo, jaxboe, linux-fsdevel, linux-scsi, linux-ide,
	linux-kernel, linux-raid, hch, James.Bottomley, tytso,
	chris.mason, swhiteho, konishi.ryusuke, dm-devel, jack, rwheeler,
	hare
On Fri, Aug 13, 2010 at 04:55:33PM +0400, Vladislav Bolkhovitin wrote:
> I'm not mentioning the obvious that a common functionality (enforcing 
> requests ordering in this case) should be handled by a common library, 
> but not internally by a zillion file systems Linux has.
I/O ordering is still handled mostly by common code, that is the
pagecache and the buffercache, although a few filesystems like XFS and
btrfs have their own implementation of the second one.
The current ordered semantics of barriers have only successfull
implemented by a complete queue drain, and not effectively been used
by filesystems.  This patchset removes the bogus global ordering
enforced by the block layer whenever a filesystems wants to be able
to use cache flushes, and because of that allows deeper outstanding
queue depth I/O with less latency.
Now I know you in particular are a fan of scsi ordered tags.  And as I
told you before I'm open to review such an implementation if it shows
us any advantages.  Adding it after this patch is in fact not any more
complicated than before, I'd almost be tempted it's easier as you don't
have to plug it into the complex state machine we used for barriers, and
more importantly we drop the requirement for the barrier sequence to
be atomic, which in fact made implementing barriers using tagged queues
impossible with the current scsi layer.
As far as playing with ordered tags it's just adding a new flag for
it on the bio that gets passed down to the driver.  For a final version
you'd need a queue-level feature if it's supported, but you don't
even need that for the initial work.  Then you can implement a
variant of blk_do_flush that does away with queueing additional requests
once finish but queues all two or three at the same time with your
new ordered flag set, at which point you are back to the level or
ordered tag usage that the old code allows.  You're still left with
all the hard problems of actually implementing error handling for it
and using it higher up in the filesystem and generic page cache code.
I'd really love to see your results, up to the point of just trying
that once I get a little spare time.  But my theory is that it won't
help us - the problem with ordered tags is that they enforce global
ordering while we currently have local ordering.  While it will reduce
the latency for the process waiting for an fsync or similar it will
affect other I/O going on in the background and reduce the devices
ability to reorder that I/O.
So for now this patch set is a massive improvement of performance for
workloads we care about, while removing the interface we put in place
to allow a theoretical optimization that didn't show up for 8 years
before, and in fact made the interface just complicated enough to make
that optimization so hard.
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-13 13:17   ` Christoph Hellwig
@ 2010-08-18 19:29     ` Vladislav Bolkhovitin
  0 siblings, 0 replies; 109+ messages in thread
From: Vladislav Bolkhovitin @ 2010-08-18 19:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Tejun Heo, jaxboe, linux-fsdevel, linux-scsi, linux-ide,
	linux-kernel, linux-raid, James.Bottomley, tytso, chris.mason,
	swhiteho, konishi.ryusuke, dm-devel, jack, rwheeler, hare
Christoph Hellwig, on 08/13/2010 05:17 PM wrote:
> As far as playing with ordered tags it's just adding a new flag for
> it on the bio that gets passed down to the driver.  For a final version
> you'd need a queue-level feature if it's supported, but you don't
> even need that for the initial work.  Then you can implement a
> variant of blk_do_flush that does away with queueing additional requests
> once finish but queues all two or three at the same time with your
> new ordered flag set, at which point you are back to the level or
> ordered tag usage that the old code allows.  You're still left with
> all the hard problems of actually implementing error handling for it
> and using it higher up in the filesystem and generic page cache code.
But how about file systems doing internal local order-by-drain? Without 
converting them to use ordered commands it would be impossible to show 
full potential of them and to make the conversion one would need deep 
internal FS knowledge. That's my point. But if there's a trivial way to 
see all such places in the filesystems code and convert, then OK, I agree.
> I'd really love to see your results, up to the point of just trying
> that once I get a little spare time.  But my theory is that it won't
> help us - the problem with ordered tags is that they enforce global
> ordering while we currently have local ordering.  While it will reduce
> the latency for the process waiting for an fsync or similar it will
> affect other I/O going on in the background and reduce the devices
> ability to reorder that I/O.
The local ordering vs global ordering is relevant only if you have 
several applications/threads load. But how about a single 
application/thread?
Another point, for which, AFAIU, the ORDERED commands were invented, is 
that they make ordering on the _another_ side of the link _after_ all 
link/transfer latencies. This is why it's hard to see advantage of them 
on local disks.
Vlad
^ permalink raw reply	[flat|nested] 109+ messages in thread 
 
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-13 12:55 ` Vladislav Bolkhovitin
  2010-08-13 13:17   ` Christoph Hellwig
@ 2010-08-13 13:21   ` Tejun Heo
  2010-08-18 19:30     ` Vladislav Bolkhovitin
  1 sibling, 1 reply; 109+ messages in thread
From: Tejun Heo @ 2010-08-13 13:21 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: jaxboe, linux-fsdevel, linux-scsi, linux-ide, linux-kernel,
	linux-raid, hch, James.Bottomley, tytso, chris.mason, swhiteho,
	konishi.ryusuke, dm-devel, jack, rwheeler, hare
Hello,
On 08/13/2010 02:55 PM, Vladislav Bolkhovitin wrote:
> If requested, I can develop the interface further.
I still think the benefit of ordering by tag would be marginal at
best, and what have you guys measured there?  Under the current
framework, there's no easy way to measure full ordered-by-tag
implementation.  The mechanism for filesystems to communicate the
ordering information (which would be a partially ordered graph) just
isn't there and there is no way the current usage of ordering-by-tag
only for barrier sequence can achieve anything close to that level of
difference.
Ripping out the original ordering by tag mechanism doesn't amount to
much.  The use of ordering-by-tag was pretty half-assed there anyway.
If you think exporting full ordering information from filesystem to
the lower layers is worthwhile, please go ahead.  It would be very
interesting to see how much actual difference it can make compared to
ordering-by-filesystem and if it's actually better and the added
complexity is manageable, there's no reason not to do that.
Thank you.
-- 
tejun
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-13 13:21   ` Tejun Heo
@ 2010-08-18 19:30     ` Vladislav Bolkhovitin
  2010-08-19  9:51       ` Tejun Heo
  0 siblings, 1 reply; 109+ messages in thread
From: Vladislav Bolkhovitin @ 2010-08-18 19:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jaxboe, linux-fsdevel, linux-scsi, linux-ide, linux-kernel,
	linux-raid, hch, James.Bottomley, tytso, chris.mason, swhiteho,
	konishi.ryusuke, dm-devel, jack, rwheeler, hare
Hello,
Tejun Heo, on 08/13/2010 05:21 PM wrote:
>> If requested, I can develop the interface further.
>
> I still think the benefit of ordering by tag would be marginal at
> best, and what have you guys measured there?  Under the current
> framework, there's no easy way to measure full ordered-by-tag
> implementation.  The mechanism for filesystems to communicate the
> ordering information (which would be a partially ordered graph) just
> isn't there and there is no way the current usage of ordering-by-tag
> only for barrier sequence can achieve anything close to that level of
> difference.
Basically, I measured how iSCSI link utilization depends from amount of 
queued commands and queued data size. This is why I made it as a table. 
 From it you can see which improvement you will have removing queue 
draining after 1, 2, 4, etc. commands depending of commands sizes.
For instance, on my previous XFS rm example, where rm of 4 files took 
3.5 minutes with nobarrier option, I could see that XFS was sending 1-3 
  32K commands in a row. From my table you can see that if it sent all 
them at once without draining, it would have about 150-200% speed increase.
Vlad
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-18 19:30     ` Vladislav Bolkhovitin
@ 2010-08-19  9:51       ` Tejun Heo
  2010-08-30  9:54         ` Hannes Reinecke
  0 siblings, 1 reply; 109+ messages in thread
From: Tejun Heo @ 2010-08-19  9:51 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: jaxboe, linux-fsdevel, linux-scsi, linux-ide, linux-kernel,
	linux-raid, hch, James.Bottomley, tytso, chris.mason, swhiteho,
	konishi.ryusuke, dm-devel, jack, rwheeler, hare
Hello,
On 08/18/2010 09:30 PM, Vladislav Bolkhovitin wrote:
> Basically, I measured how iSCSI link utilization depends from amount
> of queued commands and queued data size. This is why I made it as a
> table. From it you can see which improvement you will have removing
> queue draining after 1, 2, 4, etc. commands depending of commands
> sizes.
>
> For instance, on my previous XFS rm example, where rm of 4 files
> took 3.5 minutes with nobarrier option, I could see that XFS was
> sending 1-3 32K commands in a row. From my table you can see that if
> it sent all them at once without draining, it would have about
> 150-200% speed increase.
You compared barrier off/on.  Of course, it will make a big
difference.  I think good part of that gain should be realized by the
currently proposed patchset which removes draining.  What's needed to
be demonstrated is the difference between ordered-by-waiting and
ordered-by-tag.  We've never had code to do that properly.
The original ordered-by-tag we had only applied tag ordering to two or
three command sequences inside a barrier, which doesn't amount to much
(and could even be harmful as it imposes draining of all simple
commands inside the device only to reduce issue latencies for a few
commands).  You'll need to hook into filesystem and somehow export the
ordering information down to the driver so that whatever needs
ordering is sent out as ordered commands.
As I've wrote multiple times, I'm pretty skeptical it will bring much.
Ordered tag mandates draining inside the device just like the original
barrier implementation.  Sure, it's done at a lower layer and command
issue latencies will be reduced thanks to that but ordered-by-waiting
doesn't require _any_ draining at all.  The whole pipeline can be kept
full all the time.  I'm often wrong tho, so please feel free to go
ahead and prove me wrong.  :-)
Thanks.
-- 
tejun
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-19  9:51       ` Tejun Heo
@ 2010-08-30  9:54         ` Hannes Reinecke
  2010-08-30 20:34           ` Vladislav Bolkhovitin
  0 siblings, 1 reply; 109+ messages in thread
From: Hannes Reinecke @ 2010-08-30  9:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Vladislav Bolkhovitin, jaxboe, linux-fsdevel, linux-scsi,
	linux-ide, linux-kernel, linux-raid, hch, James.Bottomley, tytso,
	chris.mason, swhiteho, konishi.ryusuke, dm-devel, jack, rwheeler
Tejun Heo wrote:
> Hello,
> 
> On 08/18/2010 09:30 PM, Vladislav Bolkhovitin wrote:
>> Basically, I measured how iSCSI link utilization depends from amount
>> of queued commands and queued data size. This is why I made it as a
>> table. From it you can see which improvement you will have removing
>> queue draining after 1, 2, 4, etc. commands depending of commands
>> sizes.
>>
>> For instance, on my previous XFS rm example, where rm of 4 files
>> took 3.5 minutes with nobarrier option, I could see that XFS was
>> sending 1-3 32K commands in a row. From my table you can see that if
>> it sent all them at once without draining, it would have about
>> 150-200% speed increase.
> 
> You compared barrier off/on.  Of course, it will make a big
> difference.  I think good part of that gain should be realized by the
> currently proposed patchset which removes draining.  What's needed to
> be demonstrated is the difference between ordered-by-waiting and
> ordered-by-tag.  We've never had code to do that properly.
> 
> The original ordered-by-tag we had only applied tag ordering to two or
> three command sequences inside a barrier, which doesn't amount to much
> (and could even be harmful as it imposes draining of all simple
> commands inside the device only to reduce issue latencies for a few
> commands).  You'll need to hook into filesystem and somehow export the
> ordering information down to the driver so that whatever needs
> ordering is sent out as ordered commands.
> 
> As I've wrote multiple times, I'm pretty skeptical it will bring much.
> Ordered tag mandates draining inside the device just like the original
> barrier implementation.  Sure, it's done at a lower layer and command
> issue latencies will be reduced thanks to that but ordered-by-waiting
> doesn't require _any_ draining at all.  The whole pipeline can be kept
> full all the time.  I'm often wrong tho, so please feel free to go
> ahead and prove me wrong.  :-)
> 
Actually, I thought about ordered tag writes, too.
But eventually I had to give up on this for a simple reason:
Ordered tag controls the ordering on the SCSI _TARGET_. But for a
meaningful implementation we need to control the ordering all the way
down from ->queuecommand(). Which means we have three areas we need
to cover here:
- driver (ie between ->queuecommand() and passing it off to the firmware)
- firmware
- fabric
Sadly, the latter two are really hard to influence. And, what's more,
with the new/modern CNAs with multiple queues and possible multiple
routes to the target it becomes impossible to guarantee ordering.
So using ordered tags for FibreChannel is not going to work, which
makes implementing it a bit of a pointless exercise for me.
Cheers,
Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-30  9:54         ` Hannes Reinecke
@ 2010-08-30 20:34           ` Vladislav Bolkhovitin
  0 siblings, 0 replies; 109+ messages in thread
From: Vladislav Bolkhovitin @ 2010-08-30 20:34 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Tejun Heo, jaxboe, linux-fsdevel, linux-scsi, linux-ide,
	linux-kernel, linux-raid, hch, James.Bottomley, tytso,
	chris.mason, swhiteho, konishi.ryusuke, dm-devel, jack, rwheeler
Hannes Reinecke, on 08/30/2010 01:54 PM wrote:
>> As I've wrote multiple times, I'm pretty skeptical it will bring much.
>> Ordered tag mandates draining inside the device just like the original
>> barrier implementation.  Sure, it's done at a lower layer and command
>> issue latencies will be reduced thanks to that but ordered-by-waiting
>> doesn't require _any_ draining at all.  The whole pipeline can be kept
>> full all the time.  I'm often wrong tho, so please feel free to go
>> ahead and prove me wrong.  :-)
>>
> Actually, I thought about ordered tag writes, too.
> But eventually I had to give up on this for a simple reason:
> Ordered tag controls the ordering on the SCSI _TARGET_. But for a
> meaningful implementation we need to control the ordering all the way
> down from ->queuecommand(). Which means we have three areas we need
> to cover here:
> - driver (ie between ->queuecommand() and passing it off to the firmware)
> - firmware
> - fabric
>
> Sadly, the latter two are really hard to influence. And, what's more,
> with the new/modern CNAs with multiple queues and possible multiple
> routes to the target it becomes impossible to guarantee ordering.
> So using ordered tags for FibreChannel is not going to work, which
> makes implementing it a bit of a pointless exercise for me.
The situation is, actually, much better than you think. An SCSI 
transport should provide an in-order delivery of commands. In some 
transports it is required (e.g. iSCSI), in some - optional (e.g. FC). 
For FC "an application client may determine if a device server supports 
the precise delivery function by using the MODE SENSE and MODE SELECT 
commands to examine and set the enable precise delivery checking (EPDC) 
bit in the Fibre Channel Logical Unit Control page" (Fibre Channel 
Protocol for SCSI (FCP)). You can find more details in FCP section 
"Precise delivery of SCSI commands".
Regarding multiple queues, in case of a multipath access to a device 
SCSI requires either each path be a separate I_T nexus, where order of 
commands is maintained, or a transport required to maintain in-order 
commands delivery among multiple paths in a single I_T nexus (session) 
as it is done in iSCSI's MC/S and, most likely, wide SAS ports.
So, everything is in the specs. We only need to use it properly. How it 
can be done on the drivers level as well as how errors recovery can be 
done using ACA and UA_INTLCK facilities I wrote few weeks ago in the 
"[RFC] relaxed barrier semantics" thread.
Vlad
^ permalink raw reply	[flat|nested] 109+ messages in thread 
 
 
 
 
 
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-12 12:41 [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush Tejun Heo
                   ` (12 preceding siblings ...)
  2010-08-13 12:55 ` Vladislav Bolkhovitin
@ 2010-08-18  9:46 ` Christoph Hellwig
  2010-08-19  9:57   ` Tejun Heo
  2010-08-20 13:22 ` Christoph Hellwig
  2010-08-23 14:15 ` [PATCH] block: simplify queue_next_fseq Christoph Hellwig
  15 siblings, 1 reply; 109+ messages in thread
From: Christoph Hellwig @ 2010-08-18  9:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jaxboe, linux-fsdevel, linux-scsi, linux-ide, linux-kernel,
	linux-raid, hch, James.Bottomley, tytso, chris.mason, swhiteho,
	konishi.ryusuke, dm-devel, vst, jack, rwheeler, hare
FYI: One issue with this series is that make_request based drivers
not have to access all REQ_FLUSH and REQ_FUA requests.  We'll either
need to add handling to empty REQ_FLUSH requests to all of them or
figure out a way to prevent them getting sent.  That is assuming they'll
simply ignore REQ_FLUSH/REQ_FUA on normal writes.
^ permalink raw reply	[flat|nested] 109+ messages in thread
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-18  9:46 ` Christoph Hellwig
@ 2010-08-19  9:57   ` Tejun Heo
  2010-08-19 10:20     ` Christoph Hellwig
  0 siblings, 1 reply; 109+ messages in thread
From: Tejun Heo @ 2010-08-19  9:57 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: jaxboe, linux-fsdevel, linux-scsi, linux-ide, linux-kernel,
	linux-raid, James.Bottomley, tytso, chris.mason, swhiteho,
	konishi.ryusuke, dm-devel, vst, jack, rwheeler, hare
Hello,
On 08/18/2010 11:46 AM, Christoph Hellwig wrote:
> FYI: One issue with this series is that make_request based drivers
> not have to access all REQ_FLUSH and REQ_FUA requests.  We'll either
> need to add handling to empty REQ_FLUSH requests to all of them or
> figure out a way to prevent them getting sent.  That is assuming they'll
> simply ignore REQ_FLUSH/REQ_FUA on normal writes.
Can you be a bit more specific?  In most cases, request based drivers
should be fine.  They sit behind the front most request_queue which
would discompose REQ_FLUSH/FUAs into appropriate command sequence.
For the request based drivers, it's not different from the original
REQ_HARDBARRIER mechanism, it'll just see flushes and optionally FUA
writes.
Thanks.
-- 
tejun
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-19  9:57   ` Tejun Heo
@ 2010-08-19 10:20     ` Christoph Hellwig
  2010-08-19 10:22       ` Tejun Heo
  0 siblings, 1 reply; 109+ messages in thread
From: Christoph Hellwig @ 2010-08-19 10:20 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Hellwig, jaxboe, linux-fsdevel, linux-scsi, linux-ide,
	linux-kernel, linux-raid, James.Bottomley, tytso, chris.mason,
	swhiteho, konishi.ryusuke, dm-devel, vst, jack, rwheeler, hare
On Thu, Aug 19, 2010 at 11:57:53AM +0200, Tejun Heo wrote:
> On 08/18/2010 11:46 AM, Christoph Hellwig wrote:
> > FYI: One issue with this series is that make_request based drivers
> > not have to access all REQ_FLUSH and REQ_FUA requests.  We'll either
> > need to add handling to empty REQ_FLUSH requests to all of them or
> > figure out a way to prevent them getting sent.  That is assuming they'll
> > simply ignore REQ_FLUSH/REQ_FUA on normal writes.
> 
> Can you be a bit more specific?  In most cases, request based drivers
> should be fine.  They sit behind the front most request_queue which
> would discompose REQ_FLUSH/FUAs into appropriate command sequence.
I said make_request based drivers, that is drivers taking bios.   These
get bios directly from __generic_make_request and need to deal with
REQ_FLUSH/FUA themselves.  We have quite a few more than just dm/md of
this kind:
arch/powerpc/sysdev/axonram.c:	blk_queue_make_request(bank->disk->queue, axon_ram_make_request);
drivers/block/aoe/aoeblk.c:     blk_queue_make_request(d->blkq, aoeblk_make_request);
drivers/block/brd.c:		blk_queue_make_request(brd->brd_queue, brd_make_request);
drivers/block/drbd/drbd_main.c: blk_queue_make_request(q, drbd_make_request_26);
drivers/block/loop.c:		blk_queue_make_request(lo->lo_queue, loop_make_request);
drivers/block/pktcdvd.c:        blk_queue_make_request(q, pkt_make_request);
drivers/block/ps3vram.c:        blk_queue_make_request(queue, ps3vram_make_request);
drivers/block/umem.c:		blk_queue_make_request(card->queue, mm_make_request);
drivers/s390/block/dcssblk.c:	blk_queue_make_request(dev_info->dcssblk_queue, dcssblk_make_request);
drivers/s390/block/xpram.c:	blk_queue_make_request(xpram_queues[i], xpram_make_request);
drivers/staging/zram/zram_drv.c:blk_queue_make_request(zram->queue, zram_make_request);
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-19 10:20     ` Christoph Hellwig
@ 2010-08-19 10:22       ` Tejun Heo
  0 siblings, 0 replies; 109+ messages in thread
From: Tejun Heo @ 2010-08-19 10:22 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: jaxboe, linux-fsdevel, linux-scsi, linux-ide, linux-kernel,
	linux-raid, James.Bottomley, tytso, chris.mason, swhiteho,
	konishi.ryusuke, dm-devel, vst, jack, rwheeler, hare
Hello,
On 08/19/2010 12:20 PM, Christoph Hellwig wrote:
> I said make_request based drivers, that is drivers taking bios.
Right.  Gees, it's confusing.
> These get bios directly from __generic_make_request and need to deal
> with REQ_FLUSH/FUA themselves.  We have quite a few more than just
> dm/md of this kind:
>
> arch/powerpc/sysdev/axonram.c
> drivers/block/aoe/aoeblk.c
> drivers/block/brd.c
I'll try to convert these three.
> drivers/block/drbd/drbd_main.c
I'd rather leave drbd to its maintainers.
> drivers/block/loop.c
Already converted.
> drivers/block/pktcdvd.c
> drivers/block/ps3vram.c
> drivers/block/umem.c
> drivers/s390/block/dcssblk.c
> drivers/s390/block/xpram.c
> drivers/staging/zram/zram_drv.c
Will work on these.
Thanks.
-- 
tejun
^ permalink raw reply	[flat|nested] 109+ messages in thread 
 
 
 
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-12 12:41 [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush Tejun Heo
                   ` (13 preceding siblings ...)
  2010-08-18  9:46 ` Christoph Hellwig
@ 2010-08-20 13:22 ` Christoph Hellwig
  2010-08-20 15:18   ` Ric Wheeler
  2010-08-23 12:36   ` Tejun Heo
  2010-08-23 14:15 ` [PATCH] block: simplify queue_next_fseq Christoph Hellwig
  15 siblings, 2 replies; 109+ messages in thread
From: Christoph Hellwig @ 2010-08-20 13:22 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jaxboe, linux-fsdevel, linux-scsi, linux-ide, linux-kernel,
	linux-raid, hch, James.Bottomley, tytso, chris.mason, swhiteho,
	konishi.ryusuke, dm-devel, vst, jack, rwheeler, hare
FYI: here's a little writeup to document the new cache flushing scheme,
intended to replace Documentation/block/barriers.txt.  Any good
suggestion for a filename in the kernel tree?
---
Explicit volatile write cache control
=====================================
Introduction
------------
Many storage devices, especially in the consumer market, come with volatile
write back caches.  That means the devices signal I/O completion to the
operating system before data actually has hit the physical medium.  This
behavior obviously speeds up various workloads, but it means the operating
system needs to force data out to the physical medium when it performs
a data integrity operation like fsync, sync or an unmount.
The Linux block layer provides a two simple mechanism that lets filesystems
control the caching behavior of the storage device.  These mechanisms are
a forced cache flush, and the Force Unit Access (FUA) flag for requests.
Explicit cache flushes
----------------------
The REQ_FLUSH flag can be OR ed into the r/w flags of a bio submitted from the
filesystem and will make sure the volatile cache of the storage device
has been flushed before the actual I/O operation is started.  The explicit
guarantees write requests that have completed before the bio was submitted
actually are on the physical medium before this request has started.
In addition the REQ_FLUSH flag can be set on an otherwise empty bio
structure, which causes only an explicit cache flush without any dependent
I/O.  It is recommend to use the blkdev_issue_flush() helper for a pure
cache flush.
Forced Unit Access
-----------------
The REQ_FUA flag can be OR ed into the r/w flags of a bio submitted from the
filesystem and will make sure that I/O completion for this requests is not
signaled before the data has made it to non-volatile storage on the
physical medium.
Implementation details for filesystems
--------------------------------------
Filesystem can simply set the REQ_FLUSH and REQ_FUA bits and do not have to
worry if the underlying devices need any explicit cache flushing and how
the Forced Unit Access is implemented.  The REQ_FLUSH and REQ_FUA flags
may both be set on a single bio.
Implementation details for make_request_fn based block drivers
--------------------------------------------------------------
These drivers will always see the REQ_FLUSH and REQ_FUA bits as they sit
directly below the submit_bio interface.  For remapping drivers the REQ_FUA
bits needs to be propagate to underlying devices, and a global flush needs
to be implemented for bios with the REQ_FLUSH bit set.  For real device
drivers that do not have a volatile cache the REQ_FLUSH and REQ_FUA bits
on non-empty bios can simply be ignored, and REQ_FLUSH requests without
data can be completed successfully without doing any work.  Drivers for
devices with volatile caches need to implement the support for these
flags themselves without any help from the block layer.
Implementation details for request_fn based block drivers
--------------------------------------------------------------
For devices that do not support volatile write caches there is no driver
support required, the block layer completes empty REQ_FLUSH requests before
entering the driver and strips off the REQ_FLUSH and REQ_FUA bits from
requests that have a payload.  For device with volatile write caches the
driver needs to tell the block layer that it supports flushing caches by
doing:
	blk_queue_flush(sdkp->disk->queue, REQ_FLUSH);
and handle empty REQ_FLUSH requests in it's prep_fn/request_fn.  Note that
REQ_FLUSH requests with a payload are automatically turned into a sequence
of empty REQ_FLUSH and the actual write by the block layer.  For devices
that also support the FUA bit the block layer needs to be told to pass
through that bit using:
	blk_queue_flush(sdkp->disk->queue, REQ_FLUSH | REQ_FUA);
and handle write requests that have the REQ_FUA bit set properly in it's
prep_fn/request_fn.  If the FUA bit is not natively supported the block
layer turns it into an empty REQ_FLUSH requests after the actual write.
^ permalink raw reply	[flat|nested] 109+ messages in thread
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-20 13:22 ` Christoph Hellwig
@ 2010-08-20 15:18   ` Ric Wheeler
  2010-08-20 16:00     ` Chris Mason
  2010-08-23 12:30     ` Tejun Heo
  2010-08-23 12:36   ` Tejun Heo
  1 sibling, 2 replies; 109+ messages in thread
From: Ric Wheeler @ 2010-08-20 15:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Tejun Heo, jaxboe, linux-fsdevel, linux-scsi, linux-ide,
	linux-kernel, linux-raid, James.Bottomley, tytso, chris.mason,
	swhiteho, konishi.ryusuke, dm-devel, vst, jack, hare
On 08/20/2010 09:22 AM, Christoph Hellwig wrote:
> FYI: here's a little writeup to document the new cache flushing scheme,
> intended to replace Documentation/block/barriers.txt.  Any good
> suggestion for a filename in the kernel tree?
>
> ---
I was thinking that we might be better off using the "durable writes" term more 
since it is well documented (at least in the database world, where it is the "D" 
in ACID properties).  Maybe "durable_writes_support.txt" ?
>
> Explicit volatile write cache control
> =====================================
>
> Introduction
> ------------
>
> Many storage devices, especially in the consumer market, come with volatile
> write back caches.  That means the devices signal I/O completion to the
> operating system before data actually has hit the physical medium.  This
> behavior obviously speeds up various workloads, but it means the operating
> system needs to force data out to the physical medium when it performs
> a data integrity operation like fsync, sync or an unmount.
>
> The Linux block layer provides a two simple mechanism that lets filesystems
> control the caching behavior of the storage device.  These mechanisms are
> a forced cache flush, and the Force Unit Access (FUA) flag for requests.
>
Should we mention that users can also disable the write cache on the target device?
It might also be worth mentioning that storage needs to be properly configured - 
i.e., an internal hardware RAID card with battery backing needs can expose 
itself as a writethrough cache *only if* it actually has control over all of the 
backend disks and can flush/disable their write caches.
Maybe that is too much detail, but I know that people have lost data with some 
of these setups.
The rest of the write up below sounds good, thanks for pulling this together!
Ric
>
> Explicit cache flushes
> ----------------------
>
> The REQ_FLUSH flag can be OR ed into the r/w flags of a bio submitted from the
> filesystem and will make sure the volatile cache of the storage device
> has been flushed before the actual I/O operation is started.  The explicit
> guarantees write requests that have completed before the bio was submitted
> actually are on the physical medium before this request has started.
> In addition the REQ_FLUSH flag can be set on an otherwise empty bio
> structure, which causes only an explicit cache flush without any dependent
> I/O.  It is recommend to use the blkdev_issue_flush() helper for a pure
> cache flush.
>
>
> Forced Unit Access
> -----------------
>
> The REQ_FUA flag can be OR ed into the r/w flags of a bio submitted from the
> filesystem and will make sure that I/O completion for this requests is not
> signaled before the data has made it to non-volatile storage on the
> physical medium.
>
>
> Implementation details for filesystems
> --------------------------------------
>
> Filesystem can simply set the REQ_FLUSH and REQ_FUA bits and do not have to
> worry if the underlying devices need any explicit cache flushing and how
> the Forced Unit Access is implemented.  The REQ_FLUSH and REQ_FUA flags
> may both be set on a single bio.
>
>
> Implementation details for make_request_fn based block drivers
> --------------------------------------------------------------
>
> These drivers will always see the REQ_FLUSH and REQ_FUA bits as they sit
> directly below the submit_bio interface.  For remapping drivers the REQ_FUA
> bits needs to be propagate to underlying devices, and a global flush needs
> to be implemented for bios with the REQ_FLUSH bit set.  For real device
> drivers that do not have a volatile cache the REQ_FLUSH and REQ_FUA bits
> on non-empty bios can simply be ignored, and REQ_FLUSH requests without
> data can be completed successfully without doing any work.  Drivers for
> devices with volatile caches need to implement the support for these
> flags themselves without any help from the block layer.
>
>
> Implementation details for request_fn based block drivers
> --------------------------------------------------------------
>
> For devices that do not support volatile write caches there is no driver
> support required, the block layer completes empty REQ_FLUSH requests before
> entering the driver and strips off the REQ_FLUSH and REQ_FUA bits from
> requests that have a payload.  For device with volatile write caches the
> driver needs to tell the block layer that it supports flushing caches by
> doing:
>
> 	blk_queue_flush(sdkp->disk->queue, REQ_FLUSH);
>
> and handle empty REQ_FLUSH requests in it's prep_fn/request_fn.  Note that
> REQ_FLUSH requests with a payload are automatically turned into a sequence
> of empty REQ_FLUSH and the actual write by the block layer.  For devices
> that also support the FUA bit the block layer needs to be told to pass
> through that bit using:
>
> 	blk_queue_flush(sdkp->disk->queue, REQ_FLUSH | REQ_FUA);
>
> and handle write requests that have the REQ_FUA bit set properly in it's
> prep_fn/request_fn.  If the FUA bit is not natively supported the block
> layer turns it into an empty REQ_FLUSH requests after the actual write.
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-20 15:18   ` Ric Wheeler
@ 2010-08-20 16:00     ` Chris Mason
  2010-08-20 16:02       ` Ric Wheeler
  2010-08-23 12:30     ` Tejun Heo
  1 sibling, 1 reply; 109+ messages in thread
From: Chris Mason @ 2010-08-20 16:00 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Christoph Hellwig, Tejun Heo, jaxboe, linux-fsdevel, linux-scsi,
	linux-ide, linux-kernel, linux-raid, James.Bottomley, tytso,
	swhiteho, konishi.ryusuke, dm-devel, vst, jack, hare
On Fri, Aug 20, 2010 at 11:18:07AM -0400, Ric Wheeler wrote:
> On 08/20/2010 09:22 AM, Christoph Hellwig wrote:
> >FYI: here's a little writeup to document the new cache flushing scheme,
> >intended to replace Documentation/block/barriers.txt.  Any good
> >suggestion for a filename in the kernel tree?
> >
> >---
> 
> I was thinking that we might be better off using the "durable
> writes" term more since it is well documented (at least in the
> database world, where it is the "D" in ACID properties).  Maybe
> "durable_writes_support.txt" ?
sata_lies.txt?
Ok, maybe writeback_cache.txt?
-chris
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-20 16:00     ` Chris Mason
@ 2010-08-20 16:02       ` Ric Wheeler
  0 siblings, 0 replies; 109+ messages in thread
From: Ric Wheeler @ 2010-08-20 16:02 UTC (permalink / raw)
  To: Chris Mason, Christoph Hellwig, Tejun Heo, jaxboe, linux-fsdevel,
	linux-scsi
On 08/20/2010 12:00 PM, Chris Mason wrote:
> On Fri, Aug 20, 2010 at 11:18:07AM -0400, Ric Wheeler wrote:
>> On 08/20/2010 09:22 AM, Christoph Hellwig wrote:
>>> FYI: here's a little writeup to document the new cache flushing scheme,
>>> intended to replace Documentation/block/barriers.txt.  Any good
>>> suggestion for a filename in the kernel tree?
>>>
>>> ---
>>
>> I was thinking that we might be better off using the "durable
>> writes" term more since it is well documented (at least in the
>> database world, where it is the "D" in ACID properties).  Maybe
>> "durable_writes_support.txt" ?
>
> sata_lies.txt?
>
> Ok, maybe writeback_cache.txt?
>
> -chris
writeback_cache.txt is certainly the least confusing :)
ric
^ permalink raw reply	[flat|nested] 109+ messages in thread 
 
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-20 15:18   ` Ric Wheeler
  2010-08-20 16:00     ` Chris Mason
@ 2010-08-23 12:30     ` Tejun Heo
  2010-08-23 12:48       ` Christoph Hellwig
  1 sibling, 1 reply; 109+ messages in thread
From: Tejun Heo @ 2010-08-23 12:30 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Christoph Hellwig, jaxboe, linux-fsdevel, linux-scsi, linux-ide,
	linux-kernel, linux-raid, James.Bottomley, tytso, chris.mason,
	swhiteho, konishi.ryusuke, dm-devel, vst, jack, hare
Hello,
On 08/20/2010 05:18 PM, Ric Wheeler wrote:
> On 08/20/2010 09:22 AM, Christoph Hellwig wrote:
>> FYI: here's a little writeup to document the new cache flushing scheme,
>> intended to replace Documentation/block/barriers.txt.  Any good
>> suggestion for a filename in the kernel tree?
>>
> 
> I was thinking that we might be better off using the "durable
> writes" term more since it is well documented (at least in the
> database world, where it is the "D" in ACID properties).  Maybe
> "durable_writes_support.txt" ?
The term is very foreign to people outside of enterprise / database
loop.  writeback-cache.txt or write-cache-control.txt sounds good
enough to me.
>> The Linux block layer provides a two simple mechanism that lets filesystems
>> control the caching behavior of the storage device.  These mechanisms are
>> a forced cache flush, and the Force Unit Access (FUA) flag for requests.
>>
> 
> Should we mention that users can also disable the write cache on the
> target device?
> 
> It might also be worth mentioning that storage needs to be properly
> configured - i.e., an internal hardware RAID card with battery
> backing needs can expose itself as a writethrough cache *only if* it
> actually has control over all of the backend disks and can
> flush/disable their write caches.
It might be useful to give several example configurations with
different cache configurations.  I don't have much experience with
battery backed arrays but aren't they suppose to report write through
cache automatically?
Thanks.
-- 
tejun
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-23 12:30     ` Tejun Heo
@ 2010-08-23 12:48       ` Christoph Hellwig
  2010-08-23 13:58         ` Ric Wheeler
  0 siblings, 1 reply; 109+ messages in thread
From: Christoph Hellwig @ 2010-08-23 12:48 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ric Wheeler, Christoph Hellwig, jaxboe, linux-fsdevel, linux-scsi,
	linux-ide, linux-kernel, linux-raid, James.Bottomley, tytso,
	chris.mason, swhiteho, konishi.ryusuke, dm-devel, vst, jack, hare
On Mon, Aug 23, 2010 at 02:30:33PM +0200, Tejun Heo wrote:
> It might be useful to give several example configurations with
> different cache configurations.  I don't have much experience with
> battery backed arrays but aren't they suppose to report write through
> cache automatically?
They usually do.  I have one that doesn't, but SYNCHRONIZE CACHE on
it is so fast that it effectively must be a no-op.
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-23 12:48       ` Christoph Hellwig
@ 2010-08-23 13:58         ` Ric Wheeler
  2010-08-23 14:01           ` Jens Axboe
  0 siblings, 1 reply; 109+ messages in thread
From: Ric Wheeler @ 2010-08-23 13:58 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Tejun Heo, jaxboe, linux-fsdevel, linux-scsi, linux-ide,
	linux-kernel, linux-raid, James.Bottomley, tytso, chris.mason,
	swhiteho, konishi.ryusuke, dm-devel, vst, jack, hare
On 08/23/2010 08:48 AM, Christoph Hellwig wrote:
> On Mon, Aug 23, 2010 at 02:30:33PM +0200, Tejun Heo wrote:
>> It might be useful to give several example configurations with
>> different cache configurations.  I don't have much experience with
>> battery backed arrays but aren't they suppose to report write through
>> cache automatically?
>
> They usually do.  I have one that doesn't, but SYNCHRONIZE CACHE on
> it is so fast that it effectively must be a no-op.
>
Arrays are not a problem in general - they normally have internally, redundant 
batteries to hold up the cache.
The issue is when you have an internal hardware RAID card with a large cache. 
Those cards sit in your server and the batteries on the card protect its 
internal cache, but do not have the capacity to hold up the drives behind it.
Normally, those drives should have their write cache disabled, but sometimes 
(especially with S-ATA disks) this is not done.
ric
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-23 13:58         ` Ric Wheeler
@ 2010-08-23 14:01           ` Jens Axboe
  2010-08-23 14:08             ` Christoph Hellwig
  2010-08-23 15:19             ` Ric Wheeler
  0 siblings, 2 replies; 109+ messages in thread
From: Jens Axboe @ 2010-08-23 14:01 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Christoph Hellwig, Tejun Heo, linux-fsdevel@vger.kernel.org,
	linux-scsi@vger.kernel.org, linux-ide@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-raid@vger.kernel.org,
	James.Bottomley@suse.de, tytso@mit.edu, chris.mason@oracle.com,
	swhiteho@redhat.com, konishi.ryusuke@lab.ntt.co.jp,
	dm-devel@redhat.com, vst@vlnb.net, jack@suse.cz, hare@suse.de
On 2010-08-23 15:58, Ric Wheeler wrote:
> On 08/23/2010 08:48 AM, Christoph Hellwig wrote:
>> On Mon, Aug 23, 2010 at 02:30:33PM +0200, Tejun Heo wrote:
>>> It might be useful to give several example configurations with
>>> different cache configurations.  I don't have much experience with
>>> battery backed arrays but aren't they suppose to report write through
>>> cache automatically?
>>
>> They usually do.  I have one that doesn't, but SYNCHRONIZE CACHE on
>> it is so fast that it effectively must be a no-op.
>>
> 
> Arrays are not a problem in general - they normally have internally, redundant 
> batteries to hold up the cache.
> 
> The issue is when you have an internal hardware RAID card with a large cache. 
> Those cards sit in your server and the batteries on the card protect its 
> internal cache, but do not have the capacity to hold up the drives behind it.
> 
> Normally, those drives should have their write cache disabled, but sometimes 
> (especially with S-ATA disks) this is not done.
The problem purely exists on arrays that report write back cache enabled
AND don't implement SYNC_CACHE as a noop. Do any of them exist, or are
they purely urban legend?
-- 
Jens Axboe
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-23 14:01           ` Jens Axboe
@ 2010-08-23 14:08             ` Christoph Hellwig
  2010-08-23 14:13               ` Tejun Heo
                                 ` (2 more replies)
  2010-08-23 15:19             ` Ric Wheeler
  1 sibling, 3 replies; 109+ messages in thread
From: Christoph Hellwig @ 2010-08-23 14:08 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ric Wheeler, Christoph Hellwig, Tejun Heo,
	linux-fsdevel@vger.kernel.org, linux-scsi@vger.kernel.org,
	linux-ide@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-raid@vger.kernel.org, James.Bottomley@suse.de,
	tytso@mit.edu, chris.mason@oracle.com, swhiteho@redhat.com,
	konishi.ryusuke@lab.ntt.co.jp, dm-devel@redhat.com, vst@vlnb.net,
	jack@suse.cz, hare@suse.de
On Mon, Aug 23, 2010 at 04:01:15PM +0200, Jens Axboe wrote:
> The problem purely exists on arrays that report write back cache enabled
> AND don't implement SYNC_CACHE as a noop. Do any of them exist, or are
> they purely urban legend?
I haven't seen it.  I don't care particularly about this case, but once
it a while people want to disable flushing for testing or because they
really don't care.
What about adding a sysfs attribue to every request_queue that allows
disabling the cache flushing feature?  Compared to the barrier option
this controls the feature at the right level and makes it available
to everyone instead of beeing duplicated.  After a while we can then
simply ignore the barrier/nobarrier options.
^ permalink raw reply	[flat|nested] 109+ messages in thread
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-23 14:08             ` Christoph Hellwig
@ 2010-08-23 14:13               ` Tejun Heo
  2010-08-23 14:19                 ` Christoph Hellwig
  2010-08-25 11:31               ` Jens Axboe
  2010-08-30 10:04               ` Hannes Reinecke
  2 siblings, 1 reply; 109+ messages in thread
From: Tejun Heo @ 2010-08-23 14:13 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Ric Wheeler, linux-fsdevel@vger.kernel.org,
	linux-scsi@vger.kernel.org, linux-ide@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-raid@vger.kernel.org,
	James.Bottomley@suse.de, tytso@mit.edu, chris.mason@oracle.com,
	swhiteho@redhat.com, konishi.ryusuke@lab.ntt.co.jp,
	dm-devel@redhat.com, vst@vlnb.net, jack@suse.cz, hare@suse.de
Hello,
On 08/23/2010 04:08 PM, Christoph Hellwig wrote:
> On Mon, Aug 23, 2010 at 04:01:15PM +0200, Jens Axboe wrote:
>> The problem purely exists on arrays that report write back cache enabled
>> AND don't implement SYNC_CACHE as a noop. Do any of them exist, or are
>> they purely urban legend?
> 
> I haven't seen it.  I don't care particularly about this case, but once
> it a while people want to disable flushing for testing or because they
> really don't care.
> 
> What about adding a sysfs attribue to every request_queue that allows
> disabling the cache flushing feature?  Compared to the barrier option
> this controls the feature at the right level and makes it available
> to everyone instead of beeing duplicated.  After a while we can then
> simply ignore the barrier/nobarrier options.
Yeah, that sounds reasonable.  blk_queue_flush() can be called anytime
without locking anyway, so it should be really easy to implement too.
Thanks.
-- 
tejun
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-23 14:13               ` Tejun Heo
@ 2010-08-23 14:19                 ` Christoph Hellwig
  0 siblings, 0 replies; 109+ messages in thread
From: Christoph Hellwig @ 2010-08-23 14:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Hellwig, Jens Axboe, Ric Wheeler,
	linux-fsdevel@vger.kernel.org, linux-scsi@vger.kernel.org,
	linux-ide@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-raid@vger.kernel.org, James.Bottomley@suse.de,
	tytso@mit.edu, chris.mason@oracle.com, swhiteho@redhat.com,
	konishi.ryusuke@lab.ntt.co.jp, dm-devel@redhat.com, vst@vlnb.net,
	jack@suse.cz, hare@suse.de
On Mon, Aug 23, 2010 at 04:13:36PM +0200, Tejun Heo wrote:
> Yeah, that sounds reasonable.  blk_queue_flush() can be called anytime
> without locking anyway, so it should be really easy to implement too.
I don't think we can simply call blk_queue_flush - we must ensure to
never set more bits than the device allows.  We'll just need two
sets of flags in the request queue, with the sysfs file checking that
it never allows more flags than blk_queue_flush.
I'll prepare a patch for this on top of the current series.
^ permalink raw reply	[flat|nested] 109+ messages in thread 
 
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-23 14:08             ` Christoph Hellwig
  2010-08-23 14:13               ` Tejun Heo
@ 2010-08-25 11:31               ` Jens Axboe
  2010-08-30 10:04               ` Hannes Reinecke
  2 siblings, 0 replies; 109+ messages in thread
From: Jens Axboe @ 2010-08-25 11:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ric Wheeler, Tejun Heo, linux-fsdevel@vger.kernel.org,
	linux-scsi@vger.kernel.org, linux-ide@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-raid@vger.kernel.org,
	James.Bottomley@suse.de, tytso@mit.edu, chris.mason@oracle.com,
	swhiteho@redhat.com, konishi.ryusuke@lab.ntt.co.jp,
	dm-devel@redhat.com, vst@vlnb.net, jack@suse.cz, hare@suse.de
On 2010-08-23 16:08, Christoph Hellwig wrote:
> On Mon, Aug 23, 2010 at 04:01:15PM +0200, Jens Axboe wrote:
>> The problem purely exists on arrays that report write back cache enabled
>> AND don't implement SYNC_CACHE as a noop. Do any of them exist, or are
>> they purely urban legend?
> 
> I haven't seen it.  I don't care particularly about this case, but once
> it a while people want to disable flushing for testing or because they
> really don't care.
> 
> What about adding a sysfs attribue to every request_queue that allows
> disabling the cache flushing feature?  Compared to the barrier option
> this controls the feature at the right level and makes it available
> to everyone instead of beeing duplicated.  After a while we can then
> simply ignore the barrier/nobarrier options.
Agree, that would be fine.
-- 
Jens Axboe
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-23 14:08             ` Christoph Hellwig
  2010-08-23 14:13               ` Tejun Heo
  2010-08-25 11:31               ` Jens Axboe
@ 2010-08-30 10:04               ` Hannes Reinecke
  2 siblings, 0 replies; 109+ messages in thread
From: Hannes Reinecke @ 2010-08-30 10:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Ric Wheeler, Tejun Heo, linux-fsdevel@vger.kernel.org,
	linux-scsi@vger.kernel.org, linux-ide@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-raid@vger.kernel.org,
	James.Bottomley@suse.de, tytso@mit.edu, chris.mason@oracle.com,
	swhiteho@redhat.com, konishi.ryusuke@lab.ntt.co.jp,
	dm-devel@redhat.com, vst@vlnb.net, jack@suse.cz
Christoph Hellwig wrote:
> On Mon, Aug 23, 2010 at 04:01:15PM +0200, Jens Axboe wrote:
>> The problem purely exists on arrays that report write back cache enabled
>> AND don't implement SYNC_CACHE as a noop. Do any of them exist, or are
>> they purely urban legend?
> 
> I haven't seen it.  I don't care particularly about this case, but once
> it a while people want to disable flushing for testing or because they
> really don't care.
> 
aacraid for one falls into this category.
SYNC_CACHE is no-oped in the driver. Otherwise you get a _HUGE_
performance loss.
Cheers,
Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
^ permalink raw reply	[flat|nested] 109+ messages in thread 
 
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-23 14:01           ` Jens Axboe
  2010-08-23 14:08             ` Christoph Hellwig
@ 2010-08-23 15:19             ` Ric Wheeler
  2010-08-23 16:45               ` Sergey Vlasov
  1 sibling, 1 reply; 109+ messages in thread
From: Ric Wheeler @ 2010-08-23 15:19 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Tejun Heo, linux-fsdevel@vger.kernel.org,
	linux-scsi@vger.kernel.org, linux-ide@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-raid@vger.kernel.org,
	James.Bottomley@suse.de, tytso@mit.edu, chris.mason@oracle.com,
	swhiteho@redhat.com, konishi.ryusuke@lab.ntt.co.jp,
	dm-devel@redhat.com, vst@vlnb.net, jack@suse.cz, hare@suse.de
On 08/23/2010 10:01 AM, Jens Axboe wrote:
> On 2010-08-23 15:58, Ric Wheeler wrote:
>> On 08/23/2010 08:48 AM, Christoph Hellwig wrote:
>>> On Mon, Aug 23, 2010 at 02:30:33PM +0200, Tejun Heo wrote:
>>>> It might be useful to give several example configurations with
>>>> different cache configurations.  I don't have much experience with
>>>> battery backed arrays but aren't they suppose to report write through
>>>> cache automatically?
>>>
>>> They usually do.  I have one that doesn't, but SYNCHRONIZE CACHE on
>>> it is so fast that it effectively must be a no-op.
>>>
>>
>> Arrays are not a problem in general - they normally have internally, redundant
>> batteries to hold up the cache.
>>
>> The issue is when you have an internal hardware RAID card with a large cache.
>> Those cards sit in your server and the batteries on the card protect its
>> internal cache, but do not have the capacity to hold up the drives behind it.
>>
>> Normally, those drives should have their write cache disabled, but sometimes
>> (especially with S-ATA disks) this is not done.
>
> The problem purely exists on arrays that report write back cache enabled
> AND don't implement SYNC_CACHE as a noop. Do any of them exist, or are
> they purely urban legend?
>
Hi Jens,
There are actually two distinct problems:
(1) arrays with a non-volatile write cache (battery backed, navram, whatever) 
that do not NOOP a SYNC_CACHE command. I know of one brand that seems to do 
this, but it is not a common brand. If we do not issue flushes for write through 
caches, I think that we will avoid this in any case.
(2) hardware raid cards with internal buffer memory and on-card battery backup 
(they sit in your server, disks sit in jbod like expansion shelves). These are 
fine if the drives in those shelves have write cache disabled.
ric
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-23 15:19             ` Ric Wheeler
@ 2010-08-23 16:45               ` Sergey Vlasov
  2010-08-23 16:49                 ` [dm-devel] " Ric Wheeler
  0 siblings, 1 reply; 109+ messages in thread
From: Sergey Vlasov @ 2010-08-23 16:45 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: jack@suse.cz, linux-scsi@vger.kernel.org, Jens Axboe,
	vst@vlnb.net, linux-kernel@vger.kernel.org, Christoph Hellwig,
	linux-raid@vger.kernel.org, linux-ide@vger.kernel.org,
	dm-devel@redhat.com, James.Bottomley@suse.de,
	konishi.ryusuke@lab.ntt.co.jp, linux-fsdevel@vger.kernel.org,
	tytso@mit.edu, swhiteho@redhat.com, chris.mason@oracle.com,
	Tejun Heo
[-- Attachment #1.1: Type: text/plain, Size: 862 bytes --]
On Mon, Aug 23, 2010 at 11:19:13AM -0400, Ric Wheeler wrote:
[...]
> (2) hardware raid cards with internal buffer memory and on-card battery backup 
> (they sit in your server, disks sit in jbod like expansion shelves). These are 
> fine if the drives in those shelves have write cache disabled.
Actually some of such cards keep write cache on the drives enabled and
issue FLUSH CACHE commands to the drives.  E.g., 3ware 9690SA behaves
like this at least with SATA drives (the FLUSH CACHE commands can be
seen after enabling performance monitoring - they often end up in the
"10 commands having the largest latency" table).  This can actually be
safe if the card waits for the FLUSH CACHE completion before making
the write cache data in its battery-backed memory available for reuse
(and the drive implements the FLUSH CACHE command correctly).
[-- Attachment #1.2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: [dm-devel] [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-23 16:45               ` Sergey Vlasov
@ 2010-08-23 16:49                 ` Ric Wheeler
  0 siblings, 0 replies; 109+ messages in thread
From: Ric Wheeler @ 2010-08-23 16:49 UTC (permalink / raw)
  To: Jens Axboe, tytso@mit.edu, linux-scsi@vger.kernel.org,
	linux-ide@vger.kernel.org
On 08/23/2010 12:45 PM, Sergey Vlasov wrote:
> On Mon, Aug 23, 2010 at 11:19:13AM -0400, Ric Wheeler wrote:
> [...]
>> (2) hardware raid cards with internal buffer memory and on-card battery backup
>> (they sit in your server, disks sit in jbod like expansion shelves). These are
>> fine if the drives in those shelves have write cache disabled.
>
> Actually some of such cards keep write cache on the drives enabled and
> issue FLUSH CACHE commands to the drives.  E.g., 3ware 9690SA behaves
> like this at least with SATA drives (the FLUSH CACHE commands can be
> seen after enabling performance monitoring - they often end up in the
> "10 commands having the largest latency" table).  This can actually be
> safe if the card waits for the FLUSH CACHE completion before making
> the write cache data in its battery-backed memory available for reuse
> (and the drive implements the FLUSH CACHE command correctly).
Yes - this is certainly one way to do it. Note that this will not work if the 
card advertises itself as a write through cache (and we end up not sending down 
the SYNC_CACHE commands).
At least one hardware RAID card (I unfortunately cannot mention the brand) did 
not do this command forwarding.
ric
^ permalink raw reply	[flat|nested] 109+ messages in thread 
 
 
 
 
 
 
 
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-20 13:22 ` Christoph Hellwig
  2010-08-20 15:18   ` Ric Wheeler
@ 2010-08-23 12:36   ` Tejun Heo
  2010-08-23 14:05     ` Christoph Hellwig
  1 sibling, 1 reply; 109+ messages in thread
From: Tejun Heo @ 2010-08-23 12:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: jaxboe, linux-fsdevel, linux-scsi, linux-ide, linux-kernel,
	linux-raid, James.Bottomley, tytso, chris.mason, swhiteho,
	konishi.ryusuke, dm-devel, vst, jack, rwheeler, hare
Hello,
On 08/20/2010 03:22 PM, Christoph Hellwig wrote:
> Many storage devices, especially in the consumer market, come with volatile
> write back caches.  That means the devices signal I/O completion to the
> operating system before data actually has hit the physical medium.
A bit nit picky but flash devices can also have writeback caches and
the term physical medium sounds a bit off for those cases.  Maybe just
saying "non-volatile media" is better?
> Implementation details for filesystems
> --------------------------------------
> 
> Filesystem can simply set the REQ_FLUSH and REQ_FUA bits and do not have to
> worry if the underlying devices need any explicit cache flushing and how
> the Forced Unit Access is implemented.  The REQ_FLUSH and REQ_FUA flags
> may both be set on a single bio.
It may be worthwhile to explain the sequence of operations when
REQ_FLUSH + data + REQ_FUA is executed.  It can be extrapolated from
the previous two descriptions but I think giving examples of different
sequences depending on FLUSH/FUA configuration would be helpful to
help understanding the overall picture of things.
Other than those, looks good to me.
Thanks.
-- 
tejun
^ permalink raw reply	[flat|nested] 109+ messages in thread 
- * Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush
  2010-08-23 12:36   ` Tejun Heo
@ 2010-08-23 14:05     ` Christoph Hellwig
  0 siblings, 0 replies; 109+ messages in thread
From: Christoph Hellwig @ 2010-08-23 14:05 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Hellwig, jaxboe, linux-fsdevel, linux-scsi, linux-ide,
	linux-kernel, linux-raid, James.Bottomley, tytso, chris.mason,
	swhiteho, konishi.ryusuke, dm-devel, vst, jack, rwheeler, hare
Below is an updated version of the documentation.  It fixes several
typos Zach Brown noticed and replaces all references to a physical
medium with the term non-volatile storage.  I haven't added any examples
yet as I need to figure how they fit into the rest of the document.
---
Explicit volatile write cache control
=====================================
Introduction
------------
Many storage devices, especially in the consumer market, come with volatile
write back caches.  That means the devices signal I/O completion to the
operating system before data actually has hit the non-volatile storage.  This
behavior obviously speeds up various workloads, but it means the operating
system needs to force data out to the non-volatile storage when it performs
a data integrity operation like fsync, sync or an unmount.
The Linux block layer provides two simple mechanism that lets filesystems
control the caching behavior of the storage device.  These mechanisms are
a forced cache flush, and the Force Unit Access (FUA) flag for requests.
Explicit cache flushes
----------------------
The REQ_FLUSH flag can be OR ed into the r/w flags of a bio submitted from
the filesystem and will make sure the volatile cache of the storage device
has been flushed before the actual I/O operation is started.  This explicitly
guarantees that previously completed write requests are on non-volatile
storage before the flagged bio starts. In addition the REQ_FLUSH flag can be
set on an otherwise empty bio structure, which causes only an explicit cache
flush without any dependent I/O.  It is recommend to use
the blkdev_issue_flush() helper for a pure cache flush.
Forced Unit Access
-----------------
The REQ_FUA flag can be OR ed into the r/w flags of a bio submitted from the
filesystem and will make sure that I/O completion for this requests is only
signaled after the data has been commited to non-volatile storage.
Implementation details for filesystems
--------------------------------------
Filesystem can simply set the REQ_FLUSH and REQ_FUA bits and do not have to
worry if the underlying devices need any explicit cache flushing and how
the Forced Unit Access is implemented.  The REQ_FLUSH and REQ_FUA flags
may both be set on a single bio.
Implementation details for make_request_fn based block drivers
--------------------------------------------------------------
These drivers will always see the REQ_FLUSH and REQ_FUA bits as they sit
directly below the submit_bio interface.  For remapping drivers the REQ_FUA
bits need to be propagated to underlying devices, and a global flush needs
to be implemented for bios with the REQ_FLUSH bit set.  For real device
drivers that do not have a volatile cache the REQ_FLUSH and REQ_FUA bits
on non-empty bios can simply be ignored, and REQ_FLUSH requests without
data can be completed successfully without doing any work.  Drivers for
devices with volatile caches need to implement the support for these
flags themselves without any help from the block layer.
Implementation details for request_fn based block drivers
--------------------------------------------------------------
For devices that do not support volatile write caches there is no driver
support required, the block layer completes empty REQ_FLUSH requests before
entering the driver and strips off the REQ_FLUSH and REQ_FUA bits from
requests that have a payload.  For devices with volatile write caches the
driver needs to tell the block layer that it supports flushing caches by
doing:
	blk_queue_flush(sdkp->disk->queue, REQ_FLUSH);
and handle empty REQ_FLUSH requests in its prep_fn/request_fn.  Note that
REQ_FLUSH requests with a payload are automatically turned into a sequence
of empty REQ_FLUSH and the actual write by the block layer.  For devices
that also support the FUA bit the block layer needs to be told to pass
through that bit using:
	blk_queue_flush(sdkp->disk->queue, REQ_FLUSH | REQ_FUA);
and handle write requests that have the REQ_FUA bit set properly in its
prep_fn/request_fn.  If the FUA bit is not natively supported the block
layer turns it into an empty REQ_FLUSH request after the actual write.
^ permalink raw reply	[flat|nested] 109+ messages in thread 
 
 
- * [PATCH] block: simplify queue_next_fseq
  2010-08-12 12:41 [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush Tejun Heo
                   ` (14 preceding siblings ...)
  2010-08-20 13:22 ` Christoph Hellwig
@ 2010-08-23 14:15 ` Christoph Hellwig
  2010-08-23 16:28   ` OT grammar nit " John Robinson
  15 siblings, 1 reply; 109+ messages in thread
From: Christoph Hellwig @ 2010-08-23 14:15 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jaxboe, linux-fsdevel, linux-scsi, linux-ide, linux-kernel,
	linux-raid, hch, James.Bottomley, tytso, chris.mason, swhiteho,
	konishi.ryusuke, dm-devel, vst, jack, rwheeler, hare
We need to call blk_rq_init and elv_insert for all cases in queue_next_fseq,
so take these calls into common code.  Also move the end_io initialization
from queue_flush into queue_next_fseq and rename queue_flush to
init_flush_request now that it's old name doesn't apply anymore.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Index: linux-2.6/block/blk-flush.c
===================================================================
--- linux-2.6.orig/block/blk-flush.c	2010-08-17 15:34:27.864004351 +0200
+++ linux-2.6/block/blk-flush.c	2010-08-17 16:12:53.504253827 +0200
@@ -74,16 +74,11 @@ static void post_flush_end_io(struct req
 	blk_flush_complete_seq(rq->q, QUEUE_FSEQ_POSTFLUSH, error);
 }
 
-static void queue_flush(struct request_queue *q, struct request *rq,
-			rq_end_io_fn *end_io)
+static void init_flush_request(struct request *rq, struct gendisk *disk)
 {
-	blk_rq_init(q, rq);
 	rq->cmd_type = REQ_TYPE_FS;
 	rq->cmd_flags = REQ_FLUSH;
-	rq->rq_disk = q->orig_flush_rq->rq_disk;
-	rq->end_io = end_io;
-
-	elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
+	rq->rq_disk = disk;
 }
 
 static struct request *queue_next_fseq(struct request_queue *q)
@@ -91,29 +86,28 @@ static struct request *queue_next_fseq(s
 	struct request *orig_rq = q->orig_flush_rq;
 	struct request *rq = &q->flush_rq;
 
+	blk_rq_init(q, rq);
+
 	switch (blk_flush_cur_seq(q)) {
 	case QUEUE_FSEQ_PREFLUSH:
-		queue_flush(q, rq, pre_flush_end_io);
+		init_flush_request(rq, orig_rq->rq_disk);
+		rq->end_io = pre_flush_end_io;
 		break;
-
 	case QUEUE_FSEQ_DATA:
-		/* initialize proxy request, inherit FLUSH/FUA and queue it */
-		blk_rq_init(q, rq);
 		init_request_from_bio(rq, orig_rq->bio);
 		rq->cmd_flags &= ~(REQ_FLUSH | REQ_FUA);
 		rq->cmd_flags |= orig_rq->cmd_flags & (REQ_FLUSH | REQ_FUA);
 		rq->end_io = flush_data_end_io;
-
-		elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
 		break;
-
 	case QUEUE_FSEQ_POSTFLUSH:
-		queue_flush(q, rq, post_flush_end_io);
+		init_flush_request(rq, orig_rq->rq_disk);
+		rq->end_io = post_flush_end_io;
 		break;
-
 	default:
 		BUG();
 	}
+
+	elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
 	return rq;
 }
 
^ permalink raw reply	[flat|nested] 109+ messages in thread
- * OT grammar nit Re: [PATCH] block: simplify queue_next_fseq
  2010-08-23 14:15 ` [PATCH] block: simplify queue_next_fseq Christoph Hellwig
@ 2010-08-23 16:28   ` John Robinson
  0 siblings, 0 replies; 109+ messages in thread
From: John Robinson @ 2010-08-23 16:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Tejun Heo, jaxboe, linux-fsdevel, linux-scsi, linux-ide,
	linux-kernel, linux-raid, James.Bottomley, tytso, chris.mason,
	swhiteho, konishi.ryusuke, dm-devel, vst, jack, rwheeler, hare
On 23/08/2010 15:15, Christoph Hellwig wrote:
> We need to call blk_rq_init and elv_insert for all cases in queue_next_fseq,
> so take these calls into common code.  Also move the end_io initialization
> from queue_flush into queue_next_fseq and rename queue_flush to
> init_flush_request now that it's old name doesn't apply anymore.
Nit: it's "its" above, not "it's". If in doubt, if it's "it is" (or "it 
has") it's "it's" but if it could be "his" or "hers" it's "its".
I'm guessing English isn't your first language (a) because of your .de 
address and (b) because it's better than most British people's, but 
still, it's a common mistake. If I can remember any of the German I 
studied all those years ago, "its" is roughly equivalent to "sein", and 
"it's" to "es ist".
Cheers,
John.
^ permalink raw reply	[flat|nested] 109+ messages in thread