[PATCH 0/6] blk-mq: don't allocate driver tag beforehand for flush rq

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/6] blk-mq: don't allocate driver tag beforehand for flush rq
@ 2017-09-14 16:42 Ming Lei
  2017-09-14 16:42 ` [PATCH 1/6] blk-flush: don't run queue for requests of bypassing flush Ming Lei
                   ` (6 more replies)
  0 siblings, 7 replies; 12+ messages in thread
From: Ming Lei @ 2017-09-14 16:42 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: Omar Sandoval, Ming Lei

Hi,

This patchset avoids to allocate driver tag beforehand for flush rq
in case of I/O scheduler, then flush rq isn't treated specially
wrt. get/put driver tag, code gets cleanup much, such as,
reorder_tags_to_front() is removed, and we needn't to worry
about request order in dispatch list for avoiding I/O deadlock.

'dbench -t 30 -s -F 64' has been run on different devices(shared tag,
multi-queue, singele queue, ...), and no issues are observed,
even very low queue depth(1) test are run, debench still works
well.

Any comments are welcome!

Thanks,
Ming

Ming Lei (6):
  blk-flush: don't run queue for requests of bypassing flush
  block: pass 'run_queue' to blk_mq_request_bypass_insert
  blk-flush: use blk_mq_request_bypass_insert()
  blk-mq: decide how to handle flush rq via RQF_FLUSH_SEQ
  blk-mq: move __blk_mq_put_driver_tag() into blk-mq.h
  blk-mq: don't allocate driver tag beforehand for flush rq

 block/blk-core.c     |  2 +-
 block/blk-flush.c    | 43 +++++++++++++++++++++++++++++------
 block/blk-mq-sched.c | 64 ++++++++++++++--------------------------------------
 block/blk-mq.c       | 55 ++++++--------------------------------------
 block/blk-mq.h       | 15 +++++++++++-
 5 files changed, 75 insertions(+), 104 deletions(-)

-- 
2.9.5

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/6] blk-flush: don't run queue for requests of bypassing flush
  2017-09-14 16:42 [PATCH 0/6] blk-mq: don't allocate driver tag beforehand for flush rq Ming Lei
@ 2017-09-14 16:42 ` Ming Lei
  2017-09-14 16:42 ` [PATCH 2/6] block: pass 'run_queue' to blk_mq_request_bypass_insert Ming Lei
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: Ming Lei @ 2017-09-14 16:42 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: Omar Sandoval, Ming Lei

blk_insert_flush() should only insert request since run
queue always follows it.

For the case of bypassing flush, we don't need to run queue
since every blk_insert_flush() follows one run queue.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-flush.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index 4938bec8cfef..81bd1a843043 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -463,7 +463,7 @@ void blk_insert_flush(struct request *rq)
 	if ((policy & REQ_FSEQ_DATA) &&
 	    !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) {
 		if (q->mq_ops)
-			blk_mq_sched_insert_request(rq, false, true, false, false);
+			blk_mq_sched_insert_request(rq, false, false, false, false);
 		else
 			list_add_tail(&rq->queuelist, &q->queue_head);
 		return;
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/6] block: pass 'run_queue' to blk_mq_request_bypass_insert
  2017-09-14 16:42 [PATCH 0/6] blk-mq: don't allocate driver tag beforehand for flush rq Ming Lei
  2017-09-14 16:42 ` [PATCH 1/6] blk-flush: don't run queue for requests of bypassing flush Ming Lei
@ 2017-09-14 16:42 ` Ming Lei
  2017-09-14 16:42 ` [PATCH 3/6] blk-flush: use blk_mq_request_bypass_insert() Ming Lei
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: Ming Lei @ 2017-09-14 16:42 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: Omar Sandoval, Ming Lei

Block flush need this function without needing run queue,
so introduce the parameter.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c | 2 +-
 block/blk-mq.c   | 5 +++--
 block/blk-mq.h   | 2 +-
 3 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index aebe676225e6..bb9ec0013582 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2347,7 +2347,7 @@ blk_status_t blk_insert_cloned_request(struct request_queue *q, struct request *
 		 * bypass a potential scheduler on the bottom device for
 		 * insert.
 		 */
-		blk_mq_request_bypass_insert(rq);
+		blk_mq_request_bypass_insert(rq, true);
 		return BLK_STS_OK;
 	}
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 98a18609755e..21331e785919 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1405,7 +1405,7 @@ void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
  * Should only be used carefully, when the caller knows we want to
  * bypass a potential IO scheduler on the target device.
  */
-void blk_mq_request_bypass_insert(struct request *rq)
+void blk_mq_request_bypass_insert(struct request *rq, bool run_queue)
 {
 	struct blk_mq_ctx *ctx = rq->mq_ctx;
 	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(rq->q, ctx->cpu);
@@ -1414,7 +1414,8 @@ void blk_mq_request_bypass_insert(struct request *rq)
 	list_add_tail(&rq->queuelist, &hctx->dispatch);
 	spin_unlock(&hctx->lock);
 
-	blk_mq_run_hw_queue(hctx, false);
+	if (run_queue)
+		blk_mq_run_hw_queue(hctx, false);
 }
 
 void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
diff --git a/block/blk-mq.h b/block/blk-mq.h
index ef15b3414da5..038bdeb84629 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -54,7 +54,7 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
  */
 void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 				bool at_head);
-void blk_mq_request_bypass_insert(struct request *rq);
+void blk_mq_request_bypass_insert(struct request *rq, bool run_queue);
 void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
 				struct list_head *list);
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 3/6] blk-flush: use blk_mq_request_bypass_insert()
  2017-09-14 16:42 [PATCH 0/6] blk-mq: don't allocate driver tag beforehand for flush rq Ming Lei
  2017-09-14 16:42 ` [PATCH 1/6] blk-flush: don't run queue for requests of bypassing flush Ming Lei
  2017-09-14 16:42 ` [PATCH 2/6] block: pass 'run_queue' to blk_mq_request_bypass_insert Ming Lei
@ 2017-09-14 16:42 ` Ming Lei
  2017-09-14 16:42 ` [PATCH 4/6] blk-mq: decide how to handle flush rq via RQF_FLUSH_SEQ Ming Lei
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: Ming Lei @ 2017-09-14 16:42 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: Omar Sandoval, Ming Lei

In the following patch, we will use RQF_FLUSH_SEQ
to decide:

- if the flag isn't set, the flush rq need to be inserted
via blk_insert_flush()
- otherwise, the flush rq need to be dispatched directly
since it is in flush machinery now.

So we use blk_mq_request_bypass_insert() for requsts
of bypassing flush machinery, like what the legacy did.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-flush.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index 81bd1a843043..a9773d2075ac 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -463,7 +463,7 @@ void blk_insert_flush(struct request *rq)
 	if ((policy & REQ_FSEQ_DATA) &&
 	    !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) {
 		if (q->mq_ops)
-			blk_mq_sched_insert_request(rq, false, false, false, false);
+			blk_mq_request_bypass_insert(rq, false);
 		else
 			list_add_tail(&rq->queuelist, &q->queue_head);
 		return;
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 4/6] blk-mq: decide how to handle flush rq via RQF_FLUSH_SEQ
  2017-09-14 16:42 [PATCH 0/6] blk-mq: don't allocate driver tag beforehand for flush rq Ming Lei
                   ` (2 preceding siblings ...)
  2017-09-14 16:42 ` [PATCH 3/6] blk-flush: use blk_mq_request_bypass_insert() Ming Lei
@ 2017-09-14 16:42 ` Ming Lei
  2017-09-14 16:42 ` [PATCH 5/6] blk-mq: move __blk_mq_put_driver_tag() into blk-mq.h Ming Lei
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: Ming Lei @ 2017-09-14 16:42 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: Omar Sandoval, Ming Lei

Now we always preallocate one driver tag before blk_insert_flush(),
and flush request will be marked as RQF_FLUSH_SEQ once it is in
flush machinary.

So if RQF_FLUSH_SEQ isn't set, we call blk_insert_flush()
to handle the request, otherwise the flush request is
dispatched to ->dispatch list directly.

This is a prepare patch for not preallocating driver tag
on flush rq, and remove the special case.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-sched.c | 30 ++++++++++++++++--------------
 1 file changed, 16 insertions(+), 14 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 4ab69435708c..b6399d99438e 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -260,21 +260,22 @@ void blk_mq_sched_request_inserted(struct request *rq)
 EXPORT_SYMBOL_GPL(blk_mq_sched_request_inserted);
 
 static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx,
+				       bool has_sched,
 				       struct request *rq)
 {
-	if (rq->tag == -1) {
-		rq->rq_flags |= RQF_SORTED;
-		return false;
+	/* dispatch flush rq directly */
+	if (rq->rq_flags & RQF_FLUSH_SEQ) {
+		spin_lock(&hctx->lock);
+		list_add(&rq->queuelist, &hctx->dispatch);
+		spin_unlock(&hctx->lock);
+		return true;
 	}
 
-	/*
-	 * If we already have a real request tag, send directly to
-	 * the dispatch list.
-	 */
-	spin_lock(&hctx->lock);
-	list_add(&rq->queuelist, &hctx->dispatch);
-	spin_unlock(&hctx->lock);
-	return true;
+	WARN_ON(rq->tag != -1);
+	if (has_sched)
+		rq->rq_flags |= RQF_SORTED;
+
+	return false;
 }
 
 /**
@@ -362,12 +363,13 @@ void blk_mq_sched_insert_request(struct request *rq, bool at_head,
 	struct blk_mq_ctx *ctx = rq->mq_ctx;
 	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
 
-	if (rq->tag == -1 && op_is_flush(rq->cmd_flags)) {
+	/* flush rq in flush machinery need to be dispatched directly */
+	if (!(rq->rq_flags & RQF_FLUSH_SEQ) && op_is_flush(rq->cmd_flags)) {
 		blk_mq_sched_insert_flush(hctx, rq, can_block);
 		return;
 	}
 
-	if (e && blk_mq_sched_bypass_insert(hctx, rq))
+	if (blk_mq_sched_bypass_insert(hctx, !!e, rq))
 		goto run;
 
 	if (e && e->type->ops.mq.insert_requests) {
@@ -405,7 +407,7 @@ void blk_mq_sched_insert_requests(struct request_queue *q,
 		list_for_each_entry_safe(rq, next, list, queuelist) {
 			if (WARN_ON_ONCE(rq->tag != -1)) {
 				list_del_init(&rq->queuelist);
-				blk_mq_sched_bypass_insert(hctx, rq);
+				blk_mq_sched_bypass_insert(hctx, true, rq);
 			}
 		}
 	}
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 5/6] blk-mq: move __blk_mq_put_driver_tag() into blk-mq.h
  2017-09-14 16:42 [PATCH 0/6] blk-mq: don't allocate driver tag beforehand for flush rq Ming Lei
                   ` (3 preceding siblings ...)
  2017-09-14 16:42 ` [PATCH 4/6] blk-mq: decide how to handle flush rq via RQF_FLUSH_SEQ Ming Lei
@ 2017-09-14 16:42 ` Ming Lei
  2017-09-14 16:42 ` [PATCH 6/6] blk-mq: don't allocate driver tag beforehand for flush rq Ming Lei
  2017-09-14 18:51 ` [PATCH 0/6] " Jens Axboe
  6 siblings, 0 replies; 12+ messages in thread
From: Ming Lei @ 2017-09-14 16:42 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: Omar Sandoval, Ming Lei

We need this helper to put the tag for flash rq, since
we will not share tag in the flush request sequences
in case of I/O scheduler.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 12 ------------
 block/blk-mq.h | 13 +++++++++++++
 2 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 21331e785919..46d561e26774 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -920,18 +920,6 @@ bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
 	return rq->tag != -1;
 }
 
-static void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
-				    struct request *rq)
-{
-	blk_mq_put_tag(hctx, hctx->tags, rq->mq_ctx, rq->tag);
-	rq->tag = -1;
-
-	if (rq->rq_flags & RQF_MQ_INFLIGHT) {
-		rq->rq_flags &= ~RQF_MQ_INFLIGHT;
-		atomic_dec(&hctx->nr_active);
-	}
-}
-
 static void blk_mq_put_driver_tag_hctx(struct blk_mq_hw_ctx *hctx,
 				       struct request *rq)
 {
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 038bdeb84629..383f1624c34f 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -2,6 +2,7 @@
 #define INT_BLK_MQ_H
 
 #include "blk-stat.h"
+#include "blk-mq-tag.h"
 
 struct blk_mq_tag_set;
 
@@ -137,4 +138,16 @@ static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx)
 void blk_mq_in_flight(struct request_queue *q, struct hd_struct *part,
 			unsigned int inflight[2]);
 
+static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
+					   struct request *rq)
+{
+	blk_mq_put_tag(hctx, hctx->tags, rq->mq_ctx, rq->tag);
+	rq->tag = -1;
+
+	if (rq->rq_flags & RQF_MQ_INFLIGHT) {
+		rq->rq_flags &= ~RQF_MQ_INFLIGHT;
+		atomic_dec(&hctx->nr_active);
+	}
+}
+
 #endif
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 6/6] blk-mq: don't allocate driver tag beforehand for flush rq
  2017-09-14 16:42 [PATCH 0/6] blk-mq: don't allocate driver tag beforehand for flush rq Ming Lei
                   ` (4 preceding siblings ...)
  2017-09-14 16:42 ` [PATCH 5/6] blk-mq: move __blk_mq_put_driver_tag() into blk-mq.h Ming Lei
@ 2017-09-14 16:42 ` Ming Lei
  2017-09-14 18:51 ` [PATCH 0/6] " Jens Axboe
  6 siblings, 0 replies; 12+ messages in thread
From: Ming Lei @ 2017-09-14 16:42 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: Omar Sandoval, Ming Lei

The behind idea is simple:

1) for none scheduler, driver tag has to be borrowed for flush
rq, otherwise we may run out of tag, and IO hang is caused.
get/put driver tag is actually a nop, so reorder tags isn't
necessary at all.

2) for real I/O scheduler, we needn't to allocate driver tag
beforehand for flush rq, and it works just fine to follow the
way for normal requests: allocate driver tag for each rq just
before calling .queue_rq().

One visible change to driver is that the driver tag isn't
shared in the flush request sequence, that won't be a
problem, since we always do that in legacy path.

Then flush rq isn't treated specially wrt. get/put driver tag,
code gets cleanup much, such as, reorder_tags_to_front() is
removed, and we needn't to worry about request order in
dispatch list for avoiding I/O deadlock.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-flush.c    | 41 +++++++++++++++++++++++++++++++++++------
 block/blk-mq-sched.c | 36 ++----------------------------------
 block/blk-mq.c       | 38 ++++----------------------------------
 3 files changed, 41 insertions(+), 74 deletions(-)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index a9773d2075ac..b091e298b97a 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -216,6 +216,17 @@ static bool blk_flush_complete_seq(struct request *rq,
 	return kicked | queued;
 }
 
+/*
+ * We don't share tag between flush rq and data rq in case of
+ * IO scheduler, so have to release tag and restart queue
+ */
+static void put_flush_driver_tag(struct blk_mq_hw_ctx *hctx,
+				 struct request *rq)
+{
+	__blk_mq_put_driver_tag(hctx, rq);
+	blk_mq_sched_restart(hctx);
+}
+
 static void flush_end_io(struct request *flush_rq, blk_status_t error)
 {
 	struct request_queue *q = flush_rq->q;
@@ -231,8 +242,13 @@ static void flush_end_io(struct request *flush_rq, blk_status_t error)
 		/* release the tag's ownership to the req cloned from */
 		spin_lock_irqsave(&fq->mq_flush_lock, flags);
 		hctx = blk_mq_map_queue(q, flush_rq->mq_ctx->cpu);
-		blk_mq_tag_set_rq(hctx, flush_rq->tag, fq->orig_rq);
-		flush_rq->tag = -1;
+		if (!q->elevator) {
+			blk_mq_tag_set_rq(hctx, flush_rq->tag, fq->orig_rq);
+			flush_rq->tag = -1;
+		} else {
+			put_flush_driver_tag(hctx, flush_rq);
+			flush_rq->internal_tag = -1;
+		}
 	}
 
 	running = &fq->flush_queue[fq->flush_running_idx];
@@ -321,16 +337,24 @@ static bool blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq)
 	 * Borrow tag from the first request since they can't
 	 * be in flight at the same time. And acquire the tag's
 	 * ownership for flush req.
+	 *
+	 * In case of io scheduler, flush rq need to borrow
+	 * scheduler tag for making put/get driver tag workable.
+	 * In case of none, flush rq need to borrow driver tag.
 	 */
 	if (q->mq_ops) {
 		struct blk_mq_hw_ctx *hctx;
 
 		flush_rq->mq_ctx = first_rq->mq_ctx;
-		flush_rq->tag = first_rq->tag;
-		fq->orig_rq = first_rq;
 
-		hctx = blk_mq_map_queue(q, first_rq->mq_ctx->cpu);
-		blk_mq_tag_set_rq(hctx, first_rq->tag, flush_rq);
+		if (!q->elevator) {
+			fq->orig_rq = first_rq;
+			flush_rq->tag = first_rq->tag;
+			hctx = blk_mq_map_queue(q, first_rq->mq_ctx->cpu);
+			blk_mq_tag_set_rq(hctx, first_rq->tag, flush_rq);
+		} else {
+			flush_rq->internal_tag = first_rq->internal_tag;
+		}
 	}
 
 	flush_rq->cmd_flags = REQ_OP_FLUSH | REQ_PREFLUSH;
@@ -394,6 +418,11 @@ static void mq_flush_data_end_io(struct request *rq, blk_status_t error)
 
 	hctx = blk_mq_map_queue(q, ctx->cpu);
 
+	if (q->elevator) {
+		WARN_ON(rq->tag < 0);
+		put_flush_driver_tag(hctx, rq);
+	}
+
 	/*
 	 * After populating an empty queue, kick it to avoid stall.  Read
 	 * the comment in flush_end_io().
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index b6399d99438e..d2c675c089b3 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -340,21 +340,6 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *const hctx)
 	}
 }
 
-/*
- * Add flush/fua to the queue. If we fail getting a driver tag, then
- * punt to the requeue list. Requeue will re-invoke us from a context
- * that's safe to block from.
- */
-static void blk_mq_sched_insert_flush(struct blk_mq_hw_ctx *hctx,
-				      struct request *rq, bool can_block)
-{
-	if (blk_mq_get_driver_tag(rq, &hctx, can_block)) {
-		blk_insert_flush(rq);
-		blk_mq_run_hw_queue(hctx, true);
-	} else
-		blk_mq_add_to_requeue_list(rq, false, true);
-}
-
 void blk_mq_sched_insert_request(struct request *rq, bool at_head,
 				 bool run_queue, bool async, bool can_block)
 {
@@ -365,8 +350,8 @@ void blk_mq_sched_insert_request(struct request *rq, bool at_head,
 
 	/* flush rq in flush machinery need to be dispatched directly */
 	if (!(rq->rq_flags & RQF_FLUSH_SEQ) && op_is_flush(rq->cmd_flags)) {
-		blk_mq_sched_insert_flush(hctx, rq, can_block);
-		return;
+		blk_insert_flush(rq);
+		goto run;
 	}
 
 	if (blk_mq_sched_bypass_insert(hctx, !!e, rq))
@@ -395,23 +380,6 @@ void blk_mq_sched_insert_requests(struct request_queue *q,
 	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
 	struct elevator_queue *e = hctx->queue->elevator;
 
-	if (e) {
-		struct request *rq, *next;
-
-		/*
-		 * We bypass requests that already have a driver tag assigned,
-		 * which should only be flushes. Flushes are only ever inserted
-		 * as single requests, so we shouldn't ever hit the
-		 * WARN_ON_ONCE() below (but let's handle it just in case).
-		 */
-		list_for_each_entry_safe(rq, next, list, queuelist) {
-			if (WARN_ON_ONCE(rq->tag != -1)) {
-				list_del_init(&rq->queuelist);
-				blk_mq_sched_bypass_insert(hctx, true, rq);
-			}
-		}
-	}
-
 	if (e && e->type->ops.mq.insert_requests)
 		e->type->ops.mq.insert_requests(hctx, list, false);
 	else
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 46d561e26774..68c165f8f0f8 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -940,30 +940,6 @@ static void blk_mq_put_driver_tag(struct request *rq)
 	__blk_mq_put_driver_tag(hctx, rq);
 }
 
-/*
- * If we fail getting a driver tag because all the driver tags are already
- * assigned and on the dispatch list, BUT the first entry does not have a
- * tag, then we could deadlock. For that case, move entries with assigned
- * driver tags to the front, leaving the set of tagged requests in the
- * same order, and the untagged set in the same order.
- */
-static bool reorder_tags_to_front(struct list_head *list)
-{
-	struct request *rq, *tmp, *first = NULL;
-
-	list_for_each_entry_safe_reverse(rq, tmp, list, queuelist) {
-		if (rq == first)
-			break;
-		if (rq->tag != -1) {
-			list_move(&rq->queuelist, list);
-			if (!first)
-				first = rq;
-		}
-	}
-
-	return first != NULL;
-}
-
 static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode, int flags,
 				void *key)
 {
@@ -1021,9 +997,6 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list)
 
 		rq = list_first_entry(list, struct request, queuelist);
 		if (!blk_mq_get_driver_tag(rq, &hctx, false)) {
-			if (!queued && reorder_tags_to_front(list))
-				continue;
-
 			/*
 			 * The initial allocation attempt failed, so we need to
 			 * rerun the hardware queue when a tag is freed.
@@ -1630,13 +1603,10 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	if (unlikely(is_flush_fua)) {
 		blk_mq_put_ctx(data.ctx);
 		blk_mq_bio_to_request(rq, bio);
-		if (q->elevator) {
-			blk_mq_sched_insert_request(rq, false, true, true,
-					true);
-		} else {
-			blk_insert_flush(rq);
-			blk_mq_run_hw_queue(data.hctx, true);
-		}
+
+		/* bypass scheduler for flush rq */
+		blk_insert_flush(rq);
+		blk_mq_run_hw_queue(data.hctx, true);
 	} else if (plug && q->nr_hw_queues == 1) {
 		struct request *last = NULL;
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/6] blk-mq: don't allocate driver tag beforehand for flush rq
  2017-09-14 16:42 [PATCH 0/6] blk-mq: don't allocate driver tag beforehand for flush rq Ming Lei
                   ` (5 preceding siblings ...)
  2017-09-14 16:42 ` [PATCH 6/6] blk-mq: don't allocate driver tag beforehand for flush rq Ming Lei
@ 2017-09-14 18:51 ` Jens Axboe
  2017-09-15  2:20   ` Ming Lei
  6 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2017-09-14 18:51 UTC (permalink / raw)
  To: Ming Lei, linux-block; +Cc: Omar Sandoval

On 09/14/2017 10:42 AM, Ming Lei wrote:
> Hi,
> 
> This patchset avoids to allocate driver tag beforehand for flush rq
> in case of I/O scheduler, then flush rq isn't treated specially
> wrt. get/put driver tag, code gets cleanup much, such as,
> reorder_tags_to_front() is removed, and we needn't to worry
> about request order in dispatch list for avoiding I/O deadlock.
> 
> 'dbench -t 30 -s -F 64' has been run on different devices(shared tag,
> multi-queue, singele queue, ...), and no issues are observed,
> even very low queue depth(1) test are run, debench still works
> well.

Gave this a quick spin on the test box, and I get tons of spewage
on booting up:

[    9.131290] WARNING: CPU: 2 PID: 337 at block/blk-mq-sched.c:274 blk_mq_sched_insert_request+0x15d/0x170
[    9.131290] Modules linked in: sd_mod igb(+) ahci libahci i2c_algo_bit libata dca nvme nvme_core
[    9.131295] CPU: 2 PID: 337 Comm: kworker/u129:1 Tainted: G        W       4.13.0+ #472
[    9.131295] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 11/09/2016
[    9.131298] Workqueue: nvme-wq nvme_scan_work [nvme_core]
[    9.131299] task: ffff881ff7940e00 task.stack: ffff881ff7960000
[    9.131301] RIP: 0010:blk_mq_sched_insert_request+0x15d/0x170
[    9.131301] RSP: 0018:ffff881ff79639c8 EFLAGS: 00010217
[    9.131302] RAX: ffff881feeb30000 RBX: ffff881feebd3f00 RCX: 0000000000002000
[    9.131303] RDX: ffff881ff31c1800 RSI: 0000000000000000 RDI: ffff881feebd3f00
[    9.131303] RBP: ffff881ff7963a10 R08: 0000000000000000 R09: 0000000000000008
[    9.131304] R10: 0000000000001000 R11: 0000000000000422 R12: ffff881ff348c400
[    9.131305] R13: 0000000000000000 R14: 0000000000000001 R15: ffffe8dfffe4a540
[    9.131305] FS:  0000000000000000(0000) GS:ffff881fff640000(0000) knlGS:0000000000000000
[    9.131306] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    9.131307] CR2: 00007f5675e76b60 CR3: 0000001ff832e002 CR4: 00000000003606e0
[    9.131308] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    9.131308] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    9.131308] Call Trace:
[    9.131311]  ? bio_alloc_bioset+0x179/0x1d0
[    9.131314]  blk_execute_rq_nowait+0x68/0xe0
[    9.131316]  blk_execute_rq+0x53/0x90
[    9.131318]  __nvme_submit_sync_cmd+0xa2/0xf0 [nvme_core]
[    9.131320]  nvme_identify_ns.isra.32+0x6b/0xa0 [nvme_core]
[    9.131323]  nvme_revalidate_disk+0x7c/0x130 [nvme_core]
[    9.131324]  rescan_partitions+0x80/0x350
[    9.131325]  ? rescan_partitions+0x80/0x350
[    9.131327]  ? down_write+0x1b/0x50
[    9.131331]  __blkdev_get+0x277/0x3f0
[    9.131332]  ? _raw_spin_unlock+0x9/0x20
[    9.131334]  blkdev_get+0x11e/0x320
[    9.131335]  ? bdget+0x11d/0x140
[    9.131337]  device_add_disk+0x3e0/0x430
[    9.131340]  ? __might_sleep+0x45/0x80
[    9.131342]  nvme_validate_ns+0x302/0x560 [nvme_core]
[    9.131344]  nvme_scan_work+0x7c/0x2f0 [nvme_core]
[    9.131346]  ? try_to_wake_up+0x45/0x430
[    9.131348]  process_one_work+0x18a/0x3e0
[    9.131349]  worker_thread+0x48/0x3b0
[    9.131351]  kthread+0x12a/0x140
[    9.131352]  ? process_one_work+0x3e0/0x3e0
[    9.131353]  ? kthread_create_on_node+0x40/0x40
[    9.131355]  ret_from_fork+0x22/0x30
[    9.131356] Code: c0 4c 89 fa e8 e5 97 02 00 48 8b 4d c0 84 c0 74 10 49 89 5f 08 4c 89 3b 48 89 4b 08 49 89 5c 24 18 4c 89 e7 e8 55 12 37 00 eb 93 <0f> ff e9 fd fe ff ff 66 90 66 2e 0f 1f 84 00 00 00 00 00 55 49 
[    9.131374] ---[ end trace 199de228942af254 ]---

On top of that, there's a lot of spewage before that that I can't
even decipher since it's all intermingled, but it looks like a
lock imbalance issue.

This is on top of current master, fwiw.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/6] blk-mq: don't allocate driver tag beforehand for flush rq
  2017-09-14 18:51 ` [PATCH 0/6] " Jens Axboe
@ 2017-09-15  2:20   ` Ming Lei
  2017-09-15 14:29     ` Jens Axboe
  0 siblings, 1 reply; 12+ messages in thread
From: Ming Lei @ 2017-09-15  2:20 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Omar Sandoval

On Thu, Sep 14, 2017 at 12:51:24PM -0600, Jens Axboe wrote:
> On 09/14/2017 10:42 AM, Ming Lei wrote:
> > Hi,
> > 
> > This patchset avoids to allocate driver tag beforehand for flush rq
> > in case of I/O scheduler, then flush rq isn't treated specially
> > wrt. get/put driver tag, code gets cleanup much, such as,
> > reorder_tags_to_front() is removed, and we needn't to worry
> > about request order in dispatch list for avoiding I/O deadlock.
> > 
> > 'dbench -t 30 -s -F 64' has been run on different devices(shared tag,
> > multi-queue, singele queue, ...), and no issues are observed,
> > even very low queue depth(1) test are run, debench still works
> > well.
> 
> Gave this a quick spin on the test box, and I get tons of spewage
> on booting up:
> 
> [    9.131290] WARNING: CPU: 2 PID: 337 at block/blk-mq-sched.c:274 blk_mq_sched_insert_request+0x15d/0x170

Sorry, my fault.

The WARN_ON() was inside 'if (has_sched)' actually, and could you
please remove the WARN_ON() in blk_mq_sched_bypass_insert() and
see if it works?


Thanks,
Ming

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/6] blk-mq: don't allocate driver tag beforehand for flush rq
  2017-09-15  2:20   ` Ming Lei
@ 2017-09-15 14:29     ` Jens Axboe
  2017-09-15 18:06       ` Jens Axboe
  0 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2017-09-15 14:29 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, Omar Sandoval

On 09/14/2017 08:20 PM, Ming Lei wrote:
> On Thu, Sep 14, 2017 at 12:51:24PM -0600, Jens Axboe wrote:
>> On 09/14/2017 10:42 AM, Ming Lei wrote:
>>> Hi,
>>>
>>> This patchset avoids to allocate driver tag beforehand for flush rq
>>> in case of I/O scheduler, then flush rq isn't treated specially
>>> wrt. get/put driver tag, code gets cleanup much, such as,
>>> reorder_tags_to_front() is removed, and we needn't to worry
>>> about request order in dispatch list for avoiding I/O deadlock.
>>>
>>> 'dbench -t 30 -s -F 64' has been run on different devices(shared tag,
>>> multi-queue, singele queue, ...), and no issues are observed,
>>> even very low queue depth(1) test are run, debench still works
>>> well.
>>
>> Gave this a quick spin on the test box, and I get tons of spewage
>> on booting up:
>>
>> [    9.131290] WARNING: CPU: 2 PID: 337 at block/blk-mq-sched.c:274 blk_mq_sched_insert_request+0x15d/0x170
> 
> Sorry, my fault.
> 
> The WARN_ON() was inside 'if (has_sched)' actually, and could you
> please remove the WARN_ON() in blk_mq_sched_bypass_insert() and
> see if it works?

Putting it inside 'has_sched' makes it go away. I'll fire up some
testing of it here.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/6] blk-mq: don't allocate driver tag beforehand for flush rq
  2017-09-15 14:29     ` Jens Axboe
@ 2017-09-15 18:06       ` Jens Axboe
  2017-09-16  1:55         ` Ming Lei
  0 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2017-09-15 18:06 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, Omar Sandoval

On 09/15/2017 08:29 AM, Jens Axboe wrote:
> On 09/14/2017 08:20 PM, Ming Lei wrote:
>> On Thu, Sep 14, 2017 at 12:51:24PM -0600, Jens Axboe wrote:
>>> On 09/14/2017 10:42 AM, Ming Lei wrote:
>>>> Hi,
>>>>
>>>> This patchset avoids to allocate driver tag beforehand for flush rq
>>>> in case of I/O scheduler, then flush rq isn't treated specially
>>>> wrt. get/put driver tag, code gets cleanup much, such as,
>>>> reorder_tags_to_front() is removed, and we needn't to worry
>>>> about request order in dispatch list for avoiding I/O deadlock.
>>>>
>>>> 'dbench -t 30 -s -F 64' has been run on different devices(shared tag,
>>>> multi-queue, singele queue, ...), and no issues are observed,
>>>> even very low queue depth(1) test are run, debench still works
>>>> well.
>>>
>>> Gave this a quick spin on the test box, and I get tons of spewage
>>> on booting up:
>>>
>>> [    9.131290] WARNING: CPU: 2 PID: 337 at block/blk-mq-sched.c:274 blk_mq_sched_insert_request+0x15d/0x170
>>
>> Sorry, my fault.
>>
>> The WARN_ON() was inside 'if (has_sched)' actually, and could you
>> please remove the WARN_ON() in blk_mq_sched_bypass_insert() and
>> see if it works?
> 
> Putting it inside 'has_sched' makes it go away. I'll fire up some
> testing of it here.

It still triggers for the requeue path, however:

[12403.280753] WARNING: CPU: 4 PID: 2279 at block/blk-mq-sched.c:275 blk_mq_sched_insert_request+0
[12403.302465] Modules linked in: null_blk configfs crct10dif_generic crct10dif_common loop dm_mo]
[12403.331762] CPU: 4 PID: 2279 Comm: kworker/4:1H Tainted: G        W       4.13.0+ #473
[12403.341497] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 11/09/2016
[12403.350745] Workqueue: kblockd blk_mq_requeue_work
[12403.356574] task: ffff881ff6c57000 task.stack: ffff881ff2acc000
[12403.363667] RIP: 0010:blk_mq_sched_insert_request+0x15d/0x170
[12403.370565] RSP: 0018:ffff881ff2acfdc0 EFLAGS: 00010213
[12403.376878] RAX: ffff881ff2868000 RBX: ffff880c7677be00 RCX: 0000000000022003
[12403.385333] RDX: ffff881ff32d8c00 RSI: 0000000000000001 RDI: ffff880c7677be00
[12403.393790] RBP: ffff881ff2acfe08 R08: 0000000000000001 R09: ffff8809010d4000
[12403.402240] R10: 0000000000000000 R11: 0000000000001000 R12: ffff881ff37b3800
[12403.410695] R13: 0000000000000000 R14: 0000000000000000 R15: ffffe8dfffe86940
[12403.419150] FS:  0000000000000000(0000) GS:ffff881fff680000(0000) knlGS:0000000000000000
[12403.429065] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[12403.435961] CR2: 00007f13f4ce1000 CR3: 0000001ff357d002 CR4: 00000000003606e0
[12403.444414] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[12403.452868] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[12403.461316] Call Trace:
[12403.464528]  ? __blk_mq_delay_run_hw_queue+0x84/0xa0
[12403.470550]  blk_mq_requeue_work+0xc6/0x140
[12403.475702]  process_one_work+0x18a/0x3e0
[12403.480654]  worker_thread+0x48/0x3b0
[12403.485313]  kthread+0x12a/0x140
[12403.489390]  ? process_one_work+0x3e0/0x3e0
[12403.494545]  ? kthread_create_on_node+0x40/0x40
[12403.500078]  ret_from_fork+0x22/0x30
[12403.504551] Code: c0 4c 89 fa e8 e5 97 02 00 48 8b 4d c0 84 c0 74 10 49 89 5f 08 4c 89 3b 48 8 
[12403.526954] ---[ end trace 2eefb804292867b5 ]---

The code now looks like this:

if (has_sched) {                                                        
	WARN_ON(rq->tag != -1);                                         
	rq->rq_flags |= RQF_SORTED;                                     
}

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/6] blk-mq: don't allocate driver tag beforehand for flush rq
  2017-09-15 18:06       ` Jens Axboe
@ 2017-09-16  1:55         ` Ming Lei
  0 siblings, 0 replies; 12+ messages in thread
From: Ming Lei @ 2017-09-16  1:55 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Omar Sandoval

On Fri, Sep 15, 2017 at 12:06:41PM -0600, Jens Axboe wrote:
> On 09/15/2017 08:29 AM, Jens Axboe wrote:
> > On 09/14/2017 08:20 PM, Ming Lei wrote:
> >> On Thu, Sep 14, 2017 at 12:51:24PM -0600, Jens Axboe wrote:
> >>> On 09/14/2017 10:42 AM, Ming Lei wrote:
> >>>> Hi,
> >>>>
> >>>> This patchset avoids to allocate driver tag beforehand for flush rq
> >>>> in case of I/O scheduler, then flush rq isn't treated specially
> >>>> wrt. get/put driver tag, code gets cleanup much, such as,
> >>>> reorder_tags_to_front() is removed, and we needn't to worry
> >>>> about request order in dispatch list for avoiding I/O deadlock.
> >>>>
> >>>> 'dbench -t 30 -s -F 64' has been run on different devices(shared tag,
> >>>> multi-queue, singele queue, ...), and no issues are observed,
> >>>> even very low queue depth(1) test are run, debench still works
> >>>> well.
> >>>
> >>> Gave this a quick spin on the test box, and I get tons of spewage
> >>> on booting up:
> >>>
> >>> [    9.131290] WARNING: CPU: 2 PID: 337 at block/blk-mq-sched.c:274 blk_mq_sched_insert_request+0x15d/0x170
> >>
> >> Sorry, my fault.
> >>
> >> The WARN_ON() was inside 'if (has_sched)' actually, and could you
> >> please remove the WARN_ON() in blk_mq_sched_bypass_insert() and
> >> see if it works?
> > 
> > Putting it inside 'has_sched' makes it go away. I'll fire up some
> > testing of it here.
> 
> It still triggers for the requeue path, however:
> 
> [12403.280753] WARNING: CPU: 4 PID: 2279 at block/blk-mq-sched.c:275 blk_mq_sched_insert_request+0
> [12403.302465] Modules linked in: null_blk configfs crct10dif_generic crct10dif_common loop dm_mo]
> [12403.331762] CPU: 4 PID: 2279 Comm: kworker/4:1H Tainted: G        W       4.13.0+ #473
> [12403.341497] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 11/09/2016
> [12403.350745] Workqueue: kblockd blk_mq_requeue_work
> [12403.356574] task: ffff881ff6c57000 task.stack: ffff881ff2acc000
> [12403.363667] RIP: 0010:blk_mq_sched_insert_request+0x15d/0x170
> [12403.370565] RSP: 0018:ffff881ff2acfdc0 EFLAGS: 00010213
> [12403.376878] RAX: ffff881ff2868000 RBX: ffff880c7677be00 RCX: 0000000000022003
> [12403.385333] RDX: ffff881ff32d8c00 RSI: 0000000000000001 RDI: ffff880c7677be00
> [12403.393790] RBP: ffff881ff2acfe08 R08: 0000000000000001 R09: ffff8809010d4000
> [12403.402240] R10: 0000000000000000 R11: 0000000000001000 R12: ffff881ff37b3800
> [12403.410695] R13: 0000000000000000 R14: 0000000000000000 R15: ffffe8dfffe86940
> [12403.419150] FS:  0000000000000000(0000) GS:ffff881fff680000(0000) knlGS:0000000000000000
> [12403.429065] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [12403.435961] CR2: 00007f13f4ce1000 CR3: 0000001ff357d002 CR4: 00000000003606e0
> [12403.444414] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [12403.452868] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [12403.461316] Call Trace:
> [12403.464528]  ? __blk_mq_delay_run_hw_queue+0x84/0xa0
> [12403.470550]  blk_mq_requeue_work+0xc6/0x140
> [12403.475702]  process_one_work+0x18a/0x3e0
> [12403.480654]  worker_thread+0x48/0x3b0
> [12403.485313]  kthread+0x12a/0x140
> [12403.489390]  ? process_one_work+0x3e0/0x3e0
> [12403.494545]  ? kthread_create_on_node+0x40/0x40
> [12403.500078]  ret_from_fork+0x22/0x30
> [12403.504551] Code: c0 4c 89 fa e8 e5 97 02 00 48 8b 4d c0 84 c0 74 10 49 89 5f 08 4c 89 3b 48 8 
> [12403.526954] ---[ end trace 2eefb804292867b5 ]---
> 
> The code now looks like this:
> 
> if (has_sched) {                                                        
> 	WARN_ON(rq->tag != -1);                                         
> 	rq->rq_flags |= RQF_SORTED;                                     
> }

Yeah, requeue is one case I missed, since the request may
have a valid driver tag assigned at that time if the requeue
happens in completion path.

So I think we need to release driver tag in __blk_mq_requeue_request().

-- 
Ming

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2017-09-16  1:55 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-09-14 16:42 [PATCH 0/6] blk-mq: don't allocate driver tag beforehand for flush rq Ming Lei
2017-09-14 16:42 ` [PATCH 1/6] blk-flush: don't run queue for requests of bypassing flush Ming Lei
2017-09-14 16:42 ` [PATCH 2/6] block: pass 'run_queue' to blk_mq_request_bypass_insert Ming Lei
2017-09-14 16:42 ` [PATCH 3/6] blk-flush: use blk_mq_request_bypass_insert() Ming Lei
2017-09-14 16:42 ` [PATCH 4/6] blk-mq: decide how to handle flush rq via RQF_FLUSH_SEQ Ming Lei
2017-09-14 16:42 ` [PATCH 5/6] blk-mq: move __blk_mq_put_driver_tag() into blk-mq.h Ming Lei
2017-09-14 16:42 ` [PATCH 6/6] blk-mq: don't allocate driver tag beforehand for flush rq Ming Lei
2017-09-14 18:51 ` [PATCH 0/6] " Jens Axboe
2017-09-15  2:20   ` Ming Lei
2017-09-15 14:29     ` Jens Axboe
2017-09-15 18:06       ` Jens Axboe
2017-09-16  1:55         ` Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).