* [PATCH 0/6] blk-mq: don't allocate driver tag beforehand for flush rq
@ 2017-09-14 16:42 Ming Lei
2017-09-14 16:42 ` [PATCH 1/6] blk-flush: don't run queue for requests of bypassing flush Ming Lei
` (6 more replies)
0 siblings, 7 replies; 12+ messages in thread
From: Ming Lei @ 2017-09-14 16:42 UTC (permalink / raw)
To: Jens Axboe, linux-block; +Cc: Omar Sandoval, Ming Lei
Hi,
This patchset avoids to allocate driver tag beforehand for flush rq
in case of I/O scheduler, then flush rq isn't treated specially
wrt. get/put driver tag, code gets cleanup much, such as,
reorder_tags_to_front() is removed, and we needn't to worry
about request order in dispatch list for avoiding I/O deadlock.
'dbench -t 30 -s -F 64' has been run on different devices(shared tag,
multi-queue, singele queue, ...), and no issues are observed,
even very low queue depth(1) test are run, debench still works
well.
Any comments are welcome!
Thanks,
Ming
Ming Lei (6):
blk-flush: don't run queue for requests of bypassing flush
block: pass 'run_queue' to blk_mq_request_bypass_insert
blk-flush: use blk_mq_request_bypass_insert()
blk-mq: decide how to handle flush rq via RQF_FLUSH_SEQ
blk-mq: move __blk_mq_put_driver_tag() into blk-mq.h
blk-mq: don't allocate driver tag beforehand for flush rq
block/blk-core.c | 2 +-
block/blk-flush.c | 43 +++++++++++++++++++++++++++++------
block/blk-mq-sched.c | 64 ++++++++++++++--------------------------------------
block/blk-mq.c | 55 ++++++--------------------------------------
block/blk-mq.h | 15 +++++++++++-
5 files changed, 75 insertions(+), 104 deletions(-)
--
2.9.5
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 1/6] blk-flush: don't run queue for requests of bypassing flush
2017-09-14 16:42 [PATCH 0/6] blk-mq: don't allocate driver tag beforehand for flush rq Ming Lei
@ 2017-09-14 16:42 ` Ming Lei
2017-09-14 16:42 ` [PATCH 2/6] block: pass 'run_queue' to blk_mq_request_bypass_insert Ming Lei
` (5 subsequent siblings)
6 siblings, 0 replies; 12+ messages in thread
From: Ming Lei @ 2017-09-14 16:42 UTC (permalink / raw)
To: Jens Axboe, linux-block; +Cc: Omar Sandoval, Ming Lei
blk_insert_flush() should only insert request since run
queue always follows it.
For the case of bypassing flush, we don't need to run queue
since every blk_insert_flush() follows one run queue.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/blk-flush.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/block/blk-flush.c b/block/blk-flush.c
index 4938bec8cfef..81bd1a843043 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -463,7 +463,7 @@ void blk_insert_flush(struct request *rq)
if ((policy & REQ_FSEQ_DATA) &&
!(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) {
if (q->mq_ops)
- blk_mq_sched_insert_request(rq, false, true, false, false);
+ blk_mq_sched_insert_request(rq, false, false, false, false);
else
list_add_tail(&rq->queuelist, &q->queue_head);
return;
--
2.9.5
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 2/6] block: pass 'run_queue' to blk_mq_request_bypass_insert
2017-09-14 16:42 [PATCH 0/6] blk-mq: don't allocate driver tag beforehand for flush rq Ming Lei
2017-09-14 16:42 ` [PATCH 1/6] blk-flush: don't run queue for requests of bypassing flush Ming Lei
@ 2017-09-14 16:42 ` Ming Lei
2017-09-14 16:42 ` [PATCH 3/6] blk-flush: use blk_mq_request_bypass_insert() Ming Lei
` (4 subsequent siblings)
6 siblings, 0 replies; 12+ messages in thread
From: Ming Lei @ 2017-09-14 16:42 UTC (permalink / raw)
To: Jens Axboe, linux-block; +Cc: Omar Sandoval, Ming Lei
Block flush need this function without needing run queue,
so introduce the parameter.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/blk-core.c | 2 +-
block/blk-mq.c | 5 +++--
block/blk-mq.h | 2 +-
3 files changed, 5 insertions(+), 4 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index aebe676225e6..bb9ec0013582 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2347,7 +2347,7 @@ blk_status_t blk_insert_cloned_request(struct request_queue *q, struct request *
* bypass a potential scheduler on the bottom device for
* insert.
*/
- blk_mq_request_bypass_insert(rq);
+ blk_mq_request_bypass_insert(rq, true);
return BLK_STS_OK;
}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 98a18609755e..21331e785919 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1405,7 +1405,7 @@ void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
* Should only be used carefully, when the caller knows we want to
* bypass a potential IO scheduler on the target device.
*/
-void blk_mq_request_bypass_insert(struct request *rq)
+void blk_mq_request_bypass_insert(struct request *rq, bool run_queue)
{
struct blk_mq_ctx *ctx = rq->mq_ctx;
struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(rq->q, ctx->cpu);
@@ -1414,7 +1414,8 @@ void blk_mq_request_bypass_insert(struct request *rq)
list_add_tail(&rq->queuelist, &hctx->dispatch);
spin_unlock(&hctx->lock);
- blk_mq_run_hw_queue(hctx, false);
+ if (run_queue)
+ blk_mq_run_hw_queue(hctx, false);
}
void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
diff --git a/block/blk-mq.h b/block/blk-mq.h
index ef15b3414da5..038bdeb84629 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -54,7 +54,7 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
*/
void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
bool at_head);
-void blk_mq_request_bypass_insert(struct request *rq);
+void blk_mq_request_bypass_insert(struct request *rq, bool run_queue);
void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
struct list_head *list);
--
2.9.5
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 3/6] blk-flush: use blk_mq_request_bypass_insert()
2017-09-14 16:42 [PATCH 0/6] blk-mq: don't allocate driver tag beforehand for flush rq Ming Lei
2017-09-14 16:42 ` [PATCH 1/6] blk-flush: don't run queue for requests of bypassing flush Ming Lei
2017-09-14 16:42 ` [PATCH 2/6] block: pass 'run_queue' to blk_mq_request_bypass_insert Ming Lei
@ 2017-09-14 16:42 ` Ming Lei
2017-09-14 16:42 ` [PATCH 4/6] blk-mq: decide how to handle flush rq via RQF_FLUSH_SEQ Ming Lei
` (3 subsequent siblings)
6 siblings, 0 replies; 12+ messages in thread
From: Ming Lei @ 2017-09-14 16:42 UTC (permalink / raw)
To: Jens Axboe, linux-block; +Cc: Omar Sandoval, Ming Lei
In the following patch, we will use RQF_FLUSH_SEQ
to decide:
- if the flag isn't set, the flush rq need to be inserted
via blk_insert_flush()
- otherwise, the flush rq need to be dispatched directly
since it is in flush machinery now.
So we use blk_mq_request_bypass_insert() for requsts
of bypassing flush machinery, like what the legacy did.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/blk-flush.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/block/blk-flush.c b/block/blk-flush.c
index 81bd1a843043..a9773d2075ac 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -463,7 +463,7 @@ void blk_insert_flush(struct request *rq)
if ((policy & REQ_FSEQ_DATA) &&
!(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) {
if (q->mq_ops)
- blk_mq_sched_insert_request(rq, false, false, false, false);
+ blk_mq_request_bypass_insert(rq, false);
else
list_add_tail(&rq->queuelist, &q->queue_head);
return;
--
2.9.5
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 4/6] blk-mq: decide how to handle flush rq via RQF_FLUSH_SEQ
2017-09-14 16:42 [PATCH 0/6] blk-mq: don't allocate driver tag beforehand for flush rq Ming Lei
` (2 preceding siblings ...)
2017-09-14 16:42 ` [PATCH 3/6] blk-flush: use blk_mq_request_bypass_insert() Ming Lei
@ 2017-09-14 16:42 ` Ming Lei
2017-09-14 16:42 ` [PATCH 5/6] blk-mq: move __blk_mq_put_driver_tag() into blk-mq.h Ming Lei
` (2 subsequent siblings)
6 siblings, 0 replies; 12+ messages in thread
From: Ming Lei @ 2017-09-14 16:42 UTC (permalink / raw)
To: Jens Axboe, linux-block; +Cc: Omar Sandoval, Ming Lei
Now we always preallocate one driver tag before blk_insert_flush(),
and flush request will be marked as RQF_FLUSH_SEQ once it is in
flush machinary.
So if RQF_FLUSH_SEQ isn't set, we call blk_insert_flush()
to handle the request, otherwise the flush request is
dispatched to ->dispatch list directly.
This is a prepare patch for not preallocating driver tag
on flush rq, and remove the special case.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/blk-mq-sched.c | 30 ++++++++++++++++--------------
1 file changed, 16 insertions(+), 14 deletions(-)
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 4ab69435708c..b6399d99438e 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -260,21 +260,22 @@ void blk_mq_sched_request_inserted(struct request *rq)
EXPORT_SYMBOL_GPL(blk_mq_sched_request_inserted);
static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx,
+ bool has_sched,
struct request *rq)
{
- if (rq->tag == -1) {
- rq->rq_flags |= RQF_SORTED;
- return false;
+ /* dispatch flush rq directly */
+ if (rq->rq_flags & RQF_FLUSH_SEQ) {
+ spin_lock(&hctx->lock);
+ list_add(&rq->queuelist, &hctx->dispatch);
+ spin_unlock(&hctx->lock);
+ return true;
}
- /*
- * If we already have a real request tag, send directly to
- * the dispatch list.
- */
- spin_lock(&hctx->lock);
- list_add(&rq->queuelist, &hctx->dispatch);
- spin_unlock(&hctx->lock);
- return true;
+ WARN_ON(rq->tag != -1);
+ if (has_sched)
+ rq->rq_flags |= RQF_SORTED;
+
+ return false;
}
/**
@@ -362,12 +363,13 @@ void blk_mq_sched_insert_request(struct request *rq, bool at_head,
struct blk_mq_ctx *ctx = rq->mq_ctx;
struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
- if (rq->tag == -1 && op_is_flush(rq->cmd_flags)) {
+ /* flush rq in flush machinery need to be dispatched directly */
+ if (!(rq->rq_flags & RQF_FLUSH_SEQ) && op_is_flush(rq->cmd_flags)) {
blk_mq_sched_insert_flush(hctx, rq, can_block);
return;
}
- if (e && blk_mq_sched_bypass_insert(hctx, rq))
+ if (blk_mq_sched_bypass_insert(hctx, !!e, rq))
goto run;
if (e && e->type->ops.mq.insert_requests) {
@@ -405,7 +407,7 @@ void blk_mq_sched_insert_requests(struct request_queue *q,
list_for_each_entry_safe(rq, next, list, queuelist) {
if (WARN_ON_ONCE(rq->tag != -1)) {
list_del_init(&rq->queuelist);
- blk_mq_sched_bypass_insert(hctx, rq);
+ blk_mq_sched_bypass_insert(hctx, true, rq);
}
}
}
--
2.9.5
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 5/6] blk-mq: move __blk_mq_put_driver_tag() into blk-mq.h
2017-09-14 16:42 [PATCH 0/6] blk-mq: don't allocate driver tag beforehand for flush rq Ming Lei
` (3 preceding siblings ...)
2017-09-14 16:42 ` [PATCH 4/6] blk-mq: decide how to handle flush rq via RQF_FLUSH_SEQ Ming Lei
@ 2017-09-14 16:42 ` Ming Lei
2017-09-14 16:42 ` [PATCH 6/6] blk-mq: don't allocate driver tag beforehand for flush rq Ming Lei
2017-09-14 18:51 ` [PATCH 0/6] " Jens Axboe
6 siblings, 0 replies; 12+ messages in thread
From: Ming Lei @ 2017-09-14 16:42 UTC (permalink / raw)
To: Jens Axboe, linux-block; +Cc: Omar Sandoval, Ming Lei
We need this helper to put the tag for flash rq, since
we will not share tag in the flush request sequences
in case of I/O scheduler.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/blk-mq.c | 12 ------------
block/blk-mq.h | 13 +++++++++++++
2 files changed, 13 insertions(+), 12 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 21331e785919..46d561e26774 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -920,18 +920,6 @@ bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
return rq->tag != -1;
}
-static void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
- struct request *rq)
-{
- blk_mq_put_tag(hctx, hctx->tags, rq->mq_ctx, rq->tag);
- rq->tag = -1;
-
- if (rq->rq_flags & RQF_MQ_INFLIGHT) {
- rq->rq_flags &= ~RQF_MQ_INFLIGHT;
- atomic_dec(&hctx->nr_active);
- }
-}
-
static void blk_mq_put_driver_tag_hctx(struct blk_mq_hw_ctx *hctx,
struct request *rq)
{
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 038bdeb84629..383f1624c34f 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -2,6 +2,7 @@
#define INT_BLK_MQ_H
#include "blk-stat.h"
+#include "blk-mq-tag.h"
struct blk_mq_tag_set;
@@ -137,4 +138,16 @@ static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx)
void blk_mq_in_flight(struct request_queue *q, struct hd_struct *part,
unsigned int inflight[2]);
+static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
+ struct request *rq)
+{
+ blk_mq_put_tag(hctx, hctx->tags, rq->mq_ctx, rq->tag);
+ rq->tag = -1;
+
+ if (rq->rq_flags & RQF_MQ_INFLIGHT) {
+ rq->rq_flags &= ~RQF_MQ_INFLIGHT;
+ atomic_dec(&hctx->nr_active);
+ }
+}
+
#endif
--
2.9.5
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 6/6] blk-mq: don't allocate driver tag beforehand for flush rq
2017-09-14 16:42 [PATCH 0/6] blk-mq: don't allocate driver tag beforehand for flush rq Ming Lei
` (4 preceding siblings ...)
2017-09-14 16:42 ` [PATCH 5/6] blk-mq: move __blk_mq_put_driver_tag() into blk-mq.h Ming Lei
@ 2017-09-14 16:42 ` Ming Lei
2017-09-14 18:51 ` [PATCH 0/6] " Jens Axboe
6 siblings, 0 replies; 12+ messages in thread
From: Ming Lei @ 2017-09-14 16:42 UTC (permalink / raw)
To: Jens Axboe, linux-block; +Cc: Omar Sandoval, Ming Lei
The behind idea is simple:
1) for none scheduler, driver tag has to be borrowed for flush
rq, otherwise we may run out of tag, and IO hang is caused.
get/put driver tag is actually a nop, so reorder tags isn't
necessary at all.
2) for real I/O scheduler, we needn't to allocate driver tag
beforehand for flush rq, and it works just fine to follow the
way for normal requests: allocate driver tag for each rq just
before calling .queue_rq().
One visible change to driver is that the driver tag isn't
shared in the flush request sequence, that won't be a
problem, since we always do that in legacy path.
Then flush rq isn't treated specially wrt. get/put driver tag,
code gets cleanup much, such as, reorder_tags_to_front() is
removed, and we needn't to worry about request order in
dispatch list for avoiding I/O deadlock.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/blk-flush.c | 41 +++++++++++++++++++++++++++++++++++------
block/blk-mq-sched.c | 36 ++----------------------------------
block/blk-mq.c | 38 ++++----------------------------------
3 files changed, 41 insertions(+), 74 deletions(-)
diff --git a/block/blk-flush.c b/block/blk-flush.c
index a9773d2075ac..b091e298b97a 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -216,6 +216,17 @@ static bool blk_flush_complete_seq(struct request *rq,
return kicked | queued;
}
+/*
+ * We don't share tag between flush rq and data rq in case of
+ * IO scheduler, so have to release tag and restart queue
+ */
+static void put_flush_driver_tag(struct blk_mq_hw_ctx *hctx,
+ struct request *rq)
+{
+ __blk_mq_put_driver_tag(hctx, rq);
+ blk_mq_sched_restart(hctx);
+}
+
static void flush_end_io(struct request *flush_rq, blk_status_t error)
{
struct request_queue *q = flush_rq->q;
@@ -231,8 +242,13 @@ static void flush_end_io(struct request *flush_rq, blk_status_t error)
/* release the tag's ownership to the req cloned from */
spin_lock_irqsave(&fq->mq_flush_lock, flags);
hctx = blk_mq_map_queue(q, flush_rq->mq_ctx->cpu);
- blk_mq_tag_set_rq(hctx, flush_rq->tag, fq->orig_rq);
- flush_rq->tag = -1;
+ if (!q->elevator) {
+ blk_mq_tag_set_rq(hctx, flush_rq->tag, fq->orig_rq);
+ flush_rq->tag = -1;
+ } else {
+ put_flush_driver_tag(hctx, flush_rq);
+ flush_rq->internal_tag = -1;
+ }
}
running = &fq->flush_queue[fq->flush_running_idx];
@@ -321,16 +337,24 @@ static bool blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq)
* Borrow tag from the first request since they can't
* be in flight at the same time. And acquire the tag's
* ownership for flush req.
+ *
+ * In case of io scheduler, flush rq need to borrow
+ * scheduler tag for making put/get driver tag workable.
+ * In case of none, flush rq need to borrow driver tag.
*/
if (q->mq_ops) {
struct blk_mq_hw_ctx *hctx;
flush_rq->mq_ctx = first_rq->mq_ctx;
- flush_rq->tag = first_rq->tag;
- fq->orig_rq = first_rq;
- hctx = blk_mq_map_queue(q, first_rq->mq_ctx->cpu);
- blk_mq_tag_set_rq(hctx, first_rq->tag, flush_rq);
+ if (!q->elevator) {
+ fq->orig_rq = first_rq;
+ flush_rq->tag = first_rq->tag;
+ hctx = blk_mq_map_queue(q, first_rq->mq_ctx->cpu);
+ blk_mq_tag_set_rq(hctx, first_rq->tag, flush_rq);
+ } else {
+ flush_rq->internal_tag = first_rq->internal_tag;
+ }
}
flush_rq->cmd_flags = REQ_OP_FLUSH | REQ_PREFLUSH;
@@ -394,6 +418,11 @@ static void mq_flush_data_end_io(struct request *rq, blk_status_t error)
hctx = blk_mq_map_queue(q, ctx->cpu);
+ if (q->elevator) {
+ WARN_ON(rq->tag < 0);
+ put_flush_driver_tag(hctx, rq);
+ }
+
/*
* After populating an empty queue, kick it to avoid stall. Read
* the comment in flush_end_io().
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index b6399d99438e..d2c675c089b3 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -340,21 +340,6 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *const hctx)
}
}
-/*
- * Add flush/fua to the queue. If we fail getting a driver tag, then
- * punt to the requeue list. Requeue will re-invoke us from a context
- * that's safe to block from.
- */
-static void blk_mq_sched_insert_flush(struct blk_mq_hw_ctx *hctx,
- struct request *rq, bool can_block)
-{
- if (blk_mq_get_driver_tag(rq, &hctx, can_block)) {
- blk_insert_flush(rq);
- blk_mq_run_hw_queue(hctx, true);
- } else
- blk_mq_add_to_requeue_list(rq, false, true);
-}
-
void blk_mq_sched_insert_request(struct request *rq, bool at_head,
bool run_queue, bool async, bool can_block)
{
@@ -365,8 +350,8 @@ void blk_mq_sched_insert_request(struct request *rq, bool at_head,
/* flush rq in flush machinery need to be dispatched directly */
if (!(rq->rq_flags & RQF_FLUSH_SEQ) && op_is_flush(rq->cmd_flags)) {
- blk_mq_sched_insert_flush(hctx, rq, can_block);
- return;
+ blk_insert_flush(rq);
+ goto run;
}
if (blk_mq_sched_bypass_insert(hctx, !!e, rq))
@@ -395,23 +380,6 @@ void blk_mq_sched_insert_requests(struct request_queue *q,
struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
struct elevator_queue *e = hctx->queue->elevator;
- if (e) {
- struct request *rq, *next;
-
- /*
- * We bypass requests that already have a driver tag assigned,
- * which should only be flushes. Flushes are only ever inserted
- * as single requests, so we shouldn't ever hit the
- * WARN_ON_ONCE() below (but let's handle it just in case).
- */
- list_for_each_entry_safe(rq, next, list, queuelist) {
- if (WARN_ON_ONCE(rq->tag != -1)) {
- list_del_init(&rq->queuelist);
- blk_mq_sched_bypass_insert(hctx, true, rq);
- }
- }
- }
-
if (e && e->type->ops.mq.insert_requests)
e->type->ops.mq.insert_requests(hctx, list, false);
else
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 46d561e26774..68c165f8f0f8 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -940,30 +940,6 @@ static void blk_mq_put_driver_tag(struct request *rq)
__blk_mq_put_driver_tag(hctx, rq);
}
-/*
- * If we fail getting a driver tag because all the driver tags are already
- * assigned and on the dispatch list, BUT the first entry does not have a
- * tag, then we could deadlock. For that case, move entries with assigned
- * driver tags to the front, leaving the set of tagged requests in the
- * same order, and the untagged set in the same order.
- */
-static bool reorder_tags_to_front(struct list_head *list)
-{
- struct request *rq, *tmp, *first = NULL;
-
- list_for_each_entry_safe_reverse(rq, tmp, list, queuelist) {
- if (rq == first)
- break;
- if (rq->tag != -1) {
- list_move(&rq->queuelist, list);
- if (!first)
- first = rq;
- }
- }
-
- return first != NULL;
-}
-
static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode, int flags,
void *key)
{
@@ -1021,9 +997,6 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list)
rq = list_first_entry(list, struct request, queuelist);
if (!blk_mq_get_driver_tag(rq, &hctx, false)) {
- if (!queued && reorder_tags_to_front(list))
- continue;
-
/*
* The initial allocation attempt failed, so we need to
* rerun the hardware queue when a tag is freed.
@@ -1630,13 +1603,10 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
if (unlikely(is_flush_fua)) {
blk_mq_put_ctx(data.ctx);
blk_mq_bio_to_request(rq, bio);
- if (q->elevator) {
- blk_mq_sched_insert_request(rq, false, true, true,
- true);
- } else {
- blk_insert_flush(rq);
- blk_mq_run_hw_queue(data.hctx, true);
- }
+
+ /* bypass scheduler for flush rq */
+ blk_insert_flush(rq);
+ blk_mq_run_hw_queue(data.hctx, true);
} else if (plug && q->nr_hw_queues == 1) {
struct request *last = NULL;
--
2.9.5
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 0/6] blk-mq: don't allocate driver tag beforehand for flush rq
2017-09-14 16:42 [PATCH 0/6] blk-mq: don't allocate driver tag beforehand for flush rq Ming Lei
` (5 preceding siblings ...)
2017-09-14 16:42 ` [PATCH 6/6] blk-mq: don't allocate driver tag beforehand for flush rq Ming Lei
@ 2017-09-14 18:51 ` Jens Axboe
2017-09-15 2:20 ` Ming Lei
6 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2017-09-14 18:51 UTC (permalink / raw)
To: Ming Lei, linux-block; +Cc: Omar Sandoval
On 09/14/2017 10:42 AM, Ming Lei wrote:
> Hi,
>
> This patchset avoids to allocate driver tag beforehand for flush rq
> in case of I/O scheduler, then flush rq isn't treated specially
> wrt. get/put driver tag, code gets cleanup much, such as,
> reorder_tags_to_front() is removed, and we needn't to worry
> about request order in dispatch list for avoiding I/O deadlock.
>
> 'dbench -t 30 -s -F 64' has been run on different devices(shared tag,
> multi-queue, singele queue, ...), and no issues are observed,
> even very low queue depth(1) test are run, debench still works
> well.
Gave this a quick spin on the test box, and I get tons of spewage
on booting up:
[ 9.131290] WARNING: CPU: 2 PID: 337 at block/blk-mq-sched.c:274 blk_mq_sched_insert_request+0x15d/0x170
[ 9.131290] Modules linked in: sd_mod igb(+) ahci libahci i2c_algo_bit libata dca nvme nvme_core
[ 9.131295] CPU: 2 PID: 337 Comm: kworker/u129:1 Tainted: G W 4.13.0+ #472
[ 9.131295] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 11/09/2016
[ 9.131298] Workqueue: nvme-wq nvme_scan_work [nvme_core]
[ 9.131299] task: ffff881ff7940e00 task.stack: ffff881ff7960000
[ 9.131301] RIP: 0010:blk_mq_sched_insert_request+0x15d/0x170
[ 9.131301] RSP: 0018:ffff881ff79639c8 EFLAGS: 00010217
[ 9.131302] RAX: ffff881feeb30000 RBX: ffff881feebd3f00 RCX: 0000000000002000
[ 9.131303] RDX: ffff881ff31c1800 RSI: 0000000000000000 RDI: ffff881feebd3f00
[ 9.131303] RBP: ffff881ff7963a10 R08: 0000000000000000 R09: 0000000000000008
[ 9.131304] R10: 0000000000001000 R11: 0000000000000422 R12: ffff881ff348c400
[ 9.131305] R13: 0000000000000000 R14: 0000000000000001 R15: ffffe8dfffe4a540
[ 9.131305] FS: 0000000000000000(0000) GS:ffff881fff640000(0000) knlGS:0000000000000000
[ 9.131306] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9.131307] CR2: 00007f5675e76b60 CR3: 0000001ff832e002 CR4: 00000000003606e0
[ 9.131308] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 9.131308] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 9.131308] Call Trace:
[ 9.131311] ? bio_alloc_bioset+0x179/0x1d0
[ 9.131314] blk_execute_rq_nowait+0x68/0xe0
[ 9.131316] blk_execute_rq+0x53/0x90
[ 9.131318] __nvme_submit_sync_cmd+0xa2/0xf0 [nvme_core]
[ 9.131320] nvme_identify_ns.isra.32+0x6b/0xa0 [nvme_core]
[ 9.131323] nvme_revalidate_disk+0x7c/0x130 [nvme_core]
[ 9.131324] rescan_partitions+0x80/0x350
[ 9.131325] ? rescan_partitions+0x80/0x350
[ 9.131327] ? down_write+0x1b/0x50
[ 9.131331] __blkdev_get+0x277/0x3f0
[ 9.131332] ? _raw_spin_unlock+0x9/0x20
[ 9.131334] blkdev_get+0x11e/0x320
[ 9.131335] ? bdget+0x11d/0x140
[ 9.131337] device_add_disk+0x3e0/0x430
[ 9.131340] ? __might_sleep+0x45/0x80
[ 9.131342] nvme_validate_ns+0x302/0x560 [nvme_core]
[ 9.131344] nvme_scan_work+0x7c/0x2f0 [nvme_core]
[ 9.131346] ? try_to_wake_up+0x45/0x430
[ 9.131348] process_one_work+0x18a/0x3e0
[ 9.131349] worker_thread+0x48/0x3b0
[ 9.131351] kthread+0x12a/0x140
[ 9.131352] ? process_one_work+0x3e0/0x3e0
[ 9.131353] ? kthread_create_on_node+0x40/0x40
[ 9.131355] ret_from_fork+0x22/0x30
[ 9.131356] Code: c0 4c 89 fa e8 e5 97 02 00 48 8b 4d c0 84 c0 74 10 49 89 5f 08 4c 89 3b 48 89 4b 08 49 89 5c 24 18 4c 89 e7 e8 55 12 37 00 eb 93 <0f> ff e9 fd fe ff ff 66 90 66 2e 0f 1f 84 00 00 00 00 00 55 49
[ 9.131374] ---[ end trace 199de228942af254 ]---
On top of that, there's a lot of spewage before that that I can't
even decipher since it's all intermingled, but it looks like a
lock imbalance issue.
This is on top of current master, fwiw.
--
Jens Axboe
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 0/6] blk-mq: don't allocate driver tag beforehand for flush rq
2017-09-14 18:51 ` [PATCH 0/6] " Jens Axboe
@ 2017-09-15 2:20 ` Ming Lei
2017-09-15 14:29 ` Jens Axboe
0 siblings, 1 reply; 12+ messages in thread
From: Ming Lei @ 2017-09-15 2:20 UTC (permalink / raw)
To: Jens Axboe; +Cc: linux-block, Omar Sandoval
On Thu, Sep 14, 2017 at 12:51:24PM -0600, Jens Axboe wrote:
> On 09/14/2017 10:42 AM, Ming Lei wrote:
> > Hi,
> >
> > This patchset avoids to allocate driver tag beforehand for flush rq
> > in case of I/O scheduler, then flush rq isn't treated specially
> > wrt. get/put driver tag, code gets cleanup much, such as,
> > reorder_tags_to_front() is removed, and we needn't to worry
> > about request order in dispatch list for avoiding I/O deadlock.
> >
> > 'dbench -t 30 -s -F 64' has been run on different devices(shared tag,
> > multi-queue, singele queue, ...), and no issues are observed,
> > even very low queue depth(1) test are run, debench still works
> > well.
>
> Gave this a quick spin on the test box, and I get tons of spewage
> on booting up:
>
> [ 9.131290] WARNING: CPU: 2 PID: 337 at block/blk-mq-sched.c:274 blk_mq_sched_insert_request+0x15d/0x170
Sorry, my fault.
The WARN_ON() was inside 'if (has_sched)' actually, and could you
please remove the WARN_ON() in blk_mq_sched_bypass_insert() and
see if it works?
Thanks,
Ming
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 0/6] blk-mq: don't allocate driver tag beforehand for flush rq
2017-09-15 2:20 ` Ming Lei
@ 2017-09-15 14:29 ` Jens Axboe
2017-09-15 18:06 ` Jens Axboe
0 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2017-09-15 14:29 UTC (permalink / raw)
To: Ming Lei; +Cc: linux-block, Omar Sandoval
On 09/14/2017 08:20 PM, Ming Lei wrote:
> On Thu, Sep 14, 2017 at 12:51:24PM -0600, Jens Axboe wrote:
>> On 09/14/2017 10:42 AM, Ming Lei wrote:
>>> Hi,
>>>
>>> This patchset avoids to allocate driver tag beforehand for flush rq
>>> in case of I/O scheduler, then flush rq isn't treated specially
>>> wrt. get/put driver tag, code gets cleanup much, such as,
>>> reorder_tags_to_front() is removed, and we needn't to worry
>>> about request order in dispatch list for avoiding I/O deadlock.
>>>
>>> 'dbench -t 30 -s -F 64' has been run on different devices(shared tag,
>>> multi-queue, singele queue, ...), and no issues are observed,
>>> even very low queue depth(1) test are run, debench still works
>>> well.
>>
>> Gave this a quick spin on the test box, and I get tons of spewage
>> on booting up:
>>
>> [ 9.131290] WARNING: CPU: 2 PID: 337 at block/blk-mq-sched.c:274 blk_mq_sched_insert_request+0x15d/0x170
>
> Sorry, my fault.
>
> The WARN_ON() was inside 'if (has_sched)' actually, and could you
> please remove the WARN_ON() in blk_mq_sched_bypass_insert() and
> see if it works?
Putting it inside 'has_sched' makes it go away. I'll fire up some
testing of it here.
--
Jens Axboe
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 0/6] blk-mq: don't allocate driver tag beforehand for flush rq
2017-09-15 14:29 ` Jens Axboe
@ 2017-09-15 18:06 ` Jens Axboe
2017-09-16 1:55 ` Ming Lei
0 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2017-09-15 18:06 UTC (permalink / raw)
To: Ming Lei; +Cc: linux-block, Omar Sandoval
On 09/15/2017 08:29 AM, Jens Axboe wrote:
> On 09/14/2017 08:20 PM, Ming Lei wrote:
>> On Thu, Sep 14, 2017 at 12:51:24PM -0600, Jens Axboe wrote:
>>> On 09/14/2017 10:42 AM, Ming Lei wrote:
>>>> Hi,
>>>>
>>>> This patchset avoids to allocate driver tag beforehand for flush rq
>>>> in case of I/O scheduler, then flush rq isn't treated specially
>>>> wrt. get/put driver tag, code gets cleanup much, such as,
>>>> reorder_tags_to_front() is removed, and we needn't to worry
>>>> about request order in dispatch list for avoiding I/O deadlock.
>>>>
>>>> 'dbench -t 30 -s -F 64' has been run on different devices(shared tag,
>>>> multi-queue, singele queue, ...), and no issues are observed,
>>>> even very low queue depth(1) test are run, debench still works
>>>> well.
>>>
>>> Gave this a quick spin on the test box, and I get tons of spewage
>>> on booting up:
>>>
>>> [ 9.131290] WARNING: CPU: 2 PID: 337 at block/blk-mq-sched.c:274 blk_mq_sched_insert_request+0x15d/0x170
>>
>> Sorry, my fault.
>>
>> The WARN_ON() was inside 'if (has_sched)' actually, and could you
>> please remove the WARN_ON() in blk_mq_sched_bypass_insert() and
>> see if it works?
>
> Putting it inside 'has_sched' makes it go away. I'll fire up some
> testing of it here.
It still triggers for the requeue path, however:
[12403.280753] WARNING: CPU: 4 PID: 2279 at block/blk-mq-sched.c:275 blk_mq_sched_insert_request+0
[12403.302465] Modules linked in: null_blk configfs crct10dif_generic crct10dif_common loop dm_mo]
[12403.331762] CPU: 4 PID: 2279 Comm: kworker/4:1H Tainted: G W 4.13.0+ #473
[12403.341497] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 11/09/2016
[12403.350745] Workqueue: kblockd blk_mq_requeue_work
[12403.356574] task: ffff881ff6c57000 task.stack: ffff881ff2acc000
[12403.363667] RIP: 0010:blk_mq_sched_insert_request+0x15d/0x170
[12403.370565] RSP: 0018:ffff881ff2acfdc0 EFLAGS: 00010213
[12403.376878] RAX: ffff881ff2868000 RBX: ffff880c7677be00 RCX: 0000000000022003
[12403.385333] RDX: ffff881ff32d8c00 RSI: 0000000000000001 RDI: ffff880c7677be00
[12403.393790] RBP: ffff881ff2acfe08 R08: 0000000000000001 R09: ffff8809010d4000
[12403.402240] R10: 0000000000000000 R11: 0000000000001000 R12: ffff881ff37b3800
[12403.410695] R13: 0000000000000000 R14: 0000000000000000 R15: ffffe8dfffe86940
[12403.419150] FS: 0000000000000000(0000) GS:ffff881fff680000(0000) knlGS:0000000000000000
[12403.429065] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[12403.435961] CR2: 00007f13f4ce1000 CR3: 0000001ff357d002 CR4: 00000000003606e0
[12403.444414] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[12403.452868] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[12403.461316] Call Trace:
[12403.464528] ? __blk_mq_delay_run_hw_queue+0x84/0xa0
[12403.470550] blk_mq_requeue_work+0xc6/0x140
[12403.475702] process_one_work+0x18a/0x3e0
[12403.480654] worker_thread+0x48/0x3b0
[12403.485313] kthread+0x12a/0x140
[12403.489390] ? process_one_work+0x3e0/0x3e0
[12403.494545] ? kthread_create_on_node+0x40/0x40
[12403.500078] ret_from_fork+0x22/0x30
[12403.504551] Code: c0 4c 89 fa e8 e5 97 02 00 48 8b 4d c0 84 c0 74 10 49 89 5f 08 4c 89 3b 48 8
[12403.526954] ---[ end trace 2eefb804292867b5 ]---
The code now looks like this:
if (has_sched) {
WARN_ON(rq->tag != -1);
rq->rq_flags |= RQF_SORTED;
}
--
Jens Axboe
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 0/6] blk-mq: don't allocate driver tag beforehand for flush rq
2017-09-15 18:06 ` Jens Axboe
@ 2017-09-16 1:55 ` Ming Lei
0 siblings, 0 replies; 12+ messages in thread
From: Ming Lei @ 2017-09-16 1:55 UTC (permalink / raw)
To: Jens Axboe; +Cc: linux-block, Omar Sandoval
On Fri, Sep 15, 2017 at 12:06:41PM -0600, Jens Axboe wrote:
> On 09/15/2017 08:29 AM, Jens Axboe wrote:
> > On 09/14/2017 08:20 PM, Ming Lei wrote:
> >> On Thu, Sep 14, 2017 at 12:51:24PM -0600, Jens Axboe wrote:
> >>> On 09/14/2017 10:42 AM, Ming Lei wrote:
> >>>> Hi,
> >>>>
> >>>> This patchset avoids to allocate driver tag beforehand for flush rq
> >>>> in case of I/O scheduler, then flush rq isn't treated specially
> >>>> wrt. get/put driver tag, code gets cleanup much, such as,
> >>>> reorder_tags_to_front() is removed, and we needn't to worry
> >>>> about request order in dispatch list for avoiding I/O deadlock.
> >>>>
> >>>> 'dbench -t 30 -s -F 64' has been run on different devices(shared tag,
> >>>> multi-queue, singele queue, ...), and no issues are observed,
> >>>> even very low queue depth(1) test are run, debench still works
> >>>> well.
> >>>
> >>> Gave this a quick spin on the test box, and I get tons of spewage
> >>> on booting up:
> >>>
> >>> [ 9.131290] WARNING: CPU: 2 PID: 337 at block/blk-mq-sched.c:274 blk_mq_sched_insert_request+0x15d/0x170
> >>
> >> Sorry, my fault.
> >>
> >> The WARN_ON() was inside 'if (has_sched)' actually, and could you
> >> please remove the WARN_ON() in blk_mq_sched_bypass_insert() and
> >> see if it works?
> >
> > Putting it inside 'has_sched' makes it go away. I'll fire up some
> > testing of it here.
>
> It still triggers for the requeue path, however:
>
> [12403.280753] WARNING: CPU: 4 PID: 2279 at block/blk-mq-sched.c:275 blk_mq_sched_insert_request+0
> [12403.302465] Modules linked in: null_blk configfs crct10dif_generic crct10dif_common loop dm_mo]
> [12403.331762] CPU: 4 PID: 2279 Comm: kworker/4:1H Tainted: G W 4.13.0+ #473
> [12403.341497] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 11/09/2016
> [12403.350745] Workqueue: kblockd blk_mq_requeue_work
> [12403.356574] task: ffff881ff6c57000 task.stack: ffff881ff2acc000
> [12403.363667] RIP: 0010:blk_mq_sched_insert_request+0x15d/0x170
> [12403.370565] RSP: 0018:ffff881ff2acfdc0 EFLAGS: 00010213
> [12403.376878] RAX: ffff881ff2868000 RBX: ffff880c7677be00 RCX: 0000000000022003
> [12403.385333] RDX: ffff881ff32d8c00 RSI: 0000000000000001 RDI: ffff880c7677be00
> [12403.393790] RBP: ffff881ff2acfe08 R08: 0000000000000001 R09: ffff8809010d4000
> [12403.402240] R10: 0000000000000000 R11: 0000000000001000 R12: ffff881ff37b3800
> [12403.410695] R13: 0000000000000000 R14: 0000000000000000 R15: ffffe8dfffe86940
> [12403.419150] FS: 0000000000000000(0000) GS:ffff881fff680000(0000) knlGS:0000000000000000
> [12403.429065] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [12403.435961] CR2: 00007f13f4ce1000 CR3: 0000001ff357d002 CR4: 00000000003606e0
> [12403.444414] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [12403.452868] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [12403.461316] Call Trace:
> [12403.464528] ? __blk_mq_delay_run_hw_queue+0x84/0xa0
> [12403.470550] blk_mq_requeue_work+0xc6/0x140
> [12403.475702] process_one_work+0x18a/0x3e0
> [12403.480654] worker_thread+0x48/0x3b0
> [12403.485313] kthread+0x12a/0x140
> [12403.489390] ? process_one_work+0x3e0/0x3e0
> [12403.494545] ? kthread_create_on_node+0x40/0x40
> [12403.500078] ret_from_fork+0x22/0x30
> [12403.504551] Code: c0 4c 89 fa e8 e5 97 02 00 48 8b 4d c0 84 c0 74 10 49 89 5f 08 4c 89 3b 48 8
> [12403.526954] ---[ end trace 2eefb804292867b5 ]---
>
> The code now looks like this:
>
> if (has_sched) {
> WARN_ON(rq->tag != -1);
> rq->rq_flags |= RQF_SORTED;
> }
Yeah, requeue is one case I missed, since the request may
have a valid driver tag assigned at that time if the requeue
happens in completion path.
So I think we need to release driver tag in __blk_mq_requeue_request().
--
Ming
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2017-09-16 1:55 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-09-14 16:42 [PATCH 0/6] blk-mq: don't allocate driver tag beforehand for flush rq Ming Lei
2017-09-14 16:42 ` [PATCH 1/6] blk-flush: don't run queue for requests of bypassing flush Ming Lei
2017-09-14 16:42 ` [PATCH 2/6] block: pass 'run_queue' to blk_mq_request_bypass_insert Ming Lei
2017-09-14 16:42 ` [PATCH 3/6] blk-flush: use blk_mq_request_bypass_insert() Ming Lei
2017-09-14 16:42 ` [PATCH 4/6] blk-mq: decide how to handle flush rq via RQF_FLUSH_SEQ Ming Lei
2017-09-14 16:42 ` [PATCH 5/6] blk-mq: move __blk_mq_put_driver_tag() into blk-mq.h Ming Lei
2017-09-14 16:42 ` [PATCH 6/6] blk-mq: don't allocate driver tag beforehand for flush rq Ming Lei
2017-09-14 18:51 ` [PATCH 0/6] " Jens Axboe
2017-09-15 2:20 ` Ming Lei
2017-09-15 14:29 ` Jens Axboe
2017-09-15 18:06 ` Jens Axboe
2017-09-16 1:55 ` Ming Lei
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).