[PATCH V6 0/5] blk-mq-sched: improve sequential I/O performance

linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH V6 0/5] blk-mq-sched: improve sequential I/O performance
@ 2017-10-09 11:24 Ming Lei
  2017-10-09 11:24 ` [PATCH V6 1/5] blk-mq-sched: fix scheduler bad performance Ming Lei
                   ` (4 more replies)
  0 siblings, 5 replies; 17+ messages in thread
From: Ming Lei @ 2017-10-09 11:24 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	dm-devel
  Cc: Bart Van Assche, Laurence Oberman, Paolo Valente,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval, Ming Lei

Hi Jens,

In Red Hat internal storage test wrt. blk-mq scheduler, we
found that I/O performance is much bad with mq-deadline, especially
about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx,
SRP...)

Turns out one big issue causes the performance regression: requests
are still dequeued from sw queue/scheduler queue even when ldd's
queue is busy, so I/O merge becomes quite difficult to make, then
sequential IO degrades a lot.

This issue becomes one of mains reasons for reverting default SCSI_MQ
in V4.13.

The 1st patch takes direct issue in blk_mq_request_bypass_insert(),
then we can improve dm-mpath's performance in part 2, which will
be posted out soon.

The 2nd six patches improve this situation, and brings back
some performance loss.

With this change, SCSI-MQ sequential I/O performance is
improved much, Paolo reported that mq-deadline performance
improved much[2] in his dbench test wrt V2. Also performanc
improvement on lpfc/qla2xx was observed with V1.[1]

Please consider it for V4.15.

[1] http://marc.info/?l=linux-block&m=150151989915776&w=2
[2] https://marc.info/?l=linux-block&m=150217980602843&w=2

gitweb:
	https://github.com/ming1/linux/commits/blk_mq_improve_scsi_mpath_perf_V6_part1	

git & branch:
	https://github.com/ming1/linux.git	#blk_mq_improve_scsi_mpath_perf_V6_part1

V6:
	- address comments from Christoph
	- drop the 1st patch which changes blk_mq_request_bypass_insert(),
	which should belong to dm-mpath's improvement
	- move ' blk-mq-sched: move actual dispatching into one helper'
	as 2nd patch, and use the introduced helper to simplify dispatch
	logic
	- merge two previous patches into one for improving dispatch from sw queue
	- make comment/commit log line width as ~70, as suggested by
	  Christoph

V5:
	- address some comments from Omar
	- add Tested-by & Reveiewed-by tag
	- use direct issue for blk_mq_request_bypass_insert(), and
	start to consider to improve sequential I/O for dm-mpath
	- only include part 1(the original patch 1 ~ 6), as suggested
	by Omar

V4:
	- add Reviewed-by tag
	- some trival change: typo fix in commit log or comment,
	variable name, no actual functional change

V3:
	- totally round robin for picking req from ctx, as suggested
	by Bart
	- remove one local variable in __sbitmap_for_each_set()
	- drop patches of single dispatch list, which can improve
	performance on mq-deadline, but cause a bit degrade on
	none because all hctxs need to be checked after ->dispatch
	is flushed. Will post it again once it is mature.
	- rebase on v4.13-rc6 with block for-next

V2:
	- dequeue request from sw queues in round roubin's style
	as suggested by Bart, and introduces one helper in sbitmap
	for this purpose
	- improve bio merge via hash table from sw queue
	- add comments about using DISPATCH_BUSY state in lockless way,
	simplifying handling on busy state,
	- hold ctx->lock when clearing ctx busy bit as suggested
	by Bart

Ming Lei (5):
  blk-mq-sched: fix scheduler bad performance
  blk-mq-sched: move actual dispatching into one helper
  sbitmap: introduce __sbitmap_for_each_set()
  blk-mq-sched: improve dispatching from sw queue
  blk-mq-sched: don't dequeue request until all in ->dispatch are
    flushed

 block/blk-mq-debugfs.c  |   1 +
 block/blk-mq-sched.c    | 115 ++++++++++++++++++++++++++++++++++++++++--------
 block/blk-mq.c          |  44 ++++++++++++++++++
 block/blk-mq.h          |   2 +
 include/linux/blk-mq.h  |   3 ++
 include/linux/sbitmap.h |  64 ++++++++++++++++++++-------
 6 files changed, 193 insertions(+), 36 deletions(-)

-- 
2.9.5

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH V6 1/5] blk-mq-sched: fix scheduler bad performance
  2017-10-09 11:24 [PATCH V6 0/5] blk-mq-sched: improve sequential I/O performance Ming Lei
@ 2017-10-09 11:24 ` Ming Lei
  2017-10-10 18:10   ` Omar Sandoval
  2017-10-09 11:24 ` [PATCH V6 2/5] blk-mq-sched: move actual dispatching into one helper Ming Lei
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 17+ messages in thread
From: Ming Lei @ 2017-10-09 11:24 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	dm-devel
  Cc: Bart Van Assche, Laurence Oberman, Paolo Valente,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval, Ming Lei

When hw queue is busy, we shouldn't take requests from
scheduler queue any more, otherwise it is difficult to do
IO merge.

This patch fixes the awful IO performance on some
SCSI devices(lpfc, qla2xxx, ...) when mq-deadline/kyber
is used by not taking requests if hw queue is busy.

Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-sched.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 4ab69435708c..eca011fdfa0e 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -94,7 +94,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	struct request_queue *q = hctx->queue;
 	struct elevator_queue *e = q->elevator;
 	const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
-	bool did_work = false;
+	bool do_sched_dispatch = true;
 	LIST_HEAD(rq_list);
 
 	/* RCU or SRCU read lock is needed before checking quiesced flag */
@@ -125,18 +125,18 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	 */
 	if (!list_empty(&rq_list)) {
 		blk_mq_sched_mark_restart_hctx(hctx);
-		did_work = blk_mq_dispatch_rq_list(q, &rq_list);
+		do_sched_dispatch = blk_mq_dispatch_rq_list(q, &rq_list);
 	} else if (!has_sched_dispatch) {
 		blk_mq_flush_busy_ctxs(hctx, &rq_list);
 		blk_mq_dispatch_rq_list(q, &rq_list);
 	}
 
 	/*
-	 * We want to dispatch from the scheduler if we had no work left
-	 * on the dispatch list, OR if we did have work but weren't able
-	 * to make progress.
+	 * We want to dispatch from the scheduler if there was nothing
+	 * on the dispatch list or we were able to dispatch from the
+	 * dispatch list.
 	 */
-	if (!did_work && has_sched_dispatch) {
+	if (do_sched_dispatch && has_sched_dispatch) {
 		do {
 			struct request *rq;
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH V6 1/5] blk-mq-sched: fix scheduler bad performance
  2017-10-09 11:24 ` [PATCH V6 1/5] blk-mq-sched: fix scheduler bad performance Ming Lei
@ 2017-10-10 18:10   ` Omar Sandoval
  0 siblings, 0 replies; 17+ messages in thread
From: Omar Sandoval @ 2017-10-10 18:10 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	dm-devel, Bart Van Assche, Laurence Oberman, Paolo Valente,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval

On Mon, Oct 09, 2017 at 07:24:20PM +0800, Ming Lei wrote:
> When hw queue is busy, we shouldn't take requests from
> scheduler queue any more, otherwise it is difficult to do
> IO merge.
> 
> This patch fixes the awful IO performance on some
> SCSI devices(lpfc, qla2xxx, ...) when mq-deadline/kyber
> is used by not taking requests if hw queue is busy.
> 
> Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Reviewed-by: Omar Sandoval <osandov@fb.com>

> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-mq-sched.c | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 4ab69435708c..eca011fdfa0e 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -94,7 +94,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
>  	struct request_queue *q = hctx->queue;
>  	struct elevator_queue *e = q->elevator;
>  	const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
> -	bool did_work = false;
> +	bool do_sched_dispatch = true;
>  	LIST_HEAD(rq_list);
>  
>  	/* RCU or SRCU read lock is needed before checking quiesced flag */
> @@ -125,18 +125,18 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
>  	 */
>  	if (!list_empty(&rq_list)) {
>  		blk_mq_sched_mark_restart_hctx(hctx);
> -		did_work = blk_mq_dispatch_rq_list(q, &rq_list);
> +		do_sched_dispatch = blk_mq_dispatch_rq_list(q, &rq_list);
>  	} else if (!has_sched_dispatch) {
>  		blk_mq_flush_busy_ctxs(hctx, &rq_list);
>  		blk_mq_dispatch_rq_list(q, &rq_list);
>  	}
>  
>  	/*
> -	 * We want to dispatch from the scheduler if we had no work left
> -	 * on the dispatch list, OR if we did have work but weren't able
> -	 * to make progress.
> +	 * We want to dispatch from the scheduler if there was nothing
> +	 * on the dispatch list or we were able to dispatch from the
> +	 * dispatch list.
>  	 */
> -	if (!did_work && has_sched_dispatch) {
> +	if (do_sched_dispatch && has_sched_dispatch) {
>  		do {
>  			struct request *rq;
>  
> -- 
> 2.9.5
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH V6 2/5] blk-mq-sched: move actual dispatching into one helper
  2017-10-09 11:24 [PATCH V6 0/5] blk-mq-sched: improve sequential I/O performance Ming Lei
  2017-10-09 11:24 ` [PATCH V6 1/5] blk-mq-sched: fix scheduler bad performance Ming Lei
@ 2017-10-09 11:24 ` Ming Lei
  2017-10-09 11:24 ` [PATCH V6 3/5] sbitmap: introduce __sbitmap_for_each_set() Ming Lei
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 17+ messages in thread
From: Ming Lei @ 2017-10-09 11:24 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	dm-devel
  Cc: Bart Van Assche, Laurence Oberman, Paolo Valente,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval, Ming Lei

So that it becomes easy to support to dispatch from sw queue in the
following patch.

No functional change.

Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Suggested-by: Christoph Hellwig <hch@lst.de> # for simplifying dispatch logic
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-sched.c | 43 ++++++++++++++++++++++++-------------------
 1 file changed, 24 insertions(+), 19 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index eca011fdfa0e..be29ba849408 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -89,12 +89,26 @@ static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
 	return false;
 }
 
+static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
+{
+	struct request_queue *q = hctx->queue;
+	struct elevator_queue *e = q->elevator;
+	LIST_HEAD(rq_list);
+
+	do {
+		struct request *rq = e->type->ops.mq.dispatch_request(hctx);
+
+		if (!rq)
+			break;
+		list_add(&rq->queuelist, &rq_list);
+	} while (blk_mq_dispatch_rq_list(q, &rq_list));
+}
+
 void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 {
 	struct request_queue *q = hctx->queue;
 	struct elevator_queue *e = q->elevator;
 	const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
-	bool do_sched_dispatch = true;
 	LIST_HEAD(rq_list);
 
 	/* RCU or SRCU read lock is needed before checking quiesced flag */
@@ -122,30 +136,21 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	 * scheduler, we can no longer merge or sort them. So it's best to
 	 * leave them there for as long as we can. Mark the hw queue as
 	 * needing a restart in that case.
+	 *
+	 * We want to dispatch from the scheduler if there was nothing
+	 * on the dispatch list or we were able to dispatch from the
+	 * dispatch list.
 	 */
 	if (!list_empty(&rq_list)) {
 		blk_mq_sched_mark_restart_hctx(hctx);
-		do_sched_dispatch = blk_mq_dispatch_rq_list(q, &rq_list);
-	} else if (!has_sched_dispatch) {
+		if (blk_mq_dispatch_rq_list(q, &rq_list) && has_sched_dispatch)
+			blk_mq_do_dispatch_sched(hctx);
+	} else if (has_sched_dispatch) {
+		blk_mq_do_dispatch_sched(hctx);
+	} else {
 		blk_mq_flush_busy_ctxs(hctx, &rq_list);
 		blk_mq_dispatch_rq_list(q, &rq_list);
 	}
-
-	/*
-	 * We want to dispatch from the scheduler if there was nothing
-	 * on the dispatch list or we were able to dispatch from the
-	 * dispatch list.
-	 */
-	if (do_sched_dispatch && has_sched_dispatch) {
-		do {
-			struct request *rq;
-
-			rq = e->type->ops.mq.dispatch_request(hctx);
-			if (!rq)
-				break;
-			list_add(&rq->queuelist, &rq_list);
-		} while (blk_mq_dispatch_rq_list(q, &rq_list));
-	}
 }
 
 bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH V6 3/5] sbitmap: introduce __sbitmap_for_each_set()
  2017-10-09 11:24 [PATCH V6 0/5] blk-mq-sched: improve sequential I/O performance Ming Lei
  2017-10-09 11:24 ` [PATCH V6 1/5] blk-mq-sched: fix scheduler bad performance Ming Lei
  2017-10-09 11:24 ` [PATCH V6 2/5] blk-mq-sched: move actual dispatching into one helper Ming Lei
@ 2017-10-09 11:24 ` Ming Lei
  2017-10-10 18:15   ` Omar Sandoval
  2017-10-09 11:24 ` [PATCH V6 4/5] blk-mq-sched: improve dispatching from sw queue Ming Lei
  2017-10-09 11:24 ` [PATCH V6 5/5] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed Ming Lei
  4 siblings, 1 reply; 17+ messages in thread
From: Ming Lei @ 2017-10-09 11:24 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	dm-devel
  Cc: Bart Van Assche, Laurence Oberman, Paolo Valente,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval, Ming Lei

We need to iterate ctx starting from any ctx in round robin
way, so introduce this helper.

Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 include/linux/sbitmap.h | 64 ++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 47 insertions(+), 17 deletions(-)

diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index a1904aadbc45..0dcc60e820de 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -211,10 +211,14 @@ bool sbitmap_any_bit_set(const struct sbitmap *sb);
  */
 bool sbitmap_any_bit_clear(const struct sbitmap *sb);
 
+#define SB_NR_TO_INDEX(sb, bitnr) ((bitnr) >> (sb)->shift)
+#define SB_NR_TO_BIT(sb, bitnr) ((bitnr) & ((1U << (sb)->shift) - 1U))
+
 typedef bool (*sb_for_each_fn)(struct sbitmap *, unsigned int, void *);
 
 /**
- * sbitmap_for_each_set() - Iterate over each set bit in a &struct sbitmap.
+ * __sbitmap_for_each_set() - Iterate over each set bit in a &struct sbitmap.
+ * @start: Where to start the iteration.
  * @sb: Bitmap to iterate over.
  * @fn: Callback. Should return true to continue or false to break early.
  * @data: Pointer to pass to callback.
@@ -222,35 +226,61 @@ typedef bool (*sb_for_each_fn)(struct sbitmap *, unsigned int, void *);
  * This is inline even though it's non-trivial so that the function calls to the
  * callback will hopefully get optimized away.
  */
-static inline void sbitmap_for_each_set(struct sbitmap *sb, sb_for_each_fn fn,
-					void *data)
+static inline void __sbitmap_for_each_set(struct sbitmap *sb,
+					  unsigned int start,
+					  sb_for_each_fn fn, void *data)
 {
-	unsigned int i;
+	unsigned int index;
+	unsigned int nr;
+	unsigned int scanned = 0;
 
-	for (i = 0; i < sb->map_nr; i++) {
-		struct sbitmap_word *word = &sb->map[i];
-		unsigned int off, nr;
+	if (start >= sb->depth)
+		start = 0;
+	index = SB_NR_TO_INDEX(sb, start);
+	nr = SB_NR_TO_BIT(sb, start);
 
-		if (!word->word)
-			continue;
+	while (scanned < sb->depth) {
+		struct sbitmap_word *word = &sb->map[index];
+		unsigned int depth = min_t(unsigned int, word->depth - nr,
+					   sb->depth - scanned);
 
-		nr = 0;
-		off = i << sb->shift;
+		scanned += depth;
+		if (!word->word)
+			goto next;
+
+		/*
+		 * On the first iteration of the outer loop, we need to add the
+		 * bit offset back to the size of the word for find_next_bit().
+		 * On all other iterations, nr is zero, so this is a noop.
+		 */
+		depth += nr;
 		while (1) {
-			nr = find_next_bit(&word->word, word->depth, nr);
-			if (nr >= word->depth)
+			nr = find_next_bit(&word->word, depth, nr);
+			if (nr >= depth)
 				break;
-
-			if (!fn(sb, off + nr, data))
+			if (!fn(sb, (index << sb->shift) + nr, data))
 				return;
 
 			nr++;
 		}
+next:
+		nr = 0;
+		if (++index >= sb->map_nr)
+			index = 0;
 	}
 }
 
-#define SB_NR_TO_INDEX(sb, bitnr) ((bitnr) >> (sb)->shift)
-#define SB_NR_TO_BIT(sb, bitnr) ((bitnr) & ((1U << (sb)->shift) - 1U))
+/**
+ * sbitmap_for_each_set() - Iterate over each set bit in a &struct sbitmap.
+ * @sb: Bitmap to iterate over.
+ * @fn: Callback. Should return true to continue or false to break early.
+ * @data: Pointer to pass to callback.
+ */
+static inline void sbitmap_for_each_set(struct sbitmap *sb, sb_for_each_fn fn,
+					void *data)
+{
+	__sbitmap_for_each_set(sb, 0, fn, data);
+}
 
 static inline unsigned long *__sbitmap_word(struct sbitmap *sb,
 					    unsigned int bitnr)
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH V6 3/5] sbitmap: introduce __sbitmap_for_each_set()
  2017-10-09 11:24 ` [PATCH V6 3/5] sbitmap: introduce __sbitmap_for_each_set() Ming Lei
@ 2017-10-10 18:15   ` Omar Sandoval
  0 siblings, 0 replies; 17+ messages in thread
From: Omar Sandoval @ 2017-10-10 18:15 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	dm-devel, Bart Van Assche, Laurence Oberman, Paolo Valente,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval

On Mon, Oct 09, 2017 at 07:24:22PM +0800, Ming Lei wrote:
> We need to iterate ctx starting from any ctx in round robin
> way, so introduce this helper.
> 
> Cc: Omar Sandoval <osandov@fb.com>

Reviewed-by: Omar Sandoval <osandov@fb.com>

> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  include/linux/sbitmap.h | 64 ++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 47 insertions(+), 17 deletions(-)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH V6 4/5] blk-mq-sched: improve dispatching from sw queue
  2017-10-09 11:24 [PATCH V6 0/5] blk-mq-sched: improve sequential I/O performance Ming Lei
                   ` (2 preceding siblings ...)
  2017-10-09 11:24 ` [PATCH V6 3/5] sbitmap: introduce __sbitmap_for_each_set() Ming Lei
@ 2017-10-09 11:24 ` Ming Lei
  2017-10-10 18:23   ` Omar Sandoval
  2017-10-09 11:24 ` [PATCH V6 5/5] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed Ming Lei
  4 siblings, 1 reply; 17+ messages in thread
From: Ming Lei @ 2017-10-09 11:24 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	dm-devel
  Cc: Bart Van Assche, Laurence Oberman, Paolo Valente,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval, Ming Lei

SCSI devices use host-wide tagset, and the shared driver tag space is
often quite big. Meantime there is also queue depth for each lun(
.cmd_per_lun), which is often small, for example, on both lpfc and
qla2xxx, .cmd_per_lun is just 3.

So lots of requests may stay in sw queue, and we always flush all
belonging to same hw queue and dispatch them all to driver, unfortunately
it is easy to cause queue busy because of the small .cmd_per_lun.
Once these requests are flushed out, they have to stay in hctx->dispatch,
and no bio merge can participate into these requests, and sequential IO
performance is hurt a lot.

This patch introduces blk_mq_dequeue_from_ctx for dequeuing request from
sw queue so that we can dispatch them in scheduler's way, then we can
avoid to dequeue too many requests from sw queue when ->dispatch isn't
flushed completely.

This patch improves dispatching from sw queue when there is per-request-queue
queue depth by taking request one by one from sw queue, just like the way
of IO scheduler.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-sched.c   | 50 ++++++++++++++++++++++++++++++++++++++++++++++++--
 block/blk-mq.c         | 39 +++++++++++++++++++++++++++++++++++++++
 block/blk-mq.h         |  2 ++
 include/linux/blk-mq.h |  2 ++
 4 files changed, 91 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index be29ba849408..14b354f617e5 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -104,6 +104,39 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 	} while (blk_mq_dispatch_rq_list(q, &rq_list));
 }
 
+static struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx,
+					  struct blk_mq_ctx *ctx)
+{
+	unsigned idx = ctx->index_hw;
+
+	if (++idx == hctx->nr_ctx)
+		idx = 0;
+
+	return hctx->ctxs[idx];
+}
+
+static void blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
+{
+	struct request_queue *q = hctx->queue;
+	LIST_HEAD(rq_list);
+	struct blk_mq_ctx *ctx = READ_ONCE(hctx->dispatch_from);
+
+	do {
+		struct request *rq;
+
+		rq = blk_mq_dequeue_from_ctx(hctx, ctx);
+		if (!rq)
+			break;
+		list_add(&rq->queuelist, &rq_list);
+
+		/* round robin for fair dispatch */
+		ctx = blk_mq_next_ctx(hctx, rq->mq_ctx);
+
+	} while (blk_mq_dispatch_rq_list(q, &rq_list));
+
+	WRITE_ONCE(hctx->dispatch_from, ctx);
+}
+
 void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 {
 	struct request_queue *q = hctx->queue;
@@ -143,10 +176,23 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	 */
 	if (!list_empty(&rq_list)) {
 		blk_mq_sched_mark_restart_hctx(hctx);
-		if (blk_mq_dispatch_rq_list(q, &rq_list) && has_sched_dispatch)
-			blk_mq_do_dispatch_sched(hctx);
+		if (blk_mq_dispatch_rq_list(q, &rq_list)) {
+			if (has_sched_dispatch)
+				blk_mq_do_dispatch_sched(hctx);
+			else
+				blk_mq_do_dispatch_ctx(hctx);
+		}
 	} else if (has_sched_dispatch) {
 		blk_mq_do_dispatch_sched(hctx);
+	} else if (q->queue_depth) {
+		/*
+		 * If there is per-request_queue depth, we dequeue
+		 * request one by one from sw queue for avoiding to mess
+		 * up I/O merge when dispatch runs out of resource, which
+		 * can be triggered easily when there is per-request_queue
+		 * queue depth or .cmd_per_lun, such as SCSI device.
+		 */
+		blk_mq_do_dispatch_ctx(hctx);
 	} else {
 		blk_mq_flush_busy_ctxs(hctx, &rq_list);
 		blk_mq_dispatch_rq_list(q, &rq_list);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 076cbab9c3e0..394cb75d66fa 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -911,6 +911,45 @@ void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 }
 EXPORT_SYMBOL_GPL(blk_mq_flush_busy_ctxs);
 
+struct dispatch_rq_data {
+	struct blk_mq_hw_ctx *hctx;
+	struct request *rq;
+};
+
+static bool dispatch_rq_from_ctx(struct sbitmap *sb, unsigned int bitnr,
+		void *data)
+{
+	struct dispatch_rq_data *dispatch_data = data;
+	struct blk_mq_hw_ctx *hctx = dispatch_data->hctx;
+	struct blk_mq_ctx *ctx = hctx->ctxs[bitnr];
+
+	spin_lock(&ctx->lock);
+	if (unlikely(!list_empty(&ctx->rq_list))) {
+		dispatch_data->rq = list_entry_rq(ctx->rq_list.next);
+		list_del_init(&dispatch_data->rq->queuelist);
+		if (list_empty(&ctx->rq_list))
+			sbitmap_clear_bit(sb, bitnr);
+	}
+	spin_unlock(&ctx->lock);
+
+	return !dispatch_data->rq;
+}
+
+struct request *blk_mq_dequeue_from_ctx(struct blk_mq_hw_ctx *hctx,
+					struct blk_mq_ctx *start)
+{
+	unsigned off = start ? start->index_hw : 0;
+	struct dispatch_rq_data data = {
+		.hctx = hctx,
+		.rq   = NULL,
+	};
+
+	__sbitmap_for_each_set(&hctx->ctx_map, off,
+			       dispatch_rq_from_ctx, &data);
+
+	return data.rq;
+}
+
 static inline unsigned int queued_to_index(unsigned int queued)
 {
 	if (!queued)
diff --git a/block/blk-mq.h b/block/blk-mq.h
index ef15b3414da5..231cfb0d973b 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -35,6 +35,8 @@ void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list);
 bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx);
 bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
 				bool wait);
+struct request *blk_mq_dequeue_from_ctx(struct blk_mq_hw_ctx *hctx,
+					struct blk_mq_ctx *start);
 
 /*
  * Internal helpers for allocating/freeing the request map
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 50c6485cb04f..7b7a366a97f3 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -30,6 +30,8 @@ struct blk_mq_hw_ctx {
 
 	struct sbitmap		ctx_map;
 
+	struct blk_mq_ctx	*dispatch_from;
+
 	struct blk_mq_ctx	**ctxs;
 	unsigned int		nr_ctx;
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH V6 4/5] blk-mq-sched: improve dispatching from sw queue
  2017-10-09 11:24 ` [PATCH V6 4/5] blk-mq-sched: improve dispatching from sw queue Ming Lei
@ 2017-10-10 18:23   ` Omar Sandoval
  2017-10-12 10:01     ` Ming Lei
  0 siblings, 1 reply; 17+ messages in thread
From: Omar Sandoval @ 2017-10-10 18:23 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	dm-devel, Bart Van Assche, Laurence Oberman, Paolo Valente,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval

On Mon, Oct 09, 2017 at 07:24:23PM +0800, Ming Lei wrote:
> SCSI devices use host-wide tagset, and the shared driver tag space is
> often quite big. Meantime there is also queue depth for each lun(
> .cmd_per_lun), which is often small, for example, on both lpfc and
> qla2xxx, .cmd_per_lun is just 3.
> 
> So lots of requests may stay in sw queue, and we always flush all
> belonging to same hw queue and dispatch them all to driver, unfortunately
> it is easy to cause queue busy because of the small .cmd_per_lun.
> Once these requests are flushed out, they have to stay in hctx->dispatch,
> and no bio merge can participate into these requests, and sequential IO
> performance is hurt a lot.
> 
> This patch introduces blk_mq_dequeue_from_ctx for dequeuing request from
> sw queue so that we can dispatch them in scheduler's way, then we can
> avoid to dequeue too many requests from sw queue when ->dispatch isn't
> flushed completely.
> 
> This patch improves dispatching from sw queue when there is per-request-queue
> queue depth by taking request one by one from sw queue, just like the way
> of IO scheduler.

This still didn't address Jens' concern about using q->queue_depth as
the heuristic for whether to do the full sw queue flush or one-by-one
dispatch. The EWMA approach is a bit too complex for now, can you please
try the heuristic of whether the driver ever returned BLK_STS_RESOURCE?

> Reviewed-by: Omar Sandoval <osandov@fb.com>
> Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-mq-sched.c   | 50 ++++++++++++++++++++++++++++++++++++++++++++++++--
>  block/blk-mq.c         | 39 +++++++++++++++++++++++++++++++++++++++
>  block/blk-mq.h         |  2 ++
>  include/linux/blk-mq.h |  2 ++
>  4 files changed, 91 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index be29ba849408..14b354f617e5 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -104,6 +104,39 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
>  	} while (blk_mq_dispatch_rq_list(q, &rq_list));
>  }
>  
> +static struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx,
> +					  struct blk_mq_ctx *ctx)
> +{
> +	unsigned idx = ctx->index_hw;
> +
> +	if (++idx == hctx->nr_ctx)
> +		idx = 0;
> +
> +	return hctx->ctxs[idx];
> +}
> +
> +static void blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
> +{
> +	struct request_queue *q = hctx->queue;
> +	LIST_HEAD(rq_list);
> +	struct blk_mq_ctx *ctx = READ_ONCE(hctx->dispatch_from);
> +
> +	do {
> +		struct request *rq;
> +
> +		rq = blk_mq_dequeue_from_ctx(hctx, ctx);
> +		if (!rq)
> +			break;
> +		list_add(&rq->queuelist, &rq_list);
> +
> +		/* round robin for fair dispatch */
> +		ctx = blk_mq_next_ctx(hctx, rq->mq_ctx);
> +
> +	} while (blk_mq_dispatch_rq_list(q, &rq_list));
> +
> +	WRITE_ONCE(hctx->dispatch_from, ctx);
> +}
> +
>  void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
>  {
>  	struct request_queue *q = hctx->queue;
> @@ -143,10 +176,23 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
>  	 */
>  	if (!list_empty(&rq_list)) {
>  		blk_mq_sched_mark_restart_hctx(hctx);
> -		if (blk_mq_dispatch_rq_list(q, &rq_list) && has_sched_dispatch)
> -			blk_mq_do_dispatch_sched(hctx);
> +		if (blk_mq_dispatch_rq_list(q, &rq_list)) {
> +			if (has_sched_dispatch)
> +				blk_mq_do_dispatch_sched(hctx);
> +			else
> +				blk_mq_do_dispatch_ctx(hctx);
> +		}
>  	} else if (has_sched_dispatch) {
>  		blk_mq_do_dispatch_sched(hctx);
> +	} else if (q->queue_depth) {
> +		/*
> +		 * If there is per-request_queue depth, we dequeue
> +		 * request one by one from sw queue for avoiding to mess
> +		 * up I/O merge when dispatch runs out of resource, which
> +		 * can be triggered easily when there is per-request_queue
> +		 * queue depth or .cmd_per_lun, such as SCSI device.
> +		 */
> +		blk_mq_do_dispatch_ctx(hctx);
>  	} else {
>  		blk_mq_flush_busy_ctxs(hctx, &rq_list);
>  		blk_mq_dispatch_rq_list(q, &rq_list);
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 076cbab9c3e0..394cb75d66fa 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -911,6 +911,45 @@ void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list)
>  }
>  EXPORT_SYMBOL_GPL(blk_mq_flush_busy_ctxs);
>  
> +struct dispatch_rq_data {
> +	struct blk_mq_hw_ctx *hctx;
> +	struct request *rq;
> +};
> +
> +static bool dispatch_rq_from_ctx(struct sbitmap *sb, unsigned int bitnr,
> +		void *data)
> +{
> +	struct dispatch_rq_data *dispatch_data = data;
> +	struct blk_mq_hw_ctx *hctx = dispatch_data->hctx;
> +	struct blk_mq_ctx *ctx = hctx->ctxs[bitnr];
> +
> +	spin_lock(&ctx->lock);
> +	if (unlikely(!list_empty(&ctx->rq_list))) {
> +		dispatch_data->rq = list_entry_rq(ctx->rq_list.next);
> +		list_del_init(&dispatch_data->rq->queuelist);
> +		if (list_empty(&ctx->rq_list))
> +			sbitmap_clear_bit(sb, bitnr);
> +	}
> +	spin_unlock(&ctx->lock);
> +
> +	return !dispatch_data->rq;
> +}
> +
> +struct request *blk_mq_dequeue_from_ctx(struct blk_mq_hw_ctx *hctx,
> +					struct blk_mq_ctx *start)
> +{
> +	unsigned off = start ? start->index_hw : 0;
> +	struct dispatch_rq_data data = {
> +		.hctx = hctx,
> +		.rq   = NULL,
> +	};
> +
> +	__sbitmap_for_each_set(&hctx->ctx_map, off,
> +			       dispatch_rq_from_ctx, &data);
> +
> +	return data.rq;
> +}
> +
>  static inline unsigned int queued_to_index(unsigned int queued)
>  {
>  	if (!queued)
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index ef15b3414da5..231cfb0d973b 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -35,6 +35,8 @@ void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list);
>  bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx);
>  bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
>  				bool wait);
> +struct request *blk_mq_dequeue_from_ctx(struct blk_mq_hw_ctx *hctx,
> +					struct blk_mq_ctx *start);
>  
>  /*
>   * Internal helpers for allocating/freeing the request map
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 50c6485cb04f..7b7a366a97f3 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -30,6 +30,8 @@ struct blk_mq_hw_ctx {
>  
>  	struct sbitmap		ctx_map;
>  
> +	struct blk_mq_ctx	*dispatch_from;
> +
>  	struct blk_mq_ctx	**ctxs;
>  	unsigned int		nr_ctx;
>  
> -- 
> 2.9.5
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V6 4/5] blk-mq-sched: improve dispatching from sw queue
  2017-10-10 18:23   ` Omar Sandoval
@ 2017-10-12 10:01     ` Ming Lei
  2017-10-12 14:52       ` Jens Axboe
  2017-10-12 15:33       ` Bart Van Assche
  0 siblings, 2 replies; 17+ messages in thread
From: Ming Lei @ 2017-10-12 10:01 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	dm-devel, Bart Van Assche, Laurence Oberman, Paolo Valente,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval

On Tue, Oct 10, 2017 at 11:23:45AM -0700, Omar Sandoval wrote:
> On Mon, Oct 09, 2017 at 07:24:23PM +0800, Ming Lei wrote:
> > SCSI devices use host-wide tagset, and the shared driver tag space is
> > often quite big. Meantime there is also queue depth for each lun(
> > .cmd_per_lun), which is often small, for example, on both lpfc and
> > qla2xxx, .cmd_per_lun is just 3.
> > 
> > So lots of requests may stay in sw queue, and we always flush all
> > belonging to same hw queue and dispatch them all to driver, unfortunately
> > it is easy to cause queue busy because of the small .cmd_per_lun.
> > Once these requests are flushed out, they have to stay in hctx->dispatch,
> > and no bio merge can participate into these requests, and sequential IO
> > performance is hurt a lot.
> > 
> > This patch introduces blk_mq_dequeue_from_ctx for dequeuing request from
> > sw queue so that we can dispatch them in scheduler's way, then we can
> > avoid to dequeue too many requests from sw queue when ->dispatch isn't
> > flushed completely.
> > 
> > This patch improves dispatching from sw queue when there is per-request-queue
> > queue depth by taking request one by one from sw queue, just like the way
> > of IO scheduler.
> 
> This still didn't address Jens' concern about using q->queue_depth as
> the heuristic for whether to do the full sw queue flush or one-by-one
> dispatch. The EWMA approach is a bit too complex for now, can you please
> try the heuristic of whether the driver ever returned BLK_STS_RESOURCE?

That can be done easily, but I am not sure if it is good.

For example, inside queue rq path of NVMe, kmalloc(GFP_ATOMIC) is
often used, if kmalloc() returns NULL just once, BLK_STS_RESOURCE
will be returned to blk-mq, then blk-mq will never do full sw
queue flush even when kmalloc() always succeed from that time
on.

Even EWMA approach isn't good on SCSI-MQ too, because
some SCSI's .cmd_per_lun is very small, such as 3 on
lpfc and qla2xxx, and one full flush will trigger
BLK_STS_RESOURCE easily.

So I suggest to use the way of q->queue_depth first, since we
don't get performance degrade report on other devices(!q->queue_depth)
with blk-mq. We can improve this way in the future if we
have better approach.

What do you think about it?


-- 
Ming

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V6 4/5] blk-mq-sched: improve dispatching from sw queue
  2017-10-12 10:01     ` Ming Lei
@ 2017-10-12 14:52       ` Jens Axboe
  2017-10-12 15:22         ` Ming Lei
  2017-10-12 15:33       ` Bart Van Assche
  1 sibling, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2017-10-12 14:52 UTC (permalink / raw)
  To: Ming Lei, Omar Sandoval
  Cc: linux-block, Christoph Hellwig, Mike Snitzer, dm-devel,
	Bart Van Assche, Laurence Oberman, Paolo Valente,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval

On 10/12/2017 04:01 AM, Ming Lei wrote:
> On Tue, Oct 10, 2017 at 11:23:45AM -0700, Omar Sandoval wrote:
>> On Mon, Oct 09, 2017 at 07:24:23PM +0800, Ming Lei wrote:
>>> SCSI devices use host-wide tagset, and the shared driver tag space is
>>> often quite big. Meantime there is also queue depth for each lun(
>>> .cmd_per_lun), which is often small, for example, on both lpfc and
>>> qla2xxx, .cmd_per_lun is just 3.
>>>
>>> So lots of requests may stay in sw queue, and we always flush all
>>> belonging to same hw queue and dispatch them all to driver, unfortunately
>>> it is easy to cause queue busy because of the small .cmd_per_lun.
>>> Once these requests are flushed out, they have to stay in hctx->dispatch,
>>> and no bio merge can participate into these requests, and sequential IO
>>> performance is hurt a lot.
>>>
>>> This patch introduces blk_mq_dequeue_from_ctx for dequeuing request from
>>> sw queue so that we can dispatch them in scheduler's way, then we can
>>> avoid to dequeue too many requests from sw queue when ->dispatch isn't
>>> flushed completely.
>>>
>>> This patch improves dispatching from sw queue when there is per-request-queue
>>> queue depth by taking request one by one from sw queue, just like the way
>>> of IO scheduler.
>>
>> This still didn't address Jens' concern about using q->queue_depth as
>> the heuristic for whether to do the full sw queue flush or one-by-one
>> dispatch. The EWMA approach is a bit too complex for now, can you please
>> try the heuristic of whether the driver ever returned BLK_STS_RESOURCE?
> 
> That can be done easily, but I am not sure if it is good.
> 
> For example, inside queue rq path of NVMe, kmalloc(GFP_ATOMIC) is
> often used, if kmalloc() returns NULL just once, BLK_STS_RESOURCE
> will be returned to blk-mq, then blk-mq will never do full sw
> queue flush even when kmalloc() always succeed from that time
> on.

Have it be a bit more than a single bit, then. Reset it every x IOs or
something like that, that'll be more representative of transient busy
conditions anyway.

> Even EWMA approach isn't good on SCSI-MQ too, because
> some SCSI's .cmd_per_lun is very small, such as 3 on
> lpfc and qla2xxx, and one full flush will trigger
> BLK_STS_RESOURCE easily.
> 
> So I suggest to use the way of q->queue_depth first, since we
> don't get performance degrade report on other devices(!q->queue_depth)
> with blk-mq. We can improve this way in the future if we
> have better approach.
> 
> What do you think about it?

I think it's absolutely horrible, and I already explained why in great
detail in an earlier review. tldr is that the fact that only scsi sets
->queue_depth right now is completely randomly related to the fact that
only scsi also shares tags. Every driver should set the queue depth, so
that signal will go to zero very quickly.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V6 4/5] blk-mq-sched: improve dispatching from sw queue
  2017-10-12 14:52       ` Jens Axboe
@ 2017-10-12 15:22         ` Ming Lei
  2017-10-12 15:24           ` Jens Axboe
  0 siblings, 1 reply; 17+ messages in thread
From: Ming Lei @ 2017-10-12 15:22 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Omar Sandoval, linux-block, Christoph Hellwig, Mike Snitzer,
	dm-devel, Bart Van Assche, Laurence Oberman, Paolo Valente,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval

On Thu, Oct 12, 2017 at 08:52:12AM -0600, Jens Axboe wrote:
> On 10/12/2017 04:01 AM, Ming Lei wrote:
> > On Tue, Oct 10, 2017 at 11:23:45AM -0700, Omar Sandoval wrote:
> >> On Mon, Oct 09, 2017 at 07:24:23PM +0800, Ming Lei wrote:
> >>> SCSI devices use host-wide tagset, and the shared driver tag space is
> >>> often quite big. Meantime there is also queue depth for each lun(
> >>> .cmd_per_lun), which is often small, for example, on both lpfc and
> >>> qla2xxx, .cmd_per_lun is just 3.
> >>>
> >>> So lots of requests may stay in sw queue, and we always flush all
> >>> belonging to same hw queue and dispatch them all to driver, unfortunately
> >>> it is easy to cause queue busy because of the small .cmd_per_lun.
> >>> Once these requests are flushed out, they have to stay in hctx->dispatch,
> >>> and no bio merge can participate into these requests, and sequential IO
> >>> performance is hurt a lot.
> >>>
> >>> This patch introduces blk_mq_dequeue_from_ctx for dequeuing request from
> >>> sw queue so that we can dispatch them in scheduler's way, then we can
> >>> avoid to dequeue too many requests from sw queue when ->dispatch isn't
> >>> flushed completely.
> >>>
> >>> This patch improves dispatching from sw queue when there is per-request-queue
> >>> queue depth by taking request one by one from sw queue, just like the way
> >>> of IO scheduler.
> >>
> >> This still didn't address Jens' concern about using q->queue_depth as
> >> the heuristic for whether to do the full sw queue flush or one-by-one
> >> dispatch. The EWMA approach is a bit too complex for now, can you please
> >> try the heuristic of whether the driver ever returned BLK_STS_RESOURCE?
> > 
> > That can be done easily, but I am not sure if it is good.
> > 
> > For example, inside queue rq path of NVMe, kmalloc(GFP_ATOMIC) is
> > often used, if kmalloc() returns NULL just once, BLK_STS_RESOURCE
> > will be returned to blk-mq, then blk-mq will never do full sw
> > queue flush even when kmalloc() always succeed from that time
> > on.
> 
> Have it be a bit more than a single bit, then. Reset it every x IOs or
> something like that, that'll be more representative of transient busy
> conditions anyway.

OK, that can be done via a simplified EWMA by considering
the dispatch result only.

I will address it in V6.


-- 
Ming

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V6 4/5] blk-mq-sched: improve dispatching from sw queue
  2017-10-12 15:22         ` Ming Lei
@ 2017-10-12 15:24           ` Jens Axboe
  0 siblings, 0 replies; 17+ messages in thread
From: Jens Axboe @ 2017-10-12 15:24 UTC (permalink / raw)
  To: Ming Lei
  Cc: Omar Sandoval, linux-block, Christoph Hellwig, Mike Snitzer,
	dm-devel, Bart Van Assche, Laurence Oberman, Paolo Valente,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval

On 10/12/2017 09:22 AM, Ming Lei wrote:
> On Thu, Oct 12, 2017 at 08:52:12AM -0600, Jens Axboe wrote:
>> On 10/12/2017 04:01 AM, Ming Lei wrote:
>>> On Tue, Oct 10, 2017 at 11:23:45AM -0700, Omar Sandoval wrote:
>>>> On Mon, Oct 09, 2017 at 07:24:23PM +0800, Ming Lei wrote:
>>>>> SCSI devices use host-wide tagset, and the shared driver tag space is
>>>>> often quite big. Meantime there is also queue depth for each lun(
>>>>> .cmd_per_lun), which is often small, for example, on both lpfc and
>>>>> qla2xxx, .cmd_per_lun is just 3.
>>>>>
>>>>> So lots of requests may stay in sw queue, and we always flush all
>>>>> belonging to same hw queue and dispatch them all to driver, unfortunately
>>>>> it is easy to cause queue busy because of the small .cmd_per_lun.
>>>>> Once these requests are flushed out, they have to stay in hctx->dispatch,
>>>>> and no bio merge can participate into these requests, and sequential IO
>>>>> performance is hurt a lot.
>>>>>
>>>>> This patch introduces blk_mq_dequeue_from_ctx for dequeuing request from
>>>>> sw queue so that we can dispatch them in scheduler's way, then we can
>>>>> avoid to dequeue too many requests from sw queue when ->dispatch isn't
>>>>> flushed completely.
>>>>>
>>>>> This patch improves dispatching from sw queue when there is per-request-queue
>>>>> queue depth by taking request one by one from sw queue, just like the way
>>>>> of IO scheduler.
>>>>
>>>> This still didn't address Jens' concern about using q->queue_depth as
>>>> the heuristic for whether to do the full sw queue flush or one-by-one
>>>> dispatch. The EWMA approach is a bit too complex for now, can you please
>>>> try the heuristic of whether the driver ever returned BLK_STS_RESOURCE?
>>>
>>> That can be done easily, but I am not sure if it is good.
>>>
>>> For example, inside queue rq path of NVMe, kmalloc(GFP_ATOMIC) is
>>> often used, if kmalloc() returns NULL just once, BLK_STS_RESOURCE
>>> will be returned to blk-mq, then blk-mq will never do full sw
>>> queue flush even when kmalloc() always succeed from that time
>>> on.
>>
>> Have it be a bit more than a single bit, then. Reset it every x IOs or
>> something like that, that'll be more representative of transient busy
>> conditions anyway.
> 
> OK, that can be done via a simplified EWMA by considering
> the dispatch result only.

Yes, if it's kept simple enough, then that would be fine. I'm not totally
against EWMA, I just don't want to have any of this over-engineered.
Especially not when it's a pretty simple thing, we don't care about
averages, basically only if we ever see BLK_STS_RESOURCE in any kind
of recurring fashion.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V6 4/5] blk-mq-sched: improve dispatching from sw queue
  2017-10-12 10:01     ` Ming Lei
  2017-10-12 14:52       ` Jens Axboe
@ 2017-10-12 15:33       ` Bart Van Assche
  2017-10-12 15:37         ` Jens Axboe
  1 sibling, 1 reply; 17+ messages in thread
From: Bart Van Assche @ 2017-10-12 15:33 UTC (permalink / raw)
  To: osandov@osandov.com, ming.lei@redhat.com
  Cc: linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
	hch@infradead.org, tom81094@gmail.com, paolo.valente@linaro.org,
	snitzer@redhat.com, Bart Van Assche, axboe@fb.com,
	linux-scsi@vger.kernel.org, oleksandr@natalenko.name,
	osandov@fb.com, loberman@redhat.com, dm-devel@redhat.com

On Thu, 2017-10-12 at 18:01 +0800, Ming Lei wrote:
> Even EWMA approach isn't good on SCSI-MQ too, because
> some SCSI's .cmd_per_lun is very small, such as 3 on
> lpfc and qla2xxx, and one full flush will trigger
> BLK_STS_RESOURCE easily.
> 
> So I suggest to use the way of q->queue_depth first, since we
> don't get performance degrade report on other devices(!q->queue_depth)
> with blk-mq. We can improve this way in the future if we
> have better approach.

Measurements have shown that even with this patch series applied sequential
I/O performance is still below that of the legacy block and SCSI layers. So
this patch series is not the final solution. (See also John Garry's e-mail
of October 10th - https://lkml.org/lkml/2017/10/10/401). I have been
wondering what could be causing that performance difference. Maybe it's
because requests can reside for a while in the hctx dispatch queue and hence
are unvisible for the scheduler while in the hctx dispatch queue? Should we
modify blk_mq_dispatch_rq_list() such that it puts back requests that have
not been accepted by .queue_rq() onto the scheduler queue(s) instead of to
the hctx dispatch queue? If we would make that change, would it allow us to
drop patch "blk-mq-sched: improve dispatching from sw queue"?

Bart.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V6 4/5] blk-mq-sched: improve dispatching from sw queue
  2017-10-12 15:33       ` Bart Van Assche
@ 2017-10-12 15:37         ` Jens Axboe
  2017-10-12 15:49           ` Ming Lei
  0 siblings, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2017-10-12 15:37 UTC (permalink / raw)
  To: Bart Van Assche, osandov@osandov.com, ming.lei@redhat.com
  Cc: linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
	hch@infradead.org, tom81094@gmail.com, paolo.valente@linaro.org,
	snitzer@redhat.com, linux-scsi@vger.kernel.org,
	oleksandr@natalenko.name, osandov@fb.com, loberman@redhat.com,
	dm-devel@redhat.com

On 10/12/2017 09:33 AM, Bart Van Assche wrote:
> On Thu, 2017-10-12 at 18:01 +0800, Ming Lei wrote:
>> Even EWMA approach isn't good on SCSI-MQ too, because
>> some SCSI's .cmd_per_lun is very small, such as 3 on
>> lpfc and qla2xxx, and one full flush will trigger
>> BLK_STS_RESOURCE easily.
>>
>> So I suggest to use the way of q->queue_depth first, since we
>> don't get performance degrade report on other devices(!q->queue_depth)
>> with blk-mq. We can improve this way in the future if we
>> have better approach.
> 
> Measurements have shown that even with this patch series applied sequential
> I/O performance is still below that of the legacy block and SCSI layers. So
> this patch series is not the final solution. (See also John Garry's e-mail
> of October 10th - https://lkml.org/lkml/2017/10/10/401). I have been
> wondering what could be causing that performance difference. Maybe it's
> because requests can reside for a while in the hctx dispatch queue and hence
> are unvisible for the scheduler while in the hctx dispatch queue? Should we
> modify blk_mq_dispatch_rq_list() such that it puts back requests that have
> not been accepted by .queue_rq() onto the scheduler queue(s) instead of to
> the hctx dispatch queue? If we would make that change, would it allow us to
> drop patch "blk-mq-sched: improve dispatching from sw queue"?

Yes, it's clear that even with the full series, we're not completely there
yet. We are closer, though, and I do want to close that gap up as much
as we can. I think everybody will be more motivated and have an easier time
getting the last bit of the way there, once we have a good foundation in.

It may be the reason that you hint at, if we do see a lot of requeueing
or BUSY in the test case. That would prematurely move requests from the
schedulers knowledge and into the hctx->dispatch holding area. It'd be
useful to have a standard SATA test run and see if we're missing merging
in that case (since merging is what it boils down to). If we are, then
it's not hctx->dispatch issues.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V6 4/5] blk-mq-sched: improve dispatching from sw queue
  2017-10-12 15:37         ` Jens Axboe
@ 2017-10-12 15:49           ` Ming Lei
  0 siblings, 0 replies; 17+ messages in thread
From: Ming Lei @ 2017-10-12 15:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Bart Van Assche, osandov@osandov.com,
	linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
	hch@infradead.org, tom81094@gmail.com, paolo.valente@linaro.org,
	snitzer@redhat.com, linux-scsi@vger.kernel.org,
	oleksandr@natalenko.name, osandov@fb.com, loberman@redhat.com,
	dm-devel@redhat.com

On Thu, Oct 12, 2017 at 09:37:11AM -0600, Jens Axboe wrote:
> On 10/12/2017 09:33 AM, Bart Van Assche wrote:
> > On Thu, 2017-10-12 at 18:01 +0800, Ming Lei wrote:
> >> Even EWMA approach isn't good on SCSI-MQ too, because
> >> some SCSI's .cmd_per_lun is very small, such as 3 on
> >> lpfc and qla2xxx, and one full flush will trigger
> >> BLK_STS_RESOURCE easily.
> >>
> >> So I suggest to use the way of q->queue_depth first, since we
> >> don't get performance degrade report on other devices(!q->queue_depth)
> >> with blk-mq. We can improve this way in the future if we
> >> have better approach.
> > 
> > Measurements have shown that even with this patch series applied sequential
> > I/O performance is still below that of the legacy block and SCSI layers. So
> > this patch series is not the final solution. (See also John Garry's e-mail
> > of October 10th - https://lkml.org/lkml/2017/10/10/401). I have been
> > wondering what could be causing that performance difference. Maybe it's
> > because requests can reside for a while in the hctx dispatch queue and hence
> > are unvisible for the scheduler while in the hctx dispatch queue? Should we
> > modify blk_mq_dispatch_rq_list() such that it puts back requests that have
> > not been accepted by .queue_rq() onto the scheduler queue(s) instead of to
> > the hctx dispatch queue? If we would make that change, would it allow us to
> > drop patch "blk-mq-sched: improve dispatching from sw queue"?
> 
> Yes, it's clear that even with the full series, we're not completely there
> yet. We are closer, though, and I do want to close that gap up as much
> as we can. I think everybody will be more motivated and have an easier time
> getting the last bit of the way there, once we have a good foundation in.
> 
> It may be the reason that you hint at, if we do see a lot of requeueing
> or BUSY in the test case. That would prematurely move requests from the
> schedulers knowledge and into the hctx->dispatch holding area. It'd be
> useful to have a standard SATA test run and see if we're missing merging
> in that case (since merging is what it boils down to). If we are, then
> it's not hctx->dispatch issues.

>From Gary's test result on the patches of .get_budget()/.put_budget()[1],
the sequential I/O performance is still not good, that means the
issue may not be in IO merge, because .get_buget/.put_budget is 
more helpful to do I/O merge than block legacy.

Actually in my virtio-scsi test, blk-mq has been better than block legacy
with the way of .get_budget()/.put_budget().


[1] https://github.com/ming1/linux/commits/blk_mq_improve_scsi_mpath_perf_V6.2_test


-- 
Ming

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH V6 5/5] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed
  2017-10-09 11:24 [PATCH V6 0/5] blk-mq-sched: improve sequential I/O performance Ming Lei
                   ` (3 preceding siblings ...)
  2017-10-09 11:24 ` [PATCH V6 4/5] blk-mq-sched: improve dispatching from sw queue Ming Lei
@ 2017-10-09 11:24 ` Ming Lei
  2017-10-10 18:26   ` Omar Sandoval
  4 siblings, 1 reply; 17+ messages in thread
From: Ming Lei @ 2017-10-09 11:24 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	dm-devel
  Cc: Bart Van Assche, Laurence Oberman, Paolo Valente,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval, Ming Lei

During dispatching, we moved all requests from hctx->dispatch to
one temporary list, then dispatch them one by one from this list.
Unfortunately during this period, run queue from other contexts
may think the queue is idle, then start to dequeue from sw/scheduler
queue and still try to dispatch because ->dispatch is empty. This way
hurts sequential I/O performance because requests are dequeued when
lld queue is busy.

This patch introduces the state of BLK_MQ_S_DISPATCH_BUSY to
make sure that request isn't dequeued until ->dispatch is
flushed.

Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-debugfs.c |  1 +
 block/blk-mq-sched.c   | 38 ++++++++++++++++++++++++++++++++------
 block/blk-mq.c         |  5 +++++
 include/linux/blk-mq.h |  1 +
 4 files changed, 39 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index e5dccc9f6f1d..6c15487bc3ff 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -181,6 +181,7 @@ static const char *const hctx_state_name[] = {
 	HCTX_STATE_NAME(SCHED_RESTART),
 	HCTX_STATE_NAME(TAG_WAITING),
 	HCTX_STATE_NAME(START_ON_RUN),
+	HCTX_STATE_NAME(DISPATCH_BUSY),
 };
 #undef HCTX_STATE_NAME
 
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 14b354f617e5..9f549711da84 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -95,6 +95,18 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 	struct elevator_queue *e = q->elevator;
 	LIST_HEAD(rq_list);
 
+	/*
+	 * If DISPATCH_BUSY is set, that means hw queue is busy
+	 * and requests in the list of hctx->dispatch need to
+	 * be flushed first, so return early.
+	 *
+	 * Wherever DISPATCH_BUSY is set, blk_mq_run_hw_queue()
+	 * will be run to try to make progress, so it is always
+	 * safe to check the state here.
+	 */
+	if (test_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state))
+		return;
+
 	do {
 		struct request *rq = e->type->ops.mq.dispatch_request(hctx);
 
@@ -121,6 +133,10 @@ static void blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
 	LIST_HEAD(rq_list);
 	struct blk_mq_ctx *ctx = READ_ONCE(hctx->dispatch_from);
 
+	/* See same comment in blk_mq_do_dispatch_sched() */
+	if (test_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state))
+		return;
+
 	do {
 		struct request *rq;
 
@@ -176,12 +192,22 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	 */
 	if (!list_empty(&rq_list)) {
 		blk_mq_sched_mark_restart_hctx(hctx);
-		if (blk_mq_dispatch_rq_list(q, &rq_list)) {
-			if (has_sched_dispatch)
-				blk_mq_do_dispatch_sched(hctx);
-			else
-				blk_mq_do_dispatch_ctx(hctx);
-		}
+		blk_mq_dispatch_rq_list(q, &rq_list);
+
+		/*
+		 * We may clear DISPATCH_BUSY just after it is set from
+		 * another context, the only cost is that one request is
+		 * dequeued a bit early, we can survive that. Given the
+		 * window is small enough, no need to worry about performance
+		 * effect.
+		 */
+		if (list_empty_careful(&hctx->dispatch))
+			clear_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
+
+		if (has_sched_dispatch)
+			blk_mq_do_dispatch_sched(hctx);
+		else
+			blk_mq_do_dispatch_ctx(hctx);
 	} else if (has_sched_dispatch) {
 		blk_mq_do_dispatch_sched(hctx);
 	} else if (q->queue_depth) {
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 394cb75d66fa..06dda6182b7a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1172,6 +1172,11 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list)
 
 		spin_lock(&hctx->lock);
 		list_splice_init(list, &hctx->dispatch);
+		/*
+		 * DISPATCH_BUSY won't be cleared until all requests
+		 * in hctx->dispatch are dispatched successfully
+		 */
+		set_bit(BLK_MQ_S_DISPATCH_BUSY, &hctx->state);
 		spin_unlock(&hctx->lock);
 
 		/*
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 7b7a366a97f3..13f6c25fa461 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -172,6 +172,7 @@ enum {
 	BLK_MQ_S_SCHED_RESTART	= 2,
 	BLK_MQ_S_TAG_WAITING	= 3,
 	BLK_MQ_S_START_ON_RUN	= 4,
+	BLK_MQ_S_DISPATCH_BUSY	= 5,
 
 	BLK_MQ_MAX_DEPTH	= 10240,
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH V6 5/5] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed
  2017-10-09 11:24 ` [PATCH V6 5/5] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed Ming Lei
@ 2017-10-10 18:26   ` Omar Sandoval
  0 siblings, 0 replies; 17+ messages in thread
From: Omar Sandoval @ 2017-10-10 18:26 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	dm-devel, Bart Van Assche, Laurence Oberman, Paolo Valente,
	Oleksandr Natalenko, Tom Nguyen, linux-kernel, linux-scsi,
	Omar Sandoval

On Mon, Oct 09, 2017 at 07:24:24PM +0800, Ming Lei wrote:
> During dispatching, we moved all requests from hctx->dispatch to
> one temporary list, then dispatch them one by one from this list.
> Unfortunately during this period, run queue from other contexts
> may think the queue is idle, then start to dequeue from sw/scheduler
> queue and still try to dispatch because ->dispatch is empty. This way
> hurts sequential I/O performance because requests are dequeued when
> lld queue is busy.
> 
> This patch introduces the state of BLK_MQ_S_DISPATCH_BUSY to
> make sure that request isn't dequeued until ->dispatch is
> flushed.
> 
> Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>

I think this will do for now.

Reviewed-by: Omar Sandoval <osandov@fb.com>

> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-mq-debugfs.c |  1 +
>  block/blk-mq-sched.c   | 38 ++++++++++++++++++++++++++++++++------
>  block/blk-mq.c         |  5 +++++
>  include/linux/blk-mq.h |  1 +
>  4 files changed, 39 insertions(+), 6 deletions(-)

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2017-10-12 15:49 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-10-09 11:24 [PATCH V6 0/5] blk-mq-sched: improve sequential I/O performance Ming Lei
2017-10-09 11:24 ` [PATCH V6 1/5] blk-mq-sched: fix scheduler bad performance Ming Lei
2017-10-10 18:10   ` Omar Sandoval
2017-10-09 11:24 ` [PATCH V6 2/5] blk-mq-sched: move actual dispatching into one helper Ming Lei
2017-10-09 11:24 ` [PATCH V6 3/5] sbitmap: introduce __sbitmap_for_each_set() Ming Lei
2017-10-10 18:15   ` Omar Sandoval
2017-10-09 11:24 ` [PATCH V6 4/5] blk-mq-sched: improve dispatching from sw queue Ming Lei
2017-10-10 18:23   ` Omar Sandoval
2017-10-12 10:01     ` Ming Lei
2017-10-12 14:52       ` Jens Axboe
2017-10-12 15:22         ` Ming Lei
2017-10-12 15:24           ` Jens Axboe
2017-10-12 15:33       ` Bart Van Assche
2017-10-12 15:37         ` Jens Axboe
2017-10-12 15:49           ` Ming Lei
2017-10-09 11:24 ` [PATCH V6 5/5] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed Ming Lei
2017-10-10 18:26   ` Omar Sandoval

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).