Linux block layer
 help / color / mirror / Atom feed
* Re: [RFC PATCH] blk: reset 'bi_next' when bio is done inside request
From: NeilBrown @ 2017-04-03 21:25 UTC (permalink / raw)
  To: Michael Wang, linux-kernel@vger.kernel.org, linux-block,
	linux-raid
  Cc: Jens Axboe, Shaohua Li, Jinpu Wang
In-Reply-To: <9505ff12-7307-7dec-76b5-2a233a592634@profitbricks.com>

[-- Attachment #1: Type: text/plain, Size: 1927 bytes --]

On Mon, Apr 03 2017, Michael Wang wrote:

> blk_attempt_plug_merge() try to merge bio into request and chain them
> by 'bi_next', while after the bio is done inside request, we forgot to
> reset the 'bi_next'.
>
> This lead into BUG while removing all the underlying devices from md-raid1,
> the bio once go through:
>
>   md_do_sync()
>     sync_request()
>       generic_make_request()

This is a read request from the "first" device.

>         blk_queue_bio()
>           blk_attempt_plug_merge()
>             CHAINED HERE
>
> will keep chained and reused by:
>
>   raid1d()
>     sync_request_write()
>       generic_make_request()

This is a write request to some other device, isn't it?

If sync_request_write() is using a bio that has already been used, it
should call bio_reset() and fill in the details again.
However I don't see how that would happen.
Can you give specific details on the situation that triggers the bug?

Thanks,
NeilBrown


>         BUG_ON(bio->bi_next)
>
> After reset the 'bi_next' this can no longer happen.
>
> Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
> ---
>  block/blk-core.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 43b7d06..91223b2 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2619,8 +2619,10 @@ bool blk_update_request(struct request *req, int error, unsigned int nr_bytes)
>                 struct bio *bio = req->bio;
>                 unsigned bio_bytes = min(bio->bi_iter.bi_size, nr_bytes);
>
> -               if (bio_bytes == bio->bi_iter.bi_size)
> +               if (bio_bytes == bio->bi_iter.bi_size) {
>                         req->bio = bio->bi_next;
> +                       bio->bi_next = NULL;
> +               }
>
>                 req_bio_endio(req, bio, bio_bytes, error);
>
> -- 
> 2.5.0

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* [PATCH 0/5] blk-mq: scheduler and hw queue initialization fixes/enhancements
From: Omar Sandoval @ 2017-04-03 21:42 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

Hi, Jens,

This series has some fixes and enhancements for blk-mq:

- Patch 1 is a cleanup in preparation for the rest of the series
- Patch 2 is a fix necessary for patch 4 when scheduling is enabled,
  making sure we bring up new hardware queues with scheduler tags
- Patch 3 makes error handling in elevator_switch() more robust, making
  us fall back to none like you recommended last time
- Patch 4 is the remap fix from last week
- Patch 5 is an extension of patch 2 for multiqueue schedulers that
  allocate per-hctx data. Nothing in-tree needs it, but Kyber will.

Let me know if you'd prefer deferring this to 4.12 or want to apply 1-4
for 4.11. These are based on block/for-next, so the latter might require
a respin.

Thanks!

Omar Sandoval (5):
  blk-mq-sched: refactor scheduler initialization
  blk-mq-sched: set up scheduler tags when bringing up new queues
  blk-mq-sched: fix crash in switch error path
  blk-mq: remap queues when adding/removing hardware queues
  blk-mq-sched: provide hooks for initializing hardware queue data

 block/blk-mq-sched.c     | 178 +++++++++++++++++++++++++++++------------------
 block/blk-mq-sched.h     |  13 ++--
 block/blk-mq.c           |  25 +++++--
 block/blk-sysfs.c        |   2 +-
 block/elevator.c         | 114 +++++++++++++++---------------
 include/linux/elevator.h |   4 +-
 6 files changed, 198 insertions(+), 138 deletions(-)

-- 
2.12.2

^ permalink raw reply

* [PATCH 1/5] blk-mq-sched: refactor scheduler initialization
From: Omar Sandoval @ 2017-04-03 21:42 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team
In-Reply-To: <cover.1491254827.git.osandov@fb.com>

From: Omar Sandoval <osandov@fb.com>

Preparation cleanup for the next couple of fixes, push
blk_mq_sched_setup() and e->ops.mq.init_sched() into a helper.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq-sched.c | 82 ++++++++++++++++++++++++++++------------------------
 block/blk-mq-sched.h |  2 +-
 block/elevator.c     | 32 ++++++++------------
 3 files changed, 57 insertions(+), 59 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 09af8ff18719..e08ba915343e 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -431,11 +431,45 @@ static void blk_mq_sched_free_tags(struct blk_mq_tag_set *set,
 	}
 }
 
-int blk_mq_sched_setup(struct request_queue *q)
+static int blk_mq_sched_alloc_tags(struct request_queue *q,
+				   struct blk_mq_hw_ctx *hctx,
+				   unsigned int hctx_idx)
+{
+	struct blk_mq_tag_set *set = q->tag_set;
+	int ret;
+
+	hctx->sched_tags = blk_mq_alloc_rq_map(set, hctx_idx, q->nr_requests,
+					       set->reserved_tags);
+	if (!hctx->sched_tags)
+		return -ENOMEM;
+
+	ret = blk_mq_alloc_rqs(set, hctx->sched_tags, hctx_idx, q->nr_requests);
+	if (ret)
+		blk_mq_sched_free_tags(set, hctx, hctx_idx);
+
+	return ret;
+}
+
+void blk_mq_sched_teardown(struct request_queue *q)
 {
 	struct blk_mq_tag_set *set = q->tag_set;
 	struct blk_mq_hw_ctx *hctx;
-	int ret, i;
+	int i;
+
+	queue_for_each_hw_ctx(q, hctx, i)
+		blk_mq_sched_free_tags(set, hctx, i);
+}
+
+int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
+{
+	struct blk_mq_hw_ctx *hctx;
+	unsigned int i;
+	int ret;
+
+	if (!e) {
+		q->elevator = NULL;
+		return 0;
+	}
 
 	/*
 	 * Default to 256, since we don't split into sync/async like the
@@ -443,49 +477,21 @@ int blk_mq_sched_setup(struct request_queue *q)
 	 */
 	q->nr_requests = 2 * BLKDEV_MAX_RQ;
 
-	/*
-	 * We're switching to using an IO scheduler, so setup the hctx
-	 * scheduler tags and switch the request map from the regular
-	 * tags to scheduler tags. First allocate what we need, so we
-	 * can safely fail and fallback, if needed.
-	 */
-	ret = 0;
 	queue_for_each_hw_ctx(q, hctx, i) {
-		hctx->sched_tags = blk_mq_alloc_rq_map(set, i,
-				q->nr_requests, set->reserved_tags);
-		if (!hctx->sched_tags) {
-			ret = -ENOMEM;
-			break;
-		}
-		ret = blk_mq_alloc_rqs(set, hctx->sched_tags, i, q->nr_requests);
+		ret = blk_mq_sched_alloc_tags(q, hctx, i);
 		if (ret)
-			break;
+			goto err;
 	}
 
-	/*
-	 * If we failed, free what we did allocate
-	 */
-	if (ret) {
-		queue_for_each_hw_ctx(q, hctx, i) {
-			if (!hctx->sched_tags)
-				continue;
-			blk_mq_sched_free_tags(set, hctx, i);
-		}
-
-		return ret;
-	}
+	ret = e->ops.mq.init_sched(q, e);
+	if (ret)
+		goto err;
 
 	return 0;
-}
 
-void blk_mq_sched_teardown(struct request_queue *q)
-{
-	struct blk_mq_tag_set *set = q->tag_set;
-	struct blk_mq_hw_ctx *hctx;
-	int i;
-
-	queue_for_each_hw_ctx(q, hctx, i)
-		blk_mq_sched_free_tags(set, hctx, i);
+err:
+	blk_mq_sched_teardown(q);
+	return ret;
 }
 
 int blk_mq_sched_init(struct request_queue *q)
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index a75b16b123f7..873f9af5a35b 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -32,7 +32,7 @@ void blk_mq_sched_move_to_dispatch(struct blk_mq_hw_ctx *hctx,
 			struct list_head *rq_list,
 			struct request *(*get_rq)(struct blk_mq_hw_ctx *));
 
-int blk_mq_sched_setup(struct request_queue *q);
+int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e);
 void blk_mq_sched_teardown(struct request_queue *q);
 
 int blk_mq_sched_init(struct request_queue *q);
diff --git a/block/elevator.c b/block/elevator.c
index 01139f549b5b..f236ef1d2be9 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -242,17 +242,12 @@ int elevator_init(struct request_queue *q, char *name)
 		}
 	}
 
-	if (e->uses_mq) {
-		err = blk_mq_sched_setup(q);
-		if (!err)
-			err = e->ops.mq.init_sched(q, e);
-	} else
+	if (e->uses_mq)
+		err = blk_mq_init_sched(q, e);
+	else
 		err = e->ops.sq.elevator_init_fn(q, e);
-	if (err) {
-		if (e->uses_mq)
-			blk_mq_sched_teardown(q);
+	if (err)
 		elevator_put(e);
-	}
 	return err;
 }
 EXPORT_SYMBOL(elevator_init);
@@ -987,21 +982,18 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	}
 
 	/* allocate, init and register new elevator */
-	if (new_e) {
-		if (new_e->uses_mq) {
-			err = blk_mq_sched_setup(q);
-			if (!err)
-				err = new_e->ops.mq.init_sched(q, new_e);
-		} else
-			err = new_e->ops.sq.elevator_init_fn(q, new_e);
-		if (err)
-			goto fail_init;
+	if (q->mq_ops)
+		err = blk_mq_init_sched(q, new_e);
+	else
+		err = new_e->ops.sq.elevator_init_fn(q, new_e);
+	if (err)
+		goto fail_init;
 
+	if (new_e) {
 		err = elv_register_queue(q);
 		if (err)
 			goto fail_register;
-	} else
-		q->elevator = NULL;
+	}
 
 	/* done, kill the old one and finish */
 	if (old) {
-- 
2.12.2

^ permalink raw reply related

* [PATCH 2/5] blk-mq-sched: set up scheduler tags when bringing up new queues
From: Omar Sandoval @ 2017-04-03 21:42 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team
In-Reply-To: <cover.1491254827.git.osandov@fb.com>

From: Omar Sandoval <osandov@fb.com>

If a new hardware queue is added at runtime, we don't allocate scheduler
tags for it, leading to a crash. This hooks up the scheduler framework
to blk_mq_{init,exit}_hctx() to make sure everything gets properly
initialized/freed.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq-sched.c | 22 ++++++++++++++++++++++
 block/blk-mq-sched.h |  5 +++++
 block/blk-mq.c       |  9 ++++++++-
 3 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index e08ba915343e..3fd918bb13a2 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -460,6 +460,28 @@ void blk_mq_sched_teardown(struct request_queue *q)
 		blk_mq_sched_free_tags(set, hctx, i);
 }
 
+int blk_mq_sched_init_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
+			   unsigned int hctx_idx)
+{
+	struct elevator_queue *e = q->elevator;
+
+	if (!e)
+		return 0;
+
+	return blk_mq_sched_alloc_tags(q, hctx, hctx_idx);
+}
+
+void blk_mq_sched_exit_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
+			    unsigned int hctx_idx)
+{
+	struct elevator_queue *e = q->elevator;
+
+	if (!e)
+		return;
+
+	blk_mq_sched_free_tags(q->tag_set, hctx, hctx_idx);
+}
+
 int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 {
 	struct blk_mq_hw_ctx *hctx;
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 873f9af5a35b..19db25e0c95a 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -35,6 +35,11 @@ void blk_mq_sched_move_to_dispatch(struct blk_mq_hw_ctx *hctx,
 int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e);
 void blk_mq_sched_teardown(struct request_queue *q);
 
+int blk_mq_sched_init_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
+			   unsigned int hctx_idx);
+void blk_mq_sched_exit_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
+			    unsigned int hctx_idx);
+
 int blk_mq_sched_init(struct request_queue *q);
 
 static inline bool
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 061fc2cc88d3..ac830cb488d7 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1839,6 +1839,8 @@ static void blk_mq_exit_hctx(struct request_queue *q,
 				       hctx->fq->flush_rq, hctx_idx,
 				       flush_start_tag + hctx_idx);
 
+	blk_mq_sched_exit_hctx(q, hctx, hctx_idx);
+
 	if (set->ops->exit_hctx)
 		set->ops->exit_hctx(hctx, hctx_idx);
 
@@ -1905,9 +1907,12 @@ static int blk_mq_init_hctx(struct request_queue *q,
 	    set->ops->init_hctx(hctx, set->driver_data, hctx_idx))
 		goto free_bitmap;
 
+	if (blk_mq_sched_init_hctx(q, hctx, hctx_idx))
+		goto exit_hctx;
+
 	hctx->fq = blk_alloc_flush_queue(q, hctx->numa_node, set->cmd_size);
 	if (!hctx->fq)
-		goto exit_hctx;
+		goto sched_exit_hctx;
 
 	if (set->ops->init_request &&
 	    set->ops->init_request(set->driver_data,
@@ -1922,6 +1927,8 @@ static int blk_mq_init_hctx(struct request_queue *q,
 
  free_fq:
 	kfree(hctx->fq);
+ sched_exit_hctx:
+	blk_mq_sched_exit_hctx(q, hctx, hctx_idx);
  exit_hctx:
 	if (set->ops->exit_hctx)
 		set->ops->exit_hctx(hctx, hctx_idx);
-- 
2.12.2

^ permalink raw reply related

* [PATCH 4/5] blk-mq: remap queues when adding/removing hardware queues
From: Omar Sandoval @ 2017-04-03 21:42 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team
In-Reply-To: <cover.1491254827.git.osandov@fb.com>

From: Omar Sandoval <osandov@fb.com>

blk_mq_update_nr_hw_queues() used to remap hardware queues, which is the
behavior that drivers expect. However, commit 4e68a011428a changed
blk_mq_queue_reinit() to not remap queues for the case of CPU
hotplugging, inadvertently making blk_mq_update_nr_hw_queues() not remap
queues as well. This breaks, for example, NBD's multi-connection mode,
leaving the added hardware queues unused. Fix it by making
blk_mq_update_nr_hw_queues() explicitly remap the queues.

Fixes: 4e68a011428a ("blk-mq: don't redistribute hardware queues on a CPU hotplug event")
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 477951d10cc9..b49fdfde7e06 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2483,6 +2483,14 @@ static int blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set)
 	return 0;
 }
 
+static int blk_mq_update_queue_map(struct blk_mq_tag_set *set)
+{
+	if (set->ops->map_queues)
+		return set->ops->map_queues(set);
+	else
+		return blk_mq_map_queues(set);
+}
+
 /*
  * Alloc a tag set to be associated with one or more request queues.
  * May fail with EINVAL for various error conditions. May adjust the
@@ -2537,10 +2545,7 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 	if (!set->mq_map)
 		goto out_free_tags;
 
-	if (set->ops->map_queues)
-		ret = set->ops->map_queues(set);
-	else
-		ret = blk_mq_map_queues(set);
+	ret = blk_mq_update_queue_map(set);
 	if (ret)
 		goto out_free_mq_map;
 
@@ -2632,6 +2637,7 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
 		blk_mq_freeze_queue(q);
 
 	set->nr_hw_queues = nr_hw_queues;
+	blk_mq_update_queue_map(set);
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
 		blk_mq_realloc_hw_ctxs(set, q);
 		blk_mq_queue_reinit(q, cpu_online_mask);
-- 
2.12.2

^ permalink raw reply related

* [PATCH 3/5] blk-mq-sched: fix crash in switch error path
From: Omar Sandoval @ 2017-04-03 21:42 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team
In-Reply-To: <cover.1491254827.git.osandov@fb.com>

From: Omar Sandoval <osandov@fb.com>

In elevator_switch(), if blk_mq_init_sched() fails, we attempt to fall
back to the original scheduler. However, at this point, we've already
torn down the original scheduler's tags, so this causes a crash. Doing
the fallback like the legacy elevator path is much harder for mq, so fix
it by just falling back to none, instead.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq-sched.c     | 13 +++++--
 block/blk-mq-sched.h     |  2 +-
 block/blk-mq.c           |  2 --
 block/blk-sysfs.c        |  2 +-
 block/elevator.c         | 94 +++++++++++++++++++++++++++---------------------
 include/linux/elevator.h |  2 +-
 6 files changed, 67 insertions(+), 48 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 3fd918bb13a2..63281fe34090 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -450,7 +450,7 @@ static int blk_mq_sched_alloc_tags(struct request_queue *q,
 	return ret;
 }
 
-void blk_mq_sched_teardown(struct request_queue *q)
+static void blk_mq_sched_tags_teardown(struct request_queue *q)
 {
 	struct blk_mq_tag_set *set = q->tag_set;
 	struct blk_mq_hw_ctx *hctx;
@@ -512,10 +512,19 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 	return 0;
 
 err:
-	blk_mq_sched_teardown(q);
+	blk_mq_sched_tags_teardown(q);
+	q->elevator = NULL;
 	return ret;
 }
 
+void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e)
+{
+	if (e->type->ops.mq.exit_sched)
+		e->type->ops.mq.exit_sched(e);
+	blk_mq_sched_tags_teardown(q);
+	q->elevator = NULL;
+}
+
 int blk_mq_sched_init(struct request_queue *q)
 {
 	int ret;
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 19db25e0c95a..e704956e0862 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -33,7 +33,7 @@ void blk_mq_sched_move_to_dispatch(struct blk_mq_hw_ctx *hctx,
 			struct request *(*get_rq)(struct blk_mq_hw_ctx *));
 
 int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e);
-void blk_mq_sched_teardown(struct request_queue *q);
+void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e);
 
 int blk_mq_sched_init_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
 			   unsigned int hctx_idx);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index ac830cb488d7..477951d10cc9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2153,8 +2153,6 @@ void blk_mq_release(struct request_queue *q)
 	struct blk_mq_hw_ctx *hctx;
 	unsigned int i;
 
-	blk_mq_sched_teardown(q);
-
 	/* hctx kobj stays in hctx */
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (!hctx)
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 45854266e398..c47db43a40cc 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -803,7 +803,7 @@ static void blk_release_queue(struct kobject *kobj)
 
 	if (q->elevator) {
 		ioc_clear_queue(q);
-		elevator_exit(q->elevator);
+		elevator_exit(q, q->elevator);
 	}
 
 	blk_free_queue_stats(q->stats);
diff --git a/block/elevator.c b/block/elevator.c
index f236ef1d2be9..dbeecf7be719 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -252,11 +252,11 @@ int elevator_init(struct request_queue *q, char *name)
 }
 EXPORT_SYMBOL(elevator_init);
 
-void elevator_exit(struct elevator_queue *e)
+void elevator_exit(struct request_queue *q, struct elevator_queue *e)
 {
 	mutex_lock(&e->sysfs_lock);
 	if (e->uses_mq && e->type->ops.mq.exit_sched)
-		e->type->ops.mq.exit_sched(e);
+		blk_mq_exit_sched(q, e);
 	else if (!e->uses_mq && e->type->ops.sq.elevator_exit_fn)
 		e->type->ops.sq.elevator_exit_fn(e);
 	mutex_unlock(&e->sysfs_lock);
@@ -941,6 +941,45 @@ void elv_unregister(struct elevator_type *e)
 }
 EXPORT_SYMBOL_GPL(elv_unregister);
 
+static int elevator_switch_mq(struct request_queue *q,
+			      struct elevator_type *new_e)
+{
+	int ret;
+
+	blk_mq_freeze_queue(q);
+	blk_mq_quiesce_queue(q);
+
+	if (q->elevator) {
+		if (q->elevator->registered)
+			elv_unregister_queue(q);
+		ioc_clear_queue(q);
+		elevator_exit(q, q->elevator);
+	}
+
+	ret = blk_mq_init_sched(q, new_e);
+	if (ret)
+		goto out;
+
+	if (new_e) {
+		ret = elv_register_queue(q);
+		if (ret) {
+			elevator_exit(q, q->elevator);
+			goto out;
+		}
+	}
+
+	if (new_e)
+		blk_add_trace_msg(q, "elv switch: %s", new_e->elevator_name);
+	else
+		blk_add_trace_msg(q, "elv switch: none");
+
+out:
+	blk_mq_unfreeze_queue(q);
+	blk_mq_start_stopped_hw_queues(q, true);
+	return ret;
+
+}
+
 /*
  * switch to new_e io scheduler. be careful not to introduce deadlocks -
  * we don't free the old io scheduler, before we have allocated what we
@@ -953,10 +992,8 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	bool old_registered = false;
 	int err;
 
-	if (q->mq_ops) {
-		blk_mq_freeze_queue(q);
-		blk_mq_quiesce_queue(q);
-	}
+	if (q->mq_ops)
+		return elevator_switch_mq(q, new_e);
 
 	/*
 	 * Turn on BYPASS and drain all requests w/ elevator private data.
@@ -968,11 +1005,7 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	if (old) {
 		old_registered = old->registered;
 
-		if (old->uses_mq)
-			blk_mq_sched_teardown(q);
-
-		if (!q->mq_ops)
-			blk_queue_bypass_start(q);
+		blk_queue_bypass_start(q);
 
 		/* unregister and clear all auxiliary data of the old elevator */
 		if (old_registered)
@@ -982,53 +1015,32 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	}
 
 	/* allocate, init and register new elevator */
-	if (q->mq_ops)
-		err = blk_mq_init_sched(q, new_e);
-	else
-		err = new_e->ops.sq.elevator_init_fn(q, new_e);
+	err = new_e->ops.sq.elevator_init_fn(q, new_e);
 	if (err)
 		goto fail_init;
 
-	if (new_e) {
-		err = elv_register_queue(q);
-		if (err)
-			goto fail_register;
-	}
+	err = elv_register_queue(q);
+	if (err)
+		goto fail_register;
 
 	/* done, kill the old one and finish */
 	if (old) {
-		elevator_exit(old);
-		if (!q->mq_ops)
-			blk_queue_bypass_end(q);
+		elevator_exit(q, old);
+		blk_queue_bypass_end(q);
 	}
 
-	if (q->mq_ops) {
-		blk_mq_unfreeze_queue(q);
-		blk_mq_start_stopped_hw_queues(q, true);
-	}
-
-	if (new_e)
-		blk_add_trace_msg(q, "elv switch: %s", new_e->elevator_name);
-	else
-		blk_add_trace_msg(q, "elv switch: none");
+	blk_add_trace_msg(q, "elv switch: %s", new_e->elevator_name);
 
 	return 0;
 
 fail_register:
-	if (q->mq_ops)
-		blk_mq_sched_teardown(q);
-	elevator_exit(q->elevator);
+	elevator_exit(q, q->elevator);
 fail_init:
 	/* switch failed, restore and re-register old elevator */
 	if (old) {
 		q->elevator = old;
 		elv_register_queue(q);
-		if (!q->mq_ops)
-			blk_queue_bypass_end(q);
-	}
-	if (q->mq_ops) {
-		blk_mq_unfreeze_queue(q);
-		blk_mq_start_stopped_hw_queues(q, true);
+		blk_queue_bypass_end(q);
 	}
 
 	return err;
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index aebecc4ed088..22d39e8d4de1 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -211,7 +211,7 @@ extern ssize_t elv_iosched_show(struct request_queue *, char *);
 extern ssize_t elv_iosched_store(struct request_queue *, const char *, size_t);
 
 extern int elevator_init(struct request_queue *, char *);
-extern void elevator_exit(struct elevator_queue *);
+extern void elevator_exit(struct request_queue *, struct elevator_queue *);
 extern int elevator_change(struct request_queue *, const char *);
 extern bool elv_bio_merge_ok(struct request *, struct bio *);
 extern struct elevator_queue *elevator_alloc(struct request_queue *,
-- 
2.12.2

^ permalink raw reply related

* [PATCH 5/5] blk-mq-sched: provide hooks for initializing hardware queue data
From: Omar Sandoval @ 2017-04-03 21:42 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team
In-Reply-To: <cover.1491254827.git.osandov@fb.com>

From: Omar Sandoval <osandov@fb.com>

Schedulers need to be informed when a hardware queue is added or removed
at runtime so they can allocate/free per-hardware queue data. So,
replace the blk_mq_sched_init_hctx_data() helper, which only makes sense
at init time, with .init_hctx() and .exit_hctx() hooks.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq-sched.c     | 81 +++++++++++++++++++++++++-----------------------
 block/blk-mq-sched.h     |  4 ---
 include/linux/elevator.h |  2 ++
 3 files changed, 45 insertions(+), 42 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 63281fe34090..435f91bc724f 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -30,43 +30,6 @@ void blk_mq_sched_free_hctx_data(struct request_queue *q,
 }
 EXPORT_SYMBOL_GPL(blk_mq_sched_free_hctx_data);
 
-int blk_mq_sched_init_hctx_data(struct request_queue *q, size_t size,
-				int (*init)(struct blk_mq_hw_ctx *),
-				void (*exit)(struct blk_mq_hw_ctx *))
-{
-	struct blk_mq_hw_ctx *hctx;
-	int ret;
-	int i;
-
-	queue_for_each_hw_ctx(q, hctx, i) {
-		hctx->sched_data = kmalloc_node(size, GFP_KERNEL, hctx->numa_node);
-		if (!hctx->sched_data) {
-			ret = -ENOMEM;
-			goto error;
-		}
-
-		if (init) {
-			ret = init(hctx);
-			if (ret) {
-				/*
-				 * We don't want to give exit() a partially
-				 * initialized sched_data. init() must clean up
-				 * if it fails.
-				 */
-				kfree(hctx->sched_data);
-				hctx->sched_data = NULL;
-				goto error;
-			}
-		}
-	}
-
-	return 0;
-error:
-	blk_mq_sched_free_hctx_data(q, exit);
-	return ret;
-}
-EXPORT_SYMBOL_GPL(blk_mq_sched_init_hctx_data);
-
 static void __blk_mq_sched_assign_ioc(struct request_queue *q,
 				      struct request *rq,
 				      struct bio *bio,
@@ -464,11 +427,24 @@ int blk_mq_sched_init_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
 			   unsigned int hctx_idx)
 {
 	struct elevator_queue *e = q->elevator;
+	int ret;
 
 	if (!e)
 		return 0;
 
-	return blk_mq_sched_alloc_tags(q, hctx, hctx_idx);
+	ret = blk_mq_sched_alloc_tags(q, hctx, hctx_idx);
+	if (ret)
+		return ret;
+
+	if (e->type->ops.mq.init_hctx) {
+		ret = e->type->ops.mq.init_hctx(hctx, hctx_idx);
+		if (ret) {
+			blk_mq_sched_free_tags(q->tag_set, hctx, hctx_idx);
+			return ret;
+		}
+	}
+
+	return 0;
 }
 
 void blk_mq_sched_exit_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
@@ -479,12 +455,18 @@ void blk_mq_sched_exit_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
 	if (!e)
 		return;
 
+	if (e->type->ops.mq.exit_hctx && hctx->sched_data) {
+		e->type->ops.mq.exit_hctx(hctx, hctx_idx);
+		hctx->sched_data = NULL;
+	}
+
 	blk_mq_sched_free_tags(q->tag_set, hctx, hctx_idx);
 }
 
 int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 {
 	struct blk_mq_hw_ctx *hctx;
+	struct elevator_queue *eq;
 	unsigned int i;
 	int ret;
 
@@ -509,6 +491,18 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 	if (ret)
 		goto err;
 
+	if (e->ops.mq.init_hctx) {
+		queue_for_each_hw_ctx(q, hctx, i) {
+			ret = e->ops.mq.init_hctx(hctx, i);
+			if (ret) {
+				eq = q->elevator;
+				blk_mq_exit_sched(q, eq);
+				kobject_put(&eq->kobj);
+				return ret;
+			}
+		}
+	}
+
 	return 0;
 
 err:
@@ -519,6 +513,17 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 
 void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e)
 {
+	struct blk_mq_hw_ctx *hctx;
+	unsigned int i;
+
+	if (e->type->ops.mq.exit_hctx) {
+		queue_for_each_hw_ctx(q, hctx, i) {
+			if (hctx->sched_data) {
+				e->type->ops.mq.exit_hctx(hctx, i);
+				hctx->sched_data = NULL;
+			}
+		}
+	}
 	if (e->type->ops.mq.exit_sched)
 		e->type->ops.mq.exit_sched(e);
 	blk_mq_sched_tags_teardown(q);
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index e704956e0862..c6e760df0fb4 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -4,10 +4,6 @@
 #include "blk-mq.h"
 #include "blk-mq-tag.h"
 
-int blk_mq_sched_init_hctx_data(struct request_queue *q, size_t size,
-				int (*init)(struct blk_mq_hw_ctx *),
-				void (*exit)(struct blk_mq_hw_ctx *));
-
 void blk_mq_sched_free_hctx_data(struct request_queue *q,
 				 void (*exit)(struct blk_mq_hw_ctx *));
 
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 22d39e8d4de1..b7ec315ee7e7 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -93,6 +93,8 @@ struct blk_mq_hw_ctx;
 struct elevator_mq_ops {
 	int (*init_sched)(struct request_queue *, struct elevator_type *);
 	void (*exit_sched)(struct elevator_queue *);
+	int (*init_hctx)(struct blk_mq_hw_ctx *, unsigned int);
+	void (*exit_hctx)(struct blk_mq_hw_ctx *, unsigned int);
 
 	bool (*allow_merge)(struct request_queue *, struct request *, struct bio *);
 	bool (*bio_merge)(struct blk_mq_hw_ctx *, struct bio *);
-- 
2.12.2

^ permalink raw reply related

* [PATCH v2 0/5] Avoid that scsi-mq queue processing stalls
From: Bart Van Assche @ 2017-04-03 23:22 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Martin K . Petersen, James Bottomley,
	Bart Van Assche

Hello Jens,

The five patches in this patch series fix the queue lockup I reported
last week on the linux-block mailing list. Please consider these patches
for kernel v4.11.

Thanks,

Bart.

Changes compared to v1 of this patch:
- Reworked scsi_restart_queues() such that it no longer takes the SCSI
  host lock.
- Added two patches - one for exporting blk_mq_sched_restart_hctx() and
  another one to make iterating with RCU over blk_mq_tag_set.tag_list safe.

Bart Van Assche (5):
  block: Export blk_mq_sched_restart_hctx()
  block: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list
  blk-mq: Introduce blk_mq_ops.restart_hctx
  scsi: Add scsi_restart_queues()
  scsi: Ensure that scsi_run_queue() runs all hardware queues

 block/blk-mq-sched.c    | 22 ++++++++--------------
 block/blk-mq-sched.h    | 14 --------------
 block/blk-mq.c          |  6 ++++++
 drivers/scsi/scsi_lib.c | 20 +++++++++++++++++---
 include/linux/blk-mq.h  |  8 ++++++++
 include/linux/blkdev.h  |  1 -
 6 files changed, 39 insertions(+), 32 deletions(-)

-- 
2.12.0

^ permalink raw reply

* [PATCH v2 1/5] blk-mq: Export blk_mq_sched_restart_hctx()
From: Bart Van Assche @ 2017-04-03 23:22 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Martin K . Petersen, James Bottomley,
	Bart Van Assche, Christoph Hellwig, Hannes Reinecke
In-Reply-To: <20170403232228.11208-1-bart.vanassche@sandisk.com>

Since a later patch will add a call to this function from the
SCSI core, export this function. Move the BLK_MQ_S_SCHED_RESTART
bit test from blk_mq_sched_restart_hctx() into the callers of
this function. This leads to some code duplication but that will
be addressed in another patch in this series.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
---
 block/blk-mq-sched.c   | 17 +++++++++--------
 include/linux/blk-mq.h |  1 +
 2 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 09af8ff18719..414ed4b3d266 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -317,14 +317,13 @@ static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx,
 	return true;
 }
 
-static void blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
+void blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
 {
-	if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) {
-		clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
-		if (blk_mq_hctx_has_pending(hctx))
-			blk_mq_run_hw_queue(hctx, true);
-	}
+	clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
+	if (blk_mq_hctx_has_pending(hctx))
+		blk_mq_run_hw_queue(hctx, true);
 }
+EXPORT_SYMBOL(blk_mq_sched_restart_hctx);
 
 void blk_mq_sched_restart_queues(struct blk_mq_hw_ctx *hctx)
 {
@@ -334,9 +333,11 @@ void blk_mq_sched_restart_queues(struct blk_mq_hw_ctx *hctx)
 	if (test_bit(QUEUE_FLAG_RESTART, &q->queue_flags)) {
 		if (test_and_clear_bit(QUEUE_FLAG_RESTART, &q->queue_flags)) {
 			queue_for_each_hw_ctx(q, hctx, i)
-				blk_mq_sched_restart_hctx(hctx);
+				if (test_bit(BLK_MQ_S_SCHED_RESTART,
+					     &hctx->state))
+					blk_mq_sched_restart_hctx(hctx);
 		}
-	} else {
+	} else if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) {
 		blk_mq_sched_restart_hctx(hctx);
 	}
 }
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index ea2e9dcd3aef..f62f3ce2dc65 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -237,6 +237,7 @@ void blk_mq_stop_hw_queues(struct request_queue *q);
 void blk_mq_start_hw_queues(struct request_queue *q);
 void blk_mq_start_stopped_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
 void blk_mq_start_stopped_hw_queues(struct request_queue *q, bool async);
+void blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx);
 void blk_mq_run_hw_queues(struct request_queue *q, bool async);
 void blk_mq_delay_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs);
 void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
-- 
2.12.0

^ permalink raw reply related

* [PATCH v2 2/5] blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list
From: Bart Van Assche @ 2017-04-03 23:22 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Martin K . Petersen, James Bottomley,
	Bart Van Assche, Christoph Hellwig, Hannes Reinecke
In-Reply-To: <20170403232228.11208-1-bart.vanassche@sandisk.com>

A later patch in this series will namely use RCU to iterate over
this list.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
---
 block/blk-mq.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 061fc2cc88d3..8fb983e6e2e4 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2093,6 +2093,8 @@ static void blk_mq_update_tag_set_depth(struct blk_mq_tag_set *set, bool shared)
 {
 	struct request_queue *q;
 
+	lockdep_assert_held(&set->tag_list_lock);
+
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
 		blk_mq_freeze_queue(q);
 		queue_set_hctx_shared(q, shared);
@@ -2113,6 +2115,8 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
 		blk_mq_update_tag_set_depth(set, false);
 	}
 	mutex_unlock(&set->tag_list_lock);
+
+	synchronize_rcu();
 }
 
 static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
@@ -2618,6 +2622,8 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
 {
 	struct request_queue *q;
 
+	lockdep_assert_held(&set->tag_list_lock);
+
 	if (nr_hw_queues > nr_cpu_ids)
 		nr_hw_queues = nr_cpu_ids;
 	if (nr_hw_queues < 1 || nr_hw_queues == set->nr_hw_queues)
-- 
2.12.0

^ permalink raw reply related

* [PATCH v2 3/5] blk-mq: Introduce blk_mq_ops.restart_hctx
From: Bart Van Assche @ 2017-04-03 23:22 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Martin K . Petersen, James Bottomley,
	Bart Van Assche, Christoph Hellwig, Hannes Reinecke
In-Reply-To: <20170403232228.11208-1-bart.vanassche@sandisk.com>

If a tag set is shared among multiple hardware queues, leave
it to the block driver to rerun hardware queues. Hence remove
QUEUE_FLAG_RESTART and introduce blk_mq_ops.restart_hctx.
Remove blk_mq_sched_mark_restart_queue() because this
function has no callers.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
---
 block/blk-mq-sched.c   | 15 ++++-----------
 block/blk-mq-sched.h   | 14 --------------
 include/linux/blk-mq.h |  7 +++++++
 include/linux/blkdev.h |  1 -
 4 files changed, 11 insertions(+), 26 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 414ed4b3d266..f0c691a3ef9e 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -328,18 +328,11 @@ EXPORT_SYMBOL(blk_mq_sched_restart_hctx);
 void blk_mq_sched_restart_queues(struct blk_mq_hw_ctx *hctx)
 {
 	struct request_queue *q = hctx->queue;
-	unsigned int i;
-
-	if (test_bit(QUEUE_FLAG_RESTART, &q->queue_flags)) {
-		if (test_and_clear_bit(QUEUE_FLAG_RESTART, &q->queue_flags)) {
-			queue_for_each_hw_ctx(q, hctx, i)
-				if (test_bit(BLK_MQ_S_SCHED_RESTART,
-					     &hctx->state))
-					blk_mq_sched_restart_hctx(hctx);
-		}
-	} else if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) {
+
+	if (q->mq_ops->restart_hctx)
+		q->mq_ops->restart_hctx(q, hctx);
+	else if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
 		blk_mq_sched_restart_hctx(hctx);
-	}
 }
 
 /*
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index a75b16b123f7..fe62b1eccf4c 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -131,20 +131,6 @@ static inline void blk_mq_sched_mark_restart_hctx(struct blk_mq_hw_ctx *hctx)
 		set_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
 }
 
-/*
- * Mark a hardware queue and the request queue it belongs to as needing a
- * restart.
- */
-static inline void blk_mq_sched_mark_restart_queue(struct blk_mq_hw_ctx *hctx)
-{
-	struct request_queue *q = hctx->queue;
-
-	if (!test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
-		set_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
-	if (!test_bit(QUEUE_FLAG_RESTART, &q->queue_flags))
-		set_bit(QUEUE_FLAG_RESTART, &q->queue_flags);
-}
-
 static inline bool blk_mq_sched_needs_restart(struct blk_mq_hw_ctx *hctx)
 {
 	return test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index f62f3ce2dc65..ebeea36ff9cd 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -86,6 +86,7 @@ struct blk_mq_queue_data {
 };
 
 typedef int (queue_rq_fn)(struct blk_mq_hw_ctx *, const struct blk_mq_queue_data *);
+typedef void (restart_fn)(struct request_queue *q, struct blk_mq_hw_ctx *hctx);
 typedef enum blk_eh_timer_return (timeout_fn)(struct request *, bool);
 typedef int (init_hctx_fn)(struct blk_mq_hw_ctx *, void *, unsigned int);
 typedef void (exit_hctx_fn)(struct blk_mq_hw_ctx *, unsigned int);
@@ -109,6 +110,12 @@ struct blk_mq_ops {
 	queue_rq_fn		*queue_rq;
 
 	/*
+	 * Called upon request completion to rerun all hctxs that share a tag
+	 * set with the hctx for which a request finished.
+	 */
+	restart_fn		*restart_hctx;
+
+	/*
 	 * Called on request timeout
 	 */
 	timeout_fn		*timeout;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index a2dc6b390d48..a80543ec8be7 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -615,7 +615,6 @@ struct request_queue {
 #define QUEUE_FLAG_FLUSH_NQ    25	/* flush not queueuable */
 #define QUEUE_FLAG_DAX         26	/* device supports DAX */
 #define QUEUE_FLAG_STATS       27	/* track rq completion times */
-#define QUEUE_FLAG_RESTART     28	/* queue needs restart at completion */
 #define QUEUE_FLAG_POLL_STATS  29	/* collecting stats for hybrid polling */
 #define QUEUE_FLAG_REGISTERED  30	/* queue has been registered to a disk */
 
-- 
2.12.0

^ permalink raw reply related

* [PATCH v2 4/5] scsi: Add scsi_restart_hctx()
From: Bart Van Assche @ 2017-04-03 23:22 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Martin K . Petersen, James Bottomley,
	Bart Van Assche, Christoph Hellwig, Hannes Reinecke
In-Reply-To: <20170403232228.11208-1-bart.vanassche@sandisk.com>

This patch avoids that if multiple SCSI devices are associated with
a SCSI host that a queue can get stuck if scsi_queue_rq() returns
"busy".

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
---
 drivers/scsi/scsi_lib.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index c1519660824b..0e240aebc150 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -555,6 +555,21 @@ void scsi_run_host_queues(struct Scsi_Host *shost)
 		scsi_run_queue(sdev->request_queue);
 }
 
+static void scsi_restart_hctx(struct request_queue *q,
+			      struct blk_mq_hw_ctx *hctx)
+{
+	struct blk_mq_tags *tags = hctx->tags;
+	struct blk_mq_tag_set *set = q->tag_set;
+	int i;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(q, &set->tag_list, tag_set_list)
+		queue_for_each_hw_ctx(q, hctx, i)
+			if (hctx->tags == tags)
+				blk_mq_sched_restart_hctx(hctx);
+	rcu_read_unlock();
+}
+
 static void scsi_uninit_cmd(struct scsi_cmnd *cmd)
 {
 	if (!blk_rq_is_passthrough(cmd->request)) {
@@ -2156,6 +2171,7 @@ struct request_queue *scsi_alloc_queue(struct scsi_device *sdev)
 
 static const struct blk_mq_ops scsi_mq_ops = {
 	.queue_rq	= scsi_queue_rq,
+	.restart_hctx	= scsi_restart_hctx,
 	.complete	= scsi_softirq_done,
 	.timeout	= scsi_timeout,
 	.init_request	= scsi_init_request,
-- 
2.12.0

^ permalink raw reply related

* [PATCH v2 5/5] scsi: Ensure that scsi_run_queue() runs all hardware queues
From: Bart Van Assche @ 2017-04-03 23:22 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Martin K . Petersen, James Bottomley,
	Bart Van Assche, Christoph Hellwig
In-Reply-To: <20170403232228.11208-1-bart.vanassche@sandisk.com>

commit 52d7f1b5c2f3 ("blk-mq: Avoid that requeueing starts stopped
queues") removed the blk_mq_stop_hw_queue() call from scsi_queue_rq()
for the BLK_MQ_RQ_QUEUE_BUSY case. blk_mq_start_stopped_hw_queues()
only runs queues that had been stopped. Hence change the
blk_mq_start_stopped_hw_queues() call in scsi_run_queue() into
blk_mq_run_hw_queues(). Remove the blk_mq_start_stopped_hw_queues()
call from scsi_end_request() because __blk_mq_finish_request()
already runs all hardware queues if needed.

Fixes: commit 52d7f1b5c2f3 ("blk-mq: Avoid that requeueing starts stopped queues")
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Christoph Hellwig <hch@lst.de>
---
 drivers/scsi/scsi_lib.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 0e240aebc150..4459b18f62a1 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -496,7 +496,7 @@ static void scsi_run_queue(struct request_queue *q)
 		scsi_starved_list_run(sdev->host);
 
 	if (q->mq_ops)
-		blk_mq_start_stopped_hw_queues(q, false);
+		blk_mq_run_hw_queues(q, false);
 	else
 		blk_run_queue(q);
 }
@@ -681,8 +681,6 @@ static bool scsi_end_request(struct request *req, int error,
 		if (scsi_target(sdev)->single_lun ||
 		    !list_empty(&sdev->host->starved_list))
 			kblockd_schedule_work(&sdev->requeue_work);
-		else
-			blk_mq_start_stopped_hw_queues(q, true);
 	} else {
 		unsigned long flags;
 
-- 
2.12.0

^ permalink raw reply related

* Re: [PATCH v3] blk-mq: remap queues when adding/removing hardware queues
From: Christoph Hellwig @ 2017-04-04  6:24 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Keith Busch,
	Josef Bacik, kernel-team
In-Reply-To: <9753ebd0c51a9d49f110a6d0d00888170905d97a.1490993257.git.osandov@fb.com>

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [PATCH rfc 1/6] mlx5: convert to generic pci_alloc_irq_vectors
From: Christoph Hellwig @ 2017-04-04  6:27 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-rdma, linux-nvme, linux-block, netdev, Saeed Mahameed,
	Or Gerlitz, Christoph Hellwig
In-Reply-To: <1491140492-25703-2-git-send-email-sagi@grimberg.me>

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [PATCH rfc 2/6] mlx5: move affinity hints assignments to generic code
From: Christoph Hellwig @ 2017-04-04  6:32 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-rdma, linux-nvme, linux-block, netdev, Saeed Mahameed,
	Or Gerlitz, Christoph Hellwig
In-Reply-To: <1491140492-25703-3-git-send-email-sagi@grimberg.me>

> @@ -1375,7 +1375,8 @@ static void mlx5e_close_cq(struct mlx5e_cq *cq)
>  
>  static int mlx5e_get_cpu(struct mlx5e_priv *priv, int ix)
>  {
> -	return cpumask_first(priv->mdev->priv.irq_info[ix].mask);
> +	return cpumask_first(pci_irq_get_affinity(priv->mdev->pdev,
> +			MLX5_EQ_VEC_COMP_BASE + ix));

This looks ok for now, but if we look at the callers we'd probably
want to make direct use of pci_irq_get_node and pci_irq_get_affinity for
the uses directly in mlx5e_open_channel as well as the stored away
->cpu field.  But maybe that should be left for another patch after
this one.

> +	struct irq_affinity irqdesc = { .pre_vectors = MLX5_EQ_VEC_COMP_BASE, };

I usually move assignments inside structures onto a separate line to make
it more readable, e.g.

	struct irq_affinity irqdesc = {
		.pre_vectors = MLX5_EQ_VEC_COMP_BASE,
	};

Otherwise this looks fine:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [PATCH rfc 3/6] RDMA/core: expose affinity mappings per completion vector
From: Christoph Hellwig @ 2017-04-04  6:32 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-rdma, linux-nvme, linux-block, netdev, Saeed Mahameed,
	Or Gerlitz, Christoph Hellwig
In-Reply-To: <1491140492-25703-4-git-send-email-sagi@grimberg.me>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [PATCH rfc 4/6] mlx5: support ->get_vector_affinity
From: Christoph Hellwig @ 2017-04-04  6:33 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-rdma, linux-nvme, linux-block, netdev, Saeed Mahameed,
	Or Gerlitz, Christoph Hellwig
In-Reply-To: <1491140492-25703-5-git-send-email-sagi@grimberg.me>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [PATCH rfc 5/6] block: Add rdma affinity based queue mapping helper
From: Christoph Hellwig @ 2017-04-04  6:33 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-rdma, linux-nvme, linux-block, netdev, Saeed Mahameed,
	Or Gerlitz, Christoph Hellwig
In-Reply-To: <1491140492-25703-6-git-send-email-sagi@grimberg.me>

On Sun, Apr 02, 2017 at 04:41:31PM +0300, Sagi Grimberg wrote:
> Like pci and virtio, we add a rdma helper for affinity
> spreading. This achieves optimal mq affinity assignments
> according to the underlying rdma device affinity maps.
> 
> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
> ---
>  block/Kconfig               |  5 ++++
>  block/Makefile              |  1 +
>  block/blk-mq-rdma.c         | 56 +++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/blk-mq-rdma.h | 10 ++++++++
>  4 files changed, 72 insertions(+)
>  create mode 100644 block/blk-mq-rdma.c
>  create mode 100644 include/linux/blk-mq-rdma.h
> 
> diff --git a/block/Kconfig b/block/Kconfig
> index 89cd28f8d051..3ab42bbb06d5 100644
> --- a/block/Kconfig
> +++ b/block/Kconfig
> @@ -206,4 +206,9 @@ config BLK_MQ_VIRTIO
>  	depends on BLOCK && VIRTIO
>  	default y
>  
> +config BLK_MQ_RDMA
> +	bool
> +	depends on BLOCK && INFINIBAND
> +	default y
> +
>  source block/Kconfig.iosched
> diff --git a/block/Makefile b/block/Makefile
> index 081bb680789b..4498603dbc83 100644
> --- a/block/Makefile
> +++ b/block/Makefile
> @@ -26,6 +26,7 @@ obj-$(CONFIG_BLK_CMDLINE_PARSER)	+= cmdline-parser.o
>  obj-$(CONFIG_BLK_DEV_INTEGRITY) += bio-integrity.o blk-integrity.o t10-pi.o
>  obj-$(CONFIG_BLK_MQ_PCI)	+= blk-mq-pci.o
>  obj-$(CONFIG_BLK_MQ_VIRTIO)	+= blk-mq-virtio.o
> +obj-$(CONFIG_BLK_MQ_RDMA)	+= blk-mq-rdma.o
>  obj-$(CONFIG_BLK_DEV_ZONED)	+= blk-zoned.o
>  obj-$(CONFIG_BLK_WBT)		+= blk-wbt.o
>  obj-$(CONFIG_BLK_DEBUG_FS)	+= blk-mq-debugfs.o
> diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
> new file mode 100644
> index 000000000000..d402f7c93528
> --- /dev/null
> +++ b/block/blk-mq-rdma.c
> @@ -0,0 +1,56 @@
> +/*
> + * Copyright (c) 2017 Sagi Grimberg.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + */
> +#include <linux/blk-mq.h>
> +#include <linux/blk-mq-rdma.h>
> +#include <rdma/ib_verbs.h>
> +#include <linux/module.h>
> +#include "blk-mq.h"
> +
> +/**
> + * blk_mq_rdma_map_queues - provide a default queue mapping for rdma device
> + * @set:	tagset to provide the mapping for
> + * @dev:	rdma device associated with @set.
> + * @first_vec:	first interrupt vectors to use for queues (usually 0)
> + *
> + * This function assumes the rdma device @dev has at least as many available
> + * interrupt vetors as @set has queues.  It will then query it's affinity mask
> + * and built queue mapping that maps a queue to the CPUs that have irq affinity
> + * for the corresponding vector.
> + *
> + * In case either the driver passed a @dev with less vectors than
> + * @set->nr_hw_queues, or @dev does not provide an affinity mask for a
> + * vector, we fallback to the naive mapping.
> + */
> +int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
> +		struct ib_device *dev, int first_vec)
> +{
> +	const struct cpumask *mask;
> +	unsigned int queue, cpu;
> +
> +	if (set->nr_hw_queues > dev->num_comp_vectors)
> +		goto fallback;

maybe print a warning here?

Otherwise looks fine:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [PATCH rfc 6/6] nvme-rdma: use intelligent affinity based queue mappings
From: Christoph Hellwig @ 2017-04-04  6:34 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-rdma, linux-nvme, linux-block, netdev, Saeed Mahameed,
	Or Gerlitz, Christoph Hellwig
In-Reply-To: <1491140492-25703-7-git-send-email-sagi@grimberg.me>

On Sun, Apr 02, 2017 at 04:41:32PM +0300, Sagi Grimberg wrote:
> Use the geneic block layer affinity mapping helper. Also,

          generic

>  	nr_io_queues = min(opts->nr_io_queues, num_online_cpus());
> +	nr_io_queues = min_t(unsigned int, nr_io_queues,
> +				ibdev->num_comp_vectors);
> +

Add a comment here?

Otherwise looks fine:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [PATCH v2 1/5] blk-mq: Export blk_mq_sched_restart_hctx()
From: Christoph Hellwig @ 2017-04-04  6:40 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Martin K . Petersen, James Bottomley,
	Christoph Hellwig, Hannes Reinecke
In-Reply-To: <20170403232228.11208-2-bart.vanassche@sandisk.com>

> +void blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
>  {
> +	clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
> +	if (blk_mq_hctx_has_pending(hctx))
> +		blk_mq_run_hw_queue(hctx, true);
>  }
> +EXPORT_SYMBOL(blk_mq_sched_restart_hctx);

_GPL export like the other _hctx functions, please.

Otherwise looks fine:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [PATCH v2 2/5] blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list
From: Christoph Hellwig @ 2017-04-04  6:40 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Martin K . Petersen, James Bottomley,
	Christoph Hellwig, Hannes Reinecke
In-Reply-To: <20170403232228.11208-3-bart.vanassche@sandisk.com>

On Mon, Apr 03, 2017 at 04:22:25PM -0700, Bart Van Assche wrote:
> A later patch in this series will namely use RCU to iterate over
> this list.

It also adds a couple lockdep_assert_held calls, which might be worth
mentionining.

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [PATCH v2 3/5] blk-mq: Introduce blk_mq_ops.restart_hctx
From: Christoph Hellwig @ 2017-04-04  6:42 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Martin K . Petersen, James Bottomley,
	Christoph Hellwig, Hannes Reinecke
In-Reply-To: <20170403232228.11208-4-bart.vanassche@sandisk.com>

On Mon, Apr 03, 2017 at 04:22:26PM -0700, Bart Van Assche wrote:
> If a tag set is shared among multiple hardware queues, leave
> it to the block driver to rerun hardware queues. Hence remove
> QUEUE_FLAG_RESTART and introduce blk_mq_ops.restart_hctx.
> Remove blk_mq_sched_mark_restart_queue() because this
> function has no callers.

This looks fine, but I think it needs to also actually implement at
least a dummy restart_hctx method for SCSI and NVMe to keep the
existing functionality.

^ permalink raw reply

* Re: [PATCH v2 4/5] scsi: Add scsi_restart_hctx()
From: Christoph Hellwig @ 2017-04-04  6:42 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Martin K . Petersen, James Bottomley,
	Christoph Hellwig, Hannes Reinecke
In-Reply-To: <20170403232228.11208-5-bart.vanassche@sandisk.com>

> +static void scsi_restart_hctx(struct request_queue *q,
> +			      struct blk_mq_hw_ctx *hctx)
> +{
> +	struct blk_mq_tags *tags = hctx->tags;
> +	struct blk_mq_tag_set *set = q->tag_set;
> +	int i;
> +
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(q, &set->tag_list, tag_set_list)
> +		queue_for_each_hw_ctx(q, hctx, i)
> +			if (hctx->tags == tags)
> +				blk_mq_sched_restart_hctx(hctx);
> +	rcu_read_unlock();
> +}

This looks like generic block layer code, why is it in SCSI?

^ permalink raw reply

* Re: [PATCH 1/8] nowait aio: Introduce IOCB_RW_FLAG_NOWAIT
From: Christoph Hellwig @ 2017-04-04  6:48 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: linux-fsdevel, jack, hch, linux-block, linux-btrfs, linux-ext4,
	linux-xfs, sagi, avi, axboe, linux-api, willy, tom.leiming,
	Goldwyn Rodrigues
In-Reply-To: <20170403185307.6243-2-rgoldwyn@suse.de>

On Mon, Apr 03, 2017 at 01:53:00PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> This flag informs kernel to bail out if an AIO request will block
> for reasons such as file allocations, or a writeback triggered,
> or would block while allocating requests while performing
> direct I/O.
> 
> Unfortunately, aio_flags is not checked for validity, which would
> break existing applications which have it set to anything besides zero
> or IOCB_FLAG_RESFD. So, we are using aio_reserved1 and renaming it
> to aio_rw_flags.
> 
> IOCB_RW_FLAG_NOWAIT is translated to IOCB_NOWAIT for
> iocb->ki_flags.

Please make this a flag in the RWF_* namespace, and as a preparation
support the existing RWF_* flags for aio.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox