Linux block layer

Linux block layer
 help / color / mirror / Atom feed

* Re: [PATCH 3/5] nvme: mark nvme_max_retries static
From: Sagi Grimberg @ 2017-04-05 17:33 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Keith Busch
  Cc: linux-nvme, linux-block, linux-scsi
In-Reply-To: <20170405171812.19911-4-hch@lst.de>

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply

* Re: [PATCH 4/5] nvme: move the retries count to struct nvme_request
From: Sagi Grimberg @ 2017-04-05 17:33 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Keith Busch
  Cc: linux-nvme, linux-block, linux-scsi
In-Reply-To: <20170405171812.19911-5-hch@lst.de>

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply

* Re: [PATCH 5/5] block, scsi: move the retries field to struct scsi_request
From: Sagi Grimberg @ 2017-04-05 17:33 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Keith Busch
  Cc: linux-nvme, linux-block, linux-scsi
In-Reply-To: <20170405171812.19911-6-hch@lst.de>

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply

* Re: [PATCH v2 4/5] scsi: Add scsi_restart_hctx()
From: Bart Van Assche @ 2017-04-05 17:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block@vger.kernel.org, Martin K . Petersen,
	James Bottomley, Hannes Reinecke
In-Reply-To: <20170405062020.GA26389@lst.de>

On 04/04/2017 11:20 PM, Christoph Hellwig wrote:
> On Tue, Apr 04, 2017 at 03:56:34PM +0000, Bart Van Assche wrote:
>>> This looks like generic block layer code, why is it in SCSI?
>>
>> That's an excellent question. I assume that you are fine with moving
>> this code to the block layer?
> 
> Yes.  In fact I wonder if we need the blk_mq_ops method at all now
> that you're using RCU locking and don't need the scsi host_lock.

Hello Christoph,

My plan is to remove the new blk_mq_ops method again before I post v3 of
this patch series.

Bart.

^ permalink raw reply

* [PATCH v2 0/2] blk-stat: Add ability to not bucket IO; improve IO polling.
From: sbates @ 2017-04-05 17:39 UTC (permalink / raw)
  To: axboe; +Cc: linux-block, linux-nvme, Damien.LeMoal, osandov, sbates, sagi

From: Stephen Bates <sbates@raithlin.com>

Omar recently developed some patches for block layer stats that use
callbacks to determine which bucket an IO should be considered for. At
the same time there was discussion at LSF/MM that we might not want to
consider all IO when generating stats for certain algorithms (e.g. IO
completion polling) or to bucket them in a more optimal fashion.

This set does two things. It makes the bucket callback for stats
signed so we can now ignore IO that cause a negative to be returned
from the bucket function. It then improves the IO polling latency
estimations by bucketing stats based on IO size and direction.

This patchset applies cleanly on 6809ef67eb7b4b68d (Merge branch
'for-4.12/block' into for-next) in Jens' for-next tree.

I've lightly tested this using QEMU and a real NVMe low-latency
device. I do not have performance number yet. Feedback would be
appreciated! I am not *super* happy with how the bucketing by size is
done. Any suggestions on how to improve this would be appreciated!

Cc: Damien.LeMoal@wdc.com
Cc: osandov@osandov.com

Changes since v1:
  Dropped the cast in blk_stat_rq_ddir() as per Omar's suggestion.
  Moved to an array of buckets based on IO size rather than a filter
  as suggested by Jens and Damien.

Stephen Bates (2):
  blk-stat: convert blk-stat bucket callback to signed
  blk-mq: Add a polling specific stats function

 block/blk-mq.c   | 54 ++++++++++++++++++++++++++++++++++++++++++++----------
 block/blk-stat.c |  6 ++++--
 block/blk-stat.h |  9 +++++----
 3 files changed, 53 insertions(+), 16 deletions(-)

-- 
2.7.4

^ permalink raw reply

* [PATCH v2 1/2] blk-stat: convert blk-stat bucket callback to signed
From: sbates @ 2017-04-05 17:39 UTC (permalink / raw)
  To: axboe; +Cc: linux-block, linux-nvme, Damien.LeMoal, osandov, sbates, sagi
In-Reply-To: <1491413957-30885-1-git-send-email-sbates@raithlin.com>

From: Stephen Bates <sbates@raithlin.com>

In order to allow for filtering of IO based on some other properties
of the request than direction we allow the bucket function to return
an int.

If the bucket callback returns a negative do no count it in the stats
accumulation.

Signed-off-by: Stephen Bates <sbates@raithlin.com>
---
 block/blk-stat.c | 6 ++++--
 block/blk-stat.h | 9 +++++----
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/block/blk-stat.c b/block/blk-stat.c
index e77ec52..dde9d39 100644
--- a/block/blk-stat.c
+++ b/block/blk-stat.c
@@ -19,7 +19,7 @@ struct blk_queue_stats {
 	bool enable_accounting;
 };
 
-unsigned int blk_stat_rq_ddir(const struct request *rq)
+int blk_stat_rq_ddir(const struct request *rq)
 {
 	return rq_data_dir(rq);
 }
@@ -104,6 +104,8 @@ void blk_stat_add(struct request *rq)
 	list_for_each_entry_rcu(cb, &q->stats->callbacks, list) {
 		if (blk_stat_is_active(cb)) {
 			bucket = cb->bucket_fn(rq);
+			if (bucket < 0)
+				continue;
 			stat = &this_cpu_ptr(cb->cpu_stat)[bucket];
 			__blk_stat_add(stat, value);
 		}
@@ -135,7 +137,7 @@ static void blk_stat_timer_fn(unsigned long data)
 
 struct blk_stat_callback *
 blk_stat_alloc_callback(void (*timer_fn)(struct blk_stat_callback *),
-			unsigned int (*bucket_fn)(const struct request *),
+			int (*bucket_fn)(const struct request *),
 			unsigned int buckets, void *data)
 {
 	struct blk_stat_callback *cb;
diff --git a/block/blk-stat.h b/block/blk-stat.h
index 53f08a6..622a62c 100644
--- a/block/blk-stat.h
+++ b/block/blk-stat.h
@@ -48,9 +48,10 @@ struct blk_stat_callback {
 
 	/**
 	 * @bucket_fn: Given a request, returns which statistics bucket it
-	 * should be accounted under.
+	 * should be accounted under. Return -1 for no bucket for this
+	 * request.
 	 */
-	unsigned int (*bucket_fn)(const struct request *);
+	int (*bucket_fn)(const struct request *);
 
 	/**
 	 * @buckets: Number of statistics buckets.
@@ -120,7 +121,7 @@ void blk_stat_enable_accounting(struct request_queue *q);
  *
  * Return: Data direction of the request, either READ or WRITE.
  */
-unsigned int blk_stat_rq_ddir(const struct request *rq);
+int blk_stat_rq_ddir(const struct request *rq);
 
 /**
  * blk_stat_alloc_callback() - Allocate a block statistics callback.
@@ -135,7 +136,7 @@ unsigned int blk_stat_rq_ddir(const struct request *rq);
  */
 struct blk_stat_callback *
 blk_stat_alloc_callback(void (*timer_fn)(struct blk_stat_callback *),
-			unsigned int (*bucket_fn)(const struct request *),
+			int (*bucket_fn)(const struct request *),
 			unsigned int buckets, void *data);
 
 /**
-- 
2.7.4

^ permalink raw reply related

* [PATCH v2 2/2] blk-mq: Add a polling specific stats function
From: sbates @ 2017-04-05 17:39 UTC (permalink / raw)
  To: axboe; +Cc: linux-block, linux-nvme, Damien.LeMoal, osandov, sbates, sagi
In-Reply-To: <1491413957-30885-1-git-send-email-sbates@raithlin.com>

From: Stephen Bates <sbates@raithlin.com>

Rather than bucketing IO statisics based on direction only we also
bucket based on the IO size. This leads to improved polling
performance. Update the bucket callback function and use it in the
polling latency estimation.

Signed-off-by: Stephen Bates <sbates@raithlin.com>
---
 block/blk-mq.c | 53 +++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 43 insertions(+), 10 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 061fc2c..8fb1fb0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -42,6 +42,33 @@ static LIST_HEAD(all_q_list);
 static void blk_mq_poll_stats_start(struct request_queue *q);
 static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb);
 
+static int blk_mq_poll_stats_bkt(const struct request *rq)
+{
+	int dir, bytes;
+
+	dir = blk_stat_rq_ddir(rq);
+	bytes = blk_rq_bytes(rq);
+
+	if (bytes <= 512)
+		return dir;
+	else if (bytes <= 4096)
+		return dir + 2;
+	else if (bytes <= 8192)
+		return dir + 4;
+	else if (bytes <= 16384)
+		return dir + 6;
+	else if (bytes <= 32768)
+		return dir + 8;
+	else if (bytes <= 65536)
+		return dir + 10;
+	else
+		return dir + 12;
+
+	return -1;
+}
+/* Must be consisitent with function above */
+#define BLK_MQ_POLL_STATS_BKTS 14
+
 /*
  * Check if any of the ctx's have pending work in this hardware queue
  */
@@ -2245,7 +2272,8 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	q->mq_ops = set->ops;
 
 	q->poll_cb = blk_stat_alloc_callback(blk_mq_poll_stats_fn,
-					     blk_stat_rq_ddir, 2, q);
+					     blk_mq_poll_stats_bkt,
+					     BLK_MQ_POLL_STATS_BKTS, q);
 	if (!q->poll_cb)
 		goto err_exit;
 
@@ -2663,11 +2691,12 @@ static void blk_mq_poll_stats_start(struct request_queue *q)
 static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb)
 {
 	struct request_queue *q = cb->data;
+	int bucket;
 
-	if (cb->stat[READ].nr_samples)
-		q->poll_stat[READ] = cb->stat[READ];
-	if (cb->stat[WRITE].nr_samples)
-		q->poll_stat[WRITE] = cb->stat[WRITE];
+	for (bucket = 0; bucket < BLK_MQ_POLL_STATS_BKTS; bucket++) {
+		if (cb->stat[bucket].nr_samples)
+			q->poll_stat[bucket] = cb->stat[bucket];
+	}
 }
 
 static unsigned long blk_mq_poll_nsecs(struct request_queue *q,
@@ -2675,6 +2704,7 @@ static unsigned long blk_mq_poll_nsecs(struct request_queue *q,
 				       struct request *rq)
 {
 	unsigned long ret = 0;
+	int bucket;
 
 	/*
 	 * If stats collection isn't on, don't sleep but turn it on for
@@ -2689,12 +2719,15 @@ static unsigned long blk_mq_poll_nsecs(struct request_queue *q,
 	 * For instance, if the completion latencies are tight, we can
 	 * get closer than just half the mean. This is especially
 	 * important on devices where the completion latencies are longer
-	 * than ~10 usec.
+	 * than ~10 usec. We do use the stats for the relevant IO size
+	 * if available which does lead to better estimates.
 	 */
-	if (req_op(rq) == REQ_OP_READ && q->poll_stat[READ].nr_samples)
-		ret = (q->poll_stat[READ].mean + 1) / 2;
-	else if (req_op(rq) == REQ_OP_WRITE && q->poll_stat[WRITE].nr_samples)
-		ret = (q->poll_stat[WRITE].mean + 1) / 2;
+	bucket = blk_mq_poll_stats_bkt(rq);
+	if (bucket < 0)
+		return ret;
+
+	if (q->poll_stat[bucket].nr_samples)
+		ret = (q->poll_stat[bucket].mean + 1) / 2;
 
 	return ret;
 }
-- 
2.7.4

^ permalink raw reply related

* Re: ->retries fixups V2
From: Jens Axboe @ 2017-04-05 18:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Sagi Grimberg, linux-nvme, linux-block, linux-scsi
In-Reply-To: <20170405171812.19911-1-hch@lst.de>

On Wed, Apr 05 2017, Christoph Hellwig wrote:
> This series fixes a few lose bits in terms of how nvme uses ->retries,
> including fixing it for non-PCIe transports.  While at it I noticed that
> nvme and scsi use the field in entirely different ways, and no other
> driver uses it at all.  So I decided to move it into the nvme_request and
> scsi_request structures instead.
> 
> Changes since V1:
>  - better changelog for one patch
>  - move the new retries field to the end of struct nvme_request

Applied for 4.12. If we do the below on my box, we remove the (now) 2
holes from struct request and shrink it 8 bytes.

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ce6f9a6534c9..3cf241b0814d 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -215,6 +215,8 @@ struct request {
 
 	unsigned short ioprio;
 
+	unsigned int timeout;
+
 	void *special;		/* opaque pointer available for LLD use */
 
 	int errors;
@@ -223,7 +225,6 @@ struct request {
 
 	unsigned long deadline;
 	struct list_head timeout_list;
-	unsigned int timeout;
 
 	/*
 	 * completion callback.

-- 
Jens Axboe

^ permalink raw reply related

* Re: [PATCH v2 2/2] blk-mq: Add a polling specific stats function
From: Jens Axboe @ 2017-04-05 18:14 UTC (permalink / raw)
  To: sbates; +Cc: linux-block, linux-nvme, Damien.LeMoal, osandov, sagi
In-Reply-To: <1491413957-30885-3-git-send-email-sbates@raithlin.com>

On 04/05/2017 11:39 AM, sbates@raithlin.com wrote:
> @@ -42,6 +42,33 @@ static LIST_HEAD(all_q_list);
>  static void blk_mq_poll_stats_start(struct request_queue *q);
>  static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb);
>  
> +static int blk_mq_poll_stats_bkt(const struct request *rq)
> +{
> +	int dir, bytes;
> +
> +	dir = blk_stat_rq_ddir(rq);
> +	bytes = blk_rq_bytes(rq);
> +
> +	if (bytes <= 512)
> +		return dir;
> +	else if (bytes <= 4096)
> +		return dir + 2;
> +	else if (bytes <= 8192)
> +		return dir + 4;
> +	else if (bytes <= 16384)
> +		return dir + 6;
> +	else if (bytes <= 32768)
> +		return dir + 8;
> +	else if (bytes <= 65536)
> +		return dir + 10;
> +	else
> +		return dir + 12;

Why not just have 8 buckets, and make it:

	bucket = ddir + ilog2(bytes) - 9;

and cap it at MAX_BUCKET (8) and put all those above into the top
bucket.

-- 
Jens Axboe

^ permalink raw reply

* Re: ->retries fixups V2
From: Christoph Hellwig @ 2017-04-05 18:16 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, linux-nvme,
	linux-block, linux-scsi
In-Reply-To: <20170405180653.GA5548@kernel.dk>

On Wed, Apr 05, 2017 at 12:06:53PM -0600, Jens Axboe wrote:
> On Wed, Apr 05 2017, Christoph Hellwig wrote:
> > This series fixes a few lose bits in terms of how nvme uses ->retries,
> > including fixing it for non-PCIe transports.  While at it I noticed that
> > nvme and scsi use the field in entirely different ways, and no other
> > driver uses it at all.  So I decided to move it into the nvme_request and
> > scsi_request structures instead.
> > 
> > Changes since V1:
> >  - better changelog for one patch
> >  - move the new retries field to the end of struct nvme_request
> 
> Applied for 4.12. If we do the below on my box, we remove the (now) 2
> holes from struct request and shrink it 8 bytes.

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: ->retries fixups V2
From: Jens Axboe @ 2017-04-05 18:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Sagi Grimberg, linux-nvme, linux-block, linux-scsi
In-Reply-To: <20170405181605.GA9702@lst.de>

On 04/05/2017 12:16 PM, Christoph Hellwig wrote:
> On Wed, Apr 05, 2017 at 12:06:53PM -0600, Jens Axboe wrote:
>> On Wed, Apr 05 2017, Christoph Hellwig wrote:
>>> This series fixes a few lose bits in terms of how nvme uses ->retries,
>>> including fixing it for non-PCIe transports.  While at it I noticed that
>>> nvme and scsi use the field in entirely different ways, and no other
>>> driver uses it at all.  So I decided to move it into the nvme_request and
>>> scsi_request structures instead.
>>>
>>> Changes since V1:
>>>  - better changelog for one patch
>>>  - move the new retries field to the end of struct nvme_request
>>
>> Applied for 4.12. If we do the below on my box, we remove the (now) 2
>> holes from struct request and shrink it 8 bytes.
> 
> Looks good:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Thanks, added that too.

-- 
Jens Axboe

^ permalink raw reply

* [PATCH v2 0/8] blk-mq: various fixes and cleanups
From: Omar Sandoval @ 2017-04-05 18:28 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

This v2 of my series from a couple of days ago [1] with one extra fix
and two extra cleanups.

- Patch 1 is the new fix for a hang that Josef reported after trying v1.
- Patches 2-6 are the original series. Patch 5 now has Christoph's and
  Sagi's Reviewed-by.
- Patches 7 and 8 are trivial cleanups.

Patches 1-5 should probably go into 4.11, and 6-8 are for 4.12.

1: http://marc.info/?l=linux-block&m=149125578724683&w=2

Omar Sandoval (8):
  blk-mq: use the right hctx when getting a driver tag fails
  blk-mq-sched: refactor scheduler initialization
  blk-mq-sched: set up scheduler tags when bringing up new queues
  blk-mq-sched: fix crash in switch error path
  blk-mq: remap queues when adding/removing hardware queues
  blk-mq-sched: provide hooks for initializing hardware queue data
  blk-mq: make driver tag failure path easier to follow
  blk-mq: clean up direct issue blk_mq_queue_data initialization

 block/blk-mq-sched.c     | 187 +++++++++++++++++++++++++++++------------------
 block/blk-mq-sched.h     |  13 ++--
 block/blk-mq.c           |  72 ++++++++++--------
 block/blk-mq.h           |   2 +-
 block/blk-sysfs.c        |   2 +-
 block/elevator.c         | 114 +++++++++++++++--------------
 include/linux/elevator.h |   4 +-
 7 files changed, 227 insertions(+), 167 deletions(-)

-- 
2.12.2

^ permalink raw reply

* [PATCH v2 1/8] blk-mq: use the right hctx when getting a driver tag fails
From: Omar Sandoval @ 2017-04-05 18:28 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team
In-Reply-To: <cover.1491416593.git.osandov@fb.com>

From: Omar Sandoval <osandov@fb.com>

While dispatching requests, if we fail to get a driver tag, we mark the
hardware queue as waiting for a tag and put the requests on a
hctx->dispatch list to be run later when a driver tag is freed. However,
blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
queues if using a single-queue scheduler with a multiqueue device. If
blk_mq_get_driver_tag() fails, it doesn't update the hardware queue we
are processing. This means we end up using the hardware queue of the
previous request, which may or may not be the same as that of the
current request. If it isn't, the wrong hardware queue will end up
waiting for a tag, and the requests will be on the wrong dispatch list,
leading to a hang.

The fix is twofold:

1. Make sure we save which hardware queue we were trying to get a
   request for in blk_mq_get_driver_tag() regardless of whether it
   succeeds or not.
2. Make blk_mq_dispatch_rq_list() take a request_queue instead of a
   blk_mq_hw_queue to make it clear that it must handle multiple
   hardware queues, since I've already messed this up on a couple of
   occasions.

This didn't appear in testing with nvme and mq-deadline because nvme has
more driver tags than the default number of scheduler tags. However,
with the blk_mq_update_nr_hw_queues() fix, it showed up with nbd.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq-sched.c |  9 +++++----
 block/blk-mq.c       | 25 +++++++++++++------------
 block/blk-mq.h       |  2 +-
 3 files changed, 19 insertions(+), 17 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 09af8ff18719..fc00f00898d3 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -171,7 +171,8 @@ void blk_mq_sched_put_request(struct request *rq)
 
 void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 {
-	struct elevator_queue *e = hctx->queue->elevator;
+	struct request_queue *q = hctx->queue;
+	struct elevator_queue *e = q->elevator;
 	const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
 	bool did_work = false;
 	LIST_HEAD(rq_list);
@@ -203,10 +204,10 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 	 */
 	if (!list_empty(&rq_list)) {
 		blk_mq_sched_mark_restart_hctx(hctx);
-		did_work = blk_mq_dispatch_rq_list(hctx, &rq_list);
+		did_work = blk_mq_dispatch_rq_list(q, &rq_list);
 	} else if (!has_sched_dispatch) {
 		blk_mq_flush_busy_ctxs(hctx, &rq_list);
-		blk_mq_dispatch_rq_list(hctx, &rq_list);
+		blk_mq_dispatch_rq_list(q, &rq_list);
 	}
 
 	/*
@@ -222,7 +223,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 			if (!rq)
 				break;
 			list_add(&rq->queuelist, &rq_list);
-		} while (blk_mq_dispatch_rq_list(hctx, &rq_list));
+		} while (blk_mq_dispatch_rq_list(q, &rq_list));
 	}
 }
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 061fc2cc88d3..6c1bedc23b5a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -863,12 +863,8 @@ bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
 		.flags = wait ? 0 : BLK_MQ_REQ_NOWAIT,
 	};
 
-	if (rq->tag != -1) {
-done:
-		if (hctx)
-			*hctx = data.hctx;
-		return true;
-	}
+	if (rq->tag != -1)
+		goto done;
 
 	if (blk_mq_tag_is_reserved(data.hctx->sched_tags, rq->internal_tag))
 		data.flags |= BLK_MQ_REQ_RESERVED;
@@ -880,10 +876,12 @@ bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
 			atomic_inc(&data.hctx->nr_active);
 		}
 		data.hctx->tags->rqs[rq->tag] = rq;
-		goto done;
 	}
 
-	return false;
+done:
+	if (hctx)
+		*hctx = data.hctx;
+	return rq->tag != -1;
 }
 
 static void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
@@ -980,14 +978,17 @@ static bool blk_mq_dispatch_wait_add(struct blk_mq_hw_ctx *hctx)
 	return true;
 }
 
-bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
+bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list)
 {
-	struct request_queue *q = hctx->queue;
+	struct blk_mq_hw_ctx *hctx;
 	struct request *rq;
 	LIST_HEAD(driver_list);
 	struct list_head *dptr;
 	int errors, queued, ret = BLK_MQ_RQ_QUEUE_OK;
 
+	if (list_empty(list))
+		return false;
+
 	/*
 	 * Start off with dptr being NULL, so we start the first request
 	 * immediately, even if we have more pending.
@@ -998,7 +999,7 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 	 * Now process all the entries, sending them to the driver.
 	 */
 	errors = queued = 0;
-	while (!list_empty(list)) {
+	do {
 		struct blk_mq_queue_data bd;
 
 		rq = list_first_entry(list, struct request, queuelist);
@@ -1069,7 +1070,7 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 		 */
 		if (!dptr && list->next != list->prev)
 			dptr = &driver_list;
-	}
+	} while (!list_empty(list));
 
 	hctx->dispatched[queued_to_index(queued)]++;
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 8d49c06fc520..7e6f2e467696 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -30,7 +30,7 @@ void blk_mq_freeze_queue(struct request_queue *q);
 void blk_mq_free_queue(struct request_queue *q);
 int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
 void blk_mq_wake_waiters(struct request_queue *q);
-bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *, struct list_head *);
+bool blk_mq_dispatch_rq_list(struct request_queue *, struct list_head *);
 void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list);
 bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx);
 bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
-- 
2.12.2

^ permalink raw reply related

* [PATCH v2 2/8] blk-mq-sched: refactor scheduler initialization
From: Omar Sandoval @ 2017-04-05 18:28 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team
In-Reply-To: <cover.1491416593.git.osandov@fb.com>

From: Omar Sandoval <osandov@fb.com>

Preparation cleanup for the next couple of fixes, push
blk_mq_sched_setup() and e->ops.mq.init_sched() into a helper.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq-sched.c | 82 ++++++++++++++++++++++++++++------------------------
 block/blk-mq-sched.h |  2 +-
 block/elevator.c     | 32 ++++++++------------
 3 files changed, 57 insertions(+), 59 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index fc00f00898d3..6bd1758ea29b 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -432,11 +432,45 @@ static void blk_mq_sched_free_tags(struct blk_mq_tag_set *set,
 	}
 }
 
-int blk_mq_sched_setup(struct request_queue *q)
+static int blk_mq_sched_alloc_tags(struct request_queue *q,
+				   struct blk_mq_hw_ctx *hctx,
+				   unsigned int hctx_idx)
+{
+	struct blk_mq_tag_set *set = q->tag_set;
+	int ret;
+
+	hctx->sched_tags = blk_mq_alloc_rq_map(set, hctx_idx, q->nr_requests,
+					       set->reserved_tags);
+	if (!hctx->sched_tags)
+		return -ENOMEM;
+
+	ret = blk_mq_alloc_rqs(set, hctx->sched_tags, hctx_idx, q->nr_requests);
+	if (ret)
+		blk_mq_sched_free_tags(set, hctx, hctx_idx);
+
+	return ret;
+}
+
+void blk_mq_sched_teardown(struct request_queue *q)
 {
 	struct blk_mq_tag_set *set = q->tag_set;
 	struct blk_mq_hw_ctx *hctx;
-	int ret, i;
+	int i;
+
+	queue_for_each_hw_ctx(q, hctx, i)
+		blk_mq_sched_free_tags(set, hctx, i);
+}
+
+int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
+{
+	struct blk_mq_hw_ctx *hctx;
+	unsigned int i;
+	int ret;
+
+	if (!e) {
+		q->elevator = NULL;
+		return 0;
+	}
 
 	/*
 	 * Default to 256, since we don't split into sync/async like the
@@ -444,49 +478,21 @@ int blk_mq_sched_setup(struct request_queue *q)
 	 */
 	q->nr_requests = 2 * BLKDEV_MAX_RQ;
 
-	/*
-	 * We're switching to using an IO scheduler, so setup the hctx
-	 * scheduler tags and switch the request map from the regular
-	 * tags to scheduler tags. First allocate what we need, so we
-	 * can safely fail and fallback, if needed.
-	 */
-	ret = 0;
 	queue_for_each_hw_ctx(q, hctx, i) {
-		hctx->sched_tags = blk_mq_alloc_rq_map(set, i,
-				q->nr_requests, set->reserved_tags);
-		if (!hctx->sched_tags) {
-			ret = -ENOMEM;
-			break;
-		}
-		ret = blk_mq_alloc_rqs(set, hctx->sched_tags, i, q->nr_requests);
+		ret = blk_mq_sched_alloc_tags(q, hctx, i);
 		if (ret)
-			break;
+			goto err;
 	}
 
-	/*
-	 * If we failed, free what we did allocate
-	 */
-	if (ret) {
-		queue_for_each_hw_ctx(q, hctx, i) {
-			if (!hctx->sched_tags)
-				continue;
-			blk_mq_sched_free_tags(set, hctx, i);
-		}
-
-		return ret;
-	}
+	ret = e->ops.mq.init_sched(q, e);
+	if (ret)
+		goto err;
 
 	return 0;
-}
 
-void blk_mq_sched_teardown(struct request_queue *q)
-{
-	struct blk_mq_tag_set *set = q->tag_set;
-	struct blk_mq_hw_ctx *hctx;
-	int i;
-
-	queue_for_each_hw_ctx(q, hctx, i)
-		blk_mq_sched_free_tags(set, hctx, i);
+err:
+	blk_mq_sched_teardown(q);
+	return ret;
 }
 
 int blk_mq_sched_init(struct request_queue *q)
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index a75b16b123f7..873f9af5a35b 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -32,7 +32,7 @@ void blk_mq_sched_move_to_dispatch(struct blk_mq_hw_ctx *hctx,
 			struct list_head *rq_list,
 			struct request *(*get_rq)(struct blk_mq_hw_ctx *));
 
-int blk_mq_sched_setup(struct request_queue *q);
+int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e);
 void blk_mq_sched_teardown(struct request_queue *q);
 
 int blk_mq_sched_init(struct request_queue *q);
diff --git a/block/elevator.c b/block/elevator.c
index 01139f549b5b..f236ef1d2be9 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -242,17 +242,12 @@ int elevator_init(struct request_queue *q, char *name)
 		}
 	}
 
-	if (e->uses_mq) {
-		err = blk_mq_sched_setup(q);
-		if (!err)
-			err = e->ops.mq.init_sched(q, e);
-	} else
+	if (e->uses_mq)
+		err = blk_mq_init_sched(q, e);
+	else
 		err = e->ops.sq.elevator_init_fn(q, e);
-	if (err) {
-		if (e->uses_mq)
-			blk_mq_sched_teardown(q);
+	if (err)
 		elevator_put(e);
-	}
 	return err;
 }
 EXPORT_SYMBOL(elevator_init);
@@ -987,21 +982,18 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	}
 
 	/* allocate, init and register new elevator */
-	if (new_e) {
-		if (new_e->uses_mq) {
-			err = blk_mq_sched_setup(q);
-			if (!err)
-				err = new_e->ops.mq.init_sched(q, new_e);
-		} else
-			err = new_e->ops.sq.elevator_init_fn(q, new_e);
-		if (err)
-			goto fail_init;
+	if (q->mq_ops)
+		err = blk_mq_init_sched(q, new_e);
+	else
+		err = new_e->ops.sq.elevator_init_fn(q, new_e);
+	if (err)
+		goto fail_init;
 
+	if (new_e) {
 		err = elv_register_queue(q);
 		if (err)
 			goto fail_register;
-	} else
-		q->elevator = NULL;
+	}
 
 	/* done, kill the old one and finish */
 	if (old) {
-- 
2.12.2

^ permalink raw reply related

* [PATCH v2 3/8] blk-mq-sched: set up scheduler tags when bringing up new queues
From: Omar Sandoval @ 2017-04-05 18:28 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team
In-Reply-To: <cover.1491416593.git.osandov@fb.com>

From: Omar Sandoval <osandov@fb.com>

If a new hardware queue is added at runtime, we don't allocate scheduler
tags for it, leading to a crash. This hooks up the scheduler framework
to blk_mq_{init,exit}_hctx() to make sure everything gets properly
initialized/freed.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq-sched.c | 22 ++++++++++++++++++++++
 block/blk-mq-sched.h |  5 +++++
 block/blk-mq.c       |  9 ++++++++-
 3 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 6bd1758ea29b..0bb13bb51daa 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -461,6 +461,28 @@ void blk_mq_sched_teardown(struct request_queue *q)
 		blk_mq_sched_free_tags(set, hctx, i);
 }
 
+int blk_mq_sched_init_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
+			   unsigned int hctx_idx)
+{
+	struct elevator_queue *e = q->elevator;
+
+	if (!e)
+		return 0;
+
+	return blk_mq_sched_alloc_tags(q, hctx, hctx_idx);
+}
+
+void blk_mq_sched_exit_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
+			    unsigned int hctx_idx)
+{
+	struct elevator_queue *e = q->elevator;
+
+	if (!e)
+		return;
+
+	blk_mq_sched_free_tags(q->tag_set, hctx, hctx_idx);
+}
+
 int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 {
 	struct blk_mq_hw_ctx *hctx;
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 873f9af5a35b..19db25e0c95a 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -35,6 +35,11 @@ void blk_mq_sched_move_to_dispatch(struct blk_mq_hw_ctx *hctx,
 int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e);
 void blk_mq_sched_teardown(struct request_queue *q);
 
+int blk_mq_sched_init_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
+			   unsigned int hctx_idx);
+void blk_mq_sched_exit_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
+			    unsigned int hctx_idx);
+
 int blk_mq_sched_init(struct request_queue *q);
 
 static inline bool
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6c1bedc23b5a..cf158a6fc997 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1840,6 +1840,8 @@ static void blk_mq_exit_hctx(struct request_queue *q,
 				       hctx->fq->flush_rq, hctx_idx,
 				       flush_start_tag + hctx_idx);
 
+	blk_mq_sched_exit_hctx(q, hctx, hctx_idx);
+
 	if (set->ops->exit_hctx)
 		set->ops->exit_hctx(hctx, hctx_idx);
 
@@ -1906,9 +1908,12 @@ static int blk_mq_init_hctx(struct request_queue *q,
 	    set->ops->init_hctx(hctx, set->driver_data, hctx_idx))
 		goto free_bitmap;
 
+	if (blk_mq_sched_init_hctx(q, hctx, hctx_idx))
+		goto exit_hctx;
+
 	hctx->fq = blk_alloc_flush_queue(q, hctx->numa_node, set->cmd_size);
 	if (!hctx->fq)
-		goto exit_hctx;
+		goto sched_exit_hctx;
 
 	if (set->ops->init_request &&
 	    set->ops->init_request(set->driver_data,
@@ -1923,6 +1928,8 @@ static int blk_mq_init_hctx(struct request_queue *q,
 
  free_fq:
 	kfree(hctx->fq);
+ sched_exit_hctx:
+	blk_mq_sched_exit_hctx(q, hctx, hctx_idx);
  exit_hctx:
 	if (set->ops->exit_hctx)
 		set->ops->exit_hctx(hctx, hctx_idx);
-- 
2.12.2

^ permalink raw reply related

* [PATCH v2 4/8] blk-mq-sched: fix crash in switch error path
From: Omar Sandoval @ 2017-04-05 18:28 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team
In-Reply-To: <cover.1491416593.git.osandov@fb.com>

From: Omar Sandoval <osandov@fb.com>

In elevator_switch(), if blk_mq_init_sched() fails, we attempt to fall
back to the original scheduler. However, at this point, we've already
torn down the original scheduler's tags, so this causes a crash. Doing
the fallback like the legacy elevator path is much harder for mq, so fix
it by just falling back to none, instead.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq-sched.c     | 13 +++++--
 block/blk-mq-sched.h     |  2 +-
 block/blk-mq.c           |  2 --
 block/blk-sysfs.c        |  2 +-
 block/elevator.c         | 94 +++++++++++++++++++++++++++---------------------
 include/linux/elevator.h |  2 +-
 6 files changed, 67 insertions(+), 48 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 0bb13bb51daa..e8c2ed654ef0 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -451,7 +451,7 @@ static int blk_mq_sched_alloc_tags(struct request_queue *q,
 	return ret;
 }
 
-void blk_mq_sched_teardown(struct request_queue *q)
+static void blk_mq_sched_tags_teardown(struct request_queue *q)
 {
 	struct blk_mq_tag_set *set = q->tag_set;
 	struct blk_mq_hw_ctx *hctx;
@@ -513,10 +513,19 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 	return 0;
 
 err:
-	blk_mq_sched_teardown(q);
+	blk_mq_sched_tags_teardown(q);
+	q->elevator = NULL;
 	return ret;
 }
 
+void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e)
+{
+	if (e->type->ops.mq.exit_sched)
+		e->type->ops.mq.exit_sched(e);
+	blk_mq_sched_tags_teardown(q);
+	q->elevator = NULL;
+}
+
 int blk_mq_sched_init(struct request_queue *q)
 {
 	int ret;
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 19db25e0c95a..e704956e0862 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -33,7 +33,7 @@ void blk_mq_sched_move_to_dispatch(struct blk_mq_hw_ctx *hctx,
 			struct request *(*get_rq)(struct blk_mq_hw_ctx *));
 
 int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e);
-void blk_mq_sched_teardown(struct request_queue *q);
+void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e);
 
 int blk_mq_sched_init_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
 			   unsigned int hctx_idx);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index cf158a6fc997..53c5c6097be9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2154,8 +2154,6 @@ void blk_mq_release(struct request_queue *q)
 	struct blk_mq_hw_ctx *hctx;
 	unsigned int i;
 
-	blk_mq_sched_teardown(q);
-
 	/* hctx kobj stays in hctx */
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (!hctx)
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 45854266e398..c47db43a40cc 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -803,7 +803,7 @@ static void blk_release_queue(struct kobject *kobj)
 
 	if (q->elevator) {
 		ioc_clear_queue(q);
-		elevator_exit(q->elevator);
+		elevator_exit(q, q->elevator);
 	}
 
 	blk_free_queue_stats(q->stats);
diff --git a/block/elevator.c b/block/elevator.c
index f236ef1d2be9..dbeecf7be719 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -252,11 +252,11 @@ int elevator_init(struct request_queue *q, char *name)
 }
 EXPORT_SYMBOL(elevator_init);
 
-void elevator_exit(struct elevator_queue *e)
+void elevator_exit(struct request_queue *q, struct elevator_queue *e)
 {
 	mutex_lock(&e->sysfs_lock);
 	if (e->uses_mq && e->type->ops.mq.exit_sched)
-		e->type->ops.mq.exit_sched(e);
+		blk_mq_exit_sched(q, e);
 	else if (!e->uses_mq && e->type->ops.sq.elevator_exit_fn)
 		e->type->ops.sq.elevator_exit_fn(e);
 	mutex_unlock(&e->sysfs_lock);
@@ -941,6 +941,45 @@ void elv_unregister(struct elevator_type *e)
 }
 EXPORT_SYMBOL_GPL(elv_unregister);
 
+static int elevator_switch_mq(struct request_queue *q,
+			      struct elevator_type *new_e)
+{
+	int ret;
+
+	blk_mq_freeze_queue(q);
+	blk_mq_quiesce_queue(q);
+
+	if (q->elevator) {
+		if (q->elevator->registered)
+			elv_unregister_queue(q);
+		ioc_clear_queue(q);
+		elevator_exit(q, q->elevator);
+	}
+
+	ret = blk_mq_init_sched(q, new_e);
+	if (ret)
+		goto out;
+
+	if (new_e) {
+		ret = elv_register_queue(q);
+		if (ret) {
+			elevator_exit(q, q->elevator);
+			goto out;
+		}
+	}
+
+	if (new_e)
+		blk_add_trace_msg(q, "elv switch: %s", new_e->elevator_name);
+	else
+		blk_add_trace_msg(q, "elv switch: none");
+
+out:
+	blk_mq_unfreeze_queue(q);
+	blk_mq_start_stopped_hw_queues(q, true);
+	return ret;
+
+}
+
 /*
  * switch to new_e io scheduler. be careful not to introduce deadlocks -
  * we don't free the old io scheduler, before we have allocated what we
@@ -953,10 +992,8 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	bool old_registered = false;
 	int err;
 
-	if (q->mq_ops) {
-		blk_mq_freeze_queue(q);
-		blk_mq_quiesce_queue(q);
-	}
+	if (q->mq_ops)
+		return elevator_switch_mq(q, new_e);
 
 	/*
 	 * Turn on BYPASS and drain all requests w/ elevator private data.
@@ -968,11 +1005,7 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	if (old) {
 		old_registered = old->registered;
 
-		if (old->uses_mq)
-			blk_mq_sched_teardown(q);
-
-		if (!q->mq_ops)
-			blk_queue_bypass_start(q);
+		blk_queue_bypass_start(q);
 
 		/* unregister and clear all auxiliary data of the old elevator */
 		if (old_registered)
@@ -982,53 +1015,32 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	}
 
 	/* allocate, init and register new elevator */
-	if (q->mq_ops)
-		err = blk_mq_init_sched(q, new_e);
-	else
-		err = new_e->ops.sq.elevator_init_fn(q, new_e);
+	err = new_e->ops.sq.elevator_init_fn(q, new_e);
 	if (err)
 		goto fail_init;
 
-	if (new_e) {
-		err = elv_register_queue(q);
-		if (err)
-			goto fail_register;
-	}
+	err = elv_register_queue(q);
+	if (err)
+		goto fail_register;
 
 	/* done, kill the old one and finish */
 	if (old) {
-		elevator_exit(old);
-		if (!q->mq_ops)
-			blk_queue_bypass_end(q);
+		elevator_exit(q, old);
+		blk_queue_bypass_end(q);
 	}
 
-	if (q->mq_ops) {
-		blk_mq_unfreeze_queue(q);
-		blk_mq_start_stopped_hw_queues(q, true);
-	}
-
-	if (new_e)
-		blk_add_trace_msg(q, "elv switch: %s", new_e->elevator_name);
-	else
-		blk_add_trace_msg(q, "elv switch: none");
+	blk_add_trace_msg(q, "elv switch: %s", new_e->elevator_name);
 
 	return 0;
 
 fail_register:
-	if (q->mq_ops)
-		blk_mq_sched_teardown(q);
-	elevator_exit(q->elevator);
+	elevator_exit(q, q->elevator);
 fail_init:
 	/* switch failed, restore and re-register old elevator */
 	if (old) {
 		q->elevator = old;
 		elv_register_queue(q);
-		if (!q->mq_ops)
-			blk_queue_bypass_end(q);
-	}
-	if (q->mq_ops) {
-		blk_mq_unfreeze_queue(q);
-		blk_mq_start_stopped_hw_queues(q, true);
+		blk_queue_bypass_end(q);
 	}
 
 	return err;
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index aebecc4ed088..22d39e8d4de1 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -211,7 +211,7 @@ extern ssize_t elv_iosched_show(struct request_queue *, char *);
 extern ssize_t elv_iosched_store(struct request_queue *, const char *, size_t);
 
 extern int elevator_init(struct request_queue *, char *);
-extern void elevator_exit(struct elevator_queue *);
+extern void elevator_exit(struct request_queue *, struct elevator_queue *);
 extern int elevator_change(struct request_queue *, const char *);
 extern bool elv_bio_merge_ok(struct request *, struct bio *);
 extern struct elevator_queue *elevator_alloc(struct request_queue *,
-- 
2.12.2

^ permalink raw reply related

* [PATCH v2 5/8] blk-mq: remap queues when adding/removing hardware queues
From: Omar Sandoval @ 2017-04-05 18:28 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team
In-Reply-To: <cover.1491416593.git.osandov@fb.com>

From: Omar Sandoval <osandov@fb.com>

blk_mq_update_nr_hw_queues() used to remap hardware queues, which is the
behavior that drivers expect. However, commit 4e68a011428a changed
blk_mq_queue_reinit() to not remap queues for the case of CPU
hotplugging, inadvertently making blk_mq_update_nr_hw_queues() not remap
queues as well. This breaks, for example, NBD's multi-connection mode,
leaving the added hardware queues unused. Fix it by making
blk_mq_update_nr_hw_queues() explicitly remap the queues.

Fixes: 4e68a011428a ("blk-mq: don't redistribute hardware queues on a CPU hotplug event")
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 53c5c6097be9..4aa3148befe1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2484,6 +2484,14 @@ static int blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set)
 	return 0;
 }
 
+static int blk_mq_update_queue_map(struct blk_mq_tag_set *set)
+{
+	if (set->ops->map_queues)
+		return set->ops->map_queues(set);
+	else
+		return blk_mq_map_queues(set);
+}
+
 /*
  * Alloc a tag set to be associated with one or more request queues.
  * May fail with EINVAL for various error conditions. May adjust the
@@ -2538,10 +2546,7 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 	if (!set->mq_map)
 		goto out_free_tags;
 
-	if (set->ops->map_queues)
-		ret = set->ops->map_queues(set);
-	else
-		ret = blk_mq_map_queues(set);
+	ret = blk_mq_update_queue_map(set);
 	if (ret)
 		goto out_free_mq_map;
 
@@ -2633,6 +2638,7 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
 		blk_mq_freeze_queue(q);
 
 	set->nr_hw_queues = nr_hw_queues;
+	blk_mq_update_queue_map(set);
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
 		blk_mq_realloc_hw_ctxs(set, q);
 		blk_mq_queue_reinit(q, cpu_online_mask);
-- 
2.12.2

^ permalink raw reply related

* [PATCH v2 6/8] blk-mq-sched: provide hooks for initializing hardware queue data
From: Omar Sandoval @ 2017-04-05 18:28 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team
In-Reply-To: <cover.1491416593.git.osandov@fb.com>

From: Omar Sandoval <osandov@fb.com>

Schedulers need to be informed when a hardware queue is added or removed
at runtime so they can allocate/free per-hardware queue data. So,
replace the blk_mq_sched_init_hctx_data() helper, which only makes sense
at init time, with .init_hctx() and .exit_hctx() hooks.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq-sched.c     | 81 +++++++++++++++++++++++++-----------------------
 block/blk-mq-sched.h     |  4 ---
 include/linux/elevator.h |  2 ++
 3 files changed, 45 insertions(+), 42 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index e8c2ed654ef0..9d7f6d6ca693 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -30,43 +30,6 @@ void blk_mq_sched_free_hctx_data(struct request_queue *q,
 }
 EXPORT_SYMBOL_GPL(blk_mq_sched_free_hctx_data);
 
-int blk_mq_sched_init_hctx_data(struct request_queue *q, size_t size,
-				int (*init)(struct blk_mq_hw_ctx *),
-				void (*exit)(struct blk_mq_hw_ctx *))
-{
-	struct blk_mq_hw_ctx *hctx;
-	int ret;
-	int i;
-
-	queue_for_each_hw_ctx(q, hctx, i) {
-		hctx->sched_data = kmalloc_node(size, GFP_KERNEL, hctx->numa_node);
-		if (!hctx->sched_data) {
-			ret = -ENOMEM;
-			goto error;
-		}
-
-		if (init) {
-			ret = init(hctx);
-			if (ret) {
-				/*
-				 * We don't want to give exit() a partially
-				 * initialized sched_data. init() must clean up
-				 * if it fails.
-				 */
-				kfree(hctx->sched_data);
-				hctx->sched_data = NULL;
-				goto error;
-			}
-		}
-	}
-
-	return 0;
-error:
-	blk_mq_sched_free_hctx_data(q, exit);
-	return ret;
-}
-EXPORT_SYMBOL_GPL(blk_mq_sched_init_hctx_data);
-
 static void __blk_mq_sched_assign_ioc(struct request_queue *q,
 				      struct request *rq,
 				      struct bio *bio,
@@ -465,11 +428,24 @@ int blk_mq_sched_init_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
 			   unsigned int hctx_idx)
 {
 	struct elevator_queue *e = q->elevator;
+	int ret;
 
 	if (!e)
 		return 0;
 
-	return blk_mq_sched_alloc_tags(q, hctx, hctx_idx);
+	ret = blk_mq_sched_alloc_tags(q, hctx, hctx_idx);
+	if (ret)
+		return ret;
+
+	if (e->type->ops.mq.init_hctx) {
+		ret = e->type->ops.mq.init_hctx(hctx, hctx_idx);
+		if (ret) {
+			blk_mq_sched_free_tags(q->tag_set, hctx, hctx_idx);
+			return ret;
+		}
+	}
+
+	return 0;
 }
 
 void blk_mq_sched_exit_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
@@ -480,12 +456,18 @@ void blk_mq_sched_exit_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
 	if (!e)
 		return;
 
+	if (e->type->ops.mq.exit_hctx && hctx->sched_data) {
+		e->type->ops.mq.exit_hctx(hctx, hctx_idx);
+		hctx->sched_data = NULL;
+	}
+
 	blk_mq_sched_free_tags(q->tag_set, hctx, hctx_idx);
 }
 
 int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 {
 	struct blk_mq_hw_ctx *hctx;
+	struct elevator_queue *eq;
 	unsigned int i;
 	int ret;
 
@@ -510,6 +492,18 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 	if (ret)
 		goto err;
 
+	if (e->ops.mq.init_hctx) {
+		queue_for_each_hw_ctx(q, hctx, i) {
+			ret = e->ops.mq.init_hctx(hctx, i);
+			if (ret) {
+				eq = q->elevator;
+				blk_mq_exit_sched(q, eq);
+				kobject_put(&eq->kobj);
+				return ret;
+			}
+		}
+	}
+
 	return 0;
 
 err:
@@ -520,6 +514,17 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 
 void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e)
 {
+	struct blk_mq_hw_ctx *hctx;
+	unsigned int i;
+
+	if (e->type->ops.mq.exit_hctx) {
+		queue_for_each_hw_ctx(q, hctx, i) {
+			if (hctx->sched_data) {
+				e->type->ops.mq.exit_hctx(hctx, i);
+				hctx->sched_data = NULL;
+			}
+		}
+	}
 	if (e->type->ops.mq.exit_sched)
 		e->type->ops.mq.exit_sched(e);
 	blk_mq_sched_tags_teardown(q);
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index e704956e0862..c6e760df0fb4 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -4,10 +4,6 @@
 #include "blk-mq.h"
 #include "blk-mq-tag.h"
 
-int blk_mq_sched_init_hctx_data(struct request_queue *q, size_t size,
-				int (*init)(struct blk_mq_hw_ctx *),
-				void (*exit)(struct blk_mq_hw_ctx *));
-
 void blk_mq_sched_free_hctx_data(struct request_queue *q,
 				 void (*exit)(struct blk_mq_hw_ctx *));
 
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 22d39e8d4de1..b7ec315ee7e7 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -93,6 +93,8 @@ struct blk_mq_hw_ctx;
 struct elevator_mq_ops {
 	int (*init_sched)(struct request_queue *, struct elevator_type *);
 	void (*exit_sched)(struct elevator_queue *);
+	int (*init_hctx)(struct blk_mq_hw_ctx *, unsigned int);
+	void (*exit_hctx)(struct blk_mq_hw_ctx *, unsigned int);
 
 	bool (*allow_merge)(struct request_queue *, struct request *, struct bio *);
 	bool (*bio_merge)(struct blk_mq_hw_ctx *, struct bio *);
-- 
2.12.2

^ permalink raw reply related

* [PATCH v2 7/8] blk-mq: make driver tag failure path easier to follow
From: Omar Sandoval @ 2017-04-05 18:28 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team
In-Reply-To: <cover.1491416593.git.osandov@fb.com>

From: Omar Sandoval <osandov@fb.com>

Minor cleanup that makes it easier to figure out what's going on in the
driver tag allocation failure path of blk_mq_dispatch_rq_list().

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq.c | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4aa3148befe1..71dc8608f3a8 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1011,17 +1011,16 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list)
 			 * The initial allocation attempt failed, so we need to
 			 * rerun the hardware queue when a tag is freed.
 			 */
-			if (blk_mq_dispatch_wait_add(hctx)) {
-				/*
-				 * It's possible that a tag was freed in the
-				 * window between the allocation failure and
-				 * adding the hardware queue to the wait queue.
-				 */
-				if (!blk_mq_get_driver_tag(rq, &hctx, false))
-					break;
-			} else {
+			if (!blk_mq_dispatch_wait_add(hctx))
+				break;
+
+			/*
+			 * It's possible that a tag was freed in the window
+			 * between the allocation failure and adding the
+			 * hardware queue to the wait queue.
+			 */
+			if (!blk_mq_get_driver_tag(rq, &hctx, false))
 				break;
-			}
 		}
 
 		list_del_init(&rq->queuelist);
-- 
2.12.2

^ permalink raw reply related

* [PATCH v2 8/8] blk-mq: clean up direct issue blk_mq_queue_data initialization
From: Omar Sandoval @ 2017-04-05 18:28 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team
In-Reply-To: <cover.1491416593.git.osandov@fb.com>

From: Omar Sandoval <osandov@fb.com>

Trivial cleanup.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 block/blk-mq.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 71dc8608f3a8..779249a5999b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1451,8 +1451,7 @@ static void __blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie,
 	struct request_queue *q = rq->q;
 	struct blk_mq_queue_data bd = {
 		.rq = rq,
-		.list = NULL,
-		.last = 1
+		.last = true,
 	};
 	struct blk_mq_hw_ctx *hctx;
 	blk_qc_t new_cookie;
-- 
2.12.2

^ permalink raw reply related

* Re: [PATCH v2 1/8] blk-mq: use the right hctx when getting a driver tag fails
From: Bart Van Assche @ 2017-04-05 18:33 UTC (permalink / raw)
  To: osandov@osandov.com, linux-block@vger.kernel.org, axboe@fb.com
  Cc: kernel-team@fb.com
In-Reply-To: <89da4a6561df3e24af3ba1c8625470d3088d2fa1.1491416593.git.osandov@fb.com>

On Wed, 2017-04-05 at 11:28 -0700, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
>=20
> While dispatching requests, if we fail to get a driver tag, we mark the
> hardware queue as waiting for a tag and put the requests on a
> hctx->dispatch list to be run later when a driver tag is freed. However,
> blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
> queues if using a single-queue scheduler with a multiqueue device. If
> blk_mq_get_driver_tag() fails, it doesn't update the hardware queue we
> are processing. This means we end up using the hardware queue of the
> previous request, which may or may not be the same as that of the
> current request. If it isn't, the wrong hardware queue will end up
> waiting for a tag, and the requests will be on the wrong dispatch list,
> leading to a hang.
>=20
> The fix is twofold:
>=20
> 1. Make sure we save which hardware queue we were trying to get a
>    request for in blk_mq_get_driver_tag() regardless of whether it
>    succeeds or not.
> 2. Make blk_mq_dispatch_rq_list() take a request_queue instead of a
>    blk_mq_hw_queue to make it clear that it must handle multiple
>    hardware queues, since I've already messed this up on a couple of
>    occasions.
>=20
> This didn't appear in testing with nvme and mq-deadline because nvme has
> more driver tags than the default number of scheduler tags. However,
> with the blk_mq_update_nr_hw_queues() fix, it showed up with nbd.

Would the patch below be a valid alternative?

Thanks,

Bart.

[PATCH] blk-mq: Simplify blk_mq_get_driver_tag()

The blk_mq_get_driver_tag() callers either assume that *hctx is not
modified or that it points to a valid hctx pointer upon return if
tag allocation succeeded. Avoid this confusion by returning the hctx
pointer if and only if tag allocation succeeded and by only storing
the return value into hctx in those blk_mq_get_driver_tag() callers
for which the hctx pointer had not yet been computed before the
blk_mq_get_driver_tag() call.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
---
=A0block/blk-mq-sched.c |=A0=A04 +++-
=A0block/blk-mq.c=A0=A0=A0=A0=A0=A0=A0| 24 ++++++++++--------------
=A0block/blk-mq.h=A0=A0=A0=A0=A0=A0=A0|=A0=A03 +--
=A03 files changed, 14 insertions(+), 17 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index a85d939ef450..bfb8bdb95a87 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -386,7 +386,9 @@ void blk_mq_sched_restart_a_queue(struct blk_mq_hw_ctx =
*hctx)
=A0static void blk_mq_sched_insert_flush(struct blk_mq_hw_ctx *hctx,
=A0				=A0=A0=A0=A0=A0=A0struct request *rq, bool can_block)
=A0{
-	if (blk_mq_get_driver_tag(rq, &hctx, can_block)) {
+	WARN_ON_ONCE(!hctx);
+
+	if (blk_mq_get_driver_tag(rq, can_block)) {
=A0		blk_insert_flush(rq);
=A0		blk_mq_run_hw_queue(hctx, true);
=A0	} else
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 45e7f597cea3..c8e0c02dc8ca 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -857,8 +857,7 @@ static inline unsigned int queued_to_index(unsigned int=
 queued)
=A0	return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1);
=A0}
=A0
-bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx=
,
-			=A0=A0=A0bool wait)
+struct blk_mq_hw_ctx *blk_mq_get_driver_tag(struct request *rq, bool wait)
=A0{
=A0	struct blk_mq_alloc_data data =3D {
=A0		.q =3D rq->q,
@@ -866,12 +865,8 @@ bool blk_mq_get_driver_tag(struct request *rq, struct =
blk_mq_hw_ctx **hctx,
=A0		.flags =3D wait ? 0 : BLK_MQ_REQ_NOWAIT,
=A0	};
=A0
-	if (rq->tag !=3D -1) {
-done:
-		if (hctx)
-			*hctx =3D data.hctx;
-		return true;
-	}
+	if (rq->tag !=3D -1)
+		return data.hctx;
=A0
=A0	if (blk_mq_tag_is_reserved(data.hctx->sched_tags, rq->internal_tag))
=A0		data.flags |=3D BLK_MQ_REQ_RESERVED;
@@ -883,10 +878,10 @@ bool blk_mq_get_driver_tag(struct request *rq, struct=
 blk_mq_hw_ctx **hctx,
=A0			atomic_inc(&data.hctx->nr_active);
=A0		}
=A0		data.hctx->tags->rqs[rq->tag] =3D rq;
-		goto done;
+		return data.hctx;
=A0	}
=A0
-	return false;
+	return NULL;
=A0}
=A0
=A0static void blk_mq_put_driver_tag_hctx(struct blk_mq_hw_ctx *hctx,
@@ -985,7 +980,7 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx=
, struct list_head *list)
=A0		struct blk_mq_queue_data bd;
=A0
=A0		rq =3D list_first_entry(list, struct request, queuelist);
-		if (!blk_mq_get_driver_tag(rq, &hctx, false)) {
+		if (!blk_mq_get_driver_tag(rq, false)) {
=A0			if (!queued && reorder_tags_to_front(list))
=A0				continue;
=A0
@@ -999,7 +994,7 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx=
, struct list_head *list)
=A0				=A0* window between the allocation failure and
=A0				=A0* adding the hardware queue to the wait queue.
=A0				=A0*/
-				if (!blk_mq_get_driver_tag(rq, &hctx, false))
+				if (!blk_mq_get_driver_tag(rq, false))
=A0					break;
=A0			} else {
=A0				break;
@@ -1020,7 +1015,7 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hc=
tx, struct list_head *list)
=A0			struct request *nxt;
=A0
=A0			nxt =3D list_first_entry(list, struct request, queuelist);
-			bd.last =3D !blk_mq_get_driver_tag(nxt, NULL, false);
+			bd.last =3D !blk_mq_get_driver_tag(nxt, false);
=A0		}
=A0
=A0		ret =3D q->mq_ops->queue_rq(hctx, &bd);
@@ -1435,7 +1430,8 @@ static void __blk_mq_try_issue_directly(struct reques=
t *rq, blk_qc_t *cookie,
=A0	if (q->elevator)
=A0		goto insert;
=A0
-	if (!blk_mq_get_driver_tag(rq, &hctx, false))
+	hctx =3D blk_mq_get_driver_tag(rq, false);
+	if (!hctx)
=A0		goto insert;
=A0
=A0	new_cookie =3D request_to_qc_t(hctx, rq);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 8d49c06fc520..b1917fbe955c 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -33,8 +33,7 @@ void blk_mq_wake_waiters(struct request_queue *q);
=A0bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *, struct list_head *)=
;
=A0void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head=
 *list);
=A0bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx);
-bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx=
,
-				bool wait);
+struct blk_mq_hw_ctx *blk_mq_get_driver_tag(struct request *rq, bool wait)=
;
=A0
=A0/*
=A0 * Internal helpers for allocating/freeing the request map
--=A0
2.12.0

^ permalink raw reply related

* Re: [PATCH v2 8/8] blk-mq: clean up direct issue blk_mq_queue_data initialization
From: Bart Van Assche @ 2017-04-05 18:35 UTC (permalink / raw)
  To: osandov@osandov.com, linux-block@vger.kernel.org, axboe@fb.com
  Cc: kernel-team@fb.com
In-Reply-To: <e0eac74e541e9f85403fd4cf887a7f288768f00e.1491416593.git.osandov@fb.com>

On Wed, 2017-04-05 at 11:28 -0700, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
>=20
> Trivial cleanup.
>=20
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
>  block/blk-mq.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>=20
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 71dc8608f3a8..779249a5999b 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1451,8 +1451,7 @@ static void __blk_mq_try_issue_directly(struct requ=
est *rq, blk_qc_t *cookie,
>  	struct request_queue *q =3D rq->q;
>  	struct blk_mq_queue_data bd =3D {
>  		.rq =3D rq,
> -		.list =3D NULL,
> -		.last =3D 1
> +		.last =3D true,
>  	};
>  	struct blk_mq_hw_ctx *hctx;
>  	blk_qc_t new_cookie;

Hello Omar,

Although this patch looks fine to me I'm afraid that since this morning
it no longer applies to Jens' for-next branch.

Bart.=

^ permalink raw reply

* Re: [PATCH v2 8/8] blk-mq: clean up direct issue blk_mq_queue_data initialization
From: Omar Sandoval @ 2017-04-05 18:37 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block@vger.kernel.org, axboe@fb.com, kernel-team@fb.com
In-Reply-To: <1491417357.2787.13.camel@sandisk.com>

On Wed, Apr 05, 2017 at 06:35:58PM +0000, Bart Van Assche wrote:
> On Wed, 2017-04-05 at 11:28 -0700, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > Trivial cleanup.
> > 
> > Signed-off-by: Omar Sandoval <osandov@fb.com>
> > ---
> >  block/blk-mq.c | 3 +--
> >  1 file changed, 1 insertion(+), 2 deletions(-)
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 71dc8608f3a8..779249a5999b 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -1451,8 +1451,7 @@ static void __blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie,
> >  	struct request_queue *q = rq->q;
> >  	struct blk_mq_queue_data bd = {
> >  		.rq = rq,
> > -		.list = NULL,
> > -		.last = 1
> > +		.last = true,
> >  	};
> >  	struct blk_mq_hw_ctx *hctx;
> >  	blk_qc_t new_cookie;
> 
> Hello Omar,
> 
> Although this patch looks fine to me I'm afraid that since this morning
> it no longer applies to Jens' for-next branch.
> 
> Bart.

Oh, I forgot to rebase. Thanks, Bart, I'll send out an updated version.

^ permalink raw reply

* Re: [PATCH v2 1/8] blk-mq: use the right hctx when getting a driver tag fails
From: Omar Sandoval @ 2017-04-05 18:42 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block@vger.kernel.org, axboe@fb.com, kernel-team@fb.com
In-Reply-To: <1491417192.2787.11.camel@sandisk.com>

On Wed, Apr 05, 2017 at 06:33:14PM +0000, Bart Van Assche wrote:
> On Wed, 2017-04-05 at 11:28 -0700, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > While dispatching requests, if we fail to get a driver tag, we mark the
> > hardware queue as waiting for a tag and put the requests on a
> > hctx->dispatch list to be run later when a driver tag is freed. However,
> > blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
> > queues if using a single-queue scheduler with a multiqueue device. If
> > blk_mq_get_driver_tag() fails, it doesn't update the hardware queue we
> > are processing. This means we end up using the hardware queue of the
> > previous request, which may or may not be the same as that of the
> > current request. If it isn't, the wrong hardware queue will end up
> > waiting for a tag, and the requests will be on the wrong dispatch list,
> > leading to a hang.
> > 
> > The fix is twofold:
> > 
> > 1. Make sure we save which hardware queue we were trying to get a
> >    request for in blk_mq_get_driver_tag() regardless of whether it
> >    succeeds or not.
> > 2. Make blk_mq_dispatch_rq_list() take a request_queue instead of a
> >    blk_mq_hw_queue to make it clear that it must handle multiple
> >    hardware queues, since I've already messed this up on a couple of
> >    occasions.
> > 
> > This didn't appear in testing with nvme and mq-deadline because nvme has
> > more driver tags than the default number of scheduler tags. However,
> > with the blk_mq_update_nr_hw_queues() fix, it showed up with nbd.
> 
> Would the patch below be a valid alternative?
> 
> Thanks,
> 
> Bart.

Hi, Bart,

This actually has the same bug as the original code, see below.

> [PATCH] blk-mq: Simplify blk_mq_get_driver_tag()
> 
> The blk_mq_get_driver_tag() callers either assume that *hctx is not
> modified or that it points to a valid hctx pointer upon return if
> tag allocation succeeded. Avoid this confusion by returning the hctx
> pointer if and only if tag allocation succeeded and by only storing
> the return value into hctx in those blk_mq_get_driver_tag() callers
> for which the hctx pointer had not yet been computed before the
> blk_mq_get_driver_tag() call.
> 
> Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
> ---
> ï¿½block/blk-mq-sched.c |ï¿½ï¿½4 +++-
> ï¿½block/blk-mq.cï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½| 24 ++++++++++--------------
> ï¿½block/blk-mq.hï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½|ï¿½ï¿½3 +--
> ï¿½3 files changed, 14 insertions(+), 17 deletions(-)
> 

[snip]

> ï¿½static void blk_mq_put_driver_tag_hctx(struct blk_mq_hw_ctx *hctx,
> @@ -985,7 +980,7 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
> ï¿½		struct blk_mq_queue_data bd;
> ï¿½
> ï¿½		rq = list_first_entry(list, struct request, queuelist);
> -		if (!blk_mq_get_driver_tag(rq, &hctx, false)) {
> +		if (!blk_mq_get_driver_tag(rq, false)) {

Here, we want to know what hardware queue we attempted the tag
allocation on, so this won't work.

Thanks for taking a look!

^ permalink raw reply

* [PATCH v3 0/8] blk-mq: various fixes and cleanups
From: Omar Sandoval @ 2017-04-05 19:01 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

Rebase of v2 [1] onto block-next. As with v2,

- Patch 1 is the new fix for a hang that Josef reported after trying v1.
- Patches 2-6 are the original series. Patch 5 now has Christoph's and
  Sagi's Reviewed-by.
- Patches 7 and 8 are trivial cleanups.

Patches 1-5 should probably go into 4.11, and 6-8 are for 4.12.

1: http://marc.info/?l=linux-block&m=149141696306216&w=2

Omar Sandoval (8):
  blk-mq: use the right hctx when getting a driver tag fails
  blk-mq-sched: refactor scheduler initialization
  blk-mq-sched: set up scheduler tags when bringing up new queues
  blk-mq-sched: fix crash in switch error path
  blk-mq: remap queues when adding/removing hardware queues
  blk-mq-sched: provide hooks for initializing hardware queue data
  blk-mq: make driver tag failure path easier to follow
  blk-mq: use true instead of 1 for blk_mq_queue_data.last

 block/blk-mq-sched.c     | 187 +++++++++++++++++++++++++++++------------------
 block/blk-mq-sched.h     |  13 ++--
 block/blk-mq.c           |  71 ++++++++++--------
 block/blk-mq.h           |   2 +-
 block/blk-sysfs.c        |   2 +-
 block/elevator.c         | 114 +++++++++++++++--------------
 include/linux/elevator.h |   4 +-
 7 files changed, 227 insertions(+), 166 deletions(-)

-- 
2.12.2

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox