Linux block layer

Linux block layer
 help / color / mirror / Atom feed

* [PATCH 1/4] lightnvm: double-clear of dev->lun_map on target init error
From: Javier González @ 2017-04-07 18:31 UTC (permalink / raw)
  To: mb; +Cc: linux-block, linux-kernel, Javier González,
	Matias Bjørling

The dev->lun_map bits are cleared twice if an target init error occurs.
First in the target clean routine, and then next in the nvm_tgt_create
error function. Make sure that it is only cleared once by extending
nvm_remove_tgt_devi() with a clear bit, such that clearing of bits can
ignored when cleaning up a successful initialized target.

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
---
 drivers/lightnvm/core.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index 2122922..da4c082 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -89,7 +89,7 @@ static void nvm_release_luns_err(struct nvm_dev *dev, int lun_begin,
 		WARN_ON(!test_and_clear_bit(i, dev->lun_map));
 }
 
-static void nvm_remove_tgt_dev(struct nvm_tgt_dev *tgt_dev)
+static void nvm_remove_tgt_dev(struct nvm_tgt_dev *tgt_dev, int clear)
 {
 	struct nvm_dev *dev = tgt_dev->parent;
 	struct nvm_dev_map *dev_map = tgt_dev->map;
@@ -100,11 +100,13 @@ static void nvm_remove_tgt_dev(struct nvm_tgt_dev *tgt_dev)
 		int *lun_offs = ch_map->lun_offs;
 		int ch = i + ch_map->ch_off;
 
-		for (j = 0; j < ch_map->nr_luns; j++) {
-			int lun = j + lun_offs[j];
-			int lunid = (ch * dev->geo.luns_per_chnl) + lun;
+		if (clear) {
+			for (j = 0; j < ch_map->nr_luns; j++) {
+				int lun = j + lun_offs[j];
+				int lunid = (ch * dev->geo.luns_per_chnl) + lun;
 
-			WARN_ON(!test_and_clear_bit(lunid, dev->lun_map));
+				WARN_ON(!test_and_clear_bit(lunid, dev->lun_map));
+			}
 		}
 
 		kfree(ch_map->lun_offs);
@@ -309,7 +311,7 @@ static int nvm_create_tgt(struct nvm_dev *dev, struct nvm_ioctl_create *create)
 err_queue:
 	blk_cleanup_queue(tqueue);
 err_dev:
-	nvm_remove_tgt_dev(tgt_dev);
+	nvm_remove_tgt_dev(tgt_dev, 0);
 err_t:
 	kfree(t);
 err_reserve:
@@ -332,7 +334,7 @@ static void __nvm_remove_target(struct nvm_target *t)
 	if (tt->exit)
 		tt->exit(tdisk->private_data);
 
-	nvm_remove_tgt_dev(t->dev);
+	nvm_remove_tgt_dev(t->dev, 1);
 	put_disk(tdisk);
 
 	list_del(&t->list);
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH v4 0/6] Avoid that scsi-mq and dm-mq queue processing stalls sporadically
From: Jens Axboe @ 2017-04-07 18:23 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-block, linux-scsi
In-Reply-To: <20170407181654.27836-1-bart.vanassche@sandisk.com>

On 04/07/2017 12:16 PM, Bart Van Assche wrote:
> Hello Jens,
> 
> The six patches in this patch series fix the queue lockup I reported
> recently on the linux-block mailing list. Please consider these patches
> for inclusion in the upstream kernel.

Some of this we need in 4.11, but not all of it. I can't be applying patches
that "improve scalability" at this point.

4-6 looks like what we want for 4.11, I'll see if those apply directly. Then
we can put 1-3 on top in 4.12, with the others pulled in first.

-- 
Jens Axboe

^ permalink raw reply

* [PATCH v4 6/6] dm rq: Avoid that request processing stalls sporadically
From: Bart Van Assche @ 2017-04-07 18:16 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Bart Van Assche, Mike Snitzer, dm-devel
In-Reply-To: <20170407181654.27836-1-bart.vanassche@sandisk.com>

While running the srp-test software I noticed that request
processing stalls sporadically at the beginning of a test, namely
when mkfs is run against a dm-mpath device. Every time when that
happened the following command was sufficient to resume request
processing:

    echo run >/sys/kernel/debug/block/dm-0/state

This patch avoids that such request processing stalls occur. The
test I ran is as follows:

    while srp-test/run_tests -d -r 30 -t 02-mq; do :; done

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
---
 drivers/md/dm-rq.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 6886bf160fb2..d19af1d21f4c 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -755,6 +755,7 @@ static int dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
 		/* Undo dm_start_request() before requeuing */
 		rq_end_stats(md, rq);
 		rq_completed(md, rq_data_dir(rq), false);
+		blk_mq_delay_run_hw_queue(hctx, 100/*ms*/);
 		return BLK_MQ_RQ_QUEUE_BUSY;
 	}
 
-- 
2.12.0

^ permalink raw reply related

* [PATCH v4 5/6] scsi: Avoid that SCSI queues get stuck
From: Bart Van Assche @ 2017-04-07 18:16 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Bart Van Assche, Martin K . Petersen,
	James Bottomley, Christoph Hellwig, Hannes Reinecke,
	Sagi Grimberg, Long Li, K . Y . Srinivasan
In-Reply-To: <20170407181654.27836-1-bart.vanassche@sandisk.com>

If a .queue_rq() function returns BLK_MQ_RQ_QUEUE_BUSY then the block
driver that implements that function is responsible for rerunning the
hardware queue once requests can be queued again successfully.

commit 52d7f1b5c2f3 ("blk-mq: Avoid that requeueing starts stopped
queues") removed the blk_mq_stop_hw_queue() call from scsi_queue_rq()
for the BLK_MQ_RQ_QUEUE_BUSY case. Hence change all calls to functions
that are intended to rerun a busy queue such that these examine all
hardware queues instead of only stopped queues.

Since no other functions than scsi_internal_device_block() and
scsi_internal_device_unblock() should ever stop or restart a SCSI
queue, change the blk_mq_delay_queue() call into a
blk_mq_delay_run_hw_queue() call.

Fixes: commit 52d7f1b5c2f3 ("blk-mq: Avoid that requeueing starts stopped queues")
Fixes: commit 7e79dadce222 ("blk-mq: stop hardware queue in blk_mq_delay_queue()")
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Long Li <longli@microsoft.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
---
 drivers/scsi/scsi_lib.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 11972d1075f1..7bc4513bf4e4 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -496,7 +496,7 @@ static void scsi_run_queue(struct request_queue *q)
 		scsi_starved_list_run(sdev->host);
 
 	if (q->mq_ops)
-		blk_mq_start_stopped_hw_queues(q, false);
+		blk_mq_run_hw_queues(q, false);
 	else
 		blk_run_queue(q);
 }
@@ -667,7 +667,7 @@ static bool scsi_end_request(struct request *req, int error,
 		    !list_empty(&sdev->host->starved_list))
 			kblockd_schedule_work(&sdev->requeue_work);
 		else
-			blk_mq_start_stopped_hw_queues(q, true);
+			blk_mq_run_hw_queues(q, true);
 	} else {
 		unsigned long flags;
 
@@ -1974,7 +1974,7 @@ static int scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
 	case BLK_MQ_RQ_QUEUE_BUSY:
 		if (atomic_read(&sdev->device_busy) == 0 &&
 		    !scsi_device_blocked(sdev))
-			blk_mq_delay_queue(hctx, SCSI_QUEUE_DELAY);
+			blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY);
 		break;
 	case BLK_MQ_RQ_QUEUE_ERROR:
 		/*
-- 
2.12.0

^ permalink raw reply related

* [PATCH v4 4/6] blk-mq: Introduce blk_mq_delay_run_hw_queue()
From: Bart Van Assche @ 2017-04-07 18:16 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Bart Van Assche, Christoph Hellwig,
	Hannes Reinecke, Long Li, K . Y . Srinivasan
In-Reply-To: <20170407181654.27836-1-bart.vanassche@sandisk.com>

Introduce a function that runs a hardware queue unconditionally
after a delay. Note: there is already a function that stops and
restarts a hardware queue after a delay, namely blk_mq_delay_queue().

This function will be used in the next patch in this series.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Long Li <longli@microsoft.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
---
 block/blk-mq.c         | 32 ++++++++++++++++++++++++++++++--
 include/linux/blk-mq.h |  2 ++
 2 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index aff85d41cea3..836e3a17da54 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1146,7 +1146,8 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
 	return hctx->next_cpu;
 }
 
-void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
+static void __blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async,
+					unsigned long msecs)
 {
 	if (unlikely(blk_mq_hctx_stopped(hctx) ||
 		     !blk_mq_hw_queue_mapped(hctx)))
@@ -1163,7 +1164,24 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 		put_cpu();
 	}
 
-	kblockd_schedule_work_on(blk_mq_hctx_next_cpu(hctx), &hctx->run_work);
+	if (msecs == 0)
+		kblockd_schedule_work_on(blk_mq_hctx_next_cpu(hctx),
+					 &hctx->run_work);
+	else
+		kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
+						 &hctx->delayed_run_work,
+						 msecs_to_jiffies(msecs));
+}
+
+void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs)
+{
+	__blk_mq_delay_run_hw_queue(hctx, true, msecs);
+}
+EXPORT_SYMBOL(blk_mq_delay_run_hw_queue);
+
+void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
+{
+	__blk_mq_delay_run_hw_queue(hctx, async, 0);
 }
 
 void blk_mq_run_hw_queues(struct request_queue *q, bool async)
@@ -1266,6 +1284,15 @@ static void blk_mq_run_work_fn(struct work_struct *work)
 	__blk_mq_run_hw_queue(hctx);
 }
 
+static void blk_mq_delayed_run_work_fn(struct work_struct *work)
+{
+	struct blk_mq_hw_ctx *hctx;
+
+	hctx = container_of(work, struct blk_mq_hw_ctx, delayed_run_work.work);
+
+	__blk_mq_run_hw_queue(hctx);
+}
+
 static void blk_mq_delay_work_fn(struct work_struct *work)
 {
 	struct blk_mq_hw_ctx *hctx;
@@ -1866,6 +1893,7 @@ static int blk_mq_init_hctx(struct request_queue *q,
 		node = hctx->numa_node = set->numa_node;
 
 	INIT_WORK(&hctx->run_work, blk_mq_run_work_fn);
+	INIT_DELAYED_WORK(&hctx->delayed_run_work, blk_mq_delayed_run_work_fn);
 	INIT_DELAYED_WORK(&hctx->delay_work, blk_mq_delay_work_fn);
 	spin_lock_init(&hctx->lock);
 	INIT_LIST_HEAD(&hctx->dispatch);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index bdea90d75274..b90c3d5766cd 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -51,6 +51,7 @@ struct blk_mq_hw_ctx {
 
 	atomic_t		nr_active;
 
+	struct delayed_work	delayed_run_work;
 	struct delayed_work	delay_work;
 
 	struct hlist_node	cpuhp_dead;
@@ -236,6 +237,7 @@ void blk_mq_stop_hw_queues(struct request_queue *q);
 void blk_mq_start_hw_queues(struct request_queue *q);
 void blk_mq_start_stopped_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
 void blk_mq_start_stopped_hw_queues(struct request_queue *q, bool async);
+void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs);
 void blk_mq_run_hw_queues(struct request_queue *q, bool async);
 void blk_mq_delay_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs);
 void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
-- 
2.12.0

^ permalink raw reply related

* [PATCH v4 3/6] blk-mq: Clarify comments in blk_mq_dispatch_rq_list()
From: Bart Van Assche @ 2017-04-07 18:16 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Bart Van Assche, Omar Sandoval,
	Christoph Hellwig, Hannes Reinecke
In-Reply-To: <20170407181654.27836-1-bart.vanassche@sandisk.com>

The blk_mq_dispatch_rq_list() implementation got modified several
times but the comments in that function were not updated every
time. Since it is nontrivial what is going on, update the comments
in blk_mq_dispatch_rq_list().

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
---
 block/blk-mq.c | 28 ++++++++++++++++++----------
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index dba34eb79a08..aff85d41cea3 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1063,8 +1063,8 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 	 */
 	if (!list_empty(list)) {
 		/*
-		 * If we got a driver tag for the next request already,
-		 * free it again.
+		 * If an I/O scheduler has been configured and we got a driver
+		 * tag for the next request already, free it again.
 		 */
 		rq = list_first_entry(list, struct request, queuelist);
 		blk_mq_put_driver_tag(rq);
@@ -1074,16 +1074,24 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 		spin_unlock(&hctx->lock);
 
 		/*
-		 * the queue is expected stopped with BLK_MQ_RQ_QUEUE_BUSY, but
-		 * it's possible the queue is stopped and restarted again
-		 * before this. Queue restart will dispatch requests. And since
-		 * requests in rq_list aren't added into hctx->dispatch yet,
-		 * the requests in rq_list might get lost.
+		 * If SCHED_RESTART was set by the caller of this function and
+		 * it is no longer set that means that it was cleared by another
+		 * thread and hence that a queue rerun is needed.
 		 *
-		 * blk_mq_run_hw_queue() already checks the STOPPED bit
+		 * If TAG_WAITING is set that means that an I/O scheduler has
+		 * been configured and another thread is waiting for a driver
+		 * tag. To guarantee fairness, do not rerun this hardware queue
+		 * but let the other thread grab the driver tag.
 		 *
-		 * If RESTART or TAG_WAITING is set, then let completion restart
-		 * the queue instead of potentially looping here.
+		 * If no I/O scheduler has been configured it is possible that
+		 * the hardware queue got stopped and restarted before requests
+		 * were pushed back onto the dispatch list. Rerun the queue to
+		 * avoid starvation. Notes:
+		 * - blk_mq_run_hw_queue() checks whether or not a queue has
+		 *   been stopped before rerunning a queue.
+		 * - Some but not all block drivers stop a queue before
+		 *   returning BLK_MQ_RQ_QUEUE_BUSY. Two exceptions are scsi-mq
+		 *   and dm-rq.
 		 */
 		if (!blk_mq_sched_needs_restart(hctx) &&
 		    !test_bit(BLK_MQ_S_TAG_WAITING, &hctx->state))
-- 
2.12.0

^ permalink raw reply related

* [PATCH v4 2/6] blk-mq: Restart a single queue if tag sets are shared
From: Bart Van Assche @ 2017-04-07 18:16 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Bart Van Assche, Christoph Hellwig,
	Hannes Reinecke
In-Reply-To: <20170407181654.27836-1-bart.vanassche@sandisk.com>

To improve scalability, if hardware queues are shared, restart
a single hardware queue in round-robin fashion. Rename
blk_mq_sched_restart_queues() to reflect the new semantics.
Remove blk_mq_sched_mark_restart_queue() because this function
has no callers. Remove flag QUEUE_FLAG_RESTART because this
patch removes the code that uses this flag.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
---
 block/blk-mq-sched.c   | 63 ++++++++++++++++++++++++++++++++++++++++++--------
 block/blk-mq-sched.h   | 16 +------------
 block/blk-mq.c         |  2 +-
 include/linux/blkdev.h |  1 -
 4 files changed, 55 insertions(+), 27 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 09af8ff18719..a5c683a6429c 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -317,25 +317,68 @@ static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx,
 	return true;
 }
 
-static void blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
+static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
 {
 	if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) {
 		clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
-		if (blk_mq_hctx_has_pending(hctx))
+		if (blk_mq_hctx_has_pending(hctx)) {
 			blk_mq_run_hw_queue(hctx, true);
+			return true;
+		}
 	}
+	return false;
 }
 
-void blk_mq_sched_restart_queues(struct blk_mq_hw_ctx *hctx)
-{
-	struct request_queue *q = hctx->queue;
-	unsigned int i;
+/**
+ * list_for_each_entry_rcu_rr - iterate in a round-robin fashion over rcu list
+ * @pos:    loop cursor.
+ * @skip:   the list element that will not be examined. Iteration starts at
+ *          @skip->next.
+ * @head:   head of the list to examine. This list must have at least one
+ *          element, namely @skip.
+ * @member: name of the list_head structure within typeof(*pos).
+ */
+#define list_for_each_entry_rcu_rr(pos, skip, head, member)		\
+	for ((pos) = (skip);						\
+	     (pos = (pos)->member.next != (head) ? list_entry_rcu(	\
+			(pos)->member.next, typeof(*pos), member) :	\
+	      list_entry_rcu((pos)->member.next->next, typeof(*pos), member)), \
+	     (pos) != (skip); )
 
-	if (test_bit(QUEUE_FLAG_RESTART, &q->queue_flags)) {
-		if (test_and_clear_bit(QUEUE_FLAG_RESTART, &q->queue_flags)) {
-			queue_for_each_hw_ctx(q, hctx, i)
-				blk_mq_sched_restart_hctx(hctx);
+/*
+ * Called after a driver tag has been freed to check whether a hctx needs to
+ * be restarted. Restarts @hctx if its tag set is not shared. Restarts hardware
+ * queues in a round-robin fashion if the tag set of @hctx is shared with other
+ * hardware queues.
+ */
+void blk_mq_sched_restart(struct blk_mq_hw_ctx *const hctx)
+{
+	struct blk_mq_tags *const tags = hctx->tags;
+	struct blk_mq_tag_set *const set = hctx->queue->tag_set;
+	struct request_queue *const queue = hctx->queue, *q;
+	struct blk_mq_hw_ctx *hctx2;
+	unsigned int i, j;
+
+	if (set->flags & BLK_MQ_F_TAG_SHARED) {
+		rcu_read_lock();
+		list_for_each_entry_rcu_rr(q, queue, &set->tag_list,
+					   tag_set_list) {
+			queue_for_each_hw_ctx(q, hctx2, i)
+				if (hctx2->tags == tags &&
+				    blk_mq_sched_restart_hctx(hctx2))
+					goto done;
+		}
+		j = hctx->queue_num + 1;
+		for (i = 0; i < queue->nr_hw_queues; i++, j++) {
+			if (j == queue->nr_hw_queues)
+				j = 0;
+			hctx2 = queue->queue_hw_ctx[j];
+			if (hctx2->tags == tags &&
+			    blk_mq_sched_restart_hctx(hctx2))
+				break;
 		}
+done:
+		rcu_read_unlock();
 	} else {
 		blk_mq_sched_restart_hctx(hctx);
 	}
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index a75b16b123f7..4e3fc2a40207 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -19,7 +19,7 @@ bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
 				struct request **merged_request);
 bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio);
 bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request *rq);
-void blk_mq_sched_restart_queues(struct blk_mq_hw_ctx *hctx);
+void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx);
 
 void blk_mq_sched_insert_request(struct request *rq, bool at_head,
 				 bool run_queue, bool async, bool can_block);
@@ -131,20 +131,6 @@ static inline void blk_mq_sched_mark_restart_hctx(struct blk_mq_hw_ctx *hctx)
 		set_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
 }
 
-/*
- * Mark a hardware queue and the request queue it belongs to as needing a
- * restart.
- */
-static inline void blk_mq_sched_mark_restart_queue(struct blk_mq_hw_ctx *hctx)
-{
-	struct request_queue *q = hctx->queue;
-
-	if (!test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
-		set_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
-	if (!test_bit(QUEUE_FLAG_RESTART, &q->queue_flags))
-		set_bit(QUEUE_FLAG_RESTART, &q->queue_flags);
-}
-
 static inline bool blk_mq_sched_needs_restart(struct blk_mq_hw_ctx *hctx)
 {
 	return test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c26464f9649a..dba34eb79a08 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -351,7 +351,7 @@ void __blk_mq_finish_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
 		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
 	if (sched_tag != -1)
 		blk_mq_sched_completed_request(hctx, rq);
-	blk_mq_sched_restart_queues(hctx);
+	blk_mq_sched_restart(hctx);
 	blk_queue_exit(q);
 }
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 3cf241b0814d..dc6c8d39d462 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -615,7 +615,6 @@ struct request_queue {
 #define QUEUE_FLAG_FLUSH_NQ    25	/* flush not queueuable */
 #define QUEUE_FLAG_DAX         26	/* device supports DAX */
 #define QUEUE_FLAG_STATS       27	/* track rq completion times */
-#define QUEUE_FLAG_RESTART     28	/* queue needs restart at completion */
 #define QUEUE_FLAG_POLL_STATS  29	/* collecting stats for hybrid polling */
 #define QUEUE_FLAG_REGISTERED  30	/* queue has been registered to a disk */
 
-- 
2.12.0

^ permalink raw reply related

* [PATCH v4 1/6] blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list
From: Bart Van Assche @ 2017-04-07 18:16 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Bart Van Assche, Christoph Hellwig,
	Hannes Reinecke
In-Reply-To: <20170407181654.27836-1-bart.vanassche@sandisk.com>

Since the next patch in this series will use RCU to iterate over
tag_list, make this safe. Add lockdep_assert_held() statements
in functions that iterate over tag_list to make clear that using
list_for_each_entry() instead of list_for_each_entry_rcu() is
fine in these functions.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
---
 block/blk-mq.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index f7cd3208bcdf..c26464f9649a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2076,6 +2076,8 @@ static void blk_mq_update_tag_set_depth(struct blk_mq_tag_set *set, bool shared)
 {
 	struct request_queue *q;
 
+	lockdep_assert_held(&set->tag_list_lock);
+
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
 		blk_mq_freeze_queue(q);
 		queue_set_hctx_shared(q, shared);
@@ -2088,7 +2090,8 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
 	struct blk_mq_tag_set *set = q->tag_set;
 
 	mutex_lock(&set->tag_list_lock);
-	list_del_init(&q->tag_set_list);
+	list_del_rcu(&q->tag_set_list);
+	INIT_LIST_HEAD(&q->tag_set_list);
 	if (list_is_singular(&set->tag_list)) {
 		/* just transitioned to unshared */
 		set->flags &= ~BLK_MQ_F_TAG_SHARED;
@@ -2096,6 +2099,8 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
 		blk_mq_update_tag_set_depth(set, false);
 	}
 	mutex_unlock(&set->tag_list_lock);
+
+	synchronize_rcu();
 }
 
 static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
@@ -2113,7 +2118,7 @@ static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
 	}
 	if (set->flags & BLK_MQ_F_TAG_SHARED)
 		queue_set_hctx_shared(q, true);
-	list_add_tail(&q->tag_set_list, &set->tag_list);
+	list_add_tail_rcu(&q->tag_set_list, &set->tag_list);
 
 	mutex_unlock(&set->tag_list_lock);
 }
@@ -2601,6 +2606,8 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
 {
 	struct request_queue *q;
 
+	lockdep_assert_held(&set->tag_list_lock);
+
 	if (nr_hw_queues > nr_cpu_ids)
 		nr_hw_queues = nr_cpu_ids;
 	if (nr_hw_queues < 1 || nr_hw_queues == set->nr_hw_queues)
-- 
2.12.0

^ permalink raw reply related

* [PATCH v4 0/6] Avoid that scsi-mq and dm-mq queue processing stalls sporadically
From: Bart Van Assche @ 2017-04-07 18:16 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-scsi, Bart Van Assche

Hello Jens,

The six patches in this patch series fix the queue lockup I reported
recently on the linux-block mailing list. Please consider these patches
for inclusion in the upstream kernel.

Thanks,

Bart.

Changes between v3 and v4:
- Addressed the review comments on version three of this series about the
  patch that makes it safe to use RCU to iterate over .tag_list and also
  about the runtime performance and use of short variable names in patch 2/5.
- Clarified the description of the patch that fixes the scsi-mq stall.
- Added a patch to fix a dm-mq queue stall.
  
Changes between v2 and v3:
- Removed the blk_mq_ops.restart_hctx function pointer again.
- Modified blk_mq_sched_restart_queues() such that only a single hardware
  queue is restarted instead of multiple if hardware queues are shared.
- Introduced a new function in the block layer, namely
  blk_mq_delay_run_hw_queue().  

Changes between v1 and v2:
- Reworked scsi_restart_queues() such that it no longer takes the SCSI
  host lock.
- Added two patches - one for exporting blk_mq_sched_restart_hctx() and
  another one to make iterating with RCU over blk_mq_tag_set.tag_list safe.

Bart Van Assche (6):
  blk-mq: Make it safe to use RCU to iterate over
    blk_mq_tag_set.tag_list
  blk-mq: Restart a single queue if tag sets are shared
  blk-mq: Clarify comments in blk_mq_dispatch_rq_list()
  blk-mq: Introduce blk_mq_delay_run_hw_queue()
  scsi: Avoid that SCSI queues get stuck
  dm rq: Avoid that request processing stalls sporadically

 block/blk-mq-sched.c    | 63 +++++++++++++++++++++++++++++++++++-------
 block/blk-mq-sched.h    | 16 +----------
 block/blk-mq.c          | 73 +++++++++++++++++++++++++++++++++++++++----------
 drivers/md/dm-rq.c      |  1 +
 drivers/scsi/scsi_lib.c |  6 ++--
 include/linux/blk-mq.h  |  2 ++
 include/linux/blkdev.h  |  1 -
 7 files changed, 118 insertions(+), 44 deletions(-)

-- 
2.12.0

^ permalink raw reply

* Re: [PATCH] block: sed-opal: Tone down all the pr_* to debugs
From: Jens Axboe @ 2017-04-07 18:08 UTC (permalink / raw)
  To: Scott Bauer, linux-block; +Cc: jonathan.derrick, hch
In-Reply-To: <1491581217-7342-1-git-send-email-scott.bauer@intel.com>

On 04/07/2017 10:06 AM, Scott Bauer wrote:
> Lets not flood the kernel log with messages unless
> the user requests so.

Can you send this against my for-4.12/block of for-next branch? One
hunk is rejected right now, looks like this is against master.

-- 
Jens Axboe

^ permalink raw reply

* [PATCH] block: sed-opal: Tone down all the pr_* to debugs
From: Scott Bauer @ 2017-04-07 16:06 UTC (permalink / raw)
  To: linux-block; +Cc: axboe, jonathan.derrick, hch, Scott Bauer

Lets not flood the kernel log with messages unless
the user requests so.

Signed-off-by: Scott Bauer <scott.bauer@intel.com>
---
 block/sed-opal.c | 153 +++++++++++++++++++++++++++----------------------------
 1 file changed, 74 insertions(+), 79 deletions(-)

diff --git a/block/sed-opal.c b/block/sed-opal.c
index 14035f8..f10227e 100644
--- a/block/sed-opal.c
+++ b/block/sed-opal.c
@@ -275,8 +275,8 @@ static bool check_tper(const void *data)
 	u8 flags = tper->supported_features;
 
 	if (!(flags & TPER_SYNC_SUPPORTED)) {
-		pr_err("TPer sync not supported. flags = %d\n",
-		       tper->supported_features);
+		pr_debug("TPer sync not supported. flags = %d\n",
+			 tper->supported_features);
 		return false;
 	}
 
@@ -289,7 +289,7 @@ static bool check_sum(const void *data)
 	u32 nlo = be32_to_cpu(sum->num_locking_objects);
 
 	if (nlo == 0) {
-		pr_err("Need at least one locking object.\n");
+		pr_debug("Need at least one locking object.\n");
 		return false;
 	}
 
@@ -385,9 +385,9 @@ static int next(struct opal_dev *dev)
 
 		error = step->fn(dev, step->data);
 		if (error) {
-			pr_err("Error on step function: %d with error %d: %s\n",
-			       state, error,
-			       opal_error_to_human(error));
+			pr_debug("Error on step function: %d with error %d: %s\n",
+				 state, error,
+				 opal_error_to_human(error));
 
 			/* For each OPAL command we do a discovery0 then we
 			 * start some sort of session.
@@ -419,8 +419,8 @@ static int opal_discovery0_end(struct opal_dev *dev)
 	print_buffer(dev->resp, hlen);
 
 	if (hlen > IO_BUFFER_LENGTH - sizeof(*hdr)) {
-		pr_warn("Discovery length overflows buffer (%zu+%u)/%u\n",
-			sizeof(*hdr), hlen, IO_BUFFER_LENGTH);
+		pr_debug("Discovery length overflows buffer (%zu+%u)/%u\n",
+			 sizeof(*hdr), hlen, IO_BUFFER_LENGTH);
 		return -EFAULT;
 	}
 
@@ -503,7 +503,7 @@ static void add_token_u8(int *err, struct opal_dev *cmd, u8 tok)
 	if (*err)
 		return;
 	if (cmd->pos >= IO_BUFFER_LENGTH - 1) {
-		pr_err("Error adding u8: end of buffer.\n");
+		pr_debug("Error adding u8: end of buffer.\n");
 		*err = -ERANGE;
 		return;
 	}
@@ -553,7 +553,7 @@ static void add_token_u64(int *err, struct opal_dev *cmd, u64 number)
 	len = DIV_ROUND_UP(msb, 4);
 
 	if (cmd->pos >= IO_BUFFER_LENGTH - len - 1) {
-		pr_err("Error adding u64: end of buffer.\n");
+		pr_debug("Error adding u64: end of buffer.\n");
 		*err = -ERANGE;
 		return;
 	}
@@ -579,7 +579,7 @@ static void add_token_bytestring(int *err, struct opal_dev *cmd,
 	}
 
 	if (len >= IO_BUFFER_LENGTH - cmd->pos - header_len) {
-		pr_err("Error adding bytestring: end of buffer.\n");
+		pr_debug("Error adding bytestring: end of buffer.\n");
 		*err = -ERANGE;
 		return;
 	}
@@ -597,7 +597,7 @@ static void add_token_bytestring(int *err, struct opal_dev *cmd,
 static int build_locking_range(u8 *buffer, size_t length, u8 lr)
 {
 	if (length > OPAL_UID_LENGTH) {
-		pr_err("Can't build locking range. Length OOB\n");
+		pr_debug("Can't build locking range. Length OOB\n");
 		return -ERANGE;
 	}
 
@@ -614,7 +614,7 @@ static int build_locking_range(u8 *buffer, size_t length, u8 lr)
 static int build_locking_user(u8 *buffer, size_t length, u8 lr)
 {
 	if (length > OPAL_UID_LENGTH) {
-		pr_err("Can't build locking range user, Length OOB\n");
+		pr_debug("Can't build locking range user, Length OOB\n");
 		return -ERANGE;
 	}
 
@@ -648,7 +648,7 @@ static int cmd_finalize(struct opal_dev *cmd, u32 hsn, u32 tsn)
 	add_token_u8(&err, cmd, OPAL_ENDLIST);
 
 	if (err) {
-		pr_err("Error finalizing command.\n");
+		pr_debug("Error finalizing command.\n");
 		return -EFAULT;
 	}
 
@@ -660,7 +660,7 @@ static int cmd_finalize(struct opal_dev *cmd, u32 hsn, u32 tsn)
 	hdr->subpkt.length = cpu_to_be32(cmd->pos - sizeof(*hdr));
 	while (cmd->pos % 4) {
 		if (cmd->pos >= IO_BUFFER_LENGTH) {
-			pr_err("Error: Buffer overrun\n");
+			pr_debug("Error: Buffer overrun\n");
 			return -ERANGE;
 		}
 		cmd->cmd[cmd->pos++] = 0;
@@ -679,14 +679,14 @@ static const struct opal_resp_tok *response_get_token(
 	const struct opal_resp_tok *tok;
 
 	if (n >= resp->num) {
-		pr_err("Token number doesn't exist: %d, resp: %d\n",
-		       n, resp->num);
+		pr_debug("Token number doesn't exist: %d, resp: %d\n",
+			 n, resp->num);
 		return ERR_PTR(-EINVAL);
 	}
 
 	tok = &resp->toks[n];
 	if (tok->len == 0) {
-		pr_err("Token length must be non-zero\n");
+		pr_debug("Token length must be non-zero\n");
 		return ERR_PTR(-EINVAL);
 	}
 
@@ -727,7 +727,7 @@ static ssize_t response_parse_short(struct opal_resp_tok *tok,
 
 		tok->type = OPAL_DTA_TOKENID_UINT;
 		if (tok->len > 9) {
-			pr_warn("uint64 with more than 8 bytes\n");
+			pr_debug("uint64 with more than 8 bytes\n");
 			return -EINVAL;
 		}
 		for (i = tok->len - 1; i > 0; i--) {
@@ -814,8 +814,8 @@ static int response_parse(const u8 *buf, size_t length,
 
 	if (clen == 0 || plen == 0 || slen == 0 ||
 	    slen > IO_BUFFER_LENGTH - sizeof(*hdr)) {
-		pr_err("Bad header length. cp: %u, pkt: %u, subpkt: %u\n",
-		       clen, plen, slen);
+		pr_debug("Bad header length. cp: %u, pkt: %u, subpkt: %u\n",
+			 clen, plen, slen);
 		print_buffer(pos, sizeof(*hdr));
 		return -EINVAL;
 	}
@@ -848,7 +848,7 @@ static int response_parse(const u8 *buf, size_t length,
 	}
 
 	if (num_entries == 0) {
-		pr_err("Couldn't parse response.\n");
+		pr_debug("Couldn't parse response.\n");
 		return -EINVAL;
 	}
 	resp->num = num_entries;
@@ -861,18 +861,18 @@ static size_t response_get_string(const struct parsed_resp *resp, int n,
 {
 	*store = NULL;
 	if (!resp) {
-		pr_err("Response is NULL\n");
+		pr_debug("Response is NULL\n");
 		return 0;
 	}
 
 	if (n > resp->num) {
-		pr_err("Response has %d tokens. Can't access %d\n",
-		       resp->num, n);
+		pr_debug("Response has %d tokens. Can't access %d\n",
+			 resp->num, n);
 		return 0;
 	}
 
 	if (resp->toks[n].type != OPAL_DTA_TOKENID_BYTESTRING) {
-		pr_err("Token is not a byte string!\n");
+		pr_debug("Token is not a byte string!\n");
 		return 0;
 	}
 
@@ -883,26 +883,26 @@ static size_t response_get_string(const struct parsed_resp *resp, int n,
 static u64 response_get_u64(const struct parsed_resp *resp, int n)
 {
 	if (!resp) {
-		pr_err("Response is NULL\n");
+		pr_debug("Response is NULL\n");
 		return 0;
 	}
 
 	if (n > resp->num) {
-		pr_err("Response has %d tokens. Can't access %d\n",
-		       resp->num, n);
+		pr_debug("Response has %d tokens. Can't access %d\n",
+			 resp->num, n);
 		return 0;
 	}
 
 	if (resp->toks[n].type != OPAL_DTA_TOKENID_UINT) {
-		pr_err("Token is not unsigned it: %d\n",
-		       resp->toks[n].type);
+		pr_debug("Token is not unsigned it: %d\n",
+			 resp->toks[n].type);
 		return 0;
 	}
 
 	if (!(resp->toks[n].width == OPAL_WIDTH_TINY ||
 	      resp->toks[n].width == OPAL_WIDTH_SHORT)) {
-		pr_err("Atom is not short or tiny: %d\n",
-		       resp->toks[n].width);
+		pr_debug("Atom is not short or tiny: %d\n",
+			 resp->toks[n].width);
 		return 0;
 	}
 
@@ -949,7 +949,7 @@ static int parse_and_check_status(struct opal_dev *dev)
 
 	error = response_parse(dev->resp, IO_BUFFER_LENGTH, &dev->parsed);
 	if (error) {
-		pr_err("Couldn't parse response.\n");
+		pr_debug("Couldn't parse response.\n");
 		return error;
 	}
 
@@ -975,7 +975,7 @@ static int start_opal_session_cont(struct opal_dev *dev)
 	tsn = response_get_u64(&dev->parsed, 5);
 
 	if (hsn == 0 && tsn == 0) {
-		pr_err("Couldn't authenticate session\n");
+		pr_debug("Couldn't authenticate session\n");
 		return -EPERM;
 	}
 
@@ -1012,7 +1012,7 @@ static int finalize_and_send(struct opal_dev *dev, cont_fn cont)
 
 	ret = cmd_finalize(dev, dev->hsn, dev->tsn);
 	if (ret) {
-		pr_err("Error finalizing command buffer: %d\n", ret);
+		pr_debug("Error finalizing command buffer: %d\n", ret);
 		return ret;
 	}
 
@@ -1041,7 +1041,7 @@ static int gen_key(struct opal_dev *dev, void *data)
 	add_token_u8(&err, dev, OPAL_ENDLIST);
 
 	if (err) {
-		pr_err("Error building gen key command\n");
+		pr_debug("Error building gen key command\n");
 		return err;
 
 	}
@@ -1059,8 +1059,8 @@ static int get_active_key_cont(struct opal_dev *dev)
 		return error;
 	keylen = response_get_string(&dev->parsed, 4, &activekey);
 	if (!activekey) {
-		pr_err("%s: Couldn't extract the Activekey from the response\n",
-		       __func__);
+		pr_debug("%s: Couldn't extract the Activekey from the response\n",
+			 __func__);
 		return OPAL_INVAL_PARAM;
 	}
 	dev->prev_data = kmemdup(activekey, keylen, GFP_KERNEL);
@@ -1103,7 +1103,7 @@ static int get_active_key(struct opal_dev *dev, void *data)
 	add_token_u8(&err, dev, OPAL_ENDLIST);
 	add_token_u8(&err, dev, OPAL_ENDLIST);
 	if (err) {
-		pr_err("Error building get active key command\n");
+		pr_debug("Error building get active key command\n");
 		return err;
 	}
 
@@ -1159,7 +1159,7 @@ static inline int enable_global_lr(struct opal_dev *dev, u8 *uid,
 	err = generic_lr_enable_disable(dev, uid, !!setup->RLE, !!setup->WLE,
 					0, 0);
 	if (err)
-		pr_err("Failed to create enable global lr command\n");
+		pr_debug("Failed to create enable global lr command\n");
 	return err;
 }
 
@@ -1217,7 +1217,7 @@ static int setup_locking_range(struct opal_dev *dev, void *data)
 
 	}
 	if (err) {
-		pr_err("Error building Setup Locking range command.\n");
+		pr_debug("Error building Setup Locking range command.\n");
 		return err;
 
 	}
@@ -1234,11 +1234,8 @@ static int start_generic_opal_session(struct opal_dev *dev,
 	u32 hsn;
 	int err = 0;
 
-	if (key == NULL && auth != OPAL_ANYBODY_UID) {
-		pr_err("%s: Attempted to open ADMIN_SP Session without a Host" \
-		       "Challenge, and not as the Anybody UID\n", __func__);
+	if (key == NULL && auth != OPAL_ANYBODY_UID)
 		return OPAL_INVAL_PARAM;
-	}
 
 	clear_opal_cmd(dev);
 
@@ -1273,12 +1270,12 @@ static int start_generic_opal_session(struct opal_dev *dev,
 		add_token_u8(&err, dev, OPAL_ENDLIST);
 		break;
 	default:
-		pr_err("Cannot start Admin SP session with auth %d\n", auth);
+		pr_debug("Cannot start Admin SP session with auth %d\n", auth);
 		return OPAL_INVAL_PARAM;
 	}
 
 	if (err) {
-		pr_err("Error building start adminsp session command.\n");
+		pr_debug("Error building start adminsp session command.\n");
 		return err;
 	}
 
@@ -1369,7 +1366,7 @@ static int start_auth_opal_session(struct opal_dev *dev, void *data)
 	add_token_u8(&err, dev, OPAL_ENDLIST);
 
 	if (err) {
-		pr_err("Error building STARTSESSION command.\n");
+		pr_debug("Error building STARTSESSION command.\n");
 		return err;
 	}
 
@@ -1391,7 +1388,7 @@ static int revert_tper(struct opal_dev *dev, void *data)
 	add_token_u8(&err, dev, OPAL_STARTLIST);
 	add_token_u8(&err, dev, OPAL_ENDLIST);
 	if (err) {
-		pr_err("Error building REVERT TPER command.\n");
+		pr_debug("Error building REVERT TPER command.\n");
 		return err;
 	}
 
@@ -1426,7 +1423,7 @@ static int internal_activate_user(struct opal_dev *dev, void *data)
 	add_token_u8(&err, dev, OPAL_ENDLIST);
 
 	if (err) {
-		pr_err("Error building Activate UserN command.\n");
+		pr_debug("Error building Activate UserN command.\n");
 		return err;
 	}
 
@@ -1453,7 +1450,7 @@ static int erase_locking_range(struct opal_dev *dev, void *data)
 	add_token_u8(&err, dev, OPAL_ENDLIST);
 
 	if (err) {
-		pr_err("Error building Erase Locking Range Command.\n");
+		pr_debug("Error building Erase Locking Range Command.\n");
 		return err;
 	}
 	return finalize_and_send(dev, parse_and_check_status);
@@ -1484,7 +1481,7 @@ static int set_mbr_done(struct opal_dev *dev, void *data)
 	add_token_u8(&err, dev, OPAL_ENDLIST);
 
 	if (err) {
-		pr_err("Error Building set MBR Done command\n");
+		pr_debug("Error Building set MBR Done command\n");
 		return err;
 	}
 
@@ -1516,7 +1513,7 @@ static int set_mbr_enable_disable(struct opal_dev *dev, void *data)
 	add_token_u8(&err, dev, OPAL_ENDLIST);
 
 	if (err) {
-		pr_err("Error Building set MBR done command\n");
+		pr_debug("Error Building set MBR done command\n");
 		return err;
 	}
 
@@ -1567,7 +1564,7 @@ static int set_new_pw(struct opal_dev *dev, void *data)
 
 	if (generic_pw_cmd(usr->opal_key.key, usr->opal_key.key_len,
 			   cpin_uid, dev)) {
-		pr_err("Error building set password command.\n");
+		pr_debug("Error building set password command.\n");
 		return -ERANGE;
 	}
 
@@ -1582,7 +1579,7 @@ static int set_sid_cpin_pin(struct opal_dev *dev, void *data)
 	memcpy(cpin_uid, opaluid[OPAL_C_PIN_SID], OPAL_UID_LENGTH);
 
 	if (generic_pw_cmd(key->key, key->key_len, cpin_uid, dev)) {
-		pr_err("Error building Set SID cpin\n");
+		pr_debug("Error building Set SID cpin\n");
 		return -ERANGE;
 	}
 	return finalize_and_send(dev, parse_and_check_status);
@@ -1657,7 +1654,7 @@ static int add_user_to_lr(struct opal_dev *dev, void *data)
 	add_token_u8(&err, dev, OPAL_ENDLIST);
 
 	if (err) {
-		pr_err("Error building add user to locking range command.\n");
+		pr_debug("Error building add user to locking range command.\n");
 		return err;
 	}
 
@@ -1691,7 +1688,7 @@ static int lock_unlock_locking_range(struct opal_dev *dev, void *data)
 		/* vars are initalized to locked */
 		break;
 	default:
-		pr_err("Tried to set an invalid locking state... returning to uland\n");
+		pr_debug("Tried to set an invalid locking state... returning to uland\n");
 		return OPAL_INVAL_PARAM;
 	}
 
@@ -1718,7 +1715,7 @@ static int lock_unlock_locking_range(struct opal_dev *dev, void *data)
 	add_token_u8(&err, dev, OPAL_ENDLIST);
 
 	if (err) {
-		pr_err("Error building SET command.\n");
+		pr_debug("Error building SET command.\n");
 		return err;
 	}
 	return finalize_and_send(dev, parse_and_check_status);
@@ -1752,14 +1749,14 @@ static int lock_unlock_locking_range_sum(struct opal_dev *dev, void *data)
 		/* vars are initalized to locked */
 		break;
 	default:
-		pr_err("Tried to set an invalid locking state.\n");
+		pr_debug("Tried to set an invalid locking state.\n");
 		return OPAL_INVAL_PARAM;
 	}
 	ret = generic_lr_enable_disable(dev, lr_buffer, 1, 1,
 					read_locked, write_locked);
 
 	if (ret < 0) {
-		pr_err("Error building SET command.\n");
+		pr_debug("Error building SET command.\n");
 		return ret;
 	}
 	return finalize_and_send(dev, parse_and_check_status);
@@ -1811,7 +1808,7 @@ static int activate_lsp(struct opal_dev *dev, void *data)
 	}
 
 	if (err) {
-		pr_err("Error building Activate LockingSP command.\n");
+		pr_debug("Error building Activate LockingSP command.\n");
 		return err;
 	}
 
@@ -1831,7 +1828,7 @@ static int get_lsp_lifecycle_cont(struct opal_dev *dev)
 	/* 0x08 is Manufacured Inactive */
 	/* 0x09 is Manufactured */
 	if (lc_status != OPAL_MANUFACTURED_INACTIVE) {
-		pr_err("Couldn't determine the status of the Lifcycle state\n");
+		pr_debug("Couldn't determine the status of the Lifcycle state\n");
 		return -ENODEV;
 	}
 
@@ -1868,7 +1865,7 @@ static int get_lsp_lifecycle(struct opal_dev *dev, void *data)
 	add_token_u8(&err, dev, OPAL_ENDLIST);
 
 	if (err) {
-		pr_err("Error Building GET Lifecycle Status command\n");
+		pr_debug("Error Building GET Lifecycle Status command\n");
 		return err;
 	}
 
@@ -1887,7 +1884,7 @@ static int get_msid_cpin_pin_cont(struct opal_dev *dev)
 
 	strlen = response_get_string(&dev->parsed, 4, &msid_pin);
 	if (!msid_pin) {
-		pr_err("%s: Couldn't extract PIN from response\n", __func__);
+		pr_debug("%s: Couldn't extract PIN from response\n", __func__);
 		return OPAL_INVAL_PARAM;
 	}
 
@@ -1929,7 +1926,7 @@ static int get_msid_cpin_pin(struct opal_dev *dev, void *data)
 	add_token_u8(&err, dev, OPAL_ENDLIST);
 
 	if (err) {
-		pr_err("Error building Get MSID CPIN PIN command.\n");
+		pr_debug("Error building Get MSID CPIN PIN command.\n");
 		return err;
 	}
 
@@ -2124,18 +2121,18 @@ static int opal_add_user_to_lr(struct opal_dev *dev,
 
 	if (lk_unlk->l_state != OPAL_RO &&
 	    lk_unlk->l_state != OPAL_RW) {
-		pr_err("Locking state was not RO or RW\n");
+		pr_debug("Locking state was not RO or RW\n");
 		return -EINVAL;
 	}
 	if (lk_unlk->session.who < OPAL_USER1 ||
 	    lk_unlk->session.who > OPAL_USER9) {
-		pr_err("Authority was not within the range of users: %d\n",
-		       lk_unlk->session.who);
+		pr_debug("Authority was not within the range of users: %d\n",
+			 lk_unlk->session.who);
 		return -EINVAL;
 	}
 	if (lk_unlk->session.sum) {
-		pr_err("%s not supported in sum. Use setup locking range\n",
-		       __func__);
+		pr_debug("%s not supported in sum. Use setup locking range\n",
+			 __func__);
 		return -EINVAL;
 	}
 
@@ -2312,7 +2309,7 @@ static int opal_activate_user(struct opal_dev *dev,
 	/* We can't activate Admin1 it's active as manufactured */
 	if (opal_session->who < OPAL_USER1 ||
 	    opal_session->who > OPAL_USER9) {
-		pr_err("Who was not a valid user: %d\n", opal_session->who);
+		pr_debug("Who was not a valid user: %d\n", opal_session->who);
 		return -EINVAL;
 	}
 
@@ -2343,9 +2340,9 @@ bool opal_unlock_from_suspend(struct opal_dev *dev)
 
 		ret = __opal_lock_unlock(dev, &suspend->unlk);
 		if (ret) {
-			pr_warn("Failed to unlock LR %hhu with sum %d\n",
-				suspend->unlk.session.opal_key.lr,
-				suspend->unlk.session.sum);
+			pr_debug("Failed to unlock LR %hhu with sum %d\n",
+				 suspend->unlk.session.opal_key.lr,
+				 suspend->unlk.session.sum);
 			was_failure = true;
 		}
 	}
@@ -2363,10 +2360,8 @@ int sed_ioctl(struct opal_dev *dev, unsigned int cmd, void __user *arg)
 		return -EACCES;
 	if (!dev)
 		return -ENOTSUPP;
-	if (!dev->supported) {
-		pr_err("Not supported\n");
+	if (!dev->supported)
 		return -ENOTSUPP;
-	}
 
 	p = memdup_user(arg, _IOC_SIZE(cmd));
 	if (IS_ERR(p))
@@ -2410,7 +2405,7 @@ int sed_ioctl(struct opal_dev *dev, unsigned int cmd, void __user *arg)
 		ret = opal_secure_erase_locking_range(dev, p);
 		break;
 	default:
-		pr_warn("No such Opal Ioctl %u\n", cmd);
+		break;
 	}
 
 	kfree(p);
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH v3 0/5] Avoid that scsi-mq queue processing stalls
From: Bart Van Assche @ 2017-04-07 15:46 UTC (permalink / raw)
  To: ming.lei@redhat.com; +Cc: linux-block@vger.kernel.org, axboe@kernel.dk
In-Reply-To: <20170407153426.GE29821@ming.t460p>

On Fri, 2017-04-07 at 23:34 +0800, Ming Lei wrote:
> On Fri, Apr 07, 2017 at 03:18:19PM +0000, Bart Van Assche wrote:
> > On Fri, 2017-04-07 at 17:41 +0800, Ming Lei wrote:
> > > On Thu, Apr 06, 2017 at 11:10:45AM -0700, Bart Van Assche wrote:
> > > > Hello Jens,
> > > >=20
> > > > The five patches in this patch series fix the queue lockup I report=
ed
> > > > recently on the linux-block mailing list. Please consider these pat=
ches
> > > > for inclusion in the upstream kernel.
> > >=20
> > > I read the commit log of the 5 patches, looks not found descriptions
> > > about root cause of the queue lockup, so could you explain a bit abou=
t
> > > the reason behind?
> >=20
> > Hello Ming,
> >=20
> > If a .queue_rq() function returns BLK_MQ_RQ_QUEUE_BUSY then the block
> > driver that implements that function is responsible for rerunning the
> > hardware queue once requests can be queued successfully again. That is
> > not the case today for the SCSI core. Patch 5/5 ensures that hardware
>=20
> The current .queue_rq() will call blk_mq_delay_queue() if QUEUE_BUSY is
> returned, and once request is completed, the queue will be restarted
> by blk_mq_start_stopped_hw_queues() in scsi_end_request(). This way
> sounds OK in theory. And I just try to understand the specific reason
> which causes the lockup, but still not get it.

Hello Ming,

blk_mq_delay_queue() stops and restarts a hardware queue after a delay has
expired. If the SCSI core calls blk_mq_start_stopped_hw_queues() after that
delay has expired no queues will be restarted. This is why patch 5/5 change=
s
two blk_mq_start_stopped_hw_queues() calls into two blk_mq_run_hw_queues()
calls.

Bart.=

^ permalink raw reply

* Re: [PATCH 0/2] Trace completion of all bios
From: Jens Axboe @ 2017-04-07 15:42 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-raid, Mike Snitzer, Christoph Hellwig, Ming Lei,
	linux-kernel, linux-block, dm-devel, Shaohua Li, Alasdair Kergon
In-Reply-To: <149152735878.17489.16036848644242802943.stgit@noble>

On 04/06/2017 07:10 PM, NeilBrown wrote:
> Hi Jens,
>  I think all objections to this patch have been answered so I'm
>  resending.
>  I've added a small cleanup patch first which makes some small
>  enhancements to the documentation and #defines for the bio->flags
>  field.

Added for 4.12. I hand applied 2/2, since it did not apply directly to
the 4.12 branch.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH v3 0/5] Avoid that scsi-mq queue processing stalls
From: Ming Lei @ 2017-04-07 15:34 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-block@vger.kernel.org, axboe@kernel.dk
In-Reply-To: <1491578297.2559.7.camel@sandisk.com>

On Fri, Apr 07, 2017 at 03:18:19PM +0000, Bart Van Assche wrote:
> On Fri, 2017-04-07 at 17:41 +0800, Ming Lei wrote:
> > On Thu, Apr 06, 2017 at 11:10:45AM -0700, Bart Van Assche wrote:
> > > Hello Jens,
> > > 
> > > The five patches in this patch series fix the queue lockup I reported
> > > recently on the linux-block mailing list. Please consider these patches
> > > for inclusion in the upstream kernel.
> > 
> > I read the commit log of the 5 patches, looks not found descriptions
> > about root cause of the queue lockup, so could you explain a bit about
> > the reason behind?
> 
> Hello Ming,
> 
> If a .queue_rq() function returns BLK_MQ_RQ_QUEUE_BUSY then the block
> driver that implements that function is responsible for rerunning the
> hardware queue once requests can be queued successfully again. That is
> not the case today for the SCSI core. Patch 5/5 ensures that hardware

The current .queue_rq() will call blk_mq_delay_queue() if QUEUE_BUSY is
returned, and once request is completed, the queue will be restarted
by blk_mq_start_stopped_hw_queues() in scsi_end_request(). This way
sounds OK in theory. And I just try to understand the specific reason
which causes the lockup, but still not get it.

Thanks,
Ming

^ permalink raw reply

* Re: [PATCH v3 0/5] Avoid that scsi-mq queue processing stalls
From: Bart Van Assche @ 2017-04-07 15:18 UTC (permalink / raw)
  To: ming.lei@redhat.com; +Cc: linux-block@vger.kernel.org, axboe@kernel.dk
In-Reply-To: <20170407094107.GC27631@ming.t460p>

On Fri, 2017-04-07 at 17:41 +0800, Ming Lei wrote:
> On Thu, Apr 06, 2017 at 11:10:45AM -0700, Bart Van Assche wrote:
> > Hello Jens,
> >=20
> > The five patches in this patch series fix the queue lockup I reported
> > recently on the linux-block mailing list. Please consider these patches
> > for inclusion in the upstream kernel.
>=20
> I read the commit log of the 5 patches, looks not found descriptions
> about root cause of the queue lockup, so could you explain a bit about
> the reason behind?

Hello Ming,

If a .queue_rq() function returns BLK_MQ_RQ_QUEUE_BUSY then the block
driver that implements that function is responsible for rerunning the
hardware queue once requests can be queued successfully again. That is
not the case today for the SCSI core. Patch 5/5 ensures that hardware
queues for which scsi_queue_rq() has returned "busy" are rerun after
the "busy" condition has been cleared. This is why patch 5/5 fixes a
queue lockup.

Bart.=

^ permalink raw reply

* Re: [PATCH v3 1/5] blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list
From: Bart Van Assche @ 2017-04-07 15:15 UTC (permalink / raw)
  To: ming.lei@redhat.com
  Cc: hch@lst.de, linux-block@vger.kernel.org, hare@suse.com,
	axboe@kernel.dk
In-Reply-To: <20170407094623.GD27631@ming.t460p>

On Fri, 2017-04-07 at 17:46 +0800, Ming Lei wrote:
> On Thu, Apr 06, 2017 at 11:10:46AM -0700, Bart Van Assche wrote:
> > Since the next patch in this series will use RCU to iterate over
> > tag_list, make this safe. Add lockdep_assert_held() statements
> > in functions that iterate over tag_list to make clear that using
> > list_for_each_entry() instead of list_for_each_entry_rcu() is
> > fine in these functions.
> >=20
> > Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
> > Cc: Christoph Hellwig <hch@lst.de>
> > Cc: Hannes Reinecke <hare@suse.com>
> > ---
> >  block/blk-mq.c | 6 ++++++
> >  1 file changed, 6 insertions(+)
> >=20
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index f7cd3208bcdf..b5580b09b4a5 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -2076,6 +2076,8 @@ static void blk_mq_update_tag_set_depth(struct bl=
k_mq_tag_set *set, bool shared)
> >  {
> >  	struct request_queue *q;
> > =20
> > +	lockdep_assert_held(&set->tag_list_lock);
> > +
> >  	list_for_each_entry(q, &set->tag_list, tag_set_list) {
> >  		blk_mq_freeze_queue(q);
> >  		queue_set_hctx_shared(q, shared);
> > @@ -2096,6 +2098,8 @@ static void blk_mq_del_queue_tag_set(struct reque=
st_queue *q)
> >  		blk_mq_update_tag_set_depth(set, false);
> >  	}
> >  	mutex_unlock(&set->tag_list_lock);
> > +
> > +	synchronize_rcu();
>=20
> Looks synchronize_rcu() is only needed in deletion path, so it can
> be moved to blk_mq_del_queue_tag_set().
>=20
> Also list_del_init/list_add_tail() need to be replaced with RCU
> safe version in functions operating &set->tag_list.

Hello Ming,

I will replace list_del_init() / list_add_tail() by their RCU equivalents.

Regarding synchronize_rcu(): have you noticed that that call has been added=
 to
blk_mq_del_queue_tag_set(), the function you requested to move that call to=
?

Bart.=

^ permalink raw reply

* Re: [PATCH v3 4/5] blk-mq: Introduce blk_mq_delay_run_hw_queue()
From: Bart Van Assche @ 2017-04-07 15:13 UTC (permalink / raw)
  To: ming.lei@redhat.com
  Cc: hch@lst.de, linux-block@vger.kernel.org, hare@suse.de,
	longli@microsoft.com, axboe@kernel.dk, kys@microsoft.com
In-Reply-To: <20170407102343.GB28554@ming.t460p>

On Fri, 2017-04-07 at 18:23 +0800, Ming Lei wrote:
> On Thu, Apr 06, 2017 at 11:10:49AM -0700, Bart Van Assche wrote:
> > Introduce a function that runs a hardware queue unconditionally
>=20
> I appreciate if some background can be provided for this function,
> like fixing some issues?

Hello Ming,

Have you noticed that patch 5/5 in this series uses this function?

Bart.=

^ permalink raw reply

* Re: [PATCH v3 2/5] blk-mq: Restart a single queue if tag sets are shared
From: Bart Van Assche @ 2017-04-07 15:12 UTC (permalink / raw)
  To: axboe@kernel.dk; +Cc: hch@lst.de, linux-block@vger.kernel.org, hare@suse.com
In-Reply-To: <e5c27abd-d3ad-ac21-2a37-d20999d459f5@kernel.dk>

On Thu, 2017-04-06 at 13:21 -0600, Jens Axboe wrote:
> On 04/06/2017 01:12 PM, Jens Axboe wrote:
> > On 04/06/2017 12:10 PM, Bart Van Assche wrote:
> > > +		for (i =3D 0; i < queue->nr_hw_queues; i++) {
> > > +			j =3D (i + hctx->queue_num + 1) % queue->nr_hw_queues;
> > > +			h =3D queue->queue_hw_ctx[j];
> > > +			if (h->tags =3D=3D tags && blk_mq_sched_restart_hctx(h))
> > > +				break;
> >=20
> > I'm pretty sure that doing:
> >=20
> > 	j =3D i + hctx->queue_num + 1;;
>=20
> And 'i' too many there of course:
>=20
>  	j =3D hctx->queue_num + 1;;

Hello Jens,

Thanks for the feedback. I will implement this change and retest this patch
series.

Bart.=

^ permalink raw reply

* Re: [PATCH 7/8] nowait aio: xfs
From: Darrick J. Wong @ 2017-04-07 15:08 UTC (permalink / raw)
  To: Goldwyn Rodrigues
  Cc: Christoph Hellwig, linux-fsdevel, jack, linux-block, linux-btrfs,
	linux-ext4, linux-xfs, sagi, avi, axboe, linux-api, willy,
	tom.leiming, Goldwyn Rodrigues
In-Reply-To: <6cc69b0c-7af4-879e-1012-aae8ba5358bd@suse.de>

On Fri, Apr 07, 2017 at 06:34:28AM -0500, Goldwyn Rodrigues wrote:
> 
> 
> On 04/06/2017 05:54 PM, Darrick J. Wong wrote:
> > On Mon, Apr 03, 2017 at 11:52:11PM -0700, Christoph Hellwig wrote:
> >>> +	if (unaligned_io) {
> >>> +		/* If we are going to wait for other DIO to finish, bail */
> >>> +		if ((iocb->ki_flags & IOCB_NOWAIT) &&
> >>> +		     atomic_read(&inode->i_dio_count))
> >>> +			return -EAGAIN;
> >>>  		inode_dio_wait(inode);
> >>
> >> This checks i_dio_count twice in the nowait case, I think it should be:
> >>
> >> 	if (iocb->ki_flags & IOCB_NOWAIT) {
> >> 		if (atomic_read(&inode->i_dio_count))
> >> 			return -EAGAIN;
> >> 	} else {
> >> 		inode_dio_wait(inode);
> >> 	}
> >>
> >>>  	if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && xfs_is_reflink_inode(ip)) {
> >>>  		if (flags & IOMAP_DIRECT) {
> >>> +			/* A reflinked inode will result in CoW alloc */
> >>> +			if (flags & IOMAP_NOWAIT) {
> >>> +				error = -EAGAIN;
> >>> +				goto out_unlock;
> >>> +			}
> >>
> >> This is a bit pessimistic - just because the inode has any shared
> >> extents we could still write into unshared ones.  For now I think this
> >> pessimistic check is fine, but the comment should be corrected.
> > 
> > Consider what happens in both _reflink_{allocate,reserve}_cow.  If there
> > is already an existing reservation in the CoW fork then we'll have to
> > CoW and therefore can't satisfy the NOWAIT flag.  If there isn't already
> > anything in the CoW fork, then we have to see if there are shared blocks
> > by calling _reflink_trim_around_shared.  That performs a refcountbt
> > lookup, which involves locking the AGF, so we also can't satisfy NOWAIT.
> > 
> > IOWs, I think this hunk has to move outside the IOMAP_DIRECT check to
> > cover both write-to-reflinked-file cases.
> > 
> 
> IOMAP_NOWAIT is set only with IOMAP_DIRECT since the nowait feature is
> for direct-IO only. This is checked early on, when we are checking for

Ah, ok.  Disregard what I said about moving it then.

--D

> user-passed flags, and if not, -EINVAL is returned.
> 
> 
> -- 
> Goldwyn

^ permalink raw reply

* Re: [PATCHv7 2/2] loop: support 4k physical blocksize
From: Ming Lei @ 2017-04-07 14:59 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: Jens Axboe, Omar Sandoval, linux-block, Hannes Reinecke
In-Reply-To: <1491562257-82574-3-git-send-email-hare@suse.de>

On Fri, Apr 07, 2017 at 12:50:57PM +0200, Hannes Reinecke wrote:
> When generating bootable VM images certain systems (most notably
> s390x) require devices with 4k blocksize. This patch implements
> a new flag 'LO_FLAGS_BLOCKSIZE' which will set the physical
> blocksize to that of the underlying device, and allow to change
> the logical blocksize for up to the physical blocksize.
> 
> Signed-off-by: Hannes Reinecke <hare@suse.com>
> ---
>  drivers/block/loop.c      | 40 +++++++++++++++++++++++++++++++++++-----
>  drivers/block/loop.h      |  1 +
>  include/uapi/linux/loop.h |  3 +++
>  3 files changed, 39 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 81b3900..0909e06 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -221,7 +221,8 @@ static void __loop_update_dio(struct loop_device *lo, bool dio)
>  }
>  
>  static int
> -figure_loop_size(struct loop_device *lo, loff_t offset, loff_t sizelimit)
> +figure_loop_size(struct loop_device *lo, loff_t offset, loff_t sizelimit,
> +		 loff_t logical_blocksize)
>  {
>  	loff_t size = get_size(offset, sizelimit, lo->lo_backing_file);
>  	sector_t x = (sector_t)size;
> @@ -233,6 +234,12 @@ static void __loop_update_dio(struct loop_device *lo, bool dio)
>  		lo->lo_offset = offset;
>  	if (lo->lo_sizelimit != sizelimit)
>  		lo->lo_sizelimit = sizelimit;
> +	if (lo->lo_flags & LO_FLAGS_BLOCKSIZE) {
> +		lo->lo_logical_blocksize = logical_blocksize;
> +		blk_queue_physical_block_size(lo->lo_queue, lo->lo_blocksize);
> +		blk_queue_logical_block_size(lo->lo_queue,
> +					     lo->lo_logical_blocksize);
> +	}
>  	set_capacity(lo->lo_disk, x);
>  	bd_set_size(bdev, (loff_t)get_capacity(bdev->bd_disk) << 9);
>  	/* let user-space know about the new size */
> @@ -814,6 +821,7 @@ static void loop_config_discard(struct loop_device *lo)
>  	struct file *file = lo->lo_backing_file;
>  	struct inode *inode = file->f_mapping->host;
>  	struct request_queue *q = lo->lo_queue;
> +	int lo_bits = 9;
>  
>  	/*
>  	 * We use punch hole to reclaim the free space used by the
> @@ -833,7 +841,10 @@ static void loop_config_discard(struct loop_device *lo)
>  
>  	q->limits.discard_granularity = inode->i_sb->s_blocksize;
>  	q->limits.discard_alignment = 0;
> -	blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
> +	if (lo->lo_flags & LO_FLAGS_BLOCKSIZE)
> +		lo_bits = blksize_bits(lo->lo_logical_blocksize);
> +
> +	blk_queue_max_discard_sectors(q, UINT_MAX >> lo_bits);
>  	q->limits.discard_zeroes_data = 1;
>  	queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
>  }
> @@ -1087,6 +1098,7 @@ static int loop_clr_fd(struct loop_device *lo)
>  	int err;
>  	struct loop_func_table *xfer;
>  	kuid_t uid = current_uid();
> +	int lo_flags = lo->lo_flags;
>  
>  	if (lo->lo_encrypt_key_size &&
>  	    !uid_eq(lo->lo_key_owner, uid) &&
> @@ -1119,9 +1131,26 @@ static int loop_clr_fd(struct loop_device *lo)
>  	if (err)
>  		goto exit;
>  
> +	if (info->lo_flags & LO_FLAGS_BLOCKSIZE) {
> +		if (!(lo->lo_flags & LO_FLAGS_BLOCKSIZE))
> +			lo->lo_logical_blocksize = 512;
> +		lo->lo_flags |= LO_FLAGS_BLOCKSIZE;
> +		if (LO_INFO_BLOCKSIZE(info) != 512 &&
> +		    LO_INFO_BLOCKSIZE(info) != 1024 &&
> +		    LO_INFO_BLOCKSIZE(info) != 2048 &&
> +		    LO_INFO_BLOCKSIZE(info) != 4096)
> +			return -EINVAL;
> +		if (LO_INFO_BLOCKSIZE(info) > lo->lo_blocksize)
> +			return -EINVAL;
> +	}
> +
>  	if (lo->lo_offset != info->lo_offset ||
> -	    lo->lo_sizelimit != info->lo_sizelimit)
> -		if (figure_loop_size(lo, info->lo_offset, info->lo_sizelimit)) {
> +	    lo->lo_sizelimit != info->lo_sizelimit ||
> +	    lo->lo_flags != lo_flags ||
> +	    ((lo->lo_flags & LO_FLAGS_BLOCKSIZE) &&
> +	     lo->lo_logical_blocksize != LO_INFO_BLOCKSIZE(info))) {
> +		if (figure_loop_size(lo, info->lo_offset, info->lo_sizelimit,
> +				     LO_INFO_BLOCKSIZE(info)))
>  			err = -EFBIG;
>  			goto exit;
>  		}
> @@ -1312,7 +1341,8 @@ static int loop_set_capacity(struct loop_device *lo)
>  	if (unlikely(lo->lo_state != Lo_bound))
>  		return -ENXIO;
>  
> -	return figure_loop_size(lo, lo->lo_offset, lo->lo_sizelimit);
> +	return figure_loop_size(lo, lo->lo_offset, lo->lo_sizelimit,
> +				lo->lo_logical_blocksize);
>  }
>  
>  static int loop_set_dio(struct loop_device *lo, unsigned long arg)
> diff --git a/drivers/block/loop.h b/drivers/block/loop.h
> index fb2237c..579f2f7 100644
> --- a/drivers/block/loop.h
> +++ b/drivers/block/loop.h
> @@ -49,6 +49,7 @@ struct loop_device {
>  	struct file *	lo_backing_file;
>  	struct block_device *lo_device;
>  	unsigned	lo_blocksize;
> +	unsigned	lo_logical_blocksize;
>  	void		*key_data; 
>  
>  	gfp_t		old_gfp_mask;
> diff --git a/include/uapi/linux/loop.h b/include/uapi/linux/loop.h
> index c8125ec..a3960f9 100644
> --- a/include/uapi/linux/loop.h
> +++ b/include/uapi/linux/loop.h
> @@ -22,6 +22,7 @@ enum {
>  	LO_FLAGS_AUTOCLEAR	= 4,
>  	LO_FLAGS_PARTSCAN	= 8,
>  	LO_FLAGS_DIRECT_IO	= 16,
> +	LO_FLAGS_BLOCKSIZE	= 32,
>  };
>  
>  #include <asm/posix_types.h>	/* for __kernel_old_dev_t */
> @@ -59,6 +60,8 @@ struct loop_info64 {
>  	__u64		   lo_init[2];
>  };
>  
> +#define LO_INFO_BLOCKSIZE(l) (l)->lo_init[0]
> +
>  /*
>   * Loop filter types
>   */
> -- 
> 1.8.5.6
> 

Looks V7 still doesn't recover logical/physical block size for
new attached file?

Thanks,
Ming

^ permalink raw reply

* Re: [PATCH v3 1/8] blk-mq: use the right hctx when getting a driver tag fails
From: Jens Axboe @ 2017-04-07 14:45 UTC (permalink / raw)
  To: Ming Lei; +Cc: Omar Sandoval, linux-block, kernel-team
In-Reply-To: <20170407032309.GA10976@ming.t460p>

On 04/06/2017 09:23 PM, Ming Lei wrote:
> Hi Jens,
> 
> Thanks for your comment!
> 
> On Thu, Apr 06, 2017 at 01:29:26PM -0600, Jens Axboe wrote:
>> On 04/06/2017 02:23 AM, Ming Lei wrote:
>>> On Thu, Apr 06, 2017 at 12:57:51AM -0700, Omar Sandoval wrote:
>>>> On Thu, Apr 06, 2017 at 12:31:18PM +0800, Ming Lei wrote:
>>>>> On Wed, Apr 05, 2017 at 12:01:29PM -0700, Omar Sandoval wrote:
>>>>>> From: Omar Sandoval <osandov@fb.com>
>>>>>>
>>>>>> While dispatching requests, if we fail to get a driver tag, we mark the
>>>>>> hardware queue as waiting for a tag and put the requests on a
>>>>>> hctx->dispatch list to be run later when a driver tag is freed. However,
>>>>>> blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
>>>>>> queues if using a single-queue scheduler with a multiqueue device. If
>>>>>
>>>>> It can't perform well by using a SQ scheduler on a MQ device, so just be
>>>>> curious why someone wants to do that in this way,:-)
>>>>
>>>> I don't know why anyone would want to, but it has to work :) The only
>>>> reason we noticed this is because when the NBD device is created, it
>>>> only has a single queue, so we automatically assign mq-deadline to it.
>>>> Later, we update the number of queues, but it's still using mq-deadline.
>>>>
>>>>> I guess you mean that ops.mq.dispatch_request() may dispatch requests
>>>>> from other hardware queues in blk_mq_sched_dispatch_requests() instead
>>>>> of current hctx.
>>>>
>>>> Yup, that's right. It's weird, and I talked to Jens about just forcing
>>>> the MQ device into an SQ mode when using an SQ scheduler, but this way
>>>> works fine more or less.
>>>
>>> Or just switch the elevator to the MQ default one when the device becomes
>>> MQ? Or let mq-deadline's .dispatch_request() just return reqs in current
>>> hctx?
>>
>> No, that would be a really bad idea imho. First of all, I don't want
>> kernel driven scheduler changes. Secondly, the framework should work
>> with a non-direct mapping between hardware dispatch queues and
>> scheduling queues.
>>
>> While we could force a single queue usage to make that a 1:1 mapping
>> always, that loses big benefits on eg nbd, which uses multiple hardware
>> queues to up the bandwidth. Similarly on nvme, for example, we still
>> scale better with N submission queues and 1 scheduling queue compared to
>> having just 1 submission queue.
> 
> Looks that isn't what I meant. And my 2nd point is to make
> mq-deadline's dispatch_request(hctx) just returns requests mapped to
> the hw queue of 'hctx', then we can avoid to mess up blk-mq.c and
> blk-mq-sched.c.

That would indeed be better. But let's assume that we have 4 hardware
queues, we're asked to run queue X. But the FIFO dictates that the first
request that should run is on queue Y, since it's expired. What do we
do? Then we'd have to abort dispatching on queue X, and run queue Y
appropriately.

This shuffle can happen all the time. I think it'd be worthwhile to work
towards the goal of improving mq-deadline to deal with this, and
subsequently cleaning up the interface. It would be a cleaner
implementation, though efficiency might suffer further.

>>>>> If that is true, it looks like a issue in usage of I/O scheduler, since
>>>>> the mq-deadline scheduler just queues requests in one per-request_queue
>>>>> linked list, for MQ device, the scheduler queue should have been per-hctx.
>>>>
>>>> That's an option, but that's a different scheduling policy. Again, I
>>>> agree that it's strange, but it's reasonable behavior.
>>>
>>> IMO, the current mq-deadline isn't good/ready for MQ device, and it
>>> doesn't make sense to use it for MQ.
>>
>> I don't think that's true at all. I do agree that it's somewhat quirky
>> since it does introduce scheduling dependencies between the hardware
>> queues, and we have to work at making that well understood and explicit,
>> as not to introduce bugs due to that. But in reality, all multiqueue
>> hardware we are deadling with are mapped to a single resource. As such,
>> it makes a lot of sense to schedule it as such. Hence I don't think that
>> a single queue deadline approach is necessarily a bad idea for even fast
>> storage.
> 
> When we map all mq into one single deadline queue, it can't scale well, for
> example, I just run a simple write test(libaio, dio, bs:4k, 4jobs) over one
> commodity NVMe in a dual-socket(four cores) system:
> 	
> 	IO scheduler|	CPU utilization
> 	-------------------------------
> 	none		|   60%
> 	-------------------------------
> 	mq-deadline |	100%
> 	-------------------------------
> 
> And IO throughput is basically same in two cases.
> 
That's a given, for low blocksize and high iops, we'll be hitting the
deadline lock pretty hard. We dispatch single requests at the time, to
optimize for rotational storage, since it improves merging a lot. If we
modified e->type->ops.mq.dispatch_request() to take a "dump this many
requests" parameter and ramped up the depth quickly, then we could
drastically reduce the overhead for your case. The initial version
dumped the whole thing. It had lower overhead on flash, but didn't reach
the performance of old deadline on !mq since we continually emptied the
queue and eliminated merging opportunities.

blk_mq_sched_dispatch_requests() could add logic that brought back the
old behavior if NONROT is set.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH v2 2/2] blk-mq: Add a polling specific stats function
From: Jens Axboe @ 2017-04-07 14:01 UTC (permalink / raw)
  To: Stephen Bates
  Cc: linux-block@vger.kernel.org, linux-nvme@lists.infradead.org,
	Damien.LeMoal@wdc.com, osandov@osandov.com, sagi@grimberg.me
In-Reply-To: <2AD7D1E1-3906-4ABA-A301-E15DABAF443C@raithlin.com>

On 04/07/2017 06:11 AM, Stephen  Bates wrote:
> On 2017-04-05, 7:14 PM, "Jens Axboe" <axboe@kernel.dk> wrote:
> 
>> Why not just have 8 buckets, and make it:
>>
>> 	bucket = ddir + ilog2(bytes) - 9;
>>
>> and cap it at MAX_BUCKET (8) and put all those above into the top
>> bucket.
> 
> Thanks. However, that equation does not differentiate between
> direction and size. Instead we can use
> 
> bucket = ddir + 2*(ilog2(bytes) - 9);

It would be cleaner to just embed the fact that we have 2 sets of
identical buckets, and return 

	bucket = ilog2(bytes) - 9;

and have poll_stat be indexed by:

	->poll_stat[ddir][bucket];

instead.

-- 
Jens Axboe

^ permalink raw reply

* Re: [Nbd] [PATCH 00/12] nbd: Netlink interface and path failure enhancements
From: Wouter Verhelst @ 2017-04-07 13:04 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Jens Axboe, nbd-general, linux-block, kernel-team
In-Reply-To: <0BDE8C5C-1444-455D-88A3-F1D70181056C@toxicpanda.com>

Hi Josef,

On Thu, Apr 06, 2017 at 05:05:25PM -0400, Josef Bacik wrote:
> 
> > On Apr 6, 2017, at 5:01 PM, Josef Bacik <josef@toxicpanda.com> wrote:
> > 
> > This patchset adds a new netlink configuration interface to NBD as well as a
> > bunch of enhancments around path failures.  The patches provide the following
> > enhancemnts to NBD
> > 
> > - Netlink configuration interface that doesn't leave a userspace application
> >   waiting in kernel space for the device to disconnect.
> > - Netlink reconfigure interface for adding re-connected sockets to replace dead
> >   sockets.
> > - A flag to destroy the NBD device on disconnect, much like how mount -o loop
> >   works.
> > - A status interface that currently will only report whether a device is
> >   connected or not, but can be extended to include whatever in the future.
> > - A netlink multicast notification scheme to notify user space when there are
> >   connection issues to allow for seamless reconnects.
> > - Dead link handling.  You can specify a dead link timeout and the NBD device
> >   will pause IO for that timeout waiting to see if the connection can be
> >   re-established.  This is helpful to allow for things like nbd server upgrades
> >   where the whole server disappears for a short period of time.
> > 
> > These patches have been thorougly and continuously tested for about a month.
> > I've been finding bugs in various places, but this batch has been solid for the
> > last few days of testing, which include a constant disconnect/reconnect torture
> > test.  Thanks,
> 
> Oops forgot to mention that I have patches for the nbd user space side
> that utilizes all of these new interfaces, you can find the tree here
> 
> https://github.com/josefbacik/nbd

Yeah, found that already :-)

> The user space patches are a bit rough, but utilize everything so good
> for testing. They will be cleaned up and submitted once the kernel
> part is upstream. Thanks,

No worries, I'll probably clean it up myself. TBH, cleaning up
nbd-client so it becomes somewhat less of a beast has been on my TODO
list for a while, and this will only add to it... maybe that'll finally
convince me that it's time ;-)

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
       people in the world who think they really understand all of its rules,
       and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

^ permalink raw reply

* [PATCH v3 2/2] blk-mq: Add a polling specific stats function
From: sbates @ 2017-04-07 12:24 UTC (permalink / raw)
  To: axboe; +Cc: linux-block, linux-nvme, Damien.LeMoal, osandov, sbates, sagi
In-Reply-To: <1491567843-26190-1-git-send-email-sbates@raithlin.com>

From: Stephen Bates <sbates@raithlin.com>

Rather than bucketing IO statisics based on direction only we also
bucket based on the IO size. This leads to improved polling
performance. Update the bucket callback function and use it in the
polling latency estimation.

Signed-off-by: Stephen Bates <sbates@raithlin.com>
---
 block/blk-mq.c | 45 +++++++++++++++++++++++++++++++++++----------
 1 file changed, 35 insertions(+), 10 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 061fc2c..5fd376b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -42,6 +42,25 @@ static LIST_HEAD(all_q_list);
 static void blk_mq_poll_stats_start(struct request_queue *q);
 static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb);
 
+/* Must be consisitent with function below */
+#define BLK_MQ_POLL_STATS_BKTS 16
+static int blk_mq_poll_stats_bkt(const struct request *rq)
+{
+	int ddir, bytes, bucket;
+
+	ddir = blk_stat_rq_ddir(rq);
+	bytes = blk_rq_bytes(rq);
+
+	bucket = ddir + 2*(ilog2(bytes) - 9);
+
+	if (bucket < 0)
+		return -1;
+	else if (bucket >= BLK_MQ_POLL_STATS_BKTS)
+		return ddir + BLK_MQ_POLL_STATS_BKTS - 2;
+
+	return bucket;
+}
+
 /*
  * Check if any of the ctx's have pending work in this hardware queue
  */
@@ -2245,7 +2264,8 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	q->mq_ops = set->ops;
 
 	q->poll_cb = blk_stat_alloc_callback(blk_mq_poll_stats_fn,
-					     blk_stat_rq_ddir, 2, q);
+					     blk_mq_poll_stats_bkt,
+					     BLK_MQ_POLL_STATS_BKTS, q);
 	if (!q->poll_cb)
 		goto err_exit;
 
@@ -2663,11 +2683,12 @@ static void blk_mq_poll_stats_start(struct request_queue *q)
 static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb)
 {
 	struct request_queue *q = cb->data;
+	int bucket;
 
-	if (cb->stat[READ].nr_samples)
-		q->poll_stat[READ] = cb->stat[READ];
-	if (cb->stat[WRITE].nr_samples)
-		q->poll_stat[WRITE] = cb->stat[WRITE];
+	for (bucket = 0; bucket < BLK_MQ_POLL_STATS_BKTS; bucket++) {
+		if (cb->stat[bucket].nr_samples)
+			q->poll_stat[bucket] = cb->stat[bucket];
+	}
 }
 
 static unsigned long blk_mq_poll_nsecs(struct request_queue *q,
@@ -2675,6 +2696,7 @@ static unsigned long blk_mq_poll_nsecs(struct request_queue *q,
 				       struct request *rq)
 {
 	unsigned long ret = 0;
+	int bucket;
 
 	/*
 	 * If stats collection isn't on, don't sleep but turn it on for
@@ -2689,12 +2711,15 @@ static unsigned long blk_mq_poll_nsecs(struct request_queue *q,
 	 * For instance, if the completion latencies are tight, we can
 	 * get closer than just half the mean. This is especially
 	 * important on devices where the completion latencies are longer
-	 * than ~10 usec.
+	 * than ~10 usec. We do use the stats for the relevant IO size
+	 * if available which does lead to better estimates.
 	 */
-	if (req_op(rq) == REQ_OP_READ && q->poll_stat[READ].nr_samples)
-		ret = (q->poll_stat[READ].mean + 1) / 2;
-	else if (req_op(rq) == REQ_OP_WRITE && q->poll_stat[WRITE].nr_samples)
-		ret = (q->poll_stat[WRITE].mean + 1) / 2;
+	bucket = blk_mq_poll_stats_bkt(rq);
+	if (bucket < 0)
+		return ret;
+
+	if (q->poll_stat[bucket].nr_samples)
+		ret = (q->poll_stat[bucket].mean + 1) / 2;
 
 	return ret;
 }
-- 
2.7.4

^ permalink raw reply related

* [PATCH v3 1/2] blk-stat: convert blk-stat bucket callback to signed
From: sbates @ 2017-04-07 12:24 UTC (permalink / raw)
  To: axboe; +Cc: linux-block, linux-nvme, Damien.LeMoal, osandov, sbates, sagi
In-Reply-To: <1491567843-26190-1-git-send-email-sbates@raithlin.com>

From: Stephen Bates <sbates@raithlin.com>

In order to allow for filtering of IO based on some other properties
of the request than direction we allow the bucket function to return
an int.

If the bucket callback returns a negative do no count it in the stats
accumulation.

Signed-off-by: Stephen Bates <sbates@raithlin.com>
---
 block/blk-stat.c | 6 ++++--
 block/blk-stat.h | 9 +++++----
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/block/blk-stat.c b/block/blk-stat.c
index e77ec52..dde9d39 100644
--- a/block/blk-stat.c
+++ b/block/blk-stat.c
@@ -19,7 +19,7 @@ struct blk_queue_stats {
 	bool enable_accounting;
 };
 
-unsigned int blk_stat_rq_ddir(const struct request *rq)
+int blk_stat_rq_ddir(const struct request *rq)
 {
 	return rq_data_dir(rq);
 }
@@ -104,6 +104,8 @@ void blk_stat_add(struct request *rq)
 	list_for_each_entry_rcu(cb, &q->stats->callbacks, list) {
 		if (blk_stat_is_active(cb)) {
 			bucket = cb->bucket_fn(rq);
+			if (bucket < 0)
+				continue;
 			stat = &this_cpu_ptr(cb->cpu_stat)[bucket];
 			__blk_stat_add(stat, value);
 		}
@@ -135,7 +137,7 @@ static void blk_stat_timer_fn(unsigned long data)
 
 struct blk_stat_callback *
 blk_stat_alloc_callback(void (*timer_fn)(struct blk_stat_callback *),
-			unsigned int (*bucket_fn)(const struct request *),
+			int (*bucket_fn)(const struct request *),
 			unsigned int buckets, void *data)
 {
 	struct blk_stat_callback *cb;
diff --git a/block/blk-stat.h b/block/blk-stat.h
index 53f08a6..622a62c 100644
--- a/block/blk-stat.h
+++ b/block/blk-stat.h
@@ -48,9 +48,10 @@ struct blk_stat_callback {
 
 	/**
 	 * @bucket_fn: Given a request, returns which statistics bucket it
-	 * should be accounted under.
+	 * should be accounted under. Return -1 for no bucket for this
+	 * request.
 	 */
-	unsigned int (*bucket_fn)(const struct request *);
+	int (*bucket_fn)(const struct request *);
 
 	/**
 	 * @buckets: Number of statistics buckets.
@@ -120,7 +121,7 @@ void blk_stat_enable_accounting(struct request_queue *q);
  *
  * Return: Data direction of the request, either READ or WRITE.
  */
-unsigned int blk_stat_rq_ddir(const struct request *rq);
+int blk_stat_rq_ddir(const struct request *rq);
 
 /**
  * blk_stat_alloc_callback() - Allocate a block statistics callback.
@@ -135,7 +136,7 @@ unsigned int blk_stat_rq_ddir(const struct request *rq);
  */
 struct blk_stat_callback *
 blk_stat_alloc_callback(void (*timer_fn)(struct blk_stat_callback *),
-			unsigned int (*bucket_fn)(const struct request *),
+			int (*bucket_fn)(const struct request *),
 			unsigned int buckets, void *data);
 
 /**
-- 
2.7.4

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox