[PATCH 0/5] blk-mq & dm: fix IO hang and deal with one performance issue

dm-devel.redhat.com archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/5] blk-mq & dm: fix IO hang and deal with one performance issue
@ 2018-01-22  3:35 Ming Lei
  2018-01-22  3:35 ` [PATCH 1/5] blk-mq: introduce BLK_STS_DEV_RESOURCE Ming Lei
                   ` (4 more replies)
  0 siblings, 5 replies; 23+ messages in thread
From: Ming Lei @ 2018-01-22  3:35 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer
  Cc: dm-devel, Ming Lei

Hi Jens,

This 1st patch introduces BLK_STS_DEV_RESOURCE for dealing with IO hang
when non-device resource is run out of and queue is idle, and this
approach is suggested by you.

The 2nd and 3rd patches cleans dm-rq a bit and prepares for applying
blk_get_request_notify().

The 4th patch introduces blk_get_request_notify(), still suggested by
you.

The last patch applies blk_get_request_notify() and avoids one
DM specific performance issue which is introduced by the 1st patch.

These 5 patches are against both dm(dm-4.16) and block tree(for-4.16/block).

Ming Lei (5):
  blk-mq: introduce BLK_STS_DEV_RESOURCE
  dm-rq: handle dispatch exception in dm_dispatch_clone_request()
  dm-rq: return BLK_STS_* from map_request()
  blk-mq: introduce blk_get_request_notify
  dm-mpath: use blk_mq_alloc_request_notify for allocating blk-mq req

 block/blk-core.c             |  1 +
 block/blk-mq-tag.c           | 27 +++++++++++++++++++++-
 block/blk-mq.c               | 43 +++++++++++++++++++++++++++++------
 block/blk-mq.h               |  1 +
 drivers/block/null_blk.c     |  2 +-
 drivers/block/virtio_blk.c   |  2 +-
 drivers/block/xen-blkfront.c |  2 +-
 drivers/md/dm-mpath.c        | 50 ++++++++++++++++++++++++++++++++++++++++-
 drivers/md/dm-rq.c           | 53 +++++++++++++++++++++++++++-----------------
 drivers/md/dm.c              |  1 +
 drivers/scsi/scsi_lib.c      |  6 ++---
 include/linux/blk-mq.h       |  5 +++++
 include/linux/blk_types.h    |  7 ++++++
 13 files changed, 165 insertions(+), 35 deletions(-)

-- 
2.9.5

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/5] blk-mq: introduce BLK_STS_DEV_RESOURCE
  2018-01-22  3:35 [PATCH 0/5] blk-mq & dm: fix IO hang and deal with one performance issue Ming Lei
@ 2018-01-22  3:35 ` Ming Lei
  2018-01-22 16:32   ` Christoph Hellwig
  2018-01-22 16:49   ` Bart Van Assche
  2018-01-22  3:35 ` [PATCH 2/5] dm-rq: handle dispatch exception in dm_dispatch_clone_request() Ming Lei
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 23+ messages in thread
From: Ming Lei @ 2018-01-22  3:35 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer
  Cc: dm-devel, Ming Lei, Laurence Oberman, Bart Van Assche,
	James E . J . Bottomley, Martin K . Petersen, linux-scsi

This status is returned from driver to block layer if device related
resource is run out of, but driver can guarantee that IO dispatch is
triggered in future when the resource is available.

This patch converts some drivers to use this return value. Meantime
if driver returns BLK_STS_RESOURCE and S_SCHED_RESTART is marked, run
queue after 10ms for avoiding IO hang.

Suggested-by: Jens Axboe <axboe@kernel.dk>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: James E.J. Bottomley <jejb@linux.vnet.ibm.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c             |  1 +
 block/blk-mq.c               | 19 +++++++++++++++----
 drivers/block/null_blk.c     |  2 +-
 drivers/block/virtio_blk.c   |  2 +-
 drivers/block/xen-blkfront.c |  2 +-
 drivers/md/dm-rq.c           |  2 +-
 drivers/scsi/scsi_lib.c      |  6 +++---
 include/linux/blk_types.h    |  7 +++++++
 8 files changed, 30 insertions(+), 11 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index a2005a485335..ac4789c5e329 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -145,6 +145,7 @@ static const struct {
 	[BLK_STS_MEDIUM]	= { -ENODATA,	"critical medium" },
 	[BLK_STS_PROTECTION]	= { -EILSEQ,	"protection" },
 	[BLK_STS_RESOURCE]	= { -ENOMEM,	"kernel resource" },
+	[BLK_STS_DEV_RESOURCE]	= { -ENOMEM,	"device resource" },
 	[BLK_STS_AGAIN]		= { -EAGAIN,	"nonblocking retry" },
 
 	/* device mapper special case, should not leak out: */
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 01f271d40825..e91d688792a9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1169,6 +1169,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 	struct request *rq, *nxt;
 	bool no_tag = false;
 	int errors, queued;
+	blk_status_t ret = BLK_STS_OK;
 
 	if (list_empty(list))
 		return false;
@@ -1181,7 +1182,6 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 	errors = queued = 0;
 	do {
 		struct blk_mq_queue_data bd;
-		blk_status_t ret;
 
 		rq = list_first_entry(list, struct request, queuelist);
 		if (!blk_mq_get_driver_tag(rq, &hctx, false)) {
@@ -1226,7 +1226,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 		}
 
 		ret = q->mq_ops->queue_rq(hctx, &bd);
-		if (ret == BLK_STS_RESOURCE) {
+		if ((ret == BLK_STS_RESOURCE) || (ret == BLK_STS_DEV_RESOURCE)) {
 			/*
 			 * If an I/O scheduler has been configured and we got a
 			 * driver tag for the next request already, free it
@@ -1257,6 +1257,8 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 	 * that is where we will continue on next queue run.
 	 */
 	if (!list_empty(list)) {
+		bool needs_restart;
+
 		spin_lock(&hctx->lock);
 		list_splice_init(list, &hctx->dispatch);
 		spin_unlock(&hctx->lock);
@@ -1280,10 +1282,18 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 		 * - Some but not all block drivers stop a queue before
 		 *   returning BLK_STS_RESOURCE. Two exceptions are scsi-mq
 		 *   and dm-rq.
+		 *
+		 * If drivers return BLK_STS_RESOURCE and S_SCHED_RESTART
+		 * bit is set, run queue after 10ms for avoiding IO hang
+		 * because the queue may be idle and the RESTART mechanism
+		 * can't work any more.
 		 */
-		if (!blk_mq_sched_needs_restart(hctx) ||
+		needs_restart = blk_mq_sched_needs_restart(hctx);
+		if (!needs_restart ||
 		    (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
 			blk_mq_run_hw_queue(hctx, true);
+		else if (needs_restart && (ret == BLK_STS_RESOURCE))
+			blk_mq_delay_run_hw_queue(hctx, 10);
 	}
 
 	return (queued + errors) != 0;
@@ -1764,6 +1774,7 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
 		*cookie = new_cookie;
 		break;
 	case BLK_STS_RESOURCE:
+	case BLK_STS_DEV_RESOURCE:
 		__blk_mq_requeue_request(rq);
 		break;
 	default:
@@ -1826,7 +1837,7 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 	hctx_lock(hctx, &srcu_idx);
 
 	ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false);
-	if (ret == BLK_STS_RESOURCE)
+	if ((ret == BLK_STS_RESOURCE) || (ret == BLK_STS_DEV_RESOURCE))
 		blk_mq_sched_insert_request(rq, false, true, false);
 	else if (ret != BLK_STS_OK)
 		blk_mq_end_request(rq, ret);
diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index 6655893a3a7a..287a09611c0f 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -1230,7 +1230,7 @@ static blk_status_t null_handle_cmd(struct nullb_cmd *cmd)
 				return BLK_STS_OK;
 			} else
 				/* requeue request */
-				return BLK_STS_RESOURCE;
+				return BLK_STS_DEV_RESOURCE;
 		}
 	}
 
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 68846897d213..79908e6ddbf2 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -276,7 +276,7 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
 		/* Out of mem doesn't actually happen, since we fall back
 		 * to direct descriptors */
 		if (err == -ENOMEM || err == -ENOSPC)
-			return BLK_STS_RESOURCE;
+			return BLK_STS_DEV_RESOURCE;
 		return BLK_STS_IOERR;
 	}
 
diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 891265acb10e..e126e4cac2ca 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -911,7 +911,7 @@ static blk_status_t blkif_queue_rq(struct blk_mq_hw_ctx *hctx,
 out_busy:
 	blk_mq_stop_hw_queue(hctx);
 	spin_unlock_irqrestore(&rinfo->ring_lock, flags);
-	return BLK_STS_RESOURCE;
+	return BLK_STS_DEV_RESOURCE;
 }
 
 static void blkif_complete_rq(struct request *rq)
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index d8519ddd7e1a..e0bb56723b9b 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -500,7 +500,7 @@ static int map_request(struct dm_rq_target_io *tio)
 		trace_block_rq_remap(clone->q, clone, disk_devt(dm_disk(md)),
 				     blk_rq_pos(rq));
 		ret = dm_dispatch_clone_request(clone, rq);
-		if (ret == BLK_STS_RESOURCE) {
+		if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) {
 			blk_rq_unprep_clone(clone);
 			tio->ti->type->release_clone_rq(clone);
 			tio->clone = NULL;
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index d9ca1dfab154..55be2550c555 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -2030,9 +2030,9 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
 	case BLK_STS_OK:
 		break;
 	case BLK_STS_RESOURCE:
-		if (atomic_read(&sdev->device_busy) == 0 &&
-		    !scsi_device_blocked(sdev))
-			blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY);
+		if (atomic_read(&sdev->device_busy) ||
+		    scsi_device_blocked(sdev))
+			ret = BLK_STS_DEV_RESOURCE;
 		break;
 	default:
 		/*
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index c5d3db0d83f8..d76f1744a7ca 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -39,6 +39,13 @@ typedef u8 __bitwise blk_status_t;
 
 #define BLK_STS_AGAIN		((__force blk_status_t)12)
 
+/*
+ * This status is returned from driver to block layer if device related
+ * resource is run out of, but driver can guarantee that IO dispatch is
+ * triggered in future when the resource is available.
+ */
+#define BLK_STS_DEV_RESOURCE	((__force blk_status_t)13)
+
 /**
  * blk_path_error - returns true if error may be path related
  * @error: status the request was completed with
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/5] dm-rq: handle dispatch exception in dm_dispatch_clone_request()
  2018-01-22  3:35 [PATCH 0/5] blk-mq & dm: fix IO hang and deal with one performance issue Ming Lei
  2018-01-22  3:35 ` [PATCH 1/5] blk-mq: introduce BLK_STS_DEV_RESOURCE Ming Lei
@ 2018-01-22  3:35 ` Ming Lei
  2018-01-22  3:35 ` [PATCH 3/5] dm-rq: return BLK_STS_* from map_request() Ming Lei
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 23+ messages in thread
From: Ming Lei @ 2018-01-22  3:35 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer
  Cc: Bart Van Assche, dm-devel, Laurence Oberman, Ming Lei

So we can clean up map_request() a bit, no functional change.

Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/md/dm-rq.c | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index e0bb56723b9b..1408e6664c16 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -12,6 +12,9 @@
 
 #define DM_MSG_PREFIX "core-rq"
 
+#define  dispatch_busy(r)  \
+	(((r) == BLK_STS_RESOURCE) || ((r) == BLK_STS_DEV_RESOURCE))
+
 #define DM_MQ_NR_HW_QUEUES 1
 #define DM_MQ_QUEUE_DEPTH 2048
 static unsigned dm_mq_nr_hw_queues = DM_MQ_NR_HW_QUEUES;
@@ -399,7 +402,8 @@ static void end_clone_request(struct request *clone, blk_status_t error)
 	dm_complete_request(tio->orig, error);
 }
 
-static blk_status_t dm_dispatch_clone_request(struct request *clone, struct request *rq)
+static blk_status_t dm_dispatch_clone_request(struct dm_rq_target_io *tio,
+		struct request *clone, struct request *rq)
 {
 	blk_status_t r;
 
@@ -408,9 +412,17 @@ static blk_status_t dm_dispatch_clone_request(struct request *clone, struct requ
 
 	clone->start_time = jiffies;
 	r = blk_insert_cloned_request(clone->q, clone);
-	if (r != BLK_STS_OK && r != BLK_STS_RESOURCE)
+	if (dispatch_busy(r)) {
+		blk_rq_unprep_clone(clone);
+		tio->ti->type->release_clone_rq(clone);
+		tio->clone = NULL;
+	} else if (r != BLK_STS_OK) {
 		/* must complete clone in terms of original request */
 		dm_complete_request(rq, r);
+
+		/* we handled the failure, so return OK */
+		r = BLK_STS_OK;
+	}
 	return r;
 }
 
@@ -499,11 +511,8 @@ static int map_request(struct dm_rq_target_io *tio)
 		/* The target has remapped the I/O so dispatch it */
 		trace_block_rq_remap(clone->q, clone, disk_devt(dm_disk(md)),
 				     blk_rq_pos(rq));
-		ret = dm_dispatch_clone_request(clone, rq);
-		if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) {
-			blk_rq_unprep_clone(clone);
-			tio->ti->type->release_clone_rq(clone);
-			tio->clone = NULL;
+		ret = dm_dispatch_clone_request(tio, clone, rq);
+		if (dispatch_busy(ret)) {
 			if (!rq->q->mq_ops)
 				r = DM_MAPIO_DELAY_REQUEUE;
 			else
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 3/5] dm-rq: return BLK_STS_* from map_request()
  2018-01-22  3:35 [PATCH 0/5] blk-mq & dm: fix IO hang and deal with one performance issue Ming Lei
  2018-01-22  3:35 ` [PATCH 1/5] blk-mq: introduce BLK_STS_DEV_RESOURCE Ming Lei
  2018-01-22  3:35 ` [PATCH 2/5] dm-rq: handle dispatch exception in dm_dispatch_clone_request() Ming Lei
@ 2018-01-22  3:35 ` Ming Lei
  2018-01-22  5:35   ` Ming Lei
  2018-01-22  3:35 ` [PATCH 4/5] blk-mq: introduce blk_get_request_notify Ming Lei
  2018-01-22  3:35 ` [PATCH 5/5] dm-mpath: use blk_mq_alloc_request_notify for allocating blk-mq req Ming Lei
  4 siblings, 1 reply; 23+ messages in thread
From: Ming Lei @ 2018-01-22  3:35 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer
  Cc: Bart Van Assche, dm-devel, Laurence Oberman, Ming Lei

Except for DM_MAPIO_REQUEUE, map_request() handles other dispatch exception
already, so return BLK_STS_* from map_request() directly.

Another change is that if dm_dispatch_clone_request() returns
BLK_STS_DEV_RESOURCE from underlying queue, this status is returned
to blk-mq too, since underlying queue's RESTART can handle dm-rq's
RESTART in this case.

Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/md/dm-rq.c | 30 +++++++++++++++++-------------
 1 file changed, 17 insertions(+), 13 deletions(-)

diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 1408e6664c16..830e1ccfbb44 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -482,9 +482,10 @@ static void init_tio(struct dm_rq_target_io *tio, struct request *rq,
 
 /*
  * Returns:
- * DM_MAPIO_*       : the request has been processed as indicated
- * DM_MAPIO_REQUEUE : the original request needs to be immediately requeued
- * < 0              : the request was completed due to failure
+ * BLK_STS_OK       : the request has been processed aready, no need to
+ * 		ask block layer to handle it any more
+ * BLK_STS_RESOURCE : the original request needs to be immediately requeued
+ * BLK_STS_DEV_RESOURCE : same with BLK_STS_RESOURCE, but blk-mq need this info
  */
 static int map_request(struct dm_rq_target_io *tio)
 {
@@ -493,7 +494,7 @@ static int map_request(struct dm_rq_target_io *tio)
 	struct mapped_device *md = tio->md;
 	struct request *rq = tio->orig;
 	struct request *clone = NULL;
-	blk_status_t ret;
+	blk_status_t ret, result = BLK_STS_OK;
 
 	r = ti->type->clone_and_map_rq(ti, rq, &tio->info, &clone);
 check_again:
@@ -513,15 +514,16 @@ static int map_request(struct dm_rq_target_io *tio)
 				     blk_rq_pos(rq));
 		ret = dm_dispatch_clone_request(tio, clone, rq);
 		if (dispatch_busy(ret)) {
-			if (!rq->q->mq_ops)
+			if (!rq->q->mq_ops) {
 				r = DM_MAPIO_DELAY_REQUEUE;
-			else
-				r = DM_MAPIO_REQUEUE;
-			goto check_again;
+				goto check_again;
+			}
+			result = ret;
 		}
 		break;
 	case DM_MAPIO_REQUEUE:
 		/* The target wants to requeue the I/O */
+		result = BLK_STS_RESOURCE;
 		break;
 	case DM_MAPIO_DELAY_REQUEUE:
 		/* The target wants to requeue the I/O after a delay */
@@ -536,7 +538,7 @@ static int map_request(struct dm_rq_target_io *tio)
 		BUG();
 	}
 
-	return r;
+	return result;
 }
 
 static void dm_start_request(struct mapped_device *md, struct request *orig)
@@ -599,7 +601,7 @@ static void map_tio_request(struct kthread_work *work)
 {
 	struct dm_rq_target_io *tio = container_of(work, struct dm_rq_target_io, work);
 
-	if (map_request(tio) == DM_MAPIO_REQUEUE)
+	if (dispatch_busy(map_request(tio)))
 		dm_requeue_original_request(tio, false);
 }
 
@@ -754,6 +756,7 @@ static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
 	struct dm_rq_target_io *tio = blk_mq_rq_to_pdu(rq);
 	struct mapped_device *md = tio->md;
 	struct dm_target *ti = md->immutable_target;
+	blk_status_t ret;
 
 	if (unlikely(!ti)) {
 		int srcu_idx;
@@ -777,14 +780,15 @@ static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
 	tio->ti = ti;
 
 	/* Direct call is fine since .queue_rq allows allocations */
-	if (map_request(tio) == DM_MAPIO_REQUEUE) {
+	ret = map_request(tio);
+	if (dispatch_busy(ret)) {
 		/* Undo dm_start_request() before requeuing */
 		rq_end_stats(md, rq);
 		rq_completed(md, rq_data_dir(rq), false);
-		return BLK_STS_RESOURCE;
+		return ret;
 	}
 
-	return BLK_STS_OK;
+	return ret;
 }
 
 static const struct blk_mq_ops dm_mq_ops = {
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 4/5] blk-mq: introduce blk_get_request_notify
  2018-01-22  3:35 [PATCH 0/5] blk-mq & dm: fix IO hang and deal with one performance issue Ming Lei
                   ` (2 preceding siblings ...)
  2018-01-22  3:35 ` [PATCH 3/5] dm-rq: return BLK_STS_* from map_request() Ming Lei
@ 2018-01-22  3:35 ` Ming Lei
  2018-01-22 10:19   ` Ming Lei
  2018-01-22 17:13   ` Bart Van Assche
  2018-01-22  3:35 ` [PATCH 5/5] dm-mpath: use blk_mq_alloc_request_notify for allocating blk-mq req Ming Lei
  4 siblings, 2 replies; 23+ messages in thread
From: Ming Lei @ 2018-01-22  3:35 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer
  Cc: dm-devel, Ming Lei, Laurence Oberman, Bart Van Assche

DM-MPATH need to allocate request from underlying queue, but when the
allocation fails, there is no way to make underlying queue's RESTART
to restart DM's queue.

This patch introduces blk_get_request_notify() for this purpose, and
caller need to pass 'wait_queue_entry_t' to this function, and make
sure it is initialized well, so after the current allocation fails,
DM will get notified when there is request available from underlying
queue.

This approach is suggested by Jens, and has been used in blk-mq dispatch
patch for a while, see blk_mq_mark_tag_wait().

Suggested-by: Jens Axboe <axboe@kernel.dk>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-tag.c     | 27 ++++++++++++++++++++++++++-
 block/blk-mq.c         | 24 +++++++++++++++++++++---
 block/blk-mq.h         |  1 +
 include/linux/blk-mq.h |  5 +++++
 4 files changed, 53 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 336dde07b230..911fc9bd1bab 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -128,10 +128,35 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 	if (tag != -1)
 		goto found_tag;
 
-	if (data->flags & BLK_MQ_REQ_NOWAIT)
+	if ((data->flags & BLK_MQ_REQ_NOWAIT) && !(data->notifier &&
+			(data->flags & BLK_MQ_REQ_ALLOC_NOTIFY)))
 		return BLK_MQ_TAG_FAIL;
 
 	ws = bt_wait_ptr(bt, data->hctx);
+
+	/*
+	 * If caller requires notification when tag is available, add
+	 * wait entry of 'data->notifier' to the wait queue.
+	 */
+	if (data->flags & BLK_MQ_REQ_NOWAIT) {
+		bool added = false;
+
+		spin_lock_irq(&ws->wait.lock);
+		if (list_empty(&data->notifier->entry))
+			__add_wait_queue(&ws->wait, data->notifier);
+		else
+			added = true;
+		spin_unlock_irq(&ws->wait.lock);
+
+		if (added)
+			return BLK_MQ_TAG_FAIL;
+
+		tag = __blk_mq_get_tag(data, bt);
+		if (tag != -1)
+			goto found_tag;
+		return BLK_MQ_TAG_FAIL;
+	}
+
 	drop_ctx = data->ctx == NULL;
 	do {
 		/*
diff --git a/block/blk-mq.c b/block/blk-mq.c
index e91d688792a9..a71e44f17e7f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -384,10 +384,14 @@ static struct request *blk_mq_get_request(struct request_queue *q,
 	return rq;
 }
 
-struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
-		blk_mq_req_flags_t flags)
+static struct request *__blk_mq_alloc_request(struct request_queue *q,
+		unsigned int op, blk_mq_req_flags_t flags,
+		wait_queue_entry_t *notifier)
 {
-	struct blk_mq_alloc_data alloc_data = { .flags = flags };
+	struct blk_mq_alloc_data alloc_data = {
+		.flags = flags,
+		.notifier = notifier,
+	};
 	struct request *rq;
 	int ret;
 
@@ -408,8 +412,22 @@ struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
 	rq->bio = rq->biotail = NULL;
 	return rq;
 }
+
+struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
+		blk_mq_req_flags_t flags)
+{
+	return __blk_mq_alloc_request(q, op, flags, NULL);
+}
 EXPORT_SYMBOL(blk_mq_alloc_request);
 
+struct request *blk_mq_alloc_request_notify(struct request_queue *q,
+		unsigned int op, blk_mq_req_flags_t flags,
+		wait_queue_entry_t *notifier)
+{
+	return __blk_mq_alloc_request(q, op, flags, notifier);
+}
+EXPORT_SYMBOL(blk_mq_alloc_request_notify);
+
 struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 	unsigned int op, blk_mq_req_flags_t flags, unsigned int hctx_idx)
 {
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 88c558f71819..bec2f675f8f1 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -160,6 +160,7 @@ struct blk_mq_alloc_data {
 	struct request_queue *q;
 	blk_mq_req_flags_t flags;
 	unsigned int shallow_depth;
+	wait_queue_entry_t *notifier;
 
 	/* input & output parameter */
 	struct blk_mq_ctx *ctx;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 8efcf49796a3..335c7dace5a7 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -219,10 +219,15 @@ enum {
 	BLK_MQ_REQ_INTERNAL	= (__force blk_mq_req_flags_t)(1 << 2),
 	/* set RQF_PREEMPT */
 	BLK_MQ_REQ_PREEMPT	= (__force blk_mq_req_flags_t)(1 << 3),
+	/* notify when new rq is available */
+	BLK_MQ_REQ_ALLOC_NOTIFY	= (__force blk_mq_req_flags_t)(1 << 4),
 };
 
 struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
 		blk_mq_req_flags_t flags);
+struct request *blk_mq_alloc_request_notify(struct request_queue *q,
+		unsigned int op, blk_mq_req_flags_t flags,
+		wait_queue_entry_t *notifier);
 struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 		unsigned int op, blk_mq_req_flags_t flags,
 		unsigned int hctx_idx);
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 5/5] dm-mpath: use blk_mq_alloc_request_notify for allocating blk-mq req
  2018-01-22  3:35 [PATCH 0/5] blk-mq & dm: fix IO hang and deal with one performance issue Ming Lei
                   ` (3 preceding siblings ...)
  2018-01-22  3:35 ` [PATCH 4/5] blk-mq: introduce blk_get_request_notify Ming Lei
@ 2018-01-22  3:35 ` Ming Lei
  4 siblings, 0 replies; 23+ messages in thread
From: Ming Lei @ 2018-01-22  3:35 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer
  Cc: dm-devel, Ming Lei, Laurence Oberman, Bart Van Assche

When blk_get_request() fails to allocate one request, there is no way
to make underlying queue's RESTART to restart DM's queue, this patch
applies the new introduced blk_mq_alloc_request_notify() to deal with
this issue.

The following issues can be addressed:

1) When the blk_get_request() fails, finaly dm-rq will return
BLK_STS_RESOURCE to blk-mq, then blk-mq will run queue after 10ms, this
delay may degrade dm-rq performance a lot when any underlying device
attached to the same host is accessed directly.

2) there can't be a way to figure out a perfect delay for running queue
in the situation 1), and if the delay is too small, request can be dequeued
from DM queue too quick, then sequential IO performance gets hurt.

This approach is suggested by Jens, and has been used in blk-mq dispatch
patch for a while, see blk_mq_mark_tag_wait().

Suggested-by: Jens Axboe <axboe@kernel.dk>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/md/dm-mpath.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++++-
 drivers/md/dm.c       |  1 +
 2 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 7d3e572072f5..f7753a2e9229 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -11,6 +11,7 @@
 #include "dm-bio-record.h"
 #include "dm-path-selector.h"
 #include "dm-uevent.h"
+#include "dm.h"
 
 #include <linux/blkdev.h>
 #include <linux/ctype.h>
@@ -30,6 +31,15 @@
 #define DM_PG_INIT_DELAY_MSECS 2000
 #define DM_PG_INIT_DELAY_DEFAULT ((unsigned) -1)
 
+/*
+ * When running out of request from underlying request, use this
+ * structure to get notified from blk-mq
+ */
+struct underlying_notifier {
+	struct request_queue *q;
+	wait_queue_entry_t wait;
+};
+
 /* Path properties */
 struct pgpath {
 	struct list_head list;
@@ -40,6 +50,7 @@ struct pgpath {
 	struct dm_path path;
 	struct delayed_work activate_path;
 
+	struct underlying_notifier notifier;
 	bool is_active:1;		/* Path status */
 };
 
@@ -125,6 +136,17 @@ static void process_queued_bios(struct work_struct *work);
  * Allocation routines
  *-----------------------------------------------*/
 
+static int req_notify(wait_queue_entry_t *wait, unsigned mode, int flags,
+		void *key)
+{
+	struct underlying_notifier *notifier;
+
+	notifier = container_of(wait, struct underlying_notifier, wait);
+	list_del_init(&notifier->wait.entry);
+	blk_mq_run_hw_queues(notifier->q, true);
+	return 1;
+}
+
 static struct pgpath *alloc_pgpath(void)
 {
 	struct pgpath *pgpath = kzalloc(sizeof(*pgpath), GFP_KERNEL);
@@ -490,6 +512,27 @@ static bool must_push_back_bio(struct multipath *m)
 	return __must_push_back(m, flags);
 }
 
+static struct request *
+multipath_get_req(struct pgpath *pgpath, struct request_queue *q,
+		  struct request *rq)
+{
+	if (!q->mq_ops)
+		return blk_get_request(q, rq->cmd_flags | REQ_NOMERGE,
+				GFP_ATOMIC);
+
+	/*
+	 * Even BLK_MQ_REQ_ALLOC_NOTIFY is passed, we can't return
+	 * BLK_STS_DEV_RESOUCE to blk-mq if this allocation fails because
+	 * the notification can come before adding dm's request to
+	 * dispatch list, so still need to return BLK_STS_RESOUCE to
+	 * blk-mq, and the notification will run queue and cancel the
+	 * delay caused by BLK_STS_RESOUCE immediately.
+	 */
+	return blk_mq_alloc_request_notify(q, rq->cmd_flags | REQ_NOMERGE,
+                                BLK_MQ_REQ_NOWAIT | BLK_MQ_REQ_ALLOC_NOTIFY,
+				&pgpath->notifier.wait);
+}
+
 /*
  * Map cloned requests (request-based multipath)
  */
@@ -526,7 +569,7 @@ static int multipath_clone_and_map(struct dm_target *ti, struct request *rq,
 
 	bdev = pgpath->path.dev->bdev;
 	q = bdev_get_queue(bdev);
-	clone = blk_get_request(q, rq->cmd_flags | REQ_NOMERGE, GFP_ATOMIC);
+	clone = multipath_get_req(pgpath, q, rq);
 	if (IS_ERR(clone)) {
 		/* EBUSY, ENODEV or EWOULDBLOCK: requeue */
 		if (blk_queue_dying(q)) {
@@ -873,6 +916,7 @@ static struct pgpath *parse_path(struct dm_arg_set *as, struct path_selector *ps
 	int r;
 	struct pgpath *p;
 	struct multipath *m = ti->private;
+	struct mapped_device *md = dm_table_get_md((m)->ti->table);
 
 	/* we need at least a path arg */
 	if (as->argc < 1) {
@@ -906,6 +950,10 @@ static struct pgpath *parse_path(struct dm_arg_set *as, struct path_selector *ps
 		goto bad;
 	}
 
+	p->notifier.q = dm_get_md_queue(md);
+	init_waitqueue_func_entry(&p->notifier.wait, req_notify);
+	INIT_LIST_HEAD(&p->notifier.wait.entry);
+
 	return p;
  bad:
 	free_pgpath(p);
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index d6de00f367ef..ff06efcaf36e 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -445,6 +445,7 @@ struct request_queue *dm_get_md_queue(struct mapped_device *md)
 {
 	return md->queue;
 }
+EXPORT_SYMBOL_GPL(dm_get_md_queue);
 
 struct dm_stats *dm_get_stats(struct mapped_device *md)
 {
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/5] dm-rq: return BLK_STS_* from map_request()
  2018-01-22  3:35 ` [PATCH 3/5] dm-rq: return BLK_STS_* from map_request() Ming Lei
@ 2018-01-22  5:35   ` Ming Lei
  0 siblings, 0 replies; 23+ messages in thread
From: Ming Lei @ 2018-01-22  5:35 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer
  Cc: dm-devel, Laurence Oberman, Bart Van Assche

On Mon, Jan 22, 2018 at 11:35:48AM +0800, Ming Lei wrote:
> Except for DM_MAPIO_REQUEUE, map_request() handles other dispatch exception
> already, so return BLK_STS_* from map_request() directly.
> 
> Another change is that if dm_dispatch_clone_request() returns
> BLK_STS_DEV_RESOURCE from underlying queue, this status is returned
> to blk-mq too, since underlying queue's RESTART can handle dm-rq's
> RESTART in this case.

Hammm, this is obvious wrong, :-(

-- 
Ming

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/5] blk-mq: introduce blk_get_request_notify
  2018-01-22  3:35 ` [PATCH 4/5] blk-mq: introduce blk_get_request_notify Ming Lei
@ 2018-01-22 10:19   ` Ming Lei
  2018-01-22 17:13   ` Bart Van Assche
  1 sibling, 0 replies; 23+ messages in thread
From: Ming Lei @ 2018-01-22 10:19 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer
  Cc: dm-devel, Laurence Oberman, Bart Van Assche

On Mon, Jan 22, 2018 at 11:35:49AM +0800, Ming Lei wrote:
> DM-MPATH need to allocate request from underlying queue, but when the
> allocation fails, there is no way to make underlying queue's RESTART
> to restart DM's queue.
> 
> This patch introduces blk_get_request_notify() for this purpose, and
> caller need to pass 'wait_queue_entry_t' to this function, and make
> sure it is initialized well, so after the current allocation fails,
> DM will get notified when there is request available from underlying
> queue.
> 
> This approach is suggested by Jens, and has been used in blk-mq dispatch
> patch for a while, see blk_mq_mark_tag_wait().
> 
> Suggested-by: Jens Axboe <axboe@kernel.dk>
> Cc: Mike Snitzer <snitzer@redhat.com>
> Cc: Laurence Oberman <loberman@redhat.com>
> Cc: Bart Van Assche <bart.vanassche@sandisk.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-mq-tag.c     | 27 ++++++++++++++++++++++++++-
>  block/blk-mq.c         | 24 +++++++++++++++++++++---
>  block/blk-mq.h         |  1 +
>  include/linux/blk-mq.h |  5 +++++
>  4 files changed, 53 insertions(+), 4 deletions(-)
> 
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 336dde07b230..911fc9bd1bab 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -128,10 +128,35 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
>  	if (tag != -1)
>  		goto found_tag;
>  
> -	if (data->flags & BLK_MQ_REQ_NOWAIT)
> +	if ((data->flags & BLK_MQ_REQ_NOWAIT) && !(data->notifier &&
> +			(data->flags & BLK_MQ_REQ_ALLOC_NOTIFY)))
>  		return BLK_MQ_TAG_FAIL;
>  
>  	ws = bt_wait_ptr(bt, data->hctx);
> +
> +	/*
> +	 * If caller requires notification when tag is available, add
> +	 * wait entry of 'data->notifier' to the wait queue.
> +	 */
> +	if (data->flags & BLK_MQ_REQ_NOWAIT) {
> +		bool added = false;
> +
> +		spin_lock_irq(&ws->wait.lock);
> +		if (list_empty(&data->notifier->entry))
> +			__add_wait_queue(&ws->wait, data->notifier);
> +		else
> +			added = true;
> +		spin_unlock_irq(&ws->wait.lock);

The above need a per-notifier lock too, since the same notifier may
be added to different wait queues at the same time.

-- 
Ming

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/5] blk-mq: introduce BLK_STS_DEV_RESOURCE
  2018-01-22  3:35 ` [PATCH 1/5] blk-mq: introduce BLK_STS_DEV_RESOURCE Ming Lei
@ 2018-01-22 16:32   ` Christoph Hellwig
  2018-01-22 16:49   ` Bart Van Assche
  1 sibling, 0 replies; 23+ messages in thread
From: Christoph Hellwig @ 2018-01-22 16:32 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	dm-devel, Laurence Oberman, Bart Van Assche,
	James E . J . Bottomley, Martin K . Petersen, linux-scsi

> -		if (ret == BLK_STS_RESOURCE) {
> +		if ((ret == BLK_STS_RESOURCE) || (ret == BLK_STS_DEV_RESOURCE)) {

No need for the inner braces here.

> +	if ((ret == BLK_STS_RESOURCE) || (ret == BLK_STS_DEV_RESOURCE))

Same here.

> +/*
> + * This status is returned from driver to block layer if device related
> + * resource is run out of, but driver can guarantee that IO dispatch is
> + * triggered in future when the resource is available.
> + */
> +#define BLK_STS_DEV_RESOURCE	((__force blk_status_t)13)

We'll need some good explanation on BLK_STS_RESOURCE vs
BLK_STS_DEV_RESOURCE I think.

Except for these nitpicks this looks fine to me.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/5] blk-mq: introduce BLK_STS_DEV_RESOURCE
  2018-01-22  3:35 ` [PATCH 1/5] blk-mq: introduce BLK_STS_DEV_RESOURCE Ming Lei
  2018-01-22 16:32   ` Christoph Hellwig
@ 2018-01-22 16:49   ` Bart Van Assche
  2018-01-23  0:57     ` Ming Lei
  1 sibling, 1 reply; 23+ messages in thread
From: Bart Van Assche @ 2018-01-22 16:49 UTC (permalink / raw)
  To: hch@infradead.org, linux-block@vger.kernel.org,
	snitzer@redhat.com, ming.lei@redhat.com, axboe@kernel.dk
  Cc: dm-devel@redhat.com, jejb@linux.vnet.ibm.com,
	linux-scsi@vger.kernel.org, martin.petersen@oracle.com,
	loberman@redhat.com

On Mon, 2018-01-22 at 11:35 +0800, Ming Lei wrote:
> @@ -1280,10 +1282,18 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
>  		 * - Some but not all block drivers stop a queue before
>  		 *   returning BLK_STS_RESOURCE. Two exceptions are scsi-mq
>  		 *   and dm-rq.
> +		 *
> +		 * If drivers return BLK_STS_RESOURCE and S_SCHED_RESTART
> +		 * bit is set, run queue after 10ms for avoiding IO hang
> +		 * because the queue may be idle and the RESTART mechanism
> +		 * can't work any more.
>  		 */
> -		if (!blk_mq_sched_needs_restart(hctx) ||
> +		needs_restart = blk_mq_sched_needs_restart(hctx);
> +		if (!needs_restart ||
>  		    (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
>  			blk_mq_run_hw_queue(hctx, true);
> +		else if (needs_restart && (ret == BLK_STS_RESOURCE))
> +			blk_mq_delay_run_hw_queue(hctx, 10);
>  	}

In my opinion there are two problems with the above changes:
* Only the block driver author can know what a good choice is for the time
  after which to rerun the queue. So I think moving the rerun delay (10 ms)
  constant from block drivers into the core is a step backwards instead of a
  step forwards.
* The purpose of the BLK_MQ_S_SCHED_RESTART flag is to detect whether or not
  any of the queue runs triggered by freeing a tag happened concurrently. I
  don't think that there is any relationship between queue runs happening all
  or not concurrently and the chance that driver resources become available.
  So deciding whether or not a queue should be rerun based on the value of
  the BLK_MQ_S_SCHED_RESTART flag seems wrong to me.

> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index d9ca1dfab154..55be2550c555 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -2030,9 +2030,9 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
>  	case BLK_STS_OK:
>  		break;
>  	case BLK_STS_RESOURCE:
> -		if (atomic_read(&sdev->device_busy) == 0 &&
> -		    !scsi_device_blocked(sdev))
> -			blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY);
> +		if (atomic_read(&sdev->device_busy) ||
> +		    scsi_device_blocked(sdev))
> +			ret = BLK_STS_DEV_RESOURCE;
>  		break;
>  	default:
>  		/*

The above introduces two changes that have not been mentioned in the
description of this patch:
- The queue rerunning delay is changed from 3 ms into 10 ms. Where is the
  explanation of this change? Does this change have a positive or negative
  performance impact?
- The above modifies a guaranteed queue rerun into a queue rerun that
  may or may not happen, depending on whether or not multiple tags get freed
  concurrently (return BLK_STS_DEV_RESOURCE). Sorry but I think that's wrong.

Bart.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/5] blk-mq: introduce blk_get_request_notify
  2018-01-22  3:35 ` [PATCH 4/5] blk-mq: introduce blk_get_request_notify Ming Lei
  2018-01-22 10:19   ` Ming Lei
@ 2018-01-22 17:13   ` Bart Van Assche
  2018-01-23  1:29     ` Ming Lei
  1 sibling, 1 reply; 23+ messages in thread
From: Bart Van Assche @ 2018-01-22 17:13 UTC (permalink / raw)
  To: hch@infradead.org, linux-block@vger.kernel.org,
	snitzer@redhat.com, ming.lei@redhat.com, axboe@kernel.dk
  Cc: dm-devel@redhat.com, loberman@redhat.com

On Mon, 2018-01-22 at 11:35 +0800, Ming Lei wrote:
> DM-MPATH need to allocate request from underlying queue, but when the
> allocation fails, there is no way to make underlying queue's RESTART
> to restart DM's queue.
> 
> This patch introduces blk_get_request_notify() for this purpose, and
> caller need to pass 'wait_queue_entry_t' to this function, and make
> sure it is initialized well, so after the current allocation fails,
> DM will get notified when there is request available from underlying
> queue.

Please mention that this is only a partial solution because the case when
e.g. blk_insert_cloned_request() returns BLK_STS_RESOURCE is not handled.
This could help for drivers that support a very low queue depth (lpfc) but
probably won't be that useful for other drivers.

> +	/*
> +	 * If caller requires notification when tag is available, add
> +	 * wait entry of 'data->notifier' to the wait queue.
> +	 */
> +	if (data->flags & BLK_MQ_REQ_NOWAIT) {
> +		bool added = false;
> +
> +		spin_lock_irq(&ws->wait.lock);
> +		if (list_empty(&data->notifier->entry))
> +			__add_wait_queue(&ws->wait, data->notifier);
> +		else
> +			added = true;
> +		spin_unlock_irq(&ws->wait.lock);
> +
> +		if (added)
> +			return BLK_MQ_TAG_FAIL;
> +
> +		tag = __blk_mq_get_tag(data, bt);
> +		if (tag != -1)
> +			goto found_tag;
> +		return BLK_MQ_TAG_FAIL;
> +	}

Sorry but I don't like this approach. Adding "data->notifier" to the wait
queue creates a link between two request queues, e.g. a dm-mpath queue and
one of the paths that is a member of that dm-mpath queue. This creates the
potential for ugly races between e.g. "data->notifier" being triggered and
removal of the dm-mpath queue.

> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index 88c558f71819..bec2f675f8f1 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -160,6 +160,7 @@ struct blk_mq_alloc_data {
>  	struct request_queue *q;
>  	blk_mq_req_flags_t flags;
>  	unsigned int shallow_depth;
> +	wait_queue_entry_t *notifier;

If others would agree with the approach of this patch please use another name
than "notifier". In the context of the Linux kernel a notifier is an instance
of struct notifier_block. The above "notifier" member is not a notifier but a
wait queue entry.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/5] blk-mq: introduce BLK_STS_DEV_RESOURCE
  2018-01-22 16:49   ` Bart Van Assche
@ 2018-01-23  0:57     ` Ming Lei
  2018-01-23 16:17       ` Bart Van Assche
  0 siblings, 1 reply; 23+ messages in thread
From: Ming Lei @ 2018-01-23  0:57 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: hch@infradead.org, linux-block@vger.kernel.org,
	snitzer@redhat.com, axboe@kernel.dk, dm-devel@redhat.com,
	jejb@linux.vnet.ibm.com, linux-scsi@vger.kernel.org,
	martin.petersen@oracle.com, loberman@redhat.com

On Mon, Jan 22, 2018 at 04:49:54PM +0000, Bart Van Assche wrote:
> On Mon, 2018-01-22 at 11:35 +0800, Ming Lei wrote:
> > @@ -1280,10 +1282,18 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
> >  		 * - Some but not all block drivers stop a queue before
> >  		 *   returning BLK_STS_RESOURCE. Two exceptions are scsi-mq
> >  		 *   and dm-rq.
> > +		 *
> > +		 * If drivers return BLK_STS_RESOURCE and S_SCHED_RESTART
> > +		 * bit is set, run queue after 10ms for avoiding IO hang
> > +		 * because the queue may be idle and the RESTART mechanism
> > +		 * can't work any more.
> >  		 */
> > -		if (!blk_mq_sched_needs_restart(hctx) ||
> > +		needs_restart = blk_mq_sched_needs_restart(hctx);
> > +		if (!needs_restart ||
> >  		    (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
> >  			blk_mq_run_hw_queue(hctx, true);
> > +		else if (needs_restart && (ret == BLK_STS_RESOURCE))
> > +			blk_mq_delay_run_hw_queue(hctx, 10);
> >  	}
> 
> In my opinion there are two problems with the above changes:
> * Only the block driver author can know what a good choice is for the time

The reality is that no one knew that before, if you look at the fact, most of
BLK_STS_RESOURCE need that, but no one dealt with it at all. Both Jens and I
concluded that it is a generic issue, which need generic solution.

On the contrary, it is much easier to evaluate if one STS_RESOURCE can be
converted to STS_DEV_RESOURCE, as you saw, there isn't many, because now we
call .queue_rq() after getting budget and driver tag.

>   after which to rerun the queue. So I think moving the rerun delay (10 ms)
>   constant from block drivers into the core is a step backwards instead of a
>   step forwards.

As I mentioned before, if running out of resource inside .queue_rq(),
this request is still out of control of blk-mq, and if run queue is
scheduled inside .queue_rq(), it isn't correct, because when __blk_mq_run_hw_queue()
is run, the request may not be visible for dispatch.

Even though RCU lock is held during dispatch, preemption or interrupt
can happen too, so it is simply wrong to depend on the timing to make
sure __blk_mq_run_hw_queue() can see the request in this situation.

Or driver reinserts the request into scheduler queue, but that way
involves much big change if we do this in driver, and introduce
unnecessary cost meantime.

> * The purpose of the BLK_MQ_S_SCHED_RESTART flag is to detect whether or not
>   any of the queue runs triggered by freeing a tag happened concurrently. I
>   don't think that there is any relationship between queue runs happening all
>   or not concurrently and the chance that driver resources become available.
>   So deciding whether or not a queue should be rerun based on the value of
>   the BLK_MQ_S_SCHED_RESTART flag seems wrong to me.

The fact is simple, that once BLK_MQ_S_SCHED_RESTART is set,
blk_mq_dispatch_rq_list() won't call blk_mq_run_hw_queue() any more, and depends
on request completion to rerun the queue. If driver can't make sure the queue will be
restarted in future by returning STS_RESOURCE, we have to call
blk_mq_delay_run_hw_queue(hctx, 10) for avoiding IO hang.

> 
> > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> > index d9ca1dfab154..55be2550c555 100644
> > --- a/drivers/scsi/scsi_lib.c
> > +++ b/drivers/scsi/scsi_lib.c
> > @@ -2030,9 +2030,9 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
> >  	case BLK_STS_OK:
> >  		break;
> >  	case BLK_STS_RESOURCE:
> > -		if (atomic_read(&sdev->device_busy) == 0 &&
> > -		    !scsi_device_blocked(sdev))
> > -			blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY);
> > +		if (atomic_read(&sdev->device_busy) ||
> > +		    scsi_device_blocked(sdev))
> > +			ret = BLK_STS_DEV_RESOURCE;
> >  		break;
> >  	default:
> >  		/*
> 
> The above introduces two changes that have not been mentioned in the
> description of this patch:
> - The queue rerunning delay is changed from 3 ms into 10 ms. Where is the
>   explanation of this change? Does this change have a positive or negative
>   performance impact?

How can that be a issue for SCSI? The rerunning delay is only triggered
when there isn't any in-flight requests in SCSI queue.

> - The above modifies a guaranteed queue rerun into a queue rerun that
>   may or may not happen, depending on whether or not multiple tags get freed
>   concurrently (return BLK_STS_DEV_RESOURCE). Sorry but I think that's wrong.

This patch only moves the rerunning delay from SCSI .queue_rq into
blk-mq, I don't know what you mean above, please explained a bit.

I know there is a race here in SCSI, but nothing to do with this patch.

-- 
Ming

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/5] blk-mq: introduce blk_get_request_notify
  2018-01-22 17:13   ` Bart Van Assche
@ 2018-01-23  1:29     ` Ming Lei
  0 siblings, 0 replies; 23+ messages in thread
From: Ming Lei @ 2018-01-23  1:29 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: hch@infradead.org, linux-block@vger.kernel.org,
	snitzer@redhat.com, axboe@kernel.dk, dm-devel@redhat.com,
	loberman@redhat.com

On Mon, Jan 22, 2018 at 05:13:03PM +0000, Bart Van Assche wrote:
> On Mon, 2018-01-22 at 11:35 +0800, Ming Lei wrote:
> > DM-MPATH need to allocate request from underlying queue, but when the
> > allocation fails, there is no way to make underlying queue's RESTART
> > to restart DM's queue.
> > 
> > This patch introduces blk_get_request_notify() for this purpose, and
> > caller need to pass 'wait_queue_entry_t' to this function, and make
> > sure it is initialized well, so after the current allocation fails,
> > DM will get notified when there is request available from underlying
> > queue.
> 
> Please mention that this is only a partial solution because the case when
> e.g. blk_insert_cloned_request() returns BLK_STS_RESOURCE is not handled.

I don't think it is necessary to mention that in comment log, because
one patch won't deal with all things.

But definitely there will be one patch which deals with that in V2,
otherwise the performance issue can't be solved completely.

> This could help for drivers that support a very low queue depth (lpfc) but
> probably won't be that useful for other drivers.

Up to now, it is one dm-mpath specific performance issue under one
specific situation: IO is submitted to dm-mpath device and the
underlying queue at the same time, and the underlying queue depth is
very low, such as 1.

> 
> > +	/*
> > +	 * If caller requires notification when tag is available, add
> > +	 * wait entry of 'data->notifier' to the wait queue.
> > +	 */
> > +	if (data->flags & BLK_MQ_REQ_NOWAIT) {
> > +		bool added = false;
> > +
> > +		spin_lock_irq(&ws->wait.lock);
> > +		if (list_empty(&data->notifier->entry))
> > +			__add_wait_queue(&ws->wait, data->notifier);
> > +		else
> > +			added = true;
> > +		spin_unlock_irq(&ws->wait.lock);
> > +
> > +		if (added)
> > +			return BLK_MQ_TAG_FAIL;
> > +
> > +		tag = __blk_mq_get_tag(data, bt);
> > +		if (tag != -1)
> > +			goto found_tag;
> > +		return BLK_MQ_TAG_FAIL;
> > +	}
> 
> Sorry but I don't like this approach. Adding "data->notifier" to the wait
> queue creates a link between two request queues, e.g. a dm-mpath queue and
> one of the paths that is a member of that dm-mpath queue. This creates the
> potential for ugly races between e.g. "data->notifier" being triggered and
> removal of the dm-mpath queue.

I thought no such race because the dm request is still in dispatch list before
the 'notifier' is handled. And now looks this isn't true, but this issue can be
dealt easily by call blk_queue_enter_live() before the allocation, and
release them all once the notifier is triggered.

Anyway this interface need to be well documented.

> 
> > diff --git a/block/blk-mq.h b/block/blk-mq.h
> > index 88c558f71819..bec2f675f8f1 100644
> > --- a/block/blk-mq.h
> > +++ b/block/blk-mq.h
> > @@ -160,6 +160,7 @@ struct blk_mq_alloc_data {
> >  	struct request_queue *q;
> >  	blk_mq_req_flags_t flags;
> >  	unsigned int shallow_depth;
> > +	wait_queue_entry_t *notifier;
> 
> If others would agree with the approach of this patch please use another name
> than "notifier". In the context of the Linux kernel a notifier is an instance
> of struct notifier_block. The above "notifier" member is not a notifier but a
> wait queue entry.

OK, I will change to 'wait', similar with 'dispatch_wait'.

-- 
Ming

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/5] blk-mq: introduce BLK_STS_DEV_RESOURCE
  2018-01-23  0:57     ` Ming Lei
@ 2018-01-23 16:17       ` Bart Van Assche
  2018-01-23 16:26         ` Ming Lei
  0 siblings, 1 reply; 23+ messages in thread
From: Bart Van Assche @ 2018-01-23 16:17 UTC (permalink / raw)
  To: Ming Lei
  Cc: hch@infradead.org, linux-block@vger.kernel.org,
	snitzer@redhat.com, axboe@kernel.dk, dm-devel@redhat.com,
	jejb@linux.vnet.ibm.com, linux-scsi@vger.kernel.org,
	martin.petersen@oracle.com, loberman@redhat.com



On 01/22/18 16:57, Ming Lei wrote:
> Even though RCU lock is held during dispatch, preemption or interrupt
> can happen too, so it is simply wrong to depend on the timing to make
> sure __blk_mq_run_hw_queue() can see the request in this situation.

It is very unlikely that this race will ever be hit because that race 
exists for less than one microsecond and the smallest timeout that can 
be specified for delayed queue rerunning is one millisecond. Let's 
address this race if anyone ever finds a way to hit it.

>>> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
>>> index d9ca1dfab154..55be2550c555 100644
>>> --- a/drivers/scsi/scsi_lib.c
>>> +++ b/drivers/scsi/scsi_lib.c
>>> @@ -2030,9 +2030,9 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
>>>   	case BLK_STS_OK:
>>>   		break;
>>>   	case BLK_STS_RESOURCE:
>>> -		if (atomic_read(&sdev->device_busy) == 0 &&
>>> -		    !scsi_device_blocked(sdev))
>>> -			blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY);
>>> +		if (atomic_read(&sdev->device_busy) ||
>>> +		    scsi_device_blocked(sdev))
>>> +			ret = BLK_STS_DEV_RESOURCE;
>>>   		break;
>>>   	default:
>>>   		/*
>>
>> The above introduces two changes that have not been mentioned in the
>> description of this patch:
>> - The queue rerunning delay is changed from 3 ms into 10 ms. Where is the
>>    explanation of this change? Does this change have a positive or negative
>>    performance impact?
> 
> How can that be a issue for SCSI? The rerunning delay is only triggered
> when there isn't any in-flight requests in SCSI queue.

That change will result in more scsi_queue_rq() calls and hence in 
higher CPU usage. By how much the CPU usage is increased will depend on 
the CPU time required by the LLD .queuecommand() callback if that 
function gets invoked.

Bart.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/5] blk-mq: introduce BLK_STS_DEV_RESOURCE
  2018-01-23 16:17       ` Bart Van Assche
@ 2018-01-23 16:26         ` Ming Lei
  2018-01-23 16:37           ` Bart Van Assche
  0 siblings, 1 reply; 23+ messages in thread
From: Ming Lei @ 2018-01-23 16:26 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: hch@infradead.org, linux-block@vger.kernel.org,
	snitzer@redhat.com, axboe@kernel.dk, dm-devel@redhat.com,
	jejb@linux.vnet.ibm.com, linux-scsi@vger.kernel.org,
	martin.petersen@oracle.com, loberman@redhat.com

On Tue, Jan 23, 2018 at 08:17:02AM -0800, Bart Van Assche wrote:
> 
> 
> On 01/22/18 16:57, Ming Lei wrote:
> > Even though RCU lock is held during dispatch, preemption or interrupt
> > can happen too, so it is simply wrong to depend on the timing to make
> > sure __blk_mq_run_hw_queue() can see the request in this situation.
> 
> It is very unlikely that this race will ever be hit because that race exists
> for less than one microsecond and the smallest timeout that can be specified
> for delayed queue rerunning is one millisecond. Let's address this race if
> anyone ever finds a way to hit it.

Please don't depend on the timing which is often fragile, as we can make it
correct in a generic approach. Also we should avoid to make every driver to
follow this way for dealing with most of STS_RESOURCE, right?

> 
> > > > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> > > > index d9ca1dfab154..55be2550c555 100644
> > > > --- a/drivers/scsi/scsi_lib.c
> > > > +++ b/drivers/scsi/scsi_lib.c
> > > > @@ -2030,9 +2030,9 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
> > > >   	case BLK_STS_OK:
> > > >   		break;
> > > >   	case BLK_STS_RESOURCE:
> > > > -		if (atomic_read(&sdev->device_busy) == 0 &&
> > > > -		    !scsi_device_blocked(sdev))
> > > > -			blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY);
> > > > +		if (atomic_read(&sdev->device_busy) ||
> > > > +		    scsi_device_blocked(sdev))
> > > > +			ret = BLK_STS_DEV_RESOURCE;
> > > >   		break;
> > > >   	default:
> > > >   		/*
> > > 
> > > The above introduces two changes that have not been mentioned in the
> > > description of this patch:
> > > - The queue rerunning delay is changed from 3 ms into 10 ms. Where is the
> > >    explanation of this change? Does this change have a positive or negative
> > >    performance impact?
> > 
> > How can that be a issue for SCSI? The rerunning delay is only triggered
> > when there isn't any in-flight requests in SCSI queue.
> 
> That change will result in more scsi_queue_rq() calls and hence in higher
> CPU usage. By how much the CPU usage is increased will depend on the CPU
> time required by the LLD .queuecommand() callback if that function gets
> invoked.

No, this patch won't increase CPU usage on SCSI, and the only change is to move
the blk_mq_delay_run_hw_queue() out of SCSI's .queue_rq(), and the delay
becomes 10.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/5] blk-mq: introduce BLK_STS_DEV_RESOURCE
  2018-01-23 16:26         ` Ming Lei
@ 2018-01-23 16:37           ` Bart Van Assche
  2018-01-23 16:41             ` Ming Lei
  0 siblings, 1 reply; 23+ messages in thread
From: Bart Van Assche @ 2018-01-23 16:37 UTC (permalink / raw)
  To: Ming Lei
  Cc: hch@infradead.org, linux-block@vger.kernel.org,
	snitzer@redhat.com, axboe@kernel.dk, dm-devel@redhat.com,
	jejb@linux.vnet.ibm.com, linux-scsi@vger.kernel.org,
	martin.petersen@oracle.com, loberman@redhat.com

On 01/23/18 08:26, Ming Lei wrote:
> On Tue, Jan 23, 2018 at 08:17:02AM -0800, Bart Van Assche wrote:
>> On 01/22/18 16:57, Ming Lei wrote:
>>> Even though RCU lock is held during dispatch, preemption or interrupt
>>> can happen too, so it is simply wrong to depend on the timing to make
>>> sure __blk_mq_run_hw_queue() can see the request in this situation.
>>
>> It is very unlikely that this race will ever be hit because that race exists
>> for less than one microsecond and the smallest timeout that can be specified
>> for delayed queue rerunning is one millisecond. Let's address this race if
>> anyone ever finds a way to hit it.
> 
> Please don't depend on the timing which is often fragile, as we can make it
> correct in a generic approach. Also we should avoid to make every driver to
> follow this way for dealing with most of STS_RESOURCE, right?

Wouldn't it be better to fix that race without changing the block layer 
API, e.g. by using call_rcu() for delayed queue runs? As you know 
call_rcu() will only call the specified function after a grace period. 
Since pushing back requests onto the dispatch list happens with the RCU 
lock held using call_rcu() for delayed queue runs would be sufficient to 
address this race.

Bart.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/5] blk-mq: introduce BLK_STS_DEV_RESOURCE
  2018-01-23 16:37           ` Bart Van Assche
@ 2018-01-23 16:41             ` Ming Lei
  2018-01-23 16:47               ` Bart Van Assche
  0 siblings, 1 reply; 23+ messages in thread
From: Ming Lei @ 2018-01-23 16:41 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: hch@infradead.org, linux-block@vger.kernel.org,
	snitzer@redhat.com, axboe@kernel.dk, dm-devel@redhat.com,
	jejb@linux.vnet.ibm.com, linux-scsi@vger.kernel.org,
	martin.petersen@oracle.com, loberman@redhat.com

On Tue, Jan 23, 2018 at 08:37:26AM -0800, Bart Van Assche wrote:
> On 01/23/18 08:26, Ming Lei wrote:
> > On Tue, Jan 23, 2018 at 08:17:02AM -0800, Bart Van Assche wrote:
> > > On 01/22/18 16:57, Ming Lei wrote:
> > > > Even though RCU lock is held during dispatch, preemption or interrupt
> > > > can happen too, so it is simply wrong to depend on the timing to make
> > > > sure __blk_mq_run_hw_queue() can see the request in this situation.
> > > 
> > > It is very unlikely that this race will ever be hit because that race exists
> > > for less than one microsecond and the smallest timeout that can be specified
> > > for delayed queue rerunning is one millisecond. Let's address this race if
> > > anyone ever finds a way to hit it.
> > 
> > Please don't depend on the timing which is often fragile, as we can make it
> > correct in a generic approach. Also we should avoid to make every driver to
> > follow this way for dealing with most of STS_RESOURCE, right?
> 
> Wouldn't it be better to fix that race without changing the block layer API,
> e.g. by using call_rcu() for delayed queue runs? As you know call_rcu() will

Could you explain where to call call_rcu()?  call_rcu() can't be used in
IO path at all.

> only call the specified function after a grace period. Since pushing back
> requests onto the dispatch list happens with the RCU lock held using
> call_rcu() for delayed queue runs would be sufficient to address this race.


-- 
Ming

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/5] blk-mq: introduce BLK_STS_DEV_RESOURCE
  2018-01-23 16:41             ` Ming Lei
@ 2018-01-23 16:47               ` Bart Van Assche
  2018-01-23 16:49                 ` Ming Lei
  0 siblings, 1 reply; 23+ messages in thread
From: Bart Van Assche @ 2018-01-23 16:47 UTC (permalink / raw)
  To: ming.lei@redhat.com
  Cc: linux-block@vger.kernel.org, hch@infradead.org,
	snitzer@redhat.com, martin.petersen@oracle.com, axboe@kernel.dk,
	linux-scsi@vger.kernel.org, jejb@linux.vnet.ibm.com,
	loberman@redhat.com, dm-devel@redhat.com

On Wed, 2018-01-24 at 00:41 +0800, Ming Lei wrote:
> Could you explain where to call call_rcu()?  call_rcu() can't be used in
> IO path at all.

Can you explain what makes you think that call_rcu() can't be used in the
I/O path? As you know call_rcu() invokes a function asynchronously. From
Documentation/RCU/Design/Requirements/Requirements.html:

  The <tt>call_rcu()</tt> function may be used in a number of
  situations where neither <tt>synchronize_rcu()</tt> nor
  <tt>synchronize_rcu_expedited()</tt> would be legal,
  including within preempt-disable code, <tt>local_bh_disable()</tt> code,
  interrupt-disable code, and interrupt handlers.

Bart.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/5] blk-mq: introduce BLK_STS_DEV_RESOURCE
  2018-01-23 16:47               ` Bart Van Assche
@ 2018-01-23 16:49                 ` Ming Lei
  2018-01-23 16:54                   ` Bart Van Assche
  0 siblings, 1 reply; 23+ messages in thread
From: Ming Lei @ 2018-01-23 16:49 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block@vger.kernel.org, hch@infradead.org,
	snitzer@redhat.com, martin.petersen@oracle.com, axboe@kernel.dk,
	linux-scsi@vger.kernel.org, jejb@linux.vnet.ibm.com,
	loberman@redhat.com, dm-devel@redhat.com

On Tue, Jan 23, 2018 at 04:47:11PM +0000, Bart Van Assche wrote:
> On Wed, 2018-01-24 at 00:41 +0800, Ming Lei wrote:
> > Could you explain where to call call_rcu()?  call_rcu() can't be used in
> > IO path at all.
> 
> Can you explain what makes you think that call_rcu() can't be used in the
> I/O path? As you know call_rcu() invokes a function asynchronously. From
> Documentation/RCU/Design/Requirements/Requirements.html:
> 
>   The <tt>call_rcu()</tt> function may be used in a number of
>   situations where neither <tt>synchronize_rcu()</tt> nor
>   <tt>synchronize_rcu_expedited()</tt> would be legal,
>   including within preempt-disable code, <tt>local_bh_disable()</tt> code,
>   interrupt-disable code, and interrupt handlers.

OK, suppose it is true, do you want to change most of STS_RESOURCE in
all drivers to this way? Why can't we use the generic and simple approach
in this patch?

-- 
Ming

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/5] blk-mq: introduce BLK_STS_DEV_RESOURCE
  2018-01-23 16:49                 ` Ming Lei
@ 2018-01-23 16:54                   ` Bart Van Assche
  2018-01-23 16:59                     ` Ming Lei
  0 siblings, 1 reply; 23+ messages in thread
From: Bart Van Assche @ 2018-01-23 16:54 UTC (permalink / raw)
  To: ming.lei@redhat.com
  Cc: linux-block@vger.kernel.org, hch@infradead.org,
	snitzer@redhat.com, martin.petersen@oracle.com, axboe@kernel.dk,
	linux-scsi@vger.kernel.org, jejb@linux.vnet.ibm.com,
	loberman@redhat.com, dm-devel@redhat.com

On Wed, 2018-01-24 at 00:49 +0800, Ming Lei wrote:
> On Tue, Jan 23, 2018 at 04:47:11PM +0000, Bart Van Assche wrote:
> > On Wed, 2018-01-24 at 00:41 +0800, Ming Lei wrote:
> > > Could you explain where to call call_rcu()?  call_rcu() can't be used in
> > > IO path at all.
> > 
> > Can you explain what makes you think that call_rcu() can't be used in the
> > I/O path? As you know call_rcu() invokes a function asynchronously. From
> > Documentation/RCU/Design/Requirements/Requirements.html:
> > 
> >   The <tt>call_rcu()</tt> function may be used in a number of
> >   situations where neither <tt>synchronize_rcu()</tt> nor
> >   <tt>synchronize_rcu_expedited()</tt> would be legal,
> >   including within preempt-disable code, <tt>local_bh_disable()</tt> code,
> >   interrupt-disable code, and interrupt handlers.
> 
> OK, suppose it is true, do you want to change most of STS_RESOURCE in
> all drivers to this way? Why can't we use the generic and simple approach
> in this patch?

Please reread my proposal. I did not propose to change any block drivers.
What I proposed is to change the blk_mq_delay_run_hw_queue() implementation
such that it uses call_rcu() instead of kblockd_mod_delayed_work_on().
That's a generic and simple approach.

Bart.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/5] blk-mq: introduce BLK_STS_DEV_RESOURCE
  2018-01-23 16:54                   ` Bart Van Assche
@ 2018-01-23 16:59                     ` Ming Lei
  2018-01-23 22:01                       ` Bart Van Assche
  0 siblings, 1 reply; 23+ messages in thread
From: Ming Lei @ 2018-01-23 16:59 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: axboe@kernel.dk, hch@infradead.org, jejb@linux.vnet.ibm.com,
	snitzer@redhat.com, martin.petersen@oracle.com,
	linux-block@vger.kernel.org, dm-devel@redhat.com,
	linux-scsi@vger.kernel.org, loberman@redhat.com

On Tue, Jan 23, 2018 at 04:54:02PM +0000, Bart Van Assche wrote:
> On Wed, 2018-01-24 at 00:49 +0800, Ming Lei wrote:
> > On Tue, Jan 23, 2018 at 04:47:11PM +0000, Bart Van Assche wrote:
> > > On Wed, 2018-01-24 at 00:41 +0800, Ming Lei wrote:
> > > > Could you explain where to call call_rcu()?  call_rcu() can't be used in
> > > > IO path at all.
> > > 
> > > Can you explain what makes you think that call_rcu() can't be used in the
> > > I/O path? As you know call_rcu() invokes a function asynchronously. From
> > > Documentation/RCU/Design/Requirements/Requirements.html:
> > > 
> > >   The <tt>call_rcu()</tt> function may be used in a number of
> > >   situations where neither <tt>synchronize_rcu()</tt> nor
> > >   <tt>synchronize_rcu_expedited()</tt> would be legal,
> > >   including within preempt-disable code, <tt>local_bh_disable()</tt> code,
> > >   interrupt-disable code, and interrupt handlers.
> > 
> > OK, suppose it is true, do you want to change most of STS_RESOURCE in
> > all drivers to this way? Why can't we use the generic and simple approach
> > in this patch?
> 
> Please reread my proposal. I did not propose to change any block drivers.
> What I proposed is to change the blk_mq_delay_run_hw_queue() implementation
> such that it uses call_rcu() instead of kblockd_mod_delayed_work_on().
> That's a generic and simple approach.

How is that enough to fix the IO hang when driver returns STS_RESOURCE
and the queue is idle? If you want to follow previous dm-rq's way of
call blk_mq_delay_run_hw_queue() in .queue_rq(), the same trick need
to be applied to other drivers too, right?

Unfortunately most of STS_RESOURCE don't use this trick, but they need
to be handled.

The patch of 'blk-mq: introduce BLK_STS_DEV_RESOURCE' can fix all these
cases.

-- 
Ming

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/5] blk-mq: introduce BLK_STS_DEV_RESOURCE
  2018-01-23 16:59                     ` Ming Lei
@ 2018-01-23 22:01                       ` Bart Van Assche
  2018-01-24  2:31                         ` Ming Lei
  0 siblings, 1 reply; 23+ messages in thread
From: Bart Van Assche @ 2018-01-23 22:01 UTC (permalink / raw)
  To: ming.lei@redhat.com
  Cc: linux-block@vger.kernel.org, hch@infradead.org,
	snitzer@redhat.com, martin.petersen@oracle.com, axboe@kernel.dk,
	linux-scsi@vger.kernel.org, jejb@linux.vnet.ibm.com,
	loberman@redhat.com, dm-devel@redhat.com

On Wed, 2018-01-24 at 00:59 +0800, Ming Lei wrote:
> How is that enough to fix the IO hang when driver returns STS_RESOURCE
> and the queue is idle? If you want to follow previous dm-rq's way of
> call blk_mq_delay_run_hw_queue() in .queue_rq(), the same trick need
> to be applied to other drivers too, right?
> 
> Unfortunately most of STS_RESOURCE don't use this trick, but they need
> to be handled.
> 
> The patch of 'blk-mq: introduce BLK_STS_DEV_RESOURCE' can fix all these
> cases.

The goal of my proposal was to address the race between running the queue and
adding requests back to the dispatch queue only. Regarding the I/O hangs that
can occur if a block driver returns BLK_STS_RESOURCE: since all of these can
be addressed by inserting blk_mq_delay_run_hw_queue() calls in the affected
block drivers I prefer to modify the block drivers instead of making the
blk-mq core even more complicated.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/5] blk-mq: introduce BLK_STS_DEV_RESOURCE
  2018-01-23 22:01                       ` Bart Van Assche
@ 2018-01-24  2:31                         ` Ming Lei
  0 siblings, 0 replies; 23+ messages in thread
From: Ming Lei @ 2018-01-24  2:31 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block@vger.kernel.org, hch@infradead.org,
	snitzer@redhat.com, martin.petersen@oracle.com, axboe@kernel.dk,
	linux-scsi@vger.kernel.org, jejb@linux.vnet.ibm.com,
	loberman@redhat.com, dm-devel@redhat.com

On Tue, Jan 23, 2018 at 10:01:37PM +0000, Bart Van Assche wrote:
> On Wed, 2018-01-24 at 00:59 +0800, Ming Lei wrote:
> > How is that enough to fix the IO hang when driver returns STS_RESOURCE
> > and the queue is idle? If you want to follow previous dm-rq's way of
> > call blk_mq_delay_run_hw_queue() in .queue_rq(), the same trick need
> > to be applied to other drivers too, right?
> > 
> > Unfortunately most of STS_RESOURCE don't use this trick, but they need
> > to be handled.
> > 
> > The patch of 'blk-mq: introduce BLK_STS_DEV_RESOURCE' can fix all these
> > cases.
> 
> The goal of my proposal was to address the race between running the queue and
> adding requests back to the dispatch queue only. Regarding the I/O hangs that
> can occur if a block driver returns BLK_STS_RESOURCE: since all of these can
> be addressed by inserting blk_mq_delay_run_hw_queue() calls in the affected
> block drivers I prefer to modify the block drivers instead of making the
> blk-mq core even more complicated.

IMO, this change doesn't make blk-mq code more complicated, also it is
well documented, see the change in block layer:

 block/blk-core.c             |  1 +
 block/blk-mq.c               | 19 +++++++++++++++----
 include/linux/blk_types.h    | 18 ++++++++++++++++++

Also 21 lines of them are comment, so only 17 lines code change needed
in block layer.

If inserting blk_mq_delay_run_hw_queue() to driver, the change can be
a bit complicated, since call_rcu has to be used, you need to figure out
one way to provide callback and the parameter. Even you have to document
the change in each driver.

[ming@ming linux]$ git grep -n BLK_STS_RESOURCE drivers/ | wc -l
42

There are at least 42 uses of BLK_STS_RESOURCE in drivers, in theory
you should insert call_rcu(blk_mq_delay_run_hw_queue) in every BLK_STS_RESOURCE
of drivers.

I leave the decisions to Jens and drivers maintainers.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2018-01-24  2:31 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-01-22  3:35 [PATCH 0/5] blk-mq & dm: fix IO hang and deal with one performance issue Ming Lei
2018-01-22  3:35 ` [PATCH 1/5] blk-mq: introduce BLK_STS_DEV_RESOURCE Ming Lei
2018-01-22 16:32   ` Christoph Hellwig
2018-01-22 16:49   ` Bart Van Assche
2018-01-23  0:57     ` Ming Lei
2018-01-23 16:17       ` Bart Van Assche
2018-01-23 16:26         ` Ming Lei
2018-01-23 16:37           ` Bart Van Assche
2018-01-23 16:41             ` Ming Lei
2018-01-23 16:47               ` Bart Van Assche
2018-01-23 16:49                 ` Ming Lei
2018-01-23 16:54                   ` Bart Van Assche
2018-01-23 16:59                     ` Ming Lei
2018-01-23 22:01                       ` Bart Van Assche
2018-01-24  2:31                         ` Ming Lei
2018-01-22  3:35 ` [PATCH 2/5] dm-rq: handle dispatch exception in dm_dispatch_clone_request() Ming Lei
2018-01-22  3:35 ` [PATCH 3/5] dm-rq: return BLK_STS_* from map_request() Ming Lei
2018-01-22  5:35   ` Ming Lei
2018-01-22  3:35 ` [PATCH 4/5] blk-mq: introduce blk_get_request_notify Ming Lei
2018-01-22 10:19   ` Ming Lei
2018-01-22 17:13   ` Bart Van Assche
2018-01-23  1:29     ` Ming Lei
2018-01-22  3:35 ` [PATCH 5/5] dm-mpath: use blk_mq_alloc_request_notify for allocating blk-mq req Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).