[PATCH v5] blk-mq: introduce BLK_STS_DEV

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE
@ 2018-01-30 14:24 Mike Snitzer
  2018-01-30 17:52 ` [dm-devel] " Bart Van Assche
  2018-01-31  2:44 ` Jens Axboe
  0 siblings, 2 replies; 17+ messages in thread
From: Mike Snitzer @ 2018-01-30 14:24 UTC (permalink / raw)
  To: axboe; +Cc: linux-block, dm-devel, linux-scsi, linux-nvme

From: Ming Lei <ming.lei@redhat.com>

This status is returned from driver to block layer if device related
resource is unavailable, but driver can guarantee that IO dispatch
will be triggered in future when the resource is available.

Convert some drivers to return BLK_STS_DEV_RESOURCE.  Also, if driver
returns BLK_STS_RESOURCE and SCHED_RESTART is set, rerun queue after
a delay (BLK_MQ_DELAY_QUEUE) to avoid IO stalls.  BLK_MQ_DELAY_QUEUE is
3 ms because both scsi-mq and nvmefc are using that magic value.

If a driver can make sure there is in-flight IO, it is safe to return
BLK_STS_DEV_RESOURCE because:

1) If all in-flight IOs complete before examining SCHED_RESTART in
blk_mq_dispatch_rq_list(), SCHED_RESTART must be cleared, so queue
is run immediately in this case by blk_mq_dispatch_rq_list();

2) if there is any in-flight IO after/when examining SCHED_RESTART
in blk_mq_dispatch_rq_list():
- if SCHED_RESTART isn't set, queue is run immediately as handled in 1)
- otherwise, this request will be dispatched after any in-flight IO is
  completed via blk_mq_sched_restart()

3) if SCHED_RESTART is set concurently in context because of
BLK_STS_RESOURCE, blk_mq_delay_run_hw_queue() will cover the above two
cases and make sure IO hang can be avoided.

One invariant is that queue will be rerun if SCHED_RESTART is set.

Suggested-by: Jens Axboe <axboe@kernel.dk>
Tested-by: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
V5:
        - fixed stale reference to 10ms delay in blk-mq.c comment
        - revised commit header to include Ming's proof of how
          blk-mq drivers will make progress (serves to document how
	  scsi_queue_rq now works)
V4:
        - cleanup header and code comments
        - rerun queue after BLK_MQ_QUEUE_DELAY (3ms) instead of 10ms
        - eliminate nvmefc's queue rerun now that blk-mq does it
V3:
        - fix typo, and improvement document
        - add tested-by tag
V2:
        - add comments on the new introduced status
        - patch style fix
        - both are suggested by Christoph

 block/blk-core.c             |  1 +
 block/blk-mq.c               | 20 ++++++++++++++++----
 drivers/block/null_blk.c     |  2 +-
 drivers/block/virtio_blk.c   |  2 +-
 drivers/block/xen-blkfront.c |  2 +-
 drivers/md/dm-rq.c           |  5 ++---
 drivers/nvme/host/fc.c       | 12 ++----------
 drivers/scsi/scsi_lib.c      |  6 +++---
 include/linux/blk_types.h    | 17 +++++++++++++++++
 9 files changed, 44 insertions(+), 23 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index cdae69be68e9..38279d4ae08b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -145,6 +145,7 @@ static const struct {
 	[BLK_STS_MEDIUM]	= { -ENODATA,	"critical medium" },
 	[BLK_STS_PROTECTION]	= { -EILSEQ,	"protection" },
 	[BLK_STS_RESOURCE]	= { -ENOMEM,	"kernel resource" },
+	[BLK_STS_DEV_RESOURCE]	= { -ENOMEM,	"device resource" },
 	[BLK_STS_AGAIN]		= { -EAGAIN,	"nonblocking retry" },
 
 	/* device mapper special case, should not leak out: */
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 43e7449723e0..e39b4e2a63db 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1160,6 +1160,8 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx **hctx,
 	return true;
 }
 
+#define BLK_MQ_QUEUE_DELAY	3		/* ms units */
+
 bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 			     bool got_budget)
 {
@@ -1167,6 +1169,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 	struct request *rq, *nxt;
 	bool no_tag = false;
 	int errors, queued;
+	blk_status_t ret = BLK_STS_OK;
 
 	if (list_empty(list))
 		return false;
@@ -1179,7 +1182,6 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 	errors = queued = 0;
 	do {
 		struct blk_mq_queue_data bd;
-		blk_status_t ret;
 
 		rq = list_first_entry(list, struct request, queuelist);
 		if (!blk_mq_get_driver_tag(rq, &hctx, false)) {
@@ -1224,7 +1226,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 		}
 
 		ret = q->mq_ops->queue_rq(hctx, &bd);
-		if (ret == BLK_STS_RESOURCE) {
+		if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) {
 			/*
 			 * If an I/O scheduler has been configured and we got a
 			 * driver tag for the next request already, free it
@@ -1255,6 +1257,8 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 	 * that is where we will continue on next queue run.
 	 */
 	if (!list_empty(list)) {
+		bool needs_restart;
+
 		spin_lock(&hctx->lock);
 		list_splice_init(list, &hctx->dispatch);
 		spin_unlock(&hctx->lock);
@@ -1278,10 +1282,17 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 		 * - Some but not all block drivers stop a queue before
 		 *   returning BLK_STS_RESOURCE. Two exceptions are scsi-mq
 		 *   and dm-rq.
+		 *
+		 * If driver returns BLK_STS_RESOURCE and SCHED_RESTART
+		 * bit is set, run queue after a delay to avoid IO stalls
+		 * that could otherwise occur if the queue is idle.
 		 */
-		if (!blk_mq_sched_needs_restart(hctx) ||
+		needs_restart = blk_mq_sched_needs_restart(hctx);
+		if (!needs_restart ||
 		    (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
 			blk_mq_run_hw_queue(hctx, true);
+		else if (needs_restart && (ret == BLK_STS_RESOURCE))
+			blk_mq_delay_run_hw_queue(hctx, BLK_MQ_QUEUE_DELAY);
 	}
 
 	return (queued + errors) != 0;
@@ -1762,6 +1773,7 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
 		*cookie = new_cookie;
 		break;
 	case BLK_STS_RESOURCE:
+	case BLK_STS_DEV_RESOURCE:
 		__blk_mq_requeue_request(rq);
 		break;
 	default:
@@ -1824,7 +1836,7 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 	hctx_lock(hctx, &srcu_idx);
 
 	ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false);
-	if (ret == BLK_STS_RESOURCE)
+	if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE)
 		blk_mq_sched_insert_request(rq, false, true, false);
 	else if (ret != BLK_STS_OK)
 		blk_mq_end_request(rq, ret);
diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index 5b94e530570c..4bc25fc4e73c 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -1230,7 +1230,7 @@ static blk_status_t null_handle_cmd(struct nullb_cmd *cmd)
 				return BLK_STS_OK;
 			} else
 				/* requeue request */
-				return BLK_STS_RESOURCE;
+				return BLK_STS_DEV_RESOURCE;
 		}
 	}
 
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 68846897d213..79908e6ddbf2 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -276,7 +276,7 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
 		/* Out of mem doesn't actually happen, since we fall back
 		 * to direct descriptors */
 		if (err == -ENOMEM || err == -ENOSPC)
-			return BLK_STS_RESOURCE;
+			return BLK_STS_DEV_RESOURCE;
 		return BLK_STS_IOERR;
 	}
 
diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 891265acb10e..e126e4cac2ca 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -911,7 +911,7 @@ static blk_status_t blkif_queue_rq(struct blk_mq_hw_ctx *hctx,
 out_busy:
 	blk_mq_stop_hw_queue(hctx);
 	spin_unlock_irqrestore(&rinfo->ring_lock, flags);
-	return BLK_STS_RESOURCE;
+	return BLK_STS_DEV_RESOURCE;
 }
 
 static void blkif_complete_rq(struct request *rq)
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index b7d175e94a02..348a0cb6963a 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -404,7 +404,7 @@ static blk_status_t dm_dispatch_clone_request(struct request *clone, struct requ
 
 	clone->start_time = jiffies;
 	r = blk_insert_cloned_request(clone->q, clone);
-	if (r != BLK_STS_OK && r != BLK_STS_RESOURCE)
+	if (r != BLK_STS_OK && r != BLK_STS_RESOURCE && r != BLK_STS_DEV_RESOURCE)
 		/* must complete clone in terms of original request */
 		dm_complete_request(rq, r);
 	return r;
@@ -496,7 +496,7 @@ static int map_request(struct dm_rq_target_io *tio)
 		trace_block_rq_remap(clone->q, clone, disk_devt(dm_disk(md)),
 				     blk_rq_pos(rq));
 		ret = dm_dispatch_clone_request(clone, rq);
-		if (ret == BLK_STS_RESOURCE) {
+		if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) {
 			blk_rq_unprep_clone(clone);
 			tio->ti->type->release_clone_rq(clone);
 			tio->clone = NULL;
@@ -769,7 +769,6 @@ static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
 		/* Undo dm_start_request() before requeuing */
 		rq_end_stats(md, rq);
 		rq_completed(md, rq_data_dir(rq), false);
-		blk_mq_delay_run_hw_queue(hctx, 100/*ms*/);
 		return BLK_STS_RESOURCE;
 	}
 
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index b76ba4629e02..54e679541ad6 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -35,8 +35,6 @@ enum nvme_fc_queue_flags {
 	NVME_FC_Q_LIVE,
 };
 
-#define NVMEFC_QUEUE_DELAY	3		/* ms units */
-
 #define NVME_FC_DEFAULT_DEV_LOSS_TMO	60	/* seconds */
 
 struct nvme_fc_queue {
@@ -2231,7 +2229,7 @@ nvme_fc_start_fcp_op(struct nvme_fc_ctrl *ctrl, struct nvme_fc_queue *queue,
 	 * the target device is present
 	 */
 	if (ctrl->rport->remoteport.port_state != FC_OBJSTATE_ONLINE)
-		goto busy;
+		return BLK_STS_RESOURCE;
 
 	if (!nvme_fc_ctrl_get(ctrl))
 		return BLK_STS_IOERR;
@@ -2311,16 +2309,10 @@ nvme_fc_start_fcp_op(struct nvme_fc_ctrl *ctrl, struct nvme_fc_queue *queue,
 				ret != -EBUSY)
 			return BLK_STS_IOERR;
 
-		goto busy;
+		return BLK_STS_RESOURCE;
 	}
 
 	return BLK_STS_OK;
-
-busy:
-	if (!(op->flags & FCOP_FLAGS_AEN) && queue->hctx)
-		blk_mq_delay_run_hw_queue(queue->hctx, NVMEFC_QUEUE_DELAY);
-
-	return BLK_STS_RESOURCE;
 }
 
 static inline blk_status_t nvme_fc_is_ready(struct nvme_fc_queue *queue,
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index d9ca1dfab154..55be2550c555 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -2030,9 +2030,9 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
 	case BLK_STS_OK:
 		break;
 	case BLK_STS_RESOURCE:
-		if (atomic_read(&sdev->device_busy) == 0 &&
-		    !scsi_device_blocked(sdev))
-			blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY);
+		if (atomic_read(&sdev->device_busy) ||
+		    scsi_device_blocked(sdev))
+			ret = BLK_STS_DEV_RESOURCE;
 		break;
 	default:
 		/*
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 2d973ac54b09..f41d2057215f 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -39,6 +39,23 @@ typedef u8 __bitwise blk_status_t;
 
 #define BLK_STS_AGAIN		((__force blk_status_t)12)
 
+/*
+ * BLK_STS_DEV_RESOURCE is returned from driver to block layer if device
+ * related resource is unavailable, but driver can guarantee that queue
+ * will be rerun in future once the resource is available (whereby
+ * dispatching requests).
+ *
+ * To safely return BLK_STS_DEV_RESOURCE, and allow forward progress, a
+ * driver just needs to make sure there is in-flight IO.
+ *
+ * Difference with BLK_STS_RESOURCE:
+ * If driver isn't sure if the queue will be rerun once device resource
+ * is made available, please return BLK_STS_RESOURCE.  For example: when
+ * memory allocation, DMA Mapping or other system resource allocation
+ * fails and IO can't be submitted to device.
+ */
+#define BLK_STS_DEV_RESOURCE	((__force blk_status_t)13)
+
 /**
  * blk_path_error - returns true if error may be path related
  * @error: status the request was completed with
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [dm-devel] [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE
  2018-01-30 14:24 [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE Mike Snitzer
@ 2018-01-30 17:52 ` Bart Van Assche
  2018-01-30 18:38   ` Laurence Oberman
                     ` (3 more replies)
  2018-01-31  2:44 ` Jens Axboe
  1 sibling, 4 replies; 17+ messages in thread
From: Bart Van Assche @ 2018-01-30 17:52 UTC (permalink / raw)
  To: Mike Snitzer, axboe; +Cc: linux-block, dm-devel, linux-nvme, linux-scsi

On 01/30/18 06:24, Mike Snitzer wrote:
> +		 *
> +		 * If driver returns BLK_STS_RESOURCE and SCHED_RESTART
> +		 * bit is set, run queue after a delay to avoid IO stalls
> +		 * that could otherwise occur if the queue is idle.
>   		 */
> -		if (!blk_mq_sched_needs_restart(hctx) ||
> +		needs_restart = blk_mq_sched_needs_restart(hctx);
> +		if (!needs_restart ||
>   		    (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
>   			blk_mq_run_hw_queue(hctx, true);
> +		else if (needs_restart && (ret == BLK_STS_RESOURCE))
> +			blk_mq_delay_run_hw_queue(hctx, BLK_MQ_QUEUE_DELAY);
>   	}

If a request completes concurrently with execution of the above code 
then the request completion will trigger a call of 
blk_mq_sched_restart_hctx() and that call will clear the 
BLK_MQ_S_SCHED_RESTART bit. If that bit is cleared before the above code 
tests it then the above code will schedule an asynchronous call of 
__blk_mq_run_hw_queue(). If the .queue_rq() call triggered by the new 
queue run returns again BLK_STS_RESOURCE then the above code will be 
executed again. In other words, a loop occurs. That loop will repeat as 
long as the described race occurs. The current (kernel v4.15) block 
layer behavior is simpler: only block drivers call 
blk_mq_delay_run_hw_queue() and the block layer core never calls that 
function. Hence that loop cannot occur with the v4.15 block layer core 
and block drivers. A motivation of why that loop is preferred compared 
to the current behavior (no loop) is missing. Does this mean that that 
loop is a needless complication of the block layer core?

Sorry but I still prefer the v4.15 block layer approach because this 
patch has in my view the following disadvantages:
- It involves a blk-mq API change. API changes are always risky and need
   some time to stabilize.
- The delay after which to rerun the queue is moved from block layer
   drivers into the block layer core. I think that's wrong because only
   the block driver authors can make a good choice for this constant.
- This patch makes block drivers harder to understand. Anyone who sees
   return BLK_STS_RESOURCE / return BLK_STS_DEV_RESOURCE for the first
   time will have to look up the meaning of these constants. An explicit
   blk_mq_delay_run_hw_queue() call is easier to understand.
- This patch makes the blk-mq core harder to understand because of the
   loop mentioned above.
- This patch does not fix any bugs nor makes block drivers easier to
   read or to implement. So why is this patch considered useful?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [dm-devel] [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE
  2018-01-30 17:52 ` [dm-devel] " Bart Van Assche
@ 2018-01-30 18:38   ` Laurence Oberman
  2018-01-30 19:33   ` Mike Snitzer
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 17+ messages in thread
From: Laurence Oberman @ 2018-01-30 18:38 UTC (permalink / raw)
  To: Bart Van Assche, Mike Snitzer, axboe
  Cc: linux-block, dm-devel, linux-nvme, linux-scsi

On Tue, 2018-01-30 at 09:52 -0800, Bart Van Assche wrote:
> On 01/30/18 06:24, Mike Snitzer wrote:
> > +		 *
> > +		 * If driver returns BLK_STS_RESOURCE and
> > SCHED_RESTART
> > +		 * bit is set, run queue after a delay to avoid IO
> > stalls
> > +		 * that could otherwise occur if the queue is
> > idle.
> >   		 */
> > -		if (!blk_mq_sched_needs_restart(hctx) ||
> > +		needs_restart = blk_mq_sched_needs_restart(hctx);
> > +		if (!needs_restart ||
> >   		    (no_tag && list_empty_careful(&hctx-
> > >dispatch_wait.entry)))
> >   			blk_mq_run_hw_queue(hctx, true);
> > +		else if (needs_restart && (ret ==
> > BLK_STS_RESOURCE))
> > +			blk_mq_delay_run_hw_queue(hctx,
> > BLK_MQ_QUEUE_DELAY);
> >   	}
> 
> If a request completes concurrently with execution of the above code 
> then the request completion will trigger a call of 
> blk_mq_sched_restart_hctx() and that call will clear the 
> BLK_MQ_S_SCHED_RESTART bit. If that bit is cleared before the above
> code 
> tests it then the above code will schedule an asynchronous call of 
> __blk_mq_run_hw_queue(). If the .queue_rq() call triggered by the
> new 
> queue run returns again BLK_STS_RESOURCE then the above code will be 
> executed again. In other words, a loop occurs. That loop will repeat
> as 
> long as the described race occurs. The current (kernel v4.15) block 
> layer behavior is simpler: only block drivers call 
> blk_mq_delay_run_hw_queue() and the block layer core never calls
> that 
> function. Hence that loop cannot occur with the v4.15 block layer
> core 
> and block drivers. A motivation of why that loop is preferred
> compared 
> to the current behavior (no loop) is missing. Does this mean that
> that 
> loop is a needless complication of the block layer core?
> 
> Sorry but I still prefer the v4.15 block layer approach because this 
> patch has in my view the following disadvantages:
> - It involves a blk-mq API change. API changes are always risky and
> need
>    some time to stabilize.
> - The delay after which to rerun the queue is moved from block layer
>    drivers into the block layer core. I think that's wrong because
> only
>    the block driver authors can make a good choice for this constant.
> - This patch makes block drivers harder to understand. Anyone who
> sees
>    return BLK_STS_RESOURCE / return BLK_STS_DEV_RESOURCE for the
> first
>    time will have to look up the meaning of these constants. An
> explicit
>    blk_mq_delay_run_hw_queue() call is easier to understand.
> - This patch makes the blk-mq core harder to understand because of
> the
>    loop mentioned above.
> - This patch does not fix any bugs nor makes block drivers easier to
>    read or to implement. So why is this patch considered useful?
> 
> Thanks,
> 
> Bart.

Hello Bart

What is the performance implication of your method versus this patch
above.
Is there more of a delay in your approach or in Ming's approach from
your own testing.
I guess it seems we will never get consensus on this so is it time to
take a vote. 

I respect and trust your inputs, you know that, but are you perhaps
prepared to accept the approach above and agree to it and if it turns
out to expose more issues it can be addressed later.

Is that not a better way to progress this because to me it looks like
you and Ming will continue to disagree on which is the better approach.

With much respect
Laurence

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE
  2018-01-30 17:52 ` [dm-devel] " Bart Van Assche
  2018-01-30 18:38   ` Laurence Oberman
@ 2018-01-30 19:33   ` Mike Snitzer
  2018-01-30 19:42     ` Bart Van Assche
  2018-01-31  2:14   ` [dm-devel] " Ming Lei
  2018-01-31  3:17   ` Jens Axboe
  3 siblings, 1 reply; 17+ messages in thread
From: Mike Snitzer @ 2018-01-30 19:33 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: axboe, linux-block, dm-devel, linux-nvme, linux-scsi

On Tue, Jan 30 2018 at 12:52pm -0500,
Bart Van Assche <bart.vanassche@wdc.com> wrote:

> On 01/30/18 06:24, Mike Snitzer wrote:
> >+		 *
> >+		 * If driver returns BLK_STS_RESOURCE and SCHED_RESTART
> >+		 * bit is set, run queue after a delay to avoid IO stalls
> >+		 * that could otherwise occur if the queue is idle.
> >  		 */
> >-		if (!blk_mq_sched_needs_restart(hctx) ||
> >+		needs_restart = blk_mq_sched_needs_restart(hctx);
> >+		if (!needs_restart ||
> >  		    (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
> >  			blk_mq_run_hw_queue(hctx, true);
> >+		else if (needs_restart && (ret == BLK_STS_RESOURCE))
> >+			blk_mq_delay_run_hw_queue(hctx, BLK_MQ_QUEUE_DELAY);
> >  	}
> 
> If a request completes concurrently with execution of the above code
> then the request completion will trigger a call of
> blk_mq_sched_restart_hctx() and that call will clear the
> BLK_MQ_S_SCHED_RESTART bit. If that bit is cleared before the above
> code tests it then the above code will schedule an asynchronous call
> of __blk_mq_run_hw_queue(). If the .queue_rq() call triggered by the
> new queue run returns again BLK_STS_RESOURCE then the above code
> will be executed again. In other words, a loop occurs. That loop
> will repeat as long as the described race occurs. The current
> (kernel v4.15) block layer behavior is simpler: only block drivers
> call blk_mq_delay_run_hw_queue() and the block layer core never
> calls that function. Hence that loop cannot occur with the v4.15
> block layer core and block drivers. A motivation of why that loop is
> preferred compared to the current behavior (no loop) is missing.
> Does this mean that that loop is a needless complication of the
> block layer core?

No it means the loop is an internal blk-mq concern.  And that drivers
needn't worry about kicking the queue.  And it affords blk-mq latitude
to change how it responds to BLK_STS_RESOURCE in the future (without
needing to change every driver).

But even v4.15 has a similar loop.  It just happens to extend into the
.queue_rq() where the driver is completely blind to SCHED_RESTART.  And
the driver may just repeatedly kick the queue after a delay (via
blk_mq_delay_run_hw_queue).

This cycle should be a very rare occurrence regardless of which approach
is taken (V5 vs 4.15).  The fact that you engineered your SRP initiator
and target code to pathologically trigger this worst case (via
target_can_queue) doesn't mean it is the fast path for a properly
configured and functioning storage network.

> Sorry but I still prefer the v4.15 block layer approach because this
> patch has in my view the following disadvantages:
> - It involves a blk-mq API change. API changes are always risky and need
>   some time to stabilize.

The number of blk-mq API changes that have occurred since blk-mq was
introduced is a very long list.  Seems contrived to make this the one
that is breaking the camel's back.

> - The delay after which to rerun the queue is moved from block layer
>   drivers into the block layer core. I think that's wrong because only
>   the block driver authors can make a good choice for this constant.

Unsubstantiated.  3ms (scsi-mq, nvmefc) vs 100ms (dm-mq mpath): which is
better?  Pretty sure if the underlying storage network is 1) performant
2) properly configured -- then a shorter delay is preferable.

> - This patch makes block drivers harder to understand. Anyone who sees
>   return BLK_STS_RESOURCE / return BLK_STS_DEV_RESOURCE for the first
>   time will have to look up the meaning of these constants. An explicit
>   blk_mq_delay_run_hw_queue() call is easier to understand.

No it doesn't make blk-mq harder to understand.  But even if it did it
actually acknowledges that there is existing blk-mq SCHED_RESTART
heuristic for how to deal with the need for blk-mq to back-off in the
face of BLK_STS_RESOURCE returns.  By just having each blk-mq driver
blindly kick the queue you're actively ignoring, and defeating, that
entire design element of blk-mq (SCHED_RESTART).

It is an instance where blk-mq can and does know better.  Acknowledging
that fact is moving blk-mq in a better direction.

> - This patch makes the blk-mq core harder to understand because of the
>   loop mentioned above.

You've said your peace.  But you've taken on this campaign to undermine
this line of development with such passion that we're now in a place
where Jens is shell-shocked by all the repeat campaigning and noise.

Bart you keep saying the same things over and over.  Yet you cannot show
the patch to actively be a problem with testing-based detail.

Seems you'd rather refuse to even test it than open yourself up to the
possibility that this concern of yours has been making a mountain out of
a mole hill.

> - This patch does not fix any bugs nor makes block drivers easier to
>   read or to implement. So why is this patch considered useful?

It enables blk-mq core to own the problem that individual drivers should
have no business needing to worry about.  Period.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE
  2018-01-30 19:33   ` Mike Snitzer
@ 2018-01-30 19:42     ` Bart Van Assche
  2018-01-30 20:12       ` Mike Snitzer
  0 siblings, 1 reply; 17+ messages in thread
From: Bart Van Assche @ 2018-01-30 19:42 UTC (permalink / raw)
  To: snitzer@redhat.com
  Cc: dm-devel@redhat.com, linux-scsi@vger.kernel.org,
	linux-block@vger.kernel.org, linux-nvme@lists.infradead.org,
	axboe@kernel.dk

T24gVHVlLCAyMDE4LTAxLTMwIGF0IDE0OjMzIC0wNTAwLCBNaWtlIFNuaXR6ZXIgd3JvdGU6DQo+
IE9uIFR1ZSwgSmFuIDMwIDIwMTggYXQgMTI6NTJwbSAtMDUwMCwNCj4gQmFydCBWYW4gQXNzY2hl
IDxiYXJ0LnZhbmFzc2NoZUB3ZGMuY29tPiB3cm90ZToNCj4gDQo+ID4gLSBUaGlzIHBhdGNoIGRv
ZXMgbm90IGZpeCBhbnkgYnVncyBub3IgbWFrZXMgYmxvY2sgZHJpdmVycyBlYXNpZXIgdG8NCj4g
PiAgIHJlYWQgb3IgdG8gaW1wbGVtZW50LiBTbyB3aHkgaXMgdGhpcyBwYXRjaCBjb25zaWRlcmVk
IHVzZWZ1bD8NCj4gDQo+IEl0IGVuYWJsZXMgYmxrLW1xIGNvcmUgdG8gb3duIHRoZSBwcm9ibGVt
IHRoYXQgaW5kaXZpZHVhbCBkcml2ZXJzIHNob3VsZA0KPiBoYXZlIG5vIGJ1c2luZXNzIG5lZWRp
bmcgdG8gd29ycnkgYWJvdXQuICBQZXJpb2QuDQoNClRoYW5rcyBmb3IgaGF2aW5nIGNvbmZpcm1l
ZCB0aGF0IHRoaXMgcGF0Y2ggaXMgYW4gQVBJLWNoYW5nZSBvbmx5IGFuZCBkb2VzDQpub3QgZml4
IGFueSBidWdzLg0KDQpCYXJ0Lg==

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE
  2018-01-30 19:42     ` Bart Van Assche
@ 2018-01-30 20:12       ` Mike Snitzer
  0 siblings, 0 replies; 17+ messages in thread
From: Mike Snitzer @ 2018-01-30 20:12 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block@vger.kernel.org, axboe@kernel.dk, dm-devel@redhat.com,
	linux-nvme@lists.infradead.org, linux-scsi@vger.kernel.org

On Tue, Jan 30 2018 at  2:42pm -0500,
Bart Van Assche <Bart.VanAssche@wdc.com> wrote:

> On Tue, 2018-01-30 at 14:33 -0500, Mike Snitzer wrote:
> > On Tue, Jan 30 2018 at 12:52pm -0500,
> > Bart Van Assche <bart.vanassche@wdc.com> wrote:
> > 
> > > - This patch does not fix any bugs nor makes block drivers easier to
> > >   read or to implement. So why is this patch considered useful?
> > 
> > It enables blk-mq core to own the problem that individual drivers should
> > have no business needing to worry about.  Period.
> 
> Thanks for having confirmed that this patch is an API-change only and does
> not fix any bugs.

No, it is an API change that enables blk-mq drivers to make forward
progress without compromising the potential benefits associated with
blk-mq's SCHED_RESTART functionality.

(If we're going to beat this horse to death it might as well be with
precision)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [dm-devel] [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE
  2018-01-30 17:52 ` [dm-devel] " Bart Van Assche
  2018-01-30 18:38   ` Laurence Oberman
  2018-01-30 19:33   ` Mike Snitzer
@ 2018-01-31  2:14   ` Ming Lei
  2018-01-31  3:17   ` Jens Axboe
  3 siblings, 0 replies; 17+ messages in thread
From: Ming Lei @ 2018-01-31  2:14 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Mike Snitzer, axboe, linux-block, dm-devel, linux-nvme,
	linux-scsi

On Tue, Jan 30, 2018 at 09:52:31AM -0800, Bart Van Assche wrote:
> On 01/30/18 06:24, Mike Snitzer wrote:
> > +		 *
> > +		 * If driver returns BLK_STS_RESOURCE and SCHED_RESTART
> > +		 * bit is set, run queue after a delay to avoid IO stalls
> > +		 * that could otherwise occur if the queue is idle.
> >   		 */
> > -		if (!blk_mq_sched_needs_restart(hctx) ||
> > +		needs_restart = blk_mq_sched_needs_restart(hctx);
> > +		if (!needs_restart ||
> >   		    (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
> >   			blk_mq_run_hw_queue(hctx, true);
> > +		else if (needs_restart && (ret == BLK_STS_RESOURCE))
> > +			blk_mq_delay_run_hw_queue(hctx, BLK_MQ_QUEUE_DELAY);
> >   	}
> 
> If a request completes concurrently with execution of the above code then
> the request completion will trigger a call of blk_mq_sched_restart_hctx()
> and that call will clear the BLK_MQ_S_SCHED_RESTART bit. If that bit is
> cleared before the above code tests it then the above code will schedule an
> asynchronous call of __blk_mq_run_hw_queue(). If the .queue_rq() call

Right.

> triggered by the new queue run returns again BLK_STS_RESOURCE then the above
> code will be executed again. In other words, a loop occurs. That loop will

This patch doesn't change anything about that. When BLK_STS_RESOURCE is
returned, this request is added to hctx->dispatch, next time, before
dispatching this request, SCHED_RESTART is set and the loop is cut.

> repeat as long as the described race occurs. The current (kernel v4.15)
> block layer behavior is simpler: only block drivers call
> blk_mq_delay_run_hw_queue() and the block layer core never calls that
> function. Hence that loop cannot occur with the v4.15 block layer core and

That way isn't safe, I have explained to you in the following link:

	https://marc.info/?l=dm-devel&m=151727454815018&w=2

> block drivers. A motivation of why that loop is preferred compared to the
> current behavior (no loop) is missing. Does this mean that that loop is a
> needless complication of the block layer core?

No such loop as I explained above.

> 
> Sorry but I still prefer the v4.15 block layer approach because this patch
> has in my view the following disadvantages:
> - It involves a blk-mq API change. API changes are always risky and need
>   some time to stabilize.
> - The delay after which to rerun the queue is moved from block layer
>   drivers into the block layer core. I think that's wrong because only
>   the block driver authors can make a good choice for this constant.
> - This patch makes block drivers harder to understand. Anyone who sees
>   return BLK_STS_RESOURCE / return BLK_STS_DEV_RESOURCE for the first
>   time will have to look up the meaning of these constants. An explicit
>   blk_mq_delay_run_hw_queue() call is easier to understand.
> - This patch makes the blk-mq core harder to understand because of the
>   loop mentioned above.
> - This patch does not fix any bugs nor makes block drivers easier to
>   read or to implement. So why is this patch considered useful?

It does avoid the race I mentioned in the following link:

	https://marc.info/?l=dm-devel&m=151727454815018&w=2

More importantly, every driver need this change, if you have better idea
to fix them all, please post a patch, then we can compare both.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE
  2018-01-30 14:24 [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE Mike Snitzer
  2018-01-30 17:52 ` [dm-devel] " Bart Van Assche
@ 2018-01-31  2:44 ` Jens Axboe
  2018-01-31  3:04   ` [PATCH v6] " Mike Snitzer
  2018-01-31  3:07   ` [PATCH v5] " Mike Snitzer
  1 sibling, 2 replies; 17+ messages in thread
From: Jens Axboe @ 2018-01-31  2:44 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: linux-block, dm-devel, linux-scsi, linux-nvme

On 1/30/18 7:24 AM, Mike Snitzer wrote:
> From: Ming Lei <ming.lei@redhat.com>
> 
> This status is returned from driver to block layer if device related
> resource is unavailable, but driver can guarantee that IO dispatch
> will be triggered in future when the resource is available.
> 
> Convert some drivers to return BLK_STS_DEV_RESOURCE.  Also, if driver
> returns BLK_STS_RESOURCE and SCHED_RESTART is set, rerun queue after
> a delay (BLK_MQ_DELAY_QUEUE) to avoid IO stalls.  BLK_MQ_DELAY_QUEUE is
> 3 ms because both scsi-mq and nvmefc are using that magic value.
> 
> If a driver can make sure there is in-flight IO, it is safe to return
> BLK_STS_DEV_RESOURCE because:
> 
> 1) If all in-flight IOs complete before examining SCHED_RESTART in
> blk_mq_dispatch_rq_list(), SCHED_RESTART must be cleared, so queue
> is run immediately in this case by blk_mq_dispatch_rq_list();
> 
> 2) if there is any in-flight IO after/when examining SCHED_RESTART
> in blk_mq_dispatch_rq_list():
> - if SCHED_RESTART isn't set, queue is run immediately as handled in 1)
> - otherwise, this request will be dispatched after any in-flight IO is
>   completed via blk_mq_sched_restart()
> 
> 3) if SCHED_RESTART is set concurently in context because of
> BLK_STS_RESOURCE, blk_mq_delay_run_hw_queue() will cover the above two
> cases and make sure IO hang can be avoided.
> 
> One invariant is that queue will be rerun if SCHED_RESTART is set.

This looks pretty good to me. I'm waffling a bit on whether to retain
the current BLK_STS_RESOURCE behavior and name the new one something
else, but I do like using the DEV name in there to signify the
difference between a global and device resource.

Just a few small nits below - can you roll a v6 with the changes?

> diff --git a/block/blk-core.c b/block/blk-core.c
> index cdae69be68e9..38279d4ae08b 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -145,6 +145,7 @@ static const struct {
>  	[BLK_STS_MEDIUM]	= { -ENODATA,	"critical medium" },
>  	[BLK_STS_PROTECTION]	= { -EILSEQ,	"protection" },
>  	[BLK_STS_RESOURCE]	= { -ENOMEM,	"kernel resource" },
> +	[BLK_STS_DEV_RESOURCE]	= { -ENOMEM,	"device resource" },
>  	[BLK_STS_AGAIN]		= { -EAGAIN,	"nonblocking retry" },

I think we should make BLK_STS_DEV_RESOURCE be -EBUSY, that more closely
matches the result and makes it distinctly different than the global
shortage that is STS_RESOURCE/ENOMEM.

> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 43e7449723e0..e39b4e2a63db 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1160,6 +1160,8 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx **hctx,
>  	return true;
>  }
>  
> +#define BLK_MQ_QUEUE_DELAY	3		/* ms units */

Make that BLK_MQ_RESOURCE_DELAY or something like that. The name is too
generic right now.

> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 2d973ac54b09..f41d2057215f 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -39,6 +39,23 @@ typedef u8 __bitwise blk_status_t;
>  
>  #define BLK_STS_AGAIN		((__force blk_status_t)12)
>  
> +/*
> + * BLK_STS_DEV_RESOURCE is returned from driver to block layer if device
> + * related resource is unavailable, but driver can guarantee that queue
> + * will be rerun in future once the resource is available (whereby
> + * dispatching requests).
> + *
> + * To safely return BLK_STS_DEV_RESOURCE, and allow forward progress, a
> + * driver just needs to make sure there is in-flight IO.
> + *
> + * Difference with BLK_STS_RESOURCE:
> + * If driver isn't sure if the queue will be rerun once device resource
> + * is made available, please return BLK_STS_RESOURCE.  For example: when
> + * memory allocation, DMA Mapping or other system resource allocation
> + * fails and IO can't be submitted to device.
> + */
> +#define BLK_STS_DEV_RESOURCE	((__force blk_status_t)13)

I'd rephrase that as:

BLK_STS_DEV_RESOURCE is returned from the driver to the block layer if
device related resource are unavailable, but the driver can guarantee
that the queue will be rerun in the future once resources become
available again. This is typically the case for device specific
resources that are consumed for IO. If the driver fails allocating these
resources, we know that inflight (or pending) IO will free these
resource upon completion.

This is different from BLK_STS_RESOURCE in that it explicitly references
device specific resource. For resources of wider scope, allocation
failure can happen without having pending IO. This means that we can't
rely on request completions freeing these resources, as IO may not be in
flight. Examples of that are kernel memory allocations, DMA mappings, or
any other system wide resources.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v6] blk-mq: introduce BLK_STS_DEV_RESOURCE
  2018-01-31  2:44 ` Jens Axboe
@ 2018-01-31  3:04   ` Mike Snitzer
  2018-01-31  3:18     ` Jens Axboe
  2018-01-31  3:07   ` [PATCH v5] " Mike Snitzer
  1 sibling, 1 reply; 17+ messages in thread
From: Mike Snitzer @ 2018-01-31  3:04 UTC (permalink / raw)
  To: axboe; +Cc: linux-block, dm-devel, linux-scsi, linux-nvme

From: Ming Lei <ming.lei@redhat.com>

This status is returned from driver to block layer if device related
resource is unavailable, but driver can guarantee that IO dispatch
will be triggered in future when the resource is available.

Convert some drivers to return BLK_STS_DEV_RESOURCE.  Also, if driver
returns BLK_STS_RESOURCE and SCHED_RESTART is set, rerun queue after
a delay (BLK_MQ_DELAY_QUEUE) to avoid IO stalls.  BLK_MQ_DELAY_QUEUE is
3 ms because both scsi-mq and nvmefc are using that magic value.

If a driver can make sure there is in-flight IO, it is safe to return
BLK_STS_DEV_RESOURCE because:

1) If all in-flight IOs complete before examining SCHED_RESTART in
blk_mq_dispatch_rq_list(), SCHED_RESTART must be cleared, so queue
is run immediately in this case by blk_mq_dispatch_rq_list();

2) if there is any in-flight IO after/when examining SCHED_RESTART
in blk_mq_dispatch_rq_list():
- if SCHED_RESTART isn't set, queue is run immediately as handled in 1)
- otherwise, this request will be dispatched after any in-flight IO is
  completed via blk_mq_sched_restart()

3) if SCHED_RESTART is set concurently in context because of
BLK_STS_RESOURCE, blk_mq_delay_run_hw_queue() will cover the above two
cases and make sure IO hang can be avoided.

One invariant is that queue will be rerun if SCHED_RESTART is set.

Suggested-by: Jens Axboe <axboe@kernel.dk>
Tested-by: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
V6:
        - use -EBUSY, instead of -ENOMEM, for BLK_STS_DEV_RESOURCE
        - rename BLK_MQ_QUEUE_DELAY to BLK_MQ_RESOURCE_DELAY
        - cleanup blk_types.h comment block above BLK_STS_DEV_RESOURCE
        - all suggested by Jens
V5:
        - fixed stale reference to 10ms delay in blk-mq.c comment
        - revised commit header to include Ming's proof of how
          blk-mq drivers will make progress (serves to document how
	  scsi_queue_rq now works)
V4:
        - cleanup header and code comments
        - rerun queue after BLK_MQ_QUEUE_DELAY (3ms) instead of 10ms
        - eliminate nvmefc's queue rerun now that blk-mq does it
V3:
        - fix typo, and improvement document
        - add tested-by tag
V2:
        - add comments on the new introduced status
        - patch style fix
        - both are suggested by Christoph

 block/blk-core.c             |  1 +
 block/blk-mq.c               | 20 ++++++++++++++++----
 drivers/block/null_blk.c     |  2 +-
 drivers/block/virtio_blk.c   |  2 +-
 drivers/block/xen-blkfront.c |  2 +-
 drivers/md/dm-rq.c           |  5 ++---
 drivers/nvme/host/fc.c       | 12 ++----------
 drivers/scsi/scsi_lib.c      |  6 +++---
 include/linux/blk_types.h    | 18 ++++++++++++++++++
 9 files changed, 45 insertions(+), 23 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index cdae69be68e9..79dfef84c66c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -145,6 +145,7 @@ static const struct {
 	[BLK_STS_MEDIUM]	= { -ENODATA,	"critical medium" },
 	[BLK_STS_PROTECTION]	= { -EILSEQ,	"protection" },
 	[BLK_STS_RESOURCE]	= { -ENOMEM,	"kernel resource" },
+	[BLK_STS_DEV_RESOURCE]	= { -EBUSY,	"device resource" },
 	[BLK_STS_AGAIN]		= { -EAGAIN,	"nonblocking retry" },
 
 	/* device mapper special case, should not leak out: */
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 43e7449723e0..52a206e3777f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1160,6 +1160,8 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx **hctx,
 	return true;
 }
 
+#define BLK_MQ_RESOURCE_DELAY	3		/* ms units */
+
 bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 			     bool got_budget)
 {
@@ -1167,6 +1169,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 	struct request *rq, *nxt;
 	bool no_tag = false;
 	int errors, queued;
+	blk_status_t ret = BLK_STS_OK;
 
 	if (list_empty(list))
 		return false;
@@ -1179,7 +1182,6 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 	errors = queued = 0;
 	do {
 		struct blk_mq_queue_data bd;
-		blk_status_t ret;
 
 		rq = list_first_entry(list, struct request, queuelist);
 		if (!blk_mq_get_driver_tag(rq, &hctx, false)) {
@@ -1224,7 +1226,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 		}
 
 		ret = q->mq_ops->queue_rq(hctx, &bd);
-		if (ret == BLK_STS_RESOURCE) {
+		if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) {
 			/*
 			 * If an I/O scheduler has been configured and we got a
 			 * driver tag for the next request already, free it
@@ -1255,6 +1257,8 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 	 * that is where we will continue on next queue run.
 	 */
 	if (!list_empty(list)) {
+		bool needs_restart;
+
 		spin_lock(&hctx->lock);
 		list_splice_init(list, &hctx->dispatch);
 		spin_unlock(&hctx->lock);
@@ -1278,10 +1282,17 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 		 * - Some but not all block drivers stop a queue before
 		 *   returning BLK_STS_RESOURCE. Two exceptions are scsi-mq
 		 *   and dm-rq.
+		 *
+		 * If driver returns BLK_STS_RESOURCE and SCHED_RESTART
+		 * bit is set, run queue after a delay to avoid IO stalls
+		 * that could otherwise occur if the queue is idle.
 		 */
-		if (!blk_mq_sched_needs_restart(hctx) ||
+		needs_restart = blk_mq_sched_needs_restart(hctx);
+		if (!needs_restart ||
 		    (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
 			blk_mq_run_hw_queue(hctx, true);
+		else if (needs_restart && (ret == BLK_STS_RESOURCE))
+			blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);
 	}
 
 	return (queued + errors) != 0;
@@ -1762,6 +1773,7 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
 		*cookie = new_cookie;
 		break;
 	case BLK_STS_RESOURCE:
+	case BLK_STS_DEV_RESOURCE:
 		__blk_mq_requeue_request(rq);
 		break;
 	default:
@@ -1824,7 +1836,7 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 	hctx_lock(hctx, &srcu_idx);
 
 	ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false);
-	if (ret == BLK_STS_RESOURCE)
+	if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE)
 		blk_mq_sched_insert_request(rq, false, true, false);
 	else if (ret != BLK_STS_OK)
 		blk_mq_end_request(rq, ret);
diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index 5b94e530570c..4bc25fc4e73c 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -1230,7 +1230,7 @@ static blk_status_t null_handle_cmd(struct nullb_cmd *cmd)
 				return BLK_STS_OK;
 			} else
 				/* requeue request */
-				return BLK_STS_RESOURCE;
+				return BLK_STS_DEV_RESOURCE;
 		}
 	}
 
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 68846897d213..79908e6ddbf2 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -276,7 +276,7 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
 		/* Out of mem doesn't actually happen, since we fall back
 		 * to direct descriptors */
 		if (err == -ENOMEM || err == -ENOSPC)
-			return BLK_STS_RESOURCE;
+			return BLK_STS_DEV_RESOURCE;
 		return BLK_STS_IOERR;
 	}
 
diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 891265acb10e..e126e4cac2ca 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -911,7 +911,7 @@ static blk_status_t blkif_queue_rq(struct blk_mq_hw_ctx *hctx,
 out_busy:
 	blk_mq_stop_hw_queue(hctx);
 	spin_unlock_irqrestore(&rinfo->ring_lock, flags);
-	return BLK_STS_RESOURCE;
+	return BLK_STS_DEV_RESOURCE;
 }
 
 static void blkif_complete_rq(struct request *rq)
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index b7d175e94a02..348a0cb6963a 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -404,7 +404,7 @@ static blk_status_t dm_dispatch_clone_request(struct request *clone, struct requ
 
 	clone->start_time = jiffies;
 	r = blk_insert_cloned_request(clone->q, clone);
-	if (r != BLK_STS_OK && r != BLK_STS_RESOURCE)
+	if (r != BLK_STS_OK && r != BLK_STS_RESOURCE && r != BLK_STS_DEV_RESOURCE)
 		/* must complete clone in terms of original request */
 		dm_complete_request(rq, r);
 	return r;
@@ -496,7 +496,7 @@ static int map_request(struct dm_rq_target_io *tio)
 		trace_block_rq_remap(clone->q, clone, disk_devt(dm_disk(md)),
 				     blk_rq_pos(rq));
 		ret = dm_dispatch_clone_request(clone, rq);
-		if (ret == BLK_STS_RESOURCE) {
+		if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) {
 			blk_rq_unprep_clone(clone);
 			tio->ti->type->release_clone_rq(clone);
 			tio->clone = NULL;
@@ -769,7 +769,6 @@ static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
 		/* Undo dm_start_request() before requeuing */
 		rq_end_stats(md, rq);
 		rq_completed(md, rq_data_dir(rq), false);
-		blk_mq_delay_run_hw_queue(hctx, 100/*ms*/);
 		return BLK_STS_RESOURCE;
 	}
 
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index b76ba4629e02..54e679541ad6 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -35,8 +35,6 @@ enum nvme_fc_queue_flags {
 	NVME_FC_Q_LIVE,
 };
 
-#define NVMEFC_QUEUE_DELAY	3		/* ms units */
-
 #define NVME_FC_DEFAULT_DEV_LOSS_TMO	60	/* seconds */
 
 struct nvme_fc_queue {
@@ -2231,7 +2229,7 @@ nvme_fc_start_fcp_op(struct nvme_fc_ctrl *ctrl, struct nvme_fc_queue *queue,
 	 * the target device is present
 	 */
 	if (ctrl->rport->remoteport.port_state != FC_OBJSTATE_ONLINE)
-		goto busy;
+		return BLK_STS_RESOURCE;
 
 	if (!nvme_fc_ctrl_get(ctrl))
 		return BLK_STS_IOERR;
@@ -2311,16 +2309,10 @@ nvme_fc_start_fcp_op(struct nvme_fc_ctrl *ctrl, struct nvme_fc_queue *queue,
 				ret != -EBUSY)
 			return BLK_STS_IOERR;
 
-		goto busy;
+		return BLK_STS_RESOURCE;
 	}
 
 	return BLK_STS_OK;
-
-busy:
-	if (!(op->flags & FCOP_FLAGS_AEN) && queue->hctx)
-		blk_mq_delay_run_hw_queue(queue->hctx, NVMEFC_QUEUE_DELAY);
-
-	return BLK_STS_RESOURCE;
 }
 
 static inline blk_status_t nvme_fc_is_ready(struct nvme_fc_queue *queue,
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index d9ca1dfab154..55be2550c555 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -2030,9 +2030,9 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
 	case BLK_STS_OK:
 		break;
 	case BLK_STS_RESOURCE:
-		if (atomic_read(&sdev->device_busy) == 0 &&
-		    !scsi_device_blocked(sdev))
-			blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY);
+		if (atomic_read(&sdev->device_busy) ||
+		    scsi_device_blocked(sdev))
+			ret = BLK_STS_DEV_RESOURCE;
 		break;
 	default:
 		/*
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 2d973ac54b09..edbd118f23cf 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -39,6 +39,24 @@ typedef u8 __bitwise blk_status_t;
 
 #define BLK_STS_AGAIN		((__force blk_status_t)12)
 
+/*
+ * BLK_STS_DEV_RESOURCE is returned from the driver to the block layer if
+ * device related resources are unavailable, but the driver can guarantee
+ * that the queue will be rerun in the future once resources become
+ * available again. This is typically the case for device specific
+ * resources that are consumed for IO. If the driver fails allocating these
+ * resources, we know that inflight (or pending) IO will free these
+ * resource upon completion.
+ *
+ * This is different from BLK_STS_RESOURCE in that it explicitly references
+ * a device specific resource. For resources of wider scope, allocation
+ * failure can happen without having pending IO. This means that we can't
+ * rely on request completions freeing these resources, as IO may not be in
+ * flight. Examples of that are kernel memory allocations, DMA mappings, or
+ * any other system wide resources.
+ */
+#define BLK_STS_DEV_RESOURCE	((__force blk_status_t)13)
+
 /**
  * blk_path_error - returns true if error may be path related
  * @error: status the request was completed with
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE
  2018-01-31  2:44 ` Jens Axboe
  2018-01-31  3:04   ` [PATCH v6] " Mike Snitzer
@ 2018-01-31  3:07   ` Mike Snitzer
  1 sibling, 0 replies; 17+ messages in thread
From: Mike Snitzer @ 2018-01-31  3:07 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, dm-devel, linux-scsi, linux-nvme

On Tue, Jan 30 2018 at  9:44P -0500,
Jens Axboe <axboe@kernel.dk> wrote:

> On 1/30/18 7:24 AM, Mike Snitzer wrote:
> > From: Ming Lei <ming.lei@redhat.com>
> > 
> > This status is returned from driver to block layer if device related
> > resource is unavailable, but driver can guarantee that IO dispatch
> > will be triggered in future when the resource is available.
> > 
> > Convert some drivers to return BLK_STS_DEV_RESOURCE.  Also, if driver
> > returns BLK_STS_RESOURCE and SCHED_RESTART is set, rerun queue after
> > a delay (BLK_MQ_DELAY_QUEUE) to avoid IO stalls.  BLK_MQ_DELAY_QUEUE is
> > 3 ms because both scsi-mq and nvmefc are using that magic value.
> > 
> > If a driver can make sure there is in-flight IO, it is safe to return
> > BLK_STS_DEV_RESOURCE because:
> > 
> > 1) If all in-flight IOs complete before examining SCHED_RESTART in
> > blk_mq_dispatch_rq_list(), SCHED_RESTART must be cleared, so queue
> > is run immediately in this case by blk_mq_dispatch_rq_list();
> > 
> > 2) if there is any in-flight IO after/when examining SCHED_RESTART
> > in blk_mq_dispatch_rq_list():
> > - if SCHED_RESTART isn't set, queue is run immediately as handled in 1)
> > - otherwise, this request will be dispatched after any in-flight IO is
> >   completed via blk_mq_sched_restart()
> > 
> > 3) if SCHED_RESTART is set concurently in context because of
> > BLK_STS_RESOURCE, blk_mq_delay_run_hw_queue() will cover the above two
> > cases and make sure IO hang can be avoided.
> > 
> > One invariant is that queue will be rerun if SCHED_RESTART is set.
> 
> This looks pretty good to me. I'm waffling a bit on whether to retain
> the current BLK_STS_RESOURCE behavior and name the new one something
> else, but I do like using the DEV name in there to signify the
> difference between a global and device resource.
> 
> Just a few small nits below - can you roll a v6 with the changes?

Folded in all your feedback and just replied with v6.

> > diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> > index 2d973ac54b09..f41d2057215f 100644
> > --- a/include/linux/blk_types.h
> > +++ b/include/linux/blk_types.h
> > @@ -39,6 +39,23 @@ typedef u8 __bitwise blk_status_t;
> >  
> >  #define BLK_STS_AGAIN		((__force blk_status_t)12)
> >  
> > +/*
> > + * BLK_STS_DEV_RESOURCE is returned from driver to block layer if device
> > + * related resource is unavailable, but driver can guarantee that queue
> > + * will be rerun in future once the resource is available (whereby
> > + * dispatching requests).
> > + *
> > + * To safely return BLK_STS_DEV_RESOURCE, and allow forward progress, a
> > + * driver just needs to make sure there is in-flight IO.
> > + *
> > + * Difference with BLK_STS_RESOURCE:
> > + * If driver isn't sure if the queue will be rerun once device resource
> > + * is made available, please return BLK_STS_RESOURCE.  For example: when
> > + * memory allocation, DMA Mapping or other system resource allocation
> > + * fails and IO can't be submitted to device.
> > + */
> > +#define BLK_STS_DEV_RESOURCE	((__force blk_status_t)13)
> 
> I'd rephrase that as:
> 
> BLK_STS_DEV_RESOURCE is returned from the driver to the block layer if
> device related resource are unavailable, but the driver can guarantee
> that the queue will be rerun in the future once resources become
> available again. This is typically the case for device specific
> resources that are consumed for IO. If the driver fails allocating these
> resources, we know that inflight (or pending) IO will free these
> resource upon completion.
> 
> This is different from BLK_STS_RESOURCE in that it explicitly references
> device specific resource. For resources of wider scope, allocation
> failure can happen without having pending IO. This means that we can't
> rely on request completions freeing these resources, as IO may not be in
> flight. Examples of that are kernel memory allocations, DMA mappings, or
> any other system wide resources.

Thanks for that, definitely clearer, nice job.

Mike

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [dm-devel] [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE
  2018-01-30 17:52 ` [dm-devel] " Bart Van Assche
                     ` (2 preceding siblings ...)
  2018-01-31  2:14   ` [dm-devel] " Ming Lei
@ 2018-01-31  3:17   ` Jens Axboe
  2018-01-31  3:21     ` Bart Van Assche
  3 siblings, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2018-01-31  3:17 UTC (permalink / raw)
  To: Bart Van Assche, Mike Snitzer
  Cc: linux-block, dm-devel, linux-nvme, linux-scsi

On 1/30/18 10:52 AM, Bart Van Assche wrote:
> On 01/30/18 06:24, Mike Snitzer wrote:
>> +		 *
>> +		 * If driver returns BLK_STS_RESOURCE and SCHED_RESTART
>> +		 * bit is set, run queue after a delay to avoid IO stalls
>> +		 * that could otherwise occur if the queue is idle.
>>   		 */
>> -		if (!blk_mq_sched_needs_restart(hctx) ||
>> +		needs_restart = blk_mq_sched_needs_restart(hctx);
>> +		if (!needs_restart ||
>>   		    (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
>>   			blk_mq_run_hw_queue(hctx, true);
>> +		else if (needs_restart && (ret == BLK_STS_RESOURCE))
>> +			blk_mq_delay_run_hw_queue(hctx, BLK_MQ_QUEUE_DELAY);
>>   	}
> 
> If a request completes concurrently with execution of the above code 
> then the request completion will trigger a call of 
> blk_mq_sched_restart_hctx() and that call will clear the 
> BLK_MQ_S_SCHED_RESTART bit. If that bit is cleared before the above code 
> tests it then the above code will schedule an asynchronous call of 
> __blk_mq_run_hw_queue(). If the .queue_rq() call triggered by the new 
> queue run returns again BLK_STS_RESOURCE then the above code will be 
> executed again. In other words, a loop occurs. That loop will repeat as 
> long as the described race occurs. The current (kernel v4.15) block 
> layer behavior is simpler: only block drivers call 
> blk_mq_delay_run_hw_queue() and the block layer core never calls that 
> function. Hence that loop cannot occur with the v4.15 block layer core 
> and block drivers. A motivation of why that loop is preferred compared 
> to the current behavior (no loop) is missing. Does this mean that that 
> loop is a needless complication of the block layer core?

The dispatch, and later restart check, is within the hctx lock. The
completions should be as well.

> Sorry but I still prefer the v4.15 block layer approach because this 
> patch has in my view the following disadvantages:
> - It involves a blk-mq API change. API changes are always risky and need
>    some time to stabilize.
> - The delay after which to rerun the queue is moved from block layer
>    drivers into the block layer core. I think that's wrong because only
>    the block driver authors can make a good choice for this constant.

It's exactly the right place to put it, as drivers cannot make a good
decision for when to run the queue again. You get NULL on allocating
some piece of memory, when do you run it again? That's the last thing
I want driver writers to make a decision on, because a novice device
driver writer will just think that he should run the queue again asap.
In reality we are screwed. Decisions like that SHOULD be in shared
and generic code, not in driver private code.

> - This patch makes block drivers harder to understand. Anyone who sees
>    return BLK_STS_RESOURCE / return BLK_STS_DEV_RESOURCE for the first
>    time will have to look up the meaning of these constants. An explicit
>    blk_mq_delay_run_hw_queue() call is easier to understand.

BLK_STS_RESOURCE should always be safe to return, and it should work
the same as STS_DEV_RESOURCE, except it may cause an extra queue
run.

Well written drivers should use STS_DEV_RESOURCE where it makes
sense.

> - This patch does not fix any bugs nor makes block drivers easier to
>    read or to implement. So why is this patch considered useful?

It does fix the run bug on global resources, I'm assuming you mean
it doesn't fix any EXTRA bugs compared to just use the delayed
run?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6] blk-mq: introduce BLK_STS_DEV_RESOURCE
  2018-01-31  3:04   ` [PATCH v6] " Mike Snitzer
@ 2018-01-31  3:18     ` Jens Axboe
  0 siblings, 0 replies; 17+ messages in thread
From: Jens Axboe @ 2018-01-31  3:18 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: linux-block, dm-devel, linux-scsi, linux-nvme

On 1/30/18 8:04 PM, Mike Snitzer wrote:
> From: Ming Lei <ming.lei@redhat.com>
> 
> This status is returned from driver to block layer if device related
> resource is unavailable, but driver can guarantee that IO dispatch
> will be triggered in future when the resource is available.
> 
> Convert some drivers to return BLK_STS_DEV_RESOURCE.  Also, if driver
> returns BLK_STS_RESOURCE and SCHED_RESTART is set, rerun queue after
> a delay (BLK_MQ_DELAY_QUEUE) to avoid IO stalls.  BLK_MQ_DELAY_QUEUE is
> 3 ms because both scsi-mq and nvmefc are using that magic value.
> 
> If a driver can make sure there is in-flight IO, it is safe to return
> BLK_STS_DEV_RESOURCE because:
> 
> 1) If all in-flight IOs complete before examining SCHED_RESTART in
> blk_mq_dispatch_rq_list(), SCHED_RESTART must be cleared, so queue
> is run immediately in this case by blk_mq_dispatch_rq_list();
> 
> 2) if there is any in-flight IO after/when examining SCHED_RESTART
> in blk_mq_dispatch_rq_list():
> - if SCHED_RESTART isn't set, queue is run immediately as handled in 1)
> - otherwise, this request will be dispatched after any in-flight IO is
>   completed via blk_mq_sched_restart()
> 
> 3) if SCHED_RESTART is set concurently in context because of
> BLK_STS_RESOURCE, blk_mq_delay_run_hw_queue() will cover the above two
> cases and make sure IO hang can be avoided.
> 
> One invariant is that queue will be rerun if SCHED_RESTART is set.

Applied, thanks.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [dm-devel] [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE
  2018-01-31  3:17   ` Jens Axboe
@ 2018-01-31  3:21     ` Bart Van Assche
  2018-01-31  3:22       ` Jens Axboe
  0 siblings, 1 reply; 17+ messages in thread
From: Bart Van Assche @ 2018-01-31  3:21 UTC (permalink / raw)
  To: axboe@kernel.dk, snitzer@redhat.com
  Cc: dm-devel@redhat.com, linux-scsi@vger.kernel.org,
	linux-block@vger.kernel.org, linux-nvme@lists.infradead.org

T24gVHVlLCAyMDE4LTAxLTMwIGF0IDIwOjE3IC0wNzAwLCBKZW5zIEF4Ym9lIHdyb3RlOg0KPiBC
TEtfU1RTX1JFU09VUkNFIHNob3VsZCBhbHdheXMgYmUgc2FmZSB0byByZXR1cm4sIGFuZCBpdCBz
aG91bGQgd29yaw0KPiB0aGUgc2FtZSBhcyBTVFNfREVWX1JFU09VUkNFLCBleGNlcHQgaXQgbWF5
IGNhdXNlIGFuIGV4dHJhIHF1ZXVlDQo+IHJ1bi4NCj4gDQo+IFdlbGwgd3JpdHRlbiBkcml2ZXJz
IHNob3VsZCB1c2UgU1RTX0RFVl9SRVNPVVJDRSB3aGVyZSBpdCBtYWtlcw0KPiBzZW5zZS4NCg0K
SGVsbG8gSmVucywNCg0KSSB3b3VsZCBhcHByZWNpYXRlIGl0IGlmIG90aGVyIG5hbWVzIHdvdWxk
IGJlIGNob3NlbiB0aGFuIEJMS19TVFNfUkVTT1VSQ0UNCmFuZCBCTEtfU1RTX0RFVl9SRVNPVVJD
RSwgZS5nLiBuYW1lcyB0aGF0IGRpcmVjdGx5IHJlZmVyIHRvIHRoZSBmYWN0IHRoYXQNCm9uZSBv
ZiB0aGUgdHdvIHN0YXR1cyBjb2RlcyBjYXVzZXMgdGhlIHF1ZXVlIHRvIGJlIHJlcnVuIGFuZCB0
aGUgb3RoZXIgbm90Lg0KSSdtIGFmcmFpZCB0aGF0IHRoZSBjdXJyZW50bHkgY2hvc2VuIG5hbWVz
IHdpbGwgY2F1c2UgY29uZnVzaW9uLg0KDQpUaGFua3MsDQoNCkJhcnQu

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [dm-devel] [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE
  2018-01-31  3:21     ` Bart Van Assche
@ 2018-01-31  3:22       ` Jens Axboe
  2018-01-31  3:27         ` Bart Van Assche
  2018-01-31  3:33         ` Ming Lei
  0 siblings, 2 replies; 17+ messages in thread
From: Jens Axboe @ 2018-01-31  3:22 UTC (permalink / raw)
  To: Bart Van Assche, snitzer@redhat.com
  Cc: dm-devel@redhat.com, linux-scsi@vger.kernel.org,
	linux-block@vger.kernel.org, linux-nvme@lists.infradead.org

On 1/30/18 8:21 PM, Bart Van Assche wrote:
> On Tue, 2018-01-30 at 20:17 -0700, Jens Axboe wrote:
>> BLK_STS_RESOURCE should always be safe to return, and it should work
>> the same as STS_DEV_RESOURCE, except it may cause an extra queue
>> run.
>>
>> Well written drivers should use STS_DEV_RESOURCE where it makes
>> sense.
> 
> Hello Jens,
> 
> I would appreciate it if other names would be chosen than BLK_STS_RESOURCE
> and BLK_STS_DEV_RESOURCE, e.g. names that directly refer to the fact that
> one of the two status codes causes the queue to be rerun and the other not.
> I'm afraid that the currently chosen names will cause confusion.

DEV_RESOURCE is pretty clear I think, but I agree that STS_RESOURCE
could perhaps be better. STS_SYSTEM_RESOURCE? It makes the distinction
a bit clearer.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [dm-devel] [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE
  2018-01-31  3:22       ` Jens Axboe
@ 2018-01-31  3:27         ` Bart Van Assche
  2018-01-31  3:31           ` Jens Axboe
  2018-01-31  3:33         ` Ming Lei
  1 sibling, 1 reply; 17+ messages in thread
From: Bart Van Assche @ 2018-01-31  3:27 UTC (permalink / raw)
  To: axboe@kernel.dk, snitzer@redhat.com
  Cc: dm-devel@redhat.com, linux-scsi@vger.kernel.org,
	linux-block@vger.kernel.org, linux-nvme@lists.infradead.org

T24gVHVlLCAyMDE4LTAxLTMwIGF0IDIwOjIyIC0wNzAwLCBKZW5zIEF4Ym9lIHdyb3RlOg0KPiBP
biAxLzMwLzE4IDg6MjEgUE0sIEJhcnQgVmFuIEFzc2NoZSB3cm90ZToNCj4gPiBPbiBUdWUsIDIw
MTgtMDEtMzAgYXQgMjA6MTcgLTA3MDAsIEplbnMgQXhib2Ugd3JvdGU6DQo+ID4gPiBCTEtfU1RT
X1JFU09VUkNFIHNob3VsZCBhbHdheXMgYmUgc2FmZSB0byByZXR1cm4sIGFuZCBpdCBzaG91bGQg
d29yaw0KPiA+ID4gdGhlIHNhbWUgYXMgU1RTX0RFVl9SRVNPVVJDRSwgZXhjZXB0IGl0IG1heSBj
YXVzZSBhbiBleHRyYSBxdWV1ZQ0KPiA+ID4gcnVuLg0KPiA+ID4gDQo+ID4gPiBXZWxsIHdyaXR0
ZW4gZHJpdmVycyBzaG91bGQgdXNlIFNUU19ERVZfUkVTT1VSQ0Ugd2hlcmUgaXQgbWFrZXMNCj4g
PiA+IHNlbnNlLg0KPiA+IA0KPiA+IEhlbGxvIEplbnMsDQo+ID4gDQo+ID4gSSB3b3VsZCBhcHBy
ZWNpYXRlIGl0IGlmIG90aGVyIG5hbWVzIHdvdWxkIGJlIGNob3NlbiB0aGFuIEJMS19TVFNfUkVT
T1VSQ0UNCj4gPiBhbmQgQkxLX1NUU19ERVZfUkVTT1VSQ0UsIGUuZy4gbmFtZXMgdGhhdCBkaXJl
Y3RseSByZWZlciB0byB0aGUgZmFjdCB0aGF0DQo+ID4gb25lIG9mIHRoZSB0d28gc3RhdHVzIGNv
ZGVzIGNhdXNlcyB0aGUgcXVldWUgdG8gYmUgcmVydW4gYW5kIHRoZSBvdGhlciBub3QuDQo+ID4g
SSdtIGFmcmFpZCB0aGF0IHRoZSBjdXJyZW50bHkgY2hvc2VuIG5hbWVzIHdpbGwgY2F1c2UgY29u
ZnVzaW9uLg0KPiANCj4gREVWX1JFU09VUkNFIGlzIHByZXR0eSBjbGVhciBJIHRoaW5rLCBidXQg
SSBhZ3JlZSB0aGF0IFNUU19SRVNPVVJDRQ0KPiBjb3VsZCBwZXJoYXBzIGJlIGJldHRlci4gU1RT
X1NZU1RFTV9SRVNPVVJDRT8gSXQgbWFrZXMgdGhlIGRpc3RpbmN0aW9uDQo+IGEgYml0IGNsZWFy
ZXIuDQoNCkknbSBjb25jZXJuZWQgYWJvdXQgYm90aC4gQSBibG9jayBkcml2ZXIgY2FuIG5hbWVs
eSBydW4gb3V0IG9mIGRldmljZSByZXNvdXJjZXMNCndpdGhvdXQgcmVjZWl2aW5nIGEgbm90aWZp
Y2F0aW9uIGlmIHRoYXQgcmVzb3VyY2UgYmVjb21lcyBhdmFpbGFibGUgYWdhaW4uIEluDQp0aGF0
IGNhc2UgdGhlIGFwcHJvcHJpYXRlIHJldHVybiBjb2RlIGlzIEJMS19TVFNfUkVTT1VSQ0UuIEhl
bmNlIG15IGNvbmNlcm4gLi4uDQoNCkJhcnQu

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [dm-devel] [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE
  2018-01-31  3:27         ` Bart Van Assche
@ 2018-01-31  3:31           ` Jens Axboe
  0 siblings, 0 replies; 17+ messages in thread
From: Jens Axboe @ 2018-01-31  3:31 UTC (permalink / raw)
  To: Bart Van Assche, snitzer@redhat.com
  Cc: dm-devel@redhat.com, linux-scsi@vger.kernel.org,
	linux-block@vger.kernel.org, linux-nvme@lists.infradead.org

On 1/30/18 8:27 PM, Bart Van Assche wrote:
> On Tue, 2018-01-30 at 20:22 -0700, Jens Axboe wrote:
>> On 1/30/18 8:21 PM, Bart Van Assche wrote:
>>> On Tue, 2018-01-30 at 20:17 -0700, Jens Axboe wrote:
>>>> BLK_STS_RESOURCE should always be safe to return, and it should work
>>>> the same as STS_DEV_RESOURCE, except it may cause an extra queue
>>>> run.
>>>>
>>>> Well written drivers should use STS_DEV_RESOURCE where it makes
>>>> sense.
>>>
>>> Hello Jens,
>>>
>>> I would appreciate it if other names would be chosen than BLK_STS_RESOURCE
>>> and BLK_STS_DEV_RESOURCE, e.g. names that directly refer to the fact that
>>> one of the two status codes causes the queue to be rerun and the other not.
>>> I'm afraid that the currently chosen names will cause confusion.
>>
>> DEV_RESOURCE is pretty clear I think, but I agree that STS_RESOURCE
>> could perhaps be better. STS_SYSTEM_RESOURCE? It makes the distinction
>> a bit clearer.
> 
> I'm concerned about both. A block driver can namely run out of device
> resources without receiving a notification if that resource becomes
> available again.

That is true, and that is why I don't want to driver to make guesses on
when to re-run. The saving grace here is that it should happen extremely
rarely. I went over this in a previous email. If it's not a rare
occurence, then there's no other way around it then wiring up a
notification mechanism to kick the queue when the shortage clears.

One way to deal with that is making the assumption that other IO will
clear the issue. That means we can confine it to the blk-mq layer.
Similarly to how we currently sometimes have to track starvation across
hardware queues and restart queue X when Y completes IO, we may have to
have a broader scope of shortage. That would probably fix all bug
pathological cases.

> In that case the appropriate return code is BLK_STS_RESOURCE. Hence my
> concern ...

But this isn't a new thing, the only real change here is making it so
that drivers don't have to think about that case.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [dm-devel] [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE
  2018-01-31  3:22       ` Jens Axboe
  2018-01-31  3:27         ` Bart Van Assche
@ 2018-01-31  3:33         ` Ming Lei
  1 sibling, 0 replies; 17+ messages in thread
From: Ming Lei @ 2018-01-31  3:33 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Bart Van Assche, snitzer@redhat.com, dm-devel@redhat.com,
	linux-scsi@vger.kernel.org, linux-block@vger.kernel.org,
	linux-nvme@lists.infradead.org

On Tue, Jan 30, 2018 at 08:22:27PM -0700, Jens Axboe wrote:
> On 1/30/18 8:21 PM, Bart Van Assche wrote:
> > On Tue, 2018-01-30 at 20:17 -0700, Jens Axboe wrote:
> >> BLK_STS_RESOURCE should always be safe to return, and it should work
> >> the same as STS_DEV_RESOURCE, except it may cause an extra queue
> >> run.
> >>
> >> Well written drivers should use STS_DEV_RESOURCE where it makes
> >> sense.
> > 
> > Hello Jens,
> > 
> > I would appreciate it if other names would be chosen than BLK_STS_RESOURCE
> > and BLK_STS_DEV_RESOURCE, e.g. names that directly refer to the fact that
> > one of the two status codes causes the queue to be rerun and the other not.
> > I'm afraid that the currently chosen names will cause confusion.
> 
> DEV_RESOURCE is pretty clear I think, but I agree that STS_RESOURCE
> could perhaps be better. STS_SYSTEM_RESOURCE? It makes the distinction

I guess it still can't cover all, for example, .queue_rq() may not
submit rq to hardware successfully because of tansport busy, such FC,..

-- 
Ming

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2018-01-31  3:33 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-01-30 14:24 [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE Mike Snitzer
2018-01-30 17:52 ` [dm-devel] " Bart Van Assche
2018-01-30 18:38   ` Laurence Oberman
2018-01-30 19:33   ` Mike Snitzer
2018-01-30 19:42     ` Bart Van Assche
2018-01-30 20:12       ` Mike Snitzer
2018-01-31  2:14   ` [dm-devel] " Ming Lei
2018-01-31  3:17   ` Jens Axboe
2018-01-31  3:21     ` Bart Van Assche
2018-01-31  3:22       ` Jens Axboe
2018-01-31  3:27         ` Bart Van Assche
2018-01-31  3:31           ` Jens Axboe
2018-01-31  3:33         ` Ming Lei
2018-01-31  2:44 ` Jens Axboe
2018-01-31  3:04   ` [PATCH v6] " Mike Snitzer
2018-01-31  3:18     ` Jens Axboe
2018-01-31  3:07   ` [PATCH v5] " Mike Snitzer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).