[PATCH v3 0/7] Submit zoned writes in order

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/7] Submit zoned writes in order
@ 2023-05-22 18:38 Bart Van Assche
  2023-05-22 18:38 ` [PATCH v3 1/7] block: Rename a local variable in blk_mq_requeue_work() Bart Van Assche
                   ` (7 more replies)
  0 siblings, 8 replies; 38+ messages in thread
From: Bart Van Assche @ 2023-05-22 18:38 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Mike Snitzer, Damien Le Moal,
	Jaegeuk Kim, Bart Van Assche

Hi Jens,

Tests with a zoned UFS prototype have shown that there are plenty of
opportunities for reordering in the block layer for zoned writes (REQ_OP_WRITE).
The UFS driver is more likely to trigger reordering than other SCSI drivers
because it reports BLK_STS_DEV_RESOURCE more often, e.g. during clock scaling.
This patch series makes sure that zoned writes are submitted in order without
affecting other workloads significantly.

Please consider this patch series for the next merge window.

Thanks,

Bart.

Changes compared to v2:
- Changed the approach from one requeue list per hctx into preserving one
  requeue list per request queue.
- Rebased on top of Jens' for-next branch. Left out the mq-deadline patches
  since these are already in the for-next branch.
- Modified patch "block: Requeue requests if a CPU is unplugged" such that it
  always uses the requeue list.
- Added a patch that removes blk_mq_kick_requeue_list() and
  blk_mq_delay_kick_requeue_list().
- Dropped patch "block: mq-deadline: Disable head insertion for zoned writes".
- Dropped patch "block: mq-deadline: Introduce a local variable".

Changes compared to v1:
- Fixed two issues detected by the kernel test robot.

Bart Van Assche (7):
  block: Rename a local variable in blk_mq_requeue_work()
  block: Send requeued requests to the I/O scheduler
  block: Requeue requests if a CPU is unplugged
  block: Make it easier to debug zoned write reordering
  block: Preserve the order of requeued requests
  dm: Inline __dm_mq_kick_requeue_list()
  block: Inline blk_mq_{,delay_}kick_requeue_list()

 block/blk-flush.c            |   4 +-
 block/blk-mq-debugfs.c       |   2 +-
 block/blk-mq.c               | 107 ++++++++++++++++++-----------------
 drivers/block/ublk_drv.c     |   6 +-
 drivers/block/xen-blkfront.c |   1 -
 drivers/md/dm-rq.c           |  11 +---
 drivers/nvme/host/core.c     |   2 +-
 drivers/s390/block/scm_blk.c |   2 +-
 drivers/scsi/scsi_lib.c      |   2 +-
 include/linux/blk-mq.h       |   6 +-
 include/linux/blkdev.h       |   1 -
 11 files changed, 69 insertions(+), 75 deletions(-)


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v3 1/7] block: Rename a local variable in blk_mq_requeue_work()
  2023-05-22 18:38 [PATCH v3 0/7] Submit zoned writes in order Bart Van Assche
@ 2023-05-22 18:38 ` Bart Van Assche
  2023-05-23  7:12   ` Christoph Hellwig
  2023-05-23  9:48   ` Johannes Thumshirn
  2023-05-22 18:38 ` [PATCH v3 2/7] block: Send requeued requests to the I/O scheduler Bart Van Assche
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 38+ messages in thread
From: Bart Van Assche @ 2023-05-22 18:38 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Mike Snitzer, Damien Le Moal,
	Jaegeuk Kim, Bart Van Assche, Ming Lei

Two data structures in blk_mq_requeue_work() represent request lists. Make
it clear that rq_list holds requests that come from the requeue list by
renaming that data structure.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-mq.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 551e7760f45e..e79cc34ad962 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1436,17 +1436,17 @@ static void blk_mq_requeue_work(struct work_struct *work)
 {
 	struct request_queue *q =
 		container_of(work, struct request_queue, requeue_work.work);
-	LIST_HEAD(rq_list);
+	LIST_HEAD(requeue_list);
 	LIST_HEAD(flush_list);
 	struct request *rq;
 
 	spin_lock_irq(&q->requeue_lock);
-	list_splice_init(&q->requeue_list, &rq_list);
+	list_splice_init(&q->requeue_list, &requeue_list);
 	list_splice_init(&q->flush_list, &flush_list);
 	spin_unlock_irq(&q->requeue_lock);
 
-	while (!list_empty(&rq_list)) {
-		rq = list_entry(rq_list.next, struct request, queuelist);
+	while (!list_empty(&requeue_list)) {
+		rq = list_entry(requeue_list.next, struct request, queuelist);
 		/*
 		 * If RQF_DONTPREP ist set, the request has been started by the
 		 * driver already and might have driver-specific data allocated

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 2/7] block: Send requeued requests to the I/O scheduler
  2023-05-22 18:38 [PATCH v3 0/7] Submit zoned writes in order Bart Van Assche
  2023-05-22 18:38 ` [PATCH v3 1/7] block: Rename a local variable in blk_mq_requeue_work() Bart Van Assche
@ 2023-05-22 18:38 ` Bart Van Assche
  2023-05-23  7:18   ` Christoph Hellwig
  2023-05-23  9:03   ` Ming Lei
  2023-05-22 18:38 ` [PATCH v3 3/7] block: Requeue requests if a CPU is unplugged Bart Van Assche
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 38+ messages in thread
From: Bart Van Assche @ 2023-05-22 18:38 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Mike Snitzer, Damien Le Moal,
	Jaegeuk Kim, Bart Van Assche, Ming Lei, Jianchao Wang

Send requeued requests to the I/O scheduler such that the I/O scheduler
can control the order in which requests are dispatched.

This patch reworks commit aef1897cd36d ("blk-mq: insert rq with DONTPREP
to hctx dispatch list when requeue").

Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-mq.c         | 36 +++++++++++++++++++++++-------------
 include/linux/blk-mq.h |  4 ++--
 2 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index e79cc34ad962..632aee9af60f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1438,30 +1438,26 @@ static void blk_mq_requeue_work(struct work_struct *work)
 		container_of(work, struct request_queue, requeue_work.work);
 	LIST_HEAD(requeue_list);
 	LIST_HEAD(flush_list);
-	struct request *rq;
+	struct request *rq, *next;
 
 	spin_lock_irq(&q->requeue_lock);
 	list_splice_init(&q->requeue_list, &requeue_list);
 	list_splice_init(&q->flush_list, &flush_list);
 	spin_unlock_irq(&q->requeue_lock);
 
-	while (!list_empty(&requeue_list)) {
-		rq = list_entry(requeue_list.next, struct request, queuelist);
-		/*
-		 * If RQF_DONTPREP ist set, the request has been started by the
-		 * driver already and might have driver-specific data allocated
-		 * already.  Insert it into the hctx dispatch list to avoid
-		 * block layer merges for the request.
-		 */
-		if (rq->rq_flags & RQF_DONTPREP) {
-			list_del_init(&rq->queuelist);
-			blk_mq_request_bypass_insert(rq, 0);
-		} else {
+	list_for_each_entry_safe(rq, next, &requeue_list, queuelist) {
+		if (!(rq->rq_flags & RQF_DONTPREP)) {
 			list_del_init(&rq->queuelist);
 			blk_mq_insert_request(rq, BLK_MQ_INSERT_AT_HEAD);
 		}
 	}
 
+	while (!list_empty(&requeue_list)) {
+		rq = list_entry(requeue_list.next, struct request, queuelist);
+		list_del_init(&rq->queuelist);
+		blk_mq_insert_request(rq, 0);
+	}
+
 	while (!list_empty(&flush_list)) {
 		rq = list_entry(flush_list.next, struct request, queuelist);
 		list_del_init(&rq->queuelist);
@@ -2064,14 +2060,28 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list,
 		bool no_tag = prep == PREP_DISPATCH_NO_TAG &&
 			((hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) ||
 			blk_mq_is_shared_tags(hctx->flags));
+		LIST_HEAD(for_sched);
+		struct request *next;
 
 		if (nr_budgets)
 			blk_mq_release_budgets(q, list);
 
 		spin_lock(&hctx->lock);
+		list_for_each_entry_safe(rq, next, list, queuelist)
+			if (rq->rq_flags & RQF_USE_SCHED)
+				list_move_tail(&rq->queuelist, &for_sched);
 		list_splice_tail_init(list, &hctx->dispatch);
 		spin_unlock(&hctx->lock);
 
+		if (q->elevator) {
+			if (q->elevator->type->ops.requeue_request)
+				list_for_each_entry(rq, &for_sched, queuelist)
+					q->elevator->type->ops.
+						requeue_request(rq);
+			q->elevator->type->ops.insert_requests(hctx, &for_sched,
+							BLK_MQ_INSERT_AT_HEAD);
+		}
+
 		/*
 		 * Order adding requests to hctx->dispatch and checking
 		 * SCHED_RESTART flag. The pair of this smp_mb() is the one
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index d778cb6b2112..363894aea0e8 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -62,8 +62,8 @@ typedef __u32 __bitwise req_flags_t;
 #define RQF_RESV		((__force req_flags_t)(1 << 23))
 
 /* flags that prevent us from merging requests: */
-#define RQF_NOMERGE_FLAGS \
-	(RQF_STARTED | RQF_FLUSH_SEQ | RQF_SPECIAL_PAYLOAD)
+#define RQF_NOMERGE_FLAGS                                               \
+	(RQF_STARTED | RQF_FLUSH_SEQ | RQF_DONTPREP | RQF_SPECIAL_PAYLOAD)
 
 enum mq_rq_state {
 	MQ_RQ_IDLE		= 0,

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 3/7] block: Requeue requests if a CPU is unplugged
  2023-05-22 18:38 [PATCH v3 0/7] Submit zoned writes in order Bart Van Assche
  2023-05-22 18:38 ` [PATCH v3 1/7] block: Rename a local variable in blk_mq_requeue_work() Bart Van Assche
  2023-05-22 18:38 ` [PATCH v3 2/7] block: Send requeued requests to the I/O scheduler Bart Van Assche
@ 2023-05-22 18:38 ` Bart Van Assche
  2023-05-23  7:19   ` Christoph Hellwig
  2023-05-23  8:17   ` Ming Lei
  2023-05-22 18:38 ` [PATCH v3 4/7] block: Make it easier to debug zoned write reordering Bart Van Assche
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 38+ messages in thread
From: Bart Van Assche @ 2023-05-22 18:38 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Mike Snitzer, Damien Le Moal,
	Jaegeuk Kim, Bart Van Assche, Ming Lei

Requeue requests instead of sending these to the dispatch list if a CPU
is unplugged. This gives the I/O scheduler the chance to preserve the
order of zoned write requests.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-mq.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 632aee9af60f..bc52a57641e2 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3497,13 +3497,13 @@ static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
 }
 
 /*
- * 'cpu' is going away. splice any existing rq_list entries from this
- * software queue to the hw queue dispatch list, and ensure that it
- * gets run.
+ * @cpu is going away. Requeue any existing rq_list entries from its
+ * software queue and run the hw queue associated with @cpu.
  */
 static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 {
 	struct blk_mq_hw_ctx *hctx;
+	struct request *rq, *next;
 	struct blk_mq_ctx *ctx;
 	LIST_HEAD(tmp);
 	enum hctx_type type;
@@ -3525,9 +3525,9 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 	if (list_empty(&tmp))
 		return 0;
 
-	spin_lock(&hctx->lock);
-	list_splice_tail_init(&tmp, &hctx->dispatch);
-	spin_unlock(&hctx->lock);
+	list_for_each_entry_safe(rq, next, &tmp, queuelist)
+		blk_mq_requeue_request(rq, false);
+	blk_mq_kick_requeue_list(hctx->queue);
 
 	blk_mq_run_hw_queue(hctx, true);
 	return 0;

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 4/7] block: Make it easier to debug zoned write reordering
  2023-05-22 18:38 [PATCH v3 0/7] Submit zoned writes in order Bart Van Assche
                   ` (2 preceding siblings ...)
  2023-05-22 18:38 ` [PATCH v3 3/7] block: Requeue requests if a CPU is unplugged Bart Van Assche
@ 2023-05-22 18:38 ` Bart Van Assche
  2023-05-23  7:19   ` Christoph Hellwig
  2023-05-22 18:38 ` [PATCH v3 5/7] block: Preserve the order of requeued requests Bart Van Assche
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 38+ messages in thread
From: Bart Van Assche @ 2023-05-22 18:38 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Mike Snitzer, Damien Le Moal,
	Jaegeuk Kim, Bart Van Assche, Ming Lei

Issue a kernel warning if a zoned write is passed directly to the block
driver instead of to the I/O scheduler because passing a zoned write
directly to the block driver may cause zoned write reordering.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-mq.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index bc52a57641e2..9ef6fa5d7471 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2429,6 +2429,9 @@ static void blk_mq_request_bypass_insert(struct request *rq, blk_insert_t flags)
 {
 	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
 
+	WARN_ON_ONCE(rq->rq_flags & RQF_USE_SCHED &&
+		     blk_rq_is_seq_zoned_write(rq));
+
 	spin_lock(&hctx->lock);
 	if (flags & BLK_MQ_INSERT_AT_HEAD)
 		list_add(&rq->queuelist, &hctx->dispatch);
@@ -2562,6 +2565,9 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
 	};
 	blk_status_t ret;
 
+	WARN_ON_ONCE(rq->rq_flags & RQF_USE_SCHED &&
+		     blk_rq_is_seq_zoned_write(rq));
+
 	/*
 	 * For OK queue, we are done. For error, caller may kill it.
 	 * Any other error (busy), just add it to our list as we

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 5/7] block: Preserve the order of requeued requests
  2023-05-22 18:38 [PATCH v3 0/7] Submit zoned writes in order Bart Van Assche
                   ` (3 preceding siblings ...)
  2023-05-22 18:38 ` [PATCH v3 4/7] block: Make it easier to debug zoned write reordering Bart Van Assche
@ 2023-05-22 18:38 ` Bart Van Assche
  2023-05-22 18:38 ` [PATCH v3 6/7] dm: Inline __dm_mq_kick_requeue_list() Bart Van Assche
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 38+ messages in thread
From: Bart Van Assche @ 2023-05-22 18:38 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Mike Snitzer, Damien Le Moal,
	Jaegeuk Kim, Bart Van Assche, Ming Lei

If a queue is run before all requeued requests have been sent to the I/O
scheduler, the I/O scheduler may dispatch the wrong request. Fix this by
making blk_mq_run_hw_queue() process the requeue_list instead of
blk_mq_requeue_work().

Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-mq.c         | 63 +++++++++++++++++++++---------------------
 include/linux/blkdev.h |  1 -
 2 files changed, 32 insertions(+), 32 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9ef6fa5d7471..52dffdc70480 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -68,6 +68,8 @@ static inline blk_qc_t blk_rq_to_qc(struct request *rq)
 static bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx)
 {
 	return !list_empty_careful(&hctx->dispatch) ||
+		!list_empty_careful(&hctx->queue->requeue_list) ||
+		!list_empty_careful(&hctx->queue->flush_list) ||
 		sbitmap_any_bit_set(&hctx->ctx_map) ||
 			blk_mq_sched_has_work(hctx);
 }
@@ -1432,52 +1434,52 @@ void blk_mq_requeue_request(struct request *rq, bool kick_requeue_list)
 }
 EXPORT_SYMBOL(blk_mq_requeue_request);
 
-static void blk_mq_requeue_work(struct work_struct *work)
+static void blk_mq_process_requeue_list(struct blk_mq_hw_ctx *hctx)
 {
-	struct request_queue *q =
-		container_of(work, struct request_queue, requeue_work.work);
-	LIST_HEAD(requeue_list);
-	LIST_HEAD(flush_list);
+	struct request_queue *q = hctx->queue;
 	struct request *rq, *next;
+	LIST_HEAD(at_head);
+	LIST_HEAD(at_tail);
 
-	spin_lock_irq(&q->requeue_lock);
-	list_splice_init(&q->requeue_list, &requeue_list);
-	list_splice_init(&q->flush_list, &flush_list);
-	spin_unlock_irq(&q->requeue_lock);
+	if (list_empty_careful(&q->requeue_list) &&
+	    list_empty_careful(&q->flush_list))
+		return;
 
-	list_for_each_entry_safe(rq, next, &requeue_list, queuelist) {
-		if (!(rq->rq_flags & RQF_DONTPREP)) {
+	spin_lock_irq(&q->requeue_lock);
+	list_for_each_entry_safe(rq, next, &q->requeue_list, queuelist) {
+		if (!blk_queue_sq_sched(q) && rq->mq_hctx != hctx)
+			continue;
+		if (rq->rq_flags & RQF_DONTPREP) {
+			list_move_tail(&rq->queuelist, &at_tail);
+		} else {
 			list_del_init(&rq->queuelist);
-			blk_mq_insert_request(rq, BLK_MQ_INSERT_AT_HEAD);
+			list_move_tail(&rq->queuelist, &at_head);
 		}
 	}
-
-	while (!list_empty(&requeue_list)) {
-		rq = list_entry(requeue_list.next, struct request, queuelist);
-		list_del_init(&rq->queuelist);
-		blk_mq_insert_request(rq, 0);
+	list_for_each_entry_safe(rq, next, &q->flush_list, queuelist) {
+		if (!blk_queue_sq_sched(q) && rq->mq_hctx != hctx)
+			continue;
+		list_move_tail(&rq->queuelist, &at_tail);
 	}
+	spin_unlock_irq(&q->requeue_lock);
 
-	while (!list_empty(&flush_list)) {
-		rq = list_entry(flush_list.next, struct request, queuelist);
-		list_del_init(&rq->queuelist);
-		blk_mq_insert_request(rq, 0);
-	}
+	list_for_each_entry_safe(rq, next, &at_head, queuelist)
+		blk_mq_insert_request(rq, BLK_MQ_INSERT_AT_HEAD);
 
-	blk_mq_run_hw_queues(q, false);
+	list_for_each_entry_safe(rq, next, &at_tail, queuelist)
+		blk_mq_insert_request(rq, 0);
 }
 
 void blk_mq_kick_requeue_list(struct request_queue *q)
 {
-	kblockd_mod_delayed_work_on(WORK_CPU_UNBOUND, &q->requeue_work, 0);
+	blk_mq_run_hw_queues(q, true);
 }
 EXPORT_SYMBOL(blk_mq_kick_requeue_list);
 
 void blk_mq_delay_kick_requeue_list(struct request_queue *q,
 				    unsigned long msecs)
 {
-	kblockd_mod_delayed_work_on(WORK_CPU_UNBOUND, &q->requeue_work,
-				    msecs_to_jiffies(msecs));
+	blk_mq_delay_run_hw_queues(q, msecs);
 }
 EXPORT_SYMBOL(blk_mq_delay_kick_requeue_list);
 
@@ -2244,6 +2246,7 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 		return;
 	}
 
+	blk_mq_process_requeue_list(hctx);
 	blk_mq_run_dispatch_ops(hctx->queue,
 				blk_mq_sched_dispatch_requests(hctx));
 }
@@ -2292,7 +2295,7 @@ void blk_mq_run_hw_queues(struct request_queue *q, bool async)
 		 * scheduler.
 		 */
 		if (!sq_hctx || sq_hctx == hctx ||
-		    !list_empty_careful(&hctx->dispatch))
+		    blk_mq_hctx_has_pending(hctx))
 			blk_mq_run_hw_queue(hctx, async);
 	}
 }
@@ -2328,7 +2331,7 @@ void blk_mq_delay_run_hw_queues(struct request_queue *q, unsigned long msecs)
 		 * scheduler.
 		 */
 		if (!sq_hctx || sq_hctx == hctx ||
-		    !list_empty_careful(&hctx->dispatch))
+		    blk_mq_hctx_has_pending(hctx))
 			blk_mq_delay_run_hw_queue(hctx, msecs);
 	}
 }
@@ -2413,6 +2416,7 @@ static void blk_mq_run_work_fn(struct work_struct *work)
 	struct blk_mq_hw_ctx *hctx =
 		container_of(work, struct blk_mq_hw_ctx, run_work.work);
 
+	blk_mq_process_requeue_list(hctx);
 	blk_mq_run_dispatch_ops(hctx->queue,
 				blk_mq_sched_dispatch_requests(hctx));
 }
@@ -4237,7 +4241,6 @@ int blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	q->queue_flags |= QUEUE_FLAG_MQ_DEFAULT;
 	blk_mq_update_poll_flag(q);
 
-	INIT_DELAYED_WORK(&q->requeue_work, blk_mq_requeue_work);
 	INIT_LIST_HEAD(&q->flush_list);
 	INIT_LIST_HEAD(&q->requeue_list);
 	spin_lock_init(&q->requeue_lock);
@@ -4786,8 +4789,6 @@ void blk_mq_cancel_work_sync(struct request_queue *q)
 	struct blk_mq_hw_ctx *hctx;
 	unsigned long i;
 
-	cancel_delayed_work_sync(&q->requeue_work);
-
 	queue_for_each_hw_ctx(q, hctx, i)
 		cancel_delayed_work_sync(&hctx->run_work);
 }
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index fe99948688df..f410cce7289b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -491,7 +491,6 @@ struct request_queue {
 
 	struct list_head	requeue_list;
 	spinlock_t		requeue_lock;
-	struct delayed_work	requeue_work;
 
 	struct mutex		sysfs_lock;
 	struct mutex		sysfs_dir_lock;

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 6/7] dm: Inline __dm_mq_kick_requeue_list()
  2023-05-22 18:38 [PATCH v3 0/7] Submit zoned writes in order Bart Van Assche
                   ` (4 preceding siblings ...)
  2023-05-22 18:38 ` [PATCH v3 5/7] block: Preserve the order of requeued requests Bart Van Assche
@ 2023-05-22 18:38 ` Bart Van Assche
  2023-05-23  7:22   ` Christoph Hellwig
  2023-05-22 18:38 ` [PATCH v3 7/7] block: Inline blk_mq_{,delay_}kick_requeue_list() Bart Van Assche
  2023-05-23  7:22 ` [PATCH v3 0/7] Submit zoned writes in order Christoph Hellwig
  7 siblings, 1 reply; 38+ messages in thread
From: Bart Van Assche @ 2023-05-22 18:38 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Mike Snitzer, Damien Le Moal,
	Jaegeuk Kim, Bart Van Assche, Ming Lei, Alasdair Kergon, dm-devel

Since commit 52d7f1b5c2f3 ("blk-mq: Avoid that requeueing starts stopped
queues") the function __dm_mq_kick_requeue_list() is too short to keep it
as a separate function. Hence, inline this function.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/md/dm-rq.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index f7e9a3632eb3..bbe1e2ea0aa4 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -168,21 +168,16 @@ static void dm_end_request(struct request *clone, blk_status_t error)
 	rq_completed(md);
 }
 
-static void __dm_mq_kick_requeue_list(struct request_queue *q, unsigned long msecs)
-{
-	blk_mq_delay_kick_requeue_list(q, msecs);
-}
-
 void dm_mq_kick_requeue_list(struct mapped_device *md)
 {
-	__dm_mq_kick_requeue_list(md->queue, 0);
+	blk_mq_kick_requeue_list(md->queue);
 }
 EXPORT_SYMBOL(dm_mq_kick_requeue_list);
 
 static void dm_mq_delay_requeue_request(struct request *rq, unsigned long msecs)
 {
 	blk_mq_requeue_request(rq, false);
-	__dm_mq_kick_requeue_list(rq->q, msecs);
+	blk_mq_delay_kick_requeue_list(rq->q, msecs);
 }
 
 static void dm_requeue_original_request(struct dm_rq_target_io *tio, bool delay_requeue)

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 7/7] block: Inline blk_mq_{,delay_}kick_requeue_list()
  2023-05-22 18:38 [PATCH v3 0/7] Submit zoned writes in order Bart Van Assche
                   ` (5 preceding siblings ...)
  2023-05-22 18:38 ` [PATCH v3 6/7] dm: Inline __dm_mq_kick_requeue_list() Bart Van Assche
@ 2023-05-22 18:38 ` Bart Van Assche
  2023-05-24  8:01   ` Vineeth Vijayan
  2023-05-23  7:22 ` [PATCH v3 0/7] Submit zoned writes in order Christoph Hellwig
  7 siblings, 1 reply; 38+ messages in thread
From: Bart Van Assche @ 2023-05-22 18:38 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Mike Snitzer, Damien Le Moal,
	Jaegeuk Kim, Bart Van Assche, Ming Lei, Juergen Gross,
	Stefano Stabellini, Roger Pau Monné, Alasdair Kergon,
	dm-devel, Keith Busch, Sagi Grimberg, Vineeth Vijayan,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	James E.J. Bottomley, Martin K. Petersen

Patch "block: Preserve the order of requeued requests" changed
blk_mq_kick_requeue_list() and blk_mq_delay_kick_requeue_list() into
blk_mq_run_hw_queues() and blk_mq_delay_run_hw_queues() calls
respectively. Inline blk_mq_{,delay_}kick_requeue_list() because these
functions are now too short to keep these as separate functions.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-flush.c            |  4 ++--
 block/blk-mq-debugfs.c       |  2 +-
 block/blk-mq.c               | 16 +---------------
 drivers/block/ublk_drv.c     |  6 +++---
 drivers/block/xen-blkfront.c |  1 -
 drivers/md/dm-rq.c           |  6 +++---
 drivers/nvme/host/core.c     |  2 +-
 drivers/s390/block/scm_blk.c |  2 +-
 drivers/scsi/scsi_lib.c      |  2 +-
 include/linux/blk-mq.h       |  2 --
 10 files changed, 13 insertions(+), 30 deletions(-)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index dba392cf22be..22170036ddcb 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -191,7 +191,7 @@ static void blk_flush_complete_seq(struct request *rq,
 		spin_lock(&q->requeue_lock);
 		list_add_tail(&rq->queuelist, &q->flush_list);
 		spin_unlock(&q->requeue_lock);
-		blk_mq_kick_requeue_list(q);
+		blk_mq_run_hw_queues(q, true);
 		break;
 
 	case REQ_FSEQ_DONE:
@@ -352,7 +352,7 @@ static void blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
 	list_add_tail(&flush_rq->queuelist, &q->flush_list);
 	spin_unlock(&q->requeue_lock);
 
-	blk_mq_kick_requeue_list(q);
+	blk_mq_run_hw_queues(q, true);
 }
 
 static enum rq_end_io_ret mq_flush_data_end_io(struct request *rq,
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 68165a50951b..869cc62ed50f 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -145,7 +145,7 @@ static ssize_t queue_state_write(void *data, const char __user *buf,
 	} else if (strcmp(op, "start") == 0) {
 		blk_mq_start_stopped_hw_queues(q, true);
 	} else if (strcmp(op, "kick") == 0) {
-		blk_mq_kick_requeue_list(q);
+		blk_mq_run_hw_queues(q, true);
 	} else {
 		pr_err("%s: unsupported operation '%s'\n", __func__, op);
 inval:
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 52dffdc70480..34dcfc84d902 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1430,7 +1430,7 @@ void blk_mq_requeue_request(struct request *rq, bool kick_requeue_list)
 	spin_unlock_irqrestore(&q->requeue_lock, flags);
 
 	if (kick_requeue_list)
-		blk_mq_kick_requeue_list(q);
+		blk_mq_run_hw_queues(q, true);
 }
 EXPORT_SYMBOL(blk_mq_requeue_request);
 
@@ -1470,19 +1470,6 @@ static void blk_mq_process_requeue_list(struct blk_mq_hw_ctx *hctx)
 		blk_mq_insert_request(rq, 0);
 }
 
-void blk_mq_kick_requeue_list(struct request_queue *q)
-{
-	blk_mq_run_hw_queues(q, true);
-}
-EXPORT_SYMBOL(blk_mq_kick_requeue_list);
-
-void blk_mq_delay_kick_requeue_list(struct request_queue *q,
-				    unsigned long msecs)
-{
-	blk_mq_delay_run_hw_queues(q, msecs);
-}
-EXPORT_SYMBOL(blk_mq_delay_kick_requeue_list);
-
 static bool blk_mq_rq_inflight(struct request *rq, void *priv)
 {
 	/*
@@ -3537,7 +3524,6 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 
 	list_for_each_entry_safe(rq, next, &tmp, queuelist)
 		blk_mq_requeue_request(rq, false);
-	blk_mq_kick_requeue_list(hctx->queue);
 
 	blk_mq_run_hw_queue(hctx, true);
 	return 0;
diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 539eada32861..4a3d579a25b5 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -900,7 +900,7 @@ static inline void __ublk_rq_task_work(struct request *req,
 		 */
 		if (unlikely(!mapped_bytes)) {
 			blk_mq_requeue_request(req, false);
-			blk_mq_delay_kick_requeue_list(req->q,
+			blk_mq_delay_run_hw_queues(req->q,
 					UBLK_REQUEUE_DELAY_MS);
 			return;
 		}
@@ -1290,7 +1290,7 @@ static void ublk_unquiesce_dev(struct ublk_device *ub)
 
 	blk_mq_unquiesce_queue(ub->ub_disk->queue);
 	/* We may have requeued some rqs in ublk_quiesce_queue() */
-	blk_mq_kick_requeue_list(ub->ub_disk->queue);
+	blk_mq_run_hw_queues(ub->ub_disk->queue, true);
 }
 
 static void ublk_stop_dev(struct ublk_device *ub)
@@ -2334,7 +2334,7 @@ static int ublk_ctrl_end_recovery(struct ublk_device *ub,
 	blk_mq_unquiesce_queue(ub->ub_disk->queue);
 	pr_devel("%s: queue unquiesced, dev id %d.\n",
 			__func__, header->dev_id);
-	blk_mq_kick_requeue_list(ub->ub_disk->queue);
+	blk_mq_run_hw_queues(ub->ub_disk->queue, true);
 	ub->dev_info.state = UBLK_S_DEV_LIVE;
 	schedule_delayed_work(&ub->monitor_work, UBLK_DAEMON_MONITOR_PERIOD);
 	ret = 0;
diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 23ed258b57f0..6b37a134dd3a 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -2040,7 +2040,6 @@ static int blkif_recover(struct blkfront_info *info)
 		blk_mq_requeue_request(req, false);
 	}
 	blk_mq_start_stopped_hw_queues(info->rq, true);
-	blk_mq_kick_requeue_list(info->rq);
 
 	while ((bio = bio_list_pop(&info->bio_list)) != NULL) {
 		/* Traverse the list of pending bios and re-queue them */
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index bbe1e2ea0aa4..6421cc2c9852 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -64,7 +64,7 @@ int dm_request_based(struct mapped_device *md)
 void dm_start_queue(struct request_queue *q)
 {
 	blk_mq_unquiesce_queue(q);
-	blk_mq_kick_requeue_list(q);
+	blk_mq_run_hw_queues(q, true);
 }
 
 void dm_stop_queue(struct request_queue *q)
@@ -170,14 +170,14 @@ static void dm_end_request(struct request *clone, blk_status_t error)
 
 void dm_mq_kick_requeue_list(struct mapped_device *md)
 {
-	blk_mq_kick_requeue_list(md->queue);
+	blk_mq_run_hw_queues(md->queue, true);
 }
 EXPORT_SYMBOL(dm_mq_kick_requeue_list);
 
 static void dm_mq_delay_requeue_request(struct request *rq, unsigned long msecs)
 {
 	blk_mq_requeue_request(rq, false);
-	blk_mq_delay_kick_requeue_list(rq->q, msecs);
+	blk_mq_delay_run_hw_queues(rq->q, msecs);
 }
 
 static void dm_requeue_original_request(struct dm_rq_target_io *tio, bool delay_requeue)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index ccb6eb1282f8..9d3e4de23787 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -303,7 +303,7 @@ static void nvme_retry_req(struct request *req)
 
 	nvme_req(req)->retries++;
 	blk_mq_requeue_request(req, false);
-	blk_mq_delay_kick_requeue_list(req->q, delay);
+	blk_mq_delay_run_hw_queues(req->q, delay);
 }
 
 static void nvme_log_error(struct request *req)
diff --git a/drivers/s390/block/scm_blk.c b/drivers/s390/block/scm_blk.c
index 0c1df1d5f1ac..fe5937d28fdc 100644
--- a/drivers/s390/block/scm_blk.c
+++ b/drivers/s390/block/scm_blk.c
@@ -243,7 +243,7 @@ static void scm_request_requeue(struct scm_request *scmrq)
 
 	atomic_dec(&bdev->queued_reqs);
 	scm_request_done(scmrq);
-	blk_mq_kick_requeue_list(bdev->rq);
+	blk_mq_run_hw_queues(bdev->rq, true);
 }
 
 static void scm_request_finish(struct scm_request *scmrq)
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index b7c569a42aa4..d74903221638 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -124,7 +124,7 @@ static void scsi_mq_requeue_cmd(struct scsi_cmnd *cmd, unsigned long msecs)
 
 	if (msecs) {
 		blk_mq_requeue_request(rq, false);
-		blk_mq_delay_kick_requeue_list(rq->q, msecs);
+		blk_mq_delay_run_hw_queues(rq->q, msecs);
 	} else
 		blk_mq_requeue_request(rq, true);
 }
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 363894aea0e8..79b67664ace7 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -867,8 +867,6 @@ static inline bool blk_mq_add_to_batch(struct request *req,
 }
 
 void blk_mq_requeue_request(struct request *rq, bool kick_requeue_list);
-void blk_mq_kick_requeue_list(struct request_queue *q);
-void blk_mq_delay_kick_requeue_list(struct request_queue *q, unsigned long msecs);
 void blk_mq_complete_request(struct request *rq);
 bool blk_mq_complete_request_remote(struct request *rq);
 void blk_mq_stop_hw_queue(struct blk_mq_hw_ctx *hctx);

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 1/7] block: Rename a local variable in blk_mq_requeue_work()
  2023-05-22 18:38 ` [PATCH v3 1/7] block: Rename a local variable in blk_mq_requeue_work() Bart Van Assche
@ 2023-05-23  7:12   ` Christoph Hellwig
  2023-05-23  9:48   ` Johannes Thumshirn
  1 sibling, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2023-05-23  7:12 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	Damien Le Moal, Jaegeuk Kim, Ming Lei

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/7] block: Send requeued requests to the I/O scheduler
  2023-05-22 18:38 ` [PATCH v3 2/7] block: Send requeued requests to the I/O scheduler Bart Van Assche
@ 2023-05-23  7:18   ` Christoph Hellwig
  2023-05-23 22:30     ` Bart Van Assche
  2023-05-23  9:03   ` Ming Lei
  1 sibling, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2023-05-23  7:18 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	Damien Le Moal, Jaegeuk Kim, Ming Lei, Jianchao Wang

On Mon, May 22, 2023 at 11:38:37AM -0700, Bart Van Assche wrote:
> Send requeued requests to the I/O scheduler such that the I/O scheduler
> can control the order in which requests are dispatched.
> 
> This patch reworks commit aef1897cd36d ("blk-mq: insert rq with DONTPREP
> to hctx dispatch list when requeue").

But you need to explain why it is safe.  I think it is because you add
DONTPREP to RQF_NOMERGE_FLAGS, but that really belongs into the commit
log.

> +	list_for_each_entry_safe(rq, next, &requeue_list, queuelist) {
> +		if (!(rq->rq_flags & RQF_DONTPREP)) {
>  			list_del_init(&rq->queuelist);
>  			blk_mq_insert_request(rq, BLK_MQ_INSERT_AT_HEAD);
>  		}
>  	}
>  
> +	while (!list_empty(&requeue_list)) {
> +		rq = list_entry(requeue_list.next, struct request, queuelist);
> +		list_del_init(&rq->queuelist);
> +		blk_mq_insert_request(rq, 0);

So now both started and unstarted requests go through
blk_mq_insert_request, and thus potentially the I/O scheduler, but
RQF_DONTPREP request that are by definition further along in processing
are inserted at the tail, while !RQF_DONTPREP ones get inserted at head.

This feels wrong, and we probably need to sort out why and how we are
doing head insertation on a requeue first.

> @@ -2064,14 +2060,28 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list,
>  		bool no_tag = prep == PREP_DISPATCH_NO_TAG &&
>  			((hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) ||
>  			blk_mq_is_shared_tags(hctx->flags));
> +		LIST_HEAD(for_sched);
> +		struct request *next;
>  
>  		if (nr_budgets)
>  			blk_mq_release_budgets(q, list);
>  
>  		spin_lock(&hctx->lock);
> +		list_for_each_entry_safe(rq, next, list, queuelist)
> +			if (rq->rq_flags & RQF_USE_SCHED)
> +				list_move_tail(&rq->queuelist, &for_sched);
>  		list_splice_tail_init(list, &hctx->dispatch);
>  		spin_unlock(&hctx->lock);
>  
> +		if (q->elevator) {
> +			if (q->elevator->type->ops.requeue_request)
> +				list_for_each_entry(rq, &for_sched, queuelist)
> +					q->elevator->type->ops.
> +						requeue_request(rq);
> +			q->elevator->type->ops.insert_requests(hctx, &for_sched,
> +							BLK_MQ_INSERT_AT_HEAD);
> +		}
> +

None of this is explained in the commit log, please elaborate on the
rationale for it.  Also:

 - this code block really belongs into a helper
 - calling ->requeue_request in a loop before calling ->insert_requests
   feels wrong  A BLK_MQ_INSERT_REQUEUE flag feels like the better
   interface here.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 3/7] block: Requeue requests if a CPU is unplugged
  2023-05-22 18:38 ` [PATCH v3 3/7] block: Requeue requests if a CPU is unplugged Bart Van Assche
@ 2023-05-23  7:19   ` Christoph Hellwig
  2023-05-23  8:17   ` Ming Lei
  1 sibling, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2023-05-23  7:19 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	Damien Le Moal, Jaegeuk Kim, Ming Lei

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 4/7] block: Make it easier to debug zoned write reordering
  2023-05-22 18:38 ` [PATCH v3 4/7] block: Make it easier to debug zoned write reordering Bart Van Assche
@ 2023-05-23  7:19   ` Christoph Hellwig
  2023-05-23 19:34     ` Bart Van Assche
  0 siblings, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2023-05-23  7:19 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	Damien Le Moal, Jaegeuk Kim, Ming Lei

On Mon, May 22, 2023 at 11:38:39AM -0700, Bart Van Assche wrote:
> @@ -2429,6 +2429,9 @@ static void blk_mq_request_bypass_insert(struct request *rq, blk_insert_t flags)
>  {
>  	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
>  
> +	WARN_ON_ONCE(rq->rq_flags & RQF_USE_SCHED &&
> +		     blk_rq_is_seq_zoned_write(rq));
> +
>  	spin_lock(&hctx->lock);
>  	if (flags & BLK_MQ_INSERT_AT_HEAD)
>  		list_add(&rq->queuelist, &hctx->dispatch);
> @@ -2562,6 +2565,9 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
>  	};
>  	blk_status_t ret;
>  
> +	WARN_ON_ONCE(rq->rq_flags & RQF_USE_SCHED &&
> +		     blk_rq_is_seq_zoned_write(rq));

What makes sequential writes here special vs other requests that are
supposed to be using the scheduler and not a bypass?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 0/7] Submit zoned writes in order
  2023-05-22 18:38 [PATCH v3 0/7] Submit zoned writes in order Bart Van Assche
                   ` (6 preceding siblings ...)
  2023-05-22 18:38 ` [PATCH v3 7/7] block: Inline blk_mq_{,delay_}kick_requeue_list() Bart Van Assche
@ 2023-05-23  7:22 ` Christoph Hellwig
  2023-05-23 20:04   ` Bart Van Assche
  7 siblings, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2023-05-23  7:22 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	Damien Le Moal, Jaegeuk Kim

On Mon, May 22, 2023 at 11:38:35AM -0700, Bart Van Assche wrote:
> - Changed the approach from one requeue list per hctx into preserving one
>   requeue list per request queue.

Can you explain why?  The resulting code looks rather odd to me as we
now reach out to a global list from the per-hctx run_queue helper,
which seems a bit awkware.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 6/7] dm: Inline __dm_mq_kick_requeue_list()
  2023-05-22 18:38 ` [PATCH v3 6/7] dm: Inline __dm_mq_kick_requeue_list() Bart Van Assche
@ 2023-05-23  7:22   ` Christoph Hellwig
  0 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2023-05-23  7:22 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	Damien Le Moal, Jaegeuk Kim, Ming Lei, Alasdair Kergon, dm-devel

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 3/7] block: Requeue requests if a CPU is unplugged
  2023-05-22 18:38 ` [PATCH v3 3/7] block: Requeue requests if a CPU is unplugged Bart Van Assche
  2023-05-23  7:19   ` Christoph Hellwig
@ 2023-05-23  8:17   ` Ming Lei
  2023-05-23 20:15     ` Bart Van Assche
  1 sibling, 1 reply; 38+ messages in thread
From: Ming Lei @ 2023-05-23  8:17 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	Damien Le Moal, Jaegeuk Kim, Ming Lei

On Tue, May 23, 2023 at 2:39 AM Bart Van Assche <bvanassche@acm.org> wrote:
>
> Requeue requests instead of sending these to the dispatch list if a CPU
> is unplugged. This gives the I/O scheduler the chance to preserve the
> order of zoned write requests.

But the affected code path is only for queue with none scheduler, do you
think none can maintain the order for write requests?

Thanks,


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/7] block: Send requeued requests to the I/O scheduler
  2023-05-22 18:38 ` [PATCH v3 2/7] block: Send requeued requests to the I/O scheduler Bart Van Assche
  2023-05-23  7:18   ` Christoph Hellwig
@ 2023-05-23  9:03   ` Ming Lei
  2023-05-23 17:19     ` Bart Van Assche
  1 sibling, 1 reply; 38+ messages in thread
From: Ming Lei @ 2023-05-23  9:03 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	Damien Le Moal, Jaegeuk Kim, Jianchao Wang, ming.lei

On Mon, May 22, 2023 at 11:38:37AM -0700, Bart Van Assche wrote:
> Send requeued requests to the I/O scheduler such that the I/O scheduler
> can control the order in which requests are dispatched.

I guess you are addressing UFS zoned for REQ_OP_WRITE:

https://lore.kernel.org/linux-block/8e88b22e-fdf2-5182-02fe-9876e8148947@acm.org/

I am wondering how this way can maintain order for zoned write request.

requeued WRITE may happen in any order, for example, req A and req B is
in order, now req B is requeued first, then follows req A.

So req B is requeued to scheduler first, and issued to LLD, then
request A is requeued and issued to LLD later, then still re-order?

Or sd_zbc can provide requeue order guarantee? 

Thanks,
Ming

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 1/7] block: Rename a local variable in blk_mq_requeue_work()
  2023-05-22 18:38 ` [PATCH v3 1/7] block: Rename a local variable in blk_mq_requeue_work() Bart Van Assche
  2023-05-23  7:12   ` Christoph Hellwig
@ 2023-05-23  9:48   ` Johannes Thumshirn
  1 sibling, 0 replies; 38+ messages in thread
From: Johannes Thumshirn @ 2023-05-23  9:48 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block@vger.kernel.org, Christoph Hellwig, Mike Snitzer,
	Damien Le Moal, Jaegeuk Kim, Ming Lei

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/7] block: Send requeued requests to the I/O scheduler
  2023-05-23  9:03   ` Ming Lei
@ 2023-05-23 17:19     ` Bart Van Assche
  2023-05-24  0:31       ` Ming Lei
  0 siblings, 1 reply; 38+ messages in thread
From: Bart Van Assche @ 2023-05-23 17:19 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	Damien Le Moal, Jaegeuk Kim, Jianchao Wang

On 5/23/23 02:03, Ming Lei wrote:
> On Mon, May 22, 2023 at 11:38:37AM -0700, Bart Van Assche wrote:
>> Send requeued requests to the I/O scheduler such that the I/O scheduler
>> can control the order in which requests are dispatched.
> 
> I guess you are addressing UFS zoned for REQ_OP_WRITE:
> 
> https://lore.kernel.org/linux-block/8e88b22e-fdf2-5182-02fe-9876e8148947@acm.org/
> 
> I am wondering how this way can maintain order for zoned write request.
> 
> requeued WRITE may happen in any order, for example, req A and req B is
> in order, now req B is requeued first, then follows req A.
> 
> So req B is requeued to scheduler first, and issued to LLD, then
> request A is requeued and issued to LLD later, then still re-order?
> 
> Or sd_zbc can provide requeue order guarantee?

Hi Ming,

The mq-deadline scheduler restricts the queue depth to one per zone for zoned
storage so at any time there is at most one write command (REQ_OP_WRITE) in
flight per zone.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 4/7] block: Make it easier to debug zoned write reordering
  2023-05-23  7:19   ` Christoph Hellwig
@ 2023-05-23 19:34     ` Bart Van Assche
  2023-05-24  6:13       ` Christoph Hellwig
  0 siblings, 1 reply; 38+ messages in thread
From: Bart Van Assche @ 2023-05-23 19:34 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Mike Snitzer, Damien Le Moal,
	Jaegeuk Kim, Ming Lei

On 5/23/23 00:19, Christoph Hellwig wrote:
> On Mon, May 22, 2023 at 11:38:39AM -0700, Bart Van Assche wrote:
>> @@ -2429,6 +2429,9 @@ static void blk_mq_request_bypass_insert(struct request *rq, blk_insert_t flags)
>>   {
>>   	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
>>   
>> +	WARN_ON_ONCE(rq->rq_flags & RQF_USE_SCHED &&
>> +		     blk_rq_is_seq_zoned_write(rq));
>> +
>>   	spin_lock(&hctx->lock);
>>   	if (flags & BLK_MQ_INSERT_AT_HEAD)
>>   		list_add(&rq->queuelist, &hctx->dispatch);
>> @@ -2562,6 +2565,9 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
>>   	};
>>   	blk_status_t ret;
>>   
>> +	WARN_ON_ONCE(rq->rq_flags & RQF_USE_SCHED &&
>> +		     blk_rq_is_seq_zoned_write(rq));
> 
> What makes sequential writes here special vs other requests that are
> supposed to be using the scheduler and not a bypass?

Hi Christoph,

If some REQ_OP_WRITE or REQ_OP_WRITE_ZEROES requests are submitted to 
the I/O scheduler and others bypass the I/O scheduler this may lead to 
reordering. Hence this patch that triggers a kernel warning if any 
REQ_OP_WRITE or REQ_OP_WRITE_ZEROES requests bypass the I/O scheduler.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 0/7] Submit zoned writes in order
  2023-05-23  7:22 ` [PATCH v3 0/7] Submit zoned writes in order Christoph Hellwig
@ 2023-05-23 20:04   ` Bart Van Assche
  2023-05-24  6:15     ` Christoph Hellwig
  0 siblings, 1 reply; 38+ messages in thread
From: Bart Van Assche @ 2023-05-23 20:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Mike Snitzer, Damien Le Moal,
	Jaegeuk Kim

On 5/23/23 00:22, Christoph Hellwig wrote:
> On Mon, May 22, 2023 at 11:38:35AM -0700, Bart Van Assche wrote:
>> - Changed the approach from one requeue list per hctx into preserving one
>>    requeue list per request queue.
> 
> Can you explain why?  The resulting code looks rather odd to me as we
> now reach out to a global list from the per-hctx run_queue helper,
> which seems a bit awkward.

Hi Christoph,

This change is based on the assumption that requeuing and flushing are 
relatively rare events. Do you perhaps want me to change the approach 
back to one requeue list and one flush list per hctx?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 3/7] block: Requeue requests if a CPU is unplugged
  2023-05-23  8:17   ` Ming Lei
@ 2023-05-23 20:15     ` Bart Van Assche
  2023-05-24  0:35       ` Ming Lei
  0 siblings, 1 reply; 38+ messages in thread
From: Bart Van Assche @ 2023-05-23 20:15 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	Damien Le Moal, Jaegeuk Kim

On 5/23/23 01:17, Ming Lei wrote:
> On Tue, May 23, 2023 at 2:39 AM Bart Van Assche <bvanassche@acm.org> wrote:
>>
>> Requeue requests instead of sending these to the dispatch list if a CPU
>> is unplugged. This gives the I/O scheduler the chance to preserve the
>> order of zoned write requests.
> 
> But the affected code path is only for queue with none scheduler, do you
> think none can maintain the order for write requests?

Hi Ming,

Doesn't blk_mq_insert_requests() insert requests in ctx->rq_lists[type] 
whether or not an I/O scheduler is active?

I haven't found any code in blk_mq_hctx_notify_dead() that makes this 
function behave differently based on whether or not an I/O scheduler has 
been associated with the request queue. Did I perhaps overlook something?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/7] block: Send requeued requests to the I/O scheduler
  2023-05-23  7:18   ` Christoph Hellwig
@ 2023-05-23 22:30     ` Bart Van Assche
  2023-05-24  6:13       ` Christoph Hellwig
  0 siblings, 1 reply; 38+ messages in thread
From: Bart Van Assche @ 2023-05-23 22:30 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Mike Snitzer, Damien Le Moal,
	Jaegeuk Kim, Ming Lei, Jianchao Wang

On 5/23/23 00:18, Christoph Hellwig wrote:
> On Mon, May 22, 2023 at 11:38:37AM -0700, Bart Van Assche wrote:
>> Send requeued requests to the I/O scheduler such that the I/O scheduler
>> can control the order in which requests are dispatched.
>>
>> This patch reworks commit aef1897cd36d ("blk-mq: insert rq with DONTPREP
>> to hctx dispatch list when requeue").
> 
> But you need to explain why it is safe.  I think it is because you add
> DONTPREP to RQF_NOMERGE_FLAGS, but that really belongs into the commit
> log.

Hi Christoph,

I will add the above explanation in the commit message.

> 
>> +	list_for_each_entry_safe(rq, next, &requeue_list, queuelist) {
>> +		if (!(rq->rq_flags & RQF_DONTPREP)) {
>>   			list_del_init(&rq->queuelist);
>>   			blk_mq_insert_request(rq, BLK_MQ_INSERT_AT_HEAD);
>>   		}
>>   	}
>>   
>> +	while (!list_empty(&requeue_list)) {
>> +		rq = list_entry(requeue_list.next, struct request, queuelist);
>> +		list_del_init(&rq->queuelist);
>> +		blk_mq_insert_request(rq, 0);
> 
> So now both started and unstarted requests go through
> blk_mq_insert_request, and thus potentially the I/O scheduler, but
> RQF_DONTPREP request that are by definition further along in processing
> are inserted at the tail, while !RQF_DONTPREP ones get inserted at head.
> 
> This feels wrong, and we probably need to sort out why and how we are
> doing head insertation on a requeue first.

The use cases for requeuing requests are as follows:
* Resubmitting a partially completed request (ATA, SCSI, loop, ...).
* Handling connection recovery (nbd, ublk, ...).
* Simulating the "device busy" condition (null_blk).
* Handling the queue full condition (virtio-blk).
* Handling path errors by dm-mpath (DM_ENDIO_REQUEUE).
* Handling retryable I/O errors (mmc, NVMe).
* Handling unit attentions (SCSI).
* Unplugging a CPU.

Do you perhaps want me to select BLK_MQ_INSERT_AT_HEAD for all requeued requests?

>> +		if (q->elevator) {
>> +			if (q->elevator->type->ops.requeue_request)
>> +				list_for_each_entry(rq, &for_sched, queuelist)
>> +					q->elevator->type->ops.
>> +						requeue_request(rq);
>> +			q->elevator->type->ops.insert_requests(hctx, &for_sched,
>> +							BLK_MQ_INSERT_AT_HEAD);
>> +		}
>> +
> 
> None of this is explained in the commit log, please elaborate on the
> rationale for it.  Also:
> 
>   - this code block really belongs into a helper
>   - calling ->requeue_request in a loop before calling ->insert_requests
>     feels wrong  A BLK_MQ_INSERT_REQUEUE flag feels like the better
>     interface here.

Pushing a request back to hctx->dispatch is a requeuing mechanism. I want
to send requeued requests to the I/O scheduler.

Regarding the BLK_MQ_INSERT_REQUEUE suggestion, how about adding the patch
below near the start of this patch series?

Thanks,

Bart.


block: Remove elevator_type.requeue_request()

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 117aec1791c0..319d3c5a0f85 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -6232,6 +6232,7 @@ static inline void bfq_update_insert_stats(struct request_queue *q,
  #endif /* CONFIG_BFQ_CGROUP_DEBUG */

  static struct bfq_queue *bfq_init_rq(struct request *rq);
+static void bfq_finish_requeue_request(struct request *rq);

  static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
  			       blk_insert_t flags)
@@ -6243,6 +6244,11 @@ static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
  	blk_opf_t cmd_flags;
  	LIST_HEAD(free);

+	if (rq->rq_flags & RQF_REQUEUED) {
+		rq->rq_flags &= ~RQF_REQUEUED;
+		bfq_finish_requeue_request(rq);
+	}
+
  #ifdef CONFIG_BFQ_GROUP_IOSCHED
  	if (!cgroup_subsys_on_dfl(io_cgrp_subsys) && rq->bio)
  		bfqg_stats_update_legacy_io(q, rq);
@@ -7596,7 +7602,6 @@ static struct elevator_type iosched_bfq_mq = {
  	.ops = {
  		.limit_depth		= bfq_limit_depth,
  		.prepare_request	= bfq_prepare_request,
-		.requeue_request        = bfq_finish_requeue_request,
  		.finish_request		= bfq_finish_request,
  		.exit_icq		= bfq_exit_icq,
  		.insert_requests	= bfq_insert_requests,
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 1326526bb733..8d4b835539b1 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -58,13 +58,8 @@ static inline void blk_mq_sched_completed_request(struct request *rq, u64 now)

  static inline void blk_mq_sched_requeue_request(struct request *rq)
  {
-	if (rq->rq_flags & RQF_USE_SCHED) {
-		struct request_queue *q = rq->q;
-		struct elevator_queue *e = q->elevator;
-
-		if (e->type->ops.requeue_request)
-			e->type->ops.requeue_request(rq);
-	}
+	if (rq->rq_flags & RQF_USE_SCHED)
+		rq->rq_flags |= RQF_REQUEUED;
  }

  static inline bool blk_mq_sched_has_work(struct blk_mq_hw_ctx *hctx)
diff --git a/block/elevator.h b/block/elevator.h
index 7ca3d7b6ed82..c1f459eebc9f 100644
--- a/block/elevator.h
+++ b/block/elevator.h
@@ -43,7 +43,6 @@ struct elevator_mq_ops {
  	struct request *(*dispatch_request)(struct blk_mq_hw_ctx *);
  	bool (*has_work)(struct blk_mq_hw_ctx *);
  	void (*completed_request)(struct request *, u64);
-	void (*requeue_request)(struct request *);
  	struct request *(*former_request)(struct request_queue *, struct request *);
  	struct request *(*next_request)(struct request_queue *, struct request *);
  	void (*init_icq)(struct io_cq *);
diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index f495be433276..b1f2e172fe61 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -608,6 +608,10 @@ static void kyber_insert_requests(struct blk_mq_hw_ctx *hctx,

  		spin_lock(&kcq->lock);
  		trace_block_rq_insert(rq);
+		if (rq->rq_flags & RQF_REQUEUED) {
+			rq->rq_flags &= ~RQF_REQUEUED;
+			kyber_finish_request(rq);
+		}
  		if (flags & BLK_MQ_INSERT_AT_HEAD)
  			list_move(&rq->queuelist, head);
  		else
@@ -1022,7 +1026,6 @@ static struct elevator_type kyber_sched = {
  		.prepare_request = kyber_prepare_request,
  		.insert_requests = kyber_insert_requests,
  		.finish_request = kyber_finish_request,
-		.requeue_request = kyber_finish_request,
  		.completed_request = kyber_completed_request,
  		.dispatch_request = kyber_dispatch_request,
  		.has_work = kyber_has_work,
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index d778cb6b2112..861ca3bb0bce 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -59,6 +59,9 @@ typedef __u32 __bitwise req_flags_t;
  #define RQF_ZONE_WRITE_LOCKED	((__force req_flags_t)(1 << 19))
  /* ->timeout has been called, don't expire again */
  #define RQF_TIMED_OUT		((__force req_flags_t)(1 << 21))
+/* The request is about to be requeued. */
+#define RQF_REQUEUED		((__force req_flags_t)(1 << 22))
+/* A reserved tag has been assigned to the request. */
  #define RQF_RESV		((__force req_flags_t)(1 << 23))

  /* flags that prevent us from merging requests: */

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/7] block: Send requeued requests to the I/O scheduler
  2023-05-23 17:19     ` Bart Van Assche
@ 2023-05-24  0:31       ` Ming Lei
  2023-05-24 17:56         ` Bart Van Assche
  0 siblings, 1 reply; 38+ messages in thread
From: Ming Lei @ 2023-05-24  0:31 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	Damien Le Moal, Jaegeuk Kim, Jianchao Wang, ming.lei

On Tue, May 23, 2023 at 10:19:34AM -0700, Bart Van Assche wrote:
> On 5/23/23 02:03, Ming Lei wrote:
> > On Mon, May 22, 2023 at 11:38:37AM -0700, Bart Van Assche wrote:
> > > Send requeued requests to the I/O scheduler such that the I/O scheduler
> > > can control the order in which requests are dispatched.
> > 
> > I guess you are addressing UFS zoned for REQ_OP_WRITE:
> > 
> > https://lore.kernel.org/linux-block/8e88b22e-fdf2-5182-02fe-9876e8148947@acm.org/
> > 
> > I am wondering how this way can maintain order for zoned write request.
> > 
> > requeued WRITE may happen in any order, for example, req A and req B is
> > in order, now req B is requeued first, then follows req A.
> > 
> > So req B is requeued to scheduler first, and issued to LLD, then
> > request A is requeued and issued to LLD later, then still re-order?
> > 
> > Or sd_zbc can provide requeue order guarantee?
> 
> Hi Ming,
> 
> The mq-deadline scheduler restricts the queue depth to one per zone for zoned
> storage so at any time there is at most one write command (REQ_OP_WRITE) in
> flight per zone.

But if the write queue depth is 1 per zone, the requeued request won't
be re-ordered at all given no other write request can be issued from
scheduler in this zone before this requeued request is completed.

So why bother to requeue the BLK_STS_RESOURCE request via scheduler?

Thanks,
Ming


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 3/7] block: Requeue requests if a CPU is unplugged
  2023-05-23 20:15     ` Bart Van Assche
@ 2023-05-24  0:35       ` Ming Lei
  2023-05-24 18:18         ` Bart Van Assche
  0 siblings, 1 reply; 38+ messages in thread
From: Ming Lei @ 2023-05-24  0:35 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	Damien Le Moal, Jaegeuk Kim

On Tue, May 23, 2023 at 01:15:55PM -0700, Bart Van Assche wrote:
> On 5/23/23 01:17, Ming Lei wrote:
> > On Tue, May 23, 2023 at 2:39 AM Bart Van Assche <bvanassche@acm.org> wrote:
> > > 
> > > Requeue requests instead of sending these to the dispatch list if a CPU
> > > is unplugged. This gives the I/O scheduler the chance to preserve the
> > > order of zoned write requests.
> > 
> > But the affected code path is only for queue with none scheduler, do you
> > think none can maintain the order for write requests?
> 
> Hi Ming,
> 
> Doesn't blk_mq_insert_requests() insert requests in ctx->rq_lists[type]
> whether or not an I/O scheduler is active?

blk_mq_insert_requests() is only called for inserting passthrough
request in case of !q->elevator.

> 
> I haven't found any code in blk_mq_hctx_notify_dead() that makes this
> function behave differently based on whether or not an I/O scheduler has
> been associated with the request queue. Did I perhaps overlook something?

blk_mq_hctx_notify_dead() is only for handling request from sw queue
which is used for !q->elevator.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/7] block: Send requeued requests to the I/O scheduler
  2023-05-23 22:30     ` Bart Van Assche
@ 2023-05-24  6:13       ` Christoph Hellwig
  2023-05-24 18:22         ` Bart Van Assche
  0 siblings, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2023-05-24  6:13 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, Jens Axboe, linux-block, Mike Snitzer,
	Damien Le Moal, Jaegeuk Kim, Ming Lei, Jianchao Wang

On Tue, May 23, 2023 at 03:30:16PM -0700, Bart Van Assche wrote:
>  static inline void blk_mq_sched_requeue_request(struct request *rq)
>  {
> -	if (rq->rq_flags & RQF_USE_SCHED) {
> -		struct request_queue *q = rq->q;
> -		struct elevator_queue *e = q->elevator;
> -
> -		if (e->type->ops.requeue_request)
> -			e->type->ops.requeue_request(rq);
> -	}
> +	if (rq->rq_flags & RQF_USE_SCHED)
> +		rq->rq_flags |= RQF_REQUEUED;
>  }

I'd drop this helper function if we go down this way.  But maybe
we might just want to keep the method.  Sorry for the noise.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 4/7] block: Make it easier to debug zoned write reordering
  2023-05-23 19:34     ` Bart Van Assche
@ 2023-05-24  6:13       ` Christoph Hellwig
  2023-05-24 18:25         ` Bart Van Assche
  0 siblings, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2023-05-24  6:13 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, Jens Axboe, linux-block, Mike Snitzer,
	Damien Le Moal, Jaegeuk Kim, Ming Lei

On Tue, May 23, 2023 at 12:34:38PM -0700, Bart Van Assche wrote:
>> What makes sequential writes here special vs other requests that are
>> supposed to be using the scheduler and not a bypass?
>
> Hi Christoph,
>
> If some REQ_OP_WRITE or REQ_OP_WRITE_ZEROES requests are submitted to the 
> I/O scheduler and others bypass the I/O scheduler this may lead to 
> reordering. Hence this patch that triggers a kernel warning if any 
> REQ_OP_WRITE or REQ_OP_WRITE_ZEROES requests bypass the I/O scheduler.

But why would we ever do a direct insert when using a scheduler for
write (an read for that matter) commands when not on a zoned device?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 0/7] Submit zoned writes in order
  2023-05-23 20:04   ` Bart Van Assche
@ 2023-05-24  6:15     ` Christoph Hellwig
  0 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2023-05-24  6:15 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, Jens Axboe, linux-block, Mike Snitzer,
	Damien Le Moal, Jaegeuk Kim

On Tue, May 23, 2023 at 01:04:44PM -0700, Bart Van Assche wrote:
>> Can you explain why?  The resulting code looks rather odd to me as we
>> now reach out to a global list from the per-hctx run_queue helper,
>> which seems a bit awkward.
>
> Hi Christoph,
>
> This change is based on the assumption that requeuing and flushing are 
> relatively rare events.

The former are, the latter not so much.  But more importantly you now
look into a global list in the per-hctx dispatch, adding cache line
sharing.

> Do you perhaps want me to change the approach back 
> to one requeue list and one flush list per hctx?

Unless we have a very good reason to make them global that would
be my preference.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 7/7] block: Inline blk_mq_{,delay_}kick_requeue_list()
  2023-05-22 18:38 ` [PATCH v3 7/7] block: Inline blk_mq_{,delay_}kick_requeue_list() Bart Van Assche
@ 2023-05-24  8:01   ` Vineeth Vijayan
  0 siblings, 0 replies; 38+ messages in thread
From: Vineeth Vijayan @ 2023-05-24  8:01 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Mike Snitzer, Damien Le Moal,
	Jaegeuk Kim, Ming Lei, Juergen Gross, Stefano Stabellini,
	Roger Pau Monné, Alasdair Kergon, dm-devel, Keith Busch,
	Sagi Grimberg, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	James E.J. Bottomley, Martin K. Petersen


On 5/22/23 20:38, Bart Van Assche wrote:
> Patch "block: Preserve the order of requeued requests" changed
> blk_mq_kick_requeue_list() and blk_mq_delay_kick_requeue_list() into
> blk_mq_run_hw_queues() and blk_mq_delay_run_hw_queues() calls
> respectively. Inline blk_mq_{,delay_}kick_requeue_list() because these
> functions are now too short to keep these as separate functions.
>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Damien Le Moal <dlemoal@kernel.org>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Mike Snitzer <snitzer@kernel.org>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>   block/blk-flush.c            |  4 ++--
>   block/blk-mq-debugfs.c       |  2 +-
>   block/blk-mq.c               | 16 +---------------
>   drivers/block/ublk_drv.c     |  6 +++---
>   drivers/block/xen-blkfront.c |  1 -
>   drivers/md/dm-rq.c           |  6 +++---
>   drivers/nvme/host/core.c     |  2 +-
>   drivers/s390/block/scm_blk.c |  2 +-
>   drivers/scsi/scsi_lib.c      |  2 +-
>   include/linux/blk-mq.h       |  2 --
>   10 files changed, 13 insertions(+), 30 deletions(-)
The scm_blk.c changes looks good to me. thanks.

Acked-by: Vineeth Vijayan <vneethv@linux.ibm.com>

...


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/7] block: Send requeued requests to the I/O scheduler
  2023-05-24  0:31       ` Ming Lei
@ 2023-05-24 17:56         ` Bart Van Assche
  2023-05-24 23:06           ` Damien Le Moal
  0 siblings, 1 reply; 38+ messages in thread
From: Bart Van Assche @ 2023-05-24 17:56 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	Damien Le Moal, Jaegeuk Kim, Jianchao Wang

On 5/23/23 17:31, Ming Lei wrote:
> On Tue, May 23, 2023 at 10:19:34AM -0700, Bart Van Assche wrote:
>> The mq-deadline scheduler restricts the queue depth to one per zone for zoned
>> storage so at any time there is at most one write command (REQ_OP_WRITE) in
>> flight per zone.
> 
> But if the write queue depth is 1 per zone, the requeued request won't
> be re-ordered at all given no other write request can be issued from
> scheduler in this zone before this requeued request is completed.
> 
> So why bother to requeue the BLK_STS_RESOURCE request via scheduler?

Hi Ming,

It seems like my previous email was not clear enough. The mq-deadline 
scheduler restricts the queue depth per zone for commands passed to the 
SCSI core. It does not restrict how many requests a filesystem can 
submit per zone to the block layer. Without this patch there is a risk 
of reordering if a request is requeued, e.g. by the SCSI core, and other 
requests are pending for the same zone.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 3/7] block: Requeue requests if a CPU is unplugged
  2023-05-24  0:35       ` Ming Lei
@ 2023-05-24 18:18         ` Bart Van Assche
  0 siblings, 0 replies; 38+ messages in thread
From: Bart Van Assche @ 2023-05-24 18:18 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	Damien Le Moal, Jaegeuk Kim

On 5/23/23 17:35, Ming Lei wrote:
> blk_mq_insert_requests() is only called for inserting passthrough
> request in case of !q->elevator.
Hmm ... blk_mq_requeue_work() may call blk_mq_insert_request() to 
requeue a request that is not a passthrough request and if an I/O 
scheduler is active. I think there are more examples in the block layer 
of blk_mq_insert_request() calls that happen if q->elevator != NULL.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/7] block: Send requeued requests to the I/O scheduler
  2023-05-24  6:13       ` Christoph Hellwig
@ 2023-05-24 18:22         ` Bart Van Assche
  2023-05-25  8:25           ` Christoph Hellwig
  0 siblings, 1 reply; 38+ messages in thread
From: Bart Van Assche @ 2023-05-24 18:22 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Mike Snitzer, Damien Le Moal,
	Jaegeuk Kim, Ming Lei, Jianchao Wang

On 5/23/23 23:13, Christoph Hellwig wrote:
> On Tue, May 23, 2023 at 03:30:16PM -0700, Bart Van Assche wrote:
>>   static inline void blk_mq_sched_requeue_request(struct request *rq)
>>   {
>> -	if (rq->rq_flags & RQF_USE_SCHED) {
>> -		struct request_queue *q = rq->q;
>> -		struct elevator_queue *e = q->elevator;
>> -
>> -		if (e->type->ops.requeue_request)
>> -			e->type->ops.requeue_request(rq);
>> -	}
>> +	if (rq->rq_flags & RQF_USE_SCHED)
>> +		rq->rq_flags |= RQF_REQUEUED;
>>   }
> 
> I'd drop this helper function if we go down this way.  But maybe
> we might just want to keep the method.

My understanding is that every .requeue_request() call is followed by a 
.insert_requests() call and hence that we don't need the 
.requeue_request() method anymore if the RQF_REQUEUED flag would be 
introduced?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 4/7] block: Make it easier to debug zoned write reordering
  2023-05-24  6:13       ` Christoph Hellwig
@ 2023-05-24 18:25         ` Bart Van Assche
  0 siblings, 0 replies; 38+ messages in thread
From: Bart Van Assche @ 2023-05-24 18:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Mike Snitzer, Damien Le Moal,
	Jaegeuk Kim, Ming Lei

On 5/23/23 23:13, Christoph Hellwig wrote:
> On Tue, May 23, 2023 at 12:34:38PM -0700, Bart Van Assche wrote:
>>> What makes sequential writes here special vs other requests that are
>>> supposed to be using the scheduler and not a bypass?
>>
>> Hi Christoph,
>>
>> If some REQ_OP_WRITE or REQ_OP_WRITE_ZEROES requests are submitted to the
>> I/O scheduler and others bypass the I/O scheduler this may lead to
>> reordering. Hence this patch that triggers a kernel warning if any
>> REQ_OP_WRITE or REQ_OP_WRITE_ZEROES requests bypass the I/O scheduler.
> 
> But why would we ever do a direct insert when using a scheduler for
> write (an read for that matter) commands when not on a zoned device?

Patch 2/7 of this series removes a blk_mq_request_bypass_insert() call 
from blk_mq_requeue_work(). I think this shows that without this patch 
series blk_mq_request_bypass_insert() could get called for REQ_OP_WRITE 
/ REQ_OP_WRITE_ZEROES requests.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/7] block: Send requeued requests to the I/O scheduler
  2023-05-24 17:56         ` Bart Van Assche
@ 2023-05-24 23:06           ` Damien Le Moal
  2023-05-25  0:53             ` Ming Lei
  2023-06-21  0:34             ` Bart Van Assche
  0 siblings, 2 replies; 38+ messages in thread
From: Damien Le Moal @ 2023-05-24 23:06 UTC (permalink / raw)
  To: Bart Van Assche, Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	Jaegeuk Kim, Jianchao Wang

On 5/25/23 02:56, Bart Van Assche wrote:
> On 5/23/23 17:31, Ming Lei wrote:
>> On Tue, May 23, 2023 at 10:19:34AM -0700, Bart Van Assche wrote:
>>> The mq-deadline scheduler restricts the queue depth to one per zone for zoned
>>> storage so at any time there is at most one write command (REQ_OP_WRITE) in
>>> flight per zone.
>>
>> But if the write queue depth is 1 per zone, the requeued request won't
>> be re-ordered at all given no other write request can be issued from
>> scheduler in this zone before this requeued request is completed.
>>
>> So why bother to requeue the BLK_STS_RESOURCE request via scheduler?
> 
> Hi Ming,
> 
> It seems like my previous email was not clear enough. The mq-deadline 
> scheduler restricts the queue depth per zone for commands passed to the 
> SCSI core. It does not restrict how many requests a filesystem can 
> submit per zone to the block layer. Without this patch there is a risk 
> of reordering if a request is requeued, e.g. by the SCSI core, and other 
> requests are pending for the same zone.

Yes there is, but the contract we established for zoned devices in the block
layer, from the start of the support, is that users *must* write sequentially.
The block layer does not attempt, generally speaking, to reorder requests.

When mq-deadline is used, the scheduler lba reordering *may* reorder writes,
thus hiding potential bugs in the user write sequence for a zone. That is fine.
However, once a write request is dispatched, we should keep the assumption that
it is a well formed one, namely directed at the zone write pointer. So any
consideration of requeue solving write ordering issues is moot to me.

Furthermore, when the requeue happens, the target zone is still locked and the
only write request that can be in flight for that target zones is that one being
requeued. Add to that the above assumption that the request is the one we must
dispatch first, there are absolutely zero chances of seeing a reordering happen
for writes to a particular zone. I really do not see the point of requeuing that
request through the IO scheduler at all.

In general, even for reads, requeuing through the scheduler is I think a really
bad idea as that can potentially significantly increase the request latency
(time to completion), with the user seeing long tail latencies. E.g. if the
request has high priority or a short CDL time limit, requeuing through the
scheduler will go against the user indicated urgency for that request and
degrade the effectivness of latency control easures such as IO priority and CDL.

Requeues should be at the head of the dispatch queue, not through the scheduler.

As long as we keep zone write locking for zoned devices, requeue to the head of
the dispatch queue is fine. But maybe this work is preparatory to removing zone
write locking ? If that is the case, I would like to see that as well to get the
big picture. Otherwise, the latency concerns I raised above are in my opinion, a
blocker for this change.

> 
> Thanks,
> 
> Bart.
> 

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/7] block: Send requeued requests to the I/O scheduler
  2023-05-24 23:06           ` Damien Le Moal
@ 2023-05-25  0:53             ` Ming Lei
  2023-06-21  0:34             ` Bart Van Assche
  1 sibling, 0 replies; 38+ messages in thread
From: Ming Lei @ 2023-05-25  0:53 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Bart Van Assche, Jens Axboe, linux-block, Christoph Hellwig,
	Mike Snitzer, Jaegeuk Kim, Jianchao Wang, ming.lei

On Thu, May 25, 2023 at 08:06:31AM +0900, Damien Le Moal wrote:
> On 5/25/23 02:56, Bart Van Assche wrote:
> > On 5/23/23 17:31, Ming Lei wrote:
> >> On Tue, May 23, 2023 at 10:19:34AM -0700, Bart Van Assche wrote:
> >>> The mq-deadline scheduler restricts the queue depth to one per zone for zoned
> >>> storage so at any time there is at most one write command (REQ_OP_WRITE) in
> >>> flight per zone.
> >>
> >> But if the write queue depth is 1 per zone, the requeued request won't
> >> be re-ordered at all given no other write request can be issued from
> >> scheduler in this zone before this requeued request is completed.
> >>
> >> So why bother to requeue the BLK_STS_RESOURCE request via scheduler?
> > 
> > Hi Ming,
> > 
> > It seems like my previous email was not clear enough. The mq-deadline 
> > scheduler restricts the queue depth per zone for commands passed to the 
> > SCSI core. It does not restrict how many requests a filesystem can 
> > submit per zone to the block layer. Without this patch there is a risk 
> > of reordering if a request is requeued, e.g. by the SCSI core, and other 
> > requests are pending for the same zone.
> 
> Yes there is, but the contract we established for zoned devices in the block
> layer, from the start of the support, is that users *must* write sequentially.
> The block layer does not attempt, generally speaking, to reorder requests.
> 
> When mq-deadline is used, the scheduler lba reordering *may* reorder writes,
> thus hiding potential bugs in the user write sequence for a zone. That is fine.
> However, once a write request is dispatched, we should keep the assumption that
> it is a well formed one, namely directed at the zone write pointer. So any
> consideration of requeue solving write ordering issues is moot to me.
> 
> Furthermore, when the requeue happens, the target zone is still locked and the
> only write request that can be in flight for that target zones is that one being
> requeued. Add to that the above assumption that the request is the one we must
> dispatch first, there are absolutely zero chances of seeing a reordering happen
> for writes to a particular zone. I really do not see the point of requeuing that
> request through the IO scheduler at all.

Totally agree, that is exactly what I meant.

> 
> In general, even for reads, requeuing through the scheduler is I think a really
> bad idea as that can potentially significantly increase the request latency
> (time to completion), with the user seeing long tail latencies. E.g. if the
> request has high priority or a short CDL time limit, requeuing through the
> scheduler will go against the user indicated urgency for that request and
> degrade the effectivness of latency control easures such as IO priority and CDL.
> 
> Requeues should be at the head of the dispatch queue, not through the scheduler.

It is pretty easy to run into STS_RESOURCE for some scsi devices,
requeuing via schedule does add more overhead for these devices.

> 
> As long as we keep zone write locking for zoned devices, requeue to the head of
> the dispatch queue is fine. But maybe this work is preparatory to removing zone
> write locking ? If that is the case, I would like to see that as well to get the
> big picture. Otherwise, the latency concerns I raised above are in my opinion, a
> blocker for this change.

Yeah, looks Bart needs to add more words about requirement & motivation
of this patchset.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/7] block: Send requeued requests to the I/O scheduler
  2023-05-24 18:22         ` Bart Van Assche
@ 2023-05-25  8:25           ` Christoph Hellwig
  0 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2023-05-25  8:25 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, Jens Axboe, linux-block, Mike Snitzer,
	Damien Le Moal, Jaegeuk Kim, Ming Lei, Jianchao Wang

On Wed, May 24, 2023 at 11:22:00AM -0700, Bart Van Assche wrote:
>>>   static inline void blk_mq_sched_requeue_request(struct request *rq)
>>>   {
>>> -	if (rq->rq_flags & RQF_USE_SCHED) {
>>> -		struct request_queue *q = rq->q;
>>> -		struct elevator_queue *e = q->elevator;
>>> -
>>> -		if (e->type->ops.requeue_request)
>>> -			e->type->ops.requeue_request(rq);
>>> -	}
>>> +	if (rq->rq_flags & RQF_USE_SCHED)
>>> +		rq->rq_flags |= RQF_REQUEUED;
>>>   }
>>
>> I'd drop this helper function if we go down this way.  But maybe
>> we might just want to keep the method.
>
> My understanding is that every .requeue_request() call is followed by a 
> .insert_requests() call and hence that we don't need the .requeue_request() 
> method anymore if the RQF_REQUEUED flag would be introduced?

Yes, but at the same time RQF_REQUEUED no creates global state instead
of just in a callchain.  That's why originally suggest a flag to
->insert_requests instead of leaving state on every request.

>
> Thanks,
>
> Bart.
---end quoted text---

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/7] block: Send requeued requests to the I/O scheduler
  2023-05-24 23:06           ` Damien Le Moal
  2023-05-25  0:53             ` Ming Lei
@ 2023-06-21  0:34             ` Bart Van Assche
  2023-06-22 23:45               ` Damien Le Moal
  1 sibling, 1 reply; 38+ messages in thread
From: Bart Van Assche @ 2023-06-21  0:34 UTC (permalink / raw)
  To: Damien Le Moal, Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	Jaegeuk Kim, Jianchao Wang

On 5/24/23 16:06, Damien Le Moal wrote:
> When mq-deadline is used, the scheduler lba reordering *may* reorder writes,
> thus hiding potential bugs in the user write sequence for a zone. That is fine.
> However, once a write request is dispatched, we should keep the assumption that
> it is a well formed one, namely directed at the zone write pointer. So any
> consideration of requeue solving write ordering issues is moot to me.
> 
> Furthermore, when the requeue happens, the target zone is still locked and the
> only write request that can be in flight for that target zones is that one being
> requeued. Add to that the above assumption that the request is the one we must
> dispatch first, there are absolutely zero chances of seeing a reordering happen
> for writes to a particular zone. I really do not see the point of requeuing that
> request through the IO scheduler at all.

I will drop the changes from this patch that send requeued dispatched 
requests to the I/O scheduler instead of back to the dispatch list. It 
took me considerable effort to find all the potential causes of 
reordering. As you may have noticed I also came up with some changes 
that are not essential. I should have realized myself that this change 
is not essential.

> As long as we keep zone write locking for zoned devices, requeue to the head of
> the dispatch queue is fine. But maybe this work is preparatory to removing zone
> write locking ? If that is the case, I would like to see that as well to get the
> big picture. Otherwise, the latency concerns I raised above are in my opinion, a
> blocker for this change.

Regarding removing zone write locking, would it be acceptable to 
implement a solution for SCSI devices before it is clear how to 
implement a solution for NVMe devices? I think a potential solution for 
SCSI devices is to send requests that should be requeued to the SCSI 
error handler instead of to the block layer requeue list. The SCSI error 
handler waits until all pending requests have timed out or have been 
sent to the error handler. The SCSI error handler can be modified such 
that requests are sorted in LBA order before being resubmitted. This 
would solve the nasty issues that would otherwise arise when requeuing 
requests if multiple write requests for the same zone are pending.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/7] block: Send requeued requests to the I/O scheduler
  2023-06-21  0:34             ` Bart Van Assche
@ 2023-06-22 23:45               ` Damien Le Moal
  2023-06-23 20:31                 ` Bart Van Assche
  0 siblings, 1 reply; 38+ messages in thread
From: Damien Le Moal @ 2023-06-22 23:45 UTC (permalink / raw)
  To: Bart Van Assche, Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	Jaegeuk Kim, Jianchao Wang

On 6/21/23 09:34, Bart Van Assche wrote:
> On 5/24/23 16:06, Damien Le Moal wrote:
>> When mq-deadline is used, the scheduler lba reordering *may* reorder writes,
>> thus hiding potential bugs in the user write sequence for a zone. That is fine.
>> However, once a write request is dispatched, we should keep the assumption that
>> it is a well formed one, namely directed at the zone write pointer. So any
>> consideration of requeue solving write ordering issues is moot to me.
>>
>> Furthermore, when the requeue happens, the target zone is still locked and the
>> only write request that can be in flight for that target zones is that one being
>> requeued. Add to that the above assumption that the request is the one we must
>> dispatch first, there are absolutely zero chances of seeing a reordering happen
>> for writes to a particular zone. I really do not see the point of requeuing that
>> request through the IO scheduler at all.
> 
> I will drop the changes from this patch that send requeued dispatched 
> requests to the I/O scheduler instead of back to the dispatch list. It 
> took me considerable effort to find all the potential causes of 
> reordering. As you may have noticed I also came up with some changes 
> that are not essential. I should have realized myself that this change 
> is not essential.
> 
>> As long as we keep zone write locking for zoned devices, requeue to the head of
>> the dispatch queue is fine. But maybe this work is preparatory to removing zone
>> write locking ? If that is the case, I would like to see that as well to get the
>> big picture. Otherwise, the latency concerns I raised above are in my opinion, a
>> blocker for this change.
> 
> Regarding removing zone write locking, would it be acceptable to 
> implement a solution for SCSI devices before it is clear how to 
> implement a solution for NVMe devices? I think a potential solution for 
> SCSI devices is to send requests that should be requeued to the SCSI 
> error handler instead of to the block layer requeue list. The SCSI error 
> handler waits until all pending requests have timed out or have been 
> sent to the error handler. The SCSI error handler can be modified such 
> that requests are sorted in LBA order before being resubmitted. This 
> would solve the nasty issues that would otherwise arise when requeuing 
> requests if multiple write requests for the same zone are pending.

I am still thinking that a dedicated hctx for writes to sequential zones may be
the simplest solution for all device types:
1) For scsi HBAs, we can likely gain high qd zone writes, but that needs to be
checked. For AHCI though, we need to keep the max write qd=1 per zone because of
the chipsets reordering command submissions. So we'll need a queue flag saying
"need zone write locking" indicated by the adapter when creating the queue.
2) For NVMe, this would allow high QD writes, with only the penalty of heavier
locking overhead when writes are issued from multiple CPUs.

But I have not started looking at all the details. Need to start prototyping
something. We can try working on this together if you want.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 2/7] block: Send requeued requests to the I/O scheduler
  2023-06-22 23:45               ` Damien Le Moal
@ 2023-06-23 20:31                 ` Bart Van Assche
  0 siblings, 0 replies; 38+ messages in thread
From: Bart Van Assche @ 2023-06-23 20:31 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer,
	Ming Lei, Jaegeuk Kim, Jianchao Wang

On 6/22/23 16:45, Damien Le Moal wrote:
> On 6/21/23 09:34, Bart Van Assche wrote:
>> Regarding removing zone write locking, would it be acceptable to
>> implement a solution for SCSI devices before it is clear how to
>> implement a solution for NVMe devices? I think a potential solution for
>> SCSI devices is to send requests that should be requeued to the SCSI
>> error handler instead of to the block layer requeue list. The SCSI error
>> handler waits until all pending requests have timed out or have been
>> sent to the error handler. The SCSI error handler can be modified such
>> that requests are sorted in LBA order before being resubmitted. This
>> would solve the nasty issues that would otherwise arise when requeuing
>> requests if multiple write requests for the same zone are pending.
> 
> I am still thinking that a dedicated hctx for writes to sequential zones may be
> the simplest solution for all device types:
> 1) For scsi HBAs, we can likely gain high qd zone writes, but that needs to be
> checked. For AHCI though, we need to keep the max write qd=1 per zone because of
> the chipsets reordering command submissions. So we'll need a queue flag saying
> "need zone write locking" indicated by the adapter when creating the queue.
> 2) For NVMe, this would allow high QD writes, with only the penalty of heavier
> locking overhead when writes are issued from multiple CPUs.
> 
> But I have not started looking at all the details. Need to start prototyping
> something. We can try working on this together if you want.

Hi Damien,

I'm interested in collaborating on this. But I'm not sure whether a 
dedicated hardware queue for sequential writes is a full solution. 
Applications must submit zoned writes (other than write appends) in 
order. These zoned writes may end up in a software queue. It is possible 
that the software queues are flushed in such a way that the zoned writes 
are reordered. Or do you perhaps want to send all zoned writes directly 
to a hardware queue? If so, is this really a better solution than a 
single-queue I/O scheduler? Is the difference perhaps that higher read 
IOPS can be achieved because multiple hardware queues are used for reads?

Even if all sequential writes would be sent to a single hardware queue, 
to support queue depths > 1, we still need a mechanism for resubmitting 
requests in order after a request has been requeued. If e.g. three zoned 
writes are in flight and a unit attention is reported for the second 
write then resubmitting the two writes that have to be resubmitted must 
only happen after both writes have completed.

Another possibility is to introduce a new request queue flag that 
specifies that only writes should be sent to the I/O scheduler. I'm 
interested in this because of the following observation for zoned UFS 
devices for a block size of 4 KiB and a random read workload:
* mq-deadline scheduler:  59 K IOPS.
* no I/O scheduler:      100 K IOPS.
In other words, 70% more IOPS with no I/O scheduler compared to 
mq-deadline. I don't think that this indicates a performance bug in the 
mq-deadline scheduler. From a quick measurement with the null_blk driver 
it seems to me that all I/O schedulers saturate around 150 K - 170 K IOPS.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2023-06-23 20:32 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-05-22 18:38 [PATCH v3 0/7] Submit zoned writes in order Bart Van Assche
2023-05-22 18:38 ` [PATCH v3 1/7] block: Rename a local variable in blk_mq_requeue_work() Bart Van Assche
2023-05-23  7:12   ` Christoph Hellwig
2023-05-23  9:48   ` Johannes Thumshirn
2023-05-22 18:38 ` [PATCH v3 2/7] block: Send requeued requests to the I/O scheduler Bart Van Assche
2023-05-23  7:18   ` Christoph Hellwig
2023-05-23 22:30     ` Bart Van Assche
2023-05-24  6:13       ` Christoph Hellwig
2023-05-24 18:22         ` Bart Van Assche
2023-05-25  8:25           ` Christoph Hellwig
2023-05-23  9:03   ` Ming Lei
2023-05-23 17:19     ` Bart Van Assche
2023-05-24  0:31       ` Ming Lei
2023-05-24 17:56         ` Bart Van Assche
2023-05-24 23:06           ` Damien Le Moal
2023-05-25  0:53             ` Ming Lei
2023-06-21  0:34             ` Bart Van Assche
2023-06-22 23:45               ` Damien Le Moal
2023-06-23 20:31                 ` Bart Van Assche
2023-05-22 18:38 ` [PATCH v3 3/7] block: Requeue requests if a CPU is unplugged Bart Van Assche
2023-05-23  7:19   ` Christoph Hellwig
2023-05-23  8:17   ` Ming Lei
2023-05-23 20:15     ` Bart Van Assche
2023-05-24  0:35       ` Ming Lei
2023-05-24 18:18         ` Bart Van Assche
2023-05-22 18:38 ` [PATCH v3 4/7] block: Make it easier to debug zoned write reordering Bart Van Assche
2023-05-23  7:19   ` Christoph Hellwig
2023-05-23 19:34     ` Bart Van Assche
2023-05-24  6:13       ` Christoph Hellwig
2023-05-24 18:25         ` Bart Van Assche
2023-05-22 18:38 ` [PATCH v3 5/7] block: Preserve the order of requeued requests Bart Van Assche
2023-05-22 18:38 ` [PATCH v3 6/7] dm: Inline __dm_mq_kick_requeue_list() Bart Van Assche
2023-05-23  7:22   ` Christoph Hellwig
2023-05-22 18:38 ` [PATCH v3 7/7] block: Inline blk_mq_{,delay_}kick_requeue_list() Bart Van Assche
2023-05-24  8:01   ` Vineeth Vijayan
2023-05-23  7:22 ` [PATCH v3 0/7] Submit zoned writes in order Christoph Hellwig
2023-05-23 20:04   ` Bart Van Assche
2023-05-24  6:15     ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).