[PATCH v18 00/12] Improve write performance for zoned UFS devices

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v18 00/12] Improve write performance for zoned UFS devices
@ 2025-06-16 22:33 Bart Van Assche
  2025-06-16 22:33 ` [PATCH v18 01/12] block: Support block drivers that preserve the order of write requests Bart Van Assche
                   ` (11 more replies)
  0 siblings, 12 replies; 24+ messages in thread
From: Bart Van Assche @ 2025-06-16 22:33 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Bart Van Assche

Hi Jens,

This patch series improves small write IOPS by a factor of two for zoned UFS
devices on my test setup. The changes included in this patch series are as
follows:
 - A new request queue limits flag is introduced that allows block drivers to
   declare whether or not they preserve the request order per hardware queue.
 - The order of zoned writes is preserved in the block layer by submitting all
   zoned writes from the same CPU core as long as any zoned writes are pending.
   A new member 'from_cpu' is introduced in the per-zone data structure
   'blk_zone_wplug' to track from which CPU to submit zoned writes. This data
   member is reset to -1 after all pending zoned writes for a zone have
   completed.
 - The retry count for zoned writes is increased in the SCSI core to deal with
   reordering caused by unit attention conditions or the SCSI error handler.
 - New functionality is added in the null_blk and scsi_debug drivers to
   make it easier to test the changes introduced by this patch series.

Please consider this patch series for the next merge window.

Thanks,

Bart.

Changes compared to v17:
 - Rebased the patch series on top of kernel v6.16-rc1.
 - Dropped support for UFSHCI 3.0 controllers because the UFSHCI 3.0 auto-
   hibernation mechanism causes request reordering. UFSHCI 4.0 controllers
   remain supported.
 - Removed the error handling and write pointer tracking mechanisms again
   from block/blk-zoned.c.
 - Dropped the dm-linear patch from this patch series since I'm not aware of
   any use cases for write pipelining and dm-linear.

Changes compared to v16:
 - Rebased the entire patch series on top of Jens' for-next branch. Compared
   to when v16 of this series was posted, the BLK_ZONE_WPLUG_NEED_WP_UPDATE
   flag has been introduced and support for REQ_NOWAIT has been fixed.
 - The behavior for SMR disks is preserved: if .driver_preserves_write_order
   has not been set, BLK_ZONE_WPLUG_NEED_WP_UPDATE is still set if a write
   error has been encountered. If .driver_preserves_write_order has not been
   set, the write pointer is restored and the failed zoned writes are retried.
 - The superfluous "disk->zone_wplugs_hash_bits != 0" tests have been removed.

Changes compared to v15:
 - Reworked this patch series on top of the zone write plugging approach.
 - Moved support for requeuing requests from the SCSI core into the block
   layer core.
 - In the UFS driver, instead of disabling write pipelining if
   auto-hibernation is enabled, rely on the requeuing mechanism to handle
   reordering caused by resuming from auto-hibernation.

Changes compared to v14:
 - Removed the drivers/scsi/Kconfig.kunit and drivers/scsi/Makefile.kunit
   files. Instead, modified drivers/scsi/Kconfig and added #include "*_test.c"
   directives in the appropriate .c files. Removed the EXPORT_SYMBOL()
   directives that were added to make the unit tests link.
 - Fixed a double free in a unit test.

Changes compared to v13:
 - Reworked patch "block: Preserve the order of requeued zoned writes".
 - Addressed a performance concern by removing the eh_needs_prepare_resubmit
   SCSI driver callback and by introducing the SCSI host template flag
   .needs_prepare_resubmit instead.
 - Added a patch that adds a 'host' argument to scsi_eh_flush_done_q().
 - Made the code in unit tests less repetitive.

Changes compared to v12:
 - Added two new patches: "block: Preserve the order of requeued zoned writes"
   and "scsi: sd: Add a unit test for sd_cmp_sector()"
 - Restricted the number of zoned write retries. To my surprise I had to add
   "&& scmd->retries <= scmd->allowed" in the SCSI error handler to limit the
   number of retries.
 - In patch "scsi: ufs: Inform the block layer about write ordering", only set
   ELEVATOR_F_ZBD_SEQ_WRITE for zoned block devices.

Changes compared to v11:
 - Fixed a NULL pointer dereference that happened when booting from an ATA
   device by adding an scmd->device != NULL check in scsi_needs_preparation().
 - Updated Reviewed-by tags.

Changes compared to v10:
 - Dropped the UFS MediaTek and HiSilicon patches because these are not correct
   and because it is safe to drop these patches.
 - Updated Acked-by / Reviewed-by tags.

Changes compared to v9:
 - Introduced an additional scsi_driver callback: .eh_needs_prepare_resubmit().
 - Renamed the scsi_debug kernel module parameter 'no_zone_write_lock' into
   'preserves_write_order'.
 - Fixed an out-of-bounds access in the unit scsi_call_prepare_resubmit() unit
   test.
 - Wrapped ufshcd_auto_hibern8_update() calls in UFS host drivers with
   WARN_ON_ONCE() such that a kernel stack appears in case an error code is
   returned.
 - Elaborated a comment in the UFSHCI driver.

Changes compared to v8:
 - Fixed handling of 'driver_preserves_write_order' and 'use_zone_write_lock'
   in blk_stack_limits().
 - Added a comment in disk_set_zoned().
 - Modified blk_req_needs_zone_write_lock() such that it returns false if
   q->limits.use_zone_write_lock is false.
 - Modified disk_clear_zone_settings() such that it clears
   q->limits.use_zone_write_lock.
 - Left out one change from the mq-deadline patch that became superfluous due to
   the blk_req_needs_zone_write_lock() change.
 - Modified scsi_call_prepare_resubmit() such that it only calls list_sort() if
   zoned writes have to be resubmitted for which zone write locking is disabled.
 - Added an additional unit test for scsi_call_prepare_resubmit().
 - Modified the sorting code in the sd driver such that only those SCSI commands
   are sorted for which write locking is disabled.
 - Modified sd_zbc.c such that ELEVATOR_F_ZBD_SEQ_WRITE is only set if the
   write order is not preserved.
 - Included three patches for UFS host drivers that rework code that wrote
   directly to the auto-hibernation controller register.
 - Modified the UFS driver such that enabling auto-hibernation is not allowed
   if a zoned logical unit is present and if the controller operates in legacy
   mode.
 - Also in the UFS driver, simplified ufshcd_auto_hibern8_update().

Changes compared to v7:
 - Split the queue_limits member variable `use_zone_write_lock' into two member
   variables: `use_zone_write_lock' (set by disk_set_zoned()) and
   `driver_preserves_write_order' (set by the block driver or SCSI LLD). This
   should clear up the confusion about the purpose of this variable.
 - Moved the code for sorting SCSI commands by LBA from the SCSI error handler
   into the SCSI disk (sd) driver as requested by Christoph.
   
Changes compared to v6:
 - Removed QUEUE_FLAG_NO_ZONE_WRITE_LOCK and instead introduced a flag in
   the request queue limits data structure.

Changes compared to v5:
 - Renamed scsi_cmp_lba() into scsi_cmp_sector().
 - Improved several source code comments.

Changes compared to v4:
 - Dropped the patch that introduces the REQ_NO_ZONE_WRITE_LOCK flag.
 - Dropped the null_blk patch and added two scsi_debug patches instead.
 - Dropped the f2fs patch.
 - Split the patch for the UFS driver into two patches.
 - Modified several patch descriptions and source code comments.
 - Renamed dd_use_write_locking() into dd_use_zone_write_locking().
 - Moved the list_sort() call from scsi_unjam_host() into scsi_eh_flush_done_q()
   such that sorting happens just before reinserting.
 - Removed the scsi_cmd_retry_allowed() call from scsi_check_sense() to make
   sure that the retry counter is adjusted once per retry instead of twice.

Changes compared to v3:
 - Restored the patch that introduces QUEUE_FLAG_NO_ZONE_WRITE_LOCK. That patch
   had accidentally been left out from v2.
 - In patch "block: Introduce the flag REQ_NO_ZONE_WRITE_LOCK", improved the
   patch description and added the function blk_no_zone_write_lock().
 - In patch "block/mq-deadline: Only use zone locking if necessary", moved the
   blk_queue_is_zoned() call into dd_use_write_locking().
 - In patch "fs/f2fs: Disable zone write locking", set REQ_NO_ZONE_WRITE_LOCK
   from inside __bio_alloc() instead of in f2fs_submit_write_bio().

Changes compared to v2:
 - Renamed the request queue flag for disabling zone write locking.
 - Introduced a new request flag for disabling zone write locking.
 - Modified the mq-deadline scheduler such that zone write locking is only
   disabled if both flags are set.
 - Added an F2FS patch that sets the request flag for disabling zone write
   locking.
 - Only disable zone write locking in the UFS driver if auto-hibernation is
   disabled.

Changes compared to v1:
 - Left out the patches that are already upstream.
 - Switched the approach in patch "scsi: Retry unaligned zoned writes" from
   retrying immediately to sending unaligned write commands to the SCSI error
   handler.

Bart Van Assche (12):
  block: Support block drivers that preserve the order of write requests
  block: Rework request allocation in blk_mq_submit_bio()
  block: Support allocating from a specific software queue
  blk-mq: Restore the zoned write order when requeuing
  blk-zoned: Add an argument to blk_zone_plug_bio()
  blk-zoned: Support pipelining of zoned writes
  null_blk: Add the preserves_write_order attribute
  scsi: core: Retry unaligned zoned writes
  scsi: sd: Increase retry count for zoned writes
  scsi: scsi_debug: Add the preserves_write_order module parameter
  scsi: scsi_debug: Support injecting unaligned write errors
  scsi: ufs: Inform the block layer about write ordering

 block/bfq-iosched.c               |  2 +
 block/blk-mq.c                    | 73 +++++++++++++++++++------------
 block/blk-mq.h                    |  3 ++
 block/blk-settings.c              |  2 +
 block/blk-zoned.c                 | 34 ++++++++++----
 block/kyber-iosched.c             |  2 +
 block/mq-deadline.c               |  7 ++-
 drivers/block/null_blk/main.c     |  3 ++
 drivers/block/null_blk/null_blk.h |  1 +
 drivers/md/dm.c                   |  2 +-
 drivers/scsi/scsi_debug.c         | 21 ++++++++-
 drivers/scsi/scsi_error.c         | 16 +++++++
 drivers/scsi/sd.c                 |  7 +++
 drivers/ufs/core/ufshcd.c         |  6 +++
 include/linux/blk-mq.h            | 13 +++++-
 include/linux/blkdev.h            |  7 ++-
 16 files changed, 158 insertions(+), 41 deletions(-)


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v18 01/12] block: Support block drivers that preserve the order of write requests
  2025-06-16 22:33 [PATCH v18 00/12] Improve write performance for zoned UFS devices Bart Van Assche
@ 2025-06-16 22:33 ` Bart Van Assche
  2025-06-16 22:33 ` [PATCH v18 02/12] block: Rework request allocation in blk_mq_submit_bio() Bart Van Assche
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 24+ messages in thread
From: Bart Van Assche @ 2025-06-16 22:33 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Bart Van Assche, Hannes Reinecke, Nitesh Shetty, Ming Lei

Some storage controllers preserve the request order per hardware queue.
Introduce the request queue limit member variable
'driver_preserves_write_order' to allow block drivers to indicate that
the order of write commands is preserved per hardware queue and hence
that serialization of writes per zone is not required if all pending
writes are submitted to the same hardware queue.

Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Nitesh Shetty <nj.shetty@samsung.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-settings.c   | 2 ++
 include/linux/blkdev.h | 5 +++++
 2 files changed, 7 insertions(+)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index a000daafbfb4..bceb9a9cb5ba 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -814,6 +814,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 	}
 	t->max_secure_erase_sectors = min_not_zero(t->max_secure_erase_sectors,
 						   b->max_secure_erase_sectors);
+	t->driver_preserves_write_order = t->driver_preserves_write_order &&
+		b->driver_preserves_write_order;
 	t->zone_write_granularity = max(t->zone_write_granularity,
 					b->zone_write_granularity);
 	if (!(t->features & BLK_FEAT_ZONED)) {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index a59880c809c7..6d01269f9fcf 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -408,6 +408,11 @@ struct queue_limits {
 
 	unsigned int		max_open_zones;
 	unsigned int		max_active_zones;
+	/*
+	 * Whether or not the block driver preserves the order of write
+	 * requests per hardware queue. Set by the block driver.
+	 */
+	bool			driver_preserves_write_order;
 
 	/*
 	 * Drivers that set dma_alignment to less than 511 must be prepared to

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v18 02/12] block: Rework request allocation in blk_mq_submit_bio()
  2025-06-16 22:33 [PATCH v18 00/12] Improve write performance for zoned UFS devices Bart Van Assche
  2025-06-16 22:33 ` [PATCH v18 01/12] block: Support block drivers that preserve the order of write requests Bart Van Assche
@ 2025-06-16 22:33 ` Bart Van Assche
  2025-06-26  0:00   ` Damien Le Moal
  2025-06-16 22:33 ` [PATCH v18 03/12] block: Support allocating from a specific software queue Bart Van Assche
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 24+ messages in thread
From: Bart Van Assche @ 2025-06-16 22:33 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Bart Van Assche

Prepare for allocating a request from a specific hctx by making
blk_mq_submit_bio() allocate a request later.

The performance impact of this patch on the hot path is small: if a
request is cached, one percpu_ref_get(&q->q_usage_counter) call and one
percpu_ref_put(&q->q_usage_counter) call are added to the hot path.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-mq.c | 31 ++++++++++---------------------
 1 file changed, 10 insertions(+), 21 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4806b867e37d..b22b56042e91 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3116,11 +3116,6 @@ void blk_mq_submit_bio(struct bio *bio)
 	struct request *rq;
 	blk_status_t ret;
 
-	/*
-	 * If the plug has a cached request for this queue, try to use it.
-	 */
-	rq = blk_mq_peek_cached_request(plug, q, bio->bi_opf);
-
 	/*
 	 * A BIO that was released from a zone write plug has already been
 	 * through the preparation in this function, already holds a reference
@@ -3129,19 +3124,11 @@ void blk_mq_submit_bio(struct bio *bio)
 	 */
 	if (bio_zone_write_plugging(bio)) {
 		nr_segs = bio->__bi_nr_segments;
-		if (rq)
-			blk_queue_exit(q);
 		goto new_request;
 	}
 
-	/*
-	 * The cached request already holds a q_usage_counter reference and we
-	 * don't have to acquire a new one if we use it.
-	 */
-	if (!rq) {
-		if (unlikely(bio_queue_enter(bio)))
-			return;
-	}
+	if (unlikely(bio_queue_enter(bio)))
+		return;
 
 	/*
 	 * Device reconfiguration may change logical block size or reduce the
@@ -3173,8 +3160,15 @@ void blk_mq_submit_bio(struct bio *bio)
 		goto queue_exit;
 
 new_request:
+	rq = blk_mq_peek_cached_request(plug, q, bio->bi_opf);
 	if (rq) {
 		blk_mq_use_cached_rq(rq, plug, bio);
+		/*
+		 * Here we hold two references: one because of the
+		 * bio_queue_enter() call and a second one as the result of
+		 * request allocation. Drop one.
+		 */
+		blk_queue_exit(q);
 	} else {
 		rq = blk_mq_get_new_requests(q, plug, bio);
 		if (unlikely(!rq)) {
@@ -3220,12 +3214,7 @@ void blk_mq_submit_bio(struct bio *bio)
 	return;
 
 queue_exit:
-	/*
-	 * Don't drop the queue reference if we were trying to use a cached
-	 * request and thus didn't acquire one.
-	 */
-	if (!rq)
-		blk_queue_exit(q);
+	blk_queue_exit(q);
 }
 
 #ifdef CONFIG_BLK_MQ_STACKING

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v18 03/12] block: Support allocating from a specific software queue
  2025-06-16 22:33 [PATCH v18 00/12] Improve write performance for zoned UFS devices Bart Van Assche
  2025-06-16 22:33 ` [PATCH v18 01/12] block: Support block drivers that preserve the order of write requests Bart Van Assche
  2025-06-16 22:33 ` [PATCH v18 02/12] block: Rework request allocation in blk_mq_submit_bio() Bart Van Assche
@ 2025-06-16 22:33 ` Bart Van Assche
  2025-06-26  0:04   ` Damien Le Moal
  2025-06-16 22:33 ` [PATCH v18 04/12] blk-mq: Restore the zoned write order when requeuing Bart Van Assche
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 24+ messages in thread
From: Bart Van Assche @ 2025-06-16 22:33 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Bart Van Assche

A later patch will preserve the order of pipelined zoned writes by
submitting all zoned writes per zone to the same software queue as
previously submitted zoned writes. Hence support allocating a request
from a specific software queue.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-mq.c | 18 ++++++++++++++----
 block/blk-mq.h |  1 +
 2 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index b22b56042e91..9dab1caf750a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -485,6 +485,7 @@ __blk_mq_alloc_requests_batch(struct blk_mq_alloc_data *data)
 static struct request *__blk_mq_alloc_requests(struct blk_mq_alloc_data *data)
 {
 	struct request_queue *q = data->q;
+	int from_cpu = data->from_cpu;
 	u64 alloc_time_ns = 0;
 	struct request *rq;
 	unsigned int tag;
@@ -497,7 +498,8 @@ static struct request *__blk_mq_alloc_requests(struct blk_mq_alloc_data *data)
 		data->flags |= BLK_MQ_REQ_NOWAIT;
 
 retry:
-	data->ctx = blk_mq_get_ctx(q);
+	data->from_cpu = from_cpu >= 0 ? from_cpu : raw_smp_processor_id();
+	data->ctx = __blk_mq_get_ctx(q, data->from_cpu);
 	data->hctx = blk_mq_map_queue(data->cmd_flags, data->ctx);
 
 	if (q->elevator) {
@@ -579,6 +581,7 @@ static struct request *blk_mq_rq_cache_fill(struct request_queue *q,
 		.rq_flags	= 0,
 		.nr_tags	= plug->nr_ios,
 		.cached_rqs	= &plug->cached_rqs,
+		.from_cpu	= -1,
 		.ctx		= NULL,
 		.hctx		= NULL
 	};
@@ -644,6 +647,7 @@ struct request *blk_mq_alloc_request(struct request_queue *q, blk_opf_t opf,
 			.cmd_flags	= opf,
 			.rq_flags	= 0,
 			.nr_tags	= 1,
+			.from_cpu	= -1,
 			.cached_rqs	= NULL,
 			.ctx		= NULL,
 			.hctx		= NULL
@@ -678,6 +682,7 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 		.cmd_flags	= opf,
 		.rq_flags	= 0,
 		.nr_tags	= 1,
+		.from_cpu	= -1,
 		.cached_rqs	= NULL,
 		.ctx		= NULL,
 		.hctx		= NULL
@@ -3012,6 +3017,7 @@ static bool blk_mq_attempt_bio_merge(struct request_queue *q,
 }
 
 static struct request *blk_mq_get_new_requests(struct request_queue *q,
+					       int from_cpu,
 					       struct blk_plug *plug,
 					       struct bio *bio)
 {
@@ -3021,6 +3027,7 @@ static struct request *blk_mq_get_new_requests(struct request_queue *q,
 		.shallow_depth	= 0,
 		.cmd_flags	= bio->bi_opf,
 		.rq_flags	= 0,
+		.from_cpu	= from_cpu,
 		.nr_tags	= 1,
 		.cached_rqs	= NULL,
 		.ctx		= NULL,
@@ -3046,7 +3053,7 @@ static struct request *blk_mq_get_new_requests(struct request_queue *q,
  * Check if there is a suitable cached request and return it.
  */
 static struct request *blk_mq_peek_cached_request(struct blk_plug *plug,
-		struct request_queue *q, blk_opf_t opf)
+		struct request_queue *q, int from_cpu, blk_opf_t opf)
 {
 	enum hctx_type type = blk_mq_get_hctx_type(opf);
 	struct request *rq;
@@ -3056,6 +3063,8 @@ static struct request *blk_mq_peek_cached_request(struct blk_plug *plug,
 	rq = rq_list_peek(&plug->cached_rqs);
 	if (!rq || rq->q != q)
 		return NULL;
+	if (from_cpu >= 0 && rq->mq_ctx->cpu != from_cpu)
+		return NULL;
 	if (type != rq->mq_hctx->type &&
 	    (type != HCTX_TYPE_READ || rq->mq_hctx->type != HCTX_TYPE_DEFAULT))
 		return NULL;
@@ -3114,6 +3123,7 @@ void blk_mq_submit_bio(struct bio *bio)
 	struct blk_mq_hw_ctx *hctx;
 	unsigned int nr_segs;
 	struct request *rq;
+	int from_cpu = -1;
 	blk_status_t ret;
 
 	/*
@@ -3160,7 +3170,7 @@ void blk_mq_submit_bio(struct bio *bio)
 		goto queue_exit;
 
 new_request:
-	rq = blk_mq_peek_cached_request(plug, q, bio->bi_opf);
+	rq = blk_mq_peek_cached_request(plug, q, from_cpu, bio->bi_opf);
 	if (rq) {
 		blk_mq_use_cached_rq(rq, plug, bio);
 		/*
@@ -3170,7 +3180,7 @@ void blk_mq_submit_bio(struct bio *bio)
 		 */
 		blk_queue_exit(q);
 	} else {
-		rq = blk_mq_get_new_requests(q, plug, bio);
+		rq = blk_mq_get_new_requests(q, from_cpu, plug, bio);
 		if (unlikely(!rq)) {
 			if (bio->bi_opf & REQ_NOWAIT)
 				bio_wouldblock_error(bio);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index affb2e14b56e..52e907547b55 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -156,6 +156,7 @@ struct blk_mq_alloc_data {
 	struct rq_list *cached_rqs;
 
 	/* input & output parameter */
+	int from_cpu;
 	struct blk_mq_ctx *ctx;
 	struct blk_mq_hw_ctx *hctx;
 };

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v18 04/12] blk-mq: Restore the zoned write order when requeuing
  2025-06-16 22:33 [PATCH v18 00/12] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (2 preceding siblings ...)
  2025-06-16 22:33 ` [PATCH v18 03/12] block: Support allocating from a specific software queue Bart Van Assche
@ 2025-06-16 22:33 ` Bart Van Assche
  2025-06-26  0:11   ` Damien Le Moal
  2025-06-16 22:33 ` [PATCH v18 05/12] blk-zoned: Add an argument to blk_zone_plug_bio() Bart Van Assche
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 24+ messages in thread
From: Bart Van Assche @ 2025-06-16 22:33 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Bart Van Assche, Yu Kuai

Zoned writes may be requeued. This happens if a block driver returns
BLK_STS_RESOURCE, to handle SCSI unit attentions or by the SCSI error
handler after error handling has finished. Requests may be requeued in
another order than submitted. Restore the request order if requests are
requeued. Add RQF_DONTPREP to RQF_NOMERGE_FLAGS because this patch may
cause RQF_DONTPREP requests to be sent to the code that checks whether
a request can be merged and RQF_DONTPREP requests must not be merged.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/bfq-iosched.c    |  2 ++
 block/blk-mq.c         | 20 +++++++++++++++++++-
 block/blk-mq.h         |  2 ++
 block/kyber-iosched.c  |  2 ++
 block/mq-deadline.c    |  7 ++++++-
 include/linux/blk-mq.h | 13 ++++++++++++-
 6 files changed, 43 insertions(+), 3 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 0cb1e9873aab..1bd3afe5d779 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -6276,6 +6276,8 @@ static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 
 	if (flags & BLK_MQ_INSERT_AT_HEAD) {
 		list_add(&rq->queuelist, &bfqd->dispatch);
+	} else if (flags & BLK_MQ_INSERT_ORDERED) {
+		blk_mq_insert_ordered(rq, &bfqd->dispatch);
 	} else if (!bfqq) {
 		list_add_tail(&rq->queuelist, &bfqd->dispatch);
 	} else {
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9dab1caf750a..1735b1ac0574 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1562,7 +1562,9 @@ static void blk_mq_requeue_work(struct work_struct *work)
 		 * already.  Insert it into the hctx dispatch list to avoid
 		 * block layer merges for the request.
 		 */
-		if (rq->rq_flags & RQF_DONTPREP)
+		if (blk_rq_is_seq_zoned_write(rq))
+			blk_mq_insert_request(rq, BLK_MQ_INSERT_ORDERED);
+		else if (rq->rq_flags & RQF_DONTPREP)
 			blk_mq_request_bypass_insert(rq, 0);
 		else
 			blk_mq_insert_request(rq, BLK_MQ_INSERT_AT_HEAD);
@@ -2595,6 +2597,20 @@ static void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx,
 	blk_mq_run_hw_queue(hctx, run_queue_async);
 }
 
+void blk_mq_insert_ordered(struct request *rq, struct list_head *list)
+{
+	struct request_queue *q = rq->q;
+	struct request *rq2;
+
+	list_for_each_entry(rq2, list, queuelist)
+		if (rq2->q == q && blk_rq_pos(rq2) > blk_rq_pos(rq))
+			break;
+
+	/* Insert rq before rq2. If rq2 is the list head, append at the end. */
+	list_add_tail(&rq->queuelist, &rq2->queuelist);
+}
+EXPORT_SYMBOL_GPL(blk_mq_insert_ordered);
+
 static void blk_mq_insert_request(struct request *rq, blk_insert_t flags)
 {
 	struct request_queue *q = rq->q;
@@ -2649,6 +2665,8 @@ static void blk_mq_insert_request(struct request *rq, blk_insert_t flags)
 		spin_lock(&ctx->lock);
 		if (flags & BLK_MQ_INSERT_AT_HEAD)
 			list_add(&rq->queuelist, &ctx->rq_lists[hctx->type]);
+		else if (flags & BLK_MQ_INSERT_ORDERED)
+			blk_mq_insert_ordered(rq, &ctx->rq_lists[hctx->type]);
 		else
 			list_add_tail(&rq->queuelist,
 				      &ctx->rq_lists[hctx->type]);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 52e907547b55..75b657554f1f 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -40,8 +40,10 @@ enum {
 
 typedef unsigned int __bitwise blk_insert_t;
 #define BLK_MQ_INSERT_AT_HEAD		((__force blk_insert_t)0x01)
+#define BLK_MQ_INSERT_ORDERED		((__force blk_insert_t)0x02)
 
 void blk_mq_submit_bio(struct bio *bio);
+void blk_mq_insert_ordered(struct request *rq, struct list_head *list);
 int blk_mq_poll(struct request_queue *q, blk_qc_t cookie, struct io_comp_batch *iob,
 		unsigned int flags);
 void blk_mq_exit_queue(struct request_queue *q);
diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index 4dba8405bd01..051c05ceafd7 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -603,6 +603,8 @@ static void kyber_insert_requests(struct blk_mq_hw_ctx *hctx,
 		trace_block_rq_insert(rq);
 		if (flags & BLK_MQ_INSERT_AT_HEAD)
 			list_move(&rq->queuelist, head);
+		else if (flags & BLK_MQ_INSERT_ORDERED)
+			blk_mq_insert_ordered(rq, head);
 		else
 			list_move_tail(&rq->queuelist, head);
 		sbitmap_set_bit(&khd->kcq_map[sched_domain],
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 2edf1cac06d5..110fef65b829 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -710,7 +710,12 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 		 * set expire time and add to fifo list
 		 */
 		rq->fifo_time = jiffies + dd->fifo_expire[data_dir];
-		list_add_tail(&rq->queuelist, &per_prio->fifo_list[data_dir]);
+		if (flags & BLK_MQ_INSERT_ORDERED)
+			blk_mq_insert_ordered(rq,
+					      &per_prio->fifo_list[data_dir]);
+		else
+			list_add_tail(&rq->queuelist,
+				      &per_prio->fifo_list[data_dir]);
 	}
 }
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index de8c85a03bb7..6833d5a980ef 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -86,7 +86,7 @@ enum rqf_flags {
 
 /* flags that prevent us from merging requests: */
 #define RQF_NOMERGE_FLAGS \
-	(RQF_STARTED | RQF_FLUSH_SEQ | RQF_SPECIAL_PAYLOAD)
+	(RQF_STARTED | RQF_FLUSH_SEQ | RQF_DONTPREP | RQF_SPECIAL_PAYLOAD)
 
 enum mq_rq_state {
 	MQ_RQ_IDLE		= 0,
@@ -1189,4 +1189,15 @@ static inline int blk_rq_map_sg(struct request *rq, struct scatterlist *sglist)
 }
 void blk_dump_rq_flags(struct request *, char *);
 
+static inline bool blk_rq_is_seq_zoned_write(struct request *rq)
+{
+	switch (req_op(rq)) {
+	case REQ_OP_WRITE:
+	case REQ_OP_WRITE_ZEROES:
+		return bdev_zone_is_seq(rq->q->disk->part0, blk_rq_pos(rq));
+	default:
+		return false;
+	}
+}
+
 #endif /* BLK_MQ_H */

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v18 05/12] blk-zoned: Add an argument to blk_zone_plug_bio()
  2025-06-16 22:33 [PATCH v18 00/12] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (3 preceding siblings ...)
  2025-06-16 22:33 ` [PATCH v18 04/12] blk-mq: Restore the zoned write order when requeuing Bart Van Assche
@ 2025-06-16 22:33 ` Bart Van Assche
  2025-06-26  0:15   ` Damien Le Moal
  2025-06-16 22:33 ` [PATCH v18 06/12] blk-zoned: Support pipelining of zoned writes Bart Van Assche
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 24+ messages in thread
From: Bart Van Assche @ 2025-06-16 22:33 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Bart Van Assche

Prepare for preserving the order of pipelined zoned writes per zone.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-mq.c         | 2 +-
 block/blk-zoned.c      | 4 +++-
 drivers/md/dm.c        | 2 +-
 include/linux/blkdev.h | 2 +-
 4 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1735b1ac0574..ec04ff556983 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3184,7 +3184,7 @@ void blk_mq_submit_bio(struct bio *bio)
 	if (blk_mq_attempt_bio_merge(q, bio, nr_segs))
 		goto queue_exit;
 
-	if (blk_queue_is_zoned(q) && blk_zone_plug_bio(bio, nr_segs))
+	if (blk_queue_is_zoned(q) && blk_zone_plug_bio(bio, nr_segs, &from_cpu))
 		goto queue_exit;
 
 new_request:
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 351d659280e1..a79bb71a83e0 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -1104,6 +1104,8 @@ static void blk_zone_wplug_handle_native_zone_append(struct bio *bio)
  * blk_zone_plug_bio - Handle a zone write BIO with zone write plugging
  * @bio: The BIO being submitted
  * @nr_segs: The number of physical segments of @bio
+ * @from_cpu: [out] CPU of the software queue to which the bio should be
+ *	queued
  *
  * Handle write, write zeroes and zone append operations requiring emulation
  * using zone write plugging.
@@ -1112,7 +1114,7 @@ static void blk_zone_wplug_handle_native_zone_append(struct bio *bio)
  * write plug. Otherwise, return false to let the submission path process
  * @bio normally.
  */
-bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs)
+bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs, int *from_cpu)
 {
 	struct block_device *bdev = bio->bi_bdev;
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 1726f0f828cc..18084277ea27 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1788,7 +1788,7 @@ static inline bool dm_zone_bio_needs_split(struct mapped_device *md,
 }
 static inline bool dm_zone_plug_bio(struct mapped_device *md, struct bio *bio)
 {
-	return dm_emulate_zone_append(md) && blk_zone_plug_bio(bio, 0);
+	return dm_emulate_zone_append(md) && blk_zone_plug_bio(bio, 0, NULL);
 }
 
 static blk_status_t __send_zone_reset_all_emulated(struct clone_info *ci,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 6d01269f9fcf..f3471c98c22f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -842,7 +842,7 @@ static inline unsigned int disk_nr_zones(struct gendisk *disk)
 {
 	return disk->nr_zones;
 }
-bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs);
+bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs, int *from_cpu);
 
 /**
  * disk_zone_capacity - returns the zone capacity of zone containing @sector

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v18 06/12] blk-zoned: Support pipelining of zoned writes
  2025-06-16 22:33 [PATCH v18 00/12] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (4 preceding siblings ...)
  2025-06-16 22:33 ` [PATCH v18 05/12] blk-zoned: Add an argument to blk_zone_plug_bio() Bart Van Assche
@ 2025-06-16 22:33 ` Bart Van Assche
  2025-06-16 22:33 ` [PATCH v18 07/12] null_blk: Add the preserves_write_order attribute Bart Van Assche
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 24+ messages in thread
From: Bart Van Assche @ 2025-06-16 22:33 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Bart Van Assche

Support pipelining of zoned writes if the block driver preserves the write
order per hardware queue. Track per zone to which software queue writes
have been queued. If zoned writes are pipelined, submit new writes to the
same software queue as the writes that are already in progress. This
prevents reordering by submitting requests for the same zone to different
software or hardware queues.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-mq.c    |  4 ++--
 block/blk-zoned.c | 30 +++++++++++++++++++++++-------
 2 files changed, 25 insertions(+), 9 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index ec04ff556983..6954de9f9fe9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3147,8 +3147,8 @@ void blk_mq_submit_bio(struct bio *bio)
 	/*
 	 * A BIO that was released from a zone write plug has already been
 	 * through the preparation in this function, already holds a reference
-	 * on the queue usage counter, and is the only write BIO in-flight for
-	 * the target zone. Go straight to preparing a request for it.
+	 * on the queue usage counter. Go straight to preparing a request for
+	 * it.
 	 */
 	if (bio_zone_write_plugging(bio)) {
 		nr_segs = bio->__bi_nr_segments;
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index a79bb71a83e0..76f76fd83f73 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -51,6 +51,8 @@ static const char *const zone_cond_name[] = {
  * @zone_no: The number of the zone the plug is managing.
  * @wp_offset: The zone write pointer location relative to the start of the zone
  *             as a number of 512B sectors.
+ * @from_cpu: Software queue to submit writes from for drivers that preserve
+ *	the write order.
  * @bio_list: The list of BIOs that are currently plugged.
  * @bio_work: Work struct to handle issuing of plugged BIOs
  * @rcu_head: RCU head to free zone write plugs with an RCU grace period.
@@ -63,6 +65,7 @@ struct blk_zone_wplug {
 	unsigned int		flags;
 	unsigned int		zone_no;
 	unsigned int		wp_offset;
+	int			from_cpu;
 	struct bio_list		bio_list;
 	struct work_struct	bio_work;
 	struct rcu_head		rcu_head;
@@ -72,8 +75,7 @@ struct blk_zone_wplug {
 /*
  * Zone write plug flags bits:
  *  - BLK_ZONE_WPLUG_PLUGGED: Indicates that the zone write plug is plugged,
- *    that is, that write BIOs are being throttled due to a write BIO already
- *    being executed or the zone write plug bio list is not empty.
+ *    that is, that write BIOs are being throttled.
  *  - BLK_ZONE_WPLUG_NEED_WP_UPDATE: Indicates that we lost track of a zone
  *    write pointer offset and need to update it.
  *  - BLK_ZONE_WPLUG_UNHASHED: Indicates that the zone write plug was removed
@@ -568,6 +570,7 @@ static struct blk_zone_wplug *disk_get_and_lock_zone_wplug(struct gendisk *disk,
 	zwplug->flags = 0;
 	zwplug->zone_no = zno;
 	zwplug->wp_offset = bdev_offset_from_zone_start(disk->part0, sector);
+	zwplug->from_cpu = -1;
 	bio_list_init(&zwplug->bio_list);
 	INIT_WORK(&zwplug->bio_work, blk_zone_wplug_bio_work);
 	zwplug->disk = disk;
@@ -983,7 +986,8 @@ static bool blk_zone_wplug_prepare_bio(struct blk_zone_wplug *zwplug,
 	return true;
 }
 
-static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
+static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs,
+					int *from_cpu)
 {
 	struct gendisk *disk = bio->bi_bdev->bd_disk;
 	sector_t sector = bio->bi_iter.bi_sector;
@@ -1045,7 +1049,13 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
 		return true;
 	}
 
-	zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
+	if (disk->queue->limits.driver_preserves_write_order) {
+		if (zwplug->from_cpu < 0)
+			zwplug->from_cpu = raw_smp_processor_id();
+		*from_cpu = zwplug->from_cpu;
+	} else {
+		zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
+	}
 
 	spin_unlock_irqrestore(&zwplug->lock, flags);
 
@@ -1165,7 +1175,7 @@ bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs, int *from_cpu)
 		fallthrough;
 	case REQ_OP_WRITE:
 	case REQ_OP_WRITE_ZEROES:
-		return blk_zone_wplug_handle_write(bio, nr_segs);
+		return blk_zone_wplug_handle_write(bio, nr_segs, from_cpu);
 	case REQ_OP_ZONE_RESET:
 		return blk_zone_wplug_handle_reset_or_finish(bio, 0);
 	case REQ_OP_ZONE_FINISH:
@@ -1197,6 +1207,9 @@ static void disk_zone_wplug_unplug_bio(struct gendisk *disk,
 
 	zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED;
 
+	if (refcount_read(&zwplug->ref) == 2)
+		zwplug->from_cpu = -1;
+
 	/*
 	 * If the zone is full (it was fully written or finished, or empty
 	 * (it was reset), remove its zone write plug from the hash table.
@@ -1848,6 +1861,7 @@ static void queue_zone_wplug_show(struct blk_zone_wplug *zwplug,
 	unsigned int zwp_zone_no, zwp_ref;
 	unsigned int zwp_bio_list_size;
 	unsigned long flags;
+	int from_cpu;
 
 	spin_lock_irqsave(&zwplug->lock, flags);
 	zwp_zone_no = zwplug->zone_no;
@@ -1855,10 +1869,12 @@ static void queue_zone_wplug_show(struct blk_zone_wplug *zwplug,
 	zwp_ref = refcount_read(&zwplug->ref);
 	zwp_wp_offset = zwplug->wp_offset;
 	zwp_bio_list_size = bio_list_size(&zwplug->bio_list);
+	from_cpu = zwplug->from_cpu;
 	spin_unlock_irqrestore(&zwplug->lock, flags);
 
-	seq_printf(m, "%u 0x%x %u %u %u\n", zwp_zone_no, zwp_flags, zwp_ref,
-		   zwp_wp_offset, zwp_bio_list_size);
+	seq_printf(m, "zone_no %u flags 0x%x ref %u wp_offset %u bio_list_size %u from_cpu %d\n",
+		   zwp_zone_no, zwp_flags, zwp_ref, zwp_wp_offset,
+		   zwp_bio_list_size, from_cpu);
 }
 
 int queue_zone_wplugs_show(void *data, struct seq_file *m)

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v18 07/12] null_blk: Add the preserves_write_order attribute
  2025-06-16 22:33 [PATCH v18 00/12] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (5 preceding siblings ...)
  2025-06-16 22:33 ` [PATCH v18 06/12] blk-zoned: Support pipelining of zoned writes Bart Van Assche
@ 2025-06-16 22:33 ` Bart Van Assche
  2025-06-16 22:33 ` [PATCH v18 08/12] scsi: core: Retry unaligned zoned writes Bart Van Assche
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 24+ messages in thread
From: Bart Van Assche @ 2025-06-16 22:33 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Bart Van Assche

Support configuring the preserves_write_order in the null_blk driver to
make it easier to test support for this attribute.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/block/null_blk/main.c     | 3 +++
 drivers/block/null_blk/null_blk.h | 1 +
 2 files changed, 4 insertions(+)

diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
index aa163ae9b2aa..45eaa6f0e983 100644
--- a/drivers/block/null_blk/main.c
+++ b/drivers/block/null_blk/main.c
@@ -475,6 +475,7 @@ NULLB_DEVICE_ATTR(fua, bool, NULL);
 NULLB_DEVICE_ATTR(rotational, bool, NULL);
 NULLB_DEVICE_ATTR(badblocks_once, bool, NULL);
 NULLB_DEVICE_ATTR(badblocks_partial_io, bool, NULL);
+NULLB_DEVICE_ATTR(preserves_write_order, bool, NULL);
 
 static ssize_t nullb_device_power_show(struct config_item *item, char *page)
 {
@@ -613,6 +614,7 @@ static struct configfs_attribute *nullb_device_attrs[] = {
 	&nullb_device_attr_no_sched,
 	&nullb_device_attr_poll_queues,
 	&nullb_device_attr_power,
+	&nullb_device_attr_preserves_write_order,
 	&nullb_device_attr_queue_mode,
 	&nullb_device_attr_rotational,
 	&nullb_device_attr_shared_tag_bitmap,
@@ -1979,6 +1981,7 @@ static int null_add_dev(struct nullb_device *dev)
 	if (dev->virt_boundary)
 		lim.virt_boundary_mask = PAGE_SIZE - 1;
 	null_config_discard(nullb, &lim);
+	lim.driver_preserves_write_order = dev->preserves_write_order;
 	if (dev->zoned) {
 		rv = null_init_zoned_dev(dev, &lim);
 		if (rv)
diff --git a/drivers/block/null_blk/null_blk.h b/drivers/block/null_blk/null_blk.h
index 7bb6128dbaaf..08b732cf853f 100644
--- a/drivers/block/null_blk/null_blk.h
+++ b/drivers/block/null_blk/null_blk.h
@@ -110,6 +110,7 @@ struct nullb_device {
 	bool shared_tag_bitmap; /* use hostwide shared tags */
 	bool fua; /* Support FUA */
 	bool rotational; /* Fake rotational device */
+	bool preserves_write_order;
 };
 
 struct nullb {

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v18 08/12] scsi: core: Retry unaligned zoned writes
  2025-06-16 22:33 [PATCH v18 00/12] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (6 preceding siblings ...)
  2025-06-16 22:33 ` [PATCH v18 07/12] null_blk: Add the preserves_write_order attribute Bart Van Assche
@ 2025-06-16 22:33 ` Bart Van Assche
  2025-06-16 22:33 ` [PATCH v18 09/12] scsi: sd: Increase retry count for " Bart Van Assche
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 24+ messages in thread
From: Bart Van Assche @ 2025-06-16 22:33 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Bart Van Assche, Martin K. Petersen, Ming Lei

If zoned writes (REQ_OP_WRITE) for a sequential write required zone have
a starting LBA that differs from the write pointer, e.g. because a prior
write triggered a unit attention condition, then the storage device will
respond with an UNALIGNED WRITE COMMAND error. Retry commands that failed
with an unaligned write error.

Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/scsi/scsi_error.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 376b8897ab90..32f0f7d520df 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -712,6 +712,22 @@ enum scsi_disposition scsi_check_sense(struct scsi_cmnd *scmd)
 		fallthrough;
 
 	case ILLEGAL_REQUEST:
+		/*
+		 * Unaligned write command. This may indicate that zoned writes
+		 * have been received by the device in the wrong order. If write
+		 * pipelining is enabled, retry.
+		 */
+		if (sshdr.asc == 0x21 && sshdr.ascq == 0x04 &&
+		    req->q->limits.driver_preserves_write_order &&
+		    blk_rq_is_seq_zoned_write(req) &&
+		    scsi_cmd_retry_allowed(scmd)) {
+			SCSI_LOG_ERROR_RECOVERY(1,
+				sdev_printk(KERN_WARNING, scmd->device,
+				"Retrying unaligned write at LBA %#llx.\n",
+				scsi_get_lba(scmd)));
+			return NEEDS_RETRY;
+		}
+
 		if (sshdr.asc == 0x20 || /* Invalid command operation code */
 		    sshdr.asc == 0x21 || /* Logical block address out of range */
 		    sshdr.asc == 0x22 || /* Invalid function */

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v18 09/12] scsi: sd: Increase retry count for zoned writes
  2025-06-16 22:33 [PATCH v18 00/12] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (7 preceding siblings ...)
  2025-06-16 22:33 ` [PATCH v18 08/12] scsi: core: Retry unaligned zoned writes Bart Van Assche
@ 2025-06-16 22:33 ` Bart Van Assche
  2025-06-16 22:33 ` [PATCH v18 10/12] scsi: scsi_debug: Add the preserves_write_order module parameter Bart Van Assche
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 24+ messages in thread
From: Bart Van Assche @ 2025-06-16 22:33 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Bart Van Assche, Martin K. Petersen, Ming Lei

If the write order is preserved, increase the number of retries for
write commands sent to a sequential zone to the maximum number of
outstanding commands because in the worst case the number of times
reordered zoned writes have to be retried is (number of outstanding
writes per sequential zone) - 1.

Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/scsi/sd.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 3f6e87705b62..75ff6af37ecc 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1404,6 +1404,13 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
 	cmd->transfersize = sdp->sector_size;
 	cmd->underflow = nr_blocks << 9;
 	cmd->allowed = sdkp->max_retries;
+	/*
+	 * Increase the number of allowed retries for zoned writes if the driver
+	 * preserves the command order.
+	 */
+	if (rq->q->limits.driver_preserves_write_order &&
+	    blk_rq_is_seq_zoned_write(rq))
+		cmd->allowed += rq->q->nr_requests;
 	cmd->sdb.length = nr_blocks * sdp->sector_size;
 
 	SCSI_LOG_HLQUEUE(1,

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v18 10/12] scsi: scsi_debug: Add the preserves_write_order module parameter
  2025-06-16 22:33 [PATCH v18 00/12] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (8 preceding siblings ...)
  2025-06-16 22:33 ` [PATCH v18 09/12] scsi: sd: Increase retry count for " Bart Van Assche
@ 2025-06-16 22:33 ` Bart Van Assche
  2025-06-16 22:33 ` [PATCH v18 11/12] scsi: scsi_debug: Support injecting unaligned write errors Bart Van Assche
  2025-06-16 22:33 ` [PATCH v18 12/12] scsi: ufs: Inform the block layer about write ordering Bart Van Assche
  11 siblings, 0 replies; 24+ messages in thread
From: Bart Van Assche @ 2025-06-16 22:33 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Bart Van Assche, Douglas Gilbert, Martin K. Petersen, Ming Lei

Zone write locking is not used for zoned devices if the block driver
reports that it preserves the order of write commands. Make it easier to
test not using zone write locking by adding support for setting the
driver_preserves_write_order flag.

Acked-by: Douglas Gilbert <dgilbert@interlog.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/scsi/scsi_debug.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index aef33d1e346a..5575bc8a833d 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -1004,6 +1004,7 @@ static int dix_reads;
 static int dif_errors;
 
 /* ZBC global data */
+static bool sdeb_preserves_write_order;
 static bool sdeb_zbc_in_use;	/* true for host-aware and host-managed disks */
 static int sdeb_zbc_zone_cap_mb;
 static int sdeb_zbc_zone_size_mb;
@@ -6607,10 +6608,14 @@ static struct sdebug_dev_info *find_build_dev_info(struct scsi_device *sdev)
 
 static int scsi_debug_sdev_init(struct scsi_device *sdp)
 {
+	struct request_queue *q = sdp->request_queue;
+
 	if (sdebug_verbose)
 		pr_info("sdev_init <%u %u %u %llu>\n",
 		       sdp->host->host_no, sdp->channel, sdp->id, sdp->lun);
 
+	q->limits.driver_preserves_write_order = sdeb_preserves_write_order;
+
 	return 0;
 }
 
@@ -7339,6 +7344,8 @@ module_param_named(statistics, sdebug_statistics, bool, S_IRUGO | S_IWUSR);
 module_param_named(strict, sdebug_strict, bool, S_IRUGO | S_IWUSR);
 module_param_named(submit_queues, submit_queues, int, S_IRUGO);
 module_param_named(poll_queues, poll_queues, int, S_IRUGO);
+module_param_named(preserves_write_order, sdeb_preserves_write_order, bool,
+		   S_IRUGO);
 module_param_named(tur_ms_to_ready, sdeb_tur_ms_to_ready, int, S_IRUGO);
 module_param_named(unmap_alignment, sdebug_unmap_alignment, int, S_IRUGO);
 module_param_named(unmap_granularity, sdebug_unmap_granularity, int, S_IRUGO);
@@ -7411,6 +7418,8 @@ MODULE_PARM_DESC(opts, "1->noise, 2->medium_err, 4->timeout, 8->recovered_err...
 MODULE_PARM_DESC(per_host_store, "If set, next positive add_host will get new store (def=0)");
 MODULE_PARM_DESC(physblk_exp, "physical block exponent (def=0)");
 MODULE_PARM_DESC(poll_queues, "support for iouring iopoll queues (1 to max(submit_queues - 1))");
+MODULE_PARM_DESC(preserves_write_order,
+		 "Whether or not to inform the block layer that this driver preserves the order of WRITE commands (def=0)");
 MODULE_PARM_DESC(ptype, "SCSI peripheral type(def=0[disk])");
 MODULE_PARM_DESC(random, "If set, uniformly randomize command duration between 0 and delay_in_ns");
 MODULE_PARM_DESC(removable, "claim to have removable media (def=0)");

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v18 11/12] scsi: scsi_debug: Support injecting unaligned write errors
  2025-06-16 22:33 [PATCH v18 00/12] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (9 preceding siblings ...)
  2025-06-16 22:33 ` [PATCH v18 10/12] scsi: scsi_debug: Add the preserves_write_order module parameter Bart Van Assche
@ 2025-06-16 22:33 ` Bart Van Assche
  2025-06-16 22:33 ` [PATCH v18 12/12] scsi: ufs: Inform the block layer about write ordering Bart Van Assche
  11 siblings, 0 replies; 24+ messages in thread
From: Bart Van Assche @ 2025-06-16 22:33 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Bart Van Assche, Douglas Gilbert, Martin K. Petersen, Ming Lei

Allow user space software, e.g. a blktests test, to inject unaligned
write errors.

Acked-by: Douglas Gilbert <dgilbert@interlog.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/scsi/scsi_debug.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index 5575bc8a833d..963246f5894a 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -230,6 +230,7 @@ struct tape_block {
 #define SDEBUG_OPT_NO_CDB_NOISE		0x4000
 #define SDEBUG_OPT_HOST_BUSY		0x8000
 #define SDEBUG_OPT_CMD_ABORT		0x10000
+#define SDEBUG_OPT_UNALIGNED_WRITE	0x20000
 #define SDEBUG_OPT_ALL_NOISE (SDEBUG_OPT_NOISE | SDEBUG_OPT_Q_NOISE | \
 			      SDEBUG_OPT_RESET_NOISE)
 #define SDEBUG_OPT_ALL_INJECTING (SDEBUG_OPT_RECOVERED_ERR | \
@@ -237,7 +238,8 @@ struct tape_block {
 				  SDEBUG_OPT_DIF_ERR | SDEBUG_OPT_DIX_ERR | \
 				  SDEBUG_OPT_SHORT_TRANSFER | \
 				  SDEBUG_OPT_HOST_BUSY | \
-				  SDEBUG_OPT_CMD_ABORT)
+				  SDEBUG_OPT_CMD_ABORT | \
+				  SDEBUG_OPT_UNALIGNED_WRITE)
 #define SDEBUG_OPT_RECOV_DIF_DIX (SDEBUG_OPT_RECOVERED_ERR | \
 				  SDEBUG_OPT_DIF_ERR | SDEBUG_OPT_DIX_ERR)
 
@@ -4915,6 +4917,14 @@ static int resp_write_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 	u8 *cmd = scp->cmnd;
 	bool meta_data_locked = false;
 
+	if (unlikely(sdebug_opts & SDEBUG_OPT_UNALIGNED_WRITE &&
+		     atomic_read(&sdeb_inject_pending))) {
+		atomic_set(&sdeb_inject_pending, 0);
+		mk_sense_buffer(scp, ILLEGAL_REQUEST, LBA_OUT_OF_RANGE,
+				UNALIGNED_WRITE_ASCQ);
+		return check_condition_result;
+	}
+
 	switch (cmd[0]) {
 	case WRITE_16:
 		ei_lba = 0;

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v18 12/12] scsi: ufs: Inform the block layer about write ordering
  2025-06-16 22:33 [PATCH v18 00/12] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (10 preceding siblings ...)
  2025-06-16 22:33 ` [PATCH v18 11/12] scsi: scsi_debug: Support injecting unaligned write errors Bart Van Assche
@ 2025-06-16 22:33 ` Bart Van Assche
  2025-06-17  6:58   ` Avri Altman
  2025-06-19 12:49   ` Peter Wang (王信友)
  11 siblings, 2 replies; 24+ messages in thread
From: Bart Van Assche @ 2025-06-16 22:33 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Bart Van Assche, Bao D. Nguyen, Can Guo, Martin K. Petersen,
	Avri Altman

From the UFSHCI 4.0 specification, about the MCQ mode:
"Command Submission
1. Host SW writes an Entry to SQ
2. Host SW updates SQ doorbell tail pointer

Command Processing
3. After fetching the Entry, Host Controller updates SQ doorbell head
   pointer
4. Host controller sends COMMAND UPIU to UFS device"

In other words, in MCQ mode, UFS controllers are required to forward
commands to the UFS device in the order these commands have been
received from the host.

This patch improves performance as follows on a test setup with UFSHCI
4.0 controller:
- With the mq-deadline scheduler: 2.0x more IOPS for small writes.
- When not using an I/O scheduler compared to using mq-deadline with
  zone locking: 2.3x more IOPS for small writes.

Cc: Bao D. Nguyen <quic_nguyenb@quicinc.com>
Cc: Can Guo <quic_cang@quicinc.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Avri Altman <avri.altman@wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/ufs/core/ufshcd.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
index 4410e7d93b7d..340db59b7675 100644
--- a/drivers/ufs/core/ufshcd.c
+++ b/drivers/ufs/core/ufshcd.c
@@ -5281,6 +5281,12 @@ static int ufshcd_sdev_configure(struct scsi_device *sdev,
 	struct ufs_hba *hba = shost_priv(sdev->host);
 	struct request_queue *q = sdev->request_queue;
 
+	/*
+	 * The write order is preserved per MCQ. Without MCQ, auto-hibernation
+	 * may cause write reordering that results in unaligned write errors.
+	 */
+	lim->driver_preserves_write_order = hba->mcq_enabled;
+
 	lim->dma_pad_mask = PRDT_DATA_BYTE_COUNT_PAD - 1;
 
 	/*

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* RE: [PATCH v18 12/12] scsi: ufs: Inform the block layer about write ordering
  2025-06-16 22:33 ` [PATCH v18 12/12] scsi: ufs: Inform the block layer about write ordering Bart Van Assche
@ 2025-06-17  6:58   ` Avri Altman
  2025-06-19 12:49   ` Peter Wang (王信友)
  1 sibling, 0 replies; 24+ messages in thread
From: Avri Altman @ 2025-06-17  6:58 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block@vger.kernel.org, linux-scsi@vger.kernel.org,
	Christoph Hellwig, Damien Le Moal, Bao D. Nguyen, Can Guo,
	Martin K. Petersen, Avri Altman

> From the UFSHCI 4.0 specification, about the MCQ mode:
> "Command Submission
> 1. Host SW writes an Entry to SQ
> 2. Host SW updates SQ doorbell tail pointer
> 
> Command Processing
> 3. After fetching the Entry, Host Controller updates SQ doorbell head
>    pointer
> 4. Host controller sends COMMAND UPIU to UFS device"
> 
> In other words, in MCQ mode, UFS controllers are required to forward
> commands to the UFS device in the order these commands have been
> received from the host.
> 
> This patch improves performance as follows on a test setup with UFSHCI
> 4.0 controller:
> - With the mq-deadline scheduler: 2.0x more IOPS for small writes.
> - When not using an I/O scheduler compared to using mq-deadline with
>   zone locking: 2.3x more IOPS for small writes.
> 
> Cc: Bao D. Nguyen <quic_nguyenb@quicinc.com>
> Cc: Can Guo <quic_cang@quicinc.com>
> Cc: Martin K. Petersen <martin.petersen@oracle.com>
> Cc: Avri Altman <avri.altman@wdc.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Avri Altman <avri.altman@sandisk.com>

> ---
>  drivers/ufs/core/ufshcd.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
> index 4410e7d93b7d..340db59b7675 100644
> --- a/drivers/ufs/core/ufshcd.c
> +++ b/drivers/ufs/core/ufshcd.c
> @@ -5281,6 +5281,12 @@ static int ufshcd_sdev_configure(struct
> scsi_device *sdev,
>         struct ufs_hba *hba = shost_priv(sdev->host);
>         struct request_queue *q = sdev->request_queue;
> 
> +       /*
> +        * The write order is preserved per MCQ. Without MCQ, auto-hibernation
> +        * may cause write reordering that results in unaligned write errors.
> +        */
> +       lim->driver_preserves_write_order = hba->mcq_enabled;
> +
>         lim->dma_pad_mask = PRDT_DATA_BYTE_COUNT_PAD - 1;
> 
>         /*


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v18 12/12] scsi: ufs: Inform the block layer about write ordering
  2025-06-16 22:33 ` [PATCH v18 12/12] scsi: ufs: Inform the block layer about write ordering Bart Van Assche
  2025-06-17  6:58   ` Avri Altman
@ 2025-06-19 12:49   ` Peter Wang (王信友)
  2025-06-19 16:38     ` Bart Van Assche
  1 sibling, 1 reply; 24+ messages in thread
From: Peter Wang (王信友) @ 2025-06-19 12:49 UTC (permalink / raw)
  To: axboe@kernel.dk, bvanassche@acm.org
  Cc: avri.altman@wdc.com, hch@lst.de, quic_cang@quicinc.com,
	quic_nguyenb@quicinc.com, linux-scsi@vger.kernel.org,
	dlemoal@kernel.org, linux-block@vger.kernel.org,
	martin.petersen@oracle.com

Hi Bart,

How can we ensure that there is ordering across different HW queues?
For example, req0 is sent to hwq0 and req1 is sent to hwq1,
but req1 might be dispatched first?

Thanks.
Peter

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v18 12/12] scsi: ufs: Inform the block layer about write ordering
  2025-06-19 12:49   ` Peter Wang (王信友)
@ 2025-06-19 16:38     ` Bart Van Assche
  0 siblings, 0 replies; 24+ messages in thread
From: Bart Van Assche @ 2025-06-19 16:38 UTC (permalink / raw)
  To: Peter Wang (王信友), axboe@kernel.dk
  Cc: avri.altman@wdc.com, hch@lst.de, quic_cang@quicinc.com,
	quic_nguyenb@quicinc.com, linux-scsi@vger.kernel.org,
	dlemoal@kernel.org, linux-block@vger.kernel.org,
	martin.petersen@oracle.com

On 6/19/25 5:49 AM, Peter Wang (王信友) wrote:
> How can we ensure that there is ordering across different HW queues?
> For example, req0 is sent to hwq0 and req1 is sent to hwq1,
> but req1 might be dispatched first?

Hi Peter,

Please take a look at patch 06/12 from this series. That patch prevents
that the reordering mentioned above can happen. It prevents this by
queuing all zoned writes to the same hwq as long as any zoned writes are
in progress. See also
https://lore.kernel.org/linux-scsi/20250616223312.1607638-7-bvanassche@acm.org/

Bart.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v18 02/12] block: Rework request allocation in blk_mq_submit_bio()
  2025-06-16 22:33 ` [PATCH v18 02/12] block: Rework request allocation in blk_mq_submit_bio() Bart Van Assche
@ 2025-06-26  0:00   ` Damien Le Moal
  2025-06-26 17:25     ` Bart Van Assche
  0 siblings, 1 reply; 24+ messages in thread
From: Damien Le Moal @ 2025-06-26  0:00 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe; +Cc: linux-block, linux-scsi, Christoph Hellwig

On 6/17/25 07:33, Bart Van Assche wrote:
> Prepare for allocating a request from a specific hctx by making
> blk_mq_submit_bio() allocate a request later.
> 
> The performance impact of this patch on the hot path is small: if a
> request is cached, one percpu_ref_get(&q->q_usage_counter) call and one
> percpu_ref_put(&q->q_usage_counter) call are added to the hot path.

Numbers ?

The change is forcing a queue enter for all BIOs because you remove the cached
request optimization. So I am not sure it is the impact is that small if you
have a very fast storage device.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v18 03/12] block: Support allocating from a specific software queue
  2025-06-16 22:33 ` [PATCH v18 03/12] block: Support allocating from a specific software queue Bart Van Assche
@ 2025-06-26  0:04   ` Damien Le Moal
  2025-06-26 17:17     ` Bart Van Assche
  0 siblings, 1 reply; 24+ messages in thread
From: Damien Le Moal @ 2025-06-26  0:04 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe; +Cc: linux-block, linux-scsi, Christoph Hellwig

On 6/17/25 07:33, Bart Van Assche wrote:
> A later patch will preserve the order of pipelined zoned writes by
> submitting all zoned writes per zone to the same software queue as
> previously submitted zoned writes. Hence support allocating a request
> from a specific software queue.

Why is this needed ? All you need to do is schedule the zone write plug BIO work
on a specific CPU. Then the request used will naturally come from the software
queue of that CPU, no ? Am I missing something ?


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v18 04/12] blk-mq: Restore the zoned write order when requeuing
  2025-06-16 22:33 ` [PATCH v18 04/12] blk-mq: Restore the zoned write order when requeuing Bart Van Assche
@ 2025-06-26  0:11   ` Damien Le Moal
  2025-06-26 19:42     ` Bart Van Assche
  0 siblings, 1 reply; 24+ messages in thread
From: Damien Le Moal @ 2025-06-26  0:11 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Yu Kuai

On 6/17/25 07:33, Bart Van Assche wrote:
> Zoned writes may be requeued. This happens if a block driver returns
> BLK_STS_RESOURCE, to handle SCSI unit attentions or by the SCSI error
> handler after error handling has finished. Requests may be requeued in
> another order than submitted. Restore the request order if requests are
> requeued. Add RQF_DONTPREP to RQF_NOMERGE_FLAGS because this patch may
> cause RQF_DONTPREP requests to be sent to the code that checks whether
> a request can be merged and RQF_DONTPREP requests must not be merged.

Shouldn't this last part be a different prep patch ?

But overall, the commit message is inadequate because you have not yet enabled
pipelining, so this patch will only see one write request per zone at most. And
in that case, we do not care about ordering. We only care about the order of
requests per zone.

>  static void blk_mq_insert_request(struct request *rq, blk_insert_t flags)
>  {
>  	struct request_queue *q = rq->q;
> @@ -2649,6 +2665,8 @@ static void blk_mq_insert_request(struct request *rq, blk_insert_t flags)
>  		spin_lock(&ctx->lock);
>  		if (flags & BLK_MQ_INSERT_AT_HEAD)
>  			list_add(&rq->queuelist, &ctx->rq_lists[hctx->type]);
> +		else if (flags & BLK_MQ_INSERT_ORDERED)
> +			blk_mq_insert_ordered(rq, &ctx->rq_lists[hctx->type]);
>  		else
>  			list_add_tail(&rq->queuelist,
>  				      &ctx->rq_lists[hctx->type]);

This pattern is repeated multiple times. Make it a helper ?



-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v18 05/12] blk-zoned: Add an argument to blk_zone_plug_bio()
  2025-06-16 22:33 ` [PATCH v18 05/12] blk-zoned: Add an argument to blk_zone_plug_bio() Bart Van Assche
@ 2025-06-26  0:15   ` Damien Le Moal
  2025-06-26 17:16     ` Bart Van Assche
  0 siblings, 1 reply; 24+ messages in thread
From: Damien Le Moal @ 2025-06-26  0:15 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe; +Cc: linux-block, linux-scsi, Christoph Hellwig

On 6/17/25 07:33, Bart Van Assche wrote:
> Prepare for preserving the order of pipelined zoned writes per zone.

I do not understand why that is needed. And the commit message does not explain
anything.

I suspect it is to handle the case where a submitter changes CPU when issuing
writes to a zone. But if that is the case, you should be able to hide that
explicit CPU choice within the blk-zoned code instead of exposing it through the
API. E.g.: plug the bio and schedule the BIO work to run on the CPU you want.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v18 05/12] blk-zoned: Add an argument to blk_zone_plug_bio()
  2025-06-26  0:15   ` Damien Le Moal
@ 2025-06-26 17:16     ` Bart Van Assche
  0 siblings, 0 replies; 24+ messages in thread
From: Bart Van Assche @ 2025-06-26 17:16 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe; +Cc: linux-block, linux-scsi, Christoph Hellwig

On 6/25/25 5:15 PM, Damien Le Moal wrote:
> I suspect it is to handle the case where a submitter changes CPU when issuing
> writes to a zone.

Yes, that's correct.

> But if that is the case, you should be able to hide that
> explicit CPU choice within the blk-zoned code instead of exposing it through the
> API. E.g.: plug the bio and schedule the BIO work to run on the CPU you want.

The purpose of this patch series is to process zoned writes faster for
storage controllers and devices that preserve the write order per queue.
Adding this kind of overhead in the hot path would make write processing
slower for UFS devices instead of faster and hence is unacceptable.

Bart.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v18 03/12] block: Support allocating from a specific software queue
  2025-06-26  0:04   ` Damien Le Moal
@ 2025-06-26 17:17     ` Bart Van Assche
  0 siblings, 0 replies; 24+ messages in thread
From: Bart Van Assche @ 2025-06-26 17:17 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe; +Cc: linux-block, linux-scsi, Christoph Hellwig

On 6/25/25 5:04 PM, Damien Le Moal wrote:
> On 6/17/25 07:33, Bart Van Assche wrote:
>> A later patch will preserve the order of pipelined zoned writes by
>> submitting all zoned writes per zone to the same software queue as
>> previously submitted zoned writes. Hence support allocating a request
>> from a specific software queue.
> 
> Why is this needed ? All you need to do is schedule the zone write plug BIO work
> on a specific CPU. Then the request used will naturally come from the software
> queue of that CPU, no ? Am I missing something ?

Yes. That approach may be acceptable for rotating magnetic storage but
is not acceptable for UFS devices. Inserting a context switch in the hot
path would slow down UFS writes significantly.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v18 02/12] block: Rework request allocation in blk_mq_submit_bio()
  2025-06-26  0:00   ` Damien Le Moal
@ 2025-06-26 17:25     ` Bart Van Assche
  0 siblings, 0 replies; 24+ messages in thread
From: Bart Van Assche @ 2025-06-26 17:25 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe; +Cc: linux-block, linux-scsi, Christoph Hellwig

On 6/25/25 5:00 PM, Damien Le Moal wrote:
> On 6/17/25 07:33, Bart Van Assche wrote:
>> Prepare for allocating a request from a specific hctx by making
>> blk_mq_submit_bio() allocate a request later.
>>
>> The performance impact of this patch on the hot path is small: if a
>> request is cached, one percpu_ref_get(&q->q_usage_counter) call and one
>> percpu_ref_put(&q->q_usage_counter) call are added to the hot path.
> 
> Numbers ?
> 
> The change is forcing a queue enter for all BIOs because you remove the cached
> request optimization. So I am not sure it is the impact is that small if you
> have a very fast storage device.

Hi Damien,

I will check whether I can drop this patch by reworking the later
patches from this series.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v18 04/12] blk-mq: Restore the zoned write order when requeuing
  2025-06-26  0:11   ` Damien Le Moal
@ 2025-06-26 19:42     ` Bart Van Assche
  0 siblings, 0 replies; 24+ messages in thread
From: Bart Van Assche @ 2025-06-26 19:42 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Yu Kuai

On 6/25/25 5:11 PM, Damien Le Moal wrote:
> On 6/17/25 07:33, Bart Van Assche wrote:
>> Zoned writes may be requeued. This happens if a block driver returns
>> BLK_STS_RESOURCE, to handle SCSI unit attentions or by the SCSI error
>> handler after error handling has finished. Requests may be requeued in
>> another order than submitted. Restore the request order if requests are
>> requeued. Add RQF_DONTPREP to RQF_NOMERGE_FLAGS because this patch may
>> cause RQF_DONTPREP requests to be sent to the code that checks whether
>> a request can be merged and RQF_DONTPREP requests must not be merged.
> 
> Shouldn't this last part be a different prep patch ?

No. Without the RQF_NOMERGE_FLAGS change, this patch would introduce a
bug. The bug that would be introduced without the RQF_NOMERGE_FLAGS
change is that merging might happen of RQF_DONTPREP patches. As you know
patch series must be bisectable.

> But overall, the commit message is inadequate because you have not yet enabled
> pipelining, so this patch will only see one write request per zone at most. And
> in that case, we do not care about ordering. We only care about the order of
> requests per zone.

I will add in the patch description that this patch prepares for
enabling write pipelining.

>>   static void blk_mq_insert_request(struct request *rq, blk_insert_t flags)
>>   {
>>   	struct request_queue *q = rq->q;
>> @@ -2649,6 +2665,8 @@ static void blk_mq_insert_request(struct request *rq, blk_insert_t flags)
>>   		spin_lock(&ctx->lock);
>>   		if (flags & BLK_MQ_INSERT_AT_HEAD)
>>   			list_add(&rq->queuelist, &ctx->rq_lists[hctx->type]);
>> +		else if (flags & BLK_MQ_INSERT_ORDERED)
>> +			blk_mq_insert_ordered(rq, &ctx->rq_lists[hctx->type]);
>>   		else
>>   			list_add_tail(&rq->queuelist,
>>   				      &ctx->rq_lists[hctx->type]);
> 
> This pattern is repeated multiple times. Make it a helper ?

The repeated code is too short to make it a helper function. Common
feedback on block layer and NVMe kernel patches is that very short
helper functions reduce code readability and hence that typically
it's better to expand code inline rather than to introduce a very
short helper function.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2025-06-26 19:42 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-16 22:33 [PATCH v18 00/12] Improve write performance for zoned UFS devices Bart Van Assche
2025-06-16 22:33 ` [PATCH v18 01/12] block: Support block drivers that preserve the order of write requests Bart Van Assche
2025-06-16 22:33 ` [PATCH v18 02/12] block: Rework request allocation in blk_mq_submit_bio() Bart Van Assche
2025-06-26  0:00   ` Damien Le Moal
2025-06-26 17:25     ` Bart Van Assche
2025-06-16 22:33 ` [PATCH v18 03/12] block: Support allocating from a specific software queue Bart Van Assche
2025-06-26  0:04   ` Damien Le Moal
2025-06-26 17:17     ` Bart Van Assche
2025-06-16 22:33 ` [PATCH v18 04/12] blk-mq: Restore the zoned write order when requeuing Bart Van Assche
2025-06-26  0:11   ` Damien Le Moal
2025-06-26 19:42     ` Bart Van Assche
2025-06-16 22:33 ` [PATCH v18 05/12] blk-zoned: Add an argument to blk_zone_plug_bio() Bart Van Assche
2025-06-26  0:15   ` Damien Le Moal
2025-06-26 17:16     ` Bart Van Assche
2025-06-16 22:33 ` [PATCH v18 06/12] blk-zoned: Support pipelining of zoned writes Bart Van Assche
2025-06-16 22:33 ` [PATCH v18 07/12] null_blk: Add the preserves_write_order attribute Bart Van Assche
2025-06-16 22:33 ` [PATCH v18 08/12] scsi: core: Retry unaligned zoned writes Bart Van Assche
2025-06-16 22:33 ` [PATCH v18 09/12] scsi: sd: Increase retry count for " Bart Van Assche
2025-06-16 22:33 ` [PATCH v18 10/12] scsi: scsi_debug: Add the preserves_write_order module parameter Bart Van Assche
2025-06-16 22:33 ` [PATCH v18 11/12] scsi: scsi_debug: Support injecting unaligned write errors Bart Van Assche
2025-06-16 22:33 ` [PATCH v18 12/12] scsi: ufs: Inform the block layer about write ordering Bart Van Assche
2025-06-17  6:58   ` Avri Altman
2025-06-19 12:49   ` Peter Wang (王信友)
2025-06-19 16:38     ` Bart Van Assche

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).