[PATCH v21 00/12] Improve write performance for zoned UFS devices

linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v21 00/12] Improve write performance for zoned UFS devices
@ 2025-07-17 20:57 Bart Van Assche
  2025-07-17 20:57 ` [PATCH v21 01/12] block: Support block devices that preserve the order of write requests Bart Van Assche
                   ` (13 more replies)
  0 siblings, 14 replies; 25+ messages in thread
From: Bart Van Assche @ 2025-07-17 20:57 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Bart Van Assche

Hi Jens,

This patch series improves small write IOPS by a factor of two for zoned UFS
devices on my test setup. The changes included in this patch series are as
follows:
 - A new request queue limits flag is introduced that allows block drivers to
   declare whether or not the request order is preserved per hardware queue.
 - The order of zoned writes is preserved in the block layer by submitting all
   zoned writes from the same CPU core as long as any zoned writes are pending.
 - A new member 'from_cpu' is introduced in the per-zone data structure
   'blk_zone_wplug' to track from which CPU to submit zoned writes. This data
   member is reset to -1 after all pending zoned writes for a zone have
   completed.
 - The retry count for zoned writes is increased in the SCSI core to deal with
   reordering caused by unit attention conditions or the SCSI error handler.
 - New functionality is added in the scsi_debug driver to make it easier to
   test the changes introduced by this patch series.

Please consider this patch series for the next merge window.

Thanks,

Bart.

Changes compared to v20:
 - Converted a struct queue_limits member variable into a queue_limits feature
   flag.
 - Optimized performance of blk_mq_requeue_work().
 - Instead of splitting blk_zone_wplug_bio_work(), introduce a loop in that
   function.
 - Reworked patch "blk-zoned: Support pipelining of zoned writes".
 - Dropped the null_blk driver patch.
 - Improved several patch descriptions.

Changes compared to v19:
 - Dropped patch 2/11 "block: Support allocating from a specific software queue"
 - Implemented Damien's proposal to always add pipelined bios to the plug list
   and to submit all pipelined bios from the bio work for a zone.
 - Added three refactoring patches to make this patch series easier to review.

Changes compared to v18:
 - Dropped patch 2/12 "block: Rework request allocation in blk_mq_submit_bio()".
 - Improved patch descriptions.

Changes compared to v17:
 - Rebased the patch series on top of kernel v6.16-rc1.
 - Dropped support for UFSHCI 3.0 controllers because the UFSHCI 3.0 auto-
   hibernation mechanism causes request reordering. UFSHCI 4.0 controllers
   remain supported.
 - Removed the error handling and write pointer tracking mechanisms again
   from block/blk-zoned.c.
 - Dropped the dm-linear patch from this patch series since I'm not aware of
   any use cases for write pipelining and dm-linear.

Changes compared to v16:
 - Rebased the entire patch series on top of Jens' for-next branch. Compared
   to when v16 of this series was posted, the BLK_ZONE_WPLUG_NEED_WP_UPDATE
   flag has been introduced and support for REQ_NOWAIT has been fixed.
 - The behavior for SMR disks is preserved: if .driver_preserves_write_order
   has not been set, BLK_ZONE_WPLUG_NEED_WP_UPDATE is still set if a write
   error has been encountered. If .driver_preserves_write_order has not been
   set, the write pointer is restored and the failed zoned writes are retried.
 - The superfluous "disk->zone_wplugs_hash_bits != 0" tests have been removed.

Changes compared to v15:
 - Reworked this patch series on top of the zone write plugging approach.
 - Moved support for requeuing requests from the SCSI core into the block
   layer core.
 - In the UFS driver, instead of disabling write pipelining if
   auto-hibernation is enabled, rely on the requeuing mechanism to handle
   reordering caused by resuming from auto-hibernation.

Changes compared to v14:
 - Removed the drivers/scsi/Kconfig.kunit and drivers/scsi/Makefile.kunit
   files. Instead, modified drivers/scsi/Kconfig and added #include "*_test.c"
   directives in the appropriate .c files. Removed the EXPORT_SYMBOL()
   directives that were added to make the unit tests link.
 - Fixed a double free in a unit test.

Changes compared to v13:
 - Reworked patch "block: Preserve the order of requeued zoned writes".
 - Addressed a performance concern by removing the eh_needs_prepare_resubmit
   SCSI driver callback and by introducing the SCSI host template flag
   .needs_prepare_resubmit instead.
 - Added a patch that adds a 'host' argument to scsi_eh_flush_done_q().
 - Made the code in unit tests less repetitive.

Changes compared to v12:
 - Added two new patches: "block: Preserve the order of requeued zoned writes"
   and "scsi: sd: Add a unit test for sd_cmp_sector()"
 - Restricted the number of zoned write retries. To my surprise I had to add
   "&& scmd->retries <= scmd->allowed" in the SCSI error handler to limit the
   number of retries.
 - In patch "scsi: ufs: Inform the block layer about write ordering", only set
   ELEVATOR_F_ZBD_SEQ_WRITE for zoned block devices.

Changes compared to v11:
 - Fixed a NULL pointer dereference that happened when booting from an ATA
   device by adding an scmd->device != NULL check in scsi_needs_preparation().
 - Updated Reviewed-by tags.

Changes compared to v10:
 - Dropped the UFS MediaTek and HiSilicon patches because these are not correct
   and because it is safe to drop these patches.
 - Updated Acked-by / Reviewed-by tags.

Changes compared to v9:
 - Introduced an additional scsi_driver callback: .eh_needs_prepare_resubmit().
 - Renamed the scsi_debug kernel module parameter 'no_zone_write_lock' into
   'preserves_write_order'.
 - Fixed an out-of-bounds access in the unit scsi_call_prepare_resubmit() unit
   test.
 - Wrapped ufshcd_auto_hibern8_update() calls in UFS host drivers with
   WARN_ON_ONCE() such that a kernel stack appears in case an error code is
   returned.
 - Elaborated a comment in the UFSHCI driver.

Changes compared to v8:
 - Fixed handling of 'driver_preserves_write_order' and 'use_zone_write_lock'
   in blk_stack_limits().
 - Added a comment in disk_set_zoned().
 - Modified blk_req_needs_zone_write_lock() such that it returns false if
   q->limits.use_zone_write_lock is false.
 - Modified disk_clear_zone_settings() such that it clears
   q->limits.use_zone_write_lock.
 - Left out one change from the mq-deadline patch that became superfluous due to
   the blk_req_needs_zone_write_lock() change.
 - Modified scsi_call_prepare_resubmit() such that it only calls list_sort() if
   zoned writes have to be resubmitted for which zone write locking is disabled.
 - Added an additional unit test for scsi_call_prepare_resubmit().
 - Modified the sorting code in the sd driver such that only those SCSI commands
   are sorted for which write locking is disabled.
 - Modified sd_zbc.c such that ELEVATOR_F_ZBD_SEQ_WRITE is only set if the
   write order is not preserved.
 - Included three patches for UFS host drivers that rework code that wrote
   directly to the auto-hibernation controller register.
 - Modified the UFS driver such that enabling auto-hibernation is not allowed
   if a zoned logical unit is present and if the controller operates in legacy
   mode.
 - Also in the UFS driver, simplified ufshcd_auto_hibern8_update().

Changes compared to v7:
 - Split the queue_limits member variable `use_zone_write_lock' into two member
   variables: `use_zone_write_lock' (set by disk_set_zoned()) and
   `driver_preserves_write_order' (set by the block driver or SCSI LLD). This
   should clear up the confusion about the purpose of this variable.
 - Moved the code for sorting SCSI commands by LBA from the SCSI error handler
   into the SCSI disk (sd) driver as requested by Christoph.
   
Changes compared to v6:
 - Removed QUEUE_FLAG_NO_ZONE_WRITE_LOCK and instead introduced a flag in
   the request queue limits data structure.

Changes compared to v5:
 - Renamed scsi_cmp_lba() into scsi_cmp_sector().
 - Improved several source code comments.

Changes compared to v4:
 - Dropped the patch that introduces the REQ_NO_ZONE_WRITE_LOCK flag.
 - Dropped the null_blk patch and added two scsi_debug patches instead.
 - Dropped the f2fs patch.
 - Split the patch for the UFS driver into two patches.
 - Modified several patch descriptions and source code comments.
 - Renamed dd_use_write_locking() into dd_use_zone_write_locking().
 - Moved the list_sort() call from scsi_unjam_host() into scsi_eh_flush_done_q()
   such that sorting happens just before reinserting.
 - Removed the scsi_cmd_retry_allowed() call from scsi_check_sense() to make
   sure that the retry counter is adjusted once per retry instead of twice.

Changes compared to v3:
 - Restored the patch that introduces QUEUE_FLAG_NO_ZONE_WRITE_LOCK. That patch
   had accidentally been left out from v2.
 - In patch "block: Introduce the flag REQ_NO_ZONE_WRITE_LOCK", improved the
   patch description and added the function blk_no_zone_write_lock().
 - In patch "block/mq-deadline: Only use zone locking if necessary", moved the
   blk_queue_is_zoned() call into dd_use_write_locking().
 - In patch "fs/f2fs: Disable zone write locking", set REQ_NO_ZONE_WRITE_LOCK
   from inside __bio_alloc() instead of in f2fs_submit_write_bio().

Changes compared to v2:
 - Renamed the request queue flag for disabling zone write locking.
 - Introduced a new request flag for disabling zone write locking.
 - Modified the mq-deadline scheduler such that zone write locking is only
   disabled if both flags are set.
 - Added an F2FS patch that sets the request flag for disabling zone write
   locking.
 - Only disable zone write locking in the UFS driver if auto-hibernation is
   disabled.

Changes compared to v1:
 - Left out the patches that are already upstream.
 - Switched the approach in patch "scsi: Retry unaligned zoned writes" from
   retrying immediately to sending unaligned write commands to the SCSI error
   handler.

Bart Van Assche (12):
  block: Support block devices that preserve the order of write requests
  blk-mq: Restore the zone write order when requeuing
  blk-zoned: Add an argument to blk_zone_plug_bio()
  blk-zoned: Split an if-statement
  blk-zoned: Move code from disk_zone_wplug_add_bio() into its caller
  blk-zoned: Introduce a loop in blk_zone_wplug_bio_work()
  blk-zoned: Support pipelining of zoned writes
  scsi: core: Retry unaligned zoned writes
  scsi: sd: Increase retry count for zoned writes
  scsi: scsi_debug: Add the preserves_write_order module parameter
  scsi: scsi_debug: Support injecting unaligned write errors
  ufs: core: Inform the block layer about write ordering

 block/bfq-iosched.c       |   2 +
 block/blk-mq.c            |  32 +++++--
 block/blk-mq.h            |   2 +
 block/blk-settings.c      |   2 +
 block/blk-zoned.c         | 179 +++++++++++++++++++++++---------------
 block/kyber-iosched.c     |   2 +
 block/mq-deadline.c       |   7 +-
 drivers/md/dm.c           |   5 +-
 drivers/scsi/scsi_debug.c |  22 ++++-
 drivers/scsi/scsi_error.c |  16 ++++
 drivers/scsi/sd.c         |   7 ++
 drivers/ufs/core/ufshcd.c |   7 ++
 include/linux/blk-mq.h    |  13 ++-
 include/linux/blkdev.h    |  11 ++-
 14 files changed, 223 insertions(+), 84 deletions(-)


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v21 01/12] block: Support block devices that preserve the order of write requests
  2025-07-17 20:57 [PATCH v21 00/12] Improve write performance for zoned UFS devices Bart Van Assche
@ 2025-07-17 20:57 ` Bart Van Assche
  2025-07-17 20:57 ` [PATCH v21 02/12] blk-mq: Restore the zone write order when requeuing Bart Van Assche
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2025-07-17 20:57 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Bart Van Assche, Hannes Reinecke, Nitesh Shetty, Ming Lei

Some storage controllers preserve the request order per hardware queue.
Some but not all device mapper drivers preserve the bio order. Introduce
the feature flag BLK_FEAT_ORDERED_HWQ to allow block drivers and stacked
drivers to indicate that the order of write commands is preserved per
hardware queue and hence that serialization of writes per zone is not
required if all pending writes are submitted to the same hardware queue.

Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Nitesh Shetty <nj.shetty@samsung.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-settings.c   | 2 ++
 include/linux/blkdev.h | 6 ++++++
 2 files changed, 8 insertions(+)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index a000daafbfb4..45ab1d644720 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -698,6 +698,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 		t->features &= ~BLK_FEAT_NOWAIT;
 	if (!(b->features & BLK_FEAT_POLL))
 		t->features &= ~BLK_FEAT_POLL;
+	if (!(b->features & BLK_FEAT_ORDERED_HWQ))
+		t->features &= ~BLK_FEAT_ORDERED_HWQ;
 
 	t->flags |= (b->flags & BLK_FLAG_MISALIGNED);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 5f14c20c8bc0..3ea6c77746c5 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -337,6 +337,12 @@ typedef unsigned int __bitwise blk_features_t;
 /* skip this queue in blk_mq_(un)quiesce_tagset */
 #define BLK_FEAT_SKIP_TAGSET_QUIESCE	((__force blk_features_t)(1u << 13))
 
+/*
+ * The request order is preserved per hardware queue by the block driver and by
+ * the block device.
+ */
+#define BLK_FEAT_ORDERED_HWQ		((__force blk_features_t)(1u << 14))
+
 /* undocumented magic for bcache */
 #define BLK_FEAT_RAID_PARTIAL_STRIPES_EXPENSIVE \
 	((__force blk_features_t)(1u << 15))

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v21 02/12] blk-mq: Restore the zone write order when requeuing
  2025-07-17 20:57 [PATCH v21 00/12] Improve write performance for zoned UFS devices Bart Van Assche
  2025-07-17 20:57 ` [PATCH v21 01/12] block: Support block devices that preserve the order of write requests Bart Van Assche
@ 2025-07-17 20:57 ` Bart Van Assche
  2025-07-17 20:57 ` [PATCH v21 03/12] blk-zoned: Add an argument to blk_zone_plug_bio() Bart Van Assche
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2025-07-17 20:57 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Bart Van Assche, Yu Kuai

Zoned writes may be requeued. This happens if a block driver returns
BLK_STS_RESOURCE, to handle SCSI unit attentions or by the SCSI error
handler after error handling has finished. A later patch enables write
pipelining and increases the number of pending writes per zone. If
multiple writes are pending per zone, write requests may be requeued in
another order than submitted. Restore the request order if requests are
requeued. Add RQF_DONTPREP to RQF_NOMERGE_FLAGS because this patch may
cause RQF_DONTPREP requests to be sent to the code that checks whether
a request can be merged. RQF_DONTPREP requests must not be merged.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/bfq-iosched.c    |  2 ++
 block/blk-mq.c         | 21 ++++++++++++++++++++-
 block/blk-mq.h         |  2 ++
 block/kyber-iosched.c  |  2 ++
 block/mq-deadline.c    |  7 ++++++-
 include/linux/blk-mq.h | 13 ++++++++++++-
 6 files changed, 44 insertions(+), 3 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 0cb1e9873aab..1bd3afe5d779 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -6276,6 +6276,8 @@ static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 
 	if (flags & BLK_MQ_INSERT_AT_HEAD) {
 		list_add(&rq->queuelist, &bfqd->dispatch);
+	} else if (flags & BLK_MQ_INSERT_ORDERED) {
+		blk_mq_insert_ordered(rq, &bfqd->dispatch);
 	} else if (!bfqq) {
 		list_add_tail(&rq->queuelist, &bfqd->dispatch);
 	} else {
diff --git a/block/blk-mq.c b/block/blk-mq.c
index b1d81839679f..58d3d0e724cb 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1559,7 +1559,10 @@ static void blk_mq_requeue_work(struct work_struct *work)
 		 * already.  Insert it into the hctx dispatch list to avoid
 		 * block layer merges for the request.
 		 */
-		if (rq->rq_flags & RQF_DONTPREP)
+		if (q->limits.features & BLK_FEAT_ORDERED_HWQ &&
+		    blk_rq_is_seq_zoned_write(rq))
+			blk_mq_insert_request(rq, BLK_MQ_INSERT_ORDERED);
+		else if (rq->rq_flags & RQF_DONTPREP)
 			blk_mq_request_bypass_insert(rq, 0);
 		else
 			blk_mq_insert_request(rq, BLK_MQ_INSERT_AT_HEAD);
@@ -2592,6 +2595,20 @@ static void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx,
 	blk_mq_run_hw_queue(hctx, run_queue_async);
 }
 
+void blk_mq_insert_ordered(struct request *rq, struct list_head *list)
+{
+	struct request_queue *q = rq->q;
+	struct request *rq2;
+
+	list_for_each_entry(rq2, list, queuelist)
+		if (rq2->q == q && blk_rq_pos(rq2) > blk_rq_pos(rq))
+			break;
+
+	/* Insert rq before rq2. If rq2 is the list head, append at the end. */
+	list_add_tail(&rq->queuelist, &rq2->queuelist);
+}
+EXPORT_SYMBOL_GPL(blk_mq_insert_ordered);
+
 static void blk_mq_insert_request(struct request *rq, blk_insert_t flags)
 {
 	struct request_queue *q = rq->q;
@@ -2646,6 +2663,8 @@ static void blk_mq_insert_request(struct request *rq, blk_insert_t flags)
 		spin_lock(&ctx->lock);
 		if (flags & BLK_MQ_INSERT_AT_HEAD)
 			list_add(&rq->queuelist, &ctx->rq_lists[hctx->type]);
+		else if (flags & BLK_MQ_INSERT_ORDERED)
+			blk_mq_insert_ordered(rq, &ctx->rq_lists[hctx->type]);
 		else
 			list_add_tail(&rq->queuelist,
 				      &ctx->rq_lists[hctx->type]);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index affb2e14b56e..393660311a56 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -40,8 +40,10 @@ enum {
 
 typedef unsigned int __bitwise blk_insert_t;
 #define BLK_MQ_INSERT_AT_HEAD		((__force blk_insert_t)0x01)
+#define BLK_MQ_INSERT_ORDERED		((__force blk_insert_t)0x02)
 
 void blk_mq_submit_bio(struct bio *bio);
+void blk_mq_insert_ordered(struct request *rq, struct list_head *list);
 int blk_mq_poll(struct request_queue *q, blk_qc_t cookie, struct io_comp_batch *iob,
 		unsigned int flags);
 void blk_mq_exit_queue(struct request_queue *q);
diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index 4dba8405bd01..051c05ceafd7 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -603,6 +603,8 @@ static void kyber_insert_requests(struct blk_mq_hw_ctx *hctx,
 		trace_block_rq_insert(rq);
 		if (flags & BLK_MQ_INSERT_AT_HEAD)
 			list_move(&rq->queuelist, head);
+		else if (flags & BLK_MQ_INSERT_ORDERED)
+			blk_mq_insert_ordered(rq, head);
 		else
 			list_move_tail(&rq->queuelist, head);
 		sbitmap_set_bit(&khd->kcq_map[sched_domain],
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 2edf1cac06d5..110fef65b829 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -710,7 +710,12 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 		 * set expire time and add to fifo list
 		 */
 		rq->fifo_time = jiffies + dd->fifo_expire[data_dir];
-		list_add_tail(&rq->queuelist, &per_prio->fifo_list[data_dir]);
+		if (flags & BLK_MQ_INSERT_ORDERED)
+			blk_mq_insert_ordered(rq,
+					      &per_prio->fifo_list[data_dir]);
+		else
+			list_add_tail(&rq->queuelist,
+				      &per_prio->fifo_list[data_dir]);
 	}
 }
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 2a5a828f19a0..1c516151fff0 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -86,7 +86,7 @@ enum rqf_flags {
 
 /* flags that prevent us from merging requests: */
 #define RQF_NOMERGE_FLAGS \
-	(RQF_STARTED | RQF_FLUSH_SEQ | RQF_SPECIAL_PAYLOAD)
+	(RQF_STARTED | RQF_FLUSH_SEQ | RQF_DONTPREP | RQF_SPECIAL_PAYLOAD)
 
 enum mq_rq_state {
 	MQ_RQ_IDLE		= 0,
@@ -1191,4 +1191,15 @@ static inline int blk_rq_map_sg(struct request *rq, struct scatterlist *sglist)
 }
 void blk_dump_rq_flags(struct request *, char *);
 
+static inline bool blk_rq_is_seq_zoned_write(struct request *rq)
+{
+	switch (req_op(rq)) {
+	case REQ_OP_WRITE:
+	case REQ_OP_WRITE_ZEROES:
+		return bdev_zone_is_seq(rq->q->disk->part0, blk_rq_pos(rq));
+	default:
+		return false;
+	}
+}
+
 #endif /* BLK_MQ_H */

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v21 03/12] blk-zoned: Add an argument to blk_zone_plug_bio()
  2025-07-17 20:57 [PATCH v21 00/12] Improve write performance for zoned UFS devices Bart Van Assche
  2025-07-17 20:57 ` [PATCH v21 01/12] block: Support block devices that preserve the order of write requests Bart Van Assche
  2025-07-17 20:57 ` [PATCH v21 02/12] blk-mq: Restore the zone write order when requeuing Bart Van Assche
@ 2025-07-17 20:57 ` Bart Van Assche
  2025-07-18  7:13   ` Damien Le Moal
  2025-07-17 20:58 ` [PATCH v21 04/12] blk-zoned: Split an if-statement Bart Van Assche
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 25+ messages in thread
From: Bart Van Assche @ 2025-07-17 20:57 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Bart Van Assche

Software that submits zoned writes, e.g. a filesystem, may submit zoned
writes from multiple CPUs as long as the zoned writes are serialized per
zone. Submitting bios from different CPUs may cause bio reordering if
e.g. different bios reach the storage device through different queues.
Prepare for preserving the order of pipelined zoned writes per zone by
adding the 'rq_cpu` argument to blk_zone_plug_bio(). This argument tells
blk_zone_plug_bio() from which CPU a cached request has been allocated.
The cached request will only be used if it matches the CPU from which
zoned writes are being submitted for the zone associated with the bio.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-mq.c         | 7 +++----
 block/blk-zoned.c      | 5 ++++-
 drivers/md/dm.c        | 5 ++---
 include/linux/blkdev.h | 5 +++--
 4 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 58d3d0e724cb..c1035a2bbda8 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3190,10 +3190,9 @@ void blk_mq_submit_bio(struct bio *bio)
 	if (blk_mq_attempt_bio_merge(q, bio, nr_segs))
 		goto queue_exit;
 
-	if (bio_needs_zone_write_plugging(bio)) {
-		if (blk_zone_plug_bio(bio, nr_segs))
-			goto queue_exit;
-	}
+	if (bio_needs_zone_write_plugging(bio) &&
+	    blk_zone_plug_bio(bio, nr_segs, rq ? rq->mq_ctx->cpu : -1))
+		goto queue_exit;
 
 new_request:
 	if (rq) {
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index ef43aaca49f4..7e0f90626459 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -1110,6 +1110,9 @@ static void blk_zone_wplug_handle_native_zone_append(struct bio *bio)
  * blk_zone_plug_bio - Handle a zone write BIO with zone write plugging
  * @bio: The BIO being submitted
  * @nr_segs: The number of physical segments of @bio
+ * @rq_cpu: software queue onto which a request will be queued. -1 if the caller
+ *	has not yet decided onto which software queue to queue the request or if
+ *	the bio won't be converted into a request.
  *
  * Handle write, write zeroes and zone append operations requiring emulation
  * using zone write plugging.
@@ -1118,7 +1121,7 @@ static void blk_zone_wplug_handle_native_zone_append(struct bio *bio)
  * write plug. Otherwise, return false to let the submission path process
  * @bio normally.
  */
-bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs)
+bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs, int rq_cpu)
 {
 	struct block_device *bdev = bio->bi_bdev;
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index ca889328fdfe..5033af6d687c 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1814,9 +1814,8 @@ static inline bool dm_zone_bio_needs_split(struct mapped_device *md,
 
 static inline bool dm_zone_plug_bio(struct mapped_device *md, struct bio *bio)
 {
-	if (!bio_needs_zone_write_plugging(bio))
-		return false;
-	return blk_zone_plug_bio(bio, 0);
+	return bio_needs_zone_write_plugging(bio) &&
+		blk_zone_plug_bio(bio, 0, -1);
 }
 
 static blk_status_t __send_zone_reset_all_emulated(struct clone_info *ci,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 3ea6c77746c5..904e2bb1e5fc 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -897,7 +897,7 @@ static inline bool bio_needs_zone_write_plugging(struct bio *bio)
 	}
 }
 
-bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs);
+bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs, int rq_cpu);
 
 /**
  * disk_zone_capacity - returns the zone capacity of zone containing @sector
@@ -932,7 +932,8 @@ static inline bool bio_needs_zone_write_plugging(struct bio *bio)
 	return false;
 }
 
-static inline bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs)
+static inline bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs,
+				     int rq_cpu)
 {
 	return false;
 }

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v21 04/12] blk-zoned: Split an if-statement
  2025-07-17 20:57 [PATCH v21 00/12] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (2 preceding siblings ...)
  2025-07-17 20:57 ` [PATCH v21 03/12] blk-zoned: Add an argument to blk_zone_plug_bio() Bart Van Assche
@ 2025-07-17 20:58 ` Bart Van Assche
  2025-07-17 20:58 ` [PATCH v21 05/12] blk-zoned: Move code from disk_zone_wplug_add_bio() into its caller Bart Van Assche
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2025-07-17 20:58 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Bart Van Assche

Split an if-statement and also the comment above that if-statement. This
patch prepares for moving code from disk_zone_wplug_add_bio() into its
caller. No functionality has been changed.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-zoned.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 7e0f90626459..d7bdc92bedae 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -1036,13 +1036,14 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
 	bio_set_flag(bio, BIO_ZONE_WRITE_PLUGGING);
 
 	/*
-	 * If the zone is already plugged, add the BIO to the plug BIO list.
-	 * Do the same for REQ_NOWAIT BIOs to ensure that we will not see a
+	 * Add REQ_NOWAIT BIOs to the plug list to ensure that we will not see a
 	 * BLK_STS_AGAIN failure if we let the BIO execute.
-	 * Otherwise, plug and let the BIO execute.
 	 */
-	if ((zwplug->flags & BLK_ZONE_WPLUG_PLUGGED) ||
-	    (bio->bi_opf & REQ_NOWAIT))
+	if (bio->bi_opf & REQ_NOWAIT)
+		goto plug;
+
+	/* If the zone is already plugged, add the BIO to the plug BIO list. */
+	if (zwplug->flags & BLK_ZONE_WPLUG_PLUGGED)
 		goto plug;
 
 	if (!blk_zone_wplug_prepare_bio(zwplug, bio)) {
@@ -1051,6 +1052,7 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
 		return true;
 	}
 
+	/* Otherwise, plug and submit the BIO. */
 	zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
 
 	spin_unlock_irqrestore(&zwplug->lock, flags);

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v21 05/12] blk-zoned: Move code from disk_zone_wplug_add_bio() into its caller
  2025-07-17 20:57 [PATCH v21 00/12] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (3 preceding siblings ...)
  2025-07-17 20:58 ` [PATCH v21 04/12] blk-zoned: Split an if-statement Bart Van Assche
@ 2025-07-17 20:58 ` Bart Van Assche
  2025-07-18  7:15   ` Damien Le Moal
  2025-07-17 20:58 ` [PATCH v21 06/12] blk-zoned: Introduce a loop in blk_zone_wplug_bio_work() Bart Van Assche
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 25+ messages in thread
From: Bart Van Assche @ 2025-07-17 20:58 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Bart Van Assche

Move the following code into the only caller of disk_zone_wplug_add_bio():
 - The code for clearing the REQ_NOWAIT flag.
 - The code that sets the BLK_ZONE_WPLUG_PLUGGED flag.
 - The disk_zone_wplug_schedule_bio_work() call.

No functionality has been changed.

This patch prepares for zoned write pipelining by removing the code from
disk_zone_wplug_add_bio() that does not apply to all zoned write pipelining
bio processing cases.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-zoned.c | 34 ++++++++++++++--------------------
 1 file changed, 14 insertions(+), 20 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index d7bdc92bedae..8fe6e545f300 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -782,8 +782,6 @@ static inline void disk_zone_wplug_add_bio(struct gendisk *disk,
 				struct blk_zone_wplug *zwplug,
 				struct bio *bio, unsigned int nr_segs)
 {
-	bool schedule_bio_work = false;
-
 	/*
 	 * Grab an extra reference on the BIO request queue usage counter.
 	 * This reference will be reused to submit a request for the BIO for
@@ -799,16 +797,6 @@ static inline void disk_zone_wplug_add_bio(struct gendisk *disk,
 	 */
 	bio_clear_polled(bio);
 
-	/*
-	 * REQ_NOWAIT BIOs are always handled using the zone write plug BIO
-	 * work, which can block. So clear the REQ_NOWAIT flag and schedule the
-	 * work if this is the first BIO we are plugging.
-	 */
-	if (bio->bi_opf & REQ_NOWAIT) {
-		schedule_bio_work = !(zwplug->flags & BLK_ZONE_WPLUG_PLUGGED);
-		bio->bi_opf &= ~REQ_NOWAIT;
-	}
-
 	/*
 	 * Reuse the poll cookie field to store the number of segments when
 	 * split to the hardware limits.
@@ -824,11 +812,6 @@ static inline void disk_zone_wplug_add_bio(struct gendisk *disk,
 	bio_list_add(&zwplug->bio_list, bio);
 	trace_disk_zone_wplug_add_bio(zwplug->disk->queue, zwplug->zone_no,
 				      bio->bi_iter.bi_sector, bio_sectors(bio));
-
-	zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
-
-	if (schedule_bio_work)
-		disk_zone_wplug_schedule_bio_work(disk, zwplug);
 }
 
 /*
@@ -993,6 +976,7 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
 {
 	struct gendisk *disk = bio->bi_bdev->bd_disk;
 	sector_t sector = bio->bi_iter.bi_sector;
+	bool schedule_bio_work = false;
 	struct blk_zone_wplug *zwplug;
 	gfp_t gfp_mask = GFP_NOIO;
 	unsigned long flags;
@@ -1039,12 +1023,16 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
 	 * Add REQ_NOWAIT BIOs to the plug list to ensure that we will not see a
 	 * BLK_STS_AGAIN failure if we let the BIO execute.
 	 */
-	if (bio->bi_opf & REQ_NOWAIT)
-		goto plug;
+	if (bio->bi_opf & REQ_NOWAIT) {
+		bio->bi_opf &= ~REQ_NOWAIT;
+		if (!(zwplug->flags & BLK_ZONE_WPLUG_PLUGGED))
+			goto plug;
+		goto add_to_bio_list;
+	}
 
 	/* If the zone is already plugged, add the BIO to the plug BIO list. */
 	if (zwplug->flags & BLK_ZONE_WPLUG_PLUGGED)
-		goto plug;
+		goto add_to_bio_list;
 
 	if (!blk_zone_wplug_prepare_bio(zwplug, bio)) {
 		spin_unlock_irqrestore(&zwplug->lock, flags);
@@ -1060,7 +1048,13 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
 	return false;
 
 plug:
+	zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
+	schedule_bio_work = true;
+
+add_to_bio_list:
 	disk_zone_wplug_add_bio(disk, zwplug, bio, nr_segs);
+	if (schedule_bio_work)
+		disk_zone_wplug_schedule_bio_work(disk, zwplug);
 
 	spin_unlock_irqrestore(&zwplug->lock, flags);
 

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v21 06/12] blk-zoned: Introduce a loop in blk_zone_wplug_bio_work()
  2025-07-17 20:57 [PATCH v21 00/12] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (4 preceding siblings ...)
  2025-07-17 20:58 ` [PATCH v21 05/12] blk-zoned: Move code from disk_zone_wplug_add_bio() into its caller Bart Van Assche
@ 2025-07-17 20:58 ` Bart Van Assche
  2025-07-18  7:17   ` Damien Le Moal
  2025-07-17 20:58 ` [PATCH v21 07/12] blk-zoned: Support pipelining of zoned writes Bart Van Assche
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 25+ messages in thread
From: Bart Van Assche @ 2025-07-17 20:58 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Bart Van Assche

Prepare for submitting multiple bios from inside a single
blk_zone_wplug_bio_work() call. No functionality has been changed.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-zoned.c | 72 +++++++++++++++++++++++------------------------
 1 file changed, 36 insertions(+), 36 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 8fe6e545f300..6ef53f78fa3b 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -1283,47 +1283,47 @@ static void blk_zone_wplug_bio_work(struct work_struct *work)
 	struct blk_zone_wplug *zwplug =
 		container_of(work, struct blk_zone_wplug, bio_work);
 	struct block_device *bdev;
-	unsigned long flags;
 	struct bio *bio;
 
-	/*
-	 * Submit the next plugged BIO. If we do not have any, clear
-	 * the plugged flag.
-	 */
-	spin_lock_irqsave(&zwplug->lock, flags);
-
+	do {
 again:
-	bio = bio_list_pop(&zwplug->bio_list);
-	if (!bio) {
-		zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED;
-		spin_unlock_irqrestore(&zwplug->lock, flags);
-		goto put_zwplug;
-	}
-
-	trace_blk_zone_wplug_bio(zwplug->disk->queue, zwplug->zone_no,
-				 bio->bi_iter.bi_sector, bio_sectors(bio));
-
-	if (!blk_zone_wplug_prepare_bio(zwplug, bio)) {
-		blk_zone_wplug_bio_io_error(zwplug, bio);
-		goto again;
-	}
-
-	spin_unlock_irqrestore(&zwplug->lock, flags);
+		/*
+		 * Submit the next plugged BIO. If we do not have any, clear
+		 * the plugged flag.
+		 */
+		scoped_guard(spinlock_irqsave, &zwplug->lock) {
+			bio = bio_list_pop(&zwplug->bio_list);
+			if (!bio) {
+				zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED;
+				goto put_zwplug;
+			}
+
+			trace_blk_zone_wplug_bio(zwplug->disk->queue,
+						 zwplug->zone_no,
+						 bio->bi_iter.bi_sector,
+						 bio_sectors(bio));
+
+			if (!blk_zone_wplug_prepare_bio(zwplug, bio)) {
+				blk_zone_wplug_bio_io_error(zwplug, bio);
+				goto again;
+			}
+		}
 
-	bdev = bio->bi_bdev;
+		bdev = bio->bi_bdev;
 
-	/*
-	 * blk-mq devices will reuse the extra reference on the request queue
-	 * usage counter we took when the BIO was plugged, but the submission
-	 * path for BIO-based devices will not do that. So drop this extra
-	 * reference here.
-	 */
-	if (bdev_test_flag(bdev, BD_HAS_SUBMIT_BIO)) {
-		bdev->bd_disk->fops->submit_bio(bio);
-		blk_queue_exit(bdev->bd_disk->queue);
-	} else {
-		blk_mq_submit_bio(bio);
-	}
+		/*
+		 * blk-mq devices will reuse the extra reference on the request
+		 * queue usage counter we took when the BIO was plugged, but the
+		 * submission path for BIO-based devices will not do that. So
+		 * drop this extra reference here.
+		 */
+		if (bdev_test_flag(bdev, BD_HAS_SUBMIT_BIO)) {
+			bdev->bd_disk->fops->submit_bio(bio);
+			blk_queue_exit(bdev->bd_disk->queue);
+		} else {
+			blk_mq_submit_bio(bio);
+		}
+	} while (0);
 
 put_zwplug:
 	/* Drop the reference we took in disk_zone_wplug_schedule_bio_work(). */

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v21 07/12] blk-zoned: Support pipelining of zoned writes
  2025-07-17 20:57 [PATCH v21 00/12] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (5 preceding siblings ...)
  2025-07-17 20:58 ` [PATCH v21 06/12] blk-zoned: Introduce a loop in blk_zone_wplug_bio_work() Bart Van Assche
@ 2025-07-17 20:58 ` Bart Van Assche
  2025-07-18  7:38   ` Damien Le Moal
  2025-07-17 20:58 ` [PATCH v21 08/12] scsi: core: Retry unaligned " Bart Van Assche
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 25+ messages in thread
From: Bart Van Assche @ 2025-07-17 20:58 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Bart Van Assche

Support pipelining of zoned writes if the block driver preserves the write
order per hardware queue. Track per zone to which software queue writes
have been queued. If zoned writes are pipelined, submit new writes to the
same software queue as the writes that are already in progress. This
prevents reordering by submitting requests for the same zone to different
software or hardware queues.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-mq.c    |  4 +--
 block/blk-zoned.c | 66 ++++++++++++++++++++++++++++++++++++++---------
 2 files changed, 56 insertions(+), 14 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index c1035a2bbda8..56384b4aadd9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3145,8 +3145,8 @@ void blk_mq_submit_bio(struct bio *bio)
 	/*
 	 * A BIO that was released from a zone write plug has already been
 	 * through the preparation in this function, already holds a reference
-	 * on the queue usage counter, and is the only write BIO in-flight for
-	 * the target zone. Go straight to preparing a request for it.
+	 * on the queue usage counter. Go straight to preparing a request for
+	 * it.
 	 */
 	if (bio_zone_write_plugging(bio)) {
 		nr_segs = bio->__bi_nr_segments;
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 6ef53f78fa3b..3813e4bc8b0b 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -53,6 +53,8 @@ static const char *const zone_cond_name[] = {
  * @zone_no: The number of the zone the plug is managing.
  * @wp_offset: The zone write pointer location relative to the start of the zone
  *             as a number of 512B sectors.
+ * @from_cpu: Software queue to submit writes from for drivers that preserve
+ *	the write order.
  * @bio_list: The list of BIOs that are currently plugged.
  * @bio_work: Work struct to handle issuing of plugged BIOs
  * @rcu_head: RCU head to free zone write plugs with an RCU grace period.
@@ -65,6 +67,7 @@ struct blk_zone_wplug {
 	unsigned int		flags;
 	unsigned int		zone_no;
 	unsigned int		wp_offset;
+	int			from_cpu;
 	struct bio_list		bio_list;
 	struct work_struct	bio_work;
 	struct rcu_head		rcu_head;
@@ -74,8 +77,7 @@ struct blk_zone_wplug {
 /*
  * Zone write plug flags bits:
  *  - BLK_ZONE_WPLUG_PLUGGED: Indicates that the zone write plug is plugged,
- *    that is, that write BIOs are being throttled due to a write BIO already
- *    being executed or the zone write plug bio list is not empty.
+ *    that is, that write BIOs are being throttled.
  *  - BLK_ZONE_WPLUG_NEED_WP_UPDATE: Indicates that we lost track of a zone
  *    write pointer offset and need to update it.
  *  - BLK_ZONE_WPLUG_UNHASHED: Indicates that the zone write plug was removed
@@ -572,6 +574,7 @@ static struct blk_zone_wplug *disk_get_and_lock_zone_wplug(struct gendisk *disk,
 	zwplug->flags = 0;
 	zwplug->zone_no = zno;
 	zwplug->wp_offset = bdev_offset_from_zone_start(disk->part0, sector);
+	zwplug->from_cpu = -1;
 	bio_list_init(&zwplug->bio_list);
 	INIT_WORK(&zwplug->bio_work, blk_zone_wplug_bio_work);
 	zwplug->disk = disk;
@@ -768,14 +771,19 @@ static bool blk_zone_wplug_handle_reset_all(struct bio *bio)
 static void disk_zone_wplug_schedule_bio_work(struct gendisk *disk,
 					      struct blk_zone_wplug *zwplug)
 {
+	lockdep_assert_held(&zwplug->lock);
+
 	/*
 	 * Take a reference on the zone write plug and schedule the submission
 	 * of the next plugged BIO. blk_zone_wplug_bio_work() will release the
 	 * reference we take here.
 	 */
-	WARN_ON_ONCE(!(zwplug->flags & BLK_ZONE_WPLUG_PLUGGED));
 	refcount_inc(&zwplug->ref);
-	queue_work(disk->zone_wplugs_wq, &zwplug->bio_work);
+	if (zwplug->from_cpu >= 0)
+		queue_work_on(zwplug->from_cpu, disk->zone_wplugs_wq,
+			      &zwplug->bio_work);
+	else
+		queue_work(disk->zone_wplugs_wq, &zwplug->bio_work);
 }
 
 static inline void disk_zone_wplug_add_bio(struct gendisk *disk,
@@ -972,9 +980,12 @@ static bool blk_zone_wplug_prepare_bio(struct blk_zone_wplug *zwplug,
 	return true;
 }
 
-static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
+static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs,
+					int from_cpu)
 {
 	struct gendisk *disk = bio->bi_bdev->bd_disk;
+	const bool ordered_hwq = bio_op(bio) != REQ_OP_ZONE_APPEND &&
+		disk->queue->limits.features & BLK_FEAT_ORDERED_HWQ;
 	sector_t sector = bio->bi_iter.bi_sector;
 	bool schedule_bio_work = false;
 	struct blk_zone_wplug *zwplug;
@@ -1034,15 +1045,38 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
 	if (zwplug->flags & BLK_ZONE_WPLUG_PLUGGED)
 		goto add_to_bio_list;
 
+	if (ordered_hwq && zwplug->from_cpu < 0) {
+		/* No zoned writes are in progress. Select the current CPU. */
+		zwplug->from_cpu = raw_smp_processor_id();
+	}
+
+	if (ordered_hwq && zwplug->from_cpu == from_cpu) {
+		/*
+		 * The block driver preserves the write order, zoned writes have
+		 * not been plugged and the zoned write will be submitted from
+		 * zwplug->from_cpu. Let the caller submit the bio.
+		 */
+	} else if (ordered_hwq) {
+		/*
+		 * The block driver preserves the write order but the caller
+		 * allocated a request from another CPU. Submit the bio from
+		 * zwplug->from_cpu.
+		 */
+		goto plug;
+	} else {
+		/*
+		 * The block driver does not preserve the write order. Plug and
+		 * let the caller submit the BIO.
+		 */
+		zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
+	}
+
 	if (!blk_zone_wplug_prepare_bio(zwplug, bio)) {
 		spin_unlock_irqrestore(&zwplug->lock, flags);
 		bio_io_error(bio);
 		return true;
 	}
 
-	/* Otherwise, plug and submit the BIO. */
-	zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
-
 	spin_unlock_irqrestore(&zwplug->lock, flags);
 
 	return false;
@@ -1150,7 +1184,7 @@ bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs, int rq_cpu)
 		fallthrough;
 	case REQ_OP_WRITE:
 	case REQ_OP_WRITE_ZEROES:
-		return blk_zone_wplug_handle_write(bio, nr_segs);
+		return blk_zone_wplug_handle_write(bio, nr_segs, rq_cpu);
 	case REQ_OP_ZONE_RESET:
 		return blk_zone_wplug_handle_reset_or_finish(bio, 0);
 	case REQ_OP_ZONE_FINISH:
@@ -1182,6 +1216,9 @@ static void disk_zone_wplug_unplug_bio(struct gendisk *disk,
 
 	zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED;
 
+	if (refcount_read(&zwplug->ref) == 2)
+		zwplug->from_cpu = -1;
+
 	/*
 	 * If the zone is full (it was fully written or finished, or empty
 	 * (it was reset), remove its zone write plug from the hash table.
@@ -1283,6 +1320,8 @@ static void blk_zone_wplug_bio_work(struct work_struct *work)
 	struct blk_zone_wplug *zwplug =
 		container_of(work, struct blk_zone_wplug, bio_work);
 	struct block_device *bdev;
+	bool ordered_hwq = zwplug->disk->queue->limits.features &
+				BLK_FEAT_ORDERED_HWQ;
 	struct bio *bio;
 
 	do {
@@ -1323,7 +1362,7 @@ static void blk_zone_wplug_bio_work(struct work_struct *work)
 		} else {
 			blk_mq_submit_bio(bio);
 		}
-	} while (0);
+	} while (ordered_hwq);
 
 put_zwplug:
 	/* Drop the reference we took in disk_zone_wplug_schedule_bio_work(). */
@@ -1850,6 +1889,7 @@ static void queue_zone_wplug_show(struct blk_zone_wplug *zwplug,
 	unsigned int zwp_zone_no, zwp_ref;
 	unsigned int zwp_bio_list_size;
 	unsigned long flags;
+	int from_cpu;
 
 	spin_lock_irqsave(&zwplug->lock, flags);
 	zwp_zone_no = zwplug->zone_no;
@@ -1857,10 +1897,12 @@ static void queue_zone_wplug_show(struct blk_zone_wplug *zwplug,
 	zwp_ref = refcount_read(&zwplug->ref);
 	zwp_wp_offset = zwplug->wp_offset;
 	zwp_bio_list_size = bio_list_size(&zwplug->bio_list);
+	from_cpu = zwplug->from_cpu;
 	spin_unlock_irqrestore(&zwplug->lock, flags);
 
-	seq_printf(m, "%u 0x%x %u %u %u\n", zwp_zone_no, zwp_flags, zwp_ref,
-		   zwp_wp_offset, zwp_bio_list_size);
+	seq_printf(m, "zone_no %u flags 0x%x ref %u wp_offset %u bio_list_size %u from_cpu %d\n",
+		   zwp_zone_no, zwp_flags, zwp_ref, zwp_wp_offset,
+		   zwp_bio_list_size, from_cpu);
 }
 
 int queue_zone_wplugs_show(void *data, struct seq_file *m)

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v21 08/12] scsi: core: Retry unaligned zoned writes
  2025-07-17 20:57 [PATCH v21 00/12] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (6 preceding siblings ...)
  2025-07-17 20:58 ` [PATCH v21 07/12] blk-zoned: Support pipelining of zoned writes Bart Van Assche
@ 2025-07-17 20:58 ` Bart Van Assche
  2025-07-17 20:58 ` [PATCH v21 09/12] scsi: sd: Increase retry count for " Bart Van Assche
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2025-07-17 20:58 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Bart Van Assche, Martin K. Petersen, Ming Lei

If zoned writes (REQ_OP_WRITE) for a sequential write required zone have
a starting LBA that differs from the write pointer, e.g. because a prior
write triggered a unit attention condition, then the storage device will
respond with an UNALIGNED WRITE COMMAND error. Retry commands that failed
with an unaligned write error.

Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/scsi/scsi_error.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 746ff6a1f309..0af0ce8c9ed0 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -713,6 +713,22 @@ enum scsi_disposition scsi_check_sense(struct scsi_cmnd *scmd)
 		fallthrough;
 
 	case ILLEGAL_REQUEST:
+		/*
+		 * Unaligned write command. This may indicate that zoned writes
+		 * have been received by the device in the wrong order. If write
+		 * pipelining is enabled, retry.
+		 */
+		if (sshdr.asc == 0x21 && sshdr.ascq == 0x04 &&
+		    req->q->limits.features & BLK_FEAT_ORDERED_HWQ &&
+		    blk_rq_is_seq_zoned_write(req) &&
+		    scsi_cmd_retry_allowed(scmd)) {
+			SCSI_LOG_ERROR_RECOVERY(1,
+				sdev_printk(KERN_WARNING, scmd->device,
+				"Retrying unaligned write at LBA %#llx.\n",
+				scsi_get_lba(scmd)));
+			return NEEDS_RETRY;
+		}
+
 		if (sshdr.asc == 0x20 || /* Invalid command operation code */
 		    sshdr.asc == 0x21 || /* Logical block address out of range */
 		    sshdr.asc == 0x22 || /* Invalid function */

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v21 09/12] scsi: sd: Increase retry count for zoned writes
  2025-07-17 20:57 [PATCH v21 00/12] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (7 preceding siblings ...)
  2025-07-17 20:58 ` [PATCH v21 08/12] scsi: core: Retry unaligned " Bart Van Assche
@ 2025-07-17 20:58 ` Bart Van Assche
  2025-07-17 20:58 ` [PATCH v21 10/12] scsi: scsi_debug: Add the preserves_write_order module parameter Bart Van Assche
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2025-07-17 20:58 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Bart Van Assche, Martin K. Petersen, Ming Lei

If the write order is preserved, increase the number of retries for
write commands sent to a sequential zone to the maximum number of
outstanding commands because in the worst case the number of times
reordered zoned writes have to be retried is (number of outstanding
writes per sequential zone) - 1.

Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/scsi/sd.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index eeaa6af294b8..587d91a9e10c 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1404,6 +1404,13 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
 	cmd->transfersize = sdp->sector_size;
 	cmd->underflow = nr_blocks << 9;
 	cmd->allowed = sdkp->max_retries;
+	/*
+	 * Increase the number of allowed retries for zoned writes if the driver
+	 * preserves the command order.
+	 */
+	if (rq->q->limits.features & BLK_FEAT_ORDERED_HWQ &&
+	    blk_rq_is_seq_zoned_write(rq))
+		cmd->allowed += rq->q->nr_requests;
 	cmd->sdb.length = nr_blocks * sdp->sector_size;
 
 	SCSI_LOG_HLQUEUE(1,

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v21 10/12] scsi: scsi_debug: Add the preserves_write_order module parameter
  2025-07-17 20:57 [PATCH v21 00/12] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (8 preceding siblings ...)
  2025-07-17 20:58 ` [PATCH v21 09/12] scsi: sd: Increase retry count for " Bart Van Assche
@ 2025-07-17 20:58 ` Bart Van Assche
  2025-07-17 20:58 ` [PATCH v21 11/12] scsi: scsi_debug: Support injecting unaligned write errors Bart Van Assche
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2025-07-17 20:58 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Bart Van Assche, Douglas Gilbert, Martin K. Petersen, Ming Lei

Zone write locking is not used for zoned devices if the block driver
reports that it preserves the order of write commands. Make it easier to
test not using zone write locking by adding support for setting the
driver_preserves_write_order flag.

Acked-by: Douglas Gilbert <dgilbert@interlog.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/scsi/scsi_debug.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index aef33d1e346a..e970407cc7c4 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -1004,6 +1004,7 @@ static int dix_reads;
 static int dif_errors;
 
 /* ZBC global data */
+static bool sdeb_preserves_write_order;
 static bool sdeb_zbc_in_use;	/* true for host-aware and host-managed disks */
 static int sdeb_zbc_zone_cap_mb;
 static int sdeb_zbc_zone_size_mb;
@@ -6607,10 +6608,15 @@ static struct sdebug_dev_info *find_build_dev_info(struct scsi_device *sdev)
 
 static int scsi_debug_sdev_init(struct scsi_device *sdp)
 {
+	struct request_queue *q = sdp->request_queue;
+
 	if (sdebug_verbose)
 		pr_info("sdev_init <%u %u %u %llu>\n",
 		       sdp->host->host_no, sdp->channel, sdp->id, sdp->lun);
 
+	if (sdeb_preserves_write_order)
+		q->limits.features |= BLK_FEAT_ORDERED_HWQ;
+
 	return 0;
 }
 
@@ -7339,6 +7345,8 @@ module_param_named(statistics, sdebug_statistics, bool, S_IRUGO | S_IWUSR);
 module_param_named(strict, sdebug_strict, bool, S_IRUGO | S_IWUSR);
 module_param_named(submit_queues, submit_queues, int, S_IRUGO);
 module_param_named(poll_queues, poll_queues, int, S_IRUGO);
+module_param_named(preserves_write_order, sdeb_preserves_write_order, bool,
+		   S_IRUGO);
 module_param_named(tur_ms_to_ready, sdeb_tur_ms_to_ready, int, S_IRUGO);
 module_param_named(unmap_alignment, sdebug_unmap_alignment, int, S_IRUGO);
 module_param_named(unmap_granularity, sdebug_unmap_granularity, int, S_IRUGO);
@@ -7411,6 +7419,8 @@ MODULE_PARM_DESC(opts, "1->noise, 2->medium_err, 4->timeout, 8->recovered_err...
 MODULE_PARM_DESC(per_host_store, "If set, next positive add_host will get new store (def=0)");
 MODULE_PARM_DESC(physblk_exp, "physical block exponent (def=0)");
 MODULE_PARM_DESC(poll_queues, "support for iouring iopoll queues (1 to max(submit_queues - 1))");
+MODULE_PARM_DESC(preserves_write_order,
+		 "Whether or not to inform the block layer that this driver preserves the order of WRITE commands (def=0)");
 MODULE_PARM_DESC(ptype, "SCSI peripheral type(def=0[disk])");
 MODULE_PARM_DESC(random, "If set, uniformly randomize command duration between 0 and delay_in_ns");
 MODULE_PARM_DESC(removable, "claim to have removable media (def=0)");

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v21 11/12] scsi: scsi_debug: Support injecting unaligned write errors
  2025-07-17 20:57 [PATCH v21 00/12] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (9 preceding siblings ...)
  2025-07-17 20:58 ` [PATCH v21 10/12] scsi: scsi_debug: Add the preserves_write_order module parameter Bart Van Assche
@ 2025-07-17 20:58 ` Bart Van Assche
  2025-07-17 20:58 ` [PATCH v21 12/12] ufs: core: Inform the block layer about write ordering Bart Van Assche
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2025-07-17 20:58 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Bart Van Assche, Douglas Gilbert, Martin K. Petersen, Ming Lei

Allow user space software, e.g. a blktests test, to inject unaligned
write errors.

Acked-by: Douglas Gilbert <dgilbert@interlog.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/scsi/scsi_debug.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index e970407cc7c4..959d0dc850a9 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -230,6 +230,7 @@ struct tape_block {
 #define SDEBUG_OPT_NO_CDB_NOISE		0x4000
 #define SDEBUG_OPT_HOST_BUSY		0x8000
 #define SDEBUG_OPT_CMD_ABORT		0x10000
+#define SDEBUG_OPT_UNALIGNED_WRITE	0x20000
 #define SDEBUG_OPT_ALL_NOISE (SDEBUG_OPT_NOISE | SDEBUG_OPT_Q_NOISE | \
 			      SDEBUG_OPT_RESET_NOISE)
 #define SDEBUG_OPT_ALL_INJECTING (SDEBUG_OPT_RECOVERED_ERR | \
@@ -237,7 +238,8 @@ struct tape_block {
 				  SDEBUG_OPT_DIF_ERR | SDEBUG_OPT_DIX_ERR | \
 				  SDEBUG_OPT_SHORT_TRANSFER | \
 				  SDEBUG_OPT_HOST_BUSY | \
-				  SDEBUG_OPT_CMD_ABORT)
+				  SDEBUG_OPT_CMD_ABORT | \
+				  SDEBUG_OPT_UNALIGNED_WRITE)
 #define SDEBUG_OPT_RECOV_DIF_DIX (SDEBUG_OPT_RECOVERED_ERR | \
 				  SDEBUG_OPT_DIF_ERR | SDEBUG_OPT_DIX_ERR)
 
@@ -4915,6 +4917,14 @@ static int resp_write_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 	u8 *cmd = scp->cmnd;
 	bool meta_data_locked = false;
 
+	if (unlikely(sdebug_opts & SDEBUG_OPT_UNALIGNED_WRITE &&
+		     atomic_read(&sdeb_inject_pending))) {
+		atomic_set(&sdeb_inject_pending, 0);
+		mk_sense_buffer(scp, ILLEGAL_REQUEST, LBA_OUT_OF_RANGE,
+				UNALIGNED_WRITE_ASCQ);
+		return check_condition_result;
+	}
+
 	switch (cmd[0]) {
 	case WRITE_16:
 		ei_lba = 0;

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v21 12/12] ufs: core: Inform the block layer about write ordering
  2025-07-17 20:57 [PATCH v21 00/12] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (10 preceding siblings ...)
  2025-07-17 20:58 ` [PATCH v21 11/12] scsi: scsi_debug: Support injecting unaligned write errors Bart Van Assche
@ 2025-07-17 20:58 ` Bart Van Assche
  2025-07-18  7:08 ` [PATCH v21 00/12] Improve write performance for zoned UFS devices Damien Le Moal
  2025-07-18  7:39 ` Damien Le Moal
  13 siblings, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2025-07-17 20:58 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Bart Van Assche, Avri Altman, Bao D. Nguyen, Can Guo,
	Martin K. Petersen

From the UFSHCI 4.0 specification, about the MCQ mode:
"Command Submission
1. Host SW writes an Entry to SQ
2. Host SW updates SQ doorbell tail pointer

Command Processing
3. After fetching the Entry, Host Controller updates SQ doorbell head
   pointer
4. Host controller sends COMMAND UPIU to UFS device"

In other words, in MCQ mode, UFS controllers are required to forward
commands to the UFS device in the order these commands have been
received from the host.

This patch improves performance as follows on a test setup with UFSHCI
4.0 controller:
- When not using an I/O scheduler: 2.3x more IOPS for small writes.
- With the mq-deadline scheduler: 2.0x more IOPS for small writes.

Reviewed-by: Avri Altman <avri.altman@wdc.com>
Cc: Bao D. Nguyen <quic_nguyenb@quicinc.com>
Cc: Can Guo <quic_cang@quicinc.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/ufs/core/ufshcd.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
index 50adfb8b335b..6ff097e2c919 100644
--- a/drivers/ufs/core/ufshcd.c
+++ b/drivers/ufs/core/ufshcd.c
@@ -5281,6 +5281,13 @@ static int ufshcd_sdev_configure(struct scsi_device *sdev,
 	struct ufs_hba *hba = shost_priv(sdev->host);
 	struct request_queue *q = sdev->request_queue;
 
+	/*
+	 * The write order is preserved per MCQ. Without MCQ, auto-hibernation
+	 * may cause write reordering that results in unaligned write errors.
+	 */
+	if (hba->mcq_enabled)
+		lim->features |= BLK_FEAT_ORDERED_HWQ;
+
 	lim->dma_pad_mask = PRDT_DATA_BYTE_COUNT_PAD - 1;
 
 	/*

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v21 00/12] Improve write performance for zoned UFS devices
  2025-07-17 20:57 [PATCH v21 00/12] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (11 preceding siblings ...)
  2025-07-17 20:58 ` [PATCH v21 12/12] ufs: core: Inform the block layer about write ordering Bart Van Assche
@ 2025-07-18  7:08 ` Damien Le Moal
  2025-07-18 18:30   ` Bart Van Assche
  2025-07-18  7:39 ` Damien Le Moal
  13 siblings, 1 reply; 25+ messages in thread
From: Damien Le Moal @ 2025-07-18  7:08 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe; +Cc: linux-block, linux-scsi, Christoph Hellwig

On 7/18/25 05:57, Bart Van Assche wrote:
> Hi Jens,
> 
> This patch series improves small write IOPS by a factor of two for zoned UFS
> devices on my test setup. The changes included in this patch series are as
> follows:
>  - A new request queue limits flag is introduced that allows block drivers to
>    declare whether or not the request order is preserved per hardware queue.
>  - The order of zoned writes is preserved in the block layer by submitting all
>    zoned writes from the same CPU core as long as any zoned writes are pending.
>  - A new member 'from_cpu' is introduced in the per-zone data structure
>    'blk_zone_wplug' to track from which CPU to submit zoned writes. This data
>    member is reset to -1 after all pending zoned writes for a zone have
>    completed.
>  - The retry count for zoned writes is increased in the SCSI core to deal with
>    reordering caused by unit attention conditions or the SCSI error handler.
>  - New functionality is added in the scsi_debug driver to make it easier to
>    test the changes introduced by this patch series.
> 
> Please consider this patch series for the next merge window.

Bart,

How did you test this ?

I do not have a zoned UFS drive, so I used an NVMe ZNS drive, which should be
fine since the commands in the submission queues of a PCI controller are always
handled in order. So I added:

diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
index cce4c5b55aa9..36d16b8d3f37 100644
--- a/drivers/nvme/host/zns.c
+++ b/drivers/nvme/host/zns.c
@@ -108,7 +108,7 @@ int nvme_query_zone_info(struct nvme_ns *ns, unsigned lbaf,
 void nvme_update_zone_info(struct nvme_ns *ns, struct queue_limits *lim,
                struct nvme_zone_info *zi)
 {
-       lim->features |= BLK_FEAT_ZONED;
+       lim->features |= BLK_FEAT_ZONED | BLK_FEAT_ORDERED_HWQ;
        lim->max_open_zones = zi->max_open_zones;
        lim->max_active_zones = zi->max_active_zones;
        lim->max_hw_zone_append_sectors = ns->ctrl->max_zone_append;

And ran this:

fio --name=test --filename=/dev/nvme1n2 --ioengine=io_uring --iodepth=128 \
	--direct=1 --bs=4096 --zonemode=zbd --rw=randwrite \
	--numjobs=1

And I get unaligned write errors 100% of the time. Looking at your patches
again, you are not handling REQ_NOWAIT case in blk_zone_wplug_handle_write(). If
you get REQ_NOWAIT BIO, which io_uring will issue, the code goes directly to
plugging the BIO, thus bypassing your from_cpu handling.

But the same fio command with libaio (no REQ_NOWAIT in that case) also fails.

I have not looked further into it yet.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v21 03/12] blk-zoned: Add an argument to blk_zone_plug_bio()
  2025-07-17 20:57 ` [PATCH v21 03/12] blk-zoned: Add an argument to blk_zone_plug_bio() Bart Van Assche
@ 2025-07-18  7:13   ` Damien Le Moal
  2025-07-18 15:54     ` Bart Van Assche
  0 siblings, 1 reply; 25+ messages in thread
From: Damien Le Moal @ 2025-07-18  7:13 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe; +Cc: linux-block, linux-scsi, Christoph Hellwig

On 7/18/25 05:57, Bart Van Assche wrote:
> Software that submits zoned writes, e.g. a filesystem, may submit zoned
> writes from multiple CPUs as long as the zoned writes are serialized per
> zone. Submitting bios from different CPUs may cause bio reordering if
> e.g. different bios reach the storage device through different queues.
> Prepare for preserving the order of pipelined zoned writes per zone by
> adding the 'rq_cpu` argument to blk_zone_plug_bio(). This argument tells
> blk_zone_plug_bio() from which CPU a cached request has been allocated.
> The cached request will only be used if it matches the CPU from which
> zoned writes are being submitted for the zone associated with the bio.

I still do not understand why this patch is needed because you can get the
current CPU submitting the BIO inside blk_zone_plug_bio() with
raw_smp_processor_id(). That CPU ID should be the same as the cached request
that we will use only if the BIO is not going through the BIO work, that is, if
it is the first write BIO in-flight for the zone.

Furthermore, for the DM case, you pass a CPU of "-1", but if the DM target needs
zone append emulation, it will use zone write plugging. So the same control as
for blk-mq is needed.

> 
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Damien Le Moal <dlemoal@kernel.org>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/blk-mq.c         | 7 +++----
>  block/blk-zoned.c      | 5 ++++-
>  drivers/md/dm.c        | 5 ++---
>  include/linux/blkdev.h | 5 +++--
>  4 files changed, 12 insertions(+), 10 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 58d3d0e724cb..c1035a2bbda8 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3190,10 +3190,9 @@ void blk_mq_submit_bio(struct bio *bio)
>  	if (blk_mq_attempt_bio_merge(q, bio, nr_segs))
>  		goto queue_exit;
>  
> -	if (bio_needs_zone_write_plugging(bio)) {
> -		if (blk_zone_plug_bio(bio, nr_segs))
> -			goto queue_exit;
> -	}
> +	if (bio_needs_zone_write_plugging(bio) &&
> +	    blk_zone_plug_bio(bio, nr_segs, rq ? rq->mq_ctx->cpu : -1))
> +		goto queue_exit;
>  
>  new_request:
>  	if (rq) {
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index ef43aaca49f4..7e0f90626459 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -1110,6 +1110,9 @@ static void blk_zone_wplug_handle_native_zone_append(struct bio *bio)
>   * blk_zone_plug_bio - Handle a zone write BIO with zone write plugging
>   * @bio: The BIO being submitted
>   * @nr_segs: The number of physical segments of @bio
> + * @rq_cpu: software queue onto which a request will be queued. -1 if the caller
> + *	has not yet decided onto which software queue to queue the request or if
> + *	the bio won't be converted into a request.
>   *
>   * Handle write, write zeroes and zone append operations requiring emulation
>   * using zone write plugging.
> @@ -1118,7 +1121,7 @@ static void blk_zone_wplug_handle_native_zone_append(struct bio *bio)
>   * write plug. Otherwise, return false to let the submission path process
>   * @bio normally.
>   */
> -bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs)
> +bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs, int rq_cpu)
>  {
>  	struct block_device *bdev = bio->bi_bdev;
>  
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index ca889328fdfe..5033af6d687c 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1814,9 +1814,8 @@ static inline bool dm_zone_bio_needs_split(struct mapped_device *md,
>  
>  static inline bool dm_zone_plug_bio(struct mapped_device *md, struct bio *bio)
>  {
> -	if (!bio_needs_zone_write_plugging(bio))
> -		return false;
> -	return blk_zone_plug_bio(bio, 0);
> +	return bio_needs_zone_write_plugging(bio) &&
> +		blk_zone_plug_bio(bio, 0, -1);
>  }
>  
>  static blk_status_t __send_zone_reset_all_emulated(struct clone_info *ci,
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 3ea6c77746c5..904e2bb1e5fc 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -897,7 +897,7 @@ static inline bool bio_needs_zone_write_plugging(struct bio *bio)
>  	}
>  }
>  
> -bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs);
> +bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs, int rq_cpu);
>  
>  /**
>   * disk_zone_capacity - returns the zone capacity of zone containing @sector
> @@ -932,7 +932,8 @@ static inline bool bio_needs_zone_write_plugging(struct bio *bio)
>  	return false;
>  }
>  
> -static inline bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs)
> +static inline bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs,
> +				     int rq_cpu)
>  {
>  	return false;
>  }


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v21 05/12] blk-zoned: Move code from disk_zone_wplug_add_bio() into its caller
  2025-07-17 20:58 ` [PATCH v21 05/12] blk-zoned: Move code from disk_zone_wplug_add_bio() into its caller Bart Van Assche
@ 2025-07-18  7:15   ` Damien Le Moal
  0 siblings, 0 replies; 25+ messages in thread
From: Damien Le Moal @ 2025-07-18  7:15 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe; +Cc: linux-block, linux-scsi, Christoph Hellwig

On 7/18/25 05:58, Bart Van Assche wrote:
> Move the following code into the only caller of disk_zone_wplug_add_bio():
>  - The code for clearing the REQ_NOWAIT flag.
>  - The code that sets the BLK_ZONE_WPLUG_PLUGGED flag.
>  - The disk_zone_wplug_schedule_bio_work() call.
> 
> No functionality has been changed.
> 
> This patch prepares for zoned write pipelining by removing the code from
> disk_zone_wplug_add_bio() that does not apply to all zoned write pipelining
> bio processing cases.
> 
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>

[...]

> @@ -993,6 +976,7 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
>  {
>  	struct gendisk *disk = bio->bi_bdev->bd_disk;
>  	sector_t sector = bio->bi_iter.bi_sector;
> +	bool schedule_bio_work = false;
>  	struct blk_zone_wplug *zwplug;
>  	gfp_t gfp_mask = GFP_NOIO;
>  	unsigned long flags;
> @@ -1039,12 +1023,16 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
>  	 * Add REQ_NOWAIT BIOs to the plug list to ensure that we will not see a
>  	 * BLK_STS_AGAIN failure if we let the BIO execute.
>  	 */
> -	if (bio->bi_opf & REQ_NOWAIT)
> -		goto plug;
> +	if (bio->bi_opf & REQ_NOWAIT) {
> +		bio->bi_opf &= ~REQ_NOWAIT;
> +		if (!(zwplug->flags & BLK_ZONE_WPLUG_PLUGGED))
> +			goto plug;

See below.

> +		goto add_to_bio_list;
> +	}
>  
>  	/* If the zone is already plugged, add the BIO to the plug BIO list. */
>  	if (zwplug->flags & BLK_ZONE_WPLUG_PLUGGED)
> -		goto plug;
> +		goto add_to_bio_list;
>  
>  	if (!blk_zone_wplug_prepare_bio(zwplug, bio)) {
>  		spin_unlock_irqrestore(&zwplug->lock, flags);
> @@ -1060,7 +1048,13 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
>  	return false;
>  
>  plug:
> +	zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
> +	schedule_bio_work = true;

Since there is only a single "goto plug", I would simply move the 2 lines above
together with the goto plug and replace that goto with goto add_to_bio_list.

> +
> +add_to_bio_list:
>  	disk_zone_wplug_add_bio(disk, zwplug, bio, nr_segs);
> +	if (schedule_bio_work)
> +		disk_zone_wplug_schedule_bio_work(disk, zwplug);
>  
>  	spin_unlock_irqrestore(&zwplug->lock, flags);
>  


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v21 06/12] blk-zoned: Introduce a loop in blk_zone_wplug_bio_work()
  2025-07-17 20:58 ` [PATCH v21 06/12] blk-zoned: Introduce a loop in blk_zone_wplug_bio_work() Bart Van Assche
@ 2025-07-18  7:17   ` Damien Le Moal
  0 siblings, 0 replies; 25+ messages in thread
From: Damien Le Moal @ 2025-07-18  7:17 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe; +Cc: linux-block, linux-scsi, Christoph Hellwig

On 7/18/25 05:58, Bart Van Assche wrote:
> Prepare for submitting multiple bios from inside a single
> blk_zone_wplug_bio_work() call. No functionality has been changed.
> 
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/blk-zoned.c | 72 +++++++++++++++++++++++------------------------
>  1 file changed, 36 insertions(+), 36 deletions(-)
> 
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index 8fe6e545f300..6ef53f78fa3b 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -1283,47 +1283,47 @@ static void blk_zone_wplug_bio_work(struct work_struct *work)
>  	struct blk_zone_wplug *zwplug =
>  		container_of(work, struct blk_zone_wplug, bio_work);
>  	struct block_device *bdev;
> -	unsigned long flags;
>  	struct bio *bio;
>  
> -	/*
> -	 * Submit the next plugged BIO. If we do not have any, clear
> -	 * the plugged flag.
> -	 */
> -	spin_lock_irqsave(&zwplug->lock, flags);
> -
> +	do {
>  again:
> -	bio = bio_list_pop(&zwplug->bio_list);
> -	if (!bio) {
> -		zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED;
> -		spin_unlock_irqrestore(&zwplug->lock, flags);
> -		goto put_zwplug;
> -	}
> -
> -	trace_blk_zone_wplug_bio(zwplug->disk->queue, zwplug->zone_no,
> -				 bio->bi_iter.bi_sector, bio_sectors(bio));
> -
> -	if (!blk_zone_wplug_prepare_bio(zwplug, bio)) {
> -		blk_zone_wplug_bio_io_error(zwplug, bio);
> -		goto again;
> -	}
> -
> -	spin_unlock_irqrestore(&zwplug->lock, flags);
> +		/*
> +		 * Submit the next plugged BIO. If we do not have any, clear
> +		 * the plugged flag.
> +		 */
> +		scoped_guard(spinlock_irqsave, &zwplug->lock) {

I am really not a fan of this. It adds one level of indentation without making
the code easier to read.

> +			bio = bio_list_pop(&zwplug->bio_list);
> +			if (!bio) {
> +				zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED;
> +				goto put_zwplug;
> +			}
> +
> +			trace_blk_zone_wplug_bio(zwplug->disk->queue,
> +						 zwplug->zone_no,
> +						 bio->bi_iter.bi_sector,
> +						 bio_sectors(bio));
> +
> +			if (!blk_zone_wplug_prepare_bio(zwplug, bio)) {
> +				blk_zone_wplug_bio_io_error(zwplug, bio);
> +				goto again;
> +			}
> +		}
>  
> -	bdev = bio->bi_bdev;
> +		bdev = bio->bi_bdev;
>  
> -	/*
> -	 * blk-mq devices will reuse the extra reference on the request queue
> -	 * usage counter we took when the BIO was plugged, but the submission
> -	 * path for BIO-based devices will not do that. So drop this extra
> -	 * reference here.
> -	 */
> -	if (bdev_test_flag(bdev, BD_HAS_SUBMIT_BIO)) {
> -		bdev->bd_disk->fops->submit_bio(bio);
> -		blk_queue_exit(bdev->bd_disk->queue);
> -	} else {
> -		blk_mq_submit_bio(bio);
> -	}
> +		/*
> +		 * blk-mq devices will reuse the extra reference on the request
> +		 * queue usage counter we took when the BIO was plugged, but the
> +		 * submission path for BIO-based devices will not do that. So
> +		 * drop this extra reference here.
> +		 */
> +		if (bdev_test_flag(bdev, BD_HAS_SUBMIT_BIO)) {
> +			bdev->bd_disk->fops->submit_bio(bio);
> +			blk_queue_exit(bdev->bd_disk->queue);
> +		} else {
> +			blk_mq_submit_bio(bio);
> +		}
> +	} while (0);
>  
>  put_zwplug:
>  	/* Drop the reference we took in disk_zone_wplug_schedule_bio_work(). */


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v21 07/12] blk-zoned: Support pipelining of zoned writes
  2025-07-17 20:58 ` [PATCH v21 07/12] blk-zoned: Support pipelining of zoned writes Bart Van Assche
@ 2025-07-18  7:38   ` Damien Le Moal
  2025-07-18 16:29     ` Bart Van Assche
  0 siblings, 1 reply; 25+ messages in thread
From: Damien Le Moal @ 2025-07-18  7:38 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe; +Cc: linux-block, linux-scsi, Christoph Hellwig

On 7/18/25 05:58, Bart Van Assche wrote:
> Support pipelining of zoned writes if the block driver preserves the write
> order per hardware queue. Track per zone to which software queue writes
> have been queued. If zoned writes are pipelined, submit new writes to the
> same software queue as the writes that are already in progress. This
> prevents reordering by submitting requests for the same zone to different
> software or hardware queues.
> 
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Damien Le Moal <dlemoal@kernel.org>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/blk-mq.c    |  4 +--
>  block/blk-zoned.c | 66 ++++++++++++++++++++++++++++++++++++++---------
>  2 files changed, 56 insertions(+), 14 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index c1035a2bbda8..56384b4aadd9 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3145,8 +3145,8 @@ void blk_mq_submit_bio(struct bio *bio)
>  	/*
>  	 * A BIO that was released from a zone write plug has already been
>  	 * through the preparation in this function, already holds a reference
> -	 * on the queue usage counter, and is the only write BIO in-flight for
> -	 * the target zone. Go straight to preparing a request for it.
> +	 * on the queue usage counter. Go straight to preparing a request for
> +	 * it.
>  	 */
>  	if (bio_zone_write_plugging(bio)) {
>  		nr_segs = bio->__bi_nr_segments;
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index 6ef53f78fa3b..3813e4bc8b0b 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -53,6 +53,8 @@ static const char *const zone_cond_name[] = {
>   * @zone_no: The number of the zone the plug is managing.
>   * @wp_offset: The zone write pointer location relative to the start of the zone
>   *             as a number of 512B sectors.
> + * @from_cpu: Software queue to submit writes from for drivers that preserve
> + *	the write order.
>   * @bio_list: The list of BIOs that are currently plugged.
>   * @bio_work: Work struct to handle issuing of plugged BIOs
>   * @rcu_head: RCU head to free zone write plugs with an RCU grace period.
> @@ -65,6 +67,7 @@ struct blk_zone_wplug {
>  	unsigned int		flags;
>  	unsigned int		zone_no;
>  	unsigned int		wp_offset;
> +	int			from_cpu;
>  	struct bio_list		bio_list;
>  	struct work_struct	bio_work;
>  	struct rcu_head		rcu_head;
> @@ -74,8 +77,7 @@ struct blk_zone_wplug {
>  /*
>   * Zone write plug flags bits:
>   *  - BLK_ZONE_WPLUG_PLUGGED: Indicates that the zone write plug is plugged,
> - *    that is, that write BIOs are being throttled due to a write BIO already
> - *    being executed or the zone write plug bio list is not empty.
> + *    that is, that write BIOs are being throttled.
>   *  - BLK_ZONE_WPLUG_NEED_WP_UPDATE: Indicates that we lost track of a zone
>   *    write pointer offset and need to update it.
>   *  - BLK_ZONE_WPLUG_UNHASHED: Indicates that the zone write plug was removed
> @@ -572,6 +574,7 @@ static struct blk_zone_wplug *disk_get_and_lock_zone_wplug(struct gendisk *disk,
>  	zwplug->flags = 0;
>  	zwplug->zone_no = zno;
>  	zwplug->wp_offset = bdev_offset_from_zone_start(disk->part0, sector);
> +	zwplug->from_cpu = -1;
>  	bio_list_init(&zwplug->bio_list);
>  	INIT_WORK(&zwplug->bio_work, blk_zone_wplug_bio_work);
>  	zwplug->disk = disk;
> @@ -768,14 +771,19 @@ static bool blk_zone_wplug_handle_reset_all(struct bio *bio)
>  static void disk_zone_wplug_schedule_bio_work(struct gendisk *disk,
>  					      struct blk_zone_wplug *zwplug)
>  {
> +	lockdep_assert_held(&zwplug->lock);

Unrelated change. Please move this to a prep patch.

> +
>  	/*
>  	 * Take a reference on the zone write plug and schedule the submission
>  	 * of the next plugged BIO. blk_zone_wplug_bio_work() will release the
>  	 * reference we take here.
>  	 */
> -	WARN_ON_ONCE(!(zwplug->flags & BLK_ZONE_WPLUG_PLUGGED));

Why do you remove this warning ?

>  	refcount_inc(&zwplug->ref);
> -	queue_work(disk->zone_wplugs_wq, &zwplug->bio_work);
> +	if (zwplug->from_cpu >= 0)
> +		queue_work_on(zwplug->from_cpu, disk->zone_wplugs_wq,
> +			      &zwplug->bio_work);
> +	else
> +		queue_work(disk->zone_wplugs_wq, &zwplug->bio_work);
>  }
>  
>  static inline void disk_zone_wplug_add_bio(struct gendisk *disk,
> @@ -972,9 +980,12 @@ static bool blk_zone_wplug_prepare_bio(struct blk_zone_wplug *zwplug,
>  	return true;
>  }
>  
> -static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
> +static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs,
> +					int from_cpu)
>  {
>  	struct gendisk *disk = bio->bi_bdev->bd_disk;
> +	const bool ordered_hwq = bio_op(bio) != REQ_OP_ZONE_APPEND &&
> +		disk->queue->limits.features & BLK_FEAT_ORDERED_HWQ;

This is not correct. If the BIO is a zone append and
blk_zone_wplug_handle_write() is called, it means that we need to handle the BIO
using zone append emulation, that is, the BIO will be a regular write. So you
must treat it as if it originally was a regular write.

>  	sector_t sector = bio->bi_iter.bi_sector;
>  	bool schedule_bio_work = false;
>  	struct blk_zone_wplug *zwplug;
> @@ -1034,15 +1045,38 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
>  	if (zwplug->flags & BLK_ZONE_WPLUG_PLUGGED)
>  		goto add_to_bio_list;
>  
> +	if (ordered_hwq && zwplug->from_cpu < 0) {
> +		/* No zoned writes are in progress. Select the current CPU. */
> +		zwplug->from_cpu = raw_smp_processor_id();
> +	}
> +
> +	if (ordered_hwq && zwplug->from_cpu == from_cpu) {
> +		/*
> +		 * The block driver preserves the write order, zoned writes have
> +		 * not been plugged and the zoned write will be submitted from
> +		 * zwplug->from_cpu. Let the caller submit the bio.
> +		 */
> +	} else if (ordered_hwq) {
> +		/*
> +		 * The block driver preserves the write order but the caller
> +		 * allocated a request from another CPU. Submit the bio from
> +		 * zwplug->from_cpu.
> +		 */
> +		goto plug;
> +	} else {
> +		/*
> +		 * The block driver does not preserve the write order. Plug and
> +		 * let the caller submit the BIO.
> +		 */
> +		zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
> +	}

On the last round of comments, I already suggested a much nicer way of writing
this that does not repeat the if (oredered_hwq) and does not have an empty if
clause:

	if (ordered_hwq) {
		/*
		 * The block driver preserves the write order, zoned writes have
		 * not been plugged and the zoned write will be submitted from
		 * zwplug->from_cpu. Let the caller submit the bio.
		 */
		if (zwplug->from_cpu < 0) {
			/*
			 * No zoned writes are in progress: select the
			 * current CPU.
			 */
			zwplug->from_cpu = raw_smp_processor_id();
		} else if (zwplug->from_cpu != raw_smp_processor_id()) {
			/*
			 * The caller allocated a request from another CPU.
			 * Submit the bio from zwplug->from_cpu.
			 */
			goto plug;
		}
	} else {
		/*
		 * The block driver does not preserve the write order. Plug and
		 * let the caller submit the BIO.
		 */
		zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
	}

>  	if (!blk_zone_wplug_prepare_bio(zwplug, bio)) {

You moved the BLK_ZONE_WPLUG_PLUGGED flag set above. So if this fails, you need
to clear this flag and also reset zwplug->from_cpu to -1.

>  		spin_unlock_irqrestore(&zwplug->lock, flags);
>  		bio_io_error(bio);
>  		return true;
>  	}
>  
> -	/* Otherwise, plug and submit the BIO. */
> -	zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
> -
>  	spin_unlock_irqrestore(&zwplug->lock, flags);
>  
>  	return false;
> @@ -1150,7 +1184,7 @@ bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs, int rq_cpu)
>  		fallthrough;
>  	case REQ_OP_WRITE:
>  	case REQ_OP_WRITE_ZEROES:
> -		return blk_zone_wplug_handle_write(bio, nr_segs);
> +		return blk_zone_wplug_handle_write(bio, nr_segs, rq_cpu);
>  	case REQ_OP_ZONE_RESET:
>  		return blk_zone_wplug_handle_reset_or_finish(bio, 0);
>  	case REQ_OP_ZONE_FINISH:
> @@ -1182,6 +1216,9 @@ static void disk_zone_wplug_unplug_bio(struct gendisk *disk,
>  
>  	zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED;
>  
> +	if (refcount_read(&zwplug->ref) == 2)

This needs a comment explaining why you use the plug ref count instead of
unconditionally clearing from_cpu.

> +		zwplug->from_cpu = -1;
> +
>  	/*
>  	 * If the zone is full (it was fully written or finished, or empty
>  	 * (it was reset), remove its zone write plug from the hash table.
> @@ -1283,6 +1320,8 @@ static void blk_zone_wplug_bio_work(struct work_struct *work)
>  	struct blk_zone_wplug *zwplug =
>  		container_of(work, struct blk_zone_wplug, bio_work);
>  	struct block_device *bdev;
> +	bool ordered_hwq = zwplug->disk->queue->limits.features &
> +				BLK_FEAT_ORDERED_HWQ;

Splitting the line after the "=" would be nicer.

>  	struct bio *bio;
>  
>  	do {
> @@ -1323,7 +1362,7 @@ static void blk_zone_wplug_bio_work(struct work_struct *work)
>  		} else {
>  			blk_mq_submit_bio(bio);
>  		}
> -	} while (0);
> +	} while (ordered_hwq);
>  
>  put_zwplug:
>  	/* Drop the reference we took in disk_zone_wplug_schedule_bio_work(). */
> @@ -1850,6 +1889,7 @@ static void queue_zone_wplug_show(struct blk_zone_wplug *zwplug,
>  	unsigned int zwp_zone_no, zwp_ref;
>  	unsigned int zwp_bio_list_size;
>  	unsigned long flags;
> +	int from_cpu;
>  
>  	spin_lock_irqsave(&zwplug->lock, flags);
>  	zwp_zone_no = zwplug->zone_no;
> @@ -1857,10 +1897,12 @@ static void queue_zone_wplug_show(struct blk_zone_wplug *zwplug,
>  	zwp_ref = refcount_read(&zwplug->ref);
>  	zwp_wp_offset = zwplug->wp_offset;
>  	zwp_bio_list_size = bio_list_size(&zwplug->bio_list);
> +	from_cpu = zwplug->from_cpu;
>  	spin_unlock_irqrestore(&zwplug->lock, flags);
>  
> -	seq_printf(m, "%u 0x%x %u %u %u\n", zwp_zone_no, zwp_flags, zwp_ref,
> -		   zwp_wp_offset, zwp_bio_list_size);
> +	seq_printf(m, "zone_no %u flags 0x%x ref %u wp_offset %u bio_list_size %u from_cpu %d\n",
> +		   zwp_zone_no, zwp_flags, zwp_ref, zwp_wp_offset,
> +		   zwp_bio_list_size, from_cpu);
>  }
>  
>  int queue_zone_wplugs_show(void *data, struct seq_file *m)


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v21 00/12] Improve write performance for zoned UFS devices
  2025-07-17 20:57 [PATCH v21 00/12] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (12 preceding siblings ...)
  2025-07-18  7:08 ` [PATCH v21 00/12] Improve write performance for zoned UFS devices Damien Le Moal
@ 2025-07-18  7:39 ` Damien Le Moal
  2025-07-18 16:32   ` Bart Van Assche
  13 siblings, 1 reply; 25+ messages in thread
From: Damien Le Moal @ 2025-07-18  7:39 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe; +Cc: linux-block, linux-scsi, Christoph Hellwig

On 7/18/25 05:57, Bart Van Assche wrote:
> Changes compared to v20:
>  - Converted a struct queue_limits member variable into a queue_limits feature
>    flag.
>  - Optimized performance of blk_mq_requeue_work().
>  - Instead of splitting blk_zone_wplug_bio_work(), introduce a loop in that
>    function.
>  - Reworked patch "blk-zoned: Support pipelining of zoned writes".
>  - Dropped the null_blk driver patch.

Why ? null_blk does not maintain submission order ?

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v21 03/12] blk-zoned: Add an argument to blk_zone_plug_bio()
  2025-07-18  7:13   ` Damien Le Moal
@ 2025-07-18 15:54     ` Bart Van Assche
  0 siblings, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2025-07-18 15:54 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe; +Cc: linux-block, linux-scsi, Christoph Hellwig

On 7/18/25 12:13 AM, Damien Le Moal wrote:
> I still do not understand why this patch is needed because you can get the
> current CPU submitting the BIO inside blk_zone_plug_bio() with
> raw_smp_processor_id(). That CPU ID should be the same as the cached request
> that we will use only if the BIO is not going through the BIO work, that is, if
> it is the first write BIO in-flight for the zone.

I do not agree with the above. With CONFIG_PREEMPT enabled, migration to 
another CPU may happen after a cached request has been allocated and
before the zoned block device code is called. This can be prevented by
surrounding code with preempt_disable() and preempt_enable(). However, I
don't think that we want to do this in submit_bio() since there is code
in submit_bio() that may sleep (bio_queue_enter()) and sleeping with
preemption disabled is not allowed.

Is there perhaps something that I'm overlooking or misunderstanding?

> Furthermore, for the DM case, you pass a CPU of "-1", but if the DM target needs
> zone append emulation, it will use zone write plugging. So the same control as
> for blk-mq is needed.

This "-1" means that it is not known from which CPU a request will be
allocated since a migration to another CPU may happen between the
blk_zone_plug_bio() call and request allocation.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v21 07/12] blk-zoned: Support pipelining of zoned writes
  2025-07-18  7:38   ` Damien Le Moal
@ 2025-07-18 16:29     ` Bart Van Assche
  0 siblings, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2025-07-18 16:29 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe; +Cc: linux-block, linux-scsi, Christoph Hellwig

On 7/18/25 12:38 AM, Damien Le Moal wrote:
> On 7/18/25 05:58, Bart Van Assche wrote:
>> @@ -768,14 +771,19 @@ static bool blk_zone_wplug_handle_reset_all(struct bio *bio)
>>   static void disk_zone_wplug_schedule_bio_work(struct gendisk *disk,
>>   					      struct blk_zone_wplug *zwplug)
>>   {
>> +	lockdep_assert_held(&zwplug->lock);
> 
> Unrelated change. Please move this to a prep patch.

I will drop this change since I don't really need this change.

>> +
>>   	/*
>>   	 * Take a reference on the zone write plug and schedule the submission
>>   	 * of the next plugged BIO. blk_zone_wplug_bio_work() will release the
>>   	 * reference we take here.
>>   	 */
>> -	WARN_ON_ONCE(!(zwplug->flags & BLK_ZONE_WPLUG_PLUGGED));
> 
> Why do you remove this warning ?

This warning probably can be retained. I will look into restoring it.

>> @@ -972,9 +980,12 @@ static bool blk_zone_wplug_prepare_bio(struct blk_zone_wplug *zwplug,
>>   	return true;
>>   }
>>   
>> -static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
>> +static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs,
>> +					int from_cpu)
>>   {
>>   	struct gendisk *disk = bio->bi_bdev->bd_disk;
>> +	const bool ordered_hwq = bio_op(bio) != REQ_OP_ZONE_APPEND &&
>> +		disk->queue->limits.features & BLK_FEAT_ORDERED_HWQ;
> 
> This is not correct. If the BIO is a zone append and
> blk_zone_wplug_handle_write() is called, it means that we need to handle the BIO
> using zone append emulation, that is, the BIO will be a regular write. So you
> must treat it as if it originally was a regular write.

Hmm ... my understanding is that zone append emulation and also the
conversion of REQ_OP_ZONE_APPEND into REQ_OP_WRITE happens after the
above code has been executed, namely by blk_zone_wplug_prepare_bio().
 From that function:

	[ ... ]
	if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
		bio->bi_opf &= ~REQ_OP_MASK;
		bio->bi_opf |= REQ_OP_WRITE | REQ_NOMERGE;
	[ ... ]

Did I perhaps misunderstand your comment?

>> +	if (refcount_read(&zwplug->ref) == 2)
>> +		zwplug->from_cpu = -1;
> 
> This needs a comment explaining why you use the plug ref count instead of
> unconditionally clearing from_cpu.

I'm considering to add the following comment:

  	/*
	 * zwplug->from_cpu must not change while one or more writes are pending
	 * for the zone associated with zwplug. zwplug->ref is 2 when the plug
	 * is unused (one reference taken when the plug was allocated and
	 * another reference taken by the caller context). Reset
	 * zwplug->from_cpu if no more writes are pending.
	 */

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v21 00/12] Improve write performance for zoned UFS devices
  2025-07-18  7:39 ` Damien Le Moal
@ 2025-07-18 16:32   ` Bart Van Assche
  0 siblings, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2025-07-18 16:32 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe; +Cc: linux-block, linux-scsi, Christoph Hellwig

On 7/18/25 12:39 AM, Damien Le Moal wrote:
> On 7/18/25 05:57, Bart Van Assche wrote:
>> Changes compared to v20:
>>   - Converted a struct queue_limits member variable into a queue_limits feature
>>     flag.
>>   - Optimized performance of blk_mq_requeue_work().
>>   - Instead of splitting blk_zone_wplug_bio_work(), introduce a loop in that
>>     function.
>>   - Reworked patch "blk-zoned: Support pipelining of zoned writes".
>>   - Dropped the null_blk driver patch.
> 
> Why ? null_blk does not maintain submission order ?

It does, but modifying scsi_debug is sufficient to test this patch
series. Do you perhaps want me to restore the null_blk patch?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v21 00/12] Improve write performance for zoned UFS devices
  2025-07-18  7:08 ` [PATCH v21 00/12] Improve write performance for zoned UFS devices Damien Le Moal
@ 2025-07-18 18:30   ` Bart Van Assche
  2025-07-22  1:36     ` Damien Le Moal
  0 siblings, 1 reply; 25+ messages in thread
From: Bart Van Assche @ 2025-07-18 18:30 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe; +Cc: linux-block, linux-scsi, Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 3627 bytes --]

On 7/18/25 12:08 AM, Damien Le Moal wrote:
> How did you test this ?

Hi Damien,

This patch series has been tested as follows:
- In an x86-64 VM:
   - By running blktests.
   - By running the attached two scripts. test-pipelining-zoned-writes
     submits small writes sequentially and has been used to compare IOPS
     with and without write pipelining. test-pipelining-and-requeuing
     submits sequential or random writes. This script has
     been used to verify that the HOST BUSY and UNALIGNED WRITE
     conditions are handled correctly for both I/O patterns.
- On an ARM development board with a ZUFS device, by running a multitude
   of I/O patterns on top of F2FS and a ZUFS device with data
   verification enabled.

> I do not have a zoned UFS drive, so I used an NVMe ZNS drive, which should be
> fine since the commands in the submission queues of a PCI controller are always
> handled in order. So I added:
> 
> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
> index cce4c5b55aa9..36d16b8d3f37 100644
> --- a/drivers/nvme/host/zns.c
> +++ b/drivers/nvme/host/zns.c
> @@ -108,7 +108,7 @@ int nvme_query_zone_info(struct nvme_ns *ns, unsigned lbaf,
>   void nvme_update_zone_info(struct nvme_ns *ns, struct queue_limits *lim,
>                  struct nvme_zone_info *zi)
>   {
> -       lim->features |= BLK_FEAT_ZONED;
> +       lim->features |= BLK_FEAT_ZONED | BLK_FEAT_ORDERED_HWQ;
>          lim->max_open_zones = zi->max_open_zones;
>          lim->max_active_zones = zi->max_active_zones;
>          lim->max_hw_zone_append_sectors = ns->ctrl->max_zone_append;
> 
> And ran this:
> 
> fio --name=test --filename=/dev/nvme1n2 --ioengine=io_uring --iodepth=128 \
> 	--direct=1 --bs=4096 --zonemode=zbd --rw=randwrite \
> 	--numjobs=1
> 
> And I get unaligned write errors 100% of the time. Looking at your patches
> again, you are not handling REQ_NOWAIT case in blk_zone_wplug_handle_write(). If
> you get REQ_NOWAIT BIO, which io_uring will issue, the code goes directly to
> plugging the BIO, thus bypassing your from_cpu handling.

Didn't Jens recommend libaio instead of io_uring for zoned storage? See
also 
https://lore.kernel.org/linux-block/8c0f9d28-d68f-4800-b94f-1905079d4007@kernel.dk/T/#mb61b6d1294da76a9f1be38edf6dceaf703112335. 
I ran all my tests with
libaio instead of io_uring.

> But the same fio command with libaio (no REQ_NOWAIT in that case) also fails.

While this patch series addresses most potential causes of reordering by
the block layer, it does not address all possible causes of reordering.
An example of a potential cause of reordering that has not been
addressed by this patch series can be found in blk_mq_insert_requests().
That function either inserts requests in a software or a hardware queue.
Bypassing the software queue for some requests can cause reordering.
Another example can be found in blk_mq_dispatch_rq_list(). If the block
driver responds with BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE, the
requests that have not been accepted by the block driver are added to
the &hctx->dispatch list. If these requests came from a software queue,
adding these to hctx->dispatch_list instead of putting them back in
their original position in the software queue can cause reordering.

Patches 8 and 9 work around this by retrying writes in the unlikely case
that reordering happens. I think this is a more pragmatic solution than
making more changes in the block layer to make it fully preserve the
request order. In the traces that I gathered and that I inspected, I
did not see any UNALIGNED WRITE errors being reported by ZUFS devices.

Thanks,

Bart.

[-- Attachment #2: test-pipelining-and-requeuing --]
[-- Type: text/plain, Size: 2022 bytes --]

#!/bin/bash

set -eu

stop_tracing() {
    if lsof -t /sys/kernel/tracing/trace_pipe | xargs -r kill; then :; fi
    echo 0 >/sys/kernel/tracing/tracing_on
}

tracing_active() {
    [ "$(cat /sys/kernel/tracing/tracing_on)" = 1 ]
}

start_tracing() {
    rm -f /tmp/block-trace.txt
    stop_tracing
    (
	cd /sys/kernel/tracing
	echo nop > current_tracer
	echo > trace
	echo 0 > events/enable
	echo 1 > events/block/enable
	echo 0 > events/block/block_dirty_buffer/enable
	echo 0 > events/block/block_touch_buffer/enable
	echo 1 > tracing_on
	cat trace_pipe >/tmp/block-trace.txt
    ) &
    tracing_pid=$!
    while ! tracing_active; do
	sleep .1
    done
}

end_tracing() {
    if [ -n "$tracing_pid" ]; then kill "$tracing_pid"; fi
    stop_tracing
}

qd=${1:-64}
# Log error recovery actions
echo 63 > /sys/module/scsi_mod/parameters/scsi_logging_level
if modprobe -r scsi_debug; then :; fi
params=(
	delay=0
	dev_size_mb=256
	every_nth=$((2 * qd))
	max_queue="${qd}"
	ndelay=100000           # 100 us
	opts=0x28000            # SDEBUG_OPT_UNALIGNED_WRITE | SDEBUG_OPT_HOST_BUSY
	preserves_write_order=1
	sector_size=4096
	zbc=host-managed
	zone_nr_conv=0
	zone_size_mb=4
)
modprobe scsi_debug "${params[@]}"
while true; do
	bdev=$(cd /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block && echo *) 2>/dev/null
	if [ -e /dev/"${bdev}" ]; then break; fi
	sleep .1
done
dev=/dev/"${bdev}"
[ -b "${dev}" ]
for rw in write randwrite; do
    start_tracing
    params=(
	--direct=1
	--filename="${dev}"
	--iodepth="${qd}"
	--iodepth_batch=$(((qd + 3) / 4))
	--ioengine=libaio
	--ioscheduler=none
	--gtod_reduce=1
	--name="$(basename "${dev}")"
	--runtime=30
	--rw="$rw"
	--time_based=1
	--zonemode=zbd
    )
    fio "${params[@]}"
    rc=$?
    end_tracing
    if grep -avH " ref 1 " "/sys/kernel/debug/block/${bdev}/zone_wplugs"; then
	echo
	echo "Detected one or more reference count leaks!"
	break
    fi
    echo ''
    [ $rc = 0 ] || break
done
echo 0 > /sys/module/scsi_mod/parameters/scsi_logging_level

[-- Attachment #3: test-pipelining-zoned-writes --]
[-- Type: text/plain, Size: 5414 bytes --]

#!/bin/bash

set -e

run_cmd() {
    if [ -z "$android" ]; then
	eval "$1"
    else
	adb shell "$1"
    fi
}

tracing_active() {
    [ "$(run_cmd "cat /sys/kernel/tracing/tracing_on")" = 1 ]
}

start_tracing() {
    rm -f /tmp/block-trace.txt
    cmd="(if [ ! -e /sys/kernel/tracing/trace ]; then mount -t tracefs none /sys/kernel/tracing; fi &&
	cd /sys/kernel/tracing &&
	if lsof -t /sys/kernel/tracing/trace_pipe | xargs -r kill; then :; fi &&
	echo 0 > tracing_on &&
	echo nop > current_tracer &&
	echo > trace &&
	echo 0 > events/enable &&
	echo 1 > events/block/enable &&
	echo 0 > events/block/block_dirty_buffer/enable &&
	echo 0 > events/block/block_touch_buffer/enable &&
	if [ -e events/nullb ]; then echo 1 > events/nullb/enable; fi &&
	echo 1 > tracing_on &&
	cat trace_pipe)"
    run_cmd "$cmd" >"/tmp/block-trace-$1.txt" &
    tracing_pid=$!
    while ! tracing_active; do
	sleep .1
    done
}

end_tracing() {
    sleep 5
    if [ -n "$tracing_pid" ]; then kill "$tracing_pid"; fi
    run_cmd "cd /sys/kernel/tracing &&
	if lsof -t /sys/kernel/tracing/trace_pipe | xargs -r kill; then :; fi &&
	echo 0 >/sys/kernel/tracing/tracing_on"
}

android=
fastest_cpucore=
tracing=

while [ "${1#-}" != "$1" ]; do
    case "$1" in
	-a)
	    android=true; shift;;
	-t)
	    tracing=true; shift;;
	*)
	    usage;;
    esac
done

set -u

if [ -n "${android}" ]; then
    adb root 1>&2
    adb push ~/software/fio/fio /tmp >&/dev/null
    adb push ~/software/util-linux/blkzone /tmp >&/dev/null
    fastest_cpucore=$(adb shell 'grep -aH . /sys/devices/system/cpu/cpu[0-9]*/cpufreq/cpuinfo_max_freq 2>/dev/null' |
		      sed 's/:/ /' |
		      sort -rnk2 |
		      head -n1 |
		      sed -e 's|/sys/devices/system/cpu/cpu||;s|/cpufreq.*||')
    if [ -z "$fastest_cpucore" ]; then
	fastest_cpucore=$(($(adb shell nproc) - 1))
    fi
    [ -n "$fastest_cpucore" ]
fi

for mode in "none 0" "none 1" "mq-deadline 0" "mq-deadline 1"; do
    for d in /sys/kernel/config/nullb/*; do
	if [ -d "$d" ] && rmdir "$d"; then :; fi
    done
    read -r iosched preserves_write_order <<<"$mode"
    echo "==== iosched=$iosched preserves_write_order=$preserves_write_order"
    if [ -z "$android" ]; then
	if true; then
	    if modprobe -r scsi_debug; then :; fi
	    params=(
		ndelay=100000            # 100 us
		host_max_queue=64
		preserves_write_order="${preserves_write_order}"
		dev_size_mb=1024         # 1 GiB
		submit_queues="$(nproc)"
		zone_size_mb=1           # 1 MiB
		zone_nr_conv=0
		zbc=2
	    )
	    modprobe scsi_debug "${params[@]}"
	    udevadm settle
	    dev=/dev/$(cd /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block && echo *)
	    basename=$(basename "${dev}")
	else
	    if modprobe -r null_blk; then :; fi
	    modprobe null_blk nr_devices=0
	    (
		cd /sys/kernel/config/nullb
		mkdir nullb0
		cd nullb0
		params=(
		    completion_nsec=100000   # 100 us
		    hw_queue_depth=64
		    irqmode=2                # NULL_IRQ_TIMER
		    max_sectors=$((4096/512))
		    memory_backed=1
		    preserves_write_order="${preserves_write_order}"
		    size=1                   # 1 GiB
		    submit_queues="$(nproc)"
		    zone_size=1              # 1 MiB
		    zoned=1
		    power=1
		)
		for p in "${params[@]}"; do
		    if ! echo "${p//*=}" > "${p//=*}"; then
			echo "$p"
			exit 1
		    fi
		done
	    )
	    basename=nullb0
	    dev=/dev/${basename}
	    udevadm settle
	fi
	[ -b "${dev}" ]
    else
	# Retrieve the device name assigned to the zoned logical unit.
	basename=$(adb shell grep -lvw 0 /sys/class/block/sd*/queue/chunk_sectors 2>/dev/null |
			     sed 's|/sys/class/block/||g;s|/queue/chunk_sectors||g')
	# Disable block layer request merging.
	dev="/dev/block/${basename}"
    fi
    run_cmd "echo 4096 > /sys/class/block/${basename}/queue/max_sectors_kb"
    # 0: disable I/O statistics
    run_cmd "echo 0 > /sys/class/block/${basename}/queue/iostats"
    # 2: do not attempt any merges
    run_cmd "echo 2 > /sys/class/block/${basename}/queue/nomerges"
    # 2: complete on the requesting CPU
    run_cmd "echo 2 > /sys/class/block/${basename}/queue/rq_affinity"
    if [ -n "${tracing}" ]; then
	start_tracing "${iosched}-${preserves_write_order}"
    fi
    params1=(
	--name=trim
	--filename="${dev}"
	--direct=1
	--end_fsync=1
	--ioengine=pvsync
	--gtod_reduce=1
	--rw=trim
	--size=100%
	--thread=1
	--zonemode=zbd
    )
    params2=(
	--name=measure-iops
	--filename="${dev}"
	--direct=1
	--ioscheduler="${iosched}"
	--gtod_reduce=1
	--runtime=30
	--rw=write
	--thread=1
	--time_based=1
	--zonemode=zbd
    )
    if [ -n "$fastest_cpucore" ]; then
	fio_args+=(--cpus_allowed="${fastest_cpucore}")
    fi
    if [ "$preserves_write_order" = 1 ]; then
	params2+=(
	    --ioengine=libaio
	    --iodepth=64
	    --iodepth_batch=16
	)
    else
	params2+=(
	    --ioengine=pvsync2
	)
    fi
    set +e
    echo "fio ${params2[*]}"
    # Finish all open zones to prevent that the maximum number of open zones is
    # exceeded. Next, trim all zones and measure IOPS.
    if [ -z "$android" ]; then
	blkzone finish "${dev}"
	fio "${params1[@]}" >"/tmp/fio-trim-${iosched}-${preserves_write_order}.txt"
	fio "${params2[@]}"
    else
	adb shell /tmp/blkzone finish "${dev}"
	adb shell /tmp/fio "${params1[@]}" >/dev/null
	adb shell /tmp/fio "${params2[@]}"
    fi
    ret=$?
    set -e
    if [ -n "${tracing}" ]; then
	end_tracing
    fi
    [ "$ret" = 0 ] || break
done

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v21 00/12] Improve write performance for zoned UFS devices
  2025-07-18 18:30   ` Bart Van Assche
@ 2025-07-22  1:36     ` Damien Le Moal
  2025-07-22 18:24       ` Bart Van Assche
  0 siblings, 1 reply; 25+ messages in thread
From: Damien Le Moal @ 2025-07-22  1:36 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe; +Cc: linux-block, linux-scsi, Christoph Hellwig

On 7/19/25 3:30 AM, Bart Van Assche wrote:
> On 7/18/25 12:08 AM, Damien Le Moal wrote:
>> How did you test this ?
> 
> Hi Damien,
> 
> This patch series has been tested as follows:
> - In an x86-64 VM:
>   - By running blktests.
>   - By running the attached two scripts. test-pipelining-zoned-writes
>     submits small writes sequentially and has been used to compare IOPS
>     with and without write pipelining. test-pipelining-and-requeuing
>     submits sequential or random writes. This script has
>     been used to verify that the HOST BUSY and UNALIGNED WRITE
>     conditions are handled correctly for both I/O patterns.
> - On an ARM development board with a ZUFS device, by running a multitude
>   of I/O patterns on top of F2FS and a ZUFS device with data
>   verification enabled.
> 
>> I do not have a zoned UFS drive, so I used an NVMe ZNS drive, which should be
>> fine since the commands in the submission queues of a PCI controller are always
>> handled in order. So I added:
>>
>> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
>> index cce4c5b55aa9..36d16b8d3f37 100644
>> --- a/drivers/nvme/host/zns.c
>> +++ b/drivers/nvme/host/zns.c
>> @@ -108,7 +108,7 @@ int nvme_query_zone_info(struct nvme_ns *ns, unsigned lbaf,
>>   void nvme_update_zone_info(struct nvme_ns *ns, struct queue_limits *lim,
>>                  struct nvme_zone_info *zi)
>>   {
>> -       lim->features |= BLK_FEAT_ZONED;
>> +       lim->features |= BLK_FEAT_ZONED | BLK_FEAT_ORDERED_HWQ;
>>          lim->max_open_zones = zi->max_open_zones;
>>          lim->max_active_zones = zi->max_active_zones;
>>          lim->max_hw_zone_append_sectors = ns->ctrl->max_zone_append;
>>
>> And ran this:
>>
>> fio --name=test --filename=/dev/nvme1n2 --ioengine=io_uring --iodepth=128 \
>>     --direct=1 --bs=4096 --zonemode=zbd --rw=randwrite \
>>     --numjobs=1
>>
>> And I get unaligned write errors 100% of the time. Looking at your patches
>> again, you are not handling REQ_NOWAIT case in blk_zone_wplug_handle_write(). If
>> you get REQ_NOWAIT BIO, which io_uring will issue, the code goes directly to
>> plugging the BIO, thus bypassing your from_cpu handling.
> 
> Didn't Jens recommend libaio instead of io_uring for zoned storage? See
> also https://lore.kernel.org/linux-block/8c0f9d28-d68f-4800-
> b94f-1905079d4007@kernel.dk/T/#mb61b6d1294da76a9f1be38edf6dceaf703112335. I ran
> all my tests with
> libaio instead of io_uring.

My bad, yes, io_uring does not work reliably for zoned writes because of its
no-wait handling of BIOs and punting to a worker thread for blocking BIOs. But
as I said, tests with libaio did not go well either.

>> But the same fio command with libaio (no REQ_NOWAIT in that case) also fails.
> 
> While this patch series addresses most potential causes of reordering by
> the block layer, it does not address all possible causes of reordering.
> An example of a potential cause of reordering that has not been
> addressed by this patch series can be found in blk_mq_insert_requests().
> That function either inserts requests in a software or a hardware queue.
> Bypassing the software queue for some requests can cause reordering.
> Another example can be found in blk_mq_dispatch_rq_list(). If the block
> driver responds with BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE, the
> requests that have not been accepted by the block driver are added to
> the &hctx->dispatch list. If these requests came from a software queue,
> adding these to hctx->dispatch_list instead of putting them back in
> their original position in the software queue can cause reordering.
> 
> Patches 8 and 9 work around this by retrying writes in the unlikely case
> that reordering happens. I think this is a more pragmatic solution than
> making more changes in the block layer to make it fully preserve the
> request order. In the traces that I gathered and that I inspected, I
> did not see any UNALIGNED WRITE errors being reported by ZUFS devices.

So the end result of your patches is that the submission path can still
generates reordering and cause unaligned write errors. Not great to say the
least. I would really prefer something that does not cause such submission
errors to be sure that if we see an error, it is due to the a user bug (user
sending unaligned writes).

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v21 00/12] Improve write performance for zoned UFS devices
  2025-07-22  1:36     ` Damien Le Moal
@ 2025-07-22 18:24       ` Bart Van Assche
  0 siblings, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2025-07-22 18:24 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe; +Cc: linux-block, linux-scsi, Christoph Hellwig

On 7/21/25 6:36 PM, Damien Le Moal wrote:
> I would really prefer something that does not cause such submission 
> errors to be sure that if we see an error, it is due to the a user
> bug (user sending unaligned writes).

Hi Damien,

I will look into this.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2025-07-22 18:24 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-17 20:57 [PATCH v21 00/12] Improve write performance for zoned UFS devices Bart Van Assche
2025-07-17 20:57 ` [PATCH v21 01/12] block: Support block devices that preserve the order of write requests Bart Van Assche
2025-07-17 20:57 ` [PATCH v21 02/12] blk-mq: Restore the zone write order when requeuing Bart Van Assche
2025-07-17 20:57 ` [PATCH v21 03/12] blk-zoned: Add an argument to blk_zone_plug_bio() Bart Van Assche
2025-07-18  7:13   ` Damien Le Moal
2025-07-18 15:54     ` Bart Van Assche
2025-07-17 20:58 ` [PATCH v21 04/12] blk-zoned: Split an if-statement Bart Van Assche
2025-07-17 20:58 ` [PATCH v21 05/12] blk-zoned: Move code from disk_zone_wplug_add_bio() into its caller Bart Van Assche
2025-07-18  7:15   ` Damien Le Moal
2025-07-17 20:58 ` [PATCH v21 06/12] blk-zoned: Introduce a loop in blk_zone_wplug_bio_work() Bart Van Assche
2025-07-18  7:17   ` Damien Le Moal
2025-07-17 20:58 ` [PATCH v21 07/12] blk-zoned: Support pipelining of zoned writes Bart Van Assche
2025-07-18  7:38   ` Damien Le Moal
2025-07-18 16:29     ` Bart Van Assche
2025-07-17 20:58 ` [PATCH v21 08/12] scsi: core: Retry unaligned " Bart Van Assche
2025-07-17 20:58 ` [PATCH v21 09/12] scsi: sd: Increase retry count for " Bart Van Assche
2025-07-17 20:58 ` [PATCH v21 10/12] scsi: scsi_debug: Add the preserves_write_order module parameter Bart Van Assche
2025-07-17 20:58 ` [PATCH v21 11/12] scsi: scsi_debug: Support injecting unaligned write errors Bart Van Assche
2025-07-17 20:58 ` [PATCH v21 12/12] ufs: core: Inform the block layer about write ordering Bart Van Assche
2025-07-18  7:08 ` [PATCH v21 00/12] Improve write performance for zoned UFS devices Damien Le Moal
2025-07-18 18:30   ` Bart Van Assche
2025-07-22  1:36     ` Damien Le Moal
2025-07-22 18:24       ` Bart Van Assche
2025-07-18  7:39 ` Damien Le Moal
2025-07-18 16:32   ` Bart Van Assche

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).