[PATCH v25 00/20] Improve write performance for zoned UFS devices

linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v25 00/20] Improve write performance for zoned UFS devices
@ 2025-10-14 21:54 Bart Van Assche
  2025-10-14 21:54 ` [PATCH v25 01/20] block: Support block devices that preserve the order of write requests Bart Van Assche
                   ` (19 more replies)
  0 siblings, 20 replies; 33+ messages in thread
From: Bart Van Assche @ 2025-10-14 21:54 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Bart Van Assche

Hi Jens,

This patch series improves small write IOPS by a factor of two for zoned UFS
devices on my test setup. The changes included in this patch series are as
follows:
 - A new request queue limits flag is introduced that allows block drivers to
   declare whether or not the request order is preserved per hardware queue.
 - The order of zoned writes is preserved in the block layer by submitting all
   zoned writes from the same CPU core as long as any zoned writes are pending.
 - A new member 'from_cpu' is introduced in the per-zone data structure
   'blk_zone_wplug' to track from which CPU to submit zoned writes. This data
   member is reset to -1 after all pending zoned writes for a zone have
   completed.
 - The retry count for zoned writes is increased in the SCSI core to deal with
   reordering caused by unit attention conditions or the SCSI error handler.
 - New functionality is added in the null_blk and scsi_debug drivers to make it
   easier to test the changes introduced by this patch series.

Please consider this patch series for the next merge window.

Thanks,

Bart.

Changes compared to v24:
 - Split the mq-deadline patch into two patches to make reviewing easier.
 - Further improved patch descriptions and source code comments.
 - Included a patch with a source code comment fix.

Changes compared to v23:
 - Removed the sysfs attribute for configuring write pipelining.
 - Split patch "Run all hwqs for sq scheds if write pipelining is enabled" into
   two patches to make it easier to review.
 - Added patch "blk-zoned: Document disk_zone_wplug_schedule_bio_work() locking".
 - Rebased on top of Jens' for-next branch.

Changes compared to v22:
 - Made write pipelining configurable via sysfs.
 - Fixed sporadic write errors observed with the mq-deadline I/O scheduler.

Changes compared to v21:
 - Added a patch that makes the block layer preserve the request order when
   inserting a request.
 - Restored a warning statement in block/blk-zoned.c.
 - Reworked the code that selects a CPU to queue zoned writes from such that no
   changes have to be undone if blk_zone_wplug_prepare_bio() fails.
 - Removed the "plug" label in block/blk-zoned.c and retained the
   "add_to_bio_list" label.
 - Changed scoped_guard() back into spin_lock_*() calls.
 - Fixed a recently introduced reference count leak in
   disk_zone_wplug_schedule_bio_work().
 - Restored the patch for the null_blk driver.

Changes compared to v20:
 - Converted a struct queue_limits member variable into a queue_limits feature
   flag.
 - Optimized performance of blk_mq_requeue_work().
 - Instead of splitting blk_zone_wplug_bio_work(), introduce a loop in that
   function.
 - Reworked patch "blk-zoned: Support pipelining of zoned writes".
 - Dropped the null_blk driver patch.
 - Improved several patch descriptions.

Changes compared to v19:
 - Dropped patch 2/11 "block: Support allocating from a specific software queue"
 - Implemented Damien's proposal to always add pipelined bios to the plug list
   and to submit all pipelined bios from the bio work for a zone.
 - Added three refactoring patches to make this patch series easier to review.

Changes compared to v18:
 - Dropped patch 2/12 "block: Rework request allocation in blk_mq_submit_bio()".
 - Improved patch descriptions.

Changes compared to v17:
 - Rebased the patch series on top of kernel v6.16-rc1.
 - Dropped support for UFSHCI 3.0 controllers because the UFSHCI 3.0 auto-
   hibernation mechanism causes request reordering. UFSHCI 4.0 controllers
   remain supported.
 - Removed the error handling and write pointer tracking mechanisms again
   from block/blk-zoned.c.
 - Dropped the dm-linear patch from this patch series since I'm not aware of
   any use cases for write pipelining and dm-linear.

Changes compared to v16:
 - Rebased the entire patch series on top of Jens' for-next branch. Compared
   to when v16 of this series was posted, the BLK_ZONE_WPLUG_NEED_WP_UPDATE
   flag has been introduced and support for REQ_NOWAIT has been fixed.
 - The behavior for SMR disks is preserved: if .driver_preserves_write_order
   has not been set, BLK_ZONE_WPLUG_NEED_WP_UPDATE is still set if a write
   error has been encountered. If .driver_preserves_write_order has not been
   set, the write pointer is restored and the failed zoned writes are retried.
 - The superfluous "disk->zone_wplugs_hash_bits != 0" tests have been removed.

Changes compared to v15:
 - Reworked this patch series on top of the zone write plugging approach.
 - Moved support for requeuing requests from the SCSI core into the block
   layer core.
 - In the UFS driver, instead of disabling write pipelining if
   auto-hibernation is enabled, rely on the requeuing mechanism to handle
   reordering caused by resuming from auto-hibernation.

Changes compared to v14:
 - Removed the drivers/scsi/Kconfig.kunit and drivers/scsi/Makefile.kunit
   files. Instead, modified drivers/scsi/Kconfig and added #include "*_test.c"
   directives in the appropriate .c files. Removed the EXPORT_SYMBOL()
   directives that were added to make the unit tests link.
 - Fixed a double free in a unit test.

Changes compared to v13:
 - Reworked patch "block: Preserve the order of requeued zoned writes".
 - Addressed a performance concern by removing the eh_needs_prepare_resubmit
   SCSI driver callback and by introducing the SCSI host template flag
   .needs_prepare_resubmit instead.
 - Added a patch that adds a 'host' argument to scsi_eh_flush_done_q().
 - Made the code in unit tests less repetitive.

Changes compared to v12:
 - Added two new patches: "block: Preserve the order of requeued zoned writes"
   and "scsi: sd: Add a unit test for sd_cmp_sector()"
 - Restricted the number of zoned write retries. To my surprise I had to add
   "&& scmd->retries <= scmd->allowed" in the SCSI error handler to limit the
   number of retries.
 - In patch "scsi: ufs: Inform the block layer about write ordering", only set
   ELEVATOR_F_ZBD_SEQ_WRITE for zoned block devices.

Changes compared to v11:
 - Fixed a NULL pointer dereference that happened when booting from an ATA
   device by adding an scmd->device != NULL check in scsi_needs_preparation().
 - Updated Reviewed-by tags.

Changes compared to v10:
 - Dropped the UFS MediaTek and HiSilicon patches because these are not correct
   and because it is safe to drop these patches.
 - Updated Acked-by / Reviewed-by tags.

Changes compared to v9:
 - Introduced an additional scsi_driver callback: .eh_needs_prepare_resubmit().
 - Renamed the scsi_debug kernel module parameter 'no_zone_write_lock' into
   'preserves_write_order'.
 - Fixed an out-of-bounds access in the unit scsi_call_prepare_resubmit() unit
   test.
 - Wrapped ufshcd_auto_hibern8_update() calls in UFS host drivers with
   WARN_ON_ONCE() such that a kernel stack appears in case an error code is
   returned.
 - Elaborated a comment in the UFSHCI driver.

Changes compared to v8:
 - Fixed handling of 'driver_preserves_write_order' and 'use_zone_write_lock'
   in blk_stack_limits().
 - Added a comment in disk_set_zoned().
 - Modified blk_req_needs_zone_write_lock() such that it returns false if
   q->limits.use_zone_write_lock is false.
 - Modified disk_clear_zone_settings() such that it clears
   q->limits.use_zone_write_lock.
 - Left out one change from the mq-deadline patch that became superfluous due to
   the blk_req_needs_zone_write_lock() change.
 - Modified scsi_call_prepare_resubmit() such that it only calls list_sort() if
   zoned writes have to be resubmitted for which zone write locking is disabled.
 - Added an additional unit test for scsi_call_prepare_resubmit().
 - Modified the sorting code in the sd driver such that only those SCSI commands
   are sorted for which write locking is disabled.
 - Modified sd_zbc.c such that ELEVATOR_F_ZBD_SEQ_WRITE is only set if the
   write order is not preserved.
 - Included three patches for UFS host drivers that rework code that wrote
   directly to the auto-hibernation controller register.
 - Modified the UFS driver such that enabling auto-hibernation is not allowed
   if a zoned logical unit is present and if the controller operates in legacy
   mode.
 - Also in the UFS driver, simplified ufshcd_auto_hibern8_update().

Changes compared to v7:
 - Split the queue_limits member variable `use_zone_write_lock' into two member
   variables: `use_zone_write_lock' (set by disk_set_zoned()) and
   `driver_preserves_write_order' (set by the block driver or SCSI LLD). This
   should clear up the confusion about the purpose of this variable.
 - Moved the code for sorting SCSI commands by LBA from the SCSI error handler
   into the SCSI disk (sd) driver as requested by Christoph.
   
Changes compared to v6:
 - Removed QUEUE_FLAG_NO_ZONE_WRITE_LOCK and instead introduced a flag in
   the request queue limits data structure.

Changes compared to v5:
 - Renamed scsi_cmp_lba() into scsi_cmp_sector().
 - Improved several source code comments.

Changes compared to v4:
 - Dropped the patch that introduces the REQ_NO_ZONE_WRITE_LOCK flag.
 - Dropped the null_blk patch and added two scsi_debug patches instead.
 - Dropped the f2fs patch.
 - Split the patch for the UFS driver into two patches.
 - Modified several patch descriptions and source code comments.
 - Renamed dd_use_write_locking() into dd_use_zone_write_locking().
 - Moved the list_sort() call from scsi_unjam_host() into scsi_eh_flush_done_q()
   such that sorting happens just before reinserting.
 - Removed the scsi_cmd_retry_allowed() call from scsi_check_sense() to make
   sure that the retry counter is adjusted once per retry instead of twice.

Changes compared to v3:
 - Restored the patch that introduces QUEUE_FLAG_NO_ZONE_WRITE_LOCK. That patch
   had accidentally been left out from v2.
 - In patch "block: Introduce the flag REQ_NO_ZONE_WRITE_LOCK", improved the
   patch description and added the function blk_no_zone_write_lock().
 - In patch "block/mq-deadline: Only use zone locking if necessary", moved the
   blk_queue_is_zoned() call into dd_use_write_locking().
 - In patch "fs/f2fs: Disable zone write locking", set REQ_NO_ZONE_WRITE_LOCK
   from inside __bio_alloc() instead of in f2fs_submit_write_bio().

Changes compared to v2:
 - Renamed the request queue flag for disabling zone write locking.
 - Introduced a new request flag for disabling zone write locking.
 - Modified the mq-deadline scheduler such that zone write locking is only
   disabled if both flags are set.
 - Added an F2FS patch that sets the request flag for disabling zone write
   locking.
 - Only disable zone write locking in the UFS driver if auto-hibernation is
   disabled.

Changes compared to v1:
 - Left out the patches that are already upstream.
 - Switched the approach in patch "scsi: Retry unaligned zoned writes" from
   retrying immediately to sending unaligne

Bart Van Assche (20):
  block: Support block devices that preserve the order of write requests
  blk-mq: Always insert sequential zoned writes into a software queue
  blk-mq: Restore the zone write order when requeuing
  blk-mq: Move the blk_queue_sq_sched() calls
  blk-mq: Run all hwqs for sq scheds if write pipelining is enabled
  block/mq-deadline: Make locking IRQ-safe
  block/mq-deadline: Enable zoned write pipelining
  blk-zoned: Fix a typo in a source code comment
  blk-zoned: Add an argument to blk_zone_plug_bio()
  blk-zoned: Split an if-statement
  blk-zoned: Move code from disk_zone_wplug_add_bio() into its caller
  blk-zoned: Introduce a loop in blk_zone_wplug_bio_work()
  blk-zoned: Document disk_zone_wplug_schedule_bio_work() locking
  blk-zoned: Support pipelining of zoned writes
  null_blk: Add the preserves_write_order attribute
  scsi: core: Retry unaligned zoned writes
  scsi: sd: Increase retry count for zoned writes
  scsi: scsi_debug: Add the preserves_write_order module parameter
  scsi: scsi_debug: Support injecting unaligned write errors
  ufs: core: Inform the block layer about write ordering

 block/bfq-iosched.c               |   2 +
 block/blk-mq.c                    |  93 +++++++++---
 block/blk-mq.h                    |   2 +
 block/blk-settings.c              |   2 +
 block/blk-zoned.c                 | 226 ++++++++++++++++++++----------
 block/elevator.h                  |   1 +
 block/kyber-iosched.c             |   2 +
 block/mq-deadline.c               | 107 ++++++++++----
 drivers/block/null_blk/main.c     |   4 +
 drivers/block/null_blk/null_blk.h |   1 +
 drivers/md/dm.c                   |   5 +-
 drivers/scsi/scsi_debug.c         |  22 ++-
 drivers/scsi/scsi_error.c         |  16 +++
 drivers/scsi/sd.c                 |   6 +
 drivers/ufs/core/ufshcd.c         |   7 +
 include/linux/blk-mq.h            |  13 +-
 include/linux/blkdev.h            |  18 ++-
 17 files changed, 402 insertions(+), 125 deletions(-)


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v25 01/20] block: Support block devices that preserve the order of write requests
  2025-10-14 21:54 [PATCH v25 00/20] Improve write performance for zoned UFS devices Bart Van Assche
@ 2025-10-14 21:54 ` Bart Van Assche
  2025-10-14 21:54 ` [PATCH v25 02/20] blk-mq: Always insert sequential zoned writes into a software queue Bart Van Assche
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Bart Van Assche @ 2025-10-14 21:54 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Bart Van Assche

Some storage controllers preserve the request order per hardware queue.
Some but not all device mapper drivers preserve the bio order. Introduce
the feature flag BLK_FEAT_ORDERED_HWQ to allow block drivers and stacked
drivers to indicate that the order of write commands is preserved per
hardware queue and hence that serialization of writes per zone is not
required if all pending writes are submitted to the same hardware queue.

Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-settings.c   | 2 ++
 include/linux/blkdev.h | 6 ++++++
 2 files changed, 8 insertions(+)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index 54cffaae4df4..553ec729a5b1 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -771,6 +771,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 		t->features &= ~BLK_FEAT_NOWAIT;
 	if (!(b->features & BLK_FEAT_POLL))
 		t->features &= ~BLK_FEAT_POLL;
+	if (!(b->features & BLK_FEAT_ORDERED_HWQ))
+		t->features &= ~BLK_FEAT_ORDERED_HWQ;
 
 	t->flags |= (b->flags & BLK_FLAG_MISALIGNED);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 70b671a9a7f7..9af9d97e31af 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -346,6 +346,12 @@ typedef unsigned int __bitwise blk_features_t;
 #define BLK_FEAT_ATOMIC_WRITES \
 	((__force blk_features_t)(1u << 16))
 
+/*
+ * The request order is preserved per hardware queue by the block driver and by
+ * the block device. Set by the block driver.
+ */
+#define BLK_FEAT_ORDERED_HWQ		((__force blk_features_t)(1u << 17))
+
 /*
  * Flags automatically inherited when stacking limits.
  */

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v25 02/20] blk-mq: Always insert sequential zoned writes into a software queue
  2025-10-14 21:54 [PATCH v25 00/20] Improve write performance for zoned UFS devices Bart Van Assche
  2025-10-14 21:54 ` [PATCH v25 01/20] block: Support block devices that preserve the order of write requests Bart Van Assche
@ 2025-10-14 21:54 ` Bart Van Assche
  2025-10-14 21:54 ` [PATCH v25 03/20] blk-mq: Restore the zone write order when requeuing Bart Van Assche
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Bart Van Assche @ 2025-10-14 21:54 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Bart Van Assche

One of the optimizations in the block layer is that the software queues
are bypassed if it is expected that the block driver will accept a
request. This can cause request reordering even for requests submitted
from the same CPU core.  This patch preserves the order for sequential
zoned writes submitted from a given CPU core by always inserting these
requests into the appropriate software queue.

Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-mq.c         | 35 +++++++++++++++++++++++++++++++++--
 block/blk-zoned.c      | 21 +++++++++++++++++++++
 block/elevator.h       |  1 +
 include/linux/blk-mq.h | 11 +++++++++++
 include/linux/blkdev.h |  7 +++++++
 5 files changed, 73 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 09f579414161..0457aa6eef47 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1546,6 +1546,35 @@ void blk_mq_requeue_request(struct request *rq, bool kick_requeue_list)
 }
 EXPORT_SYMBOL(blk_mq_requeue_request);
 
+/*
+ * Whether the block layer should preserve the order of @rq relative to other
+ * requests submitted to the same software queue.
+ */
+static bool blk_mq_preserve_order(struct request *rq)
+{
+	return blk_pipeline_zwr(rq->q) && blk_rq_is_seq_zoned_write(rq);
+}
+
+/*
+ * Whether the order should be preserved for any request in @list. Returns %true
+ * if and only if zoned write pipelining is enabled and if there are any
+ * sequential zoned writes in @list.
+ */
+static bool blk_mq_preserve_order_for_list(struct request_queue *q,
+					   struct list_head *list)
+{
+	struct request *rq;
+
+	if (!blk_pipeline_zwr(q))
+		return false;
+
+	list_for_each_entry(rq, list, queuelist)
+		if (blk_rq_is_seq_zoned_write(rq))
+			return true;
+
+	return false;
+}
+
 static void blk_mq_requeue_work(struct work_struct *work)
 {
 	struct request_queue *q =
@@ -2575,7 +2604,8 @@ static void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx,
 	 * Try to issue requests directly if the hw queue isn't busy to save an
 	 * extra enqueue & dequeue to the sw queue.
 	 */
-	if (!hctx->dispatch_busy && !run_queue_async) {
+	if (!hctx->dispatch_busy && !run_queue_async &&
+	    !blk_mq_preserve_order_for_list(hctx->queue, list)) {
 		blk_mq_run_dispatch_ops(hctx->queue,
 			blk_mq_try_issue_list_directly(hctx, list));
 		if (list_empty(list))
@@ -3225,7 +3255,8 @@ void blk_mq_submit_bio(struct bio *bio)
 
 	hctx = rq->mq_hctx;
 	if ((rq->rq_flags & RQF_USE_SCHED) ||
-	    (hctx->dispatch_busy && (q->nr_hw_queues == 1 || !is_sync))) {
+	    (hctx->dispatch_busy && (q->nr_hw_queues == 1 || !is_sync)) ||
+	    blk_mq_preserve_order(rq)) {
 		blk_mq_insert_request(rq, 0);
 		blk_mq_run_hw_queue(hctx, true);
 	} else {
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 5e2a5788dc3b..f6bb4331eea6 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -22,6 +22,7 @@
 #include "blk.h"
 #include "blk-mq-sched.h"
 #include "blk-mq-debugfs.h"
+#include "elevator.h"
 
 #define ZONE_COND_NAME(name) [BLK_ZONE_COND_##name] = #name
 static const char *const zone_cond_name[] = {
@@ -377,6 +378,26 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, blk_mode_t mode,
 	return ret;
 }
 
+/*
+ * blk_pipeline_zwr() - Whether or not sequential zoned writes will be
+ *	pipelined per zone.
+ * @q: request queue pointer.
+ *
+ * Return: %true if and only if zoned writes will be pipelined per zone. Since
+ * running different hardware queues simultaneously on different CPU cores may
+ * lead to I/O reordering if an I/O scheduler maintains a single dispatch queue,
+ * only enable write pipelining if an I/O scheduler is active if the
+ * ELEVATOR_FLAG_SUPPORTS_ZONED_WRITE_PIPELINING flag has been set.
+ */
+bool blk_pipeline_zwr(struct request_queue *q)
+{
+	return q->limits.features & BLK_FEAT_ORDERED_HWQ &&
+	       (!q->elevator ||
+		test_bit(ELEVATOR_FLAG_SUPPORTS_ZONED_WRITE_PIPELINING,
+			 &q->elevator->flags));
+}
+EXPORT_SYMBOL(blk_pipeline_zwr);
+
 static bool disk_zone_is_last(struct gendisk *disk, struct blk_zone *zone)
 {
 	return zone->start + zone->len >= get_capacity(disk);
diff --git a/block/elevator.h b/block/elevator.h
index c4d20155065e..41f28909a31c 100644
--- a/block/elevator.h
+++ b/block/elevator.h
@@ -133,6 +133,7 @@ struct elevator_queue
 #define ELEVATOR_FLAG_REGISTERED	0
 #define ELEVATOR_FLAG_DYING		1
 #define ELEVATOR_FLAG_ENABLE_WBT_ON_EXIT	2
+#define ELEVATOR_FLAG_SUPPORTS_ZONED_WRITE_PIPELINING 3
 
 /*
  * block elevator interface
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index b25d12545f46..2c08a86b4ac3 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -1195,4 +1195,15 @@ static inline int blk_rq_map_sg(struct request *rq, struct scatterlist *sglist)
 }
 void blk_dump_rq_flags(struct request *, char *);
 
+static inline bool blk_rq_is_seq_zoned_write(struct request *rq)
+{
+	switch (req_op(rq)) {
+	case REQ_OP_WRITE:
+	case REQ_OP_WRITE_ZEROES:
+		return bdev_zone_is_seq(rq->q->disk->part0, blk_rq_pos(rq));
+	default:
+		return false;
+	}
+}
+
 #endif /* BLK_MQ_H */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 9af9d97e31af..85fca05bd5eb 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -855,6 +855,8 @@ static inline unsigned int disk_nr_zones(struct gendisk *disk)
 	return disk->nr_zones;
 }
 
+bool blk_pipeline_zwr(struct request_queue *q);
+
 /**
  * bio_needs_zone_write_plugging - Check if a BIO needs to be handled with zone
  *				   write plugging
@@ -933,6 +935,11 @@ static inline unsigned int disk_nr_zones(struct gendisk *disk)
 	return 0;
 }
 
+static inline bool blk_pipeline_zwr(struct request_queue *q)
+{
+	return false;
+}
+
 static inline bool bio_needs_zone_write_plugging(struct bio *bio)
 {
 	return false;

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v25 03/20] blk-mq: Restore the zone write order when requeuing
  2025-10-14 21:54 [PATCH v25 00/20] Improve write performance for zoned UFS devices Bart Van Assche
  2025-10-14 21:54 ` [PATCH v25 01/20] block: Support block devices that preserve the order of write requests Bart Van Assche
  2025-10-14 21:54 ` [PATCH v25 02/20] blk-mq: Always insert sequential zoned writes into a software queue Bart Van Assche
@ 2025-10-14 21:54 ` Bart Van Assche
  2025-10-14 21:54 ` [PATCH v25 04/20] blk-mq: Move the blk_queue_sq_sched() calls Bart Van Assche
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Bart Van Assche @ 2025-10-14 21:54 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Bart Van Assche

Zoned writes may be requeued. This happens if a block driver returns
BLK_STS_RESOURCE, to handle SCSI unit attentions or by the SCSI error
handler after error handling has finished. A later patch enables write
pipelining and increases the number of pending writes per zone. If
multiple writes are pending per zone, write requests may be requeued in
another order than submitted. Restore the request order if requests are
requeued. Add RQF_DONTPREP to RQF_NOMERGE_FLAGS because this patch may
cause RQF_DONTPREP requests to be sent to the code that checks whether
a request can be merged. RQF_DONTPREP requests must not be merged.

Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/bfq-iosched.c    |  2 ++
 block/blk-mq.c         | 20 +++++++++++++++++++-
 block/blk-mq.h         |  2 ++
 block/kyber-iosched.c  |  2 ++
 block/mq-deadline.c    |  7 ++++++-
 include/linux/blk-mq.h |  2 +-
 6 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 4a8d3d96bfe4..4766c89491d1 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -6259,6 +6259,8 @@ static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 
 	if (flags & BLK_MQ_INSERT_AT_HEAD) {
 		list_add(&rq->queuelist, &bfqd->dispatch);
+	} else if (flags & BLK_MQ_INSERT_ORDERED) {
+		blk_mq_insert_ordered(rq, &bfqd->dispatch);
 	} else if (!bfqq) {
 		list_add_tail(&rq->queuelist, &bfqd->dispatch);
 	} else {
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 0457aa6eef47..28d05b8846d7 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1597,7 +1597,9 @@ static void blk_mq_requeue_work(struct work_struct *work)
 		 * already.  Insert it into the hctx dispatch list to avoid
 		 * block layer merges for the request.
 		 */
-		if (rq->rq_flags & RQF_DONTPREP)
+		if (blk_mq_preserve_order(rq))
+			blk_mq_insert_request(rq, BLK_MQ_INSERT_ORDERED);
+		else if (rq->rq_flags & RQF_DONTPREP)
 			blk_mq_request_bypass_insert(rq, 0);
 		else
 			blk_mq_insert_request(rq, BLK_MQ_INSERT_AT_HEAD);
@@ -2631,6 +2633,20 @@ static void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx,
 	blk_mq_run_hw_queue(hctx, run_queue_async);
 }
 
+void blk_mq_insert_ordered(struct request *rq, struct list_head *list)
+{
+	struct request_queue *q = rq->q;
+	struct request *rq2;
+
+	list_for_each_entry(rq2, list, queuelist)
+		if (rq2->q == q && blk_rq_pos(rq2) > blk_rq_pos(rq))
+			break;
+
+	/* Insert rq before rq2. If rq2 is the list head, append at the end. */
+	list_add_tail(&rq->queuelist, &rq2->queuelist);
+}
+EXPORT_SYMBOL_GPL(blk_mq_insert_ordered);
+
 static void blk_mq_insert_request(struct request *rq, blk_insert_t flags)
 {
 	struct request_queue *q = rq->q;
@@ -2685,6 +2701,8 @@ static void blk_mq_insert_request(struct request *rq, blk_insert_t flags)
 		spin_lock(&ctx->lock);
 		if (flags & BLK_MQ_INSERT_AT_HEAD)
 			list_add(&rq->queuelist, &ctx->rq_lists[hctx->type]);
+		else if (flags & BLK_MQ_INSERT_ORDERED)
+			blk_mq_insert_ordered(rq, &ctx->rq_lists[hctx->type]);
 		else
 			list_add_tail(&rq->queuelist,
 				      &ctx->rq_lists[hctx->type]);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index af42dc018808..7a22d870b7b7 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -41,8 +41,10 @@ enum {
 
 typedef unsigned int __bitwise blk_insert_t;
 #define BLK_MQ_INSERT_AT_HEAD		((__force blk_insert_t)0x01)
+#define BLK_MQ_INSERT_ORDERED		((__force blk_insert_t)0x02)
 
 void blk_mq_submit_bio(struct bio *bio);
+void blk_mq_insert_ordered(struct request *rq, struct list_head *list);
 int blk_mq_poll(struct request_queue *q, blk_qc_t cookie, struct io_comp_batch *iob,
 		unsigned int flags);
 void blk_mq_exit_queue(struct request_queue *q);
diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index 18efd6ef2a2b..c510db7f748d 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -590,6 +590,8 @@ static void kyber_insert_requests(struct blk_mq_hw_ctx *hctx,
 		trace_block_rq_insert(rq);
 		if (flags & BLK_MQ_INSERT_AT_HEAD)
 			list_move(&rq->queuelist, head);
+		else if (flags & BLK_MQ_INSERT_ORDERED)
+			blk_mq_insert_ordered(rq, head);
 		else
 			list_move_tail(&rq->queuelist, head);
 		sbitmap_set_bit(&khd->kcq_map[sched_domain],
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 3e3719093aec..85806c73562a 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -676,7 +676,12 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 		 * set expire time and add to fifo list
 		 */
 		rq->fifo_time = jiffies + dd->fifo_expire[data_dir];
-		list_add_tail(&rq->queuelist, &per_prio->fifo_list[data_dir]);
+		if (flags & BLK_MQ_INSERT_ORDERED)
+			blk_mq_insert_ordered(rq,
+					      &per_prio->fifo_list[data_dir]);
+		else
+			list_add_tail(&rq->queuelist,
+				      &per_prio->fifo_list[data_dir]);
 	}
 }
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 2c08a86b4ac3..a5dbab229d80 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -86,7 +86,7 @@ enum rqf_flags {
 
 /* flags that prevent us from merging requests: */
 #define RQF_NOMERGE_FLAGS \
-	(RQF_STARTED | RQF_FLUSH_SEQ | RQF_SPECIAL_PAYLOAD)
+	(RQF_STARTED | RQF_FLUSH_SEQ | RQF_DONTPREP | RQF_SPECIAL_PAYLOAD)
 
 enum mq_rq_state {
 	MQ_RQ_IDLE		= 0,

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v25 04/20] blk-mq: Move the blk_queue_sq_sched() calls
  2025-10-14 21:54 [PATCH v25 00/20] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (2 preceding siblings ...)
  2025-10-14 21:54 ` [PATCH v25 03/20] blk-mq: Restore the zone write order when requeuing Bart Van Assche
@ 2025-10-14 21:54 ` Bart Van Assche
  2025-10-14 21:54 ` [PATCH v25 05/20] blk-mq: Run all hwqs for sq scheds if write pipelining is enabled Bart Van Assche
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Bart Van Assche @ 2025-10-14 21:54 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Bart Van Assche

Move the blk_queue_sq_sched() calls from blk_mq_run_hw_queues() and
blk_mq_delay_run_hw_queues() into blk_mq_get_sq_hctx(). Prepare for running
all hardware queues for single-queue I/O schedulers if write pipelining is
enabled. No functionality has been changed.

Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-mq.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 28d05b8846d7..81952d0ae544 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2406,7 +2406,13 @@ EXPORT_SYMBOL(blk_mq_run_hw_queue);
  */
 static struct blk_mq_hw_ctx *blk_mq_get_sq_hctx(struct request_queue *q)
 {
-	struct blk_mq_ctx *ctx = blk_mq_get_ctx(q);
+	struct blk_mq_hw_ctx *hctx;
+	struct blk_mq_ctx *ctx;
+
+	if (!blk_queue_sq_sched(q))
+		return NULL;
+
+	ctx = blk_mq_get_ctx(q);
 	/*
 	 * If the IO scheduler does not respect hardware queues when
 	 * dispatching, we just don't bother with multiple HW queues and
@@ -2414,8 +2420,7 @@ static struct blk_mq_hw_ctx *blk_mq_get_sq_hctx(struct request_queue *q)
 	 * just causes lock contention inside the scheduler and pointless cache
 	 * bouncing.
 	 */
-	struct blk_mq_hw_ctx *hctx = ctx->hctxs[HCTX_TYPE_DEFAULT];
-
+	hctx = ctx->hctxs[HCTX_TYPE_DEFAULT];
 	if (!blk_mq_hctx_stopped(hctx))
 		return hctx;
 	return NULL;
@@ -2431,9 +2436,7 @@ void blk_mq_run_hw_queues(struct request_queue *q, bool async)
 	struct blk_mq_hw_ctx *hctx, *sq_hctx;
 	unsigned long i;
 
-	sq_hctx = NULL;
-	if (blk_queue_sq_sched(q))
-		sq_hctx = blk_mq_get_sq_hctx(q);
+	sq_hctx = blk_mq_get_sq_hctx(q);
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (blk_mq_hctx_stopped(hctx))
 			continue;
@@ -2459,9 +2462,7 @@ void blk_mq_delay_run_hw_queues(struct request_queue *q, unsigned long msecs)
 	struct blk_mq_hw_ctx *hctx, *sq_hctx;
 	unsigned long i;
 
-	sq_hctx = NULL;
-	if (blk_queue_sq_sched(q))
-		sq_hctx = blk_mq_get_sq_hctx(q);
+	sq_hctx = blk_mq_get_sq_hctx(q);
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (blk_mq_hctx_stopped(hctx))
 			continue;

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v25 05/20] blk-mq: Run all hwqs for sq scheds if write pipelining is enabled
  2025-10-14 21:54 [PATCH v25 00/20] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (3 preceding siblings ...)
  2025-10-14 21:54 ` [PATCH v25 04/20] blk-mq: Move the blk_queue_sq_sched() calls Bart Van Assche
@ 2025-10-14 21:54 ` Bart Van Assche
  2025-10-15  7:25   ` Damien Le Moal
  2025-10-14 21:54 ` [PATCH v25 06/20] block/mq-deadline: Make locking IRQ-safe Bart Van Assche
                   ` (14 subsequent siblings)
  19 siblings, 1 reply; 33+ messages in thread
From: Bart Van Assche @ 2025-10-14 21:54 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Bart Van Assche

One of the optimizations in the block layer is that blk_mq_run_hw_queues()
only calls blk_mq_run_hw_queue() for a single hardware queue for single
queue I/O schedulers. Since this optimization may cause I/O reordering,
disable this optimization if ELEVATOR_FLAG_SUPPORTS_ZONED_WRITE_PIPELINING
has been set. This patch prepares for adding write pipelining support in
the mq-deadline I/O scheduler.

Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-mq.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 81952d0ae544..5f07483960f8 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2401,8 +2401,7 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 EXPORT_SYMBOL(blk_mq_run_hw_queue);
 
 /*
- * Return prefered queue to dispatch from (if any) for non-mq aware IO
- * scheduler.
+ * Return preferred queue to dispatch from for single-queue IO schedulers.
  */
 static struct blk_mq_hw_ctx *blk_mq_get_sq_hctx(struct request_queue *q)
 {
@@ -2412,6 +2411,11 @@ static struct blk_mq_hw_ctx *blk_mq_get_sq_hctx(struct request_queue *q)
 	if (!blk_queue_sq_sched(q))
 		return NULL;
 
+	if (blk_queue_is_zoned(q) && blk_pipeline_zwr(q) &&
+	    test_bit(ELEVATOR_FLAG_SUPPORTS_ZONED_WRITE_PIPELINING,
+		     &q->elevator->flags))
+		return NULL;
+
 	ctx = blk_mq_get_ctx(q);
 	/*
 	 * If the IO scheduler does not respect hardware queues when

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v25 06/20] block/mq-deadline: Make locking IRQ-safe
  2025-10-14 21:54 [PATCH v25 00/20] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (4 preceding siblings ...)
  2025-10-14 21:54 ` [PATCH v25 05/20] blk-mq: Run all hwqs for sq scheds if write pipelining is enabled Bart Van Assche
@ 2025-10-14 21:54 ` Bart Van Assche
  2025-10-14 21:54 ` [PATCH v25 07/20] block/mq-deadline: Enable zoned write pipelining Bart Van Assche
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Bart Van Assche @ 2025-10-14 21:54 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Bart Van Assche

Prepare for locking dd->lock in dd_has_write_work(). That function may
be called from interrupt context.

Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline.c | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 85806c73562a..0a46d0f06f72 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -457,7 +457,7 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
 	struct request *rq;
 	enum dd_prio prio;
 
-	spin_lock(&dd->lock);
+	spin_lock_irq(&dd->lock);
 
 	if (!list_empty(&dd->dispatch)) {
 		rq = list_first_entry(&dd->dispatch, struct request, queuelist);
@@ -481,7 +481,7 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
 	}
 
 unlock:
-	spin_unlock(&dd->lock);
+	spin_unlock_irq(&dd->lock);
 
 	return rq;
 }
@@ -527,9 +527,9 @@ static void dd_exit_sched(struct elevator_queue *e)
 		WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_READ]));
 		WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_WRITE]));
 
-		spin_lock(&dd->lock);
+		spin_lock_irq(&dd->lock);
 		queued = dd_queued(dd, prio);
-		spin_unlock(&dd->lock);
+		spin_unlock_irq(&dd->lock);
 
 		WARN_ONCE(queued != 0,
 			  "statistics for priority %d: i %u m %u d %u c %u\n",
@@ -623,9 +623,9 @@ static bool dd_bio_merge(struct request_queue *q, struct bio *bio,
 	struct request *free = NULL;
 	bool ret;
 
-	spin_lock(&dd->lock);
+	spin_lock_irq(&dd->lock);
 	ret = blk_mq_sched_try_merge(q, bio, nr_segs, &free);
-	spin_unlock(&dd->lock);
+	spin_unlock_irq(&dd->lock);
 
 	if (free)
 		blk_mq_free_request(free);
@@ -696,7 +696,7 @@ static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
 	struct deadline_data *dd = q->elevator->elevator_data;
 	LIST_HEAD(free);
 
-	spin_lock(&dd->lock);
+	spin_lock_irq(&dd->lock);
 	while (!list_empty(list)) {
 		struct request *rq;
 
@@ -704,7 +704,7 @@ static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
 		list_del_init(&rq->queuelist);
 		dd_insert_request(hctx, rq, flags, &free);
 	}
-	spin_unlock(&dd->lock);
+	spin_unlock_irq(&dd->lock);
 
 	blk_mq_free_requests(&free);
 }
@@ -828,7 +828,7 @@ static void *deadline_##name##_fifo_start(struct seq_file *m,		\
 	struct deadline_data *dd = q->elevator->elevator_data;		\
 	struct dd_per_prio *per_prio = &dd->per_prio[prio];		\
 									\
-	spin_lock(&dd->lock);						\
+	spin_lock_irq(&dd->lock);					\
 	return seq_list_start(&per_prio->fifo_list[data_dir], *pos);	\
 }									\
 									\
@@ -848,7 +848,7 @@ static void deadline_##name##_fifo_stop(struct seq_file *m, void *v)	\
 	struct request_queue *q = m->private;				\
 	struct deadline_data *dd = q->elevator->elevator_data;		\
 									\
-	spin_unlock(&dd->lock);						\
+	spin_unlock_irq(&dd->lock);					\
 }									\
 									\
 static const struct seq_operations deadline_##name##_fifo_seq_ops = {	\
@@ -914,11 +914,11 @@ static int dd_queued_show(void *data, struct seq_file *m)
 	struct deadline_data *dd = q->elevator->elevator_data;
 	u32 rt, be, idle;
 
-	spin_lock(&dd->lock);
+	spin_lock_irq(&dd->lock);
 	rt = dd_queued(dd, DD_RT_PRIO);
 	be = dd_queued(dd, DD_BE_PRIO);
 	idle = dd_queued(dd, DD_IDLE_PRIO);
-	spin_unlock(&dd->lock);
+	spin_unlock_irq(&dd->lock);
 
 	seq_printf(m, "%u %u %u\n", rt, be, idle);
 
@@ -942,11 +942,11 @@ static int dd_owned_by_driver_show(void *data, struct seq_file *m)
 	struct deadline_data *dd = q->elevator->elevator_data;
 	u32 rt, be, idle;
 
-	spin_lock(&dd->lock);
+	spin_lock_irq(&dd->lock);
 	rt = dd_owned_by_driver(dd, DD_RT_PRIO);
 	be = dd_owned_by_driver(dd, DD_BE_PRIO);
 	idle = dd_owned_by_driver(dd, DD_IDLE_PRIO);
-	spin_unlock(&dd->lock);
+	spin_unlock_irq(&dd->lock);
 
 	seq_printf(m, "%u %u %u\n", rt, be, idle);
 
@@ -959,7 +959,7 @@ static void *deadline_dispatch_start(struct seq_file *m, loff_t *pos)
 	struct request_queue *q = m->private;
 	struct deadline_data *dd = q->elevator->elevator_data;
 
-	spin_lock(&dd->lock);
+	spin_lock_irq(&dd->lock);
 	return seq_list_start(&dd->dispatch, *pos);
 }
 
@@ -977,7 +977,7 @@ static void deadline_dispatch_stop(struct seq_file *m, void *v)
 	struct request_queue *q = m->private;
 	struct deadline_data *dd = q->elevator->elevator_data;
 
-	spin_unlock(&dd->lock);
+	spin_unlock_irq(&dd->lock);
 }
 
 static const struct seq_operations deadline_dispatch_seq_ops = {

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v25 07/20] block/mq-deadline: Enable zoned write pipelining
  2025-10-14 21:54 [PATCH v25 00/20] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (5 preceding siblings ...)
  2025-10-14 21:54 ` [PATCH v25 06/20] block/mq-deadline: Make locking IRQ-safe Bart Van Assche
@ 2025-10-14 21:54 ` Bart Van Assche
  2025-10-15  7:31   ` Damien Le Moal
  2025-10-14 21:54 ` [PATCH v25 08/20] blk-zoned: Fix a typo in a source code comment Bart Van Assche
                   ` (12 subsequent siblings)
  19 siblings, 1 reply; 33+ messages in thread
From: Bart Van Assche @ 2025-10-14 21:54 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Bart Van Assche

The hwq selected by blk_mq_run_hw_queues() for single-queue I/O schedulers
depends on the CPU core that function has been called from. This may lead
to concurrent dispatching of I/O requests on different CPU cores and hence
may cause I/O reordering. Prevent as follows that zoned writes are
reordered:
- Set the ELEVATOR_FLAG_SUPPORTS_ZONED_WRITE_PIPELINING flag. This disables
  the single hwq optimization in the block layer core.
- Modify dd_has_work() such that it only reports that any work is pending
  for zoned writes if the zoned writes have been submitted to the hwq that
  has been passed as argument to dd_has_work().
- Modify dd_dispatch_request() such that it only dispatches zoned writes
  if the hwq argument passed to this function matches the hwq of the
  pending zoned writes.

Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline.c | 68 ++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 58 insertions(+), 10 deletions(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 0a46d0f06f72..be6ed3d8fa36 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -319,11 +319,25 @@ static struct request *dd_start_request(struct deadline_data *dd,
 	return rq;
 }
 
+/*
+ * If write pipelining is enabled, only dispatch sequential zoned writes if
+ * rq->mq_hctx == hctx.
+ */
+static bool dd_dispatch_from_hctx(struct blk_mq_hw_ctx *hctx,
+				  struct request *rq)
+{
+	struct request_queue *q = hctx->queue;
+
+	return !(q->limits.features & BLK_FEAT_ORDERED_HWQ) ||
+		rq->mq_hctx == hctx || !blk_rq_is_seq_zoned_write(rq);
+}
+
 /*
  * deadline_dispatch_requests selects the best request according to
  * read/write expire, fifo_batch, etc and with a start time <= @latest_start.
  */
 static struct request *__dd_dispatch_request(struct deadline_data *dd,
+					     struct blk_mq_hw_ctx *hctx,
 					     struct dd_per_prio *per_prio,
 					     unsigned long latest_start)
 {
@@ -336,7 +350,8 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
 	 * batches are currently reads XOR writes
 	 */
 	rq = deadline_next_request(dd, per_prio, dd->last_dir);
-	if (rq && dd->batching < dd->fifo_batch) {
+	if (rq && dd->batching < dd->fifo_batch &&
+	    dd_dispatch_from_hctx(hctx, rq)) {
 		/* we have a next request and are still entitled to batch */
 		data_dir = rq_data_dir(rq);
 		goto dispatch_request;
@@ -396,7 +411,7 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
 		rq = next_rq;
 	}
 
-	if (!rq)
+	if (!rq || !dd_dispatch_from_hctx(hctx, rq))
 		return NULL;
 
 	dd->last_dir = data_dir;
@@ -418,8 +433,9 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
  * Check whether there are any requests with priority other than DD_RT_PRIO
  * that were inserted more than prio_aging_expire jiffies ago.
  */
-static struct request *dd_dispatch_prio_aged_requests(struct deadline_data *dd,
-						      unsigned long now)
+static struct request *
+dd_dispatch_prio_aged_requests(struct deadline_data *dd,
+			       struct blk_mq_hw_ctx *hctx, unsigned long now)
 {
 	struct request *rq;
 	enum dd_prio prio;
@@ -433,7 +449,7 @@ static struct request *dd_dispatch_prio_aged_requests(struct deadline_data *dd,
 		return NULL;
 
 	for (prio = DD_BE_PRIO; prio <= DD_PRIO_MAX; prio++) {
-		rq = __dd_dispatch_request(dd, &dd->per_prio[prio],
+		rq = __dd_dispatch_request(dd, hctx, &dd->per_prio[prio],
 					   now - dd->prio_aging_expire);
 		if (rq)
 			return rq;
@@ -466,7 +482,7 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
 		goto unlock;
 	}
 
-	rq = dd_dispatch_prio_aged_requests(dd, now);
+	rq = dd_dispatch_prio_aged_requests(dd, hctx, now);
 	if (rq)
 		goto unlock;
 
@@ -475,7 +491,7 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
 	 * requests if any higher priority requests are pending.
 	 */
 	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
-		rq = __dd_dispatch_request(dd, &dd->per_prio[prio], now);
+		rq = __dd_dispatch_request(dd, hctx, &dd->per_prio[prio], now);
 		if (rq || dd_queued(dd, prio))
 			break;
 	}
@@ -575,6 +591,8 @@ static int dd_init_sched(struct request_queue *q, struct elevator_queue *eq)
 	/* We dispatch from request queue wide instead of hw queue */
 	blk_queue_flag_set(QUEUE_FLAG_SQ_SCHED, q);
 
+	set_bit(ELEVATOR_FLAG_SUPPORTS_ZONED_WRITE_PIPELINING, &eq->flags);
+
 	q->elevator = eq;
 	dd_depth_updated(q);
 	return 0;
@@ -731,10 +749,40 @@ static void dd_finish_request(struct request *rq)
 		atomic_inc(&per_prio->stats.completed);
 }
 
-static bool dd_has_work_for_prio(struct dd_per_prio *per_prio)
+/* May be called from interrupt context. */
+static bool dd_has_write_work(struct deadline_data *dd,
+			      struct blk_mq_hw_ctx *hctx,
+			      struct list_head *list)
+{
+	struct request_queue *q = hctx->queue;
+	unsigned long flags;
+	struct request *rq;
+	bool has_work = false;
+
+	if (list_empty_careful(list))
+		return false;
+
+	if (!(q->limits.features & BLK_FEAT_ORDERED_HWQ))
+		return true;
+
+	spin_lock_irqsave(&dd->lock, flags);
+	list_for_each_entry(rq, list, queuelist) {
+		if (rq->mq_hctx == hctx) {
+			has_work = true;
+			break;
+		}
+	}
+	spin_unlock_irqrestore(&dd->lock, flags);
+
+	return has_work;
+}
+
+static bool dd_has_work_for_prio(struct deadline_data *dd,
+				 struct blk_mq_hw_ctx *hctx,
+				 struct dd_per_prio *per_prio)
 {
 	return !list_empty_careful(&per_prio->fifo_list[DD_READ]) ||
-		!list_empty_careful(&per_prio->fifo_list[DD_WRITE]);
+		dd_has_write_work(dd, hctx, &per_prio->fifo_list[DD_WRITE]);
 }
 
 static bool dd_has_work(struct blk_mq_hw_ctx *hctx)
@@ -746,7 +794,7 @@ static bool dd_has_work(struct blk_mq_hw_ctx *hctx)
 		return true;
 
 	for (prio = 0; prio <= DD_PRIO_MAX; prio++)
-		if (dd_has_work_for_prio(&dd->per_prio[prio]))
+		if (dd_has_work_for_prio(dd, hctx, &dd->per_prio[prio]))
 			return true;
 
 	return false;

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v25 08/20] blk-zoned: Fix a typo in a source code comment
  2025-10-14 21:54 [PATCH v25 00/20] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (6 preceding siblings ...)
  2025-10-14 21:54 ` [PATCH v25 07/20] block/mq-deadline: Enable zoned write pipelining Bart Van Assche
@ 2025-10-14 21:54 ` Bart Van Assche
  2025-10-15  7:32   ` Damien Le Moal
  2025-10-14 21:54 ` [PATCH v25 09/20] blk-zoned: Add an argument to blk_zone_plug_bio() Bart Van Assche
                   ` (11 subsequent siblings)
  19 siblings, 1 reply; 33+ messages in thread
From: Bart Van Assche @ 2025-10-14 21:54 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Bart Van Assche

Remove a superfluous parenthesis that was introduced by commit fa8555630b32
("blk-zoned: Improve the queue reference count strategy documentation").

Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-zoned.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index f6bb4331eea6..13c45ab58d63 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -621,7 +621,7 @@ static inline void blk_zone_wplug_bio_io_error(struct blk_zone_wplug *zwplug,
 	bio_clear_flag(bio, BIO_ZONE_WRITE_PLUGGING);
 	bio_io_error(bio);
 	disk_put_zone_wplug(zwplug);
-	/* Drop the reference taken by disk_zone_wplug_add_bio(() */
+	/* Drop the reference taken by disk_zone_wplug_add_bio(). */
 	blk_queue_exit(q);
 }
 

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v25 09/20] blk-zoned: Add an argument to blk_zone_plug_bio()
  2025-10-14 21:54 [PATCH v25 00/20] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (7 preceding siblings ...)
  2025-10-14 21:54 ` [PATCH v25 08/20] blk-zoned: Fix a typo in a source code comment Bart Van Assche
@ 2025-10-14 21:54 ` Bart Van Assche
  2025-10-14 21:54 ` [PATCH v25 10/20] blk-zoned: Split an if-statement Bart Van Assche
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Bart Van Assche @ 2025-10-14 21:54 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Bart Van Assche

Software that submits zoned writes, e.g. a filesystem, may submit zoned
writes from multiple CPU cores as long as the zoned writes are serialized
per zone. Submitting bios from different CPUs may cause bio reordering if
e.g. different bios reach the storage device through different queues.
Prepare for preserving the order of pipelined zoned writes per zone by
adding the 'rq_cpu` argument to blk_zone_plug_bio(). This argument tells
blk_zone_plug_bio() from which CPU a cached request has been allocated.
The cached request will only be used if it matches the CPU from which
zoned writes are being submitted for the zone associated with the bio.

Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-mq.c         | 7 +++----
 block/blk-zoned.c      | 5 ++++-
 drivers/md/dm.c        | 5 ++---
 include/linux/blkdev.h | 5 +++--
 4 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 5f07483960f8..a19645701b0e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3234,10 +3234,9 @@ void blk_mq_submit_bio(struct bio *bio)
 	if (blk_mq_attempt_bio_merge(q, bio, nr_segs))
 		goto queue_exit;
 
-	if (bio_needs_zone_write_plugging(bio)) {
-		if (blk_zone_plug_bio(bio, nr_segs))
-			goto queue_exit;
-	}
+	if (bio_needs_zone_write_plugging(bio) &&
+	    blk_zone_plug_bio(bio, nr_segs, rq ? rq->mq_ctx->cpu : -1))
+		goto queue_exit;
 
 new_request:
 	if (rq) {
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 13c45ab58d63..aff7ab5487cd 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -1131,6 +1131,9 @@ static void blk_zone_wplug_handle_native_zone_append(struct bio *bio)
  * blk_zone_plug_bio - Handle a zone write BIO with zone write plugging
  * @bio: The BIO being submitted
  * @nr_segs: The number of physical segments of @bio
+ * @rq_cpu: software queue onto which a request will be queued. -1 if the caller
+ *	has not yet decided onto which software queue to queue the request or if
+ *	the bio won't be converted into a request.
  *
  * Handle write, write zeroes and zone append operations requiring emulation
  * using zone write plugging.
@@ -1139,7 +1142,7 @@ static void blk_zone_wplug_handle_native_zone_append(struct bio *bio)
  * write plug. Otherwise, return false to let the submission path process
  * @bio normally.
  */
-bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs)
+bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs, int rq_cpu)
 {
 	struct block_device *bdev = bio->bi_bdev;
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index f5e5e59b232b..b75b296aa79b 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1802,9 +1802,8 @@ static inline bool dm_zone_bio_needs_split(struct bio *bio)
 
 static inline bool dm_zone_plug_bio(struct mapped_device *md, struct bio *bio)
 {
-	if (!bio_needs_zone_write_plugging(bio))
-		return false;
-	return blk_zone_plug_bio(bio, 0);
+	return bio_needs_zone_write_plugging(bio) &&
+		blk_zone_plug_bio(bio, 0, -1);
 }
 
 static blk_status_t __send_zone_reset_all_emulated(struct clone_info *ci,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 85fca05bd5eb..22edafb5efc1 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -905,7 +905,7 @@ static inline bool bio_needs_zone_write_plugging(struct bio *bio)
 	}
 }
 
-bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs);
+bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs, int rq_cpu);
 
 /**
  * disk_zone_capacity - returns the zone capacity of zone containing @sector
@@ -945,7 +945,8 @@ static inline bool bio_needs_zone_write_plugging(struct bio *bio)
 	return false;
 }
 
-static inline bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs)
+static inline bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs,
+				     int rq_cpu)
 {
 	return false;
 }

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v25 10/20] blk-zoned: Split an if-statement
  2025-10-14 21:54 [PATCH v25 00/20] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (8 preceding siblings ...)
  2025-10-14 21:54 ` [PATCH v25 09/20] blk-zoned: Add an argument to blk_zone_plug_bio() Bart Van Assche
@ 2025-10-14 21:54 ` Bart Van Assche
  2025-10-14 21:54 ` [PATCH v25 11/20] blk-zoned: Move code from disk_zone_wplug_add_bio() into its caller Bart Van Assche
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Bart Van Assche @ 2025-10-14 21:54 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Bart Van Assche

Split an if-statement and also the comment above that if-statement. This
patch prepares for moving code from disk_zone_wplug_add_bio() into its
caller. No functionality has been changed.

Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-zoned.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index aff7ab5487cd..063b7b9459dd 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -1057,13 +1057,14 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
 	bio_set_flag(bio, BIO_ZONE_WRITE_PLUGGING);
 
 	/*
-	 * If the zone is already plugged, add the BIO to the plug BIO list.
-	 * Do the same for REQ_NOWAIT BIOs to ensure that we will not see a
+	 * Add REQ_NOWAIT BIOs to the plug list to ensure that we will not see a
 	 * BLK_STS_AGAIN failure if we let the BIO execute.
-	 * Otherwise, plug and let the BIO execute.
 	 */
-	if ((zwplug->flags & BLK_ZONE_WPLUG_PLUGGED) ||
-	    (bio->bi_opf & REQ_NOWAIT))
+	if (bio->bi_opf & REQ_NOWAIT)
+		goto plug;
+
+	/* If the zone is already plugged, add the BIO to the BIO plug list. */
+	if (zwplug->flags & BLK_ZONE_WPLUG_PLUGGED)
 		goto plug;
 
 	if (!blk_zone_wplug_prepare_bio(zwplug, bio)) {
@@ -1072,6 +1073,7 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
 		return true;
 	}
 
+	/* Otherwise, plug and submit the BIO. */
 	zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
 
 	spin_unlock_irqrestore(&zwplug->lock, flags);

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v25 11/20] blk-zoned: Move code from disk_zone_wplug_add_bio() into its caller
  2025-10-14 21:54 [PATCH v25 00/20] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (9 preceding siblings ...)
  2025-10-14 21:54 ` [PATCH v25 10/20] blk-zoned: Split an if-statement Bart Van Assche
@ 2025-10-14 21:54 ` Bart Van Assche
  2025-10-14 21:54 ` [PATCH v25 12/20] blk-zoned: Introduce a loop in blk_zone_wplug_bio_work() Bart Van Assche
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Bart Van Assche @ 2025-10-14 21:54 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Bart Van Assche

Move the following code into the only caller of disk_zone_wplug_add_bio():
 - The code for clearing the REQ_NOWAIT flag.
 - The code that sets the BLK_ZONE_WPLUG_PLUGGED flag.
 - The disk_zone_wplug_schedule_bio_work() call.

No functionality has been changed.

This patch prepares for zoned write pipelining by removing the code from
disk_zone_wplug_add_bio() that does not apply to all zoned write pipelining
bio processing cases.

Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-zoned.c | 33 ++++++++++++---------------------
 1 file changed, 12 insertions(+), 21 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 063b7b9459dd..6f4ceb7f4f43 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -803,8 +803,6 @@ static inline void disk_zone_wplug_add_bio(struct gendisk *disk,
 				struct blk_zone_wplug *zwplug,
 				struct bio *bio, unsigned int nr_segs)
 {
-	bool schedule_bio_work = false;
-
 	/*
 	 * Grab an extra reference on the BIO request queue usage counter.
 	 * This reference will be reused to submit a request for the BIO for
@@ -820,16 +818,6 @@ static inline void disk_zone_wplug_add_bio(struct gendisk *disk,
 	 */
 	bio_clear_polled(bio);
 
-	/*
-	 * REQ_NOWAIT BIOs are always handled using the zone write plug BIO
-	 * work, which can block. So clear the REQ_NOWAIT flag and schedule the
-	 * work if this is the first BIO we are plugging.
-	 */
-	if (bio->bi_opf & REQ_NOWAIT) {
-		schedule_bio_work = !(zwplug->flags & BLK_ZONE_WPLUG_PLUGGED);
-		bio->bi_opf &= ~REQ_NOWAIT;
-	}
-
 	/*
 	 * Reuse the poll cookie field to store the number of segments when
 	 * split to the hardware limits.
@@ -845,11 +833,6 @@ static inline void disk_zone_wplug_add_bio(struct gendisk *disk,
 	bio_list_add(&zwplug->bio_list, bio);
 	trace_disk_zone_wplug_add_bio(zwplug->disk->queue, zwplug->zone_no,
 				      bio->bi_iter.bi_sector, bio_sectors(bio));
-
-	zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
-
-	if (schedule_bio_work)
-		disk_zone_wplug_schedule_bio_work(disk, zwplug);
 }
 
 /*
@@ -1014,6 +997,7 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
 {
 	struct gendisk *disk = bio->bi_bdev->bd_disk;
 	sector_t sector = bio->bi_iter.bi_sector;
+	bool schedule_bio_work = false;
 	struct blk_zone_wplug *zwplug;
 	gfp_t gfp_mask = GFP_NOIO;
 	unsigned long flags;
@@ -1060,12 +1044,14 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
 	 * Add REQ_NOWAIT BIOs to the plug list to ensure that we will not see a
 	 * BLK_STS_AGAIN failure if we let the BIO execute.
 	 */
-	if (bio->bi_opf & REQ_NOWAIT)
-		goto plug;
+	if (bio->bi_opf & REQ_NOWAIT) {
+		bio->bi_opf &= ~REQ_NOWAIT;
+		goto add_to_bio_list;
+	}
 
 	/* If the zone is already plugged, add the BIO to the BIO plug list. */
 	if (zwplug->flags & BLK_ZONE_WPLUG_PLUGGED)
-		goto plug;
+		goto add_to_bio_list;
 
 	if (!blk_zone_wplug_prepare_bio(zwplug, bio)) {
 		spin_unlock_irqrestore(&zwplug->lock, flags);
@@ -1080,8 +1066,13 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
 
 	return false;
 
-plug:
+add_to_bio_list:
+	schedule_bio_work = !(zwplug->flags & BLK_ZONE_WPLUG_PLUGGED);
+	zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
+
 	disk_zone_wplug_add_bio(disk, zwplug, bio, nr_segs);
+	if (schedule_bio_work)
+		disk_zone_wplug_schedule_bio_work(disk, zwplug);
 
 	spin_unlock_irqrestore(&zwplug->lock, flags);
 

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v25 12/20] blk-zoned: Introduce a loop in blk_zone_wplug_bio_work()
  2025-10-14 21:54 [PATCH v25 00/20] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (10 preceding siblings ...)
  2025-10-14 21:54 ` [PATCH v25 11/20] blk-zoned: Move code from disk_zone_wplug_add_bio() into its caller Bart Van Assche
@ 2025-10-14 21:54 ` Bart Van Assche
  2025-10-14 21:54 ` [PATCH v25 13/20] blk-zoned: Document disk_zone_wplug_schedule_bio_work() locking Bart Van Assche
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Bart Van Assche @ 2025-10-14 21:54 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Bart Van Assche

Prepare for submitting multiple bios from inside a single
blk_zone_wplug_bio_work() call. No functionality has been changed.

Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-zoned.c | 70 +++++++++++++++++++++++++----------------------
 1 file changed, 37 insertions(+), 33 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 6f4ceb7f4f43..1df466167e55 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -1305,44 +1305,48 @@ static void blk_zone_wplug_bio_work(struct work_struct *work)
 	struct bio *bio;
 	bool prepared;
 
-	/*
-	 * Submit the next plugged BIO. If we do not have any, clear
-	 * the plugged flag.
-	 */
-again:
-	spin_lock_irqsave(&zwplug->lock, flags);
-	bio = bio_list_pop(&zwplug->bio_list);
-	if (!bio) {
-		zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED;
-		spin_unlock_irqrestore(&zwplug->lock, flags);
-		goto put_zwplug;
-	}
+	do {
+		/*
+		 * Submit the next plugged BIO. If we do not have any, clear
+		 * the plugged flag.
+		 */
+		spin_lock_irqsave(&zwplug->lock, flags);
+		bio = bio_list_pop(&zwplug->bio_list);
+		if (!bio) {
+			zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED;
+			spin_unlock_irqrestore(&zwplug->lock, flags);
+			goto put_zwplug;
+		}
 
-	trace_blk_zone_wplug_bio(zwplug->disk->queue, zwplug->zone_no,
-				 bio->bi_iter.bi_sector, bio_sectors(bio));
+		trace_blk_zone_wplug_bio(zwplug->disk->queue,
+					 zwplug->zone_no,
+					 bio->bi_iter.bi_sector,
+					 bio_sectors(bio));
 
-	prepared = blk_zone_wplug_prepare_bio(zwplug, bio);
-	spin_unlock_irqrestore(&zwplug->lock, flags);
+		prepared = blk_zone_wplug_prepare_bio(zwplug, bio);
+		spin_unlock_irqrestore(&zwplug->lock, flags);
 
-	if (!prepared) {
-		blk_zone_wplug_bio_io_error(zwplug, bio);
-		goto again;
-	}
+		if (!prepared) {
+			blk_zone_wplug_bio_io_error(zwplug, bio);
+			continue;
+		}
 
-	bdev = bio->bi_bdev;
 
-	/*
-	 * blk-mq devices will reuse the extra reference on the request queue
-	 * usage counter we took when the BIO was plugged, but the submission
-	 * path for BIO-based devices will not do that. So drop this extra
-	 * reference here.
-	 */
-	if (bdev_test_flag(bdev, BD_HAS_SUBMIT_BIO)) {
-		bdev->bd_disk->fops->submit_bio(bio);
-		blk_queue_exit(bdev->bd_disk->queue);
-	} else {
-		blk_mq_submit_bio(bio);
-	}
+		bdev = bio->bi_bdev;
+
+		/*
+		 * blk-mq devices will reuse the extra reference on the request
+		 * queue usage counter we took when the BIO was plugged, but the
+		 * submission path for BIO-based devices will not do that. So
+		 * drop this extra reference here.
+		 */
+		if (bdev_test_flag(bdev, BD_HAS_SUBMIT_BIO)) {
+			bdev->bd_disk->fops->submit_bio(bio);
+			blk_queue_exit(bdev->bd_disk->queue);
+		} else {
+			blk_mq_submit_bio(bio);
+		}
+	} while (0);
 
 put_zwplug:
 	/* Drop the reference we took in disk_zone_wplug_schedule_bio_work(). */

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v25 13/20] blk-zoned: Document disk_zone_wplug_schedule_bio_work() locking
  2025-10-14 21:54 [PATCH v25 00/20] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (11 preceding siblings ...)
  2025-10-14 21:54 ` [PATCH v25 12/20] blk-zoned: Introduce a loop in blk_zone_wplug_bio_work() Bart Van Assche
@ 2025-10-14 21:54 ` Bart Van Assche
  2025-10-14 21:54 ` [PATCH v25 14/20] blk-zoned: Support pipelining of zoned writes Bart Van Assche
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Bart Van Assche @ 2025-10-14 21:54 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Bart Van Assche

Before adding more code in disk_zone_wplug_schedule_bio_work() that depends
on the zone write plug lock being held, document that all callers hold this
lock.

Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-zoned.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 1df466167e55..74f0fea56eda 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -789,6 +789,8 @@ static bool blk_zone_wplug_handle_reset_all(struct bio *bio)
 static void disk_zone_wplug_schedule_bio_work(struct gendisk *disk,
 					      struct blk_zone_wplug *zwplug)
 {
+	lockdep_assert_held(&zwplug->lock);
+
 	/*
 	 * Take a reference on the zone write plug and schedule the submission
 	 * of the next plugged BIO. blk_zone_wplug_bio_work() will release the

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v25 14/20] blk-zoned: Support pipelining of zoned writes
  2025-10-14 21:54 [PATCH v25 00/20] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (12 preceding siblings ...)
  2025-10-14 21:54 ` [PATCH v25 13/20] blk-zoned: Document disk_zone_wplug_schedule_bio_work() locking Bart Van Assche
@ 2025-10-14 21:54 ` Bart Van Assche
  2025-10-14 21:54 ` [PATCH v25 15/20] null_blk: Add the preserves_write_order attribute Bart Van Assche
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Bart Van Assche @ 2025-10-14 21:54 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Bart Van Assche

Support pipelining of zoned writes if the write order is preserved per
hardware queue. Track per zone to which software queue writes have been
queued. If zoned writes are pipelined, submit new writes to the same
software queue as the writes that are already in progress. This prevents
reordering by submitting requests for the same zone to different
software or hardware queues. In disk_zone_wplug_schedule_bio_work(),
only increment the zwplug reference count if queuing zwplug->bio_work
succeeded since with this patch applied the bio_work may already be
queued if disk_zone_wplug_schedule_bio_work() is called.

Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-mq.c    |  4 +--
 block/blk-zoned.c | 89 +++++++++++++++++++++++++++++++++++++++--------
 2 files changed, 77 insertions(+), 16 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index a19645701b0e..4b7a1dca0fb9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3188,8 +3188,8 @@ void blk_mq_submit_bio(struct bio *bio)
 	/*
 	 * A BIO that was released from a zone write plug has already been
 	 * through the preparation in this function, already holds a reference
-	 * on the queue usage counter, and is the only write BIO in-flight for
-	 * the target zone. Go straight to preparing a request for it.
+	 * on the queue usage counter. Go straight to preparing a request for
+	 * it.
 	 */
 	if (bio_zone_write_plugging(bio)) {
 		nr_segs = bio->__bi_nr_segments;
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 74f0fea56eda..3584c049323e 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -8,6 +8,7 @@
  * Copyright (c) 2016, Damien Le Moal
  * Copyright (c) 2016, Western Digital
  * Copyright (c) 2024, Western Digital Corporation or its affiliates.
+ * Copyright 2025 Google LLC
  */
 
 #include <linux/kernel.h>
@@ -54,6 +55,8 @@ static const char *const zone_cond_name[] = {
  * @zone_no: The number of the zone the plug is managing.
  * @wp_offset: The zone write pointer location relative to the start of the zone
  *             as a number of 512B sectors.
+ * @from_cpu: Software queue to submit writes from for drivers that preserve
+ *	the write order.
  * @bio_list: The list of BIOs that are currently plugged.
  * @bio_work: Work struct to handle issuing of plugged BIOs
  * @rcu_head: RCU head to free zone write plugs with an RCU grace period.
@@ -66,6 +69,7 @@ struct blk_zone_wplug {
 	unsigned int		flags;
 	unsigned int		zone_no;
 	unsigned int		wp_offset;
+	int			from_cpu;
 	struct bio_list		bio_list;
 	struct work_struct	bio_work;
 	struct rcu_head		rcu_head;
@@ -75,8 +79,7 @@ struct blk_zone_wplug {
 /*
  * Zone write plug flags bits:
  *  - BLK_ZONE_WPLUG_PLUGGED: Indicates that the zone write plug is plugged,
- *    that is, that write BIOs are being throttled due to a write BIO already
- *    being executed or the zone write plug bio list is not empty.
+ *    that is, that write BIOs are being throttled.
  *  - BLK_ZONE_WPLUG_NEED_WP_UPDATE: Indicates that we lost track of a zone
  *    write pointer offset and need to update it.
  *  - BLK_ZONE_WPLUG_UNHASHED: Indicates that the zone write plug was removed
@@ -593,6 +596,7 @@ static struct blk_zone_wplug *disk_get_and_lock_zone_wplug(struct gendisk *disk,
 	zwplug->flags = 0;
 	zwplug->zone_no = zno;
 	zwplug->wp_offset = bdev_offset_from_zone_start(disk->part0, sector);
+	zwplug->from_cpu = -1;
 	bio_list_init(&zwplug->bio_list);
 	INIT_WORK(&zwplug->bio_work, blk_zone_wplug_bio_work);
 	zwplug->disk = disk;
@@ -789,16 +793,25 @@ static bool blk_zone_wplug_handle_reset_all(struct bio *bio)
 static void disk_zone_wplug_schedule_bio_work(struct gendisk *disk,
 					      struct blk_zone_wplug *zwplug)
 {
+	int cpu;
+
 	lockdep_assert_held(&zwplug->lock);
 
 	/*
-	 * Take a reference on the zone write plug and schedule the submission
-	 * of the next plugged BIO. blk_zone_wplug_bio_work() will release the
-	 * reference we take here.
+	 * Schedule a blk_zone_wplug_bio_work() call and increase the zone write
+	 * plug reference count. blk_zone_wplug_bio_work() will release the
+	 * reference we take here. Increasing the zone write plug reference
+	 * count after the queue_work_on() call is safe because all callers hold
+	 * the zone write plug lock and blk_zone_wplug_bio_work() obtains the
+	 * same lock before decrementing the reference count.
 	 */
 	WARN_ON_ONCE(!(zwplug->flags & BLK_ZONE_WPLUG_PLUGGED));
-	refcount_inc(&zwplug->ref);
-	queue_work(disk->zone_wplugs_wq, &zwplug->bio_work);
+	if (zwplug->from_cpu >= 0)
+		cpu = zwplug->from_cpu;
+	else
+		cpu = WORK_CPU_UNBOUND;
+	if (queue_work_on(cpu, disk->zone_wplugs_wq, &zwplug->bio_work))
+		refcount_inc(&zwplug->ref);
 }
 
 static inline void disk_zone_wplug_add_bio(struct gendisk *disk,
@@ -995,14 +1008,18 @@ static bool blk_zone_wplug_prepare_bio(struct blk_zone_wplug *zwplug,
 	return true;
 }
 
-static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
+static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs,
+					int rq_cpu)
 {
 	struct gendisk *disk = bio->bi_bdev->bd_disk;
+	const bool pipeline_zwr = bio_op(bio) != REQ_OP_ZONE_APPEND &&
+				 blk_pipeline_zwr(disk->queue);
 	sector_t sector = bio->bi_iter.bi_sector;
 	bool schedule_bio_work = false;
 	struct blk_zone_wplug *zwplug;
 	gfp_t gfp_mask = GFP_NOIO;
 	unsigned long flags;
+	int from_cpu = -1;
 
 	/*
 	 * BIOs must be fully contained within a zone so that we use the correct
@@ -1055,14 +1072,44 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
 	if (zwplug->flags & BLK_ZONE_WPLUG_PLUGGED)
 		goto add_to_bio_list;
 
+	/*
+	 * The code below has been organized such that zwplug->from_cpu and
+	 * zwplug->flags are only modified after it is clear that a request will
+	 * be added to the bio list or that it will be submitted by the
+	 * caller. This prevents that any changes to these member variables have
+	 * to be reverted if the blk_zone_wplug_prepare_bio() call fails.
+	 */
+
+	if (pipeline_zwr) {
+		if (zwplug->from_cpu >= 0)
+			from_cpu = zwplug->from_cpu;
+		else
+			from_cpu = smp_processor_id();
+		if (from_cpu != rq_cpu) {
+			zwplug->from_cpu = from_cpu;
+			goto add_to_bio_list;
+		}
+	}
+
 	if (!blk_zone_wplug_prepare_bio(zwplug, bio)) {
 		spin_unlock_irqrestore(&zwplug->lock, flags);
 		bio_io_error(bio);
 		return true;
 	}
 
-	/* Otherwise, plug and submit the BIO. */
-	zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
+	if (pipeline_zwr) {
+		/*
+		 * The block driver preserves the write order. Submit future
+		 * writes from the same CPU core as ongoing writes.
+		 */
+		zwplug->from_cpu = from_cpu;
+	} else {
+		/*
+		 * The block driver does not preserve the write order. Plug and
+		 * let the caller submit the BIO.
+		 */
+		zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
+	}
 
 	spin_unlock_irqrestore(&zwplug->lock, flags);
 
@@ -1170,7 +1217,7 @@ bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs, int rq_cpu)
 		fallthrough;
 	case REQ_OP_WRITE:
 	case REQ_OP_WRITE_ZEROES:
-		return blk_zone_wplug_handle_write(bio, nr_segs);
+		return blk_zone_wplug_handle_write(bio, nr_segs, rq_cpu);
 	case REQ_OP_ZONE_RESET:
 		return blk_zone_wplug_handle_reset_or_finish(bio, 0);
 	case REQ_OP_ZONE_FINISH:
@@ -1202,6 +1249,16 @@ static void disk_zone_wplug_unplug_bio(struct gendisk *disk,
 
 	zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED;
 
+	/*
+	 * zwplug->from_cpu must not change while one or more writes are pending
+	 * for the zone associated with zwplug. zwplug->ref is 2 when the plug
+	 * is unused (one reference taken when the plug was allocated and
+	 * another reference taken by the caller context). Reset
+	 * zwplug->from_cpu if no more writes are pending.
+	 */
+	if (refcount_read(&zwplug->ref) == 2)
+		zwplug->from_cpu = -1;
+
 	/*
 	 * If the zone is full (it was fully written or finished, or empty
 	 * (it was reset), remove its zone write plug from the hash table.
@@ -1302,6 +1359,7 @@ static void blk_zone_wplug_bio_work(struct work_struct *work)
 {
 	struct blk_zone_wplug *zwplug =
 		container_of(work, struct blk_zone_wplug, bio_work);
+	bool pipeline_zwr = blk_pipeline_zwr(zwplug->disk->queue);
 	struct block_device *bdev;
 	unsigned long flags;
 	struct bio *bio;
@@ -1348,7 +1406,7 @@ static void blk_zone_wplug_bio_work(struct work_struct *work)
 		} else {
 			blk_mq_submit_bio(bio);
 		}
-	} while (0);
+	} while (pipeline_zwr);
 
 put_zwplug:
 	/* Drop the reference we took in disk_zone_wplug_schedule_bio_work(). */
@@ -1875,6 +1933,7 @@ static void queue_zone_wplug_show(struct blk_zone_wplug *zwplug,
 	unsigned int zwp_zone_no, zwp_ref;
 	unsigned int zwp_bio_list_size;
 	unsigned long flags;
+	int from_cpu;
 
 	spin_lock_irqsave(&zwplug->lock, flags);
 	zwp_zone_no = zwplug->zone_no;
@@ -1882,10 +1941,12 @@ static void queue_zone_wplug_show(struct blk_zone_wplug *zwplug,
 	zwp_ref = refcount_read(&zwplug->ref);
 	zwp_wp_offset = zwplug->wp_offset;
 	zwp_bio_list_size = bio_list_size(&zwplug->bio_list);
+	from_cpu = zwplug->from_cpu;
 	spin_unlock_irqrestore(&zwplug->lock, flags);
 
-	seq_printf(m, "%u 0x%x %u %u %u\n", zwp_zone_no, zwp_flags, zwp_ref,
-		   zwp_wp_offset, zwp_bio_list_size);
+	seq_printf(m, "zone_no %u flags 0x%x ref %u wp_offset %u bio_list_size %u from_cpu %d\n",
+		   zwp_zone_no, zwp_flags, zwp_ref, zwp_wp_offset,
+		   zwp_bio_list_size, from_cpu);
 }
 
 int queue_zone_wplugs_show(void *data, struct seq_file *m)

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v25 15/20] null_blk: Add the preserves_write_order attribute
  2025-10-14 21:54 [PATCH v25 00/20] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (13 preceding siblings ...)
  2025-10-14 21:54 ` [PATCH v25 14/20] blk-zoned: Support pipelining of zoned writes Bart Van Assche
@ 2025-10-14 21:54 ` Bart Van Assche
  2025-10-14 21:54 ` [PATCH v25 16/20] scsi: core: Retry unaligned zoned writes Bart Van Assche
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Bart Van Assche @ 2025-10-14 21:54 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Bart Van Assche

Support configuring the BLK_FEAT_ORDERED_HWQ flag in the null_blk driver to
make it easier to test write pipelining

Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/block/null_blk/main.c     | 4 ++++
 drivers/block/null_blk/null_blk.h | 1 +
 2 files changed, 5 insertions(+)

diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
index f982027e8c85..0284e64a1648 100644
--- a/drivers/block/null_blk/main.c
+++ b/drivers/block/null_blk/main.c
@@ -475,6 +475,7 @@ NULLB_DEVICE_ATTR(fua, bool, NULL);
 NULLB_DEVICE_ATTR(rotational, bool, NULL);
 NULLB_DEVICE_ATTR(badblocks_once, bool, NULL);
 NULLB_DEVICE_ATTR(badblocks_partial_io, bool, NULL);
+NULLB_DEVICE_ATTR(preserves_write_order, bool, NULL);
 
 static ssize_t nullb_device_power_show(struct config_item *item, char *page)
 {
@@ -613,6 +614,7 @@ static struct configfs_attribute *nullb_device_attrs[] = {
 	&nullb_device_attr_no_sched,
 	&nullb_device_attr_poll_queues,
 	&nullb_device_attr_power,
+	&nullb_device_attr_preserves_write_order,
 	&nullb_device_attr_queue_mode,
 	&nullb_device_attr_rotational,
 	&nullb_device_attr_shared_tag_bitmap,
@@ -1979,6 +1981,8 @@ static int null_add_dev(struct nullb_device *dev)
 	if (dev->virt_boundary)
 		lim.virt_boundary_mask = PAGE_SIZE - 1;
 	null_config_discard(nullb, &lim);
+	if (dev->preserves_write_order)
+		lim.features |= BLK_FEAT_ORDERED_HWQ;
 	if (dev->zoned) {
 		rv = null_init_zoned_dev(dev, &lim);
 		if (rv)
diff --git a/drivers/block/null_blk/null_blk.h b/drivers/block/null_blk/null_blk.h
index 7bb6128dbaaf..08b732cf853f 100644
--- a/drivers/block/null_blk/null_blk.h
+++ b/drivers/block/null_blk/null_blk.h
@@ -110,6 +110,7 @@ struct nullb_device {
 	bool shared_tag_bitmap; /* use hostwide shared tags */
 	bool fua; /* Support FUA */
 	bool rotational; /* Fake rotational device */
+	bool preserves_write_order;
 };
 
 struct nullb {

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v25 16/20] scsi: core: Retry unaligned zoned writes
  2025-10-14 21:54 [PATCH v25 00/20] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (14 preceding siblings ...)
  2025-10-14 21:54 ` [PATCH v25 15/20] null_blk: Add the preserves_write_order attribute Bart Van Assche
@ 2025-10-14 21:54 ` Bart Van Assche
  2025-10-14 21:54 ` [PATCH v25 17/20] scsi: sd: Increase retry count for " Bart Van Assche
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Bart Van Assche @ 2025-10-14 21:54 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Bart Van Assche, Martin K. Petersen, Ming Lei

If zoned writes (REQ_OP_WRITE) for a sequential write required zone have
a starting LBA that differs from the write pointer, e.g. because a prior
write triggered a unit attention condition, then the storage device will
respond with an UNALIGNED WRITE COMMAND error. Retry commands that failed
with an unaligned write error.

Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/scsi/scsi_error.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 746ff6a1f309..812b18e9b9de 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -713,6 +713,22 @@ enum scsi_disposition scsi_check_sense(struct scsi_cmnd *scmd)
 		fallthrough;
 
 	case ILLEGAL_REQUEST:
+		/*
+		 * Unaligned write command. This may indicate that zoned writes
+		 * have been received by the device in the wrong order. If write
+		 * pipelining is enabled, retry.
+		 */
+		if (sshdr.asc == 0x21 && sshdr.ascq == 0x04 &&
+		    blk_pipeline_zwr(req->q) &&
+		    blk_rq_is_seq_zoned_write(req) &&
+		    scsi_cmd_retry_allowed(scmd)) {
+			SCSI_LOG_ERROR_RECOVERY(1,
+				sdev_printk(KERN_WARNING, scmd->device,
+				"Retrying unaligned write at LBA %#llx.\n",
+				scsi_get_lba(scmd)));
+			return NEEDS_RETRY;
+		}
+
 		if (sshdr.asc == 0x20 || /* Invalid command operation code */
 		    sshdr.asc == 0x21 || /* Logical block address out of range */
 		    sshdr.asc == 0x22 || /* Invalid function */

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v25 17/20] scsi: sd: Increase retry count for zoned writes
  2025-10-14 21:54 [PATCH v25 00/20] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (15 preceding siblings ...)
  2025-10-14 21:54 ` [PATCH v25 16/20] scsi: core: Retry unaligned zoned writes Bart Van Assche
@ 2025-10-14 21:54 ` Bart Van Assche
  2025-10-14 21:54 ` [PATCH v25 18/20] scsi: scsi_debug: Add the preserves_write_order module parameter Bart Van Assche
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Bart Van Assche @ 2025-10-14 21:54 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Bart Van Assche, Martin K. Petersen, Ming Lei

If the write order is preserved, increase the number of retries for
write commands sent to a sequential zone to the maximum number of
outstanding commands because in the worst case the number of times
reordered zoned writes have to be retried is (number of outstanding
writes per sequential zone) - 1.

Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/scsi/sd.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 0252d3f6bed1..f94ce38131e5 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1409,6 +1409,12 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
 	cmd->transfersize = sdp->sector_size;
 	cmd->underflow = nr_blocks << 9;
 	cmd->allowed = sdkp->max_retries;
+	/*
+	 * Increase the number of allowed retries for zoned writes if zoned
+	 * write pipelining is enabled.
+	 */
+	if (blk_pipeline_zwr(rq->q) && blk_rq_is_seq_zoned_write(rq))
+		cmd->allowed += rq->q->nr_requests;
 	cmd->sdb.length = nr_blocks * sdp->sector_size;
 
 	SCSI_LOG_HLQUEUE(1,

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v25 18/20] scsi: scsi_debug: Add the preserves_write_order module parameter
  2025-10-14 21:54 [PATCH v25 00/20] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (16 preceding siblings ...)
  2025-10-14 21:54 ` [PATCH v25 17/20] scsi: sd: Increase retry count for " Bart Van Assche
@ 2025-10-14 21:54 ` Bart Van Assche
  2025-10-14 21:54 ` [PATCH v25 19/20] scsi: scsi_debug: Support injecting unaligned write errors Bart Van Assche
  2025-10-14 21:54 ` [PATCH v25 20/20] ufs: core: Inform the block layer about write ordering Bart Van Assche
  19 siblings, 0 replies; 33+ messages in thread
From: Bart Van Assche @ 2025-10-14 21:54 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Bart Van Assche, Douglas Gilbert, Martin K. Petersen, Ming Lei

Make it easier to test write pipelining by adding support for setting the
BLK_FEAT_ORDERED_HWQ flag.

Acked-by: Douglas Gilbert <dgilbert@interlog.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/scsi/scsi_debug.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index b2ab97be5db3..03e1a93d90e9 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -1004,6 +1004,7 @@ static int dix_reads;
 static int dif_errors;
 
 /* ZBC global data */
+static bool sdeb_preserves_write_order;
 static bool sdeb_zbc_in_use;	/* true for host-aware and host-managed disks */
 static int sdeb_zbc_zone_cap_mb;
 static int sdeb_zbc_zone_size_mb;
@@ -6625,10 +6626,15 @@ static struct sdebug_dev_info *find_build_dev_info(struct scsi_device *sdev)
 
 static int scsi_debug_sdev_init(struct scsi_device *sdp)
 {
+	struct request_queue *q = sdp->request_queue;
+
 	if (sdebug_verbose)
 		pr_info("sdev_init <%u %u %u %llu>\n",
 		       sdp->host->host_no, sdp->channel, sdp->id, sdp->lun);
 
+	if (sdeb_preserves_write_order)
+		q->limits.features |= BLK_FEAT_ORDERED_HWQ;
+
 	return 0;
 }
 
@@ -7357,6 +7363,8 @@ module_param_named(statistics, sdebug_statistics, bool, S_IRUGO | S_IWUSR);
 module_param_named(strict, sdebug_strict, bool, S_IRUGO | S_IWUSR);
 module_param_named(submit_queues, submit_queues, int, S_IRUGO);
 module_param_named(poll_queues, poll_queues, int, S_IRUGO);
+module_param_named(preserves_write_order, sdeb_preserves_write_order, bool,
+		   S_IRUGO);
 module_param_named(tur_ms_to_ready, sdeb_tur_ms_to_ready, int, S_IRUGO);
 module_param_named(unmap_alignment, sdebug_unmap_alignment, int, S_IRUGO);
 module_param_named(unmap_granularity, sdebug_unmap_granularity, int, S_IRUGO);
@@ -7429,6 +7437,8 @@ MODULE_PARM_DESC(opts, "1->noise, 2->medium_err, 4->timeout, 8->recovered_err...
 MODULE_PARM_DESC(per_host_store, "If set, next positive add_host will get new store (def=0)");
 MODULE_PARM_DESC(physblk_exp, "physical block exponent (def=0)");
 MODULE_PARM_DESC(poll_queues, "support for iouring iopoll queues (1 to max(submit_queues - 1))");
+MODULE_PARM_DESC(preserves_write_order,
+		 "Whether or not to inform the block layer that this driver preserves the order of WRITE commands (def=0)");
 MODULE_PARM_DESC(ptype, "SCSI peripheral type(def=0[disk])");
 MODULE_PARM_DESC(random, "If set, uniformly randomize command duration between 0 and delay_in_ns");
 MODULE_PARM_DESC(removable, "claim to have removable media (def=0)");

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v25 19/20] scsi: scsi_debug: Support injecting unaligned write errors
  2025-10-14 21:54 [PATCH v25 00/20] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (17 preceding siblings ...)
  2025-10-14 21:54 ` [PATCH v25 18/20] scsi: scsi_debug: Add the preserves_write_order module parameter Bart Van Assche
@ 2025-10-14 21:54 ` Bart Van Assche
  2025-10-14 21:54 ` [PATCH v25 20/20] ufs: core: Inform the block layer about write ordering Bart Van Assche
  19 siblings, 0 replies; 33+ messages in thread
From: Bart Van Assche @ 2025-10-14 21:54 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Bart Van Assche, Douglas Gilbert, Martin K. Petersen, Ming Lei

Allow user space software, e.g. a blktests test, to inject unaligned
write errors.

Acked-by: Douglas Gilbert <dgilbert@interlog.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/scsi/scsi_debug.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index 03e1a93d90e9..33ff6c46df8e 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -230,6 +230,7 @@ struct tape_block {
 #define SDEBUG_OPT_NO_CDB_NOISE		0x4000
 #define SDEBUG_OPT_HOST_BUSY		0x8000
 #define SDEBUG_OPT_CMD_ABORT		0x10000
+#define SDEBUG_OPT_UNALIGNED_WRITE	0x20000
 #define SDEBUG_OPT_ALL_NOISE (SDEBUG_OPT_NOISE | SDEBUG_OPT_Q_NOISE | \
 			      SDEBUG_OPT_RESET_NOISE)
 #define SDEBUG_OPT_ALL_INJECTING (SDEBUG_OPT_RECOVERED_ERR | \
@@ -237,7 +238,8 @@ struct tape_block {
 				  SDEBUG_OPT_DIF_ERR | SDEBUG_OPT_DIX_ERR | \
 				  SDEBUG_OPT_SHORT_TRANSFER | \
 				  SDEBUG_OPT_HOST_BUSY | \
-				  SDEBUG_OPT_CMD_ABORT)
+				  SDEBUG_OPT_CMD_ABORT | \
+				  SDEBUG_OPT_UNALIGNED_WRITE)
 #define SDEBUG_OPT_RECOV_DIF_DIX (SDEBUG_OPT_RECOVERED_ERR | \
 				  SDEBUG_OPT_DIF_ERR | SDEBUG_OPT_DIX_ERR)
 
@@ -4933,6 +4935,14 @@ static int resp_write_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 	u8 *cmd = scp->cmnd;
 	bool meta_data_locked = false;
 
+	if (unlikely(sdebug_opts & SDEBUG_OPT_UNALIGNED_WRITE &&
+		     atomic_read(&sdeb_inject_pending))) {
+		atomic_set(&sdeb_inject_pending, 0);
+		mk_sense_buffer(scp, ILLEGAL_REQUEST, LBA_OUT_OF_RANGE,
+				UNALIGNED_WRITE_ASCQ);
+		return check_condition_result;
+	}
+
 	switch (cmd[0]) {
 	case WRITE_16:
 		ei_lba = 0;

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v25 20/20] ufs: core: Inform the block layer about write ordering
  2025-10-14 21:54 [PATCH v25 00/20] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (18 preceding siblings ...)
  2025-10-14 21:54 ` [PATCH v25 19/20] scsi: scsi_debug: Support injecting unaligned write errors Bart Van Assche
@ 2025-10-14 21:54 ` Bart Van Assche
  19 siblings, 0 replies; 33+ messages in thread
From: Bart Van Assche @ 2025-10-14 21:54 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Bart Van Assche, Avri Altman, Can Guo, Bao D. Nguyen,
	Martin K. Petersen

From the UFSHCI 4.0 specification, about the MCQ mode:
"Command Submission
1. Host SW writes an Entry to SQ
2. Host SW updates SQ doorbell tail pointer

Command Processing
3. After fetching the Entry, Host Controller updates SQ doorbell head
   pointer
4. Host controller sends COMMAND UPIU to UFS device"

In other words, in MCQ mode, UFS controllers are required to forward
commands to the UFS device in the order these commands have been
received from the host.

This patch improves performance as follows on a test setup with UFSHCI
4.0 controller:
- When not using an I/O scheduler: 2.3x more IOPS for small writes.
- With the mq-deadline scheduler: 2.0x more IOPS for small writes.

Reviewed-by: Avri Altman <avri.altman@wdc.com>
Reviewed-by: Can Guo <quic_cang@quicinc.com>
Cc: Bao D. Nguyen <quic_nguyenb@quicinc.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/ufs/core/ufshcd.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
index 8339fec975b9..b2a44db7163b 100644
--- a/drivers/ufs/core/ufshcd.c
+++ b/drivers/ufs/core/ufshcd.c
@@ -5333,6 +5333,13 @@ static int ufshcd_sdev_configure(struct scsi_device *sdev,
 	struct ufs_hba *hba = shost_priv(sdev->host);
 	struct request_queue *q = sdev->request_queue;
 
+	/*
+	 * The write order is preserved per MCQ. Without MCQ, auto-hibernation
+	 * may cause write reordering that results in unaligned write errors.
+	 */
+	if (hba->mcq_enabled)
+		lim->features |= BLK_FEAT_ORDERED_HWQ;
+
 	lim->dma_pad_mask = PRDT_DATA_BYTE_COUNT_PAD - 1;
 
 	/*

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v25 05/20] blk-mq: Run all hwqs for sq scheds if write pipelining is enabled
  2025-10-14 21:54 ` [PATCH v25 05/20] blk-mq: Run all hwqs for sq scheds if write pipelining is enabled Bart Van Assche
@ 2025-10-15  7:25   ` Damien Le Moal
  2025-10-15 16:35     ` Bart Van Assche
  0 siblings, 1 reply; 33+ messages in thread
From: Damien Le Moal @ 2025-10-15  7:25 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe; +Cc: linux-block, linux-scsi, Christoph Hellwig

On 2025/10/15 6:54, Bart Van Assche wrote:
> One of the optimizations in the block layer is that blk_mq_run_hw_queues()
> only calls blk_mq_run_hw_queue() for a single hardware queue for single
> queue I/O schedulers. Since this optimization may cause I/O reordering,
> disable this optimization if ELEVATOR_FLAG_SUPPORTS_ZONED_WRITE_PIPELINING
> has been set. This patch prepares for adding write pipelining support in
> the mq-deadline I/O scheduler.
> 
> Cc: Damien Le Moal <dlemoal@kernel.org>
> Cc: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/blk-mq.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 81952d0ae544..5f07483960f8 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2401,8 +2401,7 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
>  EXPORT_SYMBOL(blk_mq_run_hw_queue);
>  
>  /*
> - * Return prefered queue to dispatch from (if any) for non-mq aware IO
> - * scheduler.
> + * Return preferred queue to dispatch from for single-queue IO schedulers.
>   */
>  static struct blk_mq_hw_ctx *blk_mq_get_sq_hctx(struct request_queue *q)
>  {
> @@ -2412,6 +2411,11 @@ static struct blk_mq_hw_ctx *blk_mq_get_sq_hctx(struct request_queue *q)
>  	if (!blk_queue_sq_sched(q))
>  		return NULL;
>  
> +	if (blk_queue_is_zoned(q) && blk_pipeline_zwr(q) &&
> +	    test_bit(ELEVATOR_FLAG_SUPPORTS_ZONED_WRITE_PIPELINING,
> +		     &q->elevator->flags))

The above test_bit() is already done in blk_pipeline_zwr().
> +		return NULL;
> +
>  	ctx = blk_mq_get_ctx(q);
>  	/*
>  	 * If the IO scheduler does not respect hardware queues when


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v25 07/20] block/mq-deadline: Enable zoned write pipelining
  2025-10-14 21:54 ` [PATCH v25 07/20] block/mq-deadline: Enable zoned write pipelining Bart Van Assche
@ 2025-10-15  7:31   ` Damien Le Moal
  2025-10-15 16:32     ` Bart Van Assche
  2025-10-16 20:50     ` Bart Van Assche
  0 siblings, 2 replies; 33+ messages in thread
From: Damien Le Moal @ 2025-10-15  7:31 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe; +Cc: linux-block, linux-scsi, Christoph Hellwig

On 2025/10/15 6:54, Bart Van Assche wrote:
> The hwq selected by blk_mq_run_hw_queues() for single-queue I/O schedulers
> depends on the CPU core that function has been called from. This may lead
> to concurrent dispatching of I/O requests on different CPU cores and hence
> may cause I/O reordering. Prevent as follows that zoned writes are
> reordered:
> - Set the ELEVATOR_FLAG_SUPPORTS_ZONED_WRITE_PIPELINING flag. This disables
>   the single hwq optimization in the block layer core.
> - Modify dd_has_work() such that it only reports that any work is pending
>   for zoned writes if the zoned writes have been submitted to the hwq that
>   has been passed as argument to dd_has_work().
> - Modify dd_dispatch_request() such that it only dispatches zoned writes
>   if the hwq argument passed to this function matches the hwq of the
>   pending zoned writes.

One of the goals of zone write plugging was to remove the dependence on IO
schedulers to control the ordering of write commands to zoned block devices.
Such change are going backward and I do not like that. What if the user sets
Kyber or bfq with your zone write pipelining ? Does it break ?

From the very light explanation above, it seems to me that what you are trying
to do can be generic in the block layer and leave mq-deadline untouched.

> 
> Cc: Damien Le Moal <dlemoal@kernel.org>
> Cc: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/mq-deadline.c | 68 ++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 58 insertions(+), 10 deletions(-)
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index 0a46d0f06f72..be6ed3d8fa36 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -319,11 +319,25 @@ static struct request *dd_start_request(struct deadline_data *dd,
>  	return rq;
>  }
>  
> +/*
> + * If write pipelining is enabled, only dispatch sequential zoned writes if
> + * rq->mq_hctx == hctx.
> + */
> +static bool dd_dispatch_from_hctx(struct blk_mq_hw_ctx *hctx,
> +				  struct request *rq)
> +{
> +	struct request_queue *q = hctx->queue;
> +
> +	return !(q->limits.features & BLK_FEAT_ORDERED_HWQ) ||
> +		rq->mq_hctx == hctx || !blk_rq_is_seq_zoned_write(rq);
> +}
> +
>  /*
>   * deadline_dispatch_requests selects the best request according to
>   * read/write expire, fifo_batch, etc and with a start time <= @latest_start.
>   */
>  static struct request *__dd_dispatch_request(struct deadline_data *dd,
> +					     struct blk_mq_hw_ctx *hctx,
>  					     struct dd_per_prio *per_prio,
>  					     unsigned long latest_start)
>  {
> @@ -336,7 +350,8 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
>  	 * batches are currently reads XOR writes
>  	 */
>  	rq = deadline_next_request(dd, per_prio, dd->last_dir);
> -	if (rq && dd->batching < dd->fifo_batch) {
> +	if (rq && dd->batching < dd->fifo_batch &&
> +	    dd_dispatch_from_hctx(hctx, rq)) {
>  		/* we have a next request and are still entitled to batch */
>  		data_dir = rq_data_dir(rq);
>  		goto dispatch_request;
> @@ -396,7 +411,7 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
>  		rq = next_rq;
>  	}
>  
> -	if (!rq)
> +	if (!rq || !dd_dispatch_from_hctx(hctx, rq))
>  		return NULL;
>  
>  	dd->last_dir = data_dir;
> @@ -418,8 +433,9 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
>   * Check whether there are any requests with priority other than DD_RT_PRIO
>   * that were inserted more than prio_aging_expire jiffies ago.
>   */
> -static struct request *dd_dispatch_prio_aged_requests(struct deadline_data *dd,
> -						      unsigned long now)
> +static struct request *
> +dd_dispatch_prio_aged_requests(struct deadline_data *dd,
> +			       struct blk_mq_hw_ctx *hctx, unsigned long now)
>  {
>  	struct request *rq;
>  	enum dd_prio prio;
> @@ -433,7 +449,7 @@ static struct request *dd_dispatch_prio_aged_requests(struct deadline_data *dd,
>  		return NULL;
>  
>  	for (prio = DD_BE_PRIO; prio <= DD_PRIO_MAX; prio++) {
> -		rq = __dd_dispatch_request(dd, &dd->per_prio[prio],
> +		rq = __dd_dispatch_request(dd, hctx, &dd->per_prio[prio],
>  					   now - dd->prio_aging_expire);
>  		if (rq)
>  			return rq;
> @@ -466,7 +482,7 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
>  		goto unlock;
>  	}
>  
> -	rq = dd_dispatch_prio_aged_requests(dd, now);
> +	rq = dd_dispatch_prio_aged_requests(dd, hctx, now);
>  	if (rq)
>  		goto unlock;
>  
> @@ -475,7 +491,7 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
>  	 * requests if any higher priority requests are pending.
>  	 */
>  	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
> -		rq = __dd_dispatch_request(dd, &dd->per_prio[prio], now);
> +		rq = __dd_dispatch_request(dd, hctx, &dd->per_prio[prio], now);
>  		if (rq || dd_queued(dd, prio))
>  			break;
>  	}
> @@ -575,6 +591,8 @@ static int dd_init_sched(struct request_queue *q, struct elevator_queue *eq)
>  	/* We dispatch from request queue wide instead of hw queue */
>  	blk_queue_flag_set(QUEUE_FLAG_SQ_SCHED, q);
>  
> +	set_bit(ELEVATOR_FLAG_SUPPORTS_ZONED_WRITE_PIPELINING, &eq->flags);
> +
>  	q->elevator = eq;
>  	dd_depth_updated(q);
>  	return 0;
> @@ -731,10 +749,40 @@ static void dd_finish_request(struct request *rq)
>  		atomic_inc(&per_prio->stats.completed);
>  }
>  
> -static bool dd_has_work_for_prio(struct dd_per_prio *per_prio)
> +/* May be called from interrupt context. */
> +static bool dd_has_write_work(struct deadline_data *dd,
> +			      struct blk_mq_hw_ctx *hctx,
> +			      struct list_head *list)
> +{
> +	struct request_queue *q = hctx->queue;
> +	unsigned long flags;
> +	struct request *rq;
> +	bool has_work = false;
> +
> +	if (list_empty_careful(list))
> +		return false;
> +
> +	if (!(q->limits.features & BLK_FEAT_ORDERED_HWQ))
> +		return true;
> +
> +	spin_lock_irqsave(&dd->lock, flags);
> +	list_for_each_entry(rq, list, queuelist) {
> +		if (rq->mq_hctx == hctx) {
> +			has_work = true;
> +			break;
> +		}
> +	}
> +	spin_unlock_irqrestore(&dd->lock, flags);
> +
> +	return has_work;
> +}
> +
> +static bool dd_has_work_for_prio(struct deadline_data *dd,
> +				 struct blk_mq_hw_ctx *hctx,
> +				 struct dd_per_prio *per_prio)
>  {
>  	return !list_empty_careful(&per_prio->fifo_list[DD_READ]) ||
> -		!list_empty_careful(&per_prio->fifo_list[DD_WRITE]);
> +		dd_has_write_work(dd, hctx, &per_prio->fifo_list[DD_WRITE]);
>  }
>  
>  static bool dd_has_work(struct blk_mq_hw_ctx *hctx)
> @@ -746,7 +794,7 @@ static bool dd_has_work(struct blk_mq_hw_ctx *hctx)
>  		return true;
>  
>  	for (prio = 0; prio <= DD_PRIO_MAX; prio++)
> -		if (dd_has_work_for_prio(&dd->per_prio[prio]))
> +		if (dd_has_work_for_prio(dd, hctx, &dd->per_prio[prio]))
>  			return true;
>  
>  	return false;


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v25 08/20] blk-zoned: Fix a typo in a source code comment
  2025-10-14 21:54 ` [PATCH v25 08/20] blk-zoned: Fix a typo in a source code comment Bart Van Assche
@ 2025-10-15  7:32   ` Damien Le Moal
  2025-10-15 16:33     ` Bart Van Assche
  0 siblings, 1 reply; 33+ messages in thread
From: Damien Le Moal @ 2025-10-15  7:32 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe; +Cc: linux-block, linux-scsi, Christoph Hellwig

On 2025/10/15 6:54, Bart Van Assche wrote:
> Remove a superfluous parenthesis that was introduced by commit fa8555630b32
> ("blk-zoned: Improve the queue reference count strategy documentation").
> 
> Cc: Damien Le Moal <dlemoal@kernel.org>
> Cc: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>

This can be sent independently of this series.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v25 07/20] block/mq-deadline: Enable zoned write pipelining
  2025-10-15  7:31   ` Damien Le Moal
@ 2025-10-15 16:32     ` Bart Van Assche
  2025-10-16 20:50     ` Bart Van Assche
  1 sibling, 0 replies; 33+ messages in thread
From: Bart Van Assche @ 2025-10-15 16:32 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe; +Cc: linux-block, linux-scsi, Christoph Hellwig

On 10/15/25 12:31 AM, Damien Le Moal wrote:
> On 2025/10/15 6:54, Bart Van Assche wrote:
>> The hwq selected by blk_mq_run_hw_queues() for single-queue I/O schedulers
>> depends on the CPU core that function has been called from. This may lead
>> to concurrent dispatching of I/O requests on different CPU cores and hence
>> may cause I/O reordering. Prevent as follows that zoned writes are
>> reordered:
>> - Set the ELEVATOR_FLAG_SUPPORTS_ZONED_WRITE_PIPELINING flag. This disables
>>    the single hwq optimization in the block layer core.
>> - Modify dd_has_work() such that it only reports that any work is pending
>>    for zoned writes if the zoned writes have been submitted to the hwq that
>>    has been passed as argument to dd_has_work().
>> - Modify dd_dispatch_request() such that it only dispatches zoned writes
>>    if the hwq argument passed to this function matches the hwq of the
>>    pending zoned writes.
> 
> One of the goals of zone write plugging was to remove the dependence on IO
> schedulers to control the ordering of write commands to zoned block devices.
> Such change are going backward and I do not like that. What if the user sets
> Kyber or bfq with your zone write pipelining ? Does it break ?

Hi Damien,

As you know the Kyber I/O scheduler dispatch callback only returns requests
for the hardware queue that has been passed as an argument to the dispatch
callback. So it's probably safe to set the
ELEVATOR_FLAG_SUPPORTS_ZONED_WRITE_PIPELINING for the Kyber I/O scheduler.
However, that hasn't been done yet.

Since neither Kyber nor BFQ set the flag
ELEVATOR_FLAG_SUPPORTS_ZONED_WRITE_PIPELINING, zoned write pipelining will
be left disabled for these I/O schedulers. See also the following code from
a previous patch:

+bool blk_pipeline_zwr(struct request_queue *q)
+{
+	return q->limits.features & BLK_FEAT_ORDERED_HWQ &&
+	       (!q->elevator ||
+		test_bit(ELEVATOR_FLAG_SUPPORTS_ZONED_WRITE_PIPELINING,
+			 &q->elevator->flags));
+}

> From the very light explanation above, it seems to me that what you are trying
> to do can be generic in the block layer and leave mq-deadline untouched.

I don't think that the race described above can be solved in the block layer.
More in detail, the race condition that I observed several times while running
tests with mq-deadline and zoned write pipelining enabled and without the
mq-deadline patches from this series is as follows:

CPU core a                                  CPU core b
------------------------------------------  ------------------------------------------
blk_mq_run_hw_queue(hctx c)                 blk_mq_run_hw_queue(hctx c)
   blk_mq_sched_dispatch_requests(hctx c)      blk_mq_sched_dispatch_requests(hctx c)
     blk_mq_do_dispatch_sched(hctx c)            blk_mq_do_dispatch_sched(hctx c)
       dd_dispatch_request(hctx c) -> rq e
                                                   dd_dispatch_request(hctx c) -> rq f
       blk_mq_dispatch_rq_list(hctx c)             blk_mq_dispatch_rq_list(hctx c)
                                                     q->mq_ops->queue_rq(hctx c, rq f)
         q->mq_ops->queue_rq(hctx c, rq e)

If requests (e) and (f) refer to the same zone, with write pipelining enabled,
the above sequence causes request reordering. If both requests are regular
writes (not write appends), this will trigger an UNALIGNED WRITE COMMAND error.
I think this can only be solved by modifying an I/O scheduler such that its
dispatch callback only dispatches requests for the hctx that has been passed as
an argument to the dispatch function.

Please note that for my use cases I'm fine with not using any I/O scheduler at
all and hence that I'm fine with dropping the mq-deadline patches from this
series. Is this perhaps what you prefer?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v25 08/20] blk-zoned: Fix a typo in a source code comment
  2025-10-15  7:32   ` Damien Le Moal
@ 2025-10-15 16:33     ` Bart Van Assche
  0 siblings, 0 replies; 33+ messages in thread
From: Bart Van Assche @ 2025-10-15 16:33 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe; +Cc: linux-block, linux-scsi, Christoph Hellwig

On 10/15/25 12:32 AM, Damien Le Moal wrote:
> On 2025/10/15 6:54, Bart Van Assche wrote:
>> Remove a superfluous parenthesis that was introduced by commit fa8555630b32
>> ("blk-zoned: Improve the queue reference count strategy documentation").
>>
>> Cc: Damien Le Moal <dlemoal@kernel.org>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> 
> This can be sent independently of this series.

Agreed, but I think that Jens has made it clear more than once that he doesn't
like patches that modify source code comments only except if these are part of
a larger series that also makes functional changes.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v25 05/20] blk-mq: Run all hwqs for sq scheds if write pipelining is enabled
  2025-10-15  7:25   ` Damien Le Moal
@ 2025-10-15 16:35     ` Bart Van Assche
  0 siblings, 0 replies; 33+ messages in thread
From: Bart Van Assche @ 2025-10-15 16:35 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe; +Cc: linux-block, linux-scsi, Christoph Hellwig

On 10/15/25 12:25 AM, Damien Le Moal wrote:
> On 2025/10/15 6:54, Bart Van Assche wrote:
>> @@ -2412,6 +2411,11 @@ static struct blk_mq_hw_ctx *blk_mq_get_sq_hctx(struct request_queue *q)
>>   	if (!blk_queue_sq_sched(q))
>>   		return NULL;
>>   
>> +	if (blk_queue_is_zoned(q) && blk_pipeline_zwr(q) &&
>> +	    test_bit(ELEVATOR_FLAG_SUPPORTS_ZONED_WRITE_PIPELINING,
>> +		     &q->elevator->flags))
> 
> The above test_bit() is already done in blk_pipeline_zwr().

Thanks for the feedback. I will drop this test_bit() call.

Bart.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v25 07/20] block/mq-deadline: Enable zoned write pipelining
  2025-10-15  7:31   ` Damien Le Moal
  2025-10-15 16:32     ` Bart Van Assche
@ 2025-10-16 20:50     ` Bart Van Assche
  2025-10-18  5:31       ` Damien Le Moal
  1 sibling, 1 reply; 33+ messages in thread
From: Bart Van Assche @ 2025-10-16 20:50 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe; +Cc: linux-block, linux-scsi, Christoph Hellwig

On 10/15/25 12:31 AM, Damien Le Moal wrote:
> it seems to me that what you are trying to do can be generic in the
> block layer and leave mq-deadline untouched.
Hi Damien,

After having given this some further thought, I think that write
pipelining can be enabled if an I/O scheduler is active by serializing
blk_mq_run_dispatch_ops() calls, e.g. with a mutex. For mq-deadline and
BFQ a single mutex per request queue should be used. For Kyber one mutex
per hwq should be sufficient. With this approach it may be necessary to
use different hardware queues for reads and writes to prevent that read
performance is affected negatively. Inserting different types of
requests into different hardware queues is already supported - see e.g.
blk_mq_map_queue(). Please let me know if you want me to look further
into this approach.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v25 07/20] block/mq-deadline: Enable zoned write pipelining
  2025-10-16 20:50     ` Bart Van Assche
@ 2025-10-18  5:31       ` Damien Le Moal
  2025-10-20 18:28         ` Bart Van Assche
  0 siblings, 1 reply; 33+ messages in thread
From: Damien Le Moal @ 2025-10-18  5:31 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe; +Cc: linux-block, linux-scsi, Christoph Hellwig

On 10/17/25 05:50, Bart Van Assche wrote:
> On 10/15/25 12:31 AM, Damien Le Moal wrote:
>> it seems to me that what you are trying to do can be generic in the
>> block layer and leave mq-deadline untouched.
> Hi Damien,
> 
> After having given this some further thought, I think that write
> pipelining can be enabled if an I/O scheduler is active by serializing
> blk_mq_run_dispatch_ops() calls, e.g. with a mutex. For mq-deadline and
> BFQ a single mutex per request queue should be used. For Kyber one mutex
> per hwq should be sufficient. With this approach it may be necessary to
> use different hardware queues for reads and writes to prevent that read
> performance is affected negatively. Inserting different types of
> requests into different hardware queues is already supported - see e.g.
> blk_mq_map_queue(). Please let me know if you want me to look further
> into this approach.

As mentioned before, I really do not like the idea of having to spread zone
write ordering all over the place again. Zone write plugging isolated that,
removing all dependencies on other block layer features like the IO scheduler.
Having to modify these again is not making progress, but going backward.

Maybe we need to rethink this, restarting from your main use case and why
performance is not good. I think that said main use case is f2fs. So what
happens with write throughput with it ? Why doesn't merging of small writes in
the zone write plugs improve performance ? Wouldn't small modifications to f2fs
zone write path improve things ?

If the answers to all of the above is "no/does not work", what about a different
approach: zone write plugging v2 with a single thread per CPU that does the
pipelining without to force changes to other layers/change the API all over the
block layer ? And what about unplugging a zone write plug from the device driver
once a request is issued ? etc.

I find what you did to be too invasive. I am still trying to think of a
better/simpler approach to solve your problem. But I do not have zoned UFS
hardware to test anything, so I can only think about solutions. Unless you have
a neat way to recreate the problem without Zoned UFS devices ?

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v25 07/20] block/mq-deadline: Enable zoned write pipelining
  2025-10-18  5:31       ` Damien Le Moal
@ 2025-10-20 18:28         ` Bart Van Assche
  2025-10-21 21:01           ` Damien Le Moal
  2025-10-22  7:07           ` Christoph Hellwig
  0 siblings, 2 replies; 33+ messages in thread
From: Bart Van Assche @ 2025-10-20 18:28 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe; +Cc: linux-block, linux-scsi, Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 1722 bytes --]

On 10/17/25 10:31 PM, Damien Le Moal wrote:
> Maybe we need to rethink this, restarting from your main use case and why
> performance is not good. I think that said main use case is f2fs. So what
> happens with write throughput with it ? Why doesn't merging of small writes in
> the zone write plugs improve performance ? Wouldn't small modifications to f2fs
> zone write path improve things ?

F2FS typically generates large writes if the I/O bandwidth is high (100
MiB or more). Write pipelining improves write performance even for large
writes but not by a spectacular percentage. Write pipelining only
results in a drastic performance improvement if the write size is kept
small (e.g. 4 KiB).

> If the answers to all of the above is "no/does not work", what about a different
> approach: zone write plugging v2 with a single thread per CPU that does the
> pipelining without to force changes to other layers/change the API all over the
> block layer ?

The block layer changes that I'm proposing are small, easy to maintain
and not invasive. Using a mutex when pipelining writes only as I
proposed in a previous email is a solution that will yield better
performance than delegating work to another thread. Obtaining an
uncontended mutex takes less than a microsecond. Delegating work to
another thread introduces a delay of 10 to 100 microseconds.

> Unless you have a neat way to recreate the problem without Zoned UFS devices ?

This patch series adds support in both the scsi_debug and null_blk
drivers for write pipelining. If the mq-deadline patches from this 
series are reverted then the attached shell script sporadically reports
a write error on my test setup for the mq-deadline test cases.

Thanks,

Bart.

[-- Attachment #2: test-pipelining-zoned-writes --]
[-- Type: text/plain, Size: 6237 bytes --]

#!/bin/bash

set -eu

run_cmd() {
    if [ -z "$android" ]; then
	eval "$*"
    else
	adb shell "$*"
    fi
}

tracing_active() {
    [ "$(run_cmd "cat /sys/kernel/tracing/tracing_on")" = 1 ]
}

start_tracing() {
    rm -f "$1"
    cmd="(if [ ! -e /sys/kernel/tracing/trace ]; then mount -t tracefs none /sys/kernel/tracing; fi &&
	cd /sys/kernel/tracing &&
	if lsof -t /sys/kernel/tracing/trace_pipe | xargs -r kill; then :; fi &&
	echo 0 > tracing_on &&
	echo nop > current_tracer &&
	echo > trace &&
	echo 0 > events/enable &&
	echo 1 > events/block/enable &&
	echo 0 > events/block/block_dirty_buffer/enable &&
	echo 0 > events/block/block_touch_buffer/enable &&
	if [ -e events/nullb ]; then echo 1 > events/nullb/enable; fi &&
	echo 1 > tracing_on &&
	cat trace_pipe)"
    run_cmd "$cmd" >"$1" &
    tracing_pid=$!
    while ! tracing_active; do
	sleep .1
    done
}

end_tracing() {
    sleep 5
    if [ -n "$tracing_pid" ]; then kill "$tracing_pid"; fi
    run_cmd "cd /sys/kernel/tracing &&
	if lsof -t /sys/kernel/tracing/trace_pipe | xargs -r kill; then :; fi &&
	echo 0 >/sys/kernel/tracing/tracing_on"
}

android=
fastest_cpucore=
null_blk=
tracing=

while [ $# -gt 0 ] && [ "${1#-}" != "$1" ]; do
    case "$1" in
	-a)
	    android=true; shift;;
	-n)
	    null_blk=true; shift;;
	-t)
	    tracing=true; shift;;
	*)
	    usage;;
    esac
done

if [ -n "${android}" ]; then
    adb root 1>&2
    adb shell "grep -q '^[^[:blank:]]* /sys/kernel/debug' /proc/mounts || mount -t debugfs none /sys/kernel/debug"
    adb push ~/software/fio/fio /tmp >&/dev/null
    adb push ~/software/util-linux/blkzone /tmp >&/dev/null
    fastest_cpucore=$(adb shell 'grep -aH . /sys/devices/system/cpu/cpu[0-9]*/cpufreq/cpuinfo_max_freq 2>/dev/null' |
		      sed 's/:/ /' |
		      sort -rnk2 |
		      head -n1 |
		      sed -e 's|/sys/devices/system/cpu/cpu||;s|/cpufreq.*||')
    if [ -z "$fastest_cpucore" ]; then
	fastest_cpucore=$(($(adb shell nproc) - 1))
    fi
    [ -n "$fastest_cpucore" ]
fi

for mode in "none 0" "none 1" "mq-deadline 0" "mq-deadline 1"; do
    for d in /sys/kernel/config/nullb/*; do
	if [ -d "$d" ] && rmdir "$d"; then :; fi
    done
    read -r iosched preserves_write_order <<<"$mode"
    if [ -z "$android" ]; then
	if [ -z "$null_blk" ]; then
	    while ! modprobe -r scsi_debug >&/dev/null; do
		sleep .1
	    done
	    params=(
		ndelay=100000            # 100 us
		host_max_queue=64
		preserves_write_order="${preserves_write_order}"
		dev_size_mb=1024         # 1 GiB
		submit_queues="$(nproc)"
		zone_size_mb=1           # 1 MiB
		zone_nr_conv=0
		zbc=2
	    )
	    modprobe scsi_debug "${params[@]}"
	    udevadm settle
	    dev=/dev/$(cd /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block && echo *)
	    basename=$(basename "${dev}")
	else
	    if modprobe -r null_blk; then :; fi
	    modprobe null_blk nr_devices=0
	    (
		cd /sys/kernel/config/nullb
		mkdir nullb0
		cd nullb0
		params=(
		    completion_nsec=100000   # 100 us
		    hw_queue_depth=64
		    irqmode=2                # NULL_IRQ_TIMER
		    max_sectors=$((4096/512))
		    memory_backed=1
		    preserves_write_order="${preserves_write_order}"
		    size=1                   # 1 GiB
		    submit_queues="$(nproc)"
		    zone_size=1              # 1 MiB
		    zoned=1
		    power=1
		)
		for p in "${params[@]}"; do
		    if ! echo "${p//*=}" > "${p//=*}"; then
			echo "$p"
			exit 1
		    fi
		done
	    )
	    basename=nullb0
	    dev=/dev/${basename}
	    udevadm settle
	fi
	[ -b "${dev}" ]
	if [ "${preserves_write_order}" = 1 ]; then
	    pzw=/sys/class/block/${basename}/queue/pipeline_zoned_writes
	    if [ -e "$pzw" ]; then
		[ "$(<"/sys/class/block/${basename}/queue/pipeline_zoned_writes")" = 1 ]
	    fi
	fi
    else
	# Retrieve the device name assigned to the zoned logical unit.
	basename=$(adb shell grep -lvw 0 /sys/class/block/sd*/queue/chunk_sectors 2>/dev/null |
			     sed 's|/sys/class/block/||g;s|/queue/chunk_sectors||g')
	# Disable block layer request merging.
	dev="/dev/block/${basename}"
    fi
    run_cmd "echo 4096 > /sys/class/block/${basename}/queue/max_sectors_kb"
    # 0: disable I/O statistics
    run_cmd "echo 0 > /sys/class/block/${basename}/queue/iostats"
    # 2: do not attempt any merges
    run_cmd "echo 2 > /sys/class/block/${basename}/queue/nomerges"
    # 2: complete on the requesting CPU
    run_cmd "echo 2 > /sys/class/block/${basename}/queue/rq_affinity"
    for iopattern in write randwrite; do
	params=(
	    --name=measure-iops
	    --filename="${dev}"
	    --direct=1
	    --ioscheduler="${iosched}"
	    --gtod_reduce=1
	    --runtime=30
	    --rw="${iopattern}"
	    --thread=1
	    --time_based=1
	    --zonemode=zbd
	)
	if [ -n "$fastest_cpucore" ]; then
	    params+=(--cpus_allowed="${fastest_cpucore}")
	fi
	if [ "$preserves_write_order" = 1 ]; then
	    params+=(
		--ioengine=libaio
	    )
	    queue_depths=(1 64)
	else
	    params+=(
		--ioengine=pvsync2
	    )
	    queue_depths=(1)
	fi
	for qd in "${queue_depths[@]}"; do
	    echo "==== iosched=$iosched preserves_write_order=$preserves_write_order iopattern=${iopattern} qd=${qd}"
	    params_with_qd=("${params[@]}")
	    params_with_qd+=("--iodepth=${qd}")
	    if [ "$qd" != 1 ]; then
		params_with_qd+=(--iodepth_batch=$(((qd + 3) / 4)))
	    fi
	    echo "fio ${params_with_qd[*]}"
	    # Reset all open zones to prevent that the maximum number of open
	    # zones is exceeded. Next, measure IOPS.
	    if [ -z "$android" ]; then
		blkzone reset "${dev}"
	    else
		adb shell /tmp/blkzone reset "${dev}"
	    fi
	    if [ -n "${tracing}" ]; then
		trace="/tmp/block-trace-${iosched}-${preserves_write_order}-${iopattern}-${qd}.txt"
		start_tracing "${trace}"
	    fi
	    set +e
	    if [ -z "$android" ]; then
		fio "${params_with_qd[@]}"
	    else
		adb shell /tmp/fio "${params_with_qd[@]}"
	    fi
	    ret=$?
	    set -e
	    if [ -n "${tracing}" ]; then
		end_tracing
		xz -9 "${trace}" &
	    fi
	    [ "$ret" = 0 ] || break
	    if run_cmd "cd /sys/kernel/debug/block/${basename} && if grep -q ' ref ' zone_wplugs; then grep -avH ' ref 1 ' zone_wplugs; else false; fi"; then
		echo
		echo "Detected one or more reference count leaks!"
		break
	    fi
	done
    done
done
wait

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v25 07/20] block/mq-deadline: Enable zoned write pipelining
  2025-10-20 18:28         ` Bart Van Assche
@ 2025-10-21 21:01           ` Damien Le Moal
  2025-10-22 18:26             ` Bart Van Assche
  2025-10-22  7:07           ` Christoph Hellwig
  1 sibling, 1 reply; 33+ messages in thread
From: Damien Le Moal @ 2025-10-21 21:01 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe; +Cc: linux-block, linux-scsi, Christoph Hellwig

On 10/21/25 03:28, Bart Van Assche wrote:
> On 10/17/25 10:31 PM, Damien Le Moal wrote:
>> Maybe we need to rethink this, restarting from your main use case and why
>> performance is not good. I think that said main use case is f2fs. So what
>> happens with write throughput with it ? Why doesn't merging of small writes in
>> the zone write plugs improve performance ? Wouldn't small modifications to f2fs
>> zone write path improve things ?
> 
> F2FS typically generates large writes if the I/O bandwidth is high (100
> MiB or more). Write pipelining improves write performance even for large
> writes but not by a spectacular percentage. Write pipelining only
> results in a drastic performance improvement if the write size is kept
> small (e.g. 4 KiB).

But you are talking about high queue dpeth 4K write pattern, right ? And if yes,
BIO merging in the zone write plugs should generate much larger commands anyway.
Have you verified that this is working as expected ?

> 
>> If the answers to all of the above is "no/does not work", what about a different
>> approach: zone write plugging v2 with a single thread per CPU that does the
>> pipelining without to force changes to other layers/change the API all over the
>> block layer ?
> 
> The block layer changes that I'm proposing are small, easy to maintain
> and not invasive. Using a mutex when pipelining writes only as I
> proposed in a previous email is a solution that will yield better
> performance than delegating work to another thread. Obtaining an
> uncontended mutex takes less than a microsecond. Delegating work to
> another thread introduces a delay of 10 to 100 microseconds.
> 
>> Unless you have a neat way to recreate the problem without Zoned UFS devices ?
> 
> This patch series adds support in both the scsi_debug and null_blk
> drivers for write pipelining. If the mq-deadline patches from this 
> series are reverted then the attached shell script sporadically reports
> a write error on my test setup for the mq-deadline test cases.

I am not trying to check the correctness of your patches. I was wondering if
there is an easy way to recreate the performance difference you are seeing with
zoned UFS device easily. E.g. the 4 K write case you are describing above.



-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v25 07/20] block/mq-deadline: Enable zoned write pipelining
  2025-10-20 18:28         ` Bart Van Assche
  2025-10-21 21:01           ` Damien Le Moal
@ 2025-10-22  7:07           ` Christoph Hellwig
  1 sibling, 0 replies; 33+ messages in thread
From: Christoph Hellwig @ 2025-10-22  7:07 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Damien Le Moal, Jens Axboe, linux-block, linux-scsi,
	Christoph Hellwig

On Mon, Oct 20, 2025 at 11:28:36AM -0700, Bart Van Assche wrote:
> The block layer changes that I'm proposing are small, easy to maintain
> and not invasive.

No, they are absolutely not.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v25 07/20] block/mq-deadline: Enable zoned write pipelining
  2025-10-21 21:01           ` Damien Le Moal
@ 2025-10-22 18:26             ` Bart Van Assche
  0 siblings, 0 replies; 33+ messages in thread
From: Bart Van Assche @ 2025-10-22 18:26 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe; +Cc: linux-block, linux-scsi, Christoph Hellwig

On 10/21/25 2:01 PM, Damien Le Moal wrote:
> On 10/21/25 03:28, Bart Van Assche wrote:
>> On 10/17/25 10:31 PM, Damien Le Moal wrote:
>>> Maybe we need to rethink this, restarting from your main use case and why
>>> performance is not good. I think that said main use case is f2fs. So what
>>> happens with write throughput with it ? Why doesn't merging of small writes in
>>> the zone write plugs improve performance ? Wouldn't small modifications to f2fs
>>> zone write path improve things ?
>>
>> F2FS typically generates large writes if the I/O bandwidth is high (100
>> MiB or more). Write pipelining improves write performance even for large
>> writes but not by a spectacular percentage. Write pipelining only
>> results in a drastic performance improvement if the write size is kept
>> small (e.g. 4 KiB).
> 
> But you are talking about high queue dpeth 4K write pattern, right ? And if yes,
> BIO merging in the zone write plugs should generate much larger commands anyway.
> Have you verified that this is working as expected ?

Write pipelining improves performance even if bio merging is enabled
because with write pipelining enabled the Linux kernel doesn't wait for
a prior write to complete before the next write is sent to the storage
device.

>>> If the answers to all of the above is "no/does not work", what about a different
>>> approach: zone write plugging v2 with a single thread per CPU that does the
>>> pipelining without to force changes to other layers/change the API all over the
>>> block layer ?
>>
>> The block layer changes that I'm proposing are small, easy to maintain
>> and not invasive. Using a mutex when pipelining writes only as I
>> proposed in a previous email is a solution that will yield better
>> performance than delegating work to another thread. Obtaining an
>> uncontended mutex takes less than a microsecond. Delegating work to
>> another thread introduces a delay of 10 to 100 microseconds.
>>
>>> Unless you have a neat way to recreate the problem without Zoned UFS devices ?
>>
>> This patch series adds support in both the scsi_debug and null_blk
>> drivers for write pipelining. If the mq-deadline patches from this
>> series are reverted then the attached shell script sporadically reports
>> a write error on my test setup for the mq-deadline test cases.
> 
> I am not trying to check the correctness of your patches. I was wondering if
> there is an easy way to recreate the performance difference you are seeing with
> zoned UFS device easily. E.g. the 4 K write case you are describing above.

Yes, there is an easy way to recreate the performance difference. The
shell script attached to my previous email tests multiple combinations
of I/O schedulers, queue depths and write pipelining enabled/disabled.
The script that I shared disables I/O merging. Even if make the
following changes in that shell script:

--- a/test-pipelining-zoned-writes
+++ b/test-pipelining-zoned-writes
@@ -147,11 +147,11 @@ for mode in "none 0" "none 1" "mq-deadline 0" 
"mq-deadline 1"; do
         # Disable block layer request merging.
         dev="/dev/block/${basename}"
      fi
-    run_cmd "echo 4096 > /sys/class/block/${basename}/queue/max_sectors_kb"
+    #run_cmd "echo 4096 > 
/sys/class/block/${basename}/queue/max_sectors_kb"
      # 0: disable I/O statistics
      run_cmd "echo 0 > /sys/class/block/${basename}/queue/iostats"
      # 2: do not attempt any merges
-    run_cmd "echo 2 > /sys/class/block/${basename}/queue/nomerges"
+    #run_cmd "echo 2 > /sys/class/block/${basename}/queue/nomerges"
      # 2: complete on the requesting CPU
      run_cmd "echo 2 > /sys/class/block/${basename}/queue/rq_affinity"
      for iopattern in write randwrite; do

then I still see a significant performance improvement for the null_blk
driver (command-line option -n):

==== iosched=none preserves_write_order=0 iopattern=write qd=1
   write: IOPS=6503, BW=25.4MiB/s (26.6MB/s)(762MiB/30003msec); 762 zone 
resets
==== iosched=none preserves_write_order=0 iopattern=randwrite qd=1
   write: IOPS=6469, BW=25.3MiB/s (26.5MB/s)(758MiB/30003msec); 758 zone 
resets
==== iosched=none preserves_write_order=1 iopattern=write qd=1
   write: IOPS=5566, BW=21.7MiB/s (22.8MB/s)(652MiB/30010msec); 652 zone 
resets
==== iosched=none preserves_write_order=1 iopattern=write qd=64
   write: IOPS=15.3k, BW=59.9MiB/s (62.8MB/s)(1797MiB/30001msec); 1796 
zone resets
==== iosched=none preserves_write_order=1 iopattern=randwrite qd=1
   write: IOPS=5575, BW=21.8MiB/s (22.8MB/s)(653MiB/30005msec); 653 zone 
resets
==== iosched=none preserves_write_order=1 iopattern=randwrite qd=64
   write: IOPS=15.5k, BW=60.7MiB/s (63.7MB/s)(1821MiB/30001msec); 1821 
zone resets

As one can see above, if the queue depth equals 64 and with write
pipelining enabled, IOPS are about 2.5x higher compared to
queue depth 1 and write pipelining disabled.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2025-10-22 18:26 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-14 21:54 [PATCH v25 00/20] Improve write performance for zoned UFS devices Bart Van Assche
2025-10-14 21:54 ` [PATCH v25 01/20] block: Support block devices that preserve the order of write requests Bart Van Assche
2025-10-14 21:54 ` [PATCH v25 02/20] blk-mq: Always insert sequential zoned writes into a software queue Bart Van Assche
2025-10-14 21:54 ` [PATCH v25 03/20] blk-mq: Restore the zone write order when requeuing Bart Van Assche
2025-10-14 21:54 ` [PATCH v25 04/20] blk-mq: Move the blk_queue_sq_sched() calls Bart Van Assche
2025-10-14 21:54 ` [PATCH v25 05/20] blk-mq: Run all hwqs for sq scheds if write pipelining is enabled Bart Van Assche
2025-10-15  7:25   ` Damien Le Moal
2025-10-15 16:35     ` Bart Van Assche
2025-10-14 21:54 ` [PATCH v25 06/20] block/mq-deadline: Make locking IRQ-safe Bart Van Assche
2025-10-14 21:54 ` [PATCH v25 07/20] block/mq-deadline: Enable zoned write pipelining Bart Van Assche
2025-10-15  7:31   ` Damien Le Moal
2025-10-15 16:32     ` Bart Van Assche
2025-10-16 20:50     ` Bart Van Assche
2025-10-18  5:31       ` Damien Le Moal
2025-10-20 18:28         ` Bart Van Assche
2025-10-21 21:01           ` Damien Le Moal
2025-10-22 18:26             ` Bart Van Assche
2025-10-22  7:07           ` Christoph Hellwig
2025-10-14 21:54 ` [PATCH v25 08/20] blk-zoned: Fix a typo in a source code comment Bart Van Assche
2025-10-15  7:32   ` Damien Le Moal
2025-10-15 16:33     ` Bart Van Assche
2025-10-14 21:54 ` [PATCH v25 09/20] blk-zoned: Add an argument to blk_zone_plug_bio() Bart Van Assche
2025-10-14 21:54 ` [PATCH v25 10/20] blk-zoned: Split an if-statement Bart Van Assche
2025-10-14 21:54 ` [PATCH v25 11/20] blk-zoned: Move code from disk_zone_wplug_add_bio() into its caller Bart Van Assche
2025-10-14 21:54 ` [PATCH v25 12/20] blk-zoned: Introduce a loop in blk_zone_wplug_bio_work() Bart Van Assche
2025-10-14 21:54 ` [PATCH v25 13/20] blk-zoned: Document disk_zone_wplug_schedule_bio_work() locking Bart Van Assche
2025-10-14 21:54 ` [PATCH v25 14/20] blk-zoned: Support pipelining of zoned writes Bart Van Assche
2025-10-14 21:54 ` [PATCH v25 15/20] null_blk: Add the preserves_write_order attribute Bart Van Assche
2025-10-14 21:54 ` [PATCH v25 16/20] scsi: core: Retry unaligned zoned writes Bart Van Assche
2025-10-14 21:54 ` [PATCH v25 17/20] scsi: sd: Increase retry count for " Bart Van Assche
2025-10-14 21:54 ` [PATCH v25 18/20] scsi: scsi_debug: Add the preserves_write_order module parameter Bart Van Assche
2025-10-14 21:54 ` [PATCH v25 19/20] scsi: scsi_debug: Support injecting unaligned write errors Bart Van Assche
2025-10-14 21:54 ` [PATCH v25 20/20] ufs: core: Inform the block layer about write ordering Bart Van Assche

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).