[PATCH v16 00/26] Improve write performance for zoned UFS devices

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v16 00/26] Improve write performance for zoned UFS devices
@ 2024-11-19  0:27 Bart Van Assche
  2024-11-19  0:27 ` [PATCH v16 01/26] blk-zoned: Fix a reference count leak Bart Van Assche
                   ` (27 more replies)
  0 siblings, 28 replies; 73+ messages in thread
From: Bart Van Assche @ 2024-11-19  0:27 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Jaegeuk Kim, Bart Van Assche

Hi Damien and Christoph,

This patch series improves small write IOPS by a factor of four (+300%) for
zoned UFS devices on my test setup with an UFSHCI 3.0 controller. Although
you are probably busy because the merge window is open, please take a look
at this patch series when you have the time. This patch series is organized
as follows:
 - Bug fixes for existing code at the start of the series.
 - The write pipelining support implementation comes after the bug fixes.

Thanks,

Bart.

Changes compared to v15:
 - Reworked this patch series on top of the zone write plugging approach.
 - Moved support for requeuing requests from the SCSI core into the block
   layer core.
 - In the UFS driver, instead of disabling write pipelining if
   auto-hibernation is enabled, rely on the requeuing mechanism to handle
   reordering caused by resuming from auto-hibernation.

Changes compared to v14:
 - Removed the drivers/scsi/Kconfig.kunit and drivers/scsi/Makefile.kunit
   files. Instead, modified drivers/scsi/Kconfig and added #include "*_test.c"
   directives in the appropriate .c files. Removed the EXPORT_SYMBOL()
   directives that were added to make the unit tests link.
 - Fixed a double free in a unit test.

Changes compared to v13:
 - Reworked patch "block: Preserve the order of requeued zoned writes".
 - Addressed a performance concern by removing the eh_needs_prepare_resubmit
   SCSI driver callback and by introducing the SCSI host template flag
   .needs_prepare_resubmit instead.
 - Added a patch that adds a 'host' argument to scsi_eh_flush_done_q().
 - Made the code in unit tests less repetitive.

Changes compared to v12:
 - Added two new patches: "block: Preserve the order of requeued zoned writes"
   and "scsi: sd: Add a unit test for sd_cmp_sector()"
 - Restricted the number of zoned write retries. To my surprise I had to add
   "&& scmd->retries <= scmd->allowed" in the SCSI error handler to limit the
   number of retries.
 - In patch "scsi: ufs: Inform the block layer about write ordering", only set
   ELEVATOR_F_ZBD_SEQ_WRITE for zoned block devices.

Changes compared to v11:
 - Fixed a NULL pointer dereference that happened when booting from an ATA
   device by adding an scmd->device != NULL check in scsi_needs_preparation().
 - Updated Reviewed-by tags.

Changes compared to v10:
 - Dropped the UFS MediaTek and HiSilicon patches because these are not correct
   and because it is safe to drop these patches.
 - Updated Acked-by / Reviewed-by tags.

Changes compared to v9:
 - Introduced an additional scsi_driver callback: .eh_needs_prepare_resubmit().
 - Renamed the scsi_debug kernel module parameter 'no_zone_write_lock' into
   'preserves_write_order'.
 - Fixed an out-of-bounds access in the unit scsi_call_prepare_resubmit() unit
   test.
 - Wrapped ufshcd_auto_hibern8_update() calls in UFS host drivers with
   WARN_ON_ONCE() such that a kernel stack appears in case an error code is
   returned.
 - Elaborated a comment in the UFSHCI driver.

Changes compared to v8:
 - Fixed handling of 'driver_preserves_write_order' and 'use_zone_write_lock'
   in blk_stack_limits().
 - Added a comment in disk_set_zoned().
 - Modified blk_req_needs_zone_write_lock() such that it returns false if
   q->limits.use_zone_write_lock is false.
 - Modified disk_clear_zone_settings() such that it clears
   q->limits.use_zone_write_lock.
 - Left out one change from the mq-deadline patch that became superfluous due to
   the blk_req_needs_zone_write_lock() change.
 - Modified scsi_call_prepare_resubmit() such that it only calls list_sort() if
   zoned writes have to be resubmitted for which zone write locking is disabled.
 - Added an additional unit test for scsi_call_prepare_resubmit().
 - Modified the sorting code in the sd driver such that only those SCSI commands
   are sorted for which write locking is disabled.
 - Modified sd_zbc.c such that ELEVATOR_F_ZBD_SEQ_WRITE is only set if the
   write order is not preserved.
 - Included three patches for UFS host drivers that rework code that wrote
   directly to the auto-hibernation controller register.
 - Modified the UFS driver such that enabling auto-hibernation is not allowed
   if a zoned logical unit is present and if the controller operates in legacy
   mode.
 - Also in the UFS driver, simplified ufshcd_auto_hibern8_update().

Changes compared to v7:
 - Split the queue_limits member variable `use_zone_write_lock' into two member
   variables: `use_zone_write_lock' (set by disk_set_zoned()) and
   `driver_preserves_write_order' (set by the block driver or SCSI LLD). This
   should clear up the confusion about the purpose of this variable.
 - Moved the code for sorting SCSI commands by LBA from the SCSI error handler
   into the SCSI disk (sd) driver as requested by Christoph.
   
Changes compared to v6:
 - Removed QUEUE_FLAG_NO_ZONE_WRITE_LOCK and instead introduced a flag in
   the request queue limits data structure.

Changes compared to v5:
 - Renamed scsi_cmp_lba() into scsi_cmp_sector().
 - Improved several source code comments.

Changes compared to v4:
 - Dropped the patch that introduces the REQ_NO_ZONE_WRITE_LOCK flag.
 - Dropped the null_blk patch and added two scsi_debug patches instead.
 - Dropped the f2fs patch.
 - Split the patch for the UFS driver into two patches.
 - Modified several patch descriptions and source code comments.
 - Renamed dd_use_write_locking() into dd_use_zone_write_locking().
 - Moved the list_sort() call from scsi_unjam_host() into scsi_eh_flush_done_q()
   such that sorting happens just before reinserting.
 - Removed the scsi_cmd_retry_allowed() call from scsi_check_sense() to make
   sure that the retry counter is adjusted once per retry instead of twice.

Changes compared to v3:
 - Restored the patch that introduces QUEUE_FLAG_NO_ZONE_WRITE_LOCK. That patch
   had accidentally been left out from v2.
 - In patch "block: Introduce the flag REQ_NO_ZONE_WRITE_LOCK", improved the
   patch description and added the function blk_no_zone_write_lock().
 - In patch "block/mq-deadline: Only use zone locking if necessary", moved the
   blk_queue_is_zoned() call into dd_use_write_locking().
 - In patch "fs/f2fs: Disable zone write locking", set REQ_NO_ZONE_WRITE_LOCK
   from inside __bio_alloc() instead of in f2fs_submit_write_bio().

Changes compared to v2:
 - Renamed the request queue flag for disabling zone write locking.
 - Introduced a new request flag for disabling zone write locking.
 - Modified the mq-deadline scheduler such that zone write locking is only
   disabled if both flags are set.
 - Added an F2FS patch that sets the request flag for disabling zone write
   locking.
 - Only disable zone write locking in the UFS driver if auto-hibernation is
   disabled.

Changes compared to v1:
 - Left out the patches that are already upstream.
 - Switched the approach in patch "scsi: Retry unaligned zoned writes" from
   retrying immediately to sending unaligned write commands to the SCSI error
   handler.

Bart Van Assche (26):
  blk-zoned: Fix a reference count leak
  blk-zoned: Split disk_zone_wplugs_work()
  blk-zoned: Split queue_zone_wplugs_show()
  blk-zoned: Only handle errors after pending zoned writes have
    completed
  blk-zoned: Fix a deadlock triggered by unaligned writes
  blk-zoned: Fix requeuing of zoned writes
  block: Support block drivers that preserve the order of write requests
  dm-linear: Report to the block layer that the write order is preserved
  mq-deadline: Remove a local variable
  blk-mq: Clean up blk_mq_requeue_work()
  block: Optimize blk_mq_submit_bio() for the cache hit scenario
  block: Rework request allocation in blk_mq_submit_bio()
  block: Support allocating from a specific software queue
  blk-mq: Restore the zoned write order when requeuing
  blk-zoned: Document the locking order
  blk-zoned: Document locking assumptions
  blk-zoned: Uninline functions that are not in the hot path
  blk-zoned: Make disk_should_remove_zone_wplug() more robust
  blk-zoned: Add an argument to blk_zone_plug_bio()
  blk-zoned: Support pipelining of zoned writes
  scsi: core: Retry unaligned zoned writes
  scsi: sd: Increase retry count for zoned writes
  scsi: scsi_debug: Add the preserves_write_order module parameter
  scsi: scsi_debug: Support injecting unaligned write errors
  scsi: scsi_debug: Skip host/bus reset settle delay
  scsi: ufs: Inform the block layer about write ordering

 block/bfq-iosched.c       |   2 +
 block/blk-mq.c            |  97 ++++++----
 block/blk-mq.h            |   3 +
 block/blk-settings.c      |   2 +
 block/blk-zoned.c         | 376 +++++++++++++++++++++++++-------------
 block/blk.h               |  33 +++-
 block/kyber-iosched.c     |   2 +
 block/mq-deadline.c       |  10 +-
 drivers/md/dm-linear.c    |   6 +
 drivers/md/dm.c           |   2 +-
 drivers/scsi/scsi_debug.c |  22 ++-
 drivers/scsi/scsi_error.c |  16 ++
 drivers/scsi/sd.c         |   7 +
 drivers/ufs/core/ufshcd.c |   7 +
 include/linux/blk-mq.h    |  20 +-
 include/linux/blkdev.h    |  11 +-
 16 files changed, 444 insertions(+), 172 deletions(-)


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH v16 01/26] blk-zoned: Fix a reference count leak
  2024-11-19  0:27 [PATCH v16 00/26] Improve write performance for zoned UFS devices Bart Van Assche
@ 2024-11-19  0:27 ` Bart Van Assche
  2024-11-19  2:23   ` Damien Le Moal
  2024-11-19  0:27 ` [PATCH v16 02/26] blk-zoned: Split disk_zone_wplugs_work() Bart Van Assche
                   ` (26 subsequent siblings)
  27 siblings, 1 reply; 73+ messages in thread
From: Bart Van Assche @ 2024-11-19  0:27 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Jaegeuk Kim, Bart Van Assche

Fix a reference count leak in disk_zone_wplug_handle_error()

Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-zoned.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 70211751df16..3346b8c53605 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -1337,6 +1337,8 @@ static void disk_zone_wplug_handle_error(struct gendisk *disk,
 
 unlock:
 	spin_unlock_irqrestore(&zwplug->lock, flags);
+
+	disk_put_zone_wplug(zwplug);
 }
 
 static void disk_zone_wplugs_work(struct work_struct *work)

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v16 02/26] blk-zoned: Split disk_zone_wplugs_work()
  2024-11-19  0:27 [PATCH v16 00/26] Improve write performance for zoned UFS devices Bart Van Assche
  2024-11-19  0:27 ` [PATCH v16 01/26] blk-zoned: Fix a reference count leak Bart Van Assche
@ 2024-11-19  0:27 ` Bart Van Assche
  2024-11-19  0:27 ` [PATCH v16 03/26] blk-zoned: Split queue_zone_wplugs_show() Bart Van Assche
                   ` (25 subsequent siblings)
  27 siblings, 0 replies; 73+ messages in thread
From: Bart Van Assche @ 2024-11-19  0:27 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Jaegeuk Kim, Bart Van Assche

Prepare for adding a second disk_zone_process_err_list() call.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-zoned.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 3346b8c53605..e5c8ab1eab9f 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -1341,10 +1341,8 @@ static void disk_zone_wplug_handle_error(struct gendisk *disk,
 	disk_put_zone_wplug(zwplug);
 }
 
-static void disk_zone_wplugs_work(struct work_struct *work)
+static void disk_zone_process_err_list(struct gendisk *disk)
 {
-	struct gendisk *disk =
-		container_of(work, struct gendisk, zone_wplugs_work);
 	struct blk_zone_wplug *zwplug;
 	unsigned long flags;
 
@@ -1365,6 +1363,14 @@ static void disk_zone_wplugs_work(struct work_struct *work)
 	spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
 }
 
+static void disk_zone_wplugs_work(struct work_struct *work)
+{
+	struct gendisk *disk =
+		container_of(work, struct gendisk, zone_wplugs_work);
+
+	disk_zone_process_err_list(disk);
+}
+
 static inline unsigned int disk_zone_wplugs_hash_size(struct gendisk *disk)
 {
 	return 1U << disk->zone_wplugs_hash_bits;

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v16 03/26] blk-zoned: Split queue_zone_wplugs_show()
  2024-11-19  0:27 [PATCH v16 00/26] Improve write performance for zoned UFS devices Bart Van Assche
  2024-11-19  0:27 ` [PATCH v16 01/26] blk-zoned: Fix a reference count leak Bart Van Assche
  2024-11-19  0:27 ` [PATCH v16 02/26] blk-zoned: Split disk_zone_wplugs_work() Bart Van Assche
@ 2024-11-19  0:27 ` Bart Van Assche
  2024-11-19  2:25   ` Damien Le Moal
  2024-11-19  0:27 ` [PATCH v16 04/26] blk-zoned: Only handle errors after pending zoned writes have completed Bart Van Assche
                   ` (24 subsequent siblings)
  27 siblings, 1 reply; 73+ messages in thread
From: Bart Van Assche @ 2024-11-19  0:27 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Jaegeuk Kim, Bart Van Assche

Reduce the indentation level of the code in queue_zone_wplugs_show() by
moving the body of the loop in that function into a new function.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-zoned.c | 44 ++++++++++++++++++++++++--------------------
 1 file changed, 24 insertions(+), 20 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index e5c8ab1eab9f..7e6e6ebb6235 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -1838,37 +1838,41 @@ int blk_revalidate_disk_zones(struct gendisk *disk)
 EXPORT_SYMBOL_GPL(blk_revalidate_disk_zones);
 
 #ifdef CONFIG_BLK_DEBUG_FS
+static void queue_zone_wplug_show(struct blk_zone_wplug *zwplug,
+				  struct seq_file *m)
+{
+	unsigned int zwp_wp_offset, zwp_flags;
+	unsigned int zwp_zone_no, zwp_ref;
+	unsigned int zwp_bio_list_size;
+	unsigned long flags;
+
+	spin_lock_irqsave(&zwplug->lock, flags);
+	zwp_zone_no = zwplug->zone_no;
+	zwp_flags = zwplug->flags;
+	zwp_ref = refcount_read(&zwplug->ref);
+	zwp_wp_offset = zwplug->wp_offset;
+	zwp_bio_list_size = bio_list_size(&zwplug->bio_list);
+	spin_unlock_irqrestore(&zwplug->lock, flags);
+
+	seq_printf(m, "%u 0x%x %u %u %u\n", zwp_zone_no, zwp_flags, zwp_ref,
+		   zwp_wp_offset, zwp_bio_list_size);
+}
 
 int queue_zone_wplugs_show(void *data, struct seq_file *m)
 {
 	struct request_queue *q = data;
 	struct gendisk *disk = q->disk;
 	struct blk_zone_wplug *zwplug;
-	unsigned int zwp_wp_offset, zwp_flags;
-	unsigned int zwp_zone_no, zwp_ref;
-	unsigned int zwp_bio_list_size, i;
-	unsigned long flags;
+	unsigned int i;
 
 	if (!disk->zone_wplugs_hash)
 		return 0;
 
 	rcu_read_lock();
-	for (i = 0; i < disk_zone_wplugs_hash_size(disk); i++) {
-		hlist_for_each_entry_rcu(zwplug,
-					 &disk->zone_wplugs_hash[i], node) {
-			spin_lock_irqsave(&zwplug->lock, flags);
-			zwp_zone_no = zwplug->zone_no;
-			zwp_flags = zwplug->flags;
-			zwp_ref = refcount_read(&zwplug->ref);
-			zwp_wp_offset = zwplug->wp_offset;
-			zwp_bio_list_size = bio_list_size(&zwplug->bio_list);
-			spin_unlock_irqrestore(&zwplug->lock, flags);
-
-			seq_printf(m, "%u 0x%x %u %u %u\n",
-				   zwp_zone_no, zwp_flags, zwp_ref,
-				   zwp_wp_offset, zwp_bio_list_size);
-		}
-	}
+	for (i = 0; i < disk_zone_wplugs_hash_size(disk); i++)
+		hlist_for_each_entry_rcu(zwplug, &disk->zone_wplugs_hash[i],
+					 node)
+			queue_zone_wplug_show(zwplug, m);
 	rcu_read_unlock();
 
 	return 0;

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v16 04/26] blk-zoned: Only handle errors after pending zoned writes have completed
  2024-11-19  0:27 [PATCH v16 00/26] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (2 preceding siblings ...)
  2024-11-19  0:27 ` [PATCH v16 03/26] blk-zoned: Split queue_zone_wplugs_show() Bart Van Assche
@ 2024-11-19  0:27 ` Bart Van Assche
  2024-11-19  2:50   ` Damien Le Moal
  2024-11-19  0:27 ` [PATCH v16 05/26] blk-zoned: Fix a deadlock triggered by unaligned writes Bart Van Assche
                   ` (23 subsequent siblings)
  27 siblings, 1 reply; 73+ messages in thread
From: Bart Van Assche @ 2024-11-19  0:27 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Jaegeuk Kim, Bart Van Assche

Instead of handling write errors immediately, only handle these after all
pending zoned write requests have completed or have been requeued. This
patch prepares for changing the zone write pointer tracking approach.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-mq.c         |   9 +++
 block/blk-zoned.c      | 154 +++++++++++++++++++++++++++++++++++++++--
 block/blk.h            |  29 ++++++++
 include/linux/blk-mq.h |  18 +++++
 4 files changed, 203 insertions(+), 7 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 270cfd9fc6b0..a45077e187b5 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -793,6 +793,9 @@ void blk_mq_free_request(struct request *rq)
 	rq_qos_done(q, rq);
 
 	WRITE_ONCE(rq->state, MQ_RQ_IDLE);
+
+	blk_zone_free_request(rq);
+
 	if (req_ref_put_and_test(rq))
 		__blk_mq_free_request(rq);
 }
@@ -1189,6 +1192,9 @@ void blk_mq_end_request_batch(struct io_comp_batch *iob)
 			continue;
 
 		WRITE_ONCE(rq->state, MQ_RQ_IDLE);
+
+		blk_zone_free_request(rq);
+
 		if (!req_ref_put_and_test(rq))
 			continue;
 
@@ -1507,6 +1513,7 @@ static void __blk_mq_requeue_request(struct request *rq)
 	if (blk_mq_request_started(rq)) {
 		WRITE_ONCE(rq->state, MQ_RQ_IDLE);
 		rq->rq_flags &= ~RQF_TIMED_OUT;
+		blk_zone_requeue_work(q);
 	}
 }
 
@@ -1542,6 +1549,8 @@ static void blk_mq_requeue_work(struct work_struct *work)
 	list_splice_init(&q->flush_list, &flush_list);
 	spin_unlock_irq(&q->requeue_lock);
 
+	blk_zone_requeue_work(q);
+
 	while (!list_empty(&rq_list)) {
 		rq = list_entry(rq_list.next, struct request, queuelist);
 		/*
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 7e6e6ebb6235..b570b773e65f 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -608,6 +608,8 @@ static inline void disk_zone_wplug_set_error(struct gendisk *disk,
 	if (zwplug->flags & BLK_ZONE_WPLUG_ERROR)
 		return;
 
+	zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
+	zwplug->flags |= BLK_ZONE_WPLUG_ERROR;
 	/*
 	 * At this point, we already have a reference on the zone write plug.
 	 * However, since we are going to add the plug to the disk zone write
@@ -616,7 +618,6 @@ static inline void disk_zone_wplug_set_error(struct gendisk *disk,
 	 * handled, or in disk_zone_wplug_clear_error() if the zone is reset or
 	 * finished.
 	 */
-	zwplug->flags |= BLK_ZONE_WPLUG_ERROR;
 	refcount_inc(&zwplug->ref);
 
 	spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
@@ -642,6 +643,7 @@ static inline void disk_zone_wplug_clear_error(struct gendisk *disk,
 	spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
 	if (!list_empty(&zwplug->link)) {
 		list_del_init(&zwplug->link);
+		zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED;
 		zwplug->flags &= ~BLK_ZONE_WPLUG_ERROR;
 		disk_put_zone_wplug(zwplug);
 	}
@@ -746,6 +748,70 @@ static bool blk_zone_wplug_handle_reset_all(struct bio *bio)
 	return false;
 }
 
+struct all_zwr_inserted_data {
+	struct blk_zone_wplug *zwplug;
+	bool res;
+};
+
+/*
+ * Changes @data->res to %false if and only if @rq is a zoned write for the
+ * given zone and if it is owned by the block driver.
+ *
+ * @rq members may change while this function is in progress. Hence, use
+ * READ_ONCE() to read @rq members.
+ */
+static bool blk_zwr_inserted(struct request *rq, void *data)
+{
+	struct all_zwr_inserted_data *d = data;
+	struct blk_zone_wplug *zwplug = d->zwplug;
+	struct request_queue *q = zwplug->disk->queue;
+	struct bio *bio = READ_ONCE(rq->bio);
+
+	if (rq->q == q && READ_ONCE(rq->state) != MQ_RQ_IDLE &&
+	    blk_rq_is_seq_zoned_write(rq) && bio &&
+	    bio_zone_no(bio) == zwplug->zone_no) {
+		d->res = false;
+		return false;
+	}
+
+	return true;
+}
+
+/*
+ * Report whether or not all zoned writes for a zone have been inserted into a
+ * software queue, elevator queue or hardware queue.
+ */
+static bool blk_zone_all_zwr_inserted(struct blk_zone_wplug *zwplug)
+{
+	struct gendisk *disk = zwplug->disk;
+	struct request_queue *q = disk->queue;
+	struct all_zwr_inserted_data d = { .zwplug = zwplug, .res = true };
+	struct blk_mq_hw_ctx *hctx;
+	long unsigned int i;
+	struct request *rq;
+
+	scoped_guard(spinlock_irqsave, &q->requeue_lock) {
+		list_for_each_entry(rq, &q->requeue_list, queuelist)
+			if (blk_rq_is_seq_zoned_write(rq) &&
+			    bio_zone_no(rq->bio) == zwplug->zone_no)
+				return false;
+		list_for_each_entry(rq, &q->flush_list, queuelist)
+			if (blk_rq_is_seq_zoned_write(rq) &&
+			    bio_zone_no(rq->bio) == zwplug->zone_no)
+				return false;
+	}
+
+	queue_for_each_hw_ctx(q, hctx, i) {
+		struct blk_mq_tags *tags = hctx->sched_tags ?: hctx->tags;
+
+		blk_mq_all_tag_iter(tags, blk_zwr_inserted, &d);
+		if (!d.res || blk_mq_is_shared_tags(q->tag_set->flags))
+			break;
+	}
+
+	return d.res;
+}
+
 static inline void blk_zone_wplug_add_bio(struct blk_zone_wplug *zwplug,
 					  struct bio *bio, unsigned int nr_segs)
 {
@@ -1096,6 +1162,29 @@ static void disk_zone_wplug_schedule_bio_work(struct gendisk *disk,
 	queue_work(disk->zone_wplugs_wq, &zwplug->bio_work);
 }
 
+/*
+ * Change the zone state to "error" if a request is requeued to postpone
+ * processing of requeued requests until all pending requests have either
+ * completed or have been requeued.
+ */
+void blk_zone_write_plug_requeue_request(struct request *rq)
+{
+	struct gendisk *disk = rq->q->disk;
+	struct blk_zone_wplug *zwplug;
+
+	if (!disk->zone_wplugs_hash_bits || !blk_rq_is_seq_zoned_write(rq))
+		return;
+
+	zwplug = disk_get_zone_wplug(disk, blk_rq_pos(rq));
+	if (WARN_ON_ONCE(!zwplug))
+		return;
+
+	scoped_guard(spinlock_irqsave, &zwplug->lock)
+		disk_zone_wplug_set_error(disk, zwplug);
+
+	disk_put_zone_wplug(zwplug);
+}
+
 static void disk_zone_wplug_unplug_bio(struct gendisk *disk,
 				       struct blk_zone_wplug *zwplug)
 {
@@ -1202,6 +1291,33 @@ void blk_zone_write_plug_finish_request(struct request *req)
 	disk_put_zone_wplug(zwplug);
 }
 
+/*
+ * Schedule zone_plugs_work if a zone is in the error state and if no requests
+ * are in flight. Called from blk_mq_free_request().
+ */
+void blk_zone_write_plug_free_request(struct request *rq)
+{
+	struct gendisk *disk = rq->q->disk;
+	struct blk_zone_wplug *zwplug;
+
+	/*
+	 * Do nothing if this function is called before the zone information
+	 * has been initialized.
+	 */
+	if (!disk->zone_wplugs_hash_bits)
+		return;
+
+	zwplug = disk_get_zone_wplug(disk, blk_rq_pos(rq));
+
+	if (!zwplug)
+		return;
+
+	if (zwplug->flags & BLK_ZONE_WPLUG_ERROR)
+		kblockd_schedule_work(&disk->zone_wplugs_work);
+
+	disk_put_zone_wplug(zwplug);
+}
+
 static void blk_zone_wplug_bio_work(struct work_struct *work)
 {
 	struct blk_zone_wplug *zwplug =
@@ -1343,14 +1459,15 @@ static void disk_zone_wplug_handle_error(struct gendisk *disk,
 
 static void disk_zone_process_err_list(struct gendisk *disk)
 {
-	struct blk_zone_wplug *zwplug;
+	struct blk_zone_wplug *zwplug, *next;
 	unsigned long flags;
 
 	spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
 
-	while (!list_empty(&disk->zone_wplugs_err_list)) {
-		zwplug = list_first_entry(&disk->zone_wplugs_err_list,
-					  struct blk_zone_wplug, link);
+	list_for_each_entry_safe(zwplug, next, &disk->zone_wplugs_err_list,
+				 link) {
+		if (!blk_zone_all_zwr_inserted(zwplug))
+			continue;
 		list_del_init(&zwplug->link);
 		spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
 
@@ -1361,6 +1478,12 @@ static void disk_zone_process_err_list(struct gendisk *disk)
 	}
 
 	spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
+
+	/*
+	 * If one or more zones have been skipped, this work will be requeued
+	 * when a request is requeued (blk_zone_requeue_work()) or freed
+	 * (blk_zone_write_plug_free_request()).
+	 */
 }
 
 static void disk_zone_wplugs_work(struct work_struct *work)
@@ -1371,6 +1494,20 @@ static void disk_zone_wplugs_work(struct work_struct *work)
 	disk_zone_process_err_list(disk);
 }
 
+/* May be called from interrupt context and hence must not sleep. */
+void blk_zone_requeue_work(struct request_queue *q)
+{
+	struct gendisk *disk = q->disk;
+
+	if (!disk)
+		return;
+
+	if (in_interrupt())
+		kblockd_schedule_work(&disk->zone_wplugs_work);
+	else
+		disk_zone_process_err_list(disk);
+}
+
 static inline unsigned int disk_zone_wplugs_hash_size(struct gendisk *disk)
 {
 	return 1U << disk->zone_wplugs_hash_bits;
@@ -1854,8 +1991,11 @@ static void queue_zone_wplug_show(struct blk_zone_wplug *zwplug,
 	zwp_bio_list_size = bio_list_size(&zwplug->bio_list);
 	spin_unlock_irqrestore(&zwplug->lock, flags);
 
-	seq_printf(m, "%u 0x%x %u %u %u\n", zwp_zone_no, zwp_flags, zwp_ref,
-		   zwp_wp_offset, zwp_bio_list_size);
+	bool all_zwr_inserted = blk_zone_all_zwr_inserted(zwplug);
+
+	seq_printf(m, "zone_no %u flags 0x%x ref %u wp_offset %u bio_list_size %u all_zwr_inserted %d\n",
+		   zwp_zone_no, zwp_flags, zwp_ref, zwp_wp_offset,
+		   zwp_bio_list_size, all_zwr_inserted);
 }
 
 int queue_zone_wplugs_show(void *data, struct seq_file *m)
diff --git a/block/blk.h b/block/blk.h
index 2c26abf505b8..be945db6298d 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -473,6 +473,18 @@ static inline void blk_zone_update_request_bio(struct request *rq,
 	if (req_op(rq) == REQ_OP_ZONE_APPEND || bio_zone_write_plugging(bio))
 		bio->bi_iter.bi_sector = rq->__sector;
 }
+
+void blk_zone_write_plug_requeue_request(struct request *rq);
+static inline void blk_zone_requeue_request(struct request *rq)
+{
+	if (!blk_rq_is_seq_zoned_write(rq))
+		return;
+
+	blk_zone_write_plug_requeue_request(rq);
+}
+
+void blk_zone_requeue_work(struct request_queue *q);
+
 void blk_zone_write_plug_bio_endio(struct bio *bio);
 static inline void blk_zone_bio_endio(struct bio *bio)
 {
@@ -490,6 +502,14 @@ static inline void blk_zone_finish_request(struct request *rq)
 	if (rq->rq_flags & RQF_ZONE_WRITE_PLUGGING)
 		blk_zone_write_plug_finish_request(rq);
 }
+
+void blk_zone_write_plug_free_request(struct request *rq);
+static inline void blk_zone_free_request(struct request *rq)
+{
+	if (blk_queue_is_zoned(rq->q))
+		blk_zone_write_plug_free_request(rq);
+}
+
 int blkdev_report_zones_ioctl(struct block_device *bdev, unsigned int cmd,
 		unsigned long arg);
 int blkdev_zone_mgmt_ioctl(struct block_device *bdev, blk_mode_t mode,
@@ -515,12 +535,21 @@ static inline void blk_zone_update_request_bio(struct request *rq,
 					       struct bio *bio)
 {
 }
+static inline void blk_zone_requeue_request(struct request *rq)
+{
+}
+static inline void blk_zone_requeue_work(struct request_queue *q)
+{
+}
 static inline void blk_zone_bio_endio(struct bio *bio)
 {
 }
 static inline void blk_zone_finish_request(struct request *rq)
 {
 }
+static inline void blk_zone_free_request(struct request *rq)
+{
+}
 static inline int blkdev_report_zones_ioctl(struct block_device *bdev,
 		unsigned int cmd, unsigned long arg)
 {
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index c596e0e4cb75..ac05974f08f9 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -1169,4 +1169,22 @@ static inline int blk_rq_map_sg(struct request_queue *q, struct request *rq,
 }
 void blk_dump_rq_flags(struct request *, char *);
 
+#ifdef CONFIG_BLK_DEV_ZONED
+static inline bool blk_rq_is_seq_zoned_write(struct request *rq)
+{
+	switch (req_op(rq)) {
+	case REQ_OP_WRITE:
+	case REQ_OP_WRITE_ZEROES:
+		return bdev_zone_is_seq(rq->q->disk->part0, blk_rq_pos(rq));
+	default:
+		return false;
+	}
+}
+#else /* CONFIG_BLK_DEV_ZONED */
+static inline bool blk_rq_is_seq_zoned_write(struct request *rq)
+{
+	return false;
+}
+#endif /* CONFIG_BLK_DEV_ZONED */
+
 #endif /* BLK_MQ_H */

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v16 05/26] blk-zoned: Fix a deadlock triggered by unaligned writes
  2024-11-19  0:27 [PATCH v16 00/26] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (3 preceding siblings ...)
  2024-11-19  0:27 ` [PATCH v16 04/26] blk-zoned: Only handle errors after pending zoned writes have completed Bart Van Assche
@ 2024-11-19  0:27 ` Bart Van Assche
  2024-11-19  2:57   ` Damien Le Moal
  2024-11-19  0:27 ` [PATCH v16 06/26] blk-zoned: Fix requeuing of zoned writes Bart Van Assche
                   ` (22 subsequent siblings)
  27 siblings, 1 reply; 73+ messages in thread
From: Bart Van Assche @ 2024-11-19  0:27 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Jaegeuk Kim, Bart Van Assche

If the queue is filled with unaligned writes then the following
deadlock occurs:

Call Trace:
 <TASK>
 __schedule+0x8cc/0x2190
 schedule+0xdd/0x2b0
 blk_queue_enter+0x2ce/0x4f0
 blk_mq_alloc_request+0x303/0x810
 scsi_execute_cmd+0x3f4/0x7b0
 sd_zbc_do_report_zones+0x19e/0x4c0
 sd_zbc_report_zones+0x304/0x920
 disk_zone_wplug_handle_error+0x237/0x920
 disk_zone_wplugs_work+0x17e/0x430
 process_one_work+0xdd0/0x1490
 worker_thread+0x5eb/0x1010
 kthread+0x2e5/0x3b0
 ret_from_fork+0x3a/0x80
 </TASK>

Fix this deadlock by removing the disk->fops->report_zones() call and by
deriving the write pointer information from successfully completed zoned
writes.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-zoned.c | 69 +++++++++++++++++++----------------------------
 block/blk.h       |  4 ++-
 2 files changed, 31 insertions(+), 42 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index b570b773e65f..284820b29285 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -56,6 +56,8 @@ static const char *const zone_cond_name[] = {
  * @zone_no: The number of the zone the plug is managing.
  * @wp_offset: The zone write pointer location relative to the start of the zone
  *             as a number of 512B sectors.
+ * @wp_offset_compl: End offset for completed zoned writes as a number of 512
+ *		     byte sectors.
  * @bio_list: The list of BIOs that are currently plugged.
  * @bio_work: Work struct to handle issuing of plugged BIOs
  * @rcu_head: RCU head to free zone write plugs with an RCU grace period.
@@ -69,6 +71,7 @@ struct blk_zone_wplug {
 	unsigned int		flags;
 	unsigned int		zone_no;
 	unsigned int		wp_offset;
+	unsigned int		wp_offset_compl;
 	struct bio_list		bio_list;
 	struct work_struct	bio_work;
 	struct rcu_head		rcu_head;
@@ -531,6 +534,7 @@ static struct blk_zone_wplug *disk_get_and_lock_zone_wplug(struct gendisk *disk,
 	zwplug->flags = 0;
 	zwplug->zone_no = zno;
 	zwplug->wp_offset = sector & (disk->queue->limits.chunk_sectors - 1);
+	zwplug->wp_offset_compl = 0;
 	bio_list_init(&zwplug->bio_list);
 	INIT_WORK(&zwplug->bio_work, blk_zone_wplug_bio_work);
 	zwplug->disk = disk;
@@ -1226,6 +1230,7 @@ void blk_zone_write_plug_bio_endio(struct bio *bio)
 	struct gendisk *disk = bio->bi_bdev->bd_disk;
 	struct blk_zone_wplug *zwplug =
 		disk_get_zone_wplug(disk, bio->bi_iter.bi_sector);
+	unsigned int end_sector;
 	unsigned long flags;
 
 	if (WARN_ON_ONCE(!zwplug))
@@ -1243,11 +1248,24 @@ void blk_zone_write_plug_bio_endio(struct bio *bio)
 		bio->bi_opf |= REQ_OP_ZONE_APPEND;
 	}
 
-	/*
-	 * If the BIO failed, mark the plug as having an error to trigger
-	 * recovery.
-	 */
-	if (bio->bi_status != BLK_STS_OK) {
+	if (bio->bi_status == BLK_STS_OK) {
+		switch (bio_op(bio)) {
+		case REQ_OP_WRITE:
+		case REQ_OP_ZONE_APPEND:
+		case REQ_OP_WRITE_ZEROES:
+			end_sector = bdev_offset_from_zone_start(disk->part0,
+				     bio->bi_iter.bi_sector + bio_sectors(bio));
+			if (end_sector > zwplug->wp_offset_compl)
+				zwplug->wp_offset_compl = end_sector;
+			break;
+		default:
+			break;
+		}
+	} else {
+		/*
+		 * If the BIO failed, mark the plug as having an error to
+		 * trigger recovery.
+		 */
 		spin_lock_irqsave(&zwplug->lock, flags);
 		disk_zone_wplug_set_error(disk, zwplug);
 		spin_unlock_irqrestore(&zwplug->lock, flags);
@@ -1388,30 +1406,10 @@ static unsigned int blk_zone_wp_offset(struct blk_zone *zone)
 	}
 }
 
-static int blk_zone_wplug_report_zone_cb(struct blk_zone *zone,
-					 unsigned int idx, void *data)
-{
-	struct blk_zone *zonep = data;
-
-	*zonep = *zone;
-	return 0;
-}
-
 static void disk_zone_wplug_handle_error(struct gendisk *disk,
 					 struct blk_zone_wplug *zwplug)
 {
-	sector_t zone_start_sector =
-		bdev_zone_sectors(disk->part0) * zwplug->zone_no;
-	unsigned int noio_flag;
-	struct blk_zone zone;
 	unsigned long flags;
-	int ret;
-
-	/* Get the current zone information from the device. */
-	noio_flag = memalloc_noio_save();
-	ret = disk->fops->report_zones(disk, zone_start_sector, 1,
-				       blk_zone_wplug_report_zone_cb, &zone);
-	memalloc_noio_restore(noio_flag);
 
 	spin_lock_irqsave(&zwplug->lock, flags);
 
@@ -1425,19 +1423,8 @@ static void disk_zone_wplug_handle_error(struct gendisk *disk,
 
 	zwplug->flags &= ~BLK_ZONE_WPLUG_ERROR;
 
-	if (ret != 1) {
-		/*
-		 * We failed to get the zone information, meaning that something
-		 * is likely really wrong with the device. Abort all remaining
-		 * plugged BIOs as otherwise we could endup waiting forever on
-		 * plugged BIOs to complete if there is a queue freeze on-going.
-		 */
-		disk_zone_wplug_abort(zwplug);
-		goto unplug;
-	}
-
 	/* Update the zone write pointer offset. */
-	zwplug->wp_offset = blk_zone_wp_offset(&zone);
+	zwplug->wp_offset = zwplug->wp_offset_compl;
 	disk_zone_wplug_abort_unaligned(disk, zwplug);
 
 	/* Restart BIO submission if we still have any BIO left. */
@@ -1446,7 +1433,6 @@ static void disk_zone_wplug_handle_error(struct gendisk *disk,
 		goto unlock;
 	}
 
-unplug:
 	zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED;
 	if (disk_should_remove_zone_wplug(disk, zwplug))
 		disk_remove_zone_wplug(disk, zwplug);
@@ -1978,7 +1964,7 @@ EXPORT_SYMBOL_GPL(blk_revalidate_disk_zones);
 static void queue_zone_wplug_show(struct blk_zone_wplug *zwplug,
 				  struct seq_file *m)
 {
-	unsigned int zwp_wp_offset, zwp_flags;
+	unsigned int zwp_wp_offset, zwp_wp_offset_compl, zwp_flags;
 	unsigned int zwp_zone_no, zwp_ref;
 	unsigned int zwp_bio_list_size;
 	unsigned long flags;
@@ -1988,14 +1974,15 @@ static void queue_zone_wplug_show(struct blk_zone_wplug *zwplug,
 	zwp_flags = zwplug->flags;
 	zwp_ref = refcount_read(&zwplug->ref);
 	zwp_wp_offset = zwplug->wp_offset;
+	zwp_wp_offset_compl = zwplug->wp_offset_compl;
 	zwp_bio_list_size = bio_list_size(&zwplug->bio_list);
 	spin_unlock_irqrestore(&zwplug->lock, flags);
 
 	bool all_zwr_inserted = blk_zone_all_zwr_inserted(zwplug);
 
-	seq_printf(m, "zone_no %u flags 0x%x ref %u wp_offset %u bio_list_size %u all_zwr_inserted %d\n",
+	seq_printf(m, "zone_no %u flags 0x%x ref %u wp_offset %u wp_offset_compl %u bio_list_size %u all_zwr_inserted %d\n",
 		   zwp_zone_no, zwp_flags, zwp_ref, zwp_wp_offset,
-		   zwp_bio_list_size, all_zwr_inserted);
+		   zwp_wp_offset_compl, zwp_bio_list_size, all_zwr_inserted);
 }
 
 int queue_zone_wplugs_show(void *data, struct seq_file *m)
diff --git a/block/blk.h b/block/blk.h
index be945db6298d..88a6e258eafe 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -470,8 +470,10 @@ static inline void blk_zone_update_request_bio(struct request *rq,
 	 * the original BIO sector so that blk_zone_write_plug_bio_endio() can
 	 * lookup the zone write plug.
 	 */
-	if (req_op(rq) == REQ_OP_ZONE_APPEND || bio_zone_write_plugging(bio))
+	if (req_op(rq) == REQ_OP_ZONE_APPEND || bio_zone_write_plugging(bio)) {
 		bio->bi_iter.bi_sector = rq->__sector;
+		bio->bi_iter.bi_size = rq->__data_len;
+	}
 }
 
 void blk_zone_write_plug_requeue_request(struct request *rq);

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v16 06/26] blk-zoned: Fix requeuing of zoned writes
  2024-11-19  0:27 [PATCH v16 00/26] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (4 preceding siblings ...)
  2024-11-19  0:27 ` [PATCH v16 05/26] blk-zoned: Fix a deadlock triggered by unaligned writes Bart Van Assche
@ 2024-11-19  0:27 ` Bart Van Assche
  2024-11-19  3:00   ` Damien Le Moal
  2024-11-19  0:27 ` [PATCH v16 07/26] block: Support block drivers that preserve the order of write requests Bart Van Assche
                   ` (21 subsequent siblings)
  27 siblings, 1 reply; 73+ messages in thread
From: Bart Van Assche @ 2024-11-19  0:27 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Jaegeuk Kim, Bart Van Assche

Make sure that unaligned writes are sent to the block driver. This
allows the block driver to limit the number of retries of unaligned
writes. Remove disk_zone_wplug_abort_unaligned() because it only examines
the bio plug list. Pending writes can occur either on the bio plug list
or on the request queue requeue list.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-zoned.c | 36 ------------------------------------
 1 file changed, 36 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 284820b29285..ded38fa9ae3d 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -577,33 +577,6 @@ static void disk_zone_wplug_abort(struct blk_zone_wplug *zwplug)
 		blk_zone_wplug_bio_io_error(zwplug, bio);
 }
 
-/*
- * Abort (fail) all plugged BIOs of a zone write plug that are not aligned
- * with the assumed write pointer location of the zone when the BIO will
- * be unplugged.
- */
-static void disk_zone_wplug_abort_unaligned(struct gendisk *disk,
-					    struct blk_zone_wplug *zwplug)
-{
-	unsigned int wp_offset = zwplug->wp_offset;
-	struct bio_list bl = BIO_EMPTY_LIST;
-	struct bio *bio;
-
-	while ((bio = bio_list_pop(&zwplug->bio_list))) {
-		if (disk_zone_is_full(disk, zwplug->zone_no, wp_offset) ||
-		    (bio_op(bio) != REQ_OP_ZONE_APPEND &&
-		     bio_offset_from_zone_start(bio) != wp_offset)) {
-			blk_zone_wplug_bio_io_error(zwplug, bio);
-			continue;
-		}
-
-		wp_offset += bio_sectors(bio);
-		bio_list_add(&bl, bio);
-	}
-
-	bio_list_merge(&zwplug->bio_list, &bl);
-}
-
 static inline void disk_zone_wplug_set_error(struct gendisk *disk,
 					     struct blk_zone_wplug *zwplug)
 {
@@ -982,14 +955,6 @@ static bool blk_zone_wplug_prepare_bio(struct blk_zone_wplug *zwplug,
 		 * so that we can restore its operation code on completion.
 		 */
 		bio_set_flag(bio, BIO_EMULATES_ZONE_APPEND);
-	} else {
-		/*
-		 * Check for non-sequential writes early because we avoid a
-		 * whole lot of error handling trouble if we don't send it off
-		 * to the driver.
-		 */
-		if (bio_offset_from_zone_start(bio) != zwplug->wp_offset)
-			goto err;
 	}
 
 	/* Advance the zone write pointer offset. */
@@ -1425,7 +1390,6 @@ static void disk_zone_wplug_handle_error(struct gendisk *disk,
 
 	/* Update the zone write pointer offset. */
 	zwplug->wp_offset = zwplug->wp_offset_compl;
-	disk_zone_wplug_abort_unaligned(disk, zwplug);
 
 	/* Restart BIO submission if we still have any BIO left. */
 	if (!bio_list_empty(&zwplug->bio_list)) {

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v16 07/26] block: Support block drivers that preserve the order of write requests
  2024-11-19  0:27 [PATCH v16 00/26] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (5 preceding siblings ...)
  2024-11-19  0:27 ` [PATCH v16 06/26] blk-zoned: Fix requeuing of zoned writes Bart Van Assche
@ 2024-11-19  0:27 ` Bart Van Assche
  2024-11-19  7:37   ` Damien Le Moal
  2024-11-19  0:27 ` [PATCH v16 08/26] dm-linear: Report to the block layer that the write order is preserved Bart Van Assche
                   ` (20 subsequent siblings)
  27 siblings, 1 reply; 73+ messages in thread
From: Bart Van Assche @ 2024-11-19  0:27 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Jaegeuk Kim, Bart Van Assche, Hannes Reinecke, Nitesh Shetty,
	Ming Lei

Many but not all storage controllers require serialization of zoned
writes. Introduce a new request queue limit member variable related to
write serialization. 'driver_preserves_write_order' allows block drivers
to indicate that the order of write commands is preserved per hardware
queue and hence that serialization of writes per zone is not required if
all pending writes are submitted to the same hardware queue.

Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Nitesh Shetty <nj.shetty@samsung.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-settings.c   | 2 ++
 include/linux/blkdev.h | 5 +++++
 2 files changed, 7 insertions(+)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index f1d4dfdc37a7..329d8b65a8d7 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -633,6 +633,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 	}
 	t->max_secure_erase_sectors = min_not_zero(t->max_secure_erase_sectors,
 						   b->max_secure_erase_sectors);
+	t->driver_preserves_write_order = t->driver_preserves_write_order &&
+		b->driver_preserves_write_order;
 	t->zone_write_granularity = max(t->zone_write_granularity,
 					b->zone_write_granularity);
 	if (!(t->features & BLK_FEAT_ZONED)) {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index a1fd0ddce5cf..72be33d02d1f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -397,6 +397,11 @@ struct queue_limits {
 
 	unsigned int		max_open_zones;
 	unsigned int		max_active_zones;
+	/*
+	 * Whether or not the block driver preserves the order of write
+	 * requests. Set by the block driver.
+	 */
+	bool			driver_preserves_write_order;
 
 	/*
 	 * Drivers that set dma_alignment to less than 511 must be prepared to

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v16 08/26] dm-linear: Report to the block layer that the write order is preserved
  2024-11-19  0:27 [PATCH v16 00/26] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (6 preceding siblings ...)
  2024-11-19  0:27 ` [PATCH v16 07/26] block: Support block drivers that preserve the order of write requests Bart Van Assche
@ 2024-11-19  0:27 ` Bart Van Assche
  2024-11-19  0:27 ` [PATCH v16 09/26] mq-deadline: Remove a local variable Bart Van Assche
                   ` (19 subsequent siblings)
  27 siblings, 0 replies; 73+ messages in thread
From: Bart Van Assche @ 2024-11-19  0:27 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Jaegeuk Kim, Bart Van Assche

Enable write pipelining if dm-linear is stacked on top of a driver that
supports write pipelining.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/md/dm-linear.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 49fb0f684193..967fbf856abc 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -148,6 +148,11 @@ static int linear_report_zones(struct dm_target *ti,
 #define linear_report_zones NULL
 #endif
 
+static void linear_io_hints(struct dm_target *ti, struct queue_limits *limits)
+{
+	limits->driver_preserves_write_order = true;
+}
+
 static int linear_iterate_devices(struct dm_target *ti,
 				  iterate_devices_callout_fn fn, void *data)
 {
@@ -209,6 +214,7 @@ static struct target_type linear_target = {
 	.map    = linear_map,
 	.status = linear_status,
 	.prepare_ioctl = linear_prepare_ioctl,
+	.io_hints = linear_io_hints,
 	.iterate_devices = linear_iterate_devices,
 	.direct_access = linear_dax_direct_access,
 	.dax_zero_page_range = linear_dax_zero_page_range,

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v16 09/26] mq-deadline: Remove a local variable
  2024-11-19  0:27 [PATCH v16 00/26] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (7 preceding siblings ...)
  2024-11-19  0:27 ` [PATCH v16 08/26] dm-linear: Report to the block layer that the write order is preserved Bart Van Assche
@ 2024-11-19  0:27 ` Bart Van Assche
  2024-11-19  7:38   ` Damien Le Moal
  2024-11-19  0:27 ` [PATCH v16 10/26] blk-mq: Clean up blk_mq_requeue_work() Bart Van Assche
                   ` (18 subsequent siblings)
  27 siblings, 1 reply; 73+ messages in thread
From: Bart Van Assche @ 2024-11-19  0:27 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Jaegeuk Kim, Bart Van Assche

Since commit fde02699c242 ("block: mq-deadline: Remove support for zone
write locking"), the local variable 'insert_before' is assigned once and
is used once. Hence remove this local variable.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index acdc28756d9d..2edf84b1bc2a 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -699,8 +699,6 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 		list_add(&rq->queuelist, &per_prio->dispatch);
 		rq->fifo_time = jiffies;
 	} else {
-		struct list_head *insert_before;
-
 		deadline_add_rq_rb(per_prio, rq);
 
 		if (rq_mergeable(rq)) {
@@ -713,8 +711,7 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 		 * set expire time and add to fifo list
 		 */
 		rq->fifo_time = jiffies + dd->fifo_expire[data_dir];
-		insert_before = &per_prio->fifo_list[data_dir];
-		list_add_tail(&rq->queuelist, insert_before);
+		list_add_tail(&rq->queuelist, &per_prio->fifo_list[data_dir]);
 	}
 }
 

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v16 10/26] blk-mq: Clean up blk_mq_requeue_work()
  2024-11-19  0:27 [PATCH v16 00/26] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (8 preceding siblings ...)
  2024-11-19  0:27 ` [PATCH v16 09/26] mq-deadline: Remove a local variable Bart Van Assche
@ 2024-11-19  0:27 ` Bart Van Assche
  2024-11-19  7:39   ` Damien Le Moal
  2024-11-19  0:28 ` [PATCH v16 11/26] block: Optimize blk_mq_submit_bio() for the cache hit scenario Bart Van Assche
                   ` (17 subsequent siblings)
  27 siblings, 1 reply; 73+ messages in thread
From: Bart Van Assche @ 2024-11-19  0:27 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Jaegeuk Kim, Bart Van Assche

Move a statement that occurs in both branches of an if-statement in front
of the if-statement. Fix a typo in a source code comment. No functionality
has been changed.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-mq.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index a45077e187b5..ece567b1b6be 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1553,19 +1553,17 @@ static void blk_mq_requeue_work(struct work_struct *work)
 
 	while (!list_empty(&rq_list)) {
 		rq = list_entry(rq_list.next, struct request, queuelist);
+		list_del_init(&rq->queuelist);
 		/*
-		 * If RQF_DONTPREP ist set, the request has been started by the
+		 * If RQF_DONTPREP is set, the request has been started by the
 		 * driver already and might have driver-specific data allocated
 		 * already.  Insert it into the hctx dispatch list to avoid
 		 * block layer merges for the request.
 		 */
-		if (rq->rq_flags & RQF_DONTPREP) {
-			list_del_init(&rq->queuelist);
+		if (rq->rq_flags & RQF_DONTPREP)
 			blk_mq_request_bypass_insert(rq, 0);
-		} else {
-			list_del_init(&rq->queuelist);
+		else
 			blk_mq_insert_request(rq, BLK_MQ_INSERT_AT_HEAD);
-		}
 	}
 
 	while (!list_empty(&flush_list)) {

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v16 11/26] block: Optimize blk_mq_submit_bio() for the cache hit scenario
  2024-11-19  0:27 [PATCH v16 00/26] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (9 preceding siblings ...)
  2024-11-19  0:27 ` [PATCH v16 10/26] blk-mq: Clean up blk_mq_requeue_work() Bart Van Assche
@ 2024-11-19  0:28 ` Bart Van Assche
  2024-11-19  7:40   ` Damien Le Moal
  2024-11-19  0:28 ` [PATCH v16 12/26] block: Rework request allocation in blk_mq_submit_bio() Bart Van Assche
                   ` (16 subsequent siblings)
  27 siblings, 1 reply; 73+ messages in thread
From: Bart Van Assche @ 2024-11-19  0:28 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Jaegeuk Kim, Bart Van Assche

Help the CPU branch predictor in case of a cache hit by handling the cache
hit scenario first.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-mq.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index ece567b1b6be..56a6b5bef39f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3123,12 +3123,12 @@ void blk_mq_submit_bio(struct bio *bio)
 		goto queue_exit;
 
 new_request:
-	if (!rq) {
+	if (rq) {
+		blk_mq_use_cached_rq(rq, plug, bio);
+	} else {
 		rq = blk_mq_get_new_requests(q, plug, bio, nr_segs);
 		if (unlikely(!rq))
 			goto queue_exit;
-	} else {
-		blk_mq_use_cached_rq(rq, plug, bio);
 	}
 
 	trace_block_getrq(bio);

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v16 12/26] block: Rework request allocation in blk_mq_submit_bio()
  2024-11-19  0:27 [PATCH v16 00/26] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (10 preceding siblings ...)
  2024-11-19  0:28 ` [PATCH v16 11/26] block: Optimize blk_mq_submit_bio() for the cache hit scenario Bart Van Assche
@ 2024-11-19  0:28 ` Bart Van Assche
  2024-11-19  7:44   ` Damien Le Moal
  2024-11-19  0:28 ` [PATCH v16 13/26] block: Support allocating from a specific software queue Bart Van Assche
                   ` (15 subsequent siblings)
  27 siblings, 1 reply; 73+ messages in thread
From: Bart Van Assche @ 2024-11-19  0:28 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Jaegeuk Kim, Bart Van Assche

Prepare for allocating a request from a specific hctx by making
blk_mq_submit_bio() allocate a request later.

The performance impact of this patch on the hot path is small: if a
request is cached, one percpu_ref_get(&q->q_usage_counter) call and one
percpu_ref_put(&q->q_usage_counter) call are added to the hot path.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-mq.c | 36 +++++++++++++-----------------------
 1 file changed, 13 insertions(+), 23 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 56a6b5bef39f..5ff124fbc17c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3071,11 +3071,6 @@ void blk_mq_submit_bio(struct bio *bio)
 	struct request *rq;
 	blk_status_t ret;
 
-	/*
-	 * If the plug has a cached request for this queue, try to use it.
-	 */
-	rq = blk_mq_peek_cached_request(plug, q, bio->bi_opf);
-
 	/*
 	 * A BIO that was released from a zone write plug has already been
 	 * through the preparation in this function, already holds a reference
@@ -3084,26 +3079,19 @@ void blk_mq_submit_bio(struct bio *bio)
 	 */
 	if (bio_zone_write_plugging(bio)) {
 		nr_segs = bio->__bi_nr_segments;
-		if (rq)
-			blk_queue_exit(q);
 		goto new_request;
 	}
 
 	bio = blk_queue_bounce(bio, q);
 
 	/*
-	 * The cached request already holds a q_usage_counter reference and we
-	 * don't have to acquire a new one if we use it.
+	 * Increment the queue usage counter before performing any checks
+	 * based on queue properties that may change if the queue is frozen,
+	 * e.g. the logical block size.
 	 */
-	if (!rq) {
-		if (unlikely(bio_queue_enter(bio)))
-			return;
-	}
+	if (unlikely(bio_queue_enter(bio)))
+		return;
 
-	/*
-	 * Device reconfiguration may change logical block size, so alignment
-	 * check has to be done with queue usage counter held
-	 */
 	if (unlikely(bio_unaligned(bio, q))) {
 		bio_io_error(bio);
 		goto queue_exit;
@@ -3123,8 +3111,15 @@ void blk_mq_submit_bio(struct bio *bio)
 		goto queue_exit;
 
 new_request:
+	rq = blk_mq_peek_cached_request(plug, q, bio->bi_opf);
 	if (rq) {
 		blk_mq_use_cached_rq(rq, plug, bio);
+		/*
+		 * Here we hold two references: one because of the
+		 * bio_queue_enter() call and a second one as the result of
+		 * request allocation. Drop one.
+		 */
+		blk_queue_exit(q);
 	} else {
 		rq = blk_mq_get_new_requests(q, plug, bio, nr_segs);
 		if (unlikely(!rq))
@@ -3167,12 +3162,7 @@ void blk_mq_submit_bio(struct bio *bio)
 	return;
 
 queue_exit:
-	/*
-	 * Don't drop the queue reference if we were trying to use a cached
-	 * request and thus didn't acquire one.
-	 */
-	if (!rq)
-		blk_queue_exit(q);
+	blk_queue_exit(q);
 }
 
 #ifdef CONFIG_BLK_MQ_STACKING

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v16 13/26] block: Support allocating from a specific software queue
  2024-11-19  0:27 [PATCH v16 00/26] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (11 preceding siblings ...)
  2024-11-19  0:28 ` [PATCH v16 12/26] block: Rework request allocation in blk_mq_submit_bio() Bart Van Assche
@ 2024-11-19  0:28 ` Bart Van Assche
  2024-11-19  0:28 ` [PATCH v16 14/26] blk-mq: Restore the zoned write order when requeuing Bart Van Assche
                   ` (14 subsequent siblings)
  27 siblings, 0 replies; 73+ messages in thread
From: Bart Van Assche @ 2024-11-19  0:28 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Jaegeuk Kim, Bart Van Assche

A later patch will preserve the order of pipelined zoned writes by
submitting all zoned writes per zone to the same queue as previously
submitted zoned writes. Hence support allocating a request from a
specific queue.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-mq.c | 18 ++++++++++++++----
 block/blk-mq.h |  1 +
 2 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 5ff124fbc17c..f134d5e1c4a1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -493,6 +493,7 @@ __blk_mq_alloc_requests_batch(struct blk_mq_alloc_data *data)
 static struct request *__blk_mq_alloc_requests(struct blk_mq_alloc_data *data)
 {
 	struct request_queue *q = data->q;
+	int swq_cpu = data->swq_cpu;
 	u64 alloc_time_ns = 0;
 	struct request *rq;
 	unsigned int tag;
@@ -505,7 +506,8 @@ static struct request *__blk_mq_alloc_requests(struct blk_mq_alloc_data *data)
 		data->flags |= BLK_MQ_REQ_NOWAIT;
 
 retry:
-	data->ctx = blk_mq_get_ctx(q);
+	data->swq_cpu = swq_cpu >= 0 ? swq_cpu : raw_smp_processor_id();
+	data->ctx = __blk_mq_get_ctx(q, data->swq_cpu);
 	data->hctx = blk_mq_map_queue(q, data->cmd_flags, data->ctx);
 
 	if (q->elevator) {
@@ -585,6 +587,7 @@ static struct request *blk_mq_rq_cache_fill(struct request_queue *q,
 		.cmd_flags	= opf,
 		.nr_tags	= plug->nr_ios,
 		.cached_rqs	= &plug->cached_rqs,
+		.swq_cpu	= -1,
 	};
 	struct request *rq;
 
@@ -646,6 +649,7 @@ struct request *blk_mq_alloc_request(struct request_queue *q, blk_opf_t opf,
 			.flags		= flags,
 			.cmd_flags	= opf,
 			.nr_tags	= 1,
+			.swq_cpu	= -1,
 		};
 		int ret;
 
@@ -2969,12 +2973,14 @@ static bool blk_mq_attempt_bio_merge(struct request_queue *q,
 }
 
 static struct request *blk_mq_get_new_requests(struct request_queue *q,
+					       int swq_cpu,
 					       struct blk_plug *plug,
 					       struct bio *bio,
 					       unsigned int nsegs)
 {
 	struct blk_mq_alloc_data data = {
 		.q		= q,
+		.swq_cpu	= swq_cpu,
 		.nr_tags	= 1,
 		.cmd_flags	= bio->bi_opf,
 	};
@@ -3001,7 +3007,8 @@ static struct request *blk_mq_get_new_requests(struct request_queue *q,
  * Check if there is a suitable cached request and return it.
  */
 static struct request *blk_mq_peek_cached_request(struct blk_plug *plug,
-		struct request_queue *q, blk_opf_t opf)
+						  struct request_queue *q,
+						  int swq_cpu, blk_opf_t opf)
 {
 	enum hctx_type type = blk_mq_get_hctx_type(opf);
 	struct request *rq;
@@ -3011,6 +3018,8 @@ static struct request *blk_mq_peek_cached_request(struct blk_plug *plug,
 	rq = rq_list_peek(&plug->cached_rqs);
 	if (!rq || rq->q != q)
 		return NULL;
+	if (swq_cpu >= 0 && rq->mq_ctx->cpu != swq_cpu)
+		return NULL;
 	if (type != rq->mq_hctx->type &&
 	    (type != HCTX_TYPE_READ || rq->mq_hctx->type != HCTX_TYPE_DEFAULT))
 		return NULL;
@@ -3069,6 +3078,7 @@ void blk_mq_submit_bio(struct bio *bio)
 	struct blk_mq_hw_ctx *hctx;
 	unsigned int nr_segs;
 	struct request *rq;
+	int swq_cpu = -1;
 	blk_status_t ret;
 
 	/*
@@ -3111,7 +3121,7 @@ void blk_mq_submit_bio(struct bio *bio)
 		goto queue_exit;
 
 new_request:
-	rq = blk_mq_peek_cached_request(plug, q, bio->bi_opf);
+	rq = blk_mq_peek_cached_request(plug, q, swq_cpu, bio->bi_opf);
 	if (rq) {
 		blk_mq_use_cached_rq(rq, plug, bio);
 		/*
@@ -3121,7 +3131,7 @@ void blk_mq_submit_bio(struct bio *bio)
 		 */
 		blk_queue_exit(q);
 	} else {
-		rq = blk_mq_get_new_requests(q, plug, bio, nr_segs);
+		rq = blk_mq_get_new_requests(q, swq_cpu, plug, bio, nr_segs);
 		if (unlikely(!rq))
 			goto queue_exit;
 	}
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 89a20fffa4b1..309db553aba6 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -158,6 +158,7 @@ struct blk_mq_alloc_data {
 	struct rq_list *cached_rqs;
 
 	/* input & output parameter */
+	int swq_cpu;
 	struct blk_mq_ctx *ctx;
 	struct blk_mq_hw_ctx *hctx;
 };

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v16 14/26] blk-mq: Restore the zoned write order when requeuing
  2024-11-19  0:27 [PATCH v16 00/26] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (12 preceding siblings ...)
  2024-11-19  0:28 ` [PATCH v16 13/26] block: Support allocating from a specific software queue Bart Van Assche
@ 2024-11-19  0:28 ` Bart Van Assche
  2024-11-19  7:52   ` Damien Le Moal
  2024-11-19  0:28 ` [PATCH v16 15/26] blk-zoned: Document the locking order Bart Van Assche
                   ` (13 subsequent siblings)
  27 siblings, 1 reply; 73+ messages in thread
From: Bart Van Assche @ 2024-11-19  0:28 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Jaegeuk Kim, Bart Van Assche

Zoned writes may be requeued, e.g. if a block driver returns
BLK_STS_RESOURCE. Requests may be requeued in another order than
submitted. Restore the request order if requests are requeued.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/bfq-iosched.c    |  2 ++
 block/blk-mq.c         | 20 +++++++++++++++++++-
 block/blk-mq.h         |  2 ++
 block/kyber-iosched.c  |  2 ++
 block/mq-deadline.c    |  7 ++++++-
 include/linux/blk-mq.h |  2 +-
 6 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 0747d9d0e48c..13bedbf03bd2 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -6265,6 +6265,8 @@ static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 
 	if (flags & BLK_MQ_INSERT_AT_HEAD) {
 		list_add(&rq->queuelist, &bfqd->dispatch);
+	} else if (flags & BLK_MQ_INSERT_ORDERED) {
+		blk_mq_insert_ordered(rq, &bfqd->dispatch);
 	} else if (!bfqq) {
 		list_add_tail(&rq->queuelist, &bfqd->dispatch);
 	} else {
diff --git a/block/blk-mq.c b/block/blk-mq.c
index f134d5e1c4a1..1302ccbf2a7d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1564,7 +1564,9 @@ static void blk_mq_requeue_work(struct work_struct *work)
 		 * already.  Insert it into the hctx dispatch list to avoid
 		 * block layer merges for the request.
 		 */
-		if (rq->rq_flags & RQF_DONTPREP)
+		if (blk_rq_is_seq_zoned_write(rq))
+			blk_mq_insert_request(rq, BLK_MQ_INSERT_ORDERED);
+		else if (rq->rq_flags & RQF_DONTPREP)
 			blk_mq_request_bypass_insert(rq, 0);
 		else
 			blk_mq_insert_request(rq, BLK_MQ_INSERT_AT_HEAD);
@@ -2599,6 +2601,20 @@ static void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx,
 	blk_mq_run_hw_queue(hctx, run_queue_async);
 }
 
+void blk_mq_insert_ordered(struct request *rq, struct list_head *list)
+{
+	struct request_queue *q = rq->q;
+	struct request *rq2;
+
+	list_for_each_entry(rq2, list, queuelist)
+		if (rq2->q == q && blk_rq_pos(rq2) > blk_rq_pos(rq))
+			break;
+
+	/* Insert rq before rq2. If rq2 is the list head, append at the end. */
+	list_add_tail(&rq->queuelist, &rq2->queuelist);
+}
+EXPORT_SYMBOL_GPL(blk_mq_insert_ordered);
+
 static void blk_mq_insert_request(struct request *rq, blk_insert_t flags)
 {
 	struct request_queue *q = rq->q;
@@ -2653,6 +2669,8 @@ static void blk_mq_insert_request(struct request *rq, blk_insert_t flags)
 		spin_lock(&ctx->lock);
 		if (flags & BLK_MQ_INSERT_AT_HEAD)
 			list_add(&rq->queuelist, &ctx->rq_lists[hctx->type]);
+		else if (flags & BLK_MQ_INSERT_ORDERED)
+			blk_mq_insert_ordered(rq, &ctx->rq_lists[hctx->type]);
 		else
 			list_add_tail(&rq->queuelist,
 				      &ctx->rq_lists[hctx->type]);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 309db553aba6..10b9fb3ca762 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -40,8 +40,10 @@ enum {
 
 typedef unsigned int __bitwise blk_insert_t;
 #define BLK_MQ_INSERT_AT_HEAD		((__force blk_insert_t)0x01)
+#define BLK_MQ_INSERT_ORDERED		((__force blk_insert_t)0x02)
 
 void blk_mq_submit_bio(struct bio *bio);
+void blk_mq_insert_ordered(struct request *rq, struct list_head *list);
 int blk_mq_poll(struct request_queue *q, blk_qc_t cookie, struct io_comp_batch *iob,
 		unsigned int flags);
 void blk_mq_exit_queue(struct request_queue *q);
diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index 4155594aefc6..77bb41bab68d 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -603,6 +603,8 @@ static void kyber_insert_requests(struct blk_mq_hw_ctx *hctx,
 		trace_block_rq_insert(rq);
 		if (flags & BLK_MQ_INSERT_AT_HEAD)
 			list_move(&rq->queuelist, head);
+		else if (flags & BLK_MQ_INSERT_ORDERED)
+			blk_mq_insert_ordered(rq, head);
 		else
 			list_move_tail(&rq->queuelist, head);
 		sbitmap_set_bit(&khd->kcq_map[sched_domain],
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 2edf84b1bc2a..200e5a2928ce 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -711,7 +711,12 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 		 * set expire time and add to fifo list
 		 */
 		rq->fifo_time = jiffies + dd->fifo_expire[data_dir];
-		list_add_tail(&rq->queuelist, &per_prio->fifo_list[data_dir]);
+		if (flags & BLK_MQ_INSERT_ORDERED)
+			blk_mq_insert_ordered(rq,
+					      &per_prio->fifo_list[data_dir]);
+		else
+			list_add_tail(&rq->queuelist,
+				      &per_prio->fifo_list[data_dir]);
 	}
 }
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index ac05974f08f9..f7514eefccfd 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -85,7 +85,7 @@ enum {
 
 /* flags that prevent us from merging requests: */
 #define RQF_NOMERGE_FLAGS \
-	(RQF_STARTED | RQF_FLUSH_SEQ | RQF_SPECIAL_PAYLOAD)
+	(RQF_STARTED | RQF_FLUSH_SEQ | RQF_DONTPREP | RQF_SPECIAL_PAYLOAD)
 
 enum mq_rq_state {
 	MQ_RQ_IDLE		= 0,

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v16 15/26] blk-zoned: Document the locking order
  2024-11-19  0:27 [PATCH v16 00/26] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (13 preceding siblings ...)
  2024-11-19  0:28 ` [PATCH v16 14/26] blk-mq: Restore the zoned write order when requeuing Bart Van Assche
@ 2024-11-19  0:28 ` Bart Van Assche
  2024-11-19  7:52   ` Damien Le Moal
  2024-11-19  0:28 ` [PATCH v16 16/26] blk-zoned: Document locking assumptions Bart Van Assche
                   ` (12 subsequent siblings)
  27 siblings, 1 reply; 73+ messages in thread
From: Bart Van Assche @ 2024-11-19  0:28 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Jaegeuk Kim, Bart Van Assche

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-zoned.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index ded38fa9ae3d..1d0f8fea7516 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -51,7 +51,8 @@ static const char *const zone_cond_name[] = {
  *       reference is dropped whenever the zone of the zone write plug is reset,
  *       finished and when the zone becomes full (last write BIO to the zone
  *       completes).
- * @lock: Spinlock to atomically manipulate the plug.
+ * @lock: Spinlock to atomically manipulate the plug. Outer lock relative to
+ *	disk->zone_wplugs_lock.
  * @flags: Flags indicating the plug state.
  * @zone_no: The number of the zone the plug is managing.
  * @wp_offset: The zone write pointer location relative to the start of the zone

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v16 16/26] blk-zoned: Document locking assumptions
  2024-11-19  0:27 [PATCH v16 00/26] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (14 preceding siblings ...)
  2024-11-19  0:28 ` [PATCH v16 15/26] blk-zoned: Document the locking order Bart Van Assche
@ 2024-11-19  0:28 ` Bart Van Assche
  2024-11-19  7:53   ` Damien Le Moal
  2024-11-19  0:28 ` [PATCH v16 17/26] blk-zoned: Uninline functions that are not in the hot path Bart Van Assche
                   ` (11 subsequent siblings)
  27 siblings, 1 reply; 73+ messages in thread
From: Bart Van Assche @ 2024-11-19  0:28 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Jaegeuk Kim, Bart Van Assche

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-zoned.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 1d0f8fea7516..17fe40db1888 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -441,6 +441,8 @@ static inline void disk_put_zone_wplug(struct blk_zone_wplug *zwplug)
 static inline bool disk_should_remove_zone_wplug(struct gendisk *disk,
 						 struct blk_zone_wplug *zwplug)
 {
+	lockdep_assert_held(&zwplug->lock);
+
 	/* If the zone write plug was already removed, we are done. */
 	if (zwplug->flags & BLK_ZONE_WPLUG_UNHASHED)
 		return false;
@@ -583,6 +585,8 @@ static inline void disk_zone_wplug_set_error(struct gendisk *disk,
 {
 	unsigned long flags;
 
+	lockdep_assert_held(&zwplug->lock);
+
 	if (zwplug->flags & BLK_ZONE_WPLUG_ERROR)
 		return;
 
@@ -933,6 +937,8 @@ static bool blk_zone_wplug_prepare_bio(struct blk_zone_wplug *zwplug,
 {
 	struct gendisk *disk = bio->bi_bdev->bd_disk;
 
+	lockdep_assert_held(&zwplug->lock);
+
 	/*
 	 * Check that the user is not attempting to write to a full zone.
 	 * We know such BIO will fail, and that would potentially overflow our

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v16 17/26] blk-zoned: Uninline functions that are not in the hot path
  2024-11-19  0:27 [PATCH v16 00/26] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (15 preceding siblings ...)
  2024-11-19  0:28 ` [PATCH v16 16/26] blk-zoned: Document locking assumptions Bart Van Assche
@ 2024-11-19  0:28 ` Bart Van Assche
  2024-11-19  7:55   ` Damien Le Moal
  2024-11-19  0:28 ` [PATCH v16 18/26] blk-zoned: Make disk_should_remove_zone_wplug() more robust Bart Van Assche
                   ` (10 subsequent siblings)
  27 siblings, 1 reply; 73+ messages in thread
From: Bart Van Assche @ 2024-11-19  0:28 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Jaegeuk Kim, Bart Van Assche

Apply the convention that is followed elsewhere in the block layer: only
declare functions inline if these are in the hot path.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-zoned.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 17fe40db1888..2b4783026450 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -558,8 +558,8 @@ static struct blk_zone_wplug *disk_get_and_lock_zone_wplug(struct gendisk *disk,
 	return zwplug;
 }
 
-static inline void blk_zone_wplug_bio_io_error(struct blk_zone_wplug *zwplug,
-					       struct bio *bio)
+static void blk_zone_wplug_bio_io_error(struct blk_zone_wplug *zwplug,
+					struct bio *bio)
 {
 	struct request_queue *q = zwplug->disk->queue;
 
@@ -580,8 +580,8 @@ static void disk_zone_wplug_abort(struct blk_zone_wplug *zwplug)
 		blk_zone_wplug_bio_io_error(zwplug, bio);
 }
 
-static inline void disk_zone_wplug_set_error(struct gendisk *disk,
-					     struct blk_zone_wplug *zwplug)
+static void disk_zone_wplug_set_error(struct gendisk *disk,
+				      struct blk_zone_wplug *zwplug)
 {
 	unsigned long flags;
 
@@ -607,8 +607,8 @@ static inline void disk_zone_wplug_set_error(struct gendisk *disk,
 	spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
 }
 
-static inline void disk_zone_wplug_clear_error(struct gendisk *disk,
-					       struct blk_zone_wplug *zwplug)
+static void disk_zone_wplug_clear_error(struct gendisk *disk,
+					struct blk_zone_wplug *zwplug)
 {
 	unsigned long flags;
 
@@ -1597,7 +1597,7 @@ void disk_free_zone_resources(struct gendisk *disk)
 	disk->nr_zones = 0;
 }
 
-static inline bool disk_need_zone_resources(struct gendisk *disk)
+static bool disk_need_zone_resources(struct gendisk *disk)
 {
 	/*
 	 * All mq zoned devices need zone resources so that the block layer

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v16 18/26] blk-zoned: Make disk_should_remove_zone_wplug() more robust
  2024-11-19  0:27 [PATCH v16 00/26] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (16 preceding siblings ...)
  2024-11-19  0:28 ` [PATCH v16 17/26] blk-zoned: Uninline functions that are not in the hot path Bart Van Assche
@ 2024-11-19  0:28 ` Bart Van Assche
  2024-11-19  7:58   ` Damien Le Moal
  2024-11-19  0:28 ` [PATCH v16 19/26] blk-zoned: Add an argument to blk_zone_plug_bio() Bart Van Assche
                   ` (9 subsequent siblings)
  27 siblings, 1 reply; 73+ messages in thread
From: Bart Van Assche @ 2024-11-19  0:28 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Jaegeuk Kim, Bart Van Assche

Make the disk_should_remove_zone_wplug() behavior independent of the
number of zwplug references held by the caller.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-zoned.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 2b4783026450..59f6559b94da 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -457,12 +457,9 @@ static inline bool disk_should_remove_zone_wplug(struct gendisk *disk,
 	 * blk_zone_write_plug_finish_request() (e.g. with split BIOs
 	 * that are chained). In such case, disk_zone_wplug_unplug_bio()
 	 * should not attempt to remove the zone write plug until all BIO
-	 * completions are seen. Check by looking at the zone write plug
-	 * reference count, which is 2 when the plug is unused (one reference
-	 * taken when the plug was allocated and another reference taken by the
-	 * caller context).
+	 * completions are seen.
 	 */
-	if (refcount_read(&zwplug->ref) > 2)
+	if (zwplug->wp_offset != zwplug->wp_offset_compl)
 		return false;
 
 	/* We can remove zone write plugs for zones that are empty or full. */

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v16 19/26] blk-zoned: Add an argument to blk_zone_plug_bio()
  2024-11-19  0:27 [PATCH v16 00/26] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (17 preceding siblings ...)
  2024-11-19  0:28 ` [PATCH v16 18/26] blk-zoned: Make disk_should_remove_zone_wplug() more robust Bart Van Assche
@ 2024-11-19  0:28 ` Bart Van Assche
  2024-11-19  0:28 ` [PATCH v16 20/26] blk-zoned: Support pipelining of zoned writes Bart Van Assche
                   ` (8 subsequent siblings)
  27 siblings, 0 replies; 73+ messages in thread
From: Bart Van Assche @ 2024-11-19  0:28 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Jaegeuk Kim, Bart Van Assche

Prepare for preserving the order of pipelined zoned writes per zone.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-mq.c         | 2 +-
 block/blk-zoned.c      | 3 ++-
 drivers/md/dm.c        | 2 +-
 include/linux/blkdev.h | 6 ++++--
 4 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1302ccbf2a7d..b396ddfe7165 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3135,7 +3135,7 @@ void blk_mq_submit_bio(struct bio *bio)
 	if (blk_mq_attempt_bio_merge(q, bio, nr_segs))
 		goto queue_exit;
 
-	if (blk_queue_is_zoned(q) && blk_zone_plug_bio(bio, nr_segs))
+	if (blk_queue_is_zoned(q) && blk_zone_plug_bio(bio, nr_segs, &swq_cpu))
 		goto queue_exit;
 
 new_request:
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 59f6559b94da..8d0fac14c837 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -1049,6 +1049,7 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
  * blk_zone_plug_bio - Handle a zone write BIO with zone write plugging
  * @bio: The BIO being submitted
  * @nr_segs: The number of physical segments of @bio
+ * @swq_cpu: [out] CPU of the software queue to which the bio should be queued
  *
  * Handle write, write zeroes and zone append operations requiring emulation
  * using zone write plugging.
@@ -1057,7 +1058,7 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
  * write plug. Otherwise, return false to let the submission path process
  * @bio normally.
  */
-bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs)
+bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs, int *swq_cpu)
 {
 	struct block_device *bdev = bio->bi_bdev;
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 19230404d8c2..cd85c9b6a463 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1796,7 +1796,7 @@ static inline bool dm_zone_bio_needs_split(struct mapped_device *md,
 }
 static inline bool dm_zone_plug_bio(struct mapped_device *md, struct bio *bio)
 {
-	return dm_emulate_zone_append(md) && blk_zone_plug_bio(bio, 0);
+	return dm_emulate_zone_append(md) && blk_zone_plug_bio(bio, 0, NULL);
 }
 
 static blk_status_t __send_zone_reset_all_emulated(struct clone_info *ci,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 72be33d02d1f..1c28ec5b75ce 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -694,13 +694,15 @@ static inline unsigned int disk_nr_zones(struct gendisk *disk)
 {
 	return disk->nr_zones;
 }
-bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs);
+struct blk_mq_ctx;
+bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs, int *swq_cpu);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline unsigned int disk_nr_zones(struct gendisk *disk)
 {
 	return 0;
 }
-static inline bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs)
+static inline bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs,
+				     int *swq_cpu)
 {
 	return false;
 }

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v16 20/26] blk-zoned: Support pipelining of zoned writes
  2024-11-19  0:27 [PATCH v16 00/26] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (18 preceding siblings ...)
  2024-11-19  0:28 ` [PATCH v16 19/26] blk-zoned: Add an argument to blk_zone_plug_bio() Bart Van Assche
@ 2024-11-19  0:28 ` Bart Van Assche
  2024-11-19  0:28 ` [PATCH v16 21/26] scsi: core: Retry unaligned " Bart Van Assche
                   ` (7 subsequent siblings)
  27 siblings, 0 replies; 73+ messages in thread
From: Bart Van Assche @ 2024-11-19  0:28 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Jaegeuk Kim, Bart Van Assche

Support pipelining of zoned writes if the block driver preserves the write
order per hardware queue. Track per zone to which software queue writes
have been queued. If zoned writes are pipelined, submit new writes to the
same software queue as the writes that are already in progress. This
prevents reordering by submitting requests for the same zone to different
software or hardware queues.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-zoned.c | 40 ++++++++++++++++++++++++++++------------
 1 file changed, 28 insertions(+), 12 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 8d0fac14c837..f934c0cb5fdd 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -8,6 +8,7 @@
  * Copyright (c) 2016, Damien Le Moal
  * Copyright (c) 2016, Western Digital
  * Copyright (c) 2024, Western Digital Corporation or its affiliates.
+ * Copyright 2024 Google LLC
  */
 
 #include <linux/kernel.h>
@@ -59,6 +60,8 @@ static const char *const zone_cond_name[] = {
  *             as a number of 512B sectors.
  * @wp_offset_compl: End offset for completed zoned writes as a number of 512
  *		     byte sectors.
+ * @swq_cpu: Software queue to submit writes to for drivers that preserve the
+ *	write order.
  * @bio_list: The list of BIOs that are currently plugged.
  * @bio_work: Work struct to handle issuing of plugged BIOs
  * @rcu_head: RCU head to free zone write plugs with an RCU grace period.
@@ -73,6 +76,7 @@ struct blk_zone_wplug {
 	unsigned int		zone_no;
 	unsigned int		wp_offset;
 	unsigned int		wp_offset_compl;
+	int			swq_cpu;
 	struct bio_list		bio_list;
 	struct work_struct	bio_work;
 	struct rcu_head		rcu_head;
@@ -82,8 +86,7 @@ struct blk_zone_wplug {
 /*
  * Zone write plug flags bits:
  *  - BLK_ZONE_WPLUG_PLUGGED: Indicates that the zone write plug is plugged,
- *    that is, that write BIOs are being throttled due to a write BIO already
- *    being executed or the zone write plug bio list is not empty.
+ *    that is, that write BIOs are being throttled.
  *  - BLK_ZONE_WPLUG_ERROR: Indicates that a write error happened which will be
  *    recovered with a report zone to update the zone write pointer offset.
  *  - BLK_ZONE_WPLUG_UNHASHED: Indicates that the zone write plug was removed
@@ -535,6 +538,7 @@ static struct blk_zone_wplug *disk_get_and_lock_zone_wplug(struct gendisk *disk,
 	zwplug->zone_no = zno;
 	zwplug->wp_offset = sector & (disk->queue->limits.chunk_sectors - 1);
 	zwplug->wp_offset_compl = 0;
+	zwplug->swq_cpu = -1;
 	bio_list_init(&zwplug->bio_list);
 	INIT_WORK(&zwplug->bio_work, blk_zone_wplug_bio_work);
 	zwplug->disk = disk;
@@ -973,7 +977,8 @@ static bool blk_zone_wplug_prepare_bio(struct blk_zone_wplug *zwplug,
 	return false;
 }
 
-static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
+static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs,
+					int *swq_cpu)
 {
 	struct gendisk *disk = bio->bi_bdev->bd_disk;
 	sector_t sector = bio->bi_iter.bi_sector;
@@ -1017,11 +1022,19 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
 	bio_set_flag(bio, BIO_ZONE_WRITE_PLUGGING);
 
 	/*
-	 * If the zone is already plugged or has a pending error, add the BIO
-	 * to the plug BIO list. Otherwise, plug and let the BIO execute.
+	 * If the zone has a pending error or is already plugged, add the BIO
+	 * to the plug BIO list. Otherwise, execute the BIO and plug if not yet
+	 * plugged and if the write order is not preserved.
 	 */
-	if (zwplug->flags & BLK_ZONE_WPLUG_BUSY)
+	if (zwplug->flags & BLK_ZONE_WPLUG_BUSY) {
 		goto plug;
+	} else if (disk->queue->limits.driver_preserves_write_order) {
+		if (zwplug->swq_cpu < 0)
+			zwplug->swq_cpu = raw_smp_processor_id();
+		*swq_cpu = zwplug->swq_cpu;
+	} else {
+		zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
+	}
 
 	/*
 	 * If an error is detected when preparing the BIO, add it to the BIO
@@ -1030,8 +1043,6 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
 	if (!blk_zone_wplug_prepare_bio(zwplug, bio))
 		goto plug;
 
-	zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
-
 	spin_unlock_irqrestore(&zwplug->lock, flags);
 
 	return false;
@@ -1107,7 +1118,7 @@ bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs, int *swq_cpu)
 		fallthrough;
 	case REQ_OP_WRITE:
 	case REQ_OP_WRITE_ZEROES:
-		return blk_zone_wplug_handle_write(bio, nr_segs);
+		return blk_zone_wplug_handle_write(bio, nr_segs, swq_cpu);
 	case REQ_OP_ZONE_RESET:
 		return blk_zone_wplug_handle_reset_or_finish(bio, 0);
 	case REQ_OP_ZONE_FINISH:
@@ -1131,7 +1142,6 @@ static void disk_zone_wplug_schedule_bio_work(struct gendisk *disk,
 	 * of the next plugged BIO. blk_zone_wplug_bio_work() will release the
 	 * reference we take here.
 	 */
-	WARN_ON_ONCE(!(zwplug->flags & BLK_ZONE_WPLUG_PLUGGED));
 	refcount_inc(&zwplug->ref);
 	queue_work(disk->zone_wplugs_wq, &zwplug->bio_work);
 }
@@ -1185,6 +1195,9 @@ static void disk_zone_wplug_unplug_bio(struct gendisk *disk,
 
 	zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED;
 
+	if (refcount_read(&zwplug->ref) == 2)
+		zwplug->swq_cpu = -1;
+
 	/*
 	 * If the zone is full (it was fully written or finished, or empty
 	 * (it was reset), remove its zone write plug from the hash table.
@@ -1937,6 +1950,7 @@ static void queue_zone_wplug_show(struct blk_zone_wplug *zwplug,
 	unsigned int zwp_zone_no, zwp_ref;
 	unsigned int zwp_bio_list_size;
 	unsigned long flags;
+	int swq_cpu;
 
 	spin_lock_irqsave(&zwplug->lock, flags);
 	zwp_zone_no = zwplug->zone_no;
@@ -1945,13 +1959,15 @@ static void queue_zone_wplug_show(struct blk_zone_wplug *zwplug,
 	zwp_wp_offset = zwplug->wp_offset;
 	zwp_wp_offset_compl = zwplug->wp_offset_compl;
 	zwp_bio_list_size = bio_list_size(&zwplug->bio_list);
+	swq_cpu = zwplug->swq_cpu;
 	spin_unlock_irqrestore(&zwplug->lock, flags);
 
 	bool all_zwr_inserted = blk_zone_all_zwr_inserted(zwplug);
 
-	seq_printf(m, "zone_no %u flags 0x%x ref %u wp_offset %u wp_offset_compl %u bio_list_size %u all_zwr_inserted %d\n",
+	seq_printf(m, "zone_no %u flags 0x%x ref %u wp_offset %u wp_offset_compl %u bio_list_size %u all_zwr_inserted %d swq_cpu %d\n",
 		   zwp_zone_no, zwp_flags, zwp_ref, zwp_wp_offset,
-		   zwp_wp_offset_compl, zwp_bio_list_size, all_zwr_inserted);
+		   zwp_wp_offset_compl, zwp_bio_list_size, all_zwr_inserted,
+		   swq_cpu);
 }
 
 int queue_zone_wplugs_show(void *data, struct seq_file *m)

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v16 21/26] scsi: core: Retry unaligned zoned writes
  2024-11-19  0:27 [PATCH v16 00/26] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (19 preceding siblings ...)
  2024-11-19  0:28 ` [PATCH v16 20/26] blk-zoned: Support pipelining of zoned writes Bart Van Assche
@ 2024-11-19  0:28 ` Bart Van Assche
  2024-11-19  0:28 ` [PATCH v16 22/26] scsi: sd: Increase retry count for " Bart Van Assche
                   ` (6 subsequent siblings)
  27 siblings, 0 replies; 73+ messages in thread
From: Bart Van Assche @ 2024-11-19  0:28 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Jaegeuk Kim, Bart Van Assche, Martin K. Petersen, Ming Lei

If zoned writes (REQ_OP_WRITE) for a sequential write required zone have
a starting LBA that differs from the write pointer, e.g. because zoned
writes have been reordered, then the storage device will respond with an
UNALIGNED WRITE COMMAND error. Retry commands that failed with an
unaligned write error. The block layer core will sort the SCSI commands
per LBA before these are resubmitted.

If the write order is preserved, increase the number of retries for
write commands sent to a sequential zone to the maximum number of
outstanding commands because in the worst case the number of times
reordered zoned writes have to be retried is (number of outstanding
writes per sequential zone) - 1.

Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/scsi/scsi_error.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 10154d78e336..24cd8e8538e5 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -700,6 +700,22 @@ enum scsi_disposition scsi_check_sense(struct scsi_cmnd *scmd)
 		fallthrough;
 
 	case ILLEGAL_REQUEST:
+		/*
+		 * Unaligned write command. This may indicate that zoned writes
+		 * have been received by the device in the wrong order. If write
+		 * pipelining is enabled, retry.
+		 */
+		if (sshdr.asc == 0x21 && sshdr.ascq == 0x04 &&
+		    req->q->limits.driver_preserves_write_order &&
+		    blk_rq_is_seq_zoned_write(req) &&
+		    scsi_cmd_retry_allowed(scmd)) {
+			SCSI_LOG_ERROR_RECOVERY(1,
+				sdev_printk(KERN_WARNING, scmd->device,
+				"Retrying unaligned write at LBA %#llx.\n",
+				scsi_get_lba(scmd)));
+			return NEEDS_RETRY;
+		}
+
 		if (sshdr.asc == 0x20 || /* Invalid command operation code */
 		    sshdr.asc == 0x21 || /* Logical block address out of range */
 		    sshdr.asc == 0x22 || /* Invalid function */

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v16 22/26] scsi: sd: Increase retry count for zoned writes
  2024-11-19  0:27 [PATCH v16 00/26] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (20 preceding siblings ...)
  2024-11-19  0:28 ` [PATCH v16 21/26] scsi: core: Retry unaligned " Bart Van Assche
@ 2024-11-19  0:28 ` Bart Van Assche
  2024-11-19  0:28 ` [PATCH v16 23/26] scsi: scsi_debug: Add the preserves_write_order module parameter Bart Van Assche
                   ` (5 subsequent siblings)
  27 siblings, 0 replies; 73+ messages in thread
From: Bart Van Assche @ 2024-11-19  0:28 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Jaegeuk Kim, Bart Van Assche, Martin K. Petersen, Ming Lei

If the write order is preserved, increase the number of retries for
write commands sent to a sequential zone to the maximum number of
outstanding commands because in the worst case the number of times
reordered zoned writes have to be retried is (number of outstanding
writes per sequential zone) - 1.

Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/scsi/sd.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 8947dab132d7..117fce3b1c7b 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1408,6 +1408,13 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
 	cmd->transfersize = sdp->sector_size;
 	cmd->underflow = nr_blocks << 9;
 	cmd->allowed = sdkp->max_retries;
+	/*
+	 * Increase the number of allowed retries for zoned writes if the driver
+	 * preserves the command order.
+	 */
+	if (rq->q->limits.driver_preserves_write_order &&
+	    blk_rq_is_seq_zoned_write(rq))
+		cmd->allowed += rq->q->nr_requests;
 	cmd->sdb.length = nr_blocks * sdp->sector_size;
 
 	SCSI_LOG_HLQUEUE(1,

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v16 23/26] scsi: scsi_debug: Add the preserves_write_order module parameter
  2024-11-19  0:27 [PATCH v16 00/26] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (21 preceding siblings ...)
  2024-11-19  0:28 ` [PATCH v16 22/26] scsi: sd: Increase retry count for " Bart Van Assche
@ 2024-11-19  0:28 ` Bart Van Assche
  2024-11-19  0:28 ` [PATCH v16 24/26] scsi: scsi_debug: Support injecting unaligned write errors Bart Van Assche
                   ` (4 subsequent siblings)
  27 siblings, 0 replies; 73+ messages in thread
From: Bart Van Assche @ 2024-11-19  0:28 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Jaegeuk Kim, Bart Van Assche, Douglas Gilbert, Martin K. Petersen,
	Ming Lei

Zone write locking is not used for zoned devices if the block driver
reports that it preserves the order of write commands. Make it easier to
test not using zone write locking by adding support for setting the
driver_preserves_write_order flag.

Acked-by: Douglas Gilbert <dgilbert@interlog.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/scsi/scsi_debug.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index b52513eeeafa..1b5a9cc27fbd 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -927,6 +927,7 @@ static int dix_reads;
 static int dif_errors;
 
 /* ZBC global data */
+static bool sdeb_preserves_write_order;
 static bool sdeb_zbc_in_use;	/* true for host-aware and host-managed disks */
 static int sdeb_zbc_zone_cap_mb;
 static int sdeb_zbc_zone_size_mb;
@@ -5881,10 +5882,14 @@ static struct sdebug_dev_info *find_build_dev_info(struct scsi_device *sdev)
 
 static int scsi_debug_slave_alloc(struct scsi_device *sdp)
 {
+	struct request_queue *q = sdp->request_queue;
+
 	if (sdebug_verbose)
 		pr_info("slave_alloc <%u %u %u %llu>\n",
 		       sdp->host->host_no, sdp->channel, sdp->id, sdp->lun);
 
+	q->limits.driver_preserves_write_order = sdeb_preserves_write_order;
+
 	return 0;
 }
 
@@ -6620,6 +6625,8 @@ module_param_named(statistics, sdebug_statistics, bool, S_IRUGO | S_IWUSR);
 module_param_named(strict, sdebug_strict, bool, S_IRUGO | S_IWUSR);
 module_param_named(submit_queues, submit_queues, int, S_IRUGO);
 module_param_named(poll_queues, poll_queues, int, S_IRUGO);
+module_param_named(preserves_write_order, sdeb_preserves_write_order, bool,
+		   S_IRUGO);
 module_param_named(tur_ms_to_ready, sdeb_tur_ms_to_ready, int, S_IRUGO);
 module_param_named(unmap_alignment, sdebug_unmap_alignment, int, S_IRUGO);
 module_param_named(unmap_granularity, sdebug_unmap_granularity, int, S_IRUGO);
@@ -6692,6 +6699,8 @@ MODULE_PARM_DESC(opts, "1->noise, 2->medium_err, 4->timeout, 8->recovered_err...
 MODULE_PARM_DESC(per_host_store, "If set, next positive add_host will get new store (def=0)");
 MODULE_PARM_DESC(physblk_exp, "physical block exponent (def=0)");
 MODULE_PARM_DESC(poll_queues, "support for iouring iopoll queues (1 to max(submit_queues - 1))");
+MODULE_PARM_DESC(preserves_write_order,
+		 "Whether or not to inform the block layer that this driver preserves the order of WRITE commands (def=0)");
 MODULE_PARM_DESC(ptype, "SCSI peripheral type(def=0[disk])");
 MODULE_PARM_DESC(random, "If set, uniformly randomize command duration between 0 and delay_in_ns");
 MODULE_PARM_DESC(removable, "claim to have removable media (def=0)");

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v16 24/26] scsi: scsi_debug: Support injecting unaligned write errors
  2024-11-19  0:27 [PATCH v16 00/26] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (22 preceding siblings ...)
  2024-11-19  0:28 ` [PATCH v16 23/26] scsi: scsi_debug: Add the preserves_write_order module parameter Bart Van Assche
@ 2024-11-19  0:28 ` Bart Van Assche
  2024-11-19  0:28 ` [PATCH v16 25/26] scsi: scsi_debug: Skip host/bus reset settle delay Bart Van Assche
                   ` (3 subsequent siblings)
  27 siblings, 0 replies; 73+ messages in thread
From: Bart Van Assche @ 2024-11-19  0:28 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Jaegeuk Kim, Bart Van Assche, Douglas Gilbert, Martin K. Petersen,
	Ming Lei

Allow user space software, e.g. a blktests test, to inject unaligned
write errors.

Acked-by: Douglas Gilbert <dgilbert@interlog.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/scsi/scsi_debug.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index 1b5a9cc27fbd..2ff3a24d791e 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -193,6 +193,7 @@ static const char *sdebug_version_date = "20210520";
 #define SDEBUG_OPT_NO_CDB_NOISE		0x4000
 #define SDEBUG_OPT_HOST_BUSY		0x8000
 #define SDEBUG_OPT_CMD_ABORT		0x10000
+#define SDEBUG_OPT_UNALIGNED_WRITE	0x20000
 #define SDEBUG_OPT_ALL_NOISE (SDEBUG_OPT_NOISE | SDEBUG_OPT_Q_NOISE | \
 			      SDEBUG_OPT_RESET_NOISE)
 #define SDEBUG_OPT_ALL_INJECTING (SDEBUG_OPT_RECOVERED_ERR | \
@@ -200,7 +201,8 @@ static const char *sdebug_version_date = "20210520";
 				  SDEBUG_OPT_DIF_ERR | SDEBUG_OPT_DIX_ERR | \
 				  SDEBUG_OPT_SHORT_TRANSFER | \
 				  SDEBUG_OPT_HOST_BUSY | \
-				  SDEBUG_OPT_CMD_ABORT)
+				  SDEBUG_OPT_CMD_ABORT | \
+				  SDEBUG_OPT_UNALIGNED_WRITE)
 #define SDEBUG_OPT_RECOV_DIF_DIX (SDEBUG_OPT_RECOVERED_ERR | \
 				  SDEBUG_OPT_DIF_ERR | SDEBUG_OPT_DIX_ERR)
 
@@ -4191,6 +4193,14 @@ static int resp_write_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 	u8 *cmd = scp->cmnd;
 	bool meta_data_locked = false;
 
+	if (unlikely(sdebug_opts & SDEBUG_OPT_UNALIGNED_WRITE &&
+		     atomic_read(&sdeb_inject_pending))) {
+		atomic_set(&sdeb_inject_pending, 0);
+		mk_sense_buffer(scp, ILLEGAL_REQUEST, LBA_OUT_OF_RANGE,
+				UNALIGNED_WRITE_ASCQ);
+		return check_condition_result;
+	}
+
 	switch (cmd[0]) {
 	case WRITE_16:
 		ei_lba = 0;

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v16 25/26] scsi: scsi_debug: Skip host/bus reset settle delay
  2024-11-19  0:27 [PATCH v16 00/26] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (23 preceding siblings ...)
  2024-11-19  0:28 ` [PATCH v16 24/26] scsi: scsi_debug: Support injecting unaligned write errors Bart Van Assche
@ 2024-11-19  0:28 ` Bart Van Assche
  2024-11-19  0:28 ` [PATCH v16 26/26] scsi: ufs: Inform the block layer about write ordering Bart Van Assche
                   ` (2 subsequent siblings)
  27 siblings, 0 replies; 73+ messages in thread
From: Bart Van Assche @ 2024-11-19  0:28 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Jaegeuk Kim, Bart Van Assche

Skip the reset settle delay during error handling since the scsi_debug
driver doesn't need this delay.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/scsi/scsi_debug.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index 2ff3a24d791e..791c085917bc 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -8751,6 +8751,7 @@ static struct scsi_host_template sdebug_driver_template = {
 	.max_sectors =		-1U,
 	.max_segment_size =	-1U,
 	.module =		THIS_MODULE,
+	.skip_settle_delay =	1,
 	.track_queue_depth =	1,
 	.cmd_size = sizeof(struct sdebug_scsi_cmd),
 	.init_cmd_priv = sdebug_init_cmd_priv,

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v16 26/26] scsi: ufs: Inform the block layer about write ordering
  2024-11-19  0:27 [PATCH v16 00/26] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (24 preceding siblings ...)
  2024-11-19  0:28 ` [PATCH v16 25/26] scsi: scsi_debug: Skip host/bus reset settle delay Bart Van Assche
@ 2024-11-19  0:28 ` Bart Van Assche
       [not found]   ` <37f95f44-ab1d-20db-e0c7-94946cb9d4eb@quicinc.com>
  2024-11-19  8:01 ` [PATCH v16 00/26] Improve write performance for zoned UFS devices Damien Le Moal
  2024-11-19 12:25 ` Christoph Hellwig
  27 siblings, 1 reply; 73+ messages in thread
From: Bart Van Assche @ 2024-11-19  0:28 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Jaegeuk Kim, Bart Van Assche, Bao D. Nguyen, Can Guo,
	Martin K. Petersen, Avri Altman

From the UFSHCI 4.0 specification, about the legacy (single queue) mode:
"The host controller always process transfer requests in-order according
to the order submitted to the list. In case of multiple commands with
single doorbell register ringing (batch mode), The dispatch order for
these transfer requests by host controller will base on their index in
the List. A transfer request with lower index value will be executed
before a transfer request with higher index value."

From the UFSHCI 4.0 specification, about the MCQ mode:
"Command Submission
1. Host SW writes an Entry to SQ
2. Host SW updates SQ doorbell tail pointer

Command Processing
3. After fetching the Entry, Host Controller updates SQ doorbell head
   pointer
4. Host controller sends COMMAND UPIU to UFS device"

In other words, for both legacy and MCQ mode, UFS controllers are
required to forward commands to the UFS device in the order these
commands have been received from the host.

Notes:
- For legacy mode this is only correct if the host submits one
  command at a time. The UFS driver does this.
- Also in legacy mode, the command order is not preserved if
  auto-hibernation is enabled in the UFS controller.

This patch improves performance as follows on a test setup with UFSHCI
3.0 controller:
- With the mq-deadline scheduler: 2.5x more IOPS for small writes.
- When not using an I/O scheduler compared to using mq-deadline with
  zone locking: 4x more IOPS for small writes.

Cc: Bao D. Nguyen <quic_nguyenb@quicinc.com>
Cc: Can Guo <quic_cang@quicinc.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Avri Altman <avri.altman@wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/ufs/core/ufshcd.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
index abbe7135a977..a6dec3b7e3fd 100644
--- a/drivers/ufs/core/ufshcd.c
+++ b/drivers/ufs/core/ufshcd.c
@@ -5241,6 +5241,13 @@ static int ufshcd_device_configure(struct scsi_device *sdev,
 	struct ufs_hba *hba = shost_priv(sdev->host);
 	struct request_queue *q = sdev->request_queue;
 
+	/*
+	 * With auto-hibernation disabled, the write order is preserved per
+	 * MCQ. Auto-hibernation may cause write reordering that results in
+	 * unaligned write errors. The SCSI core will retry the failed writes.
+	 */
+	lim->driver_preserves_write_order = true;
+
 	lim->dma_pad_mask = PRDT_DATA_BYTE_COUNT_PAD - 1;
 
 	/*

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH v16 01/26] blk-zoned: Fix a reference count leak
  2024-11-19  0:27 ` [PATCH v16 01/26] blk-zoned: Fix a reference count leak Bart Van Assche
@ 2024-11-19  2:23   ` Damien Le Moal
  2024-11-19 20:21     ` Bart Van Assche
  0 siblings, 1 reply; 73+ messages in thread
From: Damien Le Moal @ 2024-11-19  2:23 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Jaegeuk Kim

On 11/19/24 9:27 AM, Bart Van Assche wrote:
> Fix a reference count leak in disk_zone_wplug_handle_error()
> 
> Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/blk-zoned.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index 70211751df16..3346b8c53605 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -1337,6 +1337,8 @@ static void disk_zone_wplug_handle_error(struct gendisk *disk,
>  
>  unlock:
>  	spin_unlock_irqrestore(&zwplug->lock, flags);
> +
> +	disk_put_zone_wplug(zwplug);

The zone wplug put call is right after the single call site to
disk_zone_wplug_handle_error(). The reason it is *not* in that function is that
the reference on the wplug for handling an error is taken when the wplug is
added to the error list. disk_zone_wplug_handle_error() does not itself take a
reference on the wplug.

So how did you come up with this ? What workload/operation did you run to find
an issue ?

>  }
>  
>  static void disk_zone_wplugs_work(struct work_struct *work)


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v16 03/26] blk-zoned: Split queue_zone_wplugs_show()
  2024-11-19  0:27 ` [PATCH v16 03/26] blk-zoned: Split queue_zone_wplugs_show() Bart Van Assche
@ 2024-11-19  2:25   ` Damien Le Moal
  0 siblings, 0 replies; 73+ messages in thread
From: Damien Le Moal @ 2024-11-19  2:25 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Jaegeuk Kim

On 11/19/24 9:27 AM, Bart Van Assche wrote:
> Reduce the indentation level of the code in queue_zone_wplugs_show() by
> moving the body of the loop in that function into a new function.
> 
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>

Looks OK.

Reviewed-by: Damien Le Moal <dlemoal@kernel.org>

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v16 04/26] blk-zoned: Only handle errors after pending zoned writes have completed
  2024-11-19  0:27 ` [PATCH v16 04/26] blk-zoned: Only handle errors after pending zoned writes have completed Bart Van Assche
@ 2024-11-19  2:50   ` Damien Le Moal
  2024-11-19 20:51     ` Bart Van Assche
  0 siblings, 1 reply; 73+ messages in thread
From: Damien Le Moal @ 2024-11-19  2:50 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Jaegeuk Kim

On 11/19/24 9:27 AM, Bart Van Assche wrote:
> Instead of handling write errors immediately, only handle these after all
> pending zoned write requests have completed or have been requeued. This
> patch prepares for changing the zone write pointer tracking approach.

A little more explanations about how this is achieved would be nice. I was
expecting a shorter change given the short commit message... Took some time to
understand the changes without details.

More comments below.

> 
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/blk-mq.c         |   9 +++
>  block/blk-zoned.c      | 154 +++++++++++++++++++++++++++++++++++++++--
>  block/blk.h            |  29 ++++++++
>  include/linux/blk-mq.h |  18 +++++
>  4 files changed, 203 insertions(+), 7 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 270cfd9fc6b0..a45077e187b5 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -793,6 +793,9 @@ void blk_mq_free_request(struct request *rq)
>  	rq_qos_done(q, rq);
>  
>  	WRITE_ONCE(rq->state, MQ_RQ_IDLE);
> +
> +	blk_zone_free_request(rq);
> +
>  	if (req_ref_put_and_test(rq))
>  		__blk_mq_free_request(rq);
>  }
> @@ -1189,6 +1192,9 @@ void blk_mq_end_request_batch(struct io_comp_batch *iob)
>  			continue;
>  
>  		WRITE_ONCE(rq->state, MQ_RQ_IDLE);
> +
> +		blk_zone_free_request(rq);
> +
>  		if (!req_ref_put_and_test(rq))
>  			continue;
>  
> @@ -1507,6 +1513,7 @@ static void __blk_mq_requeue_request(struct request *rq)
>  	if (blk_mq_request_started(rq)) {
>  		WRITE_ONCE(rq->state, MQ_RQ_IDLE);
>  		rq->rq_flags &= ~RQF_TIMED_OUT;
> +		blk_zone_requeue_work(q);
>  	}
>  }
>  
> @@ -1542,6 +1549,8 @@ static void blk_mq_requeue_work(struct work_struct *work)
>  	list_splice_init(&q->flush_list, &flush_list);
>  	spin_unlock_irq(&q->requeue_lock);
>  
> +	blk_zone_requeue_work(q);
> +
>  	while (!list_empty(&rq_list)) {
>  		rq = list_entry(rq_list.next, struct request, queuelist);
>  		/*
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index 7e6e6ebb6235..b570b773e65f 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -608,6 +608,8 @@ static inline void disk_zone_wplug_set_error(struct gendisk *disk,
>  	if (zwplug->flags & BLK_ZONE_WPLUG_ERROR)
>  		return;
>  
> +	zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
> +	zwplug->flags |= BLK_ZONE_WPLUG_ERROR;

Why move these before the comment ? Also, why set BLK_ZONE_WPLUG_PLUGGED ? It
should be set already since this is handling a failed write that was either
being prepared for submission or submitted (and completed) already. In both
cases, the wplug is plugged since we have a write in flight.

>  	/*
>  	 * At this point, we already have a reference on the zone write plug.
>  	 * However, since we are going to add the plug to the disk zone write
> @@ -616,7 +618,6 @@ static inline void disk_zone_wplug_set_error(struct gendisk *disk,
>  	 * handled, or in disk_zone_wplug_clear_error() if the zone is reset or
>  	 * finished.
>  	 */
> -	zwplug->flags |= BLK_ZONE_WPLUG_ERROR;
>  	refcount_inc(&zwplug->ref);
>  
>  	spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
> @@ -642,6 +643,7 @@ static inline void disk_zone_wplug_clear_error(struct gendisk *disk,
>  	spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
>  	if (!list_empty(&zwplug->link)) {
>  		list_del_init(&zwplug->link);
> +		zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED;
>  		zwplug->flags &= ~BLK_ZONE_WPLUG_ERROR;
>  		disk_put_zone_wplug(zwplug);
>  	}
> @@ -746,6 +748,70 @@ static bool blk_zone_wplug_handle_reset_all(struct bio *bio)
>  	return false;
>  }
>  
> +struct all_zwr_inserted_data {
> +	struct blk_zone_wplug *zwplug;
> +	bool res;
> +};
> +
> +/*
> + * Changes @data->res to %false if and only if @rq is a zoned write for the
> + * given zone and if it is owned by the block driver.

It would be nice to have a request flag or a state indicating that instead of
needing all this code... Can't that be done ?

> + *
> + * @rq members may change while this function is in progress. Hence, use
> + * READ_ONCE() to read @rq members.
> + */
> +static bool blk_zwr_inserted(struct request *rq, void *data)
> +{
> +	struct all_zwr_inserted_data *d = data;
> +	struct blk_zone_wplug *zwplug = d->zwplug;
> +	struct request_queue *q = zwplug->disk->queue;
> +	struct bio *bio = READ_ONCE(rq->bio);
> +
> +	if (rq->q == q && READ_ONCE(rq->state) != MQ_RQ_IDLE &&
> +	    blk_rq_is_seq_zoned_write(rq) && bio &&
> +	    bio_zone_no(bio) == zwplug->zone_no) {
> +		d->res = false;
> +		return false;
> +	}
> +
> +	return true;
> +}
> +
> +/*
> + * Report whether or not all zoned writes for a zone have been inserted into a
> + * software queue, elevator queue or hardware queue.
> + */
> +static bool blk_zone_all_zwr_inserted(struct blk_zone_wplug *zwplug)
> +{
> +	struct gendisk *disk = zwplug->disk;
> +	struct request_queue *q = disk->queue;
> +	struct all_zwr_inserted_data d = { .zwplug = zwplug, .res = true };
> +	struct blk_mq_hw_ctx *hctx;
> +	long unsigned int i;
> +	struct request *rq;
> +
> +	scoped_guard(spinlock_irqsave, &q->requeue_lock) {
> +		list_for_each_entry(rq, &q->requeue_list, queuelist)
> +			if (blk_rq_is_seq_zoned_write(rq) &&
> +			    bio_zone_no(rq->bio) == zwplug->zone_no)
> +				return false;
> +		list_for_each_entry(rq, &q->flush_list, queuelist)
> +			if (blk_rq_is_seq_zoned_write(rq) &&
> +			    bio_zone_no(rq->bio) == zwplug->zone_no)
> +				return false;
> +	}
> +
> +	queue_for_each_hw_ctx(q, hctx, i) {
> +		struct blk_mq_tags *tags = hctx->sched_tags ?: hctx->tags;
> +
> +		blk_mq_all_tag_iter(tags, blk_zwr_inserted, &d);
> +		if (!d.res || blk_mq_is_shared_tags(q->tag_set->flags))
> +			break;
> +	}
> +
> +	return d.res;
> +}
> +
>  static inline void blk_zone_wplug_add_bio(struct blk_zone_wplug *zwplug,
>  					  struct bio *bio, unsigned int nr_segs)
>  {
> @@ -1096,6 +1162,29 @@ static void disk_zone_wplug_schedule_bio_work(struct gendisk *disk,
>  	queue_work(disk->zone_wplugs_wq, &zwplug->bio_work);
>  }
>  
> +/*
> + * Change the zone state to "error" if a request is requeued to postpone
> + * processing of requeued requests until all pending requests have either
> + * completed or have been requeued.
> + */
> +void blk_zone_write_plug_requeue_request(struct request *rq)
> +{
> +	struct gendisk *disk = rq->q->disk;
> +	struct blk_zone_wplug *zwplug;
> +
> +	if (!disk->zone_wplugs_hash_bits || !blk_rq_is_seq_zoned_write(rq))
> +		return;

I think the disk->zone_wplugs_hash_bits check needs to go inside
disk_get_zone_wplug() as that will avoid a similar check in
blk_zone_write_plug_free_request() too. That said, I am not even convinced it
is needed at all since these functions should be called only for a zoned drive
which should have its zone wplug hash setup.

> +
> +	zwplug = disk_get_zone_wplug(disk, blk_rq_pos(rq));
> +	if (WARN_ON_ONCE(!zwplug))
> +		return;
> +
> +	scoped_guard(spinlock_irqsave, &zwplug->lock)
> +		disk_zone_wplug_set_error(disk, zwplug);
> +
> +	disk_put_zone_wplug(zwplug);
> +}
> +
>  static void disk_zone_wplug_unplug_bio(struct gendisk *disk,
>  				       struct blk_zone_wplug *zwplug)
>  {
> @@ -1202,6 +1291,33 @@ void blk_zone_write_plug_finish_request(struct request *req)
>  	disk_put_zone_wplug(zwplug);
>  }
>  
> +/*
> + * Schedule zone_plugs_work if a zone is in the error state and if no requests
> + * are in flight. Called from blk_mq_free_request().
> + */
> +void blk_zone_write_plug_free_request(struct request *rq)
> +{
> +	struct gendisk *disk = rq->q->disk;
> +	struct blk_zone_wplug *zwplug;
> +
> +	/*
> +	 * Do nothing if this function is called before the zone information
> +	 * has been initialized.
> +	 */
> +	if (!disk->zone_wplugs_hash_bits)
> +		return;
> +
> +	zwplug = disk_get_zone_wplug(disk, blk_rq_pos(rq));
> +

Useless blank line here.

> +	if (!zwplug)
> +		return;
> +
> +	if (zwplug->flags & BLK_ZONE_WPLUG_ERROR)
> +		kblockd_schedule_work(&disk->zone_wplugs_work);

I think this needs to be done under the zone wplug spinlock ?

> +> +	disk_put_zone_wplug(zwplug);
> +}
> +
>  static void blk_zone_wplug_bio_work(struct work_struct *work)
>  {
>  	struct blk_zone_wplug *zwplug =
> @@ -1343,14 +1459,15 @@ static void disk_zone_wplug_handle_error(struct gendisk *disk,
>  
>  static void disk_zone_process_err_list(struct gendisk *disk)
>  {
> -	struct blk_zone_wplug *zwplug;
> +	struct blk_zone_wplug *zwplug, *next;
>  	unsigned long flags;
>  
>  	spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
>  
> -	while (!list_empty(&disk->zone_wplugs_err_list)) {
> -		zwplug = list_first_entry(&disk->zone_wplugs_err_list,
> -					  struct blk_zone_wplug, link);
> +	list_for_each_entry_safe(zwplug, next, &disk->zone_wplugs_err_list,
> +				 link) {

You are holding the disk zwplug spinlock, so why use the _safe() loop ? Not
needed, right ?

> +		if (!blk_zone_all_zwr_inserted(zwplug))
> +			continue;
>  		list_del_init(&zwplug->link);
>  		spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
>  
> @@ -1361,6 +1478,12 @@ static void disk_zone_process_err_list(struct gendisk *disk)
>  	}
>  
>  	spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
> +
> +	/*
> +	 * If one or more zones have been skipped, this work will be requeued
> +	 * when a request is requeued (blk_zone_requeue_work()) or freed
> +	 * (blk_zone_write_plug_free_request()).
> +	 */
>  }
>  
>  static void disk_zone_wplugs_work(struct work_struct *work)
> @@ -1371,6 +1494,20 @@ static void disk_zone_wplugs_work(struct work_struct *work)
>  	disk_zone_process_err_list(disk);
>  }
>  
> +/* May be called from interrupt context and hence must not sleep. */
> +void blk_zone_requeue_work(struct request_queue *q)
> +{
> +	struct gendisk *disk = q->disk;
> +
> +	if (!disk)
> +		return;

Can this happen ?

> +
> +	if (in_interrupt())
> +		kblockd_schedule_work(&disk->zone_wplugs_work);
> +	else
> +		disk_zone_process_err_list(disk);
> +}
> +
>  static inline unsigned int disk_zone_wplugs_hash_size(struct gendisk *disk)
>  {
>  	return 1U << disk->zone_wplugs_hash_bits;
> @@ -1854,8 +1991,11 @@ static void queue_zone_wplug_show(struct blk_zone_wplug *zwplug,
>  	zwp_bio_list_size = bio_list_size(&zwplug->bio_list);
>  	spin_unlock_irqrestore(&zwplug->lock, flags);
>  
> -	seq_printf(m, "%u 0x%x %u %u %u\n", zwp_zone_no, zwp_flags, zwp_ref,
> -		   zwp_wp_offset, zwp_bio_list_size);
> +	bool all_zwr_inserted = blk_zone_all_zwr_inserted(zwplug);

Is this bool really needed ? If it is, shouldn't it be assigned while holding
the zwplug lock to have a "snapshot" of the plug with all printed values
consistent ?

> +
> +	seq_printf(m, "zone_no %u flags 0x%x ref %u wp_offset %u bio_list_size %u all_zwr_inserted %d\n",
> +		   zwp_zone_no, zwp_flags, zwp_ref, zwp_wp_offset,
> +		   zwp_bio_list_size, all_zwr_inserted);
>  }
>  
>  int queue_zone_wplugs_show(void *data, struct seq_file *m)
> diff --git a/block/blk.h b/block/blk.h
> index 2c26abf505b8..be945db6298d 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -473,6 +473,18 @@ static inline void blk_zone_update_request_bio(struct request *rq,
>  	if (req_op(rq) == REQ_OP_ZONE_APPEND || bio_zone_write_plugging(bio))
>  		bio->bi_iter.bi_sector = rq->__sector;
>  }
> +
> +void blk_zone_write_plug_requeue_request(struct request *rq);
> +static inline void blk_zone_requeue_request(struct request *rq)
> +{
> +	if (!blk_rq_is_seq_zoned_write(rq))
> +		return;
> +
> +	blk_zone_write_plug_requeue_request(rq);

May be:

	if (blk_rq_is_seq_zoned_write(rq))
		blk_zone_write_plug_requeue_request(rq);

?

> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index c596e0e4cb75..ac05974f08f9 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -1169,4 +1169,22 @@ static inline int blk_rq_map_sg(struct request_queue *q, struct request *rq,
>  }
>  void blk_dump_rq_flags(struct request *, char *);
>  
> +#ifdef CONFIG_BLK_DEV_ZONED
> +static inline bool blk_rq_is_seq_zoned_write(struct request *rq)

bdev_zone_is_seq() is already stubbed for the !CONFIG_BLK_DEV_ZONED case, so I
do not think this function needs the ifdef. It will compile either way and will
never be called from anywhere in the !CONFIG_BLK_DEV_ZONED case.

> +{
> +	switch (req_op(rq)) {
> +	case REQ_OP_WRITE:
> +	case REQ_OP_WRITE_ZEROES:
> +		return bdev_zone_is_seq(rq->q->disk->part0, blk_rq_pos(rq));
> +	default:
> +		return false;
> +	}
> +}
> +#else /* CONFIG_BLK_DEV_ZONED */
> +static inline bool blk_rq_is_seq_zoned_write(struct request *rq)
> +{
> +	return false;
> +}
> +#endif /* CONFIG_BLK_DEV_ZONED */
> +
>  #endif /* BLK_MQ_H */


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v16 05/26] blk-zoned: Fix a deadlock triggered by unaligned writes
  2024-11-19  0:27 ` [PATCH v16 05/26] blk-zoned: Fix a deadlock triggered by unaligned writes Bart Van Assche
@ 2024-11-19  2:57   ` Damien Le Moal
  2024-11-19 21:04     ` Bart Van Assche
  2025-01-09 19:11     ` Bart Van Assche
  0 siblings, 2 replies; 73+ messages in thread
From: Damien Le Moal @ 2024-11-19  2:57 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Jaegeuk Kim

On 11/19/24 9:27 AM, Bart Van Assche wrote:
> If the queue is filled with unaligned writes then the following
> deadlock occurs:
> 
> Call Trace:
>  <TASK>
>  __schedule+0x8cc/0x2190
>  schedule+0xdd/0x2b0
>  blk_queue_enter+0x2ce/0x4f0
>  blk_mq_alloc_request+0x303/0x810
>  scsi_execute_cmd+0x3f4/0x7b0
>  sd_zbc_do_report_zones+0x19e/0x4c0
>  sd_zbc_report_zones+0x304/0x920
>  disk_zone_wplug_handle_error+0x237/0x920
>  disk_zone_wplugs_work+0x17e/0x430
>  process_one_work+0xdd0/0x1490
>  worker_thread+0x5eb/0x1010
>  kthread+0x2e5/0x3b0
>  ret_from_fork+0x3a/0x80
>  </TASK>
> 
> Fix this deadlock by removing the disk->fops->report_zones() call and by
> deriving the write pointer information from successfully completed zoned
> writes.
> 
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>

Doesn't this need a Fixes tag and CC stable, and come earlier in the series (or
sent separately) ?

Overall, this patch seems wrong anyway as zone reset and zone finish may be
done between 2 writes, failing the next one but the recovery done here will use
the previous succeful write end position as the wp, which is NOT correct as
reset or finish changed that... And we also have the possibility of torn writes
(partial writes) with SAS SMR drives. So I really think that you cannot avoid
doing a report zone to recover errors.

> ---
>  block/blk-zoned.c | 69 +++++++++++++++++++----------------------------
>  block/blk.h       |  4 ++-
>  2 files changed, 31 insertions(+), 42 deletions(-)
> 
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index b570b773e65f..284820b29285 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -56,6 +56,8 @@ static const char *const zone_cond_name[] = {
>   * @zone_no: The number of the zone the plug is managing.
>   * @wp_offset: The zone write pointer location relative to the start of the zone
>   *             as a number of 512B sectors.
> + * @wp_offset_compl: End offset for completed zoned writes as a number of 512
> + *		     byte sectors.
>   * @bio_list: The list of BIOs that are currently plugged.
>   * @bio_work: Work struct to handle issuing of plugged BIOs
>   * @rcu_head: RCU head to free zone write plugs with an RCU grace period.
> @@ -69,6 +71,7 @@ struct blk_zone_wplug {
>  	unsigned int		flags;
>  	unsigned int		zone_no;
>  	unsigned int		wp_offset;
> +	unsigned int		wp_offset_compl;
>  	struct bio_list		bio_list;
>  	struct work_struct	bio_work;
>  	struct rcu_head		rcu_head;
> @@ -531,6 +534,7 @@ static struct blk_zone_wplug *disk_get_and_lock_zone_wplug(struct gendisk *disk,
>  	zwplug->flags = 0;
>  	zwplug->zone_no = zno;
>  	zwplug->wp_offset = sector & (disk->queue->limits.chunk_sectors - 1);
> +	zwplug->wp_offset_compl = 0;
>  	bio_list_init(&zwplug->bio_list);
>  	INIT_WORK(&zwplug->bio_work, blk_zone_wplug_bio_work);
>  	zwplug->disk = disk;
> @@ -1226,6 +1230,7 @@ void blk_zone_write_plug_bio_endio(struct bio *bio)
>  	struct gendisk *disk = bio->bi_bdev->bd_disk;
>  	struct blk_zone_wplug *zwplug =
>  		disk_get_zone_wplug(disk, bio->bi_iter.bi_sector);
> +	unsigned int end_sector;
>  	unsigned long flags;
>  
>  	if (WARN_ON_ONCE(!zwplug))
> @@ -1243,11 +1248,24 @@ void blk_zone_write_plug_bio_endio(struct bio *bio)
>  		bio->bi_opf |= REQ_OP_ZONE_APPEND;
>  	}
>  
> -	/*
> -	 * If the BIO failed, mark the plug as having an error to trigger
> -	 * recovery.
> -	 */
> -	if (bio->bi_status != BLK_STS_OK) {
> +	if (bio->bi_status == BLK_STS_OK) {
> +		switch (bio_op(bio)) {
> +		case REQ_OP_WRITE:
> +		case REQ_OP_ZONE_APPEND:
> +		case REQ_OP_WRITE_ZEROES:
> +			end_sector = bdev_offset_from_zone_start(disk->part0,
> +				     bio->bi_iter.bi_sector + bio_sectors(bio));
> +			if (end_sector > zwplug->wp_offset_compl)
> +				zwplug->wp_offset_compl = end_sector;
> +			break;
> +		default:
> +			break;
> +		}
> +	} else {
> +		/*
> +		 * If the BIO failed, mark the plug as having an error to
> +		 * trigger recovery.
> +		 */
>  		spin_lock_irqsave(&zwplug->lock, flags);
>  		disk_zone_wplug_set_error(disk, zwplug);
>  		spin_unlock_irqrestore(&zwplug->lock, flags);
> @@ -1388,30 +1406,10 @@ static unsigned int blk_zone_wp_offset(struct blk_zone *zone)
>  	}
>  }
>  
> -static int blk_zone_wplug_report_zone_cb(struct blk_zone *zone,
> -					 unsigned int idx, void *data)
> -{
> -	struct blk_zone *zonep = data;
> -
> -	*zonep = *zone;
> -	return 0;
> -}
> -
>  static void disk_zone_wplug_handle_error(struct gendisk *disk,
>  					 struct blk_zone_wplug *zwplug)
>  {
> -	sector_t zone_start_sector =
> -		bdev_zone_sectors(disk->part0) * zwplug->zone_no;
> -	unsigned int noio_flag;
> -	struct blk_zone zone;
>  	unsigned long flags;
> -	int ret;
> -
> -	/* Get the current zone information from the device. */
> -	noio_flag = memalloc_noio_save();
> -	ret = disk->fops->report_zones(disk, zone_start_sector, 1,
> -				       blk_zone_wplug_report_zone_cb, &zone);
> -	memalloc_noio_restore(noio_flag);
>  
>  	spin_lock_irqsave(&zwplug->lock, flags);
>  
> @@ -1425,19 +1423,8 @@ static void disk_zone_wplug_handle_error(struct gendisk *disk,
>  
>  	zwplug->flags &= ~BLK_ZONE_WPLUG_ERROR;
>  
> -	if (ret != 1) {
> -		/*
> -		 * We failed to get the zone information, meaning that something
> -		 * is likely really wrong with the device. Abort all remaining
> -		 * plugged BIOs as otherwise we could endup waiting forever on
> -		 * plugged BIOs to complete if there is a queue freeze on-going.
> -		 */
> -		disk_zone_wplug_abort(zwplug);
> -		goto unplug;
> -	}
> -
>  	/* Update the zone write pointer offset. */
> -	zwplug->wp_offset = blk_zone_wp_offset(&zone);
> +	zwplug->wp_offset = zwplug->wp_offset_compl;
>  	disk_zone_wplug_abort_unaligned(disk, zwplug);
>  
>  	/* Restart BIO submission if we still have any BIO left. */
> @@ -1446,7 +1433,6 @@ static void disk_zone_wplug_handle_error(struct gendisk *disk,
>  		goto unlock;
>  	}
>  
> -unplug:
>  	zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED;
>  	if (disk_should_remove_zone_wplug(disk, zwplug))
>  		disk_remove_zone_wplug(disk, zwplug);
> @@ -1978,7 +1964,7 @@ EXPORT_SYMBOL_GPL(blk_revalidate_disk_zones);
>  static void queue_zone_wplug_show(struct blk_zone_wplug *zwplug,
>  				  struct seq_file *m)
>  {
> -	unsigned int zwp_wp_offset, zwp_flags;
> +	unsigned int zwp_wp_offset, zwp_wp_offset_compl, zwp_flags;
>  	unsigned int zwp_zone_no, zwp_ref;
>  	unsigned int zwp_bio_list_size;
>  	unsigned long flags;
> @@ -1988,14 +1974,15 @@ static void queue_zone_wplug_show(struct blk_zone_wplug *zwplug,
>  	zwp_flags = zwplug->flags;
>  	zwp_ref = refcount_read(&zwplug->ref);
>  	zwp_wp_offset = zwplug->wp_offset;
> +	zwp_wp_offset_compl = zwplug->wp_offset_compl;
>  	zwp_bio_list_size = bio_list_size(&zwplug->bio_list);
>  	spin_unlock_irqrestore(&zwplug->lock, flags);
>  
>  	bool all_zwr_inserted = blk_zone_all_zwr_inserted(zwplug);
>  
> -	seq_printf(m, "zone_no %u flags 0x%x ref %u wp_offset %u bio_list_size %u all_zwr_inserted %d\n",
> +	seq_printf(m, "zone_no %u flags 0x%x ref %u wp_offset %u wp_offset_compl %u bio_list_size %u all_zwr_inserted %d\n",
>  		   zwp_zone_no, zwp_flags, zwp_ref, zwp_wp_offset,
> -		   zwp_bio_list_size, all_zwr_inserted);
> +		   zwp_wp_offset_compl, zwp_bio_list_size, all_zwr_inserted);
>  }
>  
>  int queue_zone_wplugs_show(void *data, struct seq_file *m)
> diff --git a/block/blk.h b/block/blk.h
> index be945db6298d..88a6e258eafe 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -470,8 +470,10 @@ static inline void blk_zone_update_request_bio(struct request *rq,
>  	 * the original BIO sector so that blk_zone_write_plug_bio_endio() can
>  	 * lookup the zone write plug.
>  	 */
> -	if (req_op(rq) == REQ_OP_ZONE_APPEND || bio_zone_write_plugging(bio))
> +	if (req_op(rq) == REQ_OP_ZONE_APPEND || bio_zone_write_plugging(bio)) {
>  		bio->bi_iter.bi_sector = rq->__sector;
> +		bio->bi_iter.bi_size = rq->__data_len;
> +	}
>  }
>  
>  void blk_zone_write_plug_requeue_request(struct request *rq);


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v16 06/26] blk-zoned: Fix requeuing of zoned writes
  2024-11-19  0:27 ` [PATCH v16 06/26] blk-zoned: Fix requeuing of zoned writes Bart Van Assche
@ 2024-11-19  3:00   ` Damien Le Moal
  2024-11-19 21:06     ` Bart Van Assche
  0 siblings, 1 reply; 73+ messages in thread
From: Damien Le Moal @ 2024-11-19  3:00 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Jaegeuk Kim

On 11/19/24 9:27 AM, Bart Van Assche wrote:
> Make sure that unaligned writes are sent to the block driver. This
> allows the block driver to limit the number of retries of unaligned
> writes. Remove disk_zone_wplug_abort_unaligned() because it only examines
> the bio plug list. Pending writes can occur either on the bio plug list
> or on the request queue requeue list.

There can be only one write in flight, so at most one write in the requeue
list... And if that write was requeued, it means that it was not even started
so is not failed yet. So not sure what this is all about.

Am I missing something ?

> 
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/blk-zoned.c | 36 ------------------------------------
>  1 file changed, 36 deletions(-)
> 
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index 284820b29285..ded38fa9ae3d 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -577,33 +577,6 @@ static void disk_zone_wplug_abort(struct blk_zone_wplug *zwplug)
>  		blk_zone_wplug_bio_io_error(zwplug, bio);
>  }
>  
> -/*
> - * Abort (fail) all plugged BIOs of a zone write plug that are not aligned
> - * with the assumed write pointer location of the zone when the BIO will
> - * be unplugged.
> - */
> -static void disk_zone_wplug_abort_unaligned(struct gendisk *disk,
> -					    struct blk_zone_wplug *zwplug)
> -{
> -	unsigned int wp_offset = zwplug->wp_offset;
> -	struct bio_list bl = BIO_EMPTY_LIST;
> -	struct bio *bio;
> -
> -	while ((bio = bio_list_pop(&zwplug->bio_list))) {
> -		if (disk_zone_is_full(disk, zwplug->zone_no, wp_offset) ||
> -		    (bio_op(bio) != REQ_OP_ZONE_APPEND &&
> -		     bio_offset_from_zone_start(bio) != wp_offset)) {
> -			blk_zone_wplug_bio_io_error(zwplug, bio);
> -			continue;
> -		}
> -
> -		wp_offset += bio_sectors(bio);
> -		bio_list_add(&bl, bio);
> -	}
> -
> -	bio_list_merge(&zwplug->bio_list, &bl);
> -}
> -
>  static inline void disk_zone_wplug_set_error(struct gendisk *disk,
>  					     struct blk_zone_wplug *zwplug)
>  {
> @@ -982,14 +955,6 @@ static bool blk_zone_wplug_prepare_bio(struct blk_zone_wplug *zwplug,
>  		 * so that we can restore its operation code on completion.
>  		 */
>  		bio_set_flag(bio, BIO_EMULATES_ZONE_APPEND);
> -	} else {
> -		/*
> -		 * Check for non-sequential writes early because we avoid a
> -		 * whole lot of error handling trouble if we don't send it off
> -		 * to the driver.
> -		 */
> -		if (bio_offset_from_zone_start(bio) != zwplug->wp_offset)
> -			goto err;
>  	}
>  
>  	/* Advance the zone write pointer offset. */
> @@ -1425,7 +1390,6 @@ static void disk_zone_wplug_handle_error(struct gendisk *disk,
>  
>  	/* Update the zone write pointer offset. */
>  	zwplug->wp_offset = zwplug->wp_offset_compl;
> -	disk_zone_wplug_abort_unaligned(disk, zwplug);
>  
>  	/* Restart BIO submission if we still have any BIO left. */
>  	if (!bio_list_empty(&zwplug->bio_list)) {


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v16 07/26] block: Support block drivers that preserve the order of write requests
  2024-11-19  0:27 ` [PATCH v16 07/26] block: Support block drivers that preserve the order of write requests Bart Van Assche
@ 2024-11-19  7:37   ` Damien Le Moal
  2024-11-19 21:08     ` Bart Van Assche
  0 siblings, 1 reply; 73+ messages in thread
From: Damien Le Moal @ 2024-11-19  7:37 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Jaegeuk Kim,
	Hannes Reinecke, Nitesh Shetty, Ming Lei

On 11/19/24 09:27, Bart Van Assche wrote:
> Many but not all storage controllers require serialization of zoned
> writes. Introduce a new request queue limit member variable related to
> write serialization. 'driver_preserves_write_order' allows block drivers
> to indicate that the order of write commands is preserved per hardware
> queue and hence that serialization of writes per zone is not required if
> all pending writes are submitted to the same hardware queue.
> 
> Cc: Damien Le Moal <dlemoal@kernel.org>
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Nitesh Shetty <nj.shetty@samsung.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/blk-settings.c   | 2 ++
>  include/linux/blkdev.h | 5 +++++
>  2 files changed, 7 insertions(+)
> 
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index f1d4dfdc37a7..329d8b65a8d7 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -633,6 +633,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>  	}
>  	t->max_secure_erase_sectors = min_not_zero(t->max_secure_erase_sectors,
>  						   b->max_secure_erase_sectors);
> +	t->driver_preserves_write_order = t->driver_preserves_write_order &&
> +		b->driver_preserves_write_order;
>  	t->zone_write_granularity = max(t->zone_write_granularity,
>  					b->zone_write_granularity);
>  	if (!(t->features & BLK_FEAT_ZONED)) {
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index a1fd0ddce5cf..72be33d02d1f 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -397,6 +397,11 @@ struct queue_limits {
>  
>  	unsigned int		max_open_zones;
>  	unsigned int		max_active_zones;
> +	/*
> +	 * Whether or not the block driver preserves the order of write
> +	 * requests. Set by the block driver.
> +	 */
> +	bool			driver_preserves_write_order;

Why not make this a q->features flag ?

>  
>  	/*
>  	 * Drivers that set dma_alignment to less than 511 must be prepared to


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v16 09/26] mq-deadline: Remove a local variable
  2024-11-19  0:27 ` [PATCH v16 09/26] mq-deadline: Remove a local variable Bart Van Assche
@ 2024-11-19  7:38   ` Damien Le Moal
  2024-11-19 21:11     ` Bart Van Assche
  0 siblings, 1 reply; 73+ messages in thread
From: Damien Le Moal @ 2024-11-19  7:38 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Jaegeuk Kim

On 11/19/24 09:27, Bart Van Assche wrote:
> Since commit fde02699c242 ("block: mq-deadline: Remove support for zone
> write locking"), the local variable 'insert_before' is assigned once and
> is used once. Hence remove this local variable.
> 
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>

Looks good.

Reviewed-by: Damien Le Moal <dlemoal@kernel.org>

This patch can be sent separately from this series I think. It does not really
belong here, no ?

> ---
>  block/mq-deadline.c | 5 +----
>  1 file changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index acdc28756d9d..2edf84b1bc2a 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -699,8 +699,6 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>  		list_add(&rq->queuelist, &per_prio->dispatch);
>  		rq->fifo_time = jiffies;
>  	} else {
> -		struct list_head *insert_before;
> -
>  		deadline_add_rq_rb(per_prio, rq);
>  
>  		if (rq_mergeable(rq)) {
> @@ -713,8 +711,7 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>  		 * set expire time and add to fifo list
>  		 */
>  		rq->fifo_time = jiffies + dd->fifo_expire[data_dir];
> -		insert_before = &per_prio->fifo_list[data_dir];
> -		list_add_tail(&rq->queuelist, insert_before);
> +		list_add_tail(&rq->queuelist, &per_prio->fifo_list[data_dir]);
>  	}
>  }
>  


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v16 10/26] blk-mq: Clean up blk_mq_requeue_work()
  2024-11-19  0:27 ` [PATCH v16 10/26] blk-mq: Clean up blk_mq_requeue_work() Bart Van Assche
@ 2024-11-19  7:39   ` Damien Le Moal
  0 siblings, 0 replies; 73+ messages in thread
From: Damien Le Moal @ 2024-11-19  7:39 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Jaegeuk Kim

On 11/19/24 09:27, Bart Van Assche wrote:
> Move a statement that occurs in both branches of an if-statement in front
> of the if-statement. Fix a typo in a source code comment. No functionality
> has been changed.
> 
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>

Looks good.

Reviewed-by: Damien Le Moal <dlemoal@kernel.org>


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v16 11/26] block: Optimize blk_mq_submit_bio() for the cache hit scenario
  2024-11-19  0:28 ` [PATCH v16 11/26] block: Optimize blk_mq_submit_bio() for the cache hit scenario Bart Van Assche
@ 2024-11-19  7:40   ` Damien Le Moal
  0 siblings, 0 replies; 73+ messages in thread
From: Damien Le Moal @ 2024-11-19  7:40 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Jaegeuk Kim

On 11/19/24 09:28, Bart Van Assche wrote:
> Help the CPU branch predictor in case of a cache hit by handling the cache
> hit scenario first.
> 
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>

Not sure this really helps but looks OK to me.

Reviewed-by: Damien Le Moal <dlemoal@kernel.org>

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v16 12/26] block: Rework request allocation in blk_mq_submit_bio()
  2024-11-19  0:28 ` [PATCH v16 12/26] block: Rework request allocation in blk_mq_submit_bio() Bart Van Assche
@ 2024-11-19  7:44   ` Damien Le Moal
  0 siblings, 0 replies; 73+ messages in thread
From: Damien Le Moal @ 2024-11-19  7:44 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Jaegeuk Kim

On 11/19/24 09:28, Bart Van Assche wrote:
> Prepare for allocating a request from a specific hctx by making
> blk_mq_submit_bio() allocate a request later.
> 
> The performance impact of this patch on the hot path is small: if a
> request is cached, one percpu_ref_get(&q->q_usage_counter) call and one
> percpu_ref_put(&q->q_usage_counter) call are added to the hot path.
> 
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>

The patch looks OK, but I cannot weight on if this is really a "small" impact on
performance. Jens may want to run his perf setup with this to verify.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v16 14/26] blk-mq: Restore the zoned write order when requeuing
  2024-11-19  0:28 ` [PATCH v16 14/26] blk-mq: Restore the zoned write order when requeuing Bart Van Assche
@ 2024-11-19  7:52   ` Damien Le Moal
  2024-11-19 21:16     ` Bart Van Assche
  0 siblings, 1 reply; 73+ messages in thread
From: Damien Le Moal @ 2024-11-19  7:52 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Jaegeuk Kim

On 11/19/24 09:28, Bart Van Assche wrote:
> Zoned writes may be requeued, e.g. if a block driver returns
> BLK_STS_RESOURCE. Requests may be requeued in another order than
> submitted. Restore the request order if requests are requeued.
> 
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/bfq-iosched.c    |  2 ++
>  block/blk-mq.c         | 20 +++++++++++++++++++-
>  block/blk-mq.h         |  2 ++
>  block/kyber-iosched.c  |  2 ++
>  block/mq-deadline.c    |  7 ++++++-
>  include/linux/blk-mq.h |  2 +-
>  6 files changed, 32 insertions(+), 3 deletions(-)
> 
> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
> index 0747d9d0e48c..13bedbf03bd2 100644
> --- a/block/bfq-iosched.c
> +++ b/block/bfq-iosched.c
> @@ -6265,6 +6265,8 @@ static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>  
>  	if (flags & BLK_MQ_INSERT_AT_HEAD) {
>  		list_add(&rq->queuelist, &bfqd->dispatch);
> +	} else if (flags & BLK_MQ_INSERT_ORDERED) {
> +		blk_mq_insert_ordered(rq, &bfqd->dispatch);
>  	} else if (!bfqq) {
>  		list_add_tail(&rq->queuelist, &bfqd->dispatch);
>  	} else {
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index f134d5e1c4a1..1302ccbf2a7d 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1564,7 +1564,9 @@ static void blk_mq_requeue_work(struct work_struct *work)
>  		 * already.  Insert it into the hctx dispatch list to avoid
>  		 * block layer merges for the request.
>  		 */
> -		if (rq->rq_flags & RQF_DONTPREP)
> +		if (blk_rq_is_seq_zoned_write(rq))
> +			blk_mq_insert_request(rq, BLK_MQ_INSERT_ORDERED);

Is this OK to do without any starvation prevention ? A high LBA write that
constantly gets requeued behind low LBA writes could end up in a timeout
situation, no ?

> +		else if (rq->rq_flags & RQF_DONTPREP)
>  			blk_mq_request_bypass_insert(rq, 0);
>  		else
>  			blk_mq_insert_request(rq, BLK_MQ_INSERT_AT_HEAD);
> @@ -2599,6 +2601,20 @@ static void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx,
>  	blk_mq_run_hw_queue(hctx, run_queue_async);
>  }
>  
> +void blk_mq_insert_ordered(struct request *rq, struct list_head *list)
> +{
> +	struct request_queue *q = rq->q;
> +	struct request *rq2;
> +
> +	list_for_each_entry(rq2, list, queuelist)
> +		if (rq2->q == q && blk_rq_pos(rq2) > blk_rq_pos(rq))
> +			break;
> +
> +	/* Insert rq before rq2. If rq2 is the list head, append at the end. */
> +	list_add_tail(&rq->queuelist, &rq2->queuelist);
> +}
> +EXPORT_SYMBOL_GPL(blk_mq_insert_ordered);
> +
>  static void blk_mq_insert_request(struct request *rq, blk_insert_t flags)
>  {
>  	struct request_queue *q = rq->q;
> @@ -2653,6 +2669,8 @@ static void blk_mq_insert_request(struct request *rq, blk_insert_t flags)
>  		spin_lock(&ctx->lock);
>  		if (flags & BLK_MQ_INSERT_AT_HEAD)
>  			list_add(&rq->queuelist, &ctx->rq_lists[hctx->type]);
> +		else if (flags & BLK_MQ_INSERT_ORDERED)
> +			blk_mq_insert_ordered(rq, &ctx->rq_lists[hctx->type]);
>  		else
>  			list_add_tail(&rq->queuelist,
>  				      &ctx->rq_lists[hctx->type]);
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index 309db553aba6..10b9fb3ca762 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -40,8 +40,10 @@ enum {
>  
>  typedef unsigned int __bitwise blk_insert_t;
>  #define BLK_MQ_INSERT_AT_HEAD		((__force blk_insert_t)0x01)
> +#define BLK_MQ_INSERT_ORDERED		((__force blk_insert_t)0x02)
>  
>  void blk_mq_submit_bio(struct bio *bio);
> +void blk_mq_insert_ordered(struct request *rq, struct list_head *list);
>  int blk_mq_poll(struct request_queue *q, blk_qc_t cookie, struct io_comp_batch *iob,
>  		unsigned int flags);
>  void blk_mq_exit_queue(struct request_queue *q);
> diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
> index 4155594aefc6..77bb41bab68d 100644
> --- a/block/kyber-iosched.c
> +++ b/block/kyber-iosched.c
> @@ -603,6 +603,8 @@ static void kyber_insert_requests(struct blk_mq_hw_ctx *hctx,
>  		trace_block_rq_insert(rq);
>  		if (flags & BLK_MQ_INSERT_AT_HEAD)
>  			list_move(&rq->queuelist, head);
> +		else if (flags & BLK_MQ_INSERT_ORDERED)
> +			blk_mq_insert_ordered(rq, head);
>  		else
>  			list_move_tail(&rq->queuelist, head);
>  		sbitmap_set_bit(&khd->kcq_map[sched_domain],
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index 2edf84b1bc2a..200e5a2928ce 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -711,7 +711,12 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>  		 * set expire time and add to fifo list
>  		 */
>  		rq->fifo_time = jiffies + dd->fifo_expire[data_dir];
> -		list_add_tail(&rq->queuelist, &per_prio->fifo_list[data_dir]);
> +		if (flags & BLK_MQ_INSERT_ORDERED)
> +			blk_mq_insert_ordered(rq,
> +					      &per_prio->fifo_list[data_dir]);
> +		else
> +			list_add_tail(&rq->queuelist,
> +				      &per_prio->fifo_list[data_dir]);
>  	}
>  }
>  
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index ac05974f08f9..f7514eefccfd 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -85,7 +85,7 @@ enum {
>  
>  /* flags that prevent us from merging requests: */
>  #define RQF_NOMERGE_FLAGS \
> -	(RQF_STARTED | RQF_FLUSH_SEQ | RQF_SPECIAL_PAYLOAD)
> +	(RQF_STARTED | RQF_FLUSH_SEQ | RQF_DONTPREP | RQF_SPECIAL_PAYLOAD)
>  
>  enum mq_rq_state {
>  	MQ_RQ_IDLE		= 0,


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v16 15/26] blk-zoned: Document the locking order
  2024-11-19  0:28 ` [PATCH v16 15/26] blk-zoned: Document the locking order Bart Van Assche
@ 2024-11-19  7:52   ` Damien Le Moal
  0 siblings, 0 replies; 73+ messages in thread
From: Damien Le Moal @ 2024-11-19  7:52 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Jaegeuk Kim

On 11/19/24 09:28, Bart Van Assche wrote:
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>

Reviewed-by: Damien Le Moal <dlemoal@kernel.org>


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v16 16/26] blk-zoned: Document locking assumptions
  2024-11-19  0:28 ` [PATCH v16 16/26] blk-zoned: Document locking assumptions Bart Van Assche
@ 2024-11-19  7:53   ` Damien Le Moal
  2024-11-19 21:18     ` Bart Van Assche
  0 siblings, 1 reply; 73+ messages in thread
From: Damien Le Moal @ 2024-11-19  7:53 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Jaegeuk Kim

On 11/19/24 09:28, Bart Van Assche wrote:
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>

The patch title seems incorrect. This is not documenting anything but adding
lockdep checks.

With a better patch title:

Reviewed-by: Damien Le Moal <dlemoal@kernel.org>

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v16 17/26] blk-zoned: Uninline functions that are not in the hot path
  2024-11-19  0:28 ` [PATCH v16 17/26] blk-zoned: Uninline functions that are not in the hot path Bart Van Assche
@ 2024-11-19  7:55   ` Damien Le Moal
  2024-11-19 21:20     ` Bart Van Assche
  0 siblings, 1 reply; 73+ messages in thread
From: Damien Le Moal @ 2024-11-19  7:55 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Jaegeuk Kim

On 11/19/24 09:28, Bart Van Assche wrote:
> Apply the convention that is followed elsewhere in the block layer: only
> declare functions inline if these are in the hot path.
> 
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>

I do not really see the point... Anyway, looks OK.

Reviewed-by: Damien Le Moal <dlemoal@kernel.org>

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v16 18/26] blk-zoned: Make disk_should_remove_zone_wplug() more robust
  2024-11-19  0:28 ` [PATCH v16 18/26] blk-zoned: Make disk_should_remove_zone_wplug() more robust Bart Van Assche
@ 2024-11-19  7:58   ` Damien Le Moal
  0 siblings, 0 replies; 73+ messages in thread
From: Damien Le Moal @ 2024-11-19  7:58 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Jaegeuk Kim

On 11/19/24 09:28, Bart Van Assche wrote:
> Make the disk_should_remove_zone_wplug() behavior independent of the
> number of zwplug references held by the caller.
> 
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/blk-zoned.c | 7 ++-----
>  1 file changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index 2b4783026450..59f6559b94da 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -457,12 +457,9 @@ static inline bool disk_should_remove_zone_wplug(struct gendisk *disk,
>  	 * blk_zone_write_plug_finish_request() (e.g. with split BIOs
>  	 * that are chained). In such case, disk_zone_wplug_unplug_bio()
>  	 * should not attempt to remove the zone write plug until all BIO
> -	 * completions are seen. Check by looking at the zone write plug
> -	 * reference count, which is 2 when the plug is unused (one reference
> -	 * taken when the plug was allocated and another reference taken by the
> -	 * caller context).
> +	 * completions are seen.
>  	 */
> -	if (refcount_read(&zwplug->ref) > 2)
> +	if (zwplug->wp_offset != zwplug->wp_offset_compl)

See my comment in patch 5 about wp_offset_compl. I do not think it works.


>  		return false;
>  
>  	/* We can remove zone write plugs for zones that are empty or full. */


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v16 00/26] Improve write performance for zoned UFS devices
  2024-11-19  0:27 [PATCH v16 00/26] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (25 preceding siblings ...)
  2024-11-19  0:28 ` [PATCH v16 26/26] scsi: ufs: Inform the block layer about write ordering Bart Van Assche
@ 2024-11-19  8:01 ` Damien Le Moal
  2024-11-19 19:08   ` Bart Van Assche
  2025-01-09 19:02   ` Bart Van Assche
  2024-11-19 12:25 ` Christoph Hellwig
  27 siblings, 2 replies; 73+ messages in thread
From: Damien Le Moal @ 2024-11-19  8:01 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Jaegeuk Kim

On 11/19/24 09:27, Bart Van Assche wrote:
> Hi Damien and Christoph,
> 
> This patch series improves small write IOPS by a factor of four (+300%) for
> zoned UFS devices on my test setup with an UFSHCI 3.0 controller. Although
> you are probably busy because the merge window is open, please take a look
> at this patch series when you have the time. This patch series is organized
> as follows:
>  - Bug fixes for existing code at the start of the series.
>  - The write pipelining support implementation comes after the bug fixes.

Impressive improvements but the changes are rather invasive. Have you tried
simpler solution like forcing unplugging a zone write plug from the driver once
a command is passed to the driver and the driver did not reject it ? It seems
like this would make everything simpler on the block layer side. But I am not
sure if the performance gains would be the same.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v16 00/26] Improve write performance for zoned UFS devices
  2024-11-19  0:27 [PATCH v16 00/26] Improve write performance for zoned UFS devices Bart Van Assche
                   ` (26 preceding siblings ...)
  2024-11-19  8:01 ` [PATCH v16 00/26] Improve write performance for zoned UFS devices Damien Le Moal
@ 2024-11-19 12:25 ` Christoph Hellwig
  2024-11-19 18:52   ` Bart Van Assche
  27 siblings, 1 reply; 73+ messages in thread
From: Christoph Hellwig @ 2024-11-19 12:25 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, linux-scsi, Christoph Hellwig,
	Damien Le Moal, Jaegeuk Kim

On Mon, Nov 18, 2024 at 04:27:49PM -0800, Bart Van Assche wrote:
> Hi Damien and Christoph,
> 
> This patch series improves small write IOPS by a factor of four (+300%) for
> zoned UFS devices on my test setup with an UFSHCI 3.0 controller.

What's your exact test setup?  Which upstream kernel support zoned UFS
device are using?


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v16 00/26] Improve write performance for zoned UFS devices
  2024-11-19 12:25 ` Christoph Hellwig
@ 2024-11-19 18:52   ` Bart Van Assche
  0 siblings, 0 replies; 73+ messages in thread
From: Bart Van Assche @ 2024-11-19 18:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, linux-scsi, Damien Le Moal, Jaegeuk Kim

On 11/19/24 4:25 AM, Christoph Hellwig wrote:
> What's your exact test setup?

Pixel 2023 and 2025 development boards with android-mainline kernel
(https://android.googlesource.com/kernel/common/+/refs/heads/android-mainline). 
The android mainline kernel tracks the upstream kernel closely. While 
the performance of Pixel development boards is
identical to that of the corresponding smartphone generation, these
development boards have the following advantages:
- A UFS socket that allows swapping UFS devices without any soldering.
- A USB port that makes it easy to monitor and capture kernel log
   messages.

> Which upstream kernel support zoned UFS device are using?

A Micron ZUFS device. Micron zoned UFS devices have a zone size that is
a power of two.

Thanks,

Bart.



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v16 00/26] Improve write performance for zoned UFS devices
  2024-11-19  8:01 ` [PATCH v16 00/26] Improve write performance for zoned UFS devices Damien Le Moal
@ 2024-11-19 19:08   ` Bart Van Assche
  2024-11-21  3:20     ` Damien Le Moal
  2025-01-09 19:02   ` Bart Van Assche
  1 sibling, 1 reply; 73+ messages in thread
From: Bart Van Assche @ 2024-11-19 19:08 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Jaegeuk Kim

On 11/19/24 12:01 AM, Damien Le Moal wrote:
> Impressive improvements but the changes are rather invasive. Have you tried
> simpler solution like forcing unplugging a zone write plug from the driver once
> a command is passed to the driver and the driver did not reject it ? It seems
> like this would make everything simpler on the block layer side. But I am not
> sure if the performance gains would be the same.

Hi Damien,

I'm not sure that the approach of submitting a new zoned write if the
driver did not reject the previous write would result in a simpler
solution. SCSI devices are allowed to respond to any command with a unit
attention instead of processing the command. If a unit attention is
reported, the SCSI core requeues the command. In other words, even with
this approach, proper support for requeued zoned writes in the block
layer is required.

Additionally, that approach is not compatible with using .queue_rqs().
While the SCSI core does not yet support a .queue_rqs() callback, I
think it would be good to have this support in the SCSI core.

If we need requeuing support anyway, why to select an approach that
probably will result in lower performance than what has been implemented
in this patch series?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v16 01/26] blk-zoned: Fix a reference count leak
  2024-11-19  2:23   ` Damien Le Moal
@ 2024-11-19 20:21     ` Bart Van Assche
  0 siblings, 0 replies; 73+ messages in thread
From: Bart Van Assche @ 2024-11-19 20:21 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Jaegeuk Kim

On 11/18/24 6:23 PM, Damien Le Moal wrote:
> On 11/19/24 9:27 AM, Bart Van Assche wrote:
>> Fix a reference count leak in disk_zone_wplug_handle_error()
>>
>> Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
>> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
>> ---
>>   block/blk-zoned.c | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
>> index 70211751df16..3346b8c53605 100644
>> --- a/block/blk-zoned.c
>> +++ b/block/blk-zoned.c
>> @@ -1337,6 +1337,8 @@ static void disk_zone_wplug_handle_error(struct gendisk *disk,
>>   
>>   unlock:
>>   	spin_unlock_irqrestore(&zwplug->lock, flags);
>> +
>> +	disk_put_zone_wplug(zwplug);
> 
> The zone wplug put call is right after the single call site to
> disk_zone_wplug_handle_error(). The reason it is *not* in that function is that
> the reference on the wplug for handling an error is taken when the wplug is
> added to the error list. disk_zone_wplug_handle_error() does not itself take a
> reference on the wplug.
> 
> So how did you come up with this ? What workload/operation did you run to find
> an issue ?

Thanks for the feedback. I can't reproduce the refcount leak without my
other patches. I will check with which other patch this patch has to be
combined. This is the script that triggered the leak:

#!/bin/bash

set -eu

qd=${1:-64}
if modprobe -r scsi_debug; then :; fi
params=(
         delay=0
         dev_size_mb=256
         every_nth=$((2 * qd))
         max_queue=${qd}
         ndelay=100000           # 100 us
         opts=0x8000             # SDEBUG_OPT_HOST_BUSY
         preserves_write_order=1
         sector_size=4096
         zbc=host-managed
         zone_nr_conv=0
         zone_size_mb=4
)
modprobe scsi_debug "${params[@]}"
while true; do
         bdev=$(cd 
/sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block && 
echo *) 2>/dev/null
         if [ -e /dev/"${bdev}" ]; then break; fi
         sleep .1
done
dev=/dev/"${bdev}"
[ -b "${dev}" ]
for rw in write randwrite; do
     params=(
         --direct=1
         --filename="${dev}"
         --iodepth="${qd}"
         --ioengine=io_uring
         --ioscheduler=none
         --gtod_reduce=1
         --hipri=0
         --name="$(basename "${dev}")"
         --runtime=30
         --rw="$rw"
         --time_based=1
         --zonemode=zbd
     )
     fio "${params[@]}"
     rc=$?
     if grep -avH " ref 1 " /sys/kernel/debug/block/sda/zone_wplugs; 
then :; fi
     echo ''
     [ $rc = 0 ] || break
done


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v16 04/26] blk-zoned: Only handle errors after pending zoned writes have completed
  2024-11-19  2:50   ` Damien Le Moal
@ 2024-11-19 20:51     ` Bart Van Assche
  2024-11-21  3:23       ` Damien Le Moal
  0 siblings, 1 reply; 73+ messages in thread
From: Bart Van Assche @ 2024-11-19 20:51 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Jaegeuk Kim

On 11/18/24 6:50 PM, Damien Le Moal wrote:
> On 11/19/24 9:27 AM, Bart Van Assche wrote:
>> Instead of handling write errors immediately, only handle these after all
>> pending zoned write requests have completed or have been requeued. This
>> patch prepares for changing the zone write pointer tracking approach.
> 
> A little more explanations about how this is achieved would be nice. I was
> expecting a shorter change given the short commit message... Took some time to
> understand the changes without details.

Hi Damien,

I will make the patch description more detailed.

>> @@ -608,6 +608,8 @@ static inline void disk_zone_wplug_set_error(struct gendisk *disk,
>>   	if (zwplug->flags & BLK_ZONE_WPLUG_ERROR)
>>   		return;
>>   
>> +	zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
>> +	zwplug->flags |= BLK_ZONE_WPLUG_ERROR;
> 
> Why move these before the comment ? Also, why set BLK_ZONE_WPLUG_PLUGGED ? It
> should be set already since this is handling a failed write that was either
> being prepared for submission or submitted (and completed) already. In both
> cases, the wplug is plugged since we have a write in flight.
> 
>>   	/*
>>   	 * At this point, we already have a reference on the zone write plug.
>>   	 * However, since we are going to add the plug to the disk zone write
>> @@ -616,7 +618,6 @@ static inline void disk_zone_wplug_set_error(struct gendisk *disk,
>>   	 * handled, or in disk_zone_wplug_clear_error() if the zone is reset or
>>   	 * finished.
>>   	 */
>> -	zwplug->flags |= BLK_ZONE_WPLUG_ERROR;
>>   	refcount_inc(&zwplug->ref);

Does the comment block refer to the refcount_inc() call only or to both
the flags changes and the refcount_inc() call? I'm assuming the former.
That's why I moved the "zwplug->flags |= BLK_ZONE_WPLUG_ERROR;"
statement. Did I perhaps misunderstand something?

>> +/*
>> + * Changes @data->res to %false if and only if @rq is a zoned write for the
>> + * given zone and if it is owned by the block driver.
> 
> It would be nice to have a request flag or a state indicating that instead of
> needing all this code... Can't that be done ?

rq->state and rq->bio are modified independently and are independent of
whether or not blk_rq_is_seq_zoned_write() evaluates to true. I'm
concerned that introducing the proposed flag would result in numerous
changes in the block layer and hence would make the block layer harder
to maintain.

>> +/*
>> + * Change the zone state to "error" if a request is requeued to postpone
>> + * processing of requeued requests until all pending requests have either
>> + * completed or have been requeued.
>> + */
>> +void blk_zone_write_plug_requeue_request(struct request *rq)
>> +{
>> +	struct gendisk *disk = rq->q->disk;
>> +	struct blk_zone_wplug *zwplug;
>> +
>> +	if (!disk->zone_wplugs_hash_bits || !blk_rq_is_seq_zoned_write(rq))
>> +		return;
> 
> I think the disk->zone_wplugs_hash_bits check needs to go inside
> disk_get_zone_wplug() as that will avoid a similar check in
> blk_zone_write_plug_free_request() too. That said, I am not even convinced it
> is needed at all since these functions should be called only for a zoned drive
> which should have its zone wplug hash setup.

Moving the disk->zone_wplugs_hash_bits check sounds good to me.

I added this check after hitting an UBSAN report that indicates that
disk->zone_wplugs_hash_bits was used before it was changed into a non-
zero value. sd_revalidate_disk() submits a very substantial number of
SCSI commands before it calls blk_revalidate_disk_zones(), the function
that sets disk->zone_wplugs_hash_bits.

>> @@ -1343,14 +1459,15 @@ static void disk_zone_wplug_handle_error(struct gendisk *disk,
>>   
>>   static void disk_zone_process_err_list(struct gendisk *disk)
>>   {
>> -	struct blk_zone_wplug *zwplug;
>> +	struct blk_zone_wplug *zwplug, *next;
>>   	unsigned long flags;
>>   
>>   	spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
>>   
>> -	while (!list_empty(&disk->zone_wplugs_err_list)) {
>> -		zwplug = list_first_entry(&disk->zone_wplugs_err_list,
>> -					  struct blk_zone_wplug, link);
>> +	list_for_each_entry_safe(zwplug, next, &disk->zone_wplugs_err_list,
>> +				 link) {
> 
> You are holding the disk zwplug spinlock, so why use the _safe() loop ? Not
> needed, right ?
> 
>> +		if (!blk_zone_all_zwr_inserted(zwplug))
>> +			continue;
>>   		list_del_init(&zwplug->link);
>>   		spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);

The _safe suffix in list_for_each_entry_safe() is not related to locking
  - it means that it is allowed to delete the loop entry 'zwplug' from
the list.

I think the _safe variant is needed because of the list_del_init() call.

>> +/* May be called from interrupt context and hence must not sleep. */
>> +void blk_zone_requeue_work(struct request_queue *q)
>> +{
>> +	struct gendisk *disk = q->disk;
>> +
>> +	if (!disk)
>> +		return;
> 
> Can this happen ?

The code in scsi_scan.c executes SCSI commands like INQUIRY before
sd_probe() attaches a disk (device_add_disk()) so I think that any code
inserted into blk_mq_requeue_work() must support q->disk == NULL.

>> @@ -1854,8 +1991,11 @@ static void queue_zone_wplug_show(struct blk_zone_wplug *zwplug,
>>   	zwp_bio_list_size = bio_list_size(&zwplug->bio_list);
>>   	spin_unlock_irqrestore(&zwplug->lock, flags);
>>   
>> -	seq_printf(m, "%u 0x%x %u %u %u\n", zwp_zone_no, zwp_flags, zwp_ref,
>> -		   zwp_wp_offset, zwp_bio_list_size);
>> +	bool all_zwr_inserted = blk_zone_all_zwr_inserted(zwplug);
> 
> Is this bool really needed ? If it is, shouldn't it be assigned while holding
> the zwplug lock to have a "snapshot" of the plug with all printed values
> consistent ?

Since blk_zone_all_zwr_inserted() checks information that is not
related to the zwplug state (e.g. the requeue list), I don't think that
the zwplug lock should be held around this blk_zone_all_zwr_inserted()
call.

I can keep this change as a local debugging patch if you would prefer
this.

>> +void blk_zone_write_plug_requeue_request(struct request *rq);
>> +static inline void blk_zone_requeue_request(struct request *rq)
>> +{
>> +	if (!blk_rq_is_seq_zoned_write(rq))
>> +		return;
>> +
>> +	blk_zone_write_plug_requeue_request(rq);
> 
> May be:
> 
> 	if (blk_rq_is_seq_zoned_write(rq))
> 		blk_zone_write_plug_requeue_request(rq);
> 
> ?

I can make this change - I don't have a strong opinion about which style
to prefer.
  Thanks,

Bart.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v16 05/26] blk-zoned: Fix a deadlock triggered by unaligned writes
  2024-11-19  2:57   ` Damien Le Moal
@ 2024-11-19 21:04     ` Bart Van Assche
  2024-11-21  3:32       ` Damien Le Moal
  2025-01-09 19:11     ` Bart Van Assche
  1 sibling, 1 reply; 73+ messages in thread
From: Bart Van Assche @ 2024-11-19 21:04 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Jaegeuk Kim

On 11/18/24 6:57 PM, Damien Le Moal wrote:
> On 11/19/24 9:27 AM, Bart Van Assche wrote:
>> If the queue is filled with unaligned writes then the following
>> deadlock occurs:
>>
>> Call Trace:
>>   <TASK>
>>   __schedule+0x8cc/0x2190
>>   schedule+0xdd/0x2b0
>>   blk_queue_enter+0x2ce/0x4f0
>>   blk_mq_alloc_request+0x303/0x810
>>   scsi_execute_cmd+0x3f4/0x7b0
>>   sd_zbc_do_report_zones+0x19e/0x4c0
>>   sd_zbc_report_zones+0x304/0x920
>>   disk_zone_wplug_handle_error+0x237/0x920
>>   disk_zone_wplugs_work+0x17e/0x430
>>   process_one_work+0xdd0/0x1490
>>   worker_thread+0x5eb/0x1010
>>   kthread+0x2e5/0x3b0
>>   ret_from_fork+0x3a/0x80
>>   </TASK>
>>
>> Fix this deadlock by removing the disk->fops->report_zones() call and by
>> deriving the write pointer information from successfully completed zoned
>> writes.
>>
>> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> 
> Doesn't this need a Fixes tag and CC stable, and come earlier in the series (or
> sent separately) ?

I will add Fixes: and Cc: stable tags.

I'm not sure how to move this patch earlier since it depends on the
previous patch in this series ("blk-zoned: Only handle errors after
pending zoned writes have completed"). Without that patch, it is not
safe to use zwplug->wp_offset_compl in the error handler.

> Overall, this patch seems wrong anyway as zone reset and zone finish may be
> done between 2 writes, failing the next one but the recovery done here will use
> the previous succeful write end position as the wp, which is NOT correct as
> reset or finish changed that...

I will add support for the zone reset and zone finish commands in this
patch.

> And we also have the possibility of torn writes
> (partial writes) with SAS SMR drives. So I really think that you cannot avoid
> doing a report zone to recover errors.

Thanks for having brought this up. This is something I was not aware of.

disk_zone_wplug_handle_error() submits a new request to retrieve zone 
information while handling an error triggered by other requests. This
can easily lead to a deadlock as the above call trace shows. How about
introducing a queue flag for the "report zones" approach in
disk_zone_wplug_handle_error() such that the "report zones" approach is
only used for SAS SMR drives?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v16 06/26] blk-zoned: Fix requeuing of zoned writes
  2024-11-19  3:00   ` Damien Le Moal
@ 2024-11-19 21:06     ` Bart Van Assche
  0 siblings, 0 replies; 73+ messages in thread
From: Bart Van Assche @ 2024-11-19 21:06 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Jaegeuk Kim

On 11/18/24 7:00 PM, Damien Le Moal wrote:
> On 11/19/24 9:27 AM, Bart Van Assche wrote:
>> Make sure that unaligned writes are sent to the block driver. This
>> allows the block driver to limit the number of retries of unaligned
>> writes. Remove disk_zone_wplug_abort_unaligned() because it only examines
>> the bio plug list. Pending writes can occur either on the bio plug list
>> or on the request queue requeue list.
> 
> There can be only one write in flight, so at most one write in the requeue
> list... And if that write was requeued, it means that it was not even started
> so is not failed yet. So not sure what this is all about.
> 
> Am I missing something ?

Patch 20/26 in this series enables support for multiple in-flight zoned
writes per zone.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v16 07/26] block: Support block drivers that preserve the order of write requests
  2024-11-19  7:37   ` Damien Le Moal
@ 2024-11-19 21:08     ` Bart Van Assche
  0 siblings, 0 replies; 73+ messages in thread
From: Bart Van Assche @ 2024-11-19 21:08 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Jaegeuk Kim,
	Hannes Reinecke, Nitesh Shetty, Ming Lei

On 11/18/24 11:37 PM, Damien Le Moal wrote:
> On 11/19/24 09:27, Bart Van Assche wrote:
>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>> index a1fd0ddce5cf..72be33d02d1f 100644
>> --- a/include/linux/blkdev.h
>> +++ b/include/linux/blkdev.h
>> @@ -397,6 +397,11 @@ struct queue_limits {
>>   
>>   	unsigned int		max_open_zones;
>>   	unsigned int		max_active_zones;
>> +	/*
>> +	 * Whether or not the block driver preserves the order of write
>> +	 * requests. Set by the block driver.
>> +	 */
>> +	bool			driver_preserves_write_order;
> 
> Why not make this a q->features flag ?

I assume that you meant q->limits.features? I will look into this.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v16 09/26] mq-deadline: Remove a local variable
  2024-11-19  7:38   ` Damien Le Moal
@ 2024-11-19 21:11     ` Bart Van Assche
  0 siblings, 0 replies; 73+ messages in thread
From: Bart Van Assche @ 2024-11-19 21:11 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Jaegeuk Kim


On 11/18/24 11:38 PM, Damien Le Moal wrote:
> On 11/19/24 09:27, Bart Van Assche wrote:
>> Since commit fde02699c242 ("block: mq-deadline: Remove support for zone
>> write locking"), the local variable 'insert_before' is assigned once and
>> is used once. Hence remove this local variable.
>>
>> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> 
> Looks good.
> 
> Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
> 
> This patch can be sent separately from this series I think. It does not really
> belong here, no ?

Thanks for the review. There is another mq-deadline change in this
series that depends on this change ("blk-mq: Restore the zoned write
order when requeuing"). This is why I included this patch in this
series.

Bart.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v16 14/26] blk-mq: Restore the zoned write order when requeuing
  2024-11-19  7:52   ` Damien Le Moal
@ 2024-11-19 21:16     ` Bart Van Assche
  0 siblings, 0 replies; 73+ messages in thread
From: Bart Van Assche @ 2024-11-19 21:16 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Jaegeuk Kim

On 11/18/24 11:52 PM, Damien Le Moal wrote:
> On 11/19/24 09:28, Bart Van Assche wrote:
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index f134d5e1c4a1..1302ccbf2a7d 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -1564,7 +1564,9 @@ static void blk_mq_requeue_work(struct work_struct *work)
>>   		 * already.  Insert it into the hctx dispatch list to avoid
>>   		 * block layer merges for the request.
>>   		 */
>> -		if (rq->rq_flags & RQF_DONTPREP)
>> +		if (blk_rq_is_seq_zoned_write(rq))
>> +			blk_mq_insert_request(rq, BLK_MQ_INSERT_ORDERED);
> 
> Is this OK to do without any starvation prevention ? A high LBA write that
> constantly gets requeued behind low LBA writes could end up in a timeout
> situation, no ?

Hi Damien,

Requeuing zoned writes should be exceptional and shouldn't happen often.
Such starvation can only happen if zoned writes for two different zones
are requeued over and over again. If that happens there will not only be
starvation for the write with the higher LBA but also retry count
exhaustion for the write with the lower LBA. If we agree that zoned
write retries are rare then I don't think we have to worry about
this kind of starvation.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v16 16/26] blk-zoned: Document locking assumptions
  2024-11-19  7:53   ` Damien Le Moal
@ 2024-11-19 21:18     ` Bart Van Assche
  2024-11-21  3:34       ` Damien Le Moal
  0 siblings, 1 reply; 73+ messages in thread
From: Bart Van Assche @ 2024-11-19 21:18 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Jaegeuk Kim

On 11/18/24 11:53 PM, Damien Le Moal wrote:
> On 11/19/24 09:28, Bart Van Assche wrote:
>> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> 
> The patch title seems incorrect. This is not documenting anything but adding
> lockdep checks.

Hmm ... what this patch does is to document what the locking assumptions
are for the modified functions. So I'm not sure why the patch title is
considered incorrect?

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v16 17/26] blk-zoned: Uninline functions that are not in the hot path
  2024-11-19  7:55   ` Damien Le Moal
@ 2024-11-19 21:20     ` Bart Van Assche
  2024-11-21  3:36       ` Damien Le Moal
  0 siblings, 1 reply; 73+ messages in thread
From: Bart Van Assche @ 2024-11-19 21:20 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Jaegeuk Kim

On 11/18/24 11:55 PM, Damien Le Moal wrote:
> I do not really see the point... Anyway, looks OK.
Hi Damien,

One of the approaches I used to debug this patch series is to add
trace_printk() calls to track zwplug state changes. trace_printk()
reports the function name in front of the information in its argument
list. I noticed that the function name reported by trace_printk() is
incorrect for inlined functions. Hence this patch.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v16 00/26] Improve write performance for zoned UFS devices
  2024-11-19 19:08   ` Bart Van Assche
@ 2024-11-21  3:20     ` Damien Le Moal
  2024-11-21 18:00       ` Bart Van Assche
  0 siblings, 1 reply; 73+ messages in thread
From: Damien Le Moal @ 2024-11-21  3:20 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Jaegeuk Kim

On 11/20/24 04:08, Bart Van Assche wrote:
> On 11/19/24 12:01 AM, Damien Le Moal wrote:
>> Impressive improvements but the changes are rather invasive. Have you tried
>> simpler solution like forcing unplugging a zone write plug from the driver once
>> a command is passed to the driver and the driver did not reject it ? It seems
>> like this would make everything simpler on the block layer side. But I am not
>> sure if the performance gains would be the same.
> 
> Hi Damien,
> 
> I'm not sure that the approach of submitting a new zoned write if the
> driver did not reject the previous write would result in a simpler
> solution. SCSI devices are allowed to respond to any command with a unit
> attention instead of processing the command. If a unit attention is
> reported, the SCSI core requeues the command. In other words, even with
> this approach, proper support for requeued zoned writes in the block
> layer is required.

Yes, but it would be vastly simpler because you would be guaranteed to having
only a single write request per zone being in-flight between the write plug and
the device at any time. So the requeue would not need reordering, and likely not
need any special code at all. nless I am missing something, this would be
simpler, no ?

The main question though with such approach is: does it give you the same
performance improvements as your current (more invasive) approach ?

> Additionally, that approach is not compatible with using .queue_rqs().
> While the SCSI core does not yet support a .queue_rqs() callback, I
> think it would be good to have this support in the SCSI core.

I do not understand why it is not compatible. What is the problem ?

> If we need requeuing support anyway, why to select an approach that
> probably will result in lower performance than what has been implemented
> in this patch series?

I am only trying to see if there is not a simpler approach than what you did.
The less changes, the better, right ?

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v16 04/26] blk-zoned: Only handle errors after pending zoned writes have completed
  2024-11-19 20:51     ` Bart Van Assche
@ 2024-11-21  3:23       ` Damien Le Moal
  2024-11-21 17:43         ` Bart Van Assche
  0 siblings, 1 reply; 73+ messages in thread
From: Damien Le Moal @ 2024-11-21  3:23 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Jaegeuk Kim

On 11/20/24 05:51, Bart Van Assche wrote:
>>> +/*
>>> + * Change the zone state to "error" if a request is requeued to postpone
>>> + * processing of requeued requests until all pending requests have either
>>> + * completed or have been requeued.
>>> + */
>>> +void blk_zone_write_plug_requeue_request(struct request *rq)
>>> +{
>>> +	struct gendisk *disk = rq->q->disk;
>>> +	struct blk_zone_wplug *zwplug;
>>> +
>>> +	if (!disk->zone_wplugs_hash_bits || !blk_rq_is_seq_zoned_write(rq))
>>> +		return;
>>
>> I think the disk->zone_wplugs_hash_bits check needs to go inside
>> disk_get_zone_wplug() as that will avoid a similar check in
>> blk_zone_write_plug_free_request() too. That said, I am not even convinced it
>> is needed at all since these functions should be called only for a zoned drive
>> which should have its zone wplug hash setup.
> 
> Moving the disk->zone_wplugs_hash_bits check sounds good to me.
> 
> I added this check after hitting an UBSAN report that indicates that
> disk->zone_wplugs_hash_bits was used before it was changed into a non-
> zero value. sd_revalidate_disk() submits a very substantial number of
> SCSI commands before it calls blk_revalidate_disk_zones(), the function
> that sets disk->zone_wplugs_hash_bits.

But non of the commands are writes to sequential zones, so the hash bits check
should not even be necessary, no ?

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v16 05/26] blk-zoned: Fix a deadlock triggered by unaligned writes
  2024-11-19 21:04     ` Bart Van Assche
@ 2024-11-21  3:32       ` Damien Le Moal
  2024-11-21 17:51         ` Bart Van Assche
  0 siblings, 1 reply; 73+ messages in thread
From: Damien Le Moal @ 2024-11-21  3:32 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Jaegeuk Kim

On 11/20/24 06:04, Bart Van Assche wrote:
> On 11/18/24 6:57 PM, Damien Le Moal wrote:
>> On 11/19/24 9:27 AM, Bart Van Assche wrote:
>>> If the queue is filled with unaligned writes then the following
>>> deadlock occurs:
>>>
>>> Call Trace:
>>>   <TASK>
>>>   __schedule+0x8cc/0x2190
>>>   schedule+0xdd/0x2b0
>>>   blk_queue_enter+0x2ce/0x4f0
>>>   blk_mq_alloc_request+0x303/0x810
>>>   scsi_execute_cmd+0x3f4/0x7b0
>>>   sd_zbc_do_report_zones+0x19e/0x4c0
>>>   sd_zbc_report_zones+0x304/0x920
>>>   disk_zone_wplug_handle_error+0x237/0x920
>>>   disk_zone_wplugs_work+0x17e/0x430
>>>   process_one_work+0xdd0/0x1490
>>>   worker_thread+0x5eb/0x1010
>>>   kthread+0x2e5/0x3b0
>>>   ret_from_fork+0x3a/0x80
>>>   </TASK>
>>>
>>> Fix this deadlock by removing the disk->fops->report_zones() call and by
>>> deriving the write pointer information from successfully completed zoned
>>> writes.
>>>
>>> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
>>
>> Doesn't this need a Fixes tag and CC stable, and come earlier in the series (or
>> sent separately) ?
> 
> I will add Fixes: and Cc: stable tags.
> 
> I'm not sure how to move this patch earlier since it depends on the
> previous patch in this series ("blk-zoned: Only handle errors after
> pending zoned writes have completed"). Without that patch, it is not
> safe to use zwplug->wp_offset_compl in the error handler.
> 
>> Overall, this patch seems wrong anyway as zone reset and zone finish may be
>> done between 2 writes, failing the next one but the recovery done here will use
>> the previous succeful write end position as the wp, which is NOT correct as
>> reset or finish changed that...
> 
> I will add support for the zone reset and zone finish commands in this
> patch.
> 
>> And we also have the possibility of torn writes
>> (partial writes) with SAS SMR drives. So I really think that you cannot avoid
>> doing a report zone to recover errors.
> 
> Thanks for having brought this up. This is something I was not aware of.
> 
> disk_zone_wplug_handle_error() submits a new request to retrieve zone 
> information while handling an error triggered by other requests. This
> can easily lead to a deadlock as the above call trace shows. How about
> introducing a queue flag for the "report zones" approach in
> disk_zone_wplug_handle_error() such that the "report zones" approach is
> only used for SAS SMR drives?

Sure, but how would that solve the potential deadlock problem ? ALso, I am not
entirely clear on how the deadlock can happen given that zone write plugs are
queueing/blocking BIOs, not requests. So even assuming you have a large number
of BIOs plugged in a zone write plug, the error handler work should still be
able to issue a request to do a report zones, no ? On which resource can the
deadlock happen ? Plugged BIOs do not yet use a tag, right ?

What am I missing here ? Or is it maybe something that can happen with your
modifications because you changed the zone write plug behavior to allow for more
than one BIO at a time being unplugged and issued to the device ?

Note that if you do have a test case for this triggering the deadlock, we
definitely need to solve this and ideally have a blktest case checking it.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v16 16/26] blk-zoned: Document locking assumptions
  2024-11-19 21:18     ` Bart Van Assche
@ 2024-11-21  3:34       ` Damien Le Moal
  0 siblings, 0 replies; 73+ messages in thread
From: Damien Le Moal @ 2024-11-21  3:34 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Jaegeuk Kim

On 11/20/24 06:18, Bart Van Assche wrote:
> On 11/18/24 11:53 PM, Damien Le Moal wrote:
>> On 11/19/24 09:28, Bart Van Assche wrote:
>>> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
>>
>> The patch title seems incorrect. This is not documenting anything but adding
>> lockdep checks.
> 
> Hmm ... what this patch does is to document what the locking assumptions
> are for the modified functions. So I'm not sure why the patch title is
> considered incorrect?

Seeing the word "Document" in the patch title, I expected text/comments. These
are lockdep checks so may be just state that, e.g. "Add lockdep checks..."

> 
> Thanks,
> 
> Bart.
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v16 17/26] blk-zoned: Uninline functions that are not in the hot path
  2024-11-19 21:20     ` Bart Van Assche
@ 2024-11-21  3:36       ` Damien Le Moal
  0 siblings, 0 replies; 73+ messages in thread
From: Damien Le Moal @ 2024-11-21  3:36 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Jaegeuk Kim

On 11/20/24 06:20, Bart Van Assche wrote:
> On 11/18/24 11:55 PM, Damien Le Moal wrote:
>> I do not really see the point... Anyway, looks OK.
> Hi Damien,
> 
> One of the approaches I used to debug this patch series is to add
> trace_printk() calls to track zwplug state changes. trace_printk()
> reports the function name in front of the information in its argument
> list. I noticed that the function name reported by trace_printk() is
> incorrect for inlined functions. Hence this patch.

Maybe add this explanation to the commit message ? It does make sense to no
inline these function with tracing.

> 
> Thanks,
> 
> Bart.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v16 04/26] blk-zoned: Only handle errors after pending zoned writes have completed
  2024-11-21  3:23       ` Damien Le Moal
@ 2024-11-21 17:43         ` Bart Van Assche
  0 siblings, 0 replies; 73+ messages in thread
From: Bart Van Assche @ 2024-11-21 17:43 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Jaegeuk Kim

On 11/20/24 7:23 PM, Damien Le Moal wrote:
> On 11/20/24 05:51, Bart Van Assche wrote:
>>>> +/*
>>>> + * Change the zone state to "error" if a request is requeued to postpone
>>>> + * processing of requeued requests until all pending requests have either
>>>> + * completed or have been requeued.
>>>> + */
>>>> +void blk_zone_write_plug_requeue_request(struct request *rq)
>>>> +{
>>>> +	struct gendisk *disk = rq->q->disk;
>>>> +	struct blk_zone_wplug *zwplug;
>>>> +
>>>> +	if (!disk->zone_wplugs_hash_bits || !blk_rq_is_seq_zoned_write(rq))
>>>> +		return;
>>>
>>> I think the disk->zone_wplugs_hash_bits check needs to go inside
>>> disk_get_zone_wplug() as that will avoid a similar check in
>>> blk_zone_write_plug_free_request() too. That said, I am not even convinced it
>>> is needed at all since these functions should be called only for a zoned drive
>>> which should have its zone wplug hash setup.
>>
>> Moving the disk->zone_wplugs_hash_bits check sounds good to me.
>>
>> I added this check after hitting an UBSAN report that indicates that
>> disk->zone_wplugs_hash_bits was used before it was changed into a non-
>> zero value. sd_revalidate_disk() submits a very substantial number of
>> SCSI commands before it calls blk_revalidate_disk_zones(), the function
>> that sets disk->zone_wplugs_hash_bits.
> 
> But non of the commands are writes to sequential zones, so the hash bits check
> should not even be necessary, no ?

Hi Damien,

I will double check whether this test is really necessary and leave it
out if not.

Thanks,

Bart.



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v16 05/26] blk-zoned: Fix a deadlock triggered by unaligned writes
  2024-11-21  3:32       ` Damien Le Moal
@ 2024-11-21 17:51         ` Bart Van Assche
  2024-11-25  4:00           ` Damien Le Moal
  0 siblings, 1 reply; 73+ messages in thread
From: Bart Van Assche @ 2024-11-21 17:51 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Jaegeuk Kim


On 11/20/24 7:32 PM, Damien Le Moal wrote:
> On 11/20/24 06:04, Bart Van Assche wrote:
>> On 11/18/24 6:57 PM, Damien Le Moal wrote:
>>> On 11/19/24 9:27 AM, Bart Van Assche wrote:
>>>> If the queue is filled with unaligned writes then the following
>>>> deadlock occurs:
>>>>
>>>> Call Trace:
>>>>    <TASK>
>>>>    __schedule+0x8cc/0x2190
>>>>    schedule+0xdd/0x2b0
>>>>    blk_queue_enter+0x2ce/0x4f0
>>>>    blk_mq_alloc_request+0x303/0x810
>>>>    scsi_execute_cmd+0x3f4/0x7b0
>>>>    sd_zbc_do_report_zones+0x19e/0x4c0
>>>>    sd_zbc_report_zones+0x304/0x920
>>>>    disk_zone_wplug_handle_error+0x237/0x920
>>>>    disk_zone_wplugs_work+0x17e/0x430
>>>>    process_one_work+0xdd0/0x1490
>>>>    worker_thread+0x5eb/0x1010
>>>>    kthread+0x2e5/0x3b0
>>>>    ret_from_fork+0x3a/0x80
>>>>    </TASK>
>>>>
>>>> Fix this deadlock by removing the disk->fops->report_zones() call and by
>>>> deriving the write pointer information from successfully completed zoned
>>>> writes.
>>>>
>>>> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
>>>
>>> Doesn't this need a Fixes tag and CC stable, and come earlier in the series (or
>>> sent separately) ?
>>
>> I will add Fixes: and Cc: stable tags.
>>
>> I'm not sure how to move this patch earlier since it depends on the
>> previous patch in this series ("blk-zoned: Only handle errors after
>> pending zoned writes have completed"). Without that patch, it is not
>> safe to use zwplug->wp_offset_compl in the error handler.
>>
>>> Overall, this patch seems wrong anyway as zone reset and zone finish may be
>>> done between 2 writes, failing the next one but the recovery done here will use
>>> the previous succeful write end position as the wp, which is NOT correct as
>>> reset or finish changed that...
>>
>> I will add support for the zone reset and zone finish commands in this
>> patch.
>>
>>> And we also have the possibility of torn writes
>>> (partial writes) with SAS SMR drives. So I really think that you cannot avoid
>>> doing a report zone to recover errors.
>>
>> Thanks for having brought this up. This is something I was not aware of.
>>
>> disk_zone_wplug_handle_error() submits a new request to retrieve zone
>> information while handling an error triggered by other requests. This
>> can easily lead to a deadlock as the above call trace shows. How about
>> introducing a queue flag for the "report zones" approach in
>> disk_zone_wplug_handle_error() such that the "report zones" approach is
>> only used for SAS SMR drives?
> 
> Sure, but how would that solve the potential deadlock problem ? ALso, I am not
> entirely clear on how the deadlock can happen given that zone write plugs are
> queueing/blocking BIOs, not requests. So even assuming you have a large number
> of BIOs plugged in a zone write plug, the error handler work should still be
> able to issue a request to do a report zones, no ? On which resource can the
> deadlock happen ? Plugged BIOs do not yet use a tag, right ?
> 
> What am I missing here ? Or is it maybe something that can happen with your
> modifications because you changed the zone write plug behavior to allow for more
> than one BIO at a time being unplugged and issued to the device ?
> 
> Note that if you do have a test case for this triggering the deadlock, we
> definitely need to solve this and ideally have a blktest case checking it.

Hi Damien,

The call trace mentioned above comes from the kernel log and was
encountered while I was testing this patch series. A reproducer has
already been shared - see also
https://lore.kernel.org/linux-block/e840b66a-79f0-4169-9ab1-c475d9608e4d@acm.org/. 
The lockup happened after the queue was filled up
with requests and hence sd_zbc_report_zones() failed to allocate an
additional request for the zone report.

I'm wondering whether this lockup can also happen with the current
upstream kernel by submitting multiple unaligned writes simultaneously
where each write affects another zone and where the number of writes
matches the queue depth.

Thanks,

Bart.



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v16 00/26] Improve write performance for zoned UFS devices
  2024-11-21  3:20     ` Damien Le Moal
@ 2024-11-21 18:00       ` Bart Van Assche
  2024-11-25  3:59         ` Damien Le Moal
  0 siblings, 1 reply; 73+ messages in thread
From: Bart Van Assche @ 2024-11-21 18:00 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Jaegeuk Kim

On 11/20/24 7:20 PM, Damien Le Moal wrote:
> I am only trying to see if there is not a simpler approach than what you did.
> The less changes, the better, right ?

Hi Damien,

I agree with you that we should select the simplest approach that yields
the desired performance.

Regarding the proposed approach, forcing unplugging a zone write plug
from the driver once a command is passed to the driver and the driver
did not reject it, is this approach compatible with SCSI devices that
may report a unit attention? If two zoned writes for the same zone are
submitted in order to a SCSI device and the SCSI device responds with a
unit attention condition to the first write then the second write will
fail with an "unaligned write" error. This will have to be handled by
pausing zoned write submission and by resubmitting zoned writes after
all pending zoned writes for the given zone have completed. In other
words, if higher queue depths are supported per zone, we cannot avoid
increasing the complexity of the code in block/blk-zoned.c. If we cannot
avoid increasing the complexity of that code, I think we can as well
select the approach that yields the highest performance and the fewest
changes in the block layer code for regular reads and writes.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v16 26/26] scsi: ufs: Inform the block layer about write ordering
       [not found]   ` <37f95f44-ab1d-20db-e0c7-94946cb9d4eb@quicinc.com>
@ 2024-11-22 18:20     ` Bart Van Assche
  2024-11-23  0:34       ` Can Guo
  0 siblings, 1 reply; 73+ messages in thread
From: Bart Van Assche @ 2024-11-22 18:20 UTC (permalink / raw)
  To: Can Guo, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Jaegeuk Kim, Bao D. Nguyen, Martin K. Petersen, Avri Altman

On 11/21/24 6:58 PM, Can Guo wrote:
> So, in MCQ mode, we are re-directing all zoned writes to one HW queue to 
> preserve the write order, is my understanding correct?
That's somewhat but not entirely correct. As one can see in patch 20/26,
the software queue (and hence also the hardware queue) that was selected
for zoned writes is unselected after all pending zoned writes for a
given zone have completed:

+	if (refcount_read(&zwplug->ref) == 2)
+		zwplug->swq_cpu = -1;

So the next time a write bio or write request is submitted a different
queue may be selected.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v16 26/26] scsi: ufs: Inform the block layer about write ordering
  2024-11-22 18:20     ` Bart Van Assche
@ 2024-11-23  0:34       ` Can Guo
  0 siblings, 0 replies; 73+ messages in thread
From: Can Guo @ 2024-11-23  0:34 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
	Jaegeuk Kim, Bao D. Nguyen, Martin K. Petersen, Avri Altman


On 11/23/2024 2:20 AM, Bart Van Assche wrote:
> On 11/21/24 6:58 PM, Can Guo wrote:
>> So, in MCQ mode, we are re-directing all zoned writes to one HW queue 
>> to preserve the write order, is my understanding correct?
> That's somewhat but not entirely correct. As one can see in patch 20/26,
> the software queue (and hence also the hardware queue) that was selected
> for zoned writes is unselected after all pending zoned writes for a
> given zone have completed:
>
> +    if (refcount_read(&zwplug->ref) == 2)
> +        zwplug->swq_cpu = -1;
>
> So the next time a write bio or write request is submitted a different
> queue may be selected.

Understood, thanks for explaining.

Reviewed-by: Can Guo <quic_cang@quicinc.com>

>
> Thanks,
>
> Bart.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v16 00/26] Improve write performance for zoned UFS devices
  2024-11-21 18:00       ` Bart Van Assche
@ 2024-11-25  3:59         ` Damien Le Moal
  0 siblings, 0 replies; 73+ messages in thread
From: Damien Le Moal @ 2024-11-25  3:59 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Jaegeuk Kim

On 11/22/24 3:00 AM, Bart Van Assche wrote:
> On 11/20/24 7:20 PM, Damien Le Moal wrote:
>> I am only trying to see if there is not a simpler approach than what you did.
>> The less changes, the better, right ?
> 
> Hi Damien,
> 
> I agree with you that we should select the simplest approach that yields
> the desired performance.
> 
> Regarding the proposed approach, forcing unplugging a zone write plug
> from the driver once a command is passed to the driver and the driver
> did not reject it, is this approach compatible with SCSI devices that
> may report a unit attention? If two zoned writes for the same zone are
> submitted in order to a SCSI device and the SCSI device responds with a
> unit attention condition to the first write then the second write will
> fail with an "unaligned write" error. This will have to be handled by

I have never seen this happen (i mean UNIT ATTENTION being returned) in
practice with SAS HDDs. Not sure how zoned UFS devices behave though.

> pausing zoned write submission and by resubmitting zoned writes after
> all pending zoned writes for the given zone have completed. In other
> words, if higher queue depths are supported per zone, we cannot avoid
> increasing the complexity of the code in block/blk-zoned.c. If we cannot
> avoid increasing the complexity of that code, I think we can as well
> select the approach that yields the highest performance and the fewest
> changes in the block layer code for regular reads and writes.

But it seems that all we need to do is a better request handling in the
requeue+dispatch queue to handle requeued writes. I am not yet convinced that
zone write plugging needs a lot of changes beside allowing more than one write
per zone and some form of throttling based on feedback from the requeue path.
I.e. the current code throttles write command issuing every time one command is
submitted: the submission "plugs" the write plug. In your case, it seems that
the plugging should be driven by the requeue path, or the driver. Unplugging of
a plug happens currently when a BIO completes, but in your case, it would need
to be also driven by the requeue path or driver.

I am thinking that things could be cleaner/easier to maintain with a zoned
write requeue special path. May be... I am thinking aloud here as I have not
tried to code anything.

> 
> Thanks,
> 
> Bart.
> 
> 

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v16 05/26] blk-zoned: Fix a deadlock triggered by unaligned writes
  2024-11-21 17:51         ` Bart Van Assche
@ 2024-11-25  4:00           ` Damien Le Moal
  2024-11-25  4:19             ` Damien Le Moal
  0 siblings, 1 reply; 73+ messages in thread
From: Damien Le Moal @ 2024-11-25  4:00 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Jaegeuk Kim

On 11/22/24 2:51 AM, Bart Van Assche wrote:
> 
> On 11/20/24 7:32 PM, Damien Le Moal wrote:
>> On 11/20/24 06:04, Bart Van Assche wrote:
>>> On 11/18/24 6:57 PM, Damien Le Moal wrote:
>>>> On 11/19/24 9:27 AM, Bart Van Assche wrote:
>>>>> If the queue is filled with unaligned writes then the following
>>>>> deadlock occurs:
>>>>>
>>>>> Call Trace:
>>>>>    <TASK>
>>>>>    __schedule+0x8cc/0x2190
>>>>>    schedule+0xdd/0x2b0
>>>>>    blk_queue_enter+0x2ce/0x4f0
>>>>>    blk_mq_alloc_request+0x303/0x810
>>>>>    scsi_execute_cmd+0x3f4/0x7b0
>>>>>    sd_zbc_do_report_zones+0x19e/0x4c0
>>>>>    sd_zbc_report_zones+0x304/0x920
>>>>>    disk_zone_wplug_handle_error+0x237/0x920
>>>>>    disk_zone_wplugs_work+0x17e/0x430
>>>>>    process_one_work+0xdd0/0x1490
>>>>>    worker_thread+0x5eb/0x1010
>>>>>    kthread+0x2e5/0x3b0
>>>>>    ret_from_fork+0x3a/0x80
>>>>>    </TASK>
>>>>>
>>>>> Fix this deadlock by removing the disk->fops->report_zones() call and by
>>>>> deriving the write pointer information from successfully completed zoned
>>>>> writes.
>>>>>
>>>>> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
>>>>
>>>> Doesn't this need a Fixes tag and CC stable, and come earlier in the series
>>>> (or
>>>> sent separately) ?
>>>
>>> I will add Fixes: and Cc: stable tags.
>>>
>>> I'm not sure how to move this patch earlier since it depends on the
>>> previous patch in this series ("blk-zoned: Only handle errors after
>>> pending zoned writes have completed"). Without that patch, it is not
>>> safe to use zwplug->wp_offset_compl in the error handler.
>>>
>>>> Overall, this patch seems wrong anyway as zone reset and zone finish may be
>>>> done between 2 writes, failing the next one but the recovery done here will
>>>> use
>>>> the previous succeful write end position as the wp, which is NOT correct as
>>>> reset or finish changed that...
>>>
>>> I will add support for the zone reset and zone finish commands in this
>>> patch.
>>>
>>>> And we also have the possibility of torn writes
>>>> (partial writes) with SAS SMR drives. So I really think that you cannot avoid
>>>> doing a report zone to recover errors.
>>>
>>> Thanks for having brought this up. This is something I was not aware of.
>>>
>>> disk_zone_wplug_handle_error() submits a new request to retrieve zone
>>> information while handling an error triggered by other requests. This
>>> can easily lead to a deadlock as the above call trace shows. How about
>>> introducing a queue flag for the "report zones" approach in
>>> disk_zone_wplug_handle_error() such that the "report zones" approach is
>>> only used for SAS SMR drives?
>>
>> Sure, but how would that solve the potential deadlock problem ? ALso, I am not
>> entirely clear on how the deadlock can happen given that zone write plugs are
>> queueing/blocking BIOs, not requests. So even assuming you have a large number
>> of BIOs plugged in a zone write plug, the error handler work should still be
>> able to issue a request to do a report zones, no ? On which resource can the
>> deadlock happen ? Plugged BIOs do not yet use a tag, right ?
>>
>> What am I missing here ? Or is it maybe something that can happen with your
>> modifications because you changed the zone write plug behavior to allow for more
>> than one BIO at a time being unplugged and issued to the device ?
>>
>> Note that if you do have a test case for this triggering the deadlock, we
>> definitely need to solve this and ideally have a blktest case checking it.
> 
> Hi Damien,
> 
> The call trace mentioned above comes from the kernel log and was
> encountered while I was testing this patch series. A reproducer has
> already been shared - see also
> https://lore.kernel.org/linux-block/e840b66a-79f0-4169-9ab1-
> c475d9608e4d@acm.org/. The lockup happened after the queue was filled up
> with requests and hence sd_zbc_report_zones() failed to allocate an
> additional request for the zone report.
> 
> I'm wondering whether this lockup can also happen with the current
> upstream kernel by submitting multiple unaligned writes simultaneously
> where each write affects another zone and where the number of writes
> matches the queue depth.

Let me check. This indeed may be a possibility.

> 
> Thanks,
> 
> Bart.
> 
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v16 05/26] blk-zoned: Fix a deadlock triggered by unaligned writes
  2024-11-25  4:00           ` Damien Le Moal
@ 2024-11-25  4:19             ` Damien Le Moal
  0 siblings, 0 replies; 73+ messages in thread
From: Damien Le Moal @ 2024-11-25  4:19 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Jaegeuk Kim

On 11/25/24 1:00 PM, Damien Le Moal wrote:
> On 11/22/24 2:51 AM, Bart Van Assche wrote:
>>
>> On 11/20/24 7:32 PM, Damien Le Moal wrote:
>>> On 11/20/24 06:04, Bart Van Assche wrote:
>>>> On 11/18/24 6:57 PM, Damien Le Moal wrote:
>>>>> On 11/19/24 9:27 AM, Bart Van Assche wrote:
>>>>>> If the queue is filled with unaligned writes then the following
>>>>>> deadlock occurs:
>>>>>>
>>>>>> Call Trace:
>>>>>>    <TASK>
>>>>>>    __schedule+0x8cc/0x2190
>>>>>>    schedule+0xdd/0x2b0
>>>>>>    blk_queue_enter+0x2ce/0x4f0
>>>>>>    blk_mq_alloc_request+0x303/0x810
>>>>>>    scsi_execute_cmd+0x3f4/0x7b0
>>>>>>    sd_zbc_do_report_zones+0x19e/0x4c0
>>>>>>    sd_zbc_report_zones+0x304/0x920
>>>>>>    disk_zone_wplug_handle_error+0x237/0x920
>>>>>>    disk_zone_wplugs_work+0x17e/0x430
>>>>>>    process_one_work+0xdd0/0x1490
>>>>>>    worker_thread+0x5eb/0x1010
>>>>>>    kthread+0x2e5/0x3b0
>>>>>>    ret_from_fork+0x3a/0x80
>>>>>>    </TASK>
>>>>>>
>>>>>> Fix this deadlock by removing the disk->fops->report_zones() call and by
>>>>>> deriving the write pointer information from successfully completed zoned
>>>>>> writes.
>>>>>>
>>>>>> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
>>>>>
>>>>> Doesn't this need a Fixes tag and CC stable, and come earlier in the series
>>>>> (or
>>>>> sent separately) ?
>>>>
>>>> I will add Fixes: and Cc: stable tags.
>>>>
>>>> I'm not sure how to move this patch earlier since it depends on the
>>>> previous patch in this series ("blk-zoned: Only handle errors after
>>>> pending zoned writes have completed"). Without that patch, it is not
>>>> safe to use zwplug->wp_offset_compl in the error handler.
>>>>
>>>>> Overall, this patch seems wrong anyway as zone reset and zone finish may be
>>>>> done between 2 writes, failing the next one but the recovery done here will
>>>>> use
>>>>> the previous succeful write end position as the wp, which is NOT correct as
>>>>> reset or finish changed that...
>>>>
>>>> I will add support for the zone reset and zone finish commands in this
>>>> patch.
>>>>
>>>>> And we also have the possibility of torn writes
>>>>> (partial writes) with SAS SMR drives. So I really think that you cannot avoid
>>>>> doing a report zone to recover errors.
>>>>
>>>> Thanks for having brought this up. This is something I was not aware of.
>>>>
>>>> disk_zone_wplug_handle_error() submits a new request to retrieve zone
>>>> information while handling an error triggered by other requests. This
>>>> can easily lead to a deadlock as the above call trace shows. How about
>>>> introducing a queue flag for the "report zones" approach in
>>>> disk_zone_wplug_handle_error() such that the "report zones" approach is
>>>> only used for SAS SMR drives?
>>>
>>> Sure, but how would that solve the potential deadlock problem ? ALso, I am not
>>> entirely clear on how the deadlock can happen given that zone write plugs are
>>> queueing/blocking BIOs, not requests. So even assuming you have a large number
>>> of BIOs plugged in a zone write plug, the error handler work should still be
>>> able to issue a request to do a report zones, no ? On which resource can the
>>> deadlock happen ? Plugged BIOs do not yet use a tag, right ?
>>>
>>> What am I missing here ? Or is it maybe something that can happen with your
>>> modifications because you changed the zone write plug behavior to allow for more
>>> than one BIO at a time being unplugged and issued to the device ?
>>>
>>> Note that if you do have a test case for this triggering the deadlock, we
>>> definitely need to solve this and ideally have a blktest case checking it.
>>
>> Hi Damien,
>>
>> The call trace mentioned above comes from the kernel log and was
>> encountered while I was testing this patch series. A reproducer has
>> already been shared - see also
>> https://lore.kernel.org/linux-block/e840b66a-79f0-4169-9ab1-
>> c475d9608e4d@acm.org/. The lockup happened after the queue was filled up
>> with requests and hence sd_zbc_report_zones() failed to allocate an
>> additional request for the zone report.
>>
>> I'm wondering whether this lockup can also happen with the current
>> upstream kernel by submitting multiple unaligned writes simultaneously
>> where each write affects another zone and where the number of writes
>> matches the queue depth.
> 
> Let me check. This indeed may be a possibility.

Thinking more about this, the answer is no, this cannot happen. The reason is
that zone write plugs are blocking BIOs, not requests. So the number of BIOs
queued waiting for execution does not matter as we are not holding tags. A
report zone for recovering from a write error in a zone may have to wait for a
tag for some time before executing, when many zones are being read/written at
the same time. But as long as requests complete and tags are freed, error
recovery report zones will eventually proceed and execute. No deadlock.
Even if the drive stops responding, we will eventually get a timeout and drive
reset which will abort all commands, so even then, we will have forward
progress with error recovery.

This is for the current upstream code.

Now, in your case, with several write requests issued per zone and said write
requests potentially waiting in the requeue or dispatch lists, then a report
zone for error recovery may indeed block forever... So with your changes, a
solution for this is needed I think.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v16 00/26] Improve write performance for zoned UFS devices
  2024-11-19  8:01 ` [PATCH v16 00/26] Improve write performance for zoned UFS devices Damien Le Moal
  2024-11-19 19:08   ` Bart Van Assche
@ 2025-01-09 19:02   ` Bart Van Assche
  2025-01-10  5:10     ` Damien Le Moal
  1 sibling, 1 reply; 73+ messages in thread
From: Bart Van Assche @ 2025-01-09 19:02 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Jaegeuk Kim

On 11/19/24 12:01 AM, Damien Le Moal wrote:
> On 11/19/24 09:27, Bart Van Assche wrote:
>> This patch series improves small write IOPS by a factor of four (+300%) for
>> zoned UFS devices on my test setup with an UFSHCI 3.0 controller. Although
>> you are probably busy because the merge window is open, please take a look
>> at this patch series when you have the time. This patch series is organized
>> as follows:
>>   - Bug fixes for existing code at the start of the series.
>>   - The write pipelining support implementation comes after the bug fixes.
> 
> Impressive improvements but the changes are rather invasive. Have you tried
> simpler solution like forcing unplugging a zone write plug from the driver once
> a command is passed to the driver and the driver did not reject it ? It seems
> like this would make everything simpler on the block layer side. But I am not
> sure if the performance gains would be the same.

(replying to an email from two months ago)

Hi Damien,

Here is a strong reason why the simpler solution mentioned above is not
sufficient: if anything goes wrong with the communication between UFS
host controller and UFS device (e.g. a command timeout or a power mode
transition fails) then the SCSI error handler is activated. This results
in ufshcd_err_handler() being called. That function resets both the host
controller and the UFS device (ufshcd_reset_and_restore()). At that time
multiple commands may be outstanding.

In other words, submitting UFS commands in order is not sufficient. An
approach is needed that is compatible with the SCSI error handler and
also that ensures that commands are resubmitted in LBA order per zone
after the SCSI error handler has completed.

The statistics I have access to show that the UFS error handler is
activated infrequently or never on any single device but also that it is
activated a nontrivial number of times across the entire device
population.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v16 05/26] blk-zoned: Fix a deadlock triggered by unaligned writes
  2024-11-19  2:57   ` Damien Le Moal
  2024-11-19 21:04     ` Bart Van Assche
@ 2025-01-09 19:11     ` Bart Van Assche
  2025-01-10  5:07       ` Damien Le Moal
  1 sibling, 1 reply; 73+ messages in thread
From: Bart Van Assche @ 2025-01-09 19:11 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Jaegeuk Kim

On 11/18/24 6:57 PM, Damien Le Moal wrote:
> And we also have the possibility of torn writes (partial writes) with
> SAS SMR drives. So I really think that you cannot avoid doing a
> report zone to recover errors.
(replying to an email of two months ago)

Hi Damien,

How about keeping the current approach (setting the 
BLK_ZONE_WPLUG_NEED_WP_UPDATE flag after an I/O error has been observed)
if write pipelining is disabled and using the wp_offset_compl approach
only if write pipelining is enabled? This approach preserves the
existing behavior for SAS SMR drives and allows to restore the write
pointer after a write error has been observed for UFS devices. Please
note that so far I have only observed write errors for UFS devices if I
add write error injection code in the UFS driver.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v16 05/26] blk-zoned: Fix a deadlock triggered by unaligned writes
  2025-01-09 19:11     ` Bart Van Assche
@ 2025-01-10  5:07       ` Damien Le Moal
  2025-01-10 18:17         ` Bart Van Assche
  0 siblings, 1 reply; 73+ messages in thread
From: Damien Le Moal @ 2025-01-10  5:07 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Jaegeuk Kim

On 1/10/25 04:11, Bart Van Assche wrote:
> On 11/18/24 6:57 PM, Damien Le Moal wrote:
>> And we also have the possibility of torn writes (partial writes) with
>> SAS SMR drives. So I really think that you cannot avoid doing a
>> report zone to recover errors.
> (replying to an email of two months ago)
> 
> Hi Damien,
> 
> How about keeping the current approach (setting the 
> BLK_ZONE_WPLUG_NEED_WP_UPDATE flag after an I/O error has been observed)
> if write pipelining is disabled and using the wp_offset_compl approach
> only if write pipelining is enabled? This approach preserves the
> existing behavior for SAS SMR drives and allows to restore the write
> pointer after a write error has been observed for UFS devices. Please
> note that so far I have only observed write errors for UFS devices if I
> add write error injection code in the UFS driver.

If you get write errors, they will be propagated to the user (f2fs in this case
I suspect) which should do a report zone to verify things. So I do not see why
this part would need to change.

> Thanks,
> 
> Bart.
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v16 00/26] Improve write performance for zoned UFS devices
  2025-01-09 19:02   ` Bart Van Assche
@ 2025-01-10  5:10     ` Damien Le Moal
  0 siblings, 0 replies; 73+ messages in thread
From: Damien Le Moal @ 2025-01-10  5:10 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Jaegeuk Kim

On 1/10/25 04:02, Bart Van Assche wrote:
> On 11/19/24 12:01 AM, Damien Le Moal wrote:
>> On 11/19/24 09:27, Bart Van Assche wrote:
>>> This patch series improves small write IOPS by a factor of four (+300%) for
>>> zoned UFS devices on my test setup with an UFSHCI 3.0 controller. Although
>>> you are probably busy because the merge window is open, please take a look
>>> at this patch series when you have the time. This patch series is organized
>>> as follows:
>>>   - Bug fixes for existing code at the start of the series.
>>>   - The write pipelining support implementation comes after the bug fixes.
>>
>> Impressive improvements but the changes are rather invasive. Have you tried
>> simpler solution like forcing unplugging a zone write plug from the driver once
>> a command is passed to the driver and the driver did not reject it ? It seems
>> like this would make everything simpler on the block layer side. But I am not
>> sure if the performance gains would be the same.
> 
> (replying to an email from two months ago)
> 
> Hi Damien,
> 
> Here is a strong reason why the simpler solution mentioned above is not
> sufficient: if anything goes wrong with the communication between UFS
> host controller and UFS device (e.g. a command timeout or a power mode
> transition fails) then the SCSI error handler is activated. This results
> in ufshcd_err_handler() being called. That function resets both the host
> controller and the UFS device (ufshcd_reset_and_restore()). At that time
> multiple commands may be outstanding.
> 
> In other words, submitting UFS commands in order is not sufficient. An
> approach is needed that is compatible with the SCSI error handler and
> also that ensures that commands are resubmitted in LBA order per zone
> after the SCSI error handler has completed.

If the failed commands are retried, they will be requeued and you will not see
the error as the request will not be completed yet, no ? And if you do see the
error back in the block layer, you cannot just retry the command at will. The
issuer must do that, no ? I am confused...

Please send patches to discuss based on code. That will be easier.

> 
> The statistics I have access to show that the UFS error handler is
> activated infrequently or never on any single device but also that it is
> activated a nontrivial number of times across the entire device
> population.
> 
> Thanks,
> 
> Bart.
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v16 05/26] blk-zoned: Fix a deadlock triggered by unaligned writes
  2025-01-10  5:07       ` Damien Le Moal
@ 2025-01-10 18:17         ` Bart Van Assche
  0 siblings, 0 replies; 73+ messages in thread
From: Bart Van Assche @ 2025-01-10 18:17 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, linux-scsi, Christoph Hellwig, Jaegeuk Kim


On 1/9/25 9:07 PM, Damien Le Moal wrote:
> On 1/10/25 04:11, Bart Van Assche wrote:
>> On 11/18/24 6:57 PM, Damien Le Moal wrote:
>>> And we also have the possibility of torn writes (partial writes) with
>>> SAS SMR drives. So I really think that you cannot avoid doing a
>>> report zone to recover errors.
>> (replying to an email of two months ago)
>>
>> Hi Damien,
>>
>> How about keeping the current approach (setting the
>> BLK_ZONE_WPLUG_NEED_WP_UPDATE flag after an I/O error has been observed)
>> if write pipelining is disabled and using the wp_offset_compl approach
>> only if write pipelining is enabled? This approach preserves the
>> existing behavior for SAS SMR drives and allows to restore the write
>> pointer after a write error has been observed for UFS devices. Please
>> note that so far I have only observed write errors for UFS devices if I
>> add write error injection code in the UFS driver.
> 
> If you get write errors, they will be propagated to the user (f2fs in this case
> I suspect) which should do a report zone to verify things. So I do not see why
> this part would need to change.

Hi Damien,

In my email I was referring to write errors that should *not* be
propagated to the user but that should result in a retry by the kernel.
This is e.g. necessary if multiple writes are outstanding simultaneously
and the storage device reports a unit attention condition for any of
these writes except the last write.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 73+ messages in thread

end of thread, other threads:[~2025-01-10 18:17 UTC | newest]

Thread overview: 73+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-19  0:27 [PATCH v16 00/26] Improve write performance for zoned UFS devices Bart Van Assche
2024-11-19  0:27 ` [PATCH v16 01/26] blk-zoned: Fix a reference count leak Bart Van Assche
2024-11-19  2:23   ` Damien Le Moal
2024-11-19 20:21     ` Bart Van Assche
2024-11-19  0:27 ` [PATCH v16 02/26] blk-zoned: Split disk_zone_wplugs_work() Bart Van Assche
2024-11-19  0:27 ` [PATCH v16 03/26] blk-zoned: Split queue_zone_wplugs_show() Bart Van Assche
2024-11-19  2:25   ` Damien Le Moal
2024-11-19  0:27 ` [PATCH v16 04/26] blk-zoned: Only handle errors after pending zoned writes have completed Bart Van Assche
2024-11-19  2:50   ` Damien Le Moal
2024-11-19 20:51     ` Bart Van Assche
2024-11-21  3:23       ` Damien Le Moal
2024-11-21 17:43         ` Bart Van Assche
2024-11-19  0:27 ` [PATCH v16 05/26] blk-zoned: Fix a deadlock triggered by unaligned writes Bart Van Assche
2024-11-19  2:57   ` Damien Le Moal
2024-11-19 21:04     ` Bart Van Assche
2024-11-21  3:32       ` Damien Le Moal
2024-11-21 17:51         ` Bart Van Assche
2024-11-25  4:00           ` Damien Le Moal
2024-11-25  4:19             ` Damien Le Moal
2025-01-09 19:11     ` Bart Van Assche
2025-01-10  5:07       ` Damien Le Moal
2025-01-10 18:17         ` Bart Van Assche
2024-11-19  0:27 ` [PATCH v16 06/26] blk-zoned: Fix requeuing of zoned writes Bart Van Assche
2024-11-19  3:00   ` Damien Le Moal
2024-11-19 21:06     ` Bart Van Assche
2024-11-19  0:27 ` [PATCH v16 07/26] block: Support block drivers that preserve the order of write requests Bart Van Assche
2024-11-19  7:37   ` Damien Le Moal
2024-11-19 21:08     ` Bart Van Assche
2024-11-19  0:27 ` [PATCH v16 08/26] dm-linear: Report to the block layer that the write order is preserved Bart Van Assche
2024-11-19  0:27 ` [PATCH v16 09/26] mq-deadline: Remove a local variable Bart Van Assche
2024-11-19  7:38   ` Damien Le Moal
2024-11-19 21:11     ` Bart Van Assche
2024-11-19  0:27 ` [PATCH v16 10/26] blk-mq: Clean up blk_mq_requeue_work() Bart Van Assche
2024-11-19  7:39   ` Damien Le Moal
2024-11-19  0:28 ` [PATCH v16 11/26] block: Optimize blk_mq_submit_bio() for the cache hit scenario Bart Van Assche
2024-11-19  7:40   ` Damien Le Moal
2024-11-19  0:28 ` [PATCH v16 12/26] block: Rework request allocation in blk_mq_submit_bio() Bart Van Assche
2024-11-19  7:44   ` Damien Le Moal
2024-11-19  0:28 ` [PATCH v16 13/26] block: Support allocating from a specific software queue Bart Van Assche
2024-11-19  0:28 ` [PATCH v16 14/26] blk-mq: Restore the zoned write order when requeuing Bart Van Assche
2024-11-19  7:52   ` Damien Le Moal
2024-11-19 21:16     ` Bart Van Assche
2024-11-19  0:28 ` [PATCH v16 15/26] blk-zoned: Document the locking order Bart Van Assche
2024-11-19  7:52   ` Damien Le Moal
2024-11-19  0:28 ` [PATCH v16 16/26] blk-zoned: Document locking assumptions Bart Van Assche
2024-11-19  7:53   ` Damien Le Moal
2024-11-19 21:18     ` Bart Van Assche
2024-11-21  3:34       ` Damien Le Moal
2024-11-19  0:28 ` [PATCH v16 17/26] blk-zoned: Uninline functions that are not in the hot path Bart Van Assche
2024-11-19  7:55   ` Damien Le Moal
2024-11-19 21:20     ` Bart Van Assche
2024-11-21  3:36       ` Damien Le Moal
2024-11-19  0:28 ` [PATCH v16 18/26] blk-zoned: Make disk_should_remove_zone_wplug() more robust Bart Van Assche
2024-11-19  7:58   ` Damien Le Moal
2024-11-19  0:28 ` [PATCH v16 19/26] blk-zoned: Add an argument to blk_zone_plug_bio() Bart Van Assche
2024-11-19  0:28 ` [PATCH v16 20/26] blk-zoned: Support pipelining of zoned writes Bart Van Assche
2024-11-19  0:28 ` [PATCH v16 21/26] scsi: core: Retry unaligned " Bart Van Assche
2024-11-19  0:28 ` [PATCH v16 22/26] scsi: sd: Increase retry count for " Bart Van Assche
2024-11-19  0:28 ` [PATCH v16 23/26] scsi: scsi_debug: Add the preserves_write_order module parameter Bart Van Assche
2024-11-19  0:28 ` [PATCH v16 24/26] scsi: scsi_debug: Support injecting unaligned write errors Bart Van Assche
2024-11-19  0:28 ` [PATCH v16 25/26] scsi: scsi_debug: Skip host/bus reset settle delay Bart Van Assche
2024-11-19  0:28 ` [PATCH v16 26/26] scsi: ufs: Inform the block layer about write ordering Bart Van Assche
     [not found]   ` <37f95f44-ab1d-20db-e0c7-94946cb9d4eb@quicinc.com>
2024-11-22 18:20     ` Bart Van Assche
2024-11-23  0:34       ` Can Guo
2024-11-19  8:01 ` [PATCH v16 00/26] Improve write performance for zoned UFS devices Damien Le Moal
2024-11-19 19:08   ` Bart Van Assche
2024-11-21  3:20     ` Damien Le Moal
2024-11-21 18:00       ` Bart Van Assche
2024-11-25  3:59         ` Damien Le Moal
2025-01-09 19:02   ` Bart Van Assche
2025-01-10  5:10     ` Damien Le Moal
2024-11-19 12:25 ` Christoph Hellwig
2024-11-19 18:52   ` Bart Van Assche

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).