[PATCH v8 0/9] Improve performance for zoned UFS devices

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v8 0/9] Improve performance for zoned UFS devices
@ 2023-08-11 21:35 Bart Van Assche
  2023-08-11 21:35 ` [PATCH v8 1/9] block: Introduce more member variables related to zone write locking Bart Van Assche
                   ` (8 more replies)
  0 siblings, 9 replies; 33+ messages in thread
From: Bart Van Assche @ 2023-08-11 21:35 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Martin K . Petersen, Christoph Hellwig,
	Bart Van Assche

Hi Jens,

This patch series improves small write IOPS by a factor of four (+300%) for
zoned UFS devices on my test setup with an UFSHCI 3.0 controller. Please
consider this patch series for the next merge window.

Thank you,

Bart.

Changes compared to v7:
 - Split the queue_limits member variable `use_zone_write_lock' into two member
   variables: `use_zone_write_lock' (set by disk_set_zoned()) and
   `driver_preserves_write_order' (set by the block driver or SCSI LLD). This
   should clear up the confusion about the purpose of this variable.
 - Moved the code for sorting SCSI commands by LBA from the SCSI error handler
   into the SCSI disk (sd) driver as requested by Christoph.
   
Changes compared to v6:
 - Removed QUEUE_FLAG_NO_ZONE_WRITE_LOCK and instead introduced a flag in
   the request queue limits data structure.

Changes compared to v5:
 - Renamed scsi_cmp_lba() into scsi_cmp_sector().
 - Improved several source code comments.

Changes compared to v4:
 - Dropped the patch that introduces the REQ_NO_ZONE_WRITE_LOCK flag.
 - Dropped the null_blk patch and added two scsi_debug patches instead.
 - Dropped the f2fs patch.
 - Split the patch for the UFS driver into two patches.
 - Modified several patch descriptions and source code comments.
 - Renamed dd_use_write_locking() into dd_use_zone_write_locking().
 - Moved the list_sort() call from scsi_unjam_host() into scsi_eh_flush_done_q()
   such that sorting happens just before reinserting.
 - Removed the scsi_cmd_retry_allowed() call from scsi_check_sense() to make
   sure that the retry counter is adjusted once per retry instead of twice.

Changes compared to v3:
 - Restored the patch that introduces QUEUE_FLAG_NO_ZONE_WRITE_LOCK. That patch
   had accidentally been left out from v2.
 - In patch "block: Introduce the flag REQ_NO_ZONE_WRITE_LOCK", improved the
   patch description and added the function blk_no_zone_write_lock().
 - In patch "block/mq-deadline: Only use zone locking if necessary", moved the
   blk_queue_is_zoned() call into dd_use_write_locking().
 - In patch "fs/f2fs: Disable zone write locking", set REQ_NO_ZONE_WRITE_LOCK
   from inside __bio_alloc() instead of in f2fs_submit_write_bio().

Changes compared to v2:
 - Renamed the request queue flag for disabling zone write locking.
 - Introduced a new request flag for disabling zone write locking.
 - Modified the mq-deadline scheduler such that zone write locking is only
   disabled if both flags are set.
 - Added an F2FS patch that sets the request flag for disabling zone write
   locking.
 - Only disable zone write locking in the UFS driver if auto-hibernation is
   disabled.

Changes compared to v1:
 - Left out the patches that are already upstream.
 - Switched the approach in patch "scsi: Retry unaligned zoned writes" from
   retrying immediately to sending unaligned write commands to the SCSI error
   handler.

Bart Van Assche (9):
  block: Introduce more member variables related to zone write locking
  block/mq-deadline: Only use zone locking if necessary
  scsi: core: Call .eh_prepare_resubmit() before resubmitting
  scsi: sd: Sort commands by LBA before resubmitting
  scsi: core: Retry unaligned zoned writes
  scsi: scsi_debug: Support disabling zone write locking
  scsi: scsi_debug: Support injecting unaligned write errors
  scsi: ufs: Split an if-condition
  scsi: ufs: Inform the block layer about write ordering

 block/blk-settings.c           |  7 +++
 block/mq-deadline.c            | 14 +++---
 drivers/scsi/Kconfig           |  2 +
 drivers/scsi/Kconfig.kunit     |  4 ++
 drivers/scsi/Makefile          |  2 +
 drivers/scsi/Makefile.kunit    |  1 +
 drivers/scsi/scsi_debug.c      | 20 +++++++-
 drivers/scsi/scsi_error.c      | 59 ++++++++++++++++++++++
 drivers/scsi/scsi_error_test.c | 92 ++++++++++++++++++++++++++++++++++
 drivers/scsi/scsi_lib.c        |  1 +
 drivers/scsi/scsi_priv.h       |  1 +
 drivers/scsi/sd.c              | 25 +++++++++
 drivers/ufs/core/ufshcd.c      | 40 +++++++++++++--
 include/linux/blkdev.h         | 10 ++++
 include/scsi/scsi.h            |  1 +
 include/scsi/scsi_driver.h     |  1 +
 16 files changed, 270 insertions(+), 10 deletions(-)
 create mode 100644 drivers/scsi/Kconfig.kunit
 create mode 100644 drivers/scsi/Makefile.kunit
 create mode 100644 drivers/scsi/scsi_error_test.c


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v8 1/9] block: Introduce more member variables related to zone write locking
  2023-08-11 21:35 [PATCH v8 0/9] Improve performance for zoned UFS devices Bart Van Assche
@ 2023-08-11 21:35 ` Bart Van Assche
  2023-08-14 12:32   ` Damien Le Moal
  2023-08-11 21:35 ` [PATCH v8 2/9] block/mq-deadline: Only use zone locking if necessary Bart Van Assche
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 33+ messages in thread
From: Bart Van Assche @ 2023-08-11 21:35 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Martin K . Petersen, Christoph Hellwig,
	Bart Van Assche, Damien Le Moal, Ming Lei

Many but not all storage controllers require serialization of zoned writes.
Introduce a new request queue limit member variable
(driver_preserves_write_order) that allows block drivers to indicate that
the order of write commands is preserved and hence that serialization of
writes per zone is not required.

Make disk_set_zoned() set 'use_zone_write_lock' only if the block device
has zones and if the block driver does not preserve the order of write
requests.

Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-settings.c   |  7 +++++++
 include/linux/blkdev.h | 10 ++++++++++
 2 files changed, 17 insertions(+)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index 0046b447268f..3a7748af1bef 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -56,6 +56,8 @@ void blk_set_default_limits(struct queue_limits *lim)
 	lim->alignment_offset = 0;
 	lim->io_opt = 0;
 	lim->misaligned = 0;
+	lim->driver_preserves_write_order = false;
+	lim->use_zone_write_lock = false;
 	lim->zoned = BLK_ZONED_NONE;
 	lim->zone_write_granularity = 0;
 	lim->dma_alignment = 511;
@@ -685,6 +687,9 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 						   b->max_secure_erase_sectors);
 	t->zone_write_granularity = max(t->zone_write_granularity,
 					b->zone_write_granularity);
+	/* Request-based stacking drivers do not reorder requests. */
+	t->driver_preserves_write_order = b->driver_preserves_write_order;
+	t->use_zone_write_lock = b->use_zone_write_lock;
 	t->zoned = max(t->zoned, b->zoned);
 	return ret;
 }
@@ -949,6 +954,8 @@ void disk_set_zoned(struct gendisk *disk, enum blk_zoned_model model)
 	}
 
 	q->limits.zoned = model;
+	q->limits.use_zone_write_lock = model != BLK_ZONED_NONE &&
+		!q->limits.driver_preserves_write_order;
 	if (model != BLK_ZONED_NONE) {
 		/*
 		 * Set the zone write granularity to the device logical block
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 2f5371b8482c..2c090a28ec78 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -316,6 +316,16 @@ struct queue_limits {
 	unsigned char		misaligned;
 	unsigned char		discard_misaligned;
 	unsigned char		raid_partial_stripes_expensive;
+	/*
+	 * Whether or not the block driver preserves the order of write
+	 * requests. Set by the block driver.
+	 */
+	bool			driver_preserves_write_order;
+	/*
+	 * Whether or not zone write locking should be used. Set by
+	 * disk_set_zoned().
+	 */
+	bool			use_zone_write_lock;
 	enum blk_zoned_model	zoned;
 
 	/*

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v8 2/9] block/mq-deadline: Only use zone locking if necessary
  2023-08-11 21:35 [PATCH v8 0/9] Improve performance for zoned UFS devices Bart Van Assche
  2023-08-11 21:35 ` [PATCH v8 1/9] block: Introduce more member variables related to zone write locking Bart Van Assche
@ 2023-08-11 21:35 ` Bart Van Assche
  2023-08-14 12:33   ` Damien Le Moal
  2023-08-11 21:35 ` [PATCH v8 3/9] scsi: core: Call .eh_prepare_resubmit() before resubmitting Bart Van Assche
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 33+ messages in thread
From: Bart Van Assche @ 2023-08-11 21:35 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Martin K . Petersen, Christoph Hellwig,
	Bart Van Assche, Damien Le Moal, Ming Lei

Measurements have shown that limiting the queue depth to one per zone for
zoned writes has a significant negative performance impact on zoned UFS
devices. Hence this patch that disables zone locking by the mq-deadline
scheduler if the storage controller preserves the command order. This
patch is based on the following assumptions:
- It happens infrequently that zoned write requests are reordered by the
  block layer.
- The I/O priority of all write requests is the same per zone.
- Either no I/O scheduler is used or an I/O scheduler is used that
  serializes write requests per zone.

Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index f958e79277b8..5c2fc4003bc0 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -353,7 +353,7 @@ deadline_fifo_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
 		return NULL;
 
 	rq = rq_entry_fifo(per_prio->fifo_list[data_dir].next);
-	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q))
+	if (data_dir == DD_READ || !rq->q->limits.use_zone_write_lock)
 		return rq;
 
 	/*
@@ -398,7 +398,7 @@ deadline_next_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
 	if (!rq)
 		return NULL;
 
-	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q))
+	if (data_dir == DD_READ || !rq->q->limits.use_zone_write_lock)
 		return rq;
 
 	/*
@@ -526,8 +526,9 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
 	}
 
 	/*
-	 * For a zoned block device, if we only have writes queued and none of
-	 * them can be dispatched, rq will be NULL.
+	 * For a zoned block device that requires write serialization, if we
+	 * only have writes queued and none of them can be dispatched, rq will
+	 * be NULL.
 	 */
 	if (!rq)
 		return NULL;
@@ -552,7 +553,8 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
 	/*
 	 * If the request needs its target zone locked, do it.
 	 */
-	blk_req_zone_write_lock(rq);
+	if (rq->q->limits.use_zone_write_lock)
+		blk_req_zone_write_lock(rq);
 	rq->rq_flags |= RQF_STARTED;
 	return rq;
 }
@@ -934,7 +936,7 @@ static void dd_finish_request(struct request *rq)
 
 	atomic_inc(&per_prio->stats.completed);
 
-	if (blk_queue_is_zoned(q)) {
+	if (rq->q->limits.use_zone_write_lock) {
 		unsigned long flags;
 
 		spin_lock_irqsave(&dd->zone_lock, flags);

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v8 3/9] scsi: core: Call .eh_prepare_resubmit() before resubmitting
  2023-08-11 21:35 [PATCH v8 0/9] Improve performance for zoned UFS devices Bart Van Assche
  2023-08-11 21:35 ` [PATCH v8 1/9] block: Introduce more member variables related to zone write locking Bart Van Assche
  2023-08-11 21:35 ` [PATCH v8 2/9] block/mq-deadline: Only use zone locking if necessary Bart Van Assche
@ 2023-08-11 21:35 ` Bart Van Assche
  2023-08-14  1:19   ` Damien Le Moal
  2023-08-11 21:35 ` [PATCH v8 4/9] scsi: sd: Sort commands by LBA " Bart Van Assche
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 33+ messages in thread
From: Bart Van Assche @ 2023-08-11 21:35 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Martin K . Petersen, Christoph Hellwig,
	Bart Van Assche, Damien Le Moal, Ming Lei, James E.J. Bottomley

Make the error handler call .eh_prepare_resubmit() before resubmitting
commands. A later patch will use this functionality to sort SCSI
commands per LBA from inside the SCSI disk driver.

Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/scsi/Kconfig           |  2 +
 drivers/scsi/Kconfig.kunit     |  4 ++
 drivers/scsi/Makefile          |  2 +
 drivers/scsi/Makefile.kunit    |  1 +
 drivers/scsi/scsi_error.c      | 43 ++++++++++++++++
 drivers/scsi/scsi_error_test.c | 92 ++++++++++++++++++++++++++++++++++
 drivers/scsi/scsi_priv.h       |  1 +
 include/scsi/scsi_driver.h     |  1 +
 8 files changed, 146 insertions(+)
 create mode 100644 drivers/scsi/Kconfig.kunit
 create mode 100644 drivers/scsi/Makefile.kunit
 create mode 100644 drivers/scsi/scsi_error_test.c

diff --git a/drivers/scsi/Kconfig b/drivers/scsi/Kconfig
index 4962ce989113..fc288f8fb800 100644
--- a/drivers/scsi/Kconfig
+++ b/drivers/scsi/Kconfig
@@ -232,6 +232,8 @@ config SCSI_SCAN_ASYNC
 	  Note that this setting also affects whether resuming from
 	  system suspend will be performed asynchronously.
 
+source "drivers/scsi/Kconfig.kunit"
+
 menu "SCSI Transports"
 	depends on SCSI
 
diff --git a/drivers/scsi/Kconfig.kunit b/drivers/scsi/Kconfig.kunit
new file mode 100644
index 000000000000..68e3d90d49e9
--- /dev/null
+++ b/drivers/scsi/Kconfig.kunit
@@ -0,0 +1,4 @@
+config SCSI_ERROR_TEST
+	tristate "scsi_error.c unit tests" if !KUNIT_ALL_TESTS
+	depends on SCSI_MOD && KUNIT
+	default KUNIT_ALL_TESTS
diff --git a/drivers/scsi/Makefile b/drivers/scsi/Makefile
index f055bfd54a68..1c5c3afb6c6e 100644
--- a/drivers/scsi/Makefile
+++ b/drivers/scsi/Makefile
@@ -168,6 +168,8 @@ scsi_mod-$(CONFIG_PM)		+= scsi_pm.o
 scsi_mod-$(CONFIG_SCSI_DH)	+= scsi_dh.o
 scsi_mod-$(CONFIG_BLK_DEV_BSG)	+= scsi_bsg.o
 
+include $(srctree)/drivers/scsi/Makefile.kunit
+
 hv_storvsc-y			:= storvsc_drv.o
 
 sd_mod-objs	:= sd.o
diff --git a/drivers/scsi/Makefile.kunit b/drivers/scsi/Makefile.kunit
new file mode 100644
index 000000000000..3e98053b2709
--- /dev/null
+++ b/drivers/scsi/Makefile.kunit
@@ -0,0 +1 @@
+obj-$(CONFIG_SCSI_ERROR_TEST) += scsi_error_test.o
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index c67cdcdc3ba8..0d7835bdc8af 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -27,6 +27,7 @@
 #include <linux/blkdev.h>
 #include <linux/delay.h>
 #include <linux/jiffies.h>
+#include <linux/list_sort.h>
 
 #include <scsi/scsi.h>
 #include <scsi/scsi_cmnd.h>
@@ -2186,6 +2187,46 @@ void scsi_eh_ready_devs(struct Scsi_Host *shost,
 }
 EXPORT_SYMBOL_GPL(scsi_eh_ready_devs);
 
+/*
+ * Comparison function that allows to sort SCSI commands by ULD driver.
+ */
+static int scsi_cmp_uld(void *priv, const struct list_head *_a,
+			const struct list_head *_b)
+{
+	struct scsi_cmnd *a = list_entry(_a, typeof(*a), eh_entry);
+	struct scsi_cmnd *b = list_entry(_b, typeof(*b), eh_entry);
+
+	/* See also the comment above the list_sort() definition. */
+	return scsi_cmd_to_driver(a) > scsi_cmd_to_driver(b);
+}
+
+void scsi_call_prepare_resubmit(struct list_head *done_q)
+{
+	struct scsi_cmnd *scmd, *next;
+
+	/* Sort pending SCSI commands by ULD. */
+	list_sort(NULL, done_q, scsi_cmp_uld);
+
+	/*
+	 * Call .eh_prepare_resubmit for each range of commands with identical
+	 * ULD driver pointer.
+	 */
+	list_for_each_entry_safe(scmd, next, done_q, eh_entry) {
+		struct scsi_driver *uld = scsi_cmd_to_driver(scmd);
+		struct list_head *prev, uld_cmd_list;
+
+		while (&next->eh_entry != done_q &&
+		       scsi_cmd_to_driver(next) == uld)
+			next = list_next_entry(next, eh_entry);
+		if (!uld->eh_prepare_resubmit)
+			continue;
+		prev = scmd->eh_entry.prev;
+		list_cut_position(&uld_cmd_list, prev, next->eh_entry.prev);
+		uld->eh_prepare_resubmit(&uld_cmd_list);
+		list_splice(&uld_cmd_list, prev);
+	}
+}
+
 /**
  * scsi_eh_flush_done_q - finish processed commands or retry them.
  * @done_q:	list_head of processed commands.
@@ -2194,6 +2235,8 @@ void scsi_eh_flush_done_q(struct list_head *done_q)
 {
 	struct scsi_cmnd *scmd, *next;
 
+	scsi_call_prepare_resubmit(done_q);
+
 	list_for_each_entry_safe(scmd, next, done_q, eh_entry) {
 		list_del_init(&scmd->eh_entry);
 		if (scsi_device_online(scmd->device) &&
diff --git a/drivers/scsi/scsi_error_test.c b/drivers/scsi/scsi_error_test.c
new file mode 100644
index 000000000000..fd616527f726
--- /dev/null
+++ b/drivers/scsi/scsi_error_test.c
@@ -0,0 +1,92 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Google LLC
+ */
+#include <kunit/test.h>
+#include <scsi/scsi_cmnd.h>
+#include <scsi/scsi_driver.h>
+#include "scsi_priv.h"
+
+static struct kunit *kunit_test;
+
+static struct scsi_driver *uld1, *uld2, *uld3;
+
+static void uld1_prepare_resubmit(struct list_head *cmd_list)
+{
+	struct scsi_cmnd *cmd;
+
+	KUNIT_EXPECT_EQ(kunit_test, list_count_nodes(cmd_list), 2);
+	list_for_each_entry(cmd, cmd_list, eh_entry)
+		KUNIT_EXPECT_PTR_EQ(kunit_test, scsi_cmd_to_driver(cmd), uld1);
+}
+
+static void uld2_prepare_resubmit(struct list_head *cmd_list)
+{
+	struct scsi_cmnd *cmd;
+
+	KUNIT_EXPECT_EQ(kunit_test, list_count_nodes(cmd_list), 2);
+	list_for_each_entry(cmd, cmd_list, eh_entry)
+		KUNIT_EXPECT_PTR_EQ(kunit_test, scsi_cmd_to_driver(cmd), uld2);
+}
+
+static void test_prepare_resubmit(struct kunit *test)
+{
+	static struct scsi_cmnd cmd1, cmd2, cmd3, cmd4, cmd5, cmd6;
+	static struct scsi_device dev1, dev2, dev3;
+	struct scsi_driver *uld;
+	LIST_HEAD(cmd_list);
+
+	uld = kzalloc(3 * sizeof(uld), GFP_KERNEL);
+	uld1 = &uld[0];
+	uld1->eh_prepare_resubmit = uld1_prepare_resubmit;
+	uld2 = &uld[1];
+	uld2->eh_prepare_resubmit = uld2_prepare_resubmit;
+	uld3 = &uld[2];
+	dev1.sdev_gendev.driver = &uld1->gendrv;
+	dev2.sdev_gendev.driver = &uld2->gendrv;
+	dev3.sdev_gendev.driver = &uld3->gendrv;
+	cmd1.device = &dev1;
+	cmd2.device = &dev1;
+	cmd3.device = &dev2;
+	cmd4.device = &dev2;
+	cmd5.device = &dev3;
+	cmd6.device = &dev3;
+	list_add_tail(&cmd1.eh_entry, &cmd_list);
+	list_add_tail(&cmd3.eh_entry, &cmd_list);
+	list_add_tail(&cmd5.eh_entry, &cmd_list);
+	list_add_tail(&cmd2.eh_entry, &cmd_list);
+	list_add_tail(&cmd4.eh_entry, &cmd_list);
+	list_add_tail(&cmd6.eh_entry, &cmd_list);
+
+	KUNIT_EXPECT_EQ(test, list_count_nodes(&cmd_list), 6);
+	kunit_test = test;
+	scsi_call_prepare_resubmit(&cmd_list);
+	kunit_test = NULL;
+	KUNIT_EXPECT_EQ(test, list_count_nodes(&cmd_list), 6);
+	KUNIT_EXPECT_TRUE(test, uld1 < uld2);
+	KUNIT_EXPECT_TRUE(test, uld2 < uld3);
+	KUNIT_EXPECT_PTR_EQ(test, cmd_list.next, &cmd1.eh_entry);
+	KUNIT_EXPECT_PTR_EQ(test, cmd_list.next->next, &cmd2.eh_entry);
+	KUNIT_EXPECT_PTR_EQ(test, cmd_list.next->next->next, &cmd3.eh_entry);
+	KUNIT_EXPECT_PTR_EQ(test, cmd_list.next->next->next->next,
+			    &cmd4.eh_entry);
+	KUNIT_EXPECT_PTR_EQ(test, cmd_list.next->next->next->next->next,
+			    &cmd5.eh_entry);
+	KUNIT_EXPECT_PTR_EQ(test, cmd_list.next->next->next->next->next->next,
+			    &cmd6.eh_entry);
+	kfree(uld);
+}
+
+static struct kunit_case prepare_resubmit_test_cases[] = {
+	KUNIT_CASE(test_prepare_resubmit),
+	{}
+};
+
+static struct kunit_suite prepare_resubmit_test_suite = {
+	.name = "prepare_resubmit",
+	.test_cases = prepare_resubmit_test_cases,
+};
+kunit_test_suite(prepare_resubmit_test_suite);
+
+MODULE_AUTHOR("Bart Van Assche");
+MODULE_LICENSE("GPL");
diff --git a/drivers/scsi/scsi_priv.h b/drivers/scsi/scsi_priv.h
index f42388ecb024..df4af4645430 100644
--- a/drivers/scsi/scsi_priv.h
+++ b/drivers/scsi/scsi_priv.h
@@ -101,6 +101,7 @@ int scsi_eh_get_sense(struct list_head *work_q,
 		      struct list_head *done_q);
 bool scsi_noretry_cmd(struct scsi_cmnd *scmd);
 void scsi_eh_done(struct scsi_cmnd *scmd);
+void scsi_call_prepare_resubmit(struct list_head *done_q);
 
 /* scsi_lib.c */
 extern int scsi_maybe_unblock_host(struct scsi_device *sdev);
diff --git a/include/scsi/scsi_driver.h b/include/scsi/scsi_driver.h
index 4ce1988b2ba0..2b11be896eee 100644
--- a/include/scsi/scsi_driver.h
+++ b/include/scsi/scsi_driver.h
@@ -18,6 +18,7 @@ struct scsi_driver {
 	int (*done)(struct scsi_cmnd *);
 	int (*eh_action)(struct scsi_cmnd *, int);
 	void (*eh_reset)(struct scsi_cmnd *);
+	void (*eh_prepare_resubmit)(struct list_head *cmd_list);
 };
 #define to_scsi_driver(drv) \
 	container_of((drv), struct scsi_driver, gendrv)

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v8 4/9] scsi: sd: Sort commands by LBA before resubmitting
  2023-08-11 21:35 [PATCH v8 0/9] Improve performance for zoned UFS devices Bart Van Assche
                   ` (2 preceding siblings ...)
  2023-08-11 21:35 ` [PATCH v8 3/9] scsi: core: Call .eh_prepare_resubmit() before resubmitting Bart Van Assche
@ 2023-08-11 21:35 ` Bart Van Assche
  2023-08-11 21:35 ` [PATCH v8 5/9] scsi: core: Retry unaligned zoned writes Bart Van Assche
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 33+ messages in thread
From: Bart Van Assche @ 2023-08-11 21:35 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Martin K . Petersen, Christoph Hellwig,
	Bart Van Assche, Damien Le Moal, Ming Lei, James E.J. Bottomley

Sort SCSI commands by LBA before the SCSI error handler resubmits
these commands. This is necessary when resubmitting zoned writes
(REQ_OP_WRITE) if multiple writes have been queued for a single zone.

Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/scsi/sd.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 3c668cfb146d..4d9c6ad11cca 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -47,6 +47,7 @@
 #include <linux/blkpg.h>
 #include <linux/blk-pm.h>
 #include <linux/delay.h>
+#include <linux/list_sort.h>
 #include <linux/major.h>
 #include <linux/mutex.h>
 #include <linux/string_helpers.h>
@@ -117,6 +118,7 @@ static void sd_uninit_command(struct scsi_cmnd *SCpnt);
 static int sd_done(struct scsi_cmnd *);
 static void sd_eh_reset(struct scsi_cmnd *);
 static int sd_eh_action(struct scsi_cmnd *, int);
+static void sd_prepare_resubmit(struct list_head *cmd_list);
 static void sd_read_capacity(struct scsi_disk *sdkp, unsigned char *buffer);
 static void scsi_disk_release(struct device *cdev);
 
@@ -617,6 +619,7 @@ static struct scsi_driver sd_template = {
 	.done			= sd_done,
 	.eh_action		= sd_eh_action,
 	.eh_reset		= sd_eh_reset,
+	.eh_prepare_resubmit	= sd_prepare_resubmit,
 };
 
 /*
@@ -2018,6 +2021,26 @@ static int sd_eh_action(struct scsi_cmnd *scmd, int eh_disp)
 	return eh_disp;
 }
 
+static int sd_cmp_sector(void *priv, const struct list_head *_a,
+			 const struct list_head *_b)
+{
+	struct scsi_cmnd *a = list_entry(_a, typeof(*a), eh_entry);
+	struct scsi_cmnd *b = list_entry(_b, typeof(*b), eh_entry);
+
+	/* See also the comment above the list_sort() definition. */
+	return blk_rq_pos(scsi_cmd_to_rq(a)) > blk_rq_pos(scsi_cmd_to_rq(b));
+}
+
+static void sd_prepare_resubmit(struct list_head *cmd_list)
+{
+	/*
+	 * Sort pending SCSI commands in starting sector order. This is
+	 * important if one of the SCSI devices associated with @shost is a
+	 * zoned block device for which zone write locking is disabled.
+	 */
+	list_sort(NULL, cmd_list, sd_cmp_sector);
+}
+
 static unsigned int sd_completed_bytes(struct scsi_cmnd *scmd)
 {
 	struct request *req = scsi_cmd_to_rq(scmd);

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v8 5/9] scsi: core: Retry unaligned zoned writes
  2023-08-11 21:35 [PATCH v8 0/9] Improve performance for zoned UFS devices Bart Van Assche
                   ` (3 preceding siblings ...)
  2023-08-11 21:35 ` [PATCH v8 4/9] scsi: sd: Sort commands by LBA " Bart Van Assche
@ 2023-08-11 21:35 ` Bart Van Assche
  2023-08-14 12:36   ` Damien Le Moal
  2023-08-11 21:35 ` [PATCH v8 6/9] scsi: scsi_debug: Support disabling zone write locking Bart Van Assche
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 33+ messages in thread
From: Bart Van Assche @ 2023-08-11 21:35 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Martin K . Petersen, Christoph Hellwig,
	Bart Van Assche, Damien Le Moal, Ming Lei, James E.J. Bottomley

If zoned writes (REQ_OP_WRITE) for a sequential write required zone have
a starting LBA that differs from the write pointer, e.g. because zoned
writes have been reordered, then the storage device will respond with an
UNALIGNED WRITE COMMAND error. Send commands that failed with an
unaligned write error to the SCSI error handler if zone write locking is
disabled. Let the SCSI error handler sort SCSI commands per LBA before
resubmitting these.

If zone write locking is disabled, increase the number of retries for
write commands sent to a sequential zone to the maximum number of
outstanding commands because in the worst case the number of times
reordered zoned writes have to be retried is (number of outstanding
writes per sequential zone) - 1.

Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/scsi/scsi_error.c | 16 ++++++++++++++++
 drivers/scsi/scsi_lib.c   |  1 +
 drivers/scsi/sd.c         |  2 ++
 include/scsi/scsi.h       |  1 +
 4 files changed, 20 insertions(+)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 0d7835bdc8af..7ae43fac07b7 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -699,6 +699,22 @@ enum scsi_disposition scsi_check_sense(struct scsi_cmnd *scmd)
 		fallthrough;
 
 	case ILLEGAL_REQUEST:
+		/*
+		 * Unaligned write command. This may indicate that zoned writes
+		 * have been received by the device in the wrong order. If zone
+		 * write locking is disabled, retry after all pending commands
+		 * have completed.
+		 */
+		if (sshdr.asc == 0x21 && sshdr.ascq == 0x04 &&
+		    !req->q->limits.use_zone_write_lock &&
+		    blk_rq_is_seq_zoned_write(req)) {
+			SCSI_LOG_ERROR_RECOVERY(3,
+				sdev_printk(KERN_INFO, scmd->device,
+					    "Retrying unaligned write at LBA %#llx.\n",
+					    scsi_get_lba(scmd)));
+			return NEEDS_DELAYED_RETRY;
+		}
+
 		if (sshdr.asc == 0x20 || /* Invalid command operation code */
 		    sshdr.asc == 0x21 || /* Logical block address out of range */
 		    sshdr.asc == 0x22 || /* Invalid function */
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 59176946ab56..69da8aee13df 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1443,6 +1443,7 @@ static void scsi_complete(struct request *rq)
 	case ADD_TO_MLQUEUE:
 		scsi_queue_insert(cmd, SCSI_MLQUEUE_DEVICE_BUSY);
 		break;
+	case NEEDS_DELAYED_RETRY:
 	default:
 		scsi_eh_scmd_add(cmd);
 		break;
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 4d9c6ad11cca..c8466c6c7387 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1238,6 +1238,8 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
 	cmd->transfersize = sdp->sector_size;
 	cmd->underflow = nr_blocks << 9;
 	cmd->allowed = sdkp->max_retries;
+	if (!rq->q->limits.use_zone_write_lock && blk_rq_is_seq_zoned_write(rq))
+		cmd->allowed += rq->q->nr_requests;
 	cmd->sdb.length = nr_blocks * sdp->sector_size;
 
 	SCSI_LOG_HLQUEUE(1,
diff --git a/include/scsi/scsi.h b/include/scsi/scsi.h
index ec093594ba53..6600db046227 100644
--- a/include/scsi/scsi.h
+++ b/include/scsi/scsi.h
@@ -93,6 +93,7 @@ static inline int scsi_status_is_check_condition(int status)
  * Internal return values.
  */
 enum scsi_disposition {
+	NEEDS_DELAYED_RETRY	= 0x2000,
 	NEEDS_RETRY		= 0x2001,
 	SUCCESS			= 0x2002,
 	FAILED			= 0x2003,

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v8 6/9] scsi: scsi_debug: Support disabling zone write locking
  2023-08-11 21:35 [PATCH v8 0/9] Improve performance for zoned UFS devices Bart Van Assche
                   ` (4 preceding siblings ...)
  2023-08-11 21:35 ` [PATCH v8 5/9] scsi: core: Retry unaligned zoned writes Bart Van Assche
@ 2023-08-11 21:35 ` Bart Van Assche
  2023-08-11 21:35 ` [PATCH v8 7/9] scsi: scsi_debug: Support injecting unaligned write errors Bart Van Assche
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 33+ messages in thread
From: Bart Van Assche @ 2023-08-11 21:35 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Martin K . Petersen, Christoph Hellwig,
	Bart Van Assche, Douglas Gilbert, Damien Le Moal, Ming Lei,
	James E.J. Bottomley

Make it easier to test not using zone write locking by supporting
disabling zone write locking in the scsi_debug driver.

Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/scsi/scsi_debug.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index 9c0af50501f9..c44c523bde2c 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -832,6 +832,7 @@ static int dix_reads;
 static int dif_errors;
 
 /* ZBC global data */
+static bool sdeb_no_zwrl;
 static bool sdeb_zbc_in_use;	/* true for host-aware and host-managed disks */
 static int sdeb_zbc_zone_cap_mb;
 static int sdeb_zbc_zone_size_mb;
@@ -5138,9 +5139,13 @@ static struct sdebug_dev_info *find_build_dev_info(struct scsi_device *sdev)
 
 static int scsi_debug_slave_alloc(struct scsi_device *sdp)
 {
+	struct request_queue *q = sdp->request_queue;
+
 	if (sdebug_verbose)
 		pr_info("slave_alloc <%u %u %u %llu>\n",
 		       sdp->host->host_no, sdp->channel, sdp->id, sdp->lun);
+	if (sdeb_no_zwrl)
+		q->limits.driver_preserves_write_order = true;
 	return 0;
 }
 
@@ -5738,6 +5743,7 @@ module_param_named(ndelay, sdebug_ndelay, int, S_IRUGO | S_IWUSR);
 module_param_named(no_lun_0, sdebug_no_lun_0, int, S_IRUGO | S_IWUSR);
 module_param_named(no_rwlock, sdebug_no_rwlock, bool, S_IRUGO | S_IWUSR);
 module_param_named(no_uld, sdebug_no_uld, int, S_IRUGO);
+module_param_named(no_zone_write_lock, sdeb_no_zwrl, bool, S_IRUGO);
 module_param_named(num_parts, sdebug_num_parts, int, S_IRUGO);
 module_param_named(num_tgts, sdebug_num_tgts, int, S_IRUGO | S_IWUSR);
 module_param_named(opt_blks, sdebug_opt_blks, int, S_IRUGO);
@@ -5812,6 +5818,8 @@ MODULE_PARM_DESC(ndelay, "response delay in nanoseconds (def=0 -> ignore)");
 MODULE_PARM_DESC(no_lun_0, "no LU number 0 (def=0 -> have lun 0)");
 MODULE_PARM_DESC(no_rwlock, "don't protect user data reads+writes (def=0)");
 MODULE_PARM_DESC(no_uld, "stop ULD (e.g. sd driver) attaching (def=0))");
+MODULE_PARM_DESC(no_zone_write_lock,
+		 "Disable serialization of zoned writes (def=0)");
 MODULE_PARM_DESC(num_parts, "number of partitions(def=0)");
 MODULE_PARM_DESC(num_tgts, "number of targets per host to simulate(def=1)");
 MODULE_PARM_DESC(opt_blks, "optimal transfer length in blocks (def=1024)");

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v8 7/9] scsi: scsi_debug: Support injecting unaligned write errors
  2023-08-11 21:35 [PATCH v8 0/9] Improve performance for zoned UFS devices Bart Van Assche
                   ` (5 preceding siblings ...)
  2023-08-11 21:35 ` [PATCH v8 6/9] scsi: scsi_debug: Support disabling zone write locking Bart Van Assche
@ 2023-08-11 21:35 ` Bart Van Assche
  2023-08-11 21:35 ` [PATCH v8 8/9] scsi: ufs: Split an if-condition Bart Van Assche
  2023-08-11 21:35 ` [PATCH v8 9/9] scsi: ufs: Inform the block layer about write ordering Bart Van Assche
  8 siblings, 0 replies; 33+ messages in thread
From: Bart Van Assche @ 2023-08-11 21:35 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Martin K . Petersen, Christoph Hellwig,
	Bart Van Assche, Douglas Gilbert, Damien Le Moal, Ming Lei,
	James E.J. Bottomley

Allow user space software, e.g. a blktests test, to inject unaligned
write errors.

Acked-by: Douglas Gilbert <dgilbert@interlog.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/scsi/scsi_debug.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index c44c523bde2c..c92bd6d00249 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -181,6 +181,7 @@ static const char *sdebug_version_date = "20210520";
 #define SDEBUG_OPT_NO_CDB_NOISE		0x4000
 #define SDEBUG_OPT_HOST_BUSY		0x8000
 #define SDEBUG_OPT_CMD_ABORT		0x10000
+#define SDEBUG_OPT_UNALIGNED_WRITE	0x20000
 #define SDEBUG_OPT_ALL_NOISE (SDEBUG_OPT_NOISE | SDEBUG_OPT_Q_NOISE | \
 			      SDEBUG_OPT_RESET_NOISE)
 #define SDEBUG_OPT_ALL_INJECTING (SDEBUG_OPT_RECOVERED_ERR | \
@@ -188,7 +189,8 @@ static const char *sdebug_version_date = "20210520";
 				  SDEBUG_OPT_DIF_ERR | SDEBUG_OPT_DIX_ERR | \
 				  SDEBUG_OPT_SHORT_TRANSFER | \
 				  SDEBUG_OPT_HOST_BUSY | \
-				  SDEBUG_OPT_CMD_ABORT)
+				  SDEBUG_OPT_CMD_ABORT | \
+				  SDEBUG_OPT_UNALIGNED_WRITE)
 #define SDEBUG_OPT_RECOV_DIF_DIX (SDEBUG_OPT_RECOVERED_ERR | \
 				  SDEBUG_OPT_DIF_ERR | SDEBUG_OPT_DIX_ERR)
 
@@ -3587,6 +3589,14 @@ static int resp_write_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 	struct sdeb_store_info *sip = devip2sip(devip, true);
 	u8 *cmd = scp->cmnd;
 
+	if (unlikely(sdebug_opts & SDEBUG_OPT_UNALIGNED_WRITE &&
+		     atomic_read(&sdeb_inject_pending))) {
+		atomic_set(&sdeb_inject_pending, 0);
+		mk_sense_buffer(scp, ILLEGAL_REQUEST, LBA_OUT_OF_RANGE,
+				UNALIGNED_WRITE_ASCQ);
+		return check_condition_result;
+	}
+
 	switch (cmd[0]) {
 	case WRITE_16:
 		ei_lba = 0;

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v8 8/9] scsi: ufs: Split an if-condition
  2023-08-11 21:35 [PATCH v8 0/9] Improve performance for zoned UFS devices Bart Van Assche
                   ` (6 preceding siblings ...)
  2023-08-11 21:35 ` [PATCH v8 7/9] scsi: scsi_debug: Support injecting unaligned write errors Bart Van Assche
@ 2023-08-11 21:35 ` Bart Van Assche
  2023-08-11 21:35 ` [PATCH v8 9/9] scsi: ufs: Inform the block layer about write ordering Bart Van Assche
  8 siblings, 0 replies; 33+ messages in thread
From: Bart Van Assche @ 2023-08-11 21:35 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Martin K . Petersen, Christoph Hellwig,
	Bart Van Assche, Can Guo, Avri Altman, Damien Le Moal, Ming Lei,
	James E.J. Bottomley, Stanley Chu, Bean Huo, Asutosh Das,
	Bao D. Nguyen, Arthur Simchaev

Make the next patch in this series easier to read. No functionality is
changed.

Reviewed-by: Can Guo <quic_cang@quicinc.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Avri Altman <avri.altman@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/ufs/core/ufshcd.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
index 129446775796..ae7b868f9c26 100644
--- a/drivers/ufs/core/ufshcd.c
+++ b/drivers/ufs/core/ufshcd.c
@@ -4352,8 +4352,9 @@ void ufshcd_auto_hibern8_update(struct ufs_hba *hba, u32 ahit)
 	}
 	spin_unlock_irqrestore(hba->host->host_lock, flags);
 
-	if (update &&
-	    !pm_runtime_suspended(&hba->ufs_device_wlun->sdev_gendev)) {
+	if (!update)
+		return;
+	if (!pm_runtime_suspended(&hba->ufs_device_wlun->sdev_gendev)) {
 		ufshcd_rpm_get_sync(hba);
 		ufshcd_hold(hba);
 		ufshcd_auto_hibern8_enable(hba);

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v8 9/9] scsi: ufs: Inform the block layer about write ordering
  2023-08-11 21:35 [PATCH v8 0/9] Improve performance for zoned UFS devices Bart Van Assche
                   ` (7 preceding siblings ...)
  2023-08-11 21:35 ` [PATCH v8 8/9] scsi: ufs: Split an if-condition Bart Van Assche
@ 2023-08-11 21:35 ` Bart Van Assche
  2023-08-12 17:09   ` Bao D. Nguyen
  8 siblings, 1 reply; 33+ messages in thread
From: Bart Van Assche @ 2023-08-11 21:35 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, Martin K . Petersen, Christoph Hellwig,
	Bart Van Assche, Can Guo, Avri Altman, Damien Le Moal, Ming Lei,
	James E.J. Bottomley, Stanley Chu, Bean Huo, Asutosh Das,
	Bao D. Nguyen, Arthur Simchaev

From the UFSHCI 4.0 specification, about the legacy (single queue) mode:
"The host controller always process transfer requests in-order according
to the order submitted to the list. In case of multiple commands with
single doorbell register ringing (batch mode), The dispatch order for
these transfer requests by host controller will base on their index in
the List. A transfer request with lower index value will be executed
before a transfer request with higher index value."

From the UFSHCI 4.0 specification, about the MCQ mode:
"Command Submission
1. Host SW writes an Entry to SQ
2. Host SW updates SQ doorbell tail pointer

Command Processing
3. After fetching the Entry, Host Controller updates SQ doorbell head
   pointer
4. Host controller sends COMMAND UPIU to UFS device"

In other words, for both legacy and MCQ mode, UFS controllers are
required to forward commands to the UFS device in the order these
commands have been received from the host.

Notes:
- For legacy mode this is only correct if the host submits one
  command at a time. The UFS driver does this.
- Also in legacy mode, the command order is not preserved if
  auto-hibernation is enabled in the UFS controller. Hence, enable
  zone write locking if auto-hibernation is enabled.

This patch improves performance as follows on my test setup:
- With the mq-deadline scheduler: 2.5x more IOPS for small writes.
- When not using an I/O scheduler compared to using mq-deadline with
  zone locking: 4x more IOPS for small writes.

Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Can Guo <quic_cang@quicinc.com>
Cc: Avri Altman <avri.altman@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/ufs/core/ufshcd.c | 35 ++++++++++++++++++++++++++++++++++-
 1 file changed, 34 insertions(+), 1 deletion(-)

diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
index ae7b868f9c26..71cee10c75ad 100644
--- a/drivers/ufs/core/ufshcd.c
+++ b/drivers/ufs/core/ufshcd.c
@@ -4337,23 +4337,48 @@ int ufshcd_uic_hibern8_exit(struct ufs_hba *hba)
 }
 EXPORT_SYMBOL_GPL(ufshcd_uic_hibern8_exit);
 
+static void ufshcd_update_preserves_write_order(struct ufs_hba *hba,
+						bool preserves_write_order)
+{
+	struct scsi_device *sdev;
+
+	shost_for_each_device(sdev, hba->host)
+		blk_freeze_queue_start(sdev->request_queue);
+	shost_for_each_device(sdev, hba->host) {
+		struct request_queue *q = sdev->request_queue;
+
+		blk_mq_freeze_queue_wait(q);
+		q->limits.driver_preserves_write_order = preserves_write_order;
+		blk_mq_unfreeze_queue(q);
+	}
+}
+
 void ufshcd_auto_hibern8_update(struct ufs_hba *hba, u32 ahit)
 {
 	unsigned long flags;
-	bool update = false;
+	bool prev_state, new_state, update = false;
 
 	if (!ufshcd_is_auto_hibern8_supported(hba))
 		return;
 
 	spin_lock_irqsave(hba->host->host_lock, flags);
+	prev_state = ufshcd_is_auto_hibern8_enabled(hba);
 	if (hba->ahit != ahit) {
 		hba->ahit = ahit;
 		update = true;
 	}
+	new_state = ufshcd_is_auto_hibern8_enabled(hba);
 	spin_unlock_irqrestore(hba->host->host_lock, flags);
 
 	if (!update)
 		return;
+	if (!is_mcq_enabled(hba) && !prev_state && new_state) {
+		/*
+		 * Auto-hibernation will be enabled for legacy UFSHCI mode. Tell
+		 * the block layer that write requests may be reordered.
+		 */
+		ufshcd_update_preserves_write_order(hba, false);
+	}
 	if (!pm_runtime_suspended(&hba->ufs_device_wlun->sdev_gendev)) {
 		ufshcd_rpm_get_sync(hba);
 		ufshcd_hold(hba);
@@ -4361,6 +4386,13 @@ void ufshcd_auto_hibern8_update(struct ufs_hba *hba, u32 ahit)
 		ufshcd_release(hba);
 		ufshcd_rpm_put_sync(hba);
 	}
+	if (!is_mcq_enabled(hba) && prev_state && !new_state) {
+		/*
+		 * Auto-hibernation has been disabled. Tell the block layer that
+		 * the order of write requests is preserved.
+		 */
+		ufshcd_update_preserves_write_order(hba, true);
+	}
 }
 EXPORT_SYMBOL_GPL(ufshcd_auto_hibern8_update);
 
@@ -5140,6 +5172,7 @@ static int ufshcd_slave_configure(struct scsi_device *sdev)
 
 	ufshcd_hpb_configure(hba, sdev);
 
+	q->limits.driver_preserves_write_order = true;
 	blk_queue_update_dma_pad(q, PRDT_DATA_BYTE_COUNT_PAD - 1);
 	if (hba->quirks & UFSHCD_QUIRK_4KB_DMA_ALIGNMENT)
 		blk_queue_update_dma_alignment(q, SZ_4K - 1);

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v8 9/9] scsi: ufs: Inform the block layer about write ordering
  2023-08-11 21:35 ` [PATCH v8 9/9] scsi: ufs: Inform the block layer about write ordering Bart Van Assche
@ 2023-08-12 17:09   ` Bao D. Nguyen
  2023-08-14 16:23     ` Bart Van Assche
  0 siblings, 1 reply; 33+ messages in thread
From: Bao D. Nguyen @ 2023-08-12 17:09 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, Martin K . Petersen, Christoph Hellwig,
	Can Guo, Avri Altman, Damien Le Moal, Ming Lei,
	James E.J. Bottomley, Stanley Chu, Bean Huo, Asutosh Das,
	Arthur Simchaev

On 8/11/2023 2:35 PM, Bart Van Assche wrote:
>  From the UFSHCI 4.0 specification, about the legacy (single queue) mode:
> "The host controller always process transfer requests in-order according
> to the order submitted to the list. In case of multiple commands with
> single doorbell register ringing (batch mode), The dispatch order for
> these transfer requests by host controller will base on their index in
> the List. A transfer request with lower index value will be executed
> before a transfer request with higher index value."
> 
>  From the UFSHCI 4.0 specification, about the MCQ mode:
> "Command Submission
> 1. Host SW writes an Entry to SQ
> 2. Host SW updates SQ doorbell tail pointer
> 
> Command Processing
> 3. After fetching the Entry, Host Controller updates SQ doorbell head
>     pointer
> 4. Host controller sends COMMAND UPIU to UFS device"
> 
> In other words, for both legacy and MCQ mode, UFS controllers are
> required to forward commands to the UFS device in the order these
> commands have been received from the host.
> 
> Notes:
> - For legacy mode this is only correct if the host submits one
>    command at a time. The UFS driver does this.
> - Also in legacy mode, the command order is not preserved if
>    auto-hibernation is enabled in the UFS controller. Hence, enable
>    zone write locking if auto-hibernation is enabled.
> 
> This patch improves performance as follows on my test setup:
> - With the mq-deadline scheduler: 2.5x more IOPS for small writes.
> - When not using an I/O scheduler compared to using mq-deadline with
>    zone locking: 4x more IOPS for small writes.
> 
> Cc: Martin K. Petersen <martin.petersen@oracle.com>
> Cc: Can Guo <quic_cang@quicinc.com>
> Cc: Avri Altman <avri.altman@wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Damien Le Moal <dlemoal@kernel.org>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>   drivers/ufs/core/ufshcd.c | 35 ++++++++++++++++++++++++++++++++++-
>   1 file changed, 34 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
> index ae7b868f9c26..71cee10c75ad 100644
> --- a/drivers/ufs/core/ufshcd.c
> +++ b/drivers/ufs/core/ufshcd.c
> @@ -4337,23 +4337,48 @@ int ufshcd_uic_hibern8_exit(struct ufs_hba *hba)
>   }
>   EXPORT_SYMBOL_GPL(ufshcd_uic_hibern8_exit);
>   
> +static void ufshcd_update_preserves_write_order(struct ufs_hba *hba,
> +						bool preserves_write_order)
> +{
> +	struct scsi_device *sdev;
> +
> +	shost_for_each_device(sdev, hba->host)
> +		blk_freeze_queue_start(sdev->request_queue);
> +	shost_for_each_device(sdev, hba->host) {
> +		struct request_queue *q = sdev->request_queue;
> +
> +		blk_mq_freeze_queue_wait(q);
> +		q->limits.driver_preserves_write_order = preserves_write_order;
> +		blk_mq_unfreeze_queue(q);
> +	}
> +}
> +
>   void ufshcd_auto_hibern8_update(struct ufs_hba *hba, u32 ahit)
>   {
>   	unsigned long flags;
> -	bool update = false;
> +	bool prev_state, new_state, update = false;
>   
>   	if (!ufshcd_is_auto_hibern8_supported(hba))
>   		return;
>   
>   	spin_lock_irqsave(hba->host->host_lock, flags);
> +	prev_state = ufshcd_is_auto_hibern8_enabled(hba);
>   	if (hba->ahit != ahit) {
>   		hba->ahit = ahit;
>   		update = true;
>   	}
> +	new_state = ufshcd_is_auto_hibern8_enabled(hba);
>   	spin_unlock_irqrestore(hba->host->host_lock, flags);
>   
>   	if (!update)
>   		return;
> +	if (!is_mcq_enabled(hba) && !prev_state && new_state) {
> +		/*
> +		 * Auto-hibernation will be enabled for legacy UFSHCI mode. Tell
> +		 * the block layer that write requests may be reordered.
> +		 */
> +		ufshcd_update_preserves_write_order(hba, false);
Hi Bart,
I am not reviewing other patches in this series, so I don't know the 
whole context. Here is my comment on this patch alone.

Looks like you rely on ufshcd_auto_hibern8_update() being invoked so 
that you can update the driver_preserves_write_order. However, the 
hba->ahit value can be updated by the vendor's driver, and 
ufshcd_auto_hibern8_enable() can be invoked without 
ufshcd_auto_hibern8_update(). Therefore, you may have a situation where 
the driver_preserves_write_order is true by default, but 
Auto-hibernation is enabled by default.

Thanks,
Bao

> +	}
>   	if (!pm_runtime_suspended(&hba->ufs_device_wlun->sdev_gendev)) {
>   		ufshcd_rpm_get_sync(hba);
>   		ufshcd_hold(hba);
> @@ -4361,6 +4386,13 @@ void ufshcd_auto_hibern8_update(struct ufs_hba *hba, u32 ahit)
>   		ufshcd_release(hba);
>   		ufshcd_rpm_put_sync(hba);
>   	}
> +	if (!is_mcq_enabled(hba) && prev_state && !new_state) {
> +		/*
> +		 * Auto-hibernation has been disabled. Tell the block layer that
> +		 * the order of write requests is preserved.
> +		 */
> +		ufshcd_update_preserves_write_order(hba, true);
> +	}
>   }
>   EXPORT_SYMBOL_GPL(ufshcd_auto_hibern8_update);
>   
> @@ -5140,6 +5172,7 @@ static int ufshcd_slave_configure(struct scsi_device *sdev)
>   
>   	ufshcd_hpb_configure(hba, sdev);
>   
> +	q->limits.driver_preserves_write_order = true;
>   	blk_queue_update_dma_pad(q, PRDT_DATA_BYTE_COUNT_PAD - 1);
>   	if (hba->quirks & UFSHCD_QUIRK_4KB_DMA_ALIGNMENT)
>   		blk_queue_update_dma_alignment(q, SZ_4K - 1);


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v8 3/9] scsi: core: Call .eh_prepare_resubmit() before resubmitting
  2023-08-11 21:35 ` [PATCH v8 3/9] scsi: core: Call .eh_prepare_resubmit() before resubmitting Bart Van Assche
@ 2023-08-14  1:19   ` Damien Le Moal
  2023-08-14  2:18     ` Bart Van Assche
  0 siblings, 1 reply; 33+ messages in thread
From: Damien Le Moal @ 2023-08-14  1:19 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, Martin K . Petersen, Christoph Hellwig,
	Ming Lei, James E.J. Bottomley

On 8/12/23 06:35, Bart Van Assche wrote:
> Make the error handler call .eh_prepare_resubmit() before resubmitting

This reads like the eh_prepare_resubmit callback already exists. But you are
adding it. So you should state that.

> commands. A later patch will use this functionality to sort SCSI
> commands per LBA from inside the SCSI disk driver.
> 
> Cc: Martin K. Petersen <martin.petersen@oracle.com>
> Cc: Damien Le Moal <dlemoal@kernel.org>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  drivers/scsi/Kconfig           |  2 +
>  drivers/scsi/Kconfig.kunit     |  4 ++
>  drivers/scsi/Makefile          |  2 +
>  drivers/scsi/Makefile.kunit    |  1 +
>  drivers/scsi/scsi_error.c      | 43 ++++++++++++++++
>  drivers/scsi/scsi_error_test.c | 92 ++++++++++++++++++++++++++++++++++
>  drivers/scsi/scsi_priv.h       |  1 +
>  include/scsi/scsi_driver.h     |  1 +
>  8 files changed, 146 insertions(+)
>  create mode 100644 drivers/scsi/Kconfig.kunit
>  create mode 100644 drivers/scsi/Makefile.kunit
>  create mode 100644 drivers/scsi/scsi_error_test.c
> 
> diff --git a/drivers/scsi/Kconfig b/drivers/scsi/Kconfig
> index 4962ce989113..fc288f8fb800 100644
> --- a/drivers/scsi/Kconfig
> +++ b/drivers/scsi/Kconfig
> @@ -232,6 +232,8 @@ config SCSI_SCAN_ASYNC
>  	  Note that this setting also affects whether resuming from
>  	  system suspend will be performed asynchronously.
>  
> +source "drivers/scsi/Kconfig.kunit"
> +
>  menu "SCSI Transports"
>  	depends on SCSI
>  
> diff --git a/drivers/scsi/Kconfig.kunit b/drivers/scsi/Kconfig.kunit
> new file mode 100644
> index 000000000000..68e3d90d49e9
> --- /dev/null
> +++ b/drivers/scsi/Kconfig.kunit
> @@ -0,0 +1,4 @@
> +config SCSI_ERROR_TEST
> +	tristate "scsi_error.c unit tests" if !KUNIT_ALL_TESTS
> +	depends on SCSI_MOD && KUNIT
> +	default KUNIT_ALL_TESTS
> diff --git a/drivers/scsi/Makefile b/drivers/scsi/Makefile
> index f055bfd54a68..1c5c3afb6c6e 100644
> --- a/drivers/scsi/Makefile
> +++ b/drivers/scsi/Makefile
> @@ -168,6 +168,8 @@ scsi_mod-$(CONFIG_PM)		+= scsi_pm.o
>  scsi_mod-$(CONFIG_SCSI_DH)	+= scsi_dh.o
>  scsi_mod-$(CONFIG_BLK_DEV_BSG)	+= scsi_bsg.o
>  
> +include $(srctree)/drivers/scsi/Makefile.kunit
> +
>  hv_storvsc-y			:= storvsc_drv.o
>  
>  sd_mod-objs	:= sd.o
> diff --git a/drivers/scsi/Makefile.kunit b/drivers/scsi/Makefile.kunit
> new file mode 100644
> index 000000000000..3e98053b2709
> --- /dev/null
> +++ b/drivers/scsi/Makefile.kunit
> @@ -0,0 +1 @@
> +obj-$(CONFIG_SCSI_ERROR_TEST) += scsi_error_test.o

All the above kunit changes (and the test changes below) seem unrelated to what
the commit message describes. Should these be split into a different patch ?

> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index c67cdcdc3ba8..0d7835bdc8af 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -27,6 +27,7 @@
>  #include <linux/blkdev.h>
>  #include <linux/delay.h>
>  #include <linux/jiffies.h>
> +#include <linux/list_sort.h>
>  
>  #include <scsi/scsi.h>
>  #include <scsi/scsi_cmnd.h>
> @@ -2186,6 +2187,46 @@ void scsi_eh_ready_devs(struct Scsi_Host *shost,
>  }
>  EXPORT_SYMBOL_GPL(scsi_eh_ready_devs);
>  
> +/*
> + * Comparison function that allows to sort SCSI commands by ULD driver.
> + */
> +static int scsi_cmp_uld(void *priv, const struct list_head *_a,
> +			const struct list_head *_b)
> +{
> +	struct scsi_cmnd *a = list_entry(_a, typeof(*a), eh_entry);
> +	struct scsi_cmnd *b = list_entry(_b, typeof(*b), eh_entry);
> +
> +	/* See also the comment above the list_sort() definition. */
> +	return scsi_cmd_to_driver(a) > scsi_cmd_to_driver(b);
> +}
> +
> +void scsi_call_prepare_resubmit(struct list_head *done_q)
> +{
> +	struct scsi_cmnd *scmd, *next;
> +
> +	/* Sort pending SCSI commands by ULD. */
> +	list_sort(NULL, done_q, scsi_cmp_uld);
> +
> +	/*
> +	 * Call .eh_prepare_resubmit for each range of commands with identical
> +	 * ULD driver pointer.
> +	 */
> +	list_for_each_entry_safe(scmd, next, done_q, eh_entry) {
> +		struct scsi_driver *uld = scsi_cmd_to_driver(scmd);
> +		struct list_head *prev, uld_cmd_list;
> +
> +		while (&next->eh_entry != done_q &&
> +		       scsi_cmd_to_driver(next) == uld)
> +			next = list_next_entry(next, eh_entry);
> +		if (!uld->eh_prepare_resubmit)
> +			continue;
> +		prev = scmd->eh_entry.prev;
> +		list_cut_position(&uld_cmd_list, prev, next->eh_entry.prev);
> +		uld->eh_prepare_resubmit(&uld_cmd_list);

Is it guaranteed that all uld implement eh_prepare_resubmit ?

> +		list_splice(&uld_cmd_list, prev);
> +	}
> +}
> +
>  /**
>   * scsi_eh_flush_done_q - finish processed commands or retry them.
>   * @done_q:	list_head of processed commands.
> @@ -2194,6 +2235,8 @@ void scsi_eh_flush_done_q(struct list_head *done_q)
>  {
>  	struct scsi_cmnd *scmd, *next;
>  
> +	scsi_call_prepare_resubmit(done_q);
> +
>  	list_for_each_entry_safe(scmd, next, done_q, eh_entry) {
>  		list_del_init(&scmd->eh_entry);
>  		if (scsi_device_online(scmd->device) &&
> diff --git a/drivers/scsi/scsi_error_test.c b/drivers/scsi/scsi_error_test.c
> new file mode 100644
> index 000000000000..fd616527f726
> --- /dev/null
> +++ b/drivers/scsi/scsi_error_test.c
> @@ -0,0 +1,92 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright 2023 Google LLC
> + */
> +#include <kunit/test.h>
> +#include <scsi/scsi_cmnd.h>
> +#include <scsi/scsi_driver.h>
> +#include "scsi_priv.h"
> +
> +static struct kunit *kunit_test;
> +
> +static struct scsi_driver *uld1, *uld2, *uld3;
> +
> +static void uld1_prepare_resubmit(struct list_head *cmd_list)
> +{
> +	struct scsi_cmnd *cmd;
> +
> +	KUNIT_EXPECT_EQ(kunit_test, list_count_nodes(cmd_list), 2);
> +	list_for_each_entry(cmd, cmd_list, eh_entry)
> +		KUNIT_EXPECT_PTR_EQ(kunit_test, scsi_cmd_to_driver(cmd), uld1);
> +}
> +
> +static void uld2_prepare_resubmit(struct list_head *cmd_list)
> +{
> +	struct scsi_cmnd *cmd;
> +
> +	KUNIT_EXPECT_EQ(kunit_test, list_count_nodes(cmd_list), 2);
> +	list_for_each_entry(cmd, cmd_list, eh_entry)
> +		KUNIT_EXPECT_PTR_EQ(kunit_test, scsi_cmd_to_driver(cmd), uld2);
> +}
> +
> +static void test_prepare_resubmit(struct kunit *test)
> +{
> +	static struct scsi_cmnd cmd1, cmd2, cmd3, cmd4, cmd5, cmd6;
> +	static struct scsi_device dev1, dev2, dev3;
> +	struct scsi_driver *uld;
> +	LIST_HEAD(cmd_list);
> +
> +	uld = kzalloc(3 * sizeof(uld), GFP_KERNEL);
> +	uld1 = &uld[0];
> +	uld1->eh_prepare_resubmit = uld1_prepare_resubmit;
> +	uld2 = &uld[1];
> +	uld2->eh_prepare_resubmit = uld2_prepare_resubmit;
> +	uld3 = &uld[2];
> +	dev1.sdev_gendev.driver = &uld1->gendrv;
> +	dev2.sdev_gendev.driver = &uld2->gendrv;
> +	dev3.sdev_gendev.driver = &uld3->gendrv;
> +	cmd1.device = &dev1;
> +	cmd2.device = &dev1;
> +	cmd3.device = &dev2;
> +	cmd4.device = &dev2;
> +	cmd5.device = &dev3;
> +	cmd6.device = &dev3;
> +	list_add_tail(&cmd1.eh_entry, &cmd_list);
> +	list_add_tail(&cmd3.eh_entry, &cmd_list);
> +	list_add_tail(&cmd5.eh_entry, &cmd_list);
> +	list_add_tail(&cmd2.eh_entry, &cmd_list);
> +	list_add_tail(&cmd4.eh_entry, &cmd_list);
> +	list_add_tail(&cmd6.eh_entry, &cmd_list);
> +
> +	KUNIT_EXPECT_EQ(test, list_count_nodes(&cmd_list), 6);
> +	kunit_test = test;
> +	scsi_call_prepare_resubmit(&cmd_list);
> +	kunit_test = NULL;
> +	KUNIT_EXPECT_EQ(test, list_count_nodes(&cmd_list), 6);
> +	KUNIT_EXPECT_TRUE(test, uld1 < uld2);
> +	KUNIT_EXPECT_TRUE(test, uld2 < uld3);
> +	KUNIT_EXPECT_PTR_EQ(test, cmd_list.next, &cmd1.eh_entry);
> +	KUNIT_EXPECT_PTR_EQ(test, cmd_list.next->next, &cmd2.eh_entry);
> +	KUNIT_EXPECT_PTR_EQ(test, cmd_list.next->next->next, &cmd3.eh_entry);
> +	KUNIT_EXPECT_PTR_EQ(test, cmd_list.next->next->next->next,
> +			    &cmd4.eh_entry);
> +	KUNIT_EXPECT_PTR_EQ(test, cmd_list.next->next->next->next->next,
> +			    &cmd5.eh_entry);
> +	KUNIT_EXPECT_PTR_EQ(test, cmd_list.next->next->next->next->next->next,
> +			    &cmd6.eh_entry);
> +	kfree(uld);
> +}
> +
> +static struct kunit_case prepare_resubmit_test_cases[] = {
> +	KUNIT_CASE(test_prepare_resubmit),
> +	{}
> +};
> +
> +static struct kunit_suite prepare_resubmit_test_suite = {
> +	.name = "prepare_resubmit",
> +	.test_cases = prepare_resubmit_test_cases,
> +};
> +kunit_test_suite(prepare_resubmit_test_suite);
> +
> +MODULE_AUTHOR("Bart Van Assche");
> +MODULE_LICENSE("GPL");
> diff --git a/drivers/scsi/scsi_priv.h b/drivers/scsi/scsi_priv.h
> index f42388ecb024..df4af4645430 100644
> --- a/drivers/scsi/scsi_priv.h
> +++ b/drivers/scsi/scsi_priv.h
> @@ -101,6 +101,7 @@ int scsi_eh_get_sense(struct list_head *work_q,
>  		      struct list_head *done_q);
>  bool scsi_noretry_cmd(struct scsi_cmnd *scmd);
>  void scsi_eh_done(struct scsi_cmnd *scmd);
> +void scsi_call_prepare_resubmit(struct list_head *done_q);
>  
>  /* scsi_lib.c */
>  extern int scsi_maybe_unblock_host(struct scsi_device *sdev);
> diff --git a/include/scsi/scsi_driver.h b/include/scsi/scsi_driver.h
> index 4ce1988b2ba0..2b11be896eee 100644
> --- a/include/scsi/scsi_driver.h
> +++ b/include/scsi/scsi_driver.h
> @@ -18,6 +18,7 @@ struct scsi_driver {
>  	int (*done)(struct scsi_cmnd *);
>  	int (*eh_action)(struct scsi_cmnd *, int);
>  	void (*eh_reset)(struct scsi_cmnd *);
> +	void (*eh_prepare_resubmit)(struct list_head *cmd_list);
>  };
>  #define to_scsi_driver(drv) \
>  	container_of((drv), struct scsi_driver, gendrv)

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v8 3/9] scsi: core: Call .eh_prepare_resubmit() before resubmitting
  2023-08-14  1:19   ` Damien Le Moal
@ 2023-08-14  2:18     ` Bart Van Assche
  2023-08-14  2:41       ` Damien Le Moal
  0 siblings, 1 reply; 33+ messages in thread
From: Bart Van Assche @ 2023-08-14  2:18 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, linux-scsi, Martin K . Petersen, Christoph Hellwig,
	Ming Lei, James E.J. Bottomley

On 8/13/23 18:19, Damien Le Moal wrote:
> On 8/12/23 06:35, Bart Van Assche wrote:
>> Make the error handler call .eh_prepare_resubmit() before resubmitting
> 
> This reads like the eh_prepare_resubmit callback already exists. But you are
> adding it. So you should state that.

Hi Damien,

I will rephrase the patch description.

>> +++ b/drivers/scsi/Makefile.kunit
>> @@ -0,0 +1 @@
>> +obj-$(CONFIG_SCSI_ERROR_TEST) += scsi_error_test.o
> 
> All the above kunit changes (and the test changes below) seem unrelated to what
> the commit message describes. Should these be split into a different patch ?

Some people insist on including unit tests in the same patch as
the patch that introduces the code that is being tested. I can
move the unit test into a separate patch if that is preferred.

>> +	/*
>> +	 * Call .eh_prepare_resubmit for each range of commands with identical
>> +	 * ULD driver pointer.
>> +	 */
>> +	list_for_each_entry_safe(scmd, next, done_q, eh_entry) {
>> +		struct scsi_driver *uld = scsi_cmd_to_driver(scmd);
>> +		struct list_head *prev, uld_cmd_list;
>> +
>> +		while (&next->eh_entry != done_q &&
>> +		       scsi_cmd_to_driver(next) == uld)
>> +			next = list_next_entry(next, eh_entry);
>> +		if (!uld->eh_prepare_resubmit)
>> +			continue;
>> +		prev = scmd->eh_entry.prev;
>> +		list_cut_position(&uld_cmd_list, prev, next->eh_entry.prev);
>> +		uld->eh_prepare_resubmit(&uld_cmd_list);
> 
> Is it guaranteed that all uld implement eh_prepare_resubmit ?

That is not guaranteed. Hence the if (!uld->eh_prepare_resubmit)
test in the above loop.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v8 3/9] scsi: core: Call .eh_prepare_resubmit() before resubmitting
  2023-08-14  2:18     ` Bart Van Assche
@ 2023-08-14  2:41       ` Damien Le Moal
  2023-08-14  3:23         ` Bart Van Assche
  0 siblings, 1 reply; 33+ messages in thread
From: Damien Le Moal @ 2023-08-14  2:41 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, Martin K . Petersen, Christoph Hellwig,
	Ming Lei, James E.J. Bottomley

On 8/14/23 11:18, Bart Van Assche wrote:
> On 8/13/23 18:19, Damien Le Moal wrote:
>> On 8/12/23 06:35, Bart Van Assche wrote:
>>> Make the error handler call .eh_prepare_resubmit() before resubmitting
>>
>> This reads like the eh_prepare_resubmit callback already exists. But you are
>> adding it. So you should state that.
> 
> Hi Damien,
> 
> I will rephrase the patch description.
> 
>>> +++ b/drivers/scsi/Makefile.kunit
>>> @@ -0,0 +1 @@
>>> +obj-$(CONFIG_SCSI_ERROR_TEST) += scsi_error_test.o
>>
>> All the above kunit changes (and the test changes below) seem unrelated to what
>> the commit message describes. Should these be split into a different patch ?
> 
> Some people insist on including unit tests in the same patch as
> the patch that introduces the code that is being tested. I can
> move the unit test into a separate patch if that is preferred.

OK. I will let Martin decide that :)
But at the very least, may be mention in the commit message that you also add
the unit tests associated with the change ?

> 
>>> +	/*
>>> +	 * Call .eh_prepare_resubmit for each range of commands with identical
>>> +	 * ULD driver pointer.
>>> +	 */
>>> +	list_for_each_entry_safe(scmd, next, done_q, eh_entry) {
>>> +		struct scsi_driver *uld = scsi_cmd_to_driver(scmd);
>>> +		struct list_head *prev, uld_cmd_list;
>>> +
>>> +		while (&next->eh_entry != done_q &&
>>> +		       scsi_cmd_to_driver(next) == uld)
>>> +			next = list_next_entry(next, eh_entry);
>>> +		if (!uld->eh_prepare_resubmit)
>>> +			continue;
>>> +		prev = scmd->eh_entry.prev;
>>> +		list_cut_position(&uld_cmd_list, prev, next->eh_entry.prev);
>>> +		uld->eh_prepare_resubmit(&uld_cmd_list);
>>
>> Is it guaranteed that all uld implement eh_prepare_resubmit ?
> 
> That is not guaranteed. Hence the if (!uld->eh_prepare_resubmit)
> test in the above loop.

Oops. Missed that one !

Another remark: we'll need this sorting only for devices that are zoned and do
not need zone write locking. For other cases (most cases in fact), we don't and
this could have some performance impact (e.g. fast RAID systems). Is there a way
to have this eh_prepare_resubmit to do nothing for regular devices to avoid that ?

> 
> Thanks,
> 
> Bart.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v8 3/9] scsi: core: Call .eh_prepare_resubmit() before resubmitting
  2023-08-14  2:41       ` Damien Le Moal
@ 2023-08-14  3:23         ` Bart Van Assche
  2023-08-14  4:18           ` Damien Le Moal
  0 siblings, 1 reply; 33+ messages in thread
From: Bart Van Assche @ 2023-08-14  3:23 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, linux-scsi, Martin K . Petersen, Christoph Hellwig,
	Ming Lei, James E.J. Bottomley

On 8/13/23 19:41, Damien Le Moal wrote:
> But at the very least, may be mention in the commit message that you also add
> the unit tests associated with the change ?

Hi Damien,

I will mention this in the patch description when I repost this patch.

> Another remark: we'll need this sorting only for devices that are zoned and do
> not need zone write locking. For other cases (most cases in fact), we don't and
> this could have some performance impact (e.g. fast RAID systems). Is there a way
> to have this eh_prepare_resubmit to do nothing for regular devices to avoid that ?

The common case is that all commands passed to the SCSI error handler
are associated with the same ULD. For this case list_sort() iterates
once over the list with commands that have to be resubmitted because
all eh_prepare_resubmit pointers are identical. The code introduced
in the next patch requires O(n^2) time with n the number of commands
passed to the error handler. I expect this time to be smaller than the
time needed to wake up the error handler if n < 100. Waking up the
error handler involves expediting the grace period (call_rcu_hurry())
and a context switch. I expect that these two mechanisms combined will
take more time than the list_sort() call if the number of commands that
failed is below 100. In other words, I don't think that
scsi_call_prepare_resubmit() will slow down error handling measurably
for the cases we care about.

Do you perhaps want me to change the code in the next patch such that
it takes O(n) time instead of O(n^2) time in case no zoned devices are
involved?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v8 3/9] scsi: core: Call .eh_prepare_resubmit() before resubmitting
  2023-08-14  3:23         ` Bart Van Assche
@ 2023-08-14  4:18           ` Damien Le Moal
  2023-08-14 17:52             ` Bart Van Assche
  0 siblings, 1 reply; 33+ messages in thread
From: Damien Le Moal @ 2023-08-14  4:18 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, Martin K . Petersen, Christoph Hellwig,
	Ming Lei, James E.J. Bottomley

On 8/14/23 12:23, Bart Van Assche wrote:
> On 8/13/23 19:41, Damien Le Moal wrote:
>> But at the very least, may be mention in the commit message that you also add
>> the unit tests associated with the change ?
> 
> Hi Damien,
> 
> I will mention this in the patch description when I repost this patch.
> 
>> Another remark: we'll need this sorting only for devices that are zoned and do
>> not need zone write locking. For other cases (most cases in fact), we don't and
>> this could have some performance impact (e.g. fast RAID systems). Is there a way
>> to have this eh_prepare_resubmit to do nothing for regular devices to avoid that ?
> 
> The common case is that all commands passed to the SCSI error handler
> are associated with the same ULD. For this case list_sort() iterates
> once over the list with commands that have to be resubmitted because
> all eh_prepare_resubmit pointers are identical. The code introduced
> in the next patch requires O(n^2) time with n the number of commands
> passed to the error handler. I expect this time to be smaller than the
> time needed to wake up the error handler if n < 100. Waking up the
> error handler involves expediting the grace period (call_rcu_hurry())
> and a context switch. I expect that these two mechanisms combined will
> take more time than the list_sort() call if the number of commands that
> failed is below 100. In other words, I don't think that
> scsi_call_prepare_resubmit() will slow down error handling measurably
> for the cases we care about.
> 
> Do you perhaps want me to change the code in the next patch such that
> it takes O(n) time instead of O(n^2) time in case no zoned devices are
> involved?

I was more thinking of a mean to not call scsi_call_prepare_resubmit() at all if
no zoned devices with use_zone_write_lock == true are involved.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v8 1/9] block: Introduce more member variables related to zone write locking
  2023-08-11 21:35 ` [PATCH v8 1/9] block: Introduce more member variables related to zone write locking Bart Van Assche
@ 2023-08-14 12:32   ` Damien Le Moal
  2023-08-14 16:57     ` Bart Van Assche
  0 siblings, 1 reply; 33+ messages in thread
From: Damien Le Moal @ 2023-08-14 12:32 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, Martin K . Petersen, Christoph Hellwig,
	Ming Lei

On 8/12/23 06:35, Bart Van Assche wrote:
> Many but not all storage controllers require serialization of zoned writes.
> Introduce a new request queue limit member variable
> (driver_preserves_write_order) that allows block drivers to indicate that
> the order of write commands is preserved and hence that serialization of
> writes per zone is not required.
> 
> Make disk_set_zoned() set 'use_zone_write_lock' only if the block device
> has zones and if the block driver does not preserve the order of write
> requests.
> 
> Cc: Damien Le Moal <dlemoal@kernel.org>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/blk-settings.c   |  7 +++++++
>  include/linux/blkdev.h | 10 ++++++++++
>  2 files changed, 17 insertions(+)
> 
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 0046b447268f..3a7748af1bef 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -56,6 +56,8 @@ void blk_set_default_limits(struct queue_limits *lim)
>  	lim->alignment_offset = 0;
>  	lim->io_opt = 0;
>  	lim->misaligned = 0;
> +	lim->driver_preserves_write_order = false;
> +	lim->use_zone_write_lock = false;
>  	lim->zoned = BLK_ZONED_NONE;
>  	lim->zone_write_granularity = 0;
>  	lim->dma_alignment = 511;
> @@ -685,6 +687,9 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>  						   b->max_secure_erase_sectors);
>  	t->zone_write_granularity = max(t->zone_write_granularity,
>  					b->zone_write_granularity);
> +	/* Request-based stacking drivers do not reorder requests. */
> +	t->driver_preserves_write_order = b->driver_preserves_write_order;
> +	t->use_zone_write_lock = b->use_zone_write_lock;

I do not think this is correct as the last target of a multi target device will
dictate the result, regardless of the other targets. So this should be something
like:

	t->driver_preserves_write_order = t->driver_preserves_write_order &&
		b->driver_preserves_write_order;
	t->use_zone_write_lock =
		t->use_zone_write_lock || b->use_zone_write_lock;

However, given that driver_preserves_write_order is initialized as false, this
would always be false. Not sure how to handle that...

>  	t->zoned = max(t->zoned, b->zoned);
>  	return ret;
>  }
> @@ -949,6 +954,8 @@ void disk_set_zoned(struct gendisk *disk, enum blk_zoned_model model)
>  	}
>  
>  	q->limits.zoned = model;
> +	q->limits.use_zone_write_lock = model != BLK_ZONED_NONE &&
> +		!q->limits.driver_preserves_write_order;

I think this needs a comment to explain the condition as it takes a while to
understand it.

>  	if (model != BLK_ZONED_NONE) {
>  		/*
>  		 * Set the zone write granularity to the device logical block

You also should set use_zone_write_lock to false in disk_clear_zone_settings().

In patch 9, ufshcd_auto_hibern8_update() changes the value of
driver_preserves_write_order, which will change the value of use_zone_write_lock
only if disk_set_zoned() is called again after ufshcd_auto_hibern8_update(). Is
that the case ? Is the drive revalidated always after
ufshcd_auto_hibern8_update() is executed ?

> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 2f5371b8482c..2c090a28ec78 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -316,6 +316,16 @@ struct queue_limits {
>  	unsigned char		misaligned;
>  	unsigned char		discard_misaligned;
>  	unsigned char		raid_partial_stripes_expensive;
> +	/*
> +	 * Whether or not the block driver preserves the order of write
> +	 * requests. Set by the block driver.
> +	 */
> +	bool			driver_preserves_write_order;
> +	/*
> +	 * Whether or not zone write locking should be used. Set by
> +	 * disk_set_zoned().
> +	 */
> +	bool			use_zone_write_lock;
>  	enum blk_zoned_model	zoned;
>  
>  	/*

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v8 2/9] block/mq-deadline: Only use zone locking if necessary
  2023-08-11 21:35 ` [PATCH v8 2/9] block/mq-deadline: Only use zone locking if necessary Bart Van Assche
@ 2023-08-14 12:33   ` Damien Le Moal
  2023-08-14 17:00     ` Bart Van Assche
  0 siblings, 1 reply; 33+ messages in thread
From: Damien Le Moal @ 2023-08-14 12:33 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, Martin K . Petersen, Christoph Hellwig,
	Ming Lei

On 8/12/23 06:35, Bart Van Assche wrote:
> Measurements have shown that limiting the queue depth to one per zone for
> zoned writes has a significant negative performance impact on zoned UFS
> devices. Hence this patch that disables zone locking by the mq-deadline
> scheduler if the storage controller preserves the command order. This
> patch is based on the following assumptions:
> - It happens infrequently that zoned write requests are reordered by the
>   block layer.
> - The I/O priority of all write requests is the same per zone.
> - Either no I/O scheduler is used or an I/O scheduler is used that
>   serializes write requests per zone.
> 
> Cc: Damien Le Moal <dlemoal@kernel.org>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/mq-deadline.c | 14 ++++++++------
>  1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index f958e79277b8..5c2fc4003bc0 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -353,7 +353,7 @@ deadline_fifo_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
>  		return NULL;
>  
>  	rq = rq_entry_fifo(per_prio->fifo_list[data_dir].next);
> -	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q))
> +	if (data_dir == DD_READ || !rq->q->limits.use_zone_write_lock)
>  		return rq;
>  
>  	/*
> @@ -398,7 +398,7 @@ deadline_next_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
>  	if (!rq)
>  		return NULL;
>  
> -	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q))
> +	if (data_dir == DD_READ || !rq->q->limits.use_zone_write_lock)
>  		return rq;
>  
>  	/*
> @@ -526,8 +526,9 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
>  	}
>  
>  	/*
> -	 * For a zoned block device, if we only have writes queued and none of
> -	 * them can be dispatched, rq will be NULL.
> +	 * For a zoned block device that requires write serialization, if we
> +	 * only have writes queued and none of them can be dispatched, rq will
> +	 * be NULL.
>  	 */
>  	if (!rq)
>  		return NULL;
> @@ -552,7 +553,8 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
>  	/*
>  	 * If the request needs its target zone locked, do it.
>  	 */
> -	blk_req_zone_write_lock(rq);
> +	if (rq->q->limits.use_zone_write_lock)
> +		blk_req_zone_write_lock(rq);
>  	rq->rq_flags |= RQF_STARTED;
>  	return rq;
>  }
> @@ -934,7 +936,7 @@ static void dd_finish_request(struct request *rq)
>  
>  	atomic_inc(&per_prio->stats.completed);
>  
> -	if (blk_queue_is_zoned(q)) {
> +	if (rq->q->limits.use_zone_write_lock) {

This is all nice and simple ! However, an inline helper to check
rq->q->limits.use_zone_write_lock would be nice. E.g.
blk_queue_use_zone_write_lock() ?

>  		unsigned long flags;
>  
>  		spin_lock_irqsave(&dd->zone_lock, flags);

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v8 5/9] scsi: core: Retry unaligned zoned writes
  2023-08-11 21:35 ` [PATCH v8 5/9] scsi: core: Retry unaligned zoned writes Bart Van Assche
@ 2023-08-14 12:36   ` Damien Le Moal
  2023-08-14 17:57     ` Bart Van Assche
  2023-08-15 17:29     ` Bart Van Assche
  0 siblings, 2 replies; 33+ messages in thread
From: Damien Le Moal @ 2023-08-14 12:36 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, Martin K . Petersen, Christoph Hellwig,
	Ming Lei, James E.J. Bottomley

On 8/12/23 06:35, Bart Van Assche wrote:
> If zoned writes (REQ_OP_WRITE) for a sequential write required zone have
> a starting LBA that differs from the write pointer, e.g. because zoned
> writes have been reordered, then the storage device will respond with an
> UNALIGNED WRITE COMMAND error. Send commands that failed with an
> unaligned write error to the SCSI error handler if zone write locking is
> disabled. Let the SCSI error handler sort SCSI commands per LBA before
> resubmitting these.
> 
> If zone write locking is disabled, increase the number of retries for
> write commands sent to a sequential zone to the maximum number of
> outstanding commands because in the worst case the number of times
> reordered zoned writes have to be retried is (number of outstanding
> writes per sequential zone) - 1.
> 
> Cc: Martin K. Petersen <martin.petersen@oracle.com>
> Cc: Damien Le Moal <dlemoal@kernel.org>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  drivers/scsi/scsi_error.c | 16 ++++++++++++++++
>  drivers/scsi/scsi_lib.c   |  1 +
>  drivers/scsi/sd.c         |  2 ++
>  include/scsi/scsi.h       |  1 +
>  4 files changed, 20 insertions(+)
> 
> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index 0d7835bdc8af..7ae43fac07b7 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -699,6 +699,22 @@ enum scsi_disposition scsi_check_sense(struct scsi_cmnd *scmd)
>  		fallthrough;
>  
>  	case ILLEGAL_REQUEST:
> +		/*
> +		 * Unaligned write command. This may indicate that zoned writes
> +		 * have been received by the device in the wrong order. If zone
> +		 * write locking is disabled, retry after all pending commands
> +		 * have completed.
> +		 */
> +		if (sshdr.asc == 0x21 && sshdr.ascq == 0x04 &&
> +		    !req->q->limits.use_zone_write_lock &&
> +		    blk_rq_is_seq_zoned_write(req)) {
> +			SCSI_LOG_ERROR_RECOVERY(3,
> +				sdev_printk(KERN_INFO, scmd->device,
> +					    "Retrying unaligned write at LBA %#llx.\n",
> +					    scsi_get_lba(scmd)));
> +			return NEEDS_DELAYED_RETRY;
> +		}
> +
>  		if (sshdr.asc == 0x20 || /* Invalid command operation code */
>  		    sshdr.asc == 0x21 || /* Logical block address out of range */
>  		    sshdr.asc == 0x22 || /* Invalid function */
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 59176946ab56..69da8aee13df 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -1443,6 +1443,7 @@ static void scsi_complete(struct request *rq)
>  	case ADD_TO_MLQUEUE:
>  		scsi_queue_insert(cmd, SCSI_MLQUEUE_DEVICE_BUSY);
>  		break;
> +	case NEEDS_DELAYED_RETRY:
>  	default:
>  		scsi_eh_scmd_add(cmd);
>  		break;
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index 4d9c6ad11cca..c8466c6c7387 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -1238,6 +1238,8 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
>  	cmd->transfersize = sdp->sector_size;
>  	cmd->underflow = nr_blocks << 9;
>  	cmd->allowed = sdkp->max_retries;
> +	if (!rq->q->limits.use_zone_write_lock && blk_rq_is_seq_zoned_write(rq))

This condition could be written as a little inline helper
blk_req_need_zone_write_lock(), which could be used in mq-dealine patch 2.

> +		cmd->allowed += rq->q->nr_requests;
>  	cmd->sdb.length = nr_blocks * sdp->sector_size;
>  
>  	SCSI_LOG_HLQUEUE(1,
> diff --git a/include/scsi/scsi.h b/include/scsi/scsi.h
> index ec093594ba53..6600db046227 100644
> --- a/include/scsi/scsi.h
> +++ b/include/scsi/scsi.h
> @@ -93,6 +93,7 @@ static inline int scsi_status_is_check_condition(int status)
>   * Internal return values.
>   */
>  enum scsi_disposition {
> +	NEEDS_DELAYED_RETRY	= 0x2000,
>  	NEEDS_RETRY		= 0x2001,
>  	SUCCESS			= 0x2002,
>  	FAILED			= 0x2003,

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v8 9/9] scsi: ufs: Inform the block layer about write ordering
  2023-08-12 17:09   ` Bao D. Nguyen
@ 2023-08-14 16:23     ` Bart Van Assche
  2023-08-15  3:20       ` Bao D. Nguyen
  0 siblings, 1 reply; 33+ messages in thread
From: Bart Van Assche @ 2023-08-14 16:23 UTC (permalink / raw)
  To: Bao D. Nguyen, Jens Axboe
  Cc: linux-block, linux-scsi, Martin K . Petersen, Christoph Hellwig,
	Can Guo, Avri Altman, Damien Le Moal, Ming Lei,
	James E.J. Bottomley, Stanley Chu, Bean Huo, Asutosh Das,
	Arthur Simchaev

On 8/12/23 10:09, Bao D. Nguyen wrote:
> I am not reviewing other patches in this series, so I don't know the 
> whole context. Here is my comment on this patch alone.
> 
> Looks like you rely on ufshcd_auto_hibern8_update() being invoked so 
> that you can update the driver_preserves_write_order. However, the 
> hba->ahit value can be updated by the vendor's driver, and 
> ufshcd_auto_hibern8_enable() can be invoked without 
> ufshcd_auto_hibern8_update(). Therefore, you may have a situation where 
> the driver_preserves_write_order is true by default, but 
> Auto-hibernation is enabled by default.

Hi Bao,

Other than setting a default value for auto-hibernation, vendor drivers
must not modify the auto-hibernation settings.

ufshcd_auto_hibern8_enable() calls from outside 
ufshcd_auto_hibern8_update() are used to reapply auto-hibernation
settings and not to modify auto-hibernation settings.

So I think that integrating the following change on top of this patch is
sufficient:

@@ -5172,7 +5172,9 @@ static int ufshcd_slave_configure(struct 
scsi_device *sdev)

  	ufshcd_hpb_configure(hba, sdev);

-	q->limits.driver_preserves_write_order = true;
+	q->limits.driver_preserves_write_order =
+		!ufshcd_is_auto_hibern8_supported(hba) ||
+		FIELD_GET(UFSHCI_AHIBERN8_TIMER_MASK, hba->ahit) == 0;
  	blk_queue_update_dma_pad(q, PRDT_DATA_BYTE_COUNT_PAD - 1);
  	if (hba->quirks & UFSHCD_QUIRK_4KB_DMA_ALIGNMENT)
  		blk_queue_update_dma_alignment(q, SZ_4K - 1);

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v8 1/9] block: Introduce more member variables related to zone write locking
  2023-08-14 12:32   ` Damien Le Moal
@ 2023-08-14 16:57     ` Bart Van Assche
  2023-08-15  2:01       ` Damien Le Moal
  0 siblings, 1 reply; 33+ messages in thread
From: Bart Van Assche @ 2023-08-14 16:57 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, linux-scsi, Martin K . Petersen, Christoph Hellwig,
	Ming Lei

On 8/14/23 05:32, Damien Le Moal wrote:
> On 8/12/23 06:35, Bart Van Assche wrote:
>> --- a/block/blk-settings.c
>> +++ b/block/blk-settings.c
>> @@ -56,6 +56,8 @@ void blk_set_default_limits(struct queue_limits *lim)
>>   	lim->alignment_offset = 0;
>>   	lim->io_opt = 0;
>>   	lim->misaligned = 0;
>> +	lim->driver_preserves_write_order = false;
>> +	lim->use_zone_write_lock = false;
>>   	lim->zoned = BLK_ZONED_NONE;
>>   	lim->zone_write_granularity = 0;
>>   	lim->dma_alignment = 511;
>> @@ -685,6 +687,9 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>>   						   b->max_secure_erase_sectors);
>>   	t->zone_write_granularity = max(t->zone_write_granularity,
>>   					b->zone_write_granularity);
>> +	/* Request-based stacking drivers do not reorder requests. */
>> +	t->driver_preserves_write_order = b->driver_preserves_write_order;
>> +	t->use_zone_write_lock = b->use_zone_write_lock;
> 
> I do not think this is correct as the last target of a multi target device will
> dictate the result, regardless of the other targets. So this should be something
> like:
> 
> 	t->driver_preserves_write_order = t->driver_preserves_write_order &&
> 		b->driver_preserves_write_order;
> 	t->use_zone_write_lock =
> 		t->use_zone_write_lock || b->use_zone_write_lock;
> 
> However, given that driver_preserves_write_order is initialized as false, this
> would always be false. Not sure how to handle that...

How about integrating the (untested) change below into this patch? It keeps
the default value for driver_preserves_write_order to false for regular block
drivers and changes the default value to true for stacking drivers:

--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -84,6 +84,7 @@ void blk_set_stacking_limits(struct queue_limits *lim)
  	lim->max_dev_sectors = UINT_MAX;
  	lim->max_write_zeroes_sectors = UINT_MAX;
  	lim->max_zone_append_sectors = UINT_MAX;
+	lim->driver_preserves_write_order = true;
  }
  EXPORT_SYMBOL(blk_set_stacking_limits);

@@ -688,8 +689,10 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
  	t->zone_write_granularity = max(t->zone_write_granularity,
  					b->zone_write_granularity);
  	/* Request-based stacking drivers do not reorder requests. */
-	t->driver_preserves_write_order = b->driver_preserves_write_order;
-	t->use_zone_write_lock = b->use_zone_write_lock;
+	t->driver_preserves_write_order = t->driver_preserves_write_order &&
+		b->driver_preserves_write_order;
+	t->use_zone_write_lock = t->use_zone_write_lock ||
+		b->use_zone_write_lock;
  	t->zoned = max(t->zoned, b->zoned);
  	return ret;
  }


>>   	t->zoned = max(t->zoned, b->zoned);
>>   	return ret;
>>   }
>> @@ -949,6 +954,8 @@ void disk_set_zoned(struct gendisk *disk, enum blk_zoned_model model)
>>   	}
>>   
>>   	q->limits.zoned = model;
>> +	q->limits.use_zone_write_lock = model != BLK_ZONED_NONE &&
>> +		!q->limits.driver_preserves_write_order;
> 
> I think this needs a comment to explain the condition as it takes a while to
> understand it.

Something like this?

	/*
	 * Use the zone write lock only for zoned block devices and only if
	 * the block driver does not preserve the order of write commands.
	 */

>>   	if (model != BLK_ZONED_NONE) {
>>   		/*
>>   		 * Set the zone write granularity to the device logical block
> 
> You also should set use_zone_write_lock to false in disk_clear_zone_settings().

I will do this.

> In patch 9, ufshcd_auto_hibern8_update() changes the value of
> driver_preserves_write_order, which will change the value of use_zone_write_lock
> only if disk_set_zoned() is called again after ufshcd_auto_hibern8_update(). Is
> that the case ? Is the drive revalidated always after
> ufshcd_auto_hibern8_update() is executed ?

I will start with testing this change on top of this patch series:

--- a/drivers/ufs/core/ufshcd.c
+++ b/drivers/ufs/core/ufshcd.c
@@ -4350,6 +4350,7 @@ static void ufshcd_update_preserves_write_order(struct ufs_hba *hba,
  		blk_mq_freeze_queue_wait(q);
  		q->limits.driver_preserves_write_order = preserves_write_order;
  		blk_mq_unfreeze_queue(q);
+		scsi_rescan_device(&sdev->sdev_gendev);
  	}
  }

Thanks,

Bart.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v8 2/9] block/mq-deadline: Only use zone locking if necessary
  2023-08-14 12:33   ` Damien Le Moal
@ 2023-08-14 17:00     ` Bart Van Assche
  2023-08-15  1:57       ` Damien Le Moal
  0 siblings, 1 reply; 33+ messages in thread
From: Bart Van Assche @ 2023-08-14 17:00 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, linux-scsi, Martin K . Petersen, Christoph Hellwig,
	Ming Lei

On 8/14/23 05:33, Damien Le Moal wrote:
> On 8/12/23 06:35, Bart Van Assche wrote:
>> @@ -934,7 +936,7 @@ static void dd_finish_request(struct request *rq)
>>   
>>   	atomic_inc(&per_prio->stats.completed);
>>   
>> -	if (blk_queue_is_zoned(q)) {
>> +	if (rq->q->limits.use_zone_write_lock) {
> 
> This is all nice and simple ! However, an inline helper to check
> rq->q->limits.use_zone_write_lock would be nice. E.g.
> blk_queue_use_zone_write_lock() ?

Hi Damien,

Do you perhaps want me to introduce a function that does nothing else than
returning the value of q->limits.use_zone_write_lock? I'm asking this because
recently I have seen a fair number of patches that remove functions that do
nothing else than returning the value of a single member variable.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v8 3/9] scsi: core: Call .eh_prepare_resubmit() before resubmitting
  2023-08-14  4:18           ` Damien Le Moal
@ 2023-08-14 17:52             ` Bart Van Assche
  0 siblings, 0 replies; 33+ messages in thread
From: Bart Van Assche @ 2023-08-14 17:52 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, linux-scsi, Martin K . Petersen, Christoph Hellwig,
	Ming Lei, James E.J. Bottomley

On 8/13/23 21:18, Damien Le Moal wrote:
> I was more thinking of a mean to not call scsi_call_prepare_resubmit() at all if
> no zoned devices with use_zone_write_lock == true are involved.

That sounds easy to implement. I will include this change when I repost this
patch series.

Thanks,

Bart.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v8 5/9] scsi: core: Retry unaligned zoned writes
  2023-08-14 12:36   ` Damien Le Moal
@ 2023-08-14 17:57     ` Bart Van Assche
  2023-08-15  1:52       ` Damien Le Moal
  2023-08-15 17:29     ` Bart Van Assche
  1 sibling, 1 reply; 33+ messages in thread
From: Bart Van Assche @ 2023-08-14 17:57 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, linux-scsi, Martin K . Petersen, Christoph Hellwig,
	Ming Lei, James E.J. Bottomley

On 8/14/23 05:36, Damien Le Moal wrote:
> On 8/12/23 06:35, Bart Van Assche wrote:
>> +	if (!rq->q->limits.use_zone_write_lock && blk_rq_is_seq_zoned_write(rq))
> 
> This condition could be written as a little inline helper
> blk_req_need_zone_write_lock(), which could be used in mq-dealine patch 2.

Hi Damien,

Since q->limits.use_zone_write_lock is being introduced, how about
modifying blk_req_needs_zone_write_lock() such that it tests that new member
variable instead of checking rq->q->disk->seq_zones_wlock? That would allow
me to leave out one change from block/mq-deadline.c. I do not have a strong
opinion about whether the name blk_req_needs_zone_write_lock() should be
retained or whether that function should be renamed into
blk_req_use_zone_write_lock().

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v8 5/9] scsi: core: Retry unaligned zoned writes
  2023-08-14 17:57     ` Bart Van Assche
@ 2023-08-15  1:52       ` Damien Le Moal
  0 siblings, 0 replies; 33+ messages in thread
From: Damien Le Moal @ 2023-08-15  1:52 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, Martin K . Petersen, Christoph Hellwig,
	Ming Lei, James E.J. Bottomley

On 8/15/23 02:57, Bart Van Assche wrote:
> On 8/14/23 05:36, Damien Le Moal wrote:
>> On 8/12/23 06:35, Bart Van Assche wrote:
>>> +	if (!rq->q->limits.use_zone_write_lock && blk_rq_is_seq_zoned_write(rq))
>>
>> This condition could be written as a little inline helper
>> blk_req_need_zone_write_lock(), which could be used in mq-dealine patch 2.
> 
> Hi Damien,
> 
> Since q->limits.use_zone_write_lock is being introduced, how about
> modifying blk_req_needs_zone_write_lock() such that it tests that new member
> variable instead of checking rq->q->disk->seq_zones_wlock? That would allow
> me to leave out one change from block/mq-deadline.c. I do not have a strong
> opinion about whether the name blk_req_needs_zone_write_lock() should be
> retained or whether that function should be renamed into
> blk_req_use_zone_write_lock().

Something like this ?

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 619ee41a51cc..a3980a71c0ac 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -57,7 +57,12 @@ EXPORT_SYMBOL_GPL(blk_zone_cond_str);
  */
 bool blk_req_needs_zone_write_lock(struct request *rq)
 {
-       if (!rq->q->disk->seq_zones_wlock)
+       struct request_queue *q = rq->q;
+
+       if (!q->limits.use_zone_write_lock)
+               return false;
+
+       if (!q->disk->seq_zones_wlock)
                return false;

        return blk_rq_is_seq_zoned_write(rq);

I think that is fine and avoids adding yet another helper.
I am OK with this.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v8 2/9] block/mq-deadline: Only use zone locking if necessary
  2023-08-14 17:00     ` Bart Van Assche
@ 2023-08-15  1:57       ` Damien Le Moal
  0 siblings, 0 replies; 33+ messages in thread
From: Damien Le Moal @ 2023-08-15  1:57 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, Martin K . Petersen, Christoph Hellwig,
	Ming Lei

On 8/15/23 02:00, Bart Van Assche wrote:
> On 8/14/23 05:33, Damien Le Moal wrote:
>> On 8/12/23 06:35, Bart Van Assche wrote:
>>> @@ -934,7 +936,7 @@ static void dd_finish_request(struct request *rq)
>>>   
>>>   	atomic_inc(&per_prio->stats.completed);
>>>   
>>> -	if (blk_queue_is_zoned(q)) {
>>> +	if (rq->q->limits.use_zone_write_lock) {
>>
>> This is all nice and simple ! However, an inline helper to check
>> rq->q->limits.use_zone_write_lock would be nice. E.g.
>> blk_queue_use_zone_write_lock() ?
> 
> Hi Damien,
> 
> Do you perhaps want me to introduce a function that does nothing else than
> returning the value of q->limits.use_zone_write_lock? I'm asking this because
> recently I have seen a fair number of patches that remove functions that do
> nothing else than returning the value of a single member variable.

I think that what you proposed in your other email (modify
blk_req_needs_zone_write_lock) is better when you need to test
use_zone_write_lock using a request.
Not sure about the cases where we need to test that limit using the queue only.
I personally like helpers that avoid hardcoding accesses to the queue limits,
but if such helpers are not OK, that is fine. No strong opinion.

> 
> Thanks,
> 
> Bart.
> 

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v8 1/9] block: Introduce more member variables related to zone write locking
  2023-08-14 16:57     ` Bart Van Assche
@ 2023-08-15  2:01       ` Damien Le Moal
  2023-08-15 16:06         ` Bart Van Assche
  0 siblings, 1 reply; 33+ messages in thread
From: Damien Le Moal @ 2023-08-15  2:01 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, Martin K . Petersen, Christoph Hellwig,
	Ming Lei

On 8/15/23 01:57, Bart Van Assche wrote:
> On 8/14/23 05:32, Damien Le Moal wrote:
>> On 8/12/23 06:35, Bart Van Assche wrote:
>>> --- a/block/blk-settings.c
>>> +++ b/block/blk-settings.c
>>> @@ -56,6 +56,8 @@ void blk_set_default_limits(struct queue_limits *lim)
>>>   	lim->alignment_offset = 0;
>>>   	lim->io_opt = 0;
>>>   	lim->misaligned = 0;
>>> +	lim->driver_preserves_write_order = false;
>>> +	lim->use_zone_write_lock = false;
>>>   	lim->zoned = BLK_ZONED_NONE;
>>>   	lim->zone_write_granularity = 0;
>>>   	lim->dma_alignment = 511;
>>> @@ -685,6 +687,9 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>>>   						   b->max_secure_erase_sectors);
>>>   	t->zone_write_granularity = max(t->zone_write_granularity,
>>>   					b->zone_write_granularity);
>>> +	/* Request-based stacking drivers do not reorder requests. */
>>> +	t->driver_preserves_write_order = b->driver_preserves_write_order;
>>> +	t->use_zone_write_lock = b->use_zone_write_lock;
>>
>> I do not think this is correct as the last target of a multi target device will
>> dictate the result, regardless of the other targets. So this should be something
>> like:
>>
>> 	t->driver_preserves_write_order = t->driver_preserves_write_order &&
>> 		b->driver_preserves_write_order;
>> 	t->use_zone_write_lock =
>> 		t->use_zone_write_lock || b->use_zone_write_lock;
>>
>> However, given that driver_preserves_write_order is initialized as false, this
>> would always be false. Not sure how to handle that...
> 
> How about integrating the (untested) change below into this patch? It keeps
> the default value for driver_preserves_write_order to false for regular block
> drivers and changes the default value to true for stacking drivers:
> 
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -84,6 +84,7 @@ void blk_set_stacking_limits(struct queue_limits *lim)
>   	lim->max_dev_sectors = UINT_MAX;
>   	lim->max_write_zeroes_sectors = UINT_MAX;
>   	lim->max_zone_append_sectors = UINT_MAX;
> +	lim->driver_preserves_write_order = true;
>   }
>   EXPORT_SYMBOL(blk_set_stacking_limits);
> 
> @@ -688,8 +689,10 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>   	t->zone_write_granularity = max(t->zone_write_granularity,
>   					b->zone_write_granularity);
>   	/* Request-based stacking drivers do not reorder requests. */
> -	t->driver_preserves_write_order = b->driver_preserves_write_order;
> -	t->use_zone_write_lock = b->use_zone_write_lock;
> +	t->driver_preserves_write_order = t->driver_preserves_write_order &&
> +		b->driver_preserves_write_order;
> +	t->use_zone_write_lock = t->use_zone_write_lock ||
> +		b->use_zone_write_lock;
>   	t->zoned = max(t->zoned, b->zoned);
>   	return ret;
>   }

I think that should work. Testing/checking this with dm-linear by combining
different null-blk devices with different configs should be easy enough.

> 
> 
>>>   	t->zoned = max(t->zoned, b->zoned);
>>>   	return ret;
>>>   }
>>> @@ -949,6 +954,8 @@ void disk_set_zoned(struct gendisk *disk, enum blk_zoned_model model)
>>>   	}
>>>   
>>>   	q->limits.zoned = model;
>>> +	q->limits.use_zone_write_lock = model != BLK_ZONED_NONE &&
>>> +		!q->limits.driver_preserves_write_order;
>>
>> I think this needs a comment to explain the condition as it takes a while to
>> understand it.
> 
> Something like this?
> 
> 	/*
> 	 * Use the zone write lock only for zoned block devices and only if
> 	 * the block driver does not preserve the order of write commands.
> 	 */

That works for me.

> 
>>>   	if (model != BLK_ZONED_NONE) {
>>>   		/*
>>>   		 * Set the zone write granularity to the device logical block
>>
>> You also should set use_zone_write_lock to false in disk_clear_zone_settings().
> 
> I will do this.
> 
>> In patch 9, ufshcd_auto_hibern8_update() changes the value of
>> driver_preserves_write_order, which will change the value of use_zone_write_lock
>> only if disk_set_zoned() is called again after ufshcd_auto_hibern8_update(). Is
>> that the case ? Is the drive revalidated always after
>> ufshcd_auto_hibern8_update() is executed ?
> 
> I will start with testing this change on top of this patch series:
> 
> --- a/drivers/ufs/core/ufshcd.c
> +++ b/drivers/ufs/core/ufshcd.c
> @@ -4350,6 +4350,7 @@ static void ufshcd_update_preserves_write_order(struct ufs_hba *hba,
>   		blk_mq_freeze_queue_wait(q);
>   		q->limits.driver_preserves_write_order = preserves_write_order;
>   		blk_mq_unfreeze_queue(q);
> +		scsi_rescan_device(&sdev->sdev_gendev);

Maybe calling disk_set_zoned() before blk_mq_unfreeze_queue() should be enough ?
rescan/revalidate will be done in case you get a topology change event (or
equivalent), which I think is not the case here.

>   	}
>   }
> 
> Thanks,
> 
> Bart.
> 
> 

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v8 9/9] scsi: ufs: Inform the block layer about write ordering
  2023-08-14 16:23     ` Bart Van Assche
@ 2023-08-15  3:20       ` Bao D. Nguyen
  2023-08-15 15:41         ` Bart Van Assche
  0 siblings, 1 reply; 33+ messages in thread
From: Bao D. Nguyen @ 2023-08-15  3:20 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, Martin K . Petersen, Christoph Hellwig,
	Can Guo, Avri Altman, Damien Le Moal, Ming Lei,
	James E.J. Bottomley, Stanley Chu, Bean Huo, Asutosh Das,
	Arthur Simchaev

On 8/14/2023 9:23 AM, Bart Van Assche wrote:
> On 8/12/23 10:09, Bao D. Nguyen wrote:
>> I am not reviewing other patches in this series, so I don't know the 
>> whole context. Here is my comment on this patch alone.
>>
>> Looks like you rely on ufshcd_auto_hibern8_update() being invoked so 
>> that you can update the driver_preserves_write_order. However, the 
>> hba->ahit value can be updated by the vendor's driver, and 
>> ufshcd_auto_hibern8_enable() can be invoked without 
>> ufshcd_auto_hibern8_update(). Therefore, you may have a situation 
>> where the driver_preserves_write_order is true by default, but 
>> Auto-hibernation is enabled by default.
> 
> Hi Bao,
> 
> Other than setting a default value for auto-hibernation, vendor drivers
> must not modify the auto-hibernation settings.
IMO, it is not a good solution to prohibit changing auto-hibernation 
settings during runtime. I can think of a few situations where changing 
this parameter may help the system such as:
- Reduce heat. The device may be hot, so hibernate often helps.
- Battery is low, so hibernate quicker to save battery.
- Allows the vendor to make decision whether to trade performance for 
power, or vice versa. Sometimes, the system can afford trading 
performance for power saving.

How about this?
- Make ufshcd_auto_hibern8_enable() a static function. It should not be 
called from the vendor drivers.
- Mandate that vendor drivers must only update auto-hibernation settings 
via the ufshcd_auto_hibern8_update() api. This function is already 
exported, so only need to update the function comment to make it clear 
(updating the hba->ahit alone may result in abnormal behavior).

Thanks,
Bao


> 
> ufshcd_auto_hibern8_enable() calls from outside 
> ufshcd_auto_hibern8_update() are used to reapply auto-hibernation
> settings and not to modify auto-hibernation settings.
> 
> So I think that integrating the following change on top of this patch is
> sufficient:
> 
> @@ -5172,7 +5172,9 @@ static int ufshcd_slave_configure(struct 
> scsi_device *sdev)
> 
>       ufshcd_hpb_configure(hba, sdev);
> 
> -    q->limits.driver_preserves_write_order = true;
> +    q->limits.driver_preserves_write_order =
> +        !ufshcd_is_auto_hibern8_supported(hba) ||
> +        FIELD_GET(UFSHCI_AHIBERN8_TIMER_MASK, hba->ahit) == 0;
Yes, this should help.

>       blk_queue_update_dma_pad(q, PRDT_DATA_BYTE_COUNT_PAD - 1);
>       if (hba->quirks & UFSHCD_QUIRK_4KB_DMA_ALIGNMENT)
>           blk_queue_update_dma_alignment(q, SZ_4K - 1);
> 
> Thanks,
> 
> Bart.
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v8 9/9] scsi: ufs: Inform the block layer about write ordering
  2023-08-15  3:20       ` Bao D. Nguyen
@ 2023-08-15 15:41         ` Bart Van Assche
  0 siblings, 0 replies; 33+ messages in thread
From: Bart Van Assche @ 2023-08-15 15:41 UTC (permalink / raw)
  To: Bao D. Nguyen, Jens Axboe
  Cc: linux-block, linux-scsi, Martin K . Petersen, Christoph Hellwig,
	Can Guo, Avri Altman, Damien Le Moal, Ming Lei,
	James E.J. Bottomley, Stanley Chu, Bean Huo, Asutosh Das,
	Arthur Simchaev

On 8/14/23 20:20, Bao D. Nguyen wrote:
> How about this?
> - Make ufshcd_auto_hibern8_enable() a static function. It should not >   be called from the vendor drivers.
> - Mandate that vendor drivers must only update auto-hibernation >   settings via the ufshcd_auto_hibern8_update() api. This function 
is>   already exported, so only need to update the function comment to> 
  make it clear (updating the hba->ahit alone may result in abnormal
>   behavior).

That sounds good to me. I will look into implementing the above proposal.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v8 1/9] block: Introduce more member variables related to zone write locking
  2023-08-15  2:01       ` Damien Le Moal
@ 2023-08-15 16:06         ` Bart Van Assche
  0 siblings, 0 replies; 33+ messages in thread
From: Bart Van Assche @ 2023-08-15 16:06 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, linux-scsi, Martin K . Petersen, Christoph Hellwig,
	Ming Lei

On 8/14/23 19:01, Damien Le Moal wrote:
> On 8/15/23 01:57, Bart Van Assche wrote:
>> --- a/drivers/ufs/core/ufshcd.c
>> +++ b/drivers/ufs/core/ufshcd.c
>> @@ -4350,6 +4350,7 @@ static void ufshcd_update_preserves_write_order(struct ufs_hba *hba,
>>    		blk_mq_freeze_queue_wait(q);
>>    		q->limits.driver_preserves_write_order = preserves_write_order;
>>    		blk_mq_unfreeze_queue(q);
>> +		scsi_rescan_device(&sdev->sdev_gendev);
> 
> Maybe calling disk_set_zoned() before blk_mq_unfreeze_queue() should be enough ?
> rescan/revalidate will be done in case you get a topology change event (or
> equivalent), which I think is not the case here.

Hi Damien,

I will look into calling disk_set_zoned() before blk_mq_unfreeze_queue().

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v8 5/9] scsi: core: Retry unaligned zoned writes
  2023-08-14 12:36   ` Damien Le Moal
  2023-08-14 17:57     ` Bart Van Assche
@ 2023-08-15 17:29     ` Bart Van Assche
  2023-08-16  1:13       ` Damien Le Moal
  1 sibling, 1 reply; 33+ messages in thread
From: Bart Van Assche @ 2023-08-15 17:29 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, linux-scsi, Martin K . Petersen, Christoph Hellwig,
	Ming Lei, James E.J. Bottomley

On 8/14/23 05:36, Damien Le Moal wrote:
> On 8/12/23 06:35, Bart Van Assche wrote:
>> --- a/drivers/scsi/sd.c
>> +++ b/drivers/scsi/sd.c
>> @@ -1238,6 +1238,8 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
>>   	cmd->transfersize = sdp->sector_size;
>>   	cmd->underflow = nr_blocks << 9;
>>   	cmd->allowed = sdkp->max_retries;
>> +	if (!rq->q->limits.use_zone_write_lock && blk_rq_is_seq_zoned_write(rq))
> 
> This condition could be written as a little inline helper
> blk_req_need_zone_write_lock(), which could be used in mq-dealine patch 2.

The above test differs from the tests in the mq-deadline I/O scheduler.
The mq-deadline I/O scheduler uses write locking if
rq->q->limits.use_zone_write_lock && q->disk->seq_zones_wlock &&
blk_rq_is_seq_zoned_write(rq). The above test is different and occurs
two times in scsi_error.c and one time in sd.c. Do you still want me to
introduce a helper function for that expression?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v8 5/9] scsi: core: Retry unaligned zoned writes
  2023-08-15 17:29     ` Bart Van Assche
@ 2023-08-16  1:13       ` Damien Le Moal
  2023-08-16 19:59         ` Bart Van Assche
  0 siblings, 1 reply; 33+ messages in thread
From: Damien Le Moal @ 2023-08-16  1:13 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, Martin K . Petersen, Christoph Hellwig,
	Ming Lei, James E.J. Bottomley

On 8/16/23 02:29, Bart Van Assche wrote:
> On 8/14/23 05:36, Damien Le Moal wrote:
>> On 8/12/23 06:35, Bart Van Assche wrote:
>>> --- a/drivers/scsi/sd.c
>>> +++ b/drivers/scsi/sd.c
>>> @@ -1238,6 +1238,8 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
>>>   	cmd->transfersize = sdp->sector_size;
>>>   	cmd->underflow = nr_blocks << 9;
>>>   	cmd->allowed = sdkp->max_retries;
>>> +	if (!rq->q->limits.use_zone_write_lock && blk_rq_is_seq_zoned_write(rq))
>>
>> This condition could be written as a little inline helper
>> blk_req_need_zone_write_lock(), which could be used in mq-dealine patch 2.
> 
> The above test differs from the tests in the mq-deadline I/O scheduler.
> The mq-deadline I/O scheduler uses write locking if
> rq->q->limits.use_zone_write_lock && q->disk->seq_zones_wlock &&
> blk_rq_is_seq_zoned_write(rq). The above test is different and occurs
> two times in scsi_error.c and one time in sd.c. Do you still want me to
> introduce a helper function for that expression?

May be not needed. But a comment explaining it may be good to add.
I still think that added the test for use_zone_write_lock in
blk_req_needs_zone_write_lock() is needed though, for consistency of all
functions related to zone locking.

> 
> Thanks,
> 
> Bart.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v8 5/9] scsi: core: Retry unaligned zoned writes
  2023-08-16  1:13       ` Damien Le Moal
@ 2023-08-16 19:59         ` Bart Van Assche
  0 siblings, 0 replies; 33+ messages in thread
From: Bart Van Assche @ 2023-08-16 19:59 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, linux-scsi, Martin K . Petersen, Christoph Hellwig,
	Ming Lei, James E.J. Bottomley

On 8/15/23 18:13, Damien Le Moal wrote:
> I still think that added the test for use_zone_write_lock in
> blk_req_needs_zone_write_lock() is needed though, for consistency of all
> functions related to zone locking.

This has been implemented in v9 of this patch series. v9 has just been 
posted.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2023-08-16 20:01 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-11 21:35 [PATCH v8 0/9] Improve performance for zoned UFS devices Bart Van Assche
2023-08-11 21:35 ` [PATCH v8 1/9] block: Introduce more member variables related to zone write locking Bart Van Assche
2023-08-14 12:32   ` Damien Le Moal
2023-08-14 16:57     ` Bart Van Assche
2023-08-15  2:01       ` Damien Le Moal
2023-08-15 16:06         ` Bart Van Assche
2023-08-11 21:35 ` [PATCH v8 2/9] block/mq-deadline: Only use zone locking if necessary Bart Van Assche
2023-08-14 12:33   ` Damien Le Moal
2023-08-14 17:00     ` Bart Van Assche
2023-08-15  1:57       ` Damien Le Moal
2023-08-11 21:35 ` [PATCH v8 3/9] scsi: core: Call .eh_prepare_resubmit() before resubmitting Bart Van Assche
2023-08-14  1:19   ` Damien Le Moal
2023-08-14  2:18     ` Bart Van Assche
2023-08-14  2:41       ` Damien Le Moal
2023-08-14  3:23         ` Bart Van Assche
2023-08-14  4:18           ` Damien Le Moal
2023-08-14 17:52             ` Bart Van Assche
2023-08-11 21:35 ` [PATCH v8 4/9] scsi: sd: Sort commands by LBA " Bart Van Assche
2023-08-11 21:35 ` [PATCH v8 5/9] scsi: core: Retry unaligned zoned writes Bart Van Assche
2023-08-14 12:36   ` Damien Le Moal
2023-08-14 17:57     ` Bart Van Assche
2023-08-15  1:52       ` Damien Le Moal
2023-08-15 17:29     ` Bart Van Assche
2023-08-16  1:13       ` Damien Le Moal
2023-08-16 19:59         ` Bart Van Assche
2023-08-11 21:35 ` [PATCH v8 6/9] scsi: scsi_debug: Support disabling zone write locking Bart Van Assche
2023-08-11 21:35 ` [PATCH v8 7/9] scsi: scsi_debug: Support injecting unaligned write errors Bart Van Assche
2023-08-11 21:35 ` [PATCH v8 8/9] scsi: ufs: Split an if-condition Bart Van Assche
2023-08-11 21:35 ` [PATCH v8 9/9] scsi: ufs: Inform the block layer about write ordering Bart Van Assche
2023-08-12 17:09   ` Bao D. Nguyen
2023-08-14 16:23     ` Bart Van Assche
2023-08-15  3:20       ` Bao D. Nguyen
2023-08-15 15:41         ` Bart Van Assche

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).