[PATCH v2 0/5] Enable zoned write pipelining for UFS devices

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/5] Enable zoned write pipelining for UFS devices
@ 2023-07-10 18:01 Bart Van Assche
  2023-07-10 18:01 ` [PATCH v2 1/5] block: Introduce a request queue flag for pipelining zoned writes Bart Van Assche
                   ` (4 more replies)
  0 siblings, 5 replies; 19+ messages in thread
From: Bart Van Assche @ 2023-07-10 18:01 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Christoph Hellwig, Bart Van Assche

Hi Jens,

This patch series improves write performance for zoned UFS devices. Please
consider these patches for the next merge window.

Thank you,

Bart.

Changes compared to v1:
 - Left out the patches that are already upstream.
 - Switched the approach in patch "scsi: Retry unaligned zoned writes" from
   retrying immediately to sending unaligned write commands to the SCSI error
   handler.

Bart Van Assche (5):
  block: Introduce a request queue flag for pipelining zoned writes
  block/mq-deadline: Only use zone locking if necessary
  block/null_blk: Add support for pipelining zoned writes
  scsi: Retry unaligned zoned writes
  scsi: ufs: Enable zoned write pipelining

 block/blk-zoned.c                 |  3 ++-
 block/mq-deadline.c               | 14 +++++++-----
 drivers/block/null_blk/main.c     |  2 ++
 drivers/block/null_blk/null_blk.h |  1 +
 drivers/block/null_blk/zoned.c    |  3 +++
 drivers/scsi/scsi_error.c         | 37 +++++++++++++++++++++++++++++++
 drivers/scsi/scsi_lib.c           |  1 +
 drivers/scsi/sd.c                 |  3 +++
 drivers/ufs/core/ufshcd.c         |  1 +
 include/linux/blkdev.h            |  7 ++++++
 include/scsi/scsi.h               |  1 +
 11 files changed, 67 insertions(+), 6 deletions(-)


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2 1/5] block: Introduce a request queue flag for pipelining zoned writes
  2023-07-10 18:01 [PATCH v2 0/5] Enable zoned write pipelining for UFS devices Bart Van Assche
@ 2023-07-10 18:01 ` Bart Van Assche
  2023-07-18  6:34   ` Damien Le Moal
  2023-07-10 18:01 ` [PATCH v2 2/5] block/mq-deadline: Only use zone locking if necessary Bart Van Assche
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Bart Van Assche @ 2023-07-10 18:01 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Damien Le Moal

Writes in sequential write required zones must happen at the write
pointer. Even if the submitter of the write commands (e.g. a filesystem)
submits writes for sequential write required zones in order, the block
layer or the storage controller may reorder these write commands.

The zone locking mechanism in the mq-deadline I/O scheduler serializes
write commands for sequential zones. Some but not all storage controllers
require this serialization. Introduce a new flag such that block drivers
can request pipelining of writes for sequential write required zones.

An example of a storage controller standard that requires write
serialization is AHCI (Advanced Host Controller Interface). Submitting
commands to AHCI controllers happens by writing a bit pattern into a
register. Each set bit corresponds to an active command. This mechanism
does not preserve command ordering information.

Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 include/linux/blkdev.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ed44a997f629..805012c5a6ab 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -534,6 +534,8 @@ struct request_queue {
 #define QUEUE_FLAG_NONROT	6	/* non-rotational device (SSD) */
 #define QUEUE_FLAG_VIRT		QUEUE_FLAG_NONROT /* paravirt device */
 #define QUEUE_FLAG_IO_STAT	7	/* do disk/partitions IO accounting */
+/* Writes for sequential write required zones may be pipelined. */
+#define QUEUE_FLAG_PIPELINE_ZONED_WRITES 8
 #define QUEUE_FLAG_NOXMERGES	9	/* No extended merges */
 #define QUEUE_FLAG_ADD_RANDOM	10	/* Contributes to random pool */
 #define QUEUE_FLAG_SYNCHRONOUS	11	/* always completes in submit context */
@@ -596,6 +598,11 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q);
 #define blk_queue_skip_tagset_quiesce(q) \
 	test_bit(QUEUE_FLAG_SKIP_TAGSET_QUIESCE, &(q)->queue_flags)

+static inline bool blk_queue_pipeline_zoned_writes(struct request_queue *q)
+{
+	return test_bit(QUEUE_FLAG_PIPELINE_ZONED_WRITES, &q->queue_flags);
+}
+
 extern void blk_set_pm_only(struct request_queue *q);
 extern void blk_clear_pm_only(struct request_queue *q);

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 2/5] block/mq-deadline: Only use zone locking if necessary
  2023-07-10 18:01 [PATCH v2 0/5] Enable zoned write pipelining for UFS devices Bart Van Assche
  2023-07-10 18:01 ` [PATCH v2 1/5] block: Introduce a request queue flag for pipelining zoned writes Bart Van Assche
@ 2023-07-10 18:01 ` Bart Van Assche
  2023-07-18  6:38   ` Damien Le Moal
  2023-07-10 18:01 ` [PATCH v2 3/5] block/null_blk: Add support for pipelining zoned writes Bart Van Assche
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Bart Van Assche @ 2023-07-10 18:01 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Damien Le Moal

Measurements have shown that limiting the queue depth to one for zoned
writes has a significant negative performance impact on zoned UFS devices.
Hence this patch that disables zone locking from the mq-deadline scheduler
for storage controllers that support pipelining zoned writes. This patch is
based on the following assumptions:
- Applications submit write requests to sequential write required zones
  in order.
- It happens infrequently that zoned write requests are reordered by the
  block layer.
- The storage controller does not reorder write requests that have been
  submitted to the same hardware queue. This is the case for UFS: the
  UFSHCI specification requires that UFS controllers process requests in
  order per hardware queue.
- The I/O priority of all pipelined write requests is the same per zone.
- Either no I/O scheduler is used or an I/O scheduler is used that
  submits write requests per zone in LBA order.

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-zoned.c   |  3 ++-
 block/mq-deadline.c | 14 +++++++++-----
 2 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 0f9f97cdddd9..59560d1657e3 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -504,7 +504,8 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
 		break;
 	case BLK_ZONE_TYPE_SEQWRITE_REQ:
 	case BLK_ZONE_TYPE_SEQWRITE_PREF:
-		if (!args->seq_zones_wlock) {
+		if (!blk_queue_pipeline_zoned_writes(q) &&
+		    !args->seq_zones_wlock) {
 			args->seq_zones_wlock =
 				blk_alloc_zone_bitmap(q->node, args->nr_zones);
 			if (!args->seq_zones_wlock)
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 6aa5daf7ae32..0bed2bdeed89 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -353,7 +353,8 @@ deadline_fifo_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
 		return NULL;
 
 	rq = rq_entry_fifo(per_prio->fifo_list[data_dir].next);
-	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q))
+	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q) ||
+	    blk_queue_pipeline_zoned_writes(rq->q))
 		return rq;
 
 	/*
@@ -398,7 +399,8 @@ deadline_next_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
 	if (!rq)
 		return NULL;
 
-	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q))
+	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q) ||
+	    blk_queue_pipeline_zoned_writes(rq->q))
 		return rq;
 
 	/*
@@ -526,8 +528,9 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
 	}
 
 	/*
-	 * For a zoned block device, if we only have writes queued and none of
-	 * them can be dispatched, rq will be NULL.
+	 * For a zoned block device that requires write serialization, if we
+	 * only have writes queued and none of them can be dispatched, rq will
+	 * be NULL.
 	 */
 	if (!rq)
 		return NULL;
@@ -933,7 +936,8 @@ static void dd_finish_request(struct request *rq)
 
 	atomic_inc(&per_prio->stats.completed);
 
-	if (blk_queue_is_zoned(q)) {
+	if (blk_queue_is_zoned(rq->q) &&
+	    !blk_queue_pipeline_zoned_writes(q)) {
 		unsigned long flags;
 
 		spin_lock_irqsave(&dd->zone_lock, flags);

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 3/5] block/null_blk: Add support for pipelining zoned writes
  2023-07-10 18:01 [PATCH v2 0/5] Enable zoned write pipelining for UFS devices Bart Van Assche
  2023-07-10 18:01 ` [PATCH v2 1/5] block: Introduce a request queue flag for pipelining zoned writes Bart Van Assche
  2023-07-10 18:01 ` [PATCH v2 2/5] block/mq-deadline: Only use zone locking if necessary Bart Van Assche
@ 2023-07-10 18:01 ` Bart Van Assche
  2023-07-10 18:01 ` [PATCH v2 4/5] scsi: Retry unaligned " Bart Van Assche
  2023-07-10 18:01 ` [PATCH v2 5/5] scsi: ufs: Enable zoned write pipelining Bart Van Assche
  4 siblings, 0 replies; 19+ messages in thread
From: Bart Van Assche @ 2023-07-10 18:01 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Damien Le Moal,
	Chaitanya Kulkarni, Damien Le Moal, Johannes Thumshirn,
	Vincent Fu, Shin'ichiro Kawasaki, Akinobu Mita

Add a new configfs attribute for enabling pipelining of zoned writes.
The test script below reports 250 K IOPS with no I/O scheduler, 6 K IOPS
with mq-deadline and pipelining disabled and 123 K IOPS with mq-deadline
and pipelining enabled. This shows that pipelining results in about 20
times more IOPS for this particular test case.

    #!/bin/bash

    for mode in "none 0" "mq-deadline 0" "mq-deadline 1"; do
        set +e
        for d in /sys/kernel/config/nullb/*; do
            [ -d "$d" ] && rmdir "$d"
        done
        modprobe -r null_blk
        set -e
        read -r iosched pipelining <<<"$mode"
        modprobe null_blk nr_devices=0
        (
            cd /sys/kernel/config/nullb
            mkdir nullb0
            cd nullb0
            params=(
                completion_nsec=100000
                hw_queue_depth=64
                irqmode=2                # NULL_IRQ_TIMER
                max_sectors=$((4096/512))
                memory_backed=1
                pipeline_zoned_writes="${pipelining}"
                size=1
                submit_queues=1
                zone_size=1
                zoned=1
                power=1
            )
            for p in "${params[@]}"; do
                echo "${p//*=}" > "${p//=*}"
            done
        )
        udevadm settle
        dev=/dev/nullb0
        [ -b "${dev}" ]
        params=(
            --direct=1
            --filename="${dev}"
            --iodepth=64
            --iodepth_batch=16
            --ioengine=io_uring
            --ioscheduler="${iosched}"
            --gtod_reduce=1
            --hipri=0
            --name=nullb0
            --runtime=30
            --rw=write
            --time_based=1
            --zonemode=zbd
        )
        fio "${params[@]}"
    done

Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/block/null_blk/main.c     | 2 ++
 drivers/block/null_blk/null_blk.h | 1 +
 drivers/block/null_blk/zoned.c    | 3 +++
 3 files changed, 6 insertions(+)

diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
index 864013019d6b..1cd6eb4e2c16 100644
--- a/drivers/block/null_blk/main.c
+++ b/drivers/block/null_blk/main.c
@@ -424,6 +424,7 @@ NULLB_DEVICE_ATTR(zone_capacity, ulong, NULL);
 NULLB_DEVICE_ATTR(zone_nr_conv, uint, NULL);
 NULLB_DEVICE_ATTR(zone_max_open, uint, NULL);
 NULLB_DEVICE_ATTR(zone_max_active, uint, NULL);
+NULLB_DEVICE_ATTR(pipeline_zoned_writes, bool, NULL);
 NULLB_DEVICE_ATTR(virt_boundary, bool, NULL);
 NULLB_DEVICE_ATTR(no_sched, bool, NULL);
 NULLB_DEVICE_ATTR(shared_tag_bitmap, bool, NULL);
@@ -569,6 +570,7 @@ static struct configfs_attribute *nullb_device_attrs[] = {
 	&nullb_device_attr_zone_max_active,
 	&nullb_device_attr_zone_readonly,
 	&nullb_device_attr_zone_offline,
+	&nullb_device_attr_pipeline_zoned_writes,
 	&nullb_device_attr_virt_boundary,
 	&nullb_device_attr_no_sched,
 	&nullb_device_attr_shared_tag_bitmap,
diff --git a/drivers/block/null_blk/null_blk.h b/drivers/block/null_blk/null_blk.h
index 929f659dd255..248acf288a8e 100644
--- a/drivers/block/null_blk/null_blk.h
+++ b/drivers/block/null_blk/null_blk.h
@@ -117,6 +117,7 @@ struct nullb_device {
 	bool memory_backed; /* if data is stored in memory */
 	bool discard; /* if support discard */
 	bool zoned; /* if device is zoned */
+	bool pipeline_zoned_writes;
 	bool virt_boundary; /* virtual boundary on/off for the device */
 	bool no_sched; /* no IO scheduler for the device */
 	bool shared_tag_bitmap; /* use hostwide shared tags */
diff --git a/drivers/block/null_blk/zoned.c b/drivers/block/null_blk/zoned.c
index 635ce0648133..2bfd5f7cee67 100644
--- a/drivers/block/null_blk/zoned.c
+++ b/drivers/block/null_blk/zoned.c
@@ -96,6 +96,9 @@ int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q)
 
 	spin_lock_init(&dev->zone_res_lock);
 
+	if (dev->pipeline_zoned_writes)
+		blk_queue_flag_set(QUEUE_FLAG_PIPELINE_ZONED_WRITES, q);
+
 	if (dev->zone_nr_conv >= dev->nr_zones) {
 		dev->zone_nr_conv = dev->nr_zones - 1;
 		pr_info("changed the number of conventional zones to %u",

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 4/5] scsi: Retry unaligned zoned writes
  2023-07-10 18:01 [PATCH v2 0/5] Enable zoned write pipelining for UFS devices Bart Van Assche
                   ` (2 preceding siblings ...)
  2023-07-10 18:01 ` [PATCH v2 3/5] block/null_blk: Add support for pipelining zoned writes Bart Van Assche
@ 2023-07-10 18:01 ` Bart Van Assche
  2023-07-18  6:47   ` Damien Le Moal
  2023-07-10 18:01 ` [PATCH v2 5/5] scsi: ufs: Enable zoned write pipelining Bart Van Assche
  4 siblings, 1 reply; 19+ messages in thread
From: Bart Van Assche @ 2023-07-10 18:01 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche,
	Martin K . Petersen, Damien Le Moal, James E.J. Bottomley

From ZBC-2: "The device server terminates with CHECK CONDITION status, with
the sense key set to ILLEGAL REQUEST, and the additional sense code set to
UNALIGNED WRITE COMMAND a write command, other than an entire medium write
same command, that specifies: a) the starting LBA in a sequential write
required zone set to a value that is not equal to the write pointer for that
sequential write required zone; or b) an ending LBA that is not equal to the
last logical block within a physical block (see SBC-5)."

I am not aware of any other conditions that may trigger the UNALIGNED
WRITE COMMAND response.

Send commands that failed with an unaligned write error to the SCSI error
handler. Let the SCSI error handler sort SCSI commands per LBA before
resubmitting these.

Increase the number of retries for write commands sent to a sequential
zone to the maximum number of outstanding commands.

Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/scsi/scsi_error.c | 37 +++++++++++++++++++++++++++++++++++++
 drivers/scsi/scsi_lib.c   |  1 +
 drivers/scsi/sd.c         |  3 +++
 include/scsi/scsi.h       |  1 +
 4 files changed, 42 insertions(+)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 3ec8bfd4090f..a6d562f3a085 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -27,6 +27,7 @@
 #include <linux/blkdev.h>
 #include <linux/delay.h>
 #include <linux/jiffies.h>
+#include <linux/list_sort.h>
 
 #include <scsi/scsi.h>
 #include <scsi/scsi_cmnd.h>
@@ -681,6 +682,17 @@ enum scsi_disposition scsi_check_sense(struct scsi_cmnd *scmd)
 		fallthrough;
 
 	case ILLEGAL_REQUEST:
+		/*
+		 * Unaligned write command. This indicates that zoned writes
+		 * have been received by the device in the wrong order. If zoned
+		 * write pipelining is enabled, retry after all pending commands
+		 * have completed.
+		 */
+		if (sshdr.asc == 0x21 && sshdr.ascq == 0x04 &&
+		    blk_queue_pipeline_zoned_writes(sdev->request_queue) &&
+		    !scsi_noretry_cmd(scmd) && scsi_cmd_retry_allowed(scmd))
+			return NEEDS_DELAYED_RETRY;
+
 		if (sshdr.asc == 0x20 || /* Invalid command operation code */
 		    sshdr.asc == 0x21 || /* Logical block address out of range */
 		    sshdr.asc == 0x22 || /* Invalid function */
@@ -2177,6 +2189,25 @@ void scsi_eh_flush_done_q(struct list_head *done_q)
 }
 EXPORT_SYMBOL(scsi_eh_flush_done_q);
 
+/*
+ * Returns a negative value if @_a has a lower LBA than @_b, zero if
+ * both have the same LBA and a positive value otherwise.
+ */
+static int scsi_cmp_lba(void *priv, const struct list_head *_a,
+			const struct list_head *_b)
+{
+	struct scsi_cmnd *a = list_entry(_a, typeof(*a), eh_entry);
+	struct scsi_cmnd *b = list_entry(_b, typeof(*b), eh_entry);
+	const sector_t pos_a = blk_rq_pos(scsi_cmd_to_rq(a));
+	const sector_t pos_b = blk_rq_pos(scsi_cmd_to_rq(b));
+
+	if (pos_a < pos_b)
+		return -1;
+	if (pos_a > pos_b)
+		return 1;
+	return 0;
+}
+
 /**
  * scsi_unjam_host - Attempt to fix a host which has a cmd that failed.
  * @shost:	Host to unjam.
@@ -2212,6 +2243,12 @@ static void scsi_unjam_host(struct Scsi_Host *shost)
 
 	SCSI_LOG_ERROR_RECOVERY(1, scsi_eh_prt_fail_stats(shost, &eh_work_q));
 
+	/*
+	 * Sort pending SCSI commands in LBA order. This is important if write
+	 * pipelining is enabled for a zoned SCSI device.
+	 */
+	list_sort(NULL, &eh_work_q, scsi_cmp_lba);
+
 	if (!scsi_eh_get_sense(&eh_work_q, &eh_done_q))
 		scsi_eh_ready_devs(shost, &eh_work_q, &eh_done_q);
 
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 0226c9279cef..a6cfdc4bfbf1 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1445,6 +1445,7 @@ static void scsi_complete(struct request *rq)
 	case ADD_TO_MLQUEUE:
 		scsi_queue_insert(cmd, SCSI_MLQUEUE_DEVICE_BUSY);
 		break;
+	case NEEDS_DELAYED_RETRY:
 	default:
 		scsi_eh_scmd_add(cmd);
 		break;
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index ab216976dbdc..6cf7495c6b56 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1206,6 +1206,9 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
 	cmd->transfersize = sdp->sector_size;
 	cmd->underflow = nr_blocks << 9;
 	cmd->allowed = sdkp->max_retries;
+	if (blk_queue_pipeline_zoned_writes(rq->q) &&
+	    blk_rq_is_seq_zoned_write(rq))
+		cmd->allowed += rq->q->nr_requests;
 	cmd->sdb.length = nr_blocks * sdp->sector_size;
 
 	SCSI_LOG_HLQUEUE(1,
diff --git a/include/scsi/scsi.h b/include/scsi/scsi.h
index ec093594ba53..6600db046227 100644
--- a/include/scsi/scsi.h
+++ b/include/scsi/scsi.h
@@ -93,6 +93,7 @@ static inline int scsi_status_is_check_condition(int status)
  * Internal return values.
  */
 enum scsi_disposition {
+	NEEDS_DELAYED_RETRY	= 0x2000,
 	NEEDS_RETRY		= 0x2001,
 	SUCCESS			= 0x2002,
 	FAILED			= 0x2003,

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 5/5] scsi: ufs: Enable zoned write pipelining
  2023-07-10 18:01 [PATCH v2 0/5] Enable zoned write pipelining for UFS devices Bart Van Assche
                   ` (3 preceding siblings ...)
  2023-07-10 18:01 ` [PATCH v2 4/5] scsi: Retry unaligned " Bart Van Assche
@ 2023-07-10 18:01 ` Bart Van Assche
  4 siblings, 0 replies; 19+ messages in thread
From: Bart Van Assche @ 2023-07-10 18:01 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche,
	Martin K . Petersen, Damien Le Moal, Avri Altman,
	James E.J. Bottomley, Stanley Chu, Manivannan Sadhasivam,
	Asutosh Das, Bean Huo, Jinyoung Choi, Ziqi Chen, Arthur Simchaev,
	Adrien Thierry

From the UFSHCI 4.0 specification, about the legacy (single queue) mode:
"The host controller always process transfer requests in-order according
to the order submitted to the list. In case of multiple commands with
single doorbell register ringing (batch mode), The dispatch order for
these transfer requests by host controller will base on their index in
the List. A transfer request with lower index value will be executed
before a transfer request with higher index value."

From the UFSHCI 4.0 specification, about the MCQ mode:
"Command Submission
1. Host SW writes an Entry to SQ
2. Host SW updates SQ doorbell tail pointer

Command Processing
3. After fetching the Entry, Host Controller updates SQ doorbell head
   pointer
4. Host controller sends COMMAND UPIU to UFS device"

In other words, for both legacy and MCQ mode, UFS controllers are
required to forward commands to the UFS device in the order these
commands have been received from the host.

Note: for legacy mode this is only correct if the host submits one
command at a time. The UFS driver does this.

This patch improves small write IOPS by about 150% on my test setup.

Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Cc: Avri Altman <avri.altman@wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/ufs/core/ufshcd.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
index e7e79f515e14..8d0e495ae6fa 100644
--- a/drivers/ufs/core/ufshcd.c
+++ b/drivers/ufs/core/ufshcd.c
@@ -5146,6 +5146,7 @@ static int ufshcd_slave_configure(struct scsi_device *sdev)

 	ufshcd_hpb_configure(hba, sdev);

+	blk_queue_flag_set(QUEUE_FLAG_PIPELINE_ZONED_WRITES, q);
 	blk_queue_update_dma_pad(q, PRDT_DATA_BYTE_COUNT_PAD - 1);
 	if (hba->quirks & UFSHCD_QUIRK_4KB_DMA_ALIGNMENT)
 		blk_queue_update_dma_alignment(q, 4096 - 1);

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/5] block: Introduce a request queue flag for pipelining zoned writes
  2023-07-10 18:01 ` [PATCH v2 1/5] block: Introduce a request queue flag for pipelining zoned writes Bart Van Assche
@ 2023-07-18  6:34   ` Damien Le Moal
  2023-07-18 22:37     ` Bart Van Assche
  0 siblings, 1 reply; 19+ messages in thread
From: Damien Le Moal @ 2023-07-18  6:34 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Damien Le Moal

On 7/11/23 03:01, Bart Van Assche wrote:
> Writes in sequential write required zones must happen at the write
> pointer. Even if the submitter of the write commands (e.g. a filesystem)
> submits writes for sequential write required zones in order, the block
> layer or the storage controller may reorder these write commands.
> 
> The zone locking mechanism in the mq-deadline I/O scheduler serializes
> write commands for sequential zones. Some but not all storage controllers
> require this serialization. Introduce a new flag such that block drivers
> can request pipelining of writes for sequential write required zones.
> 
> An example of a storage controller standard that requires write
> serialization is AHCI (Advanced Host Controller Interface). Submitting
> commands to AHCI controllers happens by writing a bit pattern into a
> register. Each set bit corresponds to an active command. This mechanism
> does not preserve command ordering information.
> 
> Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  include/linux/blkdev.h | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index ed44a997f629..805012c5a6ab 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -534,6 +534,8 @@ struct request_queue {
>  #define QUEUE_FLAG_NONROT	6	/* non-rotational device (SSD) */
>  #define QUEUE_FLAG_VIRT		QUEUE_FLAG_NONROT /* paravirt device */
>  #define QUEUE_FLAG_IO_STAT	7	/* do disk/partitions IO accounting */
> +/* Writes for sequential write required zones may be pipelined. */
> +#define QUEUE_FLAG_PIPELINE_ZONED_WRITES 8

I am not a big fan of this name as "pipeline" does not necessarily imply "high
queue depth write". What about simply calling this
QUEUE_FLAG_NO_ZONE_WRITE_LOCK, indicating that there is no need to write-lock
zones ?

>  #define QUEUE_FLAG_NOXMERGES	9	/* No extended merges */
>  #define QUEUE_FLAG_ADD_RANDOM	10	/* Contributes to random pool */
>  #define QUEUE_FLAG_SYNCHRONOUS	11	/* always completes in submit context */
> @@ -596,6 +598,11 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q);
>  #define blk_queue_skip_tagset_quiesce(q) \
>  	test_bit(QUEUE_FLAG_SKIP_TAGSET_QUIESCE, &(q)->queue_flags)
>  
> +static inline bool blk_queue_pipeline_zoned_writes(struct request_queue *q)
> +{
> +	return test_bit(QUEUE_FLAG_PIPELINE_ZONED_WRITES, &q->queue_flags);
> +}
> +
>  extern void blk_set_pm_only(struct request_queue *q);
>  extern void blk_clear_pm_only(struct request_queue *q);
>  

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/5] block/mq-deadline: Only use zone locking if necessary
  2023-07-10 18:01 ` [PATCH v2 2/5] block/mq-deadline: Only use zone locking if necessary Bart Van Assche
@ 2023-07-18  6:38   ` Damien Le Moal
  2023-07-18 22:38     ` Bart Van Assche
  2023-07-24 21:39     ` Bart Van Assche
  0 siblings, 2 replies; 19+ messages in thread
From: Damien Le Moal @ 2023-07-18  6:38 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Damien Le Moal

On 7/11/23 03:01, Bart Van Assche wrote:
> Measurements have shown that limiting the queue depth to one for zoned
> writes has a significant negative performance impact on zoned UFS devices.
> Hence this patch that disables zone locking from the mq-deadline scheduler
> for storage controllers that support pipelining zoned writes. This patch is
> based on the following assumptions:
> - Applications submit write requests to sequential write required zones
>   in order.
> - It happens infrequently that zoned write requests are reordered by the
>   block layer.
> - The storage controller does not reorder write requests that have been
>   submitted to the same hardware queue. This is the case for UFS: the
>   UFSHCI specification requires that UFS controllers process requests in
>   order per hardware queue.
> - The I/O priority of all pipelined write requests is the same per zone.
> - Either no I/O scheduler is used or an I/O scheduler is used that
>   submits write requests per zone in LBA order.
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/blk-zoned.c   |  3 ++-
>  block/mq-deadline.c | 14 +++++++++-----
>  2 files changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index 0f9f97cdddd9..59560d1657e3 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -504,7 +504,8 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
>  		break;
>  	case BLK_ZONE_TYPE_SEQWRITE_REQ:
>  	case BLK_ZONE_TYPE_SEQWRITE_PREF:
> -		if (!args->seq_zones_wlock) {
> +		if (!blk_queue_pipeline_zoned_writes(q) &&
> +		    !args->seq_zones_wlock) {

I think that this change should go into the first patch, no ?

>  			args->seq_zones_wlock =
>  				blk_alloc_zone_bitmap(q->node, args->nr_zones);
>  			if (!args->seq_zones_wlock)
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index 6aa5daf7ae32..0bed2bdeed89 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -353,7 +353,8 @@ deadline_fifo_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
>  		return NULL;
>  
>  	rq = rq_entry_fifo(per_prio->fifo_list[data_dir].next);
> -	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q))
> +	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q) ||
> +	    blk_queue_pipeline_zoned_writes(rq->q))

What about using blk_req_needs_zone_write_lock() ?

>  		return rq;
>  
>  	/*
> @@ -398,7 +399,8 @@ deadline_next_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
>  	if (!rq)
>  		return NULL;
>  
> -	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q))
> +	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q) ||
> +	    blk_queue_pipeline_zoned_writes(rq->q))

Same.

>  		return rq;
>  
>  	/*
> @@ -526,8 +528,9 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
>  	}
>  
>  	/*
> -	 * For a zoned block device, if we only have writes queued and none of
> -	 * them can be dispatched, rq will be NULL.
> +	 * For a zoned block device that requires write serialization, if we
> +	 * only have writes queued and none of them can be dispatched, rq will
> +	 * be NULL.
>  	 */
>  	if (!rq)
>  		return NULL;
> @@ -933,7 +936,8 @@ static void dd_finish_request(struct request *rq)
>  
>  	atomic_inc(&per_prio->stats.completed);
>  
> -	if (blk_queue_is_zoned(q)) {
> +	if (blk_queue_is_zoned(rq->q) &&
> +	    !blk_queue_pipeline_zoned_writes(q)) {

And again here.

>  		unsigned long flags;
>  
>  		spin_lock_irqsave(&dd->zone_lock, flags);

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 4/5] scsi: Retry unaligned zoned writes
  2023-07-10 18:01 ` [PATCH v2 4/5] scsi: Retry unaligned " Bart Van Assche
@ 2023-07-18  6:47   ` Damien Le Moal
  2023-07-18 22:53     ` Bart Van Assche
  0 siblings, 1 reply; 19+ messages in thread
From: Damien Le Moal @ 2023-07-18  6:47 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Martin K . Petersen,
	James E.J. Bottomley

On 7/11/23 03:01, Bart Van Assche wrote:
> From ZBC-2: "The device server terminates with CHECK CONDITION status, with
> the sense key set to ILLEGAL REQUEST, and the additional sense code set to
> UNALIGNED WRITE COMMAND a write command, other than an entire medium write
> same command, that specifies: a) the starting LBA in a sequential write
> required zone set to a value that is not equal to the write pointer for that
> sequential write required zone; or b) an ending LBA that is not equal to the
> last logical block within a physical block (see SBC-5)."
> 
> I am not aware of any other conditions that may trigger the UNALIGNED
> WRITE COMMAND response.

Trying to write less than 4KB on a 4Kn drive. But that should not ever happen
for kernel issued commands.

> 
> Send commands that failed with an unaligned write error to the SCSI error
> handler. Let the SCSI error handler sort SCSI commands per LBA before
> resubmitting these.
> 
> Increase the number of retries for write commands sent to a sequential
> zone to the maximum number of outstanding commands.

I think I mentioned this before. When we started btrfs work, we did something
similar (but at the IO scheduler level) to try to avoid adding a big lock in
btrfs to serialize (and thus order) writes. What we discovered is that it was
extremely easy to fall into a situation were the maximum number of possible
outstanding request is already issued, but they all are behind a "hole" and
indefinitely delayed because the missing request cannot be issued due to the max
nr request limit being reached. No forward progress and deadlock.

I do not see how your change addresses this problem. The same will happen with
this and I do not have any suggestion how to solve this. For btrfs, we ended up
using cone append emulation for scsi to avoid the big lock and avoid the FS from
having to order writes. That solution guarantees forward progress. Delaying
already issued writes that are not sequential has no such guarantees.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/5] block: Introduce a request queue flag for pipelining zoned writes
  2023-07-18  6:34   ` Damien Le Moal
@ 2023-07-18 22:37     ` Bart Van Assche
  2023-07-19  9:58       ` Damien Le Moal
  0 siblings, 1 reply; 19+ messages in thread
From: Bart Van Assche @ 2023-07-18 22:37 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe; +Cc: linux-block, Christoph Hellwig, Damien Le Moal

On 7/17/23 23:34, Damien Le Moal wrote:
> On 7/11/23 03:01, Bart Van Assche wrote:
>> +/* Writes for sequential write required zones may be pipelined. */
>> +#define QUEUE_FLAG_PIPELINE_ZONED_WRITES 8
> 
> I am not a big fan of this name as "pipeline" does not necessarily imply "high
> queue depth write". What about simply calling this
> QUEUE_FLAG_NO_ZONE_WRITE_LOCK, indicating that there is no need to write-lock
> zones ?

Hi Damien,

I'm not a big fan of names with negative words like "no" or "not" embedded.
Isn't pipelining standard computer science terminology? See also
https://en.wikipedia.org/wiki/Pipeline_(computing).

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/5] block/mq-deadline: Only use zone locking if necessary
  2023-07-18  6:38   ` Damien Le Moal
@ 2023-07-18 22:38     ` Bart Van Assche
  2023-07-24 21:39     ` Bart Van Assche
  1 sibling, 0 replies; 19+ messages in thread
From: Bart Van Assche @ 2023-07-18 22:38 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe; +Cc: linux-block, Christoph Hellwig, Damien Le Moal

On 7/17/23 23:38, Damien Le Moal wrote:
> On 7/11/23 03:01, Bart Van Assche wrote:
>> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
>> index 0f9f97cdddd9..59560d1657e3 100644
>> --- a/block/blk-zoned.c
>> +++ b/block/blk-zoned.c
>> @@ -504,7 +504,8 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
>>   		break;
>>   	case BLK_ZONE_TYPE_SEQWRITE_REQ:
>>   	case BLK_ZONE_TYPE_SEQWRITE_PREF:
>> -		if (!args->seq_zones_wlock) {
>> +		if (!blk_queue_pipeline_zoned_writes(q) &&
>> +		    !args->seq_zones_wlock) {
> 
> I think that this change should go into the first patch, no ?

That's a good point. I will move this change into the first patch.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 4/5] scsi: Retry unaligned zoned writes
  2023-07-18  6:47   ` Damien Le Moal
@ 2023-07-18 22:53     ` Bart Van Assche
  2023-07-19  9:59       ` Damien Le Moal
  0 siblings, 1 reply; 19+ messages in thread
From: Bart Van Assche @ 2023-07-18 22:53 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Martin K . Petersen,
	James E.J. Bottomley

On 7/17/23 23:47, Damien Le Moal wrote:
> On 7/11/23 03:01, Bart Van Assche wrote:
>> Send commands that failed with an unaligned write error to the SCSI error
>> handler. Let the SCSI error handler sort SCSI commands per LBA before
>> resubmitting these.
>>
>> Increase the number of retries for write commands sent to a sequential
>> zone to the maximum number of outstanding commands.
> 
> I think I mentioned this before. When we started btrfs work, we did something
> similar (but at the IO scheduler level) to try to avoid adding a big lock in
> btrfs to serialize (and thus order) writes. What we discovered is that it was
> extremely easy to fall into a situation were the maximum number of possible
> outstanding request is already issued, but they all are behind a "hole" and
> indefinitely delayed because the missing request cannot be issued due to the max
> nr request limit being reached. No forward progress and deadlock.
> 
> I do not see how your change addresses this problem. The same will happen with
> this and I do not have any suggestion how to solve this. For btrfs, we ended up
> using cone append emulation for scsi to avoid the big lock and avoid the FS from
> having to order writes. That solution guarantees forward progress. Delaying
> already issued writes that are not sequential has no such guarantees.

Hi Damien,

Thank you for having explained in detail the scenario that you ran into.

I think what has been explained above is a scenario in which the filesystem
allocates requests per zone in another order than the LBA order. How about
requiring that the filesystem allocates and submits zoned writes in LBA order
per zone? I think that this is how F2FS supports zoned storage.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/5] block: Introduce a request queue flag for pipelining zoned writes
  2023-07-18 22:37     ` Bart Van Assche
@ 2023-07-19  9:58       ` Damien Le Moal
  0 siblings, 0 replies; 19+ messages in thread
From: Damien Le Moal @ 2023-07-19  9:58 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Damien Le Moal

On 7/19/23 07:37, Bart Van Assche wrote:
> On 7/17/23 23:34, Damien Le Moal wrote:
>> On 7/11/23 03:01, Bart Van Assche wrote:
>>> +/* Writes for sequential write required zones may be pipelined. */
>>> +#define QUEUE_FLAG_PIPELINE_ZONED_WRITES 8
>>
>> I am not a big fan of this name as "pipeline" does not necessarily
>> imply "high
>> queue depth write". What about simply calling this
>> QUEUE_FLAG_NO_ZONE_WRITE_LOCK, indicating that there is no need to
>> write-lock
>> zones ?
> 
> Hi Damien,
> 
> I'm not a big fan of names with negative words like "no" or "not" embedded.
> Isn't pipelining standard computer science terminology? See also
> https://en.wikipedia.org/wiki/Pipeline_(computing).

Sure, pipeline is a well known term. But I do not think it is synonymous
with "high queue depth write" :) A "pipeline" for zoned write may still
operate at write qd=1 per zone...

Given that the default is using zone write locking, I would prefer a
flag name that indicates a change to this default. What about something
like QUEUE_FLAG_UNRESTRICTED_ZONE_WRITE ?

But tht is beside the point as I still have reservations on this
approach anyway. See my reply to patch 4.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 4/5] scsi: Retry unaligned zoned writes
  2023-07-18 22:53     ` Bart Van Assche
@ 2023-07-19  9:59       ` Damien Le Moal
  2023-07-19 16:31         ` Bart Van Assche
  0 siblings, 1 reply; 19+ messages in thread
From: Damien Le Moal @ 2023-07-19  9:59 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Martin K . Petersen,
	James E.J. Bottomley

On 7/19/23 07:53, Bart Van Assche wrote:
> On 7/17/23 23:47, Damien Le Moal wrote:
>> On 7/11/23 03:01, Bart Van Assche wrote:
>>> Send commands that failed with an unaligned write error to the SCSI
>>> error
>>> handler. Let the SCSI error handler sort SCSI commands per LBA before
>>> resubmitting these.
>>>
>>> Increase the number of retries for write commands sent to a sequential
>>> zone to the maximum number of outstanding commands.
>>
>> I think I mentioned this before. When we started btrfs work, we did
>> something
>> similar (but at the IO scheduler level) to try to avoid adding a big
>> lock in
>> btrfs to serialize (and thus order) writes. What we discovered is that
>> it was
>> extremely easy to fall into a situation were the maximum number of
>> possible
>> outstanding request is already issued, but they all are behind a
>> "hole" and
>> indefinitely delayed because the missing request cannot be issued due
>> to the max
>> nr request limit being reached. No forward progress and deadlock.
>>
>> I do not see how your change addresses this problem. The same will
>> happen with
>> this and I do not have any suggestion how to solve this. For btrfs, we
>> ended up
>> using cone append emulation for scsi to avoid the big lock and avoid
>> the FS from
>> having to order writes. That solution guarantees forward progress.
>> Delaying
>> already issued writes that are not sequential has no such guarantees.
> 
> Hi Damien,
> 
> Thank you for having explained in detail the scenario that you ran into.
> 
> I think what has been explained above is a scenario in which the filesystem
> allocates requests per zone in another order than the LBA order. How about
> requiring that the filesystem allocates and submits zoned writes in LBA
> order
> per zone? I think that this is how F2FS supports zoned storage.

Sure. But what if an application uses the drive directly ? You loose
guarantees of forward progress then. Given that an application has to
use direct IO for writes to sequential zones, this is unlikely to happen
in a "good" scenario, but it also would not be hard to write an
application that can deadlock the drive forever by simply missing one
write in a sequence of writes for a zone... That is my concern. While
f2fs would likely be OK, the delay approach is not solid enough for all
cases.



-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 4/5] scsi: Retry unaligned zoned writes
  2023-07-19  9:59       ` Damien Le Moal
@ 2023-07-19 16:31         ` Bart Van Assche
  2023-07-19 23:07           ` Damien Le Moal
  0 siblings, 1 reply; 19+ messages in thread
From: Bart Van Assche @ 2023-07-19 16:31 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Martin K . Petersen,
	James E.J. Bottomley

On 7/19/23 02:59, Damien Le Moal wrote:
> On 7/19/23 07:53, Bart Van Assche wrote:
>> I think what has been explained above is a scenario in which the filesystem
>> allocates requests per zone in another order than the LBA order. How about
>> requiring that the filesystem allocates and submits zoned writes in LBA
>> order
>> per zone? I think that this is how F2FS supports zoned storage.
> 
> Sure. But what if an application uses the drive directly ? You loose
> guarantees of forward progress then. Given that an application has to
> use direct IO for writes to sequential zones, this is unlikely to happen
> in a "good" scenario, but it also would not be hard to write an
> application that can deadlock the drive forever by simply missing one
> write in a sequence of writes for a zone... That is my concern. While
> f2fs would likely be OK, the delay approach is not solid enough for all
> cases.

Hi Damien,

This patch series increases the number of retries for zoned writes
submitted by a filesystem. Direct I/O from user space is not affected
since the code that increases the number of retries occurs in
sd_setup_read_write_cmnd(). As you know scsi_prepare_cmd() only calls
sd_setup_read_write_cmnd() for requests that come from a filesystem
and not for pass-through requests.

Does this address your concern?

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 4/5] scsi: Retry unaligned zoned writes
  2023-07-19 16:31         ` Bart Van Assche
@ 2023-07-19 23:07           ` Damien Le Moal
  2023-07-20 18:18             ` Bart Van Assche
  0 siblings, 1 reply; 19+ messages in thread
From: Damien Le Moal @ 2023-07-19 23:07 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Martin K . Petersen,
	James E.J. Bottomley

On 7/20/23 01:31, Bart Van Assche wrote:
> On 7/19/23 02:59, Damien Le Moal wrote:
>> On 7/19/23 07:53, Bart Van Assche wrote:
>>> I think what has been explained above is a scenario in which the filesystem
>>> allocates requests per zone in another order than the LBA order. How about
>>> requiring that the filesystem allocates and submits zoned writes in LBA
>>> order
>>> per zone? I think that this is how F2FS supports zoned storage.
>>
>> Sure. But what if an application uses the drive directly ? You loose
>> guarantees of forward progress then. Given that an application has to
>> use direct IO for writes to sequential zones, this is unlikely to happen
>> in a "good" scenario, but it also would not be hard to write an
>> application that can deadlock the drive forever by simply missing one
>> write in a sequence of writes for a zone... That is my concern. While
>> f2fs would likely be OK, the delay approach is not solid enough for all
>> cases.
> 
> Hi Damien,
> 
> This patch series increases the number of retries for zoned writes
> submitted by a filesystem. Direct I/O from user space is not affected
> since the code that increases the number of retries occurs in
> sd_setup_read_write_cmnd(). As you know scsi_prepare_cmd() only calls
> sd_setup_read_write_cmnd() for requests that come from a filesystem
> and not for pass-through requests.
> 
> Does this address your concern?

Not really. I was not thinking about passthrough requests but rather write
syscalls to /dev/sdX by applications. sd_setup_read_write_cmnd() is used for
these too. After all, /dev/sdX *is* a file system too, albeit the simplest possible.

> 
> Thanks,
> 
> Bart.
> 

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 4/5] scsi: Retry unaligned zoned writes
  2023-07-19 23:07           ` Damien Le Moal
@ 2023-07-20 18:18             ` Bart Van Assche
  2023-07-21  0:20               ` Damien Le Moal
  0 siblings, 1 reply; 19+ messages in thread
From: Bart Van Assche @ 2023-07-20 18:18 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Martin K . Petersen,
	James E.J. Bottomley

On 7/19/23 16:07, Damien Le Moal wrote:
> Not really. I was not thinking about passthrough requests but rather write
> syscalls to /dev/sdX by applications. sd_setup_read_write_cmnd() is used for
> these too. After all, /dev/sdX *is* a file system too, albeit the simplest possible.

How about introducing a new request flag (REQ_*) such that f2fs can disable
zone locking while it remains enabled for all other users (BTRFS and direct
I/O)?

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 4/5] scsi: Retry unaligned zoned writes
  2023-07-20 18:18             ` Bart Van Assche
@ 2023-07-21  0:20               ` Damien Le Moal
  0 siblings, 0 replies; 19+ messages in thread
From: Damien Le Moal @ 2023-07-21  0:20 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Martin K . Petersen,
	James E.J. Bottomley

On 7/21/23 03:18, Bart Van Assche wrote:
> On 7/19/23 16:07, Damien Le Moal wrote:
>> Not really. I was not thinking about passthrough requests but rather write
>> syscalls to /dev/sdX by applications. sd_setup_read_write_cmnd() is used for
>> these too. After all, /dev/sdX *is* a file system too, albeit the simplest possible.
> 
> How about introducing a new request flag (REQ_*) such that f2fs can disable
> zone locking while it remains enabled for all other users (BTRFS and direct
> I/O)?

That could be a safer approach. Still a little worried about any DM used between
the FS and the device though. The DMs that support zoned devices should be OK
but only under the assumption that the bio issuer (f2fs) guarantees that writes
to the same zone are not issued in parallel when that REQ_NO_ZONE_WRITE (or
whatever the name) is used. This flag definition will need to have a big fat
warning mentioning this constraint, that is, make it clear that it is not a
replacement for zone append (in which case we do not write lock zones and do not
care about ordering).

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/5] block/mq-deadline: Only use zone locking if necessary
  2023-07-18  6:38   ` Damien Le Moal
  2023-07-18 22:38     ` Bart Van Assche
@ 2023-07-24 21:39     ` Bart Van Assche
  1 sibling, 0 replies; 19+ messages in thread
From: Bart Van Assche @ 2023-07-24 21:39 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe; +Cc: linux-block, Christoph Hellwig, Damien Le Moal

On 7/17/23 23:38, Damien Le Moal wrote:
> On 7/11/23 03:01, Bart Van Assche wrote:
>> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
>> index 6aa5daf7ae32..0bed2bdeed89 100644
>> --- a/block/mq-deadline.c
>> +++ b/block/mq-deadline.c
>> @@ -353,7 +353,8 @@ deadline_fifo_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
>>   		return NULL;
>>   
>>   	rq = rq_entry_fifo(per_prio->fifo_list[data_dir].next);
>> -	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q))
>> +	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q) ||
>> +	    blk_queue_pipeline_zoned_writes(rq->q))
> 
> What about using blk_req_needs_zone_write_lock() ?

Hmm ... how would using blk_req_needs_zone_write_lock() improve the 
generated code? blk_queue_pipeline_zoned_writes() can be inlined and 
only tests a single bit (a request queue flag) while 
blk_req_needs_zone_write_lock() cannot be inlined by the compiler 
because it has been defined in a .c file. Additionally, 
blk_req_needs_zone_write_lock() has to dereference more pointers than 
blk_queue_pipeline_zoned_writes(). From block/blk-zoned.c:

bool blk_req_needs_zone_write_lock(struct request *rq)
{
	if (!rq->q->disk->seq_zones_wlock)
		return false;

	return blk_rq_is_seq_zoned_write(rq);
}
EXPORT_SYMBOL_GPL(blk_req_needs_zone_write_lock);

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2023-07-24 21:39 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-07-10 18:01 [PATCH v2 0/5] Enable zoned write pipelining for UFS devices Bart Van Assche
2023-07-10 18:01 ` [PATCH v2 1/5] block: Introduce a request queue flag for pipelining zoned writes Bart Van Assche
2023-07-18  6:34   ` Damien Le Moal
2023-07-18 22:37     ` Bart Van Assche
2023-07-19  9:58       ` Damien Le Moal
2023-07-10 18:01 ` [PATCH v2 2/5] block/mq-deadline: Only use zone locking if necessary Bart Van Assche
2023-07-18  6:38   ` Damien Le Moal
2023-07-18 22:38     ` Bart Van Assche
2023-07-24 21:39     ` Bart Van Assche
2023-07-10 18:01 ` [PATCH v2 3/5] block/null_blk: Add support for pipelining zoned writes Bart Van Assche
2023-07-10 18:01 ` [PATCH v2 4/5] scsi: Retry unaligned " Bart Van Assche
2023-07-18  6:47   ` Damien Le Moal
2023-07-18 22:53     ` Bart Van Assche
2023-07-19  9:59       ` Damien Le Moal
2023-07-19 16:31         ` Bart Van Assche
2023-07-19 23:07           ` Damien Le Moal
2023-07-20 18:18             ` Bart Van Assche
2023-07-21  0:20               ` Damien Le Moal
2023-07-10 18:01 ` [PATCH v2 5/5] scsi: ufs: Enable zoned write pipelining Bart Van Assche

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).