* [PATCH v17 00/14] Improve write performance for zoned UFS devices
@ 2025-01-15 22:46 Bart Van Assche
2025-01-15 22:46 ` [PATCH v17 01/14] block: Support block drivers that preserve the order of write requests Bart Van Assche
` (14 more replies)
0 siblings, 15 replies; 25+ messages in thread
From: Bart Van Assche @ 2025-01-15 22:46 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Damien Le Moal, Bart Van Assche
Hi Damien and Christoph,
This patch series improves small write IOPS by a factor of four (+300%) for
zoned UFS devices on my test setup. Please help with reviewing this patch
series.
Thanks,
Bart.
Changes compared to v16:
- Rebased the entire patch series on top of Jens' for-next branch. Compared
to when v16 of this series was posted, the BLK_ZONE_WPLUG_NEED_WP_UPDATE
flag has been introduced and support for REQ_NOWAIT has been fixed.
- The behavior for SMR disks is preserved: if .driver_preserves_write_order
has not been set, BLK_ZONE_WPLUG_NEED_WP_UPDATE is still set if a write
error has been encountered. If .driver_preserves_write_order has not been
set, the write pointer is restored and the failed zoned writes are retried.
- The superfluous "disk->zone_wplugs_hash_bits != 0" tests have been removed.
Changes compared to v15:
- Reworked this patch series on top of the zone write plugging approach.
- Moved support for requeuing requests from the SCSI core into the block
layer core.
- In the UFS driver, instead of disabling write pipelining if
auto-hibernation is enabled, rely on the requeuing mechanism to handle
reordering caused by resuming from auto-hibernation.
Changes compared to v14:
- Removed the drivers/scsi/Kconfig.kunit and drivers/scsi/Makefile.kunit
files. Instead, modified drivers/scsi/Kconfig and added #include "*_test.c"
directives in the appropriate .c files. Removed the EXPORT_SYMBOL()
directives that were added to make the unit tests link.
- Fixed a double free in a unit test.
Changes compared to v13:
- Reworked patch "block: Preserve the order of requeued zoned writes".
- Addressed a performance concern by removing the eh_needs_prepare_resubmit
SCSI driver callback and by introducing the SCSI host template flag
.needs_prepare_resubmit instead.
- Added a patch that adds a 'host' argument to scsi_eh_flush_done_q().
- Made the code in unit tests less repetitive.
Changes compared to v12:
- Added two new patches: "block: Preserve the order of requeued zoned writes"
and "scsi: sd: Add a unit test for sd_cmp_sector()"
- Restricted the number of zoned write retries. To my surprise I had to add
"&& scmd->retries <= scmd->allowed" in the SCSI error handler to limit the
number of retries.
- In patch "scsi: ufs: Inform the block layer about write ordering", only set
ELEVATOR_F_ZBD_SEQ_WRITE for zoned block devices.
Changes compared to v11:
- Fixed a NULL pointer dereference that happened when booting from an ATA
device by adding an scmd->device != NULL check in scsi_needs_preparation().
- Updated Reviewed-by tags.
Changes compared to v10:
- Dropped the UFS MediaTek and HiSilicon patches because these are not correct
and because it is safe to drop these patches.
- Updated Acked-by / Reviewed-by tags.
Changes compared to v9:
- Introduced an additional scsi_driver callback: .eh_needs_prepare_resubmit().
- Renamed the scsi_debug kernel module parameter 'no_zone_write_lock' into
'preserves_write_order'.
- Fixed an out-of-bounds access in the unit scsi_call_prepare_resubmit() unit
test.
- Wrapped ufshcd_auto_hibern8_update() calls in UFS host drivers with
WARN_ON_ONCE() such that a kernel stack appears in case an error code is
returned.
- Elaborated a comment in the UFSHCI driver.
Changes compared to v8:
- Fixed handling of 'driver_preserves_write_order' and 'use_zone_write_lock'
in blk_stack_limits().
- Added a comment in disk_set_zoned().
- Modified blk_req_needs_zone_write_lock() such that it returns false if
q->limits.use_zone_write_lock is false.
- Modified disk_clear_zone_settings() such that it clears
q->limits.use_zone_write_lock.
- Left out one change from the mq-deadline patch that became superfluous due to
the blk_req_needs_zone_write_lock() change.
- Modified scsi_call_prepare_resubmit() such that it only calls list_sort() if
zoned writes have to be resubmitted for which zone write locking is disabled.
- Added an additional unit test for scsi_call_prepare_resubmit().
- Modified the sorting code in the sd driver such that only those SCSI commands
are sorted for which write locking is disabled.
- Modified sd_zbc.c such that ELEVATOR_F_ZBD_SEQ_WRITE is only set if the
write order is not preserved.
- Included three patches for UFS host drivers that rework code that wrote
directly to the auto-hibernation controller register.
- Modified the UFS driver such that enabling auto-hibernation is not allowed
if a zoned logical unit is present and if the controller operates in legacy
mode.
- Also in the UFS driver, simplified ufshcd_auto_hibern8_update().
Changes compared to v7:
- Split the queue_limits member variable `use_zone_write_lock' into two member
variables: `use_zone_write_lock' (set by disk_set_zoned()) and
`driver_preserves_write_order' (set by the block driver or SCSI LLD). This
should clear up the confusion about the purpose of this variable.
- Moved the code for sorting SCSI commands by LBA from the SCSI error handler
into the SCSI disk (sd) driver as requested by Christoph.
Changes compared to v6:
- Removed QUEUE_FLAG_NO_ZONE_WRITE_LOCK and instead introduced a flag in
the request queue limits data structure.
Changes compared to v5:
- Renamed scsi_cmp_lba() into scsi_cmp_sector().
- Improved several source code comments.
Changes compared to v4:
- Dropped the patch that introduces the REQ_NO_ZONE_WRITE_LOCK flag.
- Dropped the null_blk patch and added two scsi_debug patches instead.
- Dropped the f2fs patch.
- Split the patch for the UFS driver into two patches.
- Modified several patch descriptions and source code comments.
- Renamed dd_use_write_locking() into dd_use_zone_write_locking().
- Moved the list_sort() call from scsi_unjam_host() into scsi_eh_flush_done_q()
such that sorting happens just before reinserting.
- Removed the scsi_cmd_retry_allowed() call from scsi_check_sense() to make
sure that the retry counter is adjusted once per retry instead of twice.
Changes compared to v3:
- Restored the patch that introduces QUEUE_FLAG_NO_ZONE_WRITE_LOCK. That patch
had accidentally been left out from v2.
- In patch "block: Introduce the flag REQ_NO_ZONE_WRITE_LOCK", improved the
patch description and added the function blk_no_zone_write_lock().
- In patch "block/mq-deadline: Only use zone locking if necessary", moved the
blk_queue_is_zoned() call into dd_use_write_locking().
- In patch "fs/f2fs: Disable zone write locking", set REQ_NO_ZONE_WRITE_LOCK
from inside __bio_alloc() instead of in f2fs_submit_write_bio().
Changes compared to v2:
- Renamed the request queue flag for disabling zone write locking.
- Introduced a new request flag for disabling zone write locking.
- Modified the mq-deadline scheduler such that zone write locking is only
disabled if both flags are set.
- Added an F2FS patch that sets the request flag for disabling zone write
locking.
- Only disable zone write locking in the UFS driver if auto-hibernation is
disabled.
Changes compared to v1:
- Left out the patches that are already upstream.
- Switched the approach in patch "scsi: Retry unaligned zoned writes" from
retrying immediately to sending unaligned write commands to the SCSI error
handler.
Bart Van Assche (14):
block: Support block drivers that preserve the order of write requests
dm-linear: Report to the block layer that the write order is preserved
block: Rework request allocation in blk_mq_submit_bio()
block: Support allocating from a specific software queue
blk-mq: Restore the zoned write order when requeuing
blk-zoned: Track the write pointer per zone
blk-zoned: Defer error handling
blk-zoned: Add an argument to blk_zone_plug_bio()
blk-zoned: Support pipelining of zoned writes
scsi: core: Retry unaligned zoned writes
scsi: sd: Increase retry count for zoned writes
scsi: scsi_debug: Add the preserves_write_order module parameter
scsi: scsi_debug: Support injecting unaligned write errors
scsi: ufs: Inform the block layer about write ordering
block/bfq-iosched.c | 2 +
block/blk-mq.c | 82 +++++----
block/blk-mq.h | 3 +
block/blk-settings.c | 2 +
block/blk-zoned.c | 342 ++++++++++++++++++++++++++++++++++----
block/blk.h | 31 +++-
block/kyber-iosched.c | 2 +
block/mq-deadline.c | 7 +-
drivers/md/dm-linear.c | 6 +
drivers/md/dm.c | 2 +-
drivers/scsi/scsi_debug.c | 21 ++-
drivers/scsi/scsi_error.c | 16 ++
drivers/scsi/sd.c | 7 +
drivers/ufs/core/ufshcd.c | 7 +
include/linux/blk-mq.h | 13 +-
include/linux/blkdev.h | 14 +-
16 files changed, 492 insertions(+), 65 deletions(-)
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH v17 01/14] block: Support block drivers that preserve the order of write requests
2025-01-15 22:46 [PATCH v17 00/14] Improve write performance for zoned UFS devices Bart Van Assche
@ 2025-01-15 22:46 ` Bart Van Assche
2025-01-15 22:46 ` [PATCH v17 02/14] dm-linear: Report to the block layer that the write order is preserved Bart Van Assche
` (13 subsequent siblings)
14 siblings, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2025-01-15 22:46 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Damien Le Moal, Bart Van Assche,
Hannes Reinecke, Nitesh Shetty, Ming Lei
Some storage controllers preserve the request order per hardware queue.
Introduce the request queue limit member variable
'driver_preserves_write_order' to allow block drivers to indicate that
the order of write commands is preserved per hardware queue and hence
that serialization of writes per zone is not required if all pending
writes are submitted to the same hardware queue.
Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Nitesh Shetty <nj.shetty@samsung.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
block/blk-settings.c | 2 ++
include/linux/blkdev.h | 5 +++++
2 files changed, 7 insertions(+)
diff --git a/block/blk-settings.c b/block/blk-settings.c
index c8368ee8de2e..18bcf6e6dc60 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -796,6 +796,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
}
t->max_secure_erase_sectors = min_not_zero(t->max_secure_erase_sectors,
b->max_secure_erase_sectors);
+ t->driver_preserves_write_order = t->driver_preserves_write_order &&
+ b->driver_preserves_write_order;
t->zone_write_granularity = max(t->zone_write_granularity,
b->zone_write_granularity);
if (!(t->features & BLK_FEAT_ZONED)) {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 7ac153e4423a..df9887412a9e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -399,6 +399,11 @@ struct queue_limits {
unsigned int max_open_zones;
unsigned int max_active_zones;
+ /*
+ * Whether or not the block driver preserves the order of write
+ * requests. Set by the block driver.
+ */
+ bool driver_preserves_write_order;
/*
* Drivers that set dma_alignment to less than 511 must be prepared to
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v17 02/14] dm-linear: Report to the block layer that the write order is preserved
2025-01-15 22:46 [PATCH v17 00/14] Improve write performance for zoned UFS devices Bart Van Assche
2025-01-15 22:46 ` [PATCH v17 01/14] block: Support block drivers that preserve the order of write requests Bart Van Assche
@ 2025-01-15 22:46 ` Bart Van Assche
2025-01-17 18:04 ` Mikulas Patocka
2025-01-15 22:46 ` [PATCH v17 03/14] block: Rework request allocation in blk_mq_submit_bio() Bart Van Assche
` (12 subsequent siblings)
14 siblings, 1 reply; 25+ messages in thread
From: Bart Van Assche @ 2025-01-15 22:46 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Damien Le Moal, Bart Van Assche,
Alasdair Kergon, Mike Snitzer, Mikulas Patocka
Enable write pipelining if dm-linear is stacked on top of a driver that
supports write pipelining.
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
drivers/md/dm-linear.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 49fb0f684193..967fbf856abc 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -148,6 +148,11 @@ static int linear_report_zones(struct dm_target *ti,
#define linear_report_zones NULL
#endif
+static void linear_io_hints(struct dm_target *ti, struct queue_limits *limits)
+{
+ limits->driver_preserves_write_order = true;
+}
+
static int linear_iterate_devices(struct dm_target *ti,
iterate_devices_callout_fn fn, void *data)
{
@@ -209,6 +214,7 @@ static struct target_type linear_target = {
.map = linear_map,
.status = linear_status,
.prepare_ioctl = linear_prepare_ioctl,
+ .io_hints = linear_io_hints,
.iterate_devices = linear_iterate_devices,
.direct_access = linear_dax_direct_access,
.dax_zero_page_range = linear_dax_zero_page_range,
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v17 03/14] block: Rework request allocation in blk_mq_submit_bio()
2025-01-15 22:46 [PATCH v17 00/14] Improve write performance for zoned UFS devices Bart Van Assche
2025-01-15 22:46 ` [PATCH v17 01/14] block: Support block drivers that preserve the order of write requests Bart Van Assche
2025-01-15 22:46 ` [PATCH v17 02/14] dm-linear: Report to the block layer that the write order is preserved Bart Van Assche
@ 2025-01-15 22:46 ` Bart Van Assche
2025-01-15 22:46 ` [PATCH v17 04/14] block: Support allocating from a specific software queue Bart Van Assche
` (11 subsequent siblings)
14 siblings, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2025-01-15 22:46 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Damien Le Moal, Bart Van Assche
Prepare for allocating a request from a specific hctx by making
blk_mq_submit_bio() allocate a request later.
The performance impact of this patch on the hot path is small: if a
request is cached, one percpu_ref_get(&q->q_usage_counter) call and one
percpu_ref_put(&q->q_usage_counter) call are added to the hot path.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
block/blk-mq.c | 31 ++++++++++---------------------
1 file changed, 10 insertions(+), 21 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index da39a1cac702..666e6e6ba143 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3063,11 +3063,6 @@ void blk_mq_submit_bio(struct bio *bio)
struct request *rq;
blk_status_t ret;
- /*
- * If the plug has a cached request for this queue, try to use it.
- */
- rq = blk_mq_peek_cached_request(plug, q, bio->bi_opf);
-
/*
* A BIO that was released from a zone write plug has already been
* through the preparation in this function, already holds a reference
@@ -3076,21 +3071,13 @@ void blk_mq_submit_bio(struct bio *bio)
*/
if (bio_zone_write_plugging(bio)) {
nr_segs = bio->__bi_nr_segments;
- if (rq)
- blk_queue_exit(q);
goto new_request;
}
bio = blk_queue_bounce(bio, q);
- /*
- * The cached request already holds a q_usage_counter reference and we
- * don't have to acquire a new one if we use it.
- */
- if (!rq) {
- if (unlikely(bio_queue_enter(bio)))
- return;
- }
+ if (unlikely(bio_queue_enter(bio)))
+ return;
/*
* Device reconfiguration may change logical block size or reduce the
@@ -3122,8 +3109,15 @@ void blk_mq_submit_bio(struct bio *bio)
goto queue_exit;
new_request:
+ rq = blk_mq_peek_cached_request(plug, q, bio->bi_opf);
if (rq) {
blk_mq_use_cached_rq(rq, plug, bio);
+ /*
+ * Here we hold two references: one because of the
+ * bio_queue_enter() call and a second one as the result of
+ * request allocation. Drop one.
+ */
+ blk_queue_exit(q);
} else {
rq = blk_mq_get_new_requests(q, plug, bio, nr_segs);
if (unlikely(!rq)) {
@@ -3169,12 +3163,7 @@ void blk_mq_submit_bio(struct bio *bio)
return;
queue_exit:
- /*
- * Don't drop the queue reference if we were trying to use a cached
- * request and thus didn't acquire one.
- */
- if (!rq)
- blk_queue_exit(q);
+ blk_queue_exit(q);
}
#ifdef CONFIG_BLK_MQ_STACKING
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v17 04/14] block: Support allocating from a specific software queue
2025-01-15 22:46 [PATCH v17 00/14] Improve write performance for zoned UFS devices Bart Van Assche
` (2 preceding siblings ...)
2025-01-15 22:46 ` [PATCH v17 03/14] block: Rework request allocation in blk_mq_submit_bio() Bart Van Assche
@ 2025-01-15 22:46 ` Bart Van Assche
2025-01-15 22:46 ` [PATCH v17 05/14] blk-mq: Restore the zoned write order when requeuing Bart Van Assche
` (10 subsequent siblings)
14 siblings, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2025-01-15 22:46 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Damien Le Moal, Bart Van Assche
A later patch will preserve the order of pipelined zoned writes by
submitting all zoned writes per zone to the same software queue as
previously submitted zoned writes. Hence support allocating a request
from a specific software queue.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
block/blk-mq.c | 18 ++++++++++++++----
block/blk-mq.h | 1 +
2 files changed, 15 insertions(+), 4 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 666e6e6ba143..4262c85be206 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -495,6 +495,7 @@ __blk_mq_alloc_requests_batch(struct blk_mq_alloc_data *data)
static struct request *__blk_mq_alloc_requests(struct blk_mq_alloc_data *data)
{
struct request_queue *q = data->q;
+ int swq_cpu = data->swq_cpu;
u64 alloc_time_ns = 0;
struct request *rq;
unsigned int tag;
@@ -507,7 +508,8 @@ static struct request *__blk_mq_alloc_requests(struct blk_mq_alloc_data *data)
data->flags |= BLK_MQ_REQ_NOWAIT;
retry:
- data->ctx = blk_mq_get_ctx(q);
+ data->swq_cpu = swq_cpu >= 0 ? swq_cpu : raw_smp_processor_id();
+ data->ctx = __blk_mq_get_ctx(q, data->swq_cpu);
data->hctx = blk_mq_map_queue(q, data->cmd_flags, data->ctx);
if (q->elevator) {
@@ -587,6 +589,7 @@ static struct request *blk_mq_rq_cache_fill(struct request_queue *q,
.cmd_flags = opf,
.nr_tags = plug->nr_ios,
.cached_rqs = &plug->cached_rqs,
+ .swq_cpu = -1,
};
struct request *rq;
@@ -648,6 +651,7 @@ struct request *blk_mq_alloc_request(struct request_queue *q, blk_opf_t opf,
.flags = flags,
.cmd_flags = opf,
.nr_tags = 1,
+ .swq_cpu = -1,
};
int ret;
@@ -2964,12 +2968,14 @@ static bool blk_mq_attempt_bio_merge(struct request_queue *q,
}
static struct request *blk_mq_get_new_requests(struct request_queue *q,
+ int swq_cpu,
struct blk_plug *plug,
struct bio *bio,
unsigned int nsegs)
{
struct blk_mq_alloc_data data = {
.q = q,
+ .swq_cpu = swq_cpu,
.nr_tags = 1,
.cmd_flags = bio->bi_opf,
};
@@ -2993,7 +2999,8 @@ static struct request *blk_mq_get_new_requests(struct request_queue *q,
* Check if there is a suitable cached request and return it.
*/
static struct request *blk_mq_peek_cached_request(struct blk_plug *plug,
- struct request_queue *q, blk_opf_t opf)
+ struct request_queue *q,
+ int swq_cpu, blk_opf_t opf)
{
enum hctx_type type = blk_mq_get_hctx_type(opf);
struct request *rq;
@@ -3003,6 +3010,8 @@ static struct request *blk_mq_peek_cached_request(struct blk_plug *plug,
rq = rq_list_peek(&plug->cached_rqs);
if (!rq || rq->q != q)
return NULL;
+ if (swq_cpu >= 0 && rq->mq_ctx->cpu != swq_cpu)
+ return NULL;
if (type != rq->mq_hctx->type &&
(type != HCTX_TYPE_READ || rq->mq_hctx->type != HCTX_TYPE_DEFAULT))
return NULL;
@@ -3061,6 +3070,7 @@ void blk_mq_submit_bio(struct bio *bio)
struct blk_mq_hw_ctx *hctx;
unsigned int nr_segs;
struct request *rq;
+ int swq_cpu = -1;
blk_status_t ret;
/*
@@ -3109,7 +3119,7 @@ void blk_mq_submit_bio(struct bio *bio)
goto queue_exit;
new_request:
- rq = blk_mq_peek_cached_request(plug, q, bio->bi_opf);
+ rq = blk_mq_peek_cached_request(plug, q, swq_cpu, bio->bi_opf);
if (rq) {
blk_mq_use_cached_rq(rq, plug, bio);
/*
@@ -3119,7 +3129,7 @@ void blk_mq_submit_bio(struct bio *bio)
*/
blk_queue_exit(q);
} else {
- rq = blk_mq_get_new_requests(q, plug, bio, nr_segs);
+ rq = blk_mq_get_new_requests(q, swq_cpu, plug, bio, nr_segs);
if (unlikely(!rq)) {
if (bio->bi_opf & REQ_NOWAIT)
bio_wouldblock_error(bio);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 44979e92b79f..d5536dcf2182 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -158,6 +158,7 @@ struct blk_mq_alloc_data {
struct rq_list *cached_rqs;
/* input & output parameter */
+ int swq_cpu;
struct blk_mq_ctx *ctx;
struct blk_mq_hw_ctx *hctx;
};
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v17 05/14] blk-mq: Restore the zoned write order when requeuing
2025-01-15 22:46 [PATCH v17 00/14] Improve write performance for zoned UFS devices Bart Van Assche
` (3 preceding siblings ...)
2025-01-15 22:46 ` [PATCH v17 04/14] block: Support allocating from a specific software queue Bart Van Assche
@ 2025-01-15 22:46 ` Bart Van Assche
2025-01-15 22:46 ` [PATCH v17 06/14] blk-zoned: Track the write pointer per zone Bart Van Assche
` (9 subsequent siblings)
14 siblings, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2025-01-15 22:46 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Damien Le Moal, Bart Van Assche,
Yu Kuai
Zoned writes may be requeued. This happens if a block driver returns
BLK_STS_RESOURCE, to handle SCSI unit attentions or by the SCSI error
handler after error handling has finished. Requests may be requeued in
another order than submitted. Restore the request order if requests are
requeued. Add RQF_DONTPREP to RQF_NOMERGE_FLAGS because this patch may
cause RQF_DONTPREP requests to be sent to the code that checks whether
a request can be merged and RQF_DONTPREP requests must not be merged.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
block/bfq-iosched.c | 2 ++
block/blk-mq.c | 20 +++++++++++++++++++-
block/blk-mq.h | 2 ++
block/kyber-iosched.c | 2 ++
block/mq-deadline.c | 7 ++++++-
include/linux/blk-mq.h | 13 ++++++++++++-
6 files changed, 43 insertions(+), 3 deletions(-)
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 167542201603..ffa4ca3aad62 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -6276,6 +6276,8 @@ static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
if (flags & BLK_MQ_INSERT_AT_HEAD) {
list_add(&rq->queuelist, &bfqd->dispatch);
+ } else if (flags & BLK_MQ_INSERT_ORDERED) {
+ blk_mq_insert_ordered(rq, &bfqd->dispatch);
} else if (!bfqq) {
list_add_tail(&rq->queuelist, &bfqd->dispatch);
} else {
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4262c85be206..01478777ae5f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1557,7 +1557,9 @@ static void blk_mq_requeue_work(struct work_struct *work)
* already. Insert it into the hctx dispatch list to avoid
* block layer merges for the request.
*/
- if (rq->rq_flags & RQF_DONTPREP)
+ if (blk_rq_is_seq_zoned_write(rq))
+ blk_mq_insert_request(rq, BLK_MQ_INSERT_ORDERED);
+ else if (rq->rq_flags & RQF_DONTPREP)
blk_mq_request_bypass_insert(rq, 0);
else
blk_mq_insert_request(rq, BLK_MQ_INSERT_AT_HEAD);
@@ -2592,6 +2594,20 @@ static void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx,
blk_mq_run_hw_queue(hctx, run_queue_async);
}
+void blk_mq_insert_ordered(struct request *rq, struct list_head *list)
+{
+ struct request_queue *q = rq->q;
+ struct request *rq2;
+
+ list_for_each_entry(rq2, list, queuelist)
+ if (rq2->q == q && blk_rq_pos(rq2) > blk_rq_pos(rq))
+ break;
+
+ /* Insert rq before rq2. If rq2 is the list head, append at the end. */
+ list_add_tail(&rq->queuelist, &rq2->queuelist);
+}
+EXPORT_SYMBOL_GPL(blk_mq_insert_ordered);
+
static void blk_mq_insert_request(struct request *rq, blk_insert_t flags)
{
struct request_queue *q = rq->q;
@@ -2646,6 +2662,8 @@ static void blk_mq_insert_request(struct request *rq, blk_insert_t flags)
spin_lock(&ctx->lock);
if (flags & BLK_MQ_INSERT_AT_HEAD)
list_add(&rq->queuelist, &ctx->rq_lists[hctx->type]);
+ else if (flags & BLK_MQ_INSERT_ORDERED)
+ blk_mq_insert_ordered(rq, &ctx->rq_lists[hctx->type]);
else
list_add_tail(&rq->queuelist,
&ctx->rq_lists[hctx->type]);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index d5536dcf2182..4035643c51a7 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -40,8 +40,10 @@ enum {
typedef unsigned int __bitwise blk_insert_t;
#define BLK_MQ_INSERT_AT_HEAD ((__force blk_insert_t)0x01)
+#define BLK_MQ_INSERT_ORDERED ((__force blk_insert_t)0x02)
void blk_mq_submit_bio(struct bio *bio);
+void blk_mq_insert_ordered(struct request *rq, struct list_head *list);
int blk_mq_poll(struct request_queue *q, blk_qc_t cookie, struct io_comp_batch *iob,
unsigned int flags);
void blk_mq_exit_queue(struct request_queue *q);
diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index dc31f2dfa414..2877cce690f3 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -603,6 +603,8 @@ static void kyber_insert_requests(struct blk_mq_hw_ctx *hctx,
trace_block_rq_insert(rq);
if (flags & BLK_MQ_INSERT_AT_HEAD)
list_move(&rq->queuelist, head);
+ else if (flags & BLK_MQ_INSERT_ORDERED)
+ blk_mq_insert_ordered(rq, head);
else
list_move_tail(&rq->queuelist, head);
sbitmap_set_bit(&khd->kcq_map[sched_domain],
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 754f6b7415cd..78534279adab 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -710,7 +710,12 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
* set expire time and add to fifo list
*/
rq->fifo_time = jiffies + dd->fifo_expire[data_dir];
- list_add_tail(&rq->queuelist, &per_prio->fifo_list[data_dir]);
+ if (flags & BLK_MQ_INSERT_ORDERED)
+ blk_mq_insert_ordered(rq,
+ &per_prio->fifo_list[data_dir]);
+ else
+ list_add_tail(&rq->queuelist,
+ &per_prio->fifo_list[data_dir]);
}
}
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index a0a9007cc1e3..482d5432817c 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -85,7 +85,7 @@ enum {
/* flags that prevent us from merging requests: */
#define RQF_NOMERGE_FLAGS \
- (RQF_STARTED | RQF_FLUSH_SEQ | RQF_SPECIAL_PAYLOAD)
+ (RQF_STARTED | RQF_FLUSH_SEQ | RQF_DONTPREP | RQF_SPECIAL_PAYLOAD)
enum mq_rq_state {
MQ_RQ_IDLE = 0,
@@ -1152,4 +1152,15 @@ static inline int blk_rq_map_sg(struct request_queue *q, struct request *rq,
}
void blk_dump_rq_flags(struct request *, char *);
+static inline bool blk_rq_is_seq_zoned_write(struct request *rq)
+{
+ switch (req_op(rq)) {
+ case REQ_OP_WRITE:
+ case REQ_OP_WRITE_ZEROES:
+ return bdev_zone_is_seq(rq->q->disk->part0, blk_rq_pos(rq));
+ default:
+ return false;
+ }
+}
+
#endif /* BLK_MQ_H */
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v17 06/14] blk-zoned: Track the write pointer per zone
2025-01-15 22:46 [PATCH v17 00/14] Improve write performance for zoned UFS devices Bart Van Assche
` (4 preceding siblings ...)
2025-01-15 22:46 ` [PATCH v17 05/14] blk-mq: Restore the zoned write order when requeuing Bart Van Assche
@ 2025-01-15 22:46 ` Bart Van Assche
2025-01-15 22:46 ` [PATCH v17 07/14] blk-zoned: Defer error handling Bart Van Assche
` (8 subsequent siblings)
14 siblings, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2025-01-15 22:46 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Damien Le Moal, Bart Van Assche
Derive the write pointer from successfully completed zoned writes. This
patch prepares for restoring the write pointer after a write has failed
either by the device (e.g. a unit attention or an unaligned write) or by
the driver (e.g. BLK_STS_RESOURCE).
Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
block/blk-zoned.c | 37 +++++++++++++++++++++++++++++--------
block/blk.h | 4 +++-
2 files changed, 32 insertions(+), 9 deletions(-)
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 9d08a54c201e..089c6740df4a 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -51,6 +51,8 @@ static const char *const zone_cond_name[] = {
* @zone_no: The number of the zone the plug is managing.
* @wp_offset: The zone write pointer location relative to the start of the zone
* as a number of 512B sectors.
+ * @wp_offset_compl: End offset for completed zoned writes as a number of 512
+ * byte sectors.
* @bio_list: The list of BIOs that are currently plugged.
* @bio_work: Work struct to handle issuing of plugged BIOs
* @rcu_head: RCU head to free zone write plugs with an RCU grace period.
@@ -63,6 +65,7 @@ struct blk_zone_wplug {
unsigned int flags;
unsigned int zone_no;
unsigned int wp_offset;
+ unsigned int wp_offset_compl;
struct bio_list bio_list;
struct work_struct bio_work;
struct rcu_head rcu_head;
@@ -554,6 +557,7 @@ static struct blk_zone_wplug *disk_get_and_lock_zone_wplug(struct gendisk *disk,
zwplug->flags = 0;
zwplug->zone_no = zno;
zwplug->wp_offset = bdev_offset_from_zone_start(disk->part0, sector);
+ zwplug->wp_offset_compl = zwplug->wp_offset;
bio_list_init(&zwplug->bio_list);
INIT_WORK(&zwplug->bio_work, blk_zone_wplug_bio_work);
zwplug->disk = disk;
@@ -612,6 +616,7 @@ static void disk_zone_wplug_set_wp_offset(struct gendisk *disk,
/* Update the zone write pointer and abort all plugged BIOs. */
zwplug->flags &= ~BLK_ZONE_WPLUG_NEED_WP_UPDATE;
zwplug->wp_offset = wp_offset;
+ zwplug->wp_offset_compl = zwplug->wp_offset;
disk_zone_wplug_abort(zwplug);
/*
@@ -1148,6 +1153,7 @@ void blk_zone_write_plug_bio_endio(struct bio *bio)
struct gendisk *disk = bio->bi_bdev->bd_disk;
struct blk_zone_wplug *zwplug =
disk_get_zone_wplug(disk, bio->bi_iter.bi_sector);
+ unsigned int end_sector;
unsigned long flags;
if (WARN_ON_ONCE(!zwplug))
@@ -1165,11 +1171,24 @@ void blk_zone_write_plug_bio_endio(struct bio *bio)
bio->bi_opf |= REQ_OP_ZONE_APPEND;
}
- /*
- * If the BIO failed, abort all plugged BIOs and mark the plug as
- * needing a write pointer update.
- */
- if (bio->bi_status != BLK_STS_OK) {
+ if (bio->bi_status == BLK_STS_OK) {
+ switch (bio_op(bio)) {
+ case REQ_OP_WRITE:
+ case REQ_OP_ZONE_APPEND:
+ case REQ_OP_WRITE_ZEROES:
+ end_sector = bdev_offset_from_zone_start(disk->part0,
+ bio->bi_iter.bi_sector + bio_sectors(bio));
+ if (end_sector > zwplug->wp_offset_compl)
+ zwplug->wp_offset_compl = end_sector;
+ break;
+ default:
+ break;
+ }
+ } else {
+ /*
+ * If the BIO failed, mark the plug as having an error to
+ * trigger recovery.
+ */
spin_lock_irqsave(&zwplug->lock, flags);
disk_zone_wplug_abort(zwplug);
zwplug->flags |= BLK_ZONE_WPLUG_NEED_WP_UPDATE;
@@ -1772,7 +1791,7 @@ EXPORT_SYMBOL_GPL(blk_zone_issue_zeroout);
static void queue_zone_wplug_show(struct blk_zone_wplug *zwplug,
struct seq_file *m)
{
- unsigned int zwp_wp_offset, zwp_flags;
+ unsigned int zwp_wp_offset, zwp_wp_offset_compl, zwp_flags;
unsigned int zwp_zone_no, zwp_ref;
unsigned int zwp_bio_list_size;
unsigned long flags;
@@ -1782,11 +1801,13 @@ static void queue_zone_wplug_show(struct blk_zone_wplug *zwplug,
zwp_flags = zwplug->flags;
zwp_ref = refcount_read(&zwplug->ref);
zwp_wp_offset = zwplug->wp_offset;
+ zwp_wp_offset_compl = zwplug->wp_offset_compl;
zwp_bio_list_size = bio_list_size(&zwplug->bio_list);
spin_unlock_irqrestore(&zwplug->lock, flags);
- seq_printf(m, "%u 0x%x %u %u %u\n", zwp_zone_no, zwp_flags, zwp_ref,
- zwp_wp_offset, zwp_bio_list_size);
+ seq_printf(m, "zone_no %u flags 0x%x ref %u wp_offset %u wp_offset_compl %u bio_list_size %u\n",
+ zwp_zone_no, zwp_flags, zwp_ref, zwp_wp_offset,
+ zwp_wp_offset_compl, zwp_bio_list_size);
}
int queue_zone_wplugs_show(void *data, struct seq_file *m)
diff --git a/block/blk.h b/block/blk.h
index 4904b86d5fec..2274253cfa58 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -470,8 +470,10 @@ static inline void blk_zone_update_request_bio(struct request *rq,
* the original BIO sector so that blk_zone_write_plug_bio_endio() can
* lookup the zone write plug.
*/
- if (req_op(rq) == REQ_OP_ZONE_APPEND || bio_zone_write_plugging(bio))
+ if (req_op(rq) == REQ_OP_ZONE_APPEND || bio_zone_write_plugging(bio)) {
bio->bi_iter.bi_sector = rq->__sector;
+ bio->bi_iter.bi_size = rq->__data_len;
+ }
}
void blk_zone_write_plug_bio_endio(struct bio *bio);
static inline void blk_zone_bio_endio(struct bio *bio)
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v17 07/14] blk-zoned: Defer error handling
2025-01-15 22:46 [PATCH v17 00/14] Improve write performance for zoned UFS devices Bart Van Assche
` (5 preceding siblings ...)
2025-01-15 22:46 ` [PATCH v17 06/14] blk-zoned: Track the write pointer per zone Bart Van Assche
@ 2025-01-15 22:46 ` Bart Van Assche
2025-01-15 22:46 ` [PATCH v17 08/14] blk-zoned: Add an argument to blk_zone_plug_bio() Bart Van Assche
` (7 subsequent siblings)
14 siblings, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2025-01-15 22:46 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Damien Le Moal, Bart Van Assche
Only handle errors after pending zoned writes have completed or have been
requeued instead of handling write errors immediately. This patch prepares
for implementing write pipelining support. If the e.g. the SCSI error
handler is activated while multiple requests are queued, all requests must
have completed or failed before any requests are resubmitted.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
block/blk-mq.c | 9 ++
block/blk-zoned.c | 279 ++++++++++++++++++++++++++++++++++++++---
block/blk.h | 27 ++++
include/linux/blkdev.h | 4 +-
4 files changed, 302 insertions(+), 17 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 01478777ae5f..ca34ec34d595 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -799,6 +799,9 @@ void blk_mq_free_request(struct request *rq)
rq_qos_done(q, rq);
WRITE_ONCE(rq->state, MQ_RQ_IDLE);
+
+ blk_zone_free_request(rq);
+
if (req_ref_put_and_test(rq))
__blk_mq_free_request(rq);
}
@@ -1195,6 +1198,9 @@ void blk_mq_end_request_batch(struct io_comp_batch *iob)
continue;
WRITE_ONCE(rq->state, MQ_RQ_IDLE);
+
+ blk_zone_free_request(rq);
+
if (!req_ref_put_and_test(rq))
continue;
@@ -1513,6 +1519,7 @@ static void __blk_mq_requeue_request(struct request *rq)
if (blk_mq_request_started(rq)) {
WRITE_ONCE(rq->state, MQ_RQ_IDLE);
rq->rq_flags &= ~RQF_TIMED_OUT;
+ blk_zone_requeue_work(q);
}
}
@@ -1548,6 +1555,8 @@ static void blk_mq_requeue_work(struct work_struct *work)
list_splice_init(&q->flush_list, &flush_list);
spin_unlock_irq(&q->requeue_lock);
+ blk_zone_requeue_work(q);
+
while (!list_empty(&rq_list)) {
rq = list_entry(rq_list.next, struct request, queuelist);
list_del_init(&rq->queuelist);
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 089c6740df4a..cc09ae84acc8 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -8,6 +8,7 @@
* Copyright (c) 2016, Damien Le Moal
* Copyright (c) 2016, Western Digital
* Copyright (c) 2024, Western Digital Corporation or its affiliates.
+ * Copyright 2024 Google LLC
*/
#include <linux/kernel.h>
@@ -37,6 +38,7 @@ static const char *const zone_cond_name[] = {
/*
* Per-zone write plug.
* @node: hlist_node structure for managing the plug using a hash table.
+ * @link: To list the plug in the zone write plug error list of the disk.
* @ref: Zone write plug reference counter. A zone write plug reference is
* always at least 1 when the plug is hashed in the disk plug hash table.
* The reference is incremented whenever a new BIO needing plugging is
@@ -60,6 +62,7 @@ static const char *const zone_cond_name[] = {
*/
struct blk_zone_wplug {
struct hlist_node node;
+ struct list_head link;
refcount_t ref;
spinlock_t lock;
unsigned int flags;
@@ -86,10 +89,16 @@ struct blk_zone_wplug {
* to prevent new references to the zone write plug to be taken for
* newly incoming BIOs. A zone write plug flagged with this flag will be
* freed once all remaining references from BIOs or functions are dropped.
+ * - BLK_ZONE_WPLUG_ERROR: Indicates that a write error happened. Recovery
+ * from the write error will happen after all pending zoned write requests
+ * either have been requeued or have been completed.
*/
#define BLK_ZONE_WPLUG_PLUGGED (1U << 0)
#define BLK_ZONE_WPLUG_NEED_WP_UPDATE (1U << 1)
#define BLK_ZONE_WPLUG_UNHASHED (1U << 2)
+#define BLK_ZONE_WPLUG_ERROR (1U << 3)
+
+#define BLK_ZONE_WPLUG_BUSY (BLK_ZONE_WPLUG_PLUGGED | BLK_ZONE_WPLUG_ERROR)
/**
* blk_zone_cond_str - Return string XXX in BLK_ZONE_COND_XXX.
@@ -468,8 +477,8 @@ static inline bool disk_should_remove_zone_wplug(struct gendisk *disk,
if (zwplug->flags & BLK_ZONE_WPLUG_UNHASHED)
return false;
- /* If the zone write plug is still plugged, it cannot be removed. */
- if (zwplug->flags & BLK_ZONE_WPLUG_PLUGGED)
+ /* If the zone write plug is still busy, it cannot be removed. */
+ if (zwplug->flags & BLK_ZONE_WPLUG_BUSY)
return false;
/*
@@ -552,6 +561,7 @@ static struct blk_zone_wplug *disk_get_and_lock_zone_wplug(struct gendisk *disk,
return NULL;
INIT_HLIST_NODE(&zwplug->node);
+ INIT_LIST_HEAD(&zwplug->link);
refcount_set(&zwplug->ref, 2);
spin_lock_init(&zwplug->lock);
zwplug->flags = 0;
@@ -601,6 +611,49 @@ static void disk_zone_wplug_abort(struct blk_zone_wplug *zwplug)
blk_zone_wplug_bio_io_error(zwplug, bio);
}
+static void disk_zone_wplug_set_error(struct gendisk *disk,
+ struct blk_zone_wplug *zwplug)
+{
+ lockdep_assert_held(&zwplug->lock);
+
+ if (zwplug->flags & BLK_ZONE_WPLUG_ERROR)
+ return;
+
+ zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
+ zwplug->flags |= BLK_ZONE_WPLUG_ERROR;
+ /*
+ * Increase the zwplug reference count because BLK_ZONE_WPLUG_ERROR has
+ * been set. This reference will be dropped when BLK_ZONE_WPLUG_ERROR is
+ * cleared.
+ */
+ refcount_inc(&zwplug->ref);
+
+ scoped_guard(spinlock_irqsave, &disk->zone_wplugs_lock)
+ list_add_tail(&zwplug->link, &disk->zone_wplugs_err_list);
+}
+
+static void disk_zone_wplug_clear_error(struct gendisk *disk,
+ struct blk_zone_wplug *zwplug)
+{
+ if (!(READ_ONCE(zwplug->flags) & BLK_ZONE_WPLUG_ERROR))
+ return;
+
+ /*
+ * We are racing with the error handling work which drops the reference
+ * on the zone write plug after handling the error state. So remove the
+ * plug from the error list and drop its reference count only if the
+ * error handling has not yet started, that is, if the zone write plug
+ * is still listed.
+ */
+ scoped_guard(spinlock_irqsave, &disk->zone_wplugs_lock) {
+ if (list_empty(&zwplug->link))
+ return;
+ list_del_init(&zwplug->link);
+ zwplug->flags &= ~BLK_ZONE_WPLUG_ERROR;
+ }
+ disk_put_zone_wplug(zwplug);
+}
+
/*
* Set a zone write plug write pointer offset to the specified value.
* This aborts all plugged BIOs, which is fine as this function is called for
@@ -619,6 +672,13 @@ static void disk_zone_wplug_set_wp_offset(struct gendisk *disk,
zwplug->wp_offset_compl = zwplug->wp_offset;
disk_zone_wplug_abort(zwplug);
+ /*
+ * Updating the write pointer offset puts back the zone
+ * in a good state. So clear the error flag and decrement the
+ * error count if we were in error state.
+ */
+ disk_zone_wplug_clear_error(disk, zwplug);
+
/*
* The zone write plug now has no BIO plugged: remove it from the
* hash table so that it cannot be seen. The plug will be freed
@@ -747,6 +807,70 @@ static bool blk_zone_wplug_handle_reset_all(struct bio *bio)
return false;
}
+struct all_zwr_inserted_data {
+ struct blk_zone_wplug *zwplug;
+ bool res;
+};
+
+/*
+ * Changes @data->res to %false if and only if @rq is a zoned write for the
+ * given zone and if it is owned by the block driver.
+ *
+ * @rq members may change while this function is in progress. Hence, use
+ * READ_ONCE() to read @rq members.
+ */
+static bool blk_zwr_inserted(struct request *rq, void *data)
+{
+ struct all_zwr_inserted_data *d = data;
+ struct blk_zone_wplug *zwplug = d->zwplug;
+ struct request_queue *q = zwplug->disk->queue;
+ struct bio *bio = READ_ONCE(rq->bio);
+
+ if (rq->q == q && READ_ONCE(rq->state) != MQ_RQ_IDLE &&
+ blk_rq_is_seq_zoned_write(rq) && bio &&
+ bio_zone_no(bio) == zwplug->zone_no) {
+ d->res = false;
+ return false;
+ }
+
+ return true;
+}
+
+/*
+ * Report whether or not all zoned writes for a zone have been inserted into a
+ * software queue, elevator queue or hardware queue.
+ */
+static bool blk_zone_all_zwr_inserted(struct blk_zone_wplug *zwplug)
+{
+ struct gendisk *disk = zwplug->disk;
+ struct request_queue *q = disk->queue;
+ struct all_zwr_inserted_data d = { .zwplug = zwplug, .res = true };
+ struct blk_mq_hw_ctx *hctx;
+ unsigned long i;
+ struct request *rq;
+
+ scoped_guard(spinlock_irqsave, &q->requeue_lock) {
+ list_for_each_entry(rq, &q->requeue_list, queuelist)
+ if (blk_rq_is_seq_zoned_write(rq) &&
+ bio_zone_no(rq->bio) == zwplug->zone_no)
+ return false;
+ list_for_each_entry(rq, &q->flush_list, queuelist)
+ if (blk_rq_is_seq_zoned_write(rq) &&
+ bio_zone_no(rq->bio) == zwplug->zone_no)
+ return false;
+ }
+
+ queue_for_each_hw_ctx(q, hctx, i) {
+ struct blk_mq_tags *tags = hctx->sched_tags ?: hctx->tags;
+
+ blk_mq_all_tag_iter(tags, blk_zwr_inserted, &d);
+ if (!d.res || blk_mq_is_shared_tags(q->tag_set->flags))
+ break;
+ }
+
+ return d.res;
+}
+
static void disk_zone_wplug_schedule_bio_work(struct gendisk *disk,
struct blk_zone_wplug *zwplug)
{
@@ -953,14 +1077,6 @@ static bool blk_zone_wplug_prepare_bio(struct blk_zone_wplug *zwplug,
* so that we can restore its operation code on completion.
*/
bio_set_flag(bio, BIO_EMULATES_ZONE_APPEND);
- } else {
- /*
- * Check for non-sequential writes early as we know that BIOs
- * with a start sector not unaligned to the zone write pointer
- * will fail.
- */
- if (bio_offset_from_zone_start(bio) != zwplug->wp_offset)
- return false;
}
/* Advance the zone write pointer offset. */
@@ -1021,7 +1137,7 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
* BLK_STS_AGAIN failure if we let the BIO execute.
* Otherwise, plug and let the BIO execute.
*/
- if ((zwplug->flags & BLK_ZONE_WPLUG_PLUGGED) ||
+ if ((zwplug->flags & BLK_ZONE_WPLUG_BUSY) ||
(bio->bi_opf & REQ_NOWAIT))
goto plug;
@@ -1122,6 +1238,29 @@ bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs)
}
EXPORT_SYMBOL_GPL(blk_zone_plug_bio);
+/*
+ * Change the zone state to "error" if a zoned write request is requeued to
+ * postpone processing of requeued requests until all pending requests have
+ * either completed or have been requeued.
+ */
+void blk_zone_write_plug_requeue_request(struct request *rq)
+{
+ struct gendisk *disk = rq->q->disk;
+ struct blk_zone_wplug *zwplug;
+
+ if (!blk_rq_is_seq_zoned_write(rq))
+ return;
+
+ zwplug = disk_get_zone_wplug(disk, blk_rq_pos(rq));
+ if (WARN_ON_ONCE(!zwplug))
+ return;
+
+ scoped_guard(spinlock_irqsave, &zwplug->lock)
+ disk_zone_wplug_set_error(disk, zwplug);
+
+ disk_put_zone_wplug(zwplug);
+}
+
static void disk_zone_wplug_unplug_bio(struct gendisk *disk,
struct blk_zone_wplug *zwplug)
{
@@ -1187,11 +1326,14 @@ void blk_zone_write_plug_bio_endio(struct bio *bio)
} else {
/*
* If the BIO failed, mark the plug as having an error to
- * trigger recovery.
+ * trigger recovery. Since we cannot rely the completion
+ * information for torn SAS SMR writes, set
+ * BLK_ZONE_WPLUG_NEED_WP_UPDATE for these devices.
*/
spin_lock_irqsave(&zwplug->lock, flags);
- disk_zone_wplug_abort(zwplug);
- zwplug->flags |= BLK_ZONE_WPLUG_NEED_WP_UPDATE;
+ if (!disk->queue->limits.driver_preserves_write_order)
+ zwplug->flags |= BLK_ZONE_WPLUG_NEED_WP_UPDATE;
+ zwplug->flags |= BLK_ZONE_WPLUG_ERROR;
spin_unlock_irqrestore(&zwplug->lock, flags);
}
@@ -1233,6 +1375,25 @@ void blk_zone_write_plug_finish_request(struct request *req)
disk_put_zone_wplug(zwplug);
}
+/*
+ * Schedule zone_plugs_work if a zone is in the error state and if no requests
+ * are in flight. Called from blk_mq_free_request().
+ */
+void blk_zone_write_plug_free_request(struct request *rq)
+{
+ struct gendisk *disk = rq->q->disk;
+ struct blk_zone_wplug *zwplug;
+
+ zwplug = disk_get_zone_wplug(disk, blk_rq_pos(rq));
+ if (!zwplug)
+ return;
+
+ if (READ_ONCE(zwplug->flags) & BLK_ZONE_WPLUG_ERROR)
+ kblockd_schedule_work(&disk->zone_wplugs_work);
+
+ disk_put_zone_wplug(zwplug);
+}
+
static void blk_zone_wplug_bio_work(struct work_struct *work)
{
struct blk_zone_wplug *zwplug =
@@ -1279,6 +1440,88 @@ static void blk_zone_wplug_bio_work(struct work_struct *work)
disk_put_zone_wplug(zwplug);
}
+static void disk_zone_wplug_handle_error(struct gendisk *disk,
+ struct blk_zone_wplug *zwplug)
+{
+ scoped_guard(spinlock_irqsave, &zwplug->lock) {
+ /*
+ * A zone reset or finish may have cleared the error
+ * already. In such case, do nothing as the report zones may
+ * have seen the "old" write pointer value before the
+ * reset/finish operation completed.
+ */
+ if (!(zwplug->flags & BLK_ZONE_WPLUG_ERROR))
+ return;
+
+ zwplug->flags &= ~BLK_ZONE_WPLUG_ERROR;
+
+ /* Update the zone write pointer offset. */
+ zwplug->wp_offset = zwplug->wp_offset_compl;
+
+ /* Restart BIO submission if we still have any BIO left. */
+ if (!bio_list_empty(&zwplug->bio_list)) {
+ disk_zone_wplug_schedule_bio_work(disk, zwplug);
+ } else {
+ zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED;
+ if (disk_should_remove_zone_wplug(disk, zwplug))
+ disk_remove_zone_wplug(disk, zwplug);
+ }
+ }
+
+ disk_put_zone_wplug(zwplug);
+}
+
+static void disk_zone_process_err_list(struct gendisk *disk)
+{
+ struct blk_zone_wplug *zwplug, *next;
+ unsigned long flags;
+
+ spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
+
+ list_for_each_entry_safe(zwplug, next, &disk->zone_wplugs_err_list,
+ link) {
+ if (!blk_zone_all_zwr_inserted(zwplug))
+ continue;
+ list_del_init(&zwplug->link);
+ spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
+
+ disk_zone_wplug_handle_error(disk, zwplug);
+ disk_put_zone_wplug(zwplug);
+
+ spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
+ }
+
+ spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
+
+ /*
+ * If one or more zones have been skipped, this work will be requeued
+ * when a request is requeued (blk_zone_requeue_work()) or freed
+ * (blk_zone_write_plug_free_request()).
+ */
+}
+
+static void disk_zone_wplugs_work(struct work_struct *work)
+{
+ struct gendisk *disk =
+ container_of(work, struct gendisk, zone_wplugs_work);
+
+ disk_zone_process_err_list(disk);
+}
+
+/* May be called from interrupt context. */
+void blk_zone_requeue_work(struct request_queue *q)
+{
+ struct gendisk *disk = q->disk;
+
+ if (!disk)
+ return;
+
+ if (in_interrupt())
+ kblockd_schedule_work(&disk->zone_wplugs_work);
+ else
+ disk_zone_process_err_list(disk);
+}
+
static inline unsigned int disk_zone_wplugs_hash_size(struct gendisk *disk)
{
return 1U << disk->zone_wplugs_hash_bits;
@@ -1287,6 +1530,8 @@ static inline unsigned int disk_zone_wplugs_hash_size(struct gendisk *disk)
void disk_init_zone_resources(struct gendisk *disk)
{
spin_lock_init(&disk->zone_wplugs_lock);
+ INIT_LIST_HEAD(&disk->zone_wplugs_err_list);
+ INIT_WORK(&disk->zone_wplugs_work, disk_zone_wplugs_work);
}
/*
@@ -1805,9 +2050,11 @@ static void queue_zone_wplug_show(struct blk_zone_wplug *zwplug,
zwp_bio_list_size = bio_list_size(&zwplug->bio_list);
spin_unlock_irqrestore(&zwplug->lock, flags);
- seq_printf(m, "zone_no %u flags 0x%x ref %u wp_offset %u wp_offset_compl %u bio_list_size %u\n",
+ bool all_zwr_inserted = blk_zone_all_zwr_inserted(zwplug);
+
+ seq_printf(m, "zone_no %u flags 0x%x ref %u wp_offset %u bio_list_size %u all_zwr_inserted %d\n",
zwp_zone_no, zwp_flags, zwp_ref, zwp_wp_offset,
- zwp_wp_offset_compl, zwp_bio_list_size);
+ zwp_bio_list_size, all_zwr_inserted);
}
int queue_zone_wplugs_show(void *data, struct seq_file *m)
diff --git a/block/blk.h b/block/blk.h
index 2274253cfa58..98954fb0069f 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -475,6 +475,16 @@ static inline void blk_zone_update_request_bio(struct request *rq,
bio->bi_iter.bi_size = rq->__data_len;
}
}
+
+void blk_zone_write_plug_requeue_request(struct request *rq);
+static inline void blk_zone_requeue_request(struct request *rq)
+{
+ if (blk_rq_is_seq_zoned_write(rq))
+ blk_zone_write_plug_requeue_request(rq);
+}
+
+void blk_zone_requeue_work(struct request_queue *q);
+
void blk_zone_write_plug_bio_endio(struct bio *bio);
static inline void blk_zone_bio_endio(struct bio *bio)
{
@@ -492,6 +502,14 @@ static inline void blk_zone_finish_request(struct request *rq)
if (rq->rq_flags & RQF_ZONE_WRITE_PLUGGING)
blk_zone_write_plug_finish_request(rq);
}
+
+void blk_zone_write_plug_free_request(struct request *rq);
+static inline void blk_zone_free_request(struct request *rq)
+{
+ if (blk_rq_is_seq_zoned_write(rq))
+ blk_zone_write_plug_free_request(rq);
+}
+
int blkdev_report_zones_ioctl(struct block_device *bdev, unsigned int cmd,
unsigned long arg);
int blkdev_zone_mgmt_ioctl(struct block_device *bdev, blk_mode_t mode,
@@ -517,12 +535,21 @@ static inline void blk_zone_update_request_bio(struct request *rq,
struct bio *bio)
{
}
+static inline void blk_zone_requeue_request(struct request *rq)
+{
+}
+static inline void blk_zone_requeue_work(struct request_queue *q)
+{
+}
static inline void blk_zone_bio_endio(struct bio *bio)
{
}
static inline void blk_zone_finish_request(struct request *rq)
{
}
+static inline void blk_zone_free_request(struct request *rq)
+{
+}
static inline int blkdev_report_zones_ioctl(struct block_device *bdev,
unsigned int cmd, unsigned long arg)
{
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index df9887412a9e..fcea07b4062e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -199,7 +199,9 @@ struct gendisk {
unsigned int zone_wplugs_hash_bits;
spinlock_t zone_wplugs_lock;
struct mempool_s *zone_wplugs_pool;
- struct hlist_head *zone_wplugs_hash;
+ struct hlist_head *zone_wplugs_hash;
+ struct list_head zone_wplugs_err_list;
+ struct work_struct zone_wplugs_work;
struct workqueue_struct *zone_wplugs_wq;
#endif /* CONFIG_BLK_DEV_ZONED */
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v17 08/14] blk-zoned: Add an argument to blk_zone_plug_bio()
2025-01-15 22:46 [PATCH v17 00/14] Improve write performance for zoned UFS devices Bart Van Assche
` (6 preceding siblings ...)
2025-01-15 22:46 ` [PATCH v17 07/14] blk-zoned: Defer error handling Bart Van Assche
@ 2025-01-15 22:46 ` Bart Van Assche
2025-01-15 22:46 ` [PATCH v17 09/14] blk-zoned: Support pipelining of zoned writes Bart Van Assche
` (6 subsequent siblings)
14 siblings, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2025-01-15 22:46 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Damien Le Moal, Bart Van Assche
Prepare for preserving the order of pipelined zoned writes per zone.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
block/blk-mq.c | 2 +-
block/blk-zoned.c | 3 ++-
drivers/md/dm.c | 2 +-
include/linux/blkdev.h | 5 +++--
4 files changed, 7 insertions(+), 5 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index ca34ec34d595..01cfcc6f7b02 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3142,7 +3142,7 @@ void blk_mq_submit_bio(struct bio *bio)
if (blk_mq_attempt_bio_merge(q, bio, nr_segs))
goto queue_exit;
- if (blk_queue_is_zoned(q) && blk_zone_plug_bio(bio, nr_segs))
+ if (blk_queue_is_zoned(q) && blk_zone_plug_bio(bio, nr_segs, &swq_cpu))
goto queue_exit;
new_request:
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index cc09ae84acc8..e2929d00dafd 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -1165,6 +1165,7 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
* blk_zone_plug_bio - Handle a zone write BIO with zone write plugging
* @bio: The BIO being submitted
* @nr_segs: The number of physical segments of @bio
+ * @swq_cpu: [out] CPU of the software queue to which the bio should be queued
*
* Handle write, write zeroes and zone append operations requiring emulation
* using zone write plugging.
@@ -1173,7 +1174,7 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
* write plug. Otherwise, return false to let the submission path process
* @bio normally.
*/
-bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs)
+bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs, int *swq_cpu)
{
struct block_device *bdev = bio->bi_bdev;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 12ecf07a3841..c3f851fe26f6 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1796,7 +1796,7 @@ static inline bool dm_zone_bio_needs_split(struct mapped_device *md,
}
static inline bool dm_zone_plug_bio(struct mapped_device *md, struct bio *bio)
{
- return dm_emulate_zone_append(md) && blk_zone_plug_bio(bio, 0);
+ return dm_emulate_zone_append(md) && blk_zone_plug_bio(bio, 0, NULL);
}
static blk_status_t __send_zone_reset_all_emulated(struct clone_info *ci,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index fcea07b4062e..0ae106944ab3 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -704,13 +704,14 @@ static inline unsigned int disk_nr_zones(struct gendisk *disk)
{
return disk->nr_zones;
}
-bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs);
+bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs, int *swq_cpu);
#else /* CONFIG_BLK_DEV_ZONED */
static inline unsigned int disk_nr_zones(struct gendisk *disk)
{
return 0;
}
-static inline bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs)
+static inline bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs,
+ int *swq_cpu)
{
return false;
}
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v17 09/14] blk-zoned: Support pipelining of zoned writes
2025-01-15 22:46 [PATCH v17 00/14] Improve write performance for zoned UFS devices Bart Van Assche
` (7 preceding siblings ...)
2025-01-15 22:46 ` [PATCH v17 08/14] blk-zoned: Add an argument to blk_zone_plug_bio() Bart Van Assche
@ 2025-01-15 22:46 ` Bart Van Assche
2025-01-15 22:46 ` [PATCH v17 10/14] scsi: core: Retry unaligned " Bart Van Assche
` (5 subsequent siblings)
14 siblings, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2025-01-15 22:46 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Damien Le Moal, Bart Van Assche
Support pipelining of zoned writes if the block driver preserves the write
order per hardware queue. Track per zone to which software queue writes
have been queued. If zoned writes are pipelined, submit new writes to the
same software queue as the writes that are already in progress. This
prevents reordering by submitting requests for the same zone to different
software or hardware queues.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
block/blk-mq.c | 4 ++--
block/blk-zoned.c | 33 ++++++++++++++++++++++++---------
2 files changed, 26 insertions(+), 11 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 01cfcc6f7b02..5ac9ff1ab380 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3103,8 +3103,8 @@ void blk_mq_submit_bio(struct bio *bio)
/*
* A BIO that was released from a zone write plug has already been
* through the preparation in this function, already holds a reference
- * on the queue usage counter, and is the only write BIO in-flight for
- * the target zone. Go straight to preparing a request for it.
+ * on the queue usage counter. Go straight to preparing a request for
+ * it.
*/
if (bio_zone_write_plugging(bio)) {
nr_segs = bio->__bi_nr_segments;
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index e2929d00dafd..6eec11b04501 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -55,6 +55,8 @@ static const char *const zone_cond_name[] = {
* as a number of 512B sectors.
* @wp_offset_compl: End offset for completed zoned writes as a number of 512
* byte sectors.
+ * @swq_cpu: Software queue to submit writes to for drivers that preserve the
+ * write order.
* @bio_list: The list of BIOs that are currently plugged.
* @bio_work: Work struct to handle issuing of plugged BIOs
* @rcu_head: RCU head to free zone write plugs with an RCU grace period.
@@ -69,6 +71,7 @@ struct blk_zone_wplug {
unsigned int zone_no;
unsigned int wp_offset;
unsigned int wp_offset_compl;
+ int swq_cpu;
struct bio_list bio_list;
struct work_struct bio_work;
struct rcu_head rcu_head;
@@ -78,8 +81,7 @@ struct blk_zone_wplug {
/*
* Zone write plug flags bits:
* - BLK_ZONE_WPLUG_PLUGGED: Indicates that the zone write plug is plugged,
- * that is, that write BIOs are being throttled due to a write BIO already
- * being executed or the zone write plug bio list is not empty.
+ * that is, that write BIOs are being throttled.
* - BLK_ZONE_WPLUG_NEED_WP_UPDATE: Indicates that we lost track of a zone
* write pointer offset and need to update it.
* - BLK_ZONE_WPLUG_UNHASHED: Indicates that the zone write plug was removed
@@ -568,6 +570,7 @@ static struct blk_zone_wplug *disk_get_and_lock_zone_wplug(struct gendisk *disk,
zwplug->zone_no = zno;
zwplug->wp_offset = bdev_offset_from_zone_start(disk->part0, sector);
zwplug->wp_offset_compl = zwplug->wp_offset;
+ zwplug->swq_cpu = -1;
bio_list_init(&zwplug->bio_list);
INIT_WORK(&zwplug->bio_work, blk_zone_wplug_bio_work);
zwplug->disk = disk;
@@ -1085,7 +1088,8 @@ static bool blk_zone_wplug_prepare_bio(struct blk_zone_wplug *zwplug,
return true;
}
-static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
+static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs,
+ int *swq_cpu)
{
struct gendisk *disk = bio->bi_bdev->bd_disk;
sector_t sector = bio->bi_iter.bi_sector;
@@ -1138,8 +1142,15 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
* Otherwise, plug and let the BIO execute.
*/
if ((zwplug->flags & BLK_ZONE_WPLUG_BUSY) ||
- (bio->bi_opf & REQ_NOWAIT))
+ (bio->bi_opf & REQ_NOWAIT)) {
goto plug;
+ } else if (disk->queue->limits.driver_preserves_write_order) {
+ if (zwplug->swq_cpu < 0)
+ zwplug->swq_cpu = raw_smp_processor_id();
+ *swq_cpu = zwplug->swq_cpu;
+ } else {
+ zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
+ }
if (!blk_zone_wplug_prepare_bio(zwplug, bio)) {
spin_unlock_irqrestore(&zwplug->lock, flags);
@@ -1147,8 +1158,6 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
return true;
}
- zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
-
spin_unlock_irqrestore(&zwplug->lock, flags);
return false;
@@ -1223,7 +1232,7 @@ bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs, int *swq_cpu)
fallthrough;
case REQ_OP_WRITE:
case REQ_OP_WRITE_ZEROES:
- return blk_zone_wplug_handle_write(bio, nr_segs);
+ return blk_zone_wplug_handle_write(bio, nr_segs, swq_cpu);
case REQ_OP_ZONE_RESET:
return blk_zone_wplug_handle_reset_or_finish(bio, 0);
case REQ_OP_ZONE_FINISH:
@@ -1278,6 +1287,9 @@ static void disk_zone_wplug_unplug_bio(struct gendisk *disk,
zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED;
+ if (refcount_read(&zwplug->ref) == 2)
+ zwplug->swq_cpu = -1;
+
/*
* If the zone is full (it was fully written or finished, or empty
* (it was reset), remove its zone write plug from the hash table.
@@ -2041,6 +2053,7 @@ static void queue_zone_wplug_show(struct blk_zone_wplug *zwplug,
unsigned int zwp_zone_no, zwp_ref;
unsigned int zwp_bio_list_size;
unsigned long flags;
+ int swq_cpu;
spin_lock_irqsave(&zwplug->lock, flags);
zwp_zone_no = zwplug->zone_no;
@@ -2049,13 +2062,15 @@ static void queue_zone_wplug_show(struct blk_zone_wplug *zwplug,
zwp_wp_offset = zwplug->wp_offset;
zwp_wp_offset_compl = zwplug->wp_offset_compl;
zwp_bio_list_size = bio_list_size(&zwplug->bio_list);
+ swq_cpu = zwplug->swq_cpu;
spin_unlock_irqrestore(&zwplug->lock, flags);
bool all_zwr_inserted = blk_zone_all_zwr_inserted(zwplug);
- seq_printf(m, "zone_no %u flags 0x%x ref %u wp_offset %u bio_list_size %u all_zwr_inserted %d\n",
+ seq_printf(m, "zone_no %u flags 0x%x ref %u wp_offset %u wp_offset_compl %u bio_list_size %u all_zwr_inserted %d swq_cpu %d\n",
zwp_zone_no, zwp_flags, zwp_ref, zwp_wp_offset,
- zwp_bio_list_size, all_zwr_inserted);
+ zwp_wp_offset_compl, zwp_bio_list_size, all_zwr_inserted,
+ swq_cpu);
}
int queue_zone_wplugs_show(void *data, struct seq_file *m)
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v17 10/14] scsi: core: Retry unaligned zoned writes
2025-01-15 22:46 [PATCH v17 00/14] Improve write performance for zoned UFS devices Bart Van Assche
` (8 preceding siblings ...)
2025-01-15 22:46 ` [PATCH v17 09/14] blk-zoned: Support pipelining of zoned writes Bart Van Assche
@ 2025-01-15 22:46 ` Bart Van Assche
2025-01-15 22:46 ` [PATCH v17 11/14] scsi: sd: Increase retry count for " Bart Van Assche
` (4 subsequent siblings)
14 siblings, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2025-01-15 22:46 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Damien Le Moal, Bart Van Assche,
Martin K. Petersen, Ming Lei
If zoned writes (REQ_OP_WRITE) for a sequential write required zone have
a starting LBA that differs from the write pointer, e.g. because a prior
write triggered a unit attention condition, then the storage device will
respond with an UNALIGNED WRITE COMMAND error. Retry commands that failed
with an unaligned write error. The block layer core will sort the SCSI
commands per LBA before these are resubmitted.
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
drivers/scsi/scsi_error.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 10154d78e336..24cd8e8538e5 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -700,6 +700,22 @@ enum scsi_disposition scsi_check_sense(struct scsi_cmnd *scmd)
fallthrough;
case ILLEGAL_REQUEST:
+ /*
+ * Unaligned write command. This may indicate that zoned writes
+ * have been received by the device in the wrong order. If write
+ * pipelining is enabled, retry.
+ */
+ if (sshdr.asc == 0x21 && sshdr.ascq == 0x04 &&
+ req->q->limits.driver_preserves_write_order &&
+ blk_rq_is_seq_zoned_write(req) &&
+ scsi_cmd_retry_allowed(scmd)) {
+ SCSI_LOG_ERROR_RECOVERY(1,
+ sdev_printk(KERN_WARNING, scmd->device,
+ "Retrying unaligned write at LBA %#llx.\n",
+ scsi_get_lba(scmd)));
+ return NEEDS_RETRY;
+ }
+
if (sshdr.asc == 0x20 || /* Invalid command operation code */
sshdr.asc == 0x21 || /* Logical block address out of range */
sshdr.asc == 0x22 || /* Invalid function */
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v17 11/14] scsi: sd: Increase retry count for zoned writes
2025-01-15 22:46 [PATCH v17 00/14] Improve write performance for zoned UFS devices Bart Van Assche
` (9 preceding siblings ...)
2025-01-15 22:46 ` [PATCH v17 10/14] scsi: core: Retry unaligned " Bart Van Assche
@ 2025-01-15 22:46 ` Bart Van Assche
2025-01-15 22:46 ` [PATCH v17 12/14] scsi: scsi_debug: Add the preserves_write_order module parameter Bart Van Assche
` (3 subsequent siblings)
14 siblings, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2025-01-15 22:46 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Damien Le Moal, Bart Van Assche,
Martin K. Petersen, Ming Lei
If the write order is preserved, increase the number of retries for
write commands sent to a sequential zone to the maximum number of
outstanding commands because in the worst case the number of times
reordered zoned writes have to be retried is (number of outstanding
writes per sequential zone) - 1.
Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
drivers/scsi/sd.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index d9e3235d7fd0..2594debb756c 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1403,6 +1403,13 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
cmd->transfersize = sdp->sector_size;
cmd->underflow = nr_blocks << 9;
cmd->allowed = sdkp->max_retries;
+ /*
+ * Increase the number of allowed retries for zoned writes if the driver
+ * preserves the command order.
+ */
+ if (rq->q->limits.driver_preserves_write_order &&
+ blk_rq_is_seq_zoned_write(rq))
+ cmd->allowed += rq->q->nr_requests;
cmd->sdb.length = nr_blocks * sdp->sector_size;
SCSI_LOG_HLQUEUE(1,
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v17 12/14] scsi: scsi_debug: Add the preserves_write_order module parameter
2025-01-15 22:46 [PATCH v17 00/14] Improve write performance for zoned UFS devices Bart Van Assche
` (10 preceding siblings ...)
2025-01-15 22:46 ` [PATCH v17 11/14] scsi: sd: Increase retry count for " Bart Van Assche
@ 2025-01-15 22:46 ` Bart Van Assche
2025-01-15 22:46 ` [PATCH v17 13/14] scsi: scsi_debug: Support injecting unaligned write errors Bart Van Assche
` (2 subsequent siblings)
14 siblings, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2025-01-15 22:46 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Damien Le Moal, Bart Van Assche,
Douglas Gilbert, Martin K. Petersen, Ming Lei
Zone write locking is not used for zoned devices if the block driver
reports that it preserves the order of write commands. Make it easier to
test not using zone write locking by adding support for setting the
driver_preserves_write_order flag.
Acked-by: Douglas Gilbert <dgilbert@interlog.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
drivers/scsi/scsi_debug.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index 680ba180a672..11df07b25c26 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -927,6 +927,7 @@ static int dix_reads;
static int dif_errors;
/* ZBC global data */
+static bool sdeb_preserves_write_order;
static bool sdeb_zbc_in_use; /* true for host-aware and host-managed disks */
static int sdeb_zbc_zone_cap_mb;
static int sdeb_zbc_zone_size_mb;
@@ -5881,10 +5882,14 @@ static struct sdebug_dev_info *find_build_dev_info(struct scsi_device *sdev)
static int scsi_debug_slave_alloc(struct scsi_device *sdp)
{
+ struct request_queue *q = sdp->request_queue;
+
if (sdebug_verbose)
pr_info("slave_alloc <%u %u %u %llu>\n",
sdp->host->host_no, sdp->channel, sdp->id, sdp->lun);
+ q->limits.driver_preserves_write_order = sdeb_preserves_write_order;
+
return 0;
}
@@ -6620,6 +6625,8 @@ module_param_named(statistics, sdebug_statistics, bool, S_IRUGO | S_IWUSR);
module_param_named(strict, sdebug_strict, bool, S_IRUGO | S_IWUSR);
module_param_named(submit_queues, submit_queues, int, S_IRUGO);
module_param_named(poll_queues, poll_queues, int, S_IRUGO);
+module_param_named(preserves_write_order, sdeb_preserves_write_order, bool,
+ S_IRUGO);
module_param_named(tur_ms_to_ready, sdeb_tur_ms_to_ready, int, S_IRUGO);
module_param_named(unmap_alignment, sdebug_unmap_alignment, int, S_IRUGO);
module_param_named(unmap_granularity, sdebug_unmap_granularity, int, S_IRUGO);
@@ -6692,6 +6699,8 @@ MODULE_PARM_DESC(opts, "1->noise, 2->medium_err, 4->timeout, 8->recovered_err...
MODULE_PARM_DESC(per_host_store, "If set, next positive add_host will get new store (def=0)");
MODULE_PARM_DESC(physblk_exp, "physical block exponent (def=0)");
MODULE_PARM_DESC(poll_queues, "support for iouring iopoll queues (1 to max(submit_queues - 1))");
+MODULE_PARM_DESC(preserves_write_order,
+ "Whether or not to inform the block layer that this driver preserves the order of WRITE commands (def=0)");
MODULE_PARM_DESC(ptype, "SCSI peripheral type(def=0[disk])");
MODULE_PARM_DESC(random, "If set, uniformly randomize command duration between 0 and delay_in_ns");
MODULE_PARM_DESC(removable, "claim to have removable media (def=0)");
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v17 13/14] scsi: scsi_debug: Support injecting unaligned write errors
2025-01-15 22:46 [PATCH v17 00/14] Improve write performance for zoned UFS devices Bart Van Assche
` (11 preceding siblings ...)
2025-01-15 22:46 ` [PATCH v17 12/14] scsi: scsi_debug: Add the preserves_write_order module parameter Bart Van Assche
@ 2025-01-15 22:46 ` Bart Van Assche
2025-01-15 22:46 ` [PATCH v17 14/14] scsi: ufs: Inform the block layer about write ordering Bart Van Assche
2025-01-17 22:47 ` [PATCH v17 00/14] Improve write performance for zoned UFS devices Damien Le Moal
14 siblings, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2025-01-15 22:46 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Damien Le Moal, Bart Van Assche,
Douglas Gilbert, Martin K. Petersen, Ming Lei
Allow user space software, e.g. a blktests test, to inject unaligned
write errors.
Acked-by: Douglas Gilbert <dgilbert@interlog.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
drivers/scsi/scsi_debug.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index 11df07b25c26..af6a128be9b6 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -193,6 +193,7 @@ static const char *sdebug_version_date = "20210520";
#define SDEBUG_OPT_NO_CDB_NOISE 0x4000
#define SDEBUG_OPT_HOST_BUSY 0x8000
#define SDEBUG_OPT_CMD_ABORT 0x10000
+#define SDEBUG_OPT_UNALIGNED_WRITE 0x20000
#define SDEBUG_OPT_ALL_NOISE (SDEBUG_OPT_NOISE | SDEBUG_OPT_Q_NOISE | \
SDEBUG_OPT_RESET_NOISE)
#define SDEBUG_OPT_ALL_INJECTING (SDEBUG_OPT_RECOVERED_ERR | \
@@ -200,7 +201,8 @@ static const char *sdebug_version_date = "20210520";
SDEBUG_OPT_DIF_ERR | SDEBUG_OPT_DIX_ERR | \
SDEBUG_OPT_SHORT_TRANSFER | \
SDEBUG_OPT_HOST_BUSY | \
- SDEBUG_OPT_CMD_ABORT)
+ SDEBUG_OPT_CMD_ABORT | \
+ SDEBUG_OPT_UNALIGNED_WRITE)
#define SDEBUG_OPT_RECOV_DIF_DIX (SDEBUG_OPT_RECOVERED_ERR | \
SDEBUG_OPT_DIF_ERR | SDEBUG_OPT_DIX_ERR)
@@ -4191,6 +4193,14 @@ static int resp_write_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
u8 *cmd = scp->cmnd;
bool meta_data_locked = false;
+ if (unlikely(sdebug_opts & SDEBUG_OPT_UNALIGNED_WRITE &&
+ atomic_read(&sdeb_inject_pending))) {
+ atomic_set(&sdeb_inject_pending, 0);
+ mk_sense_buffer(scp, ILLEGAL_REQUEST, LBA_OUT_OF_RANGE,
+ UNALIGNED_WRITE_ASCQ);
+ return check_condition_result;
+ }
+
switch (cmd[0]) {
case WRITE_16:
ei_lba = 0;
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v17 14/14] scsi: ufs: Inform the block layer about write ordering
2025-01-15 22:46 [PATCH v17 00/14] Improve write performance for zoned UFS devices Bart Van Assche
` (12 preceding siblings ...)
2025-01-15 22:46 ` [PATCH v17 13/14] scsi: scsi_debug: Support injecting unaligned write errors Bart Van Assche
@ 2025-01-15 22:46 ` Bart Van Assche
2025-01-16 7:43 ` Can Guo
2025-01-16 15:58 ` Bao D. Nguyen
2025-01-17 22:47 ` [PATCH v17 00/14] Improve write performance for zoned UFS devices Damien Le Moal
14 siblings, 2 replies; 25+ messages in thread
From: Bart Van Assche @ 2025-01-15 22:46 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Damien Le Moal, Bart Van Assche,
Bao D. Nguyen, Can Guo, Martin K. Petersen, Avri Altman
From the UFSHCI 4.0 specification, about the legacy (single queue) mode:
"The host controller always process transfer requests in-order according
to the order submitted to the list. In case of multiple commands with
single doorbell register ringing (batch mode), The dispatch order for
these transfer requests by host controller will base on their index in
the List. A transfer request with lower index value will be executed
before a transfer request with higher index value."
From the UFSHCI 4.0 specification, about the MCQ mode:
"Command Submission
1. Host SW writes an Entry to SQ
2. Host SW updates SQ doorbell tail pointer
Command Processing
3. After fetching the Entry, Host Controller updates SQ doorbell head
pointer
4. Host controller sends COMMAND UPIU to UFS device"
In other words, for both legacy and MCQ mode, UFS controllers are
required to forward commands to the UFS device in the order these
commands have been received from the host.
Notes:
- For legacy mode this is only correct if the host submits one
command at a time. The UFS driver does this.
- Also in legacy mode, the command order is not preserved if
auto-hibernation is enabled in the UFS controller.
This patch improves performance as follows on a test setup with UFSHCI
3.0 controller:
- With the mq-deadline scheduler: 2.5x more IOPS for small writes.
- When not using an I/O scheduler compared to using mq-deadline with
zone locking: 4x more IOPS for small writes.
Cc: Bao D. Nguyen <quic_nguyenb@quicinc.com>
Cc: Can Guo <quic_cang@quicinc.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Avri Altman <avri.altman@wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
drivers/ufs/core/ufshcd.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
index 3094f3c89e82..08803ba21668 100644
--- a/drivers/ufs/core/ufshcd.c
+++ b/drivers/ufs/core/ufshcd.c
@@ -5255,6 +5255,13 @@ static int ufshcd_device_configure(struct scsi_device *sdev,
struct ufs_hba *hba = shost_priv(sdev->host);
struct request_queue *q = sdev->request_queue;
+ /*
+ * With auto-hibernation disabled, the write order is preserved per
+ * MCQ. Auto-hibernation may cause write reordering that results in
+ * unaligned write errors. The SCSI core will retry the failed writes.
+ */
+ lim->driver_preserves_write_order = true;
+
lim->dma_pad_mask = PRDT_DATA_BYTE_COUNT_PAD - 1;
/*
^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH v17 14/14] scsi: ufs: Inform the block layer about write ordering
2025-01-15 22:46 ` [PATCH v17 14/14] scsi: ufs: Inform the block layer about write ordering Bart Van Assche
@ 2025-01-16 7:43 ` Can Guo
2025-01-16 15:58 ` Bao D. Nguyen
1 sibling, 0 replies; 25+ messages in thread
From: Can Guo @ 2025-01-16 7:43 UTC (permalink / raw)
To: Bart Van Assche, Jens Axboe
Cc: linux-block, Christoph Hellwig, Damien Le Moal, Bao D. Nguyen,
Martin K. Petersen, Avri Altman
On 1/16/2025 6:46 AM, Bart Van Assche wrote:
> From the UFSHCI 4.0 specification, about the legacy (single queue) mode:
> "The host controller always process transfer requests in-order according
> to the order submitted to the list. In case of multiple commands with
> single doorbell register ringing (batch mode), The dispatch order for
> these transfer requests by host controller will base on their index in
> the List. A transfer request with lower index value will be executed
> before a transfer request with higher index value."
>
> From the UFSHCI 4.0 specification, about the MCQ mode:
> "Command Submission
> 1. Host SW writes an Entry to SQ
> 2. Host SW updates SQ doorbell tail pointer
>
> Command Processing
> 3. After fetching the Entry, Host Controller updates SQ doorbell head
> pointer
> 4. Host controller sends COMMAND UPIU to UFS device"
>
> In other words, for both legacy and MCQ mode, UFS controllers are
> required to forward commands to the UFS device in the order these
> commands have been received from the host.
>
> Notes:
> - For legacy mode this is only correct if the host submits one
> command at a time. The UFS driver does this.
> - Also in legacy mode, the command order is not preserved if
> auto-hibernation is enabled in the UFS controller.
>
> This patch improves performance as follows on a test setup with UFSHCI
> 3.0 controller:
> - With the mq-deadline scheduler: 2.5x more IOPS for small writes.
> - When not using an I/O scheduler compared to using mq-deadline with
> zone locking: 4x more IOPS for small writes.
>
> Cc: Bao D. Nguyen <quic_nguyenb@quicinc.com>
> Cc: Can Guo <quic_cang@quicinc.com>
> Cc: Martin K. Petersen <martin.petersen@oracle.com>
> Cc: Avri Altman <avri.altman@wdc.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
> drivers/ufs/core/ufshcd.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
> index 3094f3c89e82..08803ba21668 100644
> --- a/drivers/ufs/core/ufshcd.c
> +++ b/drivers/ufs/core/ufshcd.c
> @@ -5255,6 +5255,13 @@ static int ufshcd_device_configure(struct scsi_device *sdev,
> struct ufs_hba *hba = shost_priv(sdev->host);
> struct request_queue *q = sdev->request_queue;
>
> + /*
> + * With auto-hibernation disabled, the write order is preserved per
> + * MCQ. Auto-hibernation may cause write reordering that results in
> + * unaligned write errors. The SCSI core will retry the failed writes.
> + */
> + lim->driver_preserves_write_order = true;
> +
> lim->dma_pad_mask = PRDT_DATA_BYTE_COUNT_PAD - 1;
>
> /*
Review-by: Can Guo <quic_cang@quicinc.com>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v17 14/14] scsi: ufs: Inform the block layer about write ordering
2025-01-15 22:46 ` [PATCH v17 14/14] scsi: ufs: Inform the block layer about write ordering Bart Van Assche
2025-01-16 7:43 ` Can Guo
@ 2025-01-16 15:58 ` Bao D. Nguyen
2025-01-23 0:52 ` Bart Van Assche
1 sibling, 1 reply; 25+ messages in thread
From: Bao D. Nguyen @ 2025-01-16 15:58 UTC (permalink / raw)
To: Bart Van Assche, Jens Axboe
Cc: linux-block, Christoph Hellwig, Damien Le Moal, Can Guo,
Martin K. Petersen, Avri Altman
On 1/15/2025 2:46 PM, Bart Van Assche wrote:
> This patch improves performance as follows on a test setup with UFSHCI
> 3.0 controller:
> - With the mq-deadline scheduler: 2.5x more IOPS for small writes.
> - When not using an I/O scheduler compared to using mq-deadline with
> zone locking: 4x more IOPS for small writes.
Hi Bart,
Wondering if the change has been tried on 4.x hosts and using different
IO schedulers? Any performance improvements?
Thanks, Bao
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v17 02/14] dm-linear: Report to the block layer that the write order is preserved
2025-01-15 22:46 ` [PATCH v17 02/14] dm-linear: Report to the block layer that the write order is preserved Bart Van Assche
@ 2025-01-17 18:04 ` Mikulas Patocka
2025-01-21 21:38 ` Bart Van Assche
0 siblings, 1 reply; 25+ messages in thread
From: Mikulas Patocka @ 2025-01-17 18:04 UTC (permalink / raw)
To: Bart Van Assche
Cc: Jens Axboe, linux-block, Christoph Hellwig, Damien Le Moal,
Alasdair Kergon, Mike Snitzer
On Wed, 15 Jan 2025, Bart Van Assche wrote:
> Enable write pipelining if dm-linear is stacked on top of a driver that
> supports write pipelining.
Hi
What if you have multiple linear targets in a table? Then, the write order
would not be preserved.
How is write pipelining supposed to work with suspend/resume? dm doesn't
preserve the order of writes in case of suspend.
Mikulas
> Cc: Alasdair Kergon <agk@redhat.com>
> Cc: Mike Snitzer <snitzer@kernel.org>
> Cc: Mikulas Patocka <mpatocka@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
> drivers/md/dm-linear.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
> index 49fb0f684193..967fbf856abc 100644
> --- a/drivers/md/dm-linear.c
> +++ b/drivers/md/dm-linear.c
> @@ -148,6 +148,11 @@ static int linear_report_zones(struct dm_target *ti,
> #define linear_report_zones NULL
> #endif
>
> +static void linear_io_hints(struct dm_target *ti, struct queue_limits *limits)
> +{
> + limits->driver_preserves_write_order = true;
> +}
> +
> static int linear_iterate_devices(struct dm_target *ti,
> iterate_devices_callout_fn fn, void *data)
> {
> @@ -209,6 +214,7 @@ static struct target_type linear_target = {
> .map = linear_map,
> .status = linear_status,
> .prepare_ioctl = linear_prepare_ioctl,
> + .io_hints = linear_io_hints,
> .iterate_devices = linear_iterate_devices,
> .direct_access = linear_dax_direct_access,
> .dax_zero_page_range = linear_dax_zero_page_range,
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v17 00/14] Improve write performance for zoned UFS devices
2025-01-15 22:46 [PATCH v17 00/14] Improve write performance for zoned UFS devices Bart Van Assche
` (13 preceding siblings ...)
2025-01-15 22:46 ` [PATCH v17 14/14] scsi: ufs: Inform the block layer about write ordering Bart Van Assche
@ 2025-01-17 22:47 ` Damien Le Moal
2025-01-21 21:57 ` Bart Van Assche
14 siblings, 1 reply; 25+ messages in thread
From: Damien Le Moal @ 2025-01-17 22:47 UTC (permalink / raw)
To: Bart Van Assche, Jens Axboe; +Cc: linux-block, Christoph Hellwig
On 1/16/25 7:46 AM, Bart Van Assche wrote:
> Hi Damien and Christoph,
>
> This patch series improves small write IOPS by a factor of four (+300%) for
> zoned UFS devices on my test setup. Please help with reviewing this patch
> series.
Bart,
Given that this is a set of 14 patches, giving a more detailed summary here of
what the entire patch set intends to do and how it does it would really help
with the review. I do not want to have to reverse engineer your line of thoughts
to see if the patch set is correctly organized.
Also, patch 14 puts back a lot of the code that was recently removed. Not nice,
and its commit message is also far too short given the size of the patch. Please
spend more time explaining what the patches do and how they do it to facilitate
review.
I also fail to see why you treat request requeue as errors if you actually
expect them to happen... I do not think you need error handling and tracking of
the completion wp position: when you get a requeue event, requeue all requests
in the plug and set the plug wp position to the lowest sector of all the
requeued requests. That is necessarily the current location of the write
pointer. No need for re-introducing all the error handling for that, no ?
I am going to wait for you to resend this with a better "big picture"
explanation of what you are trying to do so that I do not randomly comment on
things that I am not sure I completely understand. Thanks.
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v17 02/14] dm-linear: Report to the block layer that the write order is preserved
2025-01-17 18:04 ` Mikulas Patocka
@ 2025-01-21 21:38 ` Bart Van Assche
2025-01-27 17:55 ` Mikulas Patocka
0 siblings, 1 reply; 25+ messages in thread
From: Bart Van Assche @ 2025-01-21 21:38 UTC (permalink / raw)
To: Mikulas Patocka
Cc: Jens Axboe, linux-block, Christoph Hellwig, Damien Le Moal,
Alasdair Kergon, Mike Snitzer
On 1/17/25 10:04 AM, Mikulas Patocka wrote:
> What if you have multiple linear targets in a table? Then, the write order
> would not be preserved.
If dm-linear is used on top of zoned storage, shouldn't each target
involve a whole number of zones? Doesn't this mean that dm-linear will
preserve the write order per zone?
> How is write pipelining supposed to work with suspend/resume? dm doesn't
> preserve the order of writes in case of suspend.
That's an interesting question. I expect that the following will happen
upon resume if zoned writes would have been reordered by dm-linear:
* The block device reports one or more unaligned write errors.
* For the zones for which an unaligned write error has been reported,
the flag BLK_ZONE_WPLUG_ERROR is set (see also patch 07/14 in this
series).
* Further zoned writes are postponed for the BLK_ZONE_WPLUG_ERROR zones
until all pending zoned writes have completed.
* Once all pending zoned writes have completed for a
BLK_ZONE_WPLUG_ERROR zone, these are resubmitted. This happens in LBA
order.
* The resubmitted writes will succeed unless the submitter (e.g. a
filesystem) left a gap between the zoned writes. If the submitter
does not follow the zoned block device specification, the zoned
writes will be retried until the number of retries has been exhausted.
Block devices are expected to set the number of retries to a small
positive number.
Thanks,
Bart.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v17 00/14] Improve write performance for zoned UFS devices
2025-01-17 22:47 ` [PATCH v17 00/14] Improve write performance for zoned UFS devices Damien Le Moal
@ 2025-01-21 21:57 ` Bart Van Assche
2025-01-23 4:16 ` Damien Le Moal
0 siblings, 1 reply; 25+ messages in thread
From: Bart Van Assche @ 2025-01-21 21:57 UTC (permalink / raw)
To: Damien Le Moal, Jens Axboe; +Cc: linux-block, Christoph Hellwig
On 1/17/25 2:47 PM, Damien Le Moal wrote:
> Also, patch 14 puts back a lot of the code that was recently removed. Not nice,
> and its commit message is also far too short given the size of the patch. Please
> spend more time explaining what the patches do and how they do it to facilitate
> review.
Regarding the following part of patch 07/14:
- disk_zone_wplug_abort(zwplug);
- zwplug->flags |= BLK_ZONE_WPLUG_NEED_WP_UPDATE;
+ if (!disk->queue->limits.driver_preserves_write_order)
+ zwplug->flags |= BLK_ZONE_WPLUG_NEED_WP_UPDATE;
+ zwplug->flags |= BLK_ZONE_WPLUG_ERROR;
How about setting BLK_ZONE_WPLUG_ERROR only if
disk->queue->limits.driver_preserves_write_order has been set such that
the restored error handling code is only used if the storage controller
preserves the write order? This should be sufficient to preserve the
existing behavior of the blk-zoned.c code for SMR disks.
The reason I restored the error handling code is that the code in
blk-zoned.c does not support error handling inside block drivers if QD > 1
and because supporting the SCSI error handler is essential for UFS
devices.
> I also fail to see why you treat request requeue as errors if you actually
> expect them to happen...
I only expect requeues to happen in exceptional cases, e.g. because the
SCSI error handler has been activated or because a SCSI device reports a
unit attention. Requeues may happen in any order so I think that it is
essential to pause request submission after a requeue until all pending
zoned writes for the same zone have completed or failed.
> I do not think you need error handling and tracking of
> the completion wp position: when you get a requeue event, requeue all requests
> in the plug and set the plug wp position to the lowest sector of all the
> requeued requests. That is necessarily the current location of the write
> pointer.
That's an interesting proposal. I will look into setting the WP to the
lowest sector of all requeued requests.
> No need for re-introducing all the error handling for that, no ?
The error handling code has been restored because of another reason,
namely to support the SCSI error handler in case the queue depth is
larger than one.
> I am going to wait for you to resend this with a better "big picture"
> explanation of what you are trying to do so that I do not randomly comment on
> things that I am not sure I completely understand. Thanks.
I will include a more detailed cover letter when I repost this series
(after the merge window has closed). While I work on this, feedback on
this version of this patch series is still welcome.
Thanks,
Bart.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v17 14/14] scsi: ufs: Inform the block layer about write ordering
2025-01-16 15:58 ` Bao D. Nguyen
@ 2025-01-23 0:52 ` Bart Van Assche
0 siblings, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2025-01-23 0:52 UTC (permalink / raw)
To: Bao D. Nguyen, Jens Axboe
Cc: linux-block, Christoph Hellwig, Damien Le Moal, Can Guo,
Martin K. Petersen, Avri Altman
On 1/16/25 7:58 AM, Bao D. Nguyen wrote:
> On 1/15/2025 2:46 PM, Bart Van Assche wrote:
>> This patch improves performance as follows on a test setup with UFSHCI
>> 3.0 controller:
>> - With the mq-deadline scheduler: 2.5x more IOPS for small writes.
>> - When not using an I/O scheduler compared to using mq-deadline with
>> zone locking: 4x more IOPS for small writes.
>
> Wondering if the change has been tried on 4.x hosts and using different
> IO schedulers? Any performance improvements?
Hi Bao,
Agreed that it would be great to have results for UFSHCI 4.0 host
controllers. Something that's unfortunate is that my UFSHCI 4.0
test setup runs kernel 6.6 and that I have not yet succeeded at
backporting the zone write plugging patches (a requirement for this
patch series) to the 6.6 kernel. A test setup with UFSHCI 4.0
controller and kernel 6.12 should become available to me around the
end of March (two months from now).
Thanks,
Bart.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v17 00/14] Improve write performance for zoned UFS devices
2025-01-21 21:57 ` Bart Van Assche
@ 2025-01-23 4:16 ` Damien Le Moal
2025-01-27 23:01 ` Bart Van Assche
0 siblings, 1 reply; 25+ messages in thread
From: Damien Le Moal @ 2025-01-23 4:16 UTC (permalink / raw)
To: Bart Van Assche, Jens Axboe; +Cc: linux-block, Christoph Hellwig
On 1/22/25 6:57 AM, Bart Van Assche wrote:
> On 1/17/25 2:47 PM, Damien Le Moal wrote:
>> Also, patch 14 puts back a lot of the code that was recently removed. Not nice,
>> and its commit message is also far too short given the size of the patch. Please
>> spend more time explaining what the patches do and how they do it to facilitate
>> review.
>
> Regarding the following part of patch 07/14:
>
> - disk_zone_wplug_abort(zwplug);
> - zwplug->flags |= BLK_ZONE_WPLUG_NEED_WP_UPDATE;
> + if (!disk->queue->limits.driver_preserves_write_order)
> + zwplug->flags |= BLK_ZONE_WPLUG_NEED_WP_UPDATE;
> + zwplug->flags |= BLK_ZONE_WPLUG_ERROR;
>
> How about setting BLK_ZONE_WPLUG_ERROR only if
> disk->queue->limits.driver_preserves_write_order has been set such that
> the restored error handling code is only used if the storage controller
> preserves the write order? This should be sufficient to preserve the
> existing behavior of the blk-zoned.c code for SMR disks.
Your patch changes BLK_ZONE_WPLUG_NEED_WP_UPDATE into BLK_ZONE_WPLUG_ERROR for
regular (no write order preserving) zoned devices. I do not see why this is
needed. The main difference between this default behavior and what you are
trying to do is that BLK_ZONE_WPLUG_NEED_WP_UPDATE is to be expected with your
changes while it is NOT expected to ever happen for a regular coned device.
> The reason I restored the error handling code is that the code in
> blk-zoned.c does not support error handling inside block drivers if QD > 1
> and because supporting the SCSI error handler is essential for UFS
> devices.
What error handling ? The only error handling we need is to tell tell the user
that there was an error and track that the user actually does something about
it. That tracking is done with the BLK_ZONE_WPLUG_NEED_WP_UPDATE flag. That's
it, nothing more.
I think that conflating/handling the expected (even if infrequent) write requeue
events with zoned UFS devices with actual hard device write errors (e.g. the
write failed because the device is toast or you hit a bad sector, or you have a
bad cable or whatever other hardware reason for the write to fail) is making a
mess of everything. Treating the requeues like hard errors makes things
complicated, but also wrong because even for zoned UFS devices, we may still get
hard write errors, and treating these in the same way as requeue event is wrong
in my opinion. For hard errors, the write must not be retried and the error
propagated immediately to the issuer (e.g. the FS).
For a requeue event, even though that is signaled initially at the bottom of the
stack by a device unit attention error, there is no reason for us to treat these
events as failed requests in the zone write plugging code. We need to have a
special treatment for these, e.g. a "ZONE_WRITE_NEED_RETRY" request flag (and
that flag can bhe set in the scsi mid-layer before calling into the block layer
requeue.
When the zone write plugging sees this flag, it can:
1) Stop issuing write BIOs since we know they will fail
2) Wait for all requests already issued and in-flight to come back
3) Restoore the zone write plug write pointer position to the lowest sector of
all requests that were requeued
4) Re-issue the requeued writes (after somehow sorting them)
5) Re-start issuing BIOs waiting in the write plug
And any write that is failed due to a hard error will still set the zone write
plug with BLK_ZONE_WPLUG_NEED_WP_UPDATE after aborting all BIOs that are pluggeg
in the zone write plug.
Now, I think that the biggest difficulty of this work is to make sure that a
write that fails with an unaligned write error due to a write requeue event
before it can be clearly differentiated from the same failure without the
requeue event before it. For the former, we can recover. For the latter, it is a
user bug and we must NOT hide it.
Can this be done cleanly ? I am not entirely sure about it because I am still
not 100% clear about the conditions that result in a zoned UFS device failing a
write with unit attention. As far as I can tell, the first write issued when the
device is sleeping will be rejected with that error and must be requeued. But
others write behind this failed write that are already in-flight will endup
failing with an unaligned write error, no ? Or will they be failed with a unit
attention too ? This is all unclear to me.
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v17 02/14] dm-linear: Report to the block layer that the write order is preserved
2025-01-21 21:38 ` Bart Van Assche
@ 2025-01-27 17:55 ` Mikulas Patocka
0 siblings, 0 replies; 25+ messages in thread
From: Mikulas Patocka @ 2025-01-27 17:55 UTC (permalink / raw)
To: Bart Van Assche
Cc: Jens Axboe, linux-block, Christoph Hellwig, Damien Le Moal,
Alasdair Kergon, Mike Snitzer
On Tue, 21 Jan 2025, Bart Van Assche wrote:
> > How is write pipelining supposed to work with suspend/resume? dm doesn't
> > preserve the order of writes in case of suspend.
>
> That's an interesting question. I expect that the following will happen
> upon resume if zoned writes would have been reordered by dm-linear:
> * The block device reports one or more unaligned write errors.
> * For the zones for which an unaligned write error has been reported,
> the flag BLK_ZONE_WPLUG_ERROR is set (see also patch 07/14 in this
> series).
> * Further zoned writes are postponed for the BLK_ZONE_WPLUG_ERROR zones
> until all pending zoned writes have completed.
> * Once all pending zoned writes have completed for a
> BLK_ZONE_WPLUG_ERROR zone, these are resubmitted. This happens in LBA
> order.
> * The resubmitted writes will succeed unless the submitter (e.g. a
> filesystem) left a gap between the zoned writes. If the submitter
> does not follow the zoned block device specification, the zoned
> writes will be retried until the number of retries has been exhausted.
> Block devices are expected to set the number of retries to a small
> positive number.
On suspend, dm_submit_bio calls queue_io to add bios to a list. On resume,
the list is processed in order and the bios are submitted, but this
submitting of deferred bios may race with new bios that may be received
and directed to the underlying block device - so, the new bios may be
submitted before the old bios.
Mikulas
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v17 00/14] Improve write performance for zoned UFS devices
2025-01-23 4:16 ` Damien Le Moal
@ 2025-01-27 23:01 ` Bart Van Assche
0 siblings, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2025-01-27 23:01 UTC (permalink / raw)
To: Damien Le Moal, Jens Axboe; +Cc: linux-block, Christoph Hellwig
On 1/22/25 8:16 PM, Damien Le Moal wrote:
> On 1/22/25 6:57 AM, Bart Van Assche wrote:
>> The reason I restored the error handling code is that the code in
>> blk-zoned.c does not support error handling inside block drivers if QD > 1
>> and because supporting the SCSI error handler is essential for UFS
>> devices.
>
> What error handling ?
The SCSI error handler. If the SCSI error handler is invoked while only
one request is outstanding, there is no chance of reordering and no
additional measures are necessary. If multiple requests are outstanding
and the SCSI error handler is invoked, care must be taken that the
pending requests are resubmitted in LBA order per zone. This is not
possible without blocking zoned write processing until all pending
zoned writes have completed. Hence the need to set BLK_ZONE_WPLUG_ERROR
if write pipelining is enabled and a request is requeued.
> I think that conflating/handling the expected (even if infrequent) write requeue
> events with zoned UFS devices with actual hard device write errors (e.g. the
> write failed because the device is toast or you hit a bad sector, or you have a
> bad cable or whatever other hardware reason for the write to fail) is making a
> mess of everything. Treating the requeues like hard errors makes things
> complicated, but also wrong because even for zoned UFS devices, we may still get
> hard write errors, and treating these in the same way as requeue event is wrong
> in my opinion. For hard errors, the write must not be retried and the error
> propagated immediately to the issuer (e.g. the FS).
>
> For a requeue event, even though that is signaled initially at the bottom of the
> stack by a device unit attention error, there is no reason for us to treat these
> events as failed requests in the zone write plugging code. We need to have a
> special treatment for these, e.g. a "ZONE_WRITE_NEED_RETRY" request flag (and
> that flag can bhe set in the scsi mid-layer before calling into the block layer
> requeue.
>
> When the zone write plugging sees this flag, it can:
> 1) Stop issuing write BIOs since we know they will fail
> 2) Wait for all requests already issued and in-flight to come back
> 3) Restoore the zone write plug write pointer position to the lowest sector of
> all requests that were requeued
> 4) Re-issue the requeued writes (after somehow sorting them)
> 5) Re-start issuing BIOs waiting in the write plug
>
> And any write that is failed due to a hard error will still set the zone write
> plug with BLK_ZONE_WPLUG_NEED_WP_UPDATE after aborting all BIOs that are pluggeg
> in the zone write plug.
>
> Now, I think that the biggest difficulty of this work is to make sure that a
> write that fails with an unaligned write error due to a write requeue event
> before it can be clearly differentiated from the same failure without the
> requeue event before it. For the former, we can recover. For the latter, it is a
> user bug and we must NOT hide it.
>
> Can this be done cleanly ? I am not entirely sure about it because I am still
> not 100% clear about the conditions that result in a zoned UFS device failing a
> write with unit attention. As far as I can tell, the first write issued when the
> device is sleeping will be rejected with that error and must be requeued. But
> others write behind this failed write that are already in-flight will endup
> failing with an unaligned write error, no ? Or will they be failed with a unit
> attention too ? This is all unclear to me.
I propose to keep the approach of the code in blk-zoned.c independent of
the details of how UFS devices work. That will make it more likely that
the blk-zoned.c code remains generic and that it can be reused for other
storage protocols.
Although I have not yet observed reordering due to a unit attention, I
think that unit attentions should be supported because unit attentions
are such an important SCSI mechanism.
Since UFSHCI 3 controllers use a 32-bit register in the UFSHCI
controller for indicating which requests are in flight, reordering of
requests is likely to happen if a UFSHCI controller comes out of the
suspended state. When a UFSHCI controller leaves the suspended state,
it scans that 32-bit register from the lowest to the highest bit and
hence ordering information that was provided by the host is lost. It
seems to me that the easiest way to address this reordering is by
resubmitting requests once in case the controller reordered the requests
due to leaving the suspended state. Please note that UFSHCI 3
controllers are not my primary focus and that I would be fine with not
supporting write pipelining for these controllers if there would be
disagreement about this aspect. UFSHCI 4 controllers have proper
submission queues in host memory, just like NVMe controllers, and hence
request ordering information for submission queues is not lost if the
controller is suspended and resumed.
The SCSI error handler can also cause reordering. If e.g. the UFS
controller reports a CRC error then the UFS driver will reset the UFS
host controller and activate the SCSI error handler. Pending writes must
be resubmitted in LBA order per zone after the SCSI error handler has
finished to prevent that write errors propagate to the file system and
to user space.
At the time a UFS device reports an unaligned write error, I don't think
that it is possible to determine whether or not this is a retryable
error. So instead of introducing a ZONE_WRITE_NEED_RETRY flag, I propose
the following:
* Use BLK_ZONE_WPLUG_ERROR for both hard and soft errors. Hard errors
means errors that shouldn't be retried. Soft errors means errors that
are retryable.
* After all pending writes have completed and before resubmitting any
zoned writes, check whether or not these should be retried. For UFS
devices this can be done by comparing the wp_offset_compl member with
the lowest LBA of the pending zoned writes.
Do I remember correctly that residual information is not always reliable
for zoned writes to SMR disks and hence that another approach is needed
for SMR disks? How about setting BLK_ZONE_WPLUG_NEED_WP_UPDATE for SMR
disks instead of resubmitting writes after an unaligned write has been
reported by the storage device?
Thanks,
Bart.
^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2025-01-27 23:01 UTC | newest]
Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-15 22:46 [PATCH v17 00/14] Improve write performance for zoned UFS devices Bart Van Assche
2025-01-15 22:46 ` [PATCH v17 01/14] block: Support block drivers that preserve the order of write requests Bart Van Assche
2025-01-15 22:46 ` [PATCH v17 02/14] dm-linear: Report to the block layer that the write order is preserved Bart Van Assche
2025-01-17 18:04 ` Mikulas Patocka
2025-01-21 21:38 ` Bart Van Assche
2025-01-27 17:55 ` Mikulas Patocka
2025-01-15 22:46 ` [PATCH v17 03/14] block: Rework request allocation in blk_mq_submit_bio() Bart Van Assche
2025-01-15 22:46 ` [PATCH v17 04/14] block: Support allocating from a specific software queue Bart Van Assche
2025-01-15 22:46 ` [PATCH v17 05/14] blk-mq: Restore the zoned write order when requeuing Bart Van Assche
2025-01-15 22:46 ` [PATCH v17 06/14] blk-zoned: Track the write pointer per zone Bart Van Assche
2025-01-15 22:46 ` [PATCH v17 07/14] blk-zoned: Defer error handling Bart Van Assche
2025-01-15 22:46 ` [PATCH v17 08/14] blk-zoned: Add an argument to blk_zone_plug_bio() Bart Van Assche
2025-01-15 22:46 ` [PATCH v17 09/14] blk-zoned: Support pipelining of zoned writes Bart Van Assche
2025-01-15 22:46 ` [PATCH v17 10/14] scsi: core: Retry unaligned " Bart Van Assche
2025-01-15 22:46 ` [PATCH v17 11/14] scsi: sd: Increase retry count for " Bart Van Assche
2025-01-15 22:46 ` [PATCH v17 12/14] scsi: scsi_debug: Add the preserves_write_order module parameter Bart Van Assche
2025-01-15 22:46 ` [PATCH v17 13/14] scsi: scsi_debug: Support injecting unaligned write errors Bart Van Assche
2025-01-15 22:46 ` [PATCH v17 14/14] scsi: ufs: Inform the block layer about write ordering Bart Van Assche
2025-01-16 7:43 ` Can Guo
2025-01-16 15:58 ` Bao D. Nguyen
2025-01-23 0:52 ` Bart Van Assche
2025-01-17 22:47 ` [PATCH v17 00/14] Improve write performance for zoned UFS devices Damien Le Moal
2025-01-21 21:57 ` Bart Van Assche
2025-01-23 4:16 ` Damien Le Moal
2025-01-27 23:01 ` Bart Van Assche
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox