* [PATCH v19 01/11] block: Support block drivers that preserve the order of write requests
2025-06-30 22:29 [PATCH v19 00/11] Improve write performance for zoned UFS devices Bart Van Assche
@ 2025-06-30 22:29 ` Bart Van Assche
2025-07-02 10:57 ` Damien Le Moal
2025-06-30 22:29 ` [PATCH v19 02/11] block: Support allocating from a specific software queue Bart Van Assche
` (9 subsequent siblings)
10 siblings, 1 reply; 17+ messages in thread
From: Bart Van Assche @ 2025-06-30 22:29 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
Bart Van Assche, Hannes Reinecke, Nitesh Shetty, Ming Lei
Some storage controllers preserve the request order per hardware queue.
Introduce the request queue limit member variable
'driver_preserves_write_order' to allow block drivers to indicate that
the order of write commands is preserved per hardware queue and hence
that serialization of writes per zone is not required if all pending
writes are submitted to the same hardware queue.
Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Nitesh Shetty <nj.shetty@samsung.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
block/blk-settings.c | 2 ++
include/linux/blkdev.h | 5 +++++
2 files changed, 7 insertions(+)
diff --git a/block/blk-settings.c b/block/blk-settings.c
index a000daafbfb4..bceb9a9cb5ba 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -814,6 +814,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
}
t->max_secure_erase_sectors = min_not_zero(t->max_secure_erase_sectors,
b->max_secure_erase_sectors);
+ t->driver_preserves_write_order = t->driver_preserves_write_order &&
+ b->driver_preserves_write_order;
t->zone_write_granularity = max(t->zone_write_granularity,
b->zone_write_granularity);
if (!(t->features & BLK_FEAT_ZONED)) {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index a51f92b6c340..aa1990d94130 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -408,6 +408,11 @@ struct queue_limits {
unsigned int max_open_zones;
unsigned int max_active_zones;
+ /*
+ * Whether or not the block driver preserves the order of write
+ * requests per hardware queue. Set by the block driver.
+ */
+ bool driver_preserves_write_order;
/*
* Drivers that set dma_alignment to less than 511 must be prepared to
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH v19 01/11] block: Support block drivers that preserve the order of write requests
2025-06-30 22:29 ` [PATCH v19 01/11] block: Support block drivers that preserve the order of write requests Bart Van Assche
@ 2025-07-02 10:57 ` Damien Le Moal
2025-07-02 15:57 ` Bart Van Assche
2025-07-07 21:14 ` Bart Van Assche
0 siblings, 2 replies; 17+ messages in thread
From: Damien Le Moal @ 2025-07-02 10:57 UTC (permalink / raw)
To: Bart Van Assche, Jens Axboe
Cc: linux-block, linux-scsi, Christoph Hellwig, Hannes Reinecke,
Nitesh Shetty, Ming Lei
On 7/1/25 07:29, Bart Van Assche wrote:
> Some storage controllers preserve the request order per hardware queue.
> Introduce the request queue limit member variable
> 'driver_preserves_write_order' to allow block drivers to indicate that
> the order of write commands is preserved per hardware queue and hence
> that serialization of writes per zone is not required if all pending
> writes are submitted to the same hardware queue.
>
> Cc: Damien Le Moal <dlemoal@kernel.org>
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Nitesh Shetty <nj.shetty@samsung.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
> block/blk-settings.c | 2 ++
> include/linux/blkdev.h | 5 +++++
> 2 files changed, 7 insertions(+)
>
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index a000daafbfb4..bceb9a9cb5ba 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -814,6 +814,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
> }
> t->max_secure_erase_sectors = min_not_zero(t->max_secure_erase_sectors,
> b->max_secure_erase_sectors);
> + t->driver_preserves_write_order = t->driver_preserves_write_order &&
> + b->driver_preserves_write_order;
Why not use a feature instead ? Stacking of the features does exactly this, no ?
That would be less code and one less limit.
> t->zone_write_granularity = max(t->zone_write_granularity,
> b->zone_write_granularity);
> if (!(t->features & BLK_FEAT_ZONED)) {
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index a51f92b6c340..aa1990d94130 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -408,6 +408,11 @@ struct queue_limits {
>
> unsigned int max_open_zones;
> unsigned int max_active_zones;
> + /*
> + * Whether or not the block driver preserves the order of write
> + * requests per hardware queue. Set by the block driver.
> + */
> + bool driver_preserves_write_order;
>
> /*
> * Drivers that set dma_alignment to less than 511 must be prepared to
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v19 01/11] block: Support block drivers that preserve the order of write requests
2025-07-02 10:57 ` Damien Le Moal
@ 2025-07-02 15:57 ` Bart Van Assche
2025-07-03 0:09 ` Damien Le Moal
2025-07-07 21:14 ` Bart Van Assche
1 sibling, 1 reply; 17+ messages in thread
From: Bart Van Assche @ 2025-07-02 15:57 UTC (permalink / raw)
To: Damien Le Moal, Jens Axboe
Cc: linux-block, linux-scsi, Christoph Hellwig, Hannes Reinecke,
Nitesh Shetty, Ming Lei
On 7/2/25 3:57 AM, Damien Le Moal wrote:
> On 7/1/25 07:29, Bart Van Assche wrote:
>> diff --git a/block/blk-settings.c b/block/blk-settings.c
>> index a000daafbfb4..bceb9a9cb5ba 100644
>> --- a/block/blk-settings.c
>> +++ b/block/blk-settings.c
>> @@ -814,6 +814,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>> }
>> t->max_secure_erase_sectors = min_not_zero(t->max_secure_erase_sectors,
>> b->max_secure_erase_sectors);
>> + t->driver_preserves_write_order = t->driver_preserves_write_order &&
>> + b->driver_preserves_write_order;
>
> Why not use a feature instead ? Stacking of the features does exactly this, no ?
> That would be less code and one less limit.
Hi Damien,
Thanks for the feedback. I will look into making this change.
Please also help with reviewing the other patches in this series.
Progress on this patch series has been slow so far because every time I
post this patch series reviewer feedback is provided on less than 10% of
the code in this patch series.
Thanks,
Bart.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v19 01/11] block: Support block drivers that preserve the order of write requests
2025-07-02 15:57 ` Bart Van Assche
@ 2025-07-03 0:09 ` Damien Le Moal
2025-07-07 21:11 ` Bart Van Assche
0 siblings, 1 reply; 17+ messages in thread
From: Damien Le Moal @ 2025-07-03 0:09 UTC (permalink / raw)
To: Bart Van Assche, Jens Axboe
Cc: linux-block, linux-scsi, Christoph Hellwig, Hannes Reinecke,
Nitesh Shetty, Ming Lei
On 7/3/25 12:57 AM, Bart Van Assche wrote:
> On 7/2/25 3:57 AM, Damien Le Moal wrote:
>> On 7/1/25 07:29, Bart Van Assche wrote:
>>> diff --git a/block/blk-settings.c b/block/blk-settings.c
>>> index a000daafbfb4..bceb9a9cb5ba 100644
>>> --- a/block/blk-settings.c
>>> +++ b/block/blk-settings.c
>>> @@ -814,6 +814,8 @@ int blk_stack_limits(struct queue_limits *t, struct
>>> queue_limits *b,
>>> }
>>> t->max_secure_erase_sectors = min_not_zero(t->max_secure_erase_sectors,
>>> b->max_secure_erase_sectors);
>>> + t->driver_preserves_write_order = t->driver_preserves_write_order &&
>>> + b->driver_preserves_write_order;
>>
>> Why not use a feature instead ? Stacking of the features does exactly this, no ?
>> That would be less code and one less limit.
>
> Hi Damien,
>
> Thanks for the feedback. I will look into making this change.
>
> Please also help with reviewing the other patches in this series.
> Progress on this patch series has been slow so far because every time I
> post this patch series reviewer feedback is provided on less than 10% of
> the code in this patch series.
I am still trying to wrap my head around what you are doing with these patches,
which is to allocate and submit a BIO from one CPU using the ctx (soft queue)
of another CPU, if I understood things correctly (your commit messages are
still a little too dry and do not explain a lot, especially the why...).
So far, I am not liking what I am seeing because I think it is dangerous given
all the per_cpu things around. I do not see how it can be safe to play with one
BIO from one CPU to be issued on another.
I did suggest that you simply should plug all BIOs, always, and have them
submitted by the BIO work for a zone by scheduling that work on a single CPU,
always the same CPU. Doing so, you can distribute writes to different zones to
different CPUs. And if the submitter process is not bouncing all over the place
(being rescheduled constantly on different CPUs), there will be no overhead
because you can simply run the BIO work on the current CPU.
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v19 01/11] block: Support block drivers that preserve the order of write requests
2025-07-03 0:09 ` Damien Le Moal
@ 2025-07-07 21:11 ` Bart Van Assche
0 siblings, 0 replies; 17+ messages in thread
From: Bart Van Assche @ 2025-07-07 21:11 UTC (permalink / raw)
To: Damien Le Moal, Jens Axboe
Cc: linux-block, linux-scsi, Christoph Hellwig, Hannes Reinecke,
Nitesh Shetty, Ming Lei
On 7/2/25 5:09 PM, Damien Le Moal wrote:
> I did suggest that you simply should plug all BIOs, always, and have them
> submitted by the BIO work for a zone by scheduling that work on a single CPU,
> always the same CPU. Doing so, you can distribute writes to different zones to
> different CPUs. And if the submitter process is not bouncing all over the place
> (being rescheduled constantly on different CPUs), there will be no overhead
> because you can simply run the BIO work on the current CPU.
I will look into making this change.
Thanks,
Bart.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v19 01/11] block: Support block drivers that preserve the order of write requests
2025-07-02 10:57 ` Damien Le Moal
2025-07-02 15:57 ` Bart Van Assche
@ 2025-07-07 21:14 ` Bart Van Assche
1 sibling, 0 replies; 17+ messages in thread
From: Bart Van Assche @ 2025-07-07 21:14 UTC (permalink / raw)
To: Damien Le Moal, Jens Axboe
Cc: linux-block, linux-scsi, Christoph Hellwig, Hannes Reinecke,
Nitesh Shetty, Ming Lei
On 7/2/25 3:57 AM, Damien Le Moal wrote:
> On 7/1/25 07:29, Bart Van Assche wrote:
>> diff --git a/block/blk-settings.c b/block/blk-settings.c
>> index a000daafbfb4..bceb9a9cb5ba 100644
>> --- a/block/blk-settings.c
>> +++ b/block/blk-settings.c
>> @@ -814,6 +814,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>> }
>> t->max_secure_erase_sectors = min_not_zero(t->max_secure_erase_sectors,
>> b->max_secure_erase_sectors);
>> + t->driver_preserves_write_order = t->driver_preserves_write_order &&
>> + b->driver_preserves_write_order;
>
> Why not use a feature instead ? Stacking of the features does exactly this, no ?
> That would be less code and one less limit.
Not all device mapper drivers preserve the bio order. The above code
allows device mapper drivers to declare whether or not the bio order is
preserved. The suggested alternative implements a different behavior,
namely that the driver at the top of the stack inherits the behavior
from the bottommost driver. I prefer to keep the above code.
Thanks,
Bart.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH v19 02/11] block: Support allocating from a specific software queue
2025-06-30 22:29 [PATCH v19 00/11] Improve write performance for zoned UFS devices Bart Van Assche
2025-06-30 22:29 ` [PATCH v19 01/11] block: Support block drivers that preserve the order of write requests Bart Van Assche
@ 2025-06-30 22:29 ` Bart Van Assche
2025-06-30 22:29 ` [PATCH v19 03/11] blk-mq: Restore the zone write order when requeuing Bart Van Assche
` (8 subsequent siblings)
10 siblings, 0 replies; 17+ messages in thread
From: Bart Van Assche @ 2025-06-30 22:29 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
Bart Van Assche
A later patch will preserve the order of pipelined zoned writes by
submitting all zoned writes per zone to the same software queue as
previously submitted zoned writes. Hence support allocating a request
from a specific software queue.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
block/blk-mq.c | 22 +++++++++++++++++++---
block/blk-mq.h | 1 +
2 files changed, 20 insertions(+), 3 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 0c61492724d2..7bbb4da6e168 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -485,6 +485,7 @@ __blk_mq_alloc_requests_batch(struct blk_mq_alloc_data *data)
static struct request *__blk_mq_alloc_requests(struct blk_mq_alloc_data *data)
{
struct request_queue *q = data->q;
+ int from_cpu = data->from_cpu;
u64 alloc_time_ns = 0;
struct request *rq;
unsigned int tag;
@@ -497,7 +498,8 @@ static struct request *__blk_mq_alloc_requests(struct blk_mq_alloc_data *data)
data->flags |= BLK_MQ_REQ_NOWAIT;
retry:
- data->ctx = blk_mq_get_ctx(q);
+ data->from_cpu = from_cpu >= 0 ? from_cpu : raw_smp_processor_id();
+ data->ctx = __blk_mq_get_ctx(q, data->from_cpu);
data->hctx = blk_mq_map_queue(data->cmd_flags, data->ctx);
if (q->elevator) {
@@ -579,6 +581,7 @@ static struct request *blk_mq_rq_cache_fill(struct request_queue *q,
.rq_flags = 0,
.nr_tags = plug->nr_ios,
.cached_rqs = &plug->cached_rqs,
+ .from_cpu = -1,
.ctx = NULL,
.hctx = NULL
};
@@ -644,6 +647,7 @@ struct request *blk_mq_alloc_request(struct request_queue *q, blk_opf_t opf,
.cmd_flags = opf,
.rq_flags = 0,
.nr_tags = 1,
+ .from_cpu = -1,
.cached_rqs = NULL,
.ctx = NULL,
.hctx = NULL
@@ -678,6 +682,7 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
.cmd_flags = opf,
.rq_flags = 0,
.nr_tags = 1,
+ .from_cpu = -1,
.cached_rqs = NULL,
.ctx = NULL,
.hctx = NULL
@@ -3012,6 +3017,7 @@ static bool blk_mq_attempt_bio_merge(struct request_queue *q,
}
static struct request *blk_mq_get_new_requests(struct request_queue *q,
+ int from_cpu,
struct blk_plug *plug,
struct bio *bio)
{
@@ -3021,6 +3027,7 @@ static struct request *blk_mq_get_new_requests(struct request_queue *q,
.shallow_depth = 0,
.cmd_flags = bio->bi_opf,
.rq_flags = 0,
+ .from_cpu = from_cpu,
.nr_tags = 1,
.cached_rqs = NULL,
.ctx = NULL,
@@ -3114,6 +3121,7 @@ void blk_mq_submit_bio(struct bio *bio)
struct blk_mq_hw_ctx *hctx;
unsigned int nr_segs;
struct request *rq;
+ int from_cpu = -1;
blk_status_t ret;
/*
@@ -3175,10 +3183,17 @@ void blk_mq_submit_bio(struct bio *bio)
}
new_request:
- if (rq) {
+ if (rq && (from_cpu == -1 || from_cpu == rq->mq_ctx->cpu)) {
blk_mq_use_cached_rq(rq, plug, bio);
+ } else if (rq) {
+ /*
+ * We are not using the cached request, we know that
+ * q_usage_counter != 0 and have not yet called
+ * bio_queue_enter(). Increase q_usage_counter.
+ */
+ percpu_ref_get(&q->q_usage_counter);
} else {
- rq = blk_mq_get_new_requests(q, plug, bio);
+ rq = blk_mq_get_new_requests(q, from_cpu, plug, bio);
if (unlikely(!rq)) {
if (bio->bi_opf & REQ_NOWAIT)
bio_wouldblock_error(bio);
@@ -3190,6 +3205,7 @@ void blk_mq_submit_bio(struct bio *bio)
rq_qos_track(q, rq, bio);
+ /* Transfer the ownership of the q_usage_counter increase to 'rq'. */
blk_mq_bio_to_request(rq, bio, nr_segs);
ret = blk_crypto_rq_get_keyslot(rq);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index affb2e14b56e..52e907547b55 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -156,6 +156,7 @@ struct blk_mq_alloc_data {
struct rq_list *cached_rqs;
/* input & output parameter */
+ int from_cpu;
struct blk_mq_ctx *ctx;
struct blk_mq_hw_ctx *hctx;
};
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH v19 03/11] blk-mq: Restore the zone write order when requeuing
2025-06-30 22:29 [PATCH v19 00/11] Improve write performance for zoned UFS devices Bart Van Assche
2025-06-30 22:29 ` [PATCH v19 01/11] block: Support block drivers that preserve the order of write requests Bart Van Assche
2025-06-30 22:29 ` [PATCH v19 02/11] block: Support allocating from a specific software queue Bart Van Assche
@ 2025-06-30 22:29 ` Bart Van Assche
2025-06-30 22:29 ` [PATCH v19 04/11] blk-zoned: Add an argument to blk_zone_plug_bio() Bart Van Assche
` (7 subsequent siblings)
10 siblings, 0 replies; 17+ messages in thread
From: Bart Van Assche @ 2025-06-30 22:29 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
Bart Van Assche, Yu Kuai
Zoned writes may be requeued. This happens if a block driver returns
BLK_STS_RESOURCE, to handle SCSI unit attentions or by the SCSI error
handler after error handling has finished. A later patch enables write
pipelining and increases the number of pending writes per zone. If
multiple writes are pending per zone, write requests may be requeued in
another order than submitted. Restore the request order if requests are
requeued. Add RQF_DONTPREP to RQF_NOMERGE_FLAGS because this patch may
cause RQF_DONTPREP requests to be sent to the code that checks whether
a request can be merged and RQF_DONTPREP requests must not be merged.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
block/bfq-iosched.c | 2 ++
block/blk-mq.c | 20 +++++++++++++++++++-
block/blk-mq.h | 2 ++
block/kyber-iosched.c | 2 ++
block/mq-deadline.c | 7 ++++++-
include/linux/blk-mq.h | 13 ++++++++++++-
6 files changed, 43 insertions(+), 3 deletions(-)
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 0cb1e9873aab..1bd3afe5d779 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -6276,6 +6276,8 @@ static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
if (flags & BLK_MQ_INSERT_AT_HEAD) {
list_add(&rq->queuelist, &bfqd->dispatch);
+ } else if (flags & BLK_MQ_INSERT_ORDERED) {
+ blk_mq_insert_ordered(rq, &bfqd->dispatch);
} else if (!bfqq) {
list_add_tail(&rq->queuelist, &bfqd->dispatch);
} else {
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7bbb4da6e168..7935f6adda71 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1562,7 +1562,9 @@ static void blk_mq_requeue_work(struct work_struct *work)
* already. Insert it into the hctx dispatch list to avoid
* block layer merges for the request.
*/
- if (rq->rq_flags & RQF_DONTPREP)
+ if (blk_rq_is_seq_zoned_write(rq))
+ blk_mq_insert_request(rq, BLK_MQ_INSERT_ORDERED);
+ else if (rq->rq_flags & RQF_DONTPREP)
blk_mq_request_bypass_insert(rq, 0);
else
blk_mq_insert_request(rq, BLK_MQ_INSERT_AT_HEAD);
@@ -2595,6 +2597,20 @@ static void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx,
blk_mq_run_hw_queue(hctx, run_queue_async);
}
+void blk_mq_insert_ordered(struct request *rq, struct list_head *list)
+{
+ struct request_queue *q = rq->q;
+ struct request *rq2;
+
+ list_for_each_entry(rq2, list, queuelist)
+ if (rq2->q == q && blk_rq_pos(rq2) > blk_rq_pos(rq))
+ break;
+
+ /* Insert rq before rq2. If rq2 is the list head, append at the end. */
+ list_add_tail(&rq->queuelist, &rq2->queuelist);
+}
+EXPORT_SYMBOL_GPL(blk_mq_insert_ordered);
+
static void blk_mq_insert_request(struct request *rq, blk_insert_t flags)
{
struct request_queue *q = rq->q;
@@ -2649,6 +2665,8 @@ static void blk_mq_insert_request(struct request *rq, blk_insert_t flags)
spin_lock(&ctx->lock);
if (flags & BLK_MQ_INSERT_AT_HEAD)
list_add(&rq->queuelist, &ctx->rq_lists[hctx->type]);
+ else if (flags & BLK_MQ_INSERT_ORDERED)
+ blk_mq_insert_ordered(rq, &ctx->rq_lists[hctx->type]);
else
list_add_tail(&rq->queuelist,
&ctx->rq_lists[hctx->type]);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 52e907547b55..75b657554f1f 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -40,8 +40,10 @@ enum {
typedef unsigned int __bitwise blk_insert_t;
#define BLK_MQ_INSERT_AT_HEAD ((__force blk_insert_t)0x01)
+#define BLK_MQ_INSERT_ORDERED ((__force blk_insert_t)0x02)
void blk_mq_submit_bio(struct bio *bio);
+void blk_mq_insert_ordered(struct request *rq, struct list_head *list);
int blk_mq_poll(struct request_queue *q, blk_qc_t cookie, struct io_comp_batch *iob,
unsigned int flags);
void blk_mq_exit_queue(struct request_queue *q);
diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index 4dba8405bd01..051c05ceafd7 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -603,6 +603,8 @@ static void kyber_insert_requests(struct blk_mq_hw_ctx *hctx,
trace_block_rq_insert(rq);
if (flags & BLK_MQ_INSERT_AT_HEAD)
list_move(&rq->queuelist, head);
+ else if (flags & BLK_MQ_INSERT_ORDERED)
+ blk_mq_insert_ordered(rq, head);
else
list_move_tail(&rq->queuelist, head);
sbitmap_set_bit(&khd->kcq_map[sched_domain],
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 2edf1cac06d5..110fef65b829 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -710,7 +710,12 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
* set expire time and add to fifo list
*/
rq->fifo_time = jiffies + dd->fifo_expire[data_dir];
- list_add_tail(&rq->queuelist, &per_prio->fifo_list[data_dir]);
+ if (flags & BLK_MQ_INSERT_ORDERED)
+ blk_mq_insert_ordered(rq,
+ &per_prio->fifo_list[data_dir]);
+ else
+ list_add_tail(&rq->queuelist,
+ &per_prio->fifo_list[data_dir]);
}
}
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index de8c85a03bb7..6833d5a980ef 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -86,7 +86,7 @@ enum rqf_flags {
/* flags that prevent us from merging requests: */
#define RQF_NOMERGE_FLAGS \
- (RQF_STARTED | RQF_FLUSH_SEQ | RQF_SPECIAL_PAYLOAD)
+ (RQF_STARTED | RQF_FLUSH_SEQ | RQF_DONTPREP | RQF_SPECIAL_PAYLOAD)
enum mq_rq_state {
MQ_RQ_IDLE = 0,
@@ -1189,4 +1189,15 @@ static inline int blk_rq_map_sg(struct request *rq, struct scatterlist *sglist)
}
void blk_dump_rq_flags(struct request *, char *);
+static inline bool blk_rq_is_seq_zoned_write(struct request *rq)
+{
+ switch (req_op(rq)) {
+ case REQ_OP_WRITE:
+ case REQ_OP_WRITE_ZEROES:
+ return bdev_zone_is_seq(rq->q->disk->part0, blk_rq_pos(rq));
+ default:
+ return false;
+ }
+}
+
#endif /* BLK_MQ_H */
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH v19 04/11] blk-zoned: Add an argument to blk_zone_plug_bio()
2025-06-30 22:29 [PATCH v19 00/11] Improve write performance for zoned UFS devices Bart Van Assche
` (2 preceding siblings ...)
2025-06-30 22:29 ` [PATCH v19 03/11] blk-mq: Restore the zone write order when requeuing Bart Van Assche
@ 2025-06-30 22:29 ` Bart Van Assche
2025-06-30 22:29 ` [PATCH v19 05/11] blk-zoned: Support pipelining of zoned writes Bart Van Assche
` (6 subsequent siblings)
10 siblings, 0 replies; 17+ messages in thread
From: Bart Van Assche @ 2025-06-30 22:29 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
Bart Van Assche
Zoned writes may be submitted from multiple CPUs as long as these are
serialized per zone. Prepare for preserving the order of pipelined zoned
writes per zone by adding the 'from_cpu` argument to blk_zone_plug_bio().
This argument will be used to preserve the write order per zone by
submitting all zoned writes per zone from the same CPU as long as any
zoned writes are pending for that zone.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
block/blk-mq.c | 7 +++----
block/blk-zoned.c | 4 +++-
drivers/md/dm.c | 5 ++---
include/linux/blkdev.h | 5 +++--
4 files changed, 11 insertions(+), 10 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7935f6adda71..34304b6ac29e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3195,10 +3195,9 @@ void blk_mq_submit_bio(struct bio *bio)
if (blk_mq_attempt_bio_merge(q, bio, nr_segs))
goto queue_exit;
- if (bio_needs_zone_write_plugging(bio)) {
- if (blk_zone_plug_bio(bio, nr_segs))
- goto queue_exit;
- }
+ if (bio_needs_zone_write_plugging(bio) &&
+ blk_zone_plug_bio(bio, nr_segs, &from_cpu))
+ goto queue_exit;
new_request:
if (rq && (from_cpu == -1 || from_cpu == rq->mq_ctx->cpu)) {
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index efe71b1a1da1..91a59bf54547 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -1104,6 +1104,8 @@ static void blk_zone_wplug_handle_native_zone_append(struct bio *bio)
* blk_zone_plug_bio - Handle a zone write BIO with zone write plugging
* @bio: The BIO being submitted
* @nr_segs: The number of physical segments of @bio
+ * @from_cpu: [out] CPU of the software queue to which the bio should be
+ * queued
*
* Handle write, write zeroes and zone append operations requiring emulation
* using zone write plugging.
@@ -1112,7 +1114,7 @@ static void blk_zone_wplug_handle_native_zone_append(struct bio *bio)
* write plug. Otherwise, return false to let the submission path process
* @bio normally.
*/
-bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs)
+bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs, int *from_cpu)
{
struct block_device *bdev = bio->bi_bdev;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index ca889328fdfe..76c5eada7d03 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1814,9 +1814,8 @@ static inline bool dm_zone_bio_needs_split(struct mapped_device *md,
static inline bool dm_zone_plug_bio(struct mapped_device *md, struct bio *bio)
{
- if (!bio_needs_zone_write_plugging(bio))
- return false;
- return blk_zone_plug_bio(bio, 0);
+ return bio_needs_zone_write_plugging(bio) &&
+ blk_zone_plug_bio(bio, 0, NULL);
}
static blk_status_t __send_zone_reset_all_emulated(struct clone_info *ci,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index aa1990d94130..46fe8d97b31f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -891,7 +891,7 @@ static inline bool bio_needs_zone_write_plugging(struct bio *bio)
}
}
-bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs);
+bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs, int *from_cpu);
/**
* disk_zone_capacity - returns the zone capacity of zone containing @sector
@@ -926,7 +926,8 @@ static inline bool bio_needs_zone_write_plugging(struct bio *bio)
return false;
}
-static inline bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs)
+static inline bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs,
+ int *from_cpu)
{
return false;
}
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH v19 05/11] blk-zoned: Support pipelining of zoned writes
2025-06-30 22:29 [PATCH v19 00/11] Improve write performance for zoned UFS devices Bart Van Assche
` (3 preceding siblings ...)
2025-06-30 22:29 ` [PATCH v19 04/11] blk-zoned: Add an argument to blk_zone_plug_bio() Bart Van Assche
@ 2025-06-30 22:29 ` Bart Van Assche
2025-06-30 22:29 ` [PATCH v19 06/11] null_blk: Add the preserves_write_order attribute Bart Van Assche
` (5 subsequent siblings)
10 siblings, 0 replies; 17+ messages in thread
From: Bart Van Assche @ 2025-06-30 22:29 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
Bart Van Assche
Support pipelining of zoned writes if the block driver preserves the write
order per hardware queue. Track per zone to which software queue writes
have been queued. If zoned writes are pipelined, submit new writes to the
same software queue as the writes that are already in progress. This
prevents reordering by submitting requests for the same zone to different
software or hardware queues.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
block/blk-mq.c | 4 ++--
block/blk-zoned.c | 30 +++++++++++++++++++++++-------
2 files changed, 25 insertions(+), 9 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 34304b6ac29e..c8734539606f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3150,8 +3150,8 @@ void blk_mq_submit_bio(struct bio *bio)
/*
* A BIO that was released from a zone write plug has already been
* through the preparation in this function, already holds a reference
- * on the queue usage counter, and is the only write BIO in-flight for
- * the target zone. Go straight to preparing a request for it.
+ * on the queue usage counter. Go straight to preparing a request for
+ * it.
*/
if (bio_zone_write_plugging(bio)) {
nr_segs = bio->__bi_nr_segments;
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 91a59bf54547..04a852058155 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -51,6 +51,8 @@ static const char *const zone_cond_name[] = {
* @zone_no: The number of the zone the plug is managing.
* @wp_offset: The zone write pointer location relative to the start of the zone
* as a number of 512B sectors.
+ * @from_cpu: Software queue to submit writes from for drivers that preserve
+ * the write order.
* @bio_list: The list of BIOs that are currently plugged.
* @bio_work: Work struct to handle issuing of plugged BIOs
* @rcu_head: RCU head to free zone write plugs with an RCU grace period.
@@ -63,6 +65,7 @@ struct blk_zone_wplug {
unsigned int flags;
unsigned int zone_no;
unsigned int wp_offset;
+ int from_cpu;
struct bio_list bio_list;
struct work_struct bio_work;
struct rcu_head rcu_head;
@@ -72,8 +75,7 @@ struct blk_zone_wplug {
/*
* Zone write plug flags bits:
* - BLK_ZONE_WPLUG_PLUGGED: Indicates that the zone write plug is plugged,
- * that is, that write BIOs are being throttled due to a write BIO already
- * being executed or the zone write plug bio list is not empty.
+ * that is, that write BIOs are being throttled.
* - BLK_ZONE_WPLUG_NEED_WP_UPDATE: Indicates that we lost track of a zone
* write pointer offset and need to update it.
* - BLK_ZONE_WPLUG_UNHASHED: Indicates that the zone write plug was removed
@@ -568,6 +570,7 @@ static struct blk_zone_wplug *disk_get_and_lock_zone_wplug(struct gendisk *disk,
zwplug->flags = 0;
zwplug->zone_no = zno;
zwplug->wp_offset = bdev_offset_from_zone_start(disk->part0, sector);
+ zwplug->from_cpu = -1;
bio_list_init(&zwplug->bio_list);
INIT_WORK(&zwplug->bio_work, blk_zone_wplug_bio_work);
zwplug->disk = disk;
@@ -983,7 +986,8 @@ static bool blk_zone_wplug_prepare_bio(struct blk_zone_wplug *zwplug,
return true;
}
-static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
+static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs,
+ int *from_cpu)
{
struct gendisk *disk = bio->bi_bdev->bd_disk;
sector_t sector = bio->bi_iter.bi_sector;
@@ -1045,7 +1049,13 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
return true;
}
- zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
+ if (disk->queue->limits.driver_preserves_write_order) {
+ if (zwplug->from_cpu < 0)
+ zwplug->from_cpu = raw_smp_processor_id();
+ *from_cpu = zwplug->from_cpu;
+ } else {
+ zwplug->flags |= BLK_ZONE_WPLUG_PLUGGED;
+ }
spin_unlock_irqrestore(&zwplug->lock, flags);
@@ -1147,7 +1157,7 @@ bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs, int *from_cpu)
fallthrough;
case REQ_OP_WRITE:
case REQ_OP_WRITE_ZEROES:
- return blk_zone_wplug_handle_write(bio, nr_segs);
+ return blk_zone_wplug_handle_write(bio, nr_segs, from_cpu);
case REQ_OP_ZONE_RESET:
return blk_zone_wplug_handle_reset_or_finish(bio, 0);
case REQ_OP_ZONE_FINISH:
@@ -1179,6 +1189,9 @@ static void disk_zone_wplug_unplug_bio(struct gendisk *disk,
zwplug->flags &= ~BLK_ZONE_WPLUG_PLUGGED;
+ if (refcount_read(&zwplug->ref) == 2)
+ zwplug->from_cpu = -1;
+
/*
* If the zone is full (it was fully written or finished, or empty
* (it was reset), remove its zone write plug from the hash table.
@@ -1830,6 +1843,7 @@ static void queue_zone_wplug_show(struct blk_zone_wplug *zwplug,
unsigned int zwp_zone_no, zwp_ref;
unsigned int zwp_bio_list_size;
unsigned long flags;
+ int from_cpu;
spin_lock_irqsave(&zwplug->lock, flags);
zwp_zone_no = zwplug->zone_no;
@@ -1837,10 +1851,12 @@ static void queue_zone_wplug_show(struct blk_zone_wplug *zwplug,
zwp_ref = refcount_read(&zwplug->ref);
zwp_wp_offset = zwplug->wp_offset;
zwp_bio_list_size = bio_list_size(&zwplug->bio_list);
+ from_cpu = zwplug->from_cpu;
spin_unlock_irqrestore(&zwplug->lock, flags);
- seq_printf(m, "%u 0x%x %u %u %u\n", zwp_zone_no, zwp_flags, zwp_ref,
- zwp_wp_offset, zwp_bio_list_size);
+ seq_printf(m, "zone_no %u flags 0x%x ref %u wp_offset %u bio_list_size %u from_cpu %d\n",
+ zwp_zone_no, zwp_flags, zwp_ref, zwp_wp_offset,
+ zwp_bio_list_size, from_cpu);
}
int queue_zone_wplugs_show(void *data, struct seq_file *m)
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH v19 06/11] null_blk: Add the preserves_write_order attribute
2025-06-30 22:29 [PATCH v19 00/11] Improve write performance for zoned UFS devices Bart Van Assche
` (4 preceding siblings ...)
2025-06-30 22:29 ` [PATCH v19 05/11] blk-zoned: Support pipelining of zoned writes Bart Van Assche
@ 2025-06-30 22:29 ` Bart Van Assche
2025-06-30 22:29 ` [PATCH v19 07/11] scsi: core: Retry unaligned zoned writes Bart Van Assche
` (4 subsequent siblings)
10 siblings, 0 replies; 17+ messages in thread
From: Bart Van Assche @ 2025-06-30 22:29 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
Bart Van Assche
Support configuring the preserves_write_order in the null_blk driver to
make it easier to test support for this attribute.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
drivers/block/null_blk/main.c | 3 +++
drivers/block/null_blk/null_blk.h | 1 +
2 files changed, 4 insertions(+)
diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
index aa163ae9b2aa..45eaa6f0e983 100644
--- a/drivers/block/null_blk/main.c
+++ b/drivers/block/null_blk/main.c
@@ -475,6 +475,7 @@ NULLB_DEVICE_ATTR(fua, bool, NULL);
NULLB_DEVICE_ATTR(rotational, bool, NULL);
NULLB_DEVICE_ATTR(badblocks_once, bool, NULL);
NULLB_DEVICE_ATTR(badblocks_partial_io, bool, NULL);
+NULLB_DEVICE_ATTR(preserves_write_order, bool, NULL);
static ssize_t nullb_device_power_show(struct config_item *item, char *page)
{
@@ -613,6 +614,7 @@ static struct configfs_attribute *nullb_device_attrs[] = {
&nullb_device_attr_no_sched,
&nullb_device_attr_poll_queues,
&nullb_device_attr_power,
+ &nullb_device_attr_preserves_write_order,
&nullb_device_attr_queue_mode,
&nullb_device_attr_rotational,
&nullb_device_attr_shared_tag_bitmap,
@@ -1979,6 +1981,7 @@ static int null_add_dev(struct nullb_device *dev)
if (dev->virt_boundary)
lim.virt_boundary_mask = PAGE_SIZE - 1;
null_config_discard(nullb, &lim);
+ lim.driver_preserves_write_order = dev->preserves_write_order;
if (dev->zoned) {
rv = null_init_zoned_dev(dev, &lim);
if (rv)
diff --git a/drivers/block/null_blk/null_blk.h b/drivers/block/null_blk/null_blk.h
index 7bb6128dbaaf..08b732cf853f 100644
--- a/drivers/block/null_blk/null_blk.h
+++ b/drivers/block/null_blk/null_blk.h
@@ -110,6 +110,7 @@ struct nullb_device {
bool shared_tag_bitmap; /* use hostwide shared tags */
bool fua; /* Support FUA */
bool rotational; /* Fake rotational device */
+ bool preserves_write_order;
};
struct nullb {
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH v19 07/11] scsi: core: Retry unaligned zoned writes
2025-06-30 22:29 [PATCH v19 00/11] Improve write performance for zoned UFS devices Bart Van Assche
` (5 preceding siblings ...)
2025-06-30 22:29 ` [PATCH v19 06/11] null_blk: Add the preserves_write_order attribute Bart Van Assche
@ 2025-06-30 22:29 ` Bart Van Assche
2025-06-30 22:30 ` [PATCH v19 08/11] scsi: sd: Increase retry count for " Bart Van Assche
` (3 subsequent siblings)
10 siblings, 0 replies; 17+ messages in thread
From: Bart Van Assche @ 2025-06-30 22:29 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
Bart Van Assche, Martin K. Petersen, Ming Lei
If zoned writes (REQ_OP_WRITE) for a sequential write required zone have
a starting LBA that differs from the write pointer, e.g. because a prior
write triggered a unit attention condition, then the storage device will
respond with an UNALIGNED WRITE COMMAND error. Retry commands that failed
with an unaligned write error.
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
drivers/scsi/scsi_error.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 746ff6a1f309..d742fa67fe0e 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -713,6 +713,22 @@ enum scsi_disposition scsi_check_sense(struct scsi_cmnd *scmd)
fallthrough;
case ILLEGAL_REQUEST:
+ /*
+ * Unaligned write command. This may indicate that zoned writes
+ * have been received by the device in the wrong order. If write
+ * pipelining is enabled, retry.
+ */
+ if (sshdr.asc == 0x21 && sshdr.ascq == 0x04 &&
+ req->q->limits.driver_preserves_write_order &&
+ blk_rq_is_seq_zoned_write(req) &&
+ scsi_cmd_retry_allowed(scmd)) {
+ SCSI_LOG_ERROR_RECOVERY(1,
+ sdev_printk(KERN_WARNING, scmd->device,
+ "Retrying unaligned write at LBA %#llx.\n",
+ scsi_get_lba(scmd)));
+ return NEEDS_RETRY;
+ }
+
if (sshdr.asc == 0x20 || /* Invalid command operation code */
sshdr.asc == 0x21 || /* Logical block address out of range */
sshdr.asc == 0x22 || /* Invalid function */
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH v19 08/11] scsi: sd: Increase retry count for zoned writes
2025-06-30 22:29 [PATCH v19 00/11] Improve write performance for zoned UFS devices Bart Van Assche
` (6 preceding siblings ...)
2025-06-30 22:29 ` [PATCH v19 07/11] scsi: core: Retry unaligned zoned writes Bart Van Assche
@ 2025-06-30 22:30 ` Bart Van Assche
2025-06-30 22:30 ` [PATCH v19 09/11] scsi: scsi_debug: Add the preserves_write_order module parameter Bart Van Assche
` (2 subsequent siblings)
10 siblings, 0 replies; 17+ messages in thread
From: Bart Van Assche @ 2025-06-30 22:30 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
Bart Van Assche, Martin K. Petersen, Ming Lei
If the write order is preserved, increase the number of retries for
write commands sent to a sequential zone to the maximum number of
outstanding commands because in the worst case the number of times
reordered zoned writes have to be retried is (number of outstanding
writes per sequential zone) - 1.
Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
drivers/scsi/sd.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 3f6e87705b62..75ff6af37ecc 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1404,6 +1404,13 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
cmd->transfersize = sdp->sector_size;
cmd->underflow = nr_blocks << 9;
cmd->allowed = sdkp->max_retries;
+ /*
+ * Increase the number of allowed retries for zoned writes if the driver
+ * preserves the command order.
+ */
+ if (rq->q->limits.driver_preserves_write_order &&
+ blk_rq_is_seq_zoned_write(rq))
+ cmd->allowed += rq->q->nr_requests;
cmd->sdb.length = nr_blocks * sdp->sector_size;
SCSI_LOG_HLQUEUE(1,
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH v19 09/11] scsi: scsi_debug: Add the preserves_write_order module parameter
2025-06-30 22:29 [PATCH v19 00/11] Improve write performance for zoned UFS devices Bart Van Assche
` (7 preceding siblings ...)
2025-06-30 22:30 ` [PATCH v19 08/11] scsi: sd: Increase retry count for " Bart Van Assche
@ 2025-06-30 22:30 ` Bart Van Assche
2025-06-30 22:30 ` [PATCH v19 10/11] scsi: scsi_debug: Support injecting unaligned write errors Bart Van Assche
2025-06-30 22:30 ` [PATCH v19 11/11] scsi: ufs: Inform the block layer about write ordering Bart Van Assche
10 siblings, 0 replies; 17+ messages in thread
From: Bart Van Assche @ 2025-06-30 22:30 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
Bart Van Assche, Douglas Gilbert, Martin K. Petersen, Ming Lei
Zone write locking is not used for zoned devices if the block driver
reports that it preserves the order of write commands. Make it easier to
test not using zone write locking by adding support for setting the
driver_preserves_write_order flag.
Acked-by: Douglas Gilbert <dgilbert@interlog.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
drivers/scsi/scsi_debug.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index aef33d1e346a..5575bc8a833d 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -1004,6 +1004,7 @@ static int dix_reads;
static int dif_errors;
/* ZBC global data */
+static bool sdeb_preserves_write_order;
static bool sdeb_zbc_in_use; /* true for host-aware and host-managed disks */
static int sdeb_zbc_zone_cap_mb;
static int sdeb_zbc_zone_size_mb;
@@ -6607,10 +6608,14 @@ static struct sdebug_dev_info *find_build_dev_info(struct scsi_device *sdev)
static int scsi_debug_sdev_init(struct scsi_device *sdp)
{
+ struct request_queue *q = sdp->request_queue;
+
if (sdebug_verbose)
pr_info("sdev_init <%u %u %u %llu>\n",
sdp->host->host_no, sdp->channel, sdp->id, sdp->lun);
+ q->limits.driver_preserves_write_order = sdeb_preserves_write_order;
+
return 0;
}
@@ -7339,6 +7344,8 @@ module_param_named(statistics, sdebug_statistics, bool, S_IRUGO | S_IWUSR);
module_param_named(strict, sdebug_strict, bool, S_IRUGO | S_IWUSR);
module_param_named(submit_queues, submit_queues, int, S_IRUGO);
module_param_named(poll_queues, poll_queues, int, S_IRUGO);
+module_param_named(preserves_write_order, sdeb_preserves_write_order, bool,
+ S_IRUGO);
module_param_named(tur_ms_to_ready, sdeb_tur_ms_to_ready, int, S_IRUGO);
module_param_named(unmap_alignment, sdebug_unmap_alignment, int, S_IRUGO);
module_param_named(unmap_granularity, sdebug_unmap_granularity, int, S_IRUGO);
@@ -7411,6 +7418,8 @@ MODULE_PARM_DESC(opts, "1->noise, 2->medium_err, 4->timeout, 8->recovered_err...
MODULE_PARM_DESC(per_host_store, "If set, next positive add_host will get new store (def=0)");
MODULE_PARM_DESC(physblk_exp, "physical block exponent (def=0)");
MODULE_PARM_DESC(poll_queues, "support for iouring iopoll queues (1 to max(submit_queues - 1))");
+MODULE_PARM_DESC(preserves_write_order,
+ "Whether or not to inform the block layer that this driver preserves the order of WRITE commands (def=0)");
MODULE_PARM_DESC(ptype, "SCSI peripheral type(def=0[disk])");
MODULE_PARM_DESC(random, "If set, uniformly randomize command duration between 0 and delay_in_ns");
MODULE_PARM_DESC(removable, "claim to have removable media (def=0)");
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH v19 10/11] scsi: scsi_debug: Support injecting unaligned write errors
2025-06-30 22:29 [PATCH v19 00/11] Improve write performance for zoned UFS devices Bart Van Assche
` (8 preceding siblings ...)
2025-06-30 22:30 ` [PATCH v19 09/11] scsi: scsi_debug: Add the preserves_write_order module parameter Bart Van Assche
@ 2025-06-30 22:30 ` Bart Van Assche
2025-06-30 22:30 ` [PATCH v19 11/11] scsi: ufs: Inform the block layer about write ordering Bart Van Assche
10 siblings, 0 replies; 17+ messages in thread
From: Bart Van Assche @ 2025-06-30 22:30 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
Bart Van Assche, Douglas Gilbert, Martin K. Petersen, Ming Lei
Allow user space software, e.g. a blktests test, to inject unaligned
write errors.
Acked-by: Douglas Gilbert <dgilbert@interlog.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
drivers/scsi/scsi_debug.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index 5575bc8a833d..963246f5894a 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -230,6 +230,7 @@ struct tape_block {
#define SDEBUG_OPT_NO_CDB_NOISE 0x4000
#define SDEBUG_OPT_HOST_BUSY 0x8000
#define SDEBUG_OPT_CMD_ABORT 0x10000
+#define SDEBUG_OPT_UNALIGNED_WRITE 0x20000
#define SDEBUG_OPT_ALL_NOISE (SDEBUG_OPT_NOISE | SDEBUG_OPT_Q_NOISE | \
SDEBUG_OPT_RESET_NOISE)
#define SDEBUG_OPT_ALL_INJECTING (SDEBUG_OPT_RECOVERED_ERR | \
@@ -237,7 +238,8 @@ struct tape_block {
SDEBUG_OPT_DIF_ERR | SDEBUG_OPT_DIX_ERR | \
SDEBUG_OPT_SHORT_TRANSFER | \
SDEBUG_OPT_HOST_BUSY | \
- SDEBUG_OPT_CMD_ABORT)
+ SDEBUG_OPT_CMD_ABORT | \
+ SDEBUG_OPT_UNALIGNED_WRITE)
#define SDEBUG_OPT_RECOV_DIF_DIX (SDEBUG_OPT_RECOVERED_ERR | \
SDEBUG_OPT_DIF_ERR | SDEBUG_OPT_DIX_ERR)
@@ -4915,6 +4917,14 @@ static int resp_write_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
u8 *cmd = scp->cmnd;
bool meta_data_locked = false;
+ if (unlikely(sdebug_opts & SDEBUG_OPT_UNALIGNED_WRITE &&
+ atomic_read(&sdeb_inject_pending))) {
+ atomic_set(&sdeb_inject_pending, 0);
+ mk_sense_buffer(scp, ILLEGAL_REQUEST, LBA_OUT_OF_RANGE,
+ UNALIGNED_WRITE_ASCQ);
+ return check_condition_result;
+ }
+
switch (cmd[0]) {
case WRITE_16:
ei_lba = 0;
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH v19 11/11] scsi: ufs: Inform the block layer about write ordering
2025-06-30 22:29 [PATCH v19 00/11] Improve write performance for zoned UFS devices Bart Van Assche
` (9 preceding siblings ...)
2025-06-30 22:30 ` [PATCH v19 10/11] scsi: scsi_debug: Support injecting unaligned write errors Bart Van Assche
@ 2025-06-30 22:30 ` Bart Van Assche
10 siblings, 0 replies; 17+ messages in thread
From: Bart Van Assche @ 2025-06-30 22:30 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, linux-scsi, Christoph Hellwig, Damien Le Moal,
Bart Van Assche, Avri Altman, Bao D. Nguyen, Can Guo,
Martin K. Petersen
From the UFSHCI 4.0 specification, about the MCQ mode:
"Command Submission
1. Host SW writes an Entry to SQ
2. Host SW updates SQ doorbell tail pointer
Command Processing
3. After fetching the Entry, Host Controller updates SQ doorbell head
pointer
4. Host controller sends COMMAND UPIU to UFS device"
In other words, in MCQ mode, UFS controllers are required to forward
commands to the UFS device in the order these commands have been
received from the host.
This patch improves performance as follows on a test setup with UFSHCI
4.0 controller:
- When not using an I/O scheduler: 2.3x more IOPS for small writes.
- With the mq-deadline scheduler: 2.0x more IOPS for small writes.
Reviewed-by: Avri Altman <avri.altman@wdc.com>
Cc: Bao D. Nguyen <quic_nguyenb@quicinc.com>
Cc: Can Guo <quic_cang@quicinc.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
drivers/ufs/core/ufshcd.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
index 50adfb8b335b..1f2fc47ca7b4 100644
--- a/drivers/ufs/core/ufshcd.c
+++ b/drivers/ufs/core/ufshcd.c
@@ -5281,6 +5281,12 @@ static int ufshcd_sdev_configure(struct scsi_device *sdev,
struct ufs_hba *hba = shost_priv(sdev->host);
struct request_queue *q = sdev->request_queue;
+ /*
+ * The write order is preserved per MCQ. Without MCQ, auto-hibernation
+ * may cause write reordering that results in unaligned write errors.
+ */
+ lim->driver_preserves_write_order = hba->mcq_enabled;
+
lim->dma_pad_mask = PRDT_DATA_BYTE_COUNT_PAD - 1;
/*
^ permalink raw reply related [flat|nested] 17+ messages in thread