* [PATCH v2 0/2] blk-mq: fix update nr_requests regressions
@ 2025-08-19 1:29 Yu Kuai
2025-08-19 1:29 ` [PATCH v2 1/2] blk-mq: fix elevator depth_updated method Yu Kuai
2025-08-19 1:29 ` [PATCH v2 2/2] blk-mq: fix blk_mq_tags double free while nr_requests grown Yu Kuai
0 siblings, 2 replies; 8+ messages in thread
From: Yu Kuai @ 2025-08-19 1:29 UTC (permalink / raw)
To: yukuai3, axboe, bvanassche, ming.lei, hare, nilay
Cc: linux-block, linux-kernel, yukuai1, yi.zhang, yangerkun,
johnny.chenyi
From: Yu Kuai <yukuai3@huawei.com>
Changes in v2:
- instead of refactor and cleanups and fix updating nr_requests
thoroughly, fix the regression in patch 2 the easy way, and dealy
refactor and cleanups to next merge window.
patch 1 fix regression that elevator async_depth is not updated correctly
if nr_requests changes, first from error path and then for mq-deadline,
and recently for bfq and kyber.
patch 2 fix regression that if nr_requests grow, kernel will panic due
to tags double free.
Yu Kuai (2):
blk-mq: fix elevator depth_updated method
blk-mq: fix blk_mq_tags double free while nr_requests grown
block/bfq-iosched.c | 21 ++++-----------------
block/blk-mq-sched.c | 3 +++
block/blk-mq-sched.h | 11 +++++++++++
block/blk-mq-tag.c | 1 +
block/blk-mq.c | 23 ++++++++++++-----------
block/elevator.h | 2 +-
block/kyber-iosched.c | 10 ++++------
block/mq-deadline.c | 15 ++-------------
8 files changed, 38 insertions(+), 48 deletions(-)
--
2.39.2
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH v2 1/2] blk-mq: fix elevator depth_updated method
2025-08-19 1:29 [PATCH v2 0/2] blk-mq: fix update nr_requests regressions Yu Kuai
@ 2025-08-19 1:29 ` Yu Kuai
2025-08-19 12:20 ` Nilay Shroff
2025-08-19 1:29 ` [PATCH v2 2/2] blk-mq: fix blk_mq_tags double free while nr_requests grown Yu Kuai
1 sibling, 1 reply; 8+ messages in thread
From: Yu Kuai @ 2025-08-19 1:29 UTC (permalink / raw)
To: yukuai3, axboe, bvanassche, ming.lei, hare, nilay
Cc: linux-block, linux-kernel, yukuai1, yi.zhang, yangerkun,
johnny.chenyi
From: Yu Kuai <yukuai3@huawei.com>
Current depth_updated has some problems:
1) depth_updated() will be called for each hctx, while all elevators
will update async_depth for the disk level, this is not related to hctx;
2) In blk_mq_update_nr_requests(), if previous hctx update succeed and
this hctx update failed, q->nr_requests will not be updated, while
async_depth is already updated with new nr_reqeuests in previous
depth_updated();
3) All elevators are using q->nr_requests to calculate async_depth now,
however, q->nr_requests is still the old value when depth_updated() is
called from blk_mq_update_nr_requests();
Fix those problems by:
- pass in request_queue instead of hctx;
- move depth_updated() after q->nr_requests is updated in
blk_mq_update_nr_requests();
- add depth_updated() call in blk_mq_init_sched();
- remove init_hctx() method for mq-deadline and bfq that is useless now;
Fixes: 77f1e0a52d26 ("bfq: update internal depth state when queue depth changes")
Fixes: 39823b47bbd4 ("block/mq-deadline: Fix the tag reservation code")
Fixes: 42e6c6ce03fd ("lib/sbitmap: convert shallow_depth from one word to the whole sbitmap")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
block/bfq-iosched.c | 21 ++++-----------------
block/blk-mq-sched.c | 3 +++
block/blk-mq-sched.h | 11 +++++++++++
block/blk-mq.c | 23 ++++++++++++-----------
block/elevator.h | 2 +-
block/kyber-iosched.c | 10 ++++------
block/mq-deadline.c | 15 ++-------------
7 files changed, 37 insertions(+), 48 deletions(-)
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 50e51047e1fe..c0c398998aa1 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -7109,9 +7109,10 @@ void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg)
* See the comments on bfq_limit_depth for the purpose of
* the depths set in the function. Return minimum shallow depth we'll use.
*/
-static void bfq_update_depths(struct bfq_data *bfqd, struct sbitmap_queue *bt)
+static void bfq_depth_updated(struct request_queue *q)
{
- unsigned int nr_requests = bfqd->queue->nr_requests;
+ struct bfq_data *bfqd = q->elevator->elevator_data;
+ unsigned int nr_requests = q->nr_requests;
/*
* In-word depths if no bfq_queue is being weight-raised:
@@ -7143,21 +7144,8 @@ static void bfq_update_depths(struct bfq_data *bfqd, struct sbitmap_queue *bt)
bfqd->async_depths[1][0] = max((nr_requests * 3) >> 4, 1U);
/* no more than ~37% of tags for sync writes (~20% extra tags) */
bfqd->async_depths[1][1] = max((nr_requests * 6) >> 4, 1U);
-}
-
-static void bfq_depth_updated(struct blk_mq_hw_ctx *hctx)
-{
- struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
- struct blk_mq_tags *tags = hctx->sched_tags;
- bfq_update_depths(bfqd, &tags->bitmap_tags);
- sbitmap_queue_min_shallow_depth(&tags->bitmap_tags, 1);
-}
-
-static int bfq_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int index)
-{
- bfq_depth_updated(hctx);
- return 0;
+ blk_mq_set_min_shallow_depth(q, 1);
}
static void bfq_exit_queue(struct elevator_queue *e)
@@ -7628,7 +7616,6 @@ static struct elevator_type iosched_bfq_mq = {
.request_merged = bfq_request_merged,
.has_work = bfq_has_work,
.depth_updated = bfq_depth_updated,
- .init_hctx = bfq_init_hctx,
.init_sched = bfq_init_queue,
.exit_sched = bfq_exit_queue,
},
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index e2ce4a28e6c9..bf7dd97422ec 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -585,6 +585,9 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e,
}
}
}
+
+ if (e->ops.depth_updated)
+ e->ops.depth_updated(q);
return 0;
out:
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index b554e1d55950..fe83187f41db 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -92,4 +92,15 @@ static inline bool blk_mq_sched_needs_restart(struct blk_mq_hw_ctx *hctx)
return test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
}
+static inline void blk_mq_set_min_shallow_depth(struct request_queue *q,
+ unsigned int depth)
+{
+ struct blk_mq_hw_ctx *hctx;
+ unsigned long i;
+
+ queue_for_each_hw_ctx(q, hctx, i)
+ sbitmap_queue_min_shallow_depth(&hctx->sched_tags->bitmap_tags,
+ depth);
+}
+
#endif
diff --git a/block/blk-mq.c b/block/blk-mq.c
index b67d6c02eceb..9c68749124c6 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4951,20 +4951,21 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
false);
}
if (ret)
- break;
- if (q->elevator && q->elevator->type->ops.depth_updated)
- q->elevator->type->ops.depth_updated(hctx);
+ goto out;
}
- if (!ret) {
- q->nr_requests = nr;
- if (blk_mq_is_shared_tags(set->flags)) {
- if (q->elevator)
- blk_mq_tag_update_sched_shared_tags(q);
- else
- blk_mq_tag_resize_shared_tags(set, nr);
- }
+
+ q->nr_requests = nr;
+ if (q->elevator && q->elevator->type->ops.depth_updated)
+ q->elevator->type->ops.depth_updated(q);
+
+ if (blk_mq_is_shared_tags(set->flags)) {
+ if (q->elevator)
+ blk_mq_tag_update_sched_shared_tags(q);
+ else
+ blk_mq_tag_resize_shared_tags(set, nr);
}
+out:
blk_mq_unquiesce_queue(q);
return ret;
diff --git a/block/elevator.h b/block/elevator.h
index adc5c157e17e..c4d20155065e 100644
--- a/block/elevator.h
+++ b/block/elevator.h
@@ -37,7 +37,7 @@ struct elevator_mq_ops {
void (*exit_sched)(struct elevator_queue *);
int (*init_hctx)(struct blk_mq_hw_ctx *, unsigned int);
void (*exit_hctx)(struct blk_mq_hw_ctx *, unsigned int);
- void (*depth_updated)(struct blk_mq_hw_ctx *);
+ void (*depth_updated)(struct request_queue *);
bool (*allow_merge)(struct request_queue *, struct request *, struct bio *);
bool (*bio_merge)(struct request_queue *, struct bio *, unsigned int);
diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index 70cbc7b2deb4..49ae52aa20d9 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -440,13 +440,12 @@ static void kyber_ctx_queue_init(struct kyber_ctx_queue *kcq)
INIT_LIST_HEAD(&kcq->rq_list[i]);
}
-static void kyber_depth_updated(struct blk_mq_hw_ctx *hctx)
+static void kyber_depth_updated(struct request_queue *q)
{
- struct kyber_queue_data *kqd = hctx->queue->elevator->elevator_data;
- struct blk_mq_tags *tags = hctx->sched_tags;
+ struct kyber_queue_data *kqd = q->elevator->elevator_data;
- kqd->async_depth = hctx->queue->nr_requests * KYBER_ASYNC_PERCENT / 100U;
- sbitmap_queue_min_shallow_depth(&tags->bitmap_tags, kqd->async_depth);
+ kqd->async_depth = q->nr_requests * KYBER_ASYNC_PERCENT / 100U;
+ blk_mq_set_min_shallow_depth(q, kqd->async_depth);
}
static int kyber_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
@@ -493,7 +492,6 @@ static int kyber_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
khd->batching = 0;
hctx->sched_data = khd;
- kyber_depth_updated(hctx);
return 0;
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index b9b7cdf1d3c9..578bc79c5654 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -507,22 +507,12 @@ static void dd_limit_depth(blk_opf_t opf, struct blk_mq_alloc_data *data)
}
/* Called by blk_mq_update_nr_requests(). */
-static void dd_depth_updated(struct blk_mq_hw_ctx *hctx)
+static void dd_depth_updated(struct request_queue *q)
{
- struct request_queue *q = hctx->queue;
struct deadline_data *dd = q->elevator->elevator_data;
- struct blk_mq_tags *tags = hctx->sched_tags;
dd->async_depth = q->nr_requests;
-
- sbitmap_queue_min_shallow_depth(&tags->bitmap_tags, 1);
-}
-
-/* Called by blk_mq_init_hctx() and blk_mq_init_sched(). */
-static int dd_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
-{
- dd_depth_updated(hctx);
- return 0;
+ blk_mq_set_min_shallow_depth(q, 1);
}
static void dd_exit_sched(struct elevator_queue *e)
@@ -1048,7 +1038,6 @@ static struct elevator_type mq_deadline = {
.has_work = dd_has_work,
.init_sched = dd_init_sched,
.exit_sched = dd_exit_sched,
- .init_hctx = dd_init_hctx,
},
#ifdef CONFIG_BLK_DEBUG_FS
--
2.39.2
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH v2 2/2] blk-mq: fix blk_mq_tags double free while nr_requests grown
2025-08-19 1:29 [PATCH v2 0/2] blk-mq: fix update nr_requests regressions Yu Kuai
2025-08-19 1:29 ` [PATCH v2 1/2] blk-mq: fix elevator depth_updated method Yu Kuai
@ 2025-08-19 1:29 ` Yu Kuai
2025-08-19 10:47 ` Ming Lei
2025-08-19 12:45 ` Nilay Shroff
1 sibling, 2 replies; 8+ messages in thread
From: Yu Kuai @ 2025-08-19 1:29 UTC (permalink / raw)
To: yukuai3, axboe, bvanassche, ming.lei, hare, nilay
Cc: linux-block, linux-kernel, yukuai1, yi.zhang, yangerkun,
johnny.chenyi
From: Yu Kuai <yukuai3@huawei.com>
In the case user trigger tags grow by queue sysfs attribute nr_requests,
hctx->sched_tags will be freed directly and replaced with a new
allocated tags, see blk_mq_tag_update_depth().
The problem is that hctx->sched_tags is from elevator->et->tags, while
et->tags is still the freed tags, hence later elevator exist will try to
free the tags again, causing kernel panic.
Fix this problem by replacing et->tags will new allocated tags as well.
Noted there are still some long term problems that will require some
refactor to be fixed thoroughly[1].
[1] https://lore.kernel.org/all/20250815080216.410665-1-yukuai1@huaweicloud.com/
Fixes: f5a6604f7a44 ("block: fix lockdep warning caused by lock dependency in elv_iosched_store")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
block/blk-mq-tag.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index d880c50629d6..5cffa5668d0c 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -622,6 +622,7 @@ int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
return -ENOMEM;
blk_mq_free_map_and_rqs(set, *tagsptr, hctx->queue_num);
+ hctx->queue->elevator->et->tags[hctx->queue_num] = new;
*tagsptr = new;
} else {
/*
--
2.39.2
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH v2 2/2] blk-mq: fix blk_mq_tags double free while nr_requests grown
2025-08-19 1:29 ` [PATCH v2 2/2] blk-mq: fix blk_mq_tags double free while nr_requests grown Yu Kuai
@ 2025-08-19 10:47 ` Ming Lei
2025-08-19 12:45 ` Nilay Shroff
1 sibling, 0 replies; 8+ messages in thread
From: Ming Lei @ 2025-08-19 10:47 UTC (permalink / raw)
To: Yu Kuai
Cc: yukuai3, axboe, bvanassche, hare, nilay, linux-block,
linux-kernel, yi.zhang, yangerkun, johnny.chenyi
On Tue, Aug 19, 2025 at 09:29:17AM +0800, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
>
> In the case user trigger tags grow by queue sysfs attribute nr_requests,
> hctx->sched_tags will be freed directly and replaced with a new
> allocated tags, see blk_mq_tag_update_depth().
>
> The problem is that hctx->sched_tags is from elevator->et->tags, while
> et->tags is still the freed tags, hence later elevator exist will try to
> free the tags again, causing kernel panic.
>
> Fix this problem by replacing et->tags will new allocated tags as well.
>
> Noted there are still some long term problems that will require some
> refactor to be fixed thoroughly[1].
>
> [1] https://lore.kernel.org/all/20250815080216.410665-1-yukuai1@huaweicloud.com/
> Fixes: f5a6604f7a44 ("block: fix lockdep warning caused by lock dependency in elv_iosched_store")
>
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
> block/blk-mq-tag.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index d880c50629d6..5cffa5668d0c 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -622,6 +622,7 @@ int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
> return -ENOMEM;
>
> blk_mq_free_map_and_rqs(set, *tagsptr, hctx->queue_num);
> + hctx->queue->elevator->et->tags[hctx->queue_num] = new;
> *tagsptr = new;
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Thanks,
Ming
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2 1/2] blk-mq: fix elevator depth_updated method
2025-08-19 1:29 ` [PATCH v2 1/2] blk-mq: fix elevator depth_updated method Yu Kuai
@ 2025-08-19 12:20 ` Nilay Shroff
2025-08-20 0:56 ` Yu Kuai
0 siblings, 1 reply; 8+ messages in thread
From: Nilay Shroff @ 2025-08-19 12:20 UTC (permalink / raw)
To: Yu Kuai, yukuai3, axboe, bvanassche, ming.lei, hare
Cc: linux-block, linux-kernel, yi.zhang, yangerkun, johnny.chenyi
On 8/19/25 6:59 AM, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
>
> Current depth_updated has some problems:
>
> 1) depth_updated() will be called for each hctx, while all elevators
> will update async_depth for the disk level, this is not related to hctx;
> 2) In blk_mq_update_nr_requests(), if previous hctx update succeed and
> this hctx update failed, q->nr_requests will not be updated, while
> async_depth is already updated with new nr_reqeuests in previous
> depth_updated();
> 3) All elevators are using q->nr_requests to calculate async_depth now,
> however, q->nr_requests is still the old value when depth_updated() is
> called from blk_mq_update_nr_requests();
>
> Fix those problems by:
>
> - pass in request_queue instead of hctx;
> - move depth_updated() after q->nr_requests is updated in
> blk_mq_update_nr_requests();
> - add depth_updated() call in blk_mq_init_sched();
> - remove init_hctx() method for mq-deadline and bfq that is useless now;
>
> Fixes: 77f1e0a52d26 ("bfq: update internal depth state when queue depth changes")
> Fixes: 39823b47bbd4 ("block/mq-deadline: Fix the tag reservation code")
> Fixes: 42e6c6ce03fd ("lib/sbitmap: convert shallow_depth from one word to the whole sbitmap")
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
> block/bfq-iosched.c | 21 ++++-----------------
> block/blk-mq-sched.c | 3 +++
> block/blk-mq-sched.h | 11 +++++++++++
> block/blk-mq.c | 23 ++++++++++++-----------
> block/elevator.h | 2 +-
> block/kyber-iosched.c | 10 ++++------
> block/mq-deadline.c | 15 ++-------------
> 7 files changed, 37 insertions(+), 48 deletions(-)
>
> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
> index 50e51047e1fe..c0c398998aa1 100644
> --- a/block/bfq-iosched.c
> +++ b/block/bfq-iosched.c
> @@ -7109,9 +7109,10 @@ void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg)
> * See the comments on bfq_limit_depth for the purpose of
> * the depths set in the function. Return minimum shallow depth we'll use.
> */
> -static void bfq_update_depths(struct bfq_data *bfqd, struct sbitmap_queue *bt)
> +static void bfq_depth_updated(struct request_queue *q)
> {
> - unsigned int nr_requests = bfqd->queue->nr_requests;
> + struct bfq_data *bfqd = q->elevator->elevator_data;
> + unsigned int nr_requests = q->nr_requests;
>
> /*
> * In-word depths if no bfq_queue is being weight-raised:
> @@ -7143,21 +7144,8 @@ static void bfq_update_depths(struct bfq_data *bfqd, struct sbitmap_queue *bt)
> bfqd->async_depths[1][0] = max((nr_requests * 3) >> 4, 1U);
> /* no more than ~37% of tags for sync writes (~20% extra tags) */
> bfqd->async_depths[1][1] = max((nr_requests * 6) >> 4, 1U);
> -}
> -
> -static void bfq_depth_updated(struct blk_mq_hw_ctx *hctx)
> -{
> - struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
> - struct blk_mq_tags *tags = hctx->sched_tags;
>
> - bfq_update_depths(bfqd, &tags->bitmap_tags);
> - sbitmap_queue_min_shallow_depth(&tags->bitmap_tags, 1);
> -}
> -
> -static int bfq_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int index)
> -{
> - bfq_depth_updated(hctx);
> - return 0;
> + blk_mq_set_min_shallow_depth(q, 1);
> }
>
> static void bfq_exit_queue(struct elevator_queue *e)
> @@ -7628,7 +7616,6 @@ static struct elevator_type iosched_bfq_mq = {
> .request_merged = bfq_request_merged,
> .has_work = bfq_has_work,
> .depth_updated = bfq_depth_updated,
> - .init_hctx = bfq_init_hctx,
> .init_sched = bfq_init_queue,
> .exit_sched = bfq_exit_queue,
> },
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index e2ce4a28e6c9..bf7dd97422ec 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -585,6 +585,9 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e,
> }
> }
> }
> +
> + if (e->ops.depth_updated)
> + e->ops.depth_updated(q);
> return 0;
>
Overall changes look good. That said, I think it might be cleaner to structure
it this way:
elevator_switch -> blk_mq_init_sched ->init_sched ==> sets async_depth
blk_mq_update_nr_requests ->depth_updated ==> updates async_depth
This way, we don’t need to call ->depth_updated from blk_mq_init_sched.
In summary:
- Avoid calling ->depth_updated during blk_mq_init_sched
- Set async_depth when the elevator is initialized (via ->init_sched)
- Update async_depth when nr_requests is modified through sysfs (via ->depth_updated)
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2 2/2] blk-mq: fix blk_mq_tags double free while nr_requests grown
2025-08-19 1:29 ` [PATCH v2 2/2] blk-mq: fix blk_mq_tags double free while nr_requests grown Yu Kuai
2025-08-19 10:47 ` Ming Lei
@ 2025-08-19 12:45 ` Nilay Shroff
1 sibling, 0 replies; 8+ messages in thread
From: Nilay Shroff @ 2025-08-19 12:45 UTC (permalink / raw)
To: Yu Kuai, yukuai3, axboe, bvanassche, ming.lei, hare
Cc: linux-block, linux-kernel, yi.zhang, yangerkun, johnny.chenyi
On 8/19/25 6:59 AM, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
>
> In the case user trigger tags grow by queue sysfs attribute nr_requests,
> hctx->sched_tags will be freed directly and replaced with a new
> allocated tags, see blk_mq_tag_update_depth().
>
> The problem is that hctx->sched_tags is from elevator->et->tags, while
> et->tags is still the freed tags, hence later elevator exist will try to
nit: 's/exist/exit'
> free the tags again, causing kernel panic.
>
> Fix this problem by replacing et->tags will new allocated tags as well.
nit: 's/will/with'
>
> Noted there are still some long term problems that will require some
> refactor to be fixed thoroughly[1].
>
> [1] https://lore.kernel.org/all/20250815080216.410665-1-yukuai1@huaweicloud.com/
> Fixes: f5a6604f7a44 ("block: fix lockdep warning caused by lock dependency in elv_iosched_store")
>
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
> block/blk-mq-tag.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index d880c50629d6..5cffa5668d0c 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -622,6 +622,7 @@ int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
> return -ENOMEM;
>
> blk_mq_free_map_and_rqs(set, *tagsptr, hctx->queue_num);
> + hctx->queue->elevator->et->tags[hctx->queue_num] = new;
> *tagsptr = new;
> } else {
> /*
Except the above minor nitpicking this change looks good to me:
Reviewed-by: Nilay Shroff<nilay@linux.ibm.com>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2 1/2] blk-mq: fix elevator depth_updated method
2025-08-19 12:20 ` Nilay Shroff
@ 2025-08-20 0:56 ` Yu Kuai
2025-08-20 5:34 ` Nilay Shroff
0 siblings, 1 reply; 8+ messages in thread
From: Yu Kuai @ 2025-08-20 0:56 UTC (permalink / raw)
To: Nilay Shroff, Yu Kuai, axboe, bvanassche, ming.lei, hare
Cc: linux-block, linux-kernel, yi.zhang, yangerkun, johnny.chenyi,
yukuai (C)
Hi,
在 2025/08/19 20:20, Nilay Shroff 写道:
>
>
> On 8/19/25 6:59 AM, Yu Kuai wrote:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> Current depth_updated has some problems:
>>
>> 1) depth_updated() will be called for each hctx, while all elevators
>> will update async_depth for the disk level, this is not related to hctx;
>> 2) In blk_mq_update_nr_requests(), if previous hctx update succeed and
>> this hctx update failed, q->nr_requests will not be updated, while
>> async_depth is already updated with new nr_reqeuests in previous
>> depth_updated();
>> 3) All elevators are using q->nr_requests to calculate async_depth now,
>> however, q->nr_requests is still the old value when depth_updated() is
>> called from blk_mq_update_nr_requests();
>>
>> Fix those problems by:
>>
>> - pass in request_queue instead of hctx;
>> - move depth_updated() after q->nr_requests is updated in
>> blk_mq_update_nr_requests();
>> - add depth_updated() call in blk_mq_init_sched();
>> - remove init_hctx() method for mq-deadline and bfq that is useless now;
>>
>> Fixes: 77f1e0a52d26 ("bfq: update internal depth state when queue depth changes")
>> Fixes: 39823b47bbd4 ("block/mq-deadline: Fix the tag reservation code")
>> Fixes: 42e6c6ce03fd ("lib/sbitmap: convert shallow_depth from one word to the whole sbitmap")
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>> ---
>> block/bfq-iosched.c | 21 ++++-----------------
>> block/blk-mq-sched.c | 3 +++
>> block/blk-mq-sched.h | 11 +++++++++++
>> block/blk-mq.c | 23 ++++++++++++-----------
>> block/elevator.h | 2 +-
>> block/kyber-iosched.c | 10 ++++------
>> block/mq-deadline.c | 15 ++-------------
>> 7 files changed, 37 insertions(+), 48 deletions(-)
>>
>> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
>> index 50e51047e1fe..c0c398998aa1 100644
>> --- a/block/bfq-iosched.c
>> +++ b/block/bfq-iosched.c
>> @@ -7109,9 +7109,10 @@ void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg)
>> * See the comments on bfq_limit_depth for the purpose of
>> * the depths set in the function. Return minimum shallow depth we'll use.
>> */
>> -static void bfq_update_depths(struct bfq_data *bfqd, struct sbitmap_queue *bt)
>> +static void bfq_depth_updated(struct request_queue *q)
>> {
>> - unsigned int nr_requests = bfqd->queue->nr_requests;
>> + struct bfq_data *bfqd = q->elevator->elevator_data;
>> + unsigned int nr_requests = q->nr_requests;
>>
>> /*
>> * In-word depths if no bfq_queue is being weight-raised:
>> @@ -7143,21 +7144,8 @@ static void bfq_update_depths(struct bfq_data *bfqd, struct sbitmap_queue *bt)
>> bfqd->async_depths[1][0] = max((nr_requests * 3) >> 4, 1U);
>> /* no more than ~37% of tags for sync writes (~20% extra tags) */
>> bfqd->async_depths[1][1] = max((nr_requests * 6) >> 4, 1U);
>> -}
>> -
>> -static void bfq_depth_updated(struct blk_mq_hw_ctx *hctx)
>> -{
>> - struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
>> - struct blk_mq_tags *tags = hctx->sched_tags;
>>
>> - bfq_update_depths(bfqd, &tags->bitmap_tags);
>> - sbitmap_queue_min_shallow_depth(&tags->bitmap_tags, 1);
>> -}
>> -
>> -static int bfq_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int index)
>> -{
>> - bfq_depth_updated(hctx);
>> - return 0;
>> + blk_mq_set_min_shallow_depth(q, 1);
>> }
>>
>> static void bfq_exit_queue(struct elevator_queue *e)
>> @@ -7628,7 +7616,6 @@ static struct elevator_type iosched_bfq_mq = {
>> .request_merged = bfq_request_merged,
>> .has_work = bfq_has_work,
>> .depth_updated = bfq_depth_updated,
>> - .init_hctx = bfq_init_hctx,
>> .init_sched = bfq_init_queue,
>> .exit_sched = bfq_exit_queue,
>> },
>> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
>> index e2ce4a28e6c9..bf7dd97422ec 100644
>> --- a/block/blk-mq-sched.c
>> +++ b/block/blk-mq-sched.c
>> @@ -585,6 +585,9 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e,
>> }
>> }
>> }
>> +
>> + if (e->ops.depth_updated)
>> + e->ops.depth_updated(q);
>> return 0;
>>
>
> Overall changes look good. That said, I think it might be cleaner to structure
> it this way:
>
> elevator_switch -> blk_mq_init_sched ->init_sched ==> sets async_depth
> blk_mq_update_nr_requests ->depth_updated ==> updates async_depth
>
> This way, we don’t need to call ->depth_updated from blk_mq_init_sched.
Just to be sure, you mean calling the depth_updated method directly
inside the init_sched() method? This is indeed cleaner, each elevator
has to use this method to initialize async_dpeth.
>
> In summary:
> - Avoid calling ->depth_updated during blk_mq_init_sched
> - Set async_depth when the elevator is initialized (via ->init_sched)
> - Update async_depth when nr_requests is modified through sysfs (via ->depth_updated)
>
> Thanks,
> --Nilay
> .
>
Thanks,
Kuai
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2 1/2] blk-mq: fix elevator depth_updated method
2025-08-20 0:56 ` Yu Kuai
@ 2025-08-20 5:34 ` Nilay Shroff
0 siblings, 0 replies; 8+ messages in thread
From: Nilay Shroff @ 2025-08-20 5:34 UTC (permalink / raw)
To: Yu Kuai, axboe, bvanassche, ming.lei, hare
Cc: linux-block, linux-kernel, yi.zhang, yangerkun, johnny.chenyi,
yukuai (C)
On 8/20/25 6:26 AM, Yu Kuai wrote:
> Hi,
>
> 在 2025/08/19 20:20, Nilay Shroff 写道:
>>
>>
>> On 8/19/25 6:59 AM, Yu Kuai wrote:
>>> From: Yu Kuai <yukuai3@huawei.com>
>>>
>>> Current depth_updated has some problems:
>>>
>>> 1) depth_updated() will be called for each hctx, while all elevators
>>> will update async_depth for the disk level, this is not related to hctx;
>>> 2) In blk_mq_update_nr_requests(), if previous hctx update succeed and
>>> this hctx update failed, q->nr_requests will not be updated, while
>>> async_depth is already updated with new nr_reqeuests in previous
>>> depth_updated();
>>> 3) All elevators are using q->nr_requests to calculate async_depth now,
>>> however, q->nr_requests is still the old value when depth_updated() is
>>> called from blk_mq_update_nr_requests();
>>>
>>> Fix those problems by:
>>>
>>> - pass in request_queue instead of hctx;
>>> - move depth_updated() after q->nr_requests is updated in
>>> blk_mq_update_nr_requests();
>>> - add depth_updated() call in blk_mq_init_sched();
>>> - remove init_hctx() method for mq-deadline and bfq that is useless now;
>>>
>>> Fixes: 77f1e0a52d26 ("bfq: update internal depth state when queue depth changes")
>>> Fixes: 39823b47bbd4 ("block/mq-deadline: Fix the tag reservation code")
>>> Fixes: 42e6c6ce03fd ("lib/sbitmap: convert shallow_depth from one word to the whole sbitmap")
>>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>>> ---
>>> block/bfq-iosched.c | 21 ++++-----------------
>>> block/blk-mq-sched.c | 3 +++
>>> block/blk-mq-sched.h | 11 +++++++++++
>>> block/blk-mq.c | 23 ++++++++++++-----------
>>> block/elevator.h | 2 +-
>>> block/kyber-iosched.c | 10 ++++------
>>> block/mq-deadline.c | 15 ++-------------
>>> 7 files changed, 37 insertions(+), 48 deletions(-)
>>>
>>> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
>>> index 50e51047e1fe..c0c398998aa1 100644
>>> --- a/block/bfq-iosched.c
>>> +++ b/block/bfq-iosched.c
>>> @@ -7109,9 +7109,10 @@ void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg)
>>> * See the comments on bfq_limit_depth for the purpose of
>>> * the depths set in the function. Return minimum shallow depth we'll use.
>>> */
>>> -static void bfq_update_depths(struct bfq_data *bfqd, struct sbitmap_queue *bt)
>>> +static void bfq_depth_updated(struct request_queue *q)
>>> {
>>> - unsigned int nr_requests = bfqd->queue->nr_requests;
>>> + struct bfq_data *bfqd = q->elevator->elevator_data;
>>> + unsigned int nr_requests = q->nr_requests;
>>> /*
>>> * In-word depths if no bfq_queue is being weight-raised:
>>> @@ -7143,21 +7144,8 @@ static void bfq_update_depths(struct bfq_data *bfqd, struct sbitmap_queue *bt)
>>> bfqd->async_depths[1][0] = max((nr_requests * 3) >> 4, 1U);
>>> /* no more than ~37% of tags for sync writes (~20% extra tags) */
>>> bfqd->async_depths[1][1] = max((nr_requests * 6) >> 4, 1U);
>>> -}
>>> -
>>> -static void bfq_depth_updated(struct blk_mq_hw_ctx *hctx)
>>> -{
>>> - struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
>>> - struct blk_mq_tags *tags = hctx->sched_tags;
>>> - bfq_update_depths(bfqd, &tags->bitmap_tags);
>>> - sbitmap_queue_min_shallow_depth(&tags->bitmap_tags, 1);
>>> -}
>>> -
>>> -static int bfq_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int index)
>>> -{
>>> - bfq_depth_updated(hctx);
>>> - return 0;
>>> + blk_mq_set_min_shallow_depth(q, 1);
>>> }
>>> static void bfq_exit_queue(struct elevator_queue *e)
>>> @@ -7628,7 +7616,6 @@ static struct elevator_type iosched_bfq_mq = {
>>> .request_merged = bfq_request_merged,
>>> .has_work = bfq_has_work,
>>> .depth_updated = bfq_depth_updated,
>>> - .init_hctx = bfq_init_hctx,
>>> .init_sched = bfq_init_queue,
>>> .exit_sched = bfq_exit_queue,
>>> },
>>> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
>>> index e2ce4a28e6c9..bf7dd97422ec 100644
>>> --- a/block/blk-mq-sched.c
>>> +++ b/block/blk-mq-sched.c
>>> @@ -585,6 +585,9 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e,
>>> }
>>> }
>>> }
>>> +
>>> + if (e->ops.depth_updated)
>>> + e->ops.depth_updated(q);
>>> return 0;
>>>
>>
>> Overall changes look good. That said, I think it might be cleaner to structure
>> it this way:
>>
>> elevator_switch -> blk_mq_init_sched ->init_sched ==> sets async_depth
>> blk_mq_update_nr_requests ->depth_updated ==> updates async_depth
>>
>> This way, we don’t need to call ->depth_updated from blk_mq_init_sched.
>
> Just to be sure, you mean calling the depth_updated method directly
> inside the init_sched() method? This is indeed cleaner, each elevator
> has to use this method to initialize async_dpeth.
Yes you're right. As ->init_sched() already has pointer to request queue,
you may now directly call ->depth_update() method of the respective
elevator from ->init_sched().
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2025-08-20 5:35 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-19 1:29 [PATCH v2 0/2] blk-mq: fix update nr_requests regressions Yu Kuai
2025-08-19 1:29 ` [PATCH v2 1/2] blk-mq: fix elevator depth_updated method Yu Kuai
2025-08-19 12:20 ` Nilay Shroff
2025-08-20 0:56 ` Yu Kuai
2025-08-20 5:34 ` Nilay Shroff
2025-08-19 1:29 ` [PATCH v2 2/2] blk-mq: fix blk_mq_tags double free while nr_requests grown Yu Kuai
2025-08-19 10:47 ` Ming Lei
2025-08-19 12:45 ` Nilay Shroff
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).