[PATCH 00/16] blk-mq: introduce new queue attribute asyc

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/16] blk-mq: introduce new queue attribute asyc_dpeth
@ 2025-08-14  3:35 Yu Kuai
  2025-08-14  3:35 ` [PATCH 01/16] blk-mq-sched: add new parameter nr_requests in blk_mq_alloc_sched_tags() Yu Kuai
                   ` (16 more replies)
  0 siblings, 17 replies; 30+ messages in thread
From: Yu Kuai @ 2025-08-14  3:35 UTC (permalink / raw)
  To: axboe, yukuai3, bvanassche, nilay, hare, ming.lei
  Cc: linux-block, linux-kernel, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

Backgroud and motivation:

At first, we test a performance regression from 5.10 to 6.6 in
downstream kernel(described in patch 13), the regression is related to
async_depth in mq-dealine.

While trying to fix this regression, Bart suggests add a new attribute
to request_queue, and I think this is a good idea because all elevators
have similar logical, however only mq-deadline allow user to configure
async_depth. And this is patch 9-16, where the performance problem is
fixed in patch 13;

Because async_depth is related to nr_requests, while reviewing related
code, patch 2-7 are cleanups and fixes to nr_reqeusts.

I was planning to send this set for the next merge window, however,
during test I found the last block pr(6.17-rc1) introduce a regression
if nr_reqeusts grows, exit elevator will panic, and I fix this by
patch 1,8.

So I send this set for now, and hope we can consider it for this merge
window. If not, I'll have to rework the whole set and send the second
kernel panic regression first.

Yu Kuai (16):
  blk-mq-sched: add new parameter nr_requests in
    blk_mq_alloc_sched_tags()
  blk-mq: remove useless checking from queue_requests_store()
  blk-mq: remove useless checkings from blk_mq_update_nr_requests()
  blk-mq: check invalid nr_requests in queue_requests_store()
  blk-mq: fix elevator depth_updated method
  blk-mq: cleanup shared tags case in blk_mq_update_nr_requests()
  blk-mq: split bitmap grow and resize case in
    blk_mq_update_nr_requests()
  blk-mq: fix blk_mq_tags double free while nr_requests grown
  block: convert nr_requests to unsigned int
  blk-mq-sched: unify elevators checking for async requests
  blk-mq: add a new queue sysfs attribute async_depth
  kyber: covert to use request_queue->async_depth
  mq-deadline: covert to use request_queue->async_depth
  block, bfq: convert to use request_queue->async_depth
  blk-mq: fix stale nr_requests documentation
  blk-mq: add documentation for new queue attribute async_dpeth

 Documentation/ABI/stable/sysfs-block | 24 ++++---
 block/bfq-iosched.c                  | 64 +++++++------------
 block/blk-core.c                     |  1 +
 block/blk-mq-sched.c                 | 14 +++--
 block/blk-mq-sched.h                 | 18 +++++-
 block/blk-mq-tag.c                   | 23 +------
 block/blk-mq.c                       | 93 +++++++++++++++++-----------
 block/blk-mq.h                       |  2 +-
 block/blk-sysfs.c                    | 63 ++++++++++++++++++-
 block/elevator.c                     |  3 +-
 block/elevator.h                     |  2 +-
 block/kyber-iosched.c                | 40 ++----------
 block/mq-deadline.c                  | 55 ++--------------
 include/linux/blkdev.h               |  3 +-
 14 files changed, 194 insertions(+), 211 deletions(-)

-- 
2.39.2


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 01/16] blk-mq-sched: add new parameter nr_requests in blk_mq_alloc_sched_tags()
  2025-08-14  3:35 [PATCH 00/16] blk-mq: introduce new queue attribute asyc_dpeth Yu Kuai
@ 2025-08-14  3:35 ` Yu Kuai
  2025-08-14  8:16   ` Ming Lei
  2025-08-14  3:35 ` [PATCH 02/16] blk-mq: remove useless checking from queue_requests_store() Yu Kuai
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 30+ messages in thread
From: Yu Kuai @ 2025-08-14  3:35 UTC (permalink / raw)
  To: axboe, yukuai3, bvanassche, nilay, hare, ming.lei
  Cc: linux-block, linux-kernel, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

This helper only support to iallocate the default number of requests,
add a new parameter to support specific number of requests.

Prepare to fix tags double free problem if nr_requests is grown by
queue sysfs attribute nr_requests.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq-sched.c | 11 +++++++----
 block/blk-mq-sched.h |  2 +-
 block/elevator.c     |  2 +-
 3 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index e2ce4a28e6c9..9a8a0b5e04a9 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -454,7 +454,7 @@ void blk_mq_free_sched_tags_batch(struct xarray *et_table,
 }
 
 struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
-		unsigned int nr_hw_queues)
+		unsigned int nr_hw_queues, unsigned int nr_requests)
 {
 	unsigned int nr_tags;
 	int i;
@@ -475,8 +475,11 @@ struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
 	 * 128, since we don't split into sync/async like the old code
 	 * did. Additionally, this is a per-hw queue depth.
 	 */
-	et->nr_requests = 2 * min_t(unsigned int, set->queue_depth,
-			BLKDEV_DEFAULT_RQ);
+	if (nr_requests)
+		et->nr_requests = nr_requests;
+	else
+		et->nr_requests = 2 * min_t(unsigned int, set->queue_depth,
+				BLKDEV_DEFAULT_RQ);
 	et->nr_hw_queues = nr_hw_queues;
 
 	if (blk_mq_is_shared_tags(set->flags)) {
@@ -521,7 +524,7 @@ int blk_mq_alloc_sched_tags_batch(struct xarray *et_table,
 		 * concurrently.
 		 */
 		if (q->elevator) {
-			et = blk_mq_alloc_sched_tags(set, nr_hw_queues);
+			et = blk_mq_alloc_sched_tags(set, nr_hw_queues, 0);
 			if (!et)
 				goto out_unwind;
 			if (xa_insert(et_table, q->id, et, gfp))
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index b554e1d55950..0582d4bc3987 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -24,7 +24,7 @@ void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e);
 void blk_mq_sched_free_rqs(struct request_queue *q);
 
 struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
-		unsigned int nr_hw_queues);
+		unsigned int nr_hw_queues, unsigned int nr_requests);
 int blk_mq_alloc_sched_tags_batch(struct xarray *et_table,
 		struct blk_mq_tag_set *set, unsigned int nr_hw_queues);
 void blk_mq_free_sched_tags(struct elevator_tags *et,
diff --git a/block/elevator.c b/block/elevator.c
index fe96c6f4753c..f8a04f32cbcf 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -669,7 +669,7 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
 	lockdep_assert_held(&set->update_nr_hwq_lock);
 
 	if (strncmp(ctx->name, "none", 4)) {
-		ctx->et = blk_mq_alloc_sched_tags(set, set->nr_hw_queues);
+		ctx->et = blk_mq_alloc_sched_tags(set, set->nr_hw_queues, 0);
 		if (!ctx->et)
 			return -ENOMEM;
 	}
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 02/16] blk-mq: remove useless checking from queue_requests_store()
  2025-08-14  3:35 [PATCH 00/16] blk-mq: introduce new queue attribute asyc_dpeth Yu Kuai
  2025-08-14  3:35 ` [PATCH 01/16] blk-mq-sched: add new parameter nr_requests in blk_mq_alloc_sched_tags() Yu Kuai
@ 2025-08-14  3:35 ` Yu Kuai
  2025-08-14  3:35 ` [PATCH 03/16] blk-mq: remove useless checkings from blk_mq_update_nr_requests() Yu Kuai
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 30+ messages in thread
From: Yu Kuai @ 2025-08-14  3:35 UTC (permalink / raw)
  To: axboe, yukuai3, bvanassche, nilay, hare, ming.lei
  Cc: linux-block, linux-kernel, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

blk_mq_queue_attr_visible() already checked queue_is_mq(), no need to
check this again in queue_requests_store().

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-sysfs.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index c5cf79a20842..1086f7b9da28 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -69,9 +69,6 @@ queue_requests_store(struct gendisk *disk, const char *page, size_t count)
 	unsigned int memflags;
 	struct request_queue *q = disk->queue;
 
-	if (!queue_is_mq(q))
-		return -EINVAL;
-
 	ret = queue_var_store(&nr, page, count);
 	if (ret < 0)
 		return ret;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 03/16] blk-mq: remove useless checkings from blk_mq_update_nr_requests()
  2025-08-14  3:35 [PATCH 00/16] blk-mq: introduce new queue attribute asyc_dpeth Yu Kuai
  2025-08-14  3:35 ` [PATCH 01/16] blk-mq-sched: add new parameter nr_requests in blk_mq_alloc_sched_tags() Yu Kuai
  2025-08-14  3:35 ` [PATCH 02/16] blk-mq: remove useless checking from queue_requests_store() Yu Kuai
@ 2025-08-14  3:35 ` Yu Kuai
  2025-08-14 12:23   ` Nilay Shroff
  2025-08-14  3:35 ` [PATCH 04/16] blk-mq: check invalid nr_requests in queue_requests_store() Yu Kuai
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 30+ messages in thread
From: Yu Kuai @ 2025-08-14  3:35 UTC (permalink / raw)
  To: axboe, yukuai3, bvanassche, nilay, hare, ming.lei
  Cc: linux-block, linux-kernel, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

1) queue_requests_store() is the only caller of
blk_mq_update_nr_requests(), where queue is already freezed, no need to
check mq_freeze_depth;
2) q->tag_set must be set for request_based device, and queue_is_mq() is
already checked in blk_mq_queue_attr_visible(), no need to check
q->tag_set.
3) During initialization, hctx->tags in initialized before queue
kobject, and during del_gendisk, queue kobject is deleted before
exiting hctx, hence checking hctx->tags is useless.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq.c | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index b67d6c02eceb..3a219b7b3688 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4921,24 +4921,15 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 {
 	struct blk_mq_tag_set *set = q->tag_set;
 	struct blk_mq_hw_ctx *hctx;
-	int ret;
+	int ret = 0;
 	unsigned long i;
 
-	if (WARN_ON_ONCE(!q->mq_freeze_depth))
-		return -EINVAL;
-
-	if (!set)
-		return -EINVAL;
-
 	if (q->nr_requests == nr)
 		return 0;
 
 	blk_mq_quiesce_queue(q);
 
-	ret = 0;
 	queue_for_each_hw_ctx(q, hctx, i) {
-		if (!hctx->tags)
-			continue;
 		/*
 		 * If we're using an MQ scheduler, just update the scheduler
 		 * queue depth. This is similar to what the old code would do.
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 04/16] blk-mq: check invalid nr_requests in queue_requests_store()
  2025-08-14  3:35 [PATCH 00/16] blk-mq: introduce new queue attribute asyc_dpeth Yu Kuai
                   ` (2 preceding siblings ...)
  2025-08-14  3:35 ` [PATCH 03/16] blk-mq: remove useless checkings from blk_mq_update_nr_requests() Yu Kuai
@ 2025-08-14  3:35 ` Yu Kuai
  2025-08-14  3:35 ` [PATCH 05/16] blk-mq: fix elevator depth_updated method Yu Kuai
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 30+ messages in thread
From: Yu Kuai @ 2025-08-14  3:35 UTC (permalink / raw)
  To: axboe, yukuai3, bvanassche, nilay, hare, ming.lei
  Cc: linux-block, linux-kernel, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

queue_requests_store() is the only caller of
blk_mq_update_nr_requests(), and blk_mq_update_nr_requests() is the
only caller of blk_mq_tag_update_depth(), however, they all have
checkings for nr_requests input by user.

Make code cleaner by moving all the checkings to the top function:

1) nr_requests > reserved tags;
2) if there is elevator, 4 <= nr_requests <= 2048;
3) if elevator is none, 4 <= nr_requests < tag_set->queue_depth;

Meanwhile, case 2 is the only case tags can grow and -ENOMEM might be
returned.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq-tag.c | 16 +---------------
 block/blk-mq.c     | 14 +++++---------
 block/blk-mq.h     |  2 +-
 block/blk-sysfs.c  | 13 +++++++++++++
 4 files changed, 20 insertions(+), 25 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index d880c50629d6..7613a9889eb1 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -584,14 +584,10 @@ void blk_mq_free_tags(struct blk_mq_tags *tags)
 }
 
 int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
-			    struct blk_mq_tags **tagsptr, unsigned int tdepth,
-			    bool can_grow)
+			    struct blk_mq_tags **tagsptr, unsigned int tdepth)
 {
 	struct blk_mq_tags *tags = *tagsptr;
 
-	if (tdepth <= tags->nr_reserved_tags)
-		return -EINVAL;
-
 	/*
 	 * If we are allowed to grow beyond the original size, allocate
 	 * a new set of tags before freeing the old one.
@@ -600,16 +596,6 @@ int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
 		struct blk_mq_tag_set *set = hctx->queue->tag_set;
 		struct blk_mq_tags *new;
 
-		if (!can_grow)
-			return -EINVAL;
-
-		/*
-		 * We need some sort of upper limit, set it high enough that
-		 * no valid use cases should require more.
-		 */
-		if (tdepth > MAX_SCHED_RQ)
-			return -EINVAL;
-
 		/*
 		 * Only the sbitmap needs resizing since we allocated the max
 		 * initially.
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3a219b7b3688..e86cab125a2d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4924,9 +4924,6 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 	int ret = 0;
 	unsigned long i;
 
-	if (q->nr_requests == nr)
-		return 0;
-
 	blk_mq_quiesce_queue(q);
 
 	queue_for_each_hw_ctx(q, hctx, i) {
@@ -4934,13 +4931,12 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 		 * If we're using an MQ scheduler, just update the scheduler
 		 * queue depth. This is similar to what the old code would do.
 		 */
-		if (hctx->sched_tags) {
+		if (hctx->sched_tags)
 			ret = blk_mq_tag_update_depth(hctx, &hctx->sched_tags,
-						      nr, true);
-		} else {
-			ret = blk_mq_tag_update_depth(hctx, &hctx->tags, nr,
-						      false);
-		}
+						      nr);
+		else
+			ret = blk_mq_tag_update_depth(hctx, &hctx->tags, nr);
+
 		if (ret)
 			break;
 		if (q->elevator && q->elevator->type->ops.depth_updated)
diff --git a/block/blk-mq.h b/block/blk-mq.h
index affb2e14b56e..2b3ade60c90b 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -171,7 +171,7 @@ void blk_mq_put_tag(struct blk_mq_tags *tags, struct blk_mq_ctx *ctx,
 		unsigned int tag);
 void blk_mq_put_tags(struct blk_mq_tags *tags, int *tag_array, int nr_tags);
 int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
-		struct blk_mq_tags **tags, unsigned int depth, bool can_grow);
+		struct blk_mq_tags **tags, unsigned int depth);
 void blk_mq_tag_resize_shared_tags(struct blk_mq_tag_set *set,
 		unsigned int size);
 void blk_mq_tag_update_sched_shared_tags(struct request_queue *q);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 1086f7b9da28..f3d08edcc34f 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -75,12 +75,25 @@ queue_requests_store(struct gendisk *disk, const char *page, size_t count)
 
 	memflags = blk_mq_freeze_queue(q);
 	mutex_lock(&q->elevator_lock);
+
+	if (q->nr_requests == nr)
+		goto unlock;
+
 	if (nr < BLKDEV_MIN_RQ)
 		nr = BLKDEV_MIN_RQ;
 
+	if (nr <= q->tag_set->reserved_tags ||
+	    (q->elevator && nr > MAX_SCHED_RQ) ||
+	    (!q->elevator && nr > q->tag_set->queue_depth)) {
+		ret = -EINVAL;
+		goto unlock;
+	}
+
 	err = blk_mq_update_nr_requests(disk->queue, nr);
 	if (err)
 		ret = err;
+
+unlock:
 	mutex_unlock(&q->elevator_lock);
 	blk_mq_unfreeze_queue(q, memflags);
 	return ret;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 05/16] blk-mq: fix elevator depth_updated method
  2025-08-14  3:35 [PATCH 00/16] blk-mq: introduce new queue attribute asyc_dpeth Yu Kuai
                   ` (3 preceding siblings ...)
  2025-08-14  3:35 ` [PATCH 04/16] blk-mq: check invalid nr_requests in queue_requests_store() Yu Kuai
@ 2025-08-14  3:35 ` Yu Kuai
  2025-08-14  3:35 ` [PATCH 06/16] blk-mq: cleanup shared tags case in blk_mq_update_nr_requests() Yu Kuai
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 30+ messages in thread
From: Yu Kuai @ 2025-08-14  3:35 UTC (permalink / raw)
  To: axboe, yukuai3, bvanassche, nilay, hare, ming.lei
  Cc: linux-block, linux-kernel, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

Current depth_updated has some problems:

1) depth_updated() will be called for each hctx, while all elevators
will update async_depth for the disk level, this is not related to hctx;
2) In blk_mq_update_nr_requests(), if previous hctx update succeed and
this hctx update failed, q->nr_requests will not be updated, while
async_depth is already updated with new nr_reqeuests in previous
depth_updated();
3) All elevators are using q->nr_requests to calculate async_depth now,
however, q->nr_requests is still the old value when depth_updated() is
called from blk_mq_update_nr_requests();

Fix those problems by:

- pass in request_queue instead of hctx;
- move depth_updated() after q->nr_requests is updated in
  blk_mq_update_nr_requests();
- add depth_updated() call in blk_mq_init_sched();
- remove init_hctx() method for mq-deadline and bfq that is useless now;

Fixes: 77f1e0a52d26 ("bfq: update internal depth state when queue depth changes")
Fixes: 39823b47bbd4 ("block/mq-deadline: Fix the tag reservation code")
Fixes: 42e6c6ce03fd ("lib/sbitmap: convert shallow_depth from one word to the whole sbitmap")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/bfq-iosched.c   | 21 ++++-----------------
 block/blk-mq-sched.c  |  3 +++
 block/blk-mq-sched.h  | 11 +++++++++++
 block/blk-mq.c        | 24 ++++++++++++------------
 block/elevator.h      |  2 +-
 block/kyber-iosched.c | 10 ++++------
 block/mq-deadline.c   | 15 ++-------------
 7 files changed, 37 insertions(+), 49 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 50e51047e1fe..c0c398998aa1 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -7109,9 +7109,10 @@ void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg)
  * See the comments on bfq_limit_depth for the purpose of
  * the depths set in the function. Return minimum shallow depth we'll use.
  */
-static void bfq_update_depths(struct bfq_data *bfqd, struct sbitmap_queue *bt)
+static void bfq_depth_updated(struct request_queue *q)
 {
-	unsigned int nr_requests = bfqd->queue->nr_requests;
+	struct bfq_data *bfqd = q->elevator->elevator_data;
+	unsigned int nr_requests = q->nr_requests;
 
 	/*
 	 * In-word depths if no bfq_queue is being weight-raised:
@@ -7143,21 +7144,8 @@ static void bfq_update_depths(struct bfq_data *bfqd, struct sbitmap_queue *bt)
 	bfqd->async_depths[1][0] = max((nr_requests * 3) >> 4, 1U);
 	/* no more than ~37% of tags for sync writes (~20% extra tags) */
 	bfqd->async_depths[1][1] = max((nr_requests * 6) >> 4, 1U);
-}
-
-static void bfq_depth_updated(struct blk_mq_hw_ctx *hctx)
-{
-	struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
-	struct blk_mq_tags *tags = hctx->sched_tags;
 
-	bfq_update_depths(bfqd, &tags->bitmap_tags);
-	sbitmap_queue_min_shallow_depth(&tags->bitmap_tags, 1);
-}
-
-static int bfq_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int index)
-{
-	bfq_depth_updated(hctx);
-	return 0;
+	blk_mq_set_min_shallow_depth(q, 1);
 }
 
 static void bfq_exit_queue(struct elevator_queue *e)
@@ -7628,7 +7616,6 @@ static struct elevator_type iosched_bfq_mq = {
 		.request_merged		= bfq_request_merged,
 		.has_work		= bfq_has_work,
 		.depth_updated		= bfq_depth_updated,
-		.init_hctx		= bfq_init_hctx,
 		.init_sched		= bfq_init_queue,
 		.exit_sched		= bfq_exit_queue,
 	},
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 9a8a0b5e04a9..c6a1e5a25fec 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -588,6 +588,9 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e,
 			}
 		}
 	}
+
+	if (e->ops.depth_updated)
+		e->ops.depth_updated(q);
 	return 0;
 
 out:
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 0582d4bc3987..8e21a6b1415d 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -92,4 +92,15 @@ static inline bool blk_mq_sched_needs_restart(struct blk_mq_hw_ctx *hctx)
 	return test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
 }
 
+static inline void blk_mq_set_min_shallow_depth(struct request_queue *q,
+						unsigned int depth)
+{
+	struct blk_mq_hw_ctx *hctx;
+	unsigned long i;
+
+	queue_for_each_hw_ctx(q, hctx, i)
+		sbitmap_queue_min_shallow_depth(&hctx->sched_tags->bitmap_tags,
+						depth);
+}
+
 #endif
diff --git a/block/blk-mq.c b/block/blk-mq.c
index e86cab125a2d..14446d3927dd 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4938,22 +4938,22 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 			ret = blk_mq_tag_update_depth(hctx, &hctx->tags, nr);
 
 		if (ret)
-			break;
-		if (q->elevator && q->elevator->type->ops.depth_updated)
-			q->elevator->type->ops.depth_updated(hctx);
+			goto out;
 	}
-	if (!ret) {
-		q->nr_requests = nr;
-		if (blk_mq_is_shared_tags(set->flags)) {
-			if (q->elevator)
-				blk_mq_tag_update_sched_shared_tags(q);
-			else
-				blk_mq_tag_resize_shared_tags(set, nr);
-		}
+
+	q->nr_requests = nr;
+	if (q->elevator && q->elevator->type->ops.depth_updated)
+		q->elevator->type->ops.depth_updated(q);
+
+	if (blk_mq_is_shared_tags(set->flags)) {
+		if (q->elevator)
+			blk_mq_tag_update_sched_shared_tags(q);
+		else
+			blk_mq_tag_resize_shared_tags(set, nr);
 	}
 
+out:
 	blk_mq_unquiesce_queue(q);
-
 	return ret;
 }
 
diff --git a/block/elevator.h b/block/elevator.h
index adc5c157e17e..c4d20155065e 100644
--- a/block/elevator.h
+++ b/block/elevator.h
@@ -37,7 +37,7 @@ struct elevator_mq_ops {
 	void (*exit_sched)(struct elevator_queue *);
 	int (*init_hctx)(struct blk_mq_hw_ctx *, unsigned int);
 	void (*exit_hctx)(struct blk_mq_hw_ctx *, unsigned int);
-	void (*depth_updated)(struct blk_mq_hw_ctx *);
+	void (*depth_updated)(struct request_queue *);
 
 	bool (*allow_merge)(struct request_queue *, struct request *, struct bio *);
 	bool (*bio_merge)(struct request_queue *, struct bio *, unsigned int);
diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index 70cbc7b2deb4..49ae52aa20d9 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -440,13 +440,12 @@ static void kyber_ctx_queue_init(struct kyber_ctx_queue *kcq)
 		INIT_LIST_HEAD(&kcq->rq_list[i]);
 }
 
-static void kyber_depth_updated(struct blk_mq_hw_ctx *hctx)
+static void kyber_depth_updated(struct request_queue *q)
 {
-	struct kyber_queue_data *kqd = hctx->queue->elevator->elevator_data;
-	struct blk_mq_tags *tags = hctx->sched_tags;
+	struct kyber_queue_data *kqd = q->elevator->elevator_data;
 
-	kqd->async_depth = hctx->queue->nr_requests * KYBER_ASYNC_PERCENT / 100U;
-	sbitmap_queue_min_shallow_depth(&tags->bitmap_tags, kqd->async_depth);
+	kqd->async_depth = q->nr_requests * KYBER_ASYNC_PERCENT / 100U;
+	blk_mq_set_min_shallow_depth(q, kqd->async_depth);
 }
 
 static int kyber_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
@@ -493,7 +492,6 @@ static int kyber_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
 	khd->batching = 0;
 
 	hctx->sched_data = khd;
-	kyber_depth_updated(hctx);
 
 	return 0;
 
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index b9b7cdf1d3c9..578bc79c5654 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -507,22 +507,12 @@ static void dd_limit_depth(blk_opf_t opf, struct blk_mq_alloc_data *data)
 }
 
 /* Called by blk_mq_update_nr_requests(). */
-static void dd_depth_updated(struct blk_mq_hw_ctx *hctx)
+static void dd_depth_updated(struct request_queue *q)
 {
-	struct request_queue *q = hctx->queue;
 	struct deadline_data *dd = q->elevator->elevator_data;
-	struct blk_mq_tags *tags = hctx->sched_tags;
 
 	dd->async_depth = q->nr_requests;
-
-	sbitmap_queue_min_shallow_depth(&tags->bitmap_tags, 1);
-}
-
-/* Called by blk_mq_init_hctx() and blk_mq_init_sched(). */
-static int dd_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
-{
-	dd_depth_updated(hctx);
-	return 0;
+	blk_mq_set_min_shallow_depth(q, 1);
 }
 
 static void dd_exit_sched(struct elevator_queue *e)
@@ -1048,7 +1038,6 @@ static struct elevator_type mq_deadline = {
 		.has_work		= dd_has_work,
 		.init_sched		= dd_init_sched,
 		.exit_sched		= dd_exit_sched,
-		.init_hctx		= dd_init_hctx,
 	},
 
 #ifdef CONFIG_BLK_DEBUG_FS
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 06/16] blk-mq: cleanup shared tags case in blk_mq_update_nr_requests()
  2025-08-14  3:35 [PATCH 00/16] blk-mq: introduce new queue attribute asyc_dpeth Yu Kuai
                   ` (4 preceding siblings ...)
  2025-08-14  3:35 ` [PATCH 05/16] blk-mq: fix elevator depth_updated method Yu Kuai
@ 2025-08-14  3:35 ` Yu Kuai
  2025-08-14  3:35 ` [PATCH 07/16] blk-mq: split bitmap grow and resize " Yu Kuai
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 30+ messages in thread
From: Yu Kuai @ 2025-08-14  3:35 UTC (permalink / raw)
  To: axboe, yukuai3, bvanassche, nilay, hare, ming.lei
  Cc: linux-block, linux-kernel, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

For shared tags case, all hctx->shared_tags/tags are the same, it
doesn't make sense to call blk_mq_tag_update_depth() multiple times
for the same tags.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq-tag.c |  7 -------
 block/blk-mq.c     | 40 +++++++++++++++++++++-------------------
 2 files changed, 21 insertions(+), 26 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 7613a9889eb1..84a452e708e4 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -596,13 +596,6 @@ int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
 		struct blk_mq_tag_set *set = hctx->queue->tag_set;
 		struct blk_mq_tags *new;
 
-		/*
-		 * Only the sbitmap needs resizing since we allocated the max
-		 * initially.
-		 */
-		if (blk_mq_is_shared_tags(set->flags))
-			return 0;
-
 		new = blk_mq_alloc_map_and_rqs(set, hctx->queue_num, tdepth);
 		if (!new)
 			return -ENOMEM;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 14446d3927dd..c29f98ec283d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4926,32 +4926,34 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 
 	blk_mq_quiesce_queue(q);
 
-	queue_for_each_hw_ctx(q, hctx, i) {
-		/*
-		 * If we're using an MQ scheduler, just update the scheduler
-		 * queue depth. This is similar to what the old code would do.
-		 */
-		if (hctx->sched_tags)
-			ret = blk_mq_tag_update_depth(hctx, &hctx->sched_tags,
-						      nr);
-		else
-			ret = blk_mq_tag_update_depth(hctx, &hctx->tags, nr);
-
-		if (ret)
-			goto out;
-	}
-
-	q->nr_requests = nr;
-	if (q->elevator && q->elevator->type->ops.depth_updated)
-		q->elevator->type->ops.depth_updated(q);
-
 	if (blk_mq_is_shared_tags(set->flags)) {
 		if (q->elevator)
 			blk_mq_tag_update_sched_shared_tags(q);
 		else
 			blk_mq_tag_resize_shared_tags(set, nr);
+	} else {
+		queue_for_each_hw_ctx(q, hctx, i) {
+			/*
+			 * If we're using an MQ scheduler, just update the
+			 * scheduler queue depth. This is similar to what the
+			 * old code would do.
+			 */
+			if (hctx->sched_tags)
+				ret = blk_mq_tag_update_depth(hctx,
+						&hctx->sched_tags, nr);
+			else
+				ret = blk_mq_tag_update_depth(hctx,
+						&hctx->tags, nr);
+
+			if (ret)
+				goto out;
+		}
 	}
 
+	q->nr_requests = nr;
+	if (q->elevator && q->elevator->type->ops.depth_updated)
+		q->elevator->type->ops.depth_updated(q);
+
 out:
 	blk_mq_unquiesce_queue(q);
 	return ret;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 07/16] blk-mq: split bitmap grow and resize case in blk_mq_update_nr_requests()
  2025-08-14  3:35 [PATCH 00/16] blk-mq: introduce new queue attribute asyc_dpeth Yu Kuai
                   ` (5 preceding siblings ...)
  2025-08-14  3:35 ` [PATCH 06/16] blk-mq: cleanup shared tags case in blk_mq_update_nr_requests() Yu Kuai
@ 2025-08-14  3:35 ` Yu Kuai
  2025-08-14  3:35 ` [PATCH 08/16] blk-mq: fix blk_mq_tags double free while nr_requests grown Yu Kuai
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 30+ messages in thread
From: Yu Kuai @ 2025-08-14  3:35 UTC (permalink / raw)
  To: axboe, yukuai3, bvanassche, nilay, hare, ming.lei
  Cc: linux-block, linux-kernel, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

No functional changes are intended, make code cleaner and prepare to fix
the grow case in the next patch.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq.c | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index c29f98ec283d..a7d6a20c1524 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4931,6 +4931,14 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 			blk_mq_tag_update_sched_shared_tags(q);
 		else
 			blk_mq_tag_resize_shared_tags(set, nr);
+	} else if (!q->elevator) {
+		queue_for_each_hw_ctx(q, hctx, i)
+			sbitmap_queue_resize(&hctx->tags->bitmap_tags,
+				nr - hctx->tags->nr_reserved_tags);
+	} else if (nr <= q->elevator->et->nr_requests) {
+		queue_for_each_hw_ctx(q, hctx, i)
+			sbitmap_queue_resize(&hctx->sched_tags->bitmap_tags,
+				nr - hctx->sched_tags->nr_reserved_tags);
 	} else {
 		queue_for_each_hw_ctx(q, hctx, i) {
 			/*
@@ -4938,13 +4946,8 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 			 * scheduler queue depth. This is similar to what the
 			 * old code would do.
 			 */
-			if (hctx->sched_tags)
-				ret = blk_mq_tag_update_depth(hctx,
-						&hctx->sched_tags, nr);
-			else
-				ret = blk_mq_tag_update_depth(hctx,
-						&hctx->tags, nr);
-
+			ret = blk_mq_tag_update_depth(hctx,
+					&hctx->sched_tags, nr);
 			if (ret)
 				goto out;
 		}
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 08/16] blk-mq: fix blk_mq_tags double free while nr_requests grown
  2025-08-14  3:35 [PATCH 00/16] blk-mq: introduce new queue attribute asyc_dpeth Yu Kuai
                   ` (6 preceding siblings ...)
  2025-08-14  3:35 ` [PATCH 07/16] blk-mq: split bitmap grow and resize " Yu Kuai
@ 2025-08-14  3:35 ` Yu Kuai
  2025-08-14  8:20   ` Ming Lei
  2025-08-14 12:15   ` Nilay Shroff
  2025-08-14  3:35 ` [PATCH 09/16] block: convert nr_requests to unsigned int Yu Kuai
                   ` (8 subsequent siblings)
  16 siblings, 2 replies; 30+ messages in thread
From: Yu Kuai @ 2025-08-14  3:35 UTC (permalink / raw)
  To: axboe, yukuai3, bvanassche, nilay, hare, ming.lei
  Cc: linux-block, linux-kernel, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

In the case user trigger tags grow by queue sysfs attribute nr_requests,
hctx->sched_tags will be freed directly and replaced with a new
allocated tags, see blk_mq_tag_update_depth().

The problem is that hctx->sched_tags is from elevator->et->tags, while
et->tags is still the freed tags, hence later elevator exist will try to
free the tags again, causing kernel panic.

Fix this problem by using new halper blk_mq_alloc_sched_tags() to
allocate a new sched_tags. Meanwhile, there is a longterm problem can be
fixed as well:

If blk_mq_tag_update_depth() succeed for previous hctx, then bitmap depth
is updated, however, if following hctx failed, q->nr_requests is not
updated and the previous hctx->sched_tags endup bigger than q->nr_requests.

Fixes: f5a6604f7a44 ("block: fix lockdep warning caused by lock dependency in elv_iosched_store")
Fixes: e3a2b3f931f5 ("blk-mq: allow changing of queue depth through sysfs")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq.c | 31 ++++++++++++++++++++-----------
 1 file changed, 20 insertions(+), 11 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index a7d6a20c1524..f1c11f591c27 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4917,6 +4917,23 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
 }
 EXPORT_SYMBOL(blk_mq_free_tag_set);
 
+static int blk_mq_sched_grow_tags(struct request_queue *q, unsigned int nr)
+{
+	struct elevator_tags *et =
+		blk_mq_alloc_sched_tags(q->tag_set, q->nr_hw_queues, nr);
+	struct blk_mq_hw_ctx *hctx;
+	unsigned long i;
+
+	if (!et)
+		return -ENOMEM;
+
+	blk_mq_free_sched_tags(q->elevator->et, q->tag_set);
+	queue_for_each_hw_ctx(q, hctx, i)
+		hctx->sched_tags = et->tags[i];
+	q->elevator->et = et;
+	return 0;
+}
+
 int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 {
 	struct blk_mq_tag_set *set = q->tag_set;
@@ -4940,17 +4957,9 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 			sbitmap_queue_resize(&hctx->sched_tags->bitmap_tags,
 				nr - hctx->sched_tags->nr_reserved_tags);
 	} else {
-		queue_for_each_hw_ctx(q, hctx, i) {
-			/*
-			 * If we're using an MQ scheduler, just update the
-			 * scheduler queue depth. This is similar to what the
-			 * old code would do.
-			 */
-			ret = blk_mq_tag_update_depth(hctx,
-					&hctx->sched_tags, nr);
-			if (ret)
-				goto out;
-		}
+		ret = blk_mq_sched_grow_tags(q, nr);
+		if (ret)
+			goto out;
 	}
 
 	q->nr_requests = nr;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 09/16] block: convert nr_requests to unsigned int
  2025-08-14  3:35 [PATCH 00/16] blk-mq: introduce new queue attribute asyc_dpeth Yu Kuai
                   ` (7 preceding siblings ...)
  2025-08-14  3:35 ` [PATCH 08/16] blk-mq: fix blk_mq_tags double free while nr_requests grown Yu Kuai
@ 2025-08-14  3:35 ` Yu Kuai
  2025-08-14  3:35 ` [PATCH 10/16] blk-mq-sched: unify elevators checking for async requests Yu Kuai
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 30+ messages in thread
From: Yu Kuai @ 2025-08-14  3:35 UTC (permalink / raw)
  To: axboe, yukuai3, bvanassche, nilay, hare, ming.lei
  Cc: linux-block, linux-kernel, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

This value represents the number of requests for elevator tags, or drivers
tags if elevator is none. The max value for elevator tags is 2048, and
in drivers at most 16 bits is used for tag.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 include/linux/blkdev.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 95886b404b16..ad5087d5cade 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -541,7 +541,7 @@ struct request_queue {
 	/*
 	 * queue settings
 	 */
-	unsigned long		nr_requests;	/* Max # of requests */
+	unsigned int		nr_requests;	/* Max # of requests */
 
 #ifdef CONFIG_BLK_INLINE_ENCRYPTION
 	struct blk_crypto_profile *crypto_profile;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 10/16] blk-mq-sched: unify elevators checking for async requests
  2025-08-14  3:35 [PATCH 00/16] blk-mq: introduce new queue attribute asyc_dpeth Yu Kuai
                   ` (8 preceding siblings ...)
  2025-08-14  3:35 ` [PATCH 09/16] block: convert nr_requests to unsigned int Yu Kuai
@ 2025-08-14  3:35 ` Yu Kuai
  2025-08-14  3:35 ` [PATCH 11/16] blk-mq: add a new queue sysfs attribute async_depth Yu Kuai
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 30+ messages in thread
From: Yu Kuai @ 2025-08-14  3:35 UTC (permalink / raw)
  To: axboe, yukuai3, bvanassche, nilay, hare, ming.lei
  Cc: linux-block, linux-kernel, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

bfq and mq-deadline consider sync writes as async requests and only
resver tags for sync reads by async_depth, however, kyber doesn't
consider sync writes as async requests.

Consider the case there are lots of dirty pages, and user do fsync,
in this case sched_tags can be exhausted by sync writes and sync reads
can stuck waiting for tag. Hence let kyber follow what mq-deadline and
bfq did, and unify async requests checking for all elevators.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/bfq-iosched.c   | 2 +-
 block/blk-mq-sched.h  | 5 +++++
 block/kyber-iosched.c | 2 +-
 block/mq-deadline.c   | 2 +-
 4 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index c0c398998aa1..de0dee255ccf 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -697,7 +697,7 @@ static void bfq_limit_depth(blk_opf_t opf, struct blk_mq_alloc_data *data)
 	unsigned int limit, act_idx;
 
 	/* Sync reads have full depth available */
-	if (op_is_sync(opf) && !op_is_write(opf))
+	if (blk_mq_sched_sync_request(opf))
 		limit = data->q->nr_requests;
 	else
 		limit = bfqd->async_depths[!!bfqd->wr_busy_queues][op_is_sync(opf)];
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 8e21a6b1415d..ae747f9053c7 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -103,4 +103,9 @@ static inline void blk_mq_set_min_shallow_depth(struct request_queue *q,
 						depth);
 }
 
+static inline bool blk_mq_sched_sync_request(blk_opf_t opf)
+{
+	return op_is_sync(opf) && !op_is_write(opf);
+}
+
 #endif
diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index 49ae52aa20d9..b3df807044c3 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -543,7 +543,7 @@ static void kyber_limit_depth(blk_opf_t opf, struct blk_mq_alloc_data *data)
 	 * We use the scheduler tags as per-hardware queue queueing tokens.
 	 * Async requests can be limited at this stage.
 	 */
-	if (!op_is_sync(opf)) {
+	if (!blk_mq_sched_sync_request(opf)) {
 		struct kyber_queue_data *kqd = data->q->elevator->elevator_data;
 
 		data->shallow_depth = kqd->async_depth;
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 578bc79c5654..1825173d82a6 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -496,7 +496,7 @@ static void dd_limit_depth(blk_opf_t opf, struct blk_mq_alloc_data *data)
 	struct deadline_data *dd = data->q->elevator->elevator_data;
 
 	/* Do not throttle synchronous reads. */
-	if (op_is_sync(opf) && !op_is_write(opf))
+	if (blk_mq_sched_sync_request(opf))
 		return;
 
 	/*
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 11/16] blk-mq: add a new queue sysfs attribute async_depth
  2025-08-14  3:35 [PATCH 00/16] blk-mq: introduce new queue attribute asyc_dpeth Yu Kuai
                   ` (9 preceding siblings ...)
  2025-08-14  3:35 ` [PATCH 10/16] blk-mq-sched: unify elevators checking for async requests Yu Kuai
@ 2025-08-14  3:35 ` Yu Kuai
  2025-08-14  3:35 ` [PATCH 12/16] kyber: covert to use request_queue->async_depth Yu Kuai
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 30+ messages in thread
From: Yu Kuai @ 2025-08-14  3:35 UTC (permalink / raw)
  To: axboe, yukuai3, bvanassche, nilay, hare, ming.lei
  Cc: linux-block, linux-kernel, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

Add a new field async_depth to request_queue and related APIs, this is
currently not used, following patches will convert elevators to use
this instead of internal async_depth.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-core.c       |  1 +
 block/blk-mq.c         | 24 ++++++++++++++++++---
 block/blk-sysfs.c      | 47 ++++++++++++++++++++++++++++++++++++++++++
 block/elevator.c       |  1 +
 include/linux/blkdev.h |  1 +
 5 files changed, 71 insertions(+), 3 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index fdac48aec5ef..443056be1c4c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -463,6 +463,7 @@ struct request_queue *blk_alloc_queue(struct queue_limits *lim, int node_id)
 	fs_reclaim_release(GFP_KERNEL);
 
 	q->nr_requests = BLKDEV_DEFAULT_RQ;
+	q->async_depth = BLKDEV_DEFAULT_RQ;
 
 	return q;
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index f1c11f591c27..699f7a2a36e5 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -520,6 +520,8 @@ static struct request *__blk_mq_alloc_requests(struct blk_mq_alloc_data *data)
 			data->rq_flags |= RQF_USE_SCHED;
 			if (ops->limit_depth)
 				ops->limit_depth(data->cmd_flags, data);
+			else if (!blk_mq_sched_sync_request(data->cmd_flags))
+				data->shallow_depth = q->async_depth;
 		}
 	} else {
 		blk_mq_tag_busy(data->hctx);
@@ -4606,6 +4608,7 @@ int blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	spin_lock_init(&q->requeue_lock);
 
 	q->nr_requests = set->queue_depth;
+	q->async_depth = set->queue_depth;
 
 	blk_mq_init_cpu_queues(q, set->nr_hw_queues);
 	blk_mq_map_swqueue(q);
@@ -4934,6 +4937,23 @@ static int blk_mq_sched_grow_tags(struct request_queue *q, unsigned int nr)
 	return 0;
 }
 
+static void __blk_mq_update_nr_requests(struct request_queue *q,
+					unsigned int nr)
+{
+	unsigned int old_nr = q->nr_requests;
+
+	q->nr_requests = nr;
+	if (!q->elevator) {
+		q->async_depth = nr;
+		return;
+	}
+
+	/* keep the percentage of async requests */
+	q->async_depth = max(q->async_depth * nr / old_nr, 1);
+	if (q->elevator->type->ops.depth_updated)
+		q->elevator->type->ops.depth_updated(q);
+}
+
 int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 {
 	struct blk_mq_tag_set *set = q->tag_set;
@@ -4962,9 +4982,7 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 			goto out;
 	}
 
-	q->nr_requests = nr;
-	if (q->elevator && q->elevator->type->ops.depth_updated)
-		q->elevator->type->ops.depth_updated(q);
+	__blk_mq_update_nr_requests(q, nr);
 
 out:
 	blk_mq_unquiesce_queue(q);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index f3d08edcc34f..8f55730f06c6 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -99,6 +99,51 @@ queue_requests_store(struct gendisk *disk, const char *page, size_t count)
 	return ret;
 }
 
+static ssize_t queue_async_depth_show(struct gendisk *disk, char *page)
+{
+	ssize_t ret;
+
+	mutex_lock(&disk->queue->elevator_lock);
+	ret = queue_var_show(disk->queue->async_depth, page);
+	mutex_unlock(&disk->queue->elevator_lock);
+	return ret;
+}
+
+static ssize_t
+queue_async_depth_store(struct gendisk *disk, const char *page, size_t count)
+{
+	struct request_queue *q = disk->queue;
+	unsigned int memflags;
+	unsigned long nr;
+	int ret;
+
+	if (!queue_is_mq(q))
+		return -EINVAL;
+
+	ret = queue_var_store(&nr, page, count);
+	if (ret < 0)
+		return ret;
+
+	if (nr == 0)
+		return -EINVAL;
+
+	memflags = blk_mq_freeze_queue(q);
+	mutex_lock(&q->elevator_lock);
+
+	if (q->elevator) {
+		q->async_depth = min(q->nr_requests, nr);
+		if (q->elevator->type->ops.depth_updated)
+			q->elevator->type->ops.depth_updated(q);
+	} else {
+		ret = -EINVAL;
+	}
+
+	mutex_unlock(&q->elevator_lock);
+	blk_mq_unfreeze_queue(q, memflags);
+
+	return ret;
+}
+
 static ssize_t queue_ra_show(struct gendisk *disk, char *page)
 {
 	ssize_t ret;
@@ -514,6 +559,7 @@ static struct queue_sysfs_entry _prefix##_entry = {	\
 }
 
 QUEUE_RW_ENTRY(queue_requests, "nr_requests");
+QUEUE_RW_ENTRY(queue_async_depth, "async_depth");
 QUEUE_RW_ENTRY(queue_ra, "read_ahead_kb");
 QUEUE_LIM_RW_ENTRY(queue_max_sectors, "max_sectors_kb");
 QUEUE_LIM_RO_ENTRY(queue_max_hw_sectors, "max_hw_sectors_kb");
@@ -736,6 +782,7 @@ static struct attribute *blk_mq_queue_attrs[] = {
 	 */
 	&elv_iosched_entry.attr,
 	&queue_requests_entry.attr,
+	&queue_async_depth_entry.attr,
 #ifdef CONFIG_BLK_WBT
 	&queue_wb_lat_entry.attr,
 #endif
diff --git a/block/elevator.c b/block/elevator.c
index f8a04f32cbcf..6bdb05d2500d 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -601,6 +601,7 @@ static int elevator_switch(struct request_queue *q, struct elv_change_ctx *ctx)
 		blk_queue_flag_clear(QUEUE_FLAG_SQ_SCHED, q);
 		q->elevator = NULL;
 		q->nr_requests = q->tag_set->queue_depth;
+		q->async_depth = q->tag_set->queue_depth;
 	}
 	blk_add_trace_msg(q, "elv switch: %s", ctx->name);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ad5087d5cade..2e33298fcc15 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -542,6 +542,7 @@ struct request_queue {
 	 * queue settings
 	 */
 	unsigned int		nr_requests;	/* Max # of requests */
+	unsigned int		async_depth;	/* Max # of async requests */
 
 #ifdef CONFIG_BLK_INLINE_ENCRYPTION
 	struct blk_crypto_profile *crypto_profile;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 12/16] kyber: covert to use request_queue->async_depth
  2025-08-14  3:35 [PATCH 00/16] blk-mq: introduce new queue attribute asyc_dpeth Yu Kuai
                   ` (10 preceding siblings ...)
  2025-08-14  3:35 ` [PATCH 11/16] blk-mq: add a new queue sysfs attribute async_depth Yu Kuai
@ 2025-08-14  3:35 ` Yu Kuai
  2025-08-14  3:35 ` [PATCH 13/16] mq-deadline: " Yu Kuai
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 30+ messages in thread
From: Yu Kuai @ 2025-08-14  3:35 UTC (permalink / raw)
  To: axboe, yukuai3, bvanassche, nilay, hare, ming.lei
  Cc: linux-block, linux-kernel, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

Instead of the internal async_depth, remove kqd->async_depth and related
helpers, also remove limit_depth() method that is useless now.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/kyber-iosched.c | 36 +++---------------------------------
 1 file changed, 3 insertions(+), 33 deletions(-)

diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index b3df807044c3..2b38ee46140f 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -47,9 +47,8 @@ enum {
 	 * asynchronous requests, we reserve 25% of requests for synchronous
 	 * operations.
 	 */
-	KYBER_ASYNC_PERCENT = 75,
+	KYBER_DEFAULT_ASYNC_PERCENT = 75,
 };
-
 /*
  * Maximum device-wide depth for each scheduling domain.
  *
@@ -157,9 +156,6 @@ struct kyber_queue_data {
 	 */
 	struct sbitmap_queue domain_tokens[KYBER_NUM_DOMAINS];
 
-	/* Number of allowed async requests. */
-	unsigned int async_depth;
-
 	struct kyber_cpu_latency __percpu *cpu_latency;
 
 	/* Timer for stats aggregation and adjusting domain tokens. */
@@ -413,6 +409,7 @@ static int kyber_init_sched(struct request_queue *q, struct elevator_queue *eq)
 
 	eq->elevator_data = kqd;
 	q->elevator = eq;
+	q->async_depth = q->nr_requests * KYBER_DEFAULT_ASYNC_PERCENT / 100;
 
 	return 0;
 }
@@ -442,10 +439,7 @@ static void kyber_ctx_queue_init(struct kyber_ctx_queue *kcq)
 
 static void kyber_depth_updated(struct request_queue *q)
 {
-	struct kyber_queue_data *kqd = q->elevator->elevator_data;
-
-	kqd->async_depth = q->nr_requests * KYBER_ASYNC_PERCENT / 100U;
-	blk_mq_set_min_shallow_depth(q, kqd->async_depth);
+	blk_mq_set_min_shallow_depth(q, q->async_depth);
 }
 
 static int kyber_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
@@ -537,19 +531,6 @@ static void rq_clear_domain_token(struct kyber_queue_data *kqd,
 	}
 }
 
-static void kyber_limit_depth(blk_opf_t opf, struct blk_mq_alloc_data *data)
-{
-	/*
-	 * We use the scheduler tags as per-hardware queue queueing tokens.
-	 * Async requests can be limited at this stage.
-	 */
-	if (!blk_mq_sched_sync_request(opf)) {
-		struct kyber_queue_data *kqd = data->q->elevator->elevator_data;
-
-		data->shallow_depth = kqd->async_depth;
-	}
-}
-
 static bool kyber_bio_merge(struct request_queue *q, struct bio *bio,
 		unsigned int nr_segs)
 {
@@ -943,15 +924,6 @@ KYBER_DEBUGFS_DOMAIN_ATTRS(KYBER_DISCARD, discard)
 KYBER_DEBUGFS_DOMAIN_ATTRS(KYBER_OTHER, other)
 #undef KYBER_DEBUGFS_DOMAIN_ATTRS
 
-static int kyber_async_depth_show(void *data, struct seq_file *m)
-{
-	struct request_queue *q = data;
-	struct kyber_queue_data *kqd = q->elevator->elevator_data;
-
-	seq_printf(m, "%u\n", kqd->async_depth);
-	return 0;
-}
-
 static int kyber_cur_domain_show(void *data, struct seq_file *m)
 {
 	struct blk_mq_hw_ctx *hctx = data;
@@ -977,7 +949,6 @@ static const struct blk_mq_debugfs_attr kyber_queue_debugfs_attrs[] = {
 	KYBER_QUEUE_DOMAIN_ATTRS(write),
 	KYBER_QUEUE_DOMAIN_ATTRS(discard),
 	KYBER_QUEUE_DOMAIN_ATTRS(other),
-	{"async_depth", 0400, kyber_async_depth_show},
 	{},
 };
 #undef KYBER_QUEUE_DOMAIN_ATTRS
@@ -1003,7 +974,6 @@ static struct elevator_type kyber_sched = {
 		.exit_sched = kyber_exit_sched,
 		.init_hctx = kyber_init_hctx,
 		.exit_hctx = kyber_exit_hctx,
-		.limit_depth = kyber_limit_depth,
 		.bio_merge = kyber_bio_merge,
 		.prepare_request = kyber_prepare_request,
 		.insert_requests = kyber_insert_requests,
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 13/16] mq-deadline: covert to use request_queue->async_depth
  2025-08-14  3:35 [PATCH 00/16] blk-mq: introduce new queue attribute asyc_dpeth Yu Kuai
                   ` (11 preceding siblings ...)
  2025-08-14  3:35 ` [PATCH 12/16] kyber: covert to use request_queue->async_depth Yu Kuai
@ 2025-08-14  3:35 ` Yu Kuai
  2025-08-14  3:35 ` [PATCH 14/16] block, bfq: convert " Yu Kuai
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 30+ messages in thread
From: Yu Kuai @ 2025-08-14  3:35 UTC (permalink / raw)
  To: axboe, yukuai3, bvanassche, nilay, hare, ming.lei
  Cc: linux-block, linux-kernel, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

In downstream kernel, we test with mq-deadline with many fio workloads, and
we found a performance regression after commit 39823b47bbd4
("block/mq-deadline: Fix the tag reservation code") with following test:

[global]
rw=randread
direct=1
ramp_time=1
ioengine=libaio
iodepth=1024
numjobs=24
bs=1024k
group_reporting=1
runtime=60

[job1]
filename=/dev/sda

Root cause is that mq-deadline now support configuring async_depth,
although the default value is nr_request, however the minimal value is
1, hence min_shallow_depth is set to 1, causing wake_batch to be 1. For
consequence, sbitmap_queue will be waken up after each IO instead of
8 IO.

In this test case, sda is HDD and max_sectors is 128k, hence each
submitted 1M io will be splited into 8 sequential 128k requests, however
due to there are 24 jobs and total tags are exhausted, the 8 requests are
unlikely to be dispatched sequentially, and changing wake_batch to 1
will make this much worse, accounting blktrace D stage, the percentage
of sequential io is decreased from 8% to 0.8%.

Fix this problem by converting to request_queue->async_depth, where
min_shallow_depth is set each time async_depth is updated.

Fixes: 39823b47bbd4 ("block/mq-deadline: Fix the tag reservation code")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/mq-deadline.c | 42 +++---------------------------------------
 1 file changed, 3 insertions(+), 39 deletions(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 1825173d82a6..970e1541b7b7 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -98,7 +98,6 @@ struct deadline_data {
 	int fifo_batch;
 	int writes_starved;
 	int front_merges;
-	u32 async_depth;
 	int prio_aging_expire;
 
 	spinlock_t lock;
@@ -487,32 +486,10 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
 	return rq;
 }
 
-/*
- * Called by __blk_mq_alloc_request(). The shallow_depth value set by this
- * function is used by __blk_mq_get_tag().
- */
-static void dd_limit_depth(blk_opf_t opf, struct blk_mq_alloc_data *data)
-{
-	struct deadline_data *dd = data->q->elevator->elevator_data;
-
-	/* Do not throttle synchronous reads. */
-	if (blk_mq_sched_sync_request(opf))
-		return;
-
-	/*
-	 * Throttle asynchronous requests and writes such that these requests
-	 * do not block the allocation of synchronous requests.
-	 */
-	data->shallow_depth = dd->async_depth;
-}
-
-/* Called by blk_mq_update_nr_requests(). */
+/* Called by blk_mq_init_sched() and blk_mq_update_nr_requests(). */
 static void dd_depth_updated(struct request_queue *q)
 {
-	struct deadline_data *dd = q->elevator->elevator_data;
-
-	dd->async_depth = q->nr_requests;
-	blk_mq_set_min_shallow_depth(q, 1);
+	blk_mq_set_min_shallow_depth(q, q->async_depth);
 }
 
 static void dd_exit_sched(struct elevator_queue *e)
@@ -577,6 +554,7 @@ static int dd_init_sched(struct request_queue *q, struct elevator_queue *eq)
 	blk_queue_flag_set(QUEUE_FLAG_SQ_SCHED, q);
 
 	q->elevator = eq;
+	q->async_depth = q->nr_requests;
 	return 0;
 }
 
@@ -761,7 +739,6 @@ SHOW_JIFFIES(deadline_write_expire_show, dd->fifo_expire[DD_WRITE]);
 SHOW_JIFFIES(deadline_prio_aging_expire_show, dd->prio_aging_expire);
 SHOW_INT(deadline_writes_starved_show, dd->writes_starved);
 SHOW_INT(deadline_front_merges_show, dd->front_merges);
-SHOW_INT(deadline_async_depth_show, dd->async_depth);
 SHOW_INT(deadline_fifo_batch_show, dd->fifo_batch);
 #undef SHOW_INT
 #undef SHOW_JIFFIES
@@ -791,7 +768,6 @@ STORE_JIFFIES(deadline_write_expire_store, &dd->fifo_expire[DD_WRITE], 0, INT_MA
 STORE_JIFFIES(deadline_prio_aging_expire_store, &dd->prio_aging_expire, 0, INT_MAX);
 STORE_INT(deadline_writes_starved_store, &dd->writes_starved, INT_MIN, INT_MAX);
 STORE_INT(deadline_front_merges_store, &dd->front_merges, 0, 1);
-STORE_INT(deadline_async_depth_store, &dd->async_depth, 1, INT_MAX);
 STORE_INT(deadline_fifo_batch_store, &dd->fifo_batch, 0, INT_MAX);
 #undef STORE_FUNCTION
 #undef STORE_INT
@@ -805,7 +781,6 @@ static const struct elv_fs_entry deadline_attrs[] = {
 	DD_ATTR(write_expire),
 	DD_ATTR(writes_starved),
 	DD_ATTR(front_merges),
-	DD_ATTR(async_depth),
 	DD_ATTR(fifo_batch),
 	DD_ATTR(prio_aging_expire),
 	__ATTR_NULL
@@ -892,15 +867,6 @@ static int deadline_starved_show(void *data, struct seq_file *m)
 	return 0;
 }
 
-static int dd_async_depth_show(void *data, struct seq_file *m)
-{
-	struct request_queue *q = data;
-	struct deadline_data *dd = q->elevator->elevator_data;
-
-	seq_printf(m, "%u\n", dd->async_depth);
-	return 0;
-}
-
 static int dd_queued_show(void *data, struct seq_file *m)
 {
 	struct request_queue *q = data;
@@ -1010,7 +976,6 @@ static const struct blk_mq_debugfs_attr deadline_queue_debugfs_attrs[] = {
 	DEADLINE_NEXT_RQ_ATTR(write2),
 	{"batching", 0400, deadline_batching_show},
 	{"starved", 0400, deadline_starved_show},
-	{"async_depth", 0400, dd_async_depth_show},
 	{"dispatch0", 0400, .seq_ops = &deadline_dispatch0_seq_ops},
 	{"dispatch1", 0400, .seq_ops = &deadline_dispatch1_seq_ops},
 	{"dispatch2", 0400, .seq_ops = &deadline_dispatch2_seq_ops},
@@ -1024,7 +989,6 @@ static const struct blk_mq_debugfs_attr deadline_queue_debugfs_attrs[] = {
 static struct elevator_type mq_deadline = {
 	.ops = {
 		.depth_updated		= dd_depth_updated,
-		.limit_depth		= dd_limit_depth,
 		.insert_requests	= dd_insert_requests,
 		.dispatch_request	= dd_dispatch_request,
 		.prepare_request	= dd_prepare_request,
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 14/16] block, bfq: convert to use request_queue->async_depth
  2025-08-14  3:35 [PATCH 00/16] blk-mq: introduce new queue attribute asyc_dpeth Yu Kuai
                   ` (12 preceding siblings ...)
  2025-08-14  3:35 ` [PATCH 13/16] mq-deadline: " Yu Kuai
@ 2025-08-14  3:35 ` Yu Kuai
  2025-08-14  3:35 ` [PATCH 15/16] blk-mq: fix stale nr_requests documentation Yu Kuai
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 30+ messages in thread
From: Yu Kuai @ 2025-08-14  3:35 UTC (permalink / raw)
  To: axboe, yukuai3, bvanassche, nilay, hare, ming.lei
  Cc: linux-block, linux-kernel, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

The default limits is unchanged, and user can configure async_depth now.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/bfq-iosched.c | 43 +++++++++++++++++--------------------------
 1 file changed, 17 insertions(+), 26 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index de0dee255ccf..46520e50f584 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -7112,39 +7112,29 @@ void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg)
 static void bfq_depth_updated(struct request_queue *q)
 {
 	struct bfq_data *bfqd = q->elevator->elevator_data;
-	unsigned int nr_requests = q->nr_requests;
+	unsigned int async_depth = q->async_depth;
 
 	/*
-	 * In-word depths if no bfq_queue is being weight-raised:
-	 * leaving 25% of tags only for sync reads.
+	 * By default:
+	 *  - sync reads are not limited
+	 * If bfqq is not being weight-raised:
+	 *  - sync writes are limited to 75%(async depth default value)
+	 *  - async IO are limited to 50%
+	 * If bfqq is being weight-raised:
+	 *  - sync writes are limited to ~37%
+	 *  - async IO are limited to ~18
 	 *
-	 * In next formulas, right-shift the value
-	 * (1U<<bt->sb.shift), instead of computing directly
-	 * (1U<<(bt->sb.shift - something)), to be robust against
-	 * any possible value of bt->sb.shift, without having to
-	 * limit 'something'.
+	 * If request_queue->async_depth is updated by user, all limit are
+	 * updated relatively.
 	 */
-	/* no more than 50% of tags for async I/O */
-	bfqd->async_depths[0][0] = max(nr_requests >> 1, 1U);
-	/*
-	 * no more than 75% of tags for sync writes (25% extra tags
-	 * w.r.t. async I/O, to prevent async I/O from starving sync
-	 * writes)
-	 */
-	bfqd->async_depths[0][1] = max((nr_requests * 3) >> 2, 1U);
+	bfqd->async_depths[0][1] = async_depth;
+	bfqd->async_depths[0][0] = max(async_depth * 2 / 3, 1U);
+	bfqd->async_depths[1][1] = max(async_depth >> 1, 1U);
+	bfqd->async_depths[1][0] = max(async_depth >> 2, 1U);
 
 	/*
-	 * In-word depths in case some bfq_queue is being weight-
-	 * raised: leaving ~63% of tags for sync reads. This is the
-	 * highest percentage for which, in our tests, application
-	 * start-up times didn't suffer from any regression due to tag
-	 * shortage.
+	 * Due to cgroup qos, the allowed request for bfqq might be 1
 	 */
-	/* no more than ~18% of tags for async I/O */
-	bfqd->async_depths[1][0] = max((nr_requests * 3) >> 4, 1U);
-	/* no more than ~37% of tags for sync writes (~20% extra tags) */
-	bfqd->async_depths[1][1] = max((nr_requests * 6) >> 4, 1U);
-
 	blk_mq_set_min_shallow_depth(q, 1);
 }
 
@@ -7364,6 +7354,7 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_queue *eq)
 	blk_queue_flag_set(QUEUE_FLAG_DISABLE_WBT_DEF, q);
 	wbt_disable_default(q->disk);
 	blk_stat_enable_accounting(q);
+	q->async_depth = (q->nr_requests * 3) >> 2;
 
 	return 0;
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 15/16] blk-mq: fix stale nr_requests documentation
  2025-08-14  3:35 [PATCH 00/16] blk-mq: introduce new queue attribute asyc_dpeth Yu Kuai
                   ` (13 preceding siblings ...)
  2025-08-14  3:35 ` [PATCH 14/16] block, bfq: convert " Yu Kuai
@ 2025-08-14  3:35 ` Yu Kuai
  2025-08-14  3:35 ` [PATCH 16/16] blk-mq: add documentation for new queue attribute async_dpeth Yu Kuai
  2025-08-14  7:54 ` [PATCH 00/16] blk-mq: introduce new queue attribute asyc_dpeth Ming Lei
  16 siblings, 0 replies; 30+ messages in thread
From: Yu Kuai @ 2025-08-14  3:35 UTC (permalink / raw)
  To: axboe, yukuai3, bvanassche, nilay, hare, ming.lei
  Cc: linux-block, linux-kernel, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

The nr_requests documentation is still the removed single queue, remove
it and update to current blk-mq.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 Documentation/ABI/stable/sysfs-block | 14 ++++----------
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
index 0ddffc9133d0..0ed10aeff86b 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -603,16 +603,10 @@ Date:		July 2003
 Contact:	linux-block@vger.kernel.org
 Description:
 		[RW] This controls how many requests may be allocated in the
-		block layer for read or write requests. Note that the total
-		allocated number may be twice this amount, since it applies only
-		to reads or writes (not the accumulated sum).
-
-		To avoid priority inversion through request starvation, a
-		request queue maintains a separate request pool per each cgroup
-		when CONFIG_BLK_CGROUP is enabled, and this parameter applies to
-		each such per-block-cgroup request pool.  IOW, if there are N
-		block cgroups, each request queue may have up to N request
-		pools, each independently regulated by nr_requests.
+		block layer. Noted this value only represents the quantity for a
+		single blk_mq_tags instance. The actual number for the entire
+		device depends on the hardware queue count, whether elevator is
+		enabled, and whether tags are shared.
 
 
 What:		/sys/block/<disk>/queue/nr_zones
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 16/16] blk-mq: add documentation for new queue attribute async_dpeth
  2025-08-14  3:35 [PATCH 00/16] blk-mq: introduce new queue attribute asyc_dpeth Yu Kuai
                   ` (14 preceding siblings ...)
  2025-08-14  3:35 ` [PATCH 15/16] blk-mq: fix stale nr_requests documentation Yu Kuai
@ 2025-08-14  3:35 ` Yu Kuai
  2025-08-14  7:54 ` [PATCH 00/16] blk-mq: introduce new queue attribute asyc_dpeth Ming Lei
  16 siblings, 0 replies; 30+ messages in thread
From: Yu Kuai @ 2025-08-14  3:35 UTC (permalink / raw)
  To: axboe, yukuai3, bvanassche, nilay, hare, ming.lei
  Cc: linux-block, linux-kernel, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 Documentation/ABI/stable/sysfs-block | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
index 0ed10aeff86b..09b9b3db9a1f 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -609,6 +609,16 @@ Description:
 		enabled, and whether tags are shared.
 
 
+What:		/sys/block/<disk>/queue/async_depth
+Date:		August 2025
+Contact:	linux-block@vger.kernel.org
+Description:
+		[RW] This controls how many async requests may be allocated in the
+		block layer. If elevator is none, then this value is nr_requests.
+		By default, this value is 75% of nr_requests for bfq and kyber,
+		abd nr_requests for mq-deadline.
+
+
 What:		/sys/block/<disk>/queue/nr_zones
 Date:		November 2018
 Contact:	Damien Le Moal <damien.lemoal@wdc.com>
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 00/16] blk-mq: introduce new queue attribute asyc_dpeth
  2025-08-14  3:35 [PATCH 00/16] blk-mq: introduce new queue attribute asyc_dpeth Yu Kuai
                   ` (15 preceding siblings ...)
  2025-08-14  3:35 ` [PATCH 16/16] blk-mq: add documentation for new queue attribute async_dpeth Yu Kuai
@ 2025-08-14  7:54 ` Ming Lei
  2025-08-14  8:22   ` Yu Kuai
  16 siblings, 1 reply; 30+ messages in thread
From: Ming Lei @ 2025-08-14  7:54 UTC (permalink / raw)
  To: Yu Kuai
  Cc: axboe, yukuai3, bvanassche, nilay, hare, linux-block,
	linux-kernel, yi.zhang, yangerkun, johnny.chenyi

On Thu, Aug 14, 2025 at 11:35:06AM +0800, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> Backgroud and motivation:
> 
> At first, we test a performance regression from 5.10 to 6.6 in
> downstream kernel(described in patch 13), the regression is related to
> async_depth in mq-dealine.
> 
> While trying to fix this regression, Bart suggests add a new attribute
> to request_queue, and I think this is a good idea because all elevators
> have similar logical, however only mq-deadline allow user to configure
> async_depth. And this is patch 9-16, where the performance problem is
> fixed in patch 13;
> 
> Because async_depth is related to nr_requests, while reviewing related
> code, patch 2-7 are cleanups and fixes to nr_reqeusts.
> 
> I was planning to send this set for the next merge window, however,
> during test I found the last block pr(6.17-rc1) introduce a regression
> if nr_reqeusts grows, exit elevator will panic, and I fix this by
> patch 1,8.

Please split the patchset into two:

- one is for fixing recent regression on updating 'nr_requests', so this
  can be merged to v6.17, and be backport easily for stable & downstream

- another one is for improving IO performance related with async_depth.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 01/16] blk-mq-sched: add new parameter nr_requests in blk_mq_alloc_sched_tags()
  2025-08-14  3:35 ` [PATCH 01/16] blk-mq-sched: add new parameter nr_requests in blk_mq_alloc_sched_tags() Yu Kuai
@ 2025-08-14  8:16   ` Ming Lei
  2025-08-14  8:55     ` Yu Kuai
  0 siblings, 1 reply; 30+ messages in thread
From: Ming Lei @ 2025-08-14  8:16 UTC (permalink / raw)
  To: Yu Kuai
  Cc: axboe, yukuai3, bvanassche, nilay, hare, linux-block,
	linux-kernel, yi.zhang, yangerkun, johnny.chenyi

On Thu, Aug 14, 2025 at 11:35:07AM +0800, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> This helper only support to iallocate the default number of requests,
> add a new parameter to support specific number of requests.
> 
> Prepare to fix tags double free problem if nr_requests is grown by
> queue sysfs attribute nr_requests.
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>  block/blk-mq-sched.c | 11 +++++++----
>  block/blk-mq-sched.h |  2 +-
>  block/elevator.c     |  2 +-
>  3 files changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index e2ce4a28e6c9..9a8a0b5e04a9 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -454,7 +454,7 @@ void blk_mq_free_sched_tags_batch(struct xarray *et_table,
>  }
>  
>  struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
> -		unsigned int nr_hw_queues)
> +		unsigned int nr_hw_queues, unsigned int nr_requests)
>  {
>  	unsigned int nr_tags;
>  	int i;
> @@ -475,8 +475,11 @@ struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
>  	 * 128, since we don't split into sync/async like the old code
>  	 * did. Additionally, this is a per-hw queue depth.
>  	 */
> -	et->nr_requests = 2 * min_t(unsigned int, set->queue_depth,
> -			BLKDEV_DEFAULT_RQ);
> +	if (nr_requests)
> +		et->nr_requests = nr_requests;
> +	else
> +		et->nr_requests = 2 * min_t(unsigned int, set->queue_depth,
> +				BLKDEV_DEFAULT_RQ);

It looks more readable to add helper blk_mq_default_nr_requests(),
and pass it from call sites directly, then people won't be confused
with the passed zero `nr_requests`.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 08/16] blk-mq: fix blk_mq_tags double free while nr_requests grown
  2025-08-14  3:35 ` [PATCH 08/16] blk-mq: fix blk_mq_tags double free while nr_requests grown Yu Kuai
@ 2025-08-14  8:20   ` Ming Lei
  2025-08-14 12:15   ` Nilay Shroff
  1 sibling, 0 replies; 30+ messages in thread
From: Ming Lei @ 2025-08-14  8:20 UTC (permalink / raw)
  To: Yu Kuai
  Cc: axboe, yukuai3, bvanassche, nilay, hare, linux-block,
	linux-kernel, yi.zhang, yangerkun, johnny.chenyi

On Thu, Aug 14, 2025 at 11:35:14AM +0800, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> In the case user trigger tags grow by queue sysfs attribute nr_requests,
> hctx->sched_tags will be freed directly and replaced with a new
> allocated tags, see blk_mq_tag_update_depth().
> 
> The problem is that hctx->sched_tags is from elevator->et->tags, while
> et->tags is still the freed tags, hence later elevator exist will try to
> free the tags again, causing kernel panic.
> 
> Fix this problem by using new halper blk_mq_alloc_sched_tags() to
> allocate a new sched_tags. Meanwhile, there is a longterm problem can be
> fixed as well:
> 
> If blk_mq_tag_update_depth() succeed for previous hctx, then bitmap depth
> is updated, however, if following hctx failed, q->nr_requests is not
> updated and the previous hctx->sched_tags endup bigger than q->nr_requests.
> 
> Fixes: f5a6604f7a44 ("block: fix lockdep warning caused by lock dependency in elv_iosched_store")
> Fixes: e3a2b3f931f5 ("blk-mq: allow changing of queue depth through sysfs")
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>  block/blk-mq.c | 31 ++++++++++++++++++++-----------
>  1 file changed, 20 insertions(+), 11 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index a7d6a20c1524..f1c11f591c27 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -4917,6 +4917,23 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
>  }
>  EXPORT_SYMBOL(blk_mq_free_tag_set);
>  
> +static int blk_mq_sched_grow_tags(struct request_queue *q, unsigned int nr)
> +{
> +	struct elevator_tags *et =
> +		blk_mq_alloc_sched_tags(q->tag_set, q->nr_hw_queues, nr);
> +	struct blk_mq_hw_ctx *hctx;
> +	unsigned long i;
> +
> +	if (!et)
> +		return -ENOMEM;
> +
> +	blk_mq_free_sched_tags(q->elevator->et, q->tag_set);
> +	queue_for_each_hw_ctx(q, hctx, i)
> +		hctx->sched_tags = et->tags[i];
> +	q->elevator->et = et;
> +	return 0;
> +}

It depends on protection from elevator_lock, so probably it is
helpful by adding lockdep_assert_held(&q->elevator_lock), otherwise
this fix looks fine:

Reviewed-by: Ming Lei <ming.lei@redhat.com>


Thanks,
Ming


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 00/16] blk-mq: introduce new queue attribute asyc_dpeth
  2025-08-14  7:54 ` [PATCH 00/16] blk-mq: introduce new queue attribute asyc_dpeth Ming Lei
@ 2025-08-14  8:22   ` Yu Kuai
  2025-08-14  8:27     ` Ming Lei
  0 siblings, 1 reply; 30+ messages in thread
From: Yu Kuai @ 2025-08-14  8:22 UTC (permalink / raw)
  To: Ming Lei, Yu Kuai
  Cc: axboe, bvanassche, nilay, hare, linux-block, linux-kernel,
	yi.zhang, yangerkun, johnny.chenyi, yukuai (C)

Hi,

在 2025/08/14 15:54, Ming Lei 写道:
> On Thu, Aug 14, 2025 at 11:35:06AM +0800, Yu Kuai wrote:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> Backgroud and motivation:
>>
>> At first, we test a performance regression from 5.10 to 6.6 in
>> downstream kernel(described in patch 13), the regression is related to
>> async_depth in mq-dealine.
>>
>> While trying to fix this regression, Bart suggests add a new attribute
>> to request_queue, and I think this is a good idea because all elevators
>> have similar logical, however only mq-deadline allow user to configure
>> async_depth. And this is patch 9-16, where the performance problem is
>> fixed in patch 13;
>>
>> Because async_depth is related to nr_requests, while reviewing related
>> code, patch 2-7 are cleanups and fixes to nr_reqeusts.
>>
>> I was planning to send this set for the next merge window, however,
>> during test I found the last block pr(6.17-rc1) introduce a regression
>> if nr_reqeusts grows, exit elevator will panic, and I fix this by
>> patch 1,8.
> 
> Please split the patchset into two:
> 
> - one is for fixing recent regression on updating 'nr_requests', so this
>    can be merged to v6.17, and be backport easily for stable & downstream

There are actually two regressions, as fixed by patch 5 and patch 8, how
about the first patchset for patch 1-8? Are you good with those minor
prep cleanup patches?

> 
> - another one is for improving IO performance related with async_depth.

Sure, although patch 9-16 is still a performance regression :)

Thanks,
Kuai

> 
> 
> Thanks,
> Ming
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 00/16] blk-mq: introduce new queue attribute asyc_dpeth
  2025-08-14  8:22   ` Yu Kuai
@ 2025-08-14  8:27     ` Ming Lei
  2025-08-14  8:57       ` Yu Kuai
  0 siblings, 1 reply; 30+ messages in thread
From: Ming Lei @ 2025-08-14  8:27 UTC (permalink / raw)
  To: Yu Kuai
  Cc: axboe, bvanassche, nilay, hare, linux-block, linux-kernel,
	yi.zhang, yangerkun, johnny.chenyi, yukuai (C)

On Thu, Aug 14, 2025 at 04:22:27PM +0800, Yu Kuai wrote:
> Hi,
> 
> 在 2025/08/14 15:54, Ming Lei 写道:
> > On Thu, Aug 14, 2025 at 11:35:06AM +0800, Yu Kuai wrote:
> > > From: Yu Kuai <yukuai3@huawei.com>
> > > 
> > > Backgroud and motivation:
> > > 
> > > At first, we test a performance regression from 5.10 to 6.6 in
> > > downstream kernel(described in patch 13), the regression is related to
> > > async_depth in mq-dealine.
> > > 
> > > While trying to fix this regression, Bart suggests add a new attribute
> > > to request_queue, and I think this is a good idea because all elevators
> > > have similar logical, however only mq-deadline allow user to configure
> > > async_depth. And this is patch 9-16, where the performance problem is
> > > fixed in patch 13;
> > > 
> > > Because async_depth is related to nr_requests, while reviewing related
> > > code, patch 2-7 are cleanups and fixes to nr_reqeusts.
> > > 
> > > I was planning to send this set for the next merge window, however,
> > > during test I found the last block pr(6.17-rc1) introduce a regression
> > > if nr_reqeusts grows, exit elevator will panic, and I fix this by
> > > patch 1,8.
> > 
> > Please split the patchset into two:
> > 
> > - one is for fixing recent regression on updating 'nr_requests', so this
> >    can be merged to v6.17, and be backport easily for stable & downstream
> 
> There are actually two regressions, as fixed by patch 5 and patch 8, how
> about the first patchset for patch 1-8? Are you good with those minor
> prep cleanup patches?

Then probably you need to make it into three by adding one extra bug fix for
`fix elevator depth_updated method`, which follows the philosophy of
"do one thing, do it better", also helps people to review.

Thanks, 
Ming


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 01/16] blk-mq-sched: add new parameter nr_requests in blk_mq_alloc_sched_tags()
  2025-08-14  8:16   ` Ming Lei
@ 2025-08-14  8:55     ` Yu Kuai
  0 siblings, 0 replies; 30+ messages in thread
From: Yu Kuai @ 2025-08-14  8:55 UTC (permalink / raw)
  To: Ming Lei, Yu Kuai
  Cc: axboe, bvanassche, nilay, hare, linux-block, linux-kernel,
	yi.zhang, yangerkun, johnny.chenyi, yukuai (C)

Hi,

在 2025/08/14 16:16, Ming Lei 写道:
>>   struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
>> -		unsigned int nr_hw_queues)
>> +		unsigned int nr_hw_queues, unsigned int nr_requests)
>>   {
>>   	unsigned int nr_tags;
>>   	int i;
>> @@ -475,8 +475,11 @@ struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
>>   	 * 128, since we don't split into sync/async like the old code
>>   	 * did. Additionally, this is a per-hw queue depth.
>>   	 */
>> -	et->nr_requests = 2 * min_t(unsigned int, set->queue_depth,
>> -			BLKDEV_DEFAULT_RQ);
>> +	if (nr_requests)
>> +		et->nr_requests = nr_requests;
>> +	else
>> +		et->nr_requests = 2 * min_t(unsigned int, set->queue_depth,
>> +				BLKDEV_DEFAULT_RQ);
> It looks more readable to add helper blk_mq_default_nr_requests(),
> and pass it from call sites directly, then people won't be confused
> with the passed zero `nr_requests`.

Yes, this sounds good.

Thanks for the review,
Kuai


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 00/16] blk-mq: introduce new queue attribute asyc_dpeth
  2025-08-14  8:27     ` Ming Lei
@ 2025-08-14  8:57       ` Yu Kuai
  0 siblings, 0 replies; 30+ messages in thread
From: Yu Kuai @ 2025-08-14  8:57 UTC (permalink / raw)
  To: Ming Lei, Yu Kuai
  Cc: axboe, bvanassche, nilay, hare, linux-block, linux-kernel,
	yi.zhang, yangerkun, johnny.chenyi, yukuai (C)

Hi,

在 2025/08/14 16:27, Ming Lei 写道:
> On Thu, Aug 14, 2025 at 04:22:27PM +0800, Yu Kuai wrote:
>> Hi,
>>
>> 在 2025/08/14 15:54, Ming Lei 写道:
>>> On Thu, Aug 14, 2025 at 11:35:06AM +0800, Yu Kuai wrote:
>>>> From: Yu Kuai <yukuai3@huawei.com>
>>>>
>>>> Backgroud and motivation:
>>>>
>>>> At first, we test a performance regression from 5.10 to 6.6 in
>>>> downstream kernel(described in patch 13), the regression is related to
>>>> async_depth in mq-dealine.
>>>>
>>>> While trying to fix this regression, Bart suggests add a new attribute
>>>> to request_queue, and I think this is a good idea because all elevators
>>>> have similar logical, however only mq-deadline allow user to configure
>>>> async_depth. And this is patch 9-16, where the performance problem is
>>>> fixed in patch 13;
>>>>
>>>> Because async_depth is related to nr_requests, while reviewing related
>>>> code, patch 2-7 are cleanups and fixes to nr_reqeusts.
>>>>
>>>> I was planning to send this set for the next merge window, however,
>>>> during test I found the last block pr(6.17-rc1) introduce a regression
>>>> if nr_reqeusts grows, exit elevator will panic, and I fix this by
>>>> patch 1,8.
>>>
>>> Please split the patchset into two:
>>>
>>> - one is for fixing recent regression on updating 'nr_requests', so this
>>>     can be merged to v6.17, and be backport easily for stable & downstream
>>
>> There are actually two regressions, as fixed by patch 5 and patch 8, how
>> about the first patchset for patch 1-8? Are you good with those minor
>> prep cleanup patches?
> 
> Then probably you need to make it into three by adding one extra bug fix for
> `fix elevator depth_updated method`, which follows the philosophy of
> "do one thing, do it better", also helps people to review.

Ok, I'll send patch 5 seperatly since it can solve problem
independently, and then a patchset for nr_requests regression.

Thanks,
Kuai

> 
> Thanks,
> Ming
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 08/16] blk-mq: fix blk_mq_tags double free while nr_requests grown
  2025-08-14  3:35 ` [PATCH 08/16] blk-mq: fix blk_mq_tags double free while nr_requests grown Yu Kuai
  2025-08-14  8:20   ` Ming Lei
@ 2025-08-14 12:15   ` Nilay Shroff
  2025-08-15  1:54     ` Yu Kuai
  1 sibling, 1 reply; 30+ messages in thread
From: Nilay Shroff @ 2025-08-14 12:15 UTC (permalink / raw)
  To: Yu Kuai, axboe, yukuai3, bvanassche, hare, ming.lei
  Cc: linux-block, linux-kernel, yi.zhang, yangerkun, johnny.chenyi



On 8/14/25 9:05 AM, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> In the case user trigger tags grow by queue sysfs attribute nr_requests,
> hctx->sched_tags will be freed directly and replaced with a new
> allocated tags, see blk_mq_tag_update_depth().
> 
> The problem is that hctx->sched_tags is from elevator->et->tags, while
> et->tags is still the freed tags, hence later elevator exist will try to
> free the tags again, causing kernel panic.
> 
> Fix this problem by using new halper blk_mq_alloc_sched_tags() to
> allocate a new sched_tags. Meanwhile, there is a longterm problem can be
> fixed as well:
> 
> If blk_mq_tag_update_depth() succeed for previous hctx, then bitmap depth
> is updated, however, if following hctx failed, q->nr_requests is not
> updated and the previous hctx->sched_tags endup bigger than q->nr_requests.
> 
> Fixes: f5a6604f7a44 ("block: fix lockdep warning caused by lock dependency in elv_iosched_store")
> Fixes: e3a2b3f931f5 ("blk-mq: allow changing of queue depth through sysfs")
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>  block/blk-mq.c | 31 ++++++++++++++++++++-----------
>  1 file changed, 20 insertions(+), 11 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index a7d6a20c1524..f1c11f591c27 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -4917,6 +4917,23 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
>  }
>  EXPORT_SYMBOL(blk_mq_free_tag_set);
>  
> +static int blk_mq_sched_grow_tags(struct request_queue *q, unsigned int nr)
> +{
> +	struct elevator_tags *et =
> +		blk_mq_alloc_sched_tags(q->tag_set, q->nr_hw_queues, nr);
> +	struct blk_mq_hw_ctx *hctx;
> +	unsigned long i;
> +
> +	if (!et)
> +		return -ENOMEM;
> +
> +	blk_mq_free_sched_tags(q->elevator->et, q->tag_set);
> +	queue_for_each_hw_ctx(q, hctx, i)
> +		hctx->sched_tags = et->tags[i];
> +	q->elevator->et = et;
> +	return 0;
> +}
> +
I see that we're allocating/freeing sched tags here after we freeze the 
queue and acquire ->elevator_lock. And so this shall cause the lockdep 
splat we saw earlier[1]. I'm not saying that your proposed change would
cause it, but I think this is one of the code path which we missed to
handle. So when you're at it, please ensure that we don't get into this
splat again. We've already fixed this splat from another code paths 
(elevator change and nr_hw_queue update) but it seems we also need to
handle that case here as well.

[1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/

Thanks,
--Nilay

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 03/16] blk-mq: remove useless checkings from blk_mq_update_nr_requests()
  2025-08-14  3:35 ` [PATCH 03/16] blk-mq: remove useless checkings from blk_mq_update_nr_requests() Yu Kuai
@ 2025-08-14 12:23   ` Nilay Shroff
  2025-08-15  1:32     ` Yu Kuai
  0 siblings, 1 reply; 30+ messages in thread
From: Nilay Shroff @ 2025-08-14 12:23 UTC (permalink / raw)
  To: Yu Kuai, axboe, yukuai3, bvanassche, hare, ming.lei
  Cc: linux-block, linux-kernel, yi.zhang, yangerkun, johnny.chenyi



On 8/14/25 9:05 AM, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> 1) queue_requests_store() is the only caller of
> blk_mq_update_nr_requests(), where queue is already freezed, no need to
> check mq_freeze_depth;
> 2) q->tag_set must be set for request_based device, and queue_is_mq() is
> already checked in blk_mq_queue_attr_visible(), no need to check
> q->tag_set.
> 3) During initialization, hctx->tags in initialized before queue
> kobject, and during del_gendisk, queue kobject is deleted before
> exiting hctx, hence checking hctx->tags is useless.
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>  block/blk-mq.c | 11 +----------
>  1 file changed, 1 insertion(+), 10 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index b67d6c02eceb..3a219b7b3688 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -4921,24 +4921,15 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
>  {
>  	struct blk_mq_tag_set *set = q->tag_set;
>  	struct blk_mq_hw_ctx *hctx;
> -	int ret;
> +	int ret = 0;
>  	unsigned long i;
>  
> -	if (WARN_ON_ONCE(!q->mq_freeze_depth))
> -		return -EINVAL;
> -
> -	if (!set)
> -		return -EINVAL;
> -
>  	if (q->nr_requests == nr)
>  		return 0;
>  
>  	blk_mq_quiesce_queue(q);
>  
> -	ret = 0;
>  	queue_for_each_hw_ctx(q, hctx, i) {
> -		if (!hctx->tags)
> -			continue;
It's possible that hctx->tags is set to NULL in case no software 
queues are mapped to the hardware queue. So it seems that this
check is valid. Please see blk_mq_map_swqueue().

Thanks,
--Nilay


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 03/16] blk-mq: remove useless checkings from blk_mq_update_nr_requests()
  2025-08-14 12:23   ` Nilay Shroff
@ 2025-08-15  1:32     ` Yu Kuai
  2025-08-15 11:59       ` Nilay Shroff
  0 siblings, 1 reply; 30+ messages in thread
From: Yu Kuai @ 2025-08-15  1:32 UTC (permalink / raw)
  To: Nilay Shroff, Yu Kuai, axboe, bvanassche, hare, ming.lei
  Cc: linux-block, linux-kernel, yi.zhang, yangerkun, johnny.chenyi,
	yukuai (C)

Hi,

在 2025/08/14 20:23, Nilay Shroff 写道:
> 
> 
> On 8/14/25 9:05 AM, Yu Kuai wrote:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> 1) queue_requests_store() is the only caller of
>> blk_mq_update_nr_requests(), where queue is already freezed, no need to
>> check mq_freeze_depth;
>> 2) q->tag_set must be set for request_based device, and queue_is_mq() is
>> already checked in blk_mq_queue_attr_visible(), no need to check
>> q->tag_set.
>> 3) During initialization, hctx->tags in initialized before queue
>> kobject, and during del_gendisk, queue kobject is deleted before
>> exiting hctx, hence checking hctx->tags is useless.
>>
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>> ---
>>   block/blk-mq.c | 11 +----------
>>   1 file changed, 1 insertion(+), 10 deletions(-)
>>
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index b67d6c02eceb..3a219b7b3688 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -4921,24 +4921,15 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
>>   {
>>   	struct blk_mq_tag_set *set = q->tag_set;
>>   	struct blk_mq_hw_ctx *hctx;
>> -	int ret;
>> +	int ret = 0;
>>   	unsigned long i;
>>   
>> -	if (WARN_ON_ONCE(!q->mq_freeze_depth))
>> -		return -EINVAL;
>> -
>> -	if (!set)
>> -		return -EINVAL;
>> -
>>   	if (q->nr_requests == nr)
>>   		return 0;
>>   
>>   	blk_mq_quiesce_queue(q);
>>   
>> -	ret = 0;
>>   	queue_for_each_hw_ctx(q, hctx, i) {
>> -		if (!hctx->tags)
>> -			continue;
> It's possible that hctx->tags is set to NULL in case no software
> queues are mapped to the hardware queue. So it seems that this
> check is valid. Please see blk_mq_map_swqueue().

Ok, thanks for the reviw.

I didn't notice this, just wonder how can this happen?
nr_hw_queues > NR_CPUS?

Thanks,
Kuai

> 
> Thanks,
> --Nilay
> 
> .
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 08/16] blk-mq: fix blk_mq_tags double free while nr_requests grown
  2025-08-14 12:15   ` Nilay Shroff
@ 2025-08-15  1:54     ` Yu Kuai
  0 siblings, 0 replies; 30+ messages in thread
From: Yu Kuai @ 2025-08-15  1:54 UTC (permalink / raw)
  To: Nilay Shroff, Yu Kuai, axboe, bvanassche, hare, ming.lei
  Cc: linux-block, linux-kernel, yi.zhang, yangerkun, johnny.chenyi,
	yukuai (C)

Hi,

在 2025/08/14 20:15, Nilay Shroff 写道:
> 
> 
> On 8/14/25 9:05 AM, Yu Kuai wrote:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> In the case user trigger tags grow by queue sysfs attribute nr_requests,
>> hctx->sched_tags will be freed directly and replaced with a new
>> allocated tags, see blk_mq_tag_update_depth().
>>
>> The problem is that hctx->sched_tags is from elevator->et->tags, while
>> et->tags is still the freed tags, hence later elevator exist will try to
>> free the tags again, causing kernel panic.
>>
>> Fix this problem by using new halper blk_mq_alloc_sched_tags() to
>> allocate a new sched_tags. Meanwhile, there is a longterm problem can be
>> fixed as well:
>>
>> If blk_mq_tag_update_depth() succeed for previous hctx, then bitmap depth
>> is updated, however, if following hctx failed, q->nr_requests is not
>> updated and the previous hctx->sched_tags endup bigger than q->nr_requests.
>>
>> Fixes: f5a6604f7a44 ("block: fix lockdep warning caused by lock dependency in elv_iosched_store")
>> Fixes: e3a2b3f931f5 ("blk-mq: allow changing of queue depth through sysfs")
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>> ---
>>   block/blk-mq.c | 31 ++++++++++++++++++++-----------
>>   1 file changed, 20 insertions(+), 11 deletions(-)
>>
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index a7d6a20c1524..f1c11f591c27 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -4917,6 +4917,23 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
>>   }
>>   EXPORT_SYMBOL(blk_mq_free_tag_set);
>>   
>> +static int blk_mq_sched_grow_tags(struct request_queue *q, unsigned int nr)
>> +{
>> +	struct elevator_tags *et =
>> +		blk_mq_alloc_sched_tags(q->tag_set, q->nr_hw_queues, nr);
>> +	struct blk_mq_hw_ctx *hctx;
>> +	unsigned long i;
>> +
>> +	if (!et)
>> +		return -ENOMEM;
>> +
>> +	blk_mq_free_sched_tags(q->elevator->et, q->tag_set);
>> +	queue_for_each_hw_ctx(q, hctx, i)
>> +		hctx->sched_tags = et->tags[i];
>> +	q->elevator->et = et;
>> +	return 0;
>> +}
>> +
> I see that we're allocating/freeing sched tags here after we freeze the
> queue and acquire ->elevator_lock. And so this shall cause the lockdep
> splat we saw earlier[1]. I'm not saying that your proposed change would
> cause it, but I think this is one of the code path which we missed to
> handle. So when you're at it, please ensure that we don't get into this
> splat again. We've already fixed this splat from another code paths
> (elevator change and nr_hw_queue update) but it seems we also need to
> handle that case here as well.

Thanks for pointing this, I was thinking the gfp flags inside
blk_mq_alloc_sched_tags will be enough to prevent deadlock related to 
fs_reclaim. I still need to review a lot to catch up.

Looks I need to allocate memory before freezing the queue from
queue_requests_store, I think this is fine with prevrious cleanups and
we'll know excatly if memory allocation is necessary.

BTW, just curious, do we plan or have documents anywhere to explain the
lock ordering? It will be much easier for people like me to add new code
around.

Thanks
Kuai

> 
> [1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
> 
> Thanks,
> --Nilay
> .
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 03/16] blk-mq: remove useless checkings from blk_mq_update_nr_requests()
  2025-08-15  1:32     ` Yu Kuai
@ 2025-08-15 11:59       ` Nilay Shroff
  2025-08-15 13:35         ` Ming Lei
  0 siblings, 1 reply; 30+ messages in thread
From: Nilay Shroff @ 2025-08-15 11:59 UTC (permalink / raw)
  To: Yu Kuai, axboe, bvanassche, hare, ming.lei
  Cc: linux-block, linux-kernel, yi.zhang, yangerkun, johnny.chenyi,
	yukuai (C)



On 8/15/25 7:02 AM, Yu Kuai wrote:
> Hi,
> 
> 在 2025/08/14 20:23, Nilay Shroff 写道:
>>
>>
>> On 8/14/25 9:05 AM, Yu Kuai wrote:
>>> From: Yu Kuai <yukuai3@huawei.com>
>>>
>>> 1) queue_requests_store() is the only caller of
>>> blk_mq_update_nr_requests(), where queue is already freezed, no need to
>>> check mq_freeze_depth;
>>> 2) q->tag_set must be set for request_based device, and queue_is_mq() is
>>> already checked in blk_mq_queue_attr_visible(), no need to check
>>> q->tag_set.
>>> 3) During initialization, hctx->tags in initialized before queue
>>> kobject, and during del_gendisk, queue kobject is deleted before
>>> exiting hctx, hence checking hctx->tags is useless.
>>>
>>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>>> ---
>>>   block/blk-mq.c | 11 +----------
>>>   1 file changed, 1 insertion(+), 10 deletions(-)
>>>
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index b67d6c02eceb..3a219b7b3688 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -4921,24 +4921,15 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
>>>   {
>>>       struct blk_mq_tag_set *set = q->tag_set;
>>>       struct blk_mq_hw_ctx *hctx;
>>> -    int ret;
>>> +    int ret = 0;
>>>       unsigned long i;
>>>   -    if (WARN_ON_ONCE(!q->mq_freeze_depth))
>>> -        return -EINVAL;
>>> -
>>> -    if (!set)
>>> -        return -EINVAL;
>>> -
>>>       if (q->nr_requests == nr)
>>>           return 0;
>>>         blk_mq_quiesce_queue(q);
>>>   -    ret = 0;
>>>       queue_for_each_hw_ctx(q, hctx, i) {
>>> -        if (!hctx->tags)
>>> -            continue;
>> It's possible that hctx->tags is set to NULL in case no software
>> queues are mapped to the hardware queue. So it seems that this
>> check is valid. Please see blk_mq_map_swqueue().
> 
> Ok, thanks for the reviw.
> 
> I didn't notice this, just wonder how can this happen?
> nr_hw_queues > NR_CPUS?
> 
I think typically having nr_hw_queues > NR_CPUS is not allowed. 
But it's possible to have no software queues are mapped to hctx.
Check this commit 4412efecf7fd ("Revert "blk-mq: remove code for
dealing with remapping queue")

Thanks,
--Nilay





^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 03/16] blk-mq: remove useless checkings from blk_mq_update_nr_requests()
  2025-08-15 11:59       ` Nilay Shroff
@ 2025-08-15 13:35         ` Ming Lei
  0 siblings, 0 replies; 30+ messages in thread
From: Ming Lei @ 2025-08-15 13:35 UTC (permalink / raw)
  To: Nilay Shroff
  Cc: Yu Kuai, axboe, bvanassche, hare, linux-block, linux-kernel,
	yi.zhang, yangerkun, johnny.chenyi, yukuai (C)

On Fri, Aug 15, 2025 at 05:29:01PM +0530, Nilay Shroff wrote:
> 
> 
> On 8/15/25 7:02 AM, Yu Kuai wrote:
> > Hi,
> > 
> > 在 2025/08/14 20:23, Nilay Shroff 写道:
> >>
> >>
> >> On 8/14/25 9:05 AM, Yu Kuai wrote:
> >>> From: Yu Kuai <yukuai3@huawei.com>
> >>>
> >>> 1) queue_requests_store() is the only caller of
> >>> blk_mq_update_nr_requests(), where queue is already freezed, no need to
> >>> check mq_freeze_depth;
> >>> 2) q->tag_set must be set for request_based device, and queue_is_mq() is
> >>> already checked in blk_mq_queue_attr_visible(), no need to check
> >>> q->tag_set.
> >>> 3) During initialization, hctx->tags in initialized before queue
> >>> kobject, and during del_gendisk, queue kobject is deleted before
> >>> exiting hctx, hence checking hctx->tags is useless.
> >>>
> >>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> >>> ---
> >>>   block/blk-mq.c | 11 +----------
> >>>   1 file changed, 1 insertion(+), 10 deletions(-)
> >>>
> >>> diff --git a/block/blk-mq.c b/block/blk-mq.c
> >>> index b67d6c02eceb..3a219b7b3688 100644
> >>> --- a/block/blk-mq.c
> >>> +++ b/block/blk-mq.c
> >>> @@ -4921,24 +4921,15 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
> >>>   {
> >>>       struct blk_mq_tag_set *set = q->tag_set;
> >>>       struct blk_mq_hw_ctx *hctx;
> >>> -    int ret;
> >>> +    int ret = 0;
> >>>       unsigned long i;
> >>>   -    if (WARN_ON_ONCE(!q->mq_freeze_depth))
> >>> -        return -EINVAL;
> >>> -
> >>> -    if (!set)
> >>> -        return -EINVAL;
> >>> -
> >>>       if (q->nr_requests == nr)
> >>>           return 0;
> >>>         blk_mq_quiesce_queue(q);
> >>>   -    ret = 0;
> >>>       queue_for_each_hw_ctx(q, hctx, i) {
> >>> -        if (!hctx->tags)
> >>> -            continue;
> >> It's possible that hctx->tags is set to NULL in case no software
> >> queues are mapped to the hardware queue. So it seems that this
> >> check is valid. Please see blk_mq_map_swqueue().
> > 
> > Ok, thanks for the reviw.
> > 
> > I didn't notice this, just wonder how can this happen?
> > nr_hw_queues > NR_CPUS?
> > 
> I think typically having nr_hw_queues > NR_CPUS is not allowed. 
> But it's possible to have no software queues are mapped to hctx.
> Check this commit 4412efecf7fd ("Revert "blk-mq: remove code for
> dealing with remapping queue")

It is possible in case of poll_queues, for example:

modprobe null_blk submit_queues=$(getconf _NPROCESSORS_ONLN) poll_queues=2
 
 

Thanks, 
Ming


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2025-08-15 13:36 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-14  3:35 [PATCH 00/16] blk-mq: introduce new queue attribute asyc_dpeth Yu Kuai
2025-08-14  3:35 ` [PATCH 01/16] blk-mq-sched: add new parameter nr_requests in blk_mq_alloc_sched_tags() Yu Kuai
2025-08-14  8:16   ` Ming Lei
2025-08-14  8:55     ` Yu Kuai
2025-08-14  3:35 ` [PATCH 02/16] blk-mq: remove useless checking from queue_requests_store() Yu Kuai
2025-08-14  3:35 ` [PATCH 03/16] blk-mq: remove useless checkings from blk_mq_update_nr_requests() Yu Kuai
2025-08-14 12:23   ` Nilay Shroff
2025-08-15  1:32     ` Yu Kuai
2025-08-15 11:59       ` Nilay Shroff
2025-08-15 13:35         ` Ming Lei
2025-08-14  3:35 ` [PATCH 04/16] blk-mq: check invalid nr_requests in queue_requests_store() Yu Kuai
2025-08-14  3:35 ` [PATCH 05/16] blk-mq: fix elevator depth_updated method Yu Kuai
2025-08-14  3:35 ` [PATCH 06/16] blk-mq: cleanup shared tags case in blk_mq_update_nr_requests() Yu Kuai
2025-08-14  3:35 ` [PATCH 07/16] blk-mq: split bitmap grow and resize " Yu Kuai
2025-08-14  3:35 ` [PATCH 08/16] blk-mq: fix blk_mq_tags double free while nr_requests grown Yu Kuai
2025-08-14  8:20   ` Ming Lei
2025-08-14 12:15   ` Nilay Shroff
2025-08-15  1:54     ` Yu Kuai
2025-08-14  3:35 ` [PATCH 09/16] block: convert nr_requests to unsigned int Yu Kuai
2025-08-14  3:35 ` [PATCH 10/16] blk-mq-sched: unify elevators checking for async requests Yu Kuai
2025-08-14  3:35 ` [PATCH 11/16] blk-mq: add a new queue sysfs attribute async_depth Yu Kuai
2025-08-14  3:35 ` [PATCH 12/16] kyber: covert to use request_queue->async_depth Yu Kuai
2025-08-14  3:35 ` [PATCH 13/16] mq-deadline: " Yu Kuai
2025-08-14  3:35 ` [PATCH 14/16] block, bfq: convert " Yu Kuai
2025-08-14  3:35 ` [PATCH 15/16] blk-mq: fix stale nr_requests documentation Yu Kuai
2025-08-14  3:35 ` [PATCH 16/16] blk-mq: add documentation for new queue attribute async_dpeth Yu Kuai
2025-08-14  7:54 ` [PATCH 00/16] blk-mq: introduce new queue attribute asyc_dpeth Ming Lei
2025-08-14  8:22   ` Yu Kuai
2025-08-14  8:27     ` Ming Lei
2025-08-14  8:57       ` Yu Kuai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).