linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/2] blk-mq: fix update nr_requests regressions
@ 2025-08-19  1:29 Yu Kuai
  2025-08-19  1:29 ` [PATCH v2 1/2] blk-mq: fix elevator depth_updated method Yu Kuai
  2025-08-19  1:29 ` [PATCH v2 2/2] blk-mq: fix blk_mq_tags double free while nr_requests grown Yu Kuai
  0 siblings, 2 replies; 8+ messages in thread
From: Yu Kuai @ 2025-08-19  1:29 UTC (permalink / raw)
  To: yukuai3, axboe, bvanassche, ming.lei, hare, nilay
  Cc: linux-block, linux-kernel, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

Changes in v2:
 - instead of refactor and cleanups and fix updating nr_requests
 thoroughly, fix the regression in patch 2 the easy way, and dealy
 refactor and cleanups to next merge window.

patch 1 fix regression that elevator async_depth is not updated correctly
if nr_requests changes, first from error path and then for mq-deadline,
and recently for bfq and kyber.

patch 2 fix regression that if nr_requests grow, kernel will panic due
to tags double free.

Yu Kuai (2):
  blk-mq: fix elevator depth_updated method
  blk-mq: fix blk_mq_tags double free while nr_requests grown

 block/bfq-iosched.c   | 21 ++++-----------------
 block/blk-mq-sched.c  |  3 +++
 block/blk-mq-sched.h  | 11 +++++++++++
 block/blk-mq-tag.c    |  1 +
 block/blk-mq.c        | 23 ++++++++++++-----------
 block/elevator.h      |  2 +-
 block/kyber-iosched.c | 10 ++++------
 block/mq-deadline.c   | 15 ++-------------
 8 files changed, 38 insertions(+), 48 deletions(-)

-- 
2.39.2


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2 1/2] blk-mq: fix elevator depth_updated method
  2025-08-19  1:29 [PATCH v2 0/2] blk-mq: fix update nr_requests regressions Yu Kuai
@ 2025-08-19  1:29 ` Yu Kuai
  2025-08-19 12:20   ` Nilay Shroff
  2025-08-19  1:29 ` [PATCH v2 2/2] blk-mq: fix blk_mq_tags double free while nr_requests grown Yu Kuai
  1 sibling, 1 reply; 8+ messages in thread
From: Yu Kuai @ 2025-08-19  1:29 UTC (permalink / raw)
  To: yukuai3, axboe, bvanassche, ming.lei, hare, nilay
  Cc: linux-block, linux-kernel, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

Current depth_updated has some problems:

1) depth_updated() will be called for each hctx, while all elevators
will update async_depth for the disk level, this is not related to hctx;
2) In blk_mq_update_nr_requests(), if previous hctx update succeed and
this hctx update failed, q->nr_requests will not be updated, while
async_depth is already updated with new nr_reqeuests in previous
depth_updated();
3) All elevators are using q->nr_requests to calculate async_depth now,
however, q->nr_requests is still the old value when depth_updated() is
called from blk_mq_update_nr_requests();

Fix those problems by:

- pass in request_queue instead of hctx;
- move depth_updated() after q->nr_requests is updated in
  blk_mq_update_nr_requests();
- add depth_updated() call in blk_mq_init_sched();
- remove init_hctx() method for mq-deadline and bfq that is useless now;

Fixes: 77f1e0a52d26 ("bfq: update internal depth state when queue depth changes")
Fixes: 39823b47bbd4 ("block/mq-deadline: Fix the tag reservation code")
Fixes: 42e6c6ce03fd ("lib/sbitmap: convert shallow_depth from one word to the whole sbitmap")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/bfq-iosched.c   | 21 ++++-----------------
 block/blk-mq-sched.c  |  3 +++
 block/blk-mq-sched.h  | 11 +++++++++++
 block/blk-mq.c        | 23 ++++++++++++-----------
 block/elevator.h      |  2 +-
 block/kyber-iosched.c | 10 ++++------
 block/mq-deadline.c   | 15 ++-------------
 7 files changed, 37 insertions(+), 48 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 50e51047e1fe..c0c398998aa1 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -7109,9 +7109,10 @@ void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg)
  * See the comments on bfq_limit_depth for the purpose of
  * the depths set in the function. Return minimum shallow depth we'll use.
  */
-static void bfq_update_depths(struct bfq_data *bfqd, struct sbitmap_queue *bt)
+static void bfq_depth_updated(struct request_queue *q)
 {
-	unsigned int nr_requests = bfqd->queue->nr_requests;
+	struct bfq_data *bfqd = q->elevator->elevator_data;
+	unsigned int nr_requests = q->nr_requests;
 
 	/*
 	 * In-word depths if no bfq_queue is being weight-raised:
@@ -7143,21 +7144,8 @@ static void bfq_update_depths(struct bfq_data *bfqd, struct sbitmap_queue *bt)
 	bfqd->async_depths[1][0] = max((nr_requests * 3) >> 4, 1U);
 	/* no more than ~37% of tags for sync writes (~20% extra tags) */
 	bfqd->async_depths[1][1] = max((nr_requests * 6) >> 4, 1U);
-}
-
-static void bfq_depth_updated(struct blk_mq_hw_ctx *hctx)
-{
-	struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
-	struct blk_mq_tags *tags = hctx->sched_tags;
 
-	bfq_update_depths(bfqd, &tags->bitmap_tags);
-	sbitmap_queue_min_shallow_depth(&tags->bitmap_tags, 1);
-}
-
-static int bfq_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int index)
-{
-	bfq_depth_updated(hctx);
-	return 0;
+	blk_mq_set_min_shallow_depth(q, 1);
 }
 
 static void bfq_exit_queue(struct elevator_queue *e)
@@ -7628,7 +7616,6 @@ static struct elevator_type iosched_bfq_mq = {
 		.request_merged		= bfq_request_merged,
 		.has_work		= bfq_has_work,
 		.depth_updated		= bfq_depth_updated,
-		.init_hctx		= bfq_init_hctx,
 		.init_sched		= bfq_init_queue,
 		.exit_sched		= bfq_exit_queue,
 	},
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index e2ce4a28e6c9..bf7dd97422ec 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -585,6 +585,9 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e,
 			}
 		}
 	}
+
+	if (e->ops.depth_updated)
+		e->ops.depth_updated(q);
 	return 0;
 
 out:
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index b554e1d55950..fe83187f41db 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -92,4 +92,15 @@ static inline bool blk_mq_sched_needs_restart(struct blk_mq_hw_ctx *hctx)
 	return test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
 }
 
+static inline void blk_mq_set_min_shallow_depth(struct request_queue *q,
+						unsigned int depth)
+{
+	struct blk_mq_hw_ctx *hctx;
+	unsigned long i;
+
+	queue_for_each_hw_ctx(q, hctx, i)
+		sbitmap_queue_min_shallow_depth(&hctx->sched_tags->bitmap_tags,
+						depth);
+}
+
 #endif
diff --git a/block/blk-mq.c b/block/blk-mq.c
index b67d6c02eceb..9c68749124c6 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4951,20 +4951,21 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 						      false);
 		}
 		if (ret)
-			break;
-		if (q->elevator && q->elevator->type->ops.depth_updated)
-			q->elevator->type->ops.depth_updated(hctx);
+			goto out;
 	}
-	if (!ret) {
-		q->nr_requests = nr;
-		if (blk_mq_is_shared_tags(set->flags)) {
-			if (q->elevator)
-				blk_mq_tag_update_sched_shared_tags(q);
-			else
-				blk_mq_tag_resize_shared_tags(set, nr);
-		}
+
+	q->nr_requests = nr;
+	if (q->elevator && q->elevator->type->ops.depth_updated)
+		q->elevator->type->ops.depth_updated(q);
+
+	if (blk_mq_is_shared_tags(set->flags)) {
+		if (q->elevator)
+			blk_mq_tag_update_sched_shared_tags(q);
+		else
+			blk_mq_tag_resize_shared_tags(set, nr);
 	}
 
+out:
 	blk_mq_unquiesce_queue(q);
 
 	return ret;
diff --git a/block/elevator.h b/block/elevator.h
index adc5c157e17e..c4d20155065e 100644
--- a/block/elevator.h
+++ b/block/elevator.h
@@ -37,7 +37,7 @@ struct elevator_mq_ops {
 	void (*exit_sched)(struct elevator_queue *);
 	int (*init_hctx)(struct blk_mq_hw_ctx *, unsigned int);
 	void (*exit_hctx)(struct blk_mq_hw_ctx *, unsigned int);
-	void (*depth_updated)(struct blk_mq_hw_ctx *);
+	void (*depth_updated)(struct request_queue *);
 
 	bool (*allow_merge)(struct request_queue *, struct request *, struct bio *);
 	bool (*bio_merge)(struct request_queue *, struct bio *, unsigned int);
diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index 70cbc7b2deb4..49ae52aa20d9 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -440,13 +440,12 @@ static void kyber_ctx_queue_init(struct kyber_ctx_queue *kcq)
 		INIT_LIST_HEAD(&kcq->rq_list[i]);
 }
 
-static void kyber_depth_updated(struct blk_mq_hw_ctx *hctx)
+static void kyber_depth_updated(struct request_queue *q)
 {
-	struct kyber_queue_data *kqd = hctx->queue->elevator->elevator_data;
-	struct blk_mq_tags *tags = hctx->sched_tags;
+	struct kyber_queue_data *kqd = q->elevator->elevator_data;
 
-	kqd->async_depth = hctx->queue->nr_requests * KYBER_ASYNC_PERCENT / 100U;
-	sbitmap_queue_min_shallow_depth(&tags->bitmap_tags, kqd->async_depth);
+	kqd->async_depth = q->nr_requests * KYBER_ASYNC_PERCENT / 100U;
+	blk_mq_set_min_shallow_depth(q, kqd->async_depth);
 }
 
 static int kyber_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
@@ -493,7 +492,6 @@ static int kyber_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
 	khd->batching = 0;
 
 	hctx->sched_data = khd;
-	kyber_depth_updated(hctx);
 
 	return 0;
 
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index b9b7cdf1d3c9..578bc79c5654 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -507,22 +507,12 @@ static void dd_limit_depth(blk_opf_t opf, struct blk_mq_alloc_data *data)
 }
 
 /* Called by blk_mq_update_nr_requests(). */
-static void dd_depth_updated(struct blk_mq_hw_ctx *hctx)
+static void dd_depth_updated(struct request_queue *q)
 {
-	struct request_queue *q = hctx->queue;
 	struct deadline_data *dd = q->elevator->elevator_data;
-	struct blk_mq_tags *tags = hctx->sched_tags;
 
 	dd->async_depth = q->nr_requests;
-
-	sbitmap_queue_min_shallow_depth(&tags->bitmap_tags, 1);
-}
-
-/* Called by blk_mq_init_hctx() and blk_mq_init_sched(). */
-static int dd_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
-{
-	dd_depth_updated(hctx);
-	return 0;
+	blk_mq_set_min_shallow_depth(q, 1);
 }
 
 static void dd_exit_sched(struct elevator_queue *e)
@@ -1048,7 +1038,6 @@ static struct elevator_type mq_deadline = {
 		.has_work		= dd_has_work,
 		.init_sched		= dd_init_sched,
 		.exit_sched		= dd_exit_sched,
-		.init_hctx		= dd_init_hctx,
 	},
 
 #ifdef CONFIG_BLK_DEBUG_FS
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 2/2] blk-mq: fix blk_mq_tags double free while nr_requests grown
  2025-08-19  1:29 [PATCH v2 0/2] blk-mq: fix update nr_requests regressions Yu Kuai
  2025-08-19  1:29 ` [PATCH v2 1/2] blk-mq: fix elevator depth_updated method Yu Kuai
@ 2025-08-19  1:29 ` Yu Kuai
  2025-08-19 10:47   ` Ming Lei
  2025-08-19 12:45   ` Nilay Shroff
  1 sibling, 2 replies; 8+ messages in thread
From: Yu Kuai @ 2025-08-19  1:29 UTC (permalink / raw)
  To: yukuai3, axboe, bvanassche, ming.lei, hare, nilay
  Cc: linux-block, linux-kernel, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

In the case user trigger tags grow by queue sysfs attribute nr_requests,
hctx->sched_tags will be freed directly and replaced with a new
allocated tags, see blk_mq_tag_update_depth().

The problem is that hctx->sched_tags is from elevator->et->tags, while
et->tags is still the freed tags, hence later elevator exist will try to
free the tags again, causing kernel panic.

Fix this problem by replacing et->tags will new allocated tags as well.

Noted there are still some long term problems that will require some
refactor to be fixed thoroughly[1].

[1] https://lore.kernel.org/all/20250815080216.410665-1-yukuai1@huaweicloud.com/
Fixes: f5a6604f7a44 ("block: fix lockdep warning caused by lock dependency in elv_iosched_store")

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq-tag.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index d880c50629d6..5cffa5668d0c 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -622,6 +622,7 @@ int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
 			return -ENOMEM;
 
 		blk_mq_free_map_and_rqs(set, *tagsptr, hctx->queue_num);
+		hctx->queue->elevator->et->tags[hctx->queue_num] = new;
 		*tagsptr = new;
 	} else {
 		/*
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 2/2] blk-mq: fix blk_mq_tags double free while nr_requests grown
  2025-08-19  1:29 ` [PATCH v2 2/2] blk-mq: fix blk_mq_tags double free while nr_requests grown Yu Kuai
@ 2025-08-19 10:47   ` Ming Lei
  2025-08-19 12:45   ` Nilay Shroff
  1 sibling, 0 replies; 8+ messages in thread
From: Ming Lei @ 2025-08-19 10:47 UTC (permalink / raw)
  To: Yu Kuai
  Cc: yukuai3, axboe, bvanassche, hare, nilay, linux-block,
	linux-kernel, yi.zhang, yangerkun, johnny.chenyi

On Tue, Aug 19, 2025 at 09:29:17AM +0800, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> In the case user trigger tags grow by queue sysfs attribute nr_requests,
> hctx->sched_tags will be freed directly and replaced with a new
> allocated tags, see blk_mq_tag_update_depth().
> 
> The problem is that hctx->sched_tags is from elevator->et->tags, while
> et->tags is still the freed tags, hence later elevator exist will try to
> free the tags again, causing kernel panic.
> 
> Fix this problem by replacing et->tags will new allocated tags as well.
> 
> Noted there are still some long term problems that will require some
> refactor to be fixed thoroughly[1].
> 
> [1] https://lore.kernel.org/all/20250815080216.410665-1-yukuai1@huaweicloud.com/
> Fixes: f5a6604f7a44 ("block: fix lockdep warning caused by lock dependency in elv_iosched_store")
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>  block/blk-mq-tag.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index d880c50629d6..5cffa5668d0c 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -622,6 +622,7 @@ int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
>  			return -ENOMEM;
>  
>  		blk_mq_free_map_and_rqs(set, *tagsptr, hctx->queue_num);
> +		hctx->queue->elevator->et->tags[hctx->queue_num] = new;
>  		*tagsptr = new;

Reviewed-by: Ming Lei <ming.lei@redhat.com>


Thanks,
Ming


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 1/2] blk-mq: fix elevator depth_updated method
  2025-08-19  1:29 ` [PATCH v2 1/2] blk-mq: fix elevator depth_updated method Yu Kuai
@ 2025-08-19 12:20   ` Nilay Shroff
  2025-08-20  0:56     ` Yu Kuai
  0 siblings, 1 reply; 8+ messages in thread
From: Nilay Shroff @ 2025-08-19 12:20 UTC (permalink / raw)
  To: Yu Kuai, yukuai3, axboe, bvanassche, ming.lei, hare
  Cc: linux-block, linux-kernel, yi.zhang, yangerkun, johnny.chenyi



On 8/19/25 6:59 AM, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> Current depth_updated has some problems:
> 
> 1) depth_updated() will be called for each hctx, while all elevators
> will update async_depth for the disk level, this is not related to hctx;
> 2) In blk_mq_update_nr_requests(), if previous hctx update succeed and
> this hctx update failed, q->nr_requests will not be updated, while
> async_depth is already updated with new nr_reqeuests in previous
> depth_updated();
> 3) All elevators are using q->nr_requests to calculate async_depth now,
> however, q->nr_requests is still the old value when depth_updated() is
> called from blk_mq_update_nr_requests();
> 
> Fix those problems by:
> 
> - pass in request_queue instead of hctx;
> - move depth_updated() after q->nr_requests is updated in
>   blk_mq_update_nr_requests();
> - add depth_updated() call in blk_mq_init_sched();
> - remove init_hctx() method for mq-deadline and bfq that is useless now;
> 
> Fixes: 77f1e0a52d26 ("bfq: update internal depth state when queue depth changes")
> Fixes: 39823b47bbd4 ("block/mq-deadline: Fix the tag reservation code")
> Fixes: 42e6c6ce03fd ("lib/sbitmap: convert shallow_depth from one word to the whole sbitmap")
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>  block/bfq-iosched.c   | 21 ++++-----------------
>  block/blk-mq-sched.c  |  3 +++
>  block/blk-mq-sched.h  | 11 +++++++++++
>  block/blk-mq.c        | 23 ++++++++++++-----------
>  block/elevator.h      |  2 +-
>  block/kyber-iosched.c | 10 ++++------
>  block/mq-deadline.c   | 15 ++-------------
>  7 files changed, 37 insertions(+), 48 deletions(-)
> 
> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
> index 50e51047e1fe..c0c398998aa1 100644
> --- a/block/bfq-iosched.c
> +++ b/block/bfq-iosched.c
> @@ -7109,9 +7109,10 @@ void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg)
>   * See the comments on bfq_limit_depth for the purpose of
>   * the depths set in the function. Return minimum shallow depth we'll use.
>   */
> -static void bfq_update_depths(struct bfq_data *bfqd, struct sbitmap_queue *bt)
> +static void bfq_depth_updated(struct request_queue *q)
>  {
> -	unsigned int nr_requests = bfqd->queue->nr_requests;
> +	struct bfq_data *bfqd = q->elevator->elevator_data;
> +	unsigned int nr_requests = q->nr_requests;
>  
>  	/*
>  	 * In-word depths if no bfq_queue is being weight-raised:
> @@ -7143,21 +7144,8 @@ static void bfq_update_depths(struct bfq_data *bfqd, struct sbitmap_queue *bt)
>  	bfqd->async_depths[1][0] = max((nr_requests * 3) >> 4, 1U);
>  	/* no more than ~37% of tags for sync writes (~20% extra tags) */
>  	bfqd->async_depths[1][1] = max((nr_requests * 6) >> 4, 1U);
> -}
> -
> -static void bfq_depth_updated(struct blk_mq_hw_ctx *hctx)
> -{
> -	struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
> -	struct blk_mq_tags *tags = hctx->sched_tags;
>  
> -	bfq_update_depths(bfqd, &tags->bitmap_tags);
> -	sbitmap_queue_min_shallow_depth(&tags->bitmap_tags, 1);
> -}
> -
> -static int bfq_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int index)
> -{
> -	bfq_depth_updated(hctx);
> -	return 0;
> +	blk_mq_set_min_shallow_depth(q, 1);
>  }
>  
>  static void bfq_exit_queue(struct elevator_queue *e)
> @@ -7628,7 +7616,6 @@ static struct elevator_type iosched_bfq_mq = {
>  		.request_merged		= bfq_request_merged,
>  		.has_work		= bfq_has_work,
>  		.depth_updated		= bfq_depth_updated,
> -		.init_hctx		= bfq_init_hctx,
>  		.init_sched		= bfq_init_queue,
>  		.exit_sched		= bfq_exit_queue,
>  	},
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index e2ce4a28e6c9..bf7dd97422ec 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -585,6 +585,9 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e,
>  			}
>  		}
>  	}
> +
> +	if (e->ops.depth_updated)
> +		e->ops.depth_updated(q);
>  	return 0;
>  

Overall changes look good. That said, I think it might be cleaner to structure
it this way:

elevator_switch -> blk_mq_init_sched ->init_sched ==> sets async_depth
blk_mq_update_nr_requests ->depth_updated ==> updates async_depth

This way, we don’t need to call ->depth_updated from blk_mq_init_sched.

In summary:
- Avoid calling ->depth_updated during blk_mq_init_sched
- Set async_depth when the elevator is initialized (via ->init_sched)
- Update async_depth when nr_requests is modified through sysfs (via ->depth_updated)

Thanks,
--Nilay

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 2/2] blk-mq: fix blk_mq_tags double free while nr_requests grown
  2025-08-19  1:29 ` [PATCH v2 2/2] blk-mq: fix blk_mq_tags double free while nr_requests grown Yu Kuai
  2025-08-19 10:47   ` Ming Lei
@ 2025-08-19 12:45   ` Nilay Shroff
  1 sibling, 0 replies; 8+ messages in thread
From: Nilay Shroff @ 2025-08-19 12:45 UTC (permalink / raw)
  To: Yu Kuai, yukuai3, axboe, bvanassche, ming.lei, hare
  Cc: linux-block, linux-kernel, yi.zhang, yangerkun, johnny.chenyi



On 8/19/25 6:59 AM, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> In the case user trigger tags grow by queue sysfs attribute nr_requests,
> hctx->sched_tags will be freed directly and replaced with a new
> allocated tags, see blk_mq_tag_update_depth().
> 
> The problem is that hctx->sched_tags is from elevator->et->tags, while
> et->tags is still the freed tags, hence later elevator exist will try to
nit: 's/exist/exit'

> free the tags again, causing kernel panic.
> 
> Fix this problem by replacing et->tags will new allocated tags as well.
nit: 's/will/with'

> 
> Noted there are still some long term problems that will require some
> refactor to be fixed thoroughly[1].
> 
> [1] https://lore.kernel.org/all/20250815080216.410665-1-yukuai1@huaweicloud.com/
> Fixes: f5a6604f7a44 ("block: fix lockdep warning caused by lock dependency in elv_iosched_store")
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>  block/blk-mq-tag.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index d880c50629d6..5cffa5668d0c 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -622,6 +622,7 @@ int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
>  			return -ENOMEM;
>  
>  		blk_mq_free_map_and_rqs(set, *tagsptr, hctx->queue_num);
> +		hctx->queue->elevator->et->tags[hctx->queue_num] = new;
>  		*tagsptr = new;
>  	} else {
>  		/*
Except the above minor nitpicking this change looks good to me:

Reviewed-by: Nilay Shroff<nilay@linux.ibm.com>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 1/2] blk-mq: fix elevator depth_updated method
  2025-08-19 12:20   ` Nilay Shroff
@ 2025-08-20  0:56     ` Yu Kuai
  2025-08-20  5:34       ` Nilay Shroff
  0 siblings, 1 reply; 8+ messages in thread
From: Yu Kuai @ 2025-08-20  0:56 UTC (permalink / raw)
  To: Nilay Shroff, Yu Kuai, axboe, bvanassche, ming.lei, hare
  Cc: linux-block, linux-kernel, yi.zhang, yangerkun, johnny.chenyi,
	yukuai (C)

Hi,

在 2025/08/19 20:20, Nilay Shroff 写道:
> 
> 
> On 8/19/25 6:59 AM, Yu Kuai wrote:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> Current depth_updated has some problems:
>>
>> 1) depth_updated() will be called for each hctx, while all elevators
>> will update async_depth for the disk level, this is not related to hctx;
>> 2) In blk_mq_update_nr_requests(), if previous hctx update succeed and
>> this hctx update failed, q->nr_requests will not be updated, while
>> async_depth is already updated with new nr_reqeuests in previous
>> depth_updated();
>> 3) All elevators are using q->nr_requests to calculate async_depth now,
>> however, q->nr_requests is still the old value when depth_updated() is
>> called from blk_mq_update_nr_requests();
>>
>> Fix those problems by:
>>
>> - pass in request_queue instead of hctx;
>> - move depth_updated() after q->nr_requests is updated in
>>    blk_mq_update_nr_requests();
>> - add depth_updated() call in blk_mq_init_sched();
>> - remove init_hctx() method for mq-deadline and bfq that is useless now;
>>
>> Fixes: 77f1e0a52d26 ("bfq: update internal depth state when queue depth changes")
>> Fixes: 39823b47bbd4 ("block/mq-deadline: Fix the tag reservation code")
>> Fixes: 42e6c6ce03fd ("lib/sbitmap: convert shallow_depth from one word to the whole sbitmap")
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>> ---
>>   block/bfq-iosched.c   | 21 ++++-----------------
>>   block/blk-mq-sched.c  |  3 +++
>>   block/blk-mq-sched.h  | 11 +++++++++++
>>   block/blk-mq.c        | 23 ++++++++++++-----------
>>   block/elevator.h      |  2 +-
>>   block/kyber-iosched.c | 10 ++++------
>>   block/mq-deadline.c   | 15 ++-------------
>>   7 files changed, 37 insertions(+), 48 deletions(-)
>>
>> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
>> index 50e51047e1fe..c0c398998aa1 100644
>> --- a/block/bfq-iosched.c
>> +++ b/block/bfq-iosched.c
>> @@ -7109,9 +7109,10 @@ void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg)
>>    * See the comments on bfq_limit_depth for the purpose of
>>    * the depths set in the function. Return minimum shallow depth we'll use.
>>    */
>> -static void bfq_update_depths(struct bfq_data *bfqd, struct sbitmap_queue *bt)
>> +static void bfq_depth_updated(struct request_queue *q)
>>   {
>> -	unsigned int nr_requests = bfqd->queue->nr_requests;
>> +	struct bfq_data *bfqd = q->elevator->elevator_data;
>> +	unsigned int nr_requests = q->nr_requests;
>>   
>>   	/*
>>   	 * In-word depths if no bfq_queue is being weight-raised:
>> @@ -7143,21 +7144,8 @@ static void bfq_update_depths(struct bfq_data *bfqd, struct sbitmap_queue *bt)
>>   	bfqd->async_depths[1][0] = max((nr_requests * 3) >> 4, 1U);
>>   	/* no more than ~37% of tags for sync writes (~20% extra tags) */
>>   	bfqd->async_depths[1][1] = max((nr_requests * 6) >> 4, 1U);
>> -}
>> -
>> -static void bfq_depth_updated(struct blk_mq_hw_ctx *hctx)
>> -{
>> -	struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
>> -	struct blk_mq_tags *tags = hctx->sched_tags;
>>   
>> -	bfq_update_depths(bfqd, &tags->bitmap_tags);
>> -	sbitmap_queue_min_shallow_depth(&tags->bitmap_tags, 1);
>> -}
>> -
>> -static int bfq_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int index)
>> -{
>> -	bfq_depth_updated(hctx);
>> -	return 0;
>> +	blk_mq_set_min_shallow_depth(q, 1);
>>   }
>>   
>>   static void bfq_exit_queue(struct elevator_queue *e)
>> @@ -7628,7 +7616,6 @@ static struct elevator_type iosched_bfq_mq = {
>>   		.request_merged		= bfq_request_merged,
>>   		.has_work		= bfq_has_work,
>>   		.depth_updated		= bfq_depth_updated,
>> -		.init_hctx		= bfq_init_hctx,
>>   		.init_sched		= bfq_init_queue,
>>   		.exit_sched		= bfq_exit_queue,
>>   	},
>> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
>> index e2ce4a28e6c9..bf7dd97422ec 100644
>> --- a/block/blk-mq-sched.c
>> +++ b/block/blk-mq-sched.c
>> @@ -585,6 +585,9 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e,
>>   			}
>>   		}
>>   	}
>> +
>> +	if (e->ops.depth_updated)
>> +		e->ops.depth_updated(q);
>>   	return 0;
>>   
> 
> Overall changes look good. That said, I think it might be cleaner to structure
> it this way:
> 
> elevator_switch -> blk_mq_init_sched ->init_sched ==> sets async_depth
> blk_mq_update_nr_requests ->depth_updated ==> updates async_depth
> 
> This way, we don’t need to call ->depth_updated from blk_mq_init_sched.

Just to be sure, you mean calling the depth_updated method directly
inside the init_sched() method? This is indeed cleaner, each elevator
has to use this method to initialize async_dpeth.
> 
> In summary:
> - Avoid calling ->depth_updated during blk_mq_init_sched
> - Set async_depth when the elevator is initialized (via ->init_sched)
> - Update async_depth when nr_requests is modified through sysfs (via ->depth_updated)
> 
> Thanks,
> --Nilay
> .
> 

Thanks,
Kuai


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 1/2] blk-mq: fix elevator depth_updated method
  2025-08-20  0:56     ` Yu Kuai
@ 2025-08-20  5:34       ` Nilay Shroff
  0 siblings, 0 replies; 8+ messages in thread
From: Nilay Shroff @ 2025-08-20  5:34 UTC (permalink / raw)
  To: Yu Kuai, axboe, bvanassche, ming.lei, hare
  Cc: linux-block, linux-kernel, yi.zhang, yangerkun, johnny.chenyi,
	yukuai (C)



On 8/20/25 6:26 AM, Yu Kuai wrote:
> Hi,
> 
> 在 2025/08/19 20:20, Nilay Shroff 写道:
>>
>>
>> On 8/19/25 6:59 AM, Yu Kuai wrote:
>>> From: Yu Kuai <yukuai3@huawei.com>
>>>
>>> Current depth_updated has some problems:
>>>
>>> 1) depth_updated() will be called for each hctx, while all elevators
>>> will update async_depth for the disk level, this is not related to hctx;
>>> 2) In blk_mq_update_nr_requests(), if previous hctx update succeed and
>>> this hctx update failed, q->nr_requests will not be updated, while
>>> async_depth is already updated with new nr_reqeuests in previous
>>> depth_updated();
>>> 3) All elevators are using q->nr_requests to calculate async_depth now,
>>> however, q->nr_requests is still the old value when depth_updated() is
>>> called from blk_mq_update_nr_requests();
>>>
>>> Fix those problems by:
>>>
>>> - pass in request_queue instead of hctx;
>>> - move depth_updated() after q->nr_requests is updated in
>>>    blk_mq_update_nr_requests();
>>> - add depth_updated() call in blk_mq_init_sched();
>>> - remove init_hctx() method for mq-deadline and bfq that is useless now;
>>>
>>> Fixes: 77f1e0a52d26 ("bfq: update internal depth state when queue depth changes")
>>> Fixes: 39823b47bbd4 ("block/mq-deadline: Fix the tag reservation code")
>>> Fixes: 42e6c6ce03fd ("lib/sbitmap: convert shallow_depth from one word to the whole sbitmap")
>>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>>> ---
>>>   block/bfq-iosched.c   | 21 ++++-----------------
>>>   block/blk-mq-sched.c  |  3 +++
>>>   block/blk-mq-sched.h  | 11 +++++++++++
>>>   block/blk-mq.c        | 23 ++++++++++++-----------
>>>   block/elevator.h      |  2 +-
>>>   block/kyber-iosched.c | 10 ++++------
>>>   block/mq-deadline.c   | 15 ++-------------
>>>   7 files changed, 37 insertions(+), 48 deletions(-)
>>>
>>> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
>>> index 50e51047e1fe..c0c398998aa1 100644
>>> --- a/block/bfq-iosched.c
>>> +++ b/block/bfq-iosched.c
>>> @@ -7109,9 +7109,10 @@ void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg)
>>>    * See the comments on bfq_limit_depth for the purpose of
>>>    * the depths set in the function. Return minimum shallow depth we'll use.
>>>    */
>>> -static void bfq_update_depths(struct bfq_data *bfqd, struct sbitmap_queue *bt)
>>> +static void bfq_depth_updated(struct request_queue *q)
>>>   {
>>> -    unsigned int nr_requests = bfqd->queue->nr_requests;
>>> +    struct bfq_data *bfqd = q->elevator->elevator_data;
>>> +    unsigned int nr_requests = q->nr_requests;
>>>         /*
>>>        * In-word depths if no bfq_queue is being weight-raised:
>>> @@ -7143,21 +7144,8 @@ static void bfq_update_depths(struct bfq_data *bfqd, struct sbitmap_queue *bt)
>>>       bfqd->async_depths[1][0] = max((nr_requests * 3) >> 4, 1U);
>>>       /* no more than ~37% of tags for sync writes (~20% extra tags) */
>>>       bfqd->async_depths[1][1] = max((nr_requests * 6) >> 4, 1U);
>>> -}
>>> -
>>> -static void bfq_depth_updated(struct blk_mq_hw_ctx *hctx)
>>> -{
>>> -    struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
>>> -    struct blk_mq_tags *tags = hctx->sched_tags;
>>>   -    bfq_update_depths(bfqd, &tags->bitmap_tags);
>>> -    sbitmap_queue_min_shallow_depth(&tags->bitmap_tags, 1);
>>> -}
>>> -
>>> -static int bfq_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int index)
>>> -{
>>> -    bfq_depth_updated(hctx);
>>> -    return 0;
>>> +    blk_mq_set_min_shallow_depth(q, 1);
>>>   }
>>>     static void bfq_exit_queue(struct elevator_queue *e)
>>> @@ -7628,7 +7616,6 @@ static struct elevator_type iosched_bfq_mq = {
>>>           .request_merged        = bfq_request_merged,
>>>           .has_work        = bfq_has_work,
>>>           .depth_updated        = bfq_depth_updated,
>>> -        .init_hctx        = bfq_init_hctx,
>>>           .init_sched        = bfq_init_queue,
>>>           .exit_sched        = bfq_exit_queue,
>>>       },
>>> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
>>> index e2ce4a28e6c9..bf7dd97422ec 100644
>>> --- a/block/blk-mq-sched.c
>>> +++ b/block/blk-mq-sched.c
>>> @@ -585,6 +585,9 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e,
>>>               }
>>>           }
>>>       }
>>> +
>>> +    if (e->ops.depth_updated)
>>> +        e->ops.depth_updated(q);
>>>       return 0;
>>>   
>>
>> Overall changes look good. That said, I think it might be cleaner to structure
>> it this way:
>>
>> elevator_switch -> blk_mq_init_sched ->init_sched ==> sets async_depth
>> blk_mq_update_nr_requests ->depth_updated ==> updates async_depth
>>
>> This way, we don’t need to call ->depth_updated from blk_mq_init_sched.
> 
> Just to be sure, you mean calling the depth_updated method directly
> inside the init_sched() method? This is indeed cleaner, each elevator
> has to use this method to initialize async_dpeth.

Yes you're right. As ->init_sched() already has pointer to request queue,
you may now directly call ->depth_update() method of the respective 
elevator from ->init_sched().

Thanks,
--Nilay

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-08-20  5:35 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-19  1:29 [PATCH v2 0/2] blk-mq: fix update nr_requests regressions Yu Kuai
2025-08-19  1:29 ` [PATCH v2 1/2] blk-mq: fix elevator depth_updated method Yu Kuai
2025-08-19 12:20   ` Nilay Shroff
2025-08-20  0:56     ` Yu Kuai
2025-08-20  5:34       ` Nilay Shroff
2025-08-19  1:29 ` [PATCH v2 2/2] blk-mq: fix blk_mq_tags double free while nr_requests grown Yu Kuai
2025-08-19 10:47   ` Ming Lei
2025-08-19 12:45   ` Nilay Shroff

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).