linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/10] blk-mq: fix blk_mq_tags double free while nr_requests grown
@ 2025-08-15  8:02 Yu Kuai
  2025-08-15  8:02 ` [PATCH 01/10] blk-mq: remove useless checking from queue_requests_store() Yu Kuai
                   ` (10 more replies)
  0 siblings, 11 replies; 23+ messages in thread
From: Yu Kuai @ 2025-08-15  8:02 UTC (permalink / raw)
  To: axboe, hare, nilay, ming.lei
  Cc: linux-block, linux-kernel, yukuai3, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

In the case user trigger tags grow by queue sysfs attribute nr_requests,
hctx->sched_tags will be freed directly and replaced with a new
allocated tags, see blk_mq_tag_update_depth().

The problem is that hctx->sched_tags is from elevator->et->tags, while
et->tags is still the freed tags, hence later elevator exist will try to
free the tags again, causing kernel panic.

patch 1-6 are prep cleanup and refactor patches for updating nr_requests
patch 7,8 are the fix patches for the regression
patch 9 is cleanup patch after patch 8
patch 10 fix the stale nr_requests documentation

Yu Kuai (10):
  blk-mq: remove useless checking from queue_requests_store()
  blk-mq: remove useless checkings from blk_mq_update_nr_requests()
  blk-mq: check invalid nr_requests in queue_requests_store()
  blk-mq: serialize updating nr_requests with update_nr_hwq_lock
  blk-mq: cleanup shared tags case in blk_mq_update_nr_requests()
  blk-mq: split bitmap grow and resize case in
    blk_mq_update_nr_requests()
  blk-mq-sched: add new parameter nr_requests in
    blk_mq_alloc_sched_tags()
  blk-mq: fix blk_mq_tags double free while nr_requests grown
  blk-mq: remove blk_mq_tag_update_depth()
  blk-mq: fix stale nr_requests documentation

 Documentation/ABI/stable/sysfs-block | 14 ++-----
 block/blk-mq-sched.c                 | 14 +++----
 block/blk-mq-sched.h                 |  2 +-
 block/blk-mq-tag.c                   | 52 -----------------------
 block/blk-mq.c                       | 62 +++++++++++-----------------
 block/blk-mq.h                       | 17 ++++++--
 block/blk-sysfs.c                    | 44 +++++++++++++++-----
 block/elevator.c                     |  3 +-
 8 files changed, 84 insertions(+), 124 deletions(-)

-- 
2.39.2


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 01/10] blk-mq: remove useless checking from queue_requests_store()
  2025-08-15  8:02 [PATCH 00/10] blk-mq: fix blk_mq_tags double free while nr_requests grown Yu Kuai
@ 2025-08-15  8:02 ` Yu Kuai
  2025-08-15  8:02 ` [PATCH 02/10] blk-mq: remove useless checkings from blk_mq_update_nr_requests() Yu Kuai
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Yu Kuai @ 2025-08-15  8:02 UTC (permalink / raw)
  To: axboe, hare, nilay, ming.lei
  Cc: linux-block, linux-kernel, yukuai3, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

blk_mq_queue_attr_visible() already checked queue_is_mq(), no need to
check this again in queue_requests_store().

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-sysfs.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index c5cf79a20842..1086f7b9da28 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -69,9 +69,6 @@ queue_requests_store(struct gendisk *disk, const char *page, size_t count)
 	unsigned int memflags;
 	struct request_queue *q = disk->queue;
 
-	if (!queue_is_mq(q))
-		return -EINVAL;
-
 	ret = queue_var_store(&nr, page, count);
 	if (ret < 0)
 		return ret;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 02/10] blk-mq: remove useless checkings from blk_mq_update_nr_requests()
  2025-08-15  8:02 [PATCH 00/10] blk-mq: fix blk_mq_tags double free while nr_requests grown Yu Kuai
  2025-08-15  8:02 ` [PATCH 01/10] blk-mq: remove useless checking from queue_requests_store() Yu Kuai
@ 2025-08-15  8:02 ` Yu Kuai
  2025-08-15  8:02 ` [PATCH 03/10] blk-mq: check invalid nr_requests in queue_requests_store() Yu Kuai
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Yu Kuai @ 2025-08-15  8:02 UTC (permalink / raw)
  To: axboe, hare, nilay, ming.lei
  Cc: linux-block, linux-kernel, yukuai3, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

1) queue_requests_store() is the only caller of
blk_mq_update_nr_requests(), where queue is already freezed, no need to
check mq_freeze_depth;
2) q->tag_set must be set for request_based device, and queue_is_mq() is
already checked in blk_mq_queue_attr_visible(), no need to check
q->tag_set.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq.c | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9c68749124c6..ea2995d4a917 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4921,21 +4921,14 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 {
 	struct blk_mq_tag_set *set = q->tag_set;
 	struct blk_mq_hw_ctx *hctx;
-	int ret;
 	unsigned long i;
-
-	if (WARN_ON_ONCE(!q->mq_freeze_depth))
-		return -EINVAL;
-
-	if (!set)
-		return -EINVAL;
+	int ret = 0;
 
 	if (q->nr_requests == nr)
 		return 0;
 
 	blk_mq_quiesce_queue(q);
 
-	ret = 0;
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (!hctx->tags)
 			continue;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 03/10] blk-mq: check invalid nr_requests in queue_requests_store()
  2025-08-15  8:02 [PATCH 00/10] blk-mq: fix blk_mq_tags double free while nr_requests grown Yu Kuai
  2025-08-15  8:02 ` [PATCH 01/10] blk-mq: remove useless checking from queue_requests_store() Yu Kuai
  2025-08-15  8:02 ` [PATCH 02/10] blk-mq: remove useless checkings from blk_mq_update_nr_requests() Yu Kuai
@ 2025-08-15  8:02 ` Yu Kuai
  2025-08-15  8:02 ` [PATCH 04/10] blk-mq: serialize updating nr_requests with update_nr_hwq_lock Yu Kuai
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Yu Kuai @ 2025-08-15  8:02 UTC (permalink / raw)
  To: axboe, hare, nilay, ming.lei
  Cc: linux-block, linux-kernel, yukuai3, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

queue_requests_store() is the only caller of
blk_mq_update_nr_requests(), and blk_mq_update_nr_requests() is the
only caller of blk_mq_tag_update_depth(), however, they all have
checkings for nr_requests input by user.

Make code cleaner by moving all the checkings to the top function:

1) nr_requests > reserved tags;
2) if there is elevator, 4 <= nr_requests <= 2048;
3) if elevator is none, 4 <= nr_requests < tag_set->queue_depth;

Meanwhile, case 2 is the only case tags can grow and -ENOMEM might be
returned.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq-tag.c | 16 +---------------
 block/blk-mq.c     | 13 ++++---------
 block/blk-mq.h     |  2 +-
 block/blk-sysfs.c  | 13 +++++++++++++
 4 files changed, 19 insertions(+), 25 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index d880c50629d6..7613a9889eb1 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -584,14 +584,10 @@ void blk_mq_free_tags(struct blk_mq_tags *tags)
 }
 
 int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
-			    struct blk_mq_tags **tagsptr, unsigned int tdepth,
-			    bool can_grow)
+			    struct blk_mq_tags **tagsptr, unsigned int tdepth)
 {
 	struct blk_mq_tags *tags = *tagsptr;
 
-	if (tdepth <= tags->nr_reserved_tags)
-		return -EINVAL;
-
 	/*
 	 * If we are allowed to grow beyond the original size, allocate
 	 * a new set of tags before freeing the old one.
@@ -600,16 +596,6 @@ int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
 		struct blk_mq_tag_set *set = hctx->queue->tag_set;
 		struct blk_mq_tags *new;
 
-		if (!can_grow)
-			return -EINVAL;
-
-		/*
-		 * We need some sort of upper limit, set it high enough that
-		 * no valid use cases should require more.
-		 */
-		if (tdepth > MAX_SCHED_RQ)
-			return -EINVAL;
-
 		/*
 		 * Only the sbitmap needs resizing since we allocated the max
 		 * initially.
diff --git a/block/blk-mq.c b/block/blk-mq.c
index ea2995d4a917..048d6b2cffe6 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4924,9 +4924,6 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 	unsigned long i;
 	int ret = 0;
 
-	if (q->nr_requests == nr)
-		return 0;
-
 	blk_mq_quiesce_queue(q);
 
 	queue_for_each_hw_ctx(q, hctx, i) {
@@ -4936,13 +4933,11 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 		 * If we're using an MQ scheduler, just update the scheduler
 		 * queue depth. This is similar to what the old code would do.
 		 */
-		if (hctx->sched_tags) {
+		if (hctx->sched_tags)
 			ret = blk_mq_tag_update_depth(hctx, &hctx->sched_tags,
-						      nr, true);
-		} else {
-			ret = blk_mq_tag_update_depth(hctx, &hctx->tags, nr,
-						      false);
-		}
+						      nr);
+		else
+			ret = blk_mq_tag_update_depth(hctx, &hctx->tags, nr);
 		if (ret)
 			goto out;
 	}
diff --git a/block/blk-mq.h b/block/blk-mq.h
index affb2e14b56e..2b3ade60c90b 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -171,7 +171,7 @@ void blk_mq_put_tag(struct blk_mq_tags *tags, struct blk_mq_ctx *ctx,
 		unsigned int tag);
 void blk_mq_put_tags(struct blk_mq_tags *tags, int *tag_array, int nr_tags);
 int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
-		struct blk_mq_tags **tags, unsigned int depth, bool can_grow);
+		struct blk_mq_tags **tags, unsigned int depth);
 void blk_mq_tag_resize_shared_tags(struct blk_mq_tag_set *set,
 		unsigned int size);
 void blk_mq_tag_update_sched_shared_tags(struct request_queue *q);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 1086f7b9da28..f3d08edcc34f 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -75,12 +75,25 @@ queue_requests_store(struct gendisk *disk, const char *page, size_t count)
 
 	memflags = blk_mq_freeze_queue(q);
 	mutex_lock(&q->elevator_lock);
+
+	if (q->nr_requests == nr)
+		goto unlock;
+
 	if (nr < BLKDEV_MIN_RQ)
 		nr = BLKDEV_MIN_RQ;
 
+	if (nr <= q->tag_set->reserved_tags ||
+	    (q->elevator && nr > MAX_SCHED_RQ) ||
+	    (!q->elevator && nr > q->tag_set->queue_depth)) {
+		ret = -EINVAL;
+		goto unlock;
+	}
+
 	err = blk_mq_update_nr_requests(disk->queue, nr);
 	if (err)
 		ret = err;
+
+unlock:
 	mutex_unlock(&q->elevator_lock);
 	blk_mq_unfreeze_queue(q, memflags);
 	return ret;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 04/10] blk-mq: serialize updating nr_requests with update_nr_hwq_lock
  2025-08-15  8:02 [PATCH 00/10] blk-mq: fix blk_mq_tags double free while nr_requests grown Yu Kuai
                   ` (2 preceding siblings ...)
  2025-08-15  8:02 ` [PATCH 03/10] blk-mq: check invalid nr_requests in queue_requests_store() Yu Kuai
@ 2025-08-15  8:02 ` Yu Kuai
  2025-08-15 14:47   ` Ming Lei
  2025-08-15  8:02 ` [PATCH 05/10] blk-mq: cleanup shared tags case in blk_mq_update_nr_requests() Yu Kuai
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 23+ messages in thread
From: Yu Kuai @ 2025-08-15  8:02 UTC (permalink / raw)
  To: axboe, hare, nilay, ming.lei
  Cc: linux-block, linux-kernel, yukuai3, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

request_queue->nr_requests can be changed by:

a) switching elevator by update nr_hw_queues
b) switching elevator by elevator sysfs attribute
c) configue queue sysfs attribute nr_requests

Current lock order is:

1) update_nr_hwq_lock, case a,b
- allocate elevator tags
2) freeze_queue
3) elevator_lock, cas a,b,c
- update nr_requests

The update_nr_hwq_lock is not held by case c for now, and we need to check
nr_requests while elevator_lock is held, however, we can't allocate
elevator_tags whih queue freezed.

Hence also require update_nr_hwq_lock and check nr_requests, make it
possible to allocate new elevator_tags without queue freezed, also
prevent nr_requests to be changed concurrently by other contexts.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-sysfs.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index f3d08edcc34f..37eae7927b45 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -68,13 +68,14 @@ queue_requests_store(struct gendisk *disk, const char *page, size_t count)
 	int ret, err;
 	unsigned int memflags;
 	struct request_queue *q = disk->queue;
+	struct blk_mq_tag_set *set = q->tag_set;
 
 	ret = queue_var_store(&nr, page, count);
 	if (ret < 0)
 		return ret;
 
-	memflags = blk_mq_freeze_queue(q);
-	mutex_lock(&q->elevator_lock);
+	/* serialize with switching elevator */
+	down_write(&set->update_nr_hwq_lock);
 
 	if (q->nr_requests == nr)
 		goto unlock;
@@ -89,13 +90,17 @@ queue_requests_store(struct gendisk *disk, const char *page, size_t count)
 		goto unlock;
 	}
 
+	memflags = blk_mq_freeze_queue(q);
+	mutex_lock(&q->elevator_lock);
 	err = blk_mq_update_nr_requests(disk->queue, nr);
 	if (err)
 		ret = err;
 
-unlock:
 	mutex_unlock(&q->elevator_lock);
 	blk_mq_unfreeze_queue(q, memflags);
+
+unlock:
+	up_write(&set->update_nr_hwq_lock);
 	return ret;
 }
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 05/10] blk-mq: cleanup shared tags case in blk_mq_update_nr_requests()
  2025-08-15  8:02 [PATCH 00/10] blk-mq: fix blk_mq_tags double free while nr_requests grown Yu Kuai
                   ` (3 preceding siblings ...)
  2025-08-15  8:02 ` [PATCH 04/10] blk-mq: serialize updating nr_requests with update_nr_hwq_lock Yu Kuai
@ 2025-08-15  8:02 ` Yu Kuai
  2025-08-15  8:02 ` [PATCH 06/10] blk-mq: split bitmap grow and resize " Yu Kuai
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Yu Kuai @ 2025-08-15  8:02 UTC (permalink / raw)
  To: axboe, hare, nilay, ming.lei
  Cc: linux-block, linux-kernel, yukuai3, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

For shared tags case, all hctx->shared_tags/tags are the same, it
doesn't make sense to call blk_mq_tag_update_depth() multiple times
for the same tags.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq-tag.c |  7 -------
 block/blk-mq.c     | 42 ++++++++++++++++++++++--------------------
 2 files changed, 22 insertions(+), 27 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 7613a9889eb1..84a452e708e4 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -596,13 +596,6 @@ int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
 		struct blk_mq_tag_set *set = hctx->queue->tag_set;
 		struct blk_mq_tags *new;
 
-		/*
-		 * Only the sbitmap needs resizing since we allocated the max
-		 * initially.
-		 */
-		if (blk_mq_is_shared_tags(set->flags))
-			return 0;
-
 		new = blk_mq_alloc_map_and_rqs(set, hctx->queue_num, tdepth);
 		if (!new)
 			return -ENOMEM;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 048d6b2cffe6..1e85fcd474c2 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4926,33 +4926,35 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 
 	blk_mq_quiesce_queue(q);
 
-	queue_for_each_hw_ctx(q, hctx, i) {
-		if (!hctx->tags)
-			continue;
-		/*
-		 * If we're using an MQ scheduler, just update the scheduler
-		 * queue depth. This is similar to what the old code would do.
-		 */
-		if (hctx->sched_tags)
-			ret = blk_mq_tag_update_depth(hctx, &hctx->sched_tags,
-						      nr);
-		else
-			ret = blk_mq_tag_update_depth(hctx, &hctx->tags, nr);
-		if (ret)
-			goto out;
-	}
-
-	q->nr_requests = nr;
-	if (q->elevator && q->elevator->type->ops.depth_updated)
-		q->elevator->type->ops.depth_updated(q);
-
 	if (blk_mq_is_shared_tags(set->flags)) {
 		if (q->elevator)
 			blk_mq_tag_update_sched_shared_tags(q);
 		else
 			blk_mq_tag_resize_shared_tags(set, nr);
+	} else {
+		queue_for_each_hw_ctx(q, hctx, i) {
+			if (!hctx->tags)
+				continue;
+			/*
+			 * If we're using an MQ scheduler, just update the
+			 * scheduler queue depth. This is similar to what the
+			 * old code would do.
+			 */
+			if (hctx->sched_tags)
+				ret = blk_mq_tag_update_depth(hctx,
+						&hctx->sched_tags, nr);
+			else
+				ret = blk_mq_tag_update_depth(hctx,
+						&hctx->tags, nr);
+			if (ret)
+				goto out;
+		}
 	}
 
+	q->nr_requests = nr;
+	if (q->elevator && q->elevator->type->ops.depth_updated)
+		q->elevator->type->ops.depth_updated(q);
+
 out:
 	blk_mq_unquiesce_queue(q);
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 06/10] blk-mq: split bitmap grow and resize case in blk_mq_update_nr_requests()
  2025-08-15  8:02 [PATCH 00/10] blk-mq: fix blk_mq_tags double free while nr_requests grown Yu Kuai
                   ` (4 preceding siblings ...)
  2025-08-15  8:02 ` [PATCH 05/10] blk-mq: cleanup shared tags case in blk_mq_update_nr_requests() Yu Kuai
@ 2025-08-15  8:02 ` Yu Kuai
  2025-08-15  8:02 ` [PATCH 07/10] blk-mq-sched: add new parameter nr_requests in blk_mq_alloc_sched_tags() Yu Kuai
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Yu Kuai @ 2025-08-15  8:02 UTC (permalink / raw)
  To: axboe, hare, nilay, ming.lei
  Cc: linux-block, linux-kernel, yukuai3, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

No functional changes are intended, make code cleaner and prepare to fix
the grow case in following patches.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq.c | 27 ++++++++++++++++-----------
 1 file changed, 16 insertions(+), 11 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1e85fcd474c2..11c8baebb9a0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4931,21 +4931,26 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 			blk_mq_tag_update_sched_shared_tags(q);
 		else
 			blk_mq_tag_resize_shared_tags(set, nr);
+	} else if (!q->elevator) {
+		queue_for_each_hw_ctx(q, hctx, i) {
+			if (!hctx->tags)
+				continue;
+			sbitmap_queue_resize(&hctx->tags->bitmap_tags,
+				nr - hctx->tags->nr_reserved_tags);
+		}
+	} else if (nr <= q->elevator->et->nr_requests) {
+		queue_for_each_hw_ctx(q, hctx, i) {
+			if (!hctx->tags)
+				continue;
+			sbitmap_queue_resize(&hctx->sched_tags->bitmap_tags,
+				nr - hctx->sched_tags->nr_reserved_tags);
+		}
 	} else {
 		queue_for_each_hw_ctx(q, hctx, i) {
 			if (!hctx->tags)
 				continue;
-			/*
-			 * If we're using an MQ scheduler, just update the
-			 * scheduler queue depth. This is similar to what the
-			 * old code would do.
-			 */
-			if (hctx->sched_tags)
-				ret = blk_mq_tag_update_depth(hctx,
-						&hctx->sched_tags, nr);
-			else
-				ret = blk_mq_tag_update_depth(hctx,
-						&hctx->tags, nr);
+			ret = blk_mq_tag_update_depth(hctx, &hctx->sched_tags,
+						      nr);
 			if (ret)
 				goto out;
 		}
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 07/10] blk-mq-sched: add new parameter nr_requests in blk_mq_alloc_sched_tags()
  2025-08-15  8:02 [PATCH 00/10] blk-mq: fix blk_mq_tags double free while nr_requests grown Yu Kuai
                   ` (5 preceding siblings ...)
  2025-08-15  8:02 ` [PATCH 06/10] blk-mq: split bitmap grow and resize " Yu Kuai
@ 2025-08-15  8:02 ` Yu Kuai
  2025-08-15  8:02 ` [PATCH 08/10] blk-mq: fix blk_mq_tags double free while nr_requests grown Yu Kuai
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Yu Kuai @ 2025-08-15  8:02 UTC (permalink / raw)
  To: axboe, hare, nilay, ming.lei
  Cc: linux-block, linux-kernel, yukuai3, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

This helper only support to allocate the default number of requests,
add a new parameter to support specific number of requests.

Prepare to fix tags double free problem if nr_requests is grown by
queue sysfs attribute nr_requests.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq-sched.c | 14 +++++---------
 block/blk-mq-sched.h |  2 +-
 block/blk-mq.h       | 11 +++++++++++
 block/elevator.c     |  3 ++-
 4 files changed, 19 insertions(+), 11 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index bf7dd97422ec..f3a918e36589 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -454,7 +454,7 @@ void blk_mq_free_sched_tags_batch(struct xarray *et_table,
 }
 
 struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
-		unsigned int nr_hw_queues)
+		unsigned int nr_hw_queues, unsigned int nr_requests)
 {
 	unsigned int nr_tags;
 	int i;
@@ -470,13 +470,8 @@ struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
 			nr_tags * sizeof(struct blk_mq_tags *), gfp);
 	if (!et)
 		return NULL;
-	/*
-	 * Default to double of smaller one between hw queue_depth and
-	 * 128, since we don't split into sync/async like the old code
-	 * did. Additionally, this is a per-hw queue depth.
-	 */
-	et->nr_requests = 2 * min_t(unsigned int, set->queue_depth,
-			BLKDEV_DEFAULT_RQ);
+
+	et->nr_requests = nr_requests;
 	et->nr_hw_queues = nr_hw_queues;
 
 	if (blk_mq_is_shared_tags(set->flags)) {
@@ -521,7 +516,8 @@ int blk_mq_alloc_sched_tags_batch(struct xarray *et_table,
 		 * concurrently.
 		 */
 		if (q->elevator) {
-			et = blk_mq_alloc_sched_tags(set, nr_hw_queues);
+			et = blk_mq_alloc_sched_tags(set, nr_hw_queues,
+					blk_mq_default_nr_requests(set));
 			if (!et)
 				goto out_unwind;
 			if (xa_insert(et_table, q->id, et, gfp))
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index fe83187f41db..8e21a6b1415d 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -24,7 +24,7 @@ void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e);
 void blk_mq_sched_free_rqs(struct request_queue *q);
 
 struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
-		unsigned int nr_hw_queues);
+		unsigned int nr_hw_queues, unsigned int nr_requests);
 int blk_mq_alloc_sched_tags_batch(struct xarray *et_table,
 		struct blk_mq_tag_set *set, unsigned int nr_hw_queues);
 void blk_mq_free_sched_tags(struct elevator_tags *et,
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 2b3ade60c90b..731f4578d9a8 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -109,6 +109,17 @@ static inline struct blk_mq_hw_ctx *blk_mq_map_queue(blk_opf_t opf,
 	return ctx->hctxs[blk_mq_get_hctx_type(opf)];
 }
 
+/*
+ * Default to double of smaller one between hw queue_depth and
+ * 128, since we don't split into sync/async like the old code
+ * did. Additionally, this is a per-hw queue depth.
+ */
+static inline unsigned int blk_mq_default_nr_requests(
+		struct blk_mq_tag_set *set)
+{
+	return 2 * min_t(unsigned int, set->queue_depth, BLKDEV_DEFAULT_RQ);
+}
+
 /*
  * sysfs helpers
  */
diff --git a/block/elevator.c b/block/elevator.c
index fe96c6f4753c..e2ebfbf107b3 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -669,7 +669,8 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
 	lockdep_assert_held(&set->update_nr_hwq_lock);
 
 	if (strncmp(ctx->name, "none", 4)) {
-		ctx->et = blk_mq_alloc_sched_tags(set, set->nr_hw_queues);
+		ctx->et = blk_mq_alloc_sched_tags(set, set->nr_hw_queues,
+				blk_mq_default_nr_requests(set));
 		if (!ctx->et)
 			return -ENOMEM;
 	}
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 08/10] blk-mq: fix blk_mq_tags double free while nr_requests grown
  2025-08-15  8:02 [PATCH 00/10] blk-mq: fix blk_mq_tags double free while nr_requests grown Yu Kuai
                   ` (6 preceding siblings ...)
  2025-08-15  8:02 ` [PATCH 07/10] blk-mq-sched: add new parameter nr_requests in blk_mq_alloc_sched_tags() Yu Kuai
@ 2025-08-15  8:02 ` Yu Kuai
  2025-08-15 19:30   ` Nilay Shroff
  2025-08-15  8:02 ` [PATCH 09/10] blk-mq: remove blk_mq_tag_update_depth() Yu Kuai
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 23+ messages in thread
From: Yu Kuai @ 2025-08-15  8:02 UTC (permalink / raw)
  To: axboe, hare, nilay, ming.lei
  Cc: linux-block, linux-kernel, yukuai3, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

In the case user trigger tags grow by queue sysfs attribute nr_requests,
hctx->sched_tags will be freed directly and replaced with a new
allocated tags, see blk_mq_tag_update_depth().

The problem is that hctx->sched_tags is from elevator->et->tags, while
et->tags is still the freed tags, hence later elevator exist will try to
free the tags again, causing kernel panic.

Fix this problem by using new allocated elevator_tags, also convert
blk_mq_update_nr_requests to void since this helper will never fail now.

Meanwhile, there is a longterm problem can be fixed as well:

If blk_mq_tag_update_depth() succeed for previous hctx, then bitmap depth
is updated, however, if following hctx failed, q->nr_requests is not
updated and the previous hctx->sched_tags endup bigger than q->nr_requests.

Fixes: f5a6604f7a44 ("block: fix lockdep warning caused by lock dependency in elv_iosched_store")
Fixes: e3a2b3f931f5 ("blk-mq: allow changing of queue depth through sysfs")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq.c    | 19 ++++++-------------
 block/blk-mq.h    |  4 +++-
 block/blk-sysfs.c | 21 ++++++++++++++-------
 3 files changed, 23 insertions(+), 21 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 11c8baebb9a0..e9f037a25fe3 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4917,12 +4917,12 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
 }
 EXPORT_SYMBOL(blk_mq_free_tag_set);
 
-int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
+void blk_mq_update_nr_requests(struct request_queue *q,
+			       struct elevator_tags *et, unsigned int nr)
 {
 	struct blk_mq_tag_set *set = q->tag_set;
 	struct blk_mq_hw_ctx *hctx;
 	unsigned long i;
-	int ret = 0;
 
 	blk_mq_quiesce_queue(q);
 
@@ -4946,24 +4946,17 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 				nr - hctx->sched_tags->nr_reserved_tags);
 		}
 	} else {
-		queue_for_each_hw_ctx(q, hctx, i) {
-			if (!hctx->tags)
-				continue;
-			ret = blk_mq_tag_update_depth(hctx, &hctx->sched_tags,
-						      nr);
-			if (ret)
-				goto out;
-		}
+		blk_mq_free_sched_tags(q->elevator->et, set);
+		queue_for_each_hw_ctx(q, hctx, i)
+			hctx->sched_tags = et->tags[i];
+		q->elevator->et = et;
 	}
 
 	q->nr_requests = nr;
 	if (q->elevator && q->elevator->type->ops.depth_updated)
 		q->elevator->type->ops.depth_updated(q);
 
-out:
 	blk_mq_unquiesce_queue(q);
-
-	return ret;
 }
 
 /*
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 731f4578d9a8..98740f9c204e 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -6,6 +6,7 @@
 #include "blk-stat.h"
 
 struct blk_mq_tag_set;
+struct elevator_tags;
 
 struct blk_mq_ctxs {
 	struct kobject kobj;
@@ -45,7 +46,8 @@ void blk_mq_submit_bio(struct bio *bio);
 int blk_mq_poll(struct request_queue *q, blk_qc_t cookie, struct io_comp_batch *iob,
 		unsigned int flags);
 void blk_mq_exit_queue(struct request_queue *q);
-int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
+void blk_mq_update_nr_requests(struct request_queue *q,
+			       struct elevator_tags *et, unsigned int nr);
 void blk_mq_wake_waiters(struct request_queue *q);
 bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *,
 			     bool);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 37eae7927b45..848648456255 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -64,11 +64,12 @@ static ssize_t queue_requests_show(struct gendisk *disk, char *page)
 static ssize_t
 queue_requests_store(struct gendisk *disk, const char *page, size_t count)
 {
-	unsigned long nr;
-	int ret, err;
-	unsigned int memflags;
 	struct request_queue *q = disk->queue;
 	struct blk_mq_tag_set *set = q->tag_set;
+	struct elevator_tags *et = NULL;
+	unsigned int memflags;
+	unsigned long nr;
+	int ret;
 
 	ret = queue_var_store(&nr, page, count);
 	if (ret < 0)
@@ -90,12 +91,18 @@ queue_requests_store(struct gendisk *disk, const char *page, size_t count)
 		goto unlock;
 	}
 
+	if (q->elevator && nr > q->elevator->et->nr_requests) {
+		/* allocate memory before freezing queue to prevent deadlock */
+		et = blk_mq_alloc_sched_tags(set, q->nr_hw_queues, nr);
+		if (!et) {
+			ret = -ENOMEM;
+			goto unlock;
+		}
+	}
+
 	memflags = blk_mq_freeze_queue(q);
 	mutex_lock(&q->elevator_lock);
-	err = blk_mq_update_nr_requests(disk->queue, nr);
-	if (err)
-		ret = err;
-
+	blk_mq_update_nr_requests(disk->queue, et, nr);
 	mutex_unlock(&q->elevator_lock);
 	blk_mq_unfreeze_queue(q, memflags);
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 09/10] blk-mq: remove blk_mq_tag_update_depth()
  2025-08-15  8:02 [PATCH 00/10] blk-mq: fix blk_mq_tags double free while nr_requests grown Yu Kuai
                   ` (7 preceding siblings ...)
  2025-08-15  8:02 ` [PATCH 08/10] blk-mq: fix blk_mq_tags double free while nr_requests grown Yu Kuai
@ 2025-08-15  8:02 ` Yu Kuai
  2025-08-15  8:02 ` [PATCH 10/10] blk-mq: fix stale nr_requests documentation Yu Kuai
  2025-08-15  8:30 ` [PATCH 00/10] blk-mq: fix blk_mq_tags double free while nr_requests grown Ming Lei
  10 siblings, 0 replies; 23+ messages in thread
From: Yu Kuai @ 2025-08-15  8:02 UTC (permalink / raw)
  To: axboe, hare, nilay, ming.lei
  Cc: linux-block, linux-kernel, yukuai3, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

This helper is not used now.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq-tag.c | 31 -------------------------------
 block/blk-mq.h     |  2 --
 2 files changed, 33 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 84a452e708e4..cbb93a653cef 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -583,37 +583,6 @@ void blk_mq_free_tags(struct blk_mq_tags *tags)
 	kfree(tags);
 }
 
-int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
-			    struct blk_mq_tags **tagsptr, unsigned int tdepth)
-{
-	struct blk_mq_tags *tags = *tagsptr;
-
-	/*
-	 * If we are allowed to grow beyond the original size, allocate
-	 * a new set of tags before freeing the old one.
-	 */
-	if (tdepth > tags->nr_tags) {
-		struct blk_mq_tag_set *set = hctx->queue->tag_set;
-		struct blk_mq_tags *new;
-
-		new = blk_mq_alloc_map_and_rqs(set, hctx->queue_num, tdepth);
-		if (!new)
-			return -ENOMEM;
-
-		blk_mq_free_map_and_rqs(set, *tagsptr, hctx->queue_num);
-		*tagsptr = new;
-	} else {
-		/*
-		 * Don't need (or can't) update reserved tags here, they
-		 * remain static and should never need resizing.
-		 */
-		sbitmap_queue_resize(&tags->bitmap_tags,
-				tdepth - tags->nr_reserved_tags);
-	}
-
-	return 0;
-}
-
 void blk_mq_tag_resize_shared_tags(struct blk_mq_tag_set *set, unsigned int size)
 {
 	struct blk_mq_tags *tags = set->shared_tags;
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 98740f9c204e..aaa006fe5076 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -183,8 +183,6 @@ unsigned long blk_mq_get_tags(struct blk_mq_alloc_data *data, int nr_tags,
 void blk_mq_put_tag(struct blk_mq_tags *tags, struct blk_mq_ctx *ctx,
 		unsigned int tag);
 void blk_mq_put_tags(struct blk_mq_tags *tags, int *tag_array, int nr_tags);
-int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
-		struct blk_mq_tags **tags, unsigned int depth);
 void blk_mq_tag_resize_shared_tags(struct blk_mq_tag_set *set,
 		unsigned int size);
 void blk_mq_tag_update_sched_shared_tags(struct request_queue *q);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 10/10] blk-mq: fix stale nr_requests documentation
  2025-08-15  8:02 [PATCH 00/10] blk-mq: fix blk_mq_tags double free while nr_requests grown Yu Kuai
                   ` (8 preceding siblings ...)
  2025-08-15  8:02 ` [PATCH 09/10] blk-mq: remove blk_mq_tag_update_depth() Yu Kuai
@ 2025-08-15  8:02 ` Yu Kuai
  2025-08-15  8:30 ` [PATCH 00/10] blk-mq: fix blk_mq_tags double free while nr_requests grown Ming Lei
  10 siblings, 0 replies; 23+ messages in thread
From: Yu Kuai @ 2025-08-15  8:02 UTC (permalink / raw)
  To: axboe, hare, nilay, ming.lei
  Cc: linux-block, linux-kernel, yukuai3, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

The nr_requests documentation is still the removed single queue, remove
it and update to current blk-mq.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 Documentation/ABI/stable/sysfs-block | 14 ++++----------
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
index 0ddffc9133d0..0ed10aeff86b 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -603,16 +603,10 @@ Date:		July 2003
 Contact:	linux-block@vger.kernel.org
 Description:
 		[RW] This controls how many requests may be allocated in the
-		block layer for read or write requests. Note that the total
-		allocated number may be twice this amount, since it applies only
-		to reads or writes (not the accumulated sum).
-
-		To avoid priority inversion through request starvation, a
-		request queue maintains a separate request pool per each cgroup
-		when CONFIG_BLK_CGROUP is enabled, and this parameter applies to
-		each such per-block-cgroup request pool.  IOW, if there are N
-		block cgroups, each request queue may have up to N request
-		pools, each independently regulated by nr_requests.
+		block layer. Noted this value only represents the quantity for a
+		single blk_mq_tags instance. The actual number for the entire
+		device depends on the hardware queue count, whether elevator is
+		enabled, and whether tags are shared.
 
 
 What:		/sys/block/<disk>/queue/nr_zones
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 00/10] blk-mq: fix blk_mq_tags double free while nr_requests grown
  2025-08-15  8:02 [PATCH 00/10] blk-mq: fix blk_mq_tags double free while nr_requests grown Yu Kuai
                   ` (9 preceding siblings ...)
  2025-08-15  8:02 ` [PATCH 10/10] blk-mq: fix stale nr_requests documentation Yu Kuai
@ 2025-08-15  8:30 ` Ming Lei
  2025-08-15  9:05   ` Yu Kuai
  10 siblings, 1 reply; 23+ messages in thread
From: Ming Lei @ 2025-08-15  8:30 UTC (permalink / raw)
  To: Yu Kuai
  Cc: axboe, hare, nilay, linux-block, linux-kernel, yukuai3, yi.zhang,
	yangerkun, johnny.chenyi

On Fri, Aug 15, 2025 at 04:02:06PM +0800, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> In the case user trigger tags grow by queue sysfs attribute nr_requests,
> hctx->sched_tags will be freed directly and replaced with a new
> allocated tags, see blk_mq_tag_update_depth().
> 
> The problem is that hctx->sched_tags is from elevator->et->tags, while
> et->tags is still the freed tags, hence later elevator exist will try to
> free the tags again, causing kernel panic.
> 
> patch 1-6 are prep cleanup and refactor patches for updating nr_requests
> patch 7,8 are the fix patches for the regression
> patch 9 is cleanup patch after patch 8
> patch 10 fix the stale nr_requests documentation

Please do not mix bug(regression) fix with cleanup.

The bug fix for updating nr_requests should have been simple enough in single
or two patches, why do you make 10-patches for dealing with the regression?

Not mention this way is really unfriendly for stable tree backport.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 00/10] blk-mq: fix blk_mq_tags double free while nr_requests grown
  2025-08-15  8:30 ` [PATCH 00/10] blk-mq: fix blk_mq_tags double free while nr_requests grown Ming Lei
@ 2025-08-15  9:05   ` Yu Kuai
  2025-08-15 14:20     ` Ming Lei
  0 siblings, 1 reply; 23+ messages in thread
From: Yu Kuai @ 2025-08-15  9:05 UTC (permalink / raw)
  To: Ming Lei, Yu Kuai
  Cc: axboe, hare, nilay, linux-block, linux-kernel, yi.zhang,
	yangerkun, johnny.chenyi, yukuai (C)

Hi,

在 2025/08/15 16:30, Ming Lei 写道:
> On Fri, Aug 15, 2025 at 04:02:06PM +0800, Yu Kuai wrote:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> In the case user trigger tags grow by queue sysfs attribute nr_requests,
>> hctx->sched_tags will be freed directly and replaced with a new
>> allocated tags, see blk_mq_tag_update_depth().
>>
>> The problem is that hctx->sched_tags is from elevator->et->tags, while
>> et->tags is still the freed tags, hence later elevator exist will try to
>> free the tags again, causing kernel panic.
>>
>> patch 1-6 are prep cleanup and refactor patches for updating nr_requests
>> patch 7,8 are the fix patches for the regression
>> patch 9 is cleanup patch after patch 8
>> patch 10 fix the stale nr_requests documentation
> 
> Please do not mix bug(regression) fix with cleanup.
> 
> The bug fix for updating nr_requests should have been simple enough in single
> or two patches, why do you make 10-patches for dealing with the regression?

Ok, in short, my solution is:

- serialize switching elevator with updating nr_requests
- check the case that nr_requests will grow and allocate elevator_tags
before freezing the queue.
- for the grow case, switch to new elevator_tags.

I do tried and I can't find a easy way to fix this without making
related code uncomfortable. Perhaps because I do the cleanups and
refactor first and I can't think outside the box...

> 
> Not mention this way is really unfriendly for stable tree backport.

I checked the last time related code to queue_requests_store() was
changed is commit 3efe7571c3ae ("block: protect nr_requests update using
q->elevator_lock"), and I believe this is what the fixed patch relied
on, so I think backport will not have much conflicts.

Whatever stbale branch that f5a6604f7a44 ("block: fix lockdep warning
caused by lock dependency in elv_iosched_store") is backported, I can
make sure a proper fix is backported as well.

Thanks,
Kuai

> 
> 
> Thanks,
> Ming
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 00/10] blk-mq: fix blk_mq_tags double free while nr_requests grown
  2025-08-15  9:05   ` Yu Kuai
@ 2025-08-15 14:20     ` Ming Lei
  0 siblings, 0 replies; 23+ messages in thread
From: Ming Lei @ 2025-08-15 14:20 UTC (permalink / raw)
  To: Yu Kuai
  Cc: axboe, hare, nilay, linux-block, linux-kernel, yi.zhang,
	yangerkun, johnny.chenyi, yukuai (C)

On Fri, Aug 15, 2025 at 05:05:34PM +0800, Yu Kuai wrote:
> Hi,
> 
> 在 2025/08/15 16:30, Ming Lei 写道:
> > On Fri, Aug 15, 2025 at 04:02:06PM +0800, Yu Kuai wrote:
> > > From: Yu Kuai <yukuai3@huawei.com>
> > > 
> > > In the case user trigger tags grow by queue sysfs attribute nr_requests,
> > > hctx->sched_tags will be freed directly and replaced with a new
> > > allocated tags, see blk_mq_tag_update_depth().
> > > 
> > > The problem is that hctx->sched_tags is from elevator->et->tags, while
> > > et->tags is still the freed tags, hence later elevator exist will try to
> > > free the tags again, causing kernel panic.
> > > 
> > > patch 1-6 are prep cleanup and refactor patches for updating nr_requests
> > > patch 7,8 are the fix patches for the regression
> > > patch 9 is cleanup patch after patch 8
> > > patch 10 fix the stale nr_requests documentation
> > 
> > Please do not mix bug(regression) fix with cleanup.
> > 
> > The bug fix for updating nr_requests should have been simple enough in single
> > or two patches, why do you make 10-patches for dealing with the regression?
> 
> Ok, in short, my solution is:
> 
> - serialize switching elevator with updating nr_requests
> - check the case that nr_requests will grow and allocate elevator_tags
> before freezing the queue.
> - for the grow case, switch to new elevator_tags.

I'd suggest to make one or two commits to fix the recent regression
f5a6604f7a44 ("block: fix lockdep warning caused by lock dependency in elv_iosched_store")
first, because double free is one serious issue, and the fix should
belong to v6.17.

For other long-term or less serious issue, it may be fine to delay to v6.18
if the patchset is too big or complicated, which might imply new regression.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 04/10] blk-mq: serialize updating nr_requests with update_nr_hwq_lock
  2025-08-15  8:02 ` [PATCH 04/10] blk-mq: serialize updating nr_requests with update_nr_hwq_lock Yu Kuai
@ 2025-08-15 14:47   ` Ming Lei
  2025-08-16  0:49     ` Yu Kuai
  0 siblings, 1 reply; 23+ messages in thread
From: Ming Lei @ 2025-08-15 14:47 UTC (permalink / raw)
  To: Yu Kuai
  Cc: axboe, hare, nilay, linux-block, linux-kernel, yukuai3, yi.zhang,
	yangerkun, johnny.chenyi

On Fri, Aug 15, 2025 at 04:02:10PM +0800, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> request_queue->nr_requests can be changed by:
> 
> a) switching elevator by update nr_hw_queues
> b) switching elevator by elevator sysfs attribute
> c) configue queue sysfs attribute nr_requests

 ->elevator_lock is grabbed for updating ->nr_requests except for queue
initialization, so what is the real problem you are trying to solve?


Thanks,
Ming


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 08/10] blk-mq: fix blk_mq_tags double free while nr_requests grown
  2025-08-15  8:02 ` [PATCH 08/10] blk-mq: fix blk_mq_tags double free while nr_requests grown Yu Kuai
@ 2025-08-15 19:30   ` Nilay Shroff
  2025-08-16  2:57     ` Yu Kuai
  0 siblings, 1 reply; 23+ messages in thread
From: Nilay Shroff @ 2025-08-15 19:30 UTC (permalink / raw)
  To: Yu Kuai, axboe, hare, ming.lei
  Cc: linux-block, linux-kernel, yukuai3, yi.zhang, yangerkun,
	johnny.chenyi



On 8/15/25 1:32 PM, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> In the case user trigger tags grow by queue sysfs attribute nr_requests,
> hctx->sched_tags will be freed directly and replaced with a new
> allocated tags, see blk_mq_tag_update_depth().
> 
> The problem is that hctx->sched_tags is from elevator->et->tags, while
> et->tags is still the freed tags, hence later elevator exist will try to
> free the tags again, causing kernel panic.
> 
> Fix this problem by using new allocated elevator_tags, also convert
> blk_mq_update_nr_requests to void since this helper will never fail now.
> 
> Meanwhile, there is a longterm problem can be fixed as well:
> 
> If blk_mq_tag_update_depth() succeed for previous hctx, then bitmap depth
> is updated, however, if following hctx failed, q->nr_requests is not
> updated and the previous hctx->sched_tags endup bigger than q->nr_requests.
> 
> Fixes: f5a6604f7a44 ("block: fix lockdep warning caused by lock dependency in elv_iosched_store")
> Fixes: e3a2b3f931f5 ("blk-mq: allow changing of queue depth through sysfs")
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>  block/blk-mq.c    | 19 ++++++-------------
>  block/blk-mq.h    |  4 +++-
>  block/blk-sysfs.c | 21 ++++++++++++++-------
>  3 files changed, 23 insertions(+), 21 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 11c8baebb9a0..e9f037a25fe3 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -4917,12 +4917,12 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
>  }
>  EXPORT_SYMBOL(blk_mq_free_tag_set);
>  
> -int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
> +void blk_mq_update_nr_requests(struct request_queue *q,
> +			       struct elevator_tags *et, unsigned int nr)
>  {
>  	struct blk_mq_tag_set *set = q->tag_set;
>  	struct blk_mq_hw_ctx *hctx;
>  	unsigned long i;
> -	int ret = 0;
>  
>  	blk_mq_quiesce_queue(q);
>  
> @@ -4946,24 +4946,17 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
>  				nr - hctx->sched_tags->nr_reserved_tags);
>  		}
>  	} else {
> -		queue_for_each_hw_ctx(q, hctx, i) {
> -			if (!hctx->tags)
> -				continue;
> -			ret = blk_mq_tag_update_depth(hctx, &hctx->sched_tags,
> -						      nr);
> -			if (ret)
> -				goto out;
> -		}
> +		blk_mq_free_sched_tags(q->elevator->et, set);

I think you also need to ensure that elevator tags are freed after we unfreeze
queue and release ->elevator_lock otherwise we may get into the lockdep splat
for pcpu_lock dependency on ->freeze_lock and/or ->elevator_lock. Please note 
that blk_mq_free_sched_tags internally invokes sbitmap_free which invokes 
free_percpu which acquires pcpu_lock. 
 
Thanks,
--Nilay



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 04/10] blk-mq: serialize updating nr_requests with update_nr_hwq_lock
  2025-08-15 14:47   ` Ming Lei
@ 2025-08-16  0:49     ` Yu Kuai
  2025-08-16  2:23       ` Ming Lei
  0 siblings, 1 reply; 23+ messages in thread
From: Yu Kuai @ 2025-08-16  0:49 UTC (permalink / raw)
  To: Ming Lei, Yu Kuai
  Cc: axboe, hare, nilay, linux-block, linux-kernel, yukuai3, yi.zhang,
	yangerkun, johnny.chenyi

Hi,

在 2025/8/15 22:47, Ming Lei 写道:
> On Fri, Aug 15, 2025 at 04:02:10PM +0800, Yu Kuai wrote:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> request_queue->nr_requests can be changed by:
>>
>> a) switching elevator by update nr_hw_queues
>> b) switching elevator by elevator sysfs attribute
>> c) configue queue sysfs attribute nr_requests
>   ->elevator_lock is grabbed for updating ->nr_requests except for queue
> initialization, so what is the real problem you are trying to solve?
>
In order to fix the regression and prevent deadlock, we'll have to 
allocate memory

in the case bitmap tags grow before freezing the queue, also prevent 
nr_requests

to be updated by case a and b concurrently.


Thanks,

Kuai

> Thanks,
> Ming
>
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 04/10] blk-mq: serialize updating nr_requests with update_nr_hwq_lock
  2025-08-16  0:49     ` Yu Kuai
@ 2025-08-16  2:23       ` Ming Lei
  0 siblings, 0 replies; 23+ messages in thread
From: Ming Lei @ 2025-08-16  2:23 UTC (permalink / raw)
  To: Yu Kuai
  Cc: Yu Kuai, axboe, hare, nilay, linux-block, linux-kernel, yukuai3,
	yi.zhang, yangerkun, johnny.chenyi

On Sat, Aug 16, 2025 at 08:49:41AM +0800, Yu Kuai wrote:
> Hi,
> 
> 在 2025/8/15 22:47, Ming Lei 写道:
> > On Fri, Aug 15, 2025 at 04:02:10PM +0800, Yu Kuai wrote:
> > > From: Yu Kuai <yukuai3@huawei.com>
> > > 
> > > request_queue->nr_requests can be changed by:
> > > 
> > > a) switching elevator by update nr_hw_queues
> > > b) switching elevator by elevator sysfs attribute
> > > c) configue queue sysfs attribute nr_requests
> >   ->elevator_lock is grabbed for updating ->nr_requests except for queue
> > initialization, so what is the real problem you are trying to solve?
> > 
> In order to fix the regression and prevent deadlock, we'll have to allocate
> memory
> 
> in the case bitmap tags grow before freezing the queue, also prevent
> nr_requests
> 
> to be updated by case a and b concurrently.

Yeah, that is potential deadlock issue in updating nr_request.

I feel the patch title is a bit confusing because q->nr_requests updating
is already serialized.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 08/10] blk-mq: fix blk_mq_tags double free while nr_requests grown
  2025-08-15 19:30   ` Nilay Shroff
@ 2025-08-16  2:57     ` Yu Kuai
  2025-08-16  4:05       ` Ming Lei
  0 siblings, 1 reply; 23+ messages in thread
From: Yu Kuai @ 2025-08-16  2:57 UTC (permalink / raw)
  To: Nilay Shroff, Yu Kuai, axboe, hare, ming.lei
  Cc: linux-block, linux-kernel, yukuai3, yi.zhang, yangerkun,
	johnny.chenyi

Hi,

在 2025/8/16 3:30, Nilay Shroff 写道:
>
> On 8/15/25 1:32 PM, Yu Kuai wrote:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> In the case user trigger tags grow by queue sysfs attribute nr_requests,
>> hctx->sched_tags will be freed directly and replaced with a new
>> allocated tags, see blk_mq_tag_update_depth().
>>
>> The problem is that hctx->sched_tags is from elevator->et->tags, while
>> et->tags is still the freed tags, hence later elevator exist will try to
>> free the tags again, causing kernel panic.
>>
>> Fix this problem by using new allocated elevator_tags, also convert
>> blk_mq_update_nr_requests to void since this helper will never fail now.
>>
>> Meanwhile, there is a longterm problem can be fixed as well:
>>
>> If blk_mq_tag_update_depth() succeed for previous hctx, then bitmap depth
>> is updated, however, if following hctx failed, q->nr_requests is not
>> updated and the previous hctx->sched_tags endup bigger than q->nr_requests.
>>
>> Fixes: f5a6604f7a44 ("block: fix lockdep warning caused by lock dependency in elv_iosched_store")
>> Fixes: e3a2b3f931f5 ("blk-mq: allow changing of queue depth through sysfs")
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>> ---
>>   block/blk-mq.c    | 19 ++++++-------------
>>   block/blk-mq.h    |  4 +++-
>>   block/blk-sysfs.c | 21 ++++++++++++++-------
>>   3 files changed, 23 insertions(+), 21 deletions(-)
>>
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 11c8baebb9a0..e9f037a25fe3 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -4917,12 +4917,12 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
>>   }
>>   EXPORT_SYMBOL(blk_mq_free_tag_set);
>>   
>> -int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
>> +void blk_mq_update_nr_requests(struct request_queue *q,
>> +			       struct elevator_tags *et, unsigned int nr)
>>   {
>>   	struct blk_mq_tag_set *set = q->tag_set;
>>   	struct blk_mq_hw_ctx *hctx;
>>   	unsigned long i;
>> -	int ret = 0;
>>   
>>   	blk_mq_quiesce_queue(q);
>>   
>> @@ -4946,24 +4946,17 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
>>   				nr - hctx->sched_tags->nr_reserved_tags);
>>   		}
>>   	} else {
>> -		queue_for_each_hw_ctx(q, hctx, i) {
>> -			if (!hctx->tags)
>> -				continue;
>> -			ret = blk_mq_tag_update_depth(hctx, &hctx->sched_tags,
>> -						      nr);
>> -			if (ret)
>> -				goto out;
>> -		}
>> +		blk_mq_free_sched_tags(q->elevator->et, set);
> I think you also need to ensure that elevator tags are freed after we unfreeze
> queue and release ->elevator_lock otherwise we may get into the lockdep splat
> for pcpu_lock dependency on ->freeze_lock and/or ->elevator_lock. Please note
> that blk_mq_free_sched_tags internally invokes sbitmap_free which invokes
> free_percpu which acquires pcpu_lock.

Ok, thanks for the notice. However, as Ming suggested, we might fix this 
problem

in the next merge window. I'll send one patch to fix this regression by 
replace

st->tags with reallocated new sched_tags as well.


Thanks,

Kuai

>   
> Thanks,
> --Nilay
>
>
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 08/10] blk-mq: fix blk_mq_tags double free while nr_requests grown
  2025-08-16  2:57     ` Yu Kuai
@ 2025-08-16  4:05       ` Ming Lei
  2025-08-16  8:05         ` 余快
  0 siblings, 1 reply; 23+ messages in thread
From: Ming Lei @ 2025-08-16  4:05 UTC (permalink / raw)
  To: Yu Kuai
  Cc: Nilay Shroff, Yu Kuai, axboe, hare, linux-block, linux-kernel,
	yukuai3, yi.zhang, yangerkun, johnny.chenyi

On Sat, Aug 16, 2025 at 10:57:23AM +0800, Yu Kuai wrote:
> Hi,
> 
> 在 2025/8/16 3:30, Nilay Shroff 写道:
> > 
> > On 8/15/25 1:32 PM, Yu Kuai wrote:
> > > From: Yu Kuai <yukuai3@huawei.com>
> > > 
> > > In the case user trigger tags grow by queue sysfs attribute nr_requests,
> > > hctx->sched_tags will be freed directly and replaced with a new
> > > allocated tags, see blk_mq_tag_update_depth().
> > > 
> > > The problem is that hctx->sched_tags is from elevator->et->tags, while
> > > et->tags is still the freed tags, hence later elevator exist will try to
> > > free the tags again, causing kernel panic.
> > > 
> > > Fix this problem by using new allocated elevator_tags, also convert
> > > blk_mq_update_nr_requests to void since this helper will never fail now.
> > > 
> > > Meanwhile, there is a longterm problem can be fixed as well:
> > > 
> > > If blk_mq_tag_update_depth() succeed for previous hctx, then bitmap depth
> > > is updated, however, if following hctx failed, q->nr_requests is not
> > > updated and the previous hctx->sched_tags endup bigger than q->nr_requests.
> > > 
> > > Fixes: f5a6604f7a44 ("block: fix lockdep warning caused by lock dependency in elv_iosched_store")
> > > Fixes: e3a2b3f931f5 ("blk-mq: allow changing of queue depth through sysfs")
> > > Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> > > ---
> > >   block/blk-mq.c    | 19 ++++++-------------
> > >   block/blk-mq.h    |  4 +++-
> > >   block/blk-sysfs.c | 21 ++++++++++++++-------
> > >   3 files changed, 23 insertions(+), 21 deletions(-)
> > > 
> > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > index 11c8baebb9a0..e9f037a25fe3 100644
> > > --- a/block/blk-mq.c
> > > +++ b/block/blk-mq.c
> > > @@ -4917,12 +4917,12 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
> > >   }
> > >   EXPORT_SYMBOL(blk_mq_free_tag_set);
> > > -int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
> > > +void blk_mq_update_nr_requests(struct request_queue *q,
> > > +			       struct elevator_tags *et, unsigned int nr)
> > >   {
> > >   	struct blk_mq_tag_set *set = q->tag_set;
> > >   	struct blk_mq_hw_ctx *hctx;
> > >   	unsigned long i;
> > > -	int ret = 0;
> > >   	blk_mq_quiesce_queue(q);
> > > @@ -4946,24 +4946,17 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
> > >   				nr - hctx->sched_tags->nr_reserved_tags);
> > >   		}
> > >   	} else {
> > > -		queue_for_each_hw_ctx(q, hctx, i) {
> > > -			if (!hctx->tags)
> > > -				continue;
> > > -			ret = blk_mq_tag_update_depth(hctx, &hctx->sched_tags,
> > > -						      nr);
> > > -			if (ret)
> > > -				goto out;
> > > -		}
> > > +		blk_mq_free_sched_tags(q->elevator->et, set);
> > I think you also need to ensure that elevator tags are freed after we unfreeze
> > queue and release ->elevator_lock otherwise we may get into the lockdep splat
> > for pcpu_lock dependency on ->freeze_lock and/or ->elevator_lock. Please note
> > that blk_mq_free_sched_tags internally invokes sbitmap_free which invokes
> > free_percpu which acquires pcpu_lock.
> 
> Ok, thanks for the notice. However, as Ming suggested, we might fix this
> problem
> 
> in the next merge window.

There are two issues involved:

- blk_mq_tags double free, introduced recently

- long-term lock issue in queue_requests_store()

IMO, the former is a bit serious, because kernel panic can be triggered,
so suggest to make it to v6.17. The latter looks less serious and has
existed for long time, but may need code refactor to get clean fix.

> I'll send one patch to fix this regression by
> replace
> 
> st->tags with reallocated new sched_tags as well.

Patch 7 in this patchset and patch 8 in your 1st post looks enough to
fix this double free issue.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 08/10] blk-mq: fix blk_mq_tags double free while nr_requests grown
  2025-08-16  4:05       ` Ming Lei
@ 2025-08-16  8:05         ` 余快
  2025-08-18  2:11           ` Ming Lei
  0 siblings, 1 reply; 23+ messages in thread
From: 余快 @ 2025-08-16  8:05 UTC (permalink / raw)
  To: Ming Lei
  Cc: Yu Kuai, Nilay Shroff, Yu Kuai, axboe, hare, linux-block,
	linux-kernel, yukuai3, yi.zhang, yangerkun, johnny.chenyi

Hi,

Ming Lei <ming.lei@redhat.com> 于2025年8月16日周六 12:05写道:
>
> On Sat, Aug 16, 2025 at 10:57:23AM +0800, Yu Kuai wrote:
> > Hi,
> >
> > 在 2025/8/16 3:30, Nilay Shroff 写道:
> > >
> > > On 8/15/25 1:32 PM, Yu Kuai wrote:
> > > > From: Yu Kuai <yukuai3@huawei.com>
> > > >
> > > > In the case user trigger tags grow by queue sysfs attribute nr_requests,
> > > > hctx->sched_tags will be freed directly and replaced with a new
> > > > allocated tags, see blk_mq_tag_update_depth().
> > > >
> > > > The problem is that hctx->sched_tags is from elevator->et->tags, while
> > > > et->tags is still the freed tags, hence later elevator exist will try to
> > > > free the tags again, causing kernel panic.
> > > >
> > > > Fix this problem by using new allocated elevator_tags, also convert
> > > > blk_mq_update_nr_requests to void since this helper will never fail now.
> > > >
> > > > Meanwhile, there is a longterm problem can be fixed as well:
> > > >
> > > > If blk_mq_tag_update_depth() succeed for previous hctx, then bitmap depth
> > > > is updated, however, if following hctx failed, q->nr_requests is not
> > > > updated and the previous hctx->sched_tags endup bigger than q->nr_requests.
> > > >
> > > > Fixes: f5a6604f7a44 ("block: fix lockdep warning caused by lock dependency in elv_iosched_store")
> > > > Fixes: e3a2b3f931f5 ("blk-mq: allow changing of queue depth through sysfs")
> > > > Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> > > > ---
> > > >   block/blk-mq.c    | 19 ++++++-------------
> > > >   block/blk-mq.h    |  4 +++-
> > > >   block/blk-sysfs.c | 21 ++++++++++++++-------
> > > >   3 files changed, 23 insertions(+), 21 deletions(-)
> > > >
> > > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > > index 11c8baebb9a0..e9f037a25fe3 100644
> > > > --- a/block/blk-mq.c
> > > > +++ b/block/blk-mq.c
> > > > @@ -4917,12 +4917,12 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
> > > >   }
> > > >   EXPORT_SYMBOL(blk_mq_free_tag_set);
> > > > -int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
> > > > +void blk_mq_update_nr_requests(struct request_queue *q,
> > > > +                        struct elevator_tags *et, unsigned int nr)
> > > >   {
> > > >           struct blk_mq_tag_set *set = q->tag_set;
> > > >           struct blk_mq_hw_ctx *hctx;
> > > >           unsigned long i;
> > > > - int ret = 0;
> > > >           blk_mq_quiesce_queue(q);
> > > > @@ -4946,24 +4946,17 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
> > > >                                   nr - hctx->sched_tags->nr_reserved_tags);
> > > >                   }
> > > >           } else {
> > > > -         queue_for_each_hw_ctx(q, hctx, i) {
> > > > -                 if (!hctx->tags)
> > > > -                         continue;
> > > > -                 ret = blk_mq_tag_update_depth(hctx, &hctx->sched_tags,
> > > > -                                               nr);
> > > > -                 if (ret)
> > > > -                         goto out;
> > > > -         }
> > > > +         blk_mq_free_sched_tags(q->elevator->et, set);
> > > I think you also need to ensure that elevator tags are freed after we unfreeze
> > > queue and release ->elevator_lock otherwise we may get into the lockdep splat
> > > for pcpu_lock dependency on ->freeze_lock and/or ->elevator_lock. Please note
> > > that blk_mq_free_sched_tags internally invokes sbitmap_free which invokes
> > > free_percpu which acquires pcpu_lock.
> >
> > Ok, thanks for the notice. However, as Ming suggested, we might fix this
> > problem
> >
> > in the next merge window.
>
> There are two issues involved:
>
> - blk_mq_tags double free, introduced recently
>
> - long-term lock issue in queue_requests_store()
>
> IMO, the former is a bit serious, because kernel panic can be triggered,
> so suggest to make it to v6.17. The latter looks less serious and has
> existed for long time, but may need code refactor to get clean fix.
>
> > I'll send one patch to fix this regression by
> > replace
> >
> > st->tags with reallocated new sched_tags as well.
>
> Patch 7 in this patchset and patch 8 in your 1st post looks enough to
> fix this double free issue.
>
But without previous refactor, this looks hard. Can we consider the following
one line patch for this merge window? just fix the first double free
issue for now.

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index d880c50629d6..1e0ccf19295a 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -622,6 +622,7 @@ int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
                        return -ENOMEM;

                blk_mq_free_map_and_rqs(set, *tagsptr, hctx->queue_num);
+               hctx->queue->elevator->et->tags[hctx->queue_num]= new;
                *tagsptr = new;
        } else {
                /*

>
> Thanks,
> Ming
>
>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 08/10] blk-mq: fix blk_mq_tags double free while nr_requests grown
  2025-08-16  8:05         ` 余快
@ 2025-08-18  2:11           ` Ming Lei
  2025-08-18  3:12             ` Ming Lei
  0 siblings, 1 reply; 23+ messages in thread
From: Ming Lei @ 2025-08-18  2:11 UTC (permalink / raw)
  To: 余快
  Cc: Yu Kuai, Nilay Shroff, Yu Kuai, axboe, hare, linux-block,
	linux-kernel, yukuai3, yi.zhang, yangerkun, johnny.chenyi

On Sat, Aug 16, 2025 at 04:05:30PM +0800, 余快 wrote:
> Hi,
> 
> Ming Lei <ming.lei@redhat.com> 于2025年8月16日周六 12:05写道:
> >
> > On Sat, Aug 16, 2025 at 10:57:23AM +0800, Yu Kuai wrote:
> > > Hi,
> > >
> > > 在 2025/8/16 3:30, Nilay Shroff 写道:
> > > >
> > > > On 8/15/25 1:32 PM, Yu Kuai wrote:
> > > > > From: Yu Kuai <yukuai3@huawei.com>
> > > > >
> > > > > In the case user trigger tags grow by queue sysfs attribute nr_requests,
> > > > > hctx->sched_tags will be freed directly and replaced with a new
> > > > > allocated tags, see blk_mq_tag_update_depth().
> > > > >
> > > > > The problem is that hctx->sched_tags is from elevator->et->tags, while
> > > > > et->tags is still the freed tags, hence later elevator exist will try to
> > > > > free the tags again, causing kernel panic.
> > > > >
> > > > > Fix this problem by using new allocated elevator_tags, also convert
> > > > > blk_mq_update_nr_requests to void since this helper will never fail now.
> > > > >
> > > > > Meanwhile, there is a longterm problem can be fixed as well:
> > > > >
> > > > > If blk_mq_tag_update_depth() succeed for previous hctx, then bitmap depth
> > > > > is updated, however, if following hctx failed, q->nr_requests is not
> > > > > updated and the previous hctx->sched_tags endup bigger than q->nr_requests.
> > > > >
> > > > > Fixes: f5a6604f7a44 ("block: fix lockdep warning caused by lock dependency in elv_iosched_store")
> > > > > Fixes: e3a2b3f931f5 ("blk-mq: allow changing of queue depth through sysfs")
> > > > > Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> > > > > ---
> > > > >   block/blk-mq.c    | 19 ++++++-------------
> > > > >   block/blk-mq.h    |  4 +++-
> > > > >   block/blk-sysfs.c | 21 ++++++++++++++-------
> > > > >   3 files changed, 23 insertions(+), 21 deletions(-)
> > > > >
> > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > > > index 11c8baebb9a0..e9f037a25fe3 100644
> > > > > --- a/block/blk-mq.c
> > > > > +++ b/block/blk-mq.c
> > > > > @@ -4917,12 +4917,12 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
> > > > >   }
> > > > >   EXPORT_SYMBOL(blk_mq_free_tag_set);
> > > > > -int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
> > > > > +void blk_mq_update_nr_requests(struct request_queue *q,
> > > > > +                        struct elevator_tags *et, unsigned int nr)
> > > > >   {
> > > > >           struct blk_mq_tag_set *set = q->tag_set;
> > > > >           struct blk_mq_hw_ctx *hctx;
> > > > >           unsigned long i;
> > > > > - int ret = 0;
> > > > >           blk_mq_quiesce_queue(q);
> > > > > @@ -4946,24 +4946,17 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
> > > > >                                   nr - hctx->sched_tags->nr_reserved_tags);
> > > > >                   }
> > > > >           } else {
> > > > > -         queue_for_each_hw_ctx(q, hctx, i) {
> > > > > -                 if (!hctx->tags)
> > > > > -                         continue;
> > > > > -                 ret = blk_mq_tag_update_depth(hctx, &hctx->sched_tags,
> > > > > -                                               nr);
> > > > > -                 if (ret)
> > > > > -                         goto out;
> > > > > -         }
> > > > > +         blk_mq_free_sched_tags(q->elevator->et, set);
> > > > I think you also need to ensure that elevator tags are freed after we unfreeze
> > > > queue and release ->elevator_lock otherwise we may get into the lockdep splat
> > > > for pcpu_lock dependency on ->freeze_lock and/or ->elevator_lock. Please note
> > > > that blk_mq_free_sched_tags internally invokes sbitmap_free which invokes
> > > > free_percpu which acquires pcpu_lock.
> > >
> > > Ok, thanks for the notice. However, as Ming suggested, we might fix this
> > > problem
> > >
> > > in the next merge window.
> >
> > There are two issues involved:
> >
> > - blk_mq_tags double free, introduced recently
> >
> > - long-term lock issue in queue_requests_store()
> >
> > IMO, the former is a bit serious, because kernel panic can be triggered,
> > so suggest to make it to v6.17. The latter looks less serious and has
> > existed for long time, but may need code refactor to get clean fix.
> >
> > > I'll send one patch to fix this regression by
> > > replace
> > >
> > > st->tags with reallocated new sched_tags as well.
> >
> > Patch 7 in this patchset and patch 8 in your 1st post looks enough to
> > fix this double free issue.
> >
> But without previous refactor, this looks hard. Can we consider the following

I feel the following one should be enough:


diff --git a/block/blk-mq.c b/block/blk-mq.c
index b67d6c02eceb..021155bb64fc 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4917,6 +4917,29 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
 }
 EXPORT_SYMBOL(blk_mq_free_tag_set);
 
+static int blk_mq_sched_grow_tags(struct request_queue *q, unsigned int nr)
+{
+	struct blk_mq_hw_ctx *hctx;
+	struct elevator_tags *et;
+	unsigned long i;
+
+	if (nr > MAX_SCHED_RQ)
+		return -EINVAL;
+
+	et = blk_mq_alloc_sched_tags(q->tag_set, q->nr_hw_queues, nr);
+	if (!et)
+		return -ENOMEM;
+
+	blk_mq_free_sched_tags(q->elevator->et, q->tag_set);
+	queue_for_each_hw_ctx(q, hctx, i) {
+		hctx->sched_tags = et->tags[i];
+		if (q->elevator->type->ops.depth_updated)
+			q->elevator->type->ops.depth_updated(hctx);
+	}
+	q->elevator->et = et;
+	return 0;
+}
+
 int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 {
 	struct blk_mq_tag_set *set = q->tag_set;
@@ -4936,6 +4959,12 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 	blk_mq_quiesce_queue(q);
 
 	ret = 0;
+
+	if (q->elevator && nr > q->elevator->et->nr_requests) {
+		ret = blk_mq_sched_grow_tags(q, nr);
+		goto update_nr_requests;
+	}
+
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (!hctx->tags)
 			continue;
@@ -4955,6 +4984,7 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 		if (q->elevator && q->elevator->type->ops.depth_updated)
 			q->elevator->type->ops.depth_updated(hctx);
 	}
+update_nr_requests:
 	if (!ret) {
 		q->nr_requests = nr;
 		if (blk_mq_is_shared_tags(set->flags)) {

> one line patch for this merge window? just fix the first double free
> issue for now.
> 
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index d880c50629d6..1e0ccf19295a 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -622,6 +622,7 @@ int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
>                         return -ENOMEM;
> 
>                 blk_mq_free_map_and_rqs(set, *tagsptr, hctx->queue_num);
> +               hctx->queue->elevator->et->tags[hctx->queue_num]= new;
>                 *tagsptr = new;

It is fine if this way can work, however old elevator->et may has lower
depth, then:

- the above change cause et->tags overflow

- meantime memory leak is caused in blk_mq_free_sched_tags()


Thanks,
Ming


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 08/10] blk-mq: fix blk_mq_tags double free while nr_requests grown
  2025-08-18  2:11           ` Ming Lei
@ 2025-08-18  3:12             ` Ming Lei
  0 siblings, 0 replies; 23+ messages in thread
From: Ming Lei @ 2025-08-18  3:12 UTC (permalink / raw)
  To: 余快
  Cc: Yu Kuai, Nilay Shroff, Yu Kuai, axboe, hare, linux-block,
	linux-kernel, yukuai3, yi.zhang, yangerkun, johnny.chenyi

On Mon, Aug 18, 2025 at 10:12 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Sat, Aug 16, 2025 at 04:05:30PM +0800, 余快 wrote:
...
> > one line patch for this merge window? just fix the first double free
> > issue for now.
> >
> > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> > index d880c50629d6..1e0ccf19295a 100644
> > --- a/block/blk-mq-tag.c
> > +++ b/block/blk-mq-tag.c
> > @@ -622,6 +622,7 @@ int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
> >                         return -ENOMEM;
> >
> >                 blk_mq_free_map_and_rqs(set, *tagsptr, hctx->queue_num);
> > +               hctx->queue->elevator->et->tags[hctx->queue_num]= new;
> >                 *tagsptr = new;
>
> It is fine if this way can work, however old elevator->et may has lower
> depth, then:
>
> - the above change cause et->tags overflow
>
> - meantime memory leak is caused in blk_mq_free_sched_tags()

oops, looks I misunderstoodd nr_hw_queues as queue depth, so this single
line patch should fix the double free issue.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2025-08-18  3:13 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-15  8:02 [PATCH 00/10] blk-mq: fix blk_mq_tags double free while nr_requests grown Yu Kuai
2025-08-15  8:02 ` [PATCH 01/10] blk-mq: remove useless checking from queue_requests_store() Yu Kuai
2025-08-15  8:02 ` [PATCH 02/10] blk-mq: remove useless checkings from blk_mq_update_nr_requests() Yu Kuai
2025-08-15  8:02 ` [PATCH 03/10] blk-mq: check invalid nr_requests in queue_requests_store() Yu Kuai
2025-08-15  8:02 ` [PATCH 04/10] blk-mq: serialize updating nr_requests with update_nr_hwq_lock Yu Kuai
2025-08-15 14:47   ` Ming Lei
2025-08-16  0:49     ` Yu Kuai
2025-08-16  2:23       ` Ming Lei
2025-08-15  8:02 ` [PATCH 05/10] blk-mq: cleanup shared tags case in blk_mq_update_nr_requests() Yu Kuai
2025-08-15  8:02 ` [PATCH 06/10] blk-mq: split bitmap grow and resize " Yu Kuai
2025-08-15  8:02 ` [PATCH 07/10] blk-mq-sched: add new parameter nr_requests in blk_mq_alloc_sched_tags() Yu Kuai
2025-08-15  8:02 ` [PATCH 08/10] blk-mq: fix blk_mq_tags double free while nr_requests grown Yu Kuai
2025-08-15 19:30   ` Nilay Shroff
2025-08-16  2:57     ` Yu Kuai
2025-08-16  4:05       ` Ming Lei
2025-08-16  8:05         ` 余快
2025-08-18  2:11           ` Ming Lei
2025-08-18  3:12             ` Ming Lei
2025-08-15  8:02 ` [PATCH 09/10] blk-mq: remove blk_mq_tag_update_depth() Yu Kuai
2025-08-15  8:02 ` [PATCH 10/10] blk-mq: fix stale nr_requests documentation Yu Kuai
2025-08-15  8:30 ` [PATCH 00/10] blk-mq: fix blk_mq_tags double free while nr_requests grown Ming Lei
2025-08-15  9:05   ` Yu Kuai
2025-08-15 14:20     ` Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).