[PATCH for-6.18/block 00/10] cleanup and fixes for updating nr

Linux block layer
 help / color / mirror / Atom feed

* [PATCH for-6.18/block 00/10] cleanup and fixes for updating nr_requests
@ 2025-09-08  6:15 Yu Kuai
  2025-09-08  6:15 ` [PATCH for-6.18/block 01/10] blk-mq: remove useless checking in queue_requests_store() Yu Kuai
                   ` (9 more replies)
  0 siblings, 10 replies; 37+ messages in thread
From: Yu Kuai @ 2025-09-08  6:15 UTC (permalink / raw)
  To: nilay, ming.lei, axboe
  Cc: linux-block, linux-kernel, yukuai3, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

Yu Kuai (10):
  blk-mq: remove useless checking in queue_requests_store()
  blk-mq: remove useless checkings in blk_mq_update_nr_requests()
  blk-mq: check invalid nr_requests in queue_requests_store()
  blk-mq: convert to serialize updating nr_requests with
    update_nr_hwq_lock
  blk-mq: cleanup shared tags case in blk_mq_update_nr_requests()
  blk-mq: split bitmap grow and resize case in
    blk_mq_update_nr_requests()
  blk-mq-sched: add new parameter nr_requests in
    blk_mq_alloc_sched_tags()
  blk-mq: fix potential deadlock while nr_requests grown
  blk-mq: remove blk_mq_tag_update_depth()
  blk-mq: fix stale nr_requests documentation

 Documentation/ABI/stable/sysfs-block | 14 ++----
 block/blk-mq-sched.c                 | 14 +++---
 block/blk-mq-sched.h                 |  2 +-
 block/blk-mq-tag.c                   | 53 ----------------------
 block/blk-mq.c                       | 66 ++++++++++++----------------
 block/blk-mq.h                       | 18 ++++++--
 block/blk-sysfs.c                    | 47 +++++++++++++++-----
 block/elevator.c                     |  3 +-
 8 files changed, 92 insertions(+), 125 deletions(-)

-- 
2.39.2


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH for-6.18/block 01/10] blk-mq: remove useless checking in queue_requests_store()
  2025-09-08  6:15 [PATCH for-6.18/block 00/10] cleanup and fixes for updating nr_requests Yu Kuai
@ 2025-09-08  6:15 ` Yu Kuai
  2025-09-09 11:34   ` Nilay Shroff
  2025-09-08  6:15 ` [PATCH for-6.18/block 02/10] blk-mq: remove useless checkings in blk_mq_update_nr_requests() Yu Kuai
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 37+ messages in thread
From: Yu Kuai @ 2025-09-08  6:15 UTC (permalink / raw)
  To: nilay, ming.lei, axboe
  Cc: linux-block, linux-kernel, yukuai3, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

blk_mq_queue_attr_visible() already checked queue_is_mq(), no need to
check this again in queue_requests_store().

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-sysfs.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index c94b8b6ab024..1ffa65feca4f 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -69,9 +69,6 @@ queue_requests_store(struct gendisk *disk, const char *page, size_t count)
 	unsigned int memflags;
 	struct request_queue *q = disk->queue;
 
-	if (!queue_is_mq(q))
-		return -EINVAL;
-
 	ret = queue_var_store(&nr, page, count);
 	if (ret < 0)
 		return ret;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH for-6.18/block 01/10] blk-mq: remove useless checking in queue_requests_store()
  2025-09-08  6:15 ` [PATCH for-6.18/block 01/10] blk-mq: remove useless checking in queue_requests_store() Yu Kuai
@ 2025-09-09 11:34   ` Nilay Shroff
  0 siblings, 0 replies; 37+ messages in thread
From: Nilay Shroff @ 2025-09-09 11:34 UTC (permalink / raw)
  To: Yu Kuai, ming.lei, axboe
  Cc: linux-block, linux-kernel, yukuai3, yi.zhang, yangerkun,
	johnny.chenyi



On 9/8/25 11:45 AM, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> blk_mq_queue_attr_visible() already checked queue_is_mq(), no need to
> check this again in queue_requests_store().
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>

Looks good to me:
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH for-6.18/block 02/10] blk-mq: remove useless checkings in blk_mq_update_nr_requests()
  2025-09-08  6:15 [PATCH for-6.18/block 00/10] cleanup and fixes for updating nr_requests Yu Kuai
  2025-09-08  6:15 ` [PATCH for-6.18/block 01/10] blk-mq: remove useless checking in queue_requests_store() Yu Kuai
@ 2025-09-08  6:15 ` Yu Kuai
  2025-09-09 11:35   ` Nilay Shroff
  2025-09-08  6:15 ` [PATCH for-6.18/block 03/10] blk-mq: check invalid nr_requests in queue_requests_store() Yu Kuai
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 37+ messages in thread
From: Yu Kuai @ 2025-09-08  6:15 UTC (permalink / raw)
  To: nilay, ming.lei, axboe
  Cc: linux-block, linux-kernel, yukuai3, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

1) queue_requests_store() is the only caller of
blk_mq_update_nr_requests(), where queue is already freezed, no need to
check mq_freeze_depth;
2) q->tag_set must be set for request based device, and queue_is_mq() is
already checked in blk_mq_queue_attr_visible(), no need to check
q->tag_set.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq.c | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9055cd624700..0fac0236a5dc 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4921,21 +4921,14 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 {
 	struct blk_mq_tag_set *set = q->tag_set;
 	struct blk_mq_hw_ctx *hctx;
-	int ret;
+	int ret = 0;
 	unsigned long i;
 
-	if (WARN_ON_ONCE(!q->mq_freeze_depth))
-		return -EINVAL;
-
-	if (!set)
-		return -EINVAL;
-
 	if (q->nr_requests == nr)
 		return 0;
 
 	blk_mq_quiesce_queue(q);
 
-	ret = 0;
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (!hctx->tags)
 			continue;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH for-6.18/block 02/10] blk-mq: remove useless checkings in blk_mq_update_nr_requests()
  2025-09-08  6:15 ` [PATCH for-6.18/block 02/10] blk-mq: remove useless checkings in blk_mq_update_nr_requests() Yu Kuai
@ 2025-09-09 11:35   ` Nilay Shroff
  0 siblings, 0 replies; 37+ messages in thread
From: Nilay Shroff @ 2025-09-09 11:35 UTC (permalink / raw)
  To: Yu Kuai, ming.lei, axboe
  Cc: linux-block, linux-kernel, yukuai3, yi.zhang, yangerkun,
	johnny.chenyi



On 9/8/25 11:45 AM, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> 1) queue_requests_store() is the only caller of
> blk_mq_update_nr_requests(), where queue is already freezed, no need to
> check mq_freeze_depth;
> 2) q->tag_set must be set for request based device, and queue_is_mq() is
> already checked in blk_mq_queue_attr_visible(), no need to check
> q->tag_set.
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>

Looks good to me:
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH for-6.18/block 03/10] blk-mq: check invalid nr_requests in queue_requests_store()
  2025-09-08  6:15 [PATCH for-6.18/block 00/10] cleanup and fixes for updating nr_requests Yu Kuai
  2025-09-08  6:15 ` [PATCH for-6.18/block 01/10] blk-mq: remove useless checking in queue_requests_store() Yu Kuai
  2025-09-08  6:15 ` [PATCH for-6.18/block 02/10] blk-mq: remove useless checkings in blk_mq_update_nr_requests() Yu Kuai
@ 2025-09-08  6:15 ` Yu Kuai
  2025-09-09 11:36   ` Nilay Shroff
  2025-09-08  6:15 ` [PATCH for-6.18/block 04/10] blk-mq: convert to serialize updating nr_requests with update_nr_hwq_lock Yu Kuai
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 37+ messages in thread
From: Yu Kuai @ 2025-09-08  6:15 UTC (permalink / raw)
  To: nilay, ming.lei, axboe
  Cc: linux-block, linux-kernel, yukuai3, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

queue_requests_store() is the only caller of
blk_mq_update_nr_requests(), and blk_mq_update_nr_requests() is the
only caller of blk_mq_tag_update_depth(), however, they all have
checkings for nr_requests input by user.

Make code cleaner by moving all the checkings to the top function:

1) nr_requests > reserved tags;
2) if there is elevator, 4 <= nr_requests <= 2048;
3) if elevator is none, 4 <= nr_requests <= tag_set->queue_depth;

Meanwhile, case 2 is the only case tags can grow and -ENOMEM might be
returned.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq-tag.c | 16 +---------------
 block/blk-mq.c     |  8 ++------
 block/blk-mq.h     |  2 +-
 block/blk-sysfs.c  | 13 +++++++++++++
 4 files changed, 17 insertions(+), 22 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 5cffa5668d0c..725210f27471 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -584,14 +584,10 @@ void blk_mq_free_tags(struct blk_mq_tags *tags)
 }
 
 int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
-			    struct blk_mq_tags **tagsptr, unsigned int tdepth,
-			    bool can_grow)
+			    struct blk_mq_tags **tagsptr, unsigned int tdepth)
 {
 	struct blk_mq_tags *tags = *tagsptr;
 
-	if (tdepth <= tags->nr_reserved_tags)
-		return -EINVAL;
-
 	/*
 	 * If we are allowed to grow beyond the original size, allocate
 	 * a new set of tags before freeing the old one.
@@ -600,16 +596,6 @@ int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
 		struct blk_mq_tag_set *set = hctx->queue->tag_set;
 		struct blk_mq_tags *new;
 
-		if (!can_grow)
-			return -EINVAL;
-
-		/*
-		 * We need some sort of upper limit, set it high enough that
-		 * no valid use cases should require more.
-		 */
-		if (tdepth > MAX_SCHED_RQ)
-			return -EINVAL;
-
 		/*
 		 * Only the sbitmap needs resizing since we allocated the max
 		 * initially.
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 0fac0236a5dc..d7540b8e7471 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4924,9 +4924,6 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 	int ret = 0;
 	unsigned long i;
 
-	if (q->nr_requests == nr)
-		return 0;
-
 	blk_mq_quiesce_queue(q);
 
 	queue_for_each_hw_ctx(q, hctx, i) {
@@ -4938,10 +4935,9 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 		 */
 		if (hctx->sched_tags) {
 			ret = blk_mq_tag_update_depth(hctx, &hctx->sched_tags,
-						      nr, true);
+						      nr);
 		} else {
-			ret = blk_mq_tag_update_depth(hctx, &hctx->tags, nr,
-						      false);
+			ret = blk_mq_tag_update_depth(hctx, &hctx->tags, nr);
 		}
 		if (ret)
 			goto out;
diff --git a/block/blk-mq.h b/block/blk-mq.h
index affb2e14b56e..2b3ade60c90b 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -171,7 +171,7 @@ void blk_mq_put_tag(struct blk_mq_tags *tags, struct blk_mq_ctx *ctx,
 		unsigned int tag);
 void blk_mq_put_tags(struct blk_mq_tags *tags, int *tag_array, int nr_tags);
 int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
-		struct blk_mq_tags **tags, unsigned int depth, bool can_grow);
+		struct blk_mq_tags **tags, unsigned int depth);
 void blk_mq_tag_resize_shared_tags(struct blk_mq_tag_set *set,
 		unsigned int size);
 void blk_mq_tag_update_sched_shared_tags(struct request_queue *q);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 1ffa65feca4f..f99519f7a820 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -75,12 +75,25 @@ queue_requests_store(struct gendisk *disk, const char *page, size_t count)
 
 	memflags = blk_mq_freeze_queue(q);
 	mutex_lock(&q->elevator_lock);
+
+	if (nr == q->nr_requests)
+		goto unlock;
+
 	if (nr < BLKDEV_MIN_RQ)
 		nr = BLKDEV_MIN_RQ;
 
+	if (nr <= q->tag_set->reserved_tags ||
+	    (q->elevator && nr > MAX_SCHED_RQ) ||
+	    (!q->elevator && nr > q->tag_set->queue_depth)) {
+		ret = -EINVAL;
+		goto unlock;
+	}
+
 	err = blk_mq_update_nr_requests(disk->queue, nr);
 	if (err)
 		ret = err;
+
+unlock:
 	mutex_unlock(&q->elevator_lock);
 	blk_mq_unfreeze_queue(q, memflags);
 	return ret;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH for-6.18/block 03/10] blk-mq: check invalid nr_requests in queue_requests_store()
  2025-09-08  6:15 ` [PATCH for-6.18/block 03/10] blk-mq: check invalid nr_requests in queue_requests_store() Yu Kuai
@ 2025-09-09 11:36   ` Nilay Shroff
  0 siblings, 0 replies; 37+ messages in thread
From: Nilay Shroff @ 2025-09-09 11:36 UTC (permalink / raw)
  To: Yu Kuai, ming.lei, axboe
  Cc: linux-block, linux-kernel, yukuai3, yi.zhang, yangerkun,
	johnny.chenyi



On 9/8/25 11:45 AM, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> queue_requests_store() is the only caller of
> blk_mq_update_nr_requests(), and blk_mq_update_nr_requests() is the
> only caller of blk_mq_tag_update_depth(), however, they all have
> checkings for nr_requests input by user.
> 
> Make code cleaner by moving all the checkings to the top function:
> 
> 1) nr_requests > reserved tags;
> 2) if there is elevator, 4 <= nr_requests <= 2048;
> 3) if elevator is none, 4 <= nr_requests <= tag_set->queue_depth;
> 
> Meanwhile, case 2 is the only case tags can grow and -ENOMEM might be
> returned.
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>

Looks good to me:
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH for-6.18/block 04/10] blk-mq: convert to serialize updating nr_requests with update_nr_hwq_lock
  2025-09-08  6:15 [PATCH for-6.18/block 00/10] cleanup and fixes for updating nr_requests Yu Kuai
                   ` (2 preceding siblings ...)
  2025-09-08  6:15 ` [PATCH for-6.18/block 03/10] blk-mq: check invalid nr_requests in queue_requests_store() Yu Kuai
@ 2025-09-08  6:15 ` Yu Kuai
  2025-09-09  6:29   ` Nilay Shroff
  2025-09-09 11:40   ` Nilay Shroff
  2025-09-08  6:15 ` [PATCH for-6.18/block 05/10] blk-mq: cleanup shared tags case in blk_mq_update_nr_requests() Yu Kuai
                   ` (5 subsequent siblings)
  9 siblings, 2 replies; 37+ messages in thread
From: Yu Kuai @ 2025-09-08  6:15 UTC (permalink / raw)
  To: nilay, ming.lei, axboe
  Cc: linux-block, linux-kernel, yukuai3, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

request_queue->nr_requests can be changed by:

a) switching elevator by update nr_hw_queues
b) switching elevator by elevator sysfs attribute
c) configue queue sysfs attribute nr_requests

Current lock order is:

1) update_nr_hwq_lock, case a,b
2) freeze_queue
3) elevator_lock, cas a,b,c

And update nr_requests is seriablized by elevator_lock() already,
however, in the case c), we'll have to allocate new sched_tags if
nr_requests grow, and do this with elevator_lock held and queue
freezed has the risk of deadlock.

Hence use update_nr_hwq_lock instead, make it possible to allocate
memory if tags grow, meanwhile also prevent nr_requests to be changed
concurrently.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-sysfs.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index f99519f7a820..7ea15bf68b4b 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -68,13 +68,14 @@ queue_requests_store(struct gendisk *disk, const char *page, size_t count)
 	int ret, err;
 	unsigned int memflags;
 	struct request_queue *q = disk->queue;
+	struct blk_mq_tag_set *set = q->tag_set;
 
 	ret = queue_var_store(&nr, page, count);
 	if (ret < 0)
 		return ret;
 
-	memflags = blk_mq_freeze_queue(q);
-	mutex_lock(&q->elevator_lock);
+	/* serialize updating nr_requests with switching elevator */
+	down_write(&set->update_nr_hwq_lock);
 
 	if (nr == q->nr_requests)
 		goto unlock;
@@ -89,13 +90,18 @@ queue_requests_store(struct gendisk *disk, const char *page, size_t count)
 		goto unlock;
 	}
 
+	memflags = blk_mq_freeze_queue(q);
+	mutex_lock(&q->elevator_lock);
+
 	err = blk_mq_update_nr_requests(disk->queue, nr);
 	if (err)
 		ret = err;
 
-unlock:
 	mutex_unlock(&q->elevator_lock);
 	blk_mq_unfreeze_queue(q, memflags);
+
+unlock:
+	up_write(&set->update_nr_hwq_lock);
 	return ret;
 }
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH for-6.18/block 04/10] blk-mq: convert to serialize updating nr_requests with update_nr_hwq_lock
  2025-09-08  6:15 ` [PATCH for-6.18/block 04/10] blk-mq: convert to serialize updating nr_requests with update_nr_hwq_lock Yu Kuai
@ 2025-09-09  6:29   ` Nilay Shroff
  2025-09-09  6:38     ` Yu Kuai
  2025-09-09 11:40   ` Nilay Shroff
  1 sibling, 1 reply; 37+ messages in thread
From: Nilay Shroff @ 2025-09-09  6:29 UTC (permalink / raw)
  To: Yu Kuai, ming.lei, axboe
  Cc: linux-block, linux-kernel, yukuai3, yi.zhang, yangerkun,
	johnny.chenyi



On 9/8/25 11:45 AM, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> request_queue->nr_requests can be changed by:
> 
> a) switching elevator by update nr_hw_queues
> b) switching elevator by elevator sysfs attribute
> c) configue queue sysfs attribute nr_requests
> 
> Current lock order is:
> 
> 1) update_nr_hwq_lock, case a,b
> 2) freeze_queue
> 3) elevator_lock, cas a,b,c
> 
> And update nr_requests is seriablized by elevator_lock() already,
> however, in the case c), we'll have to allocate new sched_tags if
> nr_requests grow, and do this with elevator_lock held and queue
> freezed has the risk of deadlock.
> 
> Hence use update_nr_hwq_lock instead, make it possible to allocate
> memory if tags grow, meanwhile also prevent nr_requests to be changed
> concurrently.
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>  block/blk-sysfs.c | 12 +++++++++---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index f99519f7a820..7ea15bf68b4b 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -68,13 +68,14 @@ queue_requests_store(struct gendisk *disk, const char *page, size_t count)
>  	int ret, err;
>  	unsigned int memflags;
>  	struct request_queue *q = disk->queue;
> +	struct blk_mq_tag_set *set = q->tag_set;
>  
>  	ret = queue_var_store(&nr, page, count);
>  	if (ret < 0)
>  		return ret;
>  
> -	memflags = blk_mq_freeze_queue(q);
> -	mutex_lock(&q->elevator_lock);
> +	/* serialize updating nr_requests with switching elevator */
> +	down_write(&set->update_nr_hwq_lock);
>  
For serializing update of nr_requests with switching elevator,
we should use disable_elv_switch(). So with this change we 
don't need to acquire ->update_nr_hwq_lock in writer context 
while running blk_mq_update_nr_requests but instead it can run
acquiring ->update_nr_hwq_lock in the reader context.

So the code flow should be,

disable_elv_switch  => this would set QUEUE_FLAG_NO_ELV_SWITCH
...
down_read ->update_nr_hwq_lock
acquire ->freeze_lock
acquire ->elevator_lock;
...
...
release ->elevator_lock;
release ->freeze_lock

clear QUEUE_FLAG_NO_ELV_SWITCH
up_read ->update_nr_hwq_lock

Thanks,
--Nilay



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH for-6.18/block 04/10] blk-mq: convert to serialize updating nr_requests with update_nr_hwq_lock
  2025-09-09  6:29   ` Nilay Shroff
@ 2025-09-09  6:38     ` Yu Kuai
  2025-09-09  6:52       ` Nilay Shroff
  0 siblings, 1 reply; 37+ messages in thread
From: Yu Kuai @ 2025-09-09  6:38 UTC (permalink / raw)
  To: Nilay Shroff, Yu Kuai, ming.lei, axboe
  Cc: linux-block, linux-kernel, yi.zhang, yangerkun, johnny.chenyi,
	yukuai (C)

Hi,

在 2025/09/09 14:29, Nilay Shroff 写道:
> 
> 
> On 9/8/25 11:45 AM, Yu Kuai wrote:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> request_queue->nr_requests can be changed by:
>>
>> a) switching elevator by update nr_hw_queues
>> b) switching elevator by elevator sysfs attribute
>> c) configue queue sysfs attribute nr_requests
>>
>> Current lock order is:
>>
>> 1) update_nr_hwq_lock, case a,b
>> 2) freeze_queue
>> 3) elevator_lock, cas a,b,c
>>
>> And update nr_requests is seriablized by elevator_lock() already,
>> however, in the case c), we'll have to allocate new sched_tags if
>> nr_requests grow, and do this with elevator_lock held and queue
>> freezed has the risk of deadlock.
>>
>> Hence use update_nr_hwq_lock instead, make it possible to allocate
>> memory if tags grow, meanwhile also prevent nr_requests to be changed
>> concurrently.
>>
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>> ---
>>   block/blk-sysfs.c | 12 +++++++++---
>>   1 file changed, 9 insertions(+), 3 deletions(-)
>>
>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
>> index f99519f7a820..7ea15bf68b4b 100644
>> --- a/block/blk-sysfs.c
>> +++ b/block/blk-sysfs.c
>> @@ -68,13 +68,14 @@ queue_requests_store(struct gendisk *disk, const char *page, size_t count)
>>   	int ret, err;
>>   	unsigned int memflags;
>>   	struct request_queue *q = disk->queue;
>> +	struct blk_mq_tag_set *set = q->tag_set;
>>   
>>   	ret = queue_var_store(&nr, page, count);
>>   	if (ret < 0)
>>   		return ret;
>>   
>> -	memflags = blk_mq_freeze_queue(q);
>> -	mutex_lock(&q->elevator_lock);
>> +	/* serialize updating nr_requests with switching elevator */
>> +	down_write(&set->update_nr_hwq_lock);
>>   
> For serializing update of nr_requests with switching elevator,
> we should use disable_elv_switch(). So with this change we
> don't need to acquire ->update_nr_hwq_lock in writer context
> while running blk_mq_update_nr_requests but instead it can run
> acquiring ->update_nr_hwq_lock in the reader context.
> 
> So the code flow should be,
> 
> disable_elv_switch  => this would set QUEUE_FLAG_NO_ELV_SWITCH
> ...
> down_read ->update_nr_hwq_lock
> acquire ->freeze_lock
> acquire ->elevator_lock;
> ...
> ...
> release ->elevator_lock;
> release ->freeze_lock
> 
> clear QUEUE_FLAG_NO_ELV_SWITCH
> up_read ->update_nr_hwq_lock
> 

Yes, this make sense, however, there is also an implied condition that
we should serialize queue_requests_store() with itself, what if a
concurrent caller succeed the disable_elv_switch() before the
down_read() in this way?

t1:
disable_elv_switch
		t2:
		disable_elv_switch

down_read	down_read

Thanks,
Kuai
> Thanks,
> --Nilay
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH for-6.18/block 04/10] blk-mq: convert to serialize updating nr_requests with update_nr_hwq_lock
  2025-09-09  6:38     ` Yu Kuai
@ 2025-09-09  6:52       ` Nilay Shroff
  2025-09-09  7:16         ` Yu Kuai
  0 siblings, 1 reply; 37+ messages in thread
From: Nilay Shroff @ 2025-09-09  6:52 UTC (permalink / raw)
  To: Yu Kuai, ming.lei, axboe
  Cc: linux-block, linux-kernel, yi.zhang, yangerkun, johnny.chenyi,
	yukuai (C)



On 9/9/25 12:08 PM, Yu Kuai wrote:
> Hi,
> 
> 在 2025/09/09 14:29, Nilay Shroff 写道:
>>
>>
>> On 9/8/25 11:45 AM, Yu Kuai wrote:
>>> From: Yu Kuai <yukuai3@huawei.com>
>>>
>>> request_queue->nr_requests can be changed by:
>>>
>>> a) switching elevator by update nr_hw_queues
>>> b) switching elevator by elevator sysfs attribute
>>> c) configue queue sysfs attribute nr_requests
>>>
>>> Current lock order is:
>>>
>>> 1) update_nr_hwq_lock, case a,b
>>> 2) freeze_queue
>>> 3) elevator_lock, cas a,b,c
>>>
>>> And update nr_requests is seriablized by elevator_lock() already,
>>> however, in the case c), we'll have to allocate new sched_tags if
>>> nr_requests grow, and do this with elevator_lock held and queue
>>> freezed has the risk of deadlock.
>>>
>>> Hence use update_nr_hwq_lock instead, make it possible to allocate
>>> memory if tags grow, meanwhile also prevent nr_requests to be changed
>>> concurrently.
>>>
>>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>>> ---
>>>   block/blk-sysfs.c | 12 +++++++++---
>>>   1 file changed, 9 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
>>> index f99519f7a820..7ea15bf68b4b 100644
>>> --- a/block/blk-sysfs.c
>>> +++ b/block/blk-sysfs.c
>>> @@ -68,13 +68,14 @@ queue_requests_store(struct gendisk *disk, const char *page, size_t count)
>>>       int ret, err;
>>>       unsigned int memflags;
>>>       struct request_queue *q = disk->queue;
>>> +    struct blk_mq_tag_set *set = q->tag_set;
>>>         ret = queue_var_store(&nr, page, count);
>>>       if (ret < 0)
>>>           return ret;
>>>   -    memflags = blk_mq_freeze_queue(q);
>>> -    mutex_lock(&q->elevator_lock);
>>> +    /* serialize updating nr_requests with switching elevator */
>>> +    down_write(&set->update_nr_hwq_lock);
>>>   
>> For serializing update of nr_requests with switching elevator,
>> we should use disable_elv_switch(). So with this change we
>> don't need to acquire ->update_nr_hwq_lock in writer context
>> while running blk_mq_update_nr_requests but instead it can run
>> acquiring ->update_nr_hwq_lock in the reader context.
>>
>> So the code flow should be,
>>
>> disable_elv_switch  => this would set QUEUE_FLAG_NO_ELV_SWITCH
>> ...
>> down_read ->update_nr_hwq_lock
>> acquire ->freeze_lock
>> acquire ->elevator_lock;
>> ...
>> ...
>> release ->elevator_lock;
>> release ->freeze_lock
>>
>> clear QUEUE_FLAG_NO_ELV_SWITCH
>> up_read ->update_nr_hwq_lock
>>
> 
> Yes, this make sense, however, there is also an implied condition that
> we should serialize queue_requests_store() with itself, what if a
> concurrent caller succeed the disable_elv_switch() before the
> down_read() in this way?
> 
> t1:
> disable_elv_switch
>         t2:
>         disable_elv_switch
> 
> down_read    down_read
> 
I believe this is already protected with the kernfs internal
mutex locks. So you shouldn't be able to run two sysfs store 
operations concurrently on the same attribute file.

Thanks,
--Nilay


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH for-6.18/block 04/10] blk-mq: convert to serialize updating nr_requests with update_nr_hwq_lock
  2025-09-09  6:52       ` Nilay Shroff
@ 2025-09-09  7:16         ` Yu Kuai
  2025-09-09  9:26           ` Nilay Shroff
  0 siblings, 1 reply; 37+ messages in thread
From: Yu Kuai @ 2025-09-09  7:16 UTC (permalink / raw)
  To: Nilay Shroff, Yu Kuai, ming.lei, axboe
  Cc: linux-block, linux-kernel, yi.zhang, yangerkun, johnny.chenyi,
	yukuai (C)

Hi,

在 2025/09/09 14:52, Nilay Shroff 写道:
> 
> 
> On 9/9/25 12:08 PM, Yu Kuai wrote:
>> Hi,
>>
>> 在 2025/09/09 14:29, Nilay Shroff 写道:
>>>
>>>
>>> On 9/8/25 11:45 AM, Yu Kuai wrote:
>>>> From: Yu Kuai <yukuai3@huawei.com>
>>>>
>>>> request_queue->nr_requests can be changed by:
>>>>
>>>> a) switching elevator by update nr_hw_queues
>>>> b) switching elevator by elevator sysfs attribute
>>>> c) configue queue sysfs attribute nr_requests
>>>>
>>>> Current lock order is:
>>>>
>>>> 1) update_nr_hwq_lock, case a,b
>>>> 2) freeze_queue
>>>> 3) elevator_lock, cas a,b,c
>>>>
>>>> And update nr_requests is seriablized by elevator_lock() already,
>>>> however, in the case c), we'll have to allocate new sched_tags if
>>>> nr_requests grow, and do this with elevator_lock held and queue
>>>> freezed has the risk of deadlock.
>>>>
>>>> Hence use update_nr_hwq_lock instead, make it possible to allocate
>>>> memory if tags grow, meanwhile also prevent nr_requests to be changed
>>>> concurrently.
>>>>
>>>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>>>> ---
>>>>    block/blk-sysfs.c | 12 +++++++++---
>>>>    1 file changed, 9 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
>>>> index f99519f7a820..7ea15bf68b4b 100644
>>>> --- a/block/blk-sysfs.c
>>>> +++ b/block/blk-sysfs.c
>>>> @@ -68,13 +68,14 @@ queue_requests_store(struct gendisk *disk, const char *page, size_t count)
>>>>        int ret, err;
>>>>        unsigned int memflags;
>>>>        struct request_queue *q = disk->queue;
>>>> +    struct blk_mq_tag_set *set = q->tag_set;
>>>>          ret = queue_var_store(&nr, page, count);
>>>>        if (ret < 0)
>>>>            return ret;
>>>>    -    memflags = blk_mq_freeze_queue(q);
>>>> -    mutex_lock(&q->elevator_lock);
>>>> +    /* serialize updating nr_requests with switching elevator */
>>>> +    down_write(&set->update_nr_hwq_lock);
>>>>    
>>> For serializing update of nr_requests with switching elevator,
>>> we should use disable_elv_switch(). So with this change we
>>> don't need to acquire ->update_nr_hwq_lock in writer context
>>> while running blk_mq_update_nr_requests but instead it can run
>>> acquiring ->update_nr_hwq_lock in the reader context.
>>>
>>> So the code flow should be,
>>>
>>> disable_elv_switch  => this would set QUEUE_FLAG_NO_ELV_SWITCH
>>> ...
>>> down_read ->update_nr_hwq_lock
>>> acquire ->freeze_lock
>>> acquire ->elevator_lock;
>>> ...
>>> ...
>>> release ->elevator_lock;
>>> release ->freeze_lock
>>>
>>> clear QUEUE_FLAG_NO_ELV_SWITCH
>>> up_read ->update_nr_hwq_lock
>>>
>>
>> Yes, this make sense, however, there is also an implied condition that
>> we should serialize queue_requests_store() with itself, what if a
>> concurrent caller succeed the disable_elv_switch() before the
>> down_read() in this way?
>>
>> t1:
>> disable_elv_switch
>>          t2:
>>          disable_elv_switch
>>
>> down_read    down_read
>>
> I believe this is already protected with the kernfs internal
> mutex locks. So you shouldn't be able to run two sysfs store
> operations concurrently on the same attribute file.
> 

There really is no such internal lock, the call stack is:

kernfs_fop_write_iter
  sysfs_kf_write
   queue_attr_store

There is only a file level mutex kernfs_open_file->lock from the top
function kernfs_fop_write_iter(), however, this lock is not the same
if we open the same attribute file from different context.

Thanks,
Kuai
> Thanks,
> --Nilay
> 
> .
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH for-6.18/block 04/10] blk-mq: convert to serialize updating nr_requests with update_nr_hwq_lock
  2025-09-09  7:16         ` Yu Kuai
@ 2025-09-09  9:26           ` Nilay Shroff
  2025-09-09  9:36             ` Yu Kuai
  0 siblings, 1 reply; 37+ messages in thread
From: Nilay Shroff @ 2025-09-09  9:26 UTC (permalink / raw)
  To: Yu Kuai, ming.lei, axboe
  Cc: linux-block, linux-kernel, yi.zhang, yangerkun, johnny.chenyi,
	yukuai (C)



On 9/9/25 12:46 PM, Yu Kuai wrote:
> Hi,
> 
> 在 2025/09/09 14:52, Nilay Shroff 写道:
>>
>>
>> On 9/9/25 12:08 PM, Yu Kuai wrote:
>>> Hi,
>>>
>>> 在 2025/09/09 14:29, Nilay Shroff 写道:
>>>>
>>>>
>>>> On 9/8/25 11:45 AM, Yu Kuai wrote:
>>>>> From: Yu Kuai <yukuai3@huawei.com>
>>>>>
>>>>> request_queue->nr_requests can be changed by:
>>>>>
>>>>> a) switching elevator by update nr_hw_queues
>>>>> b) switching elevator by elevator sysfs attribute
>>>>> c) configue queue sysfs attribute nr_requests
>>>>>
>>>>> Current lock order is:
>>>>>
>>>>> 1) update_nr_hwq_lock, case a,b
>>>>> 2) freeze_queue
>>>>> 3) elevator_lock, cas a,b,c
>>>>>
>>>>> And update nr_requests is seriablized by elevator_lock() already,
>>>>> however, in the case c), we'll have to allocate new sched_tags if
>>>>> nr_requests grow, and do this with elevator_lock held and queue
>>>>> freezed has the risk of deadlock.
>>>>>
>>>>> Hence use update_nr_hwq_lock instead, make it possible to allocate
>>>>> memory if tags grow, meanwhile also prevent nr_requests to be changed
>>>>> concurrently.
>>>>>
>>>>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>>>>> ---
>>>>>    block/blk-sysfs.c | 12 +++++++++---
>>>>>    1 file changed, 9 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
>>>>> index f99519f7a820..7ea15bf68b4b 100644
>>>>> --- a/block/blk-sysfs.c
>>>>> +++ b/block/blk-sysfs.c
>>>>> @@ -68,13 +68,14 @@ queue_requests_store(struct gendisk *disk, const char *page, size_t count)
>>>>>        int ret, err;
>>>>>        unsigned int memflags;
>>>>>        struct request_queue *q = disk->queue;
>>>>> +    struct blk_mq_tag_set *set = q->tag_set;
>>>>>          ret = queue_var_store(&nr, page, count);
>>>>>        if (ret < 0)
>>>>>            return ret;
>>>>>    -    memflags = blk_mq_freeze_queue(q);
>>>>> -    mutex_lock(&q->elevator_lock);
>>>>> +    /* serialize updating nr_requests with switching elevator */
>>>>> +    down_write(&set->update_nr_hwq_lock);
>>>>>    
>>>> For serializing update of nr_requests with switching elevator,
>>>> we should use disable_elv_switch(). So with this change we
>>>> don't need to acquire ->update_nr_hwq_lock in writer context
>>>> while running blk_mq_update_nr_requests but instead it can run
>>>> acquiring ->update_nr_hwq_lock in the reader context.
>>>>
>>>> So the code flow should be,
>>>>
>>>> disable_elv_switch  => this would set QUEUE_FLAG_NO_ELV_SWITCH
>>>> ...
>>>> down_read ->update_nr_hwq_lock
>>>> acquire ->freeze_lock
>>>> acquire ->elevator_lock;
>>>> ...
>>>> ...
>>>> release ->elevator_lock;
>>>> release ->freeze_lock
>>>>
>>>> clear QUEUE_FLAG_NO_ELV_SWITCH
>>>> up_read ->update_nr_hwq_lock
>>>>
>>>
>>> Yes, this make sense, however, there is also an implied condition that
>>> we should serialize queue_requests_store() with itself, what if a
>>> concurrent caller succeed the disable_elv_switch() before the
>>> down_read() in this way?
>>>
>>> t1:
>>> disable_elv_switch
>>>          t2:
>>>          disable_elv_switch
>>>
>>> down_read    down_read
>>>
>> I believe this is already protected with the kernfs internal
>> mutex locks. So you shouldn't be able to run two sysfs store
>> operations concurrently on the same attribute file.
>>
> 
> There really is no such internal lock, the call stack is:
> 
> kernfs_fop_write_iter
>  sysfs_kf_write
>   queue_attr_store
> 
> There is only a file level mutex kernfs_open_file->lock from the top
> function kernfs_fop_write_iter(), however, this lock is not the same
> if we open the same attribute file from different context.
> 
Oh yes this lock only protects if the same fd is being written
concurrently. However if we open the same sysfs file from different process
contexts then fd would be different and so this lock wouldn't protect
the simultaneous update of sysfs attribute. Having said that,
looking through the code again it seems that q->nr_requests update 
is protected with ->elevator_lock (including both the elevator switch
code and your proposed changes in this patchset). So my question is
do we really need to synchronize nr_requests update code with elevator
swiupdate_nr_hwq_locktch code? So in another words what if we don't at
all use ->update_nr_hwq_lock in queue_requests_store?

Thanks
--Nilay

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH for-6.18/block 04/10] blk-mq: convert to serialize updating nr_requests with update_nr_hwq_lock
  2025-09-09  9:26           ` Nilay Shroff
@ 2025-09-09  9:36             ` Yu Kuai
  2025-09-09 10:11               ` Nilay Shroff
  0 siblings, 1 reply; 37+ messages in thread
From: Yu Kuai @ 2025-09-09  9:36 UTC (permalink / raw)
  To: Nilay Shroff, Yu Kuai, ming.lei, axboe
  Cc: linux-block, linux-kernel, yi.zhang, yangerkun, johnny.chenyi,
	yukuai (C)

Hi,

在 2025/09/09 17:26, Nilay Shroff 写道:
> 
> 
> On 9/9/25 12:46 PM, Yu Kuai wrote:
>> Hi,
>>
>> 在 2025/09/09 14:52, Nilay Shroff 写道:
>>>
>>>
>>> On 9/9/25 12:08 PM, Yu Kuai wrote:
>>>> Hi,
>>>>
>>>> 在 2025/09/09 14:29, Nilay Shroff 写道:
>>>>>
>>>>>
>>>>> On 9/8/25 11:45 AM, Yu Kuai wrote:
>>>>>> From: Yu Kuai <yukuai3@huawei.com>
>>>>>>
>>>>>> request_queue->nr_requests can be changed by:
>>>>>>
>>>>>> a) switching elevator by update nr_hw_queues
>>>>>> b) switching elevator by elevator sysfs attribute
>>>>>> c) configue queue sysfs attribute nr_requests
>>>>>>
>>>>>> Current lock order is:
>>>>>>
>>>>>> 1) update_nr_hwq_lock, case a,b
>>>>>> 2) freeze_queue
>>>>>> 3) elevator_lock, cas a,b,c
>>>>>>
>>>>>> And update nr_requests is seriablized by elevator_lock() already,
>>>>>> however, in the case c), we'll have to allocate new sched_tags if
>>>>>> nr_requests grow, and do this with elevator_lock held and queue
>>>>>> freezed has the risk of deadlock.
>>>>>>
>>>>>> Hence use update_nr_hwq_lock instead, make it possible to allocate
>>>>>> memory if tags grow, meanwhile also prevent nr_requests to be changed
>>>>>> concurrently.
>>>>>>
>>>>>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>>>>>> ---
>>>>>>     block/blk-sysfs.c | 12 +++++++++---
>>>>>>     1 file changed, 9 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
>>>>>> index f99519f7a820..7ea15bf68b4b 100644
>>>>>> --- a/block/blk-sysfs.c
>>>>>> +++ b/block/blk-sysfs.c
>>>>>> @@ -68,13 +68,14 @@ queue_requests_store(struct gendisk *disk, const char *page, size_t count)
>>>>>>         int ret, err;
>>>>>>         unsigned int memflags;
>>>>>>         struct request_queue *q = disk->queue;
>>>>>> +    struct blk_mq_tag_set *set = q->tag_set;
>>>>>>           ret = queue_var_store(&nr, page, count);
>>>>>>         if (ret < 0)
>>>>>>             return ret;
>>>>>>     -    memflags = blk_mq_freeze_queue(q);
>>>>>> -    mutex_lock(&q->elevator_lock);
>>>>>> +    /* serialize updating nr_requests with switching elevator */
>>>>>> +    down_write(&set->update_nr_hwq_lock);
>>>>>>     
>>>>> For serializing update of nr_requests with switching elevator,
>>>>> we should use disable_elv_switch(). So with this change we
>>>>> don't need to acquire ->update_nr_hwq_lock in writer context
>>>>> while running blk_mq_update_nr_requests but instead it can run
>>>>> acquiring ->update_nr_hwq_lock in the reader context.
>>>>>
>>>>> So the code flow should be,
>>>>>
>>>>> disable_elv_switch  => this would set QUEUE_FLAG_NO_ELV_SWITCH
>>>>> ...
>>>>> down_read ->update_nr_hwq_lock
>>>>> acquire ->freeze_lock
>>>>> acquire ->elevator_lock;
>>>>> ...
>>>>> ...
>>>>> release ->elevator_lock;
>>>>> release ->freeze_lock
>>>>>
>>>>> clear QUEUE_FLAG_NO_ELV_SWITCH
>>>>> up_read ->update_nr_hwq_lock
>>>>>
>>>>
>>>> Yes, this make sense, however, there is also an implied condition that
>>>> we should serialize queue_requests_store() with itself, what if a
>>>> concurrent caller succeed the disable_elv_switch() before the
>>>> down_read() in this way?
>>>>
>>>> t1:
>>>> disable_elv_switch
>>>>           t2:
>>>>           disable_elv_switch
>>>>
>>>> down_read    down_read
>>>>
>>> I believe this is already protected with the kernfs internal
>>> mutex locks. So you shouldn't be able to run two sysfs store
>>> operations concurrently on the same attribute file.
>>>
>>
>> There really is no such internal lock, the call stack is:
>>
>> kernfs_fop_write_iter
>>   sysfs_kf_write
>>    queue_attr_store
>>
>> There is only a file level mutex kernfs_open_file->lock from the top
>> function kernfs_fop_write_iter(), however, this lock is not the same
>> if we open the same attribute file from different context.
>>
> Oh yes this lock only protects if the same fd is being written
> concurrently. However if we open the same sysfs file from different process
> contexts then fd would be different and so this lock wouldn't protect
> the simultaneous update of sysfs attribute. Having said that,
> looking through the code again it seems that q->nr_requests update
> is protected with ->elevator_lock (including both the elevator switch
> code and your proposed changes in this patchset). So my question is
> do we really need to synchronize nr_requests update code with elevator
> swiupdate_nr_hwq_locktch code? So in another words what if we don't at
> all use ->update_nr_hwq_lock in queue_requests_store?

1) lock update_nr_hwq_lock, then no one can change nr_queuests
2) checking input nr_reqeusts
3) if grow, allocate memory

Main idea here is we can checking if nr_requests grow and then allocate
mermory, without concern that nr_requests can be changed after memory
allocation.

BTW, I think this sysfs attr is really a slow path, and it's fine to
grab the write lock.

Thanks,
Kuai
> 
> Thanks
> --Nilay
> .
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH for-6.18/block 04/10] blk-mq: convert to serialize updating nr_requests with update_nr_hwq_lock
  2025-09-09  9:36             ` Yu Kuai
@ 2025-09-09 10:11               ` Nilay Shroff
  2025-09-09 10:42                 ` Yu Kuai
  0 siblings, 1 reply; 37+ messages in thread
From: Nilay Shroff @ 2025-09-09 10:11 UTC (permalink / raw)
  To: Yu Kuai, ming.lei, axboe
  Cc: linux-block, linux-kernel, yi.zhang, yangerkun, johnny.chenyi,
	yukuai (C)



On 9/9/25 3:06 PM, Yu Kuai wrote:
> Hi,
> 
> 在 2025/09/09 17:26, Nilay Shroff 写道:
>>
>>
>> On 9/9/25 12:46 PM, Yu Kuai wrote:
>>> Hi,
>>>
>>> 在 2025/09/09 14:52, Nilay Shroff 写道:
>>>>
>>>>
>>>> On 9/9/25 12:08 PM, Yu Kuai wrote:
>>>>> Hi,
>>>>>
>>>>> 在 2025/09/09 14:29, Nilay Shroff 写道:
>>>>>>
>>>>>>
>>>>>> On 9/8/25 11:45 AM, Yu Kuai wrote:
>>>>>>> From: Yu Kuai <yukuai3@huawei.com>
>>>>>>>
>>>>>>> request_queue->nr_requests can be changed by:
>>>>>>>
>>>>>>> a) switching elevator by update nr_hw_queues
>>>>>>> b) switching elevator by elevator sysfs attribute
>>>>>>> c) configue queue sysfs attribute nr_requests
>>>>>>>
>>>>>>> Current lock order is:
>>>>>>>
>>>>>>> 1) update_nr_hwq_lock, case a,b
>>>>>>> 2) freeze_queue
>>>>>>> 3) elevator_lock, cas a,b,c
>>>>>>>
>>>>>>> And update nr_requests is seriablized by elevator_lock() already,
>>>>>>> however, in the case c), we'll have to allocate new sched_tags if
>>>>>>> nr_requests grow, and do this with elevator_lock held and queue
>>>>>>> freezed has the risk of deadlock.
>>>>>>>
>>>>>>> Hence use update_nr_hwq_lock instead, make it possible to allocate
>>>>>>> memory if tags grow, meanwhile also prevent nr_requests to be changed
>>>>>>> concurrently.
>>>>>>>
>>>>>>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>>>>>>> ---
>>>>>>>     block/blk-sysfs.c | 12 +++++++++---
>>>>>>>     1 file changed, 9 insertions(+), 3 deletions(-)
>>>>>>>
>>>>>>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
>>>>>>> index f99519f7a820..7ea15bf68b4b 100644
>>>>>>> --- a/block/blk-sysfs.c
>>>>>>> +++ b/block/blk-sysfs.c
>>>>>>> @@ -68,13 +68,14 @@ queue_requests_store(struct gendisk *disk, const char *page, size_t count)
>>>>>>>         int ret, err;
>>>>>>>         unsigned int memflags;
>>>>>>>         struct request_queue *q = disk->queue;
>>>>>>> +    struct blk_mq_tag_set *set = q->tag_set;
>>>>>>>           ret = queue_var_store(&nr, page, count);
>>>>>>>         if (ret < 0)
>>>>>>>             return ret;
>>>>>>>     -    memflags = blk_mq_freeze_queue(q);
>>>>>>> -    mutex_lock(&q->elevator_lock);
>>>>>>> +    /* serialize updating nr_requests with switching elevator */
>>>>>>> +    down_write(&set->update_nr_hwq_lock);
>>>>>>>     
>>>>>> For serializing update of nr_requests with switching elevator,
>>>>>> we should use disable_elv_switch(). So with this change we
>>>>>> don't need to acquire ->update_nr_hwq_lock in writer context
>>>>>> while running blk_mq_update_nr_requests but instead it can run
>>>>>> acquiring ->update_nr_hwq_lock in the reader context.
>>>>>>
>>>>>> So the code flow should be,
>>>>>>
>>>>>> disable_elv_switch  => this would set QUEUE_FLAG_NO_ELV_SWITCH
>>>>>> ...
>>>>>> down_read ->update_nr_hwq_lock
>>>>>> acquire ->freeze_lock
>>>>>> acquire ->elevator_lock;
>>>>>> ...
>>>>>> ...
>>>>>> release ->elevator_lock;
>>>>>> release ->freeze_lock
>>>>>>
>>>>>> clear QUEUE_FLAG_NO_ELV_SWITCH
>>>>>> up_read ->update_nr_hwq_lock
>>>>>>
>>>>>
>>>>> Yes, this make sense, however, there is also an implied condition that
>>>>> we should serialize queue_requests_store() with itself, what if a
>>>>> concurrent caller succeed the disable_elv_switch() before the
>>>>> down_read() in this way?
>>>>>
>>>>> t1:
>>>>> disable_elv_switch
>>>>>           t2:
>>>>>           disable_elv_switch
>>>>>
>>>>> down_read    down_read
>>>>>
>>>> I believe this is already protected with the kernfs internal
>>>> mutex locks. So you shouldn't be able to run two sysfs store
>>>> operations concurrently on the same attribute file.
>>>>
>>>
>>> There really is no such internal lock, the call stack is:
>>>
>>> kernfs_fop_write_iter
>>>   sysfs_kf_write
>>>    queue_attr_store
>>>
>>> There is only a file level mutex kernfs_open_file->lock from the top
>>> function kernfs_fop_write_iter(), however, this lock is not the same
>>> if we open the same attribute file from different context.
>>>
>> Oh yes this lock only protects if the same fd is being written
>> concurrently. However if we open the same sysfs file from different process
>> contexts then fd would be different and so this lock wouldn't protect
>> the simultaneous update of sysfs attribute. Having said that,
>> looking through the code again it seems that q->nr_requests update
>> is protected with ->elevator_lock (including both the elevator switch
>> code and your proposed changes in this patchset). So my question is
>> do we really need to synchronize nr_requests update code with elevator
>> swiupdate_nr_hwq_locktch code? So in another words what if we don't at
>> all use ->update_nr_hwq_lock in queue_requests_store?
> 
> 1) lock update_nr_hwq_lock, then no one can change nr_queuests
> 2) checking input nr_reqeusts
> 3) if grow, allocate memory
> 
> Main idea here is we can checking if nr_requests grow and then allocate
> mermory, without concern that nr_requests can be changed after memory
> allocation.
> 
If nr_requests changes after memory allocation we're still good because
eventually we'd only have one consistent value of nr_requests. For 
instance, if process A is updating nr_requests to 128 and sched switch
is updating nr_requests to 256 simultaneously then we'd either see 
nr_requests set to 128 or 256 in the end depending on who runs last.
We wouldn't get into a situation where we find some inconsistent update
to nr_requests, isn't it?

> BTW, I think this sysfs attr is really a slow path, and it's fine to
> grab the write lock.
> 
Yep you're right. But I think we should avoid locks if possible.

Thanks,
--Nilay


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH for-6.18/block 04/10] blk-mq: convert to serialize updating nr_requests with update_nr_hwq_lock
  2025-09-09 10:11               ` Nilay Shroff
@ 2025-09-09 10:42                 ` Yu Kuai
  2025-09-09 11:32                   ` Nilay Shroff
  0 siblings, 1 reply; 37+ messages in thread
From: Yu Kuai @ 2025-09-09 10:42 UTC (permalink / raw)
  To: Nilay Shroff, Yu Kuai, ming.lei, axboe
  Cc: linux-block, linux-kernel, yi.zhang, yangerkun, johnny.chenyi,
	yukuai (C)

Hi,

在 2025/9/9 18:11, Nilay Shroff 写道:
>
> On 9/9/25 3:06 PM, Yu Kuai wrote:
>> Hi,
>>
>> 在 2025/09/09 17:26, Nilay Shroff 写道:
>>>
>>> On 9/9/25 12:46 PM, Yu Kuai wrote:
>>>> Hi,
>>>>
>>>> 在 2025/09/09 14:52, Nilay Shroff 写道:
>>>>>
>>>>> On 9/9/25 12:08 PM, Yu Kuai wrote:
>>>>>> Hi,
>>>>>>
>>>>>> 在 2025/09/09 14:29, Nilay Shroff 写道:
>>>>>>>
>>>>>>> On 9/8/25 11:45 AM, Yu Kuai wrote:
>>>>>>>> From: Yu Kuai <yukuai3@huawei.com>
>>>>>>>>
>>>>>>>> request_queue->nr_requests can be changed by:
>>>>>>>>
>>>>>>>> a) switching elevator by update nr_hw_queues
>>>>>>>> b) switching elevator by elevator sysfs attribute
>>>>>>>> c) configue queue sysfs attribute nr_requests
>>>>>>>>
>>>>>>>> Current lock order is:
>>>>>>>>
>>>>>>>> 1) update_nr_hwq_lock, case a,b
>>>>>>>> 2) freeze_queue
>>>>>>>> 3) elevator_lock, cas a,b,c
>>>>>>>>
>>>>>>>> And update nr_requests is seriablized by elevator_lock() already,
>>>>>>>> however, in the case c), we'll have to allocate new sched_tags if
>>>>>>>> nr_requests grow, and do this with elevator_lock held and queue
>>>>>>>> freezed has the risk of deadlock.
>>>>>>>>
>>>>>>>> Hence use update_nr_hwq_lock instead, make it possible to allocate
>>>>>>>> memory if tags grow, meanwhile also prevent nr_requests to be changed
>>>>>>>> concurrently.
>>>>>>>>
>>>>>>>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>>>>>>>> ---
>>>>>>>>      block/blk-sysfs.c | 12 +++++++++---
>>>>>>>>      1 file changed, 9 insertions(+), 3 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
>>>>>>>> index f99519f7a820..7ea15bf68b4b 100644
>>>>>>>> --- a/block/blk-sysfs.c
>>>>>>>> +++ b/block/blk-sysfs.c
>>>>>>>> @@ -68,13 +68,14 @@ queue_requests_store(struct gendisk *disk, const char *page, size_t count)
>>>>>>>>          int ret, err;
>>>>>>>>          unsigned int memflags;
>>>>>>>>          struct request_queue *q = disk->queue;
>>>>>>>> +    struct blk_mq_tag_set *set = q->tag_set;
>>>>>>>>            ret = queue_var_store(&nr, page, count);
>>>>>>>>          if (ret < 0)
>>>>>>>>              return ret;
>>>>>>>>      -    memflags = blk_mq_freeze_queue(q);
>>>>>>>> -    mutex_lock(&q->elevator_lock);
>>>>>>>> +    /* serialize updating nr_requests with switching elevator */
>>>>>>>> +    down_write(&set->update_nr_hwq_lock);
>>>>>>>>      
>>>>>>> For serializing update of nr_requests with switching elevator,
>>>>>>> we should use disable_elv_switch(). So with this change we
>>>>>>> don't need to acquire ->update_nr_hwq_lock in writer context
>>>>>>> while running blk_mq_update_nr_requests but instead it can run
>>>>>>> acquiring ->update_nr_hwq_lock in the reader context.
>>>>>>>
>>>>>>> So the code flow should be,
>>>>>>>
>>>>>>> disable_elv_switch  => this would set QUEUE_FLAG_NO_ELV_SWITCH
>>>>>>> ...
>>>>>>> down_read ->update_nr_hwq_lock
>>>>>>> acquire ->freeze_lock
>>>>>>> acquire ->elevator_lock;
>>>>>>> ...
>>>>>>> ...
>>>>>>> release ->elevator_lock;
>>>>>>> release ->freeze_lock
>>>>>>>
>>>>>>> clear QUEUE_FLAG_NO_ELV_SWITCH
>>>>>>> up_read ->update_nr_hwq_lock
>>>>>>>
>>>>>> Yes, this make sense, however, there is also an implied condition that
>>>>>> we should serialize queue_requests_store() with itself, what if a
>>>>>> concurrent caller succeed the disable_elv_switch() before the
>>>>>> down_read() in this way?
>>>>>>
>>>>>> t1:
>>>>>> disable_elv_switch
>>>>>>            t2:
>>>>>>            disable_elv_switch
>>>>>>
>>>>>> down_read    down_read
>>>>>>
>>>>> I believe this is already protected with the kernfs internal
>>>>> mutex locks. So you shouldn't be able to run two sysfs store
>>>>> operations concurrently on the same attribute file.
>>>>>
>>>> There really is no such internal lock, the call stack is:
>>>>
>>>> kernfs_fop_write_iter
>>>>    sysfs_kf_write
>>>>     queue_attr_store
>>>>
>>>> There is only a file level mutex kernfs_open_file->lock from the top
>>>> function kernfs_fop_write_iter(), however, this lock is not the same
>>>> if we open the same attribute file from different context.
>>>>
>>> Oh yes this lock only protects if the same fd is being written
>>> concurrently. However if we open the same sysfs file from different process
>>> contexts then fd would be different and so this lock wouldn't protect
>>> the simultaneous update of sysfs attribute. Having said that,
>>> looking through the code again it seems that q->nr_requests update
>>> is protected with ->elevator_lock (including both the elevator switch
>>> code and your proposed changes in this patchset). So my question is
>>> do we really need to synchronize nr_requests update code with elevator
>>> swiupdate_nr_hwq_locktch code? So in another words what if we don't at
>>> all use ->update_nr_hwq_lock in queue_requests_store?
>> 1) lock update_nr_hwq_lock, then no one can change nr_queuests
>> 2) checking input nr_reqeusts
>> 3) if grow, allocate memory
>>
>> Main idea here is we can checking if nr_requests grow and then allocate
>> mermory, without concern that nr_requests can be changed after memory
>> allocation.
>>
> If nr_requests changes after memory allocation we're still good because
> eventually we'd only have one consistent value of nr_requests. For
> instance, if process A is updating nr_requests to 128 and sched switch
> is updating nr_requests to 256 simultaneously then we'd either see
> nr_requests set to 128 or 256 in the end depending on who runs last.
> We wouldn't get into a situation where we find some inconsistent update
> to nr_requests, isn't it?

Then we'll have to allocate memory for every input nr_requests now, we don't
know for sure if tag will grow in advance this way. And even with this, we
still have to hold read lock before allocating memory, to prevent nr hctxs
to change.

>> BTW, I think this sysfs attr is really a slow path, and it's fine to
>> grab the write lock.
>>
> Yep you're right. But I think we should avoid locks if possible.

Yes, we should avoid locks, but we should also keep code as simple
as possible, I think.

Thanks,
Kuai

>
> Thanks,
> --Nilay
>
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH for-6.18/block 04/10] blk-mq: convert to serialize updating nr_requests with update_nr_hwq_lock
  2025-09-09 10:42                 ` Yu Kuai
@ 2025-09-09 11:32                   ` Nilay Shroff
  0 siblings, 0 replies; 37+ messages in thread
From: Nilay Shroff @ 2025-09-09 11:32 UTC (permalink / raw)
  To: Yu Kuai, Yu Kuai, ming.lei, axboe
  Cc: linux-block, linux-kernel, yi.zhang, yangerkun, johnny.chenyi,
	yukuai (C)



On 9/9/25 4:12 PM, Yu Kuai wrote:
> Hi,
> 
> 在 2025/9/9 18:11, Nilay Shroff 写道:
>>
>> On 9/9/25 3:06 PM, Yu Kuai wrote:
>>> Hi,
>>>
>>> 在 2025/09/09 17:26, Nilay Shroff 写道:
>>>>
>>>> On 9/9/25 12:46 PM, Yu Kuai wrote:
>>>>> Hi,
>>>>>
>>>>> 在 2025/09/09 14:52, Nilay Shroff 写道:
>>>>>>
>>>>>> On 9/9/25 12:08 PM, Yu Kuai wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> 在 2025/09/09 14:29, Nilay Shroff 写道:
>>>>>>>>
>>>>>>>> On 9/8/25 11:45 AM, Yu Kuai wrote:
>>>>>>>>> From: Yu Kuai <yukuai3@huawei.com>
>>>>>>>>>
>>>>>>>>> request_queue->nr_requests can be changed by:
>>>>>>>>>
>>>>>>>>> a) switching elevator by update nr_hw_queues
>>>>>>>>> b) switching elevator by elevator sysfs attribute
>>>>>>>>> c) configue queue sysfs attribute nr_requests
>>>>>>>>>
>>>>>>>>> Current lock order is:
>>>>>>>>>
>>>>>>>>> 1) update_nr_hwq_lock, case a,b
>>>>>>>>> 2) freeze_queue
>>>>>>>>> 3) elevator_lock, cas a,b,c
>>>>>>>>>
>>>>>>>>> And update nr_requests is seriablized by elevator_lock() already,
>>>>>>>>> however, in the case c), we'll have to allocate new sched_tags if
>>>>>>>>> nr_requests grow, and do this with elevator_lock held and queue
>>>>>>>>> freezed has the risk of deadlock.
>>>>>>>>>
>>>>>>>>> Hence use update_nr_hwq_lock instead, make it possible to allocate
>>>>>>>>> memory if tags grow, meanwhile also prevent nr_requests to be changed
>>>>>>>>> concurrently.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>>>>>>>>> ---
>>>>>>>>>      block/blk-sysfs.c | 12 +++++++++---
>>>>>>>>>      1 file changed, 9 insertions(+), 3 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
>>>>>>>>> index f99519f7a820..7ea15bf68b4b 100644
>>>>>>>>> --- a/block/blk-sysfs.c
>>>>>>>>> +++ b/block/blk-sysfs.c
>>>>>>>>> @@ -68,13 +68,14 @@ queue_requests_store(struct gendisk *disk, const char *page, size_t count)
>>>>>>>>>          int ret, err;
>>>>>>>>>          unsigned int memflags;
>>>>>>>>>          struct request_queue *q = disk->queue;
>>>>>>>>> +    struct blk_mq_tag_set *set = q->tag_set;
>>>>>>>>>            ret = queue_var_store(&nr, page, count);
>>>>>>>>>          if (ret < 0)
>>>>>>>>>              return ret;
>>>>>>>>>      -    memflags = blk_mq_freeze_queue(q);
>>>>>>>>> -    mutex_lock(&q->elevator_lock);
>>>>>>>>> +    /* serialize updating nr_requests with switching elevator */
>>>>>>>>> +    down_write(&set->update_nr_hwq_lock);
>>>>>>>>>      
>>>>>>>> For serializing update of nr_requests with switching elevator,
>>>>>>>> we should use disable_elv_switch(). So with this change we
>>>>>>>> don't need to acquire ->update_nr_hwq_lock in writer context
>>>>>>>> while running blk_mq_update_nr_requests but instead it can run
>>>>>>>> acquiring ->update_nr_hwq_lock in the reader context.
>>>>>>>>
>>>>>>>> So the code flow should be,
>>>>>>>>
>>>>>>>> disable_elv_switch  => this would set QUEUE_FLAG_NO_ELV_SWITCH
>>>>>>>> ...
>>>>>>>> down_read ->update_nr_hwq_lock
>>>>>>>> acquire ->freeze_lock
>>>>>>>> acquire ->elevator_lock;
>>>>>>>> ...
>>>>>>>> ...
>>>>>>>> release ->elevator_lock;
>>>>>>>> release ->freeze_lock
>>>>>>>>
>>>>>>>> clear QUEUE_FLAG_NO_ELV_SWITCH
>>>>>>>> up_read ->update_nr_hwq_lock
>>>>>>>>
>>>>>>> Yes, this make sense, however, there is also an implied condition that
>>>>>>> we should serialize queue_requests_store() with itself, what if a
>>>>>>> concurrent caller succeed the disable_elv_switch() before the
>>>>>>> down_read() in this way?
>>>>>>>
>>>>>>> t1:
>>>>>>> disable_elv_switch
>>>>>>>            t2:
>>>>>>>            disable_elv_switch
>>>>>>>
>>>>>>> down_read    down_read
>>>>>>>
>>>>>> I believe this is already protected with the kernfs internal
>>>>>> mutex locks. So you shouldn't be able to run two sysfs store
>>>>>> operations concurrently on the same attribute file.
>>>>>>
>>>>> There really is no such internal lock, the call stack is:
>>>>>
>>>>> kernfs_fop_write_iter
>>>>>    sysfs_kf_write
>>>>>     queue_attr_store
>>>>>
>>>>> There is only a file level mutex kernfs_open_file->lock from the top
>>>>> function kernfs_fop_write_iter(), however, this lock is not the same
>>>>> if we open the same attribute file from different context.
>>>>>
>>>> Oh yes this lock only protects if the same fd is being written
>>>> concurrently. However if we open the same sysfs file from different process
>>>> contexts then fd would be different and so this lock wouldn't protect
>>>> the simultaneous update of sysfs attribute. Having said that,
>>>> looking through the code again it seems that q->nr_requests update
>>>> is protected with ->elevator_lock (including both the elevator switch
>>>> code and your proposed changes in this patchset). So my question is
>>>> do we really need to synchronize nr_requests update code with elevator
>>>> swiupdate_nr_hwq_locktch code? So in another words what if we don't at
>>>> all use ->update_nr_hwq_lock in queue_requests_store?
>>> 1) lock update_nr_hwq_lock, then no one can change nr_queuests
>>> 2) checking input nr_reqeusts
>>> 3) if grow, allocate memory
>>>
>>> Main idea here is we can checking if nr_requests grow and then allocate
>>> mermory, without concern that nr_requests can be changed after memory
>>> allocation.
>>>
>> If nr_requests changes after memory allocation we're still good because
>> eventually we'd only have one consistent value of nr_requests. For
>> instance, if process A is updating nr_requests to 128 and sched switch
>> is updating nr_requests to 256 simultaneously then we'd either see
>> nr_requests set to 128 or 256 in the end depending on who runs last.
>> We wouldn't get into a situation where we find some inconsistent update
>> to nr_requests, isn't it?
> 
> Then we'll have to allocate memory for every input nr_requests now, we don't
> know for sure if tag will grow in advance this way. And even with this, we
> still have to hold read lock before allocating memory, to prevent nr hctxs
> to change.
> 
Hmm ok so we still have to acquire read lock and we can't avoid 
->update_nr_hwq_lock. And that should be okay, as typically the
semantics of rw_semaphore is a multiple readers and single writer
lock mechanism. 
With the proposed patch now we've two contexts acquiring 
->update_nr_hwq_lock in writer mode but for this particular case,
I'm okay just to avoid unnecessary complexities making nr_requests
update a reader context. And yes, as you mentioned, this code runs
in slow path and we may not starve reader or writer as the code 
it protects is not big or complex. So with that said,I'd add
review tag in another email.

Thanks,
--Nilay

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH for-6.18/block 04/10] blk-mq: convert to serialize updating nr_requests with update_nr_hwq_lock
  2025-09-08  6:15 ` [PATCH for-6.18/block 04/10] blk-mq: convert to serialize updating nr_requests with update_nr_hwq_lock Yu Kuai
  2025-09-09  6:29   ` Nilay Shroff
@ 2025-09-09 11:40   ` Nilay Shroff
  1 sibling, 0 replies; 37+ messages in thread
From: Nilay Shroff @ 2025-09-09 11:40 UTC (permalink / raw)
  To: Yu Kuai, ming.lei, axboe
  Cc: linux-block, linux-kernel, yukuai3, yi.zhang, yangerkun,
	johnny.chenyi



On 9/8/25 11:45 AM, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> request_queue->nr_requests can be changed by:
> 
> a) switching elevator by update nr_hw_queues
> b) switching elevator by elevator sysfs attribute
> c) configue queue sysfs attribute nr_requests
> 
> Current lock order is:
> 
> 1) update_nr_hwq_lock, case a,b
> 2) freeze_queue
> 3) elevator_lock, cas a,b,c
> 
> And update nr_requests is seriablized by elevator_lock() already,
> however, in the case c), we'll have to allocate new sched_tags if
> nr_requests grow, and do this with elevator_lock held and queue
> freezed has the risk of deadlock.
> 
> Hence use update_nr_hwq_lock instead, make it possible to allocate
> memory if tags grow, meanwhile also prevent nr_requests to be changed
> concurrently.
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>  block/blk-sysfs.c | 12 +++++++++---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index f99519f7a820..7ea15bf68b4b 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -68,13 +68,14 @@ queue_requests_store(struct gendisk *disk, const char *page, size_t count)
>  	int ret, err;
>  	unsigned int memflags;
>  	struct request_queue *q = disk->queue;
> +	struct blk_mq_tag_set *set = q->tag_set;
>  
>  	ret = queue_var_store(&nr, page, count);
>  	if (ret < 0)
>  		return ret;
>  
> -	memflags = blk_mq_freeze_queue(q);
> -	mutex_lock(&q->elevator_lock);
> +	/* serialize updating nr_requests with switching elevator */
> +	down_write(&set->update_nr_hwq_lock);
>  
>  	if (nr == q->nr_requests)
>  		goto unlock;
> @@ -89,13 +90,18 @@ queue_requests_store(struct gendisk *disk, const char *page, size_t count)
>  		goto unlock;
>  	}
>  
> +	memflags = blk_mq_freeze_queue(q);
> +	mutex_lock(&q->elevator_lock);
> +
>  	err = blk_mq_update_nr_requests(disk->queue, nr);
>  	if (err)
>  		ret = err;
>  
> -unlock:
>  	mutex_unlock(&q->elevator_lock);
>  	blk_mq_unfreeze_queue(q, memflags);
> +
> +unlock:
> +	up_write(&set->update_nr_hwq_lock);
>  	return ret;
>  }
>  
As you moved ->elevtor_lock here and thus directly access q->elevator
without holding ->elevator_lock, please add a comment explaining why
is it safe to access q->elevator without holding ->elevator_lock. 
So with that change, please add:

Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>










^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH for-6.18/block 05/10] blk-mq: cleanup shared tags case in blk_mq_update_nr_requests()
  2025-09-08  6:15 [PATCH for-6.18/block 00/10] cleanup and fixes for updating nr_requests Yu Kuai
                   ` (3 preceding siblings ...)
  2025-09-08  6:15 ` [PATCH for-6.18/block 04/10] blk-mq: convert to serialize updating nr_requests with update_nr_hwq_lock Yu Kuai
@ 2025-09-08  6:15 ` Yu Kuai
  2025-09-09 11:58   ` Nilay Shroff
  2025-09-08  6:15 ` [PATCH for-6.18/block 06/10] blk-mq: split bitmap grow and resize " Yu Kuai
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 37+ messages in thread
From: Yu Kuai @ 2025-09-08  6:15 UTC (permalink / raw)
  To: nilay, ming.lei, axboe
  Cc: linux-block, linux-kernel, yukuai3, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

For shared tags case, all hctx->shared_tags/tags are the same, it doesn't
make sense to call into blk_mq_tag_update_depth() multiple times for the
same tags.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq-tag.c |  7 -------
 block/blk-mq.c     | 43 ++++++++++++++++++++++---------------------
 2 files changed, 22 insertions(+), 28 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 725210f27471..aed84c5d5c2b 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -596,13 +596,6 @@ int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
 		struct blk_mq_tag_set *set = hctx->queue->tag_set;
 		struct blk_mq_tags *new;
 
-		/*
-		 * Only the sbitmap needs resizing since we allocated the max
-		 * initially.
-		 */
-		if (blk_mq_is_shared_tags(set->flags))
-			return 0;
-
 		new = blk_mq_alloc_map_and_rqs(set, hctx->queue_num, tdepth);
 		if (!new)
 			return -ENOMEM;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d7540b8e7471..1ff6370f7314 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4926,34 +4926,35 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 
 	blk_mq_quiesce_queue(q);
 
-	queue_for_each_hw_ctx(q, hctx, i) {
-		if (!hctx->tags)
-			continue;
-		/*
-		 * If we're using an MQ scheduler, just update the scheduler
-		 * queue depth. This is similar to what the old code would do.
-		 */
-		if (hctx->sched_tags) {
-			ret = blk_mq_tag_update_depth(hctx, &hctx->sched_tags,
-						      nr);
-		} else {
-			ret = blk_mq_tag_update_depth(hctx, &hctx->tags, nr);
-		}
-		if (ret)
-			goto out;
-	}
-
-	q->nr_requests = nr;
-	if (q->elevator && q->elevator->type->ops.depth_updated)
-		q->elevator->type->ops.depth_updated(q);
-
 	if (blk_mq_is_shared_tags(set->flags)) {
 		if (q->elevator)
 			blk_mq_tag_update_sched_shared_tags(q);
 		else
 			blk_mq_tag_resize_shared_tags(set, nr);
+	} else {
+		queue_for_each_hw_ctx(q, hctx, i) {
+			if (!hctx->tags)
+				continue;
+			/*
+			 * If we're using an MQ scheduler, just update the
+			 * scheduler queue depth. This is similar to what the
+			 * old code would do.
+			 */
+			if (hctx->sched_tags)
+				ret = blk_mq_tag_update_depth(hctx,
+							&hctx->sched_tags, nr);
+			else
+				ret = blk_mq_tag_update_depth(hctx,
+							&hctx->tags, nr);
+			if (ret)
+				goto out;
+		}
 	}
 
+	q->nr_requests = nr;
+	if (q->elevator && q->elevator->type->ops.depth_updated)
+		q->elevator->type->ops.depth_updated(q);
+
 out:
 	blk_mq_unquiesce_queue(q);
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH for-6.18/block 05/10] blk-mq: cleanup shared tags case in blk_mq_update_nr_requests()
  2025-09-08  6:15 ` [PATCH for-6.18/block 05/10] blk-mq: cleanup shared tags case in blk_mq_update_nr_requests() Yu Kuai
@ 2025-09-09 11:58   ` Nilay Shroff
  0 siblings, 0 replies; 37+ messages in thread
From: Nilay Shroff @ 2025-09-09 11:58 UTC (permalink / raw)
  To: Yu Kuai, ming.lei, axboe
  Cc: linux-block, linux-kernel, yukuai3, yi.zhang, yangerkun,
	johnny.chenyi



On 9/8/25 11:45 AM, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> For shared tags case, all hctx->shared_tags/tags are the same, it doesn't
> make sense to call into blk_mq_tag_update_depth() multiple times for the
> same tags.
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>

Looks good to me:
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH for-6.18/block 06/10] blk-mq: split bitmap grow and resize case in blk_mq_update_nr_requests()
  2025-09-08  6:15 [PATCH for-6.18/block 00/10] cleanup and fixes for updating nr_requests Yu Kuai
                   ` (4 preceding siblings ...)
  2025-09-08  6:15 ` [PATCH for-6.18/block 05/10] blk-mq: cleanup shared tags case in blk_mq_update_nr_requests() Yu Kuai
@ 2025-09-08  6:15 ` Yu Kuai
  2025-09-09 12:18   ` Nilay Shroff
  2025-09-08  6:15 ` [PATCH for-6.18/block 07/10] blk-mq-sched: add new parameter nr_requests in blk_mq_alloc_sched_tags() Yu Kuai
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 37+ messages in thread
From: Yu Kuai @ 2025-09-08  6:15 UTC (permalink / raw)
  To: nilay, ming.lei, axboe
  Cc: linux-block, linux-kernel, yukuai3, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

No functional changes are intended, make code cleaner and prepare to fix
the grow case in following patches.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq.c | 28 ++++++++++++++++------------
 1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1ff6370f7314..82fa81036115 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4931,21 +4931,25 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 			blk_mq_tag_update_sched_shared_tags(q);
 		else
 			blk_mq_tag_resize_shared_tags(set, nr);
-	} else {
+	} else if (!q->elevator) {
 		queue_for_each_hw_ctx(q, hctx, i) {
 			if (!hctx->tags)
 				continue;
-			/*
-			 * If we're using an MQ scheduler, just update the
-			 * scheduler queue depth. This is similar to what the
-			 * old code would do.
-			 */
-			if (hctx->sched_tags)
-				ret = blk_mq_tag_update_depth(hctx,
-							&hctx->sched_tags, nr);
-			else
-				ret = blk_mq_tag_update_depth(hctx,
-							&hctx->tags, nr);
+			sbitmap_queue_resize(&hctx->tags->bitmap_tags,
+				nr - hctx->tags->nr_reserved_tags);
+		}
+	} else if (nr <= q->elevator->et->nr_requests) {
+		queue_for_each_hw_ctx(q, hctx, i) {
+			if (!hctx->sched_tags)
+				continue;
+			sbitmap_queue_resize(&hctx->sched_tags->bitmap_tags,
+				nr - hctx->sched_tags->nr_reserved_tags);
+		}
+	} else {
+		queue_for_each_hw_ctx(q, hctx, i) {
+			if (!hctx->sched_tags)
+				continue;
+			blk_mq_tag_update_depth(hctx, &hctx->sched_tags, nr);
 			if (ret)
 				goto out;
 		}
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH for-6.18/block 06/10] blk-mq: split bitmap grow and resize case in blk_mq_update_nr_requests()
  2025-09-08  6:15 ` [PATCH for-6.18/block 06/10] blk-mq: split bitmap grow and resize " Yu Kuai
@ 2025-09-09 12:18   ` Nilay Shroff
  2025-09-09 16:39     ` Yu Kuai
  0 siblings, 1 reply; 37+ messages in thread
From: Nilay Shroff @ 2025-09-09 12:18 UTC (permalink / raw)
  To: Yu Kuai, ming.lei, axboe
  Cc: linux-block, linux-kernel, yukuai3, yi.zhang, yangerkun,
	johnny.chenyi



On 9/8/25 11:45 AM, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> No functional changes are intended, make code cleaner and prepare to fix
> the grow case in following patches.
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>  block/blk-mq.c | 28 ++++++++++++++++------------
>  1 file changed, 16 insertions(+), 12 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 1ff6370f7314..82fa81036115 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -4931,21 +4931,25 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
>  			blk_mq_tag_update_sched_shared_tags(q);
>  		else
>  			blk_mq_tag_resize_shared_tags(set, nr);
> -	} else {
> +	} else if (!q->elevator) {
>  		queue_for_each_hw_ctx(q, hctx, i) {
>  			if (!hctx->tags)
>  				continue;
> -			/*
> -			 * If we're using an MQ scheduler, just update the
> -			 * scheduler queue depth. This is similar to what the
> -			 * old code would do.
> -			 */
> -			if (hctx->sched_tags)
> -				ret = blk_mq_tag_update_depth(hctx,
> -							&hctx->sched_tags, nr);
> -			else
> -				ret = blk_mq_tag_update_depth(hctx,
> -							&hctx->tags, nr);
> +			sbitmap_queue_resize(&hctx->tags->bitmap_tags,
> +				nr - hctx->tags->nr_reserved_tags);
> +		}
> +	} else if (nr <= q->elevator->et->nr_requests) {
> +		queue_for_each_hw_ctx(q, hctx, i) {
> +			if (!hctx->sched_tags)
> +				continue;
> +			sbitmap_queue_resize(&hctx->sched_tags->bitmap_tags,
> +				nr - hctx->sched_tags->nr_reserved_tags);
> +		}
> +	} else {
> +		queue_for_each_hw_ctx(q, hctx, i) {
> +			if (!hctx->sched_tags)
> +				continue;
> +			blk_mq_tag_update_depth(hctx, &hctx->sched_tags, nr);
>  			if (ret)
>  				goto out;
>  		}

The above code is good however can this be bit simplified? 
It's a bit difficult to follow through all nesting and so
could it be simplified as below:

if (shared-tags) {
    if (elevator) 
       // resize sched-shared-tags bitmap
    else
       // resize shared-tags bitmap
} else { 
    // non-shared tags
    if (elevator) {
        if (new-depth-is-less-than-the-current-depth)
            // resize sched-tags bitmap
        else
            // handle sched tags grow
    } else
        // resize tags bitmap    
}

Thanks,
--Nilay


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH for-6.18/block 06/10] blk-mq: split bitmap grow and resize case in blk_mq_update_nr_requests()
  2025-09-09 12:18   ` Nilay Shroff
@ 2025-09-09 16:39     ` Yu Kuai
  2025-09-10  6:30       ` Nilay Shroff
  0 siblings, 1 reply; 37+ messages in thread
From: Yu Kuai @ 2025-09-09 16:39 UTC (permalink / raw)
  To: Nilay Shroff, Yu Kuai, ming.lei, axboe
  Cc: linux-block, linux-kernel, yukuai3, yi.zhang, yangerkun,
	johnny.chenyi

Hi,

在 2025/9/9 20:18, Nilay Shroff 写道:
>
> On 9/8/25 11:45 AM, Yu Kuai wrote:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> No functional changes are intended, make code cleaner and prepare to fix
>> the grow case in following patches.
>>
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>> ---
>>   block/blk-mq.c | 28 ++++++++++++++++------------
>>   1 file changed, 16 insertions(+), 12 deletions(-)
>>
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 1ff6370f7314..82fa81036115 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -4931,21 +4931,25 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
>>   			blk_mq_tag_update_sched_shared_tags(q);
>>   		else
>>   			blk_mq_tag_resize_shared_tags(set, nr);
>> -	} else {
>> +	} else if (!q->elevator) {
>>   		queue_for_each_hw_ctx(q, hctx, i) {
>>   			if (!hctx->tags)
>>   				continue;
>> -			/*
>> -			 * If we're using an MQ scheduler, just update the
>> -			 * scheduler queue depth. This is similar to what the
>> -			 * old code would do.
>> -			 */
>> -			if (hctx->sched_tags)
>> -				ret = blk_mq_tag_update_depth(hctx,
>> -							&hctx->sched_tags, nr);
>> -			else
>> -				ret = blk_mq_tag_update_depth(hctx,
>> -							&hctx->tags, nr);
>> +			sbitmap_queue_resize(&hctx->tags->bitmap_tags,
>> +				nr - hctx->tags->nr_reserved_tags);
>> +		}
>> +	} else if (nr <= q->elevator->et->nr_requests) {
>> +		queue_for_each_hw_ctx(q, hctx, i) {
>> +			if (!hctx->sched_tags)
>> +				continue;
>> +			sbitmap_queue_resize(&hctx->sched_tags->bitmap_tags,
>> +				nr - hctx->sched_tags->nr_reserved_tags);
>> +		}
>> +	} else {
>> +		queue_for_each_hw_ctx(q, hctx, i) {
>> +			if (!hctx->sched_tags)
>> +				continue;
>> +			blk_mq_tag_update_depth(hctx, &hctx->sched_tags, nr);
>>   			if (ret)
>>   				goto out;
>>   		}
> The above code is good however can this be bit simplified?
> It's a bit difficult to follow through all nesting and so
> could it be simplified as below:
>
> if (shared-tags) {
>      if (elevator)
>         // resize sched-shared-tags bitmap
>      else
>         // resize shared-tags bitmap
> } else {
>      // non-shared tags
>      if (elevator) {
>          if (new-depth-is-less-than-the-current-depth)
>              // resize sched-tags bitmap
>          else
>              // handle sched tags grow
>      } else
>          // resize tags bitmap
> }

AFAIK, if - else if chain should be better than nested if - else, right?

If you don't mind, I can add comments to each else if chain to make code cleaner:

if () {
	/* shared tags */
	...
} else if () {
	/* non-shared tags and elevator is none */
	...
} else if () {
	/* non-shared tags and elevator is not none, nr_requests doesn't grow */
	...
} else () {
	/* non-shared tags and elevator is not none, nr_requests grow */
	...
}

Thanks,
Kuai

>
> Thanks,
> --Nilay
>
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH for-6.18/block 06/10] blk-mq: split bitmap grow and resize case in blk_mq_update_nr_requests()
  2025-09-09 16:39     ` Yu Kuai
@ 2025-09-10  6:30       ` Nilay Shroff
  2025-09-10  6:42         ` Yu Kuai
  0 siblings, 1 reply; 37+ messages in thread
From: Nilay Shroff @ 2025-09-10  6:30 UTC (permalink / raw)
  To: Yu Kuai, Yu Kuai, ming.lei, axboe
  Cc: linux-block, linux-kernel, yukuai3, yi.zhang, yangerkun,
	johnny.chenyi



On 9/9/25 10:09 PM, Yu Kuai wrote:
> Hi,
> 
> 在 2025/9/9 20:18, Nilay Shroff 写道:
>>
>> On 9/8/25 11:45 AM, Yu Kuai wrote:
>>> From: Yu Kuai <yukuai3@huawei.com>
>>>
>>> No functional changes are intended, make code cleaner and prepare to fix
>>> the grow case in following patches.
>>>
>>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>>> ---
>>>   block/blk-mq.c | 28 ++++++++++++++++------------
>>>   1 file changed, 16 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index 1ff6370f7314..82fa81036115 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -4931,21 +4931,25 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
>>>               blk_mq_tag_update_sched_shared_tags(q);
>>>           else
>>>               blk_mq_tag_resize_shared_tags(set, nr);
>>> -    } else {
>>> +    } else if (!q->elevator) {
>>>           queue_for_each_hw_ctx(q, hctx, i) {
>>>               if (!hctx->tags)
>>>                   continue;
>>> -            /*
>>> -             * If we're using an MQ scheduler, just update the
>>> -             * scheduler queue depth. This is similar to what the
>>> -             * old code would do.
>>> -             */
>>> -            if (hctx->sched_tags)
>>> -                ret = blk_mq_tag_update_depth(hctx,
>>> -                            &hctx->sched_tags, nr);
>>> -            else
>>> -                ret = blk_mq_tag_update_depth(hctx,
>>> -                            &hctx->tags, nr);
>>> +            sbitmap_queue_resize(&hctx->tags->bitmap_tags,
>>> +                nr - hctx->tags->nr_reserved_tags);
>>> +        }
>>> +    } else if (nr <= q->elevator->et->nr_requests) {
>>> +        queue_for_each_hw_ctx(q, hctx, i) {
>>> +            if (!hctx->sched_tags)
>>> +                continue;
>>> +            sbitmap_queue_resize(&hctx->sched_tags->bitmap_tags,
>>> +                nr - hctx->sched_tags->nr_reserved_tags);
>>> +        }
>>> +    } else {
>>> +        queue_for_each_hw_ctx(q, hctx, i) {
>>> +            if (!hctx->sched_tags)
>>> +                continue;
>>> +            blk_mq_tag_update_depth(hctx, &hctx->sched_tags, nr);
>>>               if (ret)
>>>                   goto out;
>>>           }
>> The above code is good however can this be bit simplified?
>> It's a bit difficult to follow through all nesting and so
>> could it be simplified as below:
>>
>> if (shared-tags) {
>>      if (elevator)
>>         // resize sched-shared-tags bitmap
>>      else
>>         // resize shared-tags bitmap
>> } else {
>>      // non-shared tags
>>      if (elevator) {
>>          if (new-depth-is-less-than-the-current-depth)
>>              // resize sched-tags bitmap
>>          else
>>              // handle sched tags grow
>>      } else
>>          // resize tags bitmap
>> }
> 
> AFAIK, if - else if chain should be better than nested if - else, right?
> 
> If you don't mind, I can add comments to each else if chain to make code cleaner:
> 
> if () {
>     /* shared tags */
>     ...
> } else if () {
>     /* non-shared tags and elevator is none */
>     ...
> } else if () {
>     /* non-shared tags and elevator is not none, nr_requests doesn't grow */
>     ...
> } else () {
>     /* non-shared tags and elevator is not none, nr_requests grow */
>     ...
> }
> 
Yeah, I am good with the proper comments as well so that it'd be easy
for anyone reviewing the code later to understand what those all nested
if-else conditions meant. 

Thanks,
--Nilay


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH for-6.18/block 06/10] blk-mq: split bitmap grow and resize case in blk_mq_update_nr_requests()
  2025-09-10  6:30       ` Nilay Shroff
@ 2025-09-10  6:42         ` Yu Kuai
  0 siblings, 0 replies; 37+ messages in thread
From: Yu Kuai @ 2025-09-10  6:42 UTC (permalink / raw)
  To: Nilay Shroff, Yu Kuai, Yu Kuai, ming.lei, axboe
  Cc: linux-block, linux-kernel, yi.zhang, yangerkun, johnny.chenyi,
	yukuai (C)

Hi,

在 2025/09/10 14:30, Nilay Shroff 写道:
> 
> 
> On 9/9/25 10:09 PM, Yu Kuai wrote:
>> Hi,
>>
>> 在 2025/9/9 20:18, Nilay Shroff 写道:
>>>
>>> On 9/8/25 11:45 AM, Yu Kuai wrote:
>>>> From: Yu Kuai <yukuai3@huawei.com>
>>>>
>>>> No functional changes are intended, make code cleaner and prepare to fix
>>>> the grow case in following patches.
>>>>
>>>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>>>> ---
>>>>    block/blk-mq.c | 28 ++++++++++++++++------------
>>>>    1 file changed, 16 insertions(+), 12 deletions(-)
>>>>
>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>> index 1ff6370f7314..82fa81036115 100644
>>>> --- a/block/blk-mq.c
>>>> +++ b/block/blk-mq.c
>>>> @@ -4931,21 +4931,25 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
>>>>                blk_mq_tag_update_sched_shared_tags(q);
>>>>            else
>>>>                blk_mq_tag_resize_shared_tags(set, nr);
>>>> -    } else {
>>>> +    } else if (!q->elevator) {
>>>>            queue_for_each_hw_ctx(q, hctx, i) {
>>>>                if (!hctx->tags)
>>>>                    continue;
>>>> -            /*
>>>> -             * If we're using an MQ scheduler, just update the
>>>> -             * scheduler queue depth. This is similar to what the
>>>> -             * old code would do.
>>>> -             */
>>>> -            if (hctx->sched_tags)
>>>> -                ret = blk_mq_tag_update_depth(hctx,
>>>> -                            &hctx->sched_tags, nr);
>>>> -            else
>>>> -                ret = blk_mq_tag_update_depth(hctx,
>>>> -                            &hctx->tags, nr);
>>>> +            sbitmap_queue_resize(&hctx->tags->bitmap_tags,
>>>> +                nr - hctx->tags->nr_reserved_tags);
>>>> +        }
>>>> +    } else if (nr <= q->elevator->et->nr_requests) {
>>>> +        queue_for_each_hw_ctx(q, hctx, i) {
>>>> +            if (!hctx->sched_tags)
>>>> +                continue;
>>>> +            sbitmap_queue_resize(&hctx->sched_tags->bitmap_tags,
>>>> +                nr - hctx->sched_tags->nr_reserved_tags);
>>>> +        }
>>>> +    } else {
>>>> +        queue_for_each_hw_ctx(q, hctx, i) {
>>>> +            if (!hctx->sched_tags)
>>>> +                continue;
>>>> +            blk_mq_tag_update_depth(hctx, &hctx->sched_tags, nr);
>>>>                if (ret)
>>>>                    goto out;
>>>>            }
>>> The above code is good however can this be bit simplified?
>>> It's a bit difficult to follow through all nesting and so
>>> could it be simplified as below:
>>>
>>> if (shared-tags) {
>>>       if (elevator)
>>>          // resize sched-shared-tags bitmap
>>>       else
>>>          // resize shared-tags bitmap
>>> } else {
>>>       // non-shared tags
>>>       if (elevator) {
>>>           if (new-depth-is-less-than-the-current-depth)
>>>               // resize sched-tags bitmap
>>>           else
>>>               // handle sched tags grow
>>>       } else
>>>           // resize tags bitmap
>>> }
>>
>> AFAIK, if - else if chain should be better than nested if - else, right?
>>
>> If you don't mind, I can add comments to each else if chain to make code cleaner:
>>
>> if () {
>>      /* shared tags */
>>      ...
>> } else if () {
>>      /* non-shared tags and elevator is none */
>>      ...
>> } else if () {
>>      /* non-shared tags and elevator is not none, nr_requests doesn't grow */
>>      ...
>> } else () {
>>      /* non-shared tags and elevator is not none, nr_requests grow */
>>      ...
>> }
>>
> Yeah, I am good with the proper comments as well so that it'd be easy
> for anyone reviewing the code later to understand what those all nested
> if-else conditions meant.
> 

Ok, I'll do that in the next version.

Thanks for the review!
Kuai

> Thanks,
> --Nilay
> 
> .
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH for-6.18/block 07/10] blk-mq-sched: add new parameter nr_requests in blk_mq_alloc_sched_tags()
  2025-09-08  6:15 [PATCH for-6.18/block 00/10] cleanup and fixes for updating nr_requests Yu Kuai
                   ` (5 preceding siblings ...)
  2025-09-08  6:15 ` [PATCH for-6.18/block 06/10] blk-mq: split bitmap grow and resize " Yu Kuai
@ 2025-09-08  6:15 ` Yu Kuai
  2025-09-09 12:19   ` Nilay Shroff
  2025-09-08  6:15 ` [PATCH for-6.18/block 08/10] blk-mq: fix potential deadlock while nr_requests grown Yu Kuai
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 37+ messages in thread
From: Yu Kuai @ 2025-09-08  6:15 UTC (permalink / raw)
  To: nilay, ming.lei, axboe
  Cc: linux-block, linux-kernel, yukuai3, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

This helper only support to allocate the default number of requests,
add a new parameter to support specific number of requests.

Prepare to fix potential deadlock in the case nr_requests grow.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq-sched.c | 14 +++++---------
 block/blk-mq-sched.h |  2 +-
 block/blk-mq.h       | 11 +++++++++++
 block/elevator.c     |  3 ++-
 4 files changed, 19 insertions(+), 11 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index e2ce4a28e6c9..d06bb137a743 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -454,7 +454,7 @@ void blk_mq_free_sched_tags_batch(struct xarray *et_table,
 }
 
 struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
-		unsigned int nr_hw_queues)
+		unsigned int nr_hw_queues, unsigned int nr_requests)
 {
 	unsigned int nr_tags;
 	int i;
@@ -470,13 +470,8 @@ struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
 			nr_tags * sizeof(struct blk_mq_tags *), gfp);
 	if (!et)
 		return NULL;
-	/*
-	 * Default to double of smaller one between hw queue_depth and
-	 * 128, since we don't split into sync/async like the old code
-	 * did. Additionally, this is a per-hw queue depth.
-	 */
-	et->nr_requests = 2 * min_t(unsigned int, set->queue_depth,
-			BLKDEV_DEFAULT_RQ);
+
+	et->nr_requests = nr_requests;
 	et->nr_hw_queues = nr_hw_queues;
 
 	if (blk_mq_is_shared_tags(set->flags)) {
@@ -521,7 +516,8 @@ int blk_mq_alloc_sched_tags_batch(struct xarray *et_table,
 		 * concurrently.
 		 */
 		if (q->elevator) {
-			et = blk_mq_alloc_sched_tags(set, nr_hw_queues);
+			et = blk_mq_alloc_sched_tags(set, nr_hw_queues,
+					blk_mq_default_nr_requests(set));
 			if (!et)
 				goto out_unwind;
 			if (xa_insert(et_table, q->id, et, gfp))
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index fe83187f41db..8e21a6b1415d 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -24,7 +24,7 @@ void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e);
 void blk_mq_sched_free_rqs(struct request_queue *q);
 
 struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
-		unsigned int nr_hw_queues);
+		unsigned int nr_hw_queues, unsigned int nr_requests);
 int blk_mq_alloc_sched_tags_batch(struct xarray *et_table,
 		struct blk_mq_tag_set *set, unsigned int nr_hw_queues);
 void blk_mq_free_sched_tags(struct elevator_tags *et,
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 2b3ade60c90b..731f4578d9a8 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -109,6 +109,17 @@ static inline struct blk_mq_hw_ctx *blk_mq_map_queue(blk_opf_t opf,
 	return ctx->hctxs[blk_mq_get_hctx_type(opf)];
 }
 
+/*
+ * Default to double of smaller one between hw queue_depth and
+ * 128, since we don't split into sync/async like the old code
+ * did. Additionally, this is a per-hw queue depth.
+ */
+static inline unsigned int blk_mq_default_nr_requests(
+		struct blk_mq_tag_set *set)
+{
+	return 2 * min_t(unsigned int, set->queue_depth, BLKDEV_DEFAULT_RQ);
+}
+
 /*
  * sysfs helpers
  */
diff --git a/block/elevator.c b/block/elevator.c
index fe96c6f4753c..e2ebfbf107b3 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -669,7 +669,8 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
 	lockdep_assert_held(&set->update_nr_hwq_lock);
 
 	if (strncmp(ctx->name, "none", 4)) {
-		ctx->et = blk_mq_alloc_sched_tags(set, set->nr_hw_queues);
+		ctx->et = blk_mq_alloc_sched_tags(set, set->nr_hw_queues,
+				blk_mq_default_nr_requests(set));
 		if (!ctx->et)
 			return -ENOMEM;
 	}
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH for-6.18/block 07/10] blk-mq-sched: add new parameter nr_requests in blk_mq_alloc_sched_tags()
  2025-09-08  6:15 ` [PATCH for-6.18/block 07/10] blk-mq-sched: add new parameter nr_requests in blk_mq_alloc_sched_tags() Yu Kuai
@ 2025-09-09 12:19   ` Nilay Shroff
  0 siblings, 0 replies; 37+ messages in thread
From: Nilay Shroff @ 2025-09-09 12:19 UTC (permalink / raw)
  To: Yu Kuai, ming.lei, axboe
  Cc: linux-block, linux-kernel, yukuai3, yi.zhang, yangerkun,
	johnny.chenyi



On 9/8/25 11:45 AM, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> This helper only support to allocate the default number of requests,
> add a new parameter to support specific number of requests.
> 
> Prepare to fix potential deadlock in the case nr_requests grow.
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>

Looks good to me:
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH for-6.18/block 08/10] blk-mq: fix potential deadlock while nr_requests grown
  2025-09-08  6:15 [PATCH for-6.18/block 00/10] cleanup and fixes for updating nr_requests Yu Kuai
                   ` (6 preceding siblings ...)
  2025-09-08  6:15 ` [PATCH for-6.18/block 07/10] blk-mq-sched: add new parameter nr_requests in blk_mq_alloc_sched_tags() Yu Kuai
@ 2025-09-08  6:15 ` Yu Kuai
  2025-09-09  6:39   ` Nilay Shroff
                     ` (2 more replies)
  2025-09-08  6:15 ` [PATCH for-6.18/block 09/10] blk-mq: remove blk_mq_tag_update_depth() Yu Kuai
  2025-09-08  6:15 ` [PATCH for-6.18/block 10/10] blk-mq: fix stale nr_requests documentation Yu Kuai
  9 siblings, 3 replies; 37+ messages in thread
From: Yu Kuai @ 2025-09-08  6:15 UTC (permalink / raw)
  To: nilay, ming.lei, axboe
  Cc: linux-block, linux-kernel, yukuai3, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

Allocate and free sched_tags while queue is freezed can deadlock[1],
this is a long term problem, hence allocate memory before freezing
queue and free memory after queue is unfreezed.

[1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
Fixes: e3a2b3f931f5 ("blk-mq: allow changing of queue depth through sysfs")

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq.c    | 22 ++++++++++------------
 block/blk-mq.h    |  5 ++++-
 block/blk-sysfs.c | 25 +++++++++++++++++--------
 3 files changed, 31 insertions(+), 21 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 82fa81036115..d85afbb9f031 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4917,11 +4917,13 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
 }
 EXPORT_SYMBOL(blk_mq_free_tag_set);
 
-int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
+struct elevator_tags *blk_mq_update_nr_requests(struct request_queue *q,
+						struct elevator_tags *et,
+						unsigned int nr)
 {
 	struct blk_mq_tag_set *set = q->tag_set;
+	struct elevator_tags *old_et = NULL;
 	struct blk_mq_hw_ctx *hctx;
-	int ret = 0;
 	unsigned long i;
 
 	blk_mq_quiesce_queue(q);
@@ -4946,23 +4948,19 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 				nr - hctx->sched_tags->nr_reserved_tags);
 		}
 	} else {
-		queue_for_each_hw_ctx(q, hctx, i) {
-			if (!hctx->sched_tags)
-				continue;
-			blk_mq_tag_update_depth(hctx, &hctx->sched_tags, nr);
-			if (ret)
-				goto out;
-		}
+		queue_for_each_hw_ctx(q, hctx, i)
+			hctx->sched_tags = et->tags[i];
+
+		old_et = q->elevator->et;
+		q->elevator->et = et;
 	}
 
 	q->nr_requests = nr;
 	if (q->elevator && q->elevator->type->ops.depth_updated)
 		q->elevator->type->ops.depth_updated(q);
 
-out:
 	blk_mq_unquiesce_queue(q);
-
-	return ret;
+	return old_et;
 }
 
 /*
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 731f4578d9a8..6c9d03625ba1 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -6,6 +6,7 @@
 #include "blk-stat.h"
 
 struct blk_mq_tag_set;
+struct elevator_tags;
 
 struct blk_mq_ctxs {
 	struct kobject kobj;
@@ -45,7 +46,9 @@ void blk_mq_submit_bio(struct bio *bio);
 int blk_mq_poll(struct request_queue *q, blk_qc_t cookie, struct io_comp_batch *iob,
 		unsigned int flags);
 void blk_mq_exit_queue(struct request_queue *q);
-int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
+struct elevator_tags *blk_mq_update_nr_requests(struct request_queue *q,
+						struct elevator_tags *tags,
+						unsigned int nr);
 void blk_mq_wake_waiters(struct request_queue *q);
 bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *,
 			     bool);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 7ea15bf68b4b..a0a7ebad378f 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -64,11 +64,12 @@ static ssize_t queue_requests_show(struct gendisk *disk, char *page)
 static ssize_t
 queue_requests_store(struct gendisk *disk, const char *page, size_t count)
 {
-	unsigned long nr;
-	int ret, err;
-	unsigned int memflags;
 	struct request_queue *q = disk->queue;
 	struct blk_mq_tag_set *set = q->tag_set;
+	struct elevator_tags *et = NULL;
+	unsigned int memflags;
+	unsigned long nr;
+	int ret;
 
 	ret = queue_var_store(&nr, page, count);
 	if (ret < 0)
@@ -90,16 +91,24 @@ queue_requests_store(struct gendisk *disk, const char *page, size_t count)
 		goto unlock;
 	}
 
+	if (q->elevator && nr > q->elevator->et->nr_requests) {
+		/* allocate memory before freezing queue to prevent deadlock */
+		et = blk_mq_alloc_sched_tags(set, q->nr_hw_queues, nr);
+		if (!et) {
+			ret = -ENOMEM;
+			goto unlock;
+		}
+	}
+
 	memflags = blk_mq_freeze_queue(q);
 	mutex_lock(&q->elevator_lock);
-
-	err = blk_mq_update_nr_requests(disk->queue, nr);
-	if (err)
-		ret = err;
-
+	et = blk_mq_update_nr_requests(q, et, nr);
 	mutex_unlock(&q->elevator_lock);
 	blk_mq_unfreeze_queue(q, memflags);
 
+	if (et)
+		blk_mq_free_sched_tags(et, set);
+
 unlock:
 	up_write(&set->update_nr_hwq_lock);
 	return ret;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH for-6.18/block 08/10] blk-mq: fix potential deadlock while nr_requests grown
  2025-09-08  6:15 ` [PATCH for-6.18/block 08/10] blk-mq: fix potential deadlock while nr_requests grown Yu Kuai
@ 2025-09-09  6:39   ` Nilay Shroff
  2025-09-09  7:37     ` Yu Kuai
  2025-09-09 12:21   ` Nilay Shroff
  2025-09-10  7:46   ` Yu Kuai
  2 siblings, 1 reply; 37+ messages in thread
From: Nilay Shroff @ 2025-09-09  6:39 UTC (permalink / raw)
  To: Yu Kuai, ming.lei, axboe
  Cc: linux-block, linux-kernel, yukuai3, yi.zhang, yangerkun,
	johnny.chenyi



On 9/8/25 11:45 AM, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> Allocate and free sched_tags while queue is freezed can deadlock[1],
> this is a long term problem, hence allocate memory before freezing
> queue and free memory after queue is unfreezed.
> 
> [1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
> Fixes: e3a2b3f931f5 ("blk-mq: allow changing of queue depth through sysfs")
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
[...]
[...]

> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 7ea15bf68b4b..a0a7ebad378f 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -64,11 +64,12 @@ static ssize_t queue_requests_show(struct gendisk *disk, char *page)
>  static ssize_t
>  queue_requests_store(struct gendisk *disk, const char *page, size_t count)
>  {
> -	unsigned long nr;
> -	int ret, err;
> -	unsigned int memflags;
>  	struct request_queue *q = disk->queue;
>  	struct blk_mq_tag_set *set = q->tag_set;
> +	struct elevator_tags *et = NULL;
> +	unsigned int memflags;
> +	unsigned long nr;
> +	int ret;
>  
>  	ret = queue_var_store(&nr, page, count);
>  	if (ret < 0)
> @@ -90,16 +91,24 @@ queue_requests_store(struct gendisk *disk, const char *page, size_t count)
>  		goto unlock;
>  	}
>  
> +	if (q->elevator && nr > q->elevator->et->nr_requests) {
> +		/* allocate memory before freezing queue to prevent deadlock */
> +		et = blk_mq_alloc_sched_tags(set, q->nr_hw_queues, nr);
> +		if (!et) {
> +			ret = -ENOMEM;
> +			goto unlock;
> +		}
> +	}
> +
I think we should add a comment above explaining why is it safe 
to access q->elevator without holding ->elevator_lock.

Thanks,
--Nilay

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH for-6.18/block 08/10] blk-mq: fix potential deadlock while nr_requests grown
  2025-09-09  6:39   ` Nilay Shroff
@ 2025-09-09  7:37     ` Yu Kuai
  2025-09-09  9:36       ` Nilay Shroff
  0 siblings, 1 reply; 37+ messages in thread
From: Yu Kuai @ 2025-09-09  7:37 UTC (permalink / raw)
  To: Nilay Shroff, Yu Kuai, ming.lei, axboe
  Cc: linux-block, linux-kernel, yi.zhang, yangerkun, johnny.chenyi,
	yukuai (C)

Hi,

在 2025/09/09 14:39, Nilay Shroff 写道:
> 
> 
> On 9/8/25 11:45 AM, Yu Kuai wrote:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> Allocate and free sched_tags while queue is freezed can deadlock[1],
>> this is a long term problem, hence allocate memory before freezing
>> queue and free memory after queue is unfreezed.
>>
>> [1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
>> Fixes: e3a2b3f931f5 ("blk-mq: allow changing of queue depth through sysfs")
>>
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> [...]
> [...]
> 
>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
>> index 7ea15bf68b4b..a0a7ebad378f 100644
>> --- a/block/blk-sysfs.c
>> +++ b/block/blk-sysfs.c
>> @@ -64,11 +64,12 @@ static ssize_t queue_requests_show(struct gendisk *disk, char *page)
>>   static ssize_t
>>   queue_requests_store(struct gendisk *disk, const char *page, size_t count)
>>   {
>> -	unsigned long nr;
>> -	int ret, err;
>> -	unsigned int memflags;
>>   	struct request_queue *q = disk->queue;
>>   	struct blk_mq_tag_set *set = q->tag_set;
>> +	struct elevator_tags *et = NULL;
>> +	unsigned int memflags;
>> +	unsigned long nr;
>> +	int ret;
>>   
>>   	ret = queue_var_store(&nr, page, count);
>>   	if (ret < 0)
>> @@ -90,16 +91,24 @@ queue_requests_store(struct gendisk *disk, const char *page, size_t count)
>>   		goto unlock;
>>   	}
>>   
>> +	if (q->elevator && nr > q->elevator->et->nr_requests) {
>> +		/* allocate memory before freezing queue to prevent deadlock */
>> +		et = blk_mq_alloc_sched_tags(set, q->nr_hw_queues, nr);
>> +		if (!et) {
>> +			ret = -ENOMEM;
>> +			goto unlock;
>> +		}
>> +	}
>> +
> I think we should add a comment above explaining why is it safe
> to access q->elevator without holding ->elevator_lock.
> 

I already access q->elevator to check input nr from patch 4, and that's
why I add comments to explain switching elevator is serialized, is this
enough?

Thanks,
Kuai

> Thanks,
> --Nilay
> .
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH for-6.18/block 08/10] blk-mq: fix potential deadlock while nr_requests grown
  2025-09-09  7:37     ` Yu Kuai
@ 2025-09-09  9:36       ` Nilay Shroff
  0 siblings, 0 replies; 37+ messages in thread
From: Nilay Shroff @ 2025-09-09  9:36 UTC (permalink / raw)
  To: Yu Kuai, ming.lei, axboe
  Cc: linux-block, linux-kernel, yi.zhang, yangerkun, johnny.chenyi,
	yukuai (C)



On 9/9/25 1:07 PM, Yu Kuai wrote:
> Hi,
> 
> 在 2025/09/09 14:39, Nilay Shroff 写道:
>>
>>
>> On 9/8/25 11:45 AM, Yu Kuai wrote:
>>> From: Yu Kuai <yukuai3@huawei.com>
>>>
>>> Allocate and free sched_tags while queue is freezed can deadlock[1],
>>> this is a long term problem, hence allocate memory before freezing
>>> queue and free memory after queue is unfreezed.
>>>
>>> [1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
>>> Fixes: e3a2b3f931f5 ("blk-mq: allow changing of queue depth through sysfs")
>>>
>>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>> [...]
>> [...]
>>
>>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
>>> index 7ea15bf68b4b..a0a7ebad378f 100644
>>> --- a/block/blk-sysfs.c
>>> +++ b/block/blk-sysfs.c
>>> @@ -64,11 +64,12 @@ static ssize_t queue_requests_show(struct gendisk *disk, char *page)
>>>   static ssize_t
>>>   queue_requests_store(struct gendisk *disk, const char *page, size_t count)
>>>   {
>>> -    unsigned long nr;
>>> -    int ret, err;
>>> -    unsigned int memflags;
>>>       struct request_queue *q = disk->queue;
>>>       struct blk_mq_tag_set *set = q->tag_set;
>>> +    struct elevator_tags *et = NULL;
>>> +    unsigned int memflags;
>>> +    unsigned long nr;
>>> +    int ret;
>>>         ret = queue_var_store(&nr, page, count);
>>>       if (ret < 0)
>>> @@ -90,16 +91,24 @@ queue_requests_store(struct gendisk *disk, const char *page, size_t count)
>>>           goto unlock;
>>>       }
>>>   +    if (q->elevator && nr > q->elevator->et->nr_requests) {
>>> +        /* allocate memory before freezing queue to prevent deadlock */
>>> +        et = blk_mq_alloc_sched_tags(set, q->nr_hw_queues, nr);
>>> +        if (!et) {
>>> +            ret = -ENOMEM;
>>> +            goto unlock;
>>> +        }
>>> +    }
>>> +
>> I think we should add a comment above explaining why is it safe
>> to access q->elevator without holding ->elevator_lock.
>>
> 
> I already access q->elevator to check input nr from patch 4, and that's
> why I add comments to explain switching elevator is serialized, is this
> enough?
> 
yes in patch 04/10 you moved the ->elevator_lock after the 
usual sanity checks. However when we run those sanity checks
or the code in this patch where we have to access q->elevator,
it's good to add a comment here in the code (not in commit).
For reference, you may check blk_mq_alloc_sched_tags_batch.
I think similar comment may be added here as well.

Thanks,
--Nilay

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH for-6.18/block 08/10] blk-mq: fix potential deadlock while nr_requests grown
  2025-09-08  6:15 ` [PATCH for-6.18/block 08/10] blk-mq: fix potential deadlock while nr_requests grown Yu Kuai
  2025-09-09  6:39   ` Nilay Shroff
@ 2025-09-09 12:21   ` Nilay Shroff
  2025-09-10  7:46   ` Yu Kuai
  2 siblings, 0 replies; 37+ messages in thread
From: Nilay Shroff @ 2025-09-09 12:21 UTC (permalink / raw)
  To: Yu Kuai, ming.lei, axboe
  Cc: linux-block, linux-kernel, yukuai3, yi.zhang, yangerkun,
	johnny.chenyi



On 9/8/25 11:45 AM, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> Allocate and free sched_tags while queue is freezed can deadlock[1],
> this is a long term problem, hence allocate memory before freezing
> queue and free memory after queue is unfreezed.
> 
> [1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
> Fixes: e3a2b3f931f5 ("blk-mq: allow changing of queue depth through sysfs")
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>

Looks good to me:
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH for-6.18/block 08/10] blk-mq: fix potential deadlock while nr_requests grown
  2025-09-08  6:15 ` [PATCH for-6.18/block 08/10] blk-mq: fix potential deadlock while nr_requests grown Yu Kuai
  2025-09-09  6:39   ` Nilay Shroff
  2025-09-09 12:21   ` Nilay Shroff
@ 2025-09-10  7:46   ` Yu Kuai
  2 siblings, 0 replies; 37+ messages in thread
From: Yu Kuai @ 2025-09-10  7:46 UTC (permalink / raw)
  To: Yu Kuai, nilay, ming.lei, axboe
  Cc: linux-block, linux-kernel, yi.zhang, yangerkun, johnny.chenyi,
	yukuai (C)

Hi,

在 2025/09/08 14:15, Yu Kuai 写道:
> @@ -90,16 +91,24 @@ queue_requests_store(struct gendisk *disk, const char *page, size_t count)
>   		goto unlock;
>   	}
>   
> +	if (q->elevator && nr > q->elevator->et->nr_requests) {

While rebasing v2, I found that I should also add non-shared checking
here, because from blk_mq_alloc_shced_tags(), et->nr_requests is not set
to MAX_SCHED_RQ, and we don't want to allocate memory for shared case.

I'll fix this as well in v2.

Thanks,
Kuai


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH for-6.18/block 09/10] blk-mq: remove blk_mq_tag_update_depth()
  2025-09-08  6:15 [PATCH for-6.18/block 00/10] cleanup and fixes for updating nr_requests Yu Kuai
                   ` (7 preceding siblings ...)
  2025-09-08  6:15 ` [PATCH for-6.18/block 08/10] blk-mq: fix potential deadlock while nr_requests grown Yu Kuai
@ 2025-09-08  6:15 ` Yu Kuai
  2025-09-09 12:35   ` Nilay Shroff
  2025-09-08  6:15 ` [PATCH for-6.18/block 10/10] blk-mq: fix stale nr_requests documentation Yu Kuai
  9 siblings, 1 reply; 37+ messages in thread
From: Yu Kuai @ 2025-09-08  6:15 UTC (permalink / raw)
  To: nilay, ming.lei, axboe
  Cc: linux-block, linux-kernel, yukuai3, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

This helper is not used now.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-mq-tag.c | 32 --------------------------------
 block/blk-mq.h     |  2 --
 2 files changed, 34 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index aed84c5d5c2b..cbb93a653cef 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -583,38 +583,6 @@ void blk_mq_free_tags(struct blk_mq_tags *tags)
 	kfree(tags);
 }
 
-int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
-			    struct blk_mq_tags **tagsptr, unsigned int tdepth)
-{
-	struct blk_mq_tags *tags = *tagsptr;
-
-	/*
-	 * If we are allowed to grow beyond the original size, allocate
-	 * a new set of tags before freeing the old one.
-	 */
-	if (tdepth > tags->nr_tags) {
-		struct blk_mq_tag_set *set = hctx->queue->tag_set;
-		struct blk_mq_tags *new;
-
-		new = blk_mq_alloc_map_and_rqs(set, hctx->queue_num, tdepth);
-		if (!new)
-			return -ENOMEM;
-
-		blk_mq_free_map_and_rqs(set, *tagsptr, hctx->queue_num);
-		hctx->queue->elevator->et->tags[hctx->queue_num] = new;
-		*tagsptr = new;
-	} else {
-		/*
-		 * Don't need (or can't) update reserved tags here, they
-		 * remain static and should never need resizing.
-		 */
-		sbitmap_queue_resize(&tags->bitmap_tags,
-				tdepth - tags->nr_reserved_tags);
-	}
-
-	return 0;
-}
-
 void blk_mq_tag_resize_shared_tags(struct blk_mq_tag_set *set, unsigned int size)
 {
 	struct blk_mq_tags *tags = set->shared_tags;
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 6c9d03625ba1..af09eb617d11 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -184,8 +184,6 @@ unsigned long blk_mq_get_tags(struct blk_mq_alloc_data *data, int nr_tags,
 void blk_mq_put_tag(struct blk_mq_tags *tags, struct blk_mq_ctx *ctx,
 		unsigned int tag);
 void blk_mq_put_tags(struct blk_mq_tags *tags, int *tag_array, int nr_tags);
-int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
-		struct blk_mq_tags **tags, unsigned int depth);
 void blk_mq_tag_resize_shared_tags(struct blk_mq_tag_set *set,
 		unsigned int size);
 void blk_mq_tag_update_sched_shared_tags(struct request_queue *q);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH for-6.18/block 09/10] blk-mq: remove blk_mq_tag_update_depth()
  2025-09-08  6:15 ` [PATCH for-6.18/block 09/10] blk-mq: remove blk_mq_tag_update_depth() Yu Kuai
@ 2025-09-09 12:35   ` Nilay Shroff
  0 siblings, 0 replies; 37+ messages in thread
From: Nilay Shroff @ 2025-09-09 12:35 UTC (permalink / raw)
  To: Yu Kuai, ming.lei, axboe
  Cc: linux-block, linux-kernel, yukuai3, yi.zhang, yangerkun,
	johnny.chenyi



On 9/8/25 11:45 AM, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> This helper is not used now.
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>

Looks good to me:
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH for-6.18/block 10/10] blk-mq: fix stale nr_requests documentation
  2025-09-08  6:15 [PATCH for-6.18/block 00/10] cleanup and fixes for updating nr_requests Yu Kuai
                   ` (8 preceding siblings ...)
  2025-09-08  6:15 ` [PATCH for-6.18/block 09/10] blk-mq: remove blk_mq_tag_update_depth() Yu Kuai
@ 2025-09-08  6:15 ` Yu Kuai
  2025-09-09 12:35   ` Nilay Shroff
  9 siblings, 1 reply; 37+ messages in thread
From: Yu Kuai @ 2025-09-08  6:15 UTC (permalink / raw)
  To: nilay, ming.lei, axboe
  Cc: linux-block, linux-kernel, yukuai3, yukuai1, yi.zhang, yangerkun,
	johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

The nr_requests documentation is still the removed single queue, remove
it and update to current blk-mq.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 Documentation/ABI/stable/sysfs-block | 14 ++++----------
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
index 0ddffc9133d0..0ed10aeff86b 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -603,16 +603,10 @@ Date:		July 2003
 Contact:	linux-block@vger.kernel.org
 Description:
 		[RW] This controls how many requests may be allocated in the
-		block layer for read or write requests. Note that the total
-		allocated number may be twice this amount, since it applies only
-		to reads or writes (not the accumulated sum).
-
-		To avoid priority inversion through request starvation, a
-		request queue maintains a separate request pool per each cgroup
-		when CONFIG_BLK_CGROUP is enabled, and this parameter applies to
-		each such per-block-cgroup request pool.  IOW, if there are N
-		block cgroups, each request queue may have up to N request
-		pools, each independently regulated by nr_requests.
+		block layer. Noted this value only represents the quantity for a
+		single blk_mq_tags instance. The actual number for the entire
+		device depends on the hardware queue count, whether elevator is
+		enabled, and whether tags are shared.
 
 
 What:		/sys/block/<disk>/queue/nr_zones
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH for-6.18/block 10/10] blk-mq: fix stale nr_requests documentation
  2025-09-08  6:15 ` [PATCH for-6.18/block 10/10] blk-mq: fix stale nr_requests documentation Yu Kuai
@ 2025-09-09 12:35   ` Nilay Shroff
  0 siblings, 0 replies; 37+ messages in thread
From: Nilay Shroff @ 2025-09-09 12:35 UTC (permalink / raw)
  To: Yu Kuai, ming.lei, axboe
  Cc: linux-block, linux-kernel, yukuai3, yi.zhang, yangerkun,
	johnny.chenyi



On 9/8/25 11:45 AM, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> The nr_requests documentation is still the removed single queue, remove
> it and update to current blk-mq.
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>

Looks good to me:
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2025-09-10  7:46 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-08  6:15 [PATCH for-6.18/block 00/10] cleanup and fixes for updating nr_requests Yu Kuai
2025-09-08  6:15 ` [PATCH for-6.18/block 01/10] blk-mq: remove useless checking in queue_requests_store() Yu Kuai
2025-09-09 11:34   ` Nilay Shroff
2025-09-08  6:15 ` [PATCH for-6.18/block 02/10] blk-mq: remove useless checkings in blk_mq_update_nr_requests() Yu Kuai
2025-09-09 11:35   ` Nilay Shroff
2025-09-08  6:15 ` [PATCH for-6.18/block 03/10] blk-mq: check invalid nr_requests in queue_requests_store() Yu Kuai
2025-09-09 11:36   ` Nilay Shroff
2025-09-08  6:15 ` [PATCH for-6.18/block 04/10] blk-mq: convert to serialize updating nr_requests with update_nr_hwq_lock Yu Kuai
2025-09-09  6:29   ` Nilay Shroff
2025-09-09  6:38     ` Yu Kuai
2025-09-09  6:52       ` Nilay Shroff
2025-09-09  7:16         ` Yu Kuai
2025-09-09  9:26           ` Nilay Shroff
2025-09-09  9:36             ` Yu Kuai
2025-09-09 10:11               ` Nilay Shroff
2025-09-09 10:42                 ` Yu Kuai
2025-09-09 11:32                   ` Nilay Shroff
2025-09-09 11:40   ` Nilay Shroff
2025-09-08  6:15 ` [PATCH for-6.18/block 05/10] blk-mq: cleanup shared tags case in blk_mq_update_nr_requests() Yu Kuai
2025-09-09 11:58   ` Nilay Shroff
2025-09-08  6:15 ` [PATCH for-6.18/block 06/10] blk-mq: split bitmap grow and resize " Yu Kuai
2025-09-09 12:18   ` Nilay Shroff
2025-09-09 16:39     ` Yu Kuai
2025-09-10  6:30       ` Nilay Shroff
2025-09-10  6:42         ` Yu Kuai
2025-09-08  6:15 ` [PATCH for-6.18/block 07/10] blk-mq-sched: add new parameter nr_requests in blk_mq_alloc_sched_tags() Yu Kuai
2025-09-09 12:19   ` Nilay Shroff
2025-09-08  6:15 ` [PATCH for-6.18/block 08/10] blk-mq: fix potential deadlock while nr_requests grown Yu Kuai
2025-09-09  6:39   ` Nilay Shroff
2025-09-09  7:37     ` Yu Kuai
2025-09-09  9:36       ` Nilay Shroff
2025-09-09 12:21   ` Nilay Shroff
2025-09-10  7:46   ` Yu Kuai
2025-09-08  6:15 ` [PATCH for-6.18/block 09/10] blk-mq: remove blk_mq_tag_update_depth() Yu Kuai
2025-09-09 12:35   ` Nilay Shroff
2025-09-08  6:15 ` [PATCH for-6.18/block 10/10] blk-mq: fix stale nr_requests documentation Yu Kuai
2025-09-09 12:35   ` Nilay Shroff

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox