public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCHv2] block: restore two stage elevator switch while running nr_hw_queue update
@ 2025-07-18 13:32 Nilay Shroff
  2025-07-20 12:19 ` Ming Lei
  2025-07-22  2:21 ` Yu Kuai
  0 siblings, 2 replies; 10+ messages in thread
From: Nilay Shroff @ 2025-07-18 13:32 UTC (permalink / raw)
  To: linux-block
  Cc: yi.zhang, ming.lei, hch, yukuai1, axboe, shinichiro.kawasaki,
	yukuai3, gjoyce

The kmemleak reports memory leaks related to elevator resources that
were originally allocated in the ->init_hctx() method. The following
leak traces are observed after running blktests:

unreferenced object 0xffff8881b82f7400 (size 512):
  comm "check", pid 68454, jiffies 4310588881
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace (crc 5bac8b34):
    __kvmalloc_node_noprof+0x55d/0x7a0
    sbitmap_init_node+0x15a/0x6a0
    kyber_init_hctx+0x316/0xb90
    blk_mq_init_sched+0x419/0x580
    elevator_switch+0x18b/0x630
    elv_update_nr_hw_queues+0x219/0x2c0
    __blk_mq_update_nr_hw_queues+0x36a/0x6f0
    blk_mq_update_nr_hw_queues+0x3a/0x60
    0xffffffffc09ceb80
    0xffffffffc09d7e0b
    configfs_write_iter+0x2b1/0x470
    vfs_write+0x527/0xe70
    ksys_write+0xff/0x200
    do_syscall_64+0x98/0x3c0
    entry_SYSCALL_64_after_hwframe+0x76/0x7e
unreferenced object 0xffff8881b82f6000 (size 512):
  comm "check", pid 68454, jiffies 4310588881
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace (crc 5bac8b34):
    __kvmalloc_node_noprof+0x55d/0x7a0
    sbitmap_init_node+0x15a/0x6a0
    kyber_init_hctx+0x316/0xb90
    blk_mq_init_sched+0x419/0x580
    elevator_switch+0x18b/0x630
    elv_update_nr_hw_queues+0x219/0x2c0
    __blk_mq_update_nr_hw_queues+0x36a/0x6f0
    blk_mq_update_nr_hw_queues+0x3a/0x60
    0xffffffffc09ceb80
    0xffffffffc09d7e0b
    configfs_write_iter+0x2b1/0x470
    vfs_write+0x527/0xe70
    ksys_write+0xff/0x200
    do_syscall_64+0x98/0x3c0
    entry_SYSCALL_64_after_hwframe+0x76/0x7e
unreferenced object 0xffff8881b82f5800 (size 512):
  comm "check", pid 68454, jiffies 4310588881
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace (crc 5bac8b34):
    __kvmalloc_node_noprof+0x55d/0x7a0
    sbitmap_init_node+0x15a/0x6a0
    kyber_init_hctx+0x316/0xb90
    blk_mq_init_sched+0x419/0x580
    elevator_switch+0x18b/0x630
    elv_update_nr_hw_queues+0x219/0x2c0
    __blk_mq_update_nr_hw_queues+0x36a/0x6f0
    blk_mq_update_nr_hw_queues+0x3a/0x60
    0xffffffffc09ceb80
    0xffffffffc09d7e0b
    configfs_write_iter+0x2b1/0x470
    vfs_write+0x527/0xe70

    ksys_write+0xff/0x200
    do_syscall_64+0x98/0x3c0
    entry_SYSCALL_64_after_hwframe+0x76/0x7e

The issue arises while we run nr_hw_queue update,  Specifically, we first
reallocate hardware contexts (hctx) via __blk_mq_realloc_hw_ctxs(), and
then later invoke elevator_switch() (assuming q->elevator is not NULL).
The elevator switch code would first exit old elevator (elevator_exit)
and then switches to the new elevator. The elevator_exit loops through
each hctx and invokes the elevator’s per-hctx exit method ->exit_hctx(),
which releases resources allocated during ->init_hctx().

This memleak manifests when we reduce the num of h/w queues - for example,
when the initial update sets the number of queues to X, and a later update
reduces it to Y, where Y < X. In this case, we'd loose the access to old
hctxs while we get to elevator exit code because __blk_mq_realloc_hw_ctxs
would have already released the old hctxs. As we don't now have any
reference left to the old hctxs, we don't have any way to free the
scheduler resources (which are allocate in ->init_hctx()) and kmemleak
complains about it.

This issue was caused due to the commit 596dce110b7d ("block: simplify
elevator reattachment for updating nr_hw_queues"). That change unified
the two-stage elevator teardown and reattachment into a single call that
occurs after __blk_mq_realloc_hw_ctxs() has already freed the hctxs.

This patch restores the previous two-stage elevator switch logic during
nr_hw_queues updates. First, the elevator is switched to 'none', which
ensures all scheduler resources are properly freed. Then, the hardware
contexts (hctxs) are reallocated, and the software-to-hardware queue
mappings are updated. Finally, the original elevator is reattached. This
sequence prevents loss of references to old hctxs and avoids the scheduler
resource leaks reported by kmemleak.

Reported-by : Yi Zhang <yi.zhang@redhat.com>
Fixes: 596dce110b7d ("block: simplify elevator reattachment for updating nr_hw_queues")
Closes: https://lore.kernel.org/all/CAHj4cs8oJFvz=daCvjHM5dYCNQH4UXwSySPPU4v-WHce_kZXZA@mail.gmail.com/
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
Changes from v1:
    - Updated commit message with kmemleak trace generated using null-blk
      (Yi Zhang)
    - The elevator module could be removed while nr_hw_queue update is
      running, so protect elevator switch using elevator_get() and 
      elevator_put() (Ming Lei)
    - Invoke elv_update_nr_hw_queues() from blk_mq_elv_switch_back() and 
      that way avoid elevator code restructuring in a patch which fixes
      a regression. (Ming Lei)

---
 block/blk-mq.c   | 86 +++++++++++++++++++++++++++++++++++++++++++-----
 block/blk.h      |  2 +-
 block/elevator.c |  6 ++--
 3 files changed, 81 insertions(+), 13 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4806b867e37d..fa25d6d36790 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4966,6 +4966,62 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 	return ret;
 }
 
+/*
+ * Switch back to the elevator type stored in the xarray.
+ */
+static void blk_mq_elv_switch_back(struct request_queue *q,
+		struct xarray *elv_tbl)
+{
+	struct elevator_type *e = xa_load(elv_tbl, q->id);
+
+	/* The elv_update_nr_hw_queues unfreezes the queue. */
+	elv_update_nr_hw_queues(q, e);
+
+	/* Drop the reference acquired in blk_mq_elv_switch_none. */
+	if (e)
+		elevator_put(e);
+}
+
+/*
+ * Stores elevator type in xarray and set current elevator to none. It uses
+ * q->id as an index to store the elevator type into the xarray.
+ */
+static int blk_mq_elv_switch_none(struct request_queue *q,
+		struct xarray *elv_tbl)
+{
+	int ret = 0;
+
+	lockdep_assert_held_write(&q->tag_set->update_nr_hwq_lock);
+
+	/*
+	 * Accessing q->elevator without holding q->elevator_lock is safe here
+	 * because we're called from nr_hw_queue update which is protected by
+	 * set->update_nr_hwq_lock in the writer context. So, scheduler update/
+	 * switch code (which acquires the same lock in the reader context)
+	 * can't run concurrently.
+	 */
+	if (q->elevator) {
+
+		ret = xa_insert(elv_tbl, q->id, q->elevator->type, GFP_KERNEL);
+		if (ret) {
+			WARN_ON_ONCE(1);
+			goto out;
+		}
+		/*
+		 * Before we switch elevator to 'none', take a reference to
+		 * the elevator module so that while nr_hw_queue update is
+		 * running, no one can remove elevator module. We'd put the
+		 * reference to elevator module later when we switch back
+		 * elevator.
+		 */
+		__elevator_get(q->elevator->type);
+
+		elevator_set_none(q);
+	}
+out:
+	return ret;
+}
+
 static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 							int nr_hw_queues)
 {
@@ -4973,6 +5029,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 	int prev_nr_hw_queues = set->nr_hw_queues;
 	unsigned int memflags;
 	int i;
+	struct xarray elv_tbl;
 
 	lockdep_assert_held(&set->tag_list_lock);
 
@@ -4984,6 +5041,9 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 		return;
 
 	memflags = memalloc_noio_save();
+
+	xa_init(&elv_tbl);
+
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
 		blk_mq_debugfs_unregister_hctxs(q);
 		blk_mq_sysfs_unregister_hctxs(q);
@@ -4992,11 +5052,17 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 	list_for_each_entry(q, &set->tag_list, tag_set_list)
 		blk_mq_freeze_queue_nomemsave(q);
 
-	if (blk_mq_realloc_tag_set_tags(set, nr_hw_queues) < 0) {
-		list_for_each_entry(q, &set->tag_list, tag_set_list)
-			blk_mq_unfreeze_queue_nomemrestore(q);
-		goto reregister;
-	}
+	/*
+	 * Switch IO scheduler to 'none', cleaning up the data associated
+	 * with the previous scheduler. We will switch back once we are done
+	 * updating the new sw to hw queue mappings.
+	 */
+	list_for_each_entry(q, &set->tag_list, tag_set_list)
+		if (blk_mq_elv_switch_none(q, &elv_tbl))
+			goto switch_back;
+
+	if (blk_mq_realloc_tag_set_tags(set, nr_hw_queues) < 0)
+		goto switch_back;
 
 fallback:
 	blk_mq_update_queue_map(set);
@@ -5016,12 +5082,11 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 		}
 		blk_mq_map_swqueue(q);
 	}
-
-	/* elv_update_nr_hw_queues() unfreeze queue for us */
+switch_back:
+	/* The blk_mq_elv_switch_back unfreezes queue for us. */
 	list_for_each_entry(q, &set->tag_list, tag_set_list)
-		elv_update_nr_hw_queues(q);
+		blk_mq_elv_switch_back(q, &elv_tbl);
 
-reregister:
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
 		blk_mq_sysfs_register_hctxs(q);
 		blk_mq_debugfs_register_hctxs(q);
@@ -5029,6 +5094,9 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 		blk_mq_remove_hw_queues_cpuhp(q);
 		blk_mq_add_hw_queues_cpuhp(q);
 	}
+
+	xa_destroy(&elv_tbl);
+
 	memalloc_noio_restore(memflags);
 
 	/* Free the excess tags when nr_hw_queues shrink. */
diff --git a/block/blk.h b/block/blk.h
index 37ec459fe656..fae7653a941f 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -321,7 +321,7 @@ bool blk_bio_list_merge(struct request_queue *q, struct list_head *list,
 
 bool blk_insert_flush(struct request *rq);
 
-void elv_update_nr_hw_queues(struct request_queue *q);
+void elv_update_nr_hw_queues(struct request_queue *q, struct elevator_type *e);
 void elevator_set_default(struct request_queue *q);
 void elevator_set_none(struct request_queue *q);
 
diff --git a/block/elevator.c b/block/elevator.c
index ab22542e6cf0..83d0bfb90a03 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -689,7 +689,7 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
  * The I/O scheduler depends on the number of hardware queues, this forces a
  * reattachment when nr_hw_queues changes.
  */
-void elv_update_nr_hw_queues(struct request_queue *q)
+void elv_update_nr_hw_queues(struct request_queue *q, struct elevator_type *e)
 {
 	struct elv_change_ctx ctx = {};
 	int ret = -ENODEV;
@@ -697,8 +697,8 @@ void elv_update_nr_hw_queues(struct request_queue *q)
 	WARN_ON_ONCE(q->mq_freeze_depth == 0);
 
 	mutex_lock(&q->elevator_lock);
-	if (q->elevator && !blk_queue_dying(q) && blk_queue_registered(q)) {
-		ctx.name = q->elevator->type->elevator_name;
+	if (e && !blk_queue_dying(q) && blk_queue_registered(q)) {
+		ctx.name = e->elevator_name;
 
 		/* force to reattach elevator after nr_hw_queue is updated */
 		ret = elevator_switch(q, &ctx);
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCHv2] block: restore two stage elevator switch while running nr_hw_queue update
  2025-07-18 13:32 [PATCHv2] block: restore two stage elevator switch while running nr_hw_queue update Nilay Shroff
@ 2025-07-20 12:19 ` Ming Lei
  2025-07-20 14:25   ` Nilay Shroff
  2025-07-22  2:21 ` Yu Kuai
  1 sibling, 1 reply; 10+ messages in thread
From: Ming Lei @ 2025-07-20 12:19 UTC (permalink / raw)
  To: Nilay Shroff
  Cc: linux-block, yi.zhang, hch, yukuai1, axboe, shinichiro.kawasaki,
	yukuai3, gjoyce

On Fri, Jul 18, 2025 at 07:02:09PM +0530, Nilay Shroff wrote:
> The kmemleak reports memory leaks related to elevator resources that
> were originally allocated in the ->init_hctx() method. The following
> leak traces are observed after running blktests:
> 
> unreferenced object 0xffff8881b82f7400 (size 512):
>   comm "check", pid 68454, jiffies 4310588881
>   hex dump (first 32 bytes):
>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>   backtrace (crc 5bac8b34):
>     __kvmalloc_node_noprof+0x55d/0x7a0
>     sbitmap_init_node+0x15a/0x6a0
>     kyber_init_hctx+0x316/0xb90
>     blk_mq_init_sched+0x419/0x580
>     elevator_switch+0x18b/0x630
>     elv_update_nr_hw_queues+0x219/0x2c0
>     __blk_mq_update_nr_hw_queues+0x36a/0x6f0
>     blk_mq_update_nr_hw_queues+0x3a/0x60
>     0xffffffffc09ceb80
>     0xffffffffc09d7e0b
>     configfs_write_iter+0x2b1/0x470
>     vfs_write+0x527/0xe70
>     ksys_write+0xff/0x200
>     do_syscall_64+0x98/0x3c0
>     entry_SYSCALL_64_after_hwframe+0x76/0x7e
> unreferenced object 0xffff8881b82f6000 (size 512):
>   comm "check", pid 68454, jiffies 4310588881
>   hex dump (first 32 bytes):
>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>   backtrace (crc 5bac8b34):
>     __kvmalloc_node_noprof+0x55d/0x7a0
>     sbitmap_init_node+0x15a/0x6a0
>     kyber_init_hctx+0x316/0xb90
>     blk_mq_init_sched+0x419/0x580
>     elevator_switch+0x18b/0x630
>     elv_update_nr_hw_queues+0x219/0x2c0
>     __blk_mq_update_nr_hw_queues+0x36a/0x6f0
>     blk_mq_update_nr_hw_queues+0x3a/0x60
>     0xffffffffc09ceb80
>     0xffffffffc09d7e0b
>     configfs_write_iter+0x2b1/0x470
>     vfs_write+0x527/0xe70
>     ksys_write+0xff/0x200
>     do_syscall_64+0x98/0x3c0
>     entry_SYSCALL_64_after_hwframe+0x76/0x7e
> unreferenced object 0xffff8881b82f5800 (size 512):
>   comm "check", pid 68454, jiffies 4310588881
>   hex dump (first 32 bytes):
>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>   backtrace (crc 5bac8b34):
>     __kvmalloc_node_noprof+0x55d/0x7a0
>     sbitmap_init_node+0x15a/0x6a0
>     kyber_init_hctx+0x316/0xb90
>     blk_mq_init_sched+0x419/0x580
>     elevator_switch+0x18b/0x630
>     elv_update_nr_hw_queues+0x219/0x2c0
>     __blk_mq_update_nr_hw_queues+0x36a/0x6f0
>     blk_mq_update_nr_hw_queues+0x3a/0x60
>     0xffffffffc09ceb80
>     0xffffffffc09d7e0b
>     configfs_write_iter+0x2b1/0x470
>     vfs_write+0x527/0xe70
> 
>     ksys_write+0xff/0x200
>     do_syscall_64+0x98/0x3c0
>     entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
> The issue arises while we run nr_hw_queue update,  Specifically, we first
> reallocate hardware contexts (hctx) via __blk_mq_realloc_hw_ctxs(), and
> then later invoke elevator_switch() (assuming q->elevator is not NULL).
> The elevator switch code would first exit old elevator (elevator_exit)
> and then switches to the new elevator. The elevator_exit loops through
> each hctx and invokes the elevator’s per-hctx exit method ->exit_hctx(),
> which releases resources allocated during ->init_hctx().
> 
> This memleak manifests when we reduce the num of h/w queues - for example,
> when the initial update sets the number of queues to X, and a later update
> reduces it to Y, where Y < X. In this case, we'd loose the access to old
> hctxs while we get to elevator exit code because __blk_mq_realloc_hw_ctxs
> would have already released the old hctxs. As we don't now have any
> reference left to the old hctxs, we don't have any way to free the
> scheduler resources (which are allocate in ->init_hctx()) and kmemleak
> complains about it.
> 
> This issue was caused due to the commit 596dce110b7d ("block: simplify
> elevator reattachment for updating nr_hw_queues"). That change unified
> the two-stage elevator teardown and reattachment into a single call that
> occurs after __blk_mq_realloc_hw_ctxs() has already freed the hctxs.
> 
> This patch restores the previous two-stage elevator switch logic during
> nr_hw_queues updates. First, the elevator is switched to 'none', which
> ensures all scheduler resources are properly freed. Then, the hardware
> contexts (hctxs) are reallocated, and the software-to-hardware queue
> mappings are updated. Finally, the original elevator is reattached. This
> sequence prevents loss of references to old hctxs and avoids the scheduler
> resource leaks reported by kmemleak.
> 
> Reported-by : Yi Zhang <yi.zhang@redhat.com>
> Fixes: 596dce110b7d ("block: simplify elevator reattachment for updating nr_hw_queues")
> Closes: https://lore.kernel.org/all/CAHj4cs8oJFvz=daCvjHM5dYCNQH4UXwSySPPU4v-WHce_kZXZA@mail.gmail.com/
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> ---
> Changes from v1:
>     - Updated commit message with kmemleak trace generated using null-blk
>       (Yi Zhang)
>     - The elevator module could be removed while nr_hw_queue update is
>       running, so protect elevator switch using elevator_get() and 
>       elevator_put() (Ming Lei)
>     - Invoke elv_update_nr_hw_queues() from blk_mq_elv_switch_back() and 
>       that way avoid elevator code restructuring in a patch which fixes
>       a regression. (Ming Lei)
> 
> ---
>  block/blk-mq.c   | 86 +++++++++++++++++++++++++++++++++++++++++++-----
>  block/blk.h      |  2 +-
>  block/elevator.c |  6 ++--
>  3 files changed, 81 insertions(+), 13 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 4806b867e37d..fa25d6d36790 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -4966,6 +4966,62 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
>  	return ret;
>  }
>  
> +/*
> + * Switch back to the elevator type stored in the xarray.
> + */
> +static void blk_mq_elv_switch_back(struct request_queue *q,
> +		struct xarray *elv_tbl)
> +{
> +	struct elevator_type *e = xa_load(elv_tbl, q->id);
> +
> +	/* The elv_update_nr_hw_queues unfreezes the queue. */
> +	elv_update_nr_hw_queues(q, e);
> +
> +	/* Drop the reference acquired in blk_mq_elv_switch_none. */
> +	if (e)
> +		elevator_put(e);
> +}
> +
> +/*
> + * Stores elevator type in xarray and set current elevator to none. It uses
> + * q->id as an index to store the elevator type into the xarray.
> + */
> +static int blk_mq_elv_switch_none(struct request_queue *q,
> +		struct xarray *elv_tbl)
> +{
> +	int ret = 0;
> +
> +	lockdep_assert_held_write(&q->tag_set->update_nr_hwq_lock);
> +
> +	/*
> +	 * Accessing q->elevator without holding q->elevator_lock is safe here
> +	 * because we're called from nr_hw_queue update which is protected by
> +	 * set->update_nr_hwq_lock in the writer context. So, scheduler update/
> +	 * switch code (which acquires the same lock in the reader context)
> +	 * can't run concurrently.
> +	 */
> +	if (q->elevator) {
> +
> +		ret = xa_insert(elv_tbl, q->id, q->elevator->type, GFP_KERNEL);
> +		if (ret) {
> +			WARN_ON_ONCE(1);
> +			goto out;
> +		}
> +		/*
> +		 * Before we switch elevator to 'none', take a reference to
> +		 * the elevator module so that while nr_hw_queue update is
> +		 * running, no one can remove elevator module. We'd put the
> +		 * reference to elevator module later when we switch back
> +		 * elevator.
> +		 */
> +		__elevator_get(q->elevator->type);
> +
> +		elevator_set_none(q);
> +	}
> +out:
> +	return ret;
> +}
> +
>  static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>  							int nr_hw_queues)
>  {
> @@ -4973,6 +5029,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>  	int prev_nr_hw_queues = set->nr_hw_queues;
>  	unsigned int memflags;
>  	int i;
> +	struct xarray elv_tbl;
>  
>  	lockdep_assert_held(&set->tag_list_lock);
>  
> @@ -4984,6 +5041,9 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>  		return;
>  
>  	memflags = memalloc_noio_save();
> +
> +	xa_init(&elv_tbl);
> +
>  	list_for_each_entry(q, &set->tag_list, tag_set_list) {
>  		blk_mq_debugfs_unregister_hctxs(q);
>  		blk_mq_sysfs_unregister_hctxs(q);
> @@ -4992,11 +5052,17 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>  	list_for_each_entry(q, &set->tag_list, tag_set_list)
>  		blk_mq_freeze_queue_nomemsave(q);
>  
> -	if (blk_mq_realloc_tag_set_tags(set, nr_hw_queues) < 0) {
> -		list_for_each_entry(q, &set->tag_list, tag_set_list)
> -			blk_mq_unfreeze_queue_nomemrestore(q);
> -		goto reregister;
> -	}
> +	/*
> +	 * Switch IO scheduler to 'none', cleaning up the data associated
> +	 * with the previous scheduler. We will switch back once we are done
> +	 * updating the new sw to hw queue mappings.
> +	 */
> +	list_for_each_entry(q, &set->tag_list, tag_set_list)
> +		if (blk_mq_elv_switch_none(q, &elv_tbl))
> +			goto switch_back;

The partial `switch_back` need to be careful. If switching to none is
failed for one queue, the other remained will be switched back to none
no matter if the previous scheduler is none or not. Maybe return type of
blk_mq_elv_switch_none() can be defined as 'void' simply for avoiding the
trouble.

Otherwise, this patch looks fine.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCHv2] block: restore two stage elevator switch while running nr_hw_queue update
  2025-07-20 12:19 ` Ming Lei
@ 2025-07-20 14:25   ` Nilay Shroff
  0 siblings, 0 replies; 10+ messages in thread
From: Nilay Shroff @ 2025-07-20 14:25 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-block, yi.zhang, hch, yukuai1, axboe, shinichiro.kawasaki,
	yukuai3, gjoyce



On 7/20/25 5:49 PM, Ming Lei wrote:
> On Fri, Jul 18, 2025 at 07:02:09PM +0530, Nilay Shroff wrote:
>> The kmemleak reports memory leaks related to elevator resources that
>> were originally allocated in the ->init_hctx() method. The following
>> leak traces are observed after running blktests:
>>
>> unreferenced object 0xffff8881b82f7400 (size 512):
>>   comm "check", pid 68454, jiffies 4310588881
>>   hex dump (first 32 bytes):
>>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>>   backtrace (crc 5bac8b34):
>>     __kvmalloc_node_noprof+0x55d/0x7a0
>>     sbitmap_init_node+0x15a/0x6a0
>>     kyber_init_hctx+0x316/0xb90
>>     blk_mq_init_sched+0x419/0x580
>>     elevator_switch+0x18b/0x630
>>     elv_update_nr_hw_queues+0x219/0x2c0
>>     __blk_mq_update_nr_hw_queues+0x36a/0x6f0
>>     blk_mq_update_nr_hw_queues+0x3a/0x60
>>     0xffffffffc09ceb80
>>     0xffffffffc09d7e0b
>>     configfs_write_iter+0x2b1/0x470
>>     vfs_write+0x527/0xe70
>>     ksys_write+0xff/0x200
>>     do_syscall_64+0x98/0x3c0
>>     entry_SYSCALL_64_after_hwframe+0x76/0x7e
>> unreferenced object 0xffff8881b82f6000 (size 512):
>>   comm "check", pid 68454, jiffies 4310588881
>>   hex dump (first 32 bytes):
>>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>>   backtrace (crc 5bac8b34):
>>     __kvmalloc_node_noprof+0x55d/0x7a0
>>     sbitmap_init_node+0x15a/0x6a0
>>     kyber_init_hctx+0x316/0xb90
>>     blk_mq_init_sched+0x419/0x580
>>     elevator_switch+0x18b/0x630
>>     elv_update_nr_hw_queues+0x219/0x2c0
>>     __blk_mq_update_nr_hw_queues+0x36a/0x6f0
>>     blk_mq_update_nr_hw_queues+0x3a/0x60
>>     0xffffffffc09ceb80
>>     0xffffffffc09d7e0b
>>     configfs_write_iter+0x2b1/0x470
>>     vfs_write+0x527/0xe70
>>     ksys_write+0xff/0x200
>>     do_syscall_64+0x98/0x3c0
>>     entry_SYSCALL_64_after_hwframe+0x76/0x7e
>> unreferenced object 0xffff8881b82f5800 (size 512):
>>   comm "check", pid 68454, jiffies 4310588881
>>   hex dump (first 32 bytes):
>>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>>   backtrace (crc 5bac8b34):
>>     __kvmalloc_node_noprof+0x55d/0x7a0
>>     sbitmap_init_node+0x15a/0x6a0
>>     kyber_init_hctx+0x316/0xb90
>>     blk_mq_init_sched+0x419/0x580
>>     elevator_switch+0x18b/0x630
>>     elv_update_nr_hw_queues+0x219/0x2c0
>>     __blk_mq_update_nr_hw_queues+0x36a/0x6f0
>>     blk_mq_update_nr_hw_queues+0x3a/0x60
>>     0xffffffffc09ceb80
>>     0xffffffffc09d7e0b
>>     configfs_write_iter+0x2b1/0x470
>>     vfs_write+0x527/0xe70
>>
>>     ksys_write+0xff/0x200
>>     do_syscall_64+0x98/0x3c0
>>     entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>
>> The issue arises while we run nr_hw_queue update,  Specifically, we first
>> reallocate hardware contexts (hctx) via __blk_mq_realloc_hw_ctxs(), and
>> then later invoke elevator_switch() (assuming q->elevator is not NULL).
>> The elevator switch code would first exit old elevator (elevator_exit)
>> and then switches to the new elevator. The elevator_exit loops through
>> each hctx and invokes the elevator’s per-hctx exit method ->exit_hctx(),
>> which releases resources allocated during ->init_hctx().
>>
>> This memleak manifests when we reduce the num of h/w queues - for example,
>> when the initial update sets the number of queues to X, and a later update
>> reduces it to Y, where Y < X. In this case, we'd loose the access to old
>> hctxs while we get to elevator exit code because __blk_mq_realloc_hw_ctxs
>> would have already released the old hctxs. As we don't now have any
>> reference left to the old hctxs, we don't have any way to free the
>> scheduler resources (which are allocate in ->init_hctx()) and kmemleak
>> complains about it.
>>
>> This issue was caused due to the commit 596dce110b7d ("block: simplify
>> elevator reattachment for updating nr_hw_queues"). That change unified
>> the two-stage elevator teardown and reattachment into a single call that
>> occurs after __blk_mq_realloc_hw_ctxs() has already freed the hctxs.
>>
>> This patch restores the previous two-stage elevator switch logic during
>> nr_hw_queues updates. First, the elevator is switched to 'none', which
>> ensures all scheduler resources are properly freed. Then, the hardware
>> contexts (hctxs) are reallocated, and the software-to-hardware queue
>> mappings are updated. Finally, the original elevator is reattached. This
>> sequence prevents loss of references to old hctxs and avoids the scheduler
>> resource leaks reported by kmemleak.
>>
>> Reported-by : Yi Zhang <yi.zhang@redhat.com>
>> Fixes: 596dce110b7d ("block: simplify elevator reattachment for updating nr_hw_queues")
>> Closes: https://lore.kernel.org/all/CAHj4cs8oJFvz=daCvjHM5dYCNQH4UXwSySPPU4v-WHce_kZXZA@mail.gmail.com/
>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
>> ---
>> Changes from v1:
>>     - Updated commit message with kmemleak trace generated using null-blk
>>       (Yi Zhang)
>>     - The elevator module could be removed while nr_hw_queue update is
>>       running, so protect elevator switch using elevator_get() and 
>>       elevator_put() (Ming Lei)
>>     - Invoke elv_update_nr_hw_queues() from blk_mq_elv_switch_back() and 
>>       that way avoid elevator code restructuring in a patch which fixes
>>       a regression. (Ming Lei)
>>
>> ---
>>  block/blk-mq.c   | 86 +++++++++++++++++++++++++++++++++++++++++++-----
>>  block/blk.h      |  2 +-
>>  block/elevator.c |  6 ++--
>>  3 files changed, 81 insertions(+), 13 deletions(-)
>>
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 4806b867e37d..fa25d6d36790 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -4966,6 +4966,62 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
>>  	return ret;
>>  }
>>  
>> +/*
>> + * Switch back to the elevator type stored in the xarray.
>> + */
>> +static void blk_mq_elv_switch_back(struct request_queue *q,
>> +		struct xarray *elv_tbl)
>> +{
>> +	struct elevator_type *e = xa_load(elv_tbl, q->id);
>> +
>> +	/* The elv_update_nr_hw_queues unfreezes the queue. */
>> +	elv_update_nr_hw_queues(q, e);
>> +
>> +	/* Drop the reference acquired in blk_mq_elv_switch_none. */
>> +	if (e)
>> +		elevator_put(e);
>> +}
>> +
>> +/*
>> + * Stores elevator type in xarray and set current elevator to none. It uses
>> + * q->id as an index to store the elevator type into the xarray.
>> + */
>> +static int blk_mq_elv_switch_none(struct request_queue *q,
>> +		struct xarray *elv_tbl)
>> +{
>> +	int ret = 0;
>> +
>> +	lockdep_assert_held_write(&q->tag_set->update_nr_hwq_lock);
>> +
>> +	/*
>> +	 * Accessing q->elevator without holding q->elevator_lock is safe here
>> +	 * because we're called from nr_hw_queue update which is protected by
>> +	 * set->update_nr_hwq_lock in the writer context. So, scheduler update/
>> +	 * switch code (which acquires the same lock in the reader context)
>> +	 * can't run concurrently.
>> +	 */
>> +	if (q->elevator) {
>> +
>> +		ret = xa_insert(elv_tbl, q->id, q->elevator->type, GFP_KERNEL);
>> +		if (ret) {
>> +			WARN_ON_ONCE(1);
>> +			goto out;
>> +		}
>> +		/*
>> +		 * Before we switch elevator to 'none', take a reference to
>> +		 * the elevator module so that while nr_hw_queue update is
>> +		 * running, no one can remove elevator module. We'd put the
>> +		 * reference to elevator module later when we switch back
>> +		 * elevator.
>> +		 */
>> +		__elevator_get(q->elevator->type);
>> +
>> +		elevator_set_none(q);
>> +	}
>> +out:
>> +	return ret;
>> +}
>> +
>>  static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>>  							int nr_hw_queues)
>>  {
>> @@ -4973,6 +5029,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>>  	int prev_nr_hw_queues = set->nr_hw_queues;
>>  	unsigned int memflags;
>>  	int i;
>> +	struct xarray elv_tbl;
>>  
>>  	lockdep_assert_held(&set->tag_list_lock);
>>  
>> @@ -4984,6 +5041,9 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>>  		return;
>>  
>>  	memflags = memalloc_noio_save();
>> +
>> +	xa_init(&elv_tbl);
>> +
>>  	list_for_each_entry(q, &set->tag_list, tag_set_list) {
>>  		blk_mq_debugfs_unregister_hctxs(q);
>>  		blk_mq_sysfs_unregister_hctxs(q);
>> @@ -4992,11 +5052,17 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>>  	list_for_each_entry(q, &set->tag_list, tag_set_list)
>>  		blk_mq_freeze_queue_nomemsave(q);
>>  
>> -	if (blk_mq_realloc_tag_set_tags(set, nr_hw_queues) < 0) {
>> -		list_for_each_entry(q, &set->tag_list, tag_set_list)
>> -			blk_mq_unfreeze_queue_nomemrestore(q);
>> -		goto reregister;
>> -	}
>> +	/*
>> +	 * Switch IO scheduler to 'none', cleaning up the data associated
>> +	 * with the previous scheduler. We will switch back once we are done
>> +	 * updating the new sw to hw queue mappings.
>> +	 */
>> +	list_for_each_entry(q, &set->tag_list, tag_set_list)
>> +		if (blk_mq_elv_switch_none(q, &elv_tbl))
>> +			goto switch_back;
> 
> The partial `switch_back` need to be careful. If switching to none is
> failed for one queue, the other remained will be switched back to none
> no matter if the previous scheduler is none or not. Maybe return type of
> blk_mq_elv_switch_none() can be defined as 'void' simply for avoiding the
> trouble.
> 
Hmm, maybe not. I think in the case of a partial failure while switching the
scheduler to 'none', only those queues whose scheduler was actually switched
to 'none' would be switched back to their previous scheduler.

Let's consider an example scenario:
We have three queues — q1, q2, and q3 — all sharing the same tagset.

Initially:
- q1 uses scheduler s1
- q2 uses scheduler s2
- q3 uses scheduler s3

Now, during the nr_hw_queue update, we invoke blk_mq_elv_switch_none() to
temporarily switch all schedulers to 'none'. Suppose the switch fails specifically
for q3. In this case:
- q1 and q2 successfully switched to 'none'
- q3 remains with its original scheduler s3
- We save the original scheduler mappings in an xarray as follows:
  x[q1->id] = s1
  x[q2->id] = s2
  x[q3->id] = 0   // Indicates no change was made

Since the switch failed for q3, we then invoke blk_mq_elv_switch_back(), which 
internally calls elv_update_nr_hw_queues() to restore the previous schedulers:

For q1: x[q1->id] = s1 => scheduler is restored to s1
For q2: x[q2->id] = s2 => scheduler is restored to s2
For q3: x[q3->id] = 0  => no scheduler switch attempted, q3 continues to use s3

So based on this flow, the current implementation appears to correctly handle partial
failures — only those queues that were successfully switched to 'none' are switched back.

Let me know if you think there's a scenario where this logic might still break.

Thanks,
--Nilay

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCHv2] block: restore two stage elevator switch while running nr_hw_queue update
  2025-07-18 13:32 [PATCHv2] block: restore two stage elevator switch while running nr_hw_queue update Nilay Shroff
  2025-07-20 12:19 ` Ming Lei
@ 2025-07-22  2:21 ` Yu Kuai
  2025-07-22 11:27   ` Nilay Shroff
  1 sibling, 1 reply; 10+ messages in thread
From: Yu Kuai @ 2025-07-22  2:21 UTC (permalink / raw)
  To: Nilay Shroff, linux-block
  Cc: yi.zhang, ming.lei, hch, yukuai1, axboe, shinichiro.kawasaki,
	gjoyce, yukuai (C)

Hi,

在 2025/07/18 21:32, Nilay Shroff 写道:
> The kmemleak reports memory leaks related to elevator resources that
> were originally allocated in the ->init_hctx() method. The following
> leak traces are observed after running blktests:
> 
> unreferenced object 0xffff8881b82f7400 (size 512):
>    comm "check", pid 68454, jiffies 4310588881
>    hex dump (first 32 bytes):
>      00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>      00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>    backtrace (crc 5bac8b34):
>      __kvmalloc_node_noprof+0x55d/0x7a0
>      sbitmap_init_node+0x15a/0x6a0
>      kyber_init_hctx+0x316/0xb90
>      blk_mq_init_sched+0x419/0x580
>      elevator_switch+0x18b/0x630
>      elv_update_nr_hw_queues+0x219/0x2c0
>      __blk_mq_update_nr_hw_queues+0x36a/0x6f0
>      blk_mq_update_nr_hw_queues+0x3a/0x60
>      0xffffffffc09ceb80
>      0xffffffffc09d7e0b
>      configfs_write_iter+0x2b1/0x470
>      vfs_write+0x527/0xe70
>      ksys_write+0xff/0x200
>      do_syscall_64+0x98/0x3c0
>      entry_SYSCALL_64_after_hwframe+0x76/0x7e
> unreferenced object 0xffff8881b82f6000 (size 512):
>    comm "check", pid 68454, jiffies 4310588881
>    hex dump (first 32 bytes):
>      00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>      00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>    backtrace (crc 5bac8b34):
>      __kvmalloc_node_noprof+0x55d/0x7a0
>      sbitmap_init_node+0x15a/0x6a0
>      kyber_init_hctx+0x316/0xb90
>      blk_mq_init_sched+0x419/0x580
>      elevator_switch+0x18b/0x630
>      elv_update_nr_hw_queues+0x219/0x2c0
>      __blk_mq_update_nr_hw_queues+0x36a/0x6f0
>      blk_mq_update_nr_hw_queues+0x3a/0x60
>      0xffffffffc09ceb80
>      0xffffffffc09d7e0b
>      configfs_write_iter+0x2b1/0x470
>      vfs_write+0x527/0xe70
>      ksys_write+0xff/0x200
>      do_syscall_64+0x98/0x3c0
>      entry_SYSCALL_64_after_hwframe+0x76/0x7e
> unreferenced object 0xffff8881b82f5800 (size 512):
>    comm "check", pid 68454, jiffies 4310588881
>    hex dump (first 32 bytes):
>      00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>      00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>    backtrace (crc 5bac8b34):
>      __kvmalloc_node_noprof+0x55d/0x7a0
>      sbitmap_init_node+0x15a/0x6a0
>      kyber_init_hctx+0x316/0xb90
>      blk_mq_init_sched+0x419/0x580
>      elevator_switch+0x18b/0x630
>      elv_update_nr_hw_queues+0x219/0x2c0
>      __blk_mq_update_nr_hw_queues+0x36a/0x6f0
>      blk_mq_update_nr_hw_queues+0x3a/0x60
>      0xffffffffc09ceb80
>      0xffffffffc09d7e0b
>      configfs_write_iter+0x2b1/0x470
>      vfs_write+0x527/0xe70
> 
>      ksys_write+0xff/0x200
>      do_syscall_64+0x98/0x3c0
>      entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
> The issue arises while we run nr_hw_queue update,  Specifically, we first
> reallocate hardware contexts (hctx) via __blk_mq_realloc_hw_ctxs(), and
> then later invoke elevator_switch() (assuming q->elevator is not NULL).
> The elevator switch code would first exit old elevator (elevator_exit)
> and then switches to the new elevator. The elevator_exit loops through
> each hctx and invokes the elevator’s per-hctx exit method ->exit_hctx(),
> which releases resources allocated during ->init_hctx().
> 
> This memleak manifests when we reduce the num of h/w queues - for example,
> when the initial update sets the number of queues to X, and a later update
> reduces it to Y, where Y < X. In this case, we'd loose the access to old
> hctxs while we get to elevator exit code because __blk_mq_realloc_hw_ctxs
> would have already released the old hctxs. As we don't now have any
> reference left to the old hctxs, we don't have any way to free the
> scheduler resources (which are allocate in ->init_hctx()) and kmemleak
> complains about it.
> 
> This issue was caused due to the commit 596dce110b7d ("block: simplify
> elevator reattachment for updating nr_hw_queues"). That change unified
> the two-stage elevator teardown and reattachment into a single call that
> occurs after __blk_mq_realloc_hw_ctxs() has already freed the hctxs.
> 
> This patch restores the previous two-stage elevator switch logic during
> nr_hw_queues updates. First, the elevator is switched to 'none', which
> ensures all scheduler resources are properly freed. Then, the hardware
> contexts (hctxs) are reallocated, and the software-to-hardware queue
> mappings are updated. Finally, the original elevator is reattached. This
> sequence prevents loss of references to old hctxs and avoids the scheduler
> resource leaks reported by kmemleak.
> 
> Reported-by : Yi Zhang <yi.zhang@redhat.com>
> Fixes: 596dce110b7d ("block: simplify elevator reattachment for updating nr_hw_queues")
> Closes: https://lore.kernel.org/all/CAHj4cs8oJFvz=daCvjHM5dYCNQH4UXwSySPPU4v-WHce_kZXZA@mail.gmail.com/
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> ---
> Changes from v1:
>      - Updated commit message with kmemleak trace generated using null-blk
>        (Yi Zhang)
>      - The elevator module could be removed while nr_hw_queue update is
>        running, so protect elevator switch using elevator_get() and
>        elevator_put() (Ming Lei)
>      - Invoke elv_update_nr_hw_queues() from blk_mq_elv_switch_back() and
>        that way avoid elevator code restructuring in a patch which fixes
>        a regression. (Ming Lei)
> 
> ---
>   block/blk-mq.c   | 86 +++++++++++++++++++++++++++++++++++++++++++-----
>   block/blk.h      |  2 +-
>   block/elevator.c |  6 ++--
>   3 files changed, 81 insertions(+), 13 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 4806b867e37d..fa25d6d36790 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -4966,6 +4966,62 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
>   	return ret;
>   }
>   
> +/*
> + * Switch back to the elevator type stored in the xarray.
> + */
> +static void blk_mq_elv_switch_back(struct request_queue *q,
> +		struct xarray *elv_tbl)
> +{
> +	struct elevator_type *e = xa_load(elv_tbl, q->id);
> +
> +	/* The elv_update_nr_hw_queues unfreezes the queue. */
> +	elv_update_nr_hw_queues(q, e);
> +
> +	/* Drop the reference acquired in blk_mq_elv_switch_none. */
> +	if (e)
> +		elevator_put(e);
> +}
> +
> +/*
> + * Stores elevator type in xarray and set current elevator to none. It uses
> + * q->id as an index to store the elevator type into the xarray.
> + */
> +static int blk_mq_elv_switch_none(struct request_queue *q,
> +		struct xarray *elv_tbl)
> +{
> +	int ret = 0;
> +
> +	lockdep_assert_held_write(&q->tag_set->update_nr_hwq_lock);
> +
> +	/*
> +	 * Accessing q->elevator without holding q->elevator_lock is safe here
> +	 * because we're called from nr_hw_queue update which is protected by
> +	 * set->update_nr_hwq_lock in the writer context. So, scheduler update/
> +	 * switch code (which acquires the same lock in the reader context)
> +	 * can't run concurrently.
> +	 */
> +	if (q->elevator) {
> +
> +		ret = xa_insert(elv_tbl, q->id, q->elevator->type, GFP_KERNEL);
> +		if (ret) {
> +			WARN_ON_ONCE(1);
> +			goto out;
> +		}
Perhaps this is simpler, remove the out tag and return directly:

if (WARN_ON_ONCE(ret))
	return ret;

And BTW, I feel the warning is not necessary here for small memory
allocation failure.

> +		/*
> +		 * Before we switch elevator to 'none', take a reference to
> +		 * the elevator module so that while nr_hw_queue update is
> +		 * running, no one can remove elevator module. We'd put the
> +		 * reference to elevator module later when we switch back
> +		 * elevator.
> +		 */
> +		__elevator_get(q->elevator->type);
> +
> +		elevator_set_none(q);
> +	}
> +out:
> +	return ret;
> +}
> +
>   static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>   							int nr_hw_queues)
>   {
> @@ -4973,6 +5029,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>   	int prev_nr_hw_queues = set->nr_hw_queues;
>   	unsigned int memflags;
>   	int i;
> +	struct xarray elv_tbl;

Perhaps:

DEFINE_XARRAY(elv_tbl)
>   
>   	lockdep_assert_held(&set->tag_list_lock);
>   
> @@ -4984,6 +5041,9 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>   		return;
>   
>   	memflags = memalloc_noio_save();
> +
> +	xa_init(&elv_tbl);
> +
>   	list_for_each_entry(q, &set->tag_list, tag_set_list) {
>   		blk_mq_debugfs_unregister_hctxs(q);
>   		blk_mq_sysfs_unregister_hctxs(q);
> @@ -4992,11 +5052,17 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>   	list_for_each_entry(q, &set->tag_list, tag_set_list)
>   		blk_mq_freeze_queue_nomemsave(q);
>   
> -	if (blk_mq_realloc_tag_set_tags(set, nr_hw_queues) < 0) {
> -		list_for_each_entry(q, &set->tag_list, tag_set_list)
> -			blk_mq_unfreeze_queue_nomemrestore(q);
> -		goto reregister;
> -	}
> +	/*
> +	 * Switch IO scheduler to 'none', cleaning up the data associated
> +	 * with the previous scheduler. We will switch back once we are done
> +	 * updating the new sw to hw queue mappings.
> +	 */
> +	list_for_each_entry(q, &set->tag_list, tag_set_list)
> +		if (blk_mq_elv_switch_none(q, &elv_tbl))
> +			goto switch_back;

Can we move the freeze queue into blk_mq_elv_switch_none? To
corresponding with unfreeze queue in blk_mq_elv_switch_back().

> +
> +	if (blk_mq_realloc_tag_set_tags(set, nr_hw_queues) < 0)
> +		goto switch_back;
>   
>   fallback:
>   	blk_mq_update_queue_map(set);
> @@ -5016,12 +5082,11 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>   		}
>   		blk_mq_map_swqueue(q);
>   	}
> -
> -	/* elv_update_nr_hw_queues() unfreeze queue for us */
> +switch_back:
> +	/* The blk_mq_elv_switch_back unfreezes queue for us. */
>   	list_for_each_entry(q, &set->tag_list, tag_set_list)
> -		elv_update_nr_hw_queues(q);
> +		blk_mq_elv_switch_back(q, &elv_tbl);
>   
> -reregister:
>   	list_for_each_entry(q, &set->tag_list, tag_set_list) {
>   		blk_mq_sysfs_register_hctxs(q);
>   		blk_mq_debugfs_register_hctxs(q);
> @@ -5029,6 +5094,9 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>   		blk_mq_remove_hw_queues_cpuhp(q);
>   		blk_mq_add_hw_queues_cpuhp(q);
>   	}
> +
> +	xa_destroy(&elv_tbl);
> +
>   	memalloc_noio_restore(memflags);
>   
>   	/* Free the excess tags when nr_hw_queues shrink. */
> diff --git a/block/blk.h b/block/blk.h
> index 37ec459fe656..fae7653a941f 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -321,7 +321,7 @@ bool blk_bio_list_merge(struct request_queue *q, struct list_head *list,
>   
>   bool blk_insert_flush(struct request *rq);
>   
> -void elv_update_nr_hw_queues(struct request_queue *q);
> +void elv_update_nr_hw_queues(struct request_queue *q, struct elevator_type *e);
>   void elevator_set_default(struct request_queue *q);
>   void elevator_set_none(struct request_queue *q);
>   
> diff --git a/block/elevator.c b/block/elevator.c
> index ab22542e6cf0..83d0bfb90a03 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -689,7 +689,7 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
>    * The I/O scheduler depends on the number of hardware queues, this forces a
>    * reattachment when nr_hw_queues changes.
>    */
> -void elv_update_nr_hw_queues(struct request_queue *q)
> +void elv_update_nr_hw_queues(struct request_queue *q, struct elevator_type *e)

Now that this function no longer just update nr_hw_queues, but switch
elevator from none to e, this name will be confusing. Since there is
just one caller from blk_mq_update_nr_hw_queues(), I feel it's better
to move the implematation to blk_mq_elv_switch_back() directly.

>   {
>   	struct elv_change_ctx ctx = {};
>   	int ret = -ENODEV;
> @@ -697,8 +697,8 @@ void elv_update_nr_hw_queues(struct request_queue *q)
>   	WARN_ON_ONCE(q->mq_freeze_depth == 0);
>   
>   	mutex_lock(&q->elevator_lock);
> -	if (q->elevator && !blk_queue_dying(q) && blk_queue_registered(q)) {
> -		ctx.name = q->elevator->type->elevator_name;
> +	if (e && !blk_queue_dying(q) && blk_queue_registered(q)) {
> +		ctx.name = e->elevator_name;

This looks not optimal, since you don't need elevator_lock to protect e.
>   
>   		/* force to reattach elevator after nr_hw_queue is updated */
>   		ret = elevator_switch(q, &ctx);
> 

BTW, this is not related to this patch. Should we handle fall_back
failure like blk_mq_sysfs_register_hctxs()?

Thanks,
Kuai


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCHv2] block: restore two stage elevator switch while running nr_hw_queue update
  2025-07-22  2:21 ` Yu Kuai
@ 2025-07-22 11:27   ` Nilay Shroff
  2025-07-23  0:37     ` Yu Kuai
  0 siblings, 1 reply; 10+ messages in thread
From: Nilay Shroff @ 2025-07-22 11:27 UTC (permalink / raw)
  To: Yu Kuai, linux-block
  Cc: yi.zhang, ming.lei, hch, axboe, shinichiro.kawasaki, gjoyce,
	yukuai (C)



On 7/22/25 7:51 AM, Yu Kuai wrote:
>> +/*
>> + * Stores elevator type in xarray and set current elevator to none. It uses
>> + * q->id as an index to store the elevator type into the xarray.
>> + */
>> +static int blk_mq_elv_switch_none(struct request_queue *q,
>> +        struct xarray *elv_tbl)
>> +{
>> +    int ret = 0;
>> +
>> +    lockdep_assert_held_write(&q->tag_set->update_nr_hwq_lock);
>> +
>> +    /*
>> +     * Accessing q->elevator without holding q->elevator_lock is safe here
>> +     * because we're called from nr_hw_queue update which is protected by
>> +     * set->update_nr_hwq_lock in the writer context. So, scheduler update/
>> +     * switch code (which acquires the same lock in the reader context)
>> +     * can't run concurrently.
>> +     */
>> +    if (q->elevator) {
>> +
>> +        ret = xa_insert(elv_tbl, q->id, q->elevator->type, GFP_KERNEL);
>> +        if (ret) {
>> +            WARN_ON_ONCE(1);
>> +            goto out;
>> +        }
> Perhaps this is simpler, remove the out tag and return directly:
> 
> if (WARN_ON_ONCE(ret))
>     return ret;

Yes I can do that.

> And BTW, I feel the warning is not necessary here for small memory
> allocation failure.
IMO, it’s actually useful to print a warning in this case — even though
the memory allocation failure is relatively minor. Since the failure causes
the code to unwind back to the original state and prevents the nr_hw_queues
from being updated, having a warning message can help indicate why the update
didn't go through. It provides visibility into what went wrong, which can
be valuable for debugging or understanding unexpected behavior.

> 
>> +        /*
>> +         * Before we switch elevator to 'none', take a reference to
>> +         * the elevator module so that while nr_hw_queue update is
>> +         * running, no one can remove elevator module. We'd put the
>> +         * reference to elevator module later when we switch back
>> +         * elevator.
>> +         */
>> +        __elevator_get(q->elevator->type);
>> +
>> +        elevator_set_none(q);
>> +    }
>> +out:
>> +    return ret;
>> +}
>> +
>>   static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>>                               int nr_hw_queues)
>>   {
>> @@ -4973,6 +5029,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>>       int prev_nr_hw_queues = set->nr_hw_queues;
>>       unsigned int memflags;
>>       int i;
>> +    struct xarray elv_tbl;
> 
> Perhaps:
> 
> DEFINE_XARRAY(elv_tbl)
It may not work because then we have to initialize it at compile time 
at file scope. Then if we've multiple threads running nr_hw_queue update
(for different tagsets) then we can't use that shared copy of elv_tbl 
as is and we've to protect it with another lock. So, IMO, instead creating 
xarray locally here makes sense.

>>         lockdep_assert_held(&set->tag_list_lock);
>>   @@ -4984,6 +5041,9 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>>           return;
>>         memflags = memalloc_noio_save();
>> +
>> +    xa_init(&elv_tbl);
>> +
>>       list_for_each_entry(q, &set->tag_list, tag_set_list) {
>>           blk_mq_debugfs_unregister_hctxs(q);
>>           blk_mq_sysfs_unregister_hctxs(q);
>> @@ -4992,11 +5052,17 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>>       list_for_each_entry(q, &set->tag_list, tag_set_list)
>>           blk_mq_freeze_queue_nomemsave(q);
>>   -    if (blk_mq_realloc_tag_set_tags(set, nr_hw_queues) < 0) {
>> -        list_for_each_entry(q, &set->tag_list, tag_set_list)
>> -            blk_mq_unfreeze_queue_nomemrestore(q);
>> -        goto reregister;
>> -    }
>> +    /*
>> +     * Switch IO scheduler to 'none', cleaning up the data associated
>> +     * with the previous scheduler. We will switch back once we are done
>> +     * updating the new sw to hw queue mappings.
>> +     */
>> +    list_for_each_entry(q, &set->tag_list, tag_set_list)
>> +        if (blk_mq_elv_switch_none(q, &elv_tbl))
>> +            goto switch_back;
> 
> Can we move the freeze queue into blk_mq_elv_switch_none? To
> corresponding with unfreeze queue in blk_mq_elv_switch_back().
> 
Ideally yes, but as Ming pointed in the first version of this 
patch we want to minimize code changes, as much possible, in 
the bug fix patch so that it'd be easy to back port to the older
kernel release. Having said that, we'll have another patch (not
a bug fix) where we'd address this by restructuring the code
around this area.

>> +
>> +    if (blk_mq_realloc_tag_set_tags(set, nr_hw_queues) < 0)
>> +        goto switch_back;
>>     fallback:
>>       blk_mq_update_queue_map(set);
>> @@ -5016,12 +5082,11 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>>           }
>>           blk_mq_map_swqueue(q);
>>       }
>> -
>> -    /* elv_update_nr_hw_queues() unfreeze queue for us */
>> +switch_back:
>> +    /* The blk_mq_elv_switch_back unfreezes queue for us. */
>>       list_for_each_entry(q, &set->tag_list, tag_set_list)
>> -        elv_update_nr_hw_queues(q);
>> +        blk_mq_elv_switch_back(q, &elv_tbl);
>>   -reregister:
>>       list_for_each_entry(q, &set->tag_list, tag_set_list) {
>>           blk_mq_sysfs_register_hctxs(q);
>>           blk_mq_debugfs_register_hctxs(q);
>> @@ -5029,6 +5094,9 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>>           blk_mq_remove_hw_queues_cpuhp(q);
>>           blk_mq_add_hw_queues_cpuhp(q);
>>       }
>> +
>> +    xa_destroy(&elv_tbl);
>> +
>>       memalloc_noio_restore(memflags);
>>         /* Free the excess tags when nr_hw_queues shrink. */
>> diff --git a/block/blk.h b/block/blk.h
>> index 37ec459fe656..fae7653a941f 100644
>> --- a/block/blk.h
>> +++ b/block/blk.h
>> @@ -321,7 +321,7 @@ bool blk_bio_list_merge(struct request_queue *q, struct list_head *list,
>>     bool blk_insert_flush(struct request *rq);
>>   -void elv_update_nr_hw_queues(struct request_queue *q);
>> +void elv_update_nr_hw_queues(struct request_queue *q, struct elevator_type *e);
>>   void elevator_set_default(struct request_queue *q);
>>   void elevator_set_none(struct request_queue *q);
>>   diff --git a/block/elevator.c b/block/elevator.c
>> index ab22542e6cf0..83d0bfb90a03 100644
>> --- a/block/elevator.c
>> +++ b/block/elevator.c
>> @@ -689,7 +689,7 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
>>    * The I/O scheduler depends on the number of hardware queues, this forces a
>>    * reattachment when nr_hw_queues changes.
>>    */
>> -void elv_update_nr_hw_queues(struct request_queue *q)
>> +void elv_update_nr_hw_queues(struct request_queue *q, struct elevator_type *e)
> 
> Now that this function no longer just update nr_hw_queues, but switch
> elevator from none to e, this name will be confusing. Since there is
> just one caller from blk_mq_update_nr_hw_queues(), I feel it's better
> to move the implematation to blk_mq_elv_switch_back() directly.
> 
Yeah correct, and that's what exactly I implemented in the first version
of this patch but then as I mentioned above we'd want to minimize the
code restructuring changes in a bug fix patch so that it'd be easy to
backport.

>>   {
>>       struct elv_change_ctx ctx = {};
>>       int ret = -ENODEV;
>> @@ -697,8 +697,8 @@ void elv_update_nr_hw_queues(struct request_queue *q)
>>       WARN_ON_ONCE(q->mq_freeze_depth == 0);
>>         mutex_lock(&q->elevator_lock);
>> -    if (q->elevator && !blk_queue_dying(q) && blk_queue_registered(q)) {
>> -        ctx.name = q->elevator->type->elevator_name;
>> +    if (e && !blk_queue_dying(q) && blk_queue_registered(q)) {
>> +        ctx.name = e->elevator_name;
> 
> This looks not optimal, since you don't need elevator_lock to protect e.
>>             /* force to reattach elevator after nr_hw_queue is updated */
>>           ret = elevator_switch(q, &ctx);
>>
> 
> BTW, this is not related to this patch. Should we handle fall_back
> failure like blk_mq_sysfs_register_hctxs()?
> 
OKay I guess you meant here handle failure case by unwinding the 
queue instead of looping through it from start to end. If yes, then
it could be done but again we may not want to do it the bug fix patch.

Thanks,
--Nilay


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCHv2] block: restore two stage elevator switch while running nr_hw_queue update
  2025-07-22 11:27   ` Nilay Shroff
@ 2025-07-23  0:37     ` Yu Kuai
  2025-07-23  6:24       ` Nilay Shroff
  0 siblings, 1 reply; 10+ messages in thread
From: Yu Kuai @ 2025-07-23  0:37 UTC (permalink / raw)
  To: Nilay Shroff, Yu Kuai, linux-block
  Cc: yi.zhang, ming.lei, hch, axboe, shinichiro.kawasaki, gjoyce,
	yukuai (C)

Hi,

在 2025/07/22 19:27, Nilay Shroff 写道:
> 
> 
> On 7/22/25 7:51 AM, Yu Kuai wrote:
>>> +/*
>>> + * Stores elevator type in xarray and set current elevator to none. It uses
>>> + * q->id as an index to store the elevator type into the xarray.
>>> + */
>>> +static int blk_mq_elv_switch_none(struct request_queue *q,
>>> +        struct xarray *elv_tbl)
>>> +{
>>> +    int ret = 0;
>>> +
>>> +    lockdep_assert_held_write(&q->tag_set->update_nr_hwq_lock);
>>> +
>>> +    /*
>>> +     * Accessing q->elevator without holding q->elevator_lock is safe here
>>> +     * because we're called from nr_hw_queue update which is protected by
>>> +     * set->update_nr_hwq_lock in the writer context. So, scheduler update/
>>> +     * switch code (which acquires the same lock in the reader context)
>>> +     * can't run concurrently.
>>> +     */
>>> +    if (q->elevator) {
>>> +
>>> +        ret = xa_insert(elv_tbl, q->id, q->elevator->type, GFP_KERNEL);
>>> +        if (ret) {
>>> +            WARN_ON_ONCE(1);
>>> +            goto out;
>>> +        }
>> Perhaps this is simpler, remove the out tag and return directly:
>>
>> if (WARN_ON_ONCE(ret))
>>      return ret;
> 
> Yes I can do that.
> 
>> And BTW, I feel the warning is not necessary here for small memory
>> allocation failure.
> IMO, it’s actually useful to print a warning in this case — even though
> the memory allocation failure is relatively minor. Since the failure causes
> the code to unwind back to the original state and prevents the nr_hw_queues
> from being updated, having a warning message can help indicate why the update
> didn't go through. It provides visibility into what went wrong, which can
> be valuable for debugging or understanding unexpected behavior.
> 
>>
>>> +        /*
>>> +         * Before we switch elevator to 'none', take a reference to
>>> +         * the elevator module so that while nr_hw_queue update is
>>> +         * running, no one can remove elevator module. We'd put the
>>> +         * reference to elevator module later when we switch back
>>> +         * elevator.
>>> +         */
>>> +        __elevator_get(q->elevator->type);
>>> +
>>> +        elevator_set_none(q);
>>> +    }
>>> +out:
>>> +    return ret;
>>> +}
>>> +
>>>    static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>>>                                int nr_hw_queues)
>>>    {
>>> @@ -4973,6 +5029,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>>>        int prev_nr_hw_queues = set->nr_hw_queues;
>>>        unsigned int memflags;
>>>        int i;
>>> +    struct xarray elv_tbl;
>>
>> Perhaps:
>>
>> DEFINE_XARRAY(elv_tbl)
> It may not work because then we have to initialize it at compile time
> at file scope. Then if we've multiple threads running nr_hw_queue update
> (for different tagsets) then we can't use that shared copy of elv_tbl
> as is and we've to protect it with another lock. So, IMO, instead creating
> xarray locally here makes sense.

I'm confused, I don't add static and this should still be a stack value,
I mean use this help to initialize it is simpler :)
> 
>>>          lockdep_assert_held(&set->tag_list_lock);
>>>    @@ -4984,6 +5041,9 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>>>            return;
>>>          memflags = memalloc_noio_save();
>>> +
>>> +    xa_init(&elv_tbl);
>>> +
>>>        list_for_each_entry(q, &set->tag_list, tag_set_list) {
>>>            blk_mq_debugfs_unregister_hctxs(q);
>>>            blk_mq_sysfs_unregister_hctxs(q);
>>> @@ -4992,11 +5052,17 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>>>        list_for_each_entry(q, &set->tag_list, tag_set_list)
>>>            blk_mq_freeze_queue_nomemsave(q);
>>>    -    if (blk_mq_realloc_tag_set_tags(set, nr_hw_queues) < 0) {
>>> -        list_for_each_entry(q, &set->tag_list, tag_set_list)
>>> -            blk_mq_unfreeze_queue_nomemrestore(q);
>>> -        goto reregister;
>>> -    }
>>> +    /*
>>> +     * Switch IO scheduler to 'none', cleaning up the data associated
>>> +     * with the previous scheduler. We will switch back once we are done
>>> +     * updating the new sw to hw queue mappings.
>>> +     */
>>> +    list_for_each_entry(q, &set->tag_list, tag_set_list)
>>> +        if (blk_mq_elv_switch_none(q, &elv_tbl))
>>> +            goto switch_back;
>>
>> Can we move the freeze queue into blk_mq_elv_switch_none? To
>> corresponding with unfreeze queue in blk_mq_elv_switch_back().
>>
> Ideally yes, but as Ming pointed in the first version of this
> patch we want to minimize code changes, as much possible, in
> the bug fix patch so that it'd be easy to back port to the older
> kernel release. Having said that, we'll have another patch (not
> a bug fix) where we'd address this by restructuring the code
> around this area.

Ok.
> 
>>> +
>>> +    if (blk_mq_realloc_tag_set_tags(set, nr_hw_queues) < 0)
>>> +        goto switch_back;
>>>      fallback:
>>>        blk_mq_update_queue_map(set);
>>> @@ -5016,12 +5082,11 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>>>            }
>>>            blk_mq_map_swqueue(q);
>>>        }
>>> -
>>> -    /* elv_update_nr_hw_queues() unfreeze queue for us */
>>> +switch_back:
>>> +    /* The blk_mq_elv_switch_back unfreezes queue for us. */
>>>        list_for_each_entry(q, &set->tag_list, tag_set_list)
>>> -        elv_update_nr_hw_queues(q);
>>> +        blk_mq_elv_switch_back(q, &elv_tbl);
>>>    -reregister:
>>>        list_for_each_entry(q, &set->tag_list, tag_set_list) {
>>>            blk_mq_sysfs_register_hctxs(q);
>>>            blk_mq_debugfs_register_hctxs(q);
>>> @@ -5029,6 +5094,9 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>>>            blk_mq_remove_hw_queues_cpuhp(q);
>>>            blk_mq_add_hw_queues_cpuhp(q);
>>>        }
>>> +
>>> +    xa_destroy(&elv_tbl);
>>> +
>>>        memalloc_noio_restore(memflags);
>>>          /* Free the excess tags when nr_hw_queues shrink. */
>>> diff --git a/block/blk.h b/block/blk.h
>>> index 37ec459fe656..fae7653a941f 100644
>>> --- a/block/blk.h
>>> +++ b/block/blk.h
>>> @@ -321,7 +321,7 @@ bool blk_bio_list_merge(struct request_queue *q, struct list_head *list,
>>>      bool blk_insert_flush(struct request *rq);
>>>    -void elv_update_nr_hw_queues(struct request_queue *q);
>>> +void elv_update_nr_hw_queues(struct request_queue *q, struct elevator_type *e);
>>>    void elevator_set_default(struct request_queue *q);
>>>    void elevator_set_none(struct request_queue *q);
>>>    diff --git a/block/elevator.c b/block/elevator.c
>>> index ab22542e6cf0..83d0bfb90a03 100644
>>> --- a/block/elevator.c
>>> +++ b/block/elevator.c
>>> @@ -689,7 +689,7 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
>>>     * The I/O scheduler depends on the number of hardware queues, this forces a
>>>     * reattachment when nr_hw_queues changes.
>>>     */
>>> -void elv_update_nr_hw_queues(struct request_queue *q)
>>> +void elv_update_nr_hw_queues(struct request_queue *q, struct elevator_type *e)
>>
>> Now that this function no longer just update nr_hw_queues, but switch
>> elevator from none to e, this name will be confusing. Since there is
>> just one caller from blk_mq_update_nr_hw_queues(), I feel it's better
>> to move the implematation to blk_mq_elv_switch_back() directly.
>>
> Yeah correct, and that's what exactly I implemented in the first version
> of this patch but then as I mentioned above we'd want to minimize the
> code restructuring changes in a bug fix patch so that it'd be easy to
> backport.

Sure.
> 
>>>    {
>>>        struct elv_change_ctx ctx = {};
>>>        int ret = -ENODEV;
>>> @@ -697,8 +697,8 @@ void elv_update_nr_hw_queues(struct request_queue *q)
>>>        WARN_ON_ONCE(q->mq_freeze_depth == 0);
>>>          mutex_lock(&q->elevator_lock);
>>> -    if (q->elevator && !blk_queue_dying(q) && blk_queue_registered(q)) {
>>> -        ctx.name = q->elevator->type->elevator_name;
>>> +    if (e && !blk_queue_dying(q) && blk_queue_registered(q)) {
>>> +        ctx.name = e->elevator_name;
>>
>> This looks not optimal, since you don't need elevator_lock to protect e.
>>>              /* force to reattach elevator after nr_hw_queue is updated */
>>>            ret = elevator_switch(q, &ctx);
>>>
>>
>> BTW, this is not related to this patch. Should we handle fall_back
>> failure like blk_mq_sysfs_register_hctxs()?
>>
> OKay I guess you meant here handle failure case by unwinding the
> queue instead of looping through it from start to end. If yes, then
> it could be done but again we may not want to do it the bug fix patch.
> 

Not like that, actually I don't have any ideas for now, the hctxs is
unregistered first, and if register failed, for example, due to -ENOMEM,
I can't find a way to fallback :(

Thanks,
Kuai

> Thanks,
> --Nilay
> 
> .
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCHv2] block: restore two stage elevator switch while running nr_hw_queue update
  2025-07-23  0:37     ` Yu Kuai
@ 2025-07-23  6:24       ` Nilay Shroff
  2025-07-23  6:58         ` Yu Kuai
  0 siblings, 1 reply; 10+ messages in thread
From: Nilay Shroff @ 2025-07-23  6:24 UTC (permalink / raw)
  To: Yu Kuai, linux-block
  Cc: yi.zhang, ming.lei, hch, axboe, shinichiro.kawasaki, gjoyce,
	yukuai (C)



On 7/23/25 6:07 AM, Yu Kuai wrote:
>>>>    static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>>>>                                int nr_hw_queues)
>>>>    {
>>>> @@ -4973,6 +5029,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>>>>        int prev_nr_hw_queues = set->nr_hw_queues;
>>>>        unsigned int memflags;
>>>>        int i;
>>>> +    struct xarray elv_tbl;
>>>
>>> Perhaps:
>>>
>>> DEFINE_XARRAY(elv_tbl)
>> It may not work because then we have to initialize it at compile time
>> at file scope. Then if we've multiple threads running nr_hw_queue update
>> (for different tagsets) then we can't use that shared copy of elv_tbl
>> as is and we've to protect it with another lock. So, IMO, instead creating
>> xarray locally here makes sense.
> 
> I'm confused, I don't add static and this should still be a stack value,
> I mean use this help to initialize it is simpler :)

Using DEFINE_XARRAY() to initialize a local(stack) variable is not safe because
the xarray structure embeds a spinlock (.xa_lock), and initializing spinlocks
in local scope can cause issues when lockdep is enabled.
Lockdep expects lock objects to be initialized at static file scope to correctly
track lock dependencies, specifically, when locks are initialized at compile time.
Please note that lockdep assigns unique keys to lock object created at compile time 
which it can use for analysis. This mechanism does not work properly with local
compile time initialization, and attempting to do so triggers warnings or errors
from lockdep.

Therefore, DEFINE_XARRAY() should only be used for static/global declarations. For
local usage, it's safer to use xa_init() or xa_init_flags() to initialize the xarray
dynamically at runtime.

>>> BTW, this is not related to this patch. Should we handle fall_back
>>> failure like blk_mq_sysfs_register_hctxs()?
>>>
>> OKay I guess you meant here handle failure case by unwinding the
>> queue instead of looping through it from start to end. If yes, then
>> it could be done but again we may not want to do it the bug fix patch.
>>
> 
> Not like that, actually I don't have any ideas for now, the hctxs is
> unregistered first, and if register failed, for example, due to -ENOMEM,
> I can't find a way to fallback :(
> 
If registering new hctxs fails, we fall back to the previous value of 
nr_hw_queues (prev_nr_hw_queues). When prev_nr_hw_queues is less than
the new nr_hw_queues, we do not reallocate memory for the existing hctxs—
instead, we reuse the memory that was already allocated.

Memory allocation is only attempted for the additional hctxs beyond
prev_nr_hw_queues. Therefore, if memory allocation for these new hctxs
fails, we can safely fall back to prev_nr_hw_queues because the memory
of the previously allocated hctxs remains intact.

Thanks,
--Nilay

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCHv2] block: restore two stage elevator switch while running nr_hw_queue update
  2025-07-23  6:24       ` Nilay Shroff
@ 2025-07-23  6:58         ` Yu Kuai
  2025-07-23 11:05           ` Nilay Shroff
  0 siblings, 1 reply; 10+ messages in thread
From: Yu Kuai @ 2025-07-23  6:58 UTC (permalink / raw)
  To: Nilay Shroff, Yu Kuai, linux-block
  Cc: yi.zhang, ming.lei, hch, axboe, shinichiro.kawasaki, gjoyce,
	yukuai (C)

Hi,

在 2025/07/23 14:24, Nilay Shroff 写道:
> 
> 
> On 7/23/25 6:07 AM, Yu Kuai wrote:
>>>>>     static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>>>>>                                 int nr_hw_queues)
>>>>>     {
>>>>> @@ -4973,6 +5029,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>>>>>         int prev_nr_hw_queues = set->nr_hw_queues;
>>>>>         unsigned int memflags;
>>>>>         int i;
>>>>> +    struct xarray elv_tbl;
>>>>
>>>> Perhaps:
>>>>
>>>> DEFINE_XARRAY(elv_tbl)
>>> It may not work because then we have to initialize it at compile time
>>> at file scope. Then if we've multiple threads running nr_hw_queue update
>>> (for different tagsets) then we can't use that shared copy of elv_tbl
>>> as is and we've to protect it with another lock. So, IMO, instead creating
>>> xarray locally here makes sense.
>>
>> I'm confused, I don't add static and this should still be a stack value,
>> I mean use this help to initialize it is simpler :)
> 
> Using DEFINE_XARRAY() to initialize a local(stack) variable is not safe because
> the xarray structure embeds a spinlock (.xa_lock), and initializing spinlocks
> in local scope can cause issues when lockdep is enabled.
> Lockdep expects lock objects to be initialized at static file scope to correctly
> track lock dependencies, specifically, when locks are initialized at compile time.
> Please note that lockdep assigns unique keys to lock object created at compile time
> which it can use for analysis. This mechanism does not work properly with local
> compile time initialization, and attempting to do so triggers warnings or errors
> from lockdep.
> 
> Therefore, DEFINE_XARRAY() should only be used for static/global declarations. For
> local usage, it's safer to use xa_init() or xa_init_flags() to initialize the xarray
> dynamically at runtime.

Yes, you're right.
> 
>>>> BTW, this is not related to this patch. Should we handle fall_back
>>>> failure like blk_mq_sysfs_register_hctxs()?
>>>>
>>> OKay I guess you meant here handle failure case by unwinding the
>>> queue instead of looping through it from start to end. If yes, then
>>> it could be done but again we may not want to do it the bug fix patch.
>>>
>>
>> Not like that, actually I don't have any ideas for now, the hctxs is
>> unregistered first, and if register failed, for example, due to -ENOMEM,
>> I can't find a way to fallback :(
>>
> If registering new hctxs fails, we fall back to the previous value of
> nr_hw_queues (prev_nr_hw_queues). When prev_nr_hw_queues is less than
> the new nr_hw_queues, we do not reallocate memory for the existing hctxs—
> instead, we reuse the memory that was already allocated.
> 
> Memory allocation is only attempted for the additional hctxs beyond
> prev_nr_hw_queues. Therefore, if memory allocation for these new hctxs
> fails, we can safely fall back to prev_nr_hw_queues because the memory
> of the previously allocated hctxs remains intact.

No, like I said before, blk_mq_sysfs_unregister_hctxs() will free memory
by kobject_del() for hctx->kobj and ctx->kobj, and
__blk_mq_update_nr_hw_queues() call that helper in the beginning.
And later in the fall back code, blk_mq_sysfs_register_hctxs() can fail
by memory allocation in kobject_add(), however, the return value is not
checked.

I do a quick test to inject failue, the following warning will be
trigered because kobject_del() will still be called while removing the
disk:

[  230.783515] ------------[ cut here ]------------
[  230.784841] kernfs: can not remove 'nr_tags', no directory
[  230.786350] WARNING: CPU: 1 PID: 411 at fs/kernfs/dir.c:1706 
kernfs_remove_by_name_ns+0x140
[  230.788826] Modules linked in: null_blk nbd nvme nvme_core [last 
unloaded: nbd]
[  230.790817] CPU: 1 UID: 0 PID: 411 Comm: bash Not tainted 
6.16.0-rc4-00079-g7136d673eb80-d
[  230.793609] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
BIOS 1.16.1-2.fc37 04/04
[  230.795942] RIP: 0010:kernfs_remove_by_name_ns+0x14a/0x160
[  230.797426] Code: 0b 48 83 05 d7 8d 04 0c 01 e9 6a ff ff ff 48 c7 c7 
40 3f 0e 83 48 83 05 0
[  230.801783] RSP: 0018:ffa0000000b1fc28 EFLAGS: 00010202
[  230.802753] RAX: 0000000000000000 RBX: ffffffff8a9bcdc8 RCX: 
0000000000000000
[  230.804056] RDX: 0000000000000002 RSI: ff11003f3fd17f48 RDI: 
00000000ffffffff
[  230.805353] RBP: 0000000000000000 R08: 0000000000000000 R09: 
6761745f726e2720
[  230.806659] R10: 6f6e206e6163203a R11: 203a73666e72656b R12: 
ffffffff828973a0
[  230.807983] R13: ffffffff83130a57 R14: ff1100010fea5380 R15: 
0000000000000000
[  230.809283] FS:  00007fc52f16f740(0000) GS:ff11003fb4610000(0000) 
knlGS:0000000000000000
[  230.810759] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  230.811810] CR2: 000055d6b57f4b60 CR3: 000000011ac5f006 CR4: 
0000000000771ef0
[  230.812756] PKRU: 55555554
[  230.813130] Call Trace:
[  230.813475]  <TASK>
[  230.813774]  remove_files+0x39/0xc0
[  230.814252]  sysfs_remove_group+0x48/0xf0
[  230.814798]  sysfs_remove_groups+0x31/0x70
[  230.815354]  __kobject_del+0x23/0xe0
[  230.815858]  kobject_del+0x17/0x50
[  230.816323]  blk_mq_unregister_hctx+0x5d/0x80
[  230.816921]  blk_mq_sysfs_unregister+0x7e/0x110
[  230.817536]  blk_unregister_queue+0x8a/0x160
[  230.818122]  __del_gendisk+0x1f9/0x530
[  230.818638]  del_gendisk+0xb3/0x100
[  230.819120]  null_del_dev+0x83/0x1b0 [null_blk]
[  230.819749]  nullb_device_power_store+0x149/0x240 [null_blk]
[  230.820515]  configfs_write_iter+0x10c/0x1d0
[  230.821087]  vfs_write+0x26e/0x6f0
[  230.821528]  ksys_write+0x79/0x180
[  230.821878]  __x64_sys_write+0x1d/0x30
[  230.822254]  x64_sys_call+0x45c4/0x45f0
[  230.822648]  do_syscall_64+0xa7/0x500
[  230.823018]  ? clear_bhb_loop+0x40/0x90
[  230.823408]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  230.823921] RIP: 0033:0x7fc52f263387
[  230.824291] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 
1f 00 f3 0f 1e fa 64 4
[  230.826125] RSP: 002b:00007ffc51e3b338 EFLAGS: 00000246 ORIG_RAX: 
0000000000000001
[  230.826884] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 
00007fc52f263387
[  230.827603] RDX: 0000000000000002 RSI: 000055d6b57f4b60 RDI: 
0000000000000001
[  230.828331] RBP: 000055d6b57f4b60 R08: 000000000000000a R09: 
00007fc52f2f94e0
[  230.829051] R10: 00007fc52f2f93e0 R11: 0000000000000246 R12: 
0000000000000002
[  230.829770] R13: 00007fc52f336520 R14: 0000000000000002 R15: 
00007fc52f336700
[  230.830493]  </TASK>
[  230.830725] ---[ end trace 0000000000000000 ]---
[  230.831190] ------------[ cut here ]------------
[  230.831662] kernfs: can not remove 'nr_reserved_tags', no directory

And I wonder, syzkaller test should catch this.

Thanks,
Kuai

> 
> Thanks,
> --Nilay
> .
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCHv2] block: restore two stage elevator switch while running nr_hw_queue update
  2025-07-23  6:58         ` Yu Kuai
@ 2025-07-23 11:05           ` Nilay Shroff
  2025-07-25  1:15             ` Yu Kuai
  0 siblings, 1 reply; 10+ messages in thread
From: Nilay Shroff @ 2025-07-23 11:05 UTC (permalink / raw)
  To: Yu Kuai, linux-block
  Cc: yi.zhang, ming.lei, hch, axboe, shinichiro.kawasaki, gjoyce,
	yukuai (C)



On 7/23/25 12:28 PM, Yu Kuai wrote:
>>>>> BTW, this is not related to this patch. Should we handle fall_back
>>>>> failure like blk_mq_sysfs_register_hctxs()?
>>>>>
>>>> OKay I guess you meant here handle failure case by unwinding the
>>>> queue instead of looping through it from start to end. If yes, then
>>>> it could be done but again we may not want to do it the bug fix patch.
>>>>
>>>
>>> Not like that, actually I don't have any ideas for now, the hctxs is
>>> unregistered first, and if register failed, for example, due to -ENOMEM,
>>> I can't find a way to fallback :(
>>>
>> If registering new hctxs fails, we fall back to the previous value of
>> nr_hw_queues (prev_nr_hw_queues). When prev_nr_hw_queues is less than
>> the new nr_hw_queues, we do not reallocate memory for the existing hctxs—
>> instead, we reuse the memory that was already allocated.
>>
>> Memory allocation is only attempted for the additional hctxs beyond
>> prev_nr_hw_queues. Therefore, if memory allocation for these new hctxs
>> fails, we can safely fall back to prev_nr_hw_queues because the memory
>> of the previously allocated hctxs remains intact.
> 
> No, like I said before, blk_mq_sysfs_unregister_hctxs() will free memory
> by kobject_del() for hctx->kobj and ctx->kobj, and
> __blk_mq_update_nr_hw_queues() call that helper in the beginning.
> And later in the fall back code, blk_mq_sysfs_register_hctxs() can fail
> by memory allocation in kobject_add(), however, the return value is not
> checked.
> 
This can be done checking the kobject state in sysfs: kobj->state_in_sysfs.
If kobj->state_in_sysfs is 1 then it implies that kobject_add() for this
kobj was successful and we can safely call kobject_del() on it otherwise
we can skip it. We already have few places in the kernel using this trick.
For instance, check sysfs_slab_unlink(). So, IMO, similar technique could be
used for hctx->kobj and ctx->kobj as well while we attempt to delete these 
kobjects from unregistering queue and nr_hw_queue update.

Thanks,
--Nilay

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCHv2] block: restore two stage elevator switch while running nr_hw_queue update
  2025-07-23 11:05           ` Nilay Shroff
@ 2025-07-25  1:15             ` Yu Kuai
  0 siblings, 0 replies; 10+ messages in thread
From: Yu Kuai @ 2025-07-25  1:15 UTC (permalink / raw)
  To: Nilay Shroff, Yu Kuai, linux-block
  Cc: yi.zhang, ming.lei, hch, axboe, shinichiro.kawasaki, gjoyce,
	yukuai (C)

Hi,

在 2025/07/23 19:05, Nilay Shroff 写道:
> 
> 
> On 7/23/25 12:28 PM, Yu Kuai wrote:
>>>>>> BTW, this is not related to this patch. Should we handle fall_back
>>>>>> failure like blk_mq_sysfs_register_hctxs()?
>>>>>>
>>>>> OKay I guess you meant here handle failure case by unwinding the
>>>>> queue instead of looping through it from start to end. If yes, then
>>>>> it could be done but again we may not want to do it the bug fix patch.
>>>>>
>>>>
>>>> Not like that, actually I don't have any ideas for now, the hctxs is
>>>> unregistered first, and if register failed, for example, due to -ENOMEM,
>>>> I can't find a way to fallback :(
>>>>
>>> If registering new hctxs fails, we fall back to the previous value of
>>> nr_hw_queues (prev_nr_hw_queues). When prev_nr_hw_queues is less than
>>> the new nr_hw_queues, we do not reallocate memory for the existing hctxs—
>>> instead, we reuse the memory that was already allocated.
>>>
>>> Memory allocation is only attempted for the additional hctxs beyond
>>> prev_nr_hw_queues. Therefore, if memory allocation for these new hctxs
>>> fails, we can safely fall back to prev_nr_hw_queues because the memory
>>> of the previously allocated hctxs remains intact.
>>
>> No, like I said before, blk_mq_sysfs_unregister_hctxs() will free memory
>> by kobject_del() for hctx->kobj and ctx->kobj, and
>> __blk_mq_update_nr_hw_queues() call that helper in the beginning.
>> And later in the fall back code, blk_mq_sysfs_register_hctxs() can fail
>> by memory allocation in kobject_add(), however, the return value is not
>> checked.
>>
> This can be done checking the kobject state in sysfs: kobj->state_in_sysfs.
> If kobj->state_in_sysfs is 1 then it implies that kobject_add() for this
> kobj was successful and we can safely call kobject_del() on it otherwise
> we can skip it. We already have few places in the kernel using this trick.
> For instance, check sysfs_slab_unlink(). So, IMO, similar technique could be
> used for hctx->kobj and ctx->kobj as well while we attempt to delete these
> kobjects from unregistering queue and nr_hw_queue update.

I don't like missing sysfs attributes, however I don't have better
solution, I'm good with this.

Thanks,
Kuai

> 
> Thanks,
> --Nilay
> .
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-07-25  1:15 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-18 13:32 [PATCHv2] block: restore two stage elevator switch while running nr_hw_queue update Nilay Shroff
2025-07-20 12:19 ` Ming Lei
2025-07-20 14:25   ` Nilay Shroff
2025-07-22  2:21 ` Yu Kuai
2025-07-22 11:27   ` Nilay Shroff
2025-07-23  0:37     ` Yu Kuai
2025-07-23  6:24       ` Nilay Shroff
2025-07-23  6:58         ` Yu Kuai
2025-07-23 11:05           ` Nilay Shroff
2025-07-25  1:15             ` Yu Kuai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox