* [PATCH] block: restore two stage elevator switch while running nr_hw_queue update
@ 2025-07-17 14:11 Nilay Shroff
2025-07-17 14:44 ` Ming Lei
0 siblings, 1 reply; 5+ messages in thread
From: Nilay Shroff @ 2025-07-17 14:11 UTC (permalink / raw)
To: linux-block
Cc: yi.zhang, ming.lei, hch, yukuai1, axboe, shinichiro.kawasaki,
yukuai3, gjoyce
The kmemleak reports memory leaks related to elevator resources that
were originally allocated in the ->init_hctx() method. The following
leak traces are observed after running blktests:
unreferenced object 0xffff8882e7fb9000 (size 2048):
comm "check", pid 10460, jiffies 4324980514
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace (crc c47e6a37):
__kvmalloc_node_noprof+0x55d/0x7a0
sbitmap_init_node+0x15a/0x6a0
kyber_init_hctx+0x316/0xb90
blk_mq_init_sched+0x416/0x580
elevator_switch+0x18b/0x630
elv_update_nr_hw_queues+0x219/0x2c0
__blk_mq_update_nr_hw_queues+0x36a/0x6f0
blk_mq_update_nr_hw_queues+0x3a/0x60
find_fallback+0x510/0x540 [nbd]
nbd_send_cmd+0x24b/0x1480 [nbd]
configfs_write_iter+0x2ae/0x470
vfs_write+0x524/0xe70
ksys_write+0xff/0x200
do_syscall_64+0x98/0x3c0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
unreferenced object 0xffff8882e7fbb000 (size 2048):
comm "check", pid 10460, jiffies 4324980514
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace (crc c47e6a37):
__kvmalloc_node_noprof+0x55d/0x7a0
sbitmap_init_node+0x15a/0x6a0
kyber_init_hctx+0x316/0xb90
blk_mq_init_sched+0x416/0x580
elevator_switch+0x18b/0x630
elv_update_nr_hw_queues+0x219/0x2c0
__blk_mq_update_nr_hw_queues+0x36a/0x6f0
blk_mq_update_nr_hw_queues+0x3a/0x60
find_fallback+0x510/0x540 [nbd]
nbd_send_cmd+0x24b/0x1480 [nbd]
configfs_write_iter+0x2ae/0x470
vfs_write+0x524/0xe70
ksys_write+0xff/0x200
do_syscall_64+0x98/0x3c0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
unreferenced object 0xffff88819e855000 (size 2048):
comm "check", pid 10460, jiffies 4324980514
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace (crc c47e6a37):
__kvmalloc_node_noprof+0x55d/0x7a0
sbitmap_init_node+0x15a/0x6a0
kyber_init_hctx+0x316/0xb90
blk_mq_init_sched+0x416/0x580
elevator_switch+0x18b/0x630
elv_update_nr_hw_queues+0x219/0x2c0
__blk_mq_update_nr_hw_queues+0x36a/0x6f0
blk_mq_update_nr_hw_queues+0x3a/0x60
find_fallback+0x510/0x540 [nbd]
nbd_send_cmd+0x24b/0x1480 [nbd]
configfs_write_iter+0x2ae/0x470
vfs_write+0x524/0xe70
ksys_write+0xff/0x200
do_syscall_64+0x98/0x3c0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
The issue arises while we run nr_hw_queue update, Specifically, we first
reallocate hardware contexts (hctx) via __blk_mq_realloc_hw_ctxs(), and
then later invoke elevator_switch() (assuming q->elevator is not NULL).
The elevator switch code would first exit old elevator (elevator_exit)
and then switches to the new elevator. The elevator_exit loops through
each hctx and invokes the elevator’s per-hctx exit method ->exit_hctx(),
which releases resources allocated during ->init_hctx().
This memleak manifests when we reduce the num of h/w queues - for example,
when the initial update sets the number of queues to X, and a later update
reduces it to Y, where Y < X. In this case, we'd loose the access to old
hctxs while we get to elevator exit code because __blk_mq_realloc_hw_ctxs
would have already released the old hctxs. As we don't now have any
reference left to the old hctxs, we don't have any way to free the
scheduler resources (which are allocate in ->init_hctx()) and kmemleak
complains about it.
This issue was caused due to the commit 596dce110b7d ("block: simplify
elevator reattachment for updating nr_hw_queues"). That change unified
the two-stage elevator teardown and reattachment into a single call that
occurs after __blk_mq_realloc_hw_ctxs() has already freed the hctxs.
This patch restores the previous two-stage elevator switch logic during
nr_hw_queues updates. First, the elevator is switched to 'none', which
ensures all scheduler resources are properly freed. Then, the hardware
contexts (hctxs) are reallocated, and the software-to-hardware queue
mappings are updated. Finally, the original elevator is reattached. This
sequence prevents loss of references to old hctxs and avoids the scheduler
resource leaks reported by kmemleak.
Reported-by : Yi Zhang <yi.zhang@redhat.com>
Fixes: 596dce110b7d ("block: simplify elevator reattachment for updating nr_hw_queues")
Closes: https://lore.kernel.org/all/CAHj4cs8oJFvz=daCvjHM5dYCNQH4UXwSySPPU4v-WHce_kZXZA@mail.gmail.com/
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
block/blk-mq.c | 87 +++++++++++++++++++++++++++++++++++++++++++-----
block/blk.h | 4 ++-
block/elevator.c | 40 ++--------------------
block/elevator.h | 11 ++++++
4 files changed, 94 insertions(+), 48 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4806b867e37d..b26cbf2a86dd 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4966,6 +4966,63 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
return ret;
}
+/*
+ * Switch back to the elevator name stored in the xarray.
+ */
+static void blk_mq_elv_switch_back(struct request_queue *q,
+ struct xarray *elv_tbl)
+{
+ struct elv_change_ctx ctx = {};
+ int ret = -ENODEV;
+ char *elevator_name = xa_load(elv_tbl, q->id);
+
+ WARN_ON_ONCE(q->mq_freeze_depth == 0);
+
+ mutex_lock(&q->elevator_lock);
+ if (elevator_name && !blk_queue_dying(q) && blk_queue_registered(q)) {
+ ctx.name = elevator_name;
+
+ /* force to reattach elevator after nr_hw_queue is updated */
+ ret = elevator_switch(q, &ctx);
+ }
+ mutex_unlock(&q->elevator_lock);
+ blk_mq_unfreeze_queue_nomemrestore(q);
+ if (!ret)
+ WARN_ON_ONCE(elevator_change_done(q, &ctx));
+}
+
+/*
+ * Stores elevator name in xarray and set current elevator to none. It uses
+ * q->id as an index to store the elevator name into the xarray.
+ */
+static int blk_mq_elv_switch_none(struct request_queue *q,
+ struct xarray *elv_tbl)
+{
+ int ret = 0;
+
+ lockdep_assert_held_write(&q->tag_set->update_nr_hwq_lock);
+
+ /*
+ * Accessing q->elevator without holding q->elevator_lock is safe here
+ * because we're called from nr_hw_queue update which is protected by
+ * set->update_nr_hwq_lock in the writer context. So, scheduler update/
+ * switch code (which acquires the same lock in the reader context)
+ * can't run concurrently.
+ */
+ if (q->elevator) {
+ char *elevator_name = (char *)q->elevator->type->elevator_name;
+
+ ret = xa_insert(elv_tbl, q->id, elevator_name, GFP_KERNEL);
+ if (ret) {
+ WARN_ON_ONCE(1);
+ goto out;
+ }
+ elevator_set_none(q);
+ }
+out:
+ return ret;
+}
+
static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
int nr_hw_queues)
{
@@ -4973,6 +5030,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
int prev_nr_hw_queues = set->nr_hw_queues;
unsigned int memflags;
int i;
+ struct xarray elv_tbl;
lockdep_assert_held(&set->tag_list_lock);
@@ -4984,6 +5042,9 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
return;
memflags = memalloc_noio_save();
+
+ xa_init(&elv_tbl);
+
list_for_each_entry(q, &set->tag_list, tag_set_list) {
blk_mq_debugfs_unregister_hctxs(q);
blk_mq_sysfs_unregister_hctxs(q);
@@ -4992,11 +5053,17 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
list_for_each_entry(q, &set->tag_list, tag_set_list)
blk_mq_freeze_queue_nomemsave(q);
- if (blk_mq_realloc_tag_set_tags(set, nr_hw_queues) < 0) {
- list_for_each_entry(q, &set->tag_list, tag_set_list)
- blk_mq_unfreeze_queue_nomemrestore(q);
- goto reregister;
- }
+ /*
+ * Switch IO scheduler to 'none', cleaning up the data associated
+ * with the previous scheduler. We will switch back once we are done
+ * updating the new sw to hw queue mappings.
+ */
+ list_for_each_entry(q, &set->tag_list, tag_set_list)
+ if (blk_mq_elv_switch_none(q, &elv_tbl))
+ goto switch_back;
+
+ if (blk_mq_realloc_tag_set_tags(set, nr_hw_queues) < 0)
+ goto switch_back;
fallback:
blk_mq_update_queue_map(set);
@@ -5016,12 +5083,11 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
}
blk_mq_map_swqueue(q);
}
-
- /* elv_update_nr_hw_queues() unfreeze queue for us */
+switch_back:
+ /* blk_mq_elv_switch_back() unfreezes queue for us */
list_for_each_entry(q, &set->tag_list, tag_set_list)
- elv_update_nr_hw_queues(q);
+ blk_mq_elv_switch_back(q, &elv_tbl);
-reregister:
list_for_each_entry(q, &set->tag_list, tag_set_list) {
blk_mq_sysfs_register_hctxs(q);
blk_mq_debugfs_register_hctxs(q);
@@ -5029,6 +5095,9 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
blk_mq_remove_hw_queues_cpuhp(q);
blk_mq_add_hw_queues_cpuhp(q);
}
+
+ xa_destroy(&elv_tbl);
+
memalloc_noio_restore(memflags);
/* Free the excess tags when nr_hw_queues shrink. */
diff --git a/block/blk.h b/block/blk.h
index 37ec459fe656..35ec19a085fd 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -12,6 +12,7 @@
#include "blk-crypto-internal.h"
struct elevator_type;
+struct elv_change_ctx;
#define BLK_DEV_MAX_SECTORS (LLONG_MAX >> 9)
#define BLK_MIN_SEGMENT_SIZE 4096
@@ -321,9 +322,10 @@ bool blk_bio_list_merge(struct request_queue *q, struct list_head *list,
bool blk_insert_flush(struct request *rq);
-void elv_update_nr_hw_queues(struct request_queue *q);
void elevator_set_default(struct request_queue *q);
void elevator_set_none(struct request_queue *q);
+int elevator_change_done(struct request_queue *q, struct elv_change_ctx *ctx);
+int elevator_switch(struct request_queue *q, struct elv_change_ctx *ctx);
ssize_t part_size_show(struct device *dev, struct device_attribute *attr,
char *buf);
diff --git a/block/elevator.c b/block/elevator.c
index ab22542e6cf0..016fe9c08c8c 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -45,17 +45,6 @@
#include "blk-wbt.h"
#include "blk-cgroup.h"
-/* Holding context data for changing elevator */
-struct elv_change_ctx {
- const char *name;
- bool no_uevent;
-
- /* for unregistering old elevator */
- struct elevator_queue *old;
- /* for registering new elevator */
- struct elevator_queue *new;
-};
-
static DEFINE_SPINLOCK(elv_list_lock);
static LIST_HEAD(elv_list);
@@ -570,7 +559,7 @@ EXPORT_SYMBOL_GPL(elv_unregister);
* If switching fails, we are most likely running out of memory and not able
* to restore the old io scheduler, so leaving the io scheduler being none.
*/
-static int elevator_switch(struct request_queue *q, struct elv_change_ctx *ctx)
+int elevator_switch(struct request_queue *q, struct elv_change_ctx *ctx)
{
struct elevator_type *new_e = NULL;
int ret = 0;
@@ -631,8 +620,7 @@ static void elv_exit_and_release(struct request_queue *q)
kobject_put(&e->kobj);
}
-static int elevator_change_done(struct request_queue *q,
- struct elv_change_ctx *ctx)
+int elevator_change_done(struct request_queue *q, struct elv_change_ctx *ctx)
{
int ret = 0;
@@ -685,30 +673,6 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
return ret;
}
-/*
- * The I/O scheduler depends on the number of hardware queues, this forces a
- * reattachment when nr_hw_queues changes.
- */
-void elv_update_nr_hw_queues(struct request_queue *q)
-{
- struct elv_change_ctx ctx = {};
- int ret = -ENODEV;
-
- WARN_ON_ONCE(q->mq_freeze_depth == 0);
-
- mutex_lock(&q->elevator_lock);
- if (q->elevator && !blk_queue_dying(q) && blk_queue_registered(q)) {
- ctx.name = q->elevator->type->elevator_name;
-
- /* force to reattach elevator after nr_hw_queue is updated */
- ret = elevator_switch(q, &ctx);
- }
- mutex_unlock(&q->elevator_lock);
- blk_mq_unfreeze_queue_nomemrestore(q);
- if (!ret)
- WARN_ON_ONCE(elevator_change_done(q, &ctx));
-}
-
/*
* Use the default elevator settings. If the chosen elevator initialization
* fails, fall back to the "none" elevator (no elevator).
diff --git a/block/elevator.h b/block/elevator.h
index a07ce773a38f..440b6e766848 100644
--- a/block/elevator.h
+++ b/block/elevator.h
@@ -85,6 +85,17 @@ struct elevator_type
struct list_head list;
};
+/* Holding context data for changing elevator */
+struct elv_change_ctx {
+ const char *name;
+ bool no_uevent;
+
+ /* for unregistering old elevator */
+ struct elevator_queue *old;
+ /* for registering new elevator */
+ struct elevator_queue *new;
+};
+
static inline bool elevator_tryget(struct elevator_type *e)
{
return try_module_get(e->elevator_owner);
--
2.50.1
^ permalink raw reply related [flat|nested] 5+ messages in thread* Re: [PATCH] block: restore two stage elevator switch while running nr_hw_queue update
2025-07-17 14:11 [PATCH] block: restore two stage elevator switch while running nr_hw_queue update Nilay Shroff
@ 2025-07-17 14:44 ` Ming Lei
2025-07-17 18:43 ` Nilay Shroff
0 siblings, 1 reply; 5+ messages in thread
From: Ming Lei @ 2025-07-17 14:44 UTC (permalink / raw)
To: Nilay Shroff
Cc: linux-block, yi.zhang, hch, yukuai1, axboe, shinichiro.kawasaki,
yukuai3, gjoyce
On Thu, Jul 17, 2025 at 07:41:20PM +0530, Nilay Shroff wrote:
> The kmemleak reports memory leaks related to elevator resources that
> were originally allocated in the ->init_hctx() method. The following
> leak traces are observed after running blktests:
>
> unreferenced object 0xffff8882e7fb9000 (size 2048):
> comm "check", pid 10460, jiffies 4324980514
> hex dump (first 32 bytes):
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> backtrace (crc c47e6a37):
> __kvmalloc_node_noprof+0x55d/0x7a0
> sbitmap_init_node+0x15a/0x6a0
> kyber_init_hctx+0x316/0xb90
> blk_mq_init_sched+0x416/0x580
> elevator_switch+0x18b/0x630
> elv_update_nr_hw_queues+0x219/0x2c0
> __blk_mq_update_nr_hw_queues+0x36a/0x6f0
> blk_mq_update_nr_hw_queues+0x3a/0x60
> find_fallback+0x510/0x540 [nbd]
> nbd_send_cmd+0x24b/0x1480 [nbd]
> configfs_write_iter+0x2ae/0x470
> vfs_write+0x524/0xe70
> ksys_write+0xff/0x200
> do_syscall_64+0x98/0x3c0
> entry_SYSCALL_64_after_hwframe+0x76/0x7e
> unreferenced object 0xffff8882e7fbb000 (size 2048):
> comm "check", pid 10460, jiffies 4324980514
> hex dump (first 32 bytes):
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> backtrace (crc c47e6a37):
> __kvmalloc_node_noprof+0x55d/0x7a0
> sbitmap_init_node+0x15a/0x6a0
> kyber_init_hctx+0x316/0xb90
> blk_mq_init_sched+0x416/0x580
> elevator_switch+0x18b/0x630
> elv_update_nr_hw_queues+0x219/0x2c0
> __blk_mq_update_nr_hw_queues+0x36a/0x6f0
> blk_mq_update_nr_hw_queues+0x3a/0x60
> find_fallback+0x510/0x540 [nbd]
> nbd_send_cmd+0x24b/0x1480 [nbd]
> configfs_write_iter+0x2ae/0x470
> vfs_write+0x524/0xe70
> ksys_write+0xff/0x200
> do_syscall_64+0x98/0x3c0
> entry_SYSCALL_64_after_hwframe+0x76/0x7e
> unreferenced object 0xffff88819e855000 (size 2048):
> comm "check", pid 10460, jiffies 4324980514
> hex dump (first 32 bytes):
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> backtrace (crc c47e6a37):
> __kvmalloc_node_noprof+0x55d/0x7a0
> sbitmap_init_node+0x15a/0x6a0
> kyber_init_hctx+0x316/0xb90
> blk_mq_init_sched+0x416/0x580
> elevator_switch+0x18b/0x630
> elv_update_nr_hw_queues+0x219/0x2c0
> __blk_mq_update_nr_hw_queues+0x36a/0x6f0
> blk_mq_update_nr_hw_queues+0x3a/0x60
> find_fallback+0x510/0x540 [nbd]
> nbd_send_cmd+0x24b/0x1480 [nbd]
> configfs_write_iter+0x2ae/0x470
> vfs_write+0x524/0xe70
> ksys_write+0xff/0x200
> do_syscall_64+0x98/0x3c0
> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> The issue arises while we run nr_hw_queue update, Specifically, we first
> reallocate hardware contexts (hctx) via __blk_mq_realloc_hw_ctxs(), and
> then later invoke elevator_switch() (assuming q->elevator is not NULL).
> The elevator switch code would first exit old elevator (elevator_exit)
> and then switches to the new elevator. The elevator_exit loops through
> each hctx and invokes the elevator’s per-hctx exit method ->exit_hctx(),
> which releases resources allocated during ->init_hctx().
>
> This memleak manifests when we reduce the num of h/w queues - for example,
> when the initial update sets the number of queues to X, and a later update
> reduces it to Y, where Y < X. In this case, we'd loose the access to old
> hctxs while we get to elevator exit code because __blk_mq_realloc_hw_ctxs
> would have already released the old hctxs. As we don't now have any
> reference left to the old hctxs, we don't have any way to free the
> scheduler resources (which are allocate in ->init_hctx()) and kmemleak
> complains about it.
>
> This issue was caused due to the commit 596dce110b7d ("block: simplify
> elevator reattachment for updating nr_hw_queues"). That change unified
> the two-stage elevator teardown and reattachment into a single call that
> occurs after __blk_mq_realloc_hw_ctxs() has already freed the hctxs.
>
> This patch restores the previous two-stage elevator switch logic during
> nr_hw_queues updates. First, the elevator is switched to 'none', which
> ensures all scheduler resources are properly freed. Then, the hardware
> contexts (hctxs) are reallocated, and the software-to-hardware queue
> mappings are updated. Finally, the original elevator is reattached. This
> sequence prevents loss of references to old hctxs and avoids the scheduler
> resource leaks reported by kmemleak.
>
> Reported-by : Yi Zhang <yi.zhang@redhat.com>
> Fixes: 596dce110b7d ("block: simplify elevator reattachment for updating nr_hw_queues")
> Closes: https://lore.kernel.org/all/CAHj4cs8oJFvz=daCvjHM5dYCNQH4UXwSySPPU4v-WHce_kZXZA@mail.gmail.com/
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> ---
> block/blk-mq.c | 87 +++++++++++++++++++++++++++++++++++++++++++-----
> block/blk.h | 4 ++-
> block/elevator.c | 40 ++--------------------
> block/elevator.h | 11 ++++++
> 4 files changed, 94 insertions(+), 48 deletions(-)
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 4806b867e37d..b26cbf2a86dd 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -4966,6 +4966,63 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
> return ret;
> }
>
> +/*
> + * Switch back to the elevator name stored in the xarray.
> + */
> +static void blk_mq_elv_switch_back(struct request_queue *q,
> + struct xarray *elv_tbl)
> +{
> + struct elv_change_ctx ctx = {};
> + int ret = -ENODEV;
> + char *elevator_name = xa_load(elv_tbl, q->id);
> +
> + WARN_ON_ONCE(q->mq_freeze_depth == 0);
> +
> + mutex_lock(&q->elevator_lock);
> + if (elevator_name && !blk_queue_dying(q) && blk_queue_registered(q)) {
> + ctx.name = elevator_name;
> +
> + /* force to reattach elevator after nr_hw_queue is updated */
> + ret = elevator_switch(q, &ctx);
> + }
> + mutex_unlock(&q->elevator_lock);
> + blk_mq_unfreeze_queue_nomemrestore(q);
> + if (!ret)
> + WARN_ON_ONCE(elevator_change_done(q, &ctx));
> +}
> +
> +/*
> + * Stores elevator name in xarray and set current elevator to none. It uses
> + * q->id as an index to store the elevator name into the xarray.
> + */
> +static int blk_mq_elv_switch_none(struct request_queue *q,
> + struct xarray *elv_tbl)
> +{
> + int ret = 0;
> +
> + lockdep_assert_held_write(&q->tag_set->update_nr_hwq_lock);
> +
> + /*
> + * Accessing q->elevator without holding q->elevator_lock is safe here
> + * because we're called from nr_hw_queue update which is protected by
> + * set->update_nr_hwq_lock in the writer context. So, scheduler update/
> + * switch code (which acquires the same lock in the reader context)
> + * can't run concurrently.
> + */
> + if (q->elevator) {
> + char *elevator_name = (char *)q->elevator->type->elevator_name;
> +
> + ret = xa_insert(elv_tbl, q->id, elevator_name, GFP_KERNEL);
This way isn't memory safe, because elevator module can be reloaded
during updating nr_hw_queues. One solution is to build a string<->int mapping
table, and store the (int) index of elevator type to xarray.
> + if (ret) {
> + WARN_ON_ONCE(1);
> + goto out;
> + }
> + elevator_set_none(q);
> + }
> +out:
> + return ret;
> +}
> +
> static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
> int nr_hw_queues)
> {
> @@ -4973,6 +5030,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
> int prev_nr_hw_queues = set->nr_hw_queues;
> unsigned int memflags;
> int i;
> + struct xarray elv_tbl;
>
> lockdep_assert_held(&set->tag_list_lock);
>
> @@ -4984,6 +5042,9 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
> return;
>
> memflags = memalloc_noio_save();
> +
> + xa_init(&elv_tbl);
> +
> list_for_each_entry(q, &set->tag_list, tag_set_list) {
> blk_mq_debugfs_unregister_hctxs(q);
> blk_mq_sysfs_unregister_hctxs(q);
> @@ -4992,11 +5053,17 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
> list_for_each_entry(q, &set->tag_list, tag_set_list)
> blk_mq_freeze_queue_nomemsave(q);
>
> - if (blk_mq_realloc_tag_set_tags(set, nr_hw_queues) < 0) {
> - list_for_each_entry(q, &set->tag_list, tag_set_list)
> - blk_mq_unfreeze_queue_nomemrestore(q);
> - goto reregister;
> - }
> + /*
> + * Switch IO scheduler to 'none', cleaning up the data associated
> + * with the previous scheduler. We will switch back once we are done
> + * updating the new sw to hw queue mappings.
> + */
> + list_for_each_entry(q, &set->tag_list, tag_set_list)
> + if (blk_mq_elv_switch_none(q, &elv_tbl))
> + goto switch_back;
> +
> + if (blk_mq_realloc_tag_set_tags(set, nr_hw_queues) < 0)
> + goto switch_back;
>
> fallback:
> blk_mq_update_queue_map(set);
> @@ -5016,12 +5083,11 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
> }
> blk_mq_map_swqueue(q);
> }
> -
> - /* elv_update_nr_hw_queues() unfreeze queue for us */
> +switch_back:
> + /* blk_mq_elv_switch_back() unfreezes queue for us */
> list_for_each_entry(q, &set->tag_list, tag_set_list)
> - elv_update_nr_hw_queues(q);
> + blk_mq_elv_switch_back(q, &elv_tbl);
>
> -reregister:
> list_for_each_entry(q, &set->tag_list, tag_set_list) {
> blk_mq_sysfs_register_hctxs(q);
> blk_mq_debugfs_register_hctxs(q);
> @@ -5029,6 +5095,9 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
> blk_mq_remove_hw_queues_cpuhp(q);
> blk_mq_add_hw_queues_cpuhp(q);
> }
> +
> + xa_destroy(&elv_tbl);
> +
> memalloc_noio_restore(memflags);
>
> /* Free the excess tags when nr_hw_queues shrink. */
> diff --git a/block/blk.h b/block/blk.h
> index 37ec459fe656..35ec19a085fd 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -12,6 +12,7 @@
> #include "blk-crypto-internal.h"
>
> struct elevator_type;
> +struct elv_change_ctx;
>
> #define BLK_DEV_MAX_SECTORS (LLONG_MAX >> 9)
> #define BLK_MIN_SEGMENT_SIZE 4096
> @@ -321,9 +322,10 @@ bool blk_bio_list_merge(struct request_queue *q, struct list_head *list,
>
> bool blk_insert_flush(struct request *rq);
>
> -void elv_update_nr_hw_queues(struct request_queue *q);
> void elevator_set_default(struct request_queue *q);
> void elevator_set_none(struct request_queue *q);
> +int elevator_change_done(struct request_queue *q, struct elv_change_ctx *ctx);
> +int elevator_switch(struct request_queue *q, struct elv_change_ctx *ctx);
>
> ssize_t part_size_show(struct device *dev, struct device_attribute *attr,
> char *buf);
> diff --git a/block/elevator.c b/block/elevator.c
> index ab22542e6cf0..016fe9c08c8c 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -45,17 +45,6 @@
> #include "blk-wbt.h"
> #include "blk-cgroup.h"
>
> -/* Holding context data for changing elevator */
> -struct elv_change_ctx {
> - const char *name;
> - bool no_uevent;
> -
> - /* for unregistering old elevator */
> - struct elevator_queue *old;
> - /* for registering new elevator */
> - struct elevator_queue *new;
> -};
> -
> static DEFINE_SPINLOCK(elv_list_lock);
> static LIST_HEAD(elv_list);
>
> @@ -570,7 +559,7 @@ EXPORT_SYMBOL_GPL(elv_unregister);
> * If switching fails, we are most likely running out of memory and not able
> * to restore the old io scheduler, so leaving the io scheduler being none.
> */
> -static int elevator_switch(struct request_queue *q, struct elv_change_ctx *ctx)
> +int elevator_switch(struct request_queue *q, struct elv_change_ctx *ctx)
> {
> struct elevator_type *new_e = NULL;
> int ret = 0;
> @@ -631,8 +620,7 @@ static void elv_exit_and_release(struct request_queue *q)
> kobject_put(&e->kobj);
> }
>
> -static int elevator_change_done(struct request_queue *q,
> - struct elv_change_ctx *ctx)
> +int elevator_change_done(struct request_queue *q, struct elv_change_ctx *ctx)
> {
> int ret = 0;
>
> @@ -685,30 +673,6 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
> return ret;
> }
>
> -/*
> - * The I/O scheduler depends on the number of hardware queues, this forces a
> - * reattachment when nr_hw_queues changes.
> - */
> -void elv_update_nr_hw_queues(struct request_queue *q)
> -{
> - struct elv_change_ctx ctx = {};
> - int ret = -ENODEV;
> -
> - WARN_ON_ONCE(q->mq_freeze_depth == 0);
> -
> - mutex_lock(&q->elevator_lock);
> - if (q->elevator && !blk_queue_dying(q) && blk_queue_registered(q)) {
> - ctx.name = q->elevator->type->elevator_name;
> -
> - /* force to reattach elevator after nr_hw_queue is updated */
> - ret = elevator_switch(q, &ctx);
> - }
> - mutex_unlock(&q->elevator_lock);
> - blk_mq_unfreeze_queue_nomemrestore(q);
> - if (!ret)
> - WARN_ON_ONCE(elevator_change_done(q, &ctx));
> -}
> -
> /*
> * Use the default elevator settings. If the chosen elevator initialization
> * fails, fall back to the "none" elevator (no elevator).
> diff --git a/block/elevator.h b/block/elevator.h
> index a07ce773a38f..440b6e766848 100644
> --- a/block/elevator.h
> +++ b/block/elevator.h
> @@ -85,6 +85,17 @@ struct elevator_type
> struct list_head list;
> };
>
> +/* Holding context data for changing elevator */
> +struct elv_change_ctx {
> + const char *name;
> + bool no_uevent;
> +
> + /* for unregistering old elevator */
> + struct elevator_queue *old;
> + /* for registering new elevator */
> + struct elevator_queue *new;
> +};
> +
You may avoid the big chunk of code move for both `elv_change_ctx` and
`elv_update_nr_hw_queues()`, not sure why you did that in a bug fix patch.
Thanks,
Ming
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: [PATCH] block: restore two stage elevator switch while running nr_hw_queue update
2025-07-17 14:44 ` Ming Lei
@ 2025-07-17 18:43 ` Nilay Shroff
2025-07-18 2:01 ` Ming Lei
0 siblings, 1 reply; 5+ messages in thread
From: Nilay Shroff @ 2025-07-17 18:43 UTC (permalink / raw)
To: Ming Lei
Cc: linux-block, yi.zhang, hch, yukuai1, axboe, shinichiro.kawasaki,
yukuai3, gjoyce
On 7/17/25 8:14 PM, Ming Lei wrote:
>> +static int blk_mq_elv_switch_none(struct request_queue *q,
>> + struct xarray *elv_tbl)
>> +{
>> + int ret = 0;
>> +
>> + lockdep_assert_held_write(&q->tag_set->update_nr_hwq_lock);
>> +
>> + /*
>> + * Accessing q->elevator without holding q->elevator_lock is safe here
>> + * because we're called from nr_hw_queue update which is protected by
>> + * set->update_nr_hwq_lock in the writer context. So, scheduler update/
>> + * switch code (which acquires the same lock in the reader context)
>> + * can't run concurrently.
>> + */
>> + if (q->elevator) {
>> + char *elevator_name = (char *)q->elevator->type->elevator_name;
>> +
>> + ret = xa_insert(elv_tbl, q->id, elevator_name, GFP_KERNEL);
>
> This way isn't memory safe, because elevator module can be reloaded
> during updating nr_hw_queues. One solution is to build a string<->int mapping
> table, and store the (int) index of elevator type to xarray.
>
Yes good point! How about avoiding this issue by using __elevator_get() and
elevator_put()? We can take module reference while switching elevator to 'none'
and then put the module reference while we switch back elevator.
>> -void elv_update_nr_hw_queues(struct request_queue *q)
>> -{
>> - struct elv_change_ctx ctx = {};
>> - int ret = -ENODEV;
>> -
>> - WARN_ON_ONCE(q->mq_freeze_depth == 0);
>> -
>> - mutex_lock(&q->elevator_lock);
>> - if (q->elevator && !blk_queue_dying(q) && blk_queue_registered(q)) {
>> - ctx.name = q->elevator->type->elevator_name;
>> -
>> - /* force to reattach elevator after nr_hw_queue is updated */
>> - ret = elevator_switch(q, &ctx);
>> - }
>> - mutex_unlock(&q->elevator_lock);
>> - blk_mq_unfreeze_queue_nomemrestore(q);
>> - if (!ret)
>> - WARN_ON_ONCE(elevator_change_done(q, &ctx));
>> -}
>> -
>> /*
>> * Use the default elevator settings. If the chosen elevator initialization
>> * fails, fall back to the "none" elevator (no elevator).
>> diff --git a/block/elevator.h b/block/elevator.h
>> index a07ce773a38f..440b6e766848 100644
>> --- a/block/elevator.h
>> +++ b/block/elevator.h
>> @@ -85,6 +85,17 @@ struct elevator_type
>> struct list_head list;
>> };
>>
>> +/* Holding context data for changing elevator */
>> +struct elv_change_ctx {
>> + const char *name;
>> + bool no_uevent;
>> +
>> + /* for unregistering old elevator */
>> + struct elevator_queue *old;
>> + /* for registering new elevator */
>> + struct elevator_queue *new;
>> +};
>> +
>
> You may avoid the big chunk of code move for both `elv_change_ctx` and
> `elv_update_nr_hw_queues()`, not sure why you did that in a bug fix patch.
>
This is because I’ve reintroduced the blk_mq_elv_switch_none and
blk_mq_elv_switch_back functions. I replaced elv_update_nr_hw_queues with
blk_mq_elv_switch_back(), which was the approach used prior to commit
596dce110b7d ("block: simplify elevator reattachment for updating nr_hw_queues").
Both blk_mq_elv_switch_none and blk_mq_elv_switch_back are defined in blk-mq.c
and use the elevator APIs for switching elevators. These APIs — namely
elevator_switch and elevator_change_done — rely on the elv_change_ctx structure.
As a result, I had to move some code around to ensure elv_change_ctx is accessible
from within blk-mq.c.
While it would be possible to avoid this code movement by continuing to use
elv_update_nr_hw_queues, I felt it was cleaner to restore the two-stage elevator
switch more fully by reintroducing the blk_mq_elv_switch_none and blk_mq_elv_switch_back
APIs, which were used prior to the simplification in the aforementioned commit.
Do you see any concerns with this code movement? Or is such restructuring generally
avoided in a bug fix patch?
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: [PATCH] block: restore two stage elevator switch while running nr_hw_queue update
2025-07-17 18:43 ` Nilay Shroff
@ 2025-07-18 2:01 ` Ming Lei
2025-07-18 6:06 ` Nilay Shroff
0 siblings, 1 reply; 5+ messages in thread
From: Ming Lei @ 2025-07-18 2:01 UTC (permalink / raw)
To: Nilay Shroff
Cc: linux-block, yi.zhang, hch, yukuai1, axboe, shinichiro.kawasaki,
yukuai3, gjoyce
On Fri, Jul 18, 2025 at 12:13:25AM +0530, Nilay Shroff wrote:
>
>
> On 7/17/25 8:14 PM, Ming Lei wrote:
>
> >> +static int blk_mq_elv_switch_none(struct request_queue *q,
> >> + struct xarray *elv_tbl)
> >> +{
> >> + int ret = 0;
> >> +
> >> + lockdep_assert_held_write(&q->tag_set->update_nr_hwq_lock);
> >> +
> >> + /*
> >> + * Accessing q->elevator without holding q->elevator_lock is safe here
> >> + * because we're called from nr_hw_queue update which is protected by
> >> + * set->update_nr_hwq_lock in the writer context. So, scheduler update/
> >> + * switch code (which acquires the same lock in the reader context)
> >> + * can't run concurrently.
> >> + */
> >> + if (q->elevator) {
> >> + char *elevator_name = (char *)q->elevator->type->elevator_name;
> >> +
> >> + ret = xa_insert(elv_tbl, q->id, elevator_name, GFP_KERNEL);
> >
> > This way isn't memory safe, because elevator module can be reloaded
> > during updating nr_hw_queues. One solution is to build a string<->int mapping
> > table, and store the (int) index of elevator type to xarray.
> >
> Yes good point! How about avoiding this issue by using __elevator_get() and
> elevator_put()? We can take module reference while switching elevator to 'none'
> and then put the module reference while we switch back elevator.
Yeah, that is another way.
>
> >> -void elv_update_nr_hw_queues(struct request_queue *q)
> >> -{
> >> - struct elv_change_ctx ctx = {};
> >> - int ret = -ENODEV;
> >> -
> >> - WARN_ON_ONCE(q->mq_freeze_depth == 0);
> >> -
> >> - mutex_lock(&q->elevator_lock);
> >> - if (q->elevator && !blk_queue_dying(q) && blk_queue_registered(q)) {
> >> - ctx.name = q->elevator->type->elevator_name;
> >> -
> >> - /* force to reattach elevator after nr_hw_queue is updated */
> >> - ret = elevator_switch(q, &ctx);
> >> - }
> >> - mutex_unlock(&q->elevator_lock);
> >> - blk_mq_unfreeze_queue_nomemrestore(q);
> >> - if (!ret)
> >> - WARN_ON_ONCE(elevator_change_done(q, &ctx));
> >> -}
> >> -
> >> /*
> >> * Use the default elevator settings. If the chosen elevator initialization
> >> * fails, fall back to the "none" elevator (no elevator).
> >> diff --git a/block/elevator.h b/block/elevator.h
> >> index a07ce773a38f..440b6e766848 100644
> >> --- a/block/elevator.h
> >> +++ b/block/elevator.h
> >> @@ -85,6 +85,17 @@ struct elevator_type
> >> struct list_head list;
> >> };
> >>
> >> +/* Holding context data for changing elevator */
> >> +struct elv_change_ctx {
> >> + const char *name;
> >> + bool no_uevent;
> >> +
> >> + /* for unregistering old elevator */
> >> + struct elevator_queue *old;
> >> + /* for registering new elevator */
> >> + struct elevator_queue *new;
> >> +};
> >> +
> >
> > You may avoid the big chunk of code move for both `elv_change_ctx` and
> > `elv_update_nr_hw_queues()`, not sure why you did that in a bug fix patch.
> >
> This is because I’ve reintroduced the blk_mq_elv_switch_none and
> blk_mq_elv_switch_back functions. I replaced elv_update_nr_hw_queues with
> blk_mq_elv_switch_back(), which was the approach used prior to commit
> 596dce110b7d ("block: simplify elevator reattachment for updating nr_hw_queues").
>
> Both blk_mq_elv_switch_none and blk_mq_elv_switch_back are defined in blk-mq.c
> and use the elevator APIs for switching elevators. These APIs — namely
> elevator_switch and elevator_change_done — rely on the elv_change_ctx structure.
> As a result, I had to move some code around to ensure elv_change_ctx is accessible
> from within blk-mq.c.
>
> While it would be possible to avoid this code movement by continuing to use
> elv_update_nr_hw_queues, I felt it was cleaner to restore the two-stage elevator
> switch more fully by reintroducing the blk_mq_elv_switch_none and blk_mq_elv_switch_back
> APIs, which were used prior to the simplification in the aforementioned commit.
>
> Do you see any concerns with this code movement? Or is such restructuring generally
> avoided in a bug fix patch?
You still can add blk_mq_elv_switch_none() and blk_mq_elv_switch_back(),
meantime call elv_update_nr_hw_queues() from blk_mq_elv_switch_back() by
passing elevator type name.
If one bug fix is simpler, it may become easier to review & merge in -rc or
backport.
Thanks,
Ming
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: [PATCH] block: restore two stage elevator switch while running nr_hw_queue update
2025-07-18 2:01 ` Ming Lei
@ 2025-07-18 6:06 ` Nilay Shroff
0 siblings, 0 replies; 5+ messages in thread
From: Nilay Shroff @ 2025-07-18 6:06 UTC (permalink / raw)
To: Ming Lei
Cc: linux-block, yi.zhang, hch, yukuai1, axboe, shinichiro.kawasaki,
yukuai3, gjoyce
On 7/18/25 7:31 AM, Ming Lei wrote:
> On Fri, Jul 18, 2025 at 12:13:25AM +0530, Nilay Shroff wrote:
>>
>>
>> On 7/17/25 8:14 PM, Ming Lei wrote:
>>
>>>> +static int blk_mq_elv_switch_none(struct request_queue *q,
>>>> + struct xarray *elv_tbl)
>>>> +{
>>>> + int ret = 0;
>>>> +
>>>> + lockdep_assert_held_write(&q->tag_set->update_nr_hwq_lock);
>>>> +
>>>> + /*
>>>> + * Accessing q->elevator without holding q->elevator_lock is safe here
>>>> + * because we're called from nr_hw_queue update which is protected by
>>>> + * set->update_nr_hwq_lock in the writer context. So, scheduler update/
>>>> + * switch code (which acquires the same lock in the reader context)
>>>> + * can't run concurrently.
>>>> + */
>>>> + if (q->elevator) {
>>>> + char *elevator_name = (char *)q->elevator->type->elevator_name;
>>>> +
>>>> + ret = xa_insert(elv_tbl, q->id, elevator_name, GFP_KERNEL);
>>>
>>> This way isn't memory safe, because elevator module can be reloaded
>>> during updating nr_hw_queues. One solution is to build a string<->int mapping
>>> table, and store the (int) index of elevator type to xarray.
>>>
>> Yes good point! How about avoiding this issue by using __elevator_get() and
>> elevator_put()? We can take module reference while switching elevator to 'none'
>> and then put the module reference while we switch back elevator.
>
> Yeah, that is another way.
>
So I will update this way in the next patch.
>>
>>>> -void elv_update_nr_hw_queues(struct request_queue *q)
>>>> -{
>>>> - struct elv_change_ctx ctx = {};
>>>> - int ret = -ENODEV;
>>>> -
>>>> - WARN_ON_ONCE(q->mq_freeze_depth == 0);
>>>> -
>>>> - mutex_lock(&q->elevator_lock);
>>>> - if (q->elevator && !blk_queue_dying(q) && blk_queue_registered(q)) {
>>>> - ctx.name = q->elevator->type->elevator_name;
>>>> -
>>>> - /* force to reattach elevator after nr_hw_queue is updated */
>>>> - ret = elevator_switch(q, &ctx);
>>>> - }
>>>> - mutex_unlock(&q->elevator_lock);
>>>> - blk_mq_unfreeze_queue_nomemrestore(q);
>>>> - if (!ret)
>>>> - WARN_ON_ONCE(elevator_change_done(q, &ctx));
>>>> -}
>>>> -
>>>> /*
>>>> * Use the default elevator settings. If the chosen elevator initialization
>>>> * fails, fall back to the "none" elevator (no elevator).
>>>> diff --git a/block/elevator.h b/block/elevator.h
>>>> index a07ce773a38f..440b6e766848 100644
>>>> --- a/block/elevator.h
>>>> +++ b/block/elevator.h
>>>> @@ -85,6 +85,17 @@ struct elevator_type
>>>> struct list_head list;
>>>> };
>>>>
>>>> +/* Holding context data for changing elevator */
>>>> +struct elv_change_ctx {
>>>> + const char *name;
>>>> + bool no_uevent;
>>>> +
>>>> + /* for unregistering old elevator */
>>>> + struct elevator_queue *old;
>>>> + /* for registering new elevator */
>>>> + struct elevator_queue *new;
>>>> +};
>>>> +
>>>
>>> You may avoid the big chunk of code move for both `elv_change_ctx` and
>>> `elv_update_nr_hw_queues()`, not sure why you did that in a bug fix patch.
>>>
>> This is because I’ve reintroduced the blk_mq_elv_switch_none and
>> blk_mq_elv_switch_back functions. I replaced elv_update_nr_hw_queues with
>> blk_mq_elv_switch_back(), which was the approach used prior to commit
>> 596dce110b7d ("block: simplify elevator reattachment for updating nr_hw_queues").
>>
>> Both blk_mq_elv_switch_none and blk_mq_elv_switch_back are defined in blk-mq.c
>> and use the elevator APIs for switching elevators. These APIs — namely
>> elevator_switch and elevator_change_done — rely on the elv_change_ctx structure.
>> As a result, I had to move some code around to ensure elv_change_ctx is accessible
>> from within blk-mq.c.
>>
>> While it would be possible to avoid this code movement by continuing to use
>> elv_update_nr_hw_queues, I felt it was cleaner to restore the two-stage elevator
>> switch more fully by reintroducing the blk_mq_elv_switch_none and blk_mq_elv_switch_back
>> APIs, which were used prior to the simplification in the aforementioned commit.
>>
>> Do you see any concerns with this code movement? Or is such restructuring generally
>> avoided in a bug fix patch?
>
> You still can add blk_mq_elv_switch_none() and blk_mq_elv_switch_back(),
> meantime call elv_update_nr_hw_queues() from blk_mq_elv_switch_back() by
> passing elevator type name.
>
> If one bug fix is simpler, it may become easier to review & merge in -rc or
> backport.
>
Okay will do this way in the next version.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-07-18 6:07 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-17 14:11 [PATCH] block: restore two stage elevator switch while running nr_hw_queue update Nilay Shroff
2025-07-17 14:44 ` Ming Lei
2025-07-17 18:43 ` Nilay Shroff
2025-07-18 2:01 ` Ming Lei
2025-07-18 6:06 ` Nilay Shroff
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.