* [PATCH 00/15] block: unify elevator changing and fix lockdep warning
@ 2025-04-10 13:30 Ming Lei
2025-04-10 13:30 ` [PATCH 01/15] block: don't call freeze queue in elevator_switch() and elevator_disable() Ming Lei
` (14 more replies)
0 siblings, 15 replies; 57+ messages in thread
From: Ming Lei @ 2025-04-10 13:30 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig, Ming Lei
Hello Jens,
The 1st 11 patch cleans up elevator change code, and unifying it via single
helper, meantime moves kobject_add/del & debugfs register/unregister out of
queue freezing & elevator_lock. This way fixes many lockdep warnings
reported recently, especially since fs_reclaim is connected with freeze lock
manually by commit ffa1e7ada456 ("block: Make request_queue lockdep splats
show up earlier").
The other 4 ones fixes lockdep warnings from bfq and updating nr_hw_queues
code.
Also the 4th patch fixes kasan warning & oops report from Shinichiro by
preventing elevator switching when updating nr_hw_queues is in-progress.
Thanks,
Ming Lei (15):
block: don't call freeze queue in elevator_switch() and
elevator_disable()
block: add two helpers for registering/un-registering sched debugfs
block: move sched debugfs register into elvevator_register_queue
block: prevent elevator switch during updating nr_hw_queues
block: simplify elevator reset for updating nr_hw_queues
block: add helper of elevator_change()
block: move blk_unregister_queue() & device_del() after freeze wait
block: add `struct elev_change_ctx` for unifying elevator change
block: unifying elevator change
block: pass elevator_queue to elv_register_queue & unregister_queue
block: move elv_register[unregister]_queue out of elevator_lock
block: move debugfs/sysfs register out of freezing queue
block: remove several ->elevator_lock
block: move hctx cpuhp add/del out of queue freezing
block: move wbt_enable_default() out of queue freezing from
scheduler's ->exit()
block/bfq-iosched.c | 2 +-
block/blk-mq-debugfs.c | 13 +--
block/blk-mq-sched.c | 39 +++++---
block/blk-mq.c | 163 +++++++++---------------------
block/blk-sysfs.c | 18 ++--
block/blk.h | 10 +-
block/elevator.c | 221 ++++++++++++++++++++++++++---------------
block/elevator.h | 16 +++
block/genhd.c | 25 +----
include/linux/blk-mq.h | 10 +-
10 files changed, 256 insertions(+), 261 deletions(-)
--
2.47.0
^ permalink raw reply [flat|nested] 57+ messages in thread
* [PATCH 01/15] block: don't call freeze queue in elevator_switch() and elevator_disable()
2025-04-10 13:30 [PATCH 00/15] block: unify elevator changing and fix lockdep warning Ming Lei
@ 2025-04-10 13:30 ` Ming Lei
2025-04-10 13:30 ` [PATCH 02/15] block: add two helpers for registering/un-registering sched debugfs Ming Lei
` (13 subsequent siblings)
14 siblings, 0 replies; 57+ messages in thread
From: Ming Lei @ 2025-04-10 13:30 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig, Ming Lei
Both elevator_switch() and elevator_disable() are called from sysfs
store and updating nr_hw_queue code paths only.
And in the two code paths, queue has been frozen already, so don't call
freeze queue in the two functions.
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/elevator.c | 9 ++-------
1 file changed, 2 insertions(+), 7 deletions(-)
diff --git a/block/elevator.c b/block/elevator.c
index b4d08026b02c..5051a98dc08c 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -615,12 +615,11 @@ void elevator_init_mq(struct request_queue *q)
*/
int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
{
- unsigned int memflags;
int ret;
+ WARN_ON_ONCE(q->mq_freeze_depth == 0);
lockdep_assert_held(&q->elevator_lock);
- memflags = blk_mq_freeze_queue(q);
blk_mq_quiesce_queue(q);
if (q->elevator) {
@@ -641,7 +640,6 @@ int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
out_unfreeze:
blk_mq_unquiesce_queue(q);
- blk_mq_unfreeze_queue(q, memflags);
if (ret) {
pr_warn("elv: switch to \"%s\" failed, falling back to \"none\"\n",
@@ -653,11 +651,9 @@ int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
void elevator_disable(struct request_queue *q)
{
- unsigned int memflags;
-
+ WARN_ON_ONCE(q->mq_freeze_depth == 0);
lockdep_assert_held(&q->elevator_lock);
- memflags = blk_mq_freeze_queue(q);
blk_mq_quiesce_queue(q);
elv_unregister_queue(q);
@@ -668,7 +664,6 @@ void elevator_disable(struct request_queue *q)
blk_add_trace_msg(q, "elv switch: none");
blk_mq_unquiesce_queue(q);
- blk_mq_unfreeze_queue(q, memflags);
}
/*
--
2.47.0
^ permalink raw reply related [flat|nested] 57+ messages in thread
* [PATCH 02/15] block: add two helpers for registering/un-registering sched debugfs
2025-04-10 13:30 [PATCH 00/15] block: unify elevator changing and fix lockdep warning Ming Lei
2025-04-10 13:30 ` [PATCH 01/15] block: don't call freeze queue in elevator_switch() and elevator_disable() Ming Lei
@ 2025-04-10 13:30 ` Ming Lei
2025-04-10 14:25 ` Christoph Hellwig
2025-04-10 13:30 ` [PATCH 03/15] block: move sched debugfs register into elvevator_register_queue Ming Lei
` (12 subsequent siblings)
14 siblings, 1 reply; 57+ messages in thread
From: Ming Lei @ 2025-04-10 13:30 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig, Ming Lei
Add blk_mq_sched_reg_debugfs()/blk_mq_sched_unreg_debugfs() to clean
up sched init/exit code a bit.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/blk-mq-sched.c | 42 +++++++++++++++++++++++++++---------------
1 file changed, 27 insertions(+), 15 deletions(-)
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 109611445d40..f66abaa25430 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -436,6 +436,30 @@ static int blk_mq_init_sched_shared_tags(struct request_queue *queue)
return 0;
}
+static void blk_mq_sched_reg_debugfs(struct request_queue *q)
+{
+ struct blk_mq_hw_ctx *hctx;
+ unsigned long i;
+
+ mutex_lock(&q->debugfs_mutex);
+ blk_mq_debugfs_register_sched(q);
+ queue_for_each_hw_ctx(q, hctx, i)
+ blk_mq_debugfs_register_sched_hctx(q, hctx);
+ mutex_unlock(&q->debugfs_mutex);
+}
+
+static void blk_mq_sched_unreg_debugfs(struct request_queue *q)
+{
+ struct blk_mq_hw_ctx *hctx;
+ unsigned long i;
+
+ mutex_lock(&q->debugfs_mutex);
+ queue_for_each_hw_ctx(q, hctx, i)
+ blk_mq_debugfs_unregister_sched_hctx(hctx);
+ blk_mq_debugfs_unregister_sched(q);
+ mutex_unlock(&q->debugfs_mutex);
+}
+
/* caller must have a reference to @e, will grab another one if successful */
int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
{
@@ -469,10 +493,6 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
if (ret)
goto err_free_map_and_rqs;
- mutex_lock(&q->debugfs_mutex);
- blk_mq_debugfs_register_sched(q);
- mutex_unlock(&q->debugfs_mutex);
-
queue_for_each_hw_ctx(q, hctx, i) {
if (e->ops.init_hctx) {
ret = e->ops.init_hctx(hctx, i);
@@ -484,10 +504,8 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
return ret;
}
}
- mutex_lock(&q->debugfs_mutex);
- blk_mq_debugfs_register_sched_hctx(q, hctx);
- mutex_unlock(&q->debugfs_mutex);
}
+ blk_mq_sched_reg_debugfs(q);
return 0;
@@ -526,11 +544,9 @@ void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e)
unsigned long i;
unsigned int flags = 0;
- queue_for_each_hw_ctx(q, hctx, i) {
- mutex_lock(&q->debugfs_mutex);
- blk_mq_debugfs_unregister_sched_hctx(hctx);
- mutex_unlock(&q->debugfs_mutex);
+ blk_mq_sched_unreg_debugfs(q);
+ queue_for_each_hw_ctx(q, hctx, i) {
if (e->type->ops.exit_hctx && hctx->sched_data) {
e->type->ops.exit_hctx(hctx, i);
hctx->sched_data = NULL;
@@ -538,10 +554,6 @@ void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e)
flags = hctx->flags;
}
- mutex_lock(&q->debugfs_mutex);
- blk_mq_debugfs_unregister_sched(q);
- mutex_unlock(&q->debugfs_mutex);
-
if (e->type->ops.exit_sched)
e->type->ops.exit_sched(e);
blk_mq_sched_tags_teardown(q, flags);
--
2.47.0
^ permalink raw reply related [flat|nested] 57+ messages in thread
* [PATCH 03/15] block: move sched debugfs register into elvevator_register_queue
2025-04-10 13:30 [PATCH 00/15] block: unify elevator changing and fix lockdep warning Ming Lei
2025-04-10 13:30 ` [PATCH 01/15] block: don't call freeze queue in elevator_switch() and elevator_disable() Ming Lei
2025-04-10 13:30 ` [PATCH 02/15] block: add two helpers for registering/un-registering sched debugfs Ming Lei
@ 2025-04-10 13:30 ` Ming Lei
2025-04-10 14:27 ` Christoph Hellwig
2025-04-10 13:30 ` [PATCH 04/15] block: prevent elevator switch during updating nr_hw_queues Ming Lei
` (11 subsequent siblings)
14 siblings, 1 reply; 57+ messages in thread
From: Ming Lei @ 2025-04-10 13:30 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig, Ming Lei
sched debugfs shares same lifetime with scheduler's kobject, and same
lock(elevator lock), so move sched debugfs register/unregister into
elevator_register_queue() and elevator_unregister_queue().
Then we needn't blk_mq_debugfs_register() for us to register sched
debugfs any more.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/blk-mq-debugfs.c | 12 ------------
block/blk-mq-sched.c | 7 ++-----
block/elevator.c | 6 ++++++
block/elevator.h | 3 +++
4 files changed, 11 insertions(+), 17 deletions(-)
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 3421b5521fe2..c308699ded58 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -624,22 +624,10 @@ void blk_mq_debugfs_register(struct request_queue *q)
debugfs_create_files(q->debugfs_dir, q, blk_mq_debugfs_queue_attrs);
- /*
- * blk_mq_init_sched() attempted to do this already, but q->debugfs_dir
- * didn't exist yet (because we don't know what to name the directory
- * until the queue is registered to a gendisk).
- */
- if (q->elevator && !q->sched_debugfs_dir)
- blk_mq_debugfs_register_sched(q);
-
- /* Similarly, blk_mq_init_hctx() couldn't do this previously. */
queue_for_each_hw_ctx(q, hctx, i) {
if (!hctx->debugfs_dir)
blk_mq_debugfs_register_hctx(q, hctx);
- if (q->elevator && !hctx->sched_debugfs_dir)
- blk_mq_debugfs_register_sched_hctx(q, hctx);
}
-
if (q->rq_qos) {
struct rq_qos *rqos = q->rq_qos;
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index f66abaa25430..14552c58c4e8 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -436,7 +436,7 @@ static int blk_mq_init_sched_shared_tags(struct request_queue *queue)
return 0;
}
-static void blk_mq_sched_reg_debugfs(struct request_queue *q)
+void blk_mq_sched_reg_debugfs(struct request_queue *q)
{
struct blk_mq_hw_ctx *hctx;
unsigned long i;
@@ -448,7 +448,7 @@ static void blk_mq_sched_reg_debugfs(struct request_queue *q)
mutex_unlock(&q->debugfs_mutex);
}
-static void blk_mq_sched_unreg_debugfs(struct request_queue *q)
+void blk_mq_sched_unreg_debugfs(struct request_queue *q)
{
struct blk_mq_hw_ctx *hctx;
unsigned long i;
@@ -505,7 +505,6 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
}
}
}
- blk_mq_sched_reg_debugfs(q);
return 0;
@@ -544,8 +543,6 @@ void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e)
unsigned long i;
unsigned int flags = 0;
- blk_mq_sched_unreg_debugfs(q);
-
queue_for_each_hw_ctx(q, hctx, i) {
if (e->type->ops.exit_hctx && hctx->sched_data) {
e->type->ops.exit_hctx(hctx, i);
diff --git a/block/elevator.c b/block/elevator.c
index 5051a98dc08c..cf48613c6e62 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -459,6 +459,9 @@ int elv_register_queue(struct request_queue *q, bool uevent)
lockdep_assert_held(&q->elevator_lock);
+ if (test_bit(ELEVATOR_FLAG_REGISTERED, &e->flags))
+ return 0;
+
error = kobject_add(&e->kobj, &q->disk->queue_kobj, "iosched");
if (!error) {
const struct elv_fs_entry *attr = e->type->elevator_attrs;
@@ -472,6 +475,7 @@ int elv_register_queue(struct request_queue *q, bool uevent)
if (uevent)
kobject_uevent(&e->kobj, KOBJ_ADD);
+ blk_mq_sched_reg_debugfs(q);
set_bit(ELEVATOR_FLAG_REGISTERED, &e->flags);
}
return error;
@@ -486,6 +490,8 @@ void elv_unregister_queue(struct request_queue *q)
if (e && test_and_clear_bit(ELEVATOR_FLAG_REGISTERED, &e->flags)) {
kobject_uevent(&e->kobj, KOBJ_REMOVE);
kobject_del(&e->kobj);
+
+ blk_mq_sched_unreg_debugfs(q);
}
}
diff --git a/block/elevator.h b/block/elevator.h
index e4e44dfac503..80ff9b28a66f 100644
--- a/block/elevator.h
+++ b/block/elevator.h
@@ -182,4 +182,7 @@ extern struct request *elv_rb_find(struct rb_root *, sector_t);
#define rq_entry_fifo(ptr) list_entry((ptr), struct request, queuelist)
#define rq_fifo_clear(rq) list_del_init(&(rq)->queuelist)
+void blk_mq_sched_reg_debugfs(struct request_queue *q);
+void blk_mq_sched_unreg_debugfs(struct request_queue *q);
+
#endif /* _ELEVATOR_H */
--
2.47.0
^ permalink raw reply related [flat|nested] 57+ messages in thread
* [PATCH 04/15] block: prevent elevator switch during updating nr_hw_queues
2025-04-10 13:30 [PATCH 00/15] block: unify elevator changing and fix lockdep warning Ming Lei
` (2 preceding siblings ...)
2025-04-10 13:30 ` [PATCH 03/15] block: move sched debugfs register into elvevator_register_queue Ming Lei
@ 2025-04-10 13:30 ` Ming Lei
2025-04-10 14:36 ` Christoph Hellwig
2025-04-11 19:13 ` Nilay Shroff
2025-04-10 13:30 ` [PATCH 05/15] block: simplify elevator reset for " Ming Lei
` (10 subsequent siblings)
14 siblings, 2 replies; 57+ messages in thread
From: Ming Lei @ 2025-04-10 13:30 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig, Ming Lei
updating nr_hw_queues is usually used for error handling code, when it
doesn't make sense to allow blk-mq elevator switching, since nr_hw_queues
may change, and elevator tags depends on nr_hw_queues.
Prevent elevator switch during updating nr_hw_queues by setting flag of
BLK_MQ_F_UPDATE_HW_QUEUES, and use srcu to fail elevator switch during
the period. Here elevator switch code is srcu reader of nr_hw_queues,
and blk_mq_update_nr_hw_queues() is the writer.
This way avoids lot of trouble.
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Closes: https://lore.kernel.org/linux-block/mz4t4tlwiqjijw3zvqnjb7ovvvaegkqganegmmlc567tt5xj67@xal5ro544cnc/
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/blk-mq-debugfs.c | 1 +
block/blk-mq.c | 19 ++++++++++++++++++-
block/elevator.c | 12 +++++++++++-
include/linux/blk-mq.h | 10 +++++++++-
4 files changed, 39 insertions(+), 3 deletions(-)
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index c308699ded58..27f984311bb7 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -180,6 +180,7 @@ static const char *const hctx_flag_name[] = {
HCTX_FLAG_NAME(BLOCKING),
HCTX_FLAG_NAME(TAG_RR),
HCTX_FLAG_NAME(NO_SCHED_BY_DEFAULT),
+ HCTX_FLAG_NAME(UPDATE_HW_QUEUES),
};
#undef HCTX_FLAG_NAME
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d7a103dc258b..4b0707fb7ae3 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4785,12 +4785,16 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
goto out_free_srcu;
}
+ ret = init_srcu_struct(&set->update_nr_hwq_srcu);
+ if (ret)
+ goto out_cleanup_srcu;
+
ret = -ENOMEM;
set->tags = kcalloc_node(set->nr_hw_queues,
sizeof(struct blk_mq_tags *), GFP_KERNEL,
set->numa_node);
if (!set->tags)
- goto out_cleanup_srcu;
+ goto out_cleanup_hwq_srcu;
for (i = 0; i < set->nr_maps; i++) {
set->map[i].mq_map = kcalloc_node(nr_cpu_ids,
@@ -4819,6 +4823,8 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
}
kfree(set->tags);
set->tags = NULL;
+out_cleanup_hwq_srcu:
+ cleanup_srcu_struct(&set->update_nr_hwq_srcu);
out_cleanup_srcu:
if (set->flags & BLK_MQ_F_BLOCKING)
cleanup_srcu_struct(set->srcu);
@@ -5081,7 +5087,18 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
{
mutex_lock(&set->tag_list_lock);
+ /*
+ * Mark us in updating nr_hw_queues for preventing switching
+ * elevator
+ *
+ * Elevator switch code can _not_ acquire ->tag_list_lock
+ */
+ set->flags |= BLK_MQ_F_UPDATE_HW_QUEUES;
+ synchronize_srcu(&set->update_nr_hwq_srcu);
+
__blk_mq_update_nr_hw_queues(set, nr_hw_queues);
+
+ set->flags &= BLK_MQ_F_UPDATE_HW_QUEUES;
mutex_unlock(&set->tag_list_lock);
}
EXPORT_SYMBOL_GPL(blk_mq_update_nr_hw_queues);
diff --git a/block/elevator.c b/block/elevator.c
index cf48613c6e62..7d7b77dd4341 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -718,9 +718,10 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
{
char elevator_name[ELV_NAME_MAX];
char *name;
- int ret;
+ int ret, idx;
unsigned int memflags;
struct request_queue *q = disk->queue;
+ struct blk_mq_tag_set *set = q->tag_set;
/*
* If the attribute needs to load a module, do it before freezing the
@@ -732,6 +733,13 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
elv_iosched_load_module(name);
+ idx = srcu_read_lock(&set->update_nr_hwq_srcu);
+
+ if (set->flags & BLK_MQ_F_UPDATE_HW_QUEUES) {
+ ret = -EBUSY;
+ goto exit;
+ }
+
memflags = blk_mq_freeze_queue(q);
mutex_lock(&q->elevator_lock);
ret = elevator_change(q, name);
@@ -739,6 +747,8 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
ret = count;
mutex_unlock(&q->elevator_lock);
blk_mq_unfreeze_queue(q, memflags);
+exit:
+ srcu_read_unlock(&set->update_nr_hwq_srcu, idx);
return ret;
}
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 8eb9b3310167..473871c760e1 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -527,6 +527,7 @@ struct blk_mq_tag_set {
struct mutex tag_list_lock;
struct list_head tag_list;
struct srcu_struct *srcu;
+ struct srcu_struct update_nr_hwq_srcu;
};
/**
@@ -681,7 +682,14 @@ enum {
*/
BLK_MQ_F_NO_SCHED_BY_DEFAULT = 1 << 6,
- BLK_MQ_F_MAX = 1 << 7,
+ /*
+ * True when updating nr_hw_queues is in-progress
+ *
+ * tag_set only flag, not usable for hctx
+ */
+ BLK_MQ_F_UPDATE_HW_QUEUES = 1 << 7,
+
+ BLK_MQ_F_MAX = 1 << 8,
};
#define BLK_MQ_MAX_DEPTH (10240)
--
2.47.0
^ permalink raw reply related [flat|nested] 57+ messages in thread
* [PATCH 05/15] block: simplify elevator reset for updating nr_hw_queues
2025-04-10 13:30 [PATCH 00/15] block: unify elevator changing and fix lockdep warning Ming Lei
` (3 preceding siblings ...)
2025-04-10 13:30 ` [PATCH 04/15] block: prevent elevator switch during updating nr_hw_queues Ming Lei
@ 2025-04-10 13:30 ` Ming Lei
2025-04-10 14:40 ` Christoph Hellwig
2025-04-10 15:34 ` Christoph Hellwig
2025-04-10 13:30 ` [PATCH 06/15] block: add helper of elevator_change() Ming Lei
` (9 subsequent siblings)
14 siblings, 2 replies; 57+ messages in thread
From: Ming Lei @ 2025-04-10 13:30 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig, Ming Lei
In blk_mq_update_nr_hw_queues(), nr_hw_queues may change, so elevator has
to be reset after nr_hw_queues is changed.
Now elevator switch isn't allowed during blk_mq_update_nr_hw_queues(), so
we can simply call elevator_change() to reset elevator sched tags after
nr_hw_queues is updated.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/blk-mq.c | 99 +++++-------------------------------------------
block/blk.h | 4 +-
block/elevator.c | 12 +++---
3 files changed, 18 insertions(+), 97 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4b0707fb7ae3..b7e3cd355e66 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4930,88 +4930,10 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
return ret;
}
-/*
- * request_queue and elevator_type pair.
- * It is just used by __blk_mq_update_nr_hw_queues to cache
- * the elevator_type associated with a request_queue.
- */
-struct blk_mq_qe_pair {
- struct list_head node;
- struct request_queue *q;
- struct elevator_type *type;
-};
-
-/*
- * Cache the elevator_type in qe pair list and switch the
- * io scheduler to 'none'
- */
-static bool blk_mq_elv_switch_none(struct list_head *head,
- struct request_queue *q)
-{
- struct blk_mq_qe_pair *qe;
-
- qe = kmalloc(sizeof(*qe), GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY);
- if (!qe)
- return false;
-
- /* Accessing q->elevator needs protection from ->elevator_lock. */
- mutex_lock(&q->elevator_lock);
-
- if (!q->elevator) {
- kfree(qe);
- goto unlock;
- }
-
- INIT_LIST_HEAD(&qe->node);
- qe->q = q;
- qe->type = q->elevator->type;
- /* keep a reference to the elevator module as we'll switch back */
- __elevator_get(qe->type);
- list_add(&qe->node, head);
- elevator_disable(q);
-unlock:
- mutex_unlock(&q->elevator_lock);
-
- return true;
-}
-
-static struct blk_mq_qe_pair *blk_lookup_qe_pair(struct list_head *head,
- struct request_queue *q)
-{
- struct blk_mq_qe_pair *qe;
-
- list_for_each_entry(qe, head, node)
- if (qe->q == q)
- return qe;
-
- return NULL;
-}
-
-static void blk_mq_elv_switch_back(struct list_head *head,
- struct request_queue *q)
-{
- struct blk_mq_qe_pair *qe;
- struct elevator_type *t;
-
- qe = blk_lookup_qe_pair(head, q);
- if (!qe)
- return;
- t = qe->type;
- list_del(&qe->node);
- kfree(qe);
-
- mutex_lock(&q->elevator_lock);
- elevator_switch(q, t);
- /* drop the reference acquired in blk_mq_elv_switch_none */
- elevator_put(t);
- mutex_unlock(&q->elevator_lock);
-}
-
static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
int nr_hw_queues)
{
struct request_queue *q;
- LIST_HEAD(head);
int prev_nr_hw_queues = set->nr_hw_queues;
unsigned int memflags;
int i;
@@ -5029,15 +4951,6 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
list_for_each_entry(q, &set->tag_list, tag_set_list)
blk_mq_freeze_queue_nomemsave(q);
- /*
- * Switch IO scheduler to 'none', cleaning up the data associated
- * with the previous scheduler. We will switch back once we are done
- * updating the new sw to hw queue mappings.
- */
- list_for_each_entry(q, &set->tag_list, tag_set_list)
- if (!blk_mq_elv_switch_none(&head, q))
- goto switch_back;
-
list_for_each_entry(q, &set->tag_list, tag_set_list) {
blk_mq_debugfs_unregister_hctxs(q);
blk_mq_sysfs_unregister_hctxs(q);
@@ -5071,9 +4984,15 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
blk_mq_debugfs_register_hctxs(q);
}
-switch_back:
- list_for_each_entry(q, &set->tag_list, tag_set_list)
- blk_mq_elv_switch_back(&head, q);
+ list_for_each_entry(q, &set->tag_list, tag_set_list) {
+ const char *name = "none";
+
+ mutex_lock(&q->elevator_lock);
+ if (q->elevator && !blk_queue_dying(q))
+ name = q->elevator->type->elevator_name;
+ __elevator_change(q, name, true);
+ mutex_unlock(&q->elevator_lock);
+ }
list_for_each_entry(q, &set->tag_list, tag_set_list)
blk_mq_unfreeze_queue_nomemrestore(q);
diff --git a/block/blk.h b/block/blk.h
index 006e3be433d2..0c3cc1af2525 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -319,8 +319,8 @@ bool blk_bio_list_merge(struct request_queue *q, struct list_head *list,
bool blk_insert_flush(struct request *rq);
-int elevator_switch(struct request_queue *q, struct elevator_type *new_e);
-void elevator_disable(struct request_queue *q);
+int __elevator_change(struct request_queue *q, const char *elevator_name,
+ bool force);
void elevator_exit(struct request_queue *q);
int elv_register_queue(struct request_queue *q, bool uevent);
void elv_unregister_queue(struct request_queue *q);
diff --git a/block/elevator.c b/block/elevator.c
index 7d7b77dd4341..612fa2bdd40d 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -619,7 +619,7 @@ void elevator_init_mq(struct request_queue *q)
* If switching fails, we are most likely running out of memory and not able
* to restore the old io scheduler, so leaving the io scheduler being none.
*/
-int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
+static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
{
int ret;
@@ -655,7 +655,7 @@ int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
return ret;
}
-void elevator_disable(struct request_queue *q)
+static void elevator_disable(struct request_queue *q)
{
WARN_ON_ONCE(q->mq_freeze_depth == 0);
lockdep_assert_held(&q->elevator_lock);
@@ -675,7 +675,8 @@ void elevator_disable(struct request_queue *q)
/*
* Switch this queue to the given IO scheduler.
*/
-static int elevator_change(struct request_queue *q, const char *elevator_name)
+int __elevator_change(struct request_queue *q, const char *elevator_name,
+ bool force)
{
struct elevator_type *e;
int ret;
@@ -690,7 +691,8 @@ static int elevator_change(struct request_queue *q, const char *elevator_name)
return 0;
}
- if (q->elevator && elevator_match(q->elevator->type, elevator_name))
+ if (!force && q->elevator &&
+ elevator_match(q->elevator->type, elevator_name))
return 0;
e = elevator_find_get(elevator_name);
@@ -742,7 +744,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
memflags = blk_mq_freeze_queue(q);
mutex_lock(&q->elevator_lock);
- ret = elevator_change(q, name);
+ ret = __elevator_change(q, name, false);
if (!ret)
ret = count;
mutex_unlock(&q->elevator_lock);
--
2.47.0
^ permalink raw reply related [flat|nested] 57+ messages in thread
* [PATCH 06/15] block: add helper of elevator_change()
2025-04-10 13:30 [PATCH 00/15] block: unify elevator changing and fix lockdep warning Ming Lei
` (4 preceding siblings ...)
2025-04-10 13:30 ` [PATCH 05/15] block: simplify elevator reset for " Ming Lei
@ 2025-04-10 13:30 ` Ming Lei
2025-04-10 13:30 ` [PATCH 07/15] block: move blk_unregister_queue() & device_del() after freeze wait Ming Lei
` (8 subsequent siblings)
14 siblings, 0 replies; 57+ messages in thread
From: Ming Lei @ 2025-04-10 13:30 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig, Ming Lei
Add elevator_change() to simplify elv_iosched_store() a bit, and the new
helper will be used for unifying all scheduler change.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/elevator.c | 52 +++++++++++++++++++++++++++++-------------------
1 file changed, 31 insertions(+), 21 deletions(-)
diff --git a/block/elevator.c b/block/elevator.c
index 612fa2bdd40d..e028d2ff9624 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -53,6 +53,8 @@ static LIST_HEAD(elv_list);
*/
#define rq_hash_key(rq) (blk_rq_pos(rq) + blk_rq_sectors(rq))
+static int elevator_change(struct request_queue *q, const char *name);
+
/*
* Query io scheduler to see if the current process issuing bio may be
* merged with rq.
@@ -681,10 +683,6 @@ int __elevator_change(struct request_queue *q, const char *elevator_name,
struct elevator_type *e;
int ret;
- /* Make sure queue is not in the middle of being removed */
- if (!blk_queue_registered(q))
- return -ENOENT;
-
if (!strncmp(elevator_name, "none", 4)) {
if (q->elevator)
elevator_disable(q);
@@ -703,6 +701,29 @@ int __elevator_change(struct request_queue *q, const char *elevator_name,
return ret;
}
+static int elevator_change(struct request_queue *q, const char *name)
+{
+ int ret, idx;
+ unsigned int memflags;
+ struct blk_mq_tag_set *set = q->tag_set;
+
+ idx = srcu_read_lock(&set->update_nr_hwq_srcu);
+
+ if (set->flags & BLK_MQ_F_UPDATE_HW_QUEUES) {
+ ret = -EBUSY;
+ goto exit;
+ }
+
+ memflags = blk_mq_freeze_queue(q);
+ mutex_lock(&q->elevator_lock);
+ ret = __elevator_change(q, name, false);
+ mutex_unlock(&q->elevator_lock);
+ blk_mq_unfreeze_queue(q, memflags);
+exit:
+ srcu_read_unlock(&set->update_nr_hwq_srcu, idx);
+ return ret;
+}
+
static void elv_iosched_load_module(char *elevator_name)
{
struct elevator_type *found;
@@ -718,12 +739,10 @@ static void elv_iosched_load_module(char *elevator_name)
ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
size_t count)
{
+ struct request_queue *q = disk->queue;
char elevator_name[ELV_NAME_MAX];
char *name;
- int ret, idx;
- unsigned int memflags;
- struct request_queue *q = disk->queue;
- struct blk_mq_tag_set *set = q->tag_set;
+ int ret;
/*
* If the attribute needs to load a module, do it before freezing the
@@ -735,22 +754,13 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
elv_iosched_load_module(name);
- idx = srcu_read_lock(&set->update_nr_hwq_srcu);
-
- if (set->flags & BLK_MQ_F_UPDATE_HW_QUEUES) {
- ret = -EBUSY;
- goto exit;
- }
+ /* Make sure queue is not in the middle of being removed */
+ if (!blk_queue_registered(q))
+ return -ENOENT;
- memflags = blk_mq_freeze_queue(q);
- mutex_lock(&q->elevator_lock);
- ret = __elevator_change(q, name, false);
+ ret = elevator_change(q, name);
if (!ret)
ret = count;
- mutex_unlock(&q->elevator_lock);
- blk_mq_unfreeze_queue(q, memflags);
-exit:
- srcu_read_unlock(&set->update_nr_hwq_srcu, idx);
return ret;
}
--
2.47.0
^ permalink raw reply related [flat|nested] 57+ messages in thread
* [PATCH 07/15] block: move blk_unregister_queue() & device_del() after freeze wait
2025-04-10 13:30 [PATCH 00/15] block: unify elevator changing and fix lockdep warning Ming Lei
` (5 preceding siblings ...)
2025-04-10 13:30 ` [PATCH 06/15] block: add helper of elevator_change() Ming Lei
@ 2025-04-10 13:30 ` Ming Lei
2025-04-14 6:19 ` Christoph Hellwig
2025-04-10 13:30 ` [PATCH 08/15] block: add `struct elev_change_ctx` for unifying elevator change Ming Lei
` (7 subsequent siblings)
14 siblings, 1 reply; 57+ messages in thread
From: Ming Lei @ 2025-04-10 13:30 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig, Ming Lei
Move blk_unregister_queue() & device_del() after freeze wait, and prepare
for unifying elevator switch.
This way is just fine, since bdev has been unhashed at the beginning of
del_gendisk(), both blk_unregister_queue() & device_del() are dealing
with kobject & debugfs thing only.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/genhd.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/block/genhd.c b/block/genhd.c
index c2bd86cd09de..f426c13edf55 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -721,8 +721,6 @@ void del_gendisk(struct gendisk *disk)
bdi_unregister(disk->bdi);
}
- blk_unregister_queue(disk);
-
kobject_put(disk->part0->bd_holder_dir);
kobject_put(disk->slave_dir);
disk->slave_dir = NULL;
@@ -731,10 +729,12 @@ void del_gendisk(struct gendisk *disk)
disk->part0->bd_stamp = 0;
sysfs_remove_link(block_depr, dev_name(disk_to_dev(disk)));
pm_runtime_set_memalloc_noio(disk_to_dev(disk), false);
- device_del(disk_to_dev(disk));
blk_mq_freeze_queue_wait(q);
+ blk_unregister_queue(disk);
+ device_del(disk_to_dev(disk));
+
blk_throtl_cancel_bios(disk);
blk_sync_queue(q);
--
2.47.0
^ permalink raw reply related [flat|nested] 57+ messages in thread
* [PATCH 08/15] block: add `struct elev_change_ctx` for unifying elevator change
2025-04-10 13:30 [PATCH 00/15] block: unify elevator changing and fix lockdep warning Ming Lei
` (6 preceding siblings ...)
2025-04-10 13:30 ` [PATCH 07/15] block: move blk_unregister_queue() & device_del() after freeze wait Ming Lei
@ 2025-04-10 13:30 ` Ming Lei
2025-04-14 6:21 ` Christoph Hellwig
2025-04-10 13:30 ` [PATCH 09/15] block: " Ming Lei
` (6 subsequent siblings)
14 siblings, 1 reply; 57+ messages in thread
From: Ming Lei @ 2025-04-10 13:30 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig, Ming Lei
Add `struct elev_change_ctx` and prepare for unifying elevator change,
with this way, any input & output parameter can be provided & observed
in top caller.
This way also helps to move kobject & debugfs things out of
->elevator_lock.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/blk-mq.c | 10 +++++++---
block/blk.h | 4 ++--
block/elevator.c | 33 +++++++++++++++++++--------------
block/elevator.h | 7 +++++++
4 files changed, 35 insertions(+), 19 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index b7e3cd355e66..608a74e3a87c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4985,12 +4985,16 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
}
list_for_each_entry(q, &set->tag_list, tag_set_list) {
- const char *name = "none";
+ struct elev_change_ctx ctx = {
+ .name = "none",
+ .force = 1,
+ .uevent = 1,
+ };
mutex_lock(&q->elevator_lock);
if (q->elevator && !blk_queue_dying(q))
- name = q->elevator->type->elevator_name;
- __elevator_change(q, name, true);
+ ctx.name = q->elevator->type->elevator_name;
+ __elevator_change(q, &ctx);
mutex_unlock(&q->elevator_lock);
}
diff --git a/block/blk.h b/block/blk.h
index 0c3cc1af2525..922a429b5363 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -12,6 +12,7 @@
#include "blk-crypto-internal.h"
struct elevator_type;
+struct elev_change_ctx;
#define BLK_DEV_MAX_SECTORS (LLONG_MAX >> 9)
#define BLK_MIN_SEGMENT_SIZE 4096
@@ -319,8 +320,7 @@ bool blk_bio_list_merge(struct request_queue *q, struct list_head *list,
bool blk_insert_flush(struct request *rq);
-int __elevator_change(struct request_queue *q, const char *elevator_name,
- bool force);
+int __elevator_change(struct request_queue *q, struct elev_change_ctx *ctx);
void elevator_exit(struct request_queue *q);
int elv_register_queue(struct request_queue *q, bool uevent);
void elv_unregister_queue(struct request_queue *q);
diff --git a/block/elevator.c b/block/elevator.c
index e028d2ff9624..2bc1679dcd1f 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -53,7 +53,8 @@ static LIST_HEAD(elv_list);
*/
#define rq_hash_key(rq) (blk_rq_pos(rq) + blk_rq_sectors(rq))
-static int elevator_change(struct request_queue *q, const char *name);
+static int elevator_change(struct request_queue *q,
+ struct elev_change_ctx *ctx);
/*
* Query io scheduler to see if the current process issuing bio may be
@@ -621,7 +622,8 @@ void elevator_init_mq(struct request_queue *q)
* If switching fails, we are most likely running out of memory and not able
* to restore the old io scheduler, so leaving the io scheduler being none.
*/
-static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
+static int elevator_switch(struct request_queue *q, struct elevator_type *new_e,
+ struct elev_change_ctx *ctx)
{
int ret;
@@ -639,7 +641,7 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
if (ret)
goto out_unfreeze;
- ret = elv_register_queue(q, true);
+ ret = elv_register_queue(q, ctx->uevent);
if (ret) {
elevator_exit(q);
goto out_unfreeze;
@@ -677,9 +679,9 @@ static void elevator_disable(struct request_queue *q)
/*
* Switch this queue to the given IO scheduler.
*/
-int __elevator_change(struct request_queue *q, const char *elevator_name,
- bool force)
+int __elevator_change(struct request_queue *q, struct elev_change_ctx *ctx)
{
+ const char *elevator_name = ctx->name;
struct elevator_type *e;
int ret;
@@ -689,19 +691,20 @@ int __elevator_change(struct request_queue *q, const char *elevator_name,
return 0;
}
- if (!force && q->elevator &&
+ if (!ctx->force && q->elevator &&
elevator_match(q->elevator->type, elevator_name))
return 0;
e = elevator_find_get(elevator_name);
if (!e)
return -EINVAL;
- ret = elevator_switch(q, e);
+ ret = elevator_switch(q, e, ctx);
elevator_put(e);
return ret;
}
-static int elevator_change(struct request_queue *q, const char *name)
+static int elevator_change(struct request_queue *q,
+ struct elev_change_ctx *ctx)
{
int ret, idx;
unsigned int memflags;
@@ -716,7 +719,7 @@ static int elevator_change(struct request_queue *q, const char *name)
memflags = blk_mq_freeze_queue(q);
mutex_lock(&q->elevator_lock);
- ret = __elevator_change(q, name, false);
+ ret = __elevator_change(q, ctx);
mutex_unlock(&q->elevator_lock);
blk_mq_unfreeze_queue(q, memflags);
exit:
@@ -724,7 +727,7 @@ static int elevator_change(struct request_queue *q, const char *name)
return ret;
}
-static void elv_iosched_load_module(char *elevator_name)
+static void elv_iosched_load_module(const char *elevator_name)
{
struct elevator_type *found;
@@ -741,7 +744,9 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
{
struct request_queue *q = disk->queue;
char elevator_name[ELV_NAME_MAX];
- char *name;
+ struct elev_change_ctx ctx = {
+ .uevent = 1,
+ };
int ret;
/*
@@ -750,15 +755,15 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
* queue is the one for the device storing the module file.
*/
strscpy(elevator_name, buf, sizeof(elevator_name));
- name = strstrip(elevator_name);
+ ctx.name = strstrip(elevator_name);
- elv_iosched_load_module(name);
+ elv_iosched_load_module(ctx.name);
/* Make sure queue is not in the middle of being removed */
if (!blk_queue_registered(q))
return -ENOENT;
- ret = elevator_change(q, name);
+ ret = elevator_change(q, &ctx);
if (!ret)
ret = count;
return ret;
diff --git a/block/elevator.h b/block/elevator.h
index 80ff9b28a66f..86b4977cf772 100644
--- a/block/elevator.h
+++ b/block/elevator.h
@@ -123,6 +123,13 @@ struct elevator_queue
#define ELEVATOR_FLAG_REGISTERED 0
#define ELEVATOR_FLAG_DISABLE_WBT 1
+/* Holding data for changing elevator */
+struct elev_change_ctx {
+ const char *name;
+ unsigned int force:1;
+ unsigned int uevent:1;
+};
+
/*
* block elevator interface
*/
--
2.47.0
^ permalink raw reply related [flat|nested] 57+ messages in thread
* [PATCH 09/15] block: unifying elevator change
2025-04-10 13:30 [PATCH 00/15] block: unify elevator changing and fix lockdep warning Ming Lei
` (7 preceding siblings ...)
2025-04-10 13:30 ` [PATCH 08/15] block: add `struct elev_change_ctx` for unifying elevator change Ming Lei
@ 2025-04-10 13:30 ` Ming Lei
2025-04-10 18:37 ` Nilay Shroff
2025-04-10 13:30 ` [PATCH 10/15] block: pass elevator_queue to elv_register_queue & unregister_queue Ming Lei
` (5 subsequent siblings)
14 siblings, 1 reply; 57+ messages in thread
From: Ming Lei @ 2025-04-10 13:30 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig, Ming Lei
elevator change is one well-define behavior:
- tear down current elevator if it exists
- setup new elevator
It is supposed to cover any case for changing elevator, typically the
following cases:
- setup default elevator in add_disk()
- switch to none in del_disk()
- reset elevator in blk_mq_update_nr_hw_queues()
- switch elevator in sysfs `store` elevator attribute
This patch uses elevator_change() to cover all above cases:
- every elevator switch is serialized with each other: add_disk/del_disk/
store elevator is serialized already, blk_mq_update_nr_hw_queues() uses
srcu for syncing with the other three cases
- for both add_disk()/del_disk(), queue freeze works at atomic mode
or has been froze, so the freeze in elevator_change() won't add extra
delay
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/blk-sysfs.c | 18 ++++-------
block/blk.h | 5 ++--
block/elevator.c | 76 +++++++++++++++++++++++++----------------------
block/genhd.c | 19 +-----------
4 files changed, 50 insertions(+), 68 deletions(-)
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index a2882751f0d2..58c50709bc14 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -869,14 +869,8 @@ int blk_register_queue(struct gendisk *disk)
if (ret)
goto out_unregister_ia_ranges;
+ elevator_set_default(q);
mutex_lock(&q->elevator_lock);
- if (q->elevator) {
- ret = elv_register_queue(q, false);
- if (ret) {
- mutex_unlock(&q->elevator_lock);
- goto out_crypto_sysfs_unregister;
- }
- }
wbt_enable_default(disk);
mutex_unlock(&q->elevator_lock);
@@ -902,8 +896,6 @@ int blk_register_queue(struct gendisk *disk)
return ret;
-out_crypto_sysfs_unregister:
- blk_crypto_sysfs_unregister(disk);
out_unregister_ia_ranges:
disk_unregister_independent_access_ranges(disk);
out_debugfs_remove:
@@ -949,9 +941,11 @@ void blk_unregister_queue(struct gendisk *disk)
blk_mq_sysfs_unregister(disk);
blk_crypto_sysfs_unregister(disk);
- mutex_lock(&q->elevator_lock);
- elv_unregister_queue(q);
- mutex_unlock(&q->elevator_lock);
+ if (q->elevator) {
+ blk_mq_quiesce_queue(q);
+ elevator_set_none(q);
+ blk_mq_unquiesce_queue(q);
+ }
mutex_lock(&q->sysfs_lock);
disk_unregister_independent_access_ranges(disk);
diff --git a/block/blk.h b/block/blk.h
index 922a429b5363..4626beedfdce 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -321,9 +321,8 @@ bool blk_bio_list_merge(struct request_queue *q, struct list_head *list,
bool blk_insert_flush(struct request *rq);
int __elevator_change(struct request_queue *q, struct elev_change_ctx *ctx);
-void elevator_exit(struct request_queue *q);
-int elv_register_queue(struct request_queue *q, bool uevent);
-void elv_unregister_queue(struct request_queue *q);
+void elevator_set_default(struct request_queue *q);
+void elevator_set_none(struct request_queue *q);
ssize_t part_size_show(struct device *dev, struct device_attribute *attr,
char *buf);
diff --git a/block/elevator.c b/block/elevator.c
index 2bc1679dcd1f..7d2a56ef0be6 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -151,7 +151,7 @@ static void elevator_release(struct kobject *kobj)
kfree(e);
}
-void elevator_exit(struct request_queue *q)
+static void elevator_exit(struct request_queue *q)
{
struct elevator_queue *e = q->elevator;
@@ -455,7 +455,7 @@ static const struct kobj_type elv_ktype = {
.release = elevator_release,
};
-int elv_register_queue(struct request_queue *q, bool uevent)
+static int elv_register_queue(struct request_queue *q, bool uevent)
{
struct elevator_queue *e = q->elevator;
int error;
@@ -484,7 +484,7 @@ int elv_register_queue(struct request_queue *q, bool uevent)
return error;
}
-void elv_unregister_queue(struct request_queue *q)
+static void elv_unregister_queue(struct request_queue *q)
{
struct elevator_queue *e = q->elevator;
@@ -560,60 +560,56 @@ EXPORT_SYMBOL_GPL(elv_unregister);
* For single queue devices, default to using mq-deadline. If we have multiple
* queues or mq-deadline is not available, default to "none".
*/
-static struct elevator_type *elevator_get_default(struct request_queue *q)
+static bool use_default_elevator(struct request_queue *q)
{
if (q->tag_set->flags & BLK_MQ_F_NO_SCHED_BY_DEFAULT)
- return NULL;
+ return false;
if (q->nr_hw_queues != 1 &&
!blk_mq_is_shared_tags(q->tag_set->flags))
- return NULL;
+ return false;
- return elevator_find_get("mq-deadline");
+ return true;
}
/*
* Use the default elevator settings. If the chosen elevator initialization
* fails, fall back to the "none" elevator (no elevator).
*/
-void elevator_init_mq(struct request_queue *q)
+void elevator_set_default(struct request_queue *q)
{
- struct elevator_type *e;
- unsigned int memflags;
+ struct elev_change_ctx ctx = { };
int err;
- WARN_ON_ONCE(blk_queue_registered(q));
-
- if (unlikely(q->elevator))
+ if (!queue_is_mq(q))
return;
- e = elevator_get_default(q);
- if (!e)
+ ctx.name = use_default_elevator(q) ? "mq-deadline" : "none";
+ if (!q->elevator && !strcmp(ctx.name, "none"))
return;
+ err = elevator_change(q, &ctx);
+ if (err < 0)
+ pr_warn("\"%s\" set elevator failed %d, "
+ "falling back to \"none\"\n", ctx.name, err);
+}
- /*
- * We are called before adding disk, when there isn't any FS I/O,
- * so freezing queue plus canceling dispatch work is enough to
- * drain any dispatch activities originated from passthrough
- * requests, then no need to quiesce queue which may add long boot
- * latency, especially when lots of disks are involved.
- *
- * Disk isn't added yet, so verifying queue lock only manually.
- */
- memflags = blk_mq_freeze_queue(q);
-
- blk_mq_cancel_work_sync(q);
-
- err = blk_mq_init_sched(q, e);
+void elevator_set_none(struct request_queue *q)
+{
+ struct elev_change_ctx ctx = {
+ .name = "none",
+ .uevent = 1,
+ };
+ int err;
- blk_mq_unfreeze_queue(q, memflags);
+ if (!queue_is_mq(q))
+ return;
- if (err) {
- pr_warn("\"%s\" elevator initialization failed, "
- "falling back to \"none\"\n", e->elevator_name);
- }
+ if (!q->elevator)
+ return;
- elevator_put(e);
+ err = elevator_change(q, &ctx);
+ if (err < 0)
+ pr_warn("%s: set none elevator failed %d\n", __func__, err);
}
/*
@@ -718,6 +714,16 @@ static int elevator_change(struct request_queue *q,
}
memflags = blk_mq_freeze_queue(q);
+ /*
+ * May be called before adding disk, when there isn't any FS I/O,
+ * so freezing queue plus canceling dispatch work is enough to
+ * drain any dispatch activities originated from passthrough
+ * requests, then no need to quiesce queue which may add long boot
+ * latency, especially when lots of disks are involved.
+ *
+ * Disk isn't added yet, so verifying queue lock only manually.
+ */
+ blk_mq_cancel_work_sync(q);
mutex_lock(&q->elevator_lock);
ret = __elevator_change(q, ctx);
mutex_unlock(&q->elevator_lock);
diff --git a/block/genhd.c b/block/genhd.c
index f426c13edf55..d7264546a178 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -416,12 +416,6 @@ int __must_check add_disk_fwnode(struct device *parent, struct gendisk *disk,
*/
if (disk->fops->submit_bio || disk->fops->poll_bio)
return -EINVAL;
-
- /*
- * Initialize the I/O scheduler code and pick a default one if
- * needed.
- */
- elevator_init_mq(disk->queue);
} else {
if (!disk->fops->submit_bio)
return -EINVAL;
@@ -565,11 +559,7 @@ int __must_check add_disk_fwnode(struct device *parent, struct gendisk *disk,
if (disk->major == BLOCK_EXT_MAJOR)
blk_free_ext_minor(disk->first_minor);
out_exit_elevator:
- if (disk->queue->elevator) {
- mutex_lock(&disk->queue->elevator_lock);
- elevator_exit(disk->queue);
- mutex_unlock(&disk->queue->elevator_lock);
- }
+ elevator_set_none(disk->queue);
return ret;
}
EXPORT_SYMBOL_GPL(add_disk_fwnode);
@@ -743,14 +733,7 @@ void del_gendisk(struct gendisk *disk)
if (queue_is_mq(q))
blk_mq_cancel_work_sync(q);
- blk_mq_quiesce_queue(q);
- if (q->elevator) {
- mutex_lock(&q->elevator_lock);
- elevator_exit(q);
- mutex_unlock(&q->elevator_lock);
- }
rq_qos_exit(q);
- blk_mq_unquiesce_queue(q);
/*
* If the disk does not own the queue, allow using passthrough requests
--
2.47.0
^ permalink raw reply related [flat|nested] 57+ messages in thread
* [PATCH 10/15] block: pass elevator_queue to elv_register_queue & unregister_queue
2025-04-10 13:30 [PATCH 00/15] block: unify elevator changing and fix lockdep warning Ming Lei
` (8 preceding siblings ...)
2025-04-10 13:30 ` [PATCH 09/15] block: " Ming Lei
@ 2025-04-10 13:30 ` Ming Lei
2025-04-14 6:22 ` Christoph Hellwig
2025-04-10 13:30 ` [PATCH 11/15] block: move elv_register[unregister]_queue out of elevator_lock Ming Lei
` (4 subsequent siblings)
14 siblings, 1 reply; 57+ messages in thread
From: Ming Lei @ 2025-04-10 13:30 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig, Ming Lei
Pass elevator_queue reference to elv_register_queue() &
elv_unregister_queue().
No functional change, and prepare for moving the two out of elevator
lock.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/elevator.c | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)
diff --git a/block/elevator.c b/block/elevator.c
index 7d2a56ef0be6..238b8d47cc2b 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -455,9 +455,10 @@ static const struct kobj_type elv_ktype = {
.release = elevator_release,
};
-static int elv_register_queue(struct request_queue *q, bool uevent)
+static int elv_register_queue(struct request_queue *q,
+ struct elevator_queue *e,
+ bool uevent)
{
- struct elevator_queue *e = q->elevator;
int error;
lockdep_assert_held(&q->elevator_lock);
@@ -484,10 +485,9 @@ static int elv_register_queue(struct request_queue *q, bool uevent)
return error;
}
-static void elv_unregister_queue(struct request_queue *q)
+static void elv_unregister_queue(struct request_queue *q,
+ struct elevator_queue *e)
{
- struct elevator_queue *e = q->elevator;
-
lockdep_assert_held(&q->elevator_lock);
if (e && test_and_clear_bit(ELEVATOR_FLAG_REGISTERED, &e->flags)) {
@@ -629,7 +629,7 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e,
blk_mq_quiesce_queue(q);
if (q->elevator) {
- elv_unregister_queue(q);
+ elv_unregister_queue(q, q->elevator);
elevator_exit(q);
}
@@ -637,7 +637,7 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e,
if (ret)
goto out_unfreeze;
- ret = elv_register_queue(q, ctx->uevent);
+ ret = elv_register_queue(q, q->elevator, ctx->uevent);
if (ret) {
elevator_exit(q);
goto out_unfreeze;
@@ -662,7 +662,7 @@ static void elevator_disable(struct request_queue *q)
blk_mq_quiesce_queue(q);
- elv_unregister_queue(q);
+ elv_unregister_queue(q, q->elevator);
elevator_exit(q);
blk_queue_flag_clear(QUEUE_FLAG_SQ_SCHED, q);
q->elevator = NULL;
--
2.47.0
^ permalink raw reply related [flat|nested] 57+ messages in thread
* [PATCH 11/15] block: move elv_register[unregister]_queue out of elevator_lock
2025-04-10 13:30 [PATCH 00/15] block: unify elevator changing and fix lockdep warning Ming Lei
` (9 preceding siblings ...)
2025-04-10 13:30 ` [PATCH 10/15] block: pass elevator_queue to elv_register_queue & unregister_queue Ming Lei
@ 2025-04-10 13:30 ` Ming Lei
2025-04-11 19:20 ` Nilay Shroff
2025-04-10 13:30 ` [PATCH 12/15] block: move debugfs/sysfs register out of freezing queue Ming Lei
` (3 subsequent siblings)
14 siblings, 1 reply; 57+ messages in thread
From: Ming Lei @ 2025-04-10 13:30 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig, Ming Lei
Move elv_register[unregister]_queue out of ->elevator_lock & queue freezing,
so we can kill many lockdep warnings.
elv_register[unregister]_queue() is serialized, and just dealing with
sysfs/debugfs things, no need to be done with queue frozen.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/blk-mq.c | 9 ++++----
block/blk.h | 1 +
block/elevator.c | 58 ++++++++++++++++++++++++++++++++++--------------
block/elevator.h | 5 +++++
4 files changed, 52 insertions(+), 21 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 608a74e3a87c..7219b01764da 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4990,16 +4990,17 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
.force = 1,
.uevent = 1,
};
+ int ret = -ENODEV;
mutex_lock(&q->elevator_lock);
if (q->elevator && !blk_queue_dying(q))
ctx.name = q->elevator->type->elevator_name;
- __elevator_change(q, &ctx);
+ ret = __elevator_change(q, &ctx);
mutex_unlock(&q->elevator_lock);
- }
-
- list_for_each_entry(q, &set->tag_list, tag_set_list)
blk_mq_unfreeze_queue_nomemrestore(q);
+ if (!ret)
+ WARN_ON_ONCE(elevator_change_done(q, &ctx));
+ }
memalloc_noio_restore(memflags);
/* Free the excess tags when nr_hw_queues shrink. */
diff --git a/block/blk.h b/block/blk.h
index 4626beedfdce..634cebd7a7b4 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -321,6 +321,7 @@ bool blk_bio_list_merge(struct request_queue *q, struct list_head *list,
bool blk_insert_flush(struct request *rq);
int __elevator_change(struct request_queue *q, struct elev_change_ctx *ctx);
+int elevator_change_done(struct request_queue *q, struct elev_change_ctx *ctx);
void elevator_set_default(struct request_queue *q);
void elevator_set_none(struct request_queue *q);
diff --git a/block/elevator.c b/block/elevator.c
index 238b8d47cc2b..1cc640a9db3e 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -151,18 +151,24 @@ static void elevator_release(struct kobject *kobj)
kfree(e);
}
-static void elevator_exit(struct request_queue *q)
+static void __elevator_exit(struct request_queue *q)
{
struct elevator_queue *e = q->elevator;
+ lockdep_assert_held(&q->elevator_lock);
+
ioc_clear_queue(q);
blk_mq_sched_free_rqs(q);
mutex_lock(&e->sysfs_lock);
blk_mq_exit_sched(q, e);
mutex_unlock(&e->sysfs_lock);
+}
- kobject_put(&e->kobj);
+static void elevator_exit(struct request_queue *q)
+{
+ __elevator_exit(q);
+ kobject_put(&q->elevator->kobj);
}
static inline void __elv_rqhash_del(struct request *rq)
@@ -461,8 +467,6 @@ static int elv_register_queue(struct request_queue *q,
{
int error;
- lockdep_assert_held(&q->elevator_lock);
-
if (test_bit(ELEVATOR_FLAG_REGISTERED, &e->flags))
return 0;
@@ -488,8 +492,6 @@ static int elv_register_queue(struct request_queue *q,
static void elv_unregister_queue(struct request_queue *q,
struct elevator_queue *e)
{
- lockdep_assert_held(&q->elevator_lock);
-
if (e && test_and_clear_bit(ELEVATOR_FLAG_REGISTERED, &e->flags)) {
kobject_uevent(&e->kobj, KOBJ_REMOVE);
kobject_del(&e->kobj);
@@ -629,19 +631,15 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e,
blk_mq_quiesce_queue(q);
if (q->elevator) {
- elv_unregister_queue(q, q->elevator);
- elevator_exit(q);
+ ctx->old = q->elevator;
+ __elevator_exit(q);
}
ret = blk_mq_init_sched(q, new_e);
if (ret)
goto out_unfreeze;
- ret = elv_register_queue(q, q->elevator, ctx->uevent);
- if (ret) {
- elevator_exit(q);
- goto out_unfreeze;
- }
+ ctx->new = q->elevator;
blk_add_trace_msg(q, "elv switch: %s", new_e->elevator_name);
out_unfreeze:
@@ -655,15 +653,16 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e,
return ret;
}
-static void elevator_disable(struct request_queue *q)
+static void elevator_disable(struct request_queue *q,
+ struct elev_change_ctx *ctx)
{
WARN_ON_ONCE(q->mq_freeze_depth == 0);
lockdep_assert_held(&q->elevator_lock);
blk_mq_quiesce_queue(q);
- elv_unregister_queue(q, q->elevator);
- elevator_exit(q);
+ ctx->old = q->elevator;
+ __elevator_exit(q);
blk_queue_flag_clear(QUEUE_FLAG_SQ_SCHED, q);
q->elevator = NULL;
q->nr_requests = q->tag_set->queue_depth;
@@ -672,6 +671,28 @@ static void elevator_disable(struct request_queue *q)
blk_mq_unquiesce_queue(q);
}
+int elevator_change_done(struct request_queue *q, struct elev_change_ctx *ctx)
+{
+ int ret = 0;
+
+ if (ctx->old) {
+ elv_unregister_queue(q, ctx->old);
+ kobject_put(&ctx->old->kobj);
+ }
+ if (ctx->new) {
+ ret = elv_register_queue(q, ctx->new, ctx->uevent);
+ if (ret) {
+ unsigned memflags = blk_mq_freeze_queue(q);
+
+ mutex_lock(&q->elevator_lock);
+ elevator_exit(q);
+ mutex_unlock(&q->elevator_lock);
+ blk_mq_unfreeze_queue(q, memflags);
+ }
+ }
+ return 0;
+}
+
/*
* Switch this queue to the given IO scheduler.
*/
@@ -683,7 +704,7 @@ int __elevator_change(struct request_queue *q, struct elev_change_ctx *ctx)
if (!strncmp(elevator_name, "none", 4)) {
if (q->elevator)
- elevator_disable(q);
+ elevator_disable(q, ctx);
return 0;
}
@@ -728,6 +749,9 @@ static int elevator_change(struct request_queue *q,
ret = __elevator_change(q, ctx);
mutex_unlock(&q->elevator_lock);
blk_mq_unfreeze_queue(q, memflags);
+ if (!ret)
+ ret = elevator_change_done(q, ctx);
+
exit:
srcu_read_unlock(&set->update_nr_hwq_srcu, idx);
return ret;
diff --git a/block/elevator.h b/block/elevator.h
index 86b4977cf772..87848fdc8a52 100644
--- a/block/elevator.h
+++ b/block/elevator.h
@@ -128,6 +128,11 @@ struct elev_change_ctx {
const char *name;
unsigned int force:1;
unsigned int uevent:1;
+
+ /* for unregistering old elevator */
+ struct elevator_queue *old;
+ /* for registering new elevator */
+ struct elevator_queue *new;
};
/*
--
2.47.0
^ permalink raw reply related [flat|nested] 57+ messages in thread
* [PATCH 12/15] block: move debugfs/sysfs register out of freezing queue
2025-04-10 13:30 [PATCH 00/15] block: unify elevator changing and fix lockdep warning Ming Lei
` (10 preceding siblings ...)
2025-04-10 13:30 ` [PATCH 11/15] block: move elv_register[unregister]_queue out of elevator_lock Ming Lei
@ 2025-04-10 13:30 ` Ming Lei
2025-04-10 18:57 ` Nilay Shroff
2025-04-10 13:30 ` [PATCH 13/15] block: remove several ->elevator_lock Ming Lei
` (2 subsequent siblings)
14 siblings, 1 reply; 57+ messages in thread
From: Ming Lei @ 2025-04-10 13:30 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig, Ming Lei
Move debugfs/sysfs register out of freezing queue in
__blk_mq_update_nr_hw_queues(), so that the following lockdep dependency
can be killed:
#2 (&q->q_usage_counter(io)#16){++++}-{0:0}:
#1 (fs_reclaim){+.+.}-{0:0}:
#0 (&sb->s_type->i_mutex_key#3){+.+.}-{4:4}: //debugfs
And registering/un-registering debugfs/sysfs does not require queue to be
frozen.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/blk-mq.c | 20 ++++++++++----------
1 file changed, 10 insertions(+), 10 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7219b01764da..0fb72a698d77 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4947,15 +4947,15 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
if (set->nr_maps == 1 && nr_hw_queues == set->nr_hw_queues)
return;
- memflags = memalloc_noio_save();
- list_for_each_entry(q, &set->tag_list, tag_set_list)
- blk_mq_freeze_queue_nomemsave(q);
-
list_for_each_entry(q, &set->tag_list, tag_set_list) {
blk_mq_debugfs_unregister_hctxs(q);
blk_mq_sysfs_unregister_hctxs(q);
}
+ memflags = memalloc_noio_save();
+ list_for_each_entry(q, &set->tag_list, tag_set_list)
+ blk_mq_freeze_queue_nomemsave(q);
+
if (blk_mq_realloc_tag_set_tags(set, nr_hw_queues) < 0)
goto reregister;
@@ -4978,12 +4978,6 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
blk_mq_map_swqueue(q);
}
-reregister:
- list_for_each_entry(q, &set->tag_list, tag_set_list) {
- blk_mq_sysfs_register_hctxs(q);
- blk_mq_debugfs_register_hctxs(q);
- }
-
list_for_each_entry(q, &set->tag_list, tag_set_list) {
struct elev_change_ctx ctx = {
.name = "none",
@@ -5003,6 +4997,12 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
}
memalloc_noio_restore(memflags);
+reregister:
+ list_for_each_entry(q, &set->tag_list, tag_set_list) {
+ blk_mq_sysfs_register_hctxs(q);
+ blk_mq_debugfs_register_hctxs(q);
+ }
+
/* Free the excess tags when nr_hw_queues shrink. */
for (i = set->nr_hw_queues; i < prev_nr_hw_queues; i++)
__blk_mq_free_map_and_rqs(set, i);
--
2.47.0
^ permalink raw reply related [flat|nested] 57+ messages in thread
* [PATCH 13/15] block: remove several ->elevator_lock
2025-04-10 13:30 [PATCH 00/15] block: unify elevator changing and fix lockdep warning Ming Lei
` (11 preceding siblings ...)
2025-04-10 13:30 ` [PATCH 12/15] block: move debugfs/sysfs register out of freezing queue Ming Lei
@ 2025-04-10 13:30 ` Ming Lei
2025-04-10 19:07 ` Nilay Shroff
2025-04-10 13:30 ` [PATCH 14/15] block: move hctx cpuhp add/del out of queue freezing Ming Lei
2025-04-10 13:30 ` [PATCH 15/15] block: move wbt_enable_default() out of queue freezing from scheduler's ->exit() Ming Lei
14 siblings, 1 reply; 57+ messages in thread
From: Ming Lei @ 2025-04-10 13:30 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig, Ming Lei
Both blk_mq_map_swqueue() and blk_mq_realloc_hw_ctxs() are only called
from queue initialization or updating nr_hw_queues code, in which
elevator switch can't happen any more.
So remove these ->elevator_lock uses.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/blk-mq.c | 19 ++++---------------
1 file changed, 4 insertions(+), 15 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 0fb72a698d77..812dfe759b89 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4095,8 +4095,6 @@ static void blk_mq_map_swqueue(struct request_queue *q)
struct blk_mq_ctx *ctx;
struct blk_mq_tag_set *set = q->tag_set;
- mutex_lock(&q->elevator_lock);
-
queue_for_each_hw_ctx(q, hctx, i) {
cpumask_clear(hctx->cpumask);
hctx->nr_ctx = 0;
@@ -4201,8 +4199,6 @@ static void blk_mq_map_swqueue(struct request_queue *q)
hctx->next_cpu = blk_mq_first_mapped_cpu(hctx);
hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
}
-
- mutex_unlock(&q->elevator_lock);
}
/*
@@ -4506,16 +4502,9 @@ static void __blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
}
static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
- struct request_queue *q, bool lock)
+ struct request_queue *q)
{
- if (lock) {
- /* protect against switching io scheduler */
- mutex_lock(&q->elevator_lock);
- __blk_mq_realloc_hw_ctxs(set, q);
- mutex_unlock(&q->elevator_lock);
- } else {
- __blk_mq_realloc_hw_ctxs(set, q);
- }
+ __blk_mq_realloc_hw_ctxs(set, q);
/* unregister cpuhp callbacks for exited hctxs */
blk_mq_remove_hw_queues_cpuhp(q);
@@ -4547,7 +4536,7 @@ int blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
xa_init(&q->hctx_table);
- blk_mq_realloc_hw_ctxs(set, q, false);
+ blk_mq_realloc_hw_ctxs(set, q);
if (!q->nr_hw_queues)
goto err_hctxs;
@@ -4962,7 +4951,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
fallback:
blk_mq_update_queue_map(set);
list_for_each_entry(q, &set->tag_list, tag_set_list) {
- blk_mq_realloc_hw_ctxs(set, q, true);
+ blk_mq_realloc_hw_ctxs(set, q);
if (q->nr_hw_queues != set->nr_hw_queues) {
int i = prev_nr_hw_queues;
--
2.47.0
^ permalink raw reply related [flat|nested] 57+ messages in thread
* [PATCH 14/15] block: move hctx cpuhp add/del out of queue freezing
2025-04-10 13:30 [PATCH 00/15] block: unify elevator changing and fix lockdep warning Ming Lei
` (12 preceding siblings ...)
2025-04-10 13:30 ` [PATCH 13/15] block: remove several ->elevator_lock Ming Lei
@ 2025-04-10 13:30 ` Ming Lei
2025-04-10 13:30 ` [PATCH 15/15] block: move wbt_enable_default() out of queue freezing from scheduler's ->exit() Ming Lei
14 siblings, 0 replies; 57+ messages in thread
From: Ming Lei @ 2025-04-10 13:30 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig, Ming Lei
Move hctx cpuhp add/del out of queue freezing for not connecting freeze
lock with cpuhp locks, then lockdep warning can be avoided.
This way is safe because both needn't queue to be frozen and scheduler
switch isn't allowed.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/blk-mq.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 812dfe759b89..d721e55c0844 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4951,7 +4951,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
fallback:
blk_mq_update_queue_map(set);
list_for_each_entry(q, &set->tag_list, tag_set_list) {
- blk_mq_realloc_hw_ctxs(set, q);
+ __blk_mq_realloc_hw_ctxs(set, q);
if (q->nr_hw_queues != set->nr_hw_queues) {
int i = prev_nr_hw_queues;
@@ -4990,6 +4990,9 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
list_for_each_entry(q, &set->tag_list, tag_set_list) {
blk_mq_sysfs_register_hctxs(q);
blk_mq_debugfs_register_hctxs(q);
+
+ blk_mq_remove_hw_queues_cpuhp(q);
+ blk_mq_add_hw_queues_cpuhp(q);
}
/* Free the excess tags when nr_hw_queues shrink. */
--
2.47.0
^ permalink raw reply related [flat|nested] 57+ messages in thread
* [PATCH 15/15] block: move wbt_enable_default() out of queue freezing from scheduler's ->exit()
2025-04-10 13:30 [PATCH 00/15] block: unify elevator changing and fix lockdep warning Ming Lei
` (13 preceding siblings ...)
2025-04-10 13:30 ` [PATCH 14/15] block: move hctx cpuhp add/del out of queue freezing Ming Lei
@ 2025-04-10 13:30 ` Ming Lei
2025-04-10 19:20 ` Nilay Shroff
14 siblings, 1 reply; 57+ messages in thread
From: Ming Lei @ 2025-04-10 13:30 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig, Ming Lei
scheduler's ->exit() is called with queue frozen and elevator lock is held, and
wbt_enable_default() can't be called with queue frozen, otherwise the
following lockdep warning is triggered:
#6 (&q->rq_qos_mutex){+.+.}-{4:4}:
#5 (&eq->sysfs_lock){+.+.}-{4:4}:
#4 (&q->elevator_lock){+.+.}-{4:4}:
#3 (&q->q_usage_counter(io)#3){++++}-{0:0}:
#2 (fs_reclaim){+.+.}-{0:0}:
#1 (&sb->s_type->i_mutex_key#3){+.+.}-{4:4}:
#0 (&q->debugfs_mutex){+.+.}-{4:4}:
Fix the issue by moving wbt_enable_default() out of bfq's exit(), and
call it from elevator_change_done().
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/bfq-iosched.c | 2 +-
block/elevator.c | 5 +++++
block/elevator.h | 1 +
3 files changed, 7 insertions(+), 1 deletion(-)
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index abd80dc13562..18018c8cf84d 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -7211,7 +7211,7 @@ static void bfq_exit_queue(struct elevator_queue *e)
blk_stat_disable_accounting(bfqd->queue);
clear_bit(ELEVATOR_FLAG_DISABLE_WBT, &e->flags);
- wbt_enable_default(bfqd->queue->disk);
+ set_bit(ELEVATOR_FLAG_ENABLE_WBT_ON_EXIT, &e->flags);
kfree(bfqd);
}
diff --git a/block/elevator.c b/block/elevator.c
index 1cc640a9db3e..9cf78db4d6a4 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -676,8 +676,13 @@ int elevator_change_done(struct request_queue *q, struct elev_change_ctx *ctx)
int ret = 0;
if (ctx->old) {
+ bool enable_wbt = test_bit(ELEVATOR_FLAG_ENABLE_WBT_ON_EXIT,
+ &ctx->old->flags);
+
elv_unregister_queue(q, ctx->old);
kobject_put(&ctx->old->kobj);
+ if (enable_wbt)
+ wbt_enable_default(q->disk);
}
if (ctx->new) {
ret = elv_register_queue(q, ctx->new, ctx->uevent);
diff --git a/block/elevator.h b/block/elevator.h
index 87848fdc8a52..7b66be5b8295 100644
--- a/block/elevator.h
+++ b/block/elevator.h
@@ -122,6 +122,7 @@ struct elevator_queue
#define ELEVATOR_FLAG_REGISTERED 0
#define ELEVATOR_FLAG_DISABLE_WBT 1
+#define ELEVATOR_FLAG_ENABLE_WBT_ON_EXIT 2
/* Holding data for changing elevator */
struct elev_change_ctx {
--
2.47.0
^ permalink raw reply related [flat|nested] 57+ messages in thread
* Re: [PATCH 02/15] block: add two helpers for registering/un-registering sched debugfs
2025-04-10 13:30 ` [PATCH 02/15] block: add two helpers for registering/un-registering sched debugfs Ming Lei
@ 2025-04-10 14:25 ` Christoph Hellwig
0 siblings, 0 replies; 57+ messages in thread
From: Christoph Hellwig @ 2025-04-10 14:25 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
Thomas Hellström, Christoph Hellwig
On Thu, Apr 10, 2025 at 09:30:14PM +0800, Ming Lei wrote:
> Add blk_mq_sched_reg_debugfs()/blk_mq_sched_unreg_debugfs() to clean
> up sched init/exit code a bit.
This not just adds the new helpers, but also changes where the
hctx registration and unregistration is called. From a quick look
this looks fine, but please document it and explain why it is safe
and desirable.
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 03/15] block: move sched debugfs register into elvevator_register_queue
2025-04-10 13:30 ` [PATCH 03/15] block: move sched debugfs register into elvevator_register_queue Ming Lei
@ 2025-04-10 14:27 ` Christoph Hellwig
2025-04-14 0:42 ` Ming Lei
0 siblings, 1 reply; 57+ messages in thread
From: Christoph Hellwig @ 2025-04-10 14:27 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
Thomas Hellström, Christoph Hellwig
> lockdep_assert_held(&q->elevator_lock);
>
> + if (test_bit(ELEVATOR_FLAG_REGISTERED, &e->flags))
> + return 0;
> +
This looks unrelate to the rest of the patch, and I also don't understand
why it's needed as the callers of elv_register_queue don't change.
The rest of the patch looks good to me.
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 04/15] block: prevent elevator switch during updating nr_hw_queues
2025-04-10 13:30 ` [PATCH 04/15] block: prevent elevator switch during updating nr_hw_queues Ming Lei
@ 2025-04-10 14:36 ` Christoph Hellwig
2025-04-14 0:54 ` Ming Lei
2025-04-11 19:13 ` Nilay Shroff
1 sibling, 1 reply; 57+ messages in thread
From: Christoph Hellwig @ 2025-04-10 14:36 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
Thomas Hellström, Christoph Hellwig
On Thu, Apr 10, 2025 at 09:30:16PM +0800, Ming Lei wrote:
> updating nr_hw_queues is usually used for error handling code, when it
Capitalize the first word of each sentence, please.
> doesn't make sense to allow blk-mq elevator switching, since nr_hw_queues
> may change, and elevator tags depends on nr_hw_queues.
I don't think it's really updated from error handling
- nbd does it when starting a device
- nullb can do it through debugfs
- xen-blkfront does it when resuming from a suspend
- nvme does it when resetting a controller. While error handling
can escalate to it¸ it's basically probing and re-probing code
> Prevent elevator switch during updating nr_hw_queues by setting flag of
> BLK_MQ_F_UPDATE_HW_QUEUES, and use srcu to fail elevator switch during
> the period. Here elevator switch code is srcu reader of nr_hw_queues,
> and blk_mq_update_nr_hw_queues() is the writer.
That being said as we generally are in a setup path I think the general
idea is fine. No devices should be life yet at this point and thus
no udev rules changing the scheduler should run yet.
> This way avoids lot of trouble.
Can you spell that out a bit?
> Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> Closes: https://lore.kernel.org/linux-block/mz4t4tlwiqjijw3zvqnjb7ovvvaegkqganegmmlc567tt5xj67@xal5ro544cnc/
Are we using Closes for bug reports now? I haven't really seen that
anywhere.
> out_cleanup_srcu:
> if (set->flags & BLK_MQ_F_BLOCKING)
> cleanup_srcu_struct(set->srcu);
> @@ -5081,7 +5087,18 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
> void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
> {
> mutex_lock(&set->tag_list_lock);
> + /*
> + * Mark us in updating nr_hw_queues for preventing switching
> + * elevator
>
> + *
> + * Elevator switch code can _not_ acquire ->tag_list_lock
Please add a . at the end of a sentences. Also this should probably
be something like "Mark us as in.." but I'll leave more nitpicking
to the native speakers.
> struct request_queue *q = disk->queue;
> + struct blk_mq_tag_set *set = q->tag_set;
>
> /*
> * If the attribute needs to load a module, do it before freezing the
> @@ -732,6 +733,13 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
>
> elv_iosched_load_module(name);
>
> + idx = srcu_read_lock(&set->update_nr_hwq_srcu);
> +
> + if (set->flags & BLK_MQ_F_UPDATE_HW_QUEUES) {
What provides atomicity for field modifications vs reading of set->flags?
i.e. does this need to switch using test/set_bit?
> + struct srcu_struct update_nr_hwq_srcu;
> };
>
> /**
> @@ -681,7 +682,14 @@ enum {
> */
> BLK_MQ_F_NO_SCHED_BY_DEFAULT = 1 << 6,
>
> - BLK_MQ_F_MAX = 1 << 7,
> + /*
> + * True when updating nr_hw_queues is in-progress
> + *
> + * tag_set only flag, not usable for hctx
> + */
> + BLK_MQ_F_UPDATE_HW_QUEUES = 1 << 7,
> +
> + BLK_MQ_F_MAX = 1 << 8,
Also mixing internal state with driver provided flags is always
a bad idea. So this should probably be a new state field in the
tag_set and not reuse flags.
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 05/15] block: simplify elevator reset for updating nr_hw_queues
2025-04-10 13:30 ` [PATCH 05/15] block: simplify elevator reset for " Ming Lei
@ 2025-04-10 14:40 ` Christoph Hellwig
2025-04-10 15:34 ` Christoph Hellwig
1 sibling, 0 replies; 57+ messages in thread
From: Christoph Hellwig @ 2025-04-10 14:40 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
Thomas Hellström, Christoph Hellwig
On Thu, Apr 10, 2025 at 09:30:17PM +0800, Ming Lei wrote:
> In blk_mq_update_nr_hw_queues(), nr_hw_queues may change, so elevator has
> to be reset after nr_hw_queues is changed.
This should be past tense I guess?
> Now elevator switch isn't allowed during blk_mq_update_nr_hw_queues(), so
> we can simply call elevator_change() to reset elevator sched tags after
> nr_hw_queues is updated.
Now that elevator switches aren't allowed during
blk_mq_update_nr_hw_queues(), we can simply call elevator_change() to
reset elevator sched tags after nr_hw_queues is updated.
?
Anyway, glad to see this code go away!
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 05/15] block: simplify elevator reset for updating nr_hw_queues
2025-04-10 13:30 ` [PATCH 05/15] block: simplify elevator reset for " Ming Lei
2025-04-10 14:40 ` Christoph Hellwig
@ 2025-04-10 15:34 ` Christoph Hellwig
2025-04-14 0:58 ` Ming Lei
1 sibling, 1 reply; 57+ messages in thread
From: Christoph Hellwig @ 2025-04-10 15:34 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
Thomas Hellström, Christoph Hellwig
On Thu, Apr 10, 2025 at 09:30:17PM +0800, Ming Lei wrote:
> @@ -5071,9 +4984,15 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
> blk_mq_debugfs_register_hctxs(q);
> }
>
> -switch_back:
> - list_for_each_entry(q, &set->tag_list, tag_set_list)
> - blk_mq_elv_switch_back(&head, q);
> + list_for_each_entry(q, &set->tag_list, tag_set_list) {
> + const char *name = "none";
> +
> + mutex_lock(&q->elevator_lock);
> + if (q->elevator && !blk_queue_dying(q))
> + name = q->elevator->type->elevator_name;
> + __elevator_change(q, name, true);
> + mutex_unlock(&q->elevator_lock);
> + }
Coming back to this after looking through the next patches.
Why do we even need the __elevator_change call here? We've not
actually disabled the elevator, and we prevent other callers
from changing it.
As you pass in the force argument this now always calls
elevator_switch and thus blk_mq_init_sched. But why?
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 09/15] block: unifying elevator change
2025-04-10 13:30 ` [PATCH 09/15] block: " Ming Lei
@ 2025-04-10 18:37 ` Nilay Shroff
2025-04-14 1:22 ` Ming Lei
0 siblings, 1 reply; 57+ messages in thread
From: Nilay Shroff @ 2025-04-10 18:37 UTC (permalink / raw)
To: Ming Lei, Jens Axboe, linux-block
Cc: Shinichiro Kawasaki, Thomas Hellström, Christoph Hellwig
On 4/10/25 7:00 PM, Ming Lei wrote:
> /*
> * Use the default elevator settings. If the chosen elevator initialization
> * fails, fall back to the "none" elevator (no elevator).
> */
> -void elevator_init_mq(struct request_queue *q)
> +void elevator_set_default(struct request_queue *q)
> {
> - struct elevator_type *e;
> - unsigned int memflags;
> + struct elev_change_ctx ctx = { };
> int err;
>
> - WARN_ON_ONCE(blk_queue_registered(q));
> -
> - if (unlikely(q->elevator))
> + if (!queue_is_mq(q))
> return;
>
> - e = elevator_get_default(q);
> - if (!e)
> + ctx.name = use_default_elevator(q) ? "mq-deadline" : "none";
> + if (!q->elevator && !strcmp(ctx.name, "none"))
> return;
> + err = elevator_change(q, &ctx);
> + if (err < 0)
> + pr_warn("\"%s\" set elevator failed %d, "
> + "falling back to \"none\"\n", ctx.name, err);
> +}
>
If we fail to set the evator to default (mq-deadline) while registering queue,
because nr_hw_queue update is simultaneously running then we may end up setting
the queue elevator to none and that's not correct. Isn't it?
> +void elevator_set_none(struct request_queue *q)
> +{
> + struct elev_change_ctx ctx = {
> + .name = "none",
> + .uevent = 1,
> + };
> + int err;
>
> - blk_mq_unfreeze_queue(q, memflags);
> + if (!queue_is_mq(q))
> + return;
>
> - if (err) {
> - pr_warn("\"%s\" elevator initialization failed, "
> - "falling back to \"none\"\n", e->elevator_name);
> - }
> + if (!q->elevator)
> + return;
>
> - elevator_put(e);
> + err = elevator_change(q, &ctx);
> + if (err < 0)
> + pr_warn("%s: set none elevator failed %d\n", __func__, err);
> }
>
Here as well if we fail to disable/exit elevator while deleting disk
because nr_hw_queue update is simultaneously running then we may
leak elevator resource?
> @@ -565,11 +559,7 @@ int __must_check add_disk_fwnode(struct device *parent, struct gendisk *disk,
> if (disk->major == BLOCK_EXT_MAJOR)
> blk_free_ext_minor(disk->first_minor);
> out_exit_elevator:
> - if (disk->queue->elevator) {
> - mutex_lock(&disk->queue->elevator_lock);
> - elevator_exit(disk->queue);
> - mutex_unlock(&disk->queue->elevator_lock);
> - }
> + elevator_set_none(disk->queue);
Same comment as above here as well but this is in add_disk code path.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 12/15] block: move debugfs/sysfs register out of freezing queue
2025-04-10 13:30 ` [PATCH 12/15] block: move debugfs/sysfs register out of freezing queue Ming Lei
@ 2025-04-10 18:57 ` Nilay Shroff
2025-04-14 1:42 ` Ming Lei
0 siblings, 1 reply; 57+ messages in thread
From: Nilay Shroff @ 2025-04-10 18:57 UTC (permalink / raw)
To: Ming Lei, Jens Axboe, linux-block
Cc: Shinichiro Kawasaki, Thomas Hellström, Christoph Hellwig
On 4/10/25 7:00 PM, Ming Lei wrote:
> Move debugfs/sysfs register out of freezing queue in
> __blk_mq_update_nr_hw_queues(), so that the following lockdep dependency
> can be killed:
>
> #2 (&q->q_usage_counter(io)#16){++++}-{0:0}:
> #1 (fs_reclaim){+.+.}-{0:0}:
> #0 (&sb->s_type->i_mutex_key#3){+.+.}-{4:4}: //debugfs
>
> And registering/un-registering debugfs/sysfs does not require queue to be
> frozen.
>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
> block/blk-mq.c | 20 ++++++++++----------
> 1 file changed, 10 insertions(+), 10 deletions(-)
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 7219b01764da..0fb72a698d77 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -4947,15 +4947,15 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
> if (set->nr_maps == 1 && nr_hw_queues == set->nr_hw_queues)
> return;
>
> - memflags = memalloc_noio_save();
> - list_for_each_entry(q, &set->tag_list, tag_set_list)
> - blk_mq_freeze_queue_nomemsave(q);
> -
> list_for_each_entry(q, &set->tag_list, tag_set_list) {
> blk_mq_debugfs_unregister_hctxs(q);
> blk_mq_sysfs_unregister_hctxs(q);
> }
As we removed hctx sysfs protection while un-registering it, this might
cause crash or other side-effect if simultaneously these sysfs attributes
are accessed. The read access of these attributes are still protected
using ->elevator_lock.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 13/15] block: remove several ->elevator_lock
2025-04-10 13:30 ` [PATCH 13/15] block: remove several ->elevator_lock Ming Lei
@ 2025-04-10 19:07 ` Nilay Shroff
2025-04-14 1:46 ` Ming Lei
0 siblings, 1 reply; 57+ messages in thread
From: Nilay Shroff @ 2025-04-10 19:07 UTC (permalink / raw)
To: Ming Lei, Jens Axboe, linux-block
Cc: Shinichiro Kawasaki, Thomas Hellström, Christoph Hellwig
On 4/10/25 7:00 PM, Ming Lei wrote:
> Both blk_mq_map_swqueue() and blk_mq_realloc_hw_ctxs() are only called
> from queue initialization or updating nr_hw_queues code, in which
> elevator switch can't happen any more.
>
> So remove these ->elevator_lock uses.
>
But what if blk_mq_map_swqueue runs in parallel, one context from
blk_mq_init_allocated_queue and another from blk_mq_update_nr_hw_queues?
It seems this is possible due to blk_mq_map_swqueue is invoked right
after queue is added in tag-set from blk_mq_init_allocated_queue.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 15/15] block: move wbt_enable_default() out of queue freezing from scheduler's ->exit()
2025-04-10 13:30 ` [PATCH 15/15] block: move wbt_enable_default() out of queue freezing from scheduler's ->exit() Ming Lei
@ 2025-04-10 19:20 ` Nilay Shroff
2025-04-14 1:55 ` Ming Lei
0 siblings, 1 reply; 57+ messages in thread
From: Nilay Shroff @ 2025-04-10 19:20 UTC (permalink / raw)
To: Ming Lei, Jens Axboe, linux-block
Cc: Shinichiro Kawasaki, Thomas Hellström, Christoph Hellwig
On 4/10/25 7:00 PM, Ming Lei wrote:
> scheduler's ->exit() is called with queue frozen and elevator lock is held, and
> wbt_enable_default() can't be called with queue frozen, otherwise the
> following lockdep warning is triggered:
>
> #6 (&q->rq_qos_mutex){+.+.}-{4:4}:
> #5 (&eq->sysfs_lock){+.+.}-{4:4}:
> #4 (&q->elevator_lock){+.+.}-{4:4}:
> #3 (&q->q_usage_counter(io)#3){++++}-{0:0}:
> #2 (fs_reclaim){+.+.}-{0:0}:
> #1 (&sb->s_type->i_mutex_key#3){+.+.}-{4:4}:
> #0 (&q->debugfs_mutex){+.+.}-{4:4}:
>
> Fix the issue by moving wbt_enable_default() out of bfq's exit(), and
> call it from elevator_change_done().
[...]
> diff --git a/block/elevator.c b/block/elevator.c
> index 1cc640a9db3e..9cf78db4d6a4 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -676,8 +676,13 @@ int elevator_change_done(struct request_queue *q, struct elev_change_ctx *ctx)
> int ret = 0;
>
> if (ctx->old) {
> + bool enable_wbt = test_bit(ELEVATOR_FLAG_ENABLE_WBT_ON_EXIT,
> + &ctx->old->flags);
> +
> elv_unregister_queue(q, ctx->old);
> kobject_put(&ctx->old->kobj);
> + if (enable_wbt)
> + wbt_enable_default(q->disk);
> }
wbt_enable_default is also called from ioc_qos_write and blk_register_queue
with ->elevator_lock protection. So avoiding protection here doesn't seem
correct.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 04/15] block: prevent elevator switch during updating nr_hw_queues
2025-04-10 13:30 ` [PATCH 04/15] block: prevent elevator switch during updating nr_hw_queues Ming Lei
2025-04-10 14:36 ` Christoph Hellwig
@ 2025-04-11 19:13 ` Nilay Shroff
2025-04-14 0:55 ` Ming Lei
1 sibling, 1 reply; 57+ messages in thread
From: Nilay Shroff @ 2025-04-11 19:13 UTC (permalink / raw)
To: Ming Lei, Jens Axboe, linux-block
Cc: Shinichiro Kawasaki, Thomas Hellström, Christoph Hellwig
On 4/10/25 7:00 PM, Ming Lei wrote:
> @@ -5081,7 +5087,18 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
> void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
> {
> mutex_lock(&set->tag_list_lock);
> + /*
> + * Mark us in updating nr_hw_queues for preventing switching
> + * elevator
> + *
> + * Elevator switch code can _not_ acquire ->tag_list_lock
> + */
> + set->flags |= BLK_MQ_F_UPDATE_HW_QUEUES;
> + synchronize_srcu(&set->update_nr_hwq_srcu);
> +
> __blk_mq_update_nr_hw_queues(set, nr_hw_queues);
> +
> + set->flags &= BLK_MQ_F_UPDATE_HW_QUEUES;
I think here we want to clear BLK_MQ_F_UPDATE_HW_QUEUES, so we need to
have set->flags updated as below:
set->flags &= ~BLK_MQ_F_UPDATE_HW_QUEUES;
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 11/15] block: move elv_register[unregister]_queue out of elevator_lock
2025-04-10 13:30 ` [PATCH 11/15] block: move elv_register[unregister]_queue out of elevator_lock Ming Lei
@ 2025-04-11 19:20 ` Nilay Shroff
2025-04-14 1:24 ` Ming Lei
0 siblings, 1 reply; 57+ messages in thread
From: Nilay Shroff @ 2025-04-11 19:20 UTC (permalink / raw)
To: Ming Lei, Jens Axboe, linux-block
Cc: Shinichiro Kawasaki, Thomas Hellström, Christoph Hellwig
On 4/10/25 7:00 PM, Ming Lei wrote:
> +int elevator_change_done(struct request_queue *q, struct elev_change_ctx *ctx)
> +{
> + int ret = 0;
> +
> + if (ctx->old) {
> + elv_unregister_queue(q, ctx->old);
> + kobject_put(&ctx->old->kobj);
> + }
> + if (ctx->new) {
> + ret = elv_register_queue(q, ctx->new, ctx->uevent);
> + if (ret) {
> + unsigned memflags = blk_mq_freeze_queue(q);
> +
> + mutex_lock(&q->elevator_lock);
> + elevator_exit(q);
> + mutex_unlock(&q->elevator_lock);
> + blk_mq_unfreeze_queue(q, memflags);
> + }
> + }
> + return 0;
> +}
> +
We could have sysf elevator attributes simultaneously accessed while this function
adds/removes sysfs elevator attributes without any protection. In fact, the show/store
methods of elevator attributes runs with e->sysfs_lock held. So it seems moving
the above function out of lock protection might cause crash or other side effects?
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 03/15] block: move sched debugfs register into elvevator_register_queue
2025-04-10 14:27 ` Christoph Hellwig
@ 2025-04-14 0:42 ` Ming Lei
0 siblings, 0 replies; 57+ messages in thread
From: Ming Lei @ 2025-04-14 0:42 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
Thomas Hellström
On Thu, Apr 10, 2025 at 04:27:20PM +0200, Christoph Hellwig wrote:
> > lockdep_assert_held(&q->elevator_lock);
> >
> > + if (test_bit(ELEVATOR_FLAG_REGISTERED, &e->flags))
> > + return 0;
> > +
>
> This looks unrelate to the rest of the patch, and I also don't understand
> why it's needed as the callers of elv_register_queue don't change.
The check is actually missed, I will move it into one single patch.
Thanks,
Ming
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 04/15] block: prevent elevator switch during updating nr_hw_queues
2025-04-10 14:36 ` Christoph Hellwig
@ 2025-04-14 0:54 ` Ming Lei
2025-04-14 6:07 ` Christoph Hellwig
0 siblings, 1 reply; 57+ messages in thread
From: Ming Lei @ 2025-04-14 0:54 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
Thomas Hellström
On Thu, Apr 10, 2025 at 04:36:22PM +0200, Christoph Hellwig wrote:
> On Thu, Apr 10, 2025 at 09:30:16PM +0800, Ming Lei wrote:
> > updating nr_hw_queues is usually used for error handling code, when it
>
> Capitalize the first word of each sentence, please.
>
> > doesn't make sense to allow blk-mq elevator switching, since nr_hw_queues
> > may change, and elevator tags depends on nr_hw_queues.
>
> I don't think it's really updated from error handling
NVMe does use it in error handling. I can remove error handling words, but
the trouble doesn't change.
>
> - nbd does it when starting a device
> - nullb can do it through debugfs
> - xen-blkfront does it when resuming from a suspend
> - nvme does it when resetting a controller. While error handling
> can escalate to it¸ it's basically probing and re-probing code
reset is part of error handling.
>
> > Prevent elevator switch during updating nr_hw_queues by setting flag of
> > BLK_MQ_F_UPDATE_HW_QUEUES, and use srcu to fail elevator switch during
> > the period. Here elevator switch code is srcu reader of nr_hw_queues,
> > and blk_mq_update_nr_hw_queues() is the writer.
>
> That being said as we generally are in a setup path I think the general
> idea is fine. No devices should be life yet at this point and thus
> no udev rules changing the scheduler should run yet.
>
> > This way avoids lot of trouble.
>
> Can you spell that out a bit?
Closes: https://lore.kernel.org/linux-block/mz4t4tlwiqjijw3zvqnjb7ovvvaegkqganegmmlc567tt5xj67@xal5ro544cnc/
>
> > Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> > Closes: https://lore.kernel.org/linux-block/mz4t4tlwiqjijw3zvqnjb7ovvvaegkqganegmmlc567tt5xj67@xal5ro544cnc/
>
> Are we using Closes for bug reports now? I haven't really seen that
> anywhere.
The blktests block/039 isn't merged yet, and the patch is posted recently.
kernel panic and kasan is triggered in this test.
>
> > out_cleanup_srcu:
> > if (set->flags & BLK_MQ_F_BLOCKING)
> > cleanup_srcu_struct(set->srcu);
> > @@ -5081,7 +5087,18 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
> > void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
> > {
> > mutex_lock(&set->tag_list_lock);
> > + /*
> > + * Mark us in updating nr_hw_queues for preventing switching
> > + * elevator
> >
> > + *
> > + * Elevator switch code can _not_ acquire ->tag_list_lock
>
> Please add a . at the end of a sentences. Also this should probably
> be something like "Mark us as in.." but I'll leave more nitpicking
> to the native speakers.
OK.
>
> > struct request_queue *q = disk->queue;
> > + struct blk_mq_tag_set *set = q->tag_set;
> >
> > /*
> > * If the attribute needs to load a module, do it before freezing the
> > @@ -732,6 +733,13 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
> >
> > elv_iosched_load_module(name);
> >
> > + idx = srcu_read_lock(&set->update_nr_hwq_srcu);
> > +
> > + if (set->flags & BLK_MQ_F_UPDATE_HW_QUEUES) {
>
> What provides atomicity for field modifications vs reading of set->flags?
> i.e. does this need to switch using test/set_bit?
WRITE is serialized via tag set lock with synchronize_srcu().
READ is covered by srcu read lock.
It is typical RCU usage, one writer vs. multiple writer.
>
> > + struct srcu_struct update_nr_hwq_srcu;
> > };
> >
> > /**
> > @@ -681,7 +682,14 @@ enum {
> > */
> > BLK_MQ_F_NO_SCHED_BY_DEFAULT = 1 << 6,
> >
> > - BLK_MQ_F_MAX = 1 << 7,
> > + /*
> > + * True when updating nr_hw_queues is in-progress
> > + *
> > + * tag_set only flag, not usable for hctx
> > + */
> > + BLK_MQ_F_UPDATE_HW_QUEUES = 1 << 7,
> > +
> > + BLK_MQ_F_MAX = 1 << 8,
>
> Also mixing internal state with driver provided flags is always
> a bad idea. So this should probably be a new state field in the
> tag_set and not reuse flags.
That is fine, but BLK_MQ_F_TAG_QUEUE_SHARED is used in this way too.
thanks,
Ming
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 04/15] block: prevent elevator switch during updating nr_hw_queues
2025-04-11 19:13 ` Nilay Shroff
@ 2025-04-14 0:55 ` Ming Lei
0 siblings, 0 replies; 57+ messages in thread
From: Ming Lei @ 2025-04-14 0:55 UTC (permalink / raw)
To: Nilay Shroff
Cc: Jens Axboe, linux-block, Shinichiro Kawasaki,
Thomas Hellström, Christoph Hellwig
On Sat, Apr 12, 2025 at 12:43:10AM +0530, Nilay Shroff wrote:
>
>
> On 4/10/25 7:00 PM, Ming Lei wrote:
> > @@ -5081,7 +5087,18 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
> > void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
> > {
> > mutex_lock(&set->tag_list_lock);
> > + /*
> > + * Mark us in updating nr_hw_queues for preventing switching
> > + * elevator
> > + *
> > + * Elevator switch code can _not_ acquire ->tag_list_lock
> > + */
> > + set->flags |= BLK_MQ_F_UPDATE_HW_QUEUES;
> > + synchronize_srcu(&set->update_nr_hwq_srcu);
> > +
> > __blk_mq_update_nr_hw_queues(set, nr_hw_queues);
> > +
> > + set->flags &= BLK_MQ_F_UPDATE_HW_QUEUES;
>
> I think here we want to clear BLK_MQ_F_UPDATE_HW_QUEUES, so we need to
> have set->flags updated as below:
>
> set->flags &= ~BLK_MQ_F_UPDATE_HW_QUEUES;
Good catch!
Thanks,
Ming
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 05/15] block: simplify elevator reset for updating nr_hw_queues
2025-04-10 15:34 ` Christoph Hellwig
@ 2025-04-14 0:58 ` Ming Lei
2025-04-14 6:09 ` Christoph Hellwig
0 siblings, 1 reply; 57+ messages in thread
From: Ming Lei @ 2025-04-14 0:58 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
Thomas Hellström
On Thu, Apr 10, 2025 at 05:34:18PM +0200, Christoph Hellwig wrote:
> On Thu, Apr 10, 2025 at 09:30:17PM +0800, Ming Lei wrote:
> > @@ -5071,9 +4984,15 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
> > blk_mq_debugfs_register_hctxs(q);
> > }
> >
> > -switch_back:
> > - list_for_each_entry(q, &set->tag_list, tag_set_list)
> > - blk_mq_elv_switch_back(&head, q);
> > + list_for_each_entry(q, &set->tag_list, tag_set_list) {
> > + const char *name = "none";
> > +
> > + mutex_lock(&q->elevator_lock);
> > + if (q->elevator && !blk_queue_dying(q))
> > + name = q->elevator->type->elevator_name;
> > + __elevator_change(q, name, true);
> > + mutex_unlock(&q->elevator_lock);
> > + }
>
> Coming back to this after looking through the next patches.
>
> Why do we even need the __elevator_change call here? We've not
> actually disabled the elevator, and we prevent other callers
> from changing it.
>
> As you pass in the force argument this now always calls
> elevator_switch and thus blk_mq_init_sched. But why?
sched tags is built over hctx and depends on ->nr_hw_queues,
when nr_hw_queues is changed, sched tags has to be rebuilt.
Thanks,
Ming
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 09/15] block: unifying elevator change
2025-04-10 18:37 ` Nilay Shroff
@ 2025-04-14 1:22 ` Ming Lei
2025-04-15 12:30 ` Nilay Shroff
0 siblings, 1 reply; 57+ messages in thread
From: Ming Lei @ 2025-04-14 1:22 UTC (permalink / raw)
To: Nilay Shroff
Cc: Jens Axboe, linux-block, Shinichiro Kawasaki,
Thomas Hellström, Christoph Hellwig
On Fri, Apr 11, 2025 at 12:07:34AM +0530, Nilay Shroff wrote:
>
>
> On 4/10/25 7:00 PM, Ming Lei wrote:
> > /*
> > * Use the default elevator settings. If the chosen elevator initialization
> > * fails, fall back to the "none" elevator (no elevator).
> > */
> > -void elevator_init_mq(struct request_queue *q)
> > +void elevator_set_default(struct request_queue *q)
> > {
> > - struct elevator_type *e;
> > - unsigned int memflags;
> > + struct elev_change_ctx ctx = { };
> > int err;
> >
> > - WARN_ON_ONCE(blk_queue_registered(q));
> > -
> > - if (unlikely(q->elevator))
> > + if (!queue_is_mq(q))
> > return;
> >
> > - e = elevator_get_default(q);
> > - if (!e)
> > + ctx.name = use_default_elevator(q) ? "mq-deadline" : "none";
> > + if (!q->elevator && !strcmp(ctx.name, "none"))
> > return;
> > + err = elevator_change(q, &ctx);
> > + if (err < 0)
> > + pr_warn("\"%s\" set elevator failed %d, "
> > + "falling back to \"none\"\n", ctx.name, err);
> > +}
> >
> If we fail to set the evator to default (mq-deadline) while registering queue,
> because nr_hw_queue update is simultaneously running then we may end up setting
> the queue elevator to none and that's not correct. Isn't it?
It still works with none.
I think it isn't one big deal. And if it is really one issue in future, we can
set one flag in elevator_set_default(), and let blk_mq_update_nr_hw_queues set
default sched for us.
>
> > +void elevator_set_none(struct request_queue *q)
> > +{
> > + struct elev_change_ctx ctx = {
> > + .name = "none",
> > + .uevent = 1,
> > + };
> > + int err;
> >
> > - blk_mq_unfreeze_queue(q, memflags);
> > + if (!queue_is_mq(q))
> > + return;
> >
> > - if (err) {
> > - pr_warn("\"%s\" elevator initialization failed, "
> > - "falling back to \"none\"\n", e->elevator_name);
> > - }
> > + if (!q->elevator)
> > + return;
> >
> > - elevator_put(e);
> > + err = elevator_change(q, &ctx);
> > + if (err < 0)
> > + pr_warn("%s: set none elevator failed %d\n", __func__, err);
> > }
> >
> Here as well if we fail to disable/exit elevator while deleting disk
> because nr_hw_queue update is simultaneously running then we may
> leak elevator resource?
When blk_mq_update_nr_hw_queues() observes that queue is dying, it
forces to change elevator to none, so there isn't elevator leak issue.
>
> > @@ -565,11 +559,7 @@ int __must_check add_disk_fwnode(struct device *parent, struct gendisk *disk,
> > if (disk->major == BLOCK_EXT_MAJOR)
> > blk_free_ext_minor(disk->first_minor);
> > out_exit_elevator:
> > - if (disk->queue->elevator) {
> > - mutex_lock(&disk->queue->elevator_lock);
> > - elevator_exit(disk->queue);
> > - mutex_unlock(&disk->queue->elevator_lock);
> > - }
> > + elevator_set_none(disk->queue);
> Same comment as above here as well but this is in add_disk code path.
We can avoid it by forcing to change to none in blk_mq_update_nr_hw_queues() for
!blk_queue_registered()
Thanks,
Ming
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 11/15] block: move elv_register[unregister]_queue out of elevator_lock
2025-04-11 19:20 ` Nilay Shroff
@ 2025-04-14 1:24 ` Ming Lei
2025-04-15 9:39 ` Nilay Shroff
0 siblings, 1 reply; 57+ messages in thread
From: Ming Lei @ 2025-04-14 1:24 UTC (permalink / raw)
To: Nilay Shroff
Cc: Jens Axboe, linux-block, Shinichiro Kawasaki,
Thomas Hellström, Christoph Hellwig
On Sat, Apr 12, 2025 at 12:50:10AM +0530, Nilay Shroff wrote:
>
>
> On 4/10/25 7:00 PM, Ming Lei wrote:
> > +int elevator_change_done(struct request_queue *q, struct elev_change_ctx *ctx)
> > +{
> > + int ret = 0;
> > +
> > + if (ctx->old) {
> > + elv_unregister_queue(q, ctx->old);
> > + kobject_put(&ctx->old->kobj);
> > + }
> > + if (ctx->new) {
> > + ret = elv_register_queue(q, ctx->new, ctx->uevent);
> > + if (ret) {
> > + unsigned memflags = blk_mq_freeze_queue(q);
> > +
> > + mutex_lock(&q->elevator_lock);
> > + elevator_exit(q);
> > + mutex_unlock(&q->elevator_lock);
> > + blk_mq_unfreeze_queue(q, memflags);
> > + }
> > + }
> > + return 0;
> > +}
> > +
> We could have sysf elevator attributes simultaneously accessed while this function
> adds/removes sysfs elevator attributes without any protection. In fact, the show/store
> methods of elevator attributes runs with e->sysfs_lock held. So it seems moving
> the above function out of lock protection might cause crash or other side effects?
sysfs/kobject provides such protection, and kobject_del() will drain any
in-flight attribute access.
Thanks,
Ming
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 12/15] block: move debugfs/sysfs register out of freezing queue
2025-04-10 18:57 ` Nilay Shroff
@ 2025-04-14 1:42 ` Ming Lei
2025-04-15 9:37 ` Nilay Shroff
0 siblings, 1 reply; 57+ messages in thread
From: Ming Lei @ 2025-04-14 1:42 UTC (permalink / raw)
To: Nilay Shroff
Cc: Jens Axboe, linux-block, Shinichiro Kawasaki,
Thomas Hellström, Christoph Hellwig
On Fri, Apr 11, 2025 at 12:27:17AM +0530, Nilay Shroff wrote:
>
>
> On 4/10/25 7:00 PM, Ming Lei wrote:
> > Move debugfs/sysfs register out of freezing queue in
> > __blk_mq_update_nr_hw_queues(), so that the following lockdep dependency
> > can be killed:
> >
> > #2 (&q->q_usage_counter(io)#16){++++}-{0:0}:
> > #1 (fs_reclaim){+.+.}-{0:0}:
> > #0 (&sb->s_type->i_mutex_key#3){+.+.}-{4:4}: //debugfs
> >
> > And registering/un-registering debugfs/sysfs does not require queue to be
> > frozen.
> >
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> > block/blk-mq.c | 20 ++++++++++----------
> > 1 file changed, 10 insertions(+), 10 deletions(-)
> >
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 7219b01764da..0fb72a698d77 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -4947,15 +4947,15 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
> > if (set->nr_maps == 1 && nr_hw_queues == set->nr_hw_queues)
> > return;
> >
> > - memflags = memalloc_noio_save();
> > - list_for_each_entry(q, &set->tag_list, tag_set_list)
> > - blk_mq_freeze_queue_nomemsave(q);
> > -
> > list_for_each_entry(q, &set->tag_list, tag_set_list) {
> > blk_mq_debugfs_unregister_hctxs(q);
> > blk_mq_sysfs_unregister_hctxs(q);
> > }
> As we removed hctx sysfs protection while un-registering it, this might
> cause crash or other side-effect if simultaneously these sysfs attributes
> are accessed. The read access of these attributes are still protected
> using ->elevator_lock.
The ->elevator_lock in ->show() is useless except for reading the elevator
internal data(sched tags, requests, ...), even for reading elevator data,
it should have been relying on elevator reference, instead of lock, but
that is another topic & improvement in future.
Also this patch does _not_ change ->elevator_lock for above debugfs/sysfs
unregistering, does it? It is always done without holding ->elevator_lock.
Also ->show() does not require ->q_usage_counter too.
As I mentioned, kobject/sysfs provides protection between ->show()/->store()
and kobject_del(), isn't it the reason why you want to remove ->sys_lock?
https://lore.kernel.org/linux-block/20250226124006.1593985-1-nilay@linux.ibm.com/
Thanks,
Ming
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 13/15] block: remove several ->elevator_lock
2025-04-10 19:07 ` Nilay Shroff
@ 2025-04-14 1:46 ` Ming Lei
0 siblings, 0 replies; 57+ messages in thread
From: Ming Lei @ 2025-04-14 1:46 UTC (permalink / raw)
To: Nilay Shroff
Cc: Jens Axboe, linux-block, Shinichiro Kawasaki,
Thomas Hellström, Christoph Hellwig
On Fri, Apr 11, 2025 at 12:37:49AM +0530, Nilay Shroff wrote:
>
>
> On 4/10/25 7:00 PM, Ming Lei wrote:
> > Both blk_mq_map_swqueue() and blk_mq_realloc_hw_ctxs() are only called
> > from queue initialization or updating nr_hw_queues code, in which
> > elevator switch can't happen any more.
> >
> > So remove these ->elevator_lock uses.
> >
> But what if blk_mq_map_swqueue runs in parallel, one context from
> blk_mq_init_allocated_queue and another from blk_mq_update_nr_hw_queues?
> It seems this is possible due to blk_mq_map_swqueue is invoked right
> after queue is added in tag-set from blk_mq_init_allocated_queue.
Good catch, one simple fix is to swap blk_mq_add_queue_tag_set() with
blk_mq_map_swqueue() in blk_mq_init_allocated_queue() since blk_mq_map_swqueue
doesn't rely on BLK_MQ_F_TAG_QUEUE_SHARED.
Thanks,
Ming
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 15/15] block: move wbt_enable_default() out of queue freezing from scheduler's ->exit()
2025-04-10 19:20 ` Nilay Shroff
@ 2025-04-14 1:55 ` Ming Lei
0 siblings, 0 replies; 57+ messages in thread
From: Ming Lei @ 2025-04-14 1:55 UTC (permalink / raw)
To: Nilay Shroff
Cc: Jens Axboe, linux-block, Shinichiro Kawasaki,
Thomas Hellström, Christoph Hellwig
On Fri, Apr 11, 2025 at 12:50:42AM +0530, Nilay Shroff wrote:
>
>
> On 4/10/25 7:00 PM, Ming Lei wrote:
> > scheduler's ->exit() is called with queue frozen and elevator lock is held, and
> > wbt_enable_default() can't be called with queue frozen, otherwise the
> > following lockdep warning is triggered:
> >
> > #6 (&q->rq_qos_mutex){+.+.}-{4:4}:
> > #5 (&eq->sysfs_lock){+.+.}-{4:4}:
> > #4 (&q->elevator_lock){+.+.}-{4:4}:
> > #3 (&q->q_usage_counter(io)#3){++++}-{0:0}:
> > #2 (fs_reclaim){+.+.}-{0:0}:
> > #1 (&sb->s_type->i_mutex_key#3){+.+.}-{4:4}:
> > #0 (&q->debugfs_mutex){+.+.}-{4:4}:
> >
> > Fix the issue by moving wbt_enable_default() out of bfq's exit(), and
> > call it from elevator_change_done().
>
> [...]
>
> > diff --git a/block/elevator.c b/block/elevator.c
> > index 1cc640a9db3e..9cf78db4d6a4 100644
> > --- a/block/elevator.c
> > +++ b/block/elevator.c
> > @@ -676,8 +676,13 @@ int elevator_change_done(struct request_queue *q, struct elev_change_ctx *ctx)
> > int ret = 0;
> >
> > if (ctx->old) {
> > + bool enable_wbt = test_bit(ELEVATOR_FLAG_ENABLE_WBT_ON_EXIT,
> > + &ctx->old->flags);
> > +
> > elv_unregister_queue(q, ctx->old);
> > kobject_put(&ctx->old->kobj);
> > + if (enable_wbt)
> > + wbt_enable_default(q->disk);
> > }
> wbt_enable_default is also called from ioc_qos_write and blk_register_queue
> with ->elevator_lock protection. So avoiding protection here doesn't seem
> correct.
The only code which needs the protection should be:
```
wbt_enable_default():
...
if (q->elevator && test_bit(ELEVATOR_FLAG_DISABLE_WBT, &q->elevator->flags))
enable = false;
```
The flag can be moved to request queue from elevator queue, will do it in
V2.
Thanks,
Ming
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 04/15] block: prevent elevator switch during updating nr_hw_queues
2025-04-14 0:54 ` Ming Lei
@ 2025-04-14 6:07 ` Christoph Hellwig
2025-04-15 2:03 ` Ming Lei
0 siblings, 1 reply; 57+ messages in thread
From: Christoph Hellwig @ 2025-04-14 6:07 UTC (permalink / raw)
To: Ming Lei
Cc: Christoph Hellwig, Jens Axboe, linux-block, Nilay Shroff,
Shinichiro Kawasaki, Thomas Hellström
On Mon, Apr 14, 2025 at 08:54:36AM +0800, Ming Lei wrote:
> > > elv_iosched_load_module(name);
> > >
> > > + idx = srcu_read_lock(&set->update_nr_hwq_srcu);
> > > +
> > > + if (set->flags & BLK_MQ_F_UPDATE_HW_QUEUES) {
> >
> > What provides atomicity for field modifications vs reading of set->flags?
> > i.e. does this need to switch using test/set_bit?
>
> WRITE is serialized via tag set lock with synchronize_srcu().
>
> READ is covered by srcu read lock.
>
> It is typical RCU usage, one writer vs. multiple writer.
No, (S)RCU does not help you with atomicy of bitfields.
> > Also mixing internal state with driver provided flags is always
> > a bad idea. So this should probably be a new state field in the
> > tag_set and not reuse flags.
>
> That is fine, but BLK_MQ_F_TAG_QUEUE_SHARED is used in this way too.
Yes, that should also move over to the state field. Amd rnbd needs
to be fixed to not set it diretly which is a good example for why
they should not be mixed..
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 05/15] block: simplify elevator reset for updating nr_hw_queues
2025-04-14 0:58 ` Ming Lei
@ 2025-04-14 6:09 ` Christoph Hellwig
2025-04-15 2:05 ` Ming Lei
0 siblings, 1 reply; 57+ messages in thread
From: Christoph Hellwig @ 2025-04-14 6:09 UTC (permalink / raw)
To: Ming Lei
Cc: Christoph Hellwig, Jens Axboe, linux-block, Nilay Shroff,
Shinichiro Kawasaki, Thomas Hellström
On Mon, Apr 14, 2025 at 08:58:28AM +0800, Ming Lei wrote:
> > Coming back to this after looking through the next patches.
> >
> > Why do we even need the __elevator_change call here? We've not
> > actually disabled the elevator, and we prevent other callers
> > from changing it.
> >
> > As you pass in the force argument this now always calls
> > elevator_switch and thus blk_mq_init_sched. But why?
>
> sched tags is built over hctx and depends on ->nr_hw_queues,
> when nr_hw_queues is changed, sched tags has to be rebuilt.
Can you add a comment explaining this?
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 07/15] block: move blk_unregister_queue() & device_del() after freeze wait
2025-04-10 13:30 ` [PATCH 07/15] block: move blk_unregister_queue() & device_del() after freeze wait Ming Lei
@ 2025-04-14 6:19 ` Christoph Hellwig
2025-04-15 2:26 ` Ming Lei
0 siblings, 1 reply; 57+ messages in thread
From: Christoph Hellwig @ 2025-04-14 6:19 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
Thomas Hellström, Christoph Hellwig
On Thu, Apr 10, 2025 at 09:30:19PM +0800, Ming Lei wrote:
> Move blk_unregister_queue() & device_del() after freeze wait, and prepare
> for unifying elevator switch.
>
> This way is just fine, since bdev has been unhashed at the beginning of
> del_gendisk(), both blk_unregister_queue() & device_del() are dealing
> with kobject & debugfs thing only.
While I believe this is the right thing to do, moving the freeze wait
caused issues in the past, so be careful. Take a look at:
commit 4c66a326b5ab784cddd72de07ac5b6210e9e1b06
Author: Christoph Hellwig <hch@lst.de>
Date: Mon Sep 19 16:40:49 2022 +0200
Revert "block: freeze the queue earlier in del_gendisk"
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 08/15] block: add `struct elev_change_ctx` for unifying elevator change
2025-04-10 13:30 ` [PATCH 08/15] block: add `struct elev_change_ctx` for unifying elevator change Ming Lei
@ 2025-04-14 6:21 ` Christoph Hellwig
0 siblings, 0 replies; 57+ messages in thread
From: Christoph Hellwig @ 2025-04-14 6:21 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
Thomas Hellström, Christoph Hellwig
On Thu, Apr 10, 2025 at 09:30:20PM +0800, Ming Lei wrote:
> + struct elev_change_ctx ctx = {
> + .name = "none",
> + .force = 1,
> + .uevent = 1,
> + };
>
> mutex_lock(&q->elevator_lock);
> if (q->elevator && !blk_queue_dying(q))
> + ctx.name = q->elevator->type->elevator_name;
> + __elevator_change(q, &ctx);
> mutex_unlock(&q->elevator_lock);
Can we have this whole code section in a helper in elevator.c and
keep the low-level __elevator_change function private there?
> +/* Holding data for changing elevator */
> +struct elev_change_ctx {
> + const char *name;
> + unsigned int force:1;
> + unsigned int uevent:1;
Using an integer type for flags is odd. This should a boolean, or even
better aflags field with flag names.
And I think with the flag names we can just pass the name and the flags
and do away with the separate structure entirely.
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 10/15] block: pass elevator_queue to elv_register_queue & unregister_queue
2025-04-10 13:30 ` [PATCH 10/15] block: pass elevator_queue to elv_register_queue & unregister_queue Ming Lei
@ 2025-04-14 6:22 ` Christoph Hellwig
2025-04-15 2:31 ` Ming Lei
0 siblings, 1 reply; 57+ messages in thread
From: Christoph Hellwig @ 2025-04-14 6:22 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
Thomas Hellström, Christoph Hellwig
On Thu, Apr 10, 2025 at 09:30:22PM +0800, Ming Lei wrote:
> Pass elevator_queue reference to elv_register_queue() &
> elv_unregister_queue().
>
> No functional change, and prepare for moving the two out of elevator
> lock.
Can you explain a bit more why passing a seemlingly duplicate argument
helps further down the series?
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 04/15] block: prevent elevator switch during updating nr_hw_queues
2025-04-14 6:07 ` Christoph Hellwig
@ 2025-04-15 2:03 ` Ming Lei
0 siblings, 0 replies; 57+ messages in thread
From: Ming Lei @ 2025-04-15 2:03 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
Thomas Hellström
On Mon, Apr 14, 2025 at 08:07:54AM +0200, Christoph Hellwig wrote:
> On Mon, Apr 14, 2025 at 08:54:36AM +0800, Ming Lei wrote:
> > > > elv_iosched_load_module(name);
> > > >
> > > > + idx = srcu_read_lock(&set->update_nr_hwq_srcu);
> > > > +
> > > > + if (set->flags & BLK_MQ_F_UPDATE_HW_QUEUES) {
> > >
> > > What provides atomicity for field modifications vs reading of set->flags?
> > > i.e. does this need to switch using test/set_bit?
> >
> > WRITE is serialized via tag set lock with synchronize_srcu().
> >
> > READ is covered by srcu read lock.
> >
> > It is typical RCU usage, one writer vs. multiple writer.
>
> No, (S)RCU does not help you with atomicy of bitfields.
OK, will change it into one state variable.
Thanks,
Ming
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 05/15] block: simplify elevator reset for updating nr_hw_queues
2025-04-14 6:09 ` Christoph Hellwig
@ 2025-04-15 2:05 ` Ming Lei
0 siblings, 0 replies; 57+ messages in thread
From: Ming Lei @ 2025-04-15 2:05 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
Thomas Hellström
On Mon, Apr 14, 2025 at 08:09:59AM +0200, Christoph Hellwig wrote:
> On Mon, Apr 14, 2025 at 08:58:28AM +0800, Ming Lei wrote:
> > > Coming back to this after looking through the next patches.
> > >
> > > Why do we even need the __elevator_change call here? We've not
> > > actually disabled the elevator, and we prevent other callers
> > > from changing it.
> > >
> > > As you pass in the force argument this now always calls
> > > elevator_switch and thus blk_mq_init_sched. But why?
> >
> > sched tags is built over hctx and depends on ->nr_hw_queues,
> > when nr_hw_queues is changed, sched tags has to be rebuilt.
>
> Can you add a comment explaining this?
Sure.
Thanks,
Ming
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 07/15] block: move blk_unregister_queue() & device_del() after freeze wait
2025-04-14 6:19 ` Christoph Hellwig
@ 2025-04-15 2:26 ` Ming Lei
0 siblings, 0 replies; 57+ messages in thread
From: Ming Lei @ 2025-04-15 2:26 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
Thomas Hellström
On Mon, Apr 14, 2025 at 08:19:10AM +0200, Christoph Hellwig wrote:
> On Thu, Apr 10, 2025 at 09:30:19PM +0800, Ming Lei wrote:
> > Move blk_unregister_queue() & device_del() after freeze wait, and prepare
> > for unifying elevator switch.
> >
> > This way is just fine, since bdev has been unhashed at the beginning of
> > del_gendisk(), both blk_unregister_queue() & device_del() are dealing
> > with kobject & debugfs thing only.
>
> While I believe this is the right thing to do, moving the freeze wait
> caused issues in the past, so be careful. Take a look at:
>
> commit 4c66a326b5ab784cddd72de07ac5b6210e9e1b06
> Author: Christoph Hellwig <hch@lst.de>
> Date: Mon Sep 19 16:40:49 2022 +0200
>
> Revert "block: freeze the queue earlier in del_gendisk"
Yeah, I know this story.
If it is triggered again with this patch, I will help to root cause.
The last thing we still can do is to just moving blk_unregister_queue()
after queue is frozen, or even elevator tear down. It is allowed to delete
children kobjects after their parent is removed. Just the KOBJ_REMOVE
event need to be ordered.
Thanks,
Ming
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 07/15] block: move blk_unregister_queue() & device_del() after freeze wait
@ 2025-04-15 2:27 Ming Lei
0 siblings, 0 replies; 57+ messages in thread
From: Ming Lei @ 2025-04-15 2:27 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
Thomas Hellström
[-- Attachment #1: Type: text/plain, Size: 193 bytes --]
Sender: ming.lei@redhat.com
Subject: Re: [PATCH 07/15] block: move blk_unregister_queue() & device_del() after freeze wait
Message-Id: <Z_3D0PZWolXUdXnK@fedora>
Recipient: yangpanfei@honor.com
[-- Attachment #2: Type: message/rfc822, Size: 10971 bytes --]
From: Ming Lei <ming.lei@redhat.com>
To: Christoph Hellwig <hch@lst.de>
Cc: "Jens Axboe" <axboe@kernel.dk>, linux-block@vger.kernel.org, "Nilay Shroff" <nilay@linux.ibm.com>, "Shinichiro Kawasaki" <shinichiro.kawasaki@wdc.com>, "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
Subject: Re: [PATCH 07/15] block: move blk_unregister_queue() & device_del() after freeze wait
Date: Tue, 15 Apr 2025 10:26:24 +0800
Message-ID: <Z_3D0PZWolXUdXnK@fedora>
On Mon, Apr 14, 2025 at 08:19:10AM +0200, Christoph Hellwig wrote:
> On Thu, Apr 10, 2025 at 09:30:19PM +0800, Ming Lei wrote:
> > Move blk_unregister_queue() & device_del() after freeze wait, and prepare
> > for unifying elevator switch.
> >
> > This way is just fine, since bdev has been unhashed at the beginning of
> > del_gendisk(), both blk_unregister_queue() & device_del() are dealing
> > with kobject & debugfs thing only.
>
> While I believe this is the right thing to do, moving the freeze wait
> caused issues in the past, so be careful. Take a look at:
>
> commit 4c66a326b5ab784cddd72de07ac5b6210e9e1b06
> Author: Christoph Hellwig <hch@lst.de>
> Date: Mon Sep 19 16:40:49 2022 +0200
>
> Revert "block: freeze the queue earlier in del_gendisk"
Yeah, I know this story.
If it is triggered again with this patch, I will help to root cause.
The last thing we still can do is to just moving blk_unregister_queue()
after queue is frozen, or even elevator tear down. It is allowed to delete
children kobjects after their parent is removed. Just the KOBJ_REMOVE
event need to be ordered.
Thanks,
Ming
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 10/15] block: pass elevator_queue to elv_register_queue & unregister_queue
2025-04-14 6:22 ` Christoph Hellwig
@ 2025-04-15 2:31 ` Ming Lei
2025-04-16 4:53 ` Christoph Hellwig
0 siblings, 1 reply; 57+ messages in thread
From: Ming Lei @ 2025-04-15 2:31 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
Thomas Hellström
On Mon, Apr 14, 2025 at 08:22:09AM +0200, Christoph Hellwig wrote:
> On Thu, Apr 10, 2025 at 09:30:22PM +0800, Ming Lei wrote:
> > Pass elevator_queue reference to elv_register_queue() &
> > elv_unregister_queue().
> >
> > No functional change, and prepare for moving the two out of elevator
> > lock.
>
> Can you explain a bit more why passing a seemlingly duplicate argument
> helps further down the series?
Please see the following patch, especially elevator_change_done() in which
the old elevator queue is retrieved from the context structure, then its
unregistering can be moved out of queue freezing & elevator lock.
Same with registering of the new added elevator queue.
I will add more words in commit log for this motivation in next version.
Thanks,
Ming
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 12/15] block: move debugfs/sysfs register out of freezing queue
2025-04-14 1:42 ` Ming Lei
@ 2025-04-15 9:37 ` Nilay Shroff
2025-04-15 10:06 ` Ming Lei
0 siblings, 1 reply; 57+ messages in thread
From: Nilay Shroff @ 2025-04-15 9:37 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, linux-block, Shinichiro Kawasaki,
Thomas Hellström, Christoph Hellwig
On 4/14/25 7:12 AM, Ming Lei wrote:
> On Fri, Apr 11, 2025 at 12:27:17AM +0530, Nilay Shroff wrote:
>>
>>
>> On 4/10/25 7:00 PM, Ming Lei wrote:
>>> Move debugfs/sysfs register out of freezing queue in
>>> __blk_mq_update_nr_hw_queues(), so that the following lockdep dependency
>>> can be killed:
>>>
>>> #2 (&q->q_usage_counter(io)#16){++++}-{0:0}:
>>> #1 (fs_reclaim){+.+.}-{0:0}:
>>> #0 (&sb->s_type->i_mutex_key#3){+.+.}-{4:4}: //debugfs
>>>
>>> And registering/un-registering debugfs/sysfs does not require queue to be
>>> frozen.
>>>
>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>> ---
>>> block/blk-mq.c | 20 ++++++++++----------
>>> 1 file changed, 10 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index 7219b01764da..0fb72a698d77 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -4947,15 +4947,15 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>>> if (set->nr_maps == 1 && nr_hw_queues == set->nr_hw_queues)
>>> return;
>>>
>>> - memflags = memalloc_noio_save();
>>> - list_for_each_entry(q, &set->tag_list, tag_set_list)
>>> - blk_mq_freeze_queue_nomemsave(q);
>>> -
>>> list_for_each_entry(q, &set->tag_list, tag_set_list) {
>>> blk_mq_debugfs_unregister_hctxs(q);
>>> blk_mq_sysfs_unregister_hctxs(q);
>>> }
>> As we removed hctx sysfs protection while un-registering it, this might
>> cause crash or other side-effect if simultaneously these sysfs attributes
>> are accessed. The read access of these attributes are still protected
>> using ->elevator_lock.
>
> The ->elevator_lock in ->show() is useless except for reading the elevator
> internal data(sched tags, requests, ...), even for reading elevator data,
> it should have been relying on elevator reference, instead of lock, but
> that is another topic & improvement in future.
>
> Also this patch does _not_ change ->elevator_lock for above debugfs/sysfs
> unregistering, does it? It is always done without holding ->elevator_lock.
> Also ->show() does not require ->q_usage_counter too.
>
> As I mentioned, kobject/sysfs provides protection between ->show()/->store()
> and kobject_del(), isn't it the reason why you want to remove ->sys_lock?
>
> https://lore.kernel.org/linux-block/20250226124006.1593985-1-nilay@linux.ibm.com/
>
Yes you were correct, that was the reason we wanted to remove ->sysfs_lock.
However for these particular hctx sysfs attributes (nr_tags and nr_reserved_tags)
could be updated simultaneously from another blk-mq sysfs attribute named nr_requests.
Hence IMO, the default protection provided by sysfs/kernfs may not be sufficient and
so we need to protect those attributes using ->elevator_lock.
Consider this case: While blk_mq_update_nr_hw_queues removes hctx attributes,
and simultaneously if nr_requests is also updating num of tags, would that not
cause any side effect? Maybe we also want to protect blk_mq_update_nr_requests
with srcu read lock (set->update_nr_hwq_srcu) so that it couldn't run while
blk_mq_update_nr_hw_queues is in progress?
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 11/15] block: move elv_register[unregister]_queue out of elevator_lock
2025-04-14 1:24 ` Ming Lei
@ 2025-04-15 9:39 ` Nilay Shroff
2025-04-15 10:32 ` Ming Lei
0 siblings, 1 reply; 57+ messages in thread
From: Nilay Shroff @ 2025-04-15 9:39 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, linux-block, Shinichiro Kawasaki,
Thomas Hellström, Christoph Hellwig
On 4/14/25 6:54 AM, Ming Lei wrote:
> On Sat, Apr 12, 2025 at 12:50:10AM +0530, Nilay Shroff wrote:
>>
>>
>> On 4/10/25 7:00 PM, Ming Lei wrote:
>>> +int elevator_change_done(struct request_queue *q, struct elev_change_ctx *ctx)
>>> +{
>>> + int ret = 0;
>>> +
>>> + if (ctx->old) {
>>> + elv_unregister_queue(q, ctx->old);
>>> + kobject_put(&ctx->old->kobj);
>>> + }
>>> + if (ctx->new) {
>>> + ret = elv_register_queue(q, ctx->new, ctx->uevent);
>>> + if (ret) {
>>> + unsigned memflags = blk_mq_freeze_queue(q);
>>> +
>>> + mutex_lock(&q->elevator_lock);
>>> + elevator_exit(q);
>>> + mutex_unlock(&q->elevator_lock);
>>> + blk_mq_unfreeze_queue(q, memflags);
>>> + }
>>> + }
>>> + return 0;
>>> +}
>>> +
>> We could have sysf elevator attributes simultaneously accessed while this function
>> adds/removes sysfs elevator attributes without any protection. In fact, the show/store
>> methods of elevator attributes runs with e->sysfs_lock held. So it seems moving
>> the above function out of lock protection might cause crash or other side effects?
>
> sysfs/kobject provides such protection, and kobject_del() will drain any
> in-flight attribute access.
>
Okay, so in that case do we now really need e->sysfs_lock protection while accessing
elevator attributes?
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 12/15] block: move debugfs/sysfs register out of freezing queue
2025-04-15 9:37 ` Nilay Shroff
@ 2025-04-15 10:06 ` Ming Lei
2025-04-15 11:15 ` Nilay Shroff
0 siblings, 1 reply; 57+ messages in thread
From: Ming Lei @ 2025-04-15 10:06 UTC (permalink / raw)
To: Nilay Shroff
Cc: Jens Axboe, linux-block, Shinichiro Kawasaki,
Thomas Hellström, Christoph Hellwig
On Tue, Apr 15, 2025 at 03:07:18PM +0530, Nilay Shroff wrote:
>
>
> On 4/14/25 7:12 AM, Ming Lei wrote:
> > On Fri, Apr 11, 2025 at 12:27:17AM +0530, Nilay Shroff wrote:
> >>
> >>
> >> On 4/10/25 7:00 PM, Ming Lei wrote:
> >>> Move debugfs/sysfs register out of freezing queue in
> >>> __blk_mq_update_nr_hw_queues(), so that the following lockdep dependency
> >>> can be killed:
> >>>
> >>> #2 (&q->q_usage_counter(io)#16){++++}-{0:0}:
> >>> #1 (fs_reclaim){+.+.}-{0:0}:
> >>> #0 (&sb->s_type->i_mutex_key#3){+.+.}-{4:4}: //debugfs
> >>>
> >>> And registering/un-registering debugfs/sysfs does not require queue to be
> >>> frozen.
> >>>
> >>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> >>> ---
> >>> block/blk-mq.c | 20 ++++++++++----------
> >>> 1 file changed, 10 insertions(+), 10 deletions(-)
> >>>
> >>> diff --git a/block/blk-mq.c b/block/blk-mq.c
> >>> index 7219b01764da..0fb72a698d77 100644
> >>> --- a/block/blk-mq.c
> >>> +++ b/block/blk-mq.c
> >>> @@ -4947,15 +4947,15 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
> >>> if (set->nr_maps == 1 && nr_hw_queues == set->nr_hw_queues)
> >>> return;
> >>>
> >>> - memflags = memalloc_noio_save();
> >>> - list_for_each_entry(q, &set->tag_list, tag_set_list)
> >>> - blk_mq_freeze_queue_nomemsave(q);
> >>> -
> >>> list_for_each_entry(q, &set->tag_list, tag_set_list) {
> >>> blk_mq_debugfs_unregister_hctxs(q);
> >>> blk_mq_sysfs_unregister_hctxs(q);
> >>> }
> >> As we removed hctx sysfs protection while un-registering it, this might
> >> cause crash or other side-effect if simultaneously these sysfs attributes
> >> are accessed. The read access of these attributes are still protected
> >> using ->elevator_lock.
> >
> > The ->elevator_lock in ->show() is useless except for reading the elevator
> > internal data(sched tags, requests, ...), even for reading elevator data,
> > it should have been relying on elevator reference, instead of lock, but
> > that is another topic & improvement in future.
> >
> > Also this patch does _not_ change ->elevator_lock for above debugfs/sysfs
> > unregistering, does it? It is always done without holding ->elevator_lock.
> > Also ->show() does not require ->q_usage_counter too.
> >
> > As I mentioned, kobject/sysfs provides protection between ->show()/->store()
> > and kobject_del(), isn't it the reason why you want to remove ->sys_lock?
> >
> > https://lore.kernel.org/linux-block/20250226124006.1593985-1-nilay@linux.ibm.com/
> >
> Yes you were correct, that was the reason we wanted to remove ->sysfs_lock.
> However for these particular hctx sysfs attributes (nr_tags and nr_reserved_tags)
> could be updated simultaneously from another blk-mq sysfs attribute named nr_requests.
> Hence IMO, the default protection provided by sysfs/kernfs may not be sufficient and
> so we need to protect those attributes using ->elevator_lock.
Yes, what is why this patchset doesn't kill more ->elevator_lock uses, such
as, the uses in blk-mq-debugs, update_nr_requests, but many of them can be
replaced with grabbing elevator reference.
But with/without this patch, the touched register/unregisger code does not
require ->elevator_lock:
blk_mq_debugfs_unregister_hctxs(q);
blk_mq_sysfs_unregister_hctxs(q);
so I don't understand why you argue here about ->elevator_lock use?
>
> Consider this case: While blk_mq_update_nr_hw_queues removes hctx attributes,
> and simultaneously if nr_requests is also updating num of tags, would that not
> cause any side effect?
Why is updating nr_requests related with removing hctx attributes?
Can you explain the side effect in details?
> Maybe we also want to protect blk_mq_update_nr_requests
> with srcu read lock (set->update_nr_hwq_srcu) so that it couldn't run while
> blk_mq_update_nr_hw_queues is in progress?
Yeah, agree, and it can be one new patch for covering race between
blk_mq_update_nr_requests and blk_mq_update_nr_hw_queues, the point is just
that nr_hw_queues is being changed, and not related with removing hctx
attributes, IMO.
Thanks,
Ming
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 11/15] block: move elv_register[unregister]_queue out of elevator_lock
2025-04-15 9:39 ` Nilay Shroff
@ 2025-04-15 10:32 ` Ming Lei
0 siblings, 0 replies; 57+ messages in thread
From: Ming Lei @ 2025-04-15 10:32 UTC (permalink / raw)
To: Nilay Shroff
Cc: Jens Axboe, linux-block, Shinichiro Kawasaki,
Thomas Hellström, Christoph Hellwig
On Tue, Apr 15, 2025 at 03:09:12PM +0530, Nilay Shroff wrote:
>
>
> On 4/14/25 6:54 AM, Ming Lei wrote:
> > On Sat, Apr 12, 2025 at 12:50:10AM +0530, Nilay Shroff wrote:
> >>
> >>
> >> On 4/10/25 7:00 PM, Ming Lei wrote:
> >>> +int elevator_change_done(struct request_queue *q, struct elev_change_ctx *ctx)
> >>> +{
> >>> + int ret = 0;
> >>> +
> >>> + if (ctx->old) {
> >>> + elv_unregister_queue(q, ctx->old);
> >>> + kobject_put(&ctx->old->kobj);
> >>> + }
> >>> + if (ctx->new) {
> >>> + ret = elv_register_queue(q, ctx->new, ctx->uevent);
> >>> + if (ret) {
> >>> + unsigned memflags = blk_mq_freeze_queue(q);
> >>> +
> >>> + mutex_lock(&q->elevator_lock);
> >>> + elevator_exit(q);
> >>> + mutex_unlock(&q->elevator_lock);
> >>> + blk_mq_unfreeze_queue(q, memflags);
> >>> + }
> >>> + }
> >>> + return 0;
> >>> +}
> >>> +
> >> We could have sysf elevator attributes simultaneously accessed while this function
> >> adds/removes sysfs elevator attributes without any protection. In fact, the show/store
> >> methods of elevator attributes runs with e->sysfs_lock held. So it seems moving
> >> the above function out of lock protection might cause crash or other side effects?
> >
> > sysfs/kobject provides such protection, and kobject_del() will drain any
> > in-flight attribute access.
> >
> Okay, so in that case do we now really need e->sysfs_lock protection while accessing
> elevator attributes?
Yeah, I think so, elevator_exit() is always called after elevator kobject
is deleted.
However, this patchset moves elv_unregister_queue() after blk_mq_exit_sched(),
we may need this lock for failing elevator's attribute ->show() & ->store()
by adding one '->exiting' flag to 'struct elevator_queue'.
Thanks,
Ming
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 12/15] block: move debugfs/sysfs register out of freezing queue
2025-04-15 10:06 ` Ming Lei
@ 2025-04-15 11:15 ` Nilay Shroff
2025-04-15 11:54 ` Ming Lei
0 siblings, 1 reply; 57+ messages in thread
From: Nilay Shroff @ 2025-04-15 11:15 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, linux-block, Shinichiro Kawasaki,
Thomas Hellström, Christoph Hellwig
On 4/15/25 3:36 PM, Ming Lei wrote:
> On Tue, Apr 15, 2025 at 03:07:18PM +0530, Nilay Shroff wrote:
>>
>>
>> On 4/14/25 7:12 AM, Ming Lei wrote:
>>> On Fri, Apr 11, 2025 at 12:27:17AM +0530, Nilay Shroff wrote:
>>>>
>>>>
>>>> On 4/10/25 7:00 PM, Ming Lei wrote:
>>>>> Move debugfs/sysfs register out of freezing queue in
>>>>> __blk_mq_update_nr_hw_queues(), so that the following lockdep dependency
>>>>> can be killed:
>>>>>
>>>>> #2 (&q->q_usage_counter(io)#16){++++}-{0:0}:
>>>>> #1 (fs_reclaim){+.+.}-{0:0}:
>>>>> #0 (&sb->s_type->i_mutex_key#3){+.+.}-{4:4}: //debugfs
>>>>>
>>>>> And registering/un-registering debugfs/sysfs does not require queue to be
>>>>> frozen.
>>>>>
>>>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>>>> ---
>>>>> block/blk-mq.c | 20 ++++++++++----------
>>>>> 1 file changed, 10 insertions(+), 10 deletions(-)
>>>>>
>>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>>> index 7219b01764da..0fb72a698d77 100644
>>>>> --- a/block/blk-mq.c
>>>>> +++ b/block/blk-mq.c
>>>>> @@ -4947,15 +4947,15 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>>>>> if (set->nr_maps == 1 && nr_hw_queues == set->nr_hw_queues)
>>>>> return;
>>>>>
>>>>> - memflags = memalloc_noio_save();
>>>>> - list_for_each_entry(q, &set->tag_list, tag_set_list)
>>>>> - blk_mq_freeze_queue_nomemsave(q);
>>>>> -
>>>>> list_for_each_entry(q, &set->tag_list, tag_set_list) {
>>>>> blk_mq_debugfs_unregister_hctxs(q);
>>>>> blk_mq_sysfs_unregister_hctxs(q);
>>>>> }
>>>> As we removed hctx sysfs protection while un-registering it, this might
>>>> cause crash or other side-effect if simultaneously these sysfs attributes
>>>> are accessed. The read access of these attributes are still protected
>>>> using ->elevator_lock.
>>>
>>> The ->elevator_lock in ->show() is useless except for reading the elevator
>>> internal data(sched tags, requests, ...), even for reading elevator data,
>>> it should have been relying on elevator reference, instead of lock, but
>>> that is another topic & improvement in future.
>>>
>>> Also this patch does _not_ change ->elevator_lock for above debugfs/sysfs
>>> unregistering, does it? It is always done without holding ->elevator_lock.
>>> Also ->show() does not require ->q_usage_counter too.
>>>
>>> As I mentioned, kobject/sysfs provides protection between ->show()/->store()
>>> and kobject_del(), isn't it the reason why you want to remove ->sys_lock?
>>>
>>> https://lore.kernel.org/linux-block/20250226124006.1593985-1-nilay@linux.ibm.com/
>>>
>> Yes you were correct, that was the reason we wanted to remove ->sysfs_lock.
>> However for these particular hctx sysfs attributes (nr_tags and nr_reserved_tags)
>> could be updated simultaneously from another blk-mq sysfs attribute named nr_requests.
>> Hence IMO, the default protection provided by sysfs/kernfs may not be sufficient and
>> so we need to protect those attributes using ->elevator_lock.
>
> Yes, what is why this patchset doesn't kill more ->elevator_lock uses, such
> as, the uses in blk-mq-debugs, update_nr_requests, but many of them can be
> replaced with grabbing elevator reference.
>
> But with/without this patch, the touched register/unregisger code does not
> require ->elevator_lock:
>
> blk_mq_debugfs_unregister_hctxs(q);
> blk_mq_sysfs_unregister_hctxs(q);
>
> so I don't understand why you argue here about ->elevator_lock use?
>
I am not arguing using ->elevator_lock wrt removal of hctx sysfs attributes
as you explained that sysfs/kernfs already provides the needed protection.
But please see below my explanation.
>>
>> Consider this case: While blk_mq_update_nr_hw_queues removes hctx attributes,
>> and simultaneously if nr_requests is also updating num of tags, would that not
>> cause any side effect?
>
> Why is updating nr_requests related with removing hctx attributes?
>
> Can you explain the side effect in details?
Thread 1:
writing-to-blk-mq-sysfs-attribute-nr_requests
-> queue_requests_store ==> freezes queue and acquires ->elevator_lock
-> blk_mq_update_nr_requests
-> blk_mq_tag_update_depth
-> blk_mq_alloc_map_and_rqs
-> blk_mq_alloc_rq_map
-> blk_mq_init_tags ==> updates ->nr_tags and ->nr_reserved_tags
Thread2:
blk_mq_update_nr_hw_queues
-> __blk_mq_update_nr_hw_queues
-> blk_mq_realloc_tag_set_tags
-> __blk_mq_alloc_map_and_rqs
-> blk_mq_alloc_map_and_rqs
-> blk_mq_alloc_rq_map
-> blk_mq_init_tags ==> updates ->nr_tags and ->nr_reserved_tags
Thread 3:
reading-hctx-sysfs-attribute-nr_tags
-> blk_mq_hw_sysfs_show ==> acquires ->elevaor_lock
-> blk_mq_hw_sysfs_nr_tags_show ==> access nr_tags
Thread 4:
reading-hctx-sysfs-attribute-nr_reserved_tags
-> blk_mq_hw_sysfs_show ==> acquires ->elevaor_lock
-> blk_mq_hw_sysfs_nr_reserved_tags_show ==> access nr_reserved_tags
As we can see above, ->nr_tags and ->nr_reserved_tags are also exported
to userspace using hctx sysfs attributes (nr_tags and nr_reserved_tags).
So my point was,
#1 For alleviating race between nr_hw_queues and nr_requests update,
we need protection (probably using srcu lock) so that ->nr_tags
and ->nr_reserved_tags are not updated simultaneously.
#2 How could we protect race between thread 3 and thread 2 above or
race between thread 4 and thread 2 above?
>
>> Maybe we also want to protect blk_mq_update_nr_requests
>> with srcu read lock (set->update_nr_hwq_srcu) so that it couldn't run while
>> blk_mq_update_nr_hw_queues is in progress?
>
> Yeah, agree, and it can be one new patch for covering race between
> blk_mq_update_nr_requests and blk_mq_update_nr_hw_queues, the point is just
> that nr_hw_queues is being changed, and not related with removing hctx
> attributes, IMO.
>
Please note that blk_mq_update_nr_requests also updates q->nr_requests,
however looking at all code paths which updates this value is already
protected with ->elevator_lock. So the only thing which worries me
about updates of ->nr_tags and ->nr_reserved tags as shown above.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 12/15] block: move debugfs/sysfs register out of freezing queue
2025-04-15 11:15 ` Nilay Shroff
@ 2025-04-15 11:54 ` Ming Lei
2025-04-15 12:21 ` Nilay Shroff
0 siblings, 1 reply; 57+ messages in thread
From: Ming Lei @ 2025-04-15 11:54 UTC (permalink / raw)
To: Nilay Shroff
Cc: Jens Axboe, linux-block, Shinichiro Kawasaki,
Thomas Hellström, Christoph Hellwig
On Tue, Apr 15, 2025 at 04:45:16PM +0530, Nilay Shroff wrote:
>
>
> On 4/15/25 3:36 PM, Ming Lei wrote:
> > On Tue, Apr 15, 2025 at 03:07:18PM +0530, Nilay Shroff wrote:
> >>
> >>
> >> On 4/14/25 7:12 AM, Ming Lei wrote:
> >>> On Fri, Apr 11, 2025 at 12:27:17AM +0530, Nilay Shroff wrote:
> >>>>
> >>>>
> >>>> On 4/10/25 7:00 PM, Ming Lei wrote:
> >>>>> Move debugfs/sysfs register out of freezing queue in
> >>>>> __blk_mq_update_nr_hw_queues(), so that the following lockdep dependency
> >>>>> can be killed:
> >>>>>
> >>>>> #2 (&q->q_usage_counter(io)#16){++++}-{0:0}:
> >>>>> #1 (fs_reclaim){+.+.}-{0:0}:
> >>>>> #0 (&sb->s_type->i_mutex_key#3){+.+.}-{4:4}: //debugfs
> >>>>>
> >>>>> And registering/un-registering debugfs/sysfs does not require queue to be
> >>>>> frozen.
> >>>>>
> >>>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> >>>>> ---
> >>>>> block/blk-mq.c | 20 ++++++++++----------
> >>>>> 1 file changed, 10 insertions(+), 10 deletions(-)
> >>>>>
> >>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
> >>>>> index 7219b01764da..0fb72a698d77 100644
> >>>>> --- a/block/blk-mq.c
> >>>>> +++ b/block/blk-mq.c
> >>>>> @@ -4947,15 +4947,15 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
> >>>>> if (set->nr_maps == 1 && nr_hw_queues == set->nr_hw_queues)
> >>>>> return;
> >>>>>
> >>>>> - memflags = memalloc_noio_save();
> >>>>> - list_for_each_entry(q, &set->tag_list, tag_set_list)
> >>>>> - blk_mq_freeze_queue_nomemsave(q);
> >>>>> -
> >>>>> list_for_each_entry(q, &set->tag_list, tag_set_list) {
> >>>>> blk_mq_debugfs_unregister_hctxs(q);
> >>>>> blk_mq_sysfs_unregister_hctxs(q);
> >>>>> }
> >>>> As we removed hctx sysfs protection while un-registering it, this might
> >>>> cause crash or other side-effect if simultaneously these sysfs attributes
> >>>> are accessed. The read access of these attributes are still protected
> >>>> using ->elevator_lock.
> >>>
> >>> The ->elevator_lock in ->show() is useless except for reading the elevator
> >>> internal data(sched tags, requests, ...), even for reading elevator data,
> >>> it should have been relying on elevator reference, instead of lock, but
> >>> that is another topic & improvement in future.
> >>>
> >>> Also this patch does _not_ change ->elevator_lock for above debugfs/sysfs
> >>> unregistering, does it? It is always done without holding ->elevator_lock.
> >>> Also ->show() does not require ->q_usage_counter too.
> >>>
> >>> As I mentioned, kobject/sysfs provides protection between ->show()/->store()
> >>> and kobject_del(), isn't it the reason why you want to remove ->sys_lock?
> >>>
> >>> https://lore.kernel.org/linux-block/20250226124006.1593985-1-nilay@linux.ibm.com/
> >>>
> >> Yes you were correct, that was the reason we wanted to remove ->sysfs_lock.
> >> However for these particular hctx sysfs attributes (nr_tags and nr_reserved_tags)
> >> could be updated simultaneously from another blk-mq sysfs attribute named nr_requests.
> >> Hence IMO, the default protection provided by sysfs/kernfs may not be sufficient and
> >> so we need to protect those attributes using ->elevator_lock.
> >
> > Yes, what is why this patchset doesn't kill more ->elevator_lock uses, such
> > as, the uses in blk-mq-debugs, update_nr_requests, but many of them can be
> > replaced with grabbing elevator reference.
> >
> > But with/without this patch, the touched register/unregisger code does not
> > require ->elevator_lock:
> >
> > blk_mq_debugfs_unregister_hctxs(q);
> > blk_mq_sysfs_unregister_hctxs(q);
> >
> > so I don't understand why you argue here about ->elevator_lock use?
> >
> I am not arguing using ->elevator_lock wrt removal of hctx sysfs attributes
> as you explained that sysfs/kernfs already provides the needed protection.
> But please see below my explanation.
>
> >>
> >> Consider this case: While blk_mq_update_nr_hw_queues removes hctx attributes,
> >> and simultaneously if nr_requests is also updating num of tags, would that not
> >> cause any side effect?
> >
> > Why is updating nr_requests related with removing hctx attributes?
> >
> > Can you explain the side effect in details?
> Thread 1:
> writing-to-blk-mq-sysfs-attribute-nr_requests
> -> queue_requests_store ==> freezes queue and acquires ->elevator_lock
> -> blk_mq_update_nr_requests
> -> blk_mq_tag_update_depth
> -> blk_mq_alloc_map_and_rqs
> -> blk_mq_alloc_rq_map
> -> blk_mq_init_tags ==> updates ->nr_tags and ->nr_reserved_tags
>
> Thread2:
> blk_mq_update_nr_hw_queues
> -> __blk_mq_update_nr_hw_queues
> -> blk_mq_realloc_tag_set_tags
> -> __blk_mq_alloc_map_and_rqs
> -> blk_mq_alloc_map_and_rqs
> -> blk_mq_alloc_rq_map
> -> blk_mq_init_tags ==> updates ->nr_tags and ->nr_reserved_tags
>
> Thread 3:
> reading-hctx-sysfs-attribute-nr_tags
> -> blk_mq_hw_sysfs_show ==> acquires ->elevaor_lock
> -> blk_mq_hw_sysfs_nr_tags_show ==> access nr_tags
>
> Thread 4:
> reading-hctx-sysfs-attribute-nr_reserved_tags
> -> blk_mq_hw_sysfs_show ==> acquires ->elevaor_lock
> -> blk_mq_hw_sysfs_nr_reserved_tags_show ==> access nr_reserved_tags
`hctx->tags` is guaranteed to be live if above ->show() method, and the
elevator lock is actually not needed, which isn't supposed to protect
hctx->tags too.
>
> As we can see above, ->nr_tags and ->nr_reserved_tags are also exported
> to userspace using hctx sysfs attributes (nr_tags and nr_reserved_tags).
>
> So my point was,
> #1 For alleviating race between nr_hw_queues and nr_requests update,
> we need protection (probably using srcu lock) so that ->nr_tags
> and ->nr_reserved_tags are not updated simultaneously.
>
> #2 How could we protect race between thread 3 and thread 2 above or
> race between thread 4 and thread 2 above?
blk_mq_update_nr_hw_queues() calls blk_mq_sysfs_unregister_hctxs() first,
then user can not see the above attributes before calling blk_mq_sysfs_register_hctxs().
So there isn't the race.
>
> >
> >> Maybe we also want to protect blk_mq_update_nr_requests
> >> with srcu read lock (set->update_nr_hwq_srcu) so that it couldn't run while
> >> blk_mq_update_nr_hw_queues is in progress?
> >
> > Yeah, agree, and it can be one new patch for covering race between
> > blk_mq_update_nr_requests and blk_mq_update_nr_hw_queues, the point is just
> > that nr_hw_queues is being changed, and not related with removing hctx
> > attributes, IMO.
> >
> Please note that blk_mq_update_nr_requests also updates q->nr_requests,
blk_mq_update_nr_requests() uses nr_hw_queues, so there is race between
blk_mq_update_nr_requests() and blk_mq_update_nr_hw_queues().
> however looking at all code paths which updates this value is already
> protected with ->elevator_lock. So the only thing which worries me
> about updates of ->nr_tags and ->nr_reserved tags as shown above.
As I mentioned, there isn't such race.
Thanks,
Ming
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 12/15] block: move debugfs/sysfs register out of freezing queue
2025-04-15 11:54 ` Ming Lei
@ 2025-04-15 12:21 ` Nilay Shroff
2025-04-15 12:41 ` Ming Lei
0 siblings, 1 reply; 57+ messages in thread
From: Nilay Shroff @ 2025-04-15 12:21 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, linux-block, Shinichiro Kawasaki,
Thomas Hellström, Christoph Hellwig
On 4/15/25 5:24 PM, Ming Lei wrote:
>>>
>>> Why is updating nr_requests related with removing hctx attributes?
>>>
>>> Can you explain the side effect in details?
>> Thread 1:
>> writing-to-blk-mq-sysfs-attribute-nr_requests
>> -> queue_requests_store ==> freezes queue and acquires ->elevator_lock
>> -> blk_mq_update_nr_requests
>> -> blk_mq_tag_update_depth
>> -> blk_mq_alloc_map_and_rqs
>> -> blk_mq_alloc_rq_map
>> -> blk_mq_init_tags ==> updates ->nr_tags and ->nr_reserved_tags
>>
>> Thread2:
>> blk_mq_update_nr_hw_queues
>> -> __blk_mq_update_nr_hw_queues
>> -> blk_mq_realloc_tag_set_tags
>> -> __blk_mq_alloc_map_and_rqs
>> -> blk_mq_alloc_map_and_rqs
>> -> blk_mq_alloc_rq_map
>> -> blk_mq_init_tags ==> updates ->nr_tags and ->nr_reserved_tags
>>
>> Thread 3:
>> reading-hctx-sysfs-attribute-nr_tags
>> -> blk_mq_hw_sysfs_show ==> acquires ->elevaor_lock
>> -> blk_mq_hw_sysfs_nr_tags_show ==> access nr_tags
>>
>> Thread 4:
>> reading-hctx-sysfs-attribute-nr_reserved_tags
>> -> blk_mq_hw_sysfs_show ==> acquires ->elevaor_lock
>> -> blk_mq_hw_sysfs_nr_reserved_tags_show ==> access nr_reserved_tags
>
> `hctx->tags` is guaranteed to be live if above ->show() method, and the
> elevator lock is actually not needed, which isn't supposed to protect
> hctx->tags too.
>
I think, the ->elavtor_lock would still be needed for protecting updates to
hctx->tags from thread # 1 above and simultaneously reading the hctx->tags from
thread #3 and #4 above.
>>
>> As we can see above, ->nr_tags and ->nr_reserved_tags are also exported
>> to userspace using hctx sysfs attributes (nr_tags and nr_reserved_tags).
>>
>> So my point was,
>> #1 For alleviating race between nr_hw_queues and nr_requests update,
>> we need protection (probably using srcu lock) so that ->nr_tags
>> and ->nr_reserved_tags are not updated simultaneously.
>>
>> #2 How could we protect race between thread 3 and thread 2 above or
>> race between thread 4 and thread 2 above?
>
> blk_mq_update_nr_hw_queues() calls blk_mq_sysfs_unregister_hctxs() first,
> then user can not see the above attributes before calling blk_mq_sysfs_register_hctxs().
>
> So there isn't the race.
>
>>
>>>
>>>> Maybe we also want to protect blk_mq_update_nr_requests
>>>> with srcu read lock (set->update_nr_hwq_srcu) so that it couldn't run while
>>>> blk_mq_update_nr_hw_queues is in progress?
>>>
>>> Yeah, agree, and it can be one new patch for covering race between
>>> blk_mq_update_nr_requests and blk_mq_update_nr_hw_queues, the point is just
>>> that nr_hw_queues is being changed, and not related with removing hctx
>>> attributes, IMO.
>>>
>> Please note that blk_mq_update_nr_requests also updates q->nr_requests,
>
> blk_mq_update_nr_requests() uses nr_hw_queues, so there is race between
> blk_mq_update_nr_requests() and blk_mq_update_nr_hw_queues().
>
Okay so I believe, this could be protected using srcu lock.
>> however looking at all code paths which updates this value is already
>> protected with ->elevator_lock. So the only thing which worries me
>> about updates of ->nr_tags and ->nr_reserved tags as shown above.
>
> As I mentioned, there isn't such race.
Yes agreed, there's no race between thread 3 and thread 2 or
thread 4 and thread 2.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 09/15] block: unifying elevator change
2025-04-14 1:22 ` Ming Lei
@ 2025-04-15 12:30 ` Nilay Shroff
2025-04-16 1:49 ` Ming Lei
0 siblings, 1 reply; 57+ messages in thread
From: Nilay Shroff @ 2025-04-15 12:30 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, linux-block, Shinichiro Kawasaki,
Thomas Hellström, Christoph Hellwig
On 4/14/25 6:52 AM, Ming Lei wrote:
> On Fri, Apr 11, 2025 at 12:07:34AM +0530, Nilay Shroff wrote:
>>
>>
>> On 4/10/25 7:00 PM, Ming Lei wrote:
>>> /*
>>> * Use the default elevator settings. If the chosen elevator initialization
>>> * fails, fall back to the "none" elevator (no elevator).
>>> */
>>> -void elevator_init_mq(struct request_queue *q)
>>> +void elevator_set_default(struct request_queue *q)
>>> {
>>> - struct elevator_type *e;
>>> - unsigned int memflags;
>>> + struct elev_change_ctx ctx = { };
>>> int err;
>>>
>>> - WARN_ON_ONCE(blk_queue_registered(q));
>>> -
>>> - if (unlikely(q->elevator))
>>> + if (!queue_is_mq(q))
>>> return;
>>>
>>> - e = elevator_get_default(q);
>>> - if (!e)
>>> + ctx.name = use_default_elevator(q) ? "mq-deadline" : "none";
>>> + if (!q->elevator && !strcmp(ctx.name, "none"))
>>> return;
>>> + err = elevator_change(q, &ctx);
>>> + if (err < 0)
>>> + pr_warn("\"%s\" set elevator failed %d, "
>>> + "falling back to \"none\"\n", ctx.name, err);
>>> +}
>>>
>> If we fail to set the evator to default (mq-deadline) while registering queue,
>> because nr_hw_queue update is simultaneously running then we may end up setting
>> the queue elevator to none and that's not correct. Isn't it?
>
> It still works with none.
>
> I think it isn't one big deal. And if it is really one issue in future, we can
> set one flag in elevator_set_default(), and let blk_mq_update_nr_hw_queues set
> default sched for us.
>
>>
>>> +void elevator_set_none(struct request_queue *q)
>>> +{
>>> + struct elev_change_ctx ctx = {
>>> + .name = "none",
>>> + .uevent = 1,
>>> + };
>>> + int err;
>>>
>>> - blk_mq_unfreeze_queue(q, memflags);
>>> + if (!queue_is_mq(q))
>>> + return;
>>>
>>> - if (err) {
>>> - pr_warn("\"%s\" elevator initialization failed, "
>>> - "falling back to \"none\"\n", e->elevator_name);
>>> - }
>>> + if (!q->elevator)
>>> + return;
>>>
>>> - elevator_put(e);
>>> + err = elevator_change(q, &ctx);
>>> + if (err < 0)
>>> + pr_warn("%s: set none elevator failed %d\n", __func__, err);
>>> }
>>>
>> Here as well if we fail to disable/exit elevator while deleting disk
>> because nr_hw_queue update is simultaneously running then we may
>> leak elevator resource?
>
> When blk_mq_update_nr_hw_queues() observes that queue is dying, it
> forces to change elevator to none, so there isn't elevator leak issue.
>
Yes if we get into blk_mq_update_nr_hw_queues after dying flag is set.
But what if blk_mq_update_nr_hw_queues doesn't see dying flag and starts
running __elevator_change. However later we set dying flag from del_gendisk
and starts running elevator_set_none simultaneously on another cpu?
In this case elevator_set_none would fail to set the elevator to "none" as
blk_mq_update_nr_hw_queues is running on another cpu. Isn't it?
>>
>>> @@ -565,11 +559,7 @@ int __must_check add_disk_fwnode(struct device *parent, struct gendisk *disk,
>>> if (disk->major == BLOCK_EXT_MAJOR)
>>> blk_free_ext_minor(disk->first_minor);
>>> out_exit_elevator:
>>> - if (disk->queue->elevator) {
>>> - mutex_lock(&disk->queue->elevator_lock);
>>> - elevator_exit(disk->queue);
>>> - mutex_unlock(&disk->queue->elevator_lock);
>>> - }
>>> + elevator_set_none(disk->queue);
>> Same comment as above here as well but this is in add_disk code path.
>
> We can avoid it by forcing to change to none in blk_mq_update_nr_hw_queues() for
> !blk_queue_registered()
>
Here as well there's a thin race window possible assuming add_disk fails
after we registered queue. Assuming nr_hw_queue update starts running
and it sees queue is registered however on another cpu add_disk fails
just after registering queue. So in this case still it might be possible
that elevator_set_none might fail to set elevator to "none" just because
nr_hw_queue update is running on another cpu. What do you think?
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 12/15] block: move debugfs/sysfs register out of freezing queue
2025-04-15 12:21 ` Nilay Shroff
@ 2025-04-15 12:41 ` Ming Lei
0 siblings, 0 replies; 57+ messages in thread
From: Ming Lei @ 2025-04-15 12:41 UTC (permalink / raw)
To: Nilay Shroff
Cc: Jens Axboe, linux-block, Shinichiro Kawasaki,
Thomas Hellström, Christoph Hellwig
On Tue, Apr 15, 2025 at 05:51:07PM +0530, Nilay Shroff wrote:
>
>
> On 4/15/25 5:24 PM, Ming Lei wrote:
> >>>
> >>> Why is updating nr_requests related with removing hctx attributes?
> >>>
> >>> Can you explain the side effect in details?
> >> Thread 1:
> >> writing-to-blk-mq-sysfs-attribute-nr_requests
> >> -> queue_requests_store ==> freezes queue and acquires ->elevator_lock
> >> -> blk_mq_update_nr_requests
> >> -> blk_mq_tag_update_depth
> >> -> blk_mq_alloc_map_and_rqs
> >> -> blk_mq_alloc_rq_map
> >> -> blk_mq_init_tags ==> updates ->nr_tags and ->nr_reserved_tags
> >>
> >> Thread2:
> >> blk_mq_update_nr_hw_queues
> >> -> __blk_mq_update_nr_hw_queues
> >> -> blk_mq_realloc_tag_set_tags
> >> -> __blk_mq_alloc_map_and_rqs
> >> -> blk_mq_alloc_map_and_rqs
> >> -> blk_mq_alloc_rq_map
> >> -> blk_mq_init_tags ==> updates ->nr_tags and ->nr_reserved_tags
> >>
> >> Thread 3:
> >> reading-hctx-sysfs-attribute-nr_tags
> >> -> blk_mq_hw_sysfs_show ==> acquires ->elevaor_lock
> >> -> blk_mq_hw_sysfs_nr_tags_show ==> access nr_tags
> >>
> >> Thread 4:
> >> reading-hctx-sysfs-attribute-nr_reserved_tags
> >> -> blk_mq_hw_sysfs_show ==> acquires ->elevaor_lock
> >> -> blk_mq_hw_sysfs_nr_reserved_tags_show ==> access nr_reserved_tags
> >
> > `hctx->tags` is guaranteed to be live if above ->show() method, and the
> > elevator lock is actually not needed, which isn't supposed to protect
> > hctx->tags too.
> >
> I think, the ->elavtor_lock would still be needed for protecting updates to
> hctx->tags from thread # 1 above and simultaneously reading the hctx->tags from
> thread #3 and #4 above.
update_nr_requests() can just reduce ->tags.sbitmap size, please see
blk_mq_tag_update_depth(), even it won't change hctx->tags->nr_tags, so there
isn't race.
More importantly ->elevator_lock shouldn't cover ->tags which isn't related
with scheduler, one exception is blk-mq debugfs, in which request may be
allocated from ->sched_tags, even ->tags is visiting.
Thanks,
Ming
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 09/15] block: unifying elevator change
2025-04-15 12:30 ` Nilay Shroff
@ 2025-04-16 1:49 ` Ming Lei
0 siblings, 0 replies; 57+ messages in thread
From: Ming Lei @ 2025-04-16 1:49 UTC (permalink / raw)
To: Nilay Shroff
Cc: Jens Axboe, linux-block, Shinichiro Kawasaki,
Thomas Hellström, Christoph Hellwig
On Tue, Apr 15, 2025 at 06:00:47PM +0530, Nilay Shroff wrote:
>
>
> On 4/14/25 6:52 AM, Ming Lei wrote:
> > On Fri, Apr 11, 2025 at 12:07:34AM +0530, Nilay Shroff wrote:
> >>
> >>
> >> On 4/10/25 7:00 PM, Ming Lei wrote:
> >>> /*
> >>> * Use the default elevator settings. If the chosen elevator initialization
> >>> * fails, fall back to the "none" elevator (no elevator).
> >>> */
> >>> -void elevator_init_mq(struct request_queue *q)
> >>> +void elevator_set_default(struct request_queue *q)
> >>> {
> >>> - struct elevator_type *e;
> >>> - unsigned int memflags;
> >>> + struct elev_change_ctx ctx = { };
> >>> int err;
> >>>
> >>> - WARN_ON_ONCE(blk_queue_registered(q));
> >>> -
> >>> - if (unlikely(q->elevator))
> >>> + if (!queue_is_mq(q))
> >>> return;
> >>>
> >>> - e = elevator_get_default(q);
> >>> - if (!e)
> >>> + ctx.name = use_default_elevator(q) ? "mq-deadline" : "none";
> >>> + if (!q->elevator && !strcmp(ctx.name, "none"))
> >>> return;
> >>> + err = elevator_change(q, &ctx);
> >>> + if (err < 0)
> >>> + pr_warn("\"%s\" set elevator failed %d, "
> >>> + "falling back to \"none\"\n", ctx.name, err);
> >>> +}
> >>>
> >> If we fail to set the evator to default (mq-deadline) while registering queue,
> >> because nr_hw_queue update is simultaneously running then we may end up setting
> >> the queue elevator to none and that's not correct. Isn't it?
> >
> > It still works with none.
> >
> > I think it isn't one big deal. And if it is really one issue in future, we can
> > set one flag in elevator_set_default(), and let blk_mq_update_nr_hw_queues set
> > default sched for us.
> >
> >>
> >>> +void elevator_set_none(struct request_queue *q)
> >>> +{
> >>> + struct elev_change_ctx ctx = {
> >>> + .name = "none",
> >>> + .uevent = 1,
> >>> + };
> >>> + int err;
> >>>
> >>> - blk_mq_unfreeze_queue(q, memflags);
> >>> + if (!queue_is_mq(q))
> >>> + return;
> >>>
> >>> - if (err) {
> >>> - pr_warn("\"%s\" elevator initialization failed, "
> >>> - "falling back to \"none\"\n", e->elevator_name);
> >>> - }
> >>> + if (!q->elevator)
> >>> + return;
> >>>
> >>> - elevator_put(e);
> >>> + err = elevator_change(q, &ctx);
> >>> + if (err < 0)
> >>> + pr_warn("%s: set none elevator failed %d\n", __func__, err);
> >>> }
> >>>
> >> Here as well if we fail to disable/exit elevator while deleting disk
> >> because nr_hw_queue update is simultaneously running then we may
> >> leak elevator resource?
> >
> > When blk_mq_update_nr_hw_queues() observes that queue is dying, it
> > forces to change elevator to none, so there isn't elevator leak issue.
> >
> Yes if we get into blk_mq_update_nr_hw_queues after dying flag is set.
> But what if blk_mq_update_nr_hw_queues doesn't see dying flag and starts
> running __elevator_change. However later we set dying flag from del_gendisk
> and starts running elevator_set_none simultaneously on another cpu?
> In this case elevator_set_none would fail to set the elevator to "none" as
> blk_mq_update_nr_hw_queues is running on another cpu. Isn't it?
>
> >>
> >>> @@ -565,11 +559,7 @@ int __must_check add_disk_fwnode(struct device *parent, struct gendisk *disk,
> >>> if (disk->major == BLOCK_EXT_MAJOR)
> >>> blk_free_ext_minor(disk->first_minor);
> >>> out_exit_elevator:
> >>> - if (disk->queue->elevator) {
> >>> - mutex_lock(&disk->queue->elevator_lock);
> >>> - elevator_exit(disk->queue);
> >>> - mutex_unlock(&disk->queue->elevator_lock);
> >>> - }
> >>> + elevator_set_none(disk->queue);
> >> Same comment as above here as well but this is in add_disk code path.
> >
> > We can avoid it by forcing to change to none in blk_mq_update_nr_hw_queues() for
> > !blk_queue_registered()
> >
> Here as well there's a thin race window possible assuming add_disk fails
> after we registered queue. Assuming nr_hw_queue update starts running
> and it sees queue is registered however on another cpu add_disk fails
> just after registering queue. So in this case still it might be possible
> that elevator_set_none might fail to set elevator to "none" just because
> nr_hw_queue update is running on another cpu. What do you think?
Yeah.
It isn't hard to solve, but just don't want to make the whole
implementation too complicated.
Another way is to prevent add_disk & del_disk from happening during
updating nr_hw_queues, and this way is reasonable too because both blk_mq
debugfs & sysfs registering depends on nr_hw_queues.
Meantime we can retry add_disk/del_disk until updating_nr_hw_queues are
finished, and one waitqueue can be added, so the wait can be:
add_disk():
while (true) {
srcu_read_lock()
if (set->is_updating_nr_hw_queus) {
srcu_read_unlock();
goto wait;
}
__add_disk();
srcu_read_unlock()
break;
wait:
wait_event(set->wq, !set->is_updating_nr_hw_queus);
}
Thanks,
Ming
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [PATCH 10/15] block: pass elevator_queue to elv_register_queue & unregister_queue
2025-04-15 2:31 ` Ming Lei
@ 2025-04-16 4:53 ` Christoph Hellwig
0 siblings, 0 replies; 57+ messages in thread
From: Christoph Hellwig @ 2025-04-16 4:53 UTC (permalink / raw)
To: Ming Lei
Cc: Christoph Hellwig, Jens Axboe, linux-block, Nilay Shroff,
Shinichiro Kawasaki, Thomas Hellström
On Tue, Apr 15, 2025 at 10:31:14AM +0800, Ming Lei wrote:
> Please see the following patch, especially elevator_change_done() in which
> the old elevator queue is retrieved from the context structure, then its
> unregistering can be moved out of queue freezing & elevator lock.
>
> Same with registering of the new added elevator queue.
>
> I will add more words in commit log for this motivation in next version.
Thanks. When bisecting or using git-blame finding a following patch
for reference is really hard, so please make sure commit logs are useful
when standing alone.
^ permalink raw reply [flat|nested] 57+ messages in thread
end of thread, other threads:[~2025-04-18 8:44 UTC | newest]
Thread overview: 57+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-10 13:30 [PATCH 00/15] block: unify elevator changing and fix lockdep warning Ming Lei
2025-04-10 13:30 ` [PATCH 01/15] block: don't call freeze queue in elevator_switch() and elevator_disable() Ming Lei
2025-04-10 13:30 ` [PATCH 02/15] block: add two helpers for registering/un-registering sched debugfs Ming Lei
2025-04-10 14:25 ` Christoph Hellwig
2025-04-10 13:30 ` [PATCH 03/15] block: move sched debugfs register into elvevator_register_queue Ming Lei
2025-04-10 14:27 ` Christoph Hellwig
2025-04-14 0:42 ` Ming Lei
2025-04-10 13:30 ` [PATCH 04/15] block: prevent elevator switch during updating nr_hw_queues Ming Lei
2025-04-10 14:36 ` Christoph Hellwig
2025-04-14 0:54 ` Ming Lei
2025-04-14 6:07 ` Christoph Hellwig
2025-04-15 2:03 ` Ming Lei
2025-04-11 19:13 ` Nilay Shroff
2025-04-14 0:55 ` Ming Lei
2025-04-10 13:30 ` [PATCH 05/15] block: simplify elevator reset for " Ming Lei
2025-04-10 14:40 ` Christoph Hellwig
2025-04-10 15:34 ` Christoph Hellwig
2025-04-14 0:58 ` Ming Lei
2025-04-14 6:09 ` Christoph Hellwig
2025-04-15 2:05 ` Ming Lei
2025-04-10 13:30 ` [PATCH 06/15] block: add helper of elevator_change() Ming Lei
2025-04-10 13:30 ` [PATCH 07/15] block: move blk_unregister_queue() & device_del() after freeze wait Ming Lei
2025-04-14 6:19 ` Christoph Hellwig
2025-04-15 2:26 ` Ming Lei
2025-04-10 13:30 ` [PATCH 08/15] block: add `struct elev_change_ctx` for unifying elevator change Ming Lei
2025-04-14 6:21 ` Christoph Hellwig
2025-04-10 13:30 ` [PATCH 09/15] block: " Ming Lei
2025-04-10 18:37 ` Nilay Shroff
2025-04-14 1:22 ` Ming Lei
2025-04-15 12:30 ` Nilay Shroff
2025-04-16 1:49 ` Ming Lei
2025-04-10 13:30 ` [PATCH 10/15] block: pass elevator_queue to elv_register_queue & unregister_queue Ming Lei
2025-04-14 6:22 ` Christoph Hellwig
2025-04-15 2:31 ` Ming Lei
2025-04-16 4:53 ` Christoph Hellwig
2025-04-10 13:30 ` [PATCH 11/15] block: move elv_register[unregister]_queue out of elevator_lock Ming Lei
2025-04-11 19:20 ` Nilay Shroff
2025-04-14 1:24 ` Ming Lei
2025-04-15 9:39 ` Nilay Shroff
2025-04-15 10:32 ` Ming Lei
2025-04-10 13:30 ` [PATCH 12/15] block: move debugfs/sysfs register out of freezing queue Ming Lei
2025-04-10 18:57 ` Nilay Shroff
2025-04-14 1:42 ` Ming Lei
2025-04-15 9:37 ` Nilay Shroff
2025-04-15 10:06 ` Ming Lei
2025-04-15 11:15 ` Nilay Shroff
2025-04-15 11:54 ` Ming Lei
2025-04-15 12:21 ` Nilay Shroff
2025-04-15 12:41 ` Ming Lei
2025-04-10 13:30 ` [PATCH 13/15] block: remove several ->elevator_lock Ming Lei
2025-04-10 19:07 ` Nilay Shroff
2025-04-14 1:46 ` Ming Lei
2025-04-10 13:30 ` [PATCH 14/15] block: move hctx cpuhp add/del out of queue freezing Ming Lei
2025-04-10 13:30 ` [PATCH 15/15] block: move wbt_enable_default() out of queue freezing from scheduler's ->exit() Ming Lei
2025-04-10 19:20 ` Nilay Shroff
2025-04-14 1:55 ` Ming Lei
-- strict thread matches above, loose matches on Subject: below --
2025-04-15 2:27 [PATCH 07/15] block: move blk_unregister_queue() & device_del() after freeze wait Ming Lei
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).