* [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning
@ 2025-04-24 15:21 Ming Lei
2025-04-24 15:21 ` [PATCH V3 01/20] block: move blk_mq_add_queue_tag_set() after blk_mq_map_swqueue() Ming Lei
` (20 more replies)
0 siblings, 21 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-24 15:21 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig, Ming Lei
Hello Jens,
This patchset cleans up elevator change code, and unifying it via single
helper, meantime moves kobject_add/del & debugfs register/unregister out of
queue freezing & elevator_lock. This way fixes many lockdep warnings
reported recently, especially since fs_reclaim is connected with freeze lock
manually by commit ffa1e7ada456 ("block: Make request_queue lockdep splats
show up earlier").
Thanks,
Ming
V3:
- replace srcu with rw_sem for avoiding race between add/del disk &
elevator switch and updating nr_hw_queues (Nilay Shoff)
- add elv_update_nr_hw_queues() for elevator reattachment in case of
updating nr_hw_queues, meantime keep elv_change_ctx as local structure
(Christoph)
- replace ->elevator_lock with disk->rqos_state_mutex for covering wbt
state change
- add new patch "block: use q->elevator with ->elevator_lock held in elv_iosched_show()"
- small cleanup & commit log improvement
V2:
- retry add/del disk when blk_mq_update_nr_hw_queues() is in-progress
- swap blk_mq_add_queue_tag_set() with blk_mq_map_swqueue() in
blk_mq_init_allocated_queue() (Nilay Shroff)
- move ELEVATOR_FLAG_DISABLE_WBT to request queue's flags (Nilay Shoff)
- fix race because of delaying elevator unregister
- define flags of `elv_change_ctx` as `bool` (Christoph)
- improve comment and commit log (Christoph)
Ming Lei (20):
block: move blk_mq_add_queue_tag_set() after blk_mq_map_swqueue()
block: move ELEVATOR_FLAG_DISABLE_WBT a request queue flag
block: don't call freeze queue in elevator_switch() and
elevator_disable()
block: use q->elevator with ->elevator_lock held in elv_iosched_show()
block: add two helpers for registering/un-registering sched debugfs
block: move sched debugfs register into elvevator_register_queue
block: prevent adding/deleting disk during updating nr_hw_queues
block: don't allow to switch elevator if updating nr_hw_queues is
in-progress
block: simplify elevator reattachment for updating nr_hw_queues
block: move blk_unregister_queue() & device_del() after freeze wait
block: move queue freezing & elevator_lock into elevator_change()
block: add `struct elv_change_ctx` for unifying elevator change
block: unifying elevator change
block: pass elevator_queue to elv_register_queue & unregister_queue
block: fail to show/store elevator sysfs attribute if elevator is
dying
block: move elv_register[unregister]_queue out of elevator_lock
block: move debugfs/sysfs register out of freezing queue
block: remove several ->elevator_lock
block: move hctx cpuhp add/del out of queue freezing
block: move wbt_enable_default() out of queue freezing from sched
->exit()
block/bfq-iosched.c | 6 +-
block/blk-mq-debugfs.c | 12 +-
block/blk-mq-sched.c | 41 +++---
block/blk-mq.c | 132 +++---------------
block/blk-sysfs.c | 24 ++--
block/blk-wbt.c | 13 +-
block/blk.h | 8 +-
block/elevator.c | 302 ++++++++++++++++++++++++++++-------------
block/elevator.h | 6 +-
block/genhd.c | 129 +++++++++++-------
include/linux/blk-mq.h | 3 +
include/linux/blkdev.h | 5 +
12 files changed, 365 insertions(+), 316 deletions(-)
--
2.47.0
^ permalink raw reply [flat|nested] 71+ messages in thread
* [PATCH V3 01/20] block: move blk_mq_add_queue_tag_set() after blk_mq_map_swqueue()
2025-04-24 15:21 [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Ming Lei
@ 2025-04-24 15:21 ` Ming Lei
2025-04-24 15:21 ` [PATCH V3 02/20] block: move ELEVATOR_FLAG_DISABLE_WBT a request queue flag Ming Lei
` (19 subsequent siblings)
20 siblings, 0 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-24 15:21 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig, Ming Lei, Hannes Reinecke
Move blk_mq_add_queue_tag_set() after blk_mq_map_swqueue(), and publish
this request queue to tagset after everything is setup.
This way is safe because BLK_MQ_F_TAG_QUEUE_SHARED isn't used by
blk_mq_map_swqueue(), and this flag is mainly checked in fast IO code
path.
Prepare for removing ->elevator_lock from blk_mq_map_swqueue() which
is supposed to be called when elevator switch can't be done.
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reported-by: Nilay Shroff <nilay@linux.ibm.com>
Closes: https://lore.kernel.org/linux-block/567cb7ab-23d6-4cee-a915-c8cdac903ddd@linux.ibm.com/
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/blk-mq.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 554380bfd002..29cfc7ce2e0a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4581,8 +4581,8 @@ int blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
q->nr_requests = set->queue_depth;
blk_mq_init_cpu_queues(q, set->nr_hw_queues);
- blk_mq_add_queue_tag_set(set, q);
blk_mq_map_swqueue(q);
+ blk_mq_add_queue_tag_set(set, q);
return 0;
err_hctxs:
--
2.47.0
^ permalink raw reply related [flat|nested] 71+ messages in thread
* [PATCH V3 02/20] block: move ELEVATOR_FLAG_DISABLE_WBT a request queue flag
2025-04-24 15:21 [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Ming Lei
2025-04-24 15:21 ` [PATCH V3 01/20] block: move blk_mq_add_queue_tag_set() after blk_mq_map_swqueue() Ming Lei
@ 2025-04-24 15:21 ` Ming Lei
2025-04-25 14:29 ` Christoph Hellwig
2025-04-24 15:21 ` [PATCH V3 03/20] block: don't call freeze queue in elevator_switch() and elevator_disable() Ming Lei
` (18 subsequent siblings)
20 siblings, 1 reply; 71+ messages in thread
From: Ming Lei @ 2025-04-24 15:21 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig, Ming Lei, Hannes Reinecke
ELEVATOR_FLAG_DISABLE_WBT is only used by BFQ to disallow wbt when BFQ is
in use. The flag is set in BFQ's init(), and cleared in BFQ's exit().
Making it as request queue flag, so that we can avoid to deal with elevator
switch race. Also it isn't graceful to checking one scheduler flag in
wbt_enable_default().
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/bfq-iosched.c | 4 ++--
block/blk-mq-debugfs.c | 1 +
block/blk-wbt.c | 3 +--
block/elevator.h | 1 -
include/linux/blkdev.h | 3 +++
5 files changed, 7 insertions(+), 5 deletions(-)
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index abd80dc13562..cc6f59836dcd 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -7210,7 +7210,7 @@ static void bfq_exit_queue(struct elevator_queue *e)
#endif
blk_stat_disable_accounting(bfqd->queue);
- clear_bit(ELEVATOR_FLAG_DISABLE_WBT, &e->flags);
+ blk_queue_flag_clear(QUEUE_FLAG_DISABLE_WBT_DEF, bfqd->queue);
wbt_enable_default(bfqd->queue->disk);
kfree(bfqd);
@@ -7397,7 +7397,7 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
/* We dispatch from request queue wide instead of hw queue */
blk_queue_flag_set(QUEUE_FLAG_SQ_SCHED, q);
- set_bit(ELEVATOR_FLAG_DISABLE_WBT, &eq->flags);
+ blk_queue_flag_set(QUEUE_FLAG_DISABLE_WBT_DEF, q);
wbt_disable_default(q->disk);
blk_stat_enable_accounting(q);
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 3421b5521fe2..7710c409e432 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -93,6 +93,7 @@ static const char *const blk_queue_flag_name[] = {
QUEUE_FLAG_NAME(RQ_ALLOC_TIME),
QUEUE_FLAG_NAME(HCTX_ACTIVE),
QUEUE_FLAG_NAME(SQ_SCHED),
+ QUEUE_FLAG_NAME(DISABLE_WBT_DEF),
};
#undef QUEUE_FLAG_NAME
diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index f1754d07f7e0..29cd2e33666f 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -704,8 +704,7 @@ void wbt_enable_default(struct gendisk *disk)
struct rq_qos *rqos;
bool enable = IS_ENABLED(CONFIG_BLK_WBT_MQ);
- if (q->elevator &&
- test_bit(ELEVATOR_FLAG_DISABLE_WBT, &q->elevator->flags))
+ if (blk_queue_disable_wbt(q))
enable = false;
/* Throttling already enabled? */
diff --git a/block/elevator.h b/block/elevator.h
index e4e44dfac503..e27af5492cdb 100644
--- a/block/elevator.h
+++ b/block/elevator.h
@@ -121,7 +121,6 @@ struct elevator_queue
};
#define ELEVATOR_FLAG_REGISTERED 0
-#define ELEVATOR_FLAG_DISABLE_WBT 1
/*
* block elevator interface
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 678dc38442bf..228b2e5ad493 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -644,6 +644,7 @@ enum {
QUEUE_FLAG_RQ_ALLOC_TIME, /* record rq->alloc_time_ns */
QUEUE_FLAG_HCTX_ACTIVE, /* at least one blk-mq hctx is active */
QUEUE_FLAG_SQ_SCHED, /* single queue style io dispatch */
+ QUEUE_FLAG_DISABLE_WBT_DEF, /* for sched to disable/enable wbt */
QUEUE_FLAG_MAX
};
@@ -679,6 +680,8 @@ void blk_queue_flag_clear(unsigned int flag, struct request_queue *q);
#define blk_queue_sq_sched(q) test_bit(QUEUE_FLAG_SQ_SCHED, &(q)->queue_flags)
#define blk_queue_skip_tagset_quiesce(q) \
((q)->limits.features & BLK_FEAT_SKIP_TAGSET_QUIESCE)
+#define blk_queue_disable_wbt(q) \
+ test_bit(QUEUE_FLAG_DISABLE_WBT_DEF, &(q)->queue_flags)
extern void blk_set_pm_only(struct request_queue *q);
extern void blk_clear_pm_only(struct request_queue *q);
--
2.47.0
^ permalink raw reply related [flat|nested] 71+ messages in thread
* [PATCH V3 03/20] block: don't call freeze queue in elevator_switch() and elevator_disable()
2025-04-24 15:21 [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Ming Lei
2025-04-24 15:21 ` [PATCH V3 01/20] block: move blk_mq_add_queue_tag_set() after blk_mq_map_swqueue() Ming Lei
2025-04-24 15:21 ` [PATCH V3 02/20] block: move ELEVATOR_FLAG_DISABLE_WBT a request queue flag Ming Lei
@ 2025-04-24 15:21 ` Ming Lei
2025-04-25 14:29 ` Christoph Hellwig
2025-04-24 15:21 ` [PATCH V3 04/20] block: use q->elevator with ->elevator_lock held in elv_iosched_show() Ming Lei
` (17 subsequent siblings)
20 siblings, 1 reply; 71+ messages in thread
From: Ming Lei @ 2025-04-24 15:21 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig, Ming Lei, Hannes Reinecke
Both elevator_switch() and elevator_disable() are only called from the
two code paths, in which queue is guaranteed to be frozen.
So don't call freeze queue in the two functions, also add asserts for
queue freeze.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/elevator.c | 9 ++-------
1 file changed, 2 insertions(+), 7 deletions(-)
diff --git a/block/elevator.c b/block/elevator.c
index b4d08026b02c..5051a98dc08c 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -615,12 +615,11 @@ void elevator_init_mq(struct request_queue *q)
*/
int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
{
- unsigned int memflags;
int ret;
+ WARN_ON_ONCE(q->mq_freeze_depth == 0);
lockdep_assert_held(&q->elevator_lock);
- memflags = blk_mq_freeze_queue(q);
blk_mq_quiesce_queue(q);
if (q->elevator) {
@@ -641,7 +640,6 @@ int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
out_unfreeze:
blk_mq_unquiesce_queue(q);
- blk_mq_unfreeze_queue(q, memflags);
if (ret) {
pr_warn("elv: switch to \"%s\" failed, falling back to \"none\"\n",
@@ -653,11 +651,9 @@ int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
void elevator_disable(struct request_queue *q)
{
- unsigned int memflags;
-
+ WARN_ON_ONCE(q->mq_freeze_depth == 0);
lockdep_assert_held(&q->elevator_lock);
- memflags = blk_mq_freeze_queue(q);
blk_mq_quiesce_queue(q);
elv_unregister_queue(q);
@@ -668,7 +664,6 @@ void elevator_disable(struct request_queue *q)
blk_add_trace_msg(q, "elv switch: none");
blk_mq_unquiesce_queue(q);
- blk_mq_unfreeze_queue(q, memflags);
}
/*
--
2.47.0
^ permalink raw reply related [flat|nested] 71+ messages in thread
* [PATCH V3 04/20] block: use q->elevator with ->elevator_lock held in elv_iosched_show()
2025-04-24 15:21 [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Ming Lei
` (2 preceding siblings ...)
2025-04-24 15:21 ` [PATCH V3 03/20] block: don't call freeze queue in elevator_switch() and elevator_disable() Ming Lei
@ 2025-04-24 15:21 ` Ming Lei
2025-04-25 6:08 ` Hannes Reinecke
` (2 more replies)
2025-04-24 15:21 ` [PATCH V3 05/20] block: add two helpers for registering/un-registering sched debugfs Ming Lei
` (16 subsequent siblings)
20 siblings, 3 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-24 15:21 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig, Ming Lei
Use q->elevator with ->elevator_lock held in elv_iosched_show(), since
the local cached elevator reference may become stale after getting
->elevator_lock.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/elevator.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/block/elevator.c b/block/elevator.c
index 5051a98dc08c..b32815594892 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -739,7 +739,6 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
ssize_t elv_iosched_show(struct gendisk *disk, char *name)
{
struct request_queue *q = disk->queue;
- struct elevator_queue *eq = q->elevator;
struct elevator_type *cur = NULL, *e;
int len = 0;
@@ -748,7 +747,7 @@ ssize_t elv_iosched_show(struct gendisk *disk, char *name)
len += sprintf(name+len, "[none] ");
} else {
len += sprintf(name+len, "none ");
- cur = eq->type;
+ cur = q->elevator->type;
}
spin_lock(&elv_list_lock);
--
2.47.0
^ permalink raw reply related [flat|nested] 71+ messages in thread
* [PATCH V3 05/20] block: add two helpers for registering/un-registering sched debugfs
2025-04-24 15:21 [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Ming Lei
` (3 preceding siblings ...)
2025-04-24 15:21 ` [PATCH V3 04/20] block: use q->elevator with ->elevator_lock held in elv_iosched_show() Ming Lei
@ 2025-04-24 15:21 ` Ming Lei
2025-04-24 15:21 ` [PATCH V3 06/20] block: move sched debugfs register into elvevator_register_queue Ming Lei
` (15 subsequent siblings)
20 siblings, 0 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-24 15:21 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig, Ming Lei, Yu Kuai, Hannes Reinecke
Add blk_mq_sched_reg_debugfs()/blk_mq_sched_unreg_debugfs() to clean up
sched init/exit code a bit.
Register & unregister debugfs for sched & sched_hctx order is changed a
bit, but it is safe because sched & sched_hctx is guaranteed to be ready
when exporting via debugfs.
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/blk-mq-sched.c | 45 +++++++++++++++++++++++++++++---------------
1 file changed, 30 insertions(+), 15 deletions(-)
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 9b81771774ef..2abc5e0704e8 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -434,6 +434,30 @@ static int blk_mq_init_sched_shared_tags(struct request_queue *queue)
return 0;
}
+static void blk_mq_sched_reg_debugfs(struct request_queue *q)
+{
+ struct blk_mq_hw_ctx *hctx;
+ unsigned long i;
+
+ mutex_lock(&q->debugfs_mutex);
+ blk_mq_debugfs_register_sched(q);
+ queue_for_each_hw_ctx(q, hctx, i)
+ blk_mq_debugfs_register_sched_hctx(q, hctx);
+ mutex_unlock(&q->debugfs_mutex);
+}
+
+static void blk_mq_sched_unreg_debugfs(struct request_queue *q)
+{
+ struct blk_mq_hw_ctx *hctx;
+ unsigned long i;
+
+ mutex_lock(&q->debugfs_mutex);
+ queue_for_each_hw_ctx(q, hctx, i)
+ blk_mq_debugfs_unregister_sched_hctx(hctx);
+ blk_mq_debugfs_unregister_sched(q);
+ mutex_unlock(&q->debugfs_mutex);
+}
+
/* caller must have a reference to @e, will grab another one if successful */
int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
{
@@ -467,10 +491,6 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
if (ret)
goto err_free_map_and_rqs;
- mutex_lock(&q->debugfs_mutex);
- blk_mq_debugfs_register_sched(q);
- mutex_unlock(&q->debugfs_mutex);
-
queue_for_each_hw_ctx(q, hctx, i) {
if (e->ops.init_hctx) {
ret = e->ops.init_hctx(hctx, i);
@@ -482,11 +502,11 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
return ret;
}
}
- mutex_lock(&q->debugfs_mutex);
- blk_mq_debugfs_register_sched_hctx(q, hctx);
- mutex_unlock(&q->debugfs_mutex);
}
+ /* sched is initialized, it is ready to export it via debugfs */
+ blk_mq_sched_reg_debugfs(q);
+
return 0;
err_free_map_and_rqs:
@@ -524,11 +544,10 @@ void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e)
unsigned long i;
unsigned int flags = 0;
- queue_for_each_hw_ctx(q, hctx, i) {
- mutex_lock(&q->debugfs_mutex);
- blk_mq_debugfs_unregister_sched_hctx(hctx);
- mutex_unlock(&q->debugfs_mutex);
+ /* unexport via debugfs before exiting sched */
+ blk_mq_sched_unreg_debugfs(q);
+ queue_for_each_hw_ctx(q, hctx, i) {
if (e->type->ops.exit_hctx && hctx->sched_data) {
e->type->ops.exit_hctx(hctx, i);
hctx->sched_data = NULL;
@@ -536,10 +555,6 @@ void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e)
flags = hctx->flags;
}
- mutex_lock(&q->debugfs_mutex);
- blk_mq_debugfs_unregister_sched(q);
- mutex_unlock(&q->debugfs_mutex);
-
if (e->type->ops.exit_sched)
e->type->ops.exit_sched(e);
blk_mq_sched_tags_teardown(q, flags);
--
2.47.0
^ permalink raw reply related [flat|nested] 71+ messages in thread
* [PATCH V3 06/20] block: move sched debugfs register into elvevator_register_queue
2025-04-24 15:21 [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Ming Lei
` (4 preceding siblings ...)
2025-04-24 15:21 ` [PATCH V3 05/20] block: add two helpers for registering/un-registering sched debugfs Ming Lei
@ 2025-04-24 15:21 ` Ming Lei
2025-04-24 15:21 ` [PATCH V3 07/20] block: prevent adding/deleting disk during updating nr_hw_queues Ming Lei
` (14 subsequent siblings)
20 siblings, 0 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-24 15:21 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig, Ming Lei, Yu Kuai, Hannes Reinecke
sched debugfs shares same lifetime with scheduler's kobject, and same
lock(elevator lock), so move sched debugfs register/unregister into
elevator_register_queue() and elevator_unregister_queue().
Then we needn't blk_mq_debugfs_register() for us to register sched
debugfs any more.
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/blk-mq-debugfs.c | 11 -----------
block/blk-mq-sched.c | 11 ++---------
block/elevator.c | 8 ++++++++
block/elevator.h | 3 +++
4 files changed, 13 insertions(+), 20 deletions(-)
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 7710c409e432..2837a8ce8054 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -625,20 +625,9 @@ void blk_mq_debugfs_register(struct request_queue *q)
debugfs_create_files(q->debugfs_dir, q, blk_mq_debugfs_queue_attrs);
- /*
- * blk_mq_init_sched() attempted to do this already, but q->debugfs_dir
- * didn't exist yet (because we don't know what to name the directory
- * until the queue is registered to a gendisk).
- */
- if (q->elevator && !q->sched_debugfs_dir)
- blk_mq_debugfs_register_sched(q);
-
- /* Similarly, blk_mq_init_hctx() couldn't do this previously. */
queue_for_each_hw_ctx(q, hctx, i) {
if (!hctx->debugfs_dir)
blk_mq_debugfs_register_hctx(q, hctx);
- if (q->elevator && !hctx->sched_debugfs_dir)
- blk_mq_debugfs_register_sched_hctx(q, hctx);
}
if (q->rq_qos) {
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 2abc5e0704e8..336a15ffecfa 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -434,7 +434,7 @@ static int blk_mq_init_sched_shared_tags(struct request_queue *queue)
return 0;
}
-static void blk_mq_sched_reg_debugfs(struct request_queue *q)
+void blk_mq_sched_reg_debugfs(struct request_queue *q)
{
struct blk_mq_hw_ctx *hctx;
unsigned long i;
@@ -446,7 +446,7 @@ static void blk_mq_sched_reg_debugfs(struct request_queue *q)
mutex_unlock(&q->debugfs_mutex);
}
-static void blk_mq_sched_unreg_debugfs(struct request_queue *q)
+void blk_mq_sched_unreg_debugfs(struct request_queue *q)
{
struct blk_mq_hw_ctx *hctx;
unsigned long i;
@@ -503,10 +503,6 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
}
}
}
-
- /* sched is initialized, it is ready to export it via debugfs */
- blk_mq_sched_reg_debugfs(q);
-
return 0;
err_free_map_and_rqs:
@@ -544,9 +540,6 @@ void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e)
unsigned long i;
unsigned int flags = 0;
- /* unexport via debugfs before exiting sched */
- blk_mq_sched_unreg_debugfs(q);
-
queue_for_each_hw_ctx(q, hctx, i) {
if (e->type->ops.exit_hctx && hctx->sched_data) {
e->type->ops.exit_hctx(hctx, i);
diff --git a/block/elevator.c b/block/elevator.c
index b32815594892..4400eb8fe54f 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -472,6 +472,11 @@ int elv_register_queue(struct request_queue *q, bool uevent)
if (uevent)
kobject_uevent(&e->kobj, KOBJ_ADD);
+ /*
+ * Sched is initialized, it is ready to export it via
+ * debugfs
+ */
+ blk_mq_sched_reg_debugfs(q);
set_bit(ELEVATOR_FLAG_REGISTERED, &e->flags);
}
return error;
@@ -486,6 +491,9 @@ void elv_unregister_queue(struct request_queue *q)
if (e && test_and_clear_bit(ELEVATOR_FLAG_REGISTERED, &e->flags)) {
kobject_uevent(&e->kobj, KOBJ_REMOVE);
kobject_del(&e->kobj);
+
+ /* unexport via debugfs before exiting sched */
+ blk_mq_sched_unreg_debugfs(q);
}
}
diff --git a/block/elevator.h b/block/elevator.h
index e27af5492cdb..9198676644a9 100644
--- a/block/elevator.h
+++ b/block/elevator.h
@@ -181,4 +181,7 @@ extern struct request *elv_rb_find(struct rb_root *, sector_t);
#define rq_entry_fifo(ptr) list_entry((ptr), struct request, queuelist)
#define rq_fifo_clear(rq) list_del_init(&(rq)->queuelist)
+void blk_mq_sched_reg_debugfs(struct request_queue *q);
+void blk_mq_sched_unreg_debugfs(struct request_queue *q);
+
#endif /* _ELEVATOR_H */
--
2.47.0
^ permalink raw reply related [flat|nested] 71+ messages in thread
* [PATCH V3 07/20] block: prevent adding/deleting disk during updating nr_hw_queues
2025-04-24 15:21 [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Ming Lei
` (5 preceding siblings ...)
2025-04-24 15:21 ` [PATCH V3 06/20] block: move sched debugfs register into elvevator_register_queue Ming Lei
@ 2025-04-24 15:21 ` Ming Lei
2025-04-25 6:33 ` Hannes Reinecke
` (3 more replies)
2025-04-24 15:21 ` [PATCH V3 08/20] block: don't allow to switch elevator if updating nr_hw_queues is in-progress Ming Lei
` (13 subsequent siblings)
20 siblings, 4 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-24 15:21 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig, Ming Lei
Both adding/deleting disk code are reader of `nr_hw_queues`, so we can't
allow them in-progress when updating nr_hw_queues, kernel panic and
kasan has been reported in [1].
Prevent adding/deleting disk during updating nr_hw_queues by adding
rw_semaphore to tagset, write lock is grabbed in blk_mq_update_nr_hw_queues(),
and read lock is acquired when adding/deleting disk.
Also mark GFP_NOIO allocation scope for adding/deleting disk because
blk_mq_update_nr_hw_queues() is part of some driver's error handler.
This way avoids lot of trouble.
Suggested-by: Nilay Shroff <nilay@linux.ibm.com>
Reported-by: Nilay Shroff <nilay@linux.ibm.com>
Closes: https://lore.kernel.org/linux-block/a5896cdb-a59a-4a37-9f99-20522f5d2987@linux.ibm.com/
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/blk-mq.c | 4 ++
block/genhd.c | 94 +++++++++++++++++++++++++++++++-----------
include/linux/blk-mq.h | 3 ++
3 files changed, 78 insertions(+), 23 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 29cfc7ce2e0a..1ed2d183f912 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4802,6 +4802,8 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
goto out_free_srcu;
}
+ init_rwsem(&set->update_nr_hwq_sema);
+
ret = -ENOMEM;
set->tags = kcalloc_node(set->nr_hw_queues,
sizeof(struct blk_mq_tags *), GFP_KERNEL,
@@ -5097,9 +5099,11 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
{
+ down_write(&set->update_nr_hwq_sema);
mutex_lock(&set->tag_list_lock);
__blk_mq_update_nr_hw_queues(set, nr_hw_queues);
mutex_unlock(&set->tag_list_lock);
+ up_write(&set->update_nr_hwq_sema);
}
EXPORT_SYMBOL_GPL(blk_mq_update_nr_hw_queues);
diff --git a/block/genhd.c b/block/genhd.c
index c2bd86cd09de..7f3ae3d23b26 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -399,9 +399,9 @@ int disk_scan_partitions(struct gendisk *disk, blk_mode_t mode)
* This function registers the partitioning information in @disk
* with the kernel. Also attach a fwnode to the disk device.
*/
-int __must_check add_disk_fwnode(struct device *parent, struct gendisk *disk,
- const struct attribute_group **groups,
- struct fwnode_handle *fwnode)
+static int __add_disk_fwnode(struct device *parent, struct gendisk *disk,
+ const struct attribute_group **groups,
+ struct fwnode_handle *fwnode)
{
struct device *ddev = disk_to_dev(disk);
@@ -572,6 +572,37 @@ int __must_check add_disk_fwnode(struct device *parent, struct gendisk *disk,
}
return ret;
}
+
+/**
+ * add_disk_fwnode - add disk information to kernel list with fwnode
+ * @parent: parent device for the disk
+ * @disk: per-device partitioning information
+ * @groups: Additional per-device sysfs groups
+ * @fwnode: attached disk fwnode
+ *
+ * This function registers the partitioning information in @disk
+ * with the kernel. Also attach a fwnode to the disk device.
+ */
+int __must_check add_disk_fwnode(struct device *parent, struct gendisk *disk,
+ const struct attribute_group **groups,
+ struct fwnode_handle *fwnode)
+{
+ struct blk_mq_tag_set *set;
+ unsigned int memflags;
+ int ret;
+
+ if (!queue_is_mq(disk->queue))
+ return __add_disk_fwnode(parent, disk, groups, fwnode);
+
+ set = disk->queue->tag_set;
+ memflags = memalloc_noio_save();
+ down_read(&set->update_nr_hwq_sema);
+ ret = __add_disk_fwnode(parent, disk, groups, fwnode);
+ up_read(&set->update_nr_hwq_sema);
+ memalloc_noio_restore(memflags);
+
+ return ret;
+}
EXPORT_SYMBOL_GPL(add_disk_fwnode);
/**
@@ -652,26 +683,7 @@ void blk_mark_disk_dead(struct gendisk *disk)
}
EXPORT_SYMBOL_GPL(blk_mark_disk_dead);
-/**
- * del_gendisk - remove the gendisk
- * @disk: the struct gendisk to remove
- *
- * Removes the gendisk and all its associated resources. This deletes the
- * partitions associated with the gendisk, and unregisters the associated
- * request_queue.
- *
- * This is the counter to the respective __device_add_disk() call.
- *
- * The final removal of the struct gendisk happens when its refcount reaches 0
- * with put_disk(), which should be called after del_gendisk(), if
- * __device_add_disk() was used.
- *
- * Drivers exist which depend on the release of the gendisk to be synchronous,
- * it should not be deferred.
- *
- * Context: can sleep
- */
-void del_gendisk(struct gendisk *disk)
+static void __del_gendisk(struct gendisk *disk)
{
struct request_queue *q = disk->queue;
struct block_device *part;
@@ -764,6 +776,42 @@ void del_gendisk(struct gendisk *disk)
if (start_drain)
blk_unfreeze_release_lock(q);
}
+
+/**
+ * del_gendisk - remove the gendisk
+ * @disk: the struct gendisk to remove
+ *
+ * Removes the gendisk and all its associated resources. This deletes the
+ * partitions associated with the gendisk, and unregisters the associated
+ * request_queue.
+ *
+ * This is the counter to the respective __device_add_disk() call.
+ *
+ * The final removal of the struct gendisk happens when its refcount reaches 0
+ * with put_disk(), which should be called after del_gendisk(), if
+ * __device_add_disk() was used.
+ *
+ * Drivers exist which depend on the release of the gendisk to be synchronous,
+ * it should not be deferred.
+ *
+ * Context: can sleep
+ */
+void del_gendisk(struct gendisk *disk)
+{
+ struct blk_mq_tag_set *set;
+ unsigned int memflags;
+
+ if (!queue_is_mq(disk->queue)) {
+ __del_gendisk(disk);
+ } else {
+ set = disk->queue->tag_set;
+ memflags = memalloc_noio_save();
+ down_read(&set->update_nr_hwq_sema);
+ __del_gendisk(disk);
+ up_read(&set->update_nr_hwq_sema);
+ memalloc_noio_restore(memflags);
+ }
+}
EXPORT_SYMBOL(del_gendisk);
/**
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 8eb9b3310167..28bc03b2b0dc 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -9,6 +9,7 @@
#include <linux/prefetch.h>
#include <linux/srcu.h>
#include <linux/rw_hint.h>
+#include <linux/rwsem.h>
struct blk_mq_tags;
struct blk_flush_queue;
@@ -527,6 +528,8 @@ struct blk_mq_tag_set {
struct mutex tag_list_lock;
struct list_head tag_list;
struct srcu_struct *srcu;
+
+ struct rw_semaphore update_nr_hwq_sema;
};
/**
--
2.47.0
^ permalink raw reply related [flat|nested] 71+ messages in thread
* [PATCH V3 08/20] block: don't allow to switch elevator if updating nr_hw_queues is in-progress
2025-04-24 15:21 [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Ming Lei
` (6 preceding siblings ...)
2025-04-24 15:21 ` [PATCH V3 07/20] block: prevent adding/deleting disk during updating nr_hw_queues Ming Lei
@ 2025-04-24 15:21 ` Ming Lei
2025-04-25 6:33 ` Hannes Reinecke
2025-04-25 12:48 ` Nilay Shroff
2025-04-24 15:21 ` [PATCH V3 09/20] block: simplify elevator reattachment for updating nr_hw_queues Ming Lei
` (12 subsequent siblings)
20 siblings, 2 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-24 15:21 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig, Ming Lei
Elevator switch code is another `nr_hw_queue` reader in non-fast-IO code
path, so it can't be done if updating `nr_hw_queues` is in-progress.
Take same approach with not allowing add/del disk when updating
nr_hw_queues is in-progress, by grabbing read lock of
set->update_nr_hwq_sema.
Take the nested variant for avoiding the following false positive
splat[1], and this way is correct because:
- the read lock in elv_iosched_store() is not overlapped with the read lock
in adding/deleting disk:
- kobject attribute is only available after the kobject is added and
before it is deleted
-> #4 (&q->q_usage_counter(queue){++++}-{0:0}:
-> #3 (&q->limits_lock){+.+.}-{4:4}:
-> #2 (&disk->open_mutex){+.+.}-{4:4}:
-> #1 (&set->update_nr_hwq_lock){.+.+}-{4:4}:
-> #0 (kn->active#103){++++}-{0:0}:
Link: https://lore.kernel.org/linux-block/aAWv3NPtNIKKvJZc@fedora/ [1]
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Closes: https://lore.kernel.org/linux-block/mz4t4tlwiqjijw3zvqnjb7ovvvaegkqganegmmlc567tt5xj67@xal5ro544cnc/
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/elevator.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/block/elevator.c b/block/elevator.c
index 4400eb8fe54f..56da6ab7691a 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -723,6 +723,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
int ret;
unsigned int memflags;
struct request_queue *q = disk->queue;
+ struct blk_mq_tag_set *set = q->tag_set;
/*
* If the attribute needs to load a module, do it before freezing the
@@ -734,6 +735,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
elv_iosched_load_module(name);
+ down_read_nested(&set->update_nr_hwq_sema, 1);
memflags = blk_mq_freeze_queue(q);
mutex_lock(&q->elevator_lock);
ret = elevator_change(q, name);
@@ -741,6 +743,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
ret = count;
mutex_unlock(&q->elevator_lock);
blk_mq_unfreeze_queue(q, memflags);
+ up_read(&set->update_nr_hwq_sema);
return ret;
}
--
2.47.0
^ permalink raw reply related [flat|nested] 71+ messages in thread
* [PATCH V3 09/20] block: simplify elevator reattachment for updating nr_hw_queues
2025-04-24 15:21 [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Ming Lei
` (7 preceding siblings ...)
2025-04-24 15:21 ` [PATCH V3 08/20] block: don't allow to switch elevator if updating nr_hw_queues is in-progress Ming Lei
@ 2025-04-24 15:21 ` Ming Lei
2025-04-25 6:34 ` Hannes Reinecke
2025-04-25 18:12 ` Christoph Hellwig
2025-04-24 15:21 ` [PATCH V3 10/20] block: move blk_unregister_queue() & device_del() after freeze wait Ming Lei
` (11 subsequent siblings)
20 siblings, 2 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-24 15:21 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig, Ming Lei
In blk_mq_update_nr_hw_queues(), nr_hw_queues changes and elevator data
depends on it, so elevator has to be reattached.
Now elevator switch isn't allowed during blk_mq_update_nr_hw_queues(), so
we can simply call elevator_change() to reattach elevator sched tags after
nr_hw_queues is updated.
Add elv_update_nr_hw_queues() for blk_mq_update_nr_hw_queues() to
reattach elevator.
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/blk-mq.c | 90 +-----------------------------------------------
block/blk.h | 3 +-
block/elevator.c | 32 +++++++++++++----
3 files changed, 28 insertions(+), 97 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1ed2d183f912..3afcddd21586 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4943,88 +4943,10 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
return ret;
}
-/*
- * request_queue and elevator_type pair.
- * It is just used by __blk_mq_update_nr_hw_queues to cache
- * the elevator_type associated with a request_queue.
- */
-struct blk_mq_qe_pair {
- struct list_head node;
- struct request_queue *q;
- struct elevator_type *type;
-};
-
-/*
- * Cache the elevator_type in qe pair list and switch the
- * io scheduler to 'none'
- */
-static bool blk_mq_elv_switch_none(struct list_head *head,
- struct request_queue *q)
-{
- struct blk_mq_qe_pair *qe;
-
- qe = kmalloc(sizeof(*qe), GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY);
- if (!qe)
- return false;
-
- /* Accessing q->elevator needs protection from ->elevator_lock. */
- mutex_lock(&q->elevator_lock);
-
- if (!q->elevator) {
- kfree(qe);
- goto unlock;
- }
-
- INIT_LIST_HEAD(&qe->node);
- qe->q = q;
- qe->type = q->elevator->type;
- /* keep a reference to the elevator module as we'll switch back */
- __elevator_get(qe->type);
- list_add(&qe->node, head);
- elevator_disable(q);
-unlock:
- mutex_unlock(&q->elevator_lock);
-
- return true;
-}
-
-static struct blk_mq_qe_pair *blk_lookup_qe_pair(struct list_head *head,
- struct request_queue *q)
-{
- struct blk_mq_qe_pair *qe;
-
- list_for_each_entry(qe, head, node)
- if (qe->q == q)
- return qe;
-
- return NULL;
-}
-
-static void blk_mq_elv_switch_back(struct list_head *head,
- struct request_queue *q)
-{
- struct blk_mq_qe_pair *qe;
- struct elevator_type *t;
-
- qe = blk_lookup_qe_pair(head, q);
- if (!qe)
- return;
- t = qe->type;
- list_del(&qe->node);
- kfree(qe);
-
- mutex_lock(&q->elevator_lock);
- elevator_switch(q, t);
- /* drop the reference acquired in blk_mq_elv_switch_none */
- elevator_put(t);
- mutex_unlock(&q->elevator_lock);
-}
-
static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
int nr_hw_queues)
{
struct request_queue *q;
- LIST_HEAD(head);
int prev_nr_hw_queues = set->nr_hw_queues;
unsigned int memflags;
int i;
@@ -5042,15 +4964,6 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
list_for_each_entry(q, &set->tag_list, tag_set_list)
blk_mq_freeze_queue_nomemsave(q);
- /*
- * Switch IO scheduler to 'none', cleaning up the data associated
- * with the previous scheduler. We will switch back once we are done
- * updating the new sw to hw queue mappings.
- */
- list_for_each_entry(q, &set->tag_list, tag_set_list)
- if (!blk_mq_elv_switch_none(&head, q))
- goto switch_back;
-
list_for_each_entry(q, &set->tag_list, tag_set_list) {
blk_mq_debugfs_unregister_hctxs(q);
blk_mq_sysfs_unregister_hctxs(q);
@@ -5084,9 +4997,8 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
blk_mq_debugfs_register_hctxs(q);
}
-switch_back:
list_for_each_entry(q, &set->tag_list, tag_set_list)
- blk_mq_elv_switch_back(&head, q);
+ elv_update_nr_hw_queues(q);
list_for_each_entry(q, &set->tag_list, tag_set_list)
blk_mq_unfreeze_queue_nomemrestore(q);
diff --git a/block/blk.h b/block/blk.h
index 006e3be433d2..2969f4427996 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -319,8 +319,7 @@ bool blk_bio_list_merge(struct request_queue *q, struct list_head *list,
bool blk_insert_flush(struct request *rq);
-int elevator_switch(struct request_queue *q, struct elevator_type *new_e);
-void elevator_disable(struct request_queue *q);
+void elv_update_nr_hw_queues(struct request_queue *q);
void elevator_exit(struct request_queue *q);
int elv_register_queue(struct request_queue *q, bool uevent);
void elv_unregister_queue(struct request_queue *q);
diff --git a/block/elevator.c b/block/elevator.c
index 56da6ab7691a..5705f7056516 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -621,7 +621,7 @@ void elevator_init_mq(struct request_queue *q)
* If switching fails, we are most likely running out of memory and not able
* to restore the old io scheduler, so leaving the io scheduler being none.
*/
-int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
+static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
{
int ret;
@@ -657,7 +657,7 @@ int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
return ret;
}
-void elevator_disable(struct request_queue *q)
+static void elevator_disable(struct request_queue *q)
{
WARN_ON_ONCE(q->mq_freeze_depth == 0);
lockdep_assert_held(&q->elevator_lock);
@@ -677,7 +677,8 @@ void elevator_disable(struct request_queue *q)
/*
* Switch this queue to the given IO scheduler.
*/
-static int elevator_change(struct request_queue *q, const char *elevator_name)
+static int __elevator_change(struct request_queue *q,
+ const char *elevator_name)
{
struct elevator_type *e;
int ret;
@@ -692,9 +693,6 @@ static int elevator_change(struct request_queue *q, const char *elevator_name)
return 0;
}
- if (q->elevator && elevator_match(q->elevator->type, elevator_name))
- return 0;
-
e = elevator_find_get(elevator_name);
if (!e)
return -EINVAL;
@@ -703,6 +701,28 @@ static int elevator_change(struct request_queue *q, const char *elevator_name)
return ret;
}
+static int elevator_change(struct request_queue *q, const char *elevator_name)
+{
+ if (!q->elevator || !elevator_match(q->elevator->type, elevator_name))
+ return __elevator_change(q, elevator_name);
+ return 0;
+}
+
+/*
+ * The I/O scheduler depends on the number of hardware queues, this forces a
+ * reattachment when nr_hw_queues changes.
+ */
+void elv_update_nr_hw_queues(struct request_queue *q)
+{
+ const char *name = "none";
+
+ mutex_lock(&q->elevator_lock);
+ if (q->elevator && !blk_queue_dying(q))
+ name = q->elevator->type->elevator_name;
+ __elevator_change(q, name);
+ mutex_unlock(&q->elevator_lock);
+}
+
static void elv_iosched_load_module(char *elevator_name)
{
struct elevator_type *found;
--
2.47.0
^ permalink raw reply related [flat|nested] 71+ messages in thread
* [PATCH V3 10/20] block: move blk_unregister_queue() & device_del() after freeze wait
2025-04-24 15:21 [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Ming Lei
` (8 preceding siblings ...)
2025-04-24 15:21 ` [PATCH V3 09/20] block: simplify elevator reattachment for updating nr_hw_queues Ming Lei
@ 2025-04-24 15:21 ` Ming Lei
2025-04-25 6:35 ` Hannes Reinecke
` (2 more replies)
2025-04-24 15:21 ` [PATCH V3 11/20] block: move queue freezing & elevator_lock into elevator_change() Ming Lei
` (10 subsequent siblings)
20 siblings, 3 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-24 15:21 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig, Ming Lei
Move blk_unregister_queue() & device_del() after freeze wait, and prepare
for unifying elevator switch.
This way is just fine, since bdev has been unhashed at the beginning of
del_gendisk(), both blk_unregister_queue() & device_del() are dealing
with kobject & debugfs thing only.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/genhd.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/block/genhd.c b/block/genhd.c
index 7f3ae3d23b26..53c5b1e9e1d2 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -733,8 +733,6 @@ static void __del_gendisk(struct gendisk *disk)
bdi_unregister(disk->bdi);
}
- blk_unregister_queue(disk);
-
kobject_put(disk->part0->bd_holder_dir);
kobject_put(disk->slave_dir);
disk->slave_dir = NULL;
@@ -743,10 +741,12 @@ static void __del_gendisk(struct gendisk *disk)
disk->part0->bd_stamp = 0;
sysfs_remove_link(block_depr, dev_name(disk_to_dev(disk)));
pm_runtime_set_memalloc_noio(disk_to_dev(disk), false);
- device_del(disk_to_dev(disk));
blk_mq_freeze_queue_wait(q);
+ blk_unregister_queue(disk);
+ device_del(disk_to_dev(disk));
+
blk_throtl_cancel_bios(disk);
blk_sync_queue(q);
--
2.47.0
^ permalink raw reply related [flat|nested] 71+ messages in thread
* [PATCH V3 11/20] block: move queue freezing & elevator_lock into elevator_change()
2025-04-24 15:21 [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Ming Lei
` (9 preceding siblings ...)
2025-04-24 15:21 ` [PATCH V3 10/20] block: move blk_unregister_queue() & device_del() after freeze wait Ming Lei
@ 2025-04-24 15:21 ` Ming Lei
2025-04-25 6:36 ` Hannes Reinecke
` (2 more replies)
2025-04-24 15:21 ` [PATCH V3 12/20] block: add `struct elv_change_ctx` for unifying elevator change Ming Lei
` (9 subsequent siblings)
20 siblings, 3 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-24 15:21 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig, Ming Lei
Move queue freezing & elevator_lock into elevator_change(), and prepare
for using elevator_change() for setting up & tearing down default elevator
too.
Also add lockdep_assert_held() in __elevator_change() because either
read or write lock is required for changing elevator.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/elevator.c | 18 +++++++++++-------
1 file changed, 11 insertions(+), 7 deletions(-)
diff --git a/block/elevator.c b/block/elevator.c
index 5705f7056516..e70cd4b828a5 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -683,6 +683,8 @@ static int __elevator_change(struct request_queue *q,
struct elevator_type *e;
int ret;
+ lockdep_assert_held(&q->tag_set->update_nr_hwq_sema);
+
/* Make sure queue is not in the middle of being removed */
if (!blk_queue_registered(q))
return -ENOENT;
@@ -703,9 +705,16 @@ static int __elevator_change(struct request_queue *q,
static int elevator_change(struct request_queue *q, const char *elevator_name)
{
+ unsigned int memflags;
+ int ret = 0;
+
+ memflags = blk_mq_freeze_queue(q);
+ mutex_lock(&q->elevator_lock);
if (!q->elevator || !elevator_match(q->elevator->type, elevator_name))
- return __elevator_change(q, elevator_name);
- return 0;
+ ret = __elevator_change(q, elevator_name);
+ mutex_unlock(&q->elevator_lock);
+ blk_mq_unfreeze_queue(q, memflags);
+ return ret;
}
/*
@@ -741,7 +750,6 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
char elevator_name[ELV_NAME_MAX];
char *name;
int ret;
- unsigned int memflags;
struct request_queue *q = disk->queue;
struct blk_mq_tag_set *set = q->tag_set;
@@ -756,13 +764,9 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
elv_iosched_load_module(name);
down_read_nested(&set->update_nr_hwq_sema, 1);
- memflags = blk_mq_freeze_queue(q);
- mutex_lock(&q->elevator_lock);
ret = elevator_change(q, name);
if (!ret)
ret = count;
- mutex_unlock(&q->elevator_lock);
- blk_mq_unfreeze_queue(q, memflags);
up_read(&set->update_nr_hwq_sema);
return ret;
}
--
2.47.0
^ permalink raw reply related [flat|nested] 71+ messages in thread
* [PATCH V3 12/20] block: add `struct elv_change_ctx` for unifying elevator change
2025-04-24 15:21 [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Ming Lei
` (10 preceding siblings ...)
2025-04-24 15:21 ` [PATCH V3 11/20] block: move queue freezing & elevator_lock into elevator_change() Ming Lei
@ 2025-04-24 15:21 ` Ming Lei
2025-04-25 6:38 ` Hannes Reinecke
2025-04-25 18:23 ` Christoph Hellwig
2025-04-24 15:21 ` [PATCH V3 13/20] block: " Ming Lei
` (8 subsequent siblings)
20 siblings, 2 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-24 15:21 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig, Ming Lei
Add `struct elv_change_ctx` and prepare for unifying elevator change by
[__]elevator_change(). With this way, any input & output parameter can
be provided & observed in top helper.
This way helps to move kobject add/delete & debugfs register/unregister
out of ->elevator_lock & freezing queue.
Reviewed-by : Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/elevator.c | 44 +++++++++++++++++++++++++++++---------------
1 file changed, 29 insertions(+), 15 deletions(-)
diff --git a/block/elevator.c b/block/elevator.c
index e70cd4b828a5..4be318b0b4ef 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -45,6 +45,12 @@
#include "blk-wbt.h"
#include "blk-cgroup.h"
+/* Holding context data for changing elevator */
+struct elv_change_ctx {
+ const char *name;
+ bool uevent;
+};
+
static DEFINE_SPINLOCK(elv_list_lock);
static LIST_HEAD(elv_list);
@@ -621,7 +627,8 @@ void elevator_init_mq(struct request_queue *q)
* If switching fails, we are most likely running out of memory and not able
* to restore the old io scheduler, so leaving the io scheduler being none.
*/
-static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
+static int elevator_switch(struct request_queue *q, struct elevator_type *new_e,
+ struct elv_change_ctx *ctx)
{
int ret;
@@ -639,7 +646,7 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
if (ret)
goto out_unfreeze;
- ret = elv_register_queue(q, true);
+ ret = elv_register_queue(q, ctx->uevent);
if (ret) {
elevator_exit(q);
goto out_unfreeze;
@@ -678,8 +685,9 @@ static void elevator_disable(struct request_queue *q)
* Switch this queue to the given IO scheduler.
*/
static int __elevator_change(struct request_queue *q,
- const char *elevator_name)
+ struct elv_change_ctx *ctx)
{
+ const char *elevator_name = ctx->name;
struct elevator_type *e;
int ret;
@@ -698,20 +706,21 @@ static int __elevator_change(struct request_queue *q,
e = elevator_find_get(elevator_name);
if (!e)
return -EINVAL;
- ret = elevator_switch(q, e);
+ ret = elevator_switch(q, e, ctx);
elevator_put(e);
return ret;
}
-static int elevator_change(struct request_queue *q, const char *elevator_name)
+static int elevator_change(struct request_queue *q,
+ struct elv_change_ctx *ctx)
{
unsigned int memflags;
int ret = 0;
memflags = blk_mq_freeze_queue(q);
mutex_lock(&q->elevator_lock);
- if (!q->elevator || !elevator_match(q->elevator->type, elevator_name))
- ret = __elevator_change(q, elevator_name);
+ if (!q->elevator || !elevator_match(q->elevator->type, ctx->name))
+ ret = __elevator_change(q, ctx);
mutex_unlock(&q->elevator_lock);
blk_mq_unfreeze_queue(q, memflags);
return ret;
@@ -723,16 +732,19 @@ static int elevator_change(struct request_queue *q, const char *elevator_name)
*/
void elv_update_nr_hw_queues(struct request_queue *q)
{
- const char *name = "none";
+ struct elv_change_ctx ctx = {
+ .name = "none",
+ .uevent = true,
+ };
mutex_lock(&q->elevator_lock);
if (q->elevator && !blk_queue_dying(q))
- name = q->elevator->type->elevator_name;
- __elevator_change(q, name);
+ ctx.name = q->elevator->type->elevator_name;
+ __elevator_change(q, &ctx);
mutex_unlock(&q->elevator_lock);
}
-static void elv_iosched_load_module(char *elevator_name)
+static void elv_iosched_load_module(const char *elevator_name)
{
struct elevator_type *found;
@@ -748,7 +760,9 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
size_t count)
{
char elevator_name[ELV_NAME_MAX];
- char *name;
+ struct elv_change_ctx ctx = {
+ .uevent = true,
+ };
int ret;
struct request_queue *q = disk->queue;
struct blk_mq_tag_set *set = q->tag_set;
@@ -759,12 +773,12 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
* queue is the one for the device storing the module file.
*/
strscpy(elevator_name, buf, sizeof(elevator_name));
- name = strstrip(elevator_name);
+ ctx.name = strstrip(elevator_name);
- elv_iosched_load_module(name);
+ elv_iosched_load_module(ctx.name);
down_read_nested(&set->update_nr_hwq_sema, 1);
- ret = elevator_change(q, name);
+ ret = elevator_change(q, &ctx);
if (!ret)
ret = count;
up_read(&set->update_nr_hwq_sema);
--
2.47.0
^ permalink raw reply related [flat|nested] 71+ messages in thread
* [PATCH V3 13/20] block: unifying elevator change
2025-04-24 15:21 [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Ming Lei
` (11 preceding siblings ...)
2025-04-24 15:21 ` [PATCH V3 12/20] block: add `struct elv_change_ctx` for unifying elevator change Ming Lei
@ 2025-04-24 15:21 ` Ming Lei
2025-04-25 6:39 ` Hannes Reinecke
` (2 more replies)
2025-04-24 15:21 ` [PATCH V3 14/20] block: pass elevator_queue to elv_register_queue & unregister_queue Ming Lei
` (7 subsequent siblings)
20 siblings, 3 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-24 15:21 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig, Ming Lei
elevator change is one well-define behavior:
- tear down current elevator if it exists
- setup new elevator
It is supposed to cover any case for changing elevator by single
internal API, typically the following cases:
- setup default elevator in add_disk()
- switch to none in del_disk()
- reset elevator in blk_mq_update_nr_hw_queues()
- switch elevator in sysfs `store` elevator attribute
This patch uses elevator_change() to cover all above cases:
- every elevator switch is serialized with each other: add_disk/del_disk/
store elevator is serialized already, blk_mq_update_nr_hw_queues() uses
srcu for syncing with the other three cases
- for both add_disk()/del_disk(), queue freeze works at atomic mode
or has been froze, so the freeze in elevator_change() won't add extra
delay
- `struct elev_change_ctx` instance holds any info for changing elevator
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/blk-sysfs.c | 18 +++-----
block/blk.h | 5 +-
block/elevator.c | 114 +++++++++++++++++++++++++---------------------
block/genhd.c | 28 ++----------
4 files changed, 75 insertions(+), 90 deletions(-)
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 1f9b45b0b9ee..9aa05666f32a 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -869,14 +869,8 @@ int blk_register_queue(struct gendisk *disk)
if (ret)
goto out_unregister_ia_ranges;
+ elevator_set_default(q);
mutex_lock(&q->elevator_lock);
- if (q->elevator) {
- ret = elv_register_queue(q, false);
- if (ret) {
- mutex_unlock(&q->elevator_lock);
- goto out_crypto_sysfs_unregister;
- }
- }
wbt_enable_default(disk);
mutex_unlock(&q->elevator_lock);
@@ -902,8 +896,6 @@ int blk_register_queue(struct gendisk *disk)
return ret;
-out_crypto_sysfs_unregister:
- blk_crypto_sysfs_unregister(disk);
out_unregister_ia_ranges:
disk_unregister_independent_access_ranges(disk);
out_debugfs_remove:
@@ -951,9 +943,11 @@ void blk_unregister_queue(struct gendisk *disk)
blk_mq_sysfs_unregister(disk);
blk_crypto_sysfs_unregister(disk);
- mutex_lock(&q->elevator_lock);
- elv_unregister_queue(q);
- mutex_unlock(&q->elevator_lock);
+ if (q->elevator) {
+ blk_mq_quiesce_queue(q);
+ elevator_set_none(q);
+ blk_mq_unquiesce_queue(q);
+ }
mutex_lock(&q->sysfs_lock);
disk_unregister_independent_access_ranges(disk);
diff --git a/block/blk.h b/block/blk.h
index 2969f4427996..6ba674d1ceba 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -320,9 +320,8 @@ bool blk_bio_list_merge(struct request_queue *q, struct list_head *list,
bool blk_insert_flush(struct request *rq);
void elv_update_nr_hw_queues(struct request_queue *q);
-void elevator_exit(struct request_queue *q);
-int elv_register_queue(struct request_queue *q, bool uevent);
-void elv_unregister_queue(struct request_queue *q);
+void elevator_set_default(struct request_queue *q);
+void elevator_set_none(struct request_queue *q);
ssize_t part_size_show(struct device *dev, struct device_attribute *attr,
char *buf);
diff --git a/block/elevator.c b/block/elevator.c
index 4be318b0b4ef..5ec66fbd4f24 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -49,6 +49,7 @@
struct elv_change_ctx {
const char *name;
bool uevent;
+ bool init;
};
static DEFINE_SPINLOCK(elv_list_lock);
@@ -154,7 +155,7 @@ static void elevator_release(struct kobject *kobj)
kfree(e);
}
-void elevator_exit(struct request_queue *q)
+static void elevator_exit(struct request_queue *q)
{
struct elevator_queue *e = q->elevator;
@@ -458,7 +459,7 @@ static const struct kobj_type elv_ktype = {
.release = elevator_release,
};
-int elv_register_queue(struct request_queue *q, bool uevent)
+static int elv_register_queue(struct request_queue *q, bool uevent)
{
struct elevator_queue *e = q->elevator;
int error;
@@ -488,7 +489,7 @@ int elv_register_queue(struct request_queue *q, bool uevent)
return error;
}
-void elv_unregister_queue(struct request_queue *q)
+static void elv_unregister_queue(struct request_queue *q)
{
struct elevator_queue *e = q->elevator;
@@ -565,60 +566,16 @@ EXPORT_SYMBOL_GPL(elv_unregister);
* For single queue devices, default to using mq-deadline. If we have multiple
* queues or mq-deadline is not available, default to "none".
*/
-static struct elevator_type *elevator_get_default(struct request_queue *q)
+static bool use_default_elevator(struct request_queue *q)
{
if (q->tag_set->flags & BLK_MQ_F_NO_SCHED_BY_DEFAULT)
- return NULL;
+ return false;
if (q->nr_hw_queues != 1 &&
!blk_mq_is_shared_tags(q->tag_set->flags))
- return NULL;
-
- return elevator_find_get("mq-deadline");
-}
-
-/*
- * Use the default elevator settings. If the chosen elevator initialization
- * fails, fall back to the "none" elevator (no elevator).
- */
-void elevator_init_mq(struct request_queue *q)
-{
- struct elevator_type *e;
- unsigned int memflags;
- int err;
-
- WARN_ON_ONCE(blk_queue_registered(q));
-
- if (unlikely(q->elevator))
- return;
-
- e = elevator_get_default(q);
- if (!e)
- return;
-
- /*
- * We are called before adding disk, when there isn't any FS I/O,
- * so freezing queue plus canceling dispatch work is enough to
- * drain any dispatch activities originated from passthrough
- * requests, then no need to quiesce queue which may add long boot
- * latency, especially when lots of disks are involved.
- *
- * Disk isn't added yet, so verifying queue lock only manually.
- */
- memflags = blk_mq_freeze_queue(q);
-
- blk_mq_cancel_work_sync(q);
-
- err = blk_mq_init_sched(q, e);
-
- blk_mq_unfreeze_queue(q, memflags);
-
- if (err) {
- pr_warn("\"%s\" elevator initialization failed, "
- "falling back to \"none\"\n", e->elevator_name);
- }
+ return false;
- elevator_put(e);
+ return true;
}
/*
@@ -694,7 +651,7 @@ static int __elevator_change(struct request_queue *q,
lockdep_assert_held(&q->tag_set->update_nr_hwq_sema);
/* Make sure queue is not in the middle of being removed */
- if (!blk_queue_registered(q))
+ if (!ctx->init && !blk_queue_registered(q))
return -ENOENT;
if (!strncmp(elevator_name, "none", 4)) {
@@ -718,6 +675,16 @@ static int elevator_change(struct request_queue *q,
int ret = 0;
memflags = blk_mq_freeze_queue(q);
+ /*
+ * May be called before adding disk, when there isn't any FS I/O,
+ * so freezing queue plus canceling dispatch work is enough to
+ * drain any dispatch activities originated from passthrough
+ * requests, then no need to quiesce queue which may add long boot
+ * latency, especially when lots of disks are involved.
+ *
+ * Disk isn't added yet, so verifying queue lock only manually.
+ */
+ blk_mq_cancel_work_sync(q);
mutex_lock(&q->elevator_lock);
if (!q->elevator || !elevator_match(q->elevator->type, ctx->name))
ret = __elevator_change(q, ctx);
@@ -726,6 +693,49 @@ static int elevator_change(struct request_queue *q,
return ret;
}
+/*
+ * Use the default elevator settings. If the chosen elevator initialization
+ * fails, fall back to the "none" elevator (no elevator).
+ */
+void elevator_set_default(struct request_queue *q)
+{
+ struct elv_change_ctx ctx = {
+ .init = true,
+ };
+ int err;
+
+ if (!queue_is_mq(q))
+ return;
+
+ ctx.name = use_default_elevator(q) ? "mq-deadline" : "none";
+ if (!q->elevator && !strcmp(ctx.name, "none"))
+ return;
+ err = elevator_change(q, &ctx);
+ if (err < 0)
+ pr_warn("\"%s\" set elevator failed %d, "
+ "falling back to \"none\"\n", ctx.name, err);
+}
+
+void elevator_set_none(struct request_queue *q)
+{
+ struct elv_change_ctx ctx = {
+ .name = "none",
+ .uevent = true,
+ .init = true,
+ };
+ int err;
+
+ if (!queue_is_mq(q))
+ return;
+
+ if (!q->elevator)
+ return;
+
+ err = elevator_change(q, &ctx);
+ if (err < 0)
+ pr_warn("%s: set none elevator failed %d\n", __func__, err);
+}
+
/*
* The I/O scheduler depends on the number of hardware queues, this forces a
* reattachment when nr_hw_queues changes.
diff --git a/block/genhd.c b/block/genhd.c
index 53c5b1e9e1d2..a50063c4c40f 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -416,12 +416,6 @@ static int __add_disk_fwnode(struct device *parent, struct gendisk *disk,
*/
if (disk->fops->submit_bio || disk->fops->poll_bio)
return -EINVAL;
-
- /*
- * Initialize the I/O scheduler code and pick a default one if
- * needed.
- */
- elevator_init_mq(disk->queue);
} else {
if (!disk->fops->submit_bio)
return -EINVAL;
@@ -438,7 +432,7 @@ static int __add_disk_fwnode(struct device *parent, struct gendisk *disk,
ret = -EINVAL;
if (disk->major) {
if (WARN_ON(!disk->minors))
- goto out_exit_elevator;
+ goto out;
if (disk->minors > DISK_MAX_PARTS) {
pr_err("block: can't allocate more than %d partitions\n",
@@ -448,14 +442,14 @@ static int __add_disk_fwnode(struct device *parent, struct gendisk *disk,
if (disk->first_minor > MINORMASK ||
disk->minors > MINORMASK + 1 ||
disk->first_minor + disk->minors > MINORMASK + 1)
- goto out_exit_elevator;
+ goto out;
} else {
if (WARN_ON(disk->minors))
- goto out_exit_elevator;
+ goto out;
ret = blk_alloc_ext_minor();
if (ret < 0)
- goto out_exit_elevator;
+ goto out;
disk->major = BLOCK_EXT_MAJOR;
disk->first_minor = ret;
}
@@ -564,12 +558,7 @@ static int __add_disk_fwnode(struct device *parent, struct gendisk *disk,
out_free_ext_minor:
if (disk->major == BLOCK_EXT_MAJOR)
blk_free_ext_minor(disk->first_minor);
-out_exit_elevator:
- if (disk->queue->elevator) {
- mutex_lock(&disk->queue->elevator_lock);
- elevator_exit(disk->queue);
- mutex_unlock(&disk->queue->elevator_lock);
- }
+out:
return ret;
}
@@ -755,14 +744,7 @@ static void __del_gendisk(struct gendisk *disk)
if (queue_is_mq(q))
blk_mq_cancel_work_sync(q);
- blk_mq_quiesce_queue(q);
- if (q->elevator) {
- mutex_lock(&q->elevator_lock);
- elevator_exit(q);
- mutex_unlock(&q->elevator_lock);
- }
rq_qos_exit(q);
- blk_mq_unquiesce_queue(q);
/*
* If the disk does not own the queue, allow using passthrough requests
--
2.47.0
^ permalink raw reply related [flat|nested] 71+ messages in thread
* [PATCH V3 14/20] block: pass elevator_queue to elv_register_queue & unregister_queue
2025-04-24 15:21 [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Ming Lei
` (12 preceding siblings ...)
2025-04-24 15:21 ` [PATCH V3 13/20] block: " Ming Lei
@ 2025-04-24 15:21 ` Ming Lei
2025-04-25 6:40 ` Hannes Reinecke
2025-04-24 15:21 ` [PATCH V3 15/20] block: fail to show/store elevator sysfs attribute if elevator is dying Ming Lei
` (6 subsequent siblings)
20 siblings, 1 reply; 71+ messages in thread
From: Ming Lei @ 2025-04-24 15:21 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig, Ming Lei
Pass elevator_queue reference to elv_register_queue() & elv_unregister_queue().
No functional change, and prepare for moving the two out of elevator
lock & freezing queue, when we need to store the old & new elevator
queue in `struct elv_change_ctx` instance, then both two can co-exist
for short while, so we have to pass the exact elevator_queue instance
to elv_register_queue & unregister_queue.
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/elevator.c | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)
diff --git a/block/elevator.c b/block/elevator.c
index 5ec66fbd4f24..dde8732b7a26 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -459,9 +459,10 @@ static const struct kobj_type elv_ktype = {
.release = elevator_release,
};
-static int elv_register_queue(struct request_queue *q, bool uevent)
+static int elv_register_queue(struct request_queue *q,
+ struct elevator_queue *e,
+ bool uevent)
{
- struct elevator_queue *e = q->elevator;
int error;
lockdep_assert_held(&q->elevator_lock);
@@ -489,10 +490,9 @@ static int elv_register_queue(struct request_queue *q, bool uevent)
return error;
}
-static void elv_unregister_queue(struct request_queue *q)
+static void elv_unregister_queue(struct request_queue *q,
+ struct elevator_queue *e)
{
- struct elevator_queue *e = q->elevator;
-
lockdep_assert_held(&q->elevator_lock);
if (e && test_and_clear_bit(ELEVATOR_FLAG_REGISTERED, &e->flags)) {
@@ -595,7 +595,7 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e,
blk_mq_quiesce_queue(q);
if (q->elevator) {
- elv_unregister_queue(q);
+ elv_unregister_queue(q, q->elevator);
elevator_exit(q);
}
@@ -603,7 +603,7 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e,
if (ret)
goto out_unfreeze;
- ret = elv_register_queue(q, ctx->uevent);
+ ret = elv_register_queue(q, q->elevator, ctx->uevent);
if (ret) {
elevator_exit(q);
goto out_unfreeze;
@@ -628,7 +628,7 @@ static void elevator_disable(struct request_queue *q)
blk_mq_quiesce_queue(q);
- elv_unregister_queue(q);
+ elv_unregister_queue(q, q->elevator);
elevator_exit(q);
blk_queue_flag_clear(QUEUE_FLAG_SQ_SCHED, q);
q->elevator = NULL;
--
2.47.0
^ permalink raw reply related [flat|nested] 71+ messages in thread
* [PATCH V3 15/20] block: fail to show/store elevator sysfs attribute if elevator is dying
2025-04-24 15:21 [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Ming Lei
` (13 preceding siblings ...)
2025-04-24 15:21 ` [PATCH V3 14/20] block: pass elevator_queue to elv_register_queue & unregister_queue Ming Lei
@ 2025-04-24 15:21 ` Ming Lei
2025-04-25 6:45 ` Hannes Reinecke
2025-04-25 18:36 ` Christoph Hellwig
2025-04-24 15:21 ` [PATCH V3 16/20] block: move elv_register[unregister]_queue out of elevator_lock Ming Lei
` (5 subsequent siblings)
20 siblings, 2 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-24 15:21 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig, Ming Lei
Prepare for moving elv_register[unregister]_queue out of elevator_lock
& queue freezing, so we may have to call elv_unregister_queue() after
elevator ->exit() is called, then there is small window for user to
call into ->show()/store(), and user-after-free can be caused.
Fail to show/store elevator sysfs attribute if elevator is dying by
adding one new flag of ELEVATOR_FLAG_DYNG, which is protected by
elevator ->sysfs_lock.
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/blk-mq-sched.c | 1 +
block/elevator.c | 10 ++++++++--
block/elevator.h | 1 +
3 files changed, 10 insertions(+), 2 deletions(-)
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 336a15ffecfa..55a0fd105147 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -551,5 +551,6 @@ void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e)
if (e->type->ops.exit_sched)
e->type->ops.exit_sched(e);
blk_mq_sched_tags_teardown(q, flags);
+ set_bit(ELEVATOR_FLAG_DYING, &q->elevator->flags);
q->elevator = NULL;
}
diff --git a/block/elevator.c b/block/elevator.c
index dde8732b7a26..48f2f202af98 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -426,7 +426,10 @@ elv_attr_show(struct kobject *kobj, struct attribute *attr, char *page)
e = container_of(kobj, struct elevator_queue, kobj);
mutex_lock(&e->sysfs_lock);
- error = e->type ? entry->show(e, page) : -ENOENT;
+ if (test_bit(ELEVATOR_FLAG_DYING, &e->flags))
+ error = -ENODEV;
+ else
+ error = e->type ? entry->show(e, page) : -ENOENT;
mutex_unlock(&e->sysfs_lock);
return error;
}
@@ -444,7 +447,10 @@ elv_attr_store(struct kobject *kobj, struct attribute *attr,
e = container_of(kobj, struct elevator_queue, kobj);
mutex_lock(&e->sysfs_lock);
- error = e->type ? entry->store(e, page, length) : -ENOENT;
+ if (test_bit(ELEVATOR_FLAG_DYING, &e->flags))
+ error = -ENODEV;
+ else
+ error = e->type ? entry->store(e, page, length) : -ENOENT;
mutex_unlock(&e->sysfs_lock);
return error;
}
diff --git a/block/elevator.h b/block/elevator.h
index 9198676644a9..76a90a1b7ed6 100644
--- a/block/elevator.h
+++ b/block/elevator.h
@@ -121,6 +121,7 @@ struct elevator_queue
};
#define ELEVATOR_FLAG_REGISTERED 0
+#define ELEVATOR_FLAG_DYING 1
/*
* block elevator interface
--
2.47.0
^ permalink raw reply related [flat|nested] 71+ messages in thread
* [PATCH V3 16/20] block: move elv_register[unregister]_queue out of elevator_lock
2025-04-24 15:21 [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Ming Lei
` (14 preceding siblings ...)
2025-04-24 15:21 ` [PATCH V3 15/20] block: fail to show/store elevator sysfs attribute if elevator is dying Ming Lei
@ 2025-04-24 15:21 ` Ming Lei
2025-04-25 6:46 ` Hannes Reinecke
` (2 more replies)
2025-04-24 15:21 ` [PATCH V3 17/20] block: move debugfs/sysfs register out of freezing queue Ming Lei
` (4 subsequent siblings)
20 siblings, 3 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-24 15:21 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig, Ming Lei
Move elv_register[unregister]_queue out of ->elevator_lock & queue freezing,
so we can kill many lockdep warnings.
elv_register[unregister]_queue() is serialized, and just dealing with sysfs/
debugfs things, no need to be done with queue frozen.
With this change, elevator's ->exit() is called before calling
elv_unregister_queue, then user may call into ->show()/store() of elevator's
sysfs attributes, and we have covered this issue by adding `ELEVATOR_FLAG_DYNG`.
For blk-mq debugfs, hctx->sched_tags is always checked with ->elevator_lock by
debugfs code, meantime hctx->sched_tags is updated with ->elevator_lock, so
there isn't such issue.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/blk-mq.c | 3 +-
block/elevator.c | 80 ++++++++++++++++++++++++++++++++++++------------
2 files changed, 62 insertions(+), 21 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3afcddd21586..3ecbf73773ea 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4997,11 +4997,10 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
blk_mq_debugfs_register_hctxs(q);
}
+ /* elv_update_nr_hw_queues() unfreeze queue for us */
list_for_each_entry(q, &set->tag_list, tag_set_list)
elv_update_nr_hw_queues(q);
- list_for_each_entry(q, &set->tag_list, tag_set_list)
- blk_mq_unfreeze_queue_nomemrestore(q);
memalloc_noio_restore(memflags);
/* Free the excess tags when nr_hw_queues shrink. */
diff --git a/block/elevator.c b/block/elevator.c
index 48f2f202af98..aec8081a6be3 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -50,6 +50,11 @@ struct elv_change_ctx {
const char *name;
bool uevent;
bool init;
+
+ /* for unregistering old elevator */
+ struct elevator_queue *old;
+ /* for registering new elevator */
+ struct elevator_queue *new;
};
static DEFINE_SPINLOCK(elv_list_lock);
@@ -155,18 +160,18 @@ static void elevator_release(struct kobject *kobj)
kfree(e);
}
-static void elevator_exit(struct request_queue *q)
+static void __elevator_exit(struct request_queue *q)
{
struct elevator_queue *e = q->elevator;
+ lockdep_assert_held(&q->elevator_lock);
+
ioc_clear_queue(q);
blk_mq_sched_free_rqs(q);
mutex_lock(&e->sysfs_lock);
blk_mq_exit_sched(q, e);
mutex_unlock(&e->sysfs_lock);
-
- kobject_put(&e->kobj);
}
static inline void __elv_rqhash_del(struct request *rq)
@@ -471,8 +476,6 @@ static int elv_register_queue(struct request_queue *q,
{
int error;
- lockdep_assert_held(&q->elevator_lock);
-
error = kobject_add(&e->kobj, &q->disk->queue_kobj, "iosched");
if (!error) {
const struct elv_fs_entry *attr = e->type->elevator_attrs;
@@ -499,8 +502,6 @@ static int elv_register_queue(struct request_queue *q,
static void elv_unregister_queue(struct request_queue *q,
struct elevator_queue *e)
{
- lockdep_assert_held(&q->elevator_lock);
-
if (e && test_and_clear_bit(ELEVATOR_FLAG_REGISTERED, &e->flags)) {
kobject_uevent(&e->kobj, KOBJ_REMOVE);
kobject_del(&e->kobj);
@@ -601,19 +602,15 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e,
blk_mq_quiesce_queue(q);
if (q->elevator) {
- elv_unregister_queue(q, q->elevator);
- elevator_exit(q);
+ ctx->old = q->elevator;
+ __elevator_exit(q);
}
ret = blk_mq_init_sched(q, new_e);
if (ret)
goto out_unfreeze;
- ret = elv_register_queue(q, q->elevator, ctx->uevent);
- if (ret) {
- elevator_exit(q);
- goto out_unfreeze;
- }
+ ctx->new = q->elevator;
blk_add_trace_msg(q, "elv switch: %s", new_e->elevator_name);
out_unfreeze:
@@ -627,15 +624,16 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e,
return ret;
}
-static void elevator_disable(struct request_queue *q)
+static void elevator_disable(struct request_queue *q,
+ struct elv_change_ctx *ctx)
{
WARN_ON_ONCE(q->mq_freeze_depth == 0);
lockdep_assert_held(&q->elevator_lock);
blk_mq_quiesce_queue(q);
- elv_unregister_queue(q, q->elevator);
- elevator_exit(q);
+ ctx->old = q->elevator;
+ __elevator_exit(q);
blk_queue_flag_clear(QUEUE_FLAG_SQ_SCHED, q);
q->elevator = NULL;
q->nr_requests = q->tag_set->queue_depth;
@@ -644,6 +642,38 @@ static void elevator_disable(struct request_queue *q)
blk_mq_unquiesce_queue(q);
}
+static void elv_exit_and_release(struct request_queue *q)
+{
+ struct elevator_queue *e;
+ unsigned memflags;
+
+ memflags = blk_mq_freeze_queue(q);
+ mutex_lock(&q->elevator_lock);
+ e = q->elevator;
+ __elevator_exit(q);
+ mutex_unlock(&q->elevator_lock);
+ blk_mq_unfreeze_queue(q, memflags);
+ if (e)
+ kobject_put(&e->kobj);
+}
+
+static int elevator_change_done(struct request_queue *q,
+ struct elv_change_ctx *ctx)
+{
+ int ret = 0;
+
+ if (ctx->old) {
+ elv_unregister_queue(q, ctx->old);
+ kobject_put(&ctx->old->kobj);
+ }
+ if (ctx->new) {
+ ret = elv_register_queue(q, ctx->new, ctx->uevent);
+ if (ret)
+ elv_exit_and_release(q);
+ }
+ return ret;
+}
+
/*
* Switch this queue to the given IO scheduler.
*/
@@ -662,7 +692,7 @@ static int __elevator_change(struct request_queue *q,
if (!strncmp(elevator_name, "none", 4)) {
if (q->elevator)
- elevator_disable(q);
+ elevator_disable(q, ctx);
return 0;
}
@@ -696,6 +726,9 @@ static int elevator_change(struct request_queue *q,
ret = __elevator_change(q, ctx);
mutex_unlock(&q->elevator_lock);
blk_mq_unfreeze_queue(q, memflags);
+ if (!ret)
+ ret = elevator_change_done(q, ctx);
+
return ret;
}
@@ -745,6 +778,8 @@ void elevator_set_none(struct request_queue *q)
/*
* The I/O scheduler depends on the number of hardware queues, this forces a
* reattachment when nr_hw_queues changes.
+ *
+ * Note that this unfreezes the passed in queue.
*/
void elv_update_nr_hw_queues(struct request_queue *q)
{
@@ -752,12 +787,19 @@ void elv_update_nr_hw_queues(struct request_queue *q)
.name = "none",
.uevent = true,
};
+ int ret = -ENODEV;
+
+ WARN_ON_ONCE(q->mq_freeze_depth == 0);
mutex_lock(&q->elevator_lock);
if (q->elevator && !blk_queue_dying(q))
ctx.name = q->elevator->type->elevator_name;
- __elevator_change(q, &ctx);
+ ret = __elevator_change(q, &ctx);
mutex_unlock(&q->elevator_lock);
+ blk_mq_unfreeze_queue_nomemrestore(q);
+
+ if (!ret)
+ WARN_ON_ONCE(elevator_change_done(q, &ctx));
}
static void elv_iosched_load_module(const char *elevator_name)
--
2.47.0
^ permalink raw reply related [flat|nested] 71+ messages in thread
* [PATCH V3 17/20] block: move debugfs/sysfs register out of freezing queue
2025-04-24 15:21 [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Ming Lei
` (15 preceding siblings ...)
2025-04-24 15:21 ` [PATCH V3 16/20] block: move elv_register[unregister]_queue out of elevator_lock Ming Lei
@ 2025-04-24 15:21 ` Ming Lei
2025-04-25 6:47 ` Hannes Reinecke
2025-04-25 18:38 ` Christoph Hellwig
2025-04-24 15:21 ` [PATCH V3 18/20] block: remove several ->elevator_lock Ming Lei
` (3 subsequent siblings)
20 siblings, 2 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-24 15:21 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig, Ming Lei
Move debugfs/sysfs register out of freezing queue in
__blk_mq_update_nr_hw_queues(), so that the following lockdep dependency
can be killed:
#2 (&q->q_usage_counter(io)#16){++++}-{0:0}:
#1 (fs_reclaim){+.+.}-{0:0}:
#0 (&sb->s_type->i_mutex_key#3){+.+.}-{4:4}: //debugfs
And registering/un-registering debugfs/sysfs does not require queue to be
frozen.
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/blk-mq.c | 15 +++++++--------
1 file changed, 7 insertions(+), 8 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3ecbf73773ea..420150eb5a45 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4961,14 +4961,14 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
return;
memflags = memalloc_noio_save();
- list_for_each_entry(q, &set->tag_list, tag_set_list)
- blk_mq_freeze_queue_nomemsave(q);
-
list_for_each_entry(q, &set->tag_list, tag_set_list) {
blk_mq_debugfs_unregister_hctxs(q);
blk_mq_sysfs_unregister_hctxs(q);
}
+ list_for_each_entry(q, &set->tag_list, tag_set_list)
+ blk_mq_freeze_queue_nomemsave(q);
+
if (blk_mq_realloc_tag_set_tags(set, nr_hw_queues) < 0)
goto reregister;
@@ -4991,16 +4991,15 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
blk_mq_map_swqueue(q);
}
+ /* elv_update_nr_hw_queues() unfreeze queue for us */
+ list_for_each_entry(q, &set->tag_list, tag_set_list)
+ elv_update_nr_hw_queues(q);
+
reregister:
list_for_each_entry(q, &set->tag_list, tag_set_list) {
blk_mq_sysfs_register_hctxs(q);
blk_mq_debugfs_register_hctxs(q);
}
-
- /* elv_update_nr_hw_queues() unfreeze queue for us */
- list_for_each_entry(q, &set->tag_list, tag_set_list)
- elv_update_nr_hw_queues(q);
-
memalloc_noio_restore(memflags);
/* Free the excess tags when nr_hw_queues shrink. */
--
2.47.0
^ permalink raw reply related [flat|nested] 71+ messages in thread
* [PATCH V3 18/20] block: remove several ->elevator_lock
2025-04-24 15:21 [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Ming Lei
` (16 preceding siblings ...)
2025-04-24 15:21 ` [PATCH V3 17/20] block: move debugfs/sysfs register out of freezing queue Ming Lei
@ 2025-04-24 15:21 ` Ming Lei
2025-04-25 6:48 ` Hannes Reinecke
2025-04-25 18:38 ` Christoph Hellwig
2025-04-24 15:21 ` [PATCH V3 19/20] block: move hctx cpuhp add/del out of queue freezing Ming Lei
` (2 subsequent siblings)
20 siblings, 2 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-24 15:21 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig, Ming Lei
Both blk_mq_map_swqueue() and blk_mq_realloc_hw_ctxs() are called before
the request queue is added to tagset list, so the two won't run concurrently
with blk_mq_update_nr_hw_queues().
When the two functions are only called from queue initialization or
blk_mq_update_nr_hw_queues(), elevator switch can't happen.
So remove these ->elevator_lock uses.
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/blk-mq.c | 19 ++++---------------
1 file changed, 4 insertions(+), 15 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 420150eb5a45..e08eda094ae7 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4112,8 +4112,6 @@ static void blk_mq_map_swqueue(struct request_queue *q)
struct blk_mq_ctx *ctx;
struct blk_mq_tag_set *set = q->tag_set;
- mutex_lock(&q->elevator_lock);
-
queue_for_each_hw_ctx(q, hctx, i) {
cpumask_clear(hctx->cpumask);
hctx->nr_ctx = 0;
@@ -4218,8 +4216,6 @@ static void blk_mq_map_swqueue(struct request_queue *q)
hctx->next_cpu = blk_mq_first_mapped_cpu(hctx);
hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
}
-
- mutex_unlock(&q->elevator_lock);
}
/*
@@ -4523,16 +4519,9 @@ static void __blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
}
static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
- struct request_queue *q, bool lock)
+ struct request_queue *q)
{
- if (lock) {
- /* protect against switching io scheduler */
- mutex_lock(&q->elevator_lock);
- __blk_mq_realloc_hw_ctxs(set, q);
- mutex_unlock(&q->elevator_lock);
- } else {
- __blk_mq_realloc_hw_ctxs(set, q);
- }
+ __blk_mq_realloc_hw_ctxs(set, q);
/* unregister cpuhp callbacks for exited hctxs */
blk_mq_remove_hw_queues_cpuhp(q);
@@ -4564,7 +4553,7 @@ int blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
xa_init(&q->hctx_table);
- blk_mq_realloc_hw_ctxs(set, q, false);
+ blk_mq_realloc_hw_ctxs(set, q);
if (!q->nr_hw_queues)
goto err_hctxs;
@@ -4975,7 +4964,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
fallback:
blk_mq_update_queue_map(set);
list_for_each_entry(q, &set->tag_list, tag_set_list) {
- blk_mq_realloc_hw_ctxs(set, q, true);
+ blk_mq_realloc_hw_ctxs(set, q);
if (q->nr_hw_queues != set->nr_hw_queues) {
int i = prev_nr_hw_queues;
--
2.47.0
^ permalink raw reply related [flat|nested] 71+ messages in thread
* [PATCH V3 19/20] block: move hctx cpuhp add/del out of queue freezing
2025-04-24 15:21 [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Ming Lei
` (17 preceding siblings ...)
2025-04-24 15:21 ` [PATCH V3 18/20] block: remove several ->elevator_lock Ming Lei
@ 2025-04-24 15:21 ` Ming Lei
2025-04-25 6:49 ` Hannes Reinecke
2025-04-24 15:21 ` [PATCH V3 20/20] block: move wbt_enable_default() out of queue freezing from sched ->exit() Ming Lei
2025-04-29 12:00 ` [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Stefan Haberland
20 siblings, 1 reply; 71+ messages in thread
From: Ming Lei @ 2025-04-24 15:21 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig, Ming Lei
Move hctx cpuhp add/del out of queue freezing for not connecting freeze
lock with cpuhp locks, then lockdep warning can be avoided.
This way is safe because both needn't queue to be frozen and scheduler
switch isn't allowed.
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/blk-mq.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index e08eda094ae7..add653f84797 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4964,7 +4964,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
fallback:
blk_mq_update_queue_map(set);
list_for_each_entry(q, &set->tag_list, tag_set_list) {
- blk_mq_realloc_hw_ctxs(set, q);
+ __blk_mq_realloc_hw_ctxs(set, q);
if (q->nr_hw_queues != set->nr_hw_queues) {
int i = prev_nr_hw_queues;
@@ -4988,6 +4988,9 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
list_for_each_entry(q, &set->tag_list, tag_set_list) {
blk_mq_sysfs_register_hctxs(q);
blk_mq_debugfs_register_hctxs(q);
+
+ blk_mq_remove_hw_queues_cpuhp(q);
+ blk_mq_add_hw_queues_cpuhp(q);
}
memalloc_noio_restore(memflags);
--
2.47.0
^ permalink raw reply related [flat|nested] 71+ messages in thread
* [PATCH V3 20/20] block: move wbt_enable_default() out of queue freezing from sched ->exit()
2025-04-24 15:21 [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Ming Lei
` (18 preceding siblings ...)
2025-04-24 15:21 ` [PATCH V3 19/20] block: move hctx cpuhp add/del out of queue freezing Ming Lei
@ 2025-04-24 15:21 ` Ming Lei
2025-04-25 13:10 ` Nilay Shroff
2025-04-29 10:59 ` Nilay Shroff
2025-04-29 12:00 ` [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Stefan Haberland
20 siblings, 2 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-24 15:21 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig, Ming Lei
scheduler's ->exit() is called with queue frozen and elevator lock is held, and
wbt_enable_default() can't be called with queue frozen, otherwise the
following lockdep warning is triggered:
#6 (&q->rq_qos_mutex){+.+.}-{4:4}:
#5 (&eq->sysfs_lock){+.+.}-{4:4}:
#4 (&q->elevator_lock){+.+.}-{4:4}:
#3 (&q->q_usage_counter(io)#3){++++}-{0:0}:
#2 (fs_reclaim){+.+.}-{0:0}:
#1 (&sb->s_type->i_mutex_key#3){+.+.}-{4:4}:
#0 (&q->debugfs_mutex){+.+.}-{4:4}:
Fix the issue by moving wbt_enable_default() out of bfq's exit(), and
call it from elevator_change_done().
Meantime add disk->rqos_state_mutex for covering wbt state change, which
matches the purpose more than ->elevator_lock.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
block/bfq-iosched.c | 2 +-
block/blk-sysfs.c | 6 ++----
block/blk-wbt.c | 10 ++++++++--
block/elevator.c | 5 +++++
block/elevator.h | 1 +
block/genhd.c | 1 +
include/linux/blkdev.h | 2 ++
7 files changed, 20 insertions(+), 7 deletions(-)
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index cc6f59836dcd..0cb1e9873aab 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -7211,7 +7211,7 @@ static void bfq_exit_queue(struct elevator_queue *e)
blk_stat_disable_accounting(bfqd->queue);
blk_queue_flag_clear(QUEUE_FLAG_DISABLE_WBT_DEF, bfqd->queue);
- wbt_enable_default(bfqd->queue->disk);
+ set_bit(ELEVATOR_FLAG_ENABLE_WBT_ON_EXIT, &e->flags);
kfree(bfqd);
}
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 9aa05666f32a..fafe9e9e97cc 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -593,7 +593,7 @@ static ssize_t queue_wb_lat_store(struct gendisk *disk, const char *page,
return -EINVAL;
memflags = blk_mq_freeze_queue(q);
- mutex_lock(&q->elevator_lock);
+ mutex_lock(&disk->rqos_state_mutex);
rqos = wbt_rq_qos(q);
if (!rqos) {
@@ -622,7 +622,7 @@ static ssize_t queue_wb_lat_store(struct gendisk *disk, const char *page,
blk_mq_unquiesce_queue(q);
out:
- mutex_unlock(&q->elevator_lock);
+ mutex_unlock(&disk->rqos_state_mutex);
blk_mq_unfreeze_queue(q, memflags);
return ret;
@@ -870,9 +870,7 @@ int blk_register_queue(struct gendisk *disk)
goto out_unregister_ia_ranges;
elevator_set_default(q);
- mutex_lock(&q->elevator_lock);
wbt_enable_default(disk);
- mutex_unlock(&q->elevator_lock);
blk_queue_flag_set(QUEUE_FLAG_REGISTERED, q);
diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 29cd2e33666f..c8588bae1c1b 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -704,6 +704,8 @@ void wbt_enable_default(struct gendisk *disk)
struct rq_qos *rqos;
bool enable = IS_ENABLED(CONFIG_BLK_WBT_MQ);
+ mutex_lock(&disk->rqos_state_mutex);
+
if (blk_queue_disable_wbt(q))
enable = false;
@@ -712,15 +714,17 @@ void wbt_enable_default(struct gendisk *disk)
if (rqos) {
if (enable && RQWB(rqos)->enable_state == WBT_STATE_OFF_DEFAULT)
RQWB(rqos)->enable_state = WBT_STATE_ON_DEFAULT;
- return;
+ goto unlock;
}
/* Queue not registered? Maybe shutting down... */
if (!blk_queue_registered(q))
- return;
+ goto unlock;
if (queue_is_mq(q) && enable)
wbt_init(disk);
+unlock:
+ mutex_unlock(&disk->rqos_state_mutex);
}
EXPORT_SYMBOL_GPL(wbt_enable_default);
@@ -773,11 +777,13 @@ void wbt_disable_default(struct gendisk *disk)
struct rq_wb *rwb;
if (!rqos)
return;
+ mutex_lock(&disk->rqos_state_mutex);
rwb = RQWB(rqos);
if (rwb->enable_state == WBT_STATE_ON_DEFAULT) {
blk_stat_deactivate(rwb->cb);
rwb->enable_state = WBT_STATE_OFF_DEFAULT;
}
+ mutex_unlock(&disk->rqos_state_mutex);
}
EXPORT_SYMBOL_GPL(wbt_disable_default);
diff --git a/block/elevator.c b/block/elevator.c
index aec8081a6be3..a637426da56d 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -663,8 +663,13 @@ static int elevator_change_done(struct request_queue *q,
int ret = 0;
if (ctx->old) {
+ bool enable_wbt = test_bit(ELEVATOR_FLAG_ENABLE_WBT_ON_EXIT,
+ &ctx->old->flags);
+
elv_unregister_queue(q, ctx->old);
kobject_put(&ctx->old->kobj);
+ if (enable_wbt)
+ wbt_enable_default(q->disk);
}
if (ctx->new) {
ret = elv_register_queue(q, ctx->new, ctx->uevent);
diff --git a/block/elevator.h b/block/elevator.h
index 76a90a1b7ed6..a07ce773a38f 100644
--- a/block/elevator.h
+++ b/block/elevator.h
@@ -122,6 +122,7 @@ struct elevator_queue
#define ELEVATOR_FLAG_REGISTERED 0
#define ELEVATOR_FLAG_DYING 1
+#define ELEVATOR_FLAG_ENABLE_WBT_ON_EXIT 2
/*
* block elevator interface
diff --git a/block/genhd.c b/block/genhd.c
index a50063c4c40f..e13be309552a 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1452,6 +1452,7 @@ struct gendisk *__alloc_disk_node(struct request_queue *q, int node_id,
#ifdef CONFIG_BLOCK_HOLDER_DEPRECATED
INIT_LIST_HEAD(&disk->slave_bdevs);
#endif
+ mutex_init(&disk->rqos_state_mutex);
return disk;
out_erase_part0:
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 228b2e5ad493..281d56b910c2 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -218,6 +218,8 @@ struct gendisk {
* devices that do not have multiple independent access ranges.
*/
struct blk_independent_access_ranges *ia_ranges;
+
+ struct mutex rqos_state_mutex; /* rqos state change mutex */
};
/**
--
2.47.0
^ permalink raw reply related [flat|nested] 71+ messages in thread
* Re: [PATCH V3 04/20] block: use q->elevator with ->elevator_lock held in elv_iosched_show()
2025-04-24 15:21 ` [PATCH V3 04/20] block: use q->elevator with ->elevator_lock held in elv_iosched_show() Ming Lei
@ 2025-04-25 6:08 ` Hannes Reinecke
2025-04-25 10:53 ` Nilay Shroff
2025-04-25 14:29 ` Christoph Hellwig
2 siblings, 0 replies; 71+ messages in thread
From: Hannes Reinecke @ 2025-04-25 6:08 UTC (permalink / raw)
To: Ming Lei, Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig
On 4/24/25 17:21, Ming Lei wrote:
> Use q->elevator with ->elevator_lock held in elv_iosched_show(), since
> the local cached elevator reference may become stale after getting
> ->elevator_lock.
>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
> block/elevator.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH V3 07/20] block: prevent adding/deleting disk during updating nr_hw_queues
2025-04-24 15:21 ` [PATCH V3 07/20] block: prevent adding/deleting disk during updating nr_hw_queues Ming Lei
@ 2025-04-25 6:33 ` Hannes Reinecke
2025-04-25 11:37 ` Nilay Shroff
` (2 subsequent siblings)
3 siblings, 0 replies; 71+ messages in thread
From: Hannes Reinecke @ 2025-04-25 6:33 UTC (permalink / raw)
To: Ming Lei, Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig
On 4/24/25 17:21, Ming Lei wrote:
> Both adding/deleting disk code are reader of `nr_hw_queues`, so we can't
> allow them in-progress when updating nr_hw_queues, kernel panic and
> kasan has been reported in [1].
>
> Prevent adding/deleting disk during updating nr_hw_queues by adding
> rw_semaphore to tagset, write lock is grabbed in blk_mq_update_nr_hw_queues(),
> and read lock is acquired when adding/deleting disk.
>
> Also mark GFP_NOIO allocation scope for adding/deleting disk because
> blk_mq_update_nr_hw_queues() is part of some driver's error handler.
>
> This way avoids lot of trouble.
>
> Suggested-by: Nilay Shroff <nilay@linux.ibm.com>
> Reported-by: Nilay Shroff <nilay@linux.ibm.com>
> Closes: https://lore.kernel.org/linux-block/a5896cdb-a59a-4a37-9f99-20522f5d2987@linux.ibm.com/
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
> block/blk-mq.c | 4 ++
> block/genhd.c | 94 +++++++++++++++++++++++++++++++-----------
> include/linux/blk-mq.h | 3 ++
> 3 files changed, 78 insertions(+), 23 deletions(-)
>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH V3 08/20] block: don't allow to switch elevator if updating nr_hw_queues is in-progress
2025-04-24 15:21 ` [PATCH V3 08/20] block: don't allow to switch elevator if updating nr_hw_queues is in-progress Ming Lei
@ 2025-04-25 6:33 ` Hannes Reinecke
2025-04-25 12:48 ` Nilay Shroff
1 sibling, 0 replies; 71+ messages in thread
From: Hannes Reinecke @ 2025-04-25 6:33 UTC (permalink / raw)
To: Ming Lei, Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig
On 4/24/25 17:21, Ming Lei wrote:
> Elevator switch code is another `nr_hw_queue` reader in non-fast-IO code
> path, so it can't be done if updating `nr_hw_queues` is in-progress.
>
> Take same approach with not allowing add/del disk when updating
> nr_hw_queues is in-progress, by grabbing read lock of
> set->update_nr_hwq_sema.
>
> Take the nested variant for avoiding the following false positive
> splat[1], and this way is correct because:
>
> - the read lock in elv_iosched_store() is not overlapped with the read lock
> in adding/deleting disk:
>
> - kobject attribute is only available after the kobject is added and
> before it is deleted
>
> -> #4 (&q->q_usage_counter(queue){++++}-{0:0}:
> -> #3 (&q->limits_lock){+.+.}-{4:4}:
> -> #2 (&disk->open_mutex){+.+.}-{4:4}:
> -> #1 (&set->update_nr_hwq_lock){.+.+}-{4:4}:
> -> #0 (kn->active#103){++++}-{0:0}:
>
> Link: https://lore.kernel.org/linux-block/aAWv3NPtNIKKvJZc@fedora/ [1]
> Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> Closes: https://lore.kernel.org/linux-block/mz4t4tlwiqjijw3zvqnjb7ovvvaegkqganegmmlc567tt5xj67@xal5ro544cnc/
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
> block/elevator.c | 3 +++
> 1 file changed, 3 insertions(+)
>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH V3 09/20] block: simplify elevator reattachment for updating nr_hw_queues
2025-04-24 15:21 ` [PATCH V3 09/20] block: simplify elevator reattachment for updating nr_hw_queues Ming Lei
@ 2025-04-25 6:34 ` Hannes Reinecke
2025-04-25 18:12 ` Christoph Hellwig
1 sibling, 0 replies; 71+ messages in thread
From: Hannes Reinecke @ 2025-04-25 6:34 UTC (permalink / raw)
To: Ming Lei, Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig
On 4/24/25 17:21, Ming Lei wrote:
> In blk_mq_update_nr_hw_queues(), nr_hw_queues changes and elevator data
> depends on it, so elevator has to be reattached.
>
> Now elevator switch isn't allowed during blk_mq_update_nr_hw_queues(), so
> we can simply call elevator_change() to reattach elevator sched tags after
> nr_hw_queues is updated.
>
> Add elv_update_nr_hw_queues() for blk_mq_update_nr_hw_queues() to
> reattach elevator.
>
> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
> block/blk-mq.c | 90 +-----------------------------------------------
> block/blk.h | 3 +-
> block/elevator.c | 32 +++++++++++++----
> 3 files changed, 28 insertions(+), 97 deletions(-)
>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH V3 10/20] block: move blk_unregister_queue() & device_del() after freeze wait
2025-04-24 15:21 ` [PATCH V3 10/20] block: move blk_unregister_queue() & device_del() after freeze wait Ming Lei
@ 2025-04-25 6:35 ` Hannes Reinecke
2025-04-25 12:50 ` Nilay Shroff
2025-04-25 14:34 ` Christoph Hellwig
2 siblings, 0 replies; 71+ messages in thread
From: Hannes Reinecke @ 2025-04-25 6:35 UTC (permalink / raw)
To: Ming Lei, Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig
On 4/24/25 17:21, Ming Lei wrote:
> Move blk_unregister_queue() & device_del() after freeze wait, and prepare
> for unifying elevator switch.
>
> This way is just fine, since bdev has been unhashed at the beginning of
> del_gendisk(), both blk_unregister_queue() & device_del() are dealing
> with kobject & debugfs thing only.
>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
> block/genhd.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH V3 11/20] block: move queue freezing & elevator_lock into elevator_change()
2025-04-24 15:21 ` [PATCH V3 11/20] block: move queue freezing & elevator_lock into elevator_change() Ming Lei
@ 2025-04-25 6:36 ` Hannes Reinecke
2025-04-25 12:54 ` Nilay Shroff
2025-04-25 14:35 ` Christoph Hellwig
2 siblings, 0 replies; 71+ messages in thread
From: Hannes Reinecke @ 2025-04-25 6:36 UTC (permalink / raw)
To: Ming Lei, Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig
On 4/24/25 17:21, Ming Lei wrote:
> Move queue freezing & elevator_lock into elevator_change(), and prepare
> for using elevator_change() for setting up & tearing down default elevator
> too.
>
> Also add lockdep_assert_held() in __elevator_change() because either
> read or write lock is required for changing elevator.
>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
> block/elevator.c | 18 +++++++++++-------
> 1 file changed, 11 insertions(+), 7 deletions(-)
>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH V3 12/20] block: add `struct elv_change_ctx` for unifying elevator change
2025-04-24 15:21 ` [PATCH V3 12/20] block: add `struct elv_change_ctx` for unifying elevator change Ming Lei
@ 2025-04-25 6:38 ` Hannes Reinecke
2025-04-25 18:23 ` Christoph Hellwig
1 sibling, 0 replies; 71+ messages in thread
From: Hannes Reinecke @ 2025-04-25 6:38 UTC (permalink / raw)
To: Ming Lei, Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig
On 4/24/25 17:21, Ming Lei wrote:
> Add `struct elv_change_ctx` and prepare for unifying elevator change by
> [__]elevator_change(). With this way, any input & output parameter can
> be provided & observed in top helper.
>
> This way helps to move kobject add/delete & debugfs register/unregister
> out of ->elevator_lock & freezing queue.
>
> Reviewed-by : Nilay Shroff <nilay@linux.ibm.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
> block/elevator.c | 44 +++++++++++++++++++++++++++++---------------
> 1 file changed, 29 insertions(+), 15 deletions(-)
>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH V3 13/20] block: unifying elevator change
2025-04-24 15:21 ` [PATCH V3 13/20] block: " Ming Lei
@ 2025-04-25 6:39 ` Hannes Reinecke
2025-04-25 13:03 ` Nilay Shroff
2025-04-25 18:30 ` Christoph Hellwig
2 siblings, 0 replies; 71+ messages in thread
From: Hannes Reinecke @ 2025-04-25 6:39 UTC (permalink / raw)
To: Ming Lei, Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig
On 4/24/25 17:21, Ming Lei wrote:
> elevator change is one well-define behavior:
>
> - tear down current elevator if it exists
>
> - setup new elevator
>
> It is supposed to cover any case for changing elevator by single
> internal API, typically the following cases:
>
> - setup default elevator in add_disk()
>
> - switch to none in del_disk()
>
> - reset elevator in blk_mq_update_nr_hw_queues()
>
> - switch elevator in sysfs `store` elevator attribute
>
> This patch uses elevator_change() to cover all above cases:
>
> - every elevator switch is serialized with each other: add_disk/del_disk/
> store elevator is serialized already, blk_mq_update_nr_hw_queues() uses
> srcu for syncing with the other three cases
>
> - for both add_disk()/del_disk(), queue freeze works at atomic mode
> or has been froze, so the freeze in elevator_change() won't add extra
> delay
>
> - `struct elev_change_ctx` instance holds any info for changing elevator
>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
> block/blk-sysfs.c | 18 +++-----
> block/blk.h | 5 +-
> block/elevator.c | 114 +++++++++++++++++++++++++---------------------
> block/genhd.c | 28 ++----------
> 4 files changed, 75 insertions(+), 90 deletions(-)
>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH V3 14/20] block: pass elevator_queue to elv_register_queue & unregister_queue
2025-04-24 15:21 ` [PATCH V3 14/20] block: pass elevator_queue to elv_register_queue & unregister_queue Ming Lei
@ 2025-04-25 6:40 ` Hannes Reinecke
0 siblings, 0 replies; 71+ messages in thread
From: Hannes Reinecke @ 2025-04-25 6:40 UTC (permalink / raw)
To: Ming Lei, Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig
On 4/24/25 17:21, Ming Lei wrote:
> Pass elevator_queue reference to elv_register_queue() & elv_unregister_queue().
>
> No functional change, and prepare for moving the two out of elevator
> lock & freezing queue, when we need to store the old & new elevator
> queue in `struct elv_change_ctx` instance, then both two can co-exist
> for short while, so we have to pass the exact elevator_queue instance
> to elv_register_queue & unregister_queue.
>
> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
> block/elevator.c | 16 ++++++++--------
> 1 file changed, 8 insertions(+), 8 deletions(-)
>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH V3 15/20] block: fail to show/store elevator sysfs attribute if elevator is dying
2025-04-24 15:21 ` [PATCH V3 15/20] block: fail to show/store elevator sysfs attribute if elevator is dying Ming Lei
@ 2025-04-25 6:45 ` Hannes Reinecke
2025-04-25 18:36 ` Christoph Hellwig
1 sibling, 0 replies; 71+ messages in thread
From: Hannes Reinecke @ 2025-04-25 6:45 UTC (permalink / raw)
To: Ming Lei, Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig
On 4/24/25 17:21, Ming Lei wrote:
> Prepare for moving elv_register[unregister]_queue out of elevator_lock
> & queue freezing, so we may have to call elv_unregister_queue() after
> elevator ->exit() is called, then there is small window for user to
> call into ->show()/store(), and user-after-free can be caused.
>
> Fail to show/store elevator sysfs attribute if elevator is dying by
> adding one new flag of ELEVATOR_FLAG_DYNG, which is protected by
> elevator ->sysfs_lock.
>
> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
> block/blk-mq-sched.c | 1 +
> block/elevator.c | 10 ++++++++--
> block/elevator.h | 1 +
> 3 files changed, 10 insertions(+), 2 deletions(-)
>
Not particularly happy about this, but I fear it's the best we can do
for now.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH V3 16/20] block: move elv_register[unregister]_queue out of elevator_lock
2025-04-24 15:21 ` [PATCH V3 16/20] block: move elv_register[unregister]_queue out of elevator_lock Ming Lei
@ 2025-04-25 6:46 ` Hannes Reinecke
2025-04-25 13:05 ` Nilay Shroff
2025-04-25 18:37 ` Christoph Hellwig
2 siblings, 0 replies; 71+ messages in thread
From: Hannes Reinecke @ 2025-04-25 6:46 UTC (permalink / raw)
To: Ming Lei, Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig
On 4/24/25 17:21, Ming Lei wrote:
> Move elv_register[unregister]_queue out of ->elevator_lock & queue freezing,
> so we can kill many lockdep warnings.
>
> elv_register[unregister]_queue() is serialized, and just dealing with sysfs/
> debugfs things, no need to be done with queue frozen.
>
> With this change, elevator's ->exit() is called before calling
> elv_unregister_queue, then user may call into ->show()/store() of elevator's
> sysfs attributes, and we have covered this issue by adding `ELEVATOR_FLAG_DYNG`.
>
> For blk-mq debugfs, hctx->sched_tags is always checked with ->elevator_lock by
> debugfs code, meantime hctx->sched_tags is updated with ->elevator_lock, so
> there isn't such issue.
>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
> block/blk-mq.c | 3 +-
> block/elevator.c | 80 ++++++++++++++++++++++++++++++++++++------------
> 2 files changed, 62 insertions(+), 21 deletions(-)
>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH V3 17/20] block: move debugfs/sysfs register out of freezing queue
2025-04-24 15:21 ` [PATCH V3 17/20] block: move debugfs/sysfs register out of freezing queue Ming Lei
@ 2025-04-25 6:47 ` Hannes Reinecke
2025-04-25 18:38 ` Christoph Hellwig
1 sibling, 0 replies; 71+ messages in thread
From: Hannes Reinecke @ 2025-04-25 6:47 UTC (permalink / raw)
To: Ming Lei, Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig
On 4/24/25 17:21, Ming Lei wrote:
> Move debugfs/sysfs register out of freezing queue in
> __blk_mq_update_nr_hw_queues(), so that the following lockdep dependency
> can be killed:
>
> #2 (&q->q_usage_counter(io)#16){++++}-{0:0}:
> #1 (fs_reclaim){+.+.}-{0:0}:
> #0 (&sb->s_type->i_mutex_key#3){+.+.}-{4:4}: //debugfs
>
> And registering/un-registering debugfs/sysfs does not require queue to be
> frozen.
>
> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
> block/blk-mq.c | 15 +++++++--------
> 1 file changed, 7 insertions(+), 8 deletions(-)
>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH V3 18/20] block: remove several ->elevator_lock
2025-04-24 15:21 ` [PATCH V3 18/20] block: remove several ->elevator_lock Ming Lei
@ 2025-04-25 6:48 ` Hannes Reinecke
2025-04-25 18:38 ` Christoph Hellwig
1 sibling, 0 replies; 71+ messages in thread
From: Hannes Reinecke @ 2025-04-25 6:48 UTC (permalink / raw)
To: Ming Lei, Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig
On 4/24/25 17:21, Ming Lei wrote:
> Both blk_mq_map_swqueue() and blk_mq_realloc_hw_ctxs() are called before
> the request queue is added to tagset list, so the two won't run concurrently
> with blk_mq_update_nr_hw_queues().
>
> When the two functions are only called from queue initialization or
> blk_mq_update_nr_hw_queues(), elevator switch can't happen.
>
> So remove these ->elevator_lock uses.
>
> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
> block/blk-mq.c | 19 ++++---------------
> 1 file changed, 4 insertions(+), 15 deletions(-)
>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH V3 19/20] block: move hctx cpuhp add/del out of queue freezing
2025-04-24 15:21 ` [PATCH V3 19/20] block: move hctx cpuhp add/del out of queue freezing Ming Lei
@ 2025-04-25 6:49 ` Hannes Reinecke
0 siblings, 0 replies; 71+ messages in thread
From: Hannes Reinecke @ 2025-04-25 6:49 UTC (permalink / raw)
To: Ming Lei, Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig
On 4/24/25 17:21, Ming Lei wrote:
> Move hctx cpuhp add/del out of queue freezing for not connecting freeze
> lock with cpuhp locks, then lockdep warning can be avoided.
>
> This way is safe because both needn't queue to be frozen and scheduler
> switch isn't allowed.
>
> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
> block/blk-mq.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index e08eda094ae7..add653f84797 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -4964,7 +4964,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
> fallback:
> blk_mq_update_queue_map(set);
> list_for_each_entry(q, &set->tag_list, tag_set_list) {
> - blk_mq_realloc_hw_ctxs(set, q);
> + __blk_mq_realloc_hw_ctxs(set, q);
>
> if (q->nr_hw_queues != set->nr_hw_queues) {
> int i = prev_nr_hw_queues;
> @@ -4988,6 +4988,9 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
> list_for_each_entry(q, &set->tag_list, tag_set_list) {
> blk_mq_sysfs_register_hctxs(q);
> blk_mq_debugfs_register_hctxs(q);
> +
> + blk_mq_remove_hw_queues_cpuhp(q);
> + blk_mq_add_hw_queues_cpuhp(q);
> }
> memalloc_noio_restore(memflags);
>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH V3 04/20] block: use q->elevator with ->elevator_lock held in elv_iosched_show()
2025-04-24 15:21 ` [PATCH V3 04/20] block: use q->elevator with ->elevator_lock held in elv_iosched_show() Ming Lei
2025-04-25 6:08 ` Hannes Reinecke
@ 2025-04-25 10:53 ` Nilay Shroff
2025-04-25 14:29 ` Christoph Hellwig
2 siblings, 0 replies; 71+ messages in thread
From: Nilay Shroff @ 2025-04-25 10:53 UTC (permalink / raw)
To: Ming Lei, Jens Axboe, linux-block
Cc: Shinichiro Kawasaki, Thomas Hellström, Christoph Hellwig
On 4/24/25 8:51 PM, Ming Lei wrote:
> Use q->elevator with ->elevator_lock held in elv_iosched_show(), since
> the local cached elevator reference may become stale after getting
> ->elevator_lock.
>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
Looks good to me:
Reviewed-by : Nilay Shroff <nilay@linux.ibm.com>
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH V3 07/20] block: prevent adding/deleting disk during updating nr_hw_queues
2025-04-24 15:21 ` [PATCH V3 07/20] block: prevent adding/deleting disk during updating nr_hw_queues Ming Lei
2025-04-25 6:33 ` Hannes Reinecke
@ 2025-04-25 11:37 ` Nilay Shroff
2025-04-25 14:33 ` Christoph Hellwig
2025-04-25 16:46 ` kernel test robot
3 siblings, 0 replies; 71+ messages in thread
From: Nilay Shroff @ 2025-04-25 11:37 UTC (permalink / raw)
To: Ming Lei, Jens Axboe, linux-block
Cc: Shinichiro Kawasaki, Thomas Hellström, Christoph Hellwig
On 4/24/25 8:51 PM, Ming Lei wrote:
> Both adding/deleting disk code are reader of `nr_hw_queues`, so we can't
> allow them in-progress when updating nr_hw_queues, kernel panic and
> kasan has been reported in [1].
>
> Prevent adding/deleting disk during updating nr_hw_queues by adding
> rw_semaphore to tagset, write lock is grabbed in blk_mq_update_nr_hw_queues(),
> and read lock is acquired when adding/deleting disk.
>
> Also mark GFP_NOIO allocation scope for adding/deleting disk because
> blk_mq_update_nr_hw_queues() is part of some driver's error handler.
>
> This way avoids lot of trouble.
>
> Suggested-by: Nilay Shroff <nilay@linux.ibm.com>
> Reported-by: Nilay Shroff <nilay@linux.ibm.com>
> Closes: https://lore.kernel.org/linux-block/a5896cdb-a59a-4a37-9f99-20522f5d2987@linux.ibm.com/
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
Looks good to me:
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH V3 08/20] block: don't allow to switch elevator if updating nr_hw_queues is in-progress
2025-04-24 15:21 ` [PATCH V3 08/20] block: don't allow to switch elevator if updating nr_hw_queues is in-progress Ming Lei
2025-04-25 6:33 ` Hannes Reinecke
@ 2025-04-25 12:48 ` Nilay Shroff
2025-04-27 2:27 ` Ming Lei
1 sibling, 1 reply; 71+ messages in thread
From: Nilay Shroff @ 2025-04-25 12:48 UTC (permalink / raw)
To: Ming Lei, Jens Axboe, linux-block
Cc: Shinichiro Kawasaki, Thomas Hellström, Christoph Hellwig
On 4/24/25 8:51 PM, Ming Lei wrote:
> Elevator switch code is another `nr_hw_queue` reader in non-fast-IO code
> path, so it can't be done if updating `nr_hw_queues` is in-progress.
>
> Take same approach with not allowing add/del disk when updating
> nr_hw_queues is in-progress, by grabbing read lock of
> set->update_nr_hwq_sema.
>
> Take the nested variant for avoiding the following false positive
> splat[1], and this way is correct because:
>
> - the read lock in elv_iosched_store() is not overlapped with the read lock
> in adding/deleting disk:
>
> - kobject attribute is only available after the kobject is added and
> before it is deleted
>
> -> #4 (&q->q_usage_counter(queue){++++}-{0:0}:
> -> #3 (&q->limits_lock){+.+.}-{4:4}:
> -> #2 (&disk->open_mutex){+.+.}-{4:4}:
> -> #1 (&set->update_nr_hwq_lock){.+.+}-{4:4}:
> -> #0 (kn->active#103){++++}-{0:0}:
>
> Link: https://lore.kernel.org/linux-block/aAWv3NPtNIKKvJZc@fedora/ [1]
> Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> Closes: https://lore.kernel.org/linux-block/mz4t4tlwiqjijw3zvqnjb7ovvvaegkqganegmmlc567tt5xj67@xal5ro544cnc/
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
> block/elevator.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/block/elevator.c b/block/elevator.c
> index 4400eb8fe54f..56da6ab7691a 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -723,6 +723,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
> int ret;
> unsigned int memflags;
> struct request_queue *q = disk->queue;
> + struct blk_mq_tag_set *set = q->tag_set;
>
> /*
> * If the attribute needs to load a module, do it before freezing the
> @@ -734,6 +735,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
>
> elv_iosched_load_module(name);
>
> + down_read_nested(&set->update_nr_hwq_sema, 1);
Why do we need to add nested read lock here? The lockdep splat[1] which
you reported earlier is possibly due to the same reader lock being acquired
recursively in elv_iosched_store and then elevator_change?
On another note, if we suspect possible one-depth recursion for the same
class of lock then then we should use SINGLE_DEPTH_NESTING (instead of using
1 here) for subclass. But still I am not clear why this lock needs nesting.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH V3 10/20] block: move blk_unregister_queue() & device_del() after freeze wait
2025-04-24 15:21 ` [PATCH V3 10/20] block: move blk_unregister_queue() & device_del() after freeze wait Ming Lei
2025-04-25 6:35 ` Hannes Reinecke
@ 2025-04-25 12:50 ` Nilay Shroff
2025-04-25 14:34 ` Christoph Hellwig
2 siblings, 0 replies; 71+ messages in thread
From: Nilay Shroff @ 2025-04-25 12:50 UTC (permalink / raw)
To: Ming Lei, Jens Axboe, linux-block
Cc: Shinichiro Kawasaki, Thomas Hellström, Christoph Hellwig
On 4/24/25 8:51 PM, Ming Lei wrote:
> Move blk_unregister_queue() & device_del() after freeze wait, and prepare
> for unifying elevator switch.
>
> This way is just fine, since bdev has been unhashed at the beginning of
> del_gendisk(), both blk_unregister_queue() & device_del() are dealing
> with kobject & debugfs thing only.
>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
Looks good to me:
Reviewed-by : Nilay Shroff <nilay@linux.ibm.com>
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH V3 11/20] block: move queue freezing & elevator_lock into elevator_change()
2025-04-24 15:21 ` [PATCH V3 11/20] block: move queue freezing & elevator_lock into elevator_change() Ming Lei
2025-04-25 6:36 ` Hannes Reinecke
@ 2025-04-25 12:54 ` Nilay Shroff
2025-04-25 14:35 ` Christoph Hellwig
2 siblings, 0 replies; 71+ messages in thread
From: Nilay Shroff @ 2025-04-25 12:54 UTC (permalink / raw)
To: Ming Lei, Jens Axboe, linux-block
Cc: Shinichiro Kawasaki, Thomas Hellström, Christoph Hellwig
On 4/24/25 8:51 PM, Ming Lei wrote:
> Move queue freezing & elevator_lock into elevator_change(), and prepare
> for using elevator_change() for setting up & tearing down default elevator
> too.
>
> Also add lockdep_assert_held() in __elevator_change() because either
> read or write lock is required for changing elevator.
>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
Looks good to me:
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH V3 13/20] block: unifying elevator change
2025-04-24 15:21 ` [PATCH V3 13/20] block: " Ming Lei
2025-04-25 6:39 ` Hannes Reinecke
@ 2025-04-25 13:03 ` Nilay Shroff
2025-04-30 0:46 ` Ming Lei
2025-04-25 18:30 ` Christoph Hellwig
2 siblings, 1 reply; 71+ messages in thread
From: Nilay Shroff @ 2025-04-25 13:03 UTC (permalink / raw)
To: Ming Lei, Jens Axboe, linux-block
Cc: Shinichiro Kawasaki, Thomas Hellström, Christoph Hellwig
On 4/24/25 8:51 PM, Ming Lei wrote:
> @@ -564,12 +558,7 @@ static int __add_disk_fwnode(struct device *parent, struct gendisk *disk,
> out_free_ext_minor:
> if (disk->major == BLOCK_EXT_MAJOR)
> blk_free_ext_minor(disk->first_minor);
> -out_exit_elevator:
> - if (disk->queue->elevator) {
> - mutex_lock(&disk->queue->elevator_lock);
> - elevator_exit(disk->queue);
> - mutex_unlock(&disk->queue->elevator_lock);
> - }
After restructuring __add_disk_fwnode() we may still need elevator_exit()
in case when blk_register_queue is successful and so we should have set
q->elevator to default. Now if after queue is registered successfully,
for instance, bdi_register fails then in that case we still have to exit
elevator to free its resources, isn't it?
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH V3 16/20] block: move elv_register[unregister]_queue out of elevator_lock
2025-04-24 15:21 ` [PATCH V3 16/20] block: move elv_register[unregister]_queue out of elevator_lock Ming Lei
2025-04-25 6:46 ` Hannes Reinecke
@ 2025-04-25 13:05 ` Nilay Shroff
2025-04-25 18:37 ` Christoph Hellwig
2 siblings, 0 replies; 71+ messages in thread
From: Nilay Shroff @ 2025-04-25 13:05 UTC (permalink / raw)
To: Ming Lei, Jens Axboe, linux-block
Cc: Shinichiro Kawasaki, Thomas Hellström, Christoph Hellwig
On 4/24/25 8:51 PM, Ming Lei wrote:
> Move elv_register[unregister]_queue out of ->elevator_lock & queue freezing,
> so we can kill many lockdep warnings.
>
> elv_register[unregister]_queue() is serialized, and just dealing with sysfs/
> debugfs things, no need to be done with queue frozen.
>
> With this change, elevator's ->exit() is called before calling
> elv_unregister_queue, then user may call into ->show()/store() of elevator's
> sysfs attributes, and we have covered this issue by adding `ELEVATOR_FLAG_DYNG`.
>
> For blk-mq debugfs, hctx->sched_tags is always checked with ->elevator_lock by
> debugfs code, meantime hctx->sched_tags is updated with ->elevator_lock, so
> there isn't such issue.
>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
Looks good to me:
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH V3 20/20] block: move wbt_enable_default() out of queue freezing from sched ->exit()
2025-04-24 15:21 ` [PATCH V3 20/20] block: move wbt_enable_default() out of queue freezing from sched ->exit() Ming Lei
@ 2025-04-25 13:10 ` Nilay Shroff
2025-04-29 10:59 ` Nilay Shroff
1 sibling, 0 replies; 71+ messages in thread
From: Nilay Shroff @ 2025-04-25 13:10 UTC (permalink / raw)
To: Ming Lei, Jens Axboe, linux-block
Cc: Shinichiro Kawasaki, Thomas Hellström, Christoph Hellwig
On 4/24/25 8:51 PM, Ming Lei wrote:
> scheduler's ->exit() is called with queue frozen and elevator lock is held, and
> wbt_enable_default() can't be called with queue frozen, otherwise the
> following lockdep warning is triggered:
>
> #6 (&q->rq_qos_mutex){+.+.}-{4:4}:
> #5 (&eq->sysfs_lock){+.+.}-{4:4}:
> #4 (&q->elevator_lock){+.+.}-{4:4}:
> #3 (&q->q_usage_counter(io)#3){++++}-{0:0}:
> #2 (fs_reclaim){+.+.}-{0:0}:
> #1 (&sb->s_type->i_mutex_key#3){+.+.}-{4:4}:
> #0 (&q->debugfs_mutex){+.+.}-{4:4}:
>
> Fix the issue by moving wbt_enable_default() out of bfq's exit(), and
> call it from elevator_change_done().
>
> Meantime add disk->rqos_state_mutex for covering wbt state change, which
> matches the purpose more than ->elevator_lock.
>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
Looks good to me:
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH V3 02/20] block: move ELEVATOR_FLAG_DISABLE_WBT a request queue flag
2025-04-24 15:21 ` [PATCH V3 02/20] block: move ELEVATOR_FLAG_DISABLE_WBT a request queue flag Ming Lei
@ 2025-04-25 14:29 ` Christoph Hellwig
0 siblings, 0 replies; 71+ messages in thread
From: Christoph Hellwig @ 2025-04-25 14:29 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
Thomas Hellström, Christoph Hellwig, Hannes Reinecke
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH V3 03/20] block: don't call freeze queue in elevator_switch() and elevator_disable()
2025-04-24 15:21 ` [PATCH V3 03/20] block: don't call freeze queue in elevator_switch() and elevator_disable() Ming Lei
@ 2025-04-25 14:29 ` Christoph Hellwig
0 siblings, 0 replies; 71+ messages in thread
From: Christoph Hellwig @ 2025-04-25 14:29 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
Thomas Hellström, Christoph Hellwig, Hannes Reinecke
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH V3 04/20] block: use q->elevator with ->elevator_lock held in elv_iosched_show()
2025-04-24 15:21 ` [PATCH V3 04/20] block: use q->elevator with ->elevator_lock held in elv_iosched_show() Ming Lei
2025-04-25 6:08 ` Hannes Reinecke
2025-04-25 10:53 ` Nilay Shroff
@ 2025-04-25 14:29 ` Christoph Hellwig
2 siblings, 0 replies; 71+ messages in thread
From: Christoph Hellwig @ 2025-04-25 14:29 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
Thomas Hellström, Christoph Hellwig
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH V3 07/20] block: prevent adding/deleting disk during updating nr_hw_queues
2025-04-24 15:21 ` [PATCH V3 07/20] block: prevent adding/deleting disk during updating nr_hw_queues Ming Lei
2025-04-25 6:33 ` Hannes Reinecke
2025-04-25 11:37 ` Nilay Shroff
@ 2025-04-25 14:33 ` Christoph Hellwig
2025-04-25 16:46 ` kernel test robot
3 siblings, 0 replies; 71+ messages in thread
From: Christoph Hellwig @ 2025-04-25 14:33 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
Thomas Hellström, Christoph Hellwig
On Thu, Apr 24, 2025 at 11:21:30PM +0800, Ming Lei wrote:
> + init_rwsem(&set->update_nr_hwq_sema);
Can you please call this update_nr_hwq_lock instead? _sema is a very
unusual naming for a rw semaphore.
> * This function registers the partitioning information in @disk
> * with the kernel. Also attach a fwnode to the disk device.
> */
> -int __must_check add_disk_fwnode(struct device *parent, struct gendisk *disk,
> - const struct attribute_group **groups,
> - struct fwnode_handle *fwnode)
> +static int __add_disk_fwnode(struct device *parent, struct gendisk *disk,
> + const struct attribute_group **groups,
> + struct fwnode_handle *fwnode)
once this becomes internal there is no need for the _fwnode postfix and
this can simply become __add_disk.
This probably wants a lockdep assert that te nr_hwq lock is held for the
blk-mq case.
Also when you use two-tab indentation for the arguments you don't need to
reformat for every little naming change.
Otherwise looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH V3 10/20] block: move blk_unregister_queue() & device_del() after freeze wait
2025-04-24 15:21 ` [PATCH V3 10/20] block: move blk_unregister_queue() & device_del() after freeze wait Ming Lei
2025-04-25 6:35 ` Hannes Reinecke
2025-04-25 12:50 ` Nilay Shroff
@ 2025-04-25 14:34 ` Christoph Hellwig
2025-04-28 11:51 ` Ming Lei
2 siblings, 1 reply; 71+ messages in thread
From: Christoph Hellwig @ 2025-04-25 14:34 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
Thomas Hellström, Christoph Hellwig
On Thu, Apr 24, 2025 at 11:21:33PM +0800, Ming Lei wrote:
> Move blk_unregister_queue() & device_del() after freeze wait, and prepare
> for unifying elevator switch.
>
> This way is just fine, since bdev has been unhashed at the beginning of
> del_gendisk(), both blk_unregister_queue() & device_del() are dealing
> with kobject & debugfs thing only.
Can you please add a reference to the previous attempt here and that
it was backed out again and why we think it's fine this time around?
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH V3 11/20] block: move queue freezing & elevator_lock into elevator_change()
2025-04-24 15:21 ` [PATCH V3 11/20] block: move queue freezing & elevator_lock into elevator_change() Ming Lei
2025-04-25 6:36 ` Hannes Reinecke
2025-04-25 12:54 ` Nilay Shroff
@ 2025-04-25 14:35 ` Christoph Hellwig
2 siblings, 0 replies; 71+ messages in thread
From: Christoph Hellwig @ 2025-04-25 14:35 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
Thomas Hellström, Christoph Hellwig
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH V3 07/20] block: prevent adding/deleting disk during updating nr_hw_queues
2025-04-24 15:21 ` [PATCH V3 07/20] block: prevent adding/deleting disk during updating nr_hw_queues Ming Lei
` (2 preceding siblings ...)
2025-04-25 14:33 ` Christoph Hellwig
@ 2025-04-25 16:46 ` kernel test robot
3 siblings, 0 replies; 71+ messages in thread
From: kernel test robot @ 2025-04-25 16:46 UTC (permalink / raw)
To: Ming Lei, Jens Axboe, linux-block
Cc: oe-kbuild-all, Nilay Shroff, Shinichiro Kawasaki,
Thomas Hellström, Christoph Hellwig, Ming Lei
Hi Ming,
kernel test robot noticed the following build warnings:
[auto build test WARNING on axboe-block/for-next]
[also build test WARNING on linus/master v6.15-rc3 next-20250424]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Ming-Lei/block-move-blk_mq_add_queue_tag_set-after-blk_mq_map_swqueue/20250424-232508
base: https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
patch link: https://lore.kernel.org/r/20250424152148.1066220-8-ming.lei%40redhat.com
patch subject: [PATCH V3 07/20] block: prevent adding/deleting disk during updating nr_hw_queues
config: i386-buildonly-randconfig-004-20250425 (https://download.01.org/0day-ci/archive/20250426/202504260007.SFrJbSQ8-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250426/202504260007.SFrJbSQ8-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202504260007.SFrJbSQ8-lkp@intel.com/
All warnings (new ones prefixed by >>):
>> block/genhd.c:406: warning: expecting prototype for add_disk_fwnode(). Prototype was for __add_disk_fwnode() instead
vim +406 block/genhd.c
9301fe73438499 block/genhd.c Christoph Hellwig 2020-09-21 391
^1da177e4c3f41 drivers/block/genhd.c Linus Torvalds 2005-04-16 392 /**
9dfd9ea93aeab5 block/genhd.c Christian Marangi 2024-10-03 393 * add_disk_fwnode - add disk information to kernel list with fwnode
e63a46bef01ff3 block/genhd.c Dan Williams 2016-06-15 394 * @parent: parent device for the disk
^1da177e4c3f41 drivers/block/genhd.c Linus Torvalds 2005-04-16 395 * @disk: per-device partitioning information
fef912bf860e8e block/genhd.c Hannes Reinecke 2018-09-28 396 * @groups: Additional per-device sysfs groups
9dfd9ea93aeab5 block/genhd.c Christian Marangi 2024-10-03 397 * @fwnode: attached disk fwnode
^1da177e4c3f41 drivers/block/genhd.c Linus Torvalds 2005-04-16 398 *
^1da177e4c3f41 drivers/block/genhd.c Linus Torvalds 2005-04-16 399 * This function registers the partitioning information in @disk
9dfd9ea93aeab5 block/genhd.c Christian Marangi 2024-10-03 400 * with the kernel. Also attach a fwnode to the disk device.
^1da177e4c3f41 drivers/block/genhd.c Linus Torvalds 2005-04-16 401 */
54650b49b7412c block/genhd.c Ming Lei 2025-04-24 402 static int __add_disk_fwnode(struct device *parent, struct gendisk *disk,
9dfd9ea93aeab5 block/genhd.c Christian Marangi 2024-10-03 403 const struct attribute_group **groups,
9dfd9ea93aeab5 block/genhd.c Christian Marangi 2024-10-03 404 struct fwnode_handle *fwnode)
d1254a8749711e block/genhd.c Christoph Hellwig 2021-08-04 405
^1da177e4c3f41 drivers/block/genhd.c Linus Torvalds 2005-04-16 @406 {
52b85909f85d06 block/genhd.c Christoph Hellwig 2021-08-18 407 struct device *ddev = disk_to_dev(disk);
7c3f828b522b07 block/genhd.c Christoph Hellwig 2021-05-21 408 int ret;
cf0ca9fe5dd9e3 block/genhd.c Peter Zijlstra 2008-04-30 409
3d9a9e9a77c5eb block/genhd.c Ming Lei 2025-01-15 410 if (WARN_ON_ONCE(bdev_nr_sectors(disk->part0) > BLK_DEV_MAX_SECTORS))
3d9a9e9a77c5eb block/genhd.c Ming Lei 2025-01-15 411 return -EINVAL;
3d9a9e9a77c5eb block/genhd.c Ming Lei 2025-01-15 412
6783811569aef2 block/genhd.c Christoph Hellwig 2025-01-06 413 if (queue_is_mq(disk->queue)) {
6783811569aef2 block/genhd.c Christoph Hellwig 2025-01-06 414 /*
6783811569aef2 block/genhd.c Christoph Hellwig 2025-01-06 415 * ->submit_bio and ->poll_bio are bypassed for blk-mq drivers.
6783811569aef2 block/genhd.c Christoph Hellwig 2025-01-06 416 */
6783811569aef2 block/genhd.c Christoph Hellwig 2025-01-06 417 if (disk->fops->submit_bio || disk->fops->poll_bio)
69fe0f29892077 block/genhd.c Ming Lei 2022-03-04 418 return -EINVAL;
69fe0f29892077 block/genhd.c Ming Lei 2022-03-04 419
737eb78e82d52d block/genhd.c Damien Le Moal 2019-09-05 420 /*
6783811569aef2 block/genhd.c Christoph Hellwig 2025-01-06 421 * Initialize the I/O scheduler code and pick a default one if
6783811569aef2 block/genhd.c Christoph Hellwig 2025-01-06 422 * needed.
737eb78e82d52d block/genhd.c Damien Le Moal 2019-09-05 423 */
737eb78e82d52d block/genhd.c Damien Le Moal 2019-09-05 424 elevator_init_mq(disk->queue);
6783811569aef2 block/genhd.c Christoph Hellwig 2025-01-06 425 } else {
6783811569aef2 block/genhd.c Christoph Hellwig 2025-01-06 426 if (!disk->fops->submit_bio)
6783811569aef2 block/genhd.c Christoph Hellwig 2025-01-06 427 return -EINVAL;
ac2b6f9dee8f41 block/genhd.c Al Viro 2024-04-12 428 bdev_set_flag(disk->part0, BD_HAS_SUBMIT_BIO);
6783811569aef2 block/genhd.c Christoph Hellwig 2025-01-06 429 }
9f4107b07b17b5 block/genhd.c Jens Axboe 2023-04-14 430
7c3f828b522b07 block/genhd.c Christoph Hellwig 2021-05-21 431 /*
7c3f828b522b07 block/genhd.c Christoph Hellwig 2021-05-21 432 * If the driver provides an explicit major number it also must provide
7c3f828b522b07 block/genhd.c Christoph Hellwig 2021-05-21 433 * the number of minors numbers supported, and those will be used to
7c3f828b522b07 block/genhd.c Christoph Hellwig 2021-05-21 434 * setup the gendisk.
7c3f828b522b07 block/genhd.c Christoph Hellwig 2021-05-21 435 * Otherwise just allocate the device numbers for both the whole device
7c3f828b522b07 block/genhd.c Christoph Hellwig 2021-05-21 436 * and all partitions from the extended dev_t space.
3e1a7ff8a0a7b9 block/genhd.c Tejun Heo 2008-08-25 437 */
02341a08c9dec5 block/genhd.c Yu Kuai 2022-10-22 438 ret = -EINVAL;
7c3f828b522b07 block/genhd.c Christoph Hellwig 2021-05-21 439 if (disk->major) {
83cbce9574462c block/genhd.c Luis Chamberlain 2021-08-18 440 if (WARN_ON(!disk->minors))
02341a08c9dec5 block/genhd.c Yu Kuai 2022-10-22 441 goto out_exit_elevator;
2e3c73fa0c419f block/genhd.c Christoph Hellwig 2021-05-21 442
2e3c73fa0c419f block/genhd.c Christoph Hellwig 2021-05-21 443 if (disk->minors > DISK_MAX_PARTS) {
2e3c73fa0c419f block/genhd.c Christoph Hellwig 2021-05-21 444 pr_err("block: can't allocate more than %d partitions\n",
2e3c73fa0c419f block/genhd.c Christoph Hellwig 2021-05-21 445 DISK_MAX_PARTS);
2e3c73fa0c419f block/genhd.c Christoph Hellwig 2021-05-21 446 disk->minors = DISK_MAX_PARTS;
2e3c73fa0c419f block/genhd.c Christoph Hellwig 2021-05-21 447 }
4c434392c47778 block/genhd.c Li Nan 2023-12-19 448 if (disk->first_minor > MINORMASK ||
4c434392c47778 block/genhd.c Li Nan 2023-12-19 449 disk->minors > MINORMASK + 1 ||
4c434392c47778 block/genhd.c Li Nan 2023-12-19 450 disk->first_minor + disk->minors > MINORMASK + 1)
02341a08c9dec5 block/genhd.c Yu Kuai 2022-10-22 451 goto out_exit_elevator;
7c3f828b522b07 block/genhd.c Christoph Hellwig 2021-05-21 452 } else {
83cbce9574462c block/genhd.c Luis Chamberlain 2021-08-18 453 if (WARN_ON(disk->minors))
02341a08c9dec5 block/genhd.c Yu Kuai 2022-10-22 454 goto out_exit_elevator;
3e1a7ff8a0a7b9 block/genhd.c Tejun Heo 2008-08-25 455
7c3f828b522b07 block/genhd.c Christoph Hellwig 2021-05-21 456 ret = blk_alloc_ext_minor();
83cbce9574462c block/genhd.c Luis Chamberlain 2021-08-18 457 if (ret < 0)
02341a08c9dec5 block/genhd.c Yu Kuai 2022-10-22 458 goto out_exit_elevator;
7c3f828b522b07 block/genhd.c Christoph Hellwig 2021-05-21 459 disk->major = BLOCK_EXT_MAJOR;
539711d7d6fe38 block/genhd.c Christoph Hellwig 2021-08-24 460 disk->first_minor = ret;
7c3f828b522b07 block/genhd.c Christoph Hellwig 2021-05-21 461 }
7c3f828b522b07 block/genhd.c Christoph Hellwig 2021-05-21 462
52b85909f85d06 block/genhd.c Christoph Hellwig 2021-08-18 463 /* delay uevents, until we scanned partition table */
52b85909f85d06 block/genhd.c Christoph Hellwig 2021-08-18 464 dev_set_uevent_suppress(ddev, 1);
52b85909f85d06 block/genhd.c Christoph Hellwig 2021-08-18 465
52b85909f85d06 block/genhd.c Christoph Hellwig 2021-08-18 466 ddev->parent = parent;
52b85909f85d06 block/genhd.c Christoph Hellwig 2021-08-18 467 ddev->groups = groups;
52b85909f85d06 block/genhd.c Christoph Hellwig 2021-08-18 468 dev_set_name(ddev, "%s", disk->disk_name);
9dfd9ea93aeab5 block/genhd.c Christian Marangi 2024-10-03 469 if (fwnode)
9dfd9ea93aeab5 block/genhd.c Christian Marangi 2024-10-03 470 device_set_node(ddev, fwnode);
8235b5c1e8c1c0 block/genhd.c Christoph Hellwig 2021-08-18 471 if (!(disk->flags & GENHD_FL_HIDDEN))
8235b5c1e8c1c0 block/genhd.c Christoph Hellwig 2021-08-18 472 ddev->devt = MKDEV(disk->major, disk->first_minor);
83cbce9574462c block/genhd.c Luis Chamberlain 2021-08-18 473 ret = device_add(ddev);
83cbce9574462c block/genhd.c Luis Chamberlain 2021-08-18 474 if (ret)
99d8690aae4b2f block/genhd.c Christoph Hellwig 2021-12-21 475 goto out_free_ext_minor;
99d8690aae4b2f block/genhd.c Christoph Hellwig 2021-12-21 476
99d8690aae4b2f block/genhd.c Christoph Hellwig 2021-12-21 477 ret = disk_alloc_events(disk);
99d8690aae4b2f block/genhd.c Christoph Hellwig 2021-12-21 478 if (ret)
99d8690aae4b2f block/genhd.c Christoph Hellwig 2021-12-21 479 goto out_device_del;
99d8690aae4b2f block/genhd.c Christoph Hellwig 2021-12-21 480
52b85909f85d06 block/genhd.c Christoph Hellwig 2021-08-18 481 ret = sysfs_create_link(block_depr, &ddev->kobj,
52b85909f85d06 block/genhd.c Christoph Hellwig 2021-08-18 482 kobject_name(&ddev->kobj));
83cbce9574462c block/genhd.c Luis Chamberlain 2021-08-18 483 if (ret)
83cbce9574462c block/genhd.c Luis Chamberlain 2021-08-18 484 goto out_device_del;
52b85909f85d06 block/genhd.c Christoph Hellwig 2021-08-18 485
52b85909f85d06 block/genhd.c Christoph Hellwig 2021-08-18 486 /*
52b85909f85d06 block/genhd.c Christoph Hellwig 2021-08-18 487 * avoid probable deadlock caused by allocating memory with
52b85909f85d06 block/genhd.c Christoph Hellwig 2021-08-18 488 * GFP_KERNEL in runtime_resume callback of its all ancestor
52b85909f85d06 block/genhd.c Christoph Hellwig 2021-08-18 489 * devices
52b85909f85d06 block/genhd.c Christoph Hellwig 2021-08-18 490 */
52b85909f85d06 block/genhd.c Christoph Hellwig 2021-08-18 491 pm_runtime_set_memalloc_noio(ddev, true);
52b85909f85d06 block/genhd.c Christoph Hellwig 2021-08-18 492
52b85909f85d06 block/genhd.c Christoph Hellwig 2021-08-18 493 disk->part0->bd_holder_dir =
52b85909f85d06 block/genhd.c Christoph Hellwig 2021-08-18 494 kobject_create_and_add("holders", &ddev->kobj);
fe7d064fa3faec block/genhd.c Luis Chamberlain 2021-11-03 495 if (!disk->part0->bd_holder_dir) {
fe7d064fa3faec block/genhd.c Luis Chamberlain 2021-11-03 496 ret = -ENOMEM;
ff53cd52d9bdbf block/genhd.c Thomas Weißschuh 2023-03-18 497 goto out_del_block_link;
fe7d064fa3faec block/genhd.c Luis Chamberlain 2021-11-03 498 }
52b85909f85d06 block/genhd.c Christoph Hellwig 2021-08-18 499 disk->slave_dir = kobject_create_and_add("slaves", &ddev->kobj);
fe7d064fa3faec block/genhd.c Luis Chamberlain 2021-11-03 500 if (!disk->slave_dir) {
fe7d064fa3faec block/genhd.c Luis Chamberlain 2021-11-03 501 ret = -ENOMEM;
83cbce9574462c block/genhd.c Luis Chamberlain 2021-08-18 502 goto out_put_holder_dir;
fe7d064fa3faec block/genhd.c Luis Chamberlain 2021-11-03 503 }
52b85909f85d06 block/genhd.c Christoph Hellwig 2021-08-18 504
83cbce9574462c block/genhd.c Luis Chamberlain 2021-08-18 505 ret = blk_register_queue(disk);
83cbce9574462c block/genhd.c Luis Chamberlain 2021-08-18 506 if (ret)
83cbce9574462c block/genhd.c Luis Chamberlain 2021-08-18 507 goto out_put_slave_dir;
75f4dca59694df block/genhd.c Christoph Hellwig 2021-08-18 508
9f18db572c97bc block/genhd.c Christoph Hellwig 2021-11-22 509 if (!(disk->flags & GENHD_FL_HIDDEN)) {
8235b5c1e8c1c0 block/genhd.c Christoph Hellwig 2021-08-18 510 ret = bdi_register(disk->bdi, "%u:%u",
8235b5c1e8c1c0 block/genhd.c Christoph Hellwig 2021-08-18 511 disk->major, disk->first_minor);
83cbce9574462c block/genhd.c Luis Chamberlain 2021-08-18 512 if (ret)
83cbce9574462c block/genhd.c Luis Chamberlain 2021-08-18 513 goto out_unregister_queue;
8235b5c1e8c1c0 block/genhd.c Christoph Hellwig 2021-08-18 514 bdi_set_owner(disk->bdi, ddev);
9d5ee6767c8576 block/genhd.c Christoph Hellwig 2021-08-18 515 ret = sysfs_create_link(&ddev->kobj,
9d5ee6767c8576 block/genhd.c Christoph Hellwig 2021-08-18 516 &disk->bdi->dev->kobj, "bdi");
83cbce9574462c block/genhd.c Luis Chamberlain 2021-08-18 517 if (ret)
83cbce9574462c block/genhd.c Luis Chamberlain 2021-08-18 518 goto out_unregister_bdi;
8235b5c1e8c1c0 block/genhd.c Christoph Hellwig 2021-08-18 519
e5cfefa97bccf9 block/genhd.c Yu Kuai 2023-02-17 520 /* Make sure the first partition scan will be proceed */
140ce28dd3bee8 block/genhd.c Christoph Hellwig 2024-05-02 521 if (get_capacity(disk) && disk_has_partscan(disk))
e5cfefa97bccf9 block/genhd.c Yu Kuai 2023-02-17 522 set_bit(GD_NEED_PART_SCAN, &disk->state);
e5cfefa97bccf9 block/genhd.c Yu Kuai 2023-02-17 523
9d5ee6767c8576 block/genhd.c Christoph Hellwig 2021-08-18 524 bdev_add(disk->part0, ddev->devt);
e16e506ccd673a block/genhd.c Christoph Hellwig 2021-11-22 525 if (get_capacity(disk))
05bdb9965305bb block/genhd.c Christoph Hellwig 2023-06-08 526 disk_scan_partitions(disk, BLK_OPEN_READ);
52b85909f85d06 block/genhd.c Christoph Hellwig 2021-08-18 527
52b85909f85d06 block/genhd.c Christoph Hellwig 2021-08-18 528 /*
52b85909f85d06 block/genhd.c Christoph Hellwig 2021-08-18 529 * Announce the disk and partitions after all partitions are
8235b5c1e8c1c0 block/genhd.c Christoph Hellwig 2021-08-18 530 * created. (for hidden disks uevents remain suppressed forever)
52b85909f85d06 block/genhd.c Christoph Hellwig 2021-08-18 531 */
52b85909f85d06 block/genhd.c Christoph Hellwig 2021-08-18 532 dev_set_uevent_suppress(ddev, 0);
52b85909f85d06 block/genhd.c Christoph Hellwig 2021-08-18 533 disk_uevent(disk, KOBJ_ADD);
a0a6314ae774f8 block/genhd.c Christoph Hellwig 2022-10-10 534 } else {
a0a6314ae774f8 block/genhd.c Christoph Hellwig 2022-10-10 535 /*
a0a6314ae774f8 block/genhd.c Christoph Hellwig 2022-10-10 536 * Even if the block_device for a hidden gendisk is not
a0a6314ae774f8 block/genhd.c Christoph Hellwig 2022-10-10 537 * registered, it needs to have a valid bd_dev so that the
a0a6314ae774f8 block/genhd.c Christoph Hellwig 2022-10-10 538 * freeing of the dynamic major works.
a0a6314ae774f8 block/genhd.c Christoph Hellwig 2022-10-10 539 */
a0a6314ae774f8 block/genhd.c Christoph Hellwig 2022-10-10 540 disk->part0->bd_dev = MKDEV(disk->major, disk->first_minor);
8ddcd653257c18 block/genhd.c Christoph Hellwig 2017-11-02 541 }
52b85909f85d06 block/genhd.c Christoph Hellwig 2021-08-18 542
73781b3b81e765 block/genhd.c Christoph Hellwig 2024-06-26 543 blk_apply_bdi_limits(disk->bdi, &disk->queue->limits);
77ea887e433ad8 block/genhd.c Tejun Heo 2010-12-08 544 disk_add_events(disk);
76792055c4c8b2 block/genhd.c Christoph Hellwig 2022-02-15 545 set_bit(GD_ADDED, &disk->state);
83cbce9574462c block/genhd.c Luis Chamberlain 2021-08-18 546 return 0;
83cbce9574462c block/genhd.c Luis Chamberlain 2021-08-18 547
83cbce9574462c block/genhd.c Luis Chamberlain 2021-08-18 548 out_unregister_bdi:
83cbce9574462c block/genhd.c Luis Chamberlain 2021-08-18 549 if (!(disk->flags & GENHD_FL_HIDDEN))
83cbce9574462c block/genhd.c Luis Chamberlain 2021-08-18 550 bdi_unregister(disk->bdi);
83cbce9574462c block/genhd.c Luis Chamberlain 2021-08-18 551 out_unregister_queue:
83cbce9574462c block/genhd.c Luis Chamberlain 2021-08-18 552 blk_unregister_queue(disk);
fa81cbafbf5764 block/genhd.c Chen Zhongjin 2022-10-29 553 rq_qos_exit(disk->queue);
83cbce9574462c block/genhd.c Luis Chamberlain 2021-08-18 554 out_put_slave_dir:
83cbce9574462c block/genhd.c Luis Chamberlain 2021-08-18 555 kobject_put(disk->slave_dir);
d90db3b1c8676b block/genhd.c Christoph Hellwig 2022-11-15 556 disk->slave_dir = NULL;
83cbce9574462c block/genhd.c Luis Chamberlain 2021-08-18 557 out_put_holder_dir:
83cbce9574462c block/genhd.c Luis Chamberlain 2021-08-18 558 kobject_put(disk->part0->bd_holder_dir);
83cbce9574462c block/genhd.c Luis Chamberlain 2021-08-18 559 out_del_block_link:
83cbce9574462c block/genhd.c Luis Chamberlain 2021-08-18 560 sysfs_remove_link(block_depr, dev_name(ddev));
5fa3d1a00c2d4b block/genhd.c Li Nan 2023-12-11 561 pm_runtime_set_memalloc_noio(ddev, false);
83cbce9574462c block/genhd.c Luis Chamberlain 2021-08-18 562 out_device_del:
83cbce9574462c block/genhd.c Luis Chamberlain 2021-08-18 563 device_del(ddev);
83cbce9574462c block/genhd.c Luis Chamberlain 2021-08-18 564 out_free_ext_minor:
83cbce9574462c block/genhd.c Luis Chamberlain 2021-08-18 565 if (disk->major == BLOCK_EXT_MAJOR)
83cbce9574462c block/genhd.c Luis Chamberlain 2021-08-18 566 blk_free_ext_minor(disk->first_minor);
02341a08c9dec5 block/genhd.c Yu Kuai 2022-10-22 567 out_exit_elevator:
1bf70d08cc3b55 block/genhd.c Nilay Shroff 2025-03-04 568 if (disk->queue->elevator) {
1bf70d08cc3b55 block/genhd.c Nilay Shroff 2025-03-04 569 mutex_lock(&disk->queue->elevator_lock);
02341a08c9dec5 block/genhd.c Yu Kuai 2022-10-22 570 elevator_exit(disk->queue);
1bf70d08cc3b55 block/genhd.c Nilay Shroff 2025-03-04 571 mutex_unlock(&disk->queue->elevator_lock);
1bf70d08cc3b55 block/genhd.c Nilay Shroff 2025-03-04 572 }
278167fd2f8ffe block/genhd.c Luis Chamberlain 2021-11-09 573 return ret;
^1da177e4c3f41 drivers/block/genhd.c Linus Torvalds 2005-04-16 574 }
54650b49b7412c block/genhd.c Ming Lei 2025-04-24 575
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH V3 09/20] block: simplify elevator reattachment for updating nr_hw_queues
2025-04-24 15:21 ` [PATCH V3 09/20] block: simplify elevator reattachment for updating nr_hw_queues Ming Lei
2025-04-25 6:34 ` Hannes Reinecke
@ 2025-04-25 18:12 ` Christoph Hellwig
2025-04-29 9:51 ` Ming Lei
1 sibling, 1 reply; 71+ messages in thread
From: Christoph Hellwig @ 2025-04-25 18:12 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
Thomas Hellström, Christoph Hellwig
[-- Attachment #1: Type: text/plain, Size: 448 bytes --]
On Thu, Apr 24, 2025 at 11:21:32PM +0800, Ming Lei wrote:
> +static int __elevator_change(struct request_queue *q,
> + const char *elevator_name)
There's still not good reason for this helper.
I'd suggest you add the two first attached patches before this one
(it'll need a bi of reabsing as all of them are after your entire
sweries right now) and then fold the third one into this, which will
give us less code and a cleaner interface.
[-- Attachment #2: 0001-block-look-up-the-elevator-type-in-elevator_switch.patch --]
[-- Type: text/x-patch, Size: 2036 bytes --]
From 44da16a97ef758d1d9fe7c54c8360388f585d0c5 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Fri, 25 Apr 2025 07:01:39 -0700
Subject: block: look up the elevator type in elevator_switch
That makes the function nicely self-contained and can be used
to avoid code duplication.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
block/elevator.c | 18 ++++++++----------
1 file changed, 8 insertions(+), 10 deletions(-)
diff --git a/block/elevator.c b/block/elevator.c
index a637426da56d..773b8931d874 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -591,14 +591,18 @@ static bool use_default_elevator(struct request_queue *q)
* If switching fails, we are most likely running out of memory and not able
* to restore the old io scheduler, so leaving the io scheduler being none.
*/
-static int elevator_switch(struct request_queue *q, struct elevator_type *new_e,
- struct elv_change_ctx *ctx)
+static int elevator_switch(struct request_queue *q, struct elv_change_ctx *ctx)
{
+ struct elevator_type *new_e;
int ret;
WARN_ON_ONCE(q->mq_freeze_depth == 0);
lockdep_assert_held(&q->elevator_lock);
+ new_e = elevator_find_get(ctx->name);
+ if (!new_e)
+ return -EINVAL;
+
blk_mq_quiesce_queue(q);
if (q->elevator) {
@@ -621,6 +625,7 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e,
new_e->elevator_name);
}
+ elevator_put(new_e);
return ret;
}
@@ -686,8 +691,6 @@ static int __elevator_change(struct request_queue *q,
struct elv_change_ctx *ctx)
{
const char *elevator_name = ctx->name;
- struct elevator_type *e;
- int ret;
lockdep_assert_held(&q->tag_set->update_nr_hwq_sema);
@@ -701,12 +704,7 @@ static int __elevator_change(struct request_queue *q,
return 0;
}
- e = elevator_find_get(elevator_name);
- if (!e)
- return -EINVAL;
- ret = elevator_switch(q, e, ctx);
- elevator_put(e);
- return ret;
+ return elevator_switch(q, ctx);
}
static int elevator_change(struct request_queue *q,
--
2.47.2
[-- Attachment #3: 0002-block-fold-elevator_disable-into-elevator_switch.patch --]
[-- Type: text/x-patch, Size: 3109 bytes --]
From 1bfce1a308b9e46734ed56196b4e9fe31b5a0036 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Fri, 25 Apr 2025 07:10:35 -0700
Subject: block: fold elevator_disable into elevator_switch
This removes duplicate code, and keeps the callers tidy.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
block/elevator.c | 59 ++++++++++++++++--------------------------------
1 file changed, 20 insertions(+), 39 deletions(-)
diff --git a/block/elevator.c b/block/elevator.c
index 773b8931d874..59ff0abde920 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -593,15 +593,17 @@ static bool use_default_elevator(struct request_queue *q)
*/
static int elevator_switch(struct request_queue *q, struct elv_change_ctx *ctx)
{
- struct elevator_type *new_e;
- int ret;
+ struct elevator_type *new_e = NULL;
+ int ret = 0;
WARN_ON_ONCE(q->mq_freeze_depth == 0);
lockdep_assert_held(&q->elevator_lock);
- new_e = elevator_find_get(ctx->name);
- if (!new_e)
- return -EINVAL;
+ if (strncmp(ctx->name, "none", 4)) {
+ new_e = elevator_find_get(ctx->name);
+ if (!new_e)
+ return -EINVAL;
+ }
blk_mq_quiesce_queue(q);
@@ -610,12 +612,17 @@ static int elevator_switch(struct request_queue *q, struct elv_change_ctx *ctx)
__elevator_exit(q);
}
- ret = blk_mq_init_sched(q, new_e);
- if (ret)
- goto out_unfreeze;
-
- ctx->new = q->elevator;
- blk_add_trace_msg(q, "elv switch: %s", new_e->elevator_name);
+ if (new_e) {
+ ret = blk_mq_init_sched(q, new_e);
+ if (ret)
+ goto out_unfreeze;
+ ctx->new = q->elevator;
+ } else {
+ blk_queue_flag_clear(QUEUE_FLAG_SQ_SCHED, q);
+ q->elevator = NULL;
+ q->nr_requests = q->tag_set->queue_depth;
+ }
+ blk_add_trace_msg(q, "elv switch: %s", ctx->name);
out_unfreeze:
blk_mq_unquiesce_queue(q);
@@ -625,28 +632,11 @@ static int elevator_switch(struct request_queue *q, struct elv_change_ctx *ctx)
new_e->elevator_name);
}
- elevator_put(new_e);
+ if (new_e)
+ elevator_put(new_e);
return ret;
}
-static void elevator_disable(struct request_queue *q,
- struct elv_change_ctx *ctx)
-{
- WARN_ON_ONCE(q->mq_freeze_depth == 0);
- lockdep_assert_held(&q->elevator_lock);
-
- blk_mq_quiesce_queue(q);
-
- ctx->old = q->elevator;
- __elevator_exit(q);
- blk_queue_flag_clear(QUEUE_FLAG_SQ_SCHED, q);
- q->elevator = NULL;
- q->nr_requests = q->tag_set->queue_depth;
- blk_add_trace_msg(q, "elv switch: none");
-
- blk_mq_unquiesce_queue(q);
-}
-
static void elv_exit_and_release(struct request_queue *q)
{
struct elevator_queue *e;
@@ -690,20 +680,11 @@ static int elevator_change_done(struct request_queue *q,
static int __elevator_change(struct request_queue *q,
struct elv_change_ctx *ctx)
{
- const char *elevator_name = ctx->name;
-
lockdep_assert_held(&q->tag_set->update_nr_hwq_sema);
/* Make sure queue is not in the middle of being removed */
if (!ctx->init && !blk_queue_registered(q))
return -ENOENT;
-
- if (!strncmp(elevator_name, "none", 4)) {
- if (q->elevator)
- elevator_disable(q, ctx);
- return 0;
- }
-
return elevator_switch(q, ctx);
}
--
2.47.2
[-- Attachment #4: 0003-block-remove-__elevator_change.patch --]
[-- Type: text/x-patch, Size: 2756 bytes --]
From dcda3f508e5f938cb27d4b743226ca4d8af75e28 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Fri, 25 Apr 2025 07:18:42 -0700
Subject: block: remove __elevator_change
Not much of a point in sharing code between callers with very different
expectations.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
block/elevator.c | 45 ++++++++++++++++++++-------------------------
1 file changed, 20 insertions(+), 25 deletions(-)
diff --git a/block/elevator.c b/block/elevator.c
index 59ff0abde920..b358858387a0 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -674,25 +674,13 @@ static int elevator_change_done(struct request_queue *q,
return ret;
}
-/*
- * Switch this queue to the given IO scheduler.
- */
-static int __elevator_change(struct request_queue *q,
- struct elv_change_ctx *ctx)
-{
- lockdep_assert_held(&q->tag_set->update_nr_hwq_sema);
-
- /* Make sure queue is not in the middle of being removed */
- if (!ctx->init && !blk_queue_registered(q))
- return -ENOENT;
- return elevator_switch(q, ctx);
-}
-
static int elevator_change(struct request_queue *q,
struct elv_change_ctx *ctx)
{
unsigned int memflags;
- int ret = 0;
+ int ret;
+
+ lockdep_assert_held(&q->tag_set->update_nr_hwq_sema);
memflags = blk_mq_freeze_queue(q);
/*
@@ -706,14 +694,20 @@ static int elevator_change(struct request_queue *q,
*/
blk_mq_cancel_work_sync(q);
mutex_lock(&q->elevator_lock);
- if (!q->elevator || !elevator_match(q->elevator->type, ctx->name))
- ret = __elevator_change(q, ctx);
+ /* Make sure queue is not in the middle of being removed */
+ ret = -ENOENT;
+ if (!ctx->init && !blk_queue_registered(q))
+ goto out_unlock;
+ ret = 0;
+ if (q->elevator && elevator_match(q->elevator->type, ctx->name))
+ goto out_unlock;
+ ret = elevator_switch(q, ctx);
+out_unlock:
mutex_unlock(&q->elevator_lock);
blk_mq_unfreeze_queue(q, memflags);
- if (!ret)
- ret = elevator_change_done(q, ctx);
-
- return ret;
+ if (ret)
+ return ret;
+ return elevator_change_done(q, ctx);
}
/*
@@ -768,17 +762,18 @@ void elevator_set_none(struct request_queue *q)
void elv_update_nr_hw_queues(struct request_queue *q)
{
struct elv_change_ctx ctx = {
- .name = "none",
.uevent = true,
};
- int ret = -ENODEV;
+ int ret = -ENOENT;
+ lockdep_assert_held(&q->tag_set->update_nr_hwq_sema);
WARN_ON_ONCE(q->mq_freeze_depth == 0);
mutex_lock(&q->elevator_lock);
- if (q->elevator && !blk_queue_dying(q))
+ if (blk_queue_registered(q) && !blk_queue_dying(q) && q->elevator) {
ctx.name = q->elevator->type->elevator_name;
- ret = __elevator_change(q, &ctx);
+ ret = elevator_switch(q, &ctx);
+ }
mutex_unlock(&q->elevator_lock);
blk_mq_unfreeze_queue_nomemrestore(q);
--
2.47.2
^ permalink raw reply related [flat|nested] 71+ messages in thread
* Re: [PATCH V3 12/20] block: add `struct elv_change_ctx` for unifying elevator change
2025-04-24 15:21 ` [PATCH V3 12/20] block: add `struct elv_change_ctx` for unifying elevator change Ming Lei
2025-04-25 6:38 ` Hannes Reinecke
@ 2025-04-25 18:23 ` Christoph Hellwig
2025-04-29 15:45 ` Ming Lei
1 sibling, 1 reply; 71+ messages in thread
From: Christoph Hellwig @ 2025-04-25 18:23 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
Thomas Hellström, Christoph Hellwig
On Thu, Apr 24, 2025 at 11:21:35PM +0800, Ming Lei wrote:
> +struct elv_change_ctx {
> + const char *name;
> + bool uevent;
There's only one caller that wants to supress the uevents. So maybe
invert the polarity so that it only has to be set in one place,
which also documents how setting the initial scheduler is special a
bit better.
> - ret = elv_register_queue(q, true);
> + ret = elv_register_queue(q, ctx->uevent);
.. and pass the ctx on to elv_register_queue instead of converting
paramter types. Although that might be woeth doing later when
another argument derived from ctx gets passed as well.
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH V3 13/20] block: unifying elevator change
2025-04-24 15:21 ` [PATCH V3 13/20] block: " Ming Lei
2025-04-25 6:39 ` Hannes Reinecke
2025-04-25 13:03 ` Nilay Shroff
@ 2025-04-25 18:30 ` Christoph Hellwig
2 siblings, 0 replies; 71+ messages in thread
From: Christoph Hellwig @ 2025-04-25 18:30 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
Thomas Hellström, Christoph Hellwig
On Thu, Apr 24, 2025 at 11:21:36PM +0800, Ming Lei wrote:
> elevator change is one well-define behavior:
Start the sentence captitalized.
> +static bool use_default_elevator(struct request_queue *q)
> {
> if (q->tag_set->flags & BLK_MQ_F_NO_SCHED_BY_DEFAULT)
> + return false;
>
> if (q->nr_hw_queues != 1 &&
> !blk_mq_is_shared_tags(q->tag_set->flags))
> + return false;
>
> + return true;
looking at the only caller I think this is better open coded there.
> @@ -694,7 +651,7 @@ static int __elevator_change(struct request_queue *q,
> lockdep_assert_held(&q->tag_set->update_nr_hwq_sema);
>
> /* Make sure queue is not in the middle of being removed */
> - if (!blk_queue_registered(q))
> + if (!ctx->init && !blk_queue_registered(q))
This init flag is confusingly also set by the teardown, so it is quite
misnamed. But given that none of the locks held here protected the
queue registered flag, why not just move the check out to
elv_iosched_store and remove the entire need for the flag?
> +/*
> + * Use the default elevator settings. If the chosen elevator initialization
> + * fails, fall back to the "none" elevator (no elevator).
> + */
> +void elevator_set_default(struct request_queue *q)
> +{
> + struct elv_change_ctx ctx = {
> + .init = true,
> + };
> + int err;
> +
> + if (!queue_is_mq(q))
> + return;
Please keep this in the caller. Same for the set_none side.
> +
> + ctx.name = use_default_elevator(q) ? "mq-deadline" : "none";
> + if (!q->elevator && !strcmp(ctx.name, "none"))
> + return;
> + err = elevator_change(q, &ctx);
q->elevator can't be set here. And the extra none strcmp is weird
because we have the boolean information avaiable.
Just make this:
struct elv_change_ctx ctx = {
.name = "mq-deadline",
};
...
if (!(q->tag_set->flags & BLK_MQ_F_NO_SCHED_BY_DEFAULT) &&
(q->nr_hw_queues == 1 ||
blk_mq_is_shared_tags(q->tag_set->flags))) {
err = elevator_change(q, &ctx);
> + if (err < 0)
> + pr_warn("\"%s\" set elevator failed %d, "
> + "falling back to \"none\"\n", ctx.name, err);
Please keep the more useful message from the current code.
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH V3 15/20] block: fail to show/store elevator sysfs attribute if elevator is dying
2025-04-24 15:21 ` [PATCH V3 15/20] block: fail to show/store elevator sysfs attribute if elevator is dying Ming Lei
2025-04-25 6:45 ` Hannes Reinecke
@ 2025-04-25 18:36 ` Christoph Hellwig
2025-04-30 1:07 ` Ming Lei
1 sibling, 1 reply; 71+ messages in thread
From: Christoph Hellwig @ 2025-04-25 18:36 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
Thomas Hellström, Christoph Hellwig
On Thu, Apr 24, 2025 at 11:21:38PM +0800, Ming Lei wrote:
> + if (test_bit(ELEVATOR_FLAG_DYING, &e->flags))
> + error = -ENODEV;
> + else
> + error = e->type ? entry->show(e, page) : -ENOENT;
Weird style mix, I'd expand the check for ->type to a proper else
if here as well. But how can e->type actually be NULL here to start
with?
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH V3 16/20] block: move elv_register[unregister]_queue out of elevator_lock
2025-04-24 15:21 ` [PATCH V3 16/20] block: move elv_register[unregister]_queue out of elevator_lock Ming Lei
2025-04-25 6:46 ` Hannes Reinecke
2025-04-25 13:05 ` Nilay Shroff
@ 2025-04-25 18:37 ` Christoph Hellwig
2 siblings, 0 replies; 71+ messages in thread
From: Christoph Hellwig @ 2025-04-25 18:37 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
Thomas Hellström, Christoph Hellwig
On Thu, Apr 24, 2025 at 11:21:39PM +0800, Ming Lei wrote:
> -static void elevator_exit(struct request_queue *q)
> +static void __elevator_exit(struct request_queue *q)
Why does elevator_exit grow a spurious __ prefix?
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH V3 17/20] block: move debugfs/sysfs register out of freezing queue
2025-04-24 15:21 ` [PATCH V3 17/20] block: move debugfs/sysfs register out of freezing queue Ming Lei
2025-04-25 6:47 ` Hannes Reinecke
@ 2025-04-25 18:38 ` Christoph Hellwig
2025-04-28 11:28 ` Ming Lei
1 sibling, 1 reply; 71+ messages in thread
From: Christoph Hellwig @ 2025-04-25 18:38 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
Thomas Hellström, Christoph Hellwig
On Thu, Apr 24, 2025 at 11:21:40PM +0800, Ming Lei wrote:
> Move debugfs/sysfs register out of freezing queue in
> __blk_mq_update_nr_hw_queues(), so that the following lockdep dependency
> can be killed:
>
> #2 (&q->q_usage_counter(io)#16){++++}-{0:0}:
> #1 (fs_reclaim){+.+.}-{0:0}:
> #0 (&sb->s_type->i_mutex_key#3){+.+.}-{4:4}: //debugfs
>
> And registering/un-registering debugfs/sysfs does not require queue to be
> frozen.
Please explain the why it's not required. And talking about lockdep
dependencies is not very useful - it either is an actual lock ordering
issue or a fale positive, but lockdep is just the messenger.
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH V3 18/20] block: remove several ->elevator_lock
2025-04-24 15:21 ` [PATCH V3 18/20] block: remove several ->elevator_lock Ming Lei
2025-04-25 6:48 ` Hannes Reinecke
@ 2025-04-25 18:38 ` Christoph Hellwig
1 sibling, 0 replies; 71+ messages in thread
From: Christoph Hellwig @ 2025-04-25 18:38 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
Thomas Hellström, Christoph Hellwig
This is missing a "critial section: or similar in the Subject line
for the grammar to make sense.
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH V3 08/20] block: don't allow to switch elevator if updating nr_hw_queues is in-progress
2025-04-25 12:48 ` Nilay Shroff
@ 2025-04-27 2:27 ` Ming Lei
2025-04-28 16:17 ` Nilay Shroff
0 siblings, 1 reply; 71+ messages in thread
From: Ming Lei @ 2025-04-27 2:27 UTC (permalink / raw)
To: Nilay Shroff
Cc: Jens Axboe, linux-block, Shinichiro Kawasaki,
Thomas Hellström, Christoph Hellwig
On Fri, Apr 25, 2025 at 06:18:33PM +0530, Nilay Shroff wrote:
>
>
> On 4/24/25 8:51 PM, Ming Lei wrote:
> > Elevator switch code is another `nr_hw_queue` reader in non-fast-IO code
> > path, so it can't be done if updating `nr_hw_queues` is in-progress.
> >
> > Take same approach with not allowing add/del disk when updating
> > nr_hw_queues is in-progress, by grabbing read lock of
> > set->update_nr_hwq_sema.
> >
> > Take the nested variant for avoiding the following false positive
> > splat[1], and this way is correct because:
> >
> > - the read lock in elv_iosched_store() is not overlapped with the read lock
> > in adding/deleting disk:
> >
> > - kobject attribute is only available after the kobject is added and
> > before it is deleted
> >
> > -> #4 (&q->q_usage_counter(queue){++++}-{0:0}:
> > -> #3 (&q->limits_lock){+.+.}-{4:4}:
> > -> #2 (&disk->open_mutex){+.+.}-{4:4}:
> > -> #1 (&set->update_nr_hwq_lock){.+.+}-{4:4}:
> > -> #0 (kn->active#103){++++}-{0:0}:
> >
> > Link: https://lore.kernel.org/linux-block/aAWv3NPtNIKKvJZc@fedora/ [1]
> > Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> > Closes: https://lore.kernel.org/linux-block/mz4t4tlwiqjijw3zvqnjb7ovvvaegkqganegmmlc567tt5xj67@xal5ro544cnc/
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> > block/elevator.c | 3 +++
> > 1 file changed, 3 insertions(+)
> >
> > diff --git a/block/elevator.c b/block/elevator.c
> > index 4400eb8fe54f..56da6ab7691a 100644
> > --- a/block/elevator.c
> > +++ b/block/elevator.c
> > @@ -723,6 +723,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
> > int ret;
> > unsigned int memflags;
> > struct request_queue *q = disk->queue;
> > + struct blk_mq_tag_set *set = q->tag_set;
> >
> > /*
> > * If the attribute needs to load a module, do it before freezing the
> > @@ -734,6 +735,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
> >
> > elv_iosched_load_module(name);
> >
> > + down_read_nested(&set->update_nr_hwq_sema, 1);
>
> Why do we need to add nested read lock here? The lockdep splat[1] which
> you reported earlier is possibly due to the same reader lock being acquired
> recursively in elv_iosched_store and then elevator_change?
The splat isn't related with the nested read lock.
If you replace down_read_nested() with down_read(), the same splat can be
triggered again when running `blktests block/001`.
>
> On another note, if we suspect possible one-depth recursion for the same
> class of lock then then we should use SINGLE_DEPTH_NESTING (instead of using
> 1 here) for subclass. But still I am not clear why this lock needs nesting.
It is just one false positive, because elv_iosched_store() won't happen
when adding disk.
Thanks,
Ming
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH V3 17/20] block: move debugfs/sysfs register out of freezing queue
2025-04-25 18:38 ` Christoph Hellwig
@ 2025-04-28 11:28 ` Ming Lei
0 siblings, 0 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-28 11:28 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
Thomas Hellström
On Fri, Apr 25, 2025 at 08:38:01PM +0200, Christoph Hellwig wrote:
> On Thu, Apr 24, 2025 at 11:21:40PM +0800, Ming Lei wrote:
> > Move debugfs/sysfs register out of freezing queue in
> > __blk_mq_update_nr_hw_queues(), so that the following lockdep dependency
> > can be killed:
> >
> > #2 (&q->q_usage_counter(io)#16){++++}-{0:0}:
> > #1 (fs_reclaim){+.+.}-{0:0}:
> > #0 (&sb->s_type->i_mutex_key#3){+.+.}-{4:4}: //debugfs
> >
> > And registering/un-registering debugfs/sysfs does not require queue to be
> > frozen.
>
> Please explain the why it's not required. And talking about lockdep
> dependencies is not very useful - it either is an actual lock ordering
> issue or a fale positive, but lockdep is just the messenger.
Both kobject delete and debugfs unregister drains in-progress .show/.store
and read, they do not need to freeze queue.
Thanks,
Ming
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH V3 10/20] block: move blk_unregister_queue() & device_del() after freeze wait
2025-04-25 14:34 ` Christoph Hellwig
@ 2025-04-28 11:51 ` Ming Lei
0 siblings, 0 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-28 11:51 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
Thomas Hellström
On Fri, Apr 25, 2025 at 04:34:41PM +0200, Christoph Hellwig wrote:
> On Thu, Apr 24, 2025 at 11:21:33PM +0800, Ming Lei wrote:
> > Move blk_unregister_queue() & device_del() after freeze wait, and prepare
> > for unifying elevator switch.
> >
> > This way is just fine, since bdev has been unhashed at the beginning of
> > del_gendisk(), both blk_unregister_queue() & device_del() are dealing
> > with kobject & debugfs thing only.
>
> Can you please add a reference to the previous attempt here and that
> it was backed out again and why we think it's fine this time around?
Turns out this patch isn't necessary, we can avoid the small race easily
between elevator switch and del_gendisk().
Will drop this patch in V4.
Thanks,
Ming
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH V3 08/20] block: don't allow to switch elevator if updating nr_hw_queues is in-progress
2025-04-27 2:27 ` Ming Lei
@ 2025-04-28 16:17 ` Nilay Shroff
2025-04-29 2:43 ` Ming Lei
0 siblings, 1 reply; 71+ messages in thread
From: Nilay Shroff @ 2025-04-28 16:17 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, linux-block, Shinichiro Kawasaki,
Thomas Hellström, Christoph Hellwig
On 4/27/25 7:57 AM, Ming Lei wrote:
> On Fri, Apr 25, 2025 at 06:18:33PM +0530, Nilay Shroff wrote:
>>
>>
>> On 4/24/25 8:51 PM, Ming Lei wrote:
>>> Elevator switch code is another `nr_hw_queue` reader in non-fast-IO code
>>> path, so it can't be done if updating `nr_hw_queues` is in-progress.
>>>
>>> Take same approach with not allowing add/del disk when updating
>>> nr_hw_queues is in-progress, by grabbing read lock of
>>> set->update_nr_hwq_sema.
>>>
>>> Take the nested variant for avoiding the following false positive
>>> splat[1], and this way is correct because:
>>>
>>> - the read lock in elv_iosched_store() is not overlapped with the read lock
>>> in adding/deleting disk:
>>>
>>> - kobject attribute is only available after the kobject is added and
>>> before it is deleted
>>>
>>> -> #4 (&q->q_usage_counter(queue){++++}-{0:0}:
>>> -> #3 (&q->limits_lock){+.+.}-{4:4}:
>>> -> #2 (&disk->open_mutex){+.+.}-{4:4}:
>>> -> #1 (&set->update_nr_hwq_lock){.+.+}-{4:4}:
>>> -> #0 (kn->active#103){++++}-{0:0}:
>>>
>>> Link: https://lore.kernel.org/linux-block/aAWv3NPtNIKKvJZc@fedora/ [1]
>>> Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
>>> Closes: https://lore.kernel.org/linux-block/mz4t4tlwiqjijw3zvqnjb7ovvvaegkqganegmmlc567tt5xj67@xal5ro544cnc/
>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>> ---
>>> block/elevator.c | 3 +++
>>> 1 file changed, 3 insertions(+)
>>>
>>> diff --git a/block/elevator.c b/block/elevator.c
>>> index 4400eb8fe54f..56da6ab7691a 100644
>>> --- a/block/elevator.c
>>> +++ b/block/elevator.c
>>> @@ -723,6 +723,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
>>> int ret;
>>> unsigned int memflags;
>>> struct request_queue *q = disk->queue;
>>> + struct blk_mq_tag_set *set = q->tag_set;
>>>
>>> /*
>>> * If the attribute needs to load a module, do it before freezing the
>>> @@ -734,6 +735,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
>>>
>>> elv_iosched_load_module(name);
>>>
>>> + down_read_nested(&set->update_nr_hwq_sema, 1);
>>
>> Why do we need to add nested read lock here? The lockdep splat[1] which
>> you reported earlier is possibly due to the same reader lock being acquired
>> recursively in elv_iosched_store and then elevator_change?
>
> The splat isn't related with the nested read lock.
>
> If you replace down_read_nested() with down_read(), the same splat can be
> triggered again when running `blktests block/001`.
>
I couldn't recreate it on my setup using above blktests.
>>
>> On another note, if we suspect possible one-depth recursion for the same
>> class of lock then then we should use SINGLE_DEPTH_NESTING (instead of using
>> 1 here) for subclass. But still I am not clear why this lock needs nesting.
>
> It is just one false positive, because elv_iosched_store() won't happen
> when adding disk.
>
Yes, but how could we avoid false positive? It's probably because of commit
ffa1e7ada456 ("block: Make request_queue lockdep splats show up earlier"). How about adding manual dependency of fs-reclaim on freeze-lock after we add
the disk. Currently that manual dependency is added in blk_alloc_queue.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH V3 08/20] block: don't allow to switch elevator if updating nr_hw_queues is in-progress
2025-04-28 16:17 ` Nilay Shroff
@ 2025-04-29 2:43 ` Ming Lei
2025-04-29 10:22 ` Nilay Shroff
0 siblings, 1 reply; 71+ messages in thread
From: Ming Lei @ 2025-04-29 2:43 UTC (permalink / raw)
To: Nilay Shroff
Cc: Jens Axboe, linux-block, Shinichiro Kawasaki,
Thomas Hellström, Christoph Hellwig
On Mon, Apr 28, 2025 at 09:47:02PM +0530, Nilay Shroff wrote:
>
>
> On 4/27/25 7:57 AM, Ming Lei wrote:
> > On Fri, Apr 25, 2025 at 06:18:33PM +0530, Nilay Shroff wrote:
> >>
> >>
> >> On 4/24/25 8:51 PM, Ming Lei wrote:
> >>> Elevator switch code is another `nr_hw_queue` reader in non-fast-IO code
> >>> path, so it can't be done if updating `nr_hw_queues` is in-progress.
> >>>
> >>> Take same approach with not allowing add/del disk when updating
> >>> nr_hw_queues is in-progress, by grabbing read lock of
> >>> set->update_nr_hwq_sema.
> >>>
> >>> Take the nested variant for avoiding the following false positive
> >>> splat[1], and this way is correct because:
> >>>
> >>> - the read lock in elv_iosched_store() is not overlapped with the read lock
> >>> in adding/deleting disk:
> >>>
> >>> - kobject attribute is only available after the kobject is added and
> >>> before it is deleted
> >>>
> >>> -> #4 (&q->q_usage_counter(queue){++++}-{0:0}:
> >>> -> #3 (&q->limits_lock){+.+.}-{4:4}:
> >>> -> #2 (&disk->open_mutex){+.+.}-{4:4}:
> >>> -> #1 (&set->update_nr_hwq_lock){.+.+}-{4:4}:
> >>> -> #0 (kn->active#103){++++}-{0:0}:
> >>>
> >>> Link: https://lore.kernel.org/linux-block/aAWv3NPtNIKKvJZc@fedora/ [1]
> >>> Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> >>> Closes: https://lore.kernel.org/linux-block/mz4t4tlwiqjijw3zvqnjb7ovvvaegkqganegmmlc567tt5xj67@xal5ro544cnc/
> >>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> >>> ---
> >>> block/elevator.c | 3 +++
> >>> 1 file changed, 3 insertions(+)
> >>>
> >>> diff --git a/block/elevator.c b/block/elevator.c
> >>> index 4400eb8fe54f..56da6ab7691a 100644
> >>> --- a/block/elevator.c
> >>> +++ b/block/elevator.c
> >>> @@ -723,6 +723,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
> >>> int ret;
> >>> unsigned int memflags;
> >>> struct request_queue *q = disk->queue;
> >>> + struct blk_mq_tag_set *set = q->tag_set;
> >>>
> >>> /*
> >>> * If the attribute needs to load a module, do it before freezing the
> >>> @@ -734,6 +735,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
> >>>
> >>> elv_iosched_load_module(name);
> >>>
> >>> + down_read_nested(&set->update_nr_hwq_sema, 1);
> >>
> >> Why do we need to add nested read lock here? The lockdep splat[1] which
> >> you reported earlier is possibly due to the same reader lock being acquired
> >> recursively in elv_iosched_store and then elevator_change?
> >
> > The splat isn't related with the nested read lock.
> >
> > If you replace down_read_nested() with down_read(), the same splat can be
> > triggered again when running `blktests block/001`.
> >
> I couldn't recreate it on my setup using above blktests.
It is reproduced in my test vm every time after killing the nested variant:
[ 74.257200] ======================================================
[ 74.259369] WARNING: possible circular locking dependency detected
[ 74.260772] 6.15.0-rc3_ublk+ #547 Not tainted
[ 74.261950] ------------------------------------------------------
[ 74.263281] check/5077 is trying to acquire lock:
[ 74.264492] ffff888105f1fd18 (kn->active#119){++++}-{0:0}, at: __kernfs_remove+0x213/0x680
[ 74.266006]
but task is already holding lock:
[ 74.267998] ffff88828a661e20 (&q->q_usage_counter(queue)#14){++++}-{0:0}, at: del_gendisk+0xe5/0x180
[ 74.269631]
which lock already depends on the new lock.
[ 74.272645]
the existing dependency chain (in reverse order) is:
[ 74.274804]
-> #3 (&q->q_usage_counter(queue)#14){++++}-{0:0}:
[ 74.277009] blk_queue_enter+0x4c2/0x630
[ 74.278218] blk_mq_alloc_request+0x479/0xa00
[ 74.279539] scsi_execute_cmd+0x151/0xba0
[ 74.281078] sr_check_events+0x1bc/0xa40
[ 74.283012] cdrom_check_events+0x5c/0x120
[ 74.284892] disk_check_events+0xbe/0x390
[ 74.286181] disk_check_media_change+0xf1/0x220
[ 74.287455] sr_block_open+0xce/0x230
[ 74.288528] blkdev_get_whole+0x8d/0x200
[ 74.289702] bdev_open+0x614/0xc60
[ 74.290882] blkdev_open+0x1f6/0x360
[ 74.292215] do_dentry_open+0x491/0x1820
[ 74.293309] vfs_open+0x7a/0x440
[ 74.294384] path_openat+0x1b7e/0x2ce0
[ 74.295507] do_filp_open+0x1c5/0x450
[ 74.296616] do_sys_openat2+0xef/0x180
[ 74.297667] __x64_sys_openat+0x10e/0x210
[ 74.298768] do_syscall_64+0x92/0x180
[ 74.299800] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 74.300971]
-> #2 (&disk->open_mutex){+.+.}-{4:4}:
[ 74.302700] __mutex_lock+0x19c/0x1990
[ 74.303682] bdev_open+0x6cd/0xc60
[ 74.304613] bdev_file_open_by_dev+0xc4/0x140
[ 74.306008] disk_scan_partitions+0x191/0x290
[ 74.307716] __add_disk_fwnode+0xd2a/0x1140
[ 74.309394] add_disk_fwnode+0x10e/0x220
[ 74.311039] nvme_alloc_ns+0x1833/0x2c30
[ 74.312669] nvme_scan_ns+0x5a0/0x6f0
[ 74.314151] async_run_entry_fn+0x94/0x540
[ 74.315719] process_one_work+0x86a/0x14a0
[ 74.317287] worker_thread+0x5bb/0xf90
[ 74.318228] kthread+0x371/0x720
[ 74.319085] ret_from_fork+0x31/0x70
[ 74.319941] ret_from_fork_asm+0x1a/0x30
[ 74.320808]
-> #1 (&set->update_nr_hwq_sema){.+.+}-{4:4}:
[ 74.322311] down_read+0x8e/0x470
[ 74.323135] elv_iosched_store+0x17a/0x210
[ 74.324036] queue_attr_store+0x234/0x340
[ 74.324881] kernfs_fop_write_iter+0x39b/0x5a0
[ 74.325771] vfs_write+0x5df/0xec0
[ 74.326514] ksys_write+0xff/0x200
[ 74.327262] do_syscall_64+0x92/0x180
[ 74.328018] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 74.328963]
-> #0 (kn->active#119){++++}-{0:0}:
[ 74.330433] __lock_acquire+0x145f/0x2260
[ 74.331329] lock_acquire+0x163/0x300
[ 74.332221] kernfs_drain+0x39d/0x450
[ 74.333002] __kernfs_remove+0x213/0x680
[ 74.333792] kernfs_remove_by_name_ns+0xa2/0x100
[ 74.334589] remove_files+0x8d/0x1b0
[ 74.335326] sysfs_remove_group+0x7c/0x160
[ 74.336118] sysfs_remove_groups+0x55/0xb0
[ 74.336869] __kobject_del+0x7d/0x1d0
[ 74.337637] kobject_del+0x38/0x60
[ 74.338340] blk_unregister_queue+0x153/0x2c0
[ 74.339125] __del_gendisk+0x252/0x9d0
[ 74.339959] del_gendisk+0xe5/0x180
[ 74.340756] sr_remove+0x7b/0xd0
[ 74.341429] device_release_driver_internal+0x36d/0x520
[ 74.342353] bus_remove_device+0x1ef/0x3f0
[ 74.343172] device_del+0x3be/0x9b0
[ 74.343951] __scsi_remove_device+0x27f/0x340
[ 74.344724] sdev_store_delete+0x87/0x120
[ 74.345508] kernfs_fop_write_iter+0x39b/0x5a0
[ 74.346287] vfs_write+0x5df/0xec0
[ 74.347170] ksys_write+0xff/0x200
[ 74.348312] do_syscall_64+0x92/0x180
[ 74.349519] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 74.350797]
other info that might help us debug this:
[ 74.353554] Chain exists of:
kn->active#119 --> &disk->open_mutex --> &q->q_usage_counter(queue)#14
[ 74.355535] Possible unsafe locking scenario:
[ 74.356650] CPU0 CPU1
[ 74.357328] ---- ----
[ 74.358026] lock(&q->q_usage_counter(queue)#14);
[ 74.358749] lock(&disk->open_mutex);
[ 74.359561] lock(&q->q_usage_counter(queue)#14);
[ 74.360488] lock(kn->active#119);
[ 74.361113]
*** DEADLOCK ***
[ 74.362574] 6 locks held by check/5077:
[ 74.363193] #0: ffff888114640420 (sb_writers#4){.+.+}-{0:0}, at: ksys_write+0xff/0x200
[ 74.364274] #1: ffff88829abb6088 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x25b/0x5a0
[ 74.365937] #2: ffff8881176ca0e0 (&shost->scan_mutex){+.+.}-{4:4}, at: sdev_store_delete+0x7f/0x120
[ 74.367643] #3: ffff88828521c380 (&dev->mutex){....}-{4:4}, at: device_release_driver_internal+0x90/0x520
[ 74.369464] #4: ffff8881176ca380 (&set->update_nr_hwq_sema){.+.+}-{4:4}, at: del_gendisk+0xdd/0x180
[ 74.370961] #5: ffff88828a661e20 (&q->q_usage_counter(queue)#14){++++}-{0:0}, at: del_gendisk+0xe5/0x180
[ 74.372050]
stack backtrace:
[ 74.373111] CPU: 10 UID: 0 PID: 5077 Comm: check Not tainted 6.15.0-rc3_ublk+ #547 PREEMPT(voluntary)
[ 74.373116] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-1.fc39 04/01/2014
[ 74.373118] Call Trace:
[ 74.373121] <TASK>
[ 74.373123] dump_stack_lvl+0x84/0xd0
[ 74.373129] print_circular_bug.cold+0x185/0x1d0
[ 74.373134] check_noncircular+0x14a/0x170
[ 74.373140] __lock_acquire+0x145f/0x2260
[ 74.373145] lock_acquire+0x163/0x300
[ 74.373149] ? __kernfs_remove+0x213/0x680
[ 74.373155] kernfs_drain+0x39d/0x450
[ 74.373158] ? __kernfs_remove+0x213/0x680
[ 74.373161] ? __pfx_kernfs_drain+0x10/0x10
[ 74.373165] ? find_held_lock+0x2b/0x80
[ 74.373168] ? kernfs_root+0xb0/0x1c0
[ 74.373173] __kernfs_remove+0x213/0x680
[ 74.373176] ? kernfs_find_ns+0x197/0x390
[ 74.373183] kernfs_remove_by_name_ns+0xa2/0x100
[ 74.373186] remove_files+0x8d/0x1b0
[ 74.373191] sysfs_remove_group+0x7c/0x160
[ 74.373194] sysfs_remove_groups+0x55/0xb0
[ 74.373198] __kobject_del+0x7d/0x1d0
[ 74.373203] kobject_del+0x38/0x60
[ 74.373206] blk_unregister_queue+0x153/0x2c0
[ 74.373210] __del_gendisk+0x252/0x9d0
[ 74.373214] ? down_read+0x1a7/0x470
[ 74.373218] ? __pfx___del_gendisk+0x10/0x10
[ 74.373221] ? __pfx_down_read+0x10/0x10
[ 74.373224] ? lockdep_hardirqs_on_prepare+0xdb/0x190
[ 74.373227] ? trace_hardirqs_on+0x18/0x150
[ 74.373231] del_gendisk+0xe5/0x180
[ 74.373235] sr_remove+0x7b/0xd0
[ 74.373239] device_release_driver_internal+0x36d/0x520
[ 74.373243] ? kobject_put+0x5e/0x4a0
[ 74.373246] bus_remove_device+0x1ef/0x3f0
[ 74.373250] device_del+0x3be/0x9b0
[ 74.373254] ? attribute_container_device_trigger+0x181/0x1f0
[ 74.373257] ? __pfx_device_del+0x10/0x10
[ 74.373260] ? __pfx_attribute_container_device_trigger+0x10/0x10
[ 74.373264] __scsi_remove_device+0x27f/0x340
[ 74.373267] sdev_store_delete+0x87/0x120
[ 74.373270] ? __pfx_sysfs_kf_write+0x10/0x10
[ 74.373273] kernfs_fop_write_iter+0x39b/0x5a0
[ 74.373276] ? __pfx_kernfs_fop_write_iter+0x10/0x10
[ 74.373278] vfs_write+0x5df/0xec0
[ 74.373282] ? trace_hardirqs_on+0x18/0x150
[ 74.373285] ? __pfx_vfs_write+0x10/0x10
[ 74.373291] ksys_write+0xff/0x200
[ 74.373295] ? __pfx_ksys_write+0x10/0x10
[ 74.373298] ? fput_close_sync+0xd6/0x160
[ 74.373303] do_syscall_64+0x92/0x180
[ 74.373309] ? trace_hardirqs_on_prepare+0x101/0x150
[ 74.373313] ? lockdep_hardirqs_on_prepare+0xdb/0x190
[ 74.373317] ? syscall_exit_to_user_mode+0x97/0x290
[ 74.373322] ? do_syscall_64+0x9f/0x180
[ 74.373330] ? fput_close+0xd6/0x160
[ 74.373333] ? __pfx_fput_close+0x10/0x10
[ 74.373338] ? filp_close+0x25/0x40
[ 74.373341] ? do_dup2+0x287/0x4f0
[ 74.373346] ? trace_hardirqs_on_prepare+0x101/0x150
[ 74.373348] ? lockdep_hardirqs_on_prepare+0xdb/0x190
[ 74.373351] ? syscall_exit_to_user_mode+0x97/0x290
[ 74.373353] ? do_syscall_64+0x9f/0x180
[ 74.373357] ? trace_hardirqs_on_prepare+0x101/0x150
[ 74.373359] ? lockdep_hardirqs_on_prepare+0xdb/0x190
[ 74.373362] ? syscall_exit_to_user_mode+0x97/0x290
[ 74.373364] ? do_syscall_64+0x9f/0x180
[ 74.373368] ? do_syscall_64+0x9f/0x180
[ 74.373371] ? trace_hardirqs_on_prepare+0x101/0x150
[ 74.373373] ? lockdep_hardirqs_on_prepare+0xdb/0x190
[ 74.373376] ? syscall_exit_to_user_mode+0x97/0x290
[ 74.373379] ? do_syscall_64+0x9f/0x180
[ 74.373382] ? do_syscall_64+0x9f/0x180
[ 74.373385] ? clear_bhb_loop+0x35/0x90
[ 74.373388] ? clear_bhb_loop+0x35/0x90
[ 74.373390] ? clear_bhb_loop+0x35/0x90
[ 74.373393] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 74.373396] RIP: 0033:0x7fa3873a8756
[ 74.373409] Code: 5d e8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 75 19 83 e2 39 83 fa 08 75 11 e8 26 ff ff ff 66 0f 1f 44 00 00 48 8b 45 10 0f 05 <48> 8b 5d f8 c9 c3 0f 1f 40 00 f3 0f 1e fa 55 48 89 e5 48 83 ec 08
[ 74.373412] RSP: 002b:00007ffede0285d0 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[ 74.373416] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fa3873a8756
[ 74.373418] RDX: 0000000000000002 RSI: 000056557b3dc7d0 RDI: 0000000000000001
[ 74.373420] RBP: 00007ffede0285f0 R08: 0000000000000000 R09: 0000000000000000
[ 74.373421] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000002
[ 74.373423] R13: 000056557b3dc7d0 R14: 00007fa3875245c0 R15: 0000000000000000
[ 74.373428] </TASK>
>
> >>
> >> On another note, if we suspect possible one-depth recursion for the same
> >> class of lock then then we should use SINGLE_DEPTH_NESTING (instead of using
> >> 1 here) for subclass. But still I am not clear why this lock needs nesting.
> >
> > It is just one false positive, because elv_iosched_store() won't happen
> > when adding disk.
> >
> Yes, but how could we avoid false positive? It's probably because of commit
> ffa1e7ada456 ("block: Make request_queue lockdep splats show up earlier"). How about adding manual dependency of fs-reclaim on freeze-lock after we add
> the disk. Currently that manual dependency is added in blk_alloc_queue.
Please see the above trace, which isn't related with commit ffa1e7ada456,
and the lock chain doesn't include 'fs_reclaim' at all.
commit ffa1e7ada456 isn't wrong too, which just helps us to expose deadlock
risk early.
Thanks,
Ming
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH V3 09/20] block: simplify elevator reattachment for updating nr_hw_queues
2025-04-25 18:12 ` Christoph Hellwig
@ 2025-04-29 9:51 ` Ming Lei
0 siblings, 0 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-29 9:51 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
Thomas Hellström
On Fri, Apr 25, 2025 at 08:12:27PM +0200, Christoph Hellwig wrote:
> On Thu, Apr 24, 2025 at 11:21:32PM +0800, Ming Lei wrote:
> > +static int __elevator_change(struct request_queue *q,
> > + const char *elevator_name)
>
> There's still not good reason for this helper.
With the helper, we can avoid the 'force' reattachment flag.
>
> I'd suggest you add the two first attached patches before this one
> (it'll need a bi of reabsing as all of them are after your entire
> sweries right now) and then fold the third one into this, which will
> give us less code and a cleaner interface.
I'd suggest to add your patches after this one, which changes
elevator_switch() into local symbol.
Thanks,
Ming
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH V3 08/20] block: don't allow to switch elevator if updating nr_hw_queues is in-progress
2025-04-29 2:43 ` Ming Lei
@ 2025-04-29 10:22 ` Nilay Shroff
2025-04-30 0:54 ` Ming Lei
0 siblings, 1 reply; 71+ messages in thread
From: Nilay Shroff @ 2025-04-29 10:22 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, linux-block, Shinichiro Kawasaki,
Thomas Hellström, Christoph Hellwig
On 4/29/25 8:13 AM, Ming Lei wrote:
>> I couldn't recreate it on my setup using above blktests.
> It is reproduced in my test vm every time after killing the nested variant:
>
> [ 74.257200] ======================================================
> [ 74.259369] WARNING: possible circular locking dependency detected
> [ 74.260772] 6.15.0-rc3_ublk+ #547 Not tainted
> [ 74.261950] ------------------------------------------------------
> [ 74.263281] check/5077 is trying to acquire lock:
> [ 74.264492] ffff888105f1fd18 (kn->active#119){++++}-{0:0}, at: __kernfs_remove+0x213/0x680
> [ 74.266006]
> but task is already holding lock:
> [ 74.267998] ffff88828a661e20 (&q->q_usage_counter(queue)#14){++++}-{0:0}, at: del_gendisk+0xe5/0x180
> [ 74.269631]
> which lock already depends on the new lock.
>
> [ 74.272645]
> the existing dependency chain (in reverse order) is:
> [ 74.274804]
> -> #3 (&q->q_usage_counter(queue)#14){++++}-{0:0}:
> [ 74.277009] blk_queue_enter+0x4c2/0x630
> [ 74.278218] blk_mq_alloc_request+0x479/0xa00
> [ 74.279539] scsi_execute_cmd+0x151/0xba0
> [ 74.281078] sr_check_events+0x1bc/0xa40
> [ 74.283012] cdrom_check_events+0x5c/0x120
> [ 74.284892] disk_check_events+0xbe/0x390
> [ 74.286181] disk_check_media_change+0xf1/0x220
> [ 74.287455] sr_block_open+0xce/0x230
> [ 74.288528] blkdev_get_whole+0x8d/0x200
> [ 74.289702] bdev_open+0x614/0xc60
> [ 74.290882] blkdev_open+0x1f6/0x360
> [ 74.292215] do_dentry_open+0x491/0x1820
> [ 74.293309] vfs_open+0x7a/0x440
> [ 74.294384] path_openat+0x1b7e/0x2ce0
> [ 74.295507] do_filp_open+0x1c5/0x450
> [ 74.296616] do_sys_openat2+0xef/0x180
> [ 74.297667] __x64_sys_openat+0x10e/0x210
> [ 74.298768] do_syscall_64+0x92/0x180
> [ 74.299800] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 74.300971]
> -> #2 (&disk->open_mutex){+.+.}-{4:4}:
> [ 74.302700] __mutex_lock+0x19c/0x1990
> [ 74.303682] bdev_open+0x6cd/0xc60
> [ 74.304613] bdev_file_open_by_dev+0xc4/0x140
> [ 74.306008] disk_scan_partitions+0x191/0x290
> [ 74.307716] __add_disk_fwnode+0xd2a/0x1140
> [ 74.309394] add_disk_fwnode+0x10e/0x220
> [ 74.311039] nvme_alloc_ns+0x1833/0x2c30
> [ 74.312669] nvme_scan_ns+0x5a0/0x6f0
> [ 74.314151] async_run_entry_fn+0x94/0x540
> [ 74.315719] process_one_work+0x86a/0x14a0
> [ 74.317287] worker_thread+0x5bb/0xf90
> [ 74.318228] kthread+0x371/0x720
> [ 74.319085] ret_from_fork+0x31/0x70
> [ 74.319941] ret_from_fork_asm+0x1a/0x30
> [ 74.320808]
> -> #1 (&set->update_nr_hwq_sema){.+.+}-{4:4}:
> [ 74.322311] down_read+0x8e/0x470
> [ 74.323135] elv_iosched_store+0x17a/0x210
> [ 74.324036] queue_attr_store+0x234/0x340
> [ 74.324881] kernfs_fop_write_iter+0x39b/0x5a0
> [ 74.325771] vfs_write+0x5df/0xec0
> [ 74.326514] ksys_write+0xff/0x200
> [ 74.327262] do_syscall_64+0x92/0x180
> [ 74.328018] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 74.328963]
> -> #0 (kn->active#119){++++}-{0:0}:
> [ 74.330433] __lock_acquire+0x145f/0x2260
> [ 74.331329] lock_acquire+0x163/0x300
> [ 74.332221] kernfs_drain+0x39d/0x450
> [ 74.333002] __kernfs_remove+0x213/0x680
> [ 74.333792] kernfs_remove_by_name_ns+0xa2/0x100
> [ 74.334589] remove_files+0x8d/0x1b0
> [ 74.335326] sysfs_remove_group+0x7c/0x160
> [ 74.336118] sysfs_remove_groups+0x55/0xb0
> [ 74.336869] __kobject_del+0x7d/0x1d0
> [ 74.337637] kobject_del+0x38/0x60
> [ 74.338340] blk_unregister_queue+0x153/0x2c0
> [ 74.339125] __del_gendisk+0x252/0x9d0
> [ 74.339959] del_gendisk+0xe5/0x180
> [ 74.340756] sr_remove+0x7b/0xd0
> [ 74.341429] device_release_driver_internal+0x36d/0x520
> [ 74.342353] bus_remove_device+0x1ef/0x3f0
> [ 74.343172] device_del+0x3be/0x9b0
> [ 74.343951] __scsi_remove_device+0x27f/0x340
> [ 74.344724] sdev_store_delete+0x87/0x120
> [ 74.345508] kernfs_fop_write_iter+0x39b/0x5a0
> [ 74.346287] vfs_write+0x5df/0xec0
> [ 74.347170] ksys_write+0xff/0x200
> [ 74.348312] do_syscall_64+0x92/0x180
> [ 74.349519] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 74.350797]
> other info that might help us debug this:
>
> [ 74.353554] Chain exists of:
> kn->active#119 --> &disk->open_mutex --> &q->q_usage_counter(queue)#14
>
> [ 74.355535] Possible unsafe locking scenario:
>
> [ 74.356650] CPU0 CPU1
> [ 74.357328] ---- ----
> [ 74.358026] lock(&q->q_usage_counter(queue)#14);
> [ 74.358749] lock(&disk->open_mutex);
> [ 74.359561] lock(&q->q_usage_counter(queue)#14);
> [ 74.360488] lock(kn->active#119);
> [ 74.361113]
> *** DEADLOCK ***
>
> [ 74.362574] 6 locks held by check/5077:
> [ 74.363193] #0: ffff888114640420 (sb_writers#4){.+.+}-{0:0}, at: ksys_write+0xff/0x200
> [ 74.364274] #1: ffff88829abb6088 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x25b/0x5a0
> [ 74.365937] #2: ffff8881176ca0e0 (&shost->scan_mutex){+.+.}-{4:4}, at: sdev_store_delete+0x7f/0x120
> [ 74.367643] #3: ffff88828521c380 (&dev->mutex){....}-{4:4}, at: device_release_driver_internal+0x90/0x520
> [ 74.369464] #4: ffff8881176ca380 (&set->update_nr_hwq_sema){.+.+}-{4:4}, at: del_gendisk+0xdd/0x180
> [ 74.370961] #5: ffff88828a661e20 (&q->q_usage_counter(queue)#14){++++}-{0:0}, at: del_gendisk+0xe5/0x180
> [ 74.372050]
This has baffled me as I don't understand how could read lock in
elv_iosched_store (ruuning in context #1) depends on (same) read
lock in add_disk_fwnode (running under another context #2) as
both locks are represented by the same rw semaphore. As we see
above both elv_iosched_store and add_disk_fwnode bot run under
different contexts and so ideally they should be able to run
concurrently acquiring the same read lock.
>>>>
>>>> On another note, if we suspect possible one-depth recursion for the same
>>>> class of lock then then we should use SINGLE_DEPTH_NESTING (instead of using
>>>> 1 here) for subclass. But still I am not clear why this lock needs nesting.
>>> It is just one false positive, because elv_iosched_store() won't happen
>>> when adding disk.
>>>
>> Yes, but how could we avoid false positive? It's probably because of commit
>> ffa1e7ada456 ("block: Make request_queue lockdep splats show up earlier"). How about adding manual dependency of fs-reclaim on freeze-lock after we add
>> the disk. Currently that manual dependency is added in blk_alloc_queue.
> Please see the above trace, which isn't related with commit ffa1e7ada456,
> and the lock chain doesn't include 'fs_reclaim' at all.
>
> commit ffa1e7ada456 isn't wrong too, which just helps us to expose deadlock
> risk early.
Yes I am not against this commit and now looking at the above splat I don't
blame this commit.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH V3 20/20] block: move wbt_enable_default() out of queue freezing from sched ->exit()
2025-04-24 15:21 ` [PATCH V3 20/20] block: move wbt_enable_default() out of queue freezing from sched ->exit() Ming Lei
2025-04-25 13:10 ` Nilay Shroff
@ 2025-04-29 10:59 ` Nilay Shroff
1 sibling, 0 replies; 71+ messages in thread
From: Nilay Shroff @ 2025-04-29 10:59 UTC (permalink / raw)
To: Ming Lei, Jens Axboe, linux-block
Cc: Shinichiro Kawasaki, Thomas Hellström, Christoph Hellwig
[-- Attachment #1: Type: text/plain, Size: 7898 bytes --]
On 4/24/25 8:51 PM, Ming Lei wrote:
> scheduler's ->exit() is called with queue frozen and elevator lock is held, and
> wbt_enable_default() can't be called with queue frozen, otherwise the
> following lockdep warning is triggered:
>
> #6 (&q->rq_qos_mutex){+.+.}-{4:4}:
> #5 (&eq->sysfs_lock){+.+.}-{4:4}:
> #4 (&q->elevator_lock){+.+.}-{4:4}:
> #3 (&q->q_usage_counter(io)#3){++++}-{0:0}:
> #2 (fs_reclaim){+.+.}-{0:0}:
> #1 (&sb->s_type->i_mutex_key#3){+.+.}-{4:4}:
> #0 (&q->debugfs_mutex){+.+.}-{4:4}:
>
> Fix the issue by moving wbt_enable_default() out of bfq's exit(), and
> call it from elevator_change_done().
>
> Meantime add disk->rqos_state_mutex for covering wbt state change, which
> matches the purpose more than ->elevator_lock.
>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
While testing this patch on my machine using blktests, I stumbled upon
a lockdep splat shown below.(I could consistently recreate it):
run blktests block/005 at 2025-04-28 06:57:51
======================================================
WARNING: possible circular locking dependency detected
6.15.0-rc2+ #174 Not tainted
------------------------------------------------------
check/8088 is trying to acquire lock:
c0000000a0c03538 (&disk->rqos_state_mutex){+.+.}-{4:4}, at: wbt_disable_default+0x9c/0x118
but task is already holding lock:
c00000005b8f6c38 (&q->elevator_lock){+.+.}-{4:4}, at: elevator_change+0x94/0x214
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #3 (&q->elevator_lock){+.+.}-{4:4}:
__mutex_lock+0x128/0xdd8
elevator_change+0x94/0x214
elv_iosched_store+0x14c/0x1f4
queue_attr_store+0x194/0x1d0
sysfs_kf_write+0xbc/0x110
kernfs_fop_write_iter+0x264/0x384
vfs_write+0x5b0/0x77c
ksys_write+0xa0/0x180
system_call_exception+0x1b0/0x4f0
system_call_vectored_common+0x15c/0x2ec
-> #2 (&q->q_usage_counter(io)#23){++++}-{0:0}:
blk_alloc_queue+0x46c/0x4bc
blk_mq_alloc_queue+0xc0/0x160
__blk_mq_alloc_disk+0x34/0x128
nvme_alloc_ns+0x140/0x1804 [nvme_core]
nvme_scan_ns+0x42c/0x564 [nvme_core]
async_run_entry_fn+0x9c/0x30c
process_one_work+0x514/0xd38
worker_thread+0x390/0x6dc
kthread+0x230/0x278
start_kernel_thread+0x14/0x18
-> #1 (fs_reclaim){+.+.}-{0:0}:
fs_reclaim_acquire+0x114/0x150
__kmalloc_cache_noprof+0x70/0x5c0
wbt_init+0x64/0x2fc
wbt_enable_default+0x140/0x15c
elevator_change_done+0x314/0x3a8
elv_iosched_store+0x14c/0x1f4
queue_attr_store+0x194/0x1d0
sysfs_kf_write+0xbc/0x110
kernfs_fop_write_iter+0x264/0x384
vfs_write+0x5b0/0x77c
ksys_write+0xa0/0x180
system_call_exception+0x1b0/0x4f0
system_call_vectored_common+0x15c/0x2ec
-> #0 (&disk->rqos_state_mutex){+.+.}-{4:4}:
__lock_acquire+0x1b5c/0x29f8
lock_acquire+0x23c/0x3f8
__mutex_lock+0x128/0xdd8
wbt_disable_default+0x9c/0x118
bfq_init_queue+0x7b0/0x8c0
blk_mq_init_sched+0x29c/0x3a8
__elevator_change+0x3a4/0x8a4
elevator_change+0x1a4/0x214
elv_iosched_store+0x14c/0x1f4
queue_attr_store+0x194/0x1d0
sysfs_kf_write+0xbc/0x110
kernfs_fop_write_iter+0x264/0x384
vfs_write+0x5b0/0x77c
ksys_write+0xa0/0x180
system_call_exception+0x1b0/0x4f0
system_call_vectored_common+0x15c/0x2ec
other info that might help us debug this:
Chain exists of:
&disk->rqos_state_mutex --> &q->q_usage_counter(io)#23 --> &q->elevator_lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&q->elevator_lock);
lock(&q->q_usage_counter(io)#23);
lock(&q->elevator_lock);
lock(&disk->rqos_state_mutex);
*** DEADLOCK ***
7 locks held by check/8088:
#0: c0000000873f2400 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0xa0/0x180
#1: c00000008c10c088 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x1e0/0x384
#2: c000000085239248 (kn->active#57){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x1f8/0x384
#3: c0000000f801c190 (&set->update_nr_hwq_sema){.+.+}-{4:4}, at: elv_iosched_store+0x13c/0x1f4
#4: c00000005b8f6718 (&q->q_usage_counter(io)#23){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x28/0x40
#5: c00000005b8f6750 (&q->q_usage_counter(queue)#21){+.+.}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x28/0x40
#6: c00000005b8f6c38 (&q->elevator_lock){+.+.}-{4:4}, at: elevator_change+0x94/0x214
stack backtrace:
CPU: 26 UID: 0 PID: 8088 Comm: check Kdump: loaded Not tainted 6.15.0-rc2+ #174 VOLUNTARY
Hardware name: IBM,9043-MRX POWER10 (architected) 0x800200 0xf000006 of:IBM,FW1060.00 (NM1060_028) hv:phyp pSeries
Call Trace:
[c0000000d7497240] [c0000000017b9888] dump_stack_lvl+0x100/0x184 (unreliable)
[c0000000d7497270] [c0000000002b546c] print_circular_bug+0x448/0x604
[c0000000d7497320] [c0000000002b5874] check_noncircular+0x24c/0x26c
[c0000000d74973f0] [c0000000002bbb78] __lock_acquire+0x1b5c/0x29f8
[c0000000d7497520] [c0000000002b915c] lock_acquire+0x23c/0x3f8
[c0000000d7497620] [c00000000181277c] __mutex_lock+0x128/0xdd8
[c0000000d7497780] [c000000000c73bf8] wbt_disable_default+0x9c/0x118
[c0000000d74977c0] [c000000000c4c2c0] bfq_init_queue+0x7b0/0x8c0
[c0000000d7497890] [c000000000bff634] blk_mq_init_sched+0x29c/0x3a8
[c0000000d7497910] [c000000000bc2a18] __elevator_change+0x3a4/0x8a4
[c0000000d74979b0] [c000000000bc30bc] elevator_change+0x1a4/0x214
[c0000000d7497a00] [c000000000bc427c] elv_iosched_store+0x14c/0x1f4
[c0000000d7497ae0] [c000000000bd07ec] queue_attr_store+0x194/0x1d0
[c0000000d7497c00] [c000000000a40f00] sysfs_kf_write+0xbc/0x110
[c0000000d7497c50] [c000000000a3cc4c] kernfs_fop_write_iter+0x264/0x384
[c0000000d7497cb0] [c0000000008bb9bc] vfs_write+0x5b0/0x77c
[c0000000d7497d90] [c0000000008bbf88] ksys_write+0xa0/0x180
[c0000000d7497df0] [c000000000039f70] system_call_exception+0x1b0/0x4f0
[c0000000d7497e50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec
--- interrupt: 3000 at 0x7fffa413b034
NIP: 00007fffa413b034 LR: 00007fffa413b034 CTR: 0000000000000000
REGS: c0000000d7497e80 TRAP: 3000 Not tainted (6.15.0-rc2+)
MSR: 800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE> CR: 44422408 XER: 00000000
IRQMASK: 0
GPR00: 0000000000000004 00007ffffd011260 000000010dfa7e00 0000000000000001
GPR04: 000000011c30b720 0000000000000004 0000000000000010 0000000000000001
GPR08: 0000000000000003 0000000000000000 0000000000000000 0000000000000000
GPR12: 0000000000000000 00007fffa43fab60 000000011c3adbc0 000000010dfa87b8
GPR16: 000000010dfa94d8 0000000020000000 0000000000000000 000000010deb9070
GPR20: 000000010df4beb8 00007ffffd011404 000000010df4f8a0 000000010dfa89bc
GPR24: 000000010dfa8a50 0000000000000000 000000011c30b720 0000000000000004
GPR28: 0000000000000004 00007fffa42418e0 000000011c30b720 0000000000000004
NIP [00007fffa413b034] 0x7fffa413b034
LR [00007fffa413b034] 0x7fffa413b034
--- interrupt: 3000
From the changes in this patch it appears that we have now got this
new "disk->rqos_state_mutex" which we're holding while allocating
the dynamic memory in wbt_init(). One way is to define GFP_NOIO scope
while holding this mutex but then looking further it seems that we
don't really need to keep holding this lock in wbt_enable_default()
all the way till wbt_init returns. Probably, unlocking this mutex as
soon as we're done updating the rqos state should be sufficient.
Similarly, in queue_wb_lat_store we may need to hold this mutex only
while we invoke wbt_set_min_lat. And I just realized that in
queue_wb_lat_show we need to replace q->elevator_lock with
disk->rqos_state_mutex. I have attached the above change for
your reference. You may address this in your next patch.
Thanks,
--Nilay
[-- Attachment #2: wbt.patch --]
[-- Type: text/x-patch, Size: 1984 bytes --]
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index fafe9e9e97cc..0248dc9438c3 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -560,7 +560,7 @@ static ssize_t queue_wb_lat_show(struct gendisk *disk, char *page)
ssize_t ret;
struct request_queue *q = disk->queue;
- mutex_lock(&q->elevator_lock);
+ mutex_lock(&disk->rqos_state_mutex);
if (!wbt_rq_qos(q)) {
ret = -EINVAL;
goto out;
@@ -573,7 +573,7 @@ static ssize_t queue_wb_lat_show(struct gendisk *disk, char *page)
ret = sysfs_emit(page, "%llu\n", div_u64(wbt_get_min_lat(q), 1000));
out:
- mutex_unlock(&q->elevator_lock);
+ mutex_unlock(&disk->rqos_state_mutex);
return ret;
}
@@ -593,7 +593,6 @@ static ssize_t queue_wb_lat_store(struct gendisk *disk, const char *page,
return -EINVAL;
memflags = blk_mq_freeze_queue(q);
- mutex_lock(&disk->rqos_state_mutex);
rqos = wbt_rq_qos(q);
if (!rqos) {
@@ -618,11 +617,12 @@ static ssize_t queue_wb_lat_store(struct gendisk *disk, const char *page,
*/
blk_mq_quiesce_queue(q);
+ mutex_lock(&disk->rqos_state_mutex);
wbt_set_min_lat(q, val);
+ mutex_unlock(&disk->rqos_state_mutex);
blk_mq_unquiesce_queue(q);
out:
- mutex_unlock(&disk->rqos_state_mutex);
blk_mq_unfreeze_queue(q, memflags);
return ret;
diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index c8588bae1c1b..74ae7131ada9 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -714,17 +714,17 @@ void wbt_enable_default(struct gendisk *disk)
if (rqos) {
if (enable && RQWB(rqos)->enable_state == WBT_STATE_OFF_DEFAULT)
RQWB(rqos)->enable_state = WBT_STATE_ON_DEFAULT;
- goto unlock;
+ mutex_unlock(&disk->rqos_state_mutex);
+ return;
}
+ mutex_unlock(&disk->rqos_state_mutex);
/* Queue not registered? Maybe shutting down... */
if (!blk_queue_registered(q))
- goto unlock;
+ return;
if (queue_is_mq(q) && enable)
wbt_init(disk);
-unlock:
- mutex_unlock(&disk->rqos_state_mutex);
}
EXPORT_SYMBOL_GPL(wbt_enable_default);
^ permalink raw reply related [flat|nested] 71+ messages in thread
* Re: [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning
2025-04-24 15:21 [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Ming Lei
` (19 preceding siblings ...)
2025-04-24 15:21 ` [PATCH V3 20/20] block: move wbt_enable_default() out of queue freezing from sched ->exit() Ming Lei
@ 2025-04-29 12:00 ` Stefan Haberland
2025-04-29 12:11 ` Ming Lei
20 siblings, 1 reply; 71+ messages in thread
From: Stefan Haberland @ 2025-04-29 12:00 UTC (permalink / raw)
To: Ming Lei, Jens Axboe, linux-block
Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
Christoph Hellwig
Am 24.04.25 um 17:21 schrieb Ming Lei:
> Hello Jens,
>
> This patchset cleans up elevator change code, and unifying it via single
> helper, meantime moves kobject_add/del & debugfs register/unregister out of
> queue freezing & elevator_lock. This way fixes many lockdep warnings
> reported recently, especially since fs_reclaim is connected with freeze lock
> manually by commit ffa1e7ada456 ("block: Make request_queue lockdep splats
> show up earlier").
>
>
> Thanks,
> Ming
>
> V3:
> - replace srcu with rw_sem for avoiding race between add/del disk &
> elevator switch and updating nr_hw_queues (Nilay Shoff)
>
> - add elv_update_nr_hw_queues() for elevator reattachment in case of
> updating nr_hw_queues, meantime keep elv_change_ctx as local structure
> (Christoph)
>
> - replace ->elevator_lock with disk->rqos_state_mutex for covering wbt
> state change
>
> - add new patch "block: use q->elevator with ->elevator_lock held in elv_iosched_show()"
>
> - small cleanup & commit log improvement
>
> V2:
> - retry add/del disk when blk_mq_update_nr_hw_queues() is in-progress
>
> - swap blk_mq_add_queue_tag_set() with blk_mq_map_swqueue() in
> blk_mq_init_allocated_queue() (Nilay Shroff)
>
> - move ELEVATOR_FLAG_DISABLE_WBT to request queue's flags (Nilay Shoff)
>
> - fix race because of delaying elevator unregister
>
> - define flags of `elv_change_ctx` as `bool` (Christoph)
>
> - improve comment and commit log (Christoph)
>
> Ming Lei (20):
> block: move blk_mq_add_queue_tag_set() after blk_mq_map_swqueue()
> block: move ELEVATOR_FLAG_DISABLE_WBT a request queue flag
> block: don't call freeze queue in elevator_switch() and
> elevator_disable()
> block: use q->elevator with ->elevator_lock held in elv_iosched_show()
> block: add two helpers for registering/un-registering sched debugfs
> block: move sched debugfs register into elvevator_register_queue
> block: prevent adding/deleting disk during updating nr_hw_queues
> block: don't allow to switch elevator if updating nr_hw_queues is
> in-progress
> block: simplify elevator reattachment for updating nr_hw_queues
> block: move blk_unregister_queue() & device_del() after freeze wait
> block: move queue freezing & elevator_lock into elevator_change()
> block: add `struct elv_change_ctx` for unifying elevator change
> block: unifying elevator change
> block: pass elevator_queue to elv_register_queue & unregister_queue
> block: fail to show/store elevator sysfs attribute if elevator is
> dying
> block: move elv_register[unregister]_queue out of elevator_lock
> block: move debugfs/sysfs register out of freezing queue
> block: remove several ->elevator_lock
> block: move hctx cpuhp add/del out of queue freezing
> block: move wbt_enable_default() out of queue freezing from sched
> ->exit()
>
> block/bfq-iosched.c | 6 +-
> block/blk-mq-debugfs.c | 12 +-
> block/blk-mq-sched.c | 41 +++---
> block/blk-mq.c | 132 +++---------------
> block/blk-sysfs.c | 24 ++--
> block/blk-wbt.c | 13 +-
> block/blk.h | 8 +-
> block/elevator.c | 302 ++++++++++++++++++++++++++++-------------
> block/elevator.h | 6 +-
> block/genhd.c | 129 +++++++++++-------
> include/linux/blk-mq.h | 3 +
> include/linux/blkdev.h | 5 +
> 12 files changed, 365 insertions(+), 316 deletions(-)
Hi,
while testing the patchset on s390 I still get the following lockdep splat on each boot:
======================================================
WARNING: possible circular locking dependency detected
6.15.0-rc4-gc2b4d8dcb3d2 #3 Not tainted
------------------------------------------------------
(udev-worker)/1810 is trying to acquire lock:
0000005fb84de3a8 (&q->elevator_lock){+.+.}-{4:4}, at: elevator_change+0x54/0x130
but task is already holding lock:
0000005fb84dde18 (&q->q_usage_counter(io)#34){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x26/0x40
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #3 (&q->q_usage_counter(io)#34){++++}-{0:0}:
__lock_acquire+0x6da/0xcc0
lock_acquire.part.0+0x10c/0x290
lock_acquire+0xb0/0x1a0
blk_alloc_queue+0x306/0x340
blk_mq_alloc_queue+0x60/0xd0
scsi_alloc_sdev+0x27c/0x3b0
scsi_probe_and_add_lun+0x31a/0x480
scsi_report_lun_scan+0x382/0x430
__scsi_scan_target+0x11a/0x240
scsi_scan_target+0xdc/0x100
fc_scsi_scan_rport+0xc2/0xd0
process_one_work+0x2a6/0x5d0
worker_thread+0x220/0x410
kthread+0x164/0x2d0
__ret_from_fork+0x3c/0x60
ret_from_fork+0xa/0x38
-> #2 (fs_reclaim){+.+.}-{0:0}:
__lock_acquire+0x6da/0xcc0
lock_acquire.part.0+0x10c/0x290
lock_acquire+0xb0/0x1a0
__fs_reclaim_acquire+0x44/0x50
fs_reclaim_acquire+0xba/0x100
__kmalloc_noprof+0xae/0x5e0
pcpu_alloc_chunk+0x30/0x170
pcpu_create_chunk+0x22/0x130
pcpu_alloc_noprof+0x842/0x970
do_kmem_cache_create+0x1e0/0x4b0
__kmem_cache_create_args+0x238/0x340
register_ftrace_graph+0x438/0x460
trace_selftest_startup_function_graph+0x62/0x260
run_tracer_selftest+0x116/0x1b0
register_tracer+0x192/0x260
do_one_initcall+0x4a/0x180
do_initcalls+0x146/0x170
kernel_init_freeable+0x230/0x270
kernel_init+0x2e/0x188
__ret_from_fork+0x3c/0x60
ret_from_fork+0xa/0x38
-> #1 (pcpu_alloc_mutex){+.+.}-{4:4}:
__lock_acquire+0x6da/0xcc0
lock_acquire.part.0+0x10c/0x290
lock_acquire+0xb0/0x1a0
__mutex_lock+0xae/0xa20
mutex_lock_killable_nested+0x32/0x40
pcpu_alloc_noprof+0x6ea/0x970
sbitmap_init_node+0x11a/0x230
sbitmap_queue_init_node+0x3e/0x1c0
blk_mq_init_tags+0xac/0x140
blk_mq_alloc_map_and_rqs+0xb0/0x180
blk_mq_init_sched+0xfc/0x200
elevator_switch+0x74/0x260
__elevator_change+0xde/0x150
elevator_change+0xe4/0x130
elevator_set_default+0x6c/0xc0
blk_register_queue+0x188/0x210
__add_disk_fwnode+0x202/0x420
add_disk_fwnode+0x7a/0xb0
sd_probe+0x22e/0x3f0
really_probe+0x2ac/0x430
driver_probe_device+0x3c/0xc0
__device_attach_driver+0xc0/0x140
bus_for_each_drv+0x90/0xe0
__device_attach_async_helper+0x94/0xf0
async_run_entry_fn+0x4a/0x180
process_one_work+0x2a6/0x5d0
worker_thread+0x220/0x410
kthread+0x164/0x2d0
__ret_from_fork+0x3c/0x60
ret_from_fork+0xa/0x38
-> #0 (&q->elevator_lock){+.+.}-{4:4}:
check_prev_add+0x16c/0xff0
validate_chain+0x734/0x9f0
__lock_acquire+0x6da/0xcc0
lock_acquire.part.0+0x10c/0x290
lock_acquire+0xb0/0x1a0
__mutex_lock+0xae/0xa20
mutex_lock_nested+0x32/0x40
elevator_change+0x54/0x130
elv_iosched_store+0xec/0x140
kernfs_fop_write_iter+0x166/0x230
vfs_write+0x1ac/0x480
ksys_write+0x7c/0x100
__do_syscall+0x156/0x290
system_call+0x74/0x98
other info that might help us debug this:
Chain exists of:
&q->elevator_lock --> fs_reclaim --> &q->q_usage_counter(io)#34
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&q->q_usage_counter(io)#34);
lock(fs_reclaim);
lock(&q->q_usage_counter(io)#34);
lock(&q->elevator_lock);
*** DEADLOCK ***
6 locks held by (udev-worker)/1810:
#0: 0000005f9d141450 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x7c/0x100
#1: 0000005f8ee55090 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x12a/0x230
#2: 0000005fd0be69a0 (kn->active#68){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x136/0x230
#3: 00000066feee93c0 (&set->update_nr_hwq_sema/1){.+.+}-{4:4}, at: elv_iosched_store+0xde/0x140
#4: 0000005fb84dde18 (&q->q_usage_counter(io)#34){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x26/0x40
#5: 0000005fb84dde58 (&q->q_usage_counter(queue)#11){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x26/0x40
stack backtrace:
CPU: 18 UID: 0 PID: 1810 Comm: (udev-worker) Not tainted 6.15.0-rc4-gc2b4d8dcb3d2 #3 PREEMPT
Hardware name: IBM 9175 ME1 701 (LPAR)
Call Trace:
[<00000162f415409a>] dump_stack_lvl+0xa2/0xe8
[<00000162f424cf9c>] print_circular_bug+0x18c/0x210
[<00000162f424d198>] check_noncircular+0x178/0x190
[<00000162f424e57c>] check_prev_add+0x16c/0xff0
[<00000162f424fb34>] validate_chain+0x734/0x9f0
[<00000162f4251f8a>] __lock_acquire+0x6da/0xcc0
[<00000162f425267c>] lock_acquire.part.0+0x10c/0x290
[<00000162f42528b0>] lock_acquire+0xb0/0x1a0
[<00000162f525695e>] __mutex_lock+0xae/0xa20
[<00000162f5257302>] mutex_lock_nested+0x32/0x40
[<00000162f4c00a54>] elevator_change+0x54/0x130
[<00000162f4c0136c>] elv_iosched_store+0xec/0x140
[<00000162f46b6f66>] kernfs_fop_write_iter+0x166/0x230
[<00000162f45cbe3c>] vfs_write+0x1ac/0x480
[<00000162f45cc2ac>] ksys_write+0x7c/0x100
[<00000162f524e956>] __do_syscall+0x156/0x290
[<00000162f525ff94>] system_call+0x74/0x98
INFO: lockdep is turned off.
The trigger is a udev rule changing the scheduler for the device.
thanks,
Stefan
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning
2025-04-29 12:00 ` [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Stefan Haberland
@ 2025-04-29 12:11 ` Ming Lei
0 siblings, 0 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-29 12:11 UTC (permalink / raw)
To: Stefan Haberland
Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
Thomas Hellström, Christoph Hellwig
On Tue, Apr 29, 2025 at 02:00:27PM +0200, Stefan Haberland wrote:
> Am 24.04.25 um 17:21 schrieb Ming Lei:
>
> > Hello Jens,
> >
> > This patchset cleans up elevator change code, and unifying it via single
> > helper, meantime moves kobject_add/del & debugfs register/unregister out of
> > queue freezing & elevator_lock. This way fixes many lockdep warnings
> > reported recently, especially since fs_reclaim is connected with freeze lock
> > manually by commit ffa1e7ada456 ("block: Make request_queue lockdep splats
> > show up earlier").
> >
> >
> > Thanks,
> > Ming
> >
> > V3:
> > - replace srcu with rw_sem for avoiding race between add/del disk &
> > elevator switch and updating nr_hw_queues (Nilay Shoff)
> >
> > - add elv_update_nr_hw_queues() for elevator reattachment in case of
> > updating nr_hw_queues, meantime keep elv_change_ctx as local structure
> > (Christoph)
> >
> > - replace ->elevator_lock with disk->rqos_state_mutex for covering wbt
> > state change
> >
> > - add new patch "block: use q->elevator with ->elevator_lock held in elv_iosched_show()"
> >
> > - small cleanup & commit log improvement
> >
> > V2:
> > - retry add/del disk when blk_mq_update_nr_hw_queues() is in-progress
> >
> > - swap blk_mq_add_queue_tag_set() with blk_mq_map_swqueue() in
> > blk_mq_init_allocated_queue() (Nilay Shroff)
> >
> > - move ELEVATOR_FLAG_DISABLE_WBT to request queue's flags (Nilay Shoff)
> >
> > - fix race because of delaying elevator unregister
> >
> > - define flags of `elv_change_ctx` as `bool` (Christoph)
> >
> > - improve comment and commit log (Christoph)
> >
> > Ming Lei (20):
> > block: move blk_mq_add_queue_tag_set() after blk_mq_map_swqueue()
> > block: move ELEVATOR_FLAG_DISABLE_WBT a request queue flag
> > block: don't call freeze queue in elevator_switch() and
> > elevator_disable()
> > block: use q->elevator with ->elevator_lock held in elv_iosched_show()
> > block: add two helpers for registering/un-registering sched debugfs
> > block: move sched debugfs register into elvevator_register_queue
> > block: prevent adding/deleting disk during updating nr_hw_queues
> > block: don't allow to switch elevator if updating nr_hw_queues is
> > in-progress
> > block: simplify elevator reattachment for updating nr_hw_queues
> > block: move blk_unregister_queue() & device_del() after freeze wait
> > block: move queue freezing & elevator_lock into elevator_change()
> > block: add `struct elv_change_ctx` for unifying elevator change
> > block: unifying elevator change
> > block: pass elevator_queue to elv_register_queue & unregister_queue
> > block: fail to show/store elevator sysfs attribute if elevator is
> > dying
> > block: move elv_register[unregister]_queue out of elevator_lock
> > block: move debugfs/sysfs register out of freezing queue
> > block: remove several ->elevator_lock
> > block: move hctx cpuhp add/del out of queue freezing
> > block: move wbt_enable_default() out of queue freezing from sched
> > ->exit()
> >
> > block/bfq-iosched.c | 6 +-
> > block/blk-mq-debugfs.c | 12 +-
> > block/blk-mq-sched.c | 41 +++---
> > block/blk-mq.c | 132 +++---------------
> > block/blk-sysfs.c | 24 ++--
> > block/blk-wbt.c | 13 +-
> > block/blk.h | 8 +-
> > block/elevator.c | 302 ++++++++++++++++++++++++++++-------------
> > block/elevator.h | 6 +-
> > block/genhd.c | 129 +++++++++++-------
> > include/linux/blk-mq.h | 3 +
> > include/linux/blkdev.h | 5 +
> > 12 files changed, 365 insertions(+), 316 deletions(-)
>
> Hi,
> while testing the patchset on s390 I still get the following lockdep splat on each boot:
>
> ======================================================
> WARNING: possible circular locking dependency detected
> 6.15.0-rc4-gc2b4d8dcb3d2 #3 Not tainted
> ------------------------------------------------------
> (udev-worker)/1810 is trying to acquire lock:
> 0000005fb84de3a8 (&q->elevator_lock){+.+.}-{4:4}, at: elevator_change+0x54/0x130
>
> but task is already holding lock:
> 0000005fb84dde18 (&q->q_usage_counter(io)#34){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x26/0x40
>
> which lock already depends on the new lock.
>
> the existing dependency chain (in reverse order) is:
>
> -> #3 (&q->q_usage_counter(io)#34){++++}-{0:0}:
> __lock_acquire+0x6da/0xcc0
> lock_acquire.part.0+0x10c/0x290
> lock_acquire+0xb0/0x1a0
> blk_alloc_queue+0x306/0x340
> blk_mq_alloc_queue+0x60/0xd0
> scsi_alloc_sdev+0x27c/0x3b0
> scsi_probe_and_add_lun+0x31a/0x480
> scsi_report_lun_scan+0x382/0x430
> __scsi_scan_target+0x11a/0x240
> scsi_scan_target+0xdc/0x100
> fc_scsi_scan_rport+0xc2/0xd0
> process_one_work+0x2a6/0x5d0
> worker_thread+0x220/0x410
> kthread+0x164/0x2d0
> __ret_from_fork+0x3c/0x60
> ret_from_fork+0xa/0x38
>
> -> #2 (fs_reclaim){+.+.}-{0:0}:
> __lock_acquire+0x6da/0xcc0
> lock_acquire.part.0+0x10c/0x290
> lock_acquire+0xb0/0x1a0
> __fs_reclaim_acquire+0x44/0x50
> fs_reclaim_acquire+0xba/0x100
> __kmalloc_noprof+0xae/0x5e0
> pcpu_alloc_chunk+0x30/0x170
> pcpu_create_chunk+0x22/0x130
> pcpu_alloc_noprof+0x842/0x970
> do_kmem_cache_create+0x1e0/0x4b0
> __kmem_cache_create_args+0x238/0x340
> register_ftrace_graph+0x438/0x460
> trace_selftest_startup_function_graph+0x62/0x260
> run_tracer_selftest+0x116/0x1b0
> register_tracer+0x192/0x260
> do_one_initcall+0x4a/0x180
> do_initcalls+0x146/0x170
> kernel_init_freeable+0x230/0x270
> kernel_init+0x2e/0x188
> __ret_from_fork+0x3c/0x60
> ret_from_fork+0xa/0x38
>
> -> #1 (pcpu_alloc_mutex){+.+.}-{4:4}:
> __lock_acquire+0x6da/0xcc0
> lock_acquire.part.0+0x10c/0x290
> lock_acquire+0xb0/0x1a0
> __mutex_lock+0xae/0xa20
> mutex_lock_killable_nested+0x32/0x40
> pcpu_alloc_noprof+0x6ea/0x970
It is one known dependency on percpu alloc lock, and we can't cover every
one in single patchset, especially this patchset is becoming bigger.
And this percpu lock splat becomes much easier to address after the elevator
patchset is merged.
We can move the elevator data allocation out of queue freeze, then the
dependency can be cut.
Thanks
Ming
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH V3 12/20] block: add `struct elv_change_ctx` for unifying elevator change
2025-04-25 18:23 ` Christoph Hellwig
@ 2025-04-29 15:45 ` Ming Lei
0 siblings, 0 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-29 15:45 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
Thomas Hellström
On Fri, Apr 25, 2025 at 08:23:41PM +0200, Christoph Hellwig wrote:
> On Thu, Apr 24, 2025 at 11:21:35PM +0800, Ming Lei wrote:
> > +struct elv_change_ctx {
> > + const char *name;
> > + bool uevent;
>
> There's only one caller that wants to supress the uevents. So maybe
> invert the polarity so that it only has to be set in one place,
> which also documents how setting the initial scheduler is special a
> bit better.
OK.
>
> > - ret = elv_register_queue(q, true);
> > + ret = elv_register_queue(q, ctx->uevent);
>
> .. and pass the ctx on to elv_register_queue instead of converting
> paramter types. Although that might be woeth doing later when
> another argument derived from ctx gets passed as well.
ctx isn't very useful for elv_register_queue(), which can't figure out
the exact elevator queue to use from `ctx` since there are two(old, new).
Thanks,
Ming
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH V3 13/20] block: unifying elevator change
2025-04-25 13:03 ` Nilay Shroff
@ 2025-04-30 0:46 ` Ming Lei
0 siblings, 0 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-30 0:46 UTC (permalink / raw)
To: Nilay Shroff
Cc: Jens Axboe, linux-block, Shinichiro Kawasaki,
Thomas Hellström, Christoph Hellwig
On Fri, Apr 25, 2025 at 06:33:37PM +0530, Nilay Shroff wrote:
>
>
> On 4/24/25 8:51 PM, Ming Lei wrote:
> > @@ -564,12 +558,7 @@ static int __add_disk_fwnode(struct device *parent, struct gendisk *disk,
> > out_free_ext_minor:
> > if (disk->major == BLOCK_EXT_MAJOR)
> > blk_free_ext_minor(disk->first_minor);
> > -out_exit_elevator:
> > - if (disk->queue->elevator) {
> > - mutex_lock(&disk->queue->elevator_lock);
> > - elevator_exit(disk->queue);
> > - mutex_unlock(&disk->queue->elevator_lock);
> > - }
>
> After restructuring __add_disk_fwnode() we may still need elevator_exit()
> in case when blk_register_queue is successful and so we should have set
> q->elevator to default. Now if after queue is registered successfully,
> for instance, bdi_register fails then in that case we still have to exit
> elevator to free its resources, isn't it?
If blk_register_queue is successful, it is blk_unregister_queue()'s
responsibility for covering it, elevator_exit() isn't needed here.
Thanks,
Ming
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH V3 08/20] block: don't allow to switch elevator if updating nr_hw_queues is in-progress
2025-04-29 10:22 ` Nilay Shroff
@ 2025-04-30 0:54 ` Ming Lei
0 siblings, 0 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-30 0:54 UTC (permalink / raw)
To: Nilay Shroff
Cc: Jens Axboe, linux-block, Shinichiro Kawasaki,
Thomas Hellström, Christoph Hellwig
On Tue, Apr 29, 2025 at 03:52:51PM +0530, Nilay Shroff wrote:
>
>
> On 4/29/25 8:13 AM, Ming Lei wrote:
> >> I couldn't recreate it on my setup using above blktests.
> > It is reproduced in my test vm every time after killing the nested variant:
> >
> > [ 74.257200] ======================================================
> > [ 74.259369] WARNING: possible circular locking dependency detected
> > [ 74.260772] 6.15.0-rc3_ublk+ #547 Not tainted
> > [ 74.261950] ------------------------------------------------------
> > [ 74.263281] check/5077 is trying to acquire lock:
> > [ 74.264492] ffff888105f1fd18 (kn->active#119){++++}-{0:0}, at: __kernfs_remove+0x213/0x680
> > [ 74.266006]
> > but task is already holding lock:
> > [ 74.267998] ffff88828a661e20 (&q->q_usage_counter(queue)#14){++++}-{0:0}, at: del_gendisk+0xe5/0x180
> > [ 74.269631]
> > which lock already depends on the new lock.
> >
> > [ 74.272645]
> > the existing dependency chain (in reverse order) is:
> > [ 74.274804]
> > -> #3 (&q->q_usage_counter(queue)#14){++++}-{0:0}:
> > [ 74.277009] blk_queue_enter+0x4c2/0x630
> > [ 74.278218] blk_mq_alloc_request+0x479/0xa00
> > [ 74.279539] scsi_execute_cmd+0x151/0xba0
> > [ 74.281078] sr_check_events+0x1bc/0xa40
> > [ 74.283012] cdrom_check_events+0x5c/0x120
> > [ 74.284892] disk_check_events+0xbe/0x390
> > [ 74.286181] disk_check_media_change+0xf1/0x220
> > [ 74.287455] sr_block_open+0xce/0x230
> > [ 74.288528] blkdev_get_whole+0x8d/0x200
> > [ 74.289702] bdev_open+0x614/0xc60
> > [ 74.290882] blkdev_open+0x1f6/0x360
> > [ 74.292215] do_dentry_open+0x491/0x1820
> > [ 74.293309] vfs_open+0x7a/0x440
> > [ 74.294384] path_openat+0x1b7e/0x2ce0
> > [ 74.295507] do_filp_open+0x1c5/0x450
> > [ 74.296616] do_sys_openat2+0xef/0x180
> > [ 74.297667] __x64_sys_openat+0x10e/0x210
> > [ 74.298768] do_syscall_64+0x92/0x180
> > [ 74.299800] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [ 74.300971]
> > -> #2 (&disk->open_mutex){+.+.}-{4:4}:
> > [ 74.302700] __mutex_lock+0x19c/0x1990
> > [ 74.303682] bdev_open+0x6cd/0xc60
> > [ 74.304613] bdev_file_open_by_dev+0xc4/0x140
> > [ 74.306008] disk_scan_partitions+0x191/0x290
> > [ 74.307716] __add_disk_fwnode+0xd2a/0x1140
> > [ 74.309394] add_disk_fwnode+0x10e/0x220
> > [ 74.311039] nvme_alloc_ns+0x1833/0x2c30
> > [ 74.312669] nvme_scan_ns+0x5a0/0x6f0
> > [ 74.314151] async_run_entry_fn+0x94/0x540
> > [ 74.315719] process_one_work+0x86a/0x14a0
> > [ 74.317287] worker_thread+0x5bb/0xf90
> > [ 74.318228] kthread+0x371/0x720
> > [ 74.319085] ret_from_fork+0x31/0x70
> > [ 74.319941] ret_from_fork_asm+0x1a/0x30
> > [ 74.320808]
> > -> #1 (&set->update_nr_hwq_sema){.+.+}-{4:4}:
> > [ 74.322311] down_read+0x8e/0x470
> > [ 74.323135] elv_iosched_store+0x17a/0x210
> > [ 74.324036] queue_attr_store+0x234/0x340
> > [ 74.324881] kernfs_fop_write_iter+0x39b/0x5a0
> > [ 74.325771] vfs_write+0x5df/0xec0
> > [ 74.326514] ksys_write+0xff/0x200
> > [ 74.327262] do_syscall_64+0x92/0x180
> > [ 74.328018] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [ 74.328963]
> > -> #0 (kn->active#119){++++}-{0:0}:
> > [ 74.330433] __lock_acquire+0x145f/0x2260
> > [ 74.331329] lock_acquire+0x163/0x300
> > [ 74.332221] kernfs_drain+0x39d/0x450
> > [ 74.333002] __kernfs_remove+0x213/0x680
> > [ 74.333792] kernfs_remove_by_name_ns+0xa2/0x100
> > [ 74.334589] remove_files+0x8d/0x1b0
> > [ 74.335326] sysfs_remove_group+0x7c/0x160
> > [ 74.336118] sysfs_remove_groups+0x55/0xb0
> > [ 74.336869] __kobject_del+0x7d/0x1d0
> > [ 74.337637] kobject_del+0x38/0x60
> > [ 74.338340] blk_unregister_queue+0x153/0x2c0
> > [ 74.339125] __del_gendisk+0x252/0x9d0
> > [ 74.339959] del_gendisk+0xe5/0x180
> > [ 74.340756] sr_remove+0x7b/0xd0
> > [ 74.341429] device_release_driver_internal+0x36d/0x520
> > [ 74.342353] bus_remove_device+0x1ef/0x3f0
> > [ 74.343172] device_del+0x3be/0x9b0
> > [ 74.343951] __scsi_remove_device+0x27f/0x340
> > [ 74.344724] sdev_store_delete+0x87/0x120
> > [ 74.345508] kernfs_fop_write_iter+0x39b/0x5a0
> > [ 74.346287] vfs_write+0x5df/0xec0
> > [ 74.347170] ksys_write+0xff/0x200
> > [ 74.348312] do_syscall_64+0x92/0x180
> > [ 74.349519] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [ 74.350797]
> > other info that might help us debug this:
> >
> > [ 74.353554] Chain exists of:
> > kn->active#119 --> &disk->open_mutex --> &q->q_usage_counter(queue)#14
> >
> > [ 74.355535] Possible unsafe locking scenario:
> >
> > [ 74.356650] CPU0 CPU1
> > [ 74.357328] ---- ----
> > [ 74.358026] lock(&q->q_usage_counter(queue)#14);
> > [ 74.358749] lock(&disk->open_mutex);
> > [ 74.359561] lock(&q->q_usage_counter(queue)#14);
> > [ 74.360488] lock(kn->active#119);
> > [ 74.361113]
> > *** DEADLOCK ***
> >
> > [ 74.362574] 6 locks held by check/5077:
> > [ 74.363193] #0: ffff888114640420 (sb_writers#4){.+.+}-{0:0}, at: ksys_write+0xff/0x200
> > [ 74.364274] #1: ffff88829abb6088 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x25b/0x5a0
> > [ 74.365937] #2: ffff8881176ca0e0 (&shost->scan_mutex){+.+.}-{4:4}, at: sdev_store_delete+0x7f/0x120
> > [ 74.367643] #3: ffff88828521c380 (&dev->mutex){....}-{4:4}, at: device_release_driver_internal+0x90/0x520
> > [ 74.369464] #4: ffff8881176ca380 (&set->update_nr_hwq_sema){.+.+}-{4:4}, at: del_gendisk+0xdd/0x180
> > [ 74.370961] #5: ffff88828a661e20 (&q->q_usage_counter(queue)#14){++++}-{0:0}, at: del_gendisk+0xe5/0x180
> > [ 74.372050]
>
> This has baffled me as I don't understand how could read lock in
> elv_iosched_store (ruuning in context #1) depends on (same) read
> lock in add_disk_fwnode (running under another context #2) as
> both locks are represented by the same rw semaphore. As we see
That is why the read lock in elv_iosched_store() is annotated as
nested(new lockdep map) for avoiding the false positive warning, because
the two can't be grabbed at the same time.
> above both elv_iosched_store and add_disk_fwnode bot run under
> different contexts and so ideally they should be able to run
> concurrently acquiring the same read lock.
In theory, it is yes, but reality is that scheduler store attribute
isn't created until disk/queue kobject is added, then switching elevator
can't happen until the kobject/attribute is exposed to userspace, that
is why the nested annotation is correct.
Thanks,
Ming
^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [PATCH V3 15/20] block: fail to show/store elevator sysfs attribute if elevator is dying
2025-04-25 18:36 ` Christoph Hellwig
@ 2025-04-30 1:07 ` Ming Lei
0 siblings, 0 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-30 1:07 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
Thomas Hellström
On Fri, Apr 25, 2025 at 08:36:14PM +0200, Christoph Hellwig wrote:
> On Thu, Apr 24, 2025 at 11:21:38PM +0800, Ming Lei wrote:
> > + if (test_bit(ELEVATOR_FLAG_DYING, &e->flags))
> > + error = -ENODEV;
> > + else
> > + error = e->type ? entry->show(e, page) : -ENOENT;
>
> Weird style mix, I'd expand the check for ->type to a proper else
> if here as well. But how can e->type actually be NULL here to start
> with?
You are right, e->type can't be NULL, which is assigned in its allocation,
never get cleared.
Thanks,
Ming
^ permalink raw reply [flat|nested] 71+ messages in thread
end of thread, other threads:[~2025-04-30 1:07 UTC | newest]
Thread overview: 71+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-24 15:21 [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Ming Lei
2025-04-24 15:21 ` [PATCH V3 01/20] block: move blk_mq_add_queue_tag_set() after blk_mq_map_swqueue() Ming Lei
2025-04-24 15:21 ` [PATCH V3 02/20] block: move ELEVATOR_FLAG_DISABLE_WBT a request queue flag Ming Lei
2025-04-25 14:29 ` Christoph Hellwig
2025-04-24 15:21 ` [PATCH V3 03/20] block: don't call freeze queue in elevator_switch() and elevator_disable() Ming Lei
2025-04-25 14:29 ` Christoph Hellwig
2025-04-24 15:21 ` [PATCH V3 04/20] block: use q->elevator with ->elevator_lock held in elv_iosched_show() Ming Lei
2025-04-25 6:08 ` Hannes Reinecke
2025-04-25 10:53 ` Nilay Shroff
2025-04-25 14:29 ` Christoph Hellwig
2025-04-24 15:21 ` [PATCH V3 05/20] block: add two helpers for registering/un-registering sched debugfs Ming Lei
2025-04-24 15:21 ` [PATCH V3 06/20] block: move sched debugfs register into elvevator_register_queue Ming Lei
2025-04-24 15:21 ` [PATCH V3 07/20] block: prevent adding/deleting disk during updating nr_hw_queues Ming Lei
2025-04-25 6:33 ` Hannes Reinecke
2025-04-25 11:37 ` Nilay Shroff
2025-04-25 14:33 ` Christoph Hellwig
2025-04-25 16:46 ` kernel test robot
2025-04-24 15:21 ` [PATCH V3 08/20] block: don't allow to switch elevator if updating nr_hw_queues is in-progress Ming Lei
2025-04-25 6:33 ` Hannes Reinecke
2025-04-25 12:48 ` Nilay Shroff
2025-04-27 2:27 ` Ming Lei
2025-04-28 16:17 ` Nilay Shroff
2025-04-29 2:43 ` Ming Lei
2025-04-29 10:22 ` Nilay Shroff
2025-04-30 0:54 ` Ming Lei
2025-04-24 15:21 ` [PATCH V3 09/20] block: simplify elevator reattachment for updating nr_hw_queues Ming Lei
2025-04-25 6:34 ` Hannes Reinecke
2025-04-25 18:12 ` Christoph Hellwig
2025-04-29 9:51 ` Ming Lei
2025-04-24 15:21 ` [PATCH V3 10/20] block: move blk_unregister_queue() & device_del() after freeze wait Ming Lei
2025-04-25 6:35 ` Hannes Reinecke
2025-04-25 12:50 ` Nilay Shroff
2025-04-25 14:34 ` Christoph Hellwig
2025-04-28 11:51 ` Ming Lei
2025-04-24 15:21 ` [PATCH V3 11/20] block: move queue freezing & elevator_lock into elevator_change() Ming Lei
2025-04-25 6:36 ` Hannes Reinecke
2025-04-25 12:54 ` Nilay Shroff
2025-04-25 14:35 ` Christoph Hellwig
2025-04-24 15:21 ` [PATCH V3 12/20] block: add `struct elv_change_ctx` for unifying elevator change Ming Lei
2025-04-25 6:38 ` Hannes Reinecke
2025-04-25 18:23 ` Christoph Hellwig
2025-04-29 15:45 ` Ming Lei
2025-04-24 15:21 ` [PATCH V3 13/20] block: " Ming Lei
2025-04-25 6:39 ` Hannes Reinecke
2025-04-25 13:03 ` Nilay Shroff
2025-04-30 0:46 ` Ming Lei
2025-04-25 18:30 ` Christoph Hellwig
2025-04-24 15:21 ` [PATCH V3 14/20] block: pass elevator_queue to elv_register_queue & unregister_queue Ming Lei
2025-04-25 6:40 ` Hannes Reinecke
2025-04-24 15:21 ` [PATCH V3 15/20] block: fail to show/store elevator sysfs attribute if elevator is dying Ming Lei
2025-04-25 6:45 ` Hannes Reinecke
2025-04-25 18:36 ` Christoph Hellwig
2025-04-30 1:07 ` Ming Lei
2025-04-24 15:21 ` [PATCH V3 16/20] block: move elv_register[unregister]_queue out of elevator_lock Ming Lei
2025-04-25 6:46 ` Hannes Reinecke
2025-04-25 13:05 ` Nilay Shroff
2025-04-25 18:37 ` Christoph Hellwig
2025-04-24 15:21 ` [PATCH V3 17/20] block: move debugfs/sysfs register out of freezing queue Ming Lei
2025-04-25 6:47 ` Hannes Reinecke
2025-04-25 18:38 ` Christoph Hellwig
2025-04-28 11:28 ` Ming Lei
2025-04-24 15:21 ` [PATCH V3 18/20] block: remove several ->elevator_lock Ming Lei
2025-04-25 6:48 ` Hannes Reinecke
2025-04-25 18:38 ` Christoph Hellwig
2025-04-24 15:21 ` [PATCH V3 19/20] block: move hctx cpuhp add/del out of queue freezing Ming Lei
2025-04-25 6:49 ` Hannes Reinecke
2025-04-24 15:21 ` [PATCH V3 20/20] block: move wbt_enable_default() out of queue freezing from sched ->exit() Ming Lei
2025-04-25 13:10 ` Nilay Shroff
2025-04-29 10:59 ` Nilay Shroff
2025-04-29 12:00 ` [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Stefan Haberland
2025-04-29 12:11 ` Ming Lei
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox