[PATCH V3 00/20] block: unify elevator changing and fix lockdep warning

public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning
@ 2025-04-24 15:21 Ming Lei
  2025-04-24 15:21 ` [PATCH V3 01/20] block: move blk_mq_add_queue_tag_set() after blk_mq_map_swqueue() Ming Lei
                   ` (20 more replies)
  0 siblings, 21 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-24 15:21 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig, Ming Lei

Hello Jens,

This patchset cleans up elevator change code, and unifying it via single
helper, meantime moves kobject_add/del & debugfs register/unregister out of
queue freezing & elevator_lock. This way fixes many lockdep warnings
reported recently, especially since fs_reclaim is connected with freeze lock
manually by commit ffa1e7ada456 ("block: Make request_queue lockdep splats
show up earlier").


Thanks,
Ming

V3:
	- replace srcu with rw_sem for avoiding race between add/del disk &
	  elevator switch and updating nr_hw_queues (Nilay Shoff)

	- add elv_update_nr_hw_queues() for elevator reattachment in case of
	updating nr_hw_queues, meantime keep elv_change_ctx as local structure
	(Christoph)

	- replace ->elevator_lock with disk->rqos_state_mutex for covering wbt
	state change

	- add new patch "block: use q->elevator with ->elevator_lock held in elv_iosched_show()"

	- small cleanup & commit log improvement

V2:
	- retry add/del disk when blk_mq_update_nr_hw_queues() is in-progress

	- swap blk_mq_add_queue_tag_set() with blk_mq_map_swqueue() in
	blk_mq_init_allocated_queue() (Nilay Shroff)

	- move ELEVATOR_FLAG_DISABLE_WBT to request queue's flags (Nilay Shoff) 

	- fix race because of delaying elevator unregister

	- define flags of `elv_change_ctx` as `bool` (Christoph)

	- improve comment and commit log (Christoph)

Ming Lei (20):
  block: move blk_mq_add_queue_tag_set() after blk_mq_map_swqueue()
  block: move ELEVATOR_FLAG_DISABLE_WBT a request queue flag
  block: don't call freeze queue in elevator_switch() and
    elevator_disable()
  block: use q->elevator with ->elevator_lock held in elv_iosched_show()
  block: add two helpers for registering/un-registering sched debugfs
  block: move sched debugfs register into elvevator_register_queue
  block: prevent adding/deleting disk during updating nr_hw_queues
  block: don't allow to switch elevator if updating nr_hw_queues is
    in-progress
  block: simplify elevator reattachment for updating nr_hw_queues
  block: move blk_unregister_queue() & device_del() after freeze wait
  block: move queue freezing & elevator_lock into elevator_change()
  block: add `struct elv_change_ctx` for unifying elevator change
  block: unifying elevator change
  block: pass elevator_queue to elv_register_queue & unregister_queue
  block: fail to show/store elevator sysfs attribute if elevator is
    dying
  block: move elv_register[unregister]_queue out of elevator_lock
  block: move debugfs/sysfs register out of freezing queue
  block: remove several ->elevator_lock
  block: move hctx cpuhp add/del out of queue freezing
  block: move wbt_enable_default() out of queue freezing from sched
    ->exit()

 block/bfq-iosched.c    |   6 +-
 block/blk-mq-debugfs.c |  12 +-
 block/blk-mq-sched.c   |  41 +++---
 block/blk-mq.c         | 132 +++---------------
 block/blk-sysfs.c      |  24 ++--
 block/blk-wbt.c        |  13 +-
 block/blk.h            |   8 +-
 block/elevator.c       | 302 ++++++++++++++++++++++++++++-------------
 block/elevator.h       |   6 +-
 block/genhd.c          | 129 +++++++++++-------
 include/linux/blk-mq.h |   3 +
 include/linux/blkdev.h |   5 +
 12 files changed, 365 insertions(+), 316 deletions(-)

-- 
2.47.0


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH V3 01/20] block: move blk_mq_add_queue_tag_set() after blk_mq_map_swqueue()
  2025-04-24 15:21 [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Ming Lei
@ 2025-04-24 15:21 ` Ming Lei
  2025-04-24 15:21 ` [PATCH V3 02/20] block: move ELEVATOR_FLAG_DISABLE_WBT a request queue flag Ming Lei
                   ` (19 subsequent siblings)
  20 siblings, 0 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-24 15:21 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig, Ming Lei, Hannes Reinecke

Move blk_mq_add_queue_tag_set() after blk_mq_map_swqueue(), and publish
this request queue to tagset after everything is setup.

This way is safe because BLK_MQ_F_TAG_QUEUE_SHARED isn't used by
blk_mq_map_swqueue(), and this flag is mainly checked in fast IO code
path.

Prepare for removing ->elevator_lock from blk_mq_map_swqueue() which
is supposed to be called when elevator switch can't be done.

Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reported-by: Nilay Shroff <nilay@linux.ibm.com>
Closes: https://lore.kernel.org/linux-block/567cb7ab-23d6-4cee-a915-c8cdac903ddd@linux.ibm.com/
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 554380bfd002..29cfc7ce2e0a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4581,8 +4581,8 @@ int blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	q->nr_requests = set->queue_depth;
 
 	blk_mq_init_cpu_queues(q, set->nr_hw_queues);
-	blk_mq_add_queue_tag_set(set, q);
 	blk_mq_map_swqueue(q);
+	blk_mq_add_queue_tag_set(set, q);
 	return 0;
 
 err_hctxs:
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH V3 02/20] block: move ELEVATOR_FLAG_DISABLE_WBT a request queue flag
  2025-04-24 15:21 [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Ming Lei
  2025-04-24 15:21 ` [PATCH V3 01/20] block: move blk_mq_add_queue_tag_set() after blk_mq_map_swqueue() Ming Lei
@ 2025-04-24 15:21 ` Ming Lei
  2025-04-25 14:29   ` Christoph Hellwig
  2025-04-24 15:21 ` [PATCH V3 03/20] block: don't call freeze queue in elevator_switch() and elevator_disable() Ming Lei
                   ` (18 subsequent siblings)
  20 siblings, 1 reply; 71+ messages in thread
From: Ming Lei @ 2025-04-24 15:21 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig, Ming Lei, Hannes Reinecke

ELEVATOR_FLAG_DISABLE_WBT is only used by BFQ to disallow wbt when BFQ is
in use. The flag is set in BFQ's init(), and cleared in BFQ's exit().

Making it as request queue flag, so that we can avoid to deal with elevator
switch race. Also it isn't graceful to checking one scheduler flag in
wbt_enable_default().

Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/bfq-iosched.c    | 4 ++--
 block/blk-mq-debugfs.c | 1 +
 block/blk-wbt.c        | 3 +--
 block/elevator.h       | 1 -
 include/linux/blkdev.h | 3 +++
 5 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index abd80dc13562..cc6f59836dcd 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -7210,7 +7210,7 @@ static void bfq_exit_queue(struct elevator_queue *e)
 #endif
 
 	blk_stat_disable_accounting(bfqd->queue);
-	clear_bit(ELEVATOR_FLAG_DISABLE_WBT, &e->flags);
+	blk_queue_flag_clear(QUEUE_FLAG_DISABLE_WBT_DEF, bfqd->queue);
 	wbt_enable_default(bfqd->queue->disk);
 
 	kfree(bfqd);
@@ -7397,7 +7397,7 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 	/* We dispatch from request queue wide instead of hw queue */
 	blk_queue_flag_set(QUEUE_FLAG_SQ_SCHED, q);
 
-	set_bit(ELEVATOR_FLAG_DISABLE_WBT, &eq->flags);
+	blk_queue_flag_set(QUEUE_FLAG_DISABLE_WBT_DEF, q);
 	wbt_disable_default(q->disk);
 	blk_stat_enable_accounting(q);
 
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 3421b5521fe2..7710c409e432 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -93,6 +93,7 @@ static const char *const blk_queue_flag_name[] = {
 	QUEUE_FLAG_NAME(RQ_ALLOC_TIME),
 	QUEUE_FLAG_NAME(HCTX_ACTIVE),
 	QUEUE_FLAG_NAME(SQ_SCHED),
+	QUEUE_FLAG_NAME(DISABLE_WBT_DEF),
 };
 #undef QUEUE_FLAG_NAME
 
diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index f1754d07f7e0..29cd2e33666f 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -704,8 +704,7 @@ void wbt_enable_default(struct gendisk *disk)
 	struct rq_qos *rqos;
 	bool enable = IS_ENABLED(CONFIG_BLK_WBT_MQ);
 
-	if (q->elevator &&
-	    test_bit(ELEVATOR_FLAG_DISABLE_WBT, &q->elevator->flags))
+	if (blk_queue_disable_wbt(q))
 		enable = false;
 
 	/* Throttling already enabled? */
diff --git a/block/elevator.h b/block/elevator.h
index e4e44dfac503..e27af5492cdb 100644
--- a/block/elevator.h
+++ b/block/elevator.h
@@ -121,7 +121,6 @@ struct elevator_queue
 };
 
 #define ELEVATOR_FLAG_REGISTERED	0
-#define ELEVATOR_FLAG_DISABLE_WBT	1
 
 /*
  * block elevator interface
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 678dc38442bf..228b2e5ad493 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -644,6 +644,7 @@ enum {
 	QUEUE_FLAG_RQ_ALLOC_TIME,	/* record rq->alloc_time_ns */
 	QUEUE_FLAG_HCTX_ACTIVE,		/* at least one blk-mq hctx is active */
 	QUEUE_FLAG_SQ_SCHED,		/* single queue style io dispatch */
+	QUEUE_FLAG_DISABLE_WBT_DEF,	/* for sched to disable/enable wbt */
 	QUEUE_FLAG_MAX
 };
 
@@ -679,6 +680,8 @@ void blk_queue_flag_clear(unsigned int flag, struct request_queue *q);
 #define blk_queue_sq_sched(q)	test_bit(QUEUE_FLAG_SQ_SCHED, &(q)->queue_flags)
 #define blk_queue_skip_tagset_quiesce(q) \
 	((q)->limits.features & BLK_FEAT_SKIP_TAGSET_QUIESCE)
+#define blk_queue_disable_wbt(q)	\
+	test_bit(QUEUE_FLAG_DISABLE_WBT_DEF, &(q)->queue_flags)
 
 extern void blk_set_pm_only(struct request_queue *q);
 extern void blk_clear_pm_only(struct request_queue *q);
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH V3 03/20] block: don't call freeze queue in elevator_switch() and elevator_disable()
  2025-04-24 15:21 [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Ming Lei
  2025-04-24 15:21 ` [PATCH V3 01/20] block: move blk_mq_add_queue_tag_set() after blk_mq_map_swqueue() Ming Lei
  2025-04-24 15:21 ` [PATCH V3 02/20] block: move ELEVATOR_FLAG_DISABLE_WBT a request queue flag Ming Lei
@ 2025-04-24 15:21 ` Ming Lei
  2025-04-25 14:29   ` Christoph Hellwig
  2025-04-24 15:21 ` [PATCH V3 04/20] block: use q->elevator with ->elevator_lock held in elv_iosched_show() Ming Lei
                   ` (17 subsequent siblings)
  20 siblings, 1 reply; 71+ messages in thread
From: Ming Lei @ 2025-04-24 15:21 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig, Ming Lei, Hannes Reinecke

Both elevator_switch() and elevator_disable() are only called from the
two code paths, in which queue is guaranteed to be frozen.

So don't call freeze queue in the two functions, also add asserts for
queue freeze.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/elevator.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/block/elevator.c b/block/elevator.c
index b4d08026b02c..5051a98dc08c 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -615,12 +615,11 @@ void elevator_init_mq(struct request_queue *q)
  */
 int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 {
-	unsigned int memflags;
 	int ret;
 
+	WARN_ON_ONCE(q->mq_freeze_depth == 0);
 	lockdep_assert_held(&q->elevator_lock);
 
-	memflags = blk_mq_freeze_queue(q);
 	blk_mq_quiesce_queue(q);
 
 	if (q->elevator) {
@@ -641,7 +640,6 @@ int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 
 out_unfreeze:
 	blk_mq_unquiesce_queue(q);
-	blk_mq_unfreeze_queue(q, memflags);
 
 	if (ret) {
 		pr_warn("elv: switch to \"%s\" failed, falling back to \"none\"\n",
@@ -653,11 +651,9 @@ int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 
 void elevator_disable(struct request_queue *q)
 {
-	unsigned int memflags;
-
+	WARN_ON_ONCE(q->mq_freeze_depth == 0);
 	lockdep_assert_held(&q->elevator_lock);
 
-	memflags = blk_mq_freeze_queue(q);
 	blk_mq_quiesce_queue(q);
 
 	elv_unregister_queue(q);
@@ -668,7 +664,6 @@ void elevator_disable(struct request_queue *q)
 	blk_add_trace_msg(q, "elv switch: none");
 
 	blk_mq_unquiesce_queue(q);
-	blk_mq_unfreeze_queue(q, memflags);
 }
 
 /*
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH V3 04/20] block: use q->elevator with ->elevator_lock held in elv_iosched_show()
  2025-04-24 15:21 [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Ming Lei
                   ` (2 preceding siblings ...)
  2025-04-24 15:21 ` [PATCH V3 03/20] block: don't call freeze queue in elevator_switch() and elevator_disable() Ming Lei
@ 2025-04-24 15:21 ` Ming Lei
  2025-04-25  6:08   ` Hannes Reinecke
                     ` (2 more replies)
  2025-04-24 15:21 ` [PATCH V3 05/20] block: add two helpers for registering/un-registering sched debugfs Ming Lei
                   ` (16 subsequent siblings)
  20 siblings, 3 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-24 15:21 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig, Ming Lei

Use q->elevator with ->elevator_lock held in elv_iosched_show(), since
the local cached elevator reference may become stale after getting
->elevator_lock.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/elevator.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/block/elevator.c b/block/elevator.c
index 5051a98dc08c..b32815594892 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -739,7 +739,6 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
 ssize_t elv_iosched_show(struct gendisk *disk, char *name)
 {
 	struct request_queue *q = disk->queue;
-	struct elevator_queue *eq = q->elevator;
 	struct elevator_type *cur = NULL, *e;
 	int len = 0;
 
@@ -748,7 +747,7 @@ ssize_t elv_iosched_show(struct gendisk *disk, char *name)
 		len += sprintf(name+len, "[none] ");
 	} else {
 		len += sprintf(name+len, "none ");
-		cur = eq->type;
+		cur = q->elevator->type;
 	}
 
 	spin_lock(&elv_list_lock);
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH V3 05/20] block: add two helpers for registering/un-registering sched debugfs
  2025-04-24 15:21 [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Ming Lei
                   ` (3 preceding siblings ...)
  2025-04-24 15:21 ` [PATCH V3 04/20] block: use q->elevator with ->elevator_lock held in elv_iosched_show() Ming Lei
@ 2025-04-24 15:21 ` Ming Lei
  2025-04-24 15:21 ` [PATCH V3 06/20] block: move sched debugfs register into elvevator_register_queue Ming Lei
                   ` (15 subsequent siblings)
  20 siblings, 0 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-24 15:21 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig, Ming Lei, Yu Kuai, Hannes Reinecke

Add blk_mq_sched_reg_debugfs()/blk_mq_sched_unreg_debugfs() to clean up
sched init/exit code a bit.

Register & unregister debugfs for sched & sched_hctx order is changed a
bit, but it is safe because sched & sched_hctx is guaranteed to be ready
when exporting via debugfs.

Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-sched.c | 45 +++++++++++++++++++++++++++++---------------
 1 file changed, 30 insertions(+), 15 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 9b81771774ef..2abc5e0704e8 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -434,6 +434,30 @@ static int blk_mq_init_sched_shared_tags(struct request_queue *queue)
 	return 0;
 }
 
+static void blk_mq_sched_reg_debugfs(struct request_queue *q)
+{
+	struct blk_mq_hw_ctx *hctx;
+	unsigned long i;
+
+	mutex_lock(&q->debugfs_mutex);
+	blk_mq_debugfs_register_sched(q);
+	queue_for_each_hw_ctx(q, hctx, i)
+		blk_mq_debugfs_register_sched_hctx(q, hctx);
+	mutex_unlock(&q->debugfs_mutex);
+}
+
+static void blk_mq_sched_unreg_debugfs(struct request_queue *q)
+{
+	struct blk_mq_hw_ctx *hctx;
+	unsigned long i;
+
+	mutex_lock(&q->debugfs_mutex);
+	queue_for_each_hw_ctx(q, hctx, i)
+		blk_mq_debugfs_unregister_sched_hctx(hctx);
+	blk_mq_debugfs_unregister_sched(q);
+	mutex_unlock(&q->debugfs_mutex);
+}
+
 /* caller must have a reference to @e, will grab another one if successful */
 int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 {
@@ -467,10 +491,6 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 	if (ret)
 		goto err_free_map_and_rqs;
 
-	mutex_lock(&q->debugfs_mutex);
-	blk_mq_debugfs_register_sched(q);
-	mutex_unlock(&q->debugfs_mutex);
-
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (e->ops.init_hctx) {
 			ret = e->ops.init_hctx(hctx, i);
@@ -482,11 +502,11 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 				return ret;
 			}
 		}
-		mutex_lock(&q->debugfs_mutex);
-		blk_mq_debugfs_register_sched_hctx(q, hctx);
-		mutex_unlock(&q->debugfs_mutex);
 	}
 
+	/* sched is initialized, it is ready to export it via debugfs */
+	blk_mq_sched_reg_debugfs(q);
+
 	return 0;
 
 err_free_map_and_rqs:
@@ -524,11 +544,10 @@ void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e)
 	unsigned long i;
 	unsigned int flags = 0;
 
-	queue_for_each_hw_ctx(q, hctx, i) {
-		mutex_lock(&q->debugfs_mutex);
-		blk_mq_debugfs_unregister_sched_hctx(hctx);
-		mutex_unlock(&q->debugfs_mutex);
+	/* unexport via debugfs before exiting sched */
+	blk_mq_sched_unreg_debugfs(q);
 
+	queue_for_each_hw_ctx(q, hctx, i) {
 		if (e->type->ops.exit_hctx && hctx->sched_data) {
 			e->type->ops.exit_hctx(hctx, i);
 			hctx->sched_data = NULL;
@@ -536,10 +555,6 @@ void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e)
 		flags = hctx->flags;
 	}
 
-	mutex_lock(&q->debugfs_mutex);
-	blk_mq_debugfs_unregister_sched(q);
-	mutex_unlock(&q->debugfs_mutex);
-
 	if (e->type->ops.exit_sched)
 		e->type->ops.exit_sched(e);
 	blk_mq_sched_tags_teardown(q, flags);
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH V3 06/20] block: move sched debugfs register into elvevator_register_queue
  2025-04-24 15:21 [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Ming Lei
                   ` (4 preceding siblings ...)
  2025-04-24 15:21 ` [PATCH V3 05/20] block: add two helpers for registering/un-registering sched debugfs Ming Lei
@ 2025-04-24 15:21 ` Ming Lei
  2025-04-24 15:21 ` [PATCH V3 07/20] block: prevent adding/deleting disk during updating nr_hw_queues Ming Lei
                   ` (14 subsequent siblings)
  20 siblings, 0 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-24 15:21 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig, Ming Lei, Yu Kuai, Hannes Reinecke

sched debugfs shares same lifetime with scheduler's kobject, and same
lock(elevator lock), so move sched debugfs register/unregister into
elevator_register_queue() and elevator_unregister_queue().

Then we needn't blk_mq_debugfs_register() for us to register sched
debugfs any more.

Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-debugfs.c | 11 -----------
 block/blk-mq-sched.c   | 11 ++---------
 block/elevator.c       |  8 ++++++++
 block/elevator.h       |  3 +++
 4 files changed, 13 insertions(+), 20 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 7710c409e432..2837a8ce8054 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -625,20 +625,9 @@ void blk_mq_debugfs_register(struct request_queue *q)
 
 	debugfs_create_files(q->debugfs_dir, q, blk_mq_debugfs_queue_attrs);
 
-	/*
-	 * blk_mq_init_sched() attempted to do this already, but q->debugfs_dir
-	 * didn't exist yet (because we don't know what to name the directory
-	 * until the queue is registered to a gendisk).
-	 */
-	if (q->elevator && !q->sched_debugfs_dir)
-		blk_mq_debugfs_register_sched(q);
-
-	/* Similarly, blk_mq_init_hctx() couldn't do this previously. */
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (!hctx->debugfs_dir)
 			blk_mq_debugfs_register_hctx(q, hctx);
-		if (q->elevator && !hctx->sched_debugfs_dir)
-			blk_mq_debugfs_register_sched_hctx(q, hctx);
 	}
 
 	if (q->rq_qos) {
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 2abc5e0704e8..336a15ffecfa 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -434,7 +434,7 @@ static int blk_mq_init_sched_shared_tags(struct request_queue *queue)
 	return 0;
 }
 
-static void blk_mq_sched_reg_debugfs(struct request_queue *q)
+void blk_mq_sched_reg_debugfs(struct request_queue *q)
 {
 	struct blk_mq_hw_ctx *hctx;
 	unsigned long i;
@@ -446,7 +446,7 @@ static void blk_mq_sched_reg_debugfs(struct request_queue *q)
 	mutex_unlock(&q->debugfs_mutex);
 }
 
-static void blk_mq_sched_unreg_debugfs(struct request_queue *q)
+void blk_mq_sched_unreg_debugfs(struct request_queue *q)
 {
 	struct blk_mq_hw_ctx *hctx;
 	unsigned long i;
@@ -503,10 +503,6 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 			}
 		}
 	}
-
-	/* sched is initialized, it is ready to export it via debugfs */
-	blk_mq_sched_reg_debugfs(q);
-
 	return 0;
 
 err_free_map_and_rqs:
@@ -544,9 +540,6 @@ void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e)
 	unsigned long i;
 	unsigned int flags = 0;
 
-	/* unexport via debugfs before exiting sched */
-	blk_mq_sched_unreg_debugfs(q);
-
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (e->type->ops.exit_hctx && hctx->sched_data) {
 			e->type->ops.exit_hctx(hctx, i);
diff --git a/block/elevator.c b/block/elevator.c
index b32815594892..4400eb8fe54f 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -472,6 +472,11 @@ int elv_register_queue(struct request_queue *q, bool uevent)
 		if (uevent)
 			kobject_uevent(&e->kobj, KOBJ_ADD);
 
+		/*
+		 * Sched is initialized, it is ready to export it via
+		 * debugfs
+		 */
+		blk_mq_sched_reg_debugfs(q);
 		set_bit(ELEVATOR_FLAG_REGISTERED, &e->flags);
 	}
 	return error;
@@ -486,6 +491,9 @@ void elv_unregister_queue(struct request_queue *q)
 	if (e && test_and_clear_bit(ELEVATOR_FLAG_REGISTERED, &e->flags)) {
 		kobject_uevent(&e->kobj, KOBJ_REMOVE);
 		kobject_del(&e->kobj);
+
+		/* unexport via debugfs before exiting sched */
+		blk_mq_sched_unreg_debugfs(q);
 	}
 }
 
diff --git a/block/elevator.h b/block/elevator.h
index e27af5492cdb..9198676644a9 100644
--- a/block/elevator.h
+++ b/block/elevator.h
@@ -181,4 +181,7 @@ extern struct request *elv_rb_find(struct rb_root *, sector_t);
 #define rq_entry_fifo(ptr)	list_entry((ptr), struct request, queuelist)
 #define rq_fifo_clear(rq)	list_del_init(&(rq)->queuelist)
 
+void blk_mq_sched_reg_debugfs(struct request_queue *q);
+void blk_mq_sched_unreg_debugfs(struct request_queue *q);
+
 #endif /* _ELEVATOR_H */
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH V3 07/20] block: prevent adding/deleting disk during updating nr_hw_queues
  2025-04-24 15:21 [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Ming Lei
                   ` (5 preceding siblings ...)
  2025-04-24 15:21 ` [PATCH V3 06/20] block: move sched debugfs register into elvevator_register_queue Ming Lei
@ 2025-04-24 15:21 ` Ming Lei
  2025-04-25  6:33   ` Hannes Reinecke
                     ` (3 more replies)
  2025-04-24 15:21 ` [PATCH V3 08/20] block: don't allow to switch elevator if updating nr_hw_queues is in-progress Ming Lei
                   ` (13 subsequent siblings)
  20 siblings, 4 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-24 15:21 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig, Ming Lei

Both adding/deleting disk code are reader of `nr_hw_queues`, so we can't
allow them in-progress when updating nr_hw_queues, kernel panic and
kasan has been reported in [1].

Prevent adding/deleting disk during updating nr_hw_queues by adding
rw_semaphore to tagset, write lock is grabbed in blk_mq_update_nr_hw_queues(),
and read lock is acquired when adding/deleting disk.

Also mark GFP_NOIO allocation scope for adding/deleting disk because
blk_mq_update_nr_hw_queues() is part of some driver's error handler.

This way avoids lot of trouble.

Suggested-by: Nilay Shroff <nilay@linux.ibm.com>
Reported-by: Nilay Shroff <nilay@linux.ibm.com>
Closes: https://lore.kernel.org/linux-block/a5896cdb-a59a-4a37-9f99-20522f5d2987@linux.ibm.com/
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c         |  4 ++
 block/genhd.c          | 94 +++++++++++++++++++++++++++++++-----------
 include/linux/blk-mq.h |  3 ++
 3 files changed, 78 insertions(+), 23 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 29cfc7ce2e0a..1ed2d183f912 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4802,6 +4802,8 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 			goto out_free_srcu;
 	}
 
+	init_rwsem(&set->update_nr_hwq_sema);
+
 	ret = -ENOMEM;
 	set->tags = kcalloc_node(set->nr_hw_queues,
 				 sizeof(struct blk_mq_tags *), GFP_KERNEL,
@@ -5097,9 +5099,11 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 
 void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
 {
+	down_write(&set->update_nr_hwq_sema);
 	mutex_lock(&set->tag_list_lock);
 	__blk_mq_update_nr_hw_queues(set, nr_hw_queues);
 	mutex_unlock(&set->tag_list_lock);
+	up_write(&set->update_nr_hwq_sema);
 }
 EXPORT_SYMBOL_GPL(blk_mq_update_nr_hw_queues);
 
diff --git a/block/genhd.c b/block/genhd.c
index c2bd86cd09de..7f3ae3d23b26 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -399,9 +399,9 @@ int disk_scan_partitions(struct gendisk *disk, blk_mode_t mode)
  * This function registers the partitioning information in @disk
  * with the kernel. Also attach a fwnode to the disk device.
  */
-int __must_check add_disk_fwnode(struct device *parent, struct gendisk *disk,
-				 const struct attribute_group **groups,
-				 struct fwnode_handle *fwnode)
+static int __add_disk_fwnode(struct device *parent, struct gendisk *disk,
+			     const struct attribute_group **groups,
+			     struct fwnode_handle *fwnode)
 
 {
 	struct device *ddev = disk_to_dev(disk);
@@ -572,6 +572,37 @@ int __must_check add_disk_fwnode(struct device *parent, struct gendisk *disk,
 	}
 	return ret;
 }
+
+/**
+ * add_disk_fwnode - add disk information to kernel list with fwnode
+ * @parent: parent device for the disk
+ * @disk: per-device partitioning information
+ * @groups: Additional per-device sysfs groups
+ * @fwnode: attached disk fwnode
+ *
+ * This function registers the partitioning information in @disk
+ * with the kernel. Also attach a fwnode to the disk device.
+ */
+int __must_check add_disk_fwnode(struct device *parent, struct gendisk *disk,
+				 const struct attribute_group **groups,
+				 struct fwnode_handle *fwnode)
+{
+	struct blk_mq_tag_set *set;
+	unsigned int memflags;
+	int ret;
+
+	if (!queue_is_mq(disk->queue))
+		return __add_disk_fwnode(parent, disk, groups, fwnode);
+
+	set = disk->queue->tag_set;
+	memflags = memalloc_noio_save();
+	down_read(&set->update_nr_hwq_sema);
+	ret = __add_disk_fwnode(parent, disk, groups, fwnode);
+	up_read(&set->update_nr_hwq_sema);
+	memalloc_noio_restore(memflags);
+
+	return ret;
+}
 EXPORT_SYMBOL_GPL(add_disk_fwnode);
 
 /**
@@ -652,26 +683,7 @@ void blk_mark_disk_dead(struct gendisk *disk)
 }
 EXPORT_SYMBOL_GPL(blk_mark_disk_dead);
 
-/**
- * del_gendisk - remove the gendisk
- * @disk: the struct gendisk to remove
- *
- * Removes the gendisk and all its associated resources. This deletes the
- * partitions associated with the gendisk, and unregisters the associated
- * request_queue.
- *
- * This is the counter to the respective __device_add_disk() call.
- *
- * The final removal of the struct gendisk happens when its refcount reaches 0
- * with put_disk(), which should be called after del_gendisk(), if
- * __device_add_disk() was used.
- *
- * Drivers exist which depend on the release of the gendisk to be synchronous,
- * it should not be deferred.
- *
- * Context: can sleep
- */
-void del_gendisk(struct gendisk *disk)
+static void __del_gendisk(struct gendisk *disk)
 {
 	struct request_queue *q = disk->queue;
 	struct block_device *part;
@@ -764,6 +776,42 @@ void del_gendisk(struct gendisk *disk)
 	if (start_drain)
 		blk_unfreeze_release_lock(q);
 }
+
+/**
+ * del_gendisk - remove the gendisk
+ * @disk: the struct gendisk to remove
+ *
+ * Removes the gendisk and all its associated resources. This deletes the
+ * partitions associated with the gendisk, and unregisters the associated
+ * request_queue.
+ *
+ * This is the counter to the respective __device_add_disk() call.
+ *
+ * The final removal of the struct gendisk happens when its refcount reaches 0
+ * with put_disk(), which should be called after del_gendisk(), if
+ * __device_add_disk() was used.
+ *
+ * Drivers exist which depend on the release of the gendisk to be synchronous,
+ * it should not be deferred.
+ *
+ * Context: can sleep
+ */
+void del_gendisk(struct gendisk *disk)
+{
+	struct blk_mq_tag_set *set;
+	unsigned int memflags;
+
+	if (!queue_is_mq(disk->queue)) {
+		__del_gendisk(disk);
+	} else {
+		set = disk->queue->tag_set;
+		memflags = memalloc_noio_save();
+		down_read(&set->update_nr_hwq_sema);
+		__del_gendisk(disk);
+		up_read(&set->update_nr_hwq_sema);
+		memalloc_noio_restore(memflags);
+	}
+}
 EXPORT_SYMBOL(del_gendisk);
 
 /**
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 8eb9b3310167..28bc03b2b0dc 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -9,6 +9,7 @@
 #include <linux/prefetch.h>
 #include <linux/srcu.h>
 #include <linux/rw_hint.h>
+#include <linux/rwsem.h>
 
 struct blk_mq_tags;
 struct blk_flush_queue;
@@ -527,6 +528,8 @@ struct blk_mq_tag_set {
 	struct mutex		tag_list_lock;
 	struct list_head	tag_list;
 	struct srcu_struct	*srcu;
+
+	struct rw_semaphore	update_nr_hwq_sema;
 };
 
 /**
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH V3 08/20] block: don't allow to switch elevator if updating nr_hw_queues is in-progress
  2025-04-24 15:21 [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Ming Lei
                   ` (6 preceding siblings ...)
  2025-04-24 15:21 ` [PATCH V3 07/20] block: prevent adding/deleting disk during updating nr_hw_queues Ming Lei
@ 2025-04-24 15:21 ` Ming Lei
  2025-04-25  6:33   ` Hannes Reinecke
  2025-04-25 12:48   ` Nilay Shroff
  2025-04-24 15:21 ` [PATCH V3 09/20] block: simplify elevator reattachment for updating nr_hw_queues Ming Lei
                   ` (12 subsequent siblings)
  20 siblings, 2 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-24 15:21 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig, Ming Lei

Elevator switch code is another `nr_hw_queue` reader in non-fast-IO code
path, so it can't be done if updating `nr_hw_queues` is in-progress.

Take same approach with not allowing add/del disk when updating
nr_hw_queues is in-progress, by grabbing read lock of
set->update_nr_hwq_sema.

Take the nested variant for avoiding the following false positive
splat[1], and this way is correct because:

- the read lock in elv_iosched_store() is not overlapped with the read lock
in adding/deleting disk:

- kobject attribute is only available after the kobject is added and
before it is deleted

  -> #4 (&q->q_usage_counter(queue){++++}-{0:0}:
  -> #3 (&q->limits_lock){+.+.}-{4:4}:
  -> #2 (&disk->open_mutex){+.+.}-{4:4}:
  -> #1 (&set->update_nr_hwq_lock){.+.+}-{4:4}:
  -> #0 (kn->active#103){++++}-{0:0}:

Link: https://lore.kernel.org/linux-block/aAWv3NPtNIKKvJZc@fedora/ [1]
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Closes: https://lore.kernel.org/linux-block/mz4t4tlwiqjijw3zvqnjb7ovvvaegkqganegmmlc567tt5xj67@xal5ro544cnc/
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/elevator.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/elevator.c b/block/elevator.c
index 4400eb8fe54f..56da6ab7691a 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -723,6 +723,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
 	int ret;
 	unsigned int memflags;
 	struct request_queue *q = disk->queue;
+	struct blk_mq_tag_set *set = q->tag_set;
 
 	/*
 	 * If the attribute needs to load a module, do it before freezing the
@@ -734,6 +735,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
 
 	elv_iosched_load_module(name);
 
+	down_read_nested(&set->update_nr_hwq_sema, 1);
 	memflags = blk_mq_freeze_queue(q);
 	mutex_lock(&q->elevator_lock);
 	ret = elevator_change(q, name);
@@ -741,6 +743,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
 		ret = count;
 	mutex_unlock(&q->elevator_lock);
 	blk_mq_unfreeze_queue(q, memflags);
+	up_read(&set->update_nr_hwq_sema);
 	return ret;
 }
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH V3 09/20] block: simplify elevator reattachment for updating nr_hw_queues
  2025-04-24 15:21 [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Ming Lei
                   ` (7 preceding siblings ...)
  2025-04-24 15:21 ` [PATCH V3 08/20] block: don't allow to switch elevator if updating nr_hw_queues is in-progress Ming Lei
@ 2025-04-24 15:21 ` Ming Lei
  2025-04-25  6:34   ` Hannes Reinecke
  2025-04-25 18:12   ` Christoph Hellwig
  2025-04-24 15:21 ` [PATCH V3 10/20] block: move blk_unregister_queue() & device_del() after freeze wait Ming Lei
                   ` (11 subsequent siblings)
  20 siblings, 2 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-24 15:21 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig, Ming Lei

In blk_mq_update_nr_hw_queues(), nr_hw_queues changes and elevator data
depends on it, so elevator has to be reattached.

Now elevator switch isn't allowed during blk_mq_update_nr_hw_queues(), so
we can simply call elevator_change() to reattach elevator sched tags after
nr_hw_queues is updated.

Add elv_update_nr_hw_queues() for blk_mq_update_nr_hw_queues() to
reattach elevator.

Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c   | 90 +-----------------------------------------------
 block/blk.h      |  3 +-
 block/elevator.c | 32 +++++++++++++----
 3 files changed, 28 insertions(+), 97 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1ed2d183f912..3afcddd21586 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4943,88 +4943,10 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 	return ret;
 }
 
-/*
- * request_queue and elevator_type pair.
- * It is just used by __blk_mq_update_nr_hw_queues to cache
- * the elevator_type associated with a request_queue.
- */
-struct blk_mq_qe_pair {
-	struct list_head node;
-	struct request_queue *q;
-	struct elevator_type *type;
-};
-
-/*
- * Cache the elevator_type in qe pair list and switch the
- * io scheduler to 'none'
- */
-static bool blk_mq_elv_switch_none(struct list_head *head,
-		struct request_queue *q)
-{
-	struct blk_mq_qe_pair *qe;
-
-	qe = kmalloc(sizeof(*qe), GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY);
-	if (!qe)
-		return false;
-
-	/* Accessing q->elevator needs protection from ->elevator_lock. */
-	mutex_lock(&q->elevator_lock);
-
-	if (!q->elevator) {
-		kfree(qe);
-		goto unlock;
-	}
-
-	INIT_LIST_HEAD(&qe->node);
-	qe->q = q;
-	qe->type = q->elevator->type;
-	/* keep a reference to the elevator module as we'll switch back */
-	__elevator_get(qe->type);
-	list_add(&qe->node, head);
-	elevator_disable(q);
-unlock:
-	mutex_unlock(&q->elevator_lock);
-
-	return true;
-}
-
-static struct blk_mq_qe_pair *blk_lookup_qe_pair(struct list_head *head,
-						struct request_queue *q)
-{
-	struct blk_mq_qe_pair *qe;
-
-	list_for_each_entry(qe, head, node)
-		if (qe->q == q)
-			return qe;
-
-	return NULL;
-}
-
-static void blk_mq_elv_switch_back(struct list_head *head,
-				  struct request_queue *q)
-{
-	struct blk_mq_qe_pair *qe;
-	struct elevator_type *t;
-
-	qe = blk_lookup_qe_pair(head, q);
-	if (!qe)
-		return;
-	t = qe->type;
-	list_del(&qe->node);
-	kfree(qe);
-
-	mutex_lock(&q->elevator_lock);
-	elevator_switch(q, t);
-	/* drop the reference acquired in blk_mq_elv_switch_none */
-	elevator_put(t);
-	mutex_unlock(&q->elevator_lock);
-}
-
 static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 							int nr_hw_queues)
 {
 	struct request_queue *q;
-	LIST_HEAD(head);
 	int prev_nr_hw_queues = set->nr_hw_queues;
 	unsigned int memflags;
 	int i;
@@ -5042,15 +4964,6 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 	list_for_each_entry(q, &set->tag_list, tag_set_list)
 		blk_mq_freeze_queue_nomemsave(q);
 
-	/*
-	 * Switch IO scheduler to 'none', cleaning up the data associated
-	 * with the previous scheduler. We will switch back once we are done
-	 * updating the new sw to hw queue mappings.
-	 */
-	list_for_each_entry(q, &set->tag_list, tag_set_list)
-		if (!blk_mq_elv_switch_none(&head, q))
-			goto switch_back;
-
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
 		blk_mq_debugfs_unregister_hctxs(q);
 		blk_mq_sysfs_unregister_hctxs(q);
@@ -5084,9 +4997,8 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 		blk_mq_debugfs_register_hctxs(q);
 	}
 
-switch_back:
 	list_for_each_entry(q, &set->tag_list, tag_set_list)
-		blk_mq_elv_switch_back(&head, q);
+		elv_update_nr_hw_queues(q);
 
 	list_for_each_entry(q, &set->tag_list, tag_set_list)
 		blk_mq_unfreeze_queue_nomemrestore(q);
diff --git a/block/blk.h b/block/blk.h
index 006e3be433d2..2969f4427996 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -319,8 +319,7 @@ bool blk_bio_list_merge(struct request_queue *q, struct list_head *list,
 
 bool blk_insert_flush(struct request *rq);
 
-int elevator_switch(struct request_queue *q, struct elevator_type *new_e);
-void elevator_disable(struct request_queue *q);
+void elv_update_nr_hw_queues(struct request_queue *q);
 void elevator_exit(struct request_queue *q);
 int elv_register_queue(struct request_queue *q, bool uevent);
 void elv_unregister_queue(struct request_queue *q);
diff --git a/block/elevator.c b/block/elevator.c
index 56da6ab7691a..5705f7056516 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -621,7 +621,7 @@ void elevator_init_mq(struct request_queue *q)
  * If switching fails, we are most likely running out of memory and not able
  * to restore the old io scheduler, so leaving the io scheduler being none.
  */
-int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
+static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 {
 	int ret;
 
@@ -657,7 +657,7 @@ int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	return ret;
 }
 
-void elevator_disable(struct request_queue *q)
+static void elevator_disable(struct request_queue *q)
 {
 	WARN_ON_ONCE(q->mq_freeze_depth == 0);
 	lockdep_assert_held(&q->elevator_lock);
@@ -677,7 +677,8 @@ void elevator_disable(struct request_queue *q)
 /*
  * Switch this queue to the given IO scheduler.
  */
-static int elevator_change(struct request_queue *q, const char *elevator_name)
+static int __elevator_change(struct request_queue *q,
+			     const char *elevator_name)
 {
 	struct elevator_type *e;
 	int ret;
@@ -692,9 +693,6 @@ static int elevator_change(struct request_queue *q, const char *elevator_name)
 		return 0;
 	}
 
-	if (q->elevator && elevator_match(q->elevator->type, elevator_name))
-		return 0;
-
 	e = elevator_find_get(elevator_name);
 	if (!e)
 		return -EINVAL;
@@ -703,6 +701,28 @@ static int elevator_change(struct request_queue *q, const char *elevator_name)
 	return ret;
 }
 
+static int elevator_change(struct request_queue *q, const char *elevator_name)
+{
+	if (!q->elevator || !elevator_match(q->elevator->type, elevator_name))
+		return __elevator_change(q, elevator_name);
+	return 0;
+}
+
+/*
+ * The I/O scheduler depends on the number of hardware queues, this forces a
+ * reattachment when nr_hw_queues changes.
+ */
+void elv_update_nr_hw_queues(struct request_queue *q)
+{
+	const char *name = "none";
+
+	mutex_lock(&q->elevator_lock);
+	if (q->elevator && !blk_queue_dying(q))
+		name = q->elevator->type->elevator_name;
+	__elevator_change(q, name);
+	mutex_unlock(&q->elevator_lock);
+}
+
 static void elv_iosched_load_module(char *elevator_name)
 {
 	struct elevator_type *found;
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH V3 10/20] block: move blk_unregister_queue() & device_del() after freeze wait
  2025-04-24 15:21 [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Ming Lei
                   ` (8 preceding siblings ...)
  2025-04-24 15:21 ` [PATCH V3 09/20] block: simplify elevator reattachment for updating nr_hw_queues Ming Lei
@ 2025-04-24 15:21 ` Ming Lei
  2025-04-25  6:35   ` Hannes Reinecke
                     ` (2 more replies)
  2025-04-24 15:21 ` [PATCH V3 11/20] block: move queue freezing & elevator_lock into elevator_change() Ming Lei
                   ` (10 subsequent siblings)
  20 siblings, 3 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-24 15:21 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig, Ming Lei

Move blk_unregister_queue() & device_del() after freeze wait, and prepare
for unifying elevator switch.

This way is just fine, since bdev has been unhashed at the beginning of
del_gendisk(), both blk_unregister_queue() & device_del() are dealing
with kobject & debugfs thing only.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/genhd.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index 7f3ae3d23b26..53c5b1e9e1d2 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -733,8 +733,6 @@ static void __del_gendisk(struct gendisk *disk)
 		bdi_unregister(disk->bdi);
 	}
 
-	blk_unregister_queue(disk);
-
 	kobject_put(disk->part0->bd_holder_dir);
 	kobject_put(disk->slave_dir);
 	disk->slave_dir = NULL;
@@ -743,10 +741,12 @@ static void __del_gendisk(struct gendisk *disk)
 	disk->part0->bd_stamp = 0;
 	sysfs_remove_link(block_depr, dev_name(disk_to_dev(disk)));
 	pm_runtime_set_memalloc_noio(disk_to_dev(disk), false);
-	device_del(disk_to_dev(disk));
 
 	blk_mq_freeze_queue_wait(q);
 
+	blk_unregister_queue(disk);
+	device_del(disk_to_dev(disk));
+
 	blk_throtl_cancel_bios(disk);
 
 	blk_sync_queue(q);
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH V3 11/20] block: move queue freezing & elevator_lock into elevator_change()
  2025-04-24 15:21 [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Ming Lei
                   ` (9 preceding siblings ...)
  2025-04-24 15:21 ` [PATCH V3 10/20] block: move blk_unregister_queue() & device_del() after freeze wait Ming Lei
@ 2025-04-24 15:21 ` Ming Lei
  2025-04-25  6:36   ` Hannes Reinecke
                     ` (2 more replies)
  2025-04-24 15:21 ` [PATCH V3 12/20] block: add `struct elv_change_ctx` for unifying elevator change Ming Lei
                   ` (9 subsequent siblings)
  20 siblings, 3 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-24 15:21 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig, Ming Lei

Move queue freezing & elevator_lock into elevator_change(), and prepare
for using elevator_change() for setting up & tearing down default elevator
too.

Also add lockdep_assert_held() in __elevator_change() because either
read or write lock is required for changing elevator.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/elevator.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/block/elevator.c b/block/elevator.c
index 5705f7056516..e70cd4b828a5 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -683,6 +683,8 @@ static int __elevator_change(struct request_queue *q,
 	struct elevator_type *e;
 	int ret;
 
+	lockdep_assert_held(&q->tag_set->update_nr_hwq_sema);
+
 	/* Make sure queue is not in the middle of being removed */
 	if (!blk_queue_registered(q))
 		return -ENOENT;
@@ -703,9 +705,16 @@ static int __elevator_change(struct request_queue *q,
 
 static int elevator_change(struct request_queue *q, const char *elevator_name)
 {
+	unsigned int memflags;
+	int ret = 0;
+
+	memflags = blk_mq_freeze_queue(q);
+	mutex_lock(&q->elevator_lock);
 	if (!q->elevator || !elevator_match(q->elevator->type, elevator_name))
-		return __elevator_change(q, elevator_name);
-	return 0;
+		ret = __elevator_change(q, elevator_name);
+	mutex_unlock(&q->elevator_lock);
+	blk_mq_unfreeze_queue(q, memflags);
+	return ret;
 }
 
 /*
@@ -741,7 +750,6 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
 	char elevator_name[ELV_NAME_MAX];
 	char *name;
 	int ret;
-	unsigned int memflags;
 	struct request_queue *q = disk->queue;
 	struct blk_mq_tag_set *set = q->tag_set;
 
@@ -756,13 +764,9 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
 	elv_iosched_load_module(name);
 
 	down_read_nested(&set->update_nr_hwq_sema, 1);
-	memflags = blk_mq_freeze_queue(q);
-	mutex_lock(&q->elevator_lock);
 	ret = elevator_change(q, name);
 	if (!ret)
 		ret = count;
-	mutex_unlock(&q->elevator_lock);
-	blk_mq_unfreeze_queue(q, memflags);
 	up_read(&set->update_nr_hwq_sema);
 	return ret;
 }
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH V3 12/20] block: add `struct elv_change_ctx` for unifying elevator change
  2025-04-24 15:21 [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Ming Lei
                   ` (10 preceding siblings ...)
  2025-04-24 15:21 ` [PATCH V3 11/20] block: move queue freezing & elevator_lock into elevator_change() Ming Lei
@ 2025-04-24 15:21 ` Ming Lei
  2025-04-25  6:38   ` Hannes Reinecke
  2025-04-25 18:23   ` Christoph Hellwig
  2025-04-24 15:21 ` [PATCH V3 13/20] block: " Ming Lei
                   ` (8 subsequent siblings)
  20 siblings, 2 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-24 15:21 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig, Ming Lei

Add `struct elv_change_ctx` and prepare for unifying elevator change by
[__]elevator_change(). With this way, any input & output parameter can
be provided & observed in top helper.

This way helps to move kobject add/delete & debugfs register/unregister
out of ->elevator_lock & freezing queue.

Reviewed-by : Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/elevator.c | 44 +++++++++++++++++++++++++++++---------------
 1 file changed, 29 insertions(+), 15 deletions(-)

diff --git a/block/elevator.c b/block/elevator.c
index e70cd4b828a5..4be318b0b4ef 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -45,6 +45,12 @@
 #include "blk-wbt.h"
 #include "blk-cgroup.h"
 
+/* Holding context data for changing elevator */
+struct elv_change_ctx {
+	const char *name;
+	bool uevent;
+};
+
 static DEFINE_SPINLOCK(elv_list_lock);
 static LIST_HEAD(elv_list);
 
@@ -621,7 +627,8 @@ void elevator_init_mq(struct request_queue *q)
  * If switching fails, we are most likely running out of memory and not able
  * to restore the old io scheduler, so leaving the io scheduler being none.
  */
-static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
+static int elevator_switch(struct request_queue *q, struct elevator_type *new_e,
+			   struct elv_change_ctx *ctx)
 {
 	int ret;
 
@@ -639,7 +646,7 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	if (ret)
 		goto out_unfreeze;
 
-	ret = elv_register_queue(q, true);
+	ret = elv_register_queue(q, ctx->uevent);
 	if (ret) {
 		elevator_exit(q);
 		goto out_unfreeze;
@@ -678,8 +685,9 @@ static void elevator_disable(struct request_queue *q)
  * Switch this queue to the given IO scheduler.
  */
 static int __elevator_change(struct request_queue *q,
-			     const char *elevator_name)
+			     struct elv_change_ctx *ctx)
 {
+	const char *elevator_name = ctx->name;
 	struct elevator_type *e;
 	int ret;
 
@@ -698,20 +706,21 @@ static int __elevator_change(struct request_queue *q,
 	e = elevator_find_get(elevator_name);
 	if (!e)
 		return -EINVAL;
-	ret = elevator_switch(q, e);
+	ret = elevator_switch(q, e, ctx);
 	elevator_put(e);
 	return ret;
 }
 
-static int elevator_change(struct request_queue *q, const char *elevator_name)
+static int elevator_change(struct request_queue *q,
+			   struct elv_change_ctx *ctx)
 {
 	unsigned int memflags;
 	int ret = 0;
 
 	memflags = blk_mq_freeze_queue(q);
 	mutex_lock(&q->elevator_lock);
-	if (!q->elevator || !elevator_match(q->elevator->type, elevator_name))
-		ret = __elevator_change(q, elevator_name);
+	if (!q->elevator || !elevator_match(q->elevator->type, ctx->name))
+		ret = __elevator_change(q, ctx);
 	mutex_unlock(&q->elevator_lock);
 	blk_mq_unfreeze_queue(q, memflags);
 	return ret;
@@ -723,16 +732,19 @@ static int elevator_change(struct request_queue *q, const char *elevator_name)
  */
 void elv_update_nr_hw_queues(struct request_queue *q)
 {
-	const char *name = "none";
+	struct elv_change_ctx ctx = {
+		.name	= "none",
+		.uevent	= true,
+	};
 
 	mutex_lock(&q->elevator_lock);
 	if (q->elevator && !blk_queue_dying(q))
-		name = q->elevator->type->elevator_name;
-	__elevator_change(q, name);
+		ctx.name = q->elevator->type->elevator_name;
+	__elevator_change(q, &ctx);
 	mutex_unlock(&q->elevator_lock);
 }
 
-static void elv_iosched_load_module(char *elevator_name)
+static void elv_iosched_load_module(const char *elevator_name)
 {
 	struct elevator_type *found;
 
@@ -748,7 +760,9 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
 			  size_t count)
 {
 	char elevator_name[ELV_NAME_MAX];
-	char *name;
+	struct elv_change_ctx ctx = {
+		.uevent = true,
+	};
 	int ret;
 	struct request_queue *q = disk->queue;
 	struct blk_mq_tag_set *set = q->tag_set;
@@ -759,12 +773,12 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
 	 * queue is the one for the device storing the module file.
 	 */
 	strscpy(elevator_name, buf, sizeof(elevator_name));
-	name = strstrip(elevator_name);
+	ctx.name = strstrip(elevator_name);
 
-	elv_iosched_load_module(name);
+	elv_iosched_load_module(ctx.name);
 
 	down_read_nested(&set->update_nr_hwq_sema, 1);
-	ret = elevator_change(q, name);
+	ret = elevator_change(q, &ctx);
 	if (!ret)
 		ret = count;
 	up_read(&set->update_nr_hwq_sema);
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH V3 13/20] block: unifying elevator change
  2025-04-24 15:21 [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Ming Lei
                   ` (11 preceding siblings ...)
  2025-04-24 15:21 ` [PATCH V3 12/20] block: add `struct elv_change_ctx` for unifying elevator change Ming Lei
@ 2025-04-24 15:21 ` Ming Lei
  2025-04-25  6:39   ` Hannes Reinecke
                     ` (2 more replies)
  2025-04-24 15:21 ` [PATCH V3 14/20] block: pass elevator_queue to elv_register_queue & unregister_queue Ming Lei
                   ` (7 subsequent siblings)
  20 siblings, 3 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-24 15:21 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig, Ming Lei

elevator change is one well-define behavior:

- tear down current elevator if it exists

- setup new elevator

It is supposed to cover any case for changing elevator by single
internal API, typically the following cases:

- setup default elevator in add_disk()

- switch to none in del_disk()

- reset elevator in blk_mq_update_nr_hw_queues()

- switch elevator in sysfs `store` elevator attribute

This patch uses elevator_change() to cover all above cases:

- every elevator switch is serialized with each other: add_disk/del_disk/
store elevator is serialized already, blk_mq_update_nr_hw_queues() uses
srcu for syncing with the other three cases

- for both add_disk()/del_disk(), queue freeze works at atomic mode
or has been froze, so the freeze in elevator_change() won't add extra
delay

- `struct elev_change_ctx` instance holds any info for changing elevator

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-sysfs.c |  18 +++-----
 block/blk.h       |   5 +-
 block/elevator.c  | 114 +++++++++++++++++++++++++---------------------
 block/genhd.c     |  28 ++----------
 4 files changed, 75 insertions(+), 90 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 1f9b45b0b9ee..9aa05666f32a 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -869,14 +869,8 @@ int blk_register_queue(struct gendisk *disk)
 	if (ret)
 		goto out_unregister_ia_ranges;
 
+	elevator_set_default(q);
 	mutex_lock(&q->elevator_lock);
-	if (q->elevator) {
-		ret = elv_register_queue(q, false);
-		if (ret) {
-			mutex_unlock(&q->elevator_lock);
-			goto out_crypto_sysfs_unregister;
-		}
-	}
 	wbt_enable_default(disk);
 	mutex_unlock(&q->elevator_lock);
 
@@ -902,8 +896,6 @@ int blk_register_queue(struct gendisk *disk)
 
 	return ret;
 
-out_crypto_sysfs_unregister:
-	blk_crypto_sysfs_unregister(disk);
 out_unregister_ia_ranges:
 	disk_unregister_independent_access_ranges(disk);
 out_debugfs_remove:
@@ -951,9 +943,11 @@ void blk_unregister_queue(struct gendisk *disk)
 		blk_mq_sysfs_unregister(disk);
 	blk_crypto_sysfs_unregister(disk);
 
-	mutex_lock(&q->elevator_lock);
-	elv_unregister_queue(q);
-	mutex_unlock(&q->elevator_lock);
+	if (q->elevator) {
+		blk_mq_quiesce_queue(q);
+		elevator_set_none(q);
+		blk_mq_unquiesce_queue(q);
+	}
 
 	mutex_lock(&q->sysfs_lock);
 	disk_unregister_independent_access_ranges(disk);
diff --git a/block/blk.h b/block/blk.h
index 2969f4427996..6ba674d1ceba 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -320,9 +320,8 @@ bool blk_bio_list_merge(struct request_queue *q, struct list_head *list,
 bool blk_insert_flush(struct request *rq);
 
 void elv_update_nr_hw_queues(struct request_queue *q);
-void elevator_exit(struct request_queue *q);
-int elv_register_queue(struct request_queue *q, bool uevent);
-void elv_unregister_queue(struct request_queue *q);
+void elevator_set_default(struct request_queue *q);
+void elevator_set_none(struct request_queue *q);
 
 ssize_t part_size_show(struct device *dev, struct device_attribute *attr,
 		char *buf);
diff --git a/block/elevator.c b/block/elevator.c
index 4be318b0b4ef..5ec66fbd4f24 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -49,6 +49,7 @@
 struct elv_change_ctx {
 	const char *name;
 	bool uevent;
+	bool init;
 };
 
 static DEFINE_SPINLOCK(elv_list_lock);
@@ -154,7 +155,7 @@ static void elevator_release(struct kobject *kobj)
 	kfree(e);
 }
 
-void elevator_exit(struct request_queue *q)
+static void elevator_exit(struct request_queue *q)
 {
 	struct elevator_queue *e = q->elevator;
 
@@ -458,7 +459,7 @@ static const struct kobj_type elv_ktype = {
 	.release	= elevator_release,
 };
 
-int elv_register_queue(struct request_queue *q, bool uevent)
+static int elv_register_queue(struct request_queue *q, bool uevent)
 {
 	struct elevator_queue *e = q->elevator;
 	int error;
@@ -488,7 +489,7 @@ int elv_register_queue(struct request_queue *q, bool uevent)
 	return error;
 }
 
-void elv_unregister_queue(struct request_queue *q)
+static void elv_unregister_queue(struct request_queue *q)
 {
 	struct elevator_queue *e = q->elevator;
 
@@ -565,60 +566,16 @@ EXPORT_SYMBOL_GPL(elv_unregister);
  * For single queue devices, default to using mq-deadline. If we have multiple
  * queues or mq-deadline is not available, default to "none".
  */
-static struct elevator_type *elevator_get_default(struct request_queue *q)
+static bool use_default_elevator(struct request_queue *q)
 {
 	if (q->tag_set->flags & BLK_MQ_F_NO_SCHED_BY_DEFAULT)
-		return NULL;
+		return false;
 
 	if (q->nr_hw_queues != 1 &&
 	    !blk_mq_is_shared_tags(q->tag_set->flags))
-		return NULL;
-
-	return elevator_find_get("mq-deadline");
-}
-
-/*
- * Use the default elevator settings. If the chosen elevator initialization
- * fails, fall back to the "none" elevator (no elevator).
- */
-void elevator_init_mq(struct request_queue *q)
-{
-	struct elevator_type *e;
-	unsigned int memflags;
-	int err;
-
-	WARN_ON_ONCE(blk_queue_registered(q));
-
-	if (unlikely(q->elevator))
-		return;
-
-	e = elevator_get_default(q);
-	if (!e)
-		return;
-
-	/*
-	 * We are called before adding disk, when there isn't any FS I/O,
-	 * so freezing queue plus canceling dispatch work is enough to
-	 * drain any dispatch activities originated from passthrough
-	 * requests, then no need to quiesce queue which may add long boot
-	 * latency, especially when lots of disks are involved.
-	 *
-	 * Disk isn't added yet, so verifying queue lock only manually.
-	 */
-	memflags = blk_mq_freeze_queue(q);
-
-	blk_mq_cancel_work_sync(q);
-
-	err = blk_mq_init_sched(q, e);
-
-	blk_mq_unfreeze_queue(q, memflags);
-
-	if (err) {
-		pr_warn("\"%s\" elevator initialization failed, "
-			"falling back to \"none\"\n", e->elevator_name);
-	}
+		return false;
 
-	elevator_put(e);
+	return true;
 }
 
 /*
@@ -694,7 +651,7 @@ static int __elevator_change(struct request_queue *q,
 	lockdep_assert_held(&q->tag_set->update_nr_hwq_sema);
 
 	/* Make sure queue is not in the middle of being removed */
-	if (!blk_queue_registered(q))
+	if (!ctx->init && !blk_queue_registered(q))
 		return -ENOENT;
 
 	if (!strncmp(elevator_name, "none", 4)) {
@@ -718,6 +675,16 @@ static int elevator_change(struct request_queue *q,
 	int ret = 0;
 
 	memflags = blk_mq_freeze_queue(q);
+	/*
+	 * May be called before adding disk, when there isn't any FS I/O,
+	 * so freezing queue plus canceling dispatch work is enough to
+	 * drain any dispatch activities originated from passthrough
+	 * requests, then no need to quiesce queue which may add long boot
+	 * latency, especially when lots of disks are involved.
+	 *
+	 * Disk isn't added yet, so verifying queue lock only manually.
+	 */
+	blk_mq_cancel_work_sync(q);
 	mutex_lock(&q->elevator_lock);
 	if (!q->elevator || !elevator_match(q->elevator->type, ctx->name))
 		ret = __elevator_change(q, ctx);
@@ -726,6 +693,49 @@ static int elevator_change(struct request_queue *q,
 	return ret;
 }
 
+/*
+ * Use the default elevator settings. If the chosen elevator initialization
+ * fails, fall back to the "none" elevator (no elevator).
+ */
+void elevator_set_default(struct request_queue *q)
+{
+	struct elv_change_ctx ctx = {
+		.init = true,
+	};
+	int err;
+
+	if (!queue_is_mq(q))
+		return;
+
+	ctx.name = use_default_elevator(q) ? "mq-deadline" : "none";
+	if (!q->elevator && !strcmp(ctx.name, "none"))
+		return;
+	err = elevator_change(q, &ctx);
+	if (err < 0)
+		pr_warn("\"%s\" set elevator failed %d, "
+			"falling back to \"none\"\n", ctx.name, err);
+}
+
+void elevator_set_none(struct request_queue *q)
+{
+	struct elv_change_ctx ctx = {
+		.name	= "none",
+		.uevent = true,
+		.init = true,
+	};
+	int err;
+
+	if (!queue_is_mq(q))
+		return;
+
+	if (!q->elevator)
+		return;
+
+	err = elevator_change(q, &ctx);
+	if (err < 0)
+		pr_warn("%s: set none elevator failed %d\n", __func__, err);
+}
+
 /*
  * The I/O scheduler depends on the number of hardware queues, this forces a
  * reattachment when nr_hw_queues changes.
diff --git a/block/genhd.c b/block/genhd.c
index 53c5b1e9e1d2..a50063c4c40f 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -416,12 +416,6 @@ static int __add_disk_fwnode(struct device *parent, struct gendisk *disk,
 		 */
 		if (disk->fops->submit_bio || disk->fops->poll_bio)
 			return -EINVAL;
-
-		/*
-		 * Initialize the I/O scheduler code and pick a default one if
-		 * needed.
-		 */
-		elevator_init_mq(disk->queue);
 	} else {
 		if (!disk->fops->submit_bio)
 			return -EINVAL;
@@ -438,7 +432,7 @@ static int __add_disk_fwnode(struct device *parent, struct gendisk *disk,
 	ret = -EINVAL;
 	if (disk->major) {
 		if (WARN_ON(!disk->minors))
-			goto out_exit_elevator;
+			goto out;
 
 		if (disk->minors > DISK_MAX_PARTS) {
 			pr_err("block: can't allocate more than %d partitions\n",
@@ -448,14 +442,14 @@ static int __add_disk_fwnode(struct device *parent, struct gendisk *disk,
 		if (disk->first_minor > MINORMASK ||
 		    disk->minors > MINORMASK + 1 ||
 		    disk->first_minor + disk->minors > MINORMASK + 1)
-			goto out_exit_elevator;
+			goto out;
 	} else {
 		if (WARN_ON(disk->minors))
-			goto out_exit_elevator;
+			goto out;
 
 		ret = blk_alloc_ext_minor();
 		if (ret < 0)
-			goto out_exit_elevator;
+			goto out;
 		disk->major = BLOCK_EXT_MAJOR;
 		disk->first_minor = ret;
 	}
@@ -564,12 +558,7 @@ static int __add_disk_fwnode(struct device *parent, struct gendisk *disk,
 out_free_ext_minor:
 	if (disk->major == BLOCK_EXT_MAJOR)
 		blk_free_ext_minor(disk->first_minor);
-out_exit_elevator:
-	if (disk->queue->elevator) {
-		mutex_lock(&disk->queue->elevator_lock);
-		elevator_exit(disk->queue);
-		mutex_unlock(&disk->queue->elevator_lock);
-	}
+out:
 	return ret;
 }
 
@@ -755,14 +744,7 @@ static void __del_gendisk(struct gendisk *disk)
 	if (queue_is_mq(q))
 		blk_mq_cancel_work_sync(q);
 
-	blk_mq_quiesce_queue(q);
-	if (q->elevator) {
-		mutex_lock(&q->elevator_lock);
-		elevator_exit(q);
-		mutex_unlock(&q->elevator_lock);
-	}
 	rq_qos_exit(q);
-	blk_mq_unquiesce_queue(q);
 
 	/*
 	 * If the disk does not own the queue, allow using passthrough requests
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH V3 14/20] block: pass elevator_queue to elv_register_queue & unregister_queue
  2025-04-24 15:21 [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Ming Lei
                   ` (12 preceding siblings ...)
  2025-04-24 15:21 ` [PATCH V3 13/20] block: " Ming Lei
@ 2025-04-24 15:21 ` Ming Lei
  2025-04-25  6:40   ` Hannes Reinecke
  2025-04-24 15:21 ` [PATCH V3 15/20] block: fail to show/store elevator sysfs attribute if elevator is dying Ming Lei
                   ` (6 subsequent siblings)
  20 siblings, 1 reply; 71+ messages in thread
From: Ming Lei @ 2025-04-24 15:21 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig, Ming Lei

Pass elevator_queue reference to elv_register_queue() & elv_unregister_queue().

No functional change, and prepare for moving the two out of elevator
lock & freezing queue, when we need to store the old & new elevator
queue in `struct elv_change_ctx` instance, then both two can co-exist
for short while, so we have to pass the exact elevator_queue instance
to elv_register_queue & unregister_queue.

Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/elevator.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/block/elevator.c b/block/elevator.c
index 5ec66fbd4f24..dde8732b7a26 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -459,9 +459,10 @@ static const struct kobj_type elv_ktype = {
 	.release	= elevator_release,
 };
 
-static int elv_register_queue(struct request_queue *q, bool uevent)
+static int elv_register_queue(struct request_queue *q,
+			      struct elevator_queue *e,
+			      bool uevent)
 {
-	struct elevator_queue *e = q->elevator;
 	int error;
 
 	lockdep_assert_held(&q->elevator_lock);
@@ -489,10 +490,9 @@ static int elv_register_queue(struct request_queue *q, bool uevent)
 	return error;
 }
 
-static void elv_unregister_queue(struct request_queue *q)
+static void elv_unregister_queue(struct request_queue *q,
+				 struct elevator_queue *e)
 {
-	struct elevator_queue *e = q->elevator;
-
 	lockdep_assert_held(&q->elevator_lock);
 
 	if (e && test_and_clear_bit(ELEVATOR_FLAG_REGISTERED, &e->flags)) {
@@ -595,7 +595,7 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e,
 	blk_mq_quiesce_queue(q);
 
 	if (q->elevator) {
-		elv_unregister_queue(q);
+		elv_unregister_queue(q, q->elevator);
 		elevator_exit(q);
 	}
 
@@ -603,7 +603,7 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e,
 	if (ret)
 		goto out_unfreeze;
 
-	ret = elv_register_queue(q, ctx->uevent);
+	ret = elv_register_queue(q, q->elevator, ctx->uevent);
 	if (ret) {
 		elevator_exit(q);
 		goto out_unfreeze;
@@ -628,7 +628,7 @@ static void elevator_disable(struct request_queue *q)
 
 	blk_mq_quiesce_queue(q);
 
-	elv_unregister_queue(q);
+	elv_unregister_queue(q, q->elevator);
 	elevator_exit(q);
 	blk_queue_flag_clear(QUEUE_FLAG_SQ_SCHED, q);
 	q->elevator = NULL;
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH V3 15/20] block: fail to show/store elevator sysfs attribute if elevator is dying
  2025-04-24 15:21 [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Ming Lei
                   ` (13 preceding siblings ...)
  2025-04-24 15:21 ` [PATCH V3 14/20] block: pass elevator_queue to elv_register_queue & unregister_queue Ming Lei
@ 2025-04-24 15:21 ` Ming Lei
  2025-04-25  6:45   ` Hannes Reinecke
  2025-04-25 18:36   ` Christoph Hellwig
  2025-04-24 15:21 ` [PATCH V3 16/20] block: move elv_register[unregister]_queue out of elevator_lock Ming Lei
                   ` (5 subsequent siblings)
  20 siblings, 2 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-24 15:21 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig, Ming Lei

Prepare for moving elv_register[unregister]_queue out of elevator_lock
& queue freezing, so we may have to call elv_unregister_queue() after
elevator ->exit() is called, then there is small window for user to
call into ->show()/store(), and user-after-free can be caused.

Fail to show/store elevator sysfs attribute if elevator is dying by
adding one new flag of ELEVATOR_FLAG_DYNG, which is protected by
elevator ->sysfs_lock.

Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-sched.c |  1 +
 block/elevator.c     | 10 ++++++++--
 block/elevator.h     |  1 +
 3 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 336a15ffecfa..55a0fd105147 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -551,5 +551,6 @@ void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e)
 	if (e->type->ops.exit_sched)
 		e->type->ops.exit_sched(e);
 	blk_mq_sched_tags_teardown(q, flags);
+	set_bit(ELEVATOR_FLAG_DYING, &q->elevator->flags);
 	q->elevator = NULL;
 }
diff --git a/block/elevator.c b/block/elevator.c
index dde8732b7a26..48f2f202af98 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -426,7 +426,10 @@ elv_attr_show(struct kobject *kobj, struct attribute *attr, char *page)
 
 	e = container_of(kobj, struct elevator_queue, kobj);
 	mutex_lock(&e->sysfs_lock);
-	error = e->type ? entry->show(e, page) : -ENOENT;
+	if (test_bit(ELEVATOR_FLAG_DYING, &e->flags))
+		error = -ENODEV;
+	else
+		error = e->type ? entry->show(e, page) : -ENOENT;
 	mutex_unlock(&e->sysfs_lock);
 	return error;
 }
@@ -444,7 +447,10 @@ elv_attr_store(struct kobject *kobj, struct attribute *attr,
 
 	e = container_of(kobj, struct elevator_queue, kobj);
 	mutex_lock(&e->sysfs_lock);
-	error = e->type ? entry->store(e, page, length) : -ENOENT;
+	if (test_bit(ELEVATOR_FLAG_DYING, &e->flags))
+		error = -ENODEV;
+	else
+		error = e->type ? entry->store(e, page, length) : -ENOENT;
 	mutex_unlock(&e->sysfs_lock);
 	return error;
 }
diff --git a/block/elevator.h b/block/elevator.h
index 9198676644a9..76a90a1b7ed6 100644
--- a/block/elevator.h
+++ b/block/elevator.h
@@ -121,6 +121,7 @@ struct elevator_queue
 };
 
 #define ELEVATOR_FLAG_REGISTERED	0
+#define ELEVATOR_FLAG_DYING		1
 
 /*
  * block elevator interface
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH V3 16/20] block: move elv_register[unregister]_queue out of elevator_lock
  2025-04-24 15:21 [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Ming Lei
                   ` (14 preceding siblings ...)
  2025-04-24 15:21 ` [PATCH V3 15/20] block: fail to show/store elevator sysfs attribute if elevator is dying Ming Lei
@ 2025-04-24 15:21 ` Ming Lei
  2025-04-25  6:46   ` Hannes Reinecke
                     ` (2 more replies)
  2025-04-24 15:21 ` [PATCH V3 17/20] block: move debugfs/sysfs register out of freezing queue Ming Lei
                   ` (4 subsequent siblings)
  20 siblings, 3 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-24 15:21 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig, Ming Lei

Move elv_register[unregister]_queue out of ->elevator_lock & queue freezing,
so we can kill many lockdep warnings.

elv_register[unregister]_queue() is serialized, and just dealing with sysfs/
debugfs things, no need to be done with queue frozen.

With this change, elevator's ->exit() is called before calling
elv_unregister_queue, then user may call into ->show()/store() of elevator's
sysfs attributes, and we have covered this issue by adding `ELEVATOR_FLAG_DYNG`.

For blk-mq debugfs, hctx->sched_tags is always checked with ->elevator_lock by
debugfs code, meantime hctx->sched_tags is updated with ->elevator_lock, so
there isn't such issue.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c   |  3 +-
 block/elevator.c | 80 ++++++++++++++++++++++++++++++++++++------------
 2 files changed, 62 insertions(+), 21 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3afcddd21586..3ecbf73773ea 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4997,11 +4997,10 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 		blk_mq_debugfs_register_hctxs(q);
 	}
 
+	/* elv_update_nr_hw_queues() unfreeze queue for us */
 	list_for_each_entry(q, &set->tag_list, tag_set_list)
 		elv_update_nr_hw_queues(q);
 
-	list_for_each_entry(q, &set->tag_list, tag_set_list)
-		blk_mq_unfreeze_queue_nomemrestore(q);
 	memalloc_noio_restore(memflags);
 
 	/* Free the excess tags when nr_hw_queues shrink. */
diff --git a/block/elevator.c b/block/elevator.c
index 48f2f202af98..aec8081a6be3 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -50,6 +50,11 @@ struct elv_change_ctx {
 	const char *name;
 	bool uevent;
 	bool init;
+
+	/* for unregistering old elevator */
+	struct elevator_queue *old;
+	/* for registering new elevator */
+	struct elevator_queue *new;
 };
 
 static DEFINE_SPINLOCK(elv_list_lock);
@@ -155,18 +160,18 @@ static void elevator_release(struct kobject *kobj)
 	kfree(e);
 }
 
-static void elevator_exit(struct request_queue *q)
+static void __elevator_exit(struct request_queue *q)
 {
 	struct elevator_queue *e = q->elevator;
 
+	lockdep_assert_held(&q->elevator_lock);
+
 	ioc_clear_queue(q);
 	blk_mq_sched_free_rqs(q);
 
 	mutex_lock(&e->sysfs_lock);
 	blk_mq_exit_sched(q, e);
 	mutex_unlock(&e->sysfs_lock);
-
-	kobject_put(&e->kobj);
 }
 
 static inline void __elv_rqhash_del(struct request *rq)
@@ -471,8 +476,6 @@ static int elv_register_queue(struct request_queue *q,
 {
 	int error;
 
-	lockdep_assert_held(&q->elevator_lock);
-
 	error = kobject_add(&e->kobj, &q->disk->queue_kobj, "iosched");
 	if (!error) {
 		const struct elv_fs_entry *attr = e->type->elevator_attrs;
@@ -499,8 +502,6 @@ static int elv_register_queue(struct request_queue *q,
 static void elv_unregister_queue(struct request_queue *q,
 				 struct elevator_queue *e)
 {
-	lockdep_assert_held(&q->elevator_lock);
-
 	if (e && test_and_clear_bit(ELEVATOR_FLAG_REGISTERED, &e->flags)) {
 		kobject_uevent(&e->kobj, KOBJ_REMOVE);
 		kobject_del(&e->kobj);
@@ -601,19 +602,15 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e,
 	blk_mq_quiesce_queue(q);
 
 	if (q->elevator) {
-		elv_unregister_queue(q, q->elevator);
-		elevator_exit(q);
+		ctx->old = q->elevator;
+		__elevator_exit(q);
 	}
 
 	ret = blk_mq_init_sched(q, new_e);
 	if (ret)
 		goto out_unfreeze;
 
-	ret = elv_register_queue(q, q->elevator, ctx->uevent);
-	if (ret) {
-		elevator_exit(q);
-		goto out_unfreeze;
-	}
+	ctx->new = q->elevator;
 	blk_add_trace_msg(q, "elv switch: %s", new_e->elevator_name);
 
 out_unfreeze:
@@ -627,15 +624,16 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e,
 	return ret;
 }
 
-static void elevator_disable(struct request_queue *q)
+static void elevator_disable(struct request_queue *q,
+			     struct elv_change_ctx *ctx)
 {
 	WARN_ON_ONCE(q->mq_freeze_depth == 0);
 	lockdep_assert_held(&q->elevator_lock);
 
 	blk_mq_quiesce_queue(q);
 
-	elv_unregister_queue(q, q->elevator);
-	elevator_exit(q);
+	ctx->old = q->elevator;
+	__elevator_exit(q);
 	blk_queue_flag_clear(QUEUE_FLAG_SQ_SCHED, q);
 	q->elevator = NULL;
 	q->nr_requests = q->tag_set->queue_depth;
@@ -644,6 +642,38 @@ static void elevator_disable(struct request_queue *q)
 	blk_mq_unquiesce_queue(q);
 }
 
+static void elv_exit_and_release(struct request_queue *q)
+{
+	struct elevator_queue *e;
+	unsigned memflags;
+
+	memflags = blk_mq_freeze_queue(q);
+	mutex_lock(&q->elevator_lock);
+	e = q->elevator;
+	__elevator_exit(q);
+	mutex_unlock(&q->elevator_lock);
+	blk_mq_unfreeze_queue(q, memflags);
+	if (e)
+		kobject_put(&e->kobj);
+}
+
+static int elevator_change_done(struct request_queue *q,
+				struct elv_change_ctx *ctx)
+{
+	int ret = 0;
+
+	if (ctx->old) {
+		elv_unregister_queue(q, ctx->old);
+		kobject_put(&ctx->old->kobj);
+	}
+	if (ctx->new) {
+		ret = elv_register_queue(q, ctx->new, ctx->uevent);
+		if (ret)
+			elv_exit_and_release(q);
+	}
+	return ret;
+}
+
 /*
  * Switch this queue to the given IO scheduler.
  */
@@ -662,7 +692,7 @@ static int __elevator_change(struct request_queue *q,
 
 	if (!strncmp(elevator_name, "none", 4)) {
 		if (q->elevator)
-			elevator_disable(q);
+			elevator_disable(q, ctx);
 		return 0;
 	}
 
@@ -696,6 +726,9 @@ static int elevator_change(struct request_queue *q,
 		ret = __elevator_change(q, ctx);
 	mutex_unlock(&q->elevator_lock);
 	blk_mq_unfreeze_queue(q, memflags);
+	if (!ret)
+		ret = elevator_change_done(q, ctx);
+
 	return ret;
 }
 
@@ -745,6 +778,8 @@ void elevator_set_none(struct request_queue *q)
 /*
  * The I/O scheduler depends on the number of hardware queues, this forces a
  * reattachment when nr_hw_queues changes.
+ *
+ * Note that this unfreezes the passed in queue.
  */
 void elv_update_nr_hw_queues(struct request_queue *q)
 {
@@ -752,12 +787,19 @@ void elv_update_nr_hw_queues(struct request_queue *q)
 		.name	= "none",
 		.uevent	= true,
 	};
+	int ret = -ENODEV;
+
+	WARN_ON_ONCE(q->mq_freeze_depth == 0);
 
 	mutex_lock(&q->elevator_lock);
 	if (q->elevator && !blk_queue_dying(q))
 		ctx.name = q->elevator->type->elevator_name;
-	__elevator_change(q, &ctx);
+	ret = __elevator_change(q, &ctx);
 	mutex_unlock(&q->elevator_lock);
+	blk_mq_unfreeze_queue_nomemrestore(q);
+
+	if (!ret)
+		WARN_ON_ONCE(elevator_change_done(q, &ctx));
 }
 
 static void elv_iosched_load_module(const char *elevator_name)
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH V3 17/20] block: move debugfs/sysfs register out of freezing queue
  2025-04-24 15:21 [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Ming Lei
                   ` (15 preceding siblings ...)
  2025-04-24 15:21 ` [PATCH V3 16/20] block: move elv_register[unregister]_queue out of elevator_lock Ming Lei
@ 2025-04-24 15:21 ` Ming Lei
  2025-04-25  6:47   ` Hannes Reinecke
  2025-04-25 18:38   ` Christoph Hellwig
  2025-04-24 15:21 ` [PATCH V3 18/20] block: remove several ->elevator_lock Ming Lei
                   ` (3 subsequent siblings)
  20 siblings, 2 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-24 15:21 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig, Ming Lei

Move debugfs/sysfs register out of freezing queue in
__blk_mq_update_nr_hw_queues(), so that the following lockdep dependency
can be killed:

	#2 (&q->q_usage_counter(io)#16){++++}-{0:0}:
	#1 (fs_reclaim){+.+.}-{0:0}:
	#0 (&sb->s_type->i_mutex_key#3){+.+.}-{4:4}: //debugfs

And registering/un-registering debugfs/sysfs does not require queue to be
frozen.

Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3ecbf73773ea..420150eb5a45 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4961,14 +4961,14 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 		return;
 
 	memflags = memalloc_noio_save();
-	list_for_each_entry(q, &set->tag_list, tag_set_list)
-		blk_mq_freeze_queue_nomemsave(q);
-
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
 		blk_mq_debugfs_unregister_hctxs(q);
 		blk_mq_sysfs_unregister_hctxs(q);
 	}
 
+	list_for_each_entry(q, &set->tag_list, tag_set_list)
+		blk_mq_freeze_queue_nomemsave(q);
+
 	if (blk_mq_realloc_tag_set_tags(set, nr_hw_queues) < 0)
 		goto reregister;
 
@@ -4991,16 +4991,15 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 		blk_mq_map_swqueue(q);
 	}
 
+	/* elv_update_nr_hw_queues() unfreeze queue for us */
+	list_for_each_entry(q, &set->tag_list, tag_set_list)
+		elv_update_nr_hw_queues(q);
+
 reregister:
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
 		blk_mq_sysfs_register_hctxs(q);
 		blk_mq_debugfs_register_hctxs(q);
 	}
-
-	/* elv_update_nr_hw_queues() unfreeze queue for us */
-	list_for_each_entry(q, &set->tag_list, tag_set_list)
-		elv_update_nr_hw_queues(q);
-
 	memalloc_noio_restore(memflags);
 
 	/* Free the excess tags when nr_hw_queues shrink. */
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH V3 18/20] block: remove several ->elevator_lock
  2025-04-24 15:21 [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Ming Lei
                   ` (16 preceding siblings ...)
  2025-04-24 15:21 ` [PATCH V3 17/20] block: move debugfs/sysfs register out of freezing queue Ming Lei
@ 2025-04-24 15:21 ` Ming Lei
  2025-04-25  6:48   ` Hannes Reinecke
  2025-04-25 18:38   ` Christoph Hellwig
  2025-04-24 15:21 ` [PATCH V3 19/20] block: move hctx cpuhp add/del out of queue freezing Ming Lei
                   ` (2 subsequent siblings)
  20 siblings, 2 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-24 15:21 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig, Ming Lei

Both blk_mq_map_swqueue() and blk_mq_realloc_hw_ctxs() are called before
the request queue is added to tagset list, so the two won't run concurrently
with blk_mq_update_nr_hw_queues().

When the two functions are only called from queue initialization or
blk_mq_update_nr_hw_queues(), elevator switch can't happen.

So remove these ->elevator_lock uses.

Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 19 ++++---------------
 1 file changed, 4 insertions(+), 15 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 420150eb5a45..e08eda094ae7 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4112,8 +4112,6 @@ static void blk_mq_map_swqueue(struct request_queue *q)
 	struct blk_mq_ctx *ctx;
 	struct blk_mq_tag_set *set = q->tag_set;
 
-	mutex_lock(&q->elevator_lock);
-
 	queue_for_each_hw_ctx(q, hctx, i) {
 		cpumask_clear(hctx->cpumask);
 		hctx->nr_ctx = 0;
@@ -4218,8 +4216,6 @@ static void blk_mq_map_swqueue(struct request_queue *q)
 		hctx->next_cpu = blk_mq_first_mapped_cpu(hctx);
 		hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
 	}
-
-	mutex_unlock(&q->elevator_lock);
 }
 
 /*
@@ -4523,16 +4519,9 @@ static void __blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 }
 
 static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
-				   struct request_queue *q, bool lock)
+				   struct request_queue *q)
 {
-	if (lock) {
-		/* protect against switching io scheduler  */
-		mutex_lock(&q->elevator_lock);
-		__blk_mq_realloc_hw_ctxs(set, q);
-		mutex_unlock(&q->elevator_lock);
-	} else {
-		__blk_mq_realloc_hw_ctxs(set, q);
-	}
+	__blk_mq_realloc_hw_ctxs(set, q);
 
 	/* unregister cpuhp callbacks for exited hctxs */
 	blk_mq_remove_hw_queues_cpuhp(q);
@@ -4564,7 +4553,7 @@ int blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 
 	xa_init(&q->hctx_table);
 
-	blk_mq_realloc_hw_ctxs(set, q, false);
+	blk_mq_realloc_hw_ctxs(set, q);
 	if (!q->nr_hw_queues)
 		goto err_hctxs;
 
@@ -4975,7 +4964,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 fallback:
 	blk_mq_update_queue_map(set);
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
-		blk_mq_realloc_hw_ctxs(set, q, true);
+		blk_mq_realloc_hw_ctxs(set, q);
 
 		if (q->nr_hw_queues != set->nr_hw_queues) {
 			int i = prev_nr_hw_queues;
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH V3 19/20] block: move hctx cpuhp add/del out of queue freezing
  2025-04-24 15:21 [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Ming Lei
                   ` (17 preceding siblings ...)
  2025-04-24 15:21 ` [PATCH V3 18/20] block: remove several ->elevator_lock Ming Lei
@ 2025-04-24 15:21 ` Ming Lei
  2025-04-25  6:49   ` Hannes Reinecke
  2025-04-24 15:21 ` [PATCH V3 20/20] block: move wbt_enable_default() out of queue freezing from sched ->exit() Ming Lei
  2025-04-29 12:00 ` [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Stefan Haberland
  20 siblings, 1 reply; 71+ messages in thread
From: Ming Lei @ 2025-04-24 15:21 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig, Ming Lei

Move hctx cpuhp add/del out of queue freezing for not connecting freeze
lock with cpuhp locks, then lockdep warning can be avoided.

This way is safe because both needn't queue to be frozen and scheduler
switch isn't allowed.

Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index e08eda094ae7..add653f84797 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4964,7 +4964,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 fallback:
 	blk_mq_update_queue_map(set);
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
-		blk_mq_realloc_hw_ctxs(set, q);
+		__blk_mq_realloc_hw_ctxs(set, q);
 
 		if (q->nr_hw_queues != set->nr_hw_queues) {
 			int i = prev_nr_hw_queues;
@@ -4988,6 +4988,9 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
 		blk_mq_sysfs_register_hctxs(q);
 		blk_mq_debugfs_register_hctxs(q);
+
+		blk_mq_remove_hw_queues_cpuhp(q);
+		blk_mq_add_hw_queues_cpuhp(q);
 	}
 	memalloc_noio_restore(memflags);
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH V3 20/20] block: move wbt_enable_default() out of queue freezing from sched ->exit()
  2025-04-24 15:21 [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Ming Lei
                   ` (18 preceding siblings ...)
  2025-04-24 15:21 ` [PATCH V3 19/20] block: move hctx cpuhp add/del out of queue freezing Ming Lei
@ 2025-04-24 15:21 ` Ming Lei
  2025-04-25 13:10   ` Nilay Shroff
  2025-04-29 10:59   ` Nilay Shroff
  2025-04-29 12:00 ` [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Stefan Haberland
  20 siblings, 2 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-24 15:21 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig, Ming Lei

scheduler's ->exit() is called with queue frozen and elevator lock is held, and
wbt_enable_default() can't be called with queue frozen, otherwise the
following lockdep warning is triggered:

	#6 (&q->rq_qos_mutex){+.+.}-{4:4}:
	#5 (&eq->sysfs_lock){+.+.}-{4:4}:
	#4 (&q->elevator_lock){+.+.}-{4:4}:
	#3 (&q->q_usage_counter(io)#3){++++}-{0:0}:
	#2 (fs_reclaim){+.+.}-{0:0}:
	#1 (&sb->s_type->i_mutex_key#3){+.+.}-{4:4}:
	#0 (&q->debugfs_mutex){+.+.}-{4:4}:

Fix the issue by moving wbt_enable_default() out of bfq's exit(), and
call it from elevator_change_done().

Meantime add disk->rqos_state_mutex for covering wbt state change, which
matches the purpose more than ->elevator_lock.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/bfq-iosched.c    |  2 +-
 block/blk-sysfs.c      |  6 ++----
 block/blk-wbt.c        | 10 ++++++++--
 block/elevator.c       |  5 +++++
 block/elevator.h       |  1 +
 block/genhd.c          |  1 +
 include/linux/blkdev.h |  2 ++
 7 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index cc6f59836dcd..0cb1e9873aab 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -7211,7 +7211,7 @@ static void bfq_exit_queue(struct elevator_queue *e)
 
 	blk_stat_disable_accounting(bfqd->queue);
 	blk_queue_flag_clear(QUEUE_FLAG_DISABLE_WBT_DEF, bfqd->queue);
-	wbt_enable_default(bfqd->queue->disk);
+	set_bit(ELEVATOR_FLAG_ENABLE_WBT_ON_EXIT, &e->flags);
 
 	kfree(bfqd);
 }
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 9aa05666f32a..fafe9e9e97cc 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -593,7 +593,7 @@ static ssize_t queue_wb_lat_store(struct gendisk *disk, const char *page,
 		return -EINVAL;
 
 	memflags = blk_mq_freeze_queue(q);
-	mutex_lock(&q->elevator_lock);
+	mutex_lock(&disk->rqos_state_mutex);
 
 	rqos = wbt_rq_qos(q);
 	if (!rqos) {
@@ -622,7 +622,7 @@ static ssize_t queue_wb_lat_store(struct gendisk *disk, const char *page,
 
 	blk_mq_unquiesce_queue(q);
 out:
-	mutex_unlock(&q->elevator_lock);
+	mutex_unlock(&disk->rqos_state_mutex);
 	blk_mq_unfreeze_queue(q, memflags);
 
 	return ret;
@@ -870,9 +870,7 @@ int blk_register_queue(struct gendisk *disk)
 		goto out_unregister_ia_ranges;
 
 	elevator_set_default(q);
-	mutex_lock(&q->elevator_lock);
 	wbt_enable_default(disk);
-	mutex_unlock(&q->elevator_lock);
 
 	blk_queue_flag_set(QUEUE_FLAG_REGISTERED, q);
 
diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 29cd2e33666f..c8588bae1c1b 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -704,6 +704,8 @@ void wbt_enable_default(struct gendisk *disk)
 	struct rq_qos *rqos;
 	bool enable = IS_ENABLED(CONFIG_BLK_WBT_MQ);
 
+	mutex_lock(&disk->rqos_state_mutex);
+
 	if (blk_queue_disable_wbt(q))
 		enable = false;
 
@@ -712,15 +714,17 @@ void wbt_enable_default(struct gendisk *disk)
 	if (rqos) {
 		if (enable && RQWB(rqos)->enable_state == WBT_STATE_OFF_DEFAULT)
 			RQWB(rqos)->enable_state = WBT_STATE_ON_DEFAULT;
-		return;
+		goto unlock;
 	}
 
 	/* Queue not registered? Maybe shutting down... */
 	if (!blk_queue_registered(q))
-		return;
+		goto unlock;
 
 	if (queue_is_mq(q) && enable)
 		wbt_init(disk);
+unlock:
+	mutex_unlock(&disk->rqos_state_mutex);
 }
 EXPORT_SYMBOL_GPL(wbt_enable_default);
 
@@ -773,11 +777,13 @@ void wbt_disable_default(struct gendisk *disk)
 	struct rq_wb *rwb;
 	if (!rqos)
 		return;
+	mutex_lock(&disk->rqos_state_mutex);
 	rwb = RQWB(rqos);
 	if (rwb->enable_state == WBT_STATE_ON_DEFAULT) {
 		blk_stat_deactivate(rwb->cb);
 		rwb->enable_state = WBT_STATE_OFF_DEFAULT;
 	}
+	mutex_unlock(&disk->rqos_state_mutex);
 }
 EXPORT_SYMBOL_GPL(wbt_disable_default);
 
diff --git a/block/elevator.c b/block/elevator.c
index aec8081a6be3..a637426da56d 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -663,8 +663,13 @@ static int elevator_change_done(struct request_queue *q,
 	int ret = 0;
 
 	if (ctx->old) {
+		bool enable_wbt = test_bit(ELEVATOR_FLAG_ENABLE_WBT_ON_EXIT,
+				&ctx->old->flags);
+
 		elv_unregister_queue(q, ctx->old);
 		kobject_put(&ctx->old->kobj);
+		if (enable_wbt)
+			wbt_enable_default(q->disk);
 	}
 	if (ctx->new) {
 		ret = elv_register_queue(q, ctx->new, ctx->uevent);
diff --git a/block/elevator.h b/block/elevator.h
index 76a90a1b7ed6..a07ce773a38f 100644
--- a/block/elevator.h
+++ b/block/elevator.h
@@ -122,6 +122,7 @@ struct elevator_queue
 
 #define ELEVATOR_FLAG_REGISTERED	0
 #define ELEVATOR_FLAG_DYING		1
+#define ELEVATOR_FLAG_ENABLE_WBT_ON_EXIT	2
 
 /*
  * block elevator interface
diff --git a/block/genhd.c b/block/genhd.c
index a50063c4c40f..e13be309552a 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1452,6 +1452,7 @@ struct gendisk *__alloc_disk_node(struct request_queue *q, int node_id,
 #ifdef CONFIG_BLOCK_HOLDER_DEPRECATED
 	INIT_LIST_HEAD(&disk->slave_bdevs);
 #endif
+	mutex_init(&disk->rqos_state_mutex);
 	return disk;
 
 out_erase_part0:
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 228b2e5ad493..281d56b910c2 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -218,6 +218,8 @@ struct gendisk {
 	 * devices that do not have multiple independent access ranges.
 	 */
 	struct blk_independent_access_ranges *ia_ranges;
+
+	struct mutex rqos_state_mutex;	/* rqos state change mutex */
 };
 
 /**
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 04/20] block: use q->elevator with ->elevator_lock held in elv_iosched_show()
  2025-04-24 15:21 ` [PATCH V3 04/20] block: use q->elevator with ->elevator_lock held in elv_iosched_show() Ming Lei
@ 2025-04-25  6:08   ` Hannes Reinecke
  2025-04-25 10:53   ` Nilay Shroff
  2025-04-25 14:29   ` Christoph Hellwig
  2 siblings, 0 replies; 71+ messages in thread
From: Hannes Reinecke @ 2025-04-25  6:08 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig

On 4/24/25 17:21, Ming Lei wrote:
> Use q->elevator with ->elevator_lock held in elv_iosched_show(), since
> the local cached elevator reference may become stale after getting
> ->elevator_lock.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/elevator.c | 3 +--
>   1 file changed, 1 insertion(+), 2 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 07/20] block: prevent adding/deleting disk during updating nr_hw_queues
  2025-04-24 15:21 ` [PATCH V3 07/20] block: prevent adding/deleting disk during updating nr_hw_queues Ming Lei
@ 2025-04-25  6:33   ` Hannes Reinecke
  2025-04-25 11:37   ` Nilay Shroff
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 71+ messages in thread
From: Hannes Reinecke @ 2025-04-25  6:33 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig

On 4/24/25 17:21, Ming Lei wrote:
> Both adding/deleting disk code are reader of `nr_hw_queues`, so we can't
> allow them in-progress when updating nr_hw_queues, kernel panic and
> kasan has been reported in [1].
> 
> Prevent adding/deleting disk during updating nr_hw_queues by adding
> rw_semaphore to tagset, write lock is grabbed in blk_mq_update_nr_hw_queues(),
> and read lock is acquired when adding/deleting disk.
> 
> Also mark GFP_NOIO allocation scope for adding/deleting disk because
> blk_mq_update_nr_hw_queues() is part of some driver's error handler.
> 
> This way avoids lot of trouble.
> 
> Suggested-by: Nilay Shroff <nilay@linux.ibm.com>
> Reported-by: Nilay Shroff <nilay@linux.ibm.com>
> Closes: https://lore.kernel.org/linux-block/a5896cdb-a59a-4a37-9f99-20522f5d2987@linux.ibm.com/
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-mq.c         |  4 ++
>   block/genhd.c          | 94 +++++++++++++++++++++++++++++++-----------
>   include/linux/blk-mq.h |  3 ++
>   3 files changed, 78 insertions(+), 23 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 08/20] block: don't allow to switch elevator if updating nr_hw_queues is in-progress
  2025-04-24 15:21 ` [PATCH V3 08/20] block: don't allow to switch elevator if updating nr_hw_queues is in-progress Ming Lei
@ 2025-04-25  6:33   ` Hannes Reinecke
  2025-04-25 12:48   ` Nilay Shroff
  1 sibling, 0 replies; 71+ messages in thread
From: Hannes Reinecke @ 2025-04-25  6:33 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig

On 4/24/25 17:21, Ming Lei wrote:
> Elevator switch code is another `nr_hw_queue` reader in non-fast-IO code
> path, so it can't be done if updating `nr_hw_queues` is in-progress.
> 
> Take same approach with not allowing add/del disk when updating
> nr_hw_queues is in-progress, by grabbing read lock of
> set->update_nr_hwq_sema.
> 
> Take the nested variant for avoiding the following false positive
> splat[1], and this way is correct because:
> 
> - the read lock in elv_iosched_store() is not overlapped with the read lock
> in adding/deleting disk:
> 
> - kobject attribute is only available after the kobject is added and
> before it is deleted
> 
>    -> #4 (&q->q_usage_counter(queue){++++}-{0:0}:
>    -> #3 (&q->limits_lock){+.+.}-{4:4}:
>    -> #2 (&disk->open_mutex){+.+.}-{4:4}:
>    -> #1 (&set->update_nr_hwq_lock){.+.+}-{4:4}:
>    -> #0 (kn->active#103){++++}-{0:0}:
> 
> Link: https://lore.kernel.org/linux-block/aAWv3NPtNIKKvJZc@fedora/ [1]
> Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> Closes: https://lore.kernel.org/linux-block/mz4t4tlwiqjijw3zvqnjb7ovvvaegkqganegmmlc567tt5xj67@xal5ro544cnc/
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/elevator.c | 3 +++
>   1 file changed, 3 insertions(+)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 09/20] block: simplify elevator reattachment for updating nr_hw_queues
  2025-04-24 15:21 ` [PATCH V3 09/20] block: simplify elevator reattachment for updating nr_hw_queues Ming Lei
@ 2025-04-25  6:34   ` Hannes Reinecke
  2025-04-25 18:12   ` Christoph Hellwig
  1 sibling, 0 replies; 71+ messages in thread
From: Hannes Reinecke @ 2025-04-25  6:34 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig

On 4/24/25 17:21, Ming Lei wrote:
> In blk_mq_update_nr_hw_queues(), nr_hw_queues changes and elevator data
> depends on it, so elevator has to be reattached.
> 
> Now elevator switch isn't allowed during blk_mq_update_nr_hw_queues(), so
> we can simply call elevator_change() to reattach elevator sched tags after
> nr_hw_queues is updated.
> 
> Add elv_update_nr_hw_queues() for blk_mq_update_nr_hw_queues() to
> reattach elevator.
> 
> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-mq.c   | 90 +-----------------------------------------------
>   block/blk.h      |  3 +-
>   block/elevator.c | 32 +++++++++++++----
>   3 files changed, 28 insertions(+), 97 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 10/20] block: move blk_unregister_queue() & device_del() after freeze wait
  2025-04-24 15:21 ` [PATCH V3 10/20] block: move blk_unregister_queue() & device_del() after freeze wait Ming Lei
@ 2025-04-25  6:35   ` Hannes Reinecke
  2025-04-25 12:50   ` Nilay Shroff
  2025-04-25 14:34   ` Christoph Hellwig
  2 siblings, 0 replies; 71+ messages in thread
From: Hannes Reinecke @ 2025-04-25  6:35 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig

On 4/24/25 17:21, Ming Lei wrote:
> Move blk_unregister_queue() & device_del() after freeze wait, and prepare
> for unifying elevator switch.
> 
> This way is just fine, since bdev has been unhashed at the beginning of
> del_gendisk(), both blk_unregister_queue() & device_del() are dealing
> with kobject & debugfs thing only.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/genhd.c | 6 +++---
>   1 file changed, 3 insertions(+), 3 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 11/20] block: move queue freezing & elevator_lock into elevator_change()
  2025-04-24 15:21 ` [PATCH V3 11/20] block: move queue freezing & elevator_lock into elevator_change() Ming Lei
@ 2025-04-25  6:36   ` Hannes Reinecke
  2025-04-25 12:54   ` Nilay Shroff
  2025-04-25 14:35   ` Christoph Hellwig
  2 siblings, 0 replies; 71+ messages in thread
From: Hannes Reinecke @ 2025-04-25  6:36 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig

On 4/24/25 17:21, Ming Lei wrote:
> Move queue freezing & elevator_lock into elevator_change(), and prepare
> for using elevator_change() for setting up & tearing down default elevator
> too.
> 
> Also add lockdep_assert_held() in __elevator_change() because either
> read or write lock is required for changing elevator.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/elevator.c | 18 +++++++++++-------
>   1 file changed, 11 insertions(+), 7 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 12/20] block: add `struct elv_change_ctx` for unifying elevator change
  2025-04-24 15:21 ` [PATCH V3 12/20] block: add `struct elv_change_ctx` for unifying elevator change Ming Lei
@ 2025-04-25  6:38   ` Hannes Reinecke
  2025-04-25 18:23   ` Christoph Hellwig
  1 sibling, 0 replies; 71+ messages in thread
From: Hannes Reinecke @ 2025-04-25  6:38 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig

On 4/24/25 17:21, Ming Lei wrote:
> Add `struct elv_change_ctx` and prepare for unifying elevator change by
> [__]elevator_change(). With this way, any input & output parameter can
> be provided & observed in top helper.
> 
> This way helps to move kobject add/delete & debugfs register/unregister
> out of ->elevator_lock & freezing queue.
> 
> Reviewed-by : Nilay Shroff <nilay@linux.ibm.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/elevator.c | 44 +++++++++++++++++++++++++++++---------------
>   1 file changed, 29 insertions(+), 15 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 13/20] block: unifying elevator change
  2025-04-24 15:21 ` [PATCH V3 13/20] block: " Ming Lei
@ 2025-04-25  6:39   ` Hannes Reinecke
  2025-04-25 13:03   ` Nilay Shroff
  2025-04-25 18:30   ` Christoph Hellwig
  2 siblings, 0 replies; 71+ messages in thread
From: Hannes Reinecke @ 2025-04-25  6:39 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig

On 4/24/25 17:21, Ming Lei wrote:
> elevator change is one well-define behavior:
> 
> - tear down current elevator if it exists
> 
> - setup new elevator
> 
> It is supposed to cover any case for changing elevator by single
> internal API, typically the following cases:
> 
> - setup default elevator in add_disk()
> 
> - switch to none in del_disk()
> 
> - reset elevator in blk_mq_update_nr_hw_queues()
> 
> - switch elevator in sysfs `store` elevator attribute
> 
> This patch uses elevator_change() to cover all above cases:
> 
> - every elevator switch is serialized with each other: add_disk/del_disk/
> store elevator is serialized already, blk_mq_update_nr_hw_queues() uses
> srcu for syncing with the other three cases
> 
> - for both add_disk()/del_disk(), queue freeze works at atomic mode
> or has been froze, so the freeze in elevator_change() won't add extra
> delay
> 
> - `struct elev_change_ctx` instance holds any info for changing elevator
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-sysfs.c |  18 +++-----
>   block/blk.h       |   5 +-
>   block/elevator.c  | 114 +++++++++++++++++++++++++---------------------
>   block/genhd.c     |  28 ++----------
>   4 files changed, 75 insertions(+), 90 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 14/20] block: pass elevator_queue to elv_register_queue & unregister_queue
  2025-04-24 15:21 ` [PATCH V3 14/20] block: pass elevator_queue to elv_register_queue & unregister_queue Ming Lei
@ 2025-04-25  6:40   ` Hannes Reinecke
  0 siblings, 0 replies; 71+ messages in thread
From: Hannes Reinecke @ 2025-04-25  6:40 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig

On 4/24/25 17:21, Ming Lei wrote:
> Pass elevator_queue reference to elv_register_queue() & elv_unregister_queue().
> 
> No functional change, and prepare for moving the two out of elevator
> lock & freezing queue, when we need to store the old & new elevator
> queue in `struct elv_change_ctx` instance, then both two can co-exist
> for short while, so we have to pass the exact elevator_queue instance
> to elv_register_queue & unregister_queue.
> 
> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/elevator.c | 16 ++++++++--------
>   1 file changed, 8 insertions(+), 8 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 15/20] block: fail to show/store elevator sysfs attribute if elevator is dying
  2025-04-24 15:21 ` [PATCH V3 15/20] block: fail to show/store elevator sysfs attribute if elevator is dying Ming Lei
@ 2025-04-25  6:45   ` Hannes Reinecke
  2025-04-25 18:36   ` Christoph Hellwig
  1 sibling, 0 replies; 71+ messages in thread
From: Hannes Reinecke @ 2025-04-25  6:45 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig

On 4/24/25 17:21, Ming Lei wrote:
> Prepare for moving elv_register[unregister]_queue out of elevator_lock
> & queue freezing, so we may have to call elv_unregister_queue() after
> elevator ->exit() is called, then there is small window for user to
> call into ->show()/store(), and user-after-free can be caused.
> 
> Fail to show/store elevator sysfs attribute if elevator is dying by
> adding one new flag of ELEVATOR_FLAG_DYNG, which is protected by
> elevator ->sysfs_lock.
> 
> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-mq-sched.c |  1 +
>   block/elevator.c     | 10 ++++++++--
>   block/elevator.h     |  1 +
>   3 files changed, 10 insertions(+), 2 deletions(-)
> 
Not particularly happy about this, but I fear it's the best we can do
for now.

Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 16/20] block: move elv_register[unregister]_queue out of elevator_lock
  2025-04-24 15:21 ` [PATCH V3 16/20] block: move elv_register[unregister]_queue out of elevator_lock Ming Lei
@ 2025-04-25  6:46   ` Hannes Reinecke
  2025-04-25 13:05   ` Nilay Shroff
  2025-04-25 18:37   ` Christoph Hellwig
  2 siblings, 0 replies; 71+ messages in thread
From: Hannes Reinecke @ 2025-04-25  6:46 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig

On 4/24/25 17:21, Ming Lei wrote:
> Move elv_register[unregister]_queue out of ->elevator_lock & queue freezing,
> so we can kill many lockdep warnings.
> 
> elv_register[unregister]_queue() is serialized, and just dealing with sysfs/
> debugfs things, no need to be done with queue frozen.
> 
> With this change, elevator's ->exit() is called before calling
> elv_unregister_queue, then user may call into ->show()/store() of elevator's
> sysfs attributes, and we have covered this issue by adding `ELEVATOR_FLAG_DYNG`.
> 
> For blk-mq debugfs, hctx->sched_tags is always checked with ->elevator_lock by
> debugfs code, meantime hctx->sched_tags is updated with ->elevator_lock, so
> there isn't such issue.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-mq.c   |  3 +-
>   block/elevator.c | 80 ++++++++++++++++++++++++++++++++++++------------
>   2 files changed, 62 insertions(+), 21 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 17/20] block: move debugfs/sysfs register out of freezing queue
  2025-04-24 15:21 ` [PATCH V3 17/20] block: move debugfs/sysfs register out of freezing queue Ming Lei
@ 2025-04-25  6:47   ` Hannes Reinecke
  2025-04-25 18:38   ` Christoph Hellwig
  1 sibling, 0 replies; 71+ messages in thread
From: Hannes Reinecke @ 2025-04-25  6:47 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig

On 4/24/25 17:21, Ming Lei wrote:
> Move debugfs/sysfs register out of freezing queue in
> __blk_mq_update_nr_hw_queues(), so that the following lockdep dependency
> can be killed:
> 
> 	#2 (&q->q_usage_counter(io)#16){++++}-{0:0}:
> 	#1 (fs_reclaim){+.+.}-{0:0}:
> 	#0 (&sb->s_type->i_mutex_key#3){+.+.}-{4:4}: //debugfs
> 
> And registering/un-registering debugfs/sysfs does not require queue to be
> frozen.
> 
> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-mq.c | 15 +++++++--------
>   1 file changed, 7 insertions(+), 8 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 18/20] block: remove several ->elevator_lock
  2025-04-24 15:21 ` [PATCH V3 18/20] block: remove several ->elevator_lock Ming Lei
@ 2025-04-25  6:48   ` Hannes Reinecke
  2025-04-25 18:38   ` Christoph Hellwig
  1 sibling, 0 replies; 71+ messages in thread
From: Hannes Reinecke @ 2025-04-25  6:48 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig

On 4/24/25 17:21, Ming Lei wrote:
> Both blk_mq_map_swqueue() and blk_mq_realloc_hw_ctxs() are called before
> the request queue is added to tagset list, so the two won't run concurrently
> with blk_mq_update_nr_hw_queues().
> 
> When the two functions are only called from queue initialization or
> blk_mq_update_nr_hw_queues(), elevator switch can't happen.
> 
> So remove these ->elevator_lock uses.
> 
> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-mq.c | 19 ++++---------------
>   1 file changed, 4 insertions(+), 15 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 19/20] block: move hctx cpuhp add/del out of queue freezing
  2025-04-24 15:21 ` [PATCH V3 19/20] block: move hctx cpuhp add/del out of queue freezing Ming Lei
@ 2025-04-25  6:49   ` Hannes Reinecke
  0 siblings, 0 replies; 71+ messages in thread
From: Hannes Reinecke @ 2025-04-25  6:49 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig

On 4/24/25 17:21, Ming Lei wrote:
> Move hctx cpuhp add/del out of queue freezing for not connecting freeze
> lock with cpuhp locks, then lockdep warning can be avoided.
> 
> This way is safe because both needn't queue to be frozen and scheduler
> switch isn't allowed.
> 
> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-mq.c | 5 ++++-
>   1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index e08eda094ae7..add653f84797 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -4964,7 +4964,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>   fallback:
>   	blk_mq_update_queue_map(set);
>   	list_for_each_entry(q, &set->tag_list, tag_set_list) {
> -		blk_mq_realloc_hw_ctxs(set, q);
> +		__blk_mq_realloc_hw_ctxs(set, q);
>   
>   		if (q->nr_hw_queues != set->nr_hw_queues) {
>   			int i = prev_nr_hw_queues;
> @@ -4988,6 +4988,9 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>   	list_for_each_entry(q, &set->tag_list, tag_set_list) {
>   		blk_mq_sysfs_register_hctxs(q);
>   		blk_mq_debugfs_register_hctxs(q);
> +
> +		blk_mq_remove_hw_queues_cpuhp(q);
> +		blk_mq_add_hw_queues_cpuhp(q);
>   	}
>   	memalloc_noio_restore(memflags);
>   
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 04/20] block: use q->elevator with ->elevator_lock held in elv_iosched_show()
  2025-04-24 15:21 ` [PATCH V3 04/20] block: use q->elevator with ->elevator_lock held in elv_iosched_show() Ming Lei
  2025-04-25  6:08   ` Hannes Reinecke
@ 2025-04-25 10:53   ` Nilay Shroff
  2025-04-25 14:29   ` Christoph Hellwig
  2 siblings, 0 replies; 71+ messages in thread
From: Nilay Shroff @ 2025-04-25 10:53 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: Shinichiro Kawasaki, Thomas Hellström, Christoph Hellwig



On 4/24/25 8:51 PM, Ming Lei wrote:
> Use q->elevator with ->elevator_lock held in elv_iosched_show(), since
> the local cached elevator reference may become stale after getting
> ->elevator_lock.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>

Looks good to me:
Reviewed-by : Nilay Shroff <nilay@linux.ibm.com>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 07/20] block: prevent adding/deleting disk during updating nr_hw_queues
  2025-04-24 15:21 ` [PATCH V3 07/20] block: prevent adding/deleting disk during updating nr_hw_queues Ming Lei
  2025-04-25  6:33   ` Hannes Reinecke
@ 2025-04-25 11:37   ` Nilay Shroff
  2025-04-25 14:33   ` Christoph Hellwig
  2025-04-25 16:46   ` kernel test robot
  3 siblings, 0 replies; 71+ messages in thread
From: Nilay Shroff @ 2025-04-25 11:37 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: Shinichiro Kawasaki, Thomas Hellström, Christoph Hellwig



On 4/24/25 8:51 PM, Ming Lei wrote:
> Both adding/deleting disk code are reader of `nr_hw_queues`, so we can't
> allow them in-progress when updating nr_hw_queues, kernel panic and
> kasan has been reported in [1].
> 
> Prevent adding/deleting disk during updating nr_hw_queues by adding
> rw_semaphore to tagset, write lock is grabbed in blk_mq_update_nr_hw_queues(),
> and read lock is acquired when adding/deleting disk.
> 
> Also mark GFP_NOIO allocation scope for adding/deleting disk because
> blk_mq_update_nr_hw_queues() is part of some driver's error handler.
> 
> This way avoids lot of trouble.
> 
> Suggested-by: Nilay Shroff <nilay@linux.ibm.com>
> Reported-by: Nilay Shroff <nilay@linux.ibm.com>
> Closes: https://lore.kernel.org/linux-block/a5896cdb-a59a-4a37-9f99-20522f5d2987@linux.ibm.com/
> Signed-off-by: Ming Lei <ming.lei@redhat.com>

Looks good to me:
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 08/20] block: don't allow to switch elevator if updating nr_hw_queues is in-progress
  2025-04-24 15:21 ` [PATCH V3 08/20] block: don't allow to switch elevator if updating nr_hw_queues is in-progress Ming Lei
  2025-04-25  6:33   ` Hannes Reinecke
@ 2025-04-25 12:48   ` Nilay Shroff
  2025-04-27  2:27     ` Ming Lei
  1 sibling, 1 reply; 71+ messages in thread
From: Nilay Shroff @ 2025-04-25 12:48 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: Shinichiro Kawasaki, Thomas Hellström, Christoph Hellwig



On 4/24/25 8:51 PM, Ming Lei wrote:
> Elevator switch code is another `nr_hw_queue` reader in non-fast-IO code
> path, so it can't be done if updating `nr_hw_queues` is in-progress.
> 
> Take same approach with not allowing add/del disk when updating
> nr_hw_queues is in-progress, by grabbing read lock of
> set->update_nr_hwq_sema.
> 
> Take the nested variant for avoiding the following false positive
> splat[1], and this way is correct because:
> 
> - the read lock in elv_iosched_store() is not overlapped with the read lock
> in adding/deleting disk:
> 
> - kobject attribute is only available after the kobject is added and
> before it is deleted
> 
>   -> #4 (&q->q_usage_counter(queue){++++}-{0:0}:
>   -> #3 (&q->limits_lock){+.+.}-{4:4}:
>   -> #2 (&disk->open_mutex){+.+.}-{4:4}:
>   -> #1 (&set->update_nr_hwq_lock){.+.+}-{4:4}:
>   -> #0 (kn->active#103){++++}-{0:0}:
> 
> Link: https://lore.kernel.org/linux-block/aAWv3NPtNIKKvJZc@fedora/ [1]
> Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> Closes: https://lore.kernel.org/linux-block/mz4t4tlwiqjijw3zvqnjb7ovvvaegkqganegmmlc567tt5xj67@xal5ro544cnc/
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/elevator.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/block/elevator.c b/block/elevator.c
> index 4400eb8fe54f..56da6ab7691a 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -723,6 +723,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
>  	int ret;
>  	unsigned int memflags;
>  	struct request_queue *q = disk->queue;
> +	struct blk_mq_tag_set *set = q->tag_set;
>  
>  	/*
>  	 * If the attribute needs to load a module, do it before freezing the
> @@ -734,6 +735,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
>  
>  	elv_iosched_load_module(name);
>  
> +	down_read_nested(&set->update_nr_hwq_sema, 1);

Why do we need to add nested read lock here? The lockdep splat[1] which
you reported earlier is possibly due to the same reader lock being acquired
recursively in elv_iosched_store and then elevator_change? 
 
On another note, if we suspect possible one-depth recursion for the same 
class of lock then then we should use SINGLE_DEPTH_NESTING (instead of using
1 here) for subclass. But still I am not clear why this lock needs nesting.

Thanks,
--Nilay 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 10/20] block: move blk_unregister_queue() & device_del() after freeze wait
  2025-04-24 15:21 ` [PATCH V3 10/20] block: move blk_unregister_queue() & device_del() after freeze wait Ming Lei
  2025-04-25  6:35   ` Hannes Reinecke
@ 2025-04-25 12:50   ` Nilay Shroff
  2025-04-25 14:34   ` Christoph Hellwig
  2 siblings, 0 replies; 71+ messages in thread
From: Nilay Shroff @ 2025-04-25 12:50 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: Shinichiro Kawasaki, Thomas Hellström, Christoph Hellwig



On 4/24/25 8:51 PM, Ming Lei wrote:
> Move blk_unregister_queue() & device_del() after freeze wait, and prepare
> for unifying elevator switch.
> 
> This way is just fine, since bdev has been unhashed at the beginning of
> del_gendisk(), both blk_unregister_queue() & device_del() are dealing
> with kobject & debugfs thing only.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>

Looks good to me:
Reviewed-by : Nilay Shroff <nilay@linux.ibm.com>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 11/20] block: move queue freezing & elevator_lock into elevator_change()
  2025-04-24 15:21 ` [PATCH V3 11/20] block: move queue freezing & elevator_lock into elevator_change() Ming Lei
  2025-04-25  6:36   ` Hannes Reinecke
@ 2025-04-25 12:54   ` Nilay Shroff
  2025-04-25 14:35   ` Christoph Hellwig
  2 siblings, 0 replies; 71+ messages in thread
From: Nilay Shroff @ 2025-04-25 12:54 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: Shinichiro Kawasaki, Thomas Hellström, Christoph Hellwig



On 4/24/25 8:51 PM, Ming Lei wrote:
> Move queue freezing & elevator_lock into elevator_change(), and prepare
> for using elevator_change() for setting up & tearing down default elevator
> too.
> 
> Also add lockdep_assert_held() in __elevator_change() because either
> read or write lock is required for changing elevator.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>

Looks good to me:
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 13/20] block: unifying elevator change
  2025-04-24 15:21 ` [PATCH V3 13/20] block: " Ming Lei
  2025-04-25  6:39   ` Hannes Reinecke
@ 2025-04-25 13:03   ` Nilay Shroff
  2025-04-30  0:46     ` Ming Lei
  2025-04-25 18:30   ` Christoph Hellwig
  2 siblings, 1 reply; 71+ messages in thread
From: Nilay Shroff @ 2025-04-25 13:03 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: Shinichiro Kawasaki, Thomas Hellström, Christoph Hellwig



On 4/24/25 8:51 PM, Ming Lei wrote:
> @@ -564,12 +558,7 @@ static int __add_disk_fwnode(struct device *parent, struct gendisk *disk,
>  out_free_ext_minor:
>  	if (disk->major == BLOCK_EXT_MAJOR)
>  		blk_free_ext_minor(disk->first_minor);
> -out_exit_elevator:
> -	if (disk->queue->elevator) {
> -		mutex_lock(&disk->queue->elevator_lock);
> -		elevator_exit(disk->queue);
> -		mutex_unlock(&disk->queue->elevator_lock);
> -	}

After restructuring __add_disk_fwnode() we may still need elevator_exit()
in case when blk_register_queue is successful and so we should have set
q->elevator to default. Now if after queue is registered successfully, 
for instance, bdi_register fails then in that case we still have to exit 
elevator to free its resources, isn't it? 

Thanks,
--Nilay 


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 16/20] block: move elv_register[unregister]_queue out of elevator_lock
  2025-04-24 15:21 ` [PATCH V3 16/20] block: move elv_register[unregister]_queue out of elevator_lock Ming Lei
  2025-04-25  6:46   ` Hannes Reinecke
@ 2025-04-25 13:05   ` Nilay Shroff
  2025-04-25 18:37   ` Christoph Hellwig
  2 siblings, 0 replies; 71+ messages in thread
From: Nilay Shroff @ 2025-04-25 13:05 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: Shinichiro Kawasaki, Thomas Hellström, Christoph Hellwig



On 4/24/25 8:51 PM, Ming Lei wrote:
> Move elv_register[unregister]_queue out of ->elevator_lock & queue freezing,
> so we can kill many lockdep warnings.
> 
> elv_register[unregister]_queue() is serialized, and just dealing with sysfs/
> debugfs things, no need to be done with queue frozen.
> 
> With this change, elevator's ->exit() is called before calling
> elv_unregister_queue, then user may call into ->show()/store() of elevator's
> sysfs attributes, and we have covered this issue by adding `ELEVATOR_FLAG_DYNG`.
> 
> For blk-mq debugfs, hctx->sched_tags is always checked with ->elevator_lock by
> debugfs code, meantime hctx->sched_tags is updated with ->elevator_lock, so
> there isn't such issue.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>

Looks good to me:
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 20/20] block: move wbt_enable_default() out of queue freezing from sched ->exit()
  2025-04-24 15:21 ` [PATCH V3 20/20] block: move wbt_enable_default() out of queue freezing from sched ->exit() Ming Lei
@ 2025-04-25 13:10   ` Nilay Shroff
  2025-04-29 10:59   ` Nilay Shroff
  1 sibling, 0 replies; 71+ messages in thread
From: Nilay Shroff @ 2025-04-25 13:10 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: Shinichiro Kawasaki, Thomas Hellström, Christoph Hellwig



On 4/24/25 8:51 PM, Ming Lei wrote:
> scheduler's ->exit() is called with queue frozen and elevator lock is held, and
> wbt_enable_default() can't be called with queue frozen, otherwise the
> following lockdep warning is triggered:
> 
> 	#6 (&q->rq_qos_mutex){+.+.}-{4:4}:
> 	#5 (&eq->sysfs_lock){+.+.}-{4:4}:
> 	#4 (&q->elevator_lock){+.+.}-{4:4}:
> 	#3 (&q->q_usage_counter(io)#3){++++}-{0:0}:
> 	#2 (fs_reclaim){+.+.}-{0:0}:
> 	#1 (&sb->s_type->i_mutex_key#3){+.+.}-{4:4}:
> 	#0 (&q->debugfs_mutex){+.+.}-{4:4}:
> 
> Fix the issue by moving wbt_enable_default() out of bfq's exit(), and
> call it from elevator_change_done().
> 
> Meantime add disk->rqos_state_mutex for covering wbt state change, which
> matches the purpose more than ->elevator_lock.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>

Looks good to me:
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 02/20] block: move ELEVATOR_FLAG_DISABLE_WBT a request queue flag
  2025-04-24 15:21 ` [PATCH V3 02/20] block: move ELEVATOR_FLAG_DISABLE_WBT a request queue flag Ming Lei
@ 2025-04-25 14:29   ` Christoph Hellwig
  0 siblings, 0 replies; 71+ messages in thread
From: Christoph Hellwig @ 2025-04-25 14:29 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
	Thomas Hellström, Christoph Hellwig, Hannes Reinecke

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 03/20] block: don't call freeze queue in elevator_switch() and elevator_disable()
  2025-04-24 15:21 ` [PATCH V3 03/20] block: don't call freeze queue in elevator_switch() and elevator_disable() Ming Lei
@ 2025-04-25 14:29   ` Christoph Hellwig
  0 siblings, 0 replies; 71+ messages in thread
From: Christoph Hellwig @ 2025-04-25 14:29 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
	Thomas Hellström, Christoph Hellwig, Hannes Reinecke

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 04/20] block: use q->elevator with ->elevator_lock held in elv_iosched_show()
  2025-04-24 15:21 ` [PATCH V3 04/20] block: use q->elevator with ->elevator_lock held in elv_iosched_show() Ming Lei
  2025-04-25  6:08   ` Hannes Reinecke
  2025-04-25 10:53   ` Nilay Shroff
@ 2025-04-25 14:29   ` Christoph Hellwig
  2 siblings, 0 replies; 71+ messages in thread
From: Christoph Hellwig @ 2025-04-25 14:29 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
	Thomas Hellström, Christoph Hellwig

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 07/20] block: prevent adding/deleting disk during updating nr_hw_queues
  2025-04-24 15:21 ` [PATCH V3 07/20] block: prevent adding/deleting disk during updating nr_hw_queues Ming Lei
  2025-04-25  6:33   ` Hannes Reinecke
  2025-04-25 11:37   ` Nilay Shroff
@ 2025-04-25 14:33   ` Christoph Hellwig
  2025-04-25 16:46   ` kernel test robot
  3 siblings, 0 replies; 71+ messages in thread
From: Christoph Hellwig @ 2025-04-25 14:33 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
	Thomas Hellström, Christoph Hellwig

On Thu, Apr 24, 2025 at 11:21:30PM +0800, Ming Lei wrote:
> +	init_rwsem(&set->update_nr_hwq_sema);

Can you please call this update_nr_hwq_lock instead?  _sema is a very
unusual naming for a rw semaphore.

>   * This function registers the partitioning information in @disk
>   * with the kernel. Also attach a fwnode to the disk device.
>   */
> -int __must_check add_disk_fwnode(struct device *parent, struct gendisk *disk,
> -				 const struct attribute_group **groups,
> -				 struct fwnode_handle *fwnode)
> +static int __add_disk_fwnode(struct device *parent, struct gendisk *disk,
> +			     const struct attribute_group **groups,
> +			     struct fwnode_handle *fwnode)

once this becomes internal there is no need for the _fwnode postfix and
this can simply become __add_disk.

This probably wants a lockdep assert that te nr_hwq lock is held for the
blk-mq case.

Also when you use two-tab indentation for the arguments you don't need to
reformat for every little naming change.

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 10/20] block: move blk_unregister_queue() & device_del() after freeze wait
  2025-04-24 15:21 ` [PATCH V3 10/20] block: move blk_unregister_queue() & device_del() after freeze wait Ming Lei
  2025-04-25  6:35   ` Hannes Reinecke
  2025-04-25 12:50   ` Nilay Shroff
@ 2025-04-25 14:34   ` Christoph Hellwig
  2025-04-28 11:51     ` Ming Lei
  2 siblings, 1 reply; 71+ messages in thread
From: Christoph Hellwig @ 2025-04-25 14:34 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
	Thomas Hellström, Christoph Hellwig

On Thu, Apr 24, 2025 at 11:21:33PM +0800, Ming Lei wrote:
> Move blk_unregister_queue() & device_del() after freeze wait, and prepare
> for unifying elevator switch.
> 
> This way is just fine, since bdev has been unhashed at the beginning of
> del_gendisk(), both blk_unregister_queue() & device_del() are dealing
> with kobject & debugfs thing only.

Can you please add a reference to the previous attempt here and that
it was backed out again and why we think it's fine this time around?


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 11/20] block: move queue freezing & elevator_lock into elevator_change()
  2025-04-24 15:21 ` [PATCH V3 11/20] block: move queue freezing & elevator_lock into elevator_change() Ming Lei
  2025-04-25  6:36   ` Hannes Reinecke
  2025-04-25 12:54   ` Nilay Shroff
@ 2025-04-25 14:35   ` Christoph Hellwig
  2 siblings, 0 replies; 71+ messages in thread
From: Christoph Hellwig @ 2025-04-25 14:35 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
	Thomas Hellström, Christoph Hellwig

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 07/20] block: prevent adding/deleting disk during updating nr_hw_queues
  2025-04-24 15:21 ` [PATCH V3 07/20] block: prevent adding/deleting disk during updating nr_hw_queues Ming Lei
                     ` (2 preceding siblings ...)
  2025-04-25 14:33   ` Christoph Hellwig
@ 2025-04-25 16:46   ` kernel test robot
  3 siblings, 0 replies; 71+ messages in thread
From: kernel test robot @ 2025-04-25 16:46 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: oe-kbuild-all, Nilay Shroff, Shinichiro Kawasaki,
	Thomas Hellström, Christoph Hellwig, Ming Lei

Hi Ming,

kernel test robot noticed the following build warnings:

[auto build test WARNING on axboe-block/for-next]
[also build test WARNING on linus/master v6.15-rc3 next-20250424]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Ming-Lei/block-move-blk_mq_add_queue_tag_set-after-blk_mq_map_swqueue/20250424-232508
base:   https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
patch link:    https://lore.kernel.org/r/20250424152148.1066220-8-ming.lei%40redhat.com
patch subject: [PATCH V3 07/20] block: prevent adding/deleting disk during updating nr_hw_queues
config: i386-buildonly-randconfig-004-20250425 (https://download.01.org/0day-ci/archive/20250426/202504260007.SFrJbSQ8-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250426/202504260007.SFrJbSQ8-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202504260007.SFrJbSQ8-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> block/genhd.c:406: warning: expecting prototype for add_disk_fwnode(). Prototype was for __add_disk_fwnode() instead


vim +406 block/genhd.c

9301fe73438499 block/genhd.c         Christoph Hellwig 2020-09-21  391  
^1da177e4c3f41 drivers/block/genhd.c Linus Torvalds    2005-04-16  392  /**
9dfd9ea93aeab5 block/genhd.c         Christian Marangi 2024-10-03  393   * add_disk_fwnode - add disk information to kernel list with fwnode
e63a46bef01ff3 block/genhd.c         Dan Williams      2016-06-15  394   * @parent: parent device for the disk
^1da177e4c3f41 drivers/block/genhd.c Linus Torvalds    2005-04-16  395   * @disk: per-device partitioning information
fef912bf860e8e block/genhd.c         Hannes Reinecke   2018-09-28  396   * @groups: Additional per-device sysfs groups
9dfd9ea93aeab5 block/genhd.c         Christian Marangi 2024-10-03  397   * @fwnode: attached disk fwnode
^1da177e4c3f41 drivers/block/genhd.c Linus Torvalds    2005-04-16  398   *
^1da177e4c3f41 drivers/block/genhd.c Linus Torvalds    2005-04-16  399   * This function registers the partitioning information in @disk
9dfd9ea93aeab5 block/genhd.c         Christian Marangi 2024-10-03  400   * with the kernel. Also attach a fwnode to the disk device.
^1da177e4c3f41 drivers/block/genhd.c Linus Torvalds    2005-04-16  401   */
54650b49b7412c block/genhd.c         Ming Lei          2025-04-24  402  static int __add_disk_fwnode(struct device *parent, struct gendisk *disk,
9dfd9ea93aeab5 block/genhd.c         Christian Marangi 2024-10-03  403  			     const struct attribute_group **groups,
9dfd9ea93aeab5 block/genhd.c         Christian Marangi 2024-10-03  404  			     struct fwnode_handle *fwnode)
d1254a8749711e block/genhd.c         Christoph Hellwig 2021-08-04  405  
^1da177e4c3f41 drivers/block/genhd.c Linus Torvalds    2005-04-16 @406  {
52b85909f85d06 block/genhd.c         Christoph Hellwig 2021-08-18  407  	struct device *ddev = disk_to_dev(disk);
7c3f828b522b07 block/genhd.c         Christoph Hellwig 2021-05-21  408  	int ret;
cf0ca9fe5dd9e3 block/genhd.c         Peter Zijlstra    2008-04-30  409  
3d9a9e9a77c5eb block/genhd.c         Ming Lei          2025-01-15  410  	if (WARN_ON_ONCE(bdev_nr_sectors(disk->part0) > BLK_DEV_MAX_SECTORS))
3d9a9e9a77c5eb block/genhd.c         Ming Lei          2025-01-15  411  		return -EINVAL;
3d9a9e9a77c5eb block/genhd.c         Ming Lei          2025-01-15  412  
6783811569aef2 block/genhd.c         Christoph Hellwig 2025-01-06  413  	if (queue_is_mq(disk->queue)) {
6783811569aef2 block/genhd.c         Christoph Hellwig 2025-01-06  414  		/*
6783811569aef2 block/genhd.c         Christoph Hellwig 2025-01-06  415  		 * ->submit_bio and ->poll_bio are bypassed for blk-mq drivers.
6783811569aef2 block/genhd.c         Christoph Hellwig 2025-01-06  416  		 */
6783811569aef2 block/genhd.c         Christoph Hellwig 2025-01-06  417  		if (disk->fops->submit_bio || disk->fops->poll_bio)
69fe0f29892077 block/genhd.c         Ming Lei          2022-03-04  418  			return -EINVAL;
69fe0f29892077 block/genhd.c         Ming Lei          2022-03-04  419  
737eb78e82d52d block/genhd.c         Damien Le Moal    2019-09-05  420  		/*
6783811569aef2 block/genhd.c         Christoph Hellwig 2025-01-06  421  		 * Initialize the I/O scheduler code and pick a default one if
6783811569aef2 block/genhd.c         Christoph Hellwig 2025-01-06  422  		 * needed.
737eb78e82d52d block/genhd.c         Damien Le Moal    2019-09-05  423  		 */
737eb78e82d52d block/genhd.c         Damien Le Moal    2019-09-05  424  		elevator_init_mq(disk->queue);
6783811569aef2 block/genhd.c         Christoph Hellwig 2025-01-06  425  	} else {
6783811569aef2 block/genhd.c         Christoph Hellwig 2025-01-06  426  		if (!disk->fops->submit_bio)
6783811569aef2 block/genhd.c         Christoph Hellwig 2025-01-06  427  			return -EINVAL;
ac2b6f9dee8f41 block/genhd.c         Al Viro           2024-04-12  428  		bdev_set_flag(disk->part0, BD_HAS_SUBMIT_BIO);
6783811569aef2 block/genhd.c         Christoph Hellwig 2025-01-06  429  	}
9f4107b07b17b5 block/genhd.c         Jens Axboe        2023-04-14  430  
7c3f828b522b07 block/genhd.c         Christoph Hellwig 2021-05-21  431  	/*
7c3f828b522b07 block/genhd.c         Christoph Hellwig 2021-05-21  432  	 * If the driver provides an explicit major number it also must provide
7c3f828b522b07 block/genhd.c         Christoph Hellwig 2021-05-21  433  	 * the number of minors numbers supported, and those will be used to
7c3f828b522b07 block/genhd.c         Christoph Hellwig 2021-05-21  434  	 * setup the gendisk.
7c3f828b522b07 block/genhd.c         Christoph Hellwig 2021-05-21  435  	 * Otherwise just allocate the device numbers for both the whole device
7c3f828b522b07 block/genhd.c         Christoph Hellwig 2021-05-21  436  	 * and all partitions from the extended dev_t space.
3e1a7ff8a0a7b9 block/genhd.c         Tejun Heo         2008-08-25  437  	 */
02341a08c9dec5 block/genhd.c         Yu Kuai           2022-10-22  438  	ret = -EINVAL;
7c3f828b522b07 block/genhd.c         Christoph Hellwig 2021-05-21  439  	if (disk->major) {
83cbce9574462c block/genhd.c         Luis Chamberlain  2021-08-18  440  		if (WARN_ON(!disk->minors))
02341a08c9dec5 block/genhd.c         Yu Kuai           2022-10-22  441  			goto out_exit_elevator;
2e3c73fa0c419f block/genhd.c         Christoph Hellwig 2021-05-21  442  
2e3c73fa0c419f block/genhd.c         Christoph Hellwig 2021-05-21  443  		if (disk->minors > DISK_MAX_PARTS) {
2e3c73fa0c419f block/genhd.c         Christoph Hellwig 2021-05-21  444  			pr_err("block: can't allocate more than %d partitions\n",
2e3c73fa0c419f block/genhd.c         Christoph Hellwig 2021-05-21  445  				DISK_MAX_PARTS);
2e3c73fa0c419f block/genhd.c         Christoph Hellwig 2021-05-21  446  			disk->minors = DISK_MAX_PARTS;
2e3c73fa0c419f block/genhd.c         Christoph Hellwig 2021-05-21  447  		}
4c434392c47778 block/genhd.c         Li Nan            2023-12-19  448  		if (disk->first_minor > MINORMASK ||
4c434392c47778 block/genhd.c         Li Nan            2023-12-19  449  		    disk->minors > MINORMASK + 1 ||
4c434392c47778 block/genhd.c         Li Nan            2023-12-19  450  		    disk->first_minor + disk->minors > MINORMASK + 1)
02341a08c9dec5 block/genhd.c         Yu Kuai           2022-10-22  451  			goto out_exit_elevator;
7c3f828b522b07 block/genhd.c         Christoph Hellwig 2021-05-21  452  	} else {
83cbce9574462c block/genhd.c         Luis Chamberlain  2021-08-18  453  		if (WARN_ON(disk->minors))
02341a08c9dec5 block/genhd.c         Yu Kuai           2022-10-22  454  			goto out_exit_elevator;
3e1a7ff8a0a7b9 block/genhd.c         Tejun Heo         2008-08-25  455  
7c3f828b522b07 block/genhd.c         Christoph Hellwig 2021-05-21  456  		ret = blk_alloc_ext_minor();
83cbce9574462c block/genhd.c         Luis Chamberlain  2021-08-18  457  		if (ret < 0)
02341a08c9dec5 block/genhd.c         Yu Kuai           2022-10-22  458  			goto out_exit_elevator;
7c3f828b522b07 block/genhd.c         Christoph Hellwig 2021-05-21  459  		disk->major = BLOCK_EXT_MAJOR;
539711d7d6fe38 block/genhd.c         Christoph Hellwig 2021-08-24  460  		disk->first_minor = ret;
7c3f828b522b07 block/genhd.c         Christoph Hellwig 2021-05-21  461  	}
7c3f828b522b07 block/genhd.c         Christoph Hellwig 2021-05-21  462  
52b85909f85d06 block/genhd.c         Christoph Hellwig 2021-08-18  463  	/* delay uevents, until we scanned partition table */
52b85909f85d06 block/genhd.c         Christoph Hellwig 2021-08-18  464  	dev_set_uevent_suppress(ddev, 1);
52b85909f85d06 block/genhd.c         Christoph Hellwig 2021-08-18  465  
52b85909f85d06 block/genhd.c         Christoph Hellwig 2021-08-18  466  	ddev->parent = parent;
52b85909f85d06 block/genhd.c         Christoph Hellwig 2021-08-18  467  	ddev->groups = groups;
52b85909f85d06 block/genhd.c         Christoph Hellwig 2021-08-18  468  	dev_set_name(ddev, "%s", disk->disk_name);
9dfd9ea93aeab5 block/genhd.c         Christian Marangi 2024-10-03  469  	if (fwnode)
9dfd9ea93aeab5 block/genhd.c         Christian Marangi 2024-10-03  470  		device_set_node(ddev, fwnode);
8235b5c1e8c1c0 block/genhd.c         Christoph Hellwig 2021-08-18  471  	if (!(disk->flags & GENHD_FL_HIDDEN))
8235b5c1e8c1c0 block/genhd.c         Christoph Hellwig 2021-08-18  472  		ddev->devt = MKDEV(disk->major, disk->first_minor);
83cbce9574462c block/genhd.c         Luis Chamberlain  2021-08-18  473  	ret = device_add(ddev);
83cbce9574462c block/genhd.c         Luis Chamberlain  2021-08-18  474  	if (ret)
99d8690aae4b2f block/genhd.c         Christoph Hellwig 2021-12-21  475  		goto out_free_ext_minor;
99d8690aae4b2f block/genhd.c         Christoph Hellwig 2021-12-21  476  
99d8690aae4b2f block/genhd.c         Christoph Hellwig 2021-12-21  477  	ret = disk_alloc_events(disk);
99d8690aae4b2f block/genhd.c         Christoph Hellwig 2021-12-21  478  	if (ret)
99d8690aae4b2f block/genhd.c         Christoph Hellwig 2021-12-21  479  		goto out_device_del;
99d8690aae4b2f block/genhd.c         Christoph Hellwig 2021-12-21  480  
52b85909f85d06 block/genhd.c         Christoph Hellwig 2021-08-18  481  	ret = sysfs_create_link(block_depr, &ddev->kobj,
52b85909f85d06 block/genhd.c         Christoph Hellwig 2021-08-18  482  				kobject_name(&ddev->kobj));
83cbce9574462c block/genhd.c         Luis Chamberlain  2021-08-18  483  	if (ret)
83cbce9574462c block/genhd.c         Luis Chamberlain  2021-08-18  484  		goto out_device_del;
52b85909f85d06 block/genhd.c         Christoph Hellwig 2021-08-18  485  
52b85909f85d06 block/genhd.c         Christoph Hellwig 2021-08-18  486  	/*
52b85909f85d06 block/genhd.c         Christoph Hellwig 2021-08-18  487  	 * avoid probable deadlock caused by allocating memory with
52b85909f85d06 block/genhd.c         Christoph Hellwig 2021-08-18  488  	 * GFP_KERNEL in runtime_resume callback of its all ancestor
52b85909f85d06 block/genhd.c         Christoph Hellwig 2021-08-18  489  	 * devices
52b85909f85d06 block/genhd.c         Christoph Hellwig 2021-08-18  490  	 */
52b85909f85d06 block/genhd.c         Christoph Hellwig 2021-08-18  491  	pm_runtime_set_memalloc_noio(ddev, true);
52b85909f85d06 block/genhd.c         Christoph Hellwig 2021-08-18  492  
52b85909f85d06 block/genhd.c         Christoph Hellwig 2021-08-18  493  	disk->part0->bd_holder_dir =
52b85909f85d06 block/genhd.c         Christoph Hellwig 2021-08-18  494  		kobject_create_and_add("holders", &ddev->kobj);
fe7d064fa3faec block/genhd.c         Luis Chamberlain  2021-11-03  495  	if (!disk->part0->bd_holder_dir) {
fe7d064fa3faec block/genhd.c         Luis Chamberlain  2021-11-03  496  		ret = -ENOMEM;
ff53cd52d9bdbf block/genhd.c         Thomas Weißschuh  2023-03-18  497  		goto out_del_block_link;
fe7d064fa3faec block/genhd.c         Luis Chamberlain  2021-11-03  498  	}
52b85909f85d06 block/genhd.c         Christoph Hellwig 2021-08-18  499  	disk->slave_dir = kobject_create_and_add("slaves", &ddev->kobj);
fe7d064fa3faec block/genhd.c         Luis Chamberlain  2021-11-03  500  	if (!disk->slave_dir) {
fe7d064fa3faec block/genhd.c         Luis Chamberlain  2021-11-03  501  		ret = -ENOMEM;
83cbce9574462c block/genhd.c         Luis Chamberlain  2021-08-18  502  		goto out_put_holder_dir;
fe7d064fa3faec block/genhd.c         Luis Chamberlain  2021-11-03  503  	}
52b85909f85d06 block/genhd.c         Christoph Hellwig 2021-08-18  504  
83cbce9574462c block/genhd.c         Luis Chamberlain  2021-08-18  505  	ret = blk_register_queue(disk);
83cbce9574462c block/genhd.c         Luis Chamberlain  2021-08-18  506  	if (ret)
83cbce9574462c block/genhd.c         Luis Chamberlain  2021-08-18  507  		goto out_put_slave_dir;
75f4dca59694df block/genhd.c         Christoph Hellwig 2021-08-18  508  
9f18db572c97bc block/genhd.c         Christoph Hellwig 2021-11-22  509  	if (!(disk->flags & GENHD_FL_HIDDEN)) {
8235b5c1e8c1c0 block/genhd.c         Christoph Hellwig 2021-08-18  510  		ret = bdi_register(disk->bdi, "%u:%u",
8235b5c1e8c1c0 block/genhd.c         Christoph Hellwig 2021-08-18  511  				   disk->major, disk->first_minor);
83cbce9574462c block/genhd.c         Luis Chamberlain  2021-08-18  512  		if (ret)
83cbce9574462c block/genhd.c         Luis Chamberlain  2021-08-18  513  			goto out_unregister_queue;
8235b5c1e8c1c0 block/genhd.c         Christoph Hellwig 2021-08-18  514  		bdi_set_owner(disk->bdi, ddev);
9d5ee6767c8576 block/genhd.c         Christoph Hellwig 2021-08-18  515  		ret = sysfs_create_link(&ddev->kobj,
9d5ee6767c8576 block/genhd.c         Christoph Hellwig 2021-08-18  516  					&disk->bdi->dev->kobj, "bdi");
83cbce9574462c block/genhd.c         Luis Chamberlain  2021-08-18  517  		if (ret)
83cbce9574462c block/genhd.c         Luis Chamberlain  2021-08-18  518  			goto out_unregister_bdi;
8235b5c1e8c1c0 block/genhd.c         Christoph Hellwig 2021-08-18  519  
e5cfefa97bccf9 block/genhd.c         Yu Kuai           2023-02-17  520  		/* Make sure the first partition scan will be proceed */
140ce28dd3bee8 block/genhd.c         Christoph Hellwig 2024-05-02  521  		if (get_capacity(disk) && disk_has_partscan(disk))
e5cfefa97bccf9 block/genhd.c         Yu Kuai           2023-02-17  522  			set_bit(GD_NEED_PART_SCAN, &disk->state);
e5cfefa97bccf9 block/genhd.c         Yu Kuai           2023-02-17  523  
9d5ee6767c8576 block/genhd.c         Christoph Hellwig 2021-08-18  524  		bdev_add(disk->part0, ddev->devt);
e16e506ccd673a block/genhd.c         Christoph Hellwig 2021-11-22  525  		if (get_capacity(disk))
05bdb9965305bb block/genhd.c         Christoph Hellwig 2023-06-08  526  			disk_scan_partitions(disk, BLK_OPEN_READ);
52b85909f85d06 block/genhd.c         Christoph Hellwig 2021-08-18  527  
52b85909f85d06 block/genhd.c         Christoph Hellwig 2021-08-18  528  		/*
52b85909f85d06 block/genhd.c         Christoph Hellwig 2021-08-18  529  		 * Announce the disk and partitions after all partitions are
8235b5c1e8c1c0 block/genhd.c         Christoph Hellwig 2021-08-18  530  		 * created. (for hidden disks uevents remain suppressed forever)
52b85909f85d06 block/genhd.c         Christoph Hellwig 2021-08-18  531  		 */
52b85909f85d06 block/genhd.c         Christoph Hellwig 2021-08-18  532  		dev_set_uevent_suppress(ddev, 0);
52b85909f85d06 block/genhd.c         Christoph Hellwig 2021-08-18  533  		disk_uevent(disk, KOBJ_ADD);
a0a6314ae774f8 block/genhd.c         Christoph Hellwig 2022-10-10  534  	} else {
a0a6314ae774f8 block/genhd.c         Christoph Hellwig 2022-10-10  535  		/*
a0a6314ae774f8 block/genhd.c         Christoph Hellwig 2022-10-10  536  		 * Even if the block_device for a hidden gendisk is not
a0a6314ae774f8 block/genhd.c         Christoph Hellwig 2022-10-10  537  		 * registered, it needs to have a valid bd_dev so that the
a0a6314ae774f8 block/genhd.c         Christoph Hellwig 2022-10-10  538  		 * freeing of the dynamic major works.
a0a6314ae774f8 block/genhd.c         Christoph Hellwig 2022-10-10  539  		 */
a0a6314ae774f8 block/genhd.c         Christoph Hellwig 2022-10-10  540  		disk->part0->bd_dev = MKDEV(disk->major, disk->first_minor);
8ddcd653257c18 block/genhd.c         Christoph Hellwig 2017-11-02  541  	}
52b85909f85d06 block/genhd.c         Christoph Hellwig 2021-08-18  542  
73781b3b81e765 block/genhd.c         Christoph Hellwig 2024-06-26  543  	blk_apply_bdi_limits(disk->bdi, &disk->queue->limits);
77ea887e433ad8 block/genhd.c         Tejun Heo         2010-12-08  544  	disk_add_events(disk);
76792055c4c8b2 block/genhd.c         Christoph Hellwig 2022-02-15  545  	set_bit(GD_ADDED, &disk->state);
83cbce9574462c block/genhd.c         Luis Chamberlain  2021-08-18  546  	return 0;
83cbce9574462c block/genhd.c         Luis Chamberlain  2021-08-18  547  
83cbce9574462c block/genhd.c         Luis Chamberlain  2021-08-18  548  out_unregister_bdi:
83cbce9574462c block/genhd.c         Luis Chamberlain  2021-08-18  549  	if (!(disk->flags & GENHD_FL_HIDDEN))
83cbce9574462c block/genhd.c         Luis Chamberlain  2021-08-18  550  		bdi_unregister(disk->bdi);
83cbce9574462c block/genhd.c         Luis Chamberlain  2021-08-18  551  out_unregister_queue:
83cbce9574462c block/genhd.c         Luis Chamberlain  2021-08-18  552  	blk_unregister_queue(disk);
fa81cbafbf5764 block/genhd.c         Chen Zhongjin     2022-10-29  553  	rq_qos_exit(disk->queue);
83cbce9574462c block/genhd.c         Luis Chamberlain  2021-08-18  554  out_put_slave_dir:
83cbce9574462c block/genhd.c         Luis Chamberlain  2021-08-18  555  	kobject_put(disk->slave_dir);
d90db3b1c8676b block/genhd.c         Christoph Hellwig 2022-11-15  556  	disk->slave_dir = NULL;
83cbce9574462c block/genhd.c         Luis Chamberlain  2021-08-18  557  out_put_holder_dir:
83cbce9574462c block/genhd.c         Luis Chamberlain  2021-08-18  558  	kobject_put(disk->part0->bd_holder_dir);
83cbce9574462c block/genhd.c         Luis Chamberlain  2021-08-18  559  out_del_block_link:
83cbce9574462c block/genhd.c         Luis Chamberlain  2021-08-18  560  	sysfs_remove_link(block_depr, dev_name(ddev));
5fa3d1a00c2d4b block/genhd.c         Li Nan            2023-12-11  561  	pm_runtime_set_memalloc_noio(ddev, false);
83cbce9574462c block/genhd.c         Luis Chamberlain  2021-08-18  562  out_device_del:
83cbce9574462c block/genhd.c         Luis Chamberlain  2021-08-18  563  	device_del(ddev);
83cbce9574462c block/genhd.c         Luis Chamberlain  2021-08-18  564  out_free_ext_minor:
83cbce9574462c block/genhd.c         Luis Chamberlain  2021-08-18  565  	if (disk->major == BLOCK_EXT_MAJOR)
83cbce9574462c block/genhd.c         Luis Chamberlain  2021-08-18  566  		blk_free_ext_minor(disk->first_minor);
02341a08c9dec5 block/genhd.c         Yu Kuai           2022-10-22  567  out_exit_elevator:
1bf70d08cc3b55 block/genhd.c         Nilay Shroff      2025-03-04  568  	if (disk->queue->elevator) {
1bf70d08cc3b55 block/genhd.c         Nilay Shroff      2025-03-04  569  		mutex_lock(&disk->queue->elevator_lock);
02341a08c9dec5 block/genhd.c         Yu Kuai           2022-10-22  570  		elevator_exit(disk->queue);
1bf70d08cc3b55 block/genhd.c         Nilay Shroff      2025-03-04  571  		mutex_unlock(&disk->queue->elevator_lock);
1bf70d08cc3b55 block/genhd.c         Nilay Shroff      2025-03-04  572  	}
278167fd2f8ffe block/genhd.c         Luis Chamberlain  2021-11-09  573  	return ret;
^1da177e4c3f41 drivers/block/genhd.c Linus Torvalds    2005-04-16  574  }
54650b49b7412c block/genhd.c         Ming Lei          2025-04-24  575  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 09/20] block: simplify elevator reattachment for updating nr_hw_queues
  2025-04-24 15:21 ` [PATCH V3 09/20] block: simplify elevator reattachment for updating nr_hw_queues Ming Lei
  2025-04-25  6:34   ` Hannes Reinecke
@ 2025-04-25 18:12   ` Christoph Hellwig
  2025-04-29  9:51     ` Ming Lei
  1 sibling, 1 reply; 71+ messages in thread
From: Christoph Hellwig @ 2025-04-25 18:12 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
	Thomas Hellström, Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 448 bytes --]

On Thu, Apr 24, 2025 at 11:21:32PM +0800, Ming Lei wrote:
> +static int __elevator_change(struct request_queue *q,
> +			     const char *elevator_name)

There's still not good reason for this helper.

I'd suggest you add the two first attached patches before this one
(it'll need a bi of reabsing as all of them are after your entire
sweries right now) and then fold the third one into this, which will
give us less code and a cleaner interface.


[-- Attachment #2: 0001-block-look-up-the-elevator-type-in-elevator_switch.patch --]
[-- Type: text/x-patch, Size: 2036 bytes --]

From 44da16a97ef758d1d9fe7c54c8360388f585d0c5 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Fri, 25 Apr 2025 07:01:39 -0700
Subject: block: look up the elevator type in elevator_switch

That makes the function nicely self-contained and can be used
to avoid code duplication.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/elevator.c | 18 ++++++++----------
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/block/elevator.c b/block/elevator.c
index a637426da56d..773b8931d874 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -591,14 +591,18 @@ static bool use_default_elevator(struct request_queue *q)
  * If switching fails, we are most likely running out of memory and not able
  * to restore the old io scheduler, so leaving the io scheduler being none.
  */
-static int elevator_switch(struct request_queue *q, struct elevator_type *new_e,
-			   struct elv_change_ctx *ctx)
+static int elevator_switch(struct request_queue *q, struct elv_change_ctx *ctx)
 {
+	struct elevator_type *new_e;
 	int ret;
 
 	WARN_ON_ONCE(q->mq_freeze_depth == 0);
 	lockdep_assert_held(&q->elevator_lock);
 
+	new_e = elevator_find_get(ctx->name);
+	if (!new_e)
+		return -EINVAL;
+
 	blk_mq_quiesce_queue(q);
 
 	if (q->elevator) {
@@ -621,6 +625,7 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e,
 			new_e->elevator_name);
 	}
 
+	elevator_put(new_e);
 	return ret;
 }
 
@@ -686,8 +691,6 @@ static int __elevator_change(struct request_queue *q,
 			     struct elv_change_ctx *ctx)
 {
 	const char *elevator_name = ctx->name;
-	struct elevator_type *e;
-	int ret;
 
 	lockdep_assert_held(&q->tag_set->update_nr_hwq_sema);
 
@@ -701,12 +704,7 @@ static int __elevator_change(struct request_queue *q,
 		return 0;
 	}
 
-	e = elevator_find_get(elevator_name);
-	if (!e)
-		return -EINVAL;
-	ret = elevator_switch(q, e, ctx);
-	elevator_put(e);
-	return ret;
+	return elevator_switch(q, ctx);
 }
 
 static int elevator_change(struct request_queue *q,
-- 
2.47.2


[-- Attachment #3: 0002-block-fold-elevator_disable-into-elevator_switch.patch --]
[-- Type: text/x-patch, Size: 3109 bytes --]

From 1bfce1a308b9e46734ed56196b4e9fe31b5a0036 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Fri, 25 Apr 2025 07:10:35 -0700
Subject: block: fold elevator_disable into elevator_switch

This removes duplicate code, and keeps the callers tidy.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/elevator.c | 59 ++++++++++++++++--------------------------------
 1 file changed, 20 insertions(+), 39 deletions(-)

diff --git a/block/elevator.c b/block/elevator.c
index 773b8931d874..59ff0abde920 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -593,15 +593,17 @@ static bool use_default_elevator(struct request_queue *q)
  */
 static int elevator_switch(struct request_queue *q, struct elv_change_ctx *ctx)
 {
-	struct elevator_type *new_e;
-	int ret;
+	struct elevator_type *new_e = NULL;
+	int ret = 0;
 
 	WARN_ON_ONCE(q->mq_freeze_depth == 0);
 	lockdep_assert_held(&q->elevator_lock);
 
-	new_e = elevator_find_get(ctx->name);
-	if (!new_e)
-		return -EINVAL;
+	if (strncmp(ctx->name, "none", 4)) {
+		new_e = elevator_find_get(ctx->name);
+		if (!new_e)
+			return -EINVAL;
+	}
 
 	blk_mq_quiesce_queue(q);
 
@@ -610,12 +612,17 @@ static int elevator_switch(struct request_queue *q, struct elv_change_ctx *ctx)
 		__elevator_exit(q);
 	}
 
-	ret = blk_mq_init_sched(q, new_e);
-	if (ret)
-		goto out_unfreeze;
-
-	ctx->new = q->elevator;
-	blk_add_trace_msg(q, "elv switch: %s", new_e->elevator_name);
+	if (new_e) {
+		ret = blk_mq_init_sched(q, new_e);
+		if (ret)
+			goto out_unfreeze;
+		ctx->new = q->elevator;
+	} else {
+		blk_queue_flag_clear(QUEUE_FLAG_SQ_SCHED, q);
+		q->elevator = NULL;
+		q->nr_requests = q->tag_set->queue_depth;
+	}
+	blk_add_trace_msg(q, "elv switch: %s", ctx->name);
 
 out_unfreeze:
 	blk_mq_unquiesce_queue(q);
@@ -625,28 +632,11 @@ static int elevator_switch(struct request_queue *q, struct elv_change_ctx *ctx)
 			new_e->elevator_name);
 	}
 
-	elevator_put(new_e);
+	if (new_e)
+		elevator_put(new_e);
 	return ret;
 }
 
-static void elevator_disable(struct request_queue *q,
-			     struct elv_change_ctx *ctx)
-{
-	WARN_ON_ONCE(q->mq_freeze_depth == 0);
-	lockdep_assert_held(&q->elevator_lock);
-
-	blk_mq_quiesce_queue(q);
-
-	ctx->old = q->elevator;
-	__elevator_exit(q);
-	blk_queue_flag_clear(QUEUE_FLAG_SQ_SCHED, q);
-	q->elevator = NULL;
-	q->nr_requests = q->tag_set->queue_depth;
-	blk_add_trace_msg(q, "elv switch: none");
-
-	blk_mq_unquiesce_queue(q);
-}
-
 static void elv_exit_and_release(struct request_queue *q)
 {
 	struct elevator_queue *e;
@@ -690,20 +680,11 @@ static int elevator_change_done(struct request_queue *q,
 static int __elevator_change(struct request_queue *q,
 			     struct elv_change_ctx *ctx)
 {
-	const char *elevator_name = ctx->name;
-
 	lockdep_assert_held(&q->tag_set->update_nr_hwq_sema);
 
 	/* Make sure queue is not in the middle of being removed */
 	if (!ctx->init && !blk_queue_registered(q))
 		return -ENOENT;
-
-	if (!strncmp(elevator_name, "none", 4)) {
-		if (q->elevator)
-			elevator_disable(q, ctx);
-		return 0;
-	}
-
 	return elevator_switch(q, ctx);
 }
 
-- 
2.47.2


[-- Attachment #4: 0003-block-remove-__elevator_change.patch --]
[-- Type: text/x-patch, Size: 2756 bytes --]

From dcda3f508e5f938cb27d4b743226ca4d8af75e28 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Fri, 25 Apr 2025 07:18:42 -0700
Subject: block: remove __elevator_change

Not much of a point in sharing code between callers with very different
expectations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/elevator.c | 45 ++++++++++++++++++++-------------------------
 1 file changed, 20 insertions(+), 25 deletions(-)

diff --git a/block/elevator.c b/block/elevator.c
index 59ff0abde920..b358858387a0 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -674,25 +674,13 @@ static int elevator_change_done(struct request_queue *q,
 	return ret;
 }
 
-/*
- * Switch this queue to the given IO scheduler.
- */
-static int __elevator_change(struct request_queue *q,
-			     struct elv_change_ctx *ctx)
-{
-	lockdep_assert_held(&q->tag_set->update_nr_hwq_sema);
-
-	/* Make sure queue is not in the middle of being removed */
-	if (!ctx->init && !blk_queue_registered(q))
-		return -ENOENT;
-	return elevator_switch(q, ctx);
-}
-
 static int elevator_change(struct request_queue *q,
 			   struct elv_change_ctx *ctx)
 {
 	unsigned int memflags;
-	int ret = 0;
+	int ret;
+
+	lockdep_assert_held(&q->tag_set->update_nr_hwq_sema);
 
 	memflags = blk_mq_freeze_queue(q);
 	/*
@@ -706,14 +694,20 @@ static int elevator_change(struct request_queue *q,
 	 */
 	blk_mq_cancel_work_sync(q);
 	mutex_lock(&q->elevator_lock);
-	if (!q->elevator || !elevator_match(q->elevator->type, ctx->name))
-		ret = __elevator_change(q, ctx);
+	/* Make sure queue is not in the middle of being removed */
+	ret = -ENOENT;
+	if (!ctx->init && !blk_queue_registered(q))
+		goto out_unlock;
+	ret = 0;
+	if (q->elevator && elevator_match(q->elevator->type, ctx->name))
+		goto out_unlock;
+	ret = elevator_switch(q, ctx);
+out_unlock:
 	mutex_unlock(&q->elevator_lock);
 	blk_mq_unfreeze_queue(q, memflags);
-	if (!ret)
-		ret = elevator_change_done(q, ctx);
-
-	return ret;
+	if (ret)
+		return ret;
+	return elevator_change_done(q, ctx);
 }
 
 /*
@@ -768,17 +762,18 @@ void elevator_set_none(struct request_queue *q)
 void elv_update_nr_hw_queues(struct request_queue *q)
 {
 	struct elv_change_ctx ctx = {
-		.name	= "none",
 		.uevent	= true,
 	};
-	int ret = -ENODEV;
+	int ret = -ENOENT;
 
+	lockdep_assert_held(&q->tag_set->update_nr_hwq_sema);
 	WARN_ON_ONCE(q->mq_freeze_depth == 0);
 
 	mutex_lock(&q->elevator_lock);
-	if (q->elevator && !blk_queue_dying(q))
+	if (blk_queue_registered(q) && !blk_queue_dying(q) && q->elevator) {
 		ctx.name = q->elevator->type->elevator_name;
-	ret = __elevator_change(q, &ctx);
+		ret = elevator_switch(q, &ctx);
+	}
 	mutex_unlock(&q->elevator_lock);
 	blk_mq_unfreeze_queue_nomemrestore(q);
 
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 12/20] block: add `struct elv_change_ctx` for unifying elevator change
  2025-04-24 15:21 ` [PATCH V3 12/20] block: add `struct elv_change_ctx` for unifying elevator change Ming Lei
  2025-04-25  6:38   ` Hannes Reinecke
@ 2025-04-25 18:23   ` Christoph Hellwig
  2025-04-29 15:45     ` Ming Lei
  1 sibling, 1 reply; 71+ messages in thread
From: Christoph Hellwig @ 2025-04-25 18:23 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
	Thomas Hellström, Christoph Hellwig

On Thu, Apr 24, 2025 at 11:21:35PM +0800, Ming Lei wrote:
> +struct elv_change_ctx {
> +	const char *name;
> +	bool uevent;

There's only one caller that wants to supress the uevents.  So maybe
invert the polarity so that it only has to be set in one place,
which also documents how setting the initial scheduler is special a
bit better.

> -	ret = elv_register_queue(q, true);
> +	ret = elv_register_queue(q, ctx->uevent);

.. and pass the ctx on to elv_register_queue instead of converting
paramter types.  Although that might be woeth doing later when
another argument derived from ctx gets passed as well.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 13/20] block: unifying elevator change
  2025-04-24 15:21 ` [PATCH V3 13/20] block: " Ming Lei
  2025-04-25  6:39   ` Hannes Reinecke
  2025-04-25 13:03   ` Nilay Shroff
@ 2025-04-25 18:30   ` Christoph Hellwig
  2 siblings, 0 replies; 71+ messages in thread
From: Christoph Hellwig @ 2025-04-25 18:30 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
	Thomas Hellström, Christoph Hellwig

On Thu, Apr 24, 2025 at 11:21:36PM +0800, Ming Lei wrote:
> elevator change is one well-define behavior:

Start the sentence captitalized.

> +static bool use_default_elevator(struct request_queue *q)
>  {
>  	if (q->tag_set->flags & BLK_MQ_F_NO_SCHED_BY_DEFAULT)
> +		return false;
>  
>  	if (q->nr_hw_queues != 1 &&
>  	    !blk_mq_is_shared_tags(q->tag_set->flags))
> +		return false;
>  
> +	return true;

looking at the only caller I think this is better open coded there.

> @@ -694,7 +651,7 @@ static int __elevator_change(struct request_queue *q,
>  	lockdep_assert_held(&q->tag_set->update_nr_hwq_sema);
>  
>  	/* Make sure queue is not in the middle of being removed */
> -	if (!blk_queue_registered(q))
> +	if (!ctx->init && !blk_queue_registered(q))

This init flag is confusingly also set by the teardown, so it is quite
misnamed.  But given that none of the locks held here protected the
queue registered flag, why not just move the check out to
elv_iosched_store and remove the entire need for the flag?

> +/*
> + * Use the default elevator settings. If the chosen elevator initialization
> + * fails, fall back to the "none" elevator (no elevator).
> + */
> +void elevator_set_default(struct request_queue *q)
> +{
> +	struct elv_change_ctx ctx = {
> +		.init = true,
> +	};
> +	int err;
> +
> +	if (!queue_is_mq(q))
> +		return;

Please keep this in the caller.  Same for the set_none side.

> +
> +	ctx.name = use_default_elevator(q) ? "mq-deadline" : "none";
> +	if (!q->elevator && !strcmp(ctx.name, "none"))
> +		return;
> +	err = elevator_change(q, &ctx);

q->elevator can't be set here.  And the extra none strcmp is weird
because we have the boolean information avaiable.

Just make this:

	struct elv_change_ctx ctx = {
		.name = "mq-deadline",
	};

	...

 	if (!(q->tag_set->flags & BLK_MQ_F_NO_SCHED_BY_DEFAULT) &&
	    (q->nr_hw_queues == 1 ||
	     blk_mq_is_shared_tags(q->tag_set->flags))) {
		err = elevator_change(q, &ctx);

> +	if (err < 0)
> +		pr_warn("\"%s\" set elevator failed %d, "
> +			"falling back to \"none\"\n", ctx.name, err);

Please keep the more useful message from the current code.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 15/20] block: fail to show/store elevator sysfs attribute if elevator is dying
  2025-04-24 15:21 ` [PATCH V3 15/20] block: fail to show/store elevator sysfs attribute if elevator is dying Ming Lei
  2025-04-25  6:45   ` Hannes Reinecke
@ 2025-04-25 18:36   ` Christoph Hellwig
  2025-04-30  1:07     ` Ming Lei
  1 sibling, 1 reply; 71+ messages in thread
From: Christoph Hellwig @ 2025-04-25 18:36 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
	Thomas Hellström, Christoph Hellwig

On Thu, Apr 24, 2025 at 11:21:38PM +0800, Ming Lei wrote:
> +	if (test_bit(ELEVATOR_FLAG_DYING, &e->flags))
> +		error = -ENODEV;
> +	else
> +		error = e->type ? entry->show(e, page) : -ENOENT;

Weird style mix, I'd expand the check for ->type to a proper else
if here as well.  But how can e->type actually be NULL here to start
with?


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 16/20] block: move elv_register[unregister]_queue out of elevator_lock
  2025-04-24 15:21 ` [PATCH V3 16/20] block: move elv_register[unregister]_queue out of elevator_lock Ming Lei
  2025-04-25  6:46   ` Hannes Reinecke
  2025-04-25 13:05   ` Nilay Shroff
@ 2025-04-25 18:37   ` Christoph Hellwig
  2 siblings, 0 replies; 71+ messages in thread
From: Christoph Hellwig @ 2025-04-25 18:37 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
	Thomas Hellström, Christoph Hellwig

On Thu, Apr 24, 2025 at 11:21:39PM +0800, Ming Lei wrote:
> -static void elevator_exit(struct request_queue *q)
> +static void __elevator_exit(struct request_queue *q)

Why does elevator_exit grow a spurious __ prefix?


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 17/20] block: move debugfs/sysfs register out of freezing queue
  2025-04-24 15:21 ` [PATCH V3 17/20] block: move debugfs/sysfs register out of freezing queue Ming Lei
  2025-04-25  6:47   ` Hannes Reinecke
@ 2025-04-25 18:38   ` Christoph Hellwig
  2025-04-28 11:28     ` Ming Lei
  1 sibling, 1 reply; 71+ messages in thread
From: Christoph Hellwig @ 2025-04-25 18:38 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
	Thomas Hellström, Christoph Hellwig

On Thu, Apr 24, 2025 at 11:21:40PM +0800, Ming Lei wrote:
> Move debugfs/sysfs register out of freezing queue in
> __blk_mq_update_nr_hw_queues(), so that the following lockdep dependency
> can be killed:
> 
> 	#2 (&q->q_usage_counter(io)#16){++++}-{0:0}:
> 	#1 (fs_reclaim){+.+.}-{0:0}:
> 	#0 (&sb->s_type->i_mutex_key#3){+.+.}-{4:4}: //debugfs
> 
> And registering/un-registering debugfs/sysfs does not require queue to be
> frozen.

Please explain the why it's not required.  And talking about lockdep
dependencies is not very useful - it either is an actual lock ordering
issue or a fale positive, but lockdep is just the messenger.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 18/20] block: remove several ->elevator_lock
  2025-04-24 15:21 ` [PATCH V3 18/20] block: remove several ->elevator_lock Ming Lei
  2025-04-25  6:48   ` Hannes Reinecke
@ 2025-04-25 18:38   ` Christoph Hellwig
  1 sibling, 0 replies; 71+ messages in thread
From: Christoph Hellwig @ 2025-04-25 18:38 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
	Thomas Hellström, Christoph Hellwig

This is missing a "critial section: or similar in the Subject line
for the grammar to make sense.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 08/20] block: don't allow to switch elevator if updating nr_hw_queues is in-progress
  2025-04-25 12:48   ` Nilay Shroff
@ 2025-04-27  2:27     ` Ming Lei
  2025-04-28 16:17       ` Nilay Shroff
  0 siblings, 1 reply; 71+ messages in thread
From: Ming Lei @ 2025-04-27  2:27 UTC (permalink / raw)
  To: Nilay Shroff
  Cc: Jens Axboe, linux-block, Shinichiro Kawasaki,
	Thomas Hellström, Christoph Hellwig

On Fri, Apr 25, 2025 at 06:18:33PM +0530, Nilay Shroff wrote:
> 
> 
> On 4/24/25 8:51 PM, Ming Lei wrote:
> > Elevator switch code is another `nr_hw_queue` reader in non-fast-IO code
> > path, so it can't be done if updating `nr_hw_queues` is in-progress.
> > 
> > Take same approach with not allowing add/del disk when updating
> > nr_hw_queues is in-progress, by grabbing read lock of
> > set->update_nr_hwq_sema.
> > 
> > Take the nested variant for avoiding the following false positive
> > splat[1], and this way is correct because:
> > 
> > - the read lock in elv_iosched_store() is not overlapped with the read lock
> > in adding/deleting disk:
> > 
> > - kobject attribute is only available after the kobject is added and
> > before it is deleted
> > 
> >   -> #4 (&q->q_usage_counter(queue){++++}-{0:0}:
> >   -> #3 (&q->limits_lock){+.+.}-{4:4}:
> >   -> #2 (&disk->open_mutex){+.+.}-{4:4}:
> >   -> #1 (&set->update_nr_hwq_lock){.+.+}-{4:4}:
> >   -> #0 (kn->active#103){++++}-{0:0}:
> > 
> > Link: https://lore.kernel.org/linux-block/aAWv3NPtNIKKvJZc@fedora/ [1]
> > Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> > Closes: https://lore.kernel.org/linux-block/mz4t4tlwiqjijw3zvqnjb7ovvvaegkqganegmmlc567tt5xj67@xal5ro544cnc/
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  block/elevator.c | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/block/elevator.c b/block/elevator.c
> > index 4400eb8fe54f..56da6ab7691a 100644
> > --- a/block/elevator.c
> > +++ b/block/elevator.c
> > @@ -723,6 +723,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
> >  	int ret;
> >  	unsigned int memflags;
> >  	struct request_queue *q = disk->queue;
> > +	struct blk_mq_tag_set *set = q->tag_set;
> >  
> >  	/*
> >  	 * If the attribute needs to load a module, do it before freezing the
> > @@ -734,6 +735,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
> >  
> >  	elv_iosched_load_module(name);
> >  
> > +	down_read_nested(&set->update_nr_hwq_sema, 1);
> 
> Why do we need to add nested read lock here? The lockdep splat[1] which
> you reported earlier is possibly due to the same reader lock being acquired
> recursively in elv_iosched_store and then elevator_change? 

The splat isn't related with the nested read lock.

If you replace down_read_nested() with down_read(), the same splat can be
triggered again when running `blktests block/001`.

>  
> On another note, if we suspect possible one-depth recursion for the same 
> class of lock then then we should use SINGLE_DEPTH_NESTING (instead of using
> 1 here) for subclass. But still I am not clear why this lock needs nesting.

It is just one false positive, because elv_iosched_store() won't happen
when adding disk.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 17/20] block: move debugfs/sysfs register out of freezing queue
  2025-04-25 18:38   ` Christoph Hellwig
@ 2025-04-28 11:28     ` Ming Lei
  0 siblings, 0 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-28 11:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
	Thomas Hellström

On Fri, Apr 25, 2025 at 08:38:01PM +0200, Christoph Hellwig wrote:
> On Thu, Apr 24, 2025 at 11:21:40PM +0800, Ming Lei wrote:
> > Move debugfs/sysfs register out of freezing queue in
> > __blk_mq_update_nr_hw_queues(), so that the following lockdep dependency
> > can be killed:
> > 
> > 	#2 (&q->q_usage_counter(io)#16){++++}-{0:0}:
> > 	#1 (fs_reclaim){+.+.}-{0:0}:
> > 	#0 (&sb->s_type->i_mutex_key#3){+.+.}-{4:4}: //debugfs
> > 
> > And registering/un-registering debugfs/sysfs does not require queue to be
> > frozen.
> 
> Please explain the why it's not required.  And talking about lockdep
> dependencies is not very useful - it either is an actual lock ordering
> issue or a fale positive, but lockdep is just the messenger.

Both kobject delete and debugfs unregister drains in-progress .show/.store
and read, they do not need to freeze queue.

Thanks, 
Ming


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 10/20] block: move blk_unregister_queue() & device_del() after freeze wait
  2025-04-25 14:34   ` Christoph Hellwig
@ 2025-04-28 11:51     ` Ming Lei
  0 siblings, 0 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-28 11:51 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
	Thomas Hellström

On Fri, Apr 25, 2025 at 04:34:41PM +0200, Christoph Hellwig wrote:
> On Thu, Apr 24, 2025 at 11:21:33PM +0800, Ming Lei wrote:
> > Move blk_unregister_queue() & device_del() after freeze wait, and prepare
> > for unifying elevator switch.
> > 
> > This way is just fine, since bdev has been unhashed at the beginning of
> > del_gendisk(), both blk_unregister_queue() & device_del() are dealing
> > with kobject & debugfs thing only.
> 
> Can you please add a reference to the previous attempt here and that
> it was backed out again and why we think it's fine this time around?

Turns out this patch isn't necessary, we can avoid the small race easily
between elevator switch and del_gendisk().

Will drop this patch in V4.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 08/20] block: don't allow to switch elevator if updating nr_hw_queues is in-progress
  2025-04-27  2:27     ` Ming Lei
@ 2025-04-28 16:17       ` Nilay Shroff
  2025-04-29  2:43         ` Ming Lei
  0 siblings, 1 reply; 71+ messages in thread
From: Nilay Shroff @ 2025-04-28 16:17 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Shinichiro Kawasaki,
	Thomas Hellström, Christoph Hellwig



On 4/27/25 7:57 AM, Ming Lei wrote:
> On Fri, Apr 25, 2025 at 06:18:33PM +0530, Nilay Shroff wrote:
>>
>>
>> On 4/24/25 8:51 PM, Ming Lei wrote:
>>> Elevator switch code is another `nr_hw_queue` reader in non-fast-IO code
>>> path, so it can't be done if updating `nr_hw_queues` is in-progress.
>>>
>>> Take same approach with not allowing add/del disk when updating
>>> nr_hw_queues is in-progress, by grabbing read lock of
>>> set->update_nr_hwq_sema.
>>>
>>> Take the nested variant for avoiding the following false positive
>>> splat[1], and this way is correct because:
>>>
>>> - the read lock in elv_iosched_store() is not overlapped with the read lock
>>> in adding/deleting disk:
>>>
>>> - kobject attribute is only available after the kobject is added and
>>> before it is deleted
>>>
>>>   -> #4 (&q->q_usage_counter(queue){++++}-{0:0}:
>>>   -> #3 (&q->limits_lock){+.+.}-{4:4}:
>>>   -> #2 (&disk->open_mutex){+.+.}-{4:4}:
>>>   -> #1 (&set->update_nr_hwq_lock){.+.+}-{4:4}:
>>>   -> #0 (kn->active#103){++++}-{0:0}:
>>>
>>> Link: https://lore.kernel.org/linux-block/aAWv3NPtNIKKvJZc@fedora/ [1]
>>> Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
>>> Closes: https://lore.kernel.org/linux-block/mz4t4tlwiqjijw3zvqnjb7ovvvaegkqganegmmlc567tt5xj67@xal5ro544cnc/
>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>> ---
>>>  block/elevator.c | 3 +++
>>>  1 file changed, 3 insertions(+)
>>>
>>> diff --git a/block/elevator.c b/block/elevator.c
>>> index 4400eb8fe54f..56da6ab7691a 100644
>>> --- a/block/elevator.c
>>> +++ b/block/elevator.c
>>> @@ -723,6 +723,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
>>>  	int ret;
>>>  	unsigned int memflags;
>>>  	struct request_queue *q = disk->queue;
>>> +	struct blk_mq_tag_set *set = q->tag_set;
>>>  
>>>  	/*
>>>  	 * If the attribute needs to load a module, do it before freezing the
>>> @@ -734,6 +735,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
>>>  
>>>  	elv_iosched_load_module(name);
>>>  
>>> +	down_read_nested(&set->update_nr_hwq_sema, 1);
>>
>> Why do we need to add nested read lock here? The lockdep splat[1] which
>> you reported earlier is possibly due to the same reader lock being acquired
>> recursively in elv_iosched_store and then elevator_change? 
> 
> The splat isn't related with the nested read lock.
> 
> If you replace down_read_nested() with down_read(), the same splat can be
> triggered again when running `blktests block/001`.
> 
I couldn't recreate it on my setup using above blktests.

>>  
>> On another note, if we suspect possible one-depth recursion for the same 
>> class of lock then then we should use SINGLE_DEPTH_NESTING (instead of using
>> 1 here) for subclass. But still I am not clear why this lock needs nesting.
> 
> It is just one false positive, because elv_iosched_store() won't happen
> when adding disk.
> 
Yes, but how could we avoid false positive? It's probably because of commit 
ffa1e7ada456 ("block: Make request_queue lockdep splats show up earlier"). How about adding manual dependency of fs-reclaim on freeze-lock after we add 
the disk. Currently that manual dependency is added in blk_alloc_queue.

Thanks,
--Nilay


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 08/20] block: don't allow to switch elevator if updating nr_hw_queues is in-progress
  2025-04-28 16:17       ` Nilay Shroff
@ 2025-04-29  2:43         ` Ming Lei
  2025-04-29 10:22           ` Nilay Shroff
  0 siblings, 1 reply; 71+ messages in thread
From: Ming Lei @ 2025-04-29  2:43 UTC (permalink / raw)
  To: Nilay Shroff
  Cc: Jens Axboe, linux-block, Shinichiro Kawasaki,
	Thomas Hellström, Christoph Hellwig

On Mon, Apr 28, 2025 at 09:47:02PM +0530, Nilay Shroff wrote:
> 
> 
> On 4/27/25 7:57 AM, Ming Lei wrote:
> > On Fri, Apr 25, 2025 at 06:18:33PM +0530, Nilay Shroff wrote:
> >>
> >>
> >> On 4/24/25 8:51 PM, Ming Lei wrote:
> >>> Elevator switch code is another `nr_hw_queue` reader in non-fast-IO code
> >>> path, so it can't be done if updating `nr_hw_queues` is in-progress.
> >>>
> >>> Take same approach with not allowing add/del disk when updating
> >>> nr_hw_queues is in-progress, by grabbing read lock of
> >>> set->update_nr_hwq_sema.
> >>>
> >>> Take the nested variant for avoiding the following false positive
> >>> splat[1], and this way is correct because:
> >>>
> >>> - the read lock in elv_iosched_store() is not overlapped with the read lock
> >>> in adding/deleting disk:
> >>>
> >>> - kobject attribute is only available after the kobject is added and
> >>> before it is deleted
> >>>
> >>>   -> #4 (&q->q_usage_counter(queue){++++}-{0:0}:
> >>>   -> #3 (&q->limits_lock){+.+.}-{4:4}:
> >>>   -> #2 (&disk->open_mutex){+.+.}-{4:4}:
> >>>   -> #1 (&set->update_nr_hwq_lock){.+.+}-{4:4}:
> >>>   -> #0 (kn->active#103){++++}-{0:0}:
> >>>
> >>> Link: https://lore.kernel.org/linux-block/aAWv3NPtNIKKvJZc@fedora/ [1]
> >>> Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> >>> Closes: https://lore.kernel.org/linux-block/mz4t4tlwiqjijw3zvqnjb7ovvvaegkqganegmmlc567tt5xj67@xal5ro544cnc/
> >>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> >>> ---
> >>>  block/elevator.c | 3 +++
> >>>  1 file changed, 3 insertions(+)
> >>>
> >>> diff --git a/block/elevator.c b/block/elevator.c
> >>> index 4400eb8fe54f..56da6ab7691a 100644
> >>> --- a/block/elevator.c
> >>> +++ b/block/elevator.c
> >>> @@ -723,6 +723,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
> >>>  	int ret;
> >>>  	unsigned int memflags;
> >>>  	struct request_queue *q = disk->queue;
> >>> +	struct blk_mq_tag_set *set = q->tag_set;
> >>>  
> >>>  	/*
> >>>  	 * If the attribute needs to load a module, do it before freezing the
> >>> @@ -734,6 +735,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
> >>>  
> >>>  	elv_iosched_load_module(name);
> >>>  
> >>> +	down_read_nested(&set->update_nr_hwq_sema, 1);
> >>
> >> Why do we need to add nested read lock here? The lockdep splat[1] which
> >> you reported earlier is possibly due to the same reader lock being acquired
> >> recursively in elv_iosched_store and then elevator_change? 
> > 
> > The splat isn't related with the nested read lock.
> > 
> > If you replace down_read_nested() with down_read(), the same splat can be
> > triggered again when running `blktests block/001`.
> > 
> I couldn't recreate it on my setup using above blktests.

It is reproduced in my test vm every time after killing the nested variant:

[   74.257200] ======================================================
[   74.259369] WARNING: possible circular locking dependency detected
[   74.260772] 6.15.0-rc3_ublk+ #547 Not tainted
[   74.261950] ------------------------------------------------------
[   74.263281] check/5077 is trying to acquire lock:
[   74.264492] ffff888105f1fd18 (kn->active#119){++++}-{0:0}, at: __kernfs_remove+0x213/0x680
[   74.266006]
               but task is already holding lock:
[   74.267998] ffff88828a661e20 (&q->q_usage_counter(queue)#14){++++}-{0:0}, at: del_gendisk+0xe5/0x180
[   74.269631]
               which lock already depends on the new lock.

[   74.272645]
               the existing dependency chain (in reverse order) is:
[   74.274804]
               -> #3 (&q->q_usage_counter(queue)#14){++++}-{0:0}:
[   74.277009]        blk_queue_enter+0x4c2/0x630
[   74.278218]        blk_mq_alloc_request+0x479/0xa00
[   74.279539]        scsi_execute_cmd+0x151/0xba0
[   74.281078]        sr_check_events+0x1bc/0xa40
[   74.283012]        cdrom_check_events+0x5c/0x120
[   74.284892]        disk_check_events+0xbe/0x390
[   74.286181]        disk_check_media_change+0xf1/0x220
[   74.287455]        sr_block_open+0xce/0x230
[   74.288528]        blkdev_get_whole+0x8d/0x200
[   74.289702]        bdev_open+0x614/0xc60
[   74.290882]        blkdev_open+0x1f6/0x360
[   74.292215]        do_dentry_open+0x491/0x1820
[   74.293309]        vfs_open+0x7a/0x440
[   74.294384]        path_openat+0x1b7e/0x2ce0
[   74.295507]        do_filp_open+0x1c5/0x450
[   74.296616]        do_sys_openat2+0xef/0x180
[   74.297667]        __x64_sys_openat+0x10e/0x210
[   74.298768]        do_syscall_64+0x92/0x180
[   74.299800]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   74.300971]
               -> #2 (&disk->open_mutex){+.+.}-{4:4}:
[   74.302700]        __mutex_lock+0x19c/0x1990
[   74.303682]        bdev_open+0x6cd/0xc60
[   74.304613]        bdev_file_open_by_dev+0xc4/0x140
[   74.306008]        disk_scan_partitions+0x191/0x290
[   74.307716]        __add_disk_fwnode+0xd2a/0x1140
[   74.309394]        add_disk_fwnode+0x10e/0x220
[   74.311039]        nvme_alloc_ns+0x1833/0x2c30
[   74.312669]        nvme_scan_ns+0x5a0/0x6f0
[   74.314151]        async_run_entry_fn+0x94/0x540
[   74.315719]        process_one_work+0x86a/0x14a0
[   74.317287]        worker_thread+0x5bb/0xf90
[   74.318228]        kthread+0x371/0x720
[   74.319085]        ret_from_fork+0x31/0x70
[   74.319941]        ret_from_fork_asm+0x1a/0x30
[   74.320808]
               -> #1 (&set->update_nr_hwq_sema){.+.+}-{4:4}:
[   74.322311]        down_read+0x8e/0x470
[   74.323135]        elv_iosched_store+0x17a/0x210
[   74.324036]        queue_attr_store+0x234/0x340
[   74.324881]        kernfs_fop_write_iter+0x39b/0x5a0
[   74.325771]        vfs_write+0x5df/0xec0
[   74.326514]        ksys_write+0xff/0x200
[   74.327262]        do_syscall_64+0x92/0x180
[   74.328018]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   74.328963]
               -> #0 (kn->active#119){++++}-{0:0}:
[   74.330433]        __lock_acquire+0x145f/0x2260
[   74.331329]        lock_acquire+0x163/0x300
[   74.332221]        kernfs_drain+0x39d/0x450
[   74.333002]        __kernfs_remove+0x213/0x680
[   74.333792]        kernfs_remove_by_name_ns+0xa2/0x100
[   74.334589]        remove_files+0x8d/0x1b0
[   74.335326]        sysfs_remove_group+0x7c/0x160
[   74.336118]        sysfs_remove_groups+0x55/0xb0
[   74.336869]        __kobject_del+0x7d/0x1d0
[   74.337637]        kobject_del+0x38/0x60
[   74.338340]        blk_unregister_queue+0x153/0x2c0
[   74.339125]        __del_gendisk+0x252/0x9d0
[   74.339959]        del_gendisk+0xe5/0x180
[   74.340756]        sr_remove+0x7b/0xd0
[   74.341429]        device_release_driver_internal+0x36d/0x520
[   74.342353]        bus_remove_device+0x1ef/0x3f0
[   74.343172]        device_del+0x3be/0x9b0
[   74.343951]        __scsi_remove_device+0x27f/0x340
[   74.344724]        sdev_store_delete+0x87/0x120
[   74.345508]        kernfs_fop_write_iter+0x39b/0x5a0
[   74.346287]        vfs_write+0x5df/0xec0
[   74.347170]        ksys_write+0xff/0x200
[   74.348312]        do_syscall_64+0x92/0x180
[   74.349519]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   74.350797]
               other info that might help us debug this:

[   74.353554] Chain exists of:
                 kn->active#119 --> &disk->open_mutex --> &q->q_usage_counter(queue)#14

[   74.355535]  Possible unsafe locking scenario:

[   74.356650]        CPU0                    CPU1
[   74.357328]        ----                    ----
[   74.358026]   lock(&q->q_usage_counter(queue)#14);
[   74.358749]                                lock(&disk->open_mutex);
[   74.359561]                                lock(&q->q_usage_counter(queue)#14);
[   74.360488]   lock(kn->active#119);
[   74.361113]
                *** DEADLOCK ***

[   74.362574] 6 locks held by check/5077:
[   74.363193]  #0: ffff888114640420 (sb_writers#4){.+.+}-{0:0}, at: ksys_write+0xff/0x200
[   74.364274]  #1: ffff88829abb6088 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x25b/0x5a0
[   74.365937]  #2: ffff8881176ca0e0 (&shost->scan_mutex){+.+.}-{4:4}, at: sdev_store_delete+0x7f/0x120
[   74.367643]  #3: ffff88828521c380 (&dev->mutex){....}-{4:4}, at: device_release_driver_internal+0x90/0x520
[   74.369464]  #4: ffff8881176ca380 (&set->update_nr_hwq_sema){.+.+}-{4:4}, at: del_gendisk+0xdd/0x180
[   74.370961]  #5: ffff88828a661e20 (&q->q_usage_counter(queue)#14){++++}-{0:0}, at: del_gendisk+0xe5/0x180
[   74.372050]
               stack backtrace:
[   74.373111] CPU: 10 UID: 0 PID: 5077 Comm: check Not tainted 6.15.0-rc3_ublk+ #547 PREEMPT(voluntary)
[   74.373116] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-1.fc39 04/01/2014
[   74.373118] Call Trace:
[   74.373121]  <TASK>
[   74.373123]  dump_stack_lvl+0x84/0xd0
[   74.373129]  print_circular_bug.cold+0x185/0x1d0
[   74.373134]  check_noncircular+0x14a/0x170
[   74.373140]  __lock_acquire+0x145f/0x2260
[   74.373145]  lock_acquire+0x163/0x300
[   74.373149]  ? __kernfs_remove+0x213/0x680
[   74.373155]  kernfs_drain+0x39d/0x450
[   74.373158]  ? __kernfs_remove+0x213/0x680
[   74.373161]  ? __pfx_kernfs_drain+0x10/0x10
[   74.373165]  ? find_held_lock+0x2b/0x80
[   74.373168]  ? kernfs_root+0xb0/0x1c0
[   74.373173]  __kernfs_remove+0x213/0x680
[   74.373176]  ? kernfs_find_ns+0x197/0x390
[   74.373183]  kernfs_remove_by_name_ns+0xa2/0x100
[   74.373186]  remove_files+0x8d/0x1b0
[   74.373191]  sysfs_remove_group+0x7c/0x160
[   74.373194]  sysfs_remove_groups+0x55/0xb0
[   74.373198]  __kobject_del+0x7d/0x1d0
[   74.373203]  kobject_del+0x38/0x60
[   74.373206]  blk_unregister_queue+0x153/0x2c0
[   74.373210]  __del_gendisk+0x252/0x9d0
[   74.373214]  ? down_read+0x1a7/0x470
[   74.373218]  ? __pfx___del_gendisk+0x10/0x10
[   74.373221]  ? __pfx_down_read+0x10/0x10
[   74.373224]  ? lockdep_hardirqs_on_prepare+0xdb/0x190
[   74.373227]  ? trace_hardirqs_on+0x18/0x150
[   74.373231]  del_gendisk+0xe5/0x180
[   74.373235]  sr_remove+0x7b/0xd0
[   74.373239]  device_release_driver_internal+0x36d/0x520
[   74.373243]  ? kobject_put+0x5e/0x4a0
[   74.373246]  bus_remove_device+0x1ef/0x3f0
[   74.373250]  device_del+0x3be/0x9b0
[   74.373254]  ? attribute_container_device_trigger+0x181/0x1f0
[   74.373257]  ? __pfx_device_del+0x10/0x10
[   74.373260]  ? __pfx_attribute_container_device_trigger+0x10/0x10
[   74.373264]  __scsi_remove_device+0x27f/0x340
[   74.373267]  sdev_store_delete+0x87/0x120
[   74.373270]  ? __pfx_sysfs_kf_write+0x10/0x10
[   74.373273]  kernfs_fop_write_iter+0x39b/0x5a0
[   74.373276]  ? __pfx_kernfs_fop_write_iter+0x10/0x10
[   74.373278]  vfs_write+0x5df/0xec0
[   74.373282]  ? trace_hardirqs_on+0x18/0x150
[   74.373285]  ? __pfx_vfs_write+0x10/0x10
[   74.373291]  ksys_write+0xff/0x200
[   74.373295]  ? __pfx_ksys_write+0x10/0x10
[   74.373298]  ? fput_close_sync+0xd6/0x160
[   74.373303]  do_syscall_64+0x92/0x180
[   74.373309]  ? trace_hardirqs_on_prepare+0x101/0x150
[   74.373313]  ? lockdep_hardirqs_on_prepare+0xdb/0x190
[   74.373317]  ? syscall_exit_to_user_mode+0x97/0x290
[   74.373322]  ? do_syscall_64+0x9f/0x180
[   74.373330]  ? fput_close+0xd6/0x160
[   74.373333]  ? __pfx_fput_close+0x10/0x10
[   74.373338]  ? filp_close+0x25/0x40
[   74.373341]  ? do_dup2+0x287/0x4f0
[   74.373346]  ? trace_hardirqs_on_prepare+0x101/0x150
[   74.373348]  ? lockdep_hardirqs_on_prepare+0xdb/0x190
[   74.373351]  ? syscall_exit_to_user_mode+0x97/0x290
[   74.373353]  ? do_syscall_64+0x9f/0x180
[   74.373357]  ? trace_hardirqs_on_prepare+0x101/0x150
[   74.373359]  ? lockdep_hardirqs_on_prepare+0xdb/0x190
[   74.373362]  ? syscall_exit_to_user_mode+0x97/0x290
[   74.373364]  ? do_syscall_64+0x9f/0x180
[   74.373368]  ? do_syscall_64+0x9f/0x180
[   74.373371]  ? trace_hardirqs_on_prepare+0x101/0x150
[   74.373373]  ? lockdep_hardirqs_on_prepare+0xdb/0x190
[   74.373376]  ? syscall_exit_to_user_mode+0x97/0x290
[   74.373379]  ? do_syscall_64+0x9f/0x180
[   74.373382]  ? do_syscall_64+0x9f/0x180
[   74.373385]  ? clear_bhb_loop+0x35/0x90
[   74.373388]  ? clear_bhb_loop+0x35/0x90
[   74.373390]  ? clear_bhb_loop+0x35/0x90
[   74.373393]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   74.373396] RIP: 0033:0x7fa3873a8756
[   74.373409] Code: 5d e8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 75 19 83 e2 39 83 fa 08 75 11 e8 26 ff ff ff 66 0f 1f 44 00 00 48 8b 45 10 0f 05 <48> 8b 5d f8 c9 c3 0f 1f 40 00 f3 0f 1e fa 55 48 89 e5 48 83 ec 08
[   74.373412] RSP: 002b:00007ffede0285d0 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[   74.373416] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fa3873a8756
[   74.373418] RDX: 0000000000000002 RSI: 000056557b3dc7d0 RDI: 0000000000000001
[   74.373420] RBP: 00007ffede0285f0 R08: 0000000000000000 R09: 0000000000000000
[   74.373421] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000002
[   74.373423] R13: 000056557b3dc7d0 R14: 00007fa3875245c0 R15: 0000000000000000
[   74.373428]  </TASK>



> 
> >>  
> >> On another note, if we suspect possible one-depth recursion for the same 
> >> class of lock then then we should use SINGLE_DEPTH_NESTING (instead of using
> >> 1 here) for subclass. But still I am not clear why this lock needs nesting.
> > 
> > It is just one false positive, because elv_iosched_store() won't happen
> > when adding disk.
> > 
> Yes, but how could we avoid false positive? It's probably because of commit 
> ffa1e7ada456 ("block: Make request_queue lockdep splats show up earlier"). How about adding manual dependency of fs-reclaim on freeze-lock after we add 
> the disk. Currently that manual dependency is added in blk_alloc_queue.

Please see the above trace, which isn't related with commit ffa1e7ada456,
and the lock chain doesn't include 'fs_reclaim' at all.

commit ffa1e7ada456 isn't wrong too, which just helps us to expose deadlock
risk early.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 09/20] block: simplify elevator reattachment for updating nr_hw_queues
  2025-04-25 18:12   ` Christoph Hellwig
@ 2025-04-29  9:51     ` Ming Lei
  0 siblings, 0 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-29  9:51 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
	Thomas Hellström

On Fri, Apr 25, 2025 at 08:12:27PM +0200, Christoph Hellwig wrote:
> On Thu, Apr 24, 2025 at 11:21:32PM +0800, Ming Lei wrote:
> > +static int __elevator_change(struct request_queue *q,
> > +			     const char *elevator_name)
> 
> There's still not good reason for this helper.

With the helper, we can avoid the 'force' reattachment flag.

> 
> I'd suggest you add the two first attached patches before this one
> (it'll need a bi of reabsing as all of them are after your entire
> sweries right now) and then fold the third one into this, which will
> give us less code and a cleaner interface.

I'd suggest to add your patches after this one, which changes
elevator_switch() into local symbol.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 08/20] block: don't allow to switch elevator if updating nr_hw_queues is in-progress
  2025-04-29  2:43         ` Ming Lei
@ 2025-04-29 10:22           ` Nilay Shroff
  2025-04-30  0:54             ` Ming Lei
  0 siblings, 1 reply; 71+ messages in thread
From: Nilay Shroff @ 2025-04-29 10:22 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Shinichiro Kawasaki,
	Thomas Hellström, Christoph Hellwig



On 4/29/25 8:13 AM, Ming Lei wrote:
>> I couldn't recreate it on my setup using above blktests.
> It is reproduced in my test vm every time after killing the nested variant:
> 
> [   74.257200] ======================================================
> [   74.259369] WARNING: possible circular locking dependency detected
> [   74.260772] 6.15.0-rc3_ublk+ #547 Not tainted
> [   74.261950] ------------------------------------------------------
> [   74.263281] check/5077 is trying to acquire lock:
> [   74.264492] ffff888105f1fd18 (kn->active#119){++++}-{0:0}, at: __kernfs_remove+0x213/0x680
> [   74.266006]
>                but task is already holding lock:
> [   74.267998] ffff88828a661e20 (&q->q_usage_counter(queue)#14){++++}-{0:0}, at: del_gendisk+0xe5/0x180
> [   74.269631]
>                which lock already depends on the new lock.
> 
> [   74.272645]
>                the existing dependency chain (in reverse order) is:
> [   74.274804]
>                -> #3 (&q->q_usage_counter(queue)#14){++++}-{0:0}:
> [   74.277009]        blk_queue_enter+0x4c2/0x630
> [   74.278218]        blk_mq_alloc_request+0x479/0xa00
> [   74.279539]        scsi_execute_cmd+0x151/0xba0
> [   74.281078]        sr_check_events+0x1bc/0xa40
> [   74.283012]        cdrom_check_events+0x5c/0x120
> [   74.284892]        disk_check_events+0xbe/0x390
> [   74.286181]        disk_check_media_change+0xf1/0x220
> [   74.287455]        sr_block_open+0xce/0x230
> [   74.288528]        blkdev_get_whole+0x8d/0x200
> [   74.289702]        bdev_open+0x614/0xc60
> [   74.290882]        blkdev_open+0x1f6/0x360
> [   74.292215]        do_dentry_open+0x491/0x1820
> [   74.293309]        vfs_open+0x7a/0x440
> [   74.294384]        path_openat+0x1b7e/0x2ce0
> [   74.295507]        do_filp_open+0x1c5/0x450
> [   74.296616]        do_sys_openat2+0xef/0x180
> [   74.297667]        __x64_sys_openat+0x10e/0x210
> [   74.298768]        do_syscall_64+0x92/0x180
> [   74.299800]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [   74.300971]
>                -> #2 (&disk->open_mutex){+.+.}-{4:4}:
> [   74.302700]        __mutex_lock+0x19c/0x1990
> [   74.303682]        bdev_open+0x6cd/0xc60
> [   74.304613]        bdev_file_open_by_dev+0xc4/0x140
> [   74.306008]        disk_scan_partitions+0x191/0x290
> [   74.307716]        __add_disk_fwnode+0xd2a/0x1140
> [   74.309394]        add_disk_fwnode+0x10e/0x220
> [   74.311039]        nvme_alloc_ns+0x1833/0x2c30
> [   74.312669]        nvme_scan_ns+0x5a0/0x6f0
> [   74.314151]        async_run_entry_fn+0x94/0x540
> [   74.315719]        process_one_work+0x86a/0x14a0
> [   74.317287]        worker_thread+0x5bb/0xf90
> [   74.318228]        kthread+0x371/0x720
> [   74.319085]        ret_from_fork+0x31/0x70
> [   74.319941]        ret_from_fork_asm+0x1a/0x30
> [   74.320808]
>                -> #1 (&set->update_nr_hwq_sema){.+.+}-{4:4}:
> [   74.322311]        down_read+0x8e/0x470
> [   74.323135]        elv_iosched_store+0x17a/0x210
> [   74.324036]        queue_attr_store+0x234/0x340
> [   74.324881]        kernfs_fop_write_iter+0x39b/0x5a0
> [   74.325771]        vfs_write+0x5df/0xec0
> [   74.326514]        ksys_write+0xff/0x200
> [   74.327262]        do_syscall_64+0x92/0x180
> [   74.328018]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [   74.328963]
>                -> #0 (kn->active#119){++++}-{0:0}:
> [   74.330433]        __lock_acquire+0x145f/0x2260
> [   74.331329]        lock_acquire+0x163/0x300
> [   74.332221]        kernfs_drain+0x39d/0x450
> [   74.333002]        __kernfs_remove+0x213/0x680
> [   74.333792]        kernfs_remove_by_name_ns+0xa2/0x100
> [   74.334589]        remove_files+0x8d/0x1b0
> [   74.335326]        sysfs_remove_group+0x7c/0x160
> [   74.336118]        sysfs_remove_groups+0x55/0xb0
> [   74.336869]        __kobject_del+0x7d/0x1d0
> [   74.337637]        kobject_del+0x38/0x60
> [   74.338340]        blk_unregister_queue+0x153/0x2c0
> [   74.339125]        __del_gendisk+0x252/0x9d0
> [   74.339959]        del_gendisk+0xe5/0x180
> [   74.340756]        sr_remove+0x7b/0xd0
> [   74.341429]        device_release_driver_internal+0x36d/0x520
> [   74.342353]        bus_remove_device+0x1ef/0x3f0
> [   74.343172]        device_del+0x3be/0x9b0
> [   74.343951]        __scsi_remove_device+0x27f/0x340
> [   74.344724]        sdev_store_delete+0x87/0x120
> [   74.345508]        kernfs_fop_write_iter+0x39b/0x5a0
> [   74.346287]        vfs_write+0x5df/0xec0
> [   74.347170]        ksys_write+0xff/0x200
> [   74.348312]        do_syscall_64+0x92/0x180
> [   74.349519]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [   74.350797]
>                other info that might help us debug this:
> 
> [   74.353554] Chain exists of:
>                  kn->active#119 --> &disk->open_mutex --> &q->q_usage_counter(queue)#14
> 
> [   74.355535]  Possible unsafe locking scenario:
> 
> [   74.356650]        CPU0                    CPU1
> [   74.357328]        ----                    ----
> [   74.358026]   lock(&q->q_usage_counter(queue)#14);
> [   74.358749]                                lock(&disk->open_mutex);
> [   74.359561]                                lock(&q->q_usage_counter(queue)#14);
> [   74.360488]   lock(kn->active#119);
> [   74.361113]
>                 *** DEADLOCK ***
> 
> [   74.362574] 6 locks held by check/5077:
> [   74.363193]  #0: ffff888114640420 (sb_writers#4){.+.+}-{0:0}, at: ksys_write+0xff/0x200
> [   74.364274]  #1: ffff88829abb6088 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x25b/0x5a0
> [   74.365937]  #2: ffff8881176ca0e0 (&shost->scan_mutex){+.+.}-{4:4}, at: sdev_store_delete+0x7f/0x120
> [   74.367643]  #3: ffff88828521c380 (&dev->mutex){....}-{4:4}, at: device_release_driver_internal+0x90/0x520
> [   74.369464]  #4: ffff8881176ca380 (&set->update_nr_hwq_sema){.+.+}-{4:4}, at: del_gendisk+0xdd/0x180
> [   74.370961]  #5: ffff88828a661e20 (&q->q_usage_counter(queue)#14){++++}-{0:0}, at: del_gendisk+0xe5/0x180
> [   74.372050]
 
This has baffled me as I don't understand how could read lock in
elv_iosched_store (ruuning in context #1) depends on (same) read
lock in add_disk_fwnode (running under another context #2) as 
both locks are represented by the same rw semaphore. As we see 
above both elv_iosched_store and add_disk_fwnode bot run under
different contexts and so ideally they should be able to run
concurrently acquiring the same read lock.

>>>>  
>>>> On another note, if we suspect possible one-depth recursion for the same 
>>>> class of lock then then we should use SINGLE_DEPTH_NESTING (instead of using
>>>> 1 here) for subclass. But still I am not clear why this lock needs nesting.
>>> It is just one false positive, because elv_iosched_store() won't happen
>>> when adding disk.
>>>
>> Yes, but how could we avoid false positive? It's probably because of commit 
>> ffa1e7ada456 ("block: Make request_queue lockdep splats show up earlier"). How about adding manual dependency of fs-reclaim on freeze-lock after we add 
>> the disk. Currently that manual dependency is added in blk_alloc_queue.
> Please see the above trace, which isn't related with commit ffa1e7ada456,
> and the lock chain doesn't include 'fs_reclaim' at all.
> 
> commit ffa1e7ada456 isn't wrong too, which just helps us to expose deadlock
> risk early.
Yes I am not against this commit and now looking at the above splat I don't
blame this commit.

Thanks,
--Nilay

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 20/20] block: move wbt_enable_default() out of queue freezing from sched ->exit()
  2025-04-24 15:21 ` [PATCH V3 20/20] block: move wbt_enable_default() out of queue freezing from sched ->exit() Ming Lei
  2025-04-25 13:10   ` Nilay Shroff
@ 2025-04-29 10:59   ` Nilay Shroff
  1 sibling, 0 replies; 71+ messages in thread
From: Nilay Shroff @ 2025-04-29 10:59 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: Shinichiro Kawasaki, Thomas Hellström, Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 7898 bytes --]



On 4/24/25 8:51 PM, Ming Lei wrote:
> scheduler's ->exit() is called with queue frozen and elevator lock is held, and
> wbt_enable_default() can't be called with queue frozen, otherwise the
> following lockdep warning is triggered:
> 
> 	#6 (&q->rq_qos_mutex){+.+.}-{4:4}:
> 	#5 (&eq->sysfs_lock){+.+.}-{4:4}:
> 	#4 (&q->elevator_lock){+.+.}-{4:4}:
> 	#3 (&q->q_usage_counter(io)#3){++++}-{0:0}:
> 	#2 (fs_reclaim){+.+.}-{0:0}:
> 	#1 (&sb->s_type->i_mutex_key#3){+.+.}-{4:4}:
> 	#0 (&q->debugfs_mutex){+.+.}-{4:4}:
> 
> Fix the issue by moving wbt_enable_default() out of bfq's exit(), and
> call it from elevator_change_done().
> 
> Meantime add disk->rqos_state_mutex for covering wbt state change, which
> matches the purpose more than ->elevator_lock.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>

While testing this patch on my machine using blktests, I stumbled upon
a lockdep splat shown below.(I could consistently recreate it):

run blktests block/005 at 2025-04-28 06:57:51

======================================================
WARNING: possible circular locking dependency detected
6.15.0-rc2+ #174 Not tainted
------------------------------------------------------
check/8088 is trying to acquire lock:
c0000000a0c03538 (&disk->rqos_state_mutex){+.+.}-{4:4}, at: wbt_disable_default+0x9c/0x118

but task is already holding lock:
c00000005b8f6c38 (&q->elevator_lock){+.+.}-{4:4}, at: elevator_change+0x94/0x214

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #3 (&q->elevator_lock){+.+.}-{4:4}:
       __mutex_lock+0x128/0xdd8
       elevator_change+0x94/0x214
       elv_iosched_store+0x14c/0x1f4
       queue_attr_store+0x194/0x1d0
       sysfs_kf_write+0xbc/0x110
       kernfs_fop_write_iter+0x264/0x384
       vfs_write+0x5b0/0x77c
       ksys_write+0xa0/0x180
       system_call_exception+0x1b0/0x4f0
       system_call_vectored_common+0x15c/0x2ec

-> #2 (&q->q_usage_counter(io)#23){++++}-{0:0}:
       blk_alloc_queue+0x46c/0x4bc
       blk_mq_alloc_queue+0xc0/0x160
       __blk_mq_alloc_disk+0x34/0x128
       nvme_alloc_ns+0x140/0x1804 [nvme_core]
       nvme_scan_ns+0x42c/0x564 [nvme_core]
       async_run_entry_fn+0x9c/0x30c
       process_one_work+0x514/0xd38
       worker_thread+0x390/0x6dc
       kthread+0x230/0x278
       start_kernel_thread+0x14/0x18

-> #1 (fs_reclaim){+.+.}-{0:0}:
       fs_reclaim_acquire+0x114/0x150
       __kmalloc_cache_noprof+0x70/0x5c0
       wbt_init+0x64/0x2fc
       wbt_enable_default+0x140/0x15c
       elevator_change_done+0x314/0x3a8
       elv_iosched_store+0x14c/0x1f4
       queue_attr_store+0x194/0x1d0
       sysfs_kf_write+0xbc/0x110
       kernfs_fop_write_iter+0x264/0x384
       vfs_write+0x5b0/0x77c
       ksys_write+0xa0/0x180
       system_call_exception+0x1b0/0x4f0
       system_call_vectored_common+0x15c/0x2ec

-> #0 (&disk->rqos_state_mutex){+.+.}-{4:4}:
       __lock_acquire+0x1b5c/0x29f8
       lock_acquire+0x23c/0x3f8
       __mutex_lock+0x128/0xdd8
       wbt_disable_default+0x9c/0x118
       bfq_init_queue+0x7b0/0x8c0
       blk_mq_init_sched+0x29c/0x3a8
       __elevator_change+0x3a4/0x8a4
       elevator_change+0x1a4/0x214
       elv_iosched_store+0x14c/0x1f4
       queue_attr_store+0x194/0x1d0
       sysfs_kf_write+0xbc/0x110
       kernfs_fop_write_iter+0x264/0x384
       vfs_write+0x5b0/0x77c
       ksys_write+0xa0/0x180
       system_call_exception+0x1b0/0x4f0
       system_call_vectored_common+0x15c/0x2ec

other info that might help us debug this:

Chain exists of:
  &disk->rqos_state_mutex --> &q->q_usage_counter(io)#23 --> &q->elevator_lock

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&q->elevator_lock);
                               lock(&q->q_usage_counter(io)#23);
                               lock(&q->elevator_lock);
  lock(&disk->rqos_state_mutex);

 *** DEADLOCK ***

7 locks held by check/8088:
 #0: c0000000873f2400 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0xa0/0x180
 #1: c00000008c10c088 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x1e0/0x384
 #2: c000000085239248 (kn->active#57){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x1f8/0x384
 #3: c0000000f801c190 (&set->update_nr_hwq_sema){.+.+}-{4:4}, at: elv_iosched_store+0x13c/0x1f4
 #4: c00000005b8f6718 (&q->q_usage_counter(io)#23){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x28/0x40
 #5: c00000005b8f6750 (&q->q_usage_counter(queue)#21){+.+.}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x28/0x40
 #6: c00000005b8f6c38 (&q->elevator_lock){+.+.}-{4:4}, at: elevator_change+0x94/0x214

stack backtrace:
CPU: 26 UID: 0 PID: 8088 Comm: check Kdump: loaded Not tainted 6.15.0-rc2+ #174 VOLUNTARY
Hardware name: IBM,9043-MRX POWER10 (architected) 0x800200 0xf000006 of:IBM,FW1060.00 (NM1060_028) hv:phyp pSeries
Call Trace:
[c0000000d7497240] [c0000000017b9888] dump_stack_lvl+0x100/0x184 (unreliable)
[c0000000d7497270] [c0000000002b546c] print_circular_bug+0x448/0x604
[c0000000d7497320] [c0000000002b5874] check_noncircular+0x24c/0x26c
[c0000000d74973f0] [c0000000002bbb78] __lock_acquire+0x1b5c/0x29f8
[c0000000d7497520] [c0000000002b915c] lock_acquire+0x23c/0x3f8
[c0000000d7497620] [c00000000181277c] __mutex_lock+0x128/0xdd8
[c0000000d7497780] [c000000000c73bf8] wbt_disable_default+0x9c/0x118
[c0000000d74977c0] [c000000000c4c2c0] bfq_init_queue+0x7b0/0x8c0
[c0000000d7497890] [c000000000bff634] blk_mq_init_sched+0x29c/0x3a8
[c0000000d7497910] [c000000000bc2a18] __elevator_change+0x3a4/0x8a4
[c0000000d74979b0] [c000000000bc30bc] elevator_change+0x1a4/0x214
[c0000000d7497a00] [c000000000bc427c] elv_iosched_store+0x14c/0x1f4
[c0000000d7497ae0] [c000000000bd07ec] queue_attr_store+0x194/0x1d0
[c0000000d7497c00] [c000000000a40f00] sysfs_kf_write+0xbc/0x110
[c0000000d7497c50] [c000000000a3cc4c] kernfs_fop_write_iter+0x264/0x384
[c0000000d7497cb0] [c0000000008bb9bc] vfs_write+0x5b0/0x77c
[c0000000d7497d90] [c0000000008bbf88] ksys_write+0xa0/0x180
[c0000000d7497df0] [c000000000039f70] system_call_exception+0x1b0/0x4f0
[c0000000d7497e50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec
--- interrupt: 3000 at 0x7fffa413b034
NIP:  00007fffa413b034 LR: 00007fffa413b034 CTR: 0000000000000000
REGS: c0000000d7497e80 TRAP: 3000   Not tainted  (6.15.0-rc2+)
MSR:  800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 44422408  XER: 00000000
IRQMASK: 0
GPR00: 0000000000000004 00007ffffd011260 000000010dfa7e00 0000000000000001
GPR04: 000000011c30b720 0000000000000004 0000000000000010 0000000000000001
GPR08: 0000000000000003 0000000000000000 0000000000000000 0000000000000000
GPR12: 0000000000000000 00007fffa43fab60 000000011c3adbc0 000000010dfa87b8
GPR16: 000000010dfa94d8 0000000020000000 0000000000000000 000000010deb9070
GPR20: 000000010df4beb8 00007ffffd011404 000000010df4f8a0 000000010dfa89bc
GPR24: 000000010dfa8a50 0000000000000000 000000011c30b720 0000000000000004
GPR28: 0000000000000004 00007fffa42418e0 000000011c30b720 0000000000000004
NIP [00007fffa413b034] 0x7fffa413b034
LR [00007fffa413b034] 0x7fffa413b034
--- interrupt: 3000


From the changes in this patch it appears that we have now got this
new "disk->rqos_state_mutex" which we're holding while allocating
the dynamic memory in wbt_init(). One way is to define GFP_NOIO scope
while holding this mutex but then looking further it seems that we 
don't really need to keep holding this lock in wbt_enable_default()
all the way till wbt_init returns. Probably, unlocking this mutex as
soon as we're done updating the rqos state should be sufficient. 

Similarly, in queue_wb_lat_store we may need to hold this mutex only
while we invoke wbt_set_min_lat. And I just realized that in 
queue_wb_lat_show we need to replace q->elevator_lock with 
disk->rqos_state_mutex. I have attached the above change for 
your reference. You may address this in your next patch.

Thanks,
--Nilay

[-- Attachment #2: wbt.patch --]
[-- Type: text/x-patch, Size: 1984 bytes --]

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index fafe9e9e97cc..0248dc9438c3 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -560,7 +560,7 @@ static ssize_t queue_wb_lat_show(struct gendisk *disk, char *page)
 	ssize_t ret;
 	struct request_queue *q = disk->queue;
 
-	mutex_lock(&q->elevator_lock);
+	mutex_lock(&disk->rqos_state_mutex);
 	if (!wbt_rq_qos(q)) {
 		ret = -EINVAL;
 		goto out;
@@ -573,7 +573,7 @@ static ssize_t queue_wb_lat_show(struct gendisk *disk, char *page)
 
 	ret = sysfs_emit(page, "%llu\n", div_u64(wbt_get_min_lat(q), 1000));
 out:
-	mutex_unlock(&q->elevator_lock);
+	mutex_unlock(&disk->rqos_state_mutex);
 	return ret;
 }
 
@@ -593,7 +593,6 @@ static ssize_t queue_wb_lat_store(struct gendisk *disk, const char *page,
 		return -EINVAL;
 
 	memflags = blk_mq_freeze_queue(q);
-	mutex_lock(&disk->rqos_state_mutex);
 
 	rqos = wbt_rq_qos(q);
 	if (!rqos) {
@@ -618,11 +617,12 @@ static ssize_t queue_wb_lat_store(struct gendisk *disk, const char *page,
 	 */
 	blk_mq_quiesce_queue(q);
 
+	mutex_lock(&disk->rqos_state_mutex);
 	wbt_set_min_lat(q, val);
+	mutex_unlock(&disk->rqos_state_mutex);
 
 	blk_mq_unquiesce_queue(q);
 out:
-	mutex_unlock(&disk->rqos_state_mutex);
 	blk_mq_unfreeze_queue(q, memflags);
 
 	return ret;
diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index c8588bae1c1b..74ae7131ada9 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -714,17 +714,17 @@ void wbt_enable_default(struct gendisk *disk)
 	if (rqos) {
 		if (enable && RQWB(rqos)->enable_state == WBT_STATE_OFF_DEFAULT)
 			RQWB(rqos)->enable_state = WBT_STATE_ON_DEFAULT;
-		goto unlock;
+		mutex_unlock(&disk->rqos_state_mutex);
+		return;
 	}
+	mutex_unlock(&disk->rqos_state_mutex);
 
 	/* Queue not registered? Maybe shutting down... */
 	if (!blk_queue_registered(q))
-		goto unlock;
+		return;
 
 	if (queue_is_mq(q) && enable)
 		wbt_init(disk);
-unlock:
-	mutex_unlock(&disk->rqos_state_mutex);
 }
 EXPORT_SYMBOL_GPL(wbt_enable_default);
 

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning
  2025-04-24 15:21 [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Ming Lei
                   ` (19 preceding siblings ...)
  2025-04-24 15:21 ` [PATCH V3 20/20] block: move wbt_enable_default() out of queue freezing from sched ->exit() Ming Lei
@ 2025-04-29 12:00 ` Stefan Haberland
  2025-04-29 12:11   ` Ming Lei
  20 siblings, 1 reply; 71+ messages in thread
From: Stefan Haberland @ 2025-04-29 12:00 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig

Am 24.04.25 um 17:21 schrieb Ming Lei:

> Hello Jens,
>
> This patchset cleans up elevator change code, and unifying it via single
> helper, meantime moves kobject_add/del & debugfs register/unregister out of
> queue freezing & elevator_lock. This way fixes many lockdep warnings
> reported recently, especially since fs_reclaim is connected with freeze lock
> manually by commit ffa1e7ada456 ("block: Make request_queue lockdep splats
> show up earlier").
>
>
> Thanks,
> Ming
>
> V3:
> 	- replace srcu with rw_sem for avoiding race between add/del disk &
> 	  elevator switch and updating nr_hw_queues (Nilay Shoff)
>
> 	- add elv_update_nr_hw_queues() for elevator reattachment in case of
> 	updating nr_hw_queues, meantime keep elv_change_ctx as local structure
> 	(Christoph)
>
> 	- replace ->elevator_lock with disk->rqos_state_mutex for covering wbt
> 	state change
>
> 	- add new patch "block: use q->elevator with ->elevator_lock held in elv_iosched_show()"
>
> 	- small cleanup & commit log improvement
>
> V2:
> 	- retry add/del disk when blk_mq_update_nr_hw_queues() is in-progress
>
> 	- swap blk_mq_add_queue_tag_set() with blk_mq_map_swqueue() in
> 	blk_mq_init_allocated_queue() (Nilay Shroff)
>
> 	- move ELEVATOR_FLAG_DISABLE_WBT to request queue's flags (Nilay Shoff) 
>
> 	- fix race because of delaying elevator unregister
>
> 	- define flags of `elv_change_ctx` as `bool` (Christoph)
>
> 	- improve comment and commit log (Christoph)
>
> Ming Lei (20):
>   block: move blk_mq_add_queue_tag_set() after blk_mq_map_swqueue()
>   block: move ELEVATOR_FLAG_DISABLE_WBT a request queue flag
>   block: don't call freeze queue in elevator_switch() and
>     elevator_disable()
>   block: use q->elevator with ->elevator_lock held in elv_iosched_show()
>   block: add two helpers for registering/un-registering sched debugfs
>   block: move sched debugfs register into elvevator_register_queue
>   block: prevent adding/deleting disk during updating nr_hw_queues
>   block: don't allow to switch elevator if updating nr_hw_queues is
>     in-progress
>   block: simplify elevator reattachment for updating nr_hw_queues
>   block: move blk_unregister_queue() & device_del() after freeze wait
>   block: move queue freezing & elevator_lock into elevator_change()
>   block: add `struct elv_change_ctx` for unifying elevator change
>   block: unifying elevator change
>   block: pass elevator_queue to elv_register_queue & unregister_queue
>   block: fail to show/store elevator sysfs attribute if elevator is
>     dying
>   block: move elv_register[unregister]_queue out of elevator_lock
>   block: move debugfs/sysfs register out of freezing queue
>   block: remove several ->elevator_lock
>   block: move hctx cpuhp add/del out of queue freezing
>   block: move wbt_enable_default() out of queue freezing from sched
>     ->exit()
>
>  block/bfq-iosched.c    |   6 +-
>  block/blk-mq-debugfs.c |  12 +-
>  block/blk-mq-sched.c   |  41 +++---
>  block/blk-mq.c         | 132 +++---------------
>  block/blk-sysfs.c      |  24 ++--
>  block/blk-wbt.c        |  13 +-
>  block/blk.h            |   8 +-
>  block/elevator.c       | 302 ++++++++++++++++++++++++++++-------------
>  block/elevator.h       |   6 +-
>  block/genhd.c          | 129 +++++++++++-------
>  include/linux/blk-mq.h |   3 +
>  include/linux/blkdev.h |   5 +
>  12 files changed, 365 insertions(+), 316 deletions(-)

Hi,
while testing the patchset on s390 I still get the following lockdep splat on each boot:

======================================================
 WARNING: possible circular locking dependency detected
 6.15.0-rc4-gc2b4d8dcb3d2 #3 Not tainted
 ------------------------------------------------------
 (udev-worker)/1810 is trying to acquire lock:
 0000005fb84de3a8 (&q->elevator_lock){+.+.}-{4:4}, at: elevator_change+0x54/0x130

 but task is already holding lock:
 0000005fb84dde18 (&q->q_usage_counter(io)#34){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x26/0x40

 which lock already depends on the new lock.

 the existing dependency chain (in reverse order) is:

 -> #3 (&q->q_usage_counter(io)#34){++++}-{0:0}:
        __lock_acquire+0x6da/0xcc0
        lock_acquire.part.0+0x10c/0x290
        lock_acquire+0xb0/0x1a0
        blk_alloc_queue+0x306/0x340
        blk_mq_alloc_queue+0x60/0xd0
        scsi_alloc_sdev+0x27c/0x3b0
        scsi_probe_and_add_lun+0x31a/0x480
        scsi_report_lun_scan+0x382/0x430
        __scsi_scan_target+0x11a/0x240
        scsi_scan_target+0xdc/0x100
        fc_scsi_scan_rport+0xc2/0xd0
        process_one_work+0x2a6/0x5d0
        worker_thread+0x220/0x410
        kthread+0x164/0x2d0
        __ret_from_fork+0x3c/0x60
        ret_from_fork+0xa/0x38

 -> #2 (fs_reclaim){+.+.}-{0:0}:
        __lock_acquire+0x6da/0xcc0
        lock_acquire.part.0+0x10c/0x290
        lock_acquire+0xb0/0x1a0
        __fs_reclaim_acquire+0x44/0x50
        fs_reclaim_acquire+0xba/0x100
        __kmalloc_noprof+0xae/0x5e0
        pcpu_alloc_chunk+0x30/0x170
        pcpu_create_chunk+0x22/0x130
        pcpu_alloc_noprof+0x842/0x970
        do_kmem_cache_create+0x1e0/0x4b0
        __kmem_cache_create_args+0x238/0x340
        register_ftrace_graph+0x438/0x460
        trace_selftest_startup_function_graph+0x62/0x260
        run_tracer_selftest+0x116/0x1b0
        register_tracer+0x192/0x260
        do_one_initcall+0x4a/0x180
        do_initcalls+0x146/0x170
        kernel_init_freeable+0x230/0x270
        kernel_init+0x2e/0x188
        __ret_from_fork+0x3c/0x60
        ret_from_fork+0xa/0x38

 -> #1 (pcpu_alloc_mutex){+.+.}-{4:4}:
        __lock_acquire+0x6da/0xcc0
        lock_acquire.part.0+0x10c/0x290
        lock_acquire+0xb0/0x1a0
        __mutex_lock+0xae/0xa20
        mutex_lock_killable_nested+0x32/0x40
        pcpu_alloc_noprof+0x6ea/0x970
        sbitmap_init_node+0x11a/0x230
        sbitmap_queue_init_node+0x3e/0x1c0
        blk_mq_init_tags+0xac/0x140
        blk_mq_alloc_map_and_rqs+0xb0/0x180
        blk_mq_init_sched+0xfc/0x200
        elevator_switch+0x74/0x260
        __elevator_change+0xde/0x150
        elevator_change+0xe4/0x130
        elevator_set_default+0x6c/0xc0
        blk_register_queue+0x188/0x210
        __add_disk_fwnode+0x202/0x420
        add_disk_fwnode+0x7a/0xb0
        sd_probe+0x22e/0x3f0
        really_probe+0x2ac/0x430
        driver_probe_device+0x3c/0xc0
        __device_attach_driver+0xc0/0x140
        bus_for_each_drv+0x90/0xe0
        __device_attach_async_helper+0x94/0xf0
        async_run_entry_fn+0x4a/0x180
        process_one_work+0x2a6/0x5d0
        worker_thread+0x220/0x410
        kthread+0x164/0x2d0
        __ret_from_fork+0x3c/0x60
        ret_from_fork+0xa/0x38

 -> #0 (&q->elevator_lock){+.+.}-{4:4}:
        check_prev_add+0x16c/0xff0
        validate_chain+0x734/0x9f0
        __lock_acquire+0x6da/0xcc0
        lock_acquire.part.0+0x10c/0x290
        lock_acquire+0xb0/0x1a0
        __mutex_lock+0xae/0xa20
        mutex_lock_nested+0x32/0x40
        elevator_change+0x54/0x130
        elv_iosched_store+0xec/0x140
        kernfs_fop_write_iter+0x166/0x230
        vfs_write+0x1ac/0x480
        ksys_write+0x7c/0x100
        __do_syscall+0x156/0x290
        system_call+0x74/0x98

        other info that might help us debug this:

 Chain exists of:
                 &q->elevator_lock --> fs_reclaim --> &q->q_usage_counter(io)#34

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(&q->q_usage_counter(io)#34);
                                lock(fs_reclaim);
                                lock(&q->q_usage_counter(io)#34);
   lock(&q->elevator_lock);

                *** DEADLOCK ***

 6 locks held by (udev-worker)/1810:
  #0: 0000005f9d141450 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x7c/0x100
  #1: 0000005f8ee55090 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x12a/0x230
  #2: 0000005fd0be69a0 (kn->active#68){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x136/0x230
  #3: 00000066feee93c0 (&set->update_nr_hwq_sema/1){.+.+}-{4:4}, at: elv_iosched_store+0xde/0x140
  #4: 0000005fb84dde18 (&q->q_usage_counter(io)#34){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x26/0x40
  #5: 0000005fb84dde58 (&q->q_usage_counter(queue)#11){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x26/0x40

               stack backtrace:
 CPU: 18 UID: 0 PID: 1810 Comm: (udev-worker) Not tainted 6.15.0-rc4-gc2b4d8dcb3d2 #3 PREEMPT
 Hardware name: IBM 9175 ME1 701 (LPAR)
 Call Trace:
  [<00000162f415409a>] dump_stack_lvl+0xa2/0xe8
  [<00000162f424cf9c>] print_circular_bug+0x18c/0x210
  [<00000162f424d198>] check_noncircular+0x178/0x190
  [<00000162f424e57c>] check_prev_add+0x16c/0xff0
  [<00000162f424fb34>] validate_chain+0x734/0x9f0
  [<00000162f4251f8a>] __lock_acquire+0x6da/0xcc0
  [<00000162f425267c>] lock_acquire.part.0+0x10c/0x290
  [<00000162f42528b0>] lock_acquire+0xb0/0x1a0
  [<00000162f525695e>] __mutex_lock+0xae/0xa20
  [<00000162f5257302>] mutex_lock_nested+0x32/0x40
  [<00000162f4c00a54>] elevator_change+0x54/0x130
  [<00000162f4c0136c>] elv_iosched_store+0xec/0x140
  [<00000162f46b6f66>] kernfs_fop_write_iter+0x166/0x230
  [<00000162f45cbe3c>] vfs_write+0x1ac/0x480
  [<00000162f45cc2ac>] ksys_write+0x7c/0x100
  [<00000162f524e956>] __do_syscall+0x156/0x290
  [<00000162f525ff94>] system_call+0x74/0x98
 INFO: lockdep is turned off.


The trigger is a udev rule changing the scheduler for the device.

thanks,
Stefan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning
  2025-04-29 12:00 ` [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Stefan Haberland
@ 2025-04-29 12:11   ` Ming Lei
  0 siblings, 0 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-29 12:11 UTC (permalink / raw)
  To: Stefan Haberland
  Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
	Thomas Hellström, Christoph Hellwig

On Tue, Apr 29, 2025 at 02:00:27PM +0200, Stefan Haberland wrote:
> Am 24.04.25 um 17:21 schrieb Ming Lei:
> 
> > Hello Jens,
> >
> > This patchset cleans up elevator change code, and unifying it via single
> > helper, meantime moves kobject_add/del & debugfs register/unregister out of
> > queue freezing & elevator_lock. This way fixes many lockdep warnings
> > reported recently, especially since fs_reclaim is connected with freeze lock
> > manually by commit ffa1e7ada456 ("block: Make request_queue lockdep splats
> > show up earlier").
> >
> >
> > Thanks,
> > Ming
> >
> > V3:
> > 	- replace srcu with rw_sem for avoiding race between add/del disk &
> > 	  elevator switch and updating nr_hw_queues (Nilay Shoff)
> >
> > 	- add elv_update_nr_hw_queues() for elevator reattachment in case of
> > 	updating nr_hw_queues, meantime keep elv_change_ctx as local structure
> > 	(Christoph)
> >
> > 	- replace ->elevator_lock with disk->rqos_state_mutex for covering wbt
> > 	state change
> >
> > 	- add new patch "block: use q->elevator with ->elevator_lock held in elv_iosched_show()"
> >
> > 	- small cleanup & commit log improvement
> >
> > V2:
> > 	- retry add/del disk when blk_mq_update_nr_hw_queues() is in-progress
> >
> > 	- swap blk_mq_add_queue_tag_set() with blk_mq_map_swqueue() in
> > 	blk_mq_init_allocated_queue() (Nilay Shroff)
> >
> > 	- move ELEVATOR_FLAG_DISABLE_WBT to request queue's flags (Nilay Shoff) 
> >
> > 	- fix race because of delaying elevator unregister
> >
> > 	- define flags of `elv_change_ctx` as `bool` (Christoph)
> >
> > 	- improve comment and commit log (Christoph)
> >
> > Ming Lei (20):
> >   block: move blk_mq_add_queue_tag_set() after blk_mq_map_swqueue()
> >   block: move ELEVATOR_FLAG_DISABLE_WBT a request queue flag
> >   block: don't call freeze queue in elevator_switch() and
> >     elevator_disable()
> >   block: use q->elevator with ->elevator_lock held in elv_iosched_show()
> >   block: add two helpers for registering/un-registering sched debugfs
> >   block: move sched debugfs register into elvevator_register_queue
> >   block: prevent adding/deleting disk during updating nr_hw_queues
> >   block: don't allow to switch elevator if updating nr_hw_queues is
> >     in-progress
> >   block: simplify elevator reattachment for updating nr_hw_queues
> >   block: move blk_unregister_queue() & device_del() after freeze wait
> >   block: move queue freezing & elevator_lock into elevator_change()
> >   block: add `struct elv_change_ctx` for unifying elevator change
> >   block: unifying elevator change
> >   block: pass elevator_queue to elv_register_queue & unregister_queue
> >   block: fail to show/store elevator sysfs attribute if elevator is
> >     dying
> >   block: move elv_register[unregister]_queue out of elevator_lock
> >   block: move debugfs/sysfs register out of freezing queue
> >   block: remove several ->elevator_lock
> >   block: move hctx cpuhp add/del out of queue freezing
> >   block: move wbt_enable_default() out of queue freezing from sched
> >     ->exit()
> >
> >  block/bfq-iosched.c    |   6 +-
> >  block/blk-mq-debugfs.c |  12 +-
> >  block/blk-mq-sched.c   |  41 +++---
> >  block/blk-mq.c         | 132 +++---------------
> >  block/blk-sysfs.c      |  24 ++--
> >  block/blk-wbt.c        |  13 +-
> >  block/blk.h            |   8 +-
> >  block/elevator.c       | 302 ++++++++++++++++++++++++++++-------------
> >  block/elevator.h       |   6 +-
> >  block/genhd.c          | 129 +++++++++++-------
> >  include/linux/blk-mq.h |   3 +
> >  include/linux/blkdev.h |   5 +
> >  12 files changed, 365 insertions(+), 316 deletions(-)
> 
> Hi,
> while testing the patchset on s390 I still get the following lockdep splat on each boot:
> 
> ======================================================
>  WARNING: possible circular locking dependency detected
>  6.15.0-rc4-gc2b4d8dcb3d2 #3 Not tainted
>  ------------------------------------------------------
>  (udev-worker)/1810 is trying to acquire lock:
>  0000005fb84de3a8 (&q->elevator_lock){+.+.}-{4:4}, at: elevator_change+0x54/0x130
> 
>  but task is already holding lock:
>  0000005fb84dde18 (&q->q_usage_counter(io)#34){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x26/0x40
> 
>  which lock already depends on the new lock.
> 
>  the existing dependency chain (in reverse order) is:
> 
>  -> #3 (&q->q_usage_counter(io)#34){++++}-{0:0}:
>         __lock_acquire+0x6da/0xcc0
>         lock_acquire.part.0+0x10c/0x290
>         lock_acquire+0xb0/0x1a0
>         blk_alloc_queue+0x306/0x340
>         blk_mq_alloc_queue+0x60/0xd0
>         scsi_alloc_sdev+0x27c/0x3b0
>         scsi_probe_and_add_lun+0x31a/0x480
>         scsi_report_lun_scan+0x382/0x430
>         __scsi_scan_target+0x11a/0x240
>         scsi_scan_target+0xdc/0x100
>         fc_scsi_scan_rport+0xc2/0xd0
>         process_one_work+0x2a6/0x5d0
>         worker_thread+0x220/0x410
>         kthread+0x164/0x2d0
>         __ret_from_fork+0x3c/0x60
>         ret_from_fork+0xa/0x38
> 
>  -> #2 (fs_reclaim){+.+.}-{0:0}:
>         __lock_acquire+0x6da/0xcc0
>         lock_acquire.part.0+0x10c/0x290
>         lock_acquire+0xb0/0x1a0
>         __fs_reclaim_acquire+0x44/0x50
>         fs_reclaim_acquire+0xba/0x100
>         __kmalloc_noprof+0xae/0x5e0
>         pcpu_alloc_chunk+0x30/0x170
>         pcpu_create_chunk+0x22/0x130
>         pcpu_alloc_noprof+0x842/0x970
>         do_kmem_cache_create+0x1e0/0x4b0
>         __kmem_cache_create_args+0x238/0x340
>         register_ftrace_graph+0x438/0x460
>         trace_selftest_startup_function_graph+0x62/0x260
>         run_tracer_selftest+0x116/0x1b0
>         register_tracer+0x192/0x260
>         do_one_initcall+0x4a/0x180
>         do_initcalls+0x146/0x170
>         kernel_init_freeable+0x230/0x270
>         kernel_init+0x2e/0x188
>         __ret_from_fork+0x3c/0x60
>         ret_from_fork+0xa/0x38
> 
>  -> #1 (pcpu_alloc_mutex){+.+.}-{4:4}:
>         __lock_acquire+0x6da/0xcc0
>         lock_acquire.part.0+0x10c/0x290
>         lock_acquire+0xb0/0x1a0
>         __mutex_lock+0xae/0xa20
>         mutex_lock_killable_nested+0x32/0x40
>         pcpu_alloc_noprof+0x6ea/0x970

It is one known dependency on percpu alloc lock, and we can't cover every
one in single patchset, especially this patchset is becoming bigger.

And this percpu lock splat becomes much easier to address after the elevator
patchset is merged.

We can move the elevator data allocation out of queue freeze, then the
dependency can be cut.


Thanks
Ming


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 12/20] block: add `struct elv_change_ctx` for unifying elevator change
  2025-04-25 18:23   ` Christoph Hellwig
@ 2025-04-29 15:45     ` Ming Lei
  0 siblings, 0 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-29 15:45 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
	Thomas Hellström

On Fri, Apr 25, 2025 at 08:23:41PM +0200, Christoph Hellwig wrote:
> On Thu, Apr 24, 2025 at 11:21:35PM +0800, Ming Lei wrote:
> > +struct elv_change_ctx {
> > +	const char *name;
> > +	bool uevent;
> 
> There's only one caller that wants to supress the uevents.  So maybe
> invert the polarity so that it only has to be set in one place,
> which also documents how setting the initial scheduler is special a
> bit better.

OK.

> 
> > -	ret = elv_register_queue(q, true);
> > +	ret = elv_register_queue(q, ctx->uevent);
> 
> .. and pass the ctx on to elv_register_queue instead of converting
> paramter types.  Although that might be woeth doing later when
> another argument derived from ctx gets passed as well.

ctx isn't very useful for elv_register_queue(), which can't figure out
the exact elevator queue to use from `ctx` since there are two(old, new).


Thanks,
Ming


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 13/20] block: unifying elevator change
  2025-04-25 13:03   ` Nilay Shroff
@ 2025-04-30  0:46     ` Ming Lei
  0 siblings, 0 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-30  0:46 UTC (permalink / raw)
  To: Nilay Shroff
  Cc: Jens Axboe, linux-block, Shinichiro Kawasaki,
	Thomas Hellström, Christoph Hellwig

On Fri, Apr 25, 2025 at 06:33:37PM +0530, Nilay Shroff wrote:
> 
> 
> On 4/24/25 8:51 PM, Ming Lei wrote:
> > @@ -564,12 +558,7 @@ static int __add_disk_fwnode(struct device *parent, struct gendisk *disk,
> >  out_free_ext_minor:
> >  	if (disk->major == BLOCK_EXT_MAJOR)
> >  		blk_free_ext_minor(disk->first_minor);
> > -out_exit_elevator:
> > -	if (disk->queue->elevator) {
> > -		mutex_lock(&disk->queue->elevator_lock);
> > -		elevator_exit(disk->queue);
> > -		mutex_unlock(&disk->queue->elevator_lock);
> > -	}
> 
> After restructuring __add_disk_fwnode() we may still need elevator_exit()
> in case when blk_register_queue is successful and so we should have set
> q->elevator to default. Now if after queue is registered successfully, 
> for instance, bdi_register fails then in that case we still have to exit 
> elevator to free its resources, isn't it? 

If blk_register_queue is successful, it is blk_unregister_queue()'s
responsibility for covering it, elevator_exit() isn't needed here.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 08/20] block: don't allow to switch elevator if updating nr_hw_queues is in-progress
  2025-04-29 10:22           ` Nilay Shroff
@ 2025-04-30  0:54             ` Ming Lei
  0 siblings, 0 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-30  0:54 UTC (permalink / raw)
  To: Nilay Shroff
  Cc: Jens Axboe, linux-block, Shinichiro Kawasaki,
	Thomas Hellström, Christoph Hellwig

On Tue, Apr 29, 2025 at 03:52:51PM +0530, Nilay Shroff wrote:
> 
> 
> On 4/29/25 8:13 AM, Ming Lei wrote:
> >> I couldn't recreate it on my setup using above blktests.
> > It is reproduced in my test vm every time after killing the nested variant:
> > 
> > [   74.257200] ======================================================
> > [   74.259369] WARNING: possible circular locking dependency detected
> > [   74.260772] 6.15.0-rc3_ublk+ #547 Not tainted
> > [   74.261950] ------------------------------------------------------
> > [   74.263281] check/5077 is trying to acquire lock:
> > [   74.264492] ffff888105f1fd18 (kn->active#119){++++}-{0:0}, at: __kernfs_remove+0x213/0x680
> > [   74.266006]
> >                but task is already holding lock:
> > [   74.267998] ffff88828a661e20 (&q->q_usage_counter(queue)#14){++++}-{0:0}, at: del_gendisk+0xe5/0x180
> > [   74.269631]
> >                which lock already depends on the new lock.
> > 
> > [   74.272645]
> >                the existing dependency chain (in reverse order) is:
> > [   74.274804]
> >                -> #3 (&q->q_usage_counter(queue)#14){++++}-{0:0}:
> > [   74.277009]        blk_queue_enter+0x4c2/0x630
> > [   74.278218]        blk_mq_alloc_request+0x479/0xa00
> > [   74.279539]        scsi_execute_cmd+0x151/0xba0
> > [   74.281078]        sr_check_events+0x1bc/0xa40
> > [   74.283012]        cdrom_check_events+0x5c/0x120
> > [   74.284892]        disk_check_events+0xbe/0x390
> > [   74.286181]        disk_check_media_change+0xf1/0x220
> > [   74.287455]        sr_block_open+0xce/0x230
> > [   74.288528]        blkdev_get_whole+0x8d/0x200
> > [   74.289702]        bdev_open+0x614/0xc60
> > [   74.290882]        blkdev_open+0x1f6/0x360
> > [   74.292215]        do_dentry_open+0x491/0x1820
> > [   74.293309]        vfs_open+0x7a/0x440
> > [   74.294384]        path_openat+0x1b7e/0x2ce0
> > [   74.295507]        do_filp_open+0x1c5/0x450
> > [   74.296616]        do_sys_openat2+0xef/0x180
> > [   74.297667]        __x64_sys_openat+0x10e/0x210
> > [   74.298768]        do_syscall_64+0x92/0x180
> > [   74.299800]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [   74.300971]
> >                -> #2 (&disk->open_mutex){+.+.}-{4:4}:
> > [   74.302700]        __mutex_lock+0x19c/0x1990
> > [   74.303682]        bdev_open+0x6cd/0xc60
> > [   74.304613]        bdev_file_open_by_dev+0xc4/0x140
> > [   74.306008]        disk_scan_partitions+0x191/0x290
> > [   74.307716]        __add_disk_fwnode+0xd2a/0x1140
> > [   74.309394]        add_disk_fwnode+0x10e/0x220
> > [   74.311039]        nvme_alloc_ns+0x1833/0x2c30
> > [   74.312669]        nvme_scan_ns+0x5a0/0x6f0
> > [   74.314151]        async_run_entry_fn+0x94/0x540
> > [   74.315719]        process_one_work+0x86a/0x14a0
> > [   74.317287]        worker_thread+0x5bb/0xf90
> > [   74.318228]        kthread+0x371/0x720
> > [   74.319085]        ret_from_fork+0x31/0x70
> > [   74.319941]        ret_from_fork_asm+0x1a/0x30
> > [   74.320808]
> >                -> #1 (&set->update_nr_hwq_sema){.+.+}-{4:4}:
> > [   74.322311]        down_read+0x8e/0x470
> > [   74.323135]        elv_iosched_store+0x17a/0x210
> > [   74.324036]        queue_attr_store+0x234/0x340
> > [   74.324881]        kernfs_fop_write_iter+0x39b/0x5a0
> > [   74.325771]        vfs_write+0x5df/0xec0
> > [   74.326514]        ksys_write+0xff/0x200
> > [   74.327262]        do_syscall_64+0x92/0x180
> > [   74.328018]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [   74.328963]
> >                -> #0 (kn->active#119){++++}-{0:0}:
> > [   74.330433]        __lock_acquire+0x145f/0x2260
> > [   74.331329]        lock_acquire+0x163/0x300
> > [   74.332221]        kernfs_drain+0x39d/0x450
> > [   74.333002]        __kernfs_remove+0x213/0x680
> > [   74.333792]        kernfs_remove_by_name_ns+0xa2/0x100
> > [   74.334589]        remove_files+0x8d/0x1b0
> > [   74.335326]        sysfs_remove_group+0x7c/0x160
> > [   74.336118]        sysfs_remove_groups+0x55/0xb0
> > [   74.336869]        __kobject_del+0x7d/0x1d0
> > [   74.337637]        kobject_del+0x38/0x60
> > [   74.338340]        blk_unregister_queue+0x153/0x2c0
> > [   74.339125]        __del_gendisk+0x252/0x9d0
> > [   74.339959]        del_gendisk+0xe5/0x180
> > [   74.340756]        sr_remove+0x7b/0xd0
> > [   74.341429]        device_release_driver_internal+0x36d/0x520
> > [   74.342353]        bus_remove_device+0x1ef/0x3f0
> > [   74.343172]        device_del+0x3be/0x9b0
> > [   74.343951]        __scsi_remove_device+0x27f/0x340
> > [   74.344724]        sdev_store_delete+0x87/0x120
> > [   74.345508]        kernfs_fop_write_iter+0x39b/0x5a0
> > [   74.346287]        vfs_write+0x5df/0xec0
> > [   74.347170]        ksys_write+0xff/0x200
> > [   74.348312]        do_syscall_64+0x92/0x180
> > [   74.349519]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [   74.350797]
> >                other info that might help us debug this:
> > 
> > [   74.353554] Chain exists of:
> >                  kn->active#119 --> &disk->open_mutex --> &q->q_usage_counter(queue)#14
> > 
> > [   74.355535]  Possible unsafe locking scenario:
> > 
> > [   74.356650]        CPU0                    CPU1
> > [   74.357328]        ----                    ----
> > [   74.358026]   lock(&q->q_usage_counter(queue)#14);
> > [   74.358749]                                lock(&disk->open_mutex);
> > [   74.359561]                                lock(&q->q_usage_counter(queue)#14);
> > [   74.360488]   lock(kn->active#119);
> > [   74.361113]
> >                 *** DEADLOCK ***
> > 
> > [   74.362574] 6 locks held by check/5077:
> > [   74.363193]  #0: ffff888114640420 (sb_writers#4){.+.+}-{0:0}, at: ksys_write+0xff/0x200
> > [   74.364274]  #1: ffff88829abb6088 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x25b/0x5a0
> > [   74.365937]  #2: ffff8881176ca0e0 (&shost->scan_mutex){+.+.}-{4:4}, at: sdev_store_delete+0x7f/0x120
> > [   74.367643]  #3: ffff88828521c380 (&dev->mutex){....}-{4:4}, at: device_release_driver_internal+0x90/0x520
> > [   74.369464]  #4: ffff8881176ca380 (&set->update_nr_hwq_sema){.+.+}-{4:4}, at: del_gendisk+0xdd/0x180
> > [   74.370961]  #5: ffff88828a661e20 (&q->q_usage_counter(queue)#14){++++}-{0:0}, at: del_gendisk+0xe5/0x180
> > [   74.372050]
>  
> This has baffled me as I don't understand how could read lock in
> elv_iosched_store (ruuning in context #1) depends on (same) read
> lock in add_disk_fwnode (running under another context #2) as 
> both locks are represented by the same rw semaphore. As we see 

That is why the read lock in elv_iosched_store() is annotated as
nested(new lockdep map) for avoiding the false positive warning, because
the two can't be grabbed at the same time.

> above both elv_iosched_store and add_disk_fwnode bot run under
> different contexts and so ideally they should be able to run
> concurrently acquiring the same read lock.

In theory, it is yes, but reality is that scheduler store attribute
isn't created until disk/queue kobject is added, then switching elevator
can't happen until the kobject/attribute is exposed to userspace, that
is why the nested annotation is correct.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH V3 15/20] block: fail to show/store elevator sysfs attribute if elevator is dying
  2025-04-25 18:36   ` Christoph Hellwig
@ 2025-04-30  1:07     ` Ming Lei
  0 siblings, 0 replies; 71+ messages in thread
From: Ming Lei @ 2025-04-30  1:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
	Thomas Hellström

On Fri, Apr 25, 2025 at 08:36:14PM +0200, Christoph Hellwig wrote:
> On Thu, Apr 24, 2025 at 11:21:38PM +0800, Ming Lei wrote:
> > +	if (test_bit(ELEVATOR_FLAG_DYING, &e->flags))
> > +		error = -ENODEV;
> > +	else
> > +		error = e->type ? entry->show(e, page) : -ENOENT;
> 
> Weird style mix, I'd expand the check for ->type to a proper else
> if here as well.  But how can e->type actually be NULL here to start
> with?

You are right, e->type can't be NULL, which is assigned in its allocation,
never get cleared.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 71+ messages in thread

end of thread, other threads:[~2025-04-30  1:07 UTC | newest]

Thread overview: 71+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-24 15:21 [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Ming Lei
2025-04-24 15:21 ` [PATCH V3 01/20] block: move blk_mq_add_queue_tag_set() after blk_mq_map_swqueue() Ming Lei
2025-04-24 15:21 ` [PATCH V3 02/20] block: move ELEVATOR_FLAG_DISABLE_WBT a request queue flag Ming Lei
2025-04-25 14:29   ` Christoph Hellwig
2025-04-24 15:21 ` [PATCH V3 03/20] block: don't call freeze queue in elevator_switch() and elevator_disable() Ming Lei
2025-04-25 14:29   ` Christoph Hellwig
2025-04-24 15:21 ` [PATCH V3 04/20] block: use q->elevator with ->elevator_lock held in elv_iosched_show() Ming Lei
2025-04-25  6:08   ` Hannes Reinecke
2025-04-25 10:53   ` Nilay Shroff
2025-04-25 14:29   ` Christoph Hellwig
2025-04-24 15:21 ` [PATCH V3 05/20] block: add two helpers for registering/un-registering sched debugfs Ming Lei
2025-04-24 15:21 ` [PATCH V3 06/20] block: move sched debugfs register into elvevator_register_queue Ming Lei
2025-04-24 15:21 ` [PATCH V3 07/20] block: prevent adding/deleting disk during updating nr_hw_queues Ming Lei
2025-04-25  6:33   ` Hannes Reinecke
2025-04-25 11:37   ` Nilay Shroff
2025-04-25 14:33   ` Christoph Hellwig
2025-04-25 16:46   ` kernel test robot
2025-04-24 15:21 ` [PATCH V3 08/20] block: don't allow to switch elevator if updating nr_hw_queues is in-progress Ming Lei
2025-04-25  6:33   ` Hannes Reinecke
2025-04-25 12:48   ` Nilay Shroff
2025-04-27  2:27     ` Ming Lei
2025-04-28 16:17       ` Nilay Shroff
2025-04-29  2:43         ` Ming Lei
2025-04-29 10:22           ` Nilay Shroff
2025-04-30  0:54             ` Ming Lei
2025-04-24 15:21 ` [PATCH V3 09/20] block: simplify elevator reattachment for updating nr_hw_queues Ming Lei
2025-04-25  6:34   ` Hannes Reinecke
2025-04-25 18:12   ` Christoph Hellwig
2025-04-29  9:51     ` Ming Lei
2025-04-24 15:21 ` [PATCH V3 10/20] block: move blk_unregister_queue() & device_del() after freeze wait Ming Lei
2025-04-25  6:35   ` Hannes Reinecke
2025-04-25 12:50   ` Nilay Shroff
2025-04-25 14:34   ` Christoph Hellwig
2025-04-28 11:51     ` Ming Lei
2025-04-24 15:21 ` [PATCH V3 11/20] block: move queue freezing & elevator_lock into elevator_change() Ming Lei
2025-04-25  6:36   ` Hannes Reinecke
2025-04-25 12:54   ` Nilay Shroff
2025-04-25 14:35   ` Christoph Hellwig
2025-04-24 15:21 ` [PATCH V3 12/20] block: add `struct elv_change_ctx` for unifying elevator change Ming Lei
2025-04-25  6:38   ` Hannes Reinecke
2025-04-25 18:23   ` Christoph Hellwig
2025-04-29 15:45     ` Ming Lei
2025-04-24 15:21 ` [PATCH V3 13/20] block: " Ming Lei
2025-04-25  6:39   ` Hannes Reinecke
2025-04-25 13:03   ` Nilay Shroff
2025-04-30  0:46     ` Ming Lei
2025-04-25 18:30   ` Christoph Hellwig
2025-04-24 15:21 ` [PATCH V3 14/20] block: pass elevator_queue to elv_register_queue & unregister_queue Ming Lei
2025-04-25  6:40   ` Hannes Reinecke
2025-04-24 15:21 ` [PATCH V3 15/20] block: fail to show/store elevator sysfs attribute if elevator is dying Ming Lei
2025-04-25  6:45   ` Hannes Reinecke
2025-04-25 18:36   ` Christoph Hellwig
2025-04-30  1:07     ` Ming Lei
2025-04-24 15:21 ` [PATCH V3 16/20] block: move elv_register[unregister]_queue out of elevator_lock Ming Lei
2025-04-25  6:46   ` Hannes Reinecke
2025-04-25 13:05   ` Nilay Shroff
2025-04-25 18:37   ` Christoph Hellwig
2025-04-24 15:21 ` [PATCH V3 17/20] block: move debugfs/sysfs register out of freezing queue Ming Lei
2025-04-25  6:47   ` Hannes Reinecke
2025-04-25 18:38   ` Christoph Hellwig
2025-04-28 11:28     ` Ming Lei
2025-04-24 15:21 ` [PATCH V3 18/20] block: remove several ->elevator_lock Ming Lei
2025-04-25  6:48   ` Hannes Reinecke
2025-04-25 18:38   ` Christoph Hellwig
2025-04-24 15:21 ` [PATCH V3 19/20] block: move hctx cpuhp add/del out of queue freezing Ming Lei
2025-04-25  6:49   ` Hannes Reinecke
2025-04-24 15:21 ` [PATCH V3 20/20] block: move wbt_enable_default() out of queue freezing from sched ->exit() Ming Lei
2025-04-25 13:10   ` Nilay Shroff
2025-04-29 10:59   ` Nilay Shroff
2025-04-29 12:00 ` [PATCH V3 00/20] block: unify elevator changing and fix lockdep warning Stefan Haberland
2025-04-29 12:11   ` Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox