linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V4 00/24] block: unify elevator changing and fix lockdep warning
@ 2025-04-30  4:35 Ming Lei
  2025-04-30  4:35 ` [PATCH V4 01/24] block: move blk_mq_add_queue_tag_set() after blk_mq_map_swqueue() Ming Lei
                   ` (24 more replies)
  0 siblings, 25 replies; 51+ messages in thread
From: Ming Lei @ 2025-04-30  4:35 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig, Ming Lei

Hello Jens,

This patchset cleans up elevator change code, and unifying it via single
helper, meantime moves kobject_add/del & debugfs register/unregister out of
queue freezing & elevator_lock. This way fixes many lockdep warnings
reported recently, especially since fs_reclaim is connected with freeze lock
manually by commit ffa1e7ada456 ("block: Make request_queue lockdep splats
show up earlier").


Thanks,
Ming

V4:
	- pull Christoph's two elevator change/switch cleanup patches first,
	then the following elevator change unifying is simplified a lot(Christoph)

	- fold Nilay's patch into the last patch for avoiding one new
	lock warning on rqos_state_mutex (Nilay Shoff)

	- drop patch "block: move blk_unregister_queue() & device_del() after freeze
	wait" which may cause MD hang during shutdown, and add new patch "block: add
    new helper for disabling elevator switch when deleting disk" for dealing with
    one small race window (Christoph)

	- add patch "block: remove elevator queue's type check in elv_attr_show/store()"
	(Christoph)

    - rename ->update_nr_hwq_sema as ->update_nr_hwq_lock (Christoph)

	- add small patch "block: move blk_queue_registered() check into elv_iosched_store()"

V3:
	- replace srcu with rw_sem for avoiding race between add/del disk &
	  elevator switch and updating nr_hw_queues (Nilay Shoff)

	- add elv_update_nr_hw_queues() for elevator reattachment in case of
	updating nr_hw_queues, meantime keep elv_change_ctx as local structure
	(Christoph)

	- replace ->elevator_lock with disk->rqos_state_mutex for covering wbt
	state change

	- add new patch "block: use q->elevator with ->elevator_lock held in elv_iosched_show()"

	- small cleanup & commit log improvement

V2:
	- retry add/del disk when blk_mq_update_nr_hw_queues() is in-progress

	- swap blk_mq_add_queue_tag_set() with blk_mq_map_swqueue() in
	blk_mq_init_allocated_queue() (Nilay Shroff)

	- move ELEVATOR_FLAG_DISABLE_WBT to request queue's flags (Nilay Shoff) 

	- fix race because of delaying elevator unregister

	- define flags of `elv_change_ctx` as `bool` (Christoph)

	- improve comment and commit log (Christoph)

Christoph Hellwig (2):
  block: look up the elevator type in elevator_switch
  block: fold elevator_disable into elevator_switch

Ming Lei (22):
  block: move blk_mq_add_queue_tag_set() after blk_mq_map_swqueue()
  block: move ELEVATOR_FLAG_DISABLE_WBT a request queue flag
  block: don't call freeze queue in elevator_switch() and
    elevator_disable()
  block: use q->elevator with ->elevator_lock held in elv_iosched_show()
  block: add two helpers for registering/un-registering sched debugfs
  block: move sched debugfs register into elvevator_register_queue
  block: prevent adding/deleting disk during updating nr_hw_queues
  block: don't allow to switch elevator if updating nr_hw_queues is
    in-progress
  block: move blk_queue_registered() check into elv_iosched_store()
  block: simplify elevator reattachment for updating nr_hw_queues
  block: move queue freezing & elevator_lock into elevator_change()
  block: add `struct elv_change_ctx` for unifying elevator change
  block: unifying elevator change
  block: pass elevator_queue to elv_register_queue & unregister_queue
  block: remove elevator queue's type check in elv_attr_show/store()
  block: fail to show/store elevator sysfs attribute if elevator is
    dying
  block: add new helper for disabling elevator switch when deleting disk
  block: move elv_register[unregister]_queue out of elevator_lock
  block: move hctx debugfs/sysfs registering out of freezing queue
  block: don't acquire ->elevator_lock in blk_mq_map_swqueue and
    blk_mq_realloc_hw_ctxs
  block: move hctx cpuhp add/del out of queue freezing
  block: move wbt_enable_default() out of queue freezing from sched
    ->exit()

 block/bfq-iosched.c    |   6 +-
 block/blk-mq-debugfs.c |  13 +-
 block/blk-mq-sched.c   |  41 ++++--
 block/blk-mq.c         | 132 +++--------------
 block/blk-sysfs.c      |  29 ++--
 block/blk-wbt.c        |   9 +-
 block/blk.h            |   8 +-
 block/elevator.c       | 322 ++++++++++++++++++++++++-----------------
 block/elevator.h       |   6 +-
 block/genhd.c          | 151 ++++++++++++-------
 include/linux/blk-mq.h |   3 +
 include/linux/blkdev.h |   8 +
 12 files changed, 369 insertions(+), 359 deletions(-)

-- 
2.47.0


^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH V4 01/24] block: move blk_mq_add_queue_tag_set() after blk_mq_map_swqueue()
  2025-04-30  4:35 [PATCH V4 00/24] block: unify elevator changing and fix lockdep warning Ming Lei
@ 2025-04-30  4:35 ` Ming Lei
  2025-04-30  4:35 ` [PATCH V4 02/24] block: move ELEVATOR_FLAG_DISABLE_WBT a request queue flag Ming Lei
                   ` (23 subsequent siblings)
  24 siblings, 0 replies; 51+ messages in thread
From: Ming Lei @ 2025-04-30  4:35 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig, Ming Lei, Hannes Reinecke

Move blk_mq_add_queue_tag_set() after blk_mq_map_swqueue(), and publish
this request queue to tagset after everything is setup.

This way is safe because BLK_MQ_F_TAG_QUEUE_SHARED isn't used by
blk_mq_map_swqueue(), and this flag is mainly checked in fast IO code
path.

Prepare for removing ->elevator_lock from blk_mq_map_swqueue() which
is supposed to be called when elevator switch can't be done.

Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reported-by: Nilay Shroff <nilay@linux.ibm.com>
Closes: https://lore.kernel.org/linux-block/567cb7ab-23d6-4cee-a915-c8cdac903ddd@linux.ibm.com/
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 554380bfd002..29cfc7ce2e0a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4581,8 +4581,8 @@ int blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	q->nr_requests = set->queue_depth;
 
 	blk_mq_init_cpu_queues(q, set->nr_hw_queues);
-	blk_mq_add_queue_tag_set(set, q);
 	blk_mq_map_swqueue(q);
+	blk_mq_add_queue_tag_set(set, q);
 	return 0;
 
 err_hctxs:
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH V4 02/24] block: move ELEVATOR_FLAG_DISABLE_WBT a request queue flag
  2025-04-30  4:35 [PATCH V4 00/24] block: unify elevator changing and fix lockdep warning Ming Lei
  2025-04-30  4:35 ` [PATCH V4 01/24] block: move blk_mq_add_queue_tag_set() after blk_mq_map_swqueue() Ming Lei
@ 2025-04-30  4:35 ` Ming Lei
  2025-04-30  4:35 ` [PATCH V4 03/24] block: don't call freeze queue in elevator_switch() and elevator_disable() Ming Lei
                   ` (22 subsequent siblings)
  24 siblings, 0 replies; 51+ messages in thread
From: Ming Lei @ 2025-04-30  4:35 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig, Ming Lei, Hannes Reinecke

ELEVATOR_FLAG_DISABLE_WBT is only used by BFQ to disallow wbt when BFQ is
in use. The flag is set in BFQ's init(), and cleared in BFQ's exit().

Making it as request queue flag, so that we can avoid to deal with elevator
switch race. Also it isn't graceful to checking one scheduler flag in
wbt_enable_default().

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/bfq-iosched.c    | 4 ++--
 block/blk-mq-debugfs.c | 1 +
 block/blk-wbt.c        | 3 +--
 block/elevator.h       | 1 -
 include/linux/blkdev.h | 3 +++
 5 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index abd80dc13562..cc6f59836dcd 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -7210,7 +7210,7 @@ static void bfq_exit_queue(struct elevator_queue *e)
 #endif
 
 	blk_stat_disable_accounting(bfqd->queue);
-	clear_bit(ELEVATOR_FLAG_DISABLE_WBT, &e->flags);
+	blk_queue_flag_clear(QUEUE_FLAG_DISABLE_WBT_DEF, bfqd->queue);
 	wbt_enable_default(bfqd->queue->disk);
 
 	kfree(bfqd);
@@ -7397,7 +7397,7 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 	/* We dispatch from request queue wide instead of hw queue */
 	blk_queue_flag_set(QUEUE_FLAG_SQ_SCHED, q);
 
-	set_bit(ELEVATOR_FLAG_DISABLE_WBT, &eq->flags);
+	blk_queue_flag_set(QUEUE_FLAG_DISABLE_WBT_DEF, q);
 	wbt_disable_default(q->disk);
 	blk_stat_enable_accounting(q);
 
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 3421b5521fe2..7710c409e432 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -93,6 +93,7 @@ static const char *const blk_queue_flag_name[] = {
 	QUEUE_FLAG_NAME(RQ_ALLOC_TIME),
 	QUEUE_FLAG_NAME(HCTX_ACTIVE),
 	QUEUE_FLAG_NAME(SQ_SCHED),
+	QUEUE_FLAG_NAME(DISABLE_WBT_DEF),
 };
 #undef QUEUE_FLAG_NAME
 
diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index f1754d07f7e0..29cd2e33666f 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -704,8 +704,7 @@ void wbt_enable_default(struct gendisk *disk)
 	struct rq_qos *rqos;
 	bool enable = IS_ENABLED(CONFIG_BLK_WBT_MQ);
 
-	if (q->elevator &&
-	    test_bit(ELEVATOR_FLAG_DISABLE_WBT, &q->elevator->flags))
+	if (blk_queue_disable_wbt(q))
 		enable = false;
 
 	/* Throttling already enabled? */
diff --git a/block/elevator.h b/block/elevator.h
index e4e44dfac503..e27af5492cdb 100644
--- a/block/elevator.h
+++ b/block/elevator.h
@@ -121,7 +121,6 @@ struct elevator_queue
 };
 
 #define ELEVATOR_FLAG_REGISTERED	0
-#define ELEVATOR_FLAG_DISABLE_WBT	1
 
 /*
  * block elevator interface
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f3d74f9dae8e..9c373cf0eb47 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -644,6 +644,7 @@ enum {
 	QUEUE_FLAG_RQ_ALLOC_TIME,	/* record rq->alloc_time_ns */
 	QUEUE_FLAG_HCTX_ACTIVE,		/* at least one blk-mq hctx is active */
 	QUEUE_FLAG_SQ_SCHED,		/* single queue style io dispatch */
+	QUEUE_FLAG_DISABLE_WBT_DEF,	/* for sched to disable/enable wbt */
 	QUEUE_FLAG_MAX
 };
 
@@ -679,6 +680,8 @@ void blk_queue_flag_clear(unsigned int flag, struct request_queue *q);
 #define blk_queue_sq_sched(q)	test_bit(QUEUE_FLAG_SQ_SCHED, &(q)->queue_flags)
 #define blk_queue_skip_tagset_quiesce(q) \
 	((q)->limits.features & BLK_FEAT_SKIP_TAGSET_QUIESCE)
+#define blk_queue_disable_wbt(q)	\
+	test_bit(QUEUE_FLAG_DISABLE_WBT_DEF, &(q)->queue_flags)
 
 extern void blk_set_pm_only(struct request_queue *q);
 extern void blk_clear_pm_only(struct request_queue *q);
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH V4 03/24] block: don't call freeze queue in elevator_switch() and elevator_disable()
  2025-04-30  4:35 [PATCH V4 00/24] block: unify elevator changing and fix lockdep warning Ming Lei
  2025-04-30  4:35 ` [PATCH V4 01/24] block: move blk_mq_add_queue_tag_set() after blk_mq_map_swqueue() Ming Lei
  2025-04-30  4:35 ` [PATCH V4 02/24] block: move ELEVATOR_FLAG_DISABLE_WBT a request queue flag Ming Lei
@ 2025-04-30  4:35 ` Ming Lei
  2025-04-30  4:35 ` [PATCH V4 04/24] block: use q->elevator with ->elevator_lock held in elv_iosched_show() Ming Lei
                   ` (21 subsequent siblings)
  24 siblings, 0 replies; 51+ messages in thread
From: Ming Lei @ 2025-04-30  4:35 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig, Ming Lei, Hannes Reinecke

Both elevator_switch() and elevator_disable() are only called from the
two code paths, in which queue is guaranteed to be frozen.

So don't call freeze queue in the two functions, also add asserts for
queue freeze.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/elevator.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/block/elevator.c b/block/elevator.c
index b4d08026b02c..5051a98dc08c 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -615,12 +615,11 @@ void elevator_init_mq(struct request_queue *q)
  */
 int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 {
-	unsigned int memflags;
 	int ret;
 
+	WARN_ON_ONCE(q->mq_freeze_depth == 0);
 	lockdep_assert_held(&q->elevator_lock);
 
-	memflags = blk_mq_freeze_queue(q);
 	blk_mq_quiesce_queue(q);
 
 	if (q->elevator) {
@@ -641,7 +640,6 @@ int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 
 out_unfreeze:
 	blk_mq_unquiesce_queue(q);
-	blk_mq_unfreeze_queue(q, memflags);
 
 	if (ret) {
 		pr_warn("elv: switch to \"%s\" failed, falling back to \"none\"\n",
@@ -653,11 +651,9 @@ int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 
 void elevator_disable(struct request_queue *q)
 {
-	unsigned int memflags;
-
+	WARN_ON_ONCE(q->mq_freeze_depth == 0);
 	lockdep_assert_held(&q->elevator_lock);
 
-	memflags = blk_mq_freeze_queue(q);
 	blk_mq_quiesce_queue(q);
 
 	elv_unregister_queue(q);
@@ -668,7 +664,6 @@ void elevator_disable(struct request_queue *q)
 	blk_add_trace_msg(q, "elv switch: none");
 
 	blk_mq_unquiesce_queue(q);
-	blk_mq_unfreeze_queue(q, memflags);
 }
 
 /*
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH V4 04/24] block: use q->elevator with ->elevator_lock held in elv_iosched_show()
  2025-04-30  4:35 [PATCH V4 00/24] block: unify elevator changing and fix lockdep warning Ming Lei
                   ` (2 preceding siblings ...)
  2025-04-30  4:35 ` [PATCH V4 03/24] block: don't call freeze queue in elevator_switch() and elevator_disable() Ming Lei
@ 2025-04-30  4:35 ` Ming Lei
  2025-04-30  4:35 ` [PATCH V4 05/24] block: add two helpers for registering/un-registering sched debugfs Ming Lei
                   ` (20 subsequent siblings)
  24 siblings, 0 replies; 51+ messages in thread
From: Ming Lei @ 2025-04-30  4:35 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig, Ming Lei, Hannes Reinecke

Use q->elevator with ->elevator_lock held in elv_iosched_show(), since
the local cached elevator reference may become stale after getting
->elevator_lock.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/elevator.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/block/elevator.c b/block/elevator.c
index 5051a98dc08c..b32815594892 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -739,7 +739,6 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
 ssize_t elv_iosched_show(struct gendisk *disk, char *name)
 {
 	struct request_queue *q = disk->queue;
-	struct elevator_queue *eq = q->elevator;
 	struct elevator_type *cur = NULL, *e;
 	int len = 0;
 
@@ -748,7 +747,7 @@ ssize_t elv_iosched_show(struct gendisk *disk, char *name)
 		len += sprintf(name+len, "[none] ");
 	} else {
 		len += sprintf(name+len, "none ");
-		cur = eq->type;
+		cur = q->elevator->type;
 	}
 
 	spin_lock(&elv_list_lock);
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH V4 05/24] block: add two helpers for registering/un-registering sched debugfs
  2025-04-30  4:35 [PATCH V4 00/24] block: unify elevator changing and fix lockdep warning Ming Lei
                   ` (3 preceding siblings ...)
  2025-04-30  4:35 ` [PATCH V4 04/24] block: use q->elevator with ->elevator_lock held in elv_iosched_show() Ming Lei
@ 2025-04-30  4:35 ` Ming Lei
  2025-04-30  4:35 ` [PATCH V4 06/24] block: move sched debugfs register into elvevator_register_queue Ming Lei
                   ` (19 subsequent siblings)
  24 siblings, 0 replies; 51+ messages in thread
From: Ming Lei @ 2025-04-30  4:35 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig, Ming Lei, Yu Kuai, Hannes Reinecke

Add blk_mq_sched_reg_debugfs()/blk_mq_sched_unreg_debugfs() to clean up
sched init/exit code a bit.

Register & unregister debugfs for sched & sched_hctx order is changed a
bit, but it is safe because sched & sched_hctx is guaranteed to be ready
when exporting via debugfs.

Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-sched.c | 45 +++++++++++++++++++++++++++++---------------
 1 file changed, 30 insertions(+), 15 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 9b81771774ef..2abc5e0704e8 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -434,6 +434,30 @@ static int blk_mq_init_sched_shared_tags(struct request_queue *queue)
 	return 0;
 }
 
+static void blk_mq_sched_reg_debugfs(struct request_queue *q)
+{
+	struct blk_mq_hw_ctx *hctx;
+	unsigned long i;
+
+	mutex_lock(&q->debugfs_mutex);
+	blk_mq_debugfs_register_sched(q);
+	queue_for_each_hw_ctx(q, hctx, i)
+		blk_mq_debugfs_register_sched_hctx(q, hctx);
+	mutex_unlock(&q->debugfs_mutex);
+}
+
+static void blk_mq_sched_unreg_debugfs(struct request_queue *q)
+{
+	struct blk_mq_hw_ctx *hctx;
+	unsigned long i;
+
+	mutex_lock(&q->debugfs_mutex);
+	queue_for_each_hw_ctx(q, hctx, i)
+		blk_mq_debugfs_unregister_sched_hctx(hctx);
+	blk_mq_debugfs_unregister_sched(q);
+	mutex_unlock(&q->debugfs_mutex);
+}
+
 /* caller must have a reference to @e, will grab another one if successful */
 int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 {
@@ -467,10 +491,6 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 	if (ret)
 		goto err_free_map_and_rqs;
 
-	mutex_lock(&q->debugfs_mutex);
-	blk_mq_debugfs_register_sched(q);
-	mutex_unlock(&q->debugfs_mutex);
-
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (e->ops.init_hctx) {
 			ret = e->ops.init_hctx(hctx, i);
@@ -482,11 +502,11 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 				return ret;
 			}
 		}
-		mutex_lock(&q->debugfs_mutex);
-		blk_mq_debugfs_register_sched_hctx(q, hctx);
-		mutex_unlock(&q->debugfs_mutex);
 	}
 
+	/* sched is initialized, it is ready to export it via debugfs */
+	blk_mq_sched_reg_debugfs(q);
+
 	return 0;
 
 err_free_map_and_rqs:
@@ -524,11 +544,10 @@ void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e)
 	unsigned long i;
 	unsigned int flags = 0;
 
-	queue_for_each_hw_ctx(q, hctx, i) {
-		mutex_lock(&q->debugfs_mutex);
-		blk_mq_debugfs_unregister_sched_hctx(hctx);
-		mutex_unlock(&q->debugfs_mutex);
+	/* unexport via debugfs before exiting sched */
+	blk_mq_sched_unreg_debugfs(q);
 
+	queue_for_each_hw_ctx(q, hctx, i) {
 		if (e->type->ops.exit_hctx && hctx->sched_data) {
 			e->type->ops.exit_hctx(hctx, i);
 			hctx->sched_data = NULL;
@@ -536,10 +555,6 @@ void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e)
 		flags = hctx->flags;
 	}
 
-	mutex_lock(&q->debugfs_mutex);
-	blk_mq_debugfs_unregister_sched(q);
-	mutex_unlock(&q->debugfs_mutex);
-
 	if (e->type->ops.exit_sched)
 		e->type->ops.exit_sched(e);
 	blk_mq_sched_tags_teardown(q, flags);
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH V4 06/24] block: move sched debugfs register into elvevator_register_queue
  2025-04-30  4:35 [PATCH V4 00/24] block: unify elevator changing and fix lockdep warning Ming Lei
                   ` (4 preceding siblings ...)
  2025-04-30  4:35 ` [PATCH V4 05/24] block: add two helpers for registering/un-registering sched debugfs Ming Lei
@ 2025-04-30  4:35 ` Ming Lei
  2025-04-30  4:35 ` [PATCH V4 07/24] block: prevent adding/deleting disk during updating nr_hw_queues Ming Lei
                   ` (18 subsequent siblings)
  24 siblings, 0 replies; 51+ messages in thread
From: Ming Lei @ 2025-04-30  4:35 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig, Ming Lei, Yu Kuai, Hannes Reinecke

sched debugfs shares same lifetime with scheduler's kobject, and same
lock(elevator lock), so move sched debugfs register/unregister into
elevator_register_queue() and elevator_unregister_queue().

Then we needn't blk_mq_debugfs_register() for us to register sched
debugfs any more.

Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-debugfs.c | 11 -----------
 block/blk-mq-sched.c   | 11 ++---------
 block/elevator.c       |  8 ++++++++
 block/elevator.h       |  3 +++
 4 files changed, 13 insertions(+), 20 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 7710c409e432..2837a8ce8054 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -625,20 +625,9 @@ void blk_mq_debugfs_register(struct request_queue *q)
 
 	debugfs_create_files(q->debugfs_dir, q, blk_mq_debugfs_queue_attrs);
 
-	/*
-	 * blk_mq_init_sched() attempted to do this already, but q->debugfs_dir
-	 * didn't exist yet (because we don't know what to name the directory
-	 * until the queue is registered to a gendisk).
-	 */
-	if (q->elevator && !q->sched_debugfs_dir)
-		blk_mq_debugfs_register_sched(q);
-
-	/* Similarly, blk_mq_init_hctx() couldn't do this previously. */
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (!hctx->debugfs_dir)
 			blk_mq_debugfs_register_hctx(q, hctx);
-		if (q->elevator && !hctx->sched_debugfs_dir)
-			blk_mq_debugfs_register_sched_hctx(q, hctx);
 	}
 
 	if (q->rq_qos) {
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 2abc5e0704e8..336a15ffecfa 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -434,7 +434,7 @@ static int blk_mq_init_sched_shared_tags(struct request_queue *queue)
 	return 0;
 }
 
-static void blk_mq_sched_reg_debugfs(struct request_queue *q)
+void blk_mq_sched_reg_debugfs(struct request_queue *q)
 {
 	struct blk_mq_hw_ctx *hctx;
 	unsigned long i;
@@ -446,7 +446,7 @@ static void blk_mq_sched_reg_debugfs(struct request_queue *q)
 	mutex_unlock(&q->debugfs_mutex);
 }
 
-static void blk_mq_sched_unreg_debugfs(struct request_queue *q)
+void blk_mq_sched_unreg_debugfs(struct request_queue *q)
 {
 	struct blk_mq_hw_ctx *hctx;
 	unsigned long i;
@@ -503,10 +503,6 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 			}
 		}
 	}
-
-	/* sched is initialized, it is ready to export it via debugfs */
-	blk_mq_sched_reg_debugfs(q);
-
 	return 0;
 
 err_free_map_and_rqs:
@@ -544,9 +540,6 @@ void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e)
 	unsigned long i;
 	unsigned int flags = 0;
 
-	/* unexport via debugfs before exiting sched */
-	blk_mq_sched_unreg_debugfs(q);
-
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (e->type->ops.exit_hctx && hctx->sched_data) {
 			e->type->ops.exit_hctx(hctx, i);
diff --git a/block/elevator.c b/block/elevator.c
index b32815594892..4400eb8fe54f 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -472,6 +472,11 @@ int elv_register_queue(struct request_queue *q, bool uevent)
 		if (uevent)
 			kobject_uevent(&e->kobj, KOBJ_ADD);
 
+		/*
+		 * Sched is initialized, it is ready to export it via
+		 * debugfs
+		 */
+		blk_mq_sched_reg_debugfs(q);
 		set_bit(ELEVATOR_FLAG_REGISTERED, &e->flags);
 	}
 	return error;
@@ -486,6 +491,9 @@ void elv_unregister_queue(struct request_queue *q)
 	if (e && test_and_clear_bit(ELEVATOR_FLAG_REGISTERED, &e->flags)) {
 		kobject_uevent(&e->kobj, KOBJ_REMOVE);
 		kobject_del(&e->kobj);
+
+		/* unexport via debugfs before exiting sched */
+		blk_mq_sched_unreg_debugfs(q);
 	}
 }
 
diff --git a/block/elevator.h b/block/elevator.h
index e27af5492cdb..9198676644a9 100644
--- a/block/elevator.h
+++ b/block/elevator.h
@@ -181,4 +181,7 @@ extern struct request *elv_rb_find(struct rb_root *, sector_t);
 #define rq_entry_fifo(ptr)	list_entry((ptr), struct request, queuelist)
 #define rq_fifo_clear(rq)	list_del_init(&(rq)->queuelist)
 
+void blk_mq_sched_reg_debugfs(struct request_queue *q);
+void blk_mq_sched_unreg_debugfs(struct request_queue *q);
+
 #endif /* _ELEVATOR_H */
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH V4 07/24] block: prevent adding/deleting disk during updating nr_hw_queues
  2025-04-30  4:35 [PATCH V4 00/24] block: unify elevator changing and fix lockdep warning Ming Lei
                   ` (5 preceding siblings ...)
  2025-04-30  4:35 ` [PATCH V4 06/24] block: move sched debugfs register into elvevator_register_queue Ming Lei
@ 2025-04-30  4:35 ` Ming Lei
  2025-04-30  4:35 ` [PATCH V4 08/24] block: don't allow to switch elevator if updating nr_hw_queues is in-progress Ming Lei
                   ` (17 subsequent siblings)
  24 siblings, 0 replies; 51+ messages in thread
From: Ming Lei @ 2025-04-30  4:35 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig, Ming Lei, Hannes Reinecke

Both adding/deleting disk code are reader of `nr_hw_queues`, so we can't
allow them in-progress when updating nr_hw_queues, kernel panic and
kasan has been reported in [1].

Prevent adding/deleting disk during updating nr_hw_queues by adding
rw_semaphore to tagset, write lock is grabbed in blk_mq_update_nr_hw_queues(),
and read lock is acquired when adding/deleting disk.

Also mark GFP_NOIO allocation scope for adding/deleting disk because
blk_mq_update_nr_hw_queues() is part of some driver's error handler.

This way avoids lot of trouble.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Suggested-by: Nilay Shroff <nilay@linux.ibm.com>
Reported-by: Nilay Shroff <nilay@linux.ibm.com>
Closes: https://lore.kernel.org/linux-block/a5896cdb-a59a-4a37-9f99-20522f5d2987@linux.ibm.com/
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c         |   4 ++
 block/genhd.c          | 104 ++++++++++++++++++++++++++++-------------
 include/linux/blk-mq.h |   3 ++
 3 files changed, 78 insertions(+), 33 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 29cfc7ce2e0a..3706c0dde2fc 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4802,6 +4802,8 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 			goto out_free_srcu;
 	}
 
+	init_rwsem(&set->update_nr_hwq_lock);
+
 	ret = -ENOMEM;
 	set->tags = kcalloc_node(set->nr_hw_queues,
 				 sizeof(struct blk_mq_tags *), GFP_KERNEL,
@@ -5097,9 +5099,11 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 
 void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
 {
+	down_write(&set->update_nr_hwq_lock);
 	mutex_lock(&set->tag_list_lock);
 	__blk_mq_update_nr_hw_queues(set, nr_hw_queues);
 	mutex_unlock(&set->tag_list_lock);
+	up_write(&set->update_nr_hwq_lock);
 }
 EXPORT_SYMBOL_GPL(blk_mq_update_nr_hw_queues);
 
diff --git a/block/genhd.c b/block/genhd.c
index c2bd86cd09de..340fb4555963 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -389,19 +389,9 @@ int disk_scan_partitions(struct gendisk *disk, blk_mode_t mode)
 	return ret;
 }
 
-/**
- * add_disk_fwnode - add disk information to kernel list with fwnode
- * @parent: parent device for the disk
- * @disk: per-device partitioning information
- * @groups: Additional per-device sysfs groups
- * @fwnode: attached disk fwnode
- *
- * This function registers the partitioning information in @disk
- * with the kernel. Also attach a fwnode to the disk device.
- */
-int __must_check add_disk_fwnode(struct device *parent, struct gendisk *disk,
-				 const struct attribute_group **groups,
-				 struct fwnode_handle *fwnode)
+static int __add_disk(struct device *parent, struct gendisk *disk,
+		      const struct attribute_group **groups,
+		      struct fwnode_handle *fwnode)
 
 {
 	struct device *ddev = disk_to_dev(disk);
@@ -572,6 +562,37 @@ int __must_check add_disk_fwnode(struct device *parent, struct gendisk *disk,
 	}
 	return ret;
 }
+
+/**
+ * add_disk_fwnode - add disk information to kernel list with fwnode
+ * @parent: parent device for the disk
+ * @disk: per-device partitioning information
+ * @groups: Additional per-device sysfs groups
+ * @fwnode: attached disk fwnode
+ *
+ * This function registers the partitioning information in @disk
+ * with the kernel. Also attach a fwnode to the disk device.
+ */
+int __must_check add_disk_fwnode(struct device *parent, struct gendisk *disk,
+				 const struct attribute_group **groups,
+				 struct fwnode_handle *fwnode)
+{
+	struct blk_mq_tag_set *set;
+	unsigned int memflags;
+	int ret;
+
+	if (!queue_is_mq(disk->queue))
+		return __add_disk(parent, disk, groups, fwnode);
+
+	set = disk->queue->tag_set;
+	memflags = memalloc_noio_save();
+	down_read(&set->update_nr_hwq_lock);
+	ret = __add_disk(parent, disk, groups, fwnode);
+	up_read(&set->update_nr_hwq_lock);
+	memalloc_noio_restore(memflags);
+
+	return ret;
+}
 EXPORT_SYMBOL_GPL(add_disk_fwnode);
 
 /**
@@ -652,26 +673,7 @@ void blk_mark_disk_dead(struct gendisk *disk)
 }
 EXPORT_SYMBOL_GPL(blk_mark_disk_dead);
 
-/**
- * del_gendisk - remove the gendisk
- * @disk: the struct gendisk to remove
- *
- * Removes the gendisk and all its associated resources. This deletes the
- * partitions associated with the gendisk, and unregisters the associated
- * request_queue.
- *
- * This is the counter to the respective __device_add_disk() call.
- *
- * The final removal of the struct gendisk happens when its refcount reaches 0
- * with put_disk(), which should be called after del_gendisk(), if
- * __device_add_disk() was used.
- *
- * Drivers exist which depend on the release of the gendisk to be synchronous,
- * it should not be deferred.
- *
- * Context: can sleep
- */
-void del_gendisk(struct gendisk *disk)
+static void __del_gendisk(struct gendisk *disk)
 {
 	struct request_queue *q = disk->queue;
 	struct block_device *part;
@@ -764,6 +766,42 @@ void del_gendisk(struct gendisk *disk)
 	if (start_drain)
 		blk_unfreeze_release_lock(q);
 }
+
+/**
+ * del_gendisk - remove the gendisk
+ * @disk: the struct gendisk to remove
+ *
+ * Removes the gendisk and all its associated resources. This deletes the
+ * partitions associated with the gendisk, and unregisters the associated
+ * request_queue.
+ *
+ * This is the counter to the respective __device_add_disk() call.
+ *
+ * The final removal of the struct gendisk happens when its refcount reaches 0
+ * with put_disk(), which should be called after del_gendisk(), if
+ * __device_add_disk() was used.
+ *
+ * Drivers exist which depend on the release of the gendisk to be synchronous,
+ * it should not be deferred.
+ *
+ * Context: can sleep
+ */
+void del_gendisk(struct gendisk *disk)
+{
+	struct blk_mq_tag_set *set;
+	unsigned int memflags;
+
+	if (!queue_is_mq(disk->queue)) {
+		__del_gendisk(disk);
+	} else {
+		set = disk->queue->tag_set;
+		memflags = memalloc_noio_save();
+		down_read(&set->update_nr_hwq_lock);
+		__del_gendisk(disk);
+		up_read(&set->update_nr_hwq_lock);
+		memalloc_noio_restore(memflags);
+	}
+}
 EXPORT_SYMBOL(del_gendisk);
 
 /**
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 8eb9b3310167..ef84d53095a6 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -9,6 +9,7 @@
 #include <linux/prefetch.h>
 #include <linux/srcu.h>
 #include <linux/rw_hint.h>
+#include <linux/rwsem.h>
 
 struct blk_mq_tags;
 struct blk_flush_queue;
@@ -527,6 +528,8 @@ struct blk_mq_tag_set {
 	struct mutex		tag_list_lock;
 	struct list_head	tag_list;
 	struct srcu_struct	*srcu;
+
+	struct rw_semaphore	update_nr_hwq_lock;
 };
 
 /**
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH V4 08/24] block: don't allow to switch elevator if updating nr_hw_queues is in-progress
  2025-04-30  4:35 [PATCH V4 00/24] block: unify elevator changing and fix lockdep warning Ming Lei
                   ` (6 preceding siblings ...)
  2025-04-30  4:35 ` [PATCH V4 07/24] block: prevent adding/deleting disk during updating nr_hw_queues Ming Lei
@ 2025-04-30  4:35 ` Ming Lei
  2025-04-30 14:09   ` Christoph Hellwig
  2025-04-30  4:35 ` [PATCH V4 09/24] block: look up the elevator type in elevator_switch Ming Lei
                   ` (16 subsequent siblings)
  24 siblings, 1 reply; 51+ messages in thread
From: Ming Lei @ 2025-04-30  4:35 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig, Ming Lei, Hannes Reinecke

Elevator switch code is another `nr_hw_queue` reader in non-fast-IO code
path, so it can't be done if updating `nr_hw_queues` is in-progress.

Take same approach with not allowing add/del disk when updating
nr_hw_queues is in-progress, by grabbing read lock of
set->update_nr_hwq_sema.

Take the nested variant for avoiding the following false positive
splat[1], and this way is correct because:

- the read lock in elv_iosched_store() is not overlapped with the read lock
in adding/deleting disk:

- storing to kobject attribute is only available after the kobject is added
and before it is deleted

  -> #4 (&q->q_usage_counter(queue){++++}-{0:0}:
  -> #3 (&q->limits_lock){+.+.}-{4:4}:
  -> #2 (&disk->open_mutex){+.+.}-{4:4}:
  -> #1 (&set->update_nr_hwq_lock){.+.+}-{4:4}:
  -> #0 (kn->active#103){++++}-{0:0}:

Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/linux-block/aAWv3NPtNIKKvJZc@fedora/ [1]
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Closes: https://lore.kernel.org/linux-block/mz4t4tlwiqjijw3zvqnjb7ovvvaegkqganegmmlc567tt5xj67@xal5ro544cnc/
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/elevator.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/elevator.c b/block/elevator.c
index 4400eb8fe54f..12296f77efbd 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -723,6 +723,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
 	int ret;
 	unsigned int memflags;
 	struct request_queue *q = disk->queue;
+	struct blk_mq_tag_set *set = q->tag_set;
 
 	/*
 	 * If the attribute needs to load a module, do it before freezing the
@@ -734,6 +735,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
 
 	elv_iosched_load_module(name);
 
+	down_read_nested(&set->update_nr_hwq_lock, 1);
 	memflags = blk_mq_freeze_queue(q);
 	mutex_lock(&q->elevator_lock);
 	ret = elevator_change(q, name);
@@ -741,6 +743,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
 		ret = count;
 	mutex_unlock(&q->elevator_lock);
 	blk_mq_unfreeze_queue(q, memflags);
+	up_read(&set->update_nr_hwq_lock);
 	return ret;
 }
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH V4 09/24] block: look up the elevator type in elevator_switch
  2025-04-30  4:35 [PATCH V4 00/24] block: unify elevator changing and fix lockdep warning Ming Lei
                   ` (7 preceding siblings ...)
  2025-04-30  4:35 ` [PATCH V4 08/24] block: don't allow to switch elevator if updating nr_hw_queues is in-progress Ming Lei
@ 2025-04-30  4:35 ` Ming Lei
  2025-04-30  6:14   ` Hannes Reinecke
  2025-05-01  6:47   ` Nilay Shroff
  2025-04-30  4:35 ` [PATCH V4 10/24] block: fold elevator_disable into elevator_switch Ming Lei
                   ` (15 subsequent siblings)
  24 siblings, 2 replies; 51+ messages in thread
From: Ming Lei @ 2025-04-30  4:35 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig, Ming Lei

From: Christoph Hellwig <hch@lst.de>

That makes the function nicely self-contained and can be used
to avoid code duplication.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c   |  2 +-
 block/blk.h      |  2 +-
 block/elevator.c | 18 ++++++++----------
 3 files changed, 10 insertions(+), 12 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3706c0dde2fc..54f1d8b4e35c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -5014,7 +5014,7 @@ static void blk_mq_elv_switch_back(struct list_head *head,
 	kfree(qe);
 
 	mutex_lock(&q->elevator_lock);
-	elevator_switch(q, t);
+	elevator_switch(q, t->elevator_name);
 	/* drop the reference acquired in blk_mq_elv_switch_none */
 	elevator_put(t);
 	mutex_unlock(&q->elevator_lock);
diff --git a/block/blk.h b/block/blk.h
index 328075787814..ac90a8c5a8cf 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -322,7 +322,7 @@ bool blk_bio_list_merge(struct request_queue *q, struct list_head *list,
 
 bool blk_insert_flush(struct request *rq);
 
-int elevator_switch(struct request_queue *q, struct elevator_type *new_e);
+int elevator_switch(struct request_queue *q, const char *name);
 void elevator_disable(struct request_queue *q);
 void elevator_exit(struct request_queue *q);
 int elv_register_queue(struct request_queue *q, bool uevent);
diff --git a/block/elevator.c b/block/elevator.c
index 12296f77efbd..3ab4ed17de0f 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -621,13 +621,18 @@ void elevator_init_mq(struct request_queue *q)
  * If switching fails, we are most likely running out of memory and not able
  * to restore the old io scheduler, so leaving the io scheduler being none.
  */
-int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
+int elevator_switch(struct request_queue *q, const char *name)
 {
+	struct elevator_type *new_e;
 	int ret;
 
 	WARN_ON_ONCE(q->mq_freeze_depth == 0);
 	lockdep_assert_held(&q->elevator_lock);
 
+	new_e = elevator_find_get(name);
+	if (!new_e)
+		return -EINVAL;
+
 	blk_mq_quiesce_queue(q);
 
 	if (q->elevator) {
@@ -654,6 +659,7 @@ int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 			new_e->elevator_name);
 	}
 
+	elevator_put(new_e);
 	return ret;
 }
 
@@ -679,9 +685,6 @@ void elevator_disable(struct request_queue *q)
  */
 static int elevator_change(struct request_queue *q, const char *elevator_name)
 {
-	struct elevator_type *e;
-	int ret;
-
 	/* Make sure queue is not in the middle of being removed */
 	if (!blk_queue_registered(q))
 		return -ENOENT;
@@ -695,12 +698,7 @@ static int elevator_change(struct request_queue *q, const char *elevator_name)
 	if (q->elevator && elevator_match(q->elevator->type, elevator_name))
 		return 0;
 
-	e = elevator_find_get(elevator_name);
-	if (!e)
-		return -EINVAL;
-	ret = elevator_switch(q, e);
-	elevator_put(e);
-	return ret;
+	return elevator_switch(q, elevator_name);
 }
 
 static void elv_iosched_load_module(char *elevator_name)
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH V4 10/24] block: fold elevator_disable into elevator_switch
  2025-04-30  4:35 [PATCH V4 00/24] block: unify elevator changing and fix lockdep warning Ming Lei
                   ` (8 preceding siblings ...)
  2025-04-30  4:35 ` [PATCH V4 09/24] block: look up the elevator type in elevator_switch Ming Lei
@ 2025-04-30  4:35 ` Ming Lei
  2025-04-30  6:14   ` Hannes Reinecke
  2025-05-01  6:59   ` Nilay Shroff
  2025-04-30  4:35 ` [PATCH V4 11/24] block: move blk_queue_registered() check into elv_iosched_store() Ming Lei
                   ` (14 subsequent siblings)
  24 siblings, 2 replies; 51+ messages in thread
From: Ming Lei @ 2025-04-30  4:35 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig, Ming Lei

From: Christoph Hellwig <hch@lst.de>

This removes duplicate code, and keeps the callers tidy.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/elevator.c | 61 ++++++++++++++++++------------------------------
 1 file changed, 23 insertions(+), 38 deletions(-)

diff --git a/block/elevator.c b/block/elevator.c
index 3ab4ed17de0f..df3e59107a2a 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -623,15 +623,17 @@ void elevator_init_mq(struct request_queue *q)
  */
 int elevator_switch(struct request_queue *q, const char *name)
 {
-	struct elevator_type *new_e;
-	int ret;
+	struct elevator_type *new_e = NULL;
+	int ret = 0;
 
 	WARN_ON_ONCE(q->mq_freeze_depth == 0);
 	lockdep_assert_held(&q->elevator_lock);
 
-	new_e = elevator_find_get(name);
-	if (!new_e)
-		return -EINVAL;
+	if (strncmp(name, "none", 4)) {
+		new_e = elevator_find_get(name);
+		if (!new_e)
+			return -EINVAL;
+	}
 
 	blk_mq_quiesce_queue(q);
 
@@ -640,16 +642,21 @@ int elevator_switch(struct request_queue *q, const char *name)
 		elevator_exit(q);
 	}
 
-	ret = blk_mq_init_sched(q, new_e);
-	if (ret)
-		goto out_unfreeze;
-
-	ret = elv_register_queue(q, true);
-	if (ret) {
-		elevator_exit(q);
-		goto out_unfreeze;
+	if (new_e) {
+		ret = blk_mq_init_sched(q, new_e);
+		if (ret)
+			goto out_unfreeze;
+		ret = elv_register_queue(q, true);
+		if (ret) {
+			elevator_exit(q);
+			goto out_unfreeze;
+		}
+	} else {
+		blk_queue_flag_clear(QUEUE_FLAG_SQ_SCHED, q);
+		q->elevator = NULL;
+		q->nr_requests = q->tag_set->queue_depth;
 	}
-	blk_add_trace_msg(q, "elv switch: %s", new_e->elevator_name);
+	blk_add_trace_msg(q, "elv switch: %s", name);
 
 out_unfreeze:
 	blk_mq_unquiesce_queue(q);
@@ -659,27 +666,11 @@ int elevator_switch(struct request_queue *q, const char *name)
 			new_e->elevator_name);
 	}
 
-	elevator_put(new_e);
+	if (new_e)
+		elevator_put(new_e);
 	return ret;
 }
 
-void elevator_disable(struct request_queue *q)
-{
-	WARN_ON_ONCE(q->mq_freeze_depth == 0);
-	lockdep_assert_held(&q->elevator_lock);
-
-	blk_mq_quiesce_queue(q);
-
-	elv_unregister_queue(q);
-	elevator_exit(q);
-	blk_queue_flag_clear(QUEUE_FLAG_SQ_SCHED, q);
-	q->elevator = NULL;
-	q->nr_requests = q->tag_set->queue_depth;
-	blk_add_trace_msg(q, "elv switch: none");
-
-	blk_mq_unquiesce_queue(q);
-}
-
 /*
  * Switch this queue to the given IO scheduler.
  */
@@ -689,12 +680,6 @@ static int elevator_change(struct request_queue *q, const char *elevator_name)
 	if (!blk_queue_registered(q))
 		return -ENOENT;
 
-	if (!strncmp(elevator_name, "none", 4)) {
-		if (q->elevator)
-			elevator_disable(q);
-		return 0;
-	}
-
 	if (q->elevator && elevator_match(q->elevator->type, elevator_name))
 		return 0;
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH V4 11/24] block: move blk_queue_registered() check into elv_iosched_store()
  2025-04-30  4:35 [PATCH V4 00/24] block: unify elevator changing and fix lockdep warning Ming Lei
                   ` (9 preceding siblings ...)
  2025-04-30  4:35 ` [PATCH V4 10/24] block: fold elevator_disable into elevator_switch Ming Lei
@ 2025-04-30  4:35 ` Ming Lei
  2025-04-30  6:15   ` Hannes Reinecke
  2025-05-01  9:55   ` Nilay Shroff
  2025-04-30  4:35 ` [PATCH V4 12/24] block: simplify elevator reattachment for updating nr_hw_queues Ming Lei
                   ` (13 subsequent siblings)
  24 siblings, 2 replies; 51+ messages in thread
From: Ming Lei @ 2025-04-30  4:35 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig, Ming Lei

Move blk_queue_registered() check into elv_iosched_store() and prepare
for using elevator_change() for covering any kind of elevator change in
adding/deleting disk and updating nr_hw_queue.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/elevator.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/elevator.c b/block/elevator.c
index df3e59107a2a..ac72c4f5a542 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -676,10 +676,6 @@ int elevator_switch(struct request_queue *q, const char *name)
  */
 static int elevator_change(struct request_queue *q, const char *elevator_name)
 {
-	/* Make sure queue is not in the middle of being removed */
-	if (!blk_queue_registered(q))
-		return -ENOENT;
-
 	if (q->elevator && elevator_match(q->elevator->type, elevator_name))
 		return 0;
 
@@ -708,6 +704,10 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
 	struct request_queue *q = disk->queue;
 	struct blk_mq_tag_set *set = q->tag_set;
 
+	/* Make sure queue is not in the middle of being removed */
+	if (!blk_queue_registered(q))
+		return -ENOENT;
+
 	/*
 	 * If the attribute needs to load a module, do it before freezing the
 	 * queue to ensure that the module file can be read when the request
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH V4 12/24] block: simplify elevator reattachment for updating nr_hw_queues
  2025-04-30  4:35 [PATCH V4 00/24] block: unify elevator changing and fix lockdep warning Ming Lei
                   ` (10 preceding siblings ...)
  2025-04-30  4:35 ` [PATCH V4 11/24] block: move blk_queue_registered() check into elv_iosched_store() Ming Lei
@ 2025-04-30  4:35 ` Ming Lei
  2025-04-30  4:35 ` [PATCH V4 13/24] block: move queue freezing & elevator_lock into elevator_change() Ming Lei
                   ` (12 subsequent siblings)
  24 siblings, 0 replies; 51+ messages in thread
From: Ming Lei @ 2025-04-30  4:35 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig, Ming Lei, Hannes Reinecke

In blk_mq_update_nr_hw_queues(), nr_hw_queues changes and elevator data
depends on it, and elevator has to be reattached, so call elevator_switch()
to force attachment.

Add elv_update_nr_hw_queues() simply for blk_mq_update_nr_hw_queues() to
reattach elevator, since elevator switch isn't likely when running
blk_mq_update_nr_hw_queues(). This way removes the current switch
none and switch back code.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c   | 90 +-----------------------------------------------
 block/blk.h      |  3 +-
 block/elevator.c | 20 ++++++++++-
 3 files changed, 21 insertions(+), 92 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 54f1d8b4e35c..b920eaedbaa9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4943,88 +4943,10 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 	return ret;
 }
 
-/*
- * request_queue and elevator_type pair.
- * It is just used by __blk_mq_update_nr_hw_queues to cache
- * the elevator_type associated with a request_queue.
- */
-struct blk_mq_qe_pair {
-	struct list_head node;
-	struct request_queue *q;
-	struct elevator_type *type;
-};
-
-/*
- * Cache the elevator_type in qe pair list and switch the
- * io scheduler to 'none'
- */
-static bool blk_mq_elv_switch_none(struct list_head *head,
-		struct request_queue *q)
-{
-	struct blk_mq_qe_pair *qe;
-
-	qe = kmalloc(sizeof(*qe), GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY);
-	if (!qe)
-		return false;
-
-	/* Accessing q->elevator needs protection from ->elevator_lock. */
-	mutex_lock(&q->elevator_lock);
-
-	if (!q->elevator) {
-		kfree(qe);
-		goto unlock;
-	}
-
-	INIT_LIST_HEAD(&qe->node);
-	qe->q = q;
-	qe->type = q->elevator->type;
-	/* keep a reference to the elevator module as we'll switch back */
-	__elevator_get(qe->type);
-	list_add(&qe->node, head);
-	elevator_disable(q);
-unlock:
-	mutex_unlock(&q->elevator_lock);
-
-	return true;
-}
-
-static struct blk_mq_qe_pair *blk_lookup_qe_pair(struct list_head *head,
-						struct request_queue *q)
-{
-	struct blk_mq_qe_pair *qe;
-
-	list_for_each_entry(qe, head, node)
-		if (qe->q == q)
-			return qe;
-
-	return NULL;
-}
-
-static void blk_mq_elv_switch_back(struct list_head *head,
-				  struct request_queue *q)
-{
-	struct blk_mq_qe_pair *qe;
-	struct elevator_type *t;
-
-	qe = blk_lookup_qe_pair(head, q);
-	if (!qe)
-		return;
-	t = qe->type;
-	list_del(&qe->node);
-	kfree(qe);
-
-	mutex_lock(&q->elevator_lock);
-	elevator_switch(q, t->elevator_name);
-	/* drop the reference acquired in blk_mq_elv_switch_none */
-	elevator_put(t);
-	mutex_unlock(&q->elevator_lock);
-}
-
 static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 							int nr_hw_queues)
 {
 	struct request_queue *q;
-	LIST_HEAD(head);
 	int prev_nr_hw_queues = set->nr_hw_queues;
 	unsigned int memflags;
 	int i;
@@ -5042,15 +4964,6 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 	list_for_each_entry(q, &set->tag_list, tag_set_list)
 		blk_mq_freeze_queue_nomemsave(q);
 
-	/*
-	 * Switch IO scheduler to 'none', cleaning up the data associated
-	 * with the previous scheduler. We will switch back once we are done
-	 * updating the new sw to hw queue mappings.
-	 */
-	list_for_each_entry(q, &set->tag_list, tag_set_list)
-		if (!blk_mq_elv_switch_none(&head, q))
-			goto switch_back;
-
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
 		blk_mq_debugfs_unregister_hctxs(q);
 		blk_mq_sysfs_unregister_hctxs(q);
@@ -5084,9 +4997,8 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 		blk_mq_debugfs_register_hctxs(q);
 	}
 
-switch_back:
 	list_for_each_entry(q, &set->tag_list, tag_set_list)
-		blk_mq_elv_switch_back(&head, q);
+		elv_update_nr_hw_queues(q);
 
 	list_for_each_entry(q, &set->tag_list, tag_set_list)
 		blk_mq_unfreeze_queue_nomemrestore(q);
diff --git a/block/blk.h b/block/blk.h
index ac90a8c5a8cf..98fd3a6f5ec9 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -322,8 +322,7 @@ bool blk_bio_list_merge(struct request_queue *q, struct list_head *list,
 
 bool blk_insert_flush(struct request *rq);
 
-int elevator_switch(struct request_queue *q, const char *name);
-void elevator_disable(struct request_queue *q);
+void elv_update_nr_hw_queues(struct request_queue *q);
 void elevator_exit(struct request_queue *q);
 int elv_register_queue(struct request_queue *q, bool uevent);
 void elv_unregister_queue(struct request_queue *q);
diff --git a/block/elevator.c b/block/elevator.c
index ac72c4f5a542..d7c9e1d2f305 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -621,7 +621,7 @@ void elevator_init_mq(struct request_queue *q)
  * If switching fails, we are most likely running out of memory and not able
  * to restore the old io scheduler, so leaving the io scheduler being none.
  */
-int elevator_switch(struct request_queue *q, const char *name)
+static int elevator_switch(struct request_queue *q, const char *name)
 {
 	struct elevator_type *new_e = NULL;
 	int ret = 0;
@@ -682,6 +682,24 @@ static int elevator_change(struct request_queue *q, const char *elevator_name)
 	return elevator_switch(q, elevator_name);
 }
 
+/*
+ * The I/O scheduler depends on the number of hardware queues, this forces a
+ * reattachment when nr_hw_queues changes.
+ */
+void elv_update_nr_hw_queues(struct request_queue *q)
+{
+	const char *name = "none";
+
+	WARN_ON_ONCE(q->mq_freeze_depth == 0);
+
+	mutex_lock(&q->elevator_lock);
+	if (q->elevator && !blk_queue_dying(q) && !blk_queue_registered(q))
+		name = q->elevator->type->elevator_name;
+	/* force to reattach elevator after nr_hw_queue is updated */
+	elevator_switch(q, name);
+	mutex_unlock(&q->elevator_lock);
+}
+
 static void elv_iosched_load_module(char *elevator_name)
 {
 	struct elevator_type *found;
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH V4 13/24] block: move queue freezing & elevator_lock into elevator_change()
  2025-04-30  4:35 [PATCH V4 00/24] block: unify elevator changing and fix lockdep warning Ming Lei
                   ` (11 preceding siblings ...)
  2025-04-30  4:35 ` [PATCH V4 12/24] block: simplify elevator reattachment for updating nr_hw_queues Ming Lei
@ 2025-04-30  4:35 ` Ming Lei
  2025-04-30 14:13   ` Christoph Hellwig
  2025-04-30  4:35 ` [PATCH V4 14/24] block: add `struct elv_change_ctx` for unifying elevator change Ming Lei
                   ` (11 subsequent siblings)
  24 siblings, 1 reply; 51+ messages in thread
From: Ming Lei @ 2025-04-30  4:35 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig, Ming Lei, Hannes Reinecke

Move queue freezing & elevator_lock into elevator_change(), and prepare
for using elevator_change() for setting up & tearing down default elevator
too.

Also add lockdep_assert_held() in __elevator_change() because either
read or write lock is required for changing elevator.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/elevator.c | 20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/block/elevator.c b/block/elevator.c
index d7c9e1d2f305..df1f5451fc5e 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -676,10 +676,19 @@ static int elevator_switch(struct request_queue *q, const char *name)
  */
 static int elevator_change(struct request_queue *q, const char *elevator_name)
 {
-	if (q->elevator && elevator_match(q->elevator->type, elevator_name))
-		return 0;
+	unsigned int memflags;
+	int ret = 0;
 
-	return elevator_switch(q, elevator_name);
+	lockdep_assert_held(&q->tag_set->update_nr_hwq_lock);
+
+	memflags = blk_mq_freeze_queue(q);
+	mutex_lock(&q->elevator_lock);
+	if (!(q->elevator && elevator_match(q->elevator->type,
+				elevator_name)))
+		ret = elevator_switch(q, elevator_name);
+	mutex_unlock(&q->elevator_lock);
+	blk_mq_unfreeze_queue(q, memflags);
+	return ret;
 }
 
 /*
@@ -718,7 +727,6 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
 	char elevator_name[ELV_NAME_MAX];
 	char *name;
 	int ret;
-	unsigned int memflags;
 	struct request_queue *q = disk->queue;
 	struct blk_mq_tag_set *set = q->tag_set;
 
@@ -737,13 +745,9 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
 	elv_iosched_load_module(name);
 
 	down_read_nested(&set->update_nr_hwq_lock, 1);
-	memflags = blk_mq_freeze_queue(q);
-	mutex_lock(&q->elevator_lock);
 	ret = elevator_change(q, name);
 	if (!ret)
 		ret = count;
-	mutex_unlock(&q->elevator_lock);
-	blk_mq_unfreeze_queue(q, memflags);
 	up_read(&set->update_nr_hwq_lock);
 	return ret;
 }
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH V4 14/24] block: add `struct elv_change_ctx` for unifying elevator change
  2025-04-30  4:35 [PATCH V4 00/24] block: unify elevator changing and fix lockdep warning Ming Lei
                   ` (12 preceding siblings ...)
  2025-04-30  4:35 ` [PATCH V4 13/24] block: move queue freezing & elevator_lock into elevator_change() Ming Lei
@ 2025-04-30  4:35 ` Ming Lei
  2025-04-30 14:15   ` Christoph Hellwig
  2025-04-30  4:35 ` [PATCH V4 15/24] block: " Ming Lei
                   ` (10 subsequent siblings)
  24 siblings, 1 reply; 51+ messages in thread
From: Ming Lei @ 2025-04-30  4:35 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig, Ming Lei, Hannes Reinecke

Add `struct elv_change_ctx` and prepare for unifying elevator change by
elevator_change(). With this way, any input & output parameter can
be provided & observed in top helper.

This way helps to move kobject add/delete & debugfs register/unregister
out of ->elevator_lock & freezing queue.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/elevator.c | 41 ++++++++++++++++++++++++-----------------
 1 file changed, 24 insertions(+), 17 deletions(-)

diff --git a/block/elevator.c b/block/elevator.c
index df1f5451fc5e..5188bb0119f0 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -45,6 +45,12 @@
 #include "blk-wbt.h"
 #include "blk-cgroup.h"
 
+/* Holding context data for changing elevator */
+struct elv_change_ctx {
+	const char *name;
+	bool no_uevent;
+};
+
 static DEFINE_SPINLOCK(elv_list_lock);
 static LIST_HEAD(elv_list);
 
@@ -621,7 +627,7 @@ void elevator_init_mq(struct request_queue *q)
  * If switching fails, we are most likely running out of memory and not able
  * to restore the old io scheduler, so leaving the io scheduler being none.
  */
-static int elevator_switch(struct request_queue *q, const char *name)
+static int elevator_switch(struct request_queue *q, struct elv_change_ctx *ctx)
 {
 	struct elevator_type *new_e = NULL;
 	int ret = 0;
@@ -629,8 +635,8 @@ static int elevator_switch(struct request_queue *q, const char *name)
 	WARN_ON_ONCE(q->mq_freeze_depth == 0);
 	lockdep_assert_held(&q->elevator_lock);
 
-	if (strncmp(name, "none", 4)) {
-		new_e = elevator_find_get(name);
+	if (strncmp(ctx->name, "none", 4)) {
+		new_e = elevator_find_get(ctx->name);
 		if (!new_e)
 			return -EINVAL;
 	}
@@ -646,7 +652,7 @@ static int elevator_switch(struct request_queue *q, const char *name)
 		ret = blk_mq_init_sched(q, new_e);
 		if (ret)
 			goto out_unfreeze;
-		ret = elv_register_queue(q, true);
+		ret = elv_register_queue(q, !ctx->no_uevent);
 		if (ret) {
 			elevator_exit(q);
 			goto out_unfreeze;
@@ -656,7 +662,7 @@ static int elevator_switch(struct request_queue *q, const char *name)
 		q->elevator = NULL;
 		q->nr_requests = q->tag_set->queue_depth;
 	}
-	blk_add_trace_msg(q, "elv switch: %s", name);
+	blk_add_trace_msg(q, "elv switch: %s", ctx->name);
 
 out_unfreeze:
 	blk_mq_unquiesce_queue(q);
@@ -674,7 +680,7 @@ static int elevator_switch(struct request_queue *q, const char *name)
 /*
  * Switch this queue to the given IO scheduler.
  */
-static int elevator_change(struct request_queue *q, const char *elevator_name)
+static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
 {
 	unsigned int memflags;
 	int ret = 0;
@@ -683,9 +689,8 @@ static int elevator_change(struct request_queue *q, const char *elevator_name)
 
 	memflags = blk_mq_freeze_queue(q);
 	mutex_lock(&q->elevator_lock);
-	if (!(q->elevator && elevator_match(q->elevator->type,
-				elevator_name)))
-		ret = elevator_switch(q, elevator_name);
+	if (!(q->elevator && elevator_match(q->elevator->type, ctx->name)))
+		ret = elevator_switch(q, ctx);
 	mutex_unlock(&q->elevator_lock);
 	blk_mq_unfreeze_queue(q, memflags);
 	return ret;
@@ -697,19 +702,21 @@ static int elevator_change(struct request_queue *q, const char *elevator_name)
  */
 void elv_update_nr_hw_queues(struct request_queue *q)
 {
-	const char *name = "none";
+	struct elv_change_ctx ctx = {
+		.name	= "none",
+	};
 
 	WARN_ON_ONCE(q->mq_freeze_depth == 0);
 
 	mutex_lock(&q->elevator_lock);
 	if (q->elevator && !blk_queue_dying(q) && !blk_queue_registered(q))
-		name = q->elevator->type->elevator_name;
+		ctx.name = q->elevator->type->elevator_name;
 	/* force to reattach elevator after nr_hw_queue is updated */
-	elevator_switch(q, name);
+	elevator_switch(q, &ctx);
 	mutex_unlock(&q->elevator_lock);
 }
 
-static void elv_iosched_load_module(char *elevator_name)
+static void elv_iosched_load_module(const char *elevator_name)
 {
 	struct elevator_type *found;
 
@@ -725,7 +732,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
 			  size_t count)
 {
 	char elevator_name[ELV_NAME_MAX];
-	char *name;
+	struct elv_change_ctx ctx = {};
 	int ret;
 	struct request_queue *q = disk->queue;
 	struct blk_mq_tag_set *set = q->tag_set;
@@ -740,12 +747,12 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
 	 * queue is the one for the device storing the module file.
 	 */
 	strscpy(elevator_name, buf, sizeof(elevator_name));
-	name = strstrip(elevator_name);
+	ctx.name = strstrip(elevator_name);
 
-	elv_iosched_load_module(name);
+	elv_iosched_load_module(ctx.name);
 
 	down_read_nested(&set->update_nr_hwq_lock, 1);
-	ret = elevator_change(q, name);
+	ret = elevator_change(q, &ctx);
 	if (!ret)
 		ret = count;
 	up_read(&set->update_nr_hwq_lock);
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH V4 15/24] block: unifying elevator change
  2025-04-30  4:35 [PATCH V4 00/24] block: unify elevator changing and fix lockdep warning Ming Lei
                   ` (13 preceding siblings ...)
  2025-04-30  4:35 ` [PATCH V4 14/24] block: add `struct elv_change_ctx` for unifying elevator change Ming Lei
@ 2025-04-30  4:35 ` Ming Lei
  2025-04-30 14:17   ` Christoph Hellwig
  2025-05-01 10:28   ` Nilay Shroff
  2025-04-30  4:35 ` [PATCH V4 16/24] block: pass elevator_queue to elv_register_queue & unregister_queue Ming Lei
                   ` (9 subsequent siblings)
  24 siblings, 2 replies; 51+ messages in thread
From: Ming Lei @ 2025-04-30  4:35 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig, Ming Lei, Hannes Reinecke

Elevator change is one well-define behavior:

- tear down current elevator if it exists

- setup new elevator

It is supposed to cover any case for changing elevator by single
internal API, typically the following cases:

- setup default elevator in add_disk()

- switch to none in del_disk()

- reset elevator in blk_mq_update_nr_hw_queues()

- switch elevator in sysfs `store` elevator attribute

This patch uses elevator_change() to cover all above cases:

- every elevator switch is serialized with each other: add_disk/del_disk/
store elevator is serialized already, blk_mq_update_nr_hw_queues() uses
srcu for syncing with the other three cases

- for both add_disk()/del_disk(), queue freeze works at atomic mode
or has been froze, so the freeze in elevator_change() won't add extra
delay

- `struct elev_change_ctx` instance holds any info for changing elevator

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-sysfs.c |  19 +++-----
 block/blk.h       |   5 +-
 block/elevator.c  | 116 +++++++++++++++++++++-------------------------
 block/genhd.c     |  28 ++---------
 4 files changed, 67 insertions(+), 101 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 1f9b45b0b9ee..741e607dfab6 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -869,14 +869,9 @@ int blk_register_queue(struct gendisk *disk)
 	if (ret)
 		goto out_unregister_ia_ranges;
 
+	if (queue_is_mq(q))
+		elevator_set_default(q);
 	mutex_lock(&q->elevator_lock);
-	if (q->elevator) {
-		ret = elv_register_queue(q, false);
-		if (ret) {
-			mutex_unlock(&q->elevator_lock);
-			goto out_crypto_sysfs_unregister;
-		}
-	}
 	wbt_enable_default(disk);
 	mutex_unlock(&q->elevator_lock);
 
@@ -902,8 +897,6 @@ int blk_register_queue(struct gendisk *disk)
 
 	return ret;
 
-out_crypto_sysfs_unregister:
-	blk_crypto_sysfs_unregister(disk);
 out_unregister_ia_ranges:
 	disk_unregister_independent_access_ranges(disk);
 out_debugfs_remove:
@@ -951,9 +944,11 @@ void blk_unregister_queue(struct gendisk *disk)
 		blk_mq_sysfs_unregister(disk);
 	blk_crypto_sysfs_unregister(disk);
 
-	mutex_lock(&q->elevator_lock);
-	elv_unregister_queue(q);
-	mutex_unlock(&q->elevator_lock);
+	if (queue_is_mq(q)) {
+		blk_mq_quiesce_queue(q);
+		elevator_set_none(q);
+		blk_mq_unquiesce_queue(q);
+	}
 
 	mutex_lock(&q->sysfs_lock);
 	disk_unregister_independent_access_ranges(disk);
diff --git a/block/blk.h b/block/blk.h
index 98fd3a6f5ec9..23b6ed27e5d1 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -323,9 +323,8 @@ bool blk_bio_list_merge(struct request_queue *q, struct list_head *list,
 bool blk_insert_flush(struct request *rq);
 
 void elv_update_nr_hw_queues(struct request_queue *q);
-void elevator_exit(struct request_queue *q);
-int elv_register_queue(struct request_queue *q, bool uevent);
-void elv_unregister_queue(struct request_queue *q);
+void elevator_set_default(struct request_queue *q);
+void elevator_set_none(struct request_queue *q);
 
 ssize_t part_size_show(struct device *dev, struct device_attribute *attr,
 		char *buf);
diff --git a/block/elevator.c b/block/elevator.c
index 5188bb0119f0..aaec9584ed72 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -154,7 +154,7 @@ static void elevator_release(struct kobject *kobj)
 	kfree(e);
 }
 
-void elevator_exit(struct request_queue *q)
+static void elevator_exit(struct request_queue *q)
 {
 	struct elevator_queue *e = q->elevator;
 
@@ -458,7 +458,7 @@ static const struct kobj_type elv_ktype = {
 	.release	= elevator_release,
 };
 
-int elv_register_queue(struct request_queue *q, bool uevent)
+static int elv_register_queue(struct request_queue *q, bool uevent)
 {
 	struct elevator_queue *e = q->elevator;
 	int error;
@@ -488,7 +488,7 @@ int elv_register_queue(struct request_queue *q, bool uevent)
 	return error;
 }
 
-void elv_unregister_queue(struct request_queue *q)
+static void elv_unregister_queue(struct request_queue *q)
 {
 	struct elevator_queue *e = q->elevator;
 
@@ -561,66 +561,6 @@ void elv_unregister(struct elevator_type *e)
 }
 EXPORT_SYMBOL_GPL(elv_unregister);
 
-/*
- * For single queue devices, default to using mq-deadline. If we have multiple
- * queues or mq-deadline is not available, default to "none".
- */
-static struct elevator_type *elevator_get_default(struct request_queue *q)
-{
-	if (q->tag_set->flags & BLK_MQ_F_NO_SCHED_BY_DEFAULT)
-		return NULL;
-
-	if (q->nr_hw_queues != 1 &&
-	    !blk_mq_is_shared_tags(q->tag_set->flags))
-		return NULL;
-
-	return elevator_find_get("mq-deadline");
-}
-
-/*
- * Use the default elevator settings. If the chosen elevator initialization
- * fails, fall back to the "none" elevator (no elevator).
- */
-void elevator_init_mq(struct request_queue *q)
-{
-	struct elevator_type *e;
-	unsigned int memflags;
-	int err;
-
-	WARN_ON_ONCE(blk_queue_registered(q));
-
-	if (unlikely(q->elevator))
-		return;
-
-	e = elevator_get_default(q);
-	if (!e)
-		return;
-
-	/*
-	 * We are called before adding disk, when there isn't any FS I/O,
-	 * so freezing queue plus canceling dispatch work is enough to
-	 * drain any dispatch activities originated from passthrough
-	 * requests, then no need to quiesce queue which may add long boot
-	 * latency, especially when lots of disks are involved.
-	 *
-	 * Disk isn't added yet, so verifying queue lock only manually.
-	 */
-	memflags = blk_mq_freeze_queue(q);
-
-	blk_mq_cancel_work_sync(q);
-
-	err = blk_mq_init_sched(q, e);
-
-	blk_mq_unfreeze_queue(q, memflags);
-
-	if (err) {
-		pr_warn("\"%s\" elevator initialization failed, "
-			"falling back to \"none\"\n", e->elevator_name);
-	}
-
-	elevator_put(e);
-}
-
 /*
  * Switch to new_e io scheduler.
  *
@@ -688,6 +628,16 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
 	lockdep_assert_held(&q->tag_set->update_nr_hwq_lock);
 
 	memflags = blk_mq_freeze_queue(q);
+	/*
+	 * May be called before adding disk, when there isn't any FS I/O,
+	 * so freezing queue plus canceling dispatch work is enough to
+	 * drain any dispatch activities originated from passthrough
+	 * requests, then no need to quiesce queue which may add long boot
+	 * latency, especially when lots of disks are involved.
+	 *
+	 * Disk isn't added yet, so verifying queue lock only manually.
+	 */
+	blk_mq_cancel_work_sync(q);
 	mutex_lock(&q->elevator_lock);
 	if (!(q->elevator && elevator_match(q->elevator->type, ctx->name)))
 		ret = elevator_switch(q, ctx);
@@ -716,6 +666,46 @@ void elv_update_nr_hw_queues(struct request_queue *q)
 	mutex_unlock(&q->elevator_lock);
 }
 
+/*
+ * Use the default elevator settings. If the chosen elevator initialization
+ * fails, fall back to the "none" elevator (no elevator).
+ */
+void elevator_set_default(struct request_queue *q)
+{
+	struct elv_change_ctx ctx = {
+		.name = "mq-deadline",
+		.no_uevent = true,
+	};
+	int err = 0;
+
+	if (q->tag_set->flags & BLK_MQ_F_NO_SCHED_BY_DEFAULT)
+		return;
+
+	/*
+	 * For single queue devices, default to using mq-deadline. If we
+	 * have multiple queues or mq-deadline is not available, default
+	 * to "none".
+	 */
+	if (elevator_find_get(ctx.name) && (q->nr_hw_queues == 1 ||
+			 blk_mq_is_shared_tags(q->tag_set->flags)))
+		err = elevator_change(q, &ctx);
+	if (err < 0)
+		pr_warn("\"%s\" elevator initialization, failed %d, "
+			"falling back to \"none\"\n", ctx.name, err);
+}
+
+void elevator_set_none(struct request_queue *q)
+{
+	struct elv_change_ctx ctx = {
+		.name	= "none",
+	};
+	int err;
+
+	err = elevator_change(q, &ctx);
+	if (err < 0)
+		pr_warn("%s: set none elevator failed %d\n", __func__, err);
+}
+
 static void elv_iosched_load_module(const char *elevator_name)
 {
 	struct elevator_type *found;
diff --git a/block/genhd.c b/block/genhd.c
index 340fb4555963..0c34cc1a4eae 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -406,12 +406,6 @@ static int __add_disk(struct device *parent, struct gendisk *disk,
 		 */
 		if (disk->fops->submit_bio || disk->fops->poll_bio)
 			return -EINVAL;
-
-		/*
-		 * Initialize the I/O scheduler code and pick a default one if
-		 * needed.
-		 */
-		elevator_init_mq(disk->queue);
 	} else {
 		if (!disk->fops->submit_bio)
 			return -EINVAL;
@@ -428,7 +422,7 @@ static int __add_disk(struct device *parent, struct gendisk *disk,
 	ret = -EINVAL;
 	if (disk->major) {
 		if (WARN_ON(!disk->minors))
-			goto out_exit_elevator;
+			goto out;
 
 		if (disk->minors > DISK_MAX_PARTS) {
 			pr_err("block: can't allocate more than %d partitions\n",
@@ -438,14 +432,14 @@ static int __add_disk(struct device *parent, struct gendisk *disk,
 		if (disk->first_minor > MINORMASK ||
 		    disk->minors > MINORMASK + 1 ||
 		    disk->first_minor + disk->minors > MINORMASK + 1)
-			goto out_exit_elevator;
+			goto out;
 	} else {
 		if (WARN_ON(disk->minors))
-			goto out_exit_elevator;
+			goto out;
 
 		ret = blk_alloc_ext_minor();
 		if (ret < 0)
-			goto out_exit_elevator;
+			goto out;
 		disk->major = BLOCK_EXT_MAJOR;
 		disk->first_minor = ret;
 	}
@@ -554,12 +548,7 @@ static int __add_disk(struct device *parent, struct gendisk *disk,
 out_free_ext_minor:
 	if (disk->major == BLOCK_EXT_MAJOR)
 		blk_free_ext_minor(disk->first_minor);
-out_exit_elevator:
-	if (disk->queue->elevator) {
-		mutex_lock(&disk->queue->elevator_lock);
-		elevator_exit(disk->queue);
-		mutex_unlock(&disk->queue->elevator_lock);
-	}
+out:
 	return ret;
 }
 
@@ -745,14 +734,7 @@ static void __del_gendisk(struct gendisk *disk)
 	if (queue_is_mq(q))
 		blk_mq_cancel_work_sync(q);
 
-	blk_mq_quiesce_queue(q);
-	if (q->elevator) {
-		mutex_lock(&q->elevator_lock);
-		elevator_exit(q);
-		mutex_unlock(&q->elevator_lock);
-	}
 	rq_qos_exit(q);
-	blk_mq_unquiesce_queue(q);
 
 	/*
 	 * If the disk does not own the queue, allow using passthrough requests
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH V4 16/24] block: pass elevator_queue to elv_register_queue & unregister_queue
  2025-04-30  4:35 [PATCH V4 00/24] block: unify elevator changing and fix lockdep warning Ming Lei
                   ` (14 preceding siblings ...)
  2025-04-30  4:35 ` [PATCH V4 15/24] block: " Ming Lei
@ 2025-04-30  4:35 ` Ming Lei
  2025-04-30 14:18   ` Christoph Hellwig
  2025-04-30  4:35 ` [PATCH V4 17/24] block: remove elevator queue's type check in elv_attr_show/store() Ming Lei
                   ` (8 subsequent siblings)
  24 siblings, 1 reply; 51+ messages in thread
From: Ming Lei @ 2025-04-30  4:35 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig, Ming Lei, Hannes Reinecke

Pass elevator_queue reference to elv_register_queue() & elv_unregister_queue().

No functional change, and prepare for moving the two out of elevator
lock & freezing queue, when we need to store the old & new elevator
queue in `struct elv_change_ctx` instance, then both two can co-exist
for short while, so we have to pass the exact elevator_queue instance
to elv_register_queue & unregister_queue.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/elevator.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/block/elevator.c b/block/elevator.c
index aaec9584ed72..2b16394972c0 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -458,9 +458,10 @@ static const struct kobj_type elv_ktype = {
 	.release	= elevator_release,
 };
 
-static int elv_register_queue(struct request_queue *q, bool uevent)
+static int elv_register_queue(struct request_queue *q,
+			      struct elevator_queue *e,
+			      bool uevent)
 {
-	struct elevator_queue *e = q->elevator;
 	int error;
 
 	lockdep_assert_held(&q->elevator_lock);
@@ -488,10 +489,9 @@ static int elv_register_queue(struct request_queue *q, bool uevent)
 	return error;
 }
 
-static void elv_unregister_queue(struct request_queue *q)
+static void elv_unregister_queue(struct request_queue *q,
+				 struct elevator_queue *e)
 {
-	struct elevator_queue *e = q->elevator;
-
 	lockdep_assert_held(&q->elevator_lock);
 
 	if (e && test_and_clear_bit(ELEVATOR_FLAG_REGISTERED, &e->flags)) {
@@ -584,7 +584,7 @@ static int elevator_switch(struct request_queue *q, struct elv_change_ctx *ctx)
 	blk_mq_quiesce_queue(q);
 
 	if (q->elevator) {
-		elv_unregister_queue(q);
+		elv_unregister_queue(q, q->elevator);
 		elevator_exit(q);
 	}
 
@@ -592,7 +592,7 @@ static int elevator_switch(struct request_queue *q, struct elv_change_ctx *ctx)
 		ret = blk_mq_init_sched(q, new_e);
 		if (ret)
 			goto out_unfreeze;
-		ret = elv_register_queue(q, !ctx->no_uevent);
+		ret = elv_register_queue(q, q->elevator, !ctx->no_uevent);
 		if (ret) {
 			elevator_exit(q);
 			goto out_unfreeze;
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH V4 17/24] block: remove elevator queue's type check in elv_attr_show/store()
  2025-04-30  4:35 [PATCH V4 00/24] block: unify elevator changing and fix lockdep warning Ming Lei
                   ` (15 preceding siblings ...)
  2025-04-30  4:35 ` [PATCH V4 16/24] block: pass elevator_queue to elv_register_queue & unregister_queue Ming Lei
@ 2025-04-30  4:35 ` Ming Lei
  2025-04-30  6:16   ` Hannes Reinecke
                     ` (2 more replies)
  2025-04-30  4:35 ` [PATCH V4 18/24] block: fail to show/store elevator sysfs attribute if elevator is dying Ming Lei
                   ` (7 subsequent siblings)
  24 siblings, 3 replies; 51+ messages in thread
From: Ming Lei @ 2025-04-30  4:35 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig, Ming Lei

elevatore queue's type is assigned since its allocation, and never
get cleared until it is released.

So its ->type is always not NULL, remove the unnecessary check.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/elevator.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/elevator.c b/block/elevator.c
index 2b16394972c0..86e448bc12ae 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -425,7 +425,7 @@ elv_attr_show(struct kobject *kobj, struct attribute *attr, char *page)
 
 	e = container_of(kobj, struct elevator_queue, kobj);
 	mutex_lock(&e->sysfs_lock);
-	error = e->type ? entry->show(e, page) : -ENOENT;
+	error = entry->show(e, page);
 	mutex_unlock(&e->sysfs_lock);
 	return error;
 }
@@ -443,7 +443,7 @@ elv_attr_store(struct kobject *kobj, struct attribute *attr,
 
 	e = container_of(kobj, struct elevator_queue, kobj);
 	mutex_lock(&e->sysfs_lock);
-	error = e->type ? entry->store(e, page, length) : -ENOENT;
+	error = entry->store(e, page, length);
 	mutex_unlock(&e->sysfs_lock);
 	return error;
 }
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH V4 18/24] block: fail to show/store elevator sysfs attribute if elevator is dying
  2025-04-30  4:35 [PATCH V4 00/24] block: unify elevator changing and fix lockdep warning Ming Lei
                   ` (16 preceding siblings ...)
  2025-04-30  4:35 ` [PATCH V4 17/24] block: remove elevator queue's type check in elv_attr_show/store() Ming Lei
@ 2025-04-30  4:35 ` Ming Lei
  2025-04-30  6:31   ` Hannes Reinecke
  2025-04-30 14:19   ` Christoph Hellwig
  2025-04-30  4:35 ` [PATCH V4 19/24] block: add new helper for disabling elevator switch when deleting disk Ming Lei
                   ` (6 subsequent siblings)
  24 siblings, 2 replies; 51+ messages in thread
From: Ming Lei @ 2025-04-30  4:35 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig, Ming Lei

Prepare for moving elv_register[unregister]_queue out of elevator_lock
& queue freezing, so we may have to call elv_unregister_queue() after
elevator ->exit() is called, then there is small window for user to
call into ->show()/store(), and user-after-free can be caused.

Fail to show/store elevator sysfs attribute if elevator is dying by
adding one new flag of ELEVATOR_FLAG_DYNG, which is protected by
elevator ->sysfs_lock.

Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-sched.c |  1 +
 block/elevator.c     | 10 ++++++----
 block/elevator.h     |  1 +
 3 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 336a15ffecfa..55a0fd105147 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -551,5 +551,6 @@ void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e)
 	if (e->type->ops.exit_sched)
 		e->type->ops.exit_sched(e);
 	blk_mq_sched_tags_teardown(q, flags);
+	set_bit(ELEVATOR_FLAG_DYING, &q->elevator->flags);
 	q->elevator = NULL;
 }
diff --git a/block/elevator.c b/block/elevator.c
index 86e448bc12ae..fd0bcf22aaee 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -418,14 +418,15 @@ elv_attr_show(struct kobject *kobj, struct attribute *attr, char *page)
 {
 	const struct elv_fs_entry *entry = to_elv(attr);
 	struct elevator_queue *e;
-	ssize_t error;
+	ssize_t error = -ENODEV;
 
 	if (!entry->show)
 		return -EIO;
 
 	e = container_of(kobj, struct elevator_queue, kobj);
 	mutex_lock(&e->sysfs_lock);
-	error = entry->show(e, page);
+	if (!test_bit(ELEVATOR_FLAG_DYING, &e->flags))
+		error = entry->show(e, page);
 	mutex_unlock(&e->sysfs_lock);
 	return error;
 }
@@ -436,14 +437,15 @@ elv_attr_store(struct kobject *kobj, struct attribute *attr,
 {
 	const struct elv_fs_entry *entry = to_elv(attr);
 	struct elevator_queue *e;
-	ssize_t error;
+	ssize_t error = -ENODEV;
 
 	if (!entry->store)
 		return -EIO;
 
 	e = container_of(kobj, struct elevator_queue, kobj);
 	mutex_lock(&e->sysfs_lock);
-	error = entry->store(e, page, length);
+	if (!test_bit(ELEVATOR_FLAG_DYING, &e->flags))
+		error = entry->store(e, page, length);
 	mutex_unlock(&e->sysfs_lock);
 	return error;
 }
diff --git a/block/elevator.h b/block/elevator.h
index 9198676644a9..76a90a1b7ed6 100644
--- a/block/elevator.h
+++ b/block/elevator.h
@@ -121,6 +121,7 @@ struct elevator_queue
 };
 
 #define ELEVATOR_FLAG_REGISTERED	0
+#define ELEVATOR_FLAG_DYING		1
 
 /*
  * block elevator interface
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH V4 19/24] block: add new helper for disabling elevator switch when deleting disk
  2025-04-30  4:35 [PATCH V4 00/24] block: unify elevator changing and fix lockdep warning Ming Lei
                   ` (17 preceding siblings ...)
  2025-04-30  4:35 ` [PATCH V4 18/24] block: fail to show/store elevator sysfs attribute if elevator is dying Ming Lei
@ 2025-04-30  4:35 ` Ming Lei
  2025-04-30  6:32   ` Hannes Reinecke
  2025-04-30 14:20   ` Christoph Hellwig
  2025-04-30  4:35 ` [PATCH V4 20/24] block: move elv_register[unregister]_queue out of elevator_lock Ming Lei
                   ` (5 subsequent siblings)
  24 siblings, 2 replies; 51+ messages in thread
From: Ming Lei @ 2025-04-30  4:35 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig, Ming Lei

Add new helper disable_elv_switch() and new flag QUEUE_FLAG_NO_ELV_SWITCH
for disabling elevator switch before deleting disk:

- originally flag QUEUE_FLAG_REGISTERED is added for preventing elevator
switch during removing disk, but this flag has been used widely for
other purposes, so add one new flag for disabling elevator switch only

- for avoiding deadlock risk, we have to move elevator queue
register/unregister out of elevator lock and queue freeze, which will be
done in next patch. However, this way adds small race window between elevator
switch and deleting ->queue_kobj, in which elevator queue register/unregister
could be run concurrently. The added helper will be used for avoiding the race
in the following patch.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-debugfs.c |  1 +
 block/elevator.c       |  5 ++++-
 block/genhd.c          | 12 ++++++++++++
 include/linux/blkdev.h |  3 +++
 4 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 2837a8ce8054..29b3540dd180 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -94,6 +94,7 @@ static const char *const blk_queue_flag_name[] = {
 	QUEUE_FLAG_NAME(HCTX_ACTIVE),
 	QUEUE_FLAG_NAME(SQ_SCHED),
 	QUEUE_FLAG_NAME(DISABLE_WBT_DEF),
+	QUEUE_FLAG_NAME(NO_ELV_SWITCH),
 };
 #undef QUEUE_FLAG_NAME
 
diff --git a/block/elevator.c b/block/elevator.c
index fd0bcf22aaee..98a754f58de5 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -680,6 +680,9 @@ void elevator_set_default(struct request_queue *q)
 	};
 	int err = 0;
 
+	/* now we allow to switch elevator */
+	blk_queue_flag_clear(QUEUE_FLAG_NO_ELV_SWITCH, q);
+
 	if (q->tag_set->flags & BLK_MQ_F_NO_SCHED_BY_DEFAULT)
 		return;
 
@@ -730,7 +733,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
 	struct blk_mq_tag_set *set = q->tag_set;
 
 	/* Make sure queue is not in the middle of being removed */
-	if (!blk_queue_registered(q))
+	if (!blk_queue_registered(q) || blk_queue_no_elv_switch(q))
 		return -ENOENT;
 
 	/*
diff --git a/block/genhd.c b/block/genhd.c
index 0c34cc1a4eae..0e64e7400fb4 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -749,6 +749,15 @@ static void __del_gendisk(struct gendisk *disk)
 		blk_unfreeze_release_lock(q);
 }
 
+static void disable_elv_switch(struct request_queue *q)
+{
+	WARN_ON_ONCE(!queue_is_mq(q));
+
+	mutex_lock(&q->elevator_lock);
+	blk_queue_flag_set(QUEUE_FLAG_NO_ELV_SWITCH, q);
+	mutex_unlock(&q->elevator_lock);
+}
+
 /**
  * del_gendisk - remove the gendisk
  * @disk: the struct gendisk to remove
@@ -777,6 +786,9 @@ void del_gendisk(struct gendisk *disk)
 		__del_gendisk(disk);
 	} else {
 		set = disk->queue->tag_set;
+
+		disable_elv_switch(disk->queue);
+
 		memflags = memalloc_noio_save();
 		down_read(&set->update_nr_hwq_lock);
 		__del_gendisk(disk);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 9c373cf0eb47..b15c53fabe9f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -645,6 +645,7 @@ enum {
 	QUEUE_FLAG_HCTX_ACTIVE,		/* at least one blk-mq hctx is active */
 	QUEUE_FLAG_SQ_SCHED,		/* single queue style io dispatch */
 	QUEUE_FLAG_DISABLE_WBT_DEF,	/* for sched to disable/enable wbt */
+	QUEUE_FLAG_NO_ELV_SWITCH,	/* can't switch elevator any more */
 	QUEUE_FLAG_MAX
 };
 
@@ -682,6 +683,8 @@ void blk_queue_flag_clear(unsigned int flag, struct request_queue *q);
 	((q)->limits.features & BLK_FEAT_SKIP_TAGSET_QUIESCE)
 #define blk_queue_disable_wbt(q)	\
 	test_bit(QUEUE_FLAG_DISABLE_WBT_DEF, &(q)->queue_flags)
+#define blk_queue_no_elv_switch(q)	\
+	test_bit(QUEUE_FLAG_NO_ELV_SWITCH, &(q)->queue_flags)
 
 extern void blk_set_pm_only(struct request_queue *q);
 extern void blk_clear_pm_only(struct request_queue *q);
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH V4 20/24] block: move elv_register[unregister]_queue out of elevator_lock
  2025-04-30  4:35 [PATCH V4 00/24] block: unify elevator changing and fix lockdep warning Ming Lei
                   ` (18 preceding siblings ...)
  2025-04-30  4:35 ` [PATCH V4 19/24] block: add new helper for disabling elevator switch when deleting disk Ming Lei
@ 2025-04-30  4:35 ` Ming Lei
  2025-05-01 10:26   ` Nilay Shroff
  2025-04-30  4:35 ` [PATCH V4 21/24] block: move hctx debugfs/sysfs registering out of freezing queue Ming Lei
                   ` (4 subsequent siblings)
  24 siblings, 1 reply; 51+ messages in thread
From: Ming Lei @ 2025-04-30  4:35 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig, Ming Lei, Hannes Reinecke

Move elv_register[unregister]_queue out of ->elevator_lock & queue freezing,
so we can kill many lockdep warnings.

elv_register[unregister]_queue() is serialized, and just dealing with sysfs/
debugfs things, no need to be done with queue frozen:

- when it is called from adding disk, elevator switch isn't possible
  because ->queue_kobj isn't added yet

- when it is called from deleting disk, disable_elv_switch() is
  responsible for preventing new elevator switch and draining old
  elevator switch.

- when it is called from blk_mq_update_nr_hw_queues(), adding/removing
  disk and elevator switch can't be allowed or in-progress

With this change, elevator's ->exit() is called before calling
elv_unregister_queue, then user may call into ->show()/store() of elevator's
sysfs attributes, and we have covered this issue by adding `ELEVATOR_FLAG_DYNG`.

For blk-mq debugfs, hctx->sched_tags is always checked with ->elevator_lock by
debugfs code, meantime hctx->sched_tags is updated with ->elevator_lock, so
there isn't such issue.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c   |  3 +--
 block/elevator.c | 63 ++++++++++++++++++++++++++++++++++++++----------
 block/genhd.c    |  6 +++++
 3 files changed, 57 insertions(+), 15 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index b920eaedbaa9..a4bcfce4c4b9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4997,11 +4997,10 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 		blk_mq_debugfs_register_hctxs(q);
 	}
 
+	/* elv_update_nr_hw_queues() unfreeze queue for us */
 	list_for_each_entry(q, &set->tag_list, tag_set_list)
 		elv_update_nr_hw_queues(q);
 
-	list_for_each_entry(q, &set->tag_list, tag_set_list)
-		blk_mq_unfreeze_queue_nomemrestore(q);
 	memalloc_noio_restore(memflags);
 
 	/* Free the excess tags when nr_hw_queues shrink. */
diff --git a/block/elevator.c b/block/elevator.c
index 98a754f58de5..492a593160ae 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -49,6 +49,11 @@
 struct elv_change_ctx {
 	const char *name;
 	bool no_uevent;
+
+	/* for unregistering old elevator */
+	struct elevator_queue *old;
+	/* for registering new elevator */
+	struct elevator_queue *new;
 };
 
 static DEFINE_SPINLOCK(elv_list_lock);
@@ -158,14 +163,14 @@ static void elevator_exit(struct request_queue *q)
 {
 	struct elevator_queue *e = q->elevator;
 
+	lockdep_assert_held(&q->elevator_lock);
+
 	ioc_clear_queue(q);
 	blk_mq_sched_free_rqs(q);
 
 	mutex_lock(&e->sysfs_lock);
 	blk_mq_exit_sched(q, e);
 	mutex_unlock(&e->sysfs_lock);
-
-	kobject_put(&e->kobj);
 }
 
 static inline void __elv_rqhash_del(struct request *rq)
@@ -466,8 +471,6 @@ static int elv_register_queue(struct request_queue *q,
 {
 	int error;
 
-	lockdep_assert_held(&q->elevator_lock);
-
 	error = kobject_add(&e->kobj, &q->disk->queue_kobj, "iosched");
 	if (!error) {
 		const struct elv_fs_entry *attr = e->type->elevator_attrs;
@@ -494,8 +497,6 @@ static int elv_register_queue(struct request_queue *q,
 static void elv_unregister_queue(struct request_queue *q,
 				 struct elevator_queue *e)
 {
-	lockdep_assert_held(&q->elevator_lock);
-
 	if (e && test_and_clear_bit(ELEVATOR_FLAG_REGISTERED, &e->flags)) {
 		kobject_uevent(&e->kobj, KOBJ_REMOVE);
 		kobject_del(&e->kobj);
@@ -586,7 +587,7 @@ static int elevator_switch(struct request_queue *q, struct elv_change_ctx *ctx)
 	blk_mq_quiesce_queue(q);
 
 	if (q->elevator) {
-		elv_unregister_queue(q, q->elevator);
+		ctx->old = q->elevator;
 		elevator_exit(q);
 	}
 
@@ -594,11 +595,7 @@ static int elevator_switch(struct request_queue *q, struct elv_change_ctx *ctx)
 		ret = blk_mq_init_sched(q, new_e);
 		if (ret)
 			goto out_unfreeze;
-		ret = elv_register_queue(q, q->elevator, !ctx->no_uevent);
-		if (ret) {
-			elevator_exit(q);
-			goto out_unfreeze;
-		}
+		ctx->new = q->elevator;
 	} else {
 		blk_queue_flag_clear(QUEUE_FLAG_SQ_SCHED, q);
 		q->elevator = NULL;
@@ -619,6 +616,38 @@ static int elevator_switch(struct request_queue *q, struct elv_change_ctx *ctx)
 	return ret;
 }
 
+static void elv_exit_and_release(struct request_queue *q)
+{
+	struct elevator_queue *e;
+	unsigned memflags;
+
+	memflags = blk_mq_freeze_queue(q);
+	mutex_lock(&q->elevator_lock);
+	e = q->elevator;
+	elevator_exit(q);
+	mutex_unlock(&q->elevator_lock);
+	blk_mq_unfreeze_queue(q, memflags);
+	if (e)
+		kobject_put(&e->kobj);
+}
+
+static int elevator_change_done(struct request_queue *q,
+				struct elv_change_ctx *ctx)
+{
+	int ret = 0;
+
+	if (ctx->old) {
+		elv_unregister_queue(q, ctx->old);
+		kobject_put(&ctx->old->kobj);
+	}
+	if (ctx->new) {
+		ret = elv_register_queue(q, ctx->new, !ctx->no_uevent);
+		if (ret)
+			elv_exit_and_release(q);
+	}
+	return ret;
+}
+
 /*
  * Switch this queue to the given IO scheduler.
  */
@@ -645,6 +674,9 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
 		ret = elevator_switch(q, ctx);
 	mutex_unlock(&q->elevator_lock);
 	blk_mq_unfreeze_queue(q, memflags);
+	if (!ret)
+		ret = elevator_change_done(q, ctx);
+
 	return ret;
 }
 
@@ -657,6 +689,7 @@ void elv_update_nr_hw_queues(struct request_queue *q)
 	struct elv_change_ctx ctx = {
 		.name	= "none",
 	};
+	int ret = -ENODEV;
 
 	WARN_ON_ONCE(q->mq_freeze_depth == 0);
 
@@ -664,8 +697,12 @@ void elv_update_nr_hw_queues(struct request_queue *q)
 	if (q->elevator && !blk_queue_dying(q) && !blk_queue_registered(q))
 		ctx.name = q->elevator->type->elevator_name;
 	/* force to reattach elevator after nr_hw_queue is updated */
-	elevator_switch(q, &ctx);
+	ret = elevator_switch(q, &ctx);
 	mutex_unlock(&q->elevator_lock);
+	blk_mq_unfreeze_queue_nomemrestore(q);
+
+	if (!ret)
+		WARN_ON_ONCE(elevator_change_done(q, &ctx));
 }
 
 /*
diff --git a/block/genhd.c b/block/genhd.c
index 0e64e7400fb4..59d9febd8c14 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -751,11 +751,17 @@ static void __del_gendisk(struct gendisk *disk)
 
 static void disable_elv_switch(struct request_queue *q)
 {
+	struct blk_mq_tag_set *set = q->tag_set;
+
 	WARN_ON_ONCE(!queue_is_mq(q));
 
 	mutex_lock(&q->elevator_lock);
 	blk_queue_flag_set(QUEUE_FLAG_NO_ELV_SWITCH, q);
 	mutex_unlock(&q->elevator_lock);
+
+	/* wait until in-progress elevator switch is done */
+	down_write(&set->update_nr_hwq_lock);
+	up_write(&set->update_nr_hwq_lock);
 }
 
 /**
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH V4 21/24] block: move hctx debugfs/sysfs registering out of freezing queue
  2025-04-30  4:35 [PATCH V4 00/24] block: unify elevator changing and fix lockdep warning Ming Lei
                   ` (19 preceding siblings ...)
  2025-04-30  4:35 ` [PATCH V4 20/24] block: move elv_register[unregister]_queue out of elevator_lock Ming Lei
@ 2025-04-30  4:35 ` Ming Lei
  2025-04-30  4:35 ` [PATCH V4 22/24] block: don't acquire ->elevator_lock in blk_mq_map_swqueue and blk_mq_realloc_hw_ctxs Ming Lei
                   ` (3 subsequent siblings)
  24 siblings, 0 replies; 51+ messages in thread
From: Ming Lei @ 2025-04-30  4:35 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig, Ming Lei, Hannes Reinecke

Move hctx debugfs/sysfs register out of freezing queue in
__blk_mq_update_nr_hw_queues(), so that the following lockdep dependency
can be killed:

	#2 (&q->q_usage_counter(io)#16){++++}-{0:0}:
	#1 (fs_reclaim){+.+.}-{0:0}:
	#0 (&sb->s_type->i_mutex_key#3){+.+.}-{4:4}: //debugfs

And registering/un-registering hctx debugfs/sysfs does not require queue to
be frozen:

- hctx sysfs attributes show() are drained when removing kobject, and
  there isn't store() implementation for hctx sysfs attributes

- debugfs entry read() is drained too when removing debugfs directory,
  and there isn't write() implementation for hctx debugfs too

- so it is safe to register/unregister hctx sysfs/debugfs without
  freezing queue because the cod paths changes nothing, and we just
  need to keep hctx live

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index a4bcfce4c4b9..33bffacc33db 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4961,14 +4961,14 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 		return;
 
 	memflags = memalloc_noio_save();
-	list_for_each_entry(q, &set->tag_list, tag_set_list)
-		blk_mq_freeze_queue_nomemsave(q);
-
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
 		blk_mq_debugfs_unregister_hctxs(q);
 		blk_mq_sysfs_unregister_hctxs(q);
 	}
 
+	list_for_each_entry(q, &set->tag_list, tag_set_list)
+		blk_mq_freeze_queue_nomemsave(q);
+
 	if (blk_mq_realloc_tag_set_tags(set, nr_hw_queues) < 0)
 		goto reregister;
 
@@ -4991,16 +4991,15 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 		blk_mq_map_swqueue(q);
 	}
 
+	/* elv_update_nr_hw_queues() unfreeze queue for us */
+	list_for_each_entry(q, &set->tag_list, tag_set_list)
+		elv_update_nr_hw_queues(q);
+
 reregister:
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
 		blk_mq_sysfs_register_hctxs(q);
 		blk_mq_debugfs_register_hctxs(q);
 	}
-
-	/* elv_update_nr_hw_queues() unfreeze queue for us */
-	list_for_each_entry(q, &set->tag_list, tag_set_list)
-		elv_update_nr_hw_queues(q);
-
 	memalloc_noio_restore(memflags);
 
 	/* Free the excess tags when nr_hw_queues shrink. */
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH V4 22/24] block: don't acquire ->elevator_lock in blk_mq_map_swqueue and blk_mq_realloc_hw_ctxs
  2025-04-30  4:35 [PATCH V4 00/24] block: unify elevator changing and fix lockdep warning Ming Lei
                   ` (20 preceding siblings ...)
  2025-04-30  4:35 ` [PATCH V4 21/24] block: move hctx debugfs/sysfs registering out of freezing queue Ming Lei
@ 2025-04-30  4:35 ` Ming Lei
  2025-04-30  4:35 ` [PATCH V4 23/24] block: move hctx cpuhp add/del out of queue freezing Ming Lei
                   ` (2 subsequent siblings)
  24 siblings, 0 replies; 51+ messages in thread
From: Ming Lei @ 2025-04-30  4:35 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig, Ming Lei, Hannes Reinecke

Both blk_mq_map_swqueue() and blk_mq_realloc_hw_ctxs() are called before
the request queue is added to tagset list, so the two won't run concurrently
with blk_mq_update_nr_hw_queues().

When the two functions are only called from queue initialization or
blk_mq_update_nr_hw_queues(), elevator switch can't happen.

So remove ->elevator_lock uses from the two functions.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 19 ++++---------------
 1 file changed, 4 insertions(+), 15 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 33bffacc33db..4e6bc065c955 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4112,8 +4112,6 @@ static void blk_mq_map_swqueue(struct request_queue *q)
 	struct blk_mq_ctx *ctx;
 	struct blk_mq_tag_set *set = q->tag_set;
 
-	mutex_lock(&q->elevator_lock);
-
 	queue_for_each_hw_ctx(q, hctx, i) {
 		cpumask_clear(hctx->cpumask);
 		hctx->nr_ctx = 0;
@@ -4218,8 +4216,6 @@ static void blk_mq_map_swqueue(struct request_queue *q)
 		hctx->next_cpu = blk_mq_first_mapped_cpu(hctx);
 		hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
 	}
-
-	mutex_unlock(&q->elevator_lock);
 }
 
 /*
@@ -4523,16 +4519,9 @@ static void __blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 }
 
 static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
-				   struct request_queue *q, bool lock)
+				   struct request_queue *q)
 {
-	if (lock) {
-		/* protect against switching io scheduler  */
-		mutex_lock(&q->elevator_lock);
-		__blk_mq_realloc_hw_ctxs(set, q);
-		mutex_unlock(&q->elevator_lock);
-	} else {
-		__blk_mq_realloc_hw_ctxs(set, q);
-	}
+	__blk_mq_realloc_hw_ctxs(set, q);
 
 	/* unregister cpuhp callbacks for exited hctxs */
 	blk_mq_remove_hw_queues_cpuhp(q);
@@ -4564,7 +4553,7 @@ int blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 
 	xa_init(&q->hctx_table);
 
-	blk_mq_realloc_hw_ctxs(set, q, false);
+	blk_mq_realloc_hw_ctxs(set, q);
 	if (!q->nr_hw_queues)
 		goto err_hctxs;
 
@@ -4975,7 +4964,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 fallback:
 	blk_mq_update_queue_map(set);
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
-		blk_mq_realloc_hw_ctxs(set, q, true);
+		blk_mq_realloc_hw_ctxs(set, q);
 
 		if (q->nr_hw_queues != set->nr_hw_queues) {
 			int i = prev_nr_hw_queues;
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH V4 23/24] block: move hctx cpuhp add/del out of queue freezing
  2025-04-30  4:35 [PATCH V4 00/24] block: unify elevator changing and fix lockdep warning Ming Lei
                   ` (21 preceding siblings ...)
  2025-04-30  4:35 ` [PATCH V4 22/24] block: don't acquire ->elevator_lock in blk_mq_map_swqueue and blk_mq_realloc_hw_ctxs Ming Lei
@ 2025-04-30  4:35 ` Ming Lei
  2025-04-30  4:35 ` [PATCH V4 24/24] block: move wbt_enable_default() out of queue freezing from sched ->exit() Ming Lei
  2025-04-30 14:08 ` [PATCH V4 00/24] block: unify elevator changing and fix lockdep warning Christoph Hellwig
  24 siblings, 0 replies; 51+ messages in thread
From: Ming Lei @ 2025-04-30  4:35 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig, Ming Lei, Hannes Reinecke

Move hctx cpuhp add/del out of queue freezing for not connecting freeze
lock with cpuhp locks, then lockdep warning can be avoided.

This way is safe because both needn't queue to be frozen and scheduler
switch isn't allowed, with same reason for moving hctx debugfs/sysfs
register out of queue freeze.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4e6bc065c955..682d75f0a0a4 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4964,7 +4964,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 fallback:
 	blk_mq_update_queue_map(set);
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
-		blk_mq_realloc_hw_ctxs(set, q);
+		__blk_mq_realloc_hw_ctxs(set, q);
 
 		if (q->nr_hw_queues != set->nr_hw_queues) {
 			int i = prev_nr_hw_queues;
@@ -4988,6 +4988,9 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
 		blk_mq_sysfs_register_hctxs(q);
 		blk_mq_debugfs_register_hctxs(q);
+
+		blk_mq_remove_hw_queues_cpuhp(q);
+		blk_mq_add_hw_queues_cpuhp(q);
 	}
 	memalloc_noio_restore(memflags);
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH V4 24/24] block: move wbt_enable_default() out of queue freezing from sched ->exit()
  2025-04-30  4:35 [PATCH V4 00/24] block: unify elevator changing and fix lockdep warning Ming Lei
                   ` (22 preceding siblings ...)
  2025-04-30  4:35 ` [PATCH V4 23/24] block: move hctx cpuhp add/del out of queue freezing Ming Lei
@ 2025-04-30  4:35 ` Ming Lei
  2025-04-30  6:33   ` Hannes Reinecke
  2025-04-30 14:08 ` [PATCH V4 00/24] block: unify elevator changing and fix lockdep warning Christoph Hellwig
  24 siblings, 1 reply; 51+ messages in thread
From: Ming Lei @ 2025-04-30  4:35 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig, Ming Lei

scheduler's ->exit() is called with queue frozen and elevator lock is held, and
wbt_enable_default() can't be called with queue frozen, otherwise the
following lockdep warning is triggered:

	#6 (&q->rq_qos_mutex){+.+.}-{4:4}:
	#5 (&eq->sysfs_lock){+.+.}-{4:4}:
	#4 (&q->elevator_lock){+.+.}-{4:4}:
	#3 (&q->q_usage_counter(io)#3){++++}-{0:0}:
	#2 (fs_reclaim){+.+.}-{0:0}:
	#1 (&sb->s_type->i_mutex_key#3){+.+.}-{4:4}:
	#0 (&q->debugfs_mutex){+.+.}-{4:4}:

Fix the issue by moving wbt_enable_default() out of bfq's exit(), and
call it from elevator_change_done().

Meantime add disk->rqos_state_mutex for covering wbt state change, which
matches the purpose more than ->elevator_lock.

Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/bfq-iosched.c    |  2 +-
 block/blk-sysfs.c      | 10 ++++------
 block/blk-wbt.c        |  6 ++++++
 block/elevator.c       |  5 +++++
 block/elevator.h       |  1 +
 block/genhd.c          |  1 +
 include/linux/blkdev.h |  2 ++
 7 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index cc6f59836dcd..0cb1e9873aab 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -7211,7 +7211,7 @@ static void bfq_exit_queue(struct elevator_queue *e)
 
 	blk_stat_disable_accounting(bfqd->queue);
 	blk_queue_flag_clear(QUEUE_FLAG_DISABLE_WBT_DEF, bfqd->queue);
-	wbt_enable_default(bfqd->queue->disk);
+	set_bit(ELEVATOR_FLAG_ENABLE_WBT_ON_EXIT, &e->flags);
 
 	kfree(bfqd);
 }
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 741e607dfab6..01e0ead13278 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -560,7 +560,7 @@ static ssize_t queue_wb_lat_show(struct gendisk *disk, char *page)
 	ssize_t ret;
 	struct request_queue *q = disk->queue;
 
-	mutex_lock(&q->elevator_lock);
+	mutex_lock(&disk->rqos_state_mutex);
 	if (!wbt_rq_qos(q)) {
 		ret = -EINVAL;
 		goto out;
@@ -573,7 +573,7 @@ static ssize_t queue_wb_lat_show(struct gendisk *disk, char *page)
 
 	ret = sysfs_emit(page, "%llu\n", div_u64(wbt_get_min_lat(q), 1000));
 out:
-	mutex_unlock(&q->elevator_lock);
+	mutex_unlock(&disk->rqos_state_mutex);
 	return ret;
 }
 
@@ -593,7 +593,6 @@ static ssize_t queue_wb_lat_store(struct gendisk *disk, const char *page,
 		return -EINVAL;
 
 	memflags = blk_mq_freeze_queue(q);
-	mutex_lock(&q->elevator_lock);
 
 	rqos = wbt_rq_qos(q);
 	if (!rqos) {
@@ -618,11 +617,12 @@ static ssize_t queue_wb_lat_store(struct gendisk *disk, const char *page,
 	 */
 	blk_mq_quiesce_queue(q);
 
+	mutex_lock(&disk->rqos_state_mutex);
 	wbt_set_min_lat(q, val);
+	mutex_unlock(&disk->rqos_state_mutex);
 
 	blk_mq_unquiesce_queue(q);
 out:
-	mutex_unlock(&q->elevator_lock);
 	blk_mq_unfreeze_queue(q, memflags);
 
 	return ret;
@@ -871,9 +871,7 @@ int blk_register_queue(struct gendisk *disk)
 
 	if (queue_is_mq(q))
 		elevator_set_default(q);
-	mutex_lock(&q->elevator_lock);
 	wbt_enable_default(disk);
-	mutex_unlock(&q->elevator_lock);
 
 	blk_queue_flag_set(QUEUE_FLAG_REGISTERED, q);
 
diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index 29cd2e33666f..74ae7131ada9 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -704,6 +704,8 @@ void wbt_enable_default(struct gendisk *disk)
 	struct rq_qos *rqos;
 	bool enable = IS_ENABLED(CONFIG_BLK_WBT_MQ);
 
+	mutex_lock(&disk->rqos_state_mutex);
+
 	if (blk_queue_disable_wbt(q))
 		enable = false;
 
@@ -712,8 +714,10 @@ void wbt_enable_default(struct gendisk *disk)
 	if (rqos) {
 		if (enable && RQWB(rqos)->enable_state == WBT_STATE_OFF_DEFAULT)
 			RQWB(rqos)->enable_state = WBT_STATE_ON_DEFAULT;
+		mutex_unlock(&disk->rqos_state_mutex);
 		return;
 	}
+	mutex_unlock(&disk->rqos_state_mutex);
 
 	/* Queue not registered? Maybe shutting down... */
 	if (!blk_queue_registered(q))
@@ -773,11 +777,13 @@ void wbt_disable_default(struct gendisk *disk)
 	struct rq_wb *rwb;
 	if (!rqos)
 		return;
+	mutex_lock(&disk->rqos_state_mutex);
 	rwb = RQWB(rqos);
 	if (rwb->enable_state == WBT_STATE_ON_DEFAULT) {
 		blk_stat_deactivate(rwb->cb);
 		rwb->enable_state = WBT_STATE_OFF_DEFAULT;
 	}
+	mutex_unlock(&disk->rqos_state_mutex);
 }
 EXPORT_SYMBOL_GPL(wbt_disable_default);
 
diff --git a/block/elevator.c b/block/elevator.c
index 492a593160ae..936d73cda6ed 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -637,8 +637,13 @@ static int elevator_change_done(struct request_queue *q,
 	int ret = 0;
 
 	if (ctx->old) {
+		bool enable_wbt = test_bit(ELEVATOR_FLAG_ENABLE_WBT_ON_EXIT,
+				&ctx->old->flags);
+
 		elv_unregister_queue(q, ctx->old);
 		kobject_put(&ctx->old->kobj);
+		if (enable_wbt)
+			wbt_enable_default(q->disk);
 	}
 	if (ctx->new) {
 		ret = elv_register_queue(q, ctx->new, !ctx->no_uevent);
diff --git a/block/elevator.h b/block/elevator.h
index 76a90a1b7ed6..a07ce773a38f 100644
--- a/block/elevator.h
+++ b/block/elevator.h
@@ -122,6 +122,7 @@ struct elevator_queue
 
 #define ELEVATOR_FLAG_REGISTERED	0
 #define ELEVATOR_FLAG_DYING		1
+#define ELEVATOR_FLAG_ENABLE_WBT_ON_EXIT	2
 
 /*
  * block elevator interface
diff --git a/block/genhd.c b/block/genhd.c
index 59d9febd8c14..21b4ea191d24 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1460,6 +1460,7 @@ struct gendisk *__alloc_disk_node(struct request_queue *q, int node_id,
 #ifdef CONFIG_BLOCK_HOLDER_DEPRECATED
 	INIT_LIST_HEAD(&disk->slave_bdevs);
 #endif
+	mutex_init(&disk->rqos_state_mutex);
 	return disk;
 
 out_erase_part0:
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index b15c53fabe9f..c19ae1877061 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -218,6 +218,8 @@ struct gendisk {
 	 * devices that do not have multiple independent access ranges.
 	 */
 	struct blk_independent_access_ranges *ia_ranges;
+
+	struct mutex rqos_state_mutex;	/* rqos state change mutex */
 };
 
 /**
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH V4 09/24] block: look up the elevator type in elevator_switch
  2025-04-30  4:35 ` [PATCH V4 09/24] block: look up the elevator type in elevator_switch Ming Lei
@ 2025-04-30  6:14   ` Hannes Reinecke
  2025-05-01  6:47   ` Nilay Shroff
  1 sibling, 0 replies; 51+ messages in thread
From: Hannes Reinecke @ 2025-04-30  6:14 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig

On 4/30/25 06:35, Ming Lei wrote:
> From: Christoph Hellwig <hch@lst.de>
> 
> That makes the function nicely self-contained and can be used
> to avoid code duplication.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-mq.c   |  2 +-
>   block/blk.h      |  2 +-
>   block/elevator.c | 18 ++++++++----------
>   3 files changed, 10 insertions(+), 12 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH V4 10/24] block: fold elevator_disable into elevator_switch
  2025-04-30  4:35 ` [PATCH V4 10/24] block: fold elevator_disable into elevator_switch Ming Lei
@ 2025-04-30  6:14   ` Hannes Reinecke
  2025-05-01  6:59   ` Nilay Shroff
  1 sibling, 0 replies; 51+ messages in thread
From: Hannes Reinecke @ 2025-04-30  6:14 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig

On 4/30/25 06:35, Ming Lei wrote:
> From: Christoph Hellwig <hch@lst.de>
> 
> This removes duplicate code, and keeps the callers tidy.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/elevator.c | 61 ++++++++++++++++++------------------------------
>   1 file changed, 23 insertions(+), 38 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH V4 11/24] block: move blk_queue_registered() check into elv_iosched_store()
  2025-04-30  4:35 ` [PATCH V4 11/24] block: move blk_queue_registered() check into elv_iosched_store() Ming Lei
@ 2025-04-30  6:15   ` Hannes Reinecke
  2025-05-01  9:55   ` Nilay Shroff
  1 sibling, 0 replies; 51+ messages in thread
From: Hannes Reinecke @ 2025-04-30  6:15 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig

On 4/30/25 06:35, Ming Lei wrote:
> Move blk_queue_registered() check into elv_iosched_store() and prepare
> for using elevator_change() for covering any kind of elevator change in
> adding/deleting disk and updating nr_hw_queue.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/elevator.c | 8 ++++----
>   1 file changed, 4 insertions(+), 4 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH V4 17/24] block: remove elevator queue's type check in elv_attr_show/store()
  2025-04-30  4:35 ` [PATCH V4 17/24] block: remove elevator queue's type check in elv_attr_show/store() Ming Lei
@ 2025-04-30  6:16   ` Hannes Reinecke
  2025-04-30 14:18   ` Christoph Hellwig
  2025-05-01 10:31   ` Nilay Shroff
  2 siblings, 0 replies; 51+ messages in thread
From: Hannes Reinecke @ 2025-04-30  6:16 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig

On 4/30/25 06:35, Ming Lei wrote:
> elevatore queue's type is assigned since its allocation, and never
> get cleared until it is released.
> 
> So its ->type is always not NULL, remove the unnecessary check.
> 
> Suggested-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/elevator.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
> Reviewed-by: Hannes Reinecke <hare@suse.de.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH V4 18/24] block: fail to show/store elevator sysfs attribute if elevator is dying
  2025-04-30  4:35 ` [PATCH V4 18/24] block: fail to show/store elevator sysfs attribute if elevator is dying Ming Lei
@ 2025-04-30  6:31   ` Hannes Reinecke
  2025-04-30 13:30     ` Ming Lei
  2025-04-30 14:19   ` Christoph Hellwig
  1 sibling, 1 reply; 51+ messages in thread
From: Hannes Reinecke @ 2025-04-30  6:31 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig

On 4/30/25 06:35, Ming Lei wrote:
> Prepare for moving elv_register[unregister]_queue out of elevator_lock
> & queue freezing, so we may have to call elv_unregister_queue() after
> elevator ->exit() is called, then there is small window for user to
> call into ->show()/store(), and user-after-free can be caused.
> 
> Fail to show/store elevator sysfs attribute if elevator is dying by
> adding one new flag of ELEVATOR_FLAG_DYNG, which is protected by
> elevator ->sysfs_lock.
> 
> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-mq-sched.c |  1 +
>   block/elevator.c     | 10 ++++++----
>   block/elevator.h     |  1 +
>   3 files changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 336a15ffecfa..55a0fd105147 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -551,5 +551,6 @@ void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e)
>   	if (e->type->ops.exit_sched)
>   		e->type->ops.exit_sched(e);
>   	blk_mq_sched_tags_teardown(q, flags);
> +	set_bit(ELEVATOR_FLAG_DYING, &q->elevator->flags);
>   	q->elevator = NULL;
>   }

set_bit() is unordered; don't you need to take ->sysfs_lock here?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH V4 19/24] block: add new helper for disabling elevator switch when deleting disk
  2025-04-30  4:35 ` [PATCH V4 19/24] block: add new helper for disabling elevator switch when deleting disk Ming Lei
@ 2025-04-30  6:32   ` Hannes Reinecke
  2025-04-30 14:20   ` Christoph Hellwig
  1 sibling, 0 replies; 51+ messages in thread
From: Hannes Reinecke @ 2025-04-30  6:32 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig

On 4/30/25 06:35, Ming Lei wrote:
> Add new helper disable_elv_switch() and new flag QUEUE_FLAG_NO_ELV_SWITCH
> for disabling elevator switch before deleting disk:
> 
> - originally flag QUEUE_FLAG_REGISTERED is added for preventing elevator
> switch during removing disk, but this flag has been used widely for
> other purposes, so add one new flag for disabling elevator switch only
> 
> - for avoiding deadlock risk, we have to move elevator queue
> register/unregister out of elevator lock and queue freeze, which will be
> done in next patch. However, this way adds small race window between elevator
> switch and deleting ->queue_kobj, in which elevator queue register/unregister
> could be run concurrently. The added helper will be used for avoiding the race
> in the following patch.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-mq-debugfs.c |  1 +
>   block/elevator.c       |  5 ++++-
>   block/genhd.c          | 12 ++++++++++++
>   include/linux/blkdev.h |  3 +++
>   4 files changed, 20 insertions(+), 1 deletion(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH V4 24/24] block: move wbt_enable_default() out of queue freezing from sched ->exit()
  2025-04-30  4:35 ` [PATCH V4 24/24] block: move wbt_enable_default() out of queue freezing from sched ->exit() Ming Lei
@ 2025-04-30  6:33   ` Hannes Reinecke
  0 siblings, 0 replies; 51+ messages in thread
From: Hannes Reinecke @ 2025-04-30  6:33 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: Nilay Shroff, Shinichiro Kawasaki, Thomas Hellström,
	Christoph Hellwig

On 4/30/25 06:35, Ming Lei wrote:
> scheduler's ->exit() is called with queue frozen and elevator lock is held, and
> wbt_enable_default() can't be called with queue frozen, otherwise the
> following lockdep warning is triggered:
> 
> 	#6 (&q->rq_qos_mutex){+.+.}-{4:4}:
> 	#5 (&eq->sysfs_lock){+.+.}-{4:4}:
> 	#4 (&q->elevator_lock){+.+.}-{4:4}:
> 	#3 (&q->q_usage_counter(io)#3){++++}-{0:0}:
> 	#2 (fs_reclaim){+.+.}-{0:0}:
> 	#1 (&sb->s_type->i_mutex_key#3){+.+.}-{4:4}:
> 	#0 (&q->debugfs_mutex){+.+.}-{4:4}:
> 
> Fix the issue by moving wbt_enable_default() out of bfq's exit(), and
> call it from elevator_change_done().
> 
> Meantime add disk->rqos_state_mutex for covering wbt state change, which
> matches the purpose more than ->elevator_lock.
> 
> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/bfq-iosched.c    |  2 +-
>   block/blk-sysfs.c      | 10 ++++------
>   block/blk-wbt.c        |  6 ++++++
>   block/elevator.c       |  5 +++++
>   block/elevator.h       |  1 +
>   block/genhd.c          |  1 +
>   include/linux/blkdev.h |  2 ++
>   7 files changed, 20 insertions(+), 7 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH V4 18/24] block: fail to show/store elevator sysfs attribute if elevator is dying
  2025-04-30  6:31   ` Hannes Reinecke
@ 2025-04-30 13:30     ` Ming Lei
  0 siblings, 0 replies; 51+ messages in thread
From: Ming Lei @ 2025-04-30 13:30 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
	Thomas Hellström, Christoph Hellwig

On Wed, Apr 30, 2025 at 08:31:01AM +0200, Hannes Reinecke wrote:
> On 4/30/25 06:35, Ming Lei wrote:
> > Prepare for moving elv_register[unregister]_queue out of elevator_lock
> > & queue freezing, so we may have to call elv_unregister_queue() after
> > elevator ->exit() is called, then there is small window for user to
> > call into ->show()/store(), and user-after-free can be caused.
> > 
> > Fail to show/store elevator sysfs attribute if elevator is dying by
> > adding one new flag of ELEVATOR_FLAG_DYNG, which is protected by
> > elevator ->sysfs_lock.
> > 
> > Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >   block/blk-mq-sched.c |  1 +
> >   block/elevator.c     | 10 ++++++----
> >   block/elevator.h     |  1 +
> >   3 files changed, 8 insertions(+), 4 deletions(-)
> > 
> > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > index 336a15ffecfa..55a0fd105147 100644
> > --- a/block/blk-mq-sched.c
> > +++ b/block/blk-mq-sched.c
> > @@ -551,5 +551,6 @@ void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e)
> >   	if (e->type->ops.exit_sched)
> >   		e->type->ops.exit_sched(e);
> >   	blk_mq_sched_tags_teardown(q, flags);
> > +	set_bit(ELEVATOR_FLAG_DYING, &q->elevator->flags);
> >   	q->elevator = NULL;
> >   }
> 
> set_bit() is unordered; don't you need to take ->sysfs_lock here?

Yes.

blk_mq_exit_sched() is called from elevator_exit() with eq->sysfs_lock held.


thanks,
Ming


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH V4 00/24] block: unify elevator changing and fix lockdep warning
  2025-04-30  4:35 [PATCH V4 00/24] block: unify elevator changing and fix lockdep warning Ming Lei
                   ` (23 preceding siblings ...)
  2025-04-30  4:35 ` [PATCH V4 24/24] block: move wbt_enable_default() out of queue freezing from sched ->exit() Ming Lei
@ 2025-04-30 14:08 ` Christoph Hellwig
  2025-04-30 14:13   ` Ming Lei
  24 siblings, 1 reply; 51+ messages in thread
From: Christoph Hellwig @ 2025-04-30 14:08 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
	Thomas Hellström, Christoph Hellwig

What tree is this based on?  it fails to apply to all of block-6.15,
for-6.16/block and Linus' master branch for me.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH V4 08/24] block: don't allow to switch elevator if updating nr_hw_queues is in-progress
  2025-04-30  4:35 ` [PATCH V4 08/24] block: don't allow to switch elevator if updating nr_hw_queues is in-progress Ming Lei
@ 2025-04-30 14:09   ` Christoph Hellwig
  0 siblings, 0 replies; 51+ messages in thread
From: Christoph Hellwig @ 2025-04-30 14:09 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
	Thomas Hellström, Christoph Hellwig, Hannes Reinecke

On Wed, Apr 30, 2025 at 12:35:10PM +0800, Ming Lei wrote:
> Take the nested variant for avoiding the following false positive
> splat[1], and this way is correct because:
> 
> - the read lock in elv_iosched_store() is not overlapped with the read lock
> in adding/deleting disk:
> 
> - storing to kobject attribute is only available after the kobject is added
> and before it is deleted

This needs to go into a code comment, as all nested locking should be
documented.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH V4 00/24] block: unify elevator changing and fix lockdep warning
  2025-04-30 14:08 ` [PATCH V4 00/24] block: unify elevator changing and fix lockdep warning Christoph Hellwig
@ 2025-04-30 14:13   ` Ming Lei
  0 siblings, 0 replies; 51+ messages in thread
From: Ming Lei @ 2025-04-30 14:13 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
	Thomas Hellström

On Wed, Apr 30, 2025 at 04:08:37PM +0200, Christoph Hellwig wrote:
> What tree is this based on?  it fails to apply to all of block-6.15,
> for-6.16/block and Linus' master branch for me.

It is based on the latest for-6.16/block:

53ec1abce79c (block/for-6.16/block) brd: use memcpy_{to,from]_page in brd_rw_bve


Thanks,
Ming


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH V4 13/24] block: move queue freezing & elevator_lock into elevator_change()
  2025-04-30  4:35 ` [PATCH V4 13/24] block: move queue freezing & elevator_lock into elevator_change() Ming Lei
@ 2025-04-30 14:13   ` Christoph Hellwig
  0 siblings, 0 replies; 51+ messages in thread
From: Christoph Hellwig @ 2025-04-30 14:13 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
	Thomas Hellström, Christoph Hellwig, Hannes Reinecke

FYI, this is the patch that fails to apply for me.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH V4 14/24] block: add `struct elv_change_ctx` for unifying elevator change
  2025-04-30  4:35 ` [PATCH V4 14/24] block: add `struct elv_change_ctx` for unifying elevator change Ming Lei
@ 2025-04-30 14:15   ` Christoph Hellwig
  0 siblings, 0 replies; 51+ messages in thread
From: Christoph Hellwig @ 2025-04-30 14:15 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
	Thomas Hellström, Christoph Hellwig, Hannes Reinecke

>  void elv_update_nr_hw_queues(struct request_queue *q)
>  {
> -	const char *name = "none";
> +	struct elv_change_ctx ctx = {
> +		.name	= "none",
> +	};
>  
>  	WARN_ON_ONCE(q->mq_freeze_depth == 0);
>  
>  	mutex_lock(&q->elevator_lock);
>  	if (q->elevator && !blk_queue_dying(q) && !blk_queue_registered(q))
> -		name = q->elevator->type->elevator_name;
> +		ctx.name = q->elevator->type->elevator_name;
>  	/* force to reattach elevator after nr_hw_queue is updated */
> -	elevator_switch(q, name);
> +	elevator_switch(q, &ctx);

I don't think this elevator_switch is needed or even a good idea outside
the above if clause.  Moving it in there would also restore the behavior
in the earlier version of this series.

With that ctx.name also doesn't need to be initialized to none at
declaration time.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH V4 15/24] block: unifying elevator change
  2025-04-30  4:35 ` [PATCH V4 15/24] block: " Ming Lei
@ 2025-04-30 14:17   ` Christoph Hellwig
  2025-05-01 10:28   ` Nilay Shroff
  1 sibling, 0 replies; 51+ messages in thread
From: Christoph Hellwig @ 2025-04-30 14:17 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
	Thomas Hellström, Christoph Hellwig, Hannes Reinecke

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH V4 16/24] block: pass elevator_queue to elv_register_queue & unregister_queue
  2025-04-30  4:35 ` [PATCH V4 16/24] block: pass elevator_queue to elv_register_queue & unregister_queue Ming Lei
@ 2025-04-30 14:18   ` Christoph Hellwig
  0 siblings, 0 replies; 51+ messages in thread
From: Christoph Hellwig @ 2025-04-30 14:18 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
	Thomas Hellström, Christoph Hellwig, Hannes Reinecke

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH V4 17/24] block: remove elevator queue's type check in elv_attr_show/store()
  2025-04-30  4:35 ` [PATCH V4 17/24] block: remove elevator queue's type check in elv_attr_show/store() Ming Lei
  2025-04-30  6:16   ` Hannes Reinecke
@ 2025-04-30 14:18   ` Christoph Hellwig
  2025-05-01 10:31   ` Nilay Shroff
  2 siblings, 0 replies; 51+ messages in thread
From: Christoph Hellwig @ 2025-04-30 14:18 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
	Thomas Hellström, Christoph Hellwig

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH V4 18/24] block: fail to show/store elevator sysfs attribute if elevator is dying
  2025-04-30  4:35 ` [PATCH V4 18/24] block: fail to show/store elevator sysfs attribute if elevator is dying Ming Lei
  2025-04-30  6:31   ` Hannes Reinecke
@ 2025-04-30 14:19   ` Christoph Hellwig
  1 sibling, 0 replies; 51+ messages in thread
From: Christoph Hellwig @ 2025-04-30 14:19 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
	Thomas Hellström, Christoph Hellwig

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH V4 19/24] block: add new helper for disabling elevator switch when deleting disk
  2025-04-30  4:35 ` [PATCH V4 19/24] block: add new helper for disabling elevator switch when deleting disk Ming Lei
  2025-04-30  6:32   ` Hannes Reinecke
@ 2025-04-30 14:20   ` Christoph Hellwig
  1 sibling, 0 replies; 51+ messages in thread
From: Christoph Hellwig @ 2025-04-30 14:20 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Nilay Shroff, Shinichiro Kawasaki,
	Thomas Hellström, Christoph Hellwig

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH V4 09/24] block: look up the elevator type in elevator_switch
  2025-04-30  4:35 ` [PATCH V4 09/24] block: look up the elevator type in elevator_switch Ming Lei
  2025-04-30  6:14   ` Hannes Reinecke
@ 2025-05-01  6:47   ` Nilay Shroff
  1 sibling, 0 replies; 51+ messages in thread
From: Nilay Shroff @ 2025-05-01  6:47 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: Shinichiro Kawasaki, Thomas Hellström, Christoph Hellwig



On 4/30/25 10:05 AM, Ming Lei wrote:
> From: Christoph Hellwig <hch@lst.de>
> 
> That makes the function nicely self-contained and can be used
> to avoid code duplication.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>

Looks good to me:
Reviewed-by : Nilay Shroff <nilay@linux.ibm.com>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH V4 10/24] block: fold elevator_disable into elevator_switch
  2025-04-30  4:35 ` [PATCH V4 10/24] block: fold elevator_disable into elevator_switch Ming Lei
  2025-04-30  6:14   ` Hannes Reinecke
@ 2025-05-01  6:59   ` Nilay Shroff
  1 sibling, 0 replies; 51+ messages in thread
From: Nilay Shroff @ 2025-05-01  6:59 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: Shinichiro Kawasaki, Thomas Hellström, Christoph Hellwig



On 4/30/25 10:05 AM, Ming Lei wrote:
> From: Christoph Hellwig <hch@lst.de>
> 
> This removes duplicate code, and keeps the callers tidy.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>

Looks good to me:
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH V4 11/24] block: move blk_queue_registered() check into elv_iosched_store()
  2025-04-30  4:35 ` [PATCH V4 11/24] block: move blk_queue_registered() check into elv_iosched_store() Ming Lei
  2025-04-30  6:15   ` Hannes Reinecke
@ 2025-05-01  9:55   ` Nilay Shroff
  2025-05-02 12:46     ` Ming Lei
  1 sibling, 1 reply; 51+ messages in thread
From: Nilay Shroff @ 2025-05-01  9:55 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: Shinichiro Kawasaki, Thomas Hellström, Christoph Hellwig



On 4/30/25 10:05 AM, Ming Lei wrote:
> Move blk_queue_registered() check into elv_iosched_store() and prepare
> for using elevator_change() for covering any kind of elevator change in
> adding/deleting disk and updating nr_hw_queue.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/elevator.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/block/elevator.c b/block/elevator.c
> index df3e59107a2a..ac72c4f5a542 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -676,10 +676,6 @@ int elevator_switch(struct request_queue *q, const char *name)
>   */
>  static int elevator_change(struct request_queue *q, const char *elevator_name)
>  {
> -	/* Make sure queue is not in the middle of being removed */
> -	if (!blk_queue_registered(q))
> -		return -ENOENT;
> -
>  	if (q->elevator && elevator_match(q->elevator->type, elevator_name))
>  		return 0;
>  
> @@ -708,6 +704,10 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
>  	struct request_queue *q = disk->queue;
>  	struct blk_mq_tag_set *set = q->tag_set;
>  
> +	/* Make sure queue is not in the middle of being removed */
> +	if (!blk_queue_registered(q))
> +		return -ENOENT;
> +
>  	/*
>  	 * If the attribute needs to load a module, do it before freezing the
>  	 * queue to ensure that the module file can be read when the request

Shouldn't blk_queue_registered needs protection? For instance, I see here a race 
while queue is being removed from del_gendisk and at the same time elevator is
being updated from elv_iosched_store. BTW, I saw that your patch 19/24 adds that 
protection using disable_elv_switch before deleting disk (and so you may want to
fold this patch 11/24 into patch 19/24 ?) but with that also I foresee a thin race
window. I will comment on it while reviewing the patch 19/24.

Thanks,
--Nilay

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH V4 20/24] block: move elv_register[unregister]_queue out of elevator_lock
  2025-04-30  4:35 ` [PATCH V4 20/24] block: move elv_register[unregister]_queue out of elevator_lock Ming Lei
@ 2025-05-01 10:26   ` Nilay Shroff
  2025-05-02 12:50     ` Ming Lei
  0 siblings, 1 reply; 51+ messages in thread
From: Nilay Shroff @ 2025-05-01 10:26 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: Shinichiro Kawasaki, Thomas Hellström, Christoph Hellwig,
	Hannes Reinecke



On 4/30/25 10:05 AM, Ming Lei wrote:
> Move elv_register[unregister]_queue out of ->elevator_lock & queue freezing,
> so we can kill many lockdep warnings.
> 
> elv_register[unregister]_queue() is serialized, and just dealing with sysfs/
> debugfs things, no need to be done with queue frozen:
> 
> - when it is called from adding disk, elevator switch isn't possible
>   because ->queue_kobj isn't added yet
> 
> - when it is called from deleting disk, disable_elv_switch() is
>   responsible for preventing new elevator switch and draining old
>   elevator switch.
> 
> - when it is called from blk_mq_update_nr_hw_queues(), adding/removing
>   disk and elevator switch can't be allowed or in-progress
> 
> With this change, elevator's ->exit() is called before calling
> elv_unregister_queue, then user may call into ->show()/store() of elevator's
> sysfs attributes, and we have covered this issue by adding `ELEVATOR_FLAG_DYNG`.
> 
> For blk-mq debugfs, hctx->sched_tags is always checked with ->elevator_lock by
> debugfs code, meantime hctx->sched_tags is updated with ->elevator_lock, so
> there isn't such issue.
> 
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-mq.c   |  3 +--
>  block/elevator.c | 63 ++++++++++++++++++++++++++++++++++++++----------
>  block/genhd.c    |  6 +++++
>  3 files changed, 57 insertions(+), 15 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index b920eaedbaa9..a4bcfce4c4b9 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -4997,11 +4997,10 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>  		blk_mq_debugfs_register_hctxs(q);
>  	}
>  
> +	/* elv_update_nr_hw_queues() unfreeze queue for us */
>  	list_for_each_entry(q, &set->tag_list, tag_set_list)
>  		elv_update_nr_hw_queues(q);
>  
> -	list_for_each_entry(q, &set->tag_list, tag_set_list)
> -		blk_mq_unfreeze_queue_nomemrestore(q);
>  	memalloc_noio_restore(memflags);
>  
>  	/* Free the excess tags when nr_hw_queues shrink. */
> diff --git a/block/elevator.c b/block/elevator.c
> index 98a754f58de5..492a593160ae 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -49,6 +49,11 @@
>  struct elv_change_ctx {
>  	const char *name;
>  	bool no_uevent;
> +
> +	/* for unregistering old elevator */
> +	struct elevator_queue *old;
> +	/* for registering new elevator */
> +	struct elevator_queue *new;
>  };
>  
>  static DEFINE_SPINLOCK(elv_list_lock);
> @@ -158,14 +163,14 @@ static void elevator_exit(struct request_queue *q)
>  {
>  	struct elevator_queue *e = q->elevator;
>  
> +	lockdep_assert_held(&q->elevator_lock);
> +
>  	ioc_clear_queue(q);
>  	blk_mq_sched_free_rqs(q);
>  
>  	mutex_lock(&e->sysfs_lock);
>  	blk_mq_exit_sched(q, e);
>  	mutex_unlock(&e->sysfs_lock);
> -
> -	kobject_put(&e->kobj);
>  }
>  
>  static inline void __elv_rqhash_del(struct request *rq)
> @@ -466,8 +471,6 @@ static int elv_register_queue(struct request_queue *q,
>  {
>  	int error;
>  
> -	lockdep_assert_held(&q->elevator_lock);
> -
>  	error = kobject_add(&e->kobj, &q->disk->queue_kobj, "iosched");
>  	if (!error) {
>  		const struct elv_fs_entry *attr = e->type->elevator_attrs;
> @@ -494,8 +497,6 @@ static int elv_register_queue(struct request_queue *q,
>  static void elv_unregister_queue(struct request_queue *q,
>  				 struct elevator_queue *e)
>  {
> -	lockdep_assert_held(&q->elevator_lock);
> -
>  	if (e && test_and_clear_bit(ELEVATOR_FLAG_REGISTERED, &e->flags)) {
>  		kobject_uevent(&e->kobj, KOBJ_REMOVE);
>  		kobject_del(&e->kobj);
> @@ -586,7 +587,7 @@ static int elevator_switch(struct request_queue *q, struct elv_change_ctx *ctx)
>  	blk_mq_quiesce_queue(q);
>  
>  	if (q->elevator) {
> -		elv_unregister_queue(q, q->elevator);
> +		ctx->old = q->elevator;
>  		elevator_exit(q);
>  	}
>  
> @@ -594,11 +595,7 @@ static int elevator_switch(struct request_queue *q, struct elv_change_ctx *ctx)
>  		ret = blk_mq_init_sched(q, new_e);
>  		if (ret)
>  			goto out_unfreeze;
> -		ret = elv_register_queue(q, q->elevator, !ctx->no_uevent);
> -		if (ret) {
> -			elevator_exit(q);
> -			goto out_unfreeze;
> -		}
> +		ctx->new = q->elevator;
>  	} else {
>  		blk_queue_flag_clear(QUEUE_FLAG_SQ_SCHED, q);
>  		q->elevator = NULL;
> @@ -619,6 +616,38 @@ static int elevator_switch(struct request_queue *q, struct elv_change_ctx *ctx)
>  	return ret;
>  }
>  
> +static void elv_exit_and_release(struct request_queue *q)
> +{
> +	struct elevator_queue *e;
> +	unsigned memflags;
> +
> +	memflags = blk_mq_freeze_queue(q);
> +	mutex_lock(&q->elevator_lock);
> +	e = q->elevator;
> +	elevator_exit(q);
> +	mutex_unlock(&q->elevator_lock);
> +	blk_mq_unfreeze_queue(q, memflags);
> +	if (e)
> +		kobject_put(&e->kobj);
> +}
> +
> +static int elevator_change_done(struct request_queue *q,
> +				struct elv_change_ctx *ctx)
> +{
> +	int ret = 0;
> +
> +	if (ctx->old) {
> +		elv_unregister_queue(q, ctx->old);
> +		kobject_put(&ctx->old->kobj);
> +	}
> +	if (ctx->new) {
> +		ret = elv_register_queue(q, ctx->new, !ctx->no_uevent);
> +		if (ret)
> +			elv_exit_and_release(q);
> +	}
> +	return ret;
> +}
> +
>  /*
>   * Switch this queue to the given IO scheduler.
>   */
> @@ -645,6 +674,9 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
>  		ret = elevator_switch(q, ctx);
>  	mutex_unlock(&q->elevator_lock);
>  	blk_mq_unfreeze_queue(q, memflags);
> +	if (!ret)
> +		ret = elevator_change_done(q, ctx);
> +
>  	return ret;
>  }
>  
> @@ -657,6 +689,7 @@ void elv_update_nr_hw_queues(struct request_queue *q)
>  	struct elv_change_ctx ctx = {
>  		.name	= "none",
>  	};
> +	int ret = -ENODEV;
>  
>  	WARN_ON_ONCE(q->mq_freeze_depth == 0);
>  
> @@ -664,8 +697,12 @@ void elv_update_nr_hw_queues(struct request_queue *q)
>  	if (q->elevator && !blk_queue_dying(q) && !blk_queue_registered(q))
>  		ctx.name = q->elevator->type->elevator_name;
>  	/* force to reattach elevator after nr_hw_queue is updated */
> -	elevator_switch(q, &ctx);
> +	ret = elevator_switch(q, &ctx);
>  	mutex_unlock(&q->elevator_lock);
> +	blk_mq_unfreeze_queue_nomemrestore(q);
> +
> +	if (!ret)
> +		WARN_ON_ONCE(elevator_change_done(q, &ctx));
>  }
>  
>  /*
> diff --git a/block/genhd.c b/block/genhd.c
> index 0e64e7400fb4..59d9febd8c14 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -751,11 +751,17 @@ static void __del_gendisk(struct gendisk *disk)
>  
>  static void disable_elv_switch(struct request_queue *q)
>  {
> +	struct blk_mq_tag_set *set = q->tag_set;
> +
>  	WARN_ON_ONCE(!queue_is_mq(q));
>  
>  	mutex_lock(&q->elevator_lock);
>  	blk_queue_flag_set(QUEUE_FLAG_NO_ELV_SWITCH, q);
>  	mutex_unlock(&q->elevator_lock);
> +
> +	/* wait until in-progress elevator switch is done */
> +	down_write(&set->update_nr_hwq_lock);
> +	up_write(&set->update_nr_hwq_lock);
>  }
>  
>  /**

The disable_elv_switch which is now called just before __del_gendisk 
disables elevator switch using QUEUE_FLAG_NO_ELV_SWITCH. And I also see
write-lock (set->update_nr_hwq_lock) in disable_elv_switch which intends
to wait until in-progress elevator switch is finished but that may not help
because there's a small window in elv_iosched_store where it evaluates 
"elevator-switching-disabled" and then when it actually acquires the 
read-lock (set->update_nr_hwq_lock).

During the above window, if disable_elv_switch runs then we may enter into 
the race, where we'd see elv_iosched_store and __del_gendisk running 
concurrently. 

So we may want to update disable_elv_switch such that setting QUEUE_FLAG_
NO_ELV_SWITCH is protected with write-lock (set->update_nr_hwq_lock) and
if we do that then we may also not need q->elevator_lock in disable_elv_switch.
Or another way to fix it might be to move read-lock (set->update_nr_hwq_lock) 
at the top in elv_iosched_store.  

Thanks,
--Nilay

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH V4 15/24] block: unifying elevator change
  2025-04-30  4:35 ` [PATCH V4 15/24] block: " Ming Lei
  2025-04-30 14:17   ` Christoph Hellwig
@ 2025-05-01 10:28   ` Nilay Shroff
  1 sibling, 0 replies; 51+ messages in thread
From: Nilay Shroff @ 2025-05-01 10:28 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: Shinichiro Kawasaki, Thomas Hellström, Christoph Hellwig,
	Hannes Reinecke



On 4/30/25 10:05 AM, Ming Lei wrote:
> Elevator change is one well-define behavior:
> 
> - tear down current elevator if it exists
> 
> - setup new elevator
> 
> It is supposed to cover any case for changing elevator by single
> internal API, typically the following cases:
> 
> - setup default elevator in add_disk()
> 
> - switch to none in del_disk()
> 
> - reset elevator in blk_mq_update_nr_hw_queues()
> 
> - switch elevator in sysfs `store` elevator attribute
> 
> This patch uses elevator_change() to cover all above cases:
> 
> - every elevator switch is serialized with each other: add_disk/del_disk/
> store elevator is serialized already, blk_mq_update_nr_hw_queues() uses
> srcu for syncing with the other three cases
> 
> - for both add_disk()/del_disk(), queue freeze works at atomic mode
> or has been froze, so the freeze in elevator_change() won't add extra
> delay
> 
> - `struct elev_change_ctx` instance holds any info for changing elevator
> 
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>

Looks good to me:
Reviewed-by : Nilay Shroff <nilay@linux.ibm.com>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH V4 17/24] block: remove elevator queue's type check in elv_attr_show/store()
  2025-04-30  4:35 ` [PATCH V4 17/24] block: remove elevator queue's type check in elv_attr_show/store() Ming Lei
  2025-04-30  6:16   ` Hannes Reinecke
  2025-04-30 14:18   ` Christoph Hellwig
@ 2025-05-01 10:31   ` Nilay Shroff
  2 siblings, 0 replies; 51+ messages in thread
From: Nilay Shroff @ 2025-05-01 10:31 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block
  Cc: Shinichiro Kawasaki, Thomas Hellström, Christoph Hellwig



On 4/30/25 10:05 AM, Ming Lei wrote:
> elevatore queue's type is assigned since its allocation, and never
> get cleared until it is released.
> 
> So its ->type is always not NULL, remove the unnecessary check.
> 
> Suggested-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>

Looks good to me:
Reviewed-by : Nilay Shroff <nilay@linux.ibm.com>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH V4 11/24] block: move blk_queue_registered() check into elv_iosched_store()
  2025-05-01  9:55   ` Nilay Shroff
@ 2025-05-02 12:46     ` Ming Lei
  0 siblings, 0 replies; 51+ messages in thread
From: Ming Lei @ 2025-05-02 12:46 UTC (permalink / raw)
  To: Nilay Shroff
  Cc: Jens Axboe, linux-block, Shinichiro Kawasaki,
	Thomas Hellström, Christoph Hellwig

On Thu, May 01, 2025 at 03:25:01PM +0530, Nilay Shroff wrote:
> 
> 
> On 4/30/25 10:05 AM, Ming Lei wrote:
> > Move blk_queue_registered() check into elv_iosched_store() and prepare
> > for using elevator_change() for covering any kind of elevator change in
> > adding/deleting disk and updating nr_hw_queue.
> > 
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  block/elevator.c | 8 ++++----
> >  1 file changed, 4 insertions(+), 4 deletions(-)
> > 
> > diff --git a/block/elevator.c b/block/elevator.c
> > index df3e59107a2a..ac72c4f5a542 100644
> > --- a/block/elevator.c
> > +++ b/block/elevator.c
> > @@ -676,10 +676,6 @@ int elevator_switch(struct request_queue *q, const char *name)
> >   */
> >  static int elevator_change(struct request_queue *q, const char *elevator_name)
> >  {
> > -	/* Make sure queue is not in the middle of being removed */
> > -	if (!blk_queue_registered(q))
> > -		return -ENOENT;
> > -
> >  	if (q->elevator && elevator_match(q->elevator->type, elevator_name))
> >  		return 0;
> >  
> > @@ -708,6 +704,10 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
> >  	struct request_queue *q = disk->queue;
> >  	struct blk_mq_tag_set *set = q->tag_set;
> >  
> > +	/* Make sure queue is not in the middle of being removed */
> > +	if (!blk_queue_registered(q))
> > +		return -ENOENT;
> > +
> >  	/*
> >  	 * If the attribute needs to load a module, do it before freezing the
> >  	 * queue to ensure that the module file can be read when the request
> 
> Shouldn't blk_queue_registered needs protection? For instance, I see here a race 

I think it isn't needed because both blk_queue_flag_set() and blk_queue_registered()
are atomic operation, and almost all check of blk_queue_registered() isn't
protected too.


thanks,
Ming


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH V4 20/24] block: move elv_register[unregister]_queue out of elevator_lock
  2025-05-01 10:26   ` Nilay Shroff
@ 2025-05-02 12:50     ` Ming Lei
  0 siblings, 0 replies; 51+ messages in thread
From: Ming Lei @ 2025-05-02 12:50 UTC (permalink / raw)
  To: Nilay Shroff
  Cc: Jens Axboe, linux-block, Shinichiro Kawasaki,
	Thomas Hellström, Christoph Hellwig, Hannes Reinecke

On Thu, May 01, 2025 at 03:56:03PM +0530, Nilay Shroff wrote:
> 
> 
...

> > diff --git a/block/genhd.c b/block/genhd.c
> > index 0e64e7400fb4..59d9febd8c14 100644
> > --- a/block/genhd.c
> > +++ b/block/genhd.c
> > @@ -751,11 +751,17 @@ static void __del_gendisk(struct gendisk *disk)
> >  
> >  static void disable_elv_switch(struct request_queue *q)
> >  {
> > +	struct blk_mq_tag_set *set = q->tag_set;
> > +
> >  	WARN_ON_ONCE(!queue_is_mq(q));
> >  
> >  	mutex_lock(&q->elevator_lock);
> >  	blk_queue_flag_set(QUEUE_FLAG_NO_ELV_SWITCH, q);
> >  	mutex_unlock(&q->elevator_lock);
> > +
> > +	/* wait until in-progress elevator switch is done */
> > +	down_write(&set->update_nr_hwq_lock);
> > +	up_write(&set->update_nr_hwq_lock);
> >  }
> >  
> >  /**
> 
> The disable_elv_switch which is now called just before __del_gendisk 
> disables elevator switch using QUEUE_FLAG_NO_ELV_SWITCH. And I also see
> write-lock (set->update_nr_hwq_lock) in disable_elv_switch which intends
> to wait until in-progress elevator switch is finished but that may not help
> because there's a small window in elv_iosched_store where it evaluates 
> "elevator-switching-disabled" and then when it actually acquires the 
> read-lock (set->update_nr_hwq_lock).
> 
> During the above window, if disable_elv_switch runs then we may enter into 
> the race, where we'd see elv_iosched_store and __del_gendisk running 
> concurrently.

You are right.

> 
> So we may want to update disable_elv_switch such that setting QUEUE_FLAG_
> NO_ELV_SWITCH is protected with write-lock (set->update_nr_hwq_lock) and
> if we do that then we may also not need q->elevator_lock in disable_elv_switch.
> Or another way to fix it might be to move read-lock (set->update_nr_hwq_lock) 
> at the top in elv_iosched_store.  

Looks the approach is good.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2025-05-02 12:50 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-30  4:35 [PATCH V4 00/24] block: unify elevator changing and fix lockdep warning Ming Lei
2025-04-30  4:35 ` [PATCH V4 01/24] block: move blk_mq_add_queue_tag_set() after blk_mq_map_swqueue() Ming Lei
2025-04-30  4:35 ` [PATCH V4 02/24] block: move ELEVATOR_FLAG_DISABLE_WBT a request queue flag Ming Lei
2025-04-30  4:35 ` [PATCH V4 03/24] block: don't call freeze queue in elevator_switch() and elevator_disable() Ming Lei
2025-04-30  4:35 ` [PATCH V4 04/24] block: use q->elevator with ->elevator_lock held in elv_iosched_show() Ming Lei
2025-04-30  4:35 ` [PATCH V4 05/24] block: add two helpers for registering/un-registering sched debugfs Ming Lei
2025-04-30  4:35 ` [PATCH V4 06/24] block: move sched debugfs register into elvevator_register_queue Ming Lei
2025-04-30  4:35 ` [PATCH V4 07/24] block: prevent adding/deleting disk during updating nr_hw_queues Ming Lei
2025-04-30  4:35 ` [PATCH V4 08/24] block: don't allow to switch elevator if updating nr_hw_queues is in-progress Ming Lei
2025-04-30 14:09   ` Christoph Hellwig
2025-04-30  4:35 ` [PATCH V4 09/24] block: look up the elevator type in elevator_switch Ming Lei
2025-04-30  6:14   ` Hannes Reinecke
2025-05-01  6:47   ` Nilay Shroff
2025-04-30  4:35 ` [PATCH V4 10/24] block: fold elevator_disable into elevator_switch Ming Lei
2025-04-30  6:14   ` Hannes Reinecke
2025-05-01  6:59   ` Nilay Shroff
2025-04-30  4:35 ` [PATCH V4 11/24] block: move blk_queue_registered() check into elv_iosched_store() Ming Lei
2025-04-30  6:15   ` Hannes Reinecke
2025-05-01  9:55   ` Nilay Shroff
2025-05-02 12:46     ` Ming Lei
2025-04-30  4:35 ` [PATCH V4 12/24] block: simplify elevator reattachment for updating nr_hw_queues Ming Lei
2025-04-30  4:35 ` [PATCH V4 13/24] block: move queue freezing & elevator_lock into elevator_change() Ming Lei
2025-04-30 14:13   ` Christoph Hellwig
2025-04-30  4:35 ` [PATCH V4 14/24] block: add `struct elv_change_ctx` for unifying elevator change Ming Lei
2025-04-30 14:15   ` Christoph Hellwig
2025-04-30  4:35 ` [PATCH V4 15/24] block: " Ming Lei
2025-04-30 14:17   ` Christoph Hellwig
2025-05-01 10:28   ` Nilay Shroff
2025-04-30  4:35 ` [PATCH V4 16/24] block: pass elevator_queue to elv_register_queue & unregister_queue Ming Lei
2025-04-30 14:18   ` Christoph Hellwig
2025-04-30  4:35 ` [PATCH V4 17/24] block: remove elevator queue's type check in elv_attr_show/store() Ming Lei
2025-04-30  6:16   ` Hannes Reinecke
2025-04-30 14:18   ` Christoph Hellwig
2025-05-01 10:31   ` Nilay Shroff
2025-04-30  4:35 ` [PATCH V4 18/24] block: fail to show/store elevator sysfs attribute if elevator is dying Ming Lei
2025-04-30  6:31   ` Hannes Reinecke
2025-04-30 13:30     ` Ming Lei
2025-04-30 14:19   ` Christoph Hellwig
2025-04-30  4:35 ` [PATCH V4 19/24] block: add new helper for disabling elevator switch when deleting disk Ming Lei
2025-04-30  6:32   ` Hannes Reinecke
2025-04-30 14:20   ` Christoph Hellwig
2025-04-30  4:35 ` [PATCH V4 20/24] block: move elv_register[unregister]_queue out of elevator_lock Ming Lei
2025-05-01 10:26   ` Nilay Shroff
2025-05-02 12:50     ` Ming Lei
2025-04-30  4:35 ` [PATCH V4 21/24] block: move hctx debugfs/sysfs registering out of freezing queue Ming Lei
2025-04-30  4:35 ` [PATCH V4 22/24] block: don't acquire ->elevator_lock in blk_mq_map_swqueue and blk_mq_realloc_hw_ctxs Ming Lei
2025-04-30  4:35 ` [PATCH V4 23/24] block: move hctx cpuhp add/del out of queue freezing Ming Lei
2025-04-30  4:35 ` [PATCH V4 24/24] block: move wbt_enable_default() out of queue freezing from sched ->exit() Ming Lei
2025-04-30  6:33   ` Hannes Reinecke
2025-04-30 14:08 ` [PATCH V4 00/24] block: unify elevator changing and fix lockdep warning Christoph Hellwig
2025-04-30 14:13   ` Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).