[PATCH 0/3] block: restructure elevator switch path and fix a lockdep splat

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] block: restructure elevator switch path and fix a lockdep splat
@ 2025-10-16  5:30 Nilay Shroff
  2025-10-16  5:30 ` [PATCH 1/3] block: unify elevator tags and type xarrays into struct elv_change_ctx Nilay Shroff
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Nilay Shroff @ 2025-10-16  5:30 UTC (permalink / raw)
  To: linux-block; +Cc: ming.lei, hch, yukuai1, axboe, yi.zhang, czhong, gjoyce

Hi,

This patchset reorganizes the elevator switch path used during both
nr_hw_queues update and elv_iosched_store() operations to address a
recently reported lockdep splat [1].

The warning highlights a locking dependency between ->freeze_lock and
->elevator_lock on pcpu_alloc_mutex, triggered when the Kyber scheduler
dynamically allocates its private scheduling data. The fix is to ensure
that such allocations occur outside the locked sections, thus eliminating
the dependency chain.

While working on this, it also became evident that the nr_hw_queue update
code maintains two disjoint xarrays—one for elevator tags and another
for elevator type—both serving the same purpose. Unifying these into a
single elv_change_ctx structure improves clarity and maintainability.

This series therefore implements three patches:
The first perparatory patch unifies elevator tags and type xarrays. It
combines both xarrays into a single struct elv_change_ctx, simplifying
per-queue elevator state management.

The second patch introduce ->alloc_sched_data and ->free_sched_data 
elevator ops to safely allocate and free scheduler data before acquiring
->freeze_lock and ->elevator_lock, preventing the dependency on pcpu_
alloc_mutex.

The third patch converts Kyber scheduler to use the new methods inroduced
in the previous patch. It hooks Kyber’s scheduler data allocation and 
teardown logic from ->init_sched and ->exit_sched into the new methods, 
ensuring memory operations are performed outside locked sections.

Together, these changes simplify the elevator switch logic and prevent
the reported lockdep splat.

As always, feedback and suggestions are very welcome!

[1] https://lore.kernel.org/all/CAGVVp+VNW4M-5DZMNoADp6o2VKFhi7KxWpTDkcnVyjO0=-D5+A@mail.gmail.com/

Thanks,
--Nilay

Nilay Shroff (3):
  block: unify elevator tags and type xarrays into struct elv_change_ctx
  block: introduce alloc_sched_data and free_sched_data elevator methods
  block: define alloc_sched_data and free_sched_data methods for kyber

 block/blk-mq-sched.c  | 104 ++++++++++++++++++++++++++++++++++++++----
 block/blk-mq-sched.h  |  35 +++++++++++++-
 block/blk-mq.c        |  56 ++++++++++++++---------
 block/blk.h           |   7 ++-
 block/elevator.c      |  76 ++++++++++++++++--------------
 block/elevator.h      |  23 +++++++++-
 block/kyber-iosched.c |  30 ++++++++----
 7 files changed, 252 insertions(+), 79 deletions(-)

-- 
2.51.0

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/3] block: unify elevator tags and type xarrays into struct elv_change_ctx
  2025-10-16  5:30 [PATCH 0/3] block: restructure elevator switch path and fix a lockdep splat Nilay Shroff
@ 2025-10-16  5:30 ` Nilay Shroff
  2025-10-22  4:11   ` Ming Lei
  2025-10-16  5:30 ` [PATCH 2/3] block: introduce alloc_sched_data and free_sched_data elevator methods Nilay Shroff
  2025-10-16  5:30 ` [PATCH 3/3] block: define alloc_sched_data and free_sched_data methods for kyber Nilay Shroff
  2 siblings, 1 reply; 13+ messages in thread
From: Nilay Shroff @ 2025-10-16  5:30 UTC (permalink / raw)
  To: linux-block; +Cc: ming.lei, hch, yukuai1, axboe, yi.zhang, czhong, gjoyce

Currently, the nr_hw_queues update path manages two disjoint xarrays —
one for elevator tags and another for elevator type — both used during
elevator switching. Maintaining these two parallel structures for the
same purpose adds unnecessary complexity and potential for mismatched
state.

This patch unifies both xarrays into a single structure, struct
elv_change_ctx, which holds all per-queue elevator change context. A
single xarray, named elv_tbl, now maps each queue (q->id) in a tagset
to its corresponding elv_change_ctx entry, encapsulating the elevator
tags, type and name references.

This unification simplifies the code, improves maintainability, and
clarifies ownership of per-queue elevator state.

Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
 block/blk-mq-sched.c | 47 ++++++++++++++++++++++++++++++++++------
 block/blk-mq-sched.h | 13 +++++++++++
 block/blk-mq.c       | 51 ++++++++++++++++++++++++++------------------
 block/blk.h          |  7 +++---
 block/elevator.c     | 31 ++++++---------------------
 block/elevator.h     | 15 +++++++++++++
 6 files changed, 108 insertions(+), 56 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index d06bb137a743..1c9571136a30 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -453,6 +453,33 @@ void blk_mq_free_sched_tags_batch(struct xarray *et_table,
 	}
 }
 
+int blk_mq_alloc_sched_ctx_batch(struct xarray *elv_tbl,
+		struct blk_mq_tag_set *set)
+{
+	struct request_queue *q;
+	struct elv_change_ctx *ctx;
+
+	lockdep_assert_held_write(&set->update_nr_hwq_lock);
+
+	list_for_each_entry(q, &set->tag_list, tag_set_list) {
+		ctx = kzalloc(sizeof(struct elv_change_ctx), GFP_KERNEL);
+		if (!ctx)
+			goto out_unwind;
+
+		if (xa_insert(elv_tbl, q->id, ctx, GFP_KERNEL)) {
+			kfree(ctx);
+			goto out_unwind;
+		}
+	}
+	return 0;
+out_unwind:
+	list_for_each_entry_continue_reverse(q, &set->tag_list, tag_set_list) {
+		ctx = xa_load(elv_tbl, q->id);
+		kfree(ctx);
+	}
+	return -ENOMEM;
+}
+
 struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
 		unsigned int nr_hw_queues, unsigned int nr_requests)
 {
@@ -498,12 +525,13 @@ struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
 	return NULL;
 }
 
-int blk_mq_alloc_sched_tags_batch(struct xarray *et_table,
+int blk_mq_alloc_sched_tags_batch(struct xarray *elv_tbl,
 		struct blk_mq_tag_set *set, unsigned int nr_hw_queues)
 {
+	struct elv_change_ctx *ctx;
 	struct request_queue *q;
 	struct elevator_tags *et;
-	gfp_t gfp = GFP_NOIO | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY;
+	int ret = -ENOMEM;
 
 	lockdep_assert_held_write(&set->update_nr_hwq_lock);
 
@@ -520,8 +548,13 @@ int blk_mq_alloc_sched_tags_batch(struct xarray *et_table,
 					blk_mq_default_nr_requests(set));
 			if (!et)
 				goto out_unwind;
-			if (xa_insert(et_table, q->id, et, gfp))
+
+			ctx = xa_load(elv_tbl, q->id);
+			if (WARN_ON_ONCE(!ctx)) {
+				ret = -ENOENT;
 				goto out_free_tags;
+			}
+			ctx->et = et;
 		}
 	}
 	return 0;
@@ -530,12 +563,12 @@ int blk_mq_alloc_sched_tags_batch(struct xarray *et_table,
 out_unwind:
 	list_for_each_entry_continue_reverse(q, &set->tag_list, tag_set_list) {
 		if (q->elevator) {
-			et = xa_load(et_table, q->id);
-			if (et)
-				blk_mq_free_sched_tags(et, set);
+			ctx = xa_load(elv_tbl, q->id);
+			if (ctx && ctx->et)
+				blk_mq_free_sched_tags(ctx->et, set);
 		}
 	}
-	return -ENOMEM;
+	return ret;
 }
 
 /* caller must have a reference to @e, will grab another one if successful */
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 8e21a6b1415d..ba67e4e2447b 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -27,11 +27,24 @@ struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
 		unsigned int nr_hw_queues, unsigned int nr_requests);
 int blk_mq_alloc_sched_tags_batch(struct xarray *et_table,
 		struct blk_mq_tag_set *set, unsigned int nr_hw_queues);
+int blk_mq_alloc_sched_ctx_batch(struct xarray *elv_tbl,
+		struct blk_mq_tag_set *set);
 void blk_mq_free_sched_tags(struct elevator_tags *et,
 		struct blk_mq_tag_set *set);
 void blk_mq_free_sched_tags_batch(struct xarray *et_table,
 		struct blk_mq_tag_set *set);
 
+static inline void blk_mq_free_sched_ctx_batch(struct xarray *elv_tbl)
+{
+	unsigned long i;
+	struct elv_change_ctx *ctx;
+
+	xa_for_each(elv_tbl, i, ctx) {
+		xa_erase(elv_tbl, i);
+		kfree(ctx);
+	}
+}
+
 static inline void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
 {
 	if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 09f579414161..2e3ebaf877e1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4983,27 +4983,28 @@ struct elevator_tags *blk_mq_update_nr_requests(struct request_queue *q,
  * Switch back to the elevator type stored in the xarray.
  */
 static void blk_mq_elv_switch_back(struct request_queue *q,
-		struct xarray *elv_tbl, struct xarray *et_tbl)
+		struct xarray *elv_tbl)
 {
-	struct elevator_type *e = xa_load(elv_tbl, q->id);
-	struct elevator_tags *t = xa_load(et_tbl, q->id);
+	struct elv_change_ctx *ctx = xa_load(elv_tbl, q->id);
+
+	if (WARN_ON_ONCE(!ctx))
+		return;
 
 	/* The elv_update_nr_hw_queues unfreezes the queue. */
-	elv_update_nr_hw_queues(q, e, t);
+	elv_update_nr_hw_queues(q, ctx);
 
 	/* Drop the reference acquired in blk_mq_elv_switch_none. */
-	if (e)
-		elevator_put(e);
+	if (ctx->type)
+		elevator_put(ctx->type);
 }
 
 /*
- * Stores elevator type in xarray and set current elevator to none. It uses
- * q->id as an index to store the elevator type into the xarray.
+ * Stores elevator name and type in ctx and set current elevator to none.
  */
 static int blk_mq_elv_switch_none(struct request_queue *q,
 		struct xarray *elv_tbl)
 {
-	int ret = 0;
+	struct elv_change_ctx *ctx;
 
 	lockdep_assert_held_write(&q->tag_set->update_nr_hwq_lock);
 
@@ -5015,10 +5016,11 @@ static int blk_mq_elv_switch_none(struct request_queue *q,
 	 * can't run concurrently.
 	 */
 	if (q->elevator) {
+		ctx = xa_load(elv_tbl, q->id);
+		if (WARN_ON_ONCE(!ctx))
+			return -ENOENT;
 
-		ret = xa_insert(elv_tbl, q->id, q->elevator->type, GFP_KERNEL);
-		if (WARN_ON_ONCE(ret))
-			return ret;
+		ctx->name = q->elevator->type->elevator_name;
 
 		/*
 		 * Before we switch elevator to 'none', take a reference to
@@ -5029,9 +5031,14 @@ static int blk_mq_elv_switch_none(struct request_queue *q,
 		 */
 		__elevator_get(q->elevator->type);
 
+		/*
+		 * Store elevator type so that we can release the reference
+		 * taken above later.
+		 */
+		ctx->type = q->elevator->type;
 		elevator_set_none(q);
 	}
-	return ret;
+	return 0;
 }
 
 static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
@@ -5041,7 +5048,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 	int prev_nr_hw_queues = set->nr_hw_queues;
 	unsigned int memflags;
 	int i;
-	struct xarray elv_tbl, et_tbl;
+	struct xarray elv_tbl;
 	bool queues_frozen = false;
 
 	lockdep_assert_held(&set->tag_list_lock);
@@ -5055,11 +5062,12 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 
 	memflags = memalloc_noio_save();
 
-	xa_init(&et_tbl);
-	if (blk_mq_alloc_sched_tags_batch(&et_tbl, set, nr_hw_queues) < 0)
-		goto out_memalloc_restore;
-
 	xa_init(&elv_tbl);
+	if (blk_mq_alloc_sched_ctx_batch(&elv_tbl, set) < 0)
+		goto out_xa_destroy;
+
+	if (blk_mq_alloc_sched_tags_batch(&elv_tbl, set, nr_hw_queues) < 0)
+		goto out_free_ctx;
 
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
 		blk_mq_debugfs_unregister_hctxs(q);
@@ -5105,7 +5113,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 		/* switch_back expects queue to be frozen */
 		if (!queues_frozen)
 			blk_mq_freeze_queue_nomemsave(q);
-		blk_mq_elv_switch_back(q, &elv_tbl, &et_tbl);
+		blk_mq_elv_switch_back(q, &elv_tbl);
 	}
 
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
@@ -5116,9 +5124,10 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 		blk_mq_add_hw_queues_cpuhp(q);
 	}
 
+out_free_ctx:
+	blk_mq_free_sched_ctx_batch(&elv_tbl);
+out_xa_destroy:
 	xa_destroy(&elv_tbl);
-	xa_destroy(&et_tbl);
-out_memalloc_restore:
 	memalloc_noio_restore(memflags);
 
 	/* Free the excess tags when nr_hw_queues shrink. */
diff --git a/block/blk.h b/block/blk.h
index 170794632135..a7992680f9e1 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -11,8 +11,7 @@
 #include <xen/xen.h>
 #include "blk-crypto-internal.h"
 
-struct elevator_type;
-struct elevator_tags;
+struct elv_change_ctx;
 
 /*
  * Default upper limit for the software max_sectors limit used for regular I/Os.
@@ -333,8 +332,8 @@ bool blk_bio_list_merge(struct request_queue *q, struct list_head *list,
 
 bool blk_insert_flush(struct request *rq);
 
-void elv_update_nr_hw_queues(struct request_queue *q, struct elevator_type *e,
-		struct elevator_tags *t);
+void elv_update_nr_hw_queues(struct request_queue *q,
+		struct elv_change_ctx *ctx);
 void elevator_set_default(struct request_queue *q);
 void elevator_set_none(struct request_queue *q);
 
diff --git a/block/elevator.c b/block/elevator.c
index e2ebfbf107b3..cd7bdff205c8 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -45,19 +45,6 @@
 #include "blk-wbt.h"
 #include "blk-cgroup.h"
 
-/* Holding context data for changing elevator */
-struct elv_change_ctx {
-	const char *name;
-	bool no_uevent;
-
-	/* for unregistering old elevator */
-	struct elevator_queue *old;
-	/* for registering new elevator */
-	struct elevator_queue *new;
-	/* holds sched tags data */
-	struct elevator_tags *et;
-};
-
 static DEFINE_SPINLOCK(elv_list_lock);
 static LIST_HEAD(elv_list);
 
@@ -706,32 +693,28 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
  * The I/O scheduler depends on the number of hardware queues, this forces a
  * reattachment when nr_hw_queues changes.
  */
-void elv_update_nr_hw_queues(struct request_queue *q, struct elevator_type *e,
-		struct elevator_tags *t)
+void elv_update_nr_hw_queues(struct request_queue *q,
+		struct elv_change_ctx *ctx)
 {
 	struct blk_mq_tag_set *set = q->tag_set;
-	struct elv_change_ctx ctx = {};
 	int ret = -ENODEV;
 
 	WARN_ON_ONCE(q->mq_freeze_depth == 0);
 
-	if (e && !blk_queue_dying(q) && blk_queue_registered(q)) {
-		ctx.name = e->elevator_name;
-		ctx.et = t;
-
+	if (ctx->type && !blk_queue_dying(q) && blk_queue_registered(q)) {
 		mutex_lock(&q->elevator_lock);
 		/* force to reattach elevator after nr_hw_queue is updated */
-		ret = elevator_switch(q, &ctx);
+		ret = elevator_switch(q, ctx);
 		mutex_unlock(&q->elevator_lock);
 	}
 	blk_mq_unfreeze_queue_nomemrestore(q);
 	if (!ret)
-		WARN_ON_ONCE(elevator_change_done(q, &ctx));
+		WARN_ON_ONCE(elevator_change_done(q, ctx));
 	/*
 	 * Free sched tags if it's allocated but we couldn't switch elevator.
 	 */
-	if (t && !ctx.new)
-		blk_mq_free_sched_tags(t, set);
+	if (ctx->et && !ctx->new)
+		blk_mq_free_sched_tags(ctx->et, set);
 }
 
 /*
diff --git a/block/elevator.h b/block/elevator.h
index c4d20155065e..bad43182361e 100644
--- a/block/elevator.h
+++ b/block/elevator.h
@@ -32,6 +32,21 @@ struct elevator_tags {
 	struct blk_mq_tags *tags[];
 };
 
+/* Holding context data for changing elevator */
+struct elv_change_ctx {
+	const char *name;
+	bool no_uevent;
+
+	/* for unregistering old elevator */
+	struct elevator_queue *old;
+	/* for registering new elevator */
+	struct elevator_queue *new;
+	/* store elevator type */
+	struct elevator_type *type;
+	/* holds sched tags data */
+	struct elevator_tags *et;
+};
+
 struct elevator_mq_ops {
 	int (*init_sched)(struct request_queue *, struct elevator_queue *);
 	void (*exit_sched)(struct elevator_queue *);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/3] block: introduce alloc_sched_data and free_sched_data elevator methods
  2025-10-16  5:30 [PATCH 0/3] block: restructure elevator switch path and fix a lockdep splat Nilay Shroff
  2025-10-16  5:30 ` [PATCH 1/3] block: unify elevator tags and type xarrays into struct elv_change_ctx Nilay Shroff
@ 2025-10-16  5:30 ` Nilay Shroff
  2025-10-22  4:39   ` Ming Lei
  2025-10-16  5:30 ` [PATCH 3/3] block: define alloc_sched_data and free_sched_data methods for kyber Nilay Shroff
  2 siblings, 1 reply; 13+ messages in thread
From: Nilay Shroff @ 2025-10-16  5:30 UTC (permalink / raw)
  To: linux-block; +Cc: ming.lei, hch, yukuai1, axboe, yi.zhang, czhong, gjoyce

The recent lockdep splat [1] highlights a potential deadlock risk
involving ->elevator_lock and ->freeze_lock dependencies on -pcpu_alloc_
mutex. The trace shows that the issue occurs when the Kyber scheduler
allocates dynamic memory for its elevator data during initialization.

To address this, introduce two new elevator operation callbacks:
->alloc_sched_data and ->free_sched_data.

When an elevator implements these methods, they are invoked during
scheduler switch before acquiring ->freeze_lock and ->elevator_lock.
This allows safe allocation and deallocation of per-elevator data
without holding locks that could depend on pcpu_alloc_mutex, effectively
breaking the lock dependency chain and avoiding the reported deadlock
scenario.

[1] https://lore.kernel.org/all/CAGVVp+VNW4M-5DZMNoADp6o2VKFhi7KxWpTDkcnVyjO0=-D5+A@mail.gmail.com/

Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
 block/blk-mq-sched.c | 68 ++++++++++++++++++++++++++++++++++++++++++--
 block/blk-mq-sched.h | 23 ++++++++++++++-
 block/blk-mq.c       |  7 ++++-
 block/elevator.c     | 46 +++++++++++++++++++++++-------
 block/elevator.h     |  8 +++++-
 5 files changed, 137 insertions(+), 15 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 1c9571136a30..f1cc2f2428b2 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -453,6 +453,70 @@ void blk_mq_free_sched_tags_batch(struct xarray *et_table,
 	}
 }
 
+void blk_mq_free_sched_data_batch(struct xarray *elv_tbl,
+		struct blk_mq_tag_set *set)
+{
+	struct request_queue *q;
+	struct elv_change_ctx *ctx;
+
+	lockdep_assert_held_write(&set->update_nr_hwq_lock);
+
+	list_for_each_entry(q, &set->tag_list, tag_set_list) {
+		if (q->elevator) {
+			ctx = xa_load(elv_tbl, q->id);
+			if (WARN_ON_ONCE(!ctx))
+				continue;
+			if (ctx->data)
+				blk_mq_free_sched_data(q->elevator->type,
+						ctx->data);
+		}
+	}
+}
+
+int blk_mq_alloc_sched_data_batch(struct xarray *elv_tbl,
+		struct blk_mq_tag_set *set)
+{
+	struct request_queue *q;
+	struct elv_change_ctx *ctx;
+	int ret = 0;
+
+	lockdep_assert_held_write(&set->update_nr_hwq_lock);
+
+	list_for_each_entry(q, &set->tag_list, tag_set_list) {
+		/*
+		 * Accessing q->elevator without holding q->elevator_lock is
+		 * safe because we're holding here set->update_nr_hwq_lock in
+		 * the writer context. So, scheduler update/switch code (which
+		 * acquires the same lock but in the reader context) can't run
+		 * concurrently.
+		 */
+		if (q->elevator) {
+			ctx = xa_load(elv_tbl, q->id);
+			if (WARN_ON_ONCE(!ctx))
+				return -ENOENT;
+
+			ret = blk_mq_alloc_sched_data(q, q->elevator->type,
+					&ctx->data);
+			if (ret)
+				goto out_unwind;
+		}
+	}
+	return ret;
+
+out_unwind:
+	list_for_each_entry_continue_reverse(q, &set->tag_list, tag_set_list) {
+		if (q->elevator) {
+			ctx = xa_load(elv_tbl, q->id);
+			if (WARN_ON_ONCE(!ctx))
+				continue;
+			if (ctx->data)
+				blk_mq_free_sched_data(q->elevator->type,
+						ctx->data);
+		}
+	}
+	return ret;
+}
+
 int blk_mq_alloc_sched_ctx_batch(struct xarray *elv_tbl,
 		struct blk_mq_tag_set *set)
 {
@@ -573,7 +637,7 @@ int blk_mq_alloc_sched_tags_batch(struct xarray *elv_tbl,
 
 /* caller must have a reference to @e, will grab another one if successful */
 int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e,
-		struct elevator_tags *et)
+		struct elevator_tags *et, void *data)
 {
 	unsigned int flags = q->tag_set->flags;
 	struct blk_mq_hw_ctx *hctx;
@@ -581,7 +645,7 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e,
 	unsigned long i;
 	int ret;
 
-	eq = elevator_alloc(q, e, et);
+	eq = elevator_alloc(q, e, et, data);
 	if (!eq)
 		return -ENOMEM;
 
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index ba67e4e2447b..23cda157d8dd 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -19,7 +19,7 @@ void __blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx);
 void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx);
 
 int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e,
-		struct elevator_tags *et);
+		struct elevator_tags *et, void *data);
 void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e);
 void blk_mq_sched_free_rqs(struct request_queue *q);
 
@@ -29,10 +29,31 @@ int blk_mq_alloc_sched_tags_batch(struct xarray *et_table,
 		struct blk_mq_tag_set *set, unsigned int nr_hw_queues);
 int blk_mq_alloc_sched_ctx_batch(struct xarray *elv_tbl,
 		struct blk_mq_tag_set *set);
+int blk_mq_alloc_sched_data_batch(struct xarray *elv_tbl,
+		struct blk_mq_tag_set *set);
 void blk_mq_free_sched_tags(struct elevator_tags *et,
 		struct blk_mq_tag_set *set);
 void blk_mq_free_sched_tags_batch(struct xarray *et_table,
 		struct blk_mq_tag_set *set);
+void blk_mq_free_sched_data_batch(struct xarray *elv_tbl,
+		struct blk_mq_tag_set *set);
+
+static inline int blk_mq_alloc_sched_data(struct request_queue *q,
+		struct elevator_type *e, void **data)
+{
+	if (e && e->ops.alloc_sched_data) {
+		*data = e->ops.alloc_sched_data(q);
+		if (!*data)
+			return -ENOMEM;
+	}
+	return 0;
+}
+
+static inline void blk_mq_free_sched_data(struct elevator_type *e, void *data)
+{
+	if (e && e->ops.free_sched_data)
+		e->ops.free_sched_data(data);
+}
 
 static inline void blk_mq_free_sched_ctx_batch(struct xarray *elv_tbl)
 {
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 2e3ebaf877e1..0ffec6875db9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -5066,9 +5066,14 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 	if (blk_mq_alloc_sched_ctx_batch(&elv_tbl, set) < 0)
 		goto out_xa_destroy;
 
-	if (blk_mq_alloc_sched_tags_batch(&elv_tbl, set, nr_hw_queues) < 0)
+	if (blk_mq_alloc_sched_data_batch(&elv_tbl, set) < 0)
 		goto out_free_ctx;
 
+	if (blk_mq_alloc_sched_tags_batch(&elv_tbl, set, nr_hw_queues) < 0) {
+		blk_mq_free_sched_data_batch(&elv_tbl, set);
+		goto out_free_ctx;
+	}
+
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
 		blk_mq_debugfs_unregister_hctxs(q);
 		blk_mq_sysfs_unregister_hctxs(q);
diff --git a/block/elevator.c b/block/elevator.c
index cd7bdff205c8..89f04b359911 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -121,7 +121,9 @@ static struct elevator_type *elevator_find_get(const char *name)
 static const struct kobj_type elv_ktype;
 
 struct elevator_queue *elevator_alloc(struct request_queue *q,
-		struct elevator_type *e, struct elevator_tags *et)
+			struct elevator_type *e,
+			struct elevator_tags *et,
+			void *data)
 {
 	struct elevator_queue *eq;
 
@@ -135,6 +137,7 @@ struct elevator_queue *elevator_alloc(struct request_queue *q,
 	mutex_init(&eq->sysfs_lock);
 	hash_init(eq->hash);
 	eq->et = et;
+	eq->elevator_data = data;
 
 	return eq;
 }
@@ -580,7 +583,7 @@ static int elevator_switch(struct request_queue *q, struct elv_change_ctx *ctx)
 	}
 
 	if (new_e) {
-		ret = blk_mq_init_sched(q, new_e, ctx->et);
+		ret = blk_mq_init_sched(q, new_e, ctx->et, ctx->data);
 		if (ret)
 			goto out_unfreeze;
 		ctx->new = q->elevator;
@@ -617,6 +620,7 @@ static void elv_exit_and_release(struct request_queue *q)
 	blk_mq_unfreeze_queue(q, memflags);
 	if (e) {
 		blk_mq_free_sched_tags(e->et, q->tag_set);
+		blk_mq_free_sched_data(e->type, e->elevator_data);
 		kobject_put(&e->kobj);
 	}
 }
@@ -632,6 +636,7 @@ static int elevator_change_done(struct request_queue *q,
 
 		elv_unregister_queue(q, ctx->old);
 		blk_mq_free_sched_tags(ctx->old->et, q->tag_set);
+		blk_mq_free_sched_data(ctx->old->type, ctx->old->elevator_data);
 		kobject_put(&ctx->old->kobj);
 		if (enable_wbt)
 			wbt_enable_default(q->disk);
@@ -660,6 +665,10 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
 				blk_mq_default_nr_requests(set));
 		if (!ctx->et)
 			return -ENOMEM;
+
+		ret = blk_mq_alloc_sched_data(q, ctx->type, &ctx->data);
+		if (ret)
+			goto free_sched_tags;
 	}
 
 	memflags = blk_mq_freeze_queue(q);
@@ -680,10 +689,18 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
 	blk_mq_unfreeze_queue(q, memflags);
 	if (!ret)
 		ret = elevator_change_done(q, ctx);
+
+	if (ctx->new) /* switching to new elevator is successful */
+		return ret;
+
 	/*
-	 * Free sched tags if it's allocated but we couldn't switch elevator.
+	 * Free sched tags and data if those were allocated but we couldn't
+	 * switch elevator.
 	 */
-	if (ctx->et && !ctx->new)
+	if (ctx->data)
+		blk_mq_free_sched_data(ctx->type, ctx->data);
+free_sched_tags:
+	if (ctx->et)
 		blk_mq_free_sched_tags(ctx->et, set);
 
 	return ret;
@@ -710,11 +727,17 @@ void elv_update_nr_hw_queues(struct request_queue *q,
 	blk_mq_unfreeze_queue_nomemrestore(q);
 	if (!ret)
 		WARN_ON_ONCE(elevator_change_done(q, ctx));
+
+	if (ctx->new) /* switching to new elevator is successful */
+		return;
 	/*
-	 * Free sched tags if it's allocated but we couldn't switch elevator.
+	 * Free sched tags and data if it's allocated but we couldn't switch
+	 * elevator.
 	 */
-	if (ctx->et && !ctx->new)
+	if (ctx->et)
 		blk_mq_free_sched_tags(ctx->et, set);
+	if (ctx->data)
+		blk_mq_free_sched_data(ctx->type, ctx->data);
 }
 
 /*
@@ -728,7 +751,6 @@ void elevator_set_default(struct request_queue *q)
 		.no_uevent = true,
 	};
 	int err;
-	struct elevator_type *e;
 
 	/* now we allow to switch elevator */
 	blk_queue_flag_clear(QUEUE_FLAG_NO_ELV_SWITCH, q);
@@ -741,8 +763,8 @@ void elevator_set_default(struct request_queue *q)
 	 * have multiple queues or mq-deadline is not available, default
 	 * to "none".
 	 */
-	e = elevator_find_get(ctx.name);
-	if (!e)
+	ctx.type = elevator_find_get(ctx.name);
+	if (!ctx.type)
 		return;
 
 	if ((q->nr_hw_queues == 1 ||
@@ -752,7 +774,7 @@ void elevator_set_default(struct request_queue *q)
 			pr_warn("\"%s\" elevator initialization, failed %d, falling back to \"none\"\n",
 					ctx.name, err);
 	}
-	elevator_put(e);
+	elevator_put(ctx.type);
 }
 
 void elevator_set_none(struct request_queue *q)
@@ -801,6 +823,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
 	ctx.name = strstrip(elevator_name);
 
 	elv_iosched_load_module(ctx.name);
+	ctx.type = elevator_find_get(ctx.name);
 
 	down_read(&set->update_nr_hwq_lock);
 	if (!blk_queue_no_elv_switch(q)) {
@@ -811,6 +834,9 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
 		ret = -ENOENT;
 	}
 	up_read(&set->update_nr_hwq_lock);
+
+	if (ctx.type)
+		elevator_put(ctx.type);
 	return ret;
 }
 
diff --git a/block/elevator.h b/block/elevator.h
index bad43182361e..648022e4ec92 100644
--- a/block/elevator.h
+++ b/block/elevator.h
@@ -43,6 +43,8 @@ struct elv_change_ctx {
 	struct elevator_queue *new;
 	/* store elevator type */
 	struct elevator_type *type;
+	/* store elevator data */
+	void *data;
 	/* holds sched tags data */
 	struct elevator_tags *et;
 };
@@ -53,6 +55,8 @@ struct elevator_mq_ops {
 	int (*init_hctx)(struct blk_mq_hw_ctx *, unsigned int);
 	void (*exit_hctx)(struct blk_mq_hw_ctx *, unsigned int);
 	void (*depth_updated)(struct request_queue *);
+	void *(*alloc_sched_data)(struct request_queue *);
+	void (*free_sched_data)(void *);
 
 	bool (*allow_merge)(struct request_queue *, struct request *, struct bio *);
 	bool (*bio_merge)(struct request_queue *, struct bio *, unsigned int);
@@ -178,7 +182,9 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *page, size_t count);
 
 extern bool elv_bio_merge_ok(struct request *, struct bio *);
 struct elevator_queue *elevator_alloc(struct request_queue *,
-		struct elevator_type *, struct elevator_tags *);
+			struct elevator_type *,
+			struct elevator_tags *,
+			void *);
 
 /*
  * Helper functions.
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/3] block: define alloc_sched_data and free_sched_data methods for kyber
  2025-10-16  5:30 [PATCH 0/3] block: restructure elevator switch path and fix a lockdep splat Nilay Shroff
  2025-10-16  5:30 ` [PATCH 1/3] block: unify elevator tags and type xarrays into struct elv_change_ctx Nilay Shroff
  2025-10-16  5:30 ` [PATCH 2/3] block: introduce alloc_sched_data and free_sched_data elevator methods Nilay Shroff
@ 2025-10-16  5:30 ` Nilay Shroff
  2 siblings, 0 replies; 13+ messages in thread
From: Nilay Shroff @ 2025-10-16  5:30 UTC (permalink / raw)
  To: linux-block; +Cc: ming.lei, hch, yukuai1, axboe, yi.zhang, czhong, gjoyce

Currently, the Kyber elevator allocates its private data dynamically in
->init_sched and frees it in ->exit_sched. However, since ->init_sched
is invoked during elevator switch after acquiring both ->freeze_lock and
->elevator_lock, it may trigger the lockdep splat [1] due to dependency
on pcpu_alloc_mutex.

To resolve this, move the elevator data allocation and deallocation
logic from ->init_sched and ->exit_sched into the newly introduced
->alloc_sched_data and ->free_sched_data methods. These callbacks are
invoked before acquiring ->freeze_lock and ->elevator_lock, ensuring
that memory allocation happens safely without introducing additional
locking dependencies.

This change breaks the dependency chain involving pcpu_alloc_mutex and
prevents the reported lockdep warning.

[1] https://lore.kernel.org/all/CAGVVp+VNW4M-5DZMNoADp6o2VKFhi7KxWpTDkcnVyjO0=-D5+A@mail.gmail.com/

Reported-by: Changhui Zhong <czhong@redhat.com>
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Closes: https://lore.kernel.org/all/CAGVVp+VNW4M-5DZMNoADp6o2VKFhi7KxWpTDkcnVyjO0=-D5+A@mail.gmail.com/
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
 block/kyber-iosched.c | 30 ++++++++++++++++++++++--------
 1 file changed, 22 insertions(+), 8 deletions(-)

diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index 18efd6ef2a2b..c1b36ffd19ce 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -409,30 +409,42 @@ static void kyber_depth_updated(struct request_queue *q)
 
 static int kyber_init_sched(struct request_queue *q, struct elevator_queue *eq)
 {
-	struct kyber_queue_data *kqd;
-
-	kqd = kyber_queue_data_alloc(q);
-	if (IS_ERR(kqd))
-		return PTR_ERR(kqd);
-
 	blk_stat_enable_accounting(q);
 
 	blk_queue_flag_clear(QUEUE_FLAG_SQ_SCHED, q);
 
-	eq->elevator_data = kqd;
 	q->elevator = eq;
 	kyber_depth_updated(q);
 
 	return 0;
 }
 
+static void *kyber_alloc_sched_data(struct request_queue *q)
+{
+	struct kyber_queue_data *kqd;
+
+	kqd = kyber_queue_data_alloc(q);
+	if (IS_ERR(kqd))
+		return NULL;
+
+	return kqd;
+}
+
 static void kyber_exit_sched(struct elevator_queue *e)
 {
 	struct kyber_queue_data *kqd = e->elevator_data;
-	int i;
 
 	timer_shutdown_sync(&kqd->timer);
 	blk_stat_disable_accounting(kqd->q);
+}
+
+static void kyber_free_sched_data(void *elv_data)
+{
+	struct kyber_queue_data *kqd = elv_data;
+	int i;
+
+	if (!kqd)
+		return;
 
 	for (i = 0; i < KYBER_NUM_DOMAINS; i++)
 		sbitmap_queue_free(&kqd->domain_tokens[i]);
@@ -1004,6 +1016,8 @@ static struct elevator_type kyber_sched = {
 		.exit_sched = kyber_exit_sched,
 		.init_hctx = kyber_init_hctx,
 		.exit_hctx = kyber_exit_hctx,
+		.alloc_sched_data = kyber_alloc_sched_data,
+		.free_sched_data = kyber_free_sched_data,
 		.limit_depth = kyber_limit_depth,
 		.bio_merge = kyber_bio_merge,
 		.prepare_request = kyber_prepare_request,
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] block: unify elevator tags and type xarrays into struct elv_change_ctx
  2025-10-16  5:30 ` [PATCH 1/3] block: unify elevator tags and type xarrays into struct elv_change_ctx Nilay Shroff
@ 2025-10-22  4:11   ` Ming Lei
  2025-10-23  5:53     ` Nilay Shroff
  0 siblings, 1 reply; 13+ messages in thread
From: Ming Lei @ 2025-10-22  4:11 UTC (permalink / raw)
  To: Nilay Shroff; +Cc: linux-block, hch, yukuai1, axboe, yi.zhang, czhong, gjoyce

On Thu, Oct 16, 2025 at 11:00:47AM +0530, Nilay Shroff wrote:
> Currently, the nr_hw_queues update path manages two disjoint xarrays —
> one for elevator tags and another for elevator type — both used during
> elevator switching. Maintaining these two parallel structures for the
> same purpose adds unnecessary complexity and potential for mismatched
> state.
> 
> This patch unifies both xarrays into a single structure, struct
> elv_change_ctx, which holds all per-queue elevator change context. A
> single xarray, named elv_tbl, now maps each queue (q->id) in a tagset
> to its corresponding elv_change_ctx entry, encapsulating the elevator
> tags, type and name references.
> 
> This unification simplifies the code, improves maintainability, and
> clarifies ownership of per-queue elevator state.
> 
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> ---
>  block/blk-mq-sched.c | 47 ++++++++++++++++++++++++++++++++++------
>  block/blk-mq-sched.h | 13 +++++++++++
>  block/blk-mq.c       | 51 ++++++++++++++++++++++++++------------------
>  block/blk.h          |  7 +++---
>  block/elevator.c     | 31 ++++++---------------------
>  block/elevator.h     | 15 +++++++++++++
>  6 files changed, 108 insertions(+), 56 deletions(-)
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index d06bb137a743..1c9571136a30 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -453,6 +453,33 @@ void blk_mq_free_sched_tags_batch(struct xarray *et_table,
>  	}
>  }
>  
> +int blk_mq_alloc_sched_ctx_batch(struct xarray *elv_tbl,
> +		struct blk_mq_tag_set *set)
> +{
> +	struct request_queue *q;
> +	struct elv_change_ctx *ctx;
> +
> +	lockdep_assert_held_write(&set->update_nr_hwq_lock);
> +
> +	list_for_each_entry(q, &set->tag_list, tag_set_list) {
> +		ctx = kzalloc(sizeof(struct elv_change_ctx), GFP_KERNEL);
> +		if (!ctx)
> +			goto out_unwind;
> +
> +		if (xa_insert(elv_tbl, q->id, ctx, GFP_KERNEL)) {
> +			kfree(ctx);
> +			goto out_unwind;
> +		}
> +	}
> +	return 0;
> +out_unwind:
> +	list_for_each_entry_continue_reverse(q, &set->tag_list, tag_set_list) {
> +		ctx = xa_load(elv_tbl, q->id);
> +		kfree(ctx);
> +	}

No need to unwind, you can let blk_mq_free_sched_ctx_batch cover cleanup from
callsite. Not mention you leave freed `ctx` into xarray, which is fragile.

> +	return -ENOMEM;
> +}
> +
>  struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
>  		unsigned int nr_hw_queues, unsigned int nr_requests)
>  {
> @@ -498,12 +525,13 @@ struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
>  	return NULL;
>  }
>  
> -int blk_mq_alloc_sched_tags_batch(struct xarray *et_table,
> +int blk_mq_alloc_sched_tags_batch(struct xarray *elv_tbl,
>  		struct blk_mq_tag_set *set, unsigned int nr_hw_queues)
>  {
> +	struct elv_change_ctx *ctx;
>  	struct request_queue *q;
>  	struct elevator_tags *et;
> -	gfp_t gfp = GFP_NOIO | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY;
> +	int ret = -ENOMEM;
>  
>  	lockdep_assert_held_write(&set->update_nr_hwq_lock);
>  
> @@ -520,8 +548,13 @@ int blk_mq_alloc_sched_tags_batch(struct xarray *et_table,
>  					blk_mq_default_nr_requests(set));
>  			if (!et)
>  				goto out_unwind;
> -			if (xa_insert(et_table, q->id, et, gfp))
> +
> +			ctx = xa_load(elv_tbl, q->id);
> +			if (WARN_ON_ONCE(!ctx)) {
> +				ret = -ENOENT;
>  				goto out_free_tags;
> +			}
> +			ctx->et = et;
>  		}
>  	}
>  	return 0;
> @@ -530,12 +563,12 @@ int blk_mq_alloc_sched_tags_batch(struct xarray *et_table,
>  out_unwind:
>  	list_for_each_entry_continue_reverse(q, &set->tag_list, tag_set_list) {
>  		if (q->elevator) {
> -			et = xa_load(et_table, q->id);
> -			if (et)
> -				blk_mq_free_sched_tags(et, set);
> +			ctx = xa_load(elv_tbl, q->id);
> +			if (ctx && ctx->et)
> +				blk_mq_free_sched_tags(ctx->et, set);

please clear ctx->et when it is freed.

>  		}
>  	}
> -	return -ENOMEM;
> +	return ret;
>  }
>  
>  /* caller must have a reference to @e, will grab another one if successful */
> diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
> index 8e21a6b1415d..ba67e4e2447b 100644
> --- a/block/blk-mq-sched.h
> +++ b/block/blk-mq-sched.h
> @@ -27,11 +27,24 @@ struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
>  		unsigned int nr_hw_queues, unsigned int nr_requests);
>  int blk_mq_alloc_sched_tags_batch(struct xarray *et_table,
>  		struct blk_mq_tag_set *set, unsigned int nr_hw_queues);
> +int blk_mq_alloc_sched_ctx_batch(struct xarray *elv_tbl,
> +		struct blk_mq_tag_set *set);
>  void blk_mq_free_sched_tags(struct elevator_tags *et,
>  		struct blk_mq_tag_set *set);
>  void blk_mq_free_sched_tags_batch(struct xarray *et_table,
>  		struct blk_mq_tag_set *set);
>  
> +static inline void blk_mq_free_sched_ctx_batch(struct xarray *elv_tbl)
> +{
> +	unsigned long i;
> +	struct elv_change_ctx *ctx;
> +
> +	xa_for_each(elv_tbl, i, ctx) {
> +		xa_erase(elv_tbl, i);
> +		kfree(ctx);
> +	}
> +}
> +

It could be more readable to move blk_mq_free_sched_ctx_batch() with
blk_mq_alloc_sched_ctx_batch() together.



Thanks, 
Ming


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/3] block: introduce alloc_sched_data and free_sched_data elevator methods
  2025-10-16  5:30 ` [PATCH 2/3] block: introduce alloc_sched_data and free_sched_data elevator methods Nilay Shroff
@ 2025-10-22  4:39   ` Ming Lei
  2025-10-23  5:57     ` Nilay Shroff
  2025-10-27 17:38     ` Nilay Shroff
  0 siblings, 2 replies; 13+ messages in thread
From: Ming Lei @ 2025-10-22  4:39 UTC (permalink / raw)
  To: Nilay Shroff; +Cc: linux-block, hch, yukuai1, axboe, yi.zhang, czhong, gjoyce

On Thu, Oct 16, 2025 at 11:00:48AM +0530, Nilay Shroff wrote:
> The recent lockdep splat [1] highlights a potential deadlock risk
> involving ->elevator_lock and ->freeze_lock dependencies on -pcpu_alloc_
> mutex. The trace shows that the issue occurs when the Kyber scheduler
> allocates dynamic memory for its elevator data during initialization.
> 
> To address this, introduce two new elevator operation callbacks:
> ->alloc_sched_data and ->free_sched_data.

This way looks good.

> 
> When an elevator implements these methods, they are invoked during
> scheduler switch before acquiring ->freeze_lock and ->elevator_lock.
> This allows safe allocation and deallocation of per-elevator data

This per-elevator data should be very similar with `struct elevator_tags`
from block layer viewpoint: both have same lifetime, and follow same
allocation constraint(per-cpu lock).

Can we abstract elevator data structure to cover both? Then I guess the
code should be more readable & maintainable, what do you think of this way?

One easiest way could be to add 'void *data' into `struct elevator_tags`,
just the naming of `elevator_tags` is not generic enough, but might not
a big deal.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] block: unify elevator tags and type xarrays into struct elv_change_ctx
  2025-10-22  4:11   ` Ming Lei
@ 2025-10-23  5:53     ` Nilay Shroff
  0 siblings, 0 replies; 13+ messages in thread
From: Nilay Shroff @ 2025-10-23  5:53 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, hch, yukuai1, axboe, yi.zhang, czhong, gjoyce



On 10/22/25 9:41 AM, Ming Lei wrote:
> On Thu, Oct 16, 2025 at 11:00:47AM +0530, Nilay Shroff wrote:

>>  
>> +int blk_mq_alloc_sched_ctx_batch(struct xarray *elv_tbl,
>> +		struct blk_mq_tag_set *set)
>> +{
>> +	struct request_queue *q;
>> +	struct elv_change_ctx *ctx;
>> +
>> +	lockdep_assert_held_write(&set->update_nr_hwq_lock);
>> +
>> +	list_for_each_entry(q, &set->tag_list, tag_set_list) {
>> +		ctx = kzalloc(sizeof(struct elv_change_ctx), GFP_KERNEL);
>> +		if (!ctx)
>> +			goto out_unwind;
>> +
>> +		if (xa_insert(elv_tbl, q->id, ctx, GFP_KERNEL)) {
>> +			kfree(ctx);
>> +			goto out_unwind;
>> +		}
>> +	}
>> +	return 0;
>> +out_unwind:
>> +	list_for_each_entry_continue_reverse(q, &set->tag_list, tag_set_list) {
>> +		ctx = xa_load(elv_tbl, q->id);
>> +		kfree(ctx);
>> +	}
> 
> No need to unwind, you can let blk_mq_free_sched_ctx_batch cover cleanup from
> callsite.
Yes, that makes sense. I’ll drop the unwind logic and rely on 
blk_mq_free_sched_ctx_batch() for cleanup at the callsite in the next version.

> Not mention you leave freed `ctx` into xarray, which is fragile.
Good catch! Removing the unwind block will naturally avoid that issue as well.

>> @@ -530,12 +563,12 @@ int blk_mq_alloc_sched_tags_batch(struct xarray *et_table,
>>  out_unwind:
>>  	list_for_each_entry_continue_reverse(q, &set->tag_list, tag_set_list) {
>>  		if (q->elevator) {
>> -			et = xa_load(et_table, q->id);
>> -			if (et)
>> -				blk_mq_free_sched_tags(et, set);
>> +			ctx = xa_load(elv_tbl, q->id);
>> +			if (ctx && ctx->et)
>> +				blk_mq_free_sched_tags(ctx->et, set);
> 
> please clear ctx->et when it is freed.
Ack, will fix it in next version.

>> +static inline void blk_mq_free_sched_ctx_batch(struct xarray *elv_tbl)
>> +{
>> +	unsigned long i;
>> +	struct elv_change_ctx *ctx;
>> +
>> +	xa_for_each(elv_tbl, i, ctx) {
>> +		xa_erase(elv_tbl, i);
>> +		kfree(ctx);
>> +	}
>> +}
>> +
> 
> It could be more readable to move blk_mq_free_sched_ctx_batch() with
> blk_mq_alloc_sched_ctx_batch() together.
> 
Agreed — I’ll move blk_mq_free_sched_ctx_batch() next to
blk_mq_alloc_sched_ctx_batch() for better readability.

Thanks,
--Nilay


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/3] block: introduce alloc_sched_data and free_sched_data elevator methods
  2025-10-22  4:39   ` Ming Lei
@ 2025-10-23  5:57     ` Nilay Shroff
  2025-10-23  7:48       ` Ming Lei
  2025-10-27 17:38     ` Nilay Shroff
  1 sibling, 1 reply; 13+ messages in thread
From: Nilay Shroff @ 2025-10-23  5:57 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, hch, yukuai1, axboe, yi.zhang, czhong, gjoyce



On 10/22/25 10:09 AM, Ming Lei wrote:
> On Thu, Oct 16, 2025 at 11:00:48AM +0530, Nilay Shroff wrote:
>> The recent lockdep splat [1] highlights a potential deadlock risk
>> involving ->elevator_lock and ->freeze_lock dependencies on -pcpu_alloc_
>> mutex. The trace shows that the issue occurs when the Kyber scheduler
>> allocates dynamic memory for its elevator data during initialization.
>>
>> To address this, introduce two new elevator operation callbacks:
>> ->alloc_sched_data and ->free_sched_data.
> 
> This way looks good.
> 
>>
>> When an elevator implements these methods, they are invoked during
>> scheduler switch before acquiring ->freeze_lock and ->elevator_lock.
>> This allows safe allocation and deallocation of per-elevator data
> 
> This per-elevator data should be very similar with `struct elevator_tags`
> from block layer viewpoint: both have same lifetime, and follow same
> allocation constraint(per-cpu lock).
> 
> Can we abstract elevator data structure to cover both? Then I guess the
> code should be more readable & maintainable, what do you think of this way?
> 
> One easiest way could be to add 'void *data' into `struct elevator_tags`,
> just the naming of `elevator_tags` is not generic enough, but might not
> a big deal.
> 
Hmm, good point! I'd rather suggest if we could instead rename 
struct elevator_tags to struct elevator_resources and then
add void *data field to it. Something like this:

struct elevator_tags {
	unsigned int nr_hw_queues;
	unsigned int nr_requests;
	struct blk_mq_tags *tags[];
        void *data;
};

What do you think?

Thanks,
--Nilay


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/3] block: introduce alloc_sched_data and free_sched_data elevator methods
  2025-10-23  5:57     ` Nilay Shroff
@ 2025-10-23  7:48       ` Ming Lei
  2025-10-23  8:28         ` Nilay Shroff
  0 siblings, 1 reply; 13+ messages in thread
From: Ming Lei @ 2025-10-23  7:48 UTC (permalink / raw)
  To: Nilay Shroff; +Cc: linux-block, hch, yukuai1, axboe, yi.zhang, czhong, gjoyce

On Thu, Oct 23, 2025 at 11:27:26AM +0530, Nilay Shroff wrote:
> 
> 
> On 10/22/25 10:09 AM, Ming Lei wrote:
> > On Thu, Oct 16, 2025 at 11:00:48AM +0530, Nilay Shroff wrote:
> >> The recent lockdep splat [1] highlights a potential deadlock risk
> >> involving ->elevator_lock and ->freeze_lock dependencies on -pcpu_alloc_
> >> mutex. The trace shows that the issue occurs when the Kyber scheduler
> >> allocates dynamic memory for its elevator data during initialization.
> >>
> >> To address this, introduce two new elevator operation callbacks:
> >> ->alloc_sched_data and ->free_sched_data.
> > 
> > This way looks good.
> > 
> >>
> >> When an elevator implements these methods, they are invoked during
> >> scheduler switch before acquiring ->freeze_lock and ->elevator_lock.
> >> This allows safe allocation and deallocation of per-elevator data
> > 
> > This per-elevator data should be very similar with `struct elevator_tags`
> > from block layer viewpoint: both have same lifetime, and follow same
> > allocation constraint(per-cpu lock).
> > 
> > Can we abstract elevator data structure to cover both? Then I guess the
> > code should be more readable & maintainable, what do you think of this way?
> > 
> > One easiest way could be to add 'void *data' into `struct elevator_tags`,
> > just the naming of `elevator_tags` is not generic enough, but might not
> > a big deal.
> > 
> Hmm, good point! I'd rather suggest if we could instead rename 
> struct elevator_tags to struct elevator_resources and then
> add void *data field to it. Something like this:
> 
> struct elevator_tags {
> 	unsigned int nr_hw_queues;
> 	unsigned int nr_requests;
> 	struct blk_mq_tags *tags[];
>         void *data;

'data' can't follow `tags[]`.

> };
> 
> What do you think?

It is good. The patch may be split into two:

- add data to `struct elevator_tags` for covering the lockdep issue

- renaming

Then it will become easier for review.


Thanks
Ming


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/3] block: introduce alloc_sched_data and free_sched_data elevator methods
  2025-10-23  7:48       ` Ming Lei
@ 2025-10-23  8:28         ` Nilay Shroff
  0 siblings, 0 replies; 13+ messages in thread
From: Nilay Shroff @ 2025-10-23  8:28 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, hch, yukuai1, axboe, yi.zhang, czhong, gjoyce



On 10/23/25 1:18 PM, Ming Lei wrote:
> On Thu, Oct 23, 2025 at 11:27:26AM +0530, Nilay Shroff wrote:
>>
>>
>> On 10/22/25 10:09 AM, Ming Lei wrote:
>>> On Thu, Oct 16, 2025 at 11:00:48AM +0530, Nilay Shroff wrote:
>>>> The recent lockdep splat [1] highlights a potential deadlock risk
>>>> involving ->elevator_lock and ->freeze_lock dependencies on -pcpu_alloc_
>>>> mutex. The trace shows that the issue occurs when the Kyber scheduler
>>>> allocates dynamic memory for its elevator data during initialization.
>>>>
>>>> To address this, introduce two new elevator operation callbacks:
>>>> ->alloc_sched_data and ->free_sched_data.
>>>
>>> This way looks good.
>>>
>>>>
>>>> When an elevator implements these methods, they are invoked during
>>>> scheduler switch before acquiring ->freeze_lock and ->elevator_lock.
>>>> This allows safe allocation and deallocation of per-elevator data
>>>
>>> This per-elevator data should be very similar with `struct elevator_tags`
>>> from block layer viewpoint: both have same lifetime, and follow same
>>> allocation constraint(per-cpu lock).
>>>
>>> Can we abstract elevator data structure to cover both? Then I guess the
>>> code should be more readable & maintainable, what do you think of this way?
>>>
>>> One easiest way could be to add 'void *data' into `struct elevator_tags`,
>>> just the naming of `elevator_tags` is not generic enough, but might not
>>> a big deal.
>>>
>> Hmm, good point! I'd rather suggest if we could instead rename 
>> struct elevator_tags to struct elevator_resources and then
>> add void *data field to it. Something like this:
>>
>> struct elevator_tags {
>> 	unsigned int nr_hw_queues;
>> 	unsigned int nr_requests;
>> 	struct blk_mq_tags *tags[];
>>         void *data;
> 
> 'data' can't follow `tags[]`.

yeah it was impromptu :)
> 
>> };
>>
>> What do you think?
> 
> It is good. The patch may be split into two:
> 
> - add data to `struct elevator_tags` for covering the lockdep issue
> 
> - renaming
> 
> Then it will become easier for review.
> 
Alright, I'll implement it in the next patchset.

Thanks,
--Nilay


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/3] block: introduce alloc_sched_data and free_sched_data elevator methods
  2025-10-22  4:39   ` Ming Lei
  2025-10-23  5:57     ` Nilay Shroff
@ 2025-10-27 17:38     ` Nilay Shroff
  2025-10-28  2:43       ` Ming Lei
  1 sibling, 1 reply; 13+ messages in thread
From: Nilay Shroff @ 2025-10-27 17:38 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, hch, yukuai1, axboe, yi.zhang, czhong, gjoyce

Hi Ming,

On 10/22/25 10:09 AM, Ming Lei wrote:
> On Thu, Oct 16, 2025 at 11:00:48AM +0530, Nilay Shroff wrote:
>> The recent lockdep splat [1] highlights a potential deadlock risk
>> involving ->elevator_lock and ->freeze_lock dependencies on -pcpu_alloc_
>> mutex. The trace shows that the issue occurs when the Kyber scheduler
>> allocates dynamic memory for its elevator data during initialization.
>>
>> To address this, introduce two new elevator operation callbacks:
>> ->alloc_sched_data and ->free_sched_data.
> 
> This way looks good.
> 
>>
>> When an elevator implements these methods, they are invoked during
>> scheduler switch before acquiring ->freeze_lock and ->elevator_lock.
>> This allows safe allocation and deallocation of per-elevator data
> 
> This per-elevator data should be very similar with `struct elevator_tags`
> from block layer viewpoint: both have same lifetime, and follow same
> allocation constraint(per-cpu lock).
> 
> Can we abstract elevator data structure to cover both? Then I guess the
> code should be more readable & maintainable, what do you think of this way?
> 
> One easiest way could be to add 'void *data' into `struct elevator_tags`,
> just the naming of `elevator_tags` is not generic enough, but might not
> a big deal.
> 
I realized that struct elevator_tags is already a member of struct elevator_queue,
and we also have a separate void *elevator_data member within the same structure.

So, adding void *data directly into struct elevator_tags may not be ideal, as it
would mix two logically distinct resources under a misleading name. Instead, we
can abstract both — void *elevator_data and struct elevator_tags — into a new
structure named struct elevator_resources. For instance:

struct elevator_resources {
    void *data;
    struct elevator_tags *et;
};

struct elv_change_ctx {
	const char *name;
	bool no_uevent;
	struct elevator_queue *old;
	struct elevator_queue *new;
	struct elevator_type *type;
	struct elevator_resources res;
};

I've just sent out PATCHv3 with the above change. Please review and let me know
if this approach looks good to you.

Thanks,
--Nilay 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/3] block: introduce alloc_sched_data and free_sched_data elevator methods
  2025-10-27 17:38     ` Nilay Shroff
@ 2025-10-28  2:43       ` Ming Lei
  2025-10-28  4:51         ` Nilay Shroff
  0 siblings, 1 reply; 13+ messages in thread
From: Ming Lei @ 2025-10-28  2:43 UTC (permalink / raw)
  To: Nilay Shroff; +Cc: linux-block, hch, yukuai1, axboe, yi.zhang, czhong, gjoyce

On Mon, Oct 27, 2025 at 11:08:13PM +0530, Nilay Shroff wrote:
> Hi Ming,
> 
> On 10/22/25 10:09 AM, Ming Lei wrote:
> > On Thu, Oct 16, 2025 at 11:00:48AM +0530, Nilay Shroff wrote:
> >> The recent lockdep splat [1] highlights a potential deadlock risk
> >> involving ->elevator_lock and ->freeze_lock dependencies on -pcpu_alloc_
> >> mutex. The trace shows that the issue occurs when the Kyber scheduler
> >> allocates dynamic memory for its elevator data during initialization.
> >>
> >> To address this, introduce two new elevator operation callbacks:
> >> ->alloc_sched_data and ->free_sched_data.
> > 
> > This way looks good.
> > 
> >>
> >> When an elevator implements these methods, they are invoked during
> >> scheduler switch before acquiring ->freeze_lock and ->elevator_lock.
> >> This allows safe allocation and deallocation of per-elevator data
> > 
> > This per-elevator data should be very similar with `struct elevator_tags`
> > from block layer viewpoint: both have same lifetime, and follow same
> > allocation constraint(per-cpu lock).
> > 
> > Can we abstract elevator data structure to cover both? Then I guess the
> > code should be more readable & maintainable, what do you think of this way?
> > 
> > One easiest way could be to add 'void *data' into `struct elevator_tags`,
> > just the naming of `elevator_tags` is not generic enough, but might not
> > a big deal.
> > 
> I realized that struct elevator_tags is already a member of struct elevator_queue,
> and we also have a separate void *elevator_data member within the same structure.
> 
> So, adding void *data directly into struct elevator_tags may not be ideal, as it
> would mix two logically distinct resources under a misleading name. Instead, we
> can abstract both — void *elevator_data and struct elevator_tags — into a new
> structure named struct elevator_resources. For instance:
> 
> struct elevator_resources {
>     void *data;
>     struct elevator_tags *et;
> };
> 
> struct elv_change_ctx {
> 	const char *name;
> 	bool no_uevent;
> 	struct elevator_queue *old;
> 	struct elevator_queue *new;
> 	struct elevator_type *type;
> 	struct elevator_resources res;
> };
> 
> I've just sent out PATCHv3 with the above change. Please review and let me know
> if this approach looks good to you.

It is fine to add `struct elevator_resources` for further abstraction, but
you need to abstract related methods too, otherwise the patch 3 still becomes
hard to follow: the existing blk_mq_free_sched_tags can be renamed to
blk_mq_free_sched_resource first, then you can call blk_mq_free_sched_data()
from blk_mq_free_sched_resource() inside only, instead of calling it
following every blk_mq_free_sched_tags().

Same with blk_mq_alloc_sched_tags_batch()/blk_mq_free_sched_tags_batch(),
you can make universal blk_mq_alloc_sched_res_batch/blk_mq_free_sched_res_batch()
to cover both tags & schedule data, and it is easier to extend in future too.




thanks
Ming


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/3] block: introduce alloc_sched_data and free_sched_data elevator methods
  2025-10-28  2:43       ` Ming Lei
@ 2025-10-28  4:51         ` Nilay Shroff
  0 siblings, 0 replies; 13+ messages in thread
From: Nilay Shroff @ 2025-10-28  4:51 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, hch, yukuai1, axboe, yi.zhang, czhong, gjoyce



On 10/28/25 8:13 AM, Ming Lei wrote:
> On Mon, Oct 27, 2025 at 11:08:13PM +0530, Nilay Shroff wrote:
>> Hi Ming,
>>
>> On 10/22/25 10:09 AM, Ming Lei wrote:
>>> On Thu, Oct 16, 2025 at 11:00:48AM +0530, Nilay Shroff wrote:
>>>> The recent lockdep splat [1] highlights a potential deadlock risk
>>>> involving ->elevator_lock and ->freeze_lock dependencies on -pcpu_alloc_
>>>> mutex. The trace shows that the issue occurs when the Kyber scheduler
>>>> allocates dynamic memory for its elevator data during initialization.
>>>>
>>>> To address this, introduce two new elevator operation callbacks:
>>>> ->alloc_sched_data and ->free_sched_data.
>>>
>>> This way looks good.
>>>
>>>>
>>>> When an elevator implements these methods, they are invoked during
>>>> scheduler switch before acquiring ->freeze_lock and ->elevator_lock.
>>>> This allows safe allocation and deallocation of per-elevator data
>>>
>>> This per-elevator data should be very similar with `struct elevator_tags`
>>> from block layer viewpoint: both have same lifetime, and follow same
>>> allocation constraint(per-cpu lock).
>>>
>>> Can we abstract elevator data structure to cover both? Then I guess the
>>> code should be more readable & maintainable, what do you think of this way?
>>>
>>> One easiest way could be to add 'void *data' into `struct elevator_tags`,
>>> just the naming of `elevator_tags` is not generic enough, but might not
>>> a big deal.
>>>
>> I realized that struct elevator_tags is already a member of struct elevator_queue,
>> and we also have a separate void *elevator_data member within the same structure.
>>
>> So, adding void *data directly into struct elevator_tags may not be ideal, as it
>> would mix two logically distinct resources under a misleading name. Instead, we
>> can abstract both — void *elevator_data and struct elevator_tags — into a new
>> structure named struct elevator_resources. For instance:
>>
>> struct elevator_resources {
>>     void *data;
>>     struct elevator_tags *et;
>> };
>>
>> struct elv_change_ctx {
>> 	const char *name;
>> 	bool no_uevent;
>> 	struct elevator_queue *old;
>> 	struct elevator_queue *new;
>> 	struct elevator_type *type;
>> 	struct elevator_resources res;
>> };
>>
>> I've just sent out PATCHv3 with the above change. Please review and let me know
>> if this approach looks good to you.
> 
> It is fine to add `struct elevator_resources` for further abstraction, but
> you need to abstract related methods too, otherwise the patch 3 still becomes
> hard to follow: the existing blk_mq_free_sched_tags can be renamed to
> blk_mq_free_sched_resource first, then you can call blk_mq_free_sched_data()
> from blk_mq_free_sched_resource() inside only, instead of calling it
> following every blk_mq_free_sched_tags().
> 
> Same with blk_mq_alloc_sched_tags_batch()/blk_mq_free_sched_tags_batch(),
> you can make universal blk_mq_alloc_sched_res_batch/blk_mq_free_sched_res_batch()
> to cover both tags & schedule data, and it is easier to extend in future too.
> 
Okay that makes sense.. I will restructure code and prepare a new patchset.

Thanks,
--Nilay

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-10-28  4:51 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-16  5:30 [PATCH 0/3] block: restructure elevator switch path and fix a lockdep splat Nilay Shroff
2025-10-16  5:30 ` [PATCH 1/3] block: unify elevator tags and type xarrays into struct elv_change_ctx Nilay Shroff
2025-10-22  4:11   ` Ming Lei
2025-10-23  5:53     ` Nilay Shroff
2025-10-16  5:30 ` [PATCH 2/3] block: introduce alloc_sched_data and free_sched_data elevator methods Nilay Shroff
2025-10-22  4:39   ` Ming Lei
2025-10-23  5:57     ` Nilay Shroff
2025-10-23  7:48       ` Ming Lei
2025-10-23  8:28         ` Nilay Shroff
2025-10-27 17:38     ` Nilay Shroff
2025-10-28  2:43       ` Ming Lei
2025-10-28  4:51         ` Nilay Shroff
2025-10-16  5:30 ` [PATCH 3/3] block: define alloc_sched_data and free_sched_data methods for kyber Nilay Shroff

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).