[PATCHv4 0/5] block: restructure elevator switch path and fix a lockdep splat

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCHv4 0/5] block: restructure elevator switch path and fix a lockdep splat
@ 2025-11-10  8:14 Nilay Shroff
  2025-11-10  8:14 ` [PATCHv4 1/5] block: unify elevator tags and type xarrays into struct elv_change_ctx Nilay Shroff
                   ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: Nilay Shroff @ 2025-11-10  8:14 UTC (permalink / raw)
  To: linux-block; +Cc: ming.lei, hch, axboe, yi.zhang, czhong, yukuai, gjoyce

Hi,

This patchset reorganizes the elevator switch path used during both
nr_hw_queues update and elv_iosched_store() operations to address a
recently reported lockdep splat [1].

The warning highlights a locking dependency between ->freeze_lock and
->elevator_lock on pcpu_alloc_mutex, triggered when the Kyber scheduler
dynamically allocates its private scheduling data. The fix is to ensure
that such allocations occur outside the locked sections, thus eliminating
the dependency chain.

While working on this, it also became evident that the nr_hw_queue update
code maintains two disjoint xarrays—one for elevator tags and another
for elevator type—both serving the same purpose. Unifying these into a
single elv_change_ctx structure improves clarity and maintainability.

This series therefore implements five patches:
The first perparatory patch unifies elevator tags and type xarrays. It
combines both xarrays into a single struct elv_change_ctx, simplifying
per-queue elevator state management.

The second patch is aimed to group together all elevator-related 
resources that share the same lifetime and as a first step we move the
elevator tags pointer from struct elv_change_ctx into the newly
inroduced struct elevator_resources. The subsequent patch extends the 
struct elevator_resources to include other elevator-related data.

The third patch introduce ->alloc_sched_data and ->free_sched_data
elevator ops which could be then used to safely allocate and free 
scheduler data.

The fourth patch now builds upon the previous patch and starts using the
newly introduced alloc/free sched data methods in the earlier patch
during elevator switch and nr_hw_queue update. And while doing so, it's
ensured that sched data allocation and free happens before we acquire
->freeze_lock and ->elevator_lock thus preventing its dependency on
pcpu_alloc_mutex.

The last patch of this series converts Kyber scheduler to use the new
methods inroduced in the previous patch. It hooks Kyber’s scheduler data
allocation and teardown logic from ->init_sched and ->exit_sched into
the new methods, ensuring memory operations are performed outside
locked sections.

Together, these changes simplify the elevator switch logic and prevent
the reported lockdep splat.

As always, feedback and suggestions are very welcome!

[1] https://lore.kernel.org/all/CAGVVp+VNW4M-5DZMNoADp6o2VKFhi7KxWpTDkcnVyjO0=-D5+A@mail.gmail.com/

Thanks,
--Nilay

changes from v3:
  - Split the third patch into two patches to separate the introduction
    of ->alloc_sched_data and ->free_sched_data methods from their users.
  - Free scheduler tags during sched resource allocation failures using
    blk_mq_free_sched_tags() instead of kfree() to avoid kmemleak
    (Ming Lei).
  - Delay the signature change of elevator_alloc() until the fourth
    patch, where we actually start allocating scheduler data during
    elevator switch and nr_hw_queue_update (Ming Lei).

Link to v3: https://lore.kernel.org/all/20251029103622.205607-1-nilay@linux.ibm.com/

changes fron v2:
  - Introduce helper functions blk_mq_alloc_sched_res_batch() and
    blk_mq_free_sched_res_batch() to encapsulate scheduler resource
    (tags and data) allocation and freeing in batch mode. (Ming Lei)

  - Introduce helper functions blk_mq_alloc_sched_res() and
    blk_mq_free_sched_res() to encapsulate scheduler resource
    allocation and freeing. (Ming Lei)

Link to v2: https://lore.kernel.org/all/20251027173631.1081005-1-nilay@linux.ibm.com/

changes from v1:
  - Keep blk_mq_free_sched_ctx_batch() and blk_mq_alloc_sched_ctx_batch()
    together in the same file (Ming Lei)
  - Since the ctx pointer is stored in xarray after it's dynamically
    allocated, if blk_mq_alloc_sched_ctx_batch() fails to allocate or
    insert ctx pointer in xarray then unwinding the allocation is not
    necessary. Instead looping over the xarray to retrieve the inserted
    ctx pointer and freeing it should be sufficibet. So invoke blk_mq_
    free_sched_ctx_batch() from the blk_mq_alloc_sched_ctx_batch()
    callsite on failure (Ming Lei)
  - As both elevator tags and elevator data shares the same lifetime
    and allocation constraints, abstract both into a new structure
    (Ming Lei)

Link to v1: https://lore.kernel.org/all/20251016053057.3457663-1-nilay@linux.ibm.com/

Nilay Shroff (5):
  block: unify elevator tags and type xarrays into struct elv_change_ctx
  block: move elevator tags into struct elevator_resources
  block: introduce alloc_sched_data and free_sched_data elevator methods
  block: use {alloc|free}_sched data methods
  block: define alloc_sched_data and free_sched_data methods for kyber

 block/blk-mq-sched.c  | 123 +++++++++++++++++++++++++++++++++---------
 block/blk-mq-sched.h  |  34 ++++++++++--
 block/blk-mq.c        |  50 +++++++++--------
 block/blk.h           |   7 ++-
 block/elevator.c      |  80 +++++++++++++--------------
 block/elevator.h      |  26 ++++++++-
 block/kyber-iosched.c |  30 ++++++++---
 7 files changed, 244 insertions(+), 106 deletions(-)

-- 
2.51.0


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCHv4 1/5] block: unify elevator tags and type xarrays into struct elv_change_ctx
  2025-11-10  8:14 [PATCHv4 0/5] block: restructure elevator switch path and fix a lockdep splat Nilay Shroff
@ 2025-11-10  8:14 ` Nilay Shroff
  2025-11-11  6:55   ` Yu Kuai
  2025-11-10  8:14 ` [PATCHv4 2/5] block: move elevator tags into struct elevator_resources Nilay Shroff
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: Nilay Shroff @ 2025-11-10  8:14 UTC (permalink / raw)
  To: linux-block; +Cc: ming.lei, hch, axboe, yi.zhang, czhong, yukuai, gjoyce

Currently, the nr_hw_queues update path manages two disjoint xarrays —
one for elevator tags and another for elevator type — both used during
elevator switching. Maintaining these two parallel structures for the
same purpose adds unnecessary complexity and potential for mismatched
state.

This patch unifies both xarrays into a single structure, struct
elv_change_ctx, which holds all per-queue elevator change context. A
single xarray, named elv_tbl, now maps each queue (q->id) in a tagset
to its corresponding elv_change_ctx entry, encapsulating the elevator
tags, type and name references.

This unification simplifies the code, improves maintainability, and
clarifies ownership of per-queue elevator state.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
 block/blk-mq-sched.c | 76 +++++++++++++++++++++++++++++++++-----------
 block/blk-mq-sched.h |  3 ++
 block/blk-mq.c       | 50 +++++++++++++++++------------
 block/blk.h          |  7 ++--
 block/elevator.c     | 31 ++++--------------
 block/elevator.h     | 15 +++++++++
 6 files changed, 115 insertions(+), 67 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index e0bed16485c3..3d9386555a50 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -427,11 +427,11 @@ void blk_mq_free_sched_tags(struct elevator_tags *et,
 	kfree(et);
 }
 
-void blk_mq_free_sched_tags_batch(struct xarray *et_table,
+void blk_mq_free_sched_tags_batch(struct xarray *elv_tbl,
 		struct blk_mq_tag_set *set)
 {
 	struct request_queue *q;
-	struct elevator_tags *et;
+	struct elv_change_ctx *ctx;
 
 	lockdep_assert_held_write(&set->update_nr_hwq_lock);
 
@@ -444,13 +444,47 @@ void blk_mq_free_sched_tags_batch(struct xarray *et_table,
 		 * concurrently.
 		 */
 		if (q->elevator) {
-			et = xa_load(et_table, q->id);
-			if (unlikely(!et))
+			ctx = xa_load(elv_tbl, q->id);
+			if (!ctx || !ctx->et) {
 				WARN_ON_ONCE(1);
-			else
-				blk_mq_free_sched_tags(et, set);
+				continue;
+			}
+			blk_mq_free_sched_tags(ctx->et, set);
+			ctx->et = NULL;
+		}
+	}
+}
+
+void blk_mq_free_sched_ctx_batch(struct xarray *elv_tbl)
+{
+	unsigned long i;
+	struct elv_change_ctx *ctx;
+
+	xa_for_each(elv_tbl, i, ctx) {
+		xa_erase(elv_tbl, i);
+		kfree(ctx);
+	}
+}
+
+int blk_mq_alloc_sched_ctx_batch(struct xarray *elv_tbl,
+		struct blk_mq_tag_set *set)
+{
+	struct request_queue *q;
+	struct elv_change_ctx *ctx;
+
+	lockdep_assert_held_write(&set->update_nr_hwq_lock);
+
+	list_for_each_entry(q, &set->tag_list, tag_set_list) {
+		ctx = kzalloc(sizeof(struct elv_change_ctx), GFP_KERNEL);
+		if (!ctx)
+			return -ENOMEM;
+
+		if (xa_insert(elv_tbl, q->id, ctx, GFP_KERNEL)) {
+			kfree(ctx);
+			return -ENOMEM;
 		}
 	}
+	return 0;
 }
 
 struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
@@ -498,12 +532,13 @@ struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
 	return NULL;
 }
 
-int blk_mq_alloc_sched_tags_batch(struct xarray *et_table,
+int blk_mq_alloc_sched_tags_batch(struct xarray *elv_tbl,
 		struct blk_mq_tag_set *set, unsigned int nr_hw_queues)
 {
+	struct elv_change_ctx *ctx;
 	struct request_queue *q;
 	struct elevator_tags *et;
-	gfp_t gfp = GFP_NOIO | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY;
+	int ret = -ENOMEM;
 
 	lockdep_assert_held_write(&set->update_nr_hwq_lock);
 
@@ -516,26 +551,31 @@ int blk_mq_alloc_sched_tags_batch(struct xarray *et_table,
 		 * concurrently.
 		 */
 		if (q->elevator) {
-			et = blk_mq_alloc_sched_tags(set, nr_hw_queues,
+			ctx = xa_load(elv_tbl, q->id);
+			if (WARN_ON_ONCE(!ctx)) {
+				ret = -ENOENT;
+				goto out_unwind;
+			}
+
+			ctx->et = blk_mq_alloc_sched_tags(set, nr_hw_queues,
 					blk_mq_default_nr_requests(set));
-			if (!et)
+			if (!ctx->et)
 				goto out_unwind;
-			if (xa_insert(et_table, q->id, et, gfp))
-				goto out_free_tags;
+
 		}
 	}
 	return 0;
-out_free_tags:
-	blk_mq_free_sched_tags(et, set);
 out_unwind:
 	list_for_each_entry_continue_reverse(q, &set->tag_list, tag_set_list) {
 		if (q->elevator) {
-			et = xa_load(et_table, q->id);
-			if (et)
-				blk_mq_free_sched_tags(et, set);
+			ctx = xa_load(elv_tbl, q->id);
+			if (ctx && ctx->et) {
+				blk_mq_free_sched_tags(ctx->et, set);
+				ctx->et = NULL;
+			}
 		}
 	}
-	return -ENOMEM;
+	return ret;
 }
 
 /* caller must have a reference to @e, will grab another one if successful */
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 8e21a6b1415d..2fddbc91a235 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -27,6 +27,9 @@ struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
 		unsigned int nr_hw_queues, unsigned int nr_requests);
 int blk_mq_alloc_sched_tags_batch(struct xarray *et_table,
 		struct blk_mq_tag_set *set, unsigned int nr_hw_queues);
+int blk_mq_alloc_sched_ctx_batch(struct xarray *elv_tbl,
+		struct blk_mq_tag_set *set);
+void blk_mq_free_sched_ctx_batch(struct xarray *elv_tbl);
 void blk_mq_free_sched_tags(struct elevator_tags *et,
 		struct blk_mq_tag_set *set);
 void blk_mq_free_sched_tags_batch(struct xarray *et_table,
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d626d32f6e57..1f5ef7fc9cda 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4983,27 +4983,28 @@ struct elevator_tags *blk_mq_update_nr_requests(struct request_queue *q,
  * Switch back to the elevator type stored in the xarray.
  */
 static void blk_mq_elv_switch_back(struct request_queue *q,
-		struct xarray *elv_tbl, struct xarray *et_tbl)
+		struct xarray *elv_tbl)
 {
-	struct elevator_type *e = xa_load(elv_tbl, q->id);
-	struct elevator_tags *t = xa_load(et_tbl, q->id);
+	struct elv_change_ctx *ctx = xa_load(elv_tbl, q->id);
+
+	if (WARN_ON_ONCE(!ctx))
+		return;
 
 	/* The elv_update_nr_hw_queues unfreezes the queue. */
-	elv_update_nr_hw_queues(q, e, t);
+	elv_update_nr_hw_queues(q, ctx);
 
 	/* Drop the reference acquired in blk_mq_elv_switch_none. */
-	if (e)
-		elevator_put(e);
+	if (ctx->type)
+		elevator_put(ctx->type);
 }
 
 /*
- * Stores elevator type in xarray and set current elevator to none. It uses
- * q->id as an index to store the elevator type into the xarray.
+ * Stores elevator name and type in ctx and set current elevator to none.
  */
 static int blk_mq_elv_switch_none(struct request_queue *q,
 		struct xarray *elv_tbl)
 {
-	int ret = 0;
+	struct elv_change_ctx *ctx;
 
 	lockdep_assert_held_write(&q->tag_set->update_nr_hwq_lock);
 
@@ -5015,10 +5016,11 @@ static int blk_mq_elv_switch_none(struct request_queue *q,
 	 * can't run concurrently.
 	 */
 	if (q->elevator) {
+		ctx = xa_load(elv_tbl, q->id);
+		if (WARN_ON_ONCE(!ctx))
+			return -ENOENT;
 
-		ret = xa_insert(elv_tbl, q->id, q->elevator->type, GFP_KERNEL);
-		if (WARN_ON_ONCE(ret))
-			return ret;
+		ctx->name = q->elevator->type->elevator_name;
 
 		/*
 		 * Before we switch elevator to 'none', take a reference to
@@ -5029,9 +5031,14 @@ static int blk_mq_elv_switch_none(struct request_queue *q,
 		 */
 		__elevator_get(q->elevator->type);
 
+		/*
+		 * Store elevator type so that we can release the reference
+		 * taken above later.
+		 */
+		ctx->type = q->elevator->type;
 		elevator_set_none(q);
 	}
-	return ret;
+	return 0;
 }
 
 static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
@@ -5041,7 +5048,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 	int prev_nr_hw_queues = set->nr_hw_queues;
 	unsigned int memflags;
 	int i;
-	struct xarray elv_tbl, et_tbl;
+	struct xarray elv_tbl;
 	bool queues_frozen = false;
 
 	lockdep_assert_held(&set->tag_list_lock);
@@ -5055,11 +5062,12 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 
 	memflags = memalloc_noio_save();
 
-	xa_init(&et_tbl);
-	if (blk_mq_alloc_sched_tags_batch(&et_tbl, set, nr_hw_queues) < 0)
-		goto out_memalloc_restore;
-
 	xa_init(&elv_tbl);
+	if (blk_mq_alloc_sched_ctx_batch(&elv_tbl, set) < 0)
+		goto out_free_ctx;
+
+	if (blk_mq_alloc_sched_tags_batch(&elv_tbl, set, nr_hw_queues) < 0)
+		goto out_free_ctx;
 
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
 		blk_mq_debugfs_unregister_hctxs(q);
@@ -5105,7 +5113,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 		/* switch_back expects queue to be frozen */
 		if (!queues_frozen)
 			blk_mq_freeze_queue_nomemsave(q);
-		blk_mq_elv_switch_back(q, &elv_tbl, &et_tbl);
+		blk_mq_elv_switch_back(q, &elv_tbl);
 	}
 
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
@@ -5116,9 +5124,9 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 		blk_mq_add_hw_queues_cpuhp(q);
 	}
 
+out_free_ctx:
+	blk_mq_free_sched_ctx_batch(&elv_tbl);
 	xa_destroy(&elv_tbl);
-	xa_destroy(&et_tbl);
-out_memalloc_restore:
 	memalloc_noio_restore(memflags);
 
 	/* Free the excess tags when nr_hw_queues shrink. */
diff --git a/block/blk.h b/block/blk.h
index 170794632135..a7992680f9e1 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -11,8 +11,7 @@
 #include <xen/xen.h>
 #include "blk-crypto-internal.h"
 
-struct elevator_type;
-struct elevator_tags;
+struct elv_change_ctx;
 
 /*
  * Default upper limit for the software max_sectors limit used for regular I/Os.
@@ -333,8 +332,8 @@ bool blk_bio_list_merge(struct request_queue *q, struct list_head *list,
 
 bool blk_insert_flush(struct request *rq);
 
-void elv_update_nr_hw_queues(struct request_queue *q, struct elevator_type *e,
-		struct elevator_tags *t);
+void elv_update_nr_hw_queues(struct request_queue *q,
+		struct elv_change_ctx *ctx);
 void elevator_set_default(struct request_queue *q);
 void elevator_set_none(struct request_queue *q);
 
diff --git a/block/elevator.c b/block/elevator.c
index e2ebfbf107b3..cd7bdff205c8 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -45,19 +45,6 @@
 #include "blk-wbt.h"
 #include "blk-cgroup.h"
 
-/* Holding context data for changing elevator */
-struct elv_change_ctx {
-	const char *name;
-	bool no_uevent;
-
-	/* for unregistering old elevator */
-	struct elevator_queue *old;
-	/* for registering new elevator */
-	struct elevator_queue *new;
-	/* holds sched tags data */
-	struct elevator_tags *et;
-};
-
 static DEFINE_SPINLOCK(elv_list_lock);
 static LIST_HEAD(elv_list);
 
@@ -706,32 +693,28 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
  * The I/O scheduler depends on the number of hardware queues, this forces a
  * reattachment when nr_hw_queues changes.
  */
-void elv_update_nr_hw_queues(struct request_queue *q, struct elevator_type *e,
-		struct elevator_tags *t)
+void elv_update_nr_hw_queues(struct request_queue *q,
+		struct elv_change_ctx *ctx)
 {
 	struct blk_mq_tag_set *set = q->tag_set;
-	struct elv_change_ctx ctx = {};
 	int ret = -ENODEV;
 
 	WARN_ON_ONCE(q->mq_freeze_depth == 0);
 
-	if (e && !blk_queue_dying(q) && blk_queue_registered(q)) {
-		ctx.name = e->elevator_name;
-		ctx.et = t;
-
+	if (ctx->type && !blk_queue_dying(q) && blk_queue_registered(q)) {
 		mutex_lock(&q->elevator_lock);
 		/* force to reattach elevator after nr_hw_queue is updated */
-		ret = elevator_switch(q, &ctx);
+		ret = elevator_switch(q, ctx);
 		mutex_unlock(&q->elevator_lock);
 	}
 	blk_mq_unfreeze_queue_nomemrestore(q);
 	if (!ret)
-		WARN_ON_ONCE(elevator_change_done(q, &ctx));
+		WARN_ON_ONCE(elevator_change_done(q, ctx));
 	/*
 	 * Free sched tags if it's allocated but we couldn't switch elevator.
 	 */
-	if (t && !ctx.new)
-		blk_mq_free_sched_tags(t, set);
+	if (ctx->et && !ctx->new)
+		blk_mq_free_sched_tags(ctx->et, set);
 }
 
 /*
diff --git a/block/elevator.h b/block/elevator.h
index c4d20155065e..bad43182361e 100644
--- a/block/elevator.h
+++ b/block/elevator.h
@@ -32,6 +32,21 @@ struct elevator_tags {
 	struct blk_mq_tags *tags[];
 };
 
+/* Holding context data for changing elevator */
+struct elv_change_ctx {
+	const char *name;
+	bool no_uevent;
+
+	/* for unregistering old elevator */
+	struct elevator_queue *old;
+	/* for registering new elevator */
+	struct elevator_queue *new;
+	/* store elevator type */
+	struct elevator_type *type;
+	/* holds sched tags data */
+	struct elevator_tags *et;
+};
+
 struct elevator_mq_ops {
 	int (*init_sched)(struct request_queue *, struct elevator_queue *);
 	void (*exit_sched)(struct elevator_queue *);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCHv4 2/5] block: move elevator tags into struct elevator_resources
  2025-11-10  8:14 [PATCHv4 0/5] block: restructure elevator switch path and fix a lockdep splat Nilay Shroff
  2025-11-10  8:14 ` [PATCHv4 1/5] block: unify elevator tags and type xarrays into struct elv_change_ctx Nilay Shroff
@ 2025-11-10  8:14 ` Nilay Shroff
  2025-11-11  2:52   ` Ming Lei
  2025-11-10  8:14 ` [PATCHv4 3/5] block: introduce alloc_sched_data and free_sched_data elevator methods Nilay Shroff
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: Nilay Shroff @ 2025-11-10  8:14 UTC (permalink / raw)
  To: linux-block; +Cc: ming.lei, hch, axboe, yi.zhang, czhong, yukuai, gjoyce

This patch introduces a new structure, struct elevator_resources, to
group together all elevator-related resources that share the same
lifetime. As a first step, this change moves the elevator tag pointer
from struct elv_change_ctx into the new struct elevator_resources.

Additionally, rename blk_mq_alloc_sched_tags_batch() and
blk_mq_free_sched_tags_batch() to blk_mq_alloc_sched_res_batch() and
blk_mq_free_sched_res_batch(), respectively. Introduce two new wrapper
helpers, blk_mq_alloc_sched_res() and blk_mq_free_sched_res(), around
blk_mq_alloc_sched_tags() and blk_mq_free_sched_tags().

These changes pave the way for consolidating the allocation and freeing
of elevator-specific resources into common helper functions. This
refactoring improves encapsulation and prepares the code for future
extensions, allowing additional elevator-specific data to be added to
struct elevator_resources without cluttering struct elv_change_ctx.

Subsequent patches will extend struct elevator_resources to include
other elevator-related data.

Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
 block/blk-mq-sched.c | 55 ++++++++++++++++++++++++++++----------------
 block/blk-mq-sched.h | 10 +++++---
 block/blk-mq.c       |  2 +-
 block/elevator.c     | 31 +++++++++++++------------
 block/elevator.h     |  9 ++++++--
 5 files changed, 66 insertions(+), 41 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 3d9386555a50..c7091ea4dccd 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -427,7 +427,16 @@ void blk_mq_free_sched_tags(struct elevator_tags *et,
 	kfree(et);
 }
 
-void blk_mq_free_sched_tags_batch(struct xarray *elv_tbl,
+void blk_mq_free_sched_res(struct elevator_resources *res,
+		struct blk_mq_tag_set *set)
+{
+	if (res->et) {
+		blk_mq_free_sched_tags(res->et, set);
+		res->et = NULL;
+	}
+}
+
+void blk_mq_free_sched_res_batch(struct xarray *elv_tbl,
 		struct blk_mq_tag_set *set)
 {
 	struct request_queue *q;
@@ -445,12 +454,11 @@ void blk_mq_free_sched_tags_batch(struct xarray *elv_tbl,
 		 */
 		if (q->elevator) {
 			ctx = xa_load(elv_tbl, q->id);
-			if (!ctx || !ctx->et) {
+			if (!ctx) {
 				WARN_ON_ONCE(1);
 				continue;
 			}
-			blk_mq_free_sched_tags(ctx->et, set);
-			ctx->et = NULL;
+			blk_mq_free_sched_res(&ctx->res, set);
 		}
 	}
 }
@@ -532,12 +540,22 @@ struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
 	return NULL;
 }
 
-int blk_mq_alloc_sched_tags_batch(struct xarray *elv_tbl,
+int blk_mq_alloc_sched_res(struct elevator_resources *res,
+		struct blk_mq_tag_set *set, unsigned int nr_hw_queues)
+{
+	res->et = blk_mq_alloc_sched_tags(set, nr_hw_queues,
+			blk_mq_default_nr_requests(set));
+	if (!res->et)
+		return -ENOMEM;
+
+	return 0;
+}
+
+int blk_mq_alloc_sched_res_batch(struct xarray *elv_tbl,
 		struct blk_mq_tag_set *set, unsigned int nr_hw_queues)
 {
 	struct elv_change_ctx *ctx;
 	struct request_queue *q;
-	struct elevator_tags *et;
 	int ret = -ENOMEM;
 
 	lockdep_assert_held_write(&set->update_nr_hwq_lock);
@@ -557,11 +575,10 @@ int blk_mq_alloc_sched_tags_batch(struct xarray *elv_tbl,
 				goto out_unwind;
 			}
 
-			ctx->et = blk_mq_alloc_sched_tags(set, nr_hw_queues,
-					blk_mq_default_nr_requests(set));
-			if (!ctx->et)
+			ret = blk_mq_alloc_sched_res(&ctx->res, set,
+					nr_hw_queues);
+			if (ret)
 				goto out_unwind;
-
 		}
 	}
 	return 0;
@@ -569,10 +586,8 @@ int blk_mq_alloc_sched_tags_batch(struct xarray *elv_tbl,
 	list_for_each_entry_continue_reverse(q, &set->tag_list, tag_set_list) {
 		if (q->elevator) {
 			ctx = xa_load(elv_tbl, q->id);
-			if (ctx && ctx->et) {
-				blk_mq_free_sched_tags(ctx->et, set);
-				ctx->et = NULL;
-			}
+			if (ctx)
+				blk_mq_free_sched_res(&ctx->res, set);
 		}
 	}
 	return ret;
@@ -580,7 +595,7 @@ int blk_mq_alloc_sched_tags_batch(struct xarray *elv_tbl,
 
 /* caller must have a reference to @e, will grab another one if successful */
 int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e,
-		struct elevator_tags *et)
+		struct elevator_resources *res)
 {
 	unsigned int flags = q->tag_set->flags;
 	struct blk_mq_hw_ctx *hctx;
@@ -588,23 +603,23 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e,
 	unsigned long i;
 	int ret;
 
-	eq = elevator_alloc(q, e, et);
+	eq = elevator_alloc(q, e, res->et);
 	if (!eq)
 		return -ENOMEM;
 
-	q->nr_requests = et->nr_requests;
+	q->nr_requests = res->et->nr_requests;
 
 	if (blk_mq_is_shared_tags(flags)) {
 		/* Shared tags are stored at index 0 in @et->tags. */
-		q->sched_shared_tags = et->tags[0];
-		blk_mq_tag_update_sched_shared_tags(q, et->nr_requests);
+		q->sched_shared_tags = res->et->tags[0];
+		blk_mq_tag_update_sched_shared_tags(q, res->et->nr_requests);
 	}
 
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (blk_mq_is_shared_tags(flags))
 			hctx->sched_tags = q->sched_shared_tags;
 		else
-			hctx->sched_tags = et->tags[i];
+			hctx->sched_tags = res->et->tags[i];
 	}
 
 	ret = e->ops.init_sched(q, eq);
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 2fddbc91a235..97204df76def 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -19,20 +19,24 @@ void __blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx);
 void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx);
 
 int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e,
-		struct elevator_tags *et);
+		struct elevator_resources *res);
 void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e);
 void blk_mq_sched_free_rqs(struct request_queue *q);
 
 struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
 		unsigned int nr_hw_queues, unsigned int nr_requests);
-int blk_mq_alloc_sched_tags_batch(struct xarray *et_table,
+int blk_mq_alloc_sched_res(struct elevator_resources *res,
+		struct blk_mq_tag_set *set, unsigned int nr_hw_queues);
+int blk_mq_alloc_sched_res_batch(struct xarray *elv_tbl,
 		struct blk_mq_tag_set *set, unsigned int nr_hw_queues);
 int blk_mq_alloc_sched_ctx_batch(struct xarray *elv_tbl,
 		struct blk_mq_tag_set *set);
 void blk_mq_free_sched_ctx_batch(struct xarray *elv_tbl);
 void blk_mq_free_sched_tags(struct elevator_tags *et,
 		struct blk_mq_tag_set *set);
-void blk_mq_free_sched_tags_batch(struct xarray *et_table,
+void blk_mq_free_sched_res(struct elevator_resources *res,
+		struct blk_mq_tag_set *set);
+void blk_mq_free_sched_res_batch(struct xarray *et_table,
 		struct blk_mq_tag_set *set);
 
 static inline void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1f5ef7fc9cda..2535271875bb 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -5066,7 +5066,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 	if (blk_mq_alloc_sched_ctx_batch(&elv_tbl, set) < 0)
 		goto out_free_ctx;
 
-	if (blk_mq_alloc_sched_tags_batch(&elv_tbl, set, nr_hw_queues) < 0)
+	if (blk_mq_alloc_sched_res_batch(&elv_tbl, set, nr_hw_queues) < 0)
 		goto out_free_ctx;
 
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
diff --git a/block/elevator.c b/block/elevator.c
index cd7bdff205c8..7fd3c547833c 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -580,7 +580,7 @@ static int elevator_switch(struct request_queue *q, struct elv_change_ctx *ctx)
 	}
 
 	if (new_e) {
-		ret = blk_mq_init_sched(q, new_e, ctx->et);
+		ret = blk_mq_init_sched(q, new_e, &ctx->res);
 		if (ret)
 			goto out_unfreeze;
 		ctx->new = q->elevator;
@@ -604,7 +604,8 @@ static int elevator_switch(struct request_queue *q, struct elv_change_ctx *ctx)
 	return ret;
 }
 
-static void elv_exit_and_release(struct request_queue *q)
+static void elv_exit_and_release(struct elv_change_ctx *ctx,
+		struct request_queue *q)
 {
 	struct elevator_queue *e;
 	unsigned memflags;
@@ -616,7 +617,7 @@ static void elv_exit_and_release(struct request_queue *q)
 	mutex_unlock(&q->elevator_lock);
 	blk_mq_unfreeze_queue(q, memflags);
 	if (e) {
-		blk_mq_free_sched_tags(e->et, q->tag_set);
+		blk_mq_free_sched_res(&ctx->res, q->tag_set);
 		kobject_put(&e->kobj);
 	}
 }
@@ -627,11 +628,12 @@ static int elevator_change_done(struct request_queue *q,
 	int ret = 0;
 
 	if (ctx->old) {
+		struct elevator_resources res = {.et = ctx->old->et};
 		bool enable_wbt = test_bit(ELEVATOR_FLAG_ENABLE_WBT_ON_EXIT,
 				&ctx->old->flags);
 
 		elv_unregister_queue(q, ctx->old);
-		blk_mq_free_sched_tags(ctx->old->et, q->tag_set);
+		blk_mq_free_sched_res(&res, q->tag_set);
 		kobject_put(&ctx->old->kobj);
 		if (enable_wbt)
 			wbt_enable_default(q->disk);
@@ -639,7 +641,7 @@ static int elevator_change_done(struct request_queue *q,
 	if (ctx->new) {
 		ret = elv_register_queue(q, ctx->new, !ctx->no_uevent);
 		if (ret)
-			elv_exit_and_release(q);
+			elv_exit_and_release(ctx, q);
 	}
 	return ret;
 }
@@ -656,10 +658,9 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
 	lockdep_assert_held(&set->update_nr_hwq_lock);
 
 	if (strncmp(ctx->name, "none", 4)) {
-		ctx->et = blk_mq_alloc_sched_tags(set, set->nr_hw_queues,
-				blk_mq_default_nr_requests(set));
-		if (!ctx->et)
-			return -ENOMEM;
+		ret = blk_mq_alloc_sched_res(&ctx->res, set, set->nr_hw_queues);
+		if (ret)
+			return ret;
 	}
 
 	memflags = blk_mq_freeze_queue(q);
@@ -681,10 +682,10 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
 	if (!ret)
 		ret = elevator_change_done(q, ctx);
 	/*
-	 * Free sched tags if it's allocated but we couldn't switch elevator.
+	 * Free sched resource if it's allocated but we couldn't switch elevator.
 	 */
-	if (ctx->et && !ctx->new)
-		blk_mq_free_sched_tags(ctx->et, set);
+	if (!ctx->new)
+		blk_mq_free_sched_res(&ctx->res, set);
 
 	return ret;
 }
@@ -711,10 +712,10 @@ void elv_update_nr_hw_queues(struct request_queue *q,
 	if (!ret)
 		WARN_ON_ONCE(elevator_change_done(q, ctx));
 	/*
-	 * Free sched tags if it's allocated but we couldn't switch elevator.
+	 * Free sched resource if it's allocated but we couldn't switch elevator.
 	 */
-	if (ctx->et && !ctx->new)
-		blk_mq_free_sched_tags(ctx->et, set);
+	if (!ctx->new)
+		blk_mq_free_sched_res(&ctx->res, set);
 }
 
 /*
diff --git a/block/elevator.h b/block/elevator.h
index bad43182361e..621a63597249 100644
--- a/block/elevator.h
+++ b/block/elevator.h
@@ -32,6 +32,11 @@ struct elevator_tags {
 	struct blk_mq_tags *tags[];
 };
 
+struct elevator_resources {
+	/* holds elevator tags */
+	struct elevator_tags *et;
+};
+
 /* Holding context data for changing elevator */
 struct elv_change_ctx {
 	const char *name;
@@ -43,8 +48,8 @@ struct elv_change_ctx {
 	struct elevator_queue *new;
 	/* store elevator type */
 	struct elevator_type *type;
-	/* holds sched tags data */
-	struct elevator_tags *et;
+	/* store elevator resources */
+	struct elevator_resources res;
 };
 
 struct elevator_mq_ops {
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCHv4 3/5] block: introduce alloc_sched_data and free_sched_data elevator methods
  2025-11-10  8:14 [PATCHv4 0/5] block: restructure elevator switch path and fix a lockdep splat Nilay Shroff
  2025-11-10  8:14 ` [PATCHv4 1/5] block: unify elevator tags and type xarrays into struct elv_change_ctx Nilay Shroff
  2025-11-10  8:14 ` [PATCHv4 2/5] block: move elevator tags into struct elevator_resources Nilay Shroff
@ 2025-11-10  8:14 ` Nilay Shroff
  2025-11-11  2:53   ` Ming Lei
  2025-11-11  7:20   ` Yu Kuai
  2025-11-10  8:14 ` [PATCHv4 4/5] block: use {alloc|free}_sched data methods Nilay Shroff
  2025-11-10  8:14 ` [PATCHv4 5/5] block: define alloc_sched_data and free_sched_data methods for kyber Nilay Shroff
  4 siblings, 2 replies; 18+ messages in thread
From: Nilay Shroff @ 2025-11-10  8:14 UTC (permalink / raw)
  To: linux-block; +Cc: ming.lei, hch, axboe, yi.zhang, czhong, yukuai, gjoyce

The recent lockdep splat [1] highlights a potential deadlock risk
involving ->elevator_lock and ->freeze_lock dependencies on -pcpu_alloc_
mutex. The trace shows that the issue occurs when the Kyber scheduler
allocates dynamic memory for its elevator data during initialization.

To address this, introduce two new elevator operation callbacks:
->alloc_sched_data and ->free_sched_data. The subsequent patch would
build upon these newly introduced methods to suppress lockdep splat[1].

[1] https://lore.kernel.org/all/CAGVVp+VNW4M-5DZMNoADp6o2VKFhi7KxWpTDkcnVyjO0=-D5+A@mail.gmail.com/

Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
 block/blk-mq-sched.h | 17 +++++++++++++++++
 block/elevator.h     |  2 ++
 2 files changed, 19 insertions(+)

diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 97204df76def..d38911d0d9eb 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -39,6 +39,23 @@ void blk_mq_free_sched_res(struct elevator_resources *res,
 void blk_mq_free_sched_res_batch(struct xarray *et_table,
 		struct blk_mq_tag_set *set);
 
+static inline int blk_mq_alloc_sched_data(struct request_queue *q,
+		struct elevator_type *e, void **data)
+{
+	if (e && e->ops.alloc_sched_data) {
+		*data = e->ops.alloc_sched_data(q);
+		if (!*data)
+			return -ENOMEM;
+	}
+	return 0;
+}
+
+static inline void blk_mq_free_sched_data(struct elevator_type *e, void *data)
+{
+	if (e && e->ops.free_sched_data)
+		e->ops.free_sched_data(data);
+}
+
 static inline void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
 {
 	if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
diff --git a/block/elevator.h b/block/elevator.h
index 621a63597249..e34043f6da26 100644
--- a/block/elevator.h
+++ b/block/elevator.h
@@ -58,6 +58,8 @@ struct elevator_mq_ops {
 	int (*init_hctx)(struct blk_mq_hw_ctx *, unsigned int);
 	void (*exit_hctx)(struct blk_mq_hw_ctx *, unsigned int);
 	void (*depth_updated)(struct request_queue *);
+	void *(*alloc_sched_data)(struct request_queue *);
+	void (*free_sched_data)(void *);
 
 	bool (*allow_merge)(struct request_queue *, struct request *, struct bio *);
 	bool (*bio_merge)(struct request_queue *, struct bio *, unsigned int);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCHv4 4/5] block: use {alloc|free}_sched data methods
  2025-11-10  8:14 [PATCHv4 0/5] block: restructure elevator switch path and fix a lockdep splat Nilay Shroff
                   ` (2 preceding siblings ...)
  2025-11-10  8:14 ` [PATCHv4 3/5] block: introduce alloc_sched_data and free_sched_data elevator methods Nilay Shroff
@ 2025-11-10  8:14 ` Nilay Shroff
  2025-11-11  2:58   ` Ming Lei
  2025-11-10  8:14 ` [PATCHv4 5/5] block: define alloc_sched_data and free_sched_data methods for kyber Nilay Shroff
  4 siblings, 1 reply; 18+ messages in thread
From: Nilay Shroff @ 2025-11-10  8:14 UTC (permalink / raw)
  To: linux-block; +Cc: ming.lei, hch, axboe, yi.zhang, czhong, yukuai, gjoyce

The previous patch introduced ->alloc_sched_data and
->free_sched_data methods. This patch builds upon that
by now using these methods during elevator switch and
nr_hw_queue update.

It's also ensured that scheduler-specific data is
allocated and freed through the new callbacks outside
of the ->freeze_lock and ->elevator_lock locking contexts,
thereby preventing any dependency on pcpu_alloc_mutex.

Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
 block/blk-mq-sched.c | 32 ++++++++++++++++++++++++--------
 block/blk-mq-sched.h |  8 ++++++--
 block/elevator.c     | 34 ++++++++++++++++++++++------------
 block/elevator.h     |  4 +++-
 4 files changed, 55 insertions(+), 23 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index c7091ea4dccd..0ea8f0004274 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -428,12 +428,17 @@ void blk_mq_free_sched_tags(struct elevator_tags *et,
 }
 
 void blk_mq_free_sched_res(struct elevator_resources *res,
+		struct elevator_type *type,
 		struct blk_mq_tag_set *set)
 {
 	if (res->et) {
 		blk_mq_free_sched_tags(res->et, set);
 		res->et = NULL;
 	}
+	if (res->data) {
+		blk_mq_free_sched_data(type, res->data);
+		res->data = NULL;
+	}
 }
 
 void blk_mq_free_sched_res_batch(struct xarray *elv_tbl,
@@ -458,7 +463,7 @@ void blk_mq_free_sched_res_batch(struct xarray *elv_tbl,
 				WARN_ON_ONCE(1);
 				continue;
 			}
-			blk_mq_free_sched_res(&ctx->res, set);
+			blk_mq_free_sched_res(&ctx->res, ctx->type, set);
 		}
 	}
 }
@@ -540,15 +545,24 @@ struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
 	return NULL;
 }
 
-int blk_mq_alloc_sched_res(struct elevator_resources *res,
-		struct blk_mq_tag_set *set, unsigned int nr_hw_queues)
+int blk_mq_alloc_sched_res(struct request_queue *q,
+		struct elevator_type *type,
+		struct elevator_resources *res,
+		struct blk_mq_tag_set *set,
+		unsigned int nr_hw_queues)
 {
+	int ret;
+
 	res->et = blk_mq_alloc_sched_tags(set, nr_hw_queues,
 			blk_mq_default_nr_requests(set));
 	if (!res->et)
 		return -ENOMEM;
 
-	return 0;
+	ret = blk_mq_alloc_sched_data(q, type, &res->data);
+	if (ret)
+		blk_mq_free_sched_tags(res->et, set);
+
+	return ret;
 }
 
 int blk_mq_alloc_sched_res_batch(struct xarray *elv_tbl,
@@ -575,19 +589,21 @@ int blk_mq_alloc_sched_res_batch(struct xarray *elv_tbl,
 				goto out_unwind;
 			}
 
-			ret = blk_mq_alloc_sched_res(&ctx->res, set,
-					nr_hw_queues);
+			ret = blk_mq_alloc_sched_res(q, q->elevator->type,
+					&ctx->res, set, nr_hw_queues);
 			if (ret)
 				goto out_unwind;
 		}
 	}
 	return 0;
+
 out_unwind:
 	list_for_each_entry_continue_reverse(q, &set->tag_list, tag_set_list) {
 		if (q->elevator) {
 			ctx = xa_load(elv_tbl, q->id);
 			if (ctx)
-				blk_mq_free_sched_res(&ctx->res, set);
+				blk_mq_free_sched_res(&ctx->res,
+						ctx->type, set);
 		}
 	}
 	return ret;
@@ -603,7 +619,7 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e,
 	unsigned long i;
 	int ret;
 
-	eq = elevator_alloc(q, e, res->et);
+	eq = elevator_alloc(q, e, res);
 	if (!eq)
 		return -ENOMEM;
 
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index d38911d0d9eb..acd4f1355be6 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -25,8 +25,11 @@ void blk_mq_sched_free_rqs(struct request_queue *q);
 
 struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
 		unsigned int nr_hw_queues, unsigned int nr_requests);
-int blk_mq_alloc_sched_res(struct elevator_resources *res,
-		struct blk_mq_tag_set *set, unsigned int nr_hw_queues);
+int blk_mq_alloc_sched_res(struct request_queue *q,
+		struct elevator_type *type,
+		struct elevator_resources *res,
+		struct blk_mq_tag_set *set,
+		unsigned int nr_hw_queues);
 int blk_mq_alloc_sched_res_batch(struct xarray *elv_tbl,
 		struct blk_mq_tag_set *set, unsigned int nr_hw_queues);
 int blk_mq_alloc_sched_ctx_batch(struct xarray *elv_tbl,
@@ -35,6 +38,7 @@ void blk_mq_free_sched_ctx_batch(struct xarray *elv_tbl);
 void blk_mq_free_sched_tags(struct elevator_tags *et,
 		struct blk_mq_tag_set *set);
 void blk_mq_free_sched_res(struct elevator_resources *res,
+		struct elevator_type *type,
 		struct blk_mq_tag_set *set);
 void blk_mq_free_sched_res_batch(struct xarray *et_table,
 		struct blk_mq_tag_set *set);
diff --git a/block/elevator.c b/block/elevator.c
index 7fd3c547833c..67500fbbfaf0 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -121,7 +121,7 @@ static struct elevator_type *elevator_find_get(const char *name)
 static const struct kobj_type elv_ktype;
 
 struct elevator_queue *elevator_alloc(struct request_queue *q,
-		struct elevator_type *e, struct elevator_tags *et)
+		struct elevator_type *e, struct elevator_resources *res)
 {
 	struct elevator_queue *eq;
 
@@ -134,7 +134,8 @@ struct elevator_queue *elevator_alloc(struct request_queue *q,
 	kobject_init(&eq->kobj, &elv_ktype);
 	mutex_init(&eq->sysfs_lock);
 	hash_init(eq->hash);
-	eq->et = et;
+	eq->et = res->et;
+	eq->elevator_data = res->data;
 
 	return eq;
 }
@@ -617,7 +618,7 @@ static void elv_exit_and_release(struct elv_change_ctx *ctx,
 	mutex_unlock(&q->elevator_lock);
 	blk_mq_unfreeze_queue(q, memflags);
 	if (e) {
-		blk_mq_free_sched_res(&ctx->res, q->tag_set);
+		blk_mq_free_sched_res(&ctx->res, ctx->type, q->tag_set);
 		kobject_put(&e->kobj);
 	}
 }
@@ -628,12 +629,15 @@ static int elevator_change_done(struct request_queue *q,
 	int ret = 0;
 
 	if (ctx->old) {
-		struct elevator_resources res = {.et = ctx->old->et};
+		struct elevator_resources res = {
+			.et = ctx->old->et,
+			.data = ctx->old->elevator_data
+		};
 		bool enable_wbt = test_bit(ELEVATOR_FLAG_ENABLE_WBT_ON_EXIT,
 				&ctx->old->flags);
 
 		elv_unregister_queue(q, ctx->old);
-		blk_mq_free_sched_res(&res, q->tag_set);
+		blk_mq_free_sched_res(&res, ctx->old->type, q->tag_set);
 		kobject_put(&ctx->old->kobj);
 		if (enable_wbt)
 			wbt_enable_default(q->disk);
@@ -658,7 +662,8 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
 	lockdep_assert_held(&set->update_nr_hwq_lock);
 
 	if (strncmp(ctx->name, "none", 4)) {
-		ret = blk_mq_alloc_sched_res(&ctx->res, set, set->nr_hw_queues);
+		ret = blk_mq_alloc_sched_res(q, ctx->type, &ctx->res, set,
+				set->nr_hw_queues);
 		if (ret)
 			return ret;
 	}
@@ -681,11 +686,12 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
 	blk_mq_unfreeze_queue(q, memflags);
 	if (!ret)
 		ret = elevator_change_done(q, ctx);
+
 	/*
 	 * Free sched resource if it's allocated but we couldn't switch elevator.
 	 */
 	if (!ctx->new)
-		blk_mq_free_sched_res(&ctx->res, set);
+		blk_mq_free_sched_res(&ctx->res, ctx->type, set);
 
 	return ret;
 }
@@ -711,11 +717,12 @@ void elv_update_nr_hw_queues(struct request_queue *q,
 	blk_mq_unfreeze_queue_nomemrestore(q);
 	if (!ret)
 		WARN_ON_ONCE(elevator_change_done(q, ctx));
+
 	/*
 	 * Free sched resource if it's allocated but we couldn't switch elevator.
 	 */
 	if (!ctx->new)
-		blk_mq_free_sched_res(&ctx->res, set);
+		blk_mq_free_sched_res(&ctx->res, ctx->type, set);
 }
 
 /*
@@ -729,7 +736,6 @@ void elevator_set_default(struct request_queue *q)
 		.no_uevent = true,
 	};
 	int err;
-	struct elevator_type *e;
 
 	/* now we allow to switch elevator */
 	blk_queue_flag_clear(QUEUE_FLAG_NO_ELV_SWITCH, q);
@@ -742,8 +748,8 @@ void elevator_set_default(struct request_queue *q)
 	 * have multiple queues or mq-deadline is not available, default
 	 * to "none".
 	 */
-	e = elevator_find_get(ctx.name);
-	if (!e)
+	ctx.type = elevator_find_get(ctx.name);
+	if (!ctx.type)
 		return;
 
 	if ((q->nr_hw_queues == 1 ||
@@ -753,7 +759,7 @@ void elevator_set_default(struct request_queue *q)
 			pr_warn("\"%s\" elevator initialization, failed %d, falling back to \"none\"\n",
 					ctx.name, err);
 	}
-	elevator_put(e);
+	elevator_put(ctx.type);
 }
 
 void elevator_set_none(struct request_queue *q)
@@ -802,6 +808,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
 	ctx.name = strstrip(elevator_name);
 
 	elv_iosched_load_module(ctx.name);
+	ctx.type = elevator_find_get(ctx.name);
 
 	down_read(&set->update_nr_hwq_lock);
 	if (!blk_queue_no_elv_switch(q)) {
@@ -812,6 +819,9 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
 		ret = -ENOENT;
 	}
 	up_read(&set->update_nr_hwq_lock);
+
+	if (ctx.type)
+		elevator_put(ctx.type);
 	return ret;
 }
 
diff --git a/block/elevator.h b/block/elevator.h
index e34043f6da26..3ee1d494f48a 100644
--- a/block/elevator.h
+++ b/block/elevator.h
@@ -33,6 +33,8 @@ struct elevator_tags {
 };
 
 struct elevator_resources {
+	/* holds elevator data */
+	void *data;
 	/* holds elevator tags */
 	struct elevator_tags *et;
 };
@@ -185,7 +187,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *page, size_t count);
 
 extern bool elv_bio_merge_ok(struct request *, struct bio *);
 struct elevator_queue *elevator_alloc(struct request_queue *,
-		struct elevator_type *, struct elevator_tags *);
+		struct elevator_type *, struct elevator_resources *);
 
 /*
  * Helper functions.
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCHv4 5/5] block: define alloc_sched_data and free_sched_data methods for kyber
  2025-11-10  8:14 [PATCHv4 0/5] block: restructure elevator switch path and fix a lockdep splat Nilay Shroff
                   ` (3 preceding siblings ...)
  2025-11-10  8:14 ` [PATCHv4 4/5] block: use {alloc|free}_sched data methods Nilay Shroff
@ 2025-11-10  8:14 ` Nilay Shroff
  2025-11-11  3:01   ` Ming Lei
  4 siblings, 1 reply; 18+ messages in thread
From: Nilay Shroff @ 2025-11-10  8:14 UTC (permalink / raw)
  To: linux-block; +Cc: ming.lei, hch, axboe, yi.zhang, czhong, yukuai, gjoyce

Currently, the Kyber elevator allocates its private data dynamically in
->init_sched and frees it in ->exit_sched. However, since ->init_sched
is invoked during elevator switch after acquiring both ->freeze_lock and
->elevator_lock, it may trigger the lockdep splat [1] due to dependency
on pcpu_alloc_mutex.

To resolve this, move the elevator data allocation and deallocation
logic from ->init_sched and ->exit_sched into the newly introduced
->alloc_sched_data and ->free_sched_data methods. These callbacks are
invoked before acquiring ->freeze_lock and ->elevator_lock, ensuring
that memory allocation happens safely without introducing additional
locking dependencies.

This change breaks the dependency chain involving pcpu_alloc_mutex and
prevents the reported lockdep warning.

[1] https://lore.kernel.org/all/CAGVVp+VNW4M-5DZMNoADp6o2VKFhi7KxWpTDkcnVyjO0=-D5+A@mail.gmail.com/

Reported-by: Changhui Zhong <czhong@redhat.com>
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Closes: https://lore.kernel.org/all/CAGVVp+VNW4M-5DZMNoADp6o2VKFhi7KxWpTDkcnVyjO0=-D5+A@mail.gmail.com/
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
 block/kyber-iosched.c | 30 ++++++++++++++++++++++--------
 1 file changed, 22 insertions(+), 8 deletions(-)

diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index 18efd6ef2a2b..c1b36ffd19ce 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -409,30 +409,42 @@ static void kyber_depth_updated(struct request_queue *q)
 
 static int kyber_init_sched(struct request_queue *q, struct elevator_queue *eq)
 {
-	struct kyber_queue_data *kqd;
-
-	kqd = kyber_queue_data_alloc(q);
-	if (IS_ERR(kqd))
-		return PTR_ERR(kqd);
-
 	blk_stat_enable_accounting(q);
 
 	blk_queue_flag_clear(QUEUE_FLAG_SQ_SCHED, q);
 
-	eq->elevator_data = kqd;
 	q->elevator = eq;
 	kyber_depth_updated(q);
 
 	return 0;
 }
 
+static void *kyber_alloc_sched_data(struct request_queue *q)
+{
+	struct kyber_queue_data *kqd;
+
+	kqd = kyber_queue_data_alloc(q);
+	if (IS_ERR(kqd))
+		return NULL;
+
+	return kqd;
+}
+
 static void kyber_exit_sched(struct elevator_queue *e)
 {
 	struct kyber_queue_data *kqd = e->elevator_data;
-	int i;
 
 	timer_shutdown_sync(&kqd->timer);
 	blk_stat_disable_accounting(kqd->q);
+}
+
+static void kyber_free_sched_data(void *elv_data)
+{
+	struct kyber_queue_data *kqd = elv_data;
+	int i;
+
+	if (!kqd)
+		return;
 
 	for (i = 0; i < KYBER_NUM_DOMAINS; i++)
 		sbitmap_queue_free(&kqd->domain_tokens[i]);
@@ -1004,6 +1016,8 @@ static struct elevator_type kyber_sched = {
 		.exit_sched = kyber_exit_sched,
 		.init_hctx = kyber_init_hctx,
 		.exit_hctx = kyber_exit_hctx,
+		.alloc_sched_data = kyber_alloc_sched_data,
+		.free_sched_data = kyber_free_sched_data,
 		.limit_depth = kyber_limit_depth,
 		.bio_merge = kyber_bio_merge,
 		.prepare_request = kyber_prepare_request,
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCHv4 2/5] block: move elevator tags into struct elevator_resources
  2025-11-10  8:14 ` [PATCHv4 2/5] block: move elevator tags into struct elevator_resources Nilay Shroff
@ 2025-11-11  2:52   ` Ming Lei
  2025-11-11  6:49     ` Nilay Shroff
  0 siblings, 1 reply; 18+ messages in thread
From: Ming Lei @ 2025-11-11  2:52 UTC (permalink / raw)
  To: Nilay Shroff; +Cc: linux-block, hch, axboe, yi.zhang, czhong, yukuai, gjoyce

On Mon, Nov 10, 2025 at 01:44:49PM +0530, Nilay Shroff wrote:
> This patch introduces a new structure, struct elevator_resources, to
> group together all elevator-related resources that share the same
> lifetime. As a first step, this change moves the elevator tag pointer
> from struct elv_change_ctx into the new struct elevator_resources.
> 
> Additionally, rename blk_mq_alloc_sched_tags_batch() and
> blk_mq_free_sched_tags_batch() to blk_mq_alloc_sched_res_batch() and
> blk_mq_free_sched_res_batch(), respectively. Introduce two new wrapper
> helpers, blk_mq_alloc_sched_res() and blk_mq_free_sched_res(), around
> blk_mq_alloc_sched_tags() and blk_mq_free_sched_tags().
> 
> These changes pave the way for consolidating the allocation and freeing
> of elevator-specific resources into common helper functions. This
> refactoring improves encapsulation and prepares the code for future
> extensions, allowing additional elevator-specific data to be added to
> struct elevator_resources without cluttering struct elv_change_ctx.
> 
> Subsequent patches will extend struct elevator_resources to include
> other elevator-related data.
> 
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> ---
>  block/blk-mq-sched.c | 55 ++++++++++++++++++++++++++++----------------
>  block/blk-mq-sched.h | 10 +++++---
>  block/blk-mq.c       |  2 +-
>  block/elevator.c     | 31 +++++++++++++------------
>  block/elevator.h     |  9 ++++++--
>  5 files changed, 66 insertions(+), 41 deletions(-)
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 3d9386555a50..c7091ea4dccd 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -427,7 +427,16 @@ void blk_mq_free_sched_tags(struct elevator_tags *et,
>  	kfree(et);
>  }
>  
> -void blk_mq_free_sched_tags_batch(struct xarray *elv_tbl,
> +void blk_mq_free_sched_res(struct elevator_resources *res,
> +		struct blk_mq_tag_set *set)
> +{
> +	if (res->et) {
> +		blk_mq_free_sched_tags(res->et, set);
> +		res->et = NULL;
> +	}
> +}
> +
> +void blk_mq_free_sched_res_batch(struct xarray *elv_tbl,
>  		struct blk_mq_tag_set *set)
>  {
>  	struct request_queue *q;
> @@ -445,12 +454,11 @@ void blk_mq_free_sched_tags_batch(struct xarray *elv_tbl,
>  		 */
>  		if (q->elevator) {
>  			ctx = xa_load(elv_tbl, q->id);
> -			if (!ctx || !ctx->et) {
> +			if (!ctx) {
>  				WARN_ON_ONCE(1);
>  				continue;
>  			}
> -			blk_mq_free_sched_tags(ctx->et, set);
> -			ctx->et = NULL;
> +			blk_mq_free_sched_res(&ctx->res, set);
>  		}
>  	}
>  }
> @@ -532,12 +540,22 @@ struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
>  	return NULL;
>  }
>  
> -int blk_mq_alloc_sched_tags_batch(struct xarray *elv_tbl,
> +int blk_mq_alloc_sched_res(struct elevator_resources *res,
> +		struct blk_mq_tag_set *set, unsigned int nr_hw_queues)

In patch 4, `struct request_queue *` is added to parameter of
blk_mq_alloc_sched_res(), so why not add it from beginning?
Then ` struct blk_mq_tag_set *set` can be avoided.

Similar with blk_mq_free_sched_res().

This way is more readable, because scheduler is request_queue
wide.

> +{
> +	res->et = blk_mq_alloc_sched_tags(set, nr_hw_queues,
> +			blk_mq_default_nr_requests(set));
> +	if (!res->et)
> +		return -ENOMEM;
> +
> +	return 0;
> +}
> +
> +int blk_mq_alloc_sched_res_batch(struct xarray *elv_tbl,
>  		struct blk_mq_tag_set *set, unsigned int nr_hw_queues)
>  {
>  	struct elv_change_ctx *ctx;
>  	struct request_queue *q;
> -	struct elevator_tags *et;
>  	int ret = -ENOMEM;
>  
>  	lockdep_assert_held_write(&set->update_nr_hwq_lock);
> @@ -557,11 +575,10 @@ int blk_mq_alloc_sched_tags_batch(struct xarray *elv_tbl,
>  				goto out_unwind;
>  			}
>  
> -			ctx->et = blk_mq_alloc_sched_tags(set, nr_hw_queues,
> -					blk_mq_default_nr_requests(set));
> -			if (!ctx->et)
> +			ret = blk_mq_alloc_sched_res(&ctx->res, set,
> +					nr_hw_queues);
> +			if (ret)
>  				goto out_unwind;
> -
>  		}
>  	}
>  	return 0;
> @@ -569,10 +586,8 @@ int blk_mq_alloc_sched_tags_batch(struct xarray *elv_tbl,
>  	list_for_each_entry_continue_reverse(q, &set->tag_list, tag_set_list) {
>  		if (q->elevator) {
>  			ctx = xa_load(elv_tbl, q->id);
> -			if (ctx && ctx->et) {
> -				blk_mq_free_sched_tags(ctx->et, set);
> -				ctx->et = NULL;
> -			}
> +			if (ctx)
> +				blk_mq_free_sched_res(&ctx->res, set);
>  		}
>  	}
>  	return ret;
> @@ -580,7 +595,7 @@ int blk_mq_alloc_sched_tags_batch(struct xarray *elv_tbl,
>  
>  /* caller must have a reference to @e, will grab another one if successful */
>  int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e,
> -		struct elevator_tags *et)
> +		struct elevator_resources *res)
>  {
>  	unsigned int flags = q->tag_set->flags;
>  	struct blk_mq_hw_ctx *hctx;
> @@ -588,23 +603,23 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e,
>  	unsigned long i;
>  	int ret;
>  
> -	eq = elevator_alloc(q, e, et);
> +	eq = elevator_alloc(q, e, res->et);
>  	if (!eq)
>  		return -ENOMEM;
>  
> -	q->nr_requests = et->nr_requests;
> +	q->nr_requests = res->et->nr_requests;
>  
>  	if (blk_mq_is_shared_tags(flags)) {
>  		/* Shared tags are stored at index 0 in @et->tags. */
> -		q->sched_shared_tags = et->tags[0];
> -		blk_mq_tag_update_sched_shared_tags(q, et->nr_requests);
> +		q->sched_shared_tags = res->et->tags[0];
> +		blk_mq_tag_update_sched_shared_tags(q, res->et->nr_requests);
>  	}
>  
>  	queue_for_each_hw_ctx(q, hctx, i) {
>  		if (blk_mq_is_shared_tags(flags))
>  			hctx->sched_tags = q->sched_shared_tags;
>  		else
> -			hctx->sched_tags = et->tags[i];
> +			hctx->sched_tags = res->et->tags[i];
>  	}

Adding one local variable of 'et' could kill all above changes, looks you like
big patch, but up to you.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCHv4 3/5] block: introduce alloc_sched_data and free_sched_data elevator methods
  2025-11-10  8:14 ` [PATCHv4 3/5] block: introduce alloc_sched_data and free_sched_data elevator methods Nilay Shroff
@ 2025-11-11  2:53   ` Ming Lei
  2025-11-11  7:20   ` Yu Kuai
  1 sibling, 0 replies; 18+ messages in thread
From: Ming Lei @ 2025-11-11  2:53 UTC (permalink / raw)
  To: Nilay Shroff; +Cc: linux-block, hch, axboe, yi.zhang, czhong, yukuai, gjoyce

On Mon, Nov 10, 2025 at 01:44:50PM +0530, Nilay Shroff wrote:
> The recent lockdep splat [1] highlights a potential deadlock risk
> involving ->elevator_lock and ->freeze_lock dependencies on -pcpu_alloc_
> mutex. The trace shows that the issue occurs when the Kyber scheduler
> allocates dynamic memory for its elevator data during initialization.
> 
> To address this, introduce two new elevator operation callbacks:
> ->alloc_sched_data and ->free_sched_data. The subsequent patch would
> build upon these newly introduced methods to suppress lockdep splat[1].
> 
> [1] https://lore.kernel.org/all/CAGVVp+VNW4M-5DZMNoADp6o2VKFhi7KxWpTDkcnVyjO0=-D5+A@mail.gmail.com/
> 
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>

Reviewed-by: Ming Lei <ming.lei@redhat.com>

Thanks,
Ming


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCHv4 4/5] block: use {alloc|free}_sched data methods
  2025-11-10  8:14 ` [PATCHv4 4/5] block: use {alloc|free}_sched data methods Nilay Shroff
@ 2025-11-11  2:58   ` Ming Lei
  2025-11-11  6:51     ` Nilay Shroff
  0 siblings, 1 reply; 18+ messages in thread
From: Ming Lei @ 2025-11-11  2:58 UTC (permalink / raw)
  To: Nilay Shroff; +Cc: linux-block, hch, axboe, yi.zhang, czhong, yukuai, gjoyce

On Mon, Nov 10, 2025 at 01:44:51PM +0530, Nilay Shroff wrote:
> The previous patch introduced ->alloc_sched_data and
> ->free_sched_data methods. This patch builds upon that
> by now using these methods during elevator switch and
> nr_hw_queue update.
> 
> It's also ensured that scheduler-specific data is
> allocated and freed through the new callbacks outside
> of the ->freeze_lock and ->elevator_lock locking contexts,
> thereby preventing any dependency on pcpu_alloc_mutex.
> 
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> ---
>  block/blk-mq-sched.c | 32 ++++++++++++++++++++++++--------
>  block/blk-mq-sched.h |  8 ++++++--
>  block/elevator.c     | 34 ++++++++++++++++++++++------------
>  block/elevator.h     |  4 +++-
>  4 files changed, 55 insertions(+), 23 deletions(-)
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index c7091ea4dccd..0ea8f0004274 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -428,12 +428,17 @@ void blk_mq_free_sched_tags(struct elevator_tags *et,
>  }
>  
>  void blk_mq_free_sched_res(struct elevator_resources *res,
> +		struct elevator_type *type,
>  		struct blk_mq_tag_set *set)
>  {
>  	if (res->et) {
>  		blk_mq_free_sched_tags(res->et, set);
>  		res->et = NULL;
>  	}
> +	if (res->data) {
> +		blk_mq_free_sched_data(type, res->data);
> +		res->data = NULL;
> +	}
>  }
>  
>  void blk_mq_free_sched_res_batch(struct xarray *elv_tbl,
> @@ -458,7 +463,7 @@ void blk_mq_free_sched_res_batch(struct xarray *elv_tbl,
>  				WARN_ON_ONCE(1);
>  				continue;
>  			}
> -			blk_mq_free_sched_res(&ctx->res, set);
> +			blk_mq_free_sched_res(&ctx->res, ctx->type, set);
>  		}
>  	}
>  }
> @@ -540,15 +545,24 @@ struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
>  	return NULL;
>  }
>  
> -int blk_mq_alloc_sched_res(struct elevator_resources *res,
> -		struct blk_mq_tag_set *set, unsigned int nr_hw_queues)
> +int blk_mq_alloc_sched_res(struct request_queue *q,
> +		struct elevator_type *type,
> +		struct elevator_resources *res,
> +		struct blk_mq_tag_set *set,
> +		unsigned int nr_hw_queues)

As mentioned, `struct request_queue *q` parameter can be added from
beginning, then `struct blk_mq_tag_set *set` can be avoided.

Otherwise, this patch looks fine.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCHv4 5/5] block: define alloc_sched_data and free_sched_data methods for kyber
  2025-11-10  8:14 ` [PATCHv4 5/5] block: define alloc_sched_data and free_sched_data methods for kyber Nilay Shroff
@ 2025-11-11  3:01   ` Ming Lei
  0 siblings, 0 replies; 18+ messages in thread
From: Ming Lei @ 2025-11-11  3:01 UTC (permalink / raw)
  To: Nilay Shroff; +Cc: linux-block, hch, axboe, yi.zhang, czhong, yukuai, gjoyce

On Mon, Nov 10, 2025 at 01:44:52PM +0530, Nilay Shroff wrote:
> Currently, the Kyber elevator allocates its private data dynamically in
> ->init_sched and frees it in ->exit_sched. However, since ->init_sched
> is invoked during elevator switch after acquiring both ->freeze_lock and
> ->elevator_lock, it may trigger the lockdep splat [1] due to dependency
> on pcpu_alloc_mutex.
> 
> To resolve this, move the elevator data allocation and deallocation
> logic from ->init_sched and ->exit_sched into the newly introduced
> ->alloc_sched_data and ->free_sched_data methods. These callbacks are
> invoked before acquiring ->freeze_lock and ->elevator_lock, ensuring
> that memory allocation happens safely without introducing additional
> locking dependencies.
> 
> This change breaks the dependency chain involving pcpu_alloc_mutex and
> prevents the reported lockdep warning.
> 
> [1] https://lore.kernel.org/all/CAGVVp+VNW4M-5DZMNoADp6o2VKFhi7KxWpTDkcnVyjO0=-D5+A@mail.gmail.com/
> 
> Reported-by: Changhui Zhong <czhong@redhat.com>
> Reported-by: Yi Zhang <yi.zhang@redhat.com>
> Closes: https://lore.kernel.org/all/CAGVVp+VNW4M-5DZMNoADp6o2VKFhi7KxWpTDkcnVyjO0=-D5+A@mail.gmail.com/
> Tested-by: Yi Zhang <yi.zhang@redhat.com>
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>

Reviewed-by: Ming Lei <ming.lei@redhat.com>

Thanks,
Ming


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCHv4 2/5] block: move elevator tags into struct elevator_resources
  2025-11-11  2:52   ` Ming Lei
@ 2025-11-11  6:49     ` Nilay Shroff
  0 siblings, 0 replies; 18+ messages in thread
From: Nilay Shroff @ 2025-11-11  6:49 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, hch, axboe, yi.zhang, czhong, yukuai, gjoyce


>>  
>> -int blk_mq_alloc_sched_tags_batch(struct xarray *elv_tbl,
>> +int blk_mq_alloc_sched_res(struct elevator_resources *res,
>> +		struct blk_mq_tag_set *set, unsigned int nr_hw_queues)
> 
> In patch 4, `struct request_queue *` is added to parameter of
> blk_mq_alloc_sched_res(), so why not add it from beginning?
> Then ` struct blk_mq_tag_set *set` can be avoided.
> 
> Similar with blk_mq_free_sched_res().
> 
> This way is more readable, because scheduler is request_queue
> wide.
> 
Yes this makes sense to me and good point! I'll spin 
another patch and address it.

>>  /* caller must have a reference to @e, will grab another one if successful */
>>  int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e,
>> -		struct elevator_tags *et)
>> +		struct elevator_resources *res)
>>  {
>>  	unsigned int flags = q->tag_set->flags;
>>  	struct blk_mq_hw_ctx *hctx;
>> @@ -588,23 +603,23 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e,
>>  	unsigned long i;
>>  	int ret;
>>  
>> -	eq = elevator_alloc(q, e, et);
>> +	eq = elevator_alloc(q, e, res->et);
>>  	if (!eq)
>>  		return -ENOMEM;
>>  
>> -	q->nr_requests = et->nr_requests;
>> +	q->nr_requests = res->et->nr_requests;
>>  
>>  	if (blk_mq_is_shared_tags(flags)) {
>>  		/* Shared tags are stored at index 0 in @et->tags. */
>> -		q->sched_shared_tags = et->tags[0];
>> -		blk_mq_tag_update_sched_shared_tags(q, et->nr_requests);
>> +		q->sched_shared_tags = res->et->tags[0];
>> +		blk_mq_tag_update_sched_shared_tags(q, res->et->nr_requests);
>>  	}
>>  
>>  	queue_for_each_hw_ctx(q, hctx, i) {
>>  		if (blk_mq_is_shared_tags(flags))
>>  			hctx->sched_tags = q->sched_shared_tags;
>>  		else
>> -			hctx->sched_tags = et->tags[i];
>> +			hctx->sched_tags = res->et->tags[i];
>>  	}
> 
> Adding one local variable of 'et' could kill all above changes, looks you like
> big patch, but up to you.
> 
Yeah, I see what you meant earlier — I think I misunderstood it the first time.
I’m on the same page now and will address this in the next version. 
By the way, I also like the slim patch :)

Thanks,
--Nilay


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCHv4 4/5] block: use {alloc|free}_sched data methods
  2025-11-11  2:58   ` Ming Lei
@ 2025-11-11  6:51     ` Nilay Shroff
  0 siblings, 0 replies; 18+ messages in thread
From: Nilay Shroff @ 2025-11-11  6:51 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, hch, axboe, yi.zhang, czhong, yukuai, gjoyce



On 11/11/25 8:28 AM, Ming Lei wrote:
>> @@ -540,15 +545,24 @@ struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
>>  	return NULL;
>>  }
>>  
>> -int blk_mq_alloc_sched_res(struct elevator_resources *res,
>> -		struct blk_mq_tag_set *set, unsigned int nr_hw_queues)
>> +int blk_mq_alloc_sched_res(struct request_queue *q,
>> +		struct elevator_type *type,
>> +		struct elevator_resources *res,
>> +		struct blk_mq_tag_set *set,
>> +		unsigned int nr_hw_queues)
> As mentioned, `struct request_queue *q` parameter can be added from
> beginning, then `struct blk_mq_tag_set *set` can be avoided.
> 
> Otherwise, this patch looks fine.

Ack, will address this in the next patch.

Thanks,
--Nilay

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCHv4 1/5] block: unify elevator tags and type xarrays into struct elv_change_ctx
  2025-11-10  8:14 ` [PATCHv4 1/5] block: unify elevator tags and type xarrays into struct elv_change_ctx Nilay Shroff
@ 2025-11-11  6:55   ` Yu Kuai
  2025-11-11  8:37     ` Nilay Shroff
  0 siblings, 1 reply; 18+ messages in thread
From: Yu Kuai @ 2025-11-11  6:55 UTC (permalink / raw)
  To: Nilay Shroff, linux-block
  Cc: ming.lei, hch, axboe, yi.zhang, czhong, gjoyce, yukuai

Hi,

在 2025/11/10 16:14, Nilay Shroff 写道:
> Currently, the nr_hw_queues update path manages two disjoint xarrays —
> one for elevator tags and another for elevator type — both used during
> elevator switching. Maintaining these two parallel structures for the
> same purpose adds unnecessary complexity and potential for mismatched
> state.
>
> This patch unifies both xarrays into a single structure, struct
> elv_change_ctx, which holds all per-queue elevator change context. A
> single xarray, named elv_tbl, now maps each queue (q->id) in a tagset
> to its corresponding elv_change_ctx entry, encapsulating the elevator
> tags, type and name references.
>
> This unification simplifies the code, improves maintainability, and
> clarifies ownership of per-queue elevator state.
>
> Reviewed-by: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> ---
>   block/blk-mq-sched.c | 76 +++++++++++++++++++++++++++++++++-----------
>   block/blk-mq-sched.h |  3 ++
>   block/blk-mq.c       | 50 +++++++++++++++++------------
>   block/blk.h          |  7 ++--
>   block/elevator.c     | 31 ++++--------------
>   block/elevator.h     | 15 +++++++++
>   6 files changed, 115 insertions(+), 67 deletions(-)
>
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index e0bed16485c3..3d9386555a50 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -427,11 +427,11 @@ void blk_mq_free_sched_tags(struct elevator_tags *et,
>   	kfree(et);
>   }
>   
> -void blk_mq_free_sched_tags_batch(struct xarray *et_table,
> +void blk_mq_free_sched_tags_batch(struct xarray *elv_tbl,
>   		struct blk_mq_tag_set *set)
>   {
>   	struct request_queue *q;
> -	struct elevator_tags *et;
> +	struct elv_change_ctx *ctx;
>   
>   	lockdep_assert_held_write(&set->update_nr_hwq_lock);
>   
> @@ -444,13 +444,47 @@ void blk_mq_free_sched_tags_batch(struct xarray *et_table,
>   		 * concurrently.
>   		 */
>   		if (q->elevator) {
> -			et = xa_load(et_table, q->id);
> -			if (unlikely(!et))
> +			ctx = xa_load(elv_tbl, q->id);
> +			if (!ctx || !ctx->et) {
>   				WARN_ON_ONCE(1);
> -			else
> -				blk_mq_free_sched_tags(et, set);
> +				continue;
> +			}
> +			blk_mq_free_sched_tags(ctx->et, set);
> +			ctx->et = NULL;
> +		}
> +	}
> +}
> +
> +void blk_mq_free_sched_ctx_batch(struct xarray *elv_tbl)
> +{
> +	unsigned long i;
> +	struct elv_change_ctx *ctx;
> +
> +	xa_for_each(elv_tbl, i, ctx) {
> +		xa_erase(elv_tbl, i);
> +		kfree(ctx);
> +	}
> +}
> +
> +int blk_mq_alloc_sched_ctx_batch(struct xarray *elv_tbl,
> +		struct blk_mq_tag_set *set)
> +{
> +	struct request_queue *q;
> +	struct elv_change_ctx *ctx;
> +
> +	lockdep_assert_held_write(&set->update_nr_hwq_lock);
> +
> +	list_for_each_entry(q, &set->tag_list, tag_set_list) {
> +		ctx = kzalloc(sizeof(struct elv_change_ctx), GFP_KERNEL);
> +		if (!ctx)
> +			return -ENOMEM;
> +
> +		if (xa_insert(elv_tbl, q->id, ctx, GFP_KERNEL)) {
> +			kfree(ctx);
> +			return -ENOMEM;
>   		}
>   	}
> +	return 0;
>   }
>   
>   struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
> @@ -498,12 +532,13 @@ struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
>   	return NULL;
>   }
>   
> -int blk_mq_alloc_sched_tags_batch(struct xarray *et_table,
> +int blk_mq_alloc_sched_tags_batch(struct xarray *elv_tbl,
>   		struct blk_mq_tag_set *set, unsigned int nr_hw_queues)
>   {
> +	struct elv_change_ctx *ctx;
>   	struct request_queue *q;
>   	struct elevator_tags *et;
> -	gfp_t gfp = GFP_NOIO | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY;
> +	int ret = -ENOMEM;
>   
>   	lockdep_assert_held_write(&set->update_nr_hwq_lock);
>   
> @@ -516,26 +551,31 @@ int blk_mq_alloc_sched_tags_batch(struct xarray *et_table,
>   		 * concurrently.
>   		 */
>   		if (q->elevator) {
> -			et = blk_mq_alloc_sched_tags(set, nr_hw_queues,
> +			ctx = xa_load(elv_tbl, q->id);
> +			if (WARN_ON_ONCE(!ctx)) {
> +				ret = -ENOENT;
> +				goto out_unwind;
> +			}
> +
> +			ctx->et = blk_mq_alloc_sched_tags(set, nr_hw_queues,
>   					blk_mq_default_nr_requests(set));
> -			if (!et)
> +			if (!ctx->et)
>   				goto out_unwind;
> -			if (xa_insert(et_table, q->id, et, gfp))
> -				goto out_free_tags;
> +
>   		}
>   	}
>   	return 0;
> -out_free_tags:
> -	blk_mq_free_sched_tags(et, set);
>   out_unwind:
>   	list_for_each_entry_continue_reverse(q, &set->tag_list, tag_set_list) {
>   		if (q->elevator) {
> -			et = xa_load(et_table, q->id);
> -			if (et)
> -				blk_mq_free_sched_tags(et, set);
> +			ctx = xa_load(elv_tbl, q->id);
> +			if (ctx && ctx->et) {
> +				blk_mq_free_sched_tags(ctx->et, set);
> +				ctx->et = NULL;
> +			}
>   		}
>   	}
> -	return -ENOMEM;
> +	return ret;
>   }
>   
>   /* caller must have a reference to @e, will grab another one if successful */
> diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
> index 8e21a6b1415d..2fddbc91a235 100644
> --- a/block/blk-mq-sched.h
> +++ b/block/blk-mq-sched.h
> @@ -27,6 +27,9 @@ struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
>   		unsigned int nr_hw_queues, unsigned int nr_requests);
>   int blk_mq_alloc_sched_tags_batch(struct xarray *et_table,
>   		struct blk_mq_tag_set *set, unsigned int nr_hw_queues);
> +int blk_mq_alloc_sched_ctx_batch(struct xarray *elv_tbl,
> +		struct blk_mq_tag_set *set);
> +void blk_mq_free_sched_ctx_batch(struct xarray *elv_tbl);
>   void blk_mq_free_sched_tags(struct elevator_tags *et,
>   		struct blk_mq_tag_set *set);
>   void blk_mq_free_sched_tags_batch(struct xarray *et_table,
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index d626d32f6e57..1f5ef7fc9cda 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -4983,27 +4983,28 @@ struct elevator_tags *blk_mq_update_nr_requests(struct request_queue *q,
>    * Switch back to the elevator type stored in the xarray.
>    */
>   static void blk_mq_elv_switch_back(struct request_queue *q,
> -		struct xarray *elv_tbl, struct xarray *et_tbl)
> +		struct xarray *elv_tbl)
>   {
> -	struct elevator_type *e = xa_load(elv_tbl, q->id);
> -	struct elevator_tags *t = xa_load(et_tbl, q->id);
> +	struct elv_change_ctx *ctx = xa_load(elv_tbl, q->id);
> +
> +	if (WARN_ON_ONCE(!ctx))
> +		return;
>   
>   	/* The elv_update_nr_hw_queues unfreezes the queue. */
> -	elv_update_nr_hw_queues(q, e, t);
> +	elv_update_nr_hw_queues(q, ctx);
>   
>   	/* Drop the reference acquired in blk_mq_elv_switch_none. */
> -	if (e)
> -		elevator_put(e);
> +	if (ctx->type)
> +		elevator_put(ctx->type);
>   }
>   
>   /*
> - * Stores elevator type in xarray and set current elevator to none. It uses
> - * q->id as an index to store the elevator type into the xarray.
> + * Stores elevator name and type in ctx and set current elevator to none.
>    */
>   static int blk_mq_elv_switch_none(struct request_queue *q,
>   		struct xarray *elv_tbl)
>   {
> -	int ret = 0;
> +	struct elv_change_ctx *ctx;
>   
>   	lockdep_assert_held_write(&q->tag_set->update_nr_hwq_lock);
>   
> @@ -5015,10 +5016,11 @@ static int blk_mq_elv_switch_none(struct request_queue *q,
>   	 * can't run concurrently.
>   	 */
>   	if (q->elevator) {
> +		ctx = xa_load(elv_tbl, q->id);
> +		if (WARN_ON_ONCE(!ctx))
> +			return -ENOENT;
>   
> -		ret = xa_insert(elv_tbl, q->id, q->elevator->type, GFP_KERNEL);
> -		if (WARN_ON_ONCE(ret))
> -			return ret;
> +		ctx->name = q->elevator->type->elevator_name;
>   
>   		/*
>   		 * Before we switch elevator to 'none', take a reference to
> @@ -5029,9 +5031,14 @@ static int blk_mq_elv_switch_none(struct request_queue *q,
>   		 */
>   		__elevator_get(q->elevator->type);
>   
> +		/*
> +		 * Store elevator type so that we can release the reference
> +		 * taken above later.
> +		 */
> +		ctx->type = q->elevator->type;
>   		elevator_set_none(q);
>   	}
> -	return ret;
> +	return 0;
>   }
>   
>   static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
> @@ -5041,7 +5048,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>   	int prev_nr_hw_queues = set->nr_hw_queues;
>   	unsigned int memflags;
>   	int i;
> -	struct xarray elv_tbl, et_tbl;
> +	struct xarray elv_tbl;
>   	bool queues_frozen = false;
>   
>   	lockdep_assert_held(&set->tag_list_lock);
> @@ -5055,11 +5062,12 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>   
>   	memflags = memalloc_noio_save();
>   
> -	xa_init(&et_tbl);
> -	if (blk_mq_alloc_sched_tags_batch(&et_tbl, set, nr_hw_queues) < 0)
> -		goto out_memalloc_restore;
> -
>   	xa_init(&elv_tbl);
> +	if (blk_mq_alloc_sched_ctx_batch(&elv_tbl, set) < 0)
> +		goto out_free_ctx;
> +
> +	if (blk_mq_alloc_sched_tags_batch(&elv_tbl, set, nr_hw_queues) < 0)
> +		goto out_free_ctx;

I fell it's not necessary to separate two helpers above, just fold
blk_mq_alloc_sched_tags_batch() into blk_mq_alloc_sched_ctx_batch(),
since blk_mq_alloc_sched_tags_batch() is never called separately in
following patches.

Others this patch LGTM.

Thanks,
Kuai

>   
>   	list_for_each_entry(q, &set->tag_list, tag_set_list) {
>   		blk_mq_debugfs_unregister_hctxs(q);
> @@ -5105,7 +5113,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>   		/* switch_back expects queue to be frozen */
>   		if (!queues_frozen)
>   			blk_mq_freeze_queue_nomemsave(q);
> -		blk_mq_elv_switch_back(q, &elv_tbl, &et_tbl);
> +		blk_mq_elv_switch_back(q, &elv_tbl);
>   	}
>   
>   	list_for_each_entry(q, &set->tag_list, tag_set_list) {
> @@ -5116,9 +5124,9 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>   		blk_mq_add_hw_queues_cpuhp(q);
>   	}
>   
> +out_free_ctx:
> +	blk_mq_free_sched_ctx_batch(&elv_tbl);
>   	xa_destroy(&elv_tbl);
> -	xa_destroy(&et_tbl);
> -out_memalloc_restore:
>   	memalloc_noio_restore(memflags);
>   
>   	/* Free the excess tags when nr_hw_queues shrink. */
> diff --git a/block/blk.h b/block/blk.h
> index 170794632135..a7992680f9e1 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -11,8 +11,7 @@
>   #include <xen/xen.h>
>   #include "blk-crypto-internal.h"
>   
> -struct elevator_type;
> -struct elevator_tags;
> +struct elv_change_ctx;
>   
>   /*
>    * Default upper limit for the software max_sectors limit used for regular I/Os.
> @@ -333,8 +332,8 @@ bool blk_bio_list_merge(struct request_queue *q, struct list_head *list,
>   
>   bool blk_insert_flush(struct request *rq);
>   
> -void elv_update_nr_hw_queues(struct request_queue *q, struct elevator_type *e,
> -		struct elevator_tags *t);
> +void elv_update_nr_hw_queues(struct request_queue *q,
> +		struct elv_change_ctx *ctx);
>   void elevator_set_default(struct request_queue *q);
>   void elevator_set_none(struct request_queue *q);
>   
> diff --git a/block/elevator.c b/block/elevator.c
> index e2ebfbf107b3..cd7bdff205c8 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -45,19 +45,6 @@
>   #include "blk-wbt.h"
>   #include "blk-cgroup.h"
>   
> -/* Holding context data for changing elevator */
> -struct elv_change_ctx {
> -	const char *name;
> -	bool no_uevent;
> -
> -	/* for unregistering old elevator */
> -	struct elevator_queue *old;
> -	/* for registering new elevator */
> -	struct elevator_queue *new;
> -	/* holds sched tags data */
> -	struct elevator_tags *et;
> -};
> -
>   static DEFINE_SPINLOCK(elv_list_lock);
>   static LIST_HEAD(elv_list);
>   
> @@ -706,32 +693,28 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
>    * The I/O scheduler depends on the number of hardware queues, this forces a
>    * reattachment when nr_hw_queues changes.
>    */
> -void elv_update_nr_hw_queues(struct request_queue *q, struct elevator_type *e,
> -		struct elevator_tags *t)
> +void elv_update_nr_hw_queues(struct request_queue *q,
> +		struct elv_change_ctx *ctx)
>   {
>   	struct blk_mq_tag_set *set = q->tag_set;
> -	struct elv_change_ctx ctx = {};
>   	int ret = -ENODEV;
>   
>   	WARN_ON_ONCE(q->mq_freeze_depth == 0);
>   
> -	if (e && !blk_queue_dying(q) && blk_queue_registered(q)) {
> -		ctx.name = e->elevator_name;
> -		ctx.et = t;
> -
> +	if (ctx->type && !blk_queue_dying(q) && blk_queue_registered(q)) {
>   		mutex_lock(&q->elevator_lock);
>   		/* force to reattach elevator after nr_hw_queue is updated */
> -		ret = elevator_switch(q, &ctx);
> +		ret = elevator_switch(q, ctx);
>   		mutex_unlock(&q->elevator_lock);
>   	}
>   	blk_mq_unfreeze_queue_nomemrestore(q);
>   	if (!ret)
> -		WARN_ON_ONCE(elevator_change_done(q, &ctx));
> +		WARN_ON_ONCE(elevator_change_done(q, ctx));
>   	/*
>   	 * Free sched tags if it's allocated but we couldn't switch elevator.
>   	 */
> -	if (t && !ctx.new)
> -		blk_mq_free_sched_tags(t, set);
> +	if (ctx->et && !ctx->new)
> +		blk_mq_free_sched_tags(ctx->et, set);
>   }
>   
>   /*
> diff --git a/block/elevator.h b/block/elevator.h
> index c4d20155065e..bad43182361e 100644
> --- a/block/elevator.h
> +++ b/block/elevator.h
> @@ -32,6 +32,21 @@ struct elevator_tags {
>   	struct blk_mq_tags *tags[];
>   };
>   
> +/* Holding context data for changing elevator */
> +struct elv_change_ctx {
> +	const char *name;
> +	bool no_uevent;
> +
> +	/* for unregistering old elevator */
> +	struct elevator_queue *old;
> +	/* for registering new elevator */
> +	struct elevator_queue *new;
> +	/* store elevator type */
> +	struct elevator_type *type;
> +	/* holds sched tags data */
> +	struct elevator_tags *et;
> +};
> +
>   struct elevator_mq_ops {
>   	int (*init_sched)(struct request_queue *, struct elevator_queue *);
>   	void (*exit_sched)(struct elevator_queue *);

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCHv4 3/5] block: introduce alloc_sched_data and free_sched_data elevator methods
  2025-11-10  8:14 ` [PATCHv4 3/5] block: introduce alloc_sched_data and free_sched_data elevator methods Nilay Shroff
  2025-11-11  2:53   ` Ming Lei
@ 2025-11-11  7:20   ` Yu Kuai
  2025-11-11  8:39     ` Nilay Shroff
  1 sibling, 1 reply; 18+ messages in thread
From: Yu Kuai @ 2025-11-11  7:20 UTC (permalink / raw)
  To: Nilay Shroff, linux-block
  Cc: ming.lei, hch, axboe, yi.zhang, czhong, gjoyce, yukuai

Hi,

在 2025/11/10 16:14, Nilay Shroff 写道:
> The recent lockdep splat [1] highlights a potential deadlock risk
> involving ->elevator_lock and ->freeze_lock dependencies on -pcpu_alloc_
> mutex. The trace shows that the issue occurs when the Kyber scheduler
> allocates dynamic memory for its elevator data during initialization.
>
> To address this, introduce two new elevator operation callbacks:
> ->alloc_sched_data and ->free_sched_data. The subsequent patch would
> build upon these newly introduced methods to suppress lockdep splat[1].
>
> [1] https://lore.kernel.org/all/CAGVVp+VNW4M-5DZMNoADp6o2VKFhi7KxWpTDkcnVyjO0=-D5+A@mail.gmail.com/
>
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> ---
>   block/blk-mq-sched.h | 17 +++++++++++++++++
>   block/elevator.h     |  2 ++
>   2 files changed, 19 insertions(+)
>
> diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
> index 97204df76def..d38911d0d9eb 100644
> --- a/block/blk-mq-sched.h
> +++ b/block/blk-mq-sched.h
> @@ -39,6 +39,23 @@ void blk_mq_free_sched_res(struct elevator_resources *res,
>   void blk_mq_free_sched_res_batch(struct xarray *et_table,
>   		struct blk_mq_tag_set *set);
>   
> +static inline int blk_mq_alloc_sched_data(struct request_queue *q,
> +		struct elevator_type *e, void **data)
> +{
> +	if (e && e->ops.alloc_sched_data) {
> +		*data = e->ops.alloc_sched_data(q);
> +		if (!*data)
> +			return -ENOMEM;
> +	}
> +	return 0;
> +}

I'm not strongly against this, but instead of the input parameter
as the output, why not return data directly? I feel this is more readable.

Perhaps you're considering different cases that alloc_sched_data() method
is defined or not, when NULL is returned. And this can be solved by fold above
helper into caller directly, there is only one caller anyway. In patch 4:

if (types && types->ops.alloc_sched_data) {
	res->data = types->ops.alloc_sched_data(q);
	if (!res->data) {
		blk_mq_free_sched_tags();
		return -ENOMEM;
	}
}

It's up to you :)

Thanks,
Kuai

> +
> +static inline void blk_mq_free_sched_data(struct elevator_type *e, void *data)
> +{
> +	if (e && e->ops.free_sched_data)
> +		e->ops.free_sched_data(data);
> +}
> +
>   static inline void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
>   {
>   	if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
> diff --git a/block/elevator.h b/block/elevator.h
> index 621a63597249..e34043f6da26 100644
> --- a/block/elevator.h
> +++ b/block/elevator.h
> @@ -58,6 +58,8 @@ struct elevator_mq_ops {
>   	int (*init_hctx)(struct blk_mq_hw_ctx *, unsigned int);
>   	void (*exit_hctx)(struct blk_mq_hw_ctx *, unsigned int);
>   	void (*depth_updated)(struct request_queue *);
> +	void *(*alloc_sched_data)(struct request_queue *);
> +	void (*free_sched_data)(void *);
>   
>   	bool (*allow_merge)(struct request_queue *, struct request *, struct bio *);
>   	bool (*bio_merge)(struct request_queue *, struct bio *, unsigned int);

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCHv4 1/5] block: unify elevator tags and type xarrays into struct elv_change_ctx
  2025-11-11  6:55   ` Yu Kuai
@ 2025-11-11  8:37     ` Nilay Shroff
  2025-11-11 10:02       ` Yu Kuai
  0 siblings, 1 reply; 18+ messages in thread
From: Nilay Shroff @ 2025-11-11  8:37 UTC (permalink / raw)
  To: yukuai, linux-block; +Cc: ming.lei, hch, axboe, yi.zhang, czhong, gjoyce



On 11/11/25 12:25 PM, Yu Kuai wrote:
>> @@ -5055,11 +5062,12 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>>   
>>   	memflags = memalloc_noio_save();
>>   
>> -	xa_init(&et_tbl);
>> -	if (blk_mq_alloc_sched_tags_batch(&et_tbl, set, nr_hw_queues) < 0)
>> -		goto out_memalloc_restore;
>> -
>>   	xa_init(&elv_tbl);
>> +	if (blk_mq_alloc_sched_ctx_batch(&elv_tbl, set) < 0)
>> +		goto out_free_ctx;
>> +
>> +	if (blk_mq_alloc_sched_tags_batch(&elv_tbl, set, nr_hw_queues) < 0)
>> +		goto out_free_ctx;
> I fell it's not necessary to separate two helpers above, just fold
> blk_mq_alloc_sched_tags_batch() into blk_mq_alloc_sched_ctx_batch(),
> since blk_mq_alloc_sched_tags_batch() is never called separately in
> following patches.
> 
Hmm, as the name suggests, blk_mq_alloc_sched_ctx_batch() is meant to 
allocate elevator_change_ctx structures in batches. So, folding
blk_mq_alloc_sched_tags_batch() into blk_mq_alloc_sched_ctx_batch()
doesn’t look correct, since the purpose of blk_mq_alloc_sched_tags_batch()
is to allocate scheduler tags in batches.

That said, we’ve already folded blk_mq_alloc_sched_tags_batch() into
blk_mq_alloc_sched_res_batch(), in subsequent patch, whose purpose is
to allocate scheduler resources in batches.

So, IMO, keeping blk_mq_alloc_sched_tags_batch() as-is in this patch
and folding it later into blk_mq_alloc_sched_res_batch() seems more
appropriate from a function naming and logical layering point of view.

Thanks,
--Nilay

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCHv4 3/5] block: introduce alloc_sched_data and free_sched_data elevator methods
  2025-11-11  7:20   ` Yu Kuai
@ 2025-11-11  8:39     ` Nilay Shroff
  0 siblings, 0 replies; 18+ messages in thread
From: Nilay Shroff @ 2025-11-11  8:39 UTC (permalink / raw)
  To: yukuai, linux-block; +Cc: ming.lei, hch, axboe, yi.zhang, czhong, gjoyce



On 11/11/25 12:50 PM, Yu Kuai wrote:
>> +static inline int blk_mq_alloc_sched_data(struct request_queue *q,
>> +		struct elevator_type *e, void **data)
>> +{
>> +	if (e && e->ops.alloc_sched_data) {
>> +		*data = e->ops.alloc_sched_data(q);
>> +		if (!*data)
>> +			return -ENOMEM;
>> +	}
>> +	return 0;
>> +}
> I'm not strongly against this, but instead of the input parameter
> as the output, why not return data directly? I feel this is more readable.
> 
> Perhaps you're considering different cases that alloc_sched_data() method
> is defined or not, when NULL is returned. And this can be solved by fold above
> helper into caller directly, there is only one caller anyway. In patch 4:
> 
> if (types && types->ops.alloc_sched_data) {
> 	res->data = types->ops.alloc_sched_data(q);
> 	if (!res->data) {
> 		blk_mq_free_sched_tags();
> 		return -ENOMEM;
> 	}
> }
> 
Yes this looks good and feasible, I will address this in next version.

Thanks,
--Nilay

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCHv4 1/5] block: unify elevator tags and type xarrays into struct elv_change_ctx
  2025-11-11  8:37     ` Nilay Shroff
@ 2025-11-11 10:02       ` Yu Kuai
  2025-11-11 12:00         ` Nilay Shroff
  0 siblings, 1 reply; 18+ messages in thread
From: Yu Kuai @ 2025-11-11 10:02 UTC (permalink / raw)
  To: Nilay Shroff, linux-block
  Cc: ming.lei, hch, axboe, yi.zhang, czhong, gjoyce, yukuai

Hi,

在 2025/11/11 16:37, Nilay Shroff 写道:
>
> On 11/11/25 12:25 PM, Yu Kuai wrote:
>>> @@ -5055,11 +5062,12 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>>>    
>>>    	memflags = memalloc_noio_save();
>>>    
>>> -	xa_init(&et_tbl);
>>> -	if (blk_mq_alloc_sched_tags_batch(&et_tbl, set, nr_hw_queues) < 0)
>>> -		goto out_memalloc_restore;
>>> -
>>>    	xa_init(&elv_tbl);
>>> +	if (blk_mq_alloc_sched_ctx_batch(&elv_tbl, set) < 0)
>>> +		goto out_free_ctx;
>>> +
>>> +	if (blk_mq_alloc_sched_tags_batch(&elv_tbl, set, nr_hw_queues) < 0)
>>> +		goto out_free_ctx;
>> I fell it's not necessary to separate two helpers above, just fold
>> blk_mq_alloc_sched_tags_batch() into blk_mq_alloc_sched_ctx_batch(),
>> since blk_mq_alloc_sched_tags_batch() is never called separately in
>> following patches.
>>
> Hmm, as the name suggests, blk_mq_alloc_sched_ctx_batch() is meant to
> allocate elevator_change_ctx structures in batches. So, folding
> blk_mq_alloc_sched_tags_batch() into blk_mq_alloc_sched_ctx_batch()
> doesn’t look correct, since the purpose of blk_mq_alloc_sched_tags_batch()
> is to allocate scheduler tags in batches.
>
> That said, we’ve already folded blk_mq_alloc_sched_tags_batch() into
> blk_mq_alloc_sched_res_batch(), in subsequent patch, whose purpose is
> to allocate scheduler resources in batches.
>
> So, IMO, keeping blk_mq_alloc_sched_tags_batch() as-is in this patch
> and folding it later into blk_mq_alloc_sched_res_batch() seems more
> appropriate from a function naming and logical layering point of view.

I mean just remove the helper blk_mq_alloc_sched_tags_batch() and call
blk_mq_alloc_sched_tags(or _res later) directly from
blk_mq_alloc_sched_ctx_batch().

I think at least there will be less code lines :)

blk_mq_alloc_sched_ctx_batch
  list_for_each_entry
   ctx = kzalloc
   xa_insert
   blk_mq_alloc_sched_res

blk_mq_free_sched_ctx_batch
  xa_for_each
   xa_erash
   blk_mq_free_sched_res
   kfree

If you don't like the name, perhaps it's fine to use
blk_mq_alloc_sched_ctx_and_res_batch.

Still, it's up to you :)

Thanks,
Kuai

>
> Thanks,
> --Nilay

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCHv4 1/5] block: unify elevator tags and type xarrays into struct elv_change_ctx
  2025-11-11 10:02       ` Yu Kuai
@ 2025-11-11 12:00         ` Nilay Shroff
  0 siblings, 0 replies; 18+ messages in thread
From: Nilay Shroff @ 2025-11-11 12:00 UTC (permalink / raw)
  To: yukuai, linux-block; +Cc: ming.lei, hch, axboe, yi.zhang, czhong, gjoyce



On 11/11/25 3:32 PM, Yu Kuai wrote:
> Hi,
> 
> 在 2025/11/11 16:37, Nilay Shroff 写道:
>>
>> On 11/11/25 12:25 PM, Yu Kuai wrote:
>>>> @@ -5055,11 +5062,12 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>>>>    
>>>>    	memflags = memalloc_noio_save();
>>>>    
>>>> -	xa_init(&et_tbl);
>>>> -	if (blk_mq_alloc_sched_tags_batch(&et_tbl, set, nr_hw_queues) < 0)
>>>> -		goto out_memalloc_restore;
>>>> -
>>>>    	xa_init(&elv_tbl);
>>>> +	if (blk_mq_alloc_sched_ctx_batch(&elv_tbl, set) < 0)
>>>> +		goto out_free_ctx;
>>>> +
>>>> +	if (blk_mq_alloc_sched_tags_batch(&elv_tbl, set, nr_hw_queues) < 0)
>>>> +		goto out_free_ctx;
>>> I fell it's not necessary to separate two helpers above, just fold
>>> blk_mq_alloc_sched_tags_batch() into blk_mq_alloc_sched_ctx_batch(),
>>> since blk_mq_alloc_sched_tags_batch() is never called separately in
>>> following patches.
>>>
>> Hmm, as the name suggests, blk_mq_alloc_sched_ctx_batch() is meant to
>> allocate elevator_change_ctx structures in batches. So, folding
>> blk_mq_alloc_sched_tags_batch() into blk_mq_alloc_sched_ctx_batch()
>> doesn’t look correct, since the purpose of blk_mq_alloc_sched_tags_batch()
>> is to allocate scheduler tags in batches.
>>
>> That said, we’ve already folded blk_mq_alloc_sched_tags_batch() into
>> blk_mq_alloc_sched_res_batch(), in subsequent patch, whose purpose is
>> to allocate scheduler resources in batches.
>>
>> So, IMO, keeping blk_mq_alloc_sched_tags_batch() as-is in this patch
>> and folding it later into blk_mq_alloc_sched_res_batch() seems more
>> appropriate from a function naming and logical layering point of view.
> 
> I mean just remove the helper blk_mq_alloc_sched_tags_batch() and call
> blk_mq_alloc_sched_tags(or _res later) directly from
> blk_mq_alloc_sched_ctx_batch().
> 
> I think at least there will be less code lines :)
> 
> blk_mq_alloc_sched_ctx_batch
>   list_for_each_entry
>    ctx = kzalloc
>    xa_insert
>    blk_mq_alloc_sched_res
> 
> blk_mq_free_sched_ctx_batch
>   xa_for_each
>    xa_erash
>    blk_mq_free_sched_res
>    kfree
> 
> If you don't like the name, perhaps it's fine to use
> blk_mq_alloc_sched_ctx_and_res_batch.
> 

I understand what you’re proposing. However, I still think it’s better
to keep the allocation and release of the elevator context and resources
separate — not only for logical layering, but also for practical reasons.

For example, when switching the elevator, we may only need to free the
old elevator resources while keeping the context intact. Hence, IMO, it
seems cleaner and more flexible to maintain separate allocation and release
functions for both elevator context and resources, rather than tying them
together in a single API.

Thanks,
--Nilay


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-11-11 12:00 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-10  8:14 [PATCHv4 0/5] block: restructure elevator switch path and fix a lockdep splat Nilay Shroff
2025-11-10  8:14 ` [PATCHv4 1/5] block: unify elevator tags and type xarrays into struct elv_change_ctx Nilay Shroff
2025-11-11  6:55   ` Yu Kuai
2025-11-11  8:37     ` Nilay Shroff
2025-11-11 10:02       ` Yu Kuai
2025-11-11 12:00         ` Nilay Shroff
2025-11-10  8:14 ` [PATCHv4 2/5] block: move elevator tags into struct elevator_resources Nilay Shroff
2025-11-11  2:52   ` Ming Lei
2025-11-11  6:49     ` Nilay Shroff
2025-11-10  8:14 ` [PATCHv4 3/5] block: introduce alloc_sched_data and free_sched_data elevator methods Nilay Shroff
2025-11-11  2:53   ` Ming Lei
2025-11-11  7:20   ` Yu Kuai
2025-11-11  8:39     ` Nilay Shroff
2025-11-10  8:14 ` [PATCHv4 4/5] block: use {alloc|free}_sched data methods Nilay Shroff
2025-11-11  2:58   ` Ming Lei
2025-11-11  6:51     ` Nilay Shroff
2025-11-10  8:14 ` [PATCHv4 5/5] block: define alloc_sched_data and free_sched_data methods for kyber Nilay Shroff
2025-11-11  3:01   ` Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).