[PATCHv4 0/3] block: move sched_tags allocation/de-allocation outside of locking context

public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCHv4 0/3] block: move sched_tags allocation/de-allocation outside of locking context
@ 2025-06-24 13:17 Nilay Shroff
  2025-06-24 13:17 ` [PATCHv4 1/3] block: move elevator queue allocation logic into blk_mq_init_sched Nilay Shroff
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Nilay Shroff @ 2025-06-24 13:17 UTC (permalink / raw)
  To: linux-block; +Cc: hch, ming.lei, axboe, sth, gjoyce

Hi,

There have been a few reports[1] indicating potential lockdep warnings due
to a lock dependency from the percpu allocator to the elevator lock. This
patch series aims to eliminate that dependency.

The series consists of three patches:
The first patch is preparatory patch and just move elevator queue
allocation logic from ->init_sched into blk_mq_init_sched.

The second patch in the series restructures sched_tags allocation and
deallocation during elevator update/switch operations to ensure these
actions are performed entirely outside the ->freeze_lock and ->elevator_
lock. This eliminates the percpu allocator’s lock dependency on the
elevator and freeze lock during scheduler transitions.

The third patch introduces batch allocation and deallocation helpers for
sched_tags. These helpers are used during __blk_mq_update_nr_hw_queues()
to decouple sched_tags memory management from both the elevator and freeze
locks, addressing the lockdep concerns in the nr_hw_queues update path.

[1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/

Changes since v3:
    - Further split the patchset into three patch series so that we can
      have a separate patch for sched_tags batch allocation/deallocation
      (Ming Lei)
    - Use Xarray to store and load the sched_tags (Ming Lei)
    - Unexport elevator_alloc() as we no longer need to use it outside
      of block layer core (hch)
    - unwind the sched_tags allocation and free tags when we it fails in
      the middle of allocation (hch)
    - Move struct elevator_queue header from commin header to elevator.c
      as there's no user of it outside elevator.c (Ming Lei, hch)
Link to v3: https://lore.kernel.org/all/20250616173233.3803824-1-nilay@linux.ibm.com/

Change since v2:
    - Split the patch into a two patch series. The first patch updates
      ->init_sched elevator API change and second patch handles the sched
      tags allocation/de-allocation logic (Ming Lei)
    - Address sched tags allocation/deallocation logic while running in the
      context of nr_hw_queue update so that we can handle all possible
      scenarios in a single patchest (Ming Lei)
Link to v2: https://lore.kernel.org/all/20250528123638.1029700-1-nilay@linux.ibm.com/

Changes since v1:
    - As the lifetime of elevator queue and sched tags are same, allocate
      and move sched tags under struct elevator_queue (Ming Lei)
Link to v1: https://lore.kernel.org/all/20250520103425.1259712-1-nilay@linux.ibm.com/

Nilay Shroff (3):
  block: move elevator queue allocation logic into blk_mq_init_sched
  block: fix lockdep warning caused by lock dependency in
    elv_iosched_store
  block: fix potential deadlock while running nr_hw_queue update

 block/bfq-iosched.c   |  13 +--
 block/blk-mq-sched.c  | 238 +++++++++++++++++++++++++++++-------------
 block/blk-mq-sched.h  |  19 +++-
 block/blk-mq.c        |  12 ++-
 block/blk.h           |   3 +-
 block/elevator.c      |  52 +++++++--
 block/elevator.h      |  21 +++-
 block/kyber-iosched.c |  11 +-
 block/mq-deadline.c   |  14 +--
 9 files changed, 267 insertions(+), 116 deletions(-)

-- 
2.49.0


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCHv4 1/3] block: move elevator queue allocation logic into blk_mq_init_sched
  2025-06-24 13:17 [PATCHv4 0/3] block: move sched_tags allocation/de-allocation outside of locking context Nilay Shroff
@ 2025-06-24 13:17 ` Nilay Shroff
  2025-06-24 13:17 ` [PATCHv4 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store Nilay Shroff
  2025-06-24 13:17 ` [PATCHv4 3/3] block: fix potential deadlock while running nr_hw_queue update Nilay Shroff
  2 siblings, 0 replies; 8+ messages in thread
From: Nilay Shroff @ 2025-06-24 13:17 UTC (permalink / raw)
  To: linux-block; +Cc: hch, ming.lei, axboe, sth, gjoyce

In preparation for allocating sched_tags before freezing the request
queue and acquiring ->elevator_lock, move the elevator queue allocation
logic from the elevator ops ->init_sched callback into blk_mq_init_sched.
As elevator_alloc is now only invoked from block layer core, we don't
need to export it, so unexport elevator_alloc function.

This refactoring provides a centralized location for elevator queue
initialization, which makes it easier to store pre-allocated sched_tags
in the struct elevator_queue during later changes.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
 block/bfq-iosched.c   | 13 +++----------
 block/blk-mq-sched.c  | 11 ++++++++---
 block/elevator.c      |  1 -
 block/elevator.h      |  2 +-
 block/kyber-iosched.c | 11 ++---------
 block/mq-deadline.c   | 14 ++------------
 6 files changed, 16 insertions(+), 36 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 0cb1e9873aab..fd26dc1901b0 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -7232,22 +7232,16 @@ static void bfq_init_root_group(struct bfq_group *root_group,
 	root_group->sched_data.bfq_class_idle_last_service = jiffies;
 }
 
-static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
+static int bfq_init_queue(struct request_queue *q, struct elevator_queue *eq)
 {
 	struct bfq_data *bfqd;
-	struct elevator_queue *eq;
 	unsigned int i;
 	struct blk_independent_access_ranges *ia_ranges = q->disk->ia_ranges;
 
-	eq = elevator_alloc(q, e);
-	if (!eq)
-		return -ENOMEM;
-
 	bfqd = kzalloc_node(sizeof(*bfqd), GFP_KERNEL, q->node);
-	if (!bfqd) {
-		kobject_put(&eq->kobj);
+	if (!bfqd)
 		return -ENOMEM;
-	}
+
 	eq->elevator_data = bfqd;
 
 	spin_lock_irq(&q->queue_lock);
@@ -7405,7 +7399,6 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 
 out_free:
 	kfree(bfqd);
-	kobject_put(&eq->kobj);
 	return -ENOMEM;
 }
 
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 55a0fd105147..359e0704e09b 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -475,10 +475,14 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 	q->nr_requests = 2 * min_t(unsigned int, q->tag_set->queue_depth,
 				   BLKDEV_DEFAULT_RQ);
 
+	eq = elevator_alloc(q, e);
+	if (!eq)
+		return -ENOMEM;
+
 	if (blk_mq_is_shared_tags(flags)) {
 		ret = blk_mq_init_sched_shared_tags(q);
 		if (ret)
-			return ret;
+			goto err_put_elevator;
 	}
 
 	queue_for_each_hw_ctx(q, hctx, i) {
@@ -487,7 +491,7 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 			goto err_free_map_and_rqs;
 	}
 
-	ret = e->ops.init_sched(q, e);
+	ret = e->ops.init_sched(q, eq);
 	if (ret)
 		goto err_free_map_and_rqs;
 
@@ -508,7 +512,8 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 err_free_map_and_rqs:
 	blk_mq_sched_free_rqs(q);
 	blk_mq_sched_tags_teardown(q, flags);
-
+err_put_elevator:
+	kobject_put(&eq->kobj);
 	q->elevator = NULL;
 	return ret;
 }
diff --git a/block/elevator.c b/block/elevator.c
index ab22542e6cf0..770874040f79 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -148,7 +148,6 @@ struct elevator_queue *elevator_alloc(struct request_queue *q,
 
 	return eq;
 }
-EXPORT_SYMBOL(elevator_alloc);
 
 static void elevator_release(struct kobject *kobj)
 {
diff --git a/block/elevator.h b/block/elevator.h
index a07ce773a38f..a4de5f9ad790 100644
--- a/block/elevator.h
+++ b/block/elevator.h
@@ -24,7 +24,7 @@ struct blk_mq_alloc_data;
 struct blk_mq_hw_ctx;
 
 struct elevator_mq_ops {
-	int (*init_sched)(struct request_queue *, struct elevator_type *);
+	int (*init_sched)(struct request_queue *, struct elevator_queue *);
 	void (*exit_sched)(struct elevator_queue *);
 	int (*init_hctx)(struct blk_mq_hw_ctx *, unsigned int);
 	void (*exit_hctx)(struct blk_mq_hw_ctx *, unsigned int);
diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index 4dba8405bd01..7b6832cb3a8d 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -402,20 +402,13 @@ static struct kyber_queue_data *kyber_queue_data_alloc(struct request_queue *q)
 	return ERR_PTR(ret);
 }
 
-static int kyber_init_sched(struct request_queue *q, struct elevator_type *e)
+static int kyber_init_sched(struct request_queue *q, struct elevator_queue *eq)
 {
 	struct kyber_queue_data *kqd;
-	struct elevator_queue *eq;
-
-	eq = elevator_alloc(q, e);
-	if (!eq)
-		return -ENOMEM;
 
 	kqd = kyber_queue_data_alloc(q);
-	if (IS_ERR(kqd)) {
-		kobject_put(&eq->kobj);
+	if (IS_ERR(kqd))
 		return PTR_ERR(kqd);
-	}
 
 	blk_stat_enable_accounting(q);
 
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 2edf1cac06d5..7b6caf30e00a 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -568,20 +568,14 @@ static void dd_exit_sched(struct elevator_queue *e)
 /*
  * initialize elevator private data (deadline_data).
  */
-static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
+static int dd_init_sched(struct request_queue *q, struct elevator_queue *eq)
 {
 	struct deadline_data *dd;
-	struct elevator_queue *eq;
 	enum dd_prio prio;
-	int ret = -ENOMEM;
-
-	eq = elevator_alloc(q, e);
-	if (!eq)
-		return ret;
 
 	dd = kzalloc_node(sizeof(*dd), GFP_KERNEL, q->node);
 	if (!dd)
-		goto put_eq;
+		return -ENOMEM;
 
 	eq->elevator_data = dd;
 
@@ -608,10 +602,6 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 
 	q->elevator = eq;
 	return 0;
-
-put_eq:
-	kobject_put(&eq->kobj);
-	return ret;
 }
 
 /*
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCHv4 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store
  2025-06-24 13:17 [PATCHv4 0/3] block: move sched_tags allocation/de-allocation outside of locking context Nilay Shroff
  2025-06-24 13:17 ` [PATCHv4 1/3] block: move elevator queue allocation logic into blk_mq_init_sched Nilay Shroff
@ 2025-06-24 13:17 ` Nilay Shroff
  2025-06-26 14:43   ` Ming Lei
  2025-06-24 13:17 ` [PATCHv4 3/3] block: fix potential deadlock while running nr_hw_queue update Nilay Shroff
  2 siblings, 1 reply; 8+ messages in thread
From: Nilay Shroff @ 2025-06-24 13:17 UTC (permalink / raw)
  To: linux-block; +Cc: hch, ming.lei, axboe, sth, gjoyce

Recent lockdep reports [1] have revealed a potential deadlock caused by a
lock dependency between the percpu allocator lock and the elevator lock.
This issue can be avoided by ensuring that the allocation and release of
scheduler tags (sched_tags) are performed outside the elevator lock.
Furthermore, the queue does not need to be remain frozen during these
operations.

To address this, move all sched_tags allocations and deallocations outside
of both the ->elevator_lock and the ->freeze_lock. Since the lifetime of
the elevator queue and its associated sched_tags is closely tied, the
allocated sched_tags are now stored in the elevator queue structure. Then,
during the actual elevator switch (which runs under ->freeze_lock and
->elevator_lock), the pre-allocated sched_tags are assigned to the
appropriate q->hctx. Once the elevator switch is complete and the locks
are released, the old elevator queue and its associated sched_tags are
freed.

This commit specifically addresses the allocation/deallocation of sched_
tags during elevator switching. Note that sched_tags may also be allocated
in other contexts, such as during nr_hw_queues updates. Supporting that
use case will require batch allocation/deallocation, which will be handled
in a follow-up patch.

This restructuring ensures that sched_tags memory management occurs
entirely outside of the ->elevator_lock and ->freeze_lock context,
eliminating the lock dependency problem seen during scheduler updates.

[1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/

Reported-by: Stefan Haberland <sth@linux.ibm.com>
Closes: https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
 block/blk-mq-sched.c | 186 ++++++++++++++++++++++++++-----------------
 block/blk-mq-sched.h |  14 +++-
 block/elevator.c     |  66 +++++++++++++--
 block/elevator.h     |  19 ++++-
 4 files changed, 204 insertions(+), 81 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 359e0704e09b..5d3132ac7777 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -374,64 +374,17 @@ bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request *rq,
 }
 EXPORT_SYMBOL_GPL(blk_mq_sched_try_insert_merge);
 
-static int blk_mq_sched_alloc_map_and_rqs(struct request_queue *q,
-					  struct blk_mq_hw_ctx *hctx,
-					  unsigned int hctx_idx)
-{
-	if (blk_mq_is_shared_tags(q->tag_set->flags)) {
-		hctx->sched_tags = q->sched_shared_tags;
-		return 0;
-	}
-
-	hctx->sched_tags = blk_mq_alloc_map_and_rqs(q->tag_set, hctx_idx,
-						    q->nr_requests);
-
-	if (!hctx->sched_tags)
-		return -ENOMEM;
-	return 0;
-}
-
-static void blk_mq_exit_sched_shared_tags(struct request_queue *queue)
-{
-	blk_mq_free_rq_map(queue->sched_shared_tags);
-	queue->sched_shared_tags = NULL;
-}
-
 /* called in queue's release handler, tagset has gone away */
 static void blk_mq_sched_tags_teardown(struct request_queue *q, unsigned int flags)
 {
 	struct blk_mq_hw_ctx *hctx;
 	unsigned long i;
 
-	queue_for_each_hw_ctx(q, hctx, i) {
-		if (hctx->sched_tags) {
-			if (!blk_mq_is_shared_tags(flags))
-				blk_mq_free_rq_map(hctx->sched_tags);
-			hctx->sched_tags = NULL;
-		}
-	}
+	queue_for_each_hw_ctx(q, hctx, i)
+		hctx->sched_tags = NULL;
 
 	if (blk_mq_is_shared_tags(flags))
-		blk_mq_exit_sched_shared_tags(q);
-}
-
-static int blk_mq_init_sched_shared_tags(struct request_queue *queue)
-{
-	struct blk_mq_tag_set *set = queue->tag_set;
-
-	/*
-	 * Set initial depth at max so that we don't need to reallocate for
-	 * updating nr_requests.
-	 */
-	queue->sched_shared_tags = blk_mq_alloc_map_and_rqs(set,
-						BLK_MQ_NO_HCTX_IDX,
-						MAX_SCHED_RQ);
-	if (!queue->sched_shared_tags)
-		return -ENOMEM;
-
-	blk_mq_tag_update_sched_shared_tags(queue);
-
-	return 0;
+		q->sched_shared_tags = NULL;
 }
 
 void blk_mq_sched_reg_debugfs(struct request_queue *q)
@@ -458,8 +411,106 @@ void blk_mq_sched_unreg_debugfs(struct request_queue *q)
 	mutex_unlock(&q->debugfs_mutex);
 }
 
+void __blk_mq_free_sched_tags(struct blk_mq_tag_set *set,
+		struct blk_mq_tags **tags, unsigned int nr_hw_queues)
+{
+	unsigned long i;
+
+	if (!tags)
+		return;
+
+	/* Shared tags are stored at index 0 in @tags. */
+	if (blk_mq_is_shared_tags(set->flags))
+		blk_mq_free_map_and_rqs(set, tags[0], BLK_MQ_NO_HCTX_IDX);
+	else {
+		for (i = 0; i < nr_hw_queues; i++)
+			blk_mq_free_map_and_rqs(set, tags[i], i);
+	}
+
+	kfree(tags);
+}
+
+void blk_mq_free_sched_tags(struct elevator_tags *et,
+		struct blk_mq_tag_set *set, int id)
+{
+	struct blk_mq_tags **tags;
+
+	tags = xa_load(&et->tags_table, id);
+	__blk_mq_free_sched_tags(set, tags, et->nr_hw_queues);
+}
+
+struct blk_mq_tags **__blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
+				unsigned int nr_hw_queues,
+				unsigned int nr_requests,
+				gfp_t gfp)
+{
+	int i, nr_tags;
+	struct blk_mq_tags **tags;
+
+	if (blk_mq_is_shared_tags(set->flags))
+		nr_tags = 1;
+	else
+		nr_tags = nr_hw_queues;
+
+	tags = kcalloc(nr_tags, sizeof(struct blk_mq_tags *), gfp);
+	if (!tags)
+		return NULL;
+
+	if (blk_mq_is_shared_tags(set->flags)) {
+		/* Shared tags are stored at index 0 in @tags. */
+		tags[0] = blk_mq_alloc_map_and_rqs(set, BLK_MQ_NO_HCTX_IDX,
+					MAX_SCHED_RQ);
+		if (!tags[0])
+			goto out;
+	} else {
+		for (i = 0; i < nr_hw_queues; i++) {
+			tags[i] = blk_mq_alloc_map_and_rqs(set, i, nr_requests);
+			if (!tags[i])
+				goto out_unwind;
+		}
+	}
+
+	return tags;
+out_unwind:
+	while (--i >= 0)
+		blk_mq_free_map_and_rqs(set, tags[i], i);
+out:
+	kfree(tags);
+	return NULL;
+}
+
+int blk_mq_alloc_sched_tags(struct elevator_tags *et,
+		struct blk_mq_tag_set *set, int id)
+{
+	struct blk_mq_tags **tags;
+	gfp_t gfp = GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY;
+
+	/*
+	 * Default to double of smaller one between hw queue_depth and
+	 * 128, since we don't split into sync/async like the old code
+	 * did. Additionally, this is a per-hw queue depth.
+	 */
+	et->nr_requests = 2 * min_t(unsigned int, set->queue_depth,
+			BLKDEV_DEFAULT_RQ);
+
+	tags = __blk_mq_alloc_sched_tags(set, et->nr_hw_queues,
+			et->nr_requests, gfp);
+	if (!tags)
+		return -ENOMEM;
+
+	if (xa_insert(&et->tags_table, id, tags, gfp))
+		goto out_free_tags;
+
+	return 0;
+
+out_free_tags:
+	__blk_mq_free_sched_tags(set, tags, et->nr_hw_queues);
+	return -ENOMEM;
+}
+
 /* caller must have a reference to @e, will grab another one if successful */
-int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
+int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e,
+		struct elevator_tags *et)
 {
 	unsigned int flags = q->tag_set->flags;
 	struct blk_mq_hw_ctx *hctx;
@@ -467,40 +518,33 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 	unsigned long i;
 	int ret;
 
-	/*
-	 * Default to double of smaller one between hw queue_depth and 128,
-	 * since we don't split into sync/async like the old code did.
-	 * Additionally, this is a per-hw queue depth.
-	 */
-	q->nr_requests = 2 * min_t(unsigned int, q->tag_set->queue_depth,
-				   BLKDEV_DEFAULT_RQ);
-
-	eq = elevator_alloc(q, e);
+	eq = elevator_alloc(q, e, et);
 	if (!eq)
 		return -ENOMEM;
 
+	q->nr_requests = et->nr_requests;
+
 	if (blk_mq_is_shared_tags(flags)) {
-		ret = blk_mq_init_sched_shared_tags(q);
-		if (ret)
-			goto err_put_elevator;
+		/* Shared tags are stored at index 0 in @eq->tags. */
+		q->sched_shared_tags = eq->tags[0];
+		blk_mq_tag_update_sched_shared_tags(q);
 	}
 
 	queue_for_each_hw_ctx(q, hctx, i) {
-		ret = blk_mq_sched_alloc_map_and_rqs(q, hctx, i);
-		if (ret)
-			goto err_free_map_and_rqs;
+		if (blk_mq_is_shared_tags(flags))
+			hctx->sched_tags = q->sched_shared_tags;
+		else
+			hctx->sched_tags = eq->tags[i];
 	}
 
 	ret = e->ops.init_sched(q, eq);
 	if (ret)
-		goto err_free_map_and_rqs;
+		goto out;
 
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (e->ops.init_hctx) {
 			ret = e->ops.init_hctx(hctx, i);
 			if (ret) {
-				eq = q->elevator;
-				blk_mq_sched_free_rqs(q);
 				blk_mq_exit_sched(q, eq);
 				kobject_put(&eq->kobj);
 				return ret;
@@ -509,10 +553,8 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 	}
 	return 0;
 
-err_free_map_and_rqs:
-	blk_mq_sched_free_rqs(q);
+out:
 	blk_mq_sched_tags_teardown(q, flags);
-err_put_elevator:
 	kobject_put(&eq->kobj);
 	q->elevator = NULL;
 	return ret;
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 1326526bb733..92aa50b8376a 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -18,7 +18,19 @@ void __blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx);
 
 void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx);
 
-int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e);
+int blk_mq_alloc_sched_tags(struct elevator_tags *et,
+		struct blk_mq_tag_set *set, int id);
+void blk_mq_free_sched_tags(struct elevator_tags *et,
+		struct blk_mq_tag_set *set, int id);
+struct blk_mq_tags **__blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
+					unsigned int nr_hw_queues,
+					unsigned int nr_requests,
+					gfp_t gfp);
+void __blk_mq_free_sched_tags(struct blk_mq_tag_set *set,
+		struct blk_mq_tags **tags, unsigned int nr_hw_queues);
+
+int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e,
+		struct elevator_tags *et);
 void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e);
 void blk_mq_sched_free_rqs(struct request_queue *q);
 
diff --git a/block/elevator.c b/block/elevator.c
index 770874040f79..1408894c0396 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -54,6 +54,8 @@ struct elv_change_ctx {
 	struct elevator_queue *old;
 	/* for registering new elevator */
 	struct elevator_queue *new;
+	/* holds sched tags data */
+	struct elevator_tags *et;
 };
 
 static DEFINE_SPINLOCK(elv_list_lock);
@@ -132,7 +134,7 @@ static struct elevator_type *elevator_find_get(const char *name)
 static const struct kobj_type elv_ktype;
 
 struct elevator_queue *elevator_alloc(struct request_queue *q,
-				  struct elevator_type *e)
+		struct elevator_type *e, struct elevator_tags *et)
 {
 	struct elevator_queue *eq;
 
@@ -145,8 +147,15 @@ struct elevator_queue *elevator_alloc(struct request_queue *q,
 	kobject_init(&eq->kobj, &elv_ktype);
 	mutex_init(&eq->sysfs_lock);
 	hash_init(eq->hash);
+	eq->nr_hw_queues = et->nr_hw_queues;
+	eq->tags = xa_load(&et->tags_table, q->id);
+	if (WARN_ON_ONCE(!eq->tags))
+		goto out;
 
 	return eq;
+out:
+	kobject_put(&eq->kobj);
+	return NULL;
 }
 
 static void elevator_release(struct kobject *kobj)
@@ -165,7 +174,6 @@ static void elevator_exit(struct request_queue *q)
 	lockdep_assert_held(&q->elevator_lock);
 
 	ioc_clear_queue(q);
-	blk_mq_sched_free_rqs(q);
 
 	mutex_lock(&e->sysfs_lock);
 	blk_mq_exit_sched(q, e);
@@ -591,7 +599,7 @@ static int elevator_switch(struct request_queue *q, struct elv_change_ctx *ctx)
 	}
 
 	if (new_e) {
-		ret = blk_mq_init_sched(q, new_e);
+		ret = blk_mq_init_sched(q, new_e, ctx->et);
 		if (ret)
 			goto out_unfreeze;
 		ctx->new = q->elevator;
@@ -626,8 +634,10 @@ static void elv_exit_and_release(struct request_queue *q)
 	elevator_exit(q);
 	mutex_unlock(&q->elevator_lock);
 	blk_mq_unfreeze_queue(q, memflags);
-	if (e)
+	if (e) {
+		__blk_mq_free_sched_tags(q->tag_set, e->tags, e->nr_hw_queues);
 		kobject_put(&e->kobj);
+	}
 }
 
 static int elevator_change_done(struct request_queue *q,
@@ -640,6 +650,8 @@ static int elevator_change_done(struct request_queue *q,
 				&ctx->old->flags);
 
 		elv_unregister_queue(q, ctx->old);
+		__blk_mq_free_sched_tags(q->tag_set, ctx->old->tags,
+				ctx->old->nr_hw_queues);
 		kobject_put(&ctx->old->kobj);
 		if (enable_wbt)
 			wbt_enable_default(q->disk);
@@ -658,9 +670,20 @@ static int elevator_change_done(struct request_queue *q,
 static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
 {
 	unsigned int memflags;
+	struct elevator_tags et;
+	struct blk_mq_tag_set *set = q->tag_set;
 	int ret = 0;
 
-	lockdep_assert_held(&q->tag_set->update_nr_hwq_lock);
+	lockdep_assert_held(&set->update_nr_hwq_lock);
+
+	et.nr_hw_queues = set->nr_hw_queues;
+	xa_init(&et.tags_table);
+	ctx->et = &et;
+	if (strncmp(ctx->name, "none", 4)) {
+		ret = blk_mq_alloc_sched_tags(ctx->et, set, q->id);
+		if (ret)
+			goto out;
+	}
 
 	memflags = blk_mq_freeze_queue(q);
 	/*
@@ -680,7 +703,13 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
 	blk_mq_unfreeze_queue(q, memflags);
 	if (!ret)
 		ret = elevator_change_done(q, ctx);
-
+	/*
+	 * Free sched tags if it's allocated but we couldn't switch elevator.
+	 */
+	if (!xa_empty(&ctx->et->tags_table) && !ctx->new)
+		blk_mq_free_sched_tags(ctx->et, set, q->id);
+out:
+	xa_destroy(&ctx->et->tags_table);
 	return ret;
 }
 
@@ -690,11 +719,29 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
  */
 void elv_update_nr_hw_queues(struct request_queue *q)
 {
+	struct blk_mq_tag_set *set = q->tag_set;
+	struct elevator_tags et;
 	struct elv_change_ctx ctx = {};
 	int ret = -ENODEV;
 
 	WARN_ON_ONCE(q->mq_freeze_depth == 0);
 
+	et.nr_hw_queues = set->nr_hw_queues;
+	xa_init(&et.tags_table);
+	ctx.et = &et;
+	/*
+	 * Accessing q->elevator without holding q->elevator_lock is safe here
+	 * because nr_hw_queue update is protected by set->update_nr_hwq_lock
+	 * in the writer context. So, scheduler update/switch code (which
+	 * acquires same lock in the reader context) can't run concurrently.
+	 */
+	if (q->elevator) {
+		if (blk_mq_alloc_sched_tags(ctx.et, set, q->id)) {
+			WARN_ON_ONCE(1);
+			goto out;
+		}
+	}
+
 	mutex_lock(&q->elevator_lock);
 	if (q->elevator && !blk_queue_dying(q) && blk_queue_registered(q)) {
 		ctx.name = q->elevator->type->elevator_name;
@@ -706,6 +753,13 @@ void elv_update_nr_hw_queues(struct request_queue *q)
 	blk_mq_unfreeze_queue_nomemrestore(q);
 	if (!ret)
 		WARN_ON_ONCE(elevator_change_done(q, &ctx));
+	/*
+	 * Free sched tags if it's allocated but we couldn't switch elevator.
+	 */
+	if (!xa_empty(&ctx.et->tags_table) && !ctx.new)
+		blk_mq_free_sched_tags(ctx.et, set, q->id);
+out:
+	xa_destroy(&ctx.et->tags_table);
 }
 
 /*
diff --git a/block/elevator.h b/block/elevator.h
index a4de5f9ad790..849c264031ca 100644
--- a/block/elevator.h
+++ b/block/elevator.h
@@ -23,6 +23,19 @@ enum elv_merge {
 struct blk_mq_alloc_data;
 struct blk_mq_hw_ctx;
 
+struct elevator_tags {
+	/* num. of hardware queues for which tags are allocated */
+	unsigned int nr_hw_queues;
+	/* depth used while allocating tags */
+	unsigned int nr_requests;
+	/*
+	 * An Xarray table storing shared and per hardware queue sched tags.
+	 * The key to store an entry in Xarray is q->id. Please note that
+	 * shared sched tags are always stored at index 0.
+	 */
+	struct xarray tags_table;
+};
+
 struct elevator_mq_ops {
 	int (*init_sched)(struct request_queue *, struct elevator_queue *);
 	void (*exit_sched)(struct elevator_queue *);
@@ -113,6 +126,8 @@ struct request *elv_rqhash_find(struct request_queue *q, sector_t offset);
 struct elevator_queue
 {
 	struct elevator_type *type;
+	struct blk_mq_tags **tags;
+	unsigned int nr_hw_queues;
 	void *elevator_data;
 	struct kobject kobj;
 	struct mutex sysfs_lock;
@@ -152,8 +167,8 @@ ssize_t elv_iosched_show(struct gendisk *disk, char *page);
 ssize_t elv_iosched_store(struct gendisk *disk, const char *page, size_t count);
 
 extern bool elv_bio_merge_ok(struct request *, struct bio *);
-extern struct elevator_queue *elevator_alloc(struct request_queue *,
-					struct elevator_type *);
+struct elevator_queue *elevator_alloc(struct request_queue *,
+		struct elevator_type *, struct elevator_tags *);
 
 /*
  * Helper functions.
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCHv4 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store
  2025-06-24 13:17 ` [PATCHv4 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store Nilay Shroff
@ 2025-06-26 14:43   ` Ming Lei
  2025-06-27  4:13     ` Nilay Shroff
  0 siblings, 1 reply; 8+ messages in thread
From: Ming Lei @ 2025-06-26 14:43 UTC (permalink / raw)
  To: Nilay Shroff; +Cc: linux-block, hch, axboe, sth, gjoyce

On Tue, Jun 24, 2025 at 06:47:04PM +0530, Nilay Shroff wrote:
> Recent lockdep reports [1] have revealed a potential deadlock caused by a
> lock dependency between the percpu allocator lock and the elevator lock.
> This issue can be avoided by ensuring that the allocation and release of
> scheduler tags (sched_tags) are performed outside the elevator lock.
> Furthermore, the queue does not need to be remain frozen during these
> operations.
> 
> To address this, move all sched_tags allocations and deallocations outside
> of both the ->elevator_lock and the ->freeze_lock. Since the lifetime of
> the elevator queue and its associated sched_tags is closely tied, the
> allocated sched_tags are now stored in the elevator queue structure. Then,
> during the actual elevator switch (which runs under ->freeze_lock and
> ->elevator_lock), the pre-allocated sched_tags are assigned to the
> appropriate q->hctx. Once the elevator switch is complete and the locks
> are released, the old elevator queue and its associated sched_tags are
> freed.
> 
> This commit specifically addresses the allocation/deallocation of sched_
> tags during elevator switching. Note that sched_tags may also be allocated
> in other contexts, such as during nr_hw_queues updates. Supporting that
> use case will require batch allocation/deallocation, which will be handled
> in a follow-up patch.
> 
> This restructuring ensures that sched_tags memory management occurs
> entirely outside of the ->elevator_lock and ->freeze_lock context,
> eliminating the lock dependency problem seen during scheduler updates.
> 
> [1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
> 
> Reported-by: Stefan Haberland <sth@linux.ibm.com>
> Closes: https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> ---
>  block/blk-mq-sched.c | 186 ++++++++++++++++++++++++++-----------------
>  block/blk-mq-sched.h |  14 +++-
>  block/elevator.c     |  66 +++++++++++++--
>  block/elevator.h     |  19 ++++-
>  4 files changed, 204 insertions(+), 81 deletions(-)
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 359e0704e09b..5d3132ac7777 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -374,64 +374,17 @@ bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request *rq,
>  }
>  EXPORT_SYMBOL_GPL(blk_mq_sched_try_insert_merge);
>  
> -static int blk_mq_sched_alloc_map_and_rqs(struct request_queue *q,
> -					  struct blk_mq_hw_ctx *hctx,
> -					  unsigned int hctx_idx)
> -{
> -	if (blk_mq_is_shared_tags(q->tag_set->flags)) {
> -		hctx->sched_tags = q->sched_shared_tags;
> -		return 0;
> -	}
> -
> -	hctx->sched_tags = blk_mq_alloc_map_and_rqs(q->tag_set, hctx_idx,
> -						    q->nr_requests);
> -
> -	if (!hctx->sched_tags)
> -		return -ENOMEM;
> -	return 0;
> -}
> -
> -static void blk_mq_exit_sched_shared_tags(struct request_queue *queue)
> -{
> -	blk_mq_free_rq_map(queue->sched_shared_tags);
> -	queue->sched_shared_tags = NULL;
> -}
> -
>  /* called in queue's release handler, tagset has gone away */
>  static void blk_mq_sched_tags_teardown(struct request_queue *q, unsigned int flags)
>  {
>  	struct blk_mq_hw_ctx *hctx;
>  	unsigned long i;
>  
> -	queue_for_each_hw_ctx(q, hctx, i) {
> -		if (hctx->sched_tags) {
> -			if (!blk_mq_is_shared_tags(flags))
> -				blk_mq_free_rq_map(hctx->sched_tags);
> -			hctx->sched_tags = NULL;
> -		}
> -	}
> +	queue_for_each_hw_ctx(q, hctx, i)
> +		hctx->sched_tags = NULL;
>  
>  	if (blk_mq_is_shared_tags(flags))
> -		blk_mq_exit_sched_shared_tags(q);
> -}
> -
> -static int blk_mq_init_sched_shared_tags(struct request_queue *queue)
> -{
> -	struct blk_mq_tag_set *set = queue->tag_set;
> -
> -	/*
> -	 * Set initial depth at max so that we don't need to reallocate for
> -	 * updating nr_requests.
> -	 */
> -	queue->sched_shared_tags = blk_mq_alloc_map_and_rqs(set,
> -						BLK_MQ_NO_HCTX_IDX,
> -						MAX_SCHED_RQ);
> -	if (!queue->sched_shared_tags)
> -		return -ENOMEM;
> -
> -	blk_mq_tag_update_sched_shared_tags(queue);
> -
> -	return 0;
> +		q->sched_shared_tags = NULL;
>  }
>  
>  void blk_mq_sched_reg_debugfs(struct request_queue *q)
> @@ -458,8 +411,106 @@ void blk_mq_sched_unreg_debugfs(struct request_queue *q)
>  	mutex_unlock(&q->debugfs_mutex);
>  }
>  
> +void __blk_mq_free_sched_tags(struct blk_mq_tag_set *set,
> +		struct blk_mq_tags **tags, unsigned int nr_hw_queues)
> +{
> +	unsigned long i;
> +
> +	if (!tags)
> +		return;
> +
> +	/* Shared tags are stored at index 0 in @tags. */
> +	if (blk_mq_is_shared_tags(set->flags))
> +		blk_mq_free_map_and_rqs(set, tags[0], BLK_MQ_NO_HCTX_IDX);
> +	else {
> +		for (i = 0; i < nr_hw_queues; i++)
> +			blk_mq_free_map_and_rqs(set, tags[i], i);
> +	}
> +
> +	kfree(tags);
> +}
> +
> +void blk_mq_free_sched_tags(struct elevator_tags *et,
> +		struct blk_mq_tag_set *set, int id)
> +{
> +	struct blk_mq_tags **tags;
> +
> +	tags = xa_load(&et->tags_table, id);
> +	__blk_mq_free_sched_tags(set, tags, et->nr_hw_queues);
> +}
> +
> +struct blk_mq_tags **__blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
> +				unsigned int nr_hw_queues,
> +				unsigned int nr_requests,
> +				gfp_t gfp)
> +{
> +	int i, nr_tags;
> +	struct blk_mq_tags **tags;
> +
> +	if (blk_mq_is_shared_tags(set->flags))
> +		nr_tags = 1;
> +	else
> +		nr_tags = nr_hw_queues;
> +
> +	tags = kcalloc(nr_tags, sizeof(struct blk_mq_tags *), gfp);
> +	if (!tags)
> +		return NULL;
> +
> +	if (blk_mq_is_shared_tags(set->flags)) {
> +		/* Shared tags are stored at index 0 in @tags. */
> +		tags[0] = blk_mq_alloc_map_and_rqs(set, BLK_MQ_NO_HCTX_IDX,
> +					MAX_SCHED_RQ);
> +		if (!tags[0])
> +			goto out;
> +	} else {
> +		for (i = 0; i < nr_hw_queues; i++) {
> +			tags[i] = blk_mq_alloc_map_and_rqs(set, i, nr_requests);
> +			if (!tags[i])
> +				goto out_unwind;
> +		}
> +	}
> +
> +	return tags;
> +out_unwind:
> +	while (--i >= 0)
> +		blk_mq_free_map_and_rqs(set, tags[i], i);
> +out:
> +	kfree(tags);
> +	return NULL;
> +}
> +
> +int blk_mq_alloc_sched_tags(struct elevator_tags *et,
> +		struct blk_mq_tag_set *set, int id)
> +{
> +	struct blk_mq_tags **tags;
> +	gfp_t gfp = GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY;
> +
> +	/*
> +	 * Default to double of smaller one between hw queue_depth and
> +	 * 128, since we don't split into sync/async like the old code
> +	 * did. Additionally, this is a per-hw queue depth.
> +	 */
> +	et->nr_requests = 2 * min_t(unsigned int, set->queue_depth,
> +			BLKDEV_DEFAULT_RQ);
> +
> +	tags = __blk_mq_alloc_sched_tags(set, et->nr_hw_queues,
> +			et->nr_requests, gfp);
> +	if (!tags)
> +		return -ENOMEM;
> +
> +	if (xa_insert(&et->tags_table, id, tags, gfp))
> +		goto out_free_tags;

Not sure why you add `tags_table` into `elevator_tags`, it looks very
confusing just from naming, not mention only single element is stored to
the table in elevator switch code path since queue id is the key.

Wrt. xarray suggestion, I meant to add on xarray local variable in
__blk_mq_update_nr_hw_queues(), then you can store allocated `elevator_tags`
to the xarray via ->id of request_queue.



Thanks,
Ming


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCHv4 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store
  2025-06-26 14:43   ` Ming Lei
@ 2025-06-27  4:13     ` Nilay Shroff
  2025-06-27  7:58       ` Ming Lei
  0 siblings, 1 reply; 8+ messages in thread
From: Nilay Shroff @ 2025-06-27  4:13 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, hch, axboe, sth, gjoyce



On 6/26/25 8:13 PM, Ming Lei wrote:
> On Tue, Jun 24, 2025 at 06:47:04PM +0530, Nilay Shroff wrote:
>> Recent lockdep reports [1] have revealed a potential deadlock caused by a
>> lock dependency between the percpu allocator lock and the elevator lock.
>> This issue can be avoided by ensuring that the allocation and release of
>> scheduler tags (sched_tags) are performed outside the elevator lock.
>> Furthermore, the queue does not need to be remain frozen during these
>> operations.
>>
>> To address this, move all sched_tags allocations and deallocations outside
>> of both the ->elevator_lock and the ->freeze_lock. Since the lifetime of
>> the elevator queue and its associated sched_tags is closely tied, the
>> allocated sched_tags are now stored in the elevator queue structure. Then,
>> during the actual elevator switch (which runs under ->freeze_lock and
>> ->elevator_lock), the pre-allocated sched_tags are assigned to the
>> appropriate q->hctx. Once the elevator switch is complete and the locks
>> are released, the old elevator queue and its associated sched_tags are
>> freed.
>>
>> This commit specifically addresses the allocation/deallocation of sched_
>> tags during elevator switching. Note that sched_tags may also be allocated
>> in other contexts, such as during nr_hw_queues updates. Supporting that
>> use case will require batch allocation/deallocation, which will be handled
>> in a follow-up patch.
>>
>> This restructuring ensures that sched_tags memory management occurs
>> entirely outside of the ->elevator_lock and ->freeze_lock context,
>> eliminating the lock dependency problem seen during scheduler updates.
>>
>> [1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
>>
>> Reported-by: Stefan Haberland <sth@linux.ibm.com>
>> Closes: https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
>> ---
>>  block/blk-mq-sched.c | 186 ++++++++++++++++++++++++++-----------------
>>  block/blk-mq-sched.h |  14 +++-
>>  block/elevator.c     |  66 +++++++++++++--
>>  block/elevator.h     |  19 ++++-
>>  4 files changed, 204 insertions(+), 81 deletions(-)
>>
>> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
>> index 359e0704e09b..5d3132ac7777 100644
>> --- a/block/blk-mq-sched.c
>> +++ b/block/blk-mq-sched.c
>> @@ -374,64 +374,17 @@ bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request *rq,
>>  }
>>  EXPORT_SYMBOL_GPL(blk_mq_sched_try_insert_merge);
>>  
>> -static int blk_mq_sched_alloc_map_and_rqs(struct request_queue *q,
>> -					  struct blk_mq_hw_ctx *hctx,
>> -					  unsigned int hctx_idx)
>> -{
>> -	if (blk_mq_is_shared_tags(q->tag_set->flags)) {
>> -		hctx->sched_tags = q->sched_shared_tags;
>> -		return 0;
>> -	}
>> -
>> -	hctx->sched_tags = blk_mq_alloc_map_and_rqs(q->tag_set, hctx_idx,
>> -						    q->nr_requests);
>> -
>> -	if (!hctx->sched_tags)
>> -		return -ENOMEM;
>> -	return 0;
>> -}
>> -
>> -static void blk_mq_exit_sched_shared_tags(struct request_queue *queue)
>> -{
>> -	blk_mq_free_rq_map(queue->sched_shared_tags);
>> -	queue->sched_shared_tags = NULL;
>> -}
>> -
>>  /* called in queue's release handler, tagset has gone away */
>>  static void blk_mq_sched_tags_teardown(struct request_queue *q, unsigned int flags)
>>  {
>>  	struct blk_mq_hw_ctx *hctx;
>>  	unsigned long i;
>>  
>> -	queue_for_each_hw_ctx(q, hctx, i) {
>> -		if (hctx->sched_tags) {
>> -			if (!blk_mq_is_shared_tags(flags))
>> -				blk_mq_free_rq_map(hctx->sched_tags);
>> -			hctx->sched_tags = NULL;
>> -		}
>> -	}
>> +	queue_for_each_hw_ctx(q, hctx, i)
>> +		hctx->sched_tags = NULL;
>>  
>>  	if (blk_mq_is_shared_tags(flags))
>> -		blk_mq_exit_sched_shared_tags(q);
>> -}
>> -
>> -static int blk_mq_init_sched_shared_tags(struct request_queue *queue)
>> -{
>> -	struct blk_mq_tag_set *set = queue->tag_set;
>> -
>> -	/*
>> -	 * Set initial depth at max so that we don't need to reallocate for
>> -	 * updating nr_requests.
>> -	 */
>> -	queue->sched_shared_tags = blk_mq_alloc_map_and_rqs(set,
>> -						BLK_MQ_NO_HCTX_IDX,
>> -						MAX_SCHED_RQ);
>> -	if (!queue->sched_shared_tags)
>> -		return -ENOMEM;
>> -
>> -	blk_mq_tag_update_sched_shared_tags(queue);
>> -
>> -	return 0;
>> +		q->sched_shared_tags = NULL;
>>  }
>>  
>>  void blk_mq_sched_reg_debugfs(struct request_queue *q)
>> @@ -458,8 +411,106 @@ void blk_mq_sched_unreg_debugfs(struct request_queue *q)
>>  	mutex_unlock(&q->debugfs_mutex);
>>  }
>>  
>> +void __blk_mq_free_sched_tags(struct blk_mq_tag_set *set,
>> +		struct blk_mq_tags **tags, unsigned int nr_hw_queues)
>> +{
>> +	unsigned long i;
>> +
>> +	if (!tags)
>> +		return;
>> +
>> +	/* Shared tags are stored at index 0 in @tags. */
>> +	if (blk_mq_is_shared_tags(set->flags))
>> +		blk_mq_free_map_and_rqs(set, tags[0], BLK_MQ_NO_HCTX_IDX);
>> +	else {
>> +		for (i = 0; i < nr_hw_queues; i++)
>> +			blk_mq_free_map_and_rqs(set, tags[i], i);
>> +	}
>> +
>> +	kfree(tags);
>> +}
>> +
>> +void blk_mq_free_sched_tags(struct elevator_tags *et,
>> +		struct blk_mq_tag_set *set, int id)
>> +{
>> +	struct blk_mq_tags **tags;
>> +
>> +	tags = xa_load(&et->tags_table, id);
>> +	__blk_mq_free_sched_tags(set, tags, et->nr_hw_queues);
>> +}
>> +
>> +struct blk_mq_tags **__blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
>> +				unsigned int nr_hw_queues,
>> +				unsigned int nr_requests,
>> +				gfp_t gfp)
>> +{
>> +	int i, nr_tags;
>> +	struct blk_mq_tags **tags;
>> +
>> +	if (blk_mq_is_shared_tags(set->flags))
>> +		nr_tags = 1;
>> +	else
>> +		nr_tags = nr_hw_queues;
>> +
>> +	tags = kcalloc(nr_tags, sizeof(struct blk_mq_tags *), gfp);
>> +	if (!tags)
>> +		return NULL;
>> +
>> +	if (blk_mq_is_shared_tags(set->flags)) {
>> +		/* Shared tags are stored at index 0 in @tags. */
>> +		tags[0] = blk_mq_alloc_map_and_rqs(set, BLK_MQ_NO_HCTX_IDX,
>> +					MAX_SCHED_RQ);
>> +		if (!tags[0])
>> +			goto out;
>> +	} else {
>> +		for (i = 0; i < nr_hw_queues; i++) {
>> +			tags[i] = blk_mq_alloc_map_and_rqs(set, i, nr_requests);
>> +			if (!tags[i])
>> +				goto out_unwind;
>> +		}
>> +	}
>> +
>> +	return tags;
>> +out_unwind:
>> +	while (--i >= 0)
>> +		blk_mq_free_map_and_rqs(set, tags[i], i);
>> +out:
>> +	kfree(tags);
>> +	return NULL;
>> +}
>> +
>> +int blk_mq_alloc_sched_tags(struct elevator_tags *et,
>> +		struct blk_mq_tag_set *set, int id)
>> +{
>> +	struct blk_mq_tags **tags;
>> +	gfp_t gfp = GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY;
>> +
>> +	/*
>> +	 * Default to double of smaller one between hw queue_depth and
>> +	 * 128, since we don't split into sync/async like the old code
>> +	 * did. Additionally, this is a per-hw queue depth.
>> +	 */
>> +	et->nr_requests = 2 * min_t(unsigned int, set->queue_depth,
>> +			BLKDEV_DEFAULT_RQ);
>> +
>> +	tags = __blk_mq_alloc_sched_tags(set, et->nr_hw_queues,
>> +			et->nr_requests, gfp);
>> +	if (!tags)
>> +		return -ENOMEM;
>> +
>> +	if (xa_insert(&et->tags_table, id, tags, gfp))
>> +		goto out_free_tags;
> 
> Not sure why you add `tags_table` into `elevator_tags`, it looks very
> confusing just from naming, not mention only single element is stored to
> the table in elevator switch code path since queue id is the key.
> 
Yeah but if you look at next patch where we add/remove tags in batch
then you'd realise that xarray may have multiple entries inserted when
it's called from nr_hw_queue update.

> Wrt. xarray suggestion, I meant to add on xarray local variable in
> __blk_mq_update_nr_hw_queues(), then you can store allocated `elevator_tags`
> to the xarray via ->id of request_queue.
> 
I think I got it now what you have been asking for - there was some confusion 
earlier. So you are suggesting to insert pointer to 'struct elevator_tags' 
in a local Xarray when we allocate sched_tags in batch from nr_hw_queue update
context. And the key to insert 'struct elevator_tags' into Xarray would be q->id.
Then later, from elv_update_nr_hw_queues() we should retrieve the relavent
pointer to 'struct elevator_tags' from that local Xarray (using q->id) and pass
on the retrieved elevator_tags to elevator_switch function.

So this way we would not require using ->tags_table inside 'struct elevator_tags'.
Additionally, blk_mq_alloc_sched_tags() shall return allocated 'struct elevator_tags'
to the caller. And the definition of the 'struct elevator_tags' would look like this:

struct elevator_tags {
	/* num. of hardware queues for which tags are allocated */
	unsigned int nr_hw_queues;
	/* depth used while allocating tags */
	unsigned int nr_requests;
        /* The index 0 in @tags is used to store shared sched tags */
	struct blk_mq_tags **tags;	
};

This seems like a clean and straightforward solution that should be easy to implement
without much hassle. Please let me know if this still needs adjustment.

Thanks,
--Nilay


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCHv4 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store
  2025-06-27  4:13     ` Nilay Shroff
@ 2025-06-27  7:58       ` Ming Lei
  2025-06-27  9:50         ` Nilay Shroff
  0 siblings, 1 reply; 8+ messages in thread
From: Ming Lei @ 2025-06-27  7:58 UTC (permalink / raw)
  To: Nilay Shroff; +Cc: linux-block, hch, axboe, sth, gjoyce

On Fri, Jun 27, 2025 at 09:43:29AM +0530, Nilay Shroff wrote:
> 
> 
> On 6/26/25 8:13 PM, Ming Lei wrote:
> > On Tue, Jun 24, 2025 at 06:47:04PM +0530, Nilay Shroff wrote:
> >> Recent lockdep reports [1] have revealed a potential deadlock caused by a
> >> lock dependency between the percpu allocator lock and the elevator lock.
> >> This issue can be avoided by ensuring that the allocation and release of
> >> scheduler tags (sched_tags) are performed outside the elevator lock.
> >> Furthermore, the queue does not need to be remain frozen during these
> >> operations.
> >>
> >> To address this, move all sched_tags allocations and deallocations outside
> >> of both the ->elevator_lock and the ->freeze_lock. Since the lifetime of
> >> the elevator queue and its associated sched_tags is closely tied, the
> >> allocated sched_tags are now stored in the elevator queue structure. Then,
> >> during the actual elevator switch (which runs under ->freeze_lock and
> >> ->elevator_lock), the pre-allocated sched_tags are assigned to the
> >> appropriate q->hctx. Once the elevator switch is complete and the locks
> >> are released, the old elevator queue and its associated sched_tags are
> >> freed.
> >>
> >> This commit specifically addresses the allocation/deallocation of sched_
> >> tags during elevator switching. Note that sched_tags may also be allocated
> >> in other contexts, such as during nr_hw_queues updates. Supporting that
> >> use case will require batch allocation/deallocation, which will be handled
> >> in a follow-up patch.
> >>
> >> This restructuring ensures that sched_tags memory management occurs
> >> entirely outside of the ->elevator_lock and ->freeze_lock context,
> >> eliminating the lock dependency problem seen during scheduler updates.
> >>
> >> [1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
> >>
> >> Reported-by: Stefan Haberland <sth@linux.ibm.com>
> >> Closes: https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
> >> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> >> ---
> >>  block/blk-mq-sched.c | 186 ++++++++++++++++++++++++++-----------------
> >>  block/blk-mq-sched.h |  14 +++-
> >>  block/elevator.c     |  66 +++++++++++++--
> >>  block/elevator.h     |  19 ++++-
> >>  4 files changed, 204 insertions(+), 81 deletions(-)
> >>
> >> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> >> index 359e0704e09b..5d3132ac7777 100644
> >> --- a/block/blk-mq-sched.c
> >> +++ b/block/blk-mq-sched.c
> >> @@ -374,64 +374,17 @@ bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request *rq,
> >>  }
> >>  EXPORT_SYMBOL_GPL(blk_mq_sched_try_insert_merge);
> >>  
> >> -static int blk_mq_sched_alloc_map_and_rqs(struct request_queue *q,
> >> -					  struct blk_mq_hw_ctx *hctx,
> >> -					  unsigned int hctx_idx)
> >> -{
> >> -	if (blk_mq_is_shared_tags(q->tag_set->flags)) {
> >> -		hctx->sched_tags = q->sched_shared_tags;
> >> -		return 0;
> >> -	}
> >> -
> >> -	hctx->sched_tags = blk_mq_alloc_map_and_rqs(q->tag_set, hctx_idx,
> >> -						    q->nr_requests);
> >> -
> >> -	if (!hctx->sched_tags)
> >> -		return -ENOMEM;
> >> -	return 0;
> >> -}
> >> -
> >> -static void blk_mq_exit_sched_shared_tags(struct request_queue *queue)
> >> -{
> >> -	blk_mq_free_rq_map(queue->sched_shared_tags);
> >> -	queue->sched_shared_tags = NULL;
> >> -}
> >> -
> >>  /* called in queue's release handler, tagset has gone away */
> >>  static void blk_mq_sched_tags_teardown(struct request_queue *q, unsigned int flags)
> >>  {
> >>  	struct blk_mq_hw_ctx *hctx;
> >>  	unsigned long i;
> >>  
> >> -	queue_for_each_hw_ctx(q, hctx, i) {
> >> -		if (hctx->sched_tags) {
> >> -			if (!blk_mq_is_shared_tags(flags))
> >> -				blk_mq_free_rq_map(hctx->sched_tags);
> >> -			hctx->sched_tags = NULL;
> >> -		}
> >> -	}
> >> +	queue_for_each_hw_ctx(q, hctx, i)
> >> +		hctx->sched_tags = NULL;
> >>  
> >>  	if (blk_mq_is_shared_tags(flags))
> >> -		blk_mq_exit_sched_shared_tags(q);
> >> -}
> >> -
> >> -static int blk_mq_init_sched_shared_tags(struct request_queue *queue)
> >> -{
> >> -	struct blk_mq_tag_set *set = queue->tag_set;
> >> -
> >> -	/*
> >> -	 * Set initial depth at max so that we don't need to reallocate for
> >> -	 * updating nr_requests.
> >> -	 */
> >> -	queue->sched_shared_tags = blk_mq_alloc_map_and_rqs(set,
> >> -						BLK_MQ_NO_HCTX_IDX,
> >> -						MAX_SCHED_RQ);
> >> -	if (!queue->sched_shared_tags)
> >> -		return -ENOMEM;
> >> -
> >> -	blk_mq_tag_update_sched_shared_tags(queue);
> >> -
> >> -	return 0;
> >> +		q->sched_shared_tags = NULL;
> >>  }
> >>  
> >>  void blk_mq_sched_reg_debugfs(struct request_queue *q)
> >> @@ -458,8 +411,106 @@ void blk_mq_sched_unreg_debugfs(struct request_queue *q)
> >>  	mutex_unlock(&q->debugfs_mutex);
> >>  }
> >>  
> >> +void __blk_mq_free_sched_tags(struct blk_mq_tag_set *set,
> >> +		struct blk_mq_tags **tags, unsigned int nr_hw_queues)
> >> +{
> >> +	unsigned long i;
> >> +
> >> +	if (!tags)
> >> +		return;
> >> +
> >> +	/* Shared tags are stored at index 0 in @tags. */
> >> +	if (blk_mq_is_shared_tags(set->flags))
> >> +		blk_mq_free_map_and_rqs(set, tags[0], BLK_MQ_NO_HCTX_IDX);
> >> +	else {
> >> +		for (i = 0; i < nr_hw_queues; i++)
> >> +			blk_mq_free_map_and_rqs(set, tags[i], i);
> >> +	}
> >> +
> >> +	kfree(tags);
> >> +}
> >> +
> >> +void blk_mq_free_sched_tags(struct elevator_tags *et,
> >> +		struct blk_mq_tag_set *set, int id)
> >> +{
> >> +	struct blk_mq_tags **tags;
> >> +
> >> +	tags = xa_load(&et->tags_table, id);
> >> +	__blk_mq_free_sched_tags(set, tags, et->nr_hw_queues);
> >> +}
> >> +
> >> +struct blk_mq_tags **__blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
> >> +				unsigned int nr_hw_queues,
> >> +				unsigned int nr_requests,
> >> +				gfp_t gfp)
> >> +{
> >> +	int i, nr_tags;
> >> +	struct blk_mq_tags **tags;
> >> +
> >> +	if (blk_mq_is_shared_tags(set->flags))
> >> +		nr_tags = 1;
> >> +	else
> >> +		nr_tags = nr_hw_queues;
> >> +
> >> +	tags = kcalloc(nr_tags, sizeof(struct blk_mq_tags *), gfp);
> >> +	if (!tags)
> >> +		return NULL;
> >> +
> >> +	if (blk_mq_is_shared_tags(set->flags)) {
> >> +		/* Shared tags are stored at index 0 in @tags. */
> >> +		tags[0] = blk_mq_alloc_map_and_rqs(set, BLK_MQ_NO_HCTX_IDX,
> >> +					MAX_SCHED_RQ);
> >> +		if (!tags[0])
> >> +			goto out;
> >> +	} else {
> >> +		for (i = 0; i < nr_hw_queues; i++) {
> >> +			tags[i] = blk_mq_alloc_map_and_rqs(set, i, nr_requests);
> >> +			if (!tags[i])
> >> +				goto out_unwind;
> >> +		}
> >> +	}
> >> +
> >> +	return tags;
> >> +out_unwind:
> >> +	while (--i >= 0)
> >> +		blk_mq_free_map_and_rqs(set, tags[i], i);
> >> +out:
> >> +	kfree(tags);
> >> +	return NULL;
> >> +}
> >> +
> >> +int blk_mq_alloc_sched_tags(struct elevator_tags *et,
> >> +		struct blk_mq_tag_set *set, int id)
> >> +{
> >> +	struct blk_mq_tags **tags;
> >> +	gfp_t gfp = GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY;
> >> +
> >> +	/*
> >> +	 * Default to double of smaller one between hw queue_depth and
> >> +	 * 128, since we don't split into sync/async like the old code
> >> +	 * did. Additionally, this is a per-hw queue depth.
> >> +	 */
> >> +	et->nr_requests = 2 * min_t(unsigned int, set->queue_depth,
> >> +			BLKDEV_DEFAULT_RQ);
> >> +
> >> +	tags = __blk_mq_alloc_sched_tags(set, et->nr_hw_queues,
> >> +			et->nr_requests, gfp);
> >> +	if (!tags)
> >> +		return -ENOMEM;
> >> +
> >> +	if (xa_insert(&et->tags_table, id, tags, gfp))
> >> +		goto out_free_tags;
> > 
> > Not sure why you add `tags_table` into `elevator_tags`, it looks very
> > confusing just from naming, not mention only single element is stored to
> > the table in elevator switch code path since queue id is the key.
> > 
> Yeah but if you look at next patch where we add/remove tags in batch
> then you'd realise that xarray may have multiple entries inserted when
> it's called from nr_hw_queue update.
> 
> > Wrt. xarray suggestion, I meant to add on xarray local variable in
> > __blk_mq_update_nr_hw_queues(), then you can store allocated `elevator_tags`
> > to the xarray via ->id of request_queue.
> > 
> I think I got it now what you have been asking for - there was some confusion 
> earlier. So you are suggesting to insert pointer to 'struct elevator_tags' 
> in a local Xarray when we allocate sched_tags in batch from nr_hw_queue update
> context. And the key to insert 'struct elevator_tags' into Xarray would be q->id.
> Then later, from elv_update_nr_hw_queues() we should retrieve the relavent
> pointer to 'struct elevator_tags' from that local Xarray (using q->id) and pass
> on the retrieved elevator_tags to elevator_switch function.
> 
> So this way we would not require using ->tags_table inside 'struct elevator_tags'.
> Additionally, blk_mq_alloc_sched_tags() shall return allocated 'struct elevator_tags'
> to the caller. And the definition of the 'struct elevator_tags' would look like this:
> 
> struct elevator_tags {
> 	/* num. of hardware queues for which tags are allocated */
> 	unsigned int nr_hw_queues;
> 	/* depth used while allocating tags */
> 	unsigned int nr_requests;
>         /* The index 0 in @tags is used to store shared sched tags */
> 	struct blk_mq_tags **tags;	

s/struct blk_mq_tags **tags/struct blk_mq_tags *tags[];	

Then we can save one extra failure handling.

> 
> This seems like a clean and straightforward solution that should be easy to implement
> without much hassle. Please let me know if this still needs adjustment.

I think we are in same page now.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCHv4 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store
  2025-06-27  7:58       ` Ming Lei
@ 2025-06-27  9:50         ` Nilay Shroff
  0 siblings, 0 replies; 8+ messages in thread
From: Nilay Shroff @ 2025-06-27  9:50 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, hch, axboe, sth, gjoyce



On 6/27/25 1:28 PM, Ming Lei wrote:
> On Fri, Jun 27, 2025 at 09:43:29AM +0530, Nilay Shroff wrote:
>>
>>
>> On 6/26/25 8:13 PM, Ming Lei wrote:

>> struct elevator_tags {
>> 	/* num. of hardware queues for which tags are allocated */
>> 	unsigned int nr_hw_queues;
>> 	/* depth used while allocating tags */
>> 	unsigned int nr_requests;
>>         /* The index 0 in @tags is used to store shared sched tags */
>> 	struct blk_mq_tags **tags;	
> 
> s/struct blk_mq_tags **tags/struct blk_mq_tags *tags[];	
> 
> Then we can save one extra failure handling.
Yeah I think that's the advantage with the flex array!
> 
>>
>> This seems like a clean and straightforward solution that should be easy to implement
>> without much hassle. Please let me know if this still needs adjustment.
> 
> I think we are in same page now.
>
Great!
 
Thanks,
--Nilay

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCHv4 3/3] block: fix potential deadlock while running nr_hw_queue update
  2025-06-24 13:17 [PATCHv4 0/3] block: move sched_tags allocation/de-allocation outside of locking context Nilay Shroff
  2025-06-24 13:17 ` [PATCHv4 1/3] block: move elevator queue allocation logic into blk_mq_init_sched Nilay Shroff
  2025-06-24 13:17 ` [PATCHv4 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store Nilay Shroff
@ 2025-06-24 13:17 ` Nilay Shroff
  2 siblings, 0 replies; 8+ messages in thread
From: Nilay Shroff @ 2025-06-24 13:17 UTC (permalink / raw)
  To: linux-block; +Cc: hch, ming.lei, axboe, sth, gjoyce

Move scheduler tags (sched_tags) allocation and deallocation outside
both the ->elevator_lock and ->freeze_lock when updating nr_hw_queues.
This change breaks the dependency chain from the percpu allocator lock
to the elevator lock, helping to prevent potential deadlocks, as
observed in the reported lockdep splat[1].

This commit introduces batch allocation and deallocation helpers for
sched_tags, which are now used from within __blk_mq_update_nr_hw_queues
routine while iterating through the tagset.

With this change, all sched_tags memory management is handled entirely
outside the ->elevator_lock and the ->freeze_lock context, thereby
eliminating the lock dependency that could otherwise manifest during
nr_hw_queues updates.

[1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/

Reported-by: Stefan Haberland <sth@linux.ibm.com>
Closes: https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
 block/blk-mq-sched.c | 51 ++++++++++++++++++++++++++++++++++++++++++++
 block/blk-mq-sched.h |  5 +++++
 block/blk-mq.c       | 12 ++++++++++-
 block/blk.h          |  3 ++-
 block/elevator.c     | 23 ++------------------
 5 files changed, 71 insertions(+), 23 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 5d3132ac7777..acdc03718ebd 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -508,6 +508,57 @@ int blk_mq_alloc_sched_tags(struct elevator_tags *et,
 	return -ENOMEM;
 }
 
+int blk_mq_alloc_sched_tags_batch(struct elevator_tags *et,
+		struct blk_mq_tag_set *set)
+{
+	struct request_queue *q;
+
+	lockdep_assert_held_write(&set->update_nr_hwq_lock);
+
+	list_for_each_entry(q, &set->tag_list, tag_set_list) {
+		/*
+		 * Accessing q->elevator without holding q->elevator_lock is
+		 * safe because we're holding here set->update_nr_hwq_lock in
+		 * the writer context. So, scheduler update/switch code (which
+		 * acquires the same lock but in the reader context) can't run
+		 * concurrently.
+		 */
+		if (q->elevator) {
+			if (blk_mq_alloc_sched_tags(et, set, q->id))
+				goto out_unwind;
+		}
+	}
+	return 0;
+
+out_unwind:
+	list_for_each_entry_continue_reverse(q, &set->tag_list, tag_set_list) {
+		if (q->elevator)
+			blk_mq_free_sched_tags(et, set, q->id);
+	}
+
+	return -ENOMEM;
+}
+
+void blk_mq_free_sched_tags_batch(struct elevator_tags *et,
+		struct blk_mq_tag_set *set)
+{
+	struct request_queue *q;
+
+	lockdep_assert_held_write(&set->update_nr_hwq_lock);
+
+	list_for_each_entry(q, &set->tag_list, tag_set_list) {
+		/*
+		 * Accessing q->elevator without holding q->elevator_lock is
+		 * safe because we're holding here set->update_nr_hwq_lock in
+		 * the writer context. So, scheduler update/switch code (which
+		 * acquires the same lock but in the reader context) can't run
+		 * concurrently.
+		 */
+		if (q->elevator)
+			blk_mq_free_sched_tags(et, set, q->id);
+	}
+}
+
 /* caller must have a reference to @e, will grab another one if successful */
 int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e,
 		struct elevator_tags *et)
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 92aa50b8376a..4b3bf8946ae2 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -29,6 +29,11 @@ struct blk_mq_tags **__blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
 void __blk_mq_free_sched_tags(struct blk_mq_tag_set *set,
 		struct blk_mq_tags **tags, unsigned int nr_hw_queues);
 
+int blk_mq_alloc_sched_tags_batch(struct elevator_tags *et,
+		struct blk_mq_tag_set *set);
+void blk_mq_free_sched_tags_batch(struct elevator_tags *et,
+		struct blk_mq_tag_set *set);
+
 int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e,
 		struct elevator_tags *et);
 void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4806b867e37d..a06f184f1d9a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4970,6 +4970,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 							int nr_hw_queues)
 {
 	struct request_queue *q;
+	struct elevator_tags et;
 	int prev_nr_hw_queues = set->nr_hw_queues;
 	unsigned int memflags;
 	int i;
@@ -4984,6 +4985,12 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 		return;
 
 	memflags = memalloc_noio_save();
+
+	et.nr_hw_queues = nr_hw_queues;
+	xa_init(&et.tags_table);
+	if (blk_mq_alloc_sched_tags_batch(&et, set) < 0)
+		goto memalloc_restore;
+
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
 		blk_mq_debugfs_unregister_hctxs(q);
 		blk_mq_sysfs_unregister_hctxs(q);
@@ -4995,6 +5002,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 	if (blk_mq_realloc_tag_set_tags(set, nr_hw_queues) < 0) {
 		list_for_each_entry(q, &set->tag_list, tag_set_list)
 			blk_mq_unfreeze_queue_nomemrestore(q);
+		blk_mq_free_sched_tags_batch(&et, set);
 		goto reregister;
 	}
 
@@ -5019,7 +5027,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 
 	/* elv_update_nr_hw_queues() unfreeze queue for us */
 	list_for_each_entry(q, &set->tag_list, tag_set_list)
-		elv_update_nr_hw_queues(q);
+		elv_update_nr_hw_queues(q, &et);
 
 reregister:
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
@@ -5029,7 +5037,9 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 		blk_mq_remove_hw_queues_cpuhp(q);
 		blk_mq_add_hw_queues_cpuhp(q);
 	}
+memalloc_restore:
 	memalloc_noio_restore(memflags);
+	xa_destroy(&et.tags_table);
 
 	/* Free the excess tags when nr_hw_queues shrink. */
 	for (i = set->nr_hw_queues; i < prev_nr_hw_queues; i++)
diff --git a/block/blk.h b/block/blk.h
index 37ec459fe656..a312518fb8f3 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -12,6 +12,7 @@
 #include "blk-crypto-internal.h"
 
 struct elevator_type;
+struct elevator_tags;
 
 #define	BLK_DEV_MAX_SECTORS	(LLONG_MAX >> 9)
 #define	BLK_MIN_SEGMENT_SIZE	4096
@@ -321,7 +322,7 @@ bool blk_bio_list_merge(struct request_queue *q, struct list_head *list,
 
 bool blk_insert_flush(struct request *rq);
 
-void elv_update_nr_hw_queues(struct request_queue *q);
+void elv_update_nr_hw_queues(struct request_queue *q, struct elevator_tags *et);
 void elevator_set_default(struct request_queue *q);
 void elevator_set_none(struct request_queue *q);
 
diff --git a/block/elevator.c b/block/elevator.c
index 1408894c0396..4272f9bc7e11 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -717,31 +717,14 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
  * The I/O scheduler depends on the number of hardware queues, this forces a
  * reattachment when nr_hw_queues changes.
  */
-void elv_update_nr_hw_queues(struct request_queue *q)
+void elv_update_nr_hw_queues(struct request_queue *q, struct elevator_tags *et)
 {
 	struct blk_mq_tag_set *set = q->tag_set;
-	struct elevator_tags et;
-	struct elv_change_ctx ctx = {};
+	struct elv_change_ctx ctx = {.et = et};
 	int ret = -ENODEV;
 
 	WARN_ON_ONCE(q->mq_freeze_depth == 0);
 
-	et.nr_hw_queues = set->nr_hw_queues;
-	xa_init(&et.tags_table);
-	ctx.et = &et;
-	/*
-	 * Accessing q->elevator without holding q->elevator_lock is safe here
-	 * because nr_hw_queue update is protected by set->update_nr_hwq_lock
-	 * in the writer context. So, scheduler update/switch code (which
-	 * acquires same lock in the reader context) can't run concurrently.
-	 */
-	if (q->elevator) {
-		if (blk_mq_alloc_sched_tags(ctx.et, set, q->id)) {
-			WARN_ON_ONCE(1);
-			goto out;
-		}
-	}
-
 	mutex_lock(&q->elevator_lock);
 	if (q->elevator && !blk_queue_dying(q) && blk_queue_registered(q)) {
 		ctx.name = q->elevator->type->elevator_name;
@@ -758,8 +741,6 @@ void elv_update_nr_hw_queues(struct request_queue *q)
 	 */
 	if (!xa_empty(&ctx.et->tags_table) && !ctx.new)
 		blk_mq_free_sched_tags(ctx.et, set, q->id);
-out:
-	xa_destroy(&ctx.et->tags_table);
 }
 
 /*
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-06-27  9:50 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-24 13:17 [PATCHv4 0/3] block: move sched_tags allocation/de-allocation outside of locking context Nilay Shroff
2025-06-24 13:17 ` [PATCHv4 1/3] block: move elevator queue allocation logic into blk_mq_init_sched Nilay Shroff
2025-06-24 13:17 ` [PATCHv4 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store Nilay Shroff
2025-06-26 14:43   ` Ming Lei
2025-06-27  4:13     ` Nilay Shroff
2025-06-27  7:58       ` Ming Lei
2025-06-27  9:50         ` Nilay Shroff
2025-06-24 13:17 ` [PATCHv4 3/3] block: fix potential deadlock while running nr_hw_queue update Nilay Shroff

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox