[PATCHv4 0/3] block: move sched_tags allocation/de-allocation outside of locking context

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCHv4 0/3] block: move sched_tags allocation/de-allocation outside of locking context
@ 2025-06-24 13:17 Nilay Shroff
  2025-06-24 13:17 ` [PATCHv4 1/3] block: move elevator queue allocation logic into blk_mq_init_sched Nilay Shroff
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Nilay Shroff @ 2025-06-24 13:17 UTC (permalink / raw)
  To: linux-block; +Cc: hch, ming.lei, axboe, sth, gjoyce

Hi,

There have been a few reports[1] indicating potential lockdep warnings due
to a lock dependency from the percpu allocator to the elevator lock. This
patch series aims to eliminate that dependency.

The series consists of three patches:
The first patch is preparatory patch and just move elevator queue
allocation logic from ->init_sched into blk_mq_init_sched.

The second patch in the series restructures sched_tags allocation and
deallocation during elevator update/switch operations to ensure these
actions are performed entirely outside the ->freeze_lock and ->elevator_
lock. This eliminates the percpu allocator’s lock dependency on the
elevator and freeze lock during scheduler transitions.

The third patch introduces batch allocation and deallocation helpers for
sched_tags. These helpers are used during __blk_mq_update_nr_hw_queues()
to decouple sched_tags memory management from both the elevator and freeze
locks, addressing the lockdep concerns in the nr_hw_queues update path.

[1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/

Changes since v3:
    - Further split the patchset into three patch series so that we can
      have a separate patch for sched_tags batch allocation/deallocation
      (Ming Lei)
    - Use Xarray to store and load the sched_tags (Ming Lei)
    - Unexport elevator_alloc() as we no longer need to use it outside
      of block layer core (hch)
    - unwind the sched_tags allocation and free tags when we it fails in
      the middle of allocation (hch)
    - Move struct elevator_queue header from commin header to elevator.c
      as there's no user of it outside elevator.c (Ming Lei, hch)
Link to v3: https://lore.kernel.org/all/20250616173233.3803824-1-nilay@linux.ibm.com/

Change since v2:
    - Split the patch into a two patch series. The first patch updates
      ->init_sched elevator API change and second patch handles the sched
      tags allocation/de-allocation logic (Ming Lei)
    - Address sched tags allocation/deallocation logic while running in the
      context of nr_hw_queue update so that we can handle all possible
      scenarios in a single patchest (Ming Lei)
Link to v2: https://lore.kernel.org/all/20250528123638.1029700-1-nilay@linux.ibm.com/

Changes since v1:
    - As the lifetime of elevator queue and sched tags are same, allocate
      and move sched tags under struct elevator_queue (Ming Lei)
Link to v1: https://lore.kernel.org/all/20250520103425.1259712-1-nilay@linux.ibm.com/

Nilay Shroff (3):
  block: move elevator queue allocation logic into blk_mq_init_sched
  block: fix lockdep warning caused by lock dependency in
    elv_iosched_store
  block: fix potential deadlock while running nr_hw_queue update

 block/bfq-iosched.c   |  13 +--
 block/blk-mq-sched.c  | 238 +++++++++++++++++++++++++++++-------------
 block/blk-mq-sched.h  |  19 +++-
 block/blk-mq.c        |  12 ++-
 block/blk.h           |   3 +-
 block/elevator.c      |  52 +++++++--
 block/elevator.h      |  21 +++-
 block/kyber-iosched.c |  11 +-
 block/mq-deadline.c   |  14 +--
 9 files changed, 267 insertions(+), 116 deletions(-)

-- 
2.49.0


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCHv4 1/3] block: move elevator queue allocation logic into blk_mq_init_sched
  2025-06-24 13:17 [PATCHv4 0/3] block: move sched_tags allocation/de-allocation outside of locking context Nilay Shroff
@ 2025-06-24 13:17 ` Nilay Shroff
  2025-06-24 13:17 ` [PATCHv4 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store Nilay Shroff
  2025-06-24 13:17 ` [PATCHv4 3/3] block: fix potential deadlock while running nr_hw_queue update Nilay Shroff
  2 siblings, 0 replies; 8+ messages in thread
From: Nilay Shroff @ 2025-06-24 13:17 UTC (permalink / raw)
  To: linux-block; +Cc: hch, ming.lei, axboe, sth, gjoyce

In preparation for allocating sched_tags before freezing the request
queue and acquiring ->elevator_lock, move the elevator queue allocation
logic from the elevator ops ->init_sched callback into blk_mq_init_sched.
As elevator_alloc is now only invoked from block layer core, we don't
need to export it, so unexport elevator_alloc function.

This refactoring provides a centralized location for elevator queue
initialization, which makes it easier to store pre-allocated sched_tags
in the struct elevator_queue during later changes.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
 block/bfq-iosched.c   | 13 +++----------
 block/blk-mq-sched.c  | 11 ++++++++---
 block/elevator.c      |  1 -
 block/elevator.h      |  2 +-
 block/kyber-iosched.c | 11 ++---------
 block/mq-deadline.c   | 14 ++------------
 6 files changed, 16 insertions(+), 36 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 0cb1e9873aab..fd26dc1901b0 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -7232,22 +7232,16 @@ static void bfq_init_root_group(struct bfq_group *root_group,
 	root_group->sched_data.bfq_class_idle_last_service = jiffies;
 }
 
-static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
+static int bfq_init_queue(struct request_queue *q, struct elevator_queue *eq)
 {
 	struct bfq_data *bfqd;
-	struct elevator_queue *eq;
 	unsigned int i;
 	struct blk_independent_access_ranges *ia_ranges = q->disk->ia_ranges;
 
-	eq = elevator_alloc(q, e);
-	if (!eq)
-		return -ENOMEM;
-
 	bfqd = kzalloc_node(sizeof(*bfqd), GFP_KERNEL, q->node);
-	if (!bfqd) {
-		kobject_put(&eq->kobj);
+	if (!bfqd)
 		return -ENOMEM;
-	}
+
 	eq->elevator_data = bfqd;
 
 	spin_lock_irq(&q->queue_lock);
@@ -7405,7 +7399,6 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 
 out_free:
 	kfree(bfqd);
-	kobject_put(&eq->kobj);
 	return -ENOMEM;
 }
 
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 55a0fd105147..359e0704e09b 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -475,10 +475,14 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 	q->nr_requests = 2 * min_t(unsigned int, q->tag_set->queue_depth,
 				   BLKDEV_DEFAULT_RQ);
 
+	eq = elevator_alloc(q, e);
+	if (!eq)
+		return -ENOMEM;
+
 	if (blk_mq_is_shared_tags(flags)) {
 		ret = blk_mq_init_sched_shared_tags(q);
 		if (ret)
-			return ret;
+			goto err_put_elevator;
 	}
 
 	queue_for_each_hw_ctx(q, hctx, i) {
@@ -487,7 +491,7 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 			goto err_free_map_and_rqs;
 	}
 
-	ret = e->ops.init_sched(q, e);
+	ret = e->ops.init_sched(q, eq);
 	if (ret)
 		goto err_free_map_and_rqs;
 
@@ -508,7 +512,8 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 err_free_map_and_rqs:
 	blk_mq_sched_free_rqs(q);
 	blk_mq_sched_tags_teardown(q, flags);
-
+err_put_elevator:
+	kobject_put(&eq->kobj);
 	q->elevator = NULL;
 	return ret;
 }
diff --git a/block/elevator.c b/block/elevator.c
index ab22542e6cf0..770874040f79 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -148,7 +148,6 @@ struct elevator_queue *elevator_alloc(struct request_queue *q,
 
 	return eq;
 }
-EXPORT_SYMBOL(elevator_alloc);
 
 static void elevator_release(struct kobject *kobj)
 {
diff --git a/block/elevator.h b/block/elevator.h
index a07ce773a38f..a4de5f9ad790 100644
--- a/block/elevator.h
+++ b/block/elevator.h
@@ -24,7 +24,7 @@ struct blk_mq_alloc_data;
 struct blk_mq_hw_ctx;
 
 struct elevator_mq_ops {
-	int (*init_sched)(struct request_queue *, struct elevator_type *);
+	int (*init_sched)(struct request_queue *, struct elevator_queue *);
 	void (*exit_sched)(struct elevator_queue *);
 	int (*init_hctx)(struct blk_mq_hw_ctx *, unsigned int);
 	void (*exit_hctx)(struct blk_mq_hw_ctx *, unsigned int);
diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index 4dba8405bd01..7b6832cb3a8d 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -402,20 +402,13 @@ static struct kyber_queue_data *kyber_queue_data_alloc(struct request_queue *q)
 	return ERR_PTR(ret);
 }
 
-static int kyber_init_sched(struct request_queue *q, struct elevator_type *e)
+static int kyber_init_sched(struct request_queue *q, struct elevator_queue *eq)
 {
 	struct kyber_queue_data *kqd;
-	struct elevator_queue *eq;
-
-	eq = elevator_alloc(q, e);
-	if (!eq)
-		return -ENOMEM;
 
 	kqd = kyber_queue_data_alloc(q);
-	if (IS_ERR(kqd)) {
-		kobject_put(&eq->kobj);
+	if (IS_ERR(kqd))
 		return PTR_ERR(kqd);
-	}
 
 	blk_stat_enable_accounting(q);
 
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 2edf1cac06d5..7b6caf30e00a 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -568,20 +568,14 @@ static void dd_exit_sched(struct elevator_queue *e)
 /*
  * initialize elevator private data (deadline_data).
  */
-static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
+static int dd_init_sched(struct request_queue *q, struct elevator_queue *eq)
 {
 	struct deadline_data *dd;
-	struct elevator_queue *eq;
 	enum dd_prio prio;
-	int ret = -ENOMEM;
-
-	eq = elevator_alloc(q, e);
-	if (!eq)
-		return ret;
 
 	dd = kzalloc_node(sizeof(*dd), GFP_KERNEL, q->node);
 	if (!dd)
-		goto put_eq;
+		return -ENOMEM;
 
 	eq->elevator_data = dd;
 
@@ -608,10 +602,6 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 
 	q->elevator = eq;
 	return 0;
-
-put_eq:
-	kobject_put(&eq->kobj);
-	return ret;
 }
 
 /*
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCHv4 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store
  2025-06-24 13:17 [PATCHv4 0/3] block: move sched_tags allocation/de-allocation outside of locking context Nilay Shroff
  2025-06-24 13:17 ` [PATCHv4 1/3] block: move elevator queue allocation logic into blk_mq_init_sched Nilay Shroff
@ 2025-06-24 13:17 ` Nilay Shroff
  2025-06-26 14:43   ` Ming Lei
  2025-06-24 13:17 ` [PATCHv4 3/3] block: fix potential deadlock while running nr_hw_queue update Nilay Shroff
  2 siblings, 1 reply; 8+ messages in thread
From: Nilay Shroff @ 2025-06-24 13:17 UTC (permalink / raw)
  To: linux-block; +Cc: hch, ming.lei, axboe, sth, gjoyce

Recent lockdep reports [1] have revealed a potential deadlock caused by a
lock dependency between the percpu allocator lock and the elevator lock.
This issue can be avoided by ensuring that the allocation and release of
scheduler tags (sched_tags) are performed outside the elevator lock.
Furthermore, the queue does not need to be remain frozen during these
operations.

To address this, move all sched_tags allocations and deallocations outside
of both the ->elevator_lock and the ->freeze_lock. Since the lifetime of
the elevator queue and its associated sched_tags is closely tied, the
allocated sched_tags are now stored in the elevator queue structure. Then,
during the actual elevator switch (which runs under ->freeze_lock and
->elevator_lock), the pre-allocated sched_tags are assigned to the
appropriate q->hctx. Once the elevator switch is complete and the locks
are released, the old elevator queue and its associated sched_tags are
freed.

This commit specifically addresses the allocation/deallocation of sched_
tags during elevator switching. Note that sched_tags may also be allocated
in other contexts, such as during nr_hw_queues updates. Supporting that
use case will require batch allocation/deallocation, which will be handled
in a follow-up patch.

This restructuring ensures that sched_tags memory management occurs
entirely outside of the ->elevator_lock and ->freeze_lock context,
eliminating the lock dependency problem seen during scheduler updates.

[1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/

Reported-by: Stefan Haberland <sth@linux.ibm.com>
Closes: https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
 block/blk-mq-sched.c | 186 ++++++++++++++++++++++++++-----------------
 block/blk-mq-sched.h |  14 +++-
 block/elevator.c     |  66 +++++++++++++--
 block/elevator.h     |  19 ++++-
 4 files changed, 204 insertions(+), 81 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 359e0704e09b..5d3132ac7777 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -374,64 +374,17 @@ bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request *rq,
 }
 EXPORT_SYMBOL_GPL(blk_mq_sched_try_insert_merge);
 
-static int blk_mq_sched_alloc_map_and_rqs(struct request_queue *q,
-					  struct blk_mq_hw_ctx *hctx,
-					  unsigned int hctx_idx)
-{
-	if (blk_mq_is_shared_tags(q->tag_set->flags)) {
-		hctx->sched_tags = q->sched_shared_tags;
-		return 0;
-	}
-
-	hctx->sched_tags = blk_mq_alloc_map_and_rqs(q->tag_set, hctx_idx,
-						    q->nr_requests);
-
-	if (!hctx->sched_tags)
-		return -ENOMEM;
-	return 0;
-}
-
-static void blk_mq_exit_sched_shared_tags(struct request_queue *queue)
-{
-	blk_mq_free_rq_map(queue->sched_shared_tags);
-	queue->sched_shared_tags = NULL;
-}
-
 /* called in queue's release handler, tagset has gone away */
 static void blk_mq_sched_tags_teardown(struct request_queue *q, unsigned int flags)
 {
 	struct blk_mq_hw_ctx *hctx;
 	unsigned long i;
 
-	queue_for_each_hw_ctx(q, hctx, i) {
-		if (hctx->sched_tags) {
-			if (!blk_mq_is_shared_tags(flags))
-				blk_mq_free_rq_map(hctx->sched_tags);
-			hctx->sched_tags = NULL;
-		}
-	}
+	queue_for_each_hw_ctx(q, hctx, i)
+		hctx->sched_tags = NULL;
 
 	if (blk_mq_is_shared_tags(flags))
-		blk_mq_exit_sched_shared_tags(q);
-}
-
-static int blk_mq_init_sched_shared_tags(struct request_queue *queue)
-{
-	struct blk_mq_tag_set *set = queue->tag_set;
-
-	/*
-	 * Set initial depth at max so that we don't need to reallocate for
-	 * updating nr_requests.
-	 */
-	queue->sched_shared_tags = blk_mq_alloc_map_and_rqs(set,
-						BLK_MQ_NO_HCTX_IDX,
-						MAX_SCHED_RQ);
-	if (!queue->sched_shared_tags)
-		return -ENOMEM;
-
-	blk_mq_tag_update_sched_shared_tags(queue);
-
-	return 0;
+		q->sched_shared_tags = NULL;
 }
 
 void blk_mq_sched_reg_debugfs(struct request_queue *q)
@@ -458,8 +411,106 @@ void blk_mq_sched_unreg_debugfs(struct request_queue *q)
 	mutex_unlock(&q->debugfs_mutex);
 }
 
+void __blk_mq_free_sched_tags(struct blk_mq_tag_set *set,
+		struct blk_mq_tags **tags, unsigned int nr_hw_queues)
+{
+	unsigned long i;
+
+	if (!tags)
+		return;
+
+	/* Shared tags are stored at index 0 in @tags. */
+	if (blk_mq_is_shared_tags(set->flags))
+		blk_mq_free_map_and_rqs(set, tags[0], BLK_MQ_NO_HCTX_IDX);
+	else {
+		for (i = 0; i < nr_hw_queues; i++)
+			blk_mq_free_map_and_rqs(set, tags[i], i);
+	}
+
+	kfree(tags);
+}
+
+void blk_mq_free_sched_tags(struct elevator_tags *et,
+		struct blk_mq_tag_set *set, int id)
+{
+	struct blk_mq_tags **tags;
+
+	tags = xa_load(&et->tags_table, id);
+	__blk_mq_free_sched_tags(set, tags, et->nr_hw_queues);
+}
+
+struct blk_mq_tags **__blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
+				unsigned int nr_hw_queues,
+				unsigned int nr_requests,
+				gfp_t gfp)
+{
+	int i, nr_tags;
+	struct blk_mq_tags **tags;
+
+	if (blk_mq_is_shared_tags(set->flags))
+		nr_tags = 1;
+	else
+		nr_tags = nr_hw_queues;
+
+	tags = kcalloc(nr_tags, sizeof(struct blk_mq_tags *), gfp);
+	if (!tags)
+		return NULL;
+
+	if (blk_mq_is_shared_tags(set->flags)) {
+		/* Shared tags are stored at index 0 in @tags. */
+		tags[0] = blk_mq_alloc_map_and_rqs(set, BLK_MQ_NO_HCTX_IDX,
+					MAX_SCHED_RQ);
+		if (!tags[0])
+			goto out;
+	} else {
+		for (i = 0; i < nr_hw_queues; i++) {
+			tags[i] = blk_mq_alloc_map_and_rqs(set, i, nr_requests);
+			if (!tags[i])
+				goto out_unwind;
+		}
+	}
+
+	return tags;
+out_unwind:
+	while (--i >= 0)
+		blk_mq_free_map_and_rqs(set, tags[i], i);
+out:
+	kfree(tags);
+	return NULL;
+}
+
+int blk_mq_alloc_sched_tags(struct elevator_tags *et,
+		struct blk_mq_tag_set *set, int id)
+{
+	struct blk_mq_tags **tags;
+	gfp_t gfp = GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY;
+
+	/*
+	 * Default to double of smaller one between hw queue_depth and
+	 * 128, since we don't split into sync/async like the old code
+	 * did. Additionally, this is a per-hw queue depth.
+	 */
+	et->nr_requests = 2 * min_t(unsigned int, set->queue_depth,
+			BLKDEV_DEFAULT_RQ);
+
+	tags = __blk_mq_alloc_sched_tags(set, et->nr_hw_queues,
+			et->nr_requests, gfp);
+	if (!tags)
+		return -ENOMEM;
+
+	if (xa_insert(&et->tags_table, id, tags, gfp))
+		goto out_free_tags;
+
+	return 0;
+
+out_free_tags:
+	__blk_mq_free_sched_tags(set, tags, et->nr_hw_queues);
+	return -ENOMEM;
+}
+
 /* caller must have a reference to @e, will grab another one if successful */
-int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
+int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e,
+		struct elevator_tags *et)
 {
 	unsigned int flags = q->tag_set->flags;
 	struct blk_mq_hw_ctx *hctx;
@@ -467,40 +518,33 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 	unsigned long i;
 	int ret;
 
-	/*
-	 * Default to double of smaller one between hw queue_depth and 128,
-	 * since we don't split into sync/async like the old code did.
-	 * Additionally, this is a per-hw queue depth.
-	 */
-	q->nr_requests = 2 * min_t(unsigned int, q->tag_set->queue_depth,
-				   BLKDEV_DEFAULT_RQ);
-
-	eq = elevator_alloc(q, e);
+	eq = elevator_alloc(q, e, et);
 	if (!eq)
 		return -ENOMEM;
 
+	q->nr_requests = et->nr_requests;
+
 	if (blk_mq_is_shared_tags(flags)) {
-		ret = blk_mq_init_sched_shared_tags(q);
-		if (ret)
-			goto err_put_elevator;
+		/* Shared tags are stored at index 0 in @eq->tags. */
+		q->sched_shared_tags = eq->tags[0];
+		blk_mq_tag_update_sched_shared_tags(q);
 	}
 
 	queue_for_each_hw_ctx(q, hctx, i) {
-		ret = blk_mq_sched_alloc_map_and_rqs(q, hctx, i);
-		if (ret)
-			goto err_free_map_and_rqs;
+		if (blk_mq_is_shared_tags(flags))
+			hctx->sched_tags = q->sched_shared_tags;
+		else
+			hctx->sched_tags = eq->tags[i];
 	}
 
 	ret = e->ops.init_sched(q, eq);
 	if (ret)
-		goto err_free_map_and_rqs;
+		goto out;
 
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (e->ops.init_hctx) {
 			ret = e->ops.init_hctx(hctx, i);
 			if (ret) {
-				eq = q->elevator;
-				blk_mq_sched_free_rqs(q);
 				blk_mq_exit_sched(q, eq);
 				kobject_put(&eq->kobj);
 				return ret;
@@ -509,10 +553,8 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 	}
 	return 0;
 
-err_free_map_and_rqs:
-	blk_mq_sched_free_rqs(q);
+out:
 	blk_mq_sched_tags_teardown(q, flags);
-err_put_elevator:
 	kobject_put(&eq->kobj);
 	q->elevator = NULL;
 	return ret;
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 1326526bb733..92aa50b8376a 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -18,7 +18,19 @@ void __blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx);
 
 void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx);
 
-int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e);
+int blk_mq_alloc_sched_tags(struct elevator_tags *et,
+		struct blk_mq_tag_set *set, int id);
+void blk_mq_free_sched_tags(struct elevator_tags *et,
+		struct blk_mq_tag_set *set, int id);
+struct blk_mq_tags **__blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
+					unsigned int nr_hw_queues,
+					unsigned int nr_requests,
+					gfp_t gfp);
+void __blk_mq_free_sched_tags(struct blk_mq_tag_set *set,
+		struct blk_mq_tags **tags, unsigned int nr_hw_queues);
+
+int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e,
+		struct elevator_tags *et);
 void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e);
 void blk_mq_sched_free_rqs(struct request_queue *q);
 
diff --git a/block/elevator.c b/block/elevator.c
index 770874040f79..1408894c0396 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -54,6 +54,8 @@ struct elv_change_ctx {
 	struct elevator_queue *old;
 	/* for registering new elevator */
 	struct elevator_queue *new;
+	/* holds sched tags data */
+	struct elevator_tags *et;
 };
 
 static DEFINE_SPINLOCK(elv_list_lock);
@@ -132,7 +134,7 @@ static struct elevator_type *elevator_find_get(const char *name)
 static const struct kobj_type elv_ktype;
 
 struct elevator_queue *elevator_alloc(struct request_queue *q,
-				  struct elevator_type *e)
+		struct elevator_type *e, struct elevator_tags *et)
 {
 	struct elevator_queue *eq;
 
@@ -145,8 +147,15 @@ struct elevator_queue *elevator_alloc(struct request_queue *q,
 	kobject_init(&eq->kobj, &elv_ktype);
 	mutex_init(&eq->sysfs_lock);
 	hash_init(eq->hash);
+	eq->nr_hw_queues = et->nr_hw_queues;
+	eq->tags = xa_load(&et->tags_table, q->id);
+	if (WARN_ON_ONCE(!eq->tags))
+		goto out;
 
 	return eq;
+out:
+	kobject_put(&eq->kobj);
+	return NULL;
 }
 
 static void elevator_release(struct kobject *kobj)
@@ -165,7 +174,6 @@ static void elevator_exit(struct request_queue *q)
 	lockdep_assert_held(&q->elevator_lock);
 
 	ioc_clear_queue(q);
-	blk_mq_sched_free_rqs(q);
 
 	mutex_lock(&e->sysfs_lock);
 	blk_mq_exit_sched(q, e);
@@ -591,7 +599,7 @@ static int elevator_switch(struct request_queue *q, struct elv_change_ctx *ctx)
 	}
 
 	if (new_e) {
-		ret = blk_mq_init_sched(q, new_e);
+		ret = blk_mq_init_sched(q, new_e, ctx->et);
 		if (ret)
 			goto out_unfreeze;
 		ctx->new = q->elevator;
@@ -626,8 +634,10 @@ static void elv_exit_and_release(struct request_queue *q)
 	elevator_exit(q);
 	mutex_unlock(&q->elevator_lock);
 	blk_mq_unfreeze_queue(q, memflags);
-	if (e)
+	if (e) {
+		__blk_mq_free_sched_tags(q->tag_set, e->tags, e->nr_hw_queues);
 		kobject_put(&e->kobj);
+	}
 }
 
 static int elevator_change_done(struct request_queue *q,
@@ -640,6 +650,8 @@ static int elevator_change_done(struct request_queue *q,
 				&ctx->old->flags);
 
 		elv_unregister_queue(q, ctx->old);
+		__blk_mq_free_sched_tags(q->tag_set, ctx->old->tags,
+				ctx->old->nr_hw_queues);
 		kobject_put(&ctx->old->kobj);
 		if (enable_wbt)
 			wbt_enable_default(q->disk);
@@ -658,9 +670,20 @@ static int elevator_change_done(struct request_queue *q,
 static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
 {
 	unsigned int memflags;
+	struct elevator_tags et;
+	struct blk_mq_tag_set *set = q->tag_set;
 	int ret = 0;
 
-	lockdep_assert_held(&q->tag_set->update_nr_hwq_lock);
+	lockdep_assert_held(&set->update_nr_hwq_lock);
+
+	et.nr_hw_queues = set->nr_hw_queues;
+	xa_init(&et.tags_table);
+	ctx->et = &et;
+	if (strncmp(ctx->name, "none", 4)) {
+		ret = blk_mq_alloc_sched_tags(ctx->et, set, q->id);
+		if (ret)
+			goto out;
+	}
 
 	memflags = blk_mq_freeze_queue(q);
 	/*
@@ -680,7 +703,13 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
 	blk_mq_unfreeze_queue(q, memflags);
 	if (!ret)
 		ret = elevator_change_done(q, ctx);
-
+	/*
+	 * Free sched tags if it's allocated but we couldn't switch elevator.
+	 */
+	if (!xa_empty(&ctx->et->tags_table) && !ctx->new)
+		blk_mq_free_sched_tags(ctx->et, set, q->id);
+out:
+	xa_destroy(&ctx->et->tags_table);
 	return ret;
 }
 
@@ -690,11 +719,29 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
  */
 void elv_update_nr_hw_queues(struct request_queue *q)
 {
+	struct blk_mq_tag_set *set = q->tag_set;
+	struct elevator_tags et;
 	struct elv_change_ctx ctx = {};
 	int ret = -ENODEV;
 
 	WARN_ON_ONCE(q->mq_freeze_depth == 0);
 
+	et.nr_hw_queues = set->nr_hw_queues;
+	xa_init(&et.tags_table);
+	ctx.et = &et;
+	/*
+	 * Accessing q->elevator without holding q->elevator_lock is safe here
+	 * because nr_hw_queue update is protected by set->update_nr_hwq_lock
+	 * in the writer context. So, scheduler update/switch code (which
+	 * acquires same lock in the reader context) can't run concurrently.
+	 */
+	if (q->elevator) {
+		if (blk_mq_alloc_sched_tags(ctx.et, set, q->id)) {
+			WARN_ON_ONCE(1);
+			goto out;
+		}
+	}
+
 	mutex_lock(&q->elevator_lock);
 	if (q->elevator && !blk_queue_dying(q) && blk_queue_registered(q)) {
 		ctx.name = q->elevator->type->elevator_name;
@@ -706,6 +753,13 @@ void elv_update_nr_hw_queues(struct request_queue *q)
 	blk_mq_unfreeze_queue_nomemrestore(q);
 	if (!ret)
 		WARN_ON_ONCE(elevator_change_done(q, &ctx));
+	/*
+	 * Free sched tags if it's allocated but we couldn't switch elevator.
+	 */
+	if (!xa_empty(&ctx.et->tags_table) && !ctx.new)
+		blk_mq_free_sched_tags(ctx.et, set, q->id);
+out:
+	xa_destroy(&ctx.et->tags_table);
 }
 
 /*
diff --git a/block/elevator.h b/block/elevator.h
index a4de5f9ad790..849c264031ca 100644
--- a/block/elevator.h
+++ b/block/elevator.h
@@ -23,6 +23,19 @@ enum elv_merge {
 struct blk_mq_alloc_data;
 struct blk_mq_hw_ctx;
 
+struct elevator_tags {
+	/* num. of hardware queues for which tags are allocated */
+	unsigned int nr_hw_queues;
+	/* depth used while allocating tags */
+	unsigned int nr_requests;
+	/*
+	 * An Xarray table storing shared and per hardware queue sched tags.
+	 * The key to store an entry in Xarray is q->id. Please note that
+	 * shared sched tags are always stored at index 0.
+	 */
+	struct xarray tags_table;
+};
+
 struct elevator_mq_ops {
 	int (*init_sched)(struct request_queue *, struct elevator_queue *);
 	void (*exit_sched)(struct elevator_queue *);
@@ -113,6 +126,8 @@ struct request *elv_rqhash_find(struct request_queue *q, sector_t offset);
 struct elevator_queue
 {
 	struct elevator_type *type;
+	struct blk_mq_tags **tags;
+	unsigned int nr_hw_queues;
 	void *elevator_data;
 	struct kobject kobj;
 	struct mutex sysfs_lock;
@@ -152,8 +167,8 @@ ssize_t elv_iosched_show(struct gendisk *disk, char *page);
 ssize_t elv_iosched_store(struct gendisk *disk, const char *page, size_t count);
 
 extern bool elv_bio_merge_ok(struct request *, struct bio *);
-extern struct elevator_queue *elevator_alloc(struct request_queue *,
-					struct elevator_type *);
+struct elevator_queue *elevator_alloc(struct request_queue *,
+		struct elevator_type *, struct elevator_tags *);
 
 /*
  * Helper functions.
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCHv4 3/3] block: fix potential deadlock while running nr_hw_queue update
  2025-06-24 13:17 [PATCHv4 0/3] block: move sched_tags allocation/de-allocation outside of locking context Nilay Shroff
  2025-06-24 13:17 ` [PATCHv4 1/3] block: move elevator queue allocation logic into blk_mq_init_sched Nilay Shroff
  2025-06-24 13:17 ` [PATCHv4 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store Nilay Shroff
@ 2025-06-24 13:17 ` Nilay Shroff
  2 siblings, 0 replies; 8+ messages in thread
From: Nilay Shroff @ 2025-06-24 13:17 UTC (permalink / raw)
  To: linux-block; +Cc: hch, ming.lei, axboe, sth, gjoyce

Move scheduler tags (sched_tags) allocation and deallocation outside
both the ->elevator_lock and ->freeze_lock when updating nr_hw_queues.
This change breaks the dependency chain from the percpu allocator lock
to the elevator lock, helping to prevent potential deadlocks, as
observed in the reported lockdep splat[1].

This commit introduces batch allocation and deallocation helpers for
sched_tags, which are now used from within __blk_mq_update_nr_hw_queues
routine while iterating through the tagset.

With this change, all sched_tags memory management is handled entirely
outside the ->elevator_lock and the ->freeze_lock context, thereby
eliminating the lock dependency that could otherwise manifest during
nr_hw_queues updates.

[1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/

Reported-by: Stefan Haberland <sth@linux.ibm.com>
Closes: https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
 block/blk-mq-sched.c | 51 ++++++++++++++++++++++++++++++++++++++++++++
 block/blk-mq-sched.h |  5 +++++
 block/blk-mq.c       | 12 ++++++++++-
 block/blk.h          |  3 ++-
 block/elevator.c     | 23 ++------------------
 5 files changed, 71 insertions(+), 23 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 5d3132ac7777..acdc03718ebd 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -508,6 +508,57 @@ int blk_mq_alloc_sched_tags(struct elevator_tags *et,
 	return -ENOMEM;
 }
 
+int blk_mq_alloc_sched_tags_batch(struct elevator_tags *et,
+		struct blk_mq_tag_set *set)
+{
+	struct request_queue *q;
+
+	lockdep_assert_held_write(&set->update_nr_hwq_lock);
+
+	list_for_each_entry(q, &set->tag_list, tag_set_list) {
+		/*
+		 * Accessing q->elevator without holding q->elevator_lock is
+		 * safe because we're holding here set->update_nr_hwq_lock in
+		 * the writer context. So, scheduler update/switch code (which
+		 * acquires the same lock but in the reader context) can't run
+		 * concurrently.
+		 */
+		if (q->elevator) {
+			if (blk_mq_alloc_sched_tags(et, set, q->id))
+				goto out_unwind;
+		}
+	}
+	return 0;
+
+out_unwind:
+	list_for_each_entry_continue_reverse(q, &set->tag_list, tag_set_list) {
+		if (q->elevator)
+			blk_mq_free_sched_tags(et, set, q->id);
+	}
+
+	return -ENOMEM;
+}
+
+void blk_mq_free_sched_tags_batch(struct elevator_tags *et,
+		struct blk_mq_tag_set *set)
+{
+	struct request_queue *q;
+
+	lockdep_assert_held_write(&set->update_nr_hwq_lock);
+
+	list_for_each_entry(q, &set->tag_list, tag_set_list) {
+		/*
+		 * Accessing q->elevator without holding q->elevator_lock is
+		 * safe because we're holding here set->update_nr_hwq_lock in
+		 * the writer context. So, scheduler update/switch code (which
+		 * acquires the same lock but in the reader context) can't run
+		 * concurrently.
+		 */
+		if (q->elevator)
+			blk_mq_free_sched_tags(et, set, q->id);
+	}
+}
+
 /* caller must have a reference to @e, will grab another one if successful */
 int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e,
 		struct elevator_tags *et)
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 92aa50b8376a..4b3bf8946ae2 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -29,6 +29,11 @@ struct blk_mq_tags **__blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
 void __blk_mq_free_sched_tags(struct blk_mq_tag_set *set,
 		struct blk_mq_tags **tags, unsigned int nr_hw_queues);
 
+int blk_mq_alloc_sched_tags_batch(struct elevator_tags *et,
+		struct blk_mq_tag_set *set);
+void blk_mq_free_sched_tags_batch(struct elevator_tags *et,
+		struct blk_mq_tag_set *set);
+
 int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e,
 		struct elevator_tags *et);
 void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4806b867e37d..a06f184f1d9a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4970,6 +4970,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 							int nr_hw_queues)
 {
 	struct request_queue *q;
+	struct elevator_tags et;
 	int prev_nr_hw_queues = set->nr_hw_queues;
 	unsigned int memflags;
 	int i;
@@ -4984,6 +4985,12 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 		return;
 
 	memflags = memalloc_noio_save();
+
+	et.nr_hw_queues = nr_hw_queues;
+	xa_init(&et.tags_table);
+	if (blk_mq_alloc_sched_tags_batch(&et, set) < 0)
+		goto memalloc_restore;
+
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
 		blk_mq_debugfs_unregister_hctxs(q);
 		blk_mq_sysfs_unregister_hctxs(q);
@@ -4995,6 +5002,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 	if (blk_mq_realloc_tag_set_tags(set, nr_hw_queues) < 0) {
 		list_for_each_entry(q, &set->tag_list, tag_set_list)
 			blk_mq_unfreeze_queue_nomemrestore(q);
+		blk_mq_free_sched_tags_batch(&et, set);
 		goto reregister;
 	}
 
@@ -5019,7 +5027,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 
 	/* elv_update_nr_hw_queues() unfreeze queue for us */
 	list_for_each_entry(q, &set->tag_list, tag_set_list)
-		elv_update_nr_hw_queues(q);
+		elv_update_nr_hw_queues(q, &et);
 
 reregister:
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
@@ -5029,7 +5037,9 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 		blk_mq_remove_hw_queues_cpuhp(q);
 		blk_mq_add_hw_queues_cpuhp(q);
 	}
+memalloc_restore:
 	memalloc_noio_restore(memflags);
+	xa_destroy(&et.tags_table);
 
 	/* Free the excess tags when nr_hw_queues shrink. */
 	for (i = set->nr_hw_queues; i < prev_nr_hw_queues; i++)
diff --git a/block/blk.h b/block/blk.h
index 37ec459fe656..a312518fb8f3 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -12,6 +12,7 @@
 #include "blk-crypto-internal.h"
 
 struct elevator_type;
+struct elevator_tags;
 
 #define	BLK_DEV_MAX_SECTORS	(LLONG_MAX >> 9)
 #define	BLK_MIN_SEGMENT_SIZE	4096
@@ -321,7 +322,7 @@ bool blk_bio_list_merge(struct request_queue *q, struct list_head *list,
 
 bool blk_insert_flush(struct request *rq);
 
-void elv_update_nr_hw_queues(struct request_queue *q);
+void elv_update_nr_hw_queues(struct request_queue *q, struct elevator_tags *et);
 void elevator_set_default(struct request_queue *q);
 void elevator_set_none(struct request_queue *q);
 
diff --git a/block/elevator.c b/block/elevator.c
index 1408894c0396..4272f9bc7e11 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -717,31 +717,14 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
  * The I/O scheduler depends on the number of hardware queues, this forces a
  * reattachment when nr_hw_queues changes.
  */
-void elv_update_nr_hw_queues(struct request_queue *q)
+void elv_update_nr_hw_queues(struct request_queue *q, struct elevator_tags *et)
 {
 	struct blk_mq_tag_set *set = q->tag_set;
-	struct elevator_tags et;
-	struct elv_change_ctx ctx = {};
+	struct elv_change_ctx ctx = {.et = et};
 	int ret = -ENODEV;
 
 	WARN_ON_ONCE(q->mq_freeze_depth == 0);
 
-	et.nr_hw_queues = set->nr_hw_queues;
-	xa_init(&et.tags_table);
-	ctx.et = &et;
-	/*
-	 * Accessing q->elevator without holding q->elevator_lock is safe here
-	 * because nr_hw_queue update is protected by set->update_nr_hwq_lock
-	 * in the writer context. So, scheduler update/switch code (which
-	 * acquires same lock in the reader context) can't run concurrently.
-	 */
-	if (q->elevator) {
-		if (blk_mq_alloc_sched_tags(ctx.et, set, q->id)) {
-			WARN_ON_ONCE(1);
-			goto out;
-		}
-	}
-
 	mutex_lock(&q->elevator_lock);
 	if (q->elevator && !blk_queue_dying(q) && blk_queue_registered(q)) {
 		ctx.name = q->elevator->type->elevator_name;
@@ -758,8 +741,6 @@ void elv_update_nr_hw_queues(struct request_queue *q)
 	 */
 	if (!xa_empty(&ctx.et->tags_table) && !ctx.new)
 		blk_mq_free_sched_tags(ctx.et, set, q->id);
-out:
-	xa_destroy(&ctx.et->tags_table);
 }
 
 /*
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCHv4 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store
  2025-06-24 13:17 ` [PATCHv4 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store Nilay Shroff
@ 2025-06-26 14:43   ` Ming Lei
  2025-06-27  4:13     ` Nilay Shroff
  0 siblings, 1 reply; 8+ messages in thread
From: Ming Lei @ 2025-06-26 14:43 UTC (permalink / raw)
  To: Nilay Shroff; +Cc: linux-block, hch, axboe, sth, gjoyce

On Tue, Jun 24, 2025 at 06:47:04PM +0530, Nilay Shroff wrote:
> Recent lockdep reports [1] have revealed a potential deadlock caused by a
> lock dependency between the percpu allocator lock and the elevator lock.
> This issue can be avoided by ensuring that the allocation and release of
> scheduler tags (sched_tags) are performed outside the elevator lock.
> Furthermore, the queue does not need to be remain frozen during these
> operations.
> 
> To address this, move all sched_tags allocations and deallocations outside
> of both the ->elevator_lock and the ->freeze_lock. Since the lifetime of
> the elevator queue and its associated sched_tags is closely tied, the
> allocated sched_tags are now stored in the elevator queue structure. Then,
> during the actual elevator switch (which runs under ->freeze_lock and
> ->elevator_lock), the pre-allocated sched_tags are assigned to the
> appropriate q->hctx. Once the elevator switch is complete and the locks
> are released, the old elevator queue and its associated sched_tags are
> freed.
> 
> This commit specifically addresses the allocation/deallocation of sched_
> tags during elevator switching. Note that sched_tags may also be allocated
> in other contexts, such as during nr_hw_queues updates. Supporting that
> use case will require batch allocation/deallocation, which will be handled
> in a follow-up patch.
> 
> This restructuring ensures that sched_tags memory management occurs
> entirely outside of the ->elevator_lock and ->freeze_lock context,
> eliminating the lock dependency problem seen during scheduler updates.
> 
> [1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
> 
> Reported-by: Stefan Haberland <sth@linux.ibm.com>
> Closes: https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> ---
>  block/blk-mq-sched.c | 186 ++++++++++++++++++++++++++-----------------
>  block/blk-mq-sched.h |  14 +++-
>  block/elevator.c     |  66 +++++++++++++--
>  block/elevator.h     |  19 ++++-
>  4 files changed, 204 insertions(+), 81 deletions(-)
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 359e0704e09b..5d3132ac7777 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -374,64 +374,17 @@ bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request *rq,
>  }
>  EXPORT_SYMBOL_GPL(blk_mq_sched_try_insert_merge);
>  
> -static int blk_mq_sched_alloc_map_and_rqs(struct request_queue *q,
> -					  struct blk_mq_hw_ctx *hctx,
> -					  unsigned int hctx_idx)
> -{
> -	if (blk_mq_is_shared_tags(q->tag_set->flags)) {
> -		hctx->sched_tags = q->sched_shared_tags;
> -		return 0;
> -	}
> -
> -	hctx->sched_tags = blk_mq_alloc_map_and_rqs(q->tag_set, hctx_idx,
> -						    q->nr_requests);
> -
> -	if (!hctx->sched_tags)
> -		return -ENOMEM;
> -	return 0;
> -}
> -
> -static void blk_mq_exit_sched_shared_tags(struct request_queue *queue)
> -{
> -	blk_mq_free_rq_map(queue->sched_shared_tags);
> -	queue->sched_shared_tags = NULL;
> -}
> -
>  /* called in queue's release handler, tagset has gone away */
>  static void blk_mq_sched_tags_teardown(struct request_queue *q, unsigned int flags)
>  {
>  	struct blk_mq_hw_ctx *hctx;
>  	unsigned long i;
>  
> -	queue_for_each_hw_ctx(q, hctx, i) {
> -		if (hctx->sched_tags) {
> -			if (!blk_mq_is_shared_tags(flags))
> -				blk_mq_free_rq_map(hctx->sched_tags);
> -			hctx->sched_tags = NULL;
> -		}
> -	}
> +	queue_for_each_hw_ctx(q, hctx, i)
> +		hctx->sched_tags = NULL;
>  
>  	if (blk_mq_is_shared_tags(flags))
> -		blk_mq_exit_sched_shared_tags(q);
> -}
> -
> -static int blk_mq_init_sched_shared_tags(struct request_queue *queue)
> -{
> -	struct blk_mq_tag_set *set = queue->tag_set;
> -
> -	/*
> -	 * Set initial depth at max so that we don't need to reallocate for
> -	 * updating nr_requests.
> -	 */
> -	queue->sched_shared_tags = blk_mq_alloc_map_and_rqs(set,
> -						BLK_MQ_NO_HCTX_IDX,
> -						MAX_SCHED_RQ);
> -	if (!queue->sched_shared_tags)
> -		return -ENOMEM;
> -
> -	blk_mq_tag_update_sched_shared_tags(queue);
> -
> -	return 0;
> +		q->sched_shared_tags = NULL;
>  }
>  
>  void blk_mq_sched_reg_debugfs(struct request_queue *q)
> @@ -458,8 +411,106 @@ void blk_mq_sched_unreg_debugfs(struct request_queue *q)
>  	mutex_unlock(&q->debugfs_mutex);
>  }
>  
> +void __blk_mq_free_sched_tags(struct blk_mq_tag_set *set,
> +		struct blk_mq_tags **tags, unsigned int nr_hw_queues)
> +{
> +	unsigned long i;
> +
> +	if (!tags)
> +		return;
> +
> +	/* Shared tags are stored at index 0 in @tags. */
> +	if (blk_mq_is_shared_tags(set->flags))
> +		blk_mq_free_map_and_rqs(set, tags[0], BLK_MQ_NO_HCTX_IDX);
> +	else {
> +		for (i = 0; i < nr_hw_queues; i++)
> +			blk_mq_free_map_and_rqs(set, tags[i], i);
> +	}
> +
> +	kfree(tags);
> +}
> +
> +void blk_mq_free_sched_tags(struct elevator_tags *et,
> +		struct blk_mq_tag_set *set, int id)
> +{
> +	struct blk_mq_tags **tags;
> +
> +	tags = xa_load(&et->tags_table, id);
> +	__blk_mq_free_sched_tags(set, tags, et->nr_hw_queues);
> +}
> +
> +struct blk_mq_tags **__blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
> +				unsigned int nr_hw_queues,
> +				unsigned int nr_requests,
> +				gfp_t gfp)
> +{
> +	int i, nr_tags;
> +	struct blk_mq_tags **tags;
> +
> +	if (blk_mq_is_shared_tags(set->flags))
> +		nr_tags = 1;
> +	else
> +		nr_tags = nr_hw_queues;
> +
> +	tags = kcalloc(nr_tags, sizeof(struct blk_mq_tags *), gfp);
> +	if (!tags)
> +		return NULL;
> +
> +	if (blk_mq_is_shared_tags(set->flags)) {
> +		/* Shared tags are stored at index 0 in @tags. */
> +		tags[0] = blk_mq_alloc_map_and_rqs(set, BLK_MQ_NO_HCTX_IDX,
> +					MAX_SCHED_RQ);
> +		if (!tags[0])
> +			goto out;
> +	} else {
> +		for (i = 0; i < nr_hw_queues; i++) {
> +			tags[i] = blk_mq_alloc_map_and_rqs(set, i, nr_requests);
> +			if (!tags[i])
> +				goto out_unwind;
> +		}
> +	}
> +
> +	return tags;
> +out_unwind:
> +	while (--i >= 0)
> +		blk_mq_free_map_and_rqs(set, tags[i], i);
> +out:
> +	kfree(tags);
> +	return NULL;
> +}
> +
> +int blk_mq_alloc_sched_tags(struct elevator_tags *et,
> +		struct blk_mq_tag_set *set, int id)
> +{
> +	struct blk_mq_tags **tags;
> +	gfp_t gfp = GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY;
> +
> +	/*
> +	 * Default to double of smaller one between hw queue_depth and
> +	 * 128, since we don't split into sync/async like the old code
> +	 * did. Additionally, this is a per-hw queue depth.
> +	 */
> +	et->nr_requests = 2 * min_t(unsigned int, set->queue_depth,
> +			BLKDEV_DEFAULT_RQ);
> +
> +	tags = __blk_mq_alloc_sched_tags(set, et->nr_hw_queues,
> +			et->nr_requests, gfp);
> +	if (!tags)
> +		return -ENOMEM;
> +
> +	if (xa_insert(&et->tags_table, id, tags, gfp))
> +		goto out_free_tags;

Not sure why you add `tags_table` into `elevator_tags`, it looks very
confusing just from naming, not mention only single element is stored to
the table in elevator switch code path since queue id is the key.

Wrt. xarray suggestion, I meant to add on xarray local variable in
__blk_mq_update_nr_hw_queues(), then you can store allocated `elevator_tags`
to the xarray via ->id of request_queue.



Thanks,
Ming


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCHv4 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store
  2025-06-26 14:43   ` Ming Lei
@ 2025-06-27  4:13     ` Nilay Shroff
  2025-06-27  7:58       ` Ming Lei
  0 siblings, 1 reply; 8+ messages in thread
From: Nilay Shroff @ 2025-06-27  4:13 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, hch, axboe, sth, gjoyce



On 6/26/25 8:13 PM, Ming Lei wrote:
> On Tue, Jun 24, 2025 at 06:47:04PM +0530, Nilay Shroff wrote:
>> Recent lockdep reports [1] have revealed a potential deadlock caused by a
>> lock dependency between the percpu allocator lock and the elevator lock.
>> This issue can be avoided by ensuring that the allocation and release of
>> scheduler tags (sched_tags) are performed outside the elevator lock.
>> Furthermore, the queue does not need to be remain frozen during these
>> operations.
>>
>> To address this, move all sched_tags allocations and deallocations outside
>> of both the ->elevator_lock and the ->freeze_lock. Since the lifetime of
>> the elevator queue and its associated sched_tags is closely tied, the
>> allocated sched_tags are now stored in the elevator queue structure. Then,
>> during the actual elevator switch (which runs under ->freeze_lock and
>> ->elevator_lock), the pre-allocated sched_tags are assigned to the
>> appropriate q->hctx. Once the elevator switch is complete and the locks
>> are released, the old elevator queue and its associated sched_tags are
>> freed.
>>
>> This commit specifically addresses the allocation/deallocation of sched_
>> tags during elevator switching. Note that sched_tags may also be allocated
>> in other contexts, such as during nr_hw_queues updates. Supporting that
>> use case will require batch allocation/deallocation, which will be handled
>> in a follow-up patch.
>>
>> This restructuring ensures that sched_tags memory management occurs
>> entirely outside of the ->elevator_lock and ->freeze_lock context,
>> eliminating the lock dependency problem seen during scheduler updates.
>>
>> [1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
>>
>> Reported-by: Stefan Haberland <sth@linux.ibm.com>
>> Closes: https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
>> ---
>>  block/blk-mq-sched.c | 186 ++++++++++++++++++++++++++-----------------
>>  block/blk-mq-sched.h |  14 +++-
>>  block/elevator.c     |  66 +++++++++++++--
>>  block/elevator.h     |  19 ++++-
>>  4 files changed, 204 insertions(+), 81 deletions(-)
>>
>> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
>> index 359e0704e09b..5d3132ac7777 100644
>> --- a/block/blk-mq-sched.c
>> +++ b/block/blk-mq-sched.c
>> @@ -374,64 +374,17 @@ bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request *rq,
>>  }
>>  EXPORT_SYMBOL_GPL(blk_mq_sched_try_insert_merge);
>>  
>> -static int blk_mq_sched_alloc_map_and_rqs(struct request_queue *q,
>> -					  struct blk_mq_hw_ctx *hctx,
>> -					  unsigned int hctx_idx)
>> -{
>> -	if (blk_mq_is_shared_tags(q->tag_set->flags)) {
>> -		hctx->sched_tags = q->sched_shared_tags;
>> -		return 0;
>> -	}
>> -
>> -	hctx->sched_tags = blk_mq_alloc_map_and_rqs(q->tag_set, hctx_idx,
>> -						    q->nr_requests);
>> -
>> -	if (!hctx->sched_tags)
>> -		return -ENOMEM;
>> -	return 0;
>> -}
>> -
>> -static void blk_mq_exit_sched_shared_tags(struct request_queue *queue)
>> -{
>> -	blk_mq_free_rq_map(queue->sched_shared_tags);
>> -	queue->sched_shared_tags = NULL;
>> -}
>> -
>>  /* called in queue's release handler, tagset has gone away */
>>  static void blk_mq_sched_tags_teardown(struct request_queue *q, unsigned int flags)
>>  {
>>  	struct blk_mq_hw_ctx *hctx;
>>  	unsigned long i;
>>  
>> -	queue_for_each_hw_ctx(q, hctx, i) {
>> -		if (hctx->sched_tags) {
>> -			if (!blk_mq_is_shared_tags(flags))
>> -				blk_mq_free_rq_map(hctx->sched_tags);
>> -			hctx->sched_tags = NULL;
>> -		}
>> -	}
>> +	queue_for_each_hw_ctx(q, hctx, i)
>> +		hctx->sched_tags = NULL;
>>  
>>  	if (blk_mq_is_shared_tags(flags))
>> -		blk_mq_exit_sched_shared_tags(q);
>> -}
>> -
>> -static int blk_mq_init_sched_shared_tags(struct request_queue *queue)
>> -{
>> -	struct blk_mq_tag_set *set = queue->tag_set;
>> -
>> -	/*
>> -	 * Set initial depth at max so that we don't need to reallocate for
>> -	 * updating nr_requests.
>> -	 */
>> -	queue->sched_shared_tags = blk_mq_alloc_map_and_rqs(set,
>> -						BLK_MQ_NO_HCTX_IDX,
>> -						MAX_SCHED_RQ);
>> -	if (!queue->sched_shared_tags)
>> -		return -ENOMEM;
>> -
>> -	blk_mq_tag_update_sched_shared_tags(queue);
>> -
>> -	return 0;
>> +		q->sched_shared_tags = NULL;
>>  }
>>  
>>  void blk_mq_sched_reg_debugfs(struct request_queue *q)
>> @@ -458,8 +411,106 @@ void blk_mq_sched_unreg_debugfs(struct request_queue *q)
>>  	mutex_unlock(&q->debugfs_mutex);
>>  }
>>  
>> +void __blk_mq_free_sched_tags(struct blk_mq_tag_set *set,
>> +		struct blk_mq_tags **tags, unsigned int nr_hw_queues)
>> +{
>> +	unsigned long i;
>> +
>> +	if (!tags)
>> +		return;
>> +
>> +	/* Shared tags are stored at index 0 in @tags. */
>> +	if (blk_mq_is_shared_tags(set->flags))
>> +		blk_mq_free_map_and_rqs(set, tags[0], BLK_MQ_NO_HCTX_IDX);
>> +	else {
>> +		for (i = 0; i < nr_hw_queues; i++)
>> +			blk_mq_free_map_and_rqs(set, tags[i], i);
>> +	}
>> +
>> +	kfree(tags);
>> +}
>> +
>> +void blk_mq_free_sched_tags(struct elevator_tags *et,
>> +		struct blk_mq_tag_set *set, int id)
>> +{
>> +	struct blk_mq_tags **tags;
>> +
>> +	tags = xa_load(&et->tags_table, id);
>> +	__blk_mq_free_sched_tags(set, tags, et->nr_hw_queues);
>> +}
>> +
>> +struct blk_mq_tags **__blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
>> +				unsigned int nr_hw_queues,
>> +				unsigned int nr_requests,
>> +				gfp_t gfp)
>> +{
>> +	int i, nr_tags;
>> +	struct blk_mq_tags **tags;
>> +
>> +	if (blk_mq_is_shared_tags(set->flags))
>> +		nr_tags = 1;
>> +	else
>> +		nr_tags = nr_hw_queues;
>> +
>> +	tags = kcalloc(nr_tags, sizeof(struct blk_mq_tags *), gfp);
>> +	if (!tags)
>> +		return NULL;
>> +
>> +	if (blk_mq_is_shared_tags(set->flags)) {
>> +		/* Shared tags are stored at index 0 in @tags. */
>> +		tags[0] = blk_mq_alloc_map_and_rqs(set, BLK_MQ_NO_HCTX_IDX,
>> +					MAX_SCHED_RQ);
>> +		if (!tags[0])
>> +			goto out;
>> +	} else {
>> +		for (i = 0; i < nr_hw_queues; i++) {
>> +			tags[i] = blk_mq_alloc_map_and_rqs(set, i, nr_requests);
>> +			if (!tags[i])
>> +				goto out_unwind;
>> +		}
>> +	}
>> +
>> +	return tags;
>> +out_unwind:
>> +	while (--i >= 0)
>> +		blk_mq_free_map_and_rqs(set, tags[i], i);
>> +out:
>> +	kfree(tags);
>> +	return NULL;
>> +}
>> +
>> +int blk_mq_alloc_sched_tags(struct elevator_tags *et,
>> +		struct blk_mq_tag_set *set, int id)
>> +{
>> +	struct blk_mq_tags **tags;
>> +	gfp_t gfp = GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY;
>> +
>> +	/*
>> +	 * Default to double of smaller one between hw queue_depth and
>> +	 * 128, since we don't split into sync/async like the old code
>> +	 * did. Additionally, this is a per-hw queue depth.
>> +	 */
>> +	et->nr_requests = 2 * min_t(unsigned int, set->queue_depth,
>> +			BLKDEV_DEFAULT_RQ);
>> +
>> +	tags = __blk_mq_alloc_sched_tags(set, et->nr_hw_queues,
>> +			et->nr_requests, gfp);
>> +	if (!tags)
>> +		return -ENOMEM;
>> +
>> +	if (xa_insert(&et->tags_table, id, tags, gfp))
>> +		goto out_free_tags;
> 
> Not sure why you add `tags_table` into `elevator_tags`, it looks very
> confusing just from naming, not mention only single element is stored to
> the table in elevator switch code path since queue id is the key.
> 
Yeah but if you look at next patch where we add/remove tags in batch
then you'd realise that xarray may have multiple entries inserted when
it's called from nr_hw_queue update.

> Wrt. xarray suggestion, I meant to add on xarray local variable in
> __blk_mq_update_nr_hw_queues(), then you can store allocated `elevator_tags`
> to the xarray via ->id of request_queue.
> 
I think I got it now what you have been asking for - there was some confusion 
earlier. So you are suggesting to insert pointer to 'struct elevator_tags' 
in a local Xarray when we allocate sched_tags in batch from nr_hw_queue update
context. And the key to insert 'struct elevator_tags' into Xarray would be q->id.
Then later, from elv_update_nr_hw_queues() we should retrieve the relavent
pointer to 'struct elevator_tags' from that local Xarray (using q->id) and pass
on the retrieved elevator_tags to elevator_switch function.

So this way we would not require using ->tags_table inside 'struct elevator_tags'.
Additionally, blk_mq_alloc_sched_tags() shall return allocated 'struct elevator_tags'
to the caller. And the definition of the 'struct elevator_tags' would look like this:

struct elevator_tags {
	/* num. of hardware queues for which tags are allocated */
	unsigned int nr_hw_queues;
	/* depth used while allocating tags */
	unsigned int nr_requests;
        /* The index 0 in @tags is used to store shared sched tags */
	struct blk_mq_tags **tags;	
};

This seems like a clean and straightforward solution that should be easy to implement
without much hassle. Please let me know if this still needs adjustment.

Thanks,
--Nilay


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCHv4 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store
  2025-06-27  4:13     ` Nilay Shroff
@ 2025-06-27  7:58       ` Ming Lei
  2025-06-27  9:50         ` Nilay Shroff
  0 siblings, 1 reply; 8+ messages in thread
From: Ming Lei @ 2025-06-27  7:58 UTC (permalink / raw)
  To: Nilay Shroff; +Cc: linux-block, hch, axboe, sth, gjoyce

On Fri, Jun 27, 2025 at 09:43:29AM +0530, Nilay Shroff wrote:
> 
> 
> On 6/26/25 8:13 PM, Ming Lei wrote:
> > On Tue, Jun 24, 2025 at 06:47:04PM +0530, Nilay Shroff wrote:
> >> Recent lockdep reports [1] have revealed a potential deadlock caused by a
> >> lock dependency between the percpu allocator lock and the elevator lock.
> >> This issue can be avoided by ensuring that the allocation and release of
> >> scheduler tags (sched_tags) are performed outside the elevator lock.
> >> Furthermore, the queue does not need to be remain frozen during these
> >> operations.
> >>
> >> To address this, move all sched_tags allocations and deallocations outside
> >> of both the ->elevator_lock and the ->freeze_lock. Since the lifetime of
> >> the elevator queue and its associated sched_tags is closely tied, the
> >> allocated sched_tags are now stored in the elevator queue structure. Then,
> >> during the actual elevator switch (which runs under ->freeze_lock and
> >> ->elevator_lock), the pre-allocated sched_tags are assigned to the
> >> appropriate q->hctx. Once the elevator switch is complete and the locks
> >> are released, the old elevator queue and its associated sched_tags are
> >> freed.
> >>
> >> This commit specifically addresses the allocation/deallocation of sched_
> >> tags during elevator switching. Note that sched_tags may also be allocated
> >> in other contexts, such as during nr_hw_queues updates. Supporting that
> >> use case will require batch allocation/deallocation, which will be handled
> >> in a follow-up patch.
> >>
> >> This restructuring ensures that sched_tags memory management occurs
> >> entirely outside of the ->elevator_lock and ->freeze_lock context,
> >> eliminating the lock dependency problem seen during scheduler updates.
> >>
> >> [1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
> >>
> >> Reported-by: Stefan Haberland <sth@linux.ibm.com>
> >> Closes: https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
> >> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> >> ---
> >>  block/blk-mq-sched.c | 186 ++++++++++++++++++++++++++-----------------
> >>  block/blk-mq-sched.h |  14 +++-
> >>  block/elevator.c     |  66 +++++++++++++--
> >>  block/elevator.h     |  19 ++++-
> >>  4 files changed, 204 insertions(+), 81 deletions(-)
> >>
> >> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> >> index 359e0704e09b..5d3132ac7777 100644
> >> --- a/block/blk-mq-sched.c
> >> +++ b/block/blk-mq-sched.c
> >> @@ -374,64 +374,17 @@ bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request *rq,
> >>  }
> >>  EXPORT_SYMBOL_GPL(blk_mq_sched_try_insert_merge);
> >>  
> >> -static int blk_mq_sched_alloc_map_and_rqs(struct request_queue *q,
> >> -					  struct blk_mq_hw_ctx *hctx,
> >> -					  unsigned int hctx_idx)
> >> -{
> >> -	if (blk_mq_is_shared_tags(q->tag_set->flags)) {
> >> -		hctx->sched_tags = q->sched_shared_tags;
> >> -		return 0;
> >> -	}
> >> -
> >> -	hctx->sched_tags = blk_mq_alloc_map_and_rqs(q->tag_set, hctx_idx,
> >> -						    q->nr_requests);
> >> -
> >> -	if (!hctx->sched_tags)
> >> -		return -ENOMEM;
> >> -	return 0;
> >> -}
> >> -
> >> -static void blk_mq_exit_sched_shared_tags(struct request_queue *queue)
> >> -{
> >> -	blk_mq_free_rq_map(queue->sched_shared_tags);
> >> -	queue->sched_shared_tags = NULL;
> >> -}
> >> -
> >>  /* called in queue's release handler, tagset has gone away */
> >>  static void blk_mq_sched_tags_teardown(struct request_queue *q, unsigned int flags)
> >>  {
> >>  	struct blk_mq_hw_ctx *hctx;
> >>  	unsigned long i;
> >>  
> >> -	queue_for_each_hw_ctx(q, hctx, i) {
> >> -		if (hctx->sched_tags) {
> >> -			if (!blk_mq_is_shared_tags(flags))
> >> -				blk_mq_free_rq_map(hctx->sched_tags);
> >> -			hctx->sched_tags = NULL;
> >> -		}
> >> -	}
> >> +	queue_for_each_hw_ctx(q, hctx, i)
> >> +		hctx->sched_tags = NULL;
> >>  
> >>  	if (blk_mq_is_shared_tags(flags))
> >> -		blk_mq_exit_sched_shared_tags(q);
> >> -}
> >> -
> >> -static int blk_mq_init_sched_shared_tags(struct request_queue *queue)
> >> -{
> >> -	struct blk_mq_tag_set *set = queue->tag_set;
> >> -
> >> -	/*
> >> -	 * Set initial depth at max so that we don't need to reallocate for
> >> -	 * updating nr_requests.
> >> -	 */
> >> -	queue->sched_shared_tags = blk_mq_alloc_map_and_rqs(set,
> >> -						BLK_MQ_NO_HCTX_IDX,
> >> -						MAX_SCHED_RQ);
> >> -	if (!queue->sched_shared_tags)
> >> -		return -ENOMEM;
> >> -
> >> -	blk_mq_tag_update_sched_shared_tags(queue);
> >> -
> >> -	return 0;
> >> +		q->sched_shared_tags = NULL;
> >>  }
> >>  
> >>  void blk_mq_sched_reg_debugfs(struct request_queue *q)
> >> @@ -458,8 +411,106 @@ void blk_mq_sched_unreg_debugfs(struct request_queue *q)
> >>  	mutex_unlock(&q->debugfs_mutex);
> >>  }
> >>  
> >> +void __blk_mq_free_sched_tags(struct blk_mq_tag_set *set,
> >> +		struct blk_mq_tags **tags, unsigned int nr_hw_queues)
> >> +{
> >> +	unsigned long i;
> >> +
> >> +	if (!tags)
> >> +		return;
> >> +
> >> +	/* Shared tags are stored at index 0 in @tags. */
> >> +	if (blk_mq_is_shared_tags(set->flags))
> >> +		blk_mq_free_map_and_rqs(set, tags[0], BLK_MQ_NO_HCTX_IDX);
> >> +	else {
> >> +		for (i = 0; i < nr_hw_queues; i++)
> >> +			blk_mq_free_map_and_rqs(set, tags[i], i);
> >> +	}
> >> +
> >> +	kfree(tags);
> >> +}
> >> +
> >> +void blk_mq_free_sched_tags(struct elevator_tags *et,
> >> +		struct blk_mq_tag_set *set, int id)
> >> +{
> >> +	struct blk_mq_tags **tags;
> >> +
> >> +	tags = xa_load(&et->tags_table, id);
> >> +	__blk_mq_free_sched_tags(set, tags, et->nr_hw_queues);
> >> +}
> >> +
> >> +struct blk_mq_tags **__blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
> >> +				unsigned int nr_hw_queues,
> >> +				unsigned int nr_requests,
> >> +				gfp_t gfp)
> >> +{
> >> +	int i, nr_tags;
> >> +	struct blk_mq_tags **tags;
> >> +
> >> +	if (blk_mq_is_shared_tags(set->flags))
> >> +		nr_tags = 1;
> >> +	else
> >> +		nr_tags = nr_hw_queues;
> >> +
> >> +	tags = kcalloc(nr_tags, sizeof(struct blk_mq_tags *), gfp);
> >> +	if (!tags)
> >> +		return NULL;
> >> +
> >> +	if (blk_mq_is_shared_tags(set->flags)) {
> >> +		/* Shared tags are stored at index 0 in @tags. */
> >> +		tags[0] = blk_mq_alloc_map_and_rqs(set, BLK_MQ_NO_HCTX_IDX,
> >> +					MAX_SCHED_RQ);
> >> +		if (!tags[0])
> >> +			goto out;
> >> +	} else {
> >> +		for (i = 0; i < nr_hw_queues; i++) {
> >> +			tags[i] = blk_mq_alloc_map_and_rqs(set, i, nr_requests);
> >> +			if (!tags[i])
> >> +				goto out_unwind;
> >> +		}
> >> +	}
> >> +
> >> +	return tags;
> >> +out_unwind:
> >> +	while (--i >= 0)
> >> +		blk_mq_free_map_and_rqs(set, tags[i], i);
> >> +out:
> >> +	kfree(tags);
> >> +	return NULL;
> >> +}
> >> +
> >> +int blk_mq_alloc_sched_tags(struct elevator_tags *et,
> >> +		struct blk_mq_tag_set *set, int id)
> >> +{
> >> +	struct blk_mq_tags **tags;
> >> +	gfp_t gfp = GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY;
> >> +
> >> +	/*
> >> +	 * Default to double of smaller one between hw queue_depth and
> >> +	 * 128, since we don't split into sync/async like the old code
> >> +	 * did. Additionally, this is a per-hw queue depth.
> >> +	 */
> >> +	et->nr_requests = 2 * min_t(unsigned int, set->queue_depth,
> >> +			BLKDEV_DEFAULT_RQ);
> >> +
> >> +	tags = __blk_mq_alloc_sched_tags(set, et->nr_hw_queues,
> >> +			et->nr_requests, gfp);
> >> +	if (!tags)
> >> +		return -ENOMEM;
> >> +
> >> +	if (xa_insert(&et->tags_table, id, tags, gfp))
> >> +		goto out_free_tags;
> > 
> > Not sure why you add `tags_table` into `elevator_tags`, it looks very
> > confusing just from naming, not mention only single element is stored to
> > the table in elevator switch code path since queue id is the key.
> > 
> Yeah but if you look at next patch where we add/remove tags in batch
> then you'd realise that xarray may have multiple entries inserted when
> it's called from nr_hw_queue update.
> 
> > Wrt. xarray suggestion, I meant to add on xarray local variable in
> > __blk_mq_update_nr_hw_queues(), then you can store allocated `elevator_tags`
> > to the xarray via ->id of request_queue.
> > 
> I think I got it now what you have been asking for - there was some confusion 
> earlier. So you are suggesting to insert pointer to 'struct elevator_tags' 
> in a local Xarray when we allocate sched_tags in batch from nr_hw_queue update
> context. And the key to insert 'struct elevator_tags' into Xarray would be q->id.
> Then later, from elv_update_nr_hw_queues() we should retrieve the relavent
> pointer to 'struct elevator_tags' from that local Xarray (using q->id) and pass
> on the retrieved elevator_tags to elevator_switch function.
> 
> So this way we would not require using ->tags_table inside 'struct elevator_tags'.
> Additionally, blk_mq_alloc_sched_tags() shall return allocated 'struct elevator_tags'
> to the caller. And the definition of the 'struct elevator_tags' would look like this:
> 
> struct elevator_tags {
> 	/* num. of hardware queues for which tags are allocated */
> 	unsigned int nr_hw_queues;
> 	/* depth used while allocating tags */
> 	unsigned int nr_requests;
>         /* The index 0 in @tags is used to store shared sched tags */
> 	struct blk_mq_tags **tags;	

s/struct blk_mq_tags **tags/struct blk_mq_tags *tags[];	

Then we can save one extra failure handling.

> 
> This seems like a clean and straightforward solution that should be easy to implement
> without much hassle. Please let me know if this still needs adjustment.

I think we are in same page now.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCHv4 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store
  2025-06-27  7:58       ` Ming Lei
@ 2025-06-27  9:50         ` Nilay Shroff
  0 siblings, 0 replies; 8+ messages in thread
From: Nilay Shroff @ 2025-06-27  9:50 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, hch, axboe, sth, gjoyce



On 6/27/25 1:28 PM, Ming Lei wrote:
> On Fri, Jun 27, 2025 at 09:43:29AM +0530, Nilay Shroff wrote:
>>
>>
>> On 6/26/25 8:13 PM, Ming Lei wrote:

>> struct elevator_tags {
>> 	/* num. of hardware queues for which tags are allocated */
>> 	unsigned int nr_hw_queues;
>> 	/* depth used while allocating tags */
>> 	unsigned int nr_requests;
>>         /* The index 0 in @tags is used to store shared sched tags */
>> 	struct blk_mq_tags **tags;	
> 
> s/struct blk_mq_tags **tags/struct blk_mq_tags *tags[];	
> 
> Then we can save one extra failure handling.
Yeah I think that's the advantage with the flex array!
> 
>>
>> This seems like a clean and straightforward solution that should be easy to implement
>> without much hassle. Please let me know if this still needs adjustment.
> 
> I think we are in same page now.
>
Great!
 
Thanks,
--Nilay

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-06-27  9:50 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-24 13:17 [PATCHv4 0/3] block: move sched_tags allocation/de-allocation outside of locking context Nilay Shroff
2025-06-24 13:17 ` [PATCHv4 1/3] block: move elevator queue allocation logic into blk_mq_init_sched Nilay Shroff
2025-06-24 13:17 ` [PATCHv4 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store Nilay Shroff
2025-06-26 14:43   ` Ming Lei
2025-06-27  4:13     ` Nilay Shroff
2025-06-27  7:58       ` Ming Lei
2025-06-27  9:50         ` Nilay Shroff
2025-06-24 13:17 ` [PATCHv4 3/3] block: fix potential deadlock while running nr_hw_queue update Nilay Shroff

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.