[PATCHv7 0/3] block: move sched_tags allocation/de-allocation outside of locking context

public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCHv7 0/3] block: move sched_tags allocation/de-allocation outside of locking context
@ 2025-07-01  8:18 Nilay Shroff
  2025-07-01  8:18 ` [PATCHv7 1/3] block: move elevator queue allocation logic into blk_mq_init_sched Nilay Shroff
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Nilay Shroff @ 2025-07-01  8:18 UTC (permalink / raw)
  To: linux-block; +Cc: hch, ming.lei, hare, axboe, sth, gjoyce

Hi,

There have been a few reports[1] indicating potential lockdep warnings due
to a lock dependency from the percpu allocator to the elevator lock. This
patch series aims to eliminate that dependency.

The series consists of three patches:
The first patch is preparatory patch and just move elevator queue
allocation logic from ->init_sched into blk_mq_init_sched.

The second patch in the series restructures sched_tags allocation and
deallocation during elevator update/switch operations to ensure these
actions are performed entirely outside the ->freeze_lock and ->elevator_
lock. This eliminates the percpu allocator’s lock dependency on the
elevator and freeze lock during scheduler transitions.

The third patch introduces batch allocation and deallocation helpers for
sched_tags. These helpers are used during __blk_mq_update_nr_hw_queues()
to decouple sched_tags memory management from both the elevator and freeze
locks, addressing the lockdep concerns in the nr_hw_queues update path.

[1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/

Changes since v6:
    - Add warning when loading elevator tags from an xarray yields nothing 
      (Hannes Reinecke)
    - Use elevator tags instead of xarray table as a function argument to 
      elv_update_nr_hw_queues (Ming Lei)
Link to v6: https://lore.kernel.org/all/20250630054756.54532-1-nilay@linux.ibm.com/

Changes since v5:
    - Fixed smatch warning reported by kernel test robot here:
      https://lore.kernel.org/all/202506300509.2S1tygch-lkp@intel.com/
Link to v5: https://lore.kernel.org/all/20250627175544.1063910-1-nilay@linux.ibm.com/

Changes since v4:
    - Define a local Xarray variable in __blk_mq_update_nr_hw_queues to store
      sched_tags, instead of storing it in an Xarray defined in 'struct elevator_tags'
      (Ming Lei)
Link to v4: https://lore.kernel.org/all/20250624131716.630465-1-nilay@linux.ibm.com/

Changes since v3:
    - Further split the patchset into three patch series so that we can
      have a separate patch for sched_tags batch allocation/deallocation
      (Ming Lei)
    - Use Xarray to store and load the sched_tags (Ming Lei)
    - Unexport elevator_alloc() as we no longer need to use it outside
      of block layer core (hch)
    - unwind the sched_tags allocation and free tags when we it fails in
      the middle of allocation (hch)
    - Move struct elevator_queue header from commin header to elevator.c
      as there's no user of it outside elevator.c (Ming Lei, hch)
Link to v3: https://lore.kernel.org/all/20250616173233.3803824-1-nilay@linux.ibm.com/

Change since v2:
    - Split the patch into a two patch series. The first patch updates
      ->init_sched elevator API change and second patch handles the sched
      tags allocation/de-allocation logic (Ming Lei)
    - Address sched tags allocation/deallocation logic while running in the
      context of nr_hw_queue update so that we can handle all possible
      scenarios in a single patchest (Ming Lei)
Link to v2: https://lore.kernel.org/all/20250528123638.1029700-1-nilay@linux.ibm.com/

Changes since v1:
    - As the lifetime of elevator queue and sched tags are same, allocate
      and move sched tags under struct elevator_queue (Ming Lei)
Link to v1: https://lore.kernel.org/all/20250520103425.1259712-1-nilay@linux.ibm.com/

Nilay Shroff (3):
  block: move elevator queue allocation logic into blk_mq_init_sched
  block: fix lockdep warning caused by lock dependency in
    elv_iosched_store
  block: fix potential deadlock while running nr_hw_queue update

 block/bfq-iosched.c   |  13 +--
 block/blk-mq-sched.c  | 223 ++++++++++++++++++++++++++++--------------
 block/blk-mq-sched.h  |  12 ++-
 block/blk-mq.c        |  16 ++-
 block/blk.h           |   2 +-
 block/elevator.c      |  50 ++++++++--
 block/elevator.h      |  16 ++-
 block/kyber-iosched.c |  11 +--
 block/mq-deadline.c   |  14 +--
 9 files changed, 241 insertions(+), 116 deletions(-)

-- 
2.50.0


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCHv7 1/3] block: move elevator queue allocation logic into blk_mq_init_sched
  2025-07-01  8:18 [PATCHv7 0/3] block: move sched_tags allocation/de-allocation outside of locking context Nilay Shroff
@ 2025-07-01  8:18 ` Nilay Shroff
  2025-07-01  8:18 ` [PATCHv7 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store Nilay Shroff
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 9+ messages in thread
From: Nilay Shroff @ 2025-07-01  8:18 UTC (permalink / raw)
  To: linux-block; +Cc: hch, ming.lei, hare, axboe, sth, gjoyce

In preparation for allocating sched_tags before freezing the request
queue and acquiring ->elevator_lock, move the elevator queue allocation
logic from the elevator ops ->init_sched callback into blk_mq_init_sched.
As elevator_alloc is now only invoked from block layer core, we don't
need to export it, so unexport elevator_alloc function.

This refactoring provides a centralized location for elevator queue
initialization, which makes it easier to store pre-allocated sched_tags
in the struct elevator_queue during later changes.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
 block/bfq-iosched.c   | 13 +++----------
 block/blk-mq-sched.c  | 11 ++++++++---
 block/elevator.c      |  1 -
 block/elevator.h      |  2 +-
 block/kyber-iosched.c | 11 ++---------
 block/mq-deadline.c   | 14 ++------------
 6 files changed, 16 insertions(+), 36 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 0cb1e9873aab..fd26dc1901b0 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -7232,22 +7232,16 @@ static void bfq_init_root_group(struct bfq_group *root_group,
 	root_group->sched_data.bfq_class_idle_last_service = jiffies;
 }
 
-static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
+static int bfq_init_queue(struct request_queue *q, struct elevator_queue *eq)
 {
 	struct bfq_data *bfqd;
-	struct elevator_queue *eq;
 	unsigned int i;
 	struct blk_independent_access_ranges *ia_ranges = q->disk->ia_ranges;
 
-	eq = elevator_alloc(q, e);
-	if (!eq)
-		return -ENOMEM;
-
 	bfqd = kzalloc_node(sizeof(*bfqd), GFP_KERNEL, q->node);
-	if (!bfqd) {
-		kobject_put(&eq->kobj);
+	if (!bfqd)
 		return -ENOMEM;
-	}
+
 	eq->elevator_data = bfqd;
 
 	spin_lock_irq(&q->queue_lock);
@@ -7405,7 +7399,6 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 
 out_free:
 	kfree(bfqd);
-	kobject_put(&eq->kobj);
 	return -ENOMEM;
 }
 
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 55a0fd105147..359e0704e09b 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -475,10 +475,14 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 	q->nr_requests = 2 * min_t(unsigned int, q->tag_set->queue_depth,
 				   BLKDEV_DEFAULT_RQ);
 
+	eq = elevator_alloc(q, e);
+	if (!eq)
+		return -ENOMEM;
+
 	if (blk_mq_is_shared_tags(flags)) {
 		ret = blk_mq_init_sched_shared_tags(q);
 		if (ret)
-			return ret;
+			goto err_put_elevator;
 	}
 
 	queue_for_each_hw_ctx(q, hctx, i) {
@@ -487,7 +491,7 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 			goto err_free_map_and_rqs;
 	}
 
-	ret = e->ops.init_sched(q, e);
+	ret = e->ops.init_sched(q, eq);
 	if (ret)
 		goto err_free_map_and_rqs;
 
@@ -508,7 +512,8 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 err_free_map_and_rqs:
 	blk_mq_sched_free_rqs(q);
 	blk_mq_sched_tags_teardown(q, flags);
-
+err_put_elevator:
+	kobject_put(&eq->kobj);
 	q->elevator = NULL;
 	return ret;
 }
diff --git a/block/elevator.c b/block/elevator.c
index ab22542e6cf0..770874040f79 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -148,7 +148,6 @@ struct elevator_queue *elevator_alloc(struct request_queue *q,
 
 	return eq;
 }
-EXPORT_SYMBOL(elevator_alloc);
 
 static void elevator_release(struct kobject *kobj)
 {
diff --git a/block/elevator.h b/block/elevator.h
index a07ce773a38f..a4de5f9ad790 100644
--- a/block/elevator.h
+++ b/block/elevator.h
@@ -24,7 +24,7 @@ struct blk_mq_alloc_data;
 struct blk_mq_hw_ctx;
 
 struct elevator_mq_ops {
-	int (*init_sched)(struct request_queue *, struct elevator_type *);
+	int (*init_sched)(struct request_queue *, struct elevator_queue *);
 	void (*exit_sched)(struct elevator_queue *);
 	int (*init_hctx)(struct blk_mq_hw_ctx *, unsigned int);
 	void (*exit_hctx)(struct blk_mq_hw_ctx *, unsigned int);
diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index 4dba8405bd01..7b6832cb3a8d 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -402,20 +402,13 @@ static struct kyber_queue_data *kyber_queue_data_alloc(struct request_queue *q)
 	return ERR_PTR(ret);
 }
 
-static int kyber_init_sched(struct request_queue *q, struct elevator_type *e)
+static int kyber_init_sched(struct request_queue *q, struct elevator_queue *eq)
 {
 	struct kyber_queue_data *kqd;
-	struct elevator_queue *eq;
-
-	eq = elevator_alloc(q, e);
-	if (!eq)
-		return -ENOMEM;
 
 	kqd = kyber_queue_data_alloc(q);
-	if (IS_ERR(kqd)) {
-		kobject_put(&eq->kobj);
+	if (IS_ERR(kqd))
 		return PTR_ERR(kqd);
-	}
 
 	blk_stat_enable_accounting(q);
 
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 2edf1cac06d5..7b6caf30e00a 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -568,20 +568,14 @@ static void dd_exit_sched(struct elevator_queue *e)
 /*
  * initialize elevator private data (deadline_data).
  */
-static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
+static int dd_init_sched(struct request_queue *q, struct elevator_queue *eq)
 {
 	struct deadline_data *dd;
-	struct elevator_queue *eq;
 	enum dd_prio prio;
-	int ret = -ENOMEM;
-
-	eq = elevator_alloc(q, e);
-	if (!eq)
-		return ret;
 
 	dd = kzalloc_node(sizeof(*dd), GFP_KERNEL, q->node);
 	if (!dd)
-		goto put_eq;
+		return -ENOMEM;
 
 	eq->elevator_data = dd;
 
@@ -608,10 +602,6 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 
 	q->elevator = eq;
 	return 0;
-
-put_eq:
-	kobject_put(&eq->kobj);
-	return ret;
 }
 
 /*
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCHv7 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store
  2025-07-01  8:18 [PATCHv7 0/3] block: move sched_tags allocation/de-allocation outside of locking context Nilay Shroff
  2025-07-01  8:18 ` [PATCHv7 1/3] block: move elevator queue allocation logic into blk_mq_init_sched Nilay Shroff
@ 2025-07-01  8:18 ` Nilay Shroff
  2025-07-01 10:52   ` Hannes Reinecke
  2025-07-01  8:19 ` [PATCHv7 3/3] block: fix potential deadlock while running nr_hw_queue update Nilay Shroff
  2025-07-02 13:53 ` [PATCHv7 0/3] block: move sched_tags allocation/de-allocation outside of locking context Yi Zhang
  3 siblings, 1 reply; 9+ messages in thread
From: Nilay Shroff @ 2025-07-01  8:18 UTC (permalink / raw)
  To: linux-block; +Cc: hch, ming.lei, hare, axboe, sth, gjoyce

Recent lockdep reports [1] have revealed a potential deadlock caused by a
lock dependency between the percpu allocator lock and the elevator lock.
This issue can be avoided by ensuring that the allocation and release of
scheduler tags (sched_tags) are performed outside the elevator lock.
Furthermore, the queue does not need to be remain frozen during these
operations.

To address this, move all sched_tags allocations and deallocations outside
of both the ->elevator_lock and the ->freeze_lock. Since the lifetime of
the elevator queue and its associated sched_tags is closely tied, the
allocated sched_tags are now stored in the elevator queue structure. Then,
during the actual elevator switch (which runs under ->freeze_lock and
->elevator_lock), the pre-allocated sched_tags are assigned to the
appropriate q->hctx. Once the elevator switch is complete and the locks
are released, the old elevator queue and its associated sched_tags are
freed.

This commit specifically addresses the allocation/deallocation of sched_
tags during elevator switching. Note that sched_tags may also be allocated
in other contexts, such as during nr_hw_queues updates. Supporting that
use case will require batch allocation/deallocation, which will be handled
in a follow-up patch.

This restructuring ensures that sched_tags memory management occurs
entirely outside of the ->elevator_lock and ->freeze_lock context,
eliminating the lock dependency problem seen during scheduler updates.

[1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/

Reported-by: Stefan Haberland <sth@linux.ibm.com>
Closes: https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
 block/blk-mq-sched.c | 155 +++++++++++++++++++++++--------------------
 block/blk-mq-sched.h |   8 ++-
 block/elevator.c     |  47 +++++++++++--
 block/elevator.h     |  14 +++-
 4 files changed, 144 insertions(+), 80 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 359e0704e09b..2d6d1ebdd8fb 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -374,64 +374,17 @@ bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request *rq,
 }
 EXPORT_SYMBOL_GPL(blk_mq_sched_try_insert_merge);
 
-static int blk_mq_sched_alloc_map_and_rqs(struct request_queue *q,
-					  struct blk_mq_hw_ctx *hctx,
-					  unsigned int hctx_idx)
-{
-	if (blk_mq_is_shared_tags(q->tag_set->flags)) {
-		hctx->sched_tags = q->sched_shared_tags;
-		return 0;
-	}
-
-	hctx->sched_tags = blk_mq_alloc_map_and_rqs(q->tag_set, hctx_idx,
-						    q->nr_requests);
-
-	if (!hctx->sched_tags)
-		return -ENOMEM;
-	return 0;
-}
-
-static void blk_mq_exit_sched_shared_tags(struct request_queue *queue)
-{
-	blk_mq_free_rq_map(queue->sched_shared_tags);
-	queue->sched_shared_tags = NULL;
-}
-
 /* called in queue's release handler, tagset has gone away */
 static void blk_mq_sched_tags_teardown(struct request_queue *q, unsigned int flags)
 {
 	struct blk_mq_hw_ctx *hctx;
 	unsigned long i;
 
-	queue_for_each_hw_ctx(q, hctx, i) {
-		if (hctx->sched_tags) {
-			if (!blk_mq_is_shared_tags(flags))
-				blk_mq_free_rq_map(hctx->sched_tags);
-			hctx->sched_tags = NULL;
-		}
-	}
+	queue_for_each_hw_ctx(q, hctx, i)
+		hctx->sched_tags = NULL;
 
 	if (blk_mq_is_shared_tags(flags))
-		blk_mq_exit_sched_shared_tags(q);
-}
-
-static int blk_mq_init_sched_shared_tags(struct request_queue *queue)
-{
-	struct blk_mq_tag_set *set = queue->tag_set;
-
-	/*
-	 * Set initial depth at max so that we don't need to reallocate for
-	 * updating nr_requests.
-	 */
-	queue->sched_shared_tags = blk_mq_alloc_map_and_rqs(set,
-						BLK_MQ_NO_HCTX_IDX,
-						MAX_SCHED_RQ);
-	if (!queue->sched_shared_tags)
-		return -ENOMEM;
-
-	blk_mq_tag_update_sched_shared_tags(queue);
-
-	return 0;
+		q->sched_shared_tags = NULL;
 }
 
 void blk_mq_sched_reg_debugfs(struct request_queue *q)
@@ -458,8 +411,75 @@ void blk_mq_sched_unreg_debugfs(struct request_queue *q)
 	mutex_unlock(&q->debugfs_mutex);
 }
 
+void blk_mq_free_sched_tags(struct elevator_tags *et,
+		struct blk_mq_tag_set *set)
+{
+	unsigned long i;
+
+	/* Shared tags are stored at index 0 in @tags. */
+	if (blk_mq_is_shared_tags(set->flags))
+		blk_mq_free_map_and_rqs(set, et->tags[0], BLK_MQ_NO_HCTX_IDX);
+	else {
+		for (i = 0; i < et->nr_hw_queues; i++)
+			blk_mq_free_map_and_rqs(set, et->tags[i], i);
+	}
+
+	kfree(et);
+}
+
+struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
+		unsigned int nr_hw_queues)
+{
+	unsigned int nr_tags;
+	int i;
+	struct elevator_tags *et;
+	gfp_t gfp = GFP_NOIO | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY;
+
+	if (blk_mq_is_shared_tags(set->flags))
+		nr_tags = 1;
+	else
+		nr_tags = nr_hw_queues;
+
+	et = kmalloc(sizeof(struct elevator_tags) +
+			nr_tags * sizeof(struct blk_mq_tags *), gfp);
+	if (!et)
+		return NULL;
+	/*
+	 * Default to double of smaller one between hw queue_depth and
+	 * 128, since we don't split into sync/async like the old code
+	 * did. Additionally, this is a per-hw queue depth.
+	 */
+	et->nr_requests = 2 * min_t(unsigned int, set->queue_depth,
+			BLKDEV_DEFAULT_RQ);
+	et->nr_hw_queues = nr_hw_queues;
+
+	if (blk_mq_is_shared_tags(set->flags)) {
+		/* Shared tags are stored at index 0 in @tags. */
+		et->tags[0] = blk_mq_alloc_map_and_rqs(set, BLK_MQ_NO_HCTX_IDX,
+					MAX_SCHED_RQ);
+		if (!et->tags[0])
+			goto out;
+	} else {
+		for (i = 0; i < et->nr_hw_queues; i++) {
+			et->tags[i] = blk_mq_alloc_map_and_rqs(set, i,
+					et->nr_requests);
+			if (!et->tags[i])
+				goto out_unwind;
+		}
+	}
+
+	return et;
+out_unwind:
+	while (--i >= 0)
+		blk_mq_free_map_and_rqs(set, et->tags[i], i);
+out:
+	kfree(et);
+	return NULL;
+}
+
 /* caller must have a reference to @e, will grab another one if successful */
-int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
+int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e,
+		struct elevator_tags *et)
 {
 	unsigned int flags = q->tag_set->flags;
 	struct blk_mq_hw_ctx *hctx;
@@ -467,40 +487,33 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 	unsigned long i;
 	int ret;
 
-	/*
-	 * Default to double of smaller one between hw queue_depth and 128,
-	 * since we don't split into sync/async like the old code did.
-	 * Additionally, this is a per-hw queue depth.
-	 */
-	q->nr_requests = 2 * min_t(unsigned int, q->tag_set->queue_depth,
-				   BLKDEV_DEFAULT_RQ);
-
-	eq = elevator_alloc(q, e);
+	eq = elevator_alloc(q, e, et);
 	if (!eq)
 		return -ENOMEM;
 
+	q->nr_requests = et->nr_requests;
+
 	if (blk_mq_is_shared_tags(flags)) {
-		ret = blk_mq_init_sched_shared_tags(q);
-		if (ret)
-			goto err_put_elevator;
+		/* Shared tags are stored at index 0 in @et->tags. */
+		q->sched_shared_tags = et->tags[0];
+		blk_mq_tag_update_sched_shared_tags(q);
 	}
 
 	queue_for_each_hw_ctx(q, hctx, i) {
-		ret = blk_mq_sched_alloc_map_and_rqs(q, hctx, i);
-		if (ret)
-			goto err_free_map_and_rqs;
+		if (blk_mq_is_shared_tags(flags))
+			hctx->sched_tags = q->sched_shared_tags;
+		else
+			hctx->sched_tags = et->tags[i];
 	}
 
 	ret = e->ops.init_sched(q, eq);
 	if (ret)
-		goto err_free_map_and_rqs;
+		goto out;
 
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (e->ops.init_hctx) {
 			ret = e->ops.init_hctx(hctx, i);
 			if (ret) {
-				eq = q->elevator;
-				blk_mq_sched_free_rqs(q);
 				blk_mq_exit_sched(q, eq);
 				kobject_put(&eq->kobj);
 				return ret;
@@ -509,10 +522,8 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 	}
 	return 0;
 
-err_free_map_and_rqs:
-	blk_mq_sched_free_rqs(q);
+out:
 	blk_mq_sched_tags_teardown(q, flags);
-err_put_elevator:
 	kobject_put(&eq->kobj);
 	q->elevator = NULL;
 	return ret;
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 1326526bb733..0cde00cd1c47 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -18,10 +18,16 @@ void __blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx);
 
 void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx);
 
-int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e);
+int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e,
+		struct elevator_tags *et);
 void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e);
 void blk_mq_sched_free_rqs(struct request_queue *q);
 
+struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
+		unsigned int nr_hw_queues);
+void blk_mq_free_sched_tags(struct elevator_tags *et,
+		struct blk_mq_tag_set *set);
+
 static inline void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
 {
 	if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
diff --git a/block/elevator.c b/block/elevator.c
index 770874040f79..50f4b78efe66 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -54,6 +54,8 @@ struct elv_change_ctx {
 	struct elevator_queue *old;
 	/* for registering new elevator */
 	struct elevator_queue *new;
+	/* holds sched tags data */
+	struct elevator_tags *et;
 };
 
 static DEFINE_SPINLOCK(elv_list_lock);
@@ -132,7 +134,7 @@ static struct elevator_type *elevator_find_get(const char *name)
 static const struct kobj_type elv_ktype;
 
 struct elevator_queue *elevator_alloc(struct request_queue *q,
-				  struct elevator_type *e)
+		struct elevator_type *e, struct elevator_tags *et)
 {
 	struct elevator_queue *eq;
 
@@ -145,6 +147,7 @@ struct elevator_queue *elevator_alloc(struct request_queue *q,
 	kobject_init(&eq->kobj, &elv_ktype);
 	mutex_init(&eq->sysfs_lock);
 	hash_init(eq->hash);
+	eq->et = et;
 
 	return eq;
 }
@@ -165,7 +168,6 @@ static void elevator_exit(struct request_queue *q)
 	lockdep_assert_held(&q->elevator_lock);
 
 	ioc_clear_queue(q);
-	blk_mq_sched_free_rqs(q);
 
 	mutex_lock(&e->sysfs_lock);
 	blk_mq_exit_sched(q, e);
@@ -591,7 +593,7 @@ static int elevator_switch(struct request_queue *q, struct elv_change_ctx *ctx)
 	}
 
 	if (new_e) {
-		ret = blk_mq_init_sched(q, new_e);
+		ret = blk_mq_init_sched(q, new_e, ctx->et);
 		if (ret)
 			goto out_unfreeze;
 		ctx->new = q->elevator;
@@ -626,8 +628,10 @@ static void elv_exit_and_release(struct request_queue *q)
 	elevator_exit(q);
 	mutex_unlock(&q->elevator_lock);
 	blk_mq_unfreeze_queue(q, memflags);
-	if (e)
+	if (e) {
+		blk_mq_free_sched_tags(e->et, q->tag_set);
 		kobject_put(&e->kobj);
+	}
 }
 
 static int elevator_change_done(struct request_queue *q,
@@ -640,6 +644,7 @@ static int elevator_change_done(struct request_queue *q,
 				&ctx->old->flags);
 
 		elv_unregister_queue(q, ctx->old);
+		blk_mq_free_sched_tags(ctx->old->et, q->tag_set);
 		kobject_put(&ctx->old->kobj);
 		if (enable_wbt)
 			wbt_enable_default(q->disk);
@@ -658,9 +663,16 @@ static int elevator_change_done(struct request_queue *q,
 static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
 {
 	unsigned int memflags;
+	struct blk_mq_tag_set *set = q->tag_set;
 	int ret = 0;
 
-	lockdep_assert_held(&q->tag_set->update_nr_hwq_lock);
+	lockdep_assert_held(&set->update_nr_hwq_lock);
+
+	if (strncmp(ctx->name, "none", 4)) {
+		ctx->et = blk_mq_alloc_sched_tags(set, set->nr_hw_queues);
+		if (!ctx->et)
+			return -ENOMEM;
+	}
 
 	memflags = blk_mq_freeze_queue(q);
 	/*
@@ -680,6 +692,11 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
 	blk_mq_unfreeze_queue(q, memflags);
 	if (!ret)
 		ret = elevator_change_done(q, ctx);
+	/*
+	 * Free sched tags if it's allocated but we couldn't switch elevator.
+	 */
+	if (ctx->et && !ctx->new)
+		blk_mq_free_sched_tags(ctx->et, set);
 
 	return ret;
 }
@@ -690,11 +707,26 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
  */
 void elv_update_nr_hw_queues(struct request_queue *q)
 {
+	struct blk_mq_tag_set *set = q->tag_set;
 	struct elv_change_ctx ctx = {};
 	int ret = -ENODEV;
 
 	WARN_ON_ONCE(q->mq_freeze_depth == 0);
 
+	/*
+	 * Accessing q->elevator without holding q->elevator_lock is safe here
+	 * because nr_hw_queue update is protected by set->update_nr_hwq_lock
+	 * in the writer context. So, scheduler update/switch code (which
+	 * acquires same lock in the reader context) can't run concurrently.
+	 */
+	if (q->elevator) {
+		ctx.et = blk_mq_alloc_sched_tags(set, set->nr_hw_queues);
+		if (!ctx.et) {
+			WARN_ON_ONCE(1);
+			return;
+		}
+	}
+
 	mutex_lock(&q->elevator_lock);
 	if (q->elevator && !blk_queue_dying(q) && blk_queue_registered(q)) {
 		ctx.name = q->elevator->type->elevator_name;
@@ -706,6 +738,11 @@ void elv_update_nr_hw_queues(struct request_queue *q)
 	blk_mq_unfreeze_queue_nomemrestore(q);
 	if (!ret)
 		WARN_ON_ONCE(elevator_change_done(q, &ctx));
+	/*
+	 * Free sched tags if it's allocated but we couldn't switch elevator.
+	 */
+	if (ctx.et && !ctx.new)
+		blk_mq_free_sched_tags(ctx.et, set);
 }
 
 /*
diff --git a/block/elevator.h b/block/elevator.h
index a4de5f9ad790..adc5c157e17e 100644
--- a/block/elevator.h
+++ b/block/elevator.h
@@ -23,6 +23,15 @@ enum elv_merge {
 struct blk_mq_alloc_data;
 struct blk_mq_hw_ctx;
 
+struct elevator_tags {
+	/* num. of hardware queues for which tags are allocated */
+	unsigned int nr_hw_queues;
+	/* depth used while allocating tags */
+	unsigned int nr_requests;
+	/* shared tag is stored at index 0 */
+	struct blk_mq_tags *tags[];
+};
+
 struct elevator_mq_ops {
 	int (*init_sched)(struct request_queue *, struct elevator_queue *);
 	void (*exit_sched)(struct elevator_queue *);
@@ -113,6 +122,7 @@ struct request *elv_rqhash_find(struct request_queue *q, sector_t offset);
 struct elevator_queue
 {
 	struct elevator_type *type;
+	struct elevator_tags *et;
 	void *elevator_data;
 	struct kobject kobj;
 	struct mutex sysfs_lock;
@@ -152,8 +162,8 @@ ssize_t elv_iosched_show(struct gendisk *disk, char *page);
 ssize_t elv_iosched_store(struct gendisk *disk, const char *page, size_t count);
 
 extern bool elv_bio_merge_ok(struct request *, struct bio *);
-extern struct elevator_queue *elevator_alloc(struct request_queue *,
-					struct elevator_type *);
+struct elevator_queue *elevator_alloc(struct request_queue *,
+		struct elevator_type *, struct elevator_tags *);
 
 /*
  * Helper functions.
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCHv7 3/3] block: fix potential deadlock while running nr_hw_queue update
  2025-07-01  8:18 [PATCHv7 0/3] block: move sched_tags allocation/de-allocation outside of locking context Nilay Shroff
  2025-07-01  8:18 ` [PATCHv7 1/3] block: move elevator queue allocation logic into blk_mq_init_sched Nilay Shroff
  2025-07-01  8:18 ` [PATCHv7 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store Nilay Shroff
@ 2025-07-01  8:19 ` Nilay Shroff
  2025-07-01 11:00   ` Hannes Reinecke
  2025-07-02 13:53 ` [PATCHv7 0/3] block: move sched_tags allocation/de-allocation outside of locking context Yi Zhang
  3 siblings, 1 reply; 9+ messages in thread
From: Nilay Shroff @ 2025-07-01  8:19 UTC (permalink / raw)
  To: linux-block; +Cc: hch, ming.lei, hare, axboe, sth, gjoyce

Move scheduler tags (sched_tags) allocation and deallocation outside
both the ->elevator_lock and ->freeze_lock when updating nr_hw_queues.
This change breaks the dependency chain from the percpu allocator lock
to the elevator lock, helping to prevent potential deadlocks, as
observed in the reported lockdep splat[1].

This commit introduces batch allocation and deallocation helpers for
sched_tags, which are now used from within __blk_mq_update_nr_hw_queues
routine while iterating through the tagset.

With this change, all sched_tags memory management is handled entirely
outside the ->elevator_lock and the ->freeze_lock context, thereby
eliminating the lock dependency that could otherwise manifest during
nr_hw_queues updates.

[1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/

Reported-by: Stefan Haberland <sth@linux.ibm.com>
Closes: https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
 block/blk-mq-sched.c | 65 ++++++++++++++++++++++++++++++++++++++++++++
 block/blk-mq-sched.h |  4 +++
 block/blk-mq.c       | 16 +++++++++--
 block/blk.h          |  3 +-
 block/elevator.c     |  6 ++--
 5 files changed, 88 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 2d6d1ebdd8fb..e2ce4a28e6c9 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -427,6 +427,32 @@ void blk_mq_free_sched_tags(struct elevator_tags *et,
 	kfree(et);
 }
 
+void blk_mq_free_sched_tags_batch(struct xarray *et_table,
+		struct blk_mq_tag_set *set)
+{
+	struct request_queue *q;
+	struct elevator_tags *et;
+
+	lockdep_assert_held_write(&set->update_nr_hwq_lock);
+
+	list_for_each_entry(q, &set->tag_list, tag_set_list) {
+		/*
+		 * Accessing q->elevator without holding q->elevator_lock is
+		 * safe because we're holding here set->update_nr_hwq_lock in
+		 * the writer context. So, scheduler update/switch code (which
+		 * acquires the same lock but in the reader context) can't run
+		 * concurrently.
+		 */
+		if (q->elevator) {
+			et = xa_load(et_table, q->id);
+			if (unlikely(!et))
+				WARN_ON_ONCE(1);
+			else
+				blk_mq_free_sched_tags(et, set);
+		}
+	}
+}
+
 struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
 		unsigned int nr_hw_queues)
 {
@@ -477,6 +503,45 @@ struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
 	return NULL;
 }
 
+int blk_mq_alloc_sched_tags_batch(struct xarray *et_table,
+		struct blk_mq_tag_set *set, unsigned int nr_hw_queues)
+{
+	struct request_queue *q;
+	struct elevator_tags *et;
+	gfp_t gfp = GFP_NOIO | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY;
+
+	lockdep_assert_held_write(&set->update_nr_hwq_lock);
+
+	list_for_each_entry(q, &set->tag_list, tag_set_list) {
+		/*
+		 * Accessing q->elevator without holding q->elevator_lock is
+		 * safe because we're holding here set->update_nr_hwq_lock in
+		 * the writer context. So, scheduler update/switch code (which
+		 * acquires the same lock but in the reader context) can't run
+		 * concurrently.
+		 */
+		if (q->elevator) {
+			et = blk_mq_alloc_sched_tags(set, nr_hw_queues);
+			if (!et)
+				goto out_unwind;
+			if (xa_insert(et_table, q->id, et, gfp))
+				goto out_free_tags;
+		}
+	}
+	return 0;
+out_free_tags:
+	blk_mq_free_sched_tags(et, set);
+out_unwind:
+	list_for_each_entry_continue_reverse(q, &set->tag_list, tag_set_list) {
+		if (q->elevator) {
+			et = xa_load(et_table, q->id);
+			if (et)
+				blk_mq_free_sched_tags(et, set);
+		}
+	}
+	return -ENOMEM;
+}
+
 /* caller must have a reference to @e, will grab another one if successful */
 int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e,
 		struct elevator_tags *et)
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 0cde00cd1c47..b554e1d55950 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -25,8 +25,12 @@ void blk_mq_sched_free_rqs(struct request_queue *q);
 
 struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
 		unsigned int nr_hw_queues);
+int blk_mq_alloc_sched_tags_batch(struct xarray *et_table,
+		struct blk_mq_tag_set *set, unsigned int nr_hw_queues);
 void blk_mq_free_sched_tags(struct elevator_tags *et,
 		struct blk_mq_tag_set *set);
+void blk_mq_free_sched_tags_batch(struct xarray *et_table,
+		struct blk_mq_tag_set *set);
 
 static inline void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
 {
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4806b867e37d..af7aae699ede 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4972,6 +4972,8 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 	struct request_queue *q;
 	int prev_nr_hw_queues = set->nr_hw_queues;
 	unsigned int memflags;
+	struct xarray et_table;
+	struct elevator_tags *et;
 	int i;
 
 	lockdep_assert_held(&set->tag_list_lock);
@@ -4984,6 +4986,11 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 		return;
 
 	memflags = memalloc_noio_save();
+
+	xa_init(&et_table);
+	if (blk_mq_alloc_sched_tags_batch(&et_table, set, nr_hw_queues) < 0)
+		goto out_memalloc_restore;
+
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
 		blk_mq_debugfs_unregister_hctxs(q);
 		blk_mq_sysfs_unregister_hctxs(q);
@@ -4995,6 +5002,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 	if (blk_mq_realloc_tag_set_tags(set, nr_hw_queues) < 0) {
 		list_for_each_entry(q, &set->tag_list, tag_set_list)
 			blk_mq_unfreeze_queue_nomemrestore(q);
+		blk_mq_free_sched_tags_batch(&et_table, set);
 		goto reregister;
 	}
 
@@ -5018,8 +5026,10 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 	}
 
 	/* elv_update_nr_hw_queues() unfreeze queue for us */
-	list_for_each_entry(q, &set->tag_list, tag_set_list)
-		elv_update_nr_hw_queues(q);
+	list_for_each_entry(q, &set->tag_list, tag_set_list) {
+		et = xa_load(&et_table, q->id);
+		elv_update_nr_hw_queues(q, et);
+	}
 
 reregister:
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
@@ -5029,7 +5039,9 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 		blk_mq_remove_hw_queues_cpuhp(q);
 		blk_mq_add_hw_queues_cpuhp(q);
 	}
+out_memalloc_restore:
 	memalloc_noio_restore(memflags);
+	xa_destroy(&et_table);
 
 	/* Free the excess tags when nr_hw_queues shrink. */
 	for (i = set->nr_hw_queues; i < prev_nr_hw_queues; i++)
diff --git a/block/blk.h b/block/blk.h
index 37ec459fe656..a312518fb8f3 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -12,6 +12,7 @@
 #include "blk-crypto-internal.h"
 
 struct elevator_type;
+struct elevator_tags;
 
 #define	BLK_DEV_MAX_SECTORS	(LLONG_MAX >> 9)
 #define	BLK_MIN_SEGMENT_SIZE	4096
@@ -321,7 +322,7 @@ bool blk_bio_list_merge(struct request_queue *q, struct list_head *list,
 
 bool blk_insert_flush(struct request *rq);
 
-void elv_update_nr_hw_queues(struct request_queue *q);
+void elv_update_nr_hw_queues(struct request_queue *q, struct elevator_tags *et);
 void elevator_set_default(struct request_queue *q);
 void elevator_set_none(struct request_queue *q);
 
diff --git a/block/elevator.c b/block/elevator.c
index 50f4b78efe66..92a0a4851583 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -705,7 +705,7 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
  * The I/O scheduler depends on the number of hardware queues, this forces a
  * reattachment when nr_hw_queues changes.
  */
-void elv_update_nr_hw_queues(struct request_queue *q)
+void elv_update_nr_hw_queues(struct request_queue *q, struct elevator_tags *et)
 {
 	struct blk_mq_tag_set *set = q->tag_set;
 	struct elv_change_ctx ctx = {};
@@ -720,11 +720,11 @@ void elv_update_nr_hw_queues(struct request_queue *q)
 	 * acquires same lock in the reader context) can't run concurrently.
 	 */
 	if (q->elevator) {
-		ctx.et = blk_mq_alloc_sched_tags(set, set->nr_hw_queues);
-		if (!ctx.et) {
+		if (!et) {
 			WARN_ON_ONCE(1);
 			return;
 		}
+		ctx.et = et;
 	}
 
 	mutex_lock(&q->elevator_lock);
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCHv7 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store
  2025-07-01  8:18 ` [PATCHv7 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store Nilay Shroff
@ 2025-07-01 10:52   ` Hannes Reinecke
  0 siblings, 0 replies; 9+ messages in thread
From: Hannes Reinecke @ 2025-07-01 10:52 UTC (permalink / raw)
  To: Nilay Shroff, linux-block; +Cc: hch, ming.lei, axboe, sth, gjoyce

On 7/1/25 10:18, Nilay Shroff wrote:
> Recent lockdep reports [1] have revealed a potential deadlock caused by a
> lock dependency between the percpu allocator lock and the elevator lock.
> This issue can be avoided by ensuring that the allocation and release of
> scheduler tags (sched_tags) are performed outside the elevator lock.
> Furthermore, the queue does not need to be remain frozen during these
> operations.
> 
> To address this, move all sched_tags allocations and deallocations outside
> of both the ->elevator_lock and the ->freeze_lock. Since the lifetime of
> the elevator queue and its associated sched_tags is closely tied, the
> allocated sched_tags are now stored in the elevator queue structure. Then,
> during the actual elevator switch (which runs under ->freeze_lock and
> ->elevator_lock), the pre-allocated sched_tags are assigned to the
> appropriate q->hctx. Once the elevator switch is complete and the locks
> are released, the old elevator queue and its associated sched_tags are
> freed.
> 
> This commit specifically addresses the allocation/deallocation of sched_
> tags during elevator switching. Note that sched_tags may also be allocated
> in other contexts, such as during nr_hw_queues updates. Supporting that
> use case will require batch allocation/deallocation, which will be handled
> in a follow-up patch.
> 
> This restructuring ensures that sched_tags memory management occurs
> entirely outside of the ->elevator_lock and ->freeze_lock context,
> eliminating the lock dependency problem seen during scheduler updates.
> 
> [1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
> 
> Reported-by: Stefan Haberland <sth@linux.ibm.com>
> Closes: https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
> Reviewed-by: Ming Lei <ming.lei@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> ---
>   block/blk-mq-sched.c | 155 +++++++++++++++++++++++--------------------
>   block/blk-mq-sched.h |   8 ++-
>   block/elevator.c     |  47 +++++++++++--
>   block/elevator.h     |  14 +++-
>   4 files changed, 144 insertions(+), 80 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCHv7 3/3] block: fix potential deadlock while running nr_hw_queue update
  2025-07-01  8:19 ` [PATCHv7 3/3] block: fix potential deadlock while running nr_hw_queue update Nilay Shroff
@ 2025-07-01 11:00   ` Hannes Reinecke
  0 siblings, 0 replies; 9+ messages in thread
From: Hannes Reinecke @ 2025-07-01 11:00 UTC (permalink / raw)
  To: Nilay Shroff, linux-block; +Cc: hch, ming.lei, axboe, sth, gjoyce

On 7/1/25 10:19, Nilay Shroff wrote:
> Move scheduler tags (sched_tags) allocation and deallocation outside
> both the ->elevator_lock and ->freeze_lock when updating nr_hw_queues.
> This change breaks the dependency chain from the percpu allocator lock
> to the elevator lock, helping to prevent potential deadlocks, as
> observed in the reported lockdep splat[1].
> 
> This commit introduces batch allocation and deallocation helpers for
> sched_tags, which are now used from within __blk_mq_update_nr_hw_queues
> routine while iterating through the tagset.
> 
> With this change, all sched_tags memory management is handled entirely
> outside the ->elevator_lock and the ->freeze_lock context, thereby
> eliminating the lock dependency that could otherwise manifest during
> nr_hw_queues updates.
> 
> [1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
> 
> Reported-by: Stefan Haberland <sth@linux.ibm.com>
> Closes: https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
> Reviewed-by: Ming Lei <ming.lei@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> ---
>   block/blk-mq-sched.c | 65 ++++++++++++++++++++++++++++++++++++++++++++
>   block/blk-mq-sched.h |  4 +++
>   block/blk-mq.c       | 16 +++++++++--
>   block/blk.h          |  3 +-
>   block/elevator.c     |  6 ++--
>   5 files changed, 88 insertions(+), 6 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCHv7 0/3] block: move sched_tags allocation/de-allocation outside of locking context
  2025-07-01  8:18 [PATCHv7 0/3] block: move sched_tags allocation/de-allocation outside of locking context Nilay Shroff
                   ` (2 preceding siblings ...)
  2025-07-01  8:19 ` [PATCHv7 3/3] block: fix potential deadlock while running nr_hw_queue update Nilay Shroff
@ 2025-07-02 13:53 ` Yi Zhang
  2025-07-02 14:17   ` Nilay Shroff
  3 siblings, 1 reply; 9+ messages in thread
From: Yi Zhang @ 2025-07-02 13:53 UTC (permalink / raw)
  To: Nilay Shroff
  Cc: linux-block, hch, ming.lei, hare, axboe, sth, gjoyce,
	Shinichiro Kawasaki

Hi Nilay

With the patch on the latest linux-block/for-next, I reproduced the
following WARNING with blktests block/005, here is the full log:

[  342.845331] run blktests block/005 at 2025-07-02 09:48:55

[  343.835605] ======================================================
[  343.841783] WARNING: possible circular locking dependency detected
[  343.847966] 6.16.0-rc4.fix+ #3 Not tainted
[  343.852073] ------------------------------------------------------
[  343.858250] check/1365 is trying to acquire lock:
[  343.862957] ffffffff98141db0 (pcpu_alloc_mutex){+.+.}-{4:4}, at:
pcpu_alloc_noprof+0x8eb/0xd70
[  343.871587]
               but task is already holding lock:
[  343.877421] ffff888300cfb040 (&q->elevator_lock){+.+.}-{4:4}, at:
elevator_change+0x152/0x530
[  343.885958]
               which lock already depends on the new lock.

[  343.894131]
               the existing dependency chain (in reverse order) is:
[  343.901609]
               -> #3 (&q->elevator_lock){+.+.}-{4:4}:
[  343.907891]        __lock_acquire+0x6f1/0xc00
[  343.912259]        lock_acquire.part.0+0xb6/0x240
[  343.916966]        __mutex_lock+0x17b/0x1690
[  343.921247]        elevator_change+0x152/0x530
[  343.925692]        elv_iosched_store+0x205/0x2f0
[  343.930312]        queue_attr_store+0x23b/0x300
[  343.934853]        kernfs_fop_write_iter+0x357/0x530
[  343.939829]        vfs_write+0x9bc/0xf60
[  343.943763]        ksys_write+0xf3/0x1d0
[  343.947695]        do_syscall_64+0x8c/0x3d0
[  343.951883]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  343.957462]
               -> #2 (&q->q_usage_counter(io)#4){++++}-{0:0}:
[  343.964440]        __lock_acquire+0x6f1/0xc00
[  343.968799]        lock_acquire.part.0+0xb6/0x240
[  343.973507]        blk_alloc_queue+0x5c5/0x710
[  343.977959]        blk_mq_alloc_queue+0x14e/0x240
[  343.982666]        __blk_mq_alloc_disk+0x15/0xd0
[  343.987294]        nvme_alloc_ns+0x208/0x1690 [nvme_core]
[  343.992727]        nvme_scan_ns+0x362/0x4c0 [nvme_core]
[  343.997978]        async_run_entry_fn+0x96/0x4f0
[  344.002599]        process_one_work+0x8cd/0x1950
[  344.007226]        worker_thread+0x58d/0xcf0
[  344.011499]        kthread+0x3d8/0x7a0
[  344.015259]        ret_from_fork+0x406/0x510
[  344.019532]        ret_from_fork_asm+0x1a/0x30
[  344.023980]
               -> #1 (fs_reclaim){+.+.}-{0:0}:
[  344.029654]        __lock_acquire+0x6f1/0xc00
[  344.034015]        lock_acquire.part.0+0xb6/0x240
[  344.038727]        fs_reclaim_acquire+0x103/0x150
[  344.043433]        prepare_alloc_pages+0x15f/0x600
[  344.048230]        __alloc_frozen_pages_noprof+0x14a/0x3a0
[  344.053722]        __alloc_pages_noprof+0xd/0x1d0
[  344.058438]        pcpu_alloc_pages.constprop.0+0x104/0x420
[  344.064017]        pcpu_populate_chunk+0x38/0x80
[  344.068644]        pcpu_alloc_noprof+0x650/0xd70
[  344.073265]        iommu_dma_init_fq+0x183/0x730
[  344.077893]        iommu_dma_init_domain+0x566/0x990
[  344.082866]        iommu_setup_dma_ops+0xca/0x230
[  344.087571]        bus_iommu_probe+0x1f8/0x4a0
[  344.092020]        iommu_device_register+0x153/0x240
[  344.096993]        iommu_init_pci+0x53c/0x1040
[  344.101447]        amd_iommu_init_pci+0xb6/0x5c0
[  344.106066]        state_next+0xaf7/0xff0
[  344.110080]        iommu_go_to_state+0x21/0x80
[  344.114535]        amd_iommu_init+0x15/0x70
[  344.118728]        pci_iommu_init+0x29/0x70
[  344.122914]        do_one_initcall+0x100/0x5a0
[  344.127361]        do_initcalls+0x138/0x1d0
[  344.131556]        kernel_init_freeable+0x8b7/0xbd0
[  344.136442]        kernel_init+0x1b/0x1f0
[  344.140456]        ret_from_fork+0x406/0x510
[  344.144735]        ret_from_fork_asm+0x1a/0x30
[  344.149182]
               -> #0 (pcpu_alloc_mutex){+.+.}-{4:4}:
[  344.155379]        check_prev_add+0xf1/0xce0
[  344.159653]        validate_chain+0x470/0x580
[  344.164019]        __lock_acquire+0x6f1/0xc00
[  344.168378]        lock_acquire.part.0+0xb6/0x240
[  344.173085]        __mutex_lock+0x17b/0x1690
[  344.177365]        pcpu_alloc_noprof+0x8eb/0xd70
[  344.181984]        kyber_queue_data_alloc+0x16d/0x660
[  344.187047]        kyber_init_sched+0x14/0x90
[  344.191413]        blk_mq_init_sched+0x264/0x4e0
[  344.196033]        elevator_switch+0x186/0x6a0
[  344.200478]        elevator_change+0x305/0x530
[  344.204924]        elv_iosched_store+0x205/0x2f0
[  344.209545]        queue_attr_store+0x23b/0x300
[  344.214084]        kernfs_fop_write_iter+0x357/0x530
[  344.219051]        vfs_write+0x9bc/0xf60
[  344.222976]        ksys_write+0xf3/0x1d0
[  344.226902]        do_syscall_64+0x8c/0x3d0
[  344.231088]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  344.236660]
               other info that might help us debug this:

[  344.244662] Chain exists of:
                 pcpu_alloc_mutex --> &q->q_usage_counter(io)#4 -->
&q->elevator_lock

[  344.256587]  Possible unsafe locking scenario:

[  344.262504]        CPU0                    CPU1
[  344.267037]        ----                    ----
[  344.271570]   lock(&q->elevator_lock);
[  344.275330]                                lock(&q->q_usage_counter(io)#4);
[  344.282298]                                lock(&q->elevator_lock);
[  344.288573]   lock(pcpu_alloc_mutex);
[  344.292249]
                *** DEADLOCK ***

[  344.298168] 7 locks held by check/1365:
[  344.302015]  #0: ffff888124330448 (sb_writers#4){.+.+}-{0:0}, at:
ksys_write+0xf3/0x1d0
[  344.310039]  #1: ffff8883285fc490 (&of->mutex#2){+.+.}-{4:4}, at:
kernfs_fop_write_iter+0x216/0x530
[  344.319104]  #2: ffff8882004b5f08 (kn->active#86){.+.+}-{0:0}, at:
kernfs_fop_write_iter+0x239/0x530
[  344.328258]  #3: ffff8883032d41a8
(&set->update_nr_hwq_lock){.+.+}-{4:4}, at:
elv_iosched_store+0x1c0/0x2f0
[  344.338014]  #4: ffff888300cfaab0
(&q->q_usage_counter(io)#4){++++}-{0:0}, at:
blk_mq_freeze_queue_nomemsave+0xe/0x20
[  344.348640]  #5: ffff888300cfaaf0
(&q->q_usage_counter(queue)#4){+.+.}-{0:0}, at:
blk_mq_freeze_queue_nomemsave+0xe/0x20
[  344.359525]  #6: ffff888300cfb040 (&q->elevator_lock){+.+.}-{4:4},
at: elevator_change+0x152/0x530
[  344.368502]
               stack backtrace:
[  344.372866] CPU: 8 UID: 0 PID: 1365 Comm: check Not tainted
6.16.0-rc4.fix+ #3 PREEMPT(voluntary)
[  344.372872] Hardware name: Dell Inc. PowerEdge R6515/07PXPY, BIOS
2.17.0 12/04/2024
[  344.372876] Call Trace:
[  344.372879]  <TASK>
[  344.372883]  dump_stack_lvl+0x7e/0xc0
[  344.372891]  print_circular_bug+0xdc/0xf0
[  344.372901]  check_noncircular+0x131/0x150
[  344.372905]  ? local_clock_noinstr+0x9/0xc0
[  344.372912]  ? srso_return_thunk+0x5/0x5f
[  344.372918]  ? srso_return_thunk+0x5/0x5f
[  344.372929]  check_prev_add+0xf1/0xce0
[  344.372934]  ? srso_return_thunk+0x5/0x5f
[  344.372938]  ? add_chain_cache+0x111/0x320
[  344.372948]  validate_chain+0x470/0x580
[  344.372953]  ? srso_return_thunk+0x5/0x5f
[  344.372963]  __lock_acquire+0x6f1/0xc00
[  344.372976]  lock_acquire.part.0+0xb6/0x240
[  344.372981]  ? pcpu_alloc_noprof+0x8eb/0xd70
[  344.372991]  ? srso_return_thunk+0x5/0x5f
[  344.372995]  ? rcu_is_watching+0x11/0xb0
[  344.373001]  ? srso_return_thunk+0x5/0x5f
[  344.373005]  ? lock_acquire+0x10e/0x160
[  344.373015]  __mutex_lock+0x17b/0x1690
[  344.373020]  ? pcpu_alloc_noprof+0x8eb/0xd70
[  344.373025]  ? __kasan_kmalloc+0x7b/0x90
[  344.373030]  ? kyber_queue_data_alloc+0x7e/0x660
[  344.373035]  ? kyber_init_sched+0x14/0x90
[  344.373041]  ? elevator_change+0x305/0x530
[  344.373045]  ? pcpu_alloc_noprof+0x8eb/0xd70
[  344.373050]  ? queue_attr_store+0x23b/0x300
[  344.373055]  ? kernfs_fop_write_iter+0x357/0x530
[  344.373059]  ? vfs_write+0x9bc/0xf60
[  344.373064]  ? ksys_write+0xf3/0x1d0
[  344.373068]  ? do_syscall_64+0x8c/0x3d0
[  344.373075]  ? __pfx___mutex_lock+0x10/0x10
[  344.373079]  ? srso_return_thunk+0x5/0x5f
[  344.373084]  ? lock_acquire.part.0+0xb6/0x240
[  344.373088]  ? srso_return_thunk+0x5/0x5f
[  344.373093]  ? srso_return_thunk+0x5/0x5f
[  344.373097]  ? find_held_lock+0x32/0x90
[  344.373101]  ? srso_return_thunk+0x5/0x5f
[  344.373105]  ? local_clock_noinstr+0x9/0xc0
[  344.373111]  ? srso_return_thunk+0x5/0x5f
[  344.373116]  ? __lock_release+0x1a2/0x2c0
[  344.373123]  ? srso_return_thunk+0x5/0x5f
[  344.373127]  ? mark_held_locks+0x50/0x80
[  344.373134]  ? _raw_spin_unlock_irqrestore+0x59/0x70
[  344.373139]  ? srso_return_thunk+0x5/0x5f
[  344.373143]  ? lockdep_hardirqs_on+0x78/0x100
[  344.373149]  ? srso_return_thunk+0x5/0x5f
[  344.373159]  ? pcpu_alloc_noprof+0x8eb/0xd70
[  344.373164]  pcpu_alloc_noprof+0x8eb/0xd70
[  344.373169]  ? __kmalloc_cache_node_noprof+0x3a7/0x4d0
[  344.373179]  ? xa_find_after+0x1ac/0x310
[  344.373187]  ? srso_return_thunk+0x5/0x5f
[  344.373192]  ? kasan_save_track+0x10/0x30
[  344.373200]  kyber_queue_data_alloc+0x16d/0x660
[  344.373207]  ? debug_mutex_init+0x33/0x70
[  344.373216]  kyber_init_sched+0x14/0x90
[  344.373223]  blk_mq_init_sched+0x264/0x4e0
[  344.373235]  ? __pfx_blk_mq_init_sched+0x10/0x10
[  344.373239]  ? srso_return_thunk+0x5/0x5f
[  344.373253]  elevator_switch+0x186/0x6a0
[  344.373262]  elevator_change+0x305/0x530
[  344.373272]  elv_iosched_store+0x205/0x2f0
[  344.373279]  ? __pfx_elv_iosched_store+0x10/0x10
[  344.373285]  ? srso_return_thunk+0x5/0x5f
[  344.373294]  ? srso_return_thunk+0x5/0x5f
[  344.373298]  ? __lock_acquired+0xda/0x250
[  344.373306]  ? srso_return_thunk+0x5/0x5f
[  344.373310]  ? rcu_is_watching+0x11/0xb0
[  344.373315]  ? srso_return_thunk+0x5/0x5f
[  344.373324]  queue_attr_store+0x23b/0x300
[  344.373333]  ? __pfx_queue_attr_store+0x10/0x10
[  344.373344]  ? srso_return_thunk+0x5/0x5f
[  344.373349]  ? srso_return_thunk+0x5/0x5f
[  344.373354]  ? find_held_lock+0x32/0x90
[  344.373358]  ? local_clock_noinstr+0x9/0xc0
[  344.373365]  ? srso_return_thunk+0x5/0x5f
[  344.373369]  ? __lock_release+0x1a2/0x2c0
[  344.373377]  ? sysfs_file_kobj+0xb5/0x1c0
[  344.373383]  ? srso_return_thunk+0x5/0x5f
[  344.373390]  ? srso_return_thunk+0x5/0x5f
[  344.373394]  ? sysfs_file_kobj+0xbf/0x1c0
[  344.373399]  ? srso_return_thunk+0x5/0x5f
[  344.373408]  ? __pfx_sysfs_kf_write+0x10/0x10
[  344.373412]  kernfs_fop_write_iter+0x357/0x530
[  344.373422]  vfs_write+0x9bc/0xf60
[  344.373433]  ? __pfx_vfs_write+0x10/0x10
[  344.373449]  ? find_held_lock+0x32/0x90
[  344.373454]  ? local_clock_noinstr+0x9/0xc0
[  344.373460]  ? srso_return_thunk+0x5/0x5f
[  344.373470]  ksys_write+0xf3/0x1d0
[  344.373478]  ? __pfx_ksys_write+0x10/0x10
[  344.373484]  ? srso_return_thunk+0x5/0x5f
[  344.373491]  ? srso_return_thunk+0x5/0x5f
[  344.373503]  do_syscall_64+0x8c/0x3d0
[  344.373508]  ? srso_return_thunk+0x5/0x5f
[  344.373512]  ? do_user_addr_fault+0x489/0xb10
[  344.373519]  ? srso_return_thunk+0x5/0x5f
[  344.373523]  ? rcu_is_watching+0x11/0xb0
[  344.373528]  ? srso_return_thunk+0x5/0x5f
[  344.373534]  ? srso_return_thunk+0x5/0x5f
[  344.373542]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  344.373547] RIP: 0033:0x7ff3e4afe377
[  344.373552] Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7
0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00
00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89
74 24
[  344.373557] RSP: 002b:00007ffc83840088 EFLAGS: 00000246 ORIG_RAX:
0000000000000001
[  344.373562] RAX: ffffffffffffffda RBX: 00007ff3e4bfa780 RCX: 00007ff3e4afe377
[  344.373565] RDX: 0000000000000006 RSI: 00005555d2ec8240 RDI: 0000000000000001
[  344.373568] RBP: 0000000000000006 R08: 0000000000000000 R09: 00007ff3e4bb0d40
[  344.373571] R10: 00007ff3e4bb0c40 R11: 0000000000000246 R12: 0000000000000006
[  344.373574] R13: 00005555d2ec8240 R14: 0000000000000006 R15: 00007ff3e4bf59e0
[  344.373592]  </TASK>

On Tue, Jul 1, 2025 at 4:20 PM Nilay Shroff <nilay@linux.ibm.com> wrote:
>
> Hi,
>
> There have been a few reports[1] indicating potential lockdep warnings due
> to a lock dependency from the percpu allocator to the elevator lock. This
> patch series aims to eliminate that dependency.
>
> The series consists of three patches:
> The first patch is preparatory patch and just move elevator queue
> allocation logic from ->init_sched into blk_mq_init_sched.
>
> The second patch in the series restructures sched_tags allocation and
> deallocation during elevator update/switch operations to ensure these
> actions are performed entirely outside the ->freeze_lock and ->elevator_
> lock. This eliminates the percpu allocator’s lock dependency on the
> elevator and freeze lock during scheduler transitions.
>
> The third patch introduces batch allocation and deallocation helpers for
> sched_tags. These helpers are used during __blk_mq_update_nr_hw_queues()
> to decouple sched_tags memory management from both the elevator and freeze
> locks, addressing the lockdep concerns in the nr_hw_queues update path.
>
> [1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
>
> Changes since v6:
>     - Add warning when loading elevator tags from an xarray yields nothing
>       (Hannes Reinecke)
>     - Use elevator tags instead of xarray table as a function argument to
>       elv_update_nr_hw_queues (Ming Lei)
> Link to v6: https://lore.kernel.org/all/20250630054756.54532-1-nilay@linux.ibm.com/
>
> Changes since v5:
>     - Fixed smatch warning reported by kernel test robot here:
>       https://lore.kernel.org/all/202506300509.2S1tygch-lkp@intel.com/
> Link to v5: https://lore.kernel.org/all/20250627175544.1063910-1-nilay@linux.ibm.com/
>
> Changes since v4:
>     - Define a local Xarray variable in __blk_mq_update_nr_hw_queues to store
>       sched_tags, instead of storing it in an Xarray defined in 'struct elevator_tags'
>       (Ming Lei)
> Link to v4: https://lore.kernel.org/all/20250624131716.630465-1-nilay@linux.ibm.com/
>
> Changes since v3:
>     - Further split the patchset into three patch series so that we can
>       have a separate patch for sched_tags batch allocation/deallocation
>       (Ming Lei)
>     - Use Xarray to store and load the sched_tags (Ming Lei)
>     - Unexport elevator_alloc() as we no longer need to use it outside
>       of block layer core (hch)
>     - unwind the sched_tags allocation and free tags when we it fails in
>       the middle of allocation (hch)
>     - Move struct elevator_queue header from commin header to elevator.c
>       as there's no user of it outside elevator.c (Ming Lei, hch)
> Link to v3: https://lore.kernel.org/all/20250616173233.3803824-1-nilay@linux.ibm.com/
>
> Change since v2:
>     - Split the patch into a two patch series. The first patch updates
>       ->init_sched elevator API change and second patch handles the sched
>       tags allocation/de-allocation logic (Ming Lei)
>     - Address sched tags allocation/deallocation logic while running in the
>       context of nr_hw_queue update so that we can handle all possible
>       scenarios in a single patchest (Ming Lei)
> Link to v2: https://lore.kernel.org/all/20250528123638.1029700-1-nilay@linux.ibm.com/
>
> Changes since v1:
>     - As the lifetime of elevator queue and sched tags are same, allocate
>       and move sched tags under struct elevator_queue (Ming Lei)
> Link to v1: https://lore.kernel.org/all/20250520103425.1259712-1-nilay@linux.ibm.com/
>
> Nilay Shroff (3):
>   block: move elevator queue allocation logic into blk_mq_init_sched
>   block: fix lockdep warning caused by lock dependency in
>     elv_iosched_store
>   block: fix potential deadlock while running nr_hw_queue update
>
>  block/bfq-iosched.c   |  13 +--
>  block/blk-mq-sched.c  | 223 ++++++++++++++++++++++++++++--------------
>  block/blk-mq-sched.h  |  12 ++-
>  block/blk-mq.c        |  16 ++-
>  block/blk.h           |   2 +-
>  block/elevator.c      |  50 ++++++++--
>  block/elevator.h      |  16 ++-
>  block/kyber-iosched.c |  11 +--
>  block/mq-deadline.c   |  14 +--
>  9 files changed, 241 insertions(+), 116 deletions(-)
>
> --
> 2.50.0
>
>


-- 
Best Regards,
  Yi Zhang


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCHv7 0/3] block: move sched_tags allocation/de-allocation outside of locking context
  2025-07-02 13:53 ` [PATCHv7 0/3] block: move sched_tags allocation/de-allocation outside of locking context Yi Zhang
@ 2025-07-02 14:17   ` Nilay Shroff
  2025-07-02 14:41     ` Yi Zhang
  0 siblings, 1 reply; 9+ messages in thread
From: Nilay Shroff @ 2025-07-02 14:17 UTC (permalink / raw)
  To: Yi Zhang
  Cc: linux-block, hch, ming.lei, hare, axboe, sth, gjoyce,
	Shinichiro Kawasaki



On 7/2/25 7:23 PM, Yi Zhang wrote:
> Hi Nilay
> 
> With the patch on the latest linux-block/for-next, I reproduced the
> following WARNING with blktests block/005, here is the full log:
> 
> [  342.845331] run blktests block/005 at 2025-07-02 09:48:55
> 
> [  343.835605] ======================================================
> [  343.841783] WARNING: possible circular locking dependency detected
> [  343.847966] 6.16.0-rc4.fix+ #3 Not tainted
> [  343.852073] ------------------------------------------------------
> [  343.858250] check/1365 is trying to acquire lock:
> [  343.862957] ffffffff98141db0 (pcpu_alloc_mutex){+.+.}-{4:4}, at:
> pcpu_alloc_noprof+0x8eb/0xd70
> [  343.871587]
>                but task is already holding lock:
> [  343.877421] ffff888300cfb040 (&q->elevator_lock){+.+.}-{4:4}, at:
> elevator_change+0x152/0x530
> [  343.885958]
>                which lock already depends on the new lock.
> 
> [  343.894131]
>                the existing dependency chain (in reverse order) is:
> [  343.901609]
>                -> #3 (&q->elevator_lock){+.+.}-{4:4}:
> [  343.907891]        __lock_acquire+0x6f1/0xc00
> [  343.912259]        lock_acquire.part.0+0xb6/0x240
> [  343.916966]        __mutex_lock+0x17b/0x1690
> [  343.921247]        elevator_change+0x152/0x530
> [  343.925692]        elv_iosched_store+0x205/0x2f0
> [  343.930312]        queue_attr_store+0x23b/0x300
> [  343.934853]        kernfs_fop_write_iter+0x357/0x530
> [  343.939829]        vfs_write+0x9bc/0xf60
> [  343.943763]        ksys_write+0xf3/0x1d0
> [  343.947695]        do_syscall_64+0x8c/0x3d0
> [  343.951883]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [  343.957462]
>                -> #2 (&q->q_usage_counter(io)#4){++++}-{0:0}:
> [  343.964440]        __lock_acquire+0x6f1/0xc00
> [  343.968799]        lock_acquire.part.0+0xb6/0x240
> [  343.973507]        blk_alloc_queue+0x5c5/0x710
> [  343.977959]        blk_mq_alloc_queue+0x14e/0x240
> [  343.982666]        __blk_mq_alloc_disk+0x15/0xd0
> [  343.987294]        nvme_alloc_ns+0x208/0x1690 [nvme_core]
> [  343.992727]        nvme_scan_ns+0x362/0x4c0 [nvme_core]
> [  343.997978]        async_run_entry_fn+0x96/0x4f0
> [  344.002599]        process_one_work+0x8cd/0x1950
> [  344.007226]        worker_thread+0x58d/0xcf0
> [  344.011499]        kthread+0x3d8/0x7a0
> [  344.015259]        ret_from_fork+0x406/0x510
> [  344.019532]        ret_from_fork_asm+0x1a/0x30
> [  344.023980]
>                -> #1 (fs_reclaim){+.+.}-{0:0}:
> [  344.029654]        __lock_acquire+0x6f1/0xc00
> [  344.034015]        lock_acquire.part.0+0xb6/0x240
> [  344.038727]        fs_reclaim_acquire+0x103/0x150
> [  344.043433]        prepare_alloc_pages+0x15f/0x600
> [  344.048230]        __alloc_frozen_pages_noprof+0x14a/0x3a0
> [  344.053722]        __alloc_pages_noprof+0xd/0x1d0
> [  344.058438]        pcpu_alloc_pages.constprop.0+0x104/0x420
> [  344.064017]        pcpu_populate_chunk+0x38/0x80
> [  344.068644]        pcpu_alloc_noprof+0x650/0xd70
> [  344.073265]        iommu_dma_init_fq+0x183/0x730
> [  344.077893]        iommu_dma_init_domain+0x566/0x990
> [  344.082866]        iommu_setup_dma_ops+0xca/0x230
> [  344.087571]        bus_iommu_probe+0x1f8/0x4a0
> [  344.092020]        iommu_device_register+0x153/0x240
> [  344.096993]        iommu_init_pci+0x53c/0x1040
> [  344.101447]        amd_iommu_init_pci+0xb6/0x5c0
> [  344.106066]        state_next+0xaf7/0xff0
> [  344.110080]        iommu_go_to_state+0x21/0x80
> [  344.114535]        amd_iommu_init+0x15/0x70
> [  344.118728]        pci_iommu_init+0x29/0x70
> [  344.122914]        do_one_initcall+0x100/0x5a0
> [  344.127361]        do_initcalls+0x138/0x1d0
> [  344.131556]        kernel_init_freeable+0x8b7/0xbd0
> [  344.136442]        kernel_init+0x1b/0x1f0
> [  344.140456]        ret_from_fork+0x406/0x510
> [  344.144735]        ret_from_fork_asm+0x1a/0x30
> [  344.149182]
>                -> #0 (pcpu_alloc_mutex){+.+.}-{4:4}:
> [  344.155379]        check_prev_add+0xf1/0xce0
> [  344.159653]        validate_chain+0x470/0x580
> [  344.164019]        __lock_acquire+0x6f1/0xc00
> [  344.168378]        lock_acquire.part.0+0xb6/0x240
> [  344.173085]        __mutex_lock+0x17b/0x1690
> [  344.177365]        pcpu_alloc_noprof+0x8eb/0xd70
> [  344.181984]        kyber_queue_data_alloc+0x16d/0x660
> [  344.187047]        kyber_init_sched+0x14/0x90
> [  344.191413]        blk_mq_init_sched+0x264/0x4e0
> [  344.196033]        elevator_switch+0x186/0x6a0
> [  344.200478]        elevator_change+0x305/0x530
> [  344.204924]        elv_iosched_store+0x205/0x2f0
> [  344.209545]        queue_attr_store+0x23b/0x300
> [  344.214084]        kernfs_fop_write_iter+0x357/0x530
> [  344.219051]        vfs_write+0x9bc/0xf60
> [  344.222976]        ksys_write+0xf3/0x1d0
> [  344.226902]        do_syscall_64+0x8c/0x3d0
> [  344.231088]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [  344.236660]

Thanks for the report!

I see that the above warning is different from the one addressed by the
current patchset. In the warning you've reported, the kyber elevator 
allocates per-CPU data after acquiring ->elevator_lock, which introduces
a per-CPU lock dependency on the ->elevator_lock.

In contrast, the current patchset addresses a separate issue [1] that arises
due to elevator tag allocation. This allocation occurs after both ->freeze_lock
and ->elevator_lock are held. Internally, elevator tags allocation sets up 
per-CPU sbitmap->alloc_hint, which also introduces a similar per-CPU lock
dependency on ->elevator_lock.

That said, I'll plan to address the issue you've just reported in a separate
patch, once the current patchset is merged. 

Thanks,
--Nilay

[1]https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCHv7 0/3] block: move sched_tags allocation/de-allocation outside of locking context
  2025-07-02 14:17   ` Nilay Shroff
@ 2025-07-02 14:41     ` Yi Zhang
  0 siblings, 0 replies; 9+ messages in thread
From: Yi Zhang @ 2025-07-02 14:41 UTC (permalink / raw)
  To: Nilay Shroff
  Cc: linux-block, hch, ming.lei, hare, axboe, sth, gjoyce,
	Shinichiro Kawasaki

On Wed, Jul 2, 2025 at 10:17 PM Nilay Shroff <nilay@linux.ibm.com> wrote:
>
>
>
> On 7/2/25 7:23 PM, Yi Zhang wrote:
> > Hi Nilay
> >
> > With the patch on the latest linux-block/for-next, I reproduced the
> > following WARNING with blktests block/005, here is the full log:
> >
> > [  342.845331] run blktests block/005 at 2025-07-02 09:48:55
> >
> > [  343.835605] ======================================================
> > [  343.841783] WARNING: possible circular locking dependency detected
> > [  343.847966] 6.16.0-rc4.fix+ #3 Not tainted
> > [  343.852073] ------------------------------------------------------
> > [  343.858250] check/1365 is trying to acquire lock:
> > [  343.862957] ffffffff98141db0 (pcpu_alloc_mutex){+.+.}-{4:4}, at:
> > pcpu_alloc_noprof+0x8eb/0xd70
> > [  343.871587]
> >                but task is already holding lock:
> > [  343.877421] ffff888300cfb040 (&q->elevator_lock){+.+.}-{4:4}, at:
> > elevator_change+0x152/0x530
> > [  343.885958]
> >                which lock already depends on the new lock.
> >
> > [  343.894131]
> >                the existing dependency chain (in reverse order) is:
> > [  343.901609]
> >                -> #3 (&q->elevator_lock){+.+.}-{4:4}:
> > [  343.907891]        __lock_acquire+0x6f1/0xc00
> > [  343.912259]        lock_acquire.part.0+0xb6/0x240
> > [  343.916966]        __mutex_lock+0x17b/0x1690
> > [  343.921247]        elevator_change+0x152/0x530
> > [  343.925692]        elv_iosched_store+0x205/0x2f0
> > [  343.930312]        queue_attr_store+0x23b/0x300
> > [  343.934853]        kernfs_fop_write_iter+0x357/0x530
> > [  343.939829]        vfs_write+0x9bc/0xf60
> > [  343.943763]        ksys_write+0xf3/0x1d0
> > [  343.947695]        do_syscall_64+0x8c/0x3d0
> > [  343.951883]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [  343.957462]
> >                -> #2 (&q->q_usage_counter(io)#4){++++}-{0:0}:
> > [  343.964440]        __lock_acquire+0x6f1/0xc00
> > [  343.968799]        lock_acquire.part.0+0xb6/0x240
> > [  343.973507]        blk_alloc_queue+0x5c5/0x710
> > [  343.977959]        blk_mq_alloc_queue+0x14e/0x240
> > [  343.982666]        __blk_mq_alloc_disk+0x15/0xd0
> > [  343.987294]        nvme_alloc_ns+0x208/0x1690 [nvme_core]
> > [  343.992727]        nvme_scan_ns+0x362/0x4c0 [nvme_core]
> > [  343.997978]        async_run_entry_fn+0x96/0x4f0
> > [  344.002599]        process_one_work+0x8cd/0x1950
> > [  344.007226]        worker_thread+0x58d/0xcf0
> > [  344.011499]        kthread+0x3d8/0x7a0
> > [  344.015259]        ret_from_fork+0x406/0x510
> > [  344.019532]        ret_from_fork_asm+0x1a/0x30
> > [  344.023980]
> >                -> #1 (fs_reclaim){+.+.}-{0:0}:
> > [  344.029654]        __lock_acquire+0x6f1/0xc00
> > [  344.034015]        lock_acquire.part.0+0xb6/0x240
> > [  344.038727]        fs_reclaim_acquire+0x103/0x150
> > [  344.043433]        prepare_alloc_pages+0x15f/0x600
> > [  344.048230]        __alloc_frozen_pages_noprof+0x14a/0x3a0
> > [  344.053722]        __alloc_pages_noprof+0xd/0x1d0
> > [  344.058438]        pcpu_alloc_pages.constprop.0+0x104/0x420
> > [  344.064017]        pcpu_populate_chunk+0x38/0x80
> > [  344.068644]        pcpu_alloc_noprof+0x650/0xd70
> > [  344.073265]        iommu_dma_init_fq+0x183/0x730
> > [  344.077893]        iommu_dma_init_domain+0x566/0x990
> > [  344.082866]        iommu_setup_dma_ops+0xca/0x230
> > [  344.087571]        bus_iommu_probe+0x1f8/0x4a0
> > [  344.092020]        iommu_device_register+0x153/0x240
> > [  344.096993]        iommu_init_pci+0x53c/0x1040
> > [  344.101447]        amd_iommu_init_pci+0xb6/0x5c0
> > [  344.106066]        state_next+0xaf7/0xff0
> > [  344.110080]        iommu_go_to_state+0x21/0x80
> > [  344.114535]        amd_iommu_init+0x15/0x70
> > [  344.118728]        pci_iommu_init+0x29/0x70
> > [  344.122914]        do_one_initcall+0x100/0x5a0
> > [  344.127361]        do_initcalls+0x138/0x1d0
> > [  344.131556]        kernel_init_freeable+0x8b7/0xbd0
> > [  344.136442]        kernel_init+0x1b/0x1f0
> > [  344.140456]        ret_from_fork+0x406/0x510
> > [  344.144735]        ret_from_fork_asm+0x1a/0x30
> > [  344.149182]
> >                -> #0 (pcpu_alloc_mutex){+.+.}-{4:4}:
> > [  344.155379]        check_prev_add+0xf1/0xce0
> > [  344.159653]        validate_chain+0x470/0x580
> > [  344.164019]        __lock_acquire+0x6f1/0xc00
> > [  344.168378]        lock_acquire.part.0+0xb6/0x240
> > [  344.173085]        __mutex_lock+0x17b/0x1690
> > [  344.177365]        pcpu_alloc_noprof+0x8eb/0xd70
> > [  344.181984]        kyber_queue_data_alloc+0x16d/0x660
> > [  344.187047]        kyber_init_sched+0x14/0x90
> > [  344.191413]        blk_mq_init_sched+0x264/0x4e0
> > [  344.196033]        elevator_switch+0x186/0x6a0
> > [  344.200478]        elevator_change+0x305/0x530
> > [  344.204924]        elv_iosched_store+0x205/0x2f0
> > [  344.209545]        queue_attr_store+0x23b/0x300
> > [  344.214084]        kernfs_fop_write_iter+0x357/0x530
> > [  344.219051]        vfs_write+0x9bc/0xf60
> > [  344.222976]        ksys_write+0xf3/0x1d0
> > [  344.226902]        do_syscall_64+0x8c/0x3d0
> > [  344.231088]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [  344.236660]
>
> Thanks for the report!
>
> I see that the above warning is different from the one addressed by the
> current patchset. In the warning you've reported, the kyber elevator
> allocates per-CPU data after acquiring ->elevator_lock, which introduces
> a per-CPU lock dependency on the ->elevator_lock.
>
> In contrast, the current patchset addresses a separate issue [1] that arises
> due to elevator tag allocation. This allocation occurs after both ->freeze_lock
> and ->elevator_lock are held. Internally, elevator tags allocation sets up
> per-CPU sbitmap->alloc_hint, which also introduces a similar per-CPU lock
> dependency on ->elevator_lock.
>
> That said, I'll plan to address the issue you've just reported in a separate
> patch, once the current patchset is merged.

OK, the issue[1] was easy to reproduce by blktests block/005 on my
environment, that's why I re-run the test on this patchset, and now
this one new WARNING triggered, anyway, thanks for the info and will
test your new patch later, thanks.

>
> Thanks,
> --Nilay
>
> [1]https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
>
>
>


-- 
Best Regards,
  Yi Zhang


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-07-02 14:41 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-01  8:18 [PATCHv7 0/3] block: move sched_tags allocation/de-allocation outside of locking context Nilay Shroff
2025-07-01  8:18 ` [PATCHv7 1/3] block: move elevator queue allocation logic into blk_mq_init_sched Nilay Shroff
2025-07-01  8:18 ` [PATCHv7 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store Nilay Shroff
2025-07-01 10:52   ` Hannes Reinecke
2025-07-01  8:19 ` [PATCHv7 3/3] block: fix potential deadlock while running nr_hw_queue update Nilay Shroff
2025-07-01 11:00   ` Hannes Reinecke
2025-07-02 13:53 ` [PATCHv7 0/3] block: move sched_tags allocation/de-allocation outside of locking context Yi Zhang
2025-07-02 14:17   ` Nilay Shroff
2025-07-02 14:41     ` Yi Zhang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox