* [PATCHv7 0/3] block: move sched_tags allocation/de-allocation outside of locking context
@ 2025-07-01 8:18 Nilay Shroff
2025-07-01 8:18 ` [PATCHv7 1/3] block: move elevator queue allocation logic into blk_mq_init_sched Nilay Shroff
` (3 more replies)
0 siblings, 4 replies; 9+ messages in thread
From: Nilay Shroff @ 2025-07-01 8:18 UTC (permalink / raw)
To: linux-block; +Cc: hch, ming.lei, hare, axboe, sth, gjoyce
Hi,
There have been a few reports[1] indicating potential lockdep warnings due
to a lock dependency from the percpu allocator to the elevator lock. This
patch series aims to eliminate that dependency.
The series consists of three patches:
The first patch is preparatory patch and just move elevator queue
allocation logic from ->init_sched into blk_mq_init_sched.
The second patch in the series restructures sched_tags allocation and
deallocation during elevator update/switch operations to ensure these
actions are performed entirely outside the ->freeze_lock and ->elevator_
lock. This eliminates the percpu allocator’s lock dependency on the
elevator and freeze lock during scheduler transitions.
The third patch introduces batch allocation and deallocation helpers for
sched_tags. These helpers are used during __blk_mq_update_nr_hw_queues()
to decouple sched_tags memory management from both the elevator and freeze
locks, addressing the lockdep concerns in the nr_hw_queues update path.
[1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
Changes since v6:
- Add warning when loading elevator tags from an xarray yields nothing
(Hannes Reinecke)
- Use elevator tags instead of xarray table as a function argument to
elv_update_nr_hw_queues (Ming Lei)
Link to v6: https://lore.kernel.org/all/20250630054756.54532-1-nilay@linux.ibm.com/
Changes since v5:
- Fixed smatch warning reported by kernel test robot here:
https://lore.kernel.org/all/202506300509.2S1tygch-lkp@intel.com/
Link to v5: https://lore.kernel.org/all/20250627175544.1063910-1-nilay@linux.ibm.com/
Changes since v4:
- Define a local Xarray variable in __blk_mq_update_nr_hw_queues to store
sched_tags, instead of storing it in an Xarray defined in 'struct elevator_tags'
(Ming Lei)
Link to v4: https://lore.kernel.org/all/20250624131716.630465-1-nilay@linux.ibm.com/
Changes since v3:
- Further split the patchset into three patch series so that we can
have a separate patch for sched_tags batch allocation/deallocation
(Ming Lei)
- Use Xarray to store and load the sched_tags (Ming Lei)
- Unexport elevator_alloc() as we no longer need to use it outside
of block layer core (hch)
- unwind the sched_tags allocation and free tags when we it fails in
the middle of allocation (hch)
- Move struct elevator_queue header from commin header to elevator.c
as there's no user of it outside elevator.c (Ming Lei, hch)
Link to v3: https://lore.kernel.org/all/20250616173233.3803824-1-nilay@linux.ibm.com/
Change since v2:
- Split the patch into a two patch series. The first patch updates
->init_sched elevator API change and second patch handles the sched
tags allocation/de-allocation logic (Ming Lei)
- Address sched tags allocation/deallocation logic while running in the
context of nr_hw_queue update so that we can handle all possible
scenarios in a single patchest (Ming Lei)
Link to v2: https://lore.kernel.org/all/20250528123638.1029700-1-nilay@linux.ibm.com/
Changes since v1:
- As the lifetime of elevator queue and sched tags are same, allocate
and move sched tags under struct elevator_queue (Ming Lei)
Link to v1: https://lore.kernel.org/all/20250520103425.1259712-1-nilay@linux.ibm.com/
Nilay Shroff (3):
block: move elevator queue allocation logic into blk_mq_init_sched
block: fix lockdep warning caused by lock dependency in
elv_iosched_store
block: fix potential deadlock while running nr_hw_queue update
block/bfq-iosched.c | 13 +--
block/blk-mq-sched.c | 223 ++++++++++++++++++++++++++++--------------
block/blk-mq-sched.h | 12 ++-
block/blk-mq.c | 16 ++-
block/blk.h | 2 +-
block/elevator.c | 50 ++++++++--
block/elevator.h | 16 ++-
block/kyber-iosched.c | 11 +--
block/mq-deadline.c | 14 +--
9 files changed, 241 insertions(+), 116 deletions(-)
--
2.50.0
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCHv7 1/3] block: move elevator queue allocation logic into blk_mq_init_sched
2025-07-01 8:18 [PATCHv7 0/3] block: move sched_tags allocation/de-allocation outside of locking context Nilay Shroff
@ 2025-07-01 8:18 ` Nilay Shroff
2025-07-01 8:18 ` [PATCHv7 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store Nilay Shroff
` (2 subsequent siblings)
3 siblings, 0 replies; 9+ messages in thread
From: Nilay Shroff @ 2025-07-01 8:18 UTC (permalink / raw)
To: linux-block; +Cc: hch, ming.lei, hare, axboe, sth, gjoyce
In preparation for allocating sched_tags before freezing the request
queue and acquiring ->elevator_lock, move the elevator queue allocation
logic from the elevator ops ->init_sched callback into blk_mq_init_sched.
As elevator_alloc is now only invoked from block layer core, we don't
need to export it, so unexport elevator_alloc function.
This refactoring provides a centralized location for elevator queue
initialization, which makes it easier to store pre-allocated sched_tags
in the struct elevator_queue during later changes.
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
block/bfq-iosched.c | 13 +++----------
block/blk-mq-sched.c | 11 ++++++++---
block/elevator.c | 1 -
block/elevator.h | 2 +-
block/kyber-iosched.c | 11 ++---------
block/mq-deadline.c | 14 ++------------
6 files changed, 16 insertions(+), 36 deletions(-)
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 0cb1e9873aab..fd26dc1901b0 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -7232,22 +7232,16 @@ static void bfq_init_root_group(struct bfq_group *root_group,
root_group->sched_data.bfq_class_idle_last_service = jiffies;
}
-static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
+static int bfq_init_queue(struct request_queue *q, struct elevator_queue *eq)
{
struct bfq_data *bfqd;
- struct elevator_queue *eq;
unsigned int i;
struct blk_independent_access_ranges *ia_ranges = q->disk->ia_ranges;
- eq = elevator_alloc(q, e);
- if (!eq)
- return -ENOMEM;
-
bfqd = kzalloc_node(sizeof(*bfqd), GFP_KERNEL, q->node);
- if (!bfqd) {
- kobject_put(&eq->kobj);
+ if (!bfqd)
return -ENOMEM;
- }
+
eq->elevator_data = bfqd;
spin_lock_irq(&q->queue_lock);
@@ -7405,7 +7399,6 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
out_free:
kfree(bfqd);
- kobject_put(&eq->kobj);
return -ENOMEM;
}
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 55a0fd105147..359e0704e09b 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -475,10 +475,14 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
q->nr_requests = 2 * min_t(unsigned int, q->tag_set->queue_depth,
BLKDEV_DEFAULT_RQ);
+ eq = elevator_alloc(q, e);
+ if (!eq)
+ return -ENOMEM;
+
if (blk_mq_is_shared_tags(flags)) {
ret = blk_mq_init_sched_shared_tags(q);
if (ret)
- return ret;
+ goto err_put_elevator;
}
queue_for_each_hw_ctx(q, hctx, i) {
@@ -487,7 +491,7 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
goto err_free_map_and_rqs;
}
- ret = e->ops.init_sched(q, e);
+ ret = e->ops.init_sched(q, eq);
if (ret)
goto err_free_map_and_rqs;
@@ -508,7 +512,8 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
err_free_map_and_rqs:
blk_mq_sched_free_rqs(q);
blk_mq_sched_tags_teardown(q, flags);
-
+err_put_elevator:
+ kobject_put(&eq->kobj);
q->elevator = NULL;
return ret;
}
diff --git a/block/elevator.c b/block/elevator.c
index ab22542e6cf0..770874040f79 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -148,7 +148,6 @@ struct elevator_queue *elevator_alloc(struct request_queue *q,
return eq;
}
-EXPORT_SYMBOL(elevator_alloc);
static void elevator_release(struct kobject *kobj)
{
diff --git a/block/elevator.h b/block/elevator.h
index a07ce773a38f..a4de5f9ad790 100644
--- a/block/elevator.h
+++ b/block/elevator.h
@@ -24,7 +24,7 @@ struct blk_mq_alloc_data;
struct blk_mq_hw_ctx;
struct elevator_mq_ops {
- int (*init_sched)(struct request_queue *, struct elevator_type *);
+ int (*init_sched)(struct request_queue *, struct elevator_queue *);
void (*exit_sched)(struct elevator_queue *);
int (*init_hctx)(struct blk_mq_hw_ctx *, unsigned int);
void (*exit_hctx)(struct blk_mq_hw_ctx *, unsigned int);
diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index 4dba8405bd01..7b6832cb3a8d 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -402,20 +402,13 @@ static struct kyber_queue_data *kyber_queue_data_alloc(struct request_queue *q)
return ERR_PTR(ret);
}
-static int kyber_init_sched(struct request_queue *q, struct elevator_type *e)
+static int kyber_init_sched(struct request_queue *q, struct elevator_queue *eq)
{
struct kyber_queue_data *kqd;
- struct elevator_queue *eq;
-
- eq = elevator_alloc(q, e);
- if (!eq)
- return -ENOMEM;
kqd = kyber_queue_data_alloc(q);
- if (IS_ERR(kqd)) {
- kobject_put(&eq->kobj);
+ if (IS_ERR(kqd))
return PTR_ERR(kqd);
- }
blk_stat_enable_accounting(q);
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 2edf1cac06d5..7b6caf30e00a 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -568,20 +568,14 @@ static void dd_exit_sched(struct elevator_queue *e)
/*
* initialize elevator private data (deadline_data).
*/
-static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
+static int dd_init_sched(struct request_queue *q, struct elevator_queue *eq)
{
struct deadline_data *dd;
- struct elevator_queue *eq;
enum dd_prio prio;
- int ret = -ENOMEM;
-
- eq = elevator_alloc(q, e);
- if (!eq)
- return ret;
dd = kzalloc_node(sizeof(*dd), GFP_KERNEL, q->node);
if (!dd)
- goto put_eq;
+ return -ENOMEM;
eq->elevator_data = dd;
@@ -608,10 +602,6 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
q->elevator = eq;
return 0;
-
-put_eq:
- kobject_put(&eq->kobj);
- return ret;
}
/*
--
2.50.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCHv7 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store
2025-07-01 8:18 [PATCHv7 0/3] block: move sched_tags allocation/de-allocation outside of locking context Nilay Shroff
2025-07-01 8:18 ` [PATCHv7 1/3] block: move elevator queue allocation logic into blk_mq_init_sched Nilay Shroff
@ 2025-07-01 8:18 ` Nilay Shroff
2025-07-01 10:52 ` Hannes Reinecke
2025-07-01 8:19 ` [PATCHv7 3/3] block: fix potential deadlock while running nr_hw_queue update Nilay Shroff
2025-07-02 13:53 ` [PATCHv7 0/3] block: move sched_tags allocation/de-allocation outside of locking context Yi Zhang
3 siblings, 1 reply; 9+ messages in thread
From: Nilay Shroff @ 2025-07-01 8:18 UTC (permalink / raw)
To: linux-block; +Cc: hch, ming.lei, hare, axboe, sth, gjoyce
Recent lockdep reports [1] have revealed a potential deadlock caused by a
lock dependency between the percpu allocator lock and the elevator lock.
This issue can be avoided by ensuring that the allocation and release of
scheduler tags (sched_tags) are performed outside the elevator lock.
Furthermore, the queue does not need to be remain frozen during these
operations.
To address this, move all sched_tags allocations and deallocations outside
of both the ->elevator_lock and the ->freeze_lock. Since the lifetime of
the elevator queue and its associated sched_tags is closely tied, the
allocated sched_tags are now stored in the elevator queue structure. Then,
during the actual elevator switch (which runs under ->freeze_lock and
->elevator_lock), the pre-allocated sched_tags are assigned to the
appropriate q->hctx. Once the elevator switch is complete and the locks
are released, the old elevator queue and its associated sched_tags are
freed.
This commit specifically addresses the allocation/deallocation of sched_
tags during elevator switching. Note that sched_tags may also be allocated
in other contexts, such as during nr_hw_queues updates. Supporting that
use case will require batch allocation/deallocation, which will be handled
in a follow-up patch.
This restructuring ensures that sched_tags memory management occurs
entirely outside of the ->elevator_lock and ->freeze_lock context,
eliminating the lock dependency problem seen during scheduler updates.
[1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
Reported-by: Stefan Haberland <sth@linux.ibm.com>
Closes: https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
block/blk-mq-sched.c | 155 +++++++++++++++++++++++--------------------
block/blk-mq-sched.h | 8 ++-
block/elevator.c | 47 +++++++++++--
block/elevator.h | 14 +++-
4 files changed, 144 insertions(+), 80 deletions(-)
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 359e0704e09b..2d6d1ebdd8fb 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -374,64 +374,17 @@ bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request *rq,
}
EXPORT_SYMBOL_GPL(blk_mq_sched_try_insert_merge);
-static int blk_mq_sched_alloc_map_and_rqs(struct request_queue *q,
- struct blk_mq_hw_ctx *hctx,
- unsigned int hctx_idx)
-{
- if (blk_mq_is_shared_tags(q->tag_set->flags)) {
- hctx->sched_tags = q->sched_shared_tags;
- return 0;
- }
-
- hctx->sched_tags = blk_mq_alloc_map_and_rqs(q->tag_set, hctx_idx,
- q->nr_requests);
-
- if (!hctx->sched_tags)
- return -ENOMEM;
- return 0;
-}
-
-static void blk_mq_exit_sched_shared_tags(struct request_queue *queue)
-{
- blk_mq_free_rq_map(queue->sched_shared_tags);
- queue->sched_shared_tags = NULL;
-}
-
/* called in queue's release handler, tagset has gone away */
static void blk_mq_sched_tags_teardown(struct request_queue *q, unsigned int flags)
{
struct blk_mq_hw_ctx *hctx;
unsigned long i;
- queue_for_each_hw_ctx(q, hctx, i) {
- if (hctx->sched_tags) {
- if (!blk_mq_is_shared_tags(flags))
- blk_mq_free_rq_map(hctx->sched_tags);
- hctx->sched_tags = NULL;
- }
- }
+ queue_for_each_hw_ctx(q, hctx, i)
+ hctx->sched_tags = NULL;
if (blk_mq_is_shared_tags(flags))
- blk_mq_exit_sched_shared_tags(q);
-}
-
-static int blk_mq_init_sched_shared_tags(struct request_queue *queue)
-{
- struct blk_mq_tag_set *set = queue->tag_set;
-
- /*
- * Set initial depth at max so that we don't need to reallocate for
- * updating nr_requests.
- */
- queue->sched_shared_tags = blk_mq_alloc_map_and_rqs(set,
- BLK_MQ_NO_HCTX_IDX,
- MAX_SCHED_RQ);
- if (!queue->sched_shared_tags)
- return -ENOMEM;
-
- blk_mq_tag_update_sched_shared_tags(queue);
-
- return 0;
+ q->sched_shared_tags = NULL;
}
void blk_mq_sched_reg_debugfs(struct request_queue *q)
@@ -458,8 +411,75 @@ void blk_mq_sched_unreg_debugfs(struct request_queue *q)
mutex_unlock(&q->debugfs_mutex);
}
+void blk_mq_free_sched_tags(struct elevator_tags *et,
+ struct blk_mq_tag_set *set)
+{
+ unsigned long i;
+
+ /* Shared tags are stored at index 0 in @tags. */
+ if (blk_mq_is_shared_tags(set->flags))
+ blk_mq_free_map_and_rqs(set, et->tags[0], BLK_MQ_NO_HCTX_IDX);
+ else {
+ for (i = 0; i < et->nr_hw_queues; i++)
+ blk_mq_free_map_and_rqs(set, et->tags[i], i);
+ }
+
+ kfree(et);
+}
+
+struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
+ unsigned int nr_hw_queues)
+{
+ unsigned int nr_tags;
+ int i;
+ struct elevator_tags *et;
+ gfp_t gfp = GFP_NOIO | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY;
+
+ if (blk_mq_is_shared_tags(set->flags))
+ nr_tags = 1;
+ else
+ nr_tags = nr_hw_queues;
+
+ et = kmalloc(sizeof(struct elevator_tags) +
+ nr_tags * sizeof(struct blk_mq_tags *), gfp);
+ if (!et)
+ return NULL;
+ /*
+ * Default to double of smaller one between hw queue_depth and
+ * 128, since we don't split into sync/async like the old code
+ * did. Additionally, this is a per-hw queue depth.
+ */
+ et->nr_requests = 2 * min_t(unsigned int, set->queue_depth,
+ BLKDEV_DEFAULT_RQ);
+ et->nr_hw_queues = nr_hw_queues;
+
+ if (blk_mq_is_shared_tags(set->flags)) {
+ /* Shared tags are stored at index 0 in @tags. */
+ et->tags[0] = blk_mq_alloc_map_and_rqs(set, BLK_MQ_NO_HCTX_IDX,
+ MAX_SCHED_RQ);
+ if (!et->tags[0])
+ goto out;
+ } else {
+ for (i = 0; i < et->nr_hw_queues; i++) {
+ et->tags[i] = blk_mq_alloc_map_and_rqs(set, i,
+ et->nr_requests);
+ if (!et->tags[i])
+ goto out_unwind;
+ }
+ }
+
+ return et;
+out_unwind:
+ while (--i >= 0)
+ blk_mq_free_map_and_rqs(set, et->tags[i], i);
+out:
+ kfree(et);
+ return NULL;
+}
+
/* caller must have a reference to @e, will grab another one if successful */
-int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
+int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e,
+ struct elevator_tags *et)
{
unsigned int flags = q->tag_set->flags;
struct blk_mq_hw_ctx *hctx;
@@ -467,40 +487,33 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
unsigned long i;
int ret;
- /*
- * Default to double of smaller one between hw queue_depth and 128,
- * since we don't split into sync/async like the old code did.
- * Additionally, this is a per-hw queue depth.
- */
- q->nr_requests = 2 * min_t(unsigned int, q->tag_set->queue_depth,
- BLKDEV_DEFAULT_RQ);
-
- eq = elevator_alloc(q, e);
+ eq = elevator_alloc(q, e, et);
if (!eq)
return -ENOMEM;
+ q->nr_requests = et->nr_requests;
+
if (blk_mq_is_shared_tags(flags)) {
- ret = blk_mq_init_sched_shared_tags(q);
- if (ret)
- goto err_put_elevator;
+ /* Shared tags are stored at index 0 in @et->tags. */
+ q->sched_shared_tags = et->tags[0];
+ blk_mq_tag_update_sched_shared_tags(q);
}
queue_for_each_hw_ctx(q, hctx, i) {
- ret = blk_mq_sched_alloc_map_and_rqs(q, hctx, i);
- if (ret)
- goto err_free_map_and_rqs;
+ if (blk_mq_is_shared_tags(flags))
+ hctx->sched_tags = q->sched_shared_tags;
+ else
+ hctx->sched_tags = et->tags[i];
}
ret = e->ops.init_sched(q, eq);
if (ret)
- goto err_free_map_and_rqs;
+ goto out;
queue_for_each_hw_ctx(q, hctx, i) {
if (e->ops.init_hctx) {
ret = e->ops.init_hctx(hctx, i);
if (ret) {
- eq = q->elevator;
- blk_mq_sched_free_rqs(q);
blk_mq_exit_sched(q, eq);
kobject_put(&eq->kobj);
return ret;
@@ -509,10 +522,8 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
}
return 0;
-err_free_map_and_rqs:
- blk_mq_sched_free_rqs(q);
+out:
blk_mq_sched_tags_teardown(q, flags);
-err_put_elevator:
kobject_put(&eq->kobj);
q->elevator = NULL;
return ret;
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 1326526bb733..0cde00cd1c47 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -18,10 +18,16 @@ void __blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx);
void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx);
-int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e);
+int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e,
+ struct elevator_tags *et);
void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e);
void blk_mq_sched_free_rqs(struct request_queue *q);
+struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
+ unsigned int nr_hw_queues);
+void blk_mq_free_sched_tags(struct elevator_tags *et,
+ struct blk_mq_tag_set *set);
+
static inline void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
{
if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
diff --git a/block/elevator.c b/block/elevator.c
index 770874040f79..50f4b78efe66 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -54,6 +54,8 @@ struct elv_change_ctx {
struct elevator_queue *old;
/* for registering new elevator */
struct elevator_queue *new;
+ /* holds sched tags data */
+ struct elevator_tags *et;
};
static DEFINE_SPINLOCK(elv_list_lock);
@@ -132,7 +134,7 @@ static struct elevator_type *elevator_find_get(const char *name)
static const struct kobj_type elv_ktype;
struct elevator_queue *elevator_alloc(struct request_queue *q,
- struct elevator_type *e)
+ struct elevator_type *e, struct elevator_tags *et)
{
struct elevator_queue *eq;
@@ -145,6 +147,7 @@ struct elevator_queue *elevator_alloc(struct request_queue *q,
kobject_init(&eq->kobj, &elv_ktype);
mutex_init(&eq->sysfs_lock);
hash_init(eq->hash);
+ eq->et = et;
return eq;
}
@@ -165,7 +168,6 @@ static void elevator_exit(struct request_queue *q)
lockdep_assert_held(&q->elevator_lock);
ioc_clear_queue(q);
- blk_mq_sched_free_rqs(q);
mutex_lock(&e->sysfs_lock);
blk_mq_exit_sched(q, e);
@@ -591,7 +593,7 @@ static int elevator_switch(struct request_queue *q, struct elv_change_ctx *ctx)
}
if (new_e) {
- ret = blk_mq_init_sched(q, new_e);
+ ret = blk_mq_init_sched(q, new_e, ctx->et);
if (ret)
goto out_unfreeze;
ctx->new = q->elevator;
@@ -626,8 +628,10 @@ static void elv_exit_and_release(struct request_queue *q)
elevator_exit(q);
mutex_unlock(&q->elevator_lock);
blk_mq_unfreeze_queue(q, memflags);
- if (e)
+ if (e) {
+ blk_mq_free_sched_tags(e->et, q->tag_set);
kobject_put(&e->kobj);
+ }
}
static int elevator_change_done(struct request_queue *q,
@@ -640,6 +644,7 @@ static int elevator_change_done(struct request_queue *q,
&ctx->old->flags);
elv_unregister_queue(q, ctx->old);
+ blk_mq_free_sched_tags(ctx->old->et, q->tag_set);
kobject_put(&ctx->old->kobj);
if (enable_wbt)
wbt_enable_default(q->disk);
@@ -658,9 +663,16 @@ static int elevator_change_done(struct request_queue *q,
static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
{
unsigned int memflags;
+ struct blk_mq_tag_set *set = q->tag_set;
int ret = 0;
- lockdep_assert_held(&q->tag_set->update_nr_hwq_lock);
+ lockdep_assert_held(&set->update_nr_hwq_lock);
+
+ if (strncmp(ctx->name, "none", 4)) {
+ ctx->et = blk_mq_alloc_sched_tags(set, set->nr_hw_queues);
+ if (!ctx->et)
+ return -ENOMEM;
+ }
memflags = blk_mq_freeze_queue(q);
/*
@@ -680,6 +692,11 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
blk_mq_unfreeze_queue(q, memflags);
if (!ret)
ret = elevator_change_done(q, ctx);
+ /*
+ * Free sched tags if it's allocated but we couldn't switch elevator.
+ */
+ if (ctx->et && !ctx->new)
+ blk_mq_free_sched_tags(ctx->et, set);
return ret;
}
@@ -690,11 +707,26 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
*/
void elv_update_nr_hw_queues(struct request_queue *q)
{
+ struct blk_mq_tag_set *set = q->tag_set;
struct elv_change_ctx ctx = {};
int ret = -ENODEV;
WARN_ON_ONCE(q->mq_freeze_depth == 0);
+ /*
+ * Accessing q->elevator without holding q->elevator_lock is safe here
+ * because nr_hw_queue update is protected by set->update_nr_hwq_lock
+ * in the writer context. So, scheduler update/switch code (which
+ * acquires same lock in the reader context) can't run concurrently.
+ */
+ if (q->elevator) {
+ ctx.et = blk_mq_alloc_sched_tags(set, set->nr_hw_queues);
+ if (!ctx.et) {
+ WARN_ON_ONCE(1);
+ return;
+ }
+ }
+
mutex_lock(&q->elevator_lock);
if (q->elevator && !blk_queue_dying(q) && blk_queue_registered(q)) {
ctx.name = q->elevator->type->elevator_name;
@@ -706,6 +738,11 @@ void elv_update_nr_hw_queues(struct request_queue *q)
blk_mq_unfreeze_queue_nomemrestore(q);
if (!ret)
WARN_ON_ONCE(elevator_change_done(q, &ctx));
+ /*
+ * Free sched tags if it's allocated but we couldn't switch elevator.
+ */
+ if (ctx.et && !ctx.new)
+ blk_mq_free_sched_tags(ctx.et, set);
}
/*
diff --git a/block/elevator.h b/block/elevator.h
index a4de5f9ad790..adc5c157e17e 100644
--- a/block/elevator.h
+++ b/block/elevator.h
@@ -23,6 +23,15 @@ enum elv_merge {
struct blk_mq_alloc_data;
struct blk_mq_hw_ctx;
+struct elevator_tags {
+ /* num. of hardware queues for which tags are allocated */
+ unsigned int nr_hw_queues;
+ /* depth used while allocating tags */
+ unsigned int nr_requests;
+ /* shared tag is stored at index 0 */
+ struct blk_mq_tags *tags[];
+};
+
struct elevator_mq_ops {
int (*init_sched)(struct request_queue *, struct elevator_queue *);
void (*exit_sched)(struct elevator_queue *);
@@ -113,6 +122,7 @@ struct request *elv_rqhash_find(struct request_queue *q, sector_t offset);
struct elevator_queue
{
struct elevator_type *type;
+ struct elevator_tags *et;
void *elevator_data;
struct kobject kobj;
struct mutex sysfs_lock;
@@ -152,8 +162,8 @@ ssize_t elv_iosched_show(struct gendisk *disk, char *page);
ssize_t elv_iosched_store(struct gendisk *disk, const char *page, size_t count);
extern bool elv_bio_merge_ok(struct request *, struct bio *);
-extern struct elevator_queue *elevator_alloc(struct request_queue *,
- struct elevator_type *);
+struct elevator_queue *elevator_alloc(struct request_queue *,
+ struct elevator_type *, struct elevator_tags *);
/*
* Helper functions.
--
2.50.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCHv7 3/3] block: fix potential deadlock while running nr_hw_queue update
2025-07-01 8:18 [PATCHv7 0/3] block: move sched_tags allocation/de-allocation outside of locking context Nilay Shroff
2025-07-01 8:18 ` [PATCHv7 1/3] block: move elevator queue allocation logic into blk_mq_init_sched Nilay Shroff
2025-07-01 8:18 ` [PATCHv7 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store Nilay Shroff
@ 2025-07-01 8:19 ` Nilay Shroff
2025-07-01 11:00 ` Hannes Reinecke
2025-07-02 13:53 ` [PATCHv7 0/3] block: move sched_tags allocation/de-allocation outside of locking context Yi Zhang
3 siblings, 1 reply; 9+ messages in thread
From: Nilay Shroff @ 2025-07-01 8:19 UTC (permalink / raw)
To: linux-block; +Cc: hch, ming.lei, hare, axboe, sth, gjoyce
Move scheduler tags (sched_tags) allocation and deallocation outside
both the ->elevator_lock and ->freeze_lock when updating nr_hw_queues.
This change breaks the dependency chain from the percpu allocator lock
to the elevator lock, helping to prevent potential deadlocks, as
observed in the reported lockdep splat[1].
This commit introduces batch allocation and deallocation helpers for
sched_tags, which are now used from within __blk_mq_update_nr_hw_queues
routine while iterating through the tagset.
With this change, all sched_tags memory management is handled entirely
outside the ->elevator_lock and the ->freeze_lock context, thereby
eliminating the lock dependency that could otherwise manifest during
nr_hw_queues updates.
[1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
Reported-by: Stefan Haberland <sth@linux.ibm.com>
Closes: https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
block/blk-mq-sched.c | 65 ++++++++++++++++++++++++++++++++++++++++++++
block/blk-mq-sched.h | 4 +++
block/blk-mq.c | 16 +++++++++--
block/blk.h | 3 +-
block/elevator.c | 6 ++--
5 files changed, 88 insertions(+), 6 deletions(-)
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 2d6d1ebdd8fb..e2ce4a28e6c9 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -427,6 +427,32 @@ void blk_mq_free_sched_tags(struct elevator_tags *et,
kfree(et);
}
+void blk_mq_free_sched_tags_batch(struct xarray *et_table,
+ struct blk_mq_tag_set *set)
+{
+ struct request_queue *q;
+ struct elevator_tags *et;
+
+ lockdep_assert_held_write(&set->update_nr_hwq_lock);
+
+ list_for_each_entry(q, &set->tag_list, tag_set_list) {
+ /*
+ * Accessing q->elevator without holding q->elevator_lock is
+ * safe because we're holding here set->update_nr_hwq_lock in
+ * the writer context. So, scheduler update/switch code (which
+ * acquires the same lock but in the reader context) can't run
+ * concurrently.
+ */
+ if (q->elevator) {
+ et = xa_load(et_table, q->id);
+ if (unlikely(!et))
+ WARN_ON_ONCE(1);
+ else
+ blk_mq_free_sched_tags(et, set);
+ }
+ }
+}
+
struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
unsigned int nr_hw_queues)
{
@@ -477,6 +503,45 @@ struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
return NULL;
}
+int blk_mq_alloc_sched_tags_batch(struct xarray *et_table,
+ struct blk_mq_tag_set *set, unsigned int nr_hw_queues)
+{
+ struct request_queue *q;
+ struct elevator_tags *et;
+ gfp_t gfp = GFP_NOIO | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY;
+
+ lockdep_assert_held_write(&set->update_nr_hwq_lock);
+
+ list_for_each_entry(q, &set->tag_list, tag_set_list) {
+ /*
+ * Accessing q->elevator without holding q->elevator_lock is
+ * safe because we're holding here set->update_nr_hwq_lock in
+ * the writer context. So, scheduler update/switch code (which
+ * acquires the same lock but in the reader context) can't run
+ * concurrently.
+ */
+ if (q->elevator) {
+ et = blk_mq_alloc_sched_tags(set, nr_hw_queues);
+ if (!et)
+ goto out_unwind;
+ if (xa_insert(et_table, q->id, et, gfp))
+ goto out_free_tags;
+ }
+ }
+ return 0;
+out_free_tags:
+ blk_mq_free_sched_tags(et, set);
+out_unwind:
+ list_for_each_entry_continue_reverse(q, &set->tag_list, tag_set_list) {
+ if (q->elevator) {
+ et = xa_load(et_table, q->id);
+ if (et)
+ blk_mq_free_sched_tags(et, set);
+ }
+ }
+ return -ENOMEM;
+}
+
/* caller must have a reference to @e, will grab another one if successful */
int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e,
struct elevator_tags *et)
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 0cde00cd1c47..b554e1d55950 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -25,8 +25,12 @@ void blk_mq_sched_free_rqs(struct request_queue *q);
struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
unsigned int nr_hw_queues);
+int blk_mq_alloc_sched_tags_batch(struct xarray *et_table,
+ struct blk_mq_tag_set *set, unsigned int nr_hw_queues);
void blk_mq_free_sched_tags(struct elevator_tags *et,
struct blk_mq_tag_set *set);
+void blk_mq_free_sched_tags_batch(struct xarray *et_table,
+ struct blk_mq_tag_set *set);
static inline void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
{
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4806b867e37d..af7aae699ede 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4972,6 +4972,8 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
struct request_queue *q;
int prev_nr_hw_queues = set->nr_hw_queues;
unsigned int memflags;
+ struct xarray et_table;
+ struct elevator_tags *et;
int i;
lockdep_assert_held(&set->tag_list_lock);
@@ -4984,6 +4986,11 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
return;
memflags = memalloc_noio_save();
+
+ xa_init(&et_table);
+ if (blk_mq_alloc_sched_tags_batch(&et_table, set, nr_hw_queues) < 0)
+ goto out_memalloc_restore;
+
list_for_each_entry(q, &set->tag_list, tag_set_list) {
blk_mq_debugfs_unregister_hctxs(q);
blk_mq_sysfs_unregister_hctxs(q);
@@ -4995,6 +5002,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
if (blk_mq_realloc_tag_set_tags(set, nr_hw_queues) < 0) {
list_for_each_entry(q, &set->tag_list, tag_set_list)
blk_mq_unfreeze_queue_nomemrestore(q);
+ blk_mq_free_sched_tags_batch(&et_table, set);
goto reregister;
}
@@ -5018,8 +5026,10 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
}
/* elv_update_nr_hw_queues() unfreeze queue for us */
- list_for_each_entry(q, &set->tag_list, tag_set_list)
- elv_update_nr_hw_queues(q);
+ list_for_each_entry(q, &set->tag_list, tag_set_list) {
+ et = xa_load(&et_table, q->id);
+ elv_update_nr_hw_queues(q, et);
+ }
reregister:
list_for_each_entry(q, &set->tag_list, tag_set_list) {
@@ -5029,7 +5039,9 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
blk_mq_remove_hw_queues_cpuhp(q);
blk_mq_add_hw_queues_cpuhp(q);
}
+out_memalloc_restore:
memalloc_noio_restore(memflags);
+ xa_destroy(&et_table);
/* Free the excess tags when nr_hw_queues shrink. */
for (i = set->nr_hw_queues; i < prev_nr_hw_queues; i++)
diff --git a/block/blk.h b/block/blk.h
index 37ec459fe656..a312518fb8f3 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -12,6 +12,7 @@
#include "blk-crypto-internal.h"
struct elevator_type;
+struct elevator_tags;
#define BLK_DEV_MAX_SECTORS (LLONG_MAX >> 9)
#define BLK_MIN_SEGMENT_SIZE 4096
@@ -321,7 +322,7 @@ bool blk_bio_list_merge(struct request_queue *q, struct list_head *list,
bool blk_insert_flush(struct request *rq);
-void elv_update_nr_hw_queues(struct request_queue *q);
+void elv_update_nr_hw_queues(struct request_queue *q, struct elevator_tags *et);
void elevator_set_default(struct request_queue *q);
void elevator_set_none(struct request_queue *q);
diff --git a/block/elevator.c b/block/elevator.c
index 50f4b78efe66..92a0a4851583 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -705,7 +705,7 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
* The I/O scheduler depends on the number of hardware queues, this forces a
* reattachment when nr_hw_queues changes.
*/
-void elv_update_nr_hw_queues(struct request_queue *q)
+void elv_update_nr_hw_queues(struct request_queue *q, struct elevator_tags *et)
{
struct blk_mq_tag_set *set = q->tag_set;
struct elv_change_ctx ctx = {};
@@ -720,11 +720,11 @@ void elv_update_nr_hw_queues(struct request_queue *q)
* acquires same lock in the reader context) can't run concurrently.
*/
if (q->elevator) {
- ctx.et = blk_mq_alloc_sched_tags(set, set->nr_hw_queues);
- if (!ctx.et) {
+ if (!et) {
WARN_ON_ONCE(1);
return;
}
+ ctx.et = et;
}
mutex_lock(&q->elevator_lock);
--
2.50.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCHv7 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store
2025-07-01 8:18 ` [PATCHv7 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store Nilay Shroff
@ 2025-07-01 10:52 ` Hannes Reinecke
0 siblings, 0 replies; 9+ messages in thread
From: Hannes Reinecke @ 2025-07-01 10:52 UTC (permalink / raw)
To: Nilay Shroff, linux-block; +Cc: hch, ming.lei, axboe, sth, gjoyce
On 7/1/25 10:18, Nilay Shroff wrote:
> Recent lockdep reports [1] have revealed a potential deadlock caused by a
> lock dependency between the percpu allocator lock and the elevator lock.
> This issue can be avoided by ensuring that the allocation and release of
> scheduler tags (sched_tags) are performed outside the elevator lock.
> Furthermore, the queue does not need to be remain frozen during these
> operations.
>
> To address this, move all sched_tags allocations and deallocations outside
> of both the ->elevator_lock and the ->freeze_lock. Since the lifetime of
> the elevator queue and its associated sched_tags is closely tied, the
> allocated sched_tags are now stored in the elevator queue structure. Then,
> during the actual elevator switch (which runs under ->freeze_lock and
> ->elevator_lock), the pre-allocated sched_tags are assigned to the
> appropriate q->hctx. Once the elevator switch is complete and the locks
> are released, the old elevator queue and its associated sched_tags are
> freed.
>
> This commit specifically addresses the allocation/deallocation of sched_
> tags during elevator switching. Note that sched_tags may also be allocated
> in other contexts, such as during nr_hw_queues updates. Supporting that
> use case will require batch allocation/deallocation, which will be handled
> in a follow-up patch.
>
> This restructuring ensures that sched_tags memory management occurs
> entirely outside of the ->elevator_lock and ->freeze_lock context,
> eliminating the lock dependency problem seen during scheduler updates.
>
> [1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
>
> Reported-by: Stefan Haberland <sth@linux.ibm.com>
> Closes: https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
> Reviewed-by: Ming Lei <ming.lei@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> ---
> block/blk-mq-sched.c | 155 +++++++++++++++++++++++--------------------
> block/blk-mq-sched.h | 8 ++-
> block/elevator.c | 47 +++++++++++--
> block/elevator.h | 14 +++-
> 4 files changed, 144 insertions(+), 80 deletions(-)
>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCHv7 3/3] block: fix potential deadlock while running nr_hw_queue update
2025-07-01 8:19 ` [PATCHv7 3/3] block: fix potential deadlock while running nr_hw_queue update Nilay Shroff
@ 2025-07-01 11:00 ` Hannes Reinecke
0 siblings, 0 replies; 9+ messages in thread
From: Hannes Reinecke @ 2025-07-01 11:00 UTC (permalink / raw)
To: Nilay Shroff, linux-block; +Cc: hch, ming.lei, axboe, sth, gjoyce
On 7/1/25 10:19, Nilay Shroff wrote:
> Move scheduler tags (sched_tags) allocation and deallocation outside
> both the ->elevator_lock and ->freeze_lock when updating nr_hw_queues.
> This change breaks the dependency chain from the percpu allocator lock
> to the elevator lock, helping to prevent potential deadlocks, as
> observed in the reported lockdep splat[1].
>
> This commit introduces batch allocation and deallocation helpers for
> sched_tags, which are now used from within __blk_mq_update_nr_hw_queues
> routine while iterating through the tagset.
>
> With this change, all sched_tags memory management is handled entirely
> outside the ->elevator_lock and the ->freeze_lock context, thereby
> eliminating the lock dependency that could otherwise manifest during
> nr_hw_queues updates.
>
> [1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
>
> Reported-by: Stefan Haberland <sth@linux.ibm.com>
> Closes: https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
> Reviewed-by: Ming Lei <ming.lei@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> ---
> block/blk-mq-sched.c | 65 ++++++++++++++++++++++++++++++++++++++++++++
> block/blk-mq-sched.h | 4 +++
> block/blk-mq.c | 16 +++++++++--
> block/blk.h | 3 +-
> block/elevator.c | 6 ++--
> 5 files changed, 88 insertions(+), 6 deletions(-)
>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCHv7 0/3] block: move sched_tags allocation/de-allocation outside of locking context
2025-07-01 8:18 [PATCHv7 0/3] block: move sched_tags allocation/de-allocation outside of locking context Nilay Shroff
` (2 preceding siblings ...)
2025-07-01 8:19 ` [PATCHv7 3/3] block: fix potential deadlock while running nr_hw_queue update Nilay Shroff
@ 2025-07-02 13:53 ` Yi Zhang
2025-07-02 14:17 ` Nilay Shroff
3 siblings, 1 reply; 9+ messages in thread
From: Yi Zhang @ 2025-07-02 13:53 UTC (permalink / raw)
To: Nilay Shroff
Cc: linux-block, hch, ming.lei, hare, axboe, sth, gjoyce,
Shinichiro Kawasaki
Hi Nilay
With the patch on the latest linux-block/for-next, I reproduced the
following WARNING with blktests block/005, here is the full log:
[ 342.845331] run blktests block/005 at 2025-07-02 09:48:55
[ 343.835605] ======================================================
[ 343.841783] WARNING: possible circular locking dependency detected
[ 343.847966] 6.16.0-rc4.fix+ #3 Not tainted
[ 343.852073] ------------------------------------------------------
[ 343.858250] check/1365 is trying to acquire lock:
[ 343.862957] ffffffff98141db0 (pcpu_alloc_mutex){+.+.}-{4:4}, at:
pcpu_alloc_noprof+0x8eb/0xd70
[ 343.871587]
but task is already holding lock:
[ 343.877421] ffff888300cfb040 (&q->elevator_lock){+.+.}-{4:4}, at:
elevator_change+0x152/0x530
[ 343.885958]
which lock already depends on the new lock.
[ 343.894131]
the existing dependency chain (in reverse order) is:
[ 343.901609]
-> #3 (&q->elevator_lock){+.+.}-{4:4}:
[ 343.907891] __lock_acquire+0x6f1/0xc00
[ 343.912259] lock_acquire.part.0+0xb6/0x240
[ 343.916966] __mutex_lock+0x17b/0x1690
[ 343.921247] elevator_change+0x152/0x530
[ 343.925692] elv_iosched_store+0x205/0x2f0
[ 343.930312] queue_attr_store+0x23b/0x300
[ 343.934853] kernfs_fop_write_iter+0x357/0x530
[ 343.939829] vfs_write+0x9bc/0xf60
[ 343.943763] ksys_write+0xf3/0x1d0
[ 343.947695] do_syscall_64+0x8c/0x3d0
[ 343.951883] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 343.957462]
-> #2 (&q->q_usage_counter(io)#4){++++}-{0:0}:
[ 343.964440] __lock_acquire+0x6f1/0xc00
[ 343.968799] lock_acquire.part.0+0xb6/0x240
[ 343.973507] blk_alloc_queue+0x5c5/0x710
[ 343.977959] blk_mq_alloc_queue+0x14e/0x240
[ 343.982666] __blk_mq_alloc_disk+0x15/0xd0
[ 343.987294] nvme_alloc_ns+0x208/0x1690 [nvme_core]
[ 343.992727] nvme_scan_ns+0x362/0x4c0 [nvme_core]
[ 343.997978] async_run_entry_fn+0x96/0x4f0
[ 344.002599] process_one_work+0x8cd/0x1950
[ 344.007226] worker_thread+0x58d/0xcf0
[ 344.011499] kthread+0x3d8/0x7a0
[ 344.015259] ret_from_fork+0x406/0x510
[ 344.019532] ret_from_fork_asm+0x1a/0x30
[ 344.023980]
-> #1 (fs_reclaim){+.+.}-{0:0}:
[ 344.029654] __lock_acquire+0x6f1/0xc00
[ 344.034015] lock_acquire.part.0+0xb6/0x240
[ 344.038727] fs_reclaim_acquire+0x103/0x150
[ 344.043433] prepare_alloc_pages+0x15f/0x600
[ 344.048230] __alloc_frozen_pages_noprof+0x14a/0x3a0
[ 344.053722] __alloc_pages_noprof+0xd/0x1d0
[ 344.058438] pcpu_alloc_pages.constprop.0+0x104/0x420
[ 344.064017] pcpu_populate_chunk+0x38/0x80
[ 344.068644] pcpu_alloc_noprof+0x650/0xd70
[ 344.073265] iommu_dma_init_fq+0x183/0x730
[ 344.077893] iommu_dma_init_domain+0x566/0x990
[ 344.082866] iommu_setup_dma_ops+0xca/0x230
[ 344.087571] bus_iommu_probe+0x1f8/0x4a0
[ 344.092020] iommu_device_register+0x153/0x240
[ 344.096993] iommu_init_pci+0x53c/0x1040
[ 344.101447] amd_iommu_init_pci+0xb6/0x5c0
[ 344.106066] state_next+0xaf7/0xff0
[ 344.110080] iommu_go_to_state+0x21/0x80
[ 344.114535] amd_iommu_init+0x15/0x70
[ 344.118728] pci_iommu_init+0x29/0x70
[ 344.122914] do_one_initcall+0x100/0x5a0
[ 344.127361] do_initcalls+0x138/0x1d0
[ 344.131556] kernel_init_freeable+0x8b7/0xbd0
[ 344.136442] kernel_init+0x1b/0x1f0
[ 344.140456] ret_from_fork+0x406/0x510
[ 344.144735] ret_from_fork_asm+0x1a/0x30
[ 344.149182]
-> #0 (pcpu_alloc_mutex){+.+.}-{4:4}:
[ 344.155379] check_prev_add+0xf1/0xce0
[ 344.159653] validate_chain+0x470/0x580
[ 344.164019] __lock_acquire+0x6f1/0xc00
[ 344.168378] lock_acquire.part.0+0xb6/0x240
[ 344.173085] __mutex_lock+0x17b/0x1690
[ 344.177365] pcpu_alloc_noprof+0x8eb/0xd70
[ 344.181984] kyber_queue_data_alloc+0x16d/0x660
[ 344.187047] kyber_init_sched+0x14/0x90
[ 344.191413] blk_mq_init_sched+0x264/0x4e0
[ 344.196033] elevator_switch+0x186/0x6a0
[ 344.200478] elevator_change+0x305/0x530
[ 344.204924] elv_iosched_store+0x205/0x2f0
[ 344.209545] queue_attr_store+0x23b/0x300
[ 344.214084] kernfs_fop_write_iter+0x357/0x530
[ 344.219051] vfs_write+0x9bc/0xf60
[ 344.222976] ksys_write+0xf3/0x1d0
[ 344.226902] do_syscall_64+0x8c/0x3d0
[ 344.231088] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 344.236660]
other info that might help us debug this:
[ 344.244662] Chain exists of:
pcpu_alloc_mutex --> &q->q_usage_counter(io)#4 -->
&q->elevator_lock
[ 344.256587] Possible unsafe locking scenario:
[ 344.262504] CPU0 CPU1
[ 344.267037] ---- ----
[ 344.271570] lock(&q->elevator_lock);
[ 344.275330] lock(&q->q_usage_counter(io)#4);
[ 344.282298] lock(&q->elevator_lock);
[ 344.288573] lock(pcpu_alloc_mutex);
[ 344.292249]
*** DEADLOCK ***
[ 344.298168] 7 locks held by check/1365:
[ 344.302015] #0: ffff888124330448 (sb_writers#4){.+.+}-{0:0}, at:
ksys_write+0xf3/0x1d0
[ 344.310039] #1: ffff8883285fc490 (&of->mutex#2){+.+.}-{4:4}, at:
kernfs_fop_write_iter+0x216/0x530
[ 344.319104] #2: ffff8882004b5f08 (kn->active#86){.+.+}-{0:0}, at:
kernfs_fop_write_iter+0x239/0x530
[ 344.328258] #3: ffff8883032d41a8
(&set->update_nr_hwq_lock){.+.+}-{4:4}, at:
elv_iosched_store+0x1c0/0x2f0
[ 344.338014] #4: ffff888300cfaab0
(&q->q_usage_counter(io)#4){++++}-{0:0}, at:
blk_mq_freeze_queue_nomemsave+0xe/0x20
[ 344.348640] #5: ffff888300cfaaf0
(&q->q_usage_counter(queue)#4){+.+.}-{0:0}, at:
blk_mq_freeze_queue_nomemsave+0xe/0x20
[ 344.359525] #6: ffff888300cfb040 (&q->elevator_lock){+.+.}-{4:4},
at: elevator_change+0x152/0x530
[ 344.368502]
stack backtrace:
[ 344.372866] CPU: 8 UID: 0 PID: 1365 Comm: check Not tainted
6.16.0-rc4.fix+ #3 PREEMPT(voluntary)
[ 344.372872] Hardware name: Dell Inc. PowerEdge R6515/07PXPY, BIOS
2.17.0 12/04/2024
[ 344.372876] Call Trace:
[ 344.372879] <TASK>
[ 344.372883] dump_stack_lvl+0x7e/0xc0
[ 344.372891] print_circular_bug+0xdc/0xf0
[ 344.372901] check_noncircular+0x131/0x150
[ 344.372905] ? local_clock_noinstr+0x9/0xc0
[ 344.372912] ? srso_return_thunk+0x5/0x5f
[ 344.372918] ? srso_return_thunk+0x5/0x5f
[ 344.372929] check_prev_add+0xf1/0xce0
[ 344.372934] ? srso_return_thunk+0x5/0x5f
[ 344.372938] ? add_chain_cache+0x111/0x320
[ 344.372948] validate_chain+0x470/0x580
[ 344.372953] ? srso_return_thunk+0x5/0x5f
[ 344.372963] __lock_acquire+0x6f1/0xc00
[ 344.372976] lock_acquire.part.0+0xb6/0x240
[ 344.372981] ? pcpu_alloc_noprof+0x8eb/0xd70
[ 344.372991] ? srso_return_thunk+0x5/0x5f
[ 344.372995] ? rcu_is_watching+0x11/0xb0
[ 344.373001] ? srso_return_thunk+0x5/0x5f
[ 344.373005] ? lock_acquire+0x10e/0x160
[ 344.373015] __mutex_lock+0x17b/0x1690
[ 344.373020] ? pcpu_alloc_noprof+0x8eb/0xd70
[ 344.373025] ? __kasan_kmalloc+0x7b/0x90
[ 344.373030] ? kyber_queue_data_alloc+0x7e/0x660
[ 344.373035] ? kyber_init_sched+0x14/0x90
[ 344.373041] ? elevator_change+0x305/0x530
[ 344.373045] ? pcpu_alloc_noprof+0x8eb/0xd70
[ 344.373050] ? queue_attr_store+0x23b/0x300
[ 344.373055] ? kernfs_fop_write_iter+0x357/0x530
[ 344.373059] ? vfs_write+0x9bc/0xf60
[ 344.373064] ? ksys_write+0xf3/0x1d0
[ 344.373068] ? do_syscall_64+0x8c/0x3d0
[ 344.373075] ? __pfx___mutex_lock+0x10/0x10
[ 344.373079] ? srso_return_thunk+0x5/0x5f
[ 344.373084] ? lock_acquire.part.0+0xb6/0x240
[ 344.373088] ? srso_return_thunk+0x5/0x5f
[ 344.373093] ? srso_return_thunk+0x5/0x5f
[ 344.373097] ? find_held_lock+0x32/0x90
[ 344.373101] ? srso_return_thunk+0x5/0x5f
[ 344.373105] ? local_clock_noinstr+0x9/0xc0
[ 344.373111] ? srso_return_thunk+0x5/0x5f
[ 344.373116] ? __lock_release+0x1a2/0x2c0
[ 344.373123] ? srso_return_thunk+0x5/0x5f
[ 344.373127] ? mark_held_locks+0x50/0x80
[ 344.373134] ? _raw_spin_unlock_irqrestore+0x59/0x70
[ 344.373139] ? srso_return_thunk+0x5/0x5f
[ 344.373143] ? lockdep_hardirqs_on+0x78/0x100
[ 344.373149] ? srso_return_thunk+0x5/0x5f
[ 344.373159] ? pcpu_alloc_noprof+0x8eb/0xd70
[ 344.373164] pcpu_alloc_noprof+0x8eb/0xd70
[ 344.373169] ? __kmalloc_cache_node_noprof+0x3a7/0x4d0
[ 344.373179] ? xa_find_after+0x1ac/0x310
[ 344.373187] ? srso_return_thunk+0x5/0x5f
[ 344.373192] ? kasan_save_track+0x10/0x30
[ 344.373200] kyber_queue_data_alloc+0x16d/0x660
[ 344.373207] ? debug_mutex_init+0x33/0x70
[ 344.373216] kyber_init_sched+0x14/0x90
[ 344.373223] blk_mq_init_sched+0x264/0x4e0
[ 344.373235] ? __pfx_blk_mq_init_sched+0x10/0x10
[ 344.373239] ? srso_return_thunk+0x5/0x5f
[ 344.373253] elevator_switch+0x186/0x6a0
[ 344.373262] elevator_change+0x305/0x530
[ 344.373272] elv_iosched_store+0x205/0x2f0
[ 344.373279] ? __pfx_elv_iosched_store+0x10/0x10
[ 344.373285] ? srso_return_thunk+0x5/0x5f
[ 344.373294] ? srso_return_thunk+0x5/0x5f
[ 344.373298] ? __lock_acquired+0xda/0x250
[ 344.373306] ? srso_return_thunk+0x5/0x5f
[ 344.373310] ? rcu_is_watching+0x11/0xb0
[ 344.373315] ? srso_return_thunk+0x5/0x5f
[ 344.373324] queue_attr_store+0x23b/0x300
[ 344.373333] ? __pfx_queue_attr_store+0x10/0x10
[ 344.373344] ? srso_return_thunk+0x5/0x5f
[ 344.373349] ? srso_return_thunk+0x5/0x5f
[ 344.373354] ? find_held_lock+0x32/0x90
[ 344.373358] ? local_clock_noinstr+0x9/0xc0
[ 344.373365] ? srso_return_thunk+0x5/0x5f
[ 344.373369] ? __lock_release+0x1a2/0x2c0
[ 344.373377] ? sysfs_file_kobj+0xb5/0x1c0
[ 344.373383] ? srso_return_thunk+0x5/0x5f
[ 344.373390] ? srso_return_thunk+0x5/0x5f
[ 344.373394] ? sysfs_file_kobj+0xbf/0x1c0
[ 344.373399] ? srso_return_thunk+0x5/0x5f
[ 344.373408] ? __pfx_sysfs_kf_write+0x10/0x10
[ 344.373412] kernfs_fop_write_iter+0x357/0x530
[ 344.373422] vfs_write+0x9bc/0xf60
[ 344.373433] ? __pfx_vfs_write+0x10/0x10
[ 344.373449] ? find_held_lock+0x32/0x90
[ 344.373454] ? local_clock_noinstr+0x9/0xc0
[ 344.373460] ? srso_return_thunk+0x5/0x5f
[ 344.373470] ksys_write+0xf3/0x1d0
[ 344.373478] ? __pfx_ksys_write+0x10/0x10
[ 344.373484] ? srso_return_thunk+0x5/0x5f
[ 344.373491] ? srso_return_thunk+0x5/0x5f
[ 344.373503] do_syscall_64+0x8c/0x3d0
[ 344.373508] ? srso_return_thunk+0x5/0x5f
[ 344.373512] ? do_user_addr_fault+0x489/0xb10
[ 344.373519] ? srso_return_thunk+0x5/0x5f
[ 344.373523] ? rcu_is_watching+0x11/0xb0
[ 344.373528] ? srso_return_thunk+0x5/0x5f
[ 344.373534] ? srso_return_thunk+0x5/0x5f
[ 344.373542] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 344.373547] RIP: 0033:0x7ff3e4afe377
[ 344.373552] Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7
0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00
00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89
74 24
[ 344.373557] RSP: 002b:00007ffc83840088 EFLAGS: 00000246 ORIG_RAX:
0000000000000001
[ 344.373562] RAX: ffffffffffffffda RBX: 00007ff3e4bfa780 RCX: 00007ff3e4afe377
[ 344.373565] RDX: 0000000000000006 RSI: 00005555d2ec8240 RDI: 0000000000000001
[ 344.373568] RBP: 0000000000000006 R08: 0000000000000000 R09: 00007ff3e4bb0d40
[ 344.373571] R10: 00007ff3e4bb0c40 R11: 0000000000000246 R12: 0000000000000006
[ 344.373574] R13: 00005555d2ec8240 R14: 0000000000000006 R15: 00007ff3e4bf59e0
[ 344.373592] </TASK>
On Tue, Jul 1, 2025 at 4:20 PM Nilay Shroff <nilay@linux.ibm.com> wrote:
>
> Hi,
>
> There have been a few reports[1] indicating potential lockdep warnings due
> to a lock dependency from the percpu allocator to the elevator lock. This
> patch series aims to eliminate that dependency.
>
> The series consists of three patches:
> The first patch is preparatory patch and just move elevator queue
> allocation logic from ->init_sched into blk_mq_init_sched.
>
> The second patch in the series restructures sched_tags allocation and
> deallocation during elevator update/switch operations to ensure these
> actions are performed entirely outside the ->freeze_lock and ->elevator_
> lock. This eliminates the percpu allocator’s lock dependency on the
> elevator and freeze lock during scheduler transitions.
>
> The third patch introduces batch allocation and deallocation helpers for
> sched_tags. These helpers are used during __blk_mq_update_nr_hw_queues()
> to decouple sched_tags memory management from both the elevator and freeze
> locks, addressing the lockdep concerns in the nr_hw_queues update path.
>
> [1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
>
> Changes since v6:
> - Add warning when loading elevator tags from an xarray yields nothing
> (Hannes Reinecke)
> - Use elevator tags instead of xarray table as a function argument to
> elv_update_nr_hw_queues (Ming Lei)
> Link to v6: https://lore.kernel.org/all/20250630054756.54532-1-nilay@linux.ibm.com/
>
> Changes since v5:
> - Fixed smatch warning reported by kernel test robot here:
> https://lore.kernel.org/all/202506300509.2S1tygch-lkp@intel.com/
> Link to v5: https://lore.kernel.org/all/20250627175544.1063910-1-nilay@linux.ibm.com/
>
> Changes since v4:
> - Define a local Xarray variable in __blk_mq_update_nr_hw_queues to store
> sched_tags, instead of storing it in an Xarray defined in 'struct elevator_tags'
> (Ming Lei)
> Link to v4: https://lore.kernel.org/all/20250624131716.630465-1-nilay@linux.ibm.com/
>
> Changes since v3:
> - Further split the patchset into three patch series so that we can
> have a separate patch for sched_tags batch allocation/deallocation
> (Ming Lei)
> - Use Xarray to store and load the sched_tags (Ming Lei)
> - Unexport elevator_alloc() as we no longer need to use it outside
> of block layer core (hch)
> - unwind the sched_tags allocation and free tags when we it fails in
> the middle of allocation (hch)
> - Move struct elevator_queue header from commin header to elevator.c
> as there's no user of it outside elevator.c (Ming Lei, hch)
> Link to v3: https://lore.kernel.org/all/20250616173233.3803824-1-nilay@linux.ibm.com/
>
> Change since v2:
> - Split the patch into a two patch series. The first patch updates
> ->init_sched elevator API change and second patch handles the sched
> tags allocation/de-allocation logic (Ming Lei)
> - Address sched tags allocation/deallocation logic while running in the
> context of nr_hw_queue update so that we can handle all possible
> scenarios in a single patchest (Ming Lei)
> Link to v2: https://lore.kernel.org/all/20250528123638.1029700-1-nilay@linux.ibm.com/
>
> Changes since v1:
> - As the lifetime of elevator queue and sched tags are same, allocate
> and move sched tags under struct elevator_queue (Ming Lei)
> Link to v1: https://lore.kernel.org/all/20250520103425.1259712-1-nilay@linux.ibm.com/
>
> Nilay Shroff (3):
> block: move elevator queue allocation logic into blk_mq_init_sched
> block: fix lockdep warning caused by lock dependency in
> elv_iosched_store
> block: fix potential deadlock while running nr_hw_queue update
>
> block/bfq-iosched.c | 13 +--
> block/blk-mq-sched.c | 223 ++++++++++++++++++++++++++++--------------
> block/blk-mq-sched.h | 12 ++-
> block/blk-mq.c | 16 ++-
> block/blk.h | 2 +-
> block/elevator.c | 50 ++++++++--
> block/elevator.h | 16 ++-
> block/kyber-iosched.c | 11 +--
> block/mq-deadline.c | 14 +--
> 9 files changed, 241 insertions(+), 116 deletions(-)
>
> --
> 2.50.0
>
>
--
Best Regards,
Yi Zhang
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCHv7 0/3] block: move sched_tags allocation/de-allocation outside of locking context
2025-07-02 13:53 ` [PATCHv7 0/3] block: move sched_tags allocation/de-allocation outside of locking context Yi Zhang
@ 2025-07-02 14:17 ` Nilay Shroff
2025-07-02 14:41 ` Yi Zhang
0 siblings, 1 reply; 9+ messages in thread
From: Nilay Shroff @ 2025-07-02 14:17 UTC (permalink / raw)
To: Yi Zhang
Cc: linux-block, hch, ming.lei, hare, axboe, sth, gjoyce,
Shinichiro Kawasaki
On 7/2/25 7:23 PM, Yi Zhang wrote:
> Hi Nilay
>
> With the patch on the latest linux-block/for-next, I reproduced the
> following WARNING with blktests block/005, here is the full log:
>
> [ 342.845331] run blktests block/005 at 2025-07-02 09:48:55
>
> [ 343.835605] ======================================================
> [ 343.841783] WARNING: possible circular locking dependency detected
> [ 343.847966] 6.16.0-rc4.fix+ #3 Not tainted
> [ 343.852073] ------------------------------------------------------
> [ 343.858250] check/1365 is trying to acquire lock:
> [ 343.862957] ffffffff98141db0 (pcpu_alloc_mutex){+.+.}-{4:4}, at:
> pcpu_alloc_noprof+0x8eb/0xd70
> [ 343.871587]
> but task is already holding lock:
> [ 343.877421] ffff888300cfb040 (&q->elevator_lock){+.+.}-{4:4}, at:
> elevator_change+0x152/0x530
> [ 343.885958]
> which lock already depends on the new lock.
>
> [ 343.894131]
> the existing dependency chain (in reverse order) is:
> [ 343.901609]
> -> #3 (&q->elevator_lock){+.+.}-{4:4}:
> [ 343.907891] __lock_acquire+0x6f1/0xc00
> [ 343.912259] lock_acquire.part.0+0xb6/0x240
> [ 343.916966] __mutex_lock+0x17b/0x1690
> [ 343.921247] elevator_change+0x152/0x530
> [ 343.925692] elv_iosched_store+0x205/0x2f0
> [ 343.930312] queue_attr_store+0x23b/0x300
> [ 343.934853] kernfs_fop_write_iter+0x357/0x530
> [ 343.939829] vfs_write+0x9bc/0xf60
> [ 343.943763] ksys_write+0xf3/0x1d0
> [ 343.947695] do_syscall_64+0x8c/0x3d0
> [ 343.951883] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 343.957462]
> -> #2 (&q->q_usage_counter(io)#4){++++}-{0:0}:
> [ 343.964440] __lock_acquire+0x6f1/0xc00
> [ 343.968799] lock_acquire.part.0+0xb6/0x240
> [ 343.973507] blk_alloc_queue+0x5c5/0x710
> [ 343.977959] blk_mq_alloc_queue+0x14e/0x240
> [ 343.982666] __blk_mq_alloc_disk+0x15/0xd0
> [ 343.987294] nvme_alloc_ns+0x208/0x1690 [nvme_core]
> [ 343.992727] nvme_scan_ns+0x362/0x4c0 [nvme_core]
> [ 343.997978] async_run_entry_fn+0x96/0x4f0
> [ 344.002599] process_one_work+0x8cd/0x1950
> [ 344.007226] worker_thread+0x58d/0xcf0
> [ 344.011499] kthread+0x3d8/0x7a0
> [ 344.015259] ret_from_fork+0x406/0x510
> [ 344.019532] ret_from_fork_asm+0x1a/0x30
> [ 344.023980]
> -> #1 (fs_reclaim){+.+.}-{0:0}:
> [ 344.029654] __lock_acquire+0x6f1/0xc00
> [ 344.034015] lock_acquire.part.0+0xb6/0x240
> [ 344.038727] fs_reclaim_acquire+0x103/0x150
> [ 344.043433] prepare_alloc_pages+0x15f/0x600
> [ 344.048230] __alloc_frozen_pages_noprof+0x14a/0x3a0
> [ 344.053722] __alloc_pages_noprof+0xd/0x1d0
> [ 344.058438] pcpu_alloc_pages.constprop.0+0x104/0x420
> [ 344.064017] pcpu_populate_chunk+0x38/0x80
> [ 344.068644] pcpu_alloc_noprof+0x650/0xd70
> [ 344.073265] iommu_dma_init_fq+0x183/0x730
> [ 344.077893] iommu_dma_init_domain+0x566/0x990
> [ 344.082866] iommu_setup_dma_ops+0xca/0x230
> [ 344.087571] bus_iommu_probe+0x1f8/0x4a0
> [ 344.092020] iommu_device_register+0x153/0x240
> [ 344.096993] iommu_init_pci+0x53c/0x1040
> [ 344.101447] amd_iommu_init_pci+0xb6/0x5c0
> [ 344.106066] state_next+0xaf7/0xff0
> [ 344.110080] iommu_go_to_state+0x21/0x80
> [ 344.114535] amd_iommu_init+0x15/0x70
> [ 344.118728] pci_iommu_init+0x29/0x70
> [ 344.122914] do_one_initcall+0x100/0x5a0
> [ 344.127361] do_initcalls+0x138/0x1d0
> [ 344.131556] kernel_init_freeable+0x8b7/0xbd0
> [ 344.136442] kernel_init+0x1b/0x1f0
> [ 344.140456] ret_from_fork+0x406/0x510
> [ 344.144735] ret_from_fork_asm+0x1a/0x30
> [ 344.149182]
> -> #0 (pcpu_alloc_mutex){+.+.}-{4:4}:
> [ 344.155379] check_prev_add+0xf1/0xce0
> [ 344.159653] validate_chain+0x470/0x580
> [ 344.164019] __lock_acquire+0x6f1/0xc00
> [ 344.168378] lock_acquire.part.0+0xb6/0x240
> [ 344.173085] __mutex_lock+0x17b/0x1690
> [ 344.177365] pcpu_alloc_noprof+0x8eb/0xd70
> [ 344.181984] kyber_queue_data_alloc+0x16d/0x660
> [ 344.187047] kyber_init_sched+0x14/0x90
> [ 344.191413] blk_mq_init_sched+0x264/0x4e0
> [ 344.196033] elevator_switch+0x186/0x6a0
> [ 344.200478] elevator_change+0x305/0x530
> [ 344.204924] elv_iosched_store+0x205/0x2f0
> [ 344.209545] queue_attr_store+0x23b/0x300
> [ 344.214084] kernfs_fop_write_iter+0x357/0x530
> [ 344.219051] vfs_write+0x9bc/0xf60
> [ 344.222976] ksys_write+0xf3/0x1d0
> [ 344.226902] do_syscall_64+0x8c/0x3d0
> [ 344.231088] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 344.236660]
Thanks for the report!
I see that the above warning is different from the one addressed by the
current patchset. In the warning you've reported, the kyber elevator
allocates per-CPU data after acquiring ->elevator_lock, which introduces
a per-CPU lock dependency on the ->elevator_lock.
In contrast, the current patchset addresses a separate issue [1] that arises
due to elevator tag allocation. This allocation occurs after both ->freeze_lock
and ->elevator_lock are held. Internally, elevator tags allocation sets up
per-CPU sbitmap->alloc_hint, which also introduces a similar per-CPU lock
dependency on ->elevator_lock.
That said, I'll plan to address the issue you've just reported in a separate
patch, once the current patchset is merged.
Thanks,
--Nilay
[1]https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCHv7 0/3] block: move sched_tags allocation/de-allocation outside of locking context
2025-07-02 14:17 ` Nilay Shroff
@ 2025-07-02 14:41 ` Yi Zhang
0 siblings, 0 replies; 9+ messages in thread
From: Yi Zhang @ 2025-07-02 14:41 UTC (permalink / raw)
To: Nilay Shroff
Cc: linux-block, hch, ming.lei, hare, axboe, sth, gjoyce,
Shinichiro Kawasaki
On Wed, Jul 2, 2025 at 10:17 PM Nilay Shroff <nilay@linux.ibm.com> wrote:
>
>
>
> On 7/2/25 7:23 PM, Yi Zhang wrote:
> > Hi Nilay
> >
> > With the patch on the latest linux-block/for-next, I reproduced the
> > following WARNING with blktests block/005, here is the full log:
> >
> > [ 342.845331] run blktests block/005 at 2025-07-02 09:48:55
> >
> > [ 343.835605] ======================================================
> > [ 343.841783] WARNING: possible circular locking dependency detected
> > [ 343.847966] 6.16.0-rc4.fix+ #3 Not tainted
> > [ 343.852073] ------------------------------------------------------
> > [ 343.858250] check/1365 is trying to acquire lock:
> > [ 343.862957] ffffffff98141db0 (pcpu_alloc_mutex){+.+.}-{4:4}, at:
> > pcpu_alloc_noprof+0x8eb/0xd70
> > [ 343.871587]
> > but task is already holding lock:
> > [ 343.877421] ffff888300cfb040 (&q->elevator_lock){+.+.}-{4:4}, at:
> > elevator_change+0x152/0x530
> > [ 343.885958]
> > which lock already depends on the new lock.
> >
> > [ 343.894131]
> > the existing dependency chain (in reverse order) is:
> > [ 343.901609]
> > -> #3 (&q->elevator_lock){+.+.}-{4:4}:
> > [ 343.907891] __lock_acquire+0x6f1/0xc00
> > [ 343.912259] lock_acquire.part.0+0xb6/0x240
> > [ 343.916966] __mutex_lock+0x17b/0x1690
> > [ 343.921247] elevator_change+0x152/0x530
> > [ 343.925692] elv_iosched_store+0x205/0x2f0
> > [ 343.930312] queue_attr_store+0x23b/0x300
> > [ 343.934853] kernfs_fop_write_iter+0x357/0x530
> > [ 343.939829] vfs_write+0x9bc/0xf60
> > [ 343.943763] ksys_write+0xf3/0x1d0
> > [ 343.947695] do_syscall_64+0x8c/0x3d0
> > [ 343.951883] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [ 343.957462]
> > -> #2 (&q->q_usage_counter(io)#4){++++}-{0:0}:
> > [ 343.964440] __lock_acquire+0x6f1/0xc00
> > [ 343.968799] lock_acquire.part.0+0xb6/0x240
> > [ 343.973507] blk_alloc_queue+0x5c5/0x710
> > [ 343.977959] blk_mq_alloc_queue+0x14e/0x240
> > [ 343.982666] __blk_mq_alloc_disk+0x15/0xd0
> > [ 343.987294] nvme_alloc_ns+0x208/0x1690 [nvme_core]
> > [ 343.992727] nvme_scan_ns+0x362/0x4c0 [nvme_core]
> > [ 343.997978] async_run_entry_fn+0x96/0x4f0
> > [ 344.002599] process_one_work+0x8cd/0x1950
> > [ 344.007226] worker_thread+0x58d/0xcf0
> > [ 344.011499] kthread+0x3d8/0x7a0
> > [ 344.015259] ret_from_fork+0x406/0x510
> > [ 344.019532] ret_from_fork_asm+0x1a/0x30
> > [ 344.023980]
> > -> #1 (fs_reclaim){+.+.}-{0:0}:
> > [ 344.029654] __lock_acquire+0x6f1/0xc00
> > [ 344.034015] lock_acquire.part.0+0xb6/0x240
> > [ 344.038727] fs_reclaim_acquire+0x103/0x150
> > [ 344.043433] prepare_alloc_pages+0x15f/0x600
> > [ 344.048230] __alloc_frozen_pages_noprof+0x14a/0x3a0
> > [ 344.053722] __alloc_pages_noprof+0xd/0x1d0
> > [ 344.058438] pcpu_alloc_pages.constprop.0+0x104/0x420
> > [ 344.064017] pcpu_populate_chunk+0x38/0x80
> > [ 344.068644] pcpu_alloc_noprof+0x650/0xd70
> > [ 344.073265] iommu_dma_init_fq+0x183/0x730
> > [ 344.077893] iommu_dma_init_domain+0x566/0x990
> > [ 344.082866] iommu_setup_dma_ops+0xca/0x230
> > [ 344.087571] bus_iommu_probe+0x1f8/0x4a0
> > [ 344.092020] iommu_device_register+0x153/0x240
> > [ 344.096993] iommu_init_pci+0x53c/0x1040
> > [ 344.101447] amd_iommu_init_pci+0xb6/0x5c0
> > [ 344.106066] state_next+0xaf7/0xff0
> > [ 344.110080] iommu_go_to_state+0x21/0x80
> > [ 344.114535] amd_iommu_init+0x15/0x70
> > [ 344.118728] pci_iommu_init+0x29/0x70
> > [ 344.122914] do_one_initcall+0x100/0x5a0
> > [ 344.127361] do_initcalls+0x138/0x1d0
> > [ 344.131556] kernel_init_freeable+0x8b7/0xbd0
> > [ 344.136442] kernel_init+0x1b/0x1f0
> > [ 344.140456] ret_from_fork+0x406/0x510
> > [ 344.144735] ret_from_fork_asm+0x1a/0x30
> > [ 344.149182]
> > -> #0 (pcpu_alloc_mutex){+.+.}-{4:4}:
> > [ 344.155379] check_prev_add+0xf1/0xce0
> > [ 344.159653] validate_chain+0x470/0x580
> > [ 344.164019] __lock_acquire+0x6f1/0xc00
> > [ 344.168378] lock_acquire.part.0+0xb6/0x240
> > [ 344.173085] __mutex_lock+0x17b/0x1690
> > [ 344.177365] pcpu_alloc_noprof+0x8eb/0xd70
> > [ 344.181984] kyber_queue_data_alloc+0x16d/0x660
> > [ 344.187047] kyber_init_sched+0x14/0x90
> > [ 344.191413] blk_mq_init_sched+0x264/0x4e0
> > [ 344.196033] elevator_switch+0x186/0x6a0
> > [ 344.200478] elevator_change+0x305/0x530
> > [ 344.204924] elv_iosched_store+0x205/0x2f0
> > [ 344.209545] queue_attr_store+0x23b/0x300
> > [ 344.214084] kernfs_fop_write_iter+0x357/0x530
> > [ 344.219051] vfs_write+0x9bc/0xf60
> > [ 344.222976] ksys_write+0xf3/0x1d0
> > [ 344.226902] do_syscall_64+0x8c/0x3d0
> > [ 344.231088] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [ 344.236660]
>
> Thanks for the report!
>
> I see that the above warning is different from the one addressed by the
> current patchset. In the warning you've reported, the kyber elevator
> allocates per-CPU data after acquiring ->elevator_lock, which introduces
> a per-CPU lock dependency on the ->elevator_lock.
>
> In contrast, the current patchset addresses a separate issue [1] that arises
> due to elevator tag allocation. This allocation occurs after both ->freeze_lock
> and ->elevator_lock are held. Internally, elevator tags allocation sets up
> per-CPU sbitmap->alloc_hint, which also introduces a similar per-CPU lock
> dependency on ->elevator_lock.
>
> That said, I'll plan to address the issue you've just reported in a separate
> patch, once the current patchset is merged.
OK, the issue[1] was easy to reproduce by blktests block/005 on my
environment, that's why I re-run the test on this patchset, and now
this one new WARNING triggered, anyway, thanks for the info and will
test your new patch later, thanks.
>
> Thanks,
> --Nilay
>
> [1]https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
>
>
>
--
Best Regards,
Yi Zhang
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2025-07-02 14:41 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-01 8:18 [PATCHv7 0/3] block: move sched_tags allocation/de-allocation outside of locking context Nilay Shroff
2025-07-01 8:18 ` [PATCHv7 1/3] block: move elevator queue allocation logic into blk_mq_init_sched Nilay Shroff
2025-07-01 8:18 ` [PATCHv7 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store Nilay Shroff
2025-07-01 10:52 ` Hannes Reinecke
2025-07-01 8:19 ` [PATCHv7 3/3] block: fix potential deadlock while running nr_hw_queue update Nilay Shroff
2025-07-01 11:00 ` Hannes Reinecke
2025-07-02 13:53 ` [PATCHv7 0/3] block: move sched_tags allocation/de-allocation outside of locking context Yi Zhang
2025-07-02 14:17 ` Nilay Shroff
2025-07-02 14:41 ` Yi Zhang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox