* [PATCHv6 0/3] block: move sched_tags allocation/de-allocation outside of locking context
@ 2025-06-30 5:21 Nilay Shroff
2025-06-30 5:21 ` [PATCHv6 1/3] block: move elevator queue allocation logic into blk_mq_init_sched Nilay Shroff
` (2 more replies)
0 siblings, 3 replies; 15+ messages in thread
From: Nilay Shroff @ 2025-06-30 5:21 UTC (permalink / raw)
To: linux-block; +Cc: hch, ming.lei, axboe, sth, lkp, gjoyce
Hi,
There have been a few reports[1] indicating potential lockdep warnings due
to a lock dependency from the percpu allocator to the elevator lock. This
patch series aims to eliminate that dependency.
The series consists of three patches:
The first patch is preparatory patch and just move elevator queue
allocation logic from ->init_sched into blk_mq_init_sched.
The second patch in the series restructures sched_tags allocation and
deallocation during elevator update/switch operations to ensure these
actions are performed entirely outside the ->freeze_lock and ->elevator_
lock. This eliminates the percpu allocator’s lock dependency on the
elevator and freeze lock during scheduler transitions.
The third patch introduces batch allocation and deallocation helpers for
sched_tags. These helpers are used during __blk_mq_update_nr_hw_queues()
to decouple sched_tags memory management from both the elevator and freeze
locks, addressing the lockdep concerns in the nr_hw_queues update path.
[1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
Changes since v5:
- Fixed smatch warning reported by kernel test robot here:
https://lore.kernel.org/all/202506300509.2S1tygch-lkp@intel.com/
Link to v5: https://lore.kernel.org/all/20250627175544.1063910-1-nilay@linux.ibm.com/
Changes since v4:
- Define a local Xarray variable in __blk_mq_update_nr_hw_queues to store
sched_tags, instead of storing it in an Xarray defined in 'struct elevator_tags'
(Ming Lei)
Link to v4: https://lore.kernel.org/all/20250624131716.630465-1-nilay@linux.ibm.com/
Changes since v3:
- Further split the patchset into three patch series so that we can
have a separate patch for sched_tags batch allocation/deallocation
(Ming Lei)
- Use Xarray to store and load the sched_tags (Ming Lei)
- Unexport elevator_alloc() as we no longer need to use it outside
of block layer core (hch)
- unwind the sched_tags allocation and free tags when we it fails in
the middle of allocation (hch)
- Move struct elevator_queue header from commin header to elevator.c
as there's no user of it outside elevator.c (Ming Lei, hch)
Link to v3: https://lore.kernel.org/all/20250616173233.3803824-1-nilay@linux.ibm.com/
Change since v2:
- Split the patch into a two patch series. The first patch updates
->init_sched elevator API change and second patch handles the sched
tags allocation/de-allocation logic (Ming Lei)
- Address sched tags allocation/deallocation logic while running in the
context of nr_hw_queue update so that we can handle all possible
scenarios in a single patchest (Ming Lei)
Link to v2: https://lore.kernel.org/all/20250528123638.1029700-1-nilay@linux.ibm.com/
Changes since v1:
- As the lifetime of elevator queue and sched tags are same, allocate
and move sched tags under struct elevator_queue (Ming Lei)
Link to v1: https://lore.kernel.org/all/20250520103425.1259712-1-nilay@linux.ibm.com/
Nilay Shroff (3):
block: move elevator queue allocation logic into blk_mq_init_sched
block: fix lockdep warning caused by lock dependency in
elv_iosched_store
block: fix potential deadlock while running nr_hw_queue update
block/bfq-iosched.c | 13 +--
block/blk-mq-sched.c | 220 ++++++++++++++++++++++++++++--------------
block/blk-mq-sched.h | 12 ++-
block/blk-mq.c | 11 ++-
block/blk.h | 2 +-
block/elevator.c | 50 ++++++++--
block/elevator.h | 16 ++-
block/kyber-iosched.c | 11 +--
block/mq-deadline.c | 14 +--
9 files changed, 234 insertions(+), 115 deletions(-)
--
2.49.0
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCHv6 1/3] block: move elevator queue allocation logic into blk_mq_init_sched
2025-06-30 5:21 [PATCHv6 0/3] block: move sched_tags allocation/de-allocation outside of locking context Nilay Shroff
@ 2025-06-30 5:21 ` Nilay Shroff
2025-06-30 6:13 ` Christoph Hellwig
2025-06-30 6:17 ` Hannes Reinecke
2025-06-30 5:21 ` [PATCHv6 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store Nilay Shroff
2025-06-30 5:21 ` [PATCHv6 3/3] block: fix potential deadlock while running nr_hw_queue update Nilay Shroff
2 siblings, 2 replies; 15+ messages in thread
From: Nilay Shroff @ 2025-06-30 5:21 UTC (permalink / raw)
To: linux-block; +Cc: hch, ming.lei, axboe, sth, lkp, gjoyce
In preparation for allocating sched_tags before freezing the request
queue and acquiring ->elevator_lock, move the elevator queue allocation
logic from the elevator ops ->init_sched callback into blk_mq_init_sched.
As elevator_alloc is now only invoked from block layer core, we don't
need to export it, so unexport elevator_alloc function.
This refactoring provides a centralized location for elevator queue
initialization, which makes it easier to store pre-allocated sched_tags
in the struct elevator_queue during later changes.
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
block/bfq-iosched.c | 13 +++----------
block/blk-mq-sched.c | 11 ++++++++---
block/elevator.c | 1 -
block/elevator.h | 2 +-
block/kyber-iosched.c | 11 ++---------
block/mq-deadline.c | 14 ++------------
6 files changed, 16 insertions(+), 36 deletions(-)
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 0cb1e9873aab..fd26dc1901b0 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -7232,22 +7232,16 @@ static void bfq_init_root_group(struct bfq_group *root_group,
root_group->sched_data.bfq_class_idle_last_service = jiffies;
}
-static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
+static int bfq_init_queue(struct request_queue *q, struct elevator_queue *eq)
{
struct bfq_data *bfqd;
- struct elevator_queue *eq;
unsigned int i;
struct blk_independent_access_ranges *ia_ranges = q->disk->ia_ranges;
- eq = elevator_alloc(q, e);
- if (!eq)
- return -ENOMEM;
-
bfqd = kzalloc_node(sizeof(*bfqd), GFP_KERNEL, q->node);
- if (!bfqd) {
- kobject_put(&eq->kobj);
+ if (!bfqd)
return -ENOMEM;
- }
+
eq->elevator_data = bfqd;
spin_lock_irq(&q->queue_lock);
@@ -7405,7 +7399,6 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
out_free:
kfree(bfqd);
- kobject_put(&eq->kobj);
return -ENOMEM;
}
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 55a0fd105147..359e0704e09b 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -475,10 +475,14 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
q->nr_requests = 2 * min_t(unsigned int, q->tag_set->queue_depth,
BLKDEV_DEFAULT_RQ);
+ eq = elevator_alloc(q, e);
+ if (!eq)
+ return -ENOMEM;
+
if (blk_mq_is_shared_tags(flags)) {
ret = blk_mq_init_sched_shared_tags(q);
if (ret)
- return ret;
+ goto err_put_elevator;
}
queue_for_each_hw_ctx(q, hctx, i) {
@@ -487,7 +491,7 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
goto err_free_map_and_rqs;
}
- ret = e->ops.init_sched(q, e);
+ ret = e->ops.init_sched(q, eq);
if (ret)
goto err_free_map_and_rqs;
@@ -508,7 +512,8 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
err_free_map_and_rqs:
blk_mq_sched_free_rqs(q);
blk_mq_sched_tags_teardown(q, flags);
-
+err_put_elevator:
+ kobject_put(&eq->kobj);
q->elevator = NULL;
return ret;
}
diff --git a/block/elevator.c b/block/elevator.c
index ab22542e6cf0..770874040f79 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -148,7 +148,6 @@ struct elevator_queue *elevator_alloc(struct request_queue *q,
return eq;
}
-EXPORT_SYMBOL(elevator_alloc);
static void elevator_release(struct kobject *kobj)
{
diff --git a/block/elevator.h b/block/elevator.h
index a07ce773a38f..a4de5f9ad790 100644
--- a/block/elevator.h
+++ b/block/elevator.h
@@ -24,7 +24,7 @@ struct blk_mq_alloc_data;
struct blk_mq_hw_ctx;
struct elevator_mq_ops {
- int (*init_sched)(struct request_queue *, struct elevator_type *);
+ int (*init_sched)(struct request_queue *, struct elevator_queue *);
void (*exit_sched)(struct elevator_queue *);
int (*init_hctx)(struct blk_mq_hw_ctx *, unsigned int);
void (*exit_hctx)(struct blk_mq_hw_ctx *, unsigned int);
diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index 4dba8405bd01..7b6832cb3a8d 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -402,20 +402,13 @@ static struct kyber_queue_data *kyber_queue_data_alloc(struct request_queue *q)
return ERR_PTR(ret);
}
-static int kyber_init_sched(struct request_queue *q, struct elevator_type *e)
+static int kyber_init_sched(struct request_queue *q, struct elevator_queue *eq)
{
struct kyber_queue_data *kqd;
- struct elevator_queue *eq;
-
- eq = elevator_alloc(q, e);
- if (!eq)
- return -ENOMEM;
kqd = kyber_queue_data_alloc(q);
- if (IS_ERR(kqd)) {
- kobject_put(&eq->kobj);
+ if (IS_ERR(kqd))
return PTR_ERR(kqd);
- }
blk_stat_enable_accounting(q);
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 2edf1cac06d5..7b6caf30e00a 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -568,20 +568,14 @@ static void dd_exit_sched(struct elevator_queue *e)
/*
* initialize elevator private data (deadline_data).
*/
-static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
+static int dd_init_sched(struct request_queue *q, struct elevator_queue *eq)
{
struct deadline_data *dd;
- struct elevator_queue *eq;
enum dd_prio prio;
- int ret = -ENOMEM;
-
- eq = elevator_alloc(q, e);
- if (!eq)
- return ret;
dd = kzalloc_node(sizeof(*dd), GFP_KERNEL, q->node);
if (!dd)
- goto put_eq;
+ return -ENOMEM;
eq->elevator_data = dd;
@@ -608,10 +602,6 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
q->elevator = eq;
return 0;
-
-put_eq:
- kobject_put(&eq->kobj);
- return ret;
}
/*
--
2.50.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCHv6 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store
2025-06-30 5:21 [PATCHv6 0/3] block: move sched_tags allocation/de-allocation outside of locking context Nilay Shroff
2025-06-30 5:21 ` [PATCHv6 1/3] block: move elevator queue allocation logic into blk_mq_init_sched Nilay Shroff
@ 2025-06-30 5:21 ` Nilay Shroff
2025-06-30 6:15 ` Christoph Hellwig
` (2 more replies)
2025-06-30 5:21 ` [PATCHv6 3/3] block: fix potential deadlock while running nr_hw_queue update Nilay Shroff
2 siblings, 3 replies; 15+ messages in thread
From: Nilay Shroff @ 2025-06-30 5:21 UTC (permalink / raw)
To: linux-block; +Cc: hch, ming.lei, axboe, sth, lkp, gjoyce
Recent lockdep reports [1] have revealed a potential deadlock caused by a
lock dependency between the percpu allocator lock and the elevator lock.
This issue can be avoided by ensuring that the allocation and release of
scheduler tags (sched_tags) are performed outside the elevator lock.
Furthermore, the queue does not need to be remain frozen during these
operations.
To address this, move all sched_tags allocations and deallocations outside
of both the ->elevator_lock and the ->freeze_lock. Since the lifetime of
the elevator queue and its associated sched_tags is closely tied, the
allocated sched_tags are now stored in the elevator queue structure. Then,
during the actual elevator switch (which runs under ->freeze_lock and
->elevator_lock), the pre-allocated sched_tags are assigned to the
appropriate q->hctx. Once the elevator switch is complete and the locks
are released, the old elevator queue and its associated sched_tags are
freed.
This commit specifically addresses the allocation/deallocation of sched_
tags during elevator switching. Note that sched_tags may also be allocated
in other contexts, such as during nr_hw_queues updates. Supporting that
use case will require batch allocation/deallocation, which will be handled
in a follow-up patch.
This restructuring ensures that sched_tags memory management occurs
entirely outside of the ->elevator_lock and ->freeze_lock context,
eliminating the lock dependency problem seen during scheduler updates.
[1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
Reported-by: Stefan Haberland <sth@linux.ibm.com>
Closes: https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
block/blk-mq-sched.c | 155 +++++++++++++++++++++++--------------------
block/blk-mq-sched.h | 8 ++-
block/elevator.c | 47 +++++++++++--
block/elevator.h | 14 +++-
4 files changed, 144 insertions(+), 80 deletions(-)
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 359e0704e09b..2d6d1ebdd8fb 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -374,64 +374,17 @@ bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request *rq,
}
EXPORT_SYMBOL_GPL(blk_mq_sched_try_insert_merge);
-static int blk_mq_sched_alloc_map_and_rqs(struct request_queue *q,
- struct blk_mq_hw_ctx *hctx,
- unsigned int hctx_idx)
-{
- if (blk_mq_is_shared_tags(q->tag_set->flags)) {
- hctx->sched_tags = q->sched_shared_tags;
- return 0;
- }
-
- hctx->sched_tags = blk_mq_alloc_map_and_rqs(q->tag_set, hctx_idx,
- q->nr_requests);
-
- if (!hctx->sched_tags)
- return -ENOMEM;
- return 0;
-}
-
-static void blk_mq_exit_sched_shared_tags(struct request_queue *queue)
-{
- blk_mq_free_rq_map(queue->sched_shared_tags);
- queue->sched_shared_tags = NULL;
-}
-
/* called in queue's release handler, tagset has gone away */
static void blk_mq_sched_tags_teardown(struct request_queue *q, unsigned int flags)
{
struct blk_mq_hw_ctx *hctx;
unsigned long i;
- queue_for_each_hw_ctx(q, hctx, i) {
- if (hctx->sched_tags) {
- if (!blk_mq_is_shared_tags(flags))
- blk_mq_free_rq_map(hctx->sched_tags);
- hctx->sched_tags = NULL;
- }
- }
+ queue_for_each_hw_ctx(q, hctx, i)
+ hctx->sched_tags = NULL;
if (blk_mq_is_shared_tags(flags))
- blk_mq_exit_sched_shared_tags(q);
-}
-
-static int blk_mq_init_sched_shared_tags(struct request_queue *queue)
-{
- struct blk_mq_tag_set *set = queue->tag_set;
-
- /*
- * Set initial depth at max so that we don't need to reallocate for
- * updating nr_requests.
- */
- queue->sched_shared_tags = blk_mq_alloc_map_and_rqs(set,
- BLK_MQ_NO_HCTX_IDX,
- MAX_SCHED_RQ);
- if (!queue->sched_shared_tags)
- return -ENOMEM;
-
- blk_mq_tag_update_sched_shared_tags(queue);
-
- return 0;
+ q->sched_shared_tags = NULL;
}
void blk_mq_sched_reg_debugfs(struct request_queue *q)
@@ -458,8 +411,75 @@ void blk_mq_sched_unreg_debugfs(struct request_queue *q)
mutex_unlock(&q->debugfs_mutex);
}
+void blk_mq_free_sched_tags(struct elevator_tags *et,
+ struct blk_mq_tag_set *set)
+{
+ unsigned long i;
+
+ /* Shared tags are stored at index 0 in @tags. */
+ if (blk_mq_is_shared_tags(set->flags))
+ blk_mq_free_map_and_rqs(set, et->tags[0], BLK_MQ_NO_HCTX_IDX);
+ else {
+ for (i = 0; i < et->nr_hw_queues; i++)
+ blk_mq_free_map_and_rqs(set, et->tags[i], i);
+ }
+
+ kfree(et);
+}
+
+struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
+ unsigned int nr_hw_queues)
+{
+ unsigned int nr_tags;
+ int i;
+ struct elevator_tags *et;
+ gfp_t gfp = GFP_NOIO | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY;
+
+ if (blk_mq_is_shared_tags(set->flags))
+ nr_tags = 1;
+ else
+ nr_tags = nr_hw_queues;
+
+ et = kmalloc(sizeof(struct elevator_tags) +
+ nr_tags * sizeof(struct blk_mq_tags *), gfp);
+ if (!et)
+ return NULL;
+ /*
+ * Default to double of smaller one between hw queue_depth and
+ * 128, since we don't split into sync/async like the old code
+ * did. Additionally, this is a per-hw queue depth.
+ */
+ et->nr_requests = 2 * min_t(unsigned int, set->queue_depth,
+ BLKDEV_DEFAULT_RQ);
+ et->nr_hw_queues = nr_hw_queues;
+
+ if (blk_mq_is_shared_tags(set->flags)) {
+ /* Shared tags are stored at index 0 in @tags. */
+ et->tags[0] = blk_mq_alloc_map_and_rqs(set, BLK_MQ_NO_HCTX_IDX,
+ MAX_SCHED_RQ);
+ if (!et->tags[0])
+ goto out;
+ } else {
+ for (i = 0; i < et->nr_hw_queues; i++) {
+ et->tags[i] = blk_mq_alloc_map_and_rqs(set, i,
+ et->nr_requests);
+ if (!et->tags[i])
+ goto out_unwind;
+ }
+ }
+
+ return et;
+out_unwind:
+ while (--i >= 0)
+ blk_mq_free_map_and_rqs(set, et->tags[i], i);
+out:
+ kfree(et);
+ return NULL;
+}
+
/* caller must have a reference to @e, will grab another one if successful */
-int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
+int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e,
+ struct elevator_tags *et)
{
unsigned int flags = q->tag_set->flags;
struct blk_mq_hw_ctx *hctx;
@@ -467,40 +487,33 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
unsigned long i;
int ret;
- /*
- * Default to double of smaller one between hw queue_depth and 128,
- * since we don't split into sync/async like the old code did.
- * Additionally, this is a per-hw queue depth.
- */
- q->nr_requests = 2 * min_t(unsigned int, q->tag_set->queue_depth,
- BLKDEV_DEFAULT_RQ);
-
- eq = elevator_alloc(q, e);
+ eq = elevator_alloc(q, e, et);
if (!eq)
return -ENOMEM;
+ q->nr_requests = et->nr_requests;
+
if (blk_mq_is_shared_tags(flags)) {
- ret = blk_mq_init_sched_shared_tags(q);
- if (ret)
- goto err_put_elevator;
+ /* Shared tags are stored at index 0 in @et->tags. */
+ q->sched_shared_tags = et->tags[0];
+ blk_mq_tag_update_sched_shared_tags(q);
}
queue_for_each_hw_ctx(q, hctx, i) {
- ret = blk_mq_sched_alloc_map_and_rqs(q, hctx, i);
- if (ret)
- goto err_free_map_and_rqs;
+ if (blk_mq_is_shared_tags(flags))
+ hctx->sched_tags = q->sched_shared_tags;
+ else
+ hctx->sched_tags = et->tags[i];
}
ret = e->ops.init_sched(q, eq);
if (ret)
- goto err_free_map_and_rqs;
+ goto out;
queue_for_each_hw_ctx(q, hctx, i) {
if (e->ops.init_hctx) {
ret = e->ops.init_hctx(hctx, i);
if (ret) {
- eq = q->elevator;
- blk_mq_sched_free_rqs(q);
blk_mq_exit_sched(q, eq);
kobject_put(&eq->kobj);
return ret;
@@ -509,10 +522,8 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
}
return 0;
-err_free_map_and_rqs:
- blk_mq_sched_free_rqs(q);
+out:
blk_mq_sched_tags_teardown(q, flags);
-err_put_elevator:
kobject_put(&eq->kobj);
q->elevator = NULL;
return ret;
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 1326526bb733..0cde00cd1c47 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -18,10 +18,16 @@ void __blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx);
void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx);
-int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e);
+int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e,
+ struct elevator_tags *et);
void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e);
void blk_mq_sched_free_rqs(struct request_queue *q);
+struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
+ unsigned int nr_hw_queues);
+void blk_mq_free_sched_tags(struct elevator_tags *et,
+ struct blk_mq_tag_set *set);
+
static inline void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
{
if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
diff --git a/block/elevator.c b/block/elevator.c
index 770874040f79..50f4b78efe66 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -54,6 +54,8 @@ struct elv_change_ctx {
struct elevator_queue *old;
/* for registering new elevator */
struct elevator_queue *new;
+ /* holds sched tags data */
+ struct elevator_tags *et;
};
static DEFINE_SPINLOCK(elv_list_lock);
@@ -132,7 +134,7 @@ static struct elevator_type *elevator_find_get(const char *name)
static const struct kobj_type elv_ktype;
struct elevator_queue *elevator_alloc(struct request_queue *q,
- struct elevator_type *e)
+ struct elevator_type *e, struct elevator_tags *et)
{
struct elevator_queue *eq;
@@ -145,6 +147,7 @@ struct elevator_queue *elevator_alloc(struct request_queue *q,
kobject_init(&eq->kobj, &elv_ktype);
mutex_init(&eq->sysfs_lock);
hash_init(eq->hash);
+ eq->et = et;
return eq;
}
@@ -165,7 +168,6 @@ static void elevator_exit(struct request_queue *q)
lockdep_assert_held(&q->elevator_lock);
ioc_clear_queue(q);
- blk_mq_sched_free_rqs(q);
mutex_lock(&e->sysfs_lock);
blk_mq_exit_sched(q, e);
@@ -591,7 +593,7 @@ static int elevator_switch(struct request_queue *q, struct elv_change_ctx *ctx)
}
if (new_e) {
- ret = blk_mq_init_sched(q, new_e);
+ ret = blk_mq_init_sched(q, new_e, ctx->et);
if (ret)
goto out_unfreeze;
ctx->new = q->elevator;
@@ -626,8 +628,10 @@ static void elv_exit_and_release(struct request_queue *q)
elevator_exit(q);
mutex_unlock(&q->elevator_lock);
blk_mq_unfreeze_queue(q, memflags);
- if (e)
+ if (e) {
+ blk_mq_free_sched_tags(e->et, q->tag_set);
kobject_put(&e->kobj);
+ }
}
static int elevator_change_done(struct request_queue *q,
@@ -640,6 +644,7 @@ static int elevator_change_done(struct request_queue *q,
&ctx->old->flags);
elv_unregister_queue(q, ctx->old);
+ blk_mq_free_sched_tags(ctx->old->et, q->tag_set);
kobject_put(&ctx->old->kobj);
if (enable_wbt)
wbt_enable_default(q->disk);
@@ -658,9 +663,16 @@ static int elevator_change_done(struct request_queue *q,
static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
{
unsigned int memflags;
+ struct blk_mq_tag_set *set = q->tag_set;
int ret = 0;
- lockdep_assert_held(&q->tag_set->update_nr_hwq_lock);
+ lockdep_assert_held(&set->update_nr_hwq_lock);
+
+ if (strncmp(ctx->name, "none", 4)) {
+ ctx->et = blk_mq_alloc_sched_tags(set, set->nr_hw_queues);
+ if (!ctx->et)
+ return -ENOMEM;
+ }
memflags = blk_mq_freeze_queue(q);
/*
@@ -680,6 +692,11 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
blk_mq_unfreeze_queue(q, memflags);
if (!ret)
ret = elevator_change_done(q, ctx);
+ /*
+ * Free sched tags if it's allocated but we couldn't switch elevator.
+ */
+ if (ctx->et && !ctx->new)
+ blk_mq_free_sched_tags(ctx->et, set);
return ret;
}
@@ -690,11 +707,26 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
*/
void elv_update_nr_hw_queues(struct request_queue *q)
{
+ struct blk_mq_tag_set *set = q->tag_set;
struct elv_change_ctx ctx = {};
int ret = -ENODEV;
WARN_ON_ONCE(q->mq_freeze_depth == 0);
+ /*
+ * Accessing q->elevator without holding q->elevator_lock is safe here
+ * because nr_hw_queue update is protected by set->update_nr_hwq_lock
+ * in the writer context. So, scheduler update/switch code (which
+ * acquires same lock in the reader context) can't run concurrently.
+ */
+ if (q->elevator) {
+ ctx.et = blk_mq_alloc_sched_tags(set, set->nr_hw_queues);
+ if (!ctx.et) {
+ WARN_ON_ONCE(1);
+ return;
+ }
+ }
+
mutex_lock(&q->elevator_lock);
if (q->elevator && !blk_queue_dying(q) && blk_queue_registered(q)) {
ctx.name = q->elevator->type->elevator_name;
@@ -706,6 +738,11 @@ void elv_update_nr_hw_queues(struct request_queue *q)
blk_mq_unfreeze_queue_nomemrestore(q);
if (!ret)
WARN_ON_ONCE(elevator_change_done(q, &ctx));
+ /*
+ * Free sched tags if it's allocated but we couldn't switch elevator.
+ */
+ if (ctx.et && !ctx.new)
+ blk_mq_free_sched_tags(ctx.et, set);
}
/*
diff --git a/block/elevator.h b/block/elevator.h
index a4de5f9ad790..adc5c157e17e 100644
--- a/block/elevator.h
+++ b/block/elevator.h
@@ -23,6 +23,15 @@ enum elv_merge {
struct blk_mq_alloc_data;
struct blk_mq_hw_ctx;
+struct elevator_tags {
+ /* num. of hardware queues for which tags are allocated */
+ unsigned int nr_hw_queues;
+ /* depth used while allocating tags */
+ unsigned int nr_requests;
+ /* shared tag is stored at index 0 */
+ struct blk_mq_tags *tags[];
+};
+
struct elevator_mq_ops {
int (*init_sched)(struct request_queue *, struct elevator_queue *);
void (*exit_sched)(struct elevator_queue *);
@@ -113,6 +122,7 @@ struct request *elv_rqhash_find(struct request_queue *q, sector_t offset);
struct elevator_queue
{
struct elevator_type *type;
+ struct elevator_tags *et;
void *elevator_data;
struct kobject kobj;
struct mutex sysfs_lock;
@@ -152,8 +162,8 @@ ssize_t elv_iosched_show(struct gendisk *disk, char *page);
ssize_t elv_iosched_store(struct gendisk *disk, const char *page, size_t count);
extern bool elv_bio_merge_ok(struct request *, struct bio *);
-extern struct elevator_queue *elevator_alloc(struct request_queue *,
- struct elevator_type *);
+struct elevator_queue *elevator_alloc(struct request_queue *,
+ struct elevator_type *, struct elevator_tags *);
/*
* Helper functions.
--
2.50.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCHv6 3/3] block: fix potential deadlock while running nr_hw_queue update
2025-06-30 5:21 [PATCHv6 0/3] block: move sched_tags allocation/de-allocation outside of locking context Nilay Shroff
2025-06-30 5:21 ` [PATCHv6 1/3] block: move elevator queue allocation logic into blk_mq_init_sched Nilay Shroff
2025-06-30 5:21 ` [PATCHv6 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store Nilay Shroff
@ 2025-06-30 5:21 ` Nilay Shroff
2025-06-30 6:16 ` Christoph Hellwig
` (2 more replies)
2 siblings, 3 replies; 15+ messages in thread
From: Nilay Shroff @ 2025-06-30 5:21 UTC (permalink / raw)
To: linux-block; +Cc: hch, ming.lei, axboe, sth, lkp, gjoyce
Move scheduler tags (sched_tags) allocation and deallocation outside
both the ->elevator_lock and ->freeze_lock when updating nr_hw_queues.
This change breaks the dependency chain from the percpu allocator lock
to the elevator lock, helping to prevent potential deadlocks, as
observed in the reported lockdep splat[1].
This commit introduces batch allocation and deallocation helpers for
sched_tags, which are now used from within __blk_mq_update_nr_hw_queues
routine while iterating through the tagset.
With this change, all sched_tags memory management is handled entirely
outside the ->elevator_lock and the ->freeze_lock context, thereby
eliminating the lock dependency that could otherwise manifest during
nr_hw_queues updates.
[1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
Reported-by: Stefan Haberland <sth@linux.ibm.com>
Closes: https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
block/blk-mq-sched.c | 63 ++++++++++++++++++++++++++++++++++++++++++++
block/blk-mq-sched.h | 4 +++
block/blk-mq.c | 11 +++++++-
block/blk.h | 2 +-
block/elevator.c | 4 +--
5 files changed, 80 insertions(+), 4 deletions(-)
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 2d6d1ebdd8fb..da802df34a8c 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -427,6 +427,30 @@ void blk_mq_free_sched_tags(struct elevator_tags *et,
kfree(et);
}
+void blk_mq_free_sched_tags_batch(struct xarray *et_table,
+ struct blk_mq_tag_set *set)
+{
+ struct request_queue *q;
+ struct elevator_tags *et;
+
+ lockdep_assert_held_write(&set->update_nr_hwq_lock);
+
+ list_for_each_entry(q, &set->tag_list, tag_set_list) {
+ /*
+ * Accessing q->elevator without holding q->elevator_lock is
+ * safe because we're holding here set->update_nr_hwq_lock in
+ * the writer context. So, scheduler update/switch code (which
+ * acquires the same lock but in the reader context) can't run
+ * concurrently.
+ */
+ if (q->elevator) {
+ et = xa_load(et_table, q->id);
+ if (et)
+ blk_mq_free_sched_tags(et, set);
+ }
+ }
+}
+
struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
unsigned int nr_hw_queues)
{
@@ -477,6 +501,45 @@ struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
return NULL;
}
+int blk_mq_alloc_sched_tags_batch(struct xarray *et_table,
+ struct blk_mq_tag_set *set, unsigned int nr_hw_queues)
+{
+ struct request_queue *q;
+ struct elevator_tags *et;
+ gfp_t gfp = GFP_NOIO | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY;
+
+ lockdep_assert_held_write(&set->update_nr_hwq_lock);
+
+ list_for_each_entry(q, &set->tag_list, tag_set_list) {
+ /*
+ * Accessing q->elevator without holding q->elevator_lock is
+ * safe because we're holding here set->update_nr_hwq_lock in
+ * the writer context. So, scheduler update/switch code (which
+ * acquires the same lock but in the reader context) can't run
+ * concurrently.
+ */
+ if (q->elevator) {
+ et = blk_mq_alloc_sched_tags(set, nr_hw_queues);
+ if (!et)
+ goto out_unwind;
+ if (xa_insert(et_table, q->id, et, gfp))
+ goto out_free_tags;
+ }
+ }
+ return 0;
+out_free_tags:
+ blk_mq_free_sched_tags(et, set);
+out_unwind:
+ list_for_each_entry_continue_reverse(q, &set->tag_list, tag_set_list) {
+ if (q->elevator) {
+ et = xa_load(et_table, q->id);
+ if (et)
+ blk_mq_free_sched_tags(et, set);
+ }
+ }
+ return -ENOMEM;
+}
+
/* caller must have a reference to @e, will grab another one if successful */
int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e,
struct elevator_tags *et)
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 0cde00cd1c47..b554e1d55950 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -25,8 +25,12 @@ void blk_mq_sched_free_rqs(struct request_queue *q);
struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
unsigned int nr_hw_queues);
+int blk_mq_alloc_sched_tags_batch(struct xarray *et_table,
+ struct blk_mq_tag_set *set, unsigned int nr_hw_queues);
void blk_mq_free_sched_tags(struct elevator_tags *et,
struct blk_mq_tag_set *set);
+void blk_mq_free_sched_tags_batch(struct xarray *et_table,
+ struct blk_mq_tag_set *set);
static inline void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
{
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4806b867e37d..a68b658ce07b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4972,6 +4972,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
struct request_queue *q;
int prev_nr_hw_queues = set->nr_hw_queues;
unsigned int memflags;
+ struct xarray et_table;
int i;
lockdep_assert_held(&set->tag_list_lock);
@@ -4984,6 +4985,11 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
return;
memflags = memalloc_noio_save();
+
+ xa_init(&et_table);
+ if (blk_mq_alloc_sched_tags_batch(&et_table, set, nr_hw_queues) < 0)
+ goto out_memalloc_restore;
+
list_for_each_entry(q, &set->tag_list, tag_set_list) {
blk_mq_debugfs_unregister_hctxs(q);
blk_mq_sysfs_unregister_hctxs(q);
@@ -4995,6 +5001,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
if (blk_mq_realloc_tag_set_tags(set, nr_hw_queues) < 0) {
list_for_each_entry(q, &set->tag_list, tag_set_list)
blk_mq_unfreeze_queue_nomemrestore(q);
+ blk_mq_free_sched_tags_batch(&et_table, set);
goto reregister;
}
@@ -5019,7 +5026,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
/* elv_update_nr_hw_queues() unfreeze queue for us */
list_for_each_entry(q, &set->tag_list, tag_set_list)
- elv_update_nr_hw_queues(q);
+ elv_update_nr_hw_queues(q, &et_table);
reregister:
list_for_each_entry(q, &set->tag_list, tag_set_list) {
@@ -5029,7 +5036,9 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
blk_mq_remove_hw_queues_cpuhp(q);
blk_mq_add_hw_queues_cpuhp(q);
}
+out_memalloc_restore:
memalloc_noio_restore(memflags);
+ xa_destroy(&et_table);
/* Free the excess tags when nr_hw_queues shrink. */
for (i = set->nr_hw_queues; i < prev_nr_hw_queues; i++)
diff --git a/block/blk.h b/block/blk.h
index 37ec459fe656..c6d1d1458388 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -321,7 +321,7 @@ bool blk_bio_list_merge(struct request_queue *q, struct list_head *list,
bool blk_insert_flush(struct request *rq);
-void elv_update_nr_hw_queues(struct request_queue *q);
+void elv_update_nr_hw_queues(struct request_queue *q, struct xarray *et_table);
void elevator_set_default(struct request_queue *q);
void elevator_set_none(struct request_queue *q);
diff --git a/block/elevator.c b/block/elevator.c
index 50f4b78efe66..8ba8b869d5a4 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -705,7 +705,7 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
* The I/O scheduler depends on the number of hardware queues, this forces a
* reattachment when nr_hw_queues changes.
*/
-void elv_update_nr_hw_queues(struct request_queue *q)
+void elv_update_nr_hw_queues(struct request_queue *q, struct xarray *et_table)
{
struct blk_mq_tag_set *set = q->tag_set;
struct elv_change_ctx ctx = {};
@@ -720,7 +720,7 @@ void elv_update_nr_hw_queues(struct request_queue *q)
* acquires same lock in the reader context) can't run concurrently.
*/
if (q->elevator) {
- ctx.et = blk_mq_alloc_sched_tags(set, set->nr_hw_queues);
+ ctx.et = xa_load(et_table, q->id);
if (!ctx.et) {
WARN_ON_ONCE(1);
return;
--
2.50.0
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCHv6 1/3] block: move elevator queue allocation logic into blk_mq_init_sched
2025-06-30 5:21 ` [PATCHv6 1/3] block: move elevator queue allocation logic into blk_mq_init_sched Nilay Shroff
@ 2025-06-30 6:13 ` Christoph Hellwig
2025-06-30 6:17 ` Hannes Reinecke
1 sibling, 0 replies; 15+ messages in thread
From: Christoph Hellwig @ 2025-06-30 6:13 UTC (permalink / raw)
To: Nilay Shroff; +Cc: linux-block, ming.lei, axboe, sth, lkp, gjoyce
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCHv6 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store
2025-06-30 5:21 ` [PATCHv6 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store Nilay Shroff
@ 2025-06-30 6:15 ` Christoph Hellwig
2025-06-30 6:20 ` Hannes Reinecke
2025-07-01 3:19 ` Ming Lei
2 siblings, 0 replies; 15+ messages in thread
From: Christoph Hellwig @ 2025-06-30 6:15 UTC (permalink / raw)
To: Nilay Shroff; +Cc: linux-block, ming.lei, axboe, sth, lkp, gjoyce
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCHv6 3/3] block: fix potential deadlock while running nr_hw_queue update
2025-06-30 5:21 ` [PATCHv6 3/3] block: fix potential deadlock while running nr_hw_queue update Nilay Shroff
@ 2025-06-30 6:16 ` Christoph Hellwig
2025-06-30 6:24 ` Hannes Reinecke
2025-07-01 3:24 ` Ming Lei
2 siblings, 0 replies; 15+ messages in thread
From: Christoph Hellwig @ 2025-06-30 6:16 UTC (permalink / raw)
To: Nilay Shroff; +Cc: linux-block, hch, ming.lei, axboe, sth, lkp, gjoyce
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCHv6 1/3] block: move elevator queue allocation logic into blk_mq_init_sched
2025-06-30 5:21 ` [PATCHv6 1/3] block: move elevator queue allocation logic into blk_mq_init_sched Nilay Shroff
2025-06-30 6:13 ` Christoph Hellwig
@ 2025-06-30 6:17 ` Hannes Reinecke
1 sibling, 0 replies; 15+ messages in thread
From: Hannes Reinecke @ 2025-06-30 6:17 UTC (permalink / raw)
To: Nilay Shroff, linux-block; +Cc: hch, ming.lei, axboe, sth, lkp, gjoyce
On 6/30/25 07:21, Nilay Shroff wrote:
> In preparation for allocating sched_tags before freezing the request
> queue and acquiring ->elevator_lock, move the elevator queue allocation
> logic from the elevator ops ->init_sched callback into blk_mq_init_sched.
> As elevator_alloc is now only invoked from block layer core, we don't
> need to export it, so unexport elevator_alloc function.
>
> This refactoring provides a centralized location for elevator queue
> initialization, which makes it easier to store pre-allocated sched_tags
> in the struct elevator_queue during later changes.
>
> Reviewed-by: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> ---
> block/bfq-iosched.c | 13 +++----------
> block/blk-mq-sched.c | 11 ++++++++---
> block/elevator.c | 1 -
> block/elevator.h | 2 +-
> block/kyber-iosched.c | 11 ++---------
> block/mq-deadline.c | 14 ++------------
> 6 files changed, 16 insertions(+), 36 deletions(-)
>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCHv6 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store
2025-06-30 5:21 ` [PATCHv6 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store Nilay Shroff
2025-06-30 6:15 ` Christoph Hellwig
@ 2025-06-30 6:20 ` Hannes Reinecke
2025-06-30 6:48 ` Nilay Shroff
2025-07-01 3:19 ` Ming Lei
2 siblings, 1 reply; 15+ messages in thread
From: Hannes Reinecke @ 2025-06-30 6:20 UTC (permalink / raw)
To: Nilay Shroff, linux-block; +Cc: hch, ming.lei, axboe, sth, lkp, gjoyce
On 6/30/25 07:21, Nilay Shroff wrote:
> Recent lockdep reports [1] have revealed a potential deadlock caused by a
> lock dependency between the percpu allocator lock and the elevator lock.
> This issue can be avoided by ensuring that the allocation and release of
> scheduler tags (sched_tags) are performed outside the elevator lock.
> Furthermore, the queue does not need to be remain frozen during these
> operations.
>
> To address this, move all sched_tags allocations and deallocations outside
> of both the ->elevator_lock and the ->freeze_lock. Since the lifetime of
> the elevator queue and its associated sched_tags is closely tied, the
> allocated sched_tags are now stored in the elevator queue structure. Then,
> during the actual elevator switch (which runs under ->freeze_lock and
> ->elevator_lock), the pre-allocated sched_tags are assigned to the
> appropriate q->hctx. Once the elevator switch is complete and the locks
> are released, the old elevator queue and its associated sched_tags are
> freed.
>
> This commit specifically addresses the allocation/deallocation of sched_
> tags during elevator switching. Note that sched_tags may also be allocated
> in other contexts, such as during nr_hw_queues updates. Supporting that
> use case will require batch allocation/deallocation, which will be handled
> in a follow-up patch.
>
> This restructuring ensures that sched_tags memory management occurs
> entirely outside of the ->elevator_lock and ->freeze_lock context,
> eliminating the lock dependency problem seen during scheduler updates.
>
> [1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
>
> Reported-by: Stefan Haberland <sth@linux.ibm.com>
> Closes: https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> ---
> block/blk-mq-sched.c | 155 +++++++++++++++++++++++--------------------
> block/blk-mq-sched.h | 8 ++-
> block/elevator.c | 47 +++++++++++--
> block/elevator.h | 14 +++-
> 4 files changed, 144 insertions(+), 80 deletions(-)
>
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 359e0704e09b..2d6d1ebdd8fb 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -374,64 +374,17 @@ bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request *rq,
> }
> EXPORT_SYMBOL_GPL(blk_mq_sched_try_insert_merge);
>
> -static int blk_mq_sched_alloc_map_and_rqs(struct request_queue *q,
> - struct blk_mq_hw_ctx *hctx,
> - unsigned int hctx_idx)
> -{
> - if (blk_mq_is_shared_tags(q->tag_set->flags)) {
> - hctx->sched_tags = q->sched_shared_tags;
> - return 0;
> - }
> -
> - hctx->sched_tags = blk_mq_alloc_map_and_rqs(q->tag_set, hctx_idx,
> - q->nr_requests);
> -
> - if (!hctx->sched_tags)
> - return -ENOMEM;
> - return 0;
> -}
> -
> -static void blk_mq_exit_sched_shared_tags(struct request_queue *queue)
> -{
> - blk_mq_free_rq_map(queue->sched_shared_tags);
> - queue->sched_shared_tags = NULL;
> -}
> -
> /* called in queue's release handler, tagset has gone away */
> static void blk_mq_sched_tags_teardown(struct request_queue *q, unsigned int flags)
> {
> struct blk_mq_hw_ctx *hctx;
> unsigned long i;
>
> - queue_for_each_hw_ctx(q, hctx, i) {
> - if (hctx->sched_tags) {
> - if (!blk_mq_is_shared_tags(flags))
> - blk_mq_free_rq_map(hctx->sched_tags);
> - hctx->sched_tags = NULL;
> - }
> - }
> + queue_for_each_hw_ctx(q, hctx, i)
> + hctx->sched_tags = NULL;
>
> if (blk_mq_is_shared_tags(flags))
> - blk_mq_exit_sched_shared_tags(q);
> -}
> -
> -static int blk_mq_init_sched_shared_tags(struct request_queue *queue)
> -{
> - struct blk_mq_tag_set *set = queue->tag_set;
> -
> - /*
> - * Set initial depth at max so that we don't need to reallocate for
> - * updating nr_requests.
> - */
> - queue->sched_shared_tags = blk_mq_alloc_map_and_rqs(set,
> - BLK_MQ_NO_HCTX_IDX,
> - MAX_SCHED_RQ);
> - if (!queue->sched_shared_tags)
> - return -ENOMEM;
> -
> - blk_mq_tag_update_sched_shared_tags(queue);
> -
> - return 0;
> + q->sched_shared_tags = NULL;
> }
>
> void blk_mq_sched_reg_debugfs(struct request_queue *q)
> @@ -458,8 +411,75 @@ void blk_mq_sched_unreg_debugfs(struct request_queue *q)
> mutex_unlock(&q->debugfs_mutex);
> }
>
> +void blk_mq_free_sched_tags(struct elevator_tags *et,
> + struct blk_mq_tag_set *set)
> +{
> + unsigned long i;
> +
> + /* Shared tags are stored at index 0 in @tags. */
> + if (blk_mq_is_shared_tags(set->flags))
> + blk_mq_free_map_and_rqs(set, et->tags[0], BLK_MQ_NO_HCTX_IDX);
> + else {
> + for (i = 0; i < et->nr_hw_queues; i++)
> + blk_mq_free_map_and_rqs(set, et->tags[i], i);
> + }
> +
> + kfree(et);
> +}
> +
> +struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
> + unsigned int nr_hw_queues)
> +{
> + unsigned int nr_tags;
> + int i;
> + struct elevator_tags *et;
> + gfp_t gfp = GFP_NOIO | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY;
> +
> + if (blk_mq_is_shared_tags(set->flags))
> + nr_tags = 1;
> + else
> + nr_tags = nr_hw_queues;
> +
> + et = kmalloc(sizeof(struct elevator_tags) +
> + nr_tags * sizeof(struct blk_mq_tags *), gfp);
> + if (!et)
> + return NULL;
> + /*
> + * Default to double of smaller one between hw queue_depth and
> + * 128, since we don't split into sync/async like the old code
> + * did. Additionally, this is a per-hw queue depth.
> + */
> + et->nr_requests = 2 * min_t(unsigned int, set->queue_depth,
> + BLKDEV_DEFAULT_RQ);
> + et->nr_hw_queues = nr_hw_queues;
> +
> + if (blk_mq_is_shared_tags(set->flags)) {
> + /* Shared tags are stored at index 0 in @tags. */
> + et->tags[0] = blk_mq_alloc_map_and_rqs(set, BLK_MQ_NO_HCTX_IDX,
> + MAX_SCHED_RQ);
> + if (!et->tags[0])
> + goto out;
> + } else {
> + for (i = 0; i < et->nr_hw_queues; i++) {
> + et->tags[i] = blk_mq_alloc_map_and_rqs(set, i,
> + et->nr_requests);
> + if (!et->tags[i])
> + goto out_unwind;
> + }
> + }
> +
> + return et;
> +out_unwind:
> + while (--i >= 0)
> + blk_mq_free_map_and_rqs(set, et->tags[i], i);
> +out:
> + kfree(et);
> + return NULL;
> +}
> +
As smatch stated, the unwind pattern is a bit odd.
Maybe move the unwind into the 'else' branch, and us a conditional
to invoke it:
if (i < et->nr_hw_queues)
while (--i >= 0)
blk_mq_free_map_and_request()
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCHv6 3/3] block: fix potential deadlock while running nr_hw_queue update
2025-06-30 5:21 ` [PATCHv6 3/3] block: fix potential deadlock while running nr_hw_queue update Nilay Shroff
2025-06-30 6:16 ` Christoph Hellwig
@ 2025-06-30 6:24 ` Hannes Reinecke
2025-06-30 6:57 ` Nilay Shroff
2025-07-01 3:24 ` Ming Lei
2 siblings, 1 reply; 15+ messages in thread
From: Hannes Reinecke @ 2025-06-30 6:24 UTC (permalink / raw)
To: Nilay Shroff, linux-block; +Cc: hch, ming.lei, axboe, sth, lkp, gjoyce
On 6/30/25 07:21, Nilay Shroff wrote:
> Move scheduler tags (sched_tags) allocation and deallocation outside
> both the ->elevator_lock and ->freeze_lock when updating nr_hw_queues.
> This change breaks the dependency chain from the percpu allocator lock
> to the elevator lock, helping to prevent potential deadlocks, as
> observed in the reported lockdep splat[1].
>
> This commit introduces batch allocation and deallocation helpers for
> sched_tags, which are now used from within __blk_mq_update_nr_hw_queues
> routine while iterating through the tagset.
>
> With this change, all sched_tags memory management is handled entirely
> outside the ->elevator_lock and the ->freeze_lock context, thereby
> eliminating the lock dependency that could otherwise manifest during
> nr_hw_queues updates.
>
> [1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
>
> Reported-by: Stefan Haberland <sth@linux.ibm.com>
> Closes: https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> ---
> block/blk-mq-sched.c | 63 ++++++++++++++++++++++++++++++++++++++++++++
> block/blk-mq-sched.h | 4 +++
> block/blk-mq.c | 11 +++++++-
> block/blk.h | 2 +-
> block/elevator.c | 4 +--
> 5 files changed, 80 insertions(+), 4 deletions(-)
>
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 2d6d1ebdd8fb..da802df34a8c 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -427,6 +427,30 @@ void blk_mq_free_sched_tags(struct elevator_tags *et,
> kfree(et);
> }
>
> +void blk_mq_free_sched_tags_batch(struct xarray *et_table,
> + struct blk_mq_tag_set *set)
> +{
> + struct request_queue *q;
> + struct elevator_tags *et;
> +
> + lockdep_assert_held_write(&set->update_nr_hwq_lock);
> +
> + list_for_each_entry(q, &set->tag_list, tag_set_list) {
> + /*
> + * Accessing q->elevator without holding q->elevator_lock is
> + * safe because we're holding here set->update_nr_hwq_lock in
> + * the writer context. So, scheduler update/switch code (which
> + * acquires the same lock but in the reader context) can't run
> + * concurrently.
> + */
> + if (q->elevator) {
> + et = xa_load(et_table, q->id);
> + if (et)
> + blk_mq_free_sched_tags(et, set);
> + }
> + }
> +}
> +
Odd. Do we allow empty sched_tags?
Why isn't this an error (or at least a warning)?
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCHv6 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store
2025-06-30 6:20 ` Hannes Reinecke
@ 2025-06-30 6:48 ` Nilay Shroff
0 siblings, 0 replies; 15+ messages in thread
From: Nilay Shroff @ 2025-06-30 6:48 UTC (permalink / raw)
To: Hannes Reinecke, linux-block; +Cc: hch, ming.lei, axboe, sth, lkp, gjoyce
On 6/30/25 11:50 AM, Hannes Reinecke wrote:
>> +struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
>> + unsigned int nr_hw_queues)
>> +{
>> + unsigned int nr_tags;
>> + int i;
>> + struct elevator_tags *et;
>> + gfp_t gfp = GFP_NOIO | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY;
>> +
>> + if (blk_mq_is_shared_tags(set->flags))
>> + nr_tags = 1;
>> + else
>> + nr_tags = nr_hw_queues;
>> +
>> + et = kmalloc(sizeof(struct elevator_tags) +
>> + nr_tags * sizeof(struct blk_mq_tags *), gfp);
>> + if (!et)
>> + return NULL;
>> + /*
>> + * Default to double of smaller one between hw queue_depth and
>> + * 128, since we don't split into sync/async like the old code
>> + * did. Additionally, this is a per-hw queue depth.
>> + */
>> + et->nr_requests = 2 * min_t(unsigned int, set->queue_depth,
>> + BLKDEV_DEFAULT_RQ);
>> + et->nr_hw_queues = nr_hw_queues;
>> +
>> + if (blk_mq_is_shared_tags(set->flags)) {
>> + /* Shared tags are stored at index 0 in @tags. */
>> + et->tags[0] = blk_mq_alloc_map_and_rqs(set, BLK_MQ_NO_HCTX_IDX,
>> + MAX_SCHED_RQ);
>> + if (!et->tags[0])
>> + goto out;
>> + } else {
>> + for (i = 0; i < et->nr_hw_queues; i++) {
>> + et->tags[i] = blk_mq_alloc_map_and_rqs(set, i,
>> + et->nr_requests);
>> + if (!et->tags[i])
>> + goto out_unwind;
>> + }
>> + }
>> +
>> + return et;
>> +out_unwind:
>> + while (--i >= 0)
>> + blk_mq_free_map_and_rqs(set, et->tags[i], i);
>> +out:
>> + kfree(et);
>> + return NULL;
>> +}
>> +
>
> As smatch stated, the unwind pattern is a bit odd.
> Maybe move the unwind into the 'else' branch, and us a conditional
> to invoke it:
>
> if (i < et->nr_hw_queues)
> while (--i >= 0)
> blk_mq_free_map_and_request()
>
I believe the 'if (i < et->nr_hw_queues)' check is unnecessary here. When
we jump to the @out_unwind label, @i is always less than @et->nr_hw_queues
because the for loop exits early (on allocation failure) before reaching
the upper bound. If @i had reached @et->nr_hw_queues, the loop would have
completed and we wouldn't jump to @out_unwind at all — we’d simply return
@et instead.
The Smatch flagged the unwind loop due to the use of an unsigned @i in the
previous patch. In that case, if the first allocation (i == 0) fails, then
'--i' underflows to UINT_MAX, and the condition '--i >= 0' is always true —
hence the warning.
This patch corrects that by declaring @i as a signed int, so that
'--i >= 0' behaves as expected and avoids the Smatch warning.
So, I don't think an extra condition like 'if (i < et->nr_hw_queues)' is
needed around the unwind loop. Agreed?
Thnaks,
--Nilay
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCHv6 3/3] block: fix potential deadlock while running nr_hw_queue update
2025-06-30 6:24 ` Hannes Reinecke
@ 2025-06-30 6:57 ` Nilay Shroff
0 siblings, 0 replies; 15+ messages in thread
From: Nilay Shroff @ 2025-06-30 6:57 UTC (permalink / raw)
To: Hannes Reinecke, linux-block; +Cc: hch, ming.lei, axboe, sth, lkp, gjoyce
On 6/30/25 11:54 AM, Hannes Reinecke wrote:
> On 6/30/25 07:21, Nilay Shroff wrote:
>> Move scheduler tags (sched_tags) allocation and deallocation outside
>> both the ->elevator_lock and ->freeze_lock when updating nr_hw_queues.
>> This change breaks the dependency chain from the percpu allocator lock
>> to the elevator lock, helping to prevent potential deadlocks, as
>> observed in the reported lockdep splat[1].
>>
>> This commit introduces batch allocation and deallocation helpers for
>> sched_tags, which are now used from within __blk_mq_update_nr_hw_queues
>> routine while iterating through the tagset.
>>
>> With this change, all sched_tags memory management is handled entirely
>> outside the ->elevator_lock and the ->freeze_lock context, thereby
>> eliminating the lock dependency that could otherwise manifest during
>> nr_hw_queues updates.
>>
>> [1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
>>
>> Reported-by: Stefan Haberland <sth@linux.ibm.com>
>> Closes: https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
>> ---
>> block/blk-mq-sched.c | 63 ++++++++++++++++++++++++++++++++++++++++++++
>> block/blk-mq-sched.h | 4 +++
>> block/blk-mq.c | 11 +++++++-
>> block/blk.h | 2 +-
>> block/elevator.c | 4 +--
>> 5 files changed, 80 insertions(+), 4 deletions(-)
>>
>> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
>> index 2d6d1ebdd8fb..da802df34a8c 100644
>> --- a/block/blk-mq-sched.c
>> +++ b/block/blk-mq-sched.c
>> @@ -427,6 +427,30 @@ void blk_mq_free_sched_tags(struct elevator_tags *et,
>> kfree(et);
>> }
>> +void blk_mq_free_sched_tags_batch(struct xarray *et_table,
>> + struct blk_mq_tag_set *set)
>> +{
>> + struct request_queue *q;
>> + struct elevator_tags *et;
>> +
>> + lockdep_assert_held_write(&set->update_nr_hwq_lock);
>> +
>> + list_for_each_entry(q, &set->tag_list, tag_set_list) {
>> + /*
>> + * Accessing q->elevator without holding q->elevator_lock is
>> + * safe because we're holding here set->update_nr_hwq_lock in
>> + * the writer context. So, scheduler update/switch code (which
>> + * acquires the same lock but in the reader context) can't run
>> + * concurrently.
>> + */
>> + if (q->elevator) {
>> + et = xa_load(et_table, q->id);
>> + if (et)
>> + blk_mq_free_sched_tags(et, set);
>> + }
>> + }
>> +}
>> +
> Odd. Do we allow empty sched_tags?
> Why isn't this an error (or at least a warning)?
>
We don't allow empty sched_tags and xa_load is not supposed to return an
empty sched_tags. It's just a paranoia guard that we evaluate whether @et
is non-NULL. As this function returns nothing we don't return any error
code to the caller here. But yes, we may add a warning here if that seems
reasonable. I'd add a 'WARN_ON(!et)' and update it in the next patch.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCHv6 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store
2025-06-30 5:21 ` [PATCHv6 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store Nilay Shroff
2025-06-30 6:15 ` Christoph Hellwig
2025-06-30 6:20 ` Hannes Reinecke
@ 2025-07-01 3:19 ` Ming Lei
2 siblings, 0 replies; 15+ messages in thread
From: Ming Lei @ 2025-07-01 3:19 UTC (permalink / raw)
To: Nilay Shroff; +Cc: linux-block, hch, axboe, sth, lkp, gjoyce
On Mon, Jun 30, 2025 at 10:51:55AM +0530, Nilay Shroff wrote:
> Recent lockdep reports [1] have revealed a potential deadlock caused by a
> lock dependency between the percpu allocator lock and the elevator lock.
> This issue can be avoided by ensuring that the allocation and release of
> scheduler tags (sched_tags) are performed outside the elevator lock.
> Furthermore, the queue does not need to be remain frozen during these
> operations.
>
> To address this, move all sched_tags allocations and deallocations outside
> of both the ->elevator_lock and the ->freeze_lock. Since the lifetime of
> the elevator queue and its associated sched_tags is closely tied, the
> allocated sched_tags are now stored in the elevator queue structure. Then,
> during the actual elevator switch (which runs under ->freeze_lock and
> ->elevator_lock), the pre-allocated sched_tags are assigned to the
> appropriate q->hctx. Once the elevator switch is complete and the locks
> are released, the old elevator queue and its associated sched_tags are
> freed.
>
> This commit specifically addresses the allocation/deallocation of sched_
> tags during elevator switching. Note that sched_tags may also be allocated
> in other contexts, such as during nr_hw_queues updates. Supporting that
> use case will require batch allocation/deallocation, which will be handled
> in a follow-up patch.
>
> This restructuring ensures that sched_tags memory management occurs
> entirely outside of the ->elevator_lock and ->freeze_lock context,
> eliminating the lock dependency problem seen during scheduler updates.
>
> [1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
>
> Reported-by: Stefan Haberland <sth@linux.ibm.com>
> Closes: https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Thanks,
Ming
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCHv6 3/3] block: fix potential deadlock while running nr_hw_queue update
2025-06-30 5:21 ` [PATCHv6 3/3] block: fix potential deadlock while running nr_hw_queue update Nilay Shroff
2025-06-30 6:16 ` Christoph Hellwig
2025-06-30 6:24 ` Hannes Reinecke
@ 2025-07-01 3:24 ` Ming Lei
2025-07-01 5:20 ` Nilay Shroff
2 siblings, 1 reply; 15+ messages in thread
From: Ming Lei @ 2025-07-01 3:24 UTC (permalink / raw)
To: Nilay Shroff; +Cc: linux-block, hch, axboe, sth, lkp, gjoyce
On Mon, Jun 30, 2025 at 10:51:56AM +0530, Nilay Shroff wrote:
> Move scheduler tags (sched_tags) allocation and deallocation outside
> both the ->elevator_lock and ->freeze_lock when updating nr_hw_queues.
> This change breaks the dependency chain from the percpu allocator lock
> to the elevator lock, helping to prevent potential deadlocks, as
> observed in the reported lockdep splat[1].
>
> This commit introduces batch allocation and deallocation helpers for
> sched_tags, which are now used from within __blk_mq_update_nr_hw_queues
> routine while iterating through the tagset.
>
> With this change, all sched_tags memory management is handled entirely
> outside the ->elevator_lock and the ->freeze_lock context, thereby
> eliminating the lock dependency that could otherwise manifest during
> nr_hw_queues updates.
>
> [1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
>
> Reported-by: Stefan Haberland <sth@linux.ibm.com>
> Closes: https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> ---
> block/blk-mq-sched.c | 63 ++++++++++++++++++++++++++++++++++++++++++++
> block/blk-mq-sched.h | 4 +++
> block/blk-mq.c | 11 +++++++-
> block/blk.h | 2 +-
> block/elevator.c | 4 +--
> 5 files changed, 80 insertions(+), 4 deletions(-)
>
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 2d6d1ebdd8fb..da802df34a8c 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -427,6 +427,30 @@ void blk_mq_free_sched_tags(struct elevator_tags *et,
> kfree(et);
> }
>
> +void blk_mq_free_sched_tags_batch(struct xarray *et_table,
> + struct blk_mq_tag_set *set)
> +{
> + struct request_queue *q;
> + struct elevator_tags *et;
> +
> + lockdep_assert_held_write(&set->update_nr_hwq_lock);
> +
> + list_for_each_entry(q, &set->tag_list, tag_set_list) {
> + /*
> + * Accessing q->elevator without holding q->elevator_lock is
> + * safe because we're holding here set->update_nr_hwq_lock in
> + * the writer context. So, scheduler update/switch code (which
> + * acquires the same lock but in the reader context) can't run
> + * concurrently.
> + */
> + if (q->elevator) {
> + et = xa_load(et_table, q->id);
> + if (et)
> + blk_mq_free_sched_tags(et, set);
> + }
> + }
> +}
> +
> struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
> unsigned int nr_hw_queues)
> {
> @@ -477,6 +501,45 @@ struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
> return NULL;
> }
>
> +int blk_mq_alloc_sched_tags_batch(struct xarray *et_table,
> + struct blk_mq_tag_set *set, unsigned int nr_hw_queues)
> +{
> + struct request_queue *q;
> + struct elevator_tags *et;
> + gfp_t gfp = GFP_NOIO | __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY;
> +
> + lockdep_assert_held_write(&set->update_nr_hwq_lock);
> +
> + list_for_each_entry(q, &set->tag_list, tag_set_list) {
> + /*
> + * Accessing q->elevator without holding q->elevator_lock is
> + * safe because we're holding here set->update_nr_hwq_lock in
> + * the writer context. So, scheduler update/switch code (which
> + * acquires the same lock but in the reader context) can't run
> + * concurrently.
> + */
> + if (q->elevator) {
> + et = blk_mq_alloc_sched_tags(set, nr_hw_queues);
> + if (!et)
> + goto out_unwind;
> + if (xa_insert(et_table, q->id, et, gfp))
> + goto out_free_tags;
> + }
> + }
> + return 0;
> +out_free_tags:
> + blk_mq_free_sched_tags(et, set);
> +out_unwind:
> + list_for_each_entry_continue_reverse(q, &set->tag_list, tag_set_list) {
> + if (q->elevator) {
> + et = xa_load(et_table, q->id);
> + if (et)
> + blk_mq_free_sched_tags(et, set);
> + }
> + }
> + return -ENOMEM;
> +}
> +
> /* caller must have a reference to @e, will grab another one if successful */
> int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e,
> struct elevator_tags *et)
> diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
> index 0cde00cd1c47..b554e1d55950 100644
> --- a/block/blk-mq-sched.h
> +++ b/block/blk-mq-sched.h
> @@ -25,8 +25,12 @@ void blk_mq_sched_free_rqs(struct request_queue *q);
>
> struct elevator_tags *blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
> unsigned int nr_hw_queues);
> +int blk_mq_alloc_sched_tags_batch(struct xarray *et_table,
> + struct blk_mq_tag_set *set, unsigned int nr_hw_queues);
> void blk_mq_free_sched_tags(struct elevator_tags *et,
> struct blk_mq_tag_set *set);
> +void blk_mq_free_sched_tags_batch(struct xarray *et_table,
> + struct blk_mq_tag_set *set);
>
> static inline void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
> {
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 4806b867e37d..a68b658ce07b 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -4972,6 +4972,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
> struct request_queue *q;
> int prev_nr_hw_queues = set->nr_hw_queues;
> unsigned int memflags;
> + struct xarray et_table;
> int i;
>
> lockdep_assert_held(&set->tag_list_lock);
> @@ -4984,6 +4985,11 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
> return;
>
> memflags = memalloc_noio_save();
> +
> + xa_init(&et_table);
> + if (blk_mq_alloc_sched_tags_batch(&et_table, set, nr_hw_queues) < 0)
> + goto out_memalloc_restore;
> +
> list_for_each_entry(q, &set->tag_list, tag_set_list) {
> blk_mq_debugfs_unregister_hctxs(q);
> blk_mq_sysfs_unregister_hctxs(q);
> @@ -4995,6 +5001,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
> if (blk_mq_realloc_tag_set_tags(set, nr_hw_queues) < 0) {
> list_for_each_entry(q, &set->tag_list, tag_set_list)
> blk_mq_unfreeze_queue_nomemrestore(q);
> + blk_mq_free_sched_tags_batch(&et_table, set);
> goto reregister;
> }
>
> @@ -5019,7 +5026,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>
> /* elv_update_nr_hw_queues() unfreeze queue for us */
> list_for_each_entry(q, &set->tag_list, tag_set_list)
> - elv_update_nr_hw_queues(q);
> + elv_update_nr_hw_queues(q, &et_table);
>
> reregister:
> list_for_each_entry(q, &set->tag_list, tag_set_list) {
> @@ -5029,7 +5036,9 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
> blk_mq_remove_hw_queues_cpuhp(q);
> blk_mq_add_hw_queues_cpuhp(q);
> }
> +out_memalloc_restore:
> memalloc_noio_restore(memflags);
> + xa_destroy(&et_table);
>
> /* Free the excess tags when nr_hw_queues shrink. */
> for (i = set->nr_hw_queues; i < prev_nr_hw_queues; i++)
> diff --git a/block/blk.h b/block/blk.h
> index 37ec459fe656..c6d1d1458388 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -321,7 +321,7 @@ bool blk_bio_list_merge(struct request_queue *q, struct list_head *list,
>
> bool blk_insert_flush(struct request *rq);
>
> -void elv_update_nr_hw_queues(struct request_queue *q);
> +void elv_update_nr_hw_queues(struct request_queue *q, struct xarray *et_table);
> void elevator_set_default(struct request_queue *q);
> void elevator_set_none(struct request_queue *q);
>
> diff --git a/block/elevator.c b/block/elevator.c
> index 50f4b78efe66..8ba8b869d5a4 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -705,7 +705,7 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
> * The I/O scheduler depends on the number of hardware queues, this forces a
> * reattachment when nr_hw_queues changes.
> */
> -void elv_update_nr_hw_queues(struct request_queue *q)
> +void elv_update_nr_hw_queues(struct request_queue *q, struct xarray *et_table)
et_table isn't necessary to expose to elv_update_nr_hw_queues(), and it is
less readable than passing 'struct elevator_tags *' directly, but it can be
one followup improvement.
Anyway:
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Thanks,
Ming
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCHv6 3/3] block: fix potential deadlock while running nr_hw_queue update
2025-07-01 3:24 ` Ming Lei
@ 2025-07-01 5:20 ` Nilay Shroff
0 siblings, 0 replies; 15+ messages in thread
From: Nilay Shroff @ 2025-07-01 5:20 UTC (permalink / raw)
To: Ming Lei; +Cc: linux-block, hch, axboe, sth, lkp, gjoyce
On 7/1/25 8:54 AM, Ming Lei wrote:
>> diff --git a/block/elevator.c b/block/elevator.c
>> index 50f4b78efe66..8ba8b869d5a4 100644
>> --- a/block/elevator.c
>> +++ b/block/elevator.c
>> @@ -705,7 +705,7 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
>> * The I/O scheduler depends on the number of hardware queues, this forces a
>> * reattachment when nr_hw_queues changes.
>> */
>> -void elv_update_nr_hw_queues(struct request_queue *q)
>> +void elv_update_nr_hw_queues(struct request_queue *q, struct xarray *et_table)
>
> et_table isn't necessary to expose to elv_update_nr_hw_queues(), and it is
> less readable than passing 'struct elevator_tags *' directly, but it can be
> one followup improvement.
>
Yeah, makes sense... I will update this in the subsequent patchset.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2025-07-01 5:20 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-30 5:21 [PATCHv6 0/3] block: move sched_tags allocation/de-allocation outside of locking context Nilay Shroff
2025-06-30 5:21 ` [PATCHv6 1/3] block: move elevator queue allocation logic into blk_mq_init_sched Nilay Shroff
2025-06-30 6:13 ` Christoph Hellwig
2025-06-30 6:17 ` Hannes Reinecke
2025-06-30 5:21 ` [PATCHv6 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store Nilay Shroff
2025-06-30 6:15 ` Christoph Hellwig
2025-06-30 6:20 ` Hannes Reinecke
2025-06-30 6:48 ` Nilay Shroff
2025-07-01 3:19 ` Ming Lei
2025-06-30 5:21 ` [PATCHv6 3/3] block: fix potential deadlock while running nr_hw_queue update Nilay Shroff
2025-06-30 6:16 ` Christoph Hellwig
2025-06-30 6:24 ` Hannes Reinecke
2025-06-30 6:57 ` Nilay Shroff
2025-07-01 3:24 ` Ming Lei
2025-07-01 5:20 ` Nilay Shroff
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox