* [PATCHv3 0/2] block: move sched_tags allocation/de-allocation outside of locking context
@ 2025-06-16 17:32 Nilay Shroff
2025-06-16 17:32 ` [PATCHv3 1/2] block: move elevator queue allocation logic into blk_mq_init_sched Nilay Shroff
2025-06-16 17:32 ` [PATCHv3 2/2] block: fix lock dependency between percpu alloc lock and elevator lock Nilay Shroff
0 siblings, 2 replies; 14+ messages in thread
From: Nilay Shroff @ 2025-06-16 17:32 UTC (permalink / raw)
To: linux-block; +Cc: ming.lei, hch, axboe, sth, gjoyce
There have been few reports suggesting potential lockdep caused due to
the percpu lock dependency on elevator lock. This patchset aims to fix
that dependency.
This patchset contains the two patches. The first patch is preparatory
patch and just move elevator queue allocation logic from ->init_sched
into blk_mq_init_sched. The second patch in the series restructure the
sched tags management code so that sched tags allocation and de-allocation
now occurs entirely outside of the ->freeze_lock and ->elevator_lock,
eliminating the percpu lock dependency on elevator lock problem which
could potentially occur during scheduler updates or hardware queue updates.
Change since v2:
- Split the patch into a two patch series. The first patch updates
->init_sched elevator API change and second patch handles the sched
tags allocation/de-allocation logic (Ming Lei)
- Address sched tags allocation/deallocation logic while running in the
context of nr_hw_queue update so that we can handle all posible
scenarios in a single patchest (Ming Lei)
Link to v2: https://lore.kernel.org/all/20250528123638.1029700-1-nilay@linux.ibm.com/
Changes since v1:
- As the lifetime of elevator queue and sched tags are same, allocate
and move sched tags under struct elevator_queue (Ming Lei)
Link to v1: https://lore.kernel.org/all/20250520103425.1259712-1-nilay@linux.ibm.com/
Nilay Shroff (2):
block: move elevator queue allocation logic into blk_mq_init_sched
block: fix lock dependency between percpu alloc lock and elevator lock
block/bfq-iosched.c | 13 +--
block/blk-mq-sched.c | 247 +++++++++++++++++++++++++++++++-----------
block/blk-mq-sched.h | 13 ++-
block/blk-mq.c | 14 ++-
block/blk.h | 4 +-
block/elevator.c | 82 ++++++++++----
block/elevator.h | 30 ++++-
block/kyber-iosched.c | 11 +-
block/mq-deadline.c | 14 +--
9 files changed, 308 insertions(+), 120 deletions(-)
--
2.49.0
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCHv3 1/2] block: move elevator queue allocation logic into blk_mq_init_sched
2025-06-16 17:32 [PATCHv3 0/2] block: move sched_tags allocation/de-allocation outside of locking context Nilay Shroff
@ 2025-06-16 17:32 ` Nilay Shroff
2025-06-17 15:07 ` Ming Lei
2025-06-23 5:56 ` Christoph Hellwig
2025-06-16 17:32 ` [PATCHv3 2/2] block: fix lock dependency between percpu alloc lock and elevator lock Nilay Shroff
1 sibling, 2 replies; 14+ messages in thread
From: Nilay Shroff @ 2025-06-16 17:32 UTC (permalink / raw)
To: linux-block; +Cc: ming.lei, hch, axboe, sth, gjoyce
In preparation for allocating sched_tags before freezing the request
queue and acquiring ->elevator_lock, move the elevator queue allocation
logic from the elevator ops ->init_sched callback into blk_mq_init_sched.
This refactoring provides a centralized location for elevator queue
initialization, which makes it easier to store pre-allocated sched_tags
in the struct elevator_queue during later changes.
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
block/bfq-iosched.c | 13 +++----------
block/blk-mq-sched.c | 7 ++++++-
block/elevator.h | 2 +-
block/kyber-iosched.c | 11 ++---------
block/mq-deadline.c | 14 ++------------
5 files changed, 14 insertions(+), 33 deletions(-)
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 0cb1e9873aab..fd26dc1901b0 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -7232,22 +7232,16 @@ static void bfq_init_root_group(struct bfq_group *root_group,
root_group->sched_data.bfq_class_idle_last_service = jiffies;
}
-static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
+static int bfq_init_queue(struct request_queue *q, struct elevator_queue *eq)
{
struct bfq_data *bfqd;
- struct elevator_queue *eq;
unsigned int i;
struct blk_independent_access_ranges *ia_ranges = q->disk->ia_ranges;
- eq = elevator_alloc(q, e);
- if (!eq)
- return -ENOMEM;
-
bfqd = kzalloc_node(sizeof(*bfqd), GFP_KERNEL, q->node);
- if (!bfqd) {
- kobject_put(&eq->kobj);
+ if (!bfqd)
return -ENOMEM;
- }
+
eq->elevator_data = bfqd;
spin_lock_irq(&q->queue_lock);
@@ -7405,7 +7399,6 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
out_free:
kfree(bfqd);
- kobject_put(&eq->kobj);
return -ENOMEM;
}
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 55a0fd105147..d914eb9d61a6 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -475,6 +475,10 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
q->nr_requests = 2 * min_t(unsigned int, q->tag_set->queue_depth,
BLKDEV_DEFAULT_RQ);
+ eq = elevator_alloc(q, e);
+ if (!eq)
+ return -ENOMEM;
+
if (blk_mq_is_shared_tags(flags)) {
ret = blk_mq_init_sched_shared_tags(q);
if (ret)
@@ -487,7 +491,7 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
goto err_free_map_and_rqs;
}
- ret = e->ops.init_sched(q, e);
+ ret = e->ops.init_sched(q, eq);
if (ret)
goto err_free_map_and_rqs;
@@ -509,6 +513,7 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
blk_mq_sched_free_rqs(q);
blk_mq_sched_tags_teardown(q, flags);
+ kobject_put(&eq->kobj);
q->elevator = NULL;
return ret;
}
diff --git a/block/elevator.h b/block/elevator.h
index a07ce773a38f..a4de5f9ad790 100644
--- a/block/elevator.h
+++ b/block/elevator.h
@@ -24,7 +24,7 @@ struct blk_mq_alloc_data;
struct blk_mq_hw_ctx;
struct elevator_mq_ops {
- int (*init_sched)(struct request_queue *, struct elevator_type *);
+ int (*init_sched)(struct request_queue *, struct elevator_queue *);
void (*exit_sched)(struct elevator_queue *);
int (*init_hctx)(struct blk_mq_hw_ctx *, unsigned int);
void (*exit_hctx)(struct blk_mq_hw_ctx *, unsigned int);
diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index 4dba8405bd01..7b6832cb3a8d 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -402,20 +402,13 @@ static struct kyber_queue_data *kyber_queue_data_alloc(struct request_queue *q)
return ERR_PTR(ret);
}
-static int kyber_init_sched(struct request_queue *q, struct elevator_type *e)
+static int kyber_init_sched(struct request_queue *q, struct elevator_queue *eq)
{
struct kyber_queue_data *kqd;
- struct elevator_queue *eq;
-
- eq = elevator_alloc(q, e);
- if (!eq)
- return -ENOMEM;
kqd = kyber_queue_data_alloc(q);
- if (IS_ERR(kqd)) {
- kobject_put(&eq->kobj);
+ if (IS_ERR(kqd))
return PTR_ERR(kqd);
- }
blk_stat_enable_accounting(q);
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 2edf1cac06d5..7b6caf30e00a 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -568,20 +568,14 @@ static void dd_exit_sched(struct elevator_queue *e)
/*
* initialize elevator private data (deadline_data).
*/
-static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
+static int dd_init_sched(struct request_queue *q, struct elevator_queue *eq)
{
struct deadline_data *dd;
- struct elevator_queue *eq;
enum dd_prio prio;
- int ret = -ENOMEM;
-
- eq = elevator_alloc(q, e);
- if (!eq)
- return ret;
dd = kzalloc_node(sizeof(*dd), GFP_KERNEL, q->node);
if (!dd)
- goto put_eq;
+ return -ENOMEM;
eq->elevator_data = dd;
@@ -608,10 +602,6 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
q->elevator = eq;
return 0;
-
-put_eq:
- kobject_put(&eq->kobj);
- return ret;
}
/*
--
2.49.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCHv3 2/2] block: fix lock dependency between percpu alloc lock and elevator lock
2025-06-16 17:32 [PATCHv3 0/2] block: move sched_tags allocation/de-allocation outside of locking context Nilay Shroff
2025-06-16 17:32 ` [PATCHv3 1/2] block: move elevator queue allocation logic into blk_mq_init_sched Nilay Shroff
@ 2025-06-16 17:32 ` Nilay Shroff
2025-06-18 3:06 ` Ming Lei
2025-06-23 6:10 ` Christoph Hellwig
1 sibling, 2 replies; 14+ messages in thread
From: Nilay Shroff @ 2025-06-16 17:32 UTC (permalink / raw)
To: linux-block; +Cc: ming.lei, hch, axboe, sth, gjoyce
Recent lockdep reports [1] have revealed a potential deadlock caused by a
lock dependency between the percpu allocator lock and the elevator lock.
This issue can be avoided by ensuring that the allocation and release of
scheduler tags (sched_tags) are performed outside the elevator lock.
Furthermore, the queue does not need to be remain frozen during these
operations.
To address this, move all sched_tags allocations and deallocations outside
of both the ->elevator_lock and the ->freeze_lock. Moreover, since the
lifetime of the elevator queue and its associated sched_tags is closely
tied, the allocated sched_tags are now stored in the elevator queue
structure. Then, during the actual elevator switch (which runs under
->freeze_lock and ->elevator_lock), the pre-allocated sched_tags are
assigned to the appropriate q->hctx. Once the elevator switch is complete
and the locks are released, the old elevator queue and its associated
sched_tags are freed.
This restructuring ensures that sched_tags memory management occurs
entirely outside of the ->elevator_lock and ->freeze_lock context,
eliminating the lock dependency problem seen during scheduler updates
or hardware queue update.
[1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
Reported-by: Stefan Haberland <sth@linux.ibm.com>
Closes: https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
block/blk-mq-sched.c | 244 +++++++++++++++++++++++++++++++------------
block/blk-mq-sched.h | 13 ++-
block/blk-mq.c | 14 ++-
block/blk.h | 4 +-
block/elevator.c | 82 +++++++++++----
block/elevator.h | 28 ++++-
6 files changed, 296 insertions(+), 89 deletions(-)
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index d914eb9d61a6..dd00d8c9fb78 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -374,27 +374,17 @@ bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request *rq,
}
EXPORT_SYMBOL_GPL(blk_mq_sched_try_insert_merge);
-static int blk_mq_sched_alloc_map_and_rqs(struct request_queue *q,
- struct blk_mq_hw_ctx *hctx,
- unsigned int hctx_idx)
+static void blk_mq_init_sched_tags(struct request_queue *q,
+ struct blk_mq_hw_ctx *hctx,
+ unsigned int hctx_idx,
+ struct elevator_queue *eq)
{
if (blk_mq_is_shared_tags(q->tag_set->flags)) {
hctx->sched_tags = q->sched_shared_tags;
- return 0;
+ return;
}
- hctx->sched_tags = blk_mq_alloc_map_and_rqs(q->tag_set, hctx_idx,
- q->nr_requests);
-
- if (!hctx->sched_tags)
- return -ENOMEM;
- return 0;
-}
-
-static void blk_mq_exit_sched_shared_tags(struct request_queue *queue)
-{
- blk_mq_free_rq_map(queue->sched_shared_tags);
- queue->sched_shared_tags = NULL;
+ hctx->sched_tags = eq->tags->u.tags[hctx_idx];
}
/* called in queue's release handler, tagset has gone away */
@@ -403,35 +393,18 @@ static void blk_mq_sched_tags_teardown(struct request_queue *q, unsigned int fla
struct blk_mq_hw_ctx *hctx;
unsigned long i;
- queue_for_each_hw_ctx(q, hctx, i) {
- if (hctx->sched_tags) {
- if (!blk_mq_is_shared_tags(flags))
- blk_mq_free_rq_map(hctx->sched_tags);
- hctx->sched_tags = NULL;
- }
- }
+ queue_for_each_hw_ctx(q, hctx, i)
+ hctx->sched_tags = NULL;
if (blk_mq_is_shared_tags(flags))
- blk_mq_exit_sched_shared_tags(q);
+ q->sched_shared_tags = NULL;
}
-static int blk_mq_init_sched_shared_tags(struct request_queue *queue)
+static void blk_mq_init_sched_shared_tags(struct request_queue *queue,
+ struct elevator_queue *eq)
{
- struct blk_mq_tag_set *set = queue->tag_set;
-
- /*
- * Set initial depth at max so that we don't need to reallocate for
- * updating nr_requests.
- */
- queue->sched_shared_tags = blk_mq_alloc_map_and_rqs(set,
- BLK_MQ_NO_HCTX_IDX,
- MAX_SCHED_RQ);
- if (!queue->sched_shared_tags)
- return -ENOMEM;
-
+ queue->sched_shared_tags = eq->tags->u.shared_tags;
blk_mq_tag_update_sched_shared_tags(queue);
-
- return 0;
}
void blk_mq_sched_reg_debugfs(struct request_queue *q)
@@ -458,8 +431,165 @@ void blk_mq_sched_unreg_debugfs(struct request_queue *q)
mutex_unlock(&q->debugfs_mutex);
}
+void __blk_mq_free_sched_tags(struct blk_mq_tag_set *set,
+ struct elevator_tags *tags)
+{
+ unsigned long i;
+
+ if (!tags)
+ return;
+
+ if (blk_mq_is_shared_tags(set->flags)) {
+ if (tags->u.shared_tags) {
+ blk_mq_free_rqs(set, tags->u.shared_tags,
+ BLK_MQ_NO_HCTX_IDX);
+ blk_mq_free_rq_map(tags->u.shared_tags);
+ }
+ goto out;
+ }
+
+ if (!tags->u.tags)
+ goto out;
+
+ for (i = 0; i < tags->nr_hw_queues; i++) {
+ if (tags->u.tags[i]) {
+ blk_mq_free_rqs(set, tags->u.tags[i], i);
+ blk_mq_free_rq_map(tags->u.tags[i]);
+ }
+ }
+
+ kfree(tags->u.tags);
+out:
+ kfree(tags);
+}
+
+void blk_mq_free_sched_tags(struct blk_mq_tag_set *set,
+ struct elevator_tags **tags)
+{
+ unsigned long i, count = 0;
+ struct request_queue *q;
+
+ lockdep_assert_held_write(&set->update_nr_hwq_lock);
+
+ if (!tags)
+ return;
+
+ /*
+ * Accessing q->elevator without holding q->elevator_lock is safe
+ * because we're holding here set->update_nr_hwq_lock in the writer
+ * context. So, scheduler update/switch code (which acquires the same
+ * lock but in the reader context) can't run concurrently.
+ */
+ list_for_each_entry(q, &set->tag_list, tag_set_list) {
+ if (q->elevator)
+ count++;
+ }
+
+ for (i = 0; i < count; i++)
+ __blk_mq_free_sched_tags(set, tags[i]);
+
+ kfree(tags);
+}
+
+struct elevator_tags *__blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
+ unsigned int nr_hw_queues, int id)
+{
+ unsigned int i;
+ struct elevator_tags *tags;
+ gfp_t gfp = GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY;
+
+ tags = kcalloc(1, sizeof(struct elevator_tags), gfp);
+ if (!tags)
+ return NULL;
+
+ tags->id = id;
+ /*
+ * Default to double of smaller one between hw queue_depth and
+ * 128, since we don't split into sync/async like the old code
+ * did. Additionally, this is a per-hw queue depth.
+ */
+ tags->nr_requests = 2 * min_t(unsigned int, set->queue_depth,
+ BLKDEV_DEFAULT_RQ);
+
+ if (blk_mq_is_shared_tags(set->flags)) {
+
+ tags->u.shared_tags = blk_mq_alloc_map_and_rqs(set,
+ BLK_MQ_NO_HCTX_IDX,
+ MAX_SCHED_RQ);
+ if (!tags->u.shared_tags)
+ goto out;
+
+ return tags;
+ }
+
+ tags->u.tags = kcalloc(nr_hw_queues, sizeof(struct blk_mq_tags *), gfp);
+ if (!tags->u.tags)
+ goto out;
+
+ tags->nr_hw_queues = nr_hw_queues;
+ for (i = 0; i < nr_hw_queues; i++) {
+ tags->u.tags[i] = blk_mq_alloc_map_and_rqs(set, i,
+ tags->nr_requests);
+ if (!tags->u.tags[i])
+ goto out;
+ }
+
+ return tags;
+
+out:
+ __blk_mq_free_sched_tags(set, tags);
+ return NULL;
+}
+
+int blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
+ unsigned int nr_hw_queues,
+ struct elevator_tags ***elv_tags)
+{
+ unsigned long idx = 0, count = 0;
+ struct request_queue *q;
+ struct elevator_tags **tags;
+ gfp_t gfp = GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY;
+
+ lockdep_assert_held_write(&set->update_nr_hwq_lock);
+ /*
+ * Accessing q->elevator without holding q->elevator_lock is safe
+ * because we're holding here set->update_nr_hwq_lock in the writer
+ * context. So, scheduler update/switch code (which acquires the same
+ * lock but in the reader context) can't run concurrently.
+ */
+ list_for_each_entry(q, &set->tag_list, tag_set_list) {
+ if (q->elevator)
+ count++;
+ }
+
+ if (!count)
+ return 0;
+
+ tags = kcalloc(count, sizeof(struct elevator_tags *), gfp);
+ if (!tags)
+ return -ENOMEM;
+
+ list_for_each_entry(q, &set->tag_list, tag_set_list) {
+ if (q->elevator) {
+ tags[idx] = __blk_mq_alloc_sched_tags(set, nr_hw_queues,
+ q->id);
+ if (!tags[idx])
+ goto out;
+
+ idx++;
+ }
+ }
+
+ *elv_tags = tags;
+ return count;
+out:
+ blk_mq_free_sched_tags(set, tags);
+ return -ENOMEM;
+}
+
/* caller must have a reference to @e, will grab another one if successful */
-int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
+int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e,
+ struct elevator_tags *tags)
{
unsigned int flags = q->tag_set->flags;
struct blk_mq_hw_ctx *hctx;
@@ -467,40 +597,26 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
unsigned long i;
int ret;
- /*
- * Default to double of smaller one between hw queue_depth and 128,
- * since we don't split into sync/async like the old code did.
- * Additionally, this is a per-hw queue depth.
- */
- q->nr_requests = 2 * min_t(unsigned int, q->tag_set->queue_depth,
- BLKDEV_DEFAULT_RQ);
-
- eq = elevator_alloc(q, e);
+ eq = elevator_alloc(q, e, tags);
if (!eq)
return -ENOMEM;
- if (blk_mq_is_shared_tags(flags)) {
- ret = blk_mq_init_sched_shared_tags(q);
- if (ret)
- return ret;
- }
+ q->nr_requests = tags->nr_requests;
- queue_for_each_hw_ctx(q, hctx, i) {
- ret = blk_mq_sched_alloc_map_and_rqs(q, hctx, i);
- if (ret)
- goto err_free_map_and_rqs;
- }
+ if (blk_mq_is_shared_tags(flags))
+ blk_mq_init_sched_shared_tags(q, eq);
+
+ queue_for_each_hw_ctx(q, hctx, i)
+ blk_mq_init_sched_tags(q, hctx, i, eq);
ret = e->ops.init_sched(q, eq);
if (ret)
- goto err_free_map_and_rqs;
+ goto out;
queue_for_each_hw_ctx(q, hctx, i) {
if (e->ops.init_hctx) {
ret = e->ops.init_hctx(hctx, i);
if (ret) {
- eq = q->elevator;
- blk_mq_sched_free_rqs(q);
blk_mq_exit_sched(q, eq);
kobject_put(&eq->kobj);
return ret;
@@ -509,10 +625,8 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
}
return 0;
-err_free_map_and_rqs:
- blk_mq_sched_free_rqs(q);
+out:
blk_mq_sched_tags_teardown(q, flags);
-
kobject_put(&eq->kobj);
q->elevator = NULL;
return ret;
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 1326526bb733..01c43192b98a 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -18,7 +18,18 @@ void __blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx);
void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx);
-int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e);
+void __blk_mq_free_sched_tags(struct blk_mq_tag_set *set,
+ struct elevator_tags *tags);
+void blk_mq_free_sched_tags(struct blk_mq_tag_set *set,
+ struct elevator_tags **elv_t);
+int blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
+ unsigned int nr_hw_queues,
+ struct elevator_tags ***tags);
+struct elevator_tags *__blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
+ unsigned int nr_hw_queues, int id);
+
+int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e,
+ struct elevator_tags *elv_t);
void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e);
void blk_mq_sched_free_rqs(struct request_queue *q);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4806b867e37d..b8b616c79742 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4970,6 +4970,8 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
int nr_hw_queues)
{
struct request_queue *q;
+ int count;
+ struct elevator_tags **elv_tags = NULL;
int prev_nr_hw_queues = set->nr_hw_queues;
unsigned int memflags;
int i;
@@ -4984,6 +4986,11 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
return;
memflags = memalloc_noio_save();
+
+ count = blk_mq_alloc_sched_tags(set, nr_hw_queues, &elv_tags);
+ if (count < 0)
+ goto out;
+
list_for_each_entry(q, &set->tag_list, tag_set_list) {
blk_mq_debugfs_unregister_hctxs(q);
blk_mq_sysfs_unregister_hctxs(q);
@@ -4995,6 +5002,9 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
if (blk_mq_realloc_tag_set_tags(set, nr_hw_queues) < 0) {
list_for_each_entry(q, &set->tag_list, tag_set_list)
blk_mq_unfreeze_queue_nomemrestore(q);
+
+ blk_mq_free_sched_tags(set, elv_tags);
+ elv_tags = NULL;
goto reregister;
}
@@ -5019,7 +5029,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
/* elv_update_nr_hw_queues() unfreeze queue for us */
list_for_each_entry(q, &set->tag_list, tag_set_list)
- elv_update_nr_hw_queues(q);
+ elv_update_nr_hw_queues(q, elv_tags, count);
reregister:
list_for_each_entry(q, &set->tag_list, tag_set_list) {
@@ -5029,6 +5039,8 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
blk_mq_remove_hw_queues_cpuhp(q);
blk_mq_add_hw_queues_cpuhp(q);
}
+ kfree(elv_tags);
+out:
memalloc_noio_restore(memflags);
/* Free the excess tags when nr_hw_queues shrink. */
diff --git a/block/blk.h b/block/blk.h
index 37ec459fe656..cd50305da84e 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -10,6 +10,7 @@
#include <linux/timekeeping.h>
#include <xen/xen.h>
#include "blk-crypto-internal.h"
+#include "elevator.h"
struct elevator_type;
@@ -321,7 +322,8 @@ bool blk_bio_list_merge(struct request_queue *q, struct list_head *list,
bool blk_insert_flush(struct request *rq);
-void elv_update_nr_hw_queues(struct request_queue *q);
+void elv_update_nr_hw_queues(struct request_queue *q,
+ struct elevator_tags **elv_tags, int count);
void elevator_set_default(struct request_queue *q);
void elevator_set_none(struct request_queue *q);
diff --git a/block/elevator.c b/block/elevator.c
index ab22542e6cf0..849dbe347cb2 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -45,17 +45,6 @@
#include "blk-wbt.h"
#include "blk-cgroup.h"
-/* Holding context data for changing elevator */
-struct elv_change_ctx {
- const char *name;
- bool no_uevent;
-
- /* for unregistering old elevator */
- struct elevator_queue *old;
- /* for registering new elevator */
- struct elevator_queue *new;
-};
-
static DEFINE_SPINLOCK(elv_list_lock);
static LIST_HEAD(elv_list);
@@ -132,7 +121,7 @@ static struct elevator_type *elevator_find_get(const char *name)
static const struct kobj_type elv_ktype;
struct elevator_queue *elevator_alloc(struct request_queue *q,
- struct elevator_type *e)
+ struct elevator_type *e, struct elevator_tags *tags)
{
struct elevator_queue *eq;
@@ -142,6 +131,7 @@ struct elevator_queue *elevator_alloc(struct request_queue *q,
__elevator_get(e);
eq->type = e;
+ eq->tags = tags;
kobject_init(&eq->kobj, &elv_ktype);
mutex_init(&eq->sysfs_lock);
hash_init(eq->hash);
@@ -166,13 +156,25 @@ static void elevator_exit(struct request_queue *q)
lockdep_assert_held(&q->elevator_lock);
ioc_clear_queue(q);
- blk_mq_sched_free_rqs(q);
mutex_lock(&e->sysfs_lock);
blk_mq_exit_sched(q, e);
mutex_unlock(&e->sysfs_lock);
}
+static struct elevator_tags *elevator_tags_lookup(struct elevator_tags **tags,
+ struct request_queue *q, int count)
+{
+ int i;
+
+ for (i = 0; i < count; i++) {
+ if (tags[i]->id == q->id)
+ return tags[i];
+ }
+
+ return NULL;
+}
+
static inline void __elv_rqhash_del(struct request *rq)
{
hash_del(&rq->hash);
@@ -570,7 +572,8 @@ EXPORT_SYMBOL_GPL(elv_unregister);
* If switching fails, we are most likely running out of memory and not able
* to restore the old io scheduler, so leaving the io scheduler being none.
*/
-static int elevator_switch(struct request_queue *q, struct elv_change_ctx *ctx)
+static int elevator_switch(struct request_queue *q, struct elv_change_ctx *ctx,
+ struct elevator_tags *tags)
{
struct elevator_type *new_e = NULL;
int ret = 0;
@@ -592,7 +595,7 @@ static int elevator_switch(struct request_queue *q, struct elv_change_ctx *ctx)
}
if (new_e) {
- ret = blk_mq_init_sched(q, new_e);
+ ret = blk_mq_init_sched(q, new_e, tags);
if (ret)
goto out_unfreeze;
ctx->new = q->elevator;
@@ -627,8 +630,10 @@ static void elv_exit_and_release(struct request_queue *q)
elevator_exit(q);
mutex_unlock(&q->elevator_lock);
blk_mq_unfreeze_queue(q, memflags);
- if (e)
+ if (e) {
+ __blk_mq_free_sched_tags(q->tag_set, e->tags);
kobject_put(&e->kobj);
+ }
}
static int elevator_change_done(struct request_queue *q,
@@ -641,6 +646,7 @@ static int elevator_change_done(struct request_queue *q,
&ctx->old->flags);
elv_unregister_queue(q, ctx->old);
+ __blk_mq_free_sched_tags(q->tag_set, ctx->old->tags);
kobject_put(&ctx->old->kobj);
if (enable_wbt)
wbt_enable_default(q->disk);
@@ -659,9 +665,17 @@ static int elevator_change_done(struct request_queue *q,
static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
{
unsigned int memflags;
+ struct elevator_tags *tags = NULL;
+ struct blk_mq_tag_set *set = q->tag_set;
int ret = 0;
- lockdep_assert_held(&q->tag_set->update_nr_hwq_lock);
+ lockdep_assert_held(&set->update_nr_hwq_lock);
+
+ if (strncmp(ctx->name, "none", 4)) {
+ tags = __blk_mq_alloc_sched_tags(set, set->nr_hw_queues, q->id);
+ if (!tags)
+ return -ENOMEM;
+ }
memflags = blk_mq_freeze_queue(q);
/*
@@ -676,11 +690,16 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
blk_mq_cancel_work_sync(q);
mutex_lock(&q->elevator_lock);
if (!(q->elevator && elevator_match(q->elevator->type, ctx->name)))
- ret = elevator_switch(q, ctx);
+ ret = elevator_switch(q, ctx, tags);
mutex_unlock(&q->elevator_lock);
blk_mq_unfreeze_queue(q, memflags);
if (!ret)
ret = elevator_change_done(q, ctx);
+ /*
+ * Free sched tags if it's allocated but we couldn't switch elevator.
+ */
+ if (tags && !ctx->new)
+ __blk_mq_free_sched_tags(set, tags);
return ret;
}
@@ -689,24 +708,47 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
* The I/O scheduler depends on the number of hardware queues, this forces a
* reattachment when nr_hw_queues changes.
*/
-void elv_update_nr_hw_queues(struct request_queue *q)
+void elv_update_nr_hw_queues(struct request_queue *q,
+ struct elevator_tags **elv_tags, int count)
{
+ struct elevator_tags *tags = NULL;
+ struct blk_mq_tag_set *set = q->tag_set;
struct elv_change_ctx ctx = {};
int ret = -ENODEV;
WARN_ON_ONCE(q->mq_freeze_depth == 0);
+ /*
+ * Accessing q->elevator without holding q->elevator_lock is safe here
+ * because nr_hw_queue update is protected by set->update_nr_hwq_lock
+ * in the writer context. So, scheduler update/switch code (which
+ * acquires same lock in the reader context) can't run concurrently.
+ */
+ if (q->elevator) {
+ tags = elevator_tags_lookup(elv_tags, q, count);
+ if (!tags) {
+ WARN_ON(1);
+ goto out_unfreeze;
+ }
+ }
+
mutex_lock(&q->elevator_lock);
if (q->elevator && !blk_queue_dying(q) && blk_queue_registered(q)) {
ctx.name = q->elevator->type->elevator_name;
/* force to reattach elevator after nr_hw_queue is updated */
- ret = elevator_switch(q, &ctx);
+ ret = elevator_switch(q, &ctx, tags);
}
mutex_unlock(&q->elevator_lock);
+out_unfreeze:
blk_mq_unfreeze_queue_nomemrestore(q);
if (!ret)
WARN_ON_ONCE(elevator_change_done(q, &ctx));
+ /*
+ * Free sched tags if it's allocated but we couldn't switch elevator.
+ */
+ if (tags && !ctx.new)
+ __blk_mq_free_sched_tags(set, tags);
}
/*
diff --git a/block/elevator.h b/block/elevator.h
index a4de5f9ad790..0b92121005cf 100644
--- a/block/elevator.h
+++ b/block/elevator.h
@@ -23,6 +23,17 @@ enum elv_merge {
struct blk_mq_alloc_data;
struct blk_mq_hw_ctx;
+/* Holding context data for changing elevator */
+struct elv_change_ctx {
+ const char *name;
+ bool no_uevent;
+
+ /* for unregistering old elevator */
+ struct elevator_queue *old;
+ /* for registering new elevator */
+ struct elevator_queue *new;
+};
+
struct elevator_mq_ops {
int (*init_sched)(struct request_queue *, struct elevator_queue *);
void (*exit_sched)(struct elevator_queue *);
@@ -107,12 +118,27 @@ void elv_rqhash_add(struct request_queue *q, struct request *rq);
void elv_rqhash_reposition(struct request_queue *q, struct request *rq);
struct request *elv_rqhash_find(struct request_queue *q, sector_t offset);
+struct elevator_tags {
+ /* num. of hardware queues for which tags are allocated */
+ unsigned int nr_hw_queues;
+ /* depth used while allocating tags */
+ unsigned int nr_requests;
+ /* request queue id (q->id) used during elevator_tags_lookup() */
+ int id;
+ union {
+ /* tags shared across all queues */
+ struct blk_mq_tags *shared_tags;
+ /* array of blk_mq_tags pointers, one per hardware queue. */
+ struct blk_mq_tags **tags;
+ } u;
+};
/*
* each queue has an elevator_queue associated with it
*/
struct elevator_queue
{
struct elevator_type *type;
+ struct elevator_tags *tags;
void *elevator_data;
struct kobject kobj;
struct mutex sysfs_lock;
@@ -153,7 +179,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *page, size_t count);
extern bool elv_bio_merge_ok(struct request *, struct bio *);
extern struct elevator_queue *elevator_alloc(struct request_queue *,
- struct elevator_type *);
+ struct elevator_type *, struct elevator_tags *);
/*
* Helper functions.
--
2.49.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCHv3 1/2] block: move elevator queue allocation logic into blk_mq_init_sched
2025-06-16 17:32 ` [PATCHv3 1/2] block: move elevator queue allocation logic into blk_mq_init_sched Nilay Shroff
@ 2025-06-17 15:07 ` Ming Lei
2025-06-20 14:39 ` Nilay Shroff
2025-06-23 5:56 ` Christoph Hellwig
1 sibling, 1 reply; 14+ messages in thread
From: Ming Lei @ 2025-06-17 15:07 UTC (permalink / raw)
To: Nilay Shroff; +Cc: linux-block, hch, axboe, sth, gjoyce
On Mon, Jun 16, 2025 at 11:02:25PM +0530, Nilay Shroff wrote:
> In preparation for allocating sched_tags before freezing the request
> queue and acquiring ->elevator_lock, move the elevator queue allocation
> logic from the elevator ops ->init_sched callback into blk_mq_init_sched.
>
> This refactoring provides a centralized location for elevator queue
> initialization, which makes it easier to store pre-allocated sched_tags
> in the struct elevator_queue during later changes.
>
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> ---
> block/bfq-iosched.c | 13 +++----------
> block/blk-mq-sched.c | 7 ++++++-
> block/elevator.h | 2 +-
> block/kyber-iosched.c | 11 ++---------
> block/mq-deadline.c | 14 ++------------
> 5 files changed, 14 insertions(+), 33 deletions(-)
>
> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
> index 0cb1e9873aab..fd26dc1901b0 100644
> --- a/block/bfq-iosched.c
> +++ b/block/bfq-iosched.c
> @@ -7232,22 +7232,16 @@ static void bfq_init_root_group(struct bfq_group *root_group,
> root_group->sched_data.bfq_class_idle_last_service = jiffies;
> }
>
> -static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
> +static int bfq_init_queue(struct request_queue *q, struct elevator_queue *eq)
> {
> struct bfq_data *bfqd;
> - struct elevator_queue *eq;
> unsigned int i;
> struct blk_independent_access_ranges *ia_ranges = q->disk->ia_ranges;
>
> - eq = elevator_alloc(q, e);
> - if (!eq)
> - return -ENOMEM;
> -
> bfqd = kzalloc_node(sizeof(*bfqd), GFP_KERNEL, q->node);
> - if (!bfqd) {
> - kobject_put(&eq->kobj);
> + if (!bfqd)
> return -ENOMEM;
> - }
> +
> eq->elevator_data = bfqd;
>
> spin_lock_irq(&q->queue_lock);
> @@ -7405,7 +7399,6 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
>
> out_free:
> kfree(bfqd);
> - kobject_put(&eq->kobj);
> return -ENOMEM;
> }
>
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 55a0fd105147..d914eb9d61a6 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -475,6 +475,10 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
> q->nr_requests = 2 * min_t(unsigned int, q->tag_set->queue_depth,
> BLKDEV_DEFAULT_RQ);
>
> + eq = elevator_alloc(q, e);
> + if (!eq)
> + return -ENOMEM;
> +
> if (blk_mq_is_shared_tags(flags)) {
> ret = blk_mq_init_sched_shared_tags(q);
> if (ret)
The above failure needs to be handled by kobject_put(&eq->kobj).
Otherwise, feel free to add:
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Thanks,
Ming
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCHv3 2/2] block: fix lock dependency between percpu alloc lock and elevator lock
2025-06-16 17:32 ` [PATCHv3 2/2] block: fix lock dependency between percpu alloc lock and elevator lock Nilay Shroff
@ 2025-06-18 3:06 ` Ming Lei
2025-06-18 6:52 ` Nilay Shroff
2025-06-23 6:10 ` Christoph Hellwig
1 sibling, 1 reply; 14+ messages in thread
From: Ming Lei @ 2025-06-18 3:06 UTC (permalink / raw)
To: Nilay Shroff; +Cc: linux-block, hch, axboe, sth, gjoyce
On Mon, Jun 16, 2025 at 11:02:26PM +0530, Nilay Shroff wrote:
> Recent lockdep reports [1] have revealed a potential deadlock caused by a
> lock dependency between the percpu allocator lock and the elevator lock.
> This issue can be avoided by ensuring that the allocation and release of
> scheduler tags (sched_tags) are performed outside the elevator lock.
> Furthermore, the queue does not need to be remain frozen during these
> operations.
>
> To address this, move all sched_tags allocations and deallocations outside
> of both the ->elevator_lock and the ->freeze_lock. Moreover, since the
> lifetime of the elevator queue and its associated sched_tags is closely
> tied, the allocated sched_tags are now stored in the elevator queue
> structure. Then, during the actual elevator switch (which runs under
> ->freeze_lock and ->elevator_lock), the pre-allocated sched_tags are
> assigned to the appropriate q->hctx. Once the elevator switch is complete
> and the locks are released, the old elevator queue and its associated
> sched_tags are freed.
>
> This restructuring ensures that sched_tags memory management occurs
> entirely outside of the ->elevator_lock and ->freeze_lock context,
> eliminating the lock dependency problem seen during scheduler updates
> or hardware queue update.
Nice!
>
> [1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
>
> Reported-by: Stefan Haberland <sth@linux.ibm.com>
> Closes: https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> ---
> block/blk-mq-sched.c | 244 +++++++++++++++++++++++++++++++------------
> block/blk-mq-sched.h | 13 ++-
> block/blk-mq.c | 14 ++-
> block/blk.h | 4 +-
> block/elevator.c | 82 +++++++++++----
> block/elevator.h | 28 ++++-
> 6 files changed, 296 insertions(+), 89 deletions(-)
>
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index d914eb9d61a6..dd00d8c9fb78 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -374,27 +374,17 @@ bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request *rq,
> }
> EXPORT_SYMBOL_GPL(blk_mq_sched_try_insert_merge);
>
> -static int blk_mq_sched_alloc_map_and_rqs(struct request_queue *q,
> - struct blk_mq_hw_ctx *hctx,
> - unsigned int hctx_idx)
> +static void blk_mq_init_sched_tags(struct request_queue *q,
> + struct blk_mq_hw_ctx *hctx,
> + unsigned int hctx_idx,
> + struct elevator_queue *eq)
> {
> if (blk_mq_is_shared_tags(q->tag_set->flags)) {
> hctx->sched_tags = q->sched_shared_tags;
> - return 0;
> + return;
> }
>
> - hctx->sched_tags = blk_mq_alloc_map_and_rqs(q->tag_set, hctx_idx,
> - q->nr_requests);
> -
> - if (!hctx->sched_tags)
> - return -ENOMEM;
> - return 0;
> -}
> -
> -static void blk_mq_exit_sched_shared_tags(struct request_queue *queue)
> -{
> - blk_mq_free_rq_map(queue->sched_shared_tags);
> - queue->sched_shared_tags = NULL;
> + hctx->sched_tags = eq->tags->u.tags[hctx_idx];
> }
>
> /* called in queue's release handler, tagset has gone away */
> @@ -403,35 +393,18 @@ static void blk_mq_sched_tags_teardown(struct request_queue *q, unsigned int fla
> struct blk_mq_hw_ctx *hctx;
> unsigned long i;
>
> - queue_for_each_hw_ctx(q, hctx, i) {
> - if (hctx->sched_tags) {
> - if (!blk_mq_is_shared_tags(flags))
> - blk_mq_free_rq_map(hctx->sched_tags);
> - hctx->sched_tags = NULL;
> - }
> - }
> + queue_for_each_hw_ctx(q, hctx, i)
> + hctx->sched_tags = NULL;
>
> if (blk_mq_is_shared_tags(flags))
> - blk_mq_exit_sched_shared_tags(q);
> + q->sched_shared_tags = NULL;
> }
>
> -static int blk_mq_init_sched_shared_tags(struct request_queue *queue)
> +static void blk_mq_init_sched_shared_tags(struct request_queue *queue,
> + struct elevator_queue *eq)
> {
> - struct blk_mq_tag_set *set = queue->tag_set;
> -
> - /*
> - * Set initial depth at max so that we don't need to reallocate for
> - * updating nr_requests.
> - */
> - queue->sched_shared_tags = blk_mq_alloc_map_and_rqs(set,
> - BLK_MQ_NO_HCTX_IDX,
> - MAX_SCHED_RQ);
> - if (!queue->sched_shared_tags)
> - return -ENOMEM;
> -
> + queue->sched_shared_tags = eq->tags->u.shared_tags;
> blk_mq_tag_update_sched_shared_tags(queue);
> -
> - return 0;
> }
>
> void blk_mq_sched_reg_debugfs(struct request_queue *q)
> @@ -458,8 +431,165 @@ void blk_mq_sched_unreg_debugfs(struct request_queue *q)
> mutex_unlock(&q->debugfs_mutex);
> }
>
> +void __blk_mq_free_sched_tags(struct blk_mq_tag_set *set,
> + struct elevator_tags *tags)
> +{
> + unsigned long i;
> +
> + if (!tags)
> + return;
> +
> + if (blk_mq_is_shared_tags(set->flags)) {
> + if (tags->u.shared_tags) {
> + blk_mq_free_rqs(set, tags->u.shared_tags,
> + BLK_MQ_NO_HCTX_IDX);
> + blk_mq_free_rq_map(tags->u.shared_tags);
> + }
> + goto out;
> + }
> +
> + if (!tags->u.tags)
> + goto out;
> +
> + for (i = 0; i < tags->nr_hw_queues; i++) {
> + if (tags->u.tags[i]) {
> + blk_mq_free_rqs(set, tags->u.tags[i], i);
> + blk_mq_free_rq_map(tags->u.tags[i]);
> + }
> + }
> +
> + kfree(tags->u.tags);
> +out:
> + kfree(tags);
> +}
> +
> +void blk_mq_free_sched_tags(struct blk_mq_tag_set *set,
> + struct elevator_tags **tags)
> +{
> + unsigned long i, count = 0;
> + struct request_queue *q;
> +
> + lockdep_assert_held_write(&set->update_nr_hwq_lock);
> +
> + if (!tags)
> + return;
> +
> + /*
> + * Accessing q->elevator without holding q->elevator_lock is safe
> + * because we're holding here set->update_nr_hwq_lock in the writer
> + * context. So, scheduler update/switch code (which acquires the same
> + * lock but in the reader context) can't run concurrently.
> + */
> + list_for_each_entry(q, &set->tag_list, tag_set_list) {
> + if (q->elevator)
> + count++;
> + }
> +
> + for (i = 0; i < count; i++)
> + __blk_mq_free_sched_tags(set, tags[i]);
> +
> + kfree(tags);
> +}
> +
> +struct elevator_tags *__blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
> + unsigned int nr_hw_queues, int id)
> +{
> + unsigned int i;
> + struct elevator_tags *tags;
> + gfp_t gfp = GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY;
> +
> + tags = kcalloc(1, sizeof(struct elevator_tags), gfp);
> + if (!tags)
> + return NULL;
> +
> + tags->id = id;
> + /*
> + * Default to double of smaller one between hw queue_depth and
> + * 128, since we don't split into sync/async like the old code
> + * did. Additionally, this is a per-hw queue depth.
> + */
> + tags->nr_requests = 2 * min_t(unsigned int, set->queue_depth,
> + BLKDEV_DEFAULT_RQ);
> +
> + if (blk_mq_is_shared_tags(set->flags)) {
> +
> + tags->u.shared_tags = blk_mq_alloc_map_and_rqs(set,
> + BLK_MQ_NO_HCTX_IDX,
> + MAX_SCHED_RQ);
> + if (!tags->u.shared_tags)
> + goto out;
> +
> + return tags;
> + }
> +
> + tags->u.tags = kcalloc(nr_hw_queues, sizeof(struct blk_mq_tags *), gfp);
> + if (!tags->u.tags)
> + goto out;
> +
> + tags->nr_hw_queues = nr_hw_queues;
> + for (i = 0; i < nr_hw_queues; i++) {
> + tags->u.tags[i] = blk_mq_alloc_map_and_rqs(set, i,
> + tags->nr_requests);
> + if (!tags->u.tags[i])
> + goto out;
> + }
> +
> + return tags;
> +
> +out:
> + __blk_mq_free_sched_tags(set, tags);
> + return NULL;
> +}
> +
> +int blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
> + unsigned int nr_hw_queues,
> + struct elevator_tags ***elv_tags)
We seldom see triple indirect pointers in linux kernel, :-)
> +{
> + unsigned long idx = 0, count = 0;
> + struct request_queue *q;
> + struct elevator_tags **tags;
> + gfp_t gfp = GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY;
> +
> + lockdep_assert_held_write(&set->update_nr_hwq_lock);
> + /*
> + * Accessing q->elevator without holding q->elevator_lock is safe
> + * because we're holding here set->update_nr_hwq_lock in the writer
> + * context. So, scheduler update/switch code (which acquires the same
> + * lock but in the reader context) can't run concurrently.
> + */
> + list_for_each_entry(q, &set->tag_list, tag_set_list) {
> + if (q->elevator)
> + count++;
> + }
> +
> + if (!count)
> + return 0;
> +
> + tags = kcalloc(count, sizeof(struct elevator_tags *), gfp);
> + if (!tags)
> + return -ENOMEM;
> +
> + list_for_each_entry(q, &set->tag_list, tag_set_list) {
> + if (q->elevator) {
> + tags[idx] = __blk_mq_alloc_sched_tags(set, nr_hw_queues,
> + q->id);
> + if (!tags[idx])
> + goto out;
> +
> + idx++;
> + }
> + }
> +
> + *elv_tags = tags;
> + return count;
> +out:
> + blk_mq_free_sched_tags(set, tags);
> + return -ENOMEM;
> +}
I believe lots of code can be saved if you take xarray to store the
allocated 'struct elevator_tags', and use q->id as the key.
Also the handling for updating_nr_hw_queues has new batch alloc & free logic
and should be done in standalone patch. Per my experience, bug is easier to
hide in the big patch, which is more hard to review.
> +
> /* caller must have a reference to @e, will grab another one if successful */
> -int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
> +int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e,
> + struct elevator_tags *tags)
> {
> unsigned int flags = q->tag_set->flags;
> struct blk_mq_hw_ctx *hctx;
> @@ -467,40 +597,26 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
> unsigned long i;
> int ret;
>
> - /*
> - * Default to double of smaller one between hw queue_depth and 128,
> - * since we don't split into sync/async like the old code did.
> - * Additionally, this is a per-hw queue depth.
> - */
> - q->nr_requests = 2 * min_t(unsigned int, q->tag_set->queue_depth,
> - BLKDEV_DEFAULT_RQ);
> -
> - eq = elevator_alloc(q, e);
> + eq = elevator_alloc(q, e, tags);
> if (!eq)
> return -ENOMEM;
>
> - if (blk_mq_is_shared_tags(flags)) {
> - ret = blk_mq_init_sched_shared_tags(q);
> - if (ret)
> - return ret;
> - }
> + q->nr_requests = tags->nr_requests;
>
> - queue_for_each_hw_ctx(q, hctx, i) {
> - ret = blk_mq_sched_alloc_map_and_rqs(q, hctx, i);
> - if (ret)
> - goto err_free_map_and_rqs;
> - }
> + if (blk_mq_is_shared_tags(flags))
> + blk_mq_init_sched_shared_tags(q, eq);
> +
> + queue_for_each_hw_ctx(q, hctx, i)
> + blk_mq_init_sched_tags(q, hctx, i, eq);
>
> ret = e->ops.init_sched(q, eq);
> if (ret)
> - goto err_free_map_and_rqs;
> + goto out;
>
> queue_for_each_hw_ctx(q, hctx, i) {
> if (e->ops.init_hctx) {
> ret = e->ops.init_hctx(hctx, i);
> if (ret) {
> - eq = q->elevator;
> - blk_mq_sched_free_rqs(q);
> blk_mq_exit_sched(q, eq);
> kobject_put(&eq->kobj);
> return ret;
> @@ -509,10 +625,8 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
> }
> return 0;
>
> -err_free_map_and_rqs:
> - blk_mq_sched_free_rqs(q);
> +out:
> blk_mq_sched_tags_teardown(q, flags);
> -
> kobject_put(&eq->kobj);
> q->elevator = NULL;
> return ret;
> diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
> index 1326526bb733..01c43192b98a 100644
> --- a/block/blk-mq-sched.h
> +++ b/block/blk-mq-sched.h
> @@ -18,7 +18,18 @@ void __blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx);
>
> void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx);
>
> -int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e);
> +void __blk_mq_free_sched_tags(struct blk_mq_tag_set *set,
> + struct elevator_tags *tags);
> +void blk_mq_free_sched_tags(struct blk_mq_tag_set *set,
> + struct elevator_tags **elv_t);
> +int blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
> + unsigned int nr_hw_queues,
> + struct elevator_tags ***tags);
> +struct elevator_tags *__blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
> + unsigned int nr_hw_queues, int id);
> +
> +int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e,
> + struct elevator_tags *elv_t);
> void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e);
> void blk_mq_sched_free_rqs(struct request_queue *q);
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 4806b867e37d..b8b616c79742 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -4970,6 +4970,8 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
> int nr_hw_queues)
> {
> struct request_queue *q;
> + int count;
> + struct elevator_tags **elv_tags = NULL;
> int prev_nr_hw_queues = set->nr_hw_queues;
> unsigned int memflags;
> int i;
> @@ -4984,6 +4986,11 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
> return;
>
> memflags = memalloc_noio_save();
> +
> + count = blk_mq_alloc_sched_tags(set, nr_hw_queues, &elv_tags);
> + if (count < 0)
> + goto out;
> +
> list_for_each_entry(q, &set->tag_list, tag_set_list) {
> blk_mq_debugfs_unregister_hctxs(q);
> blk_mq_sysfs_unregister_hctxs(q);
> @@ -4995,6 +5002,9 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
> if (blk_mq_realloc_tag_set_tags(set, nr_hw_queues) < 0) {
> list_for_each_entry(q, &set->tag_list, tag_set_list)
> blk_mq_unfreeze_queue_nomemrestore(q);
> +
> + blk_mq_free_sched_tags(set, elv_tags);
> + elv_tags = NULL;
> goto reregister;
> }
>
> @@ -5019,7 +5029,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>
> /* elv_update_nr_hw_queues() unfreeze queue for us */
> list_for_each_entry(q, &set->tag_list, tag_set_list)
> - elv_update_nr_hw_queues(q);
> + elv_update_nr_hw_queues(q, elv_tags, count);
>
> reregister:
> list_for_each_entry(q, &set->tag_list, tag_set_list) {
> @@ -5029,6 +5039,8 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
> blk_mq_remove_hw_queues_cpuhp(q);
> blk_mq_add_hw_queues_cpuhp(q);
> }
> + kfree(elv_tags);
> +out:
> memalloc_noio_restore(memflags);
>
> /* Free the excess tags when nr_hw_queues shrink. */
> diff --git a/block/blk.h b/block/blk.h
> index 37ec459fe656..cd50305da84e 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -10,6 +10,7 @@
> #include <linux/timekeeping.h>
> #include <xen/xen.h>
> #include "blk-crypto-internal.h"
> +#include "elevator.h"
>
> struct elevator_type;
>
> @@ -321,7 +322,8 @@ bool blk_bio_list_merge(struct request_queue *q, struct list_head *list,
>
> bool blk_insert_flush(struct request *rq);
>
> -void elv_update_nr_hw_queues(struct request_queue *q);
> +void elv_update_nr_hw_queues(struct request_queue *q,
> + struct elevator_tags **elv_tags, int count);
> void elevator_set_default(struct request_queue *q);
> void elevator_set_none(struct request_queue *q);
>
> diff --git a/block/elevator.c b/block/elevator.c
> index ab22542e6cf0..849dbe347cb2 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -45,17 +45,6 @@
> #include "blk-wbt.h"
> #include "blk-cgroup.h"
>
> -/* Holding context data for changing elevator */
> -struct elv_change_ctx {
> - const char *name;
> - bool no_uevent;
> -
> - /* for unregistering old elevator */
> - struct elevator_queue *old;
> - /* for registering new elevator */
> - struct elevator_queue *new;
> -};
> -
> static DEFINE_SPINLOCK(elv_list_lock);
> static LIST_HEAD(elv_list);
>
> @@ -132,7 +121,7 @@ static struct elevator_type *elevator_find_get(const char *name)
> static const struct kobj_type elv_ktype;
>
> struct elevator_queue *elevator_alloc(struct request_queue *q,
> - struct elevator_type *e)
> + struct elevator_type *e, struct elevator_tags *tags)
> {
> struct elevator_queue *eq;
>
> @@ -142,6 +131,7 @@ struct elevator_queue *elevator_alloc(struct request_queue *q,
>
> __elevator_get(e);
> eq->type = e;
> + eq->tags = tags;
> kobject_init(&eq->kobj, &elv_ktype);
> mutex_init(&eq->sysfs_lock);
> hash_init(eq->hash);
> @@ -166,13 +156,25 @@ static void elevator_exit(struct request_queue *q)
> lockdep_assert_held(&q->elevator_lock);
>
> ioc_clear_queue(q);
> - blk_mq_sched_free_rqs(q);
>
> mutex_lock(&e->sysfs_lock);
> blk_mq_exit_sched(q, e);
> mutex_unlock(&e->sysfs_lock);
> }
>
> +static struct elevator_tags *elevator_tags_lookup(struct elevator_tags **tags,
> + struct request_queue *q, int count)
> +{
> + int i;
> +
> + for (i = 0; i < count; i++) {
> + if (tags[i]->id == q->id)
> + return tags[i];
> + }
> +
> + return NULL;
> +}
> +
> static inline void __elv_rqhash_del(struct request *rq)
> {
> hash_del(&rq->hash);
> @@ -570,7 +572,8 @@ EXPORT_SYMBOL_GPL(elv_unregister);
> * If switching fails, we are most likely running out of memory and not able
> * to restore the old io scheduler, so leaving the io scheduler being none.
> */
> -static int elevator_switch(struct request_queue *q, struct elv_change_ctx *ctx)
> +static int elevator_switch(struct request_queue *q, struct elv_change_ctx *ctx,
> + struct elevator_tags *tags)
> {
> struct elevator_type *new_e = NULL;
> int ret = 0;
> @@ -592,7 +595,7 @@ static int elevator_switch(struct request_queue *q, struct elv_change_ctx *ctx)
> }
>
> if (new_e) {
> - ret = blk_mq_init_sched(q, new_e);
> + ret = blk_mq_init_sched(q, new_e, tags);
> if (ret)
> goto out_unfreeze;
> ctx->new = q->elevator;
> @@ -627,8 +630,10 @@ static void elv_exit_and_release(struct request_queue *q)
> elevator_exit(q);
> mutex_unlock(&q->elevator_lock);
> blk_mq_unfreeze_queue(q, memflags);
> - if (e)
> + if (e) {
> + __blk_mq_free_sched_tags(q->tag_set, e->tags);
> kobject_put(&e->kobj);
> + }
> }
>
> static int elevator_change_done(struct request_queue *q,
> @@ -641,6 +646,7 @@ static int elevator_change_done(struct request_queue *q,
> &ctx->old->flags);
>
> elv_unregister_queue(q, ctx->old);
> + __blk_mq_free_sched_tags(q->tag_set, ctx->old->tags);
> kobject_put(&ctx->old->kobj);
> if (enable_wbt)
> wbt_enable_default(q->disk);
> @@ -659,9 +665,17 @@ static int elevator_change_done(struct request_queue *q,
> static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
> {
> unsigned int memflags;
> + struct elevator_tags *tags = NULL;
> + struct blk_mq_tag_set *set = q->tag_set;
> int ret = 0;
>
> - lockdep_assert_held(&q->tag_set->update_nr_hwq_lock);
> + lockdep_assert_held(&set->update_nr_hwq_lock);
> +
> + if (strncmp(ctx->name, "none", 4)) {
> + tags = __blk_mq_alloc_sched_tags(set, set->nr_hw_queues, q->id);
> + if (!tags)
> + return -ENOMEM;
> + }
>
> memflags = blk_mq_freeze_queue(q);
> /*
> @@ -676,11 +690,16 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
> blk_mq_cancel_work_sync(q);
> mutex_lock(&q->elevator_lock);
> if (!(q->elevator && elevator_match(q->elevator->type, ctx->name)))
> - ret = elevator_switch(q, ctx);
> + ret = elevator_switch(q, ctx, tags);
tags may be passed via 'ctx'.
> mutex_unlock(&q->elevator_lock);
> blk_mq_unfreeze_queue(q, memflags);
> if (!ret)
> ret = elevator_change_done(q, ctx);
> + /*
> + * Free sched tags if it's allocated but we couldn't switch elevator.
> + */
> + if (tags && !ctx->new)
> + __blk_mq_free_sched_tags(set, tags);
Maybe you can let elevator_change_done() cover failure handling, and
move the above two line into elevator_change_done().
>
> return ret;
> }
> @@ -689,24 +708,47 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
> * The I/O scheduler depends on the number of hardware queues, this forces a
> * reattachment when nr_hw_queues changes.
> */
> -void elv_update_nr_hw_queues(struct request_queue *q)
> +void elv_update_nr_hw_queues(struct request_queue *q,
> + struct elevator_tags **elv_tags, int count)
> {
> + struct elevator_tags *tags = NULL;
> + struct blk_mq_tag_set *set = q->tag_set;
> struct elv_change_ctx ctx = {};
> int ret = -ENODEV;
>
> WARN_ON_ONCE(q->mq_freeze_depth == 0);
>
> + /*
> + * Accessing q->elevator without holding q->elevator_lock is safe here
> + * because nr_hw_queue update is protected by set->update_nr_hwq_lock
> + * in the writer context. So, scheduler update/switch code (which
> + * acquires same lock in the reader context) can't run concurrently.
> + */
> + if (q->elevator) {
> + tags = elevator_tags_lookup(elv_tags, q, count);
> + if (!tags) {
> + WARN_ON(1);
> + goto out_unfreeze;
> + }
> + }
> +
> mutex_lock(&q->elevator_lock);
> if (q->elevator && !blk_queue_dying(q) && blk_queue_registered(q)) {
> ctx.name = q->elevator->type->elevator_name;
>
> /* force to reattach elevator after nr_hw_queue is updated */
> - ret = elevator_switch(q, &ctx);
> + ret = elevator_switch(q, &ctx, tags);
> }
> mutex_unlock(&q->elevator_lock);
> +out_unfreeze:
> blk_mq_unfreeze_queue_nomemrestore(q);
> if (!ret)
> WARN_ON_ONCE(elevator_change_done(q, &ctx));
> + /*
> + * Free sched tags if it's allocated but we couldn't switch elevator.
> + */
> + if (tags && !ctx.new)
> + __blk_mq_free_sched_tags(set, tags);
> }
>
> /*
> diff --git a/block/elevator.h b/block/elevator.h
> index a4de5f9ad790..0b92121005cf 100644
> --- a/block/elevator.h
> +++ b/block/elevator.h
> @@ -23,6 +23,17 @@ enum elv_merge {
> struct blk_mq_alloc_data;
> struct blk_mq_hw_ctx;
>
> +/* Holding context data for changing elevator */
> +struct elv_change_ctx {
> + const char *name;
> + bool no_uevent;
> +
> + /* for unregistering old elevator */
> + struct elevator_queue *old;
> + /* for registering new elevator */
> + struct elevator_queue *new;
> +};
> +
'elv_change_ctx' is only used in block/elevator.c, not sure why you
move it to the header.
> struct elevator_mq_ops {
> int (*init_sched)(struct request_queue *, struct elevator_queue *);
> void (*exit_sched)(struct elevator_queue *);
> @@ -107,12 +118,27 @@ void elv_rqhash_add(struct request_queue *q, struct request *rq);
> void elv_rqhash_reposition(struct request_queue *q, struct request *rq);
> struct request *elv_rqhash_find(struct request_queue *q, sector_t offset);
>
> +struct elevator_tags {
> + /* num. of hardware queues for which tags are allocated */
> + unsigned int nr_hw_queues;
> + /* depth used while allocating tags */
> + unsigned int nr_requests;
> + /* request queue id (q->id) used during elevator_tags_lookup() */
> + int id;
> + union {
> + /* tags shared across all queues */
> + struct blk_mq_tags *shared_tags;
> + /* array of blk_mq_tags pointers, one per hardware queue. */
> + struct blk_mq_tags **tags;
> + } u;
> +};
I feel it is simpler to just save shared tags in the 1st item
of the array, then you can store 'struct blk_mq_tags' in flex array of
'struct elevator_tags', save one extra allocation & many failure handling
code.
Thanks,
Ming
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCHv3 2/2] block: fix lock dependency between percpu alloc lock and elevator lock
2025-06-18 3:06 ` Ming Lei
@ 2025-06-18 6:52 ` Nilay Shroff
0 siblings, 0 replies; 14+ messages in thread
From: Nilay Shroff @ 2025-06-18 6:52 UTC (permalink / raw)
To: Ming Lei; +Cc: linux-block, hch, axboe, sth, gjoyce
On 6/18/25 8:36 AM, Ming Lei wrote:
>> +int blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
>> + unsigned int nr_hw_queues,
>> + struct elevator_tags ***elv_tags)
>
> We seldom see triple indirect pointers in linux kernel, :-)
Yeah lets see if we could avoid it by using xarray as you suggested.
>
>> +{
>> + unsigned long idx = 0, count = 0;
>> + struct request_queue *q;
>> + struct elevator_tags **tags;
>> + gfp_t gfp = GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY;
>> +
>> + lockdep_assert_held_write(&set->update_nr_hwq_lock);
>> + /*
>> + * Accessing q->elevator without holding q->elevator_lock is safe
>> + * because we're holding here set->update_nr_hwq_lock in the writer
>> + * context. So, scheduler update/switch code (which acquires the same
>> + * lock but in the reader context) can't run concurrently.
>> + */
>> + list_for_each_entry(q, &set->tag_list, tag_set_list) {
>> + if (q->elevator)
>> + count++;
>> + }
>> +
>> + if (!count)
>> + return 0;
>> +
>> + tags = kcalloc(count, sizeof(struct elevator_tags *), gfp);
>> + if (!tags)
>> + return -ENOMEM;
>> +
>> + list_for_each_entry(q, &set->tag_list, tag_set_list) {
>> + if (q->elevator) {
>> + tags[idx] = __blk_mq_alloc_sched_tags(set, nr_hw_queues,
>> + q->id);
>> + if (!tags[idx])
>> + goto out;
>> +
>> + idx++;
>> + }
>> + }
>> +
>> + *elv_tags = tags;
>> + return count;
>> +out:
>> + blk_mq_free_sched_tags(set, tags);
>> + return -ENOMEM;
>> +}
>
> I believe lots of code can be saved if you take xarray to store the
> allocated 'struct elevator_tags', and use q->id as the key.
>
I think using xarray is a nice idea! I will rewrite this code
using xarray in the next revision.
> Also the handling for updating_nr_hw_queues has new batch alloc & free logic
> and should be done in standalone patch. Per my experience, bug is easier to
> hide in the big patch, which is more hard to review.
>
Sure, I'd further split the patch in the next revision.
>> @@ -676,11 +690,16 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
>> blk_mq_cancel_work_sync(q);
>> mutex_lock(&q->elevator_lock);
>> if (!(q->elevator && elevator_match(q->elevator->type, ctx->name)))
>> - ret = elevator_switch(q, ctx);
>> + ret = elevator_switch(q, ctx, tags);
>
> tags may be passed via 'ctx'.
Hmm ok we can do that. I will add @tags in @ctx.
>
>> mutex_unlock(&q->elevator_lock);
>> blk_mq_unfreeze_queue(q, memflags);
>> if (!ret)
>> ret = elevator_change_done(q, ctx);
>> + /*
>> + * Free sched tags if it's allocated but we couldn't switch elevator.
>> + */
>> + if (tags && !ctx->new)
>> + __blk_mq_free_sched_tags(set, tags);
>
> Maybe you can let elevator_change_done() cover failure handling, and
> move the above two line into elevator_change_done().
Yes precisely I thought about it, but then later realized that
elevator_change_done() may not be called always. For instance, in
case elevator_switch() fails and hence returns non-zero value
then in that case elevator_change_done() would be skipped. But
still in the elvator_switch() failure case, we'd have allocated
sched_tags and so we have to free it.
>
>> diff --git a/block/elevator.h b/block/elevator.h
>> index a4de5f9ad790..0b92121005cf 100644
>> --- a/block/elevator.h
>> +++ b/block/elevator.h
>> @@ -23,6 +23,17 @@ enum elv_merge {
>> struct blk_mq_alloc_data;
>> struct blk_mq_hw_ctx;
>>
>> +/* Holding context data for changing elevator */
>> +struct elv_change_ctx {
>> + const char *name;
>> + bool no_uevent;
>> +
>> + /* for unregistering old elevator */
>> + struct elevator_queue *old;
>> + /* for registering new elevator */
>> + struct elevator_queue *new;
>> +};
>> +
>
> 'elv_change_ctx' is only used in block/elevator.c, not sure why you
> move it to the header.
Oh yes, in the previous version of this patch it was used outside of
the elevator.c and so I moved it into a common header but later in
this patch I missed to move it back into the elevator.h file as now
the only user of 'elv_change_ctx' exists in elevator.c file. I will
fix this in the next revision.
>> struct elevator_mq_ops {
>> int (*init_sched)(struct request_queue *, struct elevator_queue *);
>> void (*exit_sched)(struct elevator_queue *);
>> @@ -107,12 +118,27 @@ void elv_rqhash_add(struct request_queue *q, struct request *rq);
>> void elv_rqhash_reposition(struct request_queue *q, struct request *rq);
>> struct request *elv_rqhash_find(struct request_queue *q, sector_t offset);
>>
>> +struct elevator_tags {
>> + /* num. of hardware queues for which tags are allocated */
>> + unsigned int nr_hw_queues;
>> + /* depth used while allocating tags */
>> + unsigned int nr_requests;
>> + /* request queue id (q->id) used during elevator_tags_lookup() */
>> + int id;
>> + union {
>> + /* tags shared across all queues */
>> + struct blk_mq_tags *shared_tags;
>> + /* array of blk_mq_tags pointers, one per hardware queue. */
>> + struct blk_mq_tags **tags;
>> + } u;
>> +};
>
> I feel it is simpler to just save shared tags in the 1st item
> of the array, then you can store 'struct blk_mq_tags' in flex array of
> 'struct elevator_tags', save one extra allocation & many failure handling
> code.
>
Yes sounds good! I will send out another patchset with above
suggested changes.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCHv3 1/2] block: move elevator queue allocation logic into blk_mq_init_sched
2025-06-17 15:07 ` Ming Lei
@ 2025-06-20 14:39 ` Nilay Shroff
2025-06-20 15:17 ` Ming Lei
0 siblings, 1 reply; 14+ messages in thread
From: Nilay Shroff @ 2025-06-20 14:39 UTC (permalink / raw)
To: Ming Lei; +Cc: linux-block, hch, axboe, sth, gjoyce
On 6/17/25 8:37 PM, Ming Lei wrote:
> On Mon, Jun 16, 2025 at 11:02:25PM +0530, Nilay Shroff wrote:
>> In preparation for allocating sched_tags before freezing the request
>> queue and acquiring ->elevator_lock, move the elevator queue allocation
>> logic from the elevator ops ->init_sched callback into blk_mq_init_sched.
>>
>> This refactoring provides a centralized location for elevator queue
>> initialization, which makes it easier to store pre-allocated sched_tags
>> in the struct elevator_queue during later changes.
>>
>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
>> ---
[...]
>> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
>> index 55a0fd105147..d914eb9d61a6 100644
>> --- a/block/blk-mq-sched.c
>> +++ b/block/blk-mq-sched.c
>> @@ -475,6 +475,10 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
>> q->nr_requests = 2 * min_t(unsigned int, q->tag_set->queue_depth,
>> BLKDEV_DEFAULT_RQ);
>>
>> + eq = elevator_alloc(q, e);
>> + if (!eq)
>> + return -ENOMEM;
>> +
>> if (blk_mq_is_shared_tags(flags)) {
>> ret = blk_mq_init_sched_shared_tags(q);
>> if (ret)
>
> The above failure needs to be handled by kobject_put(&eq->kobj).
I think here the elevator_alloc() failure occurs before we initialize
eq->kobj. So we don't need to handle it with kobject_put(&eq->kobj)
and instead simply returning -ENOMEM should be sufficient. Agree?
>
> Otherwise, feel free to add:
>
> Reviewed-by: Ming Lei <ming.lei@redhat.com>
>
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCHv3 1/2] block: move elevator queue allocation logic into blk_mq_init_sched
2025-06-20 14:39 ` Nilay Shroff
@ 2025-06-20 15:17 ` Ming Lei
2025-06-20 16:13 ` Nilay Shroff
0 siblings, 1 reply; 14+ messages in thread
From: Ming Lei @ 2025-06-20 15:17 UTC (permalink / raw)
To: Nilay Shroff; +Cc: linux-block, hch, axboe, sth, gjoyce
On Fri, Jun 20, 2025 at 08:09:01PM +0530, Nilay Shroff wrote:
>
>
> On 6/17/25 8:37 PM, Ming Lei wrote:
> > On Mon, Jun 16, 2025 at 11:02:25PM +0530, Nilay Shroff wrote:
> >> In preparation for allocating sched_tags before freezing the request
> >> queue and acquiring ->elevator_lock, move the elevator queue allocation
> >> logic from the elevator ops ->init_sched callback into blk_mq_init_sched.
> >>
> >> This refactoring provides a centralized location for elevator queue
> >> initialization, which makes it easier to store pre-allocated sched_tags
> >> in the struct elevator_queue during later changes.
> >>
> >> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> >> ---
>
> [...]
>
> >> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> >> index 55a0fd105147..d914eb9d61a6 100644
> >> --- a/block/blk-mq-sched.c
> >> +++ b/block/blk-mq-sched.c
> >> @@ -475,6 +475,10 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
> >> q->nr_requests = 2 * min_t(unsigned int, q->tag_set->queue_depth,
> >> BLKDEV_DEFAULT_RQ);
> >>
> >> + eq = elevator_alloc(q, e);
> >> + if (!eq)
> >> + return -ENOMEM;
> >> +
> >> if (blk_mq_is_shared_tags(flags)) {
> >> ret = blk_mq_init_sched_shared_tags(q);
> >> if (ret)
> >
> > The above failure needs to be handled by kobject_put(&eq->kobj).
>
> I think here the elevator_alloc() failure occurs before we initialize
> eq->kobj. So we don't need to handle it with kobject_put(&eq->kobj)
> and instead simply returning -ENOMEM should be sufficient. Agree?
I meant the failure from blk_mq_init_sched_shared_tags(), which has to
call kobject_put() for correct cleanup.
Thanks,
Ming
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCHv3 1/2] block: move elevator queue allocation logic into blk_mq_init_sched
2025-06-20 15:17 ` Ming Lei
@ 2025-06-20 16:13 ` Nilay Shroff
0 siblings, 0 replies; 14+ messages in thread
From: Nilay Shroff @ 2025-06-20 16:13 UTC (permalink / raw)
To: Ming Lei; +Cc: linux-block, hch, axboe, sth, gjoyce
On 6/20/25 8:47 PM, Ming Lei wrote:
> On Fri, Jun 20, 2025 at 08:09:01PM +0530, Nilay Shroff wrote:
>>
>>
>> On 6/17/25 8:37 PM, Ming Lei wrote:
>>> On Mon, Jun 16, 2025 at 11:02:25PM +0530, Nilay Shroff wrote:
>>>> In preparation for allocating sched_tags before freezing the request
>>>> queue and acquiring ->elevator_lock, move the elevator queue allocation
>>>> logic from the elevator ops ->init_sched callback into blk_mq_init_sched.
>>>>
>>>> This refactoring provides a centralized location for elevator queue
>>>> initialization, which makes it easier to store pre-allocated sched_tags
>>>> in the struct elevator_queue during later changes.
>>>>
>>>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
>>>> ---
>>
>> [...]
>>
>>>> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
>>>> index 55a0fd105147..d914eb9d61a6 100644
>>>> --- a/block/blk-mq-sched.c
>>>> +++ b/block/blk-mq-sched.c
>>>> @@ -475,6 +475,10 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
>>>> q->nr_requests = 2 * min_t(unsigned int, q->tag_set->queue_depth,
>>>> BLKDEV_DEFAULT_RQ);
>>>>
>>>> + eq = elevator_alloc(q, e);
>>>> + if (!eq)
>>>> + return -ENOMEM;
>>>> +
>>>> if (blk_mq_is_shared_tags(flags)) {
>>>> ret = blk_mq_init_sched_shared_tags(q);
>>>> if (ret)
>>>
>>> The above failure needs to be handled by kobject_put(&eq->kobj).
>>
>> I think here the elevator_alloc() failure occurs before we initialize
>> eq->kobj. So we don't need to handle it with kobject_put(&eq->kobj)
>> and instead simply returning -ENOMEM should be sufficient. Agree?
>
> I meant the failure from blk_mq_init_sched_shared_tags(), which has to
> call kobject_put() for correct cleanup.
>
Oh I see... yes we need to call kobject_put() here.
Will do it in the next series.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCHv3 1/2] block: move elevator queue allocation logic into blk_mq_init_sched
2025-06-16 17:32 ` [PATCHv3 1/2] block: move elevator queue allocation logic into blk_mq_init_sched Nilay Shroff
2025-06-17 15:07 ` Ming Lei
@ 2025-06-23 5:56 ` Christoph Hellwig
2025-06-23 9:14 ` Nilay Shroff
1 sibling, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2025-06-23 5:56 UTC (permalink / raw)
To: Nilay Shroff; +Cc: linux-block, ming.lei, hch, axboe, sth, gjoyce
With this elevator_alloc can be unexported now.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCHv3 2/2] block: fix lock dependency between percpu alloc lock and elevator lock
2025-06-16 17:32 ` [PATCHv3 2/2] block: fix lock dependency between percpu alloc lock and elevator lock Nilay Shroff
2025-06-18 3:06 ` Ming Lei
@ 2025-06-23 6:10 ` Christoph Hellwig
2025-06-23 9:33 ` Nilay Shroff
1 sibling, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2025-06-23 6:10 UTC (permalink / raw)
To: Nilay Shroff; +Cc: linux-block, ming.lei, hch, axboe, sth, gjoyce
On Mon, Jun 16, 2025 at 11:02:26PM +0530, Nilay Shroff wrote:
> +static void blk_mq_init_sched_tags(struct request_queue *q,
> + struct blk_mq_hw_ctx *hctx,
> + unsigned int hctx_idx,
> + struct elevator_queue *eq)
> {
> if (blk_mq_is_shared_tags(q->tag_set->flags)) {
> hctx->sched_tags = q->sched_shared_tags;
> + return;
> }
>
> + hctx->sched_tags = eq->tags->u.tags[hctx_idx];
> }
Given how trivial this function now is, please inline it in the only
caller. That should also allow moving the blk_mq_is_shared_tags
shared out of the loop over all hw contexts, and merge it with the
check right next to it.
> +static void blk_mq_init_sched_shared_tags(struct request_queue *queue,
> + struct elevator_queue *eq)
> {
> + queue->sched_shared_tags = eq->tags->u.shared_tags;
> blk_mq_tag_update_sched_shared_tags(queue);
> }
This helper can also just be open coded in the caller now.
> + if (blk_mq_is_shared_tags(set->flags)) {
> + if (tags->u.shared_tags) {
> + blk_mq_free_rqs(set, tags->u.shared_tags,
> + BLK_MQ_NO_HCTX_IDX);
> + blk_mq_free_rq_map(tags->u.shared_tags);
> + }
> + goto out;
> + }
> +
> + if (!tags->u.tags)
> + goto out;
> +
> + for (i = 0; i < tags->nr_hw_queues; i++) {
> + if (tags->u.tags[i]) {
> + blk_mq_free_rqs(set, tags->u.tags[i], i);
> + blk_mq_free_rq_map(tags->u.tags[i]);
> + }
> + }
Maybe restructucture this a bit:
if (blk_mq_is_shared_tags(set->flags)) {
..
} else if (tags->u.tags) {
}
kfree(tags);
to have a simpler flow and remove the need for the "goto out"?
> + tags = kcalloc(1, sizeof(struct elevator_tags), gfp);
This can use plain kzalloc.
> + if (blk_mq_is_shared_tags(set->flags)) {
> +
> + tags->u.shared_tags = blk_mq_alloc_map_and_rqs(set,
The empty line above is a bit odd.
> + BLK_MQ_NO_HCTX_IDX,
> + MAX_SCHED_RQ);
> + if (!tags->u.shared_tags)
> + goto out;
> +
> + return tags;
> + }
> +
> + tags->u.tags = kcalloc(nr_hw_queues, sizeof(struct blk_mq_tags *), gfp);
> + if (!tags->u.tags)
> + goto out;
> +
> + tags->nr_hw_queues = nr_hw_queues;
> + for (i = 0; i < nr_hw_queues; i++) {
> + tags->u.tags[i] = blk_mq_alloc_map_and_rqs(set, i,
> + tags->nr_requests);
> + if (!tags->u.tags[i])
> + goto out;
> + }
> +
> + return tags;
> +
> +out:
> + __blk_mq_free_sched_tags(set, tags);
Is __blk_mq_free_sched_tags really the right thing here vs just unwinding
what this function did?
> + /*
> + * Accessing q->elevator without holding q->elevator_lock is safe
> + * because we're holding here set->update_nr_hwq_lock in the writer
> + * context. So, scheduler update/switch code (which acquires the same
> + * lock but in the reader context) can't run concurrently.
> + */
> + list_for_each_entry(q, &set->tag_list, tag_set_list) {
> + if (q->elevator)
> + count++;
> + }
Maybe add a helper for this code and the comment?
> - lockdep_assert_held(&q->tag_set->update_nr_hwq_lock);
> + lockdep_assert_held(&set->update_nr_hwq_lock);
> +
> + if (strncmp(ctx->name, "none", 4)) {
This is a check for not having an elevator so far, right? Wouldn't
!q->elevator be the more obvious check for that? Or am I missing
something why that's not safe here?
> diff --git a/block/elevator.h b/block/elevator.h
> index a4de5f9ad790..0b92121005cf 100644
> --- a/block/elevator.h
> +++ b/block/elevator.h
> @@ -23,6 +23,17 @@ enum elv_merge {
> struct blk_mq_alloc_data;
> struct blk_mq_hw_ctx;
>
> +/* Holding context data for changing elevator */
> +struct elv_change_ctx {
> + const char *name;
> + bool no_uevent;
> +
> + /* for unregistering old elevator */
> + struct elevator_queue *old;
> + /* for registering new elevator */
> + struct elevator_queue *new;
> +};
No need to move this, it is still only used in elevator.c.
> extern struct elevator_queue *elevator_alloc(struct request_queue *,
> - struct elevator_type *);
> + struct elevator_type *, struct elevator_tags *);
Drop the extern while you're at it.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCHv3 1/2] block: move elevator queue allocation logic into blk_mq_init_sched
2025-06-23 5:56 ` Christoph Hellwig
@ 2025-06-23 9:14 ` Nilay Shroff
0 siblings, 0 replies; 14+ messages in thread
From: Nilay Shroff @ 2025-06-23 9:14 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-block, ming.lei, axboe, sth, gjoyce
On 6/23/25 11:26 AM, Christoph Hellwig wrote:
> With this elevator_alloc can be unexported now.
>
Yes it should be unexported. It will be handled in the next patchset.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCHv3 2/2] block: fix lock dependency between percpu alloc lock and elevator lock
2025-06-23 6:10 ` Christoph Hellwig
@ 2025-06-23 9:33 ` Nilay Shroff
2025-06-23 13:36 ` Christoph Hellwig
0 siblings, 1 reply; 14+ messages in thread
From: Nilay Shroff @ 2025-06-23 9:33 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-block, ming.lei, axboe, sth, gjoyce
On 6/23/25 11:40 AM, Christoph Hellwig wrote:
> On Mon, Jun 16, 2025 at 11:02:26PM +0530, Nilay Shroff wrote:
>> +static void blk_mq_init_sched_tags(struct request_queue *q,
>> + struct blk_mq_hw_ctx *hctx,
>> + unsigned int hctx_idx,
>> + struct elevator_queue *eq)
>> {
>> if (blk_mq_is_shared_tags(q->tag_set->flags)) {
>> hctx->sched_tags = q->sched_shared_tags;
>> + return;
>> }
>>
>> + hctx->sched_tags = eq->tags->u.tags[hctx_idx];
>> }
>
> Given how trivial this function now is, please inline it in the only
> caller. That should also allow moving the blk_mq_is_shared_tags
> shared out of the loop over all hw contexts, and merge it with the
> check right next to it.
>
okay, will do it the next patchset.
>> +static void blk_mq_init_sched_shared_tags(struct request_queue *queue,
>> + struct elevator_queue *eq)
>> {
>> + queue->sched_shared_tags = eq->tags->u.shared_tags;
>> blk_mq_tag_update_sched_shared_tags(queue);
>> }
>
> This helper can also just be open coded in the caller now.
Yes agreed...
>
>> + if (blk_mq_is_shared_tags(set->flags)) {
>> + if (tags->u.shared_tags) {
>> + blk_mq_free_rqs(set, tags->u.shared_tags,
>> + BLK_MQ_NO_HCTX_IDX);
>> + blk_mq_free_rq_map(tags->u.shared_tags);
>> + }
>> + goto out;
>> + }
>> +
>> + if (!tags->u.tags)
>> + goto out;
>> +
>> + for (i = 0; i < tags->nr_hw_queues; i++) {
>> + if (tags->u.tags[i]) {
>> + blk_mq_free_rqs(set, tags->u.tags[i], i);
>> + blk_mq_free_rq_map(tags->u.tags[i]);
>> + }
>> + }
>
> Maybe restructucture this a bit:
>
> if (blk_mq_is_shared_tags(set->flags)) {
> ..
> } else if (tags->u.tags) {
> }
>
> kfree(tags);
>
> to have a simpler flow and remove the need for the "goto out"?
>
Okay but as you know I am rewriting this logic using Xarray as
Ming pointed in last email. So this code will be restructured and
I will ensure that your above comment will be addressed in new
logic.
>> + tags = kcalloc(1, sizeof(struct elevator_tags), gfp);
>
> This can use plain kzalloc.
>
>> + if (blk_mq_is_shared_tags(set->flags)) {
>> +
>> + tags->u.shared_tags = blk_mq_alloc_map_and_rqs(set,
>
> The empty line above is a bit odd.
Yes I would address this in next patchset.
>
>> + BLK_MQ_NO_HCTX_IDX,
>> + MAX_SCHED_RQ);
>> + if (!tags->u.shared_tags)
>> + goto out;
>> +
>> + return tags;
>> + }
>> +
>> + tags->u.tags = kcalloc(nr_hw_queues, sizeof(struct blk_mq_tags *), gfp);
>> + if (!tags->u.tags)
>> + goto out;
>> +
>> + tags->nr_hw_queues = nr_hw_queues;
>> + for (i = 0; i < nr_hw_queues; i++) {
>> + tags->u.tags[i] = blk_mq_alloc_map_and_rqs(set, i,
>> + tags->nr_requests);
>> + if (!tags->u.tags[i])
>> + goto out;
>> + }
>> +
>> + return tags;
>> +
>> +out:
>> + __blk_mq_free_sched_tags(set, tags);
>
> Is __blk_mq_free_sched_tags really the right thing here vs just unwinding
> what this function did?
>
Hmm, __blk_mq_free_sched_tags actually releases everything which is allocated
in this function (so there won't be any leak), however, it really _dose_not_
free resources in the reverse order. It just loops over starting from the first
nr_hw_queue index and releases whatsoever allocated. But I think you're right,
we could probably have more structured way of unwinding stuff here instead
of calling __blk_mq_free_sched_tags. I will handle this in the next patch.
>> + /*
>> + * Accessing q->elevator without holding q->elevator_lock is safe
>> + * because we're holding here set->update_nr_hwq_lock in the writer
>> + * context. So, scheduler update/switch code (which acquires the same
>> + * lock but in the reader context) can't run concurrently.
>> + */
>> + list_for_each_entry(q, &set->tag_list, tag_set_list) {
>> + if (q->elevator)
>> + count++;
>> + }
>
> Maybe add a helper for this code and the comment?
Yeah but as I mentioned above with Xarray code being added, we may
not need the above loop. So lets see, if needed I will add a macro
and add proper comment in next patch.
>
>> - lockdep_assert_held(&q->tag_set->update_nr_hwq_lock);
>> + lockdep_assert_held(&set->update_nr_hwq_lock);
>> +
>> + if (strncmp(ctx->name, "none", 4)) {
>
> This is a check for not having an elevator so far, right? Wouldn't
> !q->elevator be the more obvious check for that? Or am I missing
> something why that's not safe here?
>
This code runs in the context of an elevator switch, not as part of an
nr_hw_queues update. Hence, at this point, q->elevator has not yet been
updated to the new elevator we’re switching to, so accessing q->elevator
here would be incorrect. Since we've already stored the name of the target
elevator in ctx->name, we use that instead of referencing q->elevator here.
>> diff --git a/block/elevator.h b/block/elevator.h
>> index a4de5f9ad790..0b92121005cf 100644
>> --- a/block/elevator.h
>> +++ b/block/elevator.h
>> @@ -23,6 +23,17 @@ enum elv_merge {
>> struct blk_mq_alloc_data;
>> struct blk_mq_hw_ctx;
>>
>> +/* Holding context data for changing elevator */
>> +struct elv_change_ctx {
>> + const char *name;
>> + bool no_uevent;
>> +
>> + /* for unregistering old elevator */
>> + struct elevator_queue *old;
>> + /* for registering new elevator */
>> + struct elevator_queue *new;
>> +};
>
> No need to move this, it is still only used in elevator.c.
>
Yes agreed. I'm moved it back to its original place in
elevator.c
>> extern struct elevator_queue *elevator_alloc(struct request_queue *,
>> - struct elevator_type *);
>> + struct elevator_type *, struct elevator_tags *);
>
> Drop the extern while you're at it.
>
Yeah sure.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCHv3 2/2] block: fix lock dependency between percpu alloc lock and elevator lock
2025-06-23 9:33 ` Nilay Shroff
@ 2025-06-23 13:36 ` Christoph Hellwig
0 siblings, 0 replies; 14+ messages in thread
From: Christoph Hellwig @ 2025-06-23 13:36 UTC (permalink / raw)
To: Nilay Shroff; +Cc: Christoph Hellwig, linux-block, ming.lei, axboe, sth, gjoyce
On Mon, Jun 23, 2025 at 03:03:27PM +0530, Nilay Shroff wrote:
> >
> > This is a check for not having an elevator so far, right? Wouldn't
> > !q->elevator be the more obvious check for that? Or am I missing
> > something why that's not safe here?
> >
> This code runs in the context of an elevator switch, not as part of an
> nr_hw_queues update. Hence, at this point, q->elevator has not yet been
> updated to the new elevator we’re switching to, so accessing q->elevator
> here would be incorrect. Since we've already stored the name of the target
> elevator in ctx->name, we use that instead of referencing q->elevator here.
Make sense, thanks.
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2025-06-23 13:36 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-16 17:32 [PATCHv3 0/2] block: move sched_tags allocation/de-allocation outside of locking context Nilay Shroff
2025-06-16 17:32 ` [PATCHv3 1/2] block: move elevator queue allocation logic into blk_mq_init_sched Nilay Shroff
2025-06-17 15:07 ` Ming Lei
2025-06-20 14:39 ` Nilay Shroff
2025-06-20 15:17 ` Ming Lei
2025-06-20 16:13 ` Nilay Shroff
2025-06-23 5:56 ` Christoph Hellwig
2025-06-23 9:14 ` Nilay Shroff
2025-06-16 17:32 ` [PATCHv3 2/2] block: fix lock dependency between percpu alloc lock and elevator lock Nilay Shroff
2025-06-18 3:06 ` Ming Lei
2025-06-18 6:52 ` Nilay Shroff
2025-06-23 6:10 ` Christoph Hellwig
2025-06-23 9:33 ` Nilay Shroff
2025-06-23 13:36 ` Christoph Hellwig
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox