* [PATCHv4 0/3] block: move sched_tags allocation/de-allocation outside of locking context
@ 2025-06-24 13:17 Nilay Shroff
2025-06-24 13:17 ` [PATCHv4 1/3] block: move elevator queue allocation logic into blk_mq_init_sched Nilay Shroff
` (2 more replies)
0 siblings, 3 replies; 8+ messages in thread
From: Nilay Shroff @ 2025-06-24 13:17 UTC (permalink / raw)
To: linux-block; +Cc: hch, ming.lei, axboe, sth, gjoyce
Hi,
There have been a few reports[1] indicating potential lockdep warnings due
to a lock dependency from the percpu allocator to the elevator lock. This
patch series aims to eliminate that dependency.
The series consists of three patches:
The first patch is preparatory patch and just move elevator queue
allocation logic from ->init_sched into blk_mq_init_sched.
The second patch in the series restructures sched_tags allocation and
deallocation during elevator update/switch operations to ensure these
actions are performed entirely outside the ->freeze_lock and ->elevator_
lock. This eliminates the percpu allocator’s lock dependency on the
elevator and freeze lock during scheduler transitions.
The third patch introduces batch allocation and deallocation helpers for
sched_tags. These helpers are used during __blk_mq_update_nr_hw_queues()
to decouple sched_tags memory management from both the elevator and freeze
locks, addressing the lockdep concerns in the nr_hw_queues update path.
[1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
Changes since v3:
- Further split the patchset into three patch series so that we can
have a separate patch for sched_tags batch allocation/deallocation
(Ming Lei)
- Use Xarray to store and load the sched_tags (Ming Lei)
- Unexport elevator_alloc() as we no longer need to use it outside
of block layer core (hch)
- unwind the sched_tags allocation and free tags when we it fails in
the middle of allocation (hch)
- Move struct elevator_queue header from commin header to elevator.c
as there's no user of it outside elevator.c (Ming Lei, hch)
Link to v3: https://lore.kernel.org/all/20250616173233.3803824-1-nilay@linux.ibm.com/
Change since v2:
- Split the patch into a two patch series. The first patch updates
->init_sched elevator API change and second patch handles the sched
tags allocation/de-allocation logic (Ming Lei)
- Address sched tags allocation/deallocation logic while running in the
context of nr_hw_queue update so that we can handle all possible
scenarios in a single patchest (Ming Lei)
Link to v2: https://lore.kernel.org/all/20250528123638.1029700-1-nilay@linux.ibm.com/
Changes since v1:
- As the lifetime of elevator queue and sched tags are same, allocate
and move sched tags under struct elevator_queue (Ming Lei)
Link to v1: https://lore.kernel.org/all/20250520103425.1259712-1-nilay@linux.ibm.com/
Nilay Shroff (3):
block: move elevator queue allocation logic into blk_mq_init_sched
block: fix lockdep warning caused by lock dependency in
elv_iosched_store
block: fix potential deadlock while running nr_hw_queue update
block/bfq-iosched.c | 13 +--
block/blk-mq-sched.c | 238 +++++++++++++++++++++++++++++-------------
block/blk-mq-sched.h | 19 +++-
block/blk-mq.c | 12 ++-
block/blk.h | 3 +-
block/elevator.c | 52 +++++++--
block/elevator.h | 21 +++-
block/kyber-iosched.c | 11 +-
block/mq-deadline.c | 14 +--
9 files changed, 267 insertions(+), 116 deletions(-)
--
2.49.0
^ permalink raw reply [flat|nested] 8+ messages in thread* [PATCHv4 1/3] block: move elevator queue allocation logic into blk_mq_init_sched 2025-06-24 13:17 [PATCHv4 0/3] block: move sched_tags allocation/de-allocation outside of locking context Nilay Shroff @ 2025-06-24 13:17 ` Nilay Shroff 2025-06-24 13:17 ` [PATCHv4 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store Nilay Shroff 2025-06-24 13:17 ` [PATCHv4 3/3] block: fix potential deadlock while running nr_hw_queue update Nilay Shroff 2 siblings, 0 replies; 8+ messages in thread From: Nilay Shroff @ 2025-06-24 13:17 UTC (permalink / raw) To: linux-block; +Cc: hch, ming.lei, axboe, sth, gjoyce In preparation for allocating sched_tags before freezing the request queue and acquiring ->elevator_lock, move the elevator queue allocation logic from the elevator ops ->init_sched callback into blk_mq_init_sched. As elevator_alloc is now only invoked from block layer core, we don't need to export it, so unexport elevator_alloc function. This refactoring provides a centralized location for elevator queue initialization, which makes it easier to store pre-allocated sched_tags in the struct elevator_queue during later changes. Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> --- block/bfq-iosched.c | 13 +++---------- block/blk-mq-sched.c | 11 ++++++++--- block/elevator.c | 1 - block/elevator.h | 2 +- block/kyber-iosched.c | 11 ++--------- block/mq-deadline.c | 14 ++------------ 6 files changed, 16 insertions(+), 36 deletions(-) diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index 0cb1e9873aab..fd26dc1901b0 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -7232,22 +7232,16 @@ static void bfq_init_root_group(struct bfq_group *root_group, root_group->sched_data.bfq_class_idle_last_service = jiffies; } -static int bfq_init_queue(struct request_queue *q, struct elevator_type *e) +static int bfq_init_queue(struct request_queue *q, struct elevator_queue *eq) { struct bfq_data *bfqd; - struct elevator_queue *eq; unsigned int i; struct blk_independent_access_ranges *ia_ranges = q->disk->ia_ranges; - eq = elevator_alloc(q, e); - if (!eq) - return -ENOMEM; - bfqd = kzalloc_node(sizeof(*bfqd), GFP_KERNEL, q->node); - if (!bfqd) { - kobject_put(&eq->kobj); + if (!bfqd) return -ENOMEM; - } + eq->elevator_data = bfqd; spin_lock_irq(&q->queue_lock); @@ -7405,7 +7399,6 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e) out_free: kfree(bfqd); - kobject_put(&eq->kobj); return -ENOMEM; } diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index 55a0fd105147..359e0704e09b 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -475,10 +475,14 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e) q->nr_requests = 2 * min_t(unsigned int, q->tag_set->queue_depth, BLKDEV_DEFAULT_RQ); + eq = elevator_alloc(q, e); + if (!eq) + return -ENOMEM; + if (blk_mq_is_shared_tags(flags)) { ret = blk_mq_init_sched_shared_tags(q); if (ret) - return ret; + goto err_put_elevator; } queue_for_each_hw_ctx(q, hctx, i) { @@ -487,7 +491,7 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e) goto err_free_map_and_rqs; } - ret = e->ops.init_sched(q, e); + ret = e->ops.init_sched(q, eq); if (ret) goto err_free_map_and_rqs; @@ -508,7 +512,8 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e) err_free_map_and_rqs: blk_mq_sched_free_rqs(q); blk_mq_sched_tags_teardown(q, flags); - +err_put_elevator: + kobject_put(&eq->kobj); q->elevator = NULL; return ret; } diff --git a/block/elevator.c b/block/elevator.c index ab22542e6cf0..770874040f79 100644 --- a/block/elevator.c +++ b/block/elevator.c @@ -148,7 +148,6 @@ struct elevator_queue *elevator_alloc(struct request_queue *q, return eq; } -EXPORT_SYMBOL(elevator_alloc); static void elevator_release(struct kobject *kobj) { diff --git a/block/elevator.h b/block/elevator.h index a07ce773a38f..a4de5f9ad790 100644 --- a/block/elevator.h +++ b/block/elevator.h @@ -24,7 +24,7 @@ struct blk_mq_alloc_data; struct blk_mq_hw_ctx; struct elevator_mq_ops { - int (*init_sched)(struct request_queue *, struct elevator_type *); + int (*init_sched)(struct request_queue *, struct elevator_queue *); void (*exit_sched)(struct elevator_queue *); int (*init_hctx)(struct blk_mq_hw_ctx *, unsigned int); void (*exit_hctx)(struct blk_mq_hw_ctx *, unsigned int); diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c index 4dba8405bd01..7b6832cb3a8d 100644 --- a/block/kyber-iosched.c +++ b/block/kyber-iosched.c @@ -402,20 +402,13 @@ static struct kyber_queue_data *kyber_queue_data_alloc(struct request_queue *q) return ERR_PTR(ret); } -static int kyber_init_sched(struct request_queue *q, struct elevator_type *e) +static int kyber_init_sched(struct request_queue *q, struct elevator_queue *eq) { struct kyber_queue_data *kqd; - struct elevator_queue *eq; - - eq = elevator_alloc(q, e); - if (!eq) - return -ENOMEM; kqd = kyber_queue_data_alloc(q); - if (IS_ERR(kqd)) { - kobject_put(&eq->kobj); + if (IS_ERR(kqd)) return PTR_ERR(kqd); - } blk_stat_enable_accounting(q); diff --git a/block/mq-deadline.c b/block/mq-deadline.c index 2edf1cac06d5..7b6caf30e00a 100644 --- a/block/mq-deadline.c +++ b/block/mq-deadline.c @@ -568,20 +568,14 @@ static void dd_exit_sched(struct elevator_queue *e) /* * initialize elevator private data (deadline_data). */ -static int dd_init_sched(struct request_queue *q, struct elevator_type *e) +static int dd_init_sched(struct request_queue *q, struct elevator_queue *eq) { struct deadline_data *dd; - struct elevator_queue *eq; enum dd_prio prio; - int ret = -ENOMEM; - - eq = elevator_alloc(q, e); - if (!eq) - return ret; dd = kzalloc_node(sizeof(*dd), GFP_KERNEL, q->node); if (!dd) - goto put_eq; + return -ENOMEM; eq->elevator_data = dd; @@ -608,10 +602,6 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e) q->elevator = eq; return 0; - -put_eq: - kobject_put(&eq->kobj); - return ret; } /* -- 2.49.0 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCHv4 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store 2025-06-24 13:17 [PATCHv4 0/3] block: move sched_tags allocation/de-allocation outside of locking context Nilay Shroff 2025-06-24 13:17 ` [PATCHv4 1/3] block: move elevator queue allocation logic into blk_mq_init_sched Nilay Shroff @ 2025-06-24 13:17 ` Nilay Shroff 2025-06-26 14:43 ` Ming Lei 2025-06-24 13:17 ` [PATCHv4 3/3] block: fix potential deadlock while running nr_hw_queue update Nilay Shroff 2 siblings, 1 reply; 8+ messages in thread From: Nilay Shroff @ 2025-06-24 13:17 UTC (permalink / raw) To: linux-block; +Cc: hch, ming.lei, axboe, sth, gjoyce Recent lockdep reports [1] have revealed a potential deadlock caused by a lock dependency between the percpu allocator lock and the elevator lock. This issue can be avoided by ensuring that the allocation and release of scheduler tags (sched_tags) are performed outside the elevator lock. Furthermore, the queue does not need to be remain frozen during these operations. To address this, move all sched_tags allocations and deallocations outside of both the ->elevator_lock and the ->freeze_lock. Since the lifetime of the elevator queue and its associated sched_tags is closely tied, the allocated sched_tags are now stored in the elevator queue structure. Then, during the actual elevator switch (which runs under ->freeze_lock and ->elevator_lock), the pre-allocated sched_tags are assigned to the appropriate q->hctx. Once the elevator switch is complete and the locks are released, the old elevator queue and its associated sched_tags are freed. This commit specifically addresses the allocation/deallocation of sched_ tags during elevator switching. Note that sched_tags may also be allocated in other contexts, such as during nr_hw_queues updates. Supporting that use case will require batch allocation/deallocation, which will be handled in a follow-up patch. This restructuring ensures that sched_tags memory management occurs entirely outside of the ->elevator_lock and ->freeze_lock context, eliminating the lock dependency problem seen during scheduler updates. [1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/ Reported-by: Stefan Haberland <sth@linux.ibm.com> Closes: https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/ Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> --- block/blk-mq-sched.c | 186 ++++++++++++++++++++++++++----------------- block/blk-mq-sched.h | 14 +++- block/elevator.c | 66 +++++++++++++-- block/elevator.h | 19 ++++- 4 files changed, 204 insertions(+), 81 deletions(-) diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index 359e0704e09b..5d3132ac7777 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -374,64 +374,17 @@ bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request *rq, } EXPORT_SYMBOL_GPL(blk_mq_sched_try_insert_merge); -static int blk_mq_sched_alloc_map_and_rqs(struct request_queue *q, - struct blk_mq_hw_ctx *hctx, - unsigned int hctx_idx) -{ - if (blk_mq_is_shared_tags(q->tag_set->flags)) { - hctx->sched_tags = q->sched_shared_tags; - return 0; - } - - hctx->sched_tags = blk_mq_alloc_map_and_rqs(q->tag_set, hctx_idx, - q->nr_requests); - - if (!hctx->sched_tags) - return -ENOMEM; - return 0; -} - -static void blk_mq_exit_sched_shared_tags(struct request_queue *queue) -{ - blk_mq_free_rq_map(queue->sched_shared_tags); - queue->sched_shared_tags = NULL; -} - /* called in queue's release handler, tagset has gone away */ static void blk_mq_sched_tags_teardown(struct request_queue *q, unsigned int flags) { struct blk_mq_hw_ctx *hctx; unsigned long i; - queue_for_each_hw_ctx(q, hctx, i) { - if (hctx->sched_tags) { - if (!blk_mq_is_shared_tags(flags)) - blk_mq_free_rq_map(hctx->sched_tags); - hctx->sched_tags = NULL; - } - } + queue_for_each_hw_ctx(q, hctx, i) + hctx->sched_tags = NULL; if (blk_mq_is_shared_tags(flags)) - blk_mq_exit_sched_shared_tags(q); -} - -static int blk_mq_init_sched_shared_tags(struct request_queue *queue) -{ - struct blk_mq_tag_set *set = queue->tag_set; - - /* - * Set initial depth at max so that we don't need to reallocate for - * updating nr_requests. - */ - queue->sched_shared_tags = blk_mq_alloc_map_and_rqs(set, - BLK_MQ_NO_HCTX_IDX, - MAX_SCHED_RQ); - if (!queue->sched_shared_tags) - return -ENOMEM; - - blk_mq_tag_update_sched_shared_tags(queue); - - return 0; + q->sched_shared_tags = NULL; } void blk_mq_sched_reg_debugfs(struct request_queue *q) @@ -458,8 +411,106 @@ void blk_mq_sched_unreg_debugfs(struct request_queue *q) mutex_unlock(&q->debugfs_mutex); } +void __blk_mq_free_sched_tags(struct blk_mq_tag_set *set, + struct blk_mq_tags **tags, unsigned int nr_hw_queues) +{ + unsigned long i; + + if (!tags) + return; + + /* Shared tags are stored at index 0 in @tags. */ + if (blk_mq_is_shared_tags(set->flags)) + blk_mq_free_map_and_rqs(set, tags[0], BLK_MQ_NO_HCTX_IDX); + else { + for (i = 0; i < nr_hw_queues; i++) + blk_mq_free_map_and_rqs(set, tags[i], i); + } + + kfree(tags); +} + +void blk_mq_free_sched_tags(struct elevator_tags *et, + struct blk_mq_tag_set *set, int id) +{ + struct blk_mq_tags **tags; + + tags = xa_load(&et->tags_table, id); + __blk_mq_free_sched_tags(set, tags, et->nr_hw_queues); +} + +struct blk_mq_tags **__blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set, + unsigned int nr_hw_queues, + unsigned int nr_requests, + gfp_t gfp) +{ + int i, nr_tags; + struct blk_mq_tags **tags; + + if (blk_mq_is_shared_tags(set->flags)) + nr_tags = 1; + else + nr_tags = nr_hw_queues; + + tags = kcalloc(nr_tags, sizeof(struct blk_mq_tags *), gfp); + if (!tags) + return NULL; + + if (blk_mq_is_shared_tags(set->flags)) { + /* Shared tags are stored at index 0 in @tags. */ + tags[0] = blk_mq_alloc_map_and_rqs(set, BLK_MQ_NO_HCTX_IDX, + MAX_SCHED_RQ); + if (!tags[0]) + goto out; + } else { + for (i = 0; i < nr_hw_queues; i++) { + tags[i] = blk_mq_alloc_map_and_rqs(set, i, nr_requests); + if (!tags[i]) + goto out_unwind; + } + } + + return tags; +out_unwind: + while (--i >= 0) + blk_mq_free_map_and_rqs(set, tags[i], i); +out: + kfree(tags); + return NULL; +} + +int blk_mq_alloc_sched_tags(struct elevator_tags *et, + struct blk_mq_tag_set *set, int id) +{ + struct blk_mq_tags **tags; + gfp_t gfp = GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY; + + /* + * Default to double of smaller one between hw queue_depth and + * 128, since we don't split into sync/async like the old code + * did. Additionally, this is a per-hw queue depth. + */ + et->nr_requests = 2 * min_t(unsigned int, set->queue_depth, + BLKDEV_DEFAULT_RQ); + + tags = __blk_mq_alloc_sched_tags(set, et->nr_hw_queues, + et->nr_requests, gfp); + if (!tags) + return -ENOMEM; + + if (xa_insert(&et->tags_table, id, tags, gfp)) + goto out_free_tags; + + return 0; + +out_free_tags: + __blk_mq_free_sched_tags(set, tags, et->nr_hw_queues); + return -ENOMEM; +} + /* caller must have a reference to @e, will grab another one if successful */ -int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e) +int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e, + struct elevator_tags *et) { unsigned int flags = q->tag_set->flags; struct blk_mq_hw_ctx *hctx; @@ -467,40 +518,33 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e) unsigned long i; int ret; - /* - * Default to double of smaller one between hw queue_depth and 128, - * since we don't split into sync/async like the old code did. - * Additionally, this is a per-hw queue depth. - */ - q->nr_requests = 2 * min_t(unsigned int, q->tag_set->queue_depth, - BLKDEV_DEFAULT_RQ); - - eq = elevator_alloc(q, e); + eq = elevator_alloc(q, e, et); if (!eq) return -ENOMEM; + q->nr_requests = et->nr_requests; + if (blk_mq_is_shared_tags(flags)) { - ret = blk_mq_init_sched_shared_tags(q); - if (ret) - goto err_put_elevator; + /* Shared tags are stored at index 0 in @eq->tags. */ + q->sched_shared_tags = eq->tags[0]; + blk_mq_tag_update_sched_shared_tags(q); } queue_for_each_hw_ctx(q, hctx, i) { - ret = blk_mq_sched_alloc_map_and_rqs(q, hctx, i); - if (ret) - goto err_free_map_and_rqs; + if (blk_mq_is_shared_tags(flags)) + hctx->sched_tags = q->sched_shared_tags; + else + hctx->sched_tags = eq->tags[i]; } ret = e->ops.init_sched(q, eq); if (ret) - goto err_free_map_and_rqs; + goto out; queue_for_each_hw_ctx(q, hctx, i) { if (e->ops.init_hctx) { ret = e->ops.init_hctx(hctx, i); if (ret) { - eq = q->elevator; - blk_mq_sched_free_rqs(q); blk_mq_exit_sched(q, eq); kobject_put(&eq->kobj); return ret; @@ -509,10 +553,8 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e) } return 0; -err_free_map_and_rqs: - blk_mq_sched_free_rqs(q); +out: blk_mq_sched_tags_teardown(q, flags); -err_put_elevator: kobject_put(&eq->kobj); q->elevator = NULL; return ret; diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h index 1326526bb733..92aa50b8376a 100644 --- a/block/blk-mq-sched.h +++ b/block/blk-mq-sched.h @@ -18,7 +18,19 @@ void __blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx); void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx); -int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e); +int blk_mq_alloc_sched_tags(struct elevator_tags *et, + struct blk_mq_tag_set *set, int id); +void blk_mq_free_sched_tags(struct elevator_tags *et, + struct blk_mq_tag_set *set, int id); +struct blk_mq_tags **__blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set, + unsigned int nr_hw_queues, + unsigned int nr_requests, + gfp_t gfp); +void __blk_mq_free_sched_tags(struct blk_mq_tag_set *set, + struct blk_mq_tags **tags, unsigned int nr_hw_queues); + +int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e, + struct elevator_tags *et); void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e); void blk_mq_sched_free_rqs(struct request_queue *q); diff --git a/block/elevator.c b/block/elevator.c index 770874040f79..1408894c0396 100644 --- a/block/elevator.c +++ b/block/elevator.c @@ -54,6 +54,8 @@ struct elv_change_ctx { struct elevator_queue *old; /* for registering new elevator */ struct elevator_queue *new; + /* holds sched tags data */ + struct elevator_tags *et; }; static DEFINE_SPINLOCK(elv_list_lock); @@ -132,7 +134,7 @@ static struct elevator_type *elevator_find_get(const char *name) static const struct kobj_type elv_ktype; struct elevator_queue *elevator_alloc(struct request_queue *q, - struct elevator_type *e) + struct elevator_type *e, struct elevator_tags *et) { struct elevator_queue *eq; @@ -145,8 +147,15 @@ struct elevator_queue *elevator_alloc(struct request_queue *q, kobject_init(&eq->kobj, &elv_ktype); mutex_init(&eq->sysfs_lock); hash_init(eq->hash); + eq->nr_hw_queues = et->nr_hw_queues; + eq->tags = xa_load(&et->tags_table, q->id); + if (WARN_ON_ONCE(!eq->tags)) + goto out; return eq; +out: + kobject_put(&eq->kobj); + return NULL; } static void elevator_release(struct kobject *kobj) @@ -165,7 +174,6 @@ static void elevator_exit(struct request_queue *q) lockdep_assert_held(&q->elevator_lock); ioc_clear_queue(q); - blk_mq_sched_free_rqs(q); mutex_lock(&e->sysfs_lock); blk_mq_exit_sched(q, e); @@ -591,7 +599,7 @@ static int elevator_switch(struct request_queue *q, struct elv_change_ctx *ctx) } if (new_e) { - ret = blk_mq_init_sched(q, new_e); + ret = blk_mq_init_sched(q, new_e, ctx->et); if (ret) goto out_unfreeze; ctx->new = q->elevator; @@ -626,8 +634,10 @@ static void elv_exit_and_release(struct request_queue *q) elevator_exit(q); mutex_unlock(&q->elevator_lock); blk_mq_unfreeze_queue(q, memflags); - if (e) + if (e) { + __blk_mq_free_sched_tags(q->tag_set, e->tags, e->nr_hw_queues); kobject_put(&e->kobj); + } } static int elevator_change_done(struct request_queue *q, @@ -640,6 +650,8 @@ static int elevator_change_done(struct request_queue *q, &ctx->old->flags); elv_unregister_queue(q, ctx->old); + __blk_mq_free_sched_tags(q->tag_set, ctx->old->tags, + ctx->old->nr_hw_queues); kobject_put(&ctx->old->kobj); if (enable_wbt) wbt_enable_default(q->disk); @@ -658,9 +670,20 @@ static int elevator_change_done(struct request_queue *q, static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx) { unsigned int memflags; + struct elevator_tags et; + struct blk_mq_tag_set *set = q->tag_set; int ret = 0; - lockdep_assert_held(&q->tag_set->update_nr_hwq_lock); + lockdep_assert_held(&set->update_nr_hwq_lock); + + et.nr_hw_queues = set->nr_hw_queues; + xa_init(&et.tags_table); + ctx->et = &et; + if (strncmp(ctx->name, "none", 4)) { + ret = blk_mq_alloc_sched_tags(ctx->et, set, q->id); + if (ret) + goto out; + } memflags = blk_mq_freeze_queue(q); /* @@ -680,7 +703,13 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx) blk_mq_unfreeze_queue(q, memflags); if (!ret) ret = elevator_change_done(q, ctx); - + /* + * Free sched tags if it's allocated but we couldn't switch elevator. + */ + if (!xa_empty(&ctx->et->tags_table) && !ctx->new) + blk_mq_free_sched_tags(ctx->et, set, q->id); +out: + xa_destroy(&ctx->et->tags_table); return ret; } @@ -690,11 +719,29 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx) */ void elv_update_nr_hw_queues(struct request_queue *q) { + struct blk_mq_tag_set *set = q->tag_set; + struct elevator_tags et; struct elv_change_ctx ctx = {}; int ret = -ENODEV; WARN_ON_ONCE(q->mq_freeze_depth == 0); + et.nr_hw_queues = set->nr_hw_queues; + xa_init(&et.tags_table); + ctx.et = &et; + /* + * Accessing q->elevator without holding q->elevator_lock is safe here + * because nr_hw_queue update is protected by set->update_nr_hwq_lock + * in the writer context. So, scheduler update/switch code (which + * acquires same lock in the reader context) can't run concurrently. + */ + if (q->elevator) { + if (blk_mq_alloc_sched_tags(ctx.et, set, q->id)) { + WARN_ON_ONCE(1); + goto out; + } + } + mutex_lock(&q->elevator_lock); if (q->elevator && !blk_queue_dying(q) && blk_queue_registered(q)) { ctx.name = q->elevator->type->elevator_name; @@ -706,6 +753,13 @@ void elv_update_nr_hw_queues(struct request_queue *q) blk_mq_unfreeze_queue_nomemrestore(q); if (!ret) WARN_ON_ONCE(elevator_change_done(q, &ctx)); + /* + * Free sched tags if it's allocated but we couldn't switch elevator. + */ + if (!xa_empty(&ctx.et->tags_table) && !ctx.new) + blk_mq_free_sched_tags(ctx.et, set, q->id); +out: + xa_destroy(&ctx.et->tags_table); } /* diff --git a/block/elevator.h b/block/elevator.h index a4de5f9ad790..849c264031ca 100644 --- a/block/elevator.h +++ b/block/elevator.h @@ -23,6 +23,19 @@ enum elv_merge { struct blk_mq_alloc_data; struct blk_mq_hw_ctx; +struct elevator_tags { + /* num. of hardware queues for which tags are allocated */ + unsigned int nr_hw_queues; + /* depth used while allocating tags */ + unsigned int nr_requests; + /* + * An Xarray table storing shared and per hardware queue sched tags. + * The key to store an entry in Xarray is q->id. Please note that + * shared sched tags are always stored at index 0. + */ + struct xarray tags_table; +}; + struct elevator_mq_ops { int (*init_sched)(struct request_queue *, struct elevator_queue *); void (*exit_sched)(struct elevator_queue *); @@ -113,6 +126,8 @@ struct request *elv_rqhash_find(struct request_queue *q, sector_t offset); struct elevator_queue { struct elevator_type *type; + struct blk_mq_tags **tags; + unsigned int nr_hw_queues; void *elevator_data; struct kobject kobj; struct mutex sysfs_lock; @@ -152,8 +167,8 @@ ssize_t elv_iosched_show(struct gendisk *disk, char *page); ssize_t elv_iosched_store(struct gendisk *disk, const char *page, size_t count); extern bool elv_bio_merge_ok(struct request *, struct bio *); -extern struct elevator_queue *elevator_alloc(struct request_queue *, - struct elevator_type *); +struct elevator_queue *elevator_alloc(struct request_queue *, + struct elevator_type *, struct elevator_tags *); /* * Helper functions. -- 2.49.0 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCHv4 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store 2025-06-24 13:17 ` [PATCHv4 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store Nilay Shroff @ 2025-06-26 14:43 ` Ming Lei 2025-06-27 4:13 ` Nilay Shroff 0 siblings, 1 reply; 8+ messages in thread From: Ming Lei @ 2025-06-26 14:43 UTC (permalink / raw) To: Nilay Shroff; +Cc: linux-block, hch, axboe, sth, gjoyce On Tue, Jun 24, 2025 at 06:47:04PM +0530, Nilay Shroff wrote: > Recent lockdep reports [1] have revealed a potential deadlock caused by a > lock dependency between the percpu allocator lock and the elevator lock. > This issue can be avoided by ensuring that the allocation and release of > scheduler tags (sched_tags) are performed outside the elevator lock. > Furthermore, the queue does not need to be remain frozen during these > operations. > > To address this, move all sched_tags allocations and deallocations outside > of both the ->elevator_lock and the ->freeze_lock. Since the lifetime of > the elevator queue and its associated sched_tags is closely tied, the > allocated sched_tags are now stored in the elevator queue structure. Then, > during the actual elevator switch (which runs under ->freeze_lock and > ->elevator_lock), the pre-allocated sched_tags are assigned to the > appropriate q->hctx. Once the elevator switch is complete and the locks > are released, the old elevator queue and its associated sched_tags are > freed. > > This commit specifically addresses the allocation/deallocation of sched_ > tags during elevator switching. Note that sched_tags may also be allocated > in other contexts, such as during nr_hw_queues updates. Supporting that > use case will require batch allocation/deallocation, which will be handled > in a follow-up patch. > > This restructuring ensures that sched_tags memory management occurs > entirely outside of the ->elevator_lock and ->freeze_lock context, > eliminating the lock dependency problem seen during scheduler updates. > > [1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/ > > Reported-by: Stefan Haberland <sth@linux.ibm.com> > Closes: https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/ > Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> > --- > block/blk-mq-sched.c | 186 ++++++++++++++++++++++++++----------------- > block/blk-mq-sched.h | 14 +++- > block/elevator.c | 66 +++++++++++++-- > block/elevator.h | 19 ++++- > 4 files changed, 204 insertions(+), 81 deletions(-) > > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c > index 359e0704e09b..5d3132ac7777 100644 > --- a/block/blk-mq-sched.c > +++ b/block/blk-mq-sched.c > @@ -374,64 +374,17 @@ bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request *rq, > } > EXPORT_SYMBOL_GPL(blk_mq_sched_try_insert_merge); > > -static int blk_mq_sched_alloc_map_and_rqs(struct request_queue *q, > - struct blk_mq_hw_ctx *hctx, > - unsigned int hctx_idx) > -{ > - if (blk_mq_is_shared_tags(q->tag_set->flags)) { > - hctx->sched_tags = q->sched_shared_tags; > - return 0; > - } > - > - hctx->sched_tags = blk_mq_alloc_map_and_rqs(q->tag_set, hctx_idx, > - q->nr_requests); > - > - if (!hctx->sched_tags) > - return -ENOMEM; > - return 0; > -} > - > -static void blk_mq_exit_sched_shared_tags(struct request_queue *queue) > -{ > - blk_mq_free_rq_map(queue->sched_shared_tags); > - queue->sched_shared_tags = NULL; > -} > - > /* called in queue's release handler, tagset has gone away */ > static void blk_mq_sched_tags_teardown(struct request_queue *q, unsigned int flags) > { > struct blk_mq_hw_ctx *hctx; > unsigned long i; > > - queue_for_each_hw_ctx(q, hctx, i) { > - if (hctx->sched_tags) { > - if (!blk_mq_is_shared_tags(flags)) > - blk_mq_free_rq_map(hctx->sched_tags); > - hctx->sched_tags = NULL; > - } > - } > + queue_for_each_hw_ctx(q, hctx, i) > + hctx->sched_tags = NULL; > > if (blk_mq_is_shared_tags(flags)) > - blk_mq_exit_sched_shared_tags(q); > -} > - > -static int blk_mq_init_sched_shared_tags(struct request_queue *queue) > -{ > - struct blk_mq_tag_set *set = queue->tag_set; > - > - /* > - * Set initial depth at max so that we don't need to reallocate for > - * updating nr_requests. > - */ > - queue->sched_shared_tags = blk_mq_alloc_map_and_rqs(set, > - BLK_MQ_NO_HCTX_IDX, > - MAX_SCHED_RQ); > - if (!queue->sched_shared_tags) > - return -ENOMEM; > - > - blk_mq_tag_update_sched_shared_tags(queue); > - > - return 0; > + q->sched_shared_tags = NULL; > } > > void blk_mq_sched_reg_debugfs(struct request_queue *q) > @@ -458,8 +411,106 @@ void blk_mq_sched_unreg_debugfs(struct request_queue *q) > mutex_unlock(&q->debugfs_mutex); > } > > +void __blk_mq_free_sched_tags(struct blk_mq_tag_set *set, > + struct blk_mq_tags **tags, unsigned int nr_hw_queues) > +{ > + unsigned long i; > + > + if (!tags) > + return; > + > + /* Shared tags are stored at index 0 in @tags. */ > + if (blk_mq_is_shared_tags(set->flags)) > + blk_mq_free_map_and_rqs(set, tags[0], BLK_MQ_NO_HCTX_IDX); > + else { > + for (i = 0; i < nr_hw_queues; i++) > + blk_mq_free_map_and_rqs(set, tags[i], i); > + } > + > + kfree(tags); > +} > + > +void blk_mq_free_sched_tags(struct elevator_tags *et, > + struct blk_mq_tag_set *set, int id) > +{ > + struct blk_mq_tags **tags; > + > + tags = xa_load(&et->tags_table, id); > + __blk_mq_free_sched_tags(set, tags, et->nr_hw_queues); > +} > + > +struct blk_mq_tags **__blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set, > + unsigned int nr_hw_queues, > + unsigned int nr_requests, > + gfp_t gfp) > +{ > + int i, nr_tags; > + struct blk_mq_tags **tags; > + > + if (blk_mq_is_shared_tags(set->flags)) > + nr_tags = 1; > + else > + nr_tags = nr_hw_queues; > + > + tags = kcalloc(nr_tags, sizeof(struct blk_mq_tags *), gfp); > + if (!tags) > + return NULL; > + > + if (blk_mq_is_shared_tags(set->flags)) { > + /* Shared tags are stored at index 0 in @tags. */ > + tags[0] = blk_mq_alloc_map_and_rqs(set, BLK_MQ_NO_HCTX_IDX, > + MAX_SCHED_RQ); > + if (!tags[0]) > + goto out; > + } else { > + for (i = 0; i < nr_hw_queues; i++) { > + tags[i] = blk_mq_alloc_map_and_rqs(set, i, nr_requests); > + if (!tags[i]) > + goto out_unwind; > + } > + } > + > + return tags; > +out_unwind: > + while (--i >= 0) > + blk_mq_free_map_and_rqs(set, tags[i], i); > +out: > + kfree(tags); > + return NULL; > +} > + > +int blk_mq_alloc_sched_tags(struct elevator_tags *et, > + struct blk_mq_tag_set *set, int id) > +{ > + struct blk_mq_tags **tags; > + gfp_t gfp = GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY; > + > + /* > + * Default to double of smaller one between hw queue_depth and > + * 128, since we don't split into sync/async like the old code > + * did. Additionally, this is a per-hw queue depth. > + */ > + et->nr_requests = 2 * min_t(unsigned int, set->queue_depth, > + BLKDEV_DEFAULT_RQ); > + > + tags = __blk_mq_alloc_sched_tags(set, et->nr_hw_queues, > + et->nr_requests, gfp); > + if (!tags) > + return -ENOMEM; > + > + if (xa_insert(&et->tags_table, id, tags, gfp)) > + goto out_free_tags; Not sure why you add `tags_table` into `elevator_tags`, it looks very confusing just from naming, not mention only single element is stored to the table in elevator switch code path since queue id is the key. Wrt. xarray suggestion, I meant to add on xarray local variable in __blk_mq_update_nr_hw_queues(), then you can store allocated `elevator_tags` to the xarray via ->id of request_queue. Thanks, Ming ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCHv4 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store 2025-06-26 14:43 ` Ming Lei @ 2025-06-27 4:13 ` Nilay Shroff 2025-06-27 7:58 ` Ming Lei 0 siblings, 1 reply; 8+ messages in thread From: Nilay Shroff @ 2025-06-27 4:13 UTC (permalink / raw) To: Ming Lei; +Cc: linux-block, hch, axboe, sth, gjoyce On 6/26/25 8:13 PM, Ming Lei wrote: > On Tue, Jun 24, 2025 at 06:47:04PM +0530, Nilay Shroff wrote: >> Recent lockdep reports [1] have revealed a potential deadlock caused by a >> lock dependency between the percpu allocator lock and the elevator lock. >> This issue can be avoided by ensuring that the allocation and release of >> scheduler tags (sched_tags) are performed outside the elevator lock. >> Furthermore, the queue does not need to be remain frozen during these >> operations. >> >> To address this, move all sched_tags allocations and deallocations outside >> of both the ->elevator_lock and the ->freeze_lock. Since the lifetime of >> the elevator queue and its associated sched_tags is closely tied, the >> allocated sched_tags are now stored in the elevator queue structure. Then, >> during the actual elevator switch (which runs under ->freeze_lock and >> ->elevator_lock), the pre-allocated sched_tags are assigned to the >> appropriate q->hctx. Once the elevator switch is complete and the locks >> are released, the old elevator queue and its associated sched_tags are >> freed. >> >> This commit specifically addresses the allocation/deallocation of sched_ >> tags during elevator switching. Note that sched_tags may also be allocated >> in other contexts, such as during nr_hw_queues updates. Supporting that >> use case will require batch allocation/deallocation, which will be handled >> in a follow-up patch. >> >> This restructuring ensures that sched_tags memory management occurs >> entirely outside of the ->elevator_lock and ->freeze_lock context, >> eliminating the lock dependency problem seen during scheduler updates. >> >> [1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/ >> >> Reported-by: Stefan Haberland <sth@linux.ibm.com> >> Closes: https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/ >> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> >> --- >> block/blk-mq-sched.c | 186 ++++++++++++++++++++++++++----------------- >> block/blk-mq-sched.h | 14 +++- >> block/elevator.c | 66 +++++++++++++-- >> block/elevator.h | 19 ++++- >> 4 files changed, 204 insertions(+), 81 deletions(-) >> >> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c >> index 359e0704e09b..5d3132ac7777 100644 >> --- a/block/blk-mq-sched.c >> +++ b/block/blk-mq-sched.c >> @@ -374,64 +374,17 @@ bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request *rq, >> } >> EXPORT_SYMBOL_GPL(blk_mq_sched_try_insert_merge); >> >> -static int blk_mq_sched_alloc_map_and_rqs(struct request_queue *q, >> - struct blk_mq_hw_ctx *hctx, >> - unsigned int hctx_idx) >> -{ >> - if (blk_mq_is_shared_tags(q->tag_set->flags)) { >> - hctx->sched_tags = q->sched_shared_tags; >> - return 0; >> - } >> - >> - hctx->sched_tags = blk_mq_alloc_map_and_rqs(q->tag_set, hctx_idx, >> - q->nr_requests); >> - >> - if (!hctx->sched_tags) >> - return -ENOMEM; >> - return 0; >> -} >> - >> -static void blk_mq_exit_sched_shared_tags(struct request_queue *queue) >> -{ >> - blk_mq_free_rq_map(queue->sched_shared_tags); >> - queue->sched_shared_tags = NULL; >> -} >> - >> /* called in queue's release handler, tagset has gone away */ >> static void blk_mq_sched_tags_teardown(struct request_queue *q, unsigned int flags) >> { >> struct blk_mq_hw_ctx *hctx; >> unsigned long i; >> >> - queue_for_each_hw_ctx(q, hctx, i) { >> - if (hctx->sched_tags) { >> - if (!blk_mq_is_shared_tags(flags)) >> - blk_mq_free_rq_map(hctx->sched_tags); >> - hctx->sched_tags = NULL; >> - } >> - } >> + queue_for_each_hw_ctx(q, hctx, i) >> + hctx->sched_tags = NULL; >> >> if (blk_mq_is_shared_tags(flags)) >> - blk_mq_exit_sched_shared_tags(q); >> -} >> - >> -static int blk_mq_init_sched_shared_tags(struct request_queue *queue) >> -{ >> - struct blk_mq_tag_set *set = queue->tag_set; >> - >> - /* >> - * Set initial depth at max so that we don't need to reallocate for >> - * updating nr_requests. >> - */ >> - queue->sched_shared_tags = blk_mq_alloc_map_and_rqs(set, >> - BLK_MQ_NO_HCTX_IDX, >> - MAX_SCHED_RQ); >> - if (!queue->sched_shared_tags) >> - return -ENOMEM; >> - >> - blk_mq_tag_update_sched_shared_tags(queue); >> - >> - return 0; >> + q->sched_shared_tags = NULL; >> } >> >> void blk_mq_sched_reg_debugfs(struct request_queue *q) >> @@ -458,8 +411,106 @@ void blk_mq_sched_unreg_debugfs(struct request_queue *q) >> mutex_unlock(&q->debugfs_mutex); >> } >> >> +void __blk_mq_free_sched_tags(struct blk_mq_tag_set *set, >> + struct blk_mq_tags **tags, unsigned int nr_hw_queues) >> +{ >> + unsigned long i; >> + >> + if (!tags) >> + return; >> + >> + /* Shared tags are stored at index 0 in @tags. */ >> + if (blk_mq_is_shared_tags(set->flags)) >> + blk_mq_free_map_and_rqs(set, tags[0], BLK_MQ_NO_HCTX_IDX); >> + else { >> + for (i = 0; i < nr_hw_queues; i++) >> + blk_mq_free_map_and_rqs(set, tags[i], i); >> + } >> + >> + kfree(tags); >> +} >> + >> +void blk_mq_free_sched_tags(struct elevator_tags *et, >> + struct blk_mq_tag_set *set, int id) >> +{ >> + struct blk_mq_tags **tags; >> + >> + tags = xa_load(&et->tags_table, id); >> + __blk_mq_free_sched_tags(set, tags, et->nr_hw_queues); >> +} >> + >> +struct blk_mq_tags **__blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set, >> + unsigned int nr_hw_queues, >> + unsigned int nr_requests, >> + gfp_t gfp) >> +{ >> + int i, nr_tags; >> + struct blk_mq_tags **tags; >> + >> + if (blk_mq_is_shared_tags(set->flags)) >> + nr_tags = 1; >> + else >> + nr_tags = nr_hw_queues; >> + >> + tags = kcalloc(nr_tags, sizeof(struct blk_mq_tags *), gfp); >> + if (!tags) >> + return NULL; >> + >> + if (blk_mq_is_shared_tags(set->flags)) { >> + /* Shared tags are stored at index 0 in @tags. */ >> + tags[0] = blk_mq_alloc_map_and_rqs(set, BLK_MQ_NO_HCTX_IDX, >> + MAX_SCHED_RQ); >> + if (!tags[0]) >> + goto out; >> + } else { >> + for (i = 0; i < nr_hw_queues; i++) { >> + tags[i] = blk_mq_alloc_map_and_rqs(set, i, nr_requests); >> + if (!tags[i]) >> + goto out_unwind; >> + } >> + } >> + >> + return tags; >> +out_unwind: >> + while (--i >= 0) >> + blk_mq_free_map_and_rqs(set, tags[i], i); >> +out: >> + kfree(tags); >> + return NULL; >> +} >> + >> +int blk_mq_alloc_sched_tags(struct elevator_tags *et, >> + struct blk_mq_tag_set *set, int id) >> +{ >> + struct blk_mq_tags **tags; >> + gfp_t gfp = GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY; >> + >> + /* >> + * Default to double of smaller one between hw queue_depth and >> + * 128, since we don't split into sync/async like the old code >> + * did. Additionally, this is a per-hw queue depth. >> + */ >> + et->nr_requests = 2 * min_t(unsigned int, set->queue_depth, >> + BLKDEV_DEFAULT_RQ); >> + >> + tags = __blk_mq_alloc_sched_tags(set, et->nr_hw_queues, >> + et->nr_requests, gfp); >> + if (!tags) >> + return -ENOMEM; >> + >> + if (xa_insert(&et->tags_table, id, tags, gfp)) >> + goto out_free_tags; > > Not sure why you add `tags_table` into `elevator_tags`, it looks very > confusing just from naming, not mention only single element is stored to > the table in elevator switch code path since queue id is the key. > Yeah but if you look at next patch where we add/remove tags in batch then you'd realise that xarray may have multiple entries inserted when it's called from nr_hw_queue update. > Wrt. xarray suggestion, I meant to add on xarray local variable in > __blk_mq_update_nr_hw_queues(), then you can store allocated `elevator_tags` > to the xarray via ->id of request_queue. > I think I got it now what you have been asking for - there was some confusion earlier. So you are suggesting to insert pointer to 'struct elevator_tags' in a local Xarray when we allocate sched_tags in batch from nr_hw_queue update context. And the key to insert 'struct elevator_tags' into Xarray would be q->id. Then later, from elv_update_nr_hw_queues() we should retrieve the relavent pointer to 'struct elevator_tags' from that local Xarray (using q->id) and pass on the retrieved elevator_tags to elevator_switch function. So this way we would not require using ->tags_table inside 'struct elevator_tags'. Additionally, blk_mq_alloc_sched_tags() shall return allocated 'struct elevator_tags' to the caller. And the definition of the 'struct elevator_tags' would look like this: struct elevator_tags { /* num. of hardware queues for which tags are allocated */ unsigned int nr_hw_queues; /* depth used while allocating tags */ unsigned int nr_requests; /* The index 0 in @tags is used to store shared sched tags */ struct blk_mq_tags **tags; }; This seems like a clean and straightforward solution that should be easy to implement without much hassle. Please let me know if this still needs adjustment. Thanks, --Nilay ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCHv4 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store 2025-06-27 4:13 ` Nilay Shroff @ 2025-06-27 7:58 ` Ming Lei 2025-06-27 9:50 ` Nilay Shroff 0 siblings, 1 reply; 8+ messages in thread From: Ming Lei @ 2025-06-27 7:58 UTC (permalink / raw) To: Nilay Shroff; +Cc: linux-block, hch, axboe, sth, gjoyce On Fri, Jun 27, 2025 at 09:43:29AM +0530, Nilay Shroff wrote: > > > On 6/26/25 8:13 PM, Ming Lei wrote: > > On Tue, Jun 24, 2025 at 06:47:04PM +0530, Nilay Shroff wrote: > >> Recent lockdep reports [1] have revealed a potential deadlock caused by a > >> lock dependency between the percpu allocator lock and the elevator lock. > >> This issue can be avoided by ensuring that the allocation and release of > >> scheduler tags (sched_tags) are performed outside the elevator lock. > >> Furthermore, the queue does not need to be remain frozen during these > >> operations. > >> > >> To address this, move all sched_tags allocations and deallocations outside > >> of both the ->elevator_lock and the ->freeze_lock. Since the lifetime of > >> the elevator queue and its associated sched_tags is closely tied, the > >> allocated sched_tags are now stored in the elevator queue structure. Then, > >> during the actual elevator switch (which runs under ->freeze_lock and > >> ->elevator_lock), the pre-allocated sched_tags are assigned to the > >> appropriate q->hctx. Once the elevator switch is complete and the locks > >> are released, the old elevator queue and its associated sched_tags are > >> freed. > >> > >> This commit specifically addresses the allocation/deallocation of sched_ > >> tags during elevator switching. Note that sched_tags may also be allocated > >> in other contexts, such as during nr_hw_queues updates. Supporting that > >> use case will require batch allocation/deallocation, which will be handled > >> in a follow-up patch. > >> > >> This restructuring ensures that sched_tags memory management occurs > >> entirely outside of the ->elevator_lock and ->freeze_lock context, > >> eliminating the lock dependency problem seen during scheduler updates. > >> > >> [1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/ > >> > >> Reported-by: Stefan Haberland <sth@linux.ibm.com> > >> Closes: https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/ > >> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> > >> --- > >> block/blk-mq-sched.c | 186 ++++++++++++++++++++++++++----------------- > >> block/blk-mq-sched.h | 14 +++- > >> block/elevator.c | 66 +++++++++++++-- > >> block/elevator.h | 19 ++++- > >> 4 files changed, 204 insertions(+), 81 deletions(-) > >> > >> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c > >> index 359e0704e09b..5d3132ac7777 100644 > >> --- a/block/blk-mq-sched.c > >> +++ b/block/blk-mq-sched.c > >> @@ -374,64 +374,17 @@ bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request *rq, > >> } > >> EXPORT_SYMBOL_GPL(blk_mq_sched_try_insert_merge); > >> > >> -static int blk_mq_sched_alloc_map_and_rqs(struct request_queue *q, > >> - struct blk_mq_hw_ctx *hctx, > >> - unsigned int hctx_idx) > >> -{ > >> - if (blk_mq_is_shared_tags(q->tag_set->flags)) { > >> - hctx->sched_tags = q->sched_shared_tags; > >> - return 0; > >> - } > >> - > >> - hctx->sched_tags = blk_mq_alloc_map_and_rqs(q->tag_set, hctx_idx, > >> - q->nr_requests); > >> - > >> - if (!hctx->sched_tags) > >> - return -ENOMEM; > >> - return 0; > >> -} > >> - > >> -static void blk_mq_exit_sched_shared_tags(struct request_queue *queue) > >> -{ > >> - blk_mq_free_rq_map(queue->sched_shared_tags); > >> - queue->sched_shared_tags = NULL; > >> -} > >> - > >> /* called in queue's release handler, tagset has gone away */ > >> static void blk_mq_sched_tags_teardown(struct request_queue *q, unsigned int flags) > >> { > >> struct blk_mq_hw_ctx *hctx; > >> unsigned long i; > >> > >> - queue_for_each_hw_ctx(q, hctx, i) { > >> - if (hctx->sched_tags) { > >> - if (!blk_mq_is_shared_tags(flags)) > >> - blk_mq_free_rq_map(hctx->sched_tags); > >> - hctx->sched_tags = NULL; > >> - } > >> - } > >> + queue_for_each_hw_ctx(q, hctx, i) > >> + hctx->sched_tags = NULL; > >> > >> if (blk_mq_is_shared_tags(flags)) > >> - blk_mq_exit_sched_shared_tags(q); > >> -} > >> - > >> -static int blk_mq_init_sched_shared_tags(struct request_queue *queue) > >> -{ > >> - struct blk_mq_tag_set *set = queue->tag_set; > >> - > >> - /* > >> - * Set initial depth at max so that we don't need to reallocate for > >> - * updating nr_requests. > >> - */ > >> - queue->sched_shared_tags = blk_mq_alloc_map_and_rqs(set, > >> - BLK_MQ_NO_HCTX_IDX, > >> - MAX_SCHED_RQ); > >> - if (!queue->sched_shared_tags) > >> - return -ENOMEM; > >> - > >> - blk_mq_tag_update_sched_shared_tags(queue); > >> - > >> - return 0; > >> + q->sched_shared_tags = NULL; > >> } > >> > >> void blk_mq_sched_reg_debugfs(struct request_queue *q) > >> @@ -458,8 +411,106 @@ void blk_mq_sched_unreg_debugfs(struct request_queue *q) > >> mutex_unlock(&q->debugfs_mutex); > >> } > >> > >> +void __blk_mq_free_sched_tags(struct blk_mq_tag_set *set, > >> + struct blk_mq_tags **tags, unsigned int nr_hw_queues) > >> +{ > >> + unsigned long i; > >> + > >> + if (!tags) > >> + return; > >> + > >> + /* Shared tags are stored at index 0 in @tags. */ > >> + if (blk_mq_is_shared_tags(set->flags)) > >> + blk_mq_free_map_and_rqs(set, tags[0], BLK_MQ_NO_HCTX_IDX); > >> + else { > >> + for (i = 0; i < nr_hw_queues; i++) > >> + blk_mq_free_map_and_rqs(set, tags[i], i); > >> + } > >> + > >> + kfree(tags); > >> +} > >> + > >> +void blk_mq_free_sched_tags(struct elevator_tags *et, > >> + struct blk_mq_tag_set *set, int id) > >> +{ > >> + struct blk_mq_tags **tags; > >> + > >> + tags = xa_load(&et->tags_table, id); > >> + __blk_mq_free_sched_tags(set, tags, et->nr_hw_queues); > >> +} > >> + > >> +struct blk_mq_tags **__blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set, > >> + unsigned int nr_hw_queues, > >> + unsigned int nr_requests, > >> + gfp_t gfp) > >> +{ > >> + int i, nr_tags; > >> + struct blk_mq_tags **tags; > >> + > >> + if (blk_mq_is_shared_tags(set->flags)) > >> + nr_tags = 1; > >> + else > >> + nr_tags = nr_hw_queues; > >> + > >> + tags = kcalloc(nr_tags, sizeof(struct blk_mq_tags *), gfp); > >> + if (!tags) > >> + return NULL; > >> + > >> + if (blk_mq_is_shared_tags(set->flags)) { > >> + /* Shared tags are stored at index 0 in @tags. */ > >> + tags[0] = blk_mq_alloc_map_and_rqs(set, BLK_MQ_NO_HCTX_IDX, > >> + MAX_SCHED_RQ); > >> + if (!tags[0]) > >> + goto out; > >> + } else { > >> + for (i = 0; i < nr_hw_queues; i++) { > >> + tags[i] = blk_mq_alloc_map_and_rqs(set, i, nr_requests); > >> + if (!tags[i]) > >> + goto out_unwind; > >> + } > >> + } > >> + > >> + return tags; > >> +out_unwind: > >> + while (--i >= 0) > >> + blk_mq_free_map_and_rqs(set, tags[i], i); > >> +out: > >> + kfree(tags); > >> + return NULL; > >> +} > >> + > >> +int blk_mq_alloc_sched_tags(struct elevator_tags *et, > >> + struct blk_mq_tag_set *set, int id) > >> +{ > >> + struct blk_mq_tags **tags; > >> + gfp_t gfp = GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY; > >> + > >> + /* > >> + * Default to double of smaller one between hw queue_depth and > >> + * 128, since we don't split into sync/async like the old code > >> + * did. Additionally, this is a per-hw queue depth. > >> + */ > >> + et->nr_requests = 2 * min_t(unsigned int, set->queue_depth, > >> + BLKDEV_DEFAULT_RQ); > >> + > >> + tags = __blk_mq_alloc_sched_tags(set, et->nr_hw_queues, > >> + et->nr_requests, gfp); > >> + if (!tags) > >> + return -ENOMEM; > >> + > >> + if (xa_insert(&et->tags_table, id, tags, gfp)) > >> + goto out_free_tags; > > > > Not sure why you add `tags_table` into `elevator_tags`, it looks very > > confusing just from naming, not mention only single element is stored to > > the table in elevator switch code path since queue id is the key. > > > Yeah but if you look at next patch where we add/remove tags in batch > then you'd realise that xarray may have multiple entries inserted when > it's called from nr_hw_queue update. > > > Wrt. xarray suggestion, I meant to add on xarray local variable in > > __blk_mq_update_nr_hw_queues(), then you can store allocated `elevator_tags` > > to the xarray via ->id of request_queue. > > > I think I got it now what you have been asking for - there was some confusion > earlier. So you are suggesting to insert pointer to 'struct elevator_tags' > in a local Xarray when we allocate sched_tags in batch from nr_hw_queue update > context. And the key to insert 'struct elevator_tags' into Xarray would be q->id. > Then later, from elv_update_nr_hw_queues() we should retrieve the relavent > pointer to 'struct elevator_tags' from that local Xarray (using q->id) and pass > on the retrieved elevator_tags to elevator_switch function. > > So this way we would not require using ->tags_table inside 'struct elevator_tags'. > Additionally, blk_mq_alloc_sched_tags() shall return allocated 'struct elevator_tags' > to the caller. And the definition of the 'struct elevator_tags' would look like this: > > struct elevator_tags { > /* num. of hardware queues for which tags are allocated */ > unsigned int nr_hw_queues; > /* depth used while allocating tags */ > unsigned int nr_requests; > /* The index 0 in @tags is used to store shared sched tags */ > struct blk_mq_tags **tags; s/struct blk_mq_tags **tags/struct blk_mq_tags *tags[]; Then we can save one extra failure handling. > > This seems like a clean and straightforward solution that should be easy to implement > without much hassle. Please let me know if this still needs adjustment. I think we are in same page now. Thanks, Ming ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCHv4 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store 2025-06-27 7:58 ` Ming Lei @ 2025-06-27 9:50 ` Nilay Shroff 0 siblings, 0 replies; 8+ messages in thread From: Nilay Shroff @ 2025-06-27 9:50 UTC (permalink / raw) To: Ming Lei; +Cc: linux-block, hch, axboe, sth, gjoyce On 6/27/25 1:28 PM, Ming Lei wrote: > On Fri, Jun 27, 2025 at 09:43:29AM +0530, Nilay Shroff wrote: >> >> >> On 6/26/25 8:13 PM, Ming Lei wrote: >> struct elevator_tags { >> /* num. of hardware queues for which tags are allocated */ >> unsigned int nr_hw_queues; >> /* depth used while allocating tags */ >> unsigned int nr_requests; >> /* The index 0 in @tags is used to store shared sched tags */ >> struct blk_mq_tags **tags; > > s/struct blk_mq_tags **tags/struct blk_mq_tags *tags[]; > > Then we can save one extra failure handling. Yeah I think that's the advantage with the flex array! > >> >> This seems like a clean and straightforward solution that should be easy to implement >> without much hassle. Please let me know if this still needs adjustment. > > I think we are in same page now. > Great! Thanks, --Nilay ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCHv4 3/3] block: fix potential deadlock while running nr_hw_queue update 2025-06-24 13:17 [PATCHv4 0/3] block: move sched_tags allocation/de-allocation outside of locking context Nilay Shroff 2025-06-24 13:17 ` [PATCHv4 1/3] block: move elevator queue allocation logic into blk_mq_init_sched Nilay Shroff 2025-06-24 13:17 ` [PATCHv4 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store Nilay Shroff @ 2025-06-24 13:17 ` Nilay Shroff 2 siblings, 0 replies; 8+ messages in thread From: Nilay Shroff @ 2025-06-24 13:17 UTC (permalink / raw) To: linux-block; +Cc: hch, ming.lei, axboe, sth, gjoyce Move scheduler tags (sched_tags) allocation and deallocation outside both the ->elevator_lock and ->freeze_lock when updating nr_hw_queues. This change breaks the dependency chain from the percpu allocator lock to the elevator lock, helping to prevent potential deadlocks, as observed in the reported lockdep splat[1]. This commit introduces batch allocation and deallocation helpers for sched_tags, which are now used from within __blk_mq_update_nr_hw_queues routine while iterating through the tagset. With this change, all sched_tags memory management is handled entirely outside the ->elevator_lock and the ->freeze_lock context, thereby eliminating the lock dependency that could otherwise manifest during nr_hw_queues updates. [1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/ Reported-by: Stefan Haberland <sth@linux.ibm.com> Closes: https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/ Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> --- block/blk-mq-sched.c | 51 ++++++++++++++++++++++++++++++++++++++++++++ block/blk-mq-sched.h | 5 +++++ block/blk-mq.c | 12 ++++++++++- block/blk.h | 3 ++- block/elevator.c | 23 ++------------------ 5 files changed, 71 insertions(+), 23 deletions(-) diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index 5d3132ac7777..acdc03718ebd 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -508,6 +508,57 @@ int blk_mq_alloc_sched_tags(struct elevator_tags *et, return -ENOMEM; } +int blk_mq_alloc_sched_tags_batch(struct elevator_tags *et, + struct blk_mq_tag_set *set) +{ + struct request_queue *q; + + lockdep_assert_held_write(&set->update_nr_hwq_lock); + + list_for_each_entry(q, &set->tag_list, tag_set_list) { + /* + * Accessing q->elevator without holding q->elevator_lock is + * safe because we're holding here set->update_nr_hwq_lock in + * the writer context. So, scheduler update/switch code (which + * acquires the same lock but in the reader context) can't run + * concurrently. + */ + if (q->elevator) { + if (blk_mq_alloc_sched_tags(et, set, q->id)) + goto out_unwind; + } + } + return 0; + +out_unwind: + list_for_each_entry_continue_reverse(q, &set->tag_list, tag_set_list) { + if (q->elevator) + blk_mq_free_sched_tags(et, set, q->id); + } + + return -ENOMEM; +} + +void blk_mq_free_sched_tags_batch(struct elevator_tags *et, + struct blk_mq_tag_set *set) +{ + struct request_queue *q; + + lockdep_assert_held_write(&set->update_nr_hwq_lock); + + list_for_each_entry(q, &set->tag_list, tag_set_list) { + /* + * Accessing q->elevator without holding q->elevator_lock is + * safe because we're holding here set->update_nr_hwq_lock in + * the writer context. So, scheduler update/switch code (which + * acquires the same lock but in the reader context) can't run + * concurrently. + */ + if (q->elevator) + blk_mq_free_sched_tags(et, set, q->id); + } +} + /* caller must have a reference to @e, will grab another one if successful */ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e, struct elevator_tags *et) diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h index 92aa50b8376a..4b3bf8946ae2 100644 --- a/block/blk-mq-sched.h +++ b/block/blk-mq-sched.h @@ -29,6 +29,11 @@ struct blk_mq_tags **__blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set, void __blk_mq_free_sched_tags(struct blk_mq_tag_set *set, struct blk_mq_tags **tags, unsigned int nr_hw_queues); +int blk_mq_alloc_sched_tags_batch(struct elevator_tags *et, + struct blk_mq_tag_set *set); +void blk_mq_free_sched_tags_batch(struct elevator_tags *et, + struct blk_mq_tag_set *set); + int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e, struct elevator_tags *et); void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e); diff --git a/block/blk-mq.c b/block/blk-mq.c index 4806b867e37d..a06f184f1d9a 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -4970,6 +4970,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues) { struct request_queue *q; + struct elevator_tags et; int prev_nr_hw_queues = set->nr_hw_queues; unsigned int memflags; int i; @@ -4984,6 +4985,12 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, return; memflags = memalloc_noio_save(); + + et.nr_hw_queues = nr_hw_queues; + xa_init(&et.tags_table); + if (blk_mq_alloc_sched_tags_batch(&et, set) < 0) + goto memalloc_restore; + list_for_each_entry(q, &set->tag_list, tag_set_list) { blk_mq_debugfs_unregister_hctxs(q); blk_mq_sysfs_unregister_hctxs(q); @@ -4995,6 +5002,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, if (blk_mq_realloc_tag_set_tags(set, nr_hw_queues) < 0) { list_for_each_entry(q, &set->tag_list, tag_set_list) blk_mq_unfreeze_queue_nomemrestore(q); + blk_mq_free_sched_tags_batch(&et, set); goto reregister; } @@ -5019,7 +5027,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, /* elv_update_nr_hw_queues() unfreeze queue for us */ list_for_each_entry(q, &set->tag_list, tag_set_list) - elv_update_nr_hw_queues(q); + elv_update_nr_hw_queues(q, &et); reregister: list_for_each_entry(q, &set->tag_list, tag_set_list) { @@ -5029,7 +5037,9 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, blk_mq_remove_hw_queues_cpuhp(q); blk_mq_add_hw_queues_cpuhp(q); } +memalloc_restore: memalloc_noio_restore(memflags); + xa_destroy(&et.tags_table); /* Free the excess tags when nr_hw_queues shrink. */ for (i = set->nr_hw_queues; i < prev_nr_hw_queues; i++) diff --git a/block/blk.h b/block/blk.h index 37ec459fe656..a312518fb8f3 100644 --- a/block/blk.h +++ b/block/blk.h @@ -12,6 +12,7 @@ #include "blk-crypto-internal.h" struct elevator_type; +struct elevator_tags; #define BLK_DEV_MAX_SECTORS (LLONG_MAX >> 9) #define BLK_MIN_SEGMENT_SIZE 4096 @@ -321,7 +322,7 @@ bool blk_bio_list_merge(struct request_queue *q, struct list_head *list, bool blk_insert_flush(struct request *rq); -void elv_update_nr_hw_queues(struct request_queue *q); +void elv_update_nr_hw_queues(struct request_queue *q, struct elevator_tags *et); void elevator_set_default(struct request_queue *q); void elevator_set_none(struct request_queue *q); diff --git a/block/elevator.c b/block/elevator.c index 1408894c0396..4272f9bc7e11 100644 --- a/block/elevator.c +++ b/block/elevator.c @@ -717,31 +717,14 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx) * The I/O scheduler depends on the number of hardware queues, this forces a * reattachment when nr_hw_queues changes. */ -void elv_update_nr_hw_queues(struct request_queue *q) +void elv_update_nr_hw_queues(struct request_queue *q, struct elevator_tags *et) { struct blk_mq_tag_set *set = q->tag_set; - struct elevator_tags et; - struct elv_change_ctx ctx = {}; + struct elv_change_ctx ctx = {.et = et}; int ret = -ENODEV; WARN_ON_ONCE(q->mq_freeze_depth == 0); - et.nr_hw_queues = set->nr_hw_queues; - xa_init(&et.tags_table); - ctx.et = &et; - /* - * Accessing q->elevator without holding q->elevator_lock is safe here - * because nr_hw_queue update is protected by set->update_nr_hwq_lock - * in the writer context. So, scheduler update/switch code (which - * acquires same lock in the reader context) can't run concurrently. - */ - if (q->elevator) { - if (blk_mq_alloc_sched_tags(ctx.et, set, q->id)) { - WARN_ON_ONCE(1); - goto out; - } - } - mutex_lock(&q->elevator_lock); if (q->elevator && !blk_queue_dying(q) && blk_queue_registered(q)) { ctx.name = q->elevator->type->elevator_name; @@ -758,8 +741,6 @@ void elv_update_nr_hw_queues(struct request_queue *q) */ if (!xa_empty(&ctx.et->tags_table) && !ctx.new) blk_mq_free_sched_tags(ctx.et, set, q->id); -out: - xa_destroy(&ctx.et->tags_table); } /* -- 2.49.0 ^ permalink raw reply related [flat|nested] 8+ messages in thread
end of thread, other threads:[~2025-06-27 9:50 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-06-24 13:17 [PATCHv4 0/3] block: move sched_tags allocation/de-allocation outside of locking context Nilay Shroff 2025-06-24 13:17 ` [PATCHv4 1/3] block: move elevator queue allocation logic into blk_mq_init_sched Nilay Shroff 2025-06-24 13:17 ` [PATCHv4 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store Nilay Shroff 2025-06-26 14:43 ` Ming Lei 2025-06-27 4:13 ` Nilay Shroff 2025-06-27 7:58 ` Ming Lei 2025-06-27 9:50 ` Nilay Shroff 2025-06-24 13:17 ` [PATCHv4 3/3] block: fix potential deadlock while running nr_hw_queue update Nilay Shroff
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox