From: Ming Lei <ming.lei@redhat.com>
To: Nilay Shroff <nilay@linux.ibm.com>
Cc: linux-block@vger.kernel.org, hch@lst.de, axboe@kernel.dk,
sth@linux.ibm.com, gjoyce@ibm.com
Subject: Re: [PATCHv4 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store
Date: Thu, 26 Jun 2025 22:43:00 +0800 [thread overview]
Message-ID: <aF1cdHXunW5aqaci@fedora> (raw)
In-Reply-To: <20250624131716.630465-3-nilay@linux.ibm.com>
On Tue, Jun 24, 2025 at 06:47:04PM +0530, Nilay Shroff wrote:
> Recent lockdep reports [1] have revealed a potential deadlock caused by a
> lock dependency between the percpu allocator lock and the elevator lock.
> This issue can be avoided by ensuring that the allocation and release of
> scheduler tags (sched_tags) are performed outside the elevator lock.
> Furthermore, the queue does not need to be remain frozen during these
> operations.
>
> To address this, move all sched_tags allocations and deallocations outside
> of both the ->elevator_lock and the ->freeze_lock. Since the lifetime of
> the elevator queue and its associated sched_tags is closely tied, the
> allocated sched_tags are now stored in the elevator queue structure. Then,
> during the actual elevator switch (which runs under ->freeze_lock and
> ->elevator_lock), the pre-allocated sched_tags are assigned to the
> appropriate q->hctx. Once the elevator switch is complete and the locks
> are released, the old elevator queue and its associated sched_tags are
> freed.
>
> This commit specifically addresses the allocation/deallocation of sched_
> tags during elevator switching. Note that sched_tags may also be allocated
> in other contexts, such as during nr_hw_queues updates. Supporting that
> use case will require batch allocation/deallocation, which will be handled
> in a follow-up patch.
>
> This restructuring ensures that sched_tags memory management occurs
> entirely outside of the ->elevator_lock and ->freeze_lock context,
> eliminating the lock dependency problem seen during scheduler updates.
>
> [1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
>
> Reported-by: Stefan Haberland <sth@linux.ibm.com>
> Closes: https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> ---
> block/blk-mq-sched.c | 186 ++++++++++++++++++++++++++-----------------
> block/blk-mq-sched.h | 14 +++-
> block/elevator.c | 66 +++++++++++++--
> block/elevator.h | 19 ++++-
> 4 files changed, 204 insertions(+), 81 deletions(-)
>
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 359e0704e09b..5d3132ac7777 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -374,64 +374,17 @@ bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request *rq,
> }
> EXPORT_SYMBOL_GPL(blk_mq_sched_try_insert_merge);
>
> -static int blk_mq_sched_alloc_map_and_rqs(struct request_queue *q,
> - struct blk_mq_hw_ctx *hctx,
> - unsigned int hctx_idx)
> -{
> - if (blk_mq_is_shared_tags(q->tag_set->flags)) {
> - hctx->sched_tags = q->sched_shared_tags;
> - return 0;
> - }
> -
> - hctx->sched_tags = blk_mq_alloc_map_and_rqs(q->tag_set, hctx_idx,
> - q->nr_requests);
> -
> - if (!hctx->sched_tags)
> - return -ENOMEM;
> - return 0;
> -}
> -
> -static void blk_mq_exit_sched_shared_tags(struct request_queue *queue)
> -{
> - blk_mq_free_rq_map(queue->sched_shared_tags);
> - queue->sched_shared_tags = NULL;
> -}
> -
> /* called in queue's release handler, tagset has gone away */
> static void blk_mq_sched_tags_teardown(struct request_queue *q, unsigned int flags)
> {
> struct blk_mq_hw_ctx *hctx;
> unsigned long i;
>
> - queue_for_each_hw_ctx(q, hctx, i) {
> - if (hctx->sched_tags) {
> - if (!blk_mq_is_shared_tags(flags))
> - blk_mq_free_rq_map(hctx->sched_tags);
> - hctx->sched_tags = NULL;
> - }
> - }
> + queue_for_each_hw_ctx(q, hctx, i)
> + hctx->sched_tags = NULL;
>
> if (blk_mq_is_shared_tags(flags))
> - blk_mq_exit_sched_shared_tags(q);
> -}
> -
> -static int blk_mq_init_sched_shared_tags(struct request_queue *queue)
> -{
> - struct blk_mq_tag_set *set = queue->tag_set;
> -
> - /*
> - * Set initial depth at max so that we don't need to reallocate for
> - * updating nr_requests.
> - */
> - queue->sched_shared_tags = blk_mq_alloc_map_and_rqs(set,
> - BLK_MQ_NO_HCTX_IDX,
> - MAX_SCHED_RQ);
> - if (!queue->sched_shared_tags)
> - return -ENOMEM;
> -
> - blk_mq_tag_update_sched_shared_tags(queue);
> -
> - return 0;
> + q->sched_shared_tags = NULL;
> }
>
> void blk_mq_sched_reg_debugfs(struct request_queue *q)
> @@ -458,8 +411,106 @@ void blk_mq_sched_unreg_debugfs(struct request_queue *q)
> mutex_unlock(&q->debugfs_mutex);
> }
>
> +void __blk_mq_free_sched_tags(struct blk_mq_tag_set *set,
> + struct blk_mq_tags **tags, unsigned int nr_hw_queues)
> +{
> + unsigned long i;
> +
> + if (!tags)
> + return;
> +
> + /* Shared tags are stored at index 0 in @tags. */
> + if (blk_mq_is_shared_tags(set->flags))
> + blk_mq_free_map_and_rqs(set, tags[0], BLK_MQ_NO_HCTX_IDX);
> + else {
> + for (i = 0; i < nr_hw_queues; i++)
> + blk_mq_free_map_and_rqs(set, tags[i], i);
> + }
> +
> + kfree(tags);
> +}
> +
> +void blk_mq_free_sched_tags(struct elevator_tags *et,
> + struct blk_mq_tag_set *set, int id)
> +{
> + struct blk_mq_tags **tags;
> +
> + tags = xa_load(&et->tags_table, id);
> + __blk_mq_free_sched_tags(set, tags, et->nr_hw_queues);
> +}
> +
> +struct blk_mq_tags **__blk_mq_alloc_sched_tags(struct blk_mq_tag_set *set,
> + unsigned int nr_hw_queues,
> + unsigned int nr_requests,
> + gfp_t gfp)
> +{
> + int i, nr_tags;
> + struct blk_mq_tags **tags;
> +
> + if (blk_mq_is_shared_tags(set->flags))
> + nr_tags = 1;
> + else
> + nr_tags = nr_hw_queues;
> +
> + tags = kcalloc(nr_tags, sizeof(struct blk_mq_tags *), gfp);
> + if (!tags)
> + return NULL;
> +
> + if (blk_mq_is_shared_tags(set->flags)) {
> + /* Shared tags are stored at index 0 in @tags. */
> + tags[0] = blk_mq_alloc_map_and_rqs(set, BLK_MQ_NO_HCTX_IDX,
> + MAX_SCHED_RQ);
> + if (!tags[0])
> + goto out;
> + } else {
> + for (i = 0; i < nr_hw_queues; i++) {
> + tags[i] = blk_mq_alloc_map_and_rqs(set, i, nr_requests);
> + if (!tags[i])
> + goto out_unwind;
> + }
> + }
> +
> + return tags;
> +out_unwind:
> + while (--i >= 0)
> + blk_mq_free_map_and_rqs(set, tags[i], i);
> +out:
> + kfree(tags);
> + return NULL;
> +}
> +
> +int blk_mq_alloc_sched_tags(struct elevator_tags *et,
> + struct blk_mq_tag_set *set, int id)
> +{
> + struct blk_mq_tags **tags;
> + gfp_t gfp = GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY;
> +
> + /*
> + * Default to double of smaller one between hw queue_depth and
> + * 128, since we don't split into sync/async like the old code
> + * did. Additionally, this is a per-hw queue depth.
> + */
> + et->nr_requests = 2 * min_t(unsigned int, set->queue_depth,
> + BLKDEV_DEFAULT_RQ);
> +
> + tags = __blk_mq_alloc_sched_tags(set, et->nr_hw_queues,
> + et->nr_requests, gfp);
> + if (!tags)
> + return -ENOMEM;
> +
> + if (xa_insert(&et->tags_table, id, tags, gfp))
> + goto out_free_tags;
Not sure why you add `tags_table` into `elevator_tags`, it looks very
confusing just from naming, not mention only single element is stored to
the table in elevator switch code path since queue id is the key.
Wrt. xarray suggestion, I meant to add on xarray local variable in
__blk_mq_update_nr_hw_queues(), then you can store allocated `elevator_tags`
to the xarray via ->id of request_queue.
Thanks,
Ming
next prev parent reply other threads:[~2025-06-26 14:43 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-06-24 13:17 [PATCHv4 0/3] block: move sched_tags allocation/de-allocation outside of locking context Nilay Shroff
2025-06-24 13:17 ` [PATCHv4 1/3] block: move elevator queue allocation logic into blk_mq_init_sched Nilay Shroff
2025-06-24 13:17 ` [PATCHv4 2/3] block: fix lockdep warning caused by lock dependency in elv_iosched_store Nilay Shroff
2025-06-26 14:43 ` Ming Lei [this message]
2025-06-27 4:13 ` Nilay Shroff
2025-06-27 7:58 ` Ming Lei
2025-06-27 9:50 ` Nilay Shroff
2025-06-24 13:17 ` [PATCHv4 3/3] block: fix potential deadlock while running nr_hw_queue update Nilay Shroff
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aF1cdHXunW5aqaci@fedora \
--to=ming.lei@redhat.com \
--cc=axboe@kernel.dk \
--cc=gjoyce@ibm.com \
--cc=hch@lst.de \
--cc=linux-block@vger.kernel.org \
--cc=nilay@linux.ibm.com \
--cc=sth@linux.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.