From: Nilay Shroff <nilay@linux.ibm.com>
To: linux-block@vger.kernel.org
Cc: ming.lei@redhat.com, hch@lst.de, axboe@kernel.dk,
yi.zhang@redhat.com, czhong@redhat.com, yukuai@fnnas.com,
gjoyce@ibm.com
Subject: [PATCHv7 0/5] block: restructure elevator switch path and fix a lockdep splat
Date: Thu, 13 Nov 2025 14:28:17 +0530 [thread overview]
Message-ID: <20251113090619.2030737-1-nilay@linux.ibm.com> (raw)
Hi,
This patchset reorganizes the elevator switch path used during both
nr_hw_queues update and elv_iosched_store() operations to address a
recently reported lockdep splat [1].
The warning highlights a locking dependency between ->freeze_lock and
->elevator_lock on pcpu_alloc_mutex, triggered when the Kyber scheduler
dynamically allocates its private scheduling data. The fix is to ensure
that such allocations occur outside the locked sections, thus eliminating
the dependency chain.
While working on this, it also became evident that the nr_hw_queue update
code maintains two disjoint xarrays—one for elevator tags and another
for elevator type—both serving the same purpose. Unifying these into a
single elv_change_ctx structure improves clarity and maintainability.
This series therefore implements five patches:
The first perparatory patch unifies elevator tags and type xarrays. It
combines both xarrays into a single struct elv_change_ctx, simplifying
per-queue elevator state management.
The second patch is aimed to group together all elevator-related
resources that share the same lifetime and as a first step we move the
elevator tags pointer from struct elv_change_ctx into the newly
inroduced struct elevator_resources. The subsequent patch extends the
struct elevator_resources to include other elevator-related data.
The third patch introduce ->alloc_sched_data and ->free_sched_data
elevator ops which could be then used to safely allocate and free
scheduler data.
The fourth patch now builds upon the previous patch and starts using the
newly introduced alloc/free sched data methods in the earlier patch
during elevator switch and nr_hw_queue update. And while doing so, it's
ensured that sched data allocation and free happens before we acquire
->freeze_lock and ->elevator_lock thus preventing its dependency on
pcpu_alloc_mutex.
The last patch of this series converts Kyber scheduler to use the new
methods inroduced in the previous patch. It hooks Kyber’s scheduler data
allocation and teardown logic from ->init_sched and ->exit_sched into
the new methods, ensuring memory operations are performed outside
locked sections.
Together, these changes simplify the elevator switch logic and prevent
the reported lockdep splat.
As always, feedback and suggestions are very welcome!
[1] https://lore.kernel.org/all/CAGVVp+VNW4M-5DZMNoADp6o2VKFhi7KxWpTDkcnVyjO0=-D5+A@mail.gmail.com/
Thanks,
--Nilay
changes from v6:
- Update blk_mq_alloc_sched_data() to return a NULL instead
of sentinel pointer when ->alloc_sched_data method is not
defined. (Yu Kuai)
Link to v6: https://lore.kernel.org/all/20251112132249.1791304-1-nilay@linux.ibm.com/
changes from v5:
- Update blk_mq_alloc_sched_data() to return a sentinel pointer
when the ->alloc_sched_data method is not defined. Also update
its caller accordingly to use the IS_ERR() macro to distinguish
between success and failure cases.
Link to v5: https://lore.kernel.org/all/20251112052848.1433256-1-nilay@linux.ibm.com/
changes from v4:
- Update the signature of blk_mq_alloc_sched_res() in patch #2
and add a struct request_queue * parameter. This allows direct
access to blk_mq_tag_set from the request queue, removing the
need for a separate struct blk_mq_tag_set * argument. (Ming Lei)
- Update blk_mq_init_sched() to add a local variable et of type
struct elevator_tags *. This avoids additional code changes,
as we can avoid dereferencing struct elevator_resources *res
to reach it. (Ming Lei)
- Simplify blk_mq_alloc_sched_data() to return a pointer to the
allocated scheduler data on success, or NULL on failure. Update
the caller accordingly to use the return value directly instead
of passing **sched_data as an additional argument. (Yu Kuai)
Link to v4: https://lore.kernel.org/all/20251110081457.1006206-1-nilay@linux.ibm.com/
changes from v3:
- Split the third patch into two patches to separate the introduction
of ->alloc_sched_data and ->free_sched_data methods from their users.
- Free scheduler tags during sched resource allocation failures using
blk_mq_free_sched_tags() instead of kfree() to avoid kmemleak
(Ming Lei).
- Delay the signature change of elevator_alloc() until the fourth
patch, where we actually start allocating scheduler data during
elevator switch and nr_hw_queue_update (Ming Lei).
Link to v3: https://lore.kernel.org/all/20251029103622.205607-1-nilay@linux.ibm.com/
changes fron v2:
- Introduce helper functions blk_mq_alloc_sched_res_batch() and
blk_mq_free_sched_res_batch() to encapsulate scheduler resource
(tags and data) allocation and freeing in batch mode. (Ming Lei)
- Introduce helper functions blk_mq_alloc_sched_res() and
blk_mq_free_sched_res() to encapsulate scheduler resource
allocation and freeing. (Ming Lei)
Link to v2: https://lore.kernel.org/all/20251027173631.1081005-1-nilay@linux.ibm.com/
changes from v1:
- Keep blk_mq_free_sched_ctx_batch() and blk_mq_alloc_sched_ctx_batch()
together in the same file (Ming Lei)
- Since the ctx pointer is stored in xarray after it's dynamically
allocated, if blk_mq_alloc_sched_ctx_batch() fails to allocate or
insert ctx pointer in xarray then unwinding the allocation is not
necessary. Instead looping over the xarray to retrieve the inserted
ctx pointer and freeing it should be sufficibet. So invoke blk_mq_
free_sched_ctx_batch() from the blk_mq_alloc_sched_ctx_batch()
callsite on failure (Ming Lei)
- As both elevator tags and elevator data shares the same lifetime
and allocation constraints, abstract both into a new structure
(Ming Lei)
Link to v1: https://lore.kernel.org/all/20251016053057.3457663-1-nilay@linux.ibm.com/
Nilay Shroff (5):
block: unify elevator tags and type xarrays into struct elv_change_ctx
block: move elevator tags into struct elevator_resources
block: introduce alloc_sched_data and free_sched_data elevator methods
block: use {alloc|free}_sched data methods
block: define alloc_sched_data and free_sched_data methods for kyber
block/blk-mq-sched.c | 117 ++++++++++++++++++++++++++++++++++--------
block/blk-mq-sched.h | 40 +++++++++++++--
block/blk-mq.c | 50 ++++++++++--------
block/blk.h | 7 ++-
block/elevator.c | 80 +++++++++++++----------------
block/elevator.h | 26 +++++++++-
block/kyber-iosched.c | 30 ++++++++---
7 files changed, 248 insertions(+), 102 deletions(-)
--
2.51.0
next reply other threads:[~2025-11-13 9:06 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-13 8:58 Nilay Shroff [this message]
2025-11-13 8:58 ` [PATCHv7 1/5] block: unify elevator tags and type xarrays into struct elv_change_ctx Nilay Shroff
2025-11-13 8:58 ` [PATCHv7 2/5] block: move elevator tags into struct elevator_resources Nilay Shroff
2025-11-13 8:58 ` [PATCHv7 3/5] block: introduce alloc_sched_data and free_sched_data elevator methods Nilay Shroff
2025-11-13 9:57 ` Ming Lei
2025-11-13 8:58 ` [PATCHv7 4/5] block: use {alloc|free}_sched data methods Nilay Shroff
2025-11-13 8:58 ` [PATCHv7 5/5] block: define alloc_sched_data and free_sched_data methods for kyber Nilay Shroff
2025-11-13 16:39 ` [PATCHv7 0/5] block: restructure elevator switch path and fix a lockdep splat Jens Axboe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251113090619.2030737-1-nilay@linux.ibm.com \
--to=nilay@linux.ibm.com \
--cc=axboe@kernel.dk \
--cc=czhong@redhat.com \
--cc=gjoyce@ibm.com \
--cc=hch@lst.de \
--cc=linux-block@vger.kernel.org \
--cc=ming.lei@redhat.com \
--cc=yi.zhang@redhat.com \
--cc=yukuai@fnnas.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox