From: Aaron Tomlin <atomlin@atomlin.com>
To: axboe@kernel.dk, rostedt@goodmis.org, mhiramat@kernel.org,
mathieu.desnoyers@efficios.com
Cc: bvanassche@acm.org, johannes.thumshirn@wdc.com, kch@nvidia.com,
dlemoal@kernel.org, ritesh.list@gmail.com, loberman@redhat.com,
neelx@suse.com, sean@ashe.io, mproche@gmail.com,
chjohnst@gmail.com, linux-block@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org
Subject: [PATCH v6 0/2] blk-mq: introduce tag starvation observability
Date: Sun, 17 May 2026 17:36:12 -0400 [thread overview]
Message-ID: <20260517213614.350367-1-atomlin@atomlin.com> (raw)
Hi Jens, Steve, Masami,
In high-performance storage environments, particularly when utilising RAID
controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED), severe latency
spikes can occur when fast devices are starved of available tags.
Currently, diagnosing this specific queue contention requires deploying
dynamic kprobes or inferring sleep states, which lacks a simple,
out-of-the-box diagnostic path.
This short series introduces dedicated, low-overhead observability for tag
exhaustion events in the block layer:
- Patch 1 introduces the "block_rq_tag_wait" tracepoint in the tag
allocation slow-path to capture precise, event-based starvation.
- Patch 2 complements this by exposing "wait_on_hw_tag" and
"wait_on_sched_tag" per-CPU counters via debugfs for quick,
point-in-time cumulative polling.
Together, these provide storage engineers with zero-configuration
mechanisms to definitively identify shared-tag bottlenecks.
Please let me know your thoughts.
Changes since v5 [1]:
- Replaced this_cpu_inc() with raw_cpu_inc() within
blk_mq_debugfs_inc_wait_tags(). This resolves a preemption warning
triggered under CONFIG_DEBUG_PREEMPT=y, as the routine is invoked from a
preemptible context immediately prior to io_schedule(). This adjustment
deliberately prioritises the reduction of execution overhead over
absolute statistical precision for this diagnostic interface.
Changes since v4 [2]:
- Prevented a NULL pointer dereference in the tracepoint fast-assign for
disk-less request queues by safely checking q->disk before resolving the
dev_t
- Fixed a Use-After-Free (UAF) and permanent memory leak by decoupling
the per-CPU counter allocation from the volatile debugfs lifecycle and
tying it directly to the core hctx lifecycle (i.e., blk_mq_init_hctx()
and blk_mq_exit_hctx())
- Fixed a potential compiler double-fetch bug by wrapping the per-CPU
pointer evaluations with READ_ONCE() in blk_mq_debugfs_inc_wait_tags()
- Passed the appropriate gfp_t flags down to the allocation routines to
maintain the strict GFP_NOIO context
- Updated kernel-doc descriptions to clarify that the NULL pointer
checks guard against memory allocation failures under pressure, rather
than initialisation race conditions
Changes since v3 [3]:
- Transitioned tracking architecture from shared atomic_t variables to
dynamically allocated per-CPU counters to resolve cache line bouncing
(Bart Van Assche)
Changes since v2 [4]:
- Added "Reviewed-by:" and "Tested-by:" tags for patch 1
- Evaluate is_sched_tag directly within TP_fast_assign (Steven Rostedt)
- Introduced atomic counters via debugfs
Changes since v1 [5]:
- Improved the description of the trace point (Damien Le Moal)
- Removed the redundant "active requests" (Laurence Oberman)
- Introduced pool-specific starvation tracking
[1]: https://lore.kernel.org/lkml/20260427020142.358912-1-atomlin@atomlin.com/
[2]: https://lore.kernel.org/lkml/20260419023036.1419514-1-atomlin@atomlin.com/
[3]: https://lore.kernel.org/lkml/20260319221956.332770-1-atomlin@atomlin.com/
[4]: https://lore.kernel.org/lkml/20260319015300.287653-1-atomlin@atomlin.com/
[5]: https://lore.kernel.org/lkml/20260317182835.258183-1-atomlin@atomlin.com/
Aaron Tomlin (2):
blk-mq: add tracepoint block_rq_tag_wait
blk-mq: expose tag starvation counts via debugfs
block/blk-mq-debugfs.c | 109 +++++++++++++++++++++++++++++++++++
block/blk-mq-debugfs.h | 19 ++++++
block/blk-mq-tag.c | 8 +++
block/blk-mq.c | 5 ++
include/linux/blk-mq.h | 12 ++++
include/trace/events/block.h | 43 ++++++++++++++
6 files changed, 196 insertions(+)
--
2.51.0
next reply other threads:[~2026-05-17 21:36 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-17 21:36 Aaron Tomlin [this message]
2026-05-17 21:36 ` [PATCH v6 1/2] blk-mq: add tracepoint block_rq_tag_wait Aaron Tomlin
2026-05-17 21:36 ` [PATCH v6 2/2] blk-mq: expose tag starvation counts via debugfs Aaron Tomlin
2026-05-18 8:14 ` John Garry
2026-05-21 2:22 ` Aaron Tomlin
2026-05-18 13:31 ` [PATCH v6 0/2] blk-mq: introduce tag starvation observability Jens Axboe
2026-05-21 2:07 ` Aaron Tomlin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260517213614.350367-1-atomlin@atomlin.com \
--to=atomlin@atomlin.com \
--cc=axboe@kernel.dk \
--cc=bvanassche@acm.org \
--cc=chjohnst@gmail.com \
--cc=dlemoal@kernel.org \
--cc=johannes.thumshirn@wdc.com \
--cc=kch@nvidia.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-trace-kernel@vger.kernel.org \
--cc=loberman@redhat.com \
--cc=mathieu.desnoyers@efficios.com \
--cc=mhiramat@kernel.org \
--cc=mproche@gmail.com \
--cc=neelx@suse.com \
--cc=ritesh.list@gmail.com \
--cc=rostedt@goodmis.org \
--cc=sean@ashe.io \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.