From: Aaron Tomlin <atomlin@atomlin.com>
To: axboe@kernel.dk, rostedt@goodmis.org, mhiramat@kernel.org,
mathieu.desnoyers@efficios.com
Cc: bvanassche@acm.org, johannes.thumshirn@wdc.com, kch@nvidia.com,
dlemoal@kernel.org, ritesh.list@gmail.com, loberman@redhat.com,
neelx@suse.com, sean@ashe.io, mproche@gmail.com,
chjohnst@gmail.com, linux-block@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org
Subject: [PATCH v5 0/2] blk-mq: introduce tag starvation observability
Date: Sun, 26 Apr 2026 22:01:40 -0400 [thread overview]
Message-ID: <20260427020142.358912-1-atomlin@atomlin.com> (raw)
Hi Jens, Steve, Masami,
In high-performance storage environments, particularly when utilising RAID
controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED), severe latency
spikes can occur when fast devices are starved of available tags.
Currently, diagnosing this specific queue contention requires deploying
dynamic kprobes or inferring sleep states, which lacks a simple,
out-of-the-box diagnostic path.
This short series introduces dedicated, low-overhead observability for tag
exhaustion events in the block layer:
- Patch 1 introduces the "block_rq_tag_wait" tracepoint in the tag
allocation slow-path to capture precise, event-based starvation.
- Patch 2 complements this by exposing "wait_on_hw_tag" and
"wait_on_sched_tag" per-CPU counters via debugfs for quick,
point-in-time cumulative polling.
Together, these provide storage engineers with zero-configuration
mechanisms to definitively identify shared-tag bottlenecks.
Please let me know your thoughts.
Changes since v4 [1]:
- Prevented a NULL pointer dereference in the tracepoint fast-assign for
disk-less request queues by safely checking q->disk before resolving the
dev_t
- Fixed a Use-After-Free (UAF) and permanent memory leak by decoupling
the per-CPU counter allocation from the volatile debugfs lifecycle and
tying it directly to the core hctx lifecycle (i.e., blk_mq_init_hctx()
and blk_mq_exit_hctx())
- Fixed a potential compiler double-fetch bug by wrapping the per-CPU
pointer evaluations with READ_ONCE() in blk_mq_debugfs_inc_wait_tags()
- Passed the appropriate gfp_t flags down to the allocation routines to
maintain the strict GFP_NOIO context
- Updated kernel-doc descriptions to clarify that the NULL pointer
checks guard against memory allocation failures under pressure, rather
than initialisation race conditions
Changes since v3 [2]:
- Transitioned tracking architecture from shared atomic_t variables to
dynamically allocated per-CPU counters to resolve cache line bouncing
(Bart Van Assche)
Changes since v2 [3]:
- Added "Reviewed-by:" and "Tested-by:" tags for patch 1
- Evaluate is_sched_tag directly within TP_fast_assign (Steven Rostedt)
- Introduced atomic counters via debugfs
Changes since v1 [4]:
- Improved the description of the trace point (Damien Le Moal)
- Removed the redundant "active requests" (Laurence Oberman)
- Introduced pool-specific starvation tracking
[1]: https://lore.kernel.org/lkml/20260419023036.1419514-1-atomlin@atomlin.com/
[2]: https://lore.kernel.org/lkml/20260319221956.332770-1-atomlin@atomlin.com/
[3]: https://lore.kernel.org/lkml/20260319015300.287653-1-atomlin@atomlin.com/
[4]: https://lore.kernel.org/lkml/20260317182835.258183-1-atomlin@atomlin.com/
Aaron Tomlin (2):
blk-mq: add tracepoint block_rq_tag_wait
blk-mq: expose tag starvation counts via debugfs
block/blk-mq-debugfs.c | 109 +++++++++++++++++++++++++++++++++++
block/blk-mq-debugfs.h | 19 ++++++
block/blk-mq-tag.c | 8 +++
block/blk-mq.c | 5 ++
include/linux/blk-mq.h | 12 ++++
include/trace/events/block.h | 43 ++++++++++++++
6 files changed, 196 insertions(+)
--
2.51.0
next reply other threads:[~2026-04-27 2:01 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-27 2:01 Aaron Tomlin [this message]
2026-04-27 2:01 ` [PATCH v5 1/2] blk-mq: add tracepoint block_rq_tag_wait Aaron Tomlin
2026-04-27 16:38 ` Steven Rostedt
2026-04-28 0:29 ` Aaron Tomlin
2026-04-27 2:01 ` [PATCH v5 2/2] blk-mq: expose tag starvation counts via debugfs Aaron Tomlin
2026-05-01 4:24 ` kernel test robot
2026-05-09 0:12 ` kernel test robot
2026-05-17 21:11 ` Aaron Tomlin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260427020142.358912-1-atomlin@atomlin.com \
--to=atomlin@atomlin.com \
--cc=axboe@kernel.dk \
--cc=bvanassche@acm.org \
--cc=chjohnst@gmail.com \
--cc=dlemoal@kernel.org \
--cc=johannes.thumshirn@wdc.com \
--cc=kch@nvidia.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-trace-kernel@vger.kernel.org \
--cc=loberman@redhat.com \
--cc=mathieu.desnoyers@efficios.com \
--cc=mhiramat@kernel.org \
--cc=mproche@gmail.com \
--cc=neelx@suse.com \
--cc=ritesh.list@gmail.com \
--cc=rostedt@goodmis.org \
--cc=sean@ashe.io \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.