[PATCH v3 0/2] blk-mq: introduce tag starvation observability

public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 0/2] blk-mq: introduce tag starvation observability
@ 2026-03-19 22:19 Aaron Tomlin
  2026-03-19 22:19 ` [PATCH v3 1/2] blk-mq: add tracepoint block_rq_tag_wait Aaron Tomlin
  2026-03-19 22:19 ` [PATCH v3 2/2] blk-mq: expose tag starvation counts via debugfs Aaron Tomlin
  0 siblings, 2 replies; 4+ messages in thread
From: Aaron Tomlin @ 2026-03-19 22:19 UTC (permalink / raw)
  To: axboe, rostedt, mhiramat, mathieu.desnoyers
  Cc: johannes.thumshirn, kch, bvanassche, dlemoal, ritesh.list,
	loberman, neelx, sean, mproche, chjohnst, linux-block,
	linux-kernel, linux-trace-kernel

Hi Jens, Steve, Masami,

In high-performance storage environments, particularly when utilising RAID 
controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED), severe latency
spikes can occur when fast devices are starved of available tags.
Currently, diagnosing this specific queue contention requires deploying
dynamic kprobes or inferring sleep states, which lacks a simple,
out-of-the-box diagnostic path.

This short series introduces dedicated, low-overhead observability for tag 
exhaustion events in the block layer:

  - Patch 1 introduces the "block_rq_tag_wait" tracepoint in the tag
    allocation slow-path to capture precise, event-based starvation.

  - Patch 2 complements this by exposing "wait_on_hw_tag" and 
    "wait_on_sched_tag" atomic counters via debugfs for quick, 
    point-in-time cumulative polling.

Together, these provide storage engineers with zero-configuration 
mechanisms to definitively identify shared-tag bottlenecks.

Please let me know your thoughts.


Changes since v2 [1]:
 - Added "Reviewed-by:" and "Tested-by:" tags for patch 1
 - Evaluate is_sched_tag directly within TP_fast_assign (Steven Rostedt)
 - Introduced atomic counters via debugfs 

Changes since v1 [2]:
 - Improved the description of the trace point (Damien Le Moal)
 - Removed the redundant "active requests" (Laurence Oberman)
 - Introduced pool-specific starvation tracking

[1]: https://lore.kernel.org/lkml/20260319015300.287653-1-atomlin@atomlin.com/
[2]: https://lore.kernel.org/lkml/20260317182835.258183-1-atomlin@atomlin.com/

Aaron Tomlin (2):
  blk-mq: add tracepoint block_rq_tag_wait
  blk-mq: expose tag starvation counts via debugfs

 block/blk-mq-debugfs.c       | 56 ++++++++++++++++++++++++++++++++++++
 block/blk-mq-debugfs.h       |  7 +++++
 block/blk-mq-tag.c           |  8 ++++++
 include/linux/blk-mq.h       | 10 +++++++
 include/trace/events/block.h | 43 +++++++++++++++++++++++++++
 5 files changed, 124 insertions(+)

-- 
2.51.0


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH v3 1/2] blk-mq: add tracepoint block_rq_tag_wait
  2026-03-19 22:19 [PATCH v3 0/2] blk-mq: introduce tag starvation observability Aaron Tomlin
@ 2026-03-19 22:19 ` Aaron Tomlin
  2026-03-19 22:19 ` [PATCH v3 2/2] blk-mq: expose tag starvation counts via debugfs Aaron Tomlin
  1 sibling, 0 replies; 4+ messages in thread
From: Aaron Tomlin @ 2026-03-19 22:19 UTC (permalink / raw)
  To: axboe, rostedt, mhiramat, mathieu.desnoyers
  Cc: johannes.thumshirn, kch, bvanassche, dlemoal, ritesh.list,
	loberman, neelx, sean, mproche, chjohnst, linux-block,
	linux-kernel, linux-trace-kernel

In high-performance storage environments, particularly when utilising
RAID controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED), severe
latency spikes can occur when fast devices (SSDs) are starved of hardware
tags when sharing the same blk_mq_tag_set.

Currently, diagnosing this specific hardware queue contention is
difficult. When a CPU thread exhausts the tag pool, blk_mq_get_tag()
forces the current thread to block uninterruptible via io_schedule().
While this can be inferred via sched:sched_switch or dynamically
traced by attaching a kprobe to blk_mq_mark_tag_wait(), there is no
dedicated, out-of-the-box observability for this event.

This patch introduces the block_rq_tag_wait trace point in the tag
allocation slow-path. It triggers immediately before the thread yields
the CPU, exposing the exact hardware context (hctx) that is starved, the
specific pool experiencing starvation (hardware or software scheduler),
and the total pool depth.

This provides storage engineers and performance monitoring agents
with a zero-configuration, low-overhead mechanism to definitively
identify shared-tag bottlenecks and tune I/O schedulers or cgroup
throttling accordingly.

Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Laurence Oberman <loberman@redhat.com>
Tested-by: Laurence Oberman <loberman@redhat.com>
---
 block/blk-mq-tag.c           |  4 ++++
 include/trace/events/block.h | 43 ++++++++++++++++++++++++++++++++++++
 2 files changed, 47 insertions(+)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 33946cdb5716..66138dd043d4 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -13,6 +13,7 @@
 #include <linux/kmemleak.h>
 
 #include <linux/delay.h>
+#include <trace/events/block.h>
 #include "blk.h"
 #include "blk-mq.h"
 #include "blk-mq-sched.h"
@@ -187,6 +188,9 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 		if (tag != BLK_MQ_NO_TAG)
 			break;
 
+		trace_block_rq_tag_wait(data->q, data->hctx,
+					data->rq_flags & RQF_SCHED_TAGS);
+
 		bt_prev = bt;
 		io_schedule();
 
diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index 6aa79e2d799c..71554b94e4d0 100644
--- a/include/trace/events/block.h
+++ b/include/trace/events/block.h
@@ -226,6 +226,49 @@ DECLARE_EVENT_CLASS(block_rq,
 		  IOPRIO_PRIO_LEVEL(__entry->ioprio), __entry->comm)
 );
 
+/**
+ * block_rq_tag_wait - triggered when a request is starved of a tag
+ * @q: request queue of the target device
+ * @hctx: hardware context of the request experiencing starvation
+ * @is_sched_tag: indicates whether the starved pool is the software scheduler
+ *
+ * Called immediately before the submitting context is forced to block due
+ * to the exhaustion of available tags (i.e., physical hardware driver tags
+ * or software scheduler tags). This trace point indicates that the context
+ * will be placed into an uninterruptible state via io_schedule() until an
+ * active request completes and relinquishes its assigned tag.
+ */
+TRACE_EVENT(block_rq_tag_wait,
+
+	TP_PROTO(struct request_queue *q, struct blk_mq_hw_ctx *hctx, bool is_sched_tag),
+
+	TP_ARGS(q, hctx, is_sched_tag),
+
+	TP_STRUCT__entry(
+		__field( dev_t,		dev			)
+		__field( u32,		hctx_id			)
+		__field( u32,		nr_tags			)
+		__field( bool,		is_sched_tag		)
+	),
+
+	TP_fast_assign(
+		__entry->dev		= disk_devt(q->disk);
+		__entry->hctx_id	= hctx->queue_num;
+		__entry->is_sched_tag	= is_sched_tag;
+
+		if (is_sched_tag)
+			__entry->nr_tags = hctx->sched_tags->nr_tags;
+		else
+			__entry->nr_tags = hctx->tags->nr_tags;
+	),
+
+	TP_printk("%d,%d hctx=%u starved on %s tags (depth=%u)",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->hctx_id,
+		  __entry->is_sched_tag ? "scheduler" : "hardware",
+		  __entry->nr_tags)
+);
+
 /**
  * block_rq_insert - insert block operation request into queue
  * @rq: block IO operation request
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH v3 2/2] blk-mq: expose tag starvation counts via debugfs
  2026-03-19 22:19 [PATCH v3 0/2] blk-mq: introduce tag starvation observability Aaron Tomlin
  2026-03-19 22:19 ` [PATCH v3 1/2] blk-mq: add tracepoint block_rq_tag_wait Aaron Tomlin
@ 2026-03-19 22:19 ` Aaron Tomlin
  2026-03-20 15:08   ` Laurence Oberman
  1 sibling, 1 reply; 4+ messages in thread
From: Aaron Tomlin @ 2026-03-19 22:19 UTC (permalink / raw)
  To: axboe, rostedt, mhiramat, mathieu.desnoyers
  Cc: johannes.thumshirn, kch, bvanassche, dlemoal, ritesh.list,
	loberman, neelx, sean, mproche, chjohnst, linux-block,
	linux-kernel, linux-trace-kernel

In high-performance storage environments, particularly when utilising
RAID controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED), severe
latency spikes can occur when fast devices are starved of available
tags.

This patch introduces two new debugfs attributes for each block
hardware queue:
  - /sys/kernel/debug/block/[device]/hctxN/wait_on_hw_tag
  - /sys/kernel/debug/block/[device]/hctxN/wait_on_sched_tag

These files expose atomic counters that increment each time a submitting
context is forced into an uninterruptible sleep via io_schedule() due to
the complete exhaustion of physical driver tags or software scheduler
tags, respectively.

To guarantee zero performance overhead for production kernels compiled
without debugfs, the underlying atomic_t variables and their associated
increment routines are strictly guarded behind CONFIG_BLK_DEBUG_FS.
When this configuration is disabled, the tracking logic compiles down
to a safe no-op.

Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
---
 block/blk-mq-debugfs.c | 56 ++++++++++++++++++++++++++++++++++++++++++
 block/blk-mq-debugfs.h |  7 ++++++
 block/blk-mq-tag.c     |  4 +++
 include/linux/blk-mq.h | 10 ++++++++
 4 files changed, 77 insertions(+)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 28167c9baa55..078561d7da38 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -483,6 +483,42 @@ static int hctx_dispatch_busy_show(void *data, struct seq_file *m)
 	return 0;
 }
 
+/**
+ * hctx_wait_on_hw_tag_show - display hardware tag starvation count
+ * @data: generic pointer to the associated hardware context (hctx)
+ * @m: seq_file pointer for debugfs output formatting
+ *
+ * Prints the cumulative number of times a submitting context was forced
+ * to block due to the exhaustion of physical hardware driver tags.
+ *
+ * Return: 0 on success.
+ */
+static int hctx_wait_on_hw_tag_show(void *data, struct seq_file *m)
+{
+	struct blk_mq_hw_ctx *hctx = data;
+
+	seq_printf(m, "%d\n", atomic_read(&hctx->wait_on_hw_tag));
+	return 0;
+}
+
+/**
+ * hctx_wait_on_sched_tag_show - display scheduler tag starvation count
+ * @data: generic pointer to the associated hardware context (hctx)
+ * @m: seq_file pointer for debugfs output formatting
+ *
+ * Prints the cumulative number of times a submitting context was forced
+ * to block due to the exhaustion of software scheduler tags.
+ *
+ * Return: 0 on success.
+ */
+static int hctx_wait_on_sched_tag_show(void *data, struct seq_file *m)
+{
+	struct blk_mq_hw_ctx *hctx = data;
+
+	seq_printf(m, "%d\n", atomic_read(&hctx->wait_on_sched_tag));
+	return 0;
+}
+
 #define CTX_RQ_SEQ_OPS(name, type)					\
 static void *ctx_##name##_rq_list_start(struct seq_file *m, loff_t *pos) \
 	__acquires(&ctx->lock)						\
@@ -598,6 +634,8 @@ static const struct blk_mq_debugfs_attr blk_mq_debugfs_hctx_attrs[] = {
 	{"active", 0400, hctx_active_show},
 	{"dispatch_busy", 0400, hctx_dispatch_busy_show},
 	{"type", 0400, hctx_type_show},
+	{"wait_on_hw_tag", 0400, hctx_wait_on_hw_tag_show},
+	{"wait_on_sched_tag", 0400, hctx_wait_on_sched_tag_show},
 	{},
 };
 
@@ -814,3 +852,21 @@ void blk_mq_debugfs_unregister_sched_hctx(struct blk_mq_hw_ctx *hctx)
 	debugfs_remove_recursive(hctx->sched_debugfs_dir);
 	hctx->sched_debugfs_dir = NULL;
 }
+
+/**
+ * blk_mq_debugfs_inc_wait_tags - increment the tag starvation counters
+ * @hctx: hardware context associated with the tag allocation
+ * @is_sched: boolean indicating whether the starved pool is the software scheduler
+ *
+ * Evaluates the exhausted tag pool and increments the appropriate debugfs
+ * starvation counter. This is invoked immediately before the submitting
+ * context is forced into an uninterruptible sleep via io_schedule().
+ */
+void blk_mq_debugfs_inc_wait_tags(struct blk_mq_hw_ctx *hctx,
+				  bool is_sched)
+{
+	if (is_sched)
+		atomic_inc(&hctx->wait_on_sched_tag);
+	else
+		atomic_inc(&hctx->wait_on_hw_tag);
+}
diff --git a/block/blk-mq-debugfs.h b/block/blk-mq-debugfs.h
index 49bb1aaa83dc..2cda555d5730 100644
--- a/block/blk-mq-debugfs.h
+++ b/block/blk-mq-debugfs.h
@@ -34,6 +34,8 @@ void blk_mq_debugfs_register_sched_hctx(struct request_queue *q,
 void blk_mq_debugfs_unregister_sched_hctx(struct blk_mq_hw_ctx *hctx);
 
 void blk_mq_debugfs_register_rq_qos(struct request_queue *q);
+void blk_mq_debugfs_inc_wait_tags(struct blk_mq_hw_ctx *hctx,
+				  bool is_sched);
 #else
 static inline void blk_mq_debugfs_register(struct request_queue *q)
 {
@@ -77,6 +79,11 @@ static inline void blk_mq_debugfs_register_rq_qos(struct request_queue *q)
 {
 }
 
+static inline void blk_mq_debugfs_inc_wait_tags(struct blk_mq_hw_ctx *hctx,
+						bool is_sched)
+{
+}
+
 #endif
 
 #if defined(CONFIG_BLK_DEV_ZONED) && defined(CONFIG_BLK_DEBUG_FS)
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 66138dd043d4..3cc6a97a87a0 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -17,6 +17,7 @@
 #include "blk.h"
 #include "blk-mq.h"
 #include "blk-mq-sched.h"
+#include "blk-mq-debugfs.h"
 
 /*
  * Recalculate wakeup batch when tag is shared by hctx.
@@ -191,6 +192,9 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 		trace_block_rq_tag_wait(data->q, data->hctx,
 					data->rq_flags & RQF_SCHED_TAGS);
 
+		blk_mq_debugfs_inc_wait_tags(data->hctx,
+					     data->rq_flags & RQF_SCHED_TAGS);
+
 		bt_prev = bt;
 		io_schedule();
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 18a2388ba581..f3d8ea93b23f 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -453,6 +453,16 @@ struct blk_mq_hw_ctx {
 	struct dentry		*debugfs_dir;
 	/** @sched_debugfs_dir:	debugfs directory for the scheduler. */
 	struct dentry		*sched_debugfs_dir;
+	/**
+	 * @wait_on_hw_tag: Cumulative counter incremented each time a submitting
+	 * context is forced to block due to physical hardware driver tag exhaustion.
+	 */
+	atomic_t		wait_on_hw_tag;
+	/**
+	 * @wait_on_sched_tag: Cumulative counter incremented each time a submitting
+	 * context is forced to block due to software scheduler tag exhaustion.
+	 */
+	atomic_t		wait_on_sched_tag;
 #endif
 
 	/**
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v3 2/2] blk-mq: expose tag starvation counts via debugfs
  2026-03-19 22:19 ` [PATCH v3 2/2] blk-mq: expose tag starvation counts via debugfs Aaron Tomlin
@ 2026-03-20 15:08   ` Laurence Oberman
  0 siblings, 0 replies; 4+ messages in thread
From: Laurence Oberman @ 2026-03-20 15:08 UTC (permalink / raw)
  To: Aaron Tomlin, axboe, rostedt, mhiramat, mathieu.desnoyers
  Cc: johannes.thumshirn, kch, bvanassche, dlemoal, ritesh.list, neelx,
	sean, mproche, chjohnst, linux-block, linux-kernel,
	linux-trace-kernel

On Thu, 2026-03-19 at 18:19 -0400, Aaron Tomlin wrote:
> In high-performance storage environments, particularly when utilising
> RAID controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED),
> severe
> latency spikes can occur when fast devices are starved of available
> tags.
> 
> This patch introduces two new debugfs attributes for each block
> hardware queue:
>   - /sys/kernel/debug/block/[device]/hctxN/wait_on_hw_tag
>   - /sys/kernel/debug/block/[device]/hctxN/wait_on_sched_tag
> 
> These files expose atomic counters that increment each time a
> submitting
> context is forced into an uninterruptible sleep via io_schedule() due
> to
> the complete exhaustion of physical driver tags or software scheduler
> tags, respectively.
> 
> To guarantee zero performance overhead for production kernels
> compiled
> without debugfs, the underlying atomic_t variables and their
> associated
> increment routines are strictly guarded behind CONFIG_BLK_DEBUG_FS.
> When this configuration is disabled, the tracking logic compiles down
> to a safe no-op.
> 
> Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
> ---
>  block/blk-mq-debugfs.c | 56
> ++++++++++++++++++++++++++++++++++++++++++
>  block/blk-mq-debugfs.h |  7 ++++++
>  block/blk-mq-tag.c     |  4 +++
>  include/linux/blk-mq.h | 10 ++++++++
>  4 files changed, 77 insertions(+)
> 
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index 28167c9baa55..078561d7da38 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -483,6 +483,42 @@ static int hctx_dispatch_busy_show(void *data,
> struct seq_file *m)
>  	return 0;
>  }
>  
> +/**
> + * hctx_wait_on_hw_tag_show - display hardware tag starvation count
> + * @data: generic pointer to the associated hardware context (hctx)
> + * @m: seq_file pointer for debugfs output formatting
> + *
> + * Prints the cumulative number of times a submitting context was
> forced
> + * to block due to the exhaustion of physical hardware driver tags.
> + *
> + * Return: 0 on success.
> + */
> +static int hctx_wait_on_hw_tag_show(void *data, struct seq_file *m)
> +{
> +	struct blk_mq_hw_ctx *hctx = data;
> +
> +	seq_printf(m, "%d\n", atomic_read(&hctx->wait_on_hw_tag));
> +	return 0;
> +}
> +
> +/**
> + * hctx_wait_on_sched_tag_show - display scheduler tag starvation
> count
> + * @data: generic pointer to the associated hardware context (hctx)
> + * @m: seq_file pointer for debugfs output formatting
> + *
> + * Prints the cumulative number of times a submitting context was
> forced
> + * to block due to the exhaustion of software scheduler tags.
> + *
> + * Return: 0 on success.
> + */
> +static int hctx_wait_on_sched_tag_show(void *data, struct seq_file
> *m)
> +{
> +	struct blk_mq_hw_ctx *hctx = data;
> +
> +	seq_printf(m, "%d\n", atomic_read(&hctx-
> >wait_on_sched_tag));
> +	return 0;
> +}
> +
>  #define CTX_RQ_SEQ_OPS(name,
> type)					\
>  static void *ctx_##name##_rq_list_start(struct seq_file *m, loff_t
> *pos) \
>  	__acquires(&ctx-
> >lock)						\
> @@ -598,6 +634,8 @@ static const struct blk_mq_debugfs_attr
> blk_mq_debugfs_hctx_attrs[] = {
>  	{"active", 0400, hctx_active_show},
>  	{"dispatch_busy", 0400, hctx_dispatch_busy_show},
>  	{"type", 0400, hctx_type_show},
> +	{"wait_on_hw_tag", 0400, hctx_wait_on_hw_tag_show},
> +	{"wait_on_sched_tag", 0400, hctx_wait_on_sched_tag_show},
>  	{},
>  };
>  
> @@ -814,3 +852,21 @@ void blk_mq_debugfs_unregister_sched_hctx(struct
> blk_mq_hw_ctx *hctx)
>  	debugfs_remove_recursive(hctx->sched_debugfs_dir);
>  	hctx->sched_debugfs_dir = NULL;
>  }
> +
> +/**
> + * blk_mq_debugfs_inc_wait_tags - increment the tag starvation
> counters
> + * @hctx: hardware context associated with the tag allocation
> + * @is_sched: boolean indicating whether the starved pool is the
> software scheduler
> + *
> + * Evaluates the exhausted tag pool and increments the appropriate
> debugfs
> + * starvation counter. This is invoked immediately before the
> submitting
> + * context is forced into an uninterruptible sleep via
> io_schedule().
> + */
> +void blk_mq_debugfs_inc_wait_tags(struct blk_mq_hw_ctx *hctx,
> +				  bool is_sched)
> +{
> +	if (is_sched)
> +		atomic_inc(&hctx->wait_on_sched_tag);
> +	else
> +		atomic_inc(&hctx->wait_on_hw_tag);
> +}
> diff --git a/block/blk-mq-debugfs.h b/block/blk-mq-debugfs.h
> index 49bb1aaa83dc..2cda555d5730 100644
> --- a/block/blk-mq-debugfs.h
> +++ b/block/blk-mq-debugfs.h
> @@ -34,6 +34,8 @@ void blk_mq_debugfs_register_sched_hctx(struct
> request_queue *q,
>  void blk_mq_debugfs_unregister_sched_hctx(struct blk_mq_hw_ctx
> *hctx);
>  
>  void blk_mq_debugfs_register_rq_qos(struct request_queue *q);
> +void blk_mq_debugfs_inc_wait_tags(struct blk_mq_hw_ctx *hctx,
> +				  bool is_sched);
>  #else
>  static inline void blk_mq_debugfs_register(struct request_queue *q)
>  {
> @@ -77,6 +79,11 @@ static inline void
> blk_mq_debugfs_register_rq_qos(struct request_queue *q)
>  {
>  }
>  
> +static inline void blk_mq_debugfs_inc_wait_tags(struct blk_mq_hw_ctx
> *hctx,
> +						bool is_sched)
> +{
> +}
> +
>  #endif
>  
>  #if defined(CONFIG_BLK_DEV_ZONED) && defined(CONFIG_BLK_DEBUG_FS)
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 66138dd043d4..3cc6a97a87a0 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -17,6 +17,7 @@
>  #include "blk.h"
>  #include "blk-mq.h"
>  #include "blk-mq-sched.h"
> +#include "blk-mq-debugfs.h"
>  
>  /*
>   * Recalculate wakeup batch when tag is shared by hctx.
> @@ -191,6 +192,9 @@ unsigned int blk_mq_get_tag(struct
> blk_mq_alloc_data *data)
>  		trace_block_rq_tag_wait(data->q, data->hctx,
>  					data->rq_flags &
> RQF_SCHED_TAGS);
>  
> +		blk_mq_debugfs_inc_wait_tags(data->hctx,
> +					     data->rq_flags &
> RQF_SCHED_TAGS);
> +
>  		bt_prev = bt;
>  		io_schedule();
>  
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 18a2388ba581..f3d8ea93b23f 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -453,6 +453,16 @@ struct blk_mq_hw_ctx {
>  	struct dentry		*debugfs_dir;
>  	/** @sched_debugfs_dir:	debugfs directory for the
> scheduler. */
>  	struct dentry		*sched_debugfs_dir;
> +	/**
> +	 * @wait_on_hw_tag: Cumulative counter incremented each time
> a submitting
> +	 * context is forced to block due to physical hardware
> driver tag exhaustion.
> +	 */
> +	atomic_t		wait_on_hw_tag;
> +	/**
> +	 * @wait_on_sched_tag: Cumulative counter incremented each
> time a submitting
> +	 * context is forced to block due to software scheduler tag
> exhaustion.
> +	 */
> +	atomic_t		wait_on_sched_tag;
>  #endif
>  
>  	/**

For [PATCH v3 2/2] blk-mq: expose tag starvation counts via debugfs

Tested-by: Laurence Oberman <loberman@redhat.com>

Every 10.0s: grep . /sys/kernel/debug/block/nvme0n1/hctx0/wait_on_*   
rhel95: Fri Mar 20 11:04:15 2026

/sys/kernel/debug/block/nvme0n1/hctx0/wait_on_hw_tag:103260 <---
cumulative 
/sys/kernel/debug/block/nvme0n1/hctx0/wait_on_sched_tag:0

The patch to me looks good, but will need others to confirm
Reviewed-by: Laurence Oberman <loberman@redhat.com>


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-03-20 15:08 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-19 22:19 [PATCH v3 0/2] blk-mq: introduce tag starvation observability Aaron Tomlin
2026-03-19 22:19 ` [PATCH v3 1/2] blk-mq: add tracepoint block_rq_tag_wait Aaron Tomlin
2026-03-19 22:19 ` [PATCH v3 2/2] blk-mq: expose tag starvation counts via debugfs Aaron Tomlin
2026-03-20 15:08   ` Laurence Oberman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox