[PATCH] blk-mq: add tracepoint block_rq_tag

public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] blk-mq: add tracepoint block_rq_tag_wait
@ 2026-03-17 18:28 Aaron Tomlin
  2026-03-17 23:38 ` Damien Le Moal
  0 siblings, 1 reply; 5+ messages in thread
From: Aaron Tomlin @ 2026-03-17 18:28 UTC (permalink / raw)
  To: axboe, rostedt, mhiramat, mathieu.desnoyers
  Cc: johannes.thumshirn, kch, bvanassche, dlemoal, ritesh.list, neelx,
	sean, mproche, linux-block, linux-kernel, linux-trace-kernel

In high-performance storage environments, particularly when utilising
RAID controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED), severe
latency spikes can occur when fast devices (SSDs) are starved of hardware
tags when sharing the same blk_mq_tag_set.

Currently, diagnosing this specific hardware queue contention is
difficult. When a CPU thread exhausts the tag pool, blk_mq_get_tag()
forces the current thread to block uninterruptible via io_schedule().
While this can be inferred via sched:sched_switch or dynamically
traced by attaching a kprobe to blk_mq_mark_tag_wait(), there is no
dedicated, out-of-the-box observability for this event.

This patch introduces the block_rq_tag_wait static tracepoint in
the tag allocation slow-path. It triggers immediately before the
thread yields the CPU, exposing the exact hardware context (hctx)
that is starved, the total pool size, and the current active request
count.

This provides storage engineers and performance monitoring agents
with a zero-configuration, low-overhead mechanism to definitively
identify shared-tag bottlenecks and tune I/O schedulers or cgroup
throttling accordingly.

Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
---
 block/blk-mq-tag.c           |  3 +++
 include/trace/events/block.h | 36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 39 insertions(+)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 33946cdb5716..f50993e86ca5 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -13,6 +13,7 @@
 #include <linux/kmemleak.h>
 
 #include <linux/delay.h>
+#include <trace/events/block.h>
 #include "blk.h"
 #include "blk-mq.h"
 #include "blk-mq-sched.h"
@@ -187,6 +188,8 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 		if (tag != BLK_MQ_NO_TAG)
 			break;
 
+		trace_block_rq_tag_wait(data->q, data->hctx);
+
 		bt_prev = bt;
 		io_schedule();
 
diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index 6aa79e2d799c..48e2ba433c87 100644
--- a/include/trace/events/block.h
+++ b/include/trace/events/block.h
@@ -226,6 +226,42 @@ DECLARE_EVENT_CLASS(block_rq,
 		  IOPRIO_PRIO_LEVEL(__entry->ioprio), __entry->comm)
 );
 
+/**
+ * block_rq_tag_wait - triggered when an I/O request is starved of a tag
+ * @q: queue containing the request
+ * @hctx: hardware context (queue) experiencing starvation
+ *
+ * Called immediately before the submitting thread is forced to block due
+ * to the exhaustion of available hardware tags. This tracepoint indicates
+ * that the thread will be placed into an uninterruptible state via
+ * io_schedule() until an active block I/O operation completes and
+ * relinquishes its assigned tag.
+ */
+TRACE_EVENT(block_rq_tag_wait,
+
+	TP_PROTO(struct request_queue *q, struct blk_mq_hw_ctx *hctx),
+
+	TP_ARGS(q, hctx),
+
+	TP_STRUCT__entry(
+		__field( dev_t,		dev			)
+		__field( u32,		hctx_id			)
+		__field( u32,		nr_tags			)
+		__field( u32,		active_requests		)
+	),
+
+	TP_fast_assign(
+		__entry->dev		  = q->disk ? disk_devt(q->disk) : 0;
+		__entry->hctx_id	  = hctx ? hctx->queue_num : 0;
+		__entry->nr_tags	  = hctx && hctx->tags ? hctx->tags->nr_tags : 0;
+		__entry->active_requests  = hctx ? atomic_read(&hctx->nr_active) : 0;
+	),
+
+	TP_printk("%d,%d hctx=%u starved (active=%u/%u)",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->hctx_id, __entry->active_requests, __entry->nr_tags)
+);
+
 /**
  * block_rq_insert - insert block operation request into queue
  * @rq: block IO operation request
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] blk-mq: add tracepoint block_rq_tag_wait
  2026-03-17 18:28 [PATCH] blk-mq: add tracepoint block_rq_tag_wait Aaron Tomlin
@ 2026-03-17 23:38 ` Damien Le Moal
  2026-03-18 13:10   ` Laurence Oberman
  2026-03-18 13:21   ` Aaron Tomlin
  0 siblings, 2 replies; 5+ messages in thread
From: Damien Le Moal @ 2026-03-17 23:38 UTC (permalink / raw)
  To: Aaron Tomlin, axboe, rostedt, mhiramat, mathieu.desnoyers
  Cc: johannes.thumshirn, kch, bvanassche, ritesh.list, neelx, sean,
	mproche, linux-block, linux-kernel, linux-trace-kernel

On 2026/03/18 3:28, Aaron Tomlin wrote:
> In high-performance storage environments, particularly when utilising
> RAID controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED), severe
> latency spikes can occur when fast devices (SSDs) are starved of hardware
> tags when sharing the same blk_mq_tag_set.
> 
> Currently, diagnosing this specific hardware queue contention is
> difficult. When a CPU thread exhausts the tag pool, blk_mq_get_tag()
> forces the current thread to block uninterruptible via io_schedule().
> While this can be inferred via sched:sched_switch or dynamically
> traced by attaching a kprobe to blk_mq_mark_tag_wait(), there is no
> dedicated, out-of-the-box observability for this event.
> 
> This patch introduces the block_rq_tag_wait static tracepoint in
> the tag allocation slow-path. It triggers immediately before the
> thread yields the CPU, exposing the exact hardware context (hctx)
> that is starved, the total pool size, and the current active request
> count.
> 
> This provides storage engineers and performance monitoring agents
> with a zero-configuration, low-overhead mechanism to definitively
> identify shared-tag bottlenecks and tune I/O schedulers or cgroup
> throttling accordingly.
> 
> Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>

Looks OK to me, but I have some suggestions below.

> ---
>  block/blk-mq-tag.c           |  3 +++
>  include/trace/events/block.h | 36 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 39 insertions(+)
> 
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 33946cdb5716..f50993e86ca5 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -13,6 +13,7 @@
>  #include <linux/kmemleak.h>
>  
>  #include <linux/delay.h>
> +#include <trace/events/block.h>
>  #include "blk.h"
>  #include "blk-mq.h"
>  #include "blk-mq-sched.h"
> @@ -187,6 +188,8 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
>  		if (tag != BLK_MQ_NO_TAG)
>  			break;
>  
> +		trace_block_rq_tag_wait(data->q, data->hctx);
> +
>  		bt_prev = bt;
>  		io_schedule();
>  
> diff --git a/include/trace/events/block.h b/include/trace/events/block.h
> index 6aa79e2d799c..48e2ba433c87 100644
> --- a/include/trace/events/block.h
> +++ b/include/trace/events/block.h
> @@ -226,6 +226,42 @@ DECLARE_EVENT_CLASS(block_rq,
>  		  IOPRIO_PRIO_LEVEL(__entry->ioprio), __entry->comm)
>  );
>  
> +/**
> + * block_rq_tag_wait - triggered when an I/O request is starved of a tag

when an I/O request -> when a request

> + * @q: queue containing the request

request queue of the target device

("containing" is odd here)

> + * @hctx: hardware context (queue) experiencing starvation

hardware context of the request

> + *
> + * Called immediately before the submitting thread is forced to block due

the submitting thread -> the submitting context

> + * to the exhaustion of available hardware tags. This tracepoint indicates

s/tracepoint/trace point

> + * that the thread will be placed into an uninterruptible state via

s/thread/context

> + * io_schedule() until an active block I/O operation completes and
> + * relinquishes its assigned tag.

until an active request completes

(BIOs do not have tags).

> + */
> +TRACE_EVENT(block_rq_tag_wait,
> +
> +	TP_PROTO(struct request_queue *q, struct blk_mq_hw_ctx *hctx),
> +
> +	TP_ARGS(q, hctx),
> +
> +	TP_STRUCT__entry(
> +		__field( dev_t,		dev			)
> +		__field( u32,		hctx_id			)
> +		__field( u32,		nr_tags			)
> +		__field( u32,		active_requests		)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->dev		  = q->disk ? disk_devt(q->disk) : 0;

I do not think that q->disk can ever be NULL when there is a request being
submitted.

> +		__entry->hctx_id	  = hctx ? hctx->queue_num : 0;
> +		__entry->nr_tags	  = hctx && hctx->tags ? hctx->tags->nr_tags : 0;
> +		__entry->active_requests  = hctx ? atomic_read(&hctx->nr_active) : 0;
> +	),
> +
> +	TP_printk("%d,%d hctx=%u starved (active=%u/%u)",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> +		  __entry->hctx_id, __entry->active_requests, __entry->nr_tags)
> +);
> +
>  /**
>   * block_rq_insert - insert block operation request into queue
>   * @rq: block IO operation request


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] blk-mq: add tracepoint block_rq_tag_wait
  2026-03-17 23:38 ` Damien Le Moal
@ 2026-03-18 13:10   ` Laurence Oberman
  2026-03-18 13:21   ` Aaron Tomlin
  1 sibling, 0 replies; 5+ messages in thread
From: Laurence Oberman @ 2026-03-18 13:10 UTC (permalink / raw)
  To: Damien Le Moal, Aaron Tomlin, axboe, rostedt, mhiramat,
	mathieu.desnoyers
  Cc: johannes.thumshirn, kch, bvanassche, ritesh.list, neelx, sean,
	mproche, linux-block, linux-kernel, linux-trace-kernel

On Wed, 2026-03-18 at 08:38 +0900, Damien Le Moal wrote:
> On 2026/03/18 3:28, Aaron Tomlin wrote:
> > In high-performance storage environments, particularly when
> > utilising
> > RAID controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED),
> > severe
> > latency spikes can occur when fast devices (SSDs) are starved of
> > hardware
> > tags when sharing the same blk_mq_tag_set.
> > 
> > Currently, diagnosing this specific hardware queue contention is
> > difficult. When a CPU thread exhausts the tag pool,
> > blk_mq_get_tag()
> > forces the current thread to block uninterruptible via
> > io_schedule().
> > While this can be inferred via sched:sched_switch or dynamically
> > traced by attaching a kprobe to blk_mq_mark_tag_wait(), there is no
> > dedicated, out-of-the-box observability for this event.
> > 
> > This patch introduces the block_rq_tag_wait static tracepoint in
> > the tag allocation slow-path. It triggers immediately before the
> > thread yields the CPU, exposing the exact hardware context (hctx)
> > that is starved, the total pool size, and the current active
> > request
> > count.
> > 
> > This provides storage engineers and performance monitoring agents
> > with a zero-configuration, low-overhead mechanism to definitively
> > identify shared-tag bottlenecks and tune I/O schedulers or cgroup
> > throttling accordingly.
> > 
> > Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
> 
> Looks OK to me, but I have some suggestions below.
> 
> > ---
> >  block/blk-mq-tag.c           |  3 +++
> >  include/trace/events/block.h | 36
> > ++++++++++++++++++++++++++++++++++++
> >  2 files changed, 39 insertions(+)
> > 
> > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> > index 33946cdb5716..f50993e86ca5 100644
> > --- a/block/blk-mq-tag.c
> > +++ b/block/blk-mq-tag.c
> > @@ -13,6 +13,7 @@
> >  #include <linux/kmemleak.h>
> >  
> >  #include <linux/delay.h>
> > +#include <trace/events/block.h>
> >  #include "blk.h"
> >  #include "blk-mq.h"
> >  #include "blk-mq-sched.h"
> > @@ -187,6 +188,8 @@ unsigned int blk_mq_get_tag(struct
> > blk_mq_alloc_data *data)
> >  		if (tag != BLK_MQ_NO_TAG)
> >  			break;
> >  
> > +		trace_block_rq_tag_wait(data->q, data->hctx);
> > +
> >  		bt_prev = bt;
> >  		io_schedule();
> >  
> > diff --git a/include/trace/events/block.h
> > b/include/trace/events/block.h
> > index 6aa79e2d799c..48e2ba433c87 100644
> > --- a/include/trace/events/block.h
> > +++ b/include/trace/events/block.h
> > @@ -226,6 +226,42 @@ DECLARE_EVENT_CLASS(block_rq,
> >  		  IOPRIO_PRIO_LEVEL(__entry->ioprio), __entry-
> > >comm)
> >  );
> >  
> > +/**
> > + * block_rq_tag_wait - triggered when an I/O request is starved of
> > a tag
> 
> when an I/O request -> when a request
> 
> > + * @q: queue containing the request
> 
> request queue of the target device
> 
> ("containing" is odd here)
> 
> > + * @hctx: hardware context (queue) experiencing starvation
> 
> hardware context of the request
> 
> > + *
> > + * Called immediately before the submitting thread is forced to
> > block due
> 
> the submitting thread -> the submitting context
> 
> > + * to the exhaustion of available hardware tags. This tracepoint
> > indicates
> 
> s/tracepoint/trace point
> 
> > + * that the thread will be placed into an uninterruptible state
> > via
> 
> s/thread/context
> 
> > + * io_schedule() until an active block I/O operation completes and
> > + * relinquishes its assigned tag.
> 
> until an active request completes
> 
> (BIOs do not have tags).
> 
> > + */
> > +TRACE_EVENT(block_rq_tag_wait,
> > +
> > +	TP_PROTO(struct request_queue *q, struct blk_mq_hw_ctx
> > *hctx),
> > +
> > +	TP_ARGS(q, hctx),
> > +
> > +	TP_STRUCT__entry(
> > +		__field(
> > dev_t,		dev			)
> > +		__field(
> > u32,		hctx_id			)
> > +		__field(
> > u32,		nr_tags			)
> > +		__field(
> > u32,		active_requests		)
> > +	),
> > +
> > +	TP_fast_assign(
> > +		__entry->dev		  = q->disk ? disk_devt(q-
> > >disk) : 0;
> 
> I do not think that q->disk can ever be NULL when there is a request
> being
> submitted.
> 
> > +		__entry->hctx_id	  = hctx ? hctx->queue_num
> > : 0;
> > +		__entry->nr_tags	  = hctx && hctx->tags ?
> > hctx->tags->nr_tags : 0;
> > +		__entry->active_requests  = hctx ?
> > atomic_read(&hctx->nr_active) : 0;
> > +	),
> > +
> > +	TP_printk("%d,%d hctx=%u starved (active=%u/%u)",
> > +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > +		  __entry->hctx_id, __entry->active_requests,
> > __entry->nr_tags)
> > +);
> > +
> >  /**
> >   * block_rq_insert - insert block operation request into queue
> >   * @rq: block IO operation request
> 

This visibility will be very useful. I plan to test it fully.
Updates to follow
Thanks
Laurence Oberman


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] blk-mq: add tracepoint block_rq_tag_wait
  2026-03-17 23:38 ` Damien Le Moal
  2026-03-18 13:10   ` Laurence Oberman
@ 2026-03-18 13:21   ` Aaron Tomlin
  2026-03-19  0:22     ` Aaron Tomlin
  1 sibling, 1 reply; 5+ messages in thread
From: Aaron Tomlin @ 2026-03-18 13:21 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: axboe, rostedt, mhiramat, mathieu.desnoyers, johannes.thumshirn,
	kch, bvanassche, ritesh.list, neelx, sean, mproche, linux-block,
	linux-kernel, linux-trace-kernel

[-- Attachment #1: Type: text/plain, Size: 2442 bytes --]

On Wed, Mar 18, 2026 at 08:38:20AM +0900, Damien Le Moal wrote:
> Looks OK to me, but I have some suggestions below.

Hi Damien,

Thank you for your feedback.

> > +/**
> > + * block_rq_tag_wait - triggered when an I/O request is starved of a
> > tag
> 
> when an I/O request -> when a request

Acknowledged.

> 
> > + * @q: queue containing the request
> 
> request queue of the target device
> 
> ("containing" is odd here)

Acknowledged.

> > + * @hctx: hardware context (queue) experiencing starvation
> 
> hardware context of the request

Acknowledged.

> > + *
> > + * Called immediately before the submitting thread is forced to block due
> 
> the submitting thread -> the submitting context

Acknowledged.

> 
> > + * to the exhaustion of available hardware tags. This tracepoint indicates
> 
> s/tracepoint/trace point

Acknowledged.

> 
> > + * that the thread will be placed into an uninterruptible state via
> 
> s/thread/context

Acknowledged.

> 
> > + * io_schedule() until an active block I/O operation completes and
> > + * relinquishes its assigned tag.
> 
> until an active request completes
> 

Acknowledged.

> > + */
> > +TRACE_EVENT(block_rq_tag_wait,
> > +
> > +	TP_PROTO(struct request_queue *q, struct blk_mq_hw_ctx *hctx),
> > +
> > +	TP_ARGS(q, hctx),
> > +
> > +	TP_STRUCT__entry(
> > +		__field( dev_t,		dev			)
> > +		__field( u32,		hctx_id			)
> > +		__field( u32,		nr_tags			)
> > +		__field( u32,		active_requests		)
> > +	),
> > +
> > +	TP_fast_assign(
> > +		__entry->dev		  = q->disk ? disk_devt(q->disk) : 0;
> 
> I do not think that q->disk can ever be NULL when there is a request being
> submitted.

Yes, I agree. In theory, a race with disk_release() cannot occur since the
gendisk reference counter would still be elevated here.

> 
> > +		__entry->hctx_id	  = hctx ? hctx->queue_num : 0;
> > +		__entry->nr_tags	  = hctx && hctx->tags ? hctx->tags->nr_tags : 0;
> > +		__entry->active_requests  = hctx ? atomic_read(&hctx->nr_active) : 0;
> > +	),
> > +
> > +	TP_printk("%d,%d hctx=%u starved (active=%u/%u)",
> > +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > +		  __entry->hctx_id, __entry->active_requests, __entry->nr_tags)
> > +);
> > +
> >  /**
> >   * block_rq_insert - insert block operation request into queue
> >   * @rq: block IO operation request


Kind regards,
-- 
Aaron Tomlin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] blk-mq: add tracepoint block_rq_tag_wait
  2026-03-18 13:21   ` Aaron Tomlin
@ 2026-03-19  0:22     ` Aaron Tomlin
  0 siblings, 0 replies; 5+ messages in thread
From: Aaron Tomlin @ 2026-03-19  0:22 UTC (permalink / raw)
  To: Damien Le Moal, loberman
  Cc: axboe, rostedt, mhiramat, mathieu.desnoyers, johannes.thumshirn,
	kch, bvanassche, ritesh.list, neelx, sean, mproche, chjohnst,
	linux-block, linux-kernel, linux-trace-kernel

[-- Attachment #1: Type: text/plain, Size: 1710 bytes --]

On Wed, Mar 18, 2026 at 09:21:23AM -0400, Aaron Tomlin wrote:
> On Wed, Mar 18, 2026 at 08:38:20AM +0900, Damien Le Moal wrote:
> > Looks OK to me, but I have some suggestions below.

Hi Damien, Laurence,

Upon reviewing the source code once more, it becomes apparent that tracking
"active requests" within this specific trace point is essentially redundant.
If a thread is compelled to invoke io_schedule(), it is mathematically
certain that the number of active requests perfectly equals the total
number of tags.

Now, it would almost always print active=0 in the following scenarios:

    1.  "mq-deadline" Scheduler Starvation: The thread sleeps waiting for a
        scheduler tag. Because the request has not been dispatched to
        hardware yet, blk_mq_inc_active_requests() was never called.
        hctx->nr_active is 0.

    2.  NVMe Hardware Starvation, "none" scheduler: The thread sleeps
        waiting for a hardware tag. Because NVMe drives do not share tags,
        blk_mq_inc_active_requests() instantly aborts to save CPU-cycles.
        hctx->nr_active remains 0.

    3.  RAID Hardware Starvation, "none" scheduler: The thread sleeps
        waiting for a shared hardware tag. Because it is HCTX_SHARED, the
        kernel tracks the active requests in
        hctx->queue->nr_active_requests_shared_tags. The local
        hctx->nr_active counter is completely bypassed and remains 0.

Rather than attempting to print the active count, the trace point should be
modified to indicate exactly which pool experienced starvation: the
hardware pool or the software scheduler pool.

I will submit a follow-up patch.


Kind regards,
-- 
Aaron Tomlin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-03-19  0:22 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-17 18:28 [PATCH] blk-mq: add tracepoint block_rq_tag_wait Aaron Tomlin
2026-03-17 23:38 ` Damien Le Moal
2026-03-18 13:10   ` Laurence Oberman
2026-03-18 13:21   ` Aaron Tomlin
2026-03-19  0:22     ` Aaron Tomlin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox