Linux Trace Kernel
 help / color / mirror / Atom feed
From: John Garry <john.g.garry@oracle.com>
To: Aaron Tomlin <atomlin@atomlin.com>,
	axboe@kernel.dk, rostedt@goodmis.org, mhiramat@kernel.org,
	mathieu.desnoyers@efficios.com
Cc: bvanassche@acm.org, johannes.thumshirn@wdc.com, kch@nvidia.com,
	dlemoal@kernel.org, ritesh.list@gmail.com, loberman@redhat.com,
	neelx@suse.com, sean@ashe.io, mproche@gmail.com,
	chjohnst@gmail.com, linux-block@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org
Subject: Re: [PATCH v6 2/2] blk-mq: expose tag starvation counts via debugfs
Date: Mon, 18 May 2026 09:14:49 +0100	[thread overview]
Message-ID: <fc307bd1-2c41-4bb1-8a10-b9ffde685d30@oracle.com> (raw)
In-Reply-To: <20260517213614.350367-3-atomlin@atomlin.com>

On 17/05/2026 22:36, Aaron Tomlin wrote:
> In high-performance storage environments, particularly when utilising
> RAID controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED), severe
> latency spikes can occur when fast devices are starved of available
> tags.
> 
> This patch introduces two new debugfs attributes for each block
> hardware queue:
>    - /sys/kernel/debug/block/[device]/hctxN/wait_on_hw_tag
>    - /sys/kernel/debug/block/[device]/hctxN/wait_on_sched_tag

How would these counters be used? You are just saying that we may have 
performance latency spikes and so here are two new counters.

> 
> These files expose atomic counters that increment each time a submitting
> context is forced into an uninterruptible sleep via io_schedule() due to
> the complete exhaustion of physical driver tags or software scheduler
> tags, respectively.
> 
> To ensure negligible performance overhead even in production
> environments where CONFIG_BLK_DEBUG_FS is actively enabled, this
> tracking logic utilises dynamically allocated per-CPU counters. When
> this configuration is disabled, the tracking logic compiles down to a
> safe no-op.

How does one normalise the values which are measured? I mean, during a 
period of high contention, we may get a bunch of threads waiting for a 
driver tag and the value in wait_on_hw_tag may jump considerably - how 
do you normalize this value in wait_on_hw_tag for meaningful analysis?

> 
> Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
> ---
>   block/blk-mq-debugfs.c | 109 +++++++++++++++++++++++++++++++++++++++++
>   block/blk-mq-debugfs.h |  19 +++++++
>   block/blk-mq-tag.c     |   4 ++
>   block/blk-mq.c         |   5 ++
>   include/linux/blk-mq.h |  12 +++++
>   5 files changed, 149 insertions(+)
> 
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index 047ec887456b..a94ffc2eacdf 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -7,6 +7,7 @@
>   #include <linux/blkdev.h>
>   #include <linux/build_bug.h>
>   #include <linux/debugfs.h>
> +#include <linux/percpu.h>
>   
>   #include "blk.h"
>   #include "blk-mq.h"
> @@ -484,6 +485,54 @@ static int hctx_dispatch_busy_show(void *data, struct seq_file *m)
>   	return 0;
>   }
>   
> +/**
> + * hctx_wait_on_hw_tag_show - display hardware tag starvation count
> + * @data: generic pointer to the associated hardware context (hctx)
> + * @m: seq_file pointer for debugfs output formatting
> + *
> + * Prints the cumulative number of times a submitting context was forced
> + * to block due to the exhaustion of physical hardware driver tags.
> + *
> + * Return: 0 on success.
> + */
> +static int hctx_wait_on_hw_tag_show(void *data, struct seq_file *m)
> +{
> +	struct blk_mq_hw_ctx *hctx = data;
> +	unsigned long count = 0;
> +	int cpu;
> +
> +	if (hctx->wait_on_hw_tag) {
> +		for_each_possible_cpu(cpu)
> +			count += *per_cpu_ptr(hctx->wait_on_hw_tag, cpu);
> +	}
> +	seq_printf(m, "%lu\n", count);
> +	return 0;
> +}
> +
> +/**
> + * hctx_wait_on_sched_tag_show - display scheduler tag starvation count
> + * @data: generic pointer to the associated hardware context (hctx)
> + * @m: seq_file pointer for debugfs output formatting
> + *
> + * Prints the cumulative number of times a submitting context was forced
> + * to block due to the exhaustion of software scheduler tags.
> + *
> + * Return: 0 on success.
> + */
> +static int hctx_wait_on_sched_tag_show(void *data, struct seq_file *m)
> +{
> +	struct blk_mq_hw_ctx *hctx = data;
> +	unsigned long count = 0;
> +	int cpu;
> +
> +	if (hctx->wait_on_sched_tag) {
> +		for_each_possible_cpu(cpu)
> +			count += *per_cpu_ptr(hctx->wait_on_sched_tag, cpu);
> +	}
> +	seq_printf(m, "%lu\n", count);
> +	return 0;
> +}
> +
>   #define CTX_RQ_SEQ_OPS(name, type)					\
>   static void *ctx_##name##_rq_list_start(struct seq_file *m, loff_t *pos) \
>   	__acquires(&ctx->lock)						\
> @@ -599,6 +648,8 @@ static const struct blk_mq_debugfs_attr blk_mq_debugfs_hctx_attrs[] = {
>   	{"active", 0400, hctx_active_show},
>   	{"dispatch_busy", 0400, hctx_dispatch_busy_show},
>   	{"type", 0400, hctx_type_show},
> +	{"wait_on_hw_tag", 0400, hctx_wait_on_hw_tag_show},
> +	{"wait_on_sched_tag", 0400, hctx_wait_on_sched_tag_show},
>   	{},
>   };
>   
> @@ -815,3 +866,61 @@ void blk_mq_debugfs_unregister_sched_hctx(struct blk_mq_hw_ctx *hctx)
>   	debugfs_remove_recursive(hctx->sched_debugfs_dir);
>   	hctx->sched_debugfs_dir = NULL;
>   }
> +
> +/**
> + * blk_mq_debugfs_alloc_hctx_stats - Allocate per-cpu starvation statistics
> + * @hctx: hardware context associated with the tag allocation
> + * @gfp: memory allocation flags
> + *
> + * Allocates the per-cpu memory for tracking hardware and scheduler tag
> + * starvation.
> + */
> +void blk_mq_debugfs_alloc_hctx_stats(struct blk_mq_hw_ctx *hctx, gfp_t gfp)
> +{
> +	if (!hctx->wait_on_hw_tag)
> +		hctx->wait_on_hw_tag = alloc_percpu_gfp(unsigned long,
> +							gfp);
> +	if (!hctx->wait_on_sched_tag)
> +		hctx->wait_on_sched_tag = alloc_percpu_gfp(unsigned long,
> +							   gfp);
> +}
> +
> +/**
> + * blk_mq_debugfs_free_hctx_stats - Free per-cpu starvation statistics
> + * @hctx: hardware context associated with the tag allocation
> + *
> + * Frees the per-cpu memory used for tracking hardware and scheduler tag
> + * starvation. This must only be called during hardware queue teardown when
> + * the queue is safely frozen and no active I/O submissions can race to
> + * increment the statistics.
> + */
> +void blk_mq_debugfs_free_hctx_stats(struct blk_mq_hw_ctx *hctx)
> +{
> +	free_percpu(hctx->wait_on_hw_tag);
> +	hctx->wait_on_hw_tag = NULL;
> +	free_percpu(hctx->wait_on_sched_tag);
> +	hctx->wait_on_sched_tag = NULL;
> +}
> +
> +/**
> + * blk_mq_debugfs_inc_wait_tags - increment the tag starvation counters
> + * @hctx: hardware context associated with the tag allocation
> + * @is_sched: true if the starved pool is the software scheduler
> + *
> + * Evaluates the exhausted tag pool and safely increments the appropriate
> + * per-cpu debugfs starvation counter.
> + *
> + * Note: The per-cpu pointers are explicitly checked to prevent a NULL
> + * pointer dereference in the event that the system was under heavy memory
> + * pressure and the initial per-cpu allocation failed.
> + */
> +void blk_mq_debugfs_inc_wait_tags(struct blk_mq_hw_ctx *hctx,
> +				  bool is_sched)
> +{
> +	unsigned long __percpu *tags = is_sched ?
> +			READ_ONCE(hctx->wait_on_sched_tag) :
> +			READ_ONCE(hctx->wait_on_hw_tag);
> +
> +	if (likely(tags))
> +		raw_cpu_inc(*tags);
> +}
> diff --git a/block/blk-mq-debugfs.h b/block/blk-mq-debugfs.h
> index 49bb1aaa83dc..7a7c0f376a2b 100644
> --- a/block/blk-mq-debugfs.h
> +++ b/block/blk-mq-debugfs.h
> @@ -17,6 +17,8 @@ struct blk_mq_debugfs_attr {
>   	const struct seq_operations *seq_ops;
>   };
>   
> +void blk_mq_debugfs_inc_wait_tags(struct blk_mq_hw_ctx *hctx,
> +				  bool is_sched);
>   int __blk_mq_debugfs_rq_show(struct seq_file *m, struct request *rq);
>   int blk_mq_debugfs_rq_show(struct seq_file *m, void *v);
>   
> @@ -26,6 +28,9 @@ void blk_mq_debugfs_register_hctx(struct request_queue *q,
>   void blk_mq_debugfs_unregister_hctx(struct blk_mq_hw_ctx *hctx);
>   void blk_mq_debugfs_register_hctxs(struct request_queue *q);
>   void blk_mq_debugfs_unregister_hctxs(struct request_queue *q);
> +void blk_mq_debugfs_alloc_hctx_stats(struct blk_mq_hw_ctx *hctx,
> +				     gfp_t gfp);
> +void blk_mq_debugfs_free_hctx_stats(struct blk_mq_hw_ctx *hctx);
>   
>   void blk_mq_debugfs_register_sched(struct request_queue *q);
>   void blk_mq_debugfs_unregister_sched(struct request_queue *q);
> @@ -35,6 +40,11 @@ void blk_mq_debugfs_unregister_sched_hctx(struct blk_mq_hw_ctx *hctx);
>   
>   void blk_mq_debugfs_register_rq_qos(struct request_queue *q);
>   #else
> +static inline void blk_mq_debugfs_inc_wait_tags(struct blk_mq_hw_ctx *hctx,
> +						bool is_sched)
> +{
> +}
> +
>   static inline void blk_mq_debugfs_register(struct request_queue *q)
>   {
>   }
> @@ -56,6 +66,15 @@ static inline void blk_mq_debugfs_unregister_hctxs(struct request_queue *q)
>   {
>   }
>   
> +static inline void blk_mq_debugfs_alloc_hctx_stats(struct blk_mq_hw_ctx *hctx,
> +						   gfp_t gfp)
> +{
> +}
> +
> +static inline void blk_mq_debugfs_free_hctx_stats(struct blk_mq_hw_ctx *hctx)
> +{
> +}
> +
>   static inline void blk_mq_debugfs_register_sched(struct request_queue *q)
>   {
>   }
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 66138dd043d4..3cc6a97a87a0 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -17,6 +17,7 @@
>   #include "blk.h"
>   #include "blk-mq.h"
>   #include "blk-mq-sched.h"
> +#include "blk-mq-debugfs.h"
>   
>   /*
>    * Recalculate wakeup batch when tag is shared by hctx.
> @@ -191,6 +192,9 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
>   		trace_block_rq_tag_wait(data->q, data->hctx,
>   					data->rq_flags & RQF_SCHED_TAGS);
>   
> +		blk_mq_debugfs_inc_wait_tags(data->hctx,
> +					     data->rq_flags & RQF_SCHED_TAGS);
> +
>   		bt_prev = bt;
>   		io_schedule();
>   
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 4c5c16cce4f8..cd52bf6f82ce 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3991,6 +3991,8 @@ static void blk_mq_exit_hctx(struct request_queue *q,
>   			blk_free_flush_queue_callback);
>   	hctx->fq = NULL;
>   
> +	blk_mq_debugfs_free_hctx_stats(hctx);
> +
>   	spin_lock(&q->unused_hctx_lock);
>   	list_add(&hctx->hctx_list, &q->unused_hctx_list);
>   	spin_unlock(&q->unused_hctx_lock);
> @@ -4016,6 +4018,8 @@ static int blk_mq_init_hctx(struct request_queue *q,
>   {
>   	gfp_t gfp = GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY;
>   
> +	blk_mq_debugfs_alloc_hctx_stats(hctx, gfp);
> +
>   	hctx->fq = blk_alloc_flush_queue(hctx->numa_node, set->cmd_size, gfp);
>   	if (!hctx->fq)
>   		goto fail;
> @@ -4041,6 +4045,7 @@ static int blk_mq_init_hctx(struct request_queue *q,
>   	blk_free_flush_queue(hctx->fq);
>   	hctx->fq = NULL;
>    fail:
> +	blk_mq_debugfs_free_hctx_stats(hctx);
>   	return -1;
>   }
>   
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 18a2388ba581..41d61488d683 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -453,6 +453,18 @@ struct blk_mq_hw_ctx {
>   	struct dentry		*debugfs_dir;
>   	/** @sched_debugfs_dir:	debugfs directory for the scheduler. */
>   	struct dentry		*sched_debugfs_dir;
> +	/**
> +	 * @wait_on_hw_tag: Cumulative per-cpu counter incremented each
> +	 * time a submitting context is forced to block due to physical
> +	 * hardware tag exhaustion.
> +	 */
> +	unsigned long __percpu	*wait_on_hw_tag;
> +	/**
> +	 * @wait_on_sched_tag: Cumulative per-cpu counter incremented each
> +	 * time a submitting context is forced to block due to software
> +	 * scheduler tag exhaustion.
> +	 */
> +	unsigned long __percpu	*wait_on_sched_tag;
>   #endif
>   
>   	/**


  reply	other threads:[~2026-05-18  8:15 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-17 21:36 [PATCH v6 0/2] blk-mq: introduce tag starvation observability Aaron Tomlin
2026-05-17 21:36 ` [PATCH v6 1/2] blk-mq: add tracepoint block_rq_tag_wait Aaron Tomlin
2026-05-17 21:36 ` [PATCH v6 2/2] blk-mq: expose tag starvation counts via debugfs Aaron Tomlin
2026-05-18  8:14   ` John Garry [this message]
2026-05-18 13:31 ` [PATCH v6 0/2] blk-mq: introduce tag starvation observability Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fc307bd1-2c41-4bb1-8a10-b9ffde685d30@oracle.com \
    --to=john.g.garry@oracle.com \
    --cc=atomlin@atomlin.com \
    --cc=axboe@kernel.dk \
    --cc=bvanassche@acm.org \
    --cc=chjohnst@gmail.com \
    --cc=dlemoal@kernel.org \
    --cc=johannes.thumshirn@wdc.com \
    --cc=kch@nvidia.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=loberman@redhat.com \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mhiramat@kernel.org \
    --cc=mproche@gmail.com \
    --cc=neelx@suse.com \
    --cc=ritesh.list@gmail.com \
    --cc=rostedt@goodmis.org \
    --cc=sean@ashe.io \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox