Linux Trace Kernel
 help / color / mirror / Atom feed
* Re: [PATCH 7/7] selftests/bpf: Add tests for uprobe nop10 red zone clobbering
From: Jiri Olsa @ 2026-05-18  7:30 UTC (permalink / raw)
  To: bot+bpf-ci
  Cc: oleg, peterz, mingo, mhiramat, andrii, bpf, linux-trace-kernel,
	ast, daniel, martin.lau, eddyz87, yonghong.song, clm,
	ihor.solodrai
In-Reply-To: <e91d31e0c1378fe970db6385c7af419e97af0c2acd37b3b8775a2fac02ea2361@mail.kernel.org>

On Thu, May 14, 2026 at 02:55:27PM +0000, bot+bpf-ci@kernel.org wrote:
> > diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> > index 82b3c0ce9253..d553485e7db5 100644
> > --- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> > +++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> 
> [ ... ]
> 
> > diff --git a/tools/testing/selftests/bpf/prog_tests/usdt.c b/tools/testing/selftests/bpf/prog_tests/usdt.c
> > index be34c4087ff5..606601ccdc42 100644
> > --- a/tools/testing/selftests/bpf/prog_tests/usdt.c
> > +++ b/tools/testing/selftests/bpf/prog_tests/usdt.c
> 
> [ ... ]
> 
> > diff --git a/tools/testing/selftests/bpf/progs/test_usdt.c b/tools/testing/selftests/bpf/progs/test_usdt.c
> > index f00cb52874e0..0ee78fb050a1 100644
> > --- a/tools/testing/selftests/bpf/progs/test_usdt.c
> > +++ b/tools/testing/selftests/bpf/progs/test_usdt.c
> 
> [ ... ]
> 
> > diff --git a/tools/testing/selftests/bpf/usdt_2.c b/tools/testing/selftests/bpf/usdt_2.c
> > index b359b389f6c0..5e38f8605b02 100644
> > --- a/tools/testing/selftests/bpf/usdt_2.c
> > +++ b/tools/testing/selftests/bpf/usdt_2.c
> > @@ -13,4 +13,17 @@ void usdt_2(void)
> >  	USDT(optimized_attach, usdt_2);
> >  }
> >
> > +static volatile unsigned long usdt_red_zone_arg1 = 0xDEADBEEF;
> > +static volatile unsigned long usdt_red_zone_arg2 = 0xCAFEBABE;
> > +static volatile unsigned long usdt_red_zone_arg3 = 0xFEEDFACE;
> > +
> > +void __attribute__((noinline)) usdt_red_zone_trigger(void)
> > +{
> > +	unsigned long a1 = usdt_red_zone_arg1;
> > +	unsigned long a2 = usdt_red_zone_arg2;
> > +	unsigned long a3 = usdt_red_zone_arg3;
> > +
> > +	USDT(optimized_attach, usdt_red_zone, a1, a2, a3);
> > +}
> > +
> >  #endif
> 
> Is this test reliable across different compiler versions and optimizations?
> 
> The USDT macro uses the "nor" inline assembly constraint, which allows the
> compiler to allocate these operands to CPU registers. Because there is no
> register pressure in usdt_red_zone_trigger(), modern compilers might allocate
> a1, a2, and a3 directly to registers rather than the stack's red zone.
> 
> If the operands are placed in registers, the uprobe CALL optimization
> clobbering [rsp-8] will not corrupt them. This could cause the test to pass
> even on buggy kernels, creating a false positive.
> 
> Would it be safer to explicitly force these operands into the red zone using
> inline assembly constraints, rather than depending on the compiler's register
> allocator?

answered by Andrii in:
  https://lore.kernel.org/bpf/CAEf4BzZ-nASpoF+-vLYZ1D8eWRVd5JWAuhrwx8XZNrX7ZPY5_Q@mail.gmail.com/

jirka

^ permalink raw reply

* Re: [PATCH 06/13] verification/rvgen: Convert __fill_verify_guards_func() to Lark
From: Nam Cao @ 2026-05-18  7:21 UTC (permalink / raw)
  To: Wander Lairson Costa
  Cc: Gabriele Monaco, Steven Rostedt, linux-trace-kernel, linux-kernel
In-Reply-To: <agd1Tb8RDUqfjaGn@wcosta-defaultstring.rmtbr.csb>

Wander Lairson Costa <wander@redhat.com> writes:
>> +        if not self.has_guard:
>> +            return
>
> The signature of function says this function return a list, instead of
> None.

Can you share the tools you are using to catch these? Or did you notice
that yourself?

Nam

^ permalink raw reply

* Re: [PATCH 03/13] verification/rvgen: Implement state and transition parser based on Lark
From: Nam Cao @ 2026-05-18  7:19 UTC (permalink / raw)
  To: Wander Lairson Costa
  Cc: Gabriele Monaco, Steven Rostedt, linux-trace-kernel, linux-kernel
In-Reply-To: <agdtxdk-JSyGLZ6o@wcosta-defaultstring.rmtbr.csb>

Wander Lairson Costa <wander@redhat.com> writes:
> On Tue, May 05, 2026 at 08:59:24AM +0200, Nam Cao wrote:
>> +        self.rules = [[c, None]]
>
> Here self.rules is a list of lists...
>
>> +
>> +    def chain(self, op: str, c: ConstraintCondition):
>> +        self.rules[-1][1] = op
>> +        self.rules.append((c, None))
>
> ... but here it is a list of tuples.

Thanks. I will switch it entirely to list of tuples.

Nam

^ permalink raw reply

* Re: [PATCH 02/13] verification/rvgen: Introduce a parse tree for automata using Lark
From: Nam Cao @ 2026-05-18  7:18 UTC (permalink / raw)
  To: Wander Lairson Costa
  Cc: Gabriele Monaco, Steven Rostedt, linux-trace-kernel, linux-kernel
In-Reply-To: <agdnvQPLLiLBIxw7@wcosta-defaultstring.rmtbr.csb>

Wander Lairson Costa <wander@redhat.com> writes:
> On Tue, May 05, 2026 at 08:59:23AM +0200, Nam Cao wrote:
>> +    ID: /[_a-zA-Z][_a-zA-Z0-9]+/
>
> This regex rejects symbol character symbol. Is that intentional?

It wasn't intentional. This is blindly copied from the existing regex.

Let me switch to Lark's CNAME.

Nam

^ permalink raw reply

* Re: [PATCH 01/13] verification/rvgen: Switch LTL parser to Lark
From: Nam Cao @ 2026-05-18  7:15 UTC (permalink / raw)
  To: Wander Lairson Costa
  Cc: Gabriele Monaco, Steven Rostedt, linux-trace-kernel, linux-kernel
In-Reply-To: <agdBcW8_GTP3YuNs@wcosta-defaultstring.rmtbr.csb>

Wander Lairson Costa <wander@redhat.com> writes:
>> +VARIABLE: /[A-Z_0-9]+/
>
> Is it ok to start variable names with a number? (unless I am reading the
> regex wrong).

Thanks for pointing that out. Let me switch to Lark's built-in CNAME.

>> +    def LITERAL(self, args):
>> +        return ASTNode(Variable(args))
>
> shouldn't this be `return ASTNode(Literal(args) == "true")`?

Yeah that's broken, should be
    return ASTNode(Literal(args == "true"))

Nam

^ permalink raw reply

* Re: [PATCH 9/9] rv: Mandate deallocation for per-obj monitors
From: Gabriele Monaco @ 2026-05-18  6:36 UTC (permalink / raw)
  To: Wen Yang
  Cc: Nam Cao, linux-kernel, Steven Rostedt, Masami Hiramatsu,
	linux-trace-kernel
In-Reply-To: <c4a3c4fc-7efb-4eb0-81d6-cc7a0d51f8fa@linux.dev>

On Sun, 2026-05-17 at 17:52 +0800, Wen Yang wrote:
> 
> One gap: tools/verification/rvgen/rvgen/templates/dot2k/main.c uses 
> RV_MON_%%MONITOR_TYPE%% but generates no deallocation code, may fail
> to build with a -Wunused-function warning.
> 

Thanks for the review!

That's technically the purpose of this patch, we don't know exactly how is the
per-obj monitor going to deallocate, so we make sure build fails if they don't
set up a way.

This combined with the fact per-obj monitors aren't really documented (yet),
makes it quite confusing, doesn't it?

Would you prefer we always generate a dummy hook calling da_destroy_storage()
and let the user decide what to do with it without forcing (obscure) compiler
warnings?

Thanks,
Gabriele

> 
> --
> Best wishes,
> Wen
> 
> On 5/12/26 22:02, Gabriele Monaco wrote:
> > The per-object monitors use a hash tables and dynamic allocation of the
> > monitor storage, functions to clean a monitor that is no longer needed
> > are provided but nothing ensures the monitor actually uses them.
> > 
> > Remove the inline specifier on the deallocation function to let the
> > compiler warn in case it isn't referenced. If the monitor really doesn't
> > need one (for instance because instances will never cease to exist
> > before disabling the monitor), the da_skip_deallocation() helper macro
> > can be used to silence the warning.
> > 
> > Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> > ---
> >   include/rv/da_monitor.h                      | 14 +++++++++++++-
> >   kernel/trace/rv/monitors/deadline/deadline.h |  5 ++++-
> >   2 files changed, 17 insertions(+), 2 deletions(-)
> > 
> > diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
> > index 402d3b935c08..378d23ab7dfb 100644
> > --- a/include/rv/da_monitor.h
> > +++ b/include/rv/da_monitor.h
> > @@ -489,8 +489,11 @@ static inline monitor_target
> > da_get_target_by_id(da_id_type id)
> >    * locks.
> >    * This function includes an RCU read-side critical section to synchronise
> >    * against da_monitor_destroy().
> > + * NOTE: inline is omitted on purpose to let the compiler warn if this
> > function
> > + * is never referenced. For monitors that don't require a deallocation
> > hook,
> > + * da_skip_deallocation() can be used.
> >    */
> > -static inline void da_destroy_storage(da_id_type id)
> > +static void da_destroy_storage(da_id_type id)
> >   {
> >   	struct da_monitor_storage *mon_storage;
> >   
> > @@ -504,6 +507,15 @@ static inline void da_destroy_storage(da_id_type id)
> >   	kfree_rcu(mon_storage, rcu);
> >   }
> >   
> > +/*
> > + * da_skip_deallocation - explicitly mark a deallocation function as not
> > required
> > + *
> > + * Only use when you are absolutely sure the monitor doesn't require a
> > + * deallocation hook (i.e. it's not possible for an object to finish
> > existing
> > + * when the monitor is still running).
> > + */
> > +#define da_skip_deallocation(hook) ((void)hook)
> > +
> >   static void da_monitor_reset_all(void)
> >   {
> >   	struct da_monitor_storage *mon_storage;
> > diff --git a/kernel/trace/rv/monitors/deadline/deadline.h
> > b/kernel/trace/rv/monitors/deadline/deadline.h
> > index 78fca873d61e..c39fd79148c2 100644
> > --- a/kernel/trace/rv/monitors/deadline/deadline.h
> > +++ b/kernel/trace/rv/monitors/deadline/deadline.h
> > @@ -194,7 +194,10 @@ static void __maybe_unused handle_newtask(void *data,
> > struct task_struct *task,
> >   		da_create_storage(EXPAND_ID_TASK(task), NULL);
> >   }
> >   
> > -static void __maybe_unused handle_exit(void *data, struct task_struct *p,
> > bool group_dead)
> > +/*
> > + * Deallocation hook, use da_skip_deallocation() when not necessary
> > + */
> > +static void handle_exit(void *data, struct task_struct *p, bool group_dead)
> >   {
> >   	if (p->policy == SCHED_DEADLINE)
> >   		da_destroy_storage(get_entity_id(&p->dl, DL_TASK,
> > DL_TASK));


^ permalink raw reply

* Re: [PATCH v2] tracing/probes: Allow use of BTF names to dereference pointers
From: Masami Hiramatsu @ 2026-05-18  6:17 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Linux trace kernel, bpf, Masami Hiramatsu,
	Mathieu Desnoyers, Mark Rutland, Peter Zijlstra, Namhyung Kim,
	Takaya Saeki, Douglas Raillard, Tom Zanussi, Andrew Morton,
	Thomas Gleixner, Ian Rogers, Jiri Olsa
In-Reply-To: <20260516173310.1dbad146@fedora>

On Sat, 16 May 2026 17:33:09 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

>  Function arguments at kretprobe
>  -------------------------------
> diff --git a/kernel/trace/trace_btf.c b/kernel/trace/trace_btf.c
> index 00172f301f25..be88cc4d97dd 100644
> --- a/kernel/trace/trace_btf.c
> +++ b/kernel/trace/trace_btf.c
> @@ -120,3 +120,114 @@ const struct btf_member *btf_find_struct_member(struct btf *btf,
>  	return member;
>  }
>  
> +#define BITS_ROUNDDOWN_BYTES(bits) ((bits) >> 3)
> +
> +static int find_member(const char *ptr, struct btf *btf,
> +		       const struct btf_type **type, int level)

We already have a similar function.

/*
 * Find a member of data structure/union by name and return it.
 * Return NULL if not found, or -EINVAL if parameter is invalid.
 * If the member is an member of anonymous union/structure, the offset
 * of that anonymous union/structure is stored into @anon_offset. Caller
 * can calculate the correct offset from the root data structure by
 * adding anon_offset to the member's offset.
 */
const struct btf_member *btf_find_struct_member(struct btf *btf,
						const struct btf_type *type,
						const char *member_name,
						u32 *anon_offset)

And this seems useful for you,


> +{
> +	const struct btf_member *member;
> +	const struct btf_type *t = *type;
> +	int i;
> +
> +	/* Max of 3 depth of anonymous structures */
> +	if (level > 3)
> +		return -E2BIG;
> +
> +	for_each_member(i, t, member) {
> +		const char *tname = btf_name_by_offset(btf, member->name_off);
> +
> +		if (strcmp(ptr, tname) == 0) {
> +			int offset = __btf_member_bit_offset(t, member);
> +			*type = btf_type_by_id(btf, member->type);
> +			return BITS_ROUNDDOWN_BYTES(offset);
> +		}
> +
> +		/* Handle anonymous structures */
> +		if (strlen(tname))
> +			continue;
> +
> +		*type = btf_type_by_id(btf, member->type);
> +		if (btf_type_is_struct(*type)) {
> +			int offset = find_member(ptr, btf, type, level + 1);
> +
> +			if (offset < 0)
> +				continue;
> +
> +			return offset + BITS_ROUNDDOWN_BYTES(member->offset);
> +		}
> +	}
> +
> +	return -ENOENT;
> +}
> +
> +/**
> + * btf_find_offset - Find an offset of a member for a structure
> + * @arg: A structure name followed by one or more members
> + * @offset_p: A pointer to where to store the offset
> + *
> + * Will parse @arg with the expected format of: struct.member[[.member]..]
> + * It is delimited by '.'. The first item must be a structure type.
> + * The next are its members. If the member is also of a structure type it
> + * another member may follow ".member".
> + *
> + * Note, @arg is modified but will be put back to what it was on return.
> + *
> + * Returns: 0 on success and -EINVAL if no '.' is present
> + *    or -ENXIO if the structure or member is not found.
> + *    Returns -EINVAL if BTF is not defined.
> + *  On success, @offset_p will contain the offset of the member specified
> + *    by @arg.
> + */
> +int btf_find_offset(char *arg, long *offset_p)

And this can share the inner loop of parse_btf_field().
(BTW, parse_btf_field() supports bitfield, but this does not.)

> +{
> +	const struct btf_type *t;
> +	struct btf *btf;

As Sashiko pointed, btf = NULL here.
But I think we would better to have DEFINE_FREE(btf_put).

> +	long offset = 0;
> +	char *ptr;
> +	int ret;
> +	s32 id;
> +
> +	ptr = strchr(arg, '.');
> +	if (!ptr)
> +		return -EINVAL;
> +
> +	*ptr = '\0';
> +
> +	ret = -ENXIO;
> +	id = bpf_find_btf_id(arg, BTF_KIND_STRUCT, &btf);
> +	if (id < 0)
> +		goto error;
> +
> +	/* Get BTF_KIND_FUNC type */
> +	t = btf_type_by_id(btf, id);
> +
> +	/* May allow more than one member, as long as they are structures */
> +	do {
> +		ret = -ENXIO;
> +		if (!t || !btf_type_is_struct(t))
> +			goto error;
> +
> +		*ptr++ = '.';
> +		arg = ptr;
> +		ptr = strchr(ptr, '.');
> +		if (ptr)
> +			*ptr = '\0';
> +

So, if you don't share the inner loop, you can just reuse
btf_field_struct_member() as below (this should be equivalent
to find_member()):

		field = btf_find_struct_member(btf, t, arg, &anon_offs);
		if (IS_ERR_OR_NULL(field)) {
			ret = field ? PTR_ERR(field) : -ENOENT;
			goto error;
		}

		offset += field->offset + anon_offs;

> +
> +	} while (ptr);
> +
> +	btf_put(btf);
> +	*offset_p = offset;
> +	return 0;
> +
> +error:
> +	btf_put(btf);
> +	if (ptr)
> +		*ptr = '.';
> +	return ret;
> +}
> diff --git a/kernel/trace/trace_btf.h b/kernel/trace/trace_btf.h
> index 4bc44bc261e6..7b0797a6050b 100644
> --- a/kernel/trace/trace_btf.h
> +++ b/kernel/trace/trace_btf.h
> @@ -9,3 +9,13 @@ const struct btf_member *btf_find_struct_member(struct btf *btf,
>  						const struct btf_type *type,
>  						const char *member_name,
>  						u32 *anon_offset);
> +
> +#ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
> +/* Will modify arg, but will put it back before returning. */
> +int btf_find_offset(char *arg, long *offset);
> +#else
> +static inline int btf_find_offset(char *arg, long *offset)
> +{
> +	return -EINVAL;
> +}
> +#endif
> diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
> index e0d3a0da26af..6fcede2de1a5 100644
> --- a/kernel/trace/trace_probe.c
> +++ b/kernel/trace/trace_probe.c
> @@ -1165,7 +1165,7 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
>  
>  	case '+':	/* deref memory */
>  	case '-':
> -		if (arg[1] == 'u') {
> +		if (arg[1] == 'u' && isdigit(arg[2])) {

As Sashiko pointed, This can be broken if STRUCT name starts
with "u[0-9]". What about using "()" for this case?

e.g. 

+u(user_unreg.disable_addr)(uarg)


>  			deref = FETCH_OP_UDEREF;
>  			arg[1] = arg[0];
>  			arg++;
> @@ -1178,7 +1178,22 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
>  			return -EINVAL;
>  		}
>  		*tmp = '\0';
> -		ret = kstrtol(arg, 0, &offset);
> +		if (arg[0] != '-' && !isdigit(*arg)) {
> +			int err = 0;
> +			ret = btf_find_offset(arg, &offset);
> +			switch (ret) {
> +			case -ENODEV: err = TP_ERR_NOSUP_BTFARG; break;
> +			case -E2BIG: err = TP_ERR_MEMBER_TOO_DEEP; break;
> +			case -EINVAL: err = TP_ERR_BAD_STRUCT_FMT; break;
> +			case -ENXIO: err = TP_ERR_BAD_BTF_TID; break;

This should have -ENOENT and default case for unknown error.
(There is TP_ERR_NO_BTF_FIELD already)

> +			}
> +			if (err)
> +				__trace_probe_log_err(ctx->offset, err);
> +			if (ret < 0)
> +				return ret;
> +		} else {
> +			ret = kstrtol(arg, 0, &offset);
> +		}
>  		if (ret) {
>  			trace_probe_log_err(ctx->offset, BAD_DEREF_OFFS);
>  			break;
> diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
> index 262d8707a3df..d649bb9f5b7c 100644
> --- a/kernel/trace/trace_probe.h
> +++ b/kernel/trace/trace_probe.h
> @@ -563,7 +563,9 @@ extern int traceprobe_define_arg_fields(struct trace_event_call *event_call,
>  	C(NEED_STRING_TYPE,	"$comm and immediate-string only accepts string type"),\
>  	C(TOO_MANY_ARGS,	"Too many arguments are specified"),	\
>  	C(TOO_MANY_EARGS,	"Too many entry arguments specified"),	\
> -	C(EVENT_TOO_BIG,	"Event too big (too many fields?)"),
> +	C(EVENT_TOO_BIG,	"Event too big (too many fields?)"), \
> +	C(MEMBER_TOO_DEEP,	"Too many indirections of anonymous structure"), \
> +	C(BAD_STRUCT_FMT,	"Unknown BTF structure"),
>  
>  #undef C
>  #define C(a, b)		TP_ERR_##a
> -- 
> 2.53.0
> 

Thanks,

-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* [PATCH v6 2/2] blk-mq: expose tag starvation counts via debugfs
From: Aaron Tomlin @ 2026-05-17 21:36 UTC (permalink / raw)
  To: axboe, rostedt, mhiramat, mathieu.desnoyers
  Cc: bvanassche, johannes.thumshirn, kch, dlemoal, ritesh.list,
	loberman, neelx, sean, mproche, chjohnst, linux-block,
	linux-kernel, linux-trace-kernel
In-Reply-To: <20260517213614.350367-1-atomlin@atomlin.com>

In high-performance storage environments, particularly when utilising
RAID controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED), severe
latency spikes can occur when fast devices are starved of available
tags.

This patch introduces two new debugfs attributes for each block
hardware queue:
  - /sys/kernel/debug/block/[device]/hctxN/wait_on_hw_tag
  - /sys/kernel/debug/block/[device]/hctxN/wait_on_sched_tag

These files expose atomic counters that increment each time a submitting
context is forced into an uninterruptible sleep via io_schedule() due to
the complete exhaustion of physical driver tags or software scheduler
tags, respectively.

To ensure negligible performance overhead even in production
environments where CONFIG_BLK_DEBUG_FS is actively enabled, this
tracking logic utilises dynamically allocated per-CPU counters. When
this configuration is disabled, the tracking logic compiles down to a
safe no-op.

Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
---
 block/blk-mq-debugfs.c | 109 +++++++++++++++++++++++++++++++++++++++++
 block/blk-mq-debugfs.h |  19 +++++++
 block/blk-mq-tag.c     |   4 ++
 block/blk-mq.c         |   5 ++
 include/linux/blk-mq.h |  12 +++++
 5 files changed, 149 insertions(+)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 047ec887456b..a94ffc2eacdf 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -7,6 +7,7 @@
 #include <linux/blkdev.h>
 #include <linux/build_bug.h>
 #include <linux/debugfs.h>
+#include <linux/percpu.h>
 
 #include "blk.h"
 #include "blk-mq.h"
@@ -484,6 +485,54 @@ static int hctx_dispatch_busy_show(void *data, struct seq_file *m)
 	return 0;
 }
 
+/**
+ * hctx_wait_on_hw_tag_show - display hardware tag starvation count
+ * @data: generic pointer to the associated hardware context (hctx)
+ * @m: seq_file pointer for debugfs output formatting
+ *
+ * Prints the cumulative number of times a submitting context was forced
+ * to block due to the exhaustion of physical hardware driver tags.
+ *
+ * Return: 0 on success.
+ */
+static int hctx_wait_on_hw_tag_show(void *data, struct seq_file *m)
+{
+	struct blk_mq_hw_ctx *hctx = data;
+	unsigned long count = 0;
+	int cpu;
+
+	if (hctx->wait_on_hw_tag) {
+		for_each_possible_cpu(cpu)
+			count += *per_cpu_ptr(hctx->wait_on_hw_tag, cpu);
+	}
+	seq_printf(m, "%lu\n", count);
+	return 0;
+}
+
+/**
+ * hctx_wait_on_sched_tag_show - display scheduler tag starvation count
+ * @data: generic pointer to the associated hardware context (hctx)
+ * @m: seq_file pointer for debugfs output formatting
+ *
+ * Prints the cumulative number of times a submitting context was forced
+ * to block due to the exhaustion of software scheduler tags.
+ *
+ * Return: 0 on success.
+ */
+static int hctx_wait_on_sched_tag_show(void *data, struct seq_file *m)
+{
+	struct blk_mq_hw_ctx *hctx = data;
+	unsigned long count = 0;
+	int cpu;
+
+	if (hctx->wait_on_sched_tag) {
+		for_each_possible_cpu(cpu)
+			count += *per_cpu_ptr(hctx->wait_on_sched_tag, cpu);
+	}
+	seq_printf(m, "%lu\n", count);
+	return 0;
+}
+
 #define CTX_RQ_SEQ_OPS(name, type)					\
 static void *ctx_##name##_rq_list_start(struct seq_file *m, loff_t *pos) \
 	__acquires(&ctx->lock)						\
@@ -599,6 +648,8 @@ static const struct blk_mq_debugfs_attr blk_mq_debugfs_hctx_attrs[] = {
 	{"active", 0400, hctx_active_show},
 	{"dispatch_busy", 0400, hctx_dispatch_busy_show},
 	{"type", 0400, hctx_type_show},
+	{"wait_on_hw_tag", 0400, hctx_wait_on_hw_tag_show},
+	{"wait_on_sched_tag", 0400, hctx_wait_on_sched_tag_show},
 	{},
 };
 
@@ -815,3 +866,61 @@ void blk_mq_debugfs_unregister_sched_hctx(struct blk_mq_hw_ctx *hctx)
 	debugfs_remove_recursive(hctx->sched_debugfs_dir);
 	hctx->sched_debugfs_dir = NULL;
 }
+
+/**
+ * blk_mq_debugfs_alloc_hctx_stats - Allocate per-cpu starvation statistics
+ * @hctx: hardware context associated with the tag allocation
+ * @gfp: memory allocation flags
+ *
+ * Allocates the per-cpu memory for tracking hardware and scheduler tag
+ * starvation.
+ */
+void blk_mq_debugfs_alloc_hctx_stats(struct blk_mq_hw_ctx *hctx, gfp_t gfp)
+{
+	if (!hctx->wait_on_hw_tag)
+		hctx->wait_on_hw_tag = alloc_percpu_gfp(unsigned long,
+							gfp);
+	if (!hctx->wait_on_sched_tag)
+		hctx->wait_on_sched_tag = alloc_percpu_gfp(unsigned long,
+							   gfp);
+}
+
+/**
+ * blk_mq_debugfs_free_hctx_stats - Free per-cpu starvation statistics
+ * @hctx: hardware context associated with the tag allocation
+ *
+ * Frees the per-cpu memory used for tracking hardware and scheduler tag
+ * starvation. This must only be called during hardware queue teardown when
+ * the queue is safely frozen and no active I/O submissions can race to
+ * increment the statistics.
+ */
+void blk_mq_debugfs_free_hctx_stats(struct blk_mq_hw_ctx *hctx)
+{
+	free_percpu(hctx->wait_on_hw_tag);
+	hctx->wait_on_hw_tag = NULL;
+	free_percpu(hctx->wait_on_sched_tag);
+	hctx->wait_on_sched_tag = NULL;
+}
+
+/**
+ * blk_mq_debugfs_inc_wait_tags - increment the tag starvation counters
+ * @hctx: hardware context associated with the tag allocation
+ * @is_sched: true if the starved pool is the software scheduler
+ *
+ * Evaluates the exhausted tag pool and safely increments the appropriate
+ * per-cpu debugfs starvation counter.
+ *
+ * Note: The per-cpu pointers are explicitly checked to prevent a NULL
+ * pointer dereference in the event that the system was under heavy memory
+ * pressure and the initial per-cpu allocation failed.
+ */
+void blk_mq_debugfs_inc_wait_tags(struct blk_mq_hw_ctx *hctx,
+				  bool is_sched)
+{
+	unsigned long __percpu *tags = is_sched ?
+			READ_ONCE(hctx->wait_on_sched_tag) :
+			READ_ONCE(hctx->wait_on_hw_tag);
+
+	if (likely(tags))
+		raw_cpu_inc(*tags);
+}
diff --git a/block/blk-mq-debugfs.h b/block/blk-mq-debugfs.h
index 49bb1aaa83dc..7a7c0f376a2b 100644
--- a/block/blk-mq-debugfs.h
+++ b/block/blk-mq-debugfs.h
@@ -17,6 +17,8 @@ struct blk_mq_debugfs_attr {
 	const struct seq_operations *seq_ops;
 };
 
+void blk_mq_debugfs_inc_wait_tags(struct blk_mq_hw_ctx *hctx,
+				  bool is_sched);
 int __blk_mq_debugfs_rq_show(struct seq_file *m, struct request *rq);
 int blk_mq_debugfs_rq_show(struct seq_file *m, void *v);
 
@@ -26,6 +28,9 @@ void blk_mq_debugfs_register_hctx(struct request_queue *q,
 void blk_mq_debugfs_unregister_hctx(struct blk_mq_hw_ctx *hctx);
 void blk_mq_debugfs_register_hctxs(struct request_queue *q);
 void blk_mq_debugfs_unregister_hctxs(struct request_queue *q);
+void blk_mq_debugfs_alloc_hctx_stats(struct blk_mq_hw_ctx *hctx,
+				     gfp_t gfp);
+void blk_mq_debugfs_free_hctx_stats(struct blk_mq_hw_ctx *hctx);
 
 void blk_mq_debugfs_register_sched(struct request_queue *q);
 void blk_mq_debugfs_unregister_sched(struct request_queue *q);
@@ -35,6 +40,11 @@ void blk_mq_debugfs_unregister_sched_hctx(struct blk_mq_hw_ctx *hctx);
 
 void blk_mq_debugfs_register_rq_qos(struct request_queue *q);
 #else
+static inline void blk_mq_debugfs_inc_wait_tags(struct blk_mq_hw_ctx *hctx,
+						bool is_sched)
+{
+}
+
 static inline void blk_mq_debugfs_register(struct request_queue *q)
 {
 }
@@ -56,6 +66,15 @@ static inline void blk_mq_debugfs_unregister_hctxs(struct request_queue *q)
 {
 }
 
+static inline void blk_mq_debugfs_alloc_hctx_stats(struct blk_mq_hw_ctx *hctx,
+						   gfp_t gfp)
+{
+}
+
+static inline void blk_mq_debugfs_free_hctx_stats(struct blk_mq_hw_ctx *hctx)
+{
+}
+
 static inline void blk_mq_debugfs_register_sched(struct request_queue *q)
 {
 }
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 66138dd043d4..3cc6a97a87a0 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -17,6 +17,7 @@
 #include "blk.h"
 #include "blk-mq.h"
 #include "blk-mq-sched.h"
+#include "blk-mq-debugfs.h"
 
 /*
  * Recalculate wakeup batch when tag is shared by hctx.
@@ -191,6 +192,9 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 		trace_block_rq_tag_wait(data->q, data->hctx,
 					data->rq_flags & RQF_SCHED_TAGS);
 
+		blk_mq_debugfs_inc_wait_tags(data->hctx,
+					     data->rq_flags & RQF_SCHED_TAGS);
+
 		bt_prev = bt;
 		io_schedule();
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4c5c16cce4f8..cd52bf6f82ce 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3991,6 +3991,8 @@ static void blk_mq_exit_hctx(struct request_queue *q,
 			blk_free_flush_queue_callback);
 	hctx->fq = NULL;
 
+	blk_mq_debugfs_free_hctx_stats(hctx);
+
 	spin_lock(&q->unused_hctx_lock);
 	list_add(&hctx->hctx_list, &q->unused_hctx_list);
 	spin_unlock(&q->unused_hctx_lock);
@@ -4016,6 +4018,8 @@ static int blk_mq_init_hctx(struct request_queue *q,
 {
 	gfp_t gfp = GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY;
 
+	blk_mq_debugfs_alloc_hctx_stats(hctx, gfp);
+
 	hctx->fq = blk_alloc_flush_queue(hctx->numa_node, set->cmd_size, gfp);
 	if (!hctx->fq)
 		goto fail;
@@ -4041,6 +4045,7 @@ static int blk_mq_init_hctx(struct request_queue *q,
 	blk_free_flush_queue(hctx->fq);
 	hctx->fq = NULL;
  fail:
+	blk_mq_debugfs_free_hctx_stats(hctx);
 	return -1;
 }
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 18a2388ba581..41d61488d683 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -453,6 +453,18 @@ struct blk_mq_hw_ctx {
 	struct dentry		*debugfs_dir;
 	/** @sched_debugfs_dir:	debugfs directory for the scheduler. */
 	struct dentry		*sched_debugfs_dir;
+	/**
+	 * @wait_on_hw_tag: Cumulative per-cpu counter incremented each
+	 * time a submitting context is forced to block due to physical
+	 * hardware tag exhaustion.
+	 */
+	unsigned long __percpu	*wait_on_hw_tag;
+	/**
+	 * @wait_on_sched_tag: Cumulative per-cpu counter incremented each
+	 * time a submitting context is forced to block due to software
+	 * scheduler tag exhaustion.
+	 */
+	unsigned long __percpu	*wait_on_sched_tag;
 #endif
 
 	/**
-- 
2.51.0


^ permalink raw reply related

* [PATCH v6 1/2] blk-mq: add tracepoint block_rq_tag_wait
From: Aaron Tomlin @ 2026-05-17 21:36 UTC (permalink / raw)
  To: axboe, rostedt, mhiramat, mathieu.desnoyers
  Cc: bvanassche, johannes.thumshirn, kch, dlemoal, ritesh.list,
	loberman, neelx, sean, mproche, chjohnst, linux-block,
	linux-kernel, linux-trace-kernel
In-Reply-To: <20260517213614.350367-1-atomlin@atomlin.com>

In high-performance storage environments, particularly when utilising
RAID controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED), severe
latency spikes can occur when fast devices (SSDs) are starved of hardware
tags when sharing the same blk_mq_tag_set.

Currently, diagnosing this specific hardware queue contention is
difficult. When a CPU thread exhausts the tag pool, blk_mq_get_tag()
forces the current thread to block uninterruptible via io_schedule().
While this can be inferred via sched:sched_switch or dynamically
traced by attaching a kprobe to blk_mq_mark_tag_wait(), there is no
dedicated, out-of-the-box observability for this event.

This patch introduces the block_rq_tag_wait trace point in the tag
allocation slow-path. It triggers immediately before the thread yields
the CPU, exposing the exact hardware context (hctx) that is starved, the
specific pool experiencing starvation (hardware or software scheduler),
and the total pool depth.

This provides storage engineers and performance monitoring agents
with a zero-configuration, low-overhead mechanism to definitively
identify shared-tag bottlenecks and tune I/O schedulers or cgroup
throttling accordingly.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Laurence Oberman <loberman@redhat.com>
Tested-by: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
---
 block/blk-mq-tag.c           |  4 ++++
 include/trace/events/block.h | 43 ++++++++++++++++++++++++++++++++++++
 2 files changed, 47 insertions(+)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 33946cdb5716..66138dd043d4 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -13,6 +13,7 @@
 #include <linux/kmemleak.h>
 
 #include <linux/delay.h>
+#include <trace/events/block.h>
 #include "blk.h"
 #include "blk-mq.h"
 #include "blk-mq-sched.h"
@@ -187,6 +188,9 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 		if (tag != BLK_MQ_NO_TAG)
 			break;
 
+		trace_block_rq_tag_wait(data->q, data->hctx,
+					data->rq_flags & RQF_SCHED_TAGS);
+
 		bt_prev = bt;
 		io_schedule();
 
diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index 6aa79e2d799c..d6b9a6bc3c63 100644
--- a/include/trace/events/block.h
+++ b/include/trace/events/block.h
@@ -226,6 +226,49 @@ DECLARE_EVENT_CLASS(block_rq,
 		  IOPRIO_PRIO_LEVEL(__entry->ioprio), __entry->comm)
 );
 
+/**
+ * block_rq_tag_wait - triggered when a request is starved of a tag
+ * @q: request queue of the target device
+ * @hctx: hardware context of the request experiencing starvation
+ * @is_sched_tag: indicates whether the starved pool is the software scheduler
+ *
+ * Called immediately before the submitting context is forced to block due
+ * to the exhaustion of available tags (i.e., physical hardware driver tags
+ * or software scheduler tags). This trace point indicates that the context
+ * will be placed into an uninterruptible state via io_schedule() until an
+ * active request completes and relinquishes its assigned tag.
+ */
+TRACE_EVENT(block_rq_tag_wait,
+
+	TP_PROTO(struct request_queue *q, struct blk_mq_hw_ctx *hctx, bool is_sched_tag),
+
+	TP_ARGS(q, hctx, is_sched_tag),
+
+	TP_STRUCT__entry(
+		__field( dev_t,		dev			)
+		__field( u32,		hctx_id			)
+		__field( u32,		nr_tags			)
+		__field( bool,		is_sched_tag		)
+	),
+
+	TP_fast_assign(
+		__entry->dev		= q->disk ? disk_devt(q->disk) : 0;
+		__entry->hctx_id	= hctx->queue_num;
+		__entry->is_sched_tag	= is_sched_tag;
+
+		if (is_sched_tag)
+			__entry->nr_tags = hctx->sched_tags->nr_tags;
+		else
+			__entry->nr_tags = hctx->tags->nr_tags;
+	),
+
+	TP_printk("%d,%d hctx=%u starved on %s tags (depth=%u)",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->hctx_id,
+		  __entry->is_sched_tag ? "scheduler" : "hardware",
+		  __entry->nr_tags)
+);
+
 /**
  * block_rq_insert - insert block operation request into queue
  * @rq: block IO operation request
-- 
2.51.0


^ permalink raw reply related

* [PATCH v6 0/2] blk-mq: introduce tag starvation observability
From: Aaron Tomlin @ 2026-05-17 21:36 UTC (permalink / raw)
  To: axboe, rostedt, mhiramat, mathieu.desnoyers
  Cc: bvanassche, johannes.thumshirn, kch, dlemoal, ritesh.list,
	loberman, neelx, sean, mproche, chjohnst, linux-block,
	linux-kernel, linux-trace-kernel

Hi Jens, Steve, Masami,

In high-performance storage environments, particularly when utilising RAID
controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED), severe latency
spikes can occur when fast devices are starved of available tags.
Currently, diagnosing this specific queue contention requires deploying
dynamic kprobes or inferring sleep states, which lacks a simple,
out-of-the-box diagnostic path.

This short series introduces dedicated, low-overhead observability for tag
exhaustion events in the block layer:

  - Patch 1 introduces the "block_rq_tag_wait" tracepoint in the tag
    allocation slow-path to capture precise, event-based starvation.

  - Patch 2 complements this by exposing "wait_on_hw_tag" and
    "wait_on_sched_tag" per-CPU counters via debugfs for quick,
    point-in-time cumulative polling.

Together, these provide storage engineers with zero-configuration
mechanisms to definitively identify shared-tag bottlenecks.

Please let me know your thoughts.


Changes since v5 [1]:
 - Replaced this_cpu_inc() with raw_cpu_inc() within
   blk_mq_debugfs_inc_wait_tags(). This resolves a preemption warning
   triggered under CONFIG_DEBUG_PREEMPT=y, as the routine is invoked from a
   preemptible context immediately prior to io_schedule(). This adjustment
   deliberately prioritises the reduction of execution overhead over
   absolute statistical precision for this diagnostic interface.

Changes since v4 [2]:
 - Prevented a NULL pointer dereference in the tracepoint fast-assign for
   disk-less request queues by safely checking q->disk before resolving the
   dev_t

 - Fixed a Use-After-Free (UAF) and permanent memory leak by decoupling
   the per-CPU counter allocation from the volatile debugfs lifecycle and
   tying it directly to the core hctx lifecycle (i.e., blk_mq_init_hctx()
   and blk_mq_exit_hctx())

 - Fixed a potential compiler double-fetch bug by wrapping the per-CPU
   pointer evaluations with READ_ONCE() in blk_mq_debugfs_inc_wait_tags()

 - Passed the appropriate gfp_t flags down to the allocation routines to
   maintain the strict GFP_NOIO context

 - Updated kernel-doc descriptions to clarify that the NULL pointer 
   checks guard against memory allocation failures under pressure, rather 
   than initialisation race conditions

Changes since v3 [3]:
 - Transitioned tracking architecture from shared atomic_t variables to
   dynamically allocated per-CPU counters to resolve cache line bouncing
   (Bart Van Assche)

Changes since v2 [4]:
 - Added "Reviewed-by:" and "Tested-by:" tags for patch 1

 - Evaluate is_sched_tag directly within TP_fast_assign (Steven Rostedt)

 - Introduced atomic counters via debugfs 

Changes since v1 [5]:
 - Improved the description of the trace point (Damien Le Moal)

 - Removed the redundant "active requests" (Laurence Oberman)

 - Introduced pool-specific starvation tracking

[1]: https://lore.kernel.org/lkml/20260427020142.358912-1-atomlin@atomlin.com/
[2]: https://lore.kernel.org/lkml/20260419023036.1419514-1-atomlin@atomlin.com/
[3]: https://lore.kernel.org/lkml/20260319221956.332770-1-atomlin@atomlin.com/
[4]: https://lore.kernel.org/lkml/20260319015300.287653-1-atomlin@atomlin.com/
[5]: https://lore.kernel.org/lkml/20260317182835.258183-1-atomlin@atomlin.com/


Aaron Tomlin (2):
  blk-mq: add tracepoint block_rq_tag_wait
  blk-mq: expose tag starvation counts via debugfs

 block/blk-mq-debugfs.c       | 109 +++++++++++++++++++++++++++++++++++
 block/blk-mq-debugfs.h       |  19 ++++++
 block/blk-mq-tag.c           |   8 +++
 block/blk-mq.c               |   5 ++
 include/linux/blk-mq.h       |  12 ++++
 include/trace/events/block.h |  43 ++++++++++++++
 6 files changed, 196 insertions(+)

-- 
2.51.0


^ permalink raw reply

* Re: [PATCH v5 2/2] blk-mq: expose tag starvation counts via debugfs
From: Aaron Tomlin @ 2026-05-17 21:11 UTC (permalink / raw)
  To: axboe, rostedt, mhiramat, mathieu.desnoyers
  Cc: bvanassche, johannes.thumshirn, kch, dlemoal, ritesh.list,
	loberman, neelx, sean, mproche, chjohnst, linux-block,
	linux-kernel, linux-trace-kernel
In-Reply-To: <20260427020142.358912-3-atomlin@atomlin.com>

On Sun, Apr 26, 2026 at 10:01:42PM -0400, Aaron Tomlin wrote:
> +/**
> + * blk_mq_debugfs_inc_wait_tags - increment the tag starvation counters
> + * @hctx: hardware context associated with the tag allocation
> + * @is_sched: true if the starved pool is the software scheduler
> + *
> + * Evaluates the exhausted tag pool and safely increments the appropriate
> + * per-cpu debugfs starvation counter.
> + *
> + * Note: The per-cpu pointers are explicitly checked to prevent a NULL
> + * pointer dereference in the event that the system was under heavy memory
> + * pressure and the initial per-cpu allocation failed.
> + */
> +void blk_mq_debugfs_inc_wait_tags(struct blk_mq_hw_ctx *hctx,
> +				  bool is_sched)
> +{
> +	unsigned long __percpu *tags = is_sched ?
> +			READ_ONCE(hctx->wait_on_sched_tag) :
> +			READ_ONCE(hctx->wait_on_hw_tag);
> +
> +	if (likely(tags))
> +		this_cpu_inc(*tags);
> +}

Hi Jens,

I have realised that this particular code path, invoked from
blk_mq_get_tag() immediately prior to io_schedule(), is in fact, a
preemptible context. Consequently, utilising this_cpu_inc() here will
invariably trigger a warning when operating under a kernel with
CONFIG_DEBUG_PREEMPT=y enabled.

To rectify this, I intend to transition to the use of raw_cpu_inc(). Given
that this is solely for a debugfs interface, I believe it is far more
prudent to prioritise the mitigation of execution overhead over absolute
statistical precision, should a preemption race occur.


Kind regards,
-- 
Aaron Tomlin

^ permalink raw reply

* Re: [PATCH] Re: Re: [RFC PATCH v2 04/10] rv/da: add pre-allocated storage pool for per-object monitors
From: Wen Yang @ 2026-05-17 17:13 UTC (permalink / raw)
  To: Gabriele Monaco; +Cc: linux-kernel, linux-trace-kernel, rostedt
In-Reply-To: <20260515083002.106512-1-gmonaco@redhat.com>

Hi Gabriele,

Thanks again for the thorough review and for the design sketch.
Below is a consolidated reply grouped by patch, describing what has been
implemented in v3.


-- Patch 02: per-task slot ordering / ha_monitor_reset_env

Dropped from v3; v3 is rebased on top of your series [1].

[1] - 
https://lore.kernel.org/lkml/20260512140250.262190-8-gmonaco@redhat.com


-- Patch 03: verificationtest-ktap

 > And isn't it clearer to do:
 >   dir=$(realpath "$(dirname "$0")")

Agreed and applied:

   dir=$(realpath "$(dirname "$0")")
   export PATH="$dir:$PATH"
   "$dir/../ftrace/ftracetest" -K -v --rv "$dir"

 > Then if you really really need to call it directly, do you need to
 > override PATH?

Yes.  The ftracetest check_requires logic calls `command -v <binary>` to
satisfy `requires: <name>:program` directives.  Without the script's
directory in PATH those checks evaluate to exit_unsupported and test cases
are skipped rather than run.  The make path avoids this only because make
sets OUTDIR and the runner appends it to PATH internally.


-- Patch 04: pre-allocated storage pool

 > Since you're using spinlocks, isn't that going to sleep on PREEMPT_RT?

User-mode uprobe handlers run with preempt_count == 0, fully preemptible on
both PREEMPT_RT and non-PREEMPT_RT.  The strongest evidence is in tlob
itself: tlob_start_task() takes a mutex and calls kmem_cache_zalloc(...,
GFP_KERNEL) from the uprobe entry handler; both are illegal in atomic
context and would trigger lockdep splats immediately.

On PREEMPT_RT, spinlock_t becoming a sleeping lock in the uprobe handler 
iscfine: both call sites (da_create_or_get_pool() from the handler and
da_pool_return_cb() from the rcuc kthread) are in sleepable task context.


 > We can have a macro DA_MON_ALLOCATION_STRATEGY = {DA_ALLOC_AUTO,
 > DA_ALLOC_POOL, DA_ALLOC_MANUAL} where DA_MON_POOL also requires
 > DA_MON_POOL_SIZE to be defined (force that with an #error).
 >
 > Anyway, this way you probably wouldn't need to define a different init
 > function and let everything handled more transparently.
 >
 > Also you don't need to call da_create_or_get() explicitly,
 > da_handle_start_event() should do it for you.

Agreed on all counts.  We plan to implement this in v3 as follows.
The three strategies would be a compile-time selection in da_monitor.h:

   DA_ALLOC_AUTO   (default) - lock-free kmalloc_nolock on the hot path;
                               unbounded capacity.

   DA_ALLOC_POOL             - pre-allocated fixed-size pool; 
DA_MON_POOL_SIZE
                               required, enforced with #error if missing.

   DA_ALLOC_MANUAL           - caller pre-inserts storage via
                               da_create_empty_storage() before the first
                               da_handle_start_event(); the framework only
                               links the target field.

da_monitor_init_prealloc() would be removed; da_monitor_init() would
select pool or kmalloc initialisation internally based on the strategy.

da_handle_start_event() and da_handle_start_run_event() would both call
da_prepare_storage() at compile time:

   DA_ALLOC_AUTO   -> da_create_storage()        (kmalloc_nolock)
   DA_ALLOC_POOL   -> da_create_or_get_pool()
   DA_ALLOC_MANUAL -> da_fill_empty_storage()    (link target into pre-
                      allocated slot; no allocation on the hot path)

No explicit da_create_or_get() call would be needed in any monitor.

da_create_or_get_kmalloc() would be removed: as you noted, a caller that
uses kmalloc_nolock does so because locking is forbidden; a GFP_KERNEL
fallback is equally forbidden if the lockless attempt fails, so the
function has no viable use case.

tlob would define:
   #define DA_MON_ALLOCATION_STRATEGY DA_ALLOC_POOL
   #define DA_MON_POOL_SIZE            TLOB_MAX_MONITORED

nomiss would define:
   #define DA_MON_ALLOCATION_STRATEGY DA_ALLOC_MANUAL

and call da_create_empty_storage() from handle_sys_enter() (the
sched_setscheduler syscall path), which runs in safe task context;
da_fill_empty_storage() would then link the sched_dl_entity target on
the first da_handle_start_run_event() call in handle_sched_switch().


-- Patch 05: generic uprobe infrastructure

Carried unchanged into v3 (as part of the 08-b split described below).


-- Patch 06: rvgen __init arrow reset

Thanks, carried unchanged into v3.


 > Why don't you make it a separate event (e.g. "start_tlob") [...] then
 > you also wouldn't need to call reset() and start_timer() manually.

Good suggestion.  We plan to use a dedicated start_tlob event instead,
with a self-loop in tlob.dot:

   "running" -> "running" [ label = "start;reset(clk_elapsed)" ]

da_handle_start_run_event(task->pid, ws, start_tlob) would put the
monitor into running and deliver start_tlob, which resets clk_elapsed
and arms the budget hrtimer via the generated ha_setup_invariants() —
no manual reset() or start_timer() calls needed.

One guard would be added in tlob's ha_setup_invariants() to make the
self-loop work correctly:

   if (next_state == curr_state && event != start_tlob)
       return;

Without this, the start_tlob self-loop would be treated the same as any
repeated switch_in (already running) and ha_setup_invariants() would
return early, leaving the timer unarmed.  Does this look right to you?


-- Patch 08: tlob monitor

--- Patch structure ---

 > Could you have everything that isn't strictly tlob-related in another
 > patch.

Agreed.  With the ioctl interface deferred (see below), v3 would keep
patch 08 as the tlob monitor only:

   05-b: rv: extend uprobe API with three-phase detach helpers
         (rv_uprobe.c, rv_uprobe.h, rv_uprobe_detach refactoring)
         — extension of patch 05, independent of tlob

   08:   rv/tlob: add the tlob monitor itself
         (tlob.c, tlob.h, tlob_trace.h, Kconfig/Makefile, Documentation,
          rv_trace.h include; ha_monitor.h EVENT_NONE_LBL override
          bundled here as it is only needed by tlob)

The chardev infrastructure (rv_chardev.c, rv.h additions) and the UAPI
header (include/uapi/linux/rv.h) would move to a follow-up series
together with the ioctl self-instrumentation feature.

--- ioctl interface design ---

 > I'm not particularly fond of ioctls, they aren't that flexible and in
 > this way I don't really see an added value.
 > [...] cannot the same thing be achieved using uprobes alone, e.g. by
 > registering a function address or the current instruction pointer?
 > [...] wouldn't a sysfs/tracefs file achieve a similar purpose without
 > much of the boilerplate code?

Fair point.  We plan to ship v3 with the tracefs/uprobe interface only
and defer the ioctl (/dev/rv chardev) to a follow-up series once there
is a concrete in-tree user that requires it.

The unique value of the ioctl is that TLOB_IOCTL_TRACE_STOP returns a
synchronous per-call result (-EOVERFLOW or 0) to the calling thread,
which neither uprobes nor tracefs writes can provide.  We want to keep
that option open for later, but agree it should not block the initial
tlob submission.

Does this approach work for you?

What is your preference?

--- Handler simplification ---

 > Perhaps keep the handler simpler by moving this reporting to a helper
 > function and use guard(rcu)() there.

Done.  The accumulation logic is extracted into three inline helpers, each
using scoped_guard(rcu) and returning bool (true if the task is monitored):

   tlob_acc_running(prev)   - accumulate running_ns on sched-out
   tlob_acc_waiting(next)   - accumulate waiting_ns on sched-in
   tlob_acc_sleeping(task)  - accumulate sleeping_ns on wakeup

handle_sched_switch() and handle_sched_wakeup() become one-liners:

   static void handle_sched_switch(...)
   {
       bool prev_preempted = (prev_state == 0);

       if (tlob_acc_running(prev))
           da_handle_event(prev->pid, NULL,
                           prev_preempted ? preempt_tlob : sleep_tlob);
       if (tlob_acc_waiting(next))
           da_handle_event(next->pid, NULL, switch_in_tlob);
   }

 > You probably don't need these. da_handle_event should skip tasks without
 > a monitor.

Agreed; the do_prev/do_next flags are gone.  The helpers return false
for unmonitored tasks, and da_handle_event() skips them too — both paths
are no-ops for tasks with no pool entry.

--- scoped_guard(rcu) ---

 > That should be a scoped_guard(rcu), definitely use guards if you have
 > return paths, the compiler is going to clean up (unlock) for you.

Applied to all RCU-protected sections in tlob_start_task() and
tlob_stop_task().  tlob_start_task() now uses guard(mutex) for the
serialised duplicate-check (replacing the explicit mutex_lock/unlock),
and tlob_stop_task() uses scoped_guard(rcu) for the atomic CAS section:

   scoped_guard(rcu) {
       ws = da_get_target_by_id(task->pid);
       if (!ws)
           return -ESRCH;
       ...
       if (atomic_cmpxchg_release(&ws->stopping, 0, 1) != 0)
           return -EAGAIN;
   }

--- tlob_stop_all removal ---

 > All this function does should be done by da_monitor_destroy. We could
 > add a way to pass some additional deallocation for all the other cleanup
 > you're doing on each storage.  Something like a da_extra_cleanup() you
 > can define as whatever you need and gets called in all per-obj
 > destruction paths.

Agreed.  tlob_stop_all() (~50 lines) has been removed entirely.

A da_extra_cleanup() hook macro is introduced in da_monitor.h: the default
is a no-op; a monitor may override it before including the header.  tlob
defines:

   static inline void tlob_extra_cleanup(struct da_monitor *da_mon)
   {
       struct ha_monitor *ha_mon = to_ha_monitor(da_mon);
       struct tlob_task_state *ws = da_get_target(ha_mon);

       if (!ws)
           return;
       if (atomic_cmpxchg_release(&ws->stopping, 0, 1) != 0)
           return;
       ha_cancel_timer_sync(ha_mon);
       atomic_dec(&tlob_num_monitored);
       put_task_struct(ws->task);
       call_rcu(&ws->rcu, tlob_free_rcu);
   }
   #define da_extra_cleanup tlob_extra_cleanup

da_monitor_destroy() iterates remaining entries via da_extra_cleanup +
hash_del_rcu + call_rcu, then waits for all callbacks via rcu_barrier().
tlob's disable path is now simply:

   static void __tlob_destroy_monitor(void)
   {
       da_monitor_destroy();
   }

--- EVENT_NONE_LBL ---

 > Why don't you just override EVENT_NONE_LBL (and if you prefer call it
 > MONITOR_TIMER_EVENT_NAME) without the need for another function?

Done.  model_get_timer_event_name() has been removed from automata.h.
In ha_monitor.h, EVENT_NONE_LBL is now overridable:

   #ifndef EVENT_NONE_LBL
   #define EVENT_NONE_LBL "none"
   #endif

tlob.c defines it before including the model header:

   #define EVENT_NONE_LBL "budget_exceeded"

The two call sites in ha_monitor.h that previously called
model_get_timer_event_name() now use EVENT_NONE_LBL directly.

--- KUnit config / tristate ---

 > Do you need to add this here? Since you have a patch adding KUnit tests
 > to tlob, cannot you put everything kunit-related there?
 > I couldn't build it as module.

Agreed on moving the Kconfig entry to patch 09.

The module build issue is fixed by exporting the symbols needed by the
test via EXPORT_SYMBOL_IF_KUNIT (EXPORTED_FOR_KUNIT_TESTING namespace);
tlob_kunit.c imports the namespace with MODULE_IMPORT_NS.  We plan to
keep tristate rather than changing to bool.  Does that work for you?

--- detail_env_tlob tracepoint ---

 > Since you are not documenting the detail_env_tlob tracepoint, is it
 > something really required? I would at the very least document its usage.

Fair point.  detail_env_tlob emits (running_ns, waiting_ns, sleeping_ns)
so the user can see which phase consumed the budget: high sleeping_ns
indicates I/O latency, high waiting_ns indicates scheduler pressure, high
running_ns indicates a compute overrun.  Without this breakdown the user
only knows the total elapsed time exceeded the threshold, not why.


--- Documentation ---

 > This is standard tracepoints usage, there's nothing about tlob we should
 > document here.
 > Same here, standard RV [for enable/desc tracefs files].
 > And this is duplicating what mentioned above about uprobes, isn't it?

Agreed.  The following have been removed:

   - "Violation events" section: generic trace-cmd examples and cat-trace
     instructions (standard tracepoints usage).
   - tracefs files: "enable (rw)" and "desc (ro)" entries (standard RV).
   - tracefs files: "monitor (rw)" description condensed to one line with
     a cross-reference to the uprobes section above.

In their place, a new "Violation tracepoints" subsection documents both
tlob-specific tracepoints with fields and a worked example:

   error_env_tlob: id, state, event ("budget_exceeded"), env ("clk_elapsed")

   detail_env_tlob: id, threshold_us, running_ns, waiting_ns, sleeping_ns
     Use sleeping_ns to diagnose I/O latency, waiting_ns for scheduler
     pressure, running_ns for compute overruns.

   Example:
     trace-cmd record -e error_env_tlob -e detail_env_tlob &
     # ... run workload ...
     trace-cmd report

 > Is kernel code going to use this API? RV monitors are meant to be
 > enabled by userspace. What's the use-case here?

Agreed.  The uprobe interface is driven from userspace; tlob_start_task()
and tlob_stop_task() are the internal implementation functions it calls,
not a public API for external kernel modules.  The hypothetical
kernel-module use case would be removed from the documentation; the
kernel-doc block is retained for code maintainers.

 > That's probably a bit too detailed for this page. If you really want
 > this information somewhere couldn't it stay in the code?

Agreed; moved to comments in handle_sched_switch() and
handle_sched_wakeup().  The "Limitations" subsection is retained.


-- Patch 09: KUnit tests

 > What caught my eyes are tests enrolling tracepoints handlers. If you
 > go there you're no longer doing unit testing, what's the advantage of
 > testing the entire monitor here over doing that in selftests?

Agreed.  The three suites that register tracepoint handlers or create
kthreads (tlob_sched_integration, tlob_trace_output, tlob_violation_react)
have been removed from KUnit and will be added to selftests in v3.

Two pure unit test suites remain in KUnit:

   tlob_task_api:
     Tests tlob_start_task / tlob_stop_task return values (-ENODEV,
     -EALREADY, -ESRCH, -EOVERFLOW, -ENOSPC, -ERANGE) via direct calls
     (these functions are the internal implementation used by both the
     uprobe and, in future, the ioctl interface).
     No tracepoints, no scheduling.

   tlob_uprobe_format:
     Tests the uprobe line parser (tlob_parse_uprobe_line,
     tlob_parse_remove_line) against valid and invalid input strings.
     Pure string parsing; no scheduling, no tracepoints.

This also resolves the tristate-vs-bool issue: with only pure unit tests
there is no dependency on sched_setscheduler_nocheck, so bool is correct.


--  Patch 10: selftests

--- PREEMPT_RT RCU stall ---

 > I run it on a VM and have it hanging at step 9 [...] rcu_preempt stall.
 > Did you see that? Am I doing something wrong?

Thanks for reporting.  The patch changed ha_monitor.h from
HRTIMER_MODE_REL_HARD (the existing upstream value) to REL_SOFT; the
stall appeared on PREEMPT_RT after that change.  We have not fully
confirmed whether REL_SOFT is the root cause — REL_SOFT defers the
callback to the ktimers kthread, which could starve rcu_preempt under
certain PREEMPT_RT configurations, but other factors may be involved.

We plan to revert to HRTIMER_MODE_REL_HARD at both sites in ha_monitor.h
as the conservative choice:

   ha_setup_timer():     HRTIMER_MODE_REL_SOFT -> HRTIMER_MODE_REL_HARD
   ha_start_timer_ns():  HRTIMER_MODE_REL_SOFT -> HRTIMER_MODE_REL_HARD

Do you have more insight into the stall, or does REL_HARD resolve it on
your setup?

--- Selftest structure ---

 > This should be tested together with the other monitors (enable/disable),
 > we could at most expand those with the check_requires.
 > Let's focus on tlob-only features in this patch.

Agreed.  In v3 we plan to drop tracefs.tc (covered by the generic
rv_monitor_enable_disable.tc) and keep only the six uprobe-specific
test cases under test.d/tlob/

ioctl.tc is deferred with the ioctl interface to the follow-up series.
The KUnit integration tests (sched_switch accounting, budget-expiry
tracepoint) would be moved to selftests as additional test cases.

--
Thanks,
Wen


On 5/15/26 16:30, Gabriele Monaco wrote:
> So this is what I meant. It's quick and dirty but seems to work as far
> as I could test it.
> 
> I didn't change too much around to avoid confusing more, but it probably
> needs a refactor for the functions positions and names. Some AI can do
> that later after we agree on how it should look.
> 
> The main idea is (using current function names):
> 
> da_handle_start_[run_]_event() calls da_prepare_storage(), this
> makes sure the storage is there and usable based on the strategy:
> 1. da_create_storage() plain allocation with kmalloc_nolock
> 2. da_create_or_get_pool() get a slot from the pool
> 3. da_fill_empty_storage() only set the target in a storage manually
>     allocated before
> 
> The reason why you'd need 3. is that since da_handle_start_event() is
> called from a tracepoint, you may in no way be able to allocate from
> there, then you use manually somewhere else with
> da_create_empty_storage() if you don't have the target and
> da_create_or_get() if you do (this one is misleading, we should probably
> simplify further).
> 
> The newly created 2. might be useful if you aren't on preempt-rt and
> cannot sleep but also don't want a manual allocation (beware of lock
> dependencies, it doesn't always work).
> 
> Now, I left your da_create_or_get_kmalloc() unwired because I don't
> really see the use case (you use kmalloc_nolock because you cannot lock,
> so if it fails you don't try a kmalloc). But if we really want to offer
> a possibility to allocate with GFP_KERNEL, we can make 1. more
> configurable.
> 
> Does this make sense to you?
> 
> Thanks,
> Gabriele
> 
> ---
>   include/rv/da_monitor.h                  | 160 ++++++++++-------------
>   kernel/trace/rv/monitors/nomiss/nomiss.c |   2 +-
>   kernel/trace/rv/monitors/tlob/tlob.c     |  15 +--
>   3 files changed, 74 insertions(+), 103 deletions(-)
> 
> diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
> index 74aa95d9a284..3b4a36245531 100644
> --- a/include/rv/da_monitor.h
> +++ b/include/rv/da_monitor.h
> @@ -21,6 +21,7 @@
>   #include <linux/sched.h>
>   #include <linux/slab.h>
>   #include <linux/hashtable.h>
> +#include <linux/mempool.h>
>   
>   /*
>    * Per-cpu variables require a unique name although static in some
> @@ -67,6 +68,35 @@ static struct rv_monitor rv_this;
>   #define da_id_type int
>   #endif
>   
> +#define DA_ALLOC_AUTO 0
> +#define DA_ALLOC_POOL 1
> +#define DA_ALLOC_MANUAL 2
> +
> +/*
> + * Allow the per-object monitors to run allocation manually, necessary if the
> + * start condition is in a context problematic for allocation (e.g. scheduling).
> + * In such case, if the storage was pre-allocated without a target, set it now.
> + */
> +#ifndef DA_MON_ALLOCATION_STRATEGY
> +#define DA_MON_ALLOCATION_STRATEGY DA_ALLOC_AUTO
> +#endif
> +#ifndef DA_MON_POOL_SIZE
> +#define DA_MON_POOL_SIZE 0
> +#endif
> +#if DA_MON_ALLOCATION_STRATEGY == DA_ALLOC_MANUAL
> +#define da_prepare_storage da_fill_empty_storage
> +
> +#elif DA_MON_ALLOCATION_STRATEGY == DA_ALLOC_POOL
> +#define da_prepare_storage da_create_or_get_pool
> +#if DA_MON_POOL_SIZE == 0
> +#error "DA_ALLOC_POOL requires DA_MON_POOL_SIZE to be non-zero"
> +#endif
> +
> +#else
> +#define da_prepare_storage da_create_storage
> +#endif /* DA_MON_ALLOCATION_STRATEGY */
> +
> +
>   static void react(enum states curr_state, enum events event)
>   {
>   	rv_react(&rv_this,
> @@ -448,62 +478,38 @@ static inline monitor_target da_get_target_by_id(da_id_type id)
>   }
>   
>   /*
> - * Per-object pool state.
> - *
> - * Zero-initialised by default (storage == NULL ⟹ kmalloc mode).  A monitor
> - * opts into pool mode by calling da_monitor_init_prealloc(N) instead of
> - * da_monitor_init(), which sets storage to a non-NULL kcalloc'd array.
> - *
> - * Because every field is wrapped in this struct and the struct itself is a
> - * per-TU static, each monitor that includes this header gets a completely
> - * independent pool.  A kmalloc monitor (e.g. nomiss) and a pool monitor
> - * (e.g. tlob) therefore coexist without any interference.
> - *
> - * da_pool_return_cb runs from softirq on non-PREEMPT_RT, so irqsave is
> - * required to prevent deadlock with task-context callers.  On PREEMPT_RT
> - * it runs from an rcuc kthread where spinlock_t is a sleeping lock.
> - */
> -struct da_per_obj_pool {
> -	struct da_monitor_storage  *storage;  /* non-NULL ⟹ pool mode */
> -	struct da_monitor_storage **free;     /* kmalloc'd pointer stack */
> -	unsigned int                free_top;
> -	spinlock_t                  lock;
> -};
> -
> -static struct da_per_obj_pool da_pool = {
> -	.lock = __SPIN_LOCK_UNLOCKED(da_pool.lock),
> -};
> + * Per-object pool state using kmem_cache and mempool.
> + */
> +static struct kmem_cache *da_mon_cache;
> +static mempool_t *da_mon_pool;
>   
>   static void da_pool_return_cb(struct rcu_head *head)
>   {
>   	struct da_monitor_storage *ms =
>   		container_of(head, struct da_monitor_storage, rcu);
> -	unsigned long flags;
> -
> -	spin_lock_irqsave(&da_pool.lock, flags);
> -	da_pool.free[da_pool.free_top++] = ms;
> -	spin_unlock_irqrestore(&da_pool.lock, flags);
> +	mempool_free(ms, da_mon_pool);
>   }
>   
> -/* Pops a slot from the pre-allocated pool; returns -ENOSPC if exhausted. */
> -static inline int da_create_or_get_pool(da_id_type id, monitor_target target)
> +/* Pops a slot from the pre-allocated pool; returns NULL if exhausted. */
> +static inline struct da_monitor *da_create_or_get_pool(da_id_type id,
> +						       monitor_target target,
> +						       struct da_monitor *da_mon)
>   {
>   	struct da_monitor_storage *mon_storage;
> -	unsigned long flags;
>   
> -	spin_lock_irqsave(&da_pool.lock, flags);
> -	if (!da_pool.free_top) {
> -		spin_unlock_irqrestore(&da_pool.lock, flags);
> -		return -ENOSPC;
> -	}
> -	mon_storage = da_pool.free[--da_pool.free_top];
> -	spin_unlock_irqrestore(&da_pool.lock, flags);
> +	if (da_mon)
> +		return da_mon;
>   
> +	mon_storage = mempool_alloc_preallocated(da_mon_pool);
> +	if (!mon_storage)
> +		return NULL;
> +
> +	memset(mon_storage, 0, sizeof(*mon_storage));
>   	mon_storage->id = id;
>   	mon_storage->target = target;
>   	guard(rcu)();
>   	hash_add_rcu(da_monitor_ht, &mon_storage->node, id);
> -	return 0;
> +	return &mon_storage->rv.da_mon;
>   }
>   
>   /*
> @@ -547,11 +553,12 @@ static inline int da_create_or_get_kmalloc(da_id_type id, monitor_target target)
>   }
>   
>   /* Create the per-object storage if not already there. */
> -static inline int da_create_or_get(da_id_type id, monitor_target target)
> +// NOTE: this is only needed for manual allocation!
> +// we can refactor to have it only defined there, leaving it for now
> +static inline void da_create_or_get(da_id_type id, monitor_target target)
>   {
> -	if (da_pool.storage)
> -		return da_create_or_get_pool(id, target);
> -	return da_create_or_get_kmalloc(id, target);
> +	guard(rcu)();
> +	da_create_storage(id, target, da_get_monitor(id, target));
>   }
>   
>   /*
> @@ -573,7 +580,7 @@ static inline void da_destroy_storage(da_id_type id)
>   		return;
>   	da_monitor_reset_hook(&mon_storage->rv.da_mon);
>   	hash_del_rcu(&mon_storage->node);
> -	if (da_pool.storage)
> +	if (DA_MON_ALLOCATION_STRATEGY == DA_ALLOC_POOL)
>   		call_rcu(&mon_storage->rcu, da_pool_return_cb);
>   	else
>   		kfree_rcu(mon_storage, rcu);
> @@ -591,41 +598,26 @@ static __maybe_unused void da_monitor_reset_all(void)
>   }
>   
>   /*
> - * da_monitor_init_prealloc - initialise with a pre-allocated storage pool
> - *
> - * Allocates @prealloc_count storage slots up-front so that da_create_or_get()
> - * and da_destroy_storage() never call kmalloc/kfree.  Must be called instead
> - * of da_monitor_init() for monitors that require pool mode.
> + * da_monitor_init - initialise in kmalloc mode (no pre-allocation)
>    */
> -static inline int da_monitor_init_prealloc(unsigned int prealloc_count)
> +static inline int da_monitor_init(void)
>   {
>   	hash_init(da_monitor_ht);
> +	if (DA_MON_ALLOCATION_STRATEGY != DA_ALLOC_POOL)
> +		return 0;
>   
> -	da_pool.storage = kcalloc(prealloc_count, sizeof(*da_pool.storage),
> -				  GFP_KERNEL);
> -	if (!da_pool.storage)
> +	da_mon_cache = kmem_cache_create(__stringify(DA_MON_NAME) "_cache",
> +					 sizeof(struct da_monitor_storage),
> +					 0, 0, NULL);
> +	if (!da_mon_cache)
>   		return -ENOMEM;
>   
> -	da_pool.free = kmalloc_array(prealloc_count, sizeof(*da_pool.free),
> -				     GFP_KERNEL);
> -	if (!da_pool.free) {
> -		kfree(da_pool.storage);
> -		da_pool.storage = NULL;
> +	da_mon_pool = mempool_create_slab_pool(DA_MON_POOL_SIZE, da_mon_cache);
> +	if (!da_mon_pool) {
> +		kmem_cache_destroy(da_mon_cache);
> +		da_mon_cache = NULL;
>   		return -ENOMEM;
>   	}
> -
> -	da_pool.free_top = 0;
> -	for (unsigned int i = 0; i < prealloc_count; i++)
> -		da_pool.free[da_pool.free_top++] = &da_pool.storage[i];
> -	return 0;
> -}
> -
> -/*
> - * da_monitor_init - initialise in kmalloc mode (no pre-allocation)
> - */
> -static inline int da_monitor_init(void)
> -{
> -	hash_init(da_monitor_ht);
>   	return 0;
>   }
>   
> @@ -641,11 +633,10 @@ static inline void da_monitor_destroy_pool(void)
>   	 * pending callback.
>   	 */
>   	rcu_barrier();
> -	kfree(da_pool.storage);
> -	da_pool.storage = NULL;
> -	kfree(da_pool.free);
> -	da_pool.free = NULL;
> -	da_pool.free_top = 0;
> +	mempool_destroy(da_mon_pool);
> +	da_mon_pool = NULL;
> +	kmem_cache_destroy(da_mon_cache);
> +	da_mon_cache = NULL;
>   }
>   
>   static inline void da_monitor_destroy_kmalloc(void)
> @@ -676,23 +667,12 @@ static inline void da_monitor_destroy_kmalloc(void)
>    */
>   static inline void da_monitor_destroy(void)
>   {
> -	if (da_pool.storage)
> +	if (DA_MON_ALLOCATION_STRATEGY == DA_ALLOC_POOL)
>   		da_monitor_destroy_pool();
>   	else
>   		da_monitor_destroy_kmalloc();
>   }
>   
> -/*
> - * Allow the per-object monitors to run allocation manually, necessary if the
> - * start condition is in a context problematic for allocation (e.g. scheduling).
> - * In such case, if the storage was pre-allocated without a target, set it now.
> - */
> -#ifdef DA_SKIP_AUTO_ALLOC
> -#define da_prepare_storage da_fill_empty_storage
> -#else
> -#define da_prepare_storage da_create_storage
> -#endif /* DA_SKIP_AUTO_ALLOC */
> -
>   #endif /* RV_MON_TYPE */
>   
>   #if RV_MON_TYPE == RV_MON_GLOBAL || RV_MON_TYPE == RV_MON_PER_CPU
> diff --git a/kernel/trace/rv/monitors/nomiss/nomiss.c b/kernel/trace/rv/monitors/nomiss/nomiss.c
> index 31f90f3638d8..f089cc0e2f10 100644
> --- a/kernel/trace/rv/monitors/nomiss/nomiss.c
> +++ b/kernel/trace/rv/monitors/nomiss/nomiss.c
> @@ -18,7 +18,7 @@
>   #define RV_MON_TYPE RV_MON_PER_OBJ
>   #define HA_TIMER_TYPE HA_TIMER_WHEEL
>   /* The start condition is on sched_switch, it's dangerous to allocate there */
> -#define DA_SKIP_AUTO_ALLOC
> +#define DA_MON_ALLOCATION_STRATEGY DA_ALLOC_MANUAL
>   typedef struct sched_dl_entity *monitor_target;
>   #include "nomiss.h"
>   #include <rv/ha_monitor.h>
> diff --git a/kernel/trace/rv/monitors/tlob/tlob.c b/kernel/trace/rv/monitors/tlob/tlob.c
> index 90e7035a0b55..486a6bac5cf9 100644
> --- a/kernel/trace/rv/monitors/tlob/tlob.c
> +++ b/kernel/trace/rv/monitors/tlob/tlob.c
> @@ -88,8 +88,8 @@ struct tlob_task_state {
>   
>   #define RV_MON_TYPE RV_MON_PER_OBJ
>   #define HA_TIMER_TYPE HA_TIMER_HRTIMER
> -/* Pool mode: da_handle_start_event uses da_fill_empty_storage, not kmalloc. */
> -#define DA_SKIP_AUTO_ALLOC
> +#define DA_MON_ALLOCATION_STRATEGY DA_ALLOC_POOL
> +#define DA_MON_POOL_SIZE TLOB_MAX_MONITORED
>   
>   /* Type for da_monitor_storage.target; must be defined before the includes. */
>   typedef struct tlob_task_state *monitor_target;
> @@ -428,7 +428,6 @@ int tlob_start_task(struct task_struct *task, u64 threshold_us)
>   	struct da_monitor *da_mon;
>   	struct ha_monitor *ha_mon;
>   	u64 now_ns;
> -	int ret;
>   
>   	if (!da_monitor_enabled())
>   		return -ENODEV;
> @@ -457,14 +456,6 @@ int tlob_start_task(struct task_struct *task, u64 threshold_us)
>   	ws->last_ts = ktime_get();
>   	raw_spin_lock_init(&ws->entry_lock);
>   
> -	/* Claim a pool slot (no kmalloc; DA_SKIP_AUTO_ALLOC + prealloc). */
> -	ret = da_create_or_get(task->pid, ws);
> -	if (ret) {
> -		put_task_struct(task);
> -		kmem_cache_free(tlob_state_cache, ws);
> -		return ret;
> -	}
> -
>   	atomic_inc(&tlob_num_monitored);
>   
>   	/* Hold RCU across handle + timer setup to keep da_mon valid. */
> @@ -955,7 +946,7 @@ static int __tlob_init_monitor(void)
>   
>   	atomic_set(&tlob_num_monitored, 0);
>   
> -	retval = da_monitor_init_prealloc(TLOB_MAX_MONITORED);
> +	retval = da_monitor_init();
>   	if (retval) {
>   		kmem_cache_destroy(tlob_state_cache);
>   		tlob_state_cache = NULL;
> 

^ permalink raw reply

* Re: Race condition in __modify_ftrace_direct() between tmp_ops registration and direct_functions hash update
From: Steven Rostedt @ 2026-05-17 16:53 UTC (permalink / raw)
  To: Afi0, security, linux-kernel, linux-trace-kernel, mhiramat,
	Greg KH, Jiri Olsa
In-Reply-To: <CAEABq7dxnaLrTOhmD+tKnDenmZTUQD8sG=eoxe72mi_gwaus6g@mail.gmail.com>

[ RESEND - I didn't realize you replied to me privately. Adding back Cc list ]

On Sun, 17 May 2026 15:16:17 +0000
Afi0 <capyenglishlite@gmail.com> wrote:

> Hi Steven,
> 
> Thanks for the detailed feedback, and for adding Jiri.
> 
> You're right to challenge this. Let me clarify the exact scenario:
> 
> The race is not about direct being NULL before assignment. The issue arises
> specifically in the *modification* path where an existing non-NULL direct
> is being replaced:
> 
>    1. Caller holds a valid trampoline at address old_addr
>    2. Caller calls modify_ftrace_direct(ops, new_addr)
>    3. __modify_ftrace_direct() registers tmp_ops -> ftrace starts using
>    tmp_ops
>    4. *Window opens:* CPUs entering traced function read entry -> direct =
>    old_addr via ftrace_find_rec_direct()
>    5. Caller, believing the update is complete after modify_ftrace_direct()
>    returns, frees old_addr
>    6. entry->direct = new_addr executes - too late, CPUs already jumped to
>    freed memory
> 
> The key assumption being violated: the caller cannot know when it is safe
> to free old_addr because modify_ftrace_direct() returns before entry ->
> direct is updated. The API implies atomicity that isn't guaranteed.

But __modify_ftrace_direct() calls unregister_ftrace_function(&tmp_ops).

Hmm, tmp_ops being static may be considered part of the core kernel in
which case the FTRACE_OPS_DYNAMIC is not set and the synchronization
will not be called from the unregister function.

> 
> If the convention is that callers *must* never free the old trampoline
> until some explicit synchronization point after modify_ftrace_direct()
> returns, then you're correct that this is a caller bug rather than a bug in
> __modify_ftrace_direct() itself. Could you point me to documentation of
> this requirement? I may have misread the contract.

I'll let Jiri answer this part, but it does seem that there should be a
synchronization to make sure that the code is freed. BPF is the only
user of this, and this is a new feature.

Jiri, if the modify_ftrace_direct() is used to change the trampoline,
what synchronization is done to make make sure it's not called before
being freed?

-- Steve

^ permalink raw reply

* Re: Race condition in __modify_ftrace_direct() between tmp_ops registration and direct_functions hash update
From: Steven Rostedt @ 2026-05-17 13:15 UTC (permalink / raw)
  To: Afi0
  Cc: security, linux-kernel, linux-trace-kernel, mhiramat, Greg KH,
	Jiri Olsa
In-Reply-To: <CAEABq7fMcvHpp4+59Mt-QdgGNpWhOqrGWHKmy+qt3tJSYb69kg@mail.gmail.com>


Added Jiri as he works on this code.

On Sun, 17 May 2026 06:24:11 +0000
Afi0 <capyenglishlite@gmail.com> wrote:

> Hi list,
> 
> Apologies for initially sending only to Greg. Resending to the full list as
> requested.
> ------------------------------
> 
> Component: kernel/trace/ftrace.c Function: __modify_ftrace_direct()
> Affected versions: Linux kernel 5.15+ Type: TOCTOU / Race condition CVSS
> 3.1: AV:L/AC:H/PR:L/UI:N/S:C/C:H/I:H/A:H - 7.8 (High)
> 
> SUMMARY
> 
> A race condition exists in __modify_ftrace_direct() between the
> registration of tmp_ops into ftrace_ops_list and the subsequent update of
> direct_functions hash entries. During this window, concurrent CPUs
> executing traced functions will read the stale direct call address via
> ftrace_find_rec_direct() and jump to it, while the caller may have already
> invalidated or freed the old trampoline memory.

What the above doesn't describe is how the direct was stale to begin
with. Before the assignment, it should be NULL and not a problem, and
if was being modified, the current trampoline that direct points to
should *NOT* be freed before calling this. Otherwise, that itself is a
bug.

-- Steve


> 
> VULNERABLE CODE
> 
> err = register_ftrace_function_nolock(&tmp_ops);[race window:
> ftrace_ops_list_func now active, direct_functions not yet
> updated]mutex_lock(&ftrace_lock);entry->direct = addr;   /* update
> happens here, too late */mutex_unlock(&ftrace_lock);
> 
> IMPACT
> 
> CPU executing traced function reads stale direct_functions entry during the
> race window. arch_ftrace_set_direct_caller() redirects execution to
> potentially freed or invalidated trampoline memory. Use-after-free in
> executable code context on SMP systems.
> 
> TRIGGER
> 
> Requires CAP_PERFMON or CAP_SYS_ADMIN directly. Also reachable via BPF
> trampolines (kernel/bpf/trampoline.c calls __modify_ftrace_direct()
> internally) with CAP_BPF + CAP_PERFMON, default in many CI/CD container
> runtimes. Live patching via klp_patch_func() also goes through this path.
> 
> SUGGESTED FIX
> 
> Update entry->direct under ftrace_lock BEFORE registering tmp_ops. Add
> smp_wmb() between the store and registration to ensure ordering on
> weakly-ordered architectures.
> 
> Patch attached as 0001-ftrace-fix-race-in-__modify_ftrace_direct.patch
> 
> Fixes: 0567d6809440 ("ftrace: Add modify_ftrace_direct()")
> 
> Thanks,
> 
>  Afi0


^ permalink raw reply

* Re: [PATCH 1/7] uprobes/x86: Move optimized uprobe from nop5 to nop10
From: Jiri Olsa @ 2026-05-17 11:45 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Masami Hiramatsu,
	Andrii Nakryiko, bpf, linux-trace-kernel
In-Reply-To: <CAEf4BzZxNLEeyDE2aJVm=T=RzDayaHzAbHxXQqn6M6uqa45obA@mail.gmail.com>

On Fri, May 15, 2026 at 01:31:31PM -0700, Andrii Nakryiko wrote:
> On Thu, May 14, 2026 at 6:53 AM Jiri Olsa <jolsa@kernel.org> wrote:
> >
> > Andrii reported an issue with optimized uprobes [1] that can clobber
> > redzone area with call instruction storing return address on stack
> > where user code may keep temporary data without adjusting rsp.
> >
> > Fixing this by moving the optimized uprobes on top of 10-bytes nop
> > instruction, so we can squeeze another instruction to escape the
> > redzone area before doing the call, like:
> >
> >   lea -0x80(%rsp), %rsp
> >   call tramp
> >
> > Note the lea instruction is used to adjust the rsp register without
> > changing the flags.
> 
> I think it should be very loudly explained that we can't go back to
> nop10 and have to do short jump over patched sequence (and why).

there's comment in swbp_unoptimize:

         * We have optimized nop10 (lea, call), changing it to 'jmp rel8' to
         * end of the 10-byte slot instead of restoring the original nop10,
         * because we could have thread already inside lea instruction.

I'll add it in here as well

> 
> >
> > The optimized uprobe performance stays the same:
> >
> >         uprobe-nop     :    3.129 ± 0.013M/s
> >         uprobe-push    :    3.045 ± 0.006M/s
> >         uprobe-ret     :    1.095 ± 0.004M/s
> >   -->   uprobe-nop10   :    7.170 ± 0.020M/s
> >         uretprobe-nop  :    2.143 ± 0.021M/s
> >         uretprobe-push :    2.090 ± 0.000M/s
> >         uretprobe-ret  :    0.942 ± 0.000M/s
> >   -->   uretprobe-nop10:    3.381 ± 0.003M/s
> >         usdt-nop       :    3.245 ± 0.004M/s
> >   -->   usdt-nop10     :    7.256 ± 0.023M/s
> >
> > [1] https://lore.kernel.org/bpf/20260509003146.976844-1-andrii@kernel.org/
> > Reported-by: Andrii Nakryiko <andrii@kernel.org>
> > Closes: https://lore.kernel.org/bpf/20260509003146.976844-1-andrii@kernel.org/
> > Fixes: ba2bfc97b462 ("uprobes/x86: Add support to optimize uprobes")
> > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > ---
> >  arch/x86/kernel/uprobes.c | 121 +++++++++++++++++++++++++++-----------
> >  1 file changed, 86 insertions(+), 35 deletions(-)
> >
> > diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> > index ebb1baf1eb1d..f7c4101a4039 100644
> > --- a/arch/x86/kernel/uprobes.c
> > +++ b/arch/x86/kernel/uprobes.c
> > @@ -636,9 +636,21 @@ struct uprobe_trampoline {
> >         unsigned long           vaddr;
> >  };
> >
> > +#define LEA_INSN_SIZE          5
> > +#define OPT_INSN_SIZE          (LEA_INSN_SIZE + CALL_INSN_SIZE)
> > +#define OPT_JMP8_OFFSET                (OPT_INSN_SIZE - JMP8_INSN_SIZE)
> > +#define REDZONE_SIZE           0x80
> > +
> > +static const u8 lea_rsp[] = { 0x48, 0x8d, 0x64, 0x24, 0x80 };
> > +
> > +static bool is_lea_insn(const uprobe_opcode_t *insn)
> > +{
> > +       return !memcmp(insn, lea_rsp, LEA_INSN_SIZE);
> > +}
> > +
> >  static bool is_reachable_by_call(unsigned long vtramp, unsigned long vaddr)
> >  {
> > -       long delta = (long)(vaddr + 5 - vtramp);
> > +       long delta = (long)(vaddr + OPT_INSN_SIZE - vtramp);
> >
> >         return delta >= INT_MIN && delta <= INT_MAX;
> >  }
> > @@ -651,7 +663,7 @@ static unsigned long find_nearest_trampoline(unsigned long vaddr)
> >         };
> >         unsigned long low_limit, high_limit;
> >         unsigned long low_tramp, high_tramp;
> > -       unsigned long call_end = vaddr + 5;
> > +       unsigned long call_end = vaddr + OPT_INSN_SIZE;
> >
> >         if (check_add_overflow(call_end, INT_MIN, &low_limit))
> >                 low_limit = PAGE_SIZE;
> > @@ -826,8 +838,8 @@ SYSCALL_DEFINE0(uprobe)
> 
> should we change -ENXIO to -EPROTO or some other distinct error code,
> so libbpf can avoid using nop5 attachment on kernels new enough to
> support nop5 optimization, but old enough to not do this properly with
> nop10?

right, I'll take that change as well

thanks,
jirka

^ permalink raw reply

* [PATCH] ftrace: fix race in __modify_ftrace_direct() between tmp_ops registration and direct_functions update
From: Andrii Kuchmenko @ 2026-05-17 11:01 UTC (permalink / raw)
  To: linux-trace-kernel
  Cc: rostedt, mhiramat, linux-kernel, Andrii Kuchmenko, stable

In __modify_ftrace_direct(), register_ftrace_function_nolock() makes
tmp_ops visible in ftrace_ops_list before entry->direct is updated
under ftrace_lock. During this window any CPU entering the traced
function calls call_direct_funcs(), reads the old address from
direct_functions via RCU, and jumps to it via
arch_ftrace_set_direct_caller(). If the caller freed or invalidated
the old trampoline before calling modify_ftrace_direct(), this is a
use-after-free in executable code context.

The race window:

  CPU 0 (__modify_ftrace_direct)       CPU 1 (executing traced func)
  ──────────────────────────────       ──────────────────────────────
  register_ftrace_function_nolock()
    -> tmp_ops visible in ops_list
                                        call_direct_funcs()
                                          ftrace_find_rec_direct() -> old_addr
                                          arch_ftrace_set_direct_caller(old_addr)
                                          jump to old_addr  <- UAF if freed
  mutex_lock(&ftrace_lock)
  entry->direct = addr   <- too late
  mutex_unlock(&ftrace_lock)

Fix: update entry->direct under ftrace_lock BEFORE registering tmp_ops.
Any CPU that observes tmp_ops in ftrace_ops_list after this point will
already see the new address when it calls ftrace_find_rec_direct().
Add smp_wmb() between the store and the registration to ensure the
write is visible on weakly-ordered architectures before tmp_ops
becomes observable via ftrace_ops_list.

On error from register_ftrace_function_nolock(), restore entry->direct
to old_addr since tmp_ops never became visible to other CPUs.

This affects all callers of __modify_ftrace_direct(), including:
  - modify_ftrace_direct() used by kernel modules and live patching
  - modify_ftrace_direct_nolock() used by BPF trampolines
    (kernel/bpf/trampoline.c) reachable with CAP_BPF + CAP_PERFMON

Fixes: 0567d6809440 ("ftrace: Add modify_ftrace_direct()")
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: Andrii Kuchmenko <capyenglishlite@gmail.com>
---
 kernel/trace/ftrace.c | 35 +++++++++++++++++++++++++----------
 1 file changed, 25 insertions(+), 10 deletions(-)

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index a1b2c3d4e5f6..b7c8d9e0f1a2 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -5950,6 +5950,7 @@ static int __modify_ftrace_direct(struct ftrace_ops *ops, unsigned long addr)
 	struct ftrace_func_entry *entry;
 	struct ftrace_ops tmp_ops;
+	unsigned long old_addr;
 	int err;
 
 	lockdep_assert_held(&direct_mutex);
@@ -5960,22 +5961,36 @@ static int __modify_ftrace_direct(struct ftrace_ops *ops, unsigned long addr)
 	if (!entry)
 		return -ENODEV;
 
-	/*
-	 * tmp_ops is registered into ftrace_ops_list here, making it
-	 * visible to all CPUs executing the traced function. However,
-	 * entry->direct is not updated until after this call returns,
-	 * leaving a window where CPUs read the stale (possibly freed)
-	 * direct call address via ftrace_find_rec_direct().
-	 */
-	err = register_ftrace_function_nolock(&tmp_ops);
-	if (err)
-		return err;
-
+	/* Save old address in case we need to roll back on error. */
+	old_addr = entry->direct;
+
+	/*
+	 * Update entry->direct BEFORE registering tmp_ops into
+	 * ftrace_ops_list. This closes the race window where a CPU
+	 * executing the traced function could read the old (potentially
+	 * freed) direct call address between tmp_ops becoming visible
+	 * and entry->direct being updated.
+	 *
+	 * Any CPU that observes tmp_ops in ftrace_ops_list after the
+	 * smp_wmb() below is guaranteed to see the new address when
+	 * it calls ftrace_find_rec_direct().
+	 */
 	mutex_lock(&ftrace_lock);
 	entry->direct = addr;
 	mutex_unlock(&ftrace_lock);
 
+	/*
+	 * Ensure entry->direct store is ordered before tmp_ops
+	 * becomes visible via ftrace_ops_list on weakly-ordered archs.
+	 */
+	smp_wmb();
+
+	err = register_ftrace_function_nolock(&tmp_ops);
+	if (err) {
+		/* tmp_ops never became visible; safe to restore old_addr. */
+		mutex_lock(&ftrace_lock);
+		entry->direct = old_addr;
+		mutex_unlock(&ftrace_lock);
+		return err;
+	}
+
 	/*
 	 * Now that tmp_ops is registered and entry->direct is updated,
 	 * unregister the original ops and clean up.
-- 
2.39.0

^ permalink raw reply related

* Re: [PATCH 9/9] rv: Mandate deallocation for per-obj monitors
From: Wen Yang @ 2026-05-17  9:52 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel, Steven Rostedt, Masami Hiramatsu,
	linux-trace-kernel
  Cc: Nam Cao
In-Reply-To: <20260512140250.262190-10-gmonaco@redhat.com>


One gap: tools/verification/rvgen/rvgen/templates/dot2k/main.c uses 
RV_MON_%%MONITOR_TYPE%% but generates no deallocation code, may fail
to build with a -Wunused-function warning.


--
Best wishes,
Wen

On 5/12/26 22:02, Gabriele Monaco wrote:
> The per-object monitors use a hash tables and dynamic allocation of the
> monitor storage, functions to clean a monitor that is no longer needed
> are provided but nothing ensures the monitor actually uses them.
> 
> Remove the inline specifier on the deallocation function to let the
> compiler warn in case it isn't referenced. If the monitor really doesn't
> need one (for instance because instances will never cease to exist
> before disabling the monitor), the da_skip_deallocation() helper macro
> can be used to silence the warning.
> 
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> ---
>   include/rv/da_monitor.h                      | 14 +++++++++++++-
>   kernel/trace/rv/monitors/deadline/deadline.h |  5 ++++-
>   2 files changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
> index 402d3b935c08..378d23ab7dfb 100644
> --- a/include/rv/da_monitor.h
> +++ b/include/rv/da_monitor.h
> @@ -489,8 +489,11 @@ static inline monitor_target da_get_target_by_id(da_id_type id)
>    * locks.
>    * This function includes an RCU read-side critical section to synchronise
>    * against da_monitor_destroy().
> + * NOTE: inline is omitted on purpose to let the compiler warn if this function
> + * is never referenced. For monitors that don't require a deallocation hook,
> + * da_skip_deallocation() can be used.
>    */
> -static inline void da_destroy_storage(da_id_type id)
> +static void da_destroy_storage(da_id_type id)
>   {
>   	struct da_monitor_storage *mon_storage;
>   
> @@ -504,6 +507,15 @@ static inline void da_destroy_storage(da_id_type id)
>   	kfree_rcu(mon_storage, rcu);
>   }
>   
> +/*
> + * da_skip_deallocation - explicitly mark a deallocation function as not required
> + *
> + * Only use when you are absolutely sure the monitor doesn't require a
> + * deallocation hook (i.e. it's not possible for an object to finish existing
> + * when the monitor is still running).
> + */
> +#define da_skip_deallocation(hook) ((void)hook)
> +
>   static void da_monitor_reset_all(void)
>   {
>   	struct da_monitor_storage *mon_storage;
> diff --git a/kernel/trace/rv/monitors/deadline/deadline.h b/kernel/trace/rv/monitors/deadline/deadline.h
> index 78fca873d61e..c39fd79148c2 100644
> --- a/kernel/trace/rv/monitors/deadline/deadline.h
> +++ b/kernel/trace/rv/monitors/deadline/deadline.h
> @@ -194,7 +194,10 @@ static void __maybe_unused handle_newtask(void *data, struct task_struct *task,
>   		da_create_storage(EXPAND_ID_TASK(task), NULL);
>   }
>   
> -static void __maybe_unused handle_exit(void *data, struct task_struct *p, bool group_dead)
> +/*
> + * Deallocation hook, use da_skip_deallocation() when not necessary
> + */
> +static void handle_exit(void *data, struct task_struct *p, bool group_dead)
>   {
>   	if (p->policy == SCHED_DEADLINE)
>   		da_destroy_storage(get_entity_id(&p->dl, DL_TASK, DL_TASK));

^ permalink raw reply

* Re: [PATCH 8/9] rv: Add automatic cleanup handlers for per-task HA monitors
From: Wen Yang @ 2026-05-17  9:40 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel, Steven Rostedt, Nam Cao,
	linux-trace-kernel
In-Reply-To: <20260512140250.262190-9-gmonaco@redhat.com>

ha_cancel_timer_sync() is the right choice for task exit; the
ha_mon_initializing guard correctly handles the init-window race.

One issue: after ha_monitor_disable_hook(), an in-flight
ha_handle_sched_process_exit() handler may still be executing.  It
reads task_mon_slot via da_get_monitor() (&p->rv[task_mon_slot]);
da_monitor_sync_hook() = synchronize_rcu() cannot drain it because
tracepoint handlers run outside any RCU read-side section.  If
rv_put_task_monitor_slot() writes RV_PER_TASK_MONITOR_INIT to
task_mon_slot first, the handler dereferences an OOB index.

This is the same race Patch 5 closes for PER_OBJ with
tracepoint_synchronize_unregister(); the PER_TASK da_monitor_destroy()
needs the same call (and so does every other PER_TASK monitor, not only
the new exit handler).

Could you add tracepoint_synchronize_unregister() to the PER_TASK
da_monitor_destroy() ?  Alternatively, we can carry the fix on top of
your series.

--
Best wishes,
Wen


On 5/12/26 22:02, Gabriele Monaco wrote:
> Hybrid automata monitors may start timers, depending on the model, these
> may remain active on an exiting task and cause false positives or even
> access freed memory.
> 
> Add an enable/disable hook in the HA code, currently only populated by
> the per-task handler for registration and deregistration.
> This hooks to the sched_process_exit event and ensures the timer is
> stopped for every exiting task. The handler is enabled automatically but
> may be disabled, for instance if the monitor uses the event for another
> purpose (but should still manually ensure timers are stopped).
> 
> Fixes: f5587d1b6ec9 ("rv: Add Hybrid Automata monitor type")
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> ---
>   include/rv/ha_monitor.h | 44 +++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 44 insertions(+)
> 
> diff --git a/include/rv/ha_monitor.h b/include/rv/ha_monitor.h
> index 11ae85bad492..1bdf866e9c63 100644
> --- a/include/rv/ha_monitor.h
> +++ b/include/rv/ha_monitor.h
> @@ -28,6 +28,7 @@ static inline void ha_monitor_init_env(struct da_monitor *da_mon);
>   static inline void ha_monitor_reset_env(struct da_monitor *da_mon);
>   static inline void ha_setup_timer(struct ha_monitor *ha_mon);
>   static inline bool ha_cancel_timer(struct ha_monitor *ha_mon);
> +static inline void ha_cancel_timer_sync(struct ha_monitor *ha_mon);
>   static bool ha_monitor_handle_constraint(struct da_monitor *da_mon,
>   					 enum states curr_state,
>   					 enum events event,
> @@ -38,6 +39,26 @@ static bool ha_monitor_handle_constraint(struct da_monitor *da_mon,
>   #define da_monitor_reset_hook ha_monitor_reset_env
>   #define da_monitor_sync_hook() synchronize_rcu()
>   
> +#if !defined(HA_SKIP_AUTO_CLEANUP) && RV_MON_TYPE == RV_MON_PER_TASK
> +/*
> + * Automatic cleanup handlers for per-task HA monitors, only skip if you know
> + * what you are doing (e.g. you want to implement cleanup manually in another
> + * handler doing more things).
> + */
> +static void ha_handle_sched_process_exit(void *data, struct task_struct *p,
> +					 bool group_dead);
> +
> +#define ha_monitor_enable_hook()                                             \
> +	rv_attach_trace_probe(__stringify(MONITOR_NAME), sched_process_exit, \
> +			      ha_handle_sched_process_exit)
> +#define ha_monitor_disable_hook()                                            \
> +	rv_detach_trace_probe(__stringify(MONITOR_NAME), sched_process_exit, \
> +			      ha_handle_sched_process_exit)
> +#else
> +#define ha_monitor_enable_hook()
> +#define ha_monitor_disable_hook()
> +#endif
> +
>   #include <rv/da_monitor.h>
>   #include <linux/seq_buf.h>
>   
> @@ -124,12 +145,14 @@ static int ha_monitor_init(void)
>   
>   	ha_mon_initializing = true;
>   	ret = da_monitor_init();
> +	ha_monitor_enable_hook();
>   	ha_mon_initializing = false;
>   	return ret;
>   }
>   
>   static void ha_monitor_destroy(void)
>   {
> +	ha_monitor_disable_hook();
>   	da_monitor_destroy();
>   }
>   
> @@ -230,6 +253,18 @@ static inline void ha_trace_error_env(struct ha_monitor *ha_mon,
>   {
>   	CONCATENATE(trace_error_env_, MONITOR_NAME)(id, curr_state, event, env);
>   }
> +
> +#if !defined(HA_SKIP_AUTO_CLEANUP) && RV_MON_TYPE == RV_MON_PER_TASK
> +static void ha_handle_sched_process_exit(void *data, struct task_struct *p,
> +					 bool group_dead)
> +{
> +	struct da_monitor *da_mon = da_get_monitor(p);
> +
> +	if (likely(!ha_monitor_uninitialized(da_mon)))
> +		ha_cancel_timer_sync(to_ha_monitor(da_mon));
> +}
> +#endif
> +
>   #endif /* RV_MON_TYPE */
>   
>   /*
> @@ -455,6 +490,10 @@ static inline bool ha_cancel_timer(struct ha_monitor *ha_mon)
>   {
>   	return timer_delete(&ha_mon->timer);
>   }
> +static inline void ha_cancel_timer_sync(struct ha_monitor *ha_mon)
> +{
> +	timer_delete_sync(&ha_mon->timer);
> +}
>   #elif HA_TIMER_TYPE == HA_TIMER_HRTIMER
>   /*
>    * Helper functions to handle the monitor timer.
> @@ -506,6 +545,10 @@ static inline bool ha_cancel_timer(struct ha_monitor *ha_mon)
>   {
>   	return hrtimer_try_to_cancel(&ha_mon->hrtimer) == 1;
>   }
> +static inline void ha_cancel_timer_sync(struct ha_monitor *ha_mon)
> +{
> +	hrtimer_cancel(&ha_mon->hrtimer);
> +}
>   #else /* HA_TIMER_NONE */
>   /*
>    * Start function is intentionally not defined, monitors using timers must
> @@ -516,6 +559,7 @@ static inline bool ha_cancel_timer(struct ha_monitor *ha_mon)
>   {
>   	return false;
>   }
> +static inline void ha_cancel_timer_sync(struct ha_monitor *ha_mon) { }
>   #endif
>   
>   #endif

^ permalink raw reply

* Re: [PATCH 7/9] rv: Do not rely on clean monitor when initialising HA
From: Wen Yang @ 2026-05-17  9:15 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel, Steven Rostedt, Masami Hiramatsu,
	Nam Cao, linux-trace-kernel
In-Reply-To: <20260512140250.262190-8-gmonaco@redhat.com>


The ha_mon_initializing flag correctly
guards the init path against stale monitoring=1 (from either a previous
different monitor type or a race during teardown of the previous instance).


Reviewed-by: Wen Yang <wen.yang@linux.dev>


On 5/12/26 22:02, Gabriele Monaco wrote:
> Hybrid Automata monitors hook into the DA implementation when doing
> da_monitor_reset(). This function is called both on initialisation and
> teardown, HA monitors try to cancel a timer only when it's initialised
> relying on the da_mon->monitoring flag. This flag could however be
> corrupted during initialisation. This happens for instance on per-task
> monitors that share the same storage with different type of monitors
> like LTL or in case of races during a previous teardown.
> 
> Stop relying on the monitoring flag during initialisation, assume that
> can have any value, so skip timer cancellation in any case when a local
> flag is set. New monitors (e.g. new tasks) are always zero-initialised
> so they are safe.
> 
> Reported-by: Wen Yang <wen.yang@linux.dev>
> Closes: https://lore.kernel.org/lkml/d02c656aada7d071f083460a5c9a454363669b61.1778522945.git.wen.yang@linux.dev
> Fixes: f5587d1b6ec9 ("rv: Add Hybrid Automata monitor type")
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> ---
>   include/rv/ha_monitor.h                       | 31 ++++++++++++++++++-
>   kernel/trace/rv/monitors/nomiss/nomiss.c      |  4 +--
>   kernel/trace/rv/monitors/opid/opid.c          |  4 +--
>   kernel/trace/rv/monitors/stall/stall.c        |  4 +--
>   .../rvgen/rvgen/templates/dot2k/main.c        |  4 +--
>   5 files changed, 38 insertions(+), 9 deletions(-)
> 
> diff --git a/include/rv/ha_monitor.h b/include/rv/ha_monitor.h
> index 47ff1a41febe..11ae85bad492 100644
> --- a/include/rv/ha_monitor.h
> +++ b/include/rv/ha_monitor.h
> @@ -116,6 +116,35 @@ static enum hrtimer_restart ha_monitor_timer_callback(struct hrtimer *hrtimer);
>   #define ha_get_ns() 0
>   #endif /* HA_CLK_NS */
>   
> +static bool ha_mon_initializing;
> +
> +static int ha_monitor_init(void)
> +{
> +	int ret;
> +
> +	ha_mon_initializing = true;
> +	ret = da_monitor_init();
> +	ha_mon_initializing = false;
> +	return ret;
> +}
> +
> +static void ha_monitor_destroy(void)
> +{
> +	da_monitor_destroy();
> +}
> +
> +/*
> + * ha_monitor_uninitialized - are fields like the timer initialized?
> + *
> + * On a clean monitor, we can assume an active monitor (monitoring) is
> + * initialized, however the monitoring field cannot be trusted during
> + * initialization.
> + */
> +static inline bool ha_monitor_uninitialized(struct da_monitor *da_mon)
> +{
> +	return ha_mon_initializing || !da_monitoring(da_mon);
> +}
> +
>   /* Should be supplied by the monitor */
>   static u64 ha_get_env(struct ha_monitor *ha_mon, enum envs env, u64 time_ns);
>   static bool ha_verify_constraint(struct ha_monitor *ha_mon,
> @@ -160,7 +189,7 @@ static inline void ha_monitor_reset_env(struct da_monitor *da_mon)
>   	struct ha_monitor *ha_mon = to_ha_monitor(da_mon);
>   
>   	/* Initialisation resets the monitor before initialising the timer */
> -	if (likely(da_monitoring(da_mon)))
> +	if (likely(!ha_monitor_uninitialized(da_mon)))
>   		ha_cancel_timer(ha_mon);
>   }
>   
> diff --git a/kernel/trace/rv/monitors/nomiss/nomiss.c b/kernel/trace/rv/monitors/nomiss/nomiss.c
> index 31f90f3638d8..8ead8783c29f 100644
> --- a/kernel/trace/rv/monitors/nomiss/nomiss.c
> +++ b/kernel/trace/rv/monitors/nomiss/nomiss.c
> @@ -227,7 +227,7 @@ static int enable_nomiss(void)
>   {
>   	int retval;
>   
> -	retval = da_monitor_init();
> +	retval = ha_monitor_init();
>   	if (retval)
>   		return retval;
>   
> @@ -263,7 +263,7 @@ static void disable_nomiss(void)
>   	rv_detach_trace_probe("nomiss", sched_switch, handle_sched_switch);
>   	rv_detach_trace_probe("nomiss", sched_wakeup, handle_sched_wakeup);
>   
> -	da_monitor_destroy();
> +	ha_monitor_destroy();
>   }
>   
>   static struct rv_monitor rv_this = {
> diff --git a/kernel/trace/rv/monitors/opid/opid.c b/kernel/trace/rv/monitors/opid/opid.c
> index 4594c7c46601..2922318c6112 100644
> --- a/kernel/trace/rv/monitors/opid/opid.c
> +++ b/kernel/trace/rv/monitors/opid/opid.c
> @@ -73,7 +73,7 @@ static int enable_opid(void)
>   {
>   	int retval;
>   
> -	retval = da_monitor_init();
> +	retval = ha_monitor_init();
>   	if (retval)
>   		return retval;
>   
> @@ -90,7 +90,7 @@ static void disable_opid(void)
>   	rv_detach_trace_probe("opid", sched_set_need_resched_tp, handle_sched_need_resched);
>   	rv_detach_trace_probe("opid", sched_waking, handle_sched_waking);
>   
> -	da_monitor_destroy();
> +	ha_monitor_destroy();
>   }
>   
>   /*
> diff --git a/kernel/trace/rv/monitors/stall/stall.c b/kernel/trace/rv/monitors/stall/stall.c
> index 9ccfda6b0e73..3c38fb1a0159 100644
> --- a/kernel/trace/rv/monitors/stall/stall.c
> +++ b/kernel/trace/rv/monitors/stall/stall.c
> @@ -103,7 +103,7 @@ static int enable_stall(void)
>   {
>   	int retval;
>   
> -	retval = da_monitor_init();
> +	retval = ha_monitor_init();
>   	if (retval)
>   		return retval;
>   
> @@ -120,7 +120,7 @@ static void disable_stall(void)
>   	rv_detach_trace_probe("stall", sched_switch, handle_sched_switch);
>   	rv_detach_trace_probe("stall", sched_wakeup, handle_sched_wakeup);
>   
> -	da_monitor_destroy();
> +	ha_monitor_destroy();
>   }
>   
>   static struct rv_monitor rv_this = {
> diff --git a/tools/verification/rvgen/rvgen/templates/dot2k/main.c b/tools/verification/rvgen/rvgen/templates/dot2k/main.c
> index bf0999f6657a..889446760e3c 100644
> --- a/tools/verification/rvgen/rvgen/templates/dot2k/main.c
> +++ b/tools/verification/rvgen/rvgen/templates/dot2k/main.c
> @@ -35,7 +35,7 @@ static int enable_%%MODEL_NAME%%(void)
>   {
>   	int retval;
>   
> -	retval = da_monitor_init();
> +	retval = %%MONITOR_CLASS%%_monitor_init();
>   	if (retval)
>   		return retval;
>   
> @@ -50,7 +50,7 @@ static void disable_%%MODEL_NAME%%(void)
>   
>   %%TRACEPOINT_DETACH%%
>   
> -	da_monitor_destroy();
> +	%%MONITOR_CLASS%%_monitor_destroy();
>   }
>   
>   /*

^ permalink raw reply

* Re: [PATCH 6/9] rv: Ensure synchronous cleanup for HA monitors
From: Wen Yang @ 2026-05-17  9:12 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel, Steven Rostedt, Nam Cao,
	linux-trace-kernel
In-Reply-To: <20260512140250.262190-7-gmonaco@redhat.com>

The guard(rcu)() + synchronize_rcu() mechanism for HA timer callbacks
is correct.

One concern: TOCTOU between the pre-check and guard(rcu)().

da_monitor_reset() calls reset_hook BEFORE clearing monitoring:

   da_monitor_reset_hook(da_mon);        /* ha_cancel_timer [async]   */
   WRITE_ONCE(da_mon->monitoring, 0);    /* cleared AFTER reset_hook  */
   da_mon->curr_state = model_get_initial_state();

This may creates a window where the callback pre-check passes but the
monitor is reset before guard(rcu)() is acquired:

   /* __ha_monitor_timer_callback() */
   if (unlikely(!da_monitor_handling_event(&ha_mon->da_mon)))
       return;

   /* passes: monitoring=1
    *
    * WINDOW ─ CPU A runs da_monitor_reset_all() here:
    *   ha_cancel_timer()  [returns: callback is running, cannot cancel]
    *   WRITE_ONCE(monitoring, 0)
    *   curr_state = model_get_initial_state()
    */
   guard(rcu)();
   curr_state = READ_ONCE(ha_mon->da_mon.curr_state);  /* initial_state */
   /* no second da_monitoring() check */
   ha_react(curr_state, EVENT_NONE, env_string.buffer); /* spurious call */
   ha_trace_error_env(ha_mon, ...);                     /* fires 
unconditionally */

Result: spurious ha_trace_error_env() for initial_state.  For existing
monitors (stall/nomiss/opid), model_should_send_event_env(initial, NONE)
returns false, so no false-positive reaction, but the trace event fires.
Monitors where initial_state carries a constraint would produce a false
positive.

Proposed fix : re-check inside the RCU critical section:

   guard(rcu)();
   if (unlikely(!da_monitoring(&ha_mon->da_mon)))  /* re-check here */
       return;
   curr_state = READ_ONCE(ha_mon->da_mon.curr_state);


--
Best wishes,
Wen


On 5/12/26 22:02, Gabriele Monaco wrote:
> HA monitors may start timers, all cleanup functions currently stop the
> timers asynchronously to avoid sleeping in the wrong context.
> Nothing makes sure running callbacks terminate on cleanup.
> 
> Run the entire HA timer callback in an RCU read-side critical section,
> this way we can simply synchronize_rcu() with any pending timer and are
> sure any cleanup using kfree_rcu() runs after callbacks terminated.
> Additionally make sure any unlikely callback running late won't run any
> code if the monitor is marked as disabled.
> 
> Fixes: f5587d1b6ec9 ("rv: Add Hybrid Automata monitor type")
> Fixes: 4a24127bd6cb ("rv: Add support for per-object monitors in DA/HA")
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> ---
>   include/rv/da_monitor.h | 23 +++++++++++++++++++----
>   include/rv/ha_monitor.h | 18 ++++++++++++++++--
>   2 files changed, 35 insertions(+), 6 deletions(-)
> 
> diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
> index a4a13b62d1a4..402d3b935c08 100644
> --- a/include/rv/da_monitor.h
> +++ b/include/rv/da_monitor.h
> @@ -57,6 +57,15 @@ static struct rv_monitor rv_this;
>   #define da_monitor_reset_hook(da_mon)
>   #endif
>   
> +/*
> + * Hook to allow the implementation of hybrid automata: define it with a
> + * function that waits for the termination of all monitors background
> + * activities (e.g. all timers). This hook can sleep.
> + */
> +#ifndef da_monitor_sync_hook
> +#define da_monitor_sync_hook()
> +#endif
> +
>   /*
>    * Type for the target id, default to int but can be overridden.
>    * A long type can work as hash table key (PER_OBJ) but will be downgraded to
> @@ -179,6 +188,7 @@ static inline int da_monitor_init(void)
>   static inline void da_monitor_destroy(void)
>   {
>   	da_monitor_reset_all();
> +	da_monitor_sync_hook();
>   }
>   
>   #ifndef da_implicit_guard
> @@ -232,6 +242,7 @@ static inline int da_monitor_init(void)
>   static inline void da_monitor_destroy(void)
>   {
>   	da_monitor_reset_all();
> +	da_monitor_sync_hook();
>   }
>   
>   #ifndef da_implicit_guard
> @@ -319,6 +330,7 @@ static inline void da_monitor_destroy(void)
>   	}
>   
>   	da_monitor_reset_all();
> +	da_monitor_sync_hook();
>   
>   	rv_put_task_monitor_slot(task_mon_slot);
>   	task_mon_slot = RV_PER_TASK_MONITOR_INIT;
> @@ -497,10 +509,9 @@ static void da_monitor_reset_all(void)
>   	struct da_monitor_storage *mon_storage;
>   	int bkt;
>   
> -	rcu_read_lock();
> +	guard(rcu)();
>   	hash_for_each_rcu(da_monitor_ht, bkt, mon_storage, node)
>   		da_monitor_reset(&mon_storage->rv.da_mon);
> -	rcu_read_unlock();
>   }
>   
>   static inline int da_monitor_init(void)
> @@ -516,13 +527,17 @@ static inline void da_monitor_destroy(void)
>   	int bkt;
>   
>   	tracepoint_synchronize_unregister();
> +	scoped_guard(rcu) {
> +		hash_for_each_rcu(da_monitor_ht, bkt, mon_storage, node) {
> +			da_monitor_reset_hook(&mon_storage->rv.da_mon);
> +		}
> +	}
> +	da_monitor_sync_hook();
>   	/*
>   	 * This function is called after all probes are disabled and no longer
>   	 * pending, we can safely assume no concurrent user.
>   	 */
> -	synchronize_rcu();
>   	hash_for_each_safe(da_monitor_ht, bkt, tmp, mon_storage, node) {
> -		da_monitor_reset_hook(&mon_storage->rv.da_mon);
>   		hash_del_rcu(&mon_storage->node);
>   		kfree(mon_storage);
>   	}
> diff --git a/include/rv/ha_monitor.h b/include/rv/ha_monitor.h
> index d59507e8cb30..47ff1a41febe 100644
> --- a/include/rv/ha_monitor.h
> +++ b/include/rv/ha_monitor.h
> @@ -36,6 +36,7 @@ static bool ha_monitor_handle_constraint(struct da_monitor *da_mon,
>   #define da_monitor_event_hook ha_monitor_handle_constraint
>   #define da_monitor_init_hook ha_monitor_init_env
>   #define da_monitor_reset_hook ha_monitor_reset_env
> +#define da_monitor_sync_hook() synchronize_rcu()
>   
>   #include <rv/da_monitor.h>
>   #include <linux/seq_buf.h>
> @@ -237,12 +238,25 @@ static bool ha_monitor_handle_constraint(struct da_monitor *da_mon,
>   	return false;
>   }
>   
> +/*
> + * __ha_monitor_timer_callback - generic callback representation
> + *
> + * This callback runs in an RCU read-side critical section to allow the
> + * destruction sequence to easily synchronize_rcu() with all pending timer
> + * after asynchronously disabling them.
> + */
>   static inline void __ha_monitor_timer_callback(struct ha_monitor *ha_mon)
>   {
> -	enum states curr_state = READ_ONCE(ha_mon->da_mon.curr_state);
>   	DECLARE_SEQ_BUF(env_string, ENV_BUFFER_SIZE);
> -	u64 time_ns = ha_get_ns();
> +	enum states curr_state;
> +	u64 time_ns;
> +
> +	if (unlikely(!da_monitor_handling_event(&ha_mon->da_mon)))
> +		return;
>   
> +	guard(rcu)();
> +	curr_state = READ_ONCE(ha_mon->da_mon.curr_state);
> +	time_ns = ha_get_ns();
>   	ha_get_env_string(&env_string, ha_mon, time_ns);
>   	ha_react(curr_state, EVENT_NONE, env_string.buffer);
>   	ha_trace_error_env(ha_mon, model_get_state_name(curr_state),

^ permalink raw reply

* Re: [PATCH 5/9] rv: Ensure all pending probes terminate on per-obj monitor destroy
From: Wen Yang @ 2026-05-17  9:01 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel, Steven Rostedt, Nam Cao,
	linux-trace-kernel
In-Reply-To: <20260512140250.262190-6-gmonaco@redhat.com>


Correct for PER_OBJ: kfree() is called directly (not kfree_rcu()) after
hash_del_rcu(), so we must guarantee no concurrent reader holds a 
pointer to mon_storage.  tracepoint_synchronize_unregister() — which 
calls synchronize_rcu_tasks_trace() then synchronize_srcu  —
drains all in-flight probe handlers before the hash walk, making the
subsequent kfree() safe.


Reviewed-by: Wen Yang <wen.yang@linux.dev>



On 5/12/26 22:02, Gabriele Monaco wrote:
> The monitor disable/destroy sequence detaches all probes and resets the
> monitor's data, however it doesn't wait for pending probes. This is an
> issue with per-object monitors, which free the monitor storage.
> 
> Call tracepoint_synchronize_unregister() to make sure to wait for all
> pending probes before destroying the monitor storage.
> 
> Fixes: 4a24127bd6cb ("rv: Add support for per-object monitors in DA/HA")
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> ---
>   include/rv/da_monitor.h | 5 +++--
>   1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
> index a9fd284195ee..a4a13b62d1a4 100644
> --- a/include/rv/da_monitor.h
> +++ b/include/rv/da_monitor.h
> @@ -515,9 +515,10 @@ static inline void da_monitor_destroy(void)
>   	struct hlist_node *tmp;
>   	int bkt;
>   
> +	tracepoint_synchronize_unregister();
>   	/*
> -	 * This function is called after all probes are disabled, we need only
> -	 * worry about concurrency against old events.
> +	 * This function is called after all probes are disabled and no longer
> +	 * pending, we can safely assume no concurrent user.
>   	 */
>   	synchronize_rcu();
>   	hash_for_each_safe(da_monitor_ht, bkt, tmp, mon_storage, node) {

^ permalink raw reply

* Re: [PATCH 4/9] rv: Prevent task migration while handling per-CPU events
From: Wen Yang @ 2026-05-17  8:57 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel, Steven Rostedt, linux-trace-kernel; +Cc: Nam Cao
In-Reply-To: <20260512140250.262190-5-gmonaco@redhat.com>


The da_implicit_guard() mechanism is a clean way to inject a monitor-
type-specific guard without cluttering the hot-path helpers.


Reviewed-by: Wen Yang <wen.yang@linux.dev>



On 5/12/26 22:02, Gabriele Monaco wrote:
> Tracepoint handlers are now fully preemptible. When a per-CPU monitor
> handles an event, it retrieves the monitor state using a per-CPU
> pointer. If the event itself doesn't disable preemption, the task can
> migrate to a different CPU and we risk updating the wrong monitor.
> 
> Mitigate this by explicitly disabling task migration before acquiring
> the monitor pointer. This cannot guarantee the monitor runs on the
> correct CPU but reduces the race condition window and prevents warnings.
> 
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> ---
>   include/rv/da_monitor.h | 11 +++++++++++
>   1 file changed, 11 insertions(+)
> 
> diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
> index 0b7028df08fb..a9fd284195ee 100644
> --- a/include/rv/da_monitor.h
> +++ b/include/rv/da_monitor.h
> @@ -181,6 +181,10 @@ static inline void da_monitor_destroy(void)
>   	da_monitor_reset_all();
>   }
>   
> +#ifndef da_implicit_guard
> +#define da_implicit_guard()
> +#endif
> +
>   #elif RV_MON_TYPE == RV_MON_PER_CPU
>   /*
>    * Functions to define, init and get a per-cpu monitor.
> @@ -230,6 +234,10 @@ static inline void da_monitor_destroy(void)
>   	da_monitor_reset_all();
>   }
>   
> +#ifndef da_implicit_guard
> +#define da_implicit_guard() guard(migrate)()
> +#endif
> +
>   #elif RV_MON_TYPE == RV_MON_PER_TASK
>   /*
>    * Functions to define, init and get a per-task monitor.
> @@ -677,6 +685,7 @@ static inline bool __da_handle_start_run_event(struct da_monitor *da_mon,
>    */
>   static inline void da_handle_event(enum events event)
>   {
> +	da_implicit_guard();
>   	__da_handle_event(da_get_monitor(), event, 0);
>   }
>   
> @@ -692,6 +701,7 @@ static inline void da_handle_event(enum events event)
>    */
>   static inline bool da_handle_start_event(enum events event)
>   {
> +	da_implicit_guard();
>   	return __da_handle_start_event(da_get_monitor(), event, 0);
>   }
>   
> @@ -703,6 +713,7 @@ static inline bool da_handle_start_event(enum events event)
>    */
>   static inline bool da_handle_start_run_event(enum events event)
>   {
> +	da_implicit_guard();
>   	return __da_handle_start_run_event(da_get_monitor(), event, 0);
>   }
>   

^ permalink raw reply

* Re: [PATCH 3/9] rv: Reset per-task DA monitors before releasing the slot
From: Wen Yang @ 2026-05-17  8:55 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel, Steven Rostedt, Nam Cao,
	linux-trace-kernel
In-Reply-To: <20260512140250.262190-4-gmonaco@redhat.com>


The fix is correct: task_mon_slot = RV_PER_TASK_MONITOR_INIT
equals CONFIG_RV_PER_TASK_MONITORS, which is one past the end of rv[],
so calling da_monitor_reset_all() after rv_put_task_monitor_slot()
would write into whatever memory follows task_struct.rv[] — which is
randomised and can get quite nasty, as you noted in the review thread.

Overlap note: . 
https://lore.kernel.org/all/f654a17c671469fd8fc9ea438daf2266d05068d4.camel@redhat.com/

We will coordinate to avoid redundancy;
we are happy to defer to your version here.


Reviewed-by: Wen Yang <wen.yang@linux.dev>



On 5/12/26 22:02, Gabriele Monaco wrote:
> Per-task monitors use task_mon_slot to determine which slot in the array
> to use for the monitor. During destruction, this slot is returned but
> this is done before resetting the monitor. As a result, the monitor's
> reset is in fact resetting a slot that is outside of the array
> (RV_PER_TASK_MONITOR_INIT).
> 
> Release the slot only after the reset to avoid out-of-bound memory
> access.
> 
> Fixes: 30984ccf31b7f ("rv: Refactor da_monitor to minimise macros")
> Fixes: 792575348ff70 ("rv/include: Add deterministic automata monitor definition via C macros")
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> ---
>   include/rv/da_monitor.h | 5 +++--
>   1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
> index 250888812125..0b7028df08fb 100644
> --- a/include/rv/da_monitor.h
> +++ b/include/rv/da_monitor.h
> @@ -309,10 +309,11 @@ static inline void da_monitor_destroy(void)
>   		WARN_ONCE(1, "Disabling a disabled monitor: " __stringify(MONITOR_NAME));
>   		return;
>   	}
> -	rv_put_task_monitor_slot(task_mon_slot);
> -	task_mon_slot = RV_PER_TASK_MONITOR_INIT;
>   
>   	da_monitor_reset_all();
> +
> +	rv_put_task_monitor_slot(task_mon_slot);
> +	task_mon_slot = RV_PER_TASK_MONITOR_INIT;
>   }
>   
>   #elif RV_MON_TYPE == RV_MON_PER_OBJ

^ permalink raw reply

* Re: [PATCH 2/9] rv: Fix read_lock scope in per-task DA cleanup
From: Wen Yang @ 2026-05-17  8:51 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel, Steven Rostedt, Nam Cao,
	linux-trace-kernel
In-Reply-To: <20260512140250.262190-3-gmonaco@redhat.com>


Correct.  tasklist_lock is only needed to guard 
for_each_process_thread(); idle tasks are bound to their CPU and are not 
subject to exit/reparent, so no lock is needed for for_each_present_cpu().


Reviewed-by: Wen Yang <wen.yang@linux.dev>


On 5/12/26 22:02, Gabriele Monaco wrote:
> The da_monitor_reset_all() function for per-task monitors takes
> tasklist_lock while iterating over tasks, then keeps it also while
> iterating over idle tasks (one per CPU). The latter is not necessary
> since the lock needs to guard only for_each_process_thread().
> 
> Use a scoped_guard for more compact syntax and adjust the scope only
> where the lock is necessary.
> 
> Fixes: 30984ccf31b7f ("rv: Refactor da_monitor to minimise macros")
> Fixes: 8259cb14a7068 ("rv: Reset per-task monitors also for idle tasks")
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> ---
>   include/rv/da_monitor.h | 8 ++++----
>   1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
> index 39765ff6f098..250888812125 100644
> --- a/include/rv/da_monitor.h
> +++ b/include/rv/da_monitor.h
> @@ -272,12 +272,12 @@ static void da_monitor_reset_all(void)
>   	struct task_struct *g, *p;
>   	int cpu;
>   
> -	read_lock(&tasklist_lock);
> -	for_each_process_thread(g, p)
> -		da_monitor_reset(da_get_monitor(p));
> +	scoped_guard(read_lock, &tasklist_lock) {
> +		for_each_process_thread(g, p)
> +			da_monitor_reset(da_get_monitor(p));
> +	}
>   	for_each_present_cpu(cpu)
>   		da_monitor_reset(da_get_monitor(idle_task(cpu)));
> -	read_unlock(&tasklist_lock);
>   }
>   
>   /*

^ permalink raw reply

* Re: [PATCH 1/9] rv: Fix __user specifier usage in extract_params()
From: Wen Yang @ 2026-05-17  8:48 UTC (permalink / raw)
  To: Gabriele Monaco, linux-kernel, Steven Rostedt, Masami Hiramatsu,
	Nam Cao, linux-trace-kernel
  Cc: kernel test robot
In-Reply-To: <20260512140250.262190-2-gmonaco@redhat.com>


Correct.  __user annotates the pointer type, not the stack copy.

Reviewed-by: Wen Yang <wen.yang@linux.dev>


On 5/12/26 22:02, Gabriele Monaco wrote:
> The attributes variables extracted from syscalls in the helper are both
> defined with the __user specifier although only the actual pointer to
> user data should be marked.
> 
> Remove the __user specifier from attr.
> 
> Reported-by: kernel test robot <lkp@intel.com>
> Closes: https://lore.kernel.org/oe-kbuild-all/202604150820.Ny143u6X-lkp@intel.com
> Fixes: b133207deb72 ("rv: Add nomiss deadline monitor")
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> ---
>   kernel/trace/rv/monitors/deadline/deadline.h | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/rv/monitors/deadline/deadline.h b/kernel/trace/rv/monitors/deadline/deadline.h
> index 0bbfd2543329..78fca873d61e 100644
> --- a/kernel/trace/rv/monitors/deadline/deadline.h
> +++ b/kernel/trace/rv/monitors/deadline/deadline.h
> @@ -95,7 +95,8 @@ static inline u8 get_server_type(struct task_struct *tsk)
>   static inline int extract_params(struct pt_regs *regs, long id, pid_t *pid_out)
>   {
>   	size_t size = offsetofend(struct sched_attr, sched_flags);
> -	struct sched_attr __user *uattr, attr;
> +	struct sched_attr __user *uattr;
> +	struct sched_attr attr;
>   	int new_policy = -1, ret;
>   	unsigned long args[6];
>   

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox