Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* Re: [PATCH v8] tracing/eprobes: Allow use of BTF names to dereference pointers
From: Masami Hiramatsu @ 2026-06-02  0:06 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Linux Trace Kernel, Masami Hiramatsu, Mathieu Desnoyers,
	Mark Rutland, Peter Zijlstra, Namhyung Kim, Takaya Saeki,
	Douglas Raillard, Tom Zanussi, Andrew Morton, Thomas Gleixner,
	Ian Rogers, Jiri Olsa
In-Reply-To: <20260601133129.4a1e9dec@gandalf.local.home>

On Mon, 1 Jun 2026 13:31:29 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Mon, 1 Jun 2026 13:07:46 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > Changes since v7: https://patch.msgid.link/20260529110442.0967a64c@fedora
> > 
> > - Add error message in parse_btf_args() for failed parsing of TEVENT.
> >   (Sashiko)
> > 
> > - Remove TPARG_FL_TYPECAST and just use ctx->struct_btf instead.
> >   The flag was redundant and added unnecessary complexity.
> > 
> > - Restructure to keep the lifetime of the TYPECAST to the end of
> >   traceprobe_parse_probe_arg_body(). This allows the last_type to stay
> >   around in case there's not a type parameter and then btf can still be
> >   used.
> >   (Sashiko and Masami Hiramatsu)
> 
> And I rebased onto probes/for-next
> 

Thanks, but it seems Sashiko failed to apply (because it is using
linux-trace/HEAD branch?) Hmm, we may always need "base-id" tag.

Thanks,

> -- Steve


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: PATCH v7] tracing/eprobes: Allow use of BTF names to dereference pointers
From: Masami Hiramatsu @ 2026-06-02  0:03 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Linux trace kernel, Mathieu Desnoyers, Mark Rutland,
	Peter Zijlstra, Namhyung Kim, Takaya Saeki, Douglas Raillard,
	Tom Zanussi, Andrew Morton, Thomas Gleixner, Ian Rogers,
	Jiri Olsa
In-Reply-To: <20260601122126.5ebbd7e7@fedora>

On Mon, 1 Jun 2026 12:21:26 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Sun, 31 May 2026 10:14:58 +0900
> Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> 
> > > > Does this prematurely release the BTF struct reference?
> > > > If TPARG_FL_TYPECAST is unset here and ctx->struct_btf is put, won't
> > > > later steps in traceprobe_parse_probe_arg_body() (like
> > > > find_fetch_type_from_btf_type()) fail to properly infer struct field sizes?
> > > > When ctx_btf(ctx) is called later without TPARG_FL_TYPECAST set, it
> > > > will evaluate to ctx->btf (which is NULL for eprobes).
> > > > Could this potentially lead to silent defaults, such as 64-bit reads for
> > > > smaller fields, or fail to inject pointer dereferences for string fields,
> > > > while also leaving ctx->last_type pointing to a prematurely released BTF
> > > > object?  
> > > 
> > > Does this mean we need to set ctx->last_type to NULL here too?  
> > 
> > No, since the member we refer can be different from unsigned long.
> > When we don't have ":type" suffix, we use BTF type information to
> > decide appropriate type.
> > 
> > > 
> > > Because everything above is pretty much the expected behavior. The put is
> > > *not* premature. The last_struct and struct_btf are both set to NULL. I
> > > guess the only thing missing is to reset last_type as well.  
> > 
> > No, as I explained, the last_type is used to determine the member type
> > when user does not specify the ":type" suffix.
> > 
> > So, what we need to do is deferring the btf_put(struct_btf) as below:
> > (no build test yet.)
> 
> OK, but I don't think we want the struct_btf to exist beyond a single
> arg like the btf descriptor does. How about this (on top of this change),
> where it clears the struct_btf at the end of traceprobe_parse_probe_arg_body()?
> 
> Also, I see the flag as being redundant and use the existence of
> struct_btf to denote that it's parsing a typedef struct.

Ah, indeed. OK, let me check v8 patch.

Thanks!

> 
> -- Steve
> 
> diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
> index 9246e9c3d066..56b7dc406ca1 100644
> --- a/kernel/trace/trace_probe.c
> +++ b/kernel/trace/trace_probe.c
> @@ -397,8 +397,7 @@ static bool btf_type_is_char_array(struct btf *btf, const struct btf_type *type)
>  
>  static struct btf *ctx_btf(struct traceprobe_parse_context *ctx)
>  {
> -	return ctx->flags & TPARG_FL_TYPECAST ?
> -		ctx->struct_btf : ctx->btf;
> +	return ctx->struct_btf ? : ctx->btf;
>  }
>  
>  static int check_prepare_btf_string_fetch(char *typename,
> @@ -531,6 +530,15 @@ static int query_btf_context(struct traceprobe_parse_context *ctx)
>  	return 0;
>  }
>  
> +static void clear_struct_btf(struct traceprobe_parse_context *ctx)
> +{
> +	if (ctx->struct_btf) {
> +		btf_put(ctx->struct_btf);
> +		ctx->struct_btf = NULL;
> +		ctx->last_struct = NULL;
> +	}
> +}
> +
>  static void clear_btf_context(struct traceprobe_parse_context *ctx)
>  {
>  	if (ctx->btf) {
> @@ -579,7 +587,7 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
>  	struct fetch_insn *code = *pcode;
>  	const struct btf_member *field;
>  	u32 bitoffs, anon_offs;
> -	bool is_struct = ctx->flags & TPARG_FL_TYPECAST;
> +	bool is_struct = ctx->struct_btf != NULL;
>  	struct btf *btf = ctx_btf(ctx);
>  	char *next;
>  	int is_ptr;
> @@ -690,7 +698,7 @@ static int parse_btf_arg(char *varname,
>  		ret = parse_trace_event(varname, code, ctx);
>  		if (ret < 0)
>  			return ret;
> -		if (WARN_ON_ONCE(!(ctx->flags & TPARG_FL_TYPECAST)))
> +		if (WARN_ON_ONCE(ctx->struct_btf == NULL))
>  			return -EINVAL;
>  		type = ctx->last_struct;
>  		goto found_type;
> @@ -804,21 +812,19 @@ static int parse_btf_bitfield(struct fetch_insn **pcode,
>  
>  static int query_btf_struct(const char *sname, struct traceprobe_parse_context *ctx)
>  {
> +	struct btf *btf = NULL;
>  	int id;
>  
> -	if (!ctx->struct_btf) {
> -		struct btf *btf;
> -
> -		id = bpf_find_btf_id(sname, BTF_KIND_STRUCT, &btf);
> -		if (id < 0)
> -			return id;
> -		ctx->struct_btf = btf;
> -	} else {
> -		id = btf_find_by_name_kind(ctx->struct_btf, sname, BTF_KIND_STRUCT);
> -		if (id < 0)
> -			return id;
> +	/* Could be a for a structure in a different module */
> +	if (ctx->struct_btf) {
> +		btf_put(ctx->struct_btf);
> +		ctx->struct_btf = NULL;
>  	}
>  
> +	id = bpf_find_btf_id(sname, BTF_KIND_STRUCT, &btf);
> +	if (id < 0)
> +		return id;
> +	ctx->struct_btf = btf;
>  	ctx->last_struct = btf_type_by_id(ctx->struct_btf, id);
>  	return 0;
>  }
> @@ -848,25 +854,23 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
>  
>  	if (ret < 0) {
>  		trace_probe_log_err(ctx->offset + 1, NO_PTR_STRCT);
> -		ret = -EINVAL;
> -		goto out_put;
> +		return -EINVAL;
>  	}
>  
> -	ctx->flags |= TPARG_FL_TYPECAST;
>  	tmp++;
>  
>  	ctx->offset += tmp - arg;
>  	ret = parse_btf_arg(tmp, pcode, end, ctx);
> -	ctx->flags &= ~TPARG_FL_TYPECAST;
> -	ctx->last_struct = NULL;
> -out_put:
> -	btf_put(ctx->struct_btf);
> -	ctx->struct_btf = NULL;
>  	return ret;
>  }
>  
>  #else /* !CONFIG_PROBE_EVENTS_BTF_ARGS */
>  
> +static void clear_struct_btf(struct traceprobe_parse_context *ctx)
> +{
> +	ctx->struct_btf = NULL;
> +}
> +
>  static void clear_btf_context(struct traceprobe_parse_context *ctx)
>  {
>  	ctx->btf = NULL;
> @@ -1673,6 +1677,9 @@ static int traceprobe_parse_probe_arg_body(const char *argv, ssize_t *size,
>  	}
>  	kfree(tmp);
>  
> +	/* struct_btf should not be passed to other arguments */
> +	clear_struct_btf(ctx);
> +
>  	return ret;
>  }
>  
> diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
> index 952e3d7582b8..83565f1634db 100644
> --- a/kernel/trace/trace_probe.h
> +++ b/kernel/trace/trace_probe.h
> @@ -394,7 +394,6 @@ static inline int traceprobe_get_entry_data_size(struct trace_probe *tp)
>   * TPARG_FL_KERNEL and TPARG_FL_USER are also mutually exclusive.
>   * TPARG_FL_FPROBE and TPARG_FL_TPOINT are optional but it should be with
>   * TPARG_FL_KERNEL.
> - * TPARG_FL_TYPECAST is set if an argument was typecast to a structure.
>   */
>  #define TPARG_FL_RETURN BIT(0)
>  #define TPARG_FL_KERNEL BIT(1)
> @@ -403,7 +402,6 @@ static inline int traceprobe_get_entry_data_size(struct trace_probe *tp)
>  #define TPARG_FL_USER   BIT(4)
>  #define TPARG_FL_FPROBE BIT(5)
>  #define TPARG_FL_TPOINT BIT(6)
> -#define TPARG_FL_TYPECAST BIT(7)
>  #define TPARG_FL_LOC_MASK	GENMASK(4, 0)
>  
>  static inline bool tparg_is_function_entry(unsigned int flags)


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* [PATCH] tracing/events: Expand ring buffer for in-kernel event enables
From: Manjunath Patil @ 2026-06-01 23:24 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: Mathieu Desnoyers, linux-kernel, linux-trace-kernel,
	Manjunath Patil

Ftrace keeps trace arrays at a boot-minimum ring-buffer size until
tracing is used. Tracefs event-enable paths already call
tracing_update_buffers() before enabling events, but the exported
in-kernel helpers trace_set_clr_event() and trace_array_set_clr_event()
directly enable events through __ftrace_set_clr_event().

This can leave events enabled by in-kernel users recording into the tiny
boot-minimum buffer instead of the configured default-sized buffer. Any
caller that enables events through these exported helpers observes
different buffer-expansion behavior than a userspace tracefs event enable.

Expand the relevant trace array before enabling events through the
exported in-kernel helpers, matching the tracefs event-enable behavior.
Disabling events remains unchanged.

Assisted-by: Codex:gpt-5
Signed-off-by: Manjunath Patil <manjunath.b.patil@oracle.com>
---
 kernel/trace/trace_events.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index c46e623e7e0d..3ce5b0121c5c 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -1479,10 +1479,22 @@ int ftrace_set_clr_event(struct trace_array *tr, char *buf, int set)
 int trace_set_clr_event(const char *system, const char *event, int set)
 {
 	struct trace_array *tr = top_trace_array();
+	int ret;
 
 	if (!tr)
 		return -ENODEV;
 
+	/*
+	 * Keep in-kernel event enabling consistent with tracefs event
+	 * enabling: once an event is being enabled, expand the boot-minimum
+	 * ring buffer to the configured default size before records arrive.
+	 */
+	if (set) {
+		ret = tracing_update_buffers(tr);
+		if (ret < 0)
+			return ret;
+	}
+
 	return __ftrace_set_clr_event(tr, NULL, system, event, set, NULL);
 }
 EXPORT_SYMBOL_GPL(trace_set_clr_event);
@@ -1504,11 +1516,24 @@ int trace_array_set_clr_event(struct trace_array *tr, const char *system,
 		const char *event, bool enable)
 {
 	int set;
+	int ret;
 
 	if (!tr)
 		return -ENOENT;
 
 	set = (enable == true) ? 1 : 0;
+
+	/*
+	 * Keep in-kernel event enabling consistent with tracefs event
+	 * enabling: once an event is being enabled, expand the boot-minimum
+	 * ring buffer to the configured default size before records arrive.
+	 */
+	if (set) {
+		ret = tracing_update_buffers(tr);
+		if (ret < 0)
+			return ret;
+	}
+
 	return __ftrace_set_clr_event(tr, NULL, system, event, set, NULL);
 }
 EXPORT_SYMBOL_GPL(trace_array_set_clr_event);

base-commit: e43ffb69e0438cddd72aaa30898b4dc446f664f8
-- 
2.47.3


^ permalink raw reply related

* Re: [PATCH v7 09/42] KVM: guest_memfd: Add base support for KVM_SET_MEMORY_ATTRIBUTES2
From: Michael Roth @ 2026-06-01 23:14 UTC (permalink / raw)
  To: ackerleytng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jthoughton, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka, kvm,
	linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <20260522-gmem-inplace-conversion-v7-9-2f0fae496530@google.com>

On Fri, May 22, 2026 at 05:17:51PM -0700, Ackerley Tng via B4 Relay wrote:
> From: Ackerley Tng <ackerleytng@google.com>
> 
> Introduce base support for KVM_SET_MEMORY_ATTRIBUTES2 in guest_memfd, which
> just updates attributes tracked by guest_memfd.
> 
> Validate input fields in general. Guard usage of KVM_SET_MEMORY_ATTRIBUTES2
> by making sure requested attributes are supported for this instance of kvm.
> 
> A new KVM_SET_MEMORY_ATTRIBUTES2 is defined to support writes (unlike
> KVM_SET_MEMORY_ATTRIBUTES) in addition to reads so it can provide error
> details to userspace. This will be used in a later patch.
> 
> The two ioctls use their corresponding structs with no overlap, but
> backward compatibility is baked in for future support of
> KVM_SET_MEMORY_ATTRIBUTES2 and struct kvm_memory_attributes2 in the VM
> ioctl.
> 
> The process of setting memory attributes is set up such that the later half
> will not fail due to allocation. Any necessary checks are performed before
> the point of no return.
> 
> Co-developed-by: Vishal Annapurve <vannapurve@google.com>
> Signed-off-by: Vishal Annapurve <vannapurve@google.com>
> Co-developed-by: Sean Christoperson <seanjc@google.com>
> Signed-off-by: Sean Christoperson <seanjc@google.com>

Typo on the "person".

(Sent this earlier but looks like some of my emails never hit the
list so re-sending. Apologies if this is a dupe).

Thanks,

Mike

> Reviewed-by: Fuad Tabba <tabba@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---

^ permalink raw reply

* [PATCH 2/2] rtla/timerlat: Add tests for option parsing with attached arguments
From: John Kacur @ 2026-06-01 21:15 UTC (permalink / raw)
  To: Steven Rostedt, Tomas Glozar, linux-trace-kernel
  Cc: Costa Shulyupin, Wander Lairson Costa, Crystal Wood,
	Luis Claudio R . Goncalves, linux-kernel
In-Reply-To: <20260601211538.381649-1-jkacur@redhat.com>

Add tests to verify that numeric arguments work correctly with both
attached and detached formats:
  -p 100        (short with space)
  -p100         (short without space)
  --period=100  (long with =)
  --period 100  (long with space)

These tests prevent regression of the bug fixed in commit eefa8af46ff7
("rtla/timerlat: Fix parsing of short options with attached arguments")
where -p100 was incorrectly parsed as multiple separate options.

The tests verify that:
1. All four argument formats succeed (exit code 0)
2. None trigger the "no-irq and no-thread" error that occurred when
   the bug was present

Assisted-by: Claude:claude-sonnet-4-5
Signed-off-by: John Kacur <jkacur@redhat.com>
---
 tools/tracing/rtla/tests/timerlat.t | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/tools/tracing/rtla/tests/timerlat.t b/tools/tracing/rtla/tests/timerlat.t
index fd4935fd7b49..1a63301f5d70 100644
--- a/tools/tracing/rtla/tests/timerlat.t
+++ b/tools/tracing/rtla/tests/timerlat.t
@@ -42,6 +42,16 @@ check "verify -c/--cpus" \
 check "hist test in nanoseconds" \
 	"timerlat hist -i 2 -c 0 -n -d 10s" 2 "ns"
 
+# Option parsing tests - verify attached numeric arguments work correctly
+check "verify -p with space" \
+	"timerlat hist -p 100 -c 0 -d 1s" 0 "" "no-irq and no-thread"
+check "verify -p without space (attached argument)" \
+	"timerlat hist -p100 -c 0 -d 1s" 0 "" "no-irq and no-thread"
+check "verify --period with equals" \
+	"timerlat hist --period=100 -c 0 -d 1s" 0 "" "no-irq and no-thread"
+check "verify --period with space" \
+	"timerlat hist --period 100 -c 0 -d 1s" 0 "" "no-irq and no-thread"
+
 # Actions tests
 check "trace output through -t" \
 	"timerlat hist -T 2 -t" 2 "^  Saving trace to timerlat_trace.txt$"
-- 
2.54.0


^ permalink raw reply related

* [PATCH 1/2] rtla/timerlat: Fix parsing of short options with attached arguments
From: John Kacur @ 2026-06-01 21:15 UTC (permalink / raw)
  To: Steven Rostedt, Tomas Glozar, linux-trace-kernel
  Cc: Costa Shulyupin, Wander Lairson Costa, Crystal Wood,
	Luis Claudio R . Goncalves, linux-kernel

The timerlat hist command fails to parse short options with attached
numeric arguments (e.g., -p100) due to conflicts between digit characters
used as option values and numeric arguments to other options.

This issue was discovered when testing rtla 7.1.0-rc6 with rteval,
which passes arguments in the compact -p100 format. The rteval tests
failed with the confusing error "no-irq and no-thread set, there is
nothing to do here" even though neither option was specified.

The root cause is two-fold:

1. Digit characters ('0'-'9') were used as short option values for
   long-only options like --no-irq, --no-thread, etc. This caused
   getopt_auto() to generate an option string like 'a:b:...:u0123456:7:8:9'
   which made getopt treat digits as valid option characters.

2. The two-phase option parsing approach (alternating calls between
   common_parse_options() and local option parsing) confused getopt's
   internal state when encountering arguments like -p100.

When a user passed -p100, getopt would incorrectly parse it as three
separate options: -p, -1, -0, and -0, silently setting no_irq and
no_thread flags instead of recognizing "100" as the period argument.

The two-phase parsing was introduced in commit 850cd24cb6d6 ("tools/rtla:
Add common_parse_options()") which first appeared in v7.0-rc1. Prior to
that commit, -p100 worked correctly. The digit characters as option
values existed since the original timerlat implementation, but only
became problematic when combined with the two-phase parsing approach.

Fix this by:

1. Eliminating digit characters from the option string by filtering them
   out in getopt_auto(). This prevents conflicts with numeric arguments.

2. Refactoring timerlat_hist_parse_args() to use single-pass option
   parsing. Instead of alternating between common_parse_options() and
   local parsing, merge all options (common and local) into a single
   option table and parse them in one pass. This matches the approach
   used by cyclictest and other tools.

With these changes, all argument formats work correctly:
  -p 100        (short with space)
  -p100         (short without space)
  --period=100  (long with =)
  --period 100  (long with space)

This maintains compatibility with existing usage while enabling the
compact -p100 format that users expect from similar tools.

Assisted-by: Claude:claude-sonnet-4-5
Signed-off-by: John Kacur <jkacur@redhat.com>
---
 tools/tracing/rtla/src/common.c        |  4 ++
 tools/tracing/rtla/src/timerlat_hist.c | 55 ++++++++++++++++++++++++--
 2 files changed, 56 insertions(+), 3 deletions(-)

diff --git a/tools/tracing/rtla/src/common.c b/tools/tracing/rtla/src/common.c
index 35e3d3aa922e..c2fd051c562c 100644
--- a/tools/tracing/rtla/src/common.c
+++ b/tools/tracing/rtla/src/common.c
@@ -65,6 +65,10 @@ int getopt_auto(int argc, char **argv, const struct option *long_opts)
 		if (long_opts[i].val < 32 || long_opts[i].val > 127)
 			continue;

+		/* Skip digit characters to avoid conflicts with numeric arguments */
+		if (long_opts[i].val >= '0' && long_opts[i].val <= '9')
+			continue;
+
 		if (n + 4 >= sizeof(opts))
 			fatal("optstring buffer overflow");

diff --git a/tools/tracing/rtla/src/timerlat_hist.c b/tools/tracing/rtla/src/timerlat_hist.c
index 79142af4f566..c0b6d7c30114 100644
--- a/tools/tracing/rtla/src/timerlat_hist.c
+++ b/tools/tracing/rtla/src/timerlat_hist.c
@@ -787,11 +787,24 @@ static struct common_params
 		static struct option long_options[] = {
 			{"auto",		required_argument,	0, 'a'},
 			{"bucket-size",		required_argument,	0, 'b'},
+			/* Common options */
+			{"cpus",		required_argument,	0, 'c'},
+			{"cgroup",		optional_argument,	0, 'C'},
+			{"debug",		no_argument,		0, 'D'},
+			{"duration",		required_argument,	0, 'd'},
+			{"event",		required_argument,	0, 'e'},
+			/* End common options */
 			{"entries",		required_argument,	0, 'E'},
 			{"help",		no_argument,		0, 'h'},
+			/* Common option */
+			{"house-keeping",	required_argument,	0, 'H'},
+			/* End common option */
 			{"irq",			required_argument,	0, 'i'},
 			{"nano",		no_argument,		0, 'n'},
 			{"period",		required_argument,	0, 'p'},
+			/* Common option */
+			{"priority",		required_argument,	0, 'P'},
+			/* End common option */
 			{"stack",		required_argument,	0, 's'},
 			{"thread",		required_argument,	0, 'T'},
 			{"trace",		optional_argument,	0, 't'},
@@ -819,9 +832,6 @@ static struct common_params
 			{0, 0, 0, 0}
 		};

-		if (common_parse_options(argc, argv, &params->common))
-			continue;
-
 		c = getopt_auto(argc, argv, long_options);

 		/* detect the end of the options. */
@@ -850,6 +860,35 @@ static struct common_params
 			    params->common.hist.bucket_size >= 1000000)
 				fatal("Bucket size needs to be > 0 and <= 1000000");
 			break;
+		case 'c':
+			if (parse_cpu_set(optarg, &params->common.monitored_cpus))
+				fatal("Invalid -c cpu list");
+			params->common.cpus = optarg;
+			break;
+		case 'C':
+			params->common.cgroup = 1;
+			params->common.cgroup_name = parse_optional_arg(argc, argv);
+			break;
+		case 'D':
+			config_debug = 1;
+			break;
+		case 'd':
+			params->common.duration = parse_seconds_duration(optarg);
+			if (!params->common.duration)
+				fatal("Invalid -d duration");
+			break;
+		case 'e':
+			{
+				struct trace_events *tevent;
+				tevent = trace_event_alloc(optarg);
+				if (!tevent)
+					fatal("Error alloc trace event");
+
+				if (params->common.events)
+					tevent->next = params->common.events;
+				params->common.events = tevent;
+			}
+			break;
 		case 'E':
 			params->common.hist.entries = get_llong_from_str(optarg);
 			if (params->common.hist.entries < 10 ||
@@ -860,6 +899,11 @@ static struct common_params
 		case '?':
 			timerlat_hist_usage();
 			break;
+		case 'H':
+			params->common.hk_cpus = 1;
+			if (parse_cpu_set(optarg, &params->common.hk_cpu_set))
+				fatal("Error parsing house keeping CPUs");
+			break;
 		case 'i':
 			params->common.stop_us = get_llong_from_str(optarg);
 			break;
@@ -874,6 +918,11 @@ static struct common_params
 			if (params->timerlat_period_us > 1000000)
 				fatal("Period longer than 1 s");
 			break;
+		case 'P':
+			if (parse_prio(optarg, &params->common.sched_param) == -1)
+				fatal("Invalid -P priority");
+			params->common.set_sched = 1;
+			break;
 		case 's':
 			params->print_stack = get_llong_from_str(optarg);
 			break;
-- 
2.54.0

^ permalink raw reply related

* Re: [PATCH v2] unwind: Add sframe_(un)register() system calls
From: Andrii Nakryiko @ 2026-06-01 17:57 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Linux Trace Kernel, bpf, Masami Hiramatsu,
	Mathieu Desnoyers, Jens Remus, Josh Poimboeuf, Peter Zijlstra,
	Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Linus Torvalds, Andrew Morton, Florian Weimer,
	Kees Cook, Carlos O'Donell, Sam James, Dylan Hatch,
	Borislav Petkov, Dave Hansen, David Hildenbrand, H. Peter Anvin,
	Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
	Suren Baghdasaryan, Vlastimil Babka, Heiko Carstens,
	Vasily Gorbik, Thomas Weißschuh
In-Reply-To: <20260528222051.60b38433@fedora>

On Thu, May 28, 2026 at 7:20 PM Steven Rostedt <rostedt@kernel.org> wrote:
>
> On Thu, 28 May 2026 16:01:06 -0700
> Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> >
> > [...]
> >
> > >   * Architecture-specific system calls
> > > diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> > > index a627acc8fb5f..17042d7e5e87 100644
> > > --- a/include/uapi/asm-generic/unistd.h
> > > +++ b/include/uapi/asm-generic/unistd.h
> > > @@ -863,8 +863,13 @@ __SYSCALL(__NR_listns, sys_listns)
> > >  #define __NR_rseq_slice_yield 471
> > >  __SYSCALL(__NR_rseq_slice_yield, sys_rseq_slice_yield)
> > >
> > > +#define __NR_sframe_register 472
> > > +__SYSCALL(__NR_sframe_register, sys_sframe_register)
> > > +#define __NR_sframe_unregister 473
> > > +__SYSCALL(__NR_sframe_unregister, sys_sframe_unregister)
> > > +
> > >  #undef __NR_syscalls
> > > -#define __NR_syscalls 472
> > > +#define __NR_syscalls 474
> > >
> > >  /*
> > >   * 32 bit systems traditionally used different
> > > diff --git a/include/uapi/linux/sframe.h b/include/uapi/linux/sframe.h
> > > new file mode 100644
> > > index 000000000000..d3c9f88b024b
> > > --- /dev/null
> > > +++ b/include/uapi/linux/sframe.h
> > > @@ -0,0 +1,12 @@
> > > +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
> > > +#ifndef _UAPI_LINUX_SFRAME_H
> > > +#define _UAPI_LINUX_SFRAME_H
> > > +
> > > +struct sframe_setup {
> >
> > I'd add `u64 flags;` field for easier and nicer extensibility. Check
> > in the kernel that it is set to zero, future kernels will allow some
> > of the bits to be set.
>
> That sounds reasonable.
>
> >
> > And I still think that prctl() instead of a separate sframe-specific
> > syscall is the way to go. I see no reason for sframe-specific set of
> > syscalls just to set a bit of extra metadata for the entire process.
> > That seems to be the job of prctl().
>
> I personally do not have a preference. I've just heard a lot from
> others where they want to avoid extending an ioctl() like system call
> or even create a new multiplexer syscall.
>
> If we can get a consensus of using prctl() or adding a separate system
> call, I'll go with whatever that is.

prctl() is an already existing multiplexing syscall used to provide
some per-process (of per-thread sometimes, it seems) hints and
options. Please consider sending prctl() extension, please CC me, and
let's see what arguments do people have against extending an already
existing syscall.

>
> >
> > > +       __u64                   sframe_start;
> > > +       __u64                   sframe_size;
> > > +       __u64                   text_start;
> > > +       __u64                   text_size;
> > > +};
> > > +
> >
> > [...]
> >
> > > +
> > > +/**
> > > + * sys_sframe_register - register an address for user space stacktrace walking.
> > > + * @data: Structure of sframe data used to register the sframe section
> > > + * @size: The size of the given structure.
> > > + *
> > > + * This system call is used by dynamic library utilities to inform the kernel
> > > + * of meta data that it loaded that can be used by the kernel to know how
> > > + * to stack walk the given text locations.
> > > + *
> > > + * Return: 0 if successful, otherwise a negative error.
> > > + */
> > > +SYSCALL_DEFINE2(sframe_register, struct sframe_setup __user *, data, size_t, size)
> > > +{
> > > +       struct sframe_setup sframe;
> > > +
> > > +       if (sizeof(sframe) != size)
> > > +               return -EINVAL;
> >
> > This seems overly aggressive. It seems like the pattern is to allow
> > sizes both smaller and bigger:
> >   - if user-provided size is smaller than what kernel knows about,
> > treat missing fields as zeroes
>
> Well, that could work with unregister, but for register that isn't
> quite useful, as all fields should be filled (well, if we add flags,
> that may not be 100% true).
>

This is a question of API design. If newly added fields are optional
by default, this works great. And even if you are adding some fields
that in the future will be mandatory (or it could be mandatory based
on flags), then it's super easy to error out if they are not set.
We've been doing this for years now in bpf() syscall and it works
pretty well overall, while also keeping user-space (libbpf, for
instance) side *much* simpler. I don't want to imagine bpf() syscall
which in each kernel version enforces a different size of bpf_attr
union...

> >   - if user-provided size is bigger, then check that space after
> > fields that kernel recognizes are all zeroes.
>
> That is dangerous. A zero with greater size could mean something. If
> the size is greater than expected it should simply fail and let user
> space call it again with the older version.
>

Could, but it shouldn't if we extend API reasonably. And if it so
happens that zero will be meaningful, then you add a new flag that has
to be set if that field is present. This is a solved problem.
Requiring user space to use differently-sized structs for different
kernel versions is much-much worse.

> >
> > This allows extensibility without having to change user space code all
> > the time. Old code will provide smaller struct without new (presumably
> > optional) fields, while newer code can use newer and larger struct
> > size, but as long as it clears extra fields old kernel will be fine
> > with that.
>
> The old size will always work, thus old code will always continue to
> work. If we extend the system call, then it must handle both the older
> size as well as the newer size. User space would not need to change. It
> would only change if it wanted to use a new feature, and if it wants to
> work with older kernels it would need to try the bigger size first and
> if that fails, it knows the kernel doesn't support that new feature and
> then user space can figure out what to do. Either use the old system
> call or abort.

See above, many added features are typically optional (e.g., imagine
some extra bits of information that goes along with currently existing
mandatory sframe data). And it's easy to code user space code that can
automatically and gracefully "downgrade" by detecting that kernel
doesn't support some feature and thus just not setting the field,
leaving it zero. But you won't have to track what should be the right
size of the struct which in your API headers is already larger because
you compiled something on newer kernel headers.

Believe me, this is the right way to go with this kind of extendable binary API.

>
> -- Steve
>
> >
> > > +
> > > +       if (copy_from_user(&sframe, data, size))
> > > +               return -EFAULT;
> > > +
> > > +       return sframe_add_section(sframe.sframe_start,
> > > +                                 sframe.sframe_start + sframe.sframe_size,
> > > +                                 sframe.text_start,
> > > +                                 sframe.text_start + sframe.text_size);
> > > +}
> > > +
> >
> > [...]
>

^ permalink raw reply

* Re: [PATCH 0/4] bootconfig: embed kernel.* cmdline at build time
From: Breno Leitao @ 2026-06-01 17:56 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Andrew Morton, Nathan Chancellor, paulmck, Nicolas Schier,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
	bpf, kernel-team
In-Reply-To: <20260529001519.14ca9dbe92fb2622249137c6@kernel.org>

On Fri, May 29, 2026 at 12:15:19AM +0900, Masami Hiramatsu wrote:
> On Wed, 27 May 2026 09:41:33 -0700
> Breno Leitao <leitao@debian.org> wrote:
> 
> > The userspace pieces (xbc_snprint_cmdline() in lib/, tools/bootconfig -C)
> > already landed; this series wires the rendered cmdline into the kernel.
> > 
> > Motivation: today the embedded bootconfig is parsed at runtime, after
> > parse_early_param() has already run, so early_param() handlers can't
> > see embedded values. Folding the kernel.* subtree into the cmdline at
> > build time gives a CONFIG_CMDLINE-equivalent for embedded-bootconfig
> > users without forcing them to maintain two cmdline sources.
> > 
> > Behaviorally, the "kernel" subtree is rendered to a flat string at
> > build time and stashed in .init.rodata. setup_arch() prepends it to
> > boot_command_line before parse_early_param() runs. Overflow is a soft
> > error: the helper logs and leaves boot_command_line untouched rather
> > than panicking, so an oversized embedded bconf cannot brick a boot.
> > 
> 
> Thanks Breno, yes, that is what I think about.
> Let me check it. And could you also check Sashiko's comments?

yes, I've spent some time on them, and it reported some good points, in
fact. I will fix those and resend.

Thanks!
--breno

^ permalink raw reply

* Re: [PATCH v8] tracing/eprobes: Allow use of BTF names to dereference pointers
From: Steven Rostedt @ 2026-06-01 17:31 UTC (permalink / raw)
  To: LKML, Linux Trace Kernel
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Mark Rutland, Peter Zijlstra,
	Namhyung Kim, Takaya Saeki, Douglas Raillard, Tom Zanussi,
	Andrew Morton, Thomas Gleixner, Ian Rogers, Jiri Olsa
In-Reply-To: <20260601130746.2139d926@gandalf.local.home>

On Mon, 1 Jun 2026 13:07:46 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> Changes since v7: https://patch.msgid.link/20260529110442.0967a64c@fedora
> 
> - Add error message in parse_btf_args() for failed parsing of TEVENT.
>   (Sashiko)
> 
> - Remove TPARG_FL_TYPECAST and just use ctx->struct_btf instead.
>   The flag was redundant and added unnecessary complexity.
> 
> - Restructure to keep the lifetime of the TYPECAST to the end of
>   traceprobe_parse_probe_arg_body(). This allows the last_type to stay
>   around in case there's not a type parameter and then btf can still be
>   used.
>   (Sashiko and Masami Hiramatsu)

And I rebased onto probes/for-next

-- Steve

^ permalink raw reply

* Re: [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support
From: Lorenzo Stoakes @ 2026-06-01 17:08 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Andrew Morton, Gerald Schaefer, Nico Pache, linux-doc,
	linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, linux-s390, linux-next
In-Reply-To: <20260601155808.2755103A59-agordeev@linux.ibm.com>

On Mon, Jun 01, 2026 at 05:58:08PM +0200, Alexander Gordeev wrote:
> On Fri, May 22, 2026 at 01:47:24PM -0700, Andrew Morton wrote:
>
> Hi Andrew et al,
>
> > On Fri, 22 May 2026 08:59:55 -0600 Nico Pache <npache@redhat.com> wrote:
> >
> > > The following series provides khugepaged with the capability to collapse
> > > anonymous memory regions to mTHPs.
> >
> > Thanks, I've update mm.git's mm-unstable branch to this version.
> >
> > It sounds like I might be dropping it soon, haven't started looking at
> > that yet.  But let's at least eyeball the latest version at this time.
> >
> > Sashiko was able to apply this, so the base-it-on-hotfixes thing worked
> > well, thanks.  The AI checking made a few allegations:
>
> This series appears to cause hangs on s390 in linux-next.
> The issue is not easily reproducible, so it is not yet confirmed.
> Any ideas for a reliable reproducer that exercises the code path below?
>
>     [ 2749.385719] sysrq: Show Blocked State
>     [ 2749.385730] task:khugepaged      state:D stack:0     pid:209   tgid:209   ppid:2      task_flags:0x200040 flags:0x00000000
>     [ 2749.385735] Call Trace:
>     [ 2749.385736]  [<0000017f63c8b226>] __schedule+0x316/0x890
>     [ 2749.385740]  [<0000017f63c8b7dc>] schedule+0x3c/0xc0
>     [ 2749.385743]  [<0000017f63c8b888>] schedule_preempt_disabled+0x28/0x40
>     [ 2749.385746]  [<0000017f63c902ea>] rwsem_down_write_slowpath+0x2fa/0x8b0
>     [ 2749.385749]  [<0000017f63c90910>] down_write+0x70/0x80
>     [ 2749.385752]  [<0000017f6313407a>] collapse_huge_page+0x2ea/0x9e0
>     [ 2749.385755]  [<0000017f6313491e>] mthp_collapse+0x1ae/0x1f0
>     [ 2749.385757]  [<0000017f63134fda>] collapse_scan_pmd+0x67a/0x8f0
>     [ 2749.385760]  [<0000017f6313751a>] collapse_single_pmd+0x15a/0x260
>     [ 2749.385762]  [<0000017f6313792c>] collapse_scan_mm_slot.constprop.0+0x30c/0x470
>     [ 2749.385765]  [<0000017f63137cb6>] khugepaged+0x226/0x240
>     [ 2749.385768]  [<0000017f62db3128>] kthread+0x148/0x170
>     [ 2749.385770]  [<0000017f62d2c238>] __ret_from_fork+0x48/0x220
>     [ 2749.385772]  [<0000017f63c95d0a>] ret_from_fork+0xa/0x30
>
> Thanks!

Hi Alexander,

Thanks for the report.

It's a pity it's non-repro, I had Claude have a look at it and it couldn't find
a definite issue with the code at v18, all the locks seem balanced internally.

Things it highlighted FWIW:

- Far more mmap_write_lock()'s being taken - the stack-based approach calls
  colapse_huge_page() multiple times per-PMD each of which entails an mmap read
  lock/unlock and mmap write lock.

- anon_vma write lock held for a much longer period over partial collapse.

So maybe these are triggering issues rather than being the cause of them per-se?

If you happen to see it again could you give the output for:

'echo t > /proc/sysrq-trigger' so we can track who holds the contended lock and
get more details on it?

Also the .config would be useful.

I'm guessing you've also not enabled mTHP in any way on the system?

Repro-wise you could also:

# echo 1 > /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
# echo 1 > /sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs

To get khugepaged going a more aggressively:

$ for f in /sys/kernel/mm/transparent_hugepage/hugepages-*; do echo always | sudo tee $f/enabled; done

Then maybe some stress-ng like sudo stress-ng --vm 4 --vm-bytes 2G --vm-method
all --timeout 5m (or maybe something more refined :)?

Maybe some of this will help repro more reliably?

Cheers, Lorenzo

^ permalink raw reply

* [PATCH v8] tracing/eprobes: Allow use of BTF names to dereference pointers
From: Steven Rostedt @ 2026-06-01 17:07 UTC (permalink / raw)
  To: LKML, Linux Trace Kernel
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Mark Rutland, Peter Zijlstra,
	Namhyung Kim, Takaya Saeki, Douglas Raillard, Tom Zanussi,
	Andrew Morton, Thomas Gleixner, Ian Rogers, Jiri Olsa

From: Steven Rostedt <rostedt@goodmis.org>

Add syntax to the parsing of eprobes to be able to typecast a trace event
field that is a pointer to a structure.

Currently, a dereference must be a number, where the user has to figure
out manually the offset of a member of a structure that they want to
dereference.

But for event probes that records a field that happens to be a pointer to
a structure, it cannot dereference these values with BTF naming, but
must use numerical offsets.

For example, to find out what device a sk_buff is pointing to in the
net_dev_xmit trace event, one must first use gdb to find the offsets of the
members of the structures:

 (gdb) p &((struct sk_buff *)0)->dev
 $1 = (struct net_device **) 0x10
 (gdb) p &((struct net_device *)0)->name
 $2 = (char (*)[16]) 0x118

And then use the raw numbers to dereference:

  # echo 'e:xmit net.net_dev_xmit +0x118(+0x10($skbaddr)):string' >> dynamic_events

If BTF is in the kernel, then instead, the skbaddr can be typecast to
sk_buff and use the normal dereference logic.

  # echo 'e:xmit net.net_dev_xmit (sk_buff)skbaddr->dev->name:string' >> dynamic_events
  # echo 1 > events/eprobes/xmit/enable
  # cat trace
[..]
    sshd-session-1022    [000] b..2.   860.249343: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.250061: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.250142: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.263553: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.283820: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.302716: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.322905: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.342828: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.362268: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.382335: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.400856: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.419893: xmit: (net.net_dev_xmit) arg1="enp7s0"

The syntax is simply: (STRUCT)(FIELD)->MEMBER[->MEMBER..]

Also add comments around the #else and #endif of #ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
to know what they are for.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v7: https://patch.msgid.link/20260529110442.0967a64c@fedora

- Add error message in parse_btf_args() for failed parsing of TEVENT.
  (Sashiko)

- Remove TPARG_FL_TYPECAST and just use ctx->struct_btf instead.
  The flag was redundant and added unnecessary complexity.

- Restructure to keep the lifetime of the TYPECAST to the end of
  traceprobe_parse_probe_arg_body(). This allows the last_type to stay
  around in case there's not a type parameter and then btf can still be
  used.
  (Sashiko and Masami Hiramatsu)

 Documentation/trace/eprobetrace.rst |   4 +
 kernel/trace/trace_probe.c          | 173 +++++++++++++++++++++++-----
 kernel/trace/trace_probe.h          |   5 +-
 3 files changed, 154 insertions(+), 28 deletions(-)

diff --git a/Documentation/trace/eprobetrace.rst b/Documentation/trace/eprobetrace.rst
index 89b5157cfab8..fe3602540569 100644
--- a/Documentation/trace/eprobetrace.rst
+++ b/Documentation/trace/eprobetrace.rst
@@ -46,6 +46,10 @@ Synopsis of eprobe_events
 		  (x8/x16/x32/x64), VFS layer common type(%pd/%pD), "char",
                   "string", "ustring", "symbol", "symstr" and "bitfield" are
                   supported.
+  (STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
+                  a pointer to STRUCT and then derference the pointer defined by
+                  ->MEMBER. Note that when this is used, the FIELD name does not
+                  need to be prefixed with a '$'.
 
 Types
 -----
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 695310571b08..fd1caa1f9723 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -332,6 +332,23 @@ static int parse_trace_event_arg(char *arg, struct fetch_insn *code,
 	return -ENOENT;
 }
 
+static int parse_trace_event(char *arg, struct fetch_insn *code,
+			     struct traceprobe_parse_context *ctx)
+{
+	int ret;
+
+	if (code->data)
+		return -EFAULT;
+	ret = parse_trace_event_arg(arg, code, ctx);
+	if (!ret)
+		return 0;
+	if (strcmp(arg, "comm") == 0 || strcmp(arg, "COMM") == 0) {
+		code->op = FETCH_OP_COMM;
+		return 0;
+	}
+	return -EINVAL;
+}
+
 #ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
 
 static u32 btf_type_int(const struct btf_type *t)
@@ -376,11 +393,16 @@ static bool btf_type_is_char_array(struct btf *btf, const struct btf_type *type)
 		&& BTF_INT_BITS(intdata) == 8;
 }
 
+static struct btf *ctx_btf(struct traceprobe_parse_context *ctx)
+{
+	return ctx->struct_btf ? : ctx->btf;
+}
+
 static int check_prepare_btf_string_fetch(char *typename,
 				struct fetch_insn **pcode,
 				struct traceprobe_parse_context *ctx)
 {
-	struct btf *btf = ctx->btf;
+	struct btf *btf = ctx_btf(ctx);
 
 	if (!btf || !ctx->last_type)
 		return 0;
@@ -506,6 +528,15 @@ static int query_btf_context(struct traceprobe_parse_context *ctx)
 	return 0;
 }
 
+static void clear_struct_btf(struct traceprobe_parse_context *ctx)
+{
+	if (ctx->struct_btf) {
+		btf_put(ctx->struct_btf);
+		ctx->struct_btf = NULL;
+		ctx->last_struct = NULL;
+	}
+}
+
 static void clear_btf_context(struct traceprobe_parse_context *ctx)
 {
 	if (ctx->btf) {
@@ -554,22 +585,29 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
 	struct fetch_insn *code = *pcode;
 	const struct btf_member *field;
 	u32 bitoffs, anon_offs;
+	bool is_struct = ctx->struct_btf != NULL;
+	struct btf *btf = ctx_btf(ctx);
 	char *next;
 	int is_ptr;
 	s32 tid;
 
 	do {
-		/* Outer loop for solving arrow operator ('->') */
-		if (BTF_INFO_KIND(type->info) != BTF_KIND_PTR) {
-			trace_probe_log_err(ctx->offset, NO_PTR_STRCT);
-			return -EINVAL;
-		}
-		/* Convert a struct pointer type to a struct type */
-		type = btf_type_skip_modifiers(ctx->btf, type->type, &tid);
-		if (!type) {
-			trace_probe_log_err(ctx->offset, BAD_BTF_TID);
-			return -EINVAL;
+		if (!is_struct) {
+			/* Outer loop for solving arrow operator ('->') */
+			if (BTF_INFO_KIND(type->info) != BTF_KIND_PTR) {
+				trace_probe_log_err(ctx->offset, NO_PTR_STRCT);
+				return -EINVAL;
+			}
+
+			/* Convert a struct pointer type to a struct type */
+			type = btf_type_skip_modifiers(btf, type->type, &tid);
+			if (!type) {
+				trace_probe_log_err(ctx->offset, BAD_BTF_TID);
+				return -EINVAL;
+			}
 		}
+		/* Only the first type can skip being a pointer */
+		is_struct = false;
 
 		bitoffs = 0;
 		do {
@@ -580,7 +618,7 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
 				return is_ptr;
 
 			anon_offs = 0;
-			field = btf_find_struct_member(ctx->btf, type, fieldname,
+			field = btf_find_struct_member(btf, type, fieldname,
 						       &anon_offs);
 			if (IS_ERR(field)) {
 				trace_probe_log_err(ctx->offset, BAD_BTF_TID);
@@ -602,7 +640,7 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
 				ctx->last_bitsize = 0;
 			}
 
-			type = btf_type_skip_modifiers(ctx->btf, field->type, &tid);
+			type = btf_type_skip_modifiers(btf, field->type, &tid);
 			if (!type) {
 				trace_probe_log_err(ctx->offset, BAD_BTF_TID);
 				return -EINVAL;
@@ -640,7 +678,7 @@ static int parse_btf_arg(char *varname,
 	int i, is_ptr, ret;
 	u32 tid;
 
-	if (WARN_ON_ONCE(!ctx->funcname))
+	if (WARN_ON_ONCE(!ctx->funcname && !(ctx->flags & TPARG_FL_TEVENT)))
 		return -EINVAL;
 
 	is_ptr = split_next_field(varname, &field, ctx);
@@ -653,6 +691,19 @@ static int parse_btf_arg(char *varname,
 		return -EOPNOTSUPP;
 	}
 
+	if (ctx->flags & TPARG_FL_TEVENT) {
+		ret = parse_trace_event(varname, code, ctx);
+		if (ret < 0) {
+			trace_probe_log_err(ctx->offset, BAD_ATTACH_ARG);
+			return ret;
+		}
+		/* TEVENT is only here via a typecast */
+		if (WARN_ON_ONCE(ctx->struct_btf == NULL))
+			return -EINVAL;
+		type = ctx->last_struct;
+		goto found_type;
+	}
+
 	if (ctx->flags & TPARG_FL_RETURN && !strcmp(varname, "$retval")) {
 		code->op = FETCH_OP_RETVAL;
 		/* Check whether the function return type is not void */
@@ -709,6 +760,7 @@ static int parse_btf_arg(char *varname,
 
 found:
 	type = btf_type_skip_modifiers(ctx->btf, tid, &tid);
+found_type:
 	if (!type) {
 		trace_probe_log_err(ctx->offset, BAD_BTF_TID);
 		return -EINVAL;
@@ -727,7 +779,7 @@ static int parse_btf_arg(char *varname,
 static const struct fetch_type *find_fetch_type_from_btf_type(
 					struct traceprobe_parse_context *ctx)
 {
-	struct btf *btf = ctx->btf;
+	struct btf *btf = ctx_btf(ctx);
 	const char *typestr = NULL;
 
 	if (btf && ctx->last_type)
@@ -758,7 +810,67 @@ static int parse_btf_bitfield(struct fetch_insn **pcode,
 	return 0;
 }
 
-#else
+static int query_btf_struct(const char *sname, struct traceprobe_parse_context *ctx)
+{
+	struct btf *btf = NULL;
+	int id;
+
+	/* A struct_btf should only be used by a single argument */
+	if (WARN_ON_ONCE(ctx->struct_btf)) {
+		btf_put(ctx->struct_btf);
+		ctx->struct_btf = NULL;
+	}
+
+	id = bpf_find_btf_id(sname, BTF_KIND_STRUCT, &btf);
+	if (id < 0)
+		return id;
+	ctx->struct_btf = btf;
+	ctx->last_struct = btf_type_by_id(ctx->struct_btf, id);
+	return 0;
+}
+
+static int handle_typecast(char *arg, struct fetch_insn **pcode,
+			   struct fetch_insn *end,
+			   struct traceprobe_parse_context *ctx)
+{
+	char *tmp;
+	int ret;
+
+	/* Currently this only works for eprobes */
+	if (!(ctx->flags & TPARG_FL_TEVENT)) {
+		trace_probe_log_err(ctx->offset, TYPECAST_NOT_EVENT);
+		return -EINVAL;
+	}
+
+	tmp = strchr(arg, ')');
+	if (!tmp) {
+		trace_probe_log_err(ctx->offset + strlen(arg),
+				    DEREF_OPEN_BRACE);
+		return -EINVAL;
+	}
+	*tmp = '\0';
+	ret = query_btf_struct(arg + 1, ctx);
+	*tmp = ')';
+
+	if (ret < 0) {
+		trace_probe_log_err(ctx->offset + 1, NO_PTR_STRCT);
+		return -EINVAL;
+	}
+
+	tmp++;
+
+	ctx->offset += tmp - arg;
+	ret = parse_btf_arg(tmp, pcode, end, ctx);
+	return ret;
+}
+
+#else /* !CONFIG_PROBE_EVENTS_BTF_ARGS */
+
+static void clear_struct_btf(struct traceprobe_parse_context *ctx)
+{
+	ctx->struct_btf = NULL;
+}
+
 static void clear_btf_context(struct traceprobe_parse_context *ctx)
 {
 	ctx->btf = NULL;
@@ -794,7 +906,15 @@ static int check_prepare_btf_string_fetch(char *typename,
 	return 0;
 }
 
-#endif
+static int handle_typecast(char *arg, struct fetch_insn **pcode,
+			   struct fetch_insn *end,
+			   struct traceprobe_parse_context *ctx)
+{
+	trace_probe_log_err(ctx->offset, NOSUP_BTFARG);
+	return -EOPNOTSUPP;
+}
+
+#endif /* CONFIG_PROBE_EVENTS_BTF_ARGS */
 
 #ifdef CONFIG_HAVE_FUNCTION_ARG_ACCESS_API
 
@@ -948,16 +1068,9 @@ static int parse_probe_vars(char *orig_arg, const struct fetch_type *t,
 	int len;
 
 	if (ctx->flags & TPARG_FL_TEVENT) {
-		if (code->data)
-			return -EFAULT;
-		ret = parse_trace_event_arg(arg, code, ctx);
-		if (!ret)
-			return 0;
-		if (strcmp(arg, "comm") == 0 || strcmp(arg, "COMM") == 0) {
-			code->op = FETCH_OP_COMM;
-			return 0;
-		}
-		goto inval;
+		if (parse_trace_event(arg, code, ctx) < 0)
+			goto inval;
+		return 0;
 	}
 
 	if (str_has_prefix(arg, "retval")) {
@@ -1224,6 +1337,9 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
 				code->op = FETCH_OP_IMM;
 		}
 		break;
+	case '(':
+		ret = handle_typecast(arg, pcode, end, ctx);
+		break;
 	default:
 		if (isalpha(arg[0]) || arg[0] == '_') {	/* BTF variable */
 			if (!tparg_is_function_entry(ctx->flags) &&
@@ -1556,6 +1672,9 @@ static int traceprobe_parse_probe_arg_body(const char *argv, ssize_t *size,
 	}
 	kfree(tmp);
 
+	/* struct_btf should not be passed to other arguments */
+	clear_struct_btf(ctx);
+
 	return ret;
 }
 
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 1076f1df347b..15758cc11fc6 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -422,7 +422,9 @@ struct traceprobe_parse_context {
 	const struct btf_param *params;	/* Parameter of the function */
 	s32 nr_params;			/* The number of the parameters */
 	struct btf *btf;		/* The BTF to be used */
+	struct btf *struct_btf;		/* The BTF to be used for structs */
 	const struct btf_type *last_type;	/* Saved type */
+	const struct btf_type *last_struct;	/* Saved structure */
 	u32 last_bitoffs;		/* Saved bitoffs */
 	u32 last_bitsize;		/* Saved bitsize */
 	struct trace_probe *tp;
@@ -563,7 +565,8 @@ extern int traceprobe_define_arg_fields(struct trace_event_call *event_call,
 	C(NEED_STRING_TYPE,	"$comm and immediate-string only accepts string type"),\
 	C(TOO_MANY_ARGS,	"Too many arguments are specified"),	\
 	C(TOO_MANY_EARGS,	"Too many entry arguments specified"),	\
-	C(EVENT_TOO_BIG,	"Event too big (too many fields?)"),
+	C(EVENT_TOO_BIG,	"Event too big (too many fields?)"),  \
+	C(TYPECAST_NOT_EVENT,	"Typecasts are only for eprobe fields"),
 
 #undef C
 #define C(a, b)		TP_ERR_##a
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support
From: Nico Pache @ 2026-06-01 17:05 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Andrew Morton, Gerald Schaefer, linux-doc, linux-kernel, linux-mm,
	linux-trace-kernel, aarcange, anshuman.khandual, apopple, baohua,
	baolin.wang, byungchul, catalin.marinas, cl, corbet, dave.hansen,
	david, dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh,
	jglisse, joshua.hahnjy, kas, lance.yang, liam, ljs,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
	pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang, rientjes,
	rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe,
	linux-s390, linux-next
In-Reply-To: <20260601155808.2755103A59-agordeev@linux.ibm.com>

On Mon, Jun 1, 2026 at 9:58 AM Alexander Gordeev <agordeev@linux.ibm.com> wrote:
>
> On Fri, May 22, 2026 at 01:47:24PM -0700, Andrew Morton wrote:
>
> Hi Andrew et al,
>
> > On Fri, 22 May 2026 08:59:55 -0600 Nico Pache <npache@redhat.com> wrote:
> >
> > > The following series provides khugepaged with the capability to collapse
> > > anonymous memory regions to mTHPs.
> >
> > Thanks, I've update mm.git's mm-unstable branch to this version.
> >
> > It sounds like I might be dropping it soon, haven't started looking at
> > that yet.  But let's at least eyeball the latest version at this time.
> >
> > Sashiko was able to apply this, so the base-it-on-hotfixes thing worked
> > well, thanks.  The AI checking made a few allegations:
>
> This series appears to cause hangs on s390 in linux-next.
> The issue is not easily reproducible, so it is not yet confirmed.
> Any ideas for a reliable reproducer that exercises the code path below?

Hi,

Thanks for the report!

was this caught by syzbot? If so, can you provide a link?

Also can you provide whether any of the mTHP sysfs settings were enabled?

Based on the report, it looks like we are either dealing with more
lock contention (due to holding the write lock longer). We could
switch to a trylock but that might cause us to lose some collapse
attempts (which will be retried later, so probably fine). I'm ok with
that approach if it prevents these potential regressions.

Cheers,
-- Nico

>
>     [ 2749.385719] sysrq: Show Blocked State
>     [ 2749.385730] task:khugepaged      state:D stack:0     pid:209   tgid:209   ppid:2      task_flags:0x200040 flags:0x00000000
>     [ 2749.385735] Call Trace:
>     [ 2749.385736]  [<0000017f63c8b226>] __schedule+0x316/0x890
>     [ 2749.385740]  [<0000017f63c8b7dc>] schedule+0x3c/0xc0
>     [ 2749.385743]  [<0000017f63c8b888>] schedule_preempt_disabled+0x28/0x40
>     [ 2749.385746]  [<0000017f63c902ea>] rwsem_down_write_slowpath+0x2fa/0x8b0
>     [ 2749.385749]  [<0000017f63c90910>] down_write+0x70/0x80
>     [ 2749.385752]  [<0000017f6313407a>] collapse_huge_page+0x2ea/0x9e0
>     [ 2749.385755]  [<0000017f6313491e>] mthp_collapse+0x1ae/0x1f0
>     [ 2749.385757]  [<0000017f63134fda>] collapse_scan_pmd+0x67a/0x8f0
>     [ 2749.385760]  [<0000017f6313751a>] collapse_single_pmd+0x15a/0x260
>     [ 2749.385762]  [<0000017f6313792c>] collapse_scan_mm_slot.constprop.0+0x30c/0x470
>     [ 2749.385765]  [<0000017f63137cb6>] khugepaged+0x226/0x240
>     [ 2749.385768]  [<0000017f62db3128>] kthread+0x148/0x170
>     [ 2749.385770]  [<0000017f62d2c238>] __ret_from_fork+0x48/0x220
>     [ 2749.385772]  [<0000017f63c95d0a>] ret_from_fork+0xa/0x30
>
> Thanks!
>


^ permalink raw reply

* Re: PATCH v7] tracing/eprobes: Allow use of BTF names to dereference pointers
From: Steven Rostedt @ 2026-06-01 16:21 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: LKML, Linux trace kernel, Mathieu Desnoyers, Mark Rutland,
	Peter Zijlstra, Namhyung Kim, Takaya Saeki, Douglas Raillard,
	Tom Zanussi, Andrew Morton, Thomas Gleixner, Ian Rogers,
	Jiri Olsa
In-Reply-To: <20260531101458.c8ee22f6222a3fc224cc5328@kernel.org>

On Sun, 31 May 2026 10:14:58 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> > > Does this prematurely release the BTF struct reference?
> > > If TPARG_FL_TYPECAST is unset here and ctx->struct_btf is put, won't
> > > later steps in traceprobe_parse_probe_arg_body() (like
> > > find_fetch_type_from_btf_type()) fail to properly infer struct field sizes?
> > > When ctx_btf(ctx) is called later without TPARG_FL_TYPECAST set, it
> > > will evaluate to ctx->btf (which is NULL for eprobes).
> > > Could this potentially lead to silent defaults, such as 64-bit reads for
> > > smaller fields, or fail to inject pointer dereferences for string fields,
> > > while also leaving ctx->last_type pointing to a prematurely released BTF
> > > object?  
> > 
> > Does this mean we need to set ctx->last_type to NULL here too?  
> 
> No, since the member we refer can be different from unsigned long.
> When we don't have ":type" suffix, we use BTF type information to
> decide appropriate type.
> 
> > 
> > Because everything above is pretty much the expected behavior. The put is
> > *not* premature. The last_struct and struct_btf are both set to NULL. I
> > guess the only thing missing is to reset last_type as well.  
> 
> No, as I explained, the last_type is used to determine the member type
> when user does not specify the ":type" suffix.
> 
> So, what we need to do is deferring the btf_put(struct_btf) as below:
> (no build test yet.)

OK, but I don't think we want the struct_btf to exist beyond a single
arg like the btf descriptor does. How about this (on top of this change),
where it clears the struct_btf at the end of traceprobe_parse_probe_arg_body()?

Also, I see the flag as being redundant and use the existence of
struct_btf to denote that it's parsing a typedef struct.

-- Steve

diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 9246e9c3d066..56b7dc406ca1 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -397,8 +397,7 @@ static bool btf_type_is_char_array(struct btf *btf, const struct btf_type *type)
 
 static struct btf *ctx_btf(struct traceprobe_parse_context *ctx)
 {
-	return ctx->flags & TPARG_FL_TYPECAST ?
-		ctx->struct_btf : ctx->btf;
+	return ctx->struct_btf ? : ctx->btf;
 }
 
 static int check_prepare_btf_string_fetch(char *typename,
@@ -531,6 +530,15 @@ static int query_btf_context(struct traceprobe_parse_context *ctx)
 	return 0;
 }
 
+static void clear_struct_btf(struct traceprobe_parse_context *ctx)
+{
+	if (ctx->struct_btf) {
+		btf_put(ctx->struct_btf);
+		ctx->struct_btf = NULL;
+		ctx->last_struct = NULL;
+	}
+}
+
 static void clear_btf_context(struct traceprobe_parse_context *ctx)
 {
 	if (ctx->btf) {
@@ -579,7 +587,7 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
 	struct fetch_insn *code = *pcode;
 	const struct btf_member *field;
 	u32 bitoffs, anon_offs;
-	bool is_struct = ctx->flags & TPARG_FL_TYPECAST;
+	bool is_struct = ctx->struct_btf != NULL;
 	struct btf *btf = ctx_btf(ctx);
 	char *next;
 	int is_ptr;
@@ -690,7 +698,7 @@ static int parse_btf_arg(char *varname,
 		ret = parse_trace_event(varname, code, ctx);
 		if (ret < 0)
 			return ret;
-		if (WARN_ON_ONCE(!(ctx->flags & TPARG_FL_TYPECAST)))
+		if (WARN_ON_ONCE(ctx->struct_btf == NULL))
 			return -EINVAL;
 		type = ctx->last_struct;
 		goto found_type;
@@ -804,21 +812,19 @@ static int parse_btf_bitfield(struct fetch_insn **pcode,
 
 static int query_btf_struct(const char *sname, struct traceprobe_parse_context *ctx)
 {
+	struct btf *btf = NULL;
 	int id;
 
-	if (!ctx->struct_btf) {
-		struct btf *btf;
-
-		id = bpf_find_btf_id(sname, BTF_KIND_STRUCT, &btf);
-		if (id < 0)
-			return id;
-		ctx->struct_btf = btf;
-	} else {
-		id = btf_find_by_name_kind(ctx->struct_btf, sname, BTF_KIND_STRUCT);
-		if (id < 0)
-			return id;
+	/* Could be a for a structure in a different module */
+	if (ctx->struct_btf) {
+		btf_put(ctx->struct_btf);
+		ctx->struct_btf = NULL;
 	}
 
+	id = bpf_find_btf_id(sname, BTF_KIND_STRUCT, &btf);
+	if (id < 0)
+		return id;
+	ctx->struct_btf = btf;
 	ctx->last_struct = btf_type_by_id(ctx->struct_btf, id);
 	return 0;
 }
@@ -848,25 +854,23 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
 
 	if (ret < 0) {
 		trace_probe_log_err(ctx->offset + 1, NO_PTR_STRCT);
-		ret = -EINVAL;
-		goto out_put;
+		return -EINVAL;
 	}
 
-	ctx->flags |= TPARG_FL_TYPECAST;
 	tmp++;
 
 	ctx->offset += tmp - arg;
 	ret = parse_btf_arg(tmp, pcode, end, ctx);
-	ctx->flags &= ~TPARG_FL_TYPECAST;
-	ctx->last_struct = NULL;
-out_put:
-	btf_put(ctx->struct_btf);
-	ctx->struct_btf = NULL;
 	return ret;
 }
 
 #else /* !CONFIG_PROBE_EVENTS_BTF_ARGS */
 
+static void clear_struct_btf(struct traceprobe_parse_context *ctx)
+{
+	ctx->struct_btf = NULL;
+}
+
 static void clear_btf_context(struct traceprobe_parse_context *ctx)
 {
 	ctx->btf = NULL;
@@ -1673,6 +1677,9 @@ static int traceprobe_parse_probe_arg_body(const char *argv, ssize_t *size,
 	}
 	kfree(tmp);
 
+	/* struct_btf should not be passed to other arguments */
+	clear_struct_btf(ctx);
+
 	return ret;
 }
 
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 952e3d7582b8..83565f1634db 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -394,7 +394,6 @@ static inline int traceprobe_get_entry_data_size(struct trace_probe *tp)
  * TPARG_FL_KERNEL and TPARG_FL_USER are also mutually exclusive.
  * TPARG_FL_FPROBE and TPARG_FL_TPOINT are optional but it should be with
  * TPARG_FL_KERNEL.
- * TPARG_FL_TYPECAST is set if an argument was typecast to a structure.
  */
 #define TPARG_FL_RETURN BIT(0)
 #define TPARG_FL_KERNEL BIT(1)
@@ -403,7 +402,6 @@ static inline int traceprobe_get_entry_data_size(struct trace_probe *tp)
 #define TPARG_FL_USER   BIT(4)
 #define TPARG_FL_FPROBE BIT(5)
 #define TPARG_FL_TPOINT BIT(6)
-#define TPARG_FL_TYPECAST BIT(7)
 #define TPARG_FL_LOC_MASK	GENMASK(4, 0)
 
 static inline bool tparg_is_function_entry(unsigned int flags)

^ permalink raw reply related

* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: Lance Yang @ 2026-06-01 16:07 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat, mhocko,
	peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
	rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
	surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe, usama.arif
In-Reply-To: <0bb49c47-8c41-478c-847e-b9154c75e59c@kernel.org>



On 2026/6/1 23:05, David Hildenbrand (Arm) wrote:
> On 6/1/26 17:00, Nico Pache wrote:
>> On Mon, Jun 1, 2026 at 5:14 AM David Hildenbrand (Arm) <david@kernel.org> wrote:
>>>
>>> On 6/1/26 12:47, Lance Yang wrote:
>>>>
>>>>
>>>>
>>>> Ah, cool! __folio_mark_uptodate() already does the job :P
>>>>
>>>> So yeah, no extra smp_wmb() needed here!
>>>
>>> Yeah. BTW, I think we'd need a spin_lock_nested(), so @Nico, treat my code as a
>>> draft.
>>
>> Okay, I read the above and did some investigating.
>>
>> I will try to implement and verify the changes you suggested :)
>>
>> Or an even crazier idea... what if we ensure MIPS checks for PMD_none
>> before walking a PTE table?
> 
> But how would they update the cache then correctly?
> 
> I'm too non-MIPS to know the answer :)

Right, that's my concern as well ...

If MIPS sees pmd_none(), it has no PTE table to walk, so it also has
no way to do the cache update it wanted to do, I guess :)

But, the PTE table is not really gone there. khugepaged only cleared
the PMD temporarily while still using the old PTE table through _pmd.

So I'd go with David's suggestion:

"
Best to make sure the page table is already installed when updating
the entries.
"

Cheers, Lance

^ permalink raw reply

* Re: [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support
From: Alexander Gordeev @ 2026-06-01 15:58 UTC (permalink / raw)
  To: Andrew Morton, Gerald Schaefer
  Cc: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe, linux-s390, linux-next
In-Reply-To: <20260522134724.f4f11941a85ef18b307d16ae@linux-foundation.org>

On Fri, May 22, 2026 at 01:47:24PM -0700, Andrew Morton wrote:

Hi Andrew et al,

> On Fri, 22 May 2026 08:59:55 -0600 Nico Pache <npache@redhat.com> wrote:
> 
> > The following series provides khugepaged with the capability to collapse
> > anonymous memory regions to mTHPs.
> 
> Thanks, I've update mm.git's mm-unstable branch to this version.
> 
> It sounds like I might be dropping it soon, haven't started looking at
> that yet.  But let's at least eyeball the latest version at this time.
> 
> Sashiko was able to apply this, so the base-it-on-hotfixes thing worked
> well, thanks.  The AI checking made a few allegations:

This series appears to cause hangs on s390 in linux-next.
The issue is not easily reproducible, so it is not yet confirmed.
Any ideas for a reliable reproducer that exercises the code path below?

    [ 2749.385719] sysrq: Show Blocked State
    [ 2749.385730] task:khugepaged      state:D stack:0     pid:209   tgid:209   ppid:2      task_flags:0x200040 flags:0x00000000
    [ 2749.385735] Call Trace:
    [ 2749.385736]  [<0000017f63c8b226>] __schedule+0x316/0x890
    [ 2749.385740]  [<0000017f63c8b7dc>] schedule+0x3c/0xc0
    [ 2749.385743]  [<0000017f63c8b888>] schedule_preempt_disabled+0x28/0x40
    [ 2749.385746]  [<0000017f63c902ea>] rwsem_down_write_slowpath+0x2fa/0x8b0
    [ 2749.385749]  [<0000017f63c90910>] down_write+0x70/0x80
    [ 2749.385752]  [<0000017f6313407a>] collapse_huge_page+0x2ea/0x9e0
    [ 2749.385755]  [<0000017f6313491e>] mthp_collapse+0x1ae/0x1f0
    [ 2749.385757]  [<0000017f63134fda>] collapse_scan_pmd+0x67a/0x8f0
    [ 2749.385760]  [<0000017f6313751a>] collapse_single_pmd+0x15a/0x260
    [ 2749.385762]  [<0000017f6313792c>] collapse_scan_mm_slot.constprop.0+0x30c/0x470
    [ 2749.385765]  [<0000017f63137cb6>] khugepaged+0x226/0x240
    [ 2749.385768]  [<0000017f62db3128>] kthread+0x148/0x170
    [ 2749.385770]  [<0000017f62d2c238>] __ret_from_fork+0x48/0x220
    [ 2749.385772]  [<0000017f63c95d0a>] ret_from_fork+0xa/0x30

Thanks!

^ permalink raw reply

* [PATCH v4 13/13] verification/rvgen: Generate cleanup hook for per-obj monitor
From: Gabriele Monaco @ 2026-06-01 15:38 UTC (permalink / raw)
  To: linux-kernel, Steven Rostedt, Gabriele Monaco, linux-trace-kernel
  Cc: Nam Cao, Wen Yang
In-Reply-To: <20260601153840.124372-1-gmonaco@redhat.com>

Per-object monitors can allocate memory dynamically and such memory is
required for the lifetime of the object, then it should be freed with
the appropriate call.

Force the generation scripts to add a cleanup function the user will
need to wire to the appropriate event (e.g. sched_process_exit for
tasks). This can be safely removed if the object will never cease to
exist before disabling the monitor (e.g. if following only static
variables).

Reviewed-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 tools/verification/rvgen/rvgen/dot2k.py | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/tools/verification/rvgen/rvgen/dot2k.py b/tools/verification/rvgen/rvgen/dot2k.py
index 110cfd69e..3060aa4b9 100644
--- a/tools/verification/rvgen/rvgen/dot2k.py
+++ b/tools/verification/rvgen/rvgen/dot2k.py
@@ -17,6 +17,9 @@ from .automata import _EventConstraintKey, _StateConstraintKey, AutomataError
 class dot2k(Monitor, Dot2c):
     template_dir = "dot2k"
 
+    # only needed for the per-obj cleanup hook
+    cleanup_marker = "obj_cleanup"
+
     def __init__(self, file_path, MonitorType, extra_params={}):
         self.monitor_type = MonitorType
         Monitor.__init__(self, extra_params)
@@ -56,18 +59,30 @@ class dot2k(Monitor, Dot2c):
                 buff.append(f"\tda_{handle}({event}{self.enum_suffix});")
             buff.append("}")
             buff.append("")
+        if self.monitor_type == "per_obj":
+            buff.append("/* XXX: obj is being destroyed, remove if not required (e.g. obj is static) */")
+            buff.append(f"static void handle_{self.cleanup_marker}(void *data, /* XXX: fill header */)")
+            buff.append("{")
+            buff.append("\tint id = /* XXX: how do I get the id? */;")
+            buff.append("\tda_destroy_storage(id);")
+            buff.append("}")
+            buff.append("")
         return '\n'.join(buff)
 
     def fill_tracepoint_attach_probe(self) -> str:
         buff = []
         for event in self.events:
             buff.append(f"\trv_attach_trace_probe(\"{self.name}\", /* XXX: tracepoint */, handle_{event});")
+        if self.monitor_type == "per_obj":
+            buff.append(f"\trv_attach_trace_probe(\"{self.name}\", /* XXX: cleanup tracepoint */, handle_{self.cleanup_marker});")
         return '\n'.join(buff)
 
     def fill_tracepoint_detach_helper(self) -> str:
         buff = []
         for event in self.events:
             buff.append(f"\trv_detach_trace_probe(\"{self.name}\", /* XXX: tracepoint */, handle_{event});")
+        if self.monitor_type == "per_obj":
+            buff.append(f"\trv_detach_trace_probe(\"{self.name}\", /* XXX: cleanup tracepoint */, handle_{self.cleanup_marker});")
         return '\n'.join(buff)
 
     def fill_model_h_header(self) -> list[str]:
-- 
2.54.0


^ permalink raw reply related

* [PATCH v4 12/13] rv: Fix read_lock scope in per-task DA cleanup
From: Gabriele Monaco @ 2026-06-01 15:38 UTC (permalink / raw)
  To: linux-kernel, Steven Rostedt, Gabriele Monaco, linux-trace-kernel
  Cc: Wen Yang, Nam Cao
In-Reply-To: <20260601153840.124372-1-gmonaco@redhat.com>

The da_monitor_reset_all() function for per-task monitors takes
tasklist_lock while iterating over tasks, then keeps it also while
iterating over idle tasks (one per CPU). The latter is not necessary
since the lock needs to guard only for_each_process_thread().

Use a scoped_guard for more compact syntax and adjust the scope only
where the lock is necessary.

Reviewed-by: Wen Yang <wen.yang@linux.dev>
Reviewed-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 include/rv/da_monitor.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
index 34b8fba9e..08e5d0c59 100644
--- a/include/rv/da_monitor.h
+++ b/include/rv/da_monitor.h
@@ -334,12 +334,12 @@ static void __da_monitor_reset_all(void (*reset)(struct da_monitor *))
 	struct task_struct *g, *p;
 	int cpu;
 
-	read_lock(&tasklist_lock);
-	for_each_process_thread(g, p)
-		reset(da_get_monitor(p));
+	scoped_guard(read_lock, &tasklist_lock) {
+		for_each_process_thread(g, p)
+			reset(da_get_monitor(p));
+	}
 	for_each_present_cpu(cpu)
 		reset(da_get_monitor(idle_task(cpu)));
-	read_unlock(&tasklist_lock);
 }
 
 static void da_monitor_reset_all(void)
-- 
2.54.0


^ permalink raw reply related

* [PATCH v4 11/13] verification/rvgen: Fix suffix strip in dot2k
From: Gabriele Monaco @ 2026-06-01 15:38 UTC (permalink / raw)
  To: linux-kernel, Steven Rostedt, Gabriele Monaco, linux-trace-kernel
  Cc: Nam Cao, Wen Yang
In-Reply-To: <20260601153840.124372-1-gmonaco@redhat.com>

__start_to_invariant_check() and __get_constraint_env() parse the
environment variable's name from sources that have it padded with the
monitor name. This is removed using rstrip(), which is not meant to
strip a substring but rather a set of characters.

Use removesuffix() to actually get rid of the trailing _<monitor name>.

Fixes: a82adadb16894 ("verification/rvgen: Add support for Hybrid Automata")
Reviewed-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 tools/verification/rvgen/rvgen/dot2k.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/verification/rvgen/rvgen/dot2k.py b/tools/verification/rvgen/rvgen/dot2k.py
index e6f476b90..110cfd69e 100644
--- a/tools/verification/rvgen/rvgen/dot2k.py
+++ b/tools/verification/rvgen/rvgen/dot2k.py
@@ -215,14 +215,14 @@ class ha2k(dot2k):
     def __get_constraint_env(self, constr: str) -> str:
         """Extract the second argument from an ha_ function"""
         env = constr.split("(")[1].split()[1].rstrip(")").rstrip(",")
-        assert env.rstrip(f"_{self.name}") in self.envs
+        assert env.removesuffix(f"_{self.name}") in self.envs
         return env
 
     def __start_to_invariant_check(self, constr: str) -> str:
         # by default assume the timer has ns expiration
         env = self.__get_constraint_env(constr)
         clock_type = "ns"
-        if self.env_types.get(env.rstrip(f"_{self.name}")) == "j":
+        if self.env_types.get(env.removesuffix(f"_{self.name}")) == "j":
             clock_type = "jiffy"
 
         return f"return ha_check_invariant_{clock_type}(ha_mon, {env}, time_ns)"
-- 
2.54.0


^ permalink raw reply related

* [PATCH v4 10/13] rv: Use 0 to check preemption enabled in opid
From: Gabriele Monaco @ 2026-06-01 15:38 UTC (permalink / raw)
  To: linux-kernel, Steven Rostedt, Gabriele Monaco, Masami Hiramatsu,
	linux-trace-kernel
  Cc: Nam Cao, Wen Yang
In-Reply-To: <20260601153840.124372-1-gmonaco@redhat.com>

Tracepoint handlers no longer run with preemption disabled by default
since a46023d5616 ("tracing: Guard __DECLARE_TRACE() use of
__DO_TRACE_CALL() with SRCU-fast"), the opid monitor should now count 1
in the preemption count as preemption disabled.

Change the rule for preempt_off to preempt > 0.

Reviewed-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 kernel/trace/rv/monitors/opid/opid.c | 8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/kernel/trace/rv/monitors/opid/opid.c b/kernel/trace/rv/monitors/opid/opid.c
index 2922318c6..3b6a85e81 100644
--- a/kernel/trace/rv/monitors/opid/opid.c
+++ b/kernel/trace/rv/monitors/opid/opid.c
@@ -22,14 +22,8 @@ static u64 ha_get_env(struct ha_monitor *ha_mon, enum envs_opid env, u64 time_ns
 	if (env == irq_off_opid)
 		return irqs_disabled();
 	else if (env == preempt_off_opid) {
-		/*
-		 * If CONFIG_PREEMPTION is enabled, then the tracepoint itself disables
-		 * preemption (adding one to the preempt_count). Since we are
-		 * interested in the preempt_count at the time the tracepoint was
-		 * hit, we consider 1 as still enabled.
-		 */
 		if (IS_ENABLED(CONFIG_PREEMPTION))
-			return (preempt_count() & PREEMPT_MASK) > 1;
+			return (preempt_count() & PREEMPT_MASK) > 0;
 		return true;
 	}
 	return ENV_INVALID_VALUE;
-- 
2.54.0


^ permalink raw reply related

* [PATCH v4 03/13] rv: Prevent in-flight per-task handlers from using invalid slots
From: Gabriele Monaco @ 2026-06-01 15:38 UTC (permalink / raw)
  To: linux-kernel, Steven Rostedt, Gabriele Monaco, linux-trace-kernel
  Cc: Wen Yang, Nam Cao
In-Reply-To: <20260601153840.124372-1-gmonaco@redhat.com>

Per-task monitors use a slot in the task_struct->rv[] array and store
that locally (e.g. task_mon_slot), this slot is returned during the
destruction process but currently hanlers can be running while that slot
is returning and this race may lead to accessing an invalid slot.

Synchronise with all in-flight tracepoint handlers using
tracepoint_synchronize_unregister() before returning the slot.

Fixes: f5587d1b6ec9 ("rv: Add Hybrid Automata monitor type")
Fixes: a9769a5b9878 ("rv: Add support for LTL monitors")
Suggested-by: Wen Yang <wen.yang@linux.dev>
Reviewed-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 include/rv/da_monitor.h  | 4 ++++
 include/rv/ltl_monitor.h | 1 +
 2 files changed, 5 insertions(+)

diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
index 1459fb3df..cc97cc5df 100644
--- a/include/rv/da_monitor.h
+++ b/include/rv/da_monitor.h
@@ -302,6 +302,9 @@ static int da_monitor_init(void)
 
 /*
  * da_monitor_destroy - return the allocated slot
+ *
+ * Wait for all in-flight handlers before returning the slot to avoid
+ * out-of-bound accesses.
  */
 static inline void da_monitor_destroy(void)
 {
@@ -310,6 +313,7 @@ static inline void da_monitor_destroy(void)
 		return;
 	}
 
+	tracepoint_synchronize_unregister();
 	da_monitor_reset_all();
 
 	rv_put_task_monitor_slot(task_mon_slot);
diff --git a/include/rv/ltl_monitor.h b/include/rv/ltl_monitor.h
index eff60cd61..38e792401 100644
--- a/include/rv/ltl_monitor.h
+++ b/include/rv/ltl_monitor.h
@@ -77,6 +77,7 @@ static void ltl_monitor_destroy(void)
 {
 	rv_detach_trace_probe(name, task_newtask, handle_task_newtask);
 
+	tracepoint_synchronize_unregister();
 	rv_put_task_monitor_slot(ltl_monitor_slot);
 	ltl_monitor_slot = RV_PER_TASK_MONITOR_INIT;
 }
-- 
2.54.0


^ permalink raw reply related

* [PATCH v4 09/13] rv: Prevent task migration while handling per-CPU events
From: Gabriele Monaco @ 2026-06-01 15:38 UTC (permalink / raw)
  To: linux-kernel, Steven Rostedt, Gabriele Monaco, linux-trace-kernel
  Cc: Wen Yang, Nam Cao
In-Reply-To: <20260601153840.124372-1-gmonaco@redhat.com>

Tracepoint handlers are fully preemptible after a46023d5616 ("tracing:
Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast"). When
a per-CPU monitor handles an event, it retrieves the monitor state using
a per-CPU pointer. If the event itself doesn't disable preemption, the
task can migrate to a different CPU and we risk updating the wrong
monitor.

Mitigate this by explicitly disabling task migration before acquiring
the monitor pointer. This cannot guarantee the monitor runs on the
correct CPU but reduces the race condition window and prevents warnings.

Reviewed-by: Wen Yang <wen.yang@linux.dev>
Reviewed-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 include/rv/da_monitor.h | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
index 1f440c781..34b8fba9e 100644
--- a/include/rv/da_monitor.h
+++ b/include/rv/da_monitor.h
@@ -218,6 +218,10 @@ static inline void da_monitor_destroy(void)
 	da_monitor_sync_hook();
 }
 
+#ifndef da_implicit_guard
+#define da_implicit_guard()
+#endif
+
 #elif RV_MON_TYPE == RV_MON_PER_CPU
 /*
  * Functions to define, init and get a per-cpu monitor.
@@ -284,6 +288,10 @@ static inline void da_monitor_destroy(void)
 	da_monitor_sync_hook();
 }
 
+#ifndef da_implicit_guard
+#define da_implicit_guard() guard(migrate)()
+#endif
+
 #elif RV_MON_TYPE == RV_MON_PER_TASK
 /*
  * Functions to define, init and get a per-task monitor.
@@ -756,6 +764,7 @@ static inline bool __da_handle_start_run_event(struct da_monitor *da_mon,
  */
 static inline void da_handle_event(enum events event)
 {
+	da_implicit_guard();
 	__da_handle_event(da_get_monitor(), event, 0);
 }
 
@@ -771,6 +780,7 @@ static inline void da_handle_event(enum events event)
  */
 static inline bool da_handle_start_event(enum events event)
 {
+	da_implicit_guard();
 	return __da_handle_start_event(da_get_monitor(), event, 0);
 }
 
@@ -782,6 +792,7 @@ static inline bool da_handle_start_event(enum events event)
  */
 static inline bool da_handle_start_run_event(enum events event)
 {
+	da_implicit_guard();
 	return __da_handle_start_run_event(da_get_monitor(), event, 0);
 }
 
-- 
2.54.0


^ permalink raw reply related

* [PATCH v4 08/13] rv: Ensure synchronous cleanup for HA monitors
From: Gabriele Monaco @ 2026-06-01 15:38 UTC (permalink / raw)
  To: linux-kernel, Steven Rostedt, Gabriele Monaco, linux-trace-kernel
  Cc: Nam Cao, Wen Yang
In-Reply-To: <20260601153840.124372-1-gmonaco@redhat.com>

HA monitors may start timers, all cleanup functions currently stop the
timers asynchronously to avoid sleeping in the wrong context.
Nothing makes sure running callbacks terminate on cleanup.

Run the entire HA timer callback in an RCU read-side critical section,
this way we can simply synchronize_rcu() with any pending timer and are
sure any cleanup using kfree_rcu() runs after callbacks terminated.
Additionally make sure any unlikely callback running late won't run any
code if the monitor is marked as disabled or if destruction started.
Use memory barriers to serialise with racing resets.

Fixes: f5587d1b6ec9 ("rv: Add Hybrid Automata monitor type")
Fixes: 4a24127bd6cb ("rv: Add support for per-object monitors in DA/HA")
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 include/rv/da_monitor.h | 19 ++++++++++++++++---
 include/rv/ha_monitor.h | 29 ++++++++++++++++++++++++++---
 2 files changed, 42 insertions(+), 6 deletions(-)

diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
index ec9bc88bd..1f440c781 100644
--- a/include/rv/da_monitor.h
+++ b/include/rv/da_monitor.h
@@ -57,6 +57,15 @@ static struct rv_monitor rv_this;
 #define da_monitor_reset_hook(da_mon)
 #endif
 
+/*
+ * Hook to allow the implementation of hybrid automata: define it with a
+ * function that waits for the termination of all monitors background
+ * activities (e.g. all timers). This hook can sleep.
+ */
+#ifndef da_monitor_sync_hook
+#define da_monitor_sync_hook()
+#endif
+
 /*
  * Type for the target id, default to int but can be overridden.
  * A long type can work as hash table key (PER_OBJ) but will be downgraded to
@@ -82,7 +91,8 @@ static void react(enum states curr_state, enum events event)
 static inline void da_monitor_reset_state(struct da_monitor *da_mon)
 {
 	WRITE_ONCE(da_mon->monitoring, 0);
-	da_mon->curr_state = model_get_initial_state();
+	/* Pair with load in __ha_monitor_timer_callback */
+	smp_store_release(&da_mon->curr_state, model_get_initial_state());
 }
 
 /*
@@ -205,6 +215,7 @@ static inline int da_monitor_init(void)
 static inline void da_monitor_destroy(void)
 {
 	da_monitor_reset_all();
+	da_monitor_sync_hook();
 }
 
 #elif RV_MON_TYPE == RV_MON_PER_CPU
@@ -270,6 +281,7 @@ static inline int da_monitor_init(void)
 static inline void da_monitor_destroy(void)
 {
 	da_monitor_reset_all();
+	da_monitor_sync_hook();
 }
 
 #elif RV_MON_TYPE == RV_MON_PER_TASK
@@ -367,6 +379,7 @@ static inline void da_monitor_destroy(void)
 
 	tracepoint_synchronize_unregister();
 	da_monitor_reset_all();
+	da_monitor_sync_hook();
 
 	rv_put_task_monitor_slot(task_mon_slot);
 	task_mon_slot = RV_PER_TASK_MONITOR_INIT;
@@ -573,13 +586,13 @@ static inline void da_monitor_destroy(void)
 	int bkt;
 
 	tracepoint_synchronize_unregister();
+	da_monitor_reset_all();
+	da_monitor_sync_hook();
 	/*
 	 * This function is called after all probes are disabled and no longer
 	 * pending, we can safely assume no concurrent user.
 	 */
-	synchronize_rcu();
 	hash_for_each_safe(da_monitor_ht, bkt, tmp, mon_storage, node) {
-		da_monitor_reset_hook(&mon_storage->rv.da_mon);
 		hash_del_rcu(&mon_storage->node);
 		kfree(mon_storage);
 	}
diff --git a/include/rv/ha_monitor.h b/include/rv/ha_monitor.h
index 4002b5247..28d3c74ca 100644
--- a/include/rv/ha_monitor.h
+++ b/include/rv/ha_monitor.h
@@ -37,6 +37,7 @@ static bool ha_monitor_handle_constraint(struct da_monitor *da_mon,
 #define da_monitor_event_hook ha_monitor_handle_constraint
 #define da_monitor_init_hook ha_monitor_init_env
 #define da_monitor_reset_hook ha_monitor_reset_env
+#define da_monitor_sync_hook() synchronize_rcu()
 
 #if !defined(HA_SKIP_AUTO_CLEANUP) && RV_MON_TYPE == RV_MON_PER_TASK
 /*
@@ -136,10 +137,13 @@ static enum hrtimer_restart ha_monitor_timer_callback(struct hrtimer *hrtimer);
 #define ha_get_ns() 0
 #endif /* HA_CLK_NS */
 
+static bool ha_mon_destroying;
+
 static int ha_monitor_init(void)
 {
 	int ret;
 
+	WRITE_ONCE(ha_mon_destroying, false);
 	ret = da_monitor_init();
 	if (ret == 0)
 		ha_monitor_enable_hook();
@@ -148,6 +152,7 @@ static int ha_monitor_init(void)
 
 static void ha_monitor_destroy(void)
 {
+	WRITE_ONCE(ha_mon_destroying, true);
 	ha_monitor_disable_hook();
 	da_monitor_destroy();
 }
@@ -288,12 +293,30 @@ static bool ha_monitor_handle_constraint(struct da_monitor *da_mon,
 	return false;
 }
 
+/*
+ * __ha_monitor_timer_callback - generic callback representation
+ *
+ * This callback runs in an RCU read-side critical section to allow the
+ * destruction sequence to easily synchronize_rcu() with all pending timers
+ * after asynchronously disabling them. The ha_mon_destroying check ensures
+ * any callback entering the RCU section after synchronize_rcu() completes
+ * will see the flag and bail out immediately.
+ */
 static inline void __ha_monitor_timer_callback(struct ha_monitor *ha_mon)
 {
-	enum states curr_state = READ_ONCE(ha_mon->da_mon.curr_state);
 	DECLARE_SEQ_BUF(env_string, ENV_BUFFER_SIZE);
-	u64 time_ns = ha_get_ns();
-
+	enum states curr_state;
+	u64 time_ns;
+
+	guard(rcu)();
+	if (unlikely(READ_ONCE(ha_mon_destroying)))
+		return;
+	/* Ensure consistent curr_state if we race with da_monitor_reset */
+	curr_state = smp_load_acquire(&ha_mon->da_mon.curr_state);
+	if (unlikely(!da_monitor_handling_event(&ha_mon->da_mon)))
+		return;
+
+	time_ns = ha_get_ns();
 	ha_get_env_string(&env_string, ha_mon, time_ns);
 	ha_react(curr_state, EVENT_NONE, env_string.buffer);
 	ha_trace_error_env(ha_mon, model_get_state_name(curr_state),
-- 
2.54.0


^ permalink raw reply related

* [PATCH v4 07/13] rv: Add automatic cleanup handlers for per-task HA monitors
From: Gabriele Monaco @ 2026-06-01 15:38 UTC (permalink / raw)
  To: linux-kernel, Steven Rostedt, Gabriele Monaco, Masami Hiramatsu,
	linux-trace-kernel
  Cc: Nam Cao, Wen Yang
In-Reply-To: <20260601153840.124372-1-gmonaco@redhat.com>

Hybrid automata monitors may start timers, depending on the model, these
may remain active on an exiting task and cause false positives or even
access freed memory.

Add an enable/disable hook in the HA code, currently only populated by
the per-task handler for registration and deregistration.
This hooks to the sched_process_exit event and ensures the timer is
stopped for every exiting task. The handler is enabled automatically but
may be disabled, for instance if the monitor uses the event for another
purpose (but should still manually ensure timers are stopped).

Fixes: f5587d1b6ec9 ("rv: Add Hybrid Automata monitor type")
Reviewed-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 include/rv/ha_monitor.h                       | 60 +++++++++++++++++++
 kernel/trace/rv/monitors/nomiss/nomiss.c      |  4 +-
 kernel/trace/rv/monitors/opid/opid.c          |  4 +-
 kernel/trace/rv/monitors/stall/stall.c        |  4 +-
 .../rvgen/rvgen/templates/dot2k/main.c        |  4 +-
 5 files changed, 68 insertions(+), 8 deletions(-)

diff --git a/include/rv/ha_monitor.h b/include/rv/ha_monitor.h
index bd8705556..4002b5247 100644
--- a/include/rv/ha_monitor.h
+++ b/include/rv/ha_monitor.h
@@ -28,6 +28,7 @@ static inline void ha_monitor_init_env(struct da_monitor *da_mon);
 static inline void ha_monitor_reset_env(struct da_monitor *da_mon);
 static inline void ha_setup_timer(struct ha_monitor *ha_mon);
 static inline bool ha_cancel_timer(struct ha_monitor *ha_mon);
+static inline void ha_cancel_timer_sync(struct ha_monitor *ha_mon);
 static bool ha_monitor_handle_constraint(struct da_monitor *da_mon,
 					 enum states curr_state,
 					 enum events event,
@@ -37,6 +38,26 @@ static bool ha_monitor_handle_constraint(struct da_monitor *da_mon,
 #define da_monitor_init_hook ha_monitor_init_env
 #define da_monitor_reset_hook ha_monitor_reset_env
 
+#if !defined(HA_SKIP_AUTO_CLEANUP) && RV_MON_TYPE == RV_MON_PER_TASK
+/*
+ * Automatic cleanup handlers for per-task HA monitors, only skip if you know
+ * what you are doing (e.g. you want to implement cleanup manually in another
+ * handler doing more things).
+ */
+static void ha_handle_sched_process_exit(void *data, struct task_struct *p,
+					 bool group_dead);
+
+#define ha_monitor_enable_hook()                                             \
+	rv_attach_trace_probe(__stringify(MONITOR_NAME), sched_process_exit, \
+			      ha_handle_sched_process_exit)
+#define ha_monitor_disable_hook()                                            \
+	rv_detach_trace_probe(__stringify(MONITOR_NAME), sched_process_exit, \
+			      ha_handle_sched_process_exit)
+#else
+#define ha_monitor_enable_hook() ((void)0)
+#define ha_monitor_disable_hook() ((void)0)
+#endif
+
 #include <rv/da_monitor.h>
 #include <linux/seq_buf.h>
 
@@ -115,6 +136,22 @@ static enum hrtimer_restart ha_monitor_timer_callback(struct hrtimer *hrtimer);
 #define ha_get_ns() 0
 #endif /* HA_CLK_NS */
 
+static int ha_monitor_init(void)
+{
+	int ret;
+
+	ret = da_monitor_init();
+	if (ret == 0)
+		ha_monitor_enable_hook();
+	return ret;
+}
+
+static void ha_monitor_destroy(void)
+{
+	ha_monitor_disable_hook();
+	da_monitor_destroy();
+}
+
 /* Should be supplied by the monitor */
 static u64 ha_get_env(struct ha_monitor *ha_mon, enum envs env, u64 time_ns);
 static bool ha_verify_constraint(struct ha_monitor *ha_mon,
@@ -200,6 +237,20 @@ static inline void ha_trace_error_env(struct ha_monitor *ha_mon,
 {
 	CONCATENATE(trace_error_env_, MONITOR_NAME)(id, curr_state, event, env);
 }
+
+#if !defined(HA_SKIP_AUTO_CLEANUP) && RV_MON_TYPE == RV_MON_PER_TASK
+static void ha_handle_sched_process_exit(void *data, struct task_struct *p,
+					 bool group_dead)
+{
+	struct da_monitor *da_mon = da_get_monitor(p);
+
+	if (likely(da_monitoring(da_mon))) {
+		da_monitor_reset(da_mon);
+		ha_cancel_timer_sync(to_ha_monitor(da_mon));
+	}
+}
+#endif
+
 #endif /* RV_MON_TYPE */
 
 /*
@@ -412,6 +463,10 @@ static inline bool ha_cancel_timer(struct ha_monitor *ha_mon)
 {
 	return timer_delete(&ha_mon->timer);
 }
+static inline void ha_cancel_timer_sync(struct ha_monitor *ha_mon)
+{
+	timer_delete_sync(&ha_mon->timer);
+}
 #elif HA_TIMER_TYPE == HA_TIMER_HRTIMER
 /*
  * Helper functions to handle the monitor timer.
@@ -463,6 +518,10 @@ static inline bool ha_cancel_timer(struct ha_monitor *ha_mon)
 {
 	return hrtimer_try_to_cancel(&ha_mon->hrtimer) == 1;
 }
+static inline void ha_cancel_timer_sync(struct ha_monitor *ha_mon)
+{
+	hrtimer_cancel(&ha_mon->hrtimer);
+}
 #else /* HA_TIMER_NONE */
 /*
  * Start function is intentionally not defined, monitors using timers must
@@ -473,6 +532,7 @@ static inline bool ha_cancel_timer(struct ha_monitor *ha_mon)
 {
 	return false;
 }
+static inline void ha_cancel_timer_sync(struct ha_monitor *ha_mon) { }
 #endif
 
 #endif
diff --git a/kernel/trace/rv/monitors/nomiss/nomiss.c b/kernel/trace/rv/monitors/nomiss/nomiss.c
index 31f90f363..8ead8783c 100644
--- a/kernel/trace/rv/monitors/nomiss/nomiss.c
+++ b/kernel/trace/rv/monitors/nomiss/nomiss.c
@@ -227,7 +227,7 @@ static int enable_nomiss(void)
 {
 	int retval;
 
-	retval = da_monitor_init();
+	retval = ha_monitor_init();
 	if (retval)
 		return retval;
 
@@ -263,7 +263,7 @@ static void disable_nomiss(void)
 	rv_detach_trace_probe("nomiss", sched_switch, handle_sched_switch);
 	rv_detach_trace_probe("nomiss", sched_wakeup, handle_sched_wakeup);
 
-	da_monitor_destroy();
+	ha_monitor_destroy();
 }
 
 static struct rv_monitor rv_this = {
diff --git a/kernel/trace/rv/monitors/opid/opid.c b/kernel/trace/rv/monitors/opid/opid.c
index 4594c7c46..2922318c6 100644
--- a/kernel/trace/rv/monitors/opid/opid.c
+++ b/kernel/trace/rv/monitors/opid/opid.c
@@ -73,7 +73,7 @@ static int enable_opid(void)
 {
 	int retval;
 
-	retval = da_monitor_init();
+	retval = ha_monitor_init();
 	if (retval)
 		return retval;
 
@@ -90,7 +90,7 @@ static void disable_opid(void)
 	rv_detach_trace_probe("opid", sched_set_need_resched_tp, handle_sched_need_resched);
 	rv_detach_trace_probe("opid", sched_waking, handle_sched_waking);
 
-	da_monitor_destroy();
+	ha_monitor_destroy();
 }
 
 /*
diff --git a/kernel/trace/rv/monitors/stall/stall.c b/kernel/trace/rv/monitors/stall/stall.c
index 9ccfda6b0..3c38fb1a0 100644
--- a/kernel/trace/rv/monitors/stall/stall.c
+++ b/kernel/trace/rv/monitors/stall/stall.c
@@ -103,7 +103,7 @@ static int enable_stall(void)
 {
 	int retval;
 
-	retval = da_monitor_init();
+	retval = ha_monitor_init();
 	if (retval)
 		return retval;
 
@@ -120,7 +120,7 @@ static void disable_stall(void)
 	rv_detach_trace_probe("stall", sched_switch, handle_sched_switch);
 	rv_detach_trace_probe("stall", sched_wakeup, handle_sched_wakeup);
 
-	da_monitor_destroy();
+	ha_monitor_destroy();
 }
 
 static struct rv_monitor rv_this = {
diff --git a/tools/verification/rvgen/rvgen/templates/dot2k/main.c b/tools/verification/rvgen/rvgen/templates/dot2k/main.c
index bf0999f66..889446760 100644
--- a/tools/verification/rvgen/rvgen/templates/dot2k/main.c
+++ b/tools/verification/rvgen/rvgen/templates/dot2k/main.c
@@ -35,7 +35,7 @@ static int enable_%%MODEL_NAME%%(void)
 {
 	int retval;
 
-	retval = da_monitor_init();
+	retval = %%MONITOR_CLASS%%_monitor_init();
 	if (retval)
 		return retval;
 
@@ -50,7 +50,7 @@ static void disable_%%MODEL_NAME%%(void)
 
 %%TRACEPOINT_DETACH%%
 
-	da_monitor_destroy();
+	%%MONITOR_CLASS%%_monitor_destroy();
 }
 
 /*
-- 
2.54.0


^ permalink raw reply related

* [PATCH v4 06/13] rv: Do not rely on clean monitor when initialising HA
From: Gabriele Monaco @ 2026-06-01 15:38 UTC (permalink / raw)
  To: linux-kernel, Steven Rostedt, Gabriele Monaco, linux-trace-kernel
  Cc: Wen Yang, Nam Cao
In-Reply-To: <20260601153840.124372-1-gmonaco@redhat.com>

Hybrid Automata monitors hook into the DA implementation when doing
da_monitor_reset(). This function is called both on initialisation and
teardown, HA monitors try to cancel a timer only when it's initialised
relying on the da_mon->monitoring flag. This flag could however be
corrupted during initialisation. This happens for instance on per-task
monitors that share the same storage with different type of monitors
like LTL or in case of races during a previous teardown.

Stop relying on the monitoring flag during initialisation, assume that
can have any value, so use a separate da_reset_state() skiping timer
cancellation.
New monitors (e.g. new tasks) are always zero-initialised so it is safe
to rely on the monitoring flag for those.

Reported-by: Wen Yang <wen.yang@linux.dev>
Closes: https://lore.kernel.org/lkml/d02c656aada7d071f083460a5c9a454363669b61.1778522945.git.wen.yang@linux.dev
Suggested-by: Nam Cao <namcao@linutronix.de>
Fixes: f5587d1b6ec9 ("rv: Add Hybrid Automata monitor type")
Reviewed-by: Wen Yang <wen.yang@linux.dev>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 include/rv/da_monitor.h | 91 +++++++++++++++++++++++++++++++++--------
 include/rv/ha_monitor.h |  2 +-
 2 files changed, 76 insertions(+), 17 deletions(-)

diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
index 60dc39f26..ec9bc88bd 100644
--- a/include/rv/da_monitor.h
+++ b/include/rv/da_monitor.h
@@ -76,14 +76,22 @@ static void react(enum states curr_state, enum events event)
 		 model_get_state_name(curr_state));
 }
 
+/*
+ * da_monitor_reset_state - reset a monitor and setting it to init state
+ */
+static inline void da_monitor_reset_state(struct da_monitor *da_mon)
+{
+	WRITE_ONCE(da_mon->monitoring, 0);
+	da_mon->curr_state = model_get_initial_state();
+}
+
 /*
  * da_monitor_reset - reset a monitor and setting it to init state
  */
 static inline void da_monitor_reset(struct da_monitor *da_mon)
 {
 	da_monitor_reset_hook(da_mon);
-	WRITE_ONCE(da_mon->monitoring, 0);
-	da_mon->curr_state = model_get_initial_state();
+	da_monitor_reset_state(da_mon);
 }
 
 /*
@@ -158,12 +166,28 @@ static struct da_monitor *da_get_monitor(void)
 	return &DA_MON_NAME;
 }
 
+/*
+ * __da_monitor_reset_all - reset the single monitor
+ */
+static void __da_monitor_reset_all(void (*reset)(struct da_monitor *))
+{
+	reset(da_get_monitor());
+}
+
 /*
  * da_monitor_reset_all - reset the single monitor
  */
 static void da_monitor_reset_all(void)
 {
-	da_monitor_reset(da_get_monitor());
+	__da_monitor_reset_all(da_monitor_reset);
+}
+
+/*
+ * da_monitor_reset_state_all - reset the single monitor
+ */
+static inline void da_monitor_reset_state_all(void)
+{
+	__da_monitor_reset_all(da_monitor_reset_state);
 }
 
 /*
@@ -171,7 +195,7 @@ static void da_monitor_reset_all(void)
  */
 static inline int da_monitor_init(void)
 {
-	da_monitor_reset_all();
+	da_monitor_reset_state_all();
 	return 0;
 }
 
@@ -202,25 +226,41 @@ static struct da_monitor *da_get_monitor(void)
 }
 
 /*
- * da_monitor_reset_all - reset all CPUs' monitor
+ * __da_monitor_reset_all - reset all CPUs' monitor
  */
-static void da_monitor_reset_all(void)
+static void __da_monitor_reset_all(void (*reset)(struct da_monitor *))
 {
 	struct da_monitor *da_mon;
 	int cpu;
 
 	for_each_cpu(cpu, cpu_online_mask) {
 		da_mon = per_cpu_ptr(&DA_MON_NAME, cpu);
-		da_monitor_reset(da_mon);
+		reset(da_mon);
 	}
 }
 
+/*
+ * da_monitor_reset_all - reset all CPUs' monitor
+ */
+static void da_monitor_reset_all(void)
+{
+	__da_monitor_reset_all(da_monitor_reset);
+}
+
+/*
+ * da_monitor_reset_state_all - reset all CPUs' monitor
+ */
+static inline void da_monitor_reset_state_all(void)
+{
+	__da_monitor_reset_all(da_monitor_reset_state);
+}
+
 /*
  * da_monitor_init - initialize all CPUs' monitor
  */
 static inline int da_monitor_init(void)
 {
-	da_monitor_reset_all();
+	da_monitor_reset_state_all();
 	return 0;
 }
 
@@ -269,19 +309,29 @@ static inline da_id_type da_get_id(struct da_monitor *da_mon)
 	return da_get_target(da_mon)->pid;
 }
 
-static void da_monitor_reset_all(void)
+static void __da_monitor_reset_all(void (*reset)(struct da_monitor *))
 {
 	struct task_struct *g, *p;
 	int cpu;
 
 	read_lock(&tasklist_lock);
 	for_each_process_thread(g, p)
-		da_monitor_reset(da_get_monitor(p));
+		reset(da_get_monitor(p));
 	for_each_present_cpu(cpu)
-		da_monitor_reset(da_get_monitor(idle_task(cpu)));
+		reset(da_get_monitor(idle_task(cpu)));
 	read_unlock(&tasklist_lock);
 }
 
+static void da_monitor_reset_all(void)
+{
+	__da_monitor_reset_all(da_monitor_reset);
+}
+
+static inline void da_monitor_reset_state_all(void)
+{
+	__da_monitor_reset_all(da_monitor_reset_state);
+}
+
 /*
  * da_monitor_init - initialize the per-task monitor
  *
@@ -298,7 +348,7 @@ static int da_monitor_init(void)
 
 	task_mon_slot = slot;
 
-	da_monitor_reset_all();
+	da_monitor_reset_state_all();
 	return 0;
 }
 
@@ -490,15 +540,24 @@ static inline void da_destroy_storage(da_id_type id)
 	kfree_rcu(mon_storage, rcu);
 }
 
-static void da_monitor_reset_all(void)
+static void __da_monitor_reset_all(void (*reset)(struct da_monitor *))
 {
 	struct da_monitor_storage *mon_storage;
 	int bkt;
 
-	rcu_read_lock();
+	guard(rcu)();
 	hash_for_each_rcu(da_monitor_ht, bkt, mon_storage, node)
-		da_monitor_reset(&mon_storage->rv.da_mon);
-	rcu_read_unlock();
+		reset(&mon_storage->rv.da_mon);
+}
+
+static void da_monitor_reset_all(void)
+{
+	__da_monitor_reset_all(da_monitor_reset);
+}
+
+static inline void da_monitor_reset_state_all(void)
+{
+	__da_monitor_reset_all(da_monitor_reset_state);
 }
 
 static inline int da_monitor_init(void)
diff --git a/include/rv/ha_monitor.h b/include/rv/ha_monitor.h
index d59507e8c..bd8705556 100644
--- a/include/rv/ha_monitor.h
+++ b/include/rv/ha_monitor.h
@@ -153,12 +153,12 @@ static inline void ha_monitor_init_env(struct da_monitor *da_mon)
  * Called from a hook in the DA reset functions, it supplies the da_mon
  * corresponding to the current ha_mon.
  * Not all hybrid automata require the timer, still clear it for simplicity.
+ * Monitors that never started have their timer uninitialized, do not stop those.
  */
 static inline void ha_monitor_reset_env(struct da_monitor *da_mon)
 {
 	struct ha_monitor *ha_mon = to_ha_monitor(da_mon);
 
-	/* Initialisation resets the monitor before initialising the timer */
 	if (likely(da_monitoring(da_mon)))
 		ha_cancel_timer(ha_mon);
 }
-- 
2.54.0


^ permalink raw reply related

* [PATCH v4 05/13] rv: Fix monitor start ordering and memory ordering for monitoring flag
From: Gabriele Monaco @ 2026-06-01 15:38 UTC (permalink / raw)
  To: linux-kernel, Steven Rostedt, Gabriele Monaco, linux-trace-kernel
  Cc: Wen Yang, Nam Cao
In-Reply-To: <20260601153840.124372-1-gmonaco@redhat.com>

From: Wen Yang <wen.yang@linux.dev>

da_monitor_start() set monitoring=1 before calling da_monitor_init_hook(),
may racing with the sched_switch handler:

  da_monitor_start()               sched_switch handler
  -------------------------        ---------------------------------
  da_mon->monitoring = 1;
                                   if (da_monitoring(da_mon))  /* true  */
                                       ha_start_timer_ns(...);
                                       /* hrtimer->base == NULL, crash */
  da_monitor_init_hook(da_mon);
  /* hrtimer_setup() sets base */

Fix the ordering and pair with release/acquire semantics:

  da_monitor_init_hook(da_mon);
  smp_store_release(&da_mon->monitoring, 1);    /* da_monitor_start()  */
  return smp_load_acquire(&da_mon->monitoring); /* da_monitoring()     */

On ARM64 a plain STR + LDR does not form a release-acquire pair, so
the load can observe monitoring=1 while hrtimer->base is still NULL.
The plain accesses are also data races under KCSAN.

Use WRITE_ONCE for the monitoring=0 store in da_monitor_reset() to
cover the reset path.

Fixes: 792575348ff7 ("rv/include: Add deterministic automata monitor definition via C macros")
Signed-off-by: Wen Yang <wen.yang@linux.dev>
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Reviewed-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 include/rv/da_monitor.h | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
index a7e103654..60dc39f26 100644
--- a/include/rv/da_monitor.h
+++ b/include/rv/da_monitor.h
@@ -82,7 +82,7 @@ static void react(enum states curr_state, enum events event)
 static inline void da_monitor_reset(struct da_monitor *da_mon)
 {
 	da_monitor_reset_hook(da_mon);
-	da_mon->monitoring = 0;
+	WRITE_ONCE(da_mon->monitoring, 0);
 	da_mon->curr_state = model_get_initial_state();
 }
 
@@ -95,8 +95,9 @@ static inline void da_monitor_reset(struct da_monitor *da_mon)
 static inline void da_monitor_start(struct da_monitor *da_mon)
 {
 	da_mon->curr_state = model_get_initial_state();
-	da_mon->monitoring = 1;
 	da_monitor_init_hook(da_mon);
+	/* Pairs with smp_load_acquire in da_monitoring(). */
+	smp_store_release(&da_mon->monitoring, 1);
 }
 
 /*
@@ -104,7 +105,8 @@ static inline void da_monitor_start(struct da_monitor *da_mon)
  */
 static inline bool da_monitoring(struct da_monitor *da_mon)
 {
-	return da_mon->monitoring;
+	/* Pairs with smp_store_release in da_monitor_start(). */
+	return smp_load_acquire(&da_mon->monitoring);
 }
 
 /*
-- 
2.54.0


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox