Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* [PATCH v3 4/7] tracing/probes: Support field specifier option for typecast
From: Masami Hiramatsu (Google) @ 2026-06-14 14:54 UTC (permalink / raw)
  To: Steven Rostedt, Mathieu Desnoyers
  Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178144880282.159464.16882854283219530040.stgit@devnote2>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Add a field specifier option for the typecast. This works like
container_of() macro.

    (STRUCT[,FIELD[.FIELD2...]])VAR

This is equivalent to :

    container_of(VAR, struct STRUCT, FIELD[.FIELD2...])

For example:

 echo "f tick_nohz_handler next_tick=(tick_sched,sched_timer)timer->next_tick" >> dynamic_events

This will trace tick_nohz_handler() with its tick_sched::next_tick which
is converted from @timer by contianer_of(tick, struct tick_sched, sched_timer).
So, if you enabkle both fprobes:tick_nohz_handler__entry and
timer:hrtimer_expire_entry events, we will see something like:


          <idle>-0       [002] d.h1.  3778.087272: hrtimer_expire_entry: hrtimer=00000000d63db328 f
unction=tick_nohz_handler now=3777450051040
          <idle>-0       [002] d.h1.  3778.087281: tick_nohz_handler__entry: (tick_nohz_handler+0x4
/0x140) next_tick=3777450000000


Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v3:
  - Fix error caret position.
 Changes in v2:
  - Use byteoffset for typecast field offset instead of bitoffset. This fixes negative modulo calculation.
  - Check whether a field is specified after typecast.
  - Reject if typecast field option  has arrow operator.
---
 Documentation/trace/eprobetrace.rst |    5 +
 Documentation/trace/fprobetrace.rst |    8 +-
 Documentation/trace/kprobetrace.rst |    8 +-
 kernel/trace/trace.c                |    4 -
 kernel/trace/trace_probe.c          |  179 ++++++++++++++++++++++++-----------
 kernel/trace/trace_probe.h          |    5 +
 6 files changed, 142 insertions(+), 67 deletions(-)

diff --git a/Documentation/trace/eprobetrace.rst b/Documentation/trace/eprobetrace.rst
index cd0b4aa7f896..680e0af43d5d 100644
--- a/Documentation/trace/eprobetrace.rst
+++ b/Documentation/trace/eprobetrace.rst
@@ -49,7 +49,10 @@ Synopsis of eprobe_events
   (STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
                   a pointer to STRUCT and then derference the pointer defined by
                   ->MEMBER. Note that when this is used, the FIELD name does not
-                  need to be prefixed with a '$'.
+                  need to be prefixed with a '$'. ASGN can be specified optionally.
+		  If ASGN is specified, FIELD will be cast to the same offset
+		  position as the ASGN member, rather than to the beginning of
+		  the STRUCT.
   (STRUCT)(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
 		  also be used with another FETCHARG instead of FIELD.
 
diff --git a/Documentation/trace/fprobetrace.rst b/Documentation/trace/fprobetrace.rst
index 6b8bb27bb62d..290a9e6f7491 100644
--- a/Documentation/trace/fprobetrace.rst
+++ b/Documentation/trace/fprobetrace.rst
@@ -57,10 +57,12 @@ Synopsis of fprobe-events
                   (u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types
                   (x8/x16/x32/x64), "char", "string", "ustring", "symbol", "symstr"
                   and bitfield are supported.
-  (STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
+  (STRUCT[,ASGN])FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
                   a pointer to STRUCT and then derference the pointer defined by
-                  ->MEMBER.
-  (STRUCT)(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
+                  ->MEMBER. ASGN can be specified optionally. If ASGN is specified,
+		  FIELD will be cast to the same offset position as the ASGN member,
+		  rather than to the beginning of the STRUCT.
+  (STRUCT[,ASGN])(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
                  also be used with another FETCHARG instead of FIELD.
 
   (\*1) This is available only when BTF is enabled.
diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst
index c4382765d5b2..a62707e6a9f2 100644
--- a/Documentation/trace/kprobetrace.rst
+++ b/Documentation/trace/kprobetrace.rst
@@ -61,11 +61,13 @@ Synopsis of kprobe_events
 		  (x8/x16/x32/x64), VFS layer common type(%pd/%pD), "char",
                   "string", "ustring", "symbol", "symstr" and bitfield are
                   supported.
-  (STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
+  (STRUCT[,ASGN])FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
                   a pointer to STRUCT and then derference the pointer defined by
                   ->MEMBER. Note that this is available only when the probe is
-		   on function entry.
-  (STRUCT)(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
+		   on function entry. ASGN can be specified optionally. If ASGN
+		   is specified, FIELD will be cast to the same offset position
+		   as the ASGN member, rather than to the beginning of the STRUCT.
+  (STRUCT[,ASGN])(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
                  also be used with another FETCHARG instead of FIELD.
 
   (\*1) only for the probe on function entry (offs == 0). Note, this argument access
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 4f70318918c2..0e36af853199 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4325,8 +4325,8 @@ static const char readme_msg[] =
 #ifdef CONFIG_HAVE_FUNCTION_ARG_ACCESS_API
 	"\t           $stack<index>, $stack, $retval, $comm, $arg<N>,\n"
 #ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
-	"\t           [(structname)]<argname>[->field[->field|.field...]],\n"
-	"\t           [(structname)](fetcharg)->field[->field|.field...],\n"
+	"\t           [(structname[,field])]<argname>[->field[->field|.field...]],\n"
+	"\t           [(structname[,field])](fetcharg)->field[->field|.field...],\n"
 #endif
 #else
 	"\t           $stack<index>, $stack, $retval, $comm,\n"
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index dba73aaa8ade..3ead93de2d93 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -574,6 +574,65 @@ static int split_next_field(char *varname, char **next_field,
 	return ret;
 }
 
+/* Inner loop for solving dot operator ('.'). Return bit-offset of the given field */
+static int get_bitoffset_of_field(char **pfieldname, const struct btf_type **ptype,
+				  struct traceprobe_parse_context *ctx)
+{
+	const struct btf_type *type = *ptype;
+	const struct btf_member *field;
+	struct btf *btf = ctx_btf(ctx);
+	char *fieldname = *pfieldname;
+	int bitoffs = 0;
+	u32 anon_offs;
+	char *next;
+	int is_ptr;
+	s32 tid;
+
+	do {
+		next = NULL;
+		is_ptr = split_next_field(fieldname, &next, ctx);
+		if (is_ptr < 0)
+			return is_ptr;
+
+		anon_offs = 0;
+		field = btf_find_struct_member(btf, type, fieldname,
+						&anon_offs);
+		if (IS_ERR(field)) {
+			trace_probe_log_err(ctx->offset, BAD_BTF_TID);
+			return PTR_ERR(field);
+		}
+		if (!field) {
+			trace_probe_log_err(ctx->offset, NO_BTF_FIELD);
+			return -ENOENT;
+		}
+		/* Add anonymous structure/union offset */
+		bitoffs += anon_offs;
+
+		/* Accumulate the bit-offsets of the dot-connected fields */
+		if (btf_type_kflag(type)) {
+			bitoffs += BTF_MEMBER_BIT_OFFSET(field->offset);
+			ctx->last_bitsize = BTF_MEMBER_BITFIELD_SIZE(field->offset);
+		} else {
+			bitoffs += field->offset;
+			ctx->last_bitsize = 0;
+		}
+
+		type = btf_type_skip_modifiers(btf, field->type, &tid);
+		if (!type) {
+			trace_probe_log_err(ctx->offset, BAD_BTF_TID);
+			return -EINVAL;
+		}
+
+		if (next)
+			ctx->offset += next - fieldname;
+		fieldname = next;
+	} while (!is_ptr && fieldname);
+
+	*pfieldname = fieldname;
+	*ptype = type;
+
+	return bitoffs;
+}
 /*
  * Parse the field of data structure. The @type must be a pointer type
  * pointing the target data structure type.
@@ -583,16 +642,14 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
 			   struct traceprobe_parse_context *ctx)
 {
 	struct fetch_insn *code = *pcode;
-	const struct btf_member *field;
-	u32 bitoffs, anon_offs;
-	bool is_struct = ctx->struct_btf != NULL;
 	struct btf *btf = ctx_btf(ctx);
-	char *next;
-	int is_ptr;
+	bool is_first_field = true;
+	int bitoffs;
 	s32 tid;
 
 	do {
-		if (!is_struct) {
+		/* For the first field of typecast, @type will be the target structure type. */
+		if (!(is_first_field && ctx->struct_btf)) {
 			/* Outer loop for solving arrow operator ('->') */
 			if (BTF_INFO_KIND(type->info) != BTF_KIND_PTR) {
 				trace_probe_log_err(ctx->offset, NO_PTR_STRCT);
@@ -606,60 +663,25 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
 				return -EINVAL;
 			}
 		}
-		/* Only the first type can skip being a pointer */
-		is_struct = false;
-
-		bitoffs = 0;
-		do {
-			/* Inner loop for solving dot operator ('.') */
-			next = NULL;
-			is_ptr = split_next_field(fieldname, &next, ctx);
-			if (is_ptr < 0)
-				return is_ptr;
-
-			anon_offs = 0;
-			field = btf_find_struct_member(btf, type, fieldname,
-						       &anon_offs);
-			if (IS_ERR(field)) {
-				trace_probe_log_err(ctx->offset, BAD_BTF_TID);
-				return PTR_ERR(field);
-			}
-			if (!field) {
-				trace_probe_log_err(ctx->offset, NO_BTF_FIELD);
-				return -ENOENT;
-			}
-			/* Add anonymous structure/union offset */
-			bitoffs += anon_offs;
-
-			/* Accumulate the bit-offsets of the dot-connected fields */
-			if (btf_type_kflag(type)) {
-				bitoffs += BTF_MEMBER_BIT_OFFSET(field->offset);
-				ctx->last_bitsize = BTF_MEMBER_BITFIELD_SIZE(field->offset);
-			} else {
-				bitoffs += field->offset;
-				ctx->last_bitsize = 0;
-			}
-
-			type = btf_type_skip_modifiers(btf, field->type, &tid);
-			if (!type) {
-				trace_probe_log_err(ctx->offset, BAD_BTF_TID);
-				return -EINVAL;
-			}
-
-			ctx->offset += next - fieldname;
-			fieldname = next;
-		} while (!is_ptr && fieldname);
 
+		bitoffs = get_bitoffset_of_field(&fieldname, &type, ctx);
+		if (bitoffs < 0)
+			return bitoffs;
 		if (++code == end) {
 			trace_probe_log_err(ctx->offset, TOO_MANY_OPS);
 			return -EINVAL;
 		}
 		code->op = FETCH_OP_DEREF;	/* TODO: user deref support */
 		code->offset = bitoffs / 8;
+		if (is_first_field && ctx->struct_btf) {
+			/* The first field can be typecasted with field option. */
+			code->offset -= ctx->prefix_byteoffs;
+		}
 		*pcode = code;
 
 		ctx->last_bitoffs = bitoffs % 8;
 		ctx->last_type = type;
+		is_first_field = false;
 	} while (fieldname);
 
 	return 0;
@@ -690,6 +712,11 @@ static int parse_btf_arg(char *varname,
 				    NOSUP_DAT_ARG);
 		return -EOPNOTSUPP;
 	}
+	if (!field && ctx->struct_btf) {
+		/* Typecast without field option is not supported */
+		trace_probe_log_err(ctx->offset + strlen(varname), TYPECAST_REQ_FIELD);
+		return -EOPNOTSUPP;
+	}
 
 	if (ctx->flags & TPARG_FL_TEVENT) {
 		ret = parse_trace_event(varname, code, ctx);
@@ -700,8 +727,7 @@ static int parse_btf_arg(char *varname,
 		/* TEVENT is only here via a typecast */
 		if (WARN_ON_ONCE(ctx->struct_btf == NULL))
 			return -EINVAL;
-		type = ctx->last_struct;
-		goto found_type;
+		goto found;
 	}
 
 	if (ctx->flags & TPARG_FL_RETURN && !strcmp(varname, "$retval")) {
@@ -763,7 +789,6 @@ static int parse_btf_arg(char *varname,
 		type = ctx->last_struct;
 	else
 		type = btf_type_skip_modifiers(ctx->btf, tid, &tid);
-found_type:
 	if (!type) {
 		trace_probe_log_err(ctx->offset, BAD_BTF_TID);
 		return -EINVAL;
@@ -832,6 +857,46 @@ static int query_btf_struct(const char *sname, struct traceprobe_parse_context *
 	return 0;
 }
 
+static int parse_btf_casttype(char *casttype, struct traceprobe_parse_context *ctx)
+{
+	char *field;
+	int ret;
+
+	/* Field option - evaluated later. */
+	field = strchr(casttype, ',');
+	if (field)
+		*field++ = '\0';
+
+	ret = query_btf_struct(casttype, ctx);
+	if (ret < 0) {
+		trace_probe_log_err(ctx->offset, NO_PTR_STRCT);
+		return -EINVAL;
+	}
+
+	if (field) {
+		struct btf_type *type = (struct btf_type *)ctx->last_struct;
+
+		ctx->offset += field - casttype;
+		ret = get_bitoffset_of_field(&field, &ctx->last_struct, ctx);
+		if (ret < 0)
+			return ret;
+		if (ret % 8) {
+			trace_probe_log_err(ctx->offset, TYPECAST_NOT_ALIGNED);
+			return -EINVAL;
+		}
+		if (field != NULL) {
+			/* this means @field skips an arrow operator ("->"). */
+			trace_probe_log_err(ctx->offset - 2, TYPECAST_BAD_ARROW);
+			return -EINVAL;
+		}
+		ctx->prefix_byteoffs = ret / 8;
+		/* Restore the original struct type (overwritten by get_bitoffset_of_field) */
+		ctx->last_struct = type;
+	}
+
+	return ret;
+}
+
 /* Find the matching closing parenthesis for a given opening parenthesis. */
 static char *find_matched_close_paren(char *s)
 {
@@ -915,11 +980,10 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
 		nested = true;
 	}
 
-	ret = query_btf_struct(arg + 1, ctx);
-	if (ret < 0) {
-		trace_probe_log_err(ctx->offset + 1, NO_PTR_STRCT);
-		return -EINVAL;
-	}
+	ctx->offset = orig_offset + 1;	/* for the '(' */
+	ret = parse_btf_casttype(arg + 1, ctx);
+	if (ret < 0)
+		return ret;
 
 	ctx->offset = orig_offset + tmp - arg;
 	/* If it is nested, tmp points to the field name. */
@@ -927,6 +991,7 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
 		ret = parse_btf_field(tmp, ctx->last_struct, pcode, end, ctx);
 	else
 		ret = parse_btf_arg(tmp, pcode, end, ctx);
+	ctx->prefix_byteoffs = 0;
 	return ret;
 }
 
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 982d32a5df8b..44f113faae61 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -436,6 +436,7 @@ struct traceprobe_parse_context {
 	unsigned int flags;
 	int offset;
 	int nested_level;
+	int prefix_byteoffs;	/* The byte offset of the prefix field of typecast */
 };
 
 #define TRACEPROBE_MAX_NESTED_LEVEL 3
@@ -576,7 +577,9 @@ extern int traceprobe_define_arg_fields(struct trace_event_call *event_call,
 	C(EVENT_TOO_BIG,	"Event too big (too many fields?)"),  \
 	C(TYPECAST_NOT_EVENT,	"Typecasts are only for eprobe fields"), \
 	C(TYPECAST_REQ_FIELD,	"Typecast requires a field access"),	\
-	C(TOO_MANY_NESTED,	"Too many nested typecasts/dereferences"),
+	C(TOO_MANY_NESTED,	"Too many nested typecasts/dereferences"), \
+	C(TYPECAST_NOT_ALIGNED,	"Typecast field option is not byte-aligned"), \
+	C(TYPECAST_BAD_ARROW,	"Typecast field option does not support -> operator"),
 
 #undef C
 #define C(a, b)		TP_ERR_##a


^ permalink raw reply related

* [PATCH v3 3/7] tracing/probes: Support nested typecast
From: Masami Hiramatsu (Google) @ 2026-06-14 14:53 UTC (permalink / raw)
  To: Steven Rostedt, Mathieu Desnoyers
  Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178144880282.159464.16882854283219530040.stgit@devnote2>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

When we hit an open parenthesis right after typecast closing
parenthesis, it means we have nested typecast. This allows us to
typecast a generic data member in a structure to a pointer to
another structure.

For example, to cast a DATA_MEMBER of VAR structure to STRUCT pointer
and get MEMBER value.

  (STRUCT)(VAR->DATA_MEMBER)->MEMBER

Also, we can nest typecast.

  (STRUCT1)((STRUCT2)$ARG->FIELD2)->FIELD1

Currently the max nest level is limited to 3.

This also allows user to use typecasting for registers or stacks on
kprobe events. e.g.

  (STRUCT)(%ax)->MEMBER

  (STRUCT)($stack0)->MEMBER


Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v2:
  - Fix to skip "->" after closing parenthetsis.
---
 Documentation/trace/eprobetrace.rst |    2 +
 Documentation/trace/fprobetrace.rst |    2 +
 Documentation/trace/kprobetrace.rst |    2 +
 kernel/trace/trace.c                |    1 
 kernel/trace/trace_probe.c          |   76 ++++++++++++++++++++++++++++++++---
 kernel/trace/trace_probe.h          |    7 +++
 6 files changed, 82 insertions(+), 8 deletions(-)

diff --git a/Documentation/trace/eprobetrace.rst b/Documentation/trace/eprobetrace.rst
index fe3602540569..cd0b4aa7f896 100644
--- a/Documentation/trace/eprobetrace.rst
+++ b/Documentation/trace/eprobetrace.rst
@@ -50,6 +50,8 @@ Synopsis of eprobe_events
                   a pointer to STRUCT and then derference the pointer defined by
                   ->MEMBER. Note that when this is used, the FIELD name does not
                   need to be prefixed with a '$'.
+  (STRUCT)(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
+		  also be used with another FETCHARG instead of FIELD.
 
 Types
 -----
diff --git a/Documentation/trace/fprobetrace.rst b/Documentation/trace/fprobetrace.rst
index 7435ded2d66d..6b8bb27bb62d 100644
--- a/Documentation/trace/fprobetrace.rst
+++ b/Documentation/trace/fprobetrace.rst
@@ -60,6 +60,8 @@ Synopsis of fprobe-events
   (STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
                   a pointer to STRUCT and then derference the pointer defined by
                   ->MEMBER.
+  (STRUCT)(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
+                 also be used with another FETCHARG instead of FIELD.
 
   (\*1) This is available only when BTF is enabled.
   (\*2) only for the probe on function entry (offs == 0). Note, this argument access
diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst
index f73614997d52..c4382765d5b2 100644
--- a/Documentation/trace/kprobetrace.rst
+++ b/Documentation/trace/kprobetrace.rst
@@ -65,6 +65,8 @@ Synopsis of kprobe_events
                   a pointer to STRUCT and then derference the pointer defined by
                   ->MEMBER. Note that this is available only when the probe is
 		   on function entry.
+  (STRUCT)(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
+                 also be used with another FETCHARG instead of FIELD.
 
   (\*1) only for the probe on function entry (offs == 0). Note, this argument access
         is best effort, because depending on the argument type, it may be passed on
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index aa93e7b01146..4f70318918c2 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4326,6 +4326,7 @@ static const char readme_msg[] =
 	"\t           $stack<index>, $stack, $retval, $comm, $arg<N>,\n"
 #ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
 	"\t           [(structname)]<argname>[->field[->field|.field...]],\n"
+	"\t           [(structname)](fetcharg)->field[->field|.field...],\n"
 #endif
 #else
 	"\t           $stack<index>, $stack, $retval, $comm,\n"
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 9158f1f22a62..dba73aaa8ade 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -832,10 +832,35 @@ static int query_btf_struct(const char *sname, struct traceprobe_parse_context *
 	return 0;
 }
 
+/* Find the matching closing parenthesis for a given opening parenthesis. */
+static char *find_matched_close_paren(char *s)
+{
+	char *p = s;
+	int count = 0;
+
+	while (*p) {
+		if (*p == '(')
+			count++;
+		else if (*p == ')') {
+			if (--count == 0)
+				return p;
+		}
+		p++;
+	}
+	return NULL;
+}
+
+static int
+parse_probe_arg(char *arg, const struct fetch_type *type,
+		struct fetch_insn **pcode, struct fetch_insn *end,
+		struct traceprobe_parse_context *ctx);
+
 static int handle_typecast(char *arg, struct fetch_insn **pcode,
 			   struct fetch_insn *end,
 			   struct traceprobe_parse_context *ctx)
 {
+	int orig_offset = ctx->offset;
+	bool nested = false;
 	char *tmp;
 	int ret;
 
@@ -852,19 +877,56 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
 				    DEREF_OPEN_BRACE);
 		return -EINVAL;
 	}
-	*tmp = '\0';
-	ret = query_btf_struct(arg + 1, ctx);
-	*tmp = ')';
+	*tmp++ = '\0';
+
+	/* Handle the nested structure like (STRUCT)(VAR->FIELD)->... */
+	if (*tmp == '(') {
+		char *close = find_matched_close_paren(tmp);
+
+		ctx->offset += tmp - arg;
+		if (!close) {
+			trace_probe_log_err(ctx->offset, DEREF_OPEN_BRACE);
+			return -EINVAL;
+		}
+		/* We expect a field access for typecast */
+		if (close[1] != '-' || close[2] != '>') {
+			trace_probe_log_err(ctx->offset + close - tmp + 1,
+					    TYPECAST_REQ_FIELD);
+			return -EINVAL;
+		}
 
+		ctx->nested_level++;
+		if (ctx->nested_level > TRACEPROBE_MAX_NESTED_LEVEL) {
+			trace_probe_log_err(ctx->offset, TOO_MANY_NESTED);
+			return -E2BIG;
+		}
+		*close = '\0';
+
+		ctx->offset += 1;	/* for the '(' */
+		/* We need to parse the nested one */
+		ret = parse_probe_arg(tmp + 1, find_fetch_type(NULL, ctx->flags),
+				pcode, end, ctx);
+		if (ret < 0)
+			return ret;
+		ctx->nested_level--;
+		clear_struct_btf(ctx);
+
+		tmp = close + 3;/* Skip "->" after closing parenthesis */
+		nested = true;
+	}
+
+	ret = query_btf_struct(arg + 1, ctx);
 	if (ret < 0) {
 		trace_probe_log_err(ctx->offset + 1, NO_PTR_STRCT);
 		return -EINVAL;
 	}
 
-	tmp++;
-
-	ctx->offset += tmp - arg;
-	ret = parse_btf_arg(tmp, pcode, end, ctx);
+	ctx->offset = orig_offset + tmp - arg;
+	/* If it is nested, tmp points to the field name. */
+	if (nested)
+		ret = parse_btf_field(tmp, ctx->last_struct, pcode, end, ctx);
+	else
+		ret = parse_btf_arg(tmp, pcode, end, ctx);
 	return ret;
 }
 
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 883938a74aee..982d32a5df8b 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -435,8 +435,11 @@ struct traceprobe_parse_context {
 	struct trace_probe *tp;
 	unsigned int flags;
 	int offset;
+	int nested_level;
 };
 
+#define TRACEPROBE_MAX_NESTED_LEVEL 3
+
 extern int traceprobe_parse_probe_arg(struct trace_probe *tp, int i,
 				      const char *argv,
 				      struct traceprobe_parse_context *ctx);
@@ -571,7 +574,9 @@ extern int traceprobe_define_arg_fields(struct trace_event_call *event_call,
 	C(TOO_MANY_ARGS,	"Too many arguments are specified"),	\
 	C(TOO_MANY_EARGS,	"Too many entry arguments specified"),	\
 	C(EVENT_TOO_BIG,	"Event too big (too many fields?)"),  \
-	C(TYPECAST_NOT_EVENT,	"Typecasts are only for eprobe fields"),
+	C(TYPECAST_NOT_EVENT,	"Typecasts are only for eprobe fields"), \
+	C(TYPECAST_REQ_FIELD,	"Typecast requires a field access"),	\
+	C(TOO_MANY_NESTED,	"Too many nested typecasts/dereferences"),
 
 #undef C
 #define C(a, b)		TP_ERR_##a


^ permalink raw reply related

* [PATCH v3 2/7] tracing/probes: Support typecast for various probe events
From: Masami Hiramatsu (Google) @ 2026-06-14 14:53 UTC (permalink / raw)
  To: Steven Rostedt, Mathieu Desnoyers
  Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178144880282.159464.16882854283219530040.stgit@devnote2>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Support BTF typecast feature on other probe events, but only if it is
kernel function entry or return, and must use function parameter name
or $retval. This means you can do:

  (STRUCT)PARAM->MEMBER

but you can not do (this should be enabled by nesting support)

  (STRUCT)%reg->MEMBER

To support other probe events, we just need to use last_struct type
when we find a function parameter in parse_btf_arg().

This also update <tracefs>/README file to show struct typecast.

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v3:
  - Clarify the limitation.
 Changes in v2:
  - Fix to re-enable typecast on eprobe.
---
 Documentation/trace/fprobetrace.rst |    3 +++
 Documentation/trace/kprobetrace.rst |    4 ++++
 kernel/trace/trace.c                |    2 +-
 kernel/trace/trace_probe.c          |   14 +++++++++-----
 kernel/trace/trace_probe.h          |    5 +++++
 5 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/Documentation/trace/fprobetrace.rst b/Documentation/trace/fprobetrace.rst
index b4c2ca3d02c1..7435ded2d66d 100644
--- a/Documentation/trace/fprobetrace.rst
+++ b/Documentation/trace/fprobetrace.rst
@@ -57,6 +57,9 @@ Synopsis of fprobe-events
                   (u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types
                   (x8/x16/x32/x64), "char", "string", "ustring", "symbol", "symstr"
                   and bitfield are supported.
+  (STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
+                  a pointer to STRUCT and then derference the pointer defined by
+                  ->MEMBER.
 
   (\*1) This is available only when BTF is enabled.
   (\*2) only for the probe on function entry (offs == 0). Note, this argument access
diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst
index 3b6791c17e9b..f73614997d52 100644
--- a/Documentation/trace/kprobetrace.rst
+++ b/Documentation/trace/kprobetrace.rst
@@ -61,6 +61,10 @@ Synopsis of kprobe_events
 		  (x8/x16/x32/x64), VFS layer common type(%pd/%pD), "char",
                   "string", "ustring", "symbol", "symstr" and bitfield are
                   supported.
+  (STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
+                  a pointer to STRUCT and then derference the pointer defined by
+                  ->MEMBER. Note that this is available only when the probe is
+		   on function entry.
 
   (\*1) only for the probe on function entry (offs == 0). Note, this argument access
         is best effort, because depending on the argument type, it may be passed on
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 6eb4d3097a4d..aa93e7b01146 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4325,7 +4325,7 @@ static const char readme_msg[] =
 #ifdef CONFIG_HAVE_FUNCTION_ARG_ACCESS_API
 	"\t           $stack<index>, $stack, $retval, $comm, $arg<N>,\n"
 #ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
-	"\t           <argname>[->field[->field|.field...]],\n"
+	"\t           [(structname)]<argname>[->field[->field|.field...]],\n"
 #endif
 #else
 	"\t           $stack<index>, $stack, $retval, $comm,\n"
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index fd1caa1f9723..9158f1f22a62 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -759,7 +759,10 @@ static int parse_btf_arg(char *varname,
 	return -ENOENT;
 
 found:
-	type = btf_type_skip_modifiers(ctx->btf, tid, &tid);
+	if (ctx->struct_btf)
+		type = ctx->last_struct;
+	else
+		type = btf_type_skip_modifiers(ctx->btf, tid, &tid);
 found_type:
 	if (!type) {
 		trace_probe_log_err(ctx->offset, BAD_BTF_TID);
@@ -836,10 +839,11 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
 	char *tmp;
 	int ret;
 
-	/* Currently this only works for eprobes */
-	if (!(ctx->flags & TPARG_FL_TEVENT)) {
-		trace_probe_log_err(ctx->offset, TYPECAST_NOT_EVENT);
-		return -EINVAL;
+	if (!(tparg_is_event_probe(ctx->flags) ||
+	      tparg_is_function_entry(ctx->flags) ||
+	      tparg_is_function_return(ctx->flags))) {
+		trace_probe_log_err(ctx->offset, NOSUP_BTFARG);
+		return -EOPNOTSUPP;
 	}
 
 	tmp = strchr(arg, ')');
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 15758cc11fc6..883938a74aee 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -414,6 +414,11 @@ static inline bool tparg_is_function_return(unsigned int flags)
 	return (flags & TPARG_FL_LOC_MASK) == (TPARG_FL_KERNEL | TPARG_FL_RETURN);
 }
 
+static inline bool tparg_is_event_probe(unsigned int flags)
+{
+	return !!(flags & TPARG_FL_TEVENT);
+}
+
 struct traceprobe_parse_context {
 	struct trace_event_call *event;
 	/* BTF related parameters */


^ permalink raw reply related

* [PATCH v3 1/7] tracing/events: Fix to check the simple_tsk_fn creation
From: Masami Hiramatsu (Google) @ 2026-06-14 14:53 UTC (permalink / raw)
  To: Steven Rostedt, Mathieu Desnoyers
  Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178144880282.159464.16882854283219530040.stgit@devnote2>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Sashiko pointed that this sample code does not correctly handle the
failure of thread creation because kthread_run() can return -errno.

Check the simple_tsk_fn is correctly initialized (created) or not.

Link: https://sashiko.dev/#/patchset/178092865666.163648.10457567771536160909.stgit%40devnote2

Fixes: 9cfe06f8cd5c ("tracing/events: add trace-events-sample")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v3:
   - Recover the usage counter.
---
 samples/trace_events/trace-events-sample.c |    5 +++++
 1 file changed, 5 insertions(+)

diff --git a/samples/trace_events/trace-events-sample.c b/samples/trace_events/trace-events-sample.c
index ecc7db237f2e..82344a78e471 100644
--- a/samples/trace_events/trace-events-sample.c
+++ b/samples/trace_events/trace-events-sample.c
@@ -107,6 +107,11 @@ int foo_bar_reg(void)
 	 * for consistency sake, we still take the thread_mutex.
 	 */
 	simple_tsk_fn = kthread_run(simple_thread_fn, NULL, "event-sample-fn");
+	if (IS_ERR_OR_NULL(simple_tsk_fn)) {
+		pr_err("Failed to create simple_thread_fn");
+		simple_thread_cnt--;
+		simple_tsk_fn = NULL;
+	}
  out:
 	mutex_unlock(&thread_mutex);
 	return 0;


^ permalink raw reply related

* [PATCH v3 0/7] tracing/probes: Add more typecast features
From: Masami Hiramatsu (Google) @ 2026-06-14 14:53 UTC (permalink / raw)
  To: Steven Rostedt, Mathieu Desnoyers
  Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest

Hi,

Here is the 3rd version of series to introduce more typecast features
to probe events. The previous version is here:

 https://lore.kernel.org/all/178105268094.21760.13668249930524377840.stgit@devnote2/

In this version, I fixed various problems Sashiko reviewed and add new
test cases.

Steve introduced BTF typecast feature for eprobe[1].
This series extends it and add more options:

1. Expanding BTF typecast to kprobe and fprobe.
   (currently only function entry/exit)

2. Introduce container_of like typecast. This adds a "assigned
   member" option to the typecast.

   (STRUCT,MEMBER)VAR->ANOTHER_MEMBER

   This casts VAR to STRUCT type but the VAR is as the address
   of STRUCT.MEMBER. In C, it is:

   container_of(VAR, STRUCT, MEMBER)->ANOTHER_MEMBER

3. Support nested typecast, e.g.

   (STRUCT)((STRUCT2)VAR->MEMBER2)->MEMBER

   the nest level must be smaller than 3.

4. Add $current variable to point "current" task_struct.
   This is useful with typecast, e.g.

   (task_struct)$current->pid

5. per-cpu dereference support.

   Intrdouce this_cpu_read(VAR) and this_cpu_ptr(VAR) to
   access per-cpu data on the current CPU (accessing other CPU
   data is not stable, because it can be changed.)

   You can access the member of per-cpu data structure using
   typecast like:

   (STRUCT)this_cpu_ptr(VAR)->MEMBER


And added a test script to test part of them.

[1] https://lore.kernel.org/all/20260601130746.2139d926@gandalf.local.home/


---

Masami Hiramatsu (Google) (7):
      tracing/events: Fix to check the simple_tsk_fn creation
      tracing/probes: Support typecast for various probe events
      tracing/probes: Support nested typecast
      tracing/probes: Support field specifier option for typecast
      tracing/probes: Add $current variable support
      tracing/probes: Add this_cpu_read() and this_cpu_ptr() dereference method to fetcharg
      tracing/probes: Add a new testcase for BTF typecasts


 Documentation/trace/eprobetrace.rst                |    9 
 Documentation/trace/fprobetrace.rst                |   10 
 Documentation/trace/kprobetrace.rst                |   11 +
 kernel/trace/trace.c                               |    6 
 kernel/trace/trace_probe.c                         |  413 +++++++++++++++-----
 kernel/trace/trace_probe.h                         |   19 +
 kernel/trace/trace_probe_tmpl.h                    |   33 +-
 samples/trace_events/trace-events-sample.c         |   45 ++
 samples/trace_events/trace-events-sample.h         |   34 ++
 .../ftrace/test.d/dynevent/btf_probe_event.tc      |   51 ++
 .../ftrace/test.d/dynevent/fprobe_syntax_errors.tc |    9 
 .../ftrace/test.d/kprobe/kprobe_syntax_errors.tc   |    9 
 .../ftrace/test.d/kprobe/uprobe_syntax_errors.tc   |    5 
 13 files changed, 540 insertions(+), 114 deletions(-)
 create mode 100644 tools/testing/selftests/ftrace/test.d/dynevent/btf_probe_event.tc

--
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* [PATCH 1/1] tools/tracing/rtla: fix missing unistd include
From: Andreas Ziegler @ 2026-06-14  9:28 UTC (permalink / raw)
  To: Steven Rostedt, Tomas Glozar
  Cc: linux-trace-kernel, linux-kernel, Andreas Ziegler

Compiling RTLA 7.1-rc6 with GCC 16 and uClibc as standard library fails
with these errors:

src/common.c: In function ‘set_signals’:
src/common.c:40:17: error: implicit declaration of function ‘alarm’ [-Wimplicit-function-declaration]
   40 |                 alarm(params->duration);
      |                 ^~~~~
src/common.c: In function ‘common_apply_config’:
src/common.c:187:44: error: implicit declaration of function ‘getpid’; did you mean ‘getpt’? [-Wimplicit-function-declaration]
  187 |                 retval = sched_setaffinity(getpid(), sizeof(params->hk_cpu_set),
      |                                            ^~~~~~
      |                                            getpt
In file included from src/common.c:9:
src/common.c: In function ‘run_tool’:
src/common.c:262:19: error: implicit declaration of function ‘sysconf’; did you mean ‘sscanf’? [-Wimplicit-function-declaration]
  262 |         nr_cpus = get_nprocs_conf();
      |                   ^~~~~~~~~~~~~~~
src/common.c:262:19: error: ‘_SC_NPROCESSORS_CONF’ undeclared (first use in this function)
  262 |         nr_cpus = get_nprocs_conf();
      |                   ^~~~~~~~~~~~~~~
src/common.c:262:19: note: each undeclared identifier is reported only once for each function it appears in
src/common.c:370:17: error: implicit declaration of function ‘sleep’ [-Wimplicit-function-declaration]
  370 |                 sleep(1);
      |                 ^~~~~

Restore the missing unistd.h include.

Fixes: <115b06a00875> (tools/rtla: Consolidate nr_cpus usage across all tools)

Signed-off-by: Andreas Ziegler <br025@umbiko.net>
---
 tools/tracing/rtla/src/common.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/tracing/rtla/src/common.c b/tools/tracing/rtla/src/common.c
index 35e3d3aa922e..5c5398d20f40 100644
--- a/tools/tracing/rtla/src/common.c
+++ b/tools/tracing/rtla/src/common.c
@@ -5,6 +5,7 @@
 #include <signal.h>
 #include <stdlib.h>
 #include <string.h>
+#include <unistd.h>
 #include <getopt.h>
 #include <sys/sysinfo.h>
 
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH v3 0/9] rv/tlob: Add task latency over budget RV monitor
From: Wen Yang @ 2026-06-13 16:00 UTC (permalink / raw)
  To: Gabriele Monaco; +Cc: Steven Rostedt, linux-trace-kernel, linux-kernel
In-Reply-To: <cover.1780847473.git.wen.yang@linux.dev>


Hi Gabriele,

Gentle ping on this series. Please let me know if there are any
concerns or if further changes are needed.

Thanks for your time,
Wen


On 6/8/26 00:13, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
> 
> This series introduces tlob (task latency over budget), a per-task
> hybrid automaton RV monitor that measures elapsed wall-clock time across
> a user-delimited code section and fires when the time exceeds a
> configurable budget.
> 
> The series applies cleanly on top of:
>    [1] git://git.kernel.org/pub/scm/linux/kernel/git/gmonaco/linux.git rv-fixes-7.1
>        "rv fixes for v7.1"
> 
> Background
> ----------
> The existing wwnr monitor uses a two-state DA to detect tasks that are
> woken but never run.  tlob extends the RV framework to a three-state
> hybrid automaton:
> 
>    running  (initial) -- on CPU
>    waiting             -- in the scheduler runqueue, not yet on CPU
>    sleeping            -- blocked on a lock, I/O, or similar resource
> 
> A single HA clock invariant, clk_elapsed < BUDGET_NS(), is active in
> all states.  The framework enforces it via a per-task hrtimer.  On
> expiry, error_env_tlob is emitted, followed by detail_env_tlob which
> carries a per-state time breakdown (running_ns, waiting_ns, sleeping_ns)
> that pinpoints whether the overrun occurred in the running, waiting, or
> sleeping state.
> 
> Userspace interface
> -------------------
> Tasks are registered for monitoring by writing to the tracefs monitor
> file:
> 
>    # echo "p /path/to/binary:START_OFFSET STOP_OFFSET threshold=NS" \
>          > /sys/kernel/tracing/rv/monitors/tlob/monitor
> 
> Two uprobes are registered at START_OFFSET (entry) and STOP_OFFSET
> (exit) of the delimited section.  When a task executes the entry uprobe,
> the monitor starts; when the task reaches the exit uprobe or the budget
> expires, monitoring stops and the slot is returned to the pool.
> 
> Multiple uprobe pairs can be registered for the same binary or different
> binaries.  Each task can have at most one active monitoring session; if
> a task hits a start uprobe while already monitored, the prior session is
> cancelled and a new one begins.
> 
> Series structure
> ----------------
> Patch 1: rv/da: introduce DA_MON_ALLOCATION_STRATEGY
>    Consolidates per-object DA storage allocation under a compile-time
>    selector with three strategies:
>      DA_ALLOC_AUTO   (default) - lock-free kmalloc_nolock; unbounded
>      DA_ALLOC_POOL             - pre-allocated fixed-size pool
>      DA_ALLOC_MANUAL           - caller pre-inserts storage
> 
>    da_handle_start_event() and da_handle_start_run_event() call
>    da_prepare_storage() which resolves at compile time to the correct
>    allocation function.
> 
>    This patch also includes critical correctness fixes for the pool
>    implementation:
>    - Add tracepoint_synchronize_unregister() in da_monitor_destroy_pool()
>      to fix UAF where in-flight handlers access freed pool storage
>    - Fix duplicate hash entry race in da_create_or_get_pool() via
>      concurrent-insert detection under RCU
>    - Add capacity field to fix build error (DA_MON_POOL_SIZE undeclared
>      in da_pool_return_cb)
> 
> Patch 2: rv: add generic uprobe infrastructure for RV monitors
>    Introduces rv_uprobe, a thin wrapper around uprobe_consumer for RV
>    monitors.  Provides rv_uprobe_register(), rv_uprobe_unregister(),
>    and rv_uprobe_sync() for safe teardown.
> 
> Patch 3: rv/tlob: add tlob model DOT file
>    The formal model used to generate tlob.h.
> 
> Patch 4: rv/ha: fix ha_invariant_passed_ns silent bypass of invariant check
>    Fixes a bug where ha_invariant_passed_ns() returned 0 early when
>    env_store was invalid (U64_MAX), leaving it at U64_MAX and causing
>    ha_check_invariant_ns() to always pass.  The fix calls ha_reset_clk_ns()
>    then ha_set_invariant_ns() on first use.
> 
> Patch 5: rv/ha: make da_monitor_reset_hook and EVENT_NONE_LBL overridable
>    Allows tlob to override EVENT_NONE_LBL for its start_tlob self-loop.
> 
> Patch 6: rv/tlob: add tlob hybrid automaton monitor
>    The main tlob implementation, including:
>    - Three-state HA (running/waiting/sleeping)
>    - Per-task hrtimer enforcement (HRTIMER_MODE_REL_HARD)
>    - DA_ALLOC_POOL for allocation-free hot path
>    - Uprobe registration via tracefs monitor file
>    - Per-state time accumulation (running_ns, waiting_ns, sleeping_ns)
>    - HIGH_RES_TIMERS dependency in Kconfig
> 
> Patches 7-9: Tests
>    - KUnit tests for tlob monitor
>    - Selftest infrastructure fixes
>    - tlob selftests (uprobe binding, state tracking, violation detection)
> 
> Changes since v2
> ----------------
> All feedback from Gabriele Monaco has been addressed:
> 
> -- Patch 02 (per-task slot ordering / ha_monitor_reset_env):
>     Dropped from v3; rebased on top of Gabriele's series [1].
> 
> -- Patch 03 (verificationtest-ktap):
>     Changed to use realpath for robustness as suggested.
> 
> -- Patch 04 (pre-allocated storage pool):
>     Complete redesign as DA_MON_ALLOCATION_STRATEGY:
>     - Three strategies (AUTO/POOL/MANUAL) via compile-time macro
>     - da_monitor_init_prealloc() removed; da_monitor_init() selects
>       internally
>     - da_create_or_get_kmalloc() removed (no viable use case)
>     - nomiss updated to use DA_ALLOC_MANUAL
>     - da_extra_cleanup() hook added for per-entry teardown
> 
>     Critical bug fixes included in this patch:
>     - tracepoint_synchronize_unregister() added to da_monitor_destroy_pool()
>       to prevent UAF from in-flight handlers accessing freed pool storage
>     - Duplicate hash entry race fixed in da_create_or_get_pool() via
>       concurrent-insert detection and slot return under RCU
>     - capacity field added to fix DA_MON_POOL_SIZE undeclared build error
> 
> -- Patch 05 (generic uprobe infrastructure):
>     Carried unchanged into v3.
> 
> -- Patch 06 (rvgen __init arrow reset):
>     Carried unchanged into v3.
> 
> -- Patch 08 (tlob monitor):
>     Split and refactored:
>     - ioctl interface deferred to follow-up series (tracefs-only in v3)
>     - Handler simplification: three inline helpers (tlob_acc_running/
>       waiting/sleeping) with scoped_guard(rcu)
>     - do_prev/do_next flags removed (da_handle_event skips unmonitored)
>     - scoped_guard(rcu) and guard(mutex) applied throughout
>     - tlob_stop_all() removed; da_extra_cleanup() hook used instead
>     - start_tlob self-loop added to DOT model as suggested
>     - ha_setup_invariants() guards against redundant timer restart
>     - HIGH_RES_TIMERS dependency added to Kconfig
> 
> Additional improvements in v3
> ------------------------------
> Beyond the v2 feedback, this version includes:
> 
> 1. Simplified tlob monitor implementation:
>     - Removed redundant tlob_num_monitored atomic counter
>       (da_handle_event already handles unmonitored tasks via hash lookup)
>     - Eliminated extra cacheline touch on every sched_switch/sched_wakeup
>     - Several repeated pattern simplifications.
> 
> 2. Extracted common accumulation logic:
>     - __tlob_acc() using offsetof() replaces three nearly-identical functions
>     - Reduces code duplication while maintaining type safety
> 
> 3. Complete test coverage:
>     - KUnit tests for core functionality
>     - Comprehensive selftests for uprobe integration, state tracking,
>       and violation detection
> 
> Testing
> -------
> All patches have been tested on:
> - x86_64 with CONFIG_PREEMPT_RT
> - All KUnit tests pass
> - All selftests pass with verificationtest-ktap
> 
>    
> [1] git://git.kernel.org/pub/scm/linux/kernel/git/gmonaco/linux.git rv-fixes-7.1
>      "rv fixes for v7.1"
> 
> 
> Wen Yang (9):
>    rv/da: introduce DA_MON_ALLOCATION_STRATEGY
>    rv: add generic uprobe infrastructure for RV monitors
>    rv/tlob: add tlob model DOT file
>    rv/ha: fix ha_invariant_passed_ns silent bypass of invariant check
>    rv/ha: make da_monitor_reset_hook and EVENT_NONE_LBL overridable
>    rv/tlob: add tlob hybrid automaton monitor
>    rv/tlob: add KUnit tests for the tlob monitor
>    selftests/verification: fix verificationtest-ktap for out-of-tree
>      execution
>    selftests/verification: add tlob selftests
> 
>   Documentation/trace/rv/index.rst              |   1 +
>   Documentation/trace/rv/monitor_tlob.rst       | 177 ++++
>   include/rv/da_monitor.h                       | 276 ++++-
>   include/rv/ha_monitor.h                       |  22 +-
>   include/rv/rv_uprobe.h                        | 119 +++
>   kernel/trace/rv/Kconfig                       |   5 +
>   kernel/trace/rv/Makefile                      |   3 +
>   kernel/trace/rv/monitors/nomiss/nomiss.c      |   6 +-
>   kernel/trace/rv/monitors/tlob/.kunitconfig    |   6 +
>   kernel/trace/rv/monitors/tlob/Kconfig         |  19 +
>   kernel/trace/rv/monitors/tlob/tlob.c          | 968 ++++++++++++++++++
>   kernel/trace/rv/monitors/tlob/tlob.h          | 148 +++
>   kernel/trace/rv/monitors/tlob/tlob_kunit.c    |  92 ++
>   kernel/trace/rv/monitors/tlob/tlob_trace.h    |  49 +
>   kernel/trace/rv/rv_trace.h                    |   1 +
>   kernel/trace/rv/rv_uprobe.c                   | 182 ++++
>   .../testing/selftests/verification/.gitignore |   2 +
>   tools/testing/selftests/verification/Makefile |  19 +-
>   .../verification/test.d/tlob/Makefile         |  20 +
>   .../verification/test.d/tlob/test.d/functions |   1 +
>   .../verification/test.d/tlob/tlob_sym.c       | 189 ++++
>   .../verification/test.d/tlob/tlob_target.c    | 138 +++
>   .../verification/test.d/tlob/uprobe_bind.tc   |  37 +
>   .../test.d/tlob/uprobe_detail_running.tc      |  51 +
>   .../test.d/tlob/uprobe_detail_sleeping.tc     |  50 +
>   .../test.d/tlob/uprobe_detail_waiting.tc      |  66 ++
>   .../verification/test.d/tlob/uprobe_multi.tc  |  64 ++
>   .../test.d/tlob/uprobe_no_event.tc            |  19 +
>   .../test.d/tlob/uprobe_violation.tc           |  67 ++
>   .../verification/verificationtest-ktap        |   4 +-
>   tools/verification/models/tlob.dot            |  22 +
>   31 files changed, 2789 insertions(+), 34 deletions(-)
>   create mode 100644 Documentation/trace/rv/monitor_tlob.rst
>   create mode 100644 include/rv/rv_uprobe.h
>   create mode 100644 kernel/trace/rv/monitors/tlob/.kunitconfig
>   create mode 100644 kernel/trace/rv/monitors/tlob/Kconfig
>   create mode 100644 kernel/trace/rv/monitors/tlob/tlob.c
>   create mode 100644 kernel/trace/rv/monitors/tlob/tlob.h
>   create mode 100644 kernel/trace/rv/monitors/tlob/tlob_kunit.c
>   create mode 100644 kernel/trace/rv/monitors/tlob/tlob_trace.h
>   create mode 100644 kernel/trace/rv/rv_uprobe.c
>   create mode 100644 tools/testing/selftests/verification/test.d/tlob/Makefile
>   create mode 100644 tools/testing/selftests/verification/test.d/tlob/test.d/functions
>   create mode 100644 tools/testing/selftests/verification/test.d/tlob/tlob_sym.c
>   create mode 100644 tools/testing/selftests/verification/test.d/tlob/tlob_target.c
>   create mode 100644 tools/testing/selftests/verification/test.d/tlob/uprobe_bind.tc
>   create mode 100644 tools/testing/selftests/verification/test.d/tlob/uprobe_detail_running.tc
>   create mode 100644 tools/testing/selftests/verification/test.d/tlob/uprobe_detail_sleeping.tc
>   create mode 100644 tools/testing/selftests/verification/test.d/tlob/uprobe_detail_waiting.tc
>   create mode 100644 tools/testing/selftests/verification/test.d/tlob/uprobe_multi.tc
>   create mode 100644 tools/testing/selftests/verification/test.d/tlob/uprobe_no_event.tc
>   create mode 100644 tools/testing/selftests/verification/test.d/tlob/uprobe_violation.tc
>   create mode 100644 tools/verification/models/tlob.dot
> 

^ permalink raw reply

* Re: [RESEND][PATCH v2] unwind: Add sframe_(un)register() system calls
From: Fangrui Song @ 2026-06-13  4:18 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Linux Trace Kernel, bpf, Masami Hiramatsu,
	Mathieu Desnoyers, Jens Remus, Josh Poimboeuf, Peter Zijlstra,
	Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Linus Torvalds, Andrew Morton, Florian Weimer,
	Kees Cook, Carlos O'Donell, Sam James, Dylan Hatch,
	Borislav Petkov, Dave Hansen, David Hildenbrand, H. Peter Anvin,
	Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
	Suren Baghdasaryan, Vlastimil Babka, Heiko Carstens,
	Vasily Gorbik, Thomas Weißschuh
In-Reply-To: <20260611072249.2222fd9d@gandalf.local.home>

On Thu, Jun 11, 2026 at 4:23 AM Steven Rostedt <rostedt@kernel.org> wrote:
>
> On Thu, 11 Jun 2026 00:00:25 -0700
> Fangrui Song <i@maskray.me> wrote:
>
> > Hi Steven,
> >
> > This is not an objection to deferred userspace unwinding itself -- my
> > concern is narrower: these syscalls permanently encode the kernel's
> > commitment to the SFrame format family at exactly the moment the
> > format's size trajectory is heading the wrong way, and while arguably
> > superior formats exist.
> >
> > I raised related size concerns about SFrame's viability for userspace
> > stack walking earlier:
> > https://lore.kernel.org/all/3xd4fqvwflefvsjjoagytoi3y3sf7lxqjremhe2zo5tounihe4@3ftafgryadsr/
> > ("Concerns about SFrame viability for userspace stack walking")
> >
> > SFrame v3 is even larger than v2.
> >
> > For comparison: Microsoft is currently upstreaming its Windows x64
> > Unwind V3 implementation to LLVM, which will make a side-by-side reading
> > of the two formats straightforward. Unwind V3 provides correct
> > exception-handling unwind -- full prologue replay, SEH handlers,
> > funclets -- and supports Intel APX. SFrame v3 provides stack tracing
> > only, no EH, yet comes out larger than .eh_frame. A format revision that
> > adds capability without adding bulk is demonstrably achievable; SFrame
> > v3 went the other way.
>
> My main concern is simplicity in implementation on the kernel side. One
> thing we would like to avoid is any interpreter that becomes basically
> executing user space code to perform the stack tracing (i.e. DWARF). I
> haven't looked at the Windows x64 but will do so.

I recognize interpreter complexity as a real and valid concern (though
I believe the concern can be alleviated with a good use of LLM
auditing).
But the SFrame family's size problem is more pressing than it's being
given credit for, because of a point that applies specifically to
x86-64: SFrame cannot *replace* .eh_frame there, only add to it (to
keep debugging and C++ exception handling working).

.eh_frame+.eh_frame_hdr almost takes 9%; adding .sframe adds another 9% on top.
I don't think the perceived benefit on x86-64 justifies near-doubling
the unwind metadata footprint.


x86-64 experiments with x64 Unwind V3 for ELF:

                    unwind B     VM B      % of VM
  sqlite3 -O0
    .eh_frame(+hdr)   42,120   446,735      9.43%
    Win64 v1          24,536   438,609      5.59%
    Win64 v3          33,296   450,889      7.38%
    SFrame v3         39,105   440,727      8.87%

  sqlite3 -O1
    .eh_frame(+hdr)   44,424   479,348      9.27%
    Win64 v1          15,496   450,724      3.44%
    Win64 v3          21,816   458,916      4.75%
    SFrame v3         44,756   477,776      9.37%

  CGExpr.cpp -O3
    .eh_frame(+hdr)   27,040   372,416      7.26%
    Win64 v1          10,720   356,080      3.01%
    Win64 v3          15,320   360,185      4.25%
    SFrame v3         27,628   370,912      7.45%

> >
> > I understand IBM is doubling down on SFrame for their s390x and ppc64,
>
> That's because this is currently the only way s390 can perform stack
> walking in user space.

Hmm, I think that DWARF .eh_frame already works on s390x, but I take
your word that .sframe might be appealing for some use cases.

Perhaps s390x and x86-64 shouldn't be decided by the same vote.

> > but I'm not convinced the size overhead of v3 will make it appealing on
> > x86-64. I have learned that the person driving their SFrame work at
> > Google had left and the SFrame at data center effort was being
> > reevaluated per a toolchain manager.

> I believe the person who left Google that was driving the SFrame work was
> me ;-)
>
> Thanks,
>
> -- Steve

Ha, I wasn't referring to you. I had a different toolchain engineer in mind.
But that was a secondhand aside and not the point I want to rest
anything on, so I'll set it aside and stick to the bytes.

-----

Separately, on maturity and sequencing:

On timing: the SFrame v3 binutils set was first posted 2025-12-09; the
2.46 branch was cut 2026-01-17 and released 2026-02-08 with v3
generation. That's a ~5-week window, and I'm not convinced there's
been time for rigorous review of the v3 design; if flaws surface we're
at a v4, and kernel adoption still wouldn't be a given.

^ permalink raw reply

* Re: [PATCH] rtla: Simplify osnoise tracer option setting code
From: Crystal Wood @ 2026-06-12 17:54 UTC (permalink / raw)
  To: Tomas Glozar, Steven Rostedt
  Cc: John Kacur, Luis Goncalves, Costa Shulyupin, Wander Lairson Costa,
	LKML, linux-trace-kernel
In-Reply-To: <20260612115121.54862-1-tglozar@redhat.com>

On Fri, 2026-06-12 at 13:51 +0200, Tomas Glozar wrote:
> Each osnoise tracer option (in /sys/kernel/tracing/osnoise) used by RTLA
> requires four functions to be defined:
> 
> - static osnoise_get_<opt>() - to get the current value of the option
>   and save it into struct osnoise_context's orig_<opt> field,
> - osnoise_set_<opt>() - to set the value of the option requested by the
>   user after reading and saving the original with osnoise_get_<opt>(),
>   and save it into <opt> field of struct osnoise_context,
> - osnoise_restore_<opt>() - restore the value recorded in orig_<opt>,
> - static osnoise_put_<opt>() - restore the value recorded in orig_<opt>
>   and update <opt> to reflect that.
> 
> The logic is duplicated for all the options, except for cpus (which is
> the only string option) and period/runtime (which are handled together
> and feature extra checks).

Thanks for taking this on... this is one of the things that was bugging me
during consolidation work that I didn't get to.

> Deduplicate the logic using a set of macros featuring the X macro
> pattern, defined in src/common.h:
> 
> - OSNOISE_LL_OPTIONS, which invokes OSNOISE_LL_OPTION macro for all
>   "long long" options,
> - OSNOISE_FLAG_OPTIONS, which invokes OSNOISE_FLAG_OPTION macro for all
>   flag (boolean values in osnoise/options file) options.
> 
> The list macros are then invoked in four places:
> 
> - for struct osnoise_context fields in src/common.h,
> - for function declarations, moved into src/common.h from
>   src/osnoise.h,
> - for function definitions in src/osnoise.c,
> - for context initialization and restoration, in osnoise_context_alloc()
>   and osnoise_put_context(), both in src/osnoise.c.
> 
> OSNOISE_LL_OPTIONS takes three options: name - struct osnoise_context
> field name (written "<opt>" above), path - filename inside
> /sys/kernel/tracing/osnoise passed to libtracefs, and init_val - initial
> value of struct fields, corresponding to an otherwise invalid option
> (some options use OSNOISE_OPTION_INIT_VAL = -1, some use
> OSNOISE_TIME_INIT_VAL = 0).

Can we simplify by always using -1?  Especially since that's already
treated as the universal "invalid" by osnoise_read_ll_config().

FWIW using "init val" to mean "invalid" rather than "default" is a bit
unintuitive.

> OSNOISE_FLAG_OPTION is similar, but instead of path, it takes the option
> string inside /sys/kernel/tracing/osnoise/options (opt_string), and no
> init_val, as it is purely boolean (0 or 1).
> 
> Previously, for options timerlat_align and osnoise_workload, the return
> value of osnoise_set_<opt>() distinguished between -2 (option cannot be
> set) and -1 (option not present). This distinction is expanded for all
> options for consistency; for most options, it is currently not used,
> only osnoise_workload is implemented to avoid error on -1 on older RTLA
> versions.

"on -1 on"?

> The change overall has two main benefits: it makes it much simpler to
> add a new option, as well as to change existing logic consistently for
> all of them. It also makes the code shorter by a bit over 500 lines.
> 
> There is no intentional user-visible change coming from the refactoring.
> osnoise_restore_<opt>() for flag options now sets <opt> instead of
> orig_<opt>. As the latter is also set by osnoise_put_<opt>(), plus long
> long options set <opt> in both the old and new implementation, the old
> behavior was likely a mistake, and should not matter for now, as the
> options are only restored once at the end of tracing and neither <opt>
> nor orig_<opt> field is read again.
> 
> Assisted-by: Claude:claude-opus-4-6
> Signed-off-by: Tomas Glozar <tglozar@redhat.com>
> ---
>  tools/tracing/rtla/src/common.h  |  79 +--
>  tools/tracing/rtla/src/osnoise.c | 836 ++++++-------------------------
>  tools/tracing/rtla/src/osnoise.h |  22 -
>  3 files changed, 188 insertions(+), 749 deletions(-)

While we're at it, can we move this code to common.c, and drop
"osnoise" from the names, to move closer to using that only for the
actual osnoise mode?

Or if we really want to namespace things that are specific to the
osnoise subsystem (i.e. everything implemented in trace_osnoise.c) but
not specific with respect to the osnoise/timerlat split, I'd suggest
something different like "osn_".

> + * Long long option get/set/restore/put functions, generated from OSNOISE_LL_OPTIONS.
> + */
> +#define OSNOISE_LL_OPTION(name, path, init_val)						\
> +static long long									\
> +osnoise_get_##name(struct osnoise_context *context)					\
> +{											\
> +	long long name;									\
> +											\
> +	if (context->name != (init_val))						\
> +		return context->name;							\
> +											\
> +	if (context->orig_##name != (init_val))						\
> +		return context->orig_##name;						\
> +											\
> +	name = osnoise_read_ll_config(path);						\
> +	if (name < 0)									\
> +		return (init_val);							\
> +											\
> +	context->orig_##name = name;							\
> +	return name;									\
> +}											\
> +											\
> +int osnoise_set_##name(struct osnoise_context *context, long long name)			\
> +{											\
> +	long long curr = osnoise_get_##name(context);					\
> +	int retval;									\
> +											\
> +	if (curr == (init_val))								\
> +		return -1;								\
> +											\
> +	retval = osnoise_write_ll_config(path, name);					\
> +	if (retval < 0)									\
> +		return -2;								\
> +											\
> +	context->name = name;								\
> +	return 0;									\
> +}											\

Using "name" for the value is confusing... "val" would be better.

> +											\
> +void osnoise_restore_##name(struct osnoise_context *context)				\
> +{											\
> +	int retval;									\
> +											\
> +	if (context->orig_##name == (init_val))						\
> +		return;									\
> +											\
> +	if (context->orig_##name == context->name)					\
> +		goto out_done_##name;							\
> +											\
> +	retval = osnoise_write_ll_config(path, context->orig_##name);			\
> +	if (retval < 0)									\
> +		err_msg("Could not restore original " #name "\n");			\
> +											\
> +out_done_##name:									\
> +	context->name = (init_val);							\
> +}											\

Why does the label need to have ##name in it?

> +											\
> +static void osnoise_put_##name(struct osnoise_context *context)				\
> +{											\
> +	osnoise_restore_##name(context);						\
> +											\
> +	if (context->orig_##name == (init_val))						\
> +		return;									\
> +											\
> +	context->orig_##name = (init_val);						\
> +}
[snip]
> +/*
> + * Flag option get/set/restore/put functions, generated from OSNOISE_FLAG_OPTIONS.
> + */
> +#define OSNOISE_FLAG_OPTION(name, option_str)						\
> +static int osnoise_get_##name(struct osnoise_context *context)				\
> +{											\
> +	if (context->opt_##name != OSNOISE_OPTION_INIT_VAL)				\
> +		return context->opt_##name;						\
> +											\
> +	if (context->orig_opt_##name != OSNOISE_OPTION_INIT_VAL)			\
> +		return context->orig_opt_##name;					\
> +											\
> +	context->orig_opt_##name = osnoise_options_get_option(option_str);		\
> +	return context->orig_opt_##name;						\
> +}											\
> +											\
> +int osnoise_set_##name(struct osnoise_context *context, bool onoff)			\
> +{											\
> +	int val = osnoise_get_##name(context);						\
> +	int retval;									\
> +											\
> +	if (val == OSNOISE_OPTION_INIT_VAL)						\
> +		return -1;								\
> +											\
> +	if (val == onoff)								\
> +		return 0;								\
> +											\
> +	retval = osnoise_options_set_option(option_str, onoff);				\
> +	if (retval < 0)									\
> +		return -2;								\
> +											\
> +	context->opt_##name = onoff;							\
> +	return 0;									\
> +}											\
> +											\
> +void osnoise_restore_##name(struct osnoise_context *context)				\
> +{											\
> +	int retval;									\
> +											\
> +	if (context->orig_opt_##name == OSNOISE_OPTION_INIT_VAL)			\
> +		return;									\
> +											\
> +	if (context->orig_opt_##name == context->opt_##name)				\
> +		goto out_done_##name;							\
> +											\
> +	retval = osnoise_options_set_option(option_str, context->orig_opt_##name);	\
> +	if (retval < 0)									\
> +		err_msg("Could not restore original " option_str " option\n");		\
> +											\
> +out_done_##name:									\
> +	context->opt_##name = OSNOISE_OPTION_INIT_VAL;					\
> +}											\
> +											\
> +static void osnoise_put_##name(struct osnoise_context *context)				\
> +{											\
> +	osnoise_restore_##name(context);						\
> +											\
> +	if (context->orig_opt_##name == OSNOISE_OPTION_INIT_VAL)			\
> +		return;									\
> +											\
> +	context->orig_opt_##name = OSNOISE_OPTION_INIT_VAL;				\
> +}

Can we reduce the amount of code we put in macros by moving some of the
logic to osnoise_read/write_ll_config() and osnoise_get/set_optino()? 
Or a non-macro wrapper around them if there are other callers that need
the current behavior.

Something like (assuming universal -1 invalid):

static int osn_read_ll_config(const char *rel_path, long long *val, long long *orig)
static int osn_write_ll_config(const char *rel_path, long long *val, long long *orig)
static int osn_get_option(const char *name, int *val, int *orig)
static int osn_set_option(const char *name, int *val, int *orig)

-Crystal (who wishes we were using a modern language that didn't require all
this macro stuff)


^ permalink raw reply

* Re: [RFC PATCH 1/2] random: Expose boot ID to other subsystems
From: Jason A. Donenfeld @ 2026-06-12 17:04 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Theodore Ts'o, Steven Rostedt, Mathieu Desnoyers,
	linux-kernel, linux-trace-kernel
In-Reply-To: <177937542892.2596845.4271730537688894501.stgit@mhiramat.tok.corp.google.com>

On Thu, May 21, 2026 at 11:57:09PM +0900, Masami Hiramatsu (Google) wrote:
> From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> 
> Add get_boot_id() to expose current boot ID to other kernel subsystems.
> Note that since this is only meaningful if user can access it via sysctl,
> it returns NULL if CONFIG_SYSCTL=n.

Wouldn't this be nice to have even on !SYSCTL systems? Why disable it for this
case?

> +/**
> + * get_boot_id - return the boot ID UUID
> + *
> + * This function returns a pointer to the boot ID UUID, which is generated on
> + * demand the first time this function is called. The boot ID is a UUID that
> + * is unique to each boot of the system.
> + */
> +const u8 *get_boot_id(void)
> +{
> +	static DEFINE_SPINLOCK(bootid_spinlock);
> +
> +	spin_lock(&bootid_spinlock);
> +	if (!sysctl_bootid[8])
> +		generate_random_uuid(sysctl_bootid);
> +	spin_unlock(&bootid_spinlock);
> +
> +	return sysctl_bootid;
> +}
> +
>  /*
>   * This function is used to return both the bootid UUID, and random
>   * UUID. The difference is in whether table->data is NULL; if it is,
> @@ -1638,12 +1657,8 @@ static int proc_do_uuid(const struct ctl_table *table, int write, void *buf,
>  		uuid = tmp_uuid;
>  		generate_random_uuid(uuid);
>  	} else {
> -		static DEFINE_SPINLOCK(bootid_spinlock);
> -
> -		spin_lock(&bootid_spinlock);
> -		if (!uuid[8])
> -			generate_random_uuid(uuid);
> -		spin_unlock(&bootid_spinlock);
> +		/* Ensure that the boot ID is initialized. */
> +		get_boot_id();

I find this a little odd, this implicit behavior now that sysctl_bootid ==
uuid. But perhaps that's the cleanest approach there is?

^ permalink raw reply

* Re: [PATCH] tracing: fprobe: Remove __packed from generic __fprobe_header
From: Steven Rostedt @ 2026-06-12 16:36 UTC (permalink / raw)
  To: Markus Schneider-Pargmann
  Cc: Mathieu Desnoyers, David Laight, Masami Hiramatsu (Google),
	Heiko Carstens, linux-kernel, linux-trace-kernel
In-Reply-To: <DJ7328G40P9R.YB03MWLT8GQF@baylibre.com>

On Fri, 12 Jun 2026 14:51:58 +0200
"Markus Schneider-Pargmann" <msp@baylibre.com> wrote:

>   fgraph_data = fgraph_reserve_data(gops->idx, reserved_words * sizeof(long));
> 
> fgraph_reserve_data() returns a pointer into an unsigned long array
> ret_stack. ret_stack is allocated with

Correct. It is in fact a requirement that fgraph_reserve_data() returns
a long aligned pointer.

-- Steve

^ permalink raw reply

* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Gregory Price @ 2026-06-12 15:29 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Balbir Singh, lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
	dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman
In-Reply-To: <ainFROZ3WrGioyuY@gourry-fedora-PF4VCD3F>

On Wed, Jun 10, 2026 at 04:12:52PM -0400, Gregory Price wrote:
> On Wed, Jun 10, 2026 at 08:59:59PM +0200, David Hildenbrand (Arm) wrote:
> > > 
> > > I understand this question in two ways:
> > > 
> > >   1) Can we disallow PAGE allocation and limit this to FOLIO allocation
> > 
> > Yes. Can we only allow folios to be allocated from private memory nodes. So let
> > me reply to that one below.
> > 
> ... snip ...
> > 
> > At LSF/MM we talked about how GFP flags are bad and how deriving stuff from the
> > context might be better. I think there was also talk about how the memalloc_*
> > interface might be a better way forward. Maybe we would start giving the
> > allocator more context ("we are allocating a folio").
> > 
> > The following is incomplete (esp. hugetlb stuff I assume), just as some idea:
> >
> 
> I will still probably send the next RFC version tomorrow or friday,
> as I want to get some eyes on the __GFP_PRIVATE-less pattern.
> 
> Also, I made a new `anondax` driver which enables userland testing
> of this functionality without any specialty hardware.
> 

(apologies for the length of this email: this will all be covered in
the coming cover letter, but I just wanted to share a bit of a preview)

===

Just another small update - I am planning to post the RFC today once i
get some mild cleanup done.  It will be based on the dax atomic hotplug

https://lore.kernel.org/linux-mm/20260605211911.2160954-1-gourry@gourry.net/

But a couple specific details regarding the memalloc pieces that i've
learned the past couple of days playing with it.

1) memalloc_folio is required to ensure non-folio allocations don't land
   on the private node, even if it happens within a memalloc_private
   context.  Since memalloc_folio may be useful in contexts outside of
   private nodes, I kept this as a separate flag.

   If we think there will *never* be additional users of memalloc_folio,
   then we could fold _folio into _private to save the flag for now and
   add it back when we actually need it.

2) memalloc_private is needed to unlock private nodes, but in the
   original NOFALLBACK-only design, you also needed __GFP_THISNODE.

   This is *highly* restrictive.  I found when playing with mbind that
   MPOL_BIND + __GFP_THISNODE generates a WARN (valid WARN, it normally
   implies a bug). 

   That leads me to #3

3) If a private node is opted into something like Demotion (the node is
   a demotion target) or mbind(), such that normal kernel operation can
   place memory there - it's *pseudo-private*, and should actually land
   in it's own FALLBACK list (reachable without __GFP_THISNODE, but not
   reachable as a normal fallback allocation target).

I'm still playing with this, but I think we can even omit the
__GFP_THISNODE requirement (my initial feeling that __GFP_THISNODE
didn't buy us anything in particular seems to have panned out).

At the end of the day, this makes the whole memalloc_private_save()
pattern a heck of a lot cleaner than trying fiddle with GFP.

I think you will all enjoy how clean the code ends up, and how easily
testable it is.

As a testbed I've implement an anondax (we can discuss naming) that
adds some sample NODE_PRIVATE_OPT_* flags so you can do the following.

I'm including this in the next RFC - but we can hack the entire thing
off (including the OPT flags) if we prefer to just get the base set in
without a new driver as a start.

echo 1 > dax0.0/reclaim   # kswapd and reclaim run normally on this node
echo 1 > dax0.0/demotion  # it is a demotion target
echo 1 > dax0.0/mbind     # mbind() can target this node for anon-vma's
echo 1 > dax0.0/madvise   # allow madvise() to operate on its folios
echo 1 > dax0.0/numa_balance  # allow numa balancing for this node
echo 1 > dax0.0/ltpin     # allow GUP longterm pin to operate normally
echo * > dax0.0/adistance # set the adistance for hotplug time
echo * > dax0.0/hotplug   # same as kmem/hotplug

This also means *existing hardware* can leverage private nodes if
they're capable of generating a dax device.

I've even gotten it such that you can put a private node above dram in
the adistance heirarchy - which means demotion flows downward from
device to CPU, but allocations don't default or fallback there.

This seems *immediately* useful for a variety of use cases.

~Gregory

^ permalink raw reply

* Re: [PATCHv7 bpf-next 03/29] ftrace: Add add_ftrace_hash_entry function
From: Steven Rostedt @ 2026-06-12 14:44 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kumar Kartikeya Dwivedi, Jiri Olsa, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, bpf, linux-trace-kernel,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	Menglong Dong
In-Reply-To: <DJ6EGJ8S87HP.2WOTGYK374XKI@gmail.com>

On Thu, 11 Jun 2026 10:35:11 -0700
"Alexei Starovoitov" <alexei.starovoitov@gmail.com> wrote:

> AI finds things to consider, but when they're considered and postponned
> to future it doesn't understand that and keep reporting the same thing
> every revision. So it might look like that patches are landing with
> outstanding AI complains, but this is not the case.

Well, if someone just asked me to give an ack then I would have. But I
have other things to work on. Especially since everything I do at the
moment is 100% hobby related. The chores my wife gives me now have
priority ;-)

I'll start a new job come Monday.

> 
> btw since patches touch ftrace from time to time should we add your
> ftrace testsuite to bpf CI ?

That's actually a good idea.

> How automated is it?

Very. In fact it's public. Although it's been a few years since I
updated the git repos.

I have two qemu images (currently private, but I can update them and
share). Where one is a 32bit x86 image and the other is a 64bit image.

The 64bit image hostname is called tracetest and the 32bit image's
hostname is tracetest-32. Both with root password of test0000.

The tests loaded on the image is here:

  https://github.com/rostedt/ftrace-tests

And the ktests I run are here:

  https://github.com/rostedt/ftrace-ktests

I would run the ktest like;

  ktest.pl -DPATCH_CHECKOUT:=<SHA/BRANCH> -DPATCH_START:=<first-commit> tracetest-64.conf

And in another window

  ktest.pl -DPATCH_CHECKOUT:=<SHA/BRANCH> -DPATCH_START:=<first-commit> tracetest-32.conf

For example:

  ktest.pl -DPATCH_CHECKOUT:=trace/ftrace/core -DPATCH_START:=b5d6d3f73d0bac4a7e3a061372f6da166fc6ee5c tracetest-64.conf

And that will run 40 tests (I added some more since my last push, so
github doesn't have 40) and build, boot, install, test on the qemu
64bit image. It starts out testing commits from
b5d6d3f73d0bac4a7e3a061372f6da166fc6ee5c and going through to
trace/ftrace/core. Note, the PATCH_START needs to be in the history of
the PATCH_CHECKOUT otherwise the test will fail.

If you want to run these, let me know and I can help with the setup.
It's what I gave Masami to test as well.

> 
> >  
> >> 
> >> While at it, please review Mykyta's set:
> >> https://patchwork.kernel.org/user/todo/netdevbpf/?series=1096695
> >> 
> >> It's also been pending for almost a month now.  
> >
> > Have a better link? I just get a blank page as "TODO" is set to what I have.  
> 
> Ohh. I meant this set:
> https://lore.kernel.org/bpf/CAEf4BzZFjsEv3aLktwdCZF6EXoCL+eefX+6xa3XGrhBmfO1SqA@mail.gmail.com/
> where you said that you'll think more about it after pto.

I came back on Tuesday and have yet to catch up on all the email I
ignored while away :-p

> Would be great to land it now for this merge window, so we have
> discoverability right now and if better approach comes in the future
> we can adjust to it later.
>  

I'll take a look at it.

-- Steve

^ permalink raw reply

* Re: [GIT PULL] RTLA changes for 7.2
From: Tomas Glozar @ 2026-06-12 14:19 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Costa Shulyupin, Crystal Wood, LKML, linux-trace-kernel
In-Reply-To: <20260529130643.3080315-1-tglozar@redhat.com>

pá 29. 5. 2026 v 15:16 odesílatel Tomas Glozar <tglozar@redhat.com> napsal:
>
> ----------------------------------------------------------------
> Costa Shulyupin (1):
>       tools/rtla: Fix --dump-tasks usage in timerlat
>
> Crystal Wood (1):
>       rtla: Stop the record trace on interrupt
>
> Tomas Glozar (24):
>       rtla/tests: Cover both top and hist tools where possible
>       rtla/tests: Add get_workload_pids() helper
>       rtla/tests: Check -c/--cpus thread affinity
>       rtla/tests: Use negative match when testing --aa-only
>       rtla/tests: Extend timerlat top --aa-only coverage
>       rtla/tests: Cover all hist options in runtime tests
>       rtla/tests: Add runtime test for -H/--house-keeping
>       rtla/tests: Add runtime test for -k and -u options
>       rtla/tests: Add runtime tests for -C/--cgroup
>       rtla/tests: Add unit tests for actions module
>       rtla/actions: Restore continue flag in actions_perform()
>       rtla/tests: Add unit test for restoring continue flag
>       rtla/tests: Run runtime tests in temporary directory
>       rtla/tests: Add runtime tests for restoring continue flag
>       rtla: Add libsubcmd dependency
>       tools subcmd: support optarg as separate argument
>       tools subcmd: allow parsing distinct --opt and --no-opt
>       rtla: Parse cmdline using libsubcmd

This will create a conflict with a fix in master/7.1 [1] as it removes
the code that is fixed by the patch, as reported in linux-next already
[2]. The resolution is trivial (just remove the new, fixed code, as it
is entirely replaced by a new implementation that doesn't have the
bug) but a small note in the final pull request might be useful, so
that it's clear we know about it.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e9e41d3035032ed6053d8bad7b7077e1cb3a6540
[2] https://lore.kernel.org/linux-next/aimDwlNq_RRLjg6X@sirena.co.uk/T/#u

>       rtla/tests: Add unit tests for _parse_args() functions
>       rtla/tests: Add unit tests for CLI option callbacks
>       rtla/timerlat: Add -A/--aligned CLI option
>       rtla/tests: Add unit tests for -A/--aligned option
>       Documentation/rtla: Add -A/--aligned option
>       rtla: Document tests in README
>

Tomas


^ permalink raw reply

* Re: [PATCH] tracing: fprobe: Remove __packed from generic __fprobe_header
From: Markus Schneider-Pargmann @ 2026-06-12 12:51 UTC (permalink / raw)
  To: Mathieu Desnoyers, Steven Rostedt, David Laight
  Cc: Masami Hiramatsu (Google),
	Markus Schneider-Pargmann (The Capable Hub), Heiko Carstens,
	linux-kernel, linux-trace-kernel
In-Reply-To: <0ea2ae74-7452-4ba5-9549-59197c766c25@efficios.com>

[-- Attachment #1: Type: text/plain, Size: 3036 bytes --]

Hi,

On Wed Jun 10, 2026 at 10:05 PM CEST, Mathieu Desnoyers wrote:
> On 2026-06-10 15:51, Steven Rostedt wrote:
>> On Wed, 10 Jun 2026 12:06:59 +0100
>> David Laight <david.laight.linux@gmail.com> wrote:
>> 
>>> So you only want __packed on structures that might be misaligned and those
>>> that contain misaligned members.
>>>
>>> If the structure is only guaranteed to be 32bit aligned then use __packed
>>> __aligned(4) so that two 32bit accesses get used instead of 8 8bit ones.
>>>
>>> -- David
>>>
>>>>
>>>> Thank you,
>>>>    
>>>>> Signed-off-by: Markus Schneider-Pargmann (The Capable Hub) <msp@baylibre.com>
>>>>> ---
>>>>>   kernel/trace/fprobe.c | 2 +-
>>>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
>>>>> index cc49ebd2a773..21751dcdb7b9 100644
>>>>> --- a/kernel/trace/fprobe.c
>>>>> +++ b/kernel/trace/fprobe.c
>>>>> @@ -181,7 +181,7 @@ static inline void read_fprobe_header(unsigned long *stack,
>>>>>   struct __fprobe_header {
>>>>>   	struct fprobe *fp;
>>>>>   	unsigned long size_words;
>>>>> -} __packed;
>>>>> +};
>>>>>   
>> 
>> Does "__packed" really do anything between a pointer and a long?
>
> If that structure is allocated at a non-void-ptr-aligned address, the
> packed attribute will ensure that the compiler don't emit instructions
> that require aligned loads/stores when accessing those fields.
>
> It does not change the layout of the structure per se in this specific
> case, but it informs the compiler about the lack of guarantees about
> alignment for the entire structure.
>
> x86 32/64 cannot care less about this, but it's relevant on other
> architectures.

Thanks for your feedback. I checked this before submitting the patch.
The struct is always aligned to sizeof(long):

struct __fprobe_header is only ever accessed through
read_fprobe_header() and write_fprobe_header(). Since the read will only
read what we have previously written, only the write part is relevant
here. write_fprobe_header() is only called from fprobe_fgraph_entry():

  if (write_fprobe_header(&fgraph_data[used], fp, size_words))
    used += FPROBE_HEADER_SIZE_IN_LONG + size_words;

used is always kept aligned to sizeof(long), in fact the above snippet
is the only part where it is actually changed. fgraph_data is assigned
here:

  fgraph_data = fgraph_reserve_data(gops->idx, reserved_words * sizeof(long));

fgraph_reserve_data() returns a pointer into an unsigned long array
ret_stack. ret_stack is allocated with

  ret_stack = kmem_cache_alloc(fgraph_stack_cachep, GFP_KERNEL);

and fgraph_stack_cachep is allocated with

  fgraph_stack_cachep = kmem_cache_create("fgraph_stack",
                                          SHADOW_STACK_SIZE,
                                          SHADOW_STACK_SIZE, 0, NULL);

So as far as I can see everything is sizeof(long) aligned here and it is
not allocated at a non-void-ptr-aligned address.

Best
Markus

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 289 bytes --]

^ permalink raw reply

* Re: [PATCH v4] rethook: Remove the running task check in rethook_find_ret_addr()
From: Masami Hiramatsu @ 2026-06-12 12:20 UTC (permalink / raw)
  To: XIAO WU
  Cc: sashiko-reviews, Petr Mladek, Peter Zijlstra, Tengda Wu,
	Mathieu Desnoyers, Alexei Starovoitov, Steven Rostedt,
	linux-kernel, linux-trace-kernel, live-patching
In-Reply-To: <tencent_3D17DC5BE32C8A51D938AF50F221321F6206@qq.com>

On Thu, 11 Jun 2026 23:53:36 +0800
XIAO WU <xiaowu.417@qq.com> wrote:

> Hi Tengda,
> 
> Sashiko [1] reviewed this patch and found that removing the
> task_is_running() check exposes stack unwinders to real crashes — not
> just "invalid information."  A PoC confirms this: a KASAN panic triggers
> within seconds when /proc/<pid>/stack reads the stack of a task that is
> concurrently running a kretprobe.

Hmm, why /proc/<pid>/stack unwind stack so unreliable way...
That should stop the target process, because it is exposed to
userspace. Thus it should work as safe as possible.

Anyway, thanks for reporting with the test program.

> 
> [1] 
> https://sashiko.dev/#/patchset/20260610013658.1837963-1-wutengda%40huaweicloud.com
> 
>  > diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
>  > index 5a8bdf88999a..1e7fdebe3cd5 100644
>  > --- a/kernel/trace/rethook.c
>  > +++ b/kernel/trace/rethook.c
>  > @@ -250,9 +251,6 @@ unsigned long rethook_find_ret_addr(struct 
> task_struct *tsk, unsigned long frame
>  >      if (WARN_ON_ONCE(!cur))
>  >          return 0;
>  >
>  > -    if (tsk != current && task_is_running(tsk))
>  > -        return 0;
>  > -
>  >      do {
>  >          ret = __rethook_find_ret_addr(tsk, cur);
>  >          if (!ret)
> 
> The commit message states:
> 
>  > The iteration is already safe from crashes because
>  > unwind_next_frame() holds RCU and rethook_node structures are
>  > RCU-freed; even if the iteration goes off the rails and returns
>  > invalid information, it will not crash.
> 
> There are two problems with this claim, both reproducible.
> 
> **Problem 1: stack-out-of-bounds in unwind_next_frame itself**
> 
> The PoC below reliably triggers the following KASAN panic — not in the
> rethook list traversal, but inside unwind_next_frame():
> 
> [ 1833.494623] BUG: KASAN: stack-out-of-bounds in 
> unwind_next_frame+0x861/0x2080
> [ 1833.494651] Read of size 2 at addr ffffc90003e6f5f0 by task poc/9854
> [ 1833.494707] Call Trace:
> [ 1833.494719]  dump_stack_lvl+0x116/0x1f0
> [ 1833.494743]  print_report+0xf4/0x600
> [ 1833.494788]  kasan_report+0xe0/0x110
> [ 1833.494836]  unwind_next_frame+0x861/0x2080
> [ 1833.494948]  arch_stack_walk+0x99/0x100
> [ 1833.495000]  stack_trace_save_tsk+0x16a/0x200
> [ 1833.495054]  proc_pid_stack+0x173/0x2b0
> [ 1833.495103]  seq_read_iter+0x519/0x12d0
> [ 1833.495166]  seq_read+0x3b7/0x590
> [ 1833.495297]  vfs_read+0x1f5/0xd20
> [ 1833.495497]  ksys_read+0x135/0x250
> [ 1833.495549]  do_syscall_64+0x129/0x850
> [ 1833.495566]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> [ 1833.498894] Kernel panic - not syncing: KASAN: panic_on_warn set ...
> 
> page last free pid 9737 tgid 9737 stack trace:
>   do_sys_openat2+0xbf/0x260          <-- target task inside kretprobe
>   __x64_sys_openat+0x179/0x210
> 
> This crash has nothing to do with rethook_node lifetimes or RCU.  It
> happens because the ORC unwinder reads stack memory while the target
> task concurrently executes a kretprobe trampoline that modifies return
> addresses.  The unwinder follows corrupted frame data past valid stack
> boundaries.  RCU protection of rethook_node structures is irrelevant —
> this crash occurs at the stack frame interpretation level, before any
> rethook list traversal.

OK, in that case, I think we should not allow list traversal.
I think without freezing the target task, accessing /proc/<pid>/stack
is potentially dangerous. Shouldn't we fix this at first?

> 
> The old task_is_running() check prevented the unwinder from attempting
> to unwind a running task's stack in the first place.
> 
> **Problem 2: use-after-free via rethook_node recycling**
> 
> Even if the stack-out-of-bounds above were addressed, a second crash
> path exists in the rethook list traversal itself.
> 
> rethook_recycle() immediately pushes nodes back to the objpool without
> an RCU grace period:
> 
>    kernel/trace/rethook.c:
>    void rethook_recycle(struct rethook_node *node)
>    {
>            ...
>            objpool_push(node, &node->rethook->pool);
>    }
> 
> Meanwhile, unwind_next_frame() in arch/x86/kernel/unwind_orc.c drops
> RCU between frames while the cursor (*cur) persists across iterations:
> 
>    arch/x86/kernel/unwind_orc.c:
>    bool unwind_next_frame(...)
>    {
>            ...
>            guard(rcu)();    // RCU held for one frame
>            ...
>    }                        // RCU dropped here
> 
> When the unwinder calls __rethook_find_ret_addr() in the next frame
> iteration, it does:
> 
>    struct llist_node *first = tsk->rethooks.first;
>    ...
>    *cur = first;
>    ...
>    node = node->next;       // node may have been recycled
> 
> If the target task returns from a probed function between frames, its
> rethook_node is recycled and can be instantly reallocated to another
> task.  The unwinder's stale cursor then dereferences a freed pointer,
> leading to use-after-free.


OK, this is still real problem. We should use call_rcu() to return
the object back to objpool.

Thanks!

> 
> ## Reproducer
> 
> The PoC sets up a kretprobe on do_sys_openat2, creates hot-loop threads
> calling open(), and concurrently reads /proc/<tid>/stack.  The race
> triggers within seconds (Problem 1 above; Problem 2 may reproduce on
> kernels without KASAN or with different timing).
> 
> Build:  gcc -static -pthread -o poc poc.c
> Run:    ./poc [runtime_seconds]
> Needs:  root, CONFIG_KASAN=y
> 
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <unistd.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <sys/wait.h>
> #include <sys/syscall.h>
> #include <sched.h>
> #include <fcntl.h>
> #include <errno.h>
> #include <signal.h>
> #include <pthread.h>
> #include <dirent.h>
> 
> #define TRACE "/sys/kernel/tracing"
> 
> volatile int stop = 0;
> 
> static int tfs(const char *f, const char *b)
> {
>      char p[256]; int fd, r;
>      snprintf(p, 256, "%s/%s", TRACE, f);
>      fd = open(p, O_WRONLY | O_TRUNC);
>      if (fd < 0) {
>          system("mount -t tracefs tracefs /sys/kernel/tracing 2>/dev/null");
>          usleep(50000);
>          fd = open(p, O_WRONLY | O_TRUNC);
>      }
>      if (fd < 0) return -1;
>      r = write(fd, b, strlen(b));
>      close(fd);
>      return r < 0 ? -1 : 0;
> }
> 
> void *hot_thread(void *arg)
> {
>      while (!__atomic_load_n(&stop, __ATOMIC_RELAXED)) {
>          int fd = open("/dev/null", O_RDONLY);
>          if (fd >= 0) close(fd);
>      }
>      return NULL;
> }
> 
> void *reader_thread(void *arg)
> {
>      pid_t target = *(pid_t *)arg;
>      char path[64], buf[8192];
>      snprintf(path, 64, "/proc/%d/stack", target);
>      while (!__atomic_load_n(&stop, __ATOMIC_RELAXED)) {
>          int fd = open(path, O_RDONLY);
>          if (fd >= 0) { read(fd, buf, 8191); close(fd); }
>      }
>      return NULL;
> }
> 
> void sigh(int s) { stop = 1; }
> 
> int main(int argc, char *argv[])
> {
>      int runtime = 120;
>      if (argc > 1) runtime = atoi(argv[1]);
> 
>      printf("rethook race PoC\n");
>      if (geteuid()) { printf("root needed\n"); return 1; }
>      signal(SIGINT, sigh);
> 
>      pthread_t hot[4], rdr[4];
>      pid_t hot_tids[4];
>      int pairs = 4;
> 
>      for (int c = 0; c < runtime / 5 && !stop; c++) {
>          tfs("events/kprobes/myretprobe/enable", "0");
>          tfs("kprobe_events", "-:myretprobe");
>          usleep(100);
>          tfs("kprobe_events", "r:myretprobe do_sys_openat2 $retval");
>          tfs("events/kprobes/myretprobe/enable", "1");
> 
>          pid_t main_tid = syscall(SYS_gettid);
> 
>          for (int i = 0; i < pairs; i++)
>              pthread_create(&hot[i], NULL, hot_thread, NULL);
> 
>          usleep(300000);
> 
>          {
>              DIR *d = opendir("/proc/self/task");
>              int cnt = 0;
>              if (d) {
>                  struct dirent *de;
>                  while ((de = readdir(d)) != NULL && cnt < pairs) {
>                      pid_t t = atoi(de->d_name);
>                      if (t > 0 && t != main_tid)
>                          hot_tids[cnt++] = t;
>                  }
>                  closedir(d);
>              }
>              for (int i = 0; i < cnt; i++)
>                  pthread_create(&rdr[i], NULL, reader_thread, &hot_tids[i]);
>          }
> 
>          printf("round %d\n", c);
>          sleep(5);
> 
>          stop = 1;
>          usleep(100000);
> 
>          for (int i = 0; i < pairs; i++) pthread_join(hot[i], NULL);
>          for (int i = 0; i < pairs; i++) pthread_join(rdr[i], NULL);
> 
>          stop = 0;
>          usleep(1000);
>      }
> 
>      tfs("events/kprobes/myretprobe/enable", "0");
>      tfs("kprobe_events", "-:myretprobe");
>      printf("Done\n");
>      return 0;
> }
> 
> ## Summary
> 
> The v4 commit message claims the iteration "will not crash," but the PoC
> demonstrates a reproducible KASAN panic:
> 
> 1. stack-out-of-bounds in unwind_next_frame (ORC unwinder reads
>     concurrently-modified stack frames of a running task)
> 
> 2. Potential use-after-free in __rethook_find_ret_addr (rethook nodes
>     recycled without RCU grace period, cursor persists across RCU drops)
> 
> The old task_is_running() check was racy but served as a practical
> safety net.  Removing it without adding equivalent protection in the
> callers (proc_pid_stack, BPF stack walkers) exposes users to kernel
> panics via /proc/<pid>/stack on any task running a kretprobe.
> 
> Thanks,
> Xiao
> 
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* [PATCH] rtla: Simplify osnoise tracer option setting code
From: Tomas Glozar @ 2026-06-12 11:51 UTC (permalink / raw)
  To: Steven Rostedt, Tomas Glozar
  Cc: John Kacur, Luis Goncalves, Crystal Wood, Costa Shulyupin,
	Wander Lairson Costa, LKML, linux-trace-kernel

Each osnoise tracer option (in /sys/kernel/tracing/osnoise) used by RTLA
requires four functions to be defined:

- static osnoise_get_<opt>() - to get the current value of the option
  and save it into struct osnoise_context's orig_<opt> field,
- osnoise_set_<opt>() - to set the value of the option requested by the
  user after reading and saving the original with osnoise_get_<opt>(),
  and save it into <opt> field of struct osnoise_context,
- osnoise_restore_<opt>() - restore the value recorded in orig_<opt>,
- static osnoise_put_<opt>() - restore the value recorded in orig_<opt>
  and update <opt> to reflect that.

The logic is duplicated for all the options, except for cpus (which is
the only string option) and period/runtime (which are handled together
and feature extra checks).

Deduplicate the logic using a set of macros featuring the X macro
pattern, defined in src/common.h:

- OSNOISE_LL_OPTIONS, which invokes OSNOISE_LL_OPTION macro for all
  "long long" options,
- OSNOISE_FLAG_OPTIONS, which invokes OSNOISE_FLAG_OPTION macro for all
  flag (boolean values in osnoise/options file) options.

The list macros are then invoked in four places:

- for struct osnoise_context fields in src/common.h,
- for function declarations, moved into src/common.h from
  src/osnoise.h,
- for function definitions in src/osnoise.c,
- for context initialization and restoration, in osnoise_context_alloc()
  and osnoise_put_context(), both in src/osnoise.c.

OSNOISE_LL_OPTIONS takes three options: name - struct osnoise_context
field name (written "<opt>" above), path - filename inside
/sys/kernel/tracing/osnoise passed to libtracefs, and init_val - initial
value of struct fields, corresponding to an otherwise invalid option
(some options use OSNOISE_OPTION_INIT_VAL = -1, some use
OSNOISE_TIME_INIT_VAL = 0).

OSNOISE_FLAG_OPTION is similar, but instead of path, it takes the option
string inside /sys/kernel/tracing/osnoise/options (opt_string), and no
init_val, as it is purely boolean (0 or 1).

Previously, for options timerlat_align and osnoise_workload, the return
value of osnoise_set_<opt>() distinguished between -2 (option cannot be
set) and -1 (option not present). This distinction is expanded for all
options for consistency; for most options, it is currently not used,
only osnoise_workload is implemented to avoid error on -1 on older RTLA
versions.

The change overall has two main benefits: it makes it much simpler to
add a new option, as well as to change existing logic consistently for
all of them. It also makes the code shorter by a bit over 500 lines.

There is no intentional user-visible change coming from the refactoring.
osnoise_restore_<opt>() for flag options now sets <opt> instead of
orig_<opt>. As the latter is also set by osnoise_put_<opt>(), plus long
long options set <opt> in both the old and new implementation, the old
behavior was likely a mistake, and should not matter for now, as the
options are only restored once at the end of tracing and neither <opt>
nor orig_<opt> field is read again.

Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Tomas Glozar <tglozar@redhat.com>
---
 tools/tracing/rtla/src/common.h  |  79 +--
 tools/tracing/rtla/src/osnoise.c | 836 ++++++-------------------------
 tools/tracing/rtla/src/osnoise.h |  22 -
 3 files changed, 188 insertions(+), 749 deletions(-)

diff --git a/tools/tracing/rtla/src/common.h b/tools/tracing/rtla/src/common.h
index 04b287a03f6d..47233b0781c7 100644
--- a/tools/tracing/rtla/src/common.h
+++ b/tools/tracing/rtla/src/common.h
@@ -6,9 +6,35 @@
 #include "trace.h"
 #include "utils.h"
 
+/*
+ * OSNOISE_LL_OPTIONS - list of long long options backed by tracefs files.
+ *   OSNOISE_LL_OPTION(field_name, tracefs_path, init_value)
+ *
+ * OSNOISE_FLAG_OPTIONS - list of boolean options backed by osnoise/options.
+ *   OSNOISE_FLAG_OPTION(field_name, option_string)
+ */
+#define OSNOISE_LL_OPTIONS \
+	OSNOISE_LL_OPTION(stop_us,		"osnoise/stop_tracing_us",	 OSNOISE_OPTION_INIT_VAL) \
+	OSNOISE_LL_OPTION(stop_total_us,	"osnoise/stop_tracing_total_us", OSNOISE_OPTION_INIT_VAL) \
+	OSNOISE_LL_OPTION(print_stack,		"osnoise/print_stack",		 OSNOISE_OPTION_INIT_VAL) \
+	OSNOISE_LL_OPTION(tracing_thresh,	"tracing_thresh",		 OSNOISE_OPTION_INIT_VAL) \
+	OSNOISE_LL_OPTION(timerlat_period_us,	"osnoise/timerlat_period_us",	 OSNOISE_TIME_INIT_VAL)   \
+	OSNOISE_LL_OPTION(timerlat_align_us,	"osnoise/timerlat_align_us",	 OSNOISE_OPTION_INIT_VAL)
+
+#define OSNOISE_FLAG_OPTIONS \
+	OSNOISE_FLAG_OPTION(irq_disable,	"OSNOISE_IRQ_DISABLE") \
+	OSNOISE_FLAG_OPTION(workload,		"OSNOISE_WORKLOAD") \
+	OSNOISE_FLAG_OPTION(timerlat_align,	"TIMERLAT_ALIGN")
+
 /*
  * osnoise_context - read, store, write, restore osnoise configs.
  */
+#define OSNOISE_LL_OPTION(name, path, init_val)		\
+	long long		orig_##name;		\
+	long long		name;
+#define OSNOISE_FLAG_OPTION(name, option_str)		\
+	int			orig_opt_##name;	\
+	int			opt_##name;
 struct osnoise_context {
 	int			flags;
 	int			ref;
@@ -24,42 +50,11 @@ struct osnoise_context {
 	unsigned long long	orig_period_us;
 	unsigned long long	period_us;
 
-	/* 0 as init value */
-	long long		orig_timerlat_period_us;
-	long long		timerlat_period_us;
-
-	/* 0 as init value */
-	long long		orig_tracing_thresh;
-	long long		tracing_thresh;
-
-	/* -1 as init value because 0 is disabled */
-	long long		orig_stop_us;
-	long long		stop_us;
-
-	/* -1 as init value because 0 is disabled */
-	long long		orig_stop_total_us;
-	long long		stop_total_us;
-
-	/* -1 as init value because 0 is disabled */
-	long long		orig_print_stack;
-	long long		print_stack;
-
-	/* -1 as init value because 0 is off */
-	int			orig_opt_irq_disable;
-	int			opt_irq_disable;
-
-	/* -1 as init value because 0 is off */
-	int			orig_opt_workload;
-	int			opt_workload;
-
-	/* -1 as init value because 0 is off */
-	int			orig_opt_timerlat_align;
-	int			opt_timerlat_align;
-
-	/* 0 as init value */
-	unsigned long long	orig_timerlat_align_us;
-	unsigned long long	timerlat_align_us;
+	OSNOISE_LL_OPTIONS
+	OSNOISE_FLAG_OPTIONS
 };
+#undef OSNOISE_LL_OPTION
+#undef OSNOISE_FLAG_OPTION
 
 extern volatile int stop_tracing;
 
@@ -173,15 +168,21 @@ common_threshold_handler(const struct osnoise_tool *tool);
 int osnoise_set_cpus(struct osnoise_context *context, char *cpus);
 void osnoise_restore_cpus(struct osnoise_context *context);
 
-int osnoise_set_workload(struct osnoise_context *context, bool onoff);
+#define OSNOISE_LL_OPTION(name, path, init_val)					\
+	int osnoise_set_##name(struct osnoise_context *context, long long name);	\
+	void osnoise_restore_##name(struct osnoise_context *context);
+#define OSNOISE_FLAG_OPTION(name, option_str)					\
+	int osnoise_set_##name(struct osnoise_context *context, bool onoff);	\
+	void osnoise_restore_##name(struct osnoise_context *context);
+OSNOISE_LL_OPTIONS
+OSNOISE_FLAG_OPTIONS
+#undef OSNOISE_LL_OPTION
+#undef OSNOISE_FLAG_OPTION
 
 void osnoise_destroy_tool(struct osnoise_tool *top);
 struct osnoise_tool *osnoise_init_tool(char *tool_name);
 struct osnoise_tool *osnoise_init_trace_tool(const char *tracer);
 bool osnoise_trace_is_off(struct osnoise_tool *tool, struct osnoise_tool *record);
-int osnoise_set_stop_us(struct osnoise_context *context, long long stop_us);
-int osnoise_set_stop_total_us(struct osnoise_context *context,
-			      long long stop_total_us);
 
 int common_apply_config(struct osnoise_tool *tool, struct common_params *params);
 int top_main_loop(struct osnoise_tool *tool);
diff --git a/tools/tracing/rtla/src/osnoise.c b/tools/tracing/rtla/src/osnoise.c
index 4ff5dad013b1..7f15d00b431e 100644
--- a/tools/tracing/rtla/src/osnoise.c
+++ b/tools/tracing/rtla/src/osnoise.c
@@ -345,480 +345,73 @@ void osnoise_put_runtime_period(struct osnoise_context *context)
 }
 
 /*
- * osnoise_get_timerlat_period_us - read and save the original "timerlat_period_us"
- */
-static long long
-osnoise_get_timerlat_period_us(struct osnoise_context *context)
-{
-	long long timerlat_period_us;
-
-	if (context->timerlat_period_us != OSNOISE_TIME_INIT_VAL)
-		return context->timerlat_period_us;
-
-	if (context->orig_timerlat_period_us != OSNOISE_TIME_INIT_VAL)
-		return context->orig_timerlat_period_us;
-
-	timerlat_period_us = osnoise_read_ll_config("osnoise/timerlat_period_us");
-	if (timerlat_period_us < 0)
-		goto out_err;
-
-	context->orig_timerlat_period_us = timerlat_period_us;
-	return timerlat_period_us;
-
-out_err:
-	return OSNOISE_TIME_INIT_VAL;
-}
-
-/*
- * osnoise_set_timerlat_period_us - set "timerlat_period_us"
- */
-int osnoise_set_timerlat_period_us(struct osnoise_context *context, long long timerlat_period_us)
-{
-	long long curr_timerlat_period_us = osnoise_get_timerlat_period_us(context);
-	int retval;
-
-	if (curr_timerlat_period_us == OSNOISE_TIME_INIT_VAL)
-		return -1;
-
-	retval = osnoise_write_ll_config("osnoise/timerlat_period_us", timerlat_period_us);
-	if (retval < 0)
-		return -1;
-
-	context->timerlat_period_us = timerlat_period_us;
-
-	return 0;
-}
-
-/*
- * osnoise_restore_timerlat_period_us - restore "timerlat_period_us"
- */
-void osnoise_restore_timerlat_period_us(struct osnoise_context *context)
-{
-	int retval;
-
-	if (context->orig_timerlat_period_us == OSNOISE_TIME_INIT_VAL)
-		return;
-
-	if (context->orig_timerlat_period_us == context->timerlat_period_us)
-		goto out_done;
-
-	retval = osnoise_write_ll_config("osnoise/timerlat_period_us", context->orig_timerlat_period_us);
-	if (retval < 0)
-		err_msg("Could not restore original osnoise timerlat_period_us\n");
-
-out_done:
-	context->timerlat_period_us = OSNOISE_TIME_INIT_VAL;
-}
-
-/*
- * osnoise_put_timerlat_period_us - restore original values and cleanup data
- */
-void osnoise_put_timerlat_period_us(struct osnoise_context *context)
-{
-	osnoise_restore_timerlat_period_us(context);
-
-	if (context->orig_timerlat_period_us == OSNOISE_TIME_INIT_VAL)
-		return;
-
-	context->orig_timerlat_period_us = OSNOISE_TIME_INIT_VAL;
-}
-
-/*
- * osnoise_get_timerlat_align_us - read and save the original "timerlat_align_us"
- */
-static long long
-osnoise_get_timerlat_align_us(struct osnoise_context *context)
-{
-	long long timerlat_align_us;
-
-	if (context->timerlat_align_us != OSNOISE_OPTION_INIT_VAL)
-		return context->timerlat_align_us;
-
-	if (context->orig_timerlat_align_us != OSNOISE_OPTION_INIT_VAL)
-		return context->orig_timerlat_align_us;
-
-	timerlat_align_us = osnoise_read_ll_config("osnoise/timerlat_align_us");
-	if (timerlat_align_us < 0)
-		goto out_err;
-
-	context->orig_timerlat_align_us = timerlat_align_us;
-	return timerlat_align_us;
-
-out_err:
-	return OSNOISE_OPTION_INIT_VAL;
-}
-
-/*
- * osnoise_set_timerlat_align_us - set "timerlat_align_us"
- */
-int osnoise_set_timerlat_align_us(struct osnoise_context *context, long long timerlat_align_us)
-{
-	long long curr_timerlat_align_us = osnoise_get_timerlat_align_us(context);
-	int retval;
-
-	if (curr_timerlat_align_us == OSNOISE_OPTION_INIT_VAL)
-		return -1;
-
-	retval = osnoise_write_ll_config("osnoise/timerlat_align_us", timerlat_align_us);
-	if (retval < 0)
-		return -1;
-
-	context->timerlat_align_us = timerlat_align_us;
-
-	return 0;
-}
-
-/*
- * osnoise_restore_timerlat_align_us - restore "timerlat_align_us"
- */
-void osnoise_restore_timerlat_align_us(struct osnoise_context *context)
-{
-	int retval;
-
-	if (context->orig_timerlat_align_us == OSNOISE_OPTION_INIT_VAL)
-		return;
-
-	if (context->orig_timerlat_align_us == context->timerlat_align_us)
-		goto out_done;
-
-	retval = osnoise_write_ll_config("osnoise/timerlat_align_us",
-				   context->orig_timerlat_align_us);
-	if (retval < 0)
-		err_msg("Could not restore original osnoise timerlat_align_us\n");
-
-out_done:
-	context->timerlat_align_us = OSNOISE_OPTION_INIT_VAL;
-}
-
-/*
- * osnoise_put_timerlat_align_us - restore original values and cleanup data
- */
-void osnoise_put_timerlat_align_us(struct osnoise_context *context)
-{
-	osnoise_restore_timerlat_align_us(context);
-
-	if (context->orig_timerlat_align_us == OSNOISE_OPTION_INIT_VAL)
-		return;
-
-	context->orig_timerlat_align_us = OSNOISE_OPTION_INIT_VAL;
-}
-
-/*
- * osnoise_get_stop_us - read and save the original "stop_tracing_us"
- */
-static long long
-osnoise_get_stop_us(struct osnoise_context *context)
-{
-	long long stop_us;
-
-	if (context->stop_us != OSNOISE_OPTION_INIT_VAL)
-		return context->stop_us;
-
-	if (context->orig_stop_us != OSNOISE_OPTION_INIT_VAL)
-		return context->orig_stop_us;
-
-	stop_us = osnoise_read_ll_config("osnoise/stop_tracing_us");
-	if (stop_us < 0)
-		goto out_err;
-
-	context->orig_stop_us = stop_us;
-	return stop_us;
-
-out_err:
-	return OSNOISE_OPTION_INIT_VAL;
-}
-
-/*
- * osnoise_set_stop_us - set "stop_tracing_us"
- */
-int osnoise_set_stop_us(struct osnoise_context *context, long long stop_us)
-{
-	long long curr_stop_us = osnoise_get_stop_us(context);
-	int retval;
-
-	if (curr_stop_us == OSNOISE_OPTION_INIT_VAL)
-		return -1;
-
-	retval = osnoise_write_ll_config("osnoise/stop_tracing_us", stop_us);
-	if (retval < 0)
-		return -1;
-
-	context->stop_us = stop_us;
-
-	return 0;
-}
-
-/*
- * osnoise_restore_stop_us - restore the original "stop_tracing_us"
- */
-void osnoise_restore_stop_us(struct osnoise_context *context)
-{
-	int retval;
-
-	if (context->orig_stop_us == OSNOISE_OPTION_INIT_VAL)
-		return;
-
-	if (context->orig_stop_us == context->stop_us)
-		goto out_done;
-
-	retval = osnoise_write_ll_config("osnoise/stop_tracing_us", context->orig_stop_us);
-	if (retval < 0)
-		err_msg("Could not restore original osnoise stop_us\n");
-
-out_done:
-	context->stop_us = OSNOISE_OPTION_INIT_VAL;
-}
-
-/*
- * osnoise_put_stop_us - restore original values and cleanup data
- */
-void osnoise_put_stop_us(struct osnoise_context *context)
-{
-	osnoise_restore_stop_us(context);
-
-	if (context->orig_stop_us == OSNOISE_OPTION_INIT_VAL)
-		return;
-
-	context->orig_stop_us = OSNOISE_OPTION_INIT_VAL;
-}
-
-/*
- * osnoise_get_stop_total_us - read and save the original "stop_tracing_total_us"
- */
-static long long
-osnoise_get_stop_total_us(struct osnoise_context *context)
-{
-	long long stop_total_us;
-
-	if (context->stop_total_us != OSNOISE_OPTION_INIT_VAL)
-		return context->stop_total_us;
-
-	if (context->orig_stop_total_us != OSNOISE_OPTION_INIT_VAL)
-		return context->orig_stop_total_us;
-
-	stop_total_us = osnoise_read_ll_config("osnoise/stop_tracing_total_us");
-	if (stop_total_us < 0)
-		goto out_err;
-
-	context->orig_stop_total_us = stop_total_us;
-	return stop_total_us;
-
-out_err:
-	return OSNOISE_OPTION_INIT_VAL;
-}
-
-/*
- * osnoise_set_stop_total_us - set "stop_tracing_total_us"
- */
-int osnoise_set_stop_total_us(struct osnoise_context *context, long long stop_total_us)
-{
-	long long curr_stop_total_us = osnoise_get_stop_total_us(context);
-	int retval;
-
-	if (curr_stop_total_us == OSNOISE_OPTION_INIT_VAL)
-		return -1;
-
-	retval = osnoise_write_ll_config("osnoise/stop_tracing_total_us", stop_total_us);
-	if (retval < 0)
-		return -1;
-
-	context->stop_total_us = stop_total_us;
-
-	return 0;
-}
-
-/*
- * osnoise_restore_stop_total_us - restore the original "stop_tracing_total_us"
- */
-void osnoise_restore_stop_total_us(struct osnoise_context *context)
-{
-	int retval;
-
-	if (context->orig_stop_total_us == OSNOISE_OPTION_INIT_VAL)
-		return;
-
-	if (context->orig_stop_total_us == context->stop_total_us)
-		goto out_done;
-
-	retval = osnoise_write_ll_config("osnoise/stop_tracing_total_us",
-			context->orig_stop_total_us);
-	if (retval < 0)
-		err_msg("Could not restore original osnoise stop_total_us\n");
-
-out_done:
-	context->stop_total_us = OSNOISE_OPTION_INIT_VAL;
-}
-
-/*
- * osnoise_put_stop_total_us - restore original values and cleanup data
- */
-void osnoise_put_stop_total_us(struct osnoise_context *context)
-{
-	osnoise_restore_stop_total_us(context);
-
-	if (context->orig_stop_total_us == OSNOISE_OPTION_INIT_VAL)
-		return;
-
-	context->orig_stop_total_us = OSNOISE_OPTION_INIT_VAL;
-}
-
-/*
- * osnoise_get_print_stack - read and save the original "print_stack"
- */
-static long long
-osnoise_get_print_stack(struct osnoise_context *context)
-{
-	long long print_stack;
-
-	if (context->print_stack != OSNOISE_OPTION_INIT_VAL)
-		return context->print_stack;
-
-	if (context->orig_print_stack != OSNOISE_OPTION_INIT_VAL)
-		return context->orig_print_stack;
-
-	print_stack = osnoise_read_ll_config("osnoise/print_stack");
-	if (print_stack < 0)
-		goto out_err;
-
-	context->orig_print_stack = print_stack;
-	return print_stack;
-
-out_err:
-	return OSNOISE_OPTION_INIT_VAL;
-}
-
-/*
- * osnoise_set_print_stack - set "print_stack"
- */
-int osnoise_set_print_stack(struct osnoise_context *context, long long print_stack)
-{
-	long long curr_print_stack = osnoise_get_print_stack(context);
-	int retval;
-
-	if (curr_print_stack == OSNOISE_OPTION_INIT_VAL)
-		return -1;
-
-	retval = osnoise_write_ll_config("osnoise/print_stack", print_stack);
-	if (retval < 0)
-		return -1;
-
-	context->print_stack = print_stack;
-
-	return 0;
-}
-
-/*
- * osnoise_restore_print_stack - restore the original "print_stack"
- */
-void osnoise_restore_print_stack(struct osnoise_context *context)
-{
-	int retval;
-
-	if (context->orig_print_stack == OSNOISE_OPTION_INIT_VAL)
-		return;
-
-	if (context->orig_print_stack == context->print_stack)
-		goto out_done;
-
-	retval = osnoise_write_ll_config("osnoise/print_stack", context->orig_print_stack);
-	if (retval < 0)
-		err_msg("Could not restore original osnoise print_stack\n");
-
-out_done:
-	context->print_stack = OSNOISE_OPTION_INIT_VAL;
-}
-
-/*
- * osnoise_put_print_stack - restore original values and cleanup data
- */
-void osnoise_put_print_stack(struct osnoise_context *context)
-{
-	osnoise_restore_print_stack(context);
-
-	if (context->orig_print_stack == OSNOISE_OPTION_INIT_VAL)
-		return;
-
-	context->orig_print_stack = OSNOISE_OPTION_INIT_VAL;
-}
-
-/*
- * osnoise_get_tracing_thresh - read and save the original "tracing_thresh"
- */
-static long long
-osnoise_get_tracing_thresh(struct osnoise_context *context)
-{
-	long long tracing_thresh;
-
-	if (context->tracing_thresh != OSNOISE_OPTION_INIT_VAL)
-		return context->tracing_thresh;
-
-	if (context->orig_tracing_thresh != OSNOISE_OPTION_INIT_VAL)
-		return context->orig_tracing_thresh;
-
-	tracing_thresh = osnoise_read_ll_config("tracing_thresh");
-	if (tracing_thresh < 0)
-		goto out_err;
-
-	context->orig_tracing_thresh = tracing_thresh;
-	return tracing_thresh;
-
-out_err:
-	return OSNOISE_OPTION_INIT_VAL;
-}
-
-/*
- * osnoise_set_tracing_thresh - set "tracing_thresh"
- */
-int osnoise_set_tracing_thresh(struct osnoise_context *context, long long tracing_thresh)
-{
-	long long curr_tracing_thresh = osnoise_get_tracing_thresh(context);
-	int retval;
-
-	if (curr_tracing_thresh == OSNOISE_OPTION_INIT_VAL)
-		return -1;
-
-	retval = osnoise_write_ll_config("tracing_thresh", tracing_thresh);
-	if (retval < 0)
-		return -1;
-
-	context->tracing_thresh = tracing_thresh;
-
-	return 0;
-}
-
-/*
- * osnoise_restore_tracing_thresh - restore the original "tracing_thresh"
- */
-void osnoise_restore_tracing_thresh(struct osnoise_context *context)
-{
-	int retval;
-
-	if (context->orig_tracing_thresh == OSNOISE_OPTION_INIT_VAL)
-		return;
-
-	if (context->orig_tracing_thresh == context->tracing_thresh)
-		goto out_done;
-
-	retval = osnoise_write_ll_config("tracing_thresh", context->orig_tracing_thresh);
-	if (retval < 0)
-		err_msg("Could not restore original tracing_thresh\n");
-
-out_done:
-	context->tracing_thresh = OSNOISE_OPTION_INIT_VAL;
-}
-
-/*
- * osnoise_put_tracing_thresh - restore original values and cleanup data
- */
-void osnoise_put_tracing_thresh(struct osnoise_context *context)
-{
-	osnoise_restore_tracing_thresh(context);
-
-	if (context->orig_tracing_thresh == OSNOISE_OPTION_INIT_VAL)
-		return;
-
-	context->orig_tracing_thresh = OSNOISE_OPTION_INIT_VAL;
-}
+ * Long long option get/set/restore/put functions, generated from OSNOISE_LL_OPTIONS.
+ */
+#define OSNOISE_LL_OPTION(name, path, init_val)						\
+static long long									\
+osnoise_get_##name(struct osnoise_context *context)					\
+{											\
+	long long name;									\
+											\
+	if (context->name != (init_val))						\
+		return context->name;							\
+											\
+	if (context->orig_##name != (init_val))						\
+		return context->orig_##name;						\
+											\
+	name = osnoise_read_ll_config(path);						\
+	if (name < 0)									\
+		return (init_val);							\
+											\
+	context->orig_##name = name;							\
+	return name;									\
+}											\
+											\
+int osnoise_set_##name(struct osnoise_context *context, long long name)			\
+{											\
+	long long curr = osnoise_get_##name(context);					\
+	int retval;									\
+											\
+	if (curr == (init_val))								\
+		return -1;								\
+											\
+	retval = osnoise_write_ll_config(path, name);					\
+	if (retval < 0)									\
+		return -2;								\
+											\
+	context->name = name;								\
+	return 0;									\
+}											\
+											\
+void osnoise_restore_##name(struct osnoise_context *context)				\
+{											\
+	int retval;									\
+											\
+	if (context->orig_##name == (init_val))						\
+		return;									\
+											\
+	if (context->orig_##name == context->name)					\
+		goto out_done_##name;							\
+											\
+	retval = osnoise_write_ll_config(path, context->orig_##name);			\
+	if (retval < 0)									\
+		err_msg("Could not restore original " #name "\n");			\
+											\
+out_done_##name:									\
+	context->name = (init_val);							\
+}											\
+											\
+static void osnoise_put_##name(struct osnoise_context *context)				\
+{											\
+	osnoise_restore_##name(context);						\
+											\
+	if (context->orig_##name == (init_val))						\
+		return;									\
+											\
+	context->orig_##name = (init_val);						\
+}
+OSNOISE_LL_OPTIONS
+#undef OSNOISE_LL_OPTION
 
 static int osnoise_options_get_option(char *option)
 {
@@ -866,188 +459,70 @@ static int osnoise_options_set_option(char *option, bool onoff)
 	return tracefs_instance_file_write(NULL, "osnoise/options", no_option);
 }
 
-static int osnoise_get_irq_disable(struct osnoise_context *context)
-{
-	if (context->opt_irq_disable != OSNOISE_OPTION_INIT_VAL)
-		return context->opt_irq_disable;
-
-	if (context->orig_opt_irq_disable != OSNOISE_OPTION_INIT_VAL)
-		return context->orig_opt_irq_disable;
-
-	context->orig_opt_irq_disable = osnoise_options_get_option("OSNOISE_IRQ_DISABLE");
-
-	return context->orig_opt_irq_disable;
-}
-
-int osnoise_set_irq_disable(struct osnoise_context *context, bool onoff)
-{
-	int opt_irq_disable = osnoise_get_irq_disable(context);
-	int retval;
-
-	if (opt_irq_disable == OSNOISE_OPTION_INIT_VAL)
-		return -1;
-
-	if (opt_irq_disable == onoff)
-		return 0;
-
-	retval = osnoise_options_set_option("OSNOISE_IRQ_DISABLE", onoff);
-	if (retval < 0)
-		return -1;
-
-	context->opt_irq_disable = onoff;
-
-	return 0;
-}
-
-static void osnoise_restore_irq_disable(struct osnoise_context *context)
-{
-	int retval;
-
-	if (context->orig_opt_irq_disable == OSNOISE_OPTION_INIT_VAL)
-		return;
-
-	if (context->orig_opt_irq_disable == context->opt_irq_disable)
-		goto out_done;
-
-	retval = osnoise_options_set_option("OSNOISE_IRQ_DISABLE", context->orig_opt_irq_disable);
-	if (retval < 0)
-		err_msg("Could not restore original OSNOISE_IRQ_DISABLE option\n");
-
-out_done:
-	context->orig_opt_irq_disable = OSNOISE_OPTION_INIT_VAL;
-}
-
-static void osnoise_put_irq_disable(struct osnoise_context *context)
-{
-	osnoise_restore_irq_disable(context);
-
-	if (context->orig_opt_irq_disable == OSNOISE_OPTION_INIT_VAL)
-		return;
-
-	context->orig_opt_irq_disable = OSNOISE_OPTION_INIT_VAL;
-}
-
-static int osnoise_get_workload(struct osnoise_context *context)
-{
-	if (context->opt_workload != OSNOISE_OPTION_INIT_VAL)
-		return context->opt_workload;
-
-	if (context->orig_opt_workload != OSNOISE_OPTION_INIT_VAL)
-		return context->orig_opt_workload;
-
-	context->orig_opt_workload = osnoise_options_get_option("OSNOISE_WORKLOAD");
-
-	return context->orig_opt_workload;
-}
-
-int osnoise_set_workload(struct osnoise_context *context, bool onoff)
-{
-	int opt_workload = osnoise_get_workload(context);
-	int retval;
-
-	if (opt_workload == OSNOISE_OPTION_INIT_VAL)
-		return -1;
-
-	if (opt_workload == onoff)
-		return 0;
-
-	retval = osnoise_options_set_option("OSNOISE_WORKLOAD", onoff);
-	if (retval < 0)
-		return -2;
-
-	context->opt_workload = onoff;
-
-	return 0;
-}
-
-static void osnoise_restore_workload(struct osnoise_context *context)
-{
-	int retval;
-
-	if (context->orig_opt_workload == OSNOISE_OPTION_INIT_VAL)
-		return;
-
-	if (context->orig_opt_workload == context->opt_workload)
-		goto out_done;
-
-	retval = osnoise_options_set_option("OSNOISE_WORKLOAD", context->orig_opt_workload);
-	if (retval < 0)
-		err_msg("Could not restore original OSNOISE_WORKLOAD option\n");
-
-out_done:
-	context->orig_opt_workload = OSNOISE_OPTION_INIT_VAL;
-}
-
-static void osnoise_put_workload(struct osnoise_context *context)
-{
-	osnoise_restore_workload(context);
-
-	if (context->orig_opt_workload == OSNOISE_OPTION_INIT_VAL)
-		return;
-
-	context->orig_opt_workload = OSNOISE_OPTION_INIT_VAL;
-}
-
-static int osnoise_get_timerlat_align(struct osnoise_context *context)
-{
-	if (context->opt_timerlat_align != OSNOISE_OPTION_INIT_VAL)
-		return context->opt_timerlat_align;
-
-	if (context->orig_opt_timerlat_align != OSNOISE_OPTION_INIT_VAL)
-		return context->orig_opt_timerlat_align;
-
-	context->orig_opt_timerlat_align = osnoise_options_get_option("TIMERLAT_ALIGN");
-
-	return context->orig_opt_timerlat_align;
-}
-
-int osnoise_set_timerlat_align(struct osnoise_context *context, bool onoff)
-{
-	int opt_timerlat_align = osnoise_get_timerlat_align(context);
-	int retval;
-
-	if (opt_timerlat_align == OSNOISE_OPTION_INIT_VAL)
-		return -1;
-
-	if (opt_timerlat_align == onoff)
-		return 0;
-
-	retval = osnoise_options_set_option("TIMERLAT_ALIGN", onoff);
-	if (retval < 0)
-		return -2;
-
-	context->opt_timerlat_align = onoff;
-
-	return 0;
-}
-
-static void osnoise_restore_timerlat_align(struct osnoise_context *context)
-{
-	int retval;
-
-	if (context->orig_opt_timerlat_align == OSNOISE_OPTION_INIT_VAL)
-		return;
-
-	if (context->orig_opt_timerlat_align == context->opt_timerlat_align)
-		goto out_done;
-
-	retval = osnoise_options_set_option("TIMERLAT_ALIGN", context->orig_opt_timerlat_align);
-	if (retval < 0)
-		err_msg("Could not restore original TIMERLAT_ALIGN option\n");
-
-out_done:
-	context->orig_opt_timerlat_align = OSNOISE_OPTION_INIT_VAL;
-}
-
-static void osnoise_put_timerlat_align(struct osnoise_context *context)
-{
-	osnoise_restore_timerlat_align(context);
-
-	if (context->orig_opt_timerlat_align == OSNOISE_OPTION_INIT_VAL)
-		return;
-
-	context->orig_opt_timerlat_align = OSNOISE_OPTION_INIT_VAL;
-}
+/*
+ * Flag option get/set/restore/put functions, generated from OSNOISE_FLAG_OPTIONS.
+ */
+#define OSNOISE_FLAG_OPTION(name, option_str)						\
+static int osnoise_get_##name(struct osnoise_context *context)				\
+{											\
+	if (context->opt_##name != OSNOISE_OPTION_INIT_VAL)				\
+		return context->opt_##name;						\
+											\
+	if (context->orig_opt_##name != OSNOISE_OPTION_INIT_VAL)			\
+		return context->orig_opt_##name;					\
+											\
+	context->orig_opt_##name = osnoise_options_get_option(option_str);		\
+	return context->orig_opt_##name;						\
+}											\
+											\
+int osnoise_set_##name(struct osnoise_context *context, bool onoff)			\
+{											\
+	int val = osnoise_get_##name(context);						\
+	int retval;									\
+											\
+	if (val == OSNOISE_OPTION_INIT_VAL)						\
+		return -1;								\
+											\
+	if (val == onoff)								\
+		return 0;								\
+											\
+	retval = osnoise_options_set_option(option_str, onoff);				\
+	if (retval < 0)									\
+		return -2;								\
+											\
+	context->opt_##name = onoff;							\
+	return 0;									\
+}											\
+											\
+void osnoise_restore_##name(struct osnoise_context *context)				\
+{											\
+	int retval;									\
+											\
+	if (context->orig_opt_##name == OSNOISE_OPTION_INIT_VAL)			\
+		return;									\
+											\
+	if (context->orig_opt_##name == context->opt_##name)				\
+		goto out_done_##name;							\
+											\
+	retval = osnoise_options_set_option(option_str, context->orig_opt_##name);	\
+	if (retval < 0)									\
+		err_msg("Could not restore original " option_str " option\n");		\
+											\
+out_done_##name:									\
+	context->opt_##name = OSNOISE_OPTION_INIT_VAL;					\
+}											\
+											\
+static void osnoise_put_##name(struct osnoise_context *context)				\
+{											\
+	osnoise_restore_##name(context);						\
+											\
+	if (context->orig_opt_##name == OSNOISE_OPTION_INIT_VAL)			\
+		return;									\
+											\
+	context->orig_opt_##name = OSNOISE_OPTION_INIT_VAL;				\
+}
+OSNOISE_FLAG_OPTIONS
+#undef OSNOISE_FLAG_OPTION
 
 enum {
 	FLAG_CONTEXT_NEWLY_CREATED	= (1 << 0),
@@ -1083,29 +558,16 @@ struct osnoise_context *osnoise_context_alloc(void)
 
 	context = calloc_fatal(1, sizeof(*context));
 
-	context->orig_stop_us		= OSNOISE_OPTION_INIT_VAL;
-	context->stop_us		= OSNOISE_OPTION_INIT_VAL;
-
-	context->orig_stop_total_us	= OSNOISE_OPTION_INIT_VAL;
-	context->stop_total_us		= OSNOISE_OPTION_INIT_VAL;
-
-	context->orig_print_stack	= OSNOISE_OPTION_INIT_VAL;
-	context->print_stack		= OSNOISE_OPTION_INIT_VAL;
-
-	context->orig_tracing_thresh	= OSNOISE_OPTION_INIT_VAL;
-	context->tracing_thresh		= OSNOISE_OPTION_INIT_VAL;
-
-	context->orig_opt_irq_disable	= OSNOISE_OPTION_INIT_VAL;
-	context->opt_irq_disable	= OSNOISE_OPTION_INIT_VAL;
-
-	context->orig_opt_workload	= OSNOISE_OPTION_INIT_VAL;
-	context->opt_workload		= OSNOISE_OPTION_INIT_VAL;
-
-	context->orig_opt_timerlat_align	= OSNOISE_OPTION_INIT_VAL;
-	context->opt_timerlat_align		= OSNOISE_OPTION_INIT_VAL;
-
-	context->orig_timerlat_align_us	= OSNOISE_OPTION_INIT_VAL;
-	context->timerlat_align_us	= OSNOISE_OPTION_INIT_VAL;
+#define OSNOISE_LL_OPTION(name, path, init_val)			\
+	context->orig_##name	 = (init_val);			\
+	context->name		 = (init_val);
+#define OSNOISE_FLAG_OPTION(name, option_str)			\
+	context->orig_opt_##name = OSNOISE_OPTION_INIT_VAL; 	\
+	context->opt_##name	 = OSNOISE_OPTION_INIT_VAL;
+	OSNOISE_LL_OPTIONS
+	OSNOISE_FLAG_OPTIONS
+#undef OSNOISE_LL_OPTION
+#undef OSNOISE_FLAG_OPTION
 
 	osnoise_get_context(context);
 
@@ -1128,15 +590,13 @@ void osnoise_put_context(struct osnoise_context *context)
 
 	osnoise_put_cpus(context);
 	osnoise_put_runtime_period(context);
-	osnoise_put_stop_us(context);
-	osnoise_put_stop_total_us(context);
-	osnoise_put_timerlat_period_us(context);
-	osnoise_put_print_stack(context);
-	osnoise_put_tracing_thresh(context);
-	osnoise_put_irq_disable(context);
-	osnoise_put_workload(context);
-	osnoise_put_timerlat_align(context);
-	osnoise_put_timerlat_align_us(context);
+
+#define OSNOISE_LL_OPTION(name, path, init_val)	osnoise_put_##name(context);
+#define OSNOISE_FLAG_OPTION(name, option_str)	osnoise_put_##name(context);
+	OSNOISE_LL_OPTIONS
+	OSNOISE_FLAG_OPTIONS
+#undef OSNOISE_LL_OPTION
+#undef OSNOISE_FLAG_OPTION
 
 	free(context);
 }
diff --git a/tools/tracing/rtla/src/osnoise.h b/tools/tracing/rtla/src/osnoise.h
index 340ff5a64e6e..3d1852bffed8 100644
--- a/tools/tracing/rtla/src/osnoise.h
+++ b/tools/tracing/rtla/src/osnoise.h
@@ -34,28 +34,6 @@ int osnoise_set_runtime_period(struct osnoise_context *context,
 			       unsigned long long period);
 void osnoise_restore_runtime_period(struct osnoise_context *context);
 
-void osnoise_restore_stop_us(struct osnoise_context *context);
-void osnoise_restore_stop_total_us(struct osnoise_context *context);
-
-int osnoise_set_timerlat_period_us(struct osnoise_context *context,
-				   long long timerlat_period_us);
-void osnoise_restore_timerlat_period_us(struct osnoise_context *context);
-
-int osnoise_set_tracing_thresh(struct osnoise_context *context,
-			       long long tracing_thresh);
-void osnoise_restore_tracing_thresh(struct osnoise_context *context);
-
-void osnoise_restore_print_stack(struct osnoise_context *context);
-int osnoise_set_print_stack(struct osnoise_context *context,
-			    long long print_stack);
-
-int osnoise_set_timerlat_align_us(struct osnoise_context *context,
-				  long long timerlat_align_us);
-void osnoise_restore_timerlat_align_us(struct osnoise_context *context);
-
-int osnoise_set_timerlat_align(struct osnoise_context *context, bool onoff);
-
-int osnoise_set_irq_disable(struct osnoise_context *context, bool onoff);
 void osnoise_report_missed_events(struct osnoise_tool *tool);
 int osnoise_apply_config(struct osnoise_tool *tool, struct osnoise_params *params);
 
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH] sparc64: uprobes: add missing break
From: Andreas Larsson @ 2026-06-12  9:13 UTC (permalink / raw)
  To: Rosen Penev, linux-kernel
  Cc: Masami Hiramatsu, Oleg Nesterov, Peter Zijlstra, David S. Miller,
	open list:UPROBES, open list:SPARC + UltraSPARC (sparc/sparc64)
In-Reply-To: <20260506031815.779909-1-rosenp@gmail.com>

On 2026-05-06 05:18, Rosen Penev wrote:
> Missing fallthrough causes failure with newer compilers:
> 
> arch/sparc/kernel/uprobes.c:284:2: error: unannotated fall-through between switch labels [-Werror,-Wimplicit-fallthrough]
>   284 |         default:
>       |         ^
> arch/sparc/kernel/uprobes.c:284:2: note: insert 'break;' to avoid fall-through
>   284 |         default:
>       |         ^
>       |         break;
> 
> Signed-off-by: Rosen Penev <rosenp@gmail.com>
> ---
>  arch/sparc/kernel/uprobes.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/sparc/kernel/uprobes.c b/arch/sparc/kernel/uprobes.c
> index 305017bec164..c8cac64e9988 100644
> --- a/arch/sparc/kernel/uprobes.c
> +++ b/arch/sparc/kernel/uprobes.c
> @@ -280,6 +280,7 @@ int arch_uprobe_exception_notify(struct notifier_block *self,
>  	case DIE_SSTEP:
>  		if (uprobe_post_sstep_notifier(args->regs))
>  			ret = NOTIFY_STOP;
> +		break;
>  
>  	default:
>  		break;

Reviewed-by: Andreas Larsson <andreas@gaisler.com>

Picking this up to my for-next.

Thanks,
Andreas


^ permalink raw reply

* Re: [RFC PATCH 1/2] tracing/osnoise: Sample IPI counts
From: Valentin Schneider @ 2026-06-12  8:53 UTC (permalink / raw)
  To: Tomas Glozar
  Cc: Crystal Wood, linux-kernel, linux-trace-kernel, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Costa Shulyupin,
	Ivan Pravdin
In-Reply-To: <CAP4=nvTOV9jSermE_E_eUh29ks7B9Nv1Nkq1DfvA5nLRv=5hxg@mail.gmail.com>

On 11/06/26 13:55, Tomas Glozar wrote:
> čt 11. 6. 2026 v 12:31 odesílatel Valentin Schneider
> <vschneid@redhat.com> napsal:
>> >
>> > Isn't that precisely what the ipi tracepoints used by this
>> > implementation (ipi:ipi_send_cpu) are for?
>> >
>>
>> Well, these catch the emission of the IPI, which is great for investigation
>> - slap a stacktrace trigger and you (most of the time) get the source of
>> your interference.
>>
>> However Crystal's point is that on x86 (and I assume other archs) receiving
>> & handling these IPIs is "special" and doesn't go through the generic irq
>> subsystem and thus has to be tracked separately, which is why osnoise has
>> this fairly lengthy osnoise_arch_register() thing.
>>
>
> Ah, right. This is not IPI specific, though, IIUC - Intel also has
> other IRQs that have to be traced using Intel-specific trace points,
> like irq_vectors:local_timer, which is also handled in
> osnoise_arch_register(). On ARM from what I recall, most (all?) IRQs
> are traced with irq:* tracepoints.
>
> So there are two parts to this:
>
> - Detecting interference from IPIs firing as osnoise:irq_noise (to be
> analyzed by timerlat auto analysis, and also will appear by default in
> trace output if enabled, regardless of the tool, as all osnoise:*
> tracepoints are enabled there). This is done locally using the already
> existing path (no race hazard), but requires arch-specific detection.
>
> - Counting IPIs when they are being sent. This is the new feature, and
> the count is being recorded in osnoise_sample.
>
> I guess that means that if there were a generic IPI interface, it
> would be easier to use that for IPI counting, as the event would be
> CPU-local? As you say, for tracing of the IPI source, the sending
> tracepoints are better, and that you can already dump the stack trace
> of with --event/--trigger. timerlat auto-analysis could be extended to
> connect the specific IPI to the IRQ noise and display its stack trace
> automatically, instead of manually analyzing the trace output.
>

Right, at least for the smp_call stuff (which includes irq_work) we can
leverage:

  csd_queue_cpu (on the sending CPU)
  csd_func_start (on the receiving CPU)

by indexing on the @csd address; once upon a time [1] I had this:

  $ echo 'hist:keys=cpu,csd.hex:ts=common_timestamp.usecs:src=common_cpu' >\
       /sys/kernel/tracing/events/csd/csd_queue_cpu/trigger
  $ echo 'csd_latency unsigned int src_cpu; '\
       'unsigned int dst_cpu; '\
       'unsigned long csd; u64 time' >\
       /sys/kernel/tracing/synthetic_events

  $ echo 'hist:keys=common_cpu,csd.hex:
  time=common_timestamp.usecs-$ts:
  onmatch(csd.csd_queue_cpu).trace(csd_latency,$src,common_cpu,csd,$time)' >\
       /sys/kernel/tracing/events/csd/csd_function_entry/trigger

  $ trace-cmd record -e 'synthetic:csd_latency' hackbench
  $ trace-cmd report
  <idle>-0     [001]   115.236810: csd_latency:          src_cpu=7, dst_cpu=1, csd=18446612682588476192, time=134
  <idle>-0     [000]   115.240676: csd_latency:          src_cpu=7, dst_cpu=0, csd=18446612682588214048, time=103
  <idle>-0     [009]   115.241320: csd_latency:          src_cpu=7, dst_cpu=9, csd=18446612682143963384, time=83
  <idle>-0     [007]   115.242817: csd_latency:          src_cpu=8, dst_cpu=7, csd=18446612682150759032, time=93
  <idle>-0     [005]   115.247802: csd_latency:          src_cpu=7, dst_cpu=5, csd=18446612682144441144, time=114
  <idle>-0     [005]   115.271775: csd_latency:          src_cpu=7, dst_cpu=5, csd=18446612682144441144, time=151
  <idle>-0     [000]   115.279620: csd_latency:          src_cpu=7, dst_cpu=0, csd=18446612682588214048, time=87
  <idle>-0     [000]   115.281727: csd_latency:          src_cpu=7, dst_cpu=0, csd=18446612682588214048, time=101

[1]: https://lore.kernel.org/lkml/xhsmh4jn8y8vt.mognet@vschneid.remote.csb/

I believe you're right that leveraging this would be useful for
timerlat-aa; I'll add it to my todolist :-)

>> >> Isn't this racy to do from a different CPU?  Both in terms of the
>> >> counter, and the timing of the increment relative to when the IPI is
>> >> actually received.  Not necessarily a huge deal if you only care about
>> >> zero versus bignum, but still.  At least worth a comment, if we go with
>> >> this approach.
>> >>
>> >
>> > I also think it's a bit confusing, especially as the other accesses to
>> > osn_var are cpu-local, but here, "cpu" is the *target* CPU, not the
>> > current CPU. Not sure how expensive it would be to do atomic_add for
>> > that, at least it's something to consider.
>> >
>>
>> I suppose that could be an argument for doing that stat aggregation in
>> userspace osnoise - event handlers are run after the fact via
>> tracefs_iterate_raw_events(), it's all inherently slower since it's just
>> increments of one (one per handled event) but it's also all done in
>> userspace on a control thread and doesn't bog down the kernelspace.
>>
>
> You can also do per-cpu counters in-kernel and sum them in the end,
> but that would take cpus^2 space (indexed by [current_cpu,
> target_cpu]). The question is whether there could be enough samples to
> overload sample collection (like it happens for timerlat, which
> collects data in-kernel using BPF instead).
>
> In-kernel counting can be tested with " --event ipi:ipi_send_cpu
> --trigger hist:key=cpu" - IIRC, tracefs histograms use atomic
> operations (via tracing_map) to protect the entries from races in
> multi thread access. Of course, that is inferior to what the patchset
> implements, as it doesn't record which osnoise cycle the IPI was sent
> in, nor can record cpumask IPIs.
>

I suppose I'll need to go do some benchmarking, but I'm starting to lean
towards the side of atomic incs for IPI counts being okay considering the
sort of latencies we track.

>
> Tomas


^ permalink raw reply

* Re: [PATCH v2 0/4] mm: split the file's i_mmap tree for NUMA
From: Huang Shijie @ 2026-06-12  7:02 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, viro, brauner, jack, muchun.song, osalvador, david, surenb,
	mjguzik, liam, vbabka, shakeel.butt, rppt, mhocko, corbet, skhan,
	linux, dinguyen, schuster.simon, James.Bottomley, deller, djbw,
	willy, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, james.clark,
	mhiramat, oleg, ziy, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, linmiaohe, nao.horiguchi, jannh, pfalcato,
	riel, harry, will, brian.ruley, rmk+kernel, dave.anglin, linux-mm,
	linux-doc, linux-kernel, linux-arm-kernel, linux-parisc,
	linux-fsdevel, nvdimm, linux-perf-users, linux-trace-kernel,
	zhongyuan, fangbaoshun, yingzhiwei
In-Reply-To: <airY5q_SspdbQDbi@lucifer>

On Thu, Jun 11, 2026 at 05:00:49PM +0100, Lorenzo Stoakes wrote:
> Hi Huang,
> 
> You seem to be replacing the file rmap altogether here, so you really ought
> to have sent this as an RFC so we could discuss it as a community first.
No problem.

> 
> Especially so as Pedro had publicly mentioned his plans to implement
> something similar here, so coordination would have been appreciated.
Yes. I am very happy to work with Pedro.

> 
> Anyway, as Pedro has pointed out, the code is overly complicated, it's far
> too configurable (not always a good thing), and the locking implementation
> is questionable.
I can make the code more simple. :)

> 
> You seem to be adding a whole bunch of open-coded complexity too, which is
> not something we want. Abstraction is key for the rmap.
> 
> You're also not adding any kdoc comments or really many comments at all,
> and you've not added any tests (though perhaps it's difficult given how
> core this is).
Got it.

> 
> So I would suggest that perhaps any respin should be sent as an RFC so we
> can engage in that conversation and ensure we're all on the same page?
> 
> Especially since Pedro plans to send an alternative, simpler, solution I
> believe.
> 
> It's also not helpful that you haven't examined the non-NUMA case :)
> perhaps your particular server behaves a certain way that this approach
> aids, but regresses other NUMA configurations?

emm. I ever hoped someone can help me to test this patch set on the non-NUMA
server.

It seems I should find some non-NUMA server before I send out the patch set. :)

> 
> We'd really need to be sure of this before accepting invasive changes like
> this.
Okay.

Thanks
Huang Shijie


^ permalink raw reply

* Re: [PATCH v2 3/4] mm/fs: split the file's i_mmap tree
From: Huang Shijie @ 2026-06-12  6:44 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: akpm, viro, brauner, jack, muchun.song, osalvador, david, surenb,
	mjguzik, liam, ljs, vbabka, shakeel.butt, rppt, mhocko, corbet,
	skhan, linux, dinguyen, schuster.simon, James.Bottomley, deller,
	djbw, willy, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, james.clark,
	mhiramat, oleg, ziy, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, linmiaohe, nao.horiguchi, jannh, riel, harry,
	will, brian.ruley, rmk+kernel, dave.anglin, linux-mm, linux-doc,
	linux-kernel, linux-arm-kernel, linux-parisc, linux-fsdevel,
	nvdimm, linux-perf-users, linux-trace-kernel, zhongyuan,
	fangbaoshun, yingzhiwei
In-Reply-To: <aiqFgGbIo1Psy3pI@pedro-suse.lan>

On Thu, Jun 11, 2026 at 12:11:27PM +0100, Pedro Falcato wrote:
> Hi,
> 
> On Thu, Jun 11, 2026 at 02:18:59PM +0800, Huang Shijie wrote:
> > In the UnixBench tests, there is a test "execl" which tests
> > the execve system call.
> >   For example, a Hygon's server has 12 NUMA nodes, and 384 CPUs.
> > When we test our server with "./Run -c 384 execl",
> > the test result is not good enough. The i_mmap locks contended heavily on
> > "libc.so" and "ld.so". The i_mmap tree for "libc.so" can be
> > over 6000 VMAs, all the VMAs can be in different NUMA mode. The insert/remove
> > operations do not run quickly enough.
> 
> I _really_ would have appreciated some coordination here, because I said I was
> going to take a look at it. I have something that I think is much simpler
Okay, no problem. 

I waited for more then a month, I thought you are busy at other
things. So I spent more then a week to finish the patch set v2.


> in practice. These patches are also way too complex to be dropped just before
> the merge window.
> 
> Some comments:
> 
> > 
> >  In order to reduce the competition of the i_mmap lock, this patch does
> > following:
> >    1.) Split the single i_mmap tree into several sibling trees:
> >        Each tree has a lock. The CONFIG_SPLIT_I_MMAP is used to
> >        turn on/off this feature.
> 
> There is no need for a config option. This needs to Just Work.
> 
> >    2.) Introduce a new field "tree_idx" for vm_area_struct to save the
> >        sibling tree index for this VMA.
> 
> This is possibly contentious, but there are holes in vm_area_struct.
> So I think this is fine.
> 
> >    3.) Introduce a new field "vma_count" for address_space.
> >        The new mapping_mapped() will use it.
> >    4.) Rewrite the vma_interval_tree_foreach()
> >    5.) Rewrite the lock functions.	
> > 
> >  After this patch, the VMA insert/remove operations will work faster,
> > and we can get over 400% performance improvement with the above test.
> > 
> > Signed-off-by: Huang Shijie <huangsj@hygon.cn>
> > ---
> >  fs/Kconfig               |   8 ++
> >  fs/hugetlbfs/inode.c     |  20 ++++-
> >  fs/inode.c               |  75 ++++++++++++++++-
> >  include/linux/fs.h       | 174 ++++++++++++++++++++++++++++++++++++++-
> >  include/linux/mm.h       |  80 ++++++++++++++++++
> >  include/linux/mm_types.h |   3 +
> >  mm/internal.h            |   3 +-
> >  mm/mmap.c                |  11 ++-
> >  mm/nommu.c               |  23 ++++--
> >  mm/pagewalk.c            |   2 +-
> >  mm/vma.c                 |  72 +++++++++++-----
> >  mm/vma_init.c            |   3 +
> >  12 files changed, 436 insertions(+), 38 deletions(-)
> > 
> > diff --git a/fs/Kconfig b/fs/Kconfig
> > index 43cb06de297f..e24804f70432 100644
> > --- a/fs/Kconfig
> > +++ b/fs/Kconfig
> > @@ -9,6 +9,14 @@ menu "File systems"
> >  config DCACHE_WORD_ACCESS
> >         bool
> >  
> > +config SPLIT_I_MMAP
> > +	bool "Split the file's i_mmap to several trees"
> > +	default n
> > +	help
> > +	  Split the file's i_mmap to several trees, each tree has a separate
> > +	  lock. This will reduce the lock contention of file's i_mmap tree,
> > +	  but it will cost more memory for per inode.
> > +
> >  config VALIDATE_FS_PARSER
> >  	bool "Validate filesystem parameter description"
> >  	help
> > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> > index da5b41ea5bdd..68d8308418dd 100644
> > --- a/fs/hugetlbfs/inode.c
> > +++ b/fs/hugetlbfs/inode.c
> > @@ -891,6 +891,23 @@ static struct inode *hugetlbfs_get_root(struct super_block *sb,
> >   */
> >  static struct lock_class_key hugetlbfs_i_mmap_rwsem_key;
> >  
> > +#ifdef CONFIG_SPLIT_I_MMAP
> > +static void hugetlbfs_lockdep_set_class(struct address_space *mapping)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < split_tree_num; i++) {
> > +		lockdep_set_class(&mapping->i_mmap[i].rwsem,
> > +				&hugetlbfs_i_mmap_rwsem_key);
> > +	}
> > +}
> > +#else
> > +static void hugetlbfs_lockdep_set_class(struct address_space *mapping)
> > +{
> > +	lockdep_set_class(&mapping->i_mmap_rwsem, &hugetlbfs_i_mmap_rwsem_key);
> > +}
> > +#endif
> > +
> >  static struct inode *hugetlbfs_get_inode(struct super_block *sb,
> >  					struct mnt_idmap *idmap,
> >  					struct inode *dir,
> > @@ -915,8 +932,7 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
> >  
> >  		inode->i_ino = get_next_ino();
> >  		inode_init_owner(idmap, inode, dir, mode);
> > -		lockdep_set_class(&inode->i_mapping->i_mmap_rwsem,
> > -				&hugetlbfs_i_mmap_rwsem_key);
> > +		hugetlbfs_lockdep_set_class(inode->i_mapping);
> >  		inode->i_mapping->a_ops = &hugetlbfs_aops;
> >  		simple_inode_init_ts(inode);
> >  		info->resv_map = resv_map;
> > diff --git a/fs/inode.c b/fs/inode.c
> > index 62c579a0cf7d..cb67ae83f5b3 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -214,6 +214,70 @@ static int no_open(struct inode *inode, struct file *file)
> >  	return -ENXIO;
> >  }
> >  
> > +#ifdef CONFIG_SPLIT_I_MMAP
> > +int split_tree_num;
> > +static int split_tree_align __maybe_unused = 32;
> > +
> > +static void __init init_split_tree_num(void)
> > +{
> > +#ifdef CONFIG_NUMA
> > +	split_tree_num = nr_node_ids;
> > +#else
> > +	split_tree_num = ALIGN(nr_cpu_ids, split_tree_align);
> > +#endif
> > +}
> 
> Again, too configurable. I think you're too stuck up on the NUMA case -

If you do not care about the NUMA. The performance will _NOT_ get improved
in our NUMA server. I had ever tested code which do not care about the NUMA,
and I got a bad performance. Avoid the remote access is a very important
thing for the NUMA server.

> which does not matter for many people - and may actively harm NUMA users. If
> I have a 128 core 2 NUMA node system, what should I shard by?
It is easy to extend the tree number for NUMA. :)

For the 128 core 2 NUMA, we can extend to more trees, such as:
   Two trees for each NUMA node.

> 
> > +
> > +static void free_mapping_i_mmap(struct address_space *mapping)
> > +{
> > +	int i;
> > +
> > +	if (!mapping->i_mmap)
> > +		return;
> > +
> > +	for (i = 0; i < split_tree_num; i++)
> > +		kfree(mapping->i_mmap[i]);
> > +
> > +	kfree(mapping->i_mmap);
> > +	mapping->i_mmap = NULL;
> > +}
> > +
> > +static int init_mapping_i_mmap(struct address_space *mapping, gfp_t gfp)
> > +{
> > +	struct i_mmap_tree *tree;
> > +	int i;
> > +
> > +	/* The extra one is used as terminator in vma_interval_tree_foreach() */
> > +	mapping->i_mmap = kzalloc(sizeof(tree) * (split_tree_num + 1), gfp);
> > +	if (!mapping->i_mmap)
> > +		return -ENOMEM;
> > +
> > +	for (i = 0; i < split_tree_num; i++) {
> > +		tree = kzalloc_node(sizeof(*tree), gfp, i);
> > +		if (!tree)
> > +			goto nomem;
> > +
> > +		tree->root = RB_ROOT_CACHED;
> > +		init_rwsem(&tree->rwsem);
> 
> This (as-is) should blow up with lockdep + the locking loops down there.
okay, I will check it later.

thanks a lot.
> 
> > +
> > +		mapping->i_mmap[i] = tree;
> > +	}
> > +	return 0;
> > +nomem:
> > +	free_mapping_i_mmap(mapping);
> > +	return -ENOMEM;
> > +}
> 
> Honestly, it's likely that a simple static array in struct address_space
The array size is not fixed, so we cannot add a static array in address_space.

> suffices. I would not go through the trouble of getting everything very
> tight and NUMA correct.
> 
> > +#else
> > +static int init_mapping_i_mmap(struct address_space *mapping, gfp_t gfp)
> > +{
> > +	mapping->i_mmap = RB_ROOT_CACHED;
> > +	init_rwsem(&mapping->i_mmap_rwsem);
> > +	return 0;
> > +}
> > +
> > +static void free_mapping_i_mmap(struct address_space *mapping) { }
> > +static void __init init_split_tree_num(void) {}
> > +#endif
> > +
> >  /**
> >   * inode_init_always_gfp - perform inode structure initialisation
> >   * @sb: superblock inode belongs to
> > @@ -302,9 +366,14 @@ int inode_init_always_gfp(struct super_block *sb, struct inode *inode, gfp_t gfp
> >  #endif
> >  	inode->i_flctx = NULL;
> >  
> > -	if (unlikely(security_inode_alloc(inode, gfp)))
> > +	if (init_mapping_i_mmap(mapping, gfp))
> >  		return -ENOMEM;
> >  
> > +	if (unlikely(security_inode_alloc(inode, gfp))) {
> > +		free_mapping_i_mmap(mapping);
> > +		return -ENOMEM;
> > +	}
> > +
> >  	this_cpu_inc(nr_inodes);
> >  
> >  	return 0;
> > @@ -380,6 +449,7 @@ void __destroy_inode(struct inode *inode)
> >  	if (inode->i_default_acl && !is_uncached_acl(inode->i_default_acl))
> >  		posix_acl_release(inode->i_default_acl);
> >  #endif
> > +	free_mapping_i_mmap(&inode->i_data);
> >  	this_cpu_dec(nr_inodes);
> >  }
> >  EXPORT_SYMBOL(__destroy_inode);
> > @@ -480,9 +550,7 @@ EXPORT_SYMBOL(inc_nlink);
> >  static void __address_space_init_once(struct address_space *mapping)
> >  {
> >  	xa_init_flags(&mapping->i_pages, XA_FLAGS_LOCK_IRQ | XA_FLAGS_ACCOUNT);
> > -	init_rwsem(&mapping->i_mmap_rwsem);
> >  	spin_lock_init(&mapping->i_private_lock);
> > -	mapping->i_mmap = RB_ROOT_CACHED;
> >  }
> >  
> >  void address_space_init_once(struct address_space *mapping)
> > @@ -2619,6 +2687,7 @@ void __init inode_init(void)
> >  					&i_hash_mask,
> >  					0,
> >  					0);
> > +	init_split_tree_num();
> >  }
> >  
> >  void init_special_inode(struct inode *inode, umode_t mode, dev_t rdev)
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index cd46615b8f53..f4b3645b61df 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -450,6 +450,25 @@ struct mapping_metadata_bhs {
> >  	struct list_head list;	/* The list of bhs (b_assoc_buffers) */
> >  };
> >  
> > +#ifdef CONFIG_SPLIT_I_MMAP
> > +/*
> > + * struct i_mmap_tree - A single sibling tree of the file's split i_mmap.
> > + * @root: The red/black interval tree root.
> > + * @rwsem: Protects insert/remove operations on this sibling tree.
> > + * @vma_count: Number of VMAs in this sibling tree.
> > + *
> > + * When CONFIG_SPLIT_I_MMAP is enabled, the file's single i_mmap tree is
> > + * split into split_tree_num sibling trees, each with its own lock. This
> > + * reduces lock contention by allowing concurrent VMA insert/remove
> > + * operations on different sibling trees.
> > + */
> > +struct i_mmap_tree {
> > +	struct rb_root_cached	root;
> > +	struct rw_semaphore	rwsem;
> > +	atomic_t		vma_count;
> 
> I don't see what you need this vma_count for? I get the one in address_space,
> but this one does not seem useful.
For non-NUMA case, we can use it to determine which tree we should put the new
VMA.
Round-robin is not good enough for a dynamic system.

> 
> > +};
> > +#endif
> > +
> >  /**
> >   * struct address_space - Contents of a cacheable, mappable object.
> >   * @host: Owner, either the inode or the block_device.
> > @@ -461,8 +480,13 @@ struct mapping_metadata_bhs {
> >   * @gfp_mask: Memory allocation flags to use for allocating pages.
> >   * @i_mmap_writable: Number of VM_SHARED, VM_MAYWRITE mappings.
> >   * @nr_thps: Number of THPs in the pagecache (non-shmem only).
> > - * @i_mmap: Tree of private and shared mappings.
> > - * @i_mmap_rwsem: Protects @i_mmap and @i_mmap_writable.
> > + * @i_mmap: Tree of private and shared mappings. When CONFIG_SPLIT_I_MMAP
> > + *   is enabled, this is an array of split_tree_num struct i_mmap_tree
> > + *   pointers (plus a NULL terminator).
> 
> NULL terminator wastes more memory, so I would really strongly avoid it as
> well.
any better idea?

> 
> > + * @vma_count: Total number of VMAs across all sibling trees (only when
> > + *   CONFIG_SPLIT_I_MMAP is enabled). Used by mapping_mapped().
> > + * @i_mmap_rwsem: Protects @i_mmap and @i_mmap_writable (only when
> > + *   CONFIG_SPLIT_I_MMAP is disabled; otherwise per-tree rwsem is used).
> 
> So, there are very good reasons why you still need an i_mmap_rwsem protecting
> state, even with split mmap trees. Which I'll go into later.
> 
> >   * @nrpages: Number of page entries, protected by the i_pages lock.
> >   * @writeback_index: Writeback starts here.
> >   * @a_ops: Methods.
> > @@ -480,14 +504,19 @@ struct address_space {
> >  	/* number of thp, only for non-shmem files */
> >  	atomic_t		nr_thps;
> >  #endif
> > +#ifdef CONFIG_SPLIT_I_MMAP
> > +	struct i_mmap_tree	**i_mmap;
> > +	atomic_t		vma_count;
> > +#else
> >  	struct rb_root_cached	i_mmap;
> > +	struct rw_semaphore	i_mmap_rwsem;
> > +#endif
> >  	unsigned long		nrpages;
> >  	pgoff_t			writeback_index;
> >  	const struct address_space_operations *a_ops;
> >  	unsigned long		flags;
> >  	errseq_t		wb_err;
> >  	spinlock_t		i_private_lock;
> > -	struct rw_semaphore	i_mmap_rwsem;
> 
> See d3b1a9a778e1 ("fs/address_space: move i_mmap_rwsem to mitigate a false sharing with i_mmap.")
Got it.
> 
> >  } __attribute__((aligned(sizeof(long)))) __randomize_layout;
> >  	/*
> >  	 * On most architectures that alignment is already the case; but
> > @@ -508,6 +537,133 @@ static inline bool mapping_tagged(const struct address_space *mapping, xa_mark_t
> >  	return xa_marked(&mapping->i_pages, tag);
> >  }
> >  
> > +#ifdef CONFIG_SPLIT_I_MMAP
> > +static inline int mapping_mapped(const struct address_space *mapping)
> > +{
> > +	return	atomic_read(&mapping->vma_count);
> 
> Now that I think of it, I don't think we need atomic_t, only unsigned long +
> READ_ONCE() suffices. Increments can race just fine, we don't expect any 
> consistency there - if you want consistency you probably hold the i_mmap lock.
> 
okay. I will check it.

> > +}
> > +
> > +static inline void inc_mapping_vma(struct address_space *mapping,
> > +				struct vm_area_struct *vma)
> > +{
> > +	struct i_mmap_tree *tree = mapping->i_mmap[vma->tree_idx];
> > +
> > +	atomic_inc(&tree->vma_count);
> > +	atomic_inc(&mapping->vma_count);
> > +}
> > +
> > +static inline void dec_mapping_vma(struct address_space *mapping,
> > +				struct vm_area_struct *vma)
> > +{
> > +	struct i_mmap_tree *tree = mapping->i_mmap[vma->tree_idx];
> > +
> > +	atomic_dec(&tree->vma_count);
> > +	atomic_dec(&mapping->vma_count);
> > +}
> 
> This probably shouldn't be in linux/fs.h.
> 
> > +
> > +static inline struct rb_root_cached *get_i_mmap_root(struct address_space *mapping)
> > +{
> > +	return (struct rb_root_cached *)mapping->i_mmap;
> > +}
> > +
> > +static inline void i_mmap_tree_lock_write(struct address_space *mapping,
> > +					struct vm_area_struct *vma)
> > +{
> > +	struct i_mmap_tree *tree = mapping->i_mmap[vma->tree_idx];
> > +
> > +	down_write(&tree->rwsem);
> > +}
> > +
> > +static inline void i_mmap_tree_unlock_write(struct address_space *mapping,
> > +					struct vm_area_struct *vma)
> > +{
> > +	struct i_mmap_tree *tree = mapping->i_mmap[vma->tree_idx];
> > +
> > +	up_write(&tree->rwsem);
> > +}
> > +
> > +#define i_mmap_lock_write_prepare(mapping)
> > +#define i_mmap_unlock_write_complete(mapping)
> 
> It's unclear to me why you added write_prepare() and write_complete().
> 
> > +
> > +extern int split_tree_num;
> > +static inline void i_mmap_lock_write(struct address_space *mapping)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < split_tree_num; i++)
> > +		down_write(&mapping->i_mmap[i]->rwsem);
> 
> Oof, this is an incredibly large hammer. This is basically why I think keeping
> i_mmap_rwsem (in a different form) is required. You do not want to take $nr_cpus
> locks (read _or_ write). For my design, I keep i_mmap_rwsem, but I invert its
> meaning - taking it in write = I'm reading from the tree; taking it in read =
> I'm writing to the tree. This provides some lighter-weight exclusion between
> rmap walks and rmap tree manipulation.
okay, it seem your method is better. I am waiting for your patch.

> 
> _Technically_, you shouldn't need to always take a lock when manipulating the
> tree. A pattern like mnt_hold_writers()/mnt_get_write_access() can probably
> work well. But it may be too complex ATM.
> 
> 
> Also, note that you pretty much do not want i_mmap_lock_write() users after
> the conversion is done.
> 
> > +}
> > +
> > +static inline int i_mmap_trylock_write(struct address_space *mapping)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < split_tree_num; i++) {
> > +		if (!down_write_trylock(&mapping->i_mmap[i]->rwsem)) {
> > +			while (i--)
> > +				up_write(&mapping->i_mmap[i]->rwsem);
> > +			return 0;
> > +		}
> > +	}
> > +	return 1;
> > +}
> > +
> > +static inline void i_mmap_unlock_write(struct address_space *mapping)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < split_tree_num; i++)
> > +		up_write(&mapping->i_mmap[i]->rwsem);
> > +}
> > +
> > +static inline int i_mmap_trylock_read(struct address_space *mapping)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < split_tree_num; i++) {
> > +		if (!down_read_trylock(&mapping->i_mmap[i]->rwsem)) {
> > +			while (i--)
> > +				up_read(&mapping->i_mmap[i]->rwsem);
> > +			return 0;
> > +		}
> > +	}
> > +	return 1;
> > +}
> > +
> > +static inline void i_mmap_lock_read(struct address_space *mapping)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < split_tree_num; i++)
> > +		down_read(&mapping->i_mmap[i]->rwsem);
> > +}
> > +
> > +static inline void i_mmap_unlock_read(struct address_space *mapping)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < split_tree_num; i++)
> > +		up_read(&mapping->i_mmap[i]->rwsem);
> > +}
> > +
> > +static inline void i_mmap_assert_locked(struct address_space *mapping)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < split_tree_num; i++)
> > +		lockdep_assert_held(&mapping->i_mmap[i]->rwsem);
> > +}
> > +
> > +static inline void i_mmap_assert_write_locked(struct address_space *mapping)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < split_tree_num; i++)
> > +		lockdep_assert_held_write(&mapping->i_mmap[i]->rwsem);
> > +}
> > +
> > +#else
> > +
> >  static inline void i_mmap_lock_write(struct address_space *mapping)
> >  {
> >  	down_write(&mapping->i_mmap_rwsem);
> > @@ -561,6 +717,18 @@ static inline struct rb_root_cached *get_i_mmap_root(struct address_space *mappi
> >  	return &mapping->i_mmap;
> >  }
> >  
> > +static inline void inc_mapping_vma(struct address_space *mapping,
> > +				struct vm_area_struct *vma) { }
> > +static inline void dec_mapping_vma(struct address_space *mapping,
> > +				struct vm_area_struct *vma) { }
> > +
> > +#define i_mmap_lock_write_prepare(mapping)	i_mmap_lock_write(mapping)
> > +#define i_mmap_unlock_write_complete(mapping)	i_mmap_unlock_write(mapping)
> > +#define i_mmap_tree_lock_write(mapping, vma)
> > +#define i_mmap_tree_unlock_write(mapping, vma)
> > +
> > +#endif
> > +
> >  /*
> >   * Might pages of this file have been modified in userspace?
> >   * Note that i_mmap_writable counts all VM_SHARED, VM_MAYWRITE vmas: do_mmap
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 0a45c6a8b9f2..9aa8119fa9bf 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -4041,11 +4041,91 @@ struct vm_area_struct *vma_interval_tree_iter_first(struct rb_root_cached *root,
> >  struct vm_area_struct *vma_interval_tree_iter_next(struct vm_area_struct *node,
> >  				unsigned long start, unsigned long last);
> >  
> > +#ifdef CONFIG_SPLIT_I_MMAP
> > +extern int split_tree_num;
> > +
> > +static inline int smallest_tree_idx(struct file *file)
> > +{
> > +	struct address_space *mapping = file->f_mapping;
> > +	int tmp = INT_MAX, count;
> > +	int i, j = 0;
> > +
> > +	/*
> > +	 * Since a not 100% accurate value is still okay,
> > +	 * we do not need any lock here.
> > +	 */
> > +	for (i = 0; i < split_tree_num; i++) {
> > +		count = atomic_read(&mapping->i_mmap[i]->vma_count);
> > +		if (count < tmp) {
> > +			j = i;
> > +			tmp = count;
> > +			if (!tmp)
> > +				break;
> > +		}
> > +	}
> 
> Ohh, I see why you want the per-subtree vma_count now. But is this a net-win?
It keep the trees as even as possible.

> I think doing something like vma-pointer-hashing or just smp_processor_id()
> would work a-ok.
> 
> > +	return j;
> > +}
> > +
> > +static inline void vma_set_tree_idx(struct vm_area_struct *vma)
> > +{
> > +#ifdef CONFIG_NUMA
> > +	vma->tree_idx = numa_node_id();
> > +#else
> > +	vma->tree_idx = smallest_tree_idx(vma->vm_file);
> > +#endif
> > +}
> > +
> > +static inline struct rb_root_cached *get_rb_root(struct vm_area_struct *vma,
> > +					struct address_space *mapping)
> > +{
> > +	return &mapping->i_mmap[vma->tree_idx]->root;
> > +}
> > +
> > +/* Find the first valid VMA in the sibling trees */
> > +static inline struct vm_area_struct *first_vma(struct i_mmap_tree ***__r,
> > +				unsigned long start, unsigned long last)
> > +{
> > +	struct vm_area_struct *vma = NULL;
> > +	struct i_mmap_tree **tree = *__r;
> > +	struct rb_root_cached *root;
> > +
> > +	while (*tree) {
> > +		root = &(*tree)->root;
> > +		tree++;
> > +		vma = vma_interval_tree_iter_first(root, start, last);
> > +		if (vma)
> > +			break;
> > +	}
> > +
> > +	/* Save for the next loop */
> > +	*__r = tree;
> > +	return vma;
> > +}
> > +
> > +/*
> > + * Please use get_i_mmap_root() to get the @root.
> > + * @_tmp is referenced to avoid unused variable warning.
> > + */
> > +#define vma_interval_tree_foreach(vma, root, start, last)		\
> > +	for (struct i_mmap_tree **_r = (struct i_mmap_tree **)(root),	\
> > +		**_tmp = (vma = first_vma(&_r, start, last)) ? _r : NULL;\
> > +	     ((_tmp && vma) || (vma = first_vma(&_r, start, last)));	\
> > +		vma = vma_interval_tree_iter_next(vma, start, last))
> > +#else
> >  /* Please use get_i_mmap_root() to get the @root */
> >  #define vma_interval_tree_foreach(vma, root, start, last)		\
> >  	for (vma = vma_interval_tree_iter_first(root, start, last);	\
> >  	     vma; vma = vma_interval_tree_iter_next(vma, start, last))
> >  
> > +static inline void vma_set_tree_idx(struct vm_area_struct *vma) { }
> > +
> > +static inline struct rb_root_cached *get_rb_root(struct vm_area_struct *vma,
> > +					struct address_space *mapping)
> > +{
> > +	return &mapping->i_mmap;
> > +}
> > +#endif
> > +
> >  void anon_vma_interval_tree_insert(struct anon_vma_chain *node,
> >  				   struct rb_root_cached *root);
> >  void anon_vma_interval_tree_remove(struct anon_vma_chain *node,
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index a308e2c23b82..8d6aab3346ce 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -1072,6 +1072,9 @@ struct vm_area_struct {
> >  #ifdef __HAVE_PFNMAP_TRACKING
> >  	struct pfnmap_track_ctx *pfnmap_track_ctx;
> >  #endif
> > +#ifdef CONFIG_SPLIT_I_MMAP
> > +	int tree_idx;			/* The sibling tree index for the VMA */
> > +#endif
> 
> FTR the struct hole isn't here, but right after vm_lock_seq or vm_refcnt in
> most configs.
okay, thanks.
I did not notice the struct hole issue.
> 
> >  } __randomize_layout;
> >  
> >  /* Clears all bits in the VMA flags bitmap, non-atomically. */
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 5a2ddcf68e0b..2d35cacffd19 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -1888,7 +1888,8 @@ static inline void maybe_rmap_unlock_action(struct vm_area_struct *vma,
> >  
> >  	VM_WARN_ON_ONCE(vma_is_anonymous(vma));
> >  	file = vma->vm_file;
> > -	i_mmap_unlock_write(file->f_mapping);
> > +	i_mmap_tree_unlock_write(file->f_mapping, vma);
> > +	i_mmap_unlock_write_complete(file->f_mapping);
> >  	action->hide_from_rmap_until_complete = false;
> >  }
> >  
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index d714fdb357e5..70036ec9dcaa 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1825,15 +1825,20 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
> >  			struct address_space *mapping = file->f_mapping;
> >  
> >  			get_file(file);
> > -			i_mmap_lock_write(mapping);
> > +			i_mmap_lock_write_prepare(mapping);
> > +			i_mmap_tree_lock_write(mapping, mpnt);
> > +
> >  			if (vma_is_shared_maywrite(tmp))
> >  				mapping_allow_writable(mapping);
> >  			flush_dcache_mmap_lock(mapping);
> >  			/* insert tmp into the share list, just after mpnt */
> >  			vma_interval_tree_insert_after(tmp, mpnt,
> > -					get_i_mmap_root(mapping));
> > +					get_rb_root(mpnt, mapping));
> > +			inc_mapping_vma(mapping, tmp);
> 
> Honestly, would prefer to hide all of these details from mmap.
yes, we can. 

But we need to change the functions in mm/interval_tree.c

> 
> >  			flush_dcache_mmap_unlock(mapping);
> > -			i_mmap_unlock_write(mapping);
> > +
> > +			i_mmap_tree_unlock_write(mapping, mpnt);
> > +			i_mmap_unlock_write_complete(mapping);
> >  		}
> >  
> >  		if (!(tmp->vm_flags & VM_WIPEONFORK))
> > diff --git a/mm/nommu.c b/mm/nommu.c
> > index 0f18ffc658e9..1f2c60a220f6 100644
> > --- a/mm/nommu.c
> > +++ b/mm/nommu.c
> > @@ -567,11 +567,16 @@ static void setup_vma_to_mm(struct vm_area_struct *vma, struct mm_struct *mm)
> >  	if (vma->vm_file) {
> >  		struct address_space *mapping = vma->vm_file->f_mapping;
> >  
> > -		i_mmap_lock_write(mapping);
> > +		i_mmap_lock_write_prepare(mapping);
> > +		i_mmap_tree_lock_write(mapping, vma);
> > +
> >  		flush_dcache_mmap_lock(mapping);
> > -		vma_interval_tree_insert(vma, get_i_mmap_root(mapping));
> > +		vma_interval_tree_insert(vma, get_rb_root(vma, mapping));
> > +		inc_mapping_vma(mapping, vma);
> >  		flush_dcache_mmap_unlock(mapping);
> > -		i_mmap_unlock_write(mapping);
> > +
> > +		i_mmap_tree_unlock_write(mapping, vma);
> > +		i_mmap_unlock_write_complete(mapping);
> >  	}
> >  }
> >  
> > @@ -583,11 +588,16 @@ static void cleanup_vma_from_mm(struct vm_area_struct *vma)
> >  		struct address_space *mapping;
> >  		mapping = vma->vm_file->f_mapping;
> >  
> > -		i_mmap_lock_write(mapping);
> > +		i_mmap_lock_write_prepare(mapping);
> > +		i_mmap_tree_lock_write(mapping, vma);
> > +
> >  		flush_dcache_mmap_lock(mapping);
> > -		vma_interval_tree_remove(vma, get_i_mmap_root(mapping));
> > +		vma_interval_tree_remove(vma, get_rb_root(vma, mapping));
> > +		dec_mapping_vma(mapping, vma);
> >  		flush_dcache_mmap_unlock(mapping);
> > -		i_mmap_unlock_write(mapping);
> > +
> > +		i_mmap_tree_unlock_write(mapping, vma);
> > +		i_mmap_unlock_write_complete(mapping);
> >  	}
> >  }
> >  
> > @@ -1063,6 +1073,7 @@ unsigned long do_mmap(struct file *file,
> >  	if (file) {
> >  		region->vm_file = get_file(file);
> >  		vma->vm_file = get_file(file);
> > +		vma_set_tree_idx(vma);
> 
> This is unrelated, shouldn't be done here.
> 
> >  	}
> >  
> >  	down_write(&nommu_region_sem);
> > diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> > index 8df1b5077951..d5745519d95a 100644
> > --- a/mm/pagewalk.c
> > +++ b/mm/pagewalk.c
> > @@ -809,7 +809,7 @@ int walk_page_mapping(struct address_space *mapping, pgoff_t first_index,
> >  	if (!check_ops_safe(ops))
> >  		return -EINVAL;
> >  
> > -	lockdep_assert_held(&mapping->i_mmap_rwsem);
> > +	i_mmap_assert_locked(mapping);
> 
> This kind of conversion should be done in a separate step.
> 
> >  	vma_interval_tree_foreach(vma, get_i_mmap_root(mapping), first_index,
> >  				  first_index + nr - 1) {
> >  		/* Clip to the vma */
> > diff --git a/mm/vma.c b/mm/vma.c
> > index 6159650c1b42..2055758064a9 100644
> > --- a/mm/vma.c
> > +++ b/mm/vma.c
> > @@ -234,22 +234,23 @@ static void __vma_link_file(struct vm_area_struct *vma,
> >  		mapping_allow_writable(mapping);
> >  
> >  	flush_dcache_mmap_lock(mapping);
> > -	vma_interval_tree_insert(vma, get_i_mmap_root(mapping));
> > +	vma_interval_tree_insert(vma, get_rb_root(vma, mapping));
> > +	inc_mapping_vma(mapping, vma);
> 
> inc_mapping_vma() should probably be done implicitly by insertion?
Yes, we can. 
It is more grace to hide it in vma_interval_tree_insert.

> 
> >  	flush_dcache_mmap_unlock(mapping);
> >  }
> >  
> > -/*
> > - * Requires inode->i_mapping->i_mmap_rwsem
> > - */
> >  static void __remove_shared_vm_struct(struct vm_area_struct *vma,
> >  				      struct address_space *mapping)
> >  {
> > +	i_mmap_tree_lock_write(mapping, vma);
> >  	if (vma_is_shared_maywrite(vma))
> >  		mapping_unmap_writable(mapping);
> >  
> >  	flush_dcache_mmap_lock(mapping);
> > -	vma_interval_tree_remove(vma, get_i_mmap_root(mapping));
> > +	vma_interval_tree_remove(vma, get_rb_root(vma, mapping));
> > +	dec_mapping_vma(mapping, vma);
> >  	flush_dcache_mmap_unlock(mapping);
> > +	i_mmap_tree_unlock_write(mapping, vma);
> >  }
> >  
> >  /*
> > @@ -297,8 +298,9 @@ static void vma_prepare(struct vma_prepare *vp)
> >  			uprobe_munmap(vp->adj_next, vp->adj_next->vm_start,
> >  				      vp->adj_next->vm_end);
> >  
> > -		i_mmap_lock_write(vp->mapping);
> > +		i_mmap_lock_write_prepare(vp->mapping);
> >  		if (vp->insert && vp->insert->vm_file) {
> > +			i_mmap_tree_lock_write(vp->mapping, vp->insert);
> >  			/*
> >  			 * Put into interval tree now, so instantiated pages
> >  			 * are visible to arm/parisc __flush_dcache_page
> > @@ -307,6 +309,7 @@ static void vma_prepare(struct vma_prepare *vp)
> >  			 */
> >  			__vma_link_file(vp->insert,
> >  					vp->insert->vm_file->f_mapping);
> > +			i_mmap_tree_unlock_write(vp->mapping, vp->insert);
> >  		}
> >  	}
> >  
> > @@ -318,12 +321,17 @@ static void vma_prepare(struct vma_prepare *vp)
> >  	}
> >  
> >  	if (vp->file) {
> > +		i_mmap_tree_lock_write(vp->mapping, vp->vma);
> >  		flush_dcache_mmap_lock(vp->mapping);
> >  		vma_interval_tree_remove(vp->vma,
> > -					get_i_mmap_root(vp->mapping));
> > -		if (vp->adj_next)
> > +					get_rb_root(vp->vma, vp->mapping));
> > +		dec_mapping_vma(vp->mapping, vp->vma);
> > +		if (vp->adj_next) {
> > +			i_mmap_tree_lock_write(vp->mapping, vp->adj_next);
> >  			vma_interval_tree_remove(vp->adj_next,
> > -					get_i_mmap_root(vp->mapping));
> > +					get_rb_root(vp->adj_next, vp->mapping));
> > +			dec_mapping_vma(vp->mapping, vp->adj_next);
> > +		}
> >  	}
> >  
> >  }
> > @@ -340,12 +348,17 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
> >  			 struct mm_struct *mm)
> >  {
> >  	if (vp->file) {
> > -		if (vp->adj_next)
> > +		if (vp->adj_next) {
> >  			vma_interval_tree_insert(vp->adj_next,
> > -					get_i_mmap_root(vp->mapping));
> > +					get_rb_root(vp->adj_next, vp->mapping));
> > +			inc_mapping_vma(vp->mapping, vp->adj_next);
> > +			i_mmap_tree_unlock_write(vp->mapping, vp->adj_next);
> > +		}
> >  		vma_interval_tree_insert(vp->vma,
> > -					get_i_mmap_root(vp->mapping));
> > +					get_rb_root(vp->vma, vp->mapping));
> > +		inc_mapping_vma(vp->mapping, vp->vma);
> >  		flush_dcache_mmap_unlock(vp->mapping);
> > +		i_mmap_tree_unlock_write(vp->mapping, vp->vma);
> >  	}
> >  
> >  	if (vp->remove && vp->file) {
> > @@ -370,7 +383,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
> >  	}
> >  
> >  	if (vp->file) {
> > -		i_mmap_unlock_write(vp->mapping);
> > +		i_mmap_unlock_write_complete(vp->mapping);
> >  
> >  		if (!vp->skip_vma_uprobe) {
> >  			uprobe_mmap(vp->vma);
> > @@ -1799,12 +1812,12 @@ static void unlink_file_vma_batch_process(struct unlink_vma_file_batch *vb)
> >  	int i;
> >  
> >  	mapping = vb->vmas[0]->vm_file->f_mapping;
> > -	i_mmap_lock_write(mapping);
> > +	i_mmap_lock_write_prepare(mapping);
> >  	for (i = 0; i < vb->count; i++) {
> >  		VM_WARN_ON_ONCE(vb->vmas[i]->vm_file->f_mapping != mapping);
> >  		__remove_shared_vm_struct(vb->vmas[i], mapping);
> >  	}
> > -	i_mmap_unlock_write(mapping);
> > +	i_mmap_unlock_write_complete(mapping);
> >  
> >  	unlink_file_vma_batch_init(vb);
> >  }
> > @@ -1836,10 +1849,13 @@ static void vma_link_file(struct vm_area_struct *vma, bool hold_rmap_lock)
> >  
> >  	if (file) {
> >  		mapping = file->f_mapping;
> > -		i_mmap_lock_write(mapping);
> > +		i_mmap_lock_write_prepare(mapping);
> > +		i_mmap_tree_lock_write(mapping, vma);
> >  		__vma_link_file(vma, mapping);
> > -		if (!hold_rmap_lock)
> > -			i_mmap_unlock_write(mapping);
> > +		if (!hold_rmap_lock) {
> > +			i_mmap_tree_unlock_write(mapping, vma);
> > +			i_mmap_unlock_write_complete(mapping);
> > +		}
> >  	}
> >  }
> >  
> > @@ -2164,6 +2180,23 @@ static void vm_lock_anon_vma(struct mm_struct *mm, struct anon_vma *anon_vma)
> >  	}
> >  }
> 
> I can but hope that all of the above is quite simplified before we get to the
> "making file rmap more complicated" bit.
:(
If we can do not care about the ARM device, we can make it simple.

Thanks
Huang Shijie


^ permalink raw reply

* Re: [PATCH v2 1/4] mm: use mapping_mapped to simplify the code
From: Huang Shijie @ 2026-06-12  6:03 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, viro, brauner, jack, muchun.song, osalvador, david, surenb,
	mjguzik, liam, vbabka, shakeel.butt, rppt, mhocko, corbet, skhan,
	linux, dinguyen, schuster.simon, James.Bottomley, deller, djbw,
	willy, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, james.clark,
	mhiramat, oleg, ziy, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, linmiaohe, nao.horiguchi, jannh, pfalcato,
	riel, harry, will, brian.ruley, rmk+kernel, dave.anglin, linux-mm,
	linux-doc, linux-kernel, linux-arm-kernel, linux-parisc,
	linux-fsdevel, nvdimm, linux-perf-users, linux-trace-kernel,
	zhongyuan, fangbaoshun, yingzhiwei
In-Reply-To: <airZn524Ip8VsWra@lucifer>

Hi Lorenzo & Pedro,
On Thu, Jun 11, 2026 at 04:52:54PM +0100, Lorenzo Stoakes wrote:
> On Thu, Jun 11, 2026 at 02:18:57PM +0800, Huang Shijie wrote:
> > Use mapping_mapped() to simplify the code, make
> > the code tidy and clean.
> >
> > Signed-off-by: Huang Shijie <huangsj@hygon.cn>
> 
> Yeah as Pedro said this one could just be sent separately, and I in fact
> suggest you do that :) So:
> 
Thank you Pedro and Lorenzo.
I can send a separate patch later.

Thanks
Huang Shijie


^ permalink raw reply

* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Zenghui Yu @ 2026-06-12  5:09 UTC (permalink / raw)
  To: Gregory Price
  Cc: lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, guanhao.wang
In-Reply-To: <ainFROZ3WrGioyuY@gourry-fedora-PF4VCD3F>

[ trim the Cc list ]

Hi Gregory,

On 2026/6/11 4:12, Gregory Price wrote:

> I will still probably send the next RFC version tomorrow or friday,
> as I want to get some eyes on the __GFP_PRIVATE-less pattern.

Could you please Cc me in the next version? I appreciate that and would be
happy to follow this work.

Thanks,
Zenghui

^ permalink raw reply

* Re: [PATCH v4] rethook: Remove the running task check in rethook_find_ret_addr()
From: Tengda Wu @ 2026-06-12  1:54 UTC (permalink / raw)
  To: XIAO WU, sashiko-reviews, Masami Hiramatsu, Petr Mladek,
	Peter Zijlstra
  Cc: Mathieu Desnoyers, Alexei Starovoitov, Steven Rostedt,
	linux-kernel, linux-trace-kernel, live-patching
In-Reply-To: <tencent_3D17DC5BE32C8A51D938AF50F221321F6206@qq.com>

Hi Xiao,

Thank you very much for your detailed analysis and verification.

On 2026/6/11 23:53, XIAO WU wrote:
> Hi Tengda,
> 
> Sashiko [1] reviewed this patch and found that removing the
> task_is_running() check exposes stack unwinders to real crashes — not
> just "invalid information."  A PoC confirms this: a KASAN panic triggers
> within seconds when /proc/<pid>/stack reads the stack of a task that is
> concurrently running a kretprobe.
> 
> [1] https://sashiko.dev/#/patchset/20260610013658.1837963-1-wutengda%40huaweicloud.com
> 
>> diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
>> index 5a8bdf88999a..1e7fdebe3cd5 100644
>> --- a/kernel/trace/rethook.c
>> +++ b/kernel/trace/rethook.c
>> @@ -250,9 +251,6 @@ unsigned long rethook_find_ret_addr(struct task_struct *tsk, unsigned long frame
>>      if (WARN_ON_ONCE(!cur))
>>          return 0;
>>
>> -    if (tsk != current && task_is_running(tsk))
>> -        return 0;
>> -
>>      do {
>>          ret = __rethook_find_ret_addr(tsk, cur);
>>          if (!ret)
> 
> The commit message states:
> 
>> The iteration is already safe from crashes because
>> unwind_next_frame() holds RCU and rethook_node structures are
>> RCU-freed; even if the iteration goes off the rails and returns
>> invalid information, it will not crash.
> 
> There are two problems with this claim, both reproducible.
> 
> **Problem 1: stack-out-of-bounds in unwind_next_frame itself**
> 
> The PoC below reliably triggers the following KASAN panic — not in the
> rethook list traversal, but inside unwind_next_frame():
> 
> [ 1833.494623] BUG: KASAN: stack-out-of-bounds in unwind_next_frame+0x861/0x2080
> [ 1833.494651] Read of size 2 at addr ffffc90003e6f5f0 by task poc/9854
> [ 1833.494707] Call Trace:
> [ 1833.494719]  dump_stack_lvl+0x116/0x1f0
> [ 1833.494743]  print_report+0xf4/0x600
> [ 1833.494788]  kasan_report+0xe0/0x110
> [ 1833.494836]  unwind_next_frame+0x861/0x2080
> [ 1833.494948]  arch_stack_walk+0x99/0x100
> [ 1833.495000]  stack_trace_save_tsk+0x16a/0x200
> [ 1833.495054]  proc_pid_stack+0x173/0x2b0
> [ 1833.495103]  seq_read_iter+0x519/0x12d0
> [ 1833.495166]  seq_read+0x3b7/0x590
> [ 1833.495297]  vfs_read+0x1f5/0xd20
> [ 1833.495497]  ksys_read+0x135/0x250
> [ 1833.495549]  do_syscall_64+0x129/0x850
> [ 1833.495566]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> [ 1833.498894] Kernel panic - not syncing: KASAN: panic_on_warn set ...
> 
> page last free pid 9737 tgid 9737 stack trace:
>  do_sys_openat2+0xbf/0x260          <-- target task inside kretprobe
>  __x64_sys_openat+0x179/0x210
> 
> This crash has nothing to do with rethook_node lifetimes or RCU.  It
> happens because the ORC unwinder reads stack memory while the target
> task concurrently executes a kretprobe trampoline that modifies return
> addresses.  The unwinder follows corrupted frame data past valid stack
> boundaries.  RCU protection of rethook_node structures is irrelevant —
> this crash occurs at the stack frame interpretation level, before any
> rethook list traversal.
> 
> The old task_is_running() check prevented the unwinder from attempting
> to unwind a running task's stack in the first place.
> 

Yes, I was able to reproduce the issue locally using your PoC. The problem
does exist as you described. I need to take a deeper look and figure out how
to properly fix it.

> **Problem 2: use-after-free via rethook_node recycling**
> 
> Even if the stack-out-of-bounds above were addressed, a second crash
> path exists in the rethook list traversal itself.
> 
> rethook_recycle() immediately pushes nodes back to the objpool without
> an RCU grace period:
> 
>   kernel/trace/rethook.c:
>   void rethook_recycle(struct rethook_node *node)
>   {
>           ...
>           objpool_push(node, &node->rethook->pool);
>   }
> 
> Meanwhile, unwind_next_frame() in arch/x86/kernel/unwind_orc.c drops
> RCU between frames while the cursor (*cur) persists across iterations:
> 
>   arch/x86/kernel/unwind_orc.c:
>   bool unwind_next_frame(...)
>   {
>           ...
>           guard(rcu)();    // RCU held for one frame
>           ...
>   }                        // RCU dropped here
> 
> When the unwinder calls __rethook_find_ret_addr() in the next frame
> iteration, it does:
> 
>   struct llist_node *first = tsk->rethooks.first;
>   ...
>   *cur = first;
>   ...
>   node = node->next;       // node may have been recycled
> 
> If the target task returns from a probed function between frames, its
> rethook_node is recycled and can be instantly reallocated to another
> task.  The unwinder's stale cursor then dereferences a freed pointer,
> leading to use-after-free.
> 

Yes, Sashiko also pointed this out. You have opened this issue for further
analysis to clarify its fault model. It appears to be a pre-existing issue
that may require a separate patch to resolve.

> ## Reproducer
> 
> The PoC sets up a kretprobe on do_sys_openat2, creates hot-loop threads
> calling open(), and concurrently reads /proc/<tid>/stack.  The race
> triggers within seconds (Problem 1 above; Problem 2 may reproduce on
> kernels without KASAN or with different timing).
> 
> Build:  gcc -static -pthread -o poc poc.c
> Run:    ./poc [runtime_seconds]
> Needs:  root, CONFIG_KASAN=y
> 
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <unistd.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <sys/wait.h>
> #include <sys/syscall.h>
> #include <sched.h>
> #include <fcntl.h>
> #include <errno.h>
> #include <signal.h>
> #include <pthread.h>
> #include <dirent.h>
> 
> #define TRACE "/sys/kernel/tracing"
> 
> volatile int stop = 0;
> 
> static int tfs(const char *f, const char *b)
> {
>     char p[256]; int fd, r;
>     snprintf(p, 256, "%s/%s", TRACE, f);
>     fd = open(p, O_WRONLY | O_TRUNC);
>     if (fd < 0) {
>         system("mount -t tracefs tracefs /sys/kernel/tracing 2>/dev/null");
>         usleep(50000);
>         fd = open(p, O_WRONLY | O_TRUNC);
>     }
>     if (fd < 0) return -1;
>     r = write(fd, b, strlen(b));
>     close(fd);
>     return r < 0 ? -1 : 0;
> }
> 
> void *hot_thread(void *arg)
> {
>     while (!__atomic_load_n(&stop, __ATOMIC_RELAXED)) {
>         int fd = open("/dev/null", O_RDONLY);
>         if (fd >= 0) close(fd);
>     }
>     return NULL;
> }
> 
> void *reader_thread(void *arg)
> {
>     pid_t target = *(pid_t *)arg;
>     char path[64], buf[8192];
>     snprintf(path, 64, "/proc/%d/stack", target);
>     while (!__atomic_load_n(&stop, __ATOMIC_RELAXED)) {
>         int fd = open(path, O_RDONLY);
>         if (fd >= 0) { read(fd, buf, 8191); close(fd); }
>     }
>     return NULL;
> }
> 
> void sigh(int s) { stop = 1; }
> 
> int main(int argc, char *argv[])
> {
>     int runtime = 120;
>     if (argc > 1) runtime = atoi(argv[1]);
> 
>     printf("rethook race PoC\n");
>     if (geteuid()) { printf("root needed\n"); return 1; }
>     signal(SIGINT, sigh);
> 
>     pthread_t hot[4], rdr[4];
>     pid_t hot_tids[4];
>     int pairs = 4;
> 
>     for (int c = 0; c < runtime / 5 && !stop; c++) {
>         tfs("events/kprobes/myretprobe/enable", "0");
>         tfs("kprobe_events", "-:myretprobe");
>         usleep(100);
>         tfs("kprobe_events", "r:myretprobe do_sys_openat2 $retval");
>         tfs("events/kprobes/myretprobe/enable", "1");
> 
>         pid_t main_tid = syscall(SYS_gettid);
> 
>         for (int i = 0; i < pairs; i++)
>             pthread_create(&hot[i], NULL, hot_thread, NULL);
> 
>         usleep(300000);
> 
>         {
>             DIR *d = opendir("/proc/self/task");
>             int cnt = 0;
>             if (d) {
>                 struct dirent *de;
>                 while ((de = readdir(d)) != NULL && cnt < pairs) {
>                     pid_t t = atoi(de->d_name);
>                     if (t > 0 && t != main_tid)
>                         hot_tids[cnt++] = t;
>                 }
>                 closedir(d);
>             }
>             for (int i = 0; i < cnt; i++)
>                 pthread_create(&rdr[i], NULL, reader_thread, &hot_tids[i]);
>         }
> 
>         printf("round %d\n", c);
>         sleep(5);
> 
>         stop = 1;
>         usleep(100000);
> 
>         for (int i = 0; i < pairs; i++) pthread_join(hot[i], NULL);
>         for (int i = 0; i < pairs; i++) pthread_join(rdr[i], NULL);
> 
>         stop = 0;
>         usleep(1000);
>     }
> 
>     tfs("events/kprobes/myretprobe/enable", "0");
>     tfs("kprobe_events", "-:myretprobe");
>     printf("Done\n");
>     return 0;
> }
> 
> ## Summary
> 
> The v4 commit message claims the iteration "will not crash," but the PoC
> demonstrates a reproducible KASAN panic:
> 
> 1. stack-out-of-bounds in unwind_next_frame (ORC unwinder reads
>    concurrently-modified stack frames of a running task)
> 
> 2. Potential use-after-free in __rethook_find_ret_addr (rethook nodes
>    recycled without RCU grace period, cursor persists across RCU drops)
> 
> The old task_is_running() check was racy but served as a practical
> safety net.  Removing it without adding equivalent protection in the
> callers (proc_pid_stack, BPF stack walkers) exposes users to kernel
> panics via /proc/<pid>/stack on any task running a kretprobe.

Once again, I truly appreciate your thorough review and testing.
I'm not sure if I can fully resolve these issues, but if I succeed, I will
send out a v5.

Best regards,
Tengda


^ permalink raw reply

* Re: [RFC PATCH 1/2] tracing/osnoise: Sample IPI counts
From: Crystal Wood @ 2026-06-11 20:49 UTC (permalink / raw)
  To: Valentin Schneider, Tomas Glozar
  Cc: linux-kernel, linux-trace-kernel, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Costa Shulyupin,
	Ivan Pravdin
In-Reply-To: <xhsmh33yt2wtc.mognet@vschneid-thinkpadt14sgen2i.remote.csb>

On Thu, 2026-06-11 at 12:30 +0200, Valentin Schneider wrote:
> On 11/06/26 10:59, Tomas Glozar wrote:
> > [just replying to comments, will do a full review later]
> > 
> > st 10. 6. 2026 v 21:51 odesílatel Crystal Wood <crwood@redhat.com> napsal:
> > > 
> > > On Wed, 2026-06-10 at 15:04 +0200, Valentin Schneider wrote:
> > > > Osnoise already implictly accounts IPIs via its IRQ tracking,
> > > 
> > > Does it?  It seems that IPIs bypass the kernel/irq subsystem on some
> > > arches (including x86, but not ARM).
> > > 
> > > It would be nice to solve this properly by adding generic ipi
> > > entry/exit tracing (similar to what ARM already has).
> > > 
> > 
> > Isn't that precisely what the ipi tracepoints used by this
> > implementation (ipi:ipi_send_cpu) are for?
> > 
> 
> Well, these catch the emission of the IPI, which is great for investigation
> - slap a stacktrace trigger and you (most of the time) get the source of
> your interference.
> 
> However Crystal's point is that on x86 (and I assume other archs) receiving
> & handling these IPIs is "special" and doesn't go through the generic irq
> subsystem and thus has to be tracked separately, which is why osnoise has
> this fairly lengthy osnoise_arch_register() thing.

Oh, I missed the arch hook.  I feel better now :-)

(I'd feel better if it didn't rely on osnoise-specific arch code being
updated to match if some new interrupt path pops up, but oh well.)


> > 
> > > > Alternatively I can have this be purely supported in userspace osnoise by
> > > > hooking into the IPI events and counting IPIs separately from the osnoise
> > > > events.
> > > 
> > > One benefit I could see of doing this in kernel osnoise would be if you
> > > could atomically correlate the count with the particular noise
> > > interval, but this patch doesn't do that.
> > > 
> > 
> > The count is already reported by cycle on the kernel side in the
> > patchset, right? It's only missing in the current RTLA (userspace)
> > part, as there is no statistic using the information. But it can still
> > be collected through custom histogram triggers.

Not sure I follow... this patchset reports a count of IPIs, not cycle
info, but the count is based on when the IPIs were sent, not received. 
The IPI send events capture cycle info, but that's not what this
patchset adds.

I'm not sure that it really matters though.  I had been thinking of this
more like the interference count, which is atomic with respect to a
single noise (and thus the sender of the noise would be outside that
window).  But this count is reported over the entire osnoise sample
period, so a little slop is probably OK.

-Crystal

> 


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox