* [PATCH v11 02/11] tracing/probes: Support dumping fetcharg program for debugging dynamic events
From: Masami Hiramatsu (Google) @ 2026-06-26 14:14 UTC (permalink / raw)
To: Steven Rostedt, Mathieu Desnoyers
Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178248325671.841606.17344906774310339507.stgit@devnote2>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
For debugging probe events, it is helpful to verify the compiled
fetch instructions for each probe argument. This introduces a new
kernel config CONFIG_PROBE_EVENTS_DUMP_FETCHARG to decode the
instruction sequence of each argument and display it under a
commented line starting with '#' immediately following the dynamic
event definition (such as in dynamic_events, kprobe_events,
uprobe_events, etc.).
For example:
/sys/kernel/tracing # cat dynamic_events
p:kprobes/p_vfs_read_0 vfs_read arg1=+0(file):ustring arg2=%ax:x16
# arg1: ARG(0) -> ST_USTRING(offset=0,size=4) -> END
# arg2: REG(80) -> ST_RAW(size=2) -> END
Assisted-by: Antigravity:gemini-3.5-flash
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v8:
- State this feature is only for debugging probe events.
- Fix dependency list after description in Kconfig.
Changes in v7:
- Show trace event field name for FETCH_OP_TP_ARG.
- Show immediate string value for FETCH_OP_IMMSTR.
- Fix style issues warned by checkpatch.pl.
Changes in v6:
- Newly added.
---
kernel/trace/Kconfig | 12 +++++
kernel/trace/trace_eprobe.c | 2 +
kernel/trace/trace_fprobe.c | 2 +
kernel/trace/trace_kprobe.c | 2 +
kernel/trace/trace_probe.c | 96 +++++++++++++++++++++++++++++++++++++++++++
kernel/trace/trace_probe.h | 79 +++++++++++++++++++++--------------
kernel/trace/trace_uprobe.c | 3 +
7 files changed, 164 insertions(+), 32 deletions(-)
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 084f34dc6c9f..0ab5916575a9 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -779,6 +779,18 @@ config PROBE_EVENTS_BTF_ARGS
kernel function entry or a tracepoint.
This is available only if BTF (BPF Type Format) support is enabled.
+config PROBE_EVENTS_DUMP_FETCHARG
+ bool "Dump of dynamic probe event fetch-arguments"
+ depends on PROBE_EVENTS
+ default n
+ help
+ This shows the dump of fetch-arguments of dynamic probe events
+ alongside their event definitions in the dynamic_events file
+ as comment lines. This is useful to debug the probe events.
+ Since this exposes the raw values in the dynamic_events file,
+ it might be a security risk. Only enable it if you need to debug
+ probe events themselves.
+
config KPROBE_EVENTS
depends on KPROBES
depends on HAVE_REGS_AND_STACK_ACCESS_API
diff --git a/kernel/trace/trace_eprobe.c b/kernel/trace/trace_eprobe.c
index 50518b071414..462c31145733 100644
--- a/kernel/trace/trace_eprobe.c
+++ b/kernel/trace/trace_eprobe.c
@@ -87,6 +87,8 @@ static int eprobe_dyn_event_show(struct seq_file *m, struct dyn_event *ev)
seq_printf(m, " %s=%s", ep->tp.args[i].name, ep->tp.args[i].comm);
seq_putc(m, '\n');
+ trace_probe_dump_args(m, &ep->tp);
+
return 0;
}
diff --git a/kernel/trace/trace_fprobe.c b/kernel/trace/trace_fprobe.c
index 4d1abbf66229..536781cd4c47 100644
--- a/kernel/trace/trace_fprobe.c
+++ b/kernel/trace/trace_fprobe.c
@@ -1449,6 +1449,8 @@ static int trace_fprobe_show(struct seq_file *m, struct dyn_event *ev)
seq_printf(m, " %s=%s", tf->tp.args[i].name, tf->tp.args[i].comm);
seq_putc(m, '\n');
+ trace_probe_dump_args(m, &tf->tp);
+
return 0;
}
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index a8420e6abb56..cfa807d8e760 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -1320,6 +1320,8 @@ static int trace_kprobe_show(struct seq_file *m, struct dyn_event *ev)
seq_printf(m, " %s=%s", tk->tp.args[i].name, tk->tp.args[i].comm);
seq_putc(m, '\n');
+ trace_probe_dump_args(m, &tk->tp);
+
return 0;
}
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 2ce7d62471cb..0908019aea12 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -2403,3 +2403,99 @@ int trace_probe_print_args(struct trace_seq *s, struct probe_arg *args, int nr_a
}
return 0;
}
+
+#ifdef CONFIG_PROBE_EVENTS_DUMP_FETCHARG
+
+struct fetch_op_decode {
+ const char *name;
+ void (*decode)(struct seq_file *m, struct fetch_insn *insn);
+};
+
+static const struct fetch_op_decode fetch_op_decode[];
+
+static void fetcharg_decode_none(struct seq_file *m, struct fetch_insn *insn)
+{
+ seq_puts(m, fetch_op_decode[insn->op].name);
+}
+
+static void fetcharg_decode_param(struct seq_file *m, struct fetch_insn *insn)
+{
+ seq_printf(m, "%s(%u)", fetch_op_decode[insn->op].name, insn->param);
+}
+
+static void fetcharg_decode_imm(struct seq_file *m, struct fetch_insn *insn)
+{
+ seq_printf(m, "%s(0x%lx)", fetch_op_decode[insn->op].name, insn->immediate);
+}
+
+static void fetcharg_decode_string(struct seq_file *m, struct fetch_insn *insn)
+{
+ seq_printf(m, "%s(%s)", fetch_op_decode[insn->op].name, (char *)insn->data);
+}
+
+static void fetcharg_decode_symbol(struct seq_file *m, struct fetch_insn *insn)
+{
+ seq_printf(m, "%s(%s)", fetch_op_decode[insn->op].name, (char *)insn->data);
+}
+
+static void fetcharg_decode_offset(struct seq_file *m, struct fetch_insn *insn)
+{
+ seq_printf(m, "%s(offset=%d)", fetch_op_decode[insn->op].name, insn->offset);
+}
+
+static void fetcharg_decode_store(struct seq_file *m, struct fetch_insn *insn)
+{
+ if (insn->op == FETCH_OP_ST_RAW)
+ seq_printf(m, "%s(size=%u)", fetch_op_decode[insn->op].name, insn->size);
+ else
+ seq_printf(m, "%s(offset=%d,size=%u)", fetch_op_decode[insn->op].name,
+ insn->offset, insn->size);
+}
+
+static void fetcharg_decode_bf(struct seq_file *m, struct fetch_insn *insn)
+{
+ seq_printf(m, "%s(basesize=%u,lshift=%u,rshift=%u)",
+ fetch_op_decode[insn->op].name, insn->basesize, insn->lshift, insn->rshift);
+}
+
+static void fetcharg_decode_tp_arg(struct seq_file *m, struct fetch_insn *insn)
+{
+ struct ftrace_event_field *field = insn->data;
+
+ seq_printf(m, "%s(%s)", fetch_op_decode[insn->op].name, field->name);
+}
+
+#define FETCH_OP(opname, decode_fn) \
+ [FETCH_OP_##opname] = { .name = #opname, .decode = fetcharg_decode_##decode_fn }
+
+static const struct fetch_op_decode fetch_op_decode[] = FETCH_OP_LIST;
+#undef FETCH_OP
+
+static void trace_probe_dump_arg(struct seq_file *m, struct probe_arg *parg)
+{
+ int i;
+
+ seq_printf(m, "# %s: ", parg->name);
+ for (i = 0; i < FETCH_INSN_MAX; i++) {
+ struct fetch_insn *insn = parg->code + i;
+
+ if (insn->op >= ARRAY_SIZE(fetch_op_decode) || !fetch_op_decode[insn->op].decode)
+ seq_printf(m, "unknown(%d)", insn->op);
+ else
+ fetch_op_decode[insn->op].decode(m, insn);
+
+ if (insn->op == FETCH_OP_END)
+ break;
+ seq_puts(m, " -> ");
+ }
+ seq_putc(m, '\n');
+}
+
+void trace_probe_dump_args(struct seq_file *m, struct trace_probe *tp)
+{
+ int i;
+
+ for (i = 0; i < tp->nr_args; i++)
+ trace_probe_dump_arg(m, &tp->args[i]);
+}
+#endif /* CONFIG_PROBE_EVENTS_DUMP_FETCHARG */
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 2e0d8384ee5c..e36cfe39e9a8 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -83,38 +83,46 @@ static nokprobe_inline u32 update_data_loc(u32 loc, int consumed)
/* Printing function type */
typedef int (*print_type_func_t)(struct trace_seq *, void *, void *);
-enum fetch_op {
- FETCH_OP_NOP = 0,
- // Stage 1 (load) ops
- FETCH_OP_REG, /* Register : .param = offset */
- FETCH_OP_STACK, /* Stack : .param = index */
- FETCH_OP_STACKP, /* Stack pointer */
- FETCH_OP_RETVAL, /* Return value */
- FETCH_OP_IMM, /* Immediate : .immediate */
- FETCH_OP_COMM, /* Current comm */
- FETCH_OP_ARG, /* Function argument : .param */
- FETCH_OP_FOFFS, /* File offset: .immediate */
- FETCH_OP_IMMSTR, /* Allocated string: .data */
- FETCH_OP_EDATA, /* Entry data: .offset */
- // Stage 2 (dereference) op
- FETCH_OP_DEREF, /* Dereference: .offset */
- FETCH_OP_UDEREF, /* User-space Dereference: .offset */
- // Stage 3 (store) ops
- FETCH_OP_ST_RAW, /* Raw: .size */
- FETCH_OP_ST_MEM, /* Mem: .offset, .size */
- FETCH_OP_ST_UMEM, /* Mem: .offset, .size */
- FETCH_OP_ST_STRING, /* String: .offset, .size */
- FETCH_OP_ST_USTRING, /* User String: .offset, .size */
- FETCH_OP_ST_SYMSTR, /* Kernel Symbol String: .offset, .size */
- FETCH_OP_ST_EDATA, /* Store Entry Data: .offset */
- // Stage 4 (modify) op
- FETCH_OP_MOD_BF, /* Bitfield: .basesize, .lshift, .rshift */
- // Stage 5 (loop) op
- FETCH_OP_LP_ARRAY, /* Array: .param = loop count */
- FETCH_OP_TP_ARG, /* Trace Point argument */
- FETCH_OP_END,
- FETCH_NOP_SYMBOL, /* Unresolved Symbol holder */
-};
+#define FETCH_OP_LIST { \
+ /* Stage 1 (load) ops */ \
+ FETCH_OP(NOP, none), /* NOP */ \
+ FETCH_OP(REG, param), /* Register: .param = offset */ \
+ FETCH_OP(STACK, param), /* Stack: .param = index */ \
+ FETCH_OP(STACKP, none), /* Stack pointer */ \
+ FETCH_OP(RETVAL, none), /* Return value */ \
+ FETCH_OP(IMM, imm), /* Immediate: .immediate */ \
+ FETCH_OP(COMM, none), /* Current comm */ \
+ FETCH_OP(ARG, param), /* Argument: .param = index */ \
+ FETCH_OP(FOFFS, imm), /* File offset: .immediate */ \
+ FETCH_OP(IMMSTR, string), /* Allocated string: .data */ \
+ FETCH_OP(EDATA, offset), /* Entry data: .offset */ \
+ FETCH_OP(TP_ARG, tp_arg), /* Tracepoint argument: .data */\
+ /* Stage 2 (dereference) ops */ \
+ FETCH_OP(DEREF, offset), /* Dereference: .offset */ \
+ FETCH_OP(UDEREF, offset), /* User-space dereference: .offset */\
+ /* Stage 3 (store) ops */ \
+ FETCH_OP(ST_RAW, store), /* Raw value: .size */ \
+ FETCH_OP(ST_MEM, store), /* Memory: .offset, .size */ \
+ FETCH_OP(ST_UMEM, store), /* User memory: .offset, .size */\
+ FETCH_OP(ST_STRING, store), /* String: .offset, .size */ \
+ FETCH_OP(ST_USTRING, store), /* User string: .offset, .size */\
+ FETCH_OP(ST_SYMSTR, store), /* Symbol name: .offset, .size */\
+ FETCH_OP(ST_EDATA, offset), /* Entry data: .offset */ \
+ /* Stage 4 (modify) op */ \
+ FETCH_OP(MOD_BF, bf), /* Bitfield: .basesize, .lshift, .rshift*/\
+ /* Stage 5 (loop) op */ \
+ FETCH_OP(LP_ARRAY, param), /* Loop array: .param = count */\
+ /* End */ \
+ FETCH_OP(END, none), \
+ /* Unresolved Symbol holder */ \
+ FETCH_OP(NOP_SYMBOL, symbol), /* Non loaded symbol: .data = symbol name */\
+}
+
+#define FETCH_OP(opname, decode_fn) FETCH_OP_##opname
+enum fetch_op FETCH_OP_LIST;
+#undef FETCH_OP
+
+#define FETCH_NOP_SYMBOL FETCH_OP_NOP_SYMBOL
struct fetch_insn {
enum fetch_op op;
@@ -370,6 +378,13 @@ bool trace_probe_match_command_args(struct trace_probe *tp,
int trace_probe_create(const char *raw_command, int (*createfn)(int, const char **));
int trace_probe_print_args(struct trace_seq *s, struct probe_arg *args, int nr_args,
u8 *data, void *field);
+#ifdef CONFIG_PROBE_EVENTS_DUMP_FETCHARG
+void trace_probe_dump_args(struct seq_file *m, struct trace_probe *tp);
+#else
+static inline void trace_probe_dump_args(struct seq_file *m, struct trace_probe *tp)
+{
+}
+#endif
#ifdef CONFIG_HAVE_FUNCTION_ARG_ACCESS_API
int traceprobe_get_entry_data_size(struct trace_probe *tp);
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index c274346853d1..b2e264a4b96c 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -765,6 +765,9 @@ static int trace_uprobe_show(struct seq_file *m, struct dyn_event *ev)
seq_printf(m, " %s=%s", tu->tp.args[i].name, tu->tp.args[i].comm);
seq_putc(m, '\n');
+
+ trace_probe_dump_args(m, &tu->tp);
+
return 0;
}
^ permalink raw reply related
* [PATCH v11 03/11] tools/bootconfig: Ignore comment lines in dynamic_events/kprobe_events file
From: Masami Hiramatsu (Google) @ 2026-06-26 14:14 UTC (permalink / raw)
To: Steven Rostedt, Mathieu Desnoyers
Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178248325671.841606.17344906774310339507.stgit@devnote2>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Since dynamic_events/kprobe_events files show the fetcharg debug
information as comment lines, its reader needs to ignore it.
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
tools/bootconfig/scripts/ftrace2bconf.sh | 2 ++
1 file changed, 2 insertions(+)
diff --git a/tools/bootconfig/scripts/ftrace2bconf.sh b/tools/bootconfig/scripts/ftrace2bconf.sh
index 1603801cf126..8eed445c295e 100755
--- a/tools/bootconfig/scripts/ftrace2bconf.sh
+++ b/tools/bootconfig/scripts/ftrace2bconf.sh
@@ -57,6 +57,8 @@ EOF
kprobe_event_options() {
cat $TRACEFS/kprobe_events | while read p args; do
case $p in
+ \#*)
+ continue;;
r*)
cat 1>&2 << EOF
# WARN: A return probe found but it is not supported by bootconfig. Skip it.
^ permalink raw reply related
* [PATCH v11 04/11] perf/probe: Ignore comment lines in dynamic_events/kprobe_events file
From: Masami Hiramatsu (Google) @ 2026-06-26 14:14 UTC (permalink / raw)
To: Steven Rostedt, Mathieu Desnoyers
Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178248325671.841606.17344906774310339507.stgit@devnote2>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Since dynamic_events/kprobe_events files show the fetcharg debug
information as comment lines, its reader needs to ignore it.
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
tools/perf/util/probe-file.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/tools/perf/util/probe-file.c b/tools/perf/util/probe-file.c
index 4032572cbf55..4d12693a83b3 100644
--- a/tools/perf/util/probe-file.c
+++ b/tools/perf/util/probe-file.c
@@ -197,6 +197,8 @@ struct strlist *probe_file__get_rawlist(int fd)
idx = strlen(p) - 1;
if (p[idx] == '\n')
p[idx] = '\0';
+ if (buf[0] == '#')
+ continue;
ret = strlist__add(sl, buf);
if (ret < 0) {
pr_debug("strlist__add failed (%d)\n", ret);
^ permalink raw reply related
* [PATCH v11 05/11] tracing/probes: Support typecast for various probe events
From: Masami Hiramatsu (Google) @ 2026-06-26 14:15 UTC (permalink / raw)
To: Steven Rostedt, Mathieu Desnoyers
Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178248325671.841606.17344906774310339507.stgit@devnote2>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Support BTF typecast feature on other probe events, but only if it is
kernel function entry or return, and must use function parameter name
or $retval. This means you can do:
(STRUCT)PARAM->MEMBER
Note: you can not use other variables like $stackN, %reg etc. That
needs nesting support.
To support other probe events, we just need to use last_struct type
when we find a function parameter in parse_btf_arg().
This also updates <tracefs>/README file to show struct typecast.
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v5:
- Add comments about $retval with typecast.
- Even if the type of retvalue is not known, if user specifies typecast,
use it for its type.
Changes in v3:
- Clarify the limitation.
Changes in v2:
- Fix to re-enable typecast on eprobe.
---
Documentation/trace/fprobetrace.rst | 3 +++
Documentation/trace/kprobetrace.rst | 4 ++++
kernel/trace/trace.c | 2 +-
kernel/trace/trace_probe.c | 23 +++++++++++++++++------
kernel/trace/trace_probe.h | 5 +++++
5 files changed, 30 insertions(+), 7 deletions(-)
diff --git a/Documentation/trace/fprobetrace.rst b/Documentation/trace/fprobetrace.rst
index b4c2ca3d02c1..7435ded2d66d 100644
--- a/Documentation/trace/fprobetrace.rst
+++ b/Documentation/trace/fprobetrace.rst
@@ -57,6 +57,9 @@ Synopsis of fprobe-events
(u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types
(x8/x16/x32/x64), "char", "string", "ustring", "symbol", "symstr"
and bitfield are supported.
+ (STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
+ a pointer to STRUCT and then derference the pointer defined by
+ ->MEMBER.
(\*1) This is available only when BTF is enabled.
(\*2) only for the probe on function entry (offs == 0). Note, this argument access
diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst
index 3b6791c17e9b..f73614997d52 100644
--- a/Documentation/trace/kprobetrace.rst
+++ b/Documentation/trace/kprobetrace.rst
@@ -61,6 +61,10 @@ Synopsis of kprobe_events
(x8/x16/x32/x64), VFS layer common type(%pd/%pD), "char",
"string", "ustring", "symbol", "symstr" and bitfield are
supported.
+ (STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
+ a pointer to STRUCT and then derference the pointer defined by
+ ->MEMBER. Note that this is available only when the probe is
+ on function entry.
(\*1) only for the probe on function entry (offs == 0). Note, this argument access
is best effort, because depending on the argument type, it may be passed on
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 1146b83b711a..280a3dccd13f 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4322,7 +4322,7 @@ static const char readme_msg[] =
#ifdef CONFIG_HAVE_FUNCTION_ARG_ACCESS_API
"\t $stack<index>, $stack, $retval, $comm, $arg<N>,\n"
#ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
- "\t <argname>[->field[->field|.field...]],\n"
+ "\t [(structname)]<argname>[->field[->field|.field...]],\n"
#endif
#else
"\t $stack<index>, $stack, $retval, $comm,\n"
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 0908019aea12..e6cc9f3d6c8b 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -699,7 +699,7 @@ static int parse_btf_arg(char *varname,
if (ctx->flags & TPARG_FL_RETURN && !strcmp(varname, "$retval")) {
code->op = FETCH_OP_RETVAL;
- /* Check whether the function return type is not void */
+ /* Check whether the function return type is not void, even with typecast. */
if (query_btf_context(ctx) == 0) {
if (ctx->proto->type == 0) {
trace_probe_log_err(ctx->offset, NO_RETVAL);
@@ -708,6 +708,13 @@ static int parse_btf_arg(char *varname,
tid = ctx->proto->type;
goto found;
}
+ /*
+ * Even if we can not find appropriate BTF info, we can still access
+ * the field via typecast.
+ */
+ if (ctx->struct_btf)
+ goto found;
+
if (field) {
trace_probe_log_err(ctx->offset + field - varname,
NO_BTF_ENTRY);
@@ -752,7 +759,10 @@ static int parse_btf_arg(char *varname,
return -ENOENT;
found:
- type = btf_type_skip_modifiers(ctx->btf, tid, NULL);
+ if (ctx->struct_btf)
+ type = ctx->last_struct;
+ else
+ type = btf_type_skip_modifiers(ctx->btf, tid, NULL);
found_type:
if (!type) {
trace_probe_log_err(ctx->offset, BAD_BTF_TID);
@@ -829,10 +839,11 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
char *tmp;
int ret;
- /* Currently this only works for eprobes */
- if (!(ctx->flags & TPARG_FL_TEVENT)) {
- trace_probe_log_err(ctx->offset, TYPECAST_NOT_EVENT);
- return -EINVAL;
+ if (!(tparg_is_event_probe(ctx->flags) ||
+ tparg_is_function_entry(ctx->flags) ||
+ tparg_is_function_return(ctx->flags))) {
+ trace_probe_log_err(ctx->offset, NOSUP_BTFARG);
+ return -EOPNOTSUPP;
}
tmp = strchr(arg, ')');
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index e36cfe39e9a8..aa72e2ffdd93 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -429,6 +429,11 @@ static inline bool tparg_is_function_return(unsigned int flags)
return (flags & TPARG_FL_LOC_MASK) == (TPARG_FL_KERNEL | TPARG_FL_RETURN);
}
+static inline bool tparg_is_event_probe(unsigned int flags)
+{
+ return !!(flags & TPARG_FL_TEVENT);
+}
+
struct traceprobe_parse_context {
struct trace_event_call *event;
/* BTF related parameters */
^ permalink raw reply related
* [PATCH v11 06/11] tracing/probes: Support nested typecast
From: Masami Hiramatsu (Google) @ 2026-06-26 14:15 UTC (permalink / raw)
To: Steven Rostedt, Mathieu Desnoyers
Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178248325671.841606.17344906774310339507.stgit@devnote2>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
When we hit an open parenthesis right after typecast closing
parenthesis, it means we have nested typecast. This allows us to
typecast a generic data member in a structure to a pointer to
another structure.
For example, to cast a DATA_MEMBER of VAR structure to STRUCT pointer
and get MEMBER value.
(STRUCT)(VAR->DATA_MEMBER)->MEMBER
Also, we can nest typecast.
(STRUCT1)((STRUCT2)$ARG->FIELD2)->FIELD1
Currently the max nest level is limited to 3.
This also allows user to use typecasting for registers or stacks on
kprobe events. e.g.
(STRUCT)(%ax)->MEMBER
(STRUCT)($stack0)->MEMBER
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v11:
- Fix to return -EINVAL if WARN_ON_ONCE() is hit.
Changes in v6:
- Add a WARN_ON_ONCE check for leaking nested_level (it must not happen.)
Changes in v4:
- Use orig_offset for reporting NO_PTR_STRCT error.
Changes in v2:
- Fix to skip "->" after closing parenthetsis.
---
Documentation/trace/eprobetrace.rst | 2 +
Documentation/trace/fprobetrace.rst | 2 +
Documentation/trace/kprobetrace.rst | 2 +
kernel/trace/trace.c | 1
kernel/trace/trace_probe.c | 83 ++++++++++++++++++++++++++++++++---
kernel/trace/trace_probe.h | 7 +++
6 files changed, 88 insertions(+), 9 deletions(-)
diff --git a/Documentation/trace/eprobetrace.rst b/Documentation/trace/eprobetrace.rst
index fe3602540569..cd0b4aa7f896 100644
--- a/Documentation/trace/eprobetrace.rst
+++ b/Documentation/trace/eprobetrace.rst
@@ -50,6 +50,8 @@ Synopsis of eprobe_events
a pointer to STRUCT and then derference the pointer defined by
->MEMBER. Note that when this is used, the FIELD name does not
need to be prefixed with a '$'.
+ (STRUCT)(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
+ also be used with another FETCHARG instead of FIELD.
Types
-----
diff --git a/Documentation/trace/fprobetrace.rst b/Documentation/trace/fprobetrace.rst
index 7435ded2d66d..6b8bb27bb62d 100644
--- a/Documentation/trace/fprobetrace.rst
+++ b/Documentation/trace/fprobetrace.rst
@@ -60,6 +60,8 @@ Synopsis of fprobe-events
(STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
a pointer to STRUCT and then derference the pointer defined by
->MEMBER.
+ (STRUCT)(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
+ also be used with another FETCHARG instead of FIELD.
(\*1) This is available only when BTF is enabled.
(\*2) only for the probe on function entry (offs == 0). Note, this argument access
diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst
index f73614997d52..c4382765d5b2 100644
--- a/Documentation/trace/kprobetrace.rst
+++ b/Documentation/trace/kprobetrace.rst
@@ -65,6 +65,8 @@ Synopsis of kprobe_events
a pointer to STRUCT and then derference the pointer defined by
->MEMBER. Note that this is available only when the probe is
on function entry.
+ (STRUCT)(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
+ also be used with another FETCHARG instead of FIELD.
(\*1) only for the probe on function entry (offs == 0). Note, this argument access
is best effort, because depending on the argument type, it may be passed on
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 280a3dccd13f..e56ee034c486 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4323,6 +4323,7 @@ static const char readme_msg[] =
"\t $stack<index>, $stack, $retval, $comm, $arg<N>,\n"
#ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
"\t [(structname)]<argname>[->field[->field|.field...]],\n"
+ "\t [(structname)](fetcharg)->field[->field|.field...],\n"
#endif
#else
"\t $stack<index>, $stack, $retval, $comm,\n"
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index e6cc9f3d6c8b..827ae04f6351 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -832,10 +832,35 @@ static int query_btf_struct(const char *sname, struct traceprobe_parse_context *
return 0;
}
+/* Find the matching closing parenthesis for a given opening parenthesis. */
+static char *find_matched_close_paren(char *s)
+{
+ char *p = s;
+ int count = 0;
+
+ while (*p) {
+ if (*p == '(')
+ count++;
+ else if (*p == ')') {
+ if (--count == 0)
+ return p;
+ }
+ p++;
+ }
+ return NULL;
+}
+
+static int
+parse_probe_arg(char *arg, const struct fetch_type *type,
+ struct fetch_insn **pcode, struct fetch_insn *end,
+ struct traceprobe_parse_context *ctx);
+
static int handle_typecast(char *arg, struct fetch_insn **pcode,
struct fetch_insn *end,
struct traceprobe_parse_context *ctx)
{
+ int orig_offset = ctx->offset;
+ bool nested = false;
char *tmp;
int ret;
@@ -852,19 +877,56 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
DEREF_OPEN_BRACE);
return -EINVAL;
}
- *tmp = '\0';
- ret = query_btf_struct(arg + 1, ctx);
- *tmp = ')';
+ *tmp++ = '\0';
+ /* Handle the nested structure like (STRUCT)(VAR->FIELD)->... */
+ if (*tmp == '(') {
+ char *close = find_matched_close_paren(tmp);
+
+ ctx->offset += tmp - arg;
+ if (!close) {
+ trace_probe_log_err(ctx->offset, DEREF_OPEN_BRACE);
+ return -EINVAL;
+ }
+ /* We expect a field access for typecast */
+ if (close[1] != '-' || close[2] != '>') {
+ trace_probe_log_err(ctx->offset + close - tmp + 1,
+ TYPECAST_REQ_FIELD);
+ return -EINVAL;
+ }
+
+ ctx->nested_level++;
+ if (ctx->nested_level > TRACEPROBE_MAX_NESTED_LEVEL) {
+ trace_probe_log_err(ctx->offset, TOO_MANY_NESTED);
+ return -E2BIG;
+ }
+ *close = '\0';
+
+ ctx->offset += 1; /* for the '(' */
+ /* We need to parse the nested one */
+ ret = parse_probe_arg(tmp + 1, find_fetch_type(NULL, ctx->flags),
+ pcode, end, ctx);
+ if (ret < 0)
+ return ret;
+ ctx->nested_level--;
+ clear_struct_btf(ctx);
+
+ tmp = close + 3;/* Skip "->" after closing parenthesis */
+ nested = true;
+ }
+
+ ret = query_btf_struct(arg + 1, ctx);
if (ret < 0) {
- trace_probe_log_err(ctx->offset + 1, NO_PTR_STRCT);
+ trace_probe_log_err(orig_offset + 1, NO_PTR_STRCT);
return -EINVAL;
}
- tmp++;
-
- ctx->offset += tmp - arg;
- ret = parse_btf_arg(tmp, pcode, end, ctx);
+ ctx->offset = orig_offset + tmp - arg;
+ /* If it is nested, tmp points to the field name. */
+ if (nested)
+ ret = parse_btf_field(tmp, ctx->last_struct, pcode, end, ctx);
+ else
+ ret = parse_btf_arg(tmp, pcode, end, ctx);
return ret;
}
@@ -1638,6 +1700,11 @@ static int traceprobe_parse_probe_arg_body(const char *argv, ssize_t *size,
ctx);
if (ret < 0)
goto fail;
+ /* nested_level must be 0 here, otherwise there is a bug. */
+ if (WARN_ON_ONCE(ctx->nested_level)) {
+ ret = -EINVAL;
+ goto fail;
+ }
/* Update storing type if BTF is available */
if (IS_ENABLED(CONFIG_PROBE_EVENTS_BTF_ARGS) &&
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index aa72e2ffdd93..7d71925244e8 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -450,8 +450,11 @@ struct traceprobe_parse_context {
struct trace_probe *tp;
unsigned int flags;
int offset;
+ int nested_level;
};
+#define TRACEPROBE_MAX_NESTED_LEVEL 3
+
extern int traceprobe_parse_probe_arg(struct trace_probe *tp, int i,
const char *argv,
struct traceprobe_parse_context *ctx);
@@ -587,7 +590,9 @@ extern int traceprobe_define_arg_fields(struct trace_event_call *event_call,
C(TOO_MANY_ARGS, "Too many arguments are specified"), \
C(TOO_MANY_EARGS, "Too many entry arguments specified"), \
C(EVENT_TOO_BIG, "Event too big (too many fields?)"), \
- C(TYPECAST_NOT_EVENT, "Typecasts are only for eprobe fields"),
+ C(TYPECAST_NOT_EVENT, "Typecasts are only for eprobe fields"), \
+ C(TYPECAST_REQ_FIELD, "Typecast requires a field access"), \
+ C(TOO_MANY_NESTED, "Too many nested typecasts/dereferences"),
#undef C
#define C(a, b) TP_ERR_##a
^ permalink raw reply related
* [PATCH v11 07/11] tracing/probes: Type casting always involves nested calls
From: Masami Hiramatsu (Google) @ 2026-06-26 14:15 UTC (permalink / raw)
To: Steven Rostedt, Mathieu Desnoyers
Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178248325671.841606.17344906774310339507.stgit@devnote2>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
This allows type casting to various fetchargs without parentheses
by recursively calling parse_probe_arg on the target when type
casting is used.
For example, this allows the following expressions:
- (STRUCT)%REG->FIELD
- (STRUCT)$stackN->FIELD
- (STRUCT)@SYM->FIELD
Note that @SYM+/-OFFSET with typecast needs parentheses like:
- (STRUCT)(@SYM-8)->FIELD
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v8:
- Fix caret position in error case.
- Add a comment about @SYM+/-OFFSET without parentheses.
Changes in v7:
- Prohibit using @SYM+/-OFFSET without parentheses.
- Cleanup parse_btf_arg() since ctx->struct_btf is always NULL now.
Changes in v6:
- Newly added.
---
kernel/trace/trace_probe.c | 123 ++++++++++++++++++++++++++------------------
kernel/trace/trace_probe.h | 4 +
2 files changed, 75 insertions(+), 52 deletions(-)
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 827ae04f6351..1b97b125e9cb 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -684,19 +684,6 @@ static int parse_btf_arg(char *varname,
return -EOPNOTSUPP;
}
- if (ctx->flags & TPARG_FL_TEVENT) {
- ret = parse_trace_event(varname, code, ctx);
- if (ret < 0) {
- trace_probe_log_err(ctx->offset, BAD_ATTACH_ARG);
- return ret;
- }
- /* TEVENT is only here via a typecast */
- if (WARN_ON_ONCE(ctx->struct_btf == NULL))
- return -EINVAL;
- type = ctx->last_struct;
- goto found_type;
- }
-
if (ctx->flags & TPARG_FL_RETURN && !strcmp(varname, "$retval")) {
code->op = FETCH_OP_RETVAL;
/* Check whether the function return type is not void, even with typecast. */
@@ -708,13 +695,6 @@ static int parse_btf_arg(char *varname,
tid = ctx->proto->type;
goto found;
}
- /*
- * Even if we can not find appropriate BTF info, we can still access
- * the field via typecast.
- */
- if (ctx->struct_btf)
- goto found;
-
if (field) {
trace_probe_log_err(ctx->offset + field - varname,
NO_BTF_ENTRY);
@@ -759,11 +739,7 @@ static int parse_btf_arg(char *varname,
return -ENOENT;
found:
- if (ctx->struct_btf)
- type = ctx->last_struct;
- else
- type = btf_type_skip_modifiers(ctx->btf, tid, NULL);
-found_type:
+ type = btf_type_skip_modifiers(ctx->btf, tid, NULL);
if (!type) {
trace_probe_log_err(ctx->offset, BAD_BTF_TID);
return -EINVAL;
@@ -860,7 +836,7 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
struct traceprobe_parse_context *ctx)
{
int orig_offset = ctx->offset;
- bool nested = false;
+ char *close;
char *tmp;
int ret;
@@ -871,6 +847,17 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
return -EOPNOTSUPP;
}
+ /*
+ * Always consider the token after typecast as a nested call
+ * For example: (STRUCT)VAR->FIELD and (STRUCT)(VAR)->FIELD are same.
+ * VAR is solved in the nested call.
+ */
+ ctx->nested_level++;
+ if (ctx->nested_level > TRACEPROBE_MAX_NESTED_LEVEL) {
+ trace_probe_log_err(ctx->offset, TOO_MANY_NESTED);
+ return -E2BIG;
+ }
+
tmp = strchr(arg, ')');
if (!tmp) {
trace_probe_log_err(ctx->offset + strlen(arg),
@@ -879,11 +866,10 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
}
*tmp++ = '\0';
- /* Handle the nested structure like (STRUCT)(VAR->FIELD)->... */
+ ctx->offset += tmp - arg;
if (*tmp == '(') {
- char *close = find_matched_close_paren(tmp);
+ close = find_matched_close_paren(tmp);
- ctx->offset += tmp - arg;
if (!close) {
trace_probe_log_err(ctx->offset, DEREF_OPEN_BRACE);
return -EINVAL;
@@ -894,27 +880,66 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
TYPECAST_REQ_FIELD);
return -EINVAL;
}
-
- ctx->nested_level++;
- if (ctx->nested_level > TRACEPROBE_MAX_NESTED_LEVEL) {
- trace_probe_log_err(ctx->offset, TOO_MANY_NESTED);
- return -E2BIG;
+ /* Skip '(' */
+ ctx->offset += 1;
+ tmp++;
+ } else if (*tmp == '+' || *tmp == '-') {
+ /* Dereference can have another field access inside it. */
+ char *open = strchr(tmp + 1, '(');
+
+ if (!open) {
+ trace_probe_log_err(ctx->offset,
+ DEREF_NEED_BRACE);
+ return -EINVAL;
+ }
+ close = find_matched_close_paren(open);
+ if (!close) {
+ trace_probe_log_err(ctx->offset + strlen(tmp),
+ DEREF_OPEN_BRACE);
+ return -EINVAL;
+ }
+ close++;
+ /* We expect a field access for typecast */
+ if (close[0] != '-' || close[1] != '>') {
+ trace_probe_log_err(ctx->offset + close - tmp,
+ TYPECAST_REQ_FIELD);
+ return -EINVAL;
+ }
+ } else {
+ if (tmp[0] == '@') {
+ /* @sym+offset is not allowed without parenthesized */
+ close = strpbrk(tmp, "+-");
+ if (close && isdigit(close[1])) {
+ trace_probe_log_err(ctx->offset,
+ TYPECAST_SYM_OFFSET);
+ return -EINVAL;
+ }
}
- *close = '\0';
+ /* Inner variable name */
+ close = strchr(tmp, '-');
+ if (!close || close[1] != '>') {
+ trace_probe_log_err(ctx->offset + strlen(tmp),
+ TYPECAST_REQ_FIELD);
+ return -EINVAL;
+ }
+ }
+ *close = '\0';
- ctx->offset += 1; /* for the '(' */
- /* We need to parse the nested one */
- ret = parse_probe_arg(tmp + 1, find_fetch_type(NULL, ctx->flags),
- pcode, end, ctx);
- if (ret < 0)
- return ret;
- ctx->nested_level--;
- clear_struct_btf(ctx);
+ /* We need to parse the nested one */
+ ret = parse_probe_arg(tmp, find_fetch_type(NULL, ctx->flags),
+ pcode, end, ctx);
+ if (ret < 0)
+ return ret;
+ ctx->nested_level--;
+ clear_struct_btf(ctx);
- tmp = close + 3;/* Skip "->" after closing parenthesis */
- nested = true;
- }
+ /* Let tmp point the field name. */
+ if (close[1] == '-')
+ tmp = close + 3; /* Skip "->" after closing parenthesis */
+ else
+ tmp = close + 2; /* Skip ">" after inner variable name */
+ /* resolve the typecast struct name */
ret = query_btf_struct(arg + 1, ctx);
if (ret < 0) {
trace_probe_log_err(orig_offset + 1, NO_PTR_STRCT);
@@ -922,11 +947,7 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
}
ctx->offset = orig_offset + tmp - arg;
- /* If it is nested, tmp points to the field name. */
- if (nested)
- ret = parse_btf_field(tmp, ctx->last_struct, pcode, end, ctx);
- else
- ret = parse_btf_arg(tmp, pcode, end, ctx);
+ ret = parse_btf_field(tmp, ctx->last_struct, pcode, end, ctx);
return ret;
}
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 7d71925244e8..f4fbe3010978 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -453,6 +453,7 @@ struct traceprobe_parse_context {
int nested_level;
};
+/* Each typecast consumes nested level. So the max number of typecast is 3. */
#define TRACEPROBE_MAX_NESTED_LEVEL 3
extern int traceprobe_parse_probe_arg(struct trace_probe *tp, int i,
@@ -592,7 +593,8 @@ extern int traceprobe_define_arg_fields(struct trace_event_call *event_call,
C(EVENT_TOO_BIG, "Event too big (too many fields?)"), \
C(TYPECAST_NOT_EVENT, "Typecasts are only for eprobe fields"), \
C(TYPECAST_REQ_FIELD, "Typecast requires a field access"), \
- C(TOO_MANY_NESTED, "Too many nested typecasts/dereferences"),
+ C(TOO_MANY_NESTED, "Too many nested typecasts/dereferences"), \
+ C(TYPECAST_SYM_OFFSET, "@SYM+/-OFFSET with typecast needs parentheses")
#undef C
#define C(a, b) TP_ERR_##a
^ permalink raw reply related
* [PATCH v11 08/11] tracing/probes: Support field specifier option for typecast
From: Masami Hiramatsu (Google) @ 2026-06-26 14:15 UTC (permalink / raw)
To: Steven Rostedt, Mathieu Desnoyers
Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178248325671.841606.17344906774310339507.stgit@devnote2>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Add a field specifier option for the typecast. This works like
container_of() macro.
(STRUCT[,FIELD[.FIELD2...]])VAR
This is equivalent to :
container_of(VAR, struct STRUCT, FIELD[.FIELD2...])
For example:
echo "f tick_nohz_handler next_tick=(tick_sched,sched_timer)timer->next_tick" >> dynamic_events
This will trace tick_nohz_handler() with its tick_sched::next_tick which
is converted from @timer by contianer_of(tick, struct tick_sched, sched_timer).
So, if you enabkle both fprobes:tick_nohz_handler__entry and
timer:hrtimer_expire_entry events, we will see something like:
<idle>-0 [002] d.h1. 3778.087272: hrtimer_expire_entry: hrtimer=00000000d63db328 f
unction=tick_nohz_handler now=3777450051040
<idle>-0 [002] d.h1. 3778.087281: tick_nohz_handler__entry: (tick_nohz_handler+0x4
/0x140) next_tick=3777450000000
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v6:
- Update according to the allways nested patch.
Changes in v3:
- Fix error caret position.
Changes in v2:
- Use byteoffset for typecast field offset instead of bitoffset. This fixes negative modulo calculation.
- Check whether a field is specified after typecast.
- Reject if typecast field option has arrow operator.
---
Documentation/trace/eprobetrace.rst | 5 +
Documentation/trace/fprobetrace.rst | 8 +-
Documentation/trace/kprobetrace.rst | 8 +-
kernel/trace/trace.c | 4 -
kernel/trace/trace_probe.c | 169 ++++++++++++++++++++++++-----------
kernel/trace/trace_probe.h | 5 +
6 files changed, 135 insertions(+), 64 deletions(-)
diff --git a/Documentation/trace/eprobetrace.rst b/Documentation/trace/eprobetrace.rst
index cd0b4aa7f896..680e0af43d5d 100644
--- a/Documentation/trace/eprobetrace.rst
+++ b/Documentation/trace/eprobetrace.rst
@@ -49,7 +49,10 @@ Synopsis of eprobe_events
(STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
a pointer to STRUCT and then derference the pointer defined by
->MEMBER. Note that when this is used, the FIELD name does not
- need to be prefixed with a '$'.
+ need to be prefixed with a '$'. ASGN can be specified optionally.
+ If ASGN is specified, FIELD will be cast to the same offset
+ position as the ASGN member, rather than to the beginning of
+ the STRUCT.
(STRUCT)(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
also be used with another FETCHARG instead of FIELD.
diff --git a/Documentation/trace/fprobetrace.rst b/Documentation/trace/fprobetrace.rst
index 6b8bb27bb62d..290a9e6f7491 100644
--- a/Documentation/trace/fprobetrace.rst
+++ b/Documentation/trace/fprobetrace.rst
@@ -57,10 +57,12 @@ Synopsis of fprobe-events
(u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types
(x8/x16/x32/x64), "char", "string", "ustring", "symbol", "symstr"
and bitfield are supported.
- (STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
+ (STRUCT[,ASGN])FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
a pointer to STRUCT and then derference the pointer defined by
- ->MEMBER.
- (STRUCT)(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
+ ->MEMBER. ASGN can be specified optionally. If ASGN is specified,
+ FIELD will be cast to the same offset position as the ASGN member,
+ rather than to the beginning of the STRUCT.
+ (STRUCT[,ASGN])(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
also be used with another FETCHARG instead of FIELD.
(\*1) This is available only when BTF is enabled.
diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst
index c4382765d5b2..a62707e6a9f2 100644
--- a/Documentation/trace/kprobetrace.rst
+++ b/Documentation/trace/kprobetrace.rst
@@ -61,11 +61,13 @@ Synopsis of kprobe_events
(x8/x16/x32/x64), VFS layer common type(%pd/%pD), "char",
"string", "ustring", "symbol", "symstr" and bitfield are
supported.
- (STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
+ (STRUCT[,ASGN])FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
a pointer to STRUCT and then derference the pointer defined by
->MEMBER. Note that this is available only when the probe is
- on function entry.
- (STRUCT)(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
+ on function entry. ASGN can be specified optionally. If ASGN
+ is specified, FIELD will be cast to the same offset position
+ as the ASGN member, rather than to the beginning of the STRUCT.
+ (STRUCT[,ASGN])(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
also be used with another FETCHARG instead of FIELD.
(\*1) only for the probe on function entry (offs == 0). Note, this argument access
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index e56ee034c486..5670c4b91dc0 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4322,8 +4322,8 @@ static const char readme_msg[] =
#ifdef CONFIG_HAVE_FUNCTION_ARG_ACCESS_API
"\t $stack<index>, $stack, $retval, $comm, $arg<N>,\n"
#ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
- "\t [(structname)]<argname>[->field[->field|.field...]],\n"
- "\t [(structname)](fetcharg)->field[->field|.field...],\n"
+ "\t [(structname[,field])]<argname>[->field[->field|.field...]],\n"
+ "\t [(structname[,field])](fetcharg)->field[->field|.field...],\n"
#endif
#else
"\t $stack<index>, $stack, $retval, $comm,\n"
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 1b97b125e9cb..fd006b415c68 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -568,6 +568,64 @@ static int split_next_field(char *varname, char **next_field,
return ret;
}
+/* Inner loop for solving dot operator ('.'). Return bit-offset of the given field */
+static int get_bitoffset_of_field(char **pfieldname, const struct btf_type **ptype,
+ struct traceprobe_parse_context *ctx)
+{
+ const struct btf_type *type = *ptype;
+ const struct btf_member *field;
+ struct btf *btf = ctx_btf(ctx);
+ char *fieldname = *pfieldname;
+ int bitoffs = 0;
+ u32 anon_offs;
+ char *next;
+ int is_ptr;
+
+ do {
+ next = NULL;
+ is_ptr = split_next_field(fieldname, &next, ctx);
+ if (is_ptr < 0)
+ return is_ptr;
+
+ anon_offs = 0;
+ field = btf_find_struct_member(btf, type, fieldname,
+ &anon_offs);
+ if (IS_ERR(field)) {
+ trace_probe_log_err(ctx->offset, BAD_BTF_TID);
+ return PTR_ERR(field);
+ }
+ if (!field) {
+ trace_probe_log_err(ctx->offset, NO_BTF_FIELD);
+ return -ENOENT;
+ }
+ /* Add anonymous structure/union offset */
+ bitoffs += anon_offs;
+
+ /* Accumulate the bit-offsets of the dot-connected fields */
+ if (btf_type_kflag(type)) {
+ bitoffs += BTF_MEMBER_BIT_OFFSET(field->offset);
+ ctx->last_bitsize = BTF_MEMBER_BITFIELD_SIZE(field->offset);
+ } else {
+ bitoffs += field->offset;
+ ctx->last_bitsize = 0;
+ }
+
+ type = btf_type_skip_modifiers(btf, field->type, NULL);
+ if (!type) {
+ trace_probe_log_err(ctx->offset, BAD_BTF_TID);
+ return -EINVAL;
+ }
+
+ if (next)
+ ctx->offset += next - fieldname;
+ fieldname = next;
+ } while (!is_ptr && fieldname);
+
+ *pfieldname = fieldname;
+ *ptype = type;
+
+ return bitoffs;
+}
/*
* Parse the field of data structure. The @type must be a pointer type
* pointing the target data structure type.
@@ -577,15 +635,13 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
struct traceprobe_parse_context *ctx)
{
struct fetch_insn *code = *pcode;
- const struct btf_member *field;
- u32 bitoffs, anon_offs;
- bool is_struct = ctx->struct_btf != NULL;
struct btf *btf = ctx_btf(ctx);
- char *next;
- int is_ptr;
+ bool is_first_field = true;
+ int bitoffs;
do {
- if (!is_struct) {
+ /* For the first field of typecast, @type will be the target structure type. */
+ if (!(is_first_field && ctx->struct_btf)) {
/* Outer loop for solving arrow operator ('->') */
if (BTF_INFO_KIND(type->info) != BTF_KIND_PTR) {
trace_probe_log_err(ctx->offset, NO_PTR_STRCT);
@@ -599,60 +655,25 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
return -EINVAL;
}
}
- /* Only the first type can skip being a pointer */
- is_struct = false;
-
- bitoffs = 0;
- do {
- /* Inner loop for solving dot operator ('.') */
- next = NULL;
- is_ptr = split_next_field(fieldname, &next, ctx);
- if (is_ptr < 0)
- return is_ptr;
-
- anon_offs = 0;
- field = btf_find_struct_member(btf, type, fieldname,
- &anon_offs);
- if (IS_ERR(field)) {
- trace_probe_log_err(ctx->offset, BAD_BTF_TID);
- return PTR_ERR(field);
- }
- if (!field) {
- trace_probe_log_err(ctx->offset, NO_BTF_FIELD);
- return -ENOENT;
- }
- /* Add anonymous structure/union offset */
- bitoffs += anon_offs;
-
- /* Accumulate the bit-offsets of the dot-connected fields */
- if (btf_type_kflag(type)) {
- bitoffs += BTF_MEMBER_BIT_OFFSET(field->offset);
- ctx->last_bitsize = BTF_MEMBER_BITFIELD_SIZE(field->offset);
- } else {
- bitoffs += field->offset;
- ctx->last_bitsize = 0;
- }
-
- type = btf_type_skip_modifiers(btf, field->type, NULL);
- if (!type) {
- trace_probe_log_err(ctx->offset, BAD_BTF_TID);
- return -EINVAL;
- }
-
- ctx->offset += next - fieldname;
- fieldname = next;
- } while (!is_ptr && fieldname);
+ bitoffs = get_bitoffset_of_field(&fieldname, &type, ctx);
+ if (bitoffs < 0)
+ return bitoffs;
if (++code == end) {
trace_probe_log_err(ctx->offset, TOO_MANY_OPS);
return -EINVAL;
}
code->op = FETCH_OP_DEREF; /* TODO: user deref support */
code->offset = bitoffs / 8;
+ if (is_first_field && ctx->struct_btf) {
+ /* The first field can be typecasted with field option. */
+ code->offset -= ctx->prefix_byteoffs;
+ }
*pcode = code;
ctx->last_bitoffs = bitoffs % 8;
ctx->last_type = type;
+ is_first_field = false;
} while (fieldname);
return 0;
@@ -808,6 +829,46 @@ static int query_btf_struct(const char *sname, struct traceprobe_parse_context *
return 0;
}
+static int parse_btf_casttype(char *casttype, struct traceprobe_parse_context *ctx)
+{
+ char *field;
+ int ret;
+
+ /* Field option - evaluated later. */
+ field = strchr(casttype, ',');
+ if (field)
+ *field++ = '\0';
+
+ ret = query_btf_struct(casttype, ctx);
+ if (ret < 0) {
+ trace_probe_log_err(ctx->offset, NO_PTR_STRCT);
+ return -EINVAL;
+ }
+
+ if (field) {
+ struct btf_type *type = (struct btf_type *)ctx->last_struct;
+
+ ctx->offset += field - casttype;
+ ret = get_bitoffset_of_field(&field, &ctx->last_struct, ctx);
+ if (ret < 0)
+ return ret;
+ if (ret % 8) {
+ trace_probe_log_err(ctx->offset, TYPECAST_NOT_ALIGNED);
+ return -EINVAL;
+ }
+ if (field != NULL) {
+ /* this means @field skips an arrow operator ("->"). */
+ trace_probe_log_err(ctx->offset - 2, TYPECAST_BAD_ARROW);
+ return -EINVAL;
+ }
+ ctx->prefix_byteoffs = ret / 8;
+ /* Restore the original struct type (overwritten by get_bitoffset_of_field) */
+ ctx->last_struct = type;
+ }
+
+ return ret;
+}
+
/* Find the matching closing parenthesis for a given opening parenthesis. */
static char *find_matched_close_paren(char *s)
{
@@ -940,14 +1001,14 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
tmp = close + 2; /* Skip ">" after inner variable name */
/* resolve the typecast struct name */
- ret = query_btf_struct(arg + 1, ctx);
- if (ret < 0) {
- trace_probe_log_err(orig_offset + 1, NO_PTR_STRCT);
- return -EINVAL;
- }
+ ctx->offset = orig_offset + 1; /* for the '(' */
+ ret = parse_btf_casttype(arg + 1, ctx);
+ if (ret < 0)
+ return ret;
ctx->offset = orig_offset + tmp - arg;
ret = parse_btf_field(tmp, ctx->last_struct, pcode, end, ctx);
+ ctx->prefix_byteoffs = 0;
return ret;
}
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index f4fbe3010978..e7fcc77f51fc 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -451,6 +451,7 @@ struct traceprobe_parse_context {
unsigned int flags;
int offset;
int nested_level;
+ int prefix_byteoffs; /* The byte offset of the prefix field of typecast */
};
/* Each typecast consumes nested level. So the max number of typecast is 3. */
@@ -594,7 +595,9 @@ extern int traceprobe_define_arg_fields(struct trace_event_call *event_call,
C(TYPECAST_NOT_EVENT, "Typecasts are only for eprobe fields"), \
C(TYPECAST_REQ_FIELD, "Typecast requires a field access"), \
C(TOO_MANY_NESTED, "Too many nested typecasts/dereferences"), \
- C(TYPECAST_SYM_OFFSET, "@SYM+/-OFFSET with typecast needs parentheses")
+ C(TYPECAST_SYM_OFFSET, "@SYM+/-OFFSET with typecast needs parentheses") \
+ C(TYPECAST_NOT_ALIGNED, "Typecast field option is not byte-aligned"), \
+ C(TYPECAST_BAD_ARROW, "Typecast field option does not support -> operator"),
#undef C
#define C(a, b) TP_ERR_##a
^ permalink raw reply related
* [PATCH v11 09/11] tracing/probes: Add $current variable support
From: Masami Hiramatsu (Google) @ 2026-06-26 14:15 UTC (permalink / raw)
To: Steven Rostedt, Mathieu Desnoyers
Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178248325671.841606.17344906774310339507.stgit@devnote2>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Since we can use the BTF to cast value to a structure pointer type,
it is useful to introduce "$current" special variable support to
fetcharg.
User can define a fetcharg to access current task_struct properties
using BTF info. e.g.
$current->cpus_ptr
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v8:
- Avoid uninitialized ctx->btf issue on $current without typecast.
Changes in v7:
- Fix to use force-typecast for task_struct implicitly.
Changes in v6:
- Rebased on dump fetcharg patch.
- Remove function name/eprobe requirement for $current.
Changes in v5:
- Use s32 for bof_find_btf_id().
Changes in v4:
- Add $current in README when CONFIG_HAVE_FUNCTION_ARG_ACCESS_API=y case.
- Fix to prohibit using $current in eprobes and address based kprobes.
Changes in v3:
- Remove $current support from eprobes (because eprobes is only for event)
- Prohibit uprobes to use $current.
Changes in v2:
- Support to parse $current in parse_btf_arg().
- If no typecast on $current, it automatically casted to task_struct.
- Check error case if $current follows something except for "-".
---
Documentation/trace/fprobetrace.rst | 1 +
Documentation/trace/kprobetrace.rst | 1 +
kernel/trace/trace.c | 4 ++--
kernel/trace/trace_probe.c | 37 ++++++++++++++++++++++++++++++++++-
kernel/trace/trace_probe.h | 1 +
kernel/trace/trace_probe_tmpl.h | 3 +++
6 files changed, 44 insertions(+), 3 deletions(-)
diff --git a/Documentation/trace/fprobetrace.rst b/Documentation/trace/fprobetrace.rst
index 290a9e6f7491..3392cab016b3 100644
--- a/Documentation/trace/fprobetrace.rst
+++ b/Documentation/trace/fprobetrace.rst
@@ -50,6 +50,7 @@ Synopsis of fprobe-events
$argN : Fetch the Nth function argument. (N >= 1) (\*2)
$retval : Fetch return value.(\*3)
$comm : Fetch current task comm.
+ $current : Fetch the address of the current task_struct.
+|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*4)(\*5)
\IMM : Store an immediate value to the argument.
NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst
index a62707e6a9f2..81e4fe38791d 100644
--- a/Documentation/trace/kprobetrace.rst
+++ b/Documentation/trace/kprobetrace.rst
@@ -53,6 +53,7 @@ Synopsis of kprobe_events
$argN : Fetch the Nth function argument. (N >= 1) (\*1)
$retval : Fetch return value.(\*2)
$comm : Fetch current task comm.
+ $current : Fetch the address of the current task_struct.
+|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*3)(\*4)
\IMM : Store an immediate value to the argument.
NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 5670c4b91dc0..2b0b4f9acb2e 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4320,13 +4320,13 @@ static const char readme_msg[] =
"\t args: <name>=fetcharg[:type]\n"
"\t fetcharg: (%<register>|$<efield>), @<address>, @<symbol>[+|-<offset>],\n"
#ifdef CONFIG_HAVE_FUNCTION_ARG_ACCESS_API
- "\t $stack<index>, $stack, $retval, $comm, $arg<N>,\n"
+ "\t $stack<index>, $stack, $retval, $comm, $arg<N>, $current\n"
#ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
"\t [(structname[,field])]<argname>[->field[->field|.field...]],\n"
"\t [(structname[,field])](fetcharg)->field[->field|.field...],\n"
#endif
#else
- "\t $stack<index>, $stack, $retval, $comm,\n"
+ "\t $stack<index>, $stack, $retval, $comm, $current\n"
#endif
"\t +|-[u]<offset>(<fetcharg>), \\imm-value, \\\"imm-string\"\n"
"\t kernel return probes support: $retval, $arg<N>, $comm\n"
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index fd006b415c68..999dec84275d 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -692,7 +692,9 @@ static int parse_btf_arg(char *varname,
int i, is_ptr, ret;
u32 tid;
- if (!ctx->funcname && !(ctx->flags & TPARG_FL_TEVENT))
+ /* Note: field is not separated at this point, so check prefix. */
+ if (!str_has_prefix(varname, "$current") &&
+ !ctx->funcname && !(ctx->flags & TPARG_FL_TEVENT))
return -EINVAL;
is_ptr = split_next_field(varname, &field, ctx);
@@ -705,6 +707,20 @@ static int parse_btf_arg(char *varname,
return -EOPNOTSUPP;
}
+ if (!strcmp(varname, "$current")) {
+ code->op = FETCH_OP_CURRENT;
+ /* If no typecast is specified for $current, use task_struct by default */
+ ret = bpf_find_btf_id("task_struct", BTF_KIND_STRUCT, &ctx->struct_btf);
+ if (ret < 0) {
+ trace_probe_log_err(ctx->offset, NO_BTF_ENTRY);
+ return -ENOENT;
+ }
+ tid = (u32)ret;
+ type = ctx->last_struct =
+ btf_type_skip_modifiers(ctx->struct_btf, tid, NULL);
+ goto found_type;
+ }
+
if (ctx->flags & TPARG_FL_RETURN && !strcmp(varname, "$retval")) {
code->op = FETCH_OP_RETVAL;
/* Check whether the function return type is not void, even with typecast. */
@@ -761,6 +777,7 @@ static int parse_btf_arg(char *varname,
found:
type = btf_type_skip_modifiers(ctx->btf, tid, NULL);
+found_type:
if (!type) {
trace_probe_log_err(ctx->offset, BAD_BTF_TID);
return -EINVAL;
@@ -1270,6 +1287,24 @@ static int parse_probe_vars(char *orig_arg, const struct fetch_type *t,
return 0;
}
+ /* $current returns the address of the current task_struct. */
+ if (str_has_prefix(arg, "current")) {
+ /* $current is only supported by kernel probe. */
+ if (!(ctx->flags & TPARG_FL_KERNEL)) {
+ err = TP_ERR_BAD_VAR;
+ goto inval;
+ }
+ arg += strlen("current");
+ if (*arg == '-' && IS_ENABLED(CONFIG_PROBE_EVENTS_BTF_ARGS))
+ return parse_btf_arg(orig_arg, pcode, end, ctx);
+
+ if (*arg != '\0')
+ goto inval;
+
+ code->op = FETCH_OP_CURRENT;
+ return 0;
+ }
+
#ifdef CONFIG_HAVE_FUNCTION_ARG_ACCESS_API
len = str_has_prefix(arg, "arg");
if (len) {
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index e7fcc77f51fc..053f72fdaece 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -92,6 +92,7 @@ typedef int (*print_type_func_t)(struct trace_seq *, void *, void *);
FETCH_OP(RETVAL, none), /* Return value */ \
FETCH_OP(IMM, imm), /* Immediate: .immediate */ \
FETCH_OP(COMM, none), /* Current comm */ \
+ FETCH_OP(CURRENT, none), /* Current task_struct address */\
FETCH_OP(ARG, param), /* Argument: .param = index */ \
FETCH_OP(FOFFS, imm), /* File offset: .immediate */ \
FETCH_OP(IMMSTR, string), /* Allocated string: .data */ \
diff --git a/kernel/trace/trace_probe_tmpl.h b/kernel/trace/trace_probe_tmpl.h
index 51436f19083b..d0e9662cde00 100644
--- a/kernel/trace/trace_probe_tmpl.h
+++ b/kernel/trace/trace_probe_tmpl.h
@@ -112,6 +112,9 @@ process_common_fetch_insn(struct fetch_insn *code, unsigned long *val)
case FETCH_OP_IMMSTR:
*val = (unsigned long)code->data;
break;
+ case FETCH_OP_CURRENT:
+ *val = (unsigned long)current;
+ break;
default:
return -EILSEQ;
}
^ permalink raw reply related
* [PATCH v11 10/11] tracing/probes: Add this_cpu_read() and this_cpu_ptr() dereference method to fetcharg
From: Masami Hiramatsu (Google) @ 2026-06-26 14:15 UTC (permalink / raw)
To: Steven Rostedt, Mathieu Desnoyers
Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178248325671.841606.17344906774310339507.stgit@devnote2>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
When tracing the kernel local variables, sometimes we need to get the
CPU local variables. To access it, current simple dereference is not
enough.
Thus, introduce a special this_cpu_read() dereference to access per-cpu
variable for the current CPU (accessing other CPU variable may race with
updates on other CPUs). Also this_cpu_ptr() is for accessing per-cpu
pointer.
Those are working as same as the kernel percpu macro.
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v11:
- Remove this_cpu_*() from eprobetrace.rst.
Changes in v10:
- Prohibit this_cpu_*() for eprobe events.
Changes in v9:
- Prohibit this_cpu_*() for non kernel probes.
Changes in v6:
- Rebased on dump fetcharg patch.
- Fix to fetch static percpu variable with @SYM correctly.
Changes in v5:
- Simplify this_cpu_read() into +0(this_cpu_ptr()).
Changes in v3:
- Remove NULL check for percpu var because it is just an offset, could be 0.
- Simplify process_fetch_insn_bottom() code.
- If the last operation is this_cpu_read(), read only memory of the specific
size (of type).
Changes in v2:
- Drop +CPU/+PCPU and introduce this_cpu_read() and this_cpu_ptr().
- Support these method with BTF typecast.
- Just check the base address is NOT NULL instead of is_kernel_percpu_address().
---
Documentation/trace/fprobetrace.rst | 2
Documentation/trace/kprobetrace.rst | 2
kernel/trace/trace.c | 1
kernel/trace/trace_probe.c | 152 ++++++++++++++++++++++++++---------
kernel/trace/trace_probe.h | 6 +
kernel/trace/trace_probe_tmpl.h | 22 ++++-
6 files changed, 139 insertions(+), 46 deletions(-)
diff --git a/Documentation/trace/fprobetrace.rst b/Documentation/trace/fprobetrace.rst
index 3392cab016b3..3439bc9bd351 100644
--- a/Documentation/trace/fprobetrace.rst
+++ b/Documentation/trace/fprobetrace.rst
@@ -52,6 +52,8 @@ Synopsis of fprobe-events
$comm : Fetch current task comm.
$current : Fetch the address of the current task_struct.
+|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*4)(\*5)
+ this_cpu_read(FETCHARG) : Read the value of the per-CPU variable FETCHARG on the current CPU.
+ this_cpu_ptr(FETCHARG) : Get the address of the per-CPU variable FETCHARG on the current CPU.
\IMM : Store an immediate value to the argument.
NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst
index 81e4fe38791d..9ae330eb0a52 100644
--- a/Documentation/trace/kprobetrace.rst
+++ b/Documentation/trace/kprobetrace.rst
@@ -55,6 +55,8 @@ Synopsis of kprobe_events
$comm : Fetch current task comm.
$current : Fetch the address of the current task_struct.
+|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*3)(\*4)
+ this_cpu_read(FETCHARG) : Read the value of the per-CPU variable FETCHARG on the current CPU.
+ this_cpu_ptr(FETCHARG) : Get the address of the per-CPU variable FETCHARG on the current CPU.
\IMM : Store an immediate value to the argument.
NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 2b0b4f9acb2e..c9e182d40059 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4329,6 +4329,7 @@ static const char readme_msg[] =
"\t $stack<index>, $stack, $retval, $comm, $current\n"
#endif
"\t +|-[u]<offset>(<fetcharg>), \\imm-value, \\\"imm-string\"\n"
+ "\t this_cpu_read(<fetcharg>), this_cpu_ptr(<fetcharg>)\n"
"\t kernel return probes support: $retval, $arg<N>, $comm\n"
"\t type: s8/16/32/64, u8/16/32/64, x8/16/32/64, char, string, symbol,\n"
"\t b<bit-width>@<bit-offset>/<container-size>, ustring,\n"
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 999dec84275d..18c212122344 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -345,6 +345,109 @@ static int parse_trace_event(char *arg, struct fetch_insn *code,
return -EINVAL;
}
+/* this_cpu_* parser */
+#define THIS_CPU_PTR_PREFIX "this_cpu_ptr("
+#define THIS_CPU_READ_PREFIX "this_cpu_read("
+#define THIS_CPU_PTR_LEN (sizeof(THIS_CPU_PTR_PREFIX) - 1)
+#define THIS_CPU_READ_LEN (sizeof(THIS_CPU_READ_PREFIX) - 1)
+
+static int
+parse_probe_arg(char *arg, const struct fetch_type *type,
+ struct fetch_insn **pcode, struct fetch_insn *end,
+ struct traceprobe_parse_context *ctx);
+
+/* handle dereference nested call */
+static inline int handle_dereference(char *arg, struct fetch_insn **pcode,
+ struct fetch_insn *end, struct traceprobe_parse_context *ctx,
+ int deref, long offset)
+{
+ const struct fetch_type *type = find_fetch_type(NULL, ctx->flags);
+ struct fetch_insn *code = *pcode;
+ int cur_offs = ctx->offset;
+ char *tmp;
+ int ret;
+
+ tmp = strrchr(arg, ')');
+ if (!tmp) {
+ trace_probe_log_err(ctx->offset + strlen(arg),
+ DEREF_OPEN_BRACE);
+ return -EINVAL;
+ }
+
+ *tmp = '\0';
+ ret = parse_probe_arg(arg, type, &code, end, ctx);
+ if (ret)
+ return ret;
+ ctx->offset = cur_offs;
+ if (code->op == FETCH_OP_COMM || code->op == FETCH_OP_IMMSTR) {
+ trace_probe_log_err(ctx->offset, COMM_CANT_DEREF);
+ return -EINVAL;
+ }
+
+ /*
+ * this_cpu_ptr(@SYM) does not use SYM value, but use SYM address.
+ * So we overwrite the last FETCH_OP_DEREF with FETCH_OP_CPU_PTR.
+ */
+ if (!(deref == FETCH_OP_CPU_PTR && *arg == '@')) {
+ code++;
+ if (code == end) {
+ trace_probe_log_err(ctx->offset, TOO_MANY_OPS);
+ return -EINVAL;
+ }
+ }
+ *pcode = code;
+
+ code->op = deref;
+ code->offset = offset;
+ /* Reset the last type if used */
+ ctx->last_type = NULL;
+ return 0;
+}
+
+static int parse_this_cpu(char *arg, struct fetch_insn **pcode,
+ struct fetch_insn *end,
+ struct traceprobe_parse_context *ctx)
+{
+ struct fetch_insn *code;
+ bool is_ptr = false;
+ int ret;
+
+ /*
+ * This is only for kernel probes, excluding eprobe, because per-cpu
+ * pointer should not be recorded by events.
+ */
+ if (!(ctx->flags & TPARG_FL_KERNEL) ||
+ (ctx->flags & TPARG_FL_TEVENT)) {
+ trace_probe_log_err(ctx->offset, NOSUP_PERCPU);
+ return -EINVAL;
+ }
+ if (str_has_prefix(arg, THIS_CPU_PTR_PREFIX)) {
+ arg += THIS_CPU_PTR_LEN;
+ ctx->offset += THIS_CPU_PTR_LEN;
+ is_ptr = true;
+ } else if (str_has_prefix(arg, THIS_CPU_READ_PREFIX)) {
+ arg += THIS_CPU_READ_LEN;
+ ctx->offset += THIS_CPU_READ_LEN;
+ } else
+ return -EINVAL;
+
+ ret = handle_dereference(arg, pcode, end, ctx, FETCH_OP_CPU_PTR, 0);
+ if (ret || is_ptr)
+ return ret;
+
+ /* this_cpu_read(VAR) -> +0(this_cpu_ptr(VAR)) */
+ code = *pcode;
+ code++;
+ if (code == end) {
+ trace_probe_log_err(ctx->offset, TOO_MANY_OPS);
+ return -EINVAL;
+ }
+ code->op = FETCH_OP_DEREF;
+ code->offset = 0;
+ *pcode = code;
+ return 0;
+}
+
#ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
static u32 btf_type_int(const struct btf_type *t)
@@ -904,11 +1007,6 @@ static char *find_matched_close_paren(char *s)
return NULL;
}
-static int
-parse_probe_arg(char *arg, const struct fetch_type *type,
- struct fetch_insn **pcode, struct fetch_insn *end,
- struct traceprobe_parse_context *ctx);
-
static int handle_typecast(char *arg, struct fetch_insn **pcode,
struct fetch_insn *end,
struct traceprobe_parse_context *ctx)
@@ -961,7 +1059,9 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
/* Skip '(' */
ctx->offset += 1;
tmp++;
- } else if (*tmp == '+' || *tmp == '-') {
+ } else if (*tmp == '+' || *tmp == '-' ||
+ str_has_prefix(tmp, THIS_CPU_PTR_PREFIX) ||
+ str_has_prefix(tmp, THIS_CPU_READ_PREFIX)) {
/* Dereference can have another field access inside it. */
char *open = strchr(tmp + 1, '(');
@@ -1481,36 +1581,9 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
}
ctx->offset += (tmp + 1 - arg) + (arg[0] != '-' ? 1 : 0);
arg = tmp + 1;
- tmp = strrchr(arg, ')');
- if (!tmp) {
- trace_probe_log_err(ctx->offset + strlen(arg),
- DEREF_OPEN_BRACE);
- return -EINVAL;
- } else {
- const struct fetch_type *t2 = find_fetch_type(NULL, ctx->flags);
- int cur_offs = ctx->offset;
-
- *tmp = '\0';
- ret = parse_probe_arg(arg, t2, &code, end, ctx);
- if (ret)
- break;
- ctx->offset = cur_offs;
- if (code->op == FETCH_OP_COMM ||
- code->op == FETCH_OP_IMMSTR) {
- trace_probe_log_err(ctx->offset, COMM_CANT_DEREF);
- return -EINVAL;
- }
- if (++code == end) {
- trace_probe_log_err(ctx->offset, TOO_MANY_OPS);
- return -EINVAL;
- }
- *pcode = code;
-
- code->op = deref;
- code->offset = offset;
- /* Reset the last type if used */
- ctx->last_type = NULL;
- }
+ ret = handle_dereference(arg, pcode, end, ctx, deref, offset);
+ if (ret < 0)
+ return ret;
break;
case '\\': /* Immediate value */
if (arg[1] == '"') { /* Immediate string */
@@ -1531,7 +1604,10 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
ret = handle_typecast(arg, pcode, end, ctx);
break;
default:
- if (isalpha(arg[0]) || arg[0] == '_') {
+ if (str_has_prefix(arg, THIS_CPU_PTR_PREFIX) ||
+ str_has_prefix(arg, THIS_CPU_READ_PREFIX)) {
+ ret = parse_this_cpu(arg, pcode, end, ctx);
+ } else if (isalpha(arg[0]) || arg[0] == '_') {
/* BTF variable or event field*/
if (ctx->flags & TPARG_FL_TEVENT) {
ret = parse_trace_event(arg, *pcode, ctx);
@@ -1548,8 +1624,8 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
return -EINVAL;
}
ret = parse_btf_arg(arg, pcode, end, ctx);
- break;
}
+ break;
}
if (!ret && code->op == FETCH_OP_NOP) {
/* Parsed, but do not find fetch method */
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 053f72fdaece..e6268a8dc378 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -101,6 +101,7 @@ typedef int (*print_type_func_t)(struct trace_seq *, void *, void *);
/* Stage 2 (dereference) ops */ \
FETCH_OP(DEREF, offset), /* Dereference: .offset */ \
FETCH_OP(UDEREF, offset), /* User-space dereference: .offset */\
+ FETCH_OP(CPU_PTR, none), /* Per-CPU pointer: .offset */ \
/* Stage 3 (store) ops */ \
FETCH_OP(ST_RAW, store), /* Raw value: .size */ \
FETCH_OP(ST_MEM, store), /* Memory: .offset, .size */ \
@@ -596,9 +597,10 @@ extern int traceprobe_define_arg_fields(struct trace_event_call *event_call,
C(TYPECAST_NOT_EVENT, "Typecasts are only for eprobe fields"), \
C(TYPECAST_REQ_FIELD, "Typecast requires a field access"), \
C(TOO_MANY_NESTED, "Too many nested typecasts/dereferences"), \
- C(TYPECAST_SYM_OFFSET, "@SYM+/-OFFSET with typecast needs parentheses") \
+ C(TYPECAST_SYM_OFFSET, "@SYM+/-OFFSET with typecast needs parentheses"), \
C(TYPECAST_NOT_ALIGNED, "Typecast field option is not byte-aligned"), \
- C(TYPECAST_BAD_ARROW, "Typecast field option does not support -> operator"),
+ C(TYPECAST_BAD_ARROW, "Typecast field option does not support -> operator"), \
+ C(NOSUP_PERCPU, "Per-cpu variable access is only for kernel probes"),
#undef C
#define C(a, b) TP_ERR_##a
diff --git a/kernel/trace/trace_probe_tmpl.h b/kernel/trace/trace_probe_tmpl.h
index d0e9662cde00..8db12f758fda 100644
--- a/kernel/trace/trace_probe_tmpl.h
+++ b/kernel/trace/trace_probe_tmpl.h
@@ -129,25 +129,35 @@ process_fetch_insn_bottom(struct fetch_insn *code, unsigned long val,
struct fetch_insn *s3 = NULL;
int total = 0, ret = 0, i = 0;
u32 loc = 0;
- unsigned long lval = val;
+ unsigned long lval, llval = val;
stage2:
/* 2nd stage: dereference memory if needed */
do {
- if (code->op == FETCH_OP_DEREF) {
- lval = val;
+ lval = val;
+ switch (code->op) {
+ case FETCH_OP_DEREF:
ret = probe_mem_read(&val, (void *)val + code->offset,
sizeof(val));
- } else if (code->op == FETCH_OP_UDEREF) {
- lval = val;
+ break;
+ case FETCH_OP_UDEREF:
ret = probe_mem_read_user(&val,
(void *)val + code->offset, sizeof(val));
- } else
break;
+ case FETCH_OP_CPU_PTR:
+ val = (unsigned long)this_cpu_ptr((void __percpu *)val);
+ ret = 0;
+ break;
+ default:
+ lval = llval;
+ goto out;
+ }
if (ret)
return ret;
+ llval = lval;
code++;
} while (1);
+out:
s3 = code;
stage3:
^ permalink raw reply related
* [PATCH v11 11/11] tracing/probes: Add a new testcase for BTF typecasts
From: Masami Hiramatsu (Google) @ 2026-06-26 14:16 UTC (permalink / raw)
To: Steven Rostedt, Mathieu Desnoyers
Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178248325671.841606.17344906774310339507.stgit@devnote2>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
With the introduction of container_of-style BTF typecasting and
per-CPU variable access support in trace probes, we need a way to
verify their functionality and prevent regressions.
Add a new ftrace kselftest and update the trace event sample module
to test and validate these features.
Specifically, update the trace-events-sample module to set up a
periodic timer whose callback accesses a per-CPU counter. Introduce
a new sample trace event, foo_timer_fn, to trace this callback
and log the current counter value.
Then, add a new test case, btf_probe_event.tc, which defines a
dynamic probe on the timer callback. The probe uses BTF typecasting
to recover the parent structure from the timer argument and
this_cpu_read() to fetch the per-CPU counter. The test verifies
the integrity of the implementation by ensuring the values
recorded by the dynamic probe match those from the static tracepoint.
Assisted-by: Antigravity:gemini-3.5-flash
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v11:
- nit: fix the error code in comment.
Changes in v10:
- Add a check for $current and this_cpu_* for eprobe
Changes in v9:
- Add a testcase for checking new syntax.
Changes in v8:
- Add more test cases.
Changes in v6:
- Update testcase according to changes.
Changes in v5:
- Add more syntax test cases.
Changes in v4:
- Fix uprobe $current test.
Changes in v3:
- Add syntax test case.
- Update testcase to use this_cpu_read()
Changes in v2:
- Use timer_shutdown_sync() instead of timer_delete_sync() for teardown.
---
samples/trace_events/trace-events-sample.c | 40 +++++++
samples/trace_events/trace-events-sample.h | 34 ++++++
.../ftrace/test.d/dynevent/btf_probe_event.tc | 51 ++++++++++
.../test.d/dynevent/btf_typecast_accepted.tc | 107 ++++++++++++++++++++
.../test.d/dynevent/eprobes_syntax_errors.tc | 9 ++
.../ftrace/test.d/dynevent/fprobe_syntax_errors.tc | 12 ++
.../ftrace/test.d/kprobe/kprobe_syntax_errors.tc | 12 ++
.../ftrace/test.d/kprobe/uprobe_syntax_errors.tc | 5 +
8 files changed, 265 insertions(+), 5 deletions(-)
create mode 100644 tools/testing/selftests/ftrace/test.d/dynevent/btf_probe_event.tc
create mode 100644 tools/testing/selftests/ftrace/test.d/dynevent/btf_typecast_accepted.tc
diff --git a/samples/trace_events/trace-events-sample.c b/samples/trace_events/trace-events-sample.c
index 0b7a6efdb247..ca5d98c360cb 100644
--- a/samples/trace_events/trace-events-sample.c
+++ b/samples/trace_events/trace-events-sample.c
@@ -94,6 +94,20 @@ static int simple_thread_fn(void *arg)
static DEFINE_MUTEX(thread_mutex);
static int simple_thread_cnt;
+static struct foo_timer_data *foo_timer_data;
+
+static void sample_timer_cb(struct timer_list *t)
+{
+ struct foo_timer_data *data = container_of(t, struct foo_timer_data, timer);
+
+ get_cpu();
+ trace_foo_timer_fn(data);
+ (*this_cpu_ptr(data->counter))++;
+ put_cpu();
+
+ mod_timer(t, jiffies + HZ);
+}
+
int foo_bar_reg(void)
{
mutex_lock(&thread_mutex);
@@ -132,9 +146,27 @@ void foo_bar_unreg(void)
static int __init trace_event_init(void)
{
+ foo_timer_data = kzalloc_obj(*foo_timer_data, GFP_KERNEL);
+ if (!foo_timer_data)
+ return -ENOMEM;
+
+ foo_timer_data->name = "sample_timer_counter";
+ foo_timer_data->counter = alloc_percpu(int);
+ if (!foo_timer_data->counter) {
+ kfree(foo_timer_data);
+ return -ENOMEM;
+ }
+
+ timer_setup(&foo_timer_data->timer, sample_timer_cb, 0);
+ mod_timer(&foo_timer_data->timer, jiffies + HZ);
+
simple_tsk = kthread_run(simple_thread, NULL, "event-sample");
- if (IS_ERR(simple_tsk))
- return -1;
+ if (IS_ERR(simple_tsk)) {
+ timer_shutdown_sync(&foo_timer_data->timer);
+ free_percpu(foo_timer_data->counter);
+ kfree(foo_timer_data);
+ return PTR_ERR(simple_tsk);
+ }
return 0;
}
@@ -147,6 +179,10 @@ static void __exit trace_event_exit(void)
kthread_stop(simple_tsk_fn);
simple_tsk_fn = NULL;
mutex_unlock(&thread_mutex);
+
+ timer_shutdown_sync(&foo_timer_data->timer);
+ free_percpu(foo_timer_data->counter);
+ kfree(foo_timer_data);
}
module_init(trace_event_init);
diff --git a/samples/trace_events/trace-events-sample.h b/samples/trace_events/trace-events-sample.h
index 1a05fc153353..816848a456a2 100644
--- a/samples/trace_events/trace-events-sample.h
+++ b/samples/trace_events/trace-events-sample.h
@@ -247,12 +247,14 @@
*/
/*
- * It is OK to have helper functions in the file, but they need to be protected
- * from being defined more than once. Remember, this file gets included more
- * than once.
+ * It is OK to have helper functions and data structures in the file, but they
+ * need to be protected from being defined more than once. Remember, this file
+ * gets included more than once.
*/
#ifndef __TRACE_EVENT_SAMPLE_HELPER_FUNCTIONS
#define __TRACE_EVENT_SAMPLE_HELPER_FUNCTIONS
+#include <linux/timer.h>
+
static inline int __length_of(const int *list)
{
int i;
@@ -270,6 +272,13 @@ enum {
TRACE_SAMPLE_BAR = 4,
TRACE_SAMPLE_ZOO = 8,
};
+
+struct foo_timer_data {
+ const char *name;
+ struct timer_list timer;
+ int __percpu *counter;
+};
+
#endif
/*
@@ -595,6 +604,25 @@ TRACE_EVENT(foo_rel_loc,
__get_rel_bitmask(bitmask),
__get_rel_cpumask(cpumask))
);
+
+TRACE_EVENT(foo_timer_fn,
+
+ TP_PROTO(struct foo_timer_data *data),
+
+ TP_ARGS(data),
+
+ TP_STRUCT__entry(
+ __string( name, data->name )
+ __field( int, count )
+ ),
+
+ TP_fast_assign(
+ __assign_str(name);
+ __entry->count = *this_cpu_ptr(data->counter);
+ ),
+
+ TP_printk("name=%s count=%d", __get_str(name), __entry->count)
+);
#endif
/***** NOTICE! The #if protection ends here. *****/
diff --git a/tools/testing/selftests/ftrace/test.d/dynevent/btf_probe_event.tc b/tools/testing/selftests/ftrace/test.d/dynevent/btf_probe_event.tc
new file mode 100644
index 000000000000..96791e120b7d
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/dynevent/btf_probe_event.tc
@@ -0,0 +1,51 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: BTF event with typecast and percpu access
+# requires: dynamic_events "this_cpu_read(<fetcharg>)":README "[(structname[,field])]<argname>[->field[->field|.field...]]":README
+
+# Check if the sample module is loaded
+if ! lsmod | grep -q trace_events_sample; then
+ modprobe trace-events-sample || exit_unsupported
+fi
+
+echo 0 > events/enable
+echo > dynamic_events
+
+# The sample_timer_cb(struct timer_list *t) is called.
+# We want to check (STRUCT,FIELD)VAR typecast and this_cpu_read() access.
+# (foo_timer_data,timer)t converts t to struct foo_timer_data * using container_of.
+# data->counter is a per-cpu pointer to int.
+# this_cpu_read(data->counter) should give the value of the counter.
+
+echo 'f:mysample/myevent sample_timer_cb name=(foo_timer_data,timer)t->name:string count=this_cpu_read((foo_timer_data,timer)t->counter)' >> dynamic_events
+
+echo 1 > events/mysample/myevent/enable
+echo 1 > events/sample-trace/foo_timer_fn/enable
+
+sleep 2
+
+echo 0 > events/mysample/myevent/enable
+echo 0 > events/sample-trace/foo_timer_fn/enable
+
+# Compare the values.
+MATCH=0
+while read line; do
+ if echo $line | grep -q "foo_timer_fn:"; then
+ NAME=`echo $line | sed 's/.*name=\([^ ]*\) .*/\1/'`
+ COUNT=`echo $line | sed 's/.*count=\([^ ]*\).*/\1/'`
+ if grep -q "myevent:.*name=\"${NAME}\" count=$COUNT" trace; then
+ MATCH=$((MATCH+1))
+ fi
+ fi
+done < trace
+
+if [ $MATCH -eq 0 ]; then
+ echo "No matching events found"
+ exit_fail
+fi
+
+# Clean up
+echo 0 > events/mysample/myevent/enable
+echo 0 > events/sample-trace/foo_timer_fn/enable
+echo > dynamic_events
+clear_trace
diff --git a/tools/testing/selftests/ftrace/test.d/dynevent/btf_typecast_accepted.tc b/tools/testing/selftests/ftrace/test.d/dynevent/btf_typecast_accepted.tc
new file mode 100644
index 000000000000..acf0b5a917d3
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/dynevent/btf_typecast_accepted.tc
@@ -0,0 +1,107 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: BTF typecast and percpu access syntax validation
+# requires: dynamic_events "this_cpu_read(<fetcharg>)":README "[(structname[,field])]<argname>[->field[->field|.field...]]":README
+
+KPROBES=
+FPROBES=
+
+if grep -qF "p[:[<group>/][<event>]] <place> [<args>]" README ; then
+ KPROBES=yes
+fi
+if grep -qF "f[:[<group>/][<event>]] <func-name>[%return] [<args>]" README ; then
+ FPROBES=yes
+fi
+
+if [ -z "$KPROBES" -a -z "$FPROBES" ] ; then
+ exit_unsupported
+fi
+
+echo 0 > events/enable
+echo > dynamic_events
+
+# Load trace-events-sample module if available to have per-CPU counter structure defined
+if ! lsmod | grep -q trace_events_sample; then
+ modprobe trace-events-sample || true
+fi
+
+if [ "$FPROBES" ] ; then
+ # 1. Test basic typecast on fprobe
+ echo 'f:fpevent1 vfs_read name=(file)file->f_path.dentry->d_name.name:string' >> dynamic_events
+ # 2. Test parenthesized typecast target on fprobe
+ echo 'f:fpevent2 vfs_read name=(file)(file)->f_path.dentry->d_name.name:string' >> dynamic_events
+ # 3. Test nested typecasts on fprobe
+ echo 'f:fpevent3 vfs_read name=(dentry)((file)file->f_path.dentry)->d_name.name:string' >> dynamic_events
+ # 4. Test container_of-style typecast with field option on fprobe
+ echo 'f:fpevent4 vfs_read name=(file,f_path)file->f_mode' >> dynamic_events
+ # 5. Test typecast on return value on fprobe
+ echo 'f:fpevent5 vfs_read%return name=(file)$retval->f_path.dentry->d_name.name:string' >> dynamic_events
+ # 6. Test $current variable support on fprobe
+ echo 'f:fpevent6 vfs_read pid=$current->pid' >> dynamic_events
+ echo 'f:fpevent7 vfs_read pid=(task_struct)$current->pid' >> dynamic_events
+ echo 'f:fpevent8 vfs_read pid=(task_struct,group_leader)$current->pid' >> dynamic_events
+
+ # Test this_cpu_read and this_cpu_ptr on fprobe
+ if lsmod | grep -q trace_events_sample; then
+ echo 'f:fpevent9 sample_timer_cb name=(foo_timer_data,timer)t->name:string count=this_cpu_read((foo_timer_data,timer)t->counter)' >> dynamic_events
+ echo 'f:fpevent10 sample_timer_cb ptr=this_cpu_ptr((foo_timer_data,timer)t->counter)' >> dynamic_events
+ fi
+fi
+
+if [ "$KPROBES" ] ; then
+ # 7. Test basic typecast on kprobe
+ echo 'p:kpevent1 vfs_read name=(file)file->f_path.dentry->d_name.name:string' >> dynamic_events
+ # 8. Test parenthesized typecast target on kprobe
+ echo 'p:kpevent2 vfs_read name=(file)(file)->f_path.dentry->d_name.name:string' >> dynamic_events
+ # 9. Test nested typecasts on kprobe
+ echo 'p:kpevent3 vfs_read name=(dentry)((file)file->f_path.dentry)->d_name.name:string' >> dynamic_events
+ # 10. Test container_of-style typecast with field option on kprobe
+ echo 'p:kpevent4 vfs_read name=(file,f_path)file->f_mode' >> dynamic_events
+ # 11. Test typecast on return value on kretprobe
+ echo 'r:kpevent5 vfs_read name=(file)$retval->f_path.dentry->d_name.name:string' >> dynamic_events
+ # 12. Test $current variable support on kprobe
+ echo 'p:kpevent6 vfs_read pid=$current->pid' >> dynamic_events
+ echo 'p:kpevent7 vfs_read pid=(task_struct)$current->pid' >> dynamic_events
+ echo 'p:kpevent8 vfs_read pid=(task_struct,group_leader)$current->pid' >> dynamic_events
+
+ # Test this_cpu_read and this_cpu_ptr on kprobe
+ if lsmod | grep -q trace_events_sample; then
+ echo 'p:kpevent9 sample_timer_cb name=(foo_timer_data,timer)t->name:string count=this_cpu_read((foo_timer_data,timer)t->counter)' >> dynamic_events
+ echo 'p:kpevent10 sample_timer_cb ptr=this_cpu_ptr((foo_timer_data,timer)t->counter)' >> dynamic_events
+ fi
+fi
+
+# Verify the events exist in dynamic_events
+if [ "$FPROBES" ] ; then
+ grep -q "fpevent1 " dynamic_events
+ grep -q "fpevent2 " dynamic_events
+ grep -q "fpevent3 " dynamic_events
+ grep -q "fpevent4 " dynamic_events
+ grep -q "fpevent5 " dynamic_events
+ grep -q "fpevent6 " dynamic_events
+ grep -q "fpevent7 " dynamic_events
+ grep -q "fpevent8 " dynamic_events
+ if lsmod | grep -q trace_events_sample; then
+ grep -q "fpevent9 " dynamic_events
+ grep -q "fpevent10 " dynamic_events
+ fi
+fi
+
+if [ "$KPROBES" ] ; then
+ grep -q "kpevent1 " dynamic_events
+ grep -q "kpevent2 " dynamic_events
+ grep -q "kpevent3 " dynamic_events
+ grep -q "kpevent4 " dynamic_events
+ grep -q "kpevent5 " dynamic_events
+ grep -q "kpevent6 " dynamic_events
+ grep -q "kpevent7 " dynamic_events
+ grep -q "kpevent8 " dynamic_events
+ if lsmod | grep -q trace_events_sample; then
+ grep -q "kpevent9 " dynamic_events
+ grep -q "kpevent10 " dynamic_events
+ fi
+fi
+
+# Clean up
+echo > dynamic_events
+clear_trace
diff --git a/tools/testing/selftests/ftrace/test.d/dynevent/eprobes_syntax_errors.tc b/tools/testing/selftests/ftrace/test.d/dynevent/eprobes_syntax_errors.tc
index 0e65e787e426..1d6d1cf94f16 100644
--- a/tools/testing/selftests/ftrace/test.d/dynevent/eprobes_syntax_errors.tc
+++ b/tools/testing/selftests/ftrace/test.d/dynevent/eprobes_syntax_errors.tc
@@ -21,8 +21,17 @@ check_error 'e:foo/^bar.1 syscalls/sys_enter_openat' # BAD_EVENT_NAME
check_error 'e:foo/bar syscalls/sys_enter_openat arg=^$foo' # BAD_ATTACH_ARG
+check_error 'e:foo/bar syscalls/sys_enter_openat arg=^COMM' # NO_EVENT_FIELD
+if grep -q '\\$current' README; then
+ check_error 'e:foo/bar syscalls/sys_enter_openat arg=^current' # NO_EVENT_FIELD
+fi
+
if grep -q '<attached-group>\.<attached-event>.*\[if <filter>\]' README; then
check_error 'e:foo/bar syscalls/sys_enter_openat if ^' # NO_EP_FILTER
fi
+if grep -q 'this_cpu_read(<fetcharg>)' README; then
+ check_error 'e:foo/bar syscalls/sys_enter_openat arg=^this_cpu_read(file)' # NOSUP_PERCPU
+fi
+
exit 0
diff --git a/tools/testing/selftests/ftrace/test.d/dynevent/fprobe_syntax_errors.tc b/tools/testing/selftests/ftrace/test.d/dynevent/fprobe_syntax_errors.tc
index fee479295e2f..e9d7e6919c7f 100644
--- a/tools/testing/selftests/ftrace/test.d/dynevent/fprobe_syntax_errors.tc
+++ b/tools/testing/selftests/ftrace/test.d/dynevent/fprobe_syntax_errors.tc
@@ -112,6 +112,18 @@ check_error 'f vfs_read%return $retval->^foo' # NO_PTR_STRCT
check_error 'f vfs_read file->^foo' # NO_BTF_FIELD
check_error 'f vfs_read file^-.foo' # BAD_HYPHEN
check_error 'f vfs_read ^file:string' # BAD_TYPE4STR
+if grep -qF "[(structname" README ; then
+check_error 'f vfs_read arg1=(task_struct)file^' # TYPECAST_REQ_FIELD
+check_error 'f vfs_read arg1=(a)((b)((c)(^(d)file->d)->c)->b)->a' # TOO_MANY_NESTED
+check_error 'f vfs_read arg1=(task_struct,^in_execve)file->comm' # TYPECAST_NOT_ALIGNED
+check_error 'f vfs_read arg1=(task_struct,^foo_bar)file->pid' # NO_BTF_FIELD
+check_error 'f vfs_read arg1=(^task_struct1234)file->pid' # NO_PTR_STRCT
+check_error 'f vfs_read arg1=(task_struct,se^->group_node)file->comm' # TYPECAST_BAD_ARROW
+check_error 'f vfs_read arg1=(task_struct,^->pid)file->comm' # NO_BTF_FIELD
+check_error 'f vfs_read arg1=(task_struct,^.pid)file->comm' # NO_BTF_FIELD
+check_error 'f vfs_read arg1=(task_struct,^.)file->comm' # NO_BTF_FIELD
+check_error 'f vfs_read arg1=(task_struct)^@symbol+10->comm' # TYPECAST_SYM_OFFSET
+fi
fi
else
diff --git a/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc b/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc
index 8f1c58f0c239..21ce8414459f 100644
--- a/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc
+++ b/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc
@@ -115,6 +115,18 @@ check_error 'p vfs_read+20 ^$arg*' # NOFENTRY_ARGS
check_error 'p vfs_read ^hoge' # NO_BTFARG
check_error 'p kfree ^$arg10' # NO_BTFARG (exceed the number of parameters)
check_error 'r kfree ^$retval' # NO_RETVAL
+if grep -qF "[(structname" README ; then
+check_error 'p vfs_read arg1=(task_struct)file^' # TYPECAST_REQ_FIELD
+check_error 'p vfs_read arg1=(a)((b)((c)(^(d)file->d)->c)->b)->a' # TOO_MANY_NESTED
+check_error 'p vfs_read arg1=(task_struct,^in_execve)file->comm' # TYPECAST_NOT_ALIGNED
+check_error 'p vfs_read arg1=(task_struct,^foo_bar)file->pid' # NO_BTF_FIELD
+check_error 'p vfs_read arg1=(^task_struct1234)file->pid' # NO_PTR_STRCT
+check_error 'p vfs_read arg1=(task_struct,se^->group_node)file->comm' # TYPECAST_BAD_ARROW
+check_error 'p vfs_read arg1=(task_struct,^->pid)file->comm' # NO_BTF_FIELD
+check_error 'p vfs_read arg1=(task_struct,^.pid)file->comm' # NO_BTF_FIELD
+check_error 'p vfs_read arg1=(task_struct,^.)file->comm' # NO_BTF_FIELD
+check_error 'p vfs_read arg1=(task_struct)^@symbol+10->comm' # TYPECAST_SYM_OFFSET
+fi
else
check_error 'p vfs_read ^$arg*' # NOSUP_BTFARG
fi
diff --git a/tools/testing/selftests/ftrace/test.d/kprobe/uprobe_syntax_errors.tc b/tools/testing/selftests/ftrace/test.d/kprobe/uprobe_syntax_errors.tc
index c817158b99db..e12dc967ec76 100644
--- a/tools/testing/selftests/ftrace/test.d/kprobe/uprobe_syntax_errors.tc
+++ b/tools/testing/selftests/ftrace/test.d/kprobe/uprobe_syntax_errors.tc
@@ -28,4 +28,9 @@ if grep -q ".*symstr.*" README; then
check_error 'p /bin/sh:10 $stack0:^symstr' # BAD_TYPE
fi
+# $current is not supported by uprobe
+if grep -q "\$current.*" README; then
+check_error 'p /bin/sh:10 ^$current:u8' # BAD_VAR
+fi
+
exit 0
^ permalink raw reply related
* Re: [PATCH v7 0/9] bootconfig: embed kernel.* cmdline at build time
From: Masami Hiramatsu @ 2026-06-26 14:33 UTC (permalink / raw)
To: Breno Leitao
Cc: Andrew Morton, Nathan Chancellor, paulmck, Nicolas Schier,
Nick Desaulniers, Bill Wendling, Justin Stitt, Jonathan Corbet,
Shuah Khan, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H. Peter Anvin, linux-kernel,
linux-trace-kernel, linux-kbuild, bpf, llvm, linux-doc,
kernel-team, Nicolas Schier
In-Reply-To: <20260626-bootconfig_using_tools-v7-0-24ab72139c29@debian.org>
On Fri, 26 Jun 2026 05:50:09 -0700
Breno Leitao <leitao@debian.org> wrote:
> The userspace pieces (xbc_snprint_cmdline() in lib/, tools/bootconfig -C)
> already landed; this series wires the rendered cmdline into the kernel.
>
> Motivation: today the embedded bootconfig is parsed at runtime, after
> parse_early_param() has already run, so early_param() handlers can't
> see embedded values. Folding the kernel.* subtree into the cmdline at
> build time gives a CONFIG_CMDLINE-equivalent for embedded-bootconfig
> users without forcing them to maintain two cmdline sources.
>
> Behaviorally, the "kernel" subtree is rendered to a flat string at
> build time and stashed in .init.rodata. setup_arch() prepends it to
> boot_command_line before parse_early_param() runs. Overflow is a soft
> error: the helper logs and leaves boot_command_line untouched rather
> than panicking, so an oversized embedded bconf cannot brick a boot.
>
Thanks for update!! This looks good to me.
Let me pick it and test it.
Thanks,
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
> Changes in v7:
> - The runtime opt-in now shares one helper instead of open-coding its
> own. (Masami)
> - bootconfig_cmdline_requested() moved into generic lib code (Masami)
> - Link to v6: https://lore.kernel.org/r/20260623-bootconfig_using_tools-v6-0-640c2f587a3c@debian.org
>
> Changes in v6:
> - renamed CONFIG_BOOT_CONFIG_EMBED_CMDLINE to
> CONFIG_CMDLINE_FROM_BOOTCONFIG
> - prepend embedded bootconfig cmdline before parse_early_param
> - Link to v5: https://lore.kernel.org/r/20260617-bootconfig_using_tools-v5-0-fd589a9cc5e3@debian.org
>
> Changes in v5:
> - Patch 3 (Kconfig): drop the redundant "depends on BOOT_CONFIG_EMBED"
> from CMDLINE_FROM_BOOTCONFIG; Julian Braha.
> - Patch 6 (Documentation): spell out how the embedded cmdline interacts
> with the bootloader cmdline, an initrd bootconfig, and the embedded
> bootconfig
> - Link to v4: https://lore.kernel.org/r/20260609-bootconfig_using_tools-v4-0-73c463f03a97@debian.org
>
> Changes in v4:
> - Patch 3 (build pipeline): clear CROSS_COMPILE= in the kernel-side
> tools/bootconfig sub-make. Without it, an LLVM=1 cross build
> inherits CROSS_COMPILE and tools/scripts/Makefile.include injects
> --target=/--sysroot= into the host clang, producing a target
> binary that fails to exec.
> - Patch 3 (build pipeline): place embedded-cmdline.S in its own
> .init.rodata.embed_cmdline subsection ("a") so ld.lld does not
> see a section-type mismatch against lib/bootconfig-data.S's
> writable .init.rodata ("aw"). The linker's *(.init.rodata
> .init.rodata.*) glob still folds it into the init image.
> - Patch 6 (x86/setup): also accept the bootconfig=<anything> form
> via cmdline_find_option(), matching the runtime parse_args() loop.
> Without it, bootconfig=0/=off would skip the early prepend but
> still trigger the late runtime apply -- a split-brain state.
> - New patch 7: document CONFIG_CMDLINE_FROM_BOOTCONFIG in
> Documentation/admin-guide/bootconfig.rst (semantics, opt-in,
> precedence, overflow behavior, example).
> - Link to v3: https://lore.kernel.org/r/20260608-bootconfig_using_tools-v3-0-4ddd079a0696@debian.org
>
> Changes in v3:
> - Patch 3: Move HOSTCC override to the kernel-side rule; tool keeps
> $(CC) for standalone/cross builds.
> - Patch 6: Drop the false fail-safe wording; document the
> BOOT_CONFIG_FORCE=y default interaction.
> - Link to v2:
> https://lore.kernel.org/r/20260605-bootconfig_using_tools-v2-0-d309f544b5f7@debian.org
>
> Changes in v2 (addressing review of v1):
> - Split out a standalone fix for the NULL-pointer arithmetic in
> xbc_snprint_cmdline() so the build-time render cannot trip host
> UBSan/FORTIFY_SOURCE.
> - Rework the leaf-root handling: instead of returning early, skip @root
> inside the loop so a root carrying both a value and subkeys
> (kernel = x together with kernel.foo = bar) still renders its
> descendant keys.
> - Build tools/bootconfig with $(HOSTCC) so cross-compiled (ARCH=...)
> builds render the cmdline on the build host instead of failing with
> "Exec format error".
> - Mark the embedded cmdline section read-only (drop the "w" flag from
> .init.rodata).
> - Add a make-clean hook so tools/bootconfig artifacts are removed by
> make clean.
> - Gate the x86 prepend on "bootconfig" being present on the command
> line (or CONFIG_BOOT_CONFIG_FORCE), matching the init.* opt-in
> semantics documented in bootconfig.rst and preserving fail-safe
> recovery: dropping "bootconfig" from the bootloader cmdline now also
> disables the embedded kernel.* keys.
> - Link to v1: https://patch.msgid.link/20260527-bootconfig_using_tools-v1-0-b6906a86e7d5@debian.org
>
> ---
> Breno Leitao (9):
> bootconfig: fix NULL-pointer arithmetic in xbc_snprint_cmdline()
> bootconfig: render descendant keys when xbc_snprint_cmdline() root has a value
> bootconfig: render embedded bootconfig as a kernel cmdline at build time
> bootconfig: clean build-time tools/bootconfig from make clean
> bootconfig: add xbc_prepend_embedded_cmdline() helper
> Documentation: bootconfig: document build-time cmdline rendering
> x86/setup: prepend embedded bootconfig cmdline before parse_early_param
> bootconfig: skip runtime kernel.* render once prepended early
> init/main.c: use bootconfig_cmdline_requested() for the runtime opt-in
>
> Documentation/admin-guide/bootconfig.rst | 81 ++++++++++++++++
> MAINTAINERS | 1 +
> Makefile | 27 +++++-
> arch/x86/Kconfig | 1 +
> arch/x86/kernel/setup.c | 14 ++-
> include/linux/bootconfig.h | 14 +++
> init/Kconfig | 36 +++++++
> init/main.c | 52 +++++-----
> lib/Makefile | 16 +++
> lib/bootconfig.c | 162 +++++++++++++++++++++++++++++--
> lib/embedded-cmdline.S | 16 +++
> tools/bootconfig/Makefile | 4 +-
> 12 files changed, 388 insertions(+), 36 deletions(-)
> ---
> base-commit: a87737435cfa134f9cdcc696ba3080759d04cf72
> change-id: 20260508-bootconfig_using_tools-cfa7aa9d6a5a
>
> Best regards,
> --
> Breno Leitao <leitao@debian.org>
>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH v7 0/9] bootconfig: embed kernel.* cmdline at build time
From: Breno Leitao @ 2026-06-26 14:53 UTC (permalink / raw)
To: Masami Hiramatsu
Cc: Andrew Morton, Nathan Chancellor, paulmck, Nicolas Schier,
Nick Desaulniers, Bill Wendling, Justin Stitt, Jonathan Corbet,
Shuah Khan, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H. Peter Anvin, linux-kernel,
linux-trace-kernel, linux-kbuild, bpf, llvm, linux-doc,
kernel-team, Nicolas Schier
In-Reply-To: <20260626233327.b5c9c8de494acdde4ddf5c02@kernel.org>
On Fri, Jun 26, 2026 at 11:33:27PM +0900, Masami Hiramatsu wrote:
> > The userspace pieces (xbc_snprint_cmdline() in lib/, tools/bootconfig -C)
> > already landed; this series wires the rendered cmdline into the kernel.
> >
> > Motivation: today the embedded bootconfig is parsed at runtime, after
> > parse_early_param() has already run, so early_param() handlers can't
> > see embedded values. Folding the kernel.* subtree into the cmdline at
> > build time gives a CONFIG_CMDLINE-equivalent for embedded-bootconfig
> > users without forcing them to maintain two cmdline sources.
> >
> > Behaviorally, the "kernel" subtree is rendered to a flat string at
> > build time and stashed in .init.rodata. setup_arch() prepends it to
> > boot_command_line before parse_early_param() runs. Overflow is a soft
> > error: the helper logs and leaves boot_command_line untouched rather
> > than panicking, so an oversized embedded bconf cannot brick a boot.
> >
>
> Thanks for update!! This looks good to me.
> Let me pick it and test it.
This is great. Thanks for it and for the support so far.
--breno
^ permalink raw reply
* Re: [PATCH v8 23/46] KVM: TDX: Make source page optional for KVM_TDX_INIT_MEM_REGION
From: Ackerley Tng @ 2026-06-26 15:28 UTC (permalink / raw)
To: Yan Zhao
Cc: Sean Christopherson, aik, andrew.jones, binbin.wu, brauner,
chao.p.peng, david, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, tabba, willy, wyihan, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
Baoquan He, Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <aj3TGLGWT1kMFIVH@yzhao56-desk.sh.intel.com>
Yan Zhao <yan.y.zhao@intel.com> writes:
> On Thu, Jun 25, 2026 at 05:07:23PM -0700, Ackerley Tng wrote:
>> Yan Zhao <yan.y.zhao@intel.com> writes:
>>
>> > On Wed, Jun 24, 2026 at 04:00:32PM -0700, Ackerley Tng wrote:
>> >> Sean Christopherson <seanjc@google.com> writes:
>> >>
>> >> > On Tue, Jun 23, 2026, Yan Zhao wrote:
>> >> >> On Tue, Jun 23, 2026 at 01:16:14PM +0800, Yan Zhao wrote:
>> >> >> > On Mon, Jun 22, 2026 at 06:22:45PM -0700, Sean Christopherson wrote:
>> >> >> > > On Mon, Jun 22, 2026, Yan Zhao wrote:
>> >> >> > > > On Thu, Jun 18, 2026 at 05:32:00PM -0700, Ackerley Tng via B4 Relay wrote:
>> >> >> > > > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
>> >> >> > > > > index ffe9d0db58c59..56d10333c61a7 100644
>> >> >> > > > > --- a/arch/x86/kvm/vmx/tdx.c
>> >> >> > > > > +++ b/arch/x86/kvm/vmx/tdx.c
>> >> >> > > > > @@ -3198,8 +3198,12 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>> >> >> > > > > if (KVM_BUG_ON(kvm_tdx->page_add_src, kvm))
>> >> >> > > > > return -EIO;
>> >> >> > > > >
>> >> >> > > > > - if (!src_page)
>> >> >> > > > > - return -EOPNOTSUPP;
>> >> >> > > > > + if (!src_page) {
>> >> >> > > > > + if (!gmem_in_place_conversion)
>> >> >> > > > When userspace turns on gmem_in_place_conversion while creating guest_memfd
>> >> >> > > > without the MMAP flag, the absence of src_page should still be treated as an
>> >> >> > > > error.
>> >> >> > >
>> >> >> > > Why MMAP?
>> >> >> > Hmm, I was showing a scenario that in-place conversion couldn't occur.
>> >> >> > I didn't mean that with the MMAP flag, mmap() and user write must occur.
>> >> >> >
>> >> >> > > Shouldn't this be a general "if (!src_page && !up-to-date)"? Just
>> >> >> > > because userspace _can_ mmap() the memory doesn't mean userspace _has_ mmap()'d
>> >> >> > > and written memory. And when write() lands, MMAP wouldn't be necessary to
>> >> >> > > initialize the memory.
>> >> >> > Do you mean using up-to-date flag as below?
>> >> >
>> >> > Yes? I didn't actually look at the implementation details.
>> >> >
>> >> >> > if (!src_page) {
>> >> >> > src_page = pfn_to_page(pfn);
>> >> >> > if (!folio_test_uptodate(page_folio(src_page)))
>> >> >> > return -EOPNOTSUPP;
>> >> >> > }
>> >>
>> >> Yan is right that with the earlier patch "Zero page while getting pfn",
>> >> folio_test_uptodate() here will always return true.
>> >>
>> >> Actually, this is an alternative fix for the issue Sashiko pointed out
>> >> on v7 where userspace can do a populate() (either TDX or SNP) without
>> >> first allocating the page, with src_address == NULL, and leak
>> >> uninitialized memory into the guest.
>> >>
>> >> Advantage of using the uptodate check in populate: if the host never
>> >> allocates the page, populate doesn't incur zeroing before writing the
>> >> page anyway in populate().
>> >>
>> >> Disadvantage: Both TDX and SNP will have to implement this uptodate
>> >> check. guest_memfd can't check centrally because for SNP, for a
>> >> PAGE_TYPE_ZERO, !src_page should be allowed with a !uptodate page since
>> >> firmware will zero and there's no leakage of uninitialized host memory?
>> > Another disadvantage: the uptodate flag is per-folio. What if the folio
>> > is only partially initialized by the userspace especially after huge page is
>> > supported?
>> >
>>
>> Good point on huge pages!
>>
>> The uptodate flag on the folio in guest_memfd means "this folio has been
>> written to". As of now (before patch at [1]), this happens when
>>
>> + folio is zeroed on first use by userspace
>> + folio is zeroed on first use of the guest
>> + folio is populated
>>
>> When huge pages are supported, the folio can't partially be initialized?
>>
>> On allocation, if any part is shared, we split the page. The parts are
>> separate folios that have their own uptodate flags.
>>
>> On splitting, if the huge page is uptodate, the split pages will also be
>> uptodate. If the huge page is not uptodate, the split pages won't be
>> uptodate, but that's ok since they will be marked uptodate on first use.
>>
>> On merging, the non-uptodate parts have to be zeroed and then marked
> If that's true, it would be good.
>
>> uptodate. Any parts that are in use would have been marked uptodate
>> already, so there's no overwriting data that is in use. I'll need to
>> think more about when it's safe to zero.
>>
>> I'm still on the fence between the two options
>>
>> 1. Using uptodate check in populate to reject src_pages that have never
>> been written to or
>> 2. Always zero before populate
> 2 does not work?
> The flow is
> 1. mmap gmem_fd, make GFN shared, and write initial content.
> 2. convert GFN to private
> 3. invoke ioctl to trigger populate.
>
This flow is correct, is what users of in-place conversion should do.
"Always" is the wrong word, I should have said "zero if not uptodate
before populate", as in, with patch at [1].
By doing the zeroing in __kvm_gmem_get_pfn instead, by the time populate
gets the pfn, the page would be zeroed, either because userspace faulted
it in, and the zeroing happened in kvm_gmem_fault_user_mapping(), or if
userspace never faulted it in, the zeroing would happen because
populate() allocated the page.
>> but whether the uptodate flag is per-folio or not doesn't affect these
>> two options in terms of fixing the leak of uninitialized host memory,
>> right?
> yes, provided "On merging, the non-uptodate parts have to be zeroed and then
> marked uptodate".
>
Thank you so much for bringing this up, I hadn't considered this
before. I'll do that when I get to guest_memfd hugepage restructuring.
>> >
>> >> >> Another concern with this fix is that:
>> >> >> commit "KVM: guest_memfd: Zero page while getting pfn" [1] always marks the
>> >> >> folio uptodate before reaching post_populate().
>> >> >>
>> >> >> [1] https://lore.kernel.org/all/20260618-gmem-inplace-conversion-v8-21-9d2959357853@google.com/
>> >> >>
>> >> >> > One concern is that TDX now does not much care about the up-to-date flag since
>> >> >> > TDX doesn't rely on the flag to clear pages on conversions.
>> >> >> > I'm not sure if the flag can be reliably checked in this case. e.g.,
>> >> >> > now the whole folio is marked up-to-date even if only part of it is faulted by
>> >> >> > user access.
>> >> >> > Ensuring that the up-to-date flag works correctly with huge page support seems
>> >> >> > to have more effort than introducing a dedicated flag for TDX.
>> >> >> >
>> >> >> > > > Additionally, to properly enable in-place copying for the TDX initial memory
>> >> >> > > > region, userspace must not only specify source_addr to NULL, but also follow
>> >> >> > > > a specific sequence (where steps 1/2/3/7 are required only for in-place copy):
>> >> >> > > > 1. create guest_memfd with MMAP flag
>> >> >> > > > 2. mmap the guest_memfd.
>> >> >> > > > 3. convert the initial memory range to shared.
>> >> >> > > > 4. copy initial content to the source page.
>> >> >> > > > 5. convert the initial memory range to private
>> >> >> > > > 6. invoke ioctl KVM_TDX_INIT_MEM_REGION.
>> >> >> > > > 7. do not unmap the source backend.
>> >> >> > > >
>> >> >> > > > So, would it be reasonable to introduce a dedicated flag that allows userspace
>> >> >> > > > to explicitly opt into the in-place copy functionality? e.g.,
>> >> >> > >
>> >> >> > > Why? It's userspace's responsibility to get the above right. If userspace fails
>> >> >> > > to provide a src_page when it doesn't want in-place copy, that's a userspace bug.
>> >>
>> >> Yan, is your concern that userspace forgot to update the code and
>> >> forgets to provide a src_page, and if we keep the "Zero page while
>> > Yes. Previously, it would be rejected after GUP fails.
>> >
>>
>> I see, didn't realize previously it would be rejected because GUP
>> fails. GUP failed because it wasn't faulted into the host?
> GUP fails if 0 is not a valid user address.
> But GUP would not fail if 0 is a valid address. e.g., in below scenario:
>
> #include <sys/mman.h>
> #include <stdio.h>
> int main(void)
> {
> void *p=mmap((void*)0,4096,PROT_READ|PROT_WRITE, MAP_FIXED|MAP_PRIVATE|MAP_ANONYMOUS,-1,0);
> if (p==MAP_FAILED) {
> perror("mmap");
> return 1;
> }
> *(char*)0='Y';
> printf("addr0=%p val=%c\n",p,*(char*)0);
> return 0;
> }
>
>
>> That's kind of orthogonal, I don't think GUP fail leading to rejecting
>> populate was meant to help userspace catch these issues. GUP would also
>> fail if the user did mmap(), write to it, unmap using
>> madvise(MADV_DONTNEED), then forget and pass 0 as src_address.
> The original uAPI did not explicitly define 0 as an invalid uaddr. Whether 0 was
> rejected depended on whether the user mmap()'d address 0. If 0 was a valid
> mapping, populate() could proceed.
>
> commit 2a62345b3052 ("KVM: guest_memfd: GUP source pages prior to populating
> guest memory") changed the behavior though. It would return -EOPNOTSUPP for a 0
> uaddr.
>
I see, I only looked at this after commit 2a62345b3052.
> But if a user configures 0 uaddr as valid, writes to it, and then passes 0 as
> source_addr(not from gmem), I'm not sure if it's good for the kernel to silently
> treat 0 uaddr as an identifier for in-place copy from the private PFN in gmem.
>
I'd say the original uAPI perhaps just didn't document 0 as an
unsupported uaddr. Given that commit 2a62345b3052 already merged, uAPI
was perhaps accidentally changed and no customer complained, I think we
can move forward with 0 as an invalid src_address? I wouldn't think
anyone relies on 0 intentionally being a valid address.
I could document that, if it helps?
>
>> >> getting pfn" patch, ends up with the guest silently having a zero page?
>> >> I think that would be found quite early in userspace VMM testing...
>> > I actually encountered this during testing this patch.
>> > I update most code path to follow this sequence. However, still some corner ones
>> > for TDVF HOB, which are less obvious and harder to update.
>> > The TD just booted up and hang silently.
>> >
>>
>> I think this is just the life of a close-to-hardware software engineer
>> :P no errors, got stuck somewhere, root cause is some unitialized
>> thing.
>>
>> >> >> > I mean if userspace specifies a NULL source_addr by mistake, it's better for
>> >> >> > kernel to detect this mistake, similar to how it validates whether source_addr
>> >> >> > is PAGE_ALIGNED.
>> >> >
>> >> > The alignment case is different. If userspace provides an unaligned value, KVM
>> >> > *can't* do what userspace is asking because hardware and thus KVM only supports
>> >> > converting on page boundaries.
>> >> >
>> >> > For a NULL source, KVM can still do what userspace is asking. Rejecting userspace's
>> >> > request would then be making assumptions about what userspace wants.
>> >> >
>> >>
>> >> Also, +1 on this, what if userspace, knowing that pages are zeroed on
>> >> allocation, actually wants to rely on that to get a zero page in the guest?
>> > What if 0 uaddr is a valid address? :)
>> >
>> >> >> > Since userspace already needs to perform additional steps to enable in-place
>> >> >> > copy, specifying a dedicated flag to indicate that the NULL source_addr is
>> >> >> > intentional seems like a reasonable burden.
>> >> >
>> >> > I don't see how it adds any value. I wouldn't be at all surprised if most VMMs
>> >> > just wen up with code that does:
>> >> >
>> >> > if (in-place) {
>> >> > src = NULL;
>> >> > flags |= KVM_TDX_IN_PLACE_COPY_INITIAL_MEMORY_REGION;
>> >> > }
>> >>
^ permalink raw reply
* [PATCH v10 1/6] mm/memory-failure: drop dead error_states[] entry for reserved pages
From: Breno Leitao @ 2026-06-26 15:33 UTC (permalink / raw)
To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Shuah Khan, Naoya Horiguchi, Jonathan Corbet, Shuah Khan,
Liam R. Howlett, lance.yang, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Liam R. Howlett
Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
linux-trace-kernel, kernel-team
In-Reply-To: <20260626-ecc_panic-v10-0-6dacb8ad024d@debian.org>
The first entry of error_states[],
{ reserved, reserved, MF_MSG_KERNEL, me_kernel },
is unreachable. identify_page_state() has two callers, and neither
one can dispatch a PG_reserved page to me_kernel():
* memory_failure() reaches identify_page_state() only after
get_hwpoison_page() returned 1. get_any_page() reaches that
return only via __get_hwpoison_page(), which only takes a
refcount when the page is HWPoisonHandlable().
HWPoisonHandlable() is an allowlist for LRU, free-buddy, and
(for soft-offline) movable_ops pages -- PG_reserved pages do
not satisfy any of these, so they fail with -EBUSY/-EIO long
before identify_page_state() runs.
* try_memory_failure_hugetlb() reaches identify_page_state() only
via the MF_HUGETLB_IN_USED branch, where the page is necessarily
a hugetlb folio. hugetlb folios don't carry PG_reserved at that
point: hugetlb_folio_init_vmemmap() calls __folio_clear_reserved()
during init, so the reserved entry would not match even if it
were still present.
me_kernel() never executes and the entry exists only to be matched
against by code that cannot see it.
Drop the entry, the me_kernel() helper, and the now-unused
"reserved" macro. Leave the MF_MSG_KERNEL enum value in place: it
remains part of the tracepoint and pr_err() string tables, and
follow-on work to classify unrecoverable kernel pages can reuse it
without churning the user-visible enum.
No functional change.
Suggested-by: David Hildenbrand <david@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Acked-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
mm/memory-failure.c | 14 --------------
1 file changed, 14 deletions(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 51508a55c4055..f4d3e6e20e13f 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -980,17 +980,6 @@ static bool has_extra_refcount(struct page_state *ps, struct page *p,
return false;
}
-/*
- * Error hit kernel page.
- * Do nothing, try to be lucky and not touch this instead. For a few cases we
- * could be more sophisticated.
- */
-static int me_kernel(struct page_state *ps, struct page *p)
-{
- unlock_page(p);
- return MF_IGNORED;
-}
-
/*
* Page in unknown state. Do nothing.
* This is a catch-all in case we fail to make sense of the page state.
@@ -1199,10 +1188,8 @@ static int me_huge_page(struct page_state *ps, struct page *p)
#define mlock (1UL << PG_mlocked)
#define lru (1UL << PG_lru)
#define head (1UL << PG_head)
-#define reserved (1UL << PG_reserved)
static struct page_state error_states[] = {
- { reserved, reserved, MF_MSG_KERNEL, me_kernel },
/*
* free pages are specially detected outside this table:
* PG_buddy pages only make a small fraction of all free pages.
@@ -1234,7 +1221,6 @@ static struct page_state error_states[] = {
#undef mlock
#undef lru
#undef head
-#undef reserved
static void update_per_node_mf_stats(unsigned long pfn,
enum mf_result result)
--
2.53.0-Meta
^ permalink raw reply related
* [PATCH v10 2/6] mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
From: Breno Leitao @ 2026-06-26 15:33 UTC (permalink / raw)
To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Shuah Khan, Naoya Horiguchi, Jonathan Corbet, Shuah Khan,
Liam R. Howlett, lance.yang, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Liam R. Howlett
Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
linux-trace-kernel, kernel-team
In-Reply-To: <20260626-ecc_panic-v10-0-6dacb8ad024d@debian.org>
get_any_page() collapses every HWPoisonHandlable() rejection into a
single -EIO via the __get_hwpoison_page() -> -EBUSY -> shake_page()
-> retry path. That is correct for the transient case (a userspace
folio briefly off LRU during migration or compaction, which a later
shake can drag back), but wrong for stable kernel-owned pages: slab,
page-table, large-kmalloc and PG_reserved pages will never become
HWPoisonHandlable(), so the retry loop is wasted work and the final
-EIO loses the "this is structurally unrecoverable" information.
memory_failure() then maps -EIO into MF_MSG_GET_HWPOISON, which the
panic-on-unrecoverable sysctl deliberately does not act on.
Introduce is_kernel_owned_page(), a small predicate that positively
identifies pages the hwpoison handler cannot recover from:
is_kernel_owned_page(p) :=
PageReserved(p) ||
PageSlab(head) || PageTable(head) || PageLargeKmalloc(head)
where head = compound_head(p).
PG_reserved is a per-page flag (PF_NO_COMPOUND) and is tested on the
page directly. The slab, page-table and large-kmalloc page-type bits
are only stored on the head page, so those tests resolve the compound
head first, then re-read compound_head(page) afterwards: a concurrent
split or compound free that moves head invalidates the just-read flags
and the loop retries. The lookup still takes no refcount, mirroring
the rest of get_any_page(); the recheck closes the common split race,
and a residual free->alloc->free in the same window can only mis-tag
a genuinely poisoned page, never reclassify a handlable one.
No MF_SOFT_OFFLINE / page_has_movable_ops() opt-out is needed: a
movable_ops page is always PageOffline or PageZsmalloc, whose
page_type is mutually exclusive with slab, page-table and
large-kmalloc, and it never carries PG_reserved, so it can never
match any of the checks above.
The list is intentionally not exhaustive. vmalloc and kernel-stack
pages, for example, do not carry a page_type bit and would need a
different oracle; they keep going through the existing retry path
unchanged. This is the smallest set we can identify with certainty
by page type.
Wire the helper into the top of get_any_page() to short-circuit
those pages before the retry loop runs. On a hit, drop the caller's
MF_COUNT_INCREASED reference (if any) and return -ENOTRECOVERABLE
straight away. Pages outside the helper's positive list still take
the existing retry path and return -EIO, leaving operator-visible
behaviour for those cases unchanged.
Extend the unhandlable-page pr_err() to fire for either errno and
update the get_hwpoison_page() kerneldoc to document the new return.
memory_failure() still folds every negative return into
MF_MSG_GET_HWPOISON via its existing "else if (res < 0)" branch, so
this patch on its own only changes the errno that soft_offline_page()
can propagate to its callers. A follow-up wires -ENOTRECOVERABLE
through memory_failure() and reports MF_MSG_KERNEL for the
unrecoverable cases, which is what the
panic_on_unrecoverable_memory_failure sysctl observes.
Suggested-by: David Hildenbrand <david@kernel.org>
Suggested-by: Lance Yang <lance.yang@linux.dev>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
mm/memory-failure.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 48 insertions(+), 2 deletions(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index f4d3e6e20e13f..d08fbd0d8c39f 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1325,6 +1325,36 @@ static inline bool HWPoisonHandlable(struct page *page, unsigned long flags)
return PageLRU(page) || is_free_buddy_page(page);
}
+/*
+ * Positive identification of pages the hwpoison handler cannot recover:
+ * pages owned by kernel internals with no userspace mapping to unmap, no
+ * file mapping to invalidate, and no migration target.
+ */
+static inline bool is_kernel_owned_page(struct page *page)
+{
+ struct page *head;
+ bool kernel_owned;
+
+ /* PG_reserved is a per-page flag, never set on a compound page. */
+ if (PageReserved(page))
+ return true;
+
+ /*
+ * Page-type bits live only on the head page, so resolve any tail
+ * first. The check takes no refcount; recheck the head afterwards
+ * so a concurrent split or compound free cannot leave us trusting
+ * a stale view. A free->alloc->free in the same window is still
+ * possible but closing it would require taking a reference here.
+ */
+retry:
+ head = compound_head(page);
+ kernel_owned = PageSlab(head) || PageTable(head) ||
+ PageLargeKmalloc(head);
+ if (head != compound_head(page))
+ goto retry;
+ return kernel_owned;
+}
+
static int __get_hwpoison_page(struct page *page, unsigned long flags)
{
struct folio *folio = page_folio(page);
@@ -1371,6 +1401,19 @@ static int get_any_page(struct page *p, unsigned long flags)
if (flags & MF_COUNT_INCREASED)
count_increased = true;
+ /*
+ * Page types we know are kernel-owned and cannot be recovered.
+ * Short-circuit before the shake_page() / retry loop, which
+ * cannot turn any of these into something HWPoisonHandlable().
+ * Drop the caller's reference if MF_COUNT_INCREASED took one.
+ */
+ if (is_kernel_owned_page(p)) {
+ if (count_increased)
+ put_page(p);
+ ret = -ENOTRECOVERABLE;
+ goto out;
+ }
+
try_again:
if (!count_increased) {
ret = __get_hwpoison_page(p, flags);
@@ -1418,7 +1461,7 @@ static int get_any_page(struct page *p, unsigned long flags)
ret = -EIO;
}
out:
- if (ret == -EIO)
+ if (ret == -EIO || ret == -ENOTRECOVERABLE)
pr_err("%#lx: unhandlable page.\n", page_to_pfn(p));
return ret;
@@ -1475,7 +1518,10 @@ static int __get_unpoison_page(struct page *page)
* -EIO for pages on which we can not handle memory errors,
* -EBUSY when get_hwpoison_page() has raced with page lifecycle
* operations like allocation and free,
- * -EHWPOISON when the page is hwpoisoned and taken off from buddy.
+ * -EHWPOISON when the page is hwpoisoned and taken off from buddy,
+ * -ENOTRECOVERABLE for kernel-owned pages identified by
+ * is_kernel_owned_page() (PG_reserved, slab,
+ * page-table, large-kmalloc) that the handler cannot recover.
*/
static int get_hwpoison_page(struct page *p, unsigned long flags)
{
--
2.53.0-Meta
^ permalink raw reply related
* [PATCH v10 3/6] mm/memory-failure: report MF_MSG_KERNEL for unrecoverable kernel pages
From: Breno Leitao @ 2026-06-26 15:33 UTC (permalink / raw)
To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Shuah Khan, Naoya Horiguchi, Jonathan Corbet, Shuah Khan,
Liam R. Howlett, lance.yang, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Liam R. Howlett
Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
linux-trace-kernel, kernel-team
In-Reply-To: <20260626-ecc_panic-v10-0-6dacb8ad024d@debian.org>
The previous patch teaches get_any_page() to return -ENOTRECOVERABLE
for stable unhandlable kernel pages (PG_reserved, slab, page tables,
large-kmalloc). memory_failure() still folds every negative return
into MF_MSG_GET_HWPOISON, so callers that want to react to the
unrecoverable cases (a panic option, smarter logging) cannot tell
them apart from transient page-allocator races.
Turn the post-call branch into a switch over the get_hwpoison_page()
return code: map -ENOTRECOVERABLE to MF_MSG_KERNEL and any other
negative return to MF_MSG_GET_HWPOISON. case 0 keeps the existing
free-buddy / kernel-high-order handling and case 1 falls through to
the rest of memory_failure() unchanged.
The MF_MSG_KERNEL label and tracepoint string are kept as
"reserved kernel page" to avoid breaking userspace tools that match
on those literals; the enum value still adequately tags the failure
even though it now also covers slab, page tables and large-kmalloc
pages.
Suggested-by: David Hildenbrand <david@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Acked-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
mm/memory-failure.c | 17 +++++++++++++++--
1 file changed, 15 insertions(+), 2 deletions(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index d08fbd0d8c39f..8e2aa2fafc14e 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2434,7 +2434,8 @@ int memory_failure(unsigned long pfn, int flags)
* that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
*/
res = get_hwpoison_page(p, flags);
- if (!res) {
+ switch (res) {
+ case 0:
if (is_free_buddy_page(p)) {
if (take_page_off_buddy(p)) {
page_ref_inc(p);
@@ -2453,7 +2454,19 @@ int memory_failure(unsigned long pfn, int flags)
res = action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED);
}
goto unlock_mutex;
- } else if (res < 0) {
+ case 1:
+ /* Got a refcount on a handlable page. */
+ break;
+ case -ENOTRECOVERABLE:
+ /*
+ * Stable unhandlable kernel-owned page (PG_reserved,
+ * slab, page tables, large-kmalloc).
+ * No recovery possible.
+ */
+ res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
+ goto unlock_mutex;
+ default:
+ /* Transient lifecycle race with the page allocator. */
res = action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
goto unlock_mutex;
}
--
2.53.0-Meta
^ permalink raw reply related
* [PATCH v10 4/6] mm/memory-failure: add panic option for unrecoverable pages
From: Breno Leitao @ 2026-06-26 15:33 UTC (permalink / raw)
To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Shuah Khan, Naoya Horiguchi, Jonathan Corbet, Shuah Khan,
Liam R. Howlett, lance.yang, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Liam R. Howlett
Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
linux-trace-kernel, kernel-team
In-Reply-To: <20260626-ecc_panic-v10-0-6dacb8ad024d@debian.org>
Add a sysctl panic_on_unrecoverable_memory_failure (disabled by
default) that triggers a kernel panic when memory_failure()
encounters pages that cannot be recovered. This provides a clean
crash with useful debug information rather than allowing silent
data corruption or a delayed crash at an unrelated code path.
Panic eligibility is intentionally narrow: only MF_MSG_KERNEL with
result == MF_IGNORED panics. After the previous patch, MF_MSG_KERNEL
covers PG_reserved pages and the kernel-owned pages promoted from
get_hwpoison_page() via -ENOTRECOVERABLE (slab, page tables,
large-kmalloc).
All other action types are excluded:
- MF_MSG_GET_HWPOISON and MF_MSG_KERNEL_HIGH_ORDER can be reached by
transient refcount races with the page allocator (an in-flight buddy
allocation has refcount 0 and is no longer on the buddy free list,
briefly), and panicking on them would risk killing the box for what
is actually a recoverable userspace page.
- MF_MSG_UNKNOWN means identify_page_state() could not classify the
page; that is precisely the wrong basis for a panic decision.
Acked-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
mm/memory-failure.c | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 8e2aa2fafc14e..611160c98c6f6 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -74,6 +74,8 @@ static int sysctl_memory_failure_recovery __read_mostly = 1;
static int sysctl_enable_soft_offline __read_mostly = 1;
+static int sysctl_panic_on_unrecoverable_mf __read_mostly;
+
atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
static bool hw_memory_failure __read_mostly = false;
@@ -155,6 +157,15 @@ static const struct ctl_table memory_failure_table[] = {
.proc_handler = proc_dointvec_minmax,
.extra1 = SYSCTL_ZERO,
.extra2 = SYSCTL_ONE,
+ },
+ {
+ .procname = "panic_on_unrecoverable_memory_failure",
+ .data = &sysctl_panic_on_unrecoverable_mf,
+ .maxlen = sizeof(sysctl_panic_on_unrecoverable_mf),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = SYSCTL_ZERO,
+ .extra2 = SYSCTL_ONE,
}
};
@@ -1255,6 +1266,15 @@ static void update_per_node_mf_stats(unsigned long pfn,
++mf_stats->total;
}
+static bool panic_on_unrecoverable_mf(enum mf_action_page_type type,
+ enum mf_result result)
+{
+ if (!sysctl_panic_on_unrecoverable_mf)
+ return false;
+
+ return type == MF_MSG_KERNEL && result == MF_IGNORED;
+}
+
/*
* "Dirty/Clean" indication is not 100% accurate due to the possibility of
* setting PG_dirty outside page lock. See also comment above set_page_dirty().
@@ -1272,6 +1292,9 @@ static int action_result(unsigned long pfn, enum mf_action_page_type type,
pr_err("%#lx: recovery action for %s: %s\n",
pfn, action_page_types[type], action_name[result]);
+ if (panic_on_unrecoverable_mf(type, result))
+ panic("Memory failure: %#lx: unrecoverable page", pfn);
+
return (result == MF_RECOVERED || result == MF_DELAYED) ? 0 : -EBUSY;
}
--
2.53.0-Meta
^ permalink raw reply related
* [PATCH v10 5/6] Documentation: document panic_on_unrecoverable_memory_failure sysctl
From: Breno Leitao @ 2026-06-26 15:33 UTC (permalink / raw)
To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Shuah Khan, Naoya Horiguchi, Jonathan Corbet, Shuah Khan,
Liam R. Howlett, lance.yang, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Liam R. Howlett
Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
linux-trace-kernel, kernel-team
In-Reply-To: <20260626-ecc_panic-v10-0-6dacb8ad024d@debian.org>
Add documentation for the new vm.panic_on_unrecoverable_memory_failure
sysctl, describing which failures trigger a panic (kernel-owned pages
the handler cannot recover) and which are intentionally left out
(transient allocator races and unclassified pages).
Signed-off-by: Breno Leitao <leitao@debian.org>
---
Documentation/admin-guide/sysctl/vm.rst | 80 +++++++++++++++++++++++++++++++++
1 file changed, 80 insertions(+)
diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index b9b0c218bfb44..22cc54cac3b21 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -67,6 +67,7 @@ Currently, these files are in /proc/sys/vm:
- page-cluster
- page_lock_unfairness
- panic_on_oom
+- panic_on_unrecoverable_memory_failure
- percpu_pagelist_high_fraction
- stat_interval
- stat_refresh
@@ -925,6 +926,85 @@ panic_on_oom=2+kdump gives you very strong tool to investigate
why oom happens. You can get snapshot.
+panic_on_unrecoverable_memory_failure
+======================================
+
+When a hardware memory error (e.g. multi-bit ECC) hits a kernel page
+that cannot be recovered by the memory failure handler, the default
+behaviour is to ignore the error and continue operation. This is
+dangerous because the corrupted data remains accessible to the kernel,
+risking silent data corruption or a delayed crash when the poisoned
+memory is next accessed.
+
+When enabled, this sysctl triggers a panic on memory failure events
+hitting kernel-owned pages that the handler cannot recover:
+``PageReserved`` (firmware reservations, kernel image, vDSO, zero
+page, and similar memblock-reserved regions), ``PageSlab``,
+``PageTable``, and ``PageLargeKmalloc``. These are owned by the
+kernel and the memory failure handler cannot reliably evict their
+contents.
+
+Other unrecoverable kernel-owned populations (vmalloc allocations,
+kernel stack pages, ...) are not currently covered because the
+handler has no page-type signal that distinguishes them from a
+userspace folio temporarily off the LRU during migration or
+compaction. Such pages still go through the standard
+MF_MSG_GET_HWPOISON path: ``PG_hwpoison`` is set on them and a
+delayed crash on the next access remains possible. Coverage may
+grow as the handler gains stronger kernel-ownership signals.
+
+Recoverable failure paths are also intentionally left out: in-flight
+buddy allocations and other transient races with the page allocator
+can reach the same diagnostic, and panicking on them would risk
+killing the box for a page destined for userspace where the standard
+SIGBUS recovery path applies. Pages whose state could not be
+classified at all are not covered either, since an unknown state is
+not a sound basis for a panic decision.
+
+For many environments it is preferable to panic immediately with a clean
+crash dump that captures the original error context, rather than to
+continue and face a random crash later whose cause is difficult to
+diagnose.
+
+Use cases
+---------
+
+This option is most useful in environments where unattributed crashes
+are expensive to debug or where data integrity must take precedence
+over availability:
+
+* Large fleets, where multi-bit ECC errors on kernel pages are observed
+ regularly and post-mortem analysis of an unrelated downstream crash
+ (often seconds to minutes after the original error) consumes
+ significant engineering effort.
+
+* Systems configured with kdump, where panicking at the moment of the
+ hardware error produces a vmcore that still contains the faulting
+ address, the affected page state, and the originating MCE/GHES
+ record — context that is typically lost by the time a delayed crash
+ occurs.
+
+* High-availability clusters that rely on fast, deterministic node
+ failure for failover, and prefer an immediate panic over silent data
+ corruption propagating to replicas or persistent storage.
+
+* Kernel and platform developers reproducing hwpoison issues with
+ tools such as ``mce-inject`` or error-injection debugfs interfaces,
+ where panicking on the unrecoverable path makes regressions
+ immediately visible instead of surfacing as later, unrelated
+ failures.
+
+= =====================================================================
+0 Try to continue operation (default).
+1 Panic immediately. If the ``panic`` sysctl is also non-zero then the
+ machine will be rebooted.
+= =====================================================================
+
+Example::
+
+ echo 1 > /proc/sys/vm/panic_on_unrecoverable_memory_failure
+
+
percpu_pagelist_high_fraction
=============================
--
2.53.0-Meta
^ permalink raw reply related
* [PATCH v10 6/6] selftests/mm: add hwpoison-panic destructive test
From: Breno Leitao @ 2026-06-26 15:33 UTC (permalink / raw)
To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Shuah Khan, Naoya Horiguchi, Jonathan Corbet, Shuah Khan,
Liam R. Howlett, lance.yang, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Liam R. Howlett
Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
linux-trace-kernel, kernel-team
In-Reply-To: <20260626-ecc_panic-v10-0-6dacb8ad024d@debian.org>
Add a destructive selftest that verifies
vm.panic_on_unrecoverable_memory_failure actually panics when a
hwpoison error hits a kernel-owned page.
Three "kinds" of kernel-owned page can be targeted, selectable via
the script's first positional argument (default: rodata):
rodata - a PG_reserved page in the kernel rodata range, sourced
from the "Kernel rodata" sub-resource of "System RAM" in
/proc/iomem. That entry is reported on every major
architecture and guarantees the chosen PFN is backed by
struct page (an online System RAM range, not a firmware
hole), is PG_reserved, and is read-only -- so even if
the panic fails to fire for some reason, the resulting
PG_hwpoison marker on rodata does not corrupt writable
kernel state.
slab - a slab page found by walking /proc/kpageflags for the
first PFN with KPF_SLAB set (and KPF_HWPOISON / KPF_NOPAGE
/ KPF_COMPOUND_TAIL clear). Exercises the get_any_page()
path on a non PG_reserved kernel-owned page and so
catches regressions where get_any_page() collapses
kernel-owned pages into a transient -EIO instead of
-ENOTRECOVERABLE.
pgtable - same as slab, but the PFN is selected via KPF_PGTABLE.
PageLargeKmalloc, the fourth page type matched by
is_kernel_owned_page(), is intentionally not covered: it is a
PAGE_TYPE_OPS flag with no /proc/kpageflags bit, so selecting such
a PFN from userspace is not feasible. The slab and pgtable
variants already exercise the same get_any_page() positive-check
branch.
The script enables the sysctl and writes the selected physical
address to /sys/devices/system/memory/hard_offline_page. A
successful run crashes the kernel with
Memory failure: <pfn>: unrecoverable page
A return from the inject means no panic fired. Before reporting, the
script restores the sysctl and best-effort unpoisons the target PFN
through the hwpoison debugfs interface (hard_offline_page() injects
with MF_SW_SIMULATED, so the page stays unpoisonable), then re-reads
/proc/kpageflags: a PFN that is still the kernel-owned type it selected
is a genuine failure, while one that raced to a different type before
the inject is skipped as inconclusive. Test outcome is therefore
observed externally (serial console, kdump) rather than from the
script's own exit code.
The script is intentionally NOT wired into run_vmtests.sh: every
successful run panics the kernel, which is incompatible with the
sequential "run each category in the same VM" model that
run_vmtests.sh assumes. It is also not registered as a TEST_PROGS /
ksft_* wrapper so a default kselftest run does not opt itself into
a panic. The script is meant to be executed manually inside a
disposable VM (e.g. virtme-ng), one variant per VM boot, and
requires RUN_DESTRUCTIVE=1 in the environment as a safety net.
Signed-off-by: Breno Leitao <leitao@debian.org>
---
tools/testing/selftests/mm/Makefile | 4 +
tools/testing/selftests/mm/hwpoison-panic.sh | 249 +++++++++++++++++++++++++++
2 files changed, 253 insertions(+)
diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index e6df968f0971c..ed321ae709dac 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -174,6 +174,10 @@ TEST_PROGS += ksft_userfaultfd.sh
TEST_PROGS += ksft_vma_merge.sh
TEST_PROGS += ksft_vmalloc.sh
+# Destructive: every successful run panics the kernel. Installed and
+# kept executable, but not run from a default kselftest invocation.
+TEST_PROGS_EXTENDED += hwpoison-panic.sh
+
TEST_FILES := test_vmalloc.sh
TEST_FILES += test_hmm.sh
TEST_FILES += va_high_addr_switch.sh
diff --git a/tools/testing/selftests/mm/hwpoison-panic.sh b/tools/testing/selftests/mm/hwpoison-panic.sh
new file mode 100755
index 0000000000000..aafc06e895d01
--- /dev/null
+++ b/tools/testing/selftests/mm/hwpoison-panic.sh
@@ -0,0 +1,249 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Verify vm.panic_on_unrecoverable_memory_failure by injecting a hwpoison
+# error on a kernel-owned page and confirming the kernel panics.
+#
+# Three "kinds" of kernel-owned page can be targeted, selectable via the
+# first positional argument (default: rodata):
+#
+# rodata - a PG_reserved page in the kernel rodata range
+# (sourced from /proc/iomem "Kernel rodata"). Exercises
+# memory_failure() -> get_any_page() on a PageReserved page.
+#
+# slab - a slab page found via /proc/kpageflags (KPF_SLAB).
+# Exercises memory_failure() -> get_any_page() on a non
+# PG_reserved kernel-owned page. This path is what catches
+# regressions where get_any_page() collapses kernel-owned
+# pages into a transient -EIO instead of -ENOTRECOVERABLE.
+#
+# pgtable - a page-table page found via /proc/kpageflags (KPF_PGTABLE).
+# Same path as slab, different page type.
+#
+# This test is DESTRUCTIVE: a successful run crashes the kernel. It is
+# meant to be executed inside a disposable VM (e.g. virtme-ng) with a
+# serial console captured by the harness. It is skipped unless the
+# caller opts in via RUN_DESTRUCTIVE=1.
+#
+# Test passes externally: the kernel must panic with
+# "Memory failure: <pfn>: unrecoverable page"
+# A return from the inject means no panic fired: that is a failure,
+# unless the target PFN raced to a different page type before injection,
+# in which case the run is inconclusive and is skipped.
+#
+# Author: Breno Leitao <leitao@debian.org>
+
+set -u
+
+ksft_skip=4
+sysctl_path=/proc/sys/vm/panic_on_unrecoverable_memory_failure
+inject_path=/sys/devices/system/memory/hard_offline_page
+kpageflags_path=/proc/kpageflags
+unpoison_path=/sys/kernel/debug/hwpoison/unpoison-pfn
+
+# /proc/kpageflags bit positions (see include/uapi/linux/kernel-page-flags.h)
+KPF_SLAB=7
+KPF_COMPOUND_TAIL=16
+KPF_HWPOISON=19
+KPF_NOPAGE=20
+KPF_PGTABLE=26
+KPF_RESERVED=32
+
+pagesize=$(getconf PAGE_SIZE)
+
+kind=${1:-rodata}
+
+ksft_print() { echo "# $*"; }
+ksft_exit_skip() { ksft_print "$*"; exit "$ksft_skip"; }
+ksft_exit_fail() { echo "not ok 1 $*"; exit 1; }
+
+if [ "$(id -u)" -ne 0 ]; then
+ ksft_exit_skip "must run as root"
+fi
+
+if [ ! -w "$sysctl_path" ]; then
+ ksft_exit_skip "$sysctl_path not present (kernel without the sysctl?)"
+fi
+
+if [ ! -w "$inject_path" ]; then
+ ksft_exit_skip "$inject_path not present (no MEMORY_HOTPLUG?)"
+fi
+
+if [ "${RUN_DESTRUCTIVE:-0}" != "1" ]; then
+ ksft_exit_skip "destructive test; re-run with RUN_DESTRUCTIVE=1 inside a disposable VM"
+fi
+
+# Pick a PFN inside the kernel image rodata region of /proc/iomem.
+# This is preferred over a top-level "Reserved" entry because top-level
+# Reserved ranges are often firmware holes that have no backing struct
+# page; pfn_to_online_page() returns NULL on those and memory_failure()
+# bails out with -ENXIO before reaching the panic path.
+#
+# "Kernel rodata" is reported as a sub-resource of "System RAM" on every
+# major architecture, which guarantees:
+# - the PFN is backed by struct page (within an online memory range);
+# - PG_reserved is set on the page (kernel image area);
+# - the memory is read-only, so setting PG_hwpoison on it does not
+# corrupt writable kernel state if the panic somehow does not fire.
+#
+# /proc/iomem entries look like (indented for sub-resources):
+# " 02500000-02ffffff : Kernel rodata"
+pick_rodata_phys_addr() {
+ awk -v pagesize="$(getconf PAGE_SIZE)" '
+ # Convert a hex string to a number without relying on the gawk-only
+ # strtonum(). mawk lacks it and would otherwise spuriously skip
+ # this test on distros that ship mawk as /usr/bin/awk.
+ function hex2num(s, n, i, c, v) {
+ n = 0
+ for (i = 1; i <= length(s); i++) {
+ c = tolower(substr(s, i, 1))
+ v = index("0123456789abcdef", c) - 1
+ if (v < 0)
+ return -1
+ n = n * 16 + v
+ }
+ return n
+ }
+ /: Kernel rodata[[:space:]]*$/ {
+ sub(/^[[:space:]]+/, "")
+ n = split($0, a, /[- ]/)
+ start = hex2num(a[1])
+ end = hex2num(a[2])
+ if (end <= start)
+ next
+ # Page-align upward and emit the first byte of that page.
+ pfn = int((start + pagesize - 1) / pagesize)
+ printf "0x%x\n", pfn * pagesize
+ exit 0
+ }
+ ' /proc/iomem
+}
+
+# Walk /proc/kpageflags and return the phys addr of the first PFN that
+# has bit $1 set, with KPF_HWPOISON, KPF_NOPAGE and KPF_COMPOUND_TAIL
+# all clear (so we attack a real, non-tail, not-already-poisoned page).
+#
+# We skip the first 16 MiB of PFNs to step past low-memory special
+# ranges (BIOS/EFI/ACPI/etc.) that often are PG_reserved and would not
+# exhibit the slab/pgtable type we are looking for.
+pick_kpageflags_phys_addr() {
+ local want_bit=$1
+ local pagesize skip_pfn
+
+ [ -r "$kpageflags_path" ] || return
+
+ pagesize=$(getconf PAGE_SIZE)
+ skip_pfn=$(((16 * 1024 * 1024) / pagesize))
+
+ od -An -tx8 -v -w8 -j "$((skip_pfn * 8))" "$kpageflags_path" 2>/dev/null | \
+ awk -v want_bit="$want_bit" \
+ -v hwp_bit="$KPF_HWPOISON" \
+ -v nopage_bit="$KPF_NOPAGE" \
+ -v tail_bit="$KPF_COMPOUND_TAIL" \
+ -v base_pfn="$skip_pfn" \
+ -v pagesize="$pagesize" '
+ # Test whether bit "b" is set in the 16-hex-digit value "hex".
+ # Done with substring + per-digit lookup so we never rely on awk
+ # bitwise operators (mawk lacks them), 64-bit FP precision or the
+ # gawk-only strtonum().
+ function bit_set(hex, b, di, bi, c, v) {
+ di = int(b / 4)
+ bi = b - di * 4
+ c = substr(hex, length(hex) - di, 1)
+ v = index("0123456789abcdef", tolower(c)) - 1
+ if (bi == 0) return (v % 2) == 1
+ if (bi == 1) return int(v / 2) % 2 == 1
+ if (bi == 2) return int(v / 4) % 2 == 1
+ return int(v / 8) % 2 == 1
+ }
+ {
+ gsub(/^[[:space:]]+/, "")
+ h = $1
+ if (bit_set(h, want_bit) &&
+ !bit_set(h, hwp_bit) &&
+ !bit_set(h, nopage_bit) &&
+ !bit_set(h, tail_bit)) {
+ pfn = base_pfn + NR - 1
+ printf "0x%x\n", pfn * pagesize
+ exit 0
+ }
+ }
+ '
+}
+
+# Return 0 if /proc/kpageflags bit $2 is set for PFN $1, 1 if it is
+# clear, or 2 if the word cannot be read. Used to re-confirm the target
+# page type after a non-panicking inject.
+kpageflags_bit_set() {
+ local word
+
+ word=$(od -An -tx8 -v -j "$(($1 * 8))" -N 8 "$kpageflags_path" 2>/dev/null | tr -d '[:space:]')
+ [ -n "$word" ] || return 2
+ (( (16#$word >> $2) & 1 ))
+}
+
+# Best-effort: drop the PG_hwpoison marker set by the inject so a failed
+# run does not leave a poisoned page behind. hard_offline_page() injects
+# with MF_SW_SIMULATED, so the page stays unpoisonable through the
+# hwpoison debugfs interface (needs CONFIG_HWPOISON_INJECT + debugfs).
+try_unpoison() {
+ [ -w "$unpoison_path" ] || return 0
+ echo "$1" > "$unpoison_path" 2>/dev/null || true
+}
+
+case "$kind" in
+rodata)
+ phys_addr=$(pick_rodata_phys_addr)
+ recheck_bit=$KPF_RESERVED
+ missing_msg='no "Kernel rodata" entry in /proc/iomem'
+ ;;
+slab)
+ phys_addr=$(pick_kpageflags_phys_addr "$KPF_SLAB")
+ recheck_bit=$KPF_SLAB
+ missing_msg="no usable slab PFN found in $kpageflags_path"
+ ;;
+pgtable)
+ phys_addr=$(pick_kpageflags_phys_addr "$KPF_PGTABLE")
+ recheck_bit=$KPF_PGTABLE
+ missing_msg="no usable page-table PFN found in $kpageflags_path"
+ ;;
+*)
+ ksft_exit_fail "unknown kind '$kind' (expected: rodata|slab|pgtable)"
+ ;;
+esac
+
+if [ -z "$phys_addr" ]; then
+ ksft_exit_skip "$missing_msg"
+fi
+
+ksft_print "enabling $sysctl_path"
+prior=$(cat "$sysctl_path")
+echo 1 > "$sysctl_path" || ksft_exit_fail "failed to enable sysctl"
+
+pfn=$((phys_addr / pagesize))
+ksft_print "injecting hwpoison at phys 0x$(printf '%x' "$phys_addr") (pfn 0x$(printf '%x' "$pfn"), kind=$kind)"
+ksft_print "expecting kernel panic: 'Memory failure: <pfn>: unrecoverable page'"
+
+# A successful run never returns from the inject -- it panics the kernel.
+# Reaching the code below therefore means no panic fired. Note whether
+# the write itself succeeded, then put the machine back: restore the
+# sysctl and best-effort unpoison the page we just marked.
+if echo "$phys_addr" > "$inject_path"; then
+ verdict="inject returned without panic; sysctl ineffective"
+else
+ verdict="inject failed before reaching the panic path"
+fi
+
+echo "$prior" > "$sysctl_path"
+try_unpoison "$pfn"
+
+# The page type can change between selection and injection (e.g. a slab
+# or page-table page is freed and reused). Only treat a missing panic as
+# a failure if the target PFN is still the kernel-owned type we aimed at;
+# if it raced to another type the run is inconclusive, so skip instead.
+kpageflags_bit_set "$pfn" "$recheck_bit"
+case $? in
+0) ksft_exit_fail "$verdict (page still $kind)" ;;
+1) ksft_exit_skip "target PFN no longer $kind; raced before inject, inconclusive" ;;
+*) ksft_exit_fail "$verdict (could not reconfirm page type via $kpageflags_path)" ;;
+esac
--
2.53.0-Meta
^ permalink raw reply related
* [PATCH v10 0/6] mm/memory-failure: add panic option for unrecoverable pages
From: Breno Leitao @ 2026-06-26 15:33 UTC (permalink / raw)
To: Miaohe Lin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Shuah Khan, Naoya Horiguchi, Jonathan Corbet, Shuah Khan,
Liam R. Howlett, lance.yang, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Liam R. Howlett
Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest, Breno Leitao,
linux-trace-kernel, kernel-team
A multi-bit ECC error on a kernel-owned page that the memory failure
handler cannot recover is currently swallowed: PG_hwpoison is set, the
event is logged, and the kernel keeps running. The corrupted memory
remains accessible to the kernel and either drives silent data
corruption or surfaces seconds-to-minutes later as an apparently
unrelated crash. In a large fleet that delayed, unattributable crash
turns into significant engineering effort to root-cause; in a kdump
configuration, by the time the crash happens the original error
context (faulting PFN, MCE/GHES record, page state) is long gone.
This series adds an opt-in sysctl,
vm.panic_on_unrecoverable_memory_failure, that converts an
unrecoverable kernel-page hwpoison event into an immediate panic with
a clean dmesg/vmcore that still contains the original failure
context. The default is disabled so existing workloads see no
change.
There is a selftest that test different cases, and I tested it using
the following variants:
┌─────────┬──────────┬───────────────────────────────────────────────────────────┐
│ Variant │ PFN │ Result │
├─────────┼──────────┼───────────────────────────────────────────────────────────┤
│ rodata │ 0x2600 │ Panic with "Memory failure: 0x2600: unrecoverable page" │
├─────────┼──────────┼───────────────────────────────────────────────────────────┤
│ slab │ 0x100032 │ Panic with "Memory failure: 0x100032: unrecoverable page" │
├─────────┼──────────┼───────────────────────────────────────────────────────────┤
│ pgtable │ 0x100000 │ Panic with "Memory failure: 0x100000: unrecoverable page" │
└─────────┴──────────┴───────────────────────────────────────────────────────────┘
Each one shows the same call trace, exactly the path the series builds:
hard_offline_page_store
→ memory_failure
→ action_result
→ panic("Memory failure: %#lx: unrecoverable page")
Signed-off-by: Breno Leitao <leitao@debian.org>
---
Changes in v10:
- EDITME: describe what is new in this series revision.
- EDITME: use bulletpoints and terse descriptions.
- Link to v9: https://lore.kernel.org/r/20260609-ecc_panic-v9-0-432a74002e74@debian.org
Changes in v9:
- HWPoisonKernelOwned(): wrap the head-page checks in a
compound_head() recheck loop so a concurrent split or compound free
cannot leave us trusting a stale view (Miaohe, Lance, David).
- selftest: drop the gawk-only strtonum() in hwpoison-panic.sh; do the
hex parsing with a small index()-based helper so the test no longer
spuriously skips itself on mawk-based distros (Sashiko).
- selftest: move hwpoison-panic.sh from TEST_FILES to
TEST_PROGS_EXTENDED so the script is installed executable rather
than as a non-executable data file (Sashiko).
- Link to v8: https://patch.msgid.link/20260527-ecc_panic-v8-0-9ea0cfa16bb0@debian.org
Changes in v8:
- Commit message rewording (David)
- Add HWPoisonKernelOwned() helper (Lance)
- Removed patch "mm/memory-failure: short-circuit PG_reserved before get_hwpoison_page()"
- Broaden the selftest (Lance)
- Link to v7: https://patch.msgid.link/20260513-ecc_panic-v7-0-be2e578e61da@debian.org
Changes in v7:
- Move the PG_reserved / unhandlable-kernel-page classification into
get_any_page() and surface it via -ENOTRECOVERABLE, per David
Hildenbrand's and Lance Yang's review of v6. This drops the
is_reserved snapshot in memory_failure() and the mf_get_page_status
enum / out-parameter introduced in v6.
- Restructure the post-call branch in memory_failure() as a switch
over the get_hwpoison_page() return code (David).
- Drop the "reserved" qualifier from the MF_MSG_KERNEL label and the
matching tracepoint string; the enum now covers both PG_reserved
pages and other unhandlable kernel pages.
- Squash the former patches 1/4 ("MF_MSG_KERNEL for reserved pages")
and 2/4 ("classify get_any_page() failures by reason") into a
single classification patch; the series is now 3 patches.
- Simplify panic_on_unrecoverable_mf() to a single return statement
(David).
- Link to v6: https://patch.msgid.link/20260511-ecc_panic-v6-0-183012ba7d4b@debian.org
Changes in v6:
- Dropped the selftest given the value was not clear
- Get the status of the failure from get_any_page()
- Small nits from different people/AIs.
- Link to v5: https://patch.msgid.link/20260424-ecc_panic-v5-0-a35f4b50425c@debian.org
Changes in v5:
- Add vm.panic_on_unrecoverable_memory_failure sysctl to panic on
unrecoverable kernel page hwpoison events (reserved pages, refcount-0
non-buddy pages, unknown state), with a recheck to avoid racing with
concurrent buddy allocations. (Miaohe)
- Distinguish reserved pages as MF_MSG_KERNEL in memory_failure(),
document the new sysctl in Documentation/admin-guide/sysctl/vm.rst,
and add a selftest verifying SIGBUS recovery on userspace pages still
works when the sysctl is enabled. (Miaohe)
- Added a selftest
- Link to v4:
https://patch.msgid.link/20260415-ecc_panic-v4-0-2d0277f8f601@debian.org
Changes in v4:
- Drop CONFIG_BOOTPARAM_MEMORY_FAILURE_PANIC kernel configuration option.
- Split the reserved page classification (MF_MSG_KERNEL) into its own
patch, separate from the panic mechanism.
- Document why the buddy allocator TOCTOU race (between
get_hwpoison_page() and is_free_buddy_page()) cannot cause false
positives: PG_hwpoison is set beforehand and check_new_page() in the
page allocator rejects hwpoisoned pages.
- Document the narrow LRU isolation race window for MF_MSG_UNKNOWN and
its mitigation via identify_page_state()'s two-pass design.
- Explicitly document why MF_MSG_GET_HWPOISON is excluded from the
panic conditions (shared path with transient races and non-reserved
kernel memory).
- Link to v3: https://patch.msgid.link/20260413-ecc_panic-v3-0-1dcbb2f12bc4@debian.org
Changes in v3:
- Rename is_unrecoverable_memory_failure() to panic_on_unrecoverable_mf()
as suggested by maintainer.
- Add CONFIG_BOOTPARAM_MEMORY_FAILURE_PANIC kernel configuration option,
similar to CONFIG_BOOTPARAM_HARDLOCKUP_PANIC.
- Add documentation for the sysctl and CONFIG option.
- Add code comments documenting the panic condition design rationale and
how the retry mechanism mitigates false positives from buddy allocator
races.
- Link to v2: https://patch.msgid.link/20260331-ecc_panic-v2-0-9e40d0f64f7a@debian.org
Changes in v2:
- Panic on MF_MSG_KERNEL, MF_MSG_KERNEL_HIGH_ORDER and MF_MSG_UNKNOWN
instead of MF_MSG_GET_HWPOISON.
- Report MF_MSG_KERNEL for reserved pages when get_hwpoison_page() fails
instead of MF_MSG_GET_HWPOISON.
- Link to v1: https://patch.msgid.link/20260323-ecc_panic-v1-0-72a1921726c5@debian.org
To: Miaohe Lin <linmiaohe@huawei.com>
To: Naoya Horiguchi <nao.horiguchi@gmail.com>
To: Andrew Morton <akpm@linux-foundation.org>
To: Steven Rostedt <rostedt@goodmis.org>
To: Masami Hiramatsu <mhiramat@kernel.org>
To: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: Jonathan Corbet <corbet@lwn.net>
To: Shuah Khan <skhan@linuxfoundation.org>
To: David Hildenbrand <david@kernel.org>
To: Lorenzo Stoakes <ljs@kernel.org>
To: "Liam R. Howlett" <liam@infradead.org>
To: Vlastimil Babka <vbabka@kernel.org>
To: Mike Rapoport <rppt@kernel.org>
To: Suren Baghdasaryan <surenb@google.com>
To: Michal Hocko <mhocko@suse.com>
To: Shuah Khan <shuah@kernel.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-trace-kernel@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org
---
Breno Leitao (6):
mm/memory-failure: drop dead error_states[] entry for reserved pages
mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
mm/memory-failure: report MF_MSG_KERNEL for unrecoverable kernel pages
mm/memory-failure: add panic option for unrecoverable pages
Documentation: document panic_on_unrecoverable_memory_failure sysctl
selftests/mm: add hwpoison-panic destructive test
Documentation/admin-guide/sysctl/vm.rst | 80 +++++++++
mm/memory-failure.c | 104 +++++++++--
tools/testing/selftests/mm/Makefile | 4 +
tools/testing/selftests/mm/hwpoison-panic.sh | 249 +++++++++++++++++++++++++++
4 files changed, 419 insertions(+), 18 deletions(-)
---
base-commit: 30ffa8de54e5cc80d93fd211ca134d1764a7011f
change-id: 20260323-ecc_panic-4e473b83087c
Best regards,
--
Breno Leitao <leitao@debian.org>
^ permalink raw reply
* Re: [PATCH] tracing: eprobe: read the complete FILTER_PTR_STRING pointer
From: Masami Hiramatsu @ 2026-06-26 15:45 UTC (permalink / raw)
To: Steven Rostedt
Cc: Martin Kaiser, Masami Hiramatsu (Google), linux-trace-kernel,
linux-kernel
In-Reply-To: <20260626064223.60ffeeaa@fedora>
On Fri, 26 Jun 2026 06:42:23 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:
> On Fri, 26 Jun 2026 12:20:36 +0200
> Martin Kaiser <martin@kaiser.cx> wrote:
>
> > > That is, to have +u0() say "this is going to be dereferencing user space".
> >
> > > I'll add Martin's patch and see if it makes the above work.
> >
> > I've just tried your command with my patch. It works for me, filenames are
> > logged correctly.
>
> Yep, this definitely looks like a fix. We have;
>
> addr = rec + field->offset;
>
> Where addr points to the location of the field on the ring buffer, thus
> your change to make it:
>
> val = *(unsigned long *)addr;
>
> Reads the full "long size" of the event on the ring buffer, instead of
> reading just one byte. It is "val" that gets dereferenced later by the
> probe logic (the "+0u()"), which has all the protections we need.
>
> I'll queue this up.
I've already queued this on my probes/core branch.
(which will be probes/fixes)
Thanks,
>
> Thanks!
>
> -- Steve
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH v10 0/6] mm/memory-failure: add panic option for unrecoverable pages
From: Andrew Morton @ 2026-06-26 16:27 UTC (permalink / raw)
To: Breno Leitao
Cc: Miaohe Lin, David Hildenbrand, Lorenzo Stoakes, Vlastimil Babka,
Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
Naoya Horiguchi, Jonathan Corbet, Shuah Khan, Liam R. Howlett,
lance.yang, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
linux-mm, linux-kernel, linux-doc, linux-kselftest,
linux-trace-kernel, kernel-team
In-Reply-To: <20260626-ecc_panic-v10-0-6dacb8ad024d@debian.org>
On Fri, 26 Jun 2026 08:33:14 -0700 Breno Leitao <leitao@debian.org> wrote:
> A multi-bit ECC error on a kernel-owned page that the memory failure
> handler cannot recover is currently swallowed: PG_hwpoison is set, the
> event is logged, and the kernel keeps running. The corrupted memory
> remains accessible to the kernel and either drives silent data
> corruption or surfaces seconds-to-minutes later as an apparently
> unrelated crash. In a large fleet that delayed, unattributable crash
> turns into significant engineering effort to root-cause; in a kdump
> configuration, by the time the crash happens the original error
> context (faulting PFN, MCE/GHES record, page state) is long gone.
>
> This series adds an opt-in sysctl,
> vm.panic_on_unrecoverable_memory_failure, that converts an
> unrecoverable kernel-page hwpoison event into an immediate panic with
> a clean dmesg/vmcore that still contains the original failure
> context. The default is disabled so existing workloads see no
> change.
Cool, thanks. I added this to mm.git's mm-new branch. Next week I'll
move it into the mm-unstable branch, where it will receive linux-next
exposure.
Sashiko identified a few possible things, some pre-existing:
https://sashiko.dev/#/patchset/20260626-ecc_panic-v10-0-6dacb8ad024d@debian.org
^ permalink raw reply
* Re: [PATCH v4 2/2] tracing: Remove trace_printk.h from kernel.h
From: Nathan Chancellor @ 2026-06-26 19:03 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
Mathieu Desnoyers, Andrew Morton, Linus Torvalds,
Sebastian Andrzej Siewior, John Ogness, Thomas Gleixner,
Peter Zijlstra, Julia Lawall, Yury Norov, linux-doc, linux-kbuild,
linuxppc-dev, dri-devel, linux-stm32, linux-arm-kernel,
linux-rdma, linux-usb, linux-ext4, linux-nfs, kvm, intel-gfx
In-Reply-To: <20260626045119.659d1e6b@fedora>
On Fri, Jun 26, 2026 at 04:51:19AM -0400, Steven Rostedt wrote:
> On Thu, 25 Jun 2026 16:41:58 -0700
> Nathan Chancellor <nathan@kernel.org> wrote:
>
>
> > The following diff resolves it for me, should I send it as a separate
> > patch or do you want to just fold it in with a note?
> >
> > diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
> > index 621566345406..2301a701ffbb 100644
> > --- a/include/linux/lockdep.h
> > +++ b/include/linux/lockdep.h
> > @@ -10,6 +10,7 @@
> > #ifndef __LINUX_LOCKDEP_H
> > #define __LINUX_LOCKDEP_H
> >
> > +#include <linux/instruction_pointer.h>
>
> Ah, so the reason for this breakage is because lockdep was relying on
> instruction_pointer.h, that just happened to be included in kernel.h
> via trace_printk.h.
Correct.
> This is a separate issue, so it should be a separate patch. I'll add it
> as patch 1 of this series.
Sounds good, thanks!
> Can you send me the config you used. This didn't trigger in my tests.
It is a plain allmodconfig, for example on arm:
$ make -skj"$(nproc)" ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- allmodconfig lib/test_context-analysis.o
In file included from include/linux/local_lock_internal.h:8,
from include/linux/local_lock.h:5,
from lib/test_context-analysis.c:9:
include/linux/local_lock_internal.h: In function 'local_lock_acquire':
include/linux/lockdep.h:541:87: error: '_THIS_IP_' undeclared (first use in this function)
541 | #define lock_map_acquire(l) lock_acquire_exclusive(l, 0, 0, NULL, _THIS_IP_)
| ^~~~~~~~~
include/linux/lockdep.h:509:88: note: in definition of macro 'lock_acquire_exclusive'
509 | #define lock_acquire_exclusive(l, s, t, n, i) lock_acquire(l, s, t, 0, 1, n, i)
| ^
include/linux/local_lock_internal.h:46:9: note: in expansion of macro 'lock_map_acquire'
46 | lock_map_acquire(&l->dep_map);
| ^~~~~~~~~~~~~~~~
include/linux/lockdep.h:541:87: note: each undeclared identifier is reported only once for each function it appears in
...
I also reproduced it on top of allnoconfig:
$ cat allno.config
CONFIG_CONTEXT_ANALYSIS_TEST=y
CONFIG_DEBUG_KERNEL=y
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_EXPERT=y
CONFIG_MMU=y
CONFIG_RUNTIME_TESTING_MENU=y
$ make -skj"$(nproc)" ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- KCONFIG_ALLCONFIG=1 clean allnoconfig lib/test_context-analysis.o
<same error as above>
--
Cheers,
Nathan
^ permalink raw reply
* Re: [PATCH v8 24/46] KVM: guest_memfd: Make in-place conversion the default\
From: Sean Christopherson @ 2026-06-26 19:06 UTC (permalink / raw)
To: Yan Zhao
Cc: Ackerley Tng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
david, jmattson, jthoughton, michael.roth, oupton, pankaj.gupta,
qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
tabba, willy, wyihan, forkloop, pratyush, suzuki.poulose,
aneesh.kumar, liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <aj3H2sxymOYTWTnE@yzhao56-desk.sh.intel.com>
On Fri, Jun 26, 2026, Yan Zhao wrote:
> On Thu, Jun 25, 2026 at 07:36:28AM -0700, Sean Christopherson wrote:
> > On Thu, Jun 25, 2026, Yan Zhao wrote:
> > And I'm not remotely convinced that prepending allow_ to the param will help
> > end users diagnose "unexpected" memory consumption, in quotes because anyone that
> > is deploying a stack that utilizes out-of-place conversion absolutely needs to
> > understand and plan for the additional memory consumption. I.e. if the memory
> > consumption is "unexpected" to the end user, they likely have far bigger problems.
> My first impression of gmem_in_place_conversion=true was that it enforces gmem
> in-place conversion. However, it actually only enforces per-gmem private/shared
> attribute.
> My worry was that people might think it's a kernel bug if userspace can still
> have shared memory from other sources after they configured
> gmem_in_place_conversion=true.
Ah, I see where you're coming from. FWIW, truly enforcing in-place conversion
is flat out impossible. E.g. userspace can simply replace the memslot, at which
point the memory effectively reverts to shared.
> However, I have no strong opinion if you think gmem_in_place_conversion is good,
> and with the above documentation. :)
Ya, I think this largely a documentation problem. I agree that a param name
like gmem_private_memory_attributes would be more precise, but I think it'd be
far less informative for the vast majority of users that only care whether or
not KVM can do in-place conversion, and don't care about how that is done.
^ permalink raw reply
* [PATCH rfc 0/2] Improvements to ftrace comm[] handling
From: David Laight @ 2026-06-26 21:23 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Michal Koutný
Cc: David Laight
RFC because they are untested and I want to send them before going
on holiday for 2 weeks (should still have email access).
The first patch avoids a lot of 'potentially unsized' string
functions by embedding the char[] use to hold task->comm[] in a
structure.
The second adjsuts the data structure used to cache the task
names in the thread switch code.
David Laight (2):
tracing: Embed 'char comm[16]' in a structure
tracing: Keep pid and comm[] in the same structure
kernel/trace/blktrace.c | 28 +++----
kernel/trace/trace.c | 3 +-
kernel/trace/trace.h | 9 ++-
kernel/trace/trace_events_filter.c | 2 +-
kernel/trace/trace_events_hist.c | 26 +++---
kernel/trace/trace_functions_graph.c | 10 +--
kernel/trace/trace_output.c | 24 +++---
kernel/trace/trace_sched_switch.c | 113 ++++++++++++---------------
8 files changed, 101 insertions(+), 114 deletions(-)
--
2.39.5
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox