* [PATCH v10 8/9] tracing/probes: Add this_cpu_read() and this_cpu_ptr() dereference method to fetcharg
From: Masami Hiramatsu (Google) @ 2026-06-26 2:11 UTC (permalink / raw)
To: Steven Rostedt, Mathieu Desnoyers
Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178243982430.790911.17439694390021542101.stgit@devnote2>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
When tracing the kernel local variables, sometimes we need to get the
CPU local variables. To access it, current simple dereference is not
enough.
Thus, introduce a special this_cpu_read() dereference to access per-cpu
variable for the current CPU (accessing other CPU variable may race with
updates on other CPUs). Also this_cpu_ptr() is for accessing per-cpu
pointer.
Those are working as same as the kernel percpu macro.
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v10:
- Prohibit this_cpu_*() for eprobe events.
Changes in v9:
- Prohibit this_cpu_*() for non kernel probes.
Changes in v6:
- Rebased on dump fetcharg patch.
- Fix to fetch static percpu variable with @SYM correctly.
Changes in v5:
- Simplify this_cpu_read() into +0(this_cpu_ptr()).
Changes in v3:
- Remove NULL check for percpu var because it is just an offset, could be 0.
- Simplify process_fetch_insn_bottom() code.
- If the last operation is this_cpu_read(), read only memory of the specific
size (of type).
Changes in v2:
- Drop +CPU/+PCPU and introduce this_cpu_read() and this_cpu_ptr().
- Support these method with BTF typecast.
- Just check the base address is NOT NULL instead of is_kernel_percpu_address().
---
Documentation/trace/eprobetrace.rst | 2
Documentation/trace/fprobetrace.rst | 2
Documentation/trace/kprobetrace.rst | 2
kernel/trace/trace.c | 1
kernel/trace/trace_probe.c | 152 ++++++++++++++++++++++++++---------
kernel/trace/trace_probe.h | 6 +
kernel/trace/trace_probe_tmpl.h | 22 ++++-
7 files changed, 141 insertions(+), 46 deletions(-)
diff --git a/Documentation/trace/eprobetrace.rst b/Documentation/trace/eprobetrace.rst
index 680e0af43d5d..279396951b34 100644
--- a/Documentation/trace/eprobetrace.rst
+++ b/Documentation/trace/eprobetrace.rst
@@ -39,6 +39,8 @@ Synopsis of eprobe_events
@SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol)
$comm : Fetch current task comm.
+|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*3)(\*4)
+ this_cpu_read(FETCHARG) : Read the value of the per-CPU variable FETCHARG on the current CPU.
+ this_cpu_ptr(FETCHARG) : Get the address of the per-CPU variable FETCHARG on the current CPU.
\IMM : Store an immediate value to the argument.
NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
diff --git a/Documentation/trace/fprobetrace.rst b/Documentation/trace/fprobetrace.rst
index 3392cab016b3..3439bc9bd351 100644
--- a/Documentation/trace/fprobetrace.rst
+++ b/Documentation/trace/fprobetrace.rst
@@ -52,6 +52,8 @@ Synopsis of fprobe-events
$comm : Fetch current task comm.
$current : Fetch the address of the current task_struct.
+|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*4)(\*5)
+ this_cpu_read(FETCHARG) : Read the value of the per-CPU variable FETCHARG on the current CPU.
+ this_cpu_ptr(FETCHARG) : Get the address of the per-CPU variable FETCHARG on the current CPU.
\IMM : Store an immediate value to the argument.
NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst
index 81e4fe38791d..9ae330eb0a52 100644
--- a/Documentation/trace/kprobetrace.rst
+++ b/Documentation/trace/kprobetrace.rst
@@ -55,6 +55,8 @@ Synopsis of kprobe_events
$comm : Fetch current task comm.
$current : Fetch the address of the current task_struct.
+|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*3)(\*4)
+ this_cpu_read(FETCHARG) : Read the value of the per-CPU variable FETCHARG on the current CPU.
+ this_cpu_ptr(FETCHARG) : Get the address of the per-CPU variable FETCHARG on the current CPU.
\IMM : Store an immediate value to the argument.
NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 2b0b4f9acb2e..c9e182d40059 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4329,6 +4329,7 @@ static const char readme_msg[] =
"\t $stack<index>, $stack, $retval, $comm, $current\n"
#endif
"\t +|-[u]<offset>(<fetcharg>), \\imm-value, \\\"imm-string\"\n"
+ "\t this_cpu_read(<fetcharg>), this_cpu_ptr(<fetcharg>)\n"
"\t kernel return probes support: $retval, $arg<N>, $comm\n"
"\t type: s8/16/32/64, u8/16/32/64, x8/16/32/64, char, string, symbol,\n"
"\t b<bit-width>@<bit-offset>/<container-size>, ustring,\n"
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index eb58b70ae082..0bd02bc0ee0f 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -345,6 +345,109 @@ static int parse_trace_event(char *arg, struct fetch_insn *code,
return -EINVAL;
}
+/* this_cpu_* parser */
+#define THIS_CPU_PTR_PREFIX "this_cpu_ptr("
+#define THIS_CPU_READ_PREFIX "this_cpu_read("
+#define THIS_CPU_PTR_LEN (sizeof(THIS_CPU_PTR_PREFIX) - 1)
+#define THIS_CPU_READ_LEN (sizeof(THIS_CPU_READ_PREFIX) - 1)
+
+static int
+parse_probe_arg(char *arg, const struct fetch_type *type,
+ struct fetch_insn **pcode, struct fetch_insn *end,
+ struct traceprobe_parse_context *ctx);
+
+/* handle dereference nested call */
+static inline int handle_dereference(char *arg, struct fetch_insn **pcode,
+ struct fetch_insn *end, struct traceprobe_parse_context *ctx,
+ int deref, long offset)
+{
+ const struct fetch_type *type = find_fetch_type(NULL, ctx->flags);
+ struct fetch_insn *code = *pcode;
+ int cur_offs = ctx->offset;
+ char *tmp;
+ int ret;
+
+ tmp = strrchr(arg, ')');
+ if (!tmp) {
+ trace_probe_log_err(ctx->offset + strlen(arg),
+ DEREF_OPEN_BRACE);
+ return -EINVAL;
+ }
+
+ *tmp = '\0';
+ ret = parse_probe_arg(arg, type, &code, end, ctx);
+ if (ret)
+ return ret;
+ ctx->offset = cur_offs;
+ if (code->op == FETCH_OP_COMM || code->op == FETCH_OP_IMMSTR) {
+ trace_probe_log_err(ctx->offset, COMM_CANT_DEREF);
+ return -EINVAL;
+ }
+
+ /*
+ * this_cpu_ptr(@SYM) does not use SYM value, but use SYM address.
+ * So we overwrite the last FETCH_OP_DEREF with FETCH_OP_CPU_PTR.
+ */
+ if (!(deref == FETCH_OP_CPU_PTR && *arg == '@')) {
+ code++;
+ if (code == end) {
+ trace_probe_log_err(ctx->offset, TOO_MANY_OPS);
+ return -EINVAL;
+ }
+ }
+ *pcode = code;
+
+ code->op = deref;
+ code->offset = offset;
+ /* Reset the last type if used */
+ ctx->last_type = NULL;
+ return 0;
+}
+
+static int parse_this_cpu(char *arg, struct fetch_insn **pcode,
+ struct fetch_insn *end,
+ struct traceprobe_parse_context *ctx)
+{
+ struct fetch_insn *code;
+ bool is_ptr = false;
+ int ret;
+
+ /*
+ * This is only for kernel probes, excluding eprobe, because per-cpu
+ * pointer should not be recorded by events.
+ */
+ if (!(ctx->flags & TPARG_FL_KERNEL) ||
+ (ctx->flags & TPARG_FL_TEVENT)) {
+ trace_probe_log_err(ctx->offset, NOSUP_PERCPU);
+ return -EINVAL;
+ }
+ if (str_has_prefix(arg, THIS_CPU_PTR_PREFIX)) {
+ arg += THIS_CPU_PTR_LEN;
+ ctx->offset += THIS_CPU_PTR_LEN;
+ is_ptr = true;
+ } else if (str_has_prefix(arg, THIS_CPU_READ_PREFIX)) {
+ arg += THIS_CPU_READ_LEN;
+ ctx->offset += THIS_CPU_READ_LEN;
+ } else
+ return -EINVAL;
+
+ ret = handle_dereference(arg, pcode, end, ctx, FETCH_OP_CPU_PTR, 0);
+ if (ret || is_ptr)
+ return ret;
+
+ /* this_cpu_read(VAR) -> +0(this_cpu_ptr(VAR)) */
+ code = *pcode;
+ code++;
+ if (code == end) {
+ trace_probe_log_err(ctx->offset, TOO_MANY_OPS);
+ return -EINVAL;
+ }
+ code->op = FETCH_OP_DEREF;
+ code->offset = 0;
+ *pcode = code;
+ return 0;
+}
+
#ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
static u32 btf_type_int(const struct btf_type *t)
@@ -904,11 +1007,6 @@ static char *find_matched_close_paren(char *s)
return NULL;
}
-static int
-parse_probe_arg(char *arg, const struct fetch_type *type,
- struct fetch_insn **pcode, struct fetch_insn *end,
- struct traceprobe_parse_context *ctx);
-
static int handle_typecast(char *arg, struct fetch_insn **pcode,
struct fetch_insn *end,
struct traceprobe_parse_context *ctx)
@@ -961,7 +1059,9 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
/* Skip '(' */
ctx->offset += 1;
tmp++;
- } else if (*tmp == '+' || *tmp == '-') {
+ } else if (*tmp == '+' || *tmp == '-' ||
+ str_has_prefix(tmp, THIS_CPU_PTR_PREFIX) ||
+ str_has_prefix(tmp, THIS_CPU_READ_PREFIX)) {
/* Dereference can have another field access inside it. */
char *open = strchr(tmp + 1, '(');
@@ -1481,36 +1581,9 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
}
ctx->offset += (tmp + 1 - arg) + (arg[0] != '-' ? 1 : 0);
arg = tmp + 1;
- tmp = strrchr(arg, ')');
- if (!tmp) {
- trace_probe_log_err(ctx->offset + strlen(arg),
- DEREF_OPEN_BRACE);
- return -EINVAL;
- } else {
- const struct fetch_type *t2 = find_fetch_type(NULL, ctx->flags);
- int cur_offs = ctx->offset;
-
- *tmp = '\0';
- ret = parse_probe_arg(arg, t2, &code, end, ctx);
- if (ret)
- break;
- ctx->offset = cur_offs;
- if (code->op == FETCH_OP_COMM ||
- code->op == FETCH_OP_IMMSTR) {
- trace_probe_log_err(ctx->offset, COMM_CANT_DEREF);
- return -EINVAL;
- }
- if (++code == end) {
- trace_probe_log_err(ctx->offset, TOO_MANY_OPS);
- return -EINVAL;
- }
- *pcode = code;
-
- code->op = deref;
- code->offset = offset;
- /* Reset the last type if used */
- ctx->last_type = NULL;
- }
+ ret = handle_dereference(arg, pcode, end, ctx, deref, offset);
+ if (ret < 0)
+ return ret;
break;
case '\\': /* Immediate value */
if (arg[1] == '"') { /* Immediate string */
@@ -1531,7 +1604,10 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
ret = handle_typecast(arg, pcode, end, ctx);
break;
default:
- if (isalpha(arg[0]) || arg[0] == '_') {
+ if (str_has_prefix(arg, THIS_CPU_PTR_PREFIX) ||
+ str_has_prefix(arg, THIS_CPU_READ_PREFIX)) {
+ ret = parse_this_cpu(arg, pcode, end, ctx);
+ } else if (isalpha(arg[0]) || arg[0] == '_') {
/* BTF variable or event field*/
if (ctx->flags & TPARG_FL_TEVENT) {
ret = parse_trace_event(arg, *pcode, ctx);
@@ -1548,8 +1624,8 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
return -EINVAL;
}
ret = parse_btf_arg(arg, pcode, end, ctx);
- break;
}
+ break;
}
if (!ret && code->op == FETCH_OP_NOP) {
/* Parsed, but do not find fetch method */
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 053f72fdaece..e6268a8dc378 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -101,6 +101,7 @@ typedef int (*print_type_func_t)(struct trace_seq *, void *, void *);
/* Stage 2 (dereference) ops */ \
FETCH_OP(DEREF, offset), /* Dereference: .offset */ \
FETCH_OP(UDEREF, offset), /* User-space dereference: .offset */\
+ FETCH_OP(CPU_PTR, none), /* Per-CPU pointer: .offset */ \
/* Stage 3 (store) ops */ \
FETCH_OP(ST_RAW, store), /* Raw value: .size */ \
FETCH_OP(ST_MEM, store), /* Memory: .offset, .size */ \
@@ -596,9 +597,10 @@ extern int traceprobe_define_arg_fields(struct trace_event_call *event_call,
C(TYPECAST_NOT_EVENT, "Typecasts are only for eprobe fields"), \
C(TYPECAST_REQ_FIELD, "Typecast requires a field access"), \
C(TOO_MANY_NESTED, "Too many nested typecasts/dereferences"), \
- C(TYPECAST_SYM_OFFSET, "@SYM+/-OFFSET with typecast needs parentheses") \
+ C(TYPECAST_SYM_OFFSET, "@SYM+/-OFFSET with typecast needs parentheses"), \
C(TYPECAST_NOT_ALIGNED, "Typecast field option is not byte-aligned"), \
- C(TYPECAST_BAD_ARROW, "Typecast field option does not support -> operator"),
+ C(TYPECAST_BAD_ARROW, "Typecast field option does not support -> operator"), \
+ C(NOSUP_PERCPU, "Per-cpu variable access is only for kernel probes"),
#undef C
#define C(a, b) TP_ERR_##a
diff --git a/kernel/trace/trace_probe_tmpl.h b/kernel/trace/trace_probe_tmpl.h
index d0e9662cde00..8db12f758fda 100644
--- a/kernel/trace/trace_probe_tmpl.h
+++ b/kernel/trace/trace_probe_tmpl.h
@@ -129,25 +129,35 @@ process_fetch_insn_bottom(struct fetch_insn *code, unsigned long val,
struct fetch_insn *s3 = NULL;
int total = 0, ret = 0, i = 0;
u32 loc = 0;
- unsigned long lval = val;
+ unsigned long lval, llval = val;
stage2:
/* 2nd stage: dereference memory if needed */
do {
- if (code->op == FETCH_OP_DEREF) {
- lval = val;
+ lval = val;
+ switch (code->op) {
+ case FETCH_OP_DEREF:
ret = probe_mem_read(&val, (void *)val + code->offset,
sizeof(val));
- } else if (code->op == FETCH_OP_UDEREF) {
- lval = val;
+ break;
+ case FETCH_OP_UDEREF:
ret = probe_mem_read_user(&val,
(void *)val + code->offset, sizeof(val));
- } else
break;
+ case FETCH_OP_CPU_PTR:
+ val = (unsigned long)this_cpu_ptr((void __percpu *)val);
+ ret = 0;
+ break;
+ default:
+ lval = llval;
+ goto out;
+ }
if (ret)
return ret;
+ llval = lval;
code++;
} while (1);
+out:
s3 = code;
stage3:
^ permalink raw reply related
* [PATCH v10 9/9] tracing/probes: Add a new testcase for BTF typecasts
From: Masami Hiramatsu (Google) @ 2026-06-26 2:11 UTC (permalink / raw)
To: Steven Rostedt, Mathieu Desnoyers
Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178243982430.790911.17439694390021542101.stgit@devnote2>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
With the introduction of container_of-style BTF typecasting and
per-CPU variable access support in trace probes, we need a way to
verify their functionality and prevent regressions.
Add a new ftrace kselftest and update the trace event sample module
to test and validate these features.
Specifically, update the trace-events-sample module to set up a
periodic timer whose callback accesses a per-CPU counter. Introduce
a new sample trace event, foo_timer_fn, to trace this callback
and log the current counter value.
Then, add a new test case, btf_probe_event.tc, which defines a
dynamic probe on the timer callback. The probe uses BTF typecasting
to recover the parent structure from the timer argument and
this_cpu_read() to fetch the per-CPU counter. The test verifies
the integrity of the implementation by ensuring the values
recorded by the dynamic probe match those from the static tracepoint.
Assisted-by: Antigravity:gemini-3.5-flash
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v10:
- Add a check for $current and this_cpu_* for eprobe
Changes in v9:
- Add a testcase for checking new syntax.
Changes in v8:
- Add more test cases.
Changes in v6:
- Update testcase according to changes.
Changes in v5:
- Add more syntax test cases.
Changes in v4:
- Fix uprobe $current test.
Changes in v3:
- Add syntax test case.
- Update testcase to use this_cpu_read()
Changes in v2:
- Use timer_shutdown_sync() instead of timer_delete_sync() for teardown.
---
samples/trace_events/trace-events-sample.c | 40 +++++++
samples/trace_events/trace-events-sample.h | 34 ++++++
.../ftrace/test.d/dynevent/btf_probe_event.tc | 51 ++++++++++
.../test.d/dynevent/btf_typecast_accepted.tc | 107 ++++++++++++++++++++
.../test.d/dynevent/eprobes_syntax_errors.tc | 9 ++
.../ftrace/test.d/dynevent/fprobe_syntax_errors.tc | 12 ++
.../ftrace/test.d/kprobe/kprobe_syntax_errors.tc | 12 ++
.../ftrace/test.d/kprobe/uprobe_syntax_errors.tc | 5 +
8 files changed, 265 insertions(+), 5 deletions(-)
create mode 100644 tools/testing/selftests/ftrace/test.d/dynevent/btf_probe_event.tc
create mode 100644 tools/testing/selftests/ftrace/test.d/dynevent/btf_typecast_accepted.tc
diff --git a/samples/trace_events/trace-events-sample.c b/samples/trace_events/trace-events-sample.c
index 0b7a6efdb247..ca5d98c360cb 100644
--- a/samples/trace_events/trace-events-sample.c
+++ b/samples/trace_events/trace-events-sample.c
@@ -94,6 +94,20 @@ static int simple_thread_fn(void *arg)
static DEFINE_MUTEX(thread_mutex);
static int simple_thread_cnt;
+static struct foo_timer_data *foo_timer_data;
+
+static void sample_timer_cb(struct timer_list *t)
+{
+ struct foo_timer_data *data = container_of(t, struct foo_timer_data, timer);
+
+ get_cpu();
+ trace_foo_timer_fn(data);
+ (*this_cpu_ptr(data->counter))++;
+ put_cpu();
+
+ mod_timer(t, jiffies + HZ);
+}
+
int foo_bar_reg(void)
{
mutex_lock(&thread_mutex);
@@ -132,9 +146,27 @@ void foo_bar_unreg(void)
static int __init trace_event_init(void)
{
+ foo_timer_data = kzalloc_obj(*foo_timer_data, GFP_KERNEL);
+ if (!foo_timer_data)
+ return -ENOMEM;
+
+ foo_timer_data->name = "sample_timer_counter";
+ foo_timer_data->counter = alloc_percpu(int);
+ if (!foo_timer_data->counter) {
+ kfree(foo_timer_data);
+ return -ENOMEM;
+ }
+
+ timer_setup(&foo_timer_data->timer, sample_timer_cb, 0);
+ mod_timer(&foo_timer_data->timer, jiffies + HZ);
+
simple_tsk = kthread_run(simple_thread, NULL, "event-sample");
- if (IS_ERR(simple_tsk))
- return -1;
+ if (IS_ERR(simple_tsk)) {
+ timer_shutdown_sync(&foo_timer_data->timer);
+ free_percpu(foo_timer_data->counter);
+ kfree(foo_timer_data);
+ return PTR_ERR(simple_tsk);
+ }
return 0;
}
@@ -147,6 +179,10 @@ static void __exit trace_event_exit(void)
kthread_stop(simple_tsk_fn);
simple_tsk_fn = NULL;
mutex_unlock(&thread_mutex);
+
+ timer_shutdown_sync(&foo_timer_data->timer);
+ free_percpu(foo_timer_data->counter);
+ kfree(foo_timer_data);
}
module_init(trace_event_init);
diff --git a/samples/trace_events/trace-events-sample.h b/samples/trace_events/trace-events-sample.h
index 1a05fc153353..816848a456a2 100644
--- a/samples/trace_events/trace-events-sample.h
+++ b/samples/trace_events/trace-events-sample.h
@@ -247,12 +247,14 @@
*/
/*
- * It is OK to have helper functions in the file, but they need to be protected
- * from being defined more than once. Remember, this file gets included more
- * than once.
+ * It is OK to have helper functions and data structures in the file, but they
+ * need to be protected from being defined more than once. Remember, this file
+ * gets included more than once.
*/
#ifndef __TRACE_EVENT_SAMPLE_HELPER_FUNCTIONS
#define __TRACE_EVENT_SAMPLE_HELPER_FUNCTIONS
+#include <linux/timer.h>
+
static inline int __length_of(const int *list)
{
int i;
@@ -270,6 +272,13 @@ enum {
TRACE_SAMPLE_BAR = 4,
TRACE_SAMPLE_ZOO = 8,
};
+
+struct foo_timer_data {
+ const char *name;
+ struct timer_list timer;
+ int __percpu *counter;
+};
+
#endif
/*
@@ -595,6 +604,25 @@ TRACE_EVENT(foo_rel_loc,
__get_rel_bitmask(bitmask),
__get_rel_cpumask(cpumask))
);
+
+TRACE_EVENT(foo_timer_fn,
+
+ TP_PROTO(struct foo_timer_data *data),
+
+ TP_ARGS(data),
+
+ TP_STRUCT__entry(
+ __string( name, data->name )
+ __field( int, count )
+ ),
+
+ TP_fast_assign(
+ __assign_str(name);
+ __entry->count = *this_cpu_ptr(data->counter);
+ ),
+
+ TP_printk("name=%s count=%d", __get_str(name), __entry->count)
+);
#endif
/***** NOTICE! The #if protection ends here. *****/
diff --git a/tools/testing/selftests/ftrace/test.d/dynevent/btf_probe_event.tc b/tools/testing/selftests/ftrace/test.d/dynevent/btf_probe_event.tc
new file mode 100644
index 000000000000..96791e120b7d
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/dynevent/btf_probe_event.tc
@@ -0,0 +1,51 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: BTF event with typecast and percpu access
+# requires: dynamic_events "this_cpu_read(<fetcharg>)":README "[(structname[,field])]<argname>[->field[->field|.field...]]":README
+
+# Check if the sample module is loaded
+if ! lsmod | grep -q trace_events_sample; then
+ modprobe trace-events-sample || exit_unsupported
+fi
+
+echo 0 > events/enable
+echo > dynamic_events
+
+# The sample_timer_cb(struct timer_list *t) is called.
+# We want to check (STRUCT,FIELD)VAR typecast and this_cpu_read() access.
+# (foo_timer_data,timer)t converts t to struct foo_timer_data * using container_of.
+# data->counter is a per-cpu pointer to int.
+# this_cpu_read(data->counter) should give the value of the counter.
+
+echo 'f:mysample/myevent sample_timer_cb name=(foo_timer_data,timer)t->name:string count=this_cpu_read((foo_timer_data,timer)t->counter)' >> dynamic_events
+
+echo 1 > events/mysample/myevent/enable
+echo 1 > events/sample-trace/foo_timer_fn/enable
+
+sleep 2
+
+echo 0 > events/mysample/myevent/enable
+echo 0 > events/sample-trace/foo_timer_fn/enable
+
+# Compare the values.
+MATCH=0
+while read line; do
+ if echo $line | grep -q "foo_timer_fn:"; then
+ NAME=`echo $line | sed 's/.*name=\([^ ]*\) .*/\1/'`
+ COUNT=`echo $line | sed 's/.*count=\([^ ]*\).*/\1/'`
+ if grep -q "myevent:.*name=\"${NAME}\" count=$COUNT" trace; then
+ MATCH=$((MATCH+1))
+ fi
+ fi
+done < trace
+
+if [ $MATCH -eq 0 ]; then
+ echo "No matching events found"
+ exit_fail
+fi
+
+# Clean up
+echo 0 > events/mysample/myevent/enable
+echo 0 > events/sample-trace/foo_timer_fn/enable
+echo > dynamic_events
+clear_trace
diff --git a/tools/testing/selftests/ftrace/test.d/dynevent/btf_typecast_accepted.tc b/tools/testing/selftests/ftrace/test.d/dynevent/btf_typecast_accepted.tc
new file mode 100644
index 000000000000..acf0b5a917d3
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/dynevent/btf_typecast_accepted.tc
@@ -0,0 +1,107 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: BTF typecast and percpu access syntax validation
+# requires: dynamic_events "this_cpu_read(<fetcharg>)":README "[(structname[,field])]<argname>[->field[->field|.field...]]":README
+
+KPROBES=
+FPROBES=
+
+if grep -qF "p[:[<group>/][<event>]] <place> [<args>]" README ; then
+ KPROBES=yes
+fi
+if grep -qF "f[:[<group>/][<event>]] <func-name>[%return] [<args>]" README ; then
+ FPROBES=yes
+fi
+
+if [ -z "$KPROBES" -a -z "$FPROBES" ] ; then
+ exit_unsupported
+fi
+
+echo 0 > events/enable
+echo > dynamic_events
+
+# Load trace-events-sample module if available to have per-CPU counter structure defined
+if ! lsmod | grep -q trace_events_sample; then
+ modprobe trace-events-sample || true
+fi
+
+if [ "$FPROBES" ] ; then
+ # 1. Test basic typecast on fprobe
+ echo 'f:fpevent1 vfs_read name=(file)file->f_path.dentry->d_name.name:string' >> dynamic_events
+ # 2. Test parenthesized typecast target on fprobe
+ echo 'f:fpevent2 vfs_read name=(file)(file)->f_path.dentry->d_name.name:string' >> dynamic_events
+ # 3. Test nested typecasts on fprobe
+ echo 'f:fpevent3 vfs_read name=(dentry)((file)file->f_path.dentry)->d_name.name:string' >> dynamic_events
+ # 4. Test container_of-style typecast with field option on fprobe
+ echo 'f:fpevent4 vfs_read name=(file,f_path)file->f_mode' >> dynamic_events
+ # 5. Test typecast on return value on fprobe
+ echo 'f:fpevent5 vfs_read%return name=(file)$retval->f_path.dentry->d_name.name:string' >> dynamic_events
+ # 6. Test $current variable support on fprobe
+ echo 'f:fpevent6 vfs_read pid=$current->pid' >> dynamic_events
+ echo 'f:fpevent7 vfs_read pid=(task_struct)$current->pid' >> dynamic_events
+ echo 'f:fpevent8 vfs_read pid=(task_struct,group_leader)$current->pid' >> dynamic_events
+
+ # Test this_cpu_read and this_cpu_ptr on fprobe
+ if lsmod | grep -q trace_events_sample; then
+ echo 'f:fpevent9 sample_timer_cb name=(foo_timer_data,timer)t->name:string count=this_cpu_read((foo_timer_data,timer)t->counter)' >> dynamic_events
+ echo 'f:fpevent10 sample_timer_cb ptr=this_cpu_ptr((foo_timer_data,timer)t->counter)' >> dynamic_events
+ fi
+fi
+
+if [ "$KPROBES" ] ; then
+ # 7. Test basic typecast on kprobe
+ echo 'p:kpevent1 vfs_read name=(file)file->f_path.dentry->d_name.name:string' >> dynamic_events
+ # 8. Test parenthesized typecast target on kprobe
+ echo 'p:kpevent2 vfs_read name=(file)(file)->f_path.dentry->d_name.name:string' >> dynamic_events
+ # 9. Test nested typecasts on kprobe
+ echo 'p:kpevent3 vfs_read name=(dentry)((file)file->f_path.dentry)->d_name.name:string' >> dynamic_events
+ # 10. Test container_of-style typecast with field option on kprobe
+ echo 'p:kpevent4 vfs_read name=(file,f_path)file->f_mode' >> dynamic_events
+ # 11. Test typecast on return value on kretprobe
+ echo 'r:kpevent5 vfs_read name=(file)$retval->f_path.dentry->d_name.name:string' >> dynamic_events
+ # 12. Test $current variable support on kprobe
+ echo 'p:kpevent6 vfs_read pid=$current->pid' >> dynamic_events
+ echo 'p:kpevent7 vfs_read pid=(task_struct)$current->pid' >> dynamic_events
+ echo 'p:kpevent8 vfs_read pid=(task_struct,group_leader)$current->pid' >> dynamic_events
+
+ # Test this_cpu_read and this_cpu_ptr on kprobe
+ if lsmod | grep -q trace_events_sample; then
+ echo 'p:kpevent9 sample_timer_cb name=(foo_timer_data,timer)t->name:string count=this_cpu_read((foo_timer_data,timer)t->counter)' >> dynamic_events
+ echo 'p:kpevent10 sample_timer_cb ptr=this_cpu_ptr((foo_timer_data,timer)t->counter)' >> dynamic_events
+ fi
+fi
+
+# Verify the events exist in dynamic_events
+if [ "$FPROBES" ] ; then
+ grep -q "fpevent1 " dynamic_events
+ grep -q "fpevent2 " dynamic_events
+ grep -q "fpevent3 " dynamic_events
+ grep -q "fpevent4 " dynamic_events
+ grep -q "fpevent5 " dynamic_events
+ grep -q "fpevent6 " dynamic_events
+ grep -q "fpevent7 " dynamic_events
+ grep -q "fpevent8 " dynamic_events
+ if lsmod | grep -q trace_events_sample; then
+ grep -q "fpevent9 " dynamic_events
+ grep -q "fpevent10 " dynamic_events
+ fi
+fi
+
+if [ "$KPROBES" ] ; then
+ grep -q "kpevent1 " dynamic_events
+ grep -q "kpevent2 " dynamic_events
+ grep -q "kpevent3 " dynamic_events
+ grep -q "kpevent4 " dynamic_events
+ grep -q "kpevent5 " dynamic_events
+ grep -q "kpevent6 " dynamic_events
+ grep -q "kpevent7 " dynamic_events
+ grep -q "kpevent8 " dynamic_events
+ if lsmod | grep -q trace_events_sample; then
+ grep -q "kpevent9 " dynamic_events
+ grep -q "kpevent10 " dynamic_events
+ fi
+fi
+
+# Clean up
+echo > dynamic_events
+clear_trace
diff --git a/tools/testing/selftests/ftrace/test.d/dynevent/eprobes_syntax_errors.tc b/tools/testing/selftests/ftrace/test.d/dynevent/eprobes_syntax_errors.tc
index 0e65e787e426..ecfd50187fa7 100644
--- a/tools/testing/selftests/ftrace/test.d/dynevent/eprobes_syntax_errors.tc
+++ b/tools/testing/selftests/ftrace/test.d/dynevent/eprobes_syntax_errors.tc
@@ -21,8 +21,17 @@ check_error 'e:foo/^bar.1 syscalls/sys_enter_openat' # BAD_EVENT_NAME
check_error 'e:foo/bar syscalls/sys_enter_openat arg=^$foo' # BAD_ATTACH_ARG
+check_error 'e:foo/bar syscalls/sys_enter_openat arg=^COMM' # NO_EVENT_FIELD
+if grep -q '\\$current' README; then
+ check_error 'e:foo/bar syscalls/sys_enter_openat arg=^current' # NO_EVENT_FIELD
+fi
+
if grep -q '<attached-group>\.<attached-event>.*\[if <filter>\]' README; then
check_error 'e:foo/bar syscalls/sys_enter_openat if ^' # NO_EP_FILTER
fi
+if grep -q 'this_cpu_read(<fetcharg>)' README; then
+ check_error 'e:foo/bar syscalls/sys_enter_openat arg=^this_cpu_read(file)' # NO_EP_FILTER
+fi
+
exit 0
diff --git a/tools/testing/selftests/ftrace/test.d/dynevent/fprobe_syntax_errors.tc b/tools/testing/selftests/ftrace/test.d/dynevent/fprobe_syntax_errors.tc
index fee479295e2f..e9d7e6919c7f 100644
--- a/tools/testing/selftests/ftrace/test.d/dynevent/fprobe_syntax_errors.tc
+++ b/tools/testing/selftests/ftrace/test.d/dynevent/fprobe_syntax_errors.tc
@@ -112,6 +112,18 @@ check_error 'f vfs_read%return $retval->^foo' # NO_PTR_STRCT
check_error 'f vfs_read file->^foo' # NO_BTF_FIELD
check_error 'f vfs_read file^-.foo' # BAD_HYPHEN
check_error 'f vfs_read ^file:string' # BAD_TYPE4STR
+if grep -qF "[(structname" README ; then
+check_error 'f vfs_read arg1=(task_struct)file^' # TYPECAST_REQ_FIELD
+check_error 'f vfs_read arg1=(a)((b)((c)(^(d)file->d)->c)->b)->a' # TOO_MANY_NESTED
+check_error 'f vfs_read arg1=(task_struct,^in_execve)file->comm' # TYPECAST_NOT_ALIGNED
+check_error 'f vfs_read arg1=(task_struct,^foo_bar)file->pid' # NO_BTF_FIELD
+check_error 'f vfs_read arg1=(^task_struct1234)file->pid' # NO_PTR_STRCT
+check_error 'f vfs_read arg1=(task_struct,se^->group_node)file->comm' # TYPECAST_BAD_ARROW
+check_error 'f vfs_read arg1=(task_struct,^->pid)file->comm' # NO_BTF_FIELD
+check_error 'f vfs_read arg1=(task_struct,^.pid)file->comm' # NO_BTF_FIELD
+check_error 'f vfs_read arg1=(task_struct,^.)file->comm' # NO_BTF_FIELD
+check_error 'f vfs_read arg1=(task_struct)^@symbol+10->comm' # TYPECAST_SYM_OFFSET
+fi
fi
else
diff --git a/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc b/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc
index 8f1c58f0c239..21ce8414459f 100644
--- a/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc
+++ b/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc
@@ -115,6 +115,18 @@ check_error 'p vfs_read+20 ^$arg*' # NOFENTRY_ARGS
check_error 'p vfs_read ^hoge' # NO_BTFARG
check_error 'p kfree ^$arg10' # NO_BTFARG (exceed the number of parameters)
check_error 'r kfree ^$retval' # NO_RETVAL
+if grep -qF "[(structname" README ; then
+check_error 'p vfs_read arg1=(task_struct)file^' # TYPECAST_REQ_FIELD
+check_error 'p vfs_read arg1=(a)((b)((c)(^(d)file->d)->c)->b)->a' # TOO_MANY_NESTED
+check_error 'p vfs_read arg1=(task_struct,^in_execve)file->comm' # TYPECAST_NOT_ALIGNED
+check_error 'p vfs_read arg1=(task_struct,^foo_bar)file->pid' # NO_BTF_FIELD
+check_error 'p vfs_read arg1=(^task_struct1234)file->pid' # NO_PTR_STRCT
+check_error 'p vfs_read arg1=(task_struct,se^->group_node)file->comm' # TYPECAST_BAD_ARROW
+check_error 'p vfs_read arg1=(task_struct,^->pid)file->comm' # NO_BTF_FIELD
+check_error 'p vfs_read arg1=(task_struct,^.pid)file->comm' # NO_BTF_FIELD
+check_error 'p vfs_read arg1=(task_struct,^.)file->comm' # NO_BTF_FIELD
+check_error 'p vfs_read arg1=(task_struct)^@symbol+10->comm' # TYPECAST_SYM_OFFSET
+fi
else
check_error 'p vfs_read ^$arg*' # NOSUP_BTFARG
fi
diff --git a/tools/testing/selftests/ftrace/test.d/kprobe/uprobe_syntax_errors.tc b/tools/testing/selftests/ftrace/test.d/kprobe/uprobe_syntax_errors.tc
index c817158b99db..e12dc967ec76 100644
--- a/tools/testing/selftests/ftrace/test.d/kprobe/uprobe_syntax_errors.tc
+++ b/tools/testing/selftests/ftrace/test.d/kprobe/uprobe_syntax_errors.tc
@@ -28,4 +28,9 @@ if grep -q ".*symstr.*" README; then
check_error 'p /bin/sh:10 $stack0:^symstr' # BAD_TYPE
fi
+# $current is not supported by uprobe
+if grep -q "\$current.*" README; then
+check_error 'p /bin/sh:10 ^$current:u8' # BAD_VAR
+fi
+
exit 0
^ permalink raw reply related
* Re: [PATCH v10 2/9] tracing/probes: Support dumping fetcharg program for debugging dynamic events
From: Masami Hiramatsu @ 2026-06-26 4:21 UTC (permalink / raw)
To: Masami Hiramatsu (Google)
Cc: Steven Rostedt, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178243984380.790911.16958862880276016609.stgit@devnote2>
Sashiko made a good comment:
> commit ce5985c5ce46daa952ae907992e02e403e601c04
> Author: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> tracing/probes: Support dumping fetcharg program for debugging dynamic events
> This commit introduces a new configuration option
> CONFIG_PROBE_EVENTS_DUMP_FETCHARG to debug dynamic events. When enabled, it
> appends commented lines containing the fetch instructions to dynamic events in
> tracefs files. This helps in verifying the compiled fetch instructions for
> each probe argument.
> > diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
> > index a8420e6abb56a..cfa807d8e760f 100644
> > --- a/kernel/trace/trace_kprobe.c
> > +++ b/kernel/trace/trace_kprobe.c
> > @@ -1320,6 +1320,8 @@ static int trace_kprobe_show(struct seq_file *m, struct dyn_event *ev)
> > seq_printf(m, " %s=%s", tk->tp.args[i].name, tk->tp.args[i].comm);
> > seq_putc(m, '\n');
> >
> > + trace_probe_dump_args(m, &tk->tp);
> > +
> Does enabling CONFIG_PROBE_EVENTS_DUMP_FETCHARG break in-tree userspace tools
> that parse kprobe_events?
> For example, perf probe reads the kprobe_events file directly. It passes
> the unexpected # lines to parse_probe_trace_command() in
> tools/perf/util/probe-file.c:__probe_file__get_namelist(), which returns
> -EINVAL and immediately aborts the event listing process.
> Similarly, the ftrace2bconf script does not ignore # lines when reading
> kprobe_events, which leads it to mistakenly emit invalid bootconfig syntax
> in tools/bootconfig/scripts/ftrace2bconf.sh:kprobe_event_options()
> (e.g. ftrace.event.kprobes.#.probes += ...).
Yeah, those tools needs to be updated. Anyway, the tools which reads the
files in tracefs should skip the lines started with #, Those are comment
lines. (e.g. trace file, hist file, etc.)
Thanks,
--
Masami Hiramatsu
^ permalink raw reply
* Re: [PATCHv4 00/13] uprobes/x86: Fix red zone issue for optimized uprobes
From: Andrii Nakryiko @ 2026-06-26 5:44 UTC (permalink / raw)
To: Jiri Olsa, Peter Zijlstra, Ingo Molnar, Masami Hiramatsu
Cc: Oleg Nesterov, Andrii Nakryiko, bpf, linux-trace-kernel
In-Reply-To: <CAEf4Bzbrd8xs5sSwEPR336=7z2FGcdXVtV-aVZ4W1zSjHkwwcg@mail.gmail.com>
On Mon, Jun 8, 2026 at 1:48 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Wed, Jun 3, 2026 at 11:59 PM Jiri Olsa <olsajiri@gmail.com> wrote:
> >
> > On Tue, May 26, 2026 at 10:58:27PM +0200, Jiri Olsa wrote:
> > > hi,
> > > Andrii reported an issue with optimized uprobes [1] that can clobber
> > > redzone area with call instruction storing return address on stack
> > > where user code may keep temporary data without adjusting rsp.
> > >
> > > Fixing this by moving the optimized uprobes on top of 10-bytes nop
> > > instruction, so we can squeeze another instruction to escape the
> > > redzone area before doing the call.
> > >
> > > Note we need upstream update first for patch 3 (github.com/libbpf/usdt),
> > > if we decide to take this change.
> > >
> > > thanks,
> > > jirka
> > >
> > >
> > > v1: https://lore.kernel.org/bpf/20260514135342.22130-1-jolsa@kernel.org/
> > > v2: https://lore.kernel.org/bpf/20260518105957.123445-1-jolsa@kernel.org/
> > > v3: https://lore.kernel.org/bpf/20260521124411.31133-1-jolsa@kernel.org/
> > >
> > > v4 changes:
> > > - do not use 2nd int3 (ont +5 offset) because the call instruction
> > > is allways the same for the given nop10 address [Andrii/Peter]
> > > - unmap unused trampoline vma after unsuccesfull optimization [sashiko]
> > > - small change to patch#2 moved user_64bit_mode earlier in the path
> > > and pass/use mm_struct pointer directly from arch_uprobe_optimize
> > > instead of gettting current->mm
> > > Andrii, keeping your ack, please shout otherwise
> >
> > hi,
> > I think bots did not find anything substantial, I have just small
> > selftests changes queued for v5
> >
> > any other feedback/review would be great
> >
>
> one small nit on only, otherwise LGTM.
>
> Peter, Masami, Ingo, should this go through tip tree or should we
> route this through bpf-next tree? I think we are fine either way, but
> might be more convenient to route through bpf-next given libbpf and
> BPF selftest changes.
>
I'll assume that no one has any objections to route this through
bpf-next. We got reviews from Oleg, so that's great. Jiri, seems like
you will do small adjustments and send v5, please do, and then unless
meanwhile no one raises any issues, this will go through bpf-next.
Thanks!
> If so, I'd appreciate another look at first 5 patches by Peter, if
> that's ok. Thanks!
>
>
>
> > thanks,
> > jirka
> >
> >
> > >
> > > v3 changes:
> > > - use nop10 update suggested by Peter in [2]
> > > - remove struct uprobe_trampoline object, use vma objects directly instead
> > > - selftests fixes [sashiko]
> > > - ack from Andrii
> > >
> > > v2 changes:
> > > - several selftest fixes [sashiko]
> > > - consolidate is_lea_insn and is_call_insn insto single check [Jakub Sitnicki]
> > > - use proper mm_struct object in __in_uprobe_trampoline check [sashiko]
> > > - allow to copy uprobe trampolines vma objects on fork [sashiko]
> > > - change uprobe syscall detection error from -ENXIO to -EPROTO [Andrii]
> > > - added fork/clone tests
> > > - I kept the selftest changes and nop5->nop10 changes in separate
> > > commits for easier review, we can squash them later if we want to keep
> > > bisect working properly
> > >
> > >
> > > [1] https://lore.kernel.org/bpf/20260509003146.976844-1-andrii@kernel.org/
> > > [2] https://lore.kernel.org/bpf/20260518104306.GU3102624@noisy.programming.kicks-ass.net/#t
> > > ---
> > > Andrii Nakryiko (1):
> > > selftests/bpf: Add tests for uprobe nop10 red zone clobbering
> > >
> > > Jiri Olsa (12):
> > > uprobes/x86: Use proper mm_struct in __in_uprobe_trampoline
> > > uprobes/x86: Remove struct uprobe_trampoline object
> > > uprobes/x86: Allow to copy uprobe trampolines on fork
> > > uprobes/x86: Unmap trampoline vma object in case it's unused
> > > uprobes/x86: Move optimized uprobe from nop5 to nop10
> > > libbpf: Change has_nop_combo to work on top of nop10
> > > libbpf: Detect uprobe syscall with new error
> > > selftests/bpf: Emit nop,nop10 instructions combo for x86_64 arch
> > > selftests/bpf: Change uprobe syscall tests to use nop10
> > > selftests/bpf: Change uprobe/usdt trigger bench code to use nop10
> > > selftests/bpf: Add reattach tests for uprobe syscall
> > > selftests/bpf: Add tests for forked/cloned optimized uprobes
> > >
> > > arch/x86/kernel/uprobes.c | 379 +++++++++++++++++++++++++++++++++++++++++++-----------------------------
> > > include/linux/uprobes.h | 5 -
> > > kernel/events/uprobes.c | 10 --
> > > kernel/fork.c | 1 -
> > > tools/lib/bpf/features.c | 4 +-
> > > tools/lib/bpf/usdt.c | 16 +--
> > > tools/testing/selftests/bpf/bench.c | 20 ++--
> > > tools/testing/selftests/bpf/benchs/bench_trigger.c | 38 ++++----
> > > tools/testing/selftests/bpf/benchs/run_bench_uprobes.sh | 2 +-
> > > tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c | 307 +++++++++++++++++++++++++++++++++++++++++++++++++++++-----
> > > tools/testing/selftests/bpf/prog_tests/usdt.c | 74 ++++++++++++--
> > > tools/testing/selftests/bpf/progs/test_usdt.c | 25 +++++
> > > tools/testing/selftests/bpf/usdt.h | 2 +-
> > > tools/testing/selftests/bpf/usdt_2.c | 15 ++-
> > > 14 files changed, 653 insertions(+), 245 deletions(-)
^ permalink raw reply
* Re: [PATCH v9 6/6] selftests/mm: add hwpoison-panic destructive test
From: Miaohe Lin @ 2026-06-26 7:07 UTC (permalink / raw)
To: Breno Leitao
Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest,
linux-trace-kernel, kernel-team, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Shuah Khan, Naoya Horiguchi,
Jonathan Corbet, Shuah Khan, Liam R. Howlett, lance.yang,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers
In-Reply-To: <20260609-ecc_panic-v9-6-432a74002e74@debian.org>
On 2026/6/9 18:57, Breno Leitao wrote:
> Add a destructive selftest that verifies
> vm.panic_on_unrecoverable_memory_failure actually panics when a
> hwpoison error hits a kernel-owned page.
>
> Three "kinds" of kernel-owned page can be targeted, selectable via
> the script's first positional argument (default: rodata):
>
> rodata - a PG_reserved page in the kernel rodata range, sourced
> from the "Kernel rodata" sub-resource of "System RAM" in
> /proc/iomem. That entry is reported on every major
> architecture and guarantees the chosen PFN is backed by
> struct page (an online System RAM range, not a firmware
> hole), is PG_reserved, and is read-only -- so even if
> the panic fails to fire for some reason, the resulting
> PG_hwpoison marker on rodata does not corrupt writable
> kernel state.
>
> slab - a slab page found by walking /proc/kpageflags for the
> first PFN with KPF_SLAB set (and KPF_HWPOISON / KPF_NOPAGE
> / KPF_COMPOUND_TAIL clear). Exercises the get_any_page()
> path on a non PG_reserved kernel-owned page and so
> catches regressions where get_any_page() collapses
> kernel-owned pages into a transient -EIO instead of
> -ENOTRECOVERABLE.
>
> pgtable - same as slab, but the PFN is selected via KPF_PGTABLE.
>
> PageLargeKmalloc, the fourth page type matched by
> HWPoisonKernelOwned(), is intentionally not covered: it is a
> PAGE_TYPE_OPS flag with no /proc/kpageflags bit, so selecting such
> a PFN from userspace is not feasible. The slab and pgtable
> variants already exercise the same get_any_page() positive-check
> branch.
>
> The script enables the sysctl and writes the selected physical
> address to /sys/devices/system/memory/hard_offline_page. A
> successful run crashes the kernel with
>
> Memory failure: <pfn>: unrecoverable page
>
> A return from the inject means the panic did not fire and the test
> fails. Test outcome is therefore observed externally (serial
> console, kdump) rather than from the script's own exit code.
>
> The script is intentionally NOT wired into run_vmtests.sh: every
> successful run panics the kernel, which is incompatible with the
> sequential "run each category in the same VM" model that
> run_vmtests.sh assumes. It is also not registered as a TEST_PROGS /
> ksft_* wrapper so a default kselftest run does not opt itself into
> a panic. The script is meant to be executed manually inside a
> disposable VM (e.g. virtme-ng), one variant per VM boot, and
> requires RUN_DESTRUCTIVE=1 in the environment as a safety net.
>
> Signed-off-by: Breno Leitao <leitao@debian.org>
Looks good to me with two comments below.
> ---
> tools/testing/selftests/mm/Makefile | 4 +
> tools/testing/selftests/mm/hwpoison-panic.sh | 208 +++++++++++++++++++++++++++
> 2 files changed, 212 insertions(+)
>
> diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
> index e6df968f0971..ed321ae709da 100644
> --- a/tools/testing/selftests/mm/Makefile
> +++ b/tools/testing/selftests/mm/Makefile
> @@ -174,6 +174,10 @@ TEST_PROGS += ksft_userfaultfd.sh
> TEST_PROGS += ksft_vma_merge.sh
> TEST_PROGS += ksft_vmalloc.sh
>
> +# Destructive: every successful run panics the kernel. Installed and
> +# kept executable, but not run from a default kselftest invocation.
> +TEST_PROGS_EXTENDED += hwpoison-panic.sh
> +
> TEST_FILES := test_vmalloc.sh
> TEST_FILES += test_hmm.sh
> TEST_FILES += va_high_addr_switch.sh
> diff --git a/tools/testing/selftests/mm/hwpoison-panic.sh b/tools/testing/selftests/mm/hwpoison-panic.sh
> new file mode 100755
> index 000000000000..fe58e7638a8b
> --- /dev/null
> +++ b/tools/testing/selftests/mm/hwpoison-panic.sh
> @@ -0,0 +1,208 @@
> +#!/bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +#
> +# Verify vm.panic_on_unrecoverable_memory_failure by injecting a hwpoison
> +# error on a kernel-owned page and confirming the kernel panics.
> +#
> +# Three "kinds" of kernel-owned page can be targeted, selectable via the
> +# first positional argument (default: rodata):
> +#
> +# rodata - a PG_reserved page in the kernel rodata range
> +# (sourced from /proc/iomem "Kernel rodata"). Exercises
> +# memory_failure() -> get_any_page() on a PageReserved page.
> +#
> +# slab - a slab page found via /proc/kpageflags (KPF_SLAB).
> +# Exercises memory_failure() -> get_any_page() on a non
> +# PG_reserved kernel-owned page. This path is what catches
> +# regressions where get_any_page() collapses kernel-owned
> +# pages into a transient -EIO instead of -ENOTRECOVERABLE.
> +#
> +# pgtable - a page-table page found via /proc/kpageflags (KPF_PGTABLE).
> +# Same path as slab, different page type.
> +#
> +# This test is DESTRUCTIVE: a successful run crashes the kernel. It is
> +# meant to be executed inside a disposable VM (e.g. virtme-ng) with a
> +# serial console captured by the harness. It is skipped unless the
> +# caller opts in via RUN_DESTRUCTIVE=1.
> +#
> +# Test passes externally: the kernel must panic with
> +# "Memory failure: <pfn>: unrecoverable page"
> +# A return from the inject means the panic did not fire and the test
> +# fails.
> +#
> +# Author: Breno Leitao <leitao@debian.org>
> +
> +set -u
> +
> +ksft_skip=4
> +sysctl_path=/proc/sys/vm/panic_on_unrecoverable_memory_failure
> +inject_path=/sys/devices/system/memory/hard_offline_page
> +kpageflags_path=/proc/kpageflags
> +
> +# /proc/kpageflags bit positions (see include/uapi/linux/kernel-page-flags.h)
> +KPF_SLAB=7
> +KPF_COMPOUND_TAIL=16
> +KPF_HWPOISON=19
> +KPF_NOPAGE=20
> +KPF_PGTABLE=26
> +
> +kind=${1:-rodata}
> +
> +ksft_print() { echo "# $*"; }
> +ksft_exit_skip() { ksft_print "$*"; exit "$ksft_skip"; }
> +ksft_exit_fail() { echo "not ok 1 $*"; exit 1; }
> +
> +if [ "$(id -u)" -ne 0 ]; then
> + ksft_exit_skip "must run as root"
> +fi
> +
> +if [ ! -w "$sysctl_path" ]; then
> + ksft_exit_skip "$sysctl_path not present (kernel without the sysctl?)"
> +fi
> +
> +if [ ! -w "$inject_path" ]; then
> + ksft_exit_skip "$inject_path not present (no MEMORY_HOTPLUG?)"
> +fi
> +
> +if [ "${RUN_DESTRUCTIVE:-0}" != "1" ]; then
> + ksft_exit_skip "destructive test; re-run with RUN_DESTRUCTIVE=1 inside a disposable VM"
> +fi
> +
> +# Pick a PFN inside the kernel image rodata region of /proc/iomem.
> +# This is preferred over a top-level "Reserved" entry because top-level
> +# Reserved ranges are often firmware holes that have no backing struct
> +# page; pfn_to_online_page() returns NULL on those and memory_failure()
> +# bails out with -ENXIO before reaching the panic path.
> +#
> +# "Kernel rodata" is reported as a sub-resource of "System RAM" on every
> +# major architecture, which guarantees:
> +# - the PFN is backed by struct page (within an online memory range);
> +# - PG_reserved is set on the page (kernel image area);
> +# - the memory is read-only, so setting PG_hwpoison on it does not
> +# corrupt writable kernel state if the panic somehow does not fire.
> +#
> +# /proc/iomem entries look like (indented for sub-resources):
> +# " 02500000-02ffffff : Kernel rodata"
> +pick_rodata_phys_addr() {
> + awk -v pagesize="$(getconf PAGE_SIZE)" '
> + # Convert a hex string to a number without relying on the gawk-only
> + # strtonum(). mawk lacks it and would otherwise spuriously skip
> + # this test on distros that ship mawk as /usr/bin/awk.
> + function hex2num(s, n, i, c, v) {
> + n = 0
> + for (i = 1; i <= length(s); i++) {
> + c = tolower(substr(s, i, 1))
> + v = index("0123456789abcdef", c) - 1
> + if (v < 0)
> + return -1
> + n = n * 16 + v
> + }
> + return n
> + }
> + /: Kernel rodata[[:space:]]*$/ {
> + sub(/^[[:space:]]+/, "")
> + n = split($0, a, /[- ]/)
> + start = hex2num(a[1])
> + end = hex2num(a[2])
> + if (end <= start)
> + next
> + # Page-align upward and emit the first byte of that page.
> + pfn = int((start + pagesize - 1) / pagesize)
> + printf "0x%x\n", pfn * pagesize
> + exit 0
> + }
> + ' /proc/iomem
> +}
> +
> +# Walk /proc/kpageflags and return the phys addr of the first PFN that
> +# has bit $1 set, with KPF_HWPOISON, KPF_NOPAGE and KPF_COMPOUND_TAIL
> +# all clear (so we attack a real, non-tail, not-already-poisoned page).
> +#
> +# We skip the first 16 MiB of PFNs to step past low-memory special
> +# ranges (BIOS/EFI/ACPI/etc.) that often are PG_reserved and would not
> +# exhibit the slab/pgtable type we are looking for.
> +pick_kpageflags_phys_addr() {
> + local want_bit=$1
> + local pagesize skip_pfn
> +
> + [ -r "$kpageflags_path" ] || return
> +
> + pagesize=$(getconf PAGE_SIZE)
> + skip_pfn=$(((16 * 1024 * 1024) / pagesize))
> +
> + od -An -tx8 -v -w8 -j "$((skip_pfn * 8))" "$kpageflags_path" 2>/dev/null | \
> + awk -v want_bit="$want_bit" \
> + -v hwp_bit="$KPF_HWPOISON" \
> + -v nopage_bit="$KPF_NOPAGE" \
> + -v tail_bit="$KPF_COMPOUND_TAIL" \
> + -v base_pfn="$skip_pfn" \
> + -v pagesize="$pagesize" '
> + # Test whether bit "b" is set in the 16-hex-digit value "hex".
> + # Done with substring + per-digit lookup so we never rely on awk
> + # bitwise operators (mawk lacks them), 64-bit FP precision or the
> + # gawk-only strtonum().
> + function bit_set(hex, b, di, bi, c, v) {
> + di = int(b / 4)
> + bi = b - di * 4
> + c = substr(hex, length(hex) - di, 1)
> + v = index("0123456789abcdef", tolower(c)) - 1
> + if (bi == 0) return (v % 2) == 1
> + if (bi == 1) return int(v / 2) % 2 == 1
> + if (bi == 2) return int(v / 4) % 2 == 1
> + return int(v / 8) % 2 == 1
> + }
> + {
> + gsub(/^[[:space:]]+/, "")
> + h = $1
> + if (bit_set(h, want_bit) &&
> + !bit_set(h, hwp_bit) &&
> + !bit_set(h, nopage_bit) &&
> + !bit_set(h, tail_bit)) {
> + pfn = base_pfn + NR - 1
> + printf "0x%x\n", pfn * pagesize
> + exit 0
> + }
> + }
> + '
> +}
> +
> +case "$kind" in
> +rodata)
> + phys_addr=$(pick_rodata_phys_addr)
> + missing_msg='no "Kernel rodata" entry in /proc/iomem'
> + ;;
> +slab)
> + phys_addr=$(pick_kpageflags_phys_addr "$KPF_SLAB")
> + missing_msg="no usable slab PFN found in $kpageflags_path"
> + ;;
> +pgtable)
> + phys_addr=$(pick_kpageflags_phys_addr "$KPF_PGTABLE")
> + missing_msg="no usable page-table PFN found in $kpageflags_path"
> + ;;
> +*)
> + ksft_exit_fail "unknown kind '$kind' (expected: rodata|slab|pgtable)"
> + ;;
> +esac
> +
> +if [ -z "$phys_addr" ]; then
> + ksft_exit_skip "$missing_msg"
> +fi
> +
> +ksft_print "enabling $sysctl_path"
> +prior=$(cat "$sysctl_path")
> +echo 1 > "$sysctl_path" || ksft_exit_fail "failed to enable sysctl"
> +
> +ksft_print "injecting hwpoison at phys 0x$(printf '%x' "$phys_addr") (kind=$kind)"
> +ksft_print "expecting kernel panic: 'Memory failure: <pfn>: unrecoverable page'"
> +
> +# If this returns, the kernel did not panic → test failed. Restore the
> +# sysctl before reporting so the system is left as we found it.
> +if echo "$phys_addr" > "$inject_path"; then
> + echo "$prior" > "$sysctl_path"
> + ksft_exit_fail "inject returned without panic; sysctl ineffective"
In case of failure, should we recheck the page type? There is a window between
we get the phys_addr and inject the hwpoison.
> +fi
> +
> +# Write failed (e.g. -EINVAL on offlining a non-online region): also a
> +# failure for this test, since we expected the panic path.
> +echo "$prior" > "$sysctl_path"
> +ksft_exit_fail "inject failed before reaching the panic path"
Should we unpoison the pfn in case of failure?
Thanks.
.
^ permalink raw reply
* [PATCH] fgraph: Use trace_seq_putc() in print_graph_return()
From: Markus Elfring @ 2026-06-26 7:36 UTC (permalink / raw)
To: linux-trace-kernel, Masami Hiramatsu, Mathieu Desnoyers,
Steven Rostedt
Cc: LKML, kernel-janitors, Mark Brown, Mark Rutland,
Woradorn Laodhanadhaworn
From: Markus Elfring <elfring@users.sourceforge.net>
Date: Fri, 26 Jun 2026 09:24:18 +0200
A single closing curly bracket should be put into a trace sequence buffer.
Thus use the corresponding function “trace_seq_putc”.
The source code was transformed by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
---
kernel/trace/trace_functions_graph.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/trace/trace_functions_graph.c b/kernel/trace/trace_functions_graph.c
index 0d2d3a2ea7dd..ff7cb1a76b95 100644
--- a/kernel/trace/trace_functions_graph.c
+++ b/kernel/trace/trace_functions_graph.c
@@ -1349,7 +1349,7 @@ print_graph_return(struct ftrace_graph_ret_entry *retentry, struct trace_seq *s,
* that if the funcgraph-tail option is enabled.
*/
if (func_match && !(flags & TRACE_GRAPH_PRINT_TAIL))
- trace_seq_puts(s, "}");
+ trace_seq_putc(s, '}');
else
trace_seq_printf(s, "} /* %ps */", (void *)func);
}
--
2.54.0
^ permalink raw reply related
* Re: [PATCH v3 1/8] scripts/sorttable: Handle RISC-V patchable ftrace entries
From: patchwork-bot+linux-riscv @ 2026-06-26 8:21 UTC (permalink / raw)
To: Wang Han
Cc: linux-riscv, pjw, palmer, aou, rostedt, alex, mhiramat,
mark.rutland, catalin.marinas, cp0613, andybnac, bjorn, debug,
puranjay, conor.dooley, jpoimboe, jikos, mbenes, pmladek,
joe.lawrence, shuah, peterz, mingo, acme, namhyung, oliver.yang,
xueshuai, zhuo.song, jkchen, linux-kernel, linux-trace-kernel,
live-patching, linux-kselftest, linux-perf-users
In-Reply-To: <20260609063002.3943001-1-wanghan@linux.alibaba.com>
Hello:
This series was applied to riscv/linux.git (fixes)
by Paul Walmsley <pjw@kernel.org>:
On Tue, 9 Jun 2026 14:29:52 +0800 you wrote:
> RISC-V uses -fpatchable-function-entry=8,4 when the compressed ISA is
> enabled and -fpatchable-function-entry=4,2 otherwise. In both cases, the
> patchable NOP area starts 8 bytes before the function symbol address.
> The __mcount_loc entries therefore point at the patchable NOP area
> associated with a function, while nm reports the function symbol at the
> entry address used for the function range check.
>
> [...]
Here is the summary with links:
- [v3,1/8] scripts/sorttable: Handle RISC-V patchable ftrace entries
https://git.kernel.org/riscv/c/57ad674d032b
- [v3,2/8] riscv: stacktrace: Add frame record metadata
(no matching commit)
- [v3,3/8] riscv: stacktrace: disable KASAN and KCOV instrumentation for stacktrace.o
(no matching commit)
- [v3,4/8] riscv: ftrace: always preserve s0 in dynamic ftrace register frame
(no matching commit)
- [v3,5/8] riscv: stacktrace: introduce stack-bound tracking helpers
(no matching commit)
- [v3,6/8] riscv: stacktrace: switch to frame-pointer based unwinder
(no matching commit)
- [v3,7/8] riscv: Kconfig: enable HAVE_RELIABLE_STACKTRACE and HAVE_LIVEPATCH
(no matching commit)
- [v3,8/8] selftests/livepatch: Add RISC-V syscall wrapper prefix
(no matching commit)
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [RFC PATCH 1/3] mm/compaction: skip isolate mlocked folios when compact_unevictable_allowed=0
From: Alexander Krabler @ 2026-06-26 8:45 UTC (permalink / raw)
To: Wandun, Vlastimil Babka (SUSE), linux-mm@kvack.org,
linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org,
linux-rt-devel@lists.linux.dev
Cc: akpm@linux-foundation.org, surenb@google.com, mhocko@suse.com,
jackmanb@google.com, hannes@cmpxchg.org, ziy@nvidia.com,
rostedt@goodmis.org, mhiramat@kernel.org,
mathieu.desnoyers@efficios.com, david@kernel.org, ljs@kernel.org,
liam@infradead.org, rppt@kernel.org, bigeasy@linutronix.de,
clrkwllms@kernel.org, Hugh Dickins
In-Reply-To: <ca1115c0-1509-453a-8235-08e381a3da6f@gmail.com>
On 6/24/26 13:08, Wandun wrote:
> On 6/22/26 17:55, Vlastimil Babka (SUSE) wrote:
>> On 6/18/26 13:43, Wandun wrote:
>>> Yes, I wrote a test case that can reproduce it in a few second.
>>>
>>> The test case contains 3 steps:
>>> 1. mlockall
>>> 2. mmap file(2GB) + trigger file write page fault;
>>> 3. during step 1, trigger compact via /proc/sys/vm/compact_memory
>>>
>>>
>>> My reproduction environment is qemu with 4GB ram, 8 core, aarch64,
>>> preempt_rt and includes the tracepoint in patch 02.
>>> After running the reproduction program for a few seconds, the
>>> following output appears.
>>>
>>> repro-403 [004] ....1 101.270505: mm_compaction_isolate_folio: pfn=0x71e3a mode=0x0
> flags=referenced|uptodate|mlocked
>>> repro-403 [004] ....1 101.270507: mm_compaction_isolate_folio: pfn=0x71e3b mode=0x0
> flags=referenced|uptodate|mlocked
>>> repro-403 [004] ....1 101.270513: mm_compaction_isolate_folio: pfn=0x71e3c mode=0x0
> flags=referenced|uptodate|mlocked
>>> repro-403 [004] ....1 101.270515: mm_compaction_isolate_folio: pfn=0x71e3d mode=0x0
> flags=uptodate|mlocked
>>> repro-403 [004] ....1 101.270517: mm_compaction_isolate_folio: pfn=0x71e3e mode=0x0
> flags=uptodate|mlocked
>>> repro-403 [004] ....1 101.270520: mm_compaction_isolate_folio: pfn=0x71e3f mode=0x0
> flags=uptodate|mlocked
I applied your PATCH 2/3 to our kernel and checked with your reproducer,
I get similar output, e.g.
t_compact-2148 [005] ....1 515.320221: mm_compaction_isolate_folio: pfn=0xe66c2 mode=0x0
flags=referenced|uptodate|active|swapbacked|mlocked
With your first patch applied, the amount of these messages decrease.
I was not able to apply your third patch to our (older) kernel.
However, we were not able to reproduce the actual race
(mlockall() process waiting on a migration PTE),
not in the past, not now. Might be hard to trigger that race.
> IIUC, more accurately, the migration entry in the page talbe is real a bad for
> RT process, because isolate page doesn't modify the page table, so memory
> access continues as usual, therefore a new idea occur.
>
> S1. In the mlock[all] syscall, if mlock_vma_pages_range hit a migration entry,
> then, it should wait for the migration to complete.
>
> S2. During the unmap phase of memory migration, prevent a page from being unmapped
> if the page's associated vma is markd with VM_LOCKED, similar to how reclaim is
> disabled for pages in a VM_LOCKED vma(try_to_unmap_one).
>
>
> For a page handled during the mlock[all] syscall:
> - if migration has been already finished, there is noting to do;
> - if migration is in progress and the migration etnry is already filled, we
> wait (S1)
> - if the page is in-fight, going to be isolated/migrated, S2 prevents the unmap.
>
> For a page handled during a page fault: VM_LOCKED is already set on the vma,
> so S2 guarantees it will not be unmapped, hence no migration entry.
I do not understand all details of this, but it looks good,
especially the S1 case makes a lot of sense for me.
Nitpick: I suggest to switch order of PATCH 1 and 2 for the next iteration,
introducing the tracepoint first and then improve the situation.
Thanks a lot for looking into this issue!
Best regards,
Alexander
--
KUKA Deutschland GmbH Board of Directors: Michael Jürgens (Chairman), Johan Naten, Hui Zhang Registered Office: Augsburg HRB 14914
This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of contents of this e-mail is strictly forbidden.
Please consider the environment before printing this e-mail.
^ permalink raw reply
* Re: [PATCH v4 2/2] tracing: Remove trace_printk.h from kernel.h
From: Steven Rostedt @ 2026-06-26 8:51 UTC (permalink / raw)
To: Nathan Chancellor
Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
Mathieu Desnoyers, Andrew Morton, Linus Torvalds,
Sebastian Andrzej Siewior, John Ogness, Thomas Gleixner,
Peter Zijlstra, Julia Lawall, Yury Norov, linux-doc, linux-kbuild,
linuxppc-dev, dri-devel, linux-stm32, linux-arm-kernel,
linux-rdma, linux-usb, linux-ext4, linux-nfs, kvm, intel-gfx
In-Reply-To: <20260625234158.GA261868@ax162>
On Thu, 25 Jun 2026 16:41:58 -0700
Nathan Chancellor <nathan@kernel.org> wrote:
> The following diff resolves it for me, should I send it as a separate
> patch or do you want to just fold it in with a note?
>
> diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
> index 621566345406..2301a701ffbb 100644
> --- a/include/linux/lockdep.h
> +++ b/include/linux/lockdep.h
> @@ -10,6 +10,7 @@
> #ifndef __LINUX_LOCKDEP_H
> #define __LINUX_LOCKDEP_H
>
> +#include <linux/instruction_pointer.h>
Ah, so the reason for this breakage is because lockdep was relying on
instruction_pointer.h, that just happened to be included in kernel.h
via trace_printk.h.
This is a separate issue, so it should be a separate patch. I'll add it
as patch 1 of this series.
Can you send me the config you used. This didn't trigger in my tests.
Thanks,
-- Steve
> #include <linux/lockdep_types.h>
> #include <linux/smp.h>
> #include <asm/percpu.h>
^ permalink raw reply
* Re: [RFC PATCH 1/3] mm/compaction: skip isolate mlocked folios when compact_unevictable_allowed=0
From: Sebastian Andrzej Siewior @ 2026-06-26 9:26 UTC (permalink / raw)
To: Wandun Chen
Cc: linux-mm, linux-kernel, linux-trace-kernel, linux-rt-devel, akpm,
vbabka, surenb, mhocko, jackmanb, hannes, ziy, rostedt, mhiramat,
mathieu.desnoyers, david, ljs, liam, rppt, clrkwllms,
Alexander.Krabler
In-Reply-To: <20260604023812.3700316-2-chenwandun1@gmail.com>
On 2026-06-04 10:38:10 [+0800], Wandun Chen wrote:
…
> Reported-by: Alexander Krabler <Alexander.Krabler@kuka.com>
> Closes: https://lore.kernel.org/all/DU0PR01MB10385345F7153F334100981888259A@DU0PR01MB10385.eurprd01.prod.exchangelabs.com/
> Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Wandun Chen <chenwandun@lixiang.com>
> Link: https://lore.kernel.org/all/33275585-f2db-4779-89f0-3ae24b455a67@suse.cz/ [1]
Is it possible to get a Fixes tag on the final fix so that it can be
backported stable?
Sebastian
^ permalink raw reply
* Re: [RFC PATCH 1/3] mm/compaction: skip isolate mlocked folios when compact_unevictable_allowed=0
From: Wandun @ 2026-06-26 9:38 UTC (permalink / raw)
To: Alexander Krabler, Vlastimil Babka (SUSE), linux-mm@kvack.org,
linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org,
linux-rt-devel@lists.linux.dev
Cc: akpm@linux-foundation.org, surenb@google.com, mhocko@suse.com,
jackmanb@google.com, hannes@cmpxchg.org, ziy@nvidia.com,
rostedt@goodmis.org, mhiramat@kernel.org,
mathieu.desnoyers@efficios.com, david@kernel.org, ljs@kernel.org,
liam@infradead.org, rppt@kernel.org, bigeasy@linutronix.de,
clrkwllms@kernel.org, Hugh Dickins
In-Reply-To: <PR3PR01MB6666C11E08516555C153D4FD82EB2@PR3PR01MB6666.eurprd01.prod.exchangelabs.com>
On 6/26/26 16:45, Alexander Krabler wrote:
> On 6/24/26 13:08, Wandun wrote:
>> On 6/22/26 17:55, Vlastimil Babka (SUSE) wrote:
>>> On 6/18/26 13:43, Wandun wrote:
>>>> Yes, I wrote a test case that can reproduce it in a few second.
>>>>
>>>> The test case contains 3 steps:
>>>> 1. mlockall
>>>> 2. mmap file(2GB) + trigger file write page fault;
>>>> 3. during step 1, trigger compact via /proc/sys/vm/compact_memory
>>>>
>>>>
>>>> My reproduction environment is qemu with 4GB ram, 8 core, aarch64,
>>>> preempt_rt and includes the tracepoint in patch 02.
>>>> After running the reproduction program for a few seconds, the
>>>> following output appears.
>>>>
>>>> repro-403 [004] ....1 101.270505: mm_compaction_isolate_folio: pfn=0x71e3a mode=0x0
>> flags=referenced|uptodate|mlocked
>>>> repro-403 [004] ....1 101.270507: mm_compaction_isolate_folio: pfn=0x71e3b mode=0x0
>> flags=referenced|uptodate|mlocked
>>>> repro-403 [004] ....1 101.270513: mm_compaction_isolate_folio: pfn=0x71e3c mode=0x0
>> flags=referenced|uptodate|mlocked
>>>> repro-403 [004] ....1 101.270515: mm_compaction_isolate_folio: pfn=0x71e3d mode=0x0
>> flags=uptodate|mlocked
>>>> repro-403 [004] ....1 101.270517: mm_compaction_isolate_folio: pfn=0x71e3e mode=0x0
>> flags=uptodate|mlocked
>>>> repro-403 [004] ....1 101.270520: mm_compaction_isolate_folio: pfn=0x71e3f mode=0x0
>> flags=uptodate|mlocked
>
> I applied your PATCH 2/3 to our kernel and checked with your reproducer,
> I get similar output, e.g.
> t_compact-2148 [005] ....1 515.320221: mm_compaction_isolate_folio: pfn=0xe66c2 mode=0x0
> flags=referenced|uptodate|active|swapbacked|mlocked
>
> With your first patch applied, the amount of these messages decrease.
Parts of mlocked but not unevictable pages has been filter out, so
messages decrease, but racy is still there.
> I was not able to apply your third patch to our (older) kernel.
Patch 3 is meaningless to you. The problem in your report is caused by kcompactd,
not cma alloc, so it is of no use to you.
>
> However, we were not able to reproduce the actual race
> (mlockall() process waiting on a migration PTE),
> not in the past, not now. Might be hard to trigger that race.
Not hard to trigger that case, I added a debug message, such as below,
lots of messages occur in a few second.
diff --cc mm/memory.c
index ff338c2abe92,ff338c2abe92..6552b3b14f78
--- a/mm/memory.c
+++ b/mm/memory.c
@@@ -4768,6 -4768,6 +4768,8 @@@ vm_fault_t do_swap_page(struct vm_faul
if (softleaf_is_migration(entry)) {
migration_entry_wait(vma->vm_mm, vmf->pmd,
vmf->address);
+ if (!strcmp(current->comm, "repro"))
+ pr_err("============== hit ================\n");
} else if (softleaf_is_device_exclusive(entry)) {
vmf->page = softleaf_to_page(entry);
ret = remove_device_exclusive_entry(vmf);
Best regard,
Wandun
>
>> IIUC, more accurately, the migration entry in the page talbe is real a bad for
>> RT process, because isolate page doesn't modify the page table, so memory
>> access continues as usual, therefore a new idea occur.
>>
>> S1. In the mlock[all] syscall, if mlock_vma_pages_range hit a migration entry,
>> then, it should wait for the migration to complete.
>>
>> S2. During the unmap phase of memory migration, prevent a page from being unmapped
>> if the page's associated vma is markd with VM_LOCKED, similar to how reclaim is
>> disabled for pages in a VM_LOCKED vma(try_to_unmap_one).
>>
>>
>> For a page handled during the mlock[all] syscall:
>> - if migration has been already finished, there is noting to do;
>> - if migration is in progress and the migration etnry is already filled, we
>> wait (S1)
>> - if the page is in-fight, going to be isolated/migrated, S2 prevents the unmap.
>>
>> For a page handled during a page fault: VM_LOCKED is already set on the vma,
>> so S2 guarantees it will not be unmapped, hence no migration entry.
>
> I do not understand all details of this, but it looks good,
> especially the S1 case makes a lot of sense for me.
>
> Nitpick: I suggest to switch order of PATCH 1 and 2 for the next iteration,
> introducing the tracepoint first and then improve the situation.
>
> Thanks a lot for looking into this issue!
>
> Best regards,
> Alexander
>
> --
>
> KUKA Deutschland GmbH Board of Directors: Michael Jürgens (Chairman), Johan Naten, Hui Zhang Registered Office: Augsburg HRB 14914
>
> This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of contents of this e-mail is strictly forbidden.
>
> Please consider the environment before printing this e-mail.
^ permalink raw reply
* Re: [RFC PATCH 1/3] mm/compaction: skip isolate mlocked folios when compact_unevictable_allowed=0
From: Wandun @ 2026-06-26 9:39 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-mm, linux-kernel, linux-trace-kernel, linux-rt-devel, akpm,
vbabka, surenb, mhocko, jackmanb, hannes, ziy, rostedt, mhiramat,
mathieu.desnoyers, david, ljs, liam, rppt, clrkwllms,
Alexander.Krabler
In-Reply-To: <20260626092606.7BgipTin@linutronix.de>
On 6/26/26 17:26, Sebastian Andrzej Siewior wrote:
> On 2026-06-04 10:38:10 [+0800], Wandun Chen wrote:
> …
>> Reported-by: Alexander Krabler <Alexander.Krabler@kuka.com>
>> Closes: https://lore.kernel.org/all/DU0PR01MB10385345F7153F334100981888259A@DU0PR01MB10385.eurprd01.prod.exchangelabs.com/
>> Suggested-by: Vlastimil Babka <vbabka@suse.cz>
>> Signed-off-by: Wandun Chen <chenwandun@lixiang.com>
>> Link: https://lore.kernel.org/all/33275585-f2db-4779-89f0-3ae24b455a67@suse.cz/ [1]
>
> Is it possible to get a Fixes tag on the final fix so that it can be
> backported stable?
Got it.
Best regards,
Wandun
>
> Sebastian
^ permalink raw reply
* Re: [PATCHv4 05/13] uprobes/x86: Move optimized uprobe from nop5 to nop10
From: Oleg Nesterov @ 2026-06-26 9:43 UTC (permalink / raw)
To: Jiri Olsa
Cc: Peter Zijlstra, Ingo Molnar, Masami Hiramatsu, Andrii Nakryiko,
bpf, linux-trace-kernel
In-Reply-To: <20260526205840.173790-6-jolsa@kernel.org>
On 05/26, Jiri Olsa wrote:
>
> which means we need to allow 0x2e prefix which maps to INAT_PFX_CS
> attribute in is_prefix_bad function.
...
> --- a/arch/x86/kernel/uprobes.c
> +++ b/arch/x86/kernel/uprobes.c
> @@ -266,7 +266,6 @@ static bool is_prefix_bad(struct insn *insn)
> attr = inat_get_opcode_attribute(p);
> switch (attr) {
> case INAT_MAKE_PREFIX(INAT_PFX_ES):
> - case INAT_MAKE_PREFIX(INAT_PFX_CS):
I know nothing about how x86 CPU works, so let me ask...
What if insn->x86_64 is false? Is it safe to allow the CS prefix in
this case?
Oleg.
^ permalink raw reply
* Re: [PATCH] tracing: eprobe: read the complete FILTER_PTR_STRING pointer
From: Steven Rostedt @ 2026-06-26 9:54 UTC (permalink / raw)
To: Masami Hiramatsu (Google); +Cc: Martin Kaiser, linux-trace-kernel, linux-kernel
In-Reply-To: <20260622125815.7416792c020bd3d81c01e51b@kernel.org>
On Mon, 22 Jun 2026 12:58:15 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> The problem is that the event does not provide the information that
> the string is in user space or not. But actually, for syscall events
> all data pointed by syscall parameter should be in the user space.
I think we should make this work then:
echo 'e:open syscalls.sys_enter_openat file=+u0($filename):ustring' > dynamic_events
That is, to have +u0() say "this is going to be dereferencing user space".
I'll add Martin's patch and see if it makes the above work.
-- Steve
^ permalink raw reply
* Re: [PATCH] tracing: eprobe: read the complete FILTER_PTR_STRING pointer
From: Martin Kaiser @ 2026-06-26 10:20 UTC (permalink / raw)
To: Steven Rostedt
Cc: Masami Hiramatsu (Google), linux-trace-kernel, linux-kernel
In-Reply-To: <20260626055440.76c28d25@fedora>
Thus wrote Steven Rostedt (rostedt@goodmis.org):
> On Mon, 22 Jun 2026 12:58:15 +0900
> Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> > The problem is that the event does not provide the information that
> > the string is in user space or not. But actually, for syscall events
> > all data pointed by syscall parameter should be in the user space.
> I think we should make this work then:
> echo 'e:open syscalls.sys_enter_openat file=+u0($filename):ustring' > dynamic_events
> That is, to have +u0() say "this is going to be dereferencing user space".
> I'll add Martin's patch and see if it makes the above work.
I've just tried your command with my patch. It works for me, filenames are
logged correctly.
Martin
> -- Steve
^ permalink raw reply
* Re: [RFC PATCH v2 0/4] tracing/osnoise: Track IPIs
From: Steven Rostedt @ 2026-06-26 10:26 UTC (permalink / raw)
To: Valentin Schneider
Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu,
Mathieu Desnoyers, Tomas Glozar, Costa Shulyupin, Crystal Wood,
John Kacur, Ivan Pravdin, Jonathan Corbet
In-Reply-To: <20260617131803.2988989-1-vschneid@redhat.com>
On Wed, 17 Jun 2026 15:17:55 +0200
Valentin Schneider <vschneid@redhat.com> wrote:
> Hi folks,
>
> So I've seen a few times now reports of latency spikes caused by IPIs, usually
> because of isolation misconfiguration, but only detected at the tail of end
> e.g. a 24h timerlat run.
>
> It's not because those IPIs are rare, but rather that they don't by themselves
> cause a monitered CPU to reach the latency threshold, it's usually a combined
> interference that gets us there.
>
> I'd like to make it easier to detect such misconfigurations and thus IPIs
> hitting supposedly-isolated CPUs. I initially kludged a timerlat option to stop
> tracing as soon as an IPI was sent to a monitored CPU, regardless of the latency
> threshold. It sort of did the trick, but Tomáš convinced me timerlat wasn't
> really the place for that.
>
> So here's IPI tracking added to osnoise. This time around fully in userspace, as
> Tomáš pointed out to me that this will make it a lot easier to deploy to older
> kernels.
>
> Based on top of linux/next at 'next-20260616' to have the latest libsubcmd
> changes.
>
Hi Valentin,
My new job actually makes me very interested in IPI interference, and
this patch set looks *very* interesting. I'm currently finishing up my
orientation and hopefully next week I can start catching up on all my
email.
I'll try to take a deeper look at this in the coming weeks.
-- Steve
^ permalink raw reply
* Re: [PATCH v3] mm/lruvec: trace LRU add drains and drain-all requests
From: Steven Rostedt @ 2026-06-26 10:23 UTC (permalink / raw)
To: Vlastimil Babka (SUSE)
Cc: Shakeel Butt, David Hildenbrand (Arm), JP Kobryn, linux-mm, willy,
usama.arif, akpm, mhocko, mhiramat, mathieu.desnoyers, kasong,
qi.zheng, baohua, axelrasmussen, yuanchu, weixugc, chrisl,
shikemeng, nphamcs, baoquan.he, youngjun.park, linux-kernel,
linux-trace-kernel
In-Reply-To: <1136baf3-3967-4202-9eaa-5fd667c235cf@kernel.org>
On Wed, 17 Jun 2026 20:18:57 +0200
"Vlastimil Babka (SUSE)" <vbabka@kernel.org> wrote:
> Yeah and I don't recall ever that a change to a mm tracepoint would ever
> break someone who'd complain and we'd have to revert it. These are niche
> enough. So I think the risk is low.
Note, we have literally thousands of trace events already, so the
chances of one being required by an application is rather low.
Especially since access still requires root access, which limits it to
administration tooling.
That said, if you know of a tool that uses trace events, then those
that it is likely to use can become an ABI. For mm trace evnets,
rasdaemon is the tool to worry about.
-- Steve
^ permalink raw reply
* Re: [PATCH] tracing: eprobe: read the complete FILTER_PTR_STRING pointer
From: Steven Rostedt @ 2026-06-26 10:42 UTC (permalink / raw)
To: Martin Kaiser; +Cc: Masami Hiramatsu (Google), linux-trace-kernel, linux-kernel
In-Reply-To: <aj5SdK9gUIVoPmmE@akranes.kaiser.cx>
On Fri, 26 Jun 2026 12:20:36 +0200
Martin Kaiser <martin@kaiser.cx> wrote:
> > That is, to have +u0() say "this is going to be dereferencing user space".
>
> > I'll add Martin's patch and see if it makes the above work.
>
> I've just tried your command with my patch. It works for me, filenames are
> logged correctly.
Yep, this definitely looks like a fix. We have;
addr = rec + field->offset;
Where addr points to the location of the field on the ring buffer, thus
your change to make it:
val = *(unsigned long *)addr;
Reads the full "long size" of the event on the ring buffer, instead of
reading just one byte. It is "val" that gets dereferenced later by the
probe logic (the "+0u()"), which has all the protections we need.
I'll queue this up.
Thanks!
-- Steve
^ permalink raw reply
* Re: [PATCH 0/2] rtla: Add tests for option parsing with attached arguments
From: Tomas Glozar @ 2026-06-26 12:22 UTC (permalink / raw)
To: John Kacur; +Cc: linux-trace-kernel, Steven Rostedt, linux-kernel
In-Reply-To: <20260602155210.60439-1-jkacur@redhat.com>
út 2. 6. 2026 v 17:52 odesílatel John Kacur <jkacur@redhat.com> napsal:
>
> Note: Patch 1/2 is a resend of the timerlat hist tests sent previously.
> Patch 2/2 adds tests for the remaining rtla commands.
>
Ah, this confused me. I saw cover letters, 1/2, and 2/2 with almost
the same title and content, and wondered what was going on.
> Signed-off-by: John Kacur <jkacur@redhat.com>
>
> John Kacur (2):
> rtla/timerlat: Add tests for option parsing with attached arguments
> rtla: Add tests for option parsing with attached arguments
>
> tools/tracing/rtla/tests/hwnoise.t | 10 ++++++++++
> tools/tracing/rtla/tests/osnoise.t | 18 ++++++++++++++++++
> tools/tracing/rtla/tests/timerlat.t | 18 ++++++++++++++++++
This should be either one patch, or three patches (hwnoise, osnoise,
timerlat); especially timerlat top and timerlat hist should not be
split between two commits, as the tests are in the same file. Also,
identical tests for both top and hist should use check_top_hist or
check_top_q_hist, see commit c15c55c01e48 ("rtla/tests: Cover both top
and hist tools where possible").
Anyway, I don't think this needs runtime tests. CLI unit tests (well,
actually more like integration tests, as RTLA design doesn't have
proper isolated unit tests) in tools/tracing/rtla/tests/unit/*_cli.c
should be able to fully cover this, with the benefit of being much
faster and not requiring root or any kernel features. Do you have any
concerns that cannot be covered by unit testing?
> 3 files changed, 46 insertions(+)
>
> --
> 2.54.0
>
Tomas
^ permalink raw reply
* Re: [RFC PATCH v2 0/4] tracing/osnoise: Track IPIs
From: Valentin Schneider @ 2026-06-26 12:25 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu,
Mathieu Desnoyers, Tomas Glozar, Costa Shulyupin, Crystal Wood,
John Kacur, Ivan Pravdin, Jonathan Corbet
In-Reply-To: <20260626062658.7f95bcad@fedora>
On 26/06/26 06:26, Steven Rostedt wrote:
> On Wed, 17 Jun 2026 15:17:55 +0200
> Valentin Schneider <vschneid@redhat.com> wrote:
>
>> Hi folks,
>>
>> So I've seen a few times now reports of latency spikes caused by IPIs, usually
>> because of isolation misconfiguration, but only detected at the tail of end
>> e.g. a 24h timerlat run.
>>
>> It's not because those IPIs are rare, but rather that they don't by themselves
>> cause a monitered CPU to reach the latency threshold, it's usually a combined
>> interference that gets us there.
>>
>> I'd like to make it easier to detect such misconfigurations and thus IPIs
>> hitting supposedly-isolated CPUs. I initially kludged a timerlat option to stop
>> tracing as soon as an IPI was sent to a monitored CPU, regardless of the latency
>> threshold. It sort of did the trick, but Tomáš convinced me timerlat wasn't
>> really the place for that.
>>
>> So here's IPI tracking added to osnoise. This time around fully in userspace, as
>> Tomáš pointed out to me that this will make it a lot easier to deploy to older
>> kernels.
>>
>> Based on top of linux/next at 'next-20260616' to have the latest libsubcmd
>> changes.
>>
>
> Hi Valentin,
>
> My new job actually makes me very interested in IPI interference, and
> this patch set looks *very* interesting. I'm currently finishing up my
> orientation and hopefully next week I can start catching up on all my
> email.
>
Welcome back :-) If IPIs are your thing, you may also have a look at
[1]. I'm working on a v10 following some (surprisingly) useful feedback
from Sashiko.
[1]: https://lore.kernel.org/lkml/20260505082355.1982003-1-vschneid@redhat.com/
> I'll try to take a deeper look at this in the coming weeks.
>
Thanks!
> -- Steve
^ permalink raw reply
* [PATCH v7 1/9] bootconfig: fix NULL-pointer arithmetic in xbc_snprint_cmdline()
From: Breno Leitao @ 2026-06-26 12:50 UTC (permalink / raw)
To: Masami Hiramatsu, Andrew Morton, Nathan Chancellor, paulmck,
Nicolas Schier, Nick Desaulniers, Bill Wendling, Justin Stitt,
Jonathan Corbet, Shuah Khan
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
bpf, llvm, linux-doc, Breno Leitao, kernel-team
In-Reply-To: <20260626-bootconfig_using_tools-v7-0-24ab72139c29@debian.org>
xbc_snprint_cmdline() is meant to be called twice: first with
buf=NULL, size=0 to probe the rendered length, then with a real
buffer to fill it (the standard snprintf() two-pass pattern). The
probe call makes the function compute "buf + size" (NULL + 0) and,
on every iteration, advance "buf += ret" from that NULL base and
pass the result back into snprintf().
Pointer arithmetic on a NULL pointer is undefined behavior. It is
harmless in the in-kernel callers today, but the follow-up patches
run this same code in the userspace tools/bootconfig parser at kernel
build time, where host UBSan / FORTIFY_SOURCE abort the build.
Track a running written length (size_t) instead of mutating @buf, and
only form "buf + len" when @buf is non-NULL. snprintf(NULL, 0, ...)
is itself well defined and returns the would-be length, so the
two-pass "probe then fill" usage returns identical byte counts.
Signed-off-by: Breno Leitao <leitao@debian.org>
---
lib/bootconfig.c | 23 ++++++++++++++++-------
1 file changed, 16 insertions(+), 7 deletions(-)
diff --git a/lib/bootconfig.c b/lib/bootconfig.c
index f445b7703fdd9..2ed9ee3dc81c7 100644
--- a/lib/bootconfig.c
+++ b/lib/bootconfig.c
@@ -427,10 +427,18 @@ static char xbc_namebuf[XBC_KEYLEN_MAX] __initdata;
int __init xbc_snprint_cmdline(char *buf, size_t size, struct xbc_node *root)
{
struct xbc_node *knode, *vnode;
- char *end = buf + size;
const char *val, *q;
+ size_t len = 0;
int ret;
+ /*
+ * Track the running written length rather than advancing @buf, so we
+ * never form "buf + size" or "buf += ret" while @buf is NULL (the
+ * size-probe call passes buf=NULL, size=0). NULL pointer arithmetic
+ * is undefined behavior and trips host UBSan / FORTIFY_SOURCE when
+ * this renderer runs at kernel build time. snprintf(NULL, 0, ...)
+ * itself is well defined and returns the would-be length.
+ */
xbc_node_for_each_key_value(root, knode, val) {
ret = xbc_node_compose_key_after(root, knode,
xbc_namebuf, XBC_KEYLEN_MAX);
@@ -439,10 +447,11 @@ int __init xbc_snprint_cmdline(char *buf, size_t size, struct xbc_node *root)
vnode = xbc_node_get_child(knode);
if (!vnode) {
- ret = snprintf(buf, rest(buf, end), "%s ", xbc_namebuf);
+ ret = snprintf(buf ? buf + len : NULL, rest(len, size),
+ "%s ", xbc_namebuf);
if (ret < 0)
return ret;
- buf += ret;
+ len += ret;
continue;
}
xbc_array_for_each_value(vnode, val) {
@@ -452,15 +461,15 @@ int __init xbc_snprint_cmdline(char *buf, size_t size, struct xbc_node *root)
* whitespace.
*/
q = strpbrk(val, " \t\r\n") ? "\"" : "";
- ret = snprintf(buf, rest(buf, end), "%s=%s%s%s ",
- xbc_namebuf, q, val, q);
+ ret = snprintf(buf ? buf + len : NULL, rest(len, size),
+ "%s=%s%s%s ", xbc_namebuf, q, val, q);
if (ret < 0)
return ret;
- buf += ret;
+ len += ret;
}
}
- return buf - (end - size);
+ return len;
}
#undef rest
--
2.53.0-Meta
^ permalink raw reply related
* [PATCH v7 0/9] bootconfig: embed kernel.* cmdline at build time
From: Breno Leitao @ 2026-06-26 12:50 UTC (permalink / raw)
To: Masami Hiramatsu, Andrew Morton, Nathan Chancellor, paulmck,
Nicolas Schier, Nick Desaulniers, Bill Wendling, Justin Stitt,
Jonathan Corbet, Shuah Khan
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
bpf, llvm, linux-doc, Breno Leitao, kernel-team, Nicolas Schier
The userspace pieces (xbc_snprint_cmdline() in lib/, tools/bootconfig -C)
already landed; this series wires the rendered cmdline into the kernel.
Motivation: today the embedded bootconfig is parsed at runtime, after
parse_early_param() has already run, so early_param() handlers can't
see embedded values. Folding the kernel.* subtree into the cmdline at
build time gives a CONFIG_CMDLINE-equivalent for embedded-bootconfig
users without forcing them to maintain two cmdline sources.
Behaviorally, the "kernel" subtree is rendered to a flat string at
build time and stashed in .init.rodata. setup_arch() prepends it to
boot_command_line before parse_early_param() runs. Overflow is a soft
error: the helper logs and leaves boot_command_line untouched rather
than panicking, so an oversized embedded bconf cannot brick a boot.
Signed-off-by: Breno Leitao <leitao@debian.org>
---
Changes in v7:
- The runtime opt-in now shares one helper instead of open-coding its
own. (Masami)
- bootconfig_cmdline_requested() moved into generic lib code (Masami)
- Link to v6: https://lore.kernel.org/r/20260623-bootconfig_using_tools-v6-0-640c2f587a3c@debian.org
Changes in v6:
- renamed CONFIG_BOOT_CONFIG_EMBED_CMDLINE to
CONFIG_CMDLINE_FROM_BOOTCONFIG
- prepend embedded bootconfig cmdline before parse_early_param
- Link to v5: https://lore.kernel.org/r/20260617-bootconfig_using_tools-v5-0-fd589a9cc5e3@debian.org
Changes in v5:
- Patch 3 (Kconfig): drop the redundant "depends on BOOT_CONFIG_EMBED"
from CMDLINE_FROM_BOOTCONFIG; Julian Braha.
- Patch 6 (Documentation): spell out how the embedded cmdline interacts
with the bootloader cmdline, an initrd bootconfig, and the embedded
bootconfig
- Link to v4: https://lore.kernel.org/r/20260609-bootconfig_using_tools-v4-0-73c463f03a97@debian.org
Changes in v4:
- Patch 3 (build pipeline): clear CROSS_COMPILE= in the kernel-side
tools/bootconfig sub-make. Without it, an LLVM=1 cross build
inherits CROSS_COMPILE and tools/scripts/Makefile.include injects
--target=/--sysroot= into the host clang, producing a target
binary that fails to exec.
- Patch 3 (build pipeline): place embedded-cmdline.S in its own
.init.rodata.embed_cmdline subsection ("a") so ld.lld does not
see a section-type mismatch against lib/bootconfig-data.S's
writable .init.rodata ("aw"). The linker's *(.init.rodata
.init.rodata.*) glob still folds it into the init image.
- Patch 6 (x86/setup): also accept the bootconfig=<anything> form
via cmdline_find_option(), matching the runtime parse_args() loop.
Without it, bootconfig=0/=off would skip the early prepend but
still trigger the late runtime apply -- a split-brain state.
- New patch 7: document CONFIG_CMDLINE_FROM_BOOTCONFIG in
Documentation/admin-guide/bootconfig.rst (semantics, opt-in,
precedence, overflow behavior, example).
- Link to v3: https://lore.kernel.org/r/20260608-bootconfig_using_tools-v3-0-4ddd079a0696@debian.org
Changes in v3:
- Patch 3: Move HOSTCC override to the kernel-side rule; tool keeps
$(CC) for standalone/cross builds.
- Patch 6: Drop the false fail-safe wording; document the
BOOT_CONFIG_FORCE=y default interaction.
- Link to v2:
https://lore.kernel.org/r/20260605-bootconfig_using_tools-v2-0-d309f544b5f7@debian.org
Changes in v2 (addressing review of v1):
- Split out a standalone fix for the NULL-pointer arithmetic in
xbc_snprint_cmdline() so the build-time render cannot trip host
UBSan/FORTIFY_SOURCE.
- Rework the leaf-root handling: instead of returning early, skip @root
inside the loop so a root carrying both a value and subkeys
(kernel = x together with kernel.foo = bar) still renders its
descendant keys.
- Build tools/bootconfig with $(HOSTCC) so cross-compiled (ARCH=...)
builds render the cmdline on the build host instead of failing with
"Exec format error".
- Mark the embedded cmdline section read-only (drop the "w" flag from
.init.rodata).
- Add a make-clean hook so tools/bootconfig artifacts are removed by
make clean.
- Gate the x86 prepend on "bootconfig" being present on the command
line (or CONFIG_BOOT_CONFIG_FORCE), matching the init.* opt-in
semantics documented in bootconfig.rst and preserving fail-safe
recovery: dropping "bootconfig" from the bootloader cmdline now also
disables the embedded kernel.* keys.
- Link to v1: https://patch.msgid.link/20260527-bootconfig_using_tools-v1-0-b6906a86e7d5@debian.org
---
Breno Leitao (9):
bootconfig: fix NULL-pointer arithmetic in xbc_snprint_cmdline()
bootconfig: render descendant keys when xbc_snprint_cmdline() root has a value
bootconfig: render embedded bootconfig as a kernel cmdline at build time
bootconfig: clean build-time tools/bootconfig from make clean
bootconfig: add xbc_prepend_embedded_cmdline() helper
Documentation: bootconfig: document build-time cmdline rendering
x86/setup: prepend embedded bootconfig cmdline before parse_early_param
bootconfig: skip runtime kernel.* render once prepended early
init/main.c: use bootconfig_cmdline_requested() for the runtime opt-in
Documentation/admin-guide/bootconfig.rst | 81 ++++++++++++++++
MAINTAINERS | 1 +
Makefile | 27 +++++-
arch/x86/Kconfig | 1 +
arch/x86/kernel/setup.c | 14 ++-
include/linux/bootconfig.h | 14 +++
init/Kconfig | 36 +++++++
init/main.c | 52 +++++-----
lib/Makefile | 16 +++
lib/bootconfig.c | 162 +++++++++++++++++++++++++++++--
lib/embedded-cmdline.S | 16 +++
tools/bootconfig/Makefile | 4 +-
12 files changed, 388 insertions(+), 36 deletions(-)
---
base-commit: a87737435cfa134f9cdcc696ba3080759d04cf72
change-id: 20260508-bootconfig_using_tools-cfa7aa9d6a5a
Best regards,
--
Breno Leitao <leitao@debian.org>
^ permalink raw reply
* [PATCH v7 2/9] bootconfig: render descendant keys when xbc_snprint_cmdline() root has a value
From: Breno Leitao @ 2026-06-26 12:50 UTC (permalink / raw)
To: Masami Hiramatsu, Andrew Morton, Nathan Chancellor, paulmck,
Nicolas Schier, Nick Desaulniers, Bill Wendling, Justin Stitt,
Jonathan Corbet, Shuah Khan
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
bpf, llvm, linux-doc, Breno Leitao, kernel-team
In-Reply-To: <20260626-bootconfig_using_tools-v7-0-24ab72139c29@debian.org>
xbc_node_for_each_key_value() walks to the first leaf under @root, and
when @root is itself a leaf it yields @root. That happens not only for
an empty "kernel {}" subtree, but also when @root carries both a value
and subkeys, e.g.
kernel = x
kernel.foo = bar
Here @root ("kernel") is a leaf because its first child is the value
node "x", so the iterator returns @root first. Feeding @root back into
xbc_node_compose_key_after(root, root) returns -EINVAL, which the only
in-kernel caller papers over with a "len <= 0" check -- but the
follow-up tools/bootconfig -C user propagates the error and turns such
a bootconfig into a build failure. Worse, short-circuiting the whole
call on a leaf @root would silently drop the valid "kernel.foo = bar"
descendant that this patch should render.
Skip @root inside the loop instead of bailing out: the value-only entry
is dropped (it is rendered through the "kernel" cmdline path, not here),
while real descendant keys are still emitted. An entirely empty subtree
now renders nothing and returns 0 rather than -EINVAL, matching the
"nothing to render is not an error" semantics expected by the new
build-time caller.
Signed-off-by: Breno Leitao <leitao@debian.org>
---
lib/bootconfig.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/lib/bootconfig.c b/lib/bootconfig.c
index 2ed9ee3dc81c7..926094d97397e 100644
--- a/lib/bootconfig.c
+++ b/lib/bootconfig.c
@@ -440,6 +440,17 @@ int __init xbc_snprint_cmdline(char *buf, size_t size, struct xbc_node *root)
* itself is well defined and returns the would-be length.
*/
xbc_node_for_each_key_value(root, knode, val) {
+ /*
+ * An empty or value-only @root (e.g. "kernel {}" or
+ * "kernel = x", possibly alongside "kernel.foo = bar")
+ * yields @root itself here. Skip it: composing a key for it
+ * would fail with -EINVAL, yet any real descendant keys must
+ * still be rendered. An entirely empty subtree then renders
+ * nothing and returns 0 rather than an error.
+ */
+ if (knode == root)
+ continue;
+
ret = xbc_node_compose_key_after(root, knode,
xbc_namebuf, XBC_KEYLEN_MAX);
if (ret < 0)
--
2.53.0-Meta
^ permalink raw reply related
* [PATCH v7 3/9] bootconfig: render embedded bootconfig as a kernel cmdline at build time
From: Breno Leitao @ 2026-06-26 12:50 UTC (permalink / raw)
To: Masami Hiramatsu, Andrew Morton, Nathan Chancellor, paulmck,
Nicolas Schier, Nick Desaulniers, Bill Wendling, Justin Stitt,
Jonathan Corbet, Shuah Khan
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
bpf, llvm, linux-doc, Breno Leitao, kernel-team, Nicolas Schier
In-Reply-To: <20260626-bootconfig_using_tools-v7-0-24ab72139c29@debian.org>
Add the build-time pipeline that renders the "kernel" subtree of
CONFIG_BOOT_CONFIG_EMBED_FILE into a flat cmdline string and stashes
it in .init.rodata as embedded_kernel_cmdline[]. A follow-up patch
adds the runtime helper that prepends this string to boot_command_line
during early architecture setup so parse_early_param() sees the values.
The build wires up:
tools/bootconfig -C kernel - userspace tool already shared with
lib/bootconfig.c, used here in -C mode
to render a bootconfig file to a cmdline
lib/embedded-cmdline.S - .incbin's the rendered text plus a NUL
(listed under the EXTRA BOOT CONFIG
MAINTAINERS entry)
lib/Makefile rule - runs tools/bootconfig at build time
Makefile prepare dep - ensures tools/bootconfig is built first,
same pattern as tools/objtool and
tools/bpf/resolve_btfids
Drop the test target from tools/bootconfig/Makefile's default 'all'
recipe so that hooking the binary into the kernel build does not run
test-bootconfig.sh on every prepare. The tests stay available as
'make -C tools/bootconfig test', matching the convention of
tools/objtool and tools/bpf/resolve_btfids whose 'all' targets only
build the binary.
Require BOOT_CONFIG_EMBED_FILE to be non-empty before the new option
can be enabled, otherwise tools/bootconfig -C runs against an empty
file and prints a parse error on every kernel build.
The feature gates on CONFIG_ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG, a
silent symbol arches select once they've wired the prepend call into
setup_arch(). No arch selects it in this patch, so the user-visible
CONFIG_CMDLINE_FROM_BOOTCONFIG is not yet enableable; when an arch
later opts in, the runtime behavior is added by the follow-up patches.
tools/bootconfig also installs on target systems, so its own Makefile
keeps $(CC) and stays cross-buildable as a standalone tool. The kernel
build, which runs the tool on the build host during prepare, instead
forces CC=$(HOSTCC) from a dedicated tools/bootconfig rule and clears
CROSS_COMPILE= in the sub-make. Without that clear, an LLVM=1 cross
build would inherit CROSS_COMPILE and tools/scripts/Makefile.include
would inject --target=/--sysroot= flags into the host clang invocation,
producing a target binary that fails to exec ("Exec format error").
embedded-cmdline.S places the rendered string in its own .init.rodata
subsection (.init.rodata.embed_cmdline) with the "a" (allocatable,
read-only) flag and %progbits. lib/bootconfig-data.S already places
the embedded bootconfig blob in .init.rodata with the "aw" flag
(xbc_init() rewrites separators in place, so that data must be
writable). Using a distinct subsection name avoids the ld.lld section-
type mismatch that would otherwise arise from mixing "a" and "aw"
under the same name; the linker's "*(.init.rodata .init.rodata.*)"
glob still folds both into the init image and frees them after boot.
A follow-up patch wires the build-time tools/bootconfig into the
top-level clean target.
Reviewed-by: Nicolas Schier <n.schier@fritz.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
MAINTAINERS | 1 +
Makefile | 16 ++++++++++++++++
init/Kconfig | 36 ++++++++++++++++++++++++++++++++++++
lib/Makefile | 16 ++++++++++++++++
lib/embedded-cmdline.S | 16 ++++++++++++++++
tools/bootconfig/Makefile | 2 +-
6 files changed, 86 insertions(+), 1 deletion(-)
diff --git a/MAINTAINERS b/MAINTAINERS
index 57656ec0e9d5d..953231df1911d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9844,6 +9844,7 @@ F: fs/proc/bootconfig.c
F: include/linux/bootconfig.h
F: lib/bootconfig-data.S
F: lib/bootconfig.c
+F: lib/embedded-cmdline.S
F: tools/bootconfig/*
F: tools/bootconfig/scripts/*
diff --git a/Makefile b/Makefile
index bf196c6df5b92..5255aa35a2e51 100644
--- a/Makefile
+++ b/Makefile
@@ -1545,6 +1545,22 @@ prepare: tools/bpf/resolve_btfids
endif
endif
+# tools/bootconfig renders the embedded bootconfig into a cmdline at build time.
+ifdef CONFIG_CMDLINE_FROM_BOOTCONFIG
+prepare: tools/bootconfig
+endif
+
+# tools/bootconfig is run on the build host during prepare, so force a host
+# binary here; its own Makefile keeps $(CC) for standalone and cross builds.
+# CROSS_COMPILE= is cleared so tools/scripts/Makefile.include does not inject
+# the target's --target=/--sysroot= flags into the host clang invocation under
+# LLVM=1 cross builds (which would produce a target binary that fails to exec).
+tools/bootconfig: export CC := $(HOSTCC)
+tools/bootconfig: FORCE
+ $(Q)mkdir -p $(objtree)/tools
+ $(Q)$(MAKE) O=$(abspath $(objtree)) subdir=tools -C $(srctree)/tools/ \
+ bootconfig CROSS_COMPILE=
+
# The tools build system is not a part of Kbuild and tends to introduce
# its own unique issues. If you need to integrate a new tool into Kbuild,
# please consider locating that tool outside the tools/ tree and using the
diff --git a/init/Kconfig b/init/Kconfig
index 5230d4879b1c8..598690ec313a2 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1566,6 +1566,42 @@ config BOOT_CONFIG_EMBED_FILE
This bootconfig will be used if there is no initrd or no other
bootconfig in the initrd.
+config ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG
+ bool
+ help
+ Silent symbol; no C code reads it directly. Architectures
+ select it once their setup_arch() calls
+ xbc_prepend_embedded_cmdline() before parse_early_param().
+ Its only role is to gate the user-visible
+ CMDLINE_FROM_BOOTCONFIG option per-arch, the same
+ ARCH_SUPPORTS_* idiom used by ARCH_SUPPORTS_CFI, etc.
+
+config CMDLINE_FROM_BOOTCONFIG
+ bool "Render embedded bootconfig as kernel cmdline at build time"
+ depends on BOOT_CONFIG_EMBED_FILE != ""
+ depends on ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG
+ depends on CMDLINE = ""
+ default n
+ help
+ Render the "kernel" subtree of the embedded bootconfig file into a
+ flat cmdline string at kernel build time and prepend it to
+ boot_command_line during early architecture setup. This makes
+ early_param() handlers (e.g. mem=, earlycon=, loglevel=) see the
+ values supplied via the embedded bootconfig.
+
+ The runtime bootconfig parser is unaffected, so tree-structured
+ consumers such as ftrace boot-time tracing keep working.
+
+ Note: when an initrd also carries a bootconfig, its "kernel"
+ subtree is still parsed at runtime, but the embedded "kernel"
+ keys remain in boot_command_line for parse_early_param() and
+ end up later than the initrd keys in saved_command_line, so
+ parse_args() last-wins favors the embedded values. If you need
+ initrd to override embedded kernel.* keys, leave this option
+ off.
+
+ If unsure, say N.
+
config CMDLINE_LOG_WRAP_IDEAL_LEN
int "Length to try to wrap the cmdline when logged at boot"
default 1021
diff --git a/lib/Makefile b/lib/Makefile
index 7f75cc6edf94a..4ccdce2fd5e5b 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -273,6 +273,22 @@ filechk_defbconf = cat $(or $(real-prereqs), /dev/null)
$(obj)/default.bconf: $(CONFIG_BOOT_CONFIG_EMBED_FILE) FORCE
$(call filechk,defbconf)
+obj-$(CONFIG_CMDLINE_FROM_BOOTCONFIG) += embedded-cmdline.o
+$(obj)/embedded-cmdline.o: $(obj)/embedded_cmdline.bin
+
+# Render the bootconfig "kernel" subtree to a flat cmdline string using
+# the userspace tools/bootconfig parser (-C mode). The runtime prepend
+# helper enforces COMMAND_LINE_SIZE at boot, so no build-time size
+# check is performed here (COMMAND_LINE_SIZE is an arch header
+# constant, not a Kconfig value).
+quiet_cmd_render_cmdline = BCONF2C $@
+ cmd_render_cmdline = \
+ $(objtree)/tools/bootconfig/bootconfig -C $< > $@
+
+targets += embedded_cmdline.bin
+$(obj)/embedded_cmdline.bin: $(obj)/default.bconf $(objtree)/tools/bootconfig/bootconfig FORCE
+ $(call if_changed,render_cmdline)
+
obj-$(CONFIG_RBTREE_TEST) += rbtree_test.o
obj-$(CONFIG_INTERVAL_TREE_TEST) += interval_tree_test.o
diff --git a/lib/embedded-cmdline.S b/lib/embedded-cmdline.S
new file mode 100644
index 0000000000000..bda81b4a42bea
--- /dev/null
+++ b/lib/embedded-cmdline.S
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Embed the build-time-rendered bootconfig "kernel" subtree as a flat
+ * cmdline string. setup_arch() prepends this to boot_command_line on
+ * architectures that select ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG.
+ *
+ * Copyright (c) 2026 Meta Platforms, Inc. and affiliates
+ * Copyright (c) 2026 Breno Leitao <leitao@debian.org>
+ */
+ .section .init.rodata.embed_cmdline, "a", %progbits
+ .global embedded_kernel_cmdline
+embedded_kernel_cmdline:
+ .incbin "lib/embedded_cmdline.bin"
+ .byte 0
+ .global embedded_kernel_cmdline_end
+embedded_kernel_cmdline_end:
diff --git a/tools/bootconfig/Makefile b/tools/bootconfig/Makefile
index 90eb47c9d8de6..4e82fd9553cde 100644
--- a/tools/bootconfig/Makefile
+++ b/tools/bootconfig/Makefile
@@ -15,7 +15,7 @@ override CFLAGS += -Wall -g -I$(CURDIR)/include
ALL_TARGETS := bootconfig
ALL_PROGRAMS := $(patsubst %,$(OUTPUT)%,$(ALL_TARGETS))
-all: $(ALL_PROGRAMS) test
+all: $(ALL_PROGRAMS)
$(OUTPUT)bootconfig: main.c include/linux/bootconfig.h $(LIBSRC)
$(CC) $(filter %.c,$^) $(CFLAGS) $(LDFLAGS) -o $@
--
2.53.0-Meta
^ permalink raw reply related
* [PATCH v7 4/9] bootconfig: clean build-time tools/bootconfig from make clean
From: Breno Leitao @ 2026-06-26 12:50 UTC (permalink / raw)
To: Masami Hiramatsu, Andrew Morton, Nathan Chancellor, paulmck,
Nicolas Schier, Nick Desaulniers, Bill Wendling, Justin Stitt,
Jonathan Corbet, Shuah Khan
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
bpf, llvm, linux-doc, Breno Leitao, kernel-team, Nicolas Schier
In-Reply-To: <20260626-bootconfig_using_tools-v7-0-24ab72139c29@debian.org>
The previous patch builds tools/bootconfig during 'make prepare' to
render the embedded bootconfig cmdline, but nothing removes it on
'make clean', leaving the compiled tool and its objects behind.
Wire a bootconfig_clean hook into the top-level clean target so the
compiled tool and its objects are removed by make clean, matching the
prepare-wired tools/objtool and tools/bpf/resolve_btfids.
The hook runs tools/bootconfig's Makefile via $(MAKE), which the kernel
build invokes with -rR (MAKEFLAGS += -rR). -rR drops the built-in $(RM)
variable, so the existing "$(RM) -f ..." clean recipe would expand to a
bare "-f ..." and fail. Spell the recipe with a literal "rm -f" so it
keeps working both standalone and when invoked from Kbuild.
Reviewed-by: Nicolas Schier <n.schier@fritz.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
Makefile | 11 ++++++++++-
tools/bootconfig/Makefile | 2 +-
2 files changed, 11 insertions(+), 2 deletions(-)
diff --git a/Makefile b/Makefile
index 5255aa35a2e51..20a2bcacde3b8 100644
--- a/Makefile
+++ b/Makefile
@@ -1587,6 +1587,15 @@ ifneq ($(wildcard $(objtool_O)),)
$(Q)$(MAKE) -sC $(abs_srctree)/tools/objtool O=$(objtool_O) srctree=$(abs_srctree) $(patsubst objtool_%,%,$@)
endif
+PHONY += bootconfig_clean
+
+bootconfig_O = $(abspath $(objtree))/tools/bootconfig
+
+bootconfig_clean:
+ifneq ($(wildcard $(bootconfig_O)),)
+ $(Q)$(MAKE) -sC $(srctree)/tools/bootconfig O=$(bootconfig_O) clean
+endif
+
tools/: FORCE
$(Q)mkdir -p $(objtree)/tools
$(Q)$(MAKE) O=$(abspath $(objtree)) subdir=tools -C $(srctree)/tools/
@@ -1757,7 +1766,7 @@ vmlinuxclean:
$(Q)$(CONFIG_SHELL) $(srctree)/scripts/link-vmlinux.sh clean
$(Q)$(if $(ARCH_POSTLINK), $(MAKE) -f $(ARCH_POSTLINK) clean)
-clean: archclean vmlinuxclean resolve_btfids_clean objtool_clean
+clean: archclean vmlinuxclean resolve_btfids_clean objtool_clean bootconfig_clean
# mrproper - Delete all generated files, including .config
#
diff --git a/tools/bootconfig/Makefile b/tools/bootconfig/Makefile
index 4e82fd9553cde..3cb8066d5141b 100644
--- a/tools/bootconfig/Makefile
+++ b/tools/bootconfig/Makefile
@@ -27,4 +27,4 @@ install: $(ALL_PROGRAMS)
install $(OUTPUT)bootconfig $(DESTDIR)$(bindir)
clean:
- $(RM) -f $(OUTPUT)*.o $(ALL_PROGRAMS)
+ rm -f $(OUTPUT)*.o $(ALL_PROGRAMS)
--
2.53.0-Meta
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox