* Re: [PATCH 2/2] selftests/ftrace: Account for 8-byte aligned trace_marker_raw events
From: Hui Wang @ 2026-06-08 14:51 UTC (permalink / raw)
To: Masami Hiramatsu (Google)
Cc: rostedt, mathieu.desnoyers, pjw, linux-trace-kernel, shuah,
wangfushuai, linux-kselftest
In-Reply-To: <20260608181716.726cb9c81d41d49095e7f3cf@kernel.org>
On 6/8/26 17:17, Masami Hiramatsu (Google) wrote:
> On Sun, 7 Jun 2026 15:24:31 +0800
> Hui Wang <hui.wang@canonical.com> wrote:
>
[...]
> + for config_file in \
> + /boot/config-$uname_r \
> + /lib/modules/$uname_r/config \
> + /lib/modules/$uname_r/build/.config
>
> Hmm, also I don't like this, because this highly depends on the environment.
> Instead, we can add CONFIG_IKCONFIG_PROC=y in tools/testing/selftests/ftrace/config.
>
> Thank you,
>
Thanks for the review. I'll address all other comments in v2.
I have a concern about this specific point. On Ubuntu kernels, both
CONFIG_IKCONFIG and CONFIG_IKCONFIG_PROC are disabled by default, so
/proc/config.gz does not exist. If we drop the /boot/config-$(uname -r)
lookup and rely solely on /proc/config.gz, this test would become
unresolved on every Ubuntu kernel — a regression, since it works on
those kernels today.
There is also existing precedent for the /boot/config-$(uname -r)
fallback: tools/testing/selftests/mm/va_high_addr_switch.sh checks
/proc/config.gz first and falls back to /boot/config-$(uname -r).
So how about we keep /boot/config-$(uname -r) as a fallback, but drop
the /lib/modules/... paths you objected to. And add ftrace/config as you
suggested here.
Thanks,
Hui.
>> + do
>> + if [ -f "$config_file" ]; then
>> + grep -Eq "^${config}=(y|m)$" "$config_file"
>> + return $?
>> + fi
>> + done
>> +
>> + return 2
>> +}
>> +
>> LOCALHOST=127.0.0.1
>>
>> yield() {
>> --
>> 2.43.0
>>
>>
>
^ permalink raw reply
* Re: [PATCH v2 6/6] x86/setup: prepend embedded bootconfig cmdline before parse_early_param
From: Breno Leitao @ 2026-06-08 14:41 UTC (permalink / raw)
To: Masami Hiramatsu
Cc: Andrew Morton, Nathan Chancellor, paulmck, Nicolas Schier,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
bpf, kernel-team
In-Reply-To: <20260608191928.d7d2dea899b94f05d397f891@kernel.org>
On Mon, Jun 08, 2026 at 07:19:28PM +0900, Masami Hiramatsu wrote:
> On Fri, 05 Jun 2026 05:03:37 -0700
> Breno Leitao <leitao@debian.org> wrote:
>
> > Call xbc_prepend_embedded_cmdline() in setup_arch() right after the
> > CONFIG_CMDLINE merge and before strscpy(command_line, ...) so the
> > build-time-rendered embedded bootconfig "kernel" subtree is part of
> > boot_command_line by the time parse_early_param() runs. early_param()
> > handlers (mem=, earlycon=, loglevel=, ...) now see values supplied via
> > CONFIG_BOOT_CONFIG_EMBED_FILE without parsing bootconfig at runtime.
> >
> > Gate the prepend on the bootconfig opt-in: only fold in the embedded
> > kernel.* keys when "bootconfig" is present on the command line, or
> > CONFIG_BOOT_CONFIG_FORCE is set. Applying the embedded cmdline
> > unconditionally would (a) diverge from how embedded init.* keys are
> > treated and (b) break fail-safe recovery: a malformed embedded
> > console=/mem= could panic the boot with no way for the admin to disable
> > it by dropping "bootconfig" from the bootloader cmdline.
> > cmdline_find_option_bool() runs before parse_early_param(), so the gate
> > is cheap and correctly ordered.
> >
> > Select ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG so the user-visible
> > CONFIG_BOOT_CONFIG_EMBED_CMDLINE option becomes selectable on x86.
>
> This seems like a dummy config. what code does depend on this flag?
No C code reads ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG directly — it's
a silent gating symbol, the same ARCH_SUPPORTS_* idiom as
ARCH_SUPPORTS_CFI, ARCH_SUPPORTS_LTO_CLANG, etc.
Its only role is the depends on line of BOOT_CONFIG_EMBED_CMDLINE: an
arch selects it once its setup_arch() calls
xbc_prepend_embedded_cmdline(), and that makes the user-visible
BOOT_CONFIG_EMBED_CMDLINE selectable.
Right now, only x86 supports embedded bootconfig, thus, only x86 does
the following (last patch):
config X86
+ select ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG
So, no other platform can see CONFIG_BOOT_CONFIG_EMBED_CMDLINE.
> > --- a/init/main.c
> > +++ b/init/main.c
> > @@ -378,12 +378,15 @@ static void __init setup_boot_config(void)
> > int pos, ret;
> > size_t size;
> > char *err;
> > + bool from_embedded = false;
> >
> > /* Cut out the bootconfig data even if we have no bootconfig option */
> > data = get_boot_config_from_initrd(&size);
> > /* If there is no bootconfig in initrd, try embedded one. */
> > - if (!data)
> > + if (!data) {
> > data = xbc_get_embedded_bootconfig(&size);
> > + from_embedded = true;
>
> Even from embedded bootconfig, if the arch set
> ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG=n, this must be applied to
> the cmdline as we are doing.
Right — that path is preserved. When the arch doesn't select
ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG, BOOT_CONFIG_EMBED_CMDLINE is
unselectable, so xbc_embedded_cmdline_applied() is the no-op stub
returning false.
> > strscpy(tmp_cmdline, boot_command_line, COMMAND_LINE_SIZE);
> > err = parse_args("bootconfig", tmp_cmdline, NULL, 0, 0, 0, NULL,
> > @@ -421,8 +424,17 @@ static void __init setup_boot_config(void)
> > } else {
> > xbc_get_info(&ret, NULL);
> > pr_info("Load bootconfig: %ld bytes %d nodes\n", (long)size, ret);
> > - /* keys starting with "kernel." are passed via cmdline */
> > - extra_command_line = xbc_make_cmdline("kernel");
> > + /*
> > + * keys starting with "kernel." are passed via cmdline. When
> > + * this bootconfig came from the embedded source and
> > + * setup_arch() already prepended the rendered "kernel" subtree
> > + * to boot_command_line, rendering again here would duplicate
> > + * the keys in saved_command_line and make accumulating handlers
> > + * (console=, earlycon=, ...) re-register the same value. Skip
> > + * only when the prepend really happened.
>
> Also, this should mention ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG=n case.
Ack, I will update
Thanks for the review,
--breno
^ permalink raw reply
* [RFC PATCH 7/7] tracing/probes: Add a new testcase for BTF typecasts
From: Masami Hiramatsu (Google) @ 2026-06-08 14:25 UTC (permalink / raw)
To: Steven Rostedt, Mathieu Desnoyers
Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178092865666.163648.10457567771536160909.stgit@devnote2>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
With the introduction of container_of-style BTF typecasting and
per-CPU variable access support in trace probes, we need a way to
verify their functionality and prevent regressions.
Add a new ftrace kselftest and update the trace event sample module to
test and validate these features.
Specifically, update the trace-events-sample module to set up a periodic
timer whose callback accesses a per-CPU counter. Introduce a new sample
trace event, foo_timer_fn, to trace this callback and log the current
counter value.
Then, add a new test case, btf_probe_event.tc, which defines a dynamic
probe on the timer callback. The probe uses BTF typecasting to recover
the parent structure from the timer argument and +CPU() to fetch the
per-CPU counter. The test verifies the integrity of the implementation
by ensuring the values recorded by the dynamic probe match those from
the static tracepoint.
Assisted-by: Antigravity:gemini-3.5-flash
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
samples/trace_events/trace-events-sample.c | 38 ++++++++++++++-
samples/trace_events/trace-events-sample.h | 34 ++++++++++++-
.../ftrace/test.d/dynevent/btf_probe_event.tc | 52 ++++++++++++++++++++
3 files changed, 120 insertions(+), 4 deletions(-)
create mode 100644 tools/testing/selftests/ftrace/test.d/dynevent/btf_probe_event.tc
diff --git a/samples/trace_events/trace-events-sample.c b/samples/trace_events/trace-events-sample.c
index ecc7db237f2e..770315812218 100644
--- a/samples/trace_events/trace-events-sample.c
+++ b/samples/trace_events/trace-events-sample.c
@@ -94,6 +94,20 @@ static int simple_thread_fn(void *arg)
static DEFINE_MUTEX(thread_mutex);
static int simple_thread_cnt;
+static struct foo_timer_data *foo_timer_data;
+
+static void sample_timer_cb(struct timer_list *t)
+{
+ struct foo_timer_data *data = container_of(t, struct foo_timer_data, timer);
+
+ get_cpu();
+ trace_foo_timer_fn(data);
+ (*this_cpu_ptr(data->counter))++;
+ put_cpu();
+
+ mod_timer(t, jiffies + HZ);
+}
+
int foo_bar_reg(void)
{
mutex_lock(&thread_mutex);
@@ -128,9 +142,27 @@ void foo_bar_unreg(void)
static int __init trace_event_init(void)
{
+ foo_timer_data = kzalloc_obj(*foo_timer_data, GFP_KERNEL);
+ if (!foo_timer_data)
+ return -ENOMEM;
+
+ foo_timer_data->name = "sample_timer_counter";
+ foo_timer_data->counter = alloc_percpu(int);
+ if (!foo_timer_data->counter) {
+ kfree(foo_timer_data);
+ return -ENOMEM;
+ }
+
+ timer_setup(&foo_timer_data->timer, sample_timer_cb, 0);
+ mod_timer(&foo_timer_data->timer, jiffies + HZ);
+
simple_tsk = kthread_run(simple_thread, NULL, "event-sample");
- if (IS_ERR(simple_tsk))
+ if (IS_ERR(simple_tsk)) {
+ timer_delete_sync(&foo_timer_data->timer);
+ free_percpu(foo_timer_data->counter);
+ kfree(foo_timer_data);
return -1;
+ }
return 0;
}
@@ -143,6 +175,10 @@ static void __exit trace_event_exit(void)
kthread_stop(simple_tsk_fn);
simple_tsk_fn = NULL;
mutex_unlock(&thread_mutex);
+
+ timer_delete_sync(&foo_timer_data->timer);
+ free_percpu(foo_timer_data->counter);
+ kfree(foo_timer_data);
}
module_init(trace_event_init);
diff --git a/samples/trace_events/trace-events-sample.h b/samples/trace_events/trace-events-sample.h
index 1a05fc153353..816848a456a2 100644
--- a/samples/trace_events/trace-events-sample.h
+++ b/samples/trace_events/trace-events-sample.h
@@ -247,12 +247,14 @@
*/
/*
- * It is OK to have helper functions in the file, but they need to be protected
- * from being defined more than once. Remember, this file gets included more
- * than once.
+ * It is OK to have helper functions and data structures in the file, but they
+ * need to be protected from being defined more than once. Remember, this file
+ * gets included more than once.
*/
#ifndef __TRACE_EVENT_SAMPLE_HELPER_FUNCTIONS
#define __TRACE_EVENT_SAMPLE_HELPER_FUNCTIONS
+#include <linux/timer.h>
+
static inline int __length_of(const int *list)
{
int i;
@@ -270,6 +272,13 @@ enum {
TRACE_SAMPLE_BAR = 4,
TRACE_SAMPLE_ZOO = 8,
};
+
+struct foo_timer_data {
+ const char *name;
+ struct timer_list timer;
+ int __percpu *counter;
+};
+
#endif
/*
@@ -595,6 +604,25 @@ TRACE_EVENT(foo_rel_loc,
__get_rel_bitmask(bitmask),
__get_rel_cpumask(cpumask))
);
+
+TRACE_EVENT(foo_timer_fn,
+
+ TP_PROTO(struct foo_timer_data *data),
+
+ TP_ARGS(data),
+
+ TP_STRUCT__entry(
+ __string( name, data->name )
+ __field( int, count )
+ ),
+
+ TP_fast_assign(
+ __assign_str(name);
+ __entry->count = *this_cpu_ptr(data->counter);
+ ),
+
+ TP_printk("name=%s count=%d", __get_str(name), __entry->count)
+);
#endif
/***** NOTICE! The #if protection ends here. *****/
diff --git a/tools/testing/selftests/ftrace/test.d/dynevent/btf_probe_event.tc b/tools/testing/selftests/ftrace/test.d/dynevent/btf_probe_event.tc
new file mode 100644
index 000000000000..f1980650dbe2
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/dynevent/btf_probe_event.tc
@@ -0,0 +1,52 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: BTF event with typecast and percpu access
+# requires: dynamic_events " +CPU(<fetcharg>)":README "[(structname[,field])]<argname>[->field[->field|.field...]]":README
+
+# Check if the sample module is loaded
+if ! lsmod | grep -q trace_events_sample; then
+ modprobe trace-events-sample || exit_unsupported
+fi
+
+echo 0 > events/enable
+echo > dynamic_events
+
+# The sample_timer_cb(struct timer_list *t) is called.
+# We want to check (STRUCT,FIELD)VAR typecast and +PCPU() dereference.
+# (foo_timer_data,timer)t converts t to struct foo_timer_data * using container_of.
+# data->counter is a per-cpu pointer to int.
+# +PCPU(data->counter) should give the per-cpu address of the counter.
+# *+PCPU(data->counter) should give the value of the counter.
+
+echo 'f:mysample/myevent sample_timer_cb name=(foo_timer_data,timer)t->name:string count=+CPU((foo_timer_data,timer)t->counter)' >> dynamic_events
+
+echo 1 > events/mysample/myevent/enable
+echo 1 > events/sample-trace/foo_timer_fn/enable
+
+sleep 2
+
+echo 0 > events/mysample/myevent/enable
+echo 0 > events/sample-trace/foo_timer_fn/enable
+
+# Compare the values.
+MATCH=0
+while read line; do
+ if echo $line | grep -q "foo_timer_fn:"; then
+ NAME=`echo $line | sed 's/.*name=\([^ ]*\) .*/\1/'`
+ COUNT=`echo $line | sed 's/.*count=\([^ ]*\).*/\1/'`
+ if grep -q "myevent:.*name=\"${NAME}\" count=$COUNT" trace; then
+ MATCH=$((MATCH+1))
+ fi
+ fi
+done < trace
+
+if [ $MATCH -eq 0 ]; then
+ echo "No matching events found"
+ exit_fail
+fi
+
+# Clean up
+echo 0 > events/mysample/myevent/enable
+echo 0 > events/sample-trace/foo_timer_fn/enable
+echo > dynamic_events
+clear_trace
^ permalink raw reply related
* [RFC PATCH 6/7] tracing/probes: Support reserved this_cpu_ptr() method
From: Masami Hiramatsu (Google) @ 2026-06-08 14:25 UTC (permalink / raw)
To: Steven Rostedt, Mathieu Desnoyers
Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178092865666.163648.10457567771536160909.stgit@devnote2>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
The +PCPU() dereference operator was introduced in trace probes to
access a per-CPU pointer of a CPU local variable. However, kernel
developers are more familiar with the "this_cpu_ptr()" macro.
To make trace probe syntax more intuitive and aligned with standard
kernel macros, introduce support for "this_cpu_ptr(<fetcharg>)" as a
reserved method.
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
kernel/trace/trace.c | 2 +-
kernel/trace/trace_probe.c | 7 +++++++
2 files changed, 8 insertions(+), 1 deletion(-)
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 2b8c8ac4036a..60ab839d0867 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4332,7 +4332,7 @@ static const char readme_msg[] =
"\t $stack<index>, $stack, $retval, $comm, $current\n"
#endif
"\t +|-[u]<offset>(<fetcharg>), \\imm-value, \\\"imm-string\"\n"
- "\t +CPU(<fetcharg>), +PCPU(<fetcharg>)\n"
+ "\t +CPU(<fetcharg>), +PCPU(<fetcharg>), this_cpu_ptr(<fetcharg>)\n"
"\t kernel return probes support: $retval, $arg<N>, $comm\n"
"\t type: s8/16/32/64, u8/16/32/64, x8/16/32/64, char, string, symbol,\n"
"\t b<bit-width>@<bit-offset>/<container-size>, ustring,\n"
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index fa6757222fe6..27be0664cdf3 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -1315,6 +1315,7 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
struct fetch_insn **pcode, struct fetch_insn *end,
struct traceprobe_parse_context *ctx)
{
+ static const char *THIS_CPU_PTR_STR = "this_cpu_ptr(";
struct fetch_insn *code = *pcode;
unsigned long param;
int deref = FETCH_OP_DEREF;
@@ -1426,6 +1427,7 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
ctx->offset += (tmp + 1 - arg) + (arg[0] != '-' ? 1 : 0);
arg = tmp + 1;
}
+handle_deref:
tmp = strrchr(arg, ')');
if (!tmp) {
trace_probe_log_err(ctx->offset + strlen(arg),
@@ -1476,6 +1478,11 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
ret = handle_typecast(arg, pcode, end, ctx);
break;
default:
+ if (str_has_prefix(arg, THIS_CPU_PTR_STR)) {
+ arg += strlen(THIS_CPU_PTR_STR);
+ deref = FETCH_OP_CPU_PTR;
+ goto handle_deref;
+ }
if (isalpha(arg[0]) || arg[0] == '_') { /* BTF variable */
if (!tparg_is_function_entry(ctx->flags) &&
!tparg_is_function_return(ctx->flags)) {
^ permalink raw reply related
* [RFC PATCH 5/7] tracing/probes: Add +CPU() and +PCPU() dereference method to fetcharg
From: Masami Hiramatsu (Google) @ 2026-06-08 14:25 UTC (permalink / raw)
To: Steven Rostedt, Mathieu Desnoyers
Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178092865666.163648.10457567771536160909.stgit@devnote2>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
When tracing the kernel local variables, sometimes we need to get the
CPU local variables. To access it, current simple dereference is not
enough.
Thus, introduce a special +CPU() dereference to access per-cpu variable
for the current CPU (accessing other CPU variable may race with
updates on other CPUs). Also +PCPU() is for accessing per-cpu pointer.
+CPU(pcp)
is equal to
this_cpu_read(pcp)
And
+PCPU(pcp)
is equal to
this_cpu_ptr(pcp)
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Documentation/trace/eprobetrace.rst | 3 ++
Documentation/trace/fprobetrace.rst | 3 ++
Documentation/trace/kprobetrace.rst | 3 ++
kernel/trace/trace.c | 1 +
kernel/trace/trace_probe.c | 48 +++++++++++++++++++++--------------
kernel/trace/trace_probe.h | 2 +
kernel/trace/trace_probe_tmpl.h | 30 ++++++++++++++++++----
7 files changed, 65 insertions(+), 25 deletions(-)
diff --git a/Documentation/trace/eprobetrace.rst b/Documentation/trace/eprobetrace.rst
index dcf92d5b4175..0c7878df02f6 100644
--- a/Documentation/trace/eprobetrace.rst
+++ b/Documentation/trace/eprobetrace.rst
@@ -40,6 +40,9 @@ Synopsis of eprobe_events
$comm : Fetch current task comm.
$current : Fetch the address of the current task_struct.
+|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*3)(\*4)
+ +CPU(FETCHARG) : Fetch memory at FETCHARG address on the CPU specified by CPU.
+ This is useful for fetching per-CPU variables.
+ +PCPU(FETCHARG) : Fetch memory address at FETCHARG address on the per-CPU area.
\IMM : Store an immediate value to the argument.
NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
diff --git a/Documentation/trace/fprobetrace.rst b/Documentation/trace/fprobetrace.rst
index 3392cab016b3..c851f98bb310 100644
--- a/Documentation/trace/fprobetrace.rst
+++ b/Documentation/trace/fprobetrace.rst
@@ -52,6 +52,9 @@ Synopsis of fprobe-events
$comm : Fetch current task comm.
$current : Fetch the address of the current task_struct.
+|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*4)(\*5)
+ +CPU(FETCHARG) : Fetch memory at FETCHARG address on the CPU specified by CPU.
+ This is useful for fetching per-CPU variables.
+ +PCPU(FETCHARG) : Fetch memory address at FETCHARG address on the per-CPU area.
\IMM : Store an immediate value to the argument.
NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst
index 81e4fe38791d..bc806fd82a91 100644
--- a/Documentation/trace/kprobetrace.rst
+++ b/Documentation/trace/kprobetrace.rst
@@ -55,6 +55,9 @@ Synopsis of kprobe_events
$comm : Fetch current task comm.
$current : Fetch the address of the current task_struct.
+|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*3)(\*4)
+ +CPU(FETCHARG) : Fetch memory at FETCHARG address on the CPU specified by CPU.
+ This is useful for fetching per-CPU variables.
+ +PCPU(FETCHARG) : Fetch memory address at FETCHARG address on the per-CPU area.
\IMM : Store an immediate value to the argument.
NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index e185a006cb08..2b8c8ac4036a 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4332,6 +4332,7 @@ static const char readme_msg[] =
"\t $stack<index>, $stack, $retval, $comm, $current\n"
#endif
"\t +|-[u]<offset>(<fetcharg>), \\imm-value, \\\"imm-string\"\n"
+ "\t +CPU(<fetcharg>), +PCPU(<fetcharg>)\n"
"\t kernel return probes support: $retval, $arg<N>, $comm\n"
"\t type: s8/16/32/64, u8/16/32/64, x8/16/32/64, char, string, symbol,\n"
"\t b<bit-width>@<bit-offset>/<container-size>, ustring,\n"
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 2c5deb1e1463..fa6757222fe6 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -1396,26 +1396,36 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
case '+': /* deref memory */
case '-':
- if (arg[1] == 'u') {
- deref = FETCH_OP_UDEREF;
- arg[1] = arg[0];
- arg++;
- }
- if (arg[0] == '+')
- arg++; /* Skip '+', because kstrtol() rejects it. */
- tmp = strchr(arg, '(');
- if (!tmp) {
- trace_probe_log_err(ctx->offset, DEREF_NEED_BRACE);
- return -EINVAL;
- }
- *tmp = '\0';
- ret = kstrtol(arg, 0, &offset);
- if (ret) {
- trace_probe_log_err(ctx->offset, BAD_DEREF_OFFS);
- break;
+ if (str_has_prefix(arg, "+CPU(")) {
+ deref = FETCH_OP_DEREF_CPU;
+ arg += 5;
+ ctx->offset += 5;
+ } else if (str_has_prefix(arg, "+PCPU(")) {
+ deref = FETCH_OP_CPU_PTR;
+ arg += 6;
+ ctx->offset += 6;
+ } else {
+ if (arg[1] == 'u') {
+ deref = FETCH_OP_UDEREF;
+ arg[1] = arg[0];
+ arg++;
+ }
+ if (arg[0] == '+')
+ arg++; /* Skip '+', because kstrtol() rejects it. */
+ tmp = strchr(arg, '(');
+ if (!tmp) {
+ trace_probe_log_err(ctx->offset, DEREF_NEED_BRACE);
+ return -EINVAL;
+ }
+ *tmp = '\0';
+ ret = kstrtol(arg, 0, &offset);
+ if (ret) {
+ trace_probe_log_err(ctx->offset, BAD_DEREF_OFFS);
+ break;
+ }
+ ctx->offset += (tmp + 1 - arg) + (arg[0] != '-' ? 1 : 0);
+ arg = tmp + 1;
}
- ctx->offset += (tmp + 1 - arg) + (arg[0] != '-' ? 1 : 0);
- arg = tmp + 1;
tmp = strrchr(arg, ')');
if (!tmp) {
trace_probe_log_err(ctx->offset + strlen(arg),
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index f2b31089779c..bec04bcc4226 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -100,6 +100,8 @@ enum fetch_op {
// Stage 2 (dereference) op
FETCH_OP_DEREF, /* Dereference: .offset */
FETCH_OP_UDEREF, /* User-space Dereference: .offset */
+ FETCH_OP_DEREF_CPU, /* Per-CPU Dereference for this CPU */
+ FETCH_OP_CPU_PTR, /* Per-CPU pointer for this CPU */
// Stage 3 (store) ops
FETCH_OP_ST_RAW, /* Raw: .size */
FETCH_OP_ST_MEM, /* Mem: .offset, .size */
diff --git a/kernel/trace/trace_probe_tmpl.h b/kernel/trace/trace_probe_tmpl.h
index f630930288d2..82d753decf48 100644
--- a/kernel/trace/trace_probe_tmpl.h
+++ b/kernel/trace/trace_probe_tmpl.h
@@ -129,25 +129,43 @@ process_fetch_insn_bottom(struct fetch_insn *code, unsigned long val,
struct fetch_insn *s3 = NULL;
int total = 0, ret = 0, i = 0;
u32 loc = 0;
- unsigned long lval = val;
+ unsigned long lval, llval = val;
stage2:
/* 2nd stage: dereference memory if needed */
do {
- if (code->op == FETCH_OP_DEREF) {
- lval = val;
+ lval = val;
+ switch (code->op) {
+ case FETCH_OP_DEREF:
ret = probe_mem_read(&val, (void *)val + code->offset,
sizeof(val));
- } else if (code->op == FETCH_OP_UDEREF) {
- lval = val;
+ break;
+ case FETCH_OP_UDEREF:
ret = probe_mem_read_user(&val,
(void *)val + code->offset, sizeof(val));
- } else
break;
+ case FETCH_OP_DEREF_CPU:
+ case FETCH_OP_CPU_PTR:
+ if (!is_kernel_percpu_address(val)) {
+ ret = -EFAULT;
+ break;
+ }
+ val = (unsigned long)this_cpu_ptr((void __percpu *)val);
+ if (code->op == FETCH_OP_DEREF_CPU)
+ ret = probe_mem_read(&val, (void *)val, sizeof(val));
+ else
+ ret = 0;
+ break;
+ default:
+ lval = llval;
+ goto out;
+ }
if (ret)
return ret;
+ llval = lval;
code++;
} while (1);
+out:
s3 = code;
stage3:
^ permalink raw reply related
* [RFC PATCH 4/7] tracing/probes: Add $current variable support
From: Masami Hiramatsu (Google) @ 2026-06-08 14:24 UTC (permalink / raw)
To: Steven Rostedt, Mathieu Desnoyers
Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178092865666.163648.10457567771536160909.stgit@devnote2>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Since we can use the BTF to cast value to a structure pointer type,
it is useful to introduce "$current" special variable support to
fetcharg.
User can define a fetcharg to access current task_struct properties
using BTF typecast (or dereference - but this may be complicated) e.g.
(task_struct)$current->cpus_ptr
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Documentation/trace/eprobetrace.rst | 1 +
Documentation/trace/fprobetrace.rst | 1 +
Documentation/trace/kprobetrace.rst | 1 +
kernel/trace/trace.c | 2 +-
kernel/trace/trace_probe.c | 6 ++++++
kernel/trace/trace_probe.h | 1 +
kernel/trace/trace_probe_tmpl.h | 3 +++
7 files changed, 14 insertions(+), 1 deletion(-)
diff --git a/Documentation/trace/eprobetrace.rst b/Documentation/trace/eprobetrace.rst
index 680e0af43d5d..dcf92d5b4175 100644
--- a/Documentation/trace/eprobetrace.rst
+++ b/Documentation/trace/eprobetrace.rst
@@ -38,6 +38,7 @@ Synopsis of eprobe_events
@ADDR : Fetch memory at ADDR (ADDR should be in kernel)
@SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol)
$comm : Fetch current task comm.
+ $current : Fetch the address of the current task_struct.
+|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*3)(\*4)
\IMM : Store an immediate value to the argument.
NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
diff --git a/Documentation/trace/fprobetrace.rst b/Documentation/trace/fprobetrace.rst
index 290a9e6f7491..3392cab016b3 100644
--- a/Documentation/trace/fprobetrace.rst
+++ b/Documentation/trace/fprobetrace.rst
@@ -50,6 +50,7 @@ Synopsis of fprobe-events
$argN : Fetch the Nth function argument. (N >= 1) (\*2)
$retval : Fetch return value.(\*3)
$comm : Fetch current task comm.
+ $current : Fetch the address of the current task_struct.
+|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*4)(\*5)
\IMM : Store an immediate value to the argument.
NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst
index a62707e6a9f2..81e4fe38791d 100644
--- a/Documentation/trace/kprobetrace.rst
+++ b/Documentation/trace/kprobetrace.rst
@@ -53,6 +53,7 @@ Synopsis of kprobe_events
$argN : Fetch the Nth function argument. (N >= 1) (\*1)
$retval : Fetch return value.(\*2)
$comm : Fetch current task comm.
+ $current : Fetch the address of the current task_struct.
+|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*3)(\*4)
\IMM : Store an immediate value to the argument.
NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 0e36af853199..e185a006cb08 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4329,7 +4329,7 @@ static const char readme_msg[] =
"\t [(structname[,field])](fetcharg)->field[->field|.field...],\n"
#endif
#else
- "\t $stack<index>, $stack, $retval, $comm,\n"
+ "\t $stack<index>, $stack, $retval, $comm, $current\n"
#endif
"\t +|-[u]<offset>(<fetcharg>), \\imm-value, \\\"imm-string\"\n"
"\t kernel return probes support: $retval, $arg<N>, $comm\n"
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index ff0b619e9a90..2c5deb1e1463 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -1235,6 +1235,12 @@ static int parse_probe_vars(char *orig_arg, const struct fetch_type *t,
return 0;
}
+ /* $current returns the address of the current task_struct. */
+ if (strcmp(arg, "current") == 0) {
+ code->op = FETCH_OP_CURRENT;
+ return 0;
+ }
+
#ifdef CONFIG_HAVE_FUNCTION_ARG_ACCESS_API
len = str_has_prefix(arg, "arg");
if (len) {
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index b1a54da3c761..f2b31089779c 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -96,6 +96,7 @@ enum fetch_op {
FETCH_OP_FOFFS, /* File offset: .immediate */
FETCH_OP_DATA, /* Allocated data: .data */
FETCH_OP_EDATA, /* Entry data: .offset */
+ FETCH_OP_CURRENT, /* Current task_struct address */
// Stage 2 (dereference) op
FETCH_OP_DEREF, /* Dereference: .offset */
FETCH_OP_UDEREF, /* User-space Dereference: .offset */
diff --git a/kernel/trace/trace_probe_tmpl.h b/kernel/trace/trace_probe_tmpl.h
index f39b37fcdb3b..f630930288d2 100644
--- a/kernel/trace/trace_probe_tmpl.h
+++ b/kernel/trace/trace_probe_tmpl.h
@@ -112,6 +112,9 @@ process_common_fetch_insn(struct fetch_insn *code, unsigned long *val)
case FETCH_OP_DATA:
*val = (unsigned long)code->data;
break;
+ case FETCH_OP_CURRENT:
+ *val = (unsigned long)current;
+ break;
default:
return -EILSEQ;
}
^ permalink raw reply related
* [RFC PATCH 3/7] tracing/probes: Support field specifier option for typecast
From: Masami Hiramatsu (Google) @ 2026-06-08 14:24 UTC (permalink / raw)
To: Steven Rostedt, Mathieu Desnoyers
Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178092865666.163648.10457567771536160909.stgit@devnote2>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Add a field specifier option for the typecast. This works like
container_of() macro.
(STRUCT[,FIELD[.FIELD2...]])VAR
This is equivalent to :
container_of(VAR, struct STRUCT, FIELD[.FIELD2...])
For example:
echo "f tick_nohz_handler next_tick=(tick_sched,sched_timer)timer->next_tick" >> dynamic_events
This will trace tick_nohz_handler() with its tick_sched::next_tick which
is converted from @timer by contianer_of(tick, struct tick_sched, sched_timer).
So, if you enabkle both fprobes:tick_nohz_handler__entry and
timer:hrtimer_expire_entry events, we will see something like:
<idle>-0 [002] d.h1. 3778.087272: hrtimer_expire_entry: hrtimer=00000000d63db328 f
unction=tick_nohz_handler now=3777450051040
<idle>-0 [002] d.h1. 3778.087281: tick_nohz_handler__entry: (tick_nohz_handler+0x4
/0x140) next_tick=3777450000000
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Documentation/trace/eprobetrace.rst | 5 +
Documentation/trace/fprobetrace.rst | 8 +-
Documentation/trace/kprobetrace.rst | 8 +-
kernel/trace/trace.c | 4 -
kernel/trace/trace_probe.c | 169 +++++++++++++++++++++++------------
kernel/trace/trace_probe.h | 4 +
6 files changed, 131 insertions(+), 67 deletions(-)
diff --git a/Documentation/trace/eprobetrace.rst b/Documentation/trace/eprobetrace.rst
index cd0b4aa7f896..680e0af43d5d 100644
--- a/Documentation/trace/eprobetrace.rst
+++ b/Documentation/trace/eprobetrace.rst
@@ -49,7 +49,10 @@ Synopsis of eprobe_events
(STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
a pointer to STRUCT and then derference the pointer defined by
->MEMBER. Note that when this is used, the FIELD name does not
- need to be prefixed with a '$'.
+ need to be prefixed with a '$'. ASGN can be specified optionally.
+ If ASGN is specified, FIELD will be cast to the same offset
+ position as the ASGN member, rather than to the beginning of
+ the STRUCT.
(STRUCT)(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
also be used with another FETCHARG instead of FIELD.
diff --git a/Documentation/trace/fprobetrace.rst b/Documentation/trace/fprobetrace.rst
index 6b8bb27bb62d..290a9e6f7491 100644
--- a/Documentation/trace/fprobetrace.rst
+++ b/Documentation/trace/fprobetrace.rst
@@ -57,10 +57,12 @@ Synopsis of fprobe-events
(u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types
(x8/x16/x32/x64), "char", "string", "ustring", "symbol", "symstr"
and bitfield are supported.
- (STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
+ (STRUCT[,ASGN])FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
a pointer to STRUCT and then derference the pointer defined by
- ->MEMBER.
- (STRUCT)(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
+ ->MEMBER. ASGN can be specified optionally. If ASGN is specified,
+ FIELD will be cast to the same offset position as the ASGN member,
+ rather than to the beginning of the STRUCT.
+ (STRUCT[,ASGN])(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
also be used with another FETCHARG instead of FIELD.
(\*1) This is available only when BTF is enabled.
diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst
index c4382765d5b2..a62707e6a9f2 100644
--- a/Documentation/trace/kprobetrace.rst
+++ b/Documentation/trace/kprobetrace.rst
@@ -61,11 +61,13 @@ Synopsis of kprobe_events
(x8/x16/x32/x64), VFS layer common type(%pd/%pD), "char",
"string", "ustring", "symbol", "symstr" and bitfield are
supported.
- (STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
+ (STRUCT[,ASGN])FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
a pointer to STRUCT and then derference the pointer defined by
->MEMBER. Note that this is available only when the probe is
- on function entry.
- (STRUCT)(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
+ on function entry. ASGN can be specified optionally. If ASGN
+ is specified, FIELD will be cast to the same offset position
+ as the ASGN member, rather than to the beginning of the STRUCT.
+ (STRUCT[,ASGN])(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
also be used with another FETCHARG instead of FIELD.
(\*1) only for the probe on function entry (offs == 0). Note, this argument access
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 4f70318918c2..0e36af853199 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4325,8 +4325,8 @@ static const char readme_msg[] =
#ifdef CONFIG_HAVE_FUNCTION_ARG_ACCESS_API
"\t $stack<index>, $stack, $retval, $comm, $arg<N>,\n"
#ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
- "\t [(structname)]<argname>[->field[->field|.field...]],\n"
- "\t [(structname)](fetcharg)->field[->field|.field...],\n"
+ "\t [(structname[,field])]<argname>[->field[->field|.field...]],\n"
+ "\t [(structname[,field])](fetcharg)->field[->field|.field...],\n"
#endif
#else
"\t $stack<index>, $stack, $retval, $comm,\n"
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index ddd9b1b63a17..ff0b619e9a90 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -574,6 +574,65 @@ static int split_next_field(char *varname, char **next_field,
return ret;
}
+/* Inner loop for solving dot operator ('.'). Return bit-offset of the given field */
+static int get_bitoffset_of_field(char **pfieldname, const struct btf_type **ptype,
+ struct traceprobe_parse_context *ctx)
+{
+ const struct btf_type *type = *ptype;
+ const struct btf_member *field;
+ struct btf *btf = ctx_btf(ctx);
+ char *fieldname = *pfieldname;
+ int bitoffs = 0;
+ u32 anon_offs;
+ char *next;
+ int is_ptr;
+ s32 tid;
+
+ do {
+ next = NULL;
+ is_ptr = split_next_field(fieldname, &next, ctx);
+ if (is_ptr < 0)
+ return is_ptr;
+
+ anon_offs = 0;
+ field = btf_find_struct_member(btf, type, fieldname,
+ &anon_offs);
+ if (IS_ERR(field)) {
+ trace_probe_log_err(ctx->offset, BAD_BTF_TID);
+ return PTR_ERR(field);
+ }
+ if (!field) {
+ trace_probe_log_err(ctx->offset, NO_BTF_FIELD);
+ return -ENOENT;
+ }
+ /* Add anonymous structure/union offset */
+ bitoffs += anon_offs;
+
+ /* Accumulate the bit-offsets of the dot-connected fields */
+ if (btf_type_kflag(type)) {
+ bitoffs += BTF_MEMBER_BIT_OFFSET(field->offset);
+ ctx->last_bitsize = BTF_MEMBER_BITFIELD_SIZE(field->offset);
+ } else {
+ bitoffs += field->offset;
+ ctx->last_bitsize = 0;
+ }
+
+ type = btf_type_skip_modifiers(btf, field->type, &tid);
+ if (!type) {
+ trace_probe_log_err(ctx->offset, BAD_BTF_TID);
+ return -EINVAL;
+ }
+
+ if (next)
+ ctx->offset += next - fieldname;
+ fieldname = next;
+ } while (!is_ptr && fieldname);
+
+ *pfieldname = fieldname;
+ *ptype = type;
+
+ return bitoffs;
+}
/*
* Parse the field of data structure. The @type must be a pointer type
* pointing the target data structure type.
@@ -583,16 +642,14 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
struct traceprobe_parse_context *ctx)
{
struct fetch_insn *code = *pcode;
- const struct btf_member *field;
- u32 bitoffs, anon_offs;
- bool is_struct = ctx->struct_btf != NULL;
struct btf *btf = ctx_btf(ctx);
- char *next;
- int is_ptr;
+ bool is_first_field = true;
+ int bitoffs;
s32 tid;
do {
- if (!is_struct) {
+ /* For the first field of typecast, @type will be the target structure type. */
+ if (!(is_first_field && ctx->struct_btf)) {
/* Outer loop for solving arrow operator ('->') */
if (BTF_INFO_KIND(type->info) != BTF_KIND_PTR) {
trace_probe_log_err(ctx->offset, NO_PTR_STRCT);
@@ -606,60 +663,25 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
return -EINVAL;
}
}
- /* Only the first type can skip being a pointer */
- is_struct = false;
-
- bitoffs = 0;
- do {
- /* Inner loop for solving dot operator ('.') */
- next = NULL;
- is_ptr = split_next_field(fieldname, &next, ctx);
- if (is_ptr < 0)
- return is_ptr;
-
- anon_offs = 0;
- field = btf_find_struct_member(btf, type, fieldname,
- &anon_offs);
- if (IS_ERR(field)) {
- trace_probe_log_err(ctx->offset, BAD_BTF_TID);
- return PTR_ERR(field);
- }
- if (!field) {
- trace_probe_log_err(ctx->offset, NO_BTF_FIELD);
- return -ENOENT;
- }
- /* Add anonymous structure/union offset */
- bitoffs += anon_offs;
-
- /* Accumulate the bit-offsets of the dot-connected fields */
- if (btf_type_kflag(type)) {
- bitoffs += BTF_MEMBER_BIT_OFFSET(field->offset);
- ctx->last_bitsize = BTF_MEMBER_BITFIELD_SIZE(field->offset);
- } else {
- bitoffs += field->offset;
- ctx->last_bitsize = 0;
- }
-
- type = btf_type_skip_modifiers(btf, field->type, &tid);
- if (!type) {
- trace_probe_log_err(ctx->offset, BAD_BTF_TID);
- return -EINVAL;
- }
-
- ctx->offset += next - fieldname;
- fieldname = next;
- } while (!is_ptr && fieldname);
+ bitoffs = get_bitoffset_of_field(&fieldname, &type, ctx);
+ if (bitoffs < 0)
+ return bitoffs;
if (++code == end) {
trace_probe_log_err(ctx->offset, TOO_MANY_OPS);
return -EINVAL;
}
code->op = FETCH_OP_DEREF; /* TODO: user deref support */
+ if (is_first_field && ctx->struct_btf) {
+ /* The first field can be typecasted with field option. */
+ bitoffs -= ctx->prefix_bitoffs;
+ }
code->offset = bitoffs / 8;
*pcode = code;
ctx->last_bitoffs = bitoffs % 8;
ctx->last_type = type;
+ is_first_field = false;
} while (fieldname);
return 0;
@@ -700,8 +722,7 @@ static int parse_btf_arg(char *varname,
/* TEVENT is only here via a typecast */
if (WARN_ON_ONCE(ctx->struct_btf == NULL))
return -EINVAL;
- type = ctx->last_struct;
- goto found_type;
+ goto found;
}
if (ctx->flags & TPARG_FL_RETURN && !strcmp(varname, "$retval")) {
@@ -763,7 +784,6 @@ static int parse_btf_arg(char *varname,
type = ctx->last_struct;
else
type = btf_type_skip_modifiers(ctx->btf, tid, &tid);
-found_type:
if (!type) {
trace_probe_log_err(ctx->offset, BAD_BTF_TID);
return -EINVAL;
@@ -832,6 +852,41 @@ static int query_btf_struct(const char *sname, struct traceprobe_parse_context *
return 0;
}
+static int parse_btf_casttype(char *casttype, struct traceprobe_parse_context *ctx)
+{
+ char *field;
+ int ret;
+
+ /* Field option - evaluated later. */
+ field = strchr(casttype, ',');
+ if (field)
+ *field++ = '\0';
+
+ ret = query_btf_struct(casttype, ctx);
+ if (ret < 0) {
+ trace_probe_log_err(ctx->offset, NO_PTR_STRCT);
+ return -EINVAL;
+ }
+
+ if (field) {
+ struct btf_type *type = (struct btf_type *)ctx->last_struct;
+
+ ctx->offset += field - casttype;
+ ret = get_bitoffset_of_field(&field, &ctx->last_struct, ctx);
+ if (ret < 0)
+ return ret;
+ if (ret % 8) {
+ trace_probe_log_err(ctx->offset, TYPECAST_NOT_ALIGNED);
+ return -EINVAL;
+ }
+ ctx->prefix_bitoffs = ret;
+ /* Restore the original struct type (overwritten by get_bitoffset_of_field) */
+ ctx->last_struct = type;
+ }
+
+ return ret;
+}
+
/* Find the matching closing parenthesis for a given opening parenthesis. */
static char *find_matched_close_paren(char *s)
{
@@ -913,11 +968,10 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
nested = true;
}
- ret = query_btf_struct(arg + 1, ctx);
- if (ret < 0) {
- trace_probe_log_err(ctx->offset + 1, NO_PTR_STRCT);
- return -EINVAL;
- }
+ ctx->offset = orig_offset + 1; /* for the '(' */
+ ret = parse_btf_casttype(arg + 1, ctx);
+ if (ret < 0)
+ return ret;
ctx->offset = orig_offset + tmp - arg;
/* If it is nested, tmp points to the field name. */
@@ -925,6 +979,7 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
ret = parse_btf_field(tmp, ctx->last_struct, pcode, end, ctx);
else
ret = parse_btf_arg(tmp, pcode, end, ctx);
+ ctx->prefix_bitoffs = 0;
return ret;
}
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 8dcc65e4e1db..b1a54da3c761 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -431,6 +431,7 @@ struct traceprobe_parse_context {
unsigned int flags;
int offset;
int nested_level;
+ int prefix_bitoffs; /* The bit offset of the prefix field of typecast */
};
#define TRACEPROBE_MAX_NESTED_LEVEL 3
@@ -571,7 +572,8 @@ extern int traceprobe_define_arg_fields(struct trace_event_call *event_call,
C(EVENT_TOO_BIG, "Event too big (too many fields?)"), \
C(TYPECAST_NOT_EVENT, "Typecasts are only for eprobe fields"), \
C(TYPECAST_REQ_FIELD, "Typecast requires a field access"), \
- C(TOO_MANY_NESTED, "Too many nested typecasts/dereferences"),
+ C(TOO_MANY_NESTED, "Too many nested typecasts/dereferences"), \
+ C(TYPECAST_NOT_ALIGNED, "Typecast field option is not byte-aligned"),
#undef C
#define C(a, b) TP_ERR_##a
^ permalink raw reply related
* [RFC PATCH 2/7] tracing/probes: Support nested typecast
From: Masami Hiramatsu (Google) @ 2026-06-08 14:24 UTC (permalink / raw)
To: Steven Rostedt, Mathieu Desnoyers
Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178092865666.163648.10457567771536160909.stgit@devnote2>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
When we hit an open parenthesis right after typecast closing
parenthesis, it means we have nested typecast. This allows us to
typecast a generic data member in a structure to a pointer to
another structure.
For example, to cast a DATA_MEMBER of VAR structure to STRUCT pointer
and get MEMBER value.
(STRUCT)(VAR->DATA_MEMBER)->MEMBER
Also, we can nest typecast.
(STRUCT1)((STRUCT2)$ARG->FIELD2)->FIELD1
Currently the max nest level is limited to 3.
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Documentation/trace/eprobetrace.rst | 2 +
Documentation/trace/fprobetrace.rst | 2 +
Documentation/trace/kprobetrace.rst | 2 +
kernel/trace/trace.c | 1
kernel/trace/trace_probe.c | 76 ++++++++++++++++++++++++++++++++---
kernel/trace/trace_probe.h | 7 +++
6 files changed, 82 insertions(+), 8 deletions(-)
diff --git a/Documentation/trace/eprobetrace.rst b/Documentation/trace/eprobetrace.rst
index fe3602540569..cd0b4aa7f896 100644
--- a/Documentation/trace/eprobetrace.rst
+++ b/Documentation/trace/eprobetrace.rst
@@ -50,6 +50,8 @@ Synopsis of eprobe_events
a pointer to STRUCT and then derference the pointer defined by
->MEMBER. Note that when this is used, the FIELD name does not
need to be prefixed with a '$'.
+ (STRUCT)(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
+ also be used with another FETCHARG instead of FIELD.
Types
-----
diff --git a/Documentation/trace/fprobetrace.rst b/Documentation/trace/fprobetrace.rst
index 7435ded2d66d..6b8bb27bb62d 100644
--- a/Documentation/trace/fprobetrace.rst
+++ b/Documentation/trace/fprobetrace.rst
@@ -60,6 +60,8 @@ Synopsis of fprobe-events
(STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
a pointer to STRUCT and then derference the pointer defined by
->MEMBER.
+ (STRUCT)(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
+ also be used with another FETCHARG instead of FIELD.
(\*1) This is available only when BTF is enabled.
(\*2) only for the probe on function entry (offs == 0). Note, this argument access
diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst
index f73614997d52..c4382765d5b2 100644
--- a/Documentation/trace/kprobetrace.rst
+++ b/Documentation/trace/kprobetrace.rst
@@ -65,6 +65,8 @@ Synopsis of kprobe_events
a pointer to STRUCT and then derference the pointer defined by
->MEMBER. Note that this is available only when the probe is
on function entry.
+ (STRUCT)(FETCHARG)->MEMBER[->MEMBER] : typecast can nest, so the above can
+ also be used with another FETCHARG instead of FIELD.
(\*1) only for the probe on function entry (offs == 0). Note, this argument access
is best effort, because depending on the argument type, it may be passed on
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index aa93e7b01146..4f70318918c2 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4326,6 +4326,7 @@ static const char readme_msg[] =
"\t $stack<index>, $stack, $retval, $comm, $arg<N>,\n"
#ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
"\t [(structname)]<argname>[->field[->field|.field...]],\n"
+ "\t [(structname)](fetcharg)->field[->field|.field...],\n"
#endif
#else
"\t $stack<index>, $stack, $retval, $comm,\n"
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 609b156986c5..ddd9b1b63a17 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -832,10 +832,35 @@ static int query_btf_struct(const char *sname, struct traceprobe_parse_context *
return 0;
}
+/* Find the matching closing parenthesis for a given opening parenthesis. */
+static char *find_matched_close_paren(char *s)
+{
+ char *p = s;
+ int count = 0;
+
+ while (*p) {
+ if (*p == '(')
+ count++;
+ else if (*p == ')') {
+ if (--count == 0)
+ return p;
+ }
+ p++;
+ }
+ return NULL;
+}
+
+static int
+parse_probe_arg(char *arg, const struct fetch_type *type,
+ struct fetch_insn **pcode, struct fetch_insn *end,
+ struct traceprobe_parse_context *ctx);
+
static int handle_typecast(char *arg, struct fetch_insn **pcode,
struct fetch_insn *end,
struct traceprobe_parse_context *ctx)
{
+ int orig_offset = ctx->offset;
+ bool nested = false;
char *tmp;
int ret;
@@ -850,19 +875,56 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
DEREF_OPEN_BRACE);
return -EINVAL;
}
- *tmp = '\0';
- ret = query_btf_struct(arg + 1, ctx);
- *tmp = ')';
+ *tmp++ = '\0';
+
+ /* Handle the nested structure like (STRUCT)(VAR->FIELD)->... */
+ if (*tmp == '(') {
+ char *close = find_matched_close_paren(tmp);
+
+ ctx->offset += tmp - arg;
+ if (!close) {
+ trace_probe_log_err(ctx->offset, DEREF_OPEN_BRACE);
+ return -EINVAL;
+ }
+ /* We expect a field access for typecast */
+ if (close[1] != '-' || close[2] != '>') {
+ trace_probe_log_err(ctx->offset + close - tmp + 1,
+ TYPECAST_REQ_FIELD);
+ return -EINVAL;
+ }
+ ctx->nested_level++;
+ if (ctx->nested_level > TRACEPROBE_MAX_NESTED_LEVEL) {
+ trace_probe_log_err(ctx->offset, TOO_MANY_NESTED);
+ return -E2BIG;
+ }
+ *close = '\0';
+
+ ctx->offset += 1; /* for the '(' */
+ /* We need to parse the nested one */
+ ret = parse_probe_arg(tmp + 1, find_fetch_type(NULL, ctx->flags),
+ pcode, end, ctx);
+ if (ret < 0)
+ return ret;
+ ctx->nested_level--;
+ clear_struct_btf(ctx);
+
+ tmp = close + 1;
+ nested = true;
+ }
+
+ ret = query_btf_struct(arg + 1, ctx);
if (ret < 0) {
trace_probe_log_err(ctx->offset + 1, NO_PTR_STRCT);
return -EINVAL;
}
- tmp++;
-
- ctx->offset += tmp - arg;
- ret = parse_btf_arg(tmp, pcode, end, ctx);
+ ctx->offset = orig_offset + tmp - arg;
+ /* If it is nested, tmp points to the field name. */
+ if (nested)
+ ret = parse_btf_field(tmp, ctx->last_struct, pcode, end, ctx);
+ else
+ ret = parse_btf_arg(tmp, pcode, end, ctx);
return ret;
}
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 15758cc11fc6..8dcc65e4e1db 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -430,8 +430,11 @@ struct traceprobe_parse_context {
struct trace_probe *tp;
unsigned int flags;
int offset;
+ int nested_level;
};
+#define TRACEPROBE_MAX_NESTED_LEVEL 3
+
extern int traceprobe_parse_probe_arg(struct trace_probe *tp, int i,
const char *argv,
struct traceprobe_parse_context *ctx);
@@ -566,7 +569,9 @@ extern int traceprobe_define_arg_fields(struct trace_event_call *event_call,
C(TOO_MANY_ARGS, "Too many arguments are specified"), \
C(TOO_MANY_EARGS, "Too many entry arguments specified"), \
C(EVENT_TOO_BIG, "Event too big (too many fields?)"), \
- C(TYPECAST_NOT_EVENT, "Typecasts are only for eprobe fields"),
+ C(TYPECAST_NOT_EVENT, "Typecasts are only for eprobe fields"), \
+ C(TYPECAST_REQ_FIELD, "Typecast requires a field access"), \
+ C(TOO_MANY_NESTED, "Too many nested typecasts/dereferences"),
#undef C
#define C(a, b) TP_ERR_##a
^ permalink raw reply related
* [RFC PATCH 1/7] tracing/probes: Support typecast for various probe events
From: Masami Hiramatsu (Google) @ 2026-06-08 14:24 UTC (permalink / raw)
To: Steven Rostedt, Mathieu Desnoyers
Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178092865666.163648.10457567771536160909.stgit@devnote2>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Support BTF typecast feature on other probe events (but only if it is
kernel function entry or return.)
To support other probe events, we just need to use last_struct type
when we find a function parameter in parse_btf_arg().
This also update <tracefs>/README file to show struct typecast.
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Documentation/trace/fprobetrace.rst | 3 +++
Documentation/trace/kprobetrace.rst | 4 ++++
kernel/trace/trace.c | 2 +-
kernel/trace/trace_probe.c | 12 +++++++-----
4 files changed, 15 insertions(+), 6 deletions(-)
diff --git a/Documentation/trace/fprobetrace.rst b/Documentation/trace/fprobetrace.rst
index b4c2ca3d02c1..7435ded2d66d 100644
--- a/Documentation/trace/fprobetrace.rst
+++ b/Documentation/trace/fprobetrace.rst
@@ -57,6 +57,9 @@ Synopsis of fprobe-events
(u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types
(x8/x16/x32/x64), "char", "string", "ustring", "symbol", "symstr"
and bitfield are supported.
+ (STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
+ a pointer to STRUCT and then derference the pointer defined by
+ ->MEMBER.
(\*1) This is available only when BTF is enabled.
(\*2) only for the probe on function entry (offs == 0). Note, this argument access
diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst
index 3b6791c17e9b..f73614997d52 100644
--- a/Documentation/trace/kprobetrace.rst
+++ b/Documentation/trace/kprobetrace.rst
@@ -61,6 +61,10 @@ Synopsis of kprobe_events
(x8/x16/x32/x64), VFS layer common type(%pd/%pD), "char",
"string", "ustring", "symbol", "symstr" and bitfield are
supported.
+ (STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
+ a pointer to STRUCT and then derference the pointer defined by
+ ->MEMBER. Note that this is available only when the probe is
+ on function entry.
(\*1) only for the probe on function entry (offs == 0). Note, this argument access
is best effort, because depending on the argument type, it may be passed on
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 6eb4d3097a4d..aa93e7b01146 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4325,7 +4325,7 @@ static const char readme_msg[] =
#ifdef CONFIG_HAVE_FUNCTION_ARG_ACCESS_API
"\t $stack<index>, $stack, $retval, $comm, $arg<N>,\n"
#ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
- "\t <argname>[->field[->field|.field...]],\n"
+ "\t [(structname)]<argname>[->field[->field|.field...]],\n"
#endif
#else
"\t $stack<index>, $stack, $retval, $comm,\n"
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index fd1caa1f9723..609b156986c5 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -759,7 +759,10 @@ static int parse_btf_arg(char *varname,
return -ENOENT;
found:
- type = btf_type_skip_modifiers(ctx->btf, tid, &tid);
+ if (ctx->struct_btf)
+ type = ctx->last_struct;
+ else
+ type = btf_type_skip_modifiers(ctx->btf, tid, &tid);
found_type:
if (!type) {
trace_probe_log_err(ctx->offset, BAD_BTF_TID);
@@ -836,10 +839,9 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
char *tmp;
int ret;
- /* Currently this only works for eprobes */
- if (!(ctx->flags & TPARG_FL_TEVENT)) {
- trace_probe_log_err(ctx->offset, TYPECAST_NOT_EVENT);
- return -EINVAL;
+ if (!(tparg_is_function_entry(ctx->flags) || tparg_is_function_return(ctx->flags))) {
+ trace_probe_log_err(ctx->offset, NOSUP_BTFARG);
+ return -EOPNOTSUPP;
}
tmp = strchr(arg, ')');
^ permalink raw reply related
* [RFC PATCH 0/7] tracing/probes: Add more typecast features
From: Masami Hiramatsu (Google) @ 2026-06-08 14:24 UTC (permalink / raw)
To: Steven Rostedt, Mathieu Desnoyers
Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest
Hi,
Here is a series of patches to introduce more typecast features
to probe events, which includes 1. expanding BTF typecast to
fprobe and kprobe events, 2. introducing container_of like typecst
option, 3. supporting nested typecast, 4. adding $current special
variable support, 5. adding per-cpu dereference support, 6. adding
a testcase to check typecasts.
Steve introduced BTF typecast feature for eprobe[1].
This series extends it and add more options:
1. Expanding BTF typecast to kprobe and fprobe.
(currently only function entry/exit)
2. Introduce container_of like typecast. This adds a "assigned
member" option to the typecast.
(STRUCT,MEMBER)VAR->ANOTHER_MEMBER
This casts VAR to STRUCT type but the VAR is as the address
of STRUCT.MEMBER. In C, it is:
container_of(VAR, STRUCT, MEMBER)->ANOTHER_MEMBER
3. Support nested typecast, e.g.
(STRUCT)((STRUCT2)VAR->MEMBER2)->MEMBER
the nest level must be smaller than 3.
4. Add $current variable to point "current" task_struct.
This is useful with typecast, e.g.
(task_struct)$current->pid
5. per-cpu dereference support.
+CPU(VAR) is the same as this_cpu_read(VAR), and
+PCPU(VAR) is the same as this_cpu_ptr(VAR).
Also, "this_cpu_ptr(VAR)" is available. This is good
with nesting expression.
(STRUCT)(this_cpu_ptr(VAR))->MEMBER
(However, it might be better to allow a special way to omit
parentheses for thi_cpu_ptr())
And added a test script to test part of them.
[1] https://lore.kernel.org/all/20260601130746.2139d926@gandalf.local.home/
---
Masami Hiramatsu (Google) (7):
tracing/probes: Support typecast for various probe events
tracing/probes: Support nested typecast
tracing/probes: Support field specifier option for typecast
tracing/probes: Add $current variable support
tracing/probes: Add +CPU() and +PCPU() dereference method to fetcharg
tracing/probes: Support reserved this_cpu_ptr() method
tracing/probes: Add a new testcase for BTF typecasts
Documentation/trace/eprobetrace.rst | 11 +
Documentation/trace/fprobetrace.rst | 11 +
Documentation/trace/kprobetrace.rst | 12 +
kernel/trace/trace.c | 6
kernel/trace/trace_probe.c | 312 +++++++++++++++-----
kernel/trace/trace_probe.h | 12 +
kernel/trace/trace_probe_tmpl.h | 33 ++
samples/trace_events/trace-events-sample.c | 38 ++
samples/trace_events/trace-events-sample.h | 34 ++
.../ftrace/test.d/dynevent/btf_probe_event.tc | 52 +++
10 files changed, 422 insertions(+), 99 deletions(-)
create mode 100644 tools/testing/selftests/ftrace/test.d/dynevent/btf_probe_event.tc
--
Signature
^ permalink raw reply
* Re: [PATCH v8 2/6] mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
From: Breno Leitao @ 2026-06-08 14:15 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Miaohe Lin, linux-mm, linux-kernel, linux-doc, linux-kselftest,
linux-trace-kernel, kernel-team, Lance Yang, Andrew Morton,
Lorenzo Stoakes, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Shuah Khan, Naoya Horiguchi,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Jonathan Corbet, Shuah Khan, Liam R. Howlett
In-Reply-To: <f1a742be-80cb-4256-b1f9-e50a0f83cb15@kernel.org>
On Fri, Jun 05, 2026 at 11:42:53AM +0200, David Hildenbrand (Arm) wrote:
> On 6/5/26 11:35, Breno Leitao wrote:
> > On Wed, Jun 03, 2026 at 10:33:04AM +0800, Miaohe Lin wrote:
> >> On 2026/6/2 17:41, David Hildenbrand (Arm) wrote:
> >>>
> >>> Races are fine. We might miss some pages, but that can happen on races either way.
> >>>
> >>>
> >>> I'd just do something like
> >>>
> >>> if (PageReserved(page))
> >>> return true;
> >>>
> >>> head = compound_head(page);
> >>
> >> If @head is split just after compound_head. And then @head is freed into buddy and re-allocated as slab
> >> page while @page is still in the buddy. We would panic on this scene as @head is PageSlab. But we were
> >> supposed to successfully handle @page. Or am I miss something?
> >
> > You're right that it is racy, but I think it is an acceptable race here.
> >
>
> I mean, any such races can currently already happen one way or the other?
>
> Really, the only way to not get races is to tryget the (compound)page,
> revalidate that the page is still part of the compound page.
>
> I'm not sure if that's really a good idea.
>
> But my memory is a bit vague in which scenarios we already hold a page reference
> here to prevent any concurrent freeing?
No, we don't hold one here in the case that matters.
HWPoisonKernelOwned() runs at the very top of get_any_page(), before
try_again: and before __get_hwpoison_page(). The first refcount taken in
the whole path is the folio_try_get() inside __get_hwpoison_page(), which
runs *after* the short-circuit.
So get_any_page() itself never holds a reference at the check -- the only way
one exists is if the caller passed MF_COUNT_INCREASED (count_increased ==
true).
So on the MCE/GHES path -- the one this panic option exists for -- no
reference is held when HWPoisonKernelOwned() does its compound_head() +
PageSlab()/PageTable()/PageLargeKmalloc() checks.
Given that, I'd rather keep it racy and take no refcount than add a
tryget + revalidate purely for this check. As I've said earleir, an operator
who enabled it has chosen to crash rather than run on corrupted memory;
mis-attributing one such rare, genuinely-poisoned page is within that contract.
^ permalink raw reply
* Re: [PATCH] rethook: Use tsk->on_cpu to check task execution state
From: Peter Zijlstra @ 2026-06-08 14:06 UTC (permalink / raw)
To: Masami Hiramatsu
Cc: bpf, Tengda Wu, Steven Rostedt, Mathieu Desnoyers,
Alexei Starovoitov, linux-trace-kernel, linux-kernel,
Josh Poimboeuf, jikos, mbenes, pmladek
In-Reply-To: <20260608220811.d4a0b58961cfb9eeb6bbbccb@kernel.org>
On Mon, Jun 08, 2026 at 10:08:11PM +0900, Masami Hiramatsu wrote:
> > > Anyway, I'm wondering what the purpose of this check here is, there is
> > > no real comment, and commit 5120d167e21c ("rethook: Remove warning
> > > messages printed for finding return address of a frame.") is just pure
> > > voodoo as well.
> >
> > FWIW, you should have had this discussion then.
>
> Indeed. The rethook is making a shadow stack by list, thus caller must
> guarantee the target process is blocked at least during this function.
>
> The commit messages suggest that when BPF takes a backtrace, it also
> includes other running tasks. Is that safe?
Well, you get to keep the pieces. At this point safe only pertains to
'doesn't-crash', all correctness is out the window.
I always forget the crazy BPF does ;-)
> > > Also, note the comment that goes with the usage of
> > > task_on_another_cpu(); that thing is racy as all heck.
> > >
> > > So it really comes down to what the purpose of this check is.
>
> This check has been introduced when it is copied from
> kretprobe_find_ret_addr(). It has the comment:
>
> * The @tsk must be 'current' or a task which is not running. @fp is a hint
>
> IIRC, I added this check to explicitly verify this condition.
Right, but it is a prescriptive comment, not an explanatory one. That
is, it doesn't explain the condition.
> > > I suspect the issue at hand is that tsk->rethook elements, such as
> > > iterated by __rethook_find_ret_addr() are not safe to be accessed for a
> > > running task.
> > >
> > > Notably while rethook_recycle() has some RCU thing on, that objpool
> > > thing (and the recycle name itself) seems to strongly suggest iterating
> > > these things is not sound (you could start with things from this task,
> > > hit a recycled entry and continue iterating rethooks from another task).
> > >
> > > Also note that the current check is also racy, nothing really prevents a
> > > wakeup from happening right after you observe task_is_running() being
> > > false. The task can then get scheduled in on another CPU and tear down
> > > its rethooks concurrent with __rethook_find_ret_addr().
>
> Yeah, but is there any way to ensure the task is blocked? Even if it is
> blocked, like TASK_UNINTERRUPTIBLE, unless holding the actual lock in
> the rethook, it may not be possible to ensure it?
>
> Of course, we could give up on checking within this function and leave
> everything to the caller to guarantee - as kretprobe does.
>
> BTW, the reason why we made it possible to pass tasks other than current
> is that the stack unwinding code itself supported unwinding tasks other
> than current, so we had no choice but to create this interface.
>
> However, it is a bad idea to check this in deep inside of unwinding.
This, you cannot take locks in unwinding. The only thing you can do is
try to do the best you can without crashing.
Typically unwind only happens on self -- this is natural, a task crashes
and unwinds itself, or a task does something (takes a lock, hits a
tracepoint, etc) and takes a snapshot of its own stack, and this is
safe.
Things like live-patch use task_call_func(), which ensures the callback
function is done while holding sufficient locks for the task to not
change state.
^ permalink raw reply
* Re: [syzbot] [trace?] KASAN: use-after-free Write in ring_buffer_read_page
From: Yash Suthar @ 2026-06-08 14:01 UTC (permalink / raw)
To: linux-trace-kernel
Cc: linux-kernel, rostedt, mhiramat, glider, nogikh,
mathieu.desnoyers, syzkaller-bugs, syzbot+2dd9d02f60775ce5c1fb
In-Reply-To: <CAG_fn=WVfi9=M8VjJQt8jCpXf-0D30-BQDCZZM+FX_0d8V+yhA@mail.gmail.com>
From syzbot initial report, I noticed that
ring_buffer_read_page() checks data_page->order against
buffer->subbuf_order , whereas in while
ring_buffer_subbuf_order_set() updates subbuf_size before
replacing the old pages. I think this allows reader to
use older spare page while observing a newer sub-buffer size.
That could explain the report (KASAN UAF, memset of 16308
bytes order 2 into an order 0 spare), while the AI
reproducer may hit a related race later in via copy_to_user()
tracing_buffers_read().
Before spending more time on a fix, does this sound
correct, or i am missing something in between?
Sincerely,
Yash Suthar
^ permalink raw reply
* Re: [PATCH] rethook: Use tsk->on_cpu to check task execution state
From: Masami Hiramatsu @ 2026-06-08 13:08 UTC (permalink / raw)
To: Peter Zijlstra, bpf
Cc: Tengda Wu, Masami Hiramatsu, Steven Rostedt, Mathieu Desnoyers,
Alexei Starovoitov, linux-trace-kernel, linux-kernel,
Josh Poimboeuf, jikos, mbenes, pmladek
In-Reply-To: <20260608102326.GB3161497@noisy.programming.kicks-ass.net>
On Mon, 8 Jun 2026 12:23:26 +0200
Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Jun 08, 2026 at 11:34:49AM +0200, Peter Zijlstra wrote:
> >
> > +Live patching folks
> >
> > On Mon, Jun 08, 2026 at 09:52:37AM +0800, Tengda Wu wrote:
> >
> > > Background: We are verifying the support of live patches for functions that
> > > have a kretprobe. The specific verification method is as follows:
> > >
> > > We construct a function foo() that calls bar():
> > >
> > > void bar(void)
> > > {
> > > for (;;) {
> > > schedule();
> > > }
> > > }
> > >
> > > void foo(void)
> > > {
> > > bar();
> > > }
> > >
> > > A kretprobe is attached to bar():
> > >
> > > echo 'r:rp1 bar' > /sys/kernel/tracing/kprobe_events
> > > echo 1 > /sys/kernel/tracing/events/kprobes/rp1/enable
> > >
> > > Then foo() is triggered. The expected behavior is that bar() will call
> > > schedule() and yield the CPU.
> > >
> > > After that, the live patch is activated to attempt replacing the implementation
> > > of foo(). The expectation is that this should succeed.
> >
> > This wholly depends on how foo() calls bar(), if it is a normal call,
> > then no, it should not succeed, because foo() is still on the stack.
> >
> > If it is a tail-call, then yes, because foo() is no longer relevant.
> >
> > > However, in reality, because the task that called schedule() is still in the
> > > RUNNING state,
> >
> > So calling schedule() without setting state is dodgy in the first place.
> > Who is doing this? All wait primitives will set this to
> > TASK_UNINTERRUPTIBLE or something along those lines.
> >
> > > the condition task_is_running(tsk) inside rethook_find_ret_addr()
> > > is not satisfied, causing the function to return early. This, in turn,
> > > prevents stack_trace_save_tsk_reliable() from determining the stack as
> > > reliable, leading to a failure in activating the live patch.
> > >
> > > **Not sure if this is correct:**
> > >
> > > We believe that after a task voluntarily calls schedule(), when the stack
> > > is expected to be reliable, it is a safe time to activate a live patch.
> >
> > Calling schedule() without setting state is a no-op and really shouldn't
> > count much at all.
> >
> > > Additionally, a similar tsk->on_cpu check can be found elsewhere in the
> > > kernel (See task_on_another_cpu() in arch/x86/include/asm/unwind.h).
> > > Therefore, we propose changing the task_is_running(tsk) condition to
> > > tsk->on_cpu.
> >
> > Anyway, I'm wondering what the purpose of this check here is, there is
> > no real comment, and commit 5120d167e21c ("rethook: Remove warning
> > messages printed for finding return address of a frame.") is just pure
> > voodoo as well.
>
> FWIW, you should have had this discussion then.
Indeed. The rethook is making a shadow stack by list, thus caller must
guarantee the target process is blocked at least during this function.
The commit messages suggest that when BPF takes a backtrace, it also
includes other running tasks. Is that safe?
>
> > Also, note the comment that goes with the usage of
> > task_on_another_cpu(); that thing is racy as all heck.
> >
> > So it really comes down to what the purpose of this check is.
This check has been introduced when it is copied from
kretprobe_find_ret_addr(). It has the comment:
* The @tsk must be 'current' or a task which is not running. @fp is a hint
IIRC, I added this check to explicitly verify this condition.
> >
> > I suspect the issue at hand is that tsk->rethook elements, such as
> > iterated by __rethook_find_ret_addr() are not safe to be accessed for a
> > running task.
> >
> > Notably while rethook_recycle() has some RCU thing on, that objpool
> > thing (and the recycle name itself) seems to strongly suggest iterating
> > these things is not sound (you could start with things from this task,
> > hit a recycled entry and continue iterating rethooks from another task).
> >
> > Also note that the current check is also racy, nothing really prevents a
> > wakeup from happening right after you observe task_is_running() being
> > false. The task can then get scheduled in on another CPU and tear down
> > its rethooks concurrent with __rethook_find_ret_addr().
Yeah, but is there any way to ensure the task is blocked? Even if it is
blocked, like TASK_UNINTERRUPTIBLE, unless holding the actual lock in
the rethook, it may not be possible to ensure it?
Of course, we could give up on checking within this function and leave
everything to the caller to guarantee - as kretprobe does.
BTW, the reason why we made it possible to pass tasks other than current
is that the stack unwinding code itself supported unwinding tasks other
than current, so we had no choice but to create this interface.
However, it is a bad idea to check this in deep inside of unwinding.
> > Now, livepatch itself calls unwind from a proper context, but unwinds in
> > general are not. This rethook stuff doesn't seem to be sound in general.
>
> I suspect just entirely removing the check is the sanest option at this
> point. Callers that do it right (livepatch) are guaranteed consistent
> data, and the rest gets whatever pieces.
Agreed.
>
> Notably, unwind_next() holds rcu, so the iteration is protected from any
> of those rethook_node things getting freed. Its just that the iteration
> can go sideways and you might not get a sane answer.
>
> The very worst possible option is getting stuck in an infinite loop when
> concurrent with agressive rethook re-use or something daft like that,
> but that seems extremely unlikely.
OK, thanks for your review!
Tengda, can you send a patch to just remove the check?
Thank you,
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH] mm/memory-failure: trace: change memory_failure_event to ras subsystem
From: Lance Yang @ 2026-06-08 12:14 UTC (permalink / raw)
To: xieyuanbin1
Cc: david, qiuxu.zhuo, bp, akpm, rostedt, linmiaohe, nao.horiguchi,
mhiramat, mchehab+huawei, tony.luck, yi1.lai, linux-edac,
linux-kernel, linux-mm, linux-trace-kernel, torvalds, lilinjie8,
liaohua4, Lance Yang
In-Reply-To: <20260605081213.154660-1-xieyuanbin1@huawei.com>
On Fri, Jun 05, 2026 at 04:12:13PM +0800, Xie Yuanbin wrote:
>For historical version, commit 97f0b1345219 ("tracing: add trace event
>for memory-failure") introduced memory_failure_event in ras subsystem.
>commit 31807483d395 ("mm/memory-failure: remove the selection of RAS")
>changed memory_failure_event to memory_failure subsystem. This breaks
>the backward compatibility, some user programs rely on it.
>
>Change memory_failure_event to ras subsystem to keep backward
>compatibility.
>
>Fixes: 31807483d395 ("mm/memory-failure: remove the selection of RAS")
>
>Reported-by: Yi Lai <yi1.lai@intel.com>
>Reported-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
>Closes: https://lore.kernel.org/linux-mm/CY8PR11MB7134346A3E4BB28ECA28D6E989132@CY8PR11MB7134.namprd11.prod.outlook.com
>Cc: David Hildenbrand <david@kernel.org>
>Cc: Steven Rostedt <rostedt@goodmis.org>
>Cc: Borislav Petkov <bp@alien8.de>
>Cc: Andrew Morton <akpm@linux-foundation.org>
>Cc: Miaohe Lin <linmiaohe@huawei.com>
>Signed-off-by: Xie Yuanbin <xieyuanbin1@huawei.com>
>---
> include/trace/events/memory-failure.h | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
>diff --git a/include/trace/events/memory-failure.h b/include/trace/events/memory-failure.h
>index aa57cc8f896b..7a8ee5d1a44e 100644
>--- a/include/trace/events/memory-failure.h
>+++ b/include/trace/events/memory-failure.h
>@@ -1,6 +1,10 @@
> /* SPDX-License-Identifier: GPL-2.0 */
> #undef TRACE_SYSTEM
>-#define TRACE_SYSTEM memory_failure
>+/*
>+ * For historical versions, memory_failure_event is in ras subsystem,
>+ * some user programs depend on it.
>+ */
>+#define TRACE_SYSTEM ras
> #define TRACE_INCLUDE_FILE memory-failure
>
> #if !defined(_TRACE_MEMORY_FAILURE_H) || defined(TRACE_HEADER_MULTI_READ)
>--
Thanks. Feel free to add:
Reviewed-by: Lance Yang <lance.yang@linux.dev>
^ permalink raw reply
* Re: [PATCH] mm/memory-failure: trace: change memory_failure_event to ras subsystem
From: Miaohe Lin @ 2026-06-08 11:27 UTC (permalink / raw)
To: Xie Yuanbin
Cc: linux-edac, linux-kernel, linux-mm, linux-trace-kernel, torvalds,
lilinjie8, liaohua4, david, qiuxu.zhuo, bp, akpm, rostedt,
nao.horiguchi, mhiramat, mchehab+huawei, tony.luck, yi1.lai
In-Reply-To: <20260605081213.154660-1-xieyuanbin1@huawei.com>
On 2026/6/5 16:12, Xie Yuanbin wrote:
> For historical version, commit 97f0b1345219 ("tracing: add trace event
> for memory-failure") introduced memory_failure_event in ras subsystem.
> commit 31807483d395 ("mm/memory-failure: remove the selection of RAS")
> changed memory_failure_event to memory_failure subsystem. This breaks
> the backward compatibility, some user programs rely on it.
>
> Change memory_failure_event to ras subsystem to keep backward
> compatibility.
>
> Fixes: 31807483d395 ("mm/memory-failure: remove the selection of RAS")
With David's comment addressed:
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Thanks.
.
^ permalink raw reply
* Re: [PATCH] rethook: Use tsk->on_cpu to check task execution state
From: Peter Zijlstra @ 2026-06-08 10:23 UTC (permalink / raw)
To: Tengda Wu
Cc: Masami Hiramatsu, Steven Rostedt, Mathieu Desnoyers,
Alexei Starovoitov, linux-trace-kernel, linux-kernel,
Josh Poimboeuf, jikos, mbenes, pmladek
In-Reply-To: <20260608093449.GH4149641@noisy.programming.kicks-ass.net>
On Mon, Jun 08, 2026 at 11:34:49AM +0200, Peter Zijlstra wrote:
>
> +Live patching folks
>
> On Mon, Jun 08, 2026 at 09:52:37AM +0800, Tengda Wu wrote:
>
> > Background: We are verifying the support of live patches for functions that
> > have a kretprobe. The specific verification method is as follows:
> >
> > We construct a function foo() that calls bar():
> >
> > void bar(void)
> > {
> > for (;;) {
> > schedule();
> > }
> > }
> >
> > void foo(void)
> > {
> > bar();
> > }
> >
> > A kretprobe is attached to bar():
> >
> > echo 'r:rp1 bar' > /sys/kernel/tracing/kprobe_events
> > echo 1 > /sys/kernel/tracing/events/kprobes/rp1/enable
> >
> > Then foo() is triggered. The expected behavior is that bar() will call
> > schedule() and yield the CPU.
> >
> > After that, the live patch is activated to attempt replacing the implementation
> > of foo(). The expectation is that this should succeed.
>
> This wholly depends on how foo() calls bar(), if it is a normal call,
> then no, it should not succeed, because foo() is still on the stack.
>
> If it is a tail-call, then yes, because foo() is no longer relevant.
>
> > However, in reality, because the task that called schedule() is still in the
> > RUNNING state,
>
> So calling schedule() without setting state is dodgy in the first place.
> Who is doing this? All wait primitives will set this to
> TASK_UNINTERRUPTIBLE or something along those lines.
>
> > the condition task_is_running(tsk) inside rethook_find_ret_addr()
> > is not satisfied, causing the function to return early. This, in turn,
> > prevents stack_trace_save_tsk_reliable() from determining the stack as
> > reliable, leading to a failure in activating the live patch.
> >
> > **Not sure if this is correct:**
> >
> > We believe that after a task voluntarily calls schedule(), when the stack
> > is expected to be reliable, it is a safe time to activate a live patch.
>
> Calling schedule() without setting state is a no-op and really shouldn't
> count much at all.
>
> > Additionally, a similar tsk->on_cpu check can be found elsewhere in the
> > kernel (See task_on_another_cpu() in arch/x86/include/asm/unwind.h).
> > Therefore, we propose changing the task_is_running(tsk) condition to
> > tsk->on_cpu.
>
> Anyway, I'm wondering what the purpose of this check here is, there is
> no real comment, and commit 5120d167e21c ("rethook: Remove warning
> messages printed for finding return address of a frame.") is just pure
> voodoo as well.
FWIW, you should have had this discussion then.
> Also, note the comment that goes with the usage of
> task_on_another_cpu(); that thing is racy as all heck.
>
> So it really comes down to what the purpose of this check is.
>
> I suspect the issue at hand is that tsk->rethook elements, such as
> iterated by __rethook_find_ret_addr() are not safe to be accessed for a
> running task.
>
> Notably while rethook_recycle() has some RCU thing on, that objpool
> thing (and the recycle name itself) seems to strongly suggest iterating
> these things is not sound (you could start with things from this task,
> hit a recycled entry and continue iterating rethooks from another task).
>
> Also note that the current check is also racy, nothing really prevents a
> wakeup from happening right after you observe task_is_running() being
> false. The task can then get scheduled in on another CPU and tear down
> its rethooks concurrent with __rethook_find_ret_addr().
>
>
> Now, livepatch itself calls unwind from a proper context, but unwinds in
> general are not. This rethook stuff doesn't seem to be sound in general.
I suspect just entirely removing the check is the sanest option at this
point. Callers that do it right (livepatch) are guaranteed consistent
data, and the rest gets whatever pieces.
Notably, unwind_next() holds rcu, so the iteration is protected from any
of those rethook_node things getting freed. Its just that the iteration
can go sideways and you might not get a sane answer.
The very worst possible option is getting stuck in an infinite loop when
concurrent with agressive rethook re-use or something daft like that,
but that seems extremely unlikely.
^ permalink raw reply
* Re: [PATCH v2 6/6] x86/setup: prepend embedded bootconfig cmdline before parse_early_param
From: Masami Hiramatsu @ 2026-06-08 10:19 UTC (permalink / raw)
To: Breno Leitao
Cc: Andrew Morton, Nathan Chancellor, paulmck, Nicolas Schier,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
bpf, kernel-team
In-Reply-To: <20260605-bootconfig_using_tools-v2-6-d309f544b5f7@debian.org>
On Fri, 05 Jun 2026 05:03:37 -0700
Breno Leitao <leitao@debian.org> wrote:
> Call xbc_prepend_embedded_cmdline() in setup_arch() right after the
> CONFIG_CMDLINE merge and before strscpy(command_line, ...) so the
> build-time-rendered embedded bootconfig "kernel" subtree is part of
> boot_command_line by the time parse_early_param() runs. early_param()
> handlers (mem=, earlycon=, loglevel=, ...) now see values supplied via
> CONFIG_BOOT_CONFIG_EMBED_FILE without parsing bootconfig at runtime.
>
> Gate the prepend on the bootconfig opt-in: only fold in the embedded
> kernel.* keys when "bootconfig" is present on the command line, or
> CONFIG_BOOT_CONFIG_FORCE is set. Applying the embedded cmdline
> unconditionally would (a) diverge from how embedded init.* keys are
> treated and (b) break fail-safe recovery: a malformed embedded
> console=/mem= could panic the boot with no way for the admin to disable
> it by dropping "bootconfig" from the bootloader cmdline.
> cmdline_find_option_bool() runs before parse_early_param(), so the gate
> is cheap and correctly ordered.
>
> Select ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG so the user-visible
> CONFIG_BOOT_CONFIG_EMBED_CMDLINE option becomes selectable on x86.
This seems like a dummy config. what code does depend on this flag?
>
> With this select in place, setup_boot_config() in init/main.c would
> otherwise render the embedded "kernel" subtree a second time via
> xbc_make_cmdline("kernel") into extra_command_line, duplicating every
> embedded kernel.* key in saved_command_line and making accumulating
> handlers (console=, earlycon=, ...) register the same value twice. Skip
> that render only when xbc_prepend_embedded_cmdline() actually prepended
> the keys, reported by xbc_embedded_cmdline_applied().
>
> Keying the skip on the prepend itself, rather than re-deriving the
> opt-in, keeps the two paths consistent even when setup_arch() and the
> runtime parser detect "bootconfig" differently (e.g. "bootconfig=1"):
> the keys are then rendered at runtime instead of being dropped.
>
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
> arch/x86/Kconfig | 1 +
> arch/x86/kernel/setup.c | 16 ++++++++++++++++
> init/main.c | 18 +++++++++++++++---
> 3 files changed, 32 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index f24810015234..f839795692b4 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -126,6 +126,7 @@ config X86
> select ARCH_SUPPORTS_NUMA_BALANCING if X86_64
> select ARCH_SUPPORTS_KMAP_LOCAL_FORCE_MAP if NR_CPUS <= 4096
> select ARCH_SUPPORTS_CFI if X86_64
> + select ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG
> select ARCH_USES_CFI_TRAPS if X86_64 && CFI
> select ARCH_SUPPORTS_LTO_CLANG
> select ARCH_SUPPORTS_LTO_CLANG_THIN
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 46882ce79c3a..26a82a41f44c 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -6,6 +6,7 @@
> * parts of early kernel initialization.
> */
> #include <linux/acpi.h>
> +#include <linux/bootconfig.h>
> #include <linux/console.h>
> #include <linux/cpu.h>
> #include <linux/crash_dump.h>
> @@ -36,6 +37,7 @@
> #include <asm/bios_ebda.h>
> #include <asm/bugs.h>
> #include <asm/cacheinfo.h>
> +#include <asm/cmdline.h>
> #include <asm/coco.h>
> #include <asm/cpu.h>
> #include <asm/efi.h>
> @@ -924,6 +926,20 @@ void __init setup_arch(char **cmdline_p)
> builtin_cmdline_added = true;
> #endif
>
> + /*
> + * Honor the same opt-in as the runtime bootconfig parser: only fold
> + * the embedded kernel.* keys into the cmdline when "bootconfig" is
> + * present on the command line (or CONFIG_BOOT_CONFIG_FORCE is set).
> + * This keeps fail-safe recovery working -- dropping "bootconfig" from
> + * the bootloader cmdline disables the embedded keys -- so a malformed
> + * embedded console=/mem= cannot brick a boot with no way out. It also
> + * matches setup_boot_config(), which bails out under the same
> + * condition before parsing the embedded bootconfig at runtime.
> + */
> + if (IS_ENABLED(CONFIG_BOOT_CONFIG_FORCE) ||
> + cmdline_find_option_bool(boot_command_line, "bootconfig"))
> + xbc_prepend_embedded_cmdline(boot_command_line, COMMAND_LINE_SIZE);
> +
> strscpy(command_line, boot_command_line, COMMAND_LINE_SIZE);
> *cmdline_p = command_line;
>
> diff --git a/init/main.c b/init/main.c
> index e363232b428b..567f641a5731 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -378,12 +378,15 @@ static void __init setup_boot_config(void)
> int pos, ret;
> size_t size;
> char *err;
> + bool from_embedded = false;
>
> /* Cut out the bootconfig data even if we have no bootconfig option */
> data = get_boot_config_from_initrd(&size);
> /* If there is no bootconfig in initrd, try embedded one. */
> - if (!data)
> + if (!data) {
> data = xbc_get_embedded_bootconfig(&size);
> + from_embedded = true;
Even from embedded bootconfig, if the arch set
ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG=n, this must be applied to
the cmdline as we are doing.
> + }
>
> strscpy(tmp_cmdline, boot_command_line, COMMAND_LINE_SIZE);
> err = parse_args("bootconfig", tmp_cmdline, NULL, 0, 0, 0, NULL,
> @@ -421,8 +424,17 @@ static void __init setup_boot_config(void)
> } else {
> xbc_get_info(&ret, NULL);
> pr_info("Load bootconfig: %ld bytes %d nodes\n", (long)size, ret);
> - /* keys starting with "kernel." are passed via cmdline */
> - extra_command_line = xbc_make_cmdline("kernel");
> + /*
> + * keys starting with "kernel." are passed via cmdline. When
> + * this bootconfig came from the embedded source and
> + * setup_arch() already prepended the rendered "kernel" subtree
> + * to boot_command_line, rendering again here would duplicate
> + * the keys in saved_command_line and make accumulating handlers
> + * (console=, earlycon=, ...) re-register the same value. Skip
> + * only when the prepend really happened.
Also, this should mention ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG=n case.
Thank you,
> + */
> + if (!from_embedded || !xbc_embedded_cmdline_applied())
> + extra_command_line = xbc_make_cmdline("kernel");
> /* Also, "init." keys are init arguments */
> extra_init_args = xbc_make_cmdline("init");
> }
>
> --
> 2.53.0-Meta
>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH] rethook: Use tsk->on_cpu to check task execution state
From: Peter Zijlstra @ 2026-06-08 9:34 UTC (permalink / raw)
To: Tengda Wu
Cc: Masami Hiramatsu, Steven Rostedt, Mathieu Desnoyers,
Alexei Starovoitov, linux-trace-kernel, linux-kernel,
Josh Poimboeuf, jikos, mbenes, pmladek
In-Reply-To: <679a1c8f-1e4d-4ae5-83e1-d0068e6de1a6@huaweicloud.com>
+Live patching folks
On Mon, Jun 08, 2026 at 09:52:37AM +0800, Tengda Wu wrote:
> Background: We are verifying the support of live patches for functions that
> have a kretprobe. The specific verification method is as follows:
>
> We construct a function foo() that calls bar():
>
> void bar(void)
> {
> for (;;) {
> schedule();
> }
> }
>
> void foo(void)
> {
> bar();
> }
>
> A kretprobe is attached to bar():
>
> echo 'r:rp1 bar' > /sys/kernel/tracing/kprobe_events
> echo 1 > /sys/kernel/tracing/events/kprobes/rp1/enable
>
> Then foo() is triggered. The expected behavior is that bar() will call
> schedule() and yield the CPU.
>
> After that, the live patch is activated to attempt replacing the implementation
> of foo(). The expectation is that this should succeed.
This wholly depends on how foo() calls bar(), if it is a normal call,
then no, it should not succeed, because foo() is still on the stack.
If it is a tail-call, then yes, because foo() is no longer relevant.
> However, in reality, because the task that called schedule() is still in the
> RUNNING state,
So calling schedule() without setting state is dodgy in the first place.
Who is doing this? All wait primitives will set this to
TASK_UNINTERRUPTIBLE or something along those lines.
> the condition task_is_running(tsk) inside rethook_find_ret_addr()
> is not satisfied, causing the function to return early. This, in turn,
> prevents stack_trace_save_tsk_reliable() from determining the stack as
> reliable, leading to a failure in activating the live patch.
>
> **Not sure if this is correct:**
>
> We believe that after a task voluntarily calls schedule(), when the stack
> is expected to be reliable, it is a safe time to activate a live patch.
Calling schedule() without setting state is a no-op and really shouldn't
count much at all.
> Additionally, a similar tsk->on_cpu check can be found elsewhere in the
> kernel (See task_on_another_cpu() in arch/x86/include/asm/unwind.h).
> Therefore, we propose changing the task_is_running(tsk) condition to
> tsk->on_cpu.
Anyway, I'm wondering what the purpose of this check here is, there is
no real comment, and commit 5120d167e21c ("rethook: Remove warning
messages printed for finding return address of a frame.") is just pure
voodoo as well.
Also, note the comment that goes with the usage of
task_on_another_cpu(); that thing is racy as all heck.
So it really comes down to what the purpose of this check is.
I suspect the issue at hand is that tsk->rethook elements, such as
iterated by __rethook_find_ret_addr() are not safe to be accessed for a
running task.
Notably while rethook_recycle() has some RCU thing on, that objpool
thing (and the recycle name itself) seems to strongly suggest iterating
these things is not sound (you could start with things from this task,
hit a recycled entry and continue iterating rethooks from another task).
Also note that the current check is also racy, nothing really prevents a
wakeup from happening right after you observe task_is_running() being
false. The task can then get scheduled in on another CPU and tear down
its rethooks concurrent with __rethook_find_ret_addr().
Now, livepatch itself calls unwind from a proper context, but unwinds in
general are not. This rethook stuff doesn't seem to be sound in general.
^ permalink raw reply
* Re: [PATCH 2/2] selftests/ftrace: Account for 8-byte aligned trace_marker_raw events
From: Masami Hiramatsu @ 2026-06-08 9:17 UTC (permalink / raw)
To: Hui Wang
Cc: rostedt, mathieu.desnoyers, pjw, linux-trace-kernel, shuah,
wangfushuai, linux-kselftest
In-Reply-To: <20260607072431.125633-3-hui.wang@canonical.com>
On Sun, 7 Jun 2026 15:24:31 +0800
Hui Wang <hui.wang@canonical.com> wrote:
> trace_marker_raw.tc assumes that the raw marker payload length
> reported in trace_pipe is the result of int((id + 3) / 4) * 4, but
> that is not true on kernels with CONFIG_HAVE_64BIT_ALIGNED_ACCESS
> enabled.
>
> With forced 8-byte alignment, the ring buffer event forces 8-byte
> alignment. The event length is stored in array[0], the payload data
> and id are placed in a struct raw_data_entry which is stored starting
> at array[1]. In this case, the printed payload data length is 8*N+4
> bytes.
>
> To make the testcase pass in this case, add a kconfig_enabled() helper
> and use it to detect CONFIG_HAVE_64BIT_ALIGNED_ACCESS so
> trace_marker_raw.tc can calculate the expected length correctly.
>
Hmm this fix lacks consideration for the environment.
> Assisted-by: Copilot:gpt-5.5
> Signed-off-by: Hui Wang <hui.wang@canonical.com>
> ---
> .../ftrace/test.d/00basic/trace_marker_raw.tc | 16 +++++++--
> .../testing/selftests/ftrace/test.d/functions | 33 +++++++++++++++++++
> 2 files changed, 46 insertions(+), 3 deletions(-)
>
> diff --git a/tools/testing/selftests/ftrace/test.d/00basic/trace_marker_raw.tc b/tools/testing/selftests/ftrace/test.d/00basic/trace_marker_raw.tc
> index 8e905d4fe6dd..beda0f8627b3 100644
> --- a/tools/testing/selftests/ftrace/test.d/00basic/trace_marker_raw.tc
> +++ b/tools/testing/selftests/ftrace/test.d/00basic/trace_marker_raw.tc
> @@ -15,6 +15,11 @@ is_little_endian() {
> }
>
> little=`is_little_endian`
> +raw_data_align=4
> +
> +if kconfig_enabled CONFIG_HAVE_64BIT_ALIGNED_ACCESS; then
Checking Kconfig is OK, but in this case, if the existence of the
dependent Kconfig file itself cannot be confirmed, this test should
return an unresolved error.
> + raw_data_align=8
> +fi
>
> make_str() {
> id=$1
> @@ -60,7 +65,8 @@ test_multiple_writes() {
> echo stop > trace_marker
>
> # Check to make sure the number of entries is the id (rounded up by 4)
> - awk '/.*: # [0-9a-f]* / {
> + # or is (((id + 3) rounded by 8) + 4) if raw_data_align is 8
> + awk -v data_align=$raw_data_align '/.*: # [0-9a-f]* / {
> print;
> cnt = -1;
> for (i = 0; i < NF; i++) {
> @@ -69,8 +75,12 @@ test_multiple_writes() {
> i++;
> cnt = strtonum("0x" $i);
> num = NF - (i + 1);
> - # The number of items is always rounded up by 4
> - cnt2 = int((cnt + 3) / 4) * 4;
> + # The number of items is rounded up by 4
> + # or is (8 * N + 4) if data_align is 8
> + if (data_align == 4)
> + cnt2 = int((cnt + 3) / 4) * 4;
> + else
> + cnt2 = int((cnt + 3) / 8) * 8 + 4;
> if (cnt2 != num) {
> exit 1;
> }
> diff --git a/tools/testing/selftests/ftrace/test.d/functions b/tools/testing/selftests/ftrace/test.d/functions
> index 826141e299e5..0f778087d81b 100644
> --- a/tools/testing/selftests/ftrace/test.d/functions
> +++ b/tools/testing/selftests/ftrace/test.d/functions
> @@ -177,6 +177,39 @@ check_awk_strtonum() { # strtonum is GNU awk extension
> awk 'BEGIN{strtonum("0x1")}'
> }
>
> +# a helper to check if a kconfig is enabled or not
> +# return value: 0 (if kconfig is enabled)
> +# 1 (if kconfig is not enabled)
> +# 2 (if the config files don't exist or are unreadable)
> +kconfig_enabled() { # config-name
> + local config="$1"
> + local uname_r=`uname -r`
> + local config_file
> +
> + case "$config" in
> + CONFIG_*) ;;
> + *) config="CONFIG_$config" ;;
> + esac
> +
> + if [ -f /proc/config.gz ] && zgrep --version >/dev/null 2>&1; then
> + zgrep -Eq "^${config}=(y|m)$" /proc/config.gz 2>/dev/null
Do not use zgrep (this requires to install zgrep pacakge) in this test,
instead, use more widely available `gzip -dc | grep ...`.
I would like to keep this runnable on a minimum environment.
> + return $?
> + fi
> +
> + for config_file in \
> + /boot/config-$uname_r \
> + /lib/modules/$uname_r/config \
> + /lib/modules/$uname_r/build/.config
Hmm, also I don't like this, because this highly depends on the environment.
Instead, we can add CONFIG_IKCONFIG_PROC=y in tools/testing/selftests/ftrace/config.
Thank you,
> + do
> + if [ -f "$config_file" ]; then
> + grep -Eq "^${config}=(y|m)$" "$config_file"
> + return $?
> + fi
> + done
> +
> + return 2
> +}
> +
> LOCALHOST=127.0.0.1
>
> yield() {
> --
> 2.43.0
>
>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH 1/2] ring-buffer: Fix event length with forced 8-byte alignment
From: Masami Hiramatsu @ 2026-06-08 9:02 UTC (permalink / raw)
To: Hui Wang
Cc: rostedt, mathieu.desnoyers, pjw, linux-trace-kernel, shuah,
wangfushuai, linux-kselftest
In-Reply-To: <20260607072431.125633-2-hui.wang@canonical.com>
On Sun, 7 Jun 2026 15:24:30 +0800
Hui Wang <hui.wang@canonical.com> wrote:
> When RB_FORCE_8BYTE_ALIGNMENT is true, rb_calculate_event_length()
> reserves the space of event->array[0] for placing the data length and
> rb_update_event() stores the data length in event->array[0]
> accordingly. As a result the whole event length will add extra 4 bytes
> for sizeof(event.array[0]) unconditionally.
>
> But ring_buffer_event_length() only subtracts the
> sizeof(event->array[0]) for events larger than RB_MAX_SMALL_DATA +
> sizeof(event->array[0]). As a result, small events on architectures
> with RB_FORCE_8BYTE_ALIGNMENT=true report a data length that is 4
> bytes larger than expected.
>
> To fix it, add the RB_FORCE_8BYTE_ALIGNMENT as a condition to subtract
> the size of that length field whenever RB_FORCE_8BYTE_ALIGNMENT is
> true.
>
> This issue is observed in a riscv64 kernel with
> CONFIG_HAVE_64BIT_ALIGNED_ACCESS set to y, when we run ftrace selftest
> trace_marker_raw.tc, we get the weird log: for cases where the id is
> 1..100, the number of data field is 8*N, but once id exceeds 100, the
> number of data field becomes 8*N+4:
> # 1 buf: 58 00 00 00 80 5e d1 63 (number of data field is 8*1)
> ...
> # a buf: 58 ... (number of data field is 8*2)
> ...
> # 64 buf: 58 ... (number of data field is 8*13)
> # 65 buf: 58 ... (number of data field is 8*13+4)
>
> After applying this change, the number of data field keeps being 8*N+4
> consistently.
>
Good catch!
This looks good to me.
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Thanks,
> Fixes: 2271048d1b3b ("ring-buffer: Do 8 byte alignment for 64 bit that can not handle 4 byte align")
> Signed-off-by: Hui Wang <hui.wang@canonical.com>
> ---
> kernel/trace/ring_buffer.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index 56a328e94395..d9af2bbaf9c0 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -270,7 +270,8 @@ unsigned ring_buffer_event_length(struct ring_buffer_event *event)
> if (event->type_len > RINGBUF_TYPE_DATA_TYPE_LEN_MAX)
> return length;
> length -= RB_EVNT_HDR_SIZE;
> - if (length > RB_MAX_SMALL_DATA + sizeof(event->array[0]))
> + if (length > RB_MAX_SMALL_DATA + sizeof(event->array[0]) ||
> + RB_FORCE_8BYTE_ALIGNMENT)
> length -= sizeof(event->array[0]);
> return length;
> }
> --
> 2.43.0
>
>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* [PATCH v3 13/13] verification/rvgen: Remove dead code
From: Nam Cao @ 2026-06-08 8:57 UTC (permalink / raw)
To: Gabriele Monaco, Steven Rostedt, Wander Lairson Costa,
linux-trace-kernel, linux-kernel
Cc: Nam Cao
In-Reply-To: <cover.1780908661.git.namcao@linutronix.de>
The conversion to use Lark left some dead code behind. Remove them.
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Nam Cao <namcao@linutronix.de>
---
v3: delete unused import
---
tools/verification/rvgen/rvgen/automata.py | 157 ---------------------
tools/verification/rvgen/rvgen/dot2k.py | 29 +---
2 files changed, 1 insertion(+), 185 deletions(-)
diff --git a/tools/verification/rvgen/rvgen/automata.py b/tools/verification/rvgen/rvgen/automata.py
index a3be327c2a73..bfc997f630d7 100644
--- a/tools/verification/rvgen/rvgen/automata.py
+++ b/tools/verification/rvgen/rvgen/automata.py
@@ -9,9 +9,6 @@
# Documentation/trace/rv/deterministic_automata.rst
import ntpath
-import re
-from typing import Iterator
-from itertools import islice
import lark
@@ -356,19 +353,6 @@ class State:
self.name = name
self.inv = inv
-class _ConstraintKey:
- """Base class for constraint keys."""
-
-class _StateConstraintKey(_ConstraintKey, int):
- """Key for a state constraint. Under the hood just state_id."""
- def __new__(cls, state_id: int):
- return super().__new__(cls, state_id)
-
-class _EventConstraintKey(_ConstraintKey, tuple):
- """Key for an event constraint. Under the hood just tuple(state_id,event_id)."""
- def __new__(cls, state_id: int, event_id: int):
- return super().__new__(cls, (state_id, event_id))
-
class AutomataError(Exception):
"""Exception raised for errors in automata parsing and validation.
@@ -387,28 +371,10 @@ class Automata:
invalid_state_str = "INVALID_STATE"
init_marker = "__init_"
- node_marker = "{node"
- # val can be numerical, uppercase (constant or macro), lowercase (parameter or function)
- # only numerical values should have units
- constraint_rule = re.compile(r"""
- ^
- (?P<env>[a-zA-Z_][a-zA-Z0-9_]+) # C-like identifier for the env var
- (?P<op>[!<=>]{1,2}) # operator
- (?P<val>
- [0-9]+ | # numerical value
- [A-Z_]+\(\) | # macro
- [A-Z_]+ | # constant
- [a-z_]+\(\) | # function
- [a-z_]+ # parameter
- )
- (?P<unit>[a-z]{1,2})? # optional unit for numerical values
- """, re.VERBOSE)
- constraint_reset = re.compile(r"^reset\((?P<env>[a-zA-Z_][a-zA-Z0-9_]+)\)")
def __init__(self, file_path, model_name=None):
self.__dot_path = file_path
self.name = model_name or self.__get_model_name()
- self.__dot_lines = self.__open_dot()
self.__parse_tree = ParseTree(file_path)
self.transitions = self.__parse_transitions()
self.states, self.initial_state, self.final_states = self.__parse_states()
@@ -435,57 +401,6 @@ class Automata:
return model_name
- def __open_dot(self) -> list[str]:
- dot_lines = []
- try:
- with open(self.__dot_path) as dot_file:
- dot_lines = dot_file.readlines()
- except OSError as exc:
- raise AutomataError(exc.strerror) from exc
-
- if not dot_lines:
- raise AutomataError(f"{self.__dot_path} is empty")
-
- # checking the first line:
- line = dot_lines[0].split()
-
- if len(line) < 2 or line[0] != "digraph" or line[1] != "state_automaton":
- raise AutomataError(f"Not a valid .dot format: {self.__dot_path}")
-
- return dot_lines
-
- def __get_cursor_begin_states(self) -> int:
- for cursor, line in enumerate(self.__dot_lines):
- split_line = line.split()
-
- if len(split_line) and split_line[0] == self.node_marker:
- return cursor
-
- raise AutomataError("Could not find a beginning state")
-
- def __get_cursor_begin_events(self) -> int:
- state = 0
- cursor = 0 # make pyright happy
-
- for cursor, line in enumerate(self.__dot_lines):
- line = line.split()
- if not line:
- continue
-
- if state == 0:
- if line[0] == self.node_marker:
- state = 1
- elif line[0] != self.node_marker:
- break
- else:
- raise AutomataError("Could not find beginning event")
-
- cursor += 1 # skip initial state transition
- if cursor == len(self.__dot_lines):
- raise AutomataError("Dot file ended after event beginning")
-
- return cursor
-
def __parse_transitions(self):
transitions = []
@@ -542,51 +457,6 @@ class Automata:
states.insert(0, initial_state)
return states, initial_state, final_states
- def __get_state_variables(self) -> tuple[list[str], str, list[str]]:
- # wait for node declaration
- states = []
- final_states = []
- initial_state = ""
-
- has_final_states = False
- cursor = self.__get_cursor_begin_states()
-
- # process nodes
- for line in islice(self.__dot_lines, cursor, None):
- split_line = line.split()
- if not split_line or split_line[0] != self.node_marker:
- break
-
- raw_state = split_line[-1]
-
- # "enabled_fired"}; -> enabled_fired
- state = raw_state.replace('"', '').replace('};', '').replace(',', '_')
- if state.startswith(self.init_marker):
- initial_state = state[len(self.init_marker):]
- else:
- states.append(state)
- if "doublecircle" in line:
- final_states.append(state)
- has_final_states = True
-
- if "ellipse" in line:
- final_states.append(state)
- has_final_states = True
-
- if not initial_state:
- raise AutomataError("The automaton doesn't have an initial state")
-
- states = sorted(set(states))
- states.remove(initial_state)
-
- # Insert the initial state at the beginning of the states
- states.insert(0, initial_state)
-
- if not has_final_states:
- final_states.append(initial_state)
-
- return states, initial_state, final_states
-
def __get_event_variables(self) -> tuple[list[str], list[str]]:
events: list[str] = []
envs: list[str] = []
@@ -609,26 +479,6 @@ class Automata:
return sorted(set(events)), sorted(set(envs))
- def _split_constraint_expr(self, constr: list[str]) -> Iterator[tuple[str,
- str | None]]:
- """
- Get a list of strings of the type constr1 && constr2 and returns a list of
- constraints and separators: [[constr1,"&&"],[constr2,None]]
- """
- exprs = []
- seps = []
- for c in constr:
- while "&&" in c or "||" in c:
- a = c.find("&&")
- o = c.find("||")
- pos = a if o < 0 or 0 < a < o else o
- exprs.append(c[:pos].replace(" ", ""))
- seps.append(c[pos:pos + 2].replace(" ", ""))
- c = c[pos + 2:].replace(" ", "")
- exprs.append(c)
- seps.append(None)
- return zip(exprs, seps)
-
def __extract_env_var(self, constraint: ConstraintCondition):
if constraint.unit:
self.env_types[constraint.env] = constraint.unit
@@ -697,10 +547,3 @@ class Automata:
def is_hybrid_automata(self) -> bool:
return bool(self.envs)
-
- def is_event_constraint(self, key: _ConstraintKey) -> bool:
- """
- Given the key in self.constraints return true if it is an event
- constraint, false if it is a state constraint
- """
- return isinstance(key, _EventConstraintKey)
diff --git a/tools/verification/rvgen/rvgen/dot2k.py b/tools/verification/rvgen/rvgen/dot2k.py
index e4b6c7c09170..a080b92334f9 100644
--- a/tools/verification/rvgen/rvgen/dot2k.py
+++ b/tools/verification/rvgen/rvgen/dot2k.py
@@ -8,12 +8,9 @@
# For further information, see:
# Documentation/trace/rv/da_monitor_synthesis.rst
-from collections import deque
from .dot2c import Dot2c
from .generator import Monitor
-from .automata import _EventConstraintKey, _StateConstraintKey, AutomataError
-from .automata import ConstraintCondition
-
+from .automata import ConstraintCondition, AutomataError
class dot2k(Monitor, Dot2c):
template_dir = "dot2k"
@@ -217,9 +214,6 @@ class ha2k(dot2k):
value *= 10**9
return str(value) + "ull"
- def __parse_single_constraint(self, rule: dict, value: str) -> str:
- return f"ha_get_env(ha_mon, {rule["env"]}{self.enum_suffix}, time_ns) {rule["op"]} {value}"
-
def __parse_guard_rule(self, rule) -> list[str]:
buff = []
for c, sep in rule.rules:
@@ -233,12 +227,6 @@ class ha2k(dot2k):
buff.append(cond)
return buff
- def __get_constraint_env(self, constr: str) -> str:
- """Extract the second argument from an ha_ function"""
- env = constr.split("(")[1].split()[1].rstrip(")").rstrip(",")
- assert env.removesuffix(f"_{self.name}") in self.envs
- return env
-
def __start_to_invariant_check(self, inv: ConstraintCondition) -> str:
# by default assume the timer has ns expiration
clock_type = "ns"
@@ -249,21 +237,6 @@ class ha2k(dot2k):
return f"return ha_check_invariant_{clock_type}(ha_mon, {inv.env}_{self.name}, time_ns, {value})"
- def __start_to_conv(self, constr: str) -> str:
- """
- Undo the storage conversion done by ha_start_timer_
- """
- return "ha_inv_to_guard" + constr[constr.find("("):]
-
- def __parse_timer_constraint(self, rule: dict, value: str) -> str:
- # by default assume the timer has ns expiration
- clock_type = "ns"
- if self.env_types.get(rule["env"]) == "j":
- clock_type = "jiffy"
-
- return (f"ha_start_timer_{clock_type}(ha_mon, {rule["env"]}{self.enum_suffix},"
- f" {value}, time_ns)")
-
def __parse_invariant(self, inv):
# by default assume the timer has ns expiration
clock_type = "ns"
--
2.47.3
^ permalink raw reply related
* [PATCH v3 12/13] verification/rvgen: Remove the old state variables
From: Nam Cao @ 2026-06-08 8:57 UTC (permalink / raw)
To: Gabriele Monaco, Steven Rostedt, Wander Lairson Costa,
linux-trace-kernel, linux-kernel
Cc: Nam Cao
In-Reply-To: <cover.1780908661.git.namcao@linutronix.de>
The state variables (states, initial_state, final_states) only capture the
states' names and have less information than their Lark-based counterparts.
Switch to use the new state variables and delete these old ones.
Signed-off-by: Nam Cao <namcao@linutronix.de>
---
tools/verification/rvgen/rvgen/automata.py | 9 ++++-----
tools/verification/rvgen/rvgen/dot2c.py | 10 +++++-----
tools/verification/rvgen/rvgen/dot2k.py | 8 ++++----
3 files changed, 13 insertions(+), 14 deletions(-)
diff --git a/tools/verification/rvgen/rvgen/automata.py b/tools/verification/rvgen/rvgen/automata.py
index 4c302f5cba68..a3be327c2a73 100644
--- a/tools/verification/rvgen/rvgen/automata.py
+++ b/tools/verification/rvgen/rvgen/automata.py
@@ -411,8 +411,7 @@ class Automata:
self.__dot_lines = self.__open_dot()
self.__parse_tree = ParseTree(file_path)
self.transitions = self.__parse_transitions()
- self._states, self._initial_state, self._final_states = self.__parse_states()
- self.states, self.initial_state, self.final_states = self.__get_state_variables()
+ self.states, self.initial_state, self.final_states = self.__parse_states()
self.env_types = {}
self.env_stored = set()
self.constraint_vars = set()
@@ -603,7 +602,7 @@ class Automata:
envs.append(c.env)
self.__extract_env_var(c)
- for state in self._states:
+ for state in self.states:
if state.inv:
envs.append(state.inv.env)
self.__extract_env_var(state.inv)
@@ -639,7 +638,7 @@ class Automata:
def __create_matrix(self) -> list[list[str]]:
# transform the array into a dictionary
events = self.events
- states = [s.name for s in self._states]
+ states = [s.name for s in self.states]
events_dict = {}
states_dict = {}
nr_event = 0
@@ -675,7 +674,7 @@ class Automata:
for j in range(len(self.states)):
if self.function[j][i] != self.invalid_state_str:
curr_event_used += 1
- if self.function[j][i] == self.initial_state:
+ if self.function[j][i] == self.initial_state.name:
curr_event_will_init += 1
if self.function[0][i] != self.invalid_state_str:
curr_event_from_init = True
diff --git a/tools/verification/rvgen/rvgen/dot2c.py b/tools/verification/rvgen/rvgen/dot2c.py
index fc85ba1f649e..22938ce1bf6c 100644
--- a/tools/verification/rvgen/rvgen/dot2c.py
+++ b/tools/verification/rvgen/rvgen/dot2c.py
@@ -29,10 +29,10 @@ class Dot2c(Automata):
def __get_enum_states_content(self) -> list[str]:
buff = []
- buff.append(f"\t{self.initial_state}{self.enum_suffix},")
+ buff.append(f"\t{self.initial_state.name}{self.enum_suffix},")
for state in self.states:
if state != self.initial_state:
- buff.append(f"\t{state}{self.enum_suffix},")
+ buff.append(f"\t{state.name}{self.enum_suffix},")
buff.append(f"\tstate_max{self.enum_suffix},")
return buff
@@ -142,7 +142,7 @@ class Dot2c(Automata):
def format_aut_init_states_string(self) -> list[str]:
buff = []
buff.append("\t.state_names = {")
- buff.append(self.__get_string_vector_per_line_content(self.states))
+ buff.append(self.__get_string_vector_per_line_content([s.name for s in self.states]))
buff.append("\t},")
return buff
@@ -159,7 +159,7 @@ class Dot2c(Automata):
return buff
def __get_max_strlen_of_states(self) -> int:
- max_state_name = len(max(self.states, key=len))
+ max_state_name = max((len(s.name) for s in self.states))
return max(max_state_name, len(self.invalid_state_str))
def get_aut_init_function(self) -> str:
@@ -199,7 +199,7 @@ class Dot2c(Automata):
return buff
def get_aut_init_initial_state(self) -> str:
- return self.initial_state
+ return self.initial_state.name
def format_aut_init_initial_state(self) -> list[str]:
buff = []
diff --git a/tools/verification/rvgen/rvgen/dot2k.py b/tools/verification/rvgen/rvgen/dot2k.py
index dc6d6f33729b..e4b6c7c09170 100644
--- a/tools/verification/rvgen/rvgen/dot2k.py
+++ b/tools/verification/rvgen/rvgen/dot2k.py
@@ -179,7 +179,7 @@ class ha2k(dot2k):
self.trace_h = self._read_template_file("trace_hybrid.h")
self.has_invariant = False
self.has_guard = False
- for state in self._states:
+ for state in self.states:
if state.inv:
self.has_invariant = True
for transition in self.transitions:
@@ -314,7 +314,7 @@ f"""static inline bool ha_verify_invariants(struct ha_monitor *ha_mon,
{{"""]
_else = ""
- for state in self._states:
+ for state in self.states:
if not state.inv:
continue
@@ -382,7 +382,7 @@ f"""static inline void ha_setup_invariants(struct ha_monitor *ha_mon,
buff.append(f"\tif ({condition_str})\n\t\treturn;")
_else = ""
- for state in self._states:
+ for state in self.states:
inv = state.inv
if not inv:
continue
@@ -391,7 +391,7 @@ f"""static inline void ha_setup_invariants(struct ha_monitor *ha_mon,
buff.append(f"\t\t{inv};")
_else = "else "
- for state in self._states:
+ for state in self.states:
inv = state.inv
if not inv:
continue
--
2.47.3
^ permalink raw reply related
* [PATCH v3 07/13] rv: Simplify hybrid automata monitors's clock variables
From: Nam Cao @ 2026-06-08 8:57 UTC (permalink / raw)
To: Gabriele Monaco, Steven Rostedt, Wander Lairson Costa,
linux-trace-kernel, linux-kernel
Cc: Nam Cao
In-Reply-To: <cover.1780908661.git.namcao@linutronix.de>
Hybrid automata monitors's clock variables have two different
representations:
- The invariant representation, which is the timestamp when the invariant
expires
- The guard representation, which is the timestamp when the clock is last
reset
This dual representation makes the logic quite difficult to follow (well,
at least for me). It also complicates the monitors and the generation tool,
as it requires conversion back and forth between the representation.
Simplify by using the clock variables for a single purpose: storing the
time stamp since the clock is last reset.
This also allows simplifying rvgen, which will be done in a follow-up
commit.
Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Nam Cao <namcao@linutronix.de>
---
v3:
- fix wrongly passing expire to ha_invariant_passed_jiffy()
- typo
---
include/rv/ha_monitor.h | 64 ++++++------------------
kernel/trace/rv/monitors/nomiss/nomiss.c | 18 +------
kernel/trace/rv/monitors/stall/stall.c | 2 +-
3 files changed, 19 insertions(+), 65 deletions(-)
diff --git a/include/rv/ha_monitor.h b/include/rv/ha_monitor.h
index 28d3c74cabfc..9144b4c06f3f 100644
--- a/include/rv/ha_monitor.h
+++ b/include/rv/ha_monitor.h
@@ -327,19 +327,8 @@ static inline void __ha_monitor_timer_callback(struct ha_monitor *ha_mon)
}
/*
- * The clock variables have 2 different representations in the env_store:
- * - The guard representation is the timestamp of the last reset
- * - The invariant representation is the timestamp when the invariant expires
- * As the representations are incompatible, care must be taken when switching
- * between them: the invariant representation can only be used when starting a
- * timer when the previous representation was guard (e.g. no other invariant
- * started since the last reset operation).
- * Likewise, switching from invariant to guard representation without a reset
- * can be done only by subtracting the exact value used to start the invariant.
- *
- * Reading the environment variable (ha_get_clk) also reflects this difference
- * any reads in states that have an invariant return the (possibly negative)
- * time since expiration, other reads return the time since last reset.
+ * The clock variables store the time epoch - the timestamp when the clock was last reset.
+ * They are read by subtracting the time epoch from the current time.
*/
/*
@@ -353,31 +342,21 @@ static inline void ha_reset_clk_ns(struct ha_monitor *ha_mon, enum envs env, u64
{
WRITE_ONCE(ha_mon->env_store[env], time_ns);
}
-static inline void ha_set_invariant_ns(struct ha_monitor *ha_mon, enum envs env,
- u64 value, u64 time_ns)
-{
- WRITE_ONCE(ha_mon->env_store[env], time_ns + value);
-}
-static inline bool ha_check_invariant_ns(struct ha_monitor *ha_mon,
- enum envs env, u64 time_ns)
+static inline bool ha_check_invariant_ns(struct ha_monitor *ha_mon, enum envs env,
+ u64 time_ns, u64 expire_ns)
{
- return READ_ONCE(ha_mon->env_store[env]) >= time_ns;
+ return READ_ONCE(ha_mon->env_store[env]) >= time_ns - expire_ns;
}
/*
* ha_invariant_passed_ns - prepare the invariant and return the time since reset
*/
-static inline u64 ha_invariant_passed_ns(struct ha_monitor *ha_mon, enum envs env,
- u64 expire, u64 time_ns)
+static inline u64 ha_invariant_passed_ns(struct ha_monitor *ha_mon, enum envs env, u64 time_ns)
{
- u64 passed = 0;
-
if (env < 0 || env >= ENV_MAX_STORED)
return 0;
if (ha_monitor_env_invalid(ha_mon, env))
return 0;
- passed = ha_get_env(ha_mon, env, time_ns);
- ha_set_invariant_ns(ha_mon, env, expire - passed, time_ns);
- return passed;
+ return ha_get_env(ha_mon, env, time_ns);
}
/*
@@ -391,32 +370,21 @@ static inline void ha_reset_clk_jiffy(struct ha_monitor *ha_mon, enum envs env)
{
WRITE_ONCE(ha_mon->env_store[env], get_jiffies_64());
}
-static inline void ha_set_invariant_jiffy(struct ha_monitor *ha_mon,
- enum envs env, u64 value)
+static inline bool ha_check_invariant_jiffy(struct ha_monitor *ha_mon, enum envs env,
+ u64 time_ns, u64 expire_jiffy)
{
- WRITE_ONCE(ha_mon->env_store[env], get_jiffies_64() + value);
-}
-static inline bool ha_check_invariant_jiffy(struct ha_monitor *ha_mon,
- enum envs env, u64 time_ns)
-{
- return time_after64(READ_ONCE(ha_mon->env_store[env]), get_jiffies_64());
-
+ return time_after64(READ_ONCE(ha_mon->env_store[env]), get_jiffies_64() - expire_jiffy);
}
/*
* ha_invariant_passed_jiffy - prepare the invariant and return the time since reset
*/
-static inline u64 ha_invariant_passed_jiffy(struct ha_monitor *ha_mon, enum envs env,
- u64 expire, u64 time_ns)
+static inline u64 ha_invariant_passed_jiffy(struct ha_monitor *ha_mon, enum envs env, u64 time_ns)
{
- u64 passed = 0;
-
if (env < 0 || env >= ENV_MAX_STORED)
return 0;
if (ha_monitor_env_invalid(ha_mon, env))
return 0;
- passed = ha_get_env(ha_mon, env, time_ns);
- ha_set_invariant_jiffy(ha_mon, env, expire - passed);
- return passed;
+ return ha_get_env(ha_mon, env, time_ns);
}
/*
@@ -463,14 +431,14 @@ static inline void ha_setup_timer(struct ha_monitor *ha_mon)
static inline void ha_start_timer_jiffy(struct ha_monitor *ha_mon, enum envs env,
u64 expire, u64 time_ns)
{
- u64 passed = ha_invariant_passed_jiffy(ha_mon, env, expire, time_ns);
+ u64 passed = ha_invariant_passed_jiffy(ha_mon, env, time_ns);
mod_timer(&ha_mon->timer, get_jiffies_64() + expire - passed);
}
static inline void ha_start_timer_ns(struct ha_monitor *ha_mon, enum envs env,
u64 expire, u64 time_ns)
{
- u64 passed = ha_invariant_passed_ns(ha_mon, env, expire, time_ns);
+ u64 passed = ha_invariant_passed_ns(ha_mon, env, time_ns);
ha_start_timer_jiffy(ha_mon, ENV_MAX_STORED,
nsecs_to_jiffies(expire - passed + TICK_NSEC - 1), time_ns);
@@ -516,7 +484,7 @@ static inline void ha_start_timer_ns(struct ha_monitor *ha_mon, enum envs env,
u64 expire, u64 time_ns)
{
int mode = HRTIMER_MODE_REL_HARD;
- u64 passed = ha_invariant_passed_ns(ha_mon, env, expire, time_ns);
+ u64 passed = ha_invariant_passed_ns(ha_mon, env, time_ns);
if (RV_MON_TYPE == RV_MON_PER_CPU)
mode |= HRTIMER_MODE_PINNED;
@@ -525,7 +493,7 @@ static inline void ha_start_timer_ns(struct ha_monitor *ha_mon, enum envs env,
static inline void ha_start_timer_jiffy(struct ha_monitor *ha_mon, enum envs env,
u64 expire, u64 time_ns)
{
- u64 passed = ha_invariant_passed_jiffy(ha_mon, env, expire, time_ns);
+ u64 passed = ha_invariant_passed_jiffy(ha_mon, env, time_ns);
ha_start_timer_ns(ha_mon, ENV_MAX_STORED,
jiffies_to_nsecs(expire - passed), time_ns);
diff --git a/kernel/trace/rv/monitors/nomiss/nomiss.c b/kernel/trace/rv/monitors/nomiss/nomiss.c
index 8ead8783c29f..515ece5ce0ca 100644
--- a/kernel/trace/rv/monitors/nomiss/nomiss.c
+++ b/kernel/trace/rv/monitors/nomiss/nomiss.c
@@ -57,24 +57,12 @@ static inline bool ha_verify_invariants(struct ha_monitor *ha_mon,
enum states next_state, u64 time_ns)
{
if (curr_state == ready_nomiss)
- return ha_check_invariant_ns(ha_mon, clk_nomiss, time_ns);
+ return ha_check_invariant_ns(ha_mon, clk_nomiss, time_ns, DEADLINE_NS(ha_mon));
else if (curr_state == running_nomiss)
- return ha_check_invariant_ns(ha_mon, clk_nomiss, time_ns);
+ return ha_check_invariant_ns(ha_mon, clk_nomiss, time_ns, DEADLINE_NS(ha_mon));
return true;
}
-static inline void ha_convert_inv_guard(struct ha_monitor *ha_mon,
- enum states curr_state, enum events event,
- enum states next_state, u64 time_ns)
-{
- if (curr_state == next_state)
- return;
- if (curr_state == ready_nomiss)
- ha_inv_to_guard(ha_mon, clk_nomiss, DEADLINE_NS(ha_mon), time_ns);
- else if (curr_state == running_nomiss)
- ha_inv_to_guard(ha_mon, clk_nomiss, DEADLINE_NS(ha_mon), time_ns);
-}
-
static inline bool ha_verify_guards(struct ha_monitor *ha_mon,
enum states curr_state, enum events event,
enum states next_state, u64 time_ns)
@@ -122,8 +110,6 @@ static bool ha_verify_constraint(struct ha_monitor *ha_mon,
if (!ha_verify_invariants(ha_mon, curr_state, event, next_state, time_ns))
return false;
- ha_convert_inv_guard(ha_mon, curr_state, event, next_state, time_ns);
-
if (!ha_verify_guards(ha_mon, curr_state, event, next_state, time_ns))
return false;
diff --git a/kernel/trace/rv/monitors/stall/stall.c b/kernel/trace/rv/monitors/stall/stall.c
index 3c38fb1a0159..b265578f845c 100644
--- a/kernel/trace/rv/monitors/stall/stall.c
+++ b/kernel/trace/rv/monitors/stall/stall.c
@@ -38,7 +38,7 @@ static inline bool ha_verify_invariants(struct ha_monitor *ha_mon,
enum states next_state, u64 time_ns)
{
if (curr_state == enqueued_stall)
- return ha_check_invariant_jiffy(ha_mon, clk_stall, time_ns);
+ return ha_check_invariant_jiffy(ha_mon, clk_stall, time_ns, threshold_jiffies);
return true;
}
--
2.47.3
^ permalink raw reply related
* [PATCH v3 09/13] verification/rvgen: Delete __parse_constraint()
From: Nam Cao @ 2026-06-08 8:57 UTC (permalink / raw)
To: Gabriele Monaco, Steven Rostedt, Wander Lairson Costa,
linux-trace-kernel, linux-kernel
Cc: Nam Cao
In-Reply-To: <cover.1780908661.git.namcao@linutronix.de>
All previous users of self.invariants and self.guards have been converted
to the Lark parser, delete __parse_constraints() and its associates.
Signed-off-by: Nam Cao <namcao@linutronix.de>
---
tools/verification/rvgen/rvgen/dot2k.py | 67 ++-----------------------
1 file changed, 4 insertions(+), 63 deletions(-)
diff --git a/tools/verification/rvgen/rvgen/dot2k.py b/tools/verification/rvgen/rvgen/dot2k.py
index 6d346a718a39..a38ef735a861 100644
--- a/tools/verification/rvgen/rvgen/dot2k.py
+++ b/tools/verification/rvgen/rvgen/dot2k.py
@@ -177,7 +177,6 @@ class ha2k(dot2k):
if not self.is_hybrid_automata():
raise AutomataError("Detected deterministic automaton, use the 'da' class")
self.trace_h = self._read_template_file("trace_hybrid.h")
- self.__parse_constraints()
self.has_invariant = False
self.has_guard = False
for state in self._states:
@@ -304,64 +303,6 @@ class ha2k(dot2k):
separator = "\n\t\t " if sum(len(r) for r in rules) > 80 else " "
return ["res = " + separator.join(rules) + ";"]
- def __validate_constraint(self, key: tuple[int, int] | int, constr: str,
- rule, reset) -> None:
- # event constrains are tuples and allow both rules and reset
- # state constraints are only used for expirations (e.g. clk<N)
- if self.is_event_constraint(key):
- if not rule and not reset:
- raise AutomataError("Unrecognised event constraint "
- f"({self.states[key[0]]}/{self.events[key[1]]}: {constr})")
- if rule and (rule["env"] in self.env_types and
- rule["env"] not in self.env_stored):
- raise AutomataError("Clocks in hybrid automata always require a storage"
- f" ({rule["env"]})")
- else:
- if not rule:
- raise AutomataError("Unrecognised state constraint "
- f"({self.states[key]}: {constr})")
- if rule["env"] not in self.env_stored:
- raise AutomataError("State constraints always require a storage "
- f"({rule["env"]})")
- if rule["op"] not in ["<", "<="]:
- raise AutomataError("State constraints must be clock expirations like"
- f" clk<N ({rule.string})")
-
- def __parse_constraints(self) -> None:
- self.guards: dict[_EventConstraintKey, str] = {}
- self.invariants: dict[_StateConstraintKey, str] = {}
- for key, constraint in self.constraints.items():
- rules = []
- resets = []
- for c, sep in self._split_constraint_expr(constraint):
- rule = self.constraint_rule.search(c)
- reset = self.constraint_reset.search(c)
- self.__validate_constraint(key, c, rule, reset)
- if rule:
- value = rule["val"]
- value_len = len(rule["val"])
- unit = None
- if rule.groupdict().get("unit"):
- value_len += len(rule["unit"])
- unit = rule["unit"]
- c = c[:-(value_len)]
- value = self.__adjust_value(value, unit)
- if self.is_event_constraint(key):
- c = self.__parse_single_constraint(rule, value)
- if sep:
- c += f" {sep}"
- else:
- c = self.__parse_timer_constraint(rule, value)
- rules.append(c)
- if reset:
- c = f"ha_reset_env(ha_mon, {reset["env"]}{self.enum_suffix}, time_ns)"
- resets.append(c)
- if self.is_event_constraint(key):
- res = self.__format_guard_rules(rules) + resets
- self.guards[key] = ";".join(res)
- else:
- self.invariants[key] = rules[0]
-
def __fill_verify_invariants_func(self) -> list[str]:
if not self.has_invariant:
return []
@@ -486,15 +427,15 @@ f"""static bool ha_verify_constraint(struct ha_monitor *ha_mon,
\t\t\t\t enum {self.enum_states_def} next_state, u64 time_ns)
{{""")
- if self.invariants:
+ if self.has_invariant:
buff.append("\tif (!ha_verify_invariants(ha_mon, curr_state, "
"event, next_state, time_ns))\n\t\treturn false;\n")
- if self.guards:
+ if self.has_guard:
buff.append("\tif (!ha_verify_guards(ha_mon, curr_state, event, "
"next_state, time_ns))\n\t\treturn false;\n")
- if self.invariants:
+ if self.has_invariant:
buff.append("\tha_setup_invariants(ha_mon, curr_state, event, next_state, time_ns);\n")
buff.append("\treturn true;\n}\n")
@@ -571,7 +512,7 @@ f"""static bool ha_verify_constraint(struct ha_monitor *ha_mon,
return self.__fill_hybrid_get_reset_functions() + self.__fill_constr_func()
def _fill_timer_type(self) -> list:
- if self.invariants:
+ if self.has_invariant:
return [
"/* XXX: If the monitor has several instances, consider HA_TIMER_WHEEL */",
"#define HA_TIMER_TYPE HA_TIMER_HRTIMER"
--
2.47.3
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox