Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* [PATCH v13 11/11] tracing/probes: Add a new testcase for BTF typecasts
From: Masami Hiramatsu (Google) @ 2026-06-29  6:14 UTC (permalink / raw)
  To: Steven Rostedt, Mathieu Desnoyers
  Cc: Jonathan Corbet, Shuah Khan, Masami Hiramatsu, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <178271357142.1176915.7193483024740701480.stgit@devnote2>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

With the introduction of container_of-style BTF typecasting and
per-CPU variable access support in trace probes, we need a way to
verify their functionality and prevent regressions.

Add a new ftrace kselftest and update the trace event sample module
to test and validate these features.

Specifically, update the trace-events-sample module to set up a
periodic timer whose callback accesses a per-CPU counter. Introduce
a new sample trace event, foo_timer_fn, to trace this callback
and log the current counter value.

Then, add a new test case, btf_probe_event.tc, which defines a
dynamic probe on the timer callback. The probe uses BTF typecasting
to recover the parent structure from the timer argument and
this_cpu_read() to fetch the per-CPU counter. The test verifies
the integrity of the implementation by ensuring the values
recorded by the dynamic probe match those from the static tracepoint.

Assisted-by: Antigravity:gemini-3.5-flash
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v12:
  - Fix current support check in eprobe testcase.
  - Fix to return UNRESOLVED error if sample module is not found.
  - Always check this_cpu_* in btf_typecast_accepted.tc.
 Changes in v11:
  - nit: fix the error code in comment.
 Changes in v10:
  - Add a check for $current and this_cpu_* for eprobe
 Changes in v9:
  - Add a testcase for checking new syntax.
 Changes in v8:
  - Add more test cases.
 Changes in v6:
  - Update testcase according to changes.
 Changes in v5:
  - Add more syntax test cases.
 Changes in v4:
  - Fix uprobe $current test.
 Changes in v3:
  - Add syntax test case.
  - Update testcase to use this_cpu_read()
 Changes in v2:
  - Use timer_shutdown_sync() instead of timer_delete_sync() for teardown.
---
 samples/trace_events/trace-events-sample.c         |   40 +++++++-
 samples/trace_events/trace-events-sample.h         |   34 ++++++-
 .../ftrace/test.d/dynevent/btf_probe_event.tc      |   51 ++++++++++
 .../test.d/dynevent/btf_typecast_accepted.tc       |  103 ++++++++++++++++++++
 .../test.d/dynevent/eprobes_syntax_errors.tc       |    9 ++
 .../ftrace/test.d/dynevent/fprobe_syntax_errors.tc |   12 ++
 .../ftrace/test.d/kprobe/kprobe_syntax_errors.tc   |   12 ++
 .../ftrace/test.d/kprobe/uprobe_syntax_errors.tc   |    5 +
 8 files changed, 261 insertions(+), 5 deletions(-)
 create mode 100644 tools/testing/selftests/ftrace/test.d/dynevent/btf_probe_event.tc
 create mode 100644 tools/testing/selftests/ftrace/test.d/dynevent/btf_typecast_accepted.tc

diff --git a/samples/trace_events/trace-events-sample.c b/samples/trace_events/trace-events-sample.c
index 0b7a6efdb247..ca5d98c360cb 100644
--- a/samples/trace_events/trace-events-sample.c
+++ b/samples/trace_events/trace-events-sample.c
@@ -94,6 +94,20 @@ static int simple_thread_fn(void *arg)
 static DEFINE_MUTEX(thread_mutex);
 static int simple_thread_cnt;
 
+static struct foo_timer_data *foo_timer_data;
+
+static void sample_timer_cb(struct timer_list *t)
+{
+	struct foo_timer_data *data = container_of(t, struct foo_timer_data, timer);
+
+	get_cpu();
+	trace_foo_timer_fn(data);
+	(*this_cpu_ptr(data->counter))++;
+	put_cpu();
+
+	mod_timer(t, jiffies + HZ);
+}
+
 int foo_bar_reg(void)
 {
 	mutex_lock(&thread_mutex);
@@ -132,9 +146,27 @@ void foo_bar_unreg(void)
 
 static int __init trace_event_init(void)
 {
+	foo_timer_data = kzalloc_obj(*foo_timer_data, GFP_KERNEL);
+	if (!foo_timer_data)
+		return -ENOMEM;
+
+	foo_timer_data->name = "sample_timer_counter";
+	foo_timer_data->counter = alloc_percpu(int);
+	if (!foo_timer_data->counter) {
+		kfree(foo_timer_data);
+		return -ENOMEM;
+	}
+
+	timer_setup(&foo_timer_data->timer, sample_timer_cb, 0);
+	mod_timer(&foo_timer_data->timer, jiffies + HZ);
+
 	simple_tsk = kthread_run(simple_thread, NULL, "event-sample");
-	if (IS_ERR(simple_tsk))
-		return -1;
+	if (IS_ERR(simple_tsk)) {
+		timer_shutdown_sync(&foo_timer_data->timer);
+		free_percpu(foo_timer_data->counter);
+		kfree(foo_timer_data);
+		return PTR_ERR(simple_tsk);
+	}
 
 	return 0;
 }
@@ -147,6 +179,10 @@ static void __exit trace_event_exit(void)
 		kthread_stop(simple_tsk_fn);
 	simple_tsk_fn = NULL;
 	mutex_unlock(&thread_mutex);
+
+	timer_shutdown_sync(&foo_timer_data->timer);
+	free_percpu(foo_timer_data->counter);
+	kfree(foo_timer_data);
 }
 
 module_init(trace_event_init);
diff --git a/samples/trace_events/trace-events-sample.h b/samples/trace_events/trace-events-sample.h
index 1a05fc153353..816848a456a2 100644
--- a/samples/trace_events/trace-events-sample.h
+++ b/samples/trace_events/trace-events-sample.h
@@ -247,12 +247,14 @@
  */
 
 /*
- * It is OK to have helper functions in the file, but they need to be protected
- * from being defined more than once. Remember, this file gets included more
- * than once.
+ * It is OK to have helper functions and data structures in the file, but they
+ * need to be protected from being defined more than once. Remember, this file
+ * gets included more than once.
  */
 #ifndef __TRACE_EVENT_SAMPLE_HELPER_FUNCTIONS
 #define __TRACE_EVENT_SAMPLE_HELPER_FUNCTIONS
+#include <linux/timer.h>
+
 static inline int __length_of(const int *list)
 {
 	int i;
@@ -270,6 +272,13 @@ enum {
 	TRACE_SAMPLE_BAR = 4,
 	TRACE_SAMPLE_ZOO = 8,
 };
+
+struct foo_timer_data {
+	const char		*name;
+	struct timer_list	timer;
+	int __percpu		*counter;
+};
+
 #endif
 
 /*
@@ -595,6 +604,25 @@ TRACE_EVENT(foo_rel_loc,
 		  __get_rel_bitmask(bitmask),
 		  __get_rel_cpumask(cpumask))
 );
+
+TRACE_EVENT(foo_timer_fn,
+
+	TP_PROTO(struct foo_timer_data *data),
+
+	TP_ARGS(data),
+
+	TP_STRUCT__entry(
+		__string(	name,			data->name	)
+		__field(	int,			count		)
+	),
+
+	TP_fast_assign(
+		__assign_str(name);
+		__entry->count	= *this_cpu_ptr(data->counter);
+	),
+
+	TP_printk("name=%s count=%d", __get_str(name), __entry->count)
+);
 #endif
 
 /***** NOTICE! The #if protection ends here. *****/
diff --git a/tools/testing/selftests/ftrace/test.d/dynevent/btf_probe_event.tc b/tools/testing/selftests/ftrace/test.d/dynevent/btf_probe_event.tc
new file mode 100644
index 000000000000..bf71368c31a4
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/dynevent/btf_probe_event.tc
@@ -0,0 +1,51 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: BTF event with typecast and percpu access
+# requires: dynamic_events "this_cpu_read(<fetcharg>)":README "[(structname[,field])]<argname>[->field[->field|.field...]]":README
+
+# Check if the sample module is loaded
+if ! lsmod | grep -q trace_events_sample; then
+  modprobe trace-events-sample || exit_unresolved
+fi
+
+echo 0 > events/enable
+echo > dynamic_events
+
+# The sample_timer_cb(struct timer_list *t) is called.
+# We want to check (STRUCT,FIELD)VAR typecast and this_cpu_read() access.
+# (foo_timer_data,timer)t converts t to struct foo_timer_data * using container_of.
+# data->counter is a per-cpu pointer to int.
+# this_cpu_read(data->counter) should give the value of the counter.
+
+echo 'f:mysample/myevent sample_timer_cb name=(foo_timer_data,timer)t->name:string count=this_cpu_read((foo_timer_data,timer)t->counter)' >> dynamic_events
+
+echo 1 > events/mysample/myevent/enable
+echo 1 > events/sample-trace/foo_timer_fn/enable
+
+sleep 2
+
+echo 0 > events/mysample/myevent/enable
+echo 0 > events/sample-trace/foo_timer_fn/enable
+
+# Compare the values.
+MATCH=0
+while read line; do
+  if echo $line | grep -q "foo_timer_fn:"; then
+    NAME=`echo $line | sed 's/.*name=\([^ ]*\) .*/\1/'`
+    COUNT=`echo $line | sed 's/.*count=\([^ ]*\).*/\1/'`
+    if grep -q "myevent:.*name=\"${NAME}\" count=$COUNT" trace; then
+       MATCH=$((MATCH+1))
+    fi
+  fi
+done < trace
+
+if [ $MATCH -eq 0 ]; then
+  echo "No matching events found"
+  exit_fail
+fi
+
+# Clean up
+echo 0 > events/mysample/myevent/enable
+echo 0 > events/sample-trace/foo_timer_fn/enable
+echo > dynamic_events
+clear_trace
diff --git a/tools/testing/selftests/ftrace/test.d/dynevent/btf_typecast_accepted.tc b/tools/testing/selftests/ftrace/test.d/dynevent/btf_typecast_accepted.tc
new file mode 100644
index 000000000000..dd5552727054
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/dynevent/btf_typecast_accepted.tc
@@ -0,0 +1,103 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: BTF typecast and percpu access syntax validation
+# requires: dynamic_events "this_cpu_read(<fetcharg>)":README "[(structname[,field])]<argname>[->field[->field|.field...]]":README
+
+KPROBES=
+FPROBES=
+
+if grep -qF "p[:[<group>/][<event>]] <place> [<args>]" README ; then
+  KPROBES=yes
+fi
+if grep -qF "f[:[<group>/][<event>]] <func-name>[%return] [<args>]" README ; then
+  FPROBES=yes
+fi
+
+if [ -z "$KPROBES" -a -z "$FPROBES" ] ; then
+  exit_unsupported
+fi
+
+echo 0 > events/enable
+echo > dynamic_events
+
+# Load trace-events-sample module if available to have per-CPU counter structure defined
+if ! lsmod | grep -q trace_events_sample; then
+  modprobe trace-events-sample || exit_unresolved
+fi
+
+if [ "$FPROBES" ] ; then
+  # 1. Test basic typecast on fprobe
+  echo 'f:fpevent1 vfs_read name=(file)file->f_path.dentry->d_name.name:string' >> dynamic_events
+  # 2. Test parenthesized typecast target on fprobe
+  echo 'f:fpevent2 vfs_read name=(file)(file)->f_path.dentry->d_name.name:string' >> dynamic_events
+  # 3. Test nested typecasts on fprobe
+  echo 'f:fpevent3 vfs_read name=(dentry)((file)file->f_path.dentry)->d_name.name:string' >> dynamic_events
+  # 4. Test container_of-style typecast with field option on fprobe
+  echo 'f:fpevent4 vfs_read name=(file,f_path)file->f_mode' >> dynamic_events
+  # 5. Test typecast on return value on fprobe
+  echo 'f:fpevent5 vfs_read%return name=(file)$retval->f_path.dentry->d_name.name:string' >> dynamic_events
+  # 6. Test $current variable support on fprobe
+  echo 'f:fpevent6 vfs_read pid=$current->pid' >> dynamic_events
+  echo 'f:fpevent7 vfs_read pid=(task_struct)$current->pid' >> dynamic_events
+  echo 'f:fpevent8 vfs_read pid=(task_struct,group_leader)$current->pid' >> dynamic_events
+
+  # Test this_cpu_read and this_cpu_ptr on fprobe
+  echo 'f:fpevent9 sample_timer_cb name=(foo_timer_data,timer)t->name:string count=this_cpu_read((foo_timer_data,timer)t->counter)' >> dynamic_events
+  echo 'f:fpevent10 sample_timer_cb ptr=this_cpu_ptr((foo_timer_data,timer)t->counter)' >> dynamic_events
+fi
+
+if [ "$KPROBES" ] ; then
+  # 7. Test basic typecast on kprobe
+  echo 'p:kpevent1 vfs_read name=(file)file->f_path.dentry->d_name.name:string' >> dynamic_events
+  # 8. Test parenthesized typecast target on kprobe
+  echo 'p:kpevent2 vfs_read name=(file)(file)->f_path.dentry->d_name.name:string' >> dynamic_events
+  # 9. Test nested typecasts on kprobe
+  echo 'p:kpevent3 vfs_read name=(dentry)((file)file->f_path.dentry)->d_name.name:string' >> dynamic_events
+  # 10. Test container_of-style typecast with field option on kprobe
+  echo 'p:kpevent4 vfs_read name=(file,f_path)file->f_mode' >> dynamic_events
+  # 11. Test typecast on return value on kretprobe
+  echo 'r:kpevent5 vfs_read name=(file)$retval->f_path.dentry->d_name.name:string' >> dynamic_events
+  # 12. Test $current variable support on kprobe
+  echo 'p:kpevent6 vfs_read pid=$current->pid' >> dynamic_events
+  echo 'p:kpevent7 vfs_read pid=(task_struct)$current->pid' >> dynamic_events
+  echo 'p:kpevent8 vfs_read pid=(task_struct,group_leader)$current->pid' >> dynamic_events
+
+  # Test this_cpu_read and this_cpu_ptr on kprobe
+  echo 'p:kpevent9 sample_timer_cb name=(foo_timer_data,timer)t->name:string count=this_cpu_read((foo_timer_data,timer)t->counter)' >> dynamic_events
+  echo 'p:kpevent10 sample_timer_cb ptr=this_cpu_ptr((foo_timer_data,timer)t->counter)' >> dynamic_events
+fi
+
+# Verify the events exist in dynamic_events
+if [ "$FPROBES" ] ; then
+  grep -q "fpevent1 " dynamic_events
+  grep -q "fpevent2 " dynamic_events
+  grep -q "fpevent3 " dynamic_events
+  grep -q "fpevent4 " dynamic_events
+  grep -q "fpevent5 " dynamic_events
+  grep -q "fpevent6 " dynamic_events
+  grep -q "fpevent7 " dynamic_events
+  grep -q "fpevent8 " dynamic_events
+  if lsmod | grep -q trace_events_sample; then
+    grep -q "fpevent9 " dynamic_events
+    grep -q "fpevent10 " dynamic_events
+  fi
+fi
+
+if [ "$KPROBES" ] ; then
+  grep -q "kpevent1 " dynamic_events
+  grep -q "kpevent2 " dynamic_events
+  grep -q "kpevent3 " dynamic_events
+  grep -q "kpevent4 " dynamic_events
+  grep -q "kpevent5 " dynamic_events
+  grep -q "kpevent6 " dynamic_events
+  grep -q "kpevent7 " dynamic_events
+  grep -q "kpevent8 " dynamic_events
+  if lsmod | grep -q trace_events_sample; then
+    grep -q "kpevent9 " dynamic_events
+    grep -q "kpevent10 " dynamic_events
+  fi
+fi
+
+# Clean up
+echo > dynamic_events
+clear_trace
diff --git a/tools/testing/selftests/ftrace/test.d/dynevent/eprobes_syntax_errors.tc b/tools/testing/selftests/ftrace/test.d/dynevent/eprobes_syntax_errors.tc
index 0e65e787e426..c2e3f9d19f13 100644
--- a/tools/testing/selftests/ftrace/test.d/dynevent/eprobes_syntax_errors.tc
+++ b/tools/testing/selftests/ftrace/test.d/dynevent/eprobes_syntax_errors.tc
@@ -21,8 +21,17 @@ check_error 'e:foo/^bar.1 syscalls/sys_enter_openat'	# BAD_EVENT_NAME
 
 check_error 'e:foo/bar syscalls/sys_enter_openat arg=^$foo'	# BAD_ATTACH_ARG
 
+check_error 'e:foo/bar syscalls/sys_enter_openat arg=^COMM'	# NO_EVENT_FIELD
+if grep -q "\$current.*" README; then
+  check_error 'e:foo/bar syscalls/sys_enter_openat arg=^current'	# NO_EVENT_FIELD
+fi
+
 if grep -q '<attached-group>\.<attached-event>.*\[if <filter>\]' README; then
   check_error 'e:foo/bar syscalls/sys_enter_openat if ^'	# NO_EP_FILTER
 fi
 
+if grep -q 'this_cpu_read(<fetcharg>)' README; then
+  check_error 'e:foo/bar syscalls/sys_enter_openat arg=^this_cpu_read(file)'	# NOSUP_PERCPU
+fi
+
 exit 0
diff --git a/tools/testing/selftests/ftrace/test.d/dynevent/fprobe_syntax_errors.tc b/tools/testing/selftests/ftrace/test.d/dynevent/fprobe_syntax_errors.tc
index fee479295e2f..e9d7e6919c7f 100644
--- a/tools/testing/selftests/ftrace/test.d/dynevent/fprobe_syntax_errors.tc
+++ b/tools/testing/selftests/ftrace/test.d/dynevent/fprobe_syntax_errors.tc
@@ -112,6 +112,18 @@ check_error 'f vfs_read%return $retval->^foo'	# NO_PTR_STRCT
 check_error 'f vfs_read file->^foo'		# NO_BTF_FIELD
 check_error 'f vfs_read file^-.foo'		# BAD_HYPHEN
 check_error 'f vfs_read ^file:string'		# BAD_TYPE4STR
+if grep -qF "[(structname" README ; then
+check_error 'f vfs_read arg1=(task_struct)file^'		# TYPECAST_REQ_FIELD
+check_error 'f vfs_read arg1=(a)((b)((c)(^(d)file->d)->c)->b)->a'	# TOO_MANY_NESTED
+check_error 'f vfs_read arg1=(task_struct,^in_execve)file->comm'	# TYPECAST_NOT_ALIGNED
+check_error 'f vfs_read arg1=(task_struct,^foo_bar)file->pid'	# NO_BTF_FIELD
+check_error 'f vfs_read arg1=(^task_struct1234)file->pid'	# NO_PTR_STRCT
+check_error 'f vfs_read arg1=(task_struct,se^->group_node)file->comm'	# TYPECAST_BAD_ARROW
+check_error 'f vfs_read arg1=(task_struct,^->pid)file->comm'	# NO_BTF_FIELD
+check_error 'f vfs_read arg1=(task_struct,^.pid)file->comm'	# NO_BTF_FIELD
+check_error 'f vfs_read arg1=(task_struct,^.)file->comm'	# NO_BTF_FIELD
+check_error 'f vfs_read arg1=(task_struct)^@symbol+10->comm'	# TYPECAST_SYM_OFFSET
+fi
 fi
 
 else
diff --git a/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc b/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc
index 8f1c58f0c239..21ce8414459f 100644
--- a/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc
+++ b/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc
@@ -115,6 +115,18 @@ check_error 'p vfs_read+20 ^$arg*'		# NOFENTRY_ARGS
 check_error 'p vfs_read ^hoge'			# NO_BTFARG
 check_error 'p kfree ^$arg10'			# NO_BTFARG (exceed the number of parameters)
 check_error 'r kfree ^$retval'			# NO_RETVAL
+if grep -qF "[(structname" README ; then
+check_error 'p vfs_read arg1=(task_struct)file^'		# TYPECAST_REQ_FIELD
+check_error 'p vfs_read arg1=(a)((b)((c)(^(d)file->d)->c)->b)->a'	# TOO_MANY_NESTED
+check_error 'p vfs_read arg1=(task_struct,^in_execve)file->comm'	# TYPECAST_NOT_ALIGNED
+check_error 'p vfs_read arg1=(task_struct,^foo_bar)file->pid'	# NO_BTF_FIELD
+check_error 'p vfs_read arg1=(^task_struct1234)file->pid'		# NO_PTR_STRCT
+check_error 'p vfs_read arg1=(task_struct,se^->group_node)file->comm'	# TYPECAST_BAD_ARROW
+check_error 'p vfs_read arg1=(task_struct,^->pid)file->comm'	# NO_BTF_FIELD
+check_error 'p vfs_read arg1=(task_struct,^.pid)file->comm'	# NO_BTF_FIELD
+check_error 'p vfs_read arg1=(task_struct,^.)file->comm'	# NO_BTF_FIELD
+check_error 'p vfs_read arg1=(task_struct)^@symbol+10->comm'	# TYPECAST_SYM_OFFSET
+fi
 else
 check_error 'p vfs_read ^$arg*'			# NOSUP_BTFARG
 fi
diff --git a/tools/testing/selftests/ftrace/test.d/kprobe/uprobe_syntax_errors.tc b/tools/testing/selftests/ftrace/test.d/kprobe/uprobe_syntax_errors.tc
index c817158b99db..e12dc967ec76 100644
--- a/tools/testing/selftests/ftrace/test.d/kprobe/uprobe_syntax_errors.tc
+++ b/tools/testing/selftests/ftrace/test.d/kprobe/uprobe_syntax_errors.tc
@@ -28,4 +28,9 @@ if grep -q ".*symstr.*" README; then
 check_error 'p /bin/sh:10 $stack0:^symstr'	# BAD_TYPE
 fi
 
+# $current is not supported by uprobe
+if grep -q "\$current.*" README; then
+check_error 'p /bin/sh:10 ^$current:u8'	# BAD_VAR
+fi
+
 exit 0


^ permalink raw reply related

* [PATCH v4 2/7] riscv: stacktrace: disable KASAN and KCOV instrumentation for stacktrace.o
From: Wang Han @ 2026-06-29  6:42 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: Alexandre Ghiti, linux-riscv, Oleg Nesterov, Steven Rostedt,
	Masami Hiramatsu, Mark Rutland, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, James Clark, Josh Poimboeuf,
	Jiri Kosina, Miroslav Benes, Petr Mladek, Joe Lawrence,
	Shuah Khan, oliver.yang, xueshuai, zhuo.song, jkchen,
	Marcos Paulo de Souza, linux-kernel, linux-trace-kernel,
	linux-perf-users, live-patching, linux-kselftest

KASAN records stack traces for every alloc/free, which means it walks
the unwinder very frequently. Instrumenting the stack trace collection
code itself adds substantial overhead and makes the traces themselves
noisier.

KCOV instruments every basic-block edge. The unwinder is a hot path,
especially with KASAN enabled, so KCOV instrumentation has the same kind
of cost and noise problem here.

Mark stacktrace.o as not KASAN- or KCOV-instrumented, matching the x86
treatment of its stack unwinding code. RISC-V keeps the relevant unwinder
code in stacktrace.o, so a single translation-unit annotation covers the
equivalent scope. This is a prerequisite preference for the upcoming
reliable unwinder, but the change is valid on its own.

Signed-off-by: Wang Han <wanghan@linux.alibaba.com>
---
 arch/riscv/kernel/Makefile | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
index cabb99cadfb6..c565a72a36f3 100644
--- a/arch/riscv/kernel/Makefile
+++ b/arch/riscv/kernel/Makefile
@@ -44,6 +44,12 @@ CFLAGS_REMOVE_return_address.o	= $(CC_FLAGS_FTRACE)
 CFLAGS_REMOVE_sbi_ecall.o = $(CC_FLAGS_FTRACE)
 endif

+# When KASAN is enabled, a stack trace is recorded for every alloc/free, which
+# can significantly impact performance. Avoid instrumenting the stack trace
+# collection code to minimize this impact.
+KASAN_SANITIZE_stacktrace.o := n
+KCOV_INSTRUMENT_stacktrace.o := n
+
 always-$(KBUILD_BUILTIN) += vmlinux.lds

 obj-y	+= head.o
-- 
2.43.0

^ permalink raw reply related

* [PATCH v4 4/7] riscv: stacktrace: introduce stack-bound tracking helpers
From: Wang Han @ 2026-06-29  6:42 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: Alexandre Ghiti, linux-riscv, Oleg Nesterov, Steven Rostedt,
	Masami Hiramatsu, Mark Rutland, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, James Clark, Josh Poimboeuf,
	Jiri Kosina, Miroslav Benes, Petr Mladek, Joe Lawrence,
	Shuah Khan, oliver.yang, xueshuai, zhuo.song, jkchen,
	Marcos Paulo de Souza, linux-kernel, linux-trace-kernel,
	linux-perf-users, live-patching, linux-kselftest

A reliable unwinder needs to validate that every frame record it reads
is fully contained in a known kernel stack, and it needs to refuse to
walk back into a stack it has already left. Add the building blocks
for that:

  * struct stack_info / struct unwind_state in a new
    asm/stacktrace/common.h, modelled on the arm64 reference
    implementation.
  * stackinfo_get_irq() / stackinfo_get_task() / stackinfo_get_overflow()
    plus the corresponding on_*_stack() predicates in asm/stacktrace.h,
    so callers can ask "is this object on stack X?" by stack kind
    rather than open-coded address arithmetic.
  * unwind_init_common(), unwind_find_stack() and
    unwind_consume_stack() helpers that enforce the
    forward-progress-only invariant required for reliability.

No existing user is wired up to these helpers in this commit; the
unwinder switch comes in a follow-up. The header changes leave
on_thread_stack() with the same semantics as before, just expressed in
terms of the new helpers.

Reviewed-by: Shuai Xue <xueshuai@linux.alibaba.com>
Signed-off-by: Wang Han <wanghan@linux.alibaba.com>
---
 arch/riscv/include/asm/stacktrace.h        |  65 ++++++++-
 arch/riscv/include/asm/stacktrace/common.h | 159 +++++++++++++++++++++
 2 files changed, 222 insertions(+), 2 deletions(-)
 create mode 100644 arch/riscv/include/asm/stacktrace/common.h

diff --git a/arch/riscv/include/asm/stacktrace.h b/arch/riscv/include/asm/stacktrace.h
index b1495a7e06ce..bc87c4940379 100644
--- a/arch/riscv/include/asm/stacktrace.h
+++ b/arch/riscv/include/asm/stacktrace.h
@@ -3,8 +3,13 @@
 #ifndef _ASM_RISCV_STACKTRACE_H
 #define _ASM_RISCV_STACKTRACE_H
 
+#include <linux/percpu.h>
 #include <linux/sched.h>
+#include <linux/sched/task_stack.h>
+
+#include <asm/irq_stack.h>
 #include <asm/ptrace.h>
+#include <asm/stacktrace/common.h>
 
 struct stackframe {
 	unsigned long fp;
@@ -16,14 +21,70 @@ extern void notrace walk_stackframe(struct task_struct *task, struct pt_regs *re
 extern void dump_backtrace(struct pt_regs *regs, struct task_struct *task,
 			   const char *loglvl);
 
-static inline bool on_thread_stack(void)
+/*
+ * IRQ stack accessors
+ */
+static inline struct stack_info stackinfo_get_irq(void)
+{
+	unsigned long low = (unsigned long)raw_cpu_read(irq_stack_ptr);
+	unsigned long high = low + IRQ_STACK_SIZE;
+
+	return (struct stack_info) {
+		.low = low,
+		.high = high,
+	};
+}
+
+static inline bool on_irq_stack(unsigned long sp, unsigned long size)
+{
+	struct stack_info info = stackinfo_get_irq();
+
+	return stackinfo_on_stack(&info, sp, size);
+}
+
+/*
+ * Task stack accessors
+ */
+static inline struct stack_info stackinfo_get_task(const struct task_struct *tsk)
 {
-	return !(((unsigned long)(current->stack) ^ current_stack_pointer) & ~(THREAD_SIZE - 1));
+	unsigned long low = (unsigned long)task_stack_page(tsk);
+	unsigned long high = low + THREAD_SIZE;
+
+	return (struct stack_info) {
+		.low = low,
+		.high = high,
+	};
+}
+
+static inline bool on_task_stack(const struct task_struct *tsk,
+				 unsigned long sp, unsigned long size)
+{
+	struct stack_info info = stackinfo_get_task(tsk);
+
+	return stackinfo_on_stack(&info, sp, size);
 }
 
+/*
+ * Cast is necessary since current->stack is an opaque ptr.
+ */
+#define on_thread_stack()	(on_task_stack(current, current_stack_pointer, 1))
 
+/*
+ * Overflow stack accessors
+ */
 #ifdef CONFIG_VMAP_STACK
 DECLARE_PER_CPU(unsigned long [OVERFLOW_STACK_SIZE/sizeof(long)], overflow_stack);
+
+static inline struct stack_info stackinfo_get_overflow(void)
+{
+	unsigned long low = (unsigned long)raw_cpu_ptr(overflow_stack);
+	unsigned long high = low + OVERFLOW_STACK_SIZE;
+
+	return (struct stack_info) {
+		.low = low,
+		.high = high,
+	};
+}
 #endif /* CONFIG_VMAP_STACK */
 
 #endif /* _ASM_RISCV_STACKTRACE_H */
diff --git a/arch/riscv/include/asm/stacktrace/common.h b/arch/riscv/include/asm/stacktrace/common.h
new file mode 100644
index 000000000000..360a26e34349
--- /dev/null
+++ b/arch/riscv/include/asm/stacktrace/common.h
@@ -0,0 +1,159 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * RISC-V common stack unwinder types and helpers.
+ *
+ * See: arch/arm64/include/asm/stacktrace/common.h for the reference
+ * implementation.
+ *
+ * Copyright (C) 2026
+ */
+#ifndef __ASM_RISCV_STACKTRACE_COMMON_H
+#define __ASM_RISCV_STACKTRACE_COMMON_H
+
+#include <linux/compiler.h>
+#include <linux/errno.h>
+#include <linux/types.h>
+
+#include <asm/stacktrace/frame.h>
+
+/**
+ * struct stack_info - describes the bounds of a stack.
+ *
+ * @low:  The lowest valid address on the stack.
+ * @high: The highest valid address on the stack.
+ */
+struct stack_info {
+	unsigned long low;
+	unsigned long high;
+};
+
+/**
+ * struct unwind_state - state used for robust unwinding.
+ *
+ * @fp:        The fp value in the frame record (or the real fp).
+ * @pc:        The ra value in the frame record (or the real ra).
+ *
+ * @stack:     The stack currently being unwound.
+ * @stacks:    An array of stacks which can be unwound.
+ * @nr_stacks: The number of stacks in @stacks.
+ */
+struct unwind_state {
+	unsigned long fp;
+	unsigned long pc;
+
+	struct stack_info stack;
+	struct stack_info *stacks;
+	int nr_stacks;
+};
+
+/**
+ * stackinfo_get_unknown() - Get an unknown stack_info.
+ *
+ * Return: a stack_info with low and high set to 0.
+ */
+static inline struct stack_info stackinfo_get_unknown(void)
+{
+	return (struct stack_info) {
+		.low = 0,
+		.high = 0,
+	};
+}
+
+/**
+ * stackinfo_on_stack() - Check whether an object is fully within a stack.
+ *
+ * @info: The stack to check against.
+ * @sp:   The base address of the object.
+ * @size: The size of the object.
+ *
+ * Return: true if the object is fully contained within the stack.
+ */
+static inline bool stackinfo_on_stack(const struct stack_info *info,
+				      unsigned long sp, unsigned long size)
+{
+	if (!info->low)
+		return false;
+
+	if (sp < info->low || sp + size < sp || sp + size > info->high)
+		return false;
+
+	return true;
+}
+
+/**
+ * unwind_init_common() - Initialize the common parts of the unwind state.
+ *
+ * @state: the unwind state to initialize.
+ */
+static inline void unwind_init_common(struct unwind_state *state)
+{
+	state->stack = stackinfo_get_unknown();
+}
+
+/**
+ * unwind_find_stack() - Find the accessible stack which entirely contains an
+ * object.
+ *
+ * @state: the current unwind state.
+ * @sp:    the base address of the object.
+ * @size:  the size of the object.
+ *
+ * Return: a pointer to the relevant stack_info if found; NULL otherwise.
+ */
+static inline struct stack_info *unwind_find_stack(struct unwind_state *state,
+						   unsigned long sp,
+						   unsigned long size)
+{
+	struct stack_info *info = &state->stack;
+
+	if (stackinfo_on_stack(info, sp, size))
+		return info;
+
+	for (int i = 0; i < state->nr_stacks; i++) {
+		info = &state->stacks[i];
+		if (stackinfo_on_stack(info, sp, size))
+			return info;
+	}
+
+	return NULL;
+}
+
+/**
+ * unwind_consume_stack() - Update stack boundaries so that future unwind steps
+ * cannot consume this object again.
+ *
+ * @state: the current unwind state.
+ * @info:  the stack_info of the stack containing the object.
+ * @sp:    the base address of the object.
+ * @size:  the size of the object.
+ *
+ * Stack transitions are strictly one-way, and once we've
+ * transitioned from one stack to another, it's never valid to
+ * unwind back to the old stack.
+ *
+ * Note that stacks can nest in several valid orders, e.g.
+ *
+ *   TASK -> IRQ -> OVERFLOW
+ *
+ * ... so we do not check the specific order of stack
+ * transitions.
+ */
+static inline void unwind_consume_stack(struct unwind_state *state,
+					struct stack_info *info,
+					unsigned long sp,
+					unsigned long size)
+{
+	struct stack_info tmp;
+
+	tmp = *info;
+	*info = stackinfo_get_unknown();
+	state->stack = tmp;
+
+	/*
+	 * Future unwind steps can only consume stack above this frame record.
+	 * Update the current stack to start immediately above it.
+	 */
+	state->stack.low = sp + size;
+}
+
+#endif /* __ASM_RISCV_STACKTRACE_COMMON_H */
-- 
2.43.0

^ permalink raw reply related

* [PATCH v4 7/7] selftests/livepatch: Add RISC-V syscall wrapper prefix
From: Wang Han @ 2026-06-29  6:42 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: Alexandre Ghiti, linux-riscv, Oleg Nesterov, Steven Rostedt,
	Masami Hiramatsu, Mark Rutland, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, James Clark, Josh Poimboeuf,
	Jiri Kosina, Miroslav Benes, Petr Mladek, Joe Lawrence,
	Shuah Khan, oliver.yang, xueshuai, zhuo.song, jkchen,
	Marcos Paulo de Souza, linux-kernel, linux-trace-kernel,
	linux-perf-users, live-patching, linux-kselftest

The syscall livepatch selftest resolves and patches a syscall wrapper
symbol. To use that test for RISC-V livepatch validation, add the
RISC-V FN_PREFIX definition for ARCH_HAS_SYSCALL_WRAPPER.

Without this macro, the syscall livepatch selftest cannot resolve the
RISC-V target symbol, and the syscall-related livepatch test fails on
RISC-V.

Reviewed-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: Shuai Xue <xueshuai@linux.alibaba.com>
Signed-off-by: Wang Han <wanghan@linux.alibaba.com>
---
 .../testing/selftests/livepatch/test_modules/test_klp_syscall.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/testing/selftests/livepatch/test_modules/test_klp_syscall.c b/tools/testing/selftests/livepatch/test_modules/test_klp_syscall.c
index 08aacc0e14de..9baa2a5f84c9 100644
--- a/tools/testing/selftests/livepatch/test_modules/test_klp_syscall.c
+++ b/tools/testing/selftests/livepatch/test_modules/test_klp_syscall.c
@@ -24,6 +24,8 @@
     #define FN_PREFIX __s390x_
   #elif defined(__aarch64__)
     #define FN_PREFIX __arm64_
+  #elif defined(__riscv)
+    #define FN_PREFIX __riscv_
   #elif defined(__powerpc__)
     #define FN_PREFIX
   #else
-- 
2.43.0

^ permalink raw reply related

* [PATCH v4 6/7] riscv: Kconfig: enable HAVE_RELIABLE_STACKTRACE and HAVE_LIVEPATCH
From: Wang Han @ 2026-06-29  6:42 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: Alexandre Ghiti, linux-riscv, Oleg Nesterov, Steven Rostedt,
	Masami Hiramatsu, Mark Rutland, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, James Clark, Josh Poimboeuf,
	Jiri Kosina, Miroslav Benes, Petr Mladek, Joe Lawrence,
	Shuah Khan, oliver.yang, xueshuai, zhuo.song, jkchen,
	Marcos Paulo de Souza, linux-kernel, linux-trace-kernel,
	linux-perf-users, live-patching, linux-kselftest

Now that the metadata frame records, the kunwind state machine and
arch_stack_walk_reliable() are all in place, advertise the capability
to the rest of the kernel:

  * select HAVE_RELIABLE_STACKTRACE under FRAME_POINTER && 64BIT, so
    only the configurations with the tested metadata records and
    FP-based reliable walker enable it.
  * select HAVE_LIVEPATCH under the same condition and source
    kernel/livepatch/Kconfig so the livepatch menu is reachable from
    the RISC-V configuration.

The 64BIT dependency is conservative scoping rather than a hard
technical requirement: the metadata frame record, kunwind state machine
and arch_stack_walk_reliable() also build on RV32, and the IRQ-stack
frame-record adjustment fixes a latent RV32 issue. However, the syscall
livepatch selftest and module relocation path have only been exercised
on RV64 QEMU virt so far. The 64BIT gate can be relaxed in a follow-up
once RV32 has equivalent coverage.

This is split out from the unwinder change so the policy decision and
the implementation can be reviewed and reverted independently.

Reviewed-by: Shuai Xue <xueshuai@linux.alibaba.com>
Signed-off-by: Wang Han <wanghan@linux.alibaba.com>
---
 arch/riscv/Kconfig | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index cb3d85abf595..4d09fab682ac 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -186,6 +186,7 @@ config RISCV
 	select HAVE_KRETPROBES
 	# https://github.com/ClangBuiltLinux/linux/issues/1881
 	select HAVE_LD_DEAD_CODE_DATA_ELIMINATION if !LD_IS_LLD
+	select HAVE_LIVEPATCH if FRAME_POINTER && 64BIT
 	select HAVE_MOVE_PMD
 	select HAVE_MOVE_PUD
 	select HAVE_PAGE_SIZE_4KB
@@ -196,6 +197,7 @@ config RISCV
 	select HAVE_POSIX_CPU_TIMERS_TASK_WORK
 	select HAVE_PREEMPT_DYNAMIC_KEY
 	select HAVE_REGS_AND_STACK_ACCESS_API
+	select HAVE_RELIABLE_STACKTRACE if FRAME_POINTER && 64BIT
 	select HAVE_RETHOOK
 	select HAVE_RSEQ
 	select HAVE_RUST if RUSTC_SUPPORTS_RISCV && CC_IS_CLANG
@@ -1392,3 +1394,5 @@ endmenu # "CPU Power Management"
 source "arch/riscv/kvm/Kconfig"
 
 source "drivers/acpi/Kconfig"
+
+source "kernel/livepatch/Kconfig"
-- 
2.43.0

^ permalink raw reply related

* [PATCH v4 0/7] riscv: Add reliable stack unwinding for livepatch
From: Wang Han @ 2026-06-29  6:42 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: Alexandre Ghiti, linux-riscv, Oleg Nesterov, Steven Rostedt,
	Masami Hiramatsu, Mark Rutland, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, James Clark, Josh Poimboeuf,
	Jiri Kosina, Miroslav Benes, Petr Mladek, Joe Lawrence,
	Shuah Khan, oliver.yang, xueshuai, zhuo.song, jkchen,
	Marcos Paulo de Souza, linux-kernel, linux-trace-kernel,
	linux-perf-users, live-patching, linux-kselftest

Hi,

This series adds the RISC-V architecture pieces needed by livepatch:
dynamic ftrace must preserve the frame pointer context that livepatch
uses for redirection, and the stack unwinder must be reliable enough for
the livepatch consistency model.

The reliable unwinder is based on frame records and explicit metadata at
task and exception boundaries. It is intentionally conservative: it
rejects ambiguous states instead of trying to unwind through them. The
series then enables HAVE_RELIABLE_STACKTRACE and HAVE_LIVEPATCH for
64-bit RISC-V with dynamic ftrace.

The first patch from v3, "scripts/sorttable: Handle RISC-V patchable
ftrace entries", has been picked up by Paul and is already present in
riscv/for-next as commit 57ad674d032b. It is therefore dropped from this
v4. The remaining patches are rebased on the latest riscv/for-next.

Base:
  riscv/for-next 798246e5edfb ("riscv: acpi: Enable ARCH_HAS_ACPI_TABLE_UPGRADE")

Previous versions:
  v3: https://lore.kernel.org/r/cover.194d76e3a15b.v3.riscv-livepatch.wanghan@linux.alibaba.com
  v2: https://lore.kernel.org/r/20260528082310.1994388-1-wanghan@linux.alibaba.com
  v1: https://lore.kernel.org/r/20260527123530.2593918-1-wanghan@linux.alibaba.com

Changes since v3:
  * Drop the accepted sorttable fix, now commit 57ad674d032b in
    riscv/for-next.
  * Rebase the remaining 7 patches on the latest riscv/for-next.
  * Adapt the frame metadata patch to the existing call_on_irq_stack()
    RV32 frame-pointer ABI fix by keeping metadata frame-record offsets
    distinct from the s0-relative STACKFRAME_* offsets.
  * Adapt the livepatch syscall selftest prefix change to the current
    CONFIG_ARCH_HAS_SYSCALL_WRAPPER wrapper logic.

Validation:
  * Built with riscv64-unknown-linux-gnu-gcc 15.2.0 and the existing
    configs/riscv_livepatch_config, including RISCV_ISA_C=y, EFI=y and
    ACPI=y.
  * make -C linux O=$PWD/build/linux/riscv ARCH=riscv \
      CROSS_COMPILE=riscv64-unknown-linux-gnu- -j$(nproc) Image modules
    passed.
  * make kernel-debug ARCH=riscv GDB=riscv64-unknown-linux-gnu-gdb
    passed.
  * livepatch selftest modules built successfully via ./test-livepatch.sh
    riscv.
  * QEMU RISC-V livepatch selftests passed with PASS: 7, SKIP: 1,
    FAIL: 0. The only skip is test-kprobe.sh because this config does
    not enable CONFIG_KPROBES_ON_FTRACE.
  * The ftrace function graph subset passed with 3 passed, 0 failed and
    3 unsupported tests.

Wang Han (7):
  riscv: stacktrace: Add frame record metadata
  riscv: stacktrace: disable KASAN and KCOV instrumentation for
    stacktrace.o
  riscv: ftrace: always preserve s0 in dynamic ftrace register frame
  riscv: stacktrace: introduce stack-bound tracking helpers
  riscv: stacktrace: switch to frame-pointer based unwinder
  riscv: Kconfig: enable HAVE_RELIABLE_STACKTRACE and HAVE_LIVEPATCH
  selftests/livepatch: Add RISC-V syscall wrapper prefix

 arch/riscv/Kconfig                            |   4 +
 arch/riscv/include/asm/ptrace.h               |   9 +
 arch/riscv/include/asm/stacktrace.h           |  65 +-
 arch/riscv/include/asm/stacktrace/common.h    | 159 +++++
 arch/riscv/include/asm/stacktrace/frame.h     |  53 ++
 arch/riscv/kernel/Makefile                    |   6 +
 arch/riscv/kernel/asm-offsets.c               |   6 +
 arch/riscv/kernel/entry.S                     |  39 +-
 arch/riscv/kernel/ftrace.c                    |   6 +-
 arch/riscv/kernel/head.S                      |  23 +
 arch/riscv/kernel/mcount-dyn.S                |   4 -
 arch/riscv/kernel/perf_callchain.c            |   2 +-
 arch/riscv/kernel/process.c                   |  33 +-
 arch/riscv/kernel/stacktrace.c                | 559 +++++++++++++++---
 .../livepatch/test_modules/test_klp_syscall.c |   2 +
 15 files changed, 864 insertions(+), 106 deletions(-)
 create mode 100644 arch/riscv/include/asm/stacktrace/common.h
 create mode 100644 arch/riscv/include/asm/stacktrace/frame.h

Range-diff against v3:
1:  8cef363cfed7 ! 1:  efd99ce56c1e riscv: stacktrace: Add frame record metadata
    @@ Commit message
             the secondary CPU path sets up s0 before smp_callin() so idle-task
             unwinding does not inherit an undefined caller frame;
           * copy_thread creates matching final metadata records for new kernel
    -        and user tasks, and keeps s0 available for the frame-pointer chain;
    -      * call_on_irq_stack still reserves an aligned stack slot, but links the
    -        saved {fp, ra} with the raw frame-record size so s0 points at the
    -        RISC-V frame record rather than past the alignment padding.
    +        and user tasks, and keeps s0 available for the frame-pointer chain.
     
    -    The call_on_irq_stack adjustment fixes a latent RV32 issue. On RV64,
    -    sizeof(struct stackframe) is equal to the stack alignment, so the old
    -    s0 value happened to point just above the saved {fp, ra}. On RV32, the
    -    raw frame record is 8 bytes while the reserved stack slot is 16-byte
    -    aligned, so the old s0 value pointed into the padding. Using the raw
    -    record size makes s0 point above the saved frame record on both RV32
    -    and RV64 while still reserving the aligned slot.
    +    Keep the embedded metadata-record field offsets distinct from the
    +    s0-relative STACKFRAME_* offsets used by call_on_irq_stack(), because
    +    the latter describe a frame record relative to s0 rather than to the
    +    record base.
     
         These changes keep s0 reserved for the frame-pointer chain at task and
    -    stack-switch boundaries.
    +    exception boundaries.
     
         Signed-off-by: Wang Han <wanghan@linux.alibaba.com>
     
    @@ arch/riscv/kernel/asm-offsets.c: void asm_offsets(void)
      
      	OFFSET(HIBERN_PBE_ADDR, pbe, address);
     @@ arch/riscv/kernel/asm-offsets.c: void asm_offsets(void)
    - 	OFFSET(SBI_HART_BOOT_STACK_PTR_OFFSET, sbi_hart_boot_data, stack_ptr);
    - 
      	DEFINE(STACKFRAME_SIZE_ON_STACK, ALIGN(sizeof(struct stackframe), STACK_ALIGN));
    + 	DEFINE(STACKFRAME_FP, offsetof(struct stackframe, fp) - sizeof(struct stackframe));
    + 	DEFINE(STACKFRAME_RA, offsetof(struct stackframe, ra) - sizeof(struct stackframe));
     +	DEFINE(STACKFRAME_RECORD_SIZE, sizeof(struct stackframe));
    - 	OFFSET(STACKFRAME_FP, stackframe, fp);
    - 	OFFSET(STACKFRAME_RA, stackframe, ra);
    ++	OFFSET(FRAME_RECORD_FP, frame_record, fp);
    ++	OFFSET(FRAME_RECORD_RA, frame_record, ra);
      #ifdef CONFIG_FUNCTION_TRACER
    + 	DEFINE(FTRACE_OPS_FUNC,		offsetof(struct ftrace_ops, func));
    + #ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
     
      ## arch/riscv/kernel/entry.S ##
     @@
    @@ arch/riscv/kernel/entry.S: SYM_CODE_START(handle_exception)
     +	 * Create a metadata frame record. The unwinder will use this to
     +	 * identify and unwind exception boundaries.
     +	 */
    -+	REG_S zero, (S_STACKFRAME + STACKFRAME_FP)(sp) /* stackframe.record.fp = 0 */
    -+	REG_S zero, (S_STACKFRAME + STACKFRAME_RA)(sp) /* stackframe.record.ra = 0 */
    ++	REG_S zero, (S_STACKFRAME + FRAME_RECORD_FP)(sp) /* stackframe.record.fp = 0 */
    ++	REG_S zero, (S_STACKFRAME + FRAME_RECORD_RA)(sp) /* stackframe.record.ra = 0 */
     +#ifdef CONFIG_RISCV_M_MODE
     +	li t0, SR_MPP
     +	and t0, s1, t0
    @@ arch/riscv/kernel/entry.S: SYM_CODE_START_LOCAL(handle_kernel_stack_overflow)
     +	 * pt_regs boundary and the unwinder can resume from the pre-overflow
     +	 * frame pointer saved in PT_S0.
     +	 */
    -+	REG_S zero, (S_STACKFRAME + STACKFRAME_FP)(sp)
    -+	REG_S zero, (S_STACKFRAME + STACKFRAME_RA)(sp)
    ++	REG_S zero, (S_STACKFRAME + FRAME_RECORD_FP)(sp)
    ++	REG_S zero, (S_STACKFRAME + FRAME_RECORD_RA)(sp)
     +	li t0, FRAME_META_TYPE_PT_REGS
     +	REG_S t0, S_STACKFRAME_TYPE(sp)
     +	addi s0, sp, S_STACKFRAME + STACKFRAME_RECORD_SIZE
    @@ arch/riscv/kernel/entry.S: ASM_NOKPROBE(handle_kernel_stack_overflow)
      	move a2, sp /* pt_regs */
      	call ret_from_fork_kernel
      	j ret_from_exception
    -@@ arch/riscv/kernel/entry.S: SYM_FUNC_START(call_on_irq_stack)
    - 	addi	sp, sp, -STACKFRAME_SIZE_ON_STACK
    - 	REG_S	ra, STACKFRAME_RA(sp)
    - 	REG_S	s0, STACKFRAME_FP(sp)
    --	addi	s0, sp, STACKFRAME_SIZE_ON_STACK
    -+	addi	s0, sp, STACKFRAME_RECORD_SIZE
    - 
    - 	/* Switch to the per-CPU shadow call stack */
    - 	scs_save_current
    -@@ arch/riscv/kernel/entry.S: SYM_FUNC_START(call_on_irq_stack)
    - 	scs_load_current
    - 
    - 	/* Switch back to the thread stack and restore ra and s0 */
    --	addi	sp, s0, -STACKFRAME_SIZE_ON_STACK
    -+	addi	sp, s0, -STACKFRAME_RECORD_SIZE
    - 	REG_L	ra, STACKFRAME_RA(sp)
    - 	REG_L	s0, STACKFRAME_FP(sp)
    - 	addi	sp, sp, STACKFRAME_SIZE_ON_STACK
     
      ## arch/riscv/kernel/head.S ##
     @@
    @@ arch/riscv/kernel/head.S: SYM_CODE_START(_start_kernel)
     +	 * fp/s0 points above the metadata record (RISC-V
     +	 * convention).
     +	 */
    -+	REG_S zero, (S_STACKFRAME + STACKFRAME_FP)(sp)
    -+	REG_S zero, (S_STACKFRAME + STACKFRAME_RA)(sp)
    ++	REG_S zero, (S_STACKFRAME + FRAME_RECORD_FP)(sp)
    ++	REG_S zero, (S_STACKFRAME + FRAME_RECORD_RA)(sp)
     +	li t0, FRAME_META_TYPE_FINAL
     +	REG_S t0, S_STACKFRAME_TYPE(sp)
     +	addi s0, sp, S_STACKFRAME + STACKFRAME_RECORD_SIZE
2:  237864b66d78 = 2:  bc0af8ec1976 riscv: stacktrace: disable KASAN and KCOV instrumentation for stacktrace.o
3:  e6035966a35a = 3:  ab3fbed66fff riscv: ftrace: always preserve s0 in dynamic ftrace register frame
4:  d132087ea01e = 4:  856d1e31408a riscv: stacktrace: introduce stack-bound tracking helpers
5:  02adea3ece82 = 5:  58aa4435e2ee riscv: stacktrace: switch to frame-pointer based unwinder
6:  c7d7dbe7a8a1 = 6:  500f9d9eeac0 riscv: Kconfig: enable HAVE_RELIABLE_STACKTRACE and HAVE_LIVEPATCH
7:  ae94a234b34a < -:  ------------ selftests/livepatch: Add RISC-V syscall wrapper prefix
-:  ------------ > 7:  3dcfa694c207 selftests/livepatch: Add RISC-V syscall wrapper prefix

base-commit: 798246e5edfb3aa0b2d6dca46f41014d0b99b209
-- 
2.43.0

^ permalink raw reply

* [PATCH v4 1/7] riscv: stacktrace: Add frame record metadata
From: Wang Han @ 2026-06-29  6:42 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: Alexandre Ghiti, linux-riscv, Oleg Nesterov, Steven Rostedt,
	Masami Hiramatsu, Mark Rutland, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, James Clark, Josh Poimboeuf,
	Jiri Kosina, Miroslav Benes, Petr Mladek, Joe Lawrence,
	Shuah Khan, oliver.yang, xueshuai, zhuo.song, jkchen,
	Marcos Paulo de Souza, linux-kernel, linux-trace-kernel,
	linux-perf-users, live-patching, linux-kselftest

Reliable frame-pointer unwinding needs an explicit way to identify
exception boundaries and the final entry frame. The existing unwinder
infers those boundaries from return addresses, which is too loose for a
future reliable unwinder.

Add a small metadata frame record to pt_regs and initialize it on
exception entry, kernel stack overflow, kernel thread fork, user fork,
and early idle task setup. The record uses a zero {fp, ra} sentinel plus
a type field so a later unwinder can distinguish a final user-to-kernel
boundary from a nested kernel pt_regs boundary.

This follows the arm64 metadata frame-record model, adapted to the
RISC-V {fp, ra} frame record convention.

The metadata is established at the RISC-V entry boundaries that need an
explicit unwind marker:

  * exception entry clears the metadata {fp, ra} pair and uses SPP
    (or MPP in M-mode) to record whether the pt_regs frame is the final
    user-to-kernel boundary or a nested kernel boundary;
  * the kernel stack overflow path builds a nested pt_regs metadata
    record on the overflow stack so an unwinder can resume from the
    pre-overflow s0 saved in PT_S0;
  * _start_kernel builds the init task's final metadata record, while
    the secondary CPU path sets up s0 before smp_callin() so idle-task
    unwinding does not inherit an undefined caller frame;
  * copy_thread creates matching final metadata records for new kernel
    and user tasks, and keeps s0 available for the frame-pointer chain.

Keep the embedded metadata-record field offsets distinct from the
s0-relative STACKFRAME_* offsets used by call_on_irq_stack(), because
the latter describe a frame record relative to s0 rather than to the
record base.

These changes keep s0 reserved for the frame-pointer chain at task and
exception boundaries.

Signed-off-by: Wang Han <wanghan@linux.alibaba.com>
---
 arch/riscv/include/asm/ptrace.h           |  9 ++++
 arch/riscv/include/asm/stacktrace/frame.h | 53 +++++++++++++++++++++++
 arch/riscv/kernel/asm-offsets.c           |  6 +++
 arch/riscv/kernel/entry.S                 | 39 ++++++++++++++++-
 arch/riscv/kernel/head.S                  | 23 ++++++++++
 arch/riscv/kernel/process.c               | 33 +++++++++++++-
 6 files changed, 159 insertions(+), 4 deletions(-)
 create mode 100644 arch/riscv/include/asm/stacktrace/frame.h

diff --git a/arch/riscv/include/asm/ptrace.h b/arch/riscv/include/asm/ptrace.h
index addc8188152f..4b9b0f279214 100644
--- a/arch/riscv/include/asm/ptrace.h
+++ b/arch/riscv/include/asm/ptrace.h
@@ -8,6 +8,7 @@
 
 #include <uapi/asm/ptrace.h>
 #include <asm/csr.h>
+#include <asm/stacktrace/frame.h>
 #include <linux/compiler.h>
 
 #ifndef __ASSEMBLER__
@@ -53,6 +54,14 @@ struct pt_regs {
 	unsigned long cause;
 	/* a0 value before the syscall */
 	unsigned long orig_a0;
+
+	/*
+	 * This frame record is entirely zeroed on exception entry, allowing the
+	 * unwinder to identify exception boundaries. The type field encodes
+	 * whether the exception was taken from user (FINAL) or kernel (PT_REGS)
+	 * mode.
+	 */
+	struct frame_record_meta stackframe;
 };
 
 #define PTRACE_SYSEMU			0x1f
diff --git a/arch/riscv/include/asm/stacktrace/frame.h b/arch/riscv/include/asm/stacktrace/frame.h
new file mode 100644
index 000000000000..5720a6c65fe8
--- /dev/null
+++ b/arch/riscv/include/asm/stacktrace/frame.h
@@ -0,0 +1,53 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef __ASM_RISCV_STACKTRACE_FRAME_H
+#define __ASM_RISCV_STACKTRACE_FRAME_H
+
+/*
+ * See: arch/arm64/include/asm/stacktrace/frame.h for the reference
+ * implementation.
+ */
+
+/*
+ * - FRAME_META_TYPE_NONE
+ *
+ *   This value is reserved.
+ *
+ * - FRAME_META_TYPE_FINAL
+ *
+ *   The record is the last entry on the stack.
+ *   Unwinding should terminate successfully.
+ *
+ * - FRAME_META_TYPE_PT_REGS
+ *
+ *   The record is embedded within a struct pt_regs, recording the registers at
+ *   an arbitrary point in time.
+ *   Unwinding should consume pt_regs::epc, followed by pt_regs::ra.
+ *
+ * Note: all other values are reserved and should result in unwinding
+ * terminating with an error.
+ */
+#define FRAME_META_TYPE_NONE		0
+#define FRAME_META_TYPE_FINAL		1
+#define FRAME_META_TYPE_PT_REGS		2
+
+#ifndef __ASSEMBLER__
+/*
+ * A standard RISC-V frame record.
+ */
+struct frame_record {
+	unsigned long fp;
+	unsigned long ra;
+};
+
+/*
+ * A metadata frame record indicating a special unwind.
+ * The record::{fp,ra} fields must be zero to indicate the presence of
+ * metadata.
+ */
+struct frame_record_meta {
+	struct frame_record record;
+	unsigned long type;
+};
+#endif /* __ASSEMBLER__ */
+
+#endif /* __ASM_RISCV_STACKTRACE_FRAME_H */
diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
index a75f0cfea1e9..bc8e8cd7130a 100644
--- a/arch/riscv/kernel/asm-offsets.c
+++ b/arch/riscv/kernel/asm-offsets.c
@@ -131,6 +131,9 @@ void asm_offsets(void)
 	OFFSET(PT_BADADDR, pt_regs, badaddr);
 	OFFSET(PT_CAUSE, pt_regs, cause);
 
+	DEFINE(S_STACKFRAME,		offsetof(struct pt_regs, stackframe));
+	DEFINE(S_STACKFRAME_TYPE,	offsetof(struct pt_regs, stackframe.type));
+
 	OFFSET(SUSPEND_CONTEXT_REGS, suspend_context, regs);
 
 	OFFSET(HIBERN_PBE_ADDR, pbe, address);
@@ -503,6 +506,9 @@ void asm_offsets(void)
 	DEFINE(STACKFRAME_SIZE_ON_STACK, ALIGN(sizeof(struct stackframe), STACK_ALIGN));
 	DEFINE(STACKFRAME_FP, offsetof(struct stackframe, fp) - sizeof(struct stackframe));
 	DEFINE(STACKFRAME_RA, offsetof(struct stackframe, ra) - sizeof(struct stackframe));
+	DEFINE(STACKFRAME_RECORD_SIZE, sizeof(struct stackframe));
+	OFFSET(FRAME_RECORD_FP, frame_record, fp);
+	OFFSET(FRAME_RECORD_RA, frame_record, ra);
 #ifdef CONFIG_FUNCTION_TRACER
 	DEFINE(FTRACE_OPS_FUNC,		offsetof(struct ftrace_ops, func));
 #ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S
index 08df724e13b9..d1cfb28f9180 100644
--- a/arch/riscv/kernel/entry.S
+++ b/arch/riscv/kernel/entry.S
@@ -11,6 +11,7 @@
 #include <asm/asm.h>
 #include <asm/csr.h>
 #include <asm/scs.h>
+#include <asm/stacktrace/frame.h>
 #include <asm/unistd.h>
 #include <asm/page.h>
 #include <asm/thread_info.h>
@@ -198,6 +199,27 @@ SYM_CODE_START(handle_exception)
 	REG_S s4, PT_CAUSE(sp)
 	REG_S s5, PT_TP(sp)
 
+	/*
+	 * Create a metadata frame record. The unwinder will use this to
+	 * identify and unwind exception boundaries.
+	 */
+	REG_S zero, (S_STACKFRAME + FRAME_RECORD_FP)(sp) /* stackframe.record.fp = 0 */
+	REG_S zero, (S_STACKFRAME + FRAME_RECORD_RA)(sp) /* stackframe.record.ra = 0 */
+#ifdef CONFIG_RISCV_M_MODE
+	li t0, SR_MPP
+	and t0, s1, t0
+#else
+	andi t0, s1, SR_SPP
+#endif
+	bnez t0, 1f
+	li t0, FRAME_META_TYPE_FINAL
+	j 2f
+1:
+	li t0, FRAME_META_TYPE_PT_REGS
+2:
+	REG_S t0, S_STACKFRAME_TYPE(sp)
+	addi s0, sp, S_STACKFRAME + STACKFRAME_RECORD_SIZE
+
 	/*
 	 * Set the scratch register to 0, so that if a recursive exception
 	 * occurs, the exception vector knows it came from the kernel
@@ -354,6 +376,19 @@ SYM_CODE_START_LOCAL(handle_kernel_stack_overflow)
 	REG_S s3, PT_BADADDR(sp)
 	REG_S s4, PT_CAUSE(sp)
 	REG_S s5, PT_TP(sp)
+
+	/*
+	 * Create a metadata frame record for the overflow pt_regs. The
+	 * overflow path is entered from kernel context, so this is a nested
+	 * pt_regs boundary and the unwinder can resume from the pre-overflow
+	 * frame pointer saved in PT_S0.
+	 */
+	REG_S zero, (S_STACKFRAME + FRAME_RECORD_FP)(sp)
+	REG_S zero, (S_STACKFRAME + FRAME_RECORD_RA)(sp)
+	li t0, FRAME_META_TYPE_PT_REGS
+	REG_S t0, S_STACKFRAME_TYPE(sp)
+	addi s0, sp, S_STACKFRAME + STACKFRAME_RECORD_SIZE
+
 	move a0, sp
 	tail handle_bad_stack
 SYM_CODE_END(handle_kernel_stack_overflow)
@@ -362,8 +397,8 @@ ASM_NOKPROBE(handle_kernel_stack_overflow)
 
 SYM_CODE_START(ret_from_fork_kernel_asm)
 	call schedule_tail
-	move a0, s1 /* fn_arg */
-	move a1, s0 /* fn */
+	move a0, s3 /* fn_arg */
+	move a1, s2 /* fn */
 	move a2, sp /* pt_regs */
 	call ret_from_fork_kernel
 	j ret_from_exception
diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
index f6a8ca49e627..341b2d3facbc 100644
--- a/arch/riscv/kernel/head.S
+++ b/arch/riscv/kernel/head.S
@@ -14,6 +14,7 @@
 #include <asm/hwcap.h>
 #include <asm/image.h>
 #include <asm/scs.h>
+#include <asm/stacktrace/frame.h>
 #include <asm/usercfi.h>
 #include "efi-header.S"
 
@@ -177,6 +178,14 @@ secondary_start_sbi:
 	REG_S a0, (a1)
 1:
 #endif
+
+	/*
+	 * Set up the frame pointer for the secondary idle task so reliable
+	 * stack unwinding terminates at the metadata frame in task_pt_regs().
+	 * Without this, the first frame records can inherit an undefined caller
+	 * fp and unwind past smp_callin() into .Lsecondary_park.
+	 */
+	addi s0, sp, S_STACKFRAME + STACKFRAME_RECORD_SIZE
 	scs_load_current
 	call smp_callin
 #endif /* CONFIG_SMP */
@@ -305,6 +314,20 @@ SYM_CODE_START(_start_kernel)
 	la tp, init_task
 	la sp, init_thread_union + THREAD_SIZE
 	addi sp, sp, -PT_SIZE_ON_STACK
+
+	/*
+	 * Set up a metadata frame record for the init task so that
+	 * the unwinder can identify the outermost frame by its
+	 * {fp, ra} = {0, 0} sentinel at the bottom of pt_regs.
+	 * fp/s0 points above the metadata record (RISC-V
+	 * convention).
+	 */
+	REG_S zero, (S_STACKFRAME + FRAME_RECORD_FP)(sp)
+	REG_S zero, (S_STACKFRAME + FRAME_RECORD_RA)(sp)
+	li t0, FRAME_META_TYPE_FINAL
+	REG_S t0, S_STACKFRAME_TYPE(sp)
+	addi s0, sp, S_STACKFRAME + STACKFRAME_RECORD_SIZE
+
 #if defined(CONFIG_RISCV_SBI) && defined(CONFIG_RISCV_USER_CFI)
 	li a7, SBI_EXT_FWFT
 	li a6, SBI_EXT_FWFT_SET
diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
index b2df7f72241a..0dc90bf7a652 100644
--- a/arch/riscv/kernel/process.c
+++ b/arch/riscv/kernel/process.c
@@ -258,8 +258,23 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
 		/* Supervisor/Machine, irqs on: */
 		childregs->status = SR_PP | SR_PIE;
 
-		p->thread.s[0] = (unsigned long)args->fn;
-		p->thread.s[1] = (unsigned long)args->fn_arg;
+		/*
+		 * Set up a metadata frame record at the bottom of the
+		 * stack for the unwinder. Use FRAME_META_TYPE_FINAL
+		 * since this is the outermost kernel entry for the new
+		 * task. The frame_record::{fp,ra} are already zero from
+		 * memset().
+		 *
+		 * fp/s0 points above the metadata record (RISC-V
+		 * convention). fn and fn_arg are passed via s2/s3,
+		 * keeping s0 available for the frame pointer chain.
+		 */
+		childregs->stackframe.type = FRAME_META_TYPE_FINAL;
+
+		p->thread.s[0] = (unsigned long)(&childregs->stackframe)
+				+ sizeof(struct frame_record);
+		p->thread.s[2] = (unsigned long)args->fn;
+		p->thread.s[3] = (unsigned long)args->fn_arg;
 		p->thread.ra = (unsigned long)ret_from_fork_kernel_asm;
 	} else {
 		/* allocate new shadow stack if needed. In case of CLONE_VM we have to */
@@ -278,6 +293,20 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
 		if (clone_flags & CLONE_SETTLS)
 			childregs->tp = tls;
 		childregs->a0 = 0; /* Return value of fork() */
+
+		/*
+		 * Set up the unwind boundary: ensure the metadata
+		 * frame record has its {fp,ra} sentinel zeroed and
+		 * point fp/s0 above the metadata record. Mark it as
+		 * FINAL since this is the outermost kernel entry for
+		 * the new task.
+		 */
+		childregs->stackframe.record.fp = 0;
+		childregs->stackframe.record.ra = 0;
+		childregs->stackframe.type = FRAME_META_TYPE_FINAL;
+		p->thread.s[0] = (unsigned long)(&childregs->stackframe)
+				+ sizeof(struct frame_record);
+
 		p->thread.ra = (unsigned long)ret_from_fork_user_asm;
 	}
 	p->thread.riscv_v_flags = 0;
-- 
2.43.0

^ permalink raw reply related

* [PATCH v4 5/7] riscv: stacktrace: switch to frame-pointer based unwinder
From: Wang Han @ 2026-06-29  6:42 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: Alexandre Ghiti, linux-riscv, Oleg Nesterov, Steven Rostedt,
	Masami Hiramatsu, Mark Rutland, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, James Clark, Josh Poimboeuf,
	Jiri Kosina, Miroslav Benes, Petr Mladek, Joe Lawrence,
	Shuah Khan, oliver.yang, xueshuai, zhuo.song, jkchen,
	Marcos Paulo de Souza, linux-kernel, linux-trace-kernel,
	linux-perf-users, live-patching, linux-kselftest

Replace the open-coded frame-pointer walker in arch_stack_walk() with a
robust kunwind state machine, modelled on arch/arm64/kernel/stacktrace.c
and retargeted to the RISC-V {fp, ra} frame record convention. The new
walker tracks stack bounds, consumes frame records monotonically,
understands the metadata pt_regs records added in the previous frame
record metadata patch, and recovers return addresses replaced by
function graph tracing and kretprobes.

This commit introduces arch_stack_walk_reliable() but does not yet
select HAVE_RELIABLE_STACKTRACE; that is done in a follow-up Kconfig
patch so this commit can be reviewed and bisected as a pure unwinder
replacement. Until that Kconfig change lands, livepatch is not yet
enabled and arch_stack_walk_reliable() has no in-tree caller.

Three related callers are updated to keep the same frame-record
assumptions everywhere:

  * Function graph tracing: the old RISC-V unwinder matched function
    graph return-stack entries by the saved return-address slot. That
    was consistent with the static mcount path, but not with the dynamic
    ftrace path where the parent slot is ftrace_regs::ra. Use the
    architectural frame pointer as the function graph return-address
    cookie, matching the kunwind walker.

  * Perf callchains: route kernel callchain collection through
    arch_stack_walk() so perf sees the same frame-pointer unwind
    behaviour as dump_stack() and the upcoming livepatch path.

  * dump_backtrace() / __get_wchan() / show_stack(): these now go
    through arch_stack_walk(); the explicit "Call Trace:" header is
    moved into dump_backtrace() to preserve the original output.

The non-frame-pointer fallback walker is kept untouched for
!CONFIG_FRAME_POINTER builds.

Signed-off-by: Wang Han <wanghan@linux.alibaba.com>
---
 arch/riscv/kernel/ftrace.c         |   6 +-
 arch/riscv/kernel/perf_callchain.c |   2 +-
 arch/riscv/kernel/stacktrace.c     | 559 ++++++++++++++++++++++++-----
 3 files changed, 471 insertions(+), 96 deletions(-)

diff --git a/arch/riscv/kernel/ftrace.c b/arch/riscv/kernel/ftrace.c
index b430edfb83f4..5d55199a9230 100644
--- a/arch/riscv/kernel/ftrace.c
+++ b/arch/riscv/kernel/ftrace.c
@@ -242,7 +242,8 @@ void prepare_ftrace_return(unsigned long *parent, unsigned long self_addr,
 	 */
 	old = *parent;
 
-	if (!function_graph_enter(old, self_addr, frame_pointer, parent))
+	if (!function_graph_enter(old, self_addr, frame_pointer,
+				  (void *)frame_pointer))
 		*parent = return_hooker;
 }
 
@@ -264,7 +265,8 @@ void ftrace_graph_func(unsigned long ip, unsigned long parent_ip,
 	 */
 	old = *parent;
 
-	if (!function_graph_enter_regs(old, ip, frame_pointer, parent, fregs))
+	if (!function_graph_enter_regs(old, ip, frame_pointer,
+				       (void *)frame_pointer, fregs))
 		*parent = return_hooker;
 }
 #endif /* CONFIG_DYNAMIC_FTRACE */
diff --git a/arch/riscv/kernel/perf_callchain.c b/arch/riscv/kernel/perf_callchain.c
index b465bc9eb870..436af96ea59c 100644
--- a/arch/riscv/kernel/perf_callchain.c
+++ b/arch/riscv/kernel/perf_callchain.c
@@ -44,5 +44,5 @@ void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry,
 		return;
 	}
 
-	walk_stackframe(NULL, regs, fill_callchain, entry);
+	arch_stack_walk(fill_callchain, entry, NULL, regs);
 }
diff --git a/arch/riscv/kernel/stacktrace.c b/arch/riscv/kernel/stacktrace.c
index c7555447149b..c43bf9a84207 100644
--- a/arch/riscv/kernel/stacktrace.c
+++ b/arch/riscv/kernel/stacktrace.c
@@ -11,98 +11,16 @@
 #include <linux/sched/task_stack.h>
 #include <linux/stacktrace.h>
 #include <linux/ftrace.h>
+#include <linux/kprobes.h>
+#include <linux/llist.h>
 
 #include <asm/stacktrace.h>
 
-#ifdef CONFIG_FRAME_POINTER
-
 /*
- * This disables KASAN checking when reading a value from another task's stack,
- * since the other task could be running on another CPU and could have poisoned
- * the stack in the meantime.
+ * Non-frame-pointer fallback unwinder.
+ * Only compiled when CONFIG_FRAME_POINTER is not enabled.
  */
-#define READ_ONCE_TASK_STACK(task, x)			\
-({							\
-	unsigned long val;				\
-	unsigned long addr = x;				\
-	if ((task) == current)				\
-		val = READ_ONCE(addr);			\
-	else						\
-		val = READ_ONCE_NOCHECK(addr);		\
-	val;						\
-})
-
-extern asmlinkage void handle_exception(void);
-extern unsigned long ret_from_exception_end;
-
-static inline int fp_is_valid(unsigned long fp, unsigned long sp)
-{
-	unsigned long low, high;
-
-	low = sp + sizeof(struct stackframe);
-	high = ALIGN(sp, THREAD_SIZE);
-
-	return !(fp < low || fp > high || fp & 0x07);
-}
-
-void notrace walk_stackframe(struct task_struct *task, struct pt_regs *regs,
-			     bool (*fn)(void *, unsigned long), void *arg)
-{
-	unsigned long fp, sp, pc;
-	int graph_idx = 0;
-	int level = 0;
-
-	if (regs) {
-		fp = frame_pointer(regs);
-		sp = user_stack_pointer(regs);
-		pc = instruction_pointer(regs);
-	} else if (task == NULL || task == current) {
-		fp = (unsigned long)__builtin_frame_address(0);
-		sp = current_stack_pointer;
-		pc = (unsigned long)walk_stackframe;
-		level = -1;
-	} else {
-		/* task blocked in __switch_to */
-		fp = task->thread.s[0];
-		sp = task->thread.sp;
-		pc = task->thread.ra;
-	}
-
-	for (;;) {
-		struct stackframe *frame;
-
-		if (unlikely(!__kernel_text_address(pc) || (level++ >= 0 && !fn(arg, pc))))
-			break;
-
-		if (unlikely(!fp_is_valid(fp, sp)))
-			break;
-
-		/* Unwind stack frame */
-		frame = (struct stackframe *)fp - 1;
-		sp = fp;
-		if (regs && (regs->epc == pc) && fp_is_valid(frame->ra, sp)) {
-			/* We hit function where ra is not saved on the stack */
-			fp = frame->ra;
-			pc = regs->ra;
-		} else {
-			fp = READ_ONCE_TASK_STACK(task, frame->fp);
-			pc = READ_ONCE_TASK_STACK(task, frame->ra);
-			pc = ftrace_graph_ret_addr(task, &graph_idx, pc,
-						   &frame->ra);
-			if (pc >= (unsigned long)handle_exception &&
-			    pc < (unsigned long)&ret_from_exception_end) {
-				if (unlikely(!fn(arg, pc)))
-					break;
-
-				pc = ((struct pt_regs *)sp)->epc;
-				fp = ((struct pt_regs *)sp)->s0;
-			}
-		}
-
-	}
-}
-
-#else /* !CONFIG_FRAME_POINTER */
+#ifndef CONFIG_FRAME_POINTER
 
 void notrace walk_stackframe(struct task_struct *task,
 	struct pt_regs *regs, bool (*fn)(void *, unsigned long), void *arg)
@@ -133,7 +51,12 @@ void notrace walk_stackframe(struct task_struct *task,
 	}
 }
 
-#endif /* CONFIG_FRAME_POINTER */
+#endif /* !CONFIG_FRAME_POINTER */
+
+/*
+ * Common trace helpers.
+ * These are used by both the FP (kunwind) and non-FP (walk_stackframe) paths.
+ */
 
 static bool print_trace_address(void *arg, unsigned long pc)
 {
@@ -146,12 +69,12 @@ static bool print_trace_address(void *arg, unsigned long pc)
 noinline void dump_backtrace(struct pt_regs *regs, struct task_struct *task,
 		    const char *loglvl)
 {
-	walk_stackframe(task, regs, print_trace_address, (void *)loglvl);
+	printk("%sCall Trace:\n", loglvl);
+	arch_stack_walk(print_trace_address, (void *)loglvl, task, regs);
 }
 
 void show_stack(struct task_struct *task, unsigned long *sp, const char *loglvl)
 {
-	pr_cont("%sCall Trace:\n", loglvl);
 	dump_backtrace(NULL, task, loglvl);
 }
 
@@ -171,17 +94,467 @@ unsigned long __get_wchan(struct task_struct *task)
 
 	if (!try_get_task_stack(task))
 		return 0;
-	walk_stackframe(task, NULL, save_wchan, &pc);
+	arch_stack_walk(save_wchan, &pc, task, NULL);
 	put_task_stack(task);
 	return pc;
 }
 
-noinline noinstr void arch_stack_walk(stack_trace_consume_fn consume_entry, void *cookie,
-		     struct task_struct *task, struct pt_regs *regs)
+/*
+ * Frame-pointer-based kernel unwind infrastructure.
+ * Only compiled when CONFIG_FRAME_POINTER is enabled.
+ *
+ * See: arch/arm64/kernel/stacktrace.c for the reference implementation.
+ */
+#ifdef CONFIG_FRAME_POINTER
+
+/*
+ * Per-cpu stacks are only accessible when unwinding the current task in a
+ * non-preemptible context.
+ */
+#define STACKINFO_CPU(task, name)				\
+	({							\
+		(((task) == current) && !preemptible())		\
+			? stackinfo_get_##name()		\
+			: stackinfo_get_unknown();		\
+	})
+
+enum kunwind_source {
+	KUNWIND_SOURCE_UNKNOWN,
+	KUNWIND_SOURCE_FRAME,
+	KUNWIND_SOURCE_CALLER,
+	KUNWIND_SOURCE_TASK,
+	KUNWIND_SOURCE_REGS_PC,
+};
+
+union unwind_flags {
+	unsigned long	all;
+	struct {
+		unsigned long	fgraph : 1,
+				kretprobe : 1;
+	};
+};
+
+/*
+ * Kernel unwind state
+ *
+ * @common:    Common unwind state.
+ * @task:      The task being unwound.
+ * @graph_idx: Used by ftrace_graph_ret_addr() for optimized stack unwinding.
+ * @kr_cur:    When KRETPROBES is selected, holds the kretprobe instance
+ *             associated with the most recently encountered replacement ra
+ *             value.
+ */
+struct kunwind_state {
+	struct unwind_state common;
+	struct task_struct *task;
+	int graph_idx;
+#ifdef CONFIG_KRETPROBES
+	struct llist_node *kr_cur;
+#endif
+	enum kunwind_source source;
+	union unwind_flags flags;
+	struct pt_regs *regs;
+};
+
+static __always_inline void
+kunwind_init(struct kunwind_state *state,
+	     struct task_struct *task)
+{
+	unwind_init_common(&state->common);
+	state->task = task;
+	state->source = KUNWIND_SOURCE_UNKNOWN;
+	state->flags.all = 0;
+	state->regs = NULL;
+}
+
+/*
+ * Start an unwind from a pt_regs.
+ *
+ * The unwind will begin at the PC within the regs.
+ *
+ * The regs must be on a stack currently owned by the calling task.
+ */
+static __always_inline void
+kunwind_init_from_regs(struct kunwind_state *state,
+		       struct pt_regs *regs)
+{
+	kunwind_init(state, current);
+
+	state->regs = regs;
+	state->common.fp = frame_pointer(regs);
+	state->common.pc = instruction_pointer(regs);
+	state->source = KUNWIND_SOURCE_REGS_PC;
+}
+
+/*
+ * Start an unwind from a caller.
+ *
+ * The unwind will begin at the caller of whichever function this is inlined
+ * into.
+ *
+ * The function which invokes this must be noinline.
+ */
+static __always_inline void
+kunwind_init_from_caller(struct kunwind_state *state)
+{
+	unsigned long fp = (unsigned long)__builtin_frame_address(0);
+	struct frame_record *record = (struct frame_record *)fp - 1;
+
+	kunwind_init(state, current);
+
+	state->common.fp = READ_ONCE(record->fp);
+	state->common.pc = READ_ONCE(record->ra);
+	state->source = KUNWIND_SOURCE_CALLER;
+}
+
+/*
+ * Start an unwind from a blocked task.
+ *
+ * The unwind will begin at the blocked task's saved PC (i.e. the caller of
+ * __switch_to).
+ *
+ * The caller should ensure the task is blocked in __switch_to for the
+ * duration of the unwind, or the unwind will be bogus. It is never valid to
+ * call this for the current task.
+ */
+static __always_inline void
+kunwind_init_from_task(struct kunwind_state *state,
+		       struct task_struct *task)
+{
+	kunwind_init(state, task);
+
+	state->common.fp = task->thread.s[0];
+	state->common.pc = task->thread.ra;
+	state->source = KUNWIND_SOURCE_TASK;
+}
+
+static __always_inline int
+kunwind_recover_return_address(struct kunwind_state *state)
+{
+#ifdef CONFIG_FUNCTION_GRAPH_TRACER
+	if (state->task->ret_stack &&
+	    state->common.pc == (unsigned long)return_to_handler) {
+		unsigned long orig_pc;
+
+		orig_pc = ftrace_graph_ret_addr(state->task, &state->graph_idx,
+						state->common.pc,
+						(void *)state->common.fp);
+		if (state->common.pc == orig_pc) {
+			WARN_ON_ONCE(state->task == current);
+			return -EINVAL;
+		}
+		state->common.pc = orig_pc;
+		state->flags.fgraph = 1;
+	}
+#endif /* CONFIG_FUNCTION_GRAPH_TRACER */
+
+#ifdef CONFIG_KRETPROBES
+	if (is_kretprobe_trampoline(state->common.pc)) {
+		unsigned long orig_pc;
+
+		orig_pc = kretprobe_find_ret_addr(state->task,
+						  (void *)state->common.fp,
+						  &state->kr_cur);
+		if (!orig_pc)
+			return -EINVAL;
+		state->common.pc = orig_pc;
+		state->flags.kretprobe = 1;
+	}
+#endif /* CONFIG_KRETPROBES */
+
+	return 0;
+}
+
+/*
+ * When we reach an exception boundary marked by a metadata frame record,
+ * extract pt_regs from the stack and continue unwinding from the saved
+ * context (epc and s0/fp).
+ *
+ * On RISC-V, fp points above the metadata record, so the record's
+ * frame_record portion is at fp - sizeof(struct frame_record).
+ */
+static __always_inline int
+kunwind_next_regs_pc(struct kunwind_state *state)
+{
+	struct stack_info *info;
+	unsigned long fp = state->common.fp;
+	struct pt_regs *regs;
+
+	regs = container_of((unsigned long *)(fp - sizeof(struct frame_record)),
+			    struct pt_regs, stackframe.record.fp);
+
+	info = unwind_find_stack(&state->common, (unsigned long)regs,
+				 sizeof(*regs));
+	if (!info)
+		return -EINVAL;
+
+	unwind_consume_stack(&state->common, info, (unsigned long)regs,
+			     sizeof(*regs));
+
+	state->regs = regs;
+	state->common.pc = regs->epc;
+	state->common.fp = frame_pointer(regs);
+	state->source = KUNWIND_SOURCE_REGS_PC;
+	return 0;
+}
+
+/*
+ * Handle a metadata frame record embedded in pt_regs.
+ *
+ * On RISC-V, fp points above the record (fp = metadata + 16), so the
+ * frame_record_meta starts at fp - sizeof(struct frame_record).
+ *
+ * FRAME_META_TYPE_FINAL: This is the outermost exception entry
+ *   (user -> kernel). Unwinding terminates successfully.
+ * FRAME_META_TYPE_PT_REGS: This is a nested exception entry
+ *   (kernel -> kernel). Continue unwinding from the saved context.
+ */
+static __always_inline int
+kunwind_next_frame_record_meta(struct kunwind_state *state)
+{
+	struct task_struct *tsk = state->task;
+	unsigned long fp = state->common.fp;
+	unsigned long meta_base = fp - sizeof(struct frame_record);
+	struct frame_record_meta *meta;
+	struct stack_info *info;
+
+	info = unwind_find_stack(&state->common, meta_base, sizeof(*meta));
+	if (!info)
+		return -EINVAL;
+
+	meta = (struct frame_record_meta *)meta_base;
+	switch (READ_ONCE(meta->type)) {
+	case FRAME_META_TYPE_FINAL:
+		if (meta == &task_pt_regs(tsk)->stackframe)
+			return -ENOENT;
+		WARN_ON_ONCE(tsk == current);
+		return -EINVAL;
+	case FRAME_META_TYPE_PT_REGS:
+		return kunwind_next_regs_pc(state);
+	default:
+		WARN_ON_ONCE(tsk == current);
+		return -EINVAL;
+	}
+}
+
+/*
+ * Unwind from one frame record to the next.
+ *
+ * On RISC-V, the frame record sits at fp - sizeof(struct frame_record),
+ * immediately below the address pointed to by fp/s0. This applies to both
+ * normal frame records and metadata frame records (embedded in pt_regs).
+ *
+ * A metadata record is identified by both fp and ra being zero in the
+ * frame_record portion, with a type value following at fp + 16.
+ */
+static __always_inline int
+kunwind_next_frame_record(struct kunwind_state *state)
+{
+	unsigned long fp = state->common.fp;
+	struct frame_record *record;
+	struct stack_info *info;
+	unsigned long new_fp, new_pc;
+	unsigned long record_base;
+
+	if (fp & 0x7)
+		return -EINVAL;
+
+	record_base = fp - sizeof(*record);
+
+	info = unwind_find_stack(&state->common, record_base, sizeof(*record));
+	if (!info)
+		return -EINVAL;
+
+	record = (struct frame_record *)record_base;
+	new_fp = READ_ONCE(record->fp);
+	new_pc = READ_ONCE(record->ra);
+
+	if (!new_fp && !new_pc)
+		return kunwind_next_frame_record_meta(state);
+
+	unwind_consume_stack(&state->common, info, record_base,
+			     sizeof(*record));
+
+	state->common.fp = new_fp;
+	state->common.pc = new_pc;
+	state->source = KUNWIND_SOURCE_FRAME;
+
+	return 0;
+}
+
+/*
+ * Unwind from one frame record (A) to the next frame record (B).
+ *
+ * We terminate early if the location of B indicates a malformed chain of frame
+ * records (e.g. a cycle), determined based on the location and fp value of A
+ * and the location (but not the fp value) of B.
+ */
+static __always_inline int
+kunwind_next(struct kunwind_state *state)
+{
+	int err;
+
+	state->flags.all = 0;
+
+	switch (state->source) {
+	case KUNWIND_SOURCE_FRAME:
+	case KUNWIND_SOURCE_CALLER:
+	case KUNWIND_SOURCE_TASK:
+	case KUNWIND_SOURCE_REGS_PC:
+		err = kunwind_next_frame_record(state);
+		break;
+	default:
+		err = -EINVAL;
+	}
+
+	if (err)
+		return err;
+
+	return kunwind_recover_return_address(state);
+}
+
+typedef bool (*kunwind_consume_fn)(const struct kunwind_state *state, void *cookie);
+
+static __always_inline int
+do_kunwind(struct kunwind_state *state, kunwind_consume_fn consume_state,
+	   void *cookie)
+{
+	int ret;
+
+	ret = kunwind_recover_return_address(state);
+	if (ret)
+		return ret;
+
+	while (1) {
+		if (!consume_state(state, cookie))
+			return -EINVAL;
+		ret = kunwind_next(state);
+		if (ret == -ENOENT)
+			return 0;
+		if (ret < 0)
+			return ret;
+	}
+}
+
+static __always_inline int
+kunwind_stack_walk(kunwind_consume_fn consume_state,
+		   void *cookie, struct task_struct *task,
+		   struct pt_regs *regs)
+{
+	struct task_struct *tsk = task ?: current;
+	struct stack_info stacks[] = {
+		stackinfo_get_task(tsk),
+		STACKINFO_CPU(tsk, irq),
+#ifdef CONFIG_VMAP_STACK
+		STACKINFO_CPU(tsk, overflow),
+#endif
+	};
+	struct kunwind_state state = {
+		.common = {
+			.stacks = stacks,
+			.nr_stacks = ARRAY_SIZE(stacks),
+		},
+	};
+
+	if (regs) {
+		if (tsk != current)
+			return -EINVAL;
+		kunwind_init_from_regs(&state, regs);
+	} else if (tsk == current) {
+		kunwind_init_from_caller(&state);
+	} else {
+		kunwind_init_from_task(&state, tsk);
+	}
+
+	return do_kunwind(&state, consume_state, cookie);
+}
+
+struct kunwind_consume_entry_data {
+	stack_trace_consume_fn consume_entry;
+	void *cookie;
+};
+
+static __always_inline bool
+arch_kunwind_consume_entry(const struct kunwind_state *state, void *cookie)
+{
+	struct kunwind_consume_entry_data *data = cookie;
+
+	return data->consume_entry(data->cookie, state->common.pc);
+}
+
+static __always_inline bool
+arch_reliable_kunwind_consume_entry(const struct kunwind_state *state, void *cookie)
+{
+	/*
+	 * At an exception boundary we can reliably consume the saved PC. We do
+	 * not know whether ra was live when the exception was taken, and
+	 * so we cannot perform the next unwind step reliably.
+	 *
+	 * All that matters is whether the *entire* unwind is reliable, so give
+	 * up as soon as we hit an exception boundary.
+	 */
+	if (state->source == KUNWIND_SOURCE_REGS_PC)
+		return false;
+
+	return arch_kunwind_consume_entry(state, cookie);
+}
+
+#endif /* CONFIG_FRAME_POINTER */
+
+/*
+ * arch_stack_walk - dual implementation.
+ *
+ * When CONFIG_FRAME_POINTER is enabled, uses the kunwind infrastructure for
+ * robust frame-pointer-based unwinding, consistent with arch_stack_walk_reliable.
+ *
+ * When CONFIG_FRAME_POINTER is disabled, falls back to the simple stack scan
+ * in walk_stackframe().
+ */
+#ifdef CONFIG_FRAME_POINTER
+
+noinline noinstr void arch_stack_walk(stack_trace_consume_fn consume_entry,
+				      void *cookie, struct task_struct *task,
+				      struct pt_regs *regs)
+{
+	struct kunwind_consume_entry_data data = {
+		.consume_entry = consume_entry,
+		.cookie = cookie,
+	};
+
+	kunwind_stack_walk(arch_kunwind_consume_entry, &data, task, regs);
+}
+
+#else
+
+noinline noinstr void arch_stack_walk(stack_trace_consume_fn consume_entry,
+				      void *cookie, struct task_struct *task,
+				      struct pt_regs *regs)
 {
 	walk_stackframe(task, regs, consume_entry, cookie);
 }
 
+#endif /* CONFIG_FRAME_POINTER */
+
+/*
+ * Reliable stack walk for livepatch (CONFIG_FRAME_POINTER only).
+ */
+#ifdef CONFIG_FRAME_POINTER
+
+noinline noinstr int arch_stack_walk_reliable(stack_trace_consume_fn consume_entry,
+					      void *cookie,
+					      struct task_struct *task)
+{
+	struct kunwind_consume_entry_data data = {
+		.consume_entry = consume_entry,
+		.cookie = cookie,
+	};
+
+	return kunwind_stack_walk(arch_reliable_kunwind_consume_entry, &data,
+				  task, NULL);
+}
+
+#endif /* CONFIG_FRAME_POINTER */
+
 /*
  * Get the return address for a single stackframe and return a pointer to the
  * next frame tail.
-- 
2.43.0

^ permalink raw reply related

* [PATCH v4 3/7] riscv: ftrace: always preserve s0 in dynamic ftrace register frame
From: Wang Han @ 2026-06-29  6:42 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: Alexandre Ghiti, linux-riscv, Oleg Nesterov, Steven Rostedt,
	Masami Hiramatsu, Mark Rutland, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, James Clark, Josh Poimboeuf,
	Jiri Kosina, Miroslav Benes, Petr Mladek, Joe Lawrence,
	Shuah Khan, oliver.yang, xueshuai, zhuo.song, jkchen,
	Marcos Paulo de Souza, linux-kernel, linux-trace-kernel,
	linux-perf-users, live-patching, linux-kselftest

struct __arch_ftrace_regs declares s0 unconditionally, and both
ftrace_regs_get_frame_pointer() and ftrace_partial_regs() read it
unconditionally. But the SAVE_ABI_REGS / RESTORE_ABI_REGS macros in
mcount-dyn.S only stored s0 under HAVE_FUNCTION_GRAPH_FP_TEST
(CONFIG_FUNCTION_GRAPH_TRACER && CONFIG_FRAME_POINTER). With
CONFIG_FRAME_POINTER=n the slot held whatever was on the stack before,
so any callback going through ftrace_partial_regs() saw a garbage
regs->s0. RISC-V kernels default to FRAME_POINTER=y, which is why this
has not bitten in practice.

Save and restore s0 unconditionally in the dynamic ftrace ABI register
frame. This fixes the latent garbage-s0 case, brings the dynamic ftrace
path in line with the static _mcount path (mcount.S SAVE_ABI_STATE
already saves s0 unconditionally), and matches the frame layout already
documented in the comment above SAVE_ABI_REGS. It is also a prerequisite
for the upcoming reliable unwinder, which reads
ftrace_regs_get_frame_pointer(fregs) directly.

The cost is one extra REG_S/REG_L pair per traced call, negligible
compared to the overall ftrace cost; the existing FREGS_SIZE_ON_STACK
already reserved the slot, so no extra stack space is used.

Reviewed-by: Shuai Xue <xueshuai@linux.alibaba.com>
Signed-off-by: Wang Han <wanghan@linux.alibaba.com>
---
 arch/riscv/kernel/mcount-dyn.S | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/arch/riscv/kernel/mcount-dyn.S b/arch/riscv/kernel/mcount-dyn.S
index 082fe0b0e3c0..26c55fba8fec 100644
--- a/arch/riscv/kernel/mcount-dyn.S
+++ b/arch/riscv/kernel/mcount-dyn.S
@@ -85,9 +85,7 @@
 	addi	sp, sp, -FREGS_SIZE_ON_STACK
 	REG_S	t0,  FREGS_EPC(sp)
 	REG_S	x1,  FREGS_RA(sp)
-#ifdef HAVE_FUNCTION_GRAPH_FP_TEST
 	REG_S	x8,  FREGS_S0(sp)
-#endif
 	REG_S	x6,  FREGS_T1(sp)
 #ifdef CONFIG_CC_IS_CLANG
 	REG_S	x7,  FREGS_T2(sp)
@@ -113,9 +111,7 @@
 	.macro RESTORE_ABI_REGS
 	REG_L	t0, FREGS_EPC(sp)
 	REG_L	x1, FREGS_RA(sp)
-#ifdef HAVE_FUNCTION_GRAPH_FP_TEST
 	REG_L	x8, FREGS_S0(sp)
-#endif
 	REG_L	x6,  FREGS_T1(sp)
 #ifdef CONFIG_CC_IS_CLANG
 	REG_L	x7,  FREGS_T2(sp)
-- 
2.43.0

^ permalink raw reply related

* Re: [PATCH v3 04/17] tools/rv: Add selftests
From: Gabriele Monaco @ 2026-06-29  6:51 UTC (permalink / raw)
  To: Wen Yang, linux-trace-kernel, linux-kernel
  Cc: Nam Cao, Steven Rostedt, Thomas Weissschuh, Tomas Glozar,
	John Kacur
In-Reply-To: <0c230c01-77c8-4c9e-9f49-ecb9555402cf@linux.dev>

Please cut down the context a bit more next time, it makes it much
easier to find your review.

On Mon, 2026-06-29 at 01:10 +0800, Wen Yang wrote:
> On 6/25/26 20:14, Gabriele Monaco wrote:
> > +	eval "$TIMEOUT" "$command" &> check_output.$$ &
> > +	bgpid=$!
> > +	pid=$(pgrep -f "${command%%[|;&>]*}" | tail -n1)
> 
> The pgrep runs may immediately after the background fork, before the 
> child process has had time to exec.

Yeah I'm aware of this but kind of ignored it for now and never seen it
making troubles in practice..

I could add some delay waiting for the task like:

  while [ -z "$pid" ]; do
    sleep .5
    pid=$(pgrep -f "${command%%[|;&>]*}" | tail -n1)
  done

With probably a maximum of some N retrials in case the task never
started or we messed up the pattern.
That may still race in case the command exits before we pgrep it, but in
practice that shouldn't be a problem in our tests.

Any better idea? We cannot really rely on the shell's $! because command
is using a combination of eval+timer and we'd get the wrong pid.

Thanks,
Gabriele

^ permalink raw reply

* Re: [PATCH v3 14/17] verification/rvgen: Add selftests for rvgen kunit
From: Gabriele Monaco @ 2026-06-29  7:04 UTC (permalink / raw)
  To: Wen Yang, linux-trace-kernel, linux-kernel
  Cc: Nam Cao, Steven Rostedt, Thomas Weissschuh, Tomas Glozar,
	John Kacur
In-Reply-To: <53f2d13c-cc1d-48fd-95f0-2c1ff1edfc88@linux.dev>

On Mon, 2026-06-29 at 01:06 +0800, Wen Yang wrote:
> On 6/25/26 20:14, Gabriele Monaco wrote:
> > +static void handle_example_event(void *data, /* XXX: fill header */)
> > +{
> > +	ltl_atom_update(task, LTL_EVENT_A, true/false);
> > +}
> > +
> > +static int enable_test_ltl_kunit(void)
> > +{
> > +	int retval;
> > +
> > +	retval = ltl_monitor_init();
> > +	if (retval)
> > +		return retval;
> > +
> > +	rv_attach_trace_probe("test_ltl_kunit", /* XXX: tracepoint */,
> > handle_example_event);
> > +
> > +	return 0;
> > +}
> > +
> > +static void disable_test_ltl_kunit(void)
> > +{
> > +	rv_detach_trace_probe("test_ltl_kunit", /* XXX: tracepoint */,
> > handle_sample_event);
> > +
> 
> one typo:
>         handle_sample_event should be handle_example_event.

Keep in mind that those files are not ready to build, users need to
touch them anyway after generation from rvgen. Nevertheless, this has
been inconsistent for a while and I should fix it.

> > +	ltl_monitor_destroy();
> > +}
...
> > +MODULE_LICENSE("GPL");
> > +MODULE_AUTHOR(/* TODO */);
> 
> Please use a valid string here.

Likewise, this is not supposed to build, we are just validating what
rvgen produces and that's the expected output, the user will need to
fill it with an appropriate string.

LTL uses a different approach compared to DA/HA in this template, I'm not
sure it's worth aligning the two..

Thanks,
Gabriele

> > +MODULE_DESCRIPTION("test_ltl_kunit: auto-generated");


^ permalink raw reply

* [PATCH v4 RESEND 0/7] riscv: Add reliable stack unwinding for livepatch
From: Wang Han @ 2026-06-29  7:27 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: Alexandre Ghiti, linux-riscv, Oleg Nesterov, Steven Rostedt,
	Masami Hiramatsu, Mark Rutland, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, James Clark, Josh Poimboeuf,
	Jiri Kosina, Miroslav Benes, Petr Mladek, Joe Lawrence,
	Shuah Khan, oliver.yang, xueshuai, zhuo.song, jkchen,
	Marcos Paulo de Souza, linux-kernel, linux-trace-kernel,
	linux-perf-users, live-patching, linux-kselftest

Hi,

Sorry for the noise and inconvenience caused by the previous unthreaded
v4 submission.

This is a resend of v4 only to fix mail threading. The patch contents
are unchanged from the previous v4.

Previous unthreaded v4:
  https://lore.kernel.org/r/20260629064228.3195856-1-wanghan@linux.alibaba.com

This series adds the RISC-V architecture pieces needed by livepatch:
dynamic ftrace must preserve the frame pointer context that livepatch
uses for redirection, and the stack unwinder must be reliable enough for
the livepatch consistency model.

The reliable unwinder is based on frame records and explicit metadata at
task and exception boundaries. It is intentionally conservative: it
rejects ambiguous states instead of trying to unwind through them. The
series then enables HAVE_RELIABLE_STACKTRACE and HAVE_LIVEPATCH for
64-bit RISC-V with dynamic ftrace.

The first patch from v3, "scripts/sorttable: Handle RISC-V patchable
ftrace entries", has been picked up by Paul and is already present in
riscv/for-next as commit 57ad674d032b. It is therefore dropped from this
v4. The remaining patches are rebased on the latest riscv/for-next.

Base:
  riscv/for-next 798246e5edfb ("riscv: acpi: Enable ARCH_HAS_ACPI_TABLE_UPGRADE")

Previous versions:
  v3: https://lore.kernel.org/r/cover.194d76e3a15b.v3.riscv-livepatch.wanghan@linux.alibaba.com
  v2: https://lore.kernel.org/r/20260528082310.1994388-1-wanghan@linux.alibaba.com
  v1: https://lore.kernel.org/r/20260527123530.2593918-1-wanghan@linux.alibaba.com

Changes since v3:
  * Drop the accepted sorttable fix, now commit 57ad674d032b in
    riscv/for-next.
  * Rebase the remaining 7 patches on the latest riscv/for-next.
  * Adapt the frame metadata patch to the existing call_on_irq_stack()
    RV32 frame-pointer ABI fix by keeping metadata frame-record offsets
    distinct from the s0-relative STACKFRAME_* offsets.
  * Adapt the livepatch syscall selftest prefix change to the current
    CONFIG_ARCH_HAS_SYSCALL_WRAPPER wrapper logic.

Validation:
  * Built with riscv64-unknown-linux-gnu-gcc 15.2.0 and the existing
    configs/riscv_livepatch_config, including RISCV_ISA_C=y, EFI=y and
    ACPI=y.
  * make -C linux O=$PWD/build/linux/riscv ARCH=riscv \
      CROSS_COMPILE=riscv64-unknown-linux-gnu- -j$(nproc) Image modules
    passed.
  * make kernel-debug ARCH=riscv GDB=riscv64-unknown-linux-gnu-gdb
    passed.
  * livepatch selftest modules built successfully via ./test-livepatch.sh
    riscv.
  * QEMU RISC-V livepatch selftests passed with PASS: 7, SKIP: 1,
    FAIL: 0. The only skip is test-kprobe.sh because this config does
    not enable CONFIG_KPROBES_ON_FTRACE.
  * The ftrace function graph subset passed with 3 passed, 0 failed and
    3 unsupported tests.

Wang Han (7):
  riscv: stacktrace: Add frame record metadata
  riscv: stacktrace: disable KASAN and KCOV instrumentation for
    stacktrace.o
  riscv: ftrace: always preserve s0 in dynamic ftrace register frame
  riscv: stacktrace: introduce stack-bound tracking helpers
  riscv: stacktrace: switch to frame-pointer based unwinder
  riscv: Kconfig: enable HAVE_RELIABLE_STACKTRACE and HAVE_LIVEPATCH
  selftests/livepatch: Add RISC-V syscall wrapper prefix

 arch/riscv/Kconfig                            |   4 +
 arch/riscv/include/asm/ptrace.h               |   9 +
 arch/riscv/include/asm/stacktrace.h           |  65 +-
 arch/riscv/include/asm/stacktrace/common.h    | 159 +++++
 arch/riscv/include/asm/stacktrace/frame.h     |  53 ++
 arch/riscv/kernel/Makefile                    |   6 +
 arch/riscv/kernel/asm-offsets.c               |   6 +
 arch/riscv/kernel/entry.S                     |  39 +-
 arch/riscv/kernel/ftrace.c                    |   6 +-
 arch/riscv/kernel/head.S                      |  23 +
 arch/riscv/kernel/mcount-dyn.S                |   4 -
 arch/riscv/kernel/perf_callchain.c            |   2 +-
 arch/riscv/kernel/process.c                   |  33 +-
 arch/riscv/kernel/stacktrace.c                | 559 +++++++++++++++---
 .../livepatch/test_modules/test_klp_syscall.c |   2 +
 15 files changed, 864 insertions(+), 106 deletions(-)
 create mode 100644 arch/riscv/include/asm/stacktrace/common.h
 create mode 100644 arch/riscv/include/asm/stacktrace/frame.h

Range-diff against v3:
1:  8cef363cfed7 ! 1:  efd99ce56c1e riscv: stacktrace: Add frame record metadata
    @@ Commit message
             the secondary CPU path sets up s0 before smp_callin() so idle-task
             unwinding does not inherit an undefined caller frame;
           * copy_thread creates matching final metadata records for new kernel
    -        and user tasks, and keeps s0 available for the frame-pointer chain;
    -      * call_on_irq_stack still reserves an aligned stack slot, but links the
    -        saved {fp, ra} with the raw frame-record size so s0 points at the
    -        RISC-V frame record rather than past the alignment padding.
    +        and user tasks, and keeps s0 available for the frame-pointer chain.
     
    -    The call_on_irq_stack adjustment fixes a latent RV32 issue. On RV64,
    -    sizeof(struct stackframe) is equal to the stack alignment, so the old
    -    s0 value happened to point just above the saved {fp, ra}. On RV32, the
    -    raw frame record is 8 bytes while the reserved stack slot is 16-byte
    -    aligned, so the old s0 value pointed into the padding. Using the raw
    -    record size makes s0 point above the saved frame record on both RV32
    -    and RV64 while still reserving the aligned slot.
    +    Keep the embedded metadata-record field offsets distinct from the
    +    s0-relative STACKFRAME_* offsets used by call_on_irq_stack(), because
    +    the latter describe a frame record relative to s0 rather than to the
    +    record base.
     
         These changes keep s0 reserved for the frame-pointer chain at task and
    -    stack-switch boundaries.
    +    exception boundaries.
     
         Signed-off-by: Wang Han <wanghan@linux.alibaba.com>
     
    @@ arch/riscv/kernel/asm-offsets.c: void asm_offsets(void)
      
      	OFFSET(HIBERN_PBE_ADDR, pbe, address);
     @@ arch/riscv/kernel/asm-offsets.c: void asm_offsets(void)
    - 	OFFSET(SBI_HART_BOOT_STACK_PTR_OFFSET, sbi_hart_boot_data, stack_ptr);
    - 
      	DEFINE(STACKFRAME_SIZE_ON_STACK, ALIGN(sizeof(struct stackframe), STACK_ALIGN));
    + 	DEFINE(STACKFRAME_FP, offsetof(struct stackframe, fp) - sizeof(struct stackframe));
    + 	DEFINE(STACKFRAME_RA, offsetof(struct stackframe, ra) - sizeof(struct stackframe));
     +	DEFINE(STACKFRAME_RECORD_SIZE, sizeof(struct stackframe));
    - 	OFFSET(STACKFRAME_FP, stackframe, fp);
    - 	OFFSET(STACKFRAME_RA, stackframe, ra);
    ++	OFFSET(FRAME_RECORD_FP, frame_record, fp);
    ++	OFFSET(FRAME_RECORD_RA, frame_record, ra);
      #ifdef CONFIG_FUNCTION_TRACER
    + 	DEFINE(FTRACE_OPS_FUNC,		offsetof(struct ftrace_ops, func));
    + #ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
     
      ## arch/riscv/kernel/entry.S ##
     @@
    @@ arch/riscv/kernel/entry.S: SYM_CODE_START(handle_exception)
     +	 * Create a metadata frame record. The unwinder will use this to
     +	 * identify and unwind exception boundaries.
     +	 */
    -+	REG_S zero, (S_STACKFRAME + STACKFRAME_FP)(sp) /* stackframe.record.fp = 0 */
    -+	REG_S zero, (S_STACKFRAME + STACKFRAME_RA)(sp) /* stackframe.record.ra = 0 */
    ++	REG_S zero, (S_STACKFRAME + FRAME_RECORD_FP)(sp) /* stackframe.record.fp = 0 */
    ++	REG_S zero, (S_STACKFRAME + FRAME_RECORD_RA)(sp) /* stackframe.record.ra = 0 */
     +#ifdef CONFIG_RISCV_M_MODE
     +	li t0, SR_MPP
     +	and t0, s1, t0
    @@ arch/riscv/kernel/entry.S: SYM_CODE_START_LOCAL(handle_kernel_stack_overflow)
     +	 * pt_regs boundary and the unwinder can resume from the pre-overflow
     +	 * frame pointer saved in PT_S0.
     +	 */
    -+	REG_S zero, (S_STACKFRAME + STACKFRAME_FP)(sp)
    -+	REG_S zero, (S_STACKFRAME + STACKFRAME_RA)(sp)
    ++	REG_S zero, (S_STACKFRAME + FRAME_RECORD_FP)(sp)
    ++	REG_S zero, (S_STACKFRAME + FRAME_RECORD_RA)(sp)
     +	li t0, FRAME_META_TYPE_PT_REGS
     +	REG_S t0, S_STACKFRAME_TYPE(sp)
     +	addi s0, sp, S_STACKFRAME + STACKFRAME_RECORD_SIZE
    @@ arch/riscv/kernel/entry.S: ASM_NOKPROBE(handle_kernel_stack_overflow)
      	move a2, sp /* pt_regs */
      	call ret_from_fork_kernel
      	j ret_from_exception
    -@@ arch/riscv/kernel/entry.S: SYM_FUNC_START(call_on_irq_stack)
    - 	addi	sp, sp, -STACKFRAME_SIZE_ON_STACK
    - 	REG_S	ra, STACKFRAME_RA(sp)
    - 	REG_S	s0, STACKFRAME_FP(sp)
    --	addi	s0, sp, STACKFRAME_SIZE_ON_STACK
    -+	addi	s0, sp, STACKFRAME_RECORD_SIZE
    - 
    - 	/* Switch to the per-CPU shadow call stack */
    - 	scs_save_current
    -@@ arch/riscv/kernel/entry.S: SYM_FUNC_START(call_on_irq_stack)
    - 	scs_load_current
    - 
    - 	/* Switch back to the thread stack and restore ra and s0 */
    --	addi	sp, s0, -STACKFRAME_SIZE_ON_STACK
    -+	addi	sp, s0, -STACKFRAME_RECORD_SIZE
    - 	REG_L	ra, STACKFRAME_RA(sp)
    - 	REG_L	s0, STACKFRAME_FP(sp)
    - 	addi	sp, sp, STACKFRAME_SIZE_ON_STACK
     
      ## arch/riscv/kernel/head.S ##
     @@
    @@ arch/riscv/kernel/head.S: SYM_CODE_START(_start_kernel)
     +	 * fp/s0 points above the metadata record (RISC-V
     +	 * convention).
     +	 */
    -+	REG_S zero, (S_STACKFRAME + STACKFRAME_FP)(sp)
    -+	REG_S zero, (S_STACKFRAME + STACKFRAME_RA)(sp)
    ++	REG_S zero, (S_STACKFRAME + FRAME_RECORD_FP)(sp)
    ++	REG_S zero, (S_STACKFRAME + FRAME_RECORD_RA)(sp)
     +	li t0, FRAME_META_TYPE_FINAL
     +	REG_S t0, S_STACKFRAME_TYPE(sp)
     +	addi s0, sp, S_STACKFRAME + STACKFRAME_RECORD_SIZE
2:  237864b66d78 = 2:  bc0af8ec1976 riscv: stacktrace: disable KASAN and KCOV instrumentation for stacktrace.o
3:  e6035966a35a = 3:  ab3fbed66fff riscv: ftrace: always preserve s0 in dynamic ftrace register frame
4:  d132087ea01e = 4:  856d1e31408a riscv: stacktrace: introduce stack-bound tracking helpers
5:  02adea3ece82 = 5:  58aa4435e2ee riscv: stacktrace: switch to frame-pointer based unwinder
6:  c7d7dbe7a8a1 = 6:  500f9d9eeac0 riscv: Kconfig: enable HAVE_RELIABLE_STACKTRACE and HAVE_LIVEPATCH
7:  ae94a234b34a < -:  ------------ selftests/livepatch: Add RISC-V syscall wrapper prefix
-:  ------------ > 7:  3dcfa694c207 selftests/livepatch: Add RISC-V syscall wrapper prefix

base-commit: 798246e5edfb3aa0b2d6dca46f41014d0b99b209
-- 
2.43.0

^ permalink raw reply

* [PATCH v4 RESEND 1/7] riscv: stacktrace: Add frame record metadata
From: Wang Han @ 2026-06-29  7:27 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: Alexandre Ghiti, linux-riscv, Oleg Nesterov, Steven Rostedt,
	Masami Hiramatsu, Mark Rutland, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, James Clark, Josh Poimboeuf,
	Jiri Kosina, Miroslav Benes, Petr Mladek, Joe Lawrence,
	Shuah Khan, oliver.yang, xueshuai, zhuo.song, jkchen,
	Marcos Paulo de Souza, linux-kernel, linux-trace-kernel,
	linux-perf-users, live-patching, linux-kselftest
In-Reply-To: <20260629072713.3273743-1-wanghan@linux.alibaba.com>

Reliable frame-pointer unwinding needs an explicit way to identify
exception boundaries and the final entry frame. The existing unwinder
infers those boundaries from return addresses, which is too loose for a
future reliable unwinder.

Add a small metadata frame record to pt_regs and initialize it on
exception entry, kernel stack overflow, kernel thread fork, user fork,
and early idle task setup. The record uses a zero {fp, ra} sentinel plus
a type field so a later unwinder can distinguish a final user-to-kernel
boundary from a nested kernel pt_regs boundary.

This follows the arm64 metadata frame-record model, adapted to the
RISC-V {fp, ra} frame record convention.

The metadata is established at the RISC-V entry boundaries that need an
explicit unwind marker:

  * exception entry clears the metadata {fp, ra} pair and uses SPP
    (or MPP in M-mode) to record whether the pt_regs frame is the final
    user-to-kernel boundary or a nested kernel boundary;
  * the kernel stack overflow path builds a nested pt_regs metadata
    record on the overflow stack so an unwinder can resume from the
    pre-overflow s0 saved in PT_S0;
  * _start_kernel builds the init task's final metadata record, while
    the secondary CPU path sets up s0 before smp_callin() so idle-task
    unwinding does not inherit an undefined caller frame;
  * copy_thread creates matching final metadata records for new kernel
    and user tasks, and keeps s0 available for the frame-pointer chain.

Keep the embedded metadata-record field offsets distinct from the
s0-relative STACKFRAME_* offsets used by call_on_irq_stack(), because
the latter describe a frame record relative to s0 rather than to the
record base.

These changes keep s0 reserved for the frame-pointer chain at task and
exception boundaries.

Signed-off-by: Wang Han <wanghan@linux.alibaba.com>
---
 arch/riscv/include/asm/ptrace.h           |  9 ++++
 arch/riscv/include/asm/stacktrace/frame.h | 53 +++++++++++++++++++++++
 arch/riscv/kernel/asm-offsets.c           |  6 +++
 arch/riscv/kernel/entry.S                 | 39 ++++++++++++++++-
 arch/riscv/kernel/head.S                  | 23 ++++++++++
 arch/riscv/kernel/process.c               | 33 +++++++++++++-
 6 files changed, 159 insertions(+), 4 deletions(-)
 create mode 100644 arch/riscv/include/asm/stacktrace/frame.h

diff --git a/arch/riscv/include/asm/ptrace.h b/arch/riscv/include/asm/ptrace.h
index addc8188152f..4b9b0f279214 100644
--- a/arch/riscv/include/asm/ptrace.h
+++ b/arch/riscv/include/asm/ptrace.h
@@ -8,6 +8,7 @@
 
 #include <uapi/asm/ptrace.h>
 #include <asm/csr.h>
+#include <asm/stacktrace/frame.h>
 #include <linux/compiler.h>
 
 #ifndef __ASSEMBLER__
@@ -53,6 +54,14 @@ struct pt_regs {
 	unsigned long cause;
 	/* a0 value before the syscall */
 	unsigned long orig_a0;
+
+	/*
+	 * This frame record is entirely zeroed on exception entry, allowing the
+	 * unwinder to identify exception boundaries. The type field encodes
+	 * whether the exception was taken from user (FINAL) or kernel (PT_REGS)
+	 * mode.
+	 */
+	struct frame_record_meta stackframe;
 };
 
 #define PTRACE_SYSEMU			0x1f
diff --git a/arch/riscv/include/asm/stacktrace/frame.h b/arch/riscv/include/asm/stacktrace/frame.h
new file mode 100644
index 000000000000..5720a6c65fe8
--- /dev/null
+++ b/arch/riscv/include/asm/stacktrace/frame.h
@@ -0,0 +1,53 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef __ASM_RISCV_STACKTRACE_FRAME_H
+#define __ASM_RISCV_STACKTRACE_FRAME_H
+
+/*
+ * See: arch/arm64/include/asm/stacktrace/frame.h for the reference
+ * implementation.
+ */
+
+/*
+ * - FRAME_META_TYPE_NONE
+ *
+ *   This value is reserved.
+ *
+ * - FRAME_META_TYPE_FINAL
+ *
+ *   The record is the last entry on the stack.
+ *   Unwinding should terminate successfully.
+ *
+ * - FRAME_META_TYPE_PT_REGS
+ *
+ *   The record is embedded within a struct pt_regs, recording the registers at
+ *   an arbitrary point in time.
+ *   Unwinding should consume pt_regs::epc, followed by pt_regs::ra.
+ *
+ * Note: all other values are reserved and should result in unwinding
+ * terminating with an error.
+ */
+#define FRAME_META_TYPE_NONE		0
+#define FRAME_META_TYPE_FINAL		1
+#define FRAME_META_TYPE_PT_REGS		2
+
+#ifndef __ASSEMBLER__
+/*
+ * A standard RISC-V frame record.
+ */
+struct frame_record {
+	unsigned long fp;
+	unsigned long ra;
+};
+
+/*
+ * A metadata frame record indicating a special unwind.
+ * The record::{fp,ra} fields must be zero to indicate the presence of
+ * metadata.
+ */
+struct frame_record_meta {
+	struct frame_record record;
+	unsigned long type;
+};
+#endif /* __ASSEMBLER__ */
+
+#endif /* __ASM_RISCV_STACKTRACE_FRAME_H */
diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
index a75f0cfea1e9..bc8e8cd7130a 100644
--- a/arch/riscv/kernel/asm-offsets.c
+++ b/arch/riscv/kernel/asm-offsets.c
@@ -131,6 +131,9 @@ void asm_offsets(void)
 	OFFSET(PT_BADADDR, pt_regs, badaddr);
 	OFFSET(PT_CAUSE, pt_regs, cause);
 
+	DEFINE(S_STACKFRAME,		offsetof(struct pt_regs, stackframe));
+	DEFINE(S_STACKFRAME_TYPE,	offsetof(struct pt_regs, stackframe.type));
+
 	OFFSET(SUSPEND_CONTEXT_REGS, suspend_context, regs);
 
 	OFFSET(HIBERN_PBE_ADDR, pbe, address);
@@ -503,6 +506,9 @@ void asm_offsets(void)
 	DEFINE(STACKFRAME_SIZE_ON_STACK, ALIGN(sizeof(struct stackframe), STACK_ALIGN));
 	DEFINE(STACKFRAME_FP, offsetof(struct stackframe, fp) - sizeof(struct stackframe));
 	DEFINE(STACKFRAME_RA, offsetof(struct stackframe, ra) - sizeof(struct stackframe));
+	DEFINE(STACKFRAME_RECORD_SIZE, sizeof(struct stackframe));
+	OFFSET(FRAME_RECORD_FP, frame_record, fp);
+	OFFSET(FRAME_RECORD_RA, frame_record, ra);
 #ifdef CONFIG_FUNCTION_TRACER
 	DEFINE(FTRACE_OPS_FUNC,		offsetof(struct ftrace_ops, func));
 #ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S
index 08df724e13b9..d1cfb28f9180 100644
--- a/arch/riscv/kernel/entry.S
+++ b/arch/riscv/kernel/entry.S
@@ -11,6 +11,7 @@
 #include <asm/asm.h>
 #include <asm/csr.h>
 #include <asm/scs.h>
+#include <asm/stacktrace/frame.h>
 #include <asm/unistd.h>
 #include <asm/page.h>
 #include <asm/thread_info.h>
@@ -198,6 +199,27 @@ SYM_CODE_START(handle_exception)
 	REG_S s4, PT_CAUSE(sp)
 	REG_S s5, PT_TP(sp)
 
+	/*
+	 * Create a metadata frame record. The unwinder will use this to
+	 * identify and unwind exception boundaries.
+	 */
+	REG_S zero, (S_STACKFRAME + FRAME_RECORD_FP)(sp) /* stackframe.record.fp = 0 */
+	REG_S zero, (S_STACKFRAME + FRAME_RECORD_RA)(sp) /* stackframe.record.ra = 0 */
+#ifdef CONFIG_RISCV_M_MODE
+	li t0, SR_MPP
+	and t0, s1, t0
+#else
+	andi t0, s1, SR_SPP
+#endif
+	bnez t0, 1f
+	li t0, FRAME_META_TYPE_FINAL
+	j 2f
+1:
+	li t0, FRAME_META_TYPE_PT_REGS
+2:
+	REG_S t0, S_STACKFRAME_TYPE(sp)
+	addi s0, sp, S_STACKFRAME + STACKFRAME_RECORD_SIZE
+
 	/*
 	 * Set the scratch register to 0, so that if a recursive exception
 	 * occurs, the exception vector knows it came from the kernel
@@ -354,6 +376,19 @@ SYM_CODE_START_LOCAL(handle_kernel_stack_overflow)
 	REG_S s3, PT_BADADDR(sp)
 	REG_S s4, PT_CAUSE(sp)
 	REG_S s5, PT_TP(sp)
+
+	/*
+	 * Create a metadata frame record for the overflow pt_regs. The
+	 * overflow path is entered from kernel context, so this is a nested
+	 * pt_regs boundary and the unwinder can resume from the pre-overflow
+	 * frame pointer saved in PT_S0.
+	 */
+	REG_S zero, (S_STACKFRAME + FRAME_RECORD_FP)(sp)
+	REG_S zero, (S_STACKFRAME + FRAME_RECORD_RA)(sp)
+	li t0, FRAME_META_TYPE_PT_REGS
+	REG_S t0, S_STACKFRAME_TYPE(sp)
+	addi s0, sp, S_STACKFRAME + STACKFRAME_RECORD_SIZE
+
 	move a0, sp
 	tail handle_bad_stack
 SYM_CODE_END(handle_kernel_stack_overflow)
@@ -362,8 +397,8 @@ ASM_NOKPROBE(handle_kernel_stack_overflow)
 
 SYM_CODE_START(ret_from_fork_kernel_asm)
 	call schedule_tail
-	move a0, s1 /* fn_arg */
-	move a1, s0 /* fn */
+	move a0, s3 /* fn_arg */
+	move a1, s2 /* fn */
 	move a2, sp /* pt_regs */
 	call ret_from_fork_kernel
 	j ret_from_exception
diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
index f6a8ca49e627..341b2d3facbc 100644
--- a/arch/riscv/kernel/head.S
+++ b/arch/riscv/kernel/head.S
@@ -14,6 +14,7 @@
 #include <asm/hwcap.h>
 #include <asm/image.h>
 #include <asm/scs.h>
+#include <asm/stacktrace/frame.h>
 #include <asm/usercfi.h>
 #include "efi-header.S"
 
@@ -177,6 +178,14 @@ secondary_start_sbi:
 	REG_S a0, (a1)
 1:
 #endif
+
+	/*
+	 * Set up the frame pointer for the secondary idle task so reliable
+	 * stack unwinding terminates at the metadata frame in task_pt_regs().
+	 * Without this, the first frame records can inherit an undefined caller
+	 * fp and unwind past smp_callin() into .Lsecondary_park.
+	 */
+	addi s0, sp, S_STACKFRAME + STACKFRAME_RECORD_SIZE
 	scs_load_current
 	call smp_callin
 #endif /* CONFIG_SMP */
@@ -305,6 +314,20 @@ SYM_CODE_START(_start_kernel)
 	la tp, init_task
 	la sp, init_thread_union + THREAD_SIZE
 	addi sp, sp, -PT_SIZE_ON_STACK
+
+	/*
+	 * Set up a metadata frame record for the init task so that
+	 * the unwinder can identify the outermost frame by its
+	 * {fp, ra} = {0, 0} sentinel at the bottom of pt_regs.
+	 * fp/s0 points above the metadata record (RISC-V
+	 * convention).
+	 */
+	REG_S zero, (S_STACKFRAME + FRAME_RECORD_FP)(sp)
+	REG_S zero, (S_STACKFRAME + FRAME_RECORD_RA)(sp)
+	li t0, FRAME_META_TYPE_FINAL
+	REG_S t0, S_STACKFRAME_TYPE(sp)
+	addi s0, sp, S_STACKFRAME + STACKFRAME_RECORD_SIZE
+
 #if defined(CONFIG_RISCV_SBI) && defined(CONFIG_RISCV_USER_CFI)
 	li a7, SBI_EXT_FWFT
 	li a6, SBI_EXT_FWFT_SET
diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
index b2df7f72241a..0dc90bf7a652 100644
--- a/arch/riscv/kernel/process.c
+++ b/arch/riscv/kernel/process.c
@@ -258,8 +258,23 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
 		/* Supervisor/Machine, irqs on: */
 		childregs->status = SR_PP | SR_PIE;
 
-		p->thread.s[0] = (unsigned long)args->fn;
-		p->thread.s[1] = (unsigned long)args->fn_arg;
+		/*
+		 * Set up a metadata frame record at the bottom of the
+		 * stack for the unwinder. Use FRAME_META_TYPE_FINAL
+		 * since this is the outermost kernel entry for the new
+		 * task. The frame_record::{fp,ra} are already zero from
+		 * memset().
+		 *
+		 * fp/s0 points above the metadata record (RISC-V
+		 * convention). fn and fn_arg are passed via s2/s3,
+		 * keeping s0 available for the frame pointer chain.
+		 */
+		childregs->stackframe.type = FRAME_META_TYPE_FINAL;
+
+		p->thread.s[0] = (unsigned long)(&childregs->stackframe)
+				+ sizeof(struct frame_record);
+		p->thread.s[2] = (unsigned long)args->fn;
+		p->thread.s[3] = (unsigned long)args->fn_arg;
 		p->thread.ra = (unsigned long)ret_from_fork_kernel_asm;
 	} else {
 		/* allocate new shadow stack if needed. In case of CLONE_VM we have to */
@@ -278,6 +293,20 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
 		if (clone_flags & CLONE_SETTLS)
 			childregs->tp = tls;
 		childregs->a0 = 0; /* Return value of fork() */
+
+		/*
+		 * Set up the unwind boundary: ensure the metadata
+		 * frame record has its {fp,ra} sentinel zeroed and
+		 * point fp/s0 above the metadata record. Mark it as
+		 * FINAL since this is the outermost kernel entry for
+		 * the new task.
+		 */
+		childregs->stackframe.record.fp = 0;
+		childregs->stackframe.record.ra = 0;
+		childregs->stackframe.type = FRAME_META_TYPE_FINAL;
+		p->thread.s[0] = (unsigned long)(&childregs->stackframe)
+				+ sizeof(struct frame_record);
+
 		p->thread.ra = (unsigned long)ret_from_fork_user_asm;
 	}
 	p->thread.riscv_v_flags = 0;
-- 
2.43.0

^ permalink raw reply related

* [PATCH v4 RESEND 2/7] riscv: stacktrace: disable KASAN and KCOV instrumentation for stacktrace.o
From: Wang Han @ 2026-06-29  7:27 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: Alexandre Ghiti, linux-riscv, Oleg Nesterov, Steven Rostedt,
	Masami Hiramatsu, Mark Rutland, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, James Clark, Josh Poimboeuf,
	Jiri Kosina, Miroslav Benes, Petr Mladek, Joe Lawrence,
	Shuah Khan, oliver.yang, xueshuai, zhuo.song, jkchen,
	Marcos Paulo de Souza, linux-kernel, linux-trace-kernel,
	linux-perf-users, live-patching, linux-kselftest
In-Reply-To: <20260629072713.3273743-1-wanghan@linux.alibaba.com>

KASAN records stack traces for every alloc/free, which means it walks
the unwinder very frequently. Instrumenting the stack trace collection
code itself adds substantial overhead and makes the traces themselves
noisier.

KCOV instruments every basic-block edge. The unwinder is a hot path,
especially with KASAN enabled, so KCOV instrumentation has the same kind
of cost and noise problem here.

Mark stacktrace.o as not KASAN- or KCOV-instrumented, matching the x86
treatment of its stack unwinding code. RISC-V keeps the relevant unwinder
code in stacktrace.o, so a single translation-unit annotation covers the
equivalent scope. This is a prerequisite preference for the upcoming
reliable unwinder, but the change is valid on its own.

Signed-off-by: Wang Han <wanghan@linux.alibaba.com>
---
 arch/riscv/kernel/Makefile | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
index cabb99cadfb6..c565a72a36f3 100644
--- a/arch/riscv/kernel/Makefile
+++ b/arch/riscv/kernel/Makefile
@@ -44,6 +44,12 @@ CFLAGS_REMOVE_return_address.o	= $(CC_FLAGS_FTRACE)
 CFLAGS_REMOVE_sbi_ecall.o = $(CC_FLAGS_FTRACE)
 endif

+# When KASAN is enabled, a stack trace is recorded for every alloc/free, which
+# can significantly impact performance. Avoid instrumenting the stack trace
+# collection code to minimize this impact.
+KASAN_SANITIZE_stacktrace.o := n
+KCOV_INSTRUMENT_stacktrace.o := n
+
 always-$(KBUILD_BUILTIN) += vmlinux.lds

 obj-y	+= head.o
-- 
2.43.0

^ permalink raw reply related

* [PATCH v4 RESEND 4/7] riscv: stacktrace: introduce stack-bound tracking helpers
From: Wang Han @ 2026-06-29  7:27 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: Alexandre Ghiti, linux-riscv, Oleg Nesterov, Steven Rostedt,
	Masami Hiramatsu, Mark Rutland, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, James Clark, Josh Poimboeuf,
	Jiri Kosina, Miroslav Benes, Petr Mladek, Joe Lawrence,
	Shuah Khan, oliver.yang, xueshuai, zhuo.song, jkchen,
	Marcos Paulo de Souza, linux-kernel, linux-trace-kernel,
	linux-perf-users, live-patching, linux-kselftest
In-Reply-To: <20260629072713.3273743-1-wanghan@linux.alibaba.com>

A reliable unwinder needs to validate that every frame record it reads
is fully contained in a known kernel stack, and it needs to refuse to
walk back into a stack it has already left. Add the building blocks
for that:

  * struct stack_info / struct unwind_state in a new
    asm/stacktrace/common.h, modelled on the arm64 reference
    implementation.
  * stackinfo_get_irq() / stackinfo_get_task() / stackinfo_get_overflow()
    plus the corresponding on_*_stack() predicates in asm/stacktrace.h,
    so callers can ask "is this object on stack X?" by stack kind
    rather than open-coded address arithmetic.
  * unwind_init_common(), unwind_find_stack() and
    unwind_consume_stack() helpers that enforce the
    forward-progress-only invariant required for reliability.

No existing user is wired up to these helpers in this commit; the
unwinder switch comes in a follow-up. The header changes leave
on_thread_stack() with the same semantics as before, just expressed in
terms of the new helpers.

Reviewed-by: Shuai Xue <xueshuai@linux.alibaba.com>
Signed-off-by: Wang Han <wanghan@linux.alibaba.com>
---
 arch/riscv/include/asm/stacktrace.h        |  65 ++++++++-
 arch/riscv/include/asm/stacktrace/common.h | 159 +++++++++++++++++++++
 2 files changed, 222 insertions(+), 2 deletions(-)
 create mode 100644 arch/riscv/include/asm/stacktrace/common.h

diff --git a/arch/riscv/include/asm/stacktrace.h b/arch/riscv/include/asm/stacktrace.h
index b1495a7e06ce..bc87c4940379 100644
--- a/arch/riscv/include/asm/stacktrace.h
+++ b/arch/riscv/include/asm/stacktrace.h
@@ -3,8 +3,13 @@
 #ifndef _ASM_RISCV_STACKTRACE_H
 #define _ASM_RISCV_STACKTRACE_H
 
+#include <linux/percpu.h>
 #include <linux/sched.h>
+#include <linux/sched/task_stack.h>
+
+#include <asm/irq_stack.h>
 #include <asm/ptrace.h>
+#include <asm/stacktrace/common.h>
 
 struct stackframe {
 	unsigned long fp;
@@ -16,14 +21,70 @@ extern void notrace walk_stackframe(struct task_struct *task, struct pt_regs *re
 extern void dump_backtrace(struct pt_regs *regs, struct task_struct *task,
 			   const char *loglvl);
 
-static inline bool on_thread_stack(void)
+/*
+ * IRQ stack accessors
+ */
+static inline struct stack_info stackinfo_get_irq(void)
+{
+	unsigned long low = (unsigned long)raw_cpu_read(irq_stack_ptr);
+	unsigned long high = low + IRQ_STACK_SIZE;
+
+	return (struct stack_info) {
+		.low = low,
+		.high = high,
+	};
+}
+
+static inline bool on_irq_stack(unsigned long sp, unsigned long size)
+{
+	struct stack_info info = stackinfo_get_irq();
+
+	return stackinfo_on_stack(&info, sp, size);
+}
+
+/*
+ * Task stack accessors
+ */
+static inline struct stack_info stackinfo_get_task(const struct task_struct *tsk)
 {
-	return !(((unsigned long)(current->stack) ^ current_stack_pointer) & ~(THREAD_SIZE - 1));
+	unsigned long low = (unsigned long)task_stack_page(tsk);
+	unsigned long high = low + THREAD_SIZE;
+
+	return (struct stack_info) {
+		.low = low,
+		.high = high,
+	};
+}
+
+static inline bool on_task_stack(const struct task_struct *tsk,
+				 unsigned long sp, unsigned long size)
+{
+	struct stack_info info = stackinfo_get_task(tsk);
+
+	return stackinfo_on_stack(&info, sp, size);
 }
 
+/*
+ * Cast is necessary since current->stack is an opaque ptr.
+ */
+#define on_thread_stack()	(on_task_stack(current, current_stack_pointer, 1))
 
+/*
+ * Overflow stack accessors
+ */
 #ifdef CONFIG_VMAP_STACK
 DECLARE_PER_CPU(unsigned long [OVERFLOW_STACK_SIZE/sizeof(long)], overflow_stack);
+
+static inline struct stack_info stackinfo_get_overflow(void)
+{
+	unsigned long low = (unsigned long)raw_cpu_ptr(overflow_stack);
+	unsigned long high = low + OVERFLOW_STACK_SIZE;
+
+	return (struct stack_info) {
+		.low = low,
+		.high = high,
+	};
+}
 #endif /* CONFIG_VMAP_STACK */
 
 #endif /* _ASM_RISCV_STACKTRACE_H */
diff --git a/arch/riscv/include/asm/stacktrace/common.h b/arch/riscv/include/asm/stacktrace/common.h
new file mode 100644
index 000000000000..360a26e34349
--- /dev/null
+++ b/arch/riscv/include/asm/stacktrace/common.h
@@ -0,0 +1,159 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * RISC-V common stack unwinder types and helpers.
+ *
+ * See: arch/arm64/include/asm/stacktrace/common.h for the reference
+ * implementation.
+ *
+ * Copyright (C) 2026
+ */
+#ifndef __ASM_RISCV_STACKTRACE_COMMON_H
+#define __ASM_RISCV_STACKTRACE_COMMON_H
+
+#include <linux/compiler.h>
+#include <linux/errno.h>
+#include <linux/types.h>
+
+#include <asm/stacktrace/frame.h>
+
+/**
+ * struct stack_info - describes the bounds of a stack.
+ *
+ * @low:  The lowest valid address on the stack.
+ * @high: The highest valid address on the stack.
+ */
+struct stack_info {
+	unsigned long low;
+	unsigned long high;
+};
+
+/**
+ * struct unwind_state - state used for robust unwinding.
+ *
+ * @fp:        The fp value in the frame record (or the real fp).
+ * @pc:        The ra value in the frame record (or the real ra).
+ *
+ * @stack:     The stack currently being unwound.
+ * @stacks:    An array of stacks which can be unwound.
+ * @nr_stacks: The number of stacks in @stacks.
+ */
+struct unwind_state {
+	unsigned long fp;
+	unsigned long pc;
+
+	struct stack_info stack;
+	struct stack_info *stacks;
+	int nr_stacks;
+};
+
+/**
+ * stackinfo_get_unknown() - Get an unknown stack_info.
+ *
+ * Return: a stack_info with low and high set to 0.
+ */
+static inline struct stack_info stackinfo_get_unknown(void)
+{
+	return (struct stack_info) {
+		.low = 0,
+		.high = 0,
+	};
+}
+
+/**
+ * stackinfo_on_stack() - Check whether an object is fully within a stack.
+ *
+ * @info: The stack to check against.
+ * @sp:   The base address of the object.
+ * @size: The size of the object.
+ *
+ * Return: true if the object is fully contained within the stack.
+ */
+static inline bool stackinfo_on_stack(const struct stack_info *info,
+				      unsigned long sp, unsigned long size)
+{
+	if (!info->low)
+		return false;
+
+	if (sp < info->low || sp + size < sp || sp + size > info->high)
+		return false;
+
+	return true;
+}
+
+/**
+ * unwind_init_common() - Initialize the common parts of the unwind state.
+ *
+ * @state: the unwind state to initialize.
+ */
+static inline void unwind_init_common(struct unwind_state *state)
+{
+	state->stack = stackinfo_get_unknown();
+}
+
+/**
+ * unwind_find_stack() - Find the accessible stack which entirely contains an
+ * object.
+ *
+ * @state: the current unwind state.
+ * @sp:    the base address of the object.
+ * @size:  the size of the object.
+ *
+ * Return: a pointer to the relevant stack_info if found; NULL otherwise.
+ */
+static inline struct stack_info *unwind_find_stack(struct unwind_state *state,
+						   unsigned long sp,
+						   unsigned long size)
+{
+	struct stack_info *info = &state->stack;
+
+	if (stackinfo_on_stack(info, sp, size))
+		return info;
+
+	for (int i = 0; i < state->nr_stacks; i++) {
+		info = &state->stacks[i];
+		if (stackinfo_on_stack(info, sp, size))
+			return info;
+	}
+
+	return NULL;
+}
+
+/**
+ * unwind_consume_stack() - Update stack boundaries so that future unwind steps
+ * cannot consume this object again.
+ *
+ * @state: the current unwind state.
+ * @info:  the stack_info of the stack containing the object.
+ * @sp:    the base address of the object.
+ * @size:  the size of the object.
+ *
+ * Stack transitions are strictly one-way, and once we've
+ * transitioned from one stack to another, it's never valid to
+ * unwind back to the old stack.
+ *
+ * Note that stacks can nest in several valid orders, e.g.
+ *
+ *   TASK -> IRQ -> OVERFLOW
+ *
+ * ... so we do not check the specific order of stack
+ * transitions.
+ */
+static inline void unwind_consume_stack(struct unwind_state *state,
+					struct stack_info *info,
+					unsigned long sp,
+					unsigned long size)
+{
+	struct stack_info tmp;
+
+	tmp = *info;
+	*info = stackinfo_get_unknown();
+	state->stack = tmp;
+
+	/*
+	 * Future unwind steps can only consume stack above this frame record.
+	 * Update the current stack to start immediately above it.
+	 */
+	state->stack.low = sp + size;
+}
+
+#endif /* __ASM_RISCV_STACKTRACE_COMMON_H */
-- 
2.43.0

^ permalink raw reply related

* [PATCH v4 RESEND 5/7] riscv: stacktrace: switch to frame-pointer based unwinder
From: Wang Han @ 2026-06-29  7:27 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: Alexandre Ghiti, linux-riscv, Oleg Nesterov, Steven Rostedt,
	Masami Hiramatsu, Mark Rutland, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, James Clark, Josh Poimboeuf,
	Jiri Kosina, Miroslav Benes, Petr Mladek, Joe Lawrence,
	Shuah Khan, oliver.yang, xueshuai, zhuo.song, jkchen,
	Marcos Paulo de Souza, linux-kernel, linux-trace-kernel,
	linux-perf-users, live-patching, linux-kselftest
In-Reply-To: <20260629072713.3273743-1-wanghan@linux.alibaba.com>

Replace the open-coded frame-pointer walker in arch_stack_walk() with a
robust kunwind state machine, modelled on arch/arm64/kernel/stacktrace.c
and retargeted to the RISC-V {fp, ra} frame record convention. The new
walker tracks stack bounds, consumes frame records monotonically,
understands the metadata pt_regs records added in the previous frame
record metadata patch, and recovers return addresses replaced by
function graph tracing and kretprobes.

This commit introduces arch_stack_walk_reliable() but does not yet
select HAVE_RELIABLE_STACKTRACE; that is done in a follow-up Kconfig
patch so this commit can be reviewed and bisected as a pure unwinder
replacement. Until that Kconfig change lands, livepatch is not yet
enabled and arch_stack_walk_reliable() has no in-tree caller.

Three related callers are updated to keep the same frame-record
assumptions everywhere:

  * Function graph tracing: the old RISC-V unwinder matched function
    graph return-stack entries by the saved return-address slot. That
    was consistent with the static mcount path, but not with the dynamic
    ftrace path where the parent slot is ftrace_regs::ra. Use the
    architectural frame pointer as the function graph return-address
    cookie, matching the kunwind walker.

  * Perf callchains: route kernel callchain collection through
    arch_stack_walk() so perf sees the same frame-pointer unwind
    behaviour as dump_stack() and the upcoming livepatch path.

  * dump_backtrace() / __get_wchan() / show_stack(): these now go
    through arch_stack_walk(); the explicit "Call Trace:" header is
    moved into dump_backtrace() to preserve the original output.

The non-frame-pointer fallback walker is kept untouched for
!CONFIG_FRAME_POINTER builds.

Signed-off-by: Wang Han <wanghan@linux.alibaba.com>
---
 arch/riscv/kernel/ftrace.c         |   6 +-
 arch/riscv/kernel/perf_callchain.c |   2 +-
 arch/riscv/kernel/stacktrace.c     | 559 ++++++++++++++++++++++++-----
 3 files changed, 471 insertions(+), 96 deletions(-)

diff --git a/arch/riscv/kernel/ftrace.c b/arch/riscv/kernel/ftrace.c
index b430edfb83f4..5d55199a9230 100644
--- a/arch/riscv/kernel/ftrace.c
+++ b/arch/riscv/kernel/ftrace.c
@@ -242,7 +242,8 @@ void prepare_ftrace_return(unsigned long *parent, unsigned long self_addr,
 	 */
 	old = *parent;
 
-	if (!function_graph_enter(old, self_addr, frame_pointer, parent))
+	if (!function_graph_enter(old, self_addr, frame_pointer,
+				  (void *)frame_pointer))
 		*parent = return_hooker;
 }
 
@@ -264,7 +265,8 @@ void ftrace_graph_func(unsigned long ip, unsigned long parent_ip,
 	 */
 	old = *parent;
 
-	if (!function_graph_enter_regs(old, ip, frame_pointer, parent, fregs))
+	if (!function_graph_enter_regs(old, ip, frame_pointer,
+				       (void *)frame_pointer, fregs))
 		*parent = return_hooker;
 }
 #endif /* CONFIG_DYNAMIC_FTRACE */
diff --git a/arch/riscv/kernel/perf_callchain.c b/arch/riscv/kernel/perf_callchain.c
index b465bc9eb870..436af96ea59c 100644
--- a/arch/riscv/kernel/perf_callchain.c
+++ b/arch/riscv/kernel/perf_callchain.c
@@ -44,5 +44,5 @@ void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry,
 		return;
 	}
 
-	walk_stackframe(NULL, regs, fill_callchain, entry);
+	arch_stack_walk(fill_callchain, entry, NULL, regs);
 }
diff --git a/arch/riscv/kernel/stacktrace.c b/arch/riscv/kernel/stacktrace.c
index c7555447149b..c43bf9a84207 100644
--- a/arch/riscv/kernel/stacktrace.c
+++ b/arch/riscv/kernel/stacktrace.c
@@ -11,98 +11,16 @@
 #include <linux/sched/task_stack.h>
 #include <linux/stacktrace.h>
 #include <linux/ftrace.h>
+#include <linux/kprobes.h>
+#include <linux/llist.h>
 
 #include <asm/stacktrace.h>
 
-#ifdef CONFIG_FRAME_POINTER
-
 /*
- * This disables KASAN checking when reading a value from another task's stack,
- * since the other task could be running on another CPU and could have poisoned
- * the stack in the meantime.
+ * Non-frame-pointer fallback unwinder.
+ * Only compiled when CONFIG_FRAME_POINTER is not enabled.
  */
-#define READ_ONCE_TASK_STACK(task, x)			\
-({							\
-	unsigned long val;				\
-	unsigned long addr = x;				\
-	if ((task) == current)				\
-		val = READ_ONCE(addr);			\
-	else						\
-		val = READ_ONCE_NOCHECK(addr);		\
-	val;						\
-})
-
-extern asmlinkage void handle_exception(void);
-extern unsigned long ret_from_exception_end;
-
-static inline int fp_is_valid(unsigned long fp, unsigned long sp)
-{
-	unsigned long low, high;
-
-	low = sp + sizeof(struct stackframe);
-	high = ALIGN(sp, THREAD_SIZE);
-
-	return !(fp < low || fp > high || fp & 0x07);
-}
-
-void notrace walk_stackframe(struct task_struct *task, struct pt_regs *regs,
-			     bool (*fn)(void *, unsigned long), void *arg)
-{
-	unsigned long fp, sp, pc;
-	int graph_idx = 0;
-	int level = 0;
-
-	if (regs) {
-		fp = frame_pointer(regs);
-		sp = user_stack_pointer(regs);
-		pc = instruction_pointer(regs);
-	} else if (task == NULL || task == current) {
-		fp = (unsigned long)__builtin_frame_address(0);
-		sp = current_stack_pointer;
-		pc = (unsigned long)walk_stackframe;
-		level = -1;
-	} else {
-		/* task blocked in __switch_to */
-		fp = task->thread.s[0];
-		sp = task->thread.sp;
-		pc = task->thread.ra;
-	}
-
-	for (;;) {
-		struct stackframe *frame;
-
-		if (unlikely(!__kernel_text_address(pc) || (level++ >= 0 && !fn(arg, pc))))
-			break;
-
-		if (unlikely(!fp_is_valid(fp, sp)))
-			break;
-
-		/* Unwind stack frame */
-		frame = (struct stackframe *)fp - 1;
-		sp = fp;
-		if (regs && (regs->epc == pc) && fp_is_valid(frame->ra, sp)) {
-			/* We hit function where ra is not saved on the stack */
-			fp = frame->ra;
-			pc = regs->ra;
-		} else {
-			fp = READ_ONCE_TASK_STACK(task, frame->fp);
-			pc = READ_ONCE_TASK_STACK(task, frame->ra);
-			pc = ftrace_graph_ret_addr(task, &graph_idx, pc,
-						   &frame->ra);
-			if (pc >= (unsigned long)handle_exception &&
-			    pc < (unsigned long)&ret_from_exception_end) {
-				if (unlikely(!fn(arg, pc)))
-					break;
-
-				pc = ((struct pt_regs *)sp)->epc;
-				fp = ((struct pt_regs *)sp)->s0;
-			}
-		}
-
-	}
-}
-
-#else /* !CONFIG_FRAME_POINTER */
+#ifndef CONFIG_FRAME_POINTER
 
 void notrace walk_stackframe(struct task_struct *task,
 	struct pt_regs *regs, bool (*fn)(void *, unsigned long), void *arg)
@@ -133,7 +51,12 @@ void notrace walk_stackframe(struct task_struct *task,
 	}
 }
 
-#endif /* CONFIG_FRAME_POINTER */
+#endif /* !CONFIG_FRAME_POINTER */
+
+/*
+ * Common trace helpers.
+ * These are used by both the FP (kunwind) and non-FP (walk_stackframe) paths.
+ */
 
 static bool print_trace_address(void *arg, unsigned long pc)
 {
@@ -146,12 +69,12 @@ static bool print_trace_address(void *arg, unsigned long pc)
 noinline void dump_backtrace(struct pt_regs *regs, struct task_struct *task,
 		    const char *loglvl)
 {
-	walk_stackframe(task, regs, print_trace_address, (void *)loglvl);
+	printk("%sCall Trace:\n", loglvl);
+	arch_stack_walk(print_trace_address, (void *)loglvl, task, regs);
 }
 
 void show_stack(struct task_struct *task, unsigned long *sp, const char *loglvl)
 {
-	pr_cont("%sCall Trace:\n", loglvl);
 	dump_backtrace(NULL, task, loglvl);
 }
 
@@ -171,17 +94,467 @@ unsigned long __get_wchan(struct task_struct *task)
 
 	if (!try_get_task_stack(task))
 		return 0;
-	walk_stackframe(task, NULL, save_wchan, &pc);
+	arch_stack_walk(save_wchan, &pc, task, NULL);
 	put_task_stack(task);
 	return pc;
 }
 
-noinline noinstr void arch_stack_walk(stack_trace_consume_fn consume_entry, void *cookie,
-		     struct task_struct *task, struct pt_regs *regs)
+/*
+ * Frame-pointer-based kernel unwind infrastructure.
+ * Only compiled when CONFIG_FRAME_POINTER is enabled.
+ *
+ * See: arch/arm64/kernel/stacktrace.c for the reference implementation.
+ */
+#ifdef CONFIG_FRAME_POINTER
+
+/*
+ * Per-cpu stacks are only accessible when unwinding the current task in a
+ * non-preemptible context.
+ */
+#define STACKINFO_CPU(task, name)				\
+	({							\
+		(((task) == current) && !preemptible())		\
+			? stackinfo_get_##name()		\
+			: stackinfo_get_unknown();		\
+	})
+
+enum kunwind_source {
+	KUNWIND_SOURCE_UNKNOWN,
+	KUNWIND_SOURCE_FRAME,
+	KUNWIND_SOURCE_CALLER,
+	KUNWIND_SOURCE_TASK,
+	KUNWIND_SOURCE_REGS_PC,
+};
+
+union unwind_flags {
+	unsigned long	all;
+	struct {
+		unsigned long	fgraph : 1,
+				kretprobe : 1;
+	};
+};
+
+/*
+ * Kernel unwind state
+ *
+ * @common:    Common unwind state.
+ * @task:      The task being unwound.
+ * @graph_idx: Used by ftrace_graph_ret_addr() for optimized stack unwinding.
+ * @kr_cur:    When KRETPROBES is selected, holds the kretprobe instance
+ *             associated with the most recently encountered replacement ra
+ *             value.
+ */
+struct kunwind_state {
+	struct unwind_state common;
+	struct task_struct *task;
+	int graph_idx;
+#ifdef CONFIG_KRETPROBES
+	struct llist_node *kr_cur;
+#endif
+	enum kunwind_source source;
+	union unwind_flags flags;
+	struct pt_regs *regs;
+};
+
+static __always_inline void
+kunwind_init(struct kunwind_state *state,
+	     struct task_struct *task)
+{
+	unwind_init_common(&state->common);
+	state->task = task;
+	state->source = KUNWIND_SOURCE_UNKNOWN;
+	state->flags.all = 0;
+	state->regs = NULL;
+}
+
+/*
+ * Start an unwind from a pt_regs.
+ *
+ * The unwind will begin at the PC within the regs.
+ *
+ * The regs must be on a stack currently owned by the calling task.
+ */
+static __always_inline void
+kunwind_init_from_regs(struct kunwind_state *state,
+		       struct pt_regs *regs)
+{
+	kunwind_init(state, current);
+
+	state->regs = regs;
+	state->common.fp = frame_pointer(regs);
+	state->common.pc = instruction_pointer(regs);
+	state->source = KUNWIND_SOURCE_REGS_PC;
+}
+
+/*
+ * Start an unwind from a caller.
+ *
+ * The unwind will begin at the caller of whichever function this is inlined
+ * into.
+ *
+ * The function which invokes this must be noinline.
+ */
+static __always_inline void
+kunwind_init_from_caller(struct kunwind_state *state)
+{
+	unsigned long fp = (unsigned long)__builtin_frame_address(0);
+	struct frame_record *record = (struct frame_record *)fp - 1;
+
+	kunwind_init(state, current);
+
+	state->common.fp = READ_ONCE(record->fp);
+	state->common.pc = READ_ONCE(record->ra);
+	state->source = KUNWIND_SOURCE_CALLER;
+}
+
+/*
+ * Start an unwind from a blocked task.
+ *
+ * The unwind will begin at the blocked task's saved PC (i.e. the caller of
+ * __switch_to).
+ *
+ * The caller should ensure the task is blocked in __switch_to for the
+ * duration of the unwind, or the unwind will be bogus. It is never valid to
+ * call this for the current task.
+ */
+static __always_inline void
+kunwind_init_from_task(struct kunwind_state *state,
+		       struct task_struct *task)
+{
+	kunwind_init(state, task);
+
+	state->common.fp = task->thread.s[0];
+	state->common.pc = task->thread.ra;
+	state->source = KUNWIND_SOURCE_TASK;
+}
+
+static __always_inline int
+kunwind_recover_return_address(struct kunwind_state *state)
+{
+#ifdef CONFIG_FUNCTION_GRAPH_TRACER
+	if (state->task->ret_stack &&
+	    state->common.pc == (unsigned long)return_to_handler) {
+		unsigned long orig_pc;
+
+		orig_pc = ftrace_graph_ret_addr(state->task, &state->graph_idx,
+						state->common.pc,
+						(void *)state->common.fp);
+		if (state->common.pc == orig_pc) {
+			WARN_ON_ONCE(state->task == current);
+			return -EINVAL;
+		}
+		state->common.pc = orig_pc;
+		state->flags.fgraph = 1;
+	}
+#endif /* CONFIG_FUNCTION_GRAPH_TRACER */
+
+#ifdef CONFIG_KRETPROBES
+	if (is_kretprobe_trampoline(state->common.pc)) {
+		unsigned long orig_pc;
+
+		orig_pc = kretprobe_find_ret_addr(state->task,
+						  (void *)state->common.fp,
+						  &state->kr_cur);
+		if (!orig_pc)
+			return -EINVAL;
+		state->common.pc = orig_pc;
+		state->flags.kretprobe = 1;
+	}
+#endif /* CONFIG_KRETPROBES */
+
+	return 0;
+}
+
+/*
+ * When we reach an exception boundary marked by a metadata frame record,
+ * extract pt_regs from the stack and continue unwinding from the saved
+ * context (epc and s0/fp).
+ *
+ * On RISC-V, fp points above the metadata record, so the record's
+ * frame_record portion is at fp - sizeof(struct frame_record).
+ */
+static __always_inline int
+kunwind_next_regs_pc(struct kunwind_state *state)
+{
+	struct stack_info *info;
+	unsigned long fp = state->common.fp;
+	struct pt_regs *regs;
+
+	regs = container_of((unsigned long *)(fp - sizeof(struct frame_record)),
+			    struct pt_regs, stackframe.record.fp);
+
+	info = unwind_find_stack(&state->common, (unsigned long)regs,
+				 sizeof(*regs));
+	if (!info)
+		return -EINVAL;
+
+	unwind_consume_stack(&state->common, info, (unsigned long)regs,
+			     sizeof(*regs));
+
+	state->regs = regs;
+	state->common.pc = regs->epc;
+	state->common.fp = frame_pointer(regs);
+	state->source = KUNWIND_SOURCE_REGS_PC;
+	return 0;
+}
+
+/*
+ * Handle a metadata frame record embedded in pt_regs.
+ *
+ * On RISC-V, fp points above the record (fp = metadata + 16), so the
+ * frame_record_meta starts at fp - sizeof(struct frame_record).
+ *
+ * FRAME_META_TYPE_FINAL: This is the outermost exception entry
+ *   (user -> kernel). Unwinding terminates successfully.
+ * FRAME_META_TYPE_PT_REGS: This is a nested exception entry
+ *   (kernel -> kernel). Continue unwinding from the saved context.
+ */
+static __always_inline int
+kunwind_next_frame_record_meta(struct kunwind_state *state)
+{
+	struct task_struct *tsk = state->task;
+	unsigned long fp = state->common.fp;
+	unsigned long meta_base = fp - sizeof(struct frame_record);
+	struct frame_record_meta *meta;
+	struct stack_info *info;
+
+	info = unwind_find_stack(&state->common, meta_base, sizeof(*meta));
+	if (!info)
+		return -EINVAL;
+
+	meta = (struct frame_record_meta *)meta_base;
+	switch (READ_ONCE(meta->type)) {
+	case FRAME_META_TYPE_FINAL:
+		if (meta == &task_pt_regs(tsk)->stackframe)
+			return -ENOENT;
+		WARN_ON_ONCE(tsk == current);
+		return -EINVAL;
+	case FRAME_META_TYPE_PT_REGS:
+		return kunwind_next_regs_pc(state);
+	default:
+		WARN_ON_ONCE(tsk == current);
+		return -EINVAL;
+	}
+}
+
+/*
+ * Unwind from one frame record to the next.
+ *
+ * On RISC-V, the frame record sits at fp - sizeof(struct frame_record),
+ * immediately below the address pointed to by fp/s0. This applies to both
+ * normal frame records and metadata frame records (embedded in pt_regs).
+ *
+ * A metadata record is identified by both fp and ra being zero in the
+ * frame_record portion, with a type value following at fp + 16.
+ */
+static __always_inline int
+kunwind_next_frame_record(struct kunwind_state *state)
+{
+	unsigned long fp = state->common.fp;
+	struct frame_record *record;
+	struct stack_info *info;
+	unsigned long new_fp, new_pc;
+	unsigned long record_base;
+
+	if (fp & 0x7)
+		return -EINVAL;
+
+	record_base = fp - sizeof(*record);
+
+	info = unwind_find_stack(&state->common, record_base, sizeof(*record));
+	if (!info)
+		return -EINVAL;
+
+	record = (struct frame_record *)record_base;
+	new_fp = READ_ONCE(record->fp);
+	new_pc = READ_ONCE(record->ra);
+
+	if (!new_fp && !new_pc)
+		return kunwind_next_frame_record_meta(state);
+
+	unwind_consume_stack(&state->common, info, record_base,
+			     sizeof(*record));
+
+	state->common.fp = new_fp;
+	state->common.pc = new_pc;
+	state->source = KUNWIND_SOURCE_FRAME;
+
+	return 0;
+}
+
+/*
+ * Unwind from one frame record (A) to the next frame record (B).
+ *
+ * We terminate early if the location of B indicates a malformed chain of frame
+ * records (e.g. a cycle), determined based on the location and fp value of A
+ * and the location (but not the fp value) of B.
+ */
+static __always_inline int
+kunwind_next(struct kunwind_state *state)
+{
+	int err;
+
+	state->flags.all = 0;
+
+	switch (state->source) {
+	case KUNWIND_SOURCE_FRAME:
+	case KUNWIND_SOURCE_CALLER:
+	case KUNWIND_SOURCE_TASK:
+	case KUNWIND_SOURCE_REGS_PC:
+		err = kunwind_next_frame_record(state);
+		break;
+	default:
+		err = -EINVAL;
+	}
+
+	if (err)
+		return err;
+
+	return kunwind_recover_return_address(state);
+}
+
+typedef bool (*kunwind_consume_fn)(const struct kunwind_state *state, void *cookie);
+
+static __always_inline int
+do_kunwind(struct kunwind_state *state, kunwind_consume_fn consume_state,
+	   void *cookie)
+{
+	int ret;
+
+	ret = kunwind_recover_return_address(state);
+	if (ret)
+		return ret;
+
+	while (1) {
+		if (!consume_state(state, cookie))
+			return -EINVAL;
+		ret = kunwind_next(state);
+		if (ret == -ENOENT)
+			return 0;
+		if (ret < 0)
+			return ret;
+	}
+}
+
+static __always_inline int
+kunwind_stack_walk(kunwind_consume_fn consume_state,
+		   void *cookie, struct task_struct *task,
+		   struct pt_regs *regs)
+{
+	struct task_struct *tsk = task ?: current;
+	struct stack_info stacks[] = {
+		stackinfo_get_task(tsk),
+		STACKINFO_CPU(tsk, irq),
+#ifdef CONFIG_VMAP_STACK
+		STACKINFO_CPU(tsk, overflow),
+#endif
+	};
+	struct kunwind_state state = {
+		.common = {
+			.stacks = stacks,
+			.nr_stacks = ARRAY_SIZE(stacks),
+		},
+	};
+
+	if (regs) {
+		if (tsk != current)
+			return -EINVAL;
+		kunwind_init_from_regs(&state, regs);
+	} else if (tsk == current) {
+		kunwind_init_from_caller(&state);
+	} else {
+		kunwind_init_from_task(&state, tsk);
+	}
+
+	return do_kunwind(&state, consume_state, cookie);
+}
+
+struct kunwind_consume_entry_data {
+	stack_trace_consume_fn consume_entry;
+	void *cookie;
+};
+
+static __always_inline bool
+arch_kunwind_consume_entry(const struct kunwind_state *state, void *cookie)
+{
+	struct kunwind_consume_entry_data *data = cookie;
+
+	return data->consume_entry(data->cookie, state->common.pc);
+}
+
+static __always_inline bool
+arch_reliable_kunwind_consume_entry(const struct kunwind_state *state, void *cookie)
+{
+	/*
+	 * At an exception boundary we can reliably consume the saved PC. We do
+	 * not know whether ra was live when the exception was taken, and
+	 * so we cannot perform the next unwind step reliably.
+	 *
+	 * All that matters is whether the *entire* unwind is reliable, so give
+	 * up as soon as we hit an exception boundary.
+	 */
+	if (state->source == KUNWIND_SOURCE_REGS_PC)
+		return false;
+
+	return arch_kunwind_consume_entry(state, cookie);
+}
+
+#endif /* CONFIG_FRAME_POINTER */
+
+/*
+ * arch_stack_walk - dual implementation.
+ *
+ * When CONFIG_FRAME_POINTER is enabled, uses the kunwind infrastructure for
+ * robust frame-pointer-based unwinding, consistent with arch_stack_walk_reliable.
+ *
+ * When CONFIG_FRAME_POINTER is disabled, falls back to the simple stack scan
+ * in walk_stackframe().
+ */
+#ifdef CONFIG_FRAME_POINTER
+
+noinline noinstr void arch_stack_walk(stack_trace_consume_fn consume_entry,
+				      void *cookie, struct task_struct *task,
+				      struct pt_regs *regs)
+{
+	struct kunwind_consume_entry_data data = {
+		.consume_entry = consume_entry,
+		.cookie = cookie,
+	};
+
+	kunwind_stack_walk(arch_kunwind_consume_entry, &data, task, regs);
+}
+
+#else
+
+noinline noinstr void arch_stack_walk(stack_trace_consume_fn consume_entry,
+				      void *cookie, struct task_struct *task,
+				      struct pt_regs *regs)
 {
 	walk_stackframe(task, regs, consume_entry, cookie);
 }
 
+#endif /* CONFIG_FRAME_POINTER */
+
+/*
+ * Reliable stack walk for livepatch (CONFIG_FRAME_POINTER only).
+ */
+#ifdef CONFIG_FRAME_POINTER
+
+noinline noinstr int arch_stack_walk_reliable(stack_trace_consume_fn consume_entry,
+					      void *cookie,
+					      struct task_struct *task)
+{
+	struct kunwind_consume_entry_data data = {
+		.consume_entry = consume_entry,
+		.cookie = cookie,
+	};
+
+	return kunwind_stack_walk(arch_reliable_kunwind_consume_entry, &data,
+				  task, NULL);
+}
+
+#endif /* CONFIG_FRAME_POINTER */
+
 /*
  * Get the return address for a single stackframe and return a pointer to the
  * next frame tail.
-- 
2.43.0

^ permalink raw reply related

* [PATCH v4 RESEND 7/7] selftests/livepatch: Add RISC-V syscall wrapper prefix
From: Wang Han @ 2026-06-29  7:27 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: Alexandre Ghiti, linux-riscv, Oleg Nesterov, Steven Rostedt,
	Masami Hiramatsu, Mark Rutland, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, James Clark, Josh Poimboeuf,
	Jiri Kosina, Miroslav Benes, Petr Mladek, Joe Lawrence,
	Shuah Khan, oliver.yang, xueshuai, zhuo.song, jkchen,
	Marcos Paulo de Souza, linux-kernel, linux-trace-kernel,
	linux-perf-users, live-patching, linux-kselftest
In-Reply-To: <20260629072713.3273743-1-wanghan@linux.alibaba.com>

The syscall livepatch selftest resolves and patches a syscall wrapper
symbol. To use that test for RISC-V livepatch validation, add the
RISC-V FN_PREFIX definition for ARCH_HAS_SYSCALL_WRAPPER.

Without this macro, the syscall livepatch selftest cannot resolve the
RISC-V target symbol, and the syscall-related livepatch test fails on
RISC-V.

Reviewed-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: Shuai Xue <xueshuai@linux.alibaba.com>
Signed-off-by: Wang Han <wanghan@linux.alibaba.com>
---
 .../testing/selftests/livepatch/test_modules/test_klp_syscall.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/testing/selftests/livepatch/test_modules/test_klp_syscall.c b/tools/testing/selftests/livepatch/test_modules/test_klp_syscall.c
index 08aacc0e14de..9baa2a5f84c9 100644
--- a/tools/testing/selftests/livepatch/test_modules/test_klp_syscall.c
+++ b/tools/testing/selftests/livepatch/test_modules/test_klp_syscall.c
@@ -24,6 +24,8 @@
     #define FN_PREFIX __s390x_
   #elif defined(__aarch64__)
     #define FN_PREFIX __arm64_
+  #elif defined(__riscv)
+    #define FN_PREFIX __riscv_
   #elif defined(__powerpc__)
     #define FN_PREFIX
   #else
-- 
2.43.0

^ permalink raw reply related

* [PATCH v4 RESEND 3/7] riscv: ftrace: always preserve s0 in dynamic ftrace register frame
From: Wang Han @ 2026-06-29  7:27 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: Alexandre Ghiti, linux-riscv, Oleg Nesterov, Steven Rostedt,
	Masami Hiramatsu, Mark Rutland, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, James Clark, Josh Poimboeuf,
	Jiri Kosina, Miroslav Benes, Petr Mladek, Joe Lawrence,
	Shuah Khan, oliver.yang, xueshuai, zhuo.song, jkchen,
	Marcos Paulo de Souza, linux-kernel, linux-trace-kernel,
	linux-perf-users, live-patching, linux-kselftest
In-Reply-To: <20260629072713.3273743-1-wanghan@linux.alibaba.com>

struct __arch_ftrace_regs declares s0 unconditionally, and both
ftrace_regs_get_frame_pointer() and ftrace_partial_regs() read it
unconditionally. But the SAVE_ABI_REGS / RESTORE_ABI_REGS macros in
mcount-dyn.S only stored s0 under HAVE_FUNCTION_GRAPH_FP_TEST
(CONFIG_FUNCTION_GRAPH_TRACER && CONFIG_FRAME_POINTER). With
CONFIG_FRAME_POINTER=n the slot held whatever was on the stack before,
so any callback going through ftrace_partial_regs() saw a garbage
regs->s0. RISC-V kernels default to FRAME_POINTER=y, which is why this
has not bitten in practice.

Save and restore s0 unconditionally in the dynamic ftrace ABI register
frame. This fixes the latent garbage-s0 case, brings the dynamic ftrace
path in line with the static _mcount path (mcount.S SAVE_ABI_STATE
already saves s0 unconditionally), and matches the frame layout already
documented in the comment above SAVE_ABI_REGS. It is also a prerequisite
for the upcoming reliable unwinder, which reads
ftrace_regs_get_frame_pointer(fregs) directly.

The cost is one extra REG_S/REG_L pair per traced call, negligible
compared to the overall ftrace cost; the existing FREGS_SIZE_ON_STACK
already reserved the slot, so no extra stack space is used.

Reviewed-by: Shuai Xue <xueshuai@linux.alibaba.com>
Signed-off-by: Wang Han <wanghan@linux.alibaba.com>
---
 arch/riscv/kernel/mcount-dyn.S | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/arch/riscv/kernel/mcount-dyn.S b/arch/riscv/kernel/mcount-dyn.S
index 082fe0b0e3c0..26c55fba8fec 100644
--- a/arch/riscv/kernel/mcount-dyn.S
+++ b/arch/riscv/kernel/mcount-dyn.S
@@ -85,9 +85,7 @@
 	addi	sp, sp, -FREGS_SIZE_ON_STACK
 	REG_S	t0,  FREGS_EPC(sp)
 	REG_S	x1,  FREGS_RA(sp)
-#ifdef HAVE_FUNCTION_GRAPH_FP_TEST
 	REG_S	x8,  FREGS_S0(sp)
-#endif
 	REG_S	x6,  FREGS_T1(sp)
 #ifdef CONFIG_CC_IS_CLANG
 	REG_S	x7,  FREGS_T2(sp)
@@ -113,9 +111,7 @@
 	.macro RESTORE_ABI_REGS
 	REG_L	t0, FREGS_EPC(sp)
 	REG_L	x1, FREGS_RA(sp)
-#ifdef HAVE_FUNCTION_GRAPH_FP_TEST
 	REG_L	x8, FREGS_S0(sp)
-#endif
 	REG_L	x6,  FREGS_T1(sp)
 #ifdef CONFIG_CC_IS_CLANG
 	REG_L	x7,  FREGS_T2(sp)
-- 
2.43.0

^ permalink raw reply related

* [PATCH v4 RESEND 6/7] riscv: Kconfig: enable HAVE_RELIABLE_STACKTRACE and HAVE_LIVEPATCH
From: Wang Han @ 2026-06-29  7:27 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: Alexandre Ghiti, linux-riscv, Oleg Nesterov, Steven Rostedt,
	Masami Hiramatsu, Mark Rutland, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Jiri Olsa, Ian Rogers, Adrian Hunter, James Clark, Josh Poimboeuf,
	Jiri Kosina, Miroslav Benes, Petr Mladek, Joe Lawrence,
	Shuah Khan, oliver.yang, xueshuai, zhuo.song, jkchen,
	Marcos Paulo de Souza, linux-kernel, linux-trace-kernel,
	linux-perf-users, live-patching, linux-kselftest
In-Reply-To: <20260629072713.3273743-1-wanghan@linux.alibaba.com>

Now that the metadata frame records, the kunwind state machine and
arch_stack_walk_reliable() are all in place, advertise the capability
to the rest of the kernel:

  * select HAVE_RELIABLE_STACKTRACE under FRAME_POINTER && 64BIT, so
    only the configurations with the tested metadata records and
    FP-based reliable walker enable it.
  * select HAVE_LIVEPATCH under the same condition and source
    kernel/livepatch/Kconfig so the livepatch menu is reachable from
    the RISC-V configuration.

The 64BIT dependency is conservative scoping rather than a hard
technical requirement: the metadata frame record, kunwind state machine
and arch_stack_walk_reliable() also build on RV32, and the IRQ-stack
frame-record adjustment fixes a latent RV32 issue. However, the syscall
livepatch selftest and module relocation path have only been exercised
on RV64 QEMU virt so far. The 64BIT gate can be relaxed in a follow-up
once RV32 has equivalent coverage.

This is split out from the unwinder change so the policy decision and
the implementation can be reviewed and reverted independently.

Reviewed-by: Shuai Xue <xueshuai@linux.alibaba.com>
Signed-off-by: Wang Han <wanghan@linux.alibaba.com>
---
 arch/riscv/Kconfig | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index cb3d85abf595..4d09fab682ac 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -186,6 +186,7 @@ config RISCV
 	select HAVE_KRETPROBES
 	# https://github.com/ClangBuiltLinux/linux/issues/1881
 	select HAVE_LD_DEAD_CODE_DATA_ELIMINATION if !LD_IS_LLD
+	select HAVE_LIVEPATCH if FRAME_POINTER && 64BIT
 	select HAVE_MOVE_PMD
 	select HAVE_MOVE_PUD
 	select HAVE_PAGE_SIZE_4KB
@@ -196,6 +197,7 @@ config RISCV
 	select HAVE_POSIX_CPU_TIMERS_TASK_WORK
 	select HAVE_PREEMPT_DYNAMIC_KEY
 	select HAVE_REGS_AND_STACK_ACCESS_API
+	select HAVE_RELIABLE_STACKTRACE if FRAME_POINTER && 64BIT
 	select HAVE_RETHOOK
 	select HAVE_RSEQ
 	select HAVE_RUST if RUSTC_SUPPORTS_RISCV && CC_IS_CLANG
@@ -1392,3 +1394,5 @@ endmenu # "CPU Power Management"
 source "arch/riscv/kvm/Kconfig"
 
 source "drivers/acpi/Kconfig"
+
+source "kernel/livepatch/Kconfig"
-- 
2.43.0

^ permalink raw reply related

* [PATCH 1/4] rtla: Allow unsetting non-list custom-callback CLI options
From: Tomas Glozar @ 2026-06-29  8:36 UTC (permalink / raw)
  To: Steven Rostedt, Tomas Glozar
  Cc: John Kacur, Luis Goncalves, Crystal Wood, Costa Shulyupin,
	Wander Lairson Costa, LKML, linux-trace-kernel

libsubcmd implicitly allows the user to unset already set options using
a "no-" prefix for long options. For example, if I set the period like
this:

$ rtla timerlat -D
Loading BPF program
reading osnoise/timerlat_period_us returned 1000
setting osnoise/timerlat_period_us to 1000
reading osnoise/print_stack returned 0
setting osnoise/print_stack to 0
...
<timerlat top>

it can be unset by a subsequent --no-debug:

$ rtla timerlat -D --no-debug
...
<timerlat top>

Currently, this works only for boolean options. Extend the feature for
all options by implementing handling of the "unset" argument in opt_*()
callbacks defined in cli_p.h, except for list options, i.e. options that
can be passed multiple times (--event, --filter, --trigger,
--on-threshold, --on-end).

This allows, for example, unsetting of int/long long options, e.g. "-p":

$ rtla timerlat -D -p100 --no-period
...
setting osnoise/timerlat_period_us to 1000
...

By default, options in params struct are reset to zero. A constant is
added for every parameter with a different default value, which is then
used both in <tool>_hist_args() while setting the initial value and in
opt_*() when unsetting the option. This refactoring ensures there is no
duplicate "magic number".

The default value for opt_llong_callback() and opt_int_callback() is
passed in struct option's defval field; new macros
RTLA_OPT_{LLONG,INT}{,_DEFVAL} are added to define the field
conveniently. The default value for other callbacks is hardcoded inside
each callback's unset logic.

Signed-off-by: Tomas Glozar <tglozar@redhat.com>
---
 tools/tracing/rtla/src/cli.c   |  42 +++----
 tools/tracing/rtla/src/cli_p.h | 219 +++++++++++++++++++++++++++------
 2 files changed, 194 insertions(+), 67 deletions(-)

diff --git a/tools/tracing/rtla/src/cli.c b/tools/tracing/rtla/src/cli.c
index c5279c9875310..fb8c972c0746b 100644
--- a/tools/tracing/rtla/src/cli.c
+++ b/tools/tracing/rtla/src/cli.c
@@ -192,10 +192,10 @@ struct common_params *osnoise_hist_parse_args(int argc, char **argv)
 	actions_init(&params->common.threshold_actions);
 	actions_init(&params->common.end_actions);
 
-	/* display data in microseconds */
-	params->common.output_divisor = 1000;
-	params->common.hist.bucket_size = 1;
-	params->common.hist.entries = 256;
+	/* set default values */
+	params->common.output_divisor = default_output_divisor;
+	params->common.hist.bucket_size = default_bucket_size;
+	params->common.hist.entries = default_entries;
 
 	argc = parse_options(argc, (const char **)argv,
 			     osnoise_hist_options, osnoise_hist_usage,
@@ -280,19 +280,15 @@ struct common_params *timerlat_top_parse_args(int argc, char **argv)
 	actions_init(&params->common.threshold_actions);
 	actions_init(&params->common.end_actions);
 
-	/* disabled by default */
-	params->dma_latency = -1;
-	params->deepest_idle_state = -2;
-
-	/* display data in microseconds */
-	params->common.output_divisor = 1000;
+	/* set default values */
+	params->dma_latency = default_dma_latency;
+	params->deepest_idle_state = default_deepest_idle_state;
+	params->common.output_divisor = default_output_divisor;
+	params->stack_format = default_stack_format;
 
 	/* default to BPF mode */
 	params->mode = TRACING_MODE_BPF;
 
-	/* default to truncate stack format */
-	params->stack_format = STACK_FORMAT_TRUNCATE;
-
 	argc = parse_options(argc, (const char **)argv,
 			     timerlat_top_options, timerlat_top_usage,
 			     common_parse_options_flags);
@@ -403,23 +399,17 @@ struct common_params *timerlat_hist_parse_args(int argc, char **argv)
 	actions_init(&params->common.threshold_actions);
 	actions_init(&params->common.end_actions);
 
-	/* disabled by default */
-	params->dma_latency = -1;
-
-	/* disabled by default */
-	params->deepest_idle_state = -2;
-
-	/* display data in microseconds */
-	params->common.output_divisor = 1000;
-	params->common.hist.bucket_size = 1;
-	params->common.hist.entries = 256;
+	/* set default values */
+	params->dma_latency = default_dma_latency;
+	params->deepest_idle_state = default_deepest_idle_state;
+	params->common.output_divisor = default_output_divisor;
+	params->common.hist.bucket_size = default_bucket_size;
+	params->common.hist.entries = default_entries;
+	params->stack_format = default_stack_format;
 
 	/* default to BPF mode */
 	params->mode = TRACING_MODE_BPF;
 
-	/* default to truncate stack format */
-	params->stack_format = STACK_FORMAT_TRUNCATE;
-
 	argc = parse_options(argc, (const char **)argv,
 			     timerlat_hist_options, timerlat_hist_usage,
 			     common_parse_options_flags);
diff --git a/tools/tracing/rtla/src/cli_p.h b/tools/tracing/rtla/src/cli_p.h
index 3c939de9abf02..3a93dba60215b 100644
--- a/tools/tracing/rtla/src/cli_p.h
+++ b/tools/tracing/rtla/src/cli_p.h
@@ -22,6 +22,38 @@ struct timerlat_cb_data {
 	char *trace_output;
 };
 
+/*
+ * Non-zero default values for parameters
+ */
+static const int default_dma_latency = -1; /* -1 = unset */
+static const int default_deepest_idle_state = -2; /* -1 = disable all, -2 = unset */
+static const int default_output_divisor = 1000;
+static const int default_bucket_size = 1;
+static const int default_entries = 256;
+static const enum stack_format default_stack_format = STACK_FORMAT_TRUNCATE;
+
+/*
+ * Shorthand macros for integer/long long command line options using
+ * opt_int_callback/opt_llong_callback, with variants that set defval.
+ *
+ * Note: defval's type is intptr_t. opt_int_callback interprets it directly as
+ * an int, opt_llong_callback interprets it as a pointer to a long long, as
+ * long long does not fit into intptr_t on 32-bit architectures.
+ */
+#define RTLA_OPT_LLONG(s, l, v, a, h) \
+	OPT_CALLBACK(s, l, v, a, h, opt_llong_callback)
+
+#define RTLA_OPT_LLONG_DEFVAL(s, l, v, a, h, d) { .type = OPTION_CALLBACK, \
+	.short_name = (s), .long_name = (l), .value = (v), .argh = (a), \
+	.help = (h), .callback = opt_llong_callback, .defval = (intptr_t)(d) }
+
+#define RTLA_OPT_INT(s, l, v, a, h) \
+	OPT_CALLBACK(s, l, v, a, h, opt_int_callback)
+
+#define RTLA_OPT_INT_DEFVAL(s, l, v, a, h, d) { .type = OPTION_CALLBACK, \
+	.short_name = (s), .long_name = (l), .value = (v), .argh = (a), \
+	.help = (h), .callback = opt_int_callback, .defval = (intptr_t)(d) }
+
 /*
  * Macros for command line options common to all tools
  *
@@ -108,14 +140,12 @@ struct timerlat_cb_data {
 #define RTLA_OPT_QUIET OPT_BOOLEAN('q', "quiet", &params->common.quiet, \
 	"print only a summary at the end")
 
-#define RTLA_OPT_TRACE_BUFFER_SIZE OPT_CALLBACK(0, "trace-buffer-size", \
+#define RTLA_OPT_TRACE_BUFFER_SIZE RTLA_OPT_INT(0, "trace-buffer-size", \
 	&params->common.buffer_size, "kB", \
-	"set the per-cpu trace buffer size in kB", \
-	opt_int_callback)
+	"set the per-cpu trace buffer size in kB")
 
-#define RTLA_OPT_WARM_UP OPT_CALLBACK(0, "warm-up", &params->common.warmup, "s", \
-	"let the workload run for s seconds before collecting data", \
-	opt_int_callback)
+#define RTLA_OPT_WARM_UP RTLA_OPT_INT(0, "warm-up", &params->common.warmup, "s", \
+	"let the workload run for s seconds before collecting data")
 
 #define RTLA_OPT_AUTO(cb) OPT_CALLBACK('a', "auto", &cb_data, "us", \
 	"set automatic trace mode, stopping the session if argument in us sample is hit", \
@@ -143,7 +173,12 @@ static int opt_llong_callback(const struct option *opt, const char *arg, int uns
 {
 	long long *value = opt->value;
 
-	if (unset || !arg)
+	if (unset) {
+		*value = opt->defval ? *(long long *)opt->defval : 0;
+		return 0;
+	}
+
+	if (!arg)
 		return -1;
 
 	*value = get_llong_from_str((char *)arg);
@@ -154,7 +189,12 @@ static int opt_int_callback(const struct option *opt, const char *arg, int unset
 {
 	int *value = opt->value;
 
-	if (unset || !arg)
+	if (unset) {
+		*value = (int)opt->defval;
+		return 0;
+	}
+
+	if (!arg)
 		return -1;
 
 	if (strtoi(arg, value))
@@ -168,7 +208,13 @@ static int opt_cpus_cb(const struct option *opt, const char *arg, int unset)
 	struct common_params *params = opt->value;
 	int retval;
 
-	if (unset || !arg)
+	if (unset) {
+		CPU_ZERO(&params->monitored_cpus);
+		params->cpus = NULL;
+		return 0;
+	}
+
+	if (!arg)
 		return -1;
 
 	retval = parse_cpu_set((char *)arg, &params->monitored_cpus);
@@ -183,8 +229,11 @@ static int opt_cgroup_cb(const struct option *opt, const char *arg, int unset)
 {
 	struct common_params *params = opt->value;
 
-	if (unset)
-		return -1;
+	if (unset) {
+		params->cgroup = 0;
+		params->cgroup_name = NULL;
+		return 0;
+	}
 
 	params->cgroup = 1;
 	params->cgroup_name = (char *)arg;
@@ -199,7 +248,12 @@ static int opt_duration_cb(const struct option *opt, const char *arg, int unset)
 {
 	struct common_params *params = opt->value;
 
-	if (unset || !arg)
+	if (unset) {
+		params->duration = 0;
+		return 0;
+	}
+
+	if (!arg)
 		return -1;
 
 	params->duration = parse_seconds_duration((char *)arg);
@@ -233,7 +287,13 @@ static int opt_housekeeping_cb(const struct option *opt, const char *arg, int un
 	struct common_params *params = opt->value;
 	int retval;
 
-	if (unset || !arg)
+	if (unset) {
+		params->hk_cpus = 0;
+		CPU_ZERO(&params->hk_cpu_set);
+		return 0;
+	}
+
+	if (!arg)
 		return -1;
 
 	params->hk_cpus = 1;
@@ -249,7 +309,13 @@ static int opt_priority_cb(const struct option *opt, const char *arg, int unset)
 	struct common_params *params = opt->value;
 	int retval;
 
-	if (unset || !arg)
+	if (unset) {
+		memset(&params->sched_param, 0, sizeof(params->sched_param));
+		params->set_sched = 0;
+		return 0;
+	}
+
+	if (!arg)
 		return -1;
 
 	retval = parse_prio((char *)arg, &params->sched_param);
@@ -301,9 +367,8 @@ static int opt_filter_cb(const struct option *opt, const char *arg, int unset)
 	"osnoise runtime in us", \
 	opt_osnoise_runtime_cb)
 
-#define OSNOISE_OPT_THRESHOLD OPT_CALLBACK('T', "threshold", &params->threshold, "us", \
-	"the minimum delta to be considered a noise", \
-	opt_llong_callback)
+#define OSNOISE_OPT_THRESHOLD RTLA_OPT_LLONG('T', "threshold", &params->threshold, "us", \
+	"the minimum delta to be considered a noise")
 
 /*
  * Callback functions for command line options for osnoise tools
@@ -315,7 +380,14 @@ static int opt_osnoise_auto_cb(const struct option *opt, const char *arg, int un
 	struct osnoise_params *params = cb_data->params;
 	long long auto_thresh;
 
-	if (unset || !arg)
+	if (unset) {
+		params->common.stop_us = 0;
+		params->threshold = 0;
+		cb_data->trace_output = NULL;
+		return 0;
+	}
+
+	if (!arg)
 		return -1;
 
 	auto_thresh = get_llong_from_str((char *)arg);
@@ -332,7 +404,12 @@ static int opt_osnoise_period_cb(const struct option *opt, const char *arg, int
 {
 	unsigned long long *period = opt->value;
 
-	if (unset || !arg)
+	if (unset) {
+		*period = 0;
+		return 0;
+	}
+
+	if (!arg)
 		return -1;
 
 	*period = get_llong_from_str((char *)arg);
@@ -346,7 +423,12 @@ static int opt_osnoise_runtime_cb(const struct option *opt, const char *arg, int
 {
 	unsigned long long *runtime = opt->value;
 
-	if (unset || !arg)
+	if (unset) {
+		*runtime = 0;
+		return 0;
+	}
+
+	if (!arg)
 		return -1;
 
 	*runtime = get_llong_from_str((char *)arg);
@@ -360,8 +442,10 @@ static int opt_osnoise_trace_output_cb(const struct option *opt, const char *arg
 {
 	const char **trace_output = opt->value;
 
-	if (unset)
-		return -1;
+	if (unset) {
+		*trace_output = NULL;
+		return 0;
+	}
 
 	if (!arg) {
 		*trace_output = "osnoise_trace.txt";
@@ -412,9 +496,8 @@ static int opt_osnoise_on_end_cb(const struct option *opt, const char *arg, int
 	"timerlat period in us", \
 	opt_timerlat_period_cb)
 
-#define TIMERLAT_OPT_STACK OPT_CALLBACK('s', "stack", &params->print_stack, "us", \
-	"save the stack trace at the IRQ if a thread latency is higher than the argument in us", \
-	opt_llong_callback)
+#define TIMERLAT_OPT_STACK RTLA_OPT_LLONG('s', "stack", &params->print_stack, "us", \
+	"save the stack trace at the IRQ if a thread latency is higher than the argument in us")
 
 #define TIMERLAT_OPT_NANO OPT_CALLBACK_NOOPT('n', "nano", params, NULL, \
 	"display data in nanoseconds", \
@@ -424,10 +507,10 @@ static int opt_osnoise_on_end_cb(const struct option *opt, const char *arg, int
 	"set /dev/cpu_dma_latency latency <us> to reduce exit from idle latency", \
 	opt_dma_latency_cb)
 
-#define TIMERLAT_OPT_DEEPEST_IDLE_STATE OPT_CALLBACK(0, "deepest-idle-state", \
+#define TIMERLAT_OPT_DEEPEST_IDLE_STATE RTLA_OPT_INT_DEFVAL(0, "deepest-idle-state", \
 	&params->deepest_idle_state, "n", \
 	"only go down to idle state n on cpus used by timerlat to reduce exit from idle latency", \
-	opt_int_callback)
+	default_deepest_idle_state)
 
 #define TIMERLAT_OPT_AA_ONLY OPT_CALLBACK(0, "aa-only", params, "us", \
 	"stop if <us> latency is hit, only printing the auto analysis (reduces CPU usage)", \
@@ -459,7 +542,12 @@ static int opt_timerlat_period_cb(const struct option *opt, const char *arg, int
 {
 	long long *period = opt->value;
 
-	if (unset || !arg)
+	if (unset) {
+		*period = 0;
+		return 0;
+	}
+
+	if (!arg)
 		return -1;
 
 	*period = get_llong_from_str((char *)arg);
@@ -475,7 +563,15 @@ static int opt_timerlat_auto_cb(const struct option *opt, const char *arg, int u
 	struct timerlat_params *params = cb_data->params;
 	long long auto_thresh;
 
-	if (unset || !arg)
+	if (unset) {
+		params->common.stop_total_us = 0;
+		params->common.stop_us = 0;
+		params->print_stack = 0;
+		cb_data->trace_output = NULL;
+		return 0;
+	}
+
+	if (!arg)
 		return -1;
 
 	auto_thresh = get_llong_from_str((char *)arg);
@@ -494,7 +590,12 @@ static int opt_dma_latency_cb(const struct option *opt, const char *arg, int uns
 	int *dma_latency = opt->value;
 	int retval;
 
-	if (unset || !arg)
+	if (unset) {
+		*dma_latency = default_dma_latency;
+		return 0;
+	}
+
+	if (!arg)
 		return -1;
 
 	retval = strtoi((char *)arg, dma_latency);
@@ -511,7 +612,15 @@ static int opt_aa_only_cb(const struct option *opt, const char *arg, int unset)
 	struct timerlat_params *params = opt->value;
 	long long auto_thresh;
 
-	if (unset || !arg)
+	if (unset) {
+		params->common.stop_total_us = 0;
+		params->common.stop_us = 0;
+		params->print_stack = 0;
+		params->common.aa_only = 0;
+		return 0;
+	}
+
+	if (!arg)
 		return -1;
 
 	auto_thresh = get_llong_from_str((char *)arg);
@@ -527,8 +636,10 @@ static int opt_timerlat_trace_output_cb(const struct option *opt, const char *ar
 {
 	const char **trace_output = opt->value;
 
-	if (unset)
-		return -1;
+	if (unset) {
+		*trace_output = NULL;
+		return 0;
+	}
 
 	if (!arg) {
 		*trace_output = "timerlat_trace.txt";
@@ -576,8 +687,11 @@ static int opt_user_threads_cb(const struct option *opt, const char *arg, int un
 {
 	struct timerlat_params *params = opt->value;
 
-	if (unset)
-		return -1;
+	if (unset) {
+		params->common.user_workload = false;
+		params->common.user_data = false;
+		return 0;
+	}
 
 	params->common.user_workload = true;
 	params->common.user_data = true;
@@ -589,8 +703,10 @@ static int opt_nano_cb(const struct option *opt, const char *arg, int unset)
 {
 	struct timerlat_params *params = opt->value;
 
-	if (unset)
-		return -1;
+	if (unset) {
+		params->common.output_divisor = default_output_divisor;
+		return 0;
+	}
 
 	params->common.output_divisor = 1;
 
@@ -601,7 +717,12 @@ static int opt_stack_format_cb(const struct option *opt, const char *arg, int un
 {
 	int *format = opt->value;
 
-	if (unset || !arg)
+	if (unset) {
+		*format = default_stack_format;
+		return 0;
+	}
+
+	if (!arg)
 		return -1;
 
 	*format = parse_stack_format((char *)arg);
@@ -616,7 +737,13 @@ static int opt_timerlat_align_cb(const struct option *opt, const char *arg, int
 {
 	struct timerlat_params *params = opt->value;
 
-	if (unset || !arg)
+	if (unset) {
+		params->timerlat_align = false;
+		params->timerlat_align_us = 0;
+		return 0;
+	}
+
+	if (!arg)
 		return -1;
 
 	params->timerlat_align = true;
@@ -662,7 +789,12 @@ static int opt_bucket_size_cb(const struct option *opt, const char *arg, int uns
 {
 	int *bucket_size = opt->value;
 
-	if (unset || !arg)
+	if (unset) {
+		*bucket_size = default_bucket_size;
+		return 0;
+	}
+
+	if (!arg)
 		return -1;
 
 	*bucket_size = get_llong_from_str((char *)arg);
@@ -676,7 +808,12 @@ static int opt_entries_cb(const struct option *opt, const char *arg, int unset)
 {
 	int *entries = opt->value;
 
-	if (unset || !arg)
+	if (unset) {
+		*entries = default_entries;
+		return 0;
+	}
+
+	if (!arg)
 		return -1;
 
 	*entries = get_llong_from_str((char *)arg);
-- 
2.54.0


^ permalink raw reply related

* [PATCH 2/4] rtla: Add unit tests for unset in opt callbacks
From: Tomas Glozar @ 2026-06-29  8:36 UTC (permalink / raw)
  To: Steven Rostedt, Tomas Glozar
  Cc: John Kacur, Luis Goncalves, Crystal Wood, Costa Shulyupin,
	Wander Lairson Costa, LKML, linux-trace-kernel
In-Reply-To: <20260629083654.1548925-1-tglozar@redhat.com>

Test for each opt callback that implements the unset option whether the
option sets the specified default value back correctly.

Signed-off-by: Tomas Glozar <tglozar@redhat.com>
---
 .../rtla/tests/unit/cli_opt_callback.c        | 295 ++++++++++++++++++
 1 file changed, 295 insertions(+)

diff --git a/tools/tracing/rtla/tests/unit/cli_opt_callback.c b/tools/tracing/rtla/tests/unit/cli_opt_callback.c
index 4a406af42821b..413a04f898fbb 100644
--- a/tools/tracing/rtla/tests/unit/cli_opt_callback.c
+++ b/tools/tracing/rtla/tests/unit/cli_opt_callback.c
@@ -40,6 +40,30 @@ START_TEST(test_opt_llong_callback_min)
 }
 END_TEST
 
+START_TEST(test_opt_llong_callback_unset)
+{
+	long long test_value = 0;
+	const struct option opt = TEST_CALLBACK(&test_value, opt_llong_callback);
+
+	ck_assert_int_eq(opt_llong_callback(&opt, "1234567890", 0), 0);
+	ck_assert_int_eq(opt_llong_callback(&opt, NULL, 1), 0);
+	ck_assert_int_eq(test_value, 0);
+}
+END_TEST
+
+START_TEST(test_opt_llong_callback_unset_defval)
+{
+	long long test_value = 0;
+	const long long default_value = 42;
+	const struct option opt = RTLA_OPT_LLONG_DEFVAL('t', "test", &test_value, "test value",
+							"test help", &default_value);
+
+	ck_assert_int_eq(opt_llong_callback(&opt, "1234567890", 0), 0);
+	ck_assert_int_eq(opt_llong_callback(&opt, NULL, 1), 0);
+	ck_assert_int_eq(test_value, default_value);
+}
+END_TEST
+
 START_TEST(test_opt_int_callback_simple)
 {
 	int test_value = 0;
@@ -90,6 +114,29 @@ START_TEST(test_opt_int_callback_non_numeric_suffix)
 }
 END_TEST
 
+START_TEST(test_opt_int_callback_unset)
+{
+	int test_value = 0;
+	const struct option opt = TEST_CALLBACK(&test_value, opt_int_callback);
+
+	ck_assert_int_eq(opt_int_callback(&opt, "1234567890", 0), 0);
+	ck_assert_int_eq(opt_int_callback(&opt, NULL, 1), 0);
+	ck_assert_int_eq(test_value, 0);
+}
+END_TEST
+
+START_TEST(test_opt_int_callback_unset_defval)
+{
+	int test_value = 0;
+	const struct option opt = RTLA_OPT_INT_DEFVAL('t', "test", &test_value, "test value",
+						      "test help", 42);
+
+	ck_assert_int_eq(opt_int_callback(&opt, "1234567890", 0), 0);
+	ck_assert_int_eq(opt_int_callback(&opt, NULL, 1), 0);
+	ck_assert_int_eq(test_value, 42);
+}
+END_TEST
+
 START_TEST(test_opt_cpus_cb)
 {
 	struct common_params params = {0};
@@ -134,6 +181,18 @@ START_TEST(test_opt_cgroup_cb_equals)
 }
 END_TEST
 
+START_TEST(test_opt_cgroup_cb_unset)
+{
+	struct common_params params = {0};
+	const struct option opt = TEST_CALLBACK(&params, opt_cgroup_cb);
+
+	ck_assert_int_eq(opt_cgroup_cb(&opt, "cgroup", 0), 0);
+	ck_assert_int_eq(opt_cgroup_cb(&opt, NULL, 1), 0);
+	ck_assert_int_eq(params.cgroup, 0);
+	ck_assert_ptr_null(params.cgroup_name);
+}
+END_TEST
+
 START_TEST(test_opt_duration_cb)
 {
 	struct common_params params = {0};
@@ -154,6 +213,17 @@ START_TEST(test_opt_duration_cb_invalid)
 }
 END_TEST
 
+START_TEST(test_opt_duration_cb_unset)
+{
+	struct common_params params = {0};
+	const struct option opt = TEST_CALLBACK(&params, opt_duration_cb);
+
+	ck_assert_int_eq(opt_duration_cb(&opt, "1m", 0), 0);
+	ck_assert_int_eq(opt_duration_cb(&opt, NULL, 1), 0);
+	ck_assert_int_eq(params.duration, 0);
+}
+END_TEST
+
 START_TEST(test_opt_event_cb)
 {
 	struct trace_events *events = NULL;
@@ -205,6 +275,19 @@ START_TEST(test_opt_housekeeping_cb_invalid)
 }
 END_TEST
 
+START_TEST(test_opt_housekeeping_cb_unset)
+{
+	struct common_params params = {0};
+	const struct option opt = TEST_CALLBACK(&params, opt_housekeeping_cb);
+
+	nr_cpus = 4;
+	ck_assert_int_eq(opt_housekeeping_cb(&opt, "0-3", 0), 0);
+	ck_assert_int_eq(opt_housekeeping_cb(&opt, NULL, 1), 0);
+	ck_assert_int_eq(params.hk_cpus, 0);
+	ck_assert_int_eq(CPU_COUNT(&params.hk_cpu_set), 0);
+}
+END_TEST
+
 START_TEST(test_opt_priority_cb)
 {
 	struct common_params params = {0};
@@ -226,6 +309,18 @@ START_TEST(test_opt_priority_cb_invalid)
 }
 END_TEST
 
+START_TEST(test_opt_priority_cb_unset)
+{
+	struct common_params params = {0};
+	const struct option opt = TEST_CALLBACK(&params, opt_priority_cb);
+
+	ck_assert_int_eq(opt_priority_cb(&opt, "f:95", 0), 0);
+	ck_assert_int_eq(opt_priority_cb(&opt, NULL, 1), 0);
+	ck_assert_int_eq(params.sched_param.sched_policy, 0);
+	ck_assert_int_eq(params.sched_param.sched_priority, 0);
+}
+END_TEST
+
 START_TEST(test_opt_trigger_cb)
 {
 	struct trace_events *events = trace_event_alloc("sched:sched_switch");
@@ -279,6 +374,20 @@ START_TEST(test_opt_osnoise_auto_cb)
 }
 END_TEST
 
+START_TEST(test_opt_osnoise_auto_cb_unset)
+{
+	struct osnoise_params params = {0};
+	struct osnoise_cb_data cb_data = {&params};
+	const struct option opt = TEST_CALLBACK(&cb_data, opt_osnoise_auto_cb);
+
+	ck_assert_int_eq(opt_osnoise_auto_cb(&opt, "10", 0), 0);
+	ck_assert_int_eq(opt_osnoise_auto_cb(&opt, NULL, 1), 0);
+	ck_assert_int_eq(params.common.stop_us, 0);
+	ck_assert_int_eq(params.threshold, 0);
+	ck_assert_ptr_null(cb_data.trace_output);
+}
+END_TEST
+
 START_TEST(test_opt_osnoise_period_cb)
 {
 	unsigned long long period = 0;
@@ -299,6 +408,17 @@ START_TEST(test_opt_osnoise_period_cb_invalid)
 }
 END_TEST
 
+START_TEST(test_opt_osnoise_period_cb_unset)
+{
+	unsigned long long period = 0;
+	const struct option opt = TEST_CALLBACK(&period, opt_osnoise_period_cb);
+
+	ck_assert_int_eq(opt_osnoise_period_cb(&opt, "1000000", 0), 0);
+	ck_assert_int_eq(opt_osnoise_period_cb(&opt, NULL, 1), 0);
+	ck_assert_int_eq(period, 0);
+}
+END_TEST
+
 START_TEST(test_opt_osnoise_runtime_cb)
 {
 	unsigned long long runtime = 0;
@@ -319,6 +439,17 @@ START_TEST(test_opt_osnoise_runtime_cb_invalid)
 }
 END_TEST
 
+START_TEST(test_opt_osnoise_runtime_cb_unset)
+{
+	unsigned long long runtime = 0;
+	const struct option opt = TEST_CALLBACK(&runtime, opt_osnoise_runtime_cb);
+
+	ck_assert_int_eq(opt_osnoise_runtime_cb(&opt, "900000", 0), 0);
+	ck_assert_int_eq(opt_osnoise_runtime_cb(&opt, NULL, 1), 0);
+	ck_assert_int_eq(runtime, 0);
+}
+END_TEST
+
 START_TEST(test_opt_osnoise_trace_output_cb)
 {
 	const char *trace_output = NULL;
@@ -339,6 +470,17 @@ START_TEST(test_opt_osnoise_trace_output_cb_noarg)
 }
 END_TEST
 
+START_TEST(test_opt_osnoise_trace_output_cb_unset)
+{
+	const char *trace_output = NULL;
+	const struct option opt = TEST_CALLBACK(&trace_output, opt_osnoise_trace_output_cb);
+
+	ck_assert_int_eq(opt_osnoise_trace_output_cb(&opt, "trace.txt", 0), 0);
+	ck_assert_int_eq(opt_osnoise_trace_output_cb(&opt, NULL, 1), 0);
+	ck_assert_ptr_null(trace_output);
+}
+END_TEST
+
 START_TEST(test_opt_osnoise_on_threshold_cb)
 {
 	struct actions actions = {0};
@@ -403,6 +545,17 @@ START_TEST(test_opt_timerlat_period_cb_invalid)
 }
 END_TEST
 
+START_TEST(test_opt_timerlat_period_cb_unset)
+{
+	long long period = 0;
+	const struct option opt = TEST_CALLBACK(&period, opt_timerlat_period_cb);
+
+	ck_assert_int_eq(opt_timerlat_period_cb(&opt, "1000", 0), 0);
+	ck_assert_int_eq(opt_timerlat_period_cb(&opt, NULL, 1), 0);
+	ck_assert_int_eq(period, 0);
+}
+END_TEST
+
 START_TEST(test_opt_timerlat_auto_cb)
 {
 	struct timerlat_params params = {0};
@@ -417,6 +570,21 @@ START_TEST(test_opt_timerlat_auto_cb)
 }
 END_TEST
 
+START_TEST(test_opt_timerlat_auto_cb_unset)
+{
+	struct timerlat_params params = {0};
+	struct timerlat_cb_data cb_data = {&params};
+	const struct option opt = TEST_CALLBACK(&cb_data, opt_timerlat_auto_cb);
+
+	ck_assert_int_eq(opt_timerlat_auto_cb(&opt, "10", 0), 0);
+	ck_assert_int_eq(opt_timerlat_auto_cb(&opt, NULL, 1), 0);
+	ck_assert_int_eq(params.common.stop_us, 0);
+	ck_assert_int_eq(params.common.stop_total_us, 0);
+	ck_assert_int_eq(params.print_stack, 0);
+	ck_assert_ptr_null(cb_data.trace_output);
+}
+END_TEST
+
 START_TEST(test_opt_dma_latency_cb)
 {
 	int dma_latency = 0;
@@ -447,6 +615,17 @@ START_TEST(test_opt_dma_latency_cb_max)
 }
 END_TEST
 
+START_TEST(test_opt_dma_latency_cb_unset)
+{
+	int dma_latency = 0;
+	const struct option opt = TEST_CALLBACK(&dma_latency, opt_dma_latency_cb);
+
+	ck_assert_int_eq(opt_dma_latency_cb(&opt, "1000", 0), 0);
+	ck_assert_int_eq(opt_dma_latency_cb(&opt, NULL, 1), 0);
+	ck_assert_int_eq(dma_latency, default_dma_latency);
+}
+END_TEST
+
 START_TEST(test_opt_aa_only_cb)
 {
 	struct timerlat_params params = {0};
@@ -460,6 +639,20 @@ START_TEST(test_opt_aa_only_cb)
 }
 END_TEST
 
+START_TEST(test_opt_aa_only_cb_unset)
+{
+	struct timerlat_params params = {0};
+	const struct option opt = TEST_CALLBACK(&params, opt_aa_only_cb);
+
+	ck_assert_int_eq(opt_aa_only_cb(&opt, "10", 0), 0);
+	ck_assert_int_eq(opt_aa_only_cb(&opt, NULL, 1), 0);
+	ck_assert_int_eq(params.common.stop_us, 0);
+	ck_assert_int_eq(params.common.stop_total_us, 0);
+	ck_assert_int_eq(params.print_stack, 0);
+	ck_assert_int_eq(params.common.aa_only, 0);
+}
+END_TEST
+
 START_TEST(test_opt_timerlat_trace_output_cb)
 {
 	const char *trace_output = NULL;
@@ -480,6 +673,17 @@ START_TEST(test_opt_timerlat_trace_output_cb_noarg)
 }
 END_TEST
 
+START_TEST(test_opt_timerlat_trace_output_cb_unset)
+{
+	const char *trace_output = NULL;
+	const struct option opt = TEST_CALLBACK(&trace_output, opt_timerlat_trace_output_cb);
+
+	ck_assert_int_eq(opt_timerlat_trace_output_cb(&opt, "trace.txt", 0), 0);
+	ck_assert_int_eq(opt_timerlat_trace_output_cb(&opt, NULL, 1), 0);
+	ck_assert_ptr_null(trace_output);
+}
+END_TEST
+
 START_TEST(test_opt_timerlat_on_threshold_cb)
 {
 	struct actions actions = {0};
@@ -535,6 +739,18 @@ START_TEST(test_opt_user_threads_cb)
 }
 END_TEST
 
+START_TEST(test_opt_user_threads_cb_unset)
+{
+	struct timerlat_params params = {0};
+	const struct option opt = TEST_CALLBACK(&params, opt_user_threads_cb);
+
+	ck_assert_int_eq(opt_user_threads_cb(&opt, NULL, 0), 0);
+	ck_assert_int_eq(opt_user_threads_cb(&opt, NULL, 1), 0);
+	ck_assert_int_eq(params.common.user_workload, 0);
+	ck_assert_int_eq(params.common.user_data, 0);
+}
+END_TEST
+
 START_TEST(test_opt_nano_cb)
 {
 	struct timerlat_params params = {0};
@@ -545,6 +761,17 @@ START_TEST(test_opt_nano_cb)
 }
 END_TEST
 
+START_TEST(test_opt_nano_cb_unset)
+{
+	struct timerlat_params params = {0};
+	const struct option opt = TEST_CALLBACK(&params, opt_nano_cb);
+
+	ck_assert_int_eq(opt_nano_cb(&opt, NULL, 0), 0);
+	ck_assert_int_eq(opt_nano_cb(&opt, NULL, 1), 0);
+	ck_assert_int_eq(params.common.output_divisor, default_output_divisor);
+}
+END_TEST
+
 START_TEST(test_opt_timerlat_align_cb)
 {
 	struct timerlat_params params = {0};
@@ -556,6 +783,18 @@ START_TEST(test_opt_timerlat_align_cb)
 }
 END_TEST
 
+START_TEST(test_opt_timerlat_align_cb_unset)
+{
+	struct timerlat_params params = {0};
+	const struct option opt = TEST_CALLBACK(&params, opt_timerlat_align_cb);
+
+	ck_assert_int_eq(opt_timerlat_align_cb(&opt, "500", 0), 0);
+	ck_assert_int_eq(opt_timerlat_align_cb(&opt, NULL, 1), 0);
+	ck_assert_int_eq(params.timerlat_align, 0);
+	ck_assert_int_eq(params.timerlat_align_us, 0);
+}
+END_TEST
+
 START_TEST(test_opt_stack_format_cb)
 {
 	int stack_format = 0;
@@ -576,6 +815,17 @@ START_TEST(test_opt_stack_format_cb_invalid)
 }
 END_TEST
 
+START_TEST(test_opt_stack_format_cb_unset)
+{
+	int stack_format = 0;
+	const struct option opt = TEST_CALLBACK(&stack_format, opt_stack_format_cb);
+
+	ck_assert_int_eq(opt_stack_format_cb(&opt, "full", 0), 0);
+	ck_assert_int_eq(opt_stack_format_cb(&opt, NULL, 1), 0);
+	ck_assert_int_eq(stack_format, default_stack_format);
+}
+END_TEST
+
 START_TEST(test_opt_bucket_size_cb)
 {
 	int bucket_size = 0;
@@ -606,6 +856,17 @@ START_TEST(test_opt_bucket_size_max)
 }
 END_TEST
 
+START_TEST(test_opt_bucket_size_cb_unset)
+{
+	int bucket_size = 0;
+	const struct option opt = TEST_CALLBACK(&bucket_size, opt_bucket_size_cb);
+
+	ck_assert_int_eq(opt_bucket_size_cb(&opt, "100", 0), 0);
+	ck_assert_int_eq(opt_bucket_size_cb(&opt, NULL, 1), 0);
+	ck_assert_int_eq(bucket_size, default_bucket_size);
+}
+END_TEST
+
 START_TEST(test_opt_entries_cb)
 {
 	int entries = 0;
@@ -636,6 +897,17 @@ START_TEST(test_opt_entries_max)
 }
 END_TEST
 
+START_TEST(test_opt_entries_cb_unset)
+{
+	int entries = 0;
+	const struct option opt = TEST_CALLBACK(&entries, opt_entries_cb);
+
+	ck_assert_int_eq(opt_entries_cb(&opt, "100", 0), 0);
+	ck_assert_int_eq(opt_entries_cb(&opt, NULL, 1), 0);
+	ck_assert_int_eq(entries, default_entries);
+}
+END_TEST
+
 Suite *cli_opt_callback_suite(void)
 {
 	Suite *s = suite_create("cli_opt_callback");
@@ -645,23 +917,31 @@ Suite *cli_opt_callback_suite(void)
 	tcase_add_test(tc, test_opt_llong_callback_simple);
 	tcase_add_test(tc, test_opt_llong_callback_max);
 	tcase_add_test(tc, test_opt_llong_callback_min);
+	tcase_add_test(tc, test_opt_llong_callback_unset);
+	tcase_add_test(tc, test_opt_llong_callback_unset_defval);
 	tcase_add_test(tc, test_opt_int_callback_simple);
 	tcase_add_test(tc, test_opt_int_callback_max);
 	tcase_add_test(tc, test_opt_int_callback_min);
 	tcase_add_test(tc, test_opt_int_callback_non_numeric);
 	tcase_add_test(tc, test_opt_int_callback_non_numeric_suffix);
+	tcase_add_test(tc, test_opt_int_callback_unset);
+	tcase_add_test(tc, test_opt_int_callback_unset_defval);
 	tcase_add_test(tc, test_opt_cpus_cb);
 	tcase_add_exit_test(tc, test_opt_cpus_cb_invalid, EXIT_FAILURE);
 	tcase_add_test(tc, test_opt_cgroup_cb);
 	tcase_add_test(tc, test_opt_cgroup_cb_equals);
+	tcase_add_test(tc, test_opt_cgroup_cb_unset);
 	tcase_add_test(tc, test_opt_duration_cb);
+	tcase_add_test(tc, test_opt_duration_cb_unset);
 	tcase_add_exit_test(tc, test_opt_duration_cb_invalid, EXIT_FAILURE);
 	tcase_add_test(tc, test_opt_event_cb);
 	tcase_add_test(tc, test_opt_event_cb_multiple);
 	tcase_add_test(tc, test_opt_housekeeping_cb);
 	tcase_add_exit_test(tc, test_opt_housekeeping_cb_invalid, EXIT_FAILURE);
+	tcase_add_test(tc, test_opt_housekeeping_cb_unset);
 	tcase_add_test(tc, test_opt_priority_cb);
 	tcase_add_exit_test(tc, test_opt_priority_cb_invalid, EXIT_FAILURE);
+	tcase_add_test(tc, test_opt_priority_cb_unset);
 	tcase_add_test(tc, test_opt_trigger_cb);
 	tcase_add_exit_test(tc, test_opt_trigger_cb_no_event, EXIT_FAILURE);
 	tcase_add_test(tc, test_opt_filter_cb);
@@ -670,12 +950,16 @@ Suite *cli_opt_callback_suite(void)
 
 	tc = tcase_create("osnoise");
 	tcase_add_test(tc, test_opt_osnoise_auto_cb);
+	tcase_add_test(tc, test_opt_osnoise_auto_cb_unset);
 	tcase_add_test(tc, test_opt_osnoise_period_cb);
+	tcase_add_test(tc, test_opt_osnoise_period_cb_unset);
 	tcase_add_exit_test(tc, test_opt_osnoise_period_cb_invalid, EXIT_FAILURE);
 	tcase_add_test(tc, test_opt_osnoise_runtime_cb);
 	tcase_add_exit_test(tc, test_opt_osnoise_runtime_cb_invalid, EXIT_FAILURE);
+	tcase_add_test(tc, test_opt_osnoise_runtime_cb_unset);
 	tcase_add_test(tc, test_opt_osnoise_trace_output_cb);
 	tcase_add_test(tc, test_opt_osnoise_trace_output_cb_noarg);
+	tcase_add_test(tc, test_opt_osnoise_trace_output_cb_unset);
 	tcase_add_test(tc, test_opt_osnoise_on_threshold_cb);
 	tcase_add_exit_test(tc, test_opt_osnoise_on_threshold_cb_invalid, EXIT_FAILURE);
 	tcase_add_test(tc, test_opt_osnoise_on_end_cb);
@@ -685,31 +969,42 @@ Suite *cli_opt_callback_suite(void)
 	tc = tcase_create("timerlat");
 	tcase_add_test(tc, test_opt_timerlat_period_cb);
 	tcase_add_exit_test(tc, test_opt_timerlat_period_cb_invalid, EXIT_FAILURE);
+	tcase_add_test(tc, test_opt_timerlat_period_cb_unset);
 	tcase_add_test(tc, test_opt_timerlat_auto_cb);
+	tcase_add_test(tc, test_opt_timerlat_auto_cb_unset);
 	tcase_add_test(tc, test_opt_dma_latency_cb);
 	tcase_add_exit_test(tc, test_opt_dma_latency_cb_min, EXIT_FAILURE);
 	tcase_add_exit_test(tc, test_opt_dma_latency_cb_max, EXIT_FAILURE);
+	tcase_add_test(tc, test_opt_dma_latency_cb_unset);
 	tcase_add_test(tc, test_opt_aa_only_cb);
+	tcase_add_test(tc, test_opt_aa_only_cb_unset);
 	tcase_add_test(tc, test_opt_timerlat_trace_output_cb);
 	tcase_add_test(tc, test_opt_timerlat_trace_output_cb_noarg);
+	tcase_add_test(tc, test_opt_timerlat_trace_output_cb_unset);
 	tcase_add_test(tc, test_opt_timerlat_on_threshold_cb);
 	tcase_add_exit_test(tc, test_opt_timerlat_on_threshold_cb_invalid, EXIT_FAILURE);
 	tcase_add_test(tc, test_opt_timerlat_on_end_cb);
 	tcase_add_exit_test(tc, test_opt_timerlat_on_end_cb_invalid, EXIT_FAILURE);
 	tcase_add_test(tc, test_opt_user_threads_cb);
+	tcase_add_test(tc, test_opt_user_threads_cb_unset);
 	tcase_add_test(tc, test_opt_nano_cb);
+	tcase_add_test(tc, test_opt_nano_cb_unset);
 	tcase_add_test(tc, test_opt_stack_format_cb);
 	tcase_add_exit_test(tc, test_opt_stack_format_cb_invalid, EXIT_FAILURE);
+	tcase_add_test(tc, test_opt_stack_format_cb_unset);
 	tcase_add_test(tc, test_opt_timerlat_align_cb);
+	tcase_add_test(tc, test_opt_timerlat_align_cb_unset);
 	suite_add_tcase(s, tc);
 
 	tc = tcase_create("histogram");
 	tcase_add_test(tc, test_opt_bucket_size_cb);
 	tcase_add_exit_test(tc, test_opt_bucket_size_min, EXIT_FAILURE);
 	tcase_add_exit_test(tc, test_opt_bucket_size_max, EXIT_FAILURE);
+	tcase_add_test(tc, test_opt_bucket_size_cb_unset);
 	tcase_add_test(tc, test_opt_entries_cb);
 	tcase_add_exit_test(tc, test_opt_entries_min, EXIT_FAILURE);
 	tcase_add_exit_test(tc, test_opt_entries_max, EXIT_FAILURE);
+	tcase_add_test(tc, test_opt_entries_cb_unset);
 	suite_add_tcase(s, tc);
 
 	return s;
-- 
2.54.0


^ permalink raw reply related

* [PATCH 3/4] rtla: Add unit tests for CLI with unset
From: Tomas Glozar @ 2026-06-29  8:36 UTC (permalink / raw)
  To: Steven Rostedt, Tomas Glozar
  Cc: John Kacur, Luis Goncalves, Crystal Wood, Costa Shulyupin,
	Wander Lairson Costa, LKML, linux-trace-kernel
In-Reply-To: <20260629083654.1548925-1-tglozar@redhat.com>

Test parsing of command line that sets an option and then unsets it back
to the default value in all tools.

Only two CLI tests are added for each tool: short period option (-p ...
--no-period) and long period option (--period ... --no-period). The
logic specific for individual options is tested in opt callback tests
already.

Signed-off-by: Tomas Glozar <tglozar@redhat.com>
---
 .../tracing/rtla/tests/unit/osnoise_hist_cli.c | 18 ++++++++++++++++++
 .../tracing/rtla/tests/unit/osnoise_top_cli.c  | 18 ++++++++++++++++++
 .../rtla/tests/unit/timerlat_hist_cli.c        | 18 ++++++++++++++++++
 .../tracing/rtla/tests/unit/timerlat_top_cli.c | 18 ++++++++++++++++++
 4 files changed, 72 insertions(+)

diff --git a/tools/tracing/rtla/tests/unit/osnoise_hist_cli.c b/tools/tracing/rtla/tests/unit/osnoise_hist_cli.c
index 3661529f93dc9..221985e6759f0 100644
--- a/tools/tracing/rtla/tests/unit/osnoise_hist_cli.c
+++ b/tools/tracing/rtla/tests/unit/osnoise_hist_cli.c
@@ -37,6 +37,22 @@ START_TEST(test_period_long)
 }
 END_TEST
 
+START_TEST(test_period_unset_short)
+{
+	PARSE_ARGS("osnoise", "hist", "-p", "100000", "--no-period");
+
+	ck_assert_int_eq(osn_params->period, 0);
+}
+END_TEST
+
+START_TEST(test_period_unset_long)
+{
+	PARSE_ARGS("osnoise", "hist", "--period", "100000", "--no-period");
+
+	ck_assert_int_eq(osn_params->period, 0);
+}
+END_TEST
+
 START_TEST(test_runtime_short)
 {
 	PARSE_ARGS("osnoise", "hist", "-r", "95000");
@@ -481,6 +497,8 @@ Suite *osnoise_hist_cli_suite(void)
 	tc = tcase_create("tracing_options");
 	tcase_add_test(tc, test_period_short);
 	tcase_add_test(tc, test_period_long);
+	tcase_add_test(tc, test_period_unset_short);
+	tcase_add_test(tc, test_period_unset_long);
 	tcase_add_test(tc, test_runtime_short);
 	tcase_add_test(tc, test_runtime_long);
 	tcase_add_test(tc, test_stop_short);
diff --git a/tools/tracing/rtla/tests/unit/osnoise_top_cli.c b/tools/tracing/rtla/tests/unit/osnoise_top_cli.c
index f3a8633cc84e8..057dbe574b079 100644
--- a/tools/tracing/rtla/tests/unit/osnoise_top_cli.c
+++ b/tools/tracing/rtla/tests/unit/osnoise_top_cli.c
@@ -37,6 +37,22 @@ START_TEST(test_period_long)
 }
 END_TEST
 
+START_TEST(test_period_unset_short)
+{
+	PARSE_ARGS("osnoise", "top", "-p", "100000", "--no-period");
+
+	ck_assert_int_eq(osn_params->period, 0);
+}
+END_TEST
+
+START_TEST(test_period_unset_long)
+{
+	PARSE_ARGS("osnoise", "top", "--period", "100000", "--no-period");
+
+	ck_assert_int_eq(osn_params->period, 0);
+}
+END_TEST
+
 START_TEST(test_runtime_short)
 {
 	PARSE_ARGS("osnoise", "top", "-r", "95000");
@@ -433,6 +449,8 @@ Suite *osnoise_top_cli_suite(void)
 	tc = tcase_create("tracing_options");
 	tcase_add_test(tc, test_period_short);
 	tcase_add_test(tc, test_period_long);
+	tcase_add_test(tc, test_period_unset_short);
+	tcase_add_test(tc, test_period_unset_long);
 	tcase_add_test(tc, test_runtime_short);
 	tcase_add_test(tc, test_runtime_long);
 	tcase_add_test(tc, test_stop_short);
diff --git a/tools/tracing/rtla/tests/unit/timerlat_hist_cli.c b/tools/tracing/rtla/tests/unit/timerlat_hist_cli.c
index 968bf962f53f0..d8dd9d752636e 100644
--- a/tools/tracing/rtla/tests/unit/timerlat_hist_cli.c
+++ b/tools/tracing/rtla/tests/unit/timerlat_hist_cli.c
@@ -55,6 +55,22 @@ START_TEST(test_period_long)
 }
 END_TEST
 
+START_TEST(test_period_unset_short)
+{
+	PARSE_ARGS("timerlat", "hist", "-p", "200", "--no-period");
+
+	ck_assert_int_eq(tlat_params->timerlat_period_us, 0);
+}
+END_TEST
+
+START_TEST(test_period_unset_long)
+{
+	PARSE_ARGS("timerlat", "hist", "--period", "200", "--no-period");
+
+	ck_assert_int_eq(tlat_params->timerlat_period_us, 0);
+}
+END_TEST
+
 START_TEST(test_stack_short)
 {
 	PARSE_ARGS("timerlat", "hist", "-s", "20");
@@ -629,6 +645,8 @@ Suite *timerlat_hist_cli_suite(void)
 	tcase_add_test(tc, test_irq_long);
 	tcase_add_test(tc, test_period_short);
 	tcase_add_test(tc, test_period_long);
+	tcase_add_test(tc, test_period_unset_short);
+	tcase_add_test(tc, test_period_unset_long);
 	tcase_add_test(tc, test_stack_short);
 	tcase_add_test(tc, test_stack_long);
 	tcase_add_test(tc, test_thread_short);
diff --git a/tools/tracing/rtla/tests/unit/timerlat_top_cli.c b/tools/tracing/rtla/tests/unit/timerlat_top_cli.c
index 33aa6588d503b..e9fb1a86ab8c4 100644
--- a/tools/tracing/rtla/tests/unit/timerlat_top_cli.c
+++ b/tools/tracing/rtla/tests/unit/timerlat_top_cli.c
@@ -55,6 +55,22 @@ START_TEST(test_period_long)
 }
 END_TEST
 
+START_TEST(test_period_unset_short)
+{
+	PARSE_ARGS("timerlat", "top", "-p", "200", "--no-period");
+
+	ck_assert_int_eq(tlat_params->timerlat_period_us, 0);
+}
+END_TEST
+
+START_TEST(test_period_unset_long)
+{
+	PARSE_ARGS("timerlat", "top", "--period", "200", "--no-period");
+
+	ck_assert_int_eq(tlat_params->timerlat_period_us, 0);
+}
+END_TEST
+
 START_TEST(test_stack_short)
 {
 	PARSE_ARGS("timerlat", "top", "-s", "20");
@@ -571,6 +587,8 @@ Suite *timerlat_top_cli_suite(void)
 	tcase_add_test(tc, test_irq_long);
 	tcase_add_test(tc, test_period_short);
 	tcase_add_test(tc, test_period_long);
+	tcase_add_test(tc, test_period_unset_short);
+	tcase_add_test(tc, test_period_unset_long);
 	tcase_add_test(tc, test_stack_short);
 	tcase_add_test(tc, test_stack_long);
 	tcase_add_test(tc, test_thread_short);
-- 
2.54.0


^ permalink raw reply related

* [PATCH 4/4] Documentation/rtla: Document unsetting options
From: Tomas Glozar @ 2026-06-29  8:36 UTC (permalink / raw)
  To: Steven Rostedt, Tomas Glozar
  Cc: John Kacur, Luis Goncalves, Crystal Wood, Costa Shulyupin,
	Wander Lairson Costa, LKML, linux-trace-kernel
In-Reply-To: <20260629083654.1548925-1-tglozar@redhat.com>

Add an appendix documenting how to unset options in RTLA. For options
where unsetting is currently not supported, add a note into the
respective section.

An additional note is added for --on-threshold trace. As it is
considered distinct from --trace, it is not reverted by --no-trace.

Signed-off-by: Tomas Glozar <tglozar@redhat.com>
---
 Documentation/tools/rtla/common_appendix.txt    | 17 +++++++++++++++++
 Documentation/tools/rtla/common_options.txt     | 13 ++++++++++++-
 .../tools/rtla/common_osnoise_options.txt       |  4 ++++
 .../tools/rtla/common_timerlat_options.txt      |  4 ++++
 4 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/Documentation/tools/rtla/common_appendix.txt b/Documentation/tools/rtla/common_appendix.txt
index 68cb15840d3a9..ad610ed02a240 100644
--- a/Documentation/tools/rtla/common_appendix.txt
+++ b/Documentation/tools/rtla/common_appendix.txt
@@ -1,5 +1,22 @@
 .. SPDX-License-Identifier: GPL-2.0
 
+UNSETTING OPTIONS
+=================
+
+The effect of most command line options can be reverted by prepending "no-" to
+the long variant of the option, for example:
+
+$ rtla timerlat top -p 100 --no-period
+
+resets the period back to the default value of 1000 us.
+
+If a command line option sets multiple RTLA parameters at once, the inverted
+option will revert all of them, even if they were not set by the particular
+option. For example, since using "--auto" implies "--trace", specifying
+"--trace --no-auto" will also disable trace output, just like if "--no-trace"
+was specified.
+
+
 SIGINT BEHAVIOR
 ===============
 
diff --git a/Documentation/tools/rtla/common_options.txt b/Documentation/tools/rtla/common_options.txt
index 6caa51d029347..38da1cf443a48 100644
--- a/Documentation/tools/rtla/common_options.txt
+++ b/Documentation/tools/rtla/common_options.txt
@@ -22,10 +22,14 @@
 
         Enable an event in the trace (**-t**) session. The argument can be a specific event, e.g., **-e** *sched:sched_switch*, or all events of a system group, e.g., **-e** *sched*. Multiple **-e** are allowed. It is only active when **-t** or **-a** are set.
 
+        This option cannot be unset.
+
 **--filter** *<filter>*
 
         Filter the previous **-e** *sys:event* event with *<filter>*. For further information about event filtering see https://www.kernel.org/doc/html/latest/trace/events.html#event-filtering.
 
+        This option cannot be unset.
+
 **--trigger** *<trigger>*
         Enable a trace event trigger to the previous **-e** *sys:event*.
         If the *hist:* trigger is activated, the output histogram will be automatically saved to a file named *system_event_hist.txt*.
@@ -37,6 +41,8 @@
 
         For further information about event trigger see https://www.kernel.org/doc/html/latest/trace/events.html#event-triggers.
 
+        This option cannot be unset.
+
 **-P**, **--priority** *o:prio|r:prio|f:prio|d:runtime:period*
 
         Set scheduling parameters to the |tool| tracer threads, the format to set the priority are:
@@ -78,7 +84,8 @@
 
           Saves trace output, optionally taking a filename. Alternative to -t/--trace.
           Note that unlike -t/--trace, specifying this multiple times will result in
-          the trace being saved multiple times.
+          the trace being saved multiple times, and --no-trace will not disable trace
+          output when enabled through this option.
 
         - *signal,num=<sig>,pid=<pid>*
 
@@ -107,6 +114,8 @@
 
         |actionsperf|
 
+        This option cannot be unset.
+
 **--on-end** *action*
 
         Defines an action to be executed at the end of tracing.
@@ -124,6 +133,8 @@
 
         This runs rtla with the default options, and saves trace output at the end.
 
+        This option cannot be unset.
+
 **-h**, **--help**
 
         Print help menu.
diff --git a/Documentation/tools/rtla/common_osnoise_options.txt b/Documentation/tools/rtla/common_osnoise_options.txt
index bd3c4f4991939..5fc70c0016158 100644
--- a/Documentation/tools/rtla/common_osnoise_options.txt
+++ b/Documentation/tools/rtla/common_osnoise_options.txt
@@ -24,11 +24,15 @@
         Stop the trace if a single sample is higher than the argument in microseconds.
         If **-T** is set, it will also save the trace to the output.
 
+        This option cannot be unset.
+
 **-S**, **--stop-total** *us*
 
         Stop the trace if the total sample is higher than the argument in microseconds.
         If **-T** is set, it will also save the trace to the output.
 
+        This option cannot be unset.
+
 **-T**, **--threshold** *us*
 
         Specify the minimum delta between two time reads to be considered noise.
diff --git a/Documentation/tools/rtla/common_timerlat_options.txt b/Documentation/tools/rtla/common_timerlat_options.txt
index 100840f4c0ed0..e36898438a0b0 100644
--- a/Documentation/tools/rtla/common_timerlat_options.txt
+++ b/Documentation/tools/rtla/common_timerlat_options.txt
@@ -23,10 +23,14 @@
 
         Stop trace if the *IRQ* latency is higher than the argument in us.
 
+        This option cannot be unset.
+
 **-T**, **--thread** *us*
 
         Stop trace if the *Thread* latency is higher than the argument in us.
 
+        This option cannot be unset.
+
 **-s**, **--stack** *us*
 
         Save the stack trace at the *IRQ* if a *Thread* latency is higher than the
-- 
2.54.0


^ permalink raw reply related

* Re: [RFC PATCH 1/3] mm/compaction: skip isolate mlocked folios when compact_unevictable_allowed=0
From: Wandun @ 2026-06-29  9:07 UTC (permalink / raw)
  To: Alexander Krabler, Vlastimil Babka (SUSE), linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org,
	linux-rt-devel@lists.linux.dev
  Cc: akpm@linux-foundation.org, surenb@google.com, mhocko@suse.com,
	jackmanb@google.com, hannes@cmpxchg.org, ziy@nvidia.com,
	rostedt@goodmis.org, mhiramat@kernel.org,
	mathieu.desnoyers@efficios.com, david@kernel.org, ljs@kernel.org,
	liam@infradead.org, rppt@kernel.org, bigeasy@linutronix.de,
	clrkwllms@kernel.org, Hugh Dickins
In-Reply-To: <PR3PR01MB6666EC8D53E75F742B37270282EB2@PR3PR01MB6666.eurprd01.prod.exchangelabs.com>



On 6/26/26 21:42, Alexander Krabler wrote:
> On 6/26/26 11:38, Wandun wrote:
>> On 6/26/26 16:45, Alexander Krabler wrote:
>>> However, we were not able to reproduce the actual race
>>> (mlockall() process waiting on a migration PTE),
>>> not in the past, not now. Might be hard to trigger that race.
>>
>> Not hard to trigger that case, I added a debug message, such as below,
>> lots of messages occur in a few second.
>>
>> diff --cc mm/memory.c
>> index ff338c2abe92,ff338c2abe92..6552b3b14f78
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@@ -4768,6 -4768,6 +4768,8 @@@ vm_fault_t do_swap_page(struct vm_faul
>>                 if (softleaf_is_migration(entry)) {
>>                         migration_entry_wait(vma->vm_mm, vmf->pmd,
>>                                              vmf->address);
>> +                       if (!strcmp(current->comm, "repro"))
>> +                               pr_err("============== hit ================\n");
>>                 } else if (softleaf_is_device_exclusive(entry)) {
>>                         vmf->page = softleaf_to_page(entry);
>>                         ret = remove_device_exclusive_entry(vmf);
> 
> I have a kprobe on migration_entry_wait set and logged into a ftrace buffer
> (including kernel stacktrace).
> Yes, this function is hit, but only inside the mmap-syscall, which is okay,
> memory allocation is not realtime-safe.
> 
>            repro-2090    [002] d....   811.129549: frt_migration_entry_wait: (migration_entry_wait+0x0/0x100)
>            repro-2090    [002] d....   811.129553: <stack trace>
>  => migration_entry_wait
>  => __handle_mm_fault
>  => handle_mm_fault
>  => __get_user_pages
>  => populate_vma_page_range
>  => __mm_populate
>  => vm_mmap_pgoff
>  => ksys_mmap_pgoff
>  => __arm64_sys_mmap
>  => el0_svc_common.constprop.0
>  => do_el0_svc
>  => el0_svc
>  => el0t_64_sync_handler
>  => el0t_64_sync
> 
> The original race was an instruction abort interrupt out of nothing due
> to the migration PTE set by kcompactd.
> And these kind of races I see quite often on non mlockall()-processes,
> but can't reproduce on memory locked processes.
> 
> Example:
>           podman-832     [000] d....   812.447820: frt_migration_entry_wait: (migration_entry_wait+0x0/0x100)
>           podman-832     [000] d....   812.447823: <stack trace>
>  => migration_entry_wait
>  => __handle_mm_fault
>  => handle_mm_fault
>  => do_page_fault
>  => do_translation_fault
>  => do_mem_abort
>  => el0_da
>  => el0t_64_sync_handler
>  => el0t_64_sync

Hi, Alexander

From the perspective of the root cause, there is no fundamental difference
between these two call stacks. I modified the reproduction program, and it
can still reproduce the situation of the second call stack
(although it doesn't occur as frequently). The complete reproduction program
is as follows:


#define _GNU_SOURCE
#include <fcntl.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/sysinfo.h>
#include <unistd.h>

#define PAGE_SIZE       4096
#define NR_PAGES        10000

static void *worker_fn(void *arg)
{
	int fd = (long)arg;
	size_t len = NR_PAGES * PAGE_SIZE;

	while (1) {
		if (ftruncate(fd, 0) < 0) {}
		if (ftruncate(fd, len) < 0) {}

		char *p = mmap(NULL, len, PROT_READ | PROT_WRITE,
			       MAP_SHARED, fd, 0);
		if (p == MAP_FAILED)
			continue;

		mlockall(MCL_ONFAULT |  MCL_FUTURE);

		for (int i = 0; i < NR_PAGES; i++) {
			for (int j = 0; j < PAGE_SIZE; j++) {
				p[i * PAGE_SIZE + j] = 1;
			}
		}

		usleep(200);
		munmap(p, len);
	}
	return NULL;
}

static void *compact_fn(void *arg)
{
	(void)arg;
	int fd = open("/proc/sys/vm/compact_memory", O_WRONLY);
	if (fd < 0)
		return NULL;

	while (1) {
		if (write(fd, "1", 1) < 0) {}
		usleep(5000);
	}
}

int main(void)
{
	int nproc = sysconf(_SC_NPROCESSORS_ONLN);
	if (nproc < 1)
		nproc = 1;

	int *fds = calloc((size_t)nproc, sizeof(int));
	if (!fds)
		return 1;

	size_t len = NR_PAGES * PAGE_SIZE;
	for (int i = 0; i < nproc; i++) {
		char path[64];
		snprintf(path, sizeof(path), "./repro_%d.dat", i);
		unlink(path);
		fds[i] = open(path, O_RDWR | O_CREAT, 0600);
		if (fds[i] < 0)
			return 1;
		if (ftruncate(fds[i], len) < 0)
			return 1;
	}

	printf("repro: %d workers, %d pages, Ctrl-C to stop\n",
	       nproc, NR_PAGES);

	pthread_t compact;
	pthread_create(&compact, NULL, compact_fn, NULL);

	pthread_t *threads = calloc((size_t)nproc, sizeof(pthread_t));
	for (int i = 0; i < nproc; i++)
		pthread_create(&threads[i], NULL, worker_fn, (void *)(long)fds[i]);

	pthread_join(compact, NULL);
	return 0;
}




> 
> Thanks,
> Alexander
> 
> --
> 
> KUKA Deutschland GmbH   Board of Directors: Michael Jürgens (Chairman), Johan Naten, Hui Zhang   Registered Office: Augsburg HRB 14914
> 
> This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of contents of this e-mail is strictly forbidden.
> 
> Please consider the environment before printing this e-mail.


^ permalink raw reply

* Re: [RFC PATCH 00/40] mm: reliable 1GB page allocation
From: Lorenzo Stoakes @ 2026-06-29  9:29 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, kernel-team, linux-mm, david, willy, surenb, hannes,
	ziy, usama.arif, fvdl, Andrew Morton, Jonathan Corbet,
	Chris Mason, David Sterba, Vlastimil Babka, Steven Rostedt,
	Masami Hiramatsu, Rafael J. Wysocki, Oscar Salvador,
	Mike Rapoport, linux-doc, linux-btrfs, linux-trace-kernel,
	linux-pm, linux-cxl, Linus Torvalds
In-Reply-To: <528e3a5fbc27c9dc7a098121c32b7679b4c9962a.camel@surriel.com>

TL;DR - please don't send unfiltered LLM code to list _at all_. If you want
to share it, link to a repo.

On Sat, Jun 27, 2026 at 09:36:51AM -0400, Rik van Riel wrote:
> That is the one reason I sent out RFC code before it
> is ready. I am looking for feedback on the concepts
> in this series.
...
> Once I know what I need to do, coming up with a
> cleaner implementation is very doable.
...
> The mess in the RFC is the result of trying something
> that seemed right, watching it fail in some subtle
> way, and trying to fix it up.
...
> > But the execution has to be _completely_ rethought.
>
> There's no argument there.
...
> > Another issue here is maintainer time - even this _extremely_ light-
> > touch
> > review has taken me a few hours (of my weekend :). To review it in
> > detail
> > would take probably DAYS of dedicated work.
>
> I suspect there is a mismatch in expectations here.
>
> I already knew this code has to be totally redone.

I'm glad we are in agreement on this :)

But in general I feel you have sent this and at least one other series like this
without being as clear as you should have been.

I hate to belabour the point but just to be clear:

* You label one patch [DO-NOT-MERGE], but none of the others (implying they
  are candidates for being merged) [0] and the cover letter has TODOs,
  including trivia like naming, but nothing about the code.

* You sent a non-RFC series with identical code quality issues [1]
  recently.

* Until I pointed it out, you were responding to other review here as if
  the series was genuinely was intended for (eventual) merge:

  - "This is a userspace-visible removal. Writes to
     /proc/sys/vm/watermark_boost_factor will now return -ENOENT instead of
     being accepted, breaking userspace." [2]

     <-: "I'll just drop this patch for now." [3]

  - "I left a small code nit inline, but whether you take that suggestion
     or leave it, you can add Reviewed-by: ..." [4]

    <-: "I sent it with this series mostly because it's needed to make the
    series work, and to provide context on why it's needed. I'm happy to
    resend it with a GFP mask passed in by each caller. That would look
    better, indeed!" [5]

So to be concrete, if you send really rough code, Use [pre-RFC] or [DO NOT
MERGE] (on the series as a whole) to make that clear and say so in the
cover letter VERY VERY clearly.

Or, you can put it in a repo somewhere and link it in an email discussing
the concepts (like I did with scalable CoW for instance).

Also if people respond to the series as if it isn't pre-RFC, I'd suggest in
your replies saying something like 'I intend to completely rework all this
anyway' or something like that! :)

> How do people feel about splitting up the free lists,
> so each gigabyte (well, PUD sized) chunk of memory
> has its own free lists?
>
> How can we balance the desire for higher-order kernel
> allocations, against the desire to preserve gigabyte
> sized chunks of memory that can be used for user space?
...
> That's another big question. How do we balance the
> desire to keep compaction overhead low with the desire
> to do higher order allocations almost everywhere?
>
> >
...
>
> I am just hoping to figure out what I should be
> doing on a conceptual level, before figuring out
> how to do it cleanly.
>
...
>
> I was looking for feedback on the basic concepts
> and design in the patch series, but failed to
> clearly communicate that.
>
> You provided some detailed feedback on the code,
> but as of yet nobody has really provided any
> opinions on things like whether it is desirable
> at all to have the free lists per gigablock,
> or whether we need to come up with some totally
> different approach.
>
> How do we better communicate that kind of thing
> in the future?
>
> Is that something to spell out more clearly in
> the cover letter?
>
> Is that kind of feedback something developers
> could even reasonably ask for? (if not, how do
> we figure out what maintainers want?)

As above, firstly make it clear that the code you are sending for review is
not to be reviewed so people don't waste highly contended maintainer time
on that! :)

Also, you didn't respond to my point regarding cc'ing the right people -
but that's clearly something you need to get right if you want this kind of
feedback to start with.

For instance, you didn't cc- the page allocator maintainer (Vlastimil) on a
series that is fundamentally changing the page allocator. That's not going
to help with feedback.

In general, this area of the page allocator and compaction isn't my
specialism in the kernel so I can't give you the in-depth feedback you need
on that.

But I do have thoughts in general as to how to achieve what you want here:

Firstly - you should try to summarise what you're doing here and what
you're changing alongside the trade-offs as clearly as you can in the cover
letter.

Then highlight what it is you need feedback on, broken out into clear
questions or points that make it easy for people to respond to.

And _you have already done this_ in your reply here:

* "How do people feel about splitting up the free lists, so each gigabyte
   (well, PUD sized) chunk of memory has its own free lists?"

* "How can we balance the desire for higher-order kernel allocations,
  against the desire to preserve gigabyte sized chunks of memory that can
  be used for user space?"

* "How do we balance the desire to keep compaction overhead low with the
   desire to do higher order allocations almost everywhere?"

I think a really good way of doing this would be to start out with
something like:

	Right now compaction often fails to achieve what we need, with
	fragmentation occurring anyway and (for instance) THP stalling on
	the availability of higher order folios.

etc. etc.

Summarising _the problem_.

Then a section about your proposed solution, e.g.:

	I propose a means by which we proactively achieve gigabyte-sized
	pageblocks with logic which maintains these as physically
	contiguous under both ordinary and contended workloads

Then list out the "secret sauce" of your approach, e.g.:

	This works by arranging memory such that unmovable allocations are
	grouped at <blah blah blah> etc.

Then raise your questions e.g.:

	I'd like to ask the community - how do people feel about splitting
	up the free lists, so each gigabyte (well, PUD sized) chunk of
	memory has its own free lists? <etc. etc.>

Then make it clear whether this is an RFC that is ready for primetime or
not:

	This series is simply intended as a proof-of-concept - PLEASE DO
	NOT REVIEW THE CODE per-se, but rather comment on the concepts!

(And obviously as above, if that _is_ what you intend, underline it with
[DO NOT MERGE] or [pre-RFC] or something like that).

I'd also very strongly suggest (as I did in my original reply) breaking out
parts that can be broken out as prerequisite series.

If you're doing something good or useful _anyway_ then just send that
separately first, and have later work rely on the earlier work.

There's no rush, this is huge and will take time.

A final KEY point:

NEVER submit unfiltered code generated by an LLMs to the list in _any_
form. If you want people to access code like that to test or something,
then put it in a remote repo and link to it.

The code is SO overly complicated and SO messy that it's really difficult
for people to understand what's actually going on.

At the heart of what you need here is CLARITY.

You need to CLEARLY communicate what it is you're doing so busy maintainers
can examine it. That's the _only_ way you're going to get something like
this merged.

The LLM-generated code is so awful that ain't nobody got the time to try to
understand what it's doing.

The workload for this really has to be on submitters, not maintainers.

And what you've done, even if not intended, is workslopping, and that's
really not acceptable. Quoting the kernel process on tool-generated content
[6]:

"If tools permit you to generate a contribution automatically, expect
additional scrutiny in proportion to how much of it was generated.

As with the output of any tooling, the result may be incorrect or
inappropriate. You are expected to understand and to be able to defend
everything you submit. If you are unable to do so, then do not submit the
resulting changes.

If you do so anyway, maintainers are entitled to reject your series without
detailed review."

As per this and my previous reply, AI slop doesn't scale, even as an RFC -
I won't have time to reply like this in future, and we will just have to
reject your series out of hand, which helps nobody.

>
>
> --
> All Rights Reversed.

Thanks, Lorenzo

[0]:https://lore.kernel.org/all/20260520150018.2491267-41-riel@surriel.com/
[1]:https://lore.kernel.org/linux-mm/20260616190300.1509639-1-riel@surriel.com/
[2]:https://lore.kernel.org/all/20260526140204.1390573-1-usama.arif@linux.dev/
[3]:https://lore.kernel.org/all/2ecf71858845e7d14c718b1a6845389cb78b986e.camel@surriel.com/
[4]:https://lore.kernel.org/all/20260520174749.GA1458531@zen.localdomain/
[5]:https://lore.kernel.org/all/daa29c92f055d028a5b3ec0e42cfb1ee1496a593.camel@surriel.com/
[6]:https://docs.kernel.org/process/generated-content.html

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox