Linux Trace Kernel
 help / color / mirror / Atom feed
* [PATCH v2 6/8] riscv: stacktrace: switch to frame-pointer based unwinder
From: Wang Han @ 2026-05-28  8:23 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: Steven Rostedt, Alexandre Ghiti, Masami Hiramatsu, Mark Rutland,
	Catalin Marinas, Chen Pei, Andy Chiu, Björn Töpel,
	Deepak Gupta, Puranjay Mohan, Conor Dooley, Josh Poimboeuf,
	Jiri Kosina, Miroslav Benes, Petr Mladek, Joe Lawrence,
	Shuah Khan, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, oliver.yang, xueshuai, zhuo.song, jkchen,
	linux-riscv, linux-kernel, linux-trace-kernel, live-patching,
	linux-kselftest, linux-perf-users
In-Reply-To: <20260527123530.2593918-1-wanghan@linux.alibaba.com>

Replace the open-coded frame-pointer walker in arch_stack_walk() with a
robust kunwind state machine, modelled on arch/arm64/kernel/stacktrace.c
and retargeted to the RISC-V {fp, ra} frame record convention. The new
walker tracks stack bounds, consumes frame records monotonically,
understands the metadata pt_regs records added in the previous frame
record metadata patch, and recovers return addresses replaced by
function graph tracing and kretprobes.

This commit introduces arch_stack_walk_reliable() but does not yet
select HAVE_RELIABLE_STACKTRACE; that is done in a follow-up Kconfig
patch so this commit can be reviewed and bisected as a pure unwinder
replacement. Until that Kconfig change lands, livepatch is not yet
enabled and arch_stack_walk_reliable() has no in-tree caller.

Three related callers are updated to keep the same frame-record
assumptions everywhere:

  * Function graph tracing: the old RISC-V unwinder matched function
    graph return-stack entries by the saved return-address slot. That
    was consistent with the static mcount path, but not with the dynamic
    ftrace path where the parent slot is ftrace_regs::ra. Use the
    architectural frame pointer as the function graph return-address
    cookie, matching the kunwind walker.

  * Perf callchains: route kernel callchain collection through
    arch_stack_walk() so perf sees the same frame-pointer unwind
    behaviour as dump_stack() and the upcoming livepatch path.

  * dump_backtrace() / __get_wchan() / show_stack(): these now go
    through arch_stack_walk(); the explicit "Call Trace:" header is
    moved into dump_backtrace() to preserve the original output.

The non-frame-pointer fallback walker is kept untouched for
!CONFIG_FRAME_POINTER builds.

Signed-off-by: Wang Han <wanghan@linux.alibaba.com>
---
 arch/riscv/kernel/ftrace.c         |   6 +-
 arch/riscv/kernel/perf_callchain.c |   2 +-
 arch/riscv/kernel/stacktrace.c     | 560 ++++++++++++++++++++++++-----
 3 files changed, 472 insertions(+), 96 deletions(-)

diff --git a/arch/riscv/kernel/ftrace.c b/arch/riscv/kernel/ftrace.c
index b430edfb83f4..5d55199a9230 100644
--- a/arch/riscv/kernel/ftrace.c
+++ b/arch/riscv/kernel/ftrace.c
@@ -242,7 +242,8 @@ void prepare_ftrace_return(unsigned long *parent, unsigned long self_addr,
 	 */
 	old = *parent;
 
-	if (!function_graph_enter(old, self_addr, frame_pointer, parent))
+	if (!function_graph_enter(old, self_addr, frame_pointer,
+				  (void *)frame_pointer))
 		*parent = return_hooker;
 }
 
@@ -264,7 +265,8 @@ void ftrace_graph_func(unsigned long ip, unsigned long parent_ip,
 	 */
 	old = *parent;
 
-	if (!function_graph_enter_regs(old, ip, frame_pointer, parent, fregs))
+	if (!function_graph_enter_regs(old, ip, frame_pointer,
+				       (void *)frame_pointer, fregs))
 		*parent = return_hooker;
 }
 #endif /* CONFIG_DYNAMIC_FTRACE */
diff --git a/arch/riscv/kernel/perf_callchain.c b/arch/riscv/kernel/perf_callchain.c
index b465bc9eb870..436af96ea59c 100644
--- a/arch/riscv/kernel/perf_callchain.c
+++ b/arch/riscv/kernel/perf_callchain.c
@@ -44,5 +44,5 @@ void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry,
 		return;
 	}
 
-	walk_stackframe(NULL, regs, fill_callchain, entry);
+	arch_stack_walk(fill_callchain, entry, NULL, regs);
 }
diff --git a/arch/riscv/kernel/stacktrace.c b/arch/riscv/kernel/stacktrace.c
index 2692d3a06afa..0d76320b3a29 100644
--- a/arch/riscv/kernel/stacktrace.c
+++ b/arch/riscv/kernel/stacktrace.c
@@ -11,98 +11,16 @@
 #include <linux/sched/task_stack.h>
 #include <linux/stacktrace.h>
 #include <linux/ftrace.h>
+#include <linux/kprobes.h>
+#include <linux/llist.h>
 
 #include <asm/stacktrace.h>
 
-#ifdef CONFIG_FRAME_POINTER
-
 /*
- * This disables KASAN checking when reading a value from another task's stack,
- * since the other task could be running on another CPU and could have poisoned
- * the stack in the meantime.
+ * Non-frame-pointer fallback unwinder.
+ * Only compiled when CONFIG_FRAME_POINTER is not enabled.
  */
-#define READ_ONCE_TASK_STACK(task, x)			\
-({							\
-	unsigned long val;				\
-	unsigned long addr = x;				\
-	if ((task) == current)				\
-		val = READ_ONCE(addr);			\
-	else						\
-		val = READ_ONCE_NOCHECK(addr);		\
-	val;						\
-})
-
-extern asmlinkage void handle_exception(void);
-extern unsigned long ret_from_exception_end;
-
-static inline int fp_is_valid(unsigned long fp, unsigned long sp)
-{
-	unsigned long low, high;
-
-	low = sp + sizeof(struct stackframe);
-	high = ALIGN(sp, THREAD_SIZE);
-
-	return !(fp < low || fp > high || fp & 0x07);
-}
-
-void notrace walk_stackframe(struct task_struct *task, struct pt_regs *regs,
-			     bool (*fn)(void *, unsigned long), void *arg)
-{
-	unsigned long fp, sp, pc;
-	int graph_idx = 0;
-	int level = 0;
-
-	if (regs) {
-		fp = frame_pointer(regs);
-		sp = user_stack_pointer(regs);
-		pc = instruction_pointer(regs);
-	} else if (task == NULL || task == current) {
-		fp = (unsigned long)__builtin_frame_address(0);
-		sp = current_stack_pointer;
-		pc = (unsigned long)walk_stackframe;
-		level = -1;
-	} else {
-		/* task blocked in __switch_to */
-		fp = task->thread.s[0];
-		sp = task->thread.sp;
-		pc = task->thread.ra;
-	}
-
-	for (;;) {
-		struct stackframe *frame;
-
-		if (unlikely(!__kernel_text_address(pc) || (level++ >= 0 && !fn(arg, pc))))
-			break;
-
-		if (unlikely(!fp_is_valid(fp, sp)))
-			break;
-
-		/* Unwind stack frame */
-		frame = (struct stackframe *)fp - 1;
-		sp = fp;
-		if (regs && (regs->epc == pc) && fp_is_valid(frame->ra, sp)) {
-			/* We hit function where ra is not saved on the stack */
-			fp = frame->ra;
-			pc = regs->ra;
-		} else {
-			fp = READ_ONCE_TASK_STACK(task, frame->fp);
-			pc = READ_ONCE_TASK_STACK(task, frame->ra);
-			pc = ftrace_graph_ret_addr(task, &graph_idx, pc,
-						   &frame->ra);
-			if (pc >= (unsigned long)handle_exception &&
-			    pc < (unsigned long)&ret_from_exception_end) {
-				if (unlikely(!fn(arg, pc)))
-					break;
-
-				pc = ((struct pt_regs *)sp)->epc;
-				fp = ((struct pt_regs *)sp)->s0;
-			}
-		}
-
-	}
-}
-
-#else /* !CONFIG_FRAME_POINTER */
+#ifndef CONFIG_FRAME_POINTER
 
 void notrace walk_stackframe(struct task_struct *task,
 	struct pt_regs *regs, bool (*fn)(void *, unsigned long), void *arg)
@@ -133,7 +51,12 @@ void notrace walk_stackframe(struct task_struct *task,
 	}
 }
 
-#endif /* CONFIG_FRAME_POINTER */
+#endif /* !CONFIG_FRAME_POINTER */
+
+/*
+ * Common trace helpers.
+ * These are used by both the FP (kunwind) and non-FP (walk_stackframe) paths.
+ */
 
 static bool print_trace_address(void *arg, unsigned long pc)
 {
@@ -146,12 +69,12 @@ static bool print_trace_address(void *arg, unsigned long pc)
 noinline void dump_backtrace(struct pt_regs *regs, struct task_struct *task,
 		    const char *loglvl)
 {
-	walk_stackframe(task, regs, print_trace_address, (void *)loglvl);
+	printk("%sCall Trace:\n", loglvl);
+	arch_stack_walk(print_trace_address, (void *)loglvl, task, regs);
 }
 
 void show_stack(struct task_struct *task, unsigned long *sp, const char *loglvl)
 {
-	pr_cont("%sCall Trace:\n", loglvl);
 	dump_backtrace(NULL, task, loglvl);
 }
 
@@ -171,17 +94,468 @@ unsigned long __get_wchan(struct task_struct *task)
 
 	if (!try_get_task_stack(task))
 		return 0;
-	walk_stackframe(task, NULL, save_wchan, &pc);
+	arch_stack_walk(save_wchan, &pc, task, NULL);
 	put_task_stack(task);
 	return pc;
 }
 
-noinline noinstr void arch_stack_walk(stack_trace_consume_fn consume_entry, void *cookie,
-		     struct task_struct *task, struct pt_regs *regs)
+/*
+ * Frame-pointer-based kernel unwind infrastructure.
+ * Only compiled when CONFIG_FRAME_POINTER is enabled.
+ *
+ * See: arch/arm64/kernel/stacktrace.c for the reference implementation.
+ */
+#ifdef CONFIG_FRAME_POINTER
+
+/*
+ * Per-cpu stacks are only accessible when unwinding the current task in a
+ * non-preemptible context.
+ */
+#define STACKINFO_CPU(task, name)				\
+	({							\
+		(((task) == current) && !preemptible())		\
+			? stackinfo_get_##name()		\
+			: stackinfo_get_unknown();		\
+	})
+
+enum kunwind_source {
+	KUNWIND_SOURCE_UNKNOWN,
+	KUNWIND_SOURCE_FRAME,
+	KUNWIND_SOURCE_CALLER,
+	KUNWIND_SOURCE_TASK,
+	KUNWIND_SOURCE_REGS_PC,
+};
+
+union unwind_flags {
+	unsigned long	all;
+	struct {
+		unsigned long	fgraph : 1,
+				kretprobe : 1;
+	};
+};
+
+/*
+ * Kernel unwind state
+ *
+ * @common:    Common unwind state.
+ * @task:      The task being unwound.
+ * @graph_idx: Used by ftrace_graph_ret_addr() for optimized stack unwinding.
+ * @kr_cur:    When KRETPROBES is selected, holds the kretprobe instance
+ *             associated with the most recently encountered replacement ra
+ *             value.
+ */
+struct kunwind_state {
+	struct unwind_state common;
+	struct task_struct *task;
+	int graph_idx;
+#ifdef CONFIG_KRETPROBES
+	struct llist_node *kr_cur;
+#endif
+	enum kunwind_source source;
+	union unwind_flags flags;
+	struct pt_regs *regs;
+};
+
+static __always_inline void
+kunwind_init(struct kunwind_state *state,
+	     struct task_struct *task)
+{
+	unwind_init_common(&state->common);
+	state->task = task;
+	state->source = KUNWIND_SOURCE_UNKNOWN;
+	state->flags.all = 0;
+	state->regs = NULL;
+}
+
+/*
+ * Start an unwind from a pt_regs.
+ *
+ * The unwind will begin at the PC within the regs.
+ *
+ * The regs must be on a stack currently owned by the calling task.
+ */
+static __always_inline void
+kunwind_init_from_regs(struct kunwind_state *state,
+		       struct pt_regs *regs)
+{
+	kunwind_init(state, current);
+
+	state->regs = regs;
+	state->common.fp = frame_pointer(regs);
+	state->common.pc = instruction_pointer(regs);
+	state->source = KUNWIND_SOURCE_REGS_PC;
+}
+
+/*
+ * Start an unwind from a caller.
+ *
+ * The unwind will begin at the caller of whichever function this is inlined
+ * into.
+ *
+ * The function which invokes this must be noinline.
+ */
+static __always_inline void
+kunwind_init_from_caller(struct kunwind_state *state)
+{
+	unsigned long fp = (unsigned long)__builtin_frame_address(0);
+	struct frame_record *record = (struct frame_record *)fp - 1;
+
+	kunwind_init(state, current);
+
+	state->common.fp = READ_ONCE(record->fp);
+	state->common.pc = READ_ONCE(record->ra);
+	state->source = KUNWIND_SOURCE_CALLER;
+}
+
+/*
+ * Start an unwind from a blocked task.
+ *
+ * The unwind will begin at the blocked task's saved PC (i.e. the caller of
+ * __switch_to).
+ *
+ * The caller should ensure the task is blocked in __switch_to for the
+ * duration of the unwind, or the unwind will be bogus. It is never valid to
+ * call this for the current task.
+ */
+static __always_inline void
+kunwind_init_from_task(struct kunwind_state *state,
+		       struct task_struct *task)
+{
+	kunwind_init(state, task);
+
+	state->common.fp = task->thread.s[0];
+	state->common.pc = task->thread.ra;
+	state->source = KUNWIND_SOURCE_TASK;
+}
+
+static __always_inline int
+kunwind_recover_return_address(struct kunwind_state *state)
+{
+#ifdef CONFIG_FUNCTION_GRAPH_TRACER
+	if (state->task->ret_stack &&
+	    state->common.pc == (unsigned long)return_to_handler) {
+		unsigned long orig_pc;
+
+		orig_pc = ftrace_graph_ret_addr(state->task, &state->graph_idx,
+						state->common.pc,
+						(void *)state->common.fp);
+		if (state->common.pc == orig_pc) {
+			WARN_ON_ONCE(state->task == current);
+			return -EINVAL;
+		}
+		state->common.pc = orig_pc;
+		state->flags.fgraph = 1;
+	}
+#endif /* CONFIG_FUNCTION_GRAPH_TRACER */
+
+#ifdef CONFIG_KRETPROBES
+	if (is_kretprobe_trampoline(state->common.pc)) {
+		unsigned long orig_pc;
+
+		orig_pc = kretprobe_find_ret_addr(state->task,
+						  (void *)state->common.fp,
+						  &state->kr_cur);
+		if (!orig_pc)
+			return -EINVAL;
+		state->common.pc = orig_pc;
+		state->flags.kretprobe = 1;
+	}
+#endif /* CONFIG_KRETPROBES */
+
+	return 0;
+}
+
+/*
+ * When we reach an exception boundary marked by a metadata frame record,
+ * extract pt_regs from the stack and continue unwinding from the saved
+ * context (epc and s0/fp).
+ *
+ * On RISC-V, fp points above the metadata record, so the record's
+ * frame_record portion is at fp - sizeof(struct frame_record).
+ */
+static __always_inline int
+kunwind_next_regs_pc(struct kunwind_state *state)
+{
+	struct stack_info *info;
+	unsigned long fp = state->common.fp;
+	struct pt_regs *regs;
+
+	regs = container_of((unsigned long *)(fp - sizeof(struct frame_record)),
+			    struct pt_regs, stackframe.record.fp);
+
+	info = unwind_find_stack(&state->common, (unsigned long)regs,
+				 sizeof(*regs));
+	if (!info)
+		return -EINVAL;
+
+	unwind_consume_stack(&state->common, info, (unsigned long)regs,
+			     sizeof(*regs));
+
+	state->regs = regs;
+	state->common.pc = regs->epc;
+	state->common.fp = frame_pointer(regs);
+	state->regs = NULL;
+	state->source = KUNWIND_SOURCE_REGS_PC;
+	return 0;
+}
+
+/*
+ * Handle a metadata frame record embedded in pt_regs.
+ *
+ * On RISC-V, fp points above the record (fp = metadata + 16), so the
+ * frame_record_meta starts at fp - sizeof(struct frame_record).
+ *
+ * FRAME_META_TYPE_FINAL: This is the outermost exception entry
+ *   (user -> kernel). Unwinding terminates successfully.
+ * FRAME_META_TYPE_PT_REGS: This is a nested exception entry
+ *   (kernel -> kernel). Continue unwinding from the saved context.
+ */
+static __always_inline int
+kunwind_next_frame_record_meta(struct kunwind_state *state)
+{
+	struct task_struct *tsk = state->task;
+	unsigned long fp = state->common.fp;
+	unsigned long meta_base = fp - sizeof(struct frame_record);
+	struct frame_record_meta *meta;
+	struct stack_info *info;
+
+	info = unwind_find_stack(&state->common, meta_base, sizeof(*meta));
+	if (!info)
+		return -EINVAL;
+
+	meta = (struct frame_record_meta *)meta_base;
+	switch (READ_ONCE(meta->type)) {
+	case FRAME_META_TYPE_FINAL:
+		if (meta == &task_pt_regs(tsk)->stackframe)
+			return -ENOENT;
+		WARN_ON_ONCE(tsk == current);
+		return -EINVAL;
+	case FRAME_META_TYPE_PT_REGS:
+		return kunwind_next_regs_pc(state);
+	default:
+		WARN_ON_ONCE(tsk == current);
+		return -EINVAL;
+	}
+}
+
+/*
+ * Unwind from one frame record to the next.
+ *
+ * On RISC-V, the frame record sits at fp - sizeof(struct frame_record),
+ * immediately below the address pointed to by fp/s0. This applies to both
+ * normal frame records and metadata frame records (embedded in pt_regs).
+ *
+ * A metadata record is identified by both fp and ra being zero in the
+ * frame_record portion, with a type value following at fp + 16.
+ */
+static __always_inline int
+kunwind_next_frame_record(struct kunwind_state *state)
+{
+	unsigned long fp = state->common.fp;
+	struct frame_record *record;
+	struct stack_info *info;
+	unsigned long new_fp, new_pc;
+	unsigned long record_base;
+
+	if (fp & 0x7)
+		return -EINVAL;
+
+	record_base = fp - sizeof(*record);
+
+	info = unwind_find_stack(&state->common, record_base, sizeof(*record));
+	if (!info)
+		return -EINVAL;
+
+	record = (struct frame_record *)record_base;
+	new_fp = READ_ONCE(record->fp);
+	new_pc = READ_ONCE(record->ra);
+
+	if (!new_fp && !new_pc)
+		return kunwind_next_frame_record_meta(state);
+
+	unwind_consume_stack(&state->common, info, record_base,
+			     sizeof(*record));
+
+	state->common.fp = new_fp;
+	state->common.pc = new_pc;
+	state->source = KUNWIND_SOURCE_FRAME;
+
+	return 0;
+}
+
+/*
+ * Unwind from one frame record (A) to the next frame record (B).
+ *
+ * We terminate early if the location of B indicates a malformed chain of frame
+ * records (e.g. a cycle), determined based on the location and fp value of A
+ * and the location (but not the fp value) of B.
+ */
+static __always_inline int
+kunwind_next(struct kunwind_state *state)
+{
+	int err;
+
+	state->flags.all = 0;
+
+	switch (state->source) {
+	case KUNWIND_SOURCE_FRAME:
+	case KUNWIND_SOURCE_CALLER:
+	case KUNWIND_SOURCE_TASK:
+	case KUNWIND_SOURCE_REGS_PC:
+		err = kunwind_next_frame_record(state);
+		break;
+	default:
+		err = -EINVAL;
+	}
+
+	if (err)
+		return err;
+
+	return kunwind_recover_return_address(state);
+}
+
+typedef bool (*kunwind_consume_fn)(const struct kunwind_state *state, void *cookie);
+
+static __always_inline int
+do_kunwind(struct kunwind_state *state, kunwind_consume_fn consume_state,
+	   void *cookie)
+{
+	int ret;
+
+	ret = kunwind_recover_return_address(state);
+	if (ret)
+		return ret;
+
+	while (1) {
+		if (!consume_state(state, cookie))
+			return -EINVAL;
+		ret = kunwind_next(state);
+		if (ret == -ENOENT)
+			return 0;
+		if (ret < 0)
+			return ret;
+	}
+}
+
+static __always_inline int
+kunwind_stack_walk(kunwind_consume_fn consume_state,
+		   void *cookie, struct task_struct *task,
+		   struct pt_regs *regs)
+{
+	struct task_struct *tsk = task ?: current;
+	struct stack_info stacks[] = {
+		stackinfo_get_task(tsk),
+		STACKINFO_CPU(tsk, irq),
+#ifdef CONFIG_VMAP_STACK
+		STACKINFO_CPU(tsk, overflow),
+#endif
+	};
+	struct kunwind_state state = {
+		.common = {
+			.stacks = stacks,
+			.nr_stacks = ARRAY_SIZE(stacks),
+		},
+	};
+
+	if (regs) {
+		if (tsk != current)
+			return -EINVAL;
+		kunwind_init_from_regs(&state, regs);
+	} else if (tsk == current) {
+		kunwind_init_from_caller(&state);
+	} else {
+		kunwind_init_from_task(&state, tsk);
+	}
+
+	return do_kunwind(&state, consume_state, cookie);
+}
+
+struct kunwind_consume_entry_data {
+	stack_trace_consume_fn consume_entry;
+	void *cookie;
+};
+
+static __always_inline bool
+arch_kunwind_consume_entry(const struct kunwind_state *state, void *cookie)
+{
+	struct kunwind_consume_entry_data *data = cookie;
+
+	return data->consume_entry(data->cookie, state->common.pc);
+}
+
+static __always_inline bool
+arch_reliable_kunwind_consume_entry(const struct kunwind_state *state, void *cookie)
+{
+	/*
+	 * At an exception boundary we can reliably consume the saved PC. We do
+	 * not know whether the LR was live when the exception was taken, and
+	 * so we cannot perform the next unwind step reliably.
+	 *
+	 * All that matters is whether the *entire* unwind is reliable, so give
+	 * up as soon as we hit an exception boundary.
+	 */
+	if (state->source == KUNWIND_SOURCE_REGS_PC)
+		return false;
+
+	return arch_kunwind_consume_entry(state, cookie);
+}
+
+#endif /* CONFIG_FRAME_POINTER */
+
+/*
+ * arch_stack_walk - dual implementation.
+ *
+ * When CONFIG_FRAME_POINTER is enabled, uses the kunwind infrastructure for
+ * robust frame-pointer-based unwinding, consistent with arch_stack_walk_reliable.
+ *
+ * When CONFIG_FRAME_POINTER is disabled, falls back to the simple stack scan
+ * in walk_stackframe().
+ */
+#ifdef CONFIG_FRAME_POINTER
+
+noinline noinstr void arch_stack_walk(stack_trace_consume_fn consume_entry,
+				      void *cookie, struct task_struct *task,
+				      struct pt_regs *regs)
+{
+	struct kunwind_consume_entry_data data = {
+		.consume_entry = consume_entry,
+		.cookie = cookie,
+	};
+
+	kunwind_stack_walk(arch_kunwind_consume_entry, &data, task, regs);
+}
+
+#else
+
+noinline noinstr void arch_stack_walk(stack_trace_consume_fn consume_entry,
+				      void *cookie, struct task_struct *task,
+				      struct pt_regs *regs)
 {
 	walk_stackframe(task, regs, consume_entry, cookie);
 }
 
+#endif /* CONFIG_FRAME_POINTER */
+
+/*
+ * Reliable stack walk for livepatch (CONFIG_FRAME_POINTER only).
+ */
+#ifdef CONFIG_FRAME_POINTER
+
+noinline noinstr int arch_stack_walk_reliable(stack_trace_consume_fn consume_entry,
+					      void *cookie,
+					      struct task_struct *task)
+{
+	struct kunwind_consume_entry_data data = {
+		.consume_entry = consume_entry,
+		.cookie = cookie,
+	};
+
+	return kunwind_stack_walk(arch_reliable_kunwind_consume_entry, &data,
+				  task, NULL);
+}
+
+#endif /* CONFIG_FRAME_POINTER */
+
 /*
  * Get the return address for a single stackframe and return a pointer to the
  * next frame tail.
-- 
2.43.0


^ permalink raw reply related

* [PATCH v2 8/8] selftests/livepatch: Add RISC-V syscall wrapper prefix
From: Wang Han @ 2026-05-28  8:23 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: Steven Rostedt, Alexandre Ghiti, Masami Hiramatsu, Mark Rutland,
	Catalin Marinas, Chen Pei, Andy Chiu, Björn Töpel,
	Deepak Gupta, Puranjay Mohan, Conor Dooley, Josh Poimboeuf,
	Jiri Kosina, Miroslav Benes, Petr Mladek, Joe Lawrence,
	Shuah Khan, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, oliver.yang, xueshuai, zhuo.song, jkchen,
	linux-riscv, linux-kernel, linux-trace-kernel, live-patching,
	linux-kselftest, linux-perf-users
In-Reply-To: <20260527123530.2593918-1-wanghan@linux.alibaba.com>

The syscall livepatch selftest resolves and patches a syscall wrapper
symbol. To use that test for RISC-V livepatch validation, add the
RISC-V FN_PREFIX definition for ARCH_HAS_SYSCALL_WRAPPER.

Without this macro, the syscall livepatch selftest cannot resolve the
RISC-V target symbol, and the syscall-related livepatch test fails on
RISC-V.

Signed-off-by: Wang Han <wanghan@linux.alibaba.com>
---
 .../testing/selftests/livepatch/test_modules/test_klp_syscall.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/testing/selftests/livepatch/test_modules/test_klp_syscall.c b/tools/testing/selftests/livepatch/test_modules/test_klp_syscall.c
index dd802783ea84..275e4b10cf59 100644
--- a/tools/testing/selftests/livepatch/test_modules/test_klp_syscall.c
+++ b/tools/testing/selftests/livepatch/test_modules/test_klp_syscall.c
@@ -18,6 +18,8 @@
 #define FN_PREFIX __s390x_
 #elif defined(__aarch64__)
 #define FN_PREFIX __arm64_
+#elif defined(__riscv)
+#define FN_PREFIX __riscv_
 #else
 /* powerpc does not select ARCH_HAS_SYSCALL_WRAPPER */
 #define FN_PREFIX
-- 
2.43.0


^ permalink raw reply related

* [PATCH v2 5/8] riscv: stacktrace: introduce stack-bound tracking helpers
From: Wang Han @ 2026-05-28  8:23 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: Steven Rostedt, Alexandre Ghiti, Masami Hiramatsu, Mark Rutland,
	Catalin Marinas, Chen Pei, Andy Chiu, Björn Töpel,
	Deepak Gupta, Puranjay Mohan, Conor Dooley, Josh Poimboeuf,
	Jiri Kosina, Miroslav Benes, Petr Mladek, Joe Lawrence,
	Shuah Khan, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, oliver.yang, xueshuai, zhuo.song, jkchen,
	linux-riscv, linux-kernel, linux-trace-kernel, live-patching,
	linux-kselftest, linux-perf-users
In-Reply-To: <20260527123530.2593918-1-wanghan@linux.alibaba.com>

A reliable unwinder needs to validate that every frame record it reads
is fully contained in a known kernel stack, and it needs to refuse to
walk back into a stack it has already left. Add the building blocks
for that:

  * struct stack_info / struct unwind_state in a new
    asm/stacktrace/common.h, modelled on the arm64 reference
    implementation.
  * stackinfo_get_irq() / stackinfo_get_task() / stackinfo_get_overflow()
    plus the corresponding on_*_stack() predicates in asm/stacktrace.h,
    so callers can ask "is this object on stack X?" by stack kind
    rather than open-coded address arithmetic.
  * unwind_init_common(), unwind_find_stack() and
    unwind_consume_stack() helpers that enforce the
    forward-progress-only invariant required for reliability.

No existing user is wired up to these helpers in this commit; the
unwinder switch comes in a follow-up. The header changes leave
on_thread_stack() with the same semantics as before, just expressed in
terms of the new helpers.

Signed-off-by: Wang Han <wanghan@linux.alibaba.com>
---
 arch/riscv/include/asm/stacktrace.h        |  65 ++++++++-
 arch/riscv/include/asm/stacktrace/common.h | 159 +++++++++++++++++++++
 2 files changed, 222 insertions(+), 2 deletions(-)
 create mode 100644 arch/riscv/include/asm/stacktrace/common.h

diff --git a/arch/riscv/include/asm/stacktrace.h b/arch/riscv/include/asm/stacktrace.h
index b1495a7e06ce..bc87c4940379 100644
--- a/arch/riscv/include/asm/stacktrace.h
+++ b/arch/riscv/include/asm/stacktrace.h
@@ -3,8 +3,13 @@
 #ifndef _ASM_RISCV_STACKTRACE_H
 #define _ASM_RISCV_STACKTRACE_H
 
+#include <linux/percpu.h>
 #include <linux/sched.h>
+#include <linux/sched/task_stack.h>
+
+#include <asm/irq_stack.h>
 #include <asm/ptrace.h>
+#include <asm/stacktrace/common.h>
 
 struct stackframe {
 	unsigned long fp;
@@ -16,14 +21,70 @@ extern void notrace walk_stackframe(struct task_struct *task, struct pt_regs *re
 extern void dump_backtrace(struct pt_regs *regs, struct task_struct *task,
 			   const char *loglvl);
 
-static inline bool on_thread_stack(void)
+/*
+ * IRQ stack accessors
+ */
+static inline struct stack_info stackinfo_get_irq(void)
+{
+	unsigned long low = (unsigned long)raw_cpu_read(irq_stack_ptr);
+	unsigned long high = low + IRQ_STACK_SIZE;
+
+	return (struct stack_info) {
+		.low = low,
+		.high = high,
+	};
+}
+
+static inline bool on_irq_stack(unsigned long sp, unsigned long size)
+{
+	struct stack_info info = stackinfo_get_irq();
+
+	return stackinfo_on_stack(&info, sp, size);
+}
+
+/*
+ * Task stack accessors
+ */
+static inline struct stack_info stackinfo_get_task(const struct task_struct *tsk)
 {
-	return !(((unsigned long)(current->stack) ^ current_stack_pointer) & ~(THREAD_SIZE - 1));
+	unsigned long low = (unsigned long)task_stack_page(tsk);
+	unsigned long high = low + THREAD_SIZE;
+
+	return (struct stack_info) {
+		.low = low,
+		.high = high,
+	};
+}
+
+static inline bool on_task_stack(const struct task_struct *tsk,
+				 unsigned long sp, unsigned long size)
+{
+	struct stack_info info = stackinfo_get_task(tsk);
+
+	return stackinfo_on_stack(&info, sp, size);
 }
 
+/*
+ * Cast is necessary since current->stack is an opaque ptr.
+ */
+#define on_thread_stack()	(on_task_stack(current, current_stack_pointer, 1))
 
+/*
+ * Overflow stack accessors
+ */
 #ifdef CONFIG_VMAP_STACK
 DECLARE_PER_CPU(unsigned long [OVERFLOW_STACK_SIZE/sizeof(long)], overflow_stack);
+
+static inline struct stack_info stackinfo_get_overflow(void)
+{
+	unsigned long low = (unsigned long)raw_cpu_ptr(overflow_stack);
+	unsigned long high = low + OVERFLOW_STACK_SIZE;
+
+	return (struct stack_info) {
+		.low = low,
+		.high = high,
+	};
+}
 #endif /* CONFIG_VMAP_STACK */
 
 #endif /* _ASM_RISCV_STACKTRACE_H */
diff --git a/arch/riscv/include/asm/stacktrace/common.h b/arch/riscv/include/asm/stacktrace/common.h
new file mode 100644
index 000000000000..87d6d40672f3
--- /dev/null
+++ b/arch/riscv/include/asm/stacktrace/common.h
@@ -0,0 +1,159 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * RISC-V common stack unwinder types and helpers.
+ *
+ * See: arch/arm64/include/asm/stacktrace/common.h for the reference
+ * implementation.
+ *
+ * Copyright (C) 2024
+ */
+#ifndef __ASM_RISCV_STACKTRACE_COMMON_H
+#define __ASM_RISCV_STACKTRACE_COMMON_H
+
+#include <linux/compiler.h>
+#include <linux/errno.h>
+#include <linux/types.h>
+
+#include <asm/stacktrace/frame.h>
+
+/**
+ * struct stack_info - describes the bounds of a stack.
+ *
+ * @low:  The lowest valid address on the stack.
+ * @high: The highest valid address on the stack.
+ */
+struct stack_info {
+	unsigned long low;
+	unsigned long high;
+};
+
+/**
+ * struct unwind_state - state used for robust unwinding.
+ *
+ * @fp:        The fp value in the frame record (or the real fp).
+ * @pc:        The ra value in the frame record (or the real ra).
+ *
+ * @stack:     The stack currently being unwound.
+ * @stacks:    An array of stacks which can be unwound.
+ * @nr_stacks: The number of stacks in @stacks.
+ */
+struct unwind_state {
+	unsigned long fp;
+	unsigned long pc;
+
+	struct stack_info stack;
+	struct stack_info *stacks;
+	int nr_stacks;
+};
+
+/**
+ * stackinfo_get_unknown() - Get an unknown stack_info.
+ *
+ * Return: a stack_info with low and high set to 0.
+ */
+static inline struct stack_info stackinfo_get_unknown(void)
+{
+	return (struct stack_info) {
+		.low = 0,
+		.high = 0,
+	};
+}
+
+/**
+ * stackinfo_on_stack() - Check whether an object is fully within a stack.
+ *
+ * @info: The stack to check against.
+ * @sp:   The base address of the object.
+ * @size: The size of the object.
+ *
+ * Return: true if the object is fully contained within the stack.
+ */
+static inline bool stackinfo_on_stack(const struct stack_info *info,
+				      unsigned long sp, unsigned long size)
+{
+	if (!info->low)
+		return false;
+
+	if (sp < info->low || sp + size < sp || sp + size > info->high)
+		return false;
+
+	return true;
+}
+
+/**
+ * unwind_init_common() - Initialize the common parts of the unwind state.
+ *
+ * @state: the unwind state to initialize.
+ */
+static inline void unwind_init_common(struct unwind_state *state)
+{
+	state->stack = stackinfo_get_unknown();
+}
+
+/**
+ * unwind_find_stack() - Find the accessible stack which entirely contains an
+ * object.
+ *
+ * @state: the current unwind state.
+ * @sp:    the base address of the object.
+ * @size:  the size of the object.
+ *
+ * Return: a pointer to the relevant stack_info if found; NULL otherwise.
+ */
+static inline struct stack_info *unwind_find_stack(struct unwind_state *state,
+						   unsigned long sp,
+						   unsigned long size)
+{
+	struct stack_info *info = &state->stack;
+
+	if (stackinfo_on_stack(info, sp, size))
+		return info;
+
+	for (int i = 0; i < state->nr_stacks; i++) {
+		info = &state->stacks[i];
+		if (stackinfo_on_stack(info, sp, size))
+			return info;
+	}
+
+	return NULL;
+}
+
+/**
+ * unwind_consume_stack() - Update stack boundaries so that future unwind steps
+ * cannot consume this object again.
+ *
+ * @state: the current unwind state.
+ * @info:  the stack_info of the stack containing the object.
+ * @sp:    the base address of the object.
+ * @size:  the size of the object.
+ *
+ * Stack transitions are strictly one-way, and once we've
+ * transitioned from one stack to another, it's never valid to
+ * unwind back to the old stack.
+ *
+ * Note that stacks can nest in several valid orders, e.g.
+ *
+ *   TASK -> IRQ -> OVERFLOW
+ *
+ * ... so we do not check the specific order of stack
+ * transitions.
+ */
+static inline void unwind_consume_stack(struct unwind_state *state,
+					struct stack_info *info,
+					unsigned long sp,
+					unsigned long size)
+{
+	struct stack_info tmp;
+
+	tmp = *info;
+	*info = stackinfo_get_unknown();
+	state->stack = tmp;
+
+	/*
+	 * Future unwind steps can only consume stack above this frame record.
+	 * Update the current stack to start immediately above it.
+	 */
+	state->stack.low = sp + size;
+}
+
+#endif /* __ASM_RISCV_STACKTRACE_COMMON_H */
-- 
2.43.0


^ permalink raw reply related

* [PATCH v2 1/8] scripts/sorttable: Handle RISC-V patchable ftrace entries
From: Wang Han @ 2026-05-28  8:23 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: Steven Rostedt, Alexandre Ghiti, Masami Hiramatsu, Mark Rutland,
	Catalin Marinas, Chen Pei, Andy Chiu, Björn Töpel,
	Deepak Gupta, Puranjay Mohan, Conor Dooley, Josh Poimboeuf,
	Jiri Kosina, Miroslav Benes, Petr Mladek, Joe Lawrence,
	Shuah Khan, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, oliver.yang, xueshuai, zhuo.song, jkchen,
	linux-riscv, linux-kernel, linux-trace-kernel, live-patching,
	linux-kselftest, linux-perf-users
In-Reply-To: <20260527123530.2593918-1-wanghan@linux.alibaba.com>

RISC-V uses -fpatchable-function-entry=8,4 when the compressed ISA is
enabled and -fpatchable-function-entry=4,2 otherwise. In both cases, the
patchable NOP area starts 8 bytes before the function symbol address.
The __mcount_loc entries therefore point at the patchable NOP area
associated with a function, while nm reports the function symbol at the
entry address used for the function range check.

After RISC-V selected HAVE_BUILDTIME_MCOUNT_SORT, sorttable started
applying that range check at build time. Without allowing entries just
before the reported function address, the mcount sorter treats valid
RISC-V ftrace callsites as invalid weak-function entries and writes
them back as zero. The resulting kernel boots with no ftrace entries,
breaking dynamic ftrace and users such as livepatch.

The failure is silent during the final link because zeroing weak-function
entries is an expected sorttable operation. At boot, those zero entries
are skipped by ftrace_process_locs(), so the only obvious symptom is that
the vmlinux ftrace table has lost valid callsites and ftrace users cannot
attach to them.

CONFIG_FTRACE_SORT_STARTUP_TEST also reports the table as sorted in this
state: it only checks that the __mcount_loc entries are in ascending
order, which a fully zeroed table trivially satisfies. The original
commit relied on this check and did not see the regression.

On an affected RISC-V QEMU boot with both CONFIG_FTRACE_SORT_STARTUP_TEST
and CONFIG_FTRACE_STARTUP_TEST enabled, the sort check still passes
while ftrace reports zero usable entries and the early selftests fail:

  [    0.000000] ftrace section at ffffffff8101da98 sorted properly
  [    0.000000] ftrace: allocating 0 entries in 128 pages
  [    0.054999] Testing tracer function: .. no entries found ..FAILED!
  [    0.172407] tracer: function failed selftest, disabling
  [    0.178186] Failed to init function_graph tracer, init returned -19

Handle RISC-V like arm64 for the function-range check and allow
patchable entries up to 8 bytes before the function address.

With this fix, a RISC-V QEMU smoke boot with ftrace startup tests shows
the vmlinux ftrace table is populated and dynamic ftrace still works:

  [    0.000000] ftrace: allocating 46749 entries in 184 pages
  [    0.051115] Testing tracer function: PASSED
  [    1.283782] Testing dynamic ftrace: PASSED
  [    6.275456] Testing tracer function_graph: PASSED

Fixes: 0ca1724b56af ("riscv: ftrace: select HAVE_BUILDTIME_MCOUNT_SORT")
Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/all/20260527113028.4b21a5de@fedora/
Signed-off-by: Wang Han <wanghan@linux.alibaba.com>
---
 scripts/sorttable.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/scripts/sorttable.c b/scripts/sorttable.c
index e8ed11c680c6..4c10e85bb5af 100644
--- a/scripts/sorttable.c
+++ b/scripts/sorttable.c
@@ -891,17 +891,21 @@ static int do_file(char const *const fname, void *addr)
 	table_sort_t custom_sort = NULL;
 
 	switch (elf_map_machine(ehdr)) {
-	case EM_AARCH64:
 #ifdef MCOUNT_SORT_ENABLED
+	case EM_AARCH64:
 		sort_reloc = true;
 		rela_type = 0x403;
-		/* arm64 uses patchable function entry placing before function */
+		/* fallthrough */
+	case EM_RISCV:
+		/* arm64 and RISC-V place patchable entries before the function */
 		before_func = 8;
+#else
+	case EM_AARCH64:
+	case EM_RISCV:
 #endif
 		/* fallthrough */
 	case EM_386:
 	case EM_LOONGARCH:
-	case EM_RISCV:
 	case EM_S390:
 	case EM_X86_64:
 		custom_sort = sort_relative_table_with_data;
-- 
2.43.0


^ permalink raw reply related

* [PATCH v2 4/8] riscv: ftrace: always preserve s0 in dynamic ftrace register frame
From: Wang Han @ 2026-05-28  8:23 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: Steven Rostedt, Alexandre Ghiti, Masami Hiramatsu, Mark Rutland,
	Catalin Marinas, Chen Pei, Andy Chiu, Björn Töpel,
	Deepak Gupta, Puranjay Mohan, Conor Dooley, Josh Poimboeuf,
	Jiri Kosina, Miroslav Benes, Petr Mladek, Joe Lawrence,
	Shuah Khan, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, oliver.yang, xueshuai, zhuo.song, jkchen,
	linux-riscv, linux-kernel, linux-trace-kernel, live-patching,
	linux-kselftest, linux-perf-users
In-Reply-To: <20260527123530.2593918-1-wanghan@linux.alibaba.com>

The dynamic ftrace entry/exit only saved s0 (the architectural frame
pointer) when HAVE_FUNCTION_GRAPH_FP_TEST was selected. The upcoming
reliable frame-pointer unwinder needs s0 to be present in
ftrace_regs unconditionally so it can use the frame pointer as the
function-graph return-address cookie regardless of FP_TEST.

Save and restore s0 unconditionally in the dynamic ftrace ABI register
frame. The cost is one extra REG_S/REG_L pair per traced call, which is
negligible compared to the overall ftrace cost; the benefit is a
consistent ftrace_regs layout for the unwinder.

Signed-off-by: Wang Han <wanghan@linux.alibaba.com>
---
 arch/riscv/kernel/mcount-dyn.S | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/arch/riscv/kernel/mcount-dyn.S b/arch/riscv/kernel/mcount-dyn.S
index 082fe0b0e3c0..26c55fba8fec 100644
--- a/arch/riscv/kernel/mcount-dyn.S
+++ b/arch/riscv/kernel/mcount-dyn.S
@@ -85,9 +85,7 @@
 	addi	sp, sp, -FREGS_SIZE_ON_STACK
 	REG_S	t0,  FREGS_EPC(sp)
 	REG_S	x1,  FREGS_RA(sp)
-#ifdef HAVE_FUNCTION_GRAPH_FP_TEST
 	REG_S	x8,  FREGS_S0(sp)
-#endif
 	REG_S	x6,  FREGS_T1(sp)
 #ifdef CONFIG_CC_IS_CLANG
 	REG_S	x7,  FREGS_T2(sp)
@@ -113,9 +111,7 @@
 	.macro RESTORE_ABI_REGS
 	REG_L	t0, FREGS_EPC(sp)
 	REG_L	x1, FREGS_RA(sp)
-#ifdef HAVE_FUNCTION_GRAPH_FP_TEST
 	REG_L	x8, FREGS_S0(sp)
-#endif
 	REG_L	x6,  FREGS_T1(sp)
 #ifdef CONFIG_CC_IS_CLANG
 	REG_L	x7,  FREGS_T2(sp)
-- 
2.43.0


^ permalink raw reply related

* [PATCH v2 0/8] riscv: Add reliable stack unwinding for livepatch
From: Wang Han @ 2026-05-28  8:23 UTC (permalink / raw)
  To: Paul Walmsley, Palmer Dabbelt, Albert Ou
  Cc: Steven Rostedt, Alexandre Ghiti, Masami Hiramatsu, Mark Rutland,
	Catalin Marinas, Chen Pei, Andy Chiu, Björn Töpel,
	Deepak Gupta, Puranjay Mohan, Conor Dooley, Josh Poimboeuf,
	Jiri Kosina, Miroslav Benes, Petr Mladek, Joe Lawrence,
	Shuah Khan, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, oliver.yang, xueshuai, zhuo.song, jkchen,
	linux-riscv, linux-kernel, linux-trace-kernel, live-patching,
	linux-kselftest, linux-perf-users
In-Reply-To: <20260527123530.2593918-1-wanghan@linux.alibaba.com>

Changelog
=========

v2:
  * Patch 1 (scripts/sorttable): refactor the switch in do_file() to
    group EM_AARCH64 and EM_RISCV under a single MCOUNT_SORT_ENABLED
    block sharing before_func = 8, per Steven Rostedt's suggestion [1].
    Future architectures with patchable function entries can add their
    own case to that group. No functional change.
    Also fix a stray empty line between the Fixes: tag and the
    Signed-off-by trailer (b4-checks "empty lines surround the Fixes
    tag"), and add Suggested-by: Steven Rostedt with a Link: to the
    review thread.
  * No other patches changed.

[1] https://lore.kernel.org/all/20260527113028.4b21a5de@fedora/

v1: https://lore.kernel.org/all/20260527123530.2593918-1-wanghan@linux.alibaba.com/

Problem
=======

Livepatch relies on HAVE_RELIABLE_STACKTRACE to decide whether a task
can safely switch to a patched implementation. RISC-V has a
frame-pointer stack walker, but it is not yet reliable enough for
livepatch. Three pieces are missing:

  * arch_stack_walk_reliable() itself, plus the strict stack-bound
    checks and forward-progress invariants a reliable unwinder needs.
  * Explicit unwind metadata at exception, task-entry and IRQ-stack
    boundaries, so the unwinder can distinguish a final user-to-kernel
    transition from a nested kernel pt_regs frame instead of guessing
    from return addresses.
  * Agreement between the ftrace function-graph, perf callchain and
    mcount paths and the same frame-record assumptions used by the
    reliable unwinder.

There is also a prerequisite ftrace issue on the current riscv/for-next
base. Commit 0ca1724b56af ("riscv: ftrace: select
HAVE_BUILDTIME_MCOUNT_SORT") enabled build-time sorting of the mcount
table. RISC-V uses patchable function entries, and the recorded patch
site is placed before the function symbol. scripts/sorttable currently
does not take that RISC-V layout into account, so valid ftrace sites
can be filtered out before the kernel boots.

Solution
========

Patch 1 fixes scripts/sorttable so the RISC-V build-time mcount sort
path accepts patchable function entries which precede the function
symbol. The fix carries a Fixes: tag for commit 0ca1724b56af ("riscv:
ftrace: select HAVE_BUILDTIME_MCOUNT_SORT") and is otherwise
independent; it can be picked into the RISC-V tree on its own if
preferred.

Patches 2-7 add the reliable unwinder in small, individually
reviewable steps. The design follows the same FP + metadata model
arm64 already uses for livepatch in production: the metadata frame
record in pt_regs, the unwind-state stack-bound bookkeeping, the
exception boundary handling, and the fgraph / kretprobe return-address
recovery are direct adaptations of arch/arm64/kernel/stacktrace.c,
retargeted to the RISC-V {fp, ra} frame record convention.

  * Patch 2 adds frame-record metadata for the RISC-V stack walker.
    Low-level entry and task setup code records whether a frame is a
    normal frame, an exception frame, or a task-entry boundary, so the
    reliable unwinder can validate what it is walking instead of
    guessing from the return address.
  * Patch 3 stops KASAN from instrumenting stacktrace.o, matching the
    arm, arm64 and x86 treatment of their stack unwinding code.
  * Patch 4 always preserves s0 in the dynamic ftrace register frame so
    the unwinder can use the architectural frame pointer as the
    function-graph return-address cookie regardless of FP_TEST.
  * Patch 5 introduces stack_info / unwind_state and the
    forward-progress-only stack-bound helpers that the reliable
    unwinder is built on. No caller is wired up yet.
  * Patch 6 switches arch_stack_walk() to the new frame-pointer based
    unwinder, adds arch_stack_walk_reliable() (still without an
    in-tree caller), routes perf callchains through arch_stack_walk(),
    and updates the function-graph cookie to match.
  * Patch 7 selects HAVE_RELIABLE_STACKTRACE and HAVE_LIVEPATCH under
    FRAME_POINTER && 64BIT and exposes the livepatch menu, finally
    enabling livepatch on RISC-V.

Two alternative directions were considered and deferred:

  * ORC, as used on x86, gives reliable unwinding without runtime FP
    cost, but requires RISC-V objtool stack validation, ORC metadata
    generation, and the runtime ORC unwinder. That is a much larger
    dependency chain than what this series adds.

  * SFrame is the more likely long-term replacement for FP-based
    unwinding on architectures without ORC. Kernel SFrame support is
    still under development and the currently documented SFrame ABI
    set does not cover RISC-V, so making RISC-V livepatch depend on
    SFrame would block it on toolchain and kernel infrastructure that
    is not available yet. SFrame is a replacement rather than an
    extension of the metadata frame record introduced here, so when it
    lands the metadata can be retired together with the FP unwinder.
    The interim cost (~24 bytes added to pt_regs and a handful of
    instructions on exception entry, fork and early init) is bounded
    and limited to FRAME_POINTER=y configurations, which is what the
    RISC-V kernel already builds with for stack tracing today.
    Selecting HAVE_RELIABLE_STACKTRACE under FRAME_POINTER && 64BIT
    therefore does not introduce a new build-time dependency relative
    to the status quo.

This is useful now because livepatch is increasingly important for
long-running server deployments where rebooting for critical fixes is
expensive, and recent RISC-V work (dynamic ftrace and patchable
function entries) has put the rest of the livepatch infrastructure in
place.

Module-side klp relocations rely on the existing RISC-V
apply_relocate_add(); the syscall livepatch selftest exercises the
full klp_apply_section_relocs() -> apply_relocate_add() path on RISC-V.

Patch 8 adds the RISC-V syscall wrapper prefix used by the livepatch
syscall selftest module. Without this, the syscall livepatch selftest
cannot resolve the expected target symbol on RISC-V.

Testing
=======

The series is based on riscv/for-next commit 0ca1724b56af ("riscv:
ftrace: select HAVE_BUILDTIME_MCOUNT_SORT").

Build and static checks:

  * git diff --check riscv/for-next..HEAD
  * scripts/checkpatch.pl --strict for each patch
  * RISC-V Image and modules build clean with:
      - gcc 15.2 (riscv64-unknown-linux-gnu-)
      - LLVM=1 clang 18.1.3
      - LLVM=1 clang 21.1.1
  * Each intermediate commit (patches 1-7) was built individually on
    riscv/for-next to confirm bisectability; all 7 intermediate trees
    plus the final HEAD compile clean.
  * livepatch selftest module build

The unfixed build-time sort path was reproduced under QEMU:

  ftrace: allocating 0 entries in 128 pages
  Testing tracer function: .. no entries found ..FAILED!
  Failed to init function_graph tracer, init returned -19

With the sorttable fix applied, the same QEMU boot finds the expected
ftrace entries and the ftrace startup tests pass:

  ftrace: allocating 46749 entries in 184 pages
  Testing tracer function: PASSED
  Testing dynamic ftrace: PASSED
  Testing tracer function_graph: PASSED

With all eight patches applied, RISC-V QEMU virt boots with SMP=2,
SMP=4, and SMP=8 completed the livepatch and tracing smoke tests. The
livepatch selftest result was the same in all runs:

  livepatch selftests: PASS: 7, SKIP: 1, FAIL: 0

Across these boots, the kernel brought up the requested CPU count and
the startup ftrace tests passed, including dynamic ftrace and
function_graph. The function graph selftests reported passed: 3,
failed: 0, unsupported: 3, and LKDTM WARNING_MESSAGE produced the
expected Call Trace and powered off normally.

The livepatch selftest skip is test-kprobe.sh. The test requires
CONFIG_KPROBES_ON_FTRACE, which is not provided by the current RISC-V
configuration.

Wang Han (8):
  scripts/sorttable: Handle RISC-V patchable ftrace entries
  riscv: stacktrace: Add frame record metadata
  riscv: stacktrace: disable KASAN instrumentation for stacktrace.o
  riscv: ftrace: always preserve s0 in dynamic ftrace register frame
  riscv: stacktrace: introduce stack-bound tracking helpers
  riscv: stacktrace: switch to frame-pointer based unwinder
  riscv: Kconfig: enable HAVE_RELIABLE_STACKTRACE and HAVE_LIVEPATCH
  selftests/livepatch: Add RISC-V syscall wrapper prefix

 arch/riscv/Kconfig                            |   4 +
 arch/riscv/include/asm/ptrace.h               |   9 +
 arch/riscv/include/asm/stacktrace.h           |  65 +-
 arch/riscv/include/asm/stacktrace/common.h    | 159 +++++
 arch/riscv/include/asm/stacktrace/frame.h     |  53 ++
 arch/riscv/kernel/Makefile                    |   5 +
 arch/riscv/kernel/asm-offsets.c               |   4 +
 arch/riscv/kernel/entry.S                     |  30 +-
 arch/riscv/kernel/ftrace.c                    |   6 +-
 arch/riscv/kernel/head.S                      |  23 +
 arch/riscv/kernel/mcount-dyn.S                |   4 -
 arch/riscv/kernel/perf_callchain.c            |   2 +-
 arch/riscv/kernel/process.c                   |  31 +-
 arch/riscv/kernel/stacktrace.c                | 560 +++++++++++++++---
 scripts/sorttable.c                           |  10 +-
 .../livepatch/test_modules/test_klp_syscall.c |   2 +
 16 files changed, 856 insertions(+), 111 deletions(-)
 create mode 100644 arch/riscv/include/asm/stacktrace/common.h
 create mode 100644 arch/riscv/include/asm/stacktrace/frame.h


base-commit: 0ca1724b56af054e304a9f3f60623b02a81aba3f
-- 
2.43.0


^ permalink raw reply

* Re: [PATCH 1/8] scripts/sorttable: Handle RISC-V patchable ftrace entries
From: Wang Han @ 2026-05-28  5:38 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti,
	Masami Hiramatsu, Mark Rutland, Catalin Marinas, Chen Pei,
	Andy Chiu, Björn Töpel, Deepak Gupta, Puranjay Mohan,
	Conor Dooley, Josh Poimboeuf, Jiri Kosina, Miroslav Benes,
	Petr Mladek, Joe Lawrence, Shuah Khan, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, linux-riscv,
	linux-kernel, linux-trace-kernel, live-patching, linux-kselftest,
	linux-perf-users
In-Reply-To: <20260527113028.4b21a5de@fedora>

[Resend: my first reply went out as a private message to Steve only,
due to a local git send-email config quirk that dropped the Cc list.
Re-sending now with the original cc list so the discussion is
on-record. Sorry for the duplicate, Steve.]

On Wed, 27 May 2026 11:30:28 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> So basically RISCV has the same problem as ARM64 with patchable
> entries. As this may happen for other archs in the future, I would like
> to group them together like this:
[...]
> does the above work for you? (Although I didn't even compile test it).

Yes, this is clearly better - a single grouped block makes future
patchable-entry architectures trivial to add. I will fold it into v2
with two small adjustments to keep it compiling cleanly:

  - s/case RISCV/case EM_RISCV/ (two places).
  - Put the shared "before_func = 8" on its own line under
    case EM_RISCV: with a standard /* fallthrough */ comment,
    otherwise GCC -Wimplicit-fallthrough warns between EM_AARCH64
    and EM_RISCV.

Resulting switch:

	switch (elf_map_machine(ehdr)) {
	#ifdef MCOUNT_SORT_ENABLED
	case EM_AARCH64:
		sort_reloc = true;
		rela_type = 0x403;
		/* fallthrough */
	case EM_RISCV:
		/* arm64 and RISC-V place patchable entries before the function */
		before_func = 8;
	#else
	case EM_AARCH64:
	case EM_RISCV:
	#endif
		/* fallthrough */
	case EM_386:
	case EM_LOONGARCH:
	case EM_S390:
	case EM_X86_64:
		custom_sort = sort_relative_table_with_data;
		break;

Built scripts/sorttable with the kernel host build (both with and
without MCOUNT_SORT_ENABLED), no warnings. I'll add your Suggested-by
and send v2 shortly.

Thanks!
Wang Han

^ permalink raw reply

* [PATCH v2] selftests/ftrace: Fix trace_marker_raw test on 64K page kernels
From: Tianchen Ding @ 2026-05-28  2:24 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Shuah Khan
  Cc: linux-kernel, linux-trace-kernel, linux-kselftest
In-Reply-To: <20260527095438.1794905-1-dtcccc@linux.alibaba.com>

On ARM64 kernels with 64K pages, the trace_marker_raw test fails because
bash's printf builtin uses stdio buffering which splits output into
multiple small write() calls to the tracefs file. Since each individual
write is within TRACE_MARKER_MAX_SIZE (4096), they all succeed, causing
the "too big" write test to incorrectly pass.

Fix by piping make_str output through dd with iflag=fullblock to
guarantee a single atomic write() syscall to trace_marker_raw.

Fixes: 37f46601383a ("selftests/tracing: Add basic test for trace_marker_raw file")
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
---
v2:
Update comment about 64K pages.
---
 .../selftests/ftrace/test.d/00basic/trace_marker_raw.tc    | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/ftrace/test.d/00basic/trace_marker_raw.tc b/tools/testing/selftests/ftrace/test.d/00basic/trace_marker_raw.tc
index 8e905d4fe6dd..f68f1901f65f 100644
--- a/tools/testing/selftests/ftrace/test.d/00basic/trace_marker_raw.tc
+++ b/tools/testing/selftests/ftrace/test.d/00basic/trace_marker_raw.tc
@@ -43,8 +43,11 @@ write_buffer() {
 	id=$1
 	size=$2
 
-	# write the string into the raw marker
-	make_str $id $size > trace_marker_raw
+	# Pipe through dd to ensure a single atomic write() syscall
+	# on architectures with 64K pages, where shell's printf builtin
+	# uses stdio buffering which may split the output into multiple
+	# writes.
+	make_str $id $size | dd of=trace_marker_raw bs=`expr $size + 4` iflag=fullblock
 }
 
 
-- 
2.39.3


^ permalink raw reply related

* Re: [PATCH bpf-next] x86/ftrace: relocate %rip-relative percpu refs in dynamic trampolines
From: Peter Zijlstra @ 2026-05-27 21:11 UTC (permalink / raw)
  To: Alexis Lothoré (eBPF Foundation)
  Cc: Steven Rostedt, Masami Hiramatsu, Mark Rutland, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Uros Bizjak, Thomas Petazzoni, Ingo Molnar, linux-kernel,
	linux-trace-kernel, bpf, ebpf, Bastien Curutchet
In-Reply-To: <20260527-fix_call_depth_in_trampoline-v1-1-1c1abc8ae310@bootlin.com>

On Wed, May 27, 2026 at 09:12:31PM +0200, Alexis Lothoré (eBPF Foundation) wrote:
> With CONFIG_CALL_DEPTH_TRACKING enabled on an x86 retbleed-affected
> platform (eg: Skylake), with retbleed=stuff, registering a dynamic
> ftrace trampoline crashes on the first call into the traced function:
> 

> 
> This small reproducer allows to easily trigger the crash:
> 
>   # echo 'p __x64_sys_clock_nanosleep' > /sys/kernel/tracing/kprobe_events
>   # echo 1 > /sys/kernel/tracing/events/kprobes/p___x64_sys_clock_nanosleep_0/enable
>   # usleep 1
> 
> Monitoring the crash under GDB points to the exact instruction in charge
> of incrementing the call depth:
> 
>   sarq $5, %gs:__x86_call_depth(%rip)
> 
> This instruction matches the one inserted by the ftrace_regs_caller from
> ftrace_64.S. This emitted code was likely working fine until the
> introduction of commit 59bec00ace28 ("x86/percpu: Introduce
> %rip-relative addressing to PER_CPU_VAR()"): it has made the call depth
> accounting addressing relative to $rip, instead of being based on an
> absolute address. As this code exact location depends on where the
> trampoline lives in memory, the corresponding displacement needs to be
> adjusted at runtime to actually correctly find the per-cpu
> __x86_call_depth value, otherwise the targeted address is wrong, leading
> to the page fault seen above.
> 
> Fix the %rip-relative displacement of the copied CALL_DEPTH_ACCOUNT
> instruction (from ftrace_regs_caller) by calling
> text_poke_apply_relocation(), as it is done for example by the x86 BPF
> JIT compiler through x86_call_depth_emit_accounting(). This corrects
> both CALL_DEPTH_ACCOUNT slots, in ftrace_caller and ftrace_regs_caller.
> 
> Fixes: 59bec00ace28 ("x86/percpu: Introduce %rip-relative addressing to PER_CPU_VAR()")
> Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore@bootlin.com>
> ---
>  arch/x86/kernel/ftrace.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
> index 0543b57f54ee..357df1b2922c 100644
> --- a/arch/x86/kernel/ftrace.c
> +++ b/arch/x86/kernel/ftrace.c
> @@ -375,6 +375,13 @@ create_trampoline(struct ftrace_ops *ops, unsigned int *tramp_size)
>  			goto fail;
>  	}
>  
> +	/*
> +	 * Generated trampoline may contain rip-relative addressing which
> +	 * displacement needs to be fixed
> +	 */
> +	text_poke_apply_relocation(trampoline, trampoline, size,
> +				   (void *)start_offset, size);
> +
>  	/*
>  	 * The address of the ftrace_ops that is used for this trampoline
>  	 * is stored at the end of the trampoline. This will be used to

I went and had a quick grep through the tree to see if there are more
sites that were missed in the conversion (commit 17bce3b2ae2d), but I
couldn't find another one.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>


^ permalink raw reply

* Re: [PATCH v16 03/20] unwind_user/sframe: Store .sframe section data in per-mm maple tree
From: Steven Rostedt @ 2026-05-27 20:20 UTC (permalink / raw)
  To: Jens Remus
  Cc: linux-kernel, linux-trace-kernel, x86, Josh Poimboeuf,
	Indu Bhagat, Peter Zijlstra, Dylan Hatch, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Mathieu Desnoyers, Kees Cook, Sam James, bpf, linux-mm,
	Namhyung Kim, Andrii Nakryiko, Jose E. Marchesi, Beau Belgrave,
	Florian Weimer, Carlos O'Donell, Masami Hiramatsu, Jiri Olsa,
	Arnaldo Carvalho de Melo, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Heiko Carstens, Vasily Gorbik,
	Ilya Leoshkevich, Steven Rostedt (Google)
In-Reply-To: <20260521142546.3908498-4-jremus@linux.ibm.com>

On Thu, 21 May 2026 16:25:29 +0200
Jens Remus <jremus@linux.ibm.com> wrote:

>  int sframe_remove_section(unsigned long sframe_start)
>  {
> -	return -ENOSYS;
> +	struct mm_struct *mm = current->mm;
> +	struct sframe_section *sec;
> +	unsigned long index = 0;
> +	bool found = false;
> +	int ret = 0;
> +
> +	guard(srcu)(&sframe_srcu);
> +
> +	mt_for_each(&mm->sframe_mt, sec, index, ULONG_MAX) {
> +		if (sec->sframe_start == sframe_start) {
> +			found = true;
> +			ret |= __sframe_remove_section(mm, sec);

Because this is all internal data, the __sframe_remove_section() should
never fail. Perhaps we should add a WARN_ON() if it does?

		if (sec->sframe_start == sframe_start) {
			ret |= __sframe_remove_section(mm, sec);
			WARN_ON(!found && ret);
			found = true;

-- Steve



> +		}
> +	}
> +
> +	if (!found || ret)
> +		return -EINVAL;
> +
> +	return 0;
> +}
> +

^ permalink raw reply

* Re: [PATCH v16 02/20] unwind_user/sframe: Add support for reading .sframe headers
From: Steven Rostedt @ 2026-05-27 20:09 UTC (permalink / raw)
  To: Jens Remus
  Cc: linux-kernel, linux-trace-kernel, x86, Josh Poimboeuf,
	Indu Bhagat, Peter Zijlstra, Dylan Hatch, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Mathieu Desnoyers, Kees Cook, Sam James, bpf, linux-mm,
	Namhyung Kim, Andrii Nakryiko, Jose E. Marchesi, Beau Belgrave,
	Florian Weimer, Carlos O'Donell, Masami Hiramatsu, Jiri Olsa,
	Arnaldo Carvalho de Melo, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Heiko Carstens, Vasily Gorbik,
	Ilya Leoshkevich, Steven Rostedt (Google)
In-Reply-To: <20260521142546.3908498-3-jremus@linux.ibm.com>

On Thu, 21 May 2026 16:25:28 +0200
Jens Remus <jremus@linux.ibm.com> wrote:

> +static int sframe_read_header(struct sframe_section *sec)
> +{
> +	unsigned long header_end, fdes_start, fdes_end, fres_start, fres_end;
> +	struct sframe_header shdr;
> +	unsigned int num_fdes;
> +
> +	if (copy_from_user(&shdr, (void __user *)sec->sframe_start, sizeof(shdr))) {
> +		dbg("header usercopy failed\n");
> +		return -EFAULT;
> +	}
> +
> +	if (shdr.preamble.magic != SFRAME_MAGIC ||
> +	    shdr.preamble.version != SFRAME_VERSION_3 ||
> +	    !(shdr.preamble.flags & SFRAME_F_FDE_SORTED) ||
> +	    !(shdr.preamble.flags & SFRAME_F_FDE_FUNC_START_PCREL) ||
> +	    shdr.auxhdr_len) {
> +		dbg("bad/unsupported sframe header\n");
> +		return -EINVAL;
> +	}
> +
> +	if (!shdr.num_fdes || !shdr.num_fres) {
> +		dbg("no fde/fre entries\n");
> +		return -EINVAL;
> +	}
> +
> +	header_end = sec->sframe_start + SFRAME_HEADER_SIZE(shdr);
> +	if (header_end >= sec->sframe_end) {
> +		dbg("header doesn't fit in section\n");
> +		return -EINVAL;
> +	}
> +
> +	num_fdes   = shdr.num_fdes;
> +	fdes_start = header_end + shdr.fdes_off;
> +	fdes_end   = fdes_start + (num_fdes * sizeof(struct sframe_fde_v3));
> +
> +	fres_start = header_end + shdr.fres_off;
> +	fres_end   = fres_start + shdr.fre_len;
> +
> +	if (fres_start < fdes_end || fres_end > sec->sframe_end) {
> +		dbg("inconsistent fde/fre offsets\n");
> +		return -EINVAL;
> +	}

I really think we should also test if fres_start >= fres_end. Because
if fres_start is near sec->sframe_end and on 32bit archs fre_len makes
fres_end overflow, this doesn't catch it.

Even if it doesn't cause issues later, I rather have it caught here.

-- Steve

> +
> +	sec->num_fdes		= num_fdes;
> +	sec->fdes_start		= fdes_start;
> +	sec->fres_start		= fres_start;
> +	sec->fres_end		= fres_end;
> +
> +	sec->ra_off		= shdr.cfa_fixed_ra_offset;
> +	sec->fp_off		= shdr.cfa_fixed_fp_offset;
> +
> +	return 0;
> +}

^ permalink raw reply

* Re: [PATCH 6.12] x86/fgraph: Fix return_to_handler regs.rsp value
From: Sasha Levin @ 2026-05-27 19:49 UTC (permalink / raw)
  To: stable, gregkh
  Cc: Sasha Levin, jolsa, rostedt, mhiramat, tglx, mingo, bp, x86,
	linux-trace-kernel, bpf, Andrii Nakryiko, Gyokhan Kochmarla
In-Reply-To: <20260526192324.79459-1-gyokhan@amazon.de>

> commit 8bc11700e0d23d4fdb7d8d5a73b2e95de427cabc upstream.

Queued for 6.12.y, thanks.

--
Thanks,
Sasha

^ permalink raw reply

* Re: [PATCH 6.12] tracing: Fix the bug where bpf_get_stackid returns -EFAULT on the ARM64
From: Sasha Levin @ 2026-05-27 19:49 UTC (permalink / raw)
  To: stable, gregkh
  Cc: Sasha Levin, yangfeng, rostedt, mhiramat, mark.rutland,
	catalin.marinas, will, jolsa, linux-trace-kernel,
	linux-arm-kernel, bpf, Gyokhan Kochmarla
In-Reply-To: <20260526192012.76223-1-gyokhan@amazon.de>

> commit fd2f74f8f3d3c1a524637caf5bead9757fae4332 upstream.
>
> When using bpf_program__attach_kprobe_multi_opts on ARM64 to hook a BPF program
> that contains the bpf_get_stackid function, the BPF program fails
> to obtain the stack trace and returns -EFAULT.

Queued for 6.12.y, thanks.

--
Thanks,
Sasha

^ permalink raw reply

* Re: [PATCH v8 0/6] mm/memory-failure: add panic option for unrecoverable pages
From: Andrew Morton @ 2026-05-27 19:39 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Miaohe Lin, David Hildenbrand, Lorenzo Stoakes, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	Naoya Horiguchi, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Liam R. Howlett,
	linux-mm, linux-kernel, linux-doc, linux-kselftest,
	linux-trace-kernel, kernel-team, Lance Yang
In-Reply-To: <20260527-ecc_panic-v8-0-9ea0cfa16bb0@debian.org>

On Wed, 27 May 2026 07:06:13 -0700 Breno Leitao <leitao@debian.org> wrote:

> A multi-bit ECC error on a kernel-owned page that the memory failure
> handler cannot recover is currently swallowed: PG_hwpoison is set, the
> event is logged, and the kernel keeps running.  The corrupted memory
> remains accessible to the kernel and either drives silent data
> corruption or surfaces seconds-to-minutes later as an apparently
> unrelated crash.  In a large fleet that delayed, unattributable crash
> turns into significant engineering effort to root-cause; in a kdump
> configuration, by the time the crash happens the original error
> context (faulting PFN, MCE/GHES record, page state) is long gone.
> 
> This series adds an opt-in sysctl,
> vm.panic_on_unrecoverable_memory_failure, that converts an
> unrecoverable kernel-page hwpoison event into an immediate panic with
> a clean dmesg/vmcore that still contains the original failure
> context.  The default is disabled so existing workloads see no
> change.

Thanks.  That does seem useful.

I'll pass at this time, due to -rc5 and not-very-reviewed.

AI review said a few things.  It claims to have found one pre-existing
issue.

	https://sashiko.dev/#/patchset/20260527-ecc_panic-v8-0-9ea0cfa16bb0@debian.org

^ permalink raw reply

* Re: [PATCH v2] scripts: Have make TAGS not include structure members
From: Masatake YAMATO @ 2026-05-27 19:36 UTC (permalink / raw)
  To: peterz
  Cc: rostedt, linux-kernel, linux-trace-kernel, linux-kbuild, akpm,
	masahiroy, geert, mmarek, hamo.by, sboyd
In-Reply-To: <20260527162914.GH3102624@noisy.programming.kicks-ass.net>

From: Peter Zijlstra <peterz@infradead.org>
Subject: Re: [PATCH v2] scripts: Have make TAGS not include structure members
Date: Wed, 27 May 2026 18:29:14 +0200

> On Wed, May 27, 2026 at 12:11:44PM -0400, Steven Rostedt wrote:
>> From: Steven Rostedt <rostedt@goodmis.org>
>> 
>> It is really annoying when I use emacs TAGS to search for something
>> like "dev_name" and have to go through 12 iterations before I find the
>> function "dev_name". I really do not care about structures that include
>> "dev_name" as one of its fields, and I'm sure pretty much all other
>> developers do not care either.
>> 
>> There's a "remove_structs" variable used by the scripts/tags.sh, which
>> I'm guessing is suppose to remove these structures from the TAGS file,
>> but it must do a poor job at it, as I'm always hitting structures when
>> I want the actual declaration.
>> 
>> Luckily, the etags program comes with an option "--no-members", which does
>> exactly what I want, and I'm sure all other kernel developers want too.
>> 
>> Create a new "no_members" variable and assign it to "--no-members" for the
>> "TAGS" case and pass that to the etags program to remove structures.
>> 
>> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
>> ---
>> Changes since v1: https://lore.kernel.org/all/20131115093645.6dc03918@gandalf.local.home/
>> 
>> - Use a no_members variable instead of hard coding the --no-members into
>>   the etags call, as that can break some "tags" cases. (Michal Marek)
> 
> Yeah, I often use member tags.
> 
> The tags file have a 'kind' field, what you want is for emacs to order
> on kind and prefer 'f' over 'm'.

citre (https://github.com/universal-ctags/citre) utilizes the kinds field
of tags file when soriting the list of definitions. citre runs on emacs.

https://github.com/universal-ctags/citre/blob/master/citre-lang-c.el#L28 explains
the parts of heuristics used in citre:

;; - When the symbol is after "->" or ".", tags of member kind are sorted above
;;   others.
;; - When the symbol is before "(", tags of function/macro kinds are sorted
;;   above others.
;; - When the cursor is on the header file in a "#include" directive, the
;;   header file itself, and the references to that header file (if tagged) is
;;   found as its definitions.  References that uses paths can't be found.
;;   Also, file names will be used for auto-completion.

I use citre and ctags daily for reading kernel code. Quite comfortable.

Masatake YAMATO

> 
> The alternative is switching to use emacs-lsp, that way the editor knows
> the kind of symbol you want. If you're on a function call, it should
> only consider 'f' tags. Whereas if the cursor is on a member deref, it
> should only consider 'm'.
> 


^ permalink raw reply

* Re: [PATCH bpf-next] x86/ftrace: relocate %rip-relative percpu refs in dynamic trampolines
From: Steven Rostedt @ 2026-05-27 19:30 UTC (permalink / raw)
  To: Alexis Lothoré
  Cc: Masami Hiramatsu, Mark Rutland, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Uros Bizjak,
	Thomas Petazzoni, Ingo Molnar, linux-kernel, linux-trace-kernel,
	bpf, ebpf, Bastien Curutchet
In-Reply-To: <DITPAGA18EUH.26LNYC9K9KX1P@bootlin.com>

On Wed, 27 May 2026 21:19:47 +0200
Alexis Lothoré <alexis.lothore@bootlin.com> wrote:

> On Wed May 27, 2026 at 9:08 PM CEST, Alexis Lothoré (eBPF Foundation) wrote:
> 
> [...]
> 
> > Fixes: 59bec00ace28 ("x86/percpu: Introduce %rip-relative addressing to PER_CPU_VAR()")
> > Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore@bootlin.com>  
> 
> Please ignore the 'bpf-next' target for the patch, this is a mistake (I
> used my bpf tree to prepare this patch), I am really targeting the trace
> tree here.

Actually, it should go through the x86 tree.

Acked-by: Steven Rostedt <rostedt@goodmis.org>

-- Steve


^ permalink raw reply

* Re: [PATCH bpf-next] x86/ftrace: relocate %rip-relative percpu refs in dynamic trampolines
From: Alexis Lothoré @ 2026-05-27 19:19 UTC (permalink / raw)
  To: Alexis Lothoré (eBPF Foundation), Steven Rostedt,
	Masami Hiramatsu, Mark Rutland, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Uros Bizjak
  Cc: Thomas Petazzoni, Ingo Molnar, linux-kernel, linux-trace-kernel,
	bpf, ebpf, Bastien Curutchet
In-Reply-To: <20260527-fix_call_depth_in_trampoline-v1-1-d0292bfe7eed@bootlin.com>

On Wed May 27, 2026 at 9:08 PM CEST, Alexis Lothoré (eBPF Foundation) wrote:

[...]

> Fixes: 59bec00ace28 ("x86/percpu: Introduce %rip-relative addressing to PER_CPU_VAR()")
> Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore@bootlin.com>

Please ignore the 'bpf-next' target for the patch, this is a mistake (I
used my bpf tree to prepare this patch), I am really targeting the trace
tree here.

Alexis

-- 
Alexis Lothoré, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com


^ permalink raw reply

* [PATCH bpf-next] x86/ftrace: relocate %rip-relative percpu refs in dynamic trampolines
From: Alexis Lothoré (eBPF Foundation) @ 2026-05-27 19:12 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu, Mark Rutland, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Uros Bizjak
  Cc: Thomas Petazzoni, Ingo Molnar, linux-kernel, linux-trace-kernel,
	bpf, ebpf, Bastien Curutchet,
	Alexis Lothoré (eBPF Foundation)

With CONFIG_CALL_DEPTH_TRACKING enabled on an x86 retbleed-affected
platform (eg: Skylake), with retbleed=stuff, registering a dynamic
ftrace trampoline crashes on the first call into the traced function:

  [    9.630365] BUG: unable to handle page fault for address: ffff88817ae18880
  [    9.630365] #PF: supervisor write access in kernel mode
  [    9.630365] #PF: error_code(0x0002) - not-present page
  [    9.630365] PGD 4b53067 P4D 4b53067 PUD 0
  [    9.630365] Oops: Oops: 0002 [#1] SMP PTI
  [    9.630365] CPU: 3 UID: 0 PID: 187 Comm: usleep Not tainted 7.0.10 #243 PREEMPT(full)
  [    9.630365] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.17.0-2-2 04/01/2014
  [    9.630365] RIP: 0010:0xffffffffc0400058
  [    9.630365] Code: 24 78 00 00 00 00 48 89 ea 48 89 54 24 20 48 8b b4 24 b8 00 00 00 48 8b bc 24 b0 00 00 00 48 89 bc 24 80 00 00 00 48 83 ef 05 <65> 48 c1 3d 1f a8 b6 02 05 48 8b 15 f6 00 00 00 4c 89 3c 24 4c 89
  [    9.630365] RSP: 0018:ffffc90000a3fe60 EFLAGS: 00010382
  [    9.630365] RAX: ffffffffffffffff RBX: ffffc90000a3ff58 RCX: 0000000000000000
  [    9.630365] RDX: ffffc90000a3ff48 RSI: ffffffff82198e40 RDI: ffffffff813f5654
  [    9.630365] RBP: ffffc90000a3ff48 R08: 0000000000000000 R09: 0000000000000000
  [    9.630365] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8881030d3780
  [    9.630365] R13: 00000000000000e6 R14: 0000000000000000 R15: 0000000000000000
  [    9.630365] FS:  00007f081d131740(0000) GS:ffff8881b7eae000(0000) knlGS:0000000000000000
  [    9.630365] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [    9.630365] CR2: ffff88817ae18880 CR3: 00000001033dc006 CR4: 00000000003706f0
  [    9.630365] Call Trace:
  [    9.630365]  <TASK>
  [    9.630365]  ? find_held_lock+0x2b/0x80
  [    9.630365]  ? exc_page_fault+0x74/0x220
  [    9.630365]  ? lock_release+0xe1/0x320
  [    9.630365]  ? __x64_sys_clock_nanosleep+0x9/0x1a0
  [    9.630365]  ? lockdep_hardirqs_on_prepare+0xd9/0x190
  [    9.630365]  ? trace_hardirqs_on+0x18/0x100
  [    9.630365]  __x64_sys_clock_nanosleep+0x9/0x1a0
  [    9.630365]  do_syscall_64+0x100/0x5f0
  [    9.630365]  ? exc_page_fault+0x1e0/0x220
  [    9.630365]  ? call_depth_return_thunk+0x2a/0xd0
  [    9.630365]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
  [    9.630365] RIP: 0033:0x7f081d20ad83
  [    9.630365] Code: ff ff c3 0f 1f 40 00 83 ff 03 74 7b 83 ff 02 b8 fa ff ff ff 49 89 ca 0f 44 f8 80 3d c6 d2 10 00 00 74 14 b8 e6 00 00 00 0f 05 <f7> d8 c3 66 2e 0f 1f 84 00 00 00 00 00 48 83 ec 28 48 89 54 24 10
  [    9.630365] RSP: 002b:00007ffd539e1328 EFLAGS: 00000202 ORIG_RAX: 00000000000000e6
  [    9.630365] RAX: ffffffffffffffda RBX: 0000000000000103 RCX: 00007f081d20ad83
  [    9.630365] RDX: 00007ffd539e1340 RSI: 0000000000000000 RDI: 0000000000000000
  [    9.630365] RBP: 00007ffd539e14f8 R08: 0000000000000000 R09: 0000000000000000
  [    9.630365] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
  [    9.630365] R13: 00007ffd539e1510 R14: 00007f081d370000 R15: 00005619c9d78338
  [    9.630365]  </TASK>
  [    9.630365] Modules linked in:
  [    9.630365] CR2: ffff88817ae18880
  [    9.630365] ---[ end trace 0000000000000000 ]---
  [    9.630365] RIP: 0010:0xffffffffc0400058
  [    9.630365] Code: 24 78 00 00 00 00 48 89 ea 48 89 54 24 20 48 8b b4 24 b8 00 00 00 48 8b bc 24 b0 00 00 00 48 89 bc 24 80 00 00 00 48 83 ef 05 <65> 48 c1 3d 1f a8 b6 02 05 48 8b 15 f6 00 00 00 4c 89 3c 24 4c 89
  [    9.630365] RSP: 0018:ffffc90000a3fe60 EFLAGS: 00010382
  [    9.630365] RAX: ffffffffffffffff RBX: ffffc90000a3ff58 RCX: 0000000000000000
  [    9.630365] RDX: ffffc90000a3ff48 RSI: ffffffff82198e40 RDI: ffffffff813f5654
  [    9.630365] RBP: ffffc90000a3ff48 R08: 0000000000000000 R09: 0000000000000000
  [    9.630365] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8881030d3780
  [    9.630365] R13: 00000000000000e6 R14: 0000000000000000 R15: 0000000000000000
  [    9.630365] FS:  00007f081d131740(0000) GS:ffff8881b7eae000(0000) knlGS:0000000000000000
  [    9.630365] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [    9.630365] CR2: ffff88817ae18880 CR3: 00000001033dc006 CR4: 00000000003706f0
  [    9.630365] Kernel panic - not syncing: Fatal exception

This small reproducer allows to easily trigger the crash:

  # echo 'p __x64_sys_clock_nanosleep' > /sys/kernel/tracing/kprobe_events
  # echo 1 > /sys/kernel/tracing/events/kprobes/p___x64_sys_clock_nanosleep_0/enable
  # usleep 1

Monitoring the crash under GDB points to the exact instruction in charge
of incrementing the call depth:

  sarq $5, %gs:__x86_call_depth(%rip)

This instruction matches the one inserted by the ftrace_regs_caller from
ftrace_64.S. This emitted code was likely working fine until the
introduction of commit 59bec00ace28 ("x86/percpu: Introduce
%rip-relative addressing to PER_CPU_VAR()"): it has made the call depth
accounting addressing relative to $rip, instead of being based on an
absolute address. As this code exact location depends on where the
trampoline lives in memory, the corresponding displacement needs to be
adjusted at runtime to actually correctly find the per-cpu
__x86_call_depth value, otherwise the targeted address is wrong, leading
to the page fault seen above.

Fix the %rip-relative displacement of the copied CALL_DEPTH_ACCOUNT
instruction (from ftrace_regs_caller) by calling
text_poke_apply_relocation(), as it is done for example by the x86 BPF
JIT compiler through x86_call_depth_emit_accounting(). This corrects
both CALL_DEPTH_ACCOUNT slots, in ftrace_caller and ftrace_regs_caller.

Fixes: 59bec00ace28 ("x86/percpu: Introduce %rip-relative addressing to PER_CPU_VAR()")
Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore@bootlin.com>
---
 arch/x86/kernel/ftrace.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index 0543b57f54ee..357df1b2922c 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -375,6 +375,13 @@ create_trampoline(struct ftrace_ops *ops, unsigned int *tramp_size)
 			goto fail;
 	}
 
+	/*
+	 * Generated trampoline may contain rip-relative addressing which
+	 * displacement needs to be fixed
+	 */
+	text_poke_apply_relocation(trampoline, trampoline, size,
+				   (void *)start_offset, size);
+
 	/*
 	 * The address of the ftrace_ops that is used for this trampoline
 	 * is stored at the end of the trampoline. This will be used to

---
base-commit: aef70d0806e39b83f1fbecc32c72cc328751292a
change-id: 20260527-fix_call_depth_in_trampoline-80bc56930c8f

Best regards,
--  
Alexis Lothoré (eBPF Foundation) <alexis.lothore@bootlin.com>


^ permalink raw reply related

* Re: [PATCH v2] scripts: Have make TAGS not include structure members
From: Steven Rostedt @ 2026-05-27 18:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Linux trace kernel, linux-kbuild, Andrew Morton,
	Masahiro Yamada, Masatake YAMATO, Geert Uytterhoeven,
	Michal Marek, Yang Bai
In-Reply-To: <20260527162914.GH3102624@noisy.programming.kicks-ass.net>

On Wed, 27 May 2026 18:29:14 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> Yeah, I often use member tags.
> 
> The tags file have a 'kind' field, what you want is for emacs to order
> on kind and prefer 'f' over 'm'.
> 
> The alternative is switching to use emacs-lsp, that way the editor knows
> the kind of symbol you want. If you're on a function call, it should
> only consider 'f' tags. Whereas if the cursor is on a member deref, it
> should only consider 'm'.

OK, so in addition to my procrastination of sending out this patch, I
finally changed my .emacs file to have "Meta-." call
xref-find-definitions instead of find-tags.

The xref-find-definitions gives a list of all the tags it finds and you
can search for the function. In the example of "dev_name", I simply
searched for "dev_name(" and it found the function immediately.

As "find-tags" has been deprecated back in 2016 (10 years ago!), and
xref-find-definitions doesn't suffer as much as 'find-tags' does with
respect to member tags. I'll simply drop this patch.

I can also finally archive the conversation I have in my INBOX! ;-)

-- Steve

^ permalink raw reply

* [PATCH 3/4] bootconfig: add xbc_prepend_embedded_cmdline() helper
From: Breno Leitao @ 2026-05-27 16:41 UTC (permalink / raw)
  To: Masami Hiramatsu, Andrew Morton, Nathan Chancellor, paulmck,
	Nicolas Schier
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
	bpf, Breno Leitao, kernel-team
In-Reply-To: <20260527-bootconfig_using_tools-v1-0-b6906a86e7d5@debian.org>

Add a helper that prepends the build-time-rendered embedded bootconfig
"kernel" subtree (embedded_kernel_cmdline[] from embedded-cmdline.S) to
a cmdline buffer with a separating space. Architectures call this from
setup_arch() before parse_early_param() so early_param() handlers
(mem=, earlycon=, loglevel=, ...) see values supplied via the embedded
bootconfig.

On overflow the helper logs an error and leaves the cmdline untouched
rather than panicking. Booting without the embedded values is better
than refusing to boot, and the error tells the user why their embedded
keys are missing.

When CONFIG_BOOT_CONFIG_EMBED_CMDLINE=n, the public declaration in
<linux/bootconfig.h> resolves to a no-op stub so callers compile
unchanged.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 include/linux/bootconfig.h |  7 +++++++
 lib/bootconfig.c           | 48 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 55 insertions(+)

diff --git a/include/linux/bootconfig.h b/include/linux/bootconfig.h
index 1c7f3b74ffcf..dcb0c86cbc54 100644
--- a/include/linux/bootconfig.h
+++ b/include/linux/bootconfig.h
@@ -308,4 +308,11 @@ static inline const char *xbc_get_embedded_bootconfig(size_t *size)
 }
 #endif
 
+/* Build-time-rendered bootconfig cmdline prepended in setup_arch() */
+#ifdef CONFIG_BOOT_CONFIG_EMBED_CMDLINE
+void __init xbc_prepend_embedded_cmdline(char *dst, size_t size);
+#else
+static inline void xbc_prepend_embedded_cmdline(char *dst, size_t size) { }
+#endif
+
 #endif
diff --git a/lib/bootconfig.c b/lib/bootconfig.c
index 3a102c9122f7..10c62c8600c8 100644
--- a/lib/bootconfig.c
+++ b/lib/bootconfig.c
@@ -19,6 +19,7 @@
 #include <linux/errno.h>
 #include <linux/cache.h>
 #include <linux/compiler.h>
+#include <linux/printk.h>
 #include <linux/sprintf.h>
 #include <linux/memblock.h>
 #include <linux/string.h>
@@ -34,6 +35,53 @@ const char * __init xbc_get_embedded_bootconfig(size_t *size)
 	return (*size) ? embedded_bootconfig_data : NULL;
 }
 #endif
+
+#ifdef CONFIG_BOOT_CONFIG_EMBED_CMDLINE
+/* embedded_kernel_cmdline is defined in embedded-cmdline.S */
+extern __visible const char embedded_kernel_cmdline[];
+extern __visible const char embedded_kernel_cmdline_end[];
+
+/**
+ * xbc_prepend_embedded_cmdline() - Prepend embedded bootconfig cmdline
+ * @dst: cmdline buffer to prepend into (must already contain a NUL byte)
+ * @size: total capacity of @dst in bytes
+ *
+ * Prepend the build-time-rendered "kernel" subtree of the embedded
+ * bootconfig to @dst. The rendered string already ends with a single
+ * space (the xbc_snprint_cmdline() invariant), which serves as the
+ * separator between the embedded keys and any existing content of @dst.
+ * On overflow, log an error and leave @dst untouched rather than
+ * silently truncating: booting without the embedded values is better
+ * than refusing to boot, and the error message tells the user why
+ * their embedded keys are missing.
+ *
+ * Intended to be called from setup_arch() before parse_early_param() so
+ * that early_param() handlers see the embedded values.
+ */
+void __init xbc_prepend_embedded_cmdline(char *dst, size_t size)
+{
+	size_t embed_len = embedded_kernel_cmdline_end - embedded_kernel_cmdline;
+	size_t dst_len;
+
+	if (!size || embed_len <= 1)	/* trailing NUL only */
+		return;
+	embed_len--;			/* exclude trailing NUL byte */
+
+	dst_len = strnlen(dst, size);
+	if (embed_len + dst_len + 1 > size) {
+		pr_err("embedded bootconfig cmdline (%zu bytes) does not fit in COMMAND_LINE_SIZE with %zu bytes already used; ignoring embedded values\n",
+		       embed_len, dst_len);
+		return;
+	}
+
+	if (dst_len)
+		memmove(dst + embed_len, dst, dst_len + 1);
+	else
+		dst[embed_len] = '\0';
+	memcpy(dst, embedded_kernel_cmdline, embed_len);
+}
+#endif
+
 #endif
 
 /*

-- 
2.54.0


^ permalink raw reply related

* [PATCH 4/4] x86/setup: prepend embedded bootconfig cmdline before parse_early_param
From: Breno Leitao @ 2026-05-27 16:41 UTC (permalink / raw)
  To: Masami Hiramatsu, Andrew Morton, Nathan Chancellor, paulmck,
	Nicolas Schier
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
	bpf, Breno Leitao, kernel-team
In-Reply-To: <20260527-bootconfig_using_tools-v1-0-b6906a86e7d5@debian.org>

Call xbc_prepend_embedded_cmdline() in setup_arch() right after the
CONFIG_CMDLINE merge and before strscpy(command_line, ...) so the
build-time-rendered embedded bootconfig "kernel" subtree is part of
boot_command_line by the time parse_early_param() runs. early_param()
handlers (mem=, earlycon=, loglevel=, ...) now see values supplied via
CONFIG_BOOT_CONFIG_EMBED_FILE without parsing bootconfig at runtime.

Select ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG so the user-visible
CONFIG_BOOT_CONFIG_EMBED_CMDLINE option becomes selectable on x86.

With this select in place, setup_boot_config() in init/main.c would
otherwise render the embedded "kernel" subtree a second time via
xbc_make_cmdline("kernel") and prepend it to saved_command_line /
static_command_line through extra_command_line, duplicating every
embedded kernel.* key in /proc/cmdline and causing accumulating
handlers (console=, earlycon=, ...) to register the same value twice.
Track whether the bootconfig data came from the embedded source and
skip the duplicate render in that case.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 arch/x86/Kconfig        |  1 +
 arch/x86/kernel/setup.c |  3 +++
 init/main.c             | 19 ++++++++++++++++---
 3 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f24810015234..f839795692b4 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -126,6 +126,7 @@ config X86
 	select ARCH_SUPPORTS_NUMA_BALANCING	if X86_64
 	select ARCH_SUPPORTS_KMAP_LOCAL_FORCE_MAP	if NR_CPUS <= 4096
 	select ARCH_SUPPORTS_CFI		if X86_64
+	select ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG
 	select ARCH_USES_CFI_TRAPS		if X86_64 && CFI
 	select ARCH_SUPPORTS_LTO_CLANG
 	select ARCH_SUPPORTS_LTO_CLANG_THIN
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 46882ce79c3a..592c4c79c974 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -6,6 +6,7 @@
  * parts of early kernel initialization.
  */
 #include <linux/acpi.h>
+#include <linux/bootconfig.h>
 #include <linux/console.h>
 #include <linux/cpu.h>
 #include <linux/crash_dump.h>
@@ -924,6 +925,8 @@ void __init setup_arch(char **cmdline_p)
 	builtin_cmdline_added = true;
 #endif
 
+	xbc_prepend_embedded_cmdline(boot_command_line, COMMAND_LINE_SIZE);
+
 	strscpy(command_line, boot_command_line, COMMAND_LINE_SIZE);
 	*cmdline_p = command_line;
 
diff --git a/init/main.c b/init/main.c
index e363232b428b..8264bfa97aa2 100644
--- a/init/main.c
+++ b/init/main.c
@@ -378,12 +378,15 @@ static void __init setup_boot_config(void)
 	int pos, ret;
 	size_t size;
 	char *err;
+	bool from_embedded = false;
 
 	/* Cut out the bootconfig data even if we have no bootconfig option */
 	data = get_boot_config_from_initrd(&size);
 	/* If there is no bootconfig in initrd, try embedded one. */
-	if (!data)
+	if (!data) {
 		data = xbc_get_embedded_bootconfig(&size);
+		from_embedded = true;
+	}
 
 	strscpy(tmp_cmdline, boot_command_line, COMMAND_LINE_SIZE);
 	err = parse_args("bootconfig", tmp_cmdline, NULL, 0, 0, 0, NULL,
@@ -421,8 +424,18 @@ static void __init setup_boot_config(void)
 	} else {
 		xbc_get_info(&ret, NULL);
 		pr_info("Load bootconfig: %ld bytes %d nodes\n", (long)size, ret);
-		/* keys starting with "kernel." are passed via cmdline */
-		extra_command_line = xbc_make_cmdline("kernel");
+		/*
+		 * keys starting with "kernel." are passed via cmdline. When
+		 * BOOT_CONFIG_EMBED_CMDLINE is enabled and this bootconfig
+		 * came from the embedded source, setup_arch() already
+		 * prepended the rendered "kernel" subtree to
+		 * boot_command_line; rendering again here would duplicate
+		 * the keys in saved_command_line / static_command_line and
+		 * cause accumulating handlers (console=, earlycon=, ...) to
+		 * re-register the same value.
+		 */
+		if (!IS_ENABLED(CONFIG_BOOT_CONFIG_EMBED_CMDLINE) || !from_embedded)
+			extra_command_line = xbc_make_cmdline("kernel");
 		/* Also, "init." keys are init arguments */
 		extra_init_args = xbc_make_cmdline("init");
 	}

-- 
2.54.0


^ permalink raw reply related

* [PATCH 2/4] bootconfig: render embedded bootconfig as a kernel cmdline at build time
From: Breno Leitao @ 2026-05-27 16:41 UTC (permalink / raw)
  To: Masami Hiramatsu, Andrew Morton, Nathan Chancellor, paulmck,
	Nicolas Schier
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
	bpf, Breno Leitao, kernel-team
In-Reply-To: <20260527-bootconfig_using_tools-v1-0-b6906a86e7d5@debian.org>

Add the build-time pipeline that renders the "kernel" subtree of
CONFIG_BOOT_CONFIG_EMBED_FILE into a flat cmdline string and stashes
it in .init.rodata as embedded_kernel_cmdline[]. A follow-up patch
adds the runtime helper that prepends this string to boot_command_line
during early architecture setup so parse_early_param() sees the values.

The build wires up:
  tools/bootconfig -C kernel - userspace tool already shared with
                               lib/bootconfig.c, used here in -C mode
                               to render a bootconfig file to a cmdline
  lib/embedded-cmdline.S     - .incbin's the rendered text plus a NUL
  lib/Makefile rule          - runs tools/bootconfig at build time
  Makefile prepare dep       - ensures tools/bootconfig is built first,
                               same pattern as tools/objtool and
                               tools/bpf/resolve_btfids

Drop the test target from tools/bootconfig/Makefile's default 'all'
recipe so that hooking the binary into the kernel build does not run
test-bootconfig.sh on every prepare. The tests stay available as
'make -C tools/bootconfig test', matching the convention of
tools/objtool and tools/bpf/resolve_btfids whose 'all' targets only
build the binary.

Require BOOT_CONFIG_EMBED_FILE to be non-empty before the new option
can be enabled, otherwise tools/bootconfig -C runs against an empty
file and prints a parse error on every kernel build.

The feature gates on CONFIG_ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG, a
silent symbol arches select once they've wired the prepend call into
setup_arch(). No arch selects it in this patch, so the user-visible
CONFIG_BOOT_CONFIG_EMBED_CMDLINE is not yet enableable; when an arch
later opts in, the runtime behavior is added by the follow-up patches.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 Makefile                  |  5 +++++
 init/Kconfig              | 33 +++++++++++++++++++++++++++++++++
 lib/Makefile              | 16 ++++++++++++++++
 lib/embedded-cmdline.S    | 16 ++++++++++++++++
 tools/bootconfig/Makefile |  2 +-
 5 files changed, 71 insertions(+), 1 deletion(-)

diff --git a/Makefile b/Makefile
index d59f703f9797..3ee259d00a9a 100644
--- a/Makefile
+++ b/Makefile
@@ -1543,6 +1543,11 @@ prepare: tools/bpf/resolve_btfids
 endif
 endif
 
+# lib/Makefile invokes tools/bootconfig to render the embedded bconf to cmdline.
+ifdef CONFIG_BOOT_CONFIG_EMBED_CMDLINE
+prepare: tools/bootconfig
+endif
+
 # The tools build system is not a part of Kbuild and tends to introduce
 # its own unique issues. If you need to integrate a new tool into Kbuild,
 # please consider locating that tool outside the tools/ tree and using the
diff --git a/init/Kconfig b/init/Kconfig
index ca35184532dc..5f491a5ac4b8 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1569,6 +1569,39 @@ config BOOT_CONFIG_EMBED_FILE
 	  This bootconfig will be used if there is no initrd or no other
 	  bootconfig in the initrd.
 
+config ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG
+	bool
+	help
+	  Selected by architectures whose setup_arch() prepends the
+	  build-time-rendered embedded bootconfig cmdline to
+	  boot_command_line before parse_early_param() runs.
+
+config BOOT_CONFIG_EMBED_CMDLINE
+	bool "Render embedded bootconfig as kernel cmdline at build time"
+	depends on BOOT_CONFIG_EMBED
+	depends on BOOT_CONFIG_EMBED_FILE != ""
+	depends on ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG
+	default n
+	help
+	  Render the "kernel" subtree of the embedded bootconfig file into a
+	  flat cmdline string at kernel build time and prepend it to
+	  boot_command_line during early architecture setup. This makes
+	  early_param() handlers (e.g. mem=, earlycon=, loglevel=) see the
+	  values supplied via the embedded bootconfig.
+
+	  The runtime bootconfig parser is unaffected, so tree-structured
+	  consumers such as ftrace boot-time tracing keep working.
+
+	  Note: when an initrd also carries a bootconfig, its "kernel"
+	  subtree is still parsed at runtime, but the embedded "kernel"
+	  keys remain in boot_command_line for parse_early_param() and
+	  end up later than the initrd keys in saved_command_line, so
+	  parse_args() last-wins favors the embedded values. If you need
+	  initrd to override embedded kernel.* keys, leave this option
+	  off.
+
+	  If unsure, say N.
+
 config CMDLINE_LOG_WRAP_IDEAL_LEN
 	int "Length to try to wrap the cmdline when logged at boot"
 	default 1021
diff --git a/lib/Makefile b/lib/Makefile
index 6e72d2c1cce7..9de0ac7732a2 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -273,6 +273,22 @@ filechk_defbconf = cat $(or $(real-prereqs), /dev/null)
 $(obj)/default.bconf: $(CONFIG_BOOT_CONFIG_EMBED_FILE) FORCE
 	$(call filechk,defbconf)
 
+obj-$(CONFIG_BOOT_CONFIG_EMBED_CMDLINE) += embedded-cmdline.o
+$(obj)/embedded-cmdline.o: $(obj)/embedded_cmdline.bin
+
+# Render the bootconfig "kernel" subtree to a flat cmdline string using
+# the userspace tools/bootconfig parser (-C mode). The runtime prepend
+# helper enforces COMMAND_LINE_SIZE at boot, so no build-time size
+# check is performed here (COMMAND_LINE_SIZE is an arch header
+# constant, not a Kconfig value).
+quiet_cmd_render_cmdline = BCONF2C $@
+      cmd_render_cmdline = \
+	$(objtree)/tools/bootconfig/bootconfig -C $< > $@
+
+targets += embedded_cmdline.bin
+$(obj)/embedded_cmdline.bin: $(obj)/default.bconf $(objtree)/tools/bootconfig/bootconfig FORCE
+	$(call if_changed,render_cmdline)
+
 obj-$(CONFIG_RBTREE_TEST) += rbtree_test.o
 obj-$(CONFIG_INTERVAL_TREE_TEST) += interval_tree_test.o
 
diff --git a/lib/embedded-cmdline.S b/lib/embedded-cmdline.S
new file mode 100644
index 000000000000..7e2e1d81af96
--- /dev/null
+++ b/lib/embedded-cmdline.S
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Embed the build-time-rendered bootconfig "kernel" subtree as a flat
+ * cmdline string. setup_arch() prepends this to boot_command_line on
+ * architectures that select ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG.
+ *
+ * Copyright (c) 2026 Meta Platforms, Inc. and affiliates
+ * Copyright (c) 2026 Breno Leitao <leitao@debian.org>
+ */
+	.section .init.rodata, "aw"
+	.global embedded_kernel_cmdline
+embedded_kernel_cmdline:
+	.incbin "lib/embedded_cmdline.bin"
+	.byte 0
+	.global embedded_kernel_cmdline_end
+embedded_kernel_cmdline_end:
diff --git a/tools/bootconfig/Makefile b/tools/bootconfig/Makefile
index 90eb47c9d8de..4e82fd9553cd 100644
--- a/tools/bootconfig/Makefile
+++ b/tools/bootconfig/Makefile
@@ -15,7 +15,7 @@ override CFLAGS += -Wall -g -I$(CURDIR)/include
 ALL_TARGETS := bootconfig
 ALL_PROGRAMS := $(patsubst %,$(OUTPUT)%,$(ALL_TARGETS))
 
-all: $(ALL_PROGRAMS) test
+all: $(ALL_PROGRAMS)
 
 $(OUTPUT)bootconfig: main.c include/linux/bootconfig.h $(LIBSRC)
 	$(CC) $(filter %.c,$^) $(CFLAGS) $(LDFLAGS) -o $@

-- 
2.54.0


^ permalink raw reply related

* [PATCH 1/4] bootconfig: return 0 from xbc_snprint_cmdline() for a leaf root
From: Breno Leitao @ 2026-05-27 16:41 UTC (permalink / raw)
  To: Masami Hiramatsu, Andrew Morton, Nathan Chancellor, paulmck,
	Nicolas Schier
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
	bpf, Breno Leitao, kernel-team
In-Reply-To: <20260527-bootconfig_using_tools-v1-0-b6906a86e7d5@debian.org>

Returning -EINVAL when @root has no descendant key nodes is a quirky
result for a renderer: "nothing to render" is not an error. The only
existing caller, xbc_make_cmdline(), papers over it with a `len <= 0`
check, so the misbehavior is harmless today. The new -C user in
tools/bootconfig added by the follow-up patches propagates the error
and turns an empty "kernel {}" subtree into a build failure.

Short-circuit the leaf-root case and return 0 so the rendered length
matches the rendered content.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 lib/bootconfig.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/lib/bootconfig.c b/lib/bootconfig.c
index f445b7703fdd..3a102c9122f7 100644
--- a/lib/bootconfig.c
+++ b/lib/bootconfig.c
@@ -431,6 +431,16 @@ int __init xbc_snprint_cmdline(char *buf, size_t size, struct xbc_node *root)
 	const char *val, *q;
 	int ret;
 
+	/*
+	 * A leaf @root (e.g. an empty "kernel {}" subtree, or a key whose
+	 * only child is a value node) has no descendant key/value pairs to
+	 * render. The leaf-finding iterator below would otherwise return
+	 * @root itself, which xbc_node_compose_key_after() rejects with
+	 * -EINVAL.
+	 */
+	if (root && xbc_node_is_leaf(root))
+		return 0;
+
 	xbc_node_for_each_key_value(root, knode, val) {
 		ret = xbc_node_compose_key_after(root, knode,
 					xbc_namebuf, XBC_KEYLEN_MAX);

-- 
2.54.0


^ permalink raw reply related

* [PATCH 0/4] bootconfig: embed kernel.* cmdline at build time
From: Breno Leitao @ 2026-05-27 16:41 UTC (permalink / raw)
  To: Masami Hiramatsu, Andrew Morton, Nathan Chancellor, paulmck,
	Nicolas Schier
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
	bpf, Breno Leitao, kernel-team

The userspace pieces (xbc_snprint_cmdline() in lib/, tools/bootconfig -C)
already landed; this series wires the rendered cmdline into the kernel.

Motivation: today the embedded bootconfig is parsed at runtime, after
parse_early_param() has already run, so early_param() handlers can't
see embedded values. Folding the kernel.* subtree into the cmdline at
build time gives a CONFIG_CMDLINE-equivalent for embedded-bootconfig
users without forcing them to maintain two cmdline sources.

Behaviorally, the "kernel" subtree is rendered to a flat string at
build time and stashed in .init.rodata. setup_arch() prepends it to
boot_command_line before parse_early_param() runs. Overflow is a soft
error: the helper logs and leaves boot_command_line untouched rather
than panicking, so an oversized embedded bconf cannot brick a boot.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
Breno Leitao (4):
      bootconfig: return 0 from xbc_snprint_cmdline() for a leaf root
      bootconfig: render embedded bootconfig as a kernel cmdline at build time
      bootconfig: add xbc_prepend_embedded_cmdline() helper
      x86/setup: prepend embedded bootconfig cmdline before parse_early_param

 Makefile                   |  5 ++++
 arch/x86/Kconfig           |  1 +
 arch/x86/kernel/setup.c    |  3 +++
 include/linux/bootconfig.h |  7 ++++++
 init/Kconfig               | 33 ++++++++++++++++++++++++++
 init/main.c                | 19 ++++++++++++---
 lib/Makefile               | 16 +++++++++++++
 lib/bootconfig.c           | 58 ++++++++++++++++++++++++++++++++++++++++++++++
 lib/embedded-cmdline.S     | 16 +++++++++++++
 tools/bootconfig/Makefile  |  2 +-
 10 files changed, 156 insertions(+), 4 deletions(-)
---
base-commit: e7e28506af98ce4e1059e5ec59334b335c00a246
change-id: 20260508-bootconfig_using_tools-cfa7aa9d6a5a

Best regards,
--  
Breno Leitao <leitao@debian.org>


^ permalink raw reply

* Re: [PATCH v2] scripts: Have make TAGS not include structure members'
From: Peter Zijlstra @ 2026-05-27 16:31 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Linux trace kernel, linux-kbuild, Andrew Morton,
	Masahiro Yamada, Masatake YAMATO, Geert Uytterhoeven,
	Michal Marek, Yang Bai, Stephen Boyd
In-Reply-To: <20260527162914.GH3102624@noisy.programming.kicks-ass.net>

On Wed, May 27, 2026 at 06:29:14PM +0200, Peter Zijlstra wrote:
> On Wed, May 27, 2026 at 12:11:44PM -0400, Steven Rostedt wrote:
> > From: Steven Rostedt <rostedt@goodmis.org>
> > 
> > It is really annoying when I use emacs TAGS to search for something
> > like "dev_name" and have to go through 12 iterations before I find the
> > function "dev_name". I really do not care about structures that include
> > "dev_name" as one of its fields, and I'm sure pretty much all other
> > developers do not care either.
> > 
> > There's a "remove_structs" variable used by the scripts/tags.sh, which
> > I'm guessing is suppose to remove these structures from the TAGS file,
> > but it must do a poor job at it, as I'm always hitting structures when
> > I want the actual declaration.
> > 
> > Luckily, the etags program comes with an option "--no-members", which does
> > exactly what I want, and I'm sure all other kernel developers want too.
> > 
> > Create a new "no_members" variable and assign it to "--no-members" for the
> > "TAGS" case and pass that to the etags program to remove structures.
> > 
> > Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> > ---
> > Changes since v1: https://lore.kernel.org/all/20131115093645.6dc03918@gandalf.local.home/
> > 
> > - Use a no_members variable instead of hard coding the --no-members into
> >   the etags call, as that can break some "tags" cases. (Michal Marek)
> 
> Yeah, I often use member tags.
> 
> The tags file have a 'kind' field, what you want is for emacs to order
> on kind and prefer 'f' over 'm'.
> 
> The alternative is switching to use emacs-lsp, that way the editor knows
> the kind of symbol you want. If you're on a function call, it should
> only consider 'f' tags. Whereas if the cursor is on a member deref, it
> should only consider 'm'.

That said, setting up clangd on the kernel tree is rather more painful
that I'd like it to be :-(

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox