Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* Re: [PATCH] unwind: Add sframe_(un)register() system calls
From: Steven Rostedt @ 2026-05-22 11:18 UTC (permalink / raw)
  To: Jens Remus
  Cc: LKML, Linux Trace Kernel, bpf, Masami Hiramatsu,
	Mathieu Desnoyers, Josh Poimboeuf, Peter Zijlstra, Ingo Molnar,
	Jiri Olsa, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Linus Torvalds, Andrew Morton, Florian Weimer,
	Kees Cook, Carlos O'Donell, Sam James, Dylan Hatch,
	Borislav Petkov, Dave Hansen, David Hildenbrand, H. Peter Anvin,
	Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
	Suren Baghdasaryan, Vlastimil Babka, Heiko Carstens,
	Vasily Gorbik
In-Reply-To: <d82ea328-758e-4e41-a58e-46f3137feca7@linux.ibm.com>

On Fri, 22 May 2026 11:43:06 +0200
Jens Remus <jremus@linux.ibm.com> wrote:

> On 5/22/2026 12:35 AM, Steven Rostedt wrote:
> > From: Steven Rostedt <rostedt@goodmis.org>
> > 
> > Add system calls to register and unregister sframes that can be used by
> > dynamic linkers to tell the kernel where the sframe section is in memory
> > for libraries it loads.  
> 
> Why two separate system calls?  Can't that be one single stacktracectl?
> Could they at least be non-sframe specific, e.g. stracktrace_register
> and stracktrace_unregister, so that if one would implement e.g. unwind
> user dwarf/eh_frame in the future one could pass ehframe_start and
> ehframe_end in addition to sframe_start and sframe_end?

Talking with everyone at LSF/MM/BPF the consensus was to avoid an ioctl
like system call. Everyone hates them. They told me that a system call
should do one thing. They wanted a separate system call to register and to
unregister.

Note this also helps to see what the user is doing via monitoring via
ftrace, strace, and security wise via LSMs and seccomp.

> 
> > 
> > Both system calls take a pointer to a new structure:
> > 
> >   struct sframe_setup {
> > 	unsigned long		sframe_start;
> > 	unsigned long		sframe_size;
> > 	unsigned long		text_start;
> > 	unsigned long		text_size;
> >   };
> > 
> > and a size of the passed in structure. If the system call needs to be
> > extended, then the structure could be changed and the size of that
> > structure will tell the kernel that it is the new version. If the kernel
> > does not recognize the structure size, it will return -EINVAL.
> > 
> >   sframe_start - The virtual address of the sframe section
> >   sframe_size  - The length of the sframe section
> >   text_start   - the text section the sframe represents
> >   test_size    - the length of the section
> > 
> > If other stack tracing functionality is added, it will require a new
> > system call.
> > 
> > The unregister only needs the sframe_start and requires all the rest of
> > the fields to be 0. In the future, if more can be done, then user space
> > can update the other values and check the return code to see if the kernel
> > supports it.
> > 
> > Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> > ---
> > 
> > Based on top of Jens patches here:
> > 
> >   https://lore.kernel.org/linux-trace-kernel/20260520154004.3845823-1-jremus@linux.ibm.com/
> > 
> > [ Note, I tested this with the same program from the RFC patch ]
> > 
> > Changes from RFC: https://patch.msgid.link/20260429114355.6c712e6a@gandalf.local.home
> > 
> > - Remove the ioctl() like system call for a unique system call for each
> >   functionality. Right now there's two functionalities:
> >    1. register sframe section
> >    2. unregister sframe sections
> > 
> > - Added taking a lock around the mtree logic in __sframe_remove_section()
> >   as Sashiko mentioned that there could be races from user space
> >   registering and unregistering sframe sections at the same time.  
> 
> Doesn't sframe_add_section() then also need likewise?

Ah, I saw the lock grabbed on the vma lookup. It should also be done for the
mtree_insert_range(). Thanks, will fix.

> 
> > 
> > - Removed [RFC] from subject as I believe this is more likely the way
> >   this system call will be done.  
> 
> > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h  
> 
> > @@ -999,6 +999,8 @@ asmlinkage long sys_lsm_get_self_attr(unsigned int attr, struct lsm_ctx __user *
> >  asmlinkage long sys_lsm_set_self_attr(unsigned int attr, struct lsm_ctx __user *ctx,
> >  				      u32 size, u32 flags);
> >  asmlinkage long sys_lsm_list_modules(u64 __user *ids, u32 __user *size, u32 flags);
> > +asmlinkage long sys_sframe_register(void *data, unsigned int size);
> > +asmlinkage long sys_sframe_unregister(void *data, unsigned int size);
> >  
> >  /*
> >   * Architecture-specific system calls  
> 
> 
> > diff --git a/include/uapi/linux/sframe.h b/include/uapi/linux/sframe.h  
> 
> > @@ -0,0 +1,12 @@
> > +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
> > +#ifndef _UAPI_LINUX_SFRAME_H
> > +#define _UAPI_LINUX_SFRAME_H
> > +
> > +struct sframe_setup {
> > +	unsigned long		sframe_start;
> > +	unsigned long		sframe_size;
> > +	unsigned long		text_start;
> > +	unsigned long		text_size;
> > +};
> > +
> > +#endif /* _UAPI_LINUX_SFRAME_H */  
> 
> > diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c  
> 
> > @@ -842,9 +844,11 @@ static void sframe_free_srcu(struct rcu_head *rcu)
> >  static int __sframe_remove_section(struct mm_struct *mm,
> >  				   struct sframe_section *sec)
> >  {
> > -	if (!mtree_erase(&mm->sframe_mt, sec->text_start)) {
> > -		dbg_sec("mtree_erase failed: text=%lx\n", sec->text_start);
> > -		return -EINVAL;
> > +	scoped_guard(mmap_read_lock, mm) {  
> 
> Why is a read lock sufficient? Doesn't that allow multiple readers?
> How does that prevent a concurrent modification of the mm->sframe_mt?

That was a cut and paste error. I meant to change it to a write lock, but
got distracted :-p Thanks, will fix.

> 
> > +		if (!mtree_erase(&mm->sframe_mt, sec->text_start)) {
> > +			dbg_sec("mtree_erase failed: text=%lx\n", sec->text_start);
> > +			return -EINVAL;
> > +		}  
> 
> Is (or why not) likewise required in sframe_add_section() for the
> mtree_insert_range()?
> 
> Wasn't the reported issue that while mt_for_each() in
> sframe_remove_section() there could be concurrent mtree_erase() in
> __sframe_remove_section() followed by mtree_insert_range() in
> sframe_add_section(), so that the mt_for_each() could get confused?

I'll take a closer look. But let me fix the obvious bugs first.

-- Steve

> 
> >  	}
> >  
> >  	call_srcu(&sframe_srcu, &sec->rcu, sframe_free_srcu);
> > @@ -936,3 +940,56 @@ void sframe_free_mm(struct mm_struct *mm)
> >  
> >  	mtree_destroy(&mm->sframe_mt);
> >  }

^ permalink raw reply

* [PATCH v11 2/6] x86/asm: Avoid emitting DWARF CFI for non-VDSO
From: Jens Remus @ 2026-05-22 11:04 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, x86, Steven Rostedt,
	Josh Poimboeuf, Indu Bhagat, Peter Zijlstra, Dylan Hatch,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Mathieu Desnoyers, Kees Cook, Sam James,
	Andy Lutomirski
  Cc: Jens Remus, bpf, linux-mm, Namhyung Kim, Andrii Nakryiko,
	Jose E. Marchesi, Beau Belgrave, Florian Weimer,
	Carlos O'Donell, Masami Hiramatsu, Jiri Olsa,
	Arnaldo Carvalho de Melo, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Heiko Carstens, Vasily Gorbik,
	Ilya Leoshkevich, Steven Rostedt (Google)
In-Reply-To: <20260522110427.2816637-1-jremus@linux.ibm.com>

From: Josh Poimboeuf <jpoimboe@kernel.org>

It was decided years ago that .cfi_* annotations aren't maintainable in
the kernel.  They were replaced by objtool unwind hints.  For the kernel
proper, ensure the CFI_* macros don't do anything.

On the other hand the VDSO library *does* use them, so user space can
unwind through it.

Make sure these macros only work for VDSO.  They aren't actually being
used outside of VDSO anyway, so there's no functional change.

[ Jens Remus: Define CFI_SIGNAL_FRAME for !BUILD_VDSO. ]

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Jens Remus <jremus@linux.ibm.com>
---
 arch/x86/include/asm/dwarf2.h | 52 ++++++++++++++++++++++++-----------
 1 file changed, 36 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/dwarf2.h b/arch/x86/include/asm/dwarf2.h
index 09c9684d3ad6..13e2e64ef265 100644
--- a/arch/x86/include/asm/dwarf2.h
+++ b/arch/x86/include/asm/dwarf2.h
@@ -6,6 +6,15 @@
 #warning "asm/dwarf2.h should be only included in pure assembly files"
 #endif
 
+#ifdef BUILD_VDSO
+
+	/*
+	 * For the vDSO, emit both runtime unwind information and debug
+	 * symbols for the .dbg file.
+	 */
+
+	.cfi_sections .eh_frame, .debug_frame
+
 #define CFI_STARTPROC		.cfi_startproc
 #define CFI_ENDPROC		.cfi_endproc
 #define CFI_DEF_CFA		.cfi_def_cfa
@@ -22,21 +31,32 @@
 #define CFI_ESCAPE		.cfi_escape
 #define CFI_SIGNAL_FRAME	.cfi_signal_frame
 
-#ifndef BUILD_VDSO
-	/*
-	 * Emit CFI data in .debug_frame sections, not .eh_frame sections.
-	 * The latter we currently just discard since we don't do DWARF
-	 * unwinding at runtime.  So only the offline DWARF information is
-	 * useful to anyone.  Note we should not use this directive if we
-	 * ever decide to enable DWARF unwinding at runtime.
-	 */
-	.cfi_sections .debug_frame
-#else
-	 /*
-	  * For the vDSO, emit both runtime unwind information and debug
-	  * symbols for the .dbg file.
-	  */
-	.cfi_sections .eh_frame, .debug_frame
-#endif
+#else /* !BUILD_VDSO */
+
+/*
+ * On x86, these macros aren't used outside VDSO.  As well they shouldn't be:
+ * they're fragile and very difficult to maintain.
+ */
+
+.macro nocfi args:vararg
+.endm
+
+#define CFI_STARTPROC		nocfi
+#define CFI_ENDPROC		nocfi
+#define CFI_DEF_CFA		nocfi
+#define CFI_DEF_CFA_REGISTER	nocfi
+#define CFI_DEF_CFA_OFFSET	nocfi
+#define CFI_ADJUST_CFA_OFFSET	nocfi
+#define CFI_OFFSET		nocfi
+#define CFI_REL_OFFSET		nocfi
+#define CFI_REGISTER		nocfi
+#define CFI_RESTORE		nocfi
+#define CFI_REMEMBER_STATE	nocfi
+#define CFI_RESTORE_STATE	nocfi
+#define CFI_UNDEFINED		nocfi
+#define CFI_ESCAPE		nocfi
+#define CFI_SIGNAL_FRAME	nocfi
+
+#endif /* !BUILD_VDSO */
 
 #endif /* _ASM_X86_DWARF2_H */
-- 
2.51.0


^ permalink raw reply related

* [PATCH v11 6/6] x86/vdso: Enable sframe generation in VDSO
From: Jens Remus @ 2026-05-22 11:04 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, x86, Steven Rostedt,
	Josh Poimboeuf, Indu Bhagat, Peter Zijlstra, Dylan Hatch,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Mathieu Desnoyers, Kees Cook, Sam James,
	Andy Lutomirski
  Cc: Jens Remus, bpf, linux-mm, Namhyung Kim, Andrii Nakryiko,
	Jose E. Marchesi, Beau Belgrave, Florian Weimer,
	Carlos O'Donell, Masami Hiramatsu, Jiri Olsa,
	Arnaldo Carvalho de Melo, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Heiko Carstens, Vasily Gorbik,
	Ilya Leoshkevich, Steven Rostedt (Google)
In-Reply-To: <20260522110427.2816637-1-jremus@linux.ibm.com>

From: Josh Poimboeuf <jpoimboe@kernel.org>

Enable .sframe generation in the VDSO library so kernel and user space
can unwind through it.

Starting with binutils 2.46 both GNU assembler and GNU linker
exclusively support generating and merging .sframe in SFrame V3 format.
For x86 SFrame is only supported for x86-64.  Not for x86-32 nor x32.

Test whether the assembler supports option '--gsframe-3' to explicitly
select SFrame V3 format.  Note that testing using Kconfig macro
'as-option' is not sufficient, as GNU assembler will accept the option
for any target, regardless of whether it is actually capable to generate
.sframe for it, as long the input does not trigger the generation.
Therefore it is necessary to use Kconfig macro 'as-instr' to provide
minimal CFI directives that trigger generation of .sframe.

For x86-64 VDSO, only if supported by the assembler, generate .sframe,
collect it, mark it as KEEP, and generate a GNU_SFRAME program table
entry.

For x86-32 and x32 VDSOs, given SFrame is not supported, do not generate
any .sframe nor GNU_SFRAME program table entry.  Instead explicitly
discard any .sframe.  The latter is required for x32 VDSO, as it is
built from x86-64 VDSO objects (potentially with .sframe) converted to
x32.  In this regard discarding .sframe also prevents potential
issues with linkers, such as GNU linker prior to binutils 2.46 commit
7487c98ff07a ("x32: Allow R_X86_64_PC64 for SFrame V3"), that do not
support R_X86_64_PC64 relocations in x32, like those found in .sframe
in SFrame V3 format.

[ Jens Remus: Add support for SFrame V3.  Prevent GNU_SFRAME program
table entry to empty .sframe section.  Reword commit message. ]

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Jens Remus <jremus@linux.ibm.com>
---

Notes (jremus):
    Changes in v9:
    - Always define KEEP_SFRAME to either true/false in specific VDSO linker
      scripts and use #if instead of #ifdef in common one. (Peter)
    - Reword commit message to provide more details.

 arch/Kconfig                                 |  7 +++++++
 arch/x86/entry/vdso/common/vdso-layout.lds.S | 15 +++++++++++++++
 arch/x86/entry/vdso/vdso32/vdso32.lds.S      |  3 +++
 arch/x86/entry/vdso/vdso64/Makefile          |  1 +
 arch/x86/entry/vdso/vdso64/vdso64.lds.S      |  2 ++
 arch/x86/entry/vdso/vdso64/vdsox32.lds.S     |  6 ++++++
 6 files changed, 34 insertions(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index e86880045158..79aef9b67645 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -479,6 +479,13 @@ config HAVE_HARDLOCKUP_DETECTOR_ARCH
 	  It uses the same command line parameters, and sysctl interface,
 	  as the generic hardlockup detectors.
 
+config AS_SFRAME
+	bool
+
+config AS_SFRAME3
+	def_bool $(as-instr,.cfi_startproc\n.cfi_endproc,-Wa$(comma)--gsframe-3)
+	select AS_SFRAME
+
 config UNWIND_USER
 	bool
 
diff --git a/arch/x86/entry/vdso/common/vdso-layout.lds.S b/arch/x86/entry/vdso/common/vdso-layout.lds.S
index 856b8b9d278c..c486b07b195a 100644
--- a/arch/x86/entry/vdso/common/vdso-layout.lds.S
+++ b/arch/x86/entry/vdso/common/vdso-layout.lds.S
@@ -60,6 +60,13 @@ SECTIONS
 		*(.eh_frame.*)
 	}					:text
 
+#if KEEP_SFRAME
+	.sframe		: {
+		KEEP (*(.sframe))
+		*(.sframe.*)
+	}					:text	:sframe
+#endif
+
 	/*
 	 * Text is well-separated from actual data: there's plenty of
 	 * stuff that isn't used at runtime in between.
@@ -80,6 +87,10 @@ SECTIONS
 		*(.discard)
 		*(.discard.*)
 		*(__bug_table)
+#if !KEEP_SFRAME
+		*(.sframe)
+		*(.sframe.*)
+#endif
 	}
 }
 
@@ -89,6 +100,7 @@ SECTIONS
 #define PT_GNU_EH_FRAME	0x6474e550
 #define PT_GNU_STACK	0x6474e551
 #define PT_GNU_PROPERTY	0x6474e553
+#define PT_GNU_SFRAME	0x6474e554
 
 /*
  * We must supply the ELF program headers explicitly to get just one
@@ -104,6 +116,9 @@ PHDRS
 	dynamic		PT_DYNAMIC	PF_R;
 	note		PT_NOTE		PF_R;
 	eh_frame_hdr	PT_GNU_EH_FRAME PF_R;
+#if KEEP_SFRAME
+	sframe		PT_GNU_SFRAME	PF_R;
+#endif
 	gnu_stack	PT_GNU_STACK	PF_RW;
 	gnu_property	PT_GNU_PROPERTY	PF_R;
 }
diff --git a/arch/x86/entry/vdso/vdso32/vdso32.lds.S b/arch/x86/entry/vdso/vdso32/vdso32.lds.S
index 55554f80d930..a18b65749ce3 100644
--- a/arch/x86/entry/vdso/vdso32/vdso32.lds.S
+++ b/arch/x86/entry/vdso/vdso32/vdso32.lds.S
@@ -11,6 +11,9 @@
 
 #define BUILD_VDSO32
 
+/* Discard .sframe if any.  SFrame does not support x86-32. */
+#define KEEP_SFRAME	0
+
 #include "common/vdso-layout.lds.S"
 
 /* The ELF entry point can be used to set the AT_SYSINFO value.  */
diff --git a/arch/x86/entry/vdso/vdso64/Makefile b/arch/x86/entry/vdso/vdso64/Makefile
index bfffaf1aeecc..459f8026531e 100644
--- a/arch/x86/entry/vdso/vdso64/Makefile
+++ b/arch/x86/entry/vdso/vdso64/Makefile
@@ -14,6 +14,7 @@ vobjs-$(CONFIG_X86_SGX)		+= vsgx.o
 
 # Compilation flags
 flags-y				:= -DBUILD_VDSO64 -m64 -mcmodel=small
+flags-$(CONFIG_AS_SFRAME3)	+= -Wa,--gsframe-3
 
 # The location of this include matters!
 include $(src)/../common/Makefile.include
diff --git a/arch/x86/entry/vdso/vdso64/vdso64.lds.S b/arch/x86/entry/vdso/vdso64/vdso64.lds.S
index 5ce3f2b6373a..6685cf385fc1 100644
--- a/arch/x86/entry/vdso/vdso64/vdso64.lds.S
+++ b/arch/x86/entry/vdso/vdso64/vdso64.lds.S
@@ -9,6 +9,8 @@
 
 #define BUILD_VDSO64
 
+#define KEEP_SFRAME	IS_ENABLED(CONFIG_AS_SFRAME)
+
 #include "common/vdso-layout.lds.S"
 
 /*
diff --git a/arch/x86/entry/vdso/vdso64/vdsox32.lds.S b/arch/x86/entry/vdso/vdso64/vdsox32.lds.S
index 3dbd20c8dacc..5270fd0bdd0f 100644
--- a/arch/x86/entry/vdso/vdso64/vdsox32.lds.S
+++ b/arch/x86/entry/vdso/vdso64/vdsox32.lds.S
@@ -9,6 +9,12 @@
 
 #define BUILD_VDSOX32
 
+/*
+ * Discard .sframe from x86-64 compiles.  SFrame does not support x32 and
+ * it contains R_X86_64_PC64 relocations, which linkers may not expect.
+ */
+#define KEEP_SFRAME	0
+
 #include "common/vdso-layout.lds.S"
 
 /*
-- 
2.51.0


^ permalink raw reply related

* [PATCH v11 4/6] x86/vdso: Use SYM_FUNC_{START,END} in __kernel_vsyscall()
From: Jens Remus @ 2026-05-22 11:04 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, x86, Steven Rostedt,
	Josh Poimboeuf, Indu Bhagat, Peter Zijlstra, Dylan Hatch,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Mathieu Desnoyers, Kees Cook, Sam James,
	Andy Lutomirski
  Cc: Jens Remus, bpf, linux-mm, Namhyung Kim, Andrii Nakryiko,
	Jose E. Marchesi, Beau Belgrave, Florian Weimer,
	Carlos O'Donell, Masami Hiramatsu, Jiri Olsa,
	Arnaldo Carvalho de Melo, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Heiko Carstens, Vasily Gorbik,
	Ilya Leoshkevich, Steven Rostedt (Google)
In-Reply-To: <20260522110427.2816637-1-jremus@linux.ibm.com>

From: Josh Poimboeuf <jpoimboe@kernel.org>

Use SYM_FUNC_{START,END} instead of all the boilerplate.  No functional
change.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Jens Remus <jremus@linux.ibm.com>
---
 arch/x86/entry/vdso/vdso32/system_call.S | 10 ++--------
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/arch/x86/entry/vdso/vdso32/system_call.S b/arch/x86/entry/vdso/vdso32/system_call.S
index 9157cf9c5749..a90f4f7de396 100644
--- a/arch/x86/entry/vdso/vdso32/system_call.S
+++ b/arch/x86/entry/vdso/vdso32/system_call.S
@@ -9,11 +9,7 @@
 #include <asm/alternative.h>
 
 	.text
-	.globl __kernel_vsyscall
-	.type __kernel_vsyscall,@function
-	ALIGN
-__kernel_vsyscall:
-	CFI_STARTPROC
+SYM_FUNC_START(__kernel_vsyscall)
 
 	/*
 	 * If using int $0x80, there is no reason to muck about with the
@@ -85,7 +81,5 @@ SYM_INNER_LABEL(int80_landing_pad, SYM_L_GLOBAL)
 	CFI_RESTORE		ecx
 	CFI_ADJUST_CFA_OFFSET	-4
 	RET
-	CFI_ENDPROC
-
-	.size __kernel_vsyscall,.-__kernel_vsyscall
+SYM_FUNC_END(__kernel_vsyscall)
 	.previous
-- 
2.51.0


^ permalink raw reply related

* [PATCH v11 3/6] x86/asm: Use CFI_* macros in SYM_FUNC_* macros so they can be added to VDSO
From: Jens Remus @ 2026-05-22 11:04 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, x86, Steven Rostedt,
	Josh Poimboeuf, Indu Bhagat, Peter Zijlstra, Dylan Hatch,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Mathieu Desnoyers, Kees Cook, Sam James,
	Andy Lutomirski
  Cc: Jens Remus, bpf, linux-mm, Namhyung Kim, Andrii Nakryiko,
	Jose E. Marchesi, Beau Belgrave, Florian Weimer,
	Carlos O'Donell, Masami Hiramatsu, Jiri Olsa,
	Arnaldo Carvalho de Melo, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Heiko Carstens, Vasily Gorbik,
	Ilya Leoshkevich, Steven Rostedt (Google)
In-Reply-To: <20260522110427.2816637-1-jremus@linux.ibm.com>

From: Josh Poimboeuf <jpoimboe@kernel.org>

Add CFI_STARTPROC and CFI_ENDPROC annotations to the SYM_FUNC_* macros
so the VDSO asm functions don't need to add them manually.  Note this
only affects VDSO, the CFI_* macros are empty for the kernel proper.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Jens Remus <jremus@linux.ibm.com>
---
 arch/x86/entry/vdso/common/vdso-layout.lds.S  |  2 +-
 .../x86/entry/vdso/vdso64/vgetrandom-chacha.S |  2 --
 arch/x86/entry/vdso/vdso64/vsgx.S             |  4 ---
 arch/x86/include/asm/linkage.h                | 33 +++++++++++++++----
 arch/x86/include/asm/vdso.h                   |  1 -
 5 files changed, 28 insertions(+), 14 deletions(-)

diff --git a/arch/x86/entry/vdso/common/vdso-layout.lds.S b/arch/x86/entry/vdso/common/vdso-layout.lds.S
index a1e30be3e83d..856b8b9d278c 100644
--- a/arch/x86/entry/vdso/common/vdso-layout.lds.S
+++ b/arch/x86/entry/vdso/common/vdso-layout.lds.S
@@ -1,5 +1,5 @@
 /* SPDX-License-Identifier: GPL-2.0 */
-#include <asm/vdso.h>
+#include <asm/page_types.h>
 #include <asm/vdso/vsyscall.h>
 #include <vdso/datapage.h>
 
diff --git a/arch/x86/entry/vdso/vdso64/vgetrandom-chacha.S b/arch/x86/entry/vdso/vdso64/vgetrandom-chacha.S
index cc82da9216fb..a33212594731 100644
--- a/arch/x86/entry/vdso/vdso64/vgetrandom-chacha.S
+++ b/arch/x86/entry/vdso/vdso64/vgetrandom-chacha.S
@@ -22,7 +22,6 @@ CONSTANTS:	.octa 0x6b20657479622d323320646e61707865
  *	rcx: number of 64-byte blocks to write to output
  */
 SYM_FUNC_START(__arch_chacha20_blocks_nostack)
-	CFI_STARTPROC
 .set	output,		%rdi
 .set	key,		%rsi
 .set	counter,	%rdx
@@ -175,5 +174,4 @@ SYM_FUNC_START(__arch_chacha20_blocks_nostack)
 	pxor		temp,temp
 
 	ret
-	CFI_ENDPROC
 SYM_FUNC_END(__arch_chacha20_blocks_nostack)
diff --git a/arch/x86/entry/vdso/vdso64/vsgx.S b/arch/x86/entry/vdso/vdso64/vsgx.S
index 37a3d4c02366..c0342238c976 100644
--- a/arch/x86/entry/vdso/vdso64/vsgx.S
+++ b/arch/x86/entry/vdso/vdso64/vsgx.S
@@ -24,8 +24,6 @@
 .section .text, "ax"
 
 SYM_FUNC_START(__vdso_sgx_enter_enclave)
-	/* Prolog */
-	.cfi_startproc
 	push	%rbp
 	.cfi_adjust_cfa_offset	8
 	.cfi_rel_offset		%rbp, 0
@@ -143,8 +141,6 @@ SYM_FUNC_START(__vdso_sgx_enter_enclave)
 	jle	.Lout
 	jmp	.Lenter_enclave
 
-	.cfi_endproc
-
 _ASM_VDSO_EXTABLE_HANDLE(.Lenclu_eenter_eresume, .Lhandle_exception)
 
 SYM_FUNC_END(__vdso_sgx_enter_enclave)
diff --git a/arch/x86/include/asm/linkage.h b/arch/x86/include/asm/linkage.h
index a7294656ad90..c2ca8117376f 100644
--- a/arch/x86/include/asm/linkage.h
+++ b/arch/x86/include/asm/linkage.h
@@ -40,6 +40,10 @@
 
 #ifdef __ASSEMBLER__
 
+#ifndef LINKER_SCRIPT
+#include <asm/dwarf2.h>
+#endif
+
 #if defined(CONFIG_MITIGATION_RETHUNK) && !defined(__DISABLE_EXPORTS) && !defined(BUILD_VDSO)
 #define RET	jmp __x86_return_thunk
 #else /* CONFIG_MITIGATION_RETPOLINE */
@@ -112,34 +116,51 @@
 # define SYM_FUNC_ALIAS_MEMFUNC	SYM_FUNC_ALIAS
 #endif
 
+#define __SYM_FUNC_START				\
+	CFI_STARTPROC ASM_NL
+
+#define __SYM_FUNC_END					\
+	CFI_ENDPROC ASM_NL
+
 /* SYM_TYPED_FUNC_START -- use for indirectly called globals, w/ CFI type */
 #define SYM_TYPED_FUNC_START(name)				\
 	SYM_TYPED_START(name, SYM_L_GLOBAL, SYM_F_ALIGN)	\
+	__SYM_FUNC_START					\
 	ENDBR
 
 /* SYM_FUNC_START -- use for global functions */
 #define SYM_FUNC_START(name)				\
-	SYM_START(name, SYM_L_GLOBAL, SYM_F_ALIGN)
+	SYM_START(name, SYM_L_GLOBAL, SYM_F_ALIGN)	\
+	__SYM_FUNC_START
 
 /* SYM_FUNC_START_NOALIGN -- use for global functions, w/o alignment */
 #define SYM_FUNC_START_NOALIGN(name)			\
-	SYM_START(name, SYM_L_GLOBAL, SYM_A_NONE)
+	SYM_START(name, SYM_L_GLOBAL, SYM_A_NONE)	\
+	__SYM_FUNC_START
 
 /* SYM_FUNC_START_LOCAL -- use for local functions */
 #define SYM_FUNC_START_LOCAL(name)			\
-	SYM_START(name, SYM_L_LOCAL, SYM_F_ALIGN)
+	SYM_START(name, SYM_L_LOCAL, SYM_F_ALIGN)	\
+	__SYM_FUNC_START
 
 /* SYM_FUNC_START_LOCAL_NOALIGN -- use for local functions, w/o alignment */
 #define SYM_FUNC_START_LOCAL_NOALIGN(name)		\
-	SYM_START(name, SYM_L_LOCAL, SYM_A_NONE)
+	SYM_START(name, SYM_L_LOCAL, SYM_A_NONE)	\
+	__SYM_FUNC_START
 
 /* SYM_FUNC_START_WEAK -- use for weak functions */
 #define SYM_FUNC_START_WEAK(name)			\
-	SYM_START(name, SYM_L_WEAK, SYM_F_ALIGN)
+	SYM_START(name, SYM_L_WEAK, SYM_F_ALIGN)	\
+	__SYM_FUNC_START
 
 /* SYM_FUNC_START_WEAK_NOALIGN -- use for weak functions, w/o alignment */
 #define SYM_FUNC_START_WEAK_NOALIGN(name)		\
-	SYM_START(name, SYM_L_WEAK, SYM_A_NONE)
+	SYM_START(name, SYM_L_WEAK, SYM_A_NONE)		\
+	__SYM_FUNC_START
+
+#define SYM_FUNC_END(name)				\
+	__SYM_FUNC_END					\
+	SYM_END(name, SYM_T_FUNC)
 
 /*
  * Expose 'sym' to the startup code in arch/x86/boot/startup/, by emitting an
diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
index f2d49212ae90..bbe270483e3e 100644
--- a/arch/x86/include/asm/vdso.h
+++ b/arch/x86/include/asm/vdso.h
@@ -2,7 +2,6 @@
 #ifndef _ASM_X86_VDSO_H
 #define _ASM_X86_VDSO_H
 
-#include <asm/page_types.h>
 #include <linux/linkage.h>
 #include <linux/init.h>
 
-- 
2.51.0


^ permalink raw reply related

* [PATCH v11 5/6] x86/vdso: Use CFI macros in __vdso_sgx_enter_enclave()
From: Jens Remus @ 2026-05-22 11:04 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, x86, Steven Rostedt,
	Josh Poimboeuf, Indu Bhagat, Peter Zijlstra, Dylan Hatch,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Mathieu Desnoyers, Kees Cook, Sam James,
	Andy Lutomirski
  Cc: Jens Remus, bpf, linux-mm, Namhyung Kim, Andrii Nakryiko,
	Jose E. Marchesi, Beau Belgrave, Florian Weimer,
	Carlos O'Donell, Masami Hiramatsu, Jiri Olsa,
	Arnaldo Carvalho de Melo, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Heiko Carstens, Vasily Gorbik,
	Ilya Leoshkevich, Steven Rostedt (Google)
In-Reply-To: <20260522110427.2816637-1-jremus@linux.ibm.com>

From: Josh Poimboeuf <jpoimboe@kernel.org>

Use the CFI macros instead of the raw .cfi_* directives to be consistent
with the rest of the VDSO asm.  It's also easier on the eyes.

No functional changes.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Jens Remus <jremus@linux.ibm.com>
---
 arch/x86/entry/vdso/vdso64/vsgx.S | 14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/arch/x86/entry/vdso/vdso64/vsgx.S b/arch/x86/entry/vdso/vdso64/vsgx.S
index c0342238c976..76efbeb1e287 100644
--- a/arch/x86/entry/vdso/vdso64/vsgx.S
+++ b/arch/x86/entry/vdso/vdso64/vsgx.S
@@ -25,12 +25,12 @@
 
 SYM_FUNC_START(__vdso_sgx_enter_enclave)
 	push	%rbp
-	.cfi_adjust_cfa_offset	8
-	.cfi_rel_offset		%rbp, 0
+	CFI_ADJUST_CFA_OFFSET	8
+	CFI_REL_OFFSET		%rbp, 0
 	mov	%rsp, %rbp
-	.cfi_def_cfa_register	%rbp
+	CFI_DEF_CFA_REGISTER	%rbp
 	push	%rbx
-	.cfi_rel_offset		%rbx, -8
+	CFI_REL_OFFSET		%rbx, -8
 
 	mov	%ecx, %eax
 .Lenter_enclave:
@@ -77,13 +77,11 @@ SYM_FUNC_START(__vdso_sgx_enter_enclave)
 .Lout:
 	pop	%rbx
 	leave
-	.cfi_def_cfa		%rsp, 8
+	CFI_DEF_CFA		%rsp, 8
 	RET
 
-	/* The out-of-line code runs with the pre-leave stack frame. */
-	.cfi_def_cfa		%rbp, 16
-
 .Linvalid_input:
+	CFI_DEF_CFA		%rbp, 16
 	mov	$(-EINVAL), %eax
 	jmp	.Lout
 
-- 
2.51.0


^ permalink raw reply related

* [PATCH v11 1/6] x86/vdso: Fix DWARF generation for getrandom()
From: Jens Remus @ 2026-05-22 11:04 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, x86, Steven Rostedt,
	Josh Poimboeuf, Indu Bhagat, Peter Zijlstra, Dylan Hatch,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Mathieu Desnoyers, Kees Cook, Sam James,
	Andy Lutomirski
  Cc: Jens Remus, bpf, linux-mm, Namhyung Kim, Andrii Nakryiko,
	Jose E. Marchesi, Beau Belgrave, Florian Weimer,
	Carlos O'Donell, Masami Hiramatsu, Jiri Olsa,
	Arnaldo Carvalho de Melo, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Heiko Carstens, Vasily Gorbik,
	Ilya Leoshkevich, Steven Rostedt (Google)
In-Reply-To: <20260522110427.2816637-1-jremus@linux.ibm.com>

From: Josh Poimboeuf <jpoimboe@kernel.org>

Add CFI annotations to the VDSO implementation of getrandom() so it will
have valid DWARF unwinding metadata.

Fixes: 33385150ac45 ("x86: vdso: Wire up getrandom() vDSO implementation")
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Jens Remus <jremus@linux.ibm.com>
---
 arch/x86/entry/vdso/vdso64/vgetrandom-chacha.S | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/vdso/vdso64/vgetrandom-chacha.S b/arch/x86/entry/vdso/vdso64/vgetrandom-chacha.S
index bcba5639b8ee..cc82da9216fb 100644
--- a/arch/x86/entry/vdso/vdso64/vgetrandom-chacha.S
+++ b/arch/x86/entry/vdso/vdso64/vgetrandom-chacha.S
@@ -4,7 +4,7 @@
  */
 
 #include <linux/linkage.h>
-#include <asm/frame.h>
+#include <asm/dwarf2.h>
 
 .section	.rodata, "a"
 .align 16
@@ -22,7 +22,7 @@ CONSTANTS:	.octa 0x6b20657479622d323320646e61707865
  *	rcx: number of 64-byte blocks to write to output
  */
 SYM_FUNC_START(__arch_chacha20_blocks_nostack)
-
+	CFI_STARTPROC
 .set	output,		%rdi
 .set	key,		%rsi
 .set	counter,	%rdx
@@ -175,4 +175,5 @@ SYM_FUNC_START(__arch_chacha20_blocks_nostack)
 	pxor		temp,temp
 
 	ret
+	CFI_ENDPROC
 SYM_FUNC_END(__arch_chacha20_blocks_nostack)
-- 
2.51.0


^ permalink raw reply related

* [PATCH v11 0/6] x86/vdso: VDSO updates and fixes for sframes
From: Jens Remus @ 2026-05-22 11:04 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, x86, Steven Rostedt,
	Josh Poimboeuf, Indu Bhagat, Peter Zijlstra, Dylan Hatch,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Mathieu Desnoyers, Kees Cook, Sam James,
	Andy Lutomirski
  Cc: Jens Remus, bpf, linux-mm, Namhyung Kim, Andrii Nakryiko,
	Jose E. Marchesi, Beau Belgrave, Florian Weimer,
	Carlos O'Donell, Masami Hiramatsu, Jiri Olsa,
	Arnaldo Carvalho de Melo, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Heiko Carstens, Vasily Gorbik,
	Ilya Leoshkevich

This enables generation of SFrame V3 stack trace information for VDSO on
x86-64.  It's a continuation of Josh's and Steve's work:

   https://lore.kernel.org/all/cover.1737511963.git.jpoimboe@kernel.org/
   https://lore.kernel.org/all/20250425023750.669174660@goodmis.org/

This series focuses only on the VDSO code. They are helpful fixes
and updates that doesn't rely on sframes (although the last patch
is sframe related).

This series applies on top of tip:master (b07a332d9cbb):

  git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git  master

Like the unwind user sframe series [1] it depends on the binutils 2.46
release to be used to build the VDSO with SFrame V3 stack trace
information (using the assembler option --gsframe-3).

[1]: [PATCH v14 00/19] unwind_deferred: Implement sframe handling,
    https://lore.kernel.org/all/20260505121718.3572346-1-jremus@linux.ibm.com/

Changes in v11:
- Rebased on tip:master (b07a332d9cbb).
- This time with correct version in cover letter.

Changes in v10:
- Rebased on tip:master (a375022d6383).

Changes in v9 (see indivicual patch notes):
- Always define KEEP_SFRAME to either true/false in specific VDSO linker
  scripts and use #if instead of #ifdef in common one. (Peter)
- Reword patch 6 commit message to provide more details.
- Note: Binutils 2.46 with SFrame V3 support has been released.

Changes in v8:
- Discard .sframe for x32 and x86-32 VDSOs. (Josh/Indu)
- Define CFI_SIGNAL_FRAME for !BUILD_VDSO.
- Drop .cfi_sections .sframe in dwarf2.h in favor of the explicitly
  specified more specific assembler option --gsframe-3.
- Incorporate missing changes and review feedback from Steven's v6
  (I erroneously based my v6 on Steven's v5):
  - Reword patch 3 commit subject to Steven's v6 one.
  - Remove SYM_F_ALIGN in __vdso_sgx_enter_enclave(). (Josh)

Changes in v7:
- Rebase on H. Peter Anvin's vDSO changes on tip:x86/entry. (Peter)
- Simplify adding assembler option -Wa,--gsframe-3.  Add for vdso64
  only.
- Align to .eh_frame and mark .sframe as KEEP in vDSO linker script.
  Note that GNU linker 2.46 will mark .sframe as KEEP in its default
  linker script as well.

Changes in v6:
- SFrame V3 support (SFrame V2 is not supported).
- Prevent GNU_SFRAME program table entry to empty .sframe section.
- Integrate v5 review feedback. (Josh)

Regards,
Jens


Josh Poimboeuf (6):
  x86/vdso: Fix DWARF generation for getrandom()
  x86/asm: Avoid emitting DWARF CFI for non-VDSO
  x86/asm: Use CFI_* macros in SYM_FUNC_* macros so they can be added to
    VDSO
  x86/vdso: Use SYM_FUNC_{START,END} in __kernel_vsyscall()
  x86/vdso: Use CFI macros in __vdso_sgx_enter_enclave()
  x86/vdso: Enable sframe generation in VDSO

 arch/Kconfig                                  |  7 +++
 arch/x86/entry/vdso/common/vdso-layout.lds.S  | 17 +++++-
 arch/x86/entry/vdso/vdso32/system_call.S      | 10 +---
 arch/x86/entry/vdso/vdso32/vdso32.lds.S       |  3 ++
 arch/x86/entry/vdso/vdso64/Makefile           |  1 +
 arch/x86/entry/vdso/vdso64/vdso64.lds.S       |  2 +
 arch/x86/entry/vdso/vdso64/vdsox32.lds.S      |  6 +++
 .../x86/entry/vdso/vdso64/vgetrandom-chacha.S |  3 +-
 arch/x86/entry/vdso/vdso64/vsgx.S             | 18 +++----
 arch/x86/include/asm/dwarf2.h                 | 52 +++++++++++++------
 arch/x86/include/asm/linkage.h                | 33 +++++++++---
 arch/x86/include/asm/vdso.h                   |  1 -
 12 files changed, 107 insertions(+), 46 deletions(-)

-- 
2.51.0


^ permalink raw reply

* [PATCH v2 3/3] trace: add documentation, selftest and tooling for stackmap
From: Li Pengfei @ 2026-05-22 10:40 UTC (permalink / raw)
  To: linux-trace-kernel
  Cc: rostedt, mhiramat, linux-kernel, cmllamas, zhangbo56, lipengfei28,
	lkp
In-Reply-To: <20260522104017.1668638-1-lipengfei28@xiaomi.com>

From: Pengfei Li <lipengfei28@xiaomi.com>

Add supporting files for the ftrace stackmap feature:

Documentation/trace/ftrace-stackmap.rst:
  Documentation covering design, usage, tracefs interface, binary
  format, and performance characteristics. Added to the 'Core Tracing
  Frameworks' toctree in Documentation/trace/index.rst. Documents:
  - Reset requires tracing to be stopped first
  - Boot-time activation via trace_options=stackmap
  - bits parameter range [10, 18] and worst-case memory usage
  - tracefs file modes (0640 / 0440)
  - Best-effort snapshot semantics for stack_map_bin
  - Counter naming: successes (events served), drops, success_rate

tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc:
  Functional selftest verifying:
  - stackmap tracefs nodes exist
  - enabling stackmap + stacktrace produces stack_id events
  - stack_map_stat shows non-zero successes and zero drops
  - reset clears entries when tracing is stopped
  - reset is rejected (-EBUSY) while tracing is active
  Uses an EXIT trap to restore options/stackmap and options/stacktrace
  on any exit path.

tools/tracing/stackmap_dump.py:
  Python script to parse the binary stack_map_bin export.
  Features:
  - Automatic endianness detection via magic number
  - Batched addr2line via stdin (avoids ARG_MAX with large stacks)
  - JSON output mode
  - Top-N filtering by ref_count

Binary format: all fields are native-endian. The parser detects
byte order by reading the magic value (0x464D5342 = 'FSMB').

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202605160010.fakzGVVq-lkp@intel.com/
Signed-off-by: Pengfei Li <lipengfei28@xiaomi.com>
---
 Documentation/trace/ftrace-stackmap.rst       | 145 +++++++++++++++++
 Documentation/trace/index.rst                 |   1 +
 .../ftrace/test.d/ftrace/stackmap-basic.tc    | 100 ++++++++++++
 tools/tracing/stackmap_dump.py                | 150 ++++++++++++++++++
 4 files changed, 396 insertions(+)
 create mode 100644 Documentation/trace/ftrace-stackmap.rst
 create mode 100755 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
 create mode 100755 tools/tracing/stackmap_dump.py

diff --git a/Documentation/trace/ftrace-stackmap.rst b/Documentation/trace/ftrace-stackmap.rst
new file mode 100644
index 000000000000..1230d44d1d23
--- /dev/null
+++ b/Documentation/trace/ftrace-stackmap.rst
@@ -0,0 +1,145 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================
+Ftrace Stack Map
+======================
+
+:Author: Pengfei Li <lipengfei28@xiaomi.com>
+
+Overview
+========
+
+The ftrace stack map provides stack trace deduplication for the ftrace
+ring buffer. When enabled, instead of storing full kernel stack traces
+(typically 80-160 bytes each) in the ring buffer for every event, ftrace
+stores only a 4-byte ``stack_id``. The full stacks are maintained in a
+separate hash table and exported via tracefs for userspace to resolve.
+
+This is inspired by eBPF's ``BPF_MAP_TYPE_STACK_TRACE`` but integrated
+into ftrace's infrastructure, requiring no userspace daemon.
+
+Configuration
+=============
+
+Enable ``CONFIG_FTRACE_STACKMAP=y`` in the kernel config.
+
+Kernel command line parameters:
+
+- ``ftrace_stackmap.bits=N`` - Set map capacity to 2^N unique stacks
+  (default: 14 → 16384 stacks; valid range: 10-18).
+
+  At ``bits=18`` the kernel reserves roughly 130 MB of vmalloc memory
+  for the element pool. Each ``open()`` of ``stack_map_bin`` may
+  briefly allocate a similar amount for a snapshot. The cap is set
+  intentionally to bound memory usage.
+
+Usage
+=====
+
+Enable stack deduplication::
+
+    echo 1 > /sys/kernel/debug/tracing/options/stackmap
+    echo 1 > /sys/kernel/debug/tracing/options/stacktrace
+    echo function > /sys/kernel/debug/tracing/current_tracer
+
+The trace output will show ``<stack_id N>`` instead of full stack traces::
+
+    sh-1234 [006] d.h.. 123.456789: <stack_id 42>
+
+To view the actual stacks::
+
+    cat /sys/kernel/debug/tracing/stack_map
+
+Output format::
+
+    stack_id 42 [ref 1337, depth 8]
+      [0] schedule+0x48/0xc0
+      [1] schedule_timeout+0x1c/0x30
+      ...
+
+To view statistics::
+
+    cat /sys/kernel/debug/tracing/stack_map_stat
+
+Output::
+
+    entries:      2500 / 16384
+    table_size:   32768
+    successes:    148923
+    drops:        0
+    success_rate: 100%
+
+To reset the stack map (tracing must be stopped first)::
+
+    echo 0 > /sys/kernel/debug/tracing/tracing_on
+    echo 0 > /sys/kernel/debug/tracing/stack_map
+
+Reset returns ``-EBUSY`` if tracing is currently active, or if another
+reset is already in progress.
+
+Boot-time activation
+====================
+
+The stackmap option can be enabled from the kernel command line::
+
+    trace_options=stackmap,stacktrace
+
+Trace events that fire before the tracefs filesystem is initialized
+(``fs_initcall`` time) fall back to recording full stack traces; once
+``ftrace_stackmap_create()`` runs, subsequent events are deduplicated.
+The crossover is automatic and lossless — no events are dropped, but
+early-boot stacks recorded before the crossover are not deduplicated.
+
+Tracefs Nodes
+=============
+
+The stack_map files are owned by root and not world-readable
+(``stack_map``: 0640; ``stack_map_stat`` and ``stack_map_bin``: 0440).
+
+``stack_map``
+    Text export of all deduplicated stacks with symbol resolution.
+    Writing ``0`` or ``reset`` clears all entries (only when tracing
+    is stopped).
+
+``stack_map_stat``
+    Statistics: entry count, hits, drops, and hit rate.
+
+``stack_map_bin``
+    Binary export for efficient userspace consumption. Format:
+
+    - Header (16 bytes): magic(u32) + version(u32) + nr_stacks(u32) + reserved(u32)
+    - Per stack: stack_id(u32) + nr(u32) + ref_count(u32) + reserved(u32) + ips(u64 × nr)
+
+    All fields are written in the kernel's native byte order.
+    Userspace tools detect endianness by reading the magic value.
+    Magic: ``0x464D5342`` ('FSMB'), Version: 2.
+
+    The export is a best-effort snapshot allocated at ``open()``;
+    concurrent inserts during the snapshot may be truncated. A
+    bounds check ensures no overflow.
+
+Design
+======
+
+The stack map is modeled after ``tracing_map.c`` (used by hist triggers),
+using a lock-free design based on Dr. Cliff Click's non-blocking hash table
+algorithm:
+
+- **Lookup/Insert**: Lock-free via ``cmpxchg``, safe in NMI/IRQ/any context
+- **Memory**: Pre-allocated element pool, zero allocation on the hot path
+  (no GFP_ATOMIC failures under memory pressure)
+- **Collision**: Linear probing with a 2x over-provisioned table; probe
+  length is bounded so worst-case insert/lookup is O(1)
+- **Scope**: Currently supports the global trace instance
+- **Hash**: 32-bit jhash with a per-instance random seed; full ``memcmp``
+  confirms matches
+
+Performance
+===========
+
+Typical results on ARM64 Android device (function tracer, 2 seconds):
+
+- Unique stacks: ~3000
+- Hit rate: 84-98% (depends on workload diversity)
+- Ring buffer savings: ~80% for stack data
+- Overhead per event: ~50ns (one jhash + hash table lookup)
diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst
index 5d9bf4694d5d..ac8b1141c23a 100644
--- a/Documentation/trace/index.rst
+++ b/Documentation/trace/index.rst
@@ -33,6 +33,7 @@ the Linux kernel.
    ftrace
    ftrace-design
    ftrace-uses
+   ftrace-stackmap
    kprobes
    kprobetrace
    fprobetrace
diff --git a/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
new file mode 100755
index 000000000000..34e4e31ff7a1
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
@@ -0,0 +1,100 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: ftrace - stackmap basic functionality
+# requires: stack_map options/stackmap
+
+# Test that ftrace stackmap deduplication works:
+# 1. Enable stackmap + stacktrace options
+# 2. Run function tracer briefly
+# 3. Verify stack_map has entries
+# 4. Verify stack_map_stat shows successes and zero drops
+# 5. Verify trace contains <stack_id> events
+# 6. Verify reset works when tracing is stopped
+# 7. Verify reset is rejected (-EBUSY) while tracing is active
+
+fail() {
+    echo "FAIL: $1"
+    exit_fail
+}
+
+# Restore state on any exit (success, fail, or interrupt) so a
+# half-finished test does not leave stacktrace/stackmap enabled.
+cleanup() {
+    disable_tracing 2>/dev/null
+    echo nop > current_tracer 2>/dev/null
+    echo 0 > options/stackmap 2>/dev/null
+    echo 0 > options/stacktrace 2>/dev/null
+}
+trap cleanup EXIT
+
+disable_tracing
+clear_trace
+
+# Verify stackmap files exist
+test -f stack_map      || fail "stack_map file missing"
+test -f stack_map_stat || fail "stack_map_stat file missing"
+test -f stack_map_bin  || fail "stack_map_bin file missing"
+
+# Enable stackmap dedup
+echo 1 > options/stackmap
+echo 1 > options/stacktrace
+
+# Run function tracer briefly
+echo function > current_tracer
+enable_tracing
+sleep 1
+disable_tracing
+echo nop > current_tracer
+echo 0 > options/stackmap
+
+# Check stack_map_stat has entries (default empty to avoid [: too many args)
+entries=$(cat stack_map_stat | grep "^entries:" | awk '{print $2}')
+: "${entries:=0}"
+if [ "$entries" -eq 0 ]; then
+    fail "stackmap has zero entries after tracing"
+fi
+
+# Check successes > 0
+successes=$(cat stack_map_stat | grep "^successes:" | awk '{print $2}')
+: "${successes:=0}"
+if [ "$successes" -eq 0 ]; then
+    fail "stackmap has zero successes"
+fi
+
+# Check drops == 0 (pool should be large enough for 1s trace)
+drops=$(cat stack_map_stat | grep "^drops:" | awk '{print $2}')
+: "${drops:=0}"
+if [ "$drops" -ne 0 ]; then
+    fail "stackmap had $drops drops (pool exhausted?)"
+fi
+
+# Check stack_map text output is parseable
+first_id=$(cat stack_map | grep "^stack_id" | head -1 | awk '{print $2}')
+if [ -z "$first_id" ]; then
+    fail "stack_map output has no stack_id entries"
+fi
+
+# Check trace has stack_id events
+count=$(grep -c "stack_id" trace || true)
+if [ "$count" -eq 0 ]; then
+    fail "trace has no <stack_id> events"
+fi
+
+# Test reset (tracing must be stopped — disable_tracing was called above)
+echo 0 > stack_map
+entries_after=$(cat stack_map_stat | grep "^entries:" | awk '{print $2}')
+: "${entries_after:=-1}"
+if [ "$entries_after" -ne 0 ]; then
+    fail "stackmap reset did not clear entries (got $entries_after)"
+fi
+
+# Test that reset is rejected while tracing is active
+enable_tracing
+if echo 0 > stack_map 2>/dev/null; then
+    disable_tracing
+    fail "stackmap reset should fail while tracing is active"
+fi
+disable_tracing
+
+echo "stackmap basic test passed: $entries unique stacks, $successes successes, $drops drops"
+exit 0
diff --git a/tools/tracing/stackmap_dump.py b/tools/tracing/stackmap_dump.py
new file mode 100755
index 000000000000..fc5d0c9cf0af
--- /dev/null
+++ b/tools/tracing/stackmap_dump.py
@@ -0,0 +1,150 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+"""
+stackmap_dump.py - Parse and display ftrace stack_map_bin binary export.
+
+Usage:
+    # Pull from device and parse
+    adb pull /sys/kernel/debug/tracing/stack_map_bin /tmp/stack_map.bin
+    python3 stackmap_dump.py /tmp/stack_map.bin
+
+    # With vmlinux for offline symbol resolution
+    python3 stackmap_dump.py /tmp/stack_map.bin --vmlinux vmlinux
+
+    # JSON output for tooling
+    python3 stackmap_dump.py /tmp/stack_map.bin --json
+"""
+
+import struct
+import sys
+import argparse
+import json
+import subprocess
+
+MAGIC = 0x464D5342  # 'FSMB'
+HEADER_SIZE = 16  # 4 x u32
+ENTRY_SIZE = 16   # 4 x u32
+
+
+def detect_endianness(data):
+    """Detect byte order from magic number in header."""
+    if len(data) < 4:
+        raise ValueError("File too small")
+    magic_le = struct.unpack_from('<I', data, 0)[0]
+    if magic_le == MAGIC:
+        return '<'
+    magic_be = struct.unpack_from('>I', data, 0)[0]
+    if magic_be == MAGIC:
+        return '>'
+    raise ValueError(f"Bad magic: 0x{magic_le:08x} (neither LE nor BE)")
+
+
+def batch_addr2line(vmlinux, addrs):
+    """Resolve multiple addresses in one addr2line invocation."""
+    if not addrs:
+        return {}
+    try:
+        # Feed addresses on stdin to avoid ARG_MAX limits with large
+        # numbers of addresses (one stack can have 30+ frames; a
+        # snapshot can have thousands of unique stacks).
+        stdin = '\n'.join(hex(a) for a in addrs) + '\n'
+        result = subprocess.run(
+            ['addr2line', '-f', '-e', vmlinux],
+            input=stdin, capture_output=True, text=True, timeout=60
+        )
+        lines = result.stdout.split('\n')
+        # addr2line outputs 2 lines per address: function name + source location
+        symbols = {}
+        for i, addr in enumerate(addrs):
+            idx = i * 2
+            if idx < len(lines) and lines[idx] and lines[idx] != '??':
+                symbols[addr] = lines[idx]
+        return symbols
+    except (subprocess.TimeoutExpired, FileNotFoundError) as e:
+        print(f"warning: addr2line failed: {e}", file=sys.stderr)
+        return {}
+
+
+def parse_stackmap_bin(data):
+    """Parse binary stackmap data, yield (stack_id, ref_count, [ips])."""
+    if len(data) < HEADER_SIZE:
+        raise ValueError("File too small for header")
+
+    endian = detect_endianness(data)
+    header_fmt = f'{endian}IIII'
+    entry_fmt = f'{endian}IIII'
+
+    magic, version, nr_stacks, _ = struct.unpack_from(header_fmt, data, 0)
+    if version not in (1, 2):
+        raise ValueError(f"Unsupported version: {version}")
+
+    offset = HEADER_SIZE
+    for _ in range(nr_stacks):
+        if offset + ENTRY_SIZE > len(data):
+            break
+        stack_id, nr, ref_count, _ = struct.unpack_from(entry_fmt, data, offset)
+        offset += ENTRY_SIZE
+
+        ips_size = nr * 8
+        if offset + ips_size > len(data):
+            break
+        ips = struct.unpack_from(f'{endian}{nr}Q', data, offset)
+        offset += ips_size
+
+        yield stack_id, ref_count, list(ips)
+
+
+def main():
+    parser = argparse.ArgumentParser(description='Parse ftrace stack_map_bin')
+    parser.add_argument('file', help='Path to stack_map_bin file')
+    parser.add_argument('--vmlinux', help='Path to vmlinux for symbol resolution')
+    parser.add_argument('--json', action='store_true', help='JSON output')
+    parser.add_argument('--top', type=int, default=0,
+                        help='Show only top N stacks by ref_count')
+    args = parser.parse_args()
+
+    with open(args.file, 'rb') as f:
+        data = f.read()
+
+    stacks = list(parse_stackmap_bin(data))
+
+    if args.top > 0:
+        stacks.sort(key=lambda x: x[1], reverse=True)
+        stacks = stacks[:args.top]
+
+    # Batch symbol resolution
+    symbols = {}
+    if args.vmlinux:
+        all_addrs = set()
+        for _, _, ips in stacks:
+            all_addrs.update(ips)
+        symbols = batch_addr2line(args.vmlinux, list(all_addrs))
+
+    if args.json:
+        output = []
+        for stack_id, ref_count, ips in stacks:
+            entry = {
+                'stack_id': stack_id,
+                'ref_count': ref_count,
+                'ips': [f'0x{ip:x}' for ip in ips]
+            }
+            if args.vmlinux:
+                entry['symbols'] = [symbols.get(ip, f'0x{ip:x}')
+                                    for ip in ips]
+            output.append(entry)
+        print(json.dumps(output, indent=2))
+    else:
+        for stack_id, ref_count, ips in stacks:
+            print(f"stack_id {stack_id} [ref {ref_count}, depth {len(ips)}]")
+            for i, ip in enumerate(ips):
+                sym = symbols.get(ip, '')
+                if sym:
+                    sym = f' {sym}'
+                print(f"  [{i}] 0x{ip:x}{sym}")
+            print()
+
+    print(f"Total: {len(stacks)} unique stacks", file=sys.stderr)
+
+
+if __name__ == '__main__':
+    main()
-- 
2.34.1


^ permalink raw reply related

* [PATCH v2 2/3] trace: integrate stackmap into ftrace stack recording path
From: Li Pengfei @ 2026-05-22 10:40 UTC (permalink / raw)
  To: linux-trace-kernel
  Cc: rostedt, mhiramat, linux-kernel, cmllamas, zhangbo56, lipengfei28,
	lkp
In-Reply-To: <20260522104017.1668638-1-lipengfei28@xiaomi.com>

From: Pengfei Li <lipengfei28@xiaomi.com>

Add TRACE_STACK_ID event type and integrate ftrace_stackmap into
__ftrace_trace_stack(). When the 'stackmap' trace option is enabled,
the stack recording path stores a 4-byte stack_id in the ring buffer
instead of the full stack trace.

Changes:
- New TRACE_STACK_ID in trace_type enum
- New stack_id_entry in trace_entries.h
- New TRACE_ITER(STACKMAP) trace option flag; when CONFIG_FTRACE_STACKMAP
  is disabled, TRACE_ITER_STACKMAP_BIT is defined as -1 so that
  TRACE_ITER(STACKMAP) evaluates to 0 (following the existing pattern
  used by TRACE_ITER_PROF_TEXT_OFFSET)
- Modified __ftrace_trace_stack() to call ftrace_stackmap_get_id()
  when the stackmap option is active
- Stackmap pointer read with smp_load_acquire(), published with
  smp_store_release() to ensure proper initialization ordering
- NULL check on tr->stackmap prevents dereference if creation failed
  or if used on a secondary trace instance (graceful fallback)
- ftrace_stackmap_create() takes the owning trace_array so the
  stackmap can later check tracing state during reset
- Added stack_id print handler in trace_output.c

Fallback behavior: if stackmap returns an error (pool exhausted,
resetting, or NULL pointer), the full stack trace is recorded as
before — no new failure modes introduced.

Note: stackmap is currently initialized only for the global trace
instance. Secondary instances fall back to full stack recording.

Usage:
  echo 1 > /sys/kernel/debug/tracing/options/stackmap
  echo 1 > /sys/kernel/debug/tracing/options/stacktrace

Signed-off-by: Pengfei Li <lipengfei28@xiaomi.com>
---
 kernel/trace/trace.c         | 66 ++++++++++++++++++++++++++++++++++++
 kernel/trace/trace.h         | 16 +++++++++
 kernel/trace/trace_entries.h | 15 ++++++++
 kernel/trace/trace_output.c  | 23 +++++++++++++
 4 files changed, 120 insertions(+)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 6eb4d3097a4d..49a675dffad5 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -57,6 +57,7 @@
 
 #include "trace.h"
 #include "trace_output.h"
+#include "trace_stackmap.h"
 
 #ifdef CONFIG_FTRACE_STARTUP_TEST
 /*
@@ -2184,6 +2185,43 @@ void __ftrace_trace_stack(struct trace_array *tr,
 	}
 #endif
 
+#ifdef CONFIG_FTRACE_STACKMAP
+	/*
+	 * If stackmap dedup is enabled, try to store only the stack_id
+	 * in the ring buffer instead of the full stack trace.
+	 */
+	if (tr->trace_flags & TRACE_ITER(STACKMAP)) {
+		struct ftrace_stackmap *smap;
+		struct stack_id_entry *sid_entry;
+		int sid;
+
+		smap = smp_load_acquire(&tr->stackmap);
+		if (!smap)
+			goto full_stack;
+
+		sid = ftrace_stackmap_get_id(smap, fstack->calls, nr_entries);
+		if (sid >= 0) {
+			event = __trace_buffer_lock_reserve(buffer,
+					TRACE_STACK_ID,
+					sizeof(*sid_entry), trace_ctx);
+			if (!event)
+				goto out;
+			sid_entry = ring_buffer_event_data(event);
+			sid_entry->stack_id = sid;
+			/*
+			 * stack_id is a synthetic side-event attached to a
+			 * primary trace event that was already subject to
+			 * filtering. No per-event filter is defined for
+			 * TRACE_STACK_ID, so commit unconditionally.
+			 */
+			__buffer_unlock_commit(buffer, event);
+			goto out;
+		}
+		/* Fall through to full stack on stackmap failure */
+	}
+full_stack:
+#endif
+
 	event = __trace_buffer_lock_reserve(buffer, TRACE_STACK,
 				    struct_size(entry, caller, nr_entries),
 				    trace_ctx);
@@ -9222,6 +9260,34 @@ static __init void tracer_init_tracefs_work_func(struct work_struct *work)
 			NULL, &tracing_dyn_info_fops);
 #endif
 
+#ifdef CONFIG_FTRACE_STACKMAP
+	{
+		struct ftrace_stackmap *smap;
+
+		smap = ftrace_stackmap_create(&global_trace);
+		if (!IS_ERR(smap)) {
+			/*
+			 * Use smp_store_release to ensure the stackmap
+			 * structure is fully initialized before publishing
+			 * the pointer to concurrent trace event readers.
+			 */
+			smp_store_release(&global_trace.stackmap, smap);
+			trace_create_file("stack_map", TRACE_MODE_WRITE, NULL,
+					smap, &ftrace_stackmap_fops);
+			trace_create_file("stack_map_stat", TRACE_MODE_READ, NULL,
+					smap, &ftrace_stackmap_stat_fops);
+			trace_create_file("stack_map_bin", TRACE_MODE_READ, NULL,
+					smap, &ftrace_stackmap_bin_fops);
+		} else {
+			pr_warn("ftrace stackmap init failed, dedup disabled\n");
+			/*
+			 * global_trace.stackmap is already NULL from kzalloc;
+			 * leaving it NULL ensures the load-acquire in
+			 * __ftrace_trace_stack falls back to full stack.
+			 */
+		}
+	}
+#endif
 	create_trace_instances(NULL);
 
 	update_tracer_options();
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 80fe152af1dd..7e7d5e5a35ff 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -57,6 +57,7 @@ enum trace_type {
 	TRACE_TIMERLAT,
 	TRACE_RAW_DATA,
 	TRACE_FUNC_REPEATS,
+	TRACE_STACK_ID,
 
 	__TRACE_LAST_TYPE,
 };
@@ -453,6 +454,9 @@ struct trace_array {
 	struct cond_snapshot	*cond_snapshot;
 #endif
 	struct trace_func_repeats	__percpu *last_func_repeats;
+#ifdef CONFIG_FTRACE_STACKMAP
+	struct ftrace_stackmap		*stackmap;
+#endif
 	/*
 	 * On boot up, the ring buffer is set to the minimum size, so that
 	 * we do not waste memory on systems that are not using tracing.
@@ -579,6 +583,8 @@ extern void __ftrace_bad_type(void);
 			  TRACE_GRAPH_RET);		\
 		IF_ASSIGN(var, ent, struct func_repeats_entry,		\
 			  TRACE_FUNC_REPEATS);				\
+		IF_ASSIGN(var, ent, struct stack_id_entry,		\
+			  TRACE_STACK_ID);				\
 		__ftrace_bad_type();					\
 	} while (0)
 
@@ -1449,7 +1455,16 @@ extern int trace_get_user(struct trace_parser *parser, const char __user *ubuf,
 # define STACK_FLAGS
 #endif
 
+#ifdef CONFIG_FTRACE_STACKMAP
+# define STACKMAP_FLAGS				\
+			C(STACKMAP,		"stackmap"),
+#else
+# define STACKMAP_FLAGS
+# define TRACE_ITER_STACKMAP_BIT	-1
+#endif
+
 #ifdef CONFIG_FUNCTION_PROFILER
+
 # define PROFILER_FLAGS					\
 		C(PROF_TEXT_OFFSET,	"prof-text-offset"),
 # ifdef CONFIG_FUNCTION_GRAPH_TRACER
@@ -1506,6 +1521,7 @@ extern int trace_get_user(struct trace_parser *parser, const char __user *ubuf,
 		FUNCTION_FLAGS					\
 		FGRAPH_FLAGS					\
 		STACK_FLAGS					\
+		STACKMAP_FLAGS					\
 		BRANCH_FLAGS					\
 		PROFILER_FLAGS					\
 		FPROFILE_FLAGS
diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h
index 54417468fdeb..89ed14b7e5fd 100644
--- a/kernel/trace/trace_entries.h
+++ b/kernel/trace/trace_entries.h
@@ -250,6 +250,21 @@ FTRACE_ENTRY(user_stack, userstack_entry,
 		 (void *)__entry->caller[6], (void *)__entry->caller[7])
 );
 
+/*
+ * Stack ID entry - stores only a stack_id referencing the stackmap.
+ * Used when CONFIG_FTRACE_STACKMAP is enabled to deduplicate stacks.
+ */
+FTRACE_ENTRY(stack_id, stack_id_entry,
+
+	TRACE_STACK_ID,
+
+	F_STRUCT(
+		__field(	int,		stack_id	)
+	),
+
+	F_printk("<stack_id %d>", __entry->stack_id)
+);
+
 /*
  * trace_printk entry:
  */
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index a5ad76175d10..68678ea88159 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1517,6 +1517,28 @@ static struct trace_event trace_user_stack_event = {
 	.funcs		= &trace_user_stack_funcs,
 };
 
+/* TRACE_STACK_ID */
+static enum print_line_t trace_stack_id_print(struct trace_iterator *iter,
+					      int flags, struct trace_event *event)
+{
+	struct stack_id_entry *field;
+	struct trace_seq *s = &iter->seq;
+
+	trace_assign_type(field, iter->ent);
+	trace_seq_printf(s, "<stack_id %d>\n", field->stack_id);
+
+	return trace_handle_return(s);
+}
+
+static struct trace_event_functions trace_stack_id_funcs = {
+	.trace		= trace_stack_id_print,
+};
+
+static struct trace_event trace_stack_id_event = {
+	.type		= TRACE_STACK_ID,
+	.funcs		= &trace_stack_id_funcs,
+};
+
 /* TRACE_HWLAT */
 static enum print_line_t
 trace_hwlat_print(struct trace_iterator *iter, int flags,
@@ -1908,6 +1930,7 @@ static struct trace_event *events[] __initdata = {
 	&trace_wake_event,
 	&trace_stack_event,
 	&trace_user_stack_event,
+	&trace_stack_id_event,
 	&trace_bputs_event,
 	&trace_bprint_event,
 	&trace_print_event,
-- 
2.34.1


^ permalink raw reply related

* [PATCH v2 1/3] trace: add lock-free stackmap for stack trace deduplication
From: Li Pengfei @ 2026-05-22 10:40 UTC (permalink / raw)
  To: linux-trace-kernel
  Cc: rostedt, mhiramat, linux-kernel, cmllamas, zhangbo56, lipengfei28,
	lkp
In-Reply-To: <20260522104017.1668638-1-lipengfei28@xiaomi.com>

From: Pengfei Li <lipengfei28@xiaomi.com>

Add a lock-free hash map (ftrace_stackmap) that deduplicates kernel
stack traces for the ftrace ring buffer. Instead of storing full
stack traces (80-160 bytes each) in the ring buffer for every event,
ftrace can store a 4-byte stack_id when the stackmap option is enabled.

The implementation is modeled after tracing_map.c (used by hist
triggers), using the same lock-free design based on Dr. Cliff Click's
non-blocking hash table algorithm:

- Lock-free insert via cmpxchg, safe in NMI/IRQ/any context
- Pre-allocated element pool (zero allocation on hot path)
- Linear probing with 2x over-provisioned table; probe length is
  bounded by FTRACE_STACKMAP_MAX_PROBE so worst-case insert/lookup
  is O(1) even when the table is heavily loaded with claimed-but-
  empty slots from pool exhaustion
- Single global instance (initialized for the global trace array)

The stackmap is exported via three tracefs nodes:
- stack_map: text export with symbol resolution (mode 0640)
- stack_map_stat: counters (entries, successes, drops, success_rate)
- stack_map_bin: binary export (all fields native-endian)

Counter naming:
- 'successes' counts events that were successfully assigned a
  stack_id (covers both first-time inserts and dedup hits).
- 'drops' counts events that fell back to recording the full stack
  (pool exhausted, probe limit reached, or reset in progress).
- 'success_rate' is successes / (successes + drops).

Reset semantics:
- Reset is a control-path operation only allowed when tracing is
  stopped on the owning trace_array. Online reset (with tracing
  active) is intentionally not supported to keep the proof
  obligations small.
- Reset uses atomic_cmpxchg() to claim the resetting flag, then
  verifies tracer_tracing_is_on() returns false. The resetting
  flag itself blocks subsequent get_id() callers; userspace
  re-enabling tracing after our check still cannot let new
  insertions through.
- synchronize_rcu() drains in-flight get_id() callers from the
  ftrace callback path, which runs preempt-disabled.
- Reset clears the resetting flag with atomic_set_release() so a
  subsequent get_id() observes a fully cleared map.
- Concurrent reset returns -EBUSY; reset while tracing is active
  returns -EBUSY.

Concurrency notes:
- entry->val publication uses smp_store_release() paired with
  smp_load_acquire() in all dereferencing readers (lookup, seq_show,
  bin_open). seq_start/seq_next only check val for NULL and use
  READ_ONCE().
- elt->nr is read with READ_ONCE() and clamped to MAX_DEPTH before
  use in seq_show and bin_open.
- Pool exhaustion: stackmap_get_elt() short-circuits via
  atomic_read() before the contended atomic RMW, avoiding cacheline
  contention once the pool is full. Slots that win cmpxchg but
  cannot get an elt are left 'claimed but empty'; subsequent
  lookups treat val==NULL as a miss and probe past them. The
  bounded probe length keeps per-event cost O(1).

Hash key:
- Per-instance random seed stored in the stackmap struct (no
  global state), seeded at create time.
- 32-bit jhash is forced to 1 if it lands on 0 (which is the
  free-slot sentinel). Full memcmp confirms matches.

Memory:
- Single flat vmalloc for the element pool (no per-elt kzalloc).
- bits parameter clamped to [10, 18]: at the maximum bits=18, the
  element pool is ~130 MB and a stack_map_bin snapshot may briefly
  allocate another ~130 MB.
- struct stackmap_bin_snapshot uses u64 (not size_t) for its size
  field so data[] is 8-byte aligned on both 32-bit and 64-bit
  architectures, avoiding alignment faults when writing u64 IPs
  on strict-alignment architectures.

Kernel command line parameter:
- ftrace_stackmap.bits=N: set map capacity (2^N unique stacks,
  range 10-18, default 14)

Signed-off-by: Pengfei Li <lipengfei28@xiaomi.com>
---
 kernel/trace/Kconfig          |  21 ++
 kernel/trace/Makefile         |   1 +
 kernel/trace/trace_stackmap.c | 643 ++++++++++++++++++++++++++++++++++
 kernel/trace/trace_stackmap.h |  56 +++
 4 files changed, 721 insertions(+)
 create mode 100644 kernel/trace/trace_stackmap.c
 create mode 100644 kernel/trace/trace_stackmap.h

diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index e130da35808f..2a63fd2c9a96 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -412,6 +412,27 @@ config STACK_TRACER
 
 	  Say N if unsure.
 
+config FTRACE_STACKMAP
+	bool "Ftrace stack map deduplication"
+	depends on TRACING
+	depends on STACKTRACE
+	select KALLSYMS
+	help
+	  This enables a global stack trace hash table for ftrace, inspired
+	  by eBPF's BPF_MAP_TYPE_STACK_TRACE. When enabled, ftrace can store
+	  only a stack_id in the ring buffer instead of the full stack trace,
+	  significantly reducing trace buffer usage when the same call stacks
+	  appear repeatedly.
+
+	  The deduplicated stacks are exported via:
+	    /sys/kernel/debug/tracing/stack_map
+
+	  Writing to this file resets the stack map. Reading shows all unique
+	  stacks with their stack_id and reference count.
+
+	  Say Y if you want to reduce ftrace buffer usage for stack traces.
+	  Say N if unsure.
+
 config TRACE_PREEMPT_TOGGLE
 	bool
 	help
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 1decdce8cbef..f1b6175099cc 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -85,6 +85,7 @@ obj-$(CONFIG_HWLAT_TRACER) += trace_hwlat.o
 obj-$(CONFIG_OSNOISE_TRACER) += trace_osnoise.o
 obj-$(CONFIG_NOP_TRACER) += trace_nop.o
 obj-$(CONFIG_STACK_TRACER) += trace_stack.o
+obj-$(CONFIG_FTRACE_STACKMAP) += trace_stackmap.o
 obj-$(CONFIG_MMIOTRACE) += trace_mmiotrace.o
 obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += trace_functions_graph.o
 obj-$(CONFIG_TRACE_BRANCH_PROFILING) += trace_branch.o
diff --git a/kernel/trace/trace_stackmap.c b/kernel/trace/trace_stackmap.c
new file mode 100644
index 000000000000..b23a60e9286c
--- /dev/null
+++ b/kernel/trace/trace_stackmap.c
@@ -0,0 +1,643 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Ftrace Stack Map - Lock-free stack trace deduplication for ftrace
+ *
+ * Modeled after tracing_map.c (used by hist triggers), this provides
+ * a lock-free hash map optimized for the ftrace hot path. The design
+ * is based on Dr. Cliff Click's non-blocking hash table algorithm.
+ *
+ * Key properties:
+ * - Lock-free insert via cmpxchg, safe in NMI/IRQ/any context
+ * - Pre-allocated element pool (zero allocation on hot path)
+ * - Linear probing with 2x over-provisioned table; probe length
+ *   bounded by FTRACE_STACKMAP_MAX_PROBE to keep worst-case lookup
+ *   cost constant even when the table is heavily loaded
+ * - Single global instance (initialized for the global trace array)
+ *
+ * Reset is a control-path operation, only allowed when tracing is
+ * stopped on the owning trace_array. The protocol is:
+ *
+ *   - atomic_cmpxchg(&resetting, 0, 1) atomically claims reset rights
+ *     and blocks new get_id() callers (they observe resetting=1 and
+ *     return -EINVAL).
+ *   - tracer_tracing_is_on() is checked AFTER the cmpxchg, so the
+ *     resetting flag itself prevents new insertions even if userspace
+ *     re-enables tracing immediately after the check.
+ *   - synchronize_rcu() drains in-flight get_id() callers from the
+ *     ftrace callback path, which runs with preemption disabled.
+ *
+ * Online reset (with tracing active) is intentionally not supported
+ * to keep the design simple and the proof obligations small.
+ *
+ * The 32-bit jhash of the stack IPs is the hash table key. On hash
+ * collision, linear probing finds the next slot and full memcmp
+ * confirms the match.
+ *
+ * Concurrent userspace readers (cat stack_map / stack_map_bin) get
+ * a best-effort snapshot. They are coherent with the hot path
+ * (smp_load_acquire on entry->val), but they are not coherent with
+ * a concurrent reset; since reset requires tracing to be stopped,
+ * mid-iteration reset can produce truncated or partial output but
+ * never crashes.
+ */
+
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/jhash.h>
+#include <linux/seq_file.h>
+#include <linux/kallsyms.h>
+#include <linux/vmalloc.h>
+#include <linux/atomic.h>
+#include <linux/random.h>
+#include <linux/rcupdate.h>
+#include <linux/log2.h>
+
+#include "trace.h"
+#include "trace_stackmap.h"
+
+/*
+ * Bound the linear-probe scan length. With a 2x over-provisioned table,
+ * a well-distributed hash gives very short probe chains. Capping at 64
+ * keeps worst-case lookup O(1) even when the table is heavily loaded
+ * with claimed-but-empty slots from pool exhaustion.
+ */
+#define FTRACE_STACKMAP_MAX_PROBE	64
+
+/*
+ * Each pre-allocated element holds one unique stack trace.
+ * Fixed size: MAX_DEPTH entries regardless of actual depth.
+ */
+struct stackmap_elt {
+	u32		nr;		/* actual number of IPs */
+	atomic_t	ref_count;
+	unsigned long	ips[FTRACE_STACKMAP_MAX_DEPTH];
+};
+
+/*
+ * Hash table entry: a 32-bit key (jhash of stack) + pointer to elt.
+ * key == 0 means the slot is free.
+ */
+struct stackmap_entry {
+	u32			key;	/* 0 = free, non-zero = jhash */
+	struct stackmap_elt	*val;	/* NULL until fully published */
+};
+
+struct ftrace_stackmap {
+	struct trace_array	*tr;		/* owning trace_array */
+	unsigned int		map_bits;
+	unsigned int		map_size;	/* 1 << (map_bits + 1) */
+	unsigned int		max_elts;	/* 1 << map_bits */
+	u32			hash_seed;	/* per-instance jhash seed */
+	atomic_t		next_elt;	/* index into elts pool */
+	struct stackmap_entry	*entries;	/* hash table */
+	struct stackmap_elt	*elts;		/* flat element pool */
+	atomic_t		resetting;
+	atomic64_t		successes;	/* events served (hits + new inserts) */
+	atomic64_t		drops;
+};
+
+/*
+ * Cap the bits parameter to keep worst-case allocations bounded:
+ *   bits=18 → 256K elts, 512K slots, ~130 MB elt pool, ~130 MB bin
+ *             export.
+ * Smaller workloads should use the default (14) which gives 16K elts
+ * (~8 MB pool); bump bits via the ftrace_stackmap.bits= kernel
+ * parameter for higher unique-stack capacity.
+ */
+#define FTRACE_STACKMAP_BITS_MIN	10
+#define FTRACE_STACKMAP_BITS_MAX	18
+#define FTRACE_STACKMAP_BITS_DEFAULT	14
+
+static unsigned int stackmap_map_bits = FTRACE_STACKMAP_BITS_DEFAULT;
+static int __init stackmap_bits_setup(char *str)
+{
+	unsigned long val;
+
+	if (kstrtoul(str, 0, &val))
+		return -EINVAL;
+	val = clamp_val(val, FTRACE_STACKMAP_BITS_MIN, FTRACE_STACKMAP_BITS_MAX);
+	stackmap_map_bits = val;
+	return 0;
+}
+early_param("ftrace_stackmap.bits", stackmap_bits_setup);
+
+/* --- Element pool --- */
+
+static struct stackmap_elt *stackmap_get_elt(struct ftrace_stackmap *smap)
+{
+	int idx;
+
+	/*
+	 * Fast-path early-out once the pool is fully consumed. Avoids
+	 * the contended atomic RMW on next_elt for every traced event
+	 * after the pool is exhausted.
+	 */
+	if (atomic_read(&smap->next_elt) >= smap->max_elts)
+		return NULL;
+
+	idx = atomic_fetch_add_unless(&smap->next_elt, 1, smap->max_elts);
+	if (idx < smap->max_elts)
+		return &smap->elts[idx];
+	return NULL;
+}
+
+/* --- Create / Destroy / Reset --- */
+
+struct ftrace_stackmap *ftrace_stackmap_create(struct trace_array *tr)
+{
+	struct ftrace_stackmap *smap;
+	unsigned int bits;
+
+	smap = kzalloc(sizeof(*smap), GFP_KERNEL);
+	if (!smap)
+		return ERR_PTR(-ENOMEM);
+
+	/* Defensive clamp: reject bogus bits even if early_param is bypassed. */
+	bits = clamp_val(stackmap_map_bits,
+			 FTRACE_STACKMAP_BITS_MIN,
+			 FTRACE_STACKMAP_BITS_MAX);
+
+	smap->tr = tr;
+	smap->map_bits = bits;
+	smap->max_elts = 1U << bits;
+	smap->map_size = 1U << (bits + 1);	/* 2x over-provision */
+	BUG_ON(!is_power_of_2(smap->map_size));
+
+	smap->entries = vzalloc(sizeof(*smap->entries) * smap->map_size);
+	if (!smap->entries) {
+		kfree(smap);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	/*
+	 * Single large vmalloc of the element pool, indexed flat.
+	 * At bits=16 this is 64K * sizeof(struct stackmap_elt). The
+	 * struct is ~520 B (8 + 4 + 4 + 64*8), so total ~33 MB.
+	 */
+	smap->elts = vzalloc(sizeof(*smap->elts) * (size_t)smap->max_elts);
+	if (!smap->elts) {
+		vfree(smap->entries);
+		kfree(smap);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	smap->hash_seed = get_random_u32();
+	atomic_set(&smap->next_elt, 0);
+	atomic_set(&smap->resetting, 0);
+	atomic64_set(&smap->successes, 0);
+	atomic64_set(&smap->drops, 0);
+
+	return smap;
+}
+
+void ftrace_stackmap_destroy(struct ftrace_stackmap *smap)
+{
+	if (!smap || IS_ERR(smap))
+		return;
+	vfree(smap->elts);
+	vfree(smap->entries);
+	kfree(smap);
+}
+
+/**
+ * ftrace_stackmap_reset - clear all entries in the stackmap
+ * @smap: the stackmap to reset
+ *
+ * Returns 0 on success, -EBUSY if another reset is already in
+ * progress, or if tracing is currently active on the owning
+ * trace_array.
+ *
+ * Online reset (with tracing active) is not supported. Caller must
+ * stop tracing first (echo 0 > tracing_on).
+ *
+ * Caller is process context (typically sysfs write handler).
+ *
+ * Protocol:
+ *   1. Atomically claim reset rights via cmpxchg on @resetting.
+ *   2. Verify tracing is stopped on @smap->tr; if not, release the
+ *      claim and return -EBUSY. The resetting flag itself blocks
+ *      any subsequent get_id() callers.
+ *   3. synchronize_rcu() drains in-flight get_id() callers from the
+ *      ftrace callback path (which runs preempt-disabled).
+ *   4. memset entries, elts, and counters.
+ *   5. Release the resetting flag with release semantics so any new
+ *      get_id() observes a fully cleared map.
+ */
+int ftrace_stackmap_reset(struct ftrace_stackmap *smap)
+{
+	if (!smap)
+		return 0;
+
+	if (atomic_cmpxchg(&smap->resetting, 0, 1) != 0)
+		return -EBUSY;
+
+	if (smap->tr && tracer_tracing_is_on(smap->tr)) {
+		atomic_set(&smap->resetting, 0);
+		return -EBUSY;
+	}
+
+	/*
+	 * synchronize_rcu() itself is a full barrier; no extra smp_mb()
+	 * is needed before it. It drains in-flight ftrace callbacks that
+	 * may have already passed the resetting check with the old value.
+	 */
+	synchronize_rcu();
+
+	memset(smap->entries, 0, sizeof(*smap->entries) * smap->map_size);
+	memset(smap->elts, 0, sizeof(*smap->elts) * (size_t)smap->max_elts);
+
+	atomic_set(&smap->next_elt, 0);
+	atomic64_set(&smap->successes, 0);
+	atomic64_set(&smap->drops, 0);
+
+	/* Release resetting=0 so new get_id() observes a cleared map. */
+	atomic_set_release(&smap->resetting, 0);
+	return 0;
+}
+
+/* --- Core: get_id (lock-free, NMI-safe) --- */
+
+int ftrace_stackmap_get_id(struct ftrace_stackmap *smap,
+			   unsigned long *ips, unsigned int nr_entries)
+{
+	u32 key_hash, idx, test_key, trace_len;
+	struct stackmap_entry *entry;
+	struct stackmap_elt *val;
+	int probes = 0;
+
+	if (!smap || !nr_entries || atomic_read(&smap->resetting))
+		return -EINVAL;
+	if (nr_entries > FTRACE_STACKMAP_MAX_DEPTH)
+		nr_entries = FTRACE_STACKMAP_MAX_DEPTH;
+
+	trace_len = nr_entries * sizeof(unsigned long);
+	/*
+	 * jhash2() requires the length in u32 units and the data to be
+	 * u32-aligned. On 64-bit kernels sizeof(unsigned long)==8, so
+	 * trace_len is always a multiple of 8 (hence of 4). Use jhash2
+	 * directly; the cast to u32* is safe because ips[] is naturally
+	 * aligned to sizeof(unsigned long) >= 4.
+	 */
+	key_hash = jhash2((const u32 *)ips, trace_len / sizeof(u32),
+			  smap->hash_seed);
+	if (key_hash == 0)
+		key_hash = 1;	/* 0 means free slot */
+
+	idx = key_hash >> (32 - (smap->map_bits + 1));
+
+	while (probes < FTRACE_STACKMAP_MAX_PROBE) {
+		idx &= (smap->map_size - 1);
+		entry = &smap->entries[idx];
+		test_key = entry->key;
+
+		if (test_key == key_hash) {
+			/*
+			 * smp_load_acquire pairs with smp_store_release in
+			 * the publisher below; ensures we see fully-formed
+			 * elt fields (nr, ips, ref_count) before dereference.
+			 */
+			val = smp_load_acquire(&entry->val);
+			if (val && val->nr == nr_entries &&
+			    memcmp(val->ips, ips, trace_len) == 0) {
+				atomic_inc(&val->ref_count);
+				atomic64_inc(&smap->successes);
+				return (int)idx;
+			}
+			/*
+			 * val == NULL: another CPU is mid-insert, or this
+			 * slot is "claimed but empty" (pool exhausted).
+			 * val != NULL but mismatch: 32-bit hash collision
+			 * with a different stack. In both cases, advance.
+			 */
+		} else if (!test_key) {
+			/* Free slot: try to claim it */
+			if (cmpxchg(&entry->key, 0, key_hash) == 0) {
+				struct stackmap_elt *elt;
+
+				elt = stackmap_get_elt(smap);
+				if (!elt) {
+					/*
+					 * Pool exhausted. We claimed this
+					 * slot with cmpxchg but cannot fill
+					 * it. Leave key set so the slot
+					 * stays "claimed but empty" — future
+					 * lookups treat val==NULL as a miss
+					 * and probe past it. Cannot revert
+					 * key=0 without racing other CPUs.
+					 */
+					atomic64_inc(&smap->drops);
+					return -ENOSPC;
+				}
+
+				elt->nr = nr_entries;
+				atomic_set(&elt->ref_count, 1);
+				memcpy(elt->ips, ips, trace_len);
+
+				/*
+				 * Publish elt with release semantics so the
+				 * reader's smp_load_acquire can safely
+				 * dereference val->nr / val->ips.
+				 */
+				smp_store_release(&entry->val, elt);
+				atomic64_inc(&smap->successes);
+				return (int)idx;
+			}
+			/* cmpxchg failed; another CPU claimed this slot. */
+		}
+
+		idx++;
+		probes++;
+	}
+
+	atomic64_inc(&smap->drops);
+	return -ENOSPC;
+}
+
+/* --- Text export: /sys/kernel/debug/tracing/stack_map --- */
+
+struct stackmap_seq_private {
+	struct ftrace_stackmap	*smap;
+};
+
+static void *stackmap_seq_start(struct seq_file *m, loff_t *pos)
+{
+	struct stackmap_seq_private *priv = m->private;
+	struct ftrace_stackmap *smap = priv->smap;
+	u32 i;
+
+	if (!smap)
+		return NULL;
+	for (i = *pos; i < smap->map_size; i++) {
+		if (smap->entries[i].key && READ_ONCE(smap->entries[i].val)) {
+			*pos = i;
+			return &smap->entries[i];
+		}
+	}
+	return NULL;
+}
+
+static void *stackmap_seq_next(struct seq_file *m, void *v, loff_t *pos)
+{
+	struct stackmap_seq_private *priv = m->private;
+	struct ftrace_stackmap *smap = priv->smap;
+	u32 i;
+
+	if (!smap)
+		return NULL;
+	for (i = *pos + 1; i < smap->map_size; i++) {
+		if (smap->entries[i].key && READ_ONCE(smap->entries[i].val)) {
+			*pos = i;
+			return &smap->entries[i];
+		}
+	}
+	return NULL;
+}
+
+static void stackmap_seq_stop(struct seq_file *m, void *v) { }
+
+static int stackmap_seq_show(struct seq_file *m, void *v)
+{
+	struct stackmap_entry *entry = v;
+	struct stackmap_elt *elt = smp_load_acquire(&entry->val);
+	struct stackmap_seq_private *priv = m->private;
+	u32 idx = entry - priv->smap->entries;
+	u32 i, nr;
+
+	if (!elt)
+		return 0;
+
+	nr = READ_ONCE(elt->nr);
+	if (nr > FTRACE_STACKMAP_MAX_DEPTH)
+		nr = FTRACE_STACKMAP_MAX_DEPTH;
+
+	seq_printf(m, "stack_id %u [ref %u, depth %u]\n",
+		   idx, atomic_read(&elt->ref_count), nr);
+	for (i = 0; i < nr; i++)
+		seq_printf(m, "  [%u] %pS\n", i, (void *)elt->ips[i]);
+	seq_putc(m, '\n');
+	return 0;
+}
+
+static const struct seq_operations stackmap_seq_ops = {
+	.start	= stackmap_seq_start,
+	.next	= stackmap_seq_next,
+	.stop	= stackmap_seq_stop,
+	.show	= stackmap_seq_show,
+};
+
+static int stackmap_open(struct inode *inode, struct file *file)
+{
+	struct stackmap_seq_private *priv;
+	struct seq_file *m;
+	int ret;
+
+	ret = seq_open_private(file, &stackmap_seq_ops,
+			       sizeof(struct stackmap_seq_private));
+	if (ret)
+		return ret;
+	m = file->private_data;
+	priv = m->private;
+	priv->smap = inode->i_private;
+	return 0;
+}
+
+/*
+ * Accept exactly "0" or "reset" (optionally followed by a single newline).
+ */
+static bool stackmap_write_is_reset(const char *buf, size_t n)
+{
+	if (n > 0 && buf[n - 1] == '\n')
+		n--;
+	return (n == 1 && buf[0] == '0') ||
+	       (n == 5 && memcmp(buf, "reset", 5) == 0);
+}
+
+static ssize_t stackmap_write(struct file *file, const char __user *ubuf,
+			      size_t count, loff_t *ppos)
+{
+	struct seq_file *m = file->private_data;
+	struct stackmap_seq_private *priv = m->private;
+	char buf[8];
+	size_t n = min(count, sizeof(buf) - 1);
+	int ret;
+
+	if (n == 0)
+		return -EINVAL;
+	if (copy_from_user(buf, ubuf, n))
+		return -EFAULT;
+	buf[n] = '\0';
+
+	if (!stackmap_write_is_reset(buf, n))
+		return -EINVAL;
+
+	/*
+	 * ftrace_stackmap_reset() atomically claims reset rights via
+	 * cmpxchg and returns -EBUSY if another reset is in progress
+	 * or if tracing is active.
+	 */
+	ret = ftrace_stackmap_reset(priv->smap);
+	if (ret)
+		return ret;
+	return count;
+}
+
+const struct file_operations ftrace_stackmap_fops = {
+	.open		= stackmap_open,
+	.read		= seq_read,
+	.write		= stackmap_write,
+	.llseek		= seq_lseek,
+	.release	= seq_release_private,
+};
+
+/* --- Stats --- */
+
+static int stackmap_stat_show(struct seq_file *m, void *v)
+{
+	struct ftrace_stackmap *smap = m->private;
+	u32 entries;
+	u64 successes, drops;
+
+	if (!smap) {
+		seq_puts(m, "stackmap not initialized\n");
+		return 0;
+	}
+
+	entries = atomic_read(&smap->next_elt);
+	successes = atomic64_read(&smap->successes);
+	drops = atomic64_read(&smap->drops);
+
+	seq_printf(m, "entries:      %u / %u\n", entries, smap->max_elts);
+	seq_printf(m, "table_size:   %u\n", smap->map_size);
+	seq_printf(m, "successes:    %llu\n", successes);
+	seq_printf(m, "drops:        %llu\n", drops);
+	if (successes + drops > 0)
+		seq_printf(m, "success_rate: %llu%%\n",
+			   successes * 100 / (successes + drops));
+	return 0;
+}
+
+static int stackmap_stat_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, stackmap_stat_show, inode->i_private);
+}
+
+const struct file_operations ftrace_stackmap_stat_fops = {
+	.open		= stackmap_stat_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+/* --- Binary export --- */
+
+struct stackmap_bin_snapshot {
+	/*
+	 * Use u64 (not size_t) so data[] is 8-byte aligned on both
+	 * 32-bit and 64-bit architectures. The IP array within data[]
+	 * is accessed as u64*, which would alignment-fault on strict
+	 * architectures (e.g. older ARM, SPARC) if data[] started at
+	 * a 4-byte boundary.
+	 */
+	u64	size;
+	char	data[];
+};
+
+static int stackmap_bin_open(struct inode *inode, struct file *file)
+{
+	struct ftrace_stackmap *smap = inode->i_private;
+	struct stackmap_bin_snapshot *snap;
+	struct ftrace_stackmap_bin_header *hdr;
+	size_t alloc_size, off;
+	u32 nr_entries, i, nr_stacks;
+
+	if (!smap)
+		return -ENODEV;
+
+	/*
+	 * Worst-case allocation size: every populated entry uses a
+	 * full-depth stack. The (+1) gives one slack slot in case a
+	 * concurrent insert lands between this snapshot and iteration.
+	 * The loop below performs an explicit bounds check anyway.
+	 *
+	 * At bits=16 this caps at ~33 MB. The file is mode 0440
+	 * (TRACE_MODE_READ), so only privileged users can open it.
+	 */
+	nr_entries = atomic_read(&smap->next_elt);
+	alloc_size = sizeof(*hdr) + (nr_entries + 1) *
+		     (sizeof(struct ftrace_stackmap_bin_entry) +
+		      FTRACE_STACKMAP_MAX_DEPTH * sizeof(u64));
+
+	snap = vmalloc(sizeof(*snap) + alloc_size);
+	if (!snap)
+		return -ENOMEM;
+
+	hdr = (struct ftrace_stackmap_bin_header *)snap->data;
+	hdr->magic = FTRACE_STACKMAP_BIN_MAGIC;
+	hdr->version = FTRACE_STACKMAP_BIN_VERSION;
+	hdr->reserved = 0;
+	off = sizeof(*hdr);
+	nr_stacks = 0;
+
+	for (i = 0; i < smap->map_size; i++) {
+		struct stackmap_entry *entry = &smap->entries[i];
+		struct stackmap_elt *elt;
+		struct ftrace_stackmap_bin_entry *e;
+		u64 *ips_out;
+		u32 k, nr;
+
+		if (!entry->key)
+			continue;
+		elt = smp_load_acquire(&entry->val);
+		if (!elt)
+			continue;
+
+		nr = READ_ONCE(elt->nr);
+		if (nr > FTRACE_STACKMAP_MAX_DEPTH)
+			nr = FTRACE_STACKMAP_MAX_DEPTH;
+
+		/* Bounds check: stop if we would overflow the allocation. */
+		if (off + sizeof(*e) + nr * sizeof(u64) > alloc_size)
+			break;
+
+		e = (struct ftrace_stackmap_bin_entry *)(snap->data + off);
+		e->stack_id = i;
+		e->nr = nr;
+		e->ref_count = atomic_read(&elt->ref_count);
+		e->reserved = 0;
+		off += sizeof(*e);
+
+		ips_out = (u64 *)(snap->data + off);
+		for (k = 0; k < nr; k++)
+			ips_out[k] = (u64)elt->ips[k];
+		off += nr * sizeof(u64);
+		nr_stacks++;
+	}
+
+	hdr->nr_stacks = nr_stacks;
+	snap->size = off;
+	file->private_data = snap;
+	return 0;
+}
+
+static ssize_t stackmap_bin_read(struct file *file, char __user *ubuf,
+				 size_t count, loff_t *ppos)
+{
+	struct stackmap_bin_snapshot *snap = file->private_data;
+
+	if (!snap)
+		return -EINVAL;
+	return simple_read_from_buffer(ubuf, count, ppos, snap->data, snap->size);
+}
+
+static int stackmap_bin_release(struct inode *inode, struct file *file)
+{
+	vfree(file->private_data);
+	return 0;
+}
+
+const struct file_operations ftrace_stackmap_bin_fops = {
+	.open		= stackmap_bin_open,
+	.read		= stackmap_bin_read,
+	.llseek		= default_llseek,
+	.release	= stackmap_bin_release,
+};
diff --git a/kernel/trace/trace_stackmap.h b/kernel/trace/trace_stackmap.h
new file mode 100644
index 000000000000..da51ed919e2c
--- /dev/null
+++ b/kernel/trace/trace_stackmap.h
@@ -0,0 +1,56 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _TRACE_STACKMAP_H
+#define _TRACE_STACKMAP_H
+
+#include <linux/types.h>
+#include <linux/atomic.h>
+
+#define FTRACE_STACKMAP_MAX_DEPTH	64
+
+/* Binary export format */
+#define FTRACE_STACKMAP_BIN_MAGIC	0x464D5342	/* 'FSMB' */
+#define FTRACE_STACKMAP_BIN_VERSION	2
+
+struct ftrace_stackmap_bin_header {
+	u32 magic;
+	u32 version;
+	u32 nr_stacks;
+	u32 reserved;
+};
+
+struct ftrace_stackmap_bin_entry {
+	u32 stack_id;
+	u32 nr;
+	u32 ref_count;
+	u32 reserved;
+	/* followed by u64 ips[nr] */
+};
+
+struct trace_array;
+
+#ifdef CONFIG_FTRACE_STACKMAP
+
+struct ftrace_stackmap;
+
+struct ftrace_stackmap *ftrace_stackmap_create(struct trace_array *tr);
+void ftrace_stackmap_destroy(struct ftrace_stackmap *smap);
+int ftrace_stackmap_get_id(struct ftrace_stackmap *smap,
+			   unsigned long *ips, unsigned int nr_entries);
+int ftrace_stackmap_reset(struct ftrace_stackmap *smap);
+
+extern const struct file_operations ftrace_stackmap_fops;
+extern const struct file_operations ftrace_stackmap_stat_fops;
+extern const struct file_operations ftrace_stackmap_bin_fops;
+
+#else
+
+struct ftrace_stackmap;
+static inline struct ftrace_stackmap *ftrace_stackmap_create(struct trace_array *tr) { return NULL; }
+static inline void ftrace_stackmap_destroy(struct ftrace_stackmap *s) { }
+static inline int ftrace_stackmap_get_id(struct ftrace_stackmap *s,
+					 unsigned long *ips, unsigned int n)
+{ return -ENOSYS; }
+static inline int ftrace_stackmap_reset(struct ftrace_stackmap *s) { return 0; }
+
+#endif
+#endif /* _TRACE_STACKMAP_H */
-- 
2.34.1


^ permalink raw reply related

* [RFC PATCH v2 0/3] trace: stack trace deduplication for ftrace ring buffer
From: Li Pengfei @ 2026-05-22 10:40 UTC (permalink / raw)
  To: linux-trace-kernel
  Cc: rostedt, mhiramat, linux-kernel, cmllamas, zhangbo56, lipengfei28,
	lkp
In-Reply-To: <20260514034916.2162517-1-lipengfei28@xiaomi.com>

From: Pengfei Li <lipengfei28@xiaomi.com>

Hi Steven, all,

This is v2 of the ftrace stackmap series. It addresses the Sashiko
review at [1] and incorporates the kernel test robot's toctree fix.

The series adds stack trace deduplication to ftrace. When the
stacktrace option is enabled, the ring buffer stores a 4-byte
stack_id instead of a full kernel stack trace, while the full
stacks are exported via tracefs.

Problem
=======

With stacktrace enabled, each trace event stores a full kernel
stack (typically 10-20 frames x 8 bytes = 80-160 bytes). On
production devices with 4-8 MB trace buffers, this fills the
buffer in seconds, limiting the usefulness of boot-time tracing
and always-on performance monitoring.

Design
======

The implementation is a lock-free hash map modeled after
tracing_map.c, as suggested by Steven [2]:

- lock-free insert via cmpxchg, safe in NMI/IRQ/any context
- pre-allocated element pool, so there is no allocation on the hot path
- linear probing with a 2x over-provisioned table
- bounded probe length to keep worst-case lookup/insert cost bounded
- currently implemented for the global trace instance

The ring buffer stores only stack_id. Full stacks are exported via:

  /sys/kernel/debug/tracing/stack_map
  /sys/kernel/debug/tracing/stack_map_stat
  /sys/kernel/debug/tracing/stack_map_bin

Reset semantics
===============

Reset is treated as a control-path operation and is only supported
when tracing is stopped on the owning trace_array. Online reset is
intentionally not supported.

The reset path:
- atomically claims reset rights via cmpxchg
- rejects reset with -EBUSY if tracing is active
- blocks new get_id() callers via the resetting flag
- waits for in-flight ftrace callback paths with synchronize_rcu()
- clears the map and releases resetting with release semantics

Why not reuse tracing_map.c
===========================

This series follows the same overall lock-free approach, but uses a
purpose-built structure. tracing_map.c is designed for histogram-style
aggregation with fixed-size keys and value fields, while this use case
needs variable-length stack storage plus reference counting.

Why not reuse BPF stackmap
==========================

BPF_MAP_TYPE_STACK_TRACE addresses a similar problem, but requires a
BPF program and the BPF runtime. This series keeps the functionality
inside ftrace and available without CONFIG_BPF.

Unlike BPF stackmap, which may replace entries on collision, this
design keeps stack_id stable once assigned, which is important because
ring buffer events may reference that stack_id long after insertion.

Test results
============

Platform: ARM64 Qualcomm SM8850 (8 cores), kernel 6.12, bits=14,
tracing sched_switch + kmem_cache_alloc with stacktrace trigger,
5-second capture, default ring buffer.

Per-event payload (measured from tracing stats):

  Event                    Full stack    Stackmap    Reduction
  ---------------------    ----------    --------    ---------
  sched_switch             102 B/entry   48 B/entry    -53%
  kmem_cache_alloc         111 B/entry   44 B/entry    -60%

In the same 5-second capture window, the smaller per-event footprint
translated to many more retained events before wraparound. For
sched_switch:

  - without stackmap:      43,950 retained entries
  - with stackmap:      1,710,044 retained entries

During the same runs, the stackmap observed a few thousand unique
stacks and no drops.

Boot-time activation is also supported via:

  trace_options=stackmap,stacktrace

Events that occur before stackmap initialization fall back to full
stack traces; later events are deduplicated. This transition does
not itself drop events, but early boot stacks recorded before
initialization are not deduplicated.

QEMU validation
===============

The series also runs cleanly in QEMU on aarch64 (mainline,
qemu-system-aarch64, 2 vCPU, virt machine, busybox initrd).

A post-init smoke test verified:
- stack_map, stack_map_stat, stack_map_bin, and options/stackmap exist
- enabling stackmap + stacktrace produces stack_id events
- stack_map_stat shows non-zero successes and zero drops
- reset is rejected with -EBUSY while tracing is active
- reset clears the map when tracing is stopped
- stack_map_bin magic is correct

Changes since RFC v1
====================

- tightened reset semantics: reset now requires tracing to be stopped
  and returns -EBUSY if tracing is active or another reset is in progress
- fixed publication/consumption ordering with smp_store_release() /
  smp_load_acquire()
- bounded probe length and added pool-exhaustion fast-path handling
- moved hash_seed into struct ftrace_stackmap
- switched the element pool to a single flat vmalloc allocation
- bounded bits range to [10, 18] to limit worst-case memory usage
- fixed TRACE_ITER(STACKMAP) handling
- tightened stack_map reset input parsing
- renamed stat counters to "successes" / "success_rate" so the meaning
  is unambiguous (counts events served, including first-time inserts)
- added documentation, selftest coverage, and userspace dump tooling

Known limitations
=================

- Per-instance stackmap support is not included in this series.
- The stackmap currently covers kernel stacks only.
- stack_map_bin is a best-effort snapshot, not a fully atomic export.
- trace-cmd / libtraceevent integration is left for follow-up once the
  binary format settles.

Usage
=====

  echo 1 > /sys/kernel/debug/tracing/options/stackmap
  echo 1 > /sys/kernel/debug/tracing/options/stacktrace

[1] https://sashiko.dev/?list=org.kernel.vger.linux-trace-kernel#/patchset/20260514034916.2162517-1-lipengfei28%40xiaomi.com
[2] https://lore.kernel.org/all/20260513085145.30dd23e0@fedora/

Pengfei Li (3):
  trace: add lock-free stackmap for stack trace deduplication
  trace: integrate stackmap into ftrace stack recording path
  trace: add documentation, selftest and tooling for stackmap

 Documentation/trace/ftrace-stackmap.rst       | 145 ++++
 Documentation/trace/index.rst                 |   1 +
 kernel/trace/Kconfig                          |  21 +
 kernel/trace/Makefile                         |   1 +
 kernel/trace/trace.c                          |  66 ++
 kernel/trace/trace.h                          |  16 +
 kernel/trace/trace_entries.h                  |  15 +
 kernel/trace/trace_output.c                   |  23 +
 kernel/trace/trace_stackmap.c                 | 643 ++++++++++++++++++
 kernel/trace/trace_stackmap.h                 |  56 ++
 .../ftrace/test.d/ftrace/stackmap-basic.tc    | 100 +++
 tools/tracing/stackmap_dump.py                | 150 ++++
 12 files changed, 1237 insertions(+)
 create mode 100644 Documentation/trace/ftrace-stackmap.rst
 create mode 100644 kernel/trace/trace_stackmap.c
 create mode 100644 kernel/trace/trace_stackmap.h
 create mode 100755 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
 create mode 100755 tools/tracing/stackmap_dump.py

-- 
2.34.1

^ permalink raw reply

* Re: [PATCH v3 0/2] Add trace events for Qualcomm GENI SPI drivers
From: Mark Brown @ 2026-05-19 13:13 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Praveen Talari
  Cc: linux-kernel, linux-trace-kernel, linux-arm-msm, linux-spi,
	mukesh.savaliya, aniket.randive, chandana.chiluveru,
	jyothi.seerapu, Konrad Dybcio
In-Reply-To: <20260518-add-tracepoints-for-qcom-geni-spi-v3-0-7928f6810a79@oss.qualcomm.com>

On Mon, 18 May 2026 22:30:50 +0530, Praveen Talari wrote:
> Add trace events for Qualcomm GENI SPI drivers
> 
> Add tracepoints to the Qualcomm GENI (Generic Interface) SPI driver.
> These trace events enable runtime debugging and performance analysis
> of SPI operations.
> 
> The trace events capture SPI clock configuration, setup parameters,
> transfer details, interrupt status.
> 
> [...]

Applied to

   https://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi.git for-7.2

Thanks!

[1/2] spi: qcom-geni: trace: Add trace events for Qualcomm GENI SPI
      https://git.kernel.org/broonie/spi/c/bf62d130b1f2
[2/2] spi: qcom-geni: Add trace events for Qualcomm GENI SPI driver
      https://git.kernel.org/broonie/spi/c/b5687af4af89

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark

^ permalink raw reply

* Re: [PATCH v3 2/2] serial: qcom-geni: Add tracepoints for Qualcomm GENI serial driver
From: Greg Kroah-Hartman @ 2026-05-22  9:47 UTC (permalink / raw)
  To: Praveen Talari
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Jiri Slaby,
	Konrad Dybcio, linux-kernel, linux-trace-kernel, linux-arm-msm,
	linux-serial, Mukesh Kumar Savaliya, Aniket Randive,
	chandana.chiluveru, jyothi.seerapu
In-Reply-To: <20260518-add-tracepoints-for-qcom-geni-serial-v3-2-b4addb151376@oss.qualcomm.com>

On Mon, May 18, 2026 at 11:26:56PM +0530, Praveen Talari wrote:
> Add tracing to the Qualcomm GENI serial driver to improve runtime
> observability.
> 
> Trace hooks are added at key points including termios and clock
> configuration, manual control get/set, interrupt handling, and data
> TX/RX paths.
> 
> Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
> Signed-off-by: Praveen Talari <praveen.talari@oss.qualcomm.com>
> ---
> v2->v3:
> - Updated commit text(removed example as it was available on cover
>   letter).
> ---
>  drivers/tty/serial/qcom_geni_serial.c | 27 +++++++++++++++++++++++----
>  1 file changed, 23 insertions(+), 4 deletions(-)

This patch did not apply to my tree :(

^ permalink raw reply

* Re: [PATCH] unwind: Add sframe_(un)register() system calls
From: Jens Remus @ 2026-05-22  9:43 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Linux Trace Kernel, bpf, Masami Hiramatsu,
	Mathieu Desnoyers, Josh Poimboeuf, Peter Zijlstra, Ingo Molnar,
	Jiri Olsa, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Linus Torvalds, Andrew Morton, Florian Weimer,
	Kees Cook, Carlos O'Donell, Sam James, Dylan Hatch,
	Borislav Petkov, Dave Hansen, David Hildenbrand, H. Peter Anvin,
	Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
	Suren Baghdasaryan, Vlastimil Babka, Heiko Carstens,
	Vasily Gorbik
In-Reply-To: <20260521183532.7a145c8a@gandalf.local.home>

On 5/22/2026 12:35 AM, Steven Rostedt wrote:
> From: Steven Rostedt <rostedt@goodmis.org>
> 
> Add system calls to register and unregister sframes that can be used by
> dynamic linkers to tell the kernel where the sframe section is in memory
> for libraries it loads.

Why two separate system calls?  Can't that be one single stacktracectl?
Could they at least be non-sframe specific, e.g. stracktrace_register
and stracktrace_unregister, so that if one would implement e.g. unwind
user dwarf/eh_frame in the future one could pass ehframe_start and
ehframe_end in addition to sframe_start and sframe_end?

> 
> Both system calls take a pointer to a new structure:
> 
>   struct sframe_setup {
> 	unsigned long		sframe_start;
> 	unsigned long		sframe_size;
> 	unsigned long		text_start;
> 	unsigned long		text_size;
>   };
> 
> and a size of the passed in structure. If the system call needs to be
> extended, then the structure could be changed and the size of that
> structure will tell the kernel that it is the new version. If the kernel
> does not recognize the structure size, it will return -EINVAL.
> 
>   sframe_start - The virtual address of the sframe section
>   sframe_size  - The length of the sframe section
>   text_start   - the text section the sframe represents
>   test_size    - the length of the section
> 
> If other stack tracing functionality is added, it will require a new
> system call.
> 
> The unregister only needs the sframe_start and requires all the rest of
> the fields to be 0. In the future, if more can be done, then user space
> can update the other values and check the return code to see if the kernel
> supports it.
> 
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> ---
> 
> Based on top of Jens patches here:
> 
>   https://lore.kernel.org/linux-trace-kernel/20260520154004.3845823-1-jremus@linux.ibm.com/
> 
> [ Note, I tested this with the same program from the RFC patch ]
> 
> Changes from RFC: https://patch.msgid.link/20260429114355.6c712e6a@gandalf.local.home
> 
> - Remove the ioctl() like system call for a unique system call for each
>   functionality. Right now there's two functionalities:
>    1. register sframe section
>    2. unregister sframe sections
> 
> - Added taking a lock around the mtree logic in __sframe_remove_section()
>   as Sashiko mentioned that there could be races from user space
>   registering and unregistering sframe sections at the same time.

Doesn't sframe_add_section() then also need likewise?

> 
> - Removed [RFC] from subject as I believe this is more likely the way
>   this system call will be done.

> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h

> @@ -999,6 +999,8 @@ asmlinkage long sys_lsm_get_self_attr(unsigned int attr, struct lsm_ctx __user *
>  asmlinkage long sys_lsm_set_self_attr(unsigned int attr, struct lsm_ctx __user *ctx,
>  				      u32 size, u32 flags);
>  asmlinkage long sys_lsm_list_modules(u64 __user *ids, u32 __user *size, u32 flags);
> +asmlinkage long sys_sframe_register(void *data, unsigned int size);
> +asmlinkage long sys_sframe_unregister(void *data, unsigned int size);
>  
>  /*
>   * Architecture-specific system calls


> diff --git a/include/uapi/linux/sframe.h b/include/uapi/linux/sframe.h

> @@ -0,0 +1,12 @@
> +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
> +#ifndef _UAPI_LINUX_SFRAME_H
> +#define _UAPI_LINUX_SFRAME_H
> +
> +struct sframe_setup {
> +	unsigned long		sframe_start;
> +	unsigned long		sframe_size;
> +	unsigned long		text_start;
> +	unsigned long		text_size;
> +};
> +
> +#endif /* _UAPI_LINUX_SFRAME_H */

> diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c

> @@ -842,9 +844,11 @@ static void sframe_free_srcu(struct rcu_head *rcu)
>  static int __sframe_remove_section(struct mm_struct *mm,
>  				   struct sframe_section *sec)
>  {
> -	if (!mtree_erase(&mm->sframe_mt, sec->text_start)) {
> -		dbg_sec("mtree_erase failed: text=%lx\n", sec->text_start);
> -		return -EINVAL;
> +	scoped_guard(mmap_read_lock, mm) {

Why is a read lock sufficient? Doesn't that allow multiple readers?
How does that prevent a concurrent modification of the mm->sframe_mt?

> +		if (!mtree_erase(&mm->sframe_mt, sec->text_start)) {
> +			dbg_sec("mtree_erase failed: text=%lx\n", sec->text_start);
> +			return -EINVAL;
> +		}

Is (or why not) likewise required in sframe_add_section() for the
mtree_insert_range()?

Wasn't the reported issue that while mt_for_each() in
sframe_remove_section() there could be concurrent mtree_erase() in
__sframe_remove_section() followed by mtree_insert_range() in
sframe_add_section(), so that the mt_for_each() could get confused?

>  	}
>  
>  	call_srcu(&sframe_srcu, &sec->rcu, sframe_free_srcu);
> @@ -936,3 +940,56 @@ void sframe_free_mm(struct mm_struct *mm)
>  
>  	mtree_destroy(&mm->sframe_mt);
>  }
> +
> +/**
> + * sys_sframe_register - register an address for user space stacktrace walking.
> + * @data: Structure of sframe data used to register the sframe section
> + * @size: The size of the given structure.
> + *
> + * This system call is used by dynamic library utilities to inform the kernel
> + * of meta data that it loaded that can be used by the kernel to know how
> + * to stack walk the given text locations.
> + *
> + * Return: 0 if successful, otherwise a negative error.
> + */
> +SYSCALL_DEFINE2(sframe_register, __user struct sframe_setup *, data, unsigned int, size)
> +{
> +	struct sframe_setup sframe;
> +
> +	if (sizeof(sframe) != size)
> +		return -EINVAL;
> +
> +	if (copy_from_user(&sframe, data, size))
> +		return -EFAULT;
> +
> +	return sframe_add_section(sframe.sframe_start,
> +				  sframe.sframe_start + sframe.sframe_size,
> +				  sframe.text_start,
> +				  sframe.text_start + sframe.text_size);
> +}
> +
> +/**
> + * sys_sframe_unregister - unregister an sframe address
> + * @data: Structure of sframe data used to register the sframe section
> + * @size: The size of the given structure.
> + *
> + * The data->sframe_start is the only value that is used. The rest must
> + * be zero.
> + *
> + * Return: 0 if successful, otherwise a negative error.
> + */
> +SYSCALL_DEFINE2(sframe_unregister, __user struct sframe_setup *, data, unsigned int, size)
> +{
> +	struct sframe_setup sframe;
> +
> +	if (sizeof(sframe) != size)
> +		return -EINVAL;
> +
> +	if (copy_from_user(&sframe, data, size))
> +		return -EFAULT;
> +
> +	if (sframe.sframe_size || sframe.text_start || sframe.text_size)
> +		return -EINVAL;
> +
> +	return sframe_remove_section(sframe.sframe_start);
> +}
Thanks and regards,
Jens
-- 
Jens Remus
Linux on Z Development (D3303)
jremus@de.ibm.com / jremus@linux.ibm.com

IBM Deutschland Research & Development GmbH; Vorsitzender des Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der Gesellschaft: Ehningen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/


^ permalink raw reply

* Re: [PATCH] ftrace: Use flexible array for hash buckets
From: kernel test robot @ 2026-05-22  8:45 UTC (permalink / raw)
  To: Rosen Penev
  Cc: oe-lkp, lkp, linux-kernel, linux-trace-kernel, Steven Rostedt,
	Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, oliver.sang
In-Reply-To: <20260520220030.16887-1-rosenp@gmail.com>



Hello,

kernel test robot noticed "BUG:KASAN:global-out-of-bounds_in_ftrace_find_rec_direct" on:

commit: 42ed22b1b4ae179c78e088477950eb1dd1dc9e90 ("[PATCH] ftrace: Use flexible array for hash buckets")
url: https://github.com/intel-lab-lkp/linux/commits/Rosen-Penev/ftrace-Use-flexible-array-for-hash-buckets/20260521-060312
base: https://git.kernel.org/cgit/linux/kernel/git/trace/linux-trace for-next
patch link: https://lore.kernel.org/all/20260520220030.16887-1-rosenp@gmail.com/
patch subject: [PATCH] ftrace: Use flexible array for hash buckets

in testcase: boot

config: x86_64-rhel-9.4-kunit
compiler: gcc-14
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 32G

(please refer to attached dmesg/kmsg for entire log/backtrace)



If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202605221023.2bcc5f10-lkp@intel.com



[    9.849798][    T1] BUG: KASAN: global-out-of-bounds in ftrace_find_rec_direct (trace/ftrace.c:1172 (discriminator 2) trace/ftrace.c:2611 (discriminator 2))
[    9.849798][    T1] Read of size 8 at addr ffffffff9675bd48 by task swapper/0/1
[    9.849798][    T1]
[    9.849798][    T1] CPU: 1 UID: 0 PID: 1 Comm: swapper/0 Not tainted 7.1.0-rc3+ #1 PREEMPT(lazy)
[    9.849798][    T1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[    9.849798][    T1] Call Trace:
[    9.849798][    T1]  <TASK>
[    9.849798][    T1]  dump_stack_lvl (dump_stack.c:94 dump_stack.c:120)
[    9.849798][    T1]  print_address_description+0x70/0x300
[    9.849798][    T1]  print_report (kasan/report.c:482)
[    9.849798][    T1]  ? __virt_addr_valid (linux/mmzone.h:2198 (discriminator 1) linux/mmzone.h:2280 (discriminator 1) x86/mm/physaddr.c:54 (discriminator 1))
[    9.849798][    T1]  ? ftrace_find_rec_direct (trace/ftrace.c:1172 (discriminator 2) trace/ftrace.c:2611 (discriminator 2))
[    9.849798][    T1]  kasan_report (kasan/report.c:595)
[    9.849798][    T1]  ? ftrace_find_rec_direct (trace/ftrace.c:1172 (discriminator 2) trace/ftrace.c:2611 (discriminator 2))
[    9.849798][    T1]  ? __pfx_trace_selftest_dynamic_test_func (??:?)
[    9.849798][    T1]  ftrace_find_rec_direct (trace/ftrace.c:1172 (discriminator 2) trace/ftrace.c:2611 (discriminator 2))
[    9.849798][    T1]  register_ftrace_direct (trace/ftrace.c:6057)
[    9.849798][    T1]  ? __pfx_ftrace_stub_direct_tramp (x86/kernel/ftrace_64.S:318)
[    9.849798][    T1]  trace_selftest_startup_function_graph (trace/trace_selftest.c:1139)
[    9.849798][    T1]  ? __pfx_trace_selftest_startup_function_graph (trace/trace_selftest.c:754)
[    9.849798][    T1]  run_tracer_selftest (trace/trace.c:1335)
[    9.849798][    T1]  register_tracer (trace/trace.c:1375 trace/trace.c:1492)
[    9.849798][    T1]  ? __pfx_init_graph_trace (trace/trace_functions_graph.c:1811)
[    9.849798][    T1]  do_one_initcall (main.c:1347)
[    9.849798][    T1]  ? __pfx_do_one_initcall (trace/events/initcall.h:10)
[    9.849798][    T1]  ? asm_sysvec_apic_timer_interrupt (x86/include/asm/idtentry.h:569)
[    9.849798][    T1]  ? ret_from_fork_asm (x86/entry/entry_64.S:245)
[    9.849798][    T1]  do_initcalls (main.c:1409 (discriminator 1) main.c:1425 (discriminator 1))
[    9.849798][    T1]  kernel_init_freeable (main.c:1445 main.c:1658)
[    9.849798][    T1]  ? __pfx_kernel_init_freeable (main.c:1626)
[    9.849798][    T1]  ? __pfx_schedule_timeout (??:?)
[    9.849798][    T1]  ? __pfx__raw_spin_lock_irq (locking/spinlock.c:183)
[    9.849798][    T1]  ? __pfx_kernel_init (main.c:717)
[    9.849798][    T1]  kernel_init (main.c:1548)
[    9.849798][    T1]  ? __pfx_kernel_init (main.c:717)
[    9.849798][    T1]  ret_from_fork (x86/kernel/process.c:158)
[    9.849798][    T1]  ? __pfx_ret_from_fork (x86/include/asm/entry-common.h:54)
[    9.849798][    T1]  ? switch_fpu (linux/instrumented.h:82 asm-generic/bitops/instrumented-non-atomic.h:141 linux/thread_info.h:133 linux/sched.h:2066 x86/include/asm/fpu/sched.h:34)
[    9.849798][    T1]  ? __switch_to (x86/kernel/process_64.c:619)
[    9.849798][    T1]  ? __switch_to_asm (x86/entry/entry_64.S:206)
[    9.849798][    T1]  ? __pfx_kernel_init (main.c:717)
[    9.849798][    T1]  ret_from_fork_asm (x86/entry/entry_64.S:245)
[    9.849798][    T1]  </TASK>
[    9.849798][    T1]
[    9.849798][    T1] The buggy address belongs to the variable:
[    9.849798][    T1]  empty_hash+0x28/0xa0
[    9.849798][    T1]
[    9.849798][    T1] The buggy address belongs to the physical page:
[    9.849798][    T1] page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x77d95b
[    9.849798][    T1] flags: 0x17ffffc0002000(reserved|node=0|zone=2|lastcpupid=0x1fffff)
[    9.849798][    T1] raw: 0017ffffc0002000 ffffea001df656c8 ffffea001df656c8 0000000000000000
[    9.849798][    T1] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
[    9.849798][    T1] page dumped because: kasan: bad access detected
[    9.849798][    T1]
[    9.849798][    T1] Memory state around the buggy address:
[    9.849798][    T1]  ffffffff9675bc00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[    9.849798][    T1]  ffffffff9675bc80: 00 00 00 00 00 00 f9 f9 f9 f9 f9 f9 00 00 00 00
[    9.849798][    T1] >ffffffff9675bd00: f9 f9 f9 f9 00 00 00 00 00 f9 f9 f9 f9 f9 f9 f9
[    9.849798][    T1]                                               ^
[    9.849798][    T1]  ffffffff9675bd80: 00 02 f9 f9 f9 f9 f9 f9 00 00 00 00 00 00 00 00
[    9.849798][    T1]  ffffffff9675be00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[    9.849798][    T1] ==================================================================
[    9.904835][    T1] Disabling lock debugging due to kernel taint


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20260522/202605221023.2bcc5f10-lkp@intel.com



-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply

* Re: [PATCH] Re: Re: [RFC PATCH v2 04/10] rv/da: add pre-allocated storage pool for per-object monitors
From: Gabriele Monaco @ 2026-05-22  6:45 UTC (permalink / raw)
  To: Wen Yang; +Cc: linux-kernel, linux-trace-kernel, rostedt
In-Reply-To: <7917085b-fef1-4017-b289-eefe84332ff5@linux.dev>

Hi Wen,

On Fri, 2026-05-22 at 01:40 +0800, Wen Yang wrote:
> Hi Gabriele, 
> 
> No specific reason for REL_SOFT — not intentional, reverting to 
> REL_HARD.
>  
> Reproduced the stall on the same config (PREEMPT_RT + 
> PROVE_LOCKING/PROVE_RCU). 
> 
> Root cause is a cleanup ordering bug in uprobe_detail_waiting.tc, 
> unrelated to REL_SOFT/REL_HARD:
>  
> 
>    # original cleanup — wrong order 
> 
>    echo "-${UPROBE_TARGET}:${start_offset}" > "$TLOB_MONITOR"  # (A) 
> 
>    kill "$hog_pid"                                              # (B) 
> 
>  
>                                                               (A) 
> triggers synchronize_srcu() in the kernel. But tlob_target is stuck 
> mid-uprobe_notify_resume holding an SRCU read lock, preempted by the 
> FIFO-99 hog -> so the reader never finishes and (B) is never reached. 
> rcuc/0 (a kthread on PREEMPT_RT) is also starved by the hog -> RCU stall.

great you found the issue and solution. Wonder why lockdep wasn't more
informative, but probably the issue was so frequent to hog that too.


>    Fix: kill the hog first: 
>  
> 
>    kill "$hog_pid"; wait "$hog_pid" 
> 
>    echo "-${UPROBE_TARGET}:${start_offset}" > "$TLOB_MONITOR" 
> 
> 
> On the PREEMPT_RT it is more reliably triggered there because rcuc/0
> runs as a preemptible kthread rather than in softirq, making it easier
> for the hog to monopolise the CPU long enough to hit the stall.
> 
> Thank you for the thorough review and valuable suggestions. We are 
> working through all of them and running the full test suite.
> We expect to post v3 within the next two days.

Alright, sounds good.

Thanks,
Gabriele


^ permalink raw reply

* [PATCH v6] tracing/eprobes: Allow use of BTF names to dereference pointers
From: Steven Rostedt @ 2026-05-22  2:50 UTC (permalink / raw)
  To: LKML, Linux trace kernel
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Mark Rutland, Peter Zijlstra,
	Namhyung Kim, Takaya Saeki, Douglas Raillard, Tom Zanussi,
	Andrew Morton, Thomas Gleixner, Ian Rogers, Jiri Olsa

From: Steven Rostedt <rostedt@goodmis.org>

Add syntax to the parsing of eprobes to be able to typecast a trace event
field that is a pointer to a structure.

Currently, a dereference must be a number, where the user has to figure
out manually the offset of a member of a structure that they want to
dereference.

But for event probes that records a field that happens to be a pointer to
a structure, it cannot dereference these values with BTF naming, but
must use numerical offsets.

For example, to find out what device a sk_buff is pointing to in the
net_dev_xmit trace event, one must first use gdb to find the offsets of the
members of the structures:

 (gdb) p &((struct sk_buff *)0)->dev
 $1 = (struct net_device **) 0x10
 (gdb) p &((struct net_device *)0)->name
 $2 = (char (*)[16]) 0x118

And then use the raw numbers to dereference:

  # echo 'e:xmit net.net_dev_xmit +0x118(+0x10($skbaddr)):string' >> dynamic_events

If BTF is in the kernel, then instead, the skbaddr can be typecast to
sk_buff and use the normal dereference logic.

  # echo 'e:xmit net.net_dev_xmit (sk_buff)skbaddr->dev->name:string' >> dynamic_events
  # echo 1 > events/eprobes/xmit/enable
  # cat trace
[..]
    sshd-session-1022    [000] b..2.   860.249343: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.250061: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.250142: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.263553: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.283820: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.302716: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.322905: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.342828: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.362268: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.382335: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.400856: xmit: (net.net_dev_xmit) arg1="enp7s0"
    sshd-session-1022    [000] b..2.   860.419893: xmit: (net.net_dev_xmit) arg1="enp7s0"

The syntax is simply: (STRUCT)(FIELD)->MEMBER[->MEMBER..]

Also add comments around the #else and #endif of #ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
to know what they are for.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---

Changes since v5: https://patch.msgid.link/20260519130144.40e71a00@fedora

- Restructure parse_btf_arg() to only use typecast when TEVENT flag is set
  (eprobes).

- Move found_type label to still test for (!type) in case ctx->last_struct
  is NULL for some reason.

- Restructure the code to be more specific to when to use ctx->struct_btf
  and use ctx->btf for places that the typecast will not be used.

- Move parse_trace_event() function outside of BTF #ifdef as it is used
  outside of it as well.

- Move the code in the '(' switch statement into a helper function called
  handle_typecast(). This will be a stub when BTF is not configured in.

- Have the ctx->last_struct and ctx->struct_btf only set within the
  handle_typecast() function, and have that function reset it too.

- Add comments to the #else and #endif of the #ifdef
  CONFIG_PROBE_EVENTS_BTF_ARGS to know what they are for.

 Documentation/trace/eprobetrace.rst |   4 +
 kernel/trace/trace_probe.c          | 171 +++++++++++++++++++++++-----
 kernel/trace/trace_probe.h          |   7 +-
 3 files changed, 152 insertions(+), 30 deletions(-)

diff --git a/Documentation/trace/eprobetrace.rst b/Documentation/trace/eprobetrace.rst
index 89b5157cfab8..fe3602540569 100644
--- a/Documentation/trace/eprobetrace.rst
+++ b/Documentation/trace/eprobetrace.rst
@@ -46,6 +46,10 @@ Synopsis of eprobe_events
 		  (x8/x16/x32/x64), VFS layer common type(%pd/%pD), "char",
                   "string", "ustring", "symbol", "symstr" and "bitfield" are
                   supported.
+  (STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
+                  a pointer to STRUCT and then derference the pointer defined by
+                  ->MEMBER. Note that when this is used, the FIELD name does not
+                  need to be prefixed with a '$'.
 
 Types
 -----
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index e0d3a0da26af..4a205cebda60 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -332,6 +332,25 @@ static int parse_trace_event_arg(char *arg, struct fetch_insn *code,
 	return -ENOENT;
 }
 
+static int parse_trace_event(char *arg, struct fetch_insn *code,
+			     struct traceprobe_parse_context *ctx)
+{
+	int ret;
+
+	if (code->data)
+		return -EFAULT;
+	ret = parse_trace_event_arg(arg, code, ctx);
+	if (!ret)
+		return 0;
+	if (strcmp(arg, "comm") == 0 || strcmp(arg, "COMM") == 0) {
+		code->op = FETCH_OP_COMM;
+		return 0;
+	}
+	/* backward compatibility */
+	ctx->offset = 0;
+	return -EINVAL;
+}
+
 #ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
 
 static u32 btf_type_int(const struct btf_type *t)
@@ -376,11 +395,17 @@ static bool btf_type_is_char_array(struct btf *btf, const struct btf_type *type)
 		&& BTF_INT_BITS(intdata) == 8;
 }
 
+static struct btf *ctx_btf(struct traceprobe_parse_context *ctx)
+{
+	return ctx->flags & TPARG_FL_TYPECAST ?
+		ctx->struct_btf : ctx->btf;
+}
+
 static int check_prepare_btf_string_fetch(char *typename,
 				struct fetch_insn **pcode,
 				struct traceprobe_parse_context *ctx)
 {
-	struct btf *btf = ctx->btf;
+	struct btf *btf = ctx_btf(ctx);
 
 	if (!btf || !ctx->last_type)
 		return 0;
@@ -554,22 +579,29 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
 	struct fetch_insn *code = *pcode;
 	const struct btf_member *field;
 	u32 bitoffs, anon_offs;
+	bool is_struct = ctx->flags & TPARG_FL_TYPECAST;
+	struct btf *btf = ctx_btf(ctx);
 	char *next;
 	int is_ptr;
 	s32 tid;
 
 	do {
-		/* Outer loop for solving arrow operator ('->') */
-		if (BTF_INFO_KIND(type->info) != BTF_KIND_PTR) {
-			trace_probe_log_err(ctx->offset, NO_PTR_STRCT);
-			return -EINVAL;
-		}
-		/* Convert a struct pointer type to a struct type */
-		type = btf_type_skip_modifiers(ctx->btf, type->type, &tid);
-		if (!type) {
-			trace_probe_log_err(ctx->offset, BAD_BTF_TID);
-			return -EINVAL;
+		if (!is_struct) {
+			/* Outer loop for solving arrow operator ('->') */
+			if (BTF_INFO_KIND(type->info) != BTF_KIND_PTR) {
+				trace_probe_log_err(ctx->offset, NO_PTR_STRCT);
+				return -EINVAL;
+			}
+
+			/* Convert a struct pointer type to a struct type */
+			type = btf_type_skip_modifiers(btf, type->type, &tid);
+			if (!type) {
+				trace_probe_log_err(ctx->offset, BAD_BTF_TID);
+				return -EINVAL;
+			}
 		}
+		/* Only the first type can skip being a pointer */
+		is_struct = false;
 
 		bitoffs = 0;
 		do {
@@ -580,7 +612,7 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
 				return is_ptr;
 
 			anon_offs = 0;
-			field = btf_find_struct_member(ctx->btf, type, fieldname,
+			field = btf_find_struct_member(btf, type, fieldname,
 						       &anon_offs);
 			if (IS_ERR(field)) {
 				trace_probe_log_err(ctx->offset, BAD_BTF_TID);
@@ -602,7 +634,7 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
 				ctx->last_bitsize = 0;
 			}
 
-			type = btf_type_skip_modifiers(ctx->btf, field->type, &tid);
+			type = btf_type_skip_modifiers(btf, field->type, &tid);
 			if (!type) {
 				trace_probe_log_err(ctx->offset, BAD_BTF_TID);
 				return -EINVAL;
@@ -627,6 +659,7 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
 	return 0;
 }
 
+
 static int __store_entry_arg(struct trace_probe *tp, int argnum);
 
 static int parse_btf_arg(char *varname,
@@ -640,7 +673,7 @@ static int parse_btf_arg(char *varname,
 	int i, is_ptr, ret;
 	u32 tid;
 
-	if (WARN_ON_ONCE(!ctx->funcname))
+	if (WARN_ON_ONCE(!ctx->funcname && !(ctx->flags & TPARG_FL_TEVENT)))
 		return -EINVAL;
 
 	is_ptr = split_next_field(varname, &field, ctx);
@@ -653,6 +686,20 @@ static int parse_btf_arg(char *varname,
 		return -EOPNOTSUPP;
 	}
 
+	if (ctx->flags & TPARG_FL_TEVENT) {
+		int ret;
+
+		ret = parse_trace_event(varname, code, ctx);
+		if (ret < 0)
+			return ret;
+
+		if (ctx->flags & TPARG_FL_TYPECAST) {
+			type = ctx->last_struct;
+			goto found_type;
+		}
+		return 0;
+	}
+
 	if (ctx->flags & TPARG_FL_RETURN && !strcmp(varname, "$retval")) {
 		code->op = FETCH_OP_RETVAL;
 		/* Check whether the function return type is not void */
@@ -709,6 +756,7 @@ static int parse_btf_arg(char *varname,
 
 found:
 	type = btf_type_skip_modifiers(ctx->btf, tid, &tid);
+found_type:
 	if (!type) {
 		trace_probe_log_err(ctx->offset, BAD_BTF_TID);
 		return -EINVAL;
@@ -727,7 +775,7 @@ static int parse_btf_arg(char *varname,
 static const struct fetch_type *find_fetch_type_from_btf_type(
 					struct traceprobe_parse_context *ctx)
 {
-	struct btf *btf = ctx->btf;
+	struct btf *btf = ctx_btf(ctx);
 	const char *typestr = NULL;
 
 	if (btf && ctx->last_type)
@@ -758,7 +806,70 @@ static int parse_btf_bitfield(struct fetch_insn **pcode,
 	return 0;
 }
 
-#else
+static int query_btf_struct(const char *sname, struct traceprobe_parse_context *ctx)
+{
+	int id;
+
+	if (!ctx->struct_btf) {
+		struct btf *btf;
+
+		id = bpf_find_btf_id(sname, BTF_KIND_STRUCT, &btf);
+		if (id < 0)
+			return id;
+		ctx->struct_btf = btf;
+	} else {
+		id = btf_find_by_name_kind(ctx->struct_btf, sname, BTF_KIND_STRUCT);
+		if (id < 0)
+			return id;
+	}
+
+	ctx->last_struct = btf_type_by_id(ctx->struct_btf, id);
+	return 0;
+}
+
+static int handle_typecast(char *arg, struct fetch_insn **pcode,
+			   struct fetch_insn *end,
+			   struct traceprobe_parse_context *ctx)
+{
+	char *tmp;
+	int ret;
+
+	/* Currently this only works for eprobes */
+	if (!(ctx->flags & TPARG_FL_TEVENT)) {
+		trace_probe_log_err(ctx->offset, TYPECAST_NOT_EVENT);
+		return -EINVAL;
+	}
+
+	tmp = strchr(arg, ')');
+	if (!tmp) {
+		trace_probe_log_err(ctx->offset + strlen(arg),
+				    DEREF_OPEN_BRACE);
+		return -EINVAL;
+	}
+	*tmp = '\0';
+	ret = query_btf_struct(arg + 1, ctx);
+	*tmp = ')';
+
+	if (ret < 0) {
+		trace_probe_log_err(ctx->offset + 1, NO_PTR_STRCT);
+		ret = -EINVAL;
+		goto out_put;
+	}
+
+	ctx->flags |= TPARG_FL_TYPECAST;
+	tmp++;
+
+	ctx->offset += tmp - arg;
+	ret = parse_btf_arg(tmp, pcode, end, ctx);
+	ctx->flags &= ~TPARG_FL_TYPECAST;
+	ctx->last_struct = NULL;
+out_put:
+	btf_put(ctx->struct_btf);
+	return ret;
+}
+
+#else /* !CONFIG_PROBE_EVENTS_BTF_ARGS */
+
 static void clear_btf_context(struct traceprobe_parse_context *ctx)
 {
 	ctx->btf = NULL;
@@ -794,7 +905,15 @@ static int check_prepare_btf_string_fetch(char *typename,
 	return 0;
 }
 
-#endif
+static int handle_typecast(char *arg, struct fetch_insn **pcode,
+			   struct fetch_insn *end,
+			   struct traceprobe_parse_context *ctx)
+{
+	trace_probe_log_err(ctx->offset, NOSUP_BTFARG);
+	return -EOPNOTSUPP;
+}
+
+#endif /* CONFIG_PROBE_EVENTS_BTF_ARGS */
 
 #ifdef CONFIG_HAVE_FUNCTION_ARG_ACCESS_API
 
@@ -953,18 +1072,9 @@ static int parse_probe_vars(char *orig_arg, const struct fetch_type *t,
 	int len;
 
 	if (ctx->flags & TPARG_FL_TEVENT) {
-		if (code->data)
-			return -EFAULT;
-		ret = parse_trace_event_arg(arg, code, ctx);
-		if (!ret)
-			return 0;
-		if (strcmp(arg, "comm") == 0 || strcmp(arg, "COMM") == 0) {
-			code->op = FETCH_OP_COMM;
-			return 0;
-		}
-		/* backward compatibility */
-		ctx->offset = 0;
-		goto inval;
+		if (parse_trace_event(arg, code, ctx) < 0)
+			goto inval;
+		return 0;
 	}
 
 	if (str_has_prefix(arg, "retval")) {
@@ -1231,6 +1341,9 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
 				code->op = FETCH_OP_IMM;
 		}
 		break;
+	case '(':
+		ret = handle_typecast(arg, pcode, end, ctx);
+		break;
 	default:
 		if (isalpha(arg[0]) || arg[0] == '_') {	/* BTF variable */
 			if (!tparg_is_function_entry(ctx->flags) &&
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 262d8707a3df..952e3d7582b8 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -394,6 +394,7 @@ static inline int traceprobe_get_entry_data_size(struct trace_probe *tp)
  * TPARG_FL_KERNEL and TPARG_FL_USER are also mutually exclusive.
  * TPARG_FL_FPROBE and TPARG_FL_TPOINT are optional but it should be with
  * TPARG_FL_KERNEL.
+ * TPARG_FL_TYPECAST is set if an argument was typecast to a structure.
  */
 #define TPARG_FL_RETURN BIT(0)
 #define TPARG_FL_KERNEL BIT(1)
@@ -402,6 +403,7 @@ static inline int traceprobe_get_entry_data_size(struct trace_probe *tp)
 #define TPARG_FL_USER   BIT(4)
 #define TPARG_FL_FPROBE BIT(5)
 #define TPARG_FL_TPOINT BIT(6)
+#define TPARG_FL_TYPECAST BIT(7)
 #define TPARG_FL_LOC_MASK	GENMASK(4, 0)
 
 static inline bool tparg_is_function_entry(unsigned int flags)
@@ -422,7 +424,9 @@ struct traceprobe_parse_context {
 	const struct btf_param *params;	/* Parameter of the function */
 	s32 nr_params;			/* The number of the parameters */
 	struct btf *btf;		/* The BTF to be used */
+	struct btf *struct_btf;		/* The BTF to be used for structs */
 	const struct btf_type *last_type;	/* Saved type */
+	const struct btf_type *last_struct;	/* Saved structure */
 	u32 last_bitoffs;		/* Saved bitoffs */
 	u32 last_bitsize;		/* Saved bitsize */
 	struct trace_probe *tp;
@@ -563,7 +567,8 @@ extern int traceprobe_define_arg_fields(struct trace_event_call *event_call,
 	C(NEED_STRING_TYPE,	"$comm and immediate-string only accepts string type"),\
 	C(TOO_MANY_ARGS,	"Too many arguments are specified"),	\
 	C(TOO_MANY_EARGS,	"Too many entry arguments specified"),	\
-	C(EVENT_TOO_BIG,	"Event too big (too many fields?)"),
+	C(EVENT_TOO_BIG,	"Event too big (too many fields?)"),  \
+	C(TYPECAST_NOT_EVENT,	"Typecasts are only for eprobe fields"),
 
 #undef C
 #define C(a, b)		TP_ERR_##a
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH v2] tracing/osnoise: Array printk init and cleanup
From: Crystal Wood @ 2026-05-21 23:48 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-trace-kernel, John Kacur, Tomas Glozar, Costa Shulyupin,
	Wander Lairson Costa, sashiko-bot, sashiko-reviews
In-Reply-To: <20260520162810.3112a9d6@gandalf.local.home>

On Wed, 2026-05-20 at 16:28 -0400, Steven Rostedt wrote:
> [ Replying to Sashiko: https://sashiko.dev/?list=org.kernel.vger.linux-trace-kernel#/patchset/20260511223035.1475676-1-crwood%40redhat.com ]
> 
> > commit 859dc1eded9e36304f27e72643f28f30f327f078
> > Author: Crystal Wood <crwood@redhat.com>
> > tracing/osnoise: Array printk init and cleanup
> > This commit initializes the trace array buffer for osnoise instances on
> > creation. It also introduces the osnoise_print() helper function and switches
> > the tracer to use trace_array_printk() instead of trace_array_printk_buf().
> > > Use trace_array_printk() instead of trace_array_printk_buf(), as we're
> > > only writing to the main buffer (of a non-main instance) anyway -- and
> 
> > Is this description accurate? Can the osnoise tracer also be enabled on the
> > global trace array, for example by writing to current_tracer?
> > If it is used on the global instance, trace_array_init_printk() explicitly
> > returns -EINVAL and refuses to allocate a buffer, and trace_array_printk()
> > will return 0 without printing.
> 
> Hmm, I guess it would drop any of these prints if it's using the top level
> trace array. But I believe this is for use with RTLA which always creates a
> trace instance.

It doesn't *have* to be used with rtla, but you won't get the printks if
you use the main instance.  The parenthetical was just meant to avoid
potential misunderstanding of the phrase "main buffer".

> > > diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
> > > index 75678053b21c5..2be188768ab42 100644
> > > --- a/kernel/trace/trace_osnoise.c
> > > +++ b/kernel/trace/trace_osnoise.c
> > > @@ -83,6 +83,22 @@ struct osnoise_instance {
> > >  
> > >  static struct list_head osnoise_instances;
> > >  
> > > +static void osnoise_print(const char *fmt, ...)
> > > +{
> > > +	struct osnoise_instance *inst;
> > > +	struct trace_array *tr;
> > > +	va_list ap;
> > > +
> > > +	rcu_read_lock();
> > > +	list_for_each_entry_rcu(inst, &osnoise_instances, list) {
> > > +		tr = inst->tr;
> > > +		va_start(ap, fmt);
> > > +		trace_array_vprintk(tr, _RET_IP_, fmt, ap);
> 
> > Does this code create a use-after-free on the trace array if an instance is
> > removed concurrently?

If so, it was already an issue with osnoise_taint(),
osnoise_stop_tracing(), osnoise_stop_exception(), etc.  Wouldn't be
surprising, as this file has a number of other synchronization issues as
well.

> > When a user deletes a trace instance via rmdir, the unregister function
> > removes the instance from the list using list_del_rcu(). However, the removal
> > routine does not appear to wait for an RCU grace period before freeing the
> > trace array itself.
> > Could a concurrent execution of this loop inside the rcu_read_lock() section
> > still access the unlinked instance, read the freed inst->tr, and pass it to
> > trace_array_vprintk()? This appears to be an existing issue, but it still
> > affects the loop here.
> 
> Hmm, this is interesting. osnoise keeps track of its own instances via a
> osnoise_instances list. But it only use kfree_rcu() to free the list
> descriptor but doesn't take care of the tr being freed before hand!
> 
> Something like this could work [not even compiled]
> 
> diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
> index 75678053b21c..bda1e0e0d2e1 100644
> --- a/kernel/trace/trace_osnoise.c
> +++ b/kernel/trace/trace_osnoise.c
> @@ -476,8 +476,11 @@ static void print_osnoise_headers(struct seq_file *s)
>  										\
>  	rcu_read_lock();							\
>  	list_for_each_entry_rcu(inst, &osnoise_instances, list) {		\
> +		if (trace_array_get(inst->tr) < 0)				\
> +			continue;						\
>  		buffer = inst->tr->array_buffer.buffer;				\
>  		trace_array_printk_buf(buffer, _THIS_IP_, msg);			\
> +		trace_array_put(inst->tr);					\
>  	}									\
>  	rcu_read_unlock();							\
>  	osnoise_data.tainted = true;						\

OK, I'll prepare a v3.

> > > @@ -1189,10 +1198,10 @@ static __always_inline void osnoise_stop_exception(char *msg, int cpu)
> > >  	rcu_read_lock();
> > >  	list_for_each_entry_rcu(inst, &osnoise_instances, list) {
> > >  		tr = inst->tr;
> > > -		trace_array_printk_buf(tr->array_buffer.buffer, _THIS_IP_,
> > > -				       "stop tracing hit on cpu %d due to exception: %s\n",
> > > -				       smp_processor_id(),
> > > -				       msg);
> > > +		trace_array_printk(tr, _THIS_IP_,
> > > +				   "stop tracing hit on cpu %d due to exception: %s\n",
> > > +				   smp_processor_id(),
> > > +				   msg);
> 
> > Does this switch to trace_array_printk() silently drop messages when tracing
> > on the global instance?
> > Looking at trace_array_printk(), it explicitly returns 0 if the
> > TRACE_ARRAY_FL_GLOBAL flag is set on the trace array:
> > kernel/trace/trace_printk.c:trace_array_printk() {
> >     ...
> > 	if (tr->flags & TRACE_ARRAY_FL_GLOBAL)
> > 		return 0;
> >     ...
> > }
> > This means stop and exception logs would be discarded if the tracer is
> > running on the global instance.
> 
> Yep. But maybe that's not an issue?

I'd actually consider it a fix, if the policy is actually about not
allowing tracers to "spam" the main instance, rather than just avoiding
the percpu allocation.  Especially for osnoise_stop_exception(), which
is called only one place, that already printed the same message with
osnoise_taint(). :-P

As I mentioned in the v1 patch, if trace_array_printk_buf() is going to
bypass the global instance check, should probably be internal to the
core trace code.

-Crystal


^ permalink raw reply

* [PATCH] unwind: Add sframe_(un)register() system calls
From: Steven Rostedt @ 2026-05-21 22:35 UTC (permalink / raw)
  To: LKML, Linux Trace Kernel, bpf
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Jens Remus, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Linus Torvalds, Andrew Morton,
	Florian Weimer, Kees Cook, Carlos O'Donell, Sam James,
	Dylan Hatch, Borislav Petkov, Dave Hansen, David Hildenbrand,
	H. Peter Anvin, Liam R. Howlett, Lorenzo Stoakes, Michal Hocko,
	Mike Rapoport, Suren Baghdasaryan, Vlastimil Babka,
	Heiko Carstens, Vasily Gorbik

From: Steven Rostedt <rostedt@goodmis.org>

Add system calls to register and unregister sframes that can be used by
dynamic linkers to tell the kernel where the sframe section is in memory
for libraries it loads.

Both system calls take a pointer to a new structure:

  struct sframe_setup {
	unsigned long		sframe_start;
	unsigned long		sframe_size;
	unsigned long		text_start;
	unsigned long		text_size;
  };

and a size of the passed in structure. If the system call needs to be
extended, then the structure could be changed and the size of that
structure will tell the kernel that it is the new version. If the kernel
does not recognize the structure size, it will return -EINVAL.

  sframe_start - The virtual address of the sframe section
  sframe_size  - The length of the sframe section
  text_start   - the text section the sframe represents
  test_size    - the length of the section

If other stack tracing functionality is added, it will require a new
system call.

The unregister only needs the sframe_start and requires all the rest of
the fields to be 0. In the future, if more can be done, then user space
can update the other values and check the return code to see if the kernel
supports it.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---

Based on top of Jens patches here:

  https://lore.kernel.org/linux-trace-kernel/20260520154004.3845823-1-jremus@linux.ibm.com/

[ Note, I tested this with the same program from the RFC patch ]

Changes from RFC: https://patch.msgid.link/20260429114355.6c712e6a@gandalf.local.home

- Remove the ioctl() like system call for a unique system call for each
  functionality. Right now there's two functionalities:
   1. register sframe section
   2. unregister sframe sections

- Added taking a lock around the mtree logic in __sframe_remove_section()
  as Sashiko mentioned that there could be races from user space
  registering and unregistering sframe sections at the same time.

- Removed [RFC] from subject as I believe this is more likely the way
  this system call will be done.

 arch/alpha/kernel/syscalls/syscall.tbl      |  2 +
 arch/arm/tools/syscall.tbl                  |  2 +
 arch/arm64/tools/syscall_32.tbl             |  2 +
 arch/m68k/kernel/syscalls/syscall.tbl       |  2 +
 arch/microblaze/kernel/syscalls/syscall.tbl |  2 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   |  2 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   |  2 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   |  2 +
 arch/parisc/kernel/syscalls/syscall.tbl     |  2 +
 arch/powerpc/kernel/syscalls/syscall.tbl    |  2 +
 arch/s390/kernel/syscalls/syscall.tbl       |  3 +
 arch/sh/kernel/syscalls/syscall.tbl         |  2 +
 arch/sparc/kernel/syscalls/syscall.tbl      |  2 +
 arch/x86/entry/syscalls/syscall_32.tbl      |  2 +
 arch/x86/entry/syscalls/syscall_64.tbl      |  2 +
 arch/xtensa/kernel/syscalls/syscall.tbl     |  2 +
 include/linux/syscalls.h                    |  2 +
 include/uapi/asm-generic/unistd.h           |  7 ++-
 include/uapi/linux/sframe.h                 | 12 ++++
 kernel/sys_ni.c                             |  3 +
 kernel/unwind/sframe.c                      | 63 ++++++++++++++++++++-
 scripts/syscall.tbl                         |  2 +
 22 files changed, 118 insertions(+), 4 deletions(-)
 create mode 100644 include/uapi/linux/sframe.h

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index f31b7afffc34..f0639b831f2a 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -511,3 +511,5 @@
 579	common	file_setattr			sys_file_setattr
 580	common	listns				sys_listns
 581	common	rseq_slice_yield		sys_rseq_slice_yield
+582	common	sframe_register			sys_sframe_register
+583	common	sframe_unregister		sys_sframe_unregister
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 94351e22bfcf..887b242ffb25 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -486,3 +486,5 @@
 469	common	file_setattr			sys_file_setattr
 470	common	listns				sys_listns
 471	common	rseq_slice_yield		sys_rseq_slice_yield
+472	common	sframe_register			sys_sframe_register
+473	common	sframe_unregister		sys_sframe_unregister
diff --git a/arch/arm64/tools/syscall_32.tbl b/arch/arm64/tools/syscall_32.tbl
index 62d93d88e0fe..c820f1ff718c 100644
--- a/arch/arm64/tools/syscall_32.tbl
+++ b/arch/arm64/tools/syscall_32.tbl
@@ -483,3 +483,5 @@
 469	common	file_setattr			sys_file_setattr
 470	common	listns				sys_listns
 471	common	rseq_slice_yield		sys_rseq_slice_yield
+472	common	sframe_register			sys_sframe_register
+473	common	sframe_unregister		sys_sframe_unregister
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 248934257101..4c7f17f0364b 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -471,3 +471,5 @@
 469	common	file_setattr			sys_file_setattr
 470	common	listns				sys_listns
 471	common	rseq_slice_yield		sys_rseq_slice_yield
+472	common	sframe_register			sys_sframe_register
+473	common	sframe_unregister		sys_sframe_unregister
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 223d26303627..e8dc2cc149f4 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -477,3 +477,5 @@
 469	common	file_setattr			sys_file_setattr
 470	common	listns				sys_listns
 471	common	rseq_slice_yield		sys_rseq_slice_yield
+472	common	sframe_register			sys_sframe_register
+473	common	sframe_unregister		sys_sframe_unregister
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 7430714e2b8f..d0bae05d16af 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -410,3 +410,5 @@
 469	n32	file_setattr			sys_file_setattr
 470	n32	listns				sys_listns
 471	n32	rseq_slice_yield		sys_rseq_slice_yield
+472	n32	sframe_register			sys_sframe_register
+473	n32	sframe_unregister		sys_sframe_unregister
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index 630aab9e5425..2e200de6a58c 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -386,3 +386,5 @@
 469	n64	file_setattr			sys_file_setattr
 470	n64	listns				sys_listns
 471	n64	rseq_slice_yield		sys_rseq_slice_yield
+472	n64	sframe_register			sys_sframe_register
+473	n64	sframe_unregister		sys_sframe_unregister
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 128653112284..0e3b82011ae2 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -459,3 +459,5 @@
 469	o32	file_setattr			sys_file_setattr
 470	o32	listns				sys_listns
 471	o32	rseq_slice_yield		sys_rseq_slice_yield
+472	o32	sframe_register			sys_sframe_register
+473	o32	sframe_unregister		sys_sframe_unregister
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index c6331dad9461..e0758ef8667d 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -470,3 +470,5 @@
 469	common	file_setattr			sys_file_setattr
 470	common	listns				sys_listns
 471	common	rseq_slice_yield		sys_rseq_slice_yield
+472	common	sframe_register			sys_sframe_register
+473	common	sframe_unregister		sys_sframe_unregister
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 4fcc7c58a105..eda40c4f4f2f 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -562,3 +562,5 @@
 469	common	file_setattr			sys_file_setattr
 470	common	listns				sys_listns
 471	nospu	rseq_slice_yield		sys_rseq_slice_yield
+472	nospu	sframe_register			sys_sframe_register
+473	nospu	sframe_unregister		sys_sframe_unregister
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 09a7ef04d979..52519e2acdc8 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -398,3 +398,6 @@
 469	common	file_setattr			sys_file_setattr
 470	common	listns				sys_listns
 471	common	rseq_slice_yield		sys_rseq_slice_yield
+472	common	stacktrace_setup		sys_stacktrace_setup
+472	common	sframe_register			sys_sframe_register
+473	common	sframe_unregister		sys_sframe_unregister
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 70b315cbe710..62ac7b1b4dd4 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -475,3 +475,5 @@
 469	common	file_setattr			sys_file_setattr
 470	common	listns				sys_listns
 471	common	rseq_slice_yield		sys_rseq_slice_yield
+472	common	sframe_register			sys_sframe_register
+473	common	sframe_unregister		sys_sframe_unregister
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 7e71bf7fcd14..f92273ae608a 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -517,3 +517,5 @@
 469	common	file_setattr			sys_file_setattr
 470	common	listns				sys_listns
 471	common	rseq_slice_yield		sys_rseq_slice_yield
+472	common	sframe_register			sys_sframe_register
+473	common	sframe_unregister		sys_sframe_unregister
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index f832ebd2d79b..409a50df3b21 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -477,3 +477,5 @@
 469	i386	file_setattr		sys_file_setattr
 470	i386	listns			sys_listns
 471	i386	rseq_slice_yield	sys_rseq_slice_yield
+472	i386	sframe_register		sys_sframe_register
+473	i386	sframe_unregister	sys_sframe_unregister
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 524155d655da..9b7c5a449751 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -396,6 +396,8 @@
 469	common	file_setattr		sys_file_setattr
 470	common	listns			sys_listns
 471	common	rseq_slice_yield	sys_rseq_slice_yield
+472	common	sframe_register		sys_sframe_register
+473	common	sframe_unregister	sys_sframe_unregister
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index a9bca4e484de..037b8040f69d 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -442,3 +442,5 @@
 469	common	file_setattr			sys_file_setattr
 470	common	listns				sys_listns
 471	common	rseq_slice_yield		sys_rseq_slice_yield
+472	common	sframe_register'		sys_sframe_register
+473	common	sframe_unregister		sys_sframe_unregister
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index f5639d5ac331..992ccc401c5e 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -999,6 +999,8 @@ asmlinkage long sys_lsm_get_self_attr(unsigned int attr, struct lsm_ctx __user *
 asmlinkage long sys_lsm_set_self_attr(unsigned int attr, struct lsm_ctx __user *ctx,
 				      u32 size, u32 flags);
 asmlinkage long sys_lsm_list_modules(u64 __user *ids, u32 __user *size, u32 flags);
+asmlinkage long sys_sframe_register(void *data, unsigned int size);
+asmlinkage long sys_sframe_unregister(void *data, unsigned int size);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index a627acc8fb5f..17042d7e5e87 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -863,8 +863,13 @@ __SYSCALL(__NR_listns, sys_listns)
 #define __NR_rseq_slice_yield 471
 __SYSCALL(__NR_rseq_slice_yield, sys_rseq_slice_yield)
 
+#define __NR_sframe_register 472
+__SYSCALL(__NR_sframe_register, sys_sframe_register)
+#define __NR_sframe_unregister 473
+__SYSCALL(__NR_sframe_unregister, sys_sframe_unregister)
+
 #undef __NR_syscalls
-#define __NR_syscalls 472
+#define __NR_syscalls 474
 
 /*
  * 32 bit systems traditionally used different
diff --git a/include/uapi/linux/sframe.h b/include/uapi/linux/sframe.h
new file mode 100644
index 000000000000..137a2ebf91f4
--- /dev/null
+++ b/include/uapi/linux/sframe.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_SFRAME_H
+#define _UAPI_LINUX_SFRAME_H
+
+struct sframe_setup {
+	unsigned long		sframe_start;
+	unsigned long		sframe_size;
+	unsigned long		text_start;
+	unsigned long		text_size;
+};
+
+#endif /* _UAPI_LINUX_SFRAME_H */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index add3032da16f..eca5293f5d40 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -394,3 +394,6 @@ COND_SYSCALL(rseq_slice_yield);
 
 COND_SYSCALL(uretprobe);
 COND_SYSCALL(uprobe);
+
+COND_SYSCALL(sframe_register);
+COND_SYSCALL(sframe_unregister);
diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
index db88d993dff1..9956f1e3aba1 100644
--- a/kernel/unwind/sframe.c
+++ b/kernel/unwind/sframe.c
@@ -12,8 +12,10 @@
 #include <linux/mm.h>
 #include <linux/string_helpers.h>
 #include <linux/sframe.h>
+#include <linux/syscalls.h>
 #include <asm/unwind_user_sframe.h>
 #include <linux/unwind_user_types.h>
+#include <uapi/linux/sframe.h>
 
 #include "sframe.h"
 #include "sframe_debug.h"
@@ -842,9 +844,11 @@ static void sframe_free_srcu(struct rcu_head *rcu)
 static int __sframe_remove_section(struct mm_struct *mm,
 				   struct sframe_section *sec)
 {
-	if (!mtree_erase(&mm->sframe_mt, sec->text_start)) {
-		dbg_sec("mtree_erase failed: text=%lx\n", sec->text_start);
-		return -EINVAL;
+	scoped_guard(mmap_read_lock, mm) {
+		if (!mtree_erase(&mm->sframe_mt, sec->text_start)) {
+			dbg_sec("mtree_erase failed: text=%lx\n", sec->text_start);
+			return -EINVAL;
+		}
 	}
 
 	call_srcu(&sframe_srcu, &sec->rcu, sframe_free_srcu);
@@ -936,3 +940,56 @@ void sframe_free_mm(struct mm_struct *mm)
 
 	mtree_destroy(&mm->sframe_mt);
 }
+
+/**
+ * sys_sframe_register - register an address for user space stacktrace walking.
+ * @data: Structure of sframe data used to register the sframe section
+ * @size: The size of the given structure.
+ *
+ * This system call is used by dynamic library utilities to inform the kernel
+ * of meta data that it loaded that can be used by the kernel to know how
+ * to stack walk the given text locations.
+ *
+ * Return: 0 if successful, otherwise a negative error.
+ */
+SYSCALL_DEFINE2(sframe_register, __user struct sframe_setup *, data, unsigned int, size)
+{
+	struct sframe_setup sframe;
+
+	if (sizeof(sframe) != size)
+		return -EINVAL;
+
+	if (copy_from_user(&sframe, data, size))
+		return -EFAULT;
+
+	return sframe_add_section(sframe.sframe_start,
+				  sframe.sframe_start + sframe.sframe_size,
+				  sframe.text_start,
+				  sframe.text_start + sframe.text_size);
+}
+
+/**
+ * sys_sframe_unregister - unregister an sframe address
+ * @data: Structure of sframe data used to register the sframe section
+ * @size: The size of the given structure.
+ *
+ * The data->sframe_start is the only value that is used. The rest must
+ * be zero.
+ *
+ * Return: 0 if successful, otherwise a negative error.
+ */
+SYSCALL_DEFINE2(sframe_unregister, __user struct sframe_setup *, data, unsigned int, size)
+{
+	struct sframe_setup sframe;
+
+	if (sizeof(sframe) != size)
+		return -EINVAL;
+
+	if (copy_from_user(&sframe, data, size))
+		return -EFAULT;
+
+	if (sframe.sframe_size || sframe.text_start || sframe.text_size)
+		return -EINVAL;
+
+	return sframe_remove_section(sframe.sframe_start);
+}
diff --git a/scripts/syscall.tbl b/scripts/syscall.tbl
index 7a42b32b6577..46ec22b50042 100644
--- a/scripts/syscall.tbl
+++ b/scripts/syscall.tbl
@@ -412,3 +412,5 @@
 469	common	file_setattr			sys_file_setattr
 470	common	listns				sys_listns
 471	common	rseq_slice_yield		sys_rseq_slice_yield
+472	common	sframe_register			sys_sframe_register
+473	common	sframe_unregister		sys_sframe_unregister
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH] tracing: Replace BUG_ON with lockdep_assert_held in uprobe_buffer functions
From: Steven Rostedt @ 2026-05-21 22:16 UTC (permalink / raw)
  To: Yash Suthar
  Cc: mhiramat, mathieu.desnoyers, linux-kernel, linux-trace-kernel,
	skhan
In-Reply-To: <20260521192846.8306-1-yashsuthar983@gmail.com>

On Fri, 22 May 2026 00:58:46 +0530
Yash Suthar <yashsuthar983@gmail.com> wrote:

> Replace BUG_ON(!mutex_is_locked(&event_mutex)) with
> lockdep_assert_held(&event_mutex) in uprobe_buffer_enable() and
> uprobe_buffer_disable().
> 
> BUG_ON() will crash the kernel. mutex_is_locked() only checks
> if any task holds lock,but not the caller task. lockdep_assert_held()
> also check current task for lock and no crash on true condition.
> 
> Signed-off-by: Yash Suthar <yashsuthar983@gmail.com>

This looks good to me.

Acked-by: Steven Rostedt <rostedt@goodmis.org>

Masami, do you want to take this?

-- Steve

> ---
>  kernel/trace/trace_uprobe.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
> index 2cabf8a23ec5..aee0960d0cf7 100644
> --- a/kernel/trace/trace_uprobe.c
> +++ b/kernel/trace/trace_uprobe.c
> @@ -912,7 +912,7 @@ static int uprobe_buffer_enable(void)
>  {
>  	int ret = 0;
>  
> -	BUG_ON(!mutex_is_locked(&event_mutex));
> +	lockdep_assert_held(&event_mutex);
>  
>  	if (uprobe_buffer_refcnt++ == 0) {
>  		ret = uprobe_buffer_init();
> @@ -927,7 +927,7 @@ static void uprobe_buffer_disable(void)
>  {
>  	int cpu;
>  
> -	BUG_ON(!mutex_is_locked(&event_mutex));
> +	lockdep_assert_held(&event_mutex);
>  
>  	if (--uprobe_buffer_refcnt == 0) {
>  		for_each_possible_cpu(cpu)


^ permalink raw reply

* Re: [PATCH v6 21/43] KVM: SEV: Make 'uaddr' parameter optional for KVM_SEV_SNP_LAUNCH_UPDATE
From: Ackerley Tng @ 2026-05-21 21:27 UTC (permalink / raw)
  To: Sean Christopherson, Fuad Tabba
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka, kvm,
	linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <ag8G7Wq5PbEdKloG@google.com>

Sean Christopherson <seanjc@google.com> writes:

> On Thu, May 21, 2026, Fuad Tabba wrote:
>> Hi,
>>
>> On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
>> <devnull+ackerleytng.google.com@kernel.org> wrote:
>> >
>> > From: Michael Roth <michael.roth@amd.com>
>> >
>> > For vm_memory_attributes=1, in-place conversion/population is not
>> > supported, so the initial contents necessarily must need to come
>> > from a separate src address, which is enforced by the current
>> > implementation. However, for vm_memory_attributes=0, it is possible for
>> > guest memory to be initialized directly from userspace by mmap()'ing the
>> > guest_memfd and writing to it while the corresponding GPA ranges are in
>> > a 'shared' state before converting them to the 'private' state expected
>> > by KVM_SEV_SNP_LAUNCH_UPDATE.
>> >
>> > Update the handling/documentation for KVM_SEV_SNP_LAUNCH_UPDATE to allow
>> > for 'uaddr' to be set to NULL when vm_memory_attributes=0, which
>> > SNP_LAUNCH_UPDATE will then use to determine when it should/shouldn't
>> > copy in data from a separate memory location. Continue to enforce
>> > non-NULL for the original vm_memory_attributes=1 case.
>> >
>> > Signed-off-by: Michael Roth <michael.roth@amd.com>
>> > [Added src_page check in error handling path when the firmware command fails]
>> > [Dropped ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES]
>> > Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>>
>> I'm not very familiar with the SEV-SNP populate flows, but it looks
>> like Sashiko is on to something:
>> https://sashiko.dev/#/patchset/20260507-gmem-inplace-conversion-v6-0-91ab5a8b19a4%40google.com?part=21
>>
>> - a potential read-only page overwrite, because src_page is acquired
>> via get_user_pages_fast() without the FOLL_WRITE flag, but is then
>> overwritten via memcpy
>
> Oof, yeah, that's bad.  Adding FOLL_WRITE to kvm_gmem_populate() feels wrong, and
> could break uABI, but doing gup() in SNP code would reintroduce the AB-BA issue
> with filemap_invalidate_lock().
>
> Aha!  Not if we use get_user_page_fast_only().  Ugh, but then we'd have to plumb
> the userspace address into the post-populated callback.
>
> Hrm.  Given that no one has yelled about overwriting their CPUID page, and given
> that the CPUID page is likely dynamically created and thus is unlikely to be a
> read-only mapping (e.g. versus the initial image), maybe this?
>

Overwriting the CPUID page is by design, I think. IIUC if the SNP
firmware doesn't like something about the CPUID page, it can update
src_page and then return an error to userspace.

Userspace should then check if it agrees with the updated CPUID contents
and then retry if it agrees.

> diff --git arch/x86/kvm/svm/sev.c arch/x86/kvm/svm/sev.c
> index 37d4cfa5d980..c73c028d72c1 100644
> --- arch/x86/kvm/svm/sev.c
> +++ arch/x86/kvm/svm/sev.c
> @@ -2456,6 +2456,7 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
>         sev_populate_args.type = params.type;
>
>         count = kvm_gmem_populate(kvm, params.gfn_start, src, npages,
> +                                 params.type == KVM_SEV_SNP_PAGE_TYPE_CPUID,

I think this makes sense given that writing to src_page can only happen
when params.type == KVM_SEV_SNP_PAGE_TYPE_CPUID (this is explicitly one
of the guards in sev_gmem_post_populate()):

	/*
	 * If the firmware command failed handle the reclaim and cleanup of that
	 * PFN before reporting an error.
	 *
	 * Additionally, when invalid CPUID function entries are detected,
	 * firmware writes the expected values into the page and leaves it
	 * unencrypted so it can be used for debugging and error-reporting.
	 *
	 * Copy this page back into the source buffer so userspace can use this
	 * information to provide information on which CPUID leaves/fields
	 * failed CPUID validation.
	 */
	if (ret && !snp_page_reclaim(kvm, pfn) &&
	    sev_populate_args->type == KVM_SEV_SNP_PAGE_TYPE_CPUID &&
	    sev_populate_args->fw_error == SEV_RET_INVALID_PARAM && src_page) {
		void *src_vaddr = kmap_local_page(src_page);
		void *dst_vaddr = kmap_local_pfn(pfn);

		memcpy(src_vaddr, dst_vaddr, PAGE_SIZE);

		kunmap_local(src_vaddr);
		kunmap_local(dst_vaddr);
	}

>                                   sev_gmem_post_populate, &sev_populate_args);
>         if (count < 0) {
>                 argp->error = sev_populate_args.fw_error;
> diff --git arch/x86/kvm/vmx/tdx.c arch/x86/kvm/vmx/tdx.c
> index f97bcf580e6d..33f35be4455b 100644
> --- arch/x86/kvm/vmx/tdx.c
> +++ arch/x86/kvm/vmx/tdx.c
> @@ -3188,7 +3188,7 @@ static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *c
>                 };
>                 gmem_ret = kvm_gmem_populate(kvm, gpa_to_gfn(region.gpa),
>                                              u64_to_user_ptr(region.source_addr),
> -                                            1, tdx_gmem_post_populate, &arg);
> +                                            1, false, tdx_gmem_post_populate, &arg);

And TDX doesn't try to write src_page, so this is good too.

>                 if (gmem_ret < 0) {
>                         ret = gmem_ret;
>                         break;
> diff --git include/linux/kvm_host.h include/linux/kvm_host.h
> index 61a3430957f2..b83cda2870ba 100644
> --- include/linux/kvm_host.h
> +++ include/linux/kvm_host.h
> @@ -2596,7 +2596,8 @@ int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_ord
>  typedef int (*kvm_gmem_populate_cb)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>                                     struct page *page, void *opaque);
>
> -long kvm_gmem_populate(struct kvm *kvm, gfn_t gfn, void __user *src, long npages,
> +long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src,
> +                      long npages, bool writable,

What do you think of need_writable_src instead of just writable for the
variable name?

>                        kvm_gmem_populate_cb post_populate, void *opaque);
>  #endif
>
> diff --git virt/kvm/guest_memfd.c virt/kvm/guest_memfd.c
> index a35a55571a2d..6553d4e032ce 100644
> --- virt/kvm/guest_memfd.c
> +++ virt/kvm/guest_memfd.c
> @@ -858,7 +858,8 @@ static long __kvm_gmem_populate(struct kvm *kvm, struct kvm_memory_slot *slot,
>         return ret;
>  }
>
> -long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long npages,
> +long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src,
> +                      long npages, bool writable,
>                        kvm_gmem_populate_cb post_populate, void *opaque)
>  {
>         struct kvm_memory_slot *slot;
> @@ -892,8 +893,9 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long
>
>                 if (src) {
>                         unsigned long uaddr = (unsigned long)src + i * PAGE_SIZE;
> +                       unsigned int flags = writable ? FOLL_WRITE : 0;

How about using FOLL_WRITE | FOLL_NOFAULT so if it weren't writable to
start with, don't CoW, just error out?

Like you said above the CPUID page provided as src_page would have been
written to before, so it should have been mapped as writable.

>
> -                       ret = get_user_pages_fast(uaddr, 1, 0, &src_page);
> +                       ret = get_user_pages_fast(uaddr, 1, flags, &src_page);

If we stick with FOLL_WRITE, this also solves the case where a read-only
mapping or global zero page are provided as src_page, since
get_user_pages_fast() will do a copy-on-write if those were the inputs,
making it writable before the write happens (on failure) in
sev_gmem_post_populate().

>                         if (ret < 0)
>                                 break;
>                         if (ret != 1) {
>
>> - an ordering violation with the kunmap_local() calls
>
> Yeesh, that's a new one for me.  Thankfully this is 64-bit only, so it's not an
> issue.
>
>> These predate this patch series and are just being touched by the
>> 'src_page' addition, but if Sashiko's right, these should probably be
>> fixed sooner rather than later.
>
> Yeah, ditto with the offset wrapping case.

^ permalink raw reply

* [PATCH] tracing: Replace BUG_ON with lockdep_assert_held in uprobe_buffer functions
From: Yash Suthar @ 2026-05-21 19:28 UTC (permalink / raw)
  To: rostedt, mhiramat
  Cc: mathieu.desnoyers, linux-kernel, linux-trace-kernel, skhan,
	Yash Suthar

Replace BUG_ON(!mutex_is_locked(&event_mutex)) with
lockdep_assert_held(&event_mutex) in uprobe_buffer_enable() and
uprobe_buffer_disable().

BUG_ON() will crash the kernel. mutex_is_locked() only checks
if any task holds lock,but not the caller task. lockdep_assert_held()
also check current task for lock and no crash on true condition.

Signed-off-by: Yash Suthar <yashsuthar983@gmail.com>
---
 kernel/trace/trace_uprobe.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index 2cabf8a23ec5..aee0960d0cf7 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -912,7 +912,7 @@ static int uprobe_buffer_enable(void)
 {
 	int ret = 0;
 
-	BUG_ON(!mutex_is_locked(&event_mutex));
+	lockdep_assert_held(&event_mutex);
 
 	if (uprobe_buffer_refcnt++ == 0) {
 		ret = uprobe_buffer_init();
@@ -927,7 +927,7 @@ static void uprobe_buffer_disable(void)
 {
 	int cpu;
 
-	BUG_ON(!mutex_is_locked(&event_mutex));
+	lockdep_assert_held(&event_mutex);
 
 	if (--uprobe_buffer_refcnt == 0) {
 		for_each_possible_cpu(cpu)
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH 0/3] Fix out-of-tree build of some tools
From: Ian Rogers @ 2026-05-21 17:50 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Steven Rostedt, Tomas Glozar, linux-perf-users,
	linux-trace-kernel
In-Reply-To: <ag8X7gcDw6jpJsLq@decadent.org.uk>

On Thu, May 21, 2026 at 7:57 AM Ben Hutchings <benh@debian.org> wrote:
>
> perf and rtla currently don't fully support out-of-tree builds, as
> they may still create files in their source directory.  This series
> fixes all the instances of this problem that I have found.
>
> Ben.

Thanks for the cleanup! I ran Sashiko locally and it noted that we
should also fix the clean steps to remove these files (I'll copy the
full text below).
Reviewed-by: Ian Rogers <irogers@google.com>

```
commit cf212be834ce6b0df5cccdbc4ad977a917ff8e48
   Author: Ben Hutchings <benh@debian.org>

   perf tools: Put Python egg info in output directory

   Installing the Python extension currently creates an egg-info
   directory in the source tree, preventing a fully out-of-tree build.

   Add the necessary runes to the setup.py comamnd line to relocate the
   egg-info directory in an out-of-tree build.

   > diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
   > index fc92d6ceac5b..b9a9bdddb89a 100644
   > --- a/tools/perf/Makefile.perf
   > +++ b/tools/perf/Makefile.perf
   > @@ -905,7 +910,9 @@ install-bin: install-tools install-tests
   >  install: install-bin try-install-man
   >
   >  install-python_ext:
   > -       $(PYTHON_WORD) util/setup.py --quiet install
--root='/$(DESTDIR_SQ)'
   > +       $(PYTHON_WORD) util/setup.py --quiet \
   > +               $(if $(OUTPUT),egg_info --egg-base $(OUTPUT)) \
   > +               install --root='/$(DESTDIR_SQ)'

   This relocates the egg-info directory to the output directory, which
   helps keep the source tree clean. However, does the `python-clean`
   target need to be updated to clean up this directory?

   Currently, `python-clean` is defined as:

   python-clean := $(call QUIET_CLEAN, python) $(RM) -r
$(PYTHON_EXTBUILD) $(OUTPUT)python/perf*.so

   Should we also clean `$(OUTPUT)/perf.egg-info` (or `perf.egg-info` for
   in-tree builds) to ensure `make clean` removes all generated python
   metadata? For example:

   python-clean := $(call QUIET_CLEAN, python) \
                   $(RM) -r $(PYTHON_EXTBUILD) $(OUTPUT)python/perf*.so \
                   $(OUTPUT)perf.egg-info perf.egg-info

   commit dd754ea75966ec7ade41500afeb32d944f20a8f9
   Author: Ben Hutchings <benh@debian.org>

   perf tools: Put Python bytecode in output directory

   The PMU events are processed into C sources by Python scripts,
which
   normally results in writing bytecode for each module into the
source
   tree.  This prevents a fully out-of-tree build.

   To fix this, set $PYTHONPYCACHEPREFIX to relocate the bytecode cache
   directory in an out-of-tree build.

   > diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
   > index fc92d6ceac5b..b9a9bdddb89a 100644
   > --- a/tools/perf/Makefile.perf
   > +++ b/tools/perf/Makefile.perf
   > @@ -400,6 +400,11 @@ PYTHON_EXTBUILD_LIB :=
$(PYTHON_EXTBUILD)lib/
   >  PYTHON_EXTBUILD_TMP := $(PYTHON_EXTBUILD)tmp/
   >  export PYTHON_EXTBUILD_LIB PYTHON_EXTBUILD_TMP
   >
   > +# Put Python bytecode in output directory
   > +ifdef OUTPUT
   > +export PYTHONPYCACHEPREFIX := $(OUTPUT)/__pycache__
   > +endif
   > +
   >  python-clean := $(call QUIET_CLEAN, python) $(RM) -r
$(PYTHON_EXTBUILD) $(OUTPUT)python/perf*.so

   Similar to the egg-info relocation, does `python-clean` need to
clean up
   the relocated `__pycache__` directory under `$(OUTPUT)`?

   If `PYTHONPYCACHEPREFIX` is exported to `$(OUTPUT)/__pycache__`, running
   `make clean` currently leaves this directory behind. Should we add it to
   the `RM` list in `python-clean`? For example:

   python-clean := $(call QUIET_CLEAN, python) \
                   $(RM) -r $(PYTHON_EXTBUILD)
$(OUTPUT)python/perf*.so \
                   $(if $(OUTPUT),$(OUTPUT)/__pycache__)
```

Thanks,
Ian

> Ben Hutchings (3):
>   rtla: Fix output files in source tree
>   perf tools: Put Python egg info in output directory
>   perf tools: Put Python bytecode in output directory
>
>  tools/perf/Makefile.perf            |  9 ++++++++-
>  tools/tracing/rtla/Makefile         | 31 ++++++++++++++++++-----------
>  tools/tracing/rtla/tests/timerlat.t |  4 ++--
>  3 files changed, 29 insertions(+), 15 deletions(-)
>

^ permalink raw reply

* Re: [PATCH] Re: Re: [RFC PATCH v2 04/10] rv/da: add pre-allocated storage pool for per-object monitors
From: Wen Yang @ 2026-05-21 17:40 UTC (permalink / raw)
  To: Gabriele Monaco; +Cc: linux-kernel, linux-trace-kernel, rostedt
In-Reply-To: <5183dc18d63b617ab0f19290e8a790fa6898f372.camel@redhat.com>



On 5/19/26 19:14, Gabriele Monaco wrote:
> Hi Wen,
> 
> On Mon, 2026-05-18 at 01:13 +0800, Wen Yang wrote:
>>
>> Yes.  The ftracetest check_requires logic calls `command -v <binary>` to
>> satisfy `requires: <name>:program` directives.  Without the script's
>> directory in PATH those checks evaluate to exit_unsupported and test cases
>> are skipped rather than run.  The make path avoids this only because make
>> sets OUTDIR and the runner appends it to PATH internally.
>>
> 
> So you're overriding PATH so the selftest's binaries can be found from
> the test, right?
> 
> Wouldn't it be simpler to just put the absolute paths in the tests and
> don't touch PATH.
> 
> If the selftests are run via makefile, it ensures the required binaries
> are built and available, so there's no need to go through the `requires:
> <name>:program` infrastructure (that's more about what's installed on
> the system.
> 
> Or if you don't want anything hardcoded, you could pass the $OUTDIR from
> the environment and use that in scripts, whatever looks cleaner.
> 
> Does it make sense to you?
> 
>>
>> -- Patch 04: pre-allocated storage pool
>>
>>   > Since you're using spinlocks, isn't that going to sleep on PREEMPT_RT?
>>
>> User-mode uprobe handlers run with preempt_count == 0, fully preemptible on
>> both PREEMPT_RT and non-PREEMPT_RT.  The strongest evidence is in tlob
>> itself: tlob_start_task() takes a mutex and calls kmem_cache_zalloc(...,
>> GFP_KERNEL) from the uprobe entry handler; both are illegal in atomic
>> context and would trigger lockdep splats immediately.
>>
>> On PREEMPT_RT, spinlock_t becoming a sleeping lock in the uprobe handler
>> iscfine: both call sites (da_create_or_get_pool() from the handler and
>> da_pool_return_cb() from the rcuc kthread) are in sleepable task context.
>>
> 
> Yeah exactly, the uprobe is fine with anything (even the automatic
> `kmalloc_nolock`), but sure preallocation at least guarantees the slots
> are there.
> 
>>   > We can have a macro DA_MON_ALLOCATION_STRATEGY = {DA_ALLOC_AUTO,
>>   > DA_ALLOC_POOL, DA_ALLOC_MANUAL} where DA_MON_POOL also requires
>>   > DA_MON_POOL_SIZE to be defined (force that with an #error).
>>   >
>>   > Anyway, this way you probably wouldn't need to define a different init
>>   > function and let everything handled more transparently.
>>   >
>>   > Also you don't need to call da_create_or_get() explicitly,
>>   > da_handle_start_event() should do it for you.
>>
>> Agreed on all counts.  We plan to implement this in v3 as follows.
>> The three strategies would be a compile-time selection in da_monitor.h:
>>
>>     DA_ALLOC_AUTO   (default) - lock-free kmalloc_nolock on the hot path;
>>                                 unbounded capacity.
>>
>>     DA_ALLOC_POOL             - pre-allocated fixed-size pool;
>> DA_MON_POOL_SIZE
>>                                 required, enforced with #error if missing.
>>
>>     DA_ALLOC_MANUAL           - caller pre-inserts storage via
>>                                 da_create_empty_storage() before the first
>>                                 da_handle_start_event(); the framework only
>>                                 links the target field.
>>
>> da_monitor_init_prealloc() would be removed; da_monitor_init() would
>> select pool or kmalloc initialisation internally based on the strategy.
>>
>> da_handle_start_event() and da_handle_start_run_event() would both call
>> da_prepare_storage() at compile time:
>>
>>     DA_ALLOC_AUTO   -> da_create_storage()        (kmalloc_nolock)
>>     DA_ALLOC_POOL   -> da_create_or_get_pool()
>>     DA_ALLOC_MANUAL -> da_fill_empty_storage()    (link target into pre-
>>                        allocated slot; no allocation on the hot path)
>>
>> No explicit da_create_or_get() call would be needed in any monitor.
>>
>> da_create_or_get_kmalloc() would be removed: as you noted, a caller that
>> uses kmalloc_nolock does so because locking is forbidden; a GFP_KERNEL
>> fallback is equally forbidden if the lockless attempt fails, so the
>> function has no viable use case.
>>
>> tlob would define:
>>     #define DA_MON_ALLOCATION_STRATEGY DA_ALLOC_POOL
>>     #define DA_MON_POOL_SIZE            TLOB_MAX_MONITORED
>>
>> nomiss would define:
>>     #define DA_MON_ALLOCATION_STRATEGY DA_ALLOC_MANUAL
>>
>> and call da_create_empty_storage() from handle_sys_enter() (the
>> sched_setscheduler syscall path), which runs in safe task context;
>> da_fill_empty_storage() would then link the sched_dl_entity target on
>> the first da_handle_start_run_event() call in handle_sched_switch().
> 
> Yeah good point, there's no need to make it a special path even if we
> have the target ready, da_handle_start_run_event() can do it just fine.
> 
>>
>>
>> -- Patch 05: generic uprobe infrastructure
>>
>> Carried unchanged into v3 (as part of the 08-b split described below).
>>
>>
>> -- Patch 06: rvgen __init arrow reset
>>
>> Thanks, carried unchanged into v3.
>>
> 
> Well, if you don't need reset() on the __init arrow we can drop this,
> right? Also it doesn't seem fully wired with the rest and requires a
> separate event to do handle_monitor_start(), which can be only just
> another handler for tlob, nothing general.
> 
>>
>>   > Why don't you make it a separate event (e.g. "start_tlob") [...] then
>>   > you also wouldn't need to call reset() and start_timer() manually.
>>
>> Good suggestion.  We plan to use a dedicated start_tlob event instead,
>> with a self-loop in tlob.dot:
>>
>>     "running" -> "running" [ label = "start;reset(clk_elapsed)" ]
>>
>> da_handle_start_run_event(task->pid, ws, start_tlob) would put the
>> monitor into running and deliver start_tlob, which resets clk_elapsed
>> and arms the budget hrtimer via the generated ha_setup_invariants() —
>> no manual reset() or start_timer() calls needed.
>>
>> One guard would be added in tlob's ha_setup_invariants() to make the
>> self-loop work correctly:
>>
>>     if (next_state == curr_state && event != start_tlob)
>>         return;
>>
>> Without this, the start_tlob self-loop would be treated the same as any
>> repeated switch_in (already running) and ha_setup_invariants() would
>> return early, leaving the timer unarmed.  Does this look right to you?
>>
> 
> If you just add a separate event rvgen should take care of everything,
> you should be able to take ha_verify_constraint() and friends as-is from
> the generated code.
> 
> But yeah, that's what it would end up doing.
> 
>>
>> -- Patch 08: tlob monitor
>>
>> --- Patch structure ---
>>
>>   > Could you have everything that isn't strictly tlob-related in another
>>   > patch.
>>
>> Agreed.  With the ioctl interface deferred (see below), v3 would keep
>> patch 08 as the tlob monitor only:
>>
>>     05-b: rv: extend uprobe API with three-phase detach helpers
>>           (rv_uprobe.c, rv_uprobe.h, rv_uprobe_detach refactoring)
>>           — extension of patch 05, independent of tlob
>>
>>     08:   rv/tlob: add the tlob monitor itself
>>           (tlob.c, tlob.h, tlob_trace.h, Kconfig/Makefile, Documentation,
>>            rv_trace.h include; ha_monitor.h EVENT_NONE_LBL override
>>            bundled here as it is only needed by tlob)
>>
>> The chardev infrastructure (rv_chardev.c, rv.h additions) and the UAPI
>> header (include/uapi/linux/rv.h) would move to a follow-up series
>> together with the ioctl self-instrumentation feature.
>>
>> --- ioctl interface design ---
>>
>>   > I'm not particularly fond of ioctls, they aren't that flexible and in
>>   > this way I don't really see an added value.
>>   > [...] cannot the same thing be achieved using uprobes alone, e.g. by
>>   > registering a function address or the current instruction pointer?
>>   > [...] wouldn't a sysfs/tracefs file achieve a similar purpose without
>>   > much of the boilerplate code?
>>
>> Fair point.  We plan to ship v3 with the tracefs/uprobe interface only
>> and defer the ioctl (/dev/rv chardev) to a follow-up series once there
>> is a concrete in-tree user that requires it.
>>
>> The unique value of the ioctl is that TLOB_IOCTL_TRACE_STOP returns a
>> synchronous per-call result (-EOVERFLOW or 0) to the calling thread,
>> which neither uprobes nor tracefs writes can provide.  We want to keep
>> that option open for later, but agree it should not block the initial
>> tlob submission.
>>
>> Does this approach work for you?
>>
>> What is your preference?
> 
> Yeah looks good to me.
> Ioctls are cumbersome to set up also for the user, perhaps another sysfs
> file in the monitor directory would keep the control entirely in tlob.c
> and give you roughly the same value with easier setup.
> 
> Heck we might even think of an RV reactor that does that: e.g. creates a
> file where reads sleep until the first reaction (-EOVERFLOW) and returns
> 0 in other scenarios. I'm gonna have a thought on that, but anyway I
> don't see why a sysfs file cannot do this.
> 
> Let's defer it for now.
> 
>>
>> --- Handler simplification ---
>>
>>   > Perhaps keep the handler simpler by moving this reporting to a helper
>>   > function and use guard(rcu)() there.
>>
>> Done.  The accumulation logic is extracted into three inline helpers, each
>> using scoped_guard(rcu) and returning bool (true if the task is monitored):
>>
>>     tlob_acc_running(prev)   - accumulate running_ns on sched-out
>>     tlob_acc_waiting(next)   - accumulate waiting_ns on sched-in
>>     tlob_acc_sleeping(task)  - accumulate sleeping_ns on wakeup
>>
>> handle_sched_switch() and handle_sched_wakeup() become one-liners:
>>
>>     static void handle_sched_switch(...)
>>     {
>>         bool prev_preempted = (prev_state == 0);
>>
>>         if (tlob_acc_running(prev))
>>             da_handle_event(prev->pid, NULL,
>>                             prev_preempted ? preempt_tlob : sleep_tlob);
>>         if (tlob_acc_waiting(next))
>>             da_handle_event(next->pid, NULL, switch_in_tlob);
>>     }
> 
> Yeah sounds good.
> 
>>
>>   > You probably don't need these. da_handle_event should skip tasks without
>>   > a monitor.
>>
>> Agreed; the do_prev/do_next flags are gone.  The helpers return false
>> for unmonitored tasks, and da_handle_event() skips them too — both paths
>> are no-ops for tasks with no pool entry.
>>
>> --- scoped_guard(rcu) ---
>>
>>   > That should be a scoped_guard(rcu), definitely use guards if you have
>>   > return paths, the compiler is going to clean up (unlock) for you.
>>
>> Applied to all RCU-protected sections in tlob_start_task() and
>> tlob_stop_task().  tlob_start_task() now uses guard(mutex) for the
>> serialised duplicate-check (replacing the explicit mutex_lock/unlock),
>> and tlob_stop_task() uses scoped_guard(rcu) for the atomic CAS section:
>>
>>     scoped_guard(rcu) {
>>         ws = da_get_target_by_id(task->pid);
>>         if (!ws)
>>             return -ESRCH;
>>         ...
>>         if (atomic_cmpxchg_release(&ws->stopping, 0, 1) != 0)
>>             return -EAGAIN;
>>     }
> 
> Perfect.
> 
>>
>> --- tlob_stop_all removal ---
>>
>>   > All this function does should be done by da_monitor_destroy. We could
>>   > add a way to pass some additional deallocation for all the other cleanup
>>   > you're doing on each storage.  Something like a da_extra_cleanup() you
>>   > can define as whatever you need and gets called in all per-obj
>>   > destruction paths.
>>
>> Agreed.  tlob_stop_all() (~50 lines) has been removed entirely.
>>
>> A da_extra_cleanup() hook macro is introduced in da_monitor.h: the default
>> is a no-op; a monitor may override it before including the header.  tlob
>> defines:
>>
>>     static inline void tlob_extra_cleanup(struct da_monitor *da_mon)
>>     {
>>         struct ha_monitor *ha_mon = to_ha_monitor(da_mon);
>>         struct tlob_task_state *ws = da_get_target(ha_mon);
>>
>>         if (!ws)
>>             return;
>>         if (atomic_cmpxchg_release(&ws->stopping, 0, 1) != 0)
>>             return;
> 
>>         ha_cancel_timer_sync(ha_mon);
> After my patch making timer callbacks RCU read-side critical section,
> you won't need that, just let the usual reset asynchronously stop the
> timer and put everything that needs it stopped in your RCU callback.
> 
> Of course make sure the timer was stopped before this extra cleanup, so
> put the macro accordingly.
> 
> I don't think da_extra_cleanup in general should be expected to sleep
> and call_rcu should do the heavy lifting (it may run from any tracepoint).
> 
> Anyway we can see it later after that's merged.
> 
>>         atomic_dec(&tlob_num_monitored);
>>         put_task_struct(ws->task);
>>         call_rcu(&ws->rcu, tlob_free_rcu);
>>     }
>>     #define da_extra_cleanup tlob_extra_cleanup
>>
>> da_monitor_destroy() iterates remaining entries via da_extra_cleanup +
>> hash_del_rcu + call_rcu, then waits for all callbacks via rcu_barrier().
>> tlob's disable path is now simply:
>>
>>     static void __tlob_destroy_monitor(void)
>>     {
>>         da_monitor_destroy();
>>     }
> 
> Looks good, let's see the full picture.
> 
>> --- EVENT_NONE_LBL ---
>>
>>   > Why don't you just override EVENT_NONE_LBL (and if you prefer call it
>>   > MONITOR_TIMER_EVENT_NAME) without the need for another function?
>>
>> Done.  model_get_timer_event_name() has been removed from automata.h.
>> In ha_monitor.h, EVENT_NONE_LBL is now overridable:
>>
>>     #ifndef EVENT_NONE_LBL
>>     #define EVENT_NONE_LBL "none"
>>     #endif
>>
>> tlob.c defines it before including the model header:
>>
>>     #define EVENT_NONE_LBL "budget_exceeded"
>>
>> The two call sites in ha_monitor.h that previously called
>> model_get_timer_event_name() now use EVENT_NONE_LBL directly.
>>
>> --- KUnit config / tristate ---
>>
>>   > Do you need to add this here? Since you have a patch adding KUnit tests
>>   > to tlob, cannot you put everything kunit-related there?
>>   > I couldn't build it as module.
>>
>> Agreed on moving the Kconfig entry to patch 09.
>>
>> The module build issue is fixed by exporting the symbols needed by the
>> test via EXPORT_SYMBOL_IF_KUNIT (EXPORTED_FOR_KUNIT_TESTING namespace);
>> tlob_kunit.c imports the namespace with MODULE_IMPORT_NS.  We plan to
>> keep tristate rather than changing to bool.  Does that work for you?
> 
> Yeah it's good as long as it works as module too.
> 
> I might have a look at making my patch module-ready, for now it just
> can't work but I wonder if we can do something nicer to allow it
> (like in your case a bunch of exports, a separate file and a standalone
> testcase, perhaps all wrapped in some helper).
> 
>>
>> --- detail_env_tlob tracepoint ---
>>
>>   > Since you are not documenting the detail_env_tlob tracepoint, is it
>>   > something really required? I would at the very least document its usage.
>>
>> Fair point.  detail_env_tlob emits (running_ns, waiting_ns, sleeping_ns)
>> so the user can see which phase consumed the budget: high sleeping_ns
>> indicates I/O latency, high waiting_ns indicates scheduler pressure, high
>> running_ns indicates a compute overrun.  Without this breakdown the user
>> only knows the total elapsed time exceeded the threshold, not why.
>>
> 
> Alright, then this can go into the docs.
> 
>>
>> --- Documentation ---
>>
>>   > This is standard tracepoints usage, there's nothing about tlob we should
>>   > document here.
>>   > Same here, standard RV [for enable/desc tracefs files].
>>   > And this is duplicating what mentioned above about uprobes, isn't it?
>>
>> Agreed.  The following have been removed:
>>
>>     - "Violation events" section: generic trace-cmd examples and cat-trace
>>       instructions (standard tracepoints usage).
>>     - tracefs files: "enable (rw)" and "desc (ro)" entries (standard RV).
>>     - tracefs files: "monitor (rw)" description condensed to one line with
>>       a cross-reference to the uprobes section above.
>>
>> In their place, a new "Violation tracepoints" subsection documents both
>> tlob-specific tracepoints with fields and a worked example:
>>
>>     error_env_tlob: id, state, event ("budget_exceeded"), env ("clk_elapsed")
>>
>>     detail_env_tlob: id, threshold_us, running_ns, waiting_ns, sleeping_ns
>>       Use sleeping_ns to diagnose I/O latency, waiting_ns for scheduler
>>       pressure, running_ns for compute overruns.
>>
>>     Example:
>>       trace-cmd record -e error_env_tlob -e detail_env_tlob &
>>       # ... run workload ...
>>       trace-cmd report
> 
> Yeah sounds good, also pointing out to enable the monitor.
> We might think of a general way to do this kind of thing in
> tools/rv, although detail_env_tlob is non-standard.
> 
>>   > Is kernel code going to use this API? RV monitors are meant to be
>>   > enabled by userspace. What's the use-case here?
>>
>> Agreed.  The uprobe interface is driven from userspace; tlob_start_task()
>> and tlob_stop_task() are the internal implementation functions it calls,
>> not a public API for external kernel modules.  The hypothetical
>> kernel-module use case would be removed from the documentation; the
>> kernel-doc block is retained for code maintainers.
>>
>>   > That's probably a bit too detailed for this page. If you really want
>>   > this information somewhere couldn't it stay in the code?
>>
>> Agreed; moved to comments in handle_sched_switch() and
>> handle_sched_wakeup().  The "Limitations" subsection is retained.
>>
>> -- Patch 09: KUnit tests
>>
>>   > What caught my eyes are tests enrolling tracepoints handlers. If you
>>   > go there you're no longer doing unit testing, what's the advantage of
>>   > testing the entire monitor here over doing that in selftests?
>>
>> Agreed.  The three suites that register tracepoint handlers or create
>> kthreads (tlob_sched_integration, tlob_trace_output, tlob_violation_react)
>> have been removed from KUnit and will be added to selftests in v3.
>>
>> Two pure unit test suites remain in KUnit:
>>
>>     tlob_task_api:
>>       Tests tlob_start_task / tlob_stop_task return values (-ENODEV,
>>       -EALREADY, -ESRCH, -EOVERFLOW, -ENOSPC, -ERANGE) via direct calls
>>       (these functions are the internal implementation used by both the
>>       uprobe and, in future, the ioctl interface).
>>       No tracepoints, no scheduling.
>>
>>     tlob_uprobe_format:
>>       Tests the uprobe line parser (tlob_parse_uprobe_line,
>>       tlob_parse_remove_line) against valid and invalid input strings.
>>       Pure string parsing; no scheduling, no tracepoints.
>>
>> This also resolves the tristate-vs-bool issue: with only pure unit tests
>> there is no dependency on sched_setscheduler_nocheck, so bool is correct.
>>
> 
> Yeah looks good.
> 
>>
>> --  Patch 10: selftests
>>
>> --- PREEMPT_RT RCU stall ---
>>
>>   > I run it on a VM and have it hanging at step 9 [...] rcu_preempt stall.
>>   > Did you see that? Am I doing something wrong?
>>
>> Thanks for reporting.  The patch changed ha_monitor.h from
>> HRTIMER_MODE_REL_HARD (the existing upstream value) to REL_SOFT; the
>> stall appeared on PREEMPT_RT after that change.  We have not fully
>> confirmed whether REL_SOFT is the root cause — REL_SOFT defers the
>> callback to the ktimers kthread, which could starve rcu_preempt under
>> certain PREEMPT_RT configurations, but other factors may be involved.
>>
>> We plan to revert to HRTIMER_MODE_REL_HARD at both sites in ha_monitor.h
>> as the conservative choice:
>>
>>     ha_setup_timer():     HRTIMER_MODE_REL_SOFT -> HRTIMER_MODE_REL_HARD
>>     ha_start_timer_ns():  HRTIMER_MODE_REL_SOFT -> HRTIMER_MODE_REL_HARD
>>
>> Do you have more insight into the stall, or does REL_HARD resolve it on
>> your setup?
> 
> Right, good point, any specific reason why you wanted REL_SOFT?
> 
> I indeed always test under PREEMPT_RT but I still see the same splat
> also after reverting REL_HARD..
> Could you reproduce it on your setup?
> 
> My config is nothing special: what vng gives you adding
> PREEMPT_RT/RCU_PREEMPT and lockdep (PROVE_LOCKING/PROVE_RCU).
> 

Hi Gabriele, 

 

No specific reason for REL_SOFT — not intentional, reverting to 
REL_HARD.
 

Reproduced the stall on the same config (PREEMPT_RT + 
PROVE_LOCKING/PROVE_RCU). 

Root cause is a cleanup ordering bug in uprobe_detail_waiting.tc, 
unrelated to REL_SOFT/REL_HARD:
 

   # original cleanup — wrong order 

   echo "-${UPROBE_TARGET}:${start_offset}" > "$TLOB_MONITOR"  # (A) 

   kill "$hog_pid"                                              # (B) 

 
                                                              (A) 
triggers synchronize_srcu() in the kernel. But tlob_target is stuck 
mid-uprobe_notify_resume holding an SRCU read lock, preempted by the 
FIFO-99 hog -> so the reader never finishes and (B) is never reached. 
rcuc/0 (a kthread on PREEMPT_RT) is also starved by the hog -> RCU stall.
 

   Fix: kill the hog first: 
 

   kill "$hog_pid"; wait "$hog_pid" 

   echo "-${UPROBE_TARGET}:${start_offset}" > "$TLOB_MONITOR" 


On the PREEMPT_RT it is more reliably triggered there because rcuc/0
runs as a preemptible kthread rather than in softirq, making it easier
for the hog to monopolise the CPU long enough to hit the stall.

Thank you for the thorough review and valuable suggestions. We are 
working through all of them and running the full test suite.
We expect to post v3 within the next two days.

--
Best wishes,
Wen


>>
>> --- Selftest structure ---
>>
>>   > This should be tested together with the other monitors (enable/disable),
>>   > we could at most expand those with the check_requires.
>>   > Let's focus on tlob-only features in this patch.
>>
>> Agreed.  In v3 we plan to drop tracefs.tc (covered by the generic
>> rv_monitor_enable_disable.tc) and keep only the six uprobe-specific
>> test cases under test.d/tlob/
>>
>> ioctl.tc is deferred with the ioctl interface to the follow-up series.
>> The KUnit integration tests (sched_switch accounting, budget-expiry
>> tracepoint) would be moved to selftests as additional test cases.
>>
> 
> Thanks,
> Gabriele
> 

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox