[PATCH v3 00/19] unwind, perf: sframe user space unwinding

linux-toolchains.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 00/19] unwind, perf: sframe user space unwinding
@ 2024-10-28 21:47 Josh Poimboeuf
  2024-10-28 21:47 ` [PATCH v3 01/19] x86/vdso: Fix DWARF generation for getrandom() Josh Poimboeuf
                   ` (22 more replies)
  0 siblings, 23 replies; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-28 21:47 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

This has all the changes discussed in v2, plus VDSO sframe support and
Namhyung's perf tool patches (see detailed changelog below).

I did quite a bit of testing, it seems to work well.  It still needs
some binutils and glibc patches which I'll send in a reply.

Questions for perf experts:

  - Is the perf_event lifetime managed correctly or do we need to do
    something to ensure it exists in unwind_user_task_work()?

    Or alternatively is the original perf_event even needed in
    unwind_user_task_work() or can a new one be created on demand?

  - Is --call-graph=sframe needed for consistency?

  - Should perf use the context cookie?  Note that because the callback
    is usually only called once for multiple NMIs in the same entry
    context, it's possible for the PERF_RECORD_CALLCHAIN_DEFERRED event
    to arrive *before* some of the corresponding kernel events.  The
    context cookie disambiguates the corner cases.

Based on tip/master.

Also at:

  git://git.kernel.org/pub/scm/linux/kernel/git/jpoimboe/linux.git sframe-v3


v3:
- move the "deferred" logic out of perf and into unwind_user with new
  unwind_user_deferred() interface [Steven, Mathieu]
- add more sframe sanity checks [Steven]
- make frame pointers optional depending on arch [Jens]
- fix perf event output [Namhyung]
- include Namhyung's perf tool patches
- enable sframe generation in VDSO
- fix build errors [robot]

v2: https://lore.kernel.org/cover.1726268190.git.jpoimboe@kernel.org
- rebase on v6.11-rc7
- reorganize the patches to add sframe first
- change to sframe v2
- add new perf event type: PERF_RECORD_CALLCHAIN_DEFERRED
- add new perf attribute: defer_callchain

v1: https://lore.kernel.org/cover.1699487758.git.jpoimboe@kernel.org

Some distros have started compiling frame pointers into all their
packages to enable the kernel to do system-wide profiling of user space.
Unfortunately that creates a runtime performance penalty across the
entire system.  Using DWARF (or .eh_frame) instead isn't feasible
because of complexity and slowness.

For in-kernel unwinding we solved this problem with the creation of the
ORC unwinder for x86_64.  Similarly, for user space the GNU assembler
has created the SFrame ("Simple Frame") v2 format starting with binutils
2.41.

These patches add support for unwinding user space from the kernel using
SFrame with perf.  It should be easy to add user unwinding support for
other components like ftrace.

There were two main challenges:

1) Finding .sframe sections in shared/dlopened libraries

   The kernel has no visibility to the contents of shared libraries.
   This was solved by adding a PR_ADD_SFRAME option to prctl() which
   allows the runtime linker to manually provide the in-memory address
   of an .sframe section to the kernel.

2) Dealing with page faults

   Keeping all binaries' sframe data pinned would likely waste a lot of
   memory.  Instead, read it from user space on demand.  That can't be
   done from perf NMI context due to page faults, so defer the unwind to
   the next user exit.  Since the NMI handler doesn't do exit work,
   self-IPI and then schedule task work to be run on exit from the IPI.

Special thanks to Indu for the original concept, and to Steven and Peter
for helping a lot with the design.  And to Steven for letting me do it ;-)



Josh Poimboeuf (15):
  x86/vdso: Fix DWARF generation for getrandom()
  x86/asm: Avoid emitting DWARF CFI for non-VDSO
  x86/asm: Fix VDSO DWARF generation with kernel IBT enabled
  x86/vdso: Use SYM_FUNC_{START,END} in __kernel_vsyscall()
  x86/vdso: Use CFI macros in __vdso_sgx_enter_enclave()
  x86/vdso: Enable sframe generation in VDSO
  unwind: Add user space unwinding API
  unwind/x86: Enable CONFIG_HAVE_UNWIND_USER_FP
  unwind: Introduce sframe user space unwinding
  unwind/x86: Enable CONFIG_HAVE_UNWIND_USER_SFRAME
  unwind: Add deferred user space unwinding API
  perf: Remove get_perf_callchain() 'init_nr' argument
  perf: Remove get_perf_callchain() 'crosstask' argument
  perf: Simplify get_perf_callchain() user logic
  perf: Add deferred user callchains

Namhyung Kim (4):
  perf tools: Minimal CALLCHAIN_DEFERRED support
  perf record: Enable defer_callchain for user callchains
  perf script: Display PERF_RECORD_CALLCHAIN_DEFERRED
  perf tools: Merge deferred user callchains

 arch/Kconfig                              |  14 +
 arch/x86/Kconfig                          |   2 +
 arch/x86/entry/vdso/Makefile              |   6 +-
 arch/x86/entry/vdso/vdso-layout.lds.S     |   5 +-
 arch/x86/entry/vdso/vdso32/system_call.S  |  10 +-
 arch/x86/entry/vdso/vgetrandom-chacha.S   |   3 +-
 arch/x86/entry/vdso/vsgx.S                |  19 +-
 arch/x86/include/asm/dwarf2.h             |  40 ++-
 arch/x86/include/asm/linkage.h            |  29 +-
 arch/x86/include/asm/mmu.h                |   2 +-
 arch/x86/include/asm/unwind_user.h        |  11 +
 arch/x86/include/asm/vdso.h               |   1 -
 fs/binfmt_elf.c                           |  35 +-
 include/linux/entry-common.h              |   3 +
 include/linux/mm_types.h                  |   3 +
 include/linux/perf_event.h                |  12 +-
 include/linux/sched.h                     |   5 +
 include/linux/sframe.h                    |  41 +++
 include/linux/unwind_user.h               |  99 ++++++
 include/uapi/linux/elf.h                  |   1 +
 include/uapi/linux/perf_event.h           |  22 +-
 include/uapi/linux/prctl.h                |   3 +
 kernel/Makefile                           |   1 +
 kernel/bpf/stackmap.c                     |  14 +-
 kernel/events/callchain.c                 |  47 +--
 kernel/events/core.c                      |  70 +++-
 kernel/fork.c                             |  14 +
 kernel/sys.c                              |  11 +
 kernel/unwind/Makefile                    |   2 +
 kernel/unwind/sframe.c                    | 380 ++++++++++++++++++++++
 kernel/unwind/sframe.h                    | 215 ++++++++++++
 kernel/unwind/user.c                      | 318 ++++++++++++++++++
 mm/init-mm.c                              |   6 +
 tools/include/uapi/linux/perf_event.h     |  22 +-
 tools/lib/perf/include/perf/event.h       |   7 +
 tools/perf/Documentation/perf-script.txt  |   5 +
 tools/perf/builtin-script.c               |  92 ++++++
 tools/perf/util/callchain.c               |  24 ++
 tools/perf/util/callchain.h               |   3 +
 tools/perf/util/event.c                   |   1 +
 tools/perf/util/evlist.c                  |   1 +
 tools/perf/util/evlist.h                  |   1 +
 tools/perf/util/evsel.c                   |  32 +-
 tools/perf/util/evsel.h                   |   1 +
 tools/perf/util/machine.c                 |   1 +
 tools/perf/util/perf_event_attr_fprintf.c |   1 +
 tools/perf/util/sample.h                  |   3 +-
 tools/perf/util/session.c                 |  78 +++++
 tools/perf/util/tool.c                    |   2 +
 tools/perf/util/tool.h                    |   4 +-
 50 files changed, 1634 insertions(+), 88 deletions(-)
 create mode 100644 arch/x86/include/asm/unwind_user.h
 create mode 100644 include/linux/sframe.h
 create mode 100644 include/linux/unwind_user.h
 create mode 100644 kernel/unwind/Makefile
 create mode 100644 kernel/unwind/sframe.c
 create mode 100644 kernel/unwind/sframe.h
 create mode 100644 kernel/unwind/user.c

-- 
2.47.0


^ permalink raw reply	[flat|nested] 119+ messages in thread

* [PATCH v3 01/19] x86/vdso: Fix DWARF generation for getrandom()
  2024-10-28 21:47 [PATCH v3 00/19] unwind, perf: sframe user space unwinding Josh Poimboeuf
@ 2024-10-28 21:47 ` Josh Poimboeuf
  2024-10-28 21:47   ` Josh Poimboeuf
  2024-10-28 21:47 ` [PATCH v3 02/19] x86/asm: Avoid emitting DWARF CFI for non-VDSO Josh Poimboeuf
                   ` (21 subsequent siblings)
  22 siblings, 1 reply; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-28 21:47 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

Enable DWARF generation for the VDSO implementation of getrandom().

Fixes: 33385150ac45 ("x86: vdso: Wire up getrandom() vDSO implementation")
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/x86/entry/vdso/vgetrandom-chacha.S | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/vdso/vgetrandom-chacha.S b/arch/x86/entry/vdso/vgetrandom-chacha.S
index bcba5639b8ee..cc82da9216fb 100644
--- a/arch/x86/entry/vdso/vgetrandom-chacha.S
+++ b/arch/x86/entry/vdso/vgetrandom-chacha.S
@@ -4,7 +4,7 @@
  */
 
 #include <linux/linkage.h>
-#include <asm/frame.h>
+#include <asm/dwarf2.h>
 
 .section	.rodata, "a"
 .align 16
@@ -22,7 +22,7 @@ CONSTANTS:	.octa 0x6b20657479622d323320646e61707865
  *	rcx: number of 64-byte blocks to write to output
  */
 SYM_FUNC_START(__arch_chacha20_blocks_nostack)
-
+	CFI_STARTPROC
 .set	output,		%rdi
 .set	key,		%rsi
 .set	counter,	%rdx
@@ -175,4 +175,5 @@ SYM_FUNC_START(__arch_chacha20_blocks_nostack)
 	pxor		temp,temp
 
 	ret
+	CFI_ENDPROC
 SYM_FUNC_END(__arch_chacha20_blocks_nostack)
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 02/19] x86/asm: Avoid emitting DWARF CFI for non-VDSO
  2024-10-28 21:47 [PATCH v3 00/19] unwind, perf: sframe user space unwinding Josh Poimboeuf
  2024-10-28 21:47 ` [PATCH v3 01/19] x86/vdso: Fix DWARF generation for getrandom() Josh Poimboeuf
@ 2024-10-28 21:47 ` Josh Poimboeuf
  2024-10-28 21:47   ` Josh Poimboeuf
  2024-10-30 17:19   ` Jens Remus
  2024-10-28 21:47 ` [PATCH v3 03/19] x86/asm: Fix VDSO DWARF generation with kernel IBT enabled Josh Poimboeuf
                   ` (20 subsequent siblings)
  22 siblings, 2 replies; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-28 21:47 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

VDSO is the only part of the "kernel" using DWARF CFI directives.  For
the kernel proper, ensure the CFI_* macros don't do anything.

These macros aren't yet being used outside of VDSO, so there's no
functional change.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/x86/include/asm/dwarf2.h | 37 +++++++++++++++++++++++++++--------
 1 file changed, 29 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/dwarf2.h b/arch/x86/include/asm/dwarf2.h
index 430fca13bb56..b1aa3fcd5bca 100644
--- a/arch/x86/include/asm/dwarf2.h
+++ b/arch/x86/include/asm/dwarf2.h
@@ -6,6 +6,15 @@
 #warning "asm/dwarf2.h should be only included in pure assembly files"
 #endif
 
+#ifdef BUILD_VDSO
+
+	/*
+	 * For the vDSO, emit both runtime unwind information and debug
+	 * symbols for the .dbg file.
+	 */
+
+	.cfi_sections .eh_frame, .debug_frame
+
 #define CFI_STARTPROC		.cfi_startproc
 #define CFI_ENDPROC		.cfi_endproc
 #define CFI_DEF_CFA		.cfi_def_cfa
@@ -21,7 +30,8 @@
 #define CFI_UNDEFINED		.cfi_undefined
 #define CFI_ESCAPE		.cfi_escape
 
-#ifndef BUILD_VDSO
+#else /* !BUILD_VDSO */
+
 	/*
 	 * Emit CFI data in .debug_frame sections, not .eh_frame sections.
 	 * The latter we currently just discard since we don't do DWARF
@@ -29,13 +39,24 @@
 	 * useful to anyone.  Note we should not use this directive if we
 	 * ever decide to enable DWARF unwinding at runtime.
 	 */
+
 	.cfi_sections .debug_frame
-#else
-	 /*
-	  * For the vDSO, emit both runtime unwind information and debug
-	  * symbols for the .dbg file.
-	  */
-	.cfi_sections .eh_frame, .debug_frame
-#endif
+
+#define CFI_STARTPROC
+#define CFI_ENDPROC
+#define CFI_DEF_CFA
+#define CFI_DEF_CFA_REGISTER
+#define CFI_DEF_CFA_OFFSET
+#define CFI_ADJUST_CFA_OFFSET
+#define CFI_OFFSET
+#define CFI_REL_OFFSET
+#define CFI_REGISTER
+#define CFI_RESTORE
+#define CFI_REMEMBER_STATE
+#define CFI_RESTORE_STATE
+#define CFI_UNDEFINED
+#define CFI_ESCAPE
+
+#endif /* !BUILD_VDSO */
 
 #endif /* _ASM_X86_DWARF2_H */
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 03/19] x86/asm: Fix VDSO DWARF generation with kernel IBT enabled
  2024-10-28 21:47 [PATCH v3 00/19] unwind, perf: sframe user space unwinding Josh Poimboeuf
  2024-10-28 21:47 ` [PATCH v3 01/19] x86/vdso: Fix DWARF generation for getrandom() Josh Poimboeuf
  2024-10-28 21:47 ` [PATCH v3 02/19] x86/asm: Avoid emitting DWARF CFI for non-VDSO Josh Poimboeuf
@ 2024-10-28 21:47 ` Josh Poimboeuf
  2024-10-28 21:47   ` Josh Poimboeuf
  2024-10-28 21:47 ` [PATCH v3 04/19] x86/vdso: Use SYM_FUNC_{START,END} in __kernel_vsyscall() Josh Poimboeuf
                   ` (19 subsequent siblings)
  22 siblings, 1 reply; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-28 21:47 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

The DWARF .cfi_startproc annotation needs to be at the very beginning of
a function.  But with kernel IBT that doesn't happen as ENDBR is
sneakily embedded in SYM_FUNC_START.  So the DWARF unwinding info is
wrong at the beginning of the VDSO functions.

Fix it by adding CFI_STARTPROC and CFI_ENDPROC to SYM_FUNC_START_* and
SYM_FUNC_END respectively.  Note this only affects VDSO, as the CFI_*
macros are empty for the kernel proper.

Fixes: c4691712b546 ("x86/linkage: Add ENDBR to SYM_FUNC_START*()")
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/x86/entry/vdso/vdso-layout.lds.S   |  2 +-
 arch/x86/entry/vdso/vgetrandom-chacha.S |  2 --
 arch/x86/entry/vdso/vsgx.S              |  4 ----
 arch/x86/include/asm/linkage.h          | 29 +++++++++++++++++++------
 arch/x86/include/asm/vdso.h             |  1 -
 5 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/arch/x86/entry/vdso/vdso-layout.lds.S b/arch/x86/entry/vdso/vdso-layout.lds.S
index bafa73f09e92..a42c7d4a33da 100644
--- a/arch/x86/entry/vdso/vdso-layout.lds.S
+++ b/arch/x86/entry/vdso/vdso-layout.lds.S
@@ -1,5 +1,5 @@
 /* SPDX-License-Identifier: GPL-2.0 */
-#include <asm/vdso.h>
+#include <asm/page_types.h>
 
 /*
  * Linker script for vDSO.  This is an ELF shared object prelinked to
diff --git a/arch/x86/entry/vdso/vgetrandom-chacha.S b/arch/x86/entry/vdso/vgetrandom-chacha.S
index cc82da9216fb..a33212594731 100644
--- a/arch/x86/entry/vdso/vgetrandom-chacha.S
+++ b/arch/x86/entry/vdso/vgetrandom-chacha.S
@@ -22,7 +22,6 @@ CONSTANTS:	.octa 0x6b20657479622d323320646e61707865
  *	rcx: number of 64-byte blocks to write to output
  */
 SYM_FUNC_START(__arch_chacha20_blocks_nostack)
-	CFI_STARTPROC
 .set	output,		%rdi
 .set	key,		%rsi
 .set	counter,	%rdx
@@ -175,5 +174,4 @@ SYM_FUNC_START(__arch_chacha20_blocks_nostack)
 	pxor		temp,temp
 
 	ret
-	CFI_ENDPROC
 SYM_FUNC_END(__arch_chacha20_blocks_nostack)
diff --git a/arch/x86/entry/vdso/vsgx.S b/arch/x86/entry/vdso/vsgx.S
index 37a3d4c02366..c0342238c976 100644
--- a/arch/x86/entry/vdso/vsgx.S
+++ b/arch/x86/entry/vdso/vsgx.S
@@ -24,8 +24,6 @@
 .section .text, "ax"
 
 SYM_FUNC_START(__vdso_sgx_enter_enclave)
-	/* Prolog */
-	.cfi_startproc
 	push	%rbp
 	.cfi_adjust_cfa_offset	8
 	.cfi_rel_offset		%rbp, 0
@@ -143,8 +141,6 @@ SYM_FUNC_START(__vdso_sgx_enter_enclave)
 	jle	.Lout
 	jmp	.Lenter_enclave
 
-	.cfi_endproc
-
 _ASM_VDSO_EXTABLE_HANDLE(.Lenclu_eenter_eresume, .Lhandle_exception)
 
 SYM_FUNC_END(__vdso_sgx_enter_enclave)
diff --git a/arch/x86/include/asm/linkage.h b/arch/x86/include/asm/linkage.h
index dc31b13b87a0..2866d57ef907 100644
--- a/arch/x86/include/asm/linkage.h
+++ b/arch/x86/include/asm/linkage.h
@@ -40,6 +40,10 @@
 
 #ifdef __ASSEMBLY__
 
+#ifndef LINKER_SCRIPT
+#include <asm/dwarf2.h>
+#endif
+
 #if defined(CONFIG_MITIGATION_RETHUNK) && !defined(__DISABLE_EXPORTS) && !defined(BUILD_VDSO)
 #define RET	jmp __x86_return_thunk
 #else /* CONFIG_MITIGATION_RETPOLINE */
@@ -112,40 +116,51 @@
 # define SYM_FUNC_ALIAS_MEMFUNC	SYM_FUNC_ALIAS
 #endif
 
+#define __SYM_FUNC_START				\
+	CFI_STARTPROC ASM_NL				\
+	ENDBR
+
+#define __SYM_FUNC_END					\
+	CFI_ENDPROC ASM_NL
+
 /* SYM_TYPED_FUNC_START -- use for indirectly called globals, w/ CFI type */
 #define SYM_TYPED_FUNC_START(name)				\
 	SYM_TYPED_START(name, SYM_L_GLOBAL, SYM_F_ALIGN)	\
-	ENDBR
+	__SYM_FUNC_START
 
 /* SYM_FUNC_START -- use for global functions */
 #define SYM_FUNC_START(name)				\
 	SYM_START(name, SYM_L_GLOBAL, SYM_F_ALIGN)	\
-	ENDBR
+	__SYM_FUNC_START
 
 /* SYM_FUNC_START_NOALIGN -- use for global functions, w/o alignment */
 #define SYM_FUNC_START_NOALIGN(name)			\
 	SYM_START(name, SYM_L_GLOBAL, SYM_A_NONE)	\
-	ENDBR
+	__SYM_FUNC_START
 
 /* SYM_FUNC_START_LOCAL -- use for local functions */
 #define SYM_FUNC_START_LOCAL(name)			\
 	SYM_START(name, SYM_L_LOCAL, SYM_F_ALIGN)	\
-	ENDBR
+	__SYM_FUNC_START
 
 /* SYM_FUNC_START_LOCAL_NOALIGN -- use for local functions, w/o alignment */
 #define SYM_FUNC_START_LOCAL_NOALIGN(name)		\
 	SYM_START(name, SYM_L_LOCAL, SYM_A_NONE)	\
-	ENDBR
+	__SYM_FUNC_START
 
 /* SYM_FUNC_START_WEAK -- use for weak functions */
 #define SYM_FUNC_START_WEAK(name)			\
 	SYM_START(name, SYM_L_WEAK, SYM_F_ALIGN)	\
-	ENDBR
+	__SYM_FUNC_START
 
 /* SYM_FUNC_START_WEAK_NOALIGN -- use for weak functions, w/o alignment */
 #define SYM_FUNC_START_WEAK_NOALIGN(name)		\
 	SYM_START(name, SYM_L_WEAK, SYM_A_NONE)		\
-	ENDBR
+	__SYM_FUNC_START
+
+#define SYM_FUNC_END(name)				\
+	__SYM_FUNC_END					\
+	SYM_END(name, SYM_T_FUNC)
 
 #endif /* _ASM_X86_LINKAGE_H */
 
diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
index d7f6592b74a9..0111c349bbc5 100644
--- a/arch/x86/include/asm/vdso.h
+++ b/arch/x86/include/asm/vdso.h
@@ -2,7 +2,6 @@
 #ifndef _ASM_X86_VDSO_H
 #define _ASM_X86_VDSO_H
 
-#include <asm/page_types.h>
 #include <linux/linkage.h>
 #include <linux/init.h>
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 04/19] x86/vdso: Use SYM_FUNC_{START,END} in __kernel_vsyscall()
  2024-10-28 21:47 [PATCH v3 00/19] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (2 preceding siblings ...)
  2024-10-28 21:47 ` [PATCH v3 03/19] x86/asm: Fix VDSO DWARF generation with kernel IBT enabled Josh Poimboeuf
@ 2024-10-28 21:47 ` Josh Poimboeuf
  2024-10-28 21:47   ` Josh Poimboeuf
  2024-10-28 21:47 ` [PATCH v3 05/19] x86/vdso: Use CFI macros in __vdso_sgx_enter_enclave() Josh Poimboeuf
                   ` (18 subsequent siblings)
  22 siblings, 1 reply; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-28 21:47 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

Use SYM_FUNC_{START,END} instead of all the boilerplate.  No functional
change.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/x86/entry/vdso/vdso32/system_call.S | 10 ++--------
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/arch/x86/entry/vdso/vdso32/system_call.S b/arch/x86/entry/vdso/vdso32/system_call.S
index d33c6513fd2c..bdc576548240 100644
--- a/arch/x86/entry/vdso/vdso32/system_call.S
+++ b/arch/x86/entry/vdso/vdso32/system_call.S
@@ -9,11 +9,7 @@
 #include <asm/alternative.h>
 
 	.text
-	.globl __kernel_vsyscall
-	.type __kernel_vsyscall,@function
-	ALIGN
-__kernel_vsyscall:
-	CFI_STARTPROC
+SYM_FUNC_START(__kernel_vsyscall)
 	/*
 	 * Reshuffle regs so that all of any of the entry instructions
 	 * will preserve enough state.
@@ -79,7 +75,5 @@ SYM_INNER_LABEL(int80_landing_pad, SYM_L_GLOBAL)
 	CFI_RESTORE		ecx
 	CFI_ADJUST_CFA_OFFSET	-4
 	RET
-	CFI_ENDPROC
-
-	.size __kernel_vsyscall,.-__kernel_vsyscall
+SYM_FUNC_END(__kernel_vsyscall)
 	.previous
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 05/19] x86/vdso: Use CFI macros in __vdso_sgx_enter_enclave()
  2024-10-28 21:47 [PATCH v3 00/19] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (3 preceding siblings ...)
  2024-10-28 21:47 ` [PATCH v3 04/19] x86/vdso: Use SYM_FUNC_{START,END} in __kernel_vsyscall() Josh Poimboeuf
@ 2024-10-28 21:47 ` Josh Poimboeuf
  2024-10-28 21:47   ` Josh Poimboeuf
  2024-10-28 21:47 ` [PATCH v3 06/19] x86/vdso: Enable sframe generation in VDSO Josh Poimboeuf
                   ` (17 subsequent siblings)
  22 siblings, 1 reply; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-28 21:47 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

Use the CFI macros instead of the raw .cfi_* directives to be consistent
with the rest of the VDSO asm.  It's also easier on the eyes.

No functional changes.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/x86/entry/vdso/vsgx.S | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/arch/x86/entry/vdso/vsgx.S b/arch/x86/entry/vdso/vsgx.S
index c0342238c976..8d7b8eb45c50 100644
--- a/arch/x86/entry/vdso/vsgx.S
+++ b/arch/x86/entry/vdso/vsgx.S
@@ -24,13 +24,14 @@
 .section .text, "ax"
 
 SYM_FUNC_START(__vdso_sgx_enter_enclave)
+	SYM_F_ALIGN
 	push	%rbp
-	.cfi_adjust_cfa_offset	8
-	.cfi_rel_offset		%rbp, 0
+	CFI_ADJUST_CFA_OFFSET	8
+	CFI_REL_OFFSET		%rbp, 0
 	mov	%rsp, %rbp
-	.cfi_def_cfa_register	%rbp
+	CFI_DEF_CFA_REGISTER	%rbp
 	push	%rbx
-	.cfi_rel_offset		%rbx, -8
+	CFI_REL_OFFSET		%rbx, -8
 
 	mov	%ecx, %eax
 .Lenter_enclave:
@@ -77,13 +78,11 @@ SYM_FUNC_START(__vdso_sgx_enter_enclave)
 .Lout:
 	pop	%rbx
 	leave
-	.cfi_def_cfa		%rsp, 8
+	CFI_DEF_CFA		%rsp, 8
 	RET
 
-	/* The out-of-line code runs with the pre-leave stack frame. */
-	.cfi_def_cfa		%rbp, 16
-
 .Linvalid_input:
+	CFI_DEF_CFA		%rbp, 16
 	mov	$(-EINVAL), %eax
 	jmp	.Lout
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 06/19] x86/vdso: Enable sframe generation in VDSO
  2024-10-28 21:47 [PATCH v3 00/19] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (4 preceding siblings ...)
  2024-10-28 21:47 ` [PATCH v3 05/19] x86/vdso: Use CFI macros in __vdso_sgx_enter_enclave() Josh Poimboeuf
@ 2024-10-28 21:47 ` Josh Poimboeuf
  2024-10-28 21:47   ` Josh Poimboeuf
  2024-10-30 18:20   ` Jens Remus
  2024-10-28 21:47 ` [PATCH v3 07/19] unwind: Add user space unwinding API Josh Poimboeuf
                   ` (16 subsequent siblings)
  22 siblings, 2 replies; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-28 21:47 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

Enable sframe generation in the VDSO library so kernel and user space
can unwind through it.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/x86/entry/vdso/Makefile          | 6 ++++--
 arch/x86/entry/vdso/vdso-layout.lds.S | 3 +++
 arch/x86/include/asm/dwarf2.h         | 5 ++++-
 3 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile
index c9216ac4fb1e..75ae9e093a2d 100644
--- a/arch/x86/entry/vdso/Makefile
+++ b/arch/x86/entry/vdso/Makefile
@@ -47,13 +47,15 @@ quiet_cmd_vdso2c = VDSO2C  $@
 $(obj)/vdso-image-%.c: $(obj)/vdso%.so.dbg $(obj)/vdso%.so $(obj)/vdso2c FORCE
 	$(call if_changed,vdso2c)
 
+SFRAME_CFLAGS := $(call as-option,-Wa$(comma)-gsframe,)
+
 #
 # Don't omit frame pointers for ease of userspace debugging, but do
 # optimize sibling calls.
 #
 CFL := $(PROFILING) -mcmodel=small -fPIC -O2 -fasynchronous-unwind-tables -m64 \
        $(filter -g%,$(KBUILD_CFLAGS)) -fno-stack-protector \
-       -fno-omit-frame-pointer -foptimize-sibling-calls \
+       -fno-omit-frame-pointer $(SFRAME_CFLAGS) -foptimize-sibling-calls \
        -DDISABLE_BRANCH_PROFILING -DBUILD_VDSO
 
 ifdef CONFIG_MITIGATION_RETPOLINE
@@ -63,7 +65,7 @@ endif
 endif
 
 $(vobjs): KBUILD_CFLAGS := $(filter-out $(PADDING_CFLAGS) $(CC_FLAGS_LTO) $(CC_FLAGS_CFI) $(RANDSTRUCT_CFLAGS) $(GCC_PLUGINS_CFLAGS) $(RETPOLINE_CFLAGS),$(KBUILD_CFLAGS)) $(CFL)
-$(vobjs): KBUILD_AFLAGS += -DBUILD_VDSO
+$(vobjs): KBUILD_AFLAGS += -DBUILD_VDSO $(SFRAME_CFLAGS)
 
 #
 # vDSO code runs in userspace and -pg doesn't help with profiling anyway.
diff --git a/arch/x86/entry/vdso/vdso-layout.lds.S b/arch/x86/entry/vdso/vdso-layout.lds.S
index a42c7d4a33da..f6cd6654bb9e 100644
--- a/arch/x86/entry/vdso/vdso-layout.lds.S
+++ b/arch/x86/entry/vdso/vdso-layout.lds.S
@@ -69,6 +69,7 @@ SECTIONS
 	.eh_frame_hdr	: { *(.eh_frame_hdr) }		:text	:eh_frame_hdr
 	.eh_frame	: { KEEP (*(.eh_frame)) }	:text
 
+	.sframe		: { *(.sframe) }		:text	:sframe
 
 	/*
 	 * Text is well-separated from actual data: there's plenty of
@@ -97,6 +98,7 @@ SECTIONS
  * Very old versions of ld do not recognize this name token; use the constant.
  */
 #define PT_GNU_EH_FRAME	0x6474e550
+#define PT_GNU_SFRAME	0x6474e554
 
 /*
  * We must supply the ELF program headers explicitly to get just one
@@ -108,4 +110,5 @@ PHDRS
 	dynamic		PT_DYNAMIC	FLAGS(4);		/* PF_R */
 	note		PT_NOTE		FLAGS(4);		/* PF_R */
 	eh_frame_hdr	PT_GNU_EH_FRAME;
+	sframe		PT_GNU_SFRAME;
 }
diff --git a/arch/x86/include/asm/dwarf2.h b/arch/x86/include/asm/dwarf2.h
index b1aa3fcd5bca..1a49492817a1 100644
--- a/arch/x86/include/asm/dwarf2.h
+++ b/arch/x86/include/asm/dwarf2.h
@@ -12,8 +12,11 @@
 	 * For the vDSO, emit both runtime unwind information and debug
 	 * symbols for the .dbg file.
 	 */
-
+#ifdef __x86_64__
+	.cfi_sections .eh_frame, .debug_frame, .sframe
+#else
 	.cfi_sections .eh_frame, .debug_frame
+#endif
 
 #define CFI_STARTPROC		.cfi_startproc
 #define CFI_ENDPROC		.cfi_endproc
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 07/19] unwind: Add user space unwinding API
  2024-10-28 21:47 [PATCH v3 00/19] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (5 preceding siblings ...)
  2024-10-28 21:47 ` [PATCH v3 06/19] x86/vdso: Enable sframe generation in VDSO Josh Poimboeuf
@ 2024-10-28 21:47 ` Josh Poimboeuf
  2024-10-28 21:47   ` Josh Poimboeuf
  2024-12-06 10:29   ` Jens Remus
  2024-10-28 21:47 ` [PATCH v3 08/19] unwind/x86: Enable CONFIG_HAVE_UNWIND_USER_FP Josh Poimboeuf
                   ` (15 subsequent siblings)
  22 siblings, 2 replies; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-28 21:47 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

Introduce a user space unwinder API which provides a generic way to
unwind user stacks.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/Kconfig                |  7 +++
 include/linux/unwind_user.h | 41 +++++++++++++++
 kernel/Makefile             |  1 +
 kernel/unwind/Makefile      |  1 +
 kernel/unwind/user.c        | 99 +++++++++++++++++++++++++++++++++++++
 5 files changed, 149 insertions(+)
 create mode 100644 include/linux/unwind_user.h
 create mode 100644 kernel/unwind/Makefile
 create mode 100644 kernel/unwind/user.c

diff --git a/arch/Kconfig b/arch/Kconfig
index 7a95c1052cd5..ee8ec97ea0ef 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -435,6 +435,13 @@ config HAVE_HARDLOCKUP_DETECTOR_ARCH
 	  It uses the same command line parameters, and sysctl interface,
 	  as the generic hardlockup detectors.
 
+config UNWIND_USER
+	bool
+
+config HAVE_UNWIND_USER_FP
+	bool
+	select UNWIND_USER
+
 config HAVE_PERF_REGS
 	bool
 	help
diff --git a/include/linux/unwind_user.h b/include/linux/unwind_user.h
new file mode 100644
index 000000000000..9d28db06f33f
--- /dev/null
+++ b/include/linux/unwind_user.h
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_UNWIND_USER_H
+#define _LINUX_UNWIND_USER_H
+
+#include <linux/types.h>
+
+enum unwind_user_type {
+	UNWIND_USER_TYPE_FP,
+};
+
+struct unwind_stacktrace {
+	unsigned int	nr;
+	unsigned long	*entries;
+};
+
+struct unwind_user_frame {
+	s32 cfa_off;
+	s32 ra_off;
+	s32 fp_off;
+	bool use_fp;
+};
+
+struct unwind_user_state {
+	unsigned long ip;
+	unsigned long sp;
+	unsigned long fp;
+	enum unwind_user_type type;
+	bool done;
+};
+
+/* Synchronous interfaces: */
+
+int unwind_user_start(struct unwind_user_state *state);
+int unwind_user_next(struct unwind_user_state *state);
+
+int unwind_user(struct unwind_stacktrace *trace, unsigned int max_entries);
+
+#define for_each_user_frame(state) \
+	for (unwind_user_start((state)); !(state)->done; unwind_user_next((state)))
+
+#endif /* _LINUX_UNWIND_USER_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index 87866b037fbe..6cb4b0e02a34 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -50,6 +50,7 @@ obj-y += rcu/
 obj-y += livepatch/
 obj-y += dma/
 obj-y += entry/
+obj-y += unwind/
 obj-$(CONFIG_MODULES) += module/
 
 obj-$(CONFIG_KCMP) += kcmp.o
diff --git a/kernel/unwind/Makefile b/kernel/unwind/Makefile
new file mode 100644
index 000000000000..349ce3677526
--- /dev/null
+++ b/kernel/unwind/Makefile
@@ -0,0 +1 @@
+ obj-$(CONFIG_UNWIND_USER) += user.o
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
new file mode 100644
index 000000000000..54b989810a0e
--- /dev/null
+++ b/kernel/unwind/user.c
@@ -0,0 +1,99 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+* Generic interfaces for unwinding user space
+*
+* Copyright (C) 2024 Josh Poimboeuf <jpoimboe@kernel.org>
+*/
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/sched/task_stack.h>
+#include <linux/unwind_user.h>
+#include <linux/uaccess.h>
+#include <asm/unwind_user.h>
+
+static struct unwind_user_frame fp_frame = {
+	ARCH_INIT_USER_FP_FRAME
+};
+
+int unwind_user_next(struct unwind_user_state *state)
+{
+	struct unwind_user_frame _frame;
+	struct unwind_user_frame *frame = &_frame;
+	unsigned long prev_ip, cfa, fp, ra = 0;
+
+	if (state->done)
+		return -EINVAL;
+
+	prev_ip = state->ip;
+
+	switch (state->type) {
+	case UNWIND_USER_TYPE_FP:
+		frame = &fp_frame;
+		break;
+	default:
+		BUG();
+	}
+
+	cfa = (frame->use_fp ? state->fp : state->sp) + frame->cfa_off;
+
+	if (frame->ra_off && get_user(ra, (unsigned long __user *)(cfa + frame->ra_off)))
+		goto the_end;
+
+	if (ra == prev_ip)
+		goto the_end;
+
+	if (frame->fp_off && get_user(fp, (unsigned long __user *)(cfa + frame->fp_off)))
+		goto the_end;
+
+	state->sp = cfa;
+	state->ip = ra;
+	if (frame->fp_off)
+		state->fp = fp;
+
+	return 0;
+
+the_end:
+	state->done = true;
+	return -EINVAL;
+}
+
+int unwind_user_start(struct unwind_user_state *state)
+{
+	struct pt_regs *regs = task_pt_regs(current);
+
+	memset(state, 0, sizeof(*state));
+
+	if (!current->mm) {
+		state->done = true;
+		return -EINVAL;
+	}
+
+	state->type = UNWIND_USER_TYPE_FP;
+
+	state->sp = user_stack_pointer(regs);
+	state->ip = instruction_pointer(regs);
+	state->fp = frame_pointer(regs);
+
+	return 0;
+}
+
+int unwind_user(struct unwind_stacktrace *trace, unsigned int max_entries)
+{
+	struct unwind_user_state state;
+
+	trace->nr = 0;
+
+	if (!max_entries)
+		return -EINVAL;
+
+	if (!current->mm)
+		return 0;
+
+	for_each_user_frame(&state) {
+		trace->entries[trace->nr++] = state.ip;
+		if (trace->nr >= max_entries)
+			break;
+	}
+
+	return 0;
+}
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 08/19] unwind/x86: Enable CONFIG_HAVE_UNWIND_USER_FP
  2024-10-28 21:47 [PATCH v3 00/19] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (6 preceding siblings ...)
  2024-10-28 21:47 ` [PATCH v3 07/19] unwind: Add user space unwinding API Josh Poimboeuf
@ 2024-10-28 21:47 ` Josh Poimboeuf
  2024-10-28 21:47   ` Josh Poimboeuf
  2024-10-29 13:13   ` Peter Zijlstra
  2024-10-28 21:47 ` [PATCH v3 09/19] unwind: Introduce sframe user space unwinding Josh Poimboeuf
                   ` (14 subsequent siblings)
  22 siblings, 2 replies; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-28 21:47 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

Use ARCH_INIT_USER_FP_FRAME to describe how frame pointers are unwound
on x86, and enable CONFIG_HAVE_UNWIND_USER_FP accordingly so the
unwind_user interfaces can be used.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/x86/Kconfig                   |  1 +
 arch/x86/include/asm/unwind_user.h | 11 +++++++++++
 2 files changed, 12 insertions(+)
 create mode 100644 arch/x86/include/asm/unwind_user.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0bdb7a394f59..f91098d6f535 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -289,6 +289,7 @@ config X86
 	select HAVE_SYSCALL_TRACEPOINTS
 	select HAVE_UACCESS_VALIDATION		if HAVE_OBJTOOL
 	select HAVE_UNSTABLE_SCHED_CLOCK
+	select HAVE_UNWIND_USER_FP		if X86_64
 	select HAVE_USER_RETURN_NOTIFIER
 	select HAVE_GENERIC_VDSO
 	select VDSO_GETRANDOM			if X86_64
diff --git a/arch/x86/include/asm/unwind_user.h b/arch/x86/include/asm/unwind_user.h
new file mode 100644
index 000000000000..19df26a65132
--- /dev/null
+++ b/arch/x86/include/asm/unwind_user.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_UNWIND_USER_H
+#define _ASM_X86_UNWIND_USER_H
+
+#define ARCH_INIT_USER_FP_FRAME							\
+	.ra_off		= (s32)sizeof(long) * -1,				\
+	.cfa_off	= (s32)sizeof(long) * 2,				\
+	.fp_off		= (s32)sizeof(long) * -2,				\
+	.use_fp		= true,
+
+#endif /* _ASM_X86_UNWIND_USER_H */
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 09/19] unwind: Introduce sframe user space unwinding
  2024-10-28 21:47 [PATCH v3 00/19] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (7 preceding siblings ...)
  2024-10-28 21:47 ` [PATCH v3 08/19] unwind/x86: Enable CONFIG_HAVE_UNWIND_USER_FP Josh Poimboeuf
@ 2024-10-28 21:47 ` Josh Poimboeuf
  2024-10-28 21:47   ` Josh Poimboeuf
                     ` (7 more replies)
  2024-10-28 21:47 ` [PATCH v3 10/19] unwind/x86: Enable CONFIG_HAVE_UNWIND_USER_SFRAME Josh Poimboeuf
                   ` (13 subsequent siblings)
  22 siblings, 8 replies; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-28 21:47 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

Some distros have started compiling frame pointers into all their
packages to enable the kernel to do system-wide profiling of user space.
Unfortunately that creates a runtime performance penalty across the
entire system.  Using DWARF instead isn't feasible due to the complexity
it would add to the kernel.

For in-kernel unwinding we solved this problem with the creation of the
ORC unwinder for x86_64.  Similarly, for user space the GNU assembler
has created the sframe format starting with binutils 2.41 for sframe v2.
Sframe is a simpler version of .eh_frame.  It gets placed in the .sframe
section.

Add support for unwinding user space using sframe.

More information about sframe can be found here:

  - https://lwn.net/Articles/932209/
  - https://lwn.net/Articles/940686/
  - https://sourceware.org/binutils/docs/sframe-spec.html

Glibc support is needed to implement the prctl() calls to tell the
kernel where the .sframe segments are.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/Kconfig                |   4 +
 arch/x86/include/asm/mmu.h  |   2 +-
 fs/binfmt_elf.c             |  35 +++-
 include/linux/mm_types.h    |   3 +
 include/linux/sframe.h      |  41 ++++
 include/linux/unwind_user.h |   2 +
 include/uapi/linux/elf.h    |   1 +
 include/uapi/linux/prctl.h  |   3 +
 kernel/fork.c               |  10 +
 kernel/sys.c                |  11 ++
 kernel/unwind/Makefile      |   3 +-
 kernel/unwind/sframe.c      | 380 ++++++++++++++++++++++++++++++++++++
 kernel/unwind/sframe.h      | 215 ++++++++++++++++++++
 kernel/unwind/user.c        |  24 ++-
 mm/init-mm.c                |   6 +
 15 files changed, 732 insertions(+), 8 deletions(-)
 create mode 100644 include/linux/sframe.h
 create mode 100644 kernel/unwind/sframe.c
 create mode 100644 kernel/unwind/sframe.h

diff --git a/arch/Kconfig b/arch/Kconfig
index ee8ec97ea0ef..e769c39dd221 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -442,6 +442,10 @@ config HAVE_UNWIND_USER_FP
 	bool
 	select UNWIND_USER
 
+config HAVE_UNWIND_USER_SFRAME
+	bool
+	select UNWIND_USER
+
 config HAVE_PERF_REGS
 	bool
 	help
diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index ce4677b8b735..12ea831978cc 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -73,7 +73,7 @@ typedef struct {
 	.context = {							\
 		.ctx_id = 1,						\
 		.lock = __MUTEX_INITIALIZER(mm.context.lock),		\
-	}
+	},
 
 void leave_mm(void);
 #define leave_mm leave_mm
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 06dc4a57ba78..434c548f0837 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -47,6 +47,7 @@
 #include <linux/dax.h>
 #include <linux/uaccess.h>
 #include <linux/rseq.h>
+#include <linux/sframe.h>
 #include <asm/param.h>
 #include <asm/page.h>
 
@@ -633,11 +634,13 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
 		unsigned long no_base, struct elf_phdr *interp_elf_phdata,
 		struct arch_elf_state *arch_state)
 {
-	struct elf_phdr *eppnt;
+	struct elf_phdr *eppnt, *sframe_phdr = NULL;
 	unsigned long load_addr = 0;
 	int load_addr_set = 0;
 	unsigned long error = ~0UL;
 	unsigned long total_size;
+	unsigned long start_code = ~0UL;
+	unsigned long end_code = 0;
 	int i;
 
 	/* First of all, some simple consistency checks */
@@ -659,7 +662,8 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
 
 	eppnt = interp_elf_phdata;
 	for (i = 0; i < interp_elf_ex->e_phnum; i++, eppnt++) {
-		if (eppnt->p_type == PT_LOAD) {
+		switch (eppnt->p_type) {
+		case PT_LOAD: {
 			int elf_type = MAP_PRIVATE;
 			int elf_prot = make_prot(eppnt->p_flags, arch_state,
 						 true, true);
@@ -688,7 +692,7 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
 			/*
 			 * Check to see if the section's size will overflow the
 			 * allowed task size. Note that p_filesz must always be
-			 * <= p_memsize so it's only necessary to check p_memsz.
+			 * <= p_memsz so it's only necessary to check p_memsz.
 			 */
 			k = load_addr + eppnt->p_vaddr;
 			if (BAD_ADDR(k) ||
@@ -698,9 +702,24 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
 				error = -ENOMEM;
 				goto out;
 			}
+
+			if ((eppnt->p_flags & PF_X) && k < start_code)
+				start_code = k;
+
+			if ((eppnt->p_flags & PF_X) && k + eppnt->p_filesz > end_code)
+				end_code = k + eppnt->p_filesz;
+			break;
+		}
+		case PT_GNU_SFRAME:
+			sframe_phdr = eppnt;
+			break;
 		}
 	}
 
+	if (sframe_phdr)
+		sframe_add_section(load_addr + sframe_phdr->p_vaddr,
+				   start_code, end_code);
+
 	error = load_addr;
 out:
 	return error;
@@ -823,7 +842,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
 	int first_pt_load = 1;
 	unsigned long error;
 	struct elf_phdr *elf_ppnt, *elf_phdata, *interp_elf_phdata = NULL;
-	struct elf_phdr *elf_property_phdata = NULL;
+	struct elf_phdr *elf_property_phdata = NULL, *sframe_phdr = NULL;
 	unsigned long elf_brk;
 	int retval, i;
 	unsigned long elf_entry;
@@ -931,6 +950,10 @@ static int load_elf_binary(struct linux_binprm *bprm)
 				executable_stack = EXSTACK_DISABLE_X;
 			break;
 
+		case PT_GNU_SFRAME:
+			sframe_phdr = elf_ppnt;
+			break;
+
 		case PT_LOPROC ... PT_HIPROC:
 			retval = arch_elf_pt_proc(elf_ex, elf_ppnt,
 						  bprm->file, false,
@@ -1321,6 +1344,10 @@ static int load_elf_binary(struct linux_binprm *bprm)
 					    task_pid_nr(current), retval);
 	}
 
+	if (sframe_phdr)
+		sframe_add_section(load_bias + sframe_phdr->p_vaddr,
+				   start_code, end_code);
+
 	regs = current_pt_regs();
 #ifdef ELF_PLAT_INIT
 	/*
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 381d22eba088..6e7561c1a5fc 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1052,6 +1052,9 @@ struct mm_struct {
 #endif
 		} lru_gen;
 #endif /* CONFIG_LRU_GEN_WALKS_MMU */
+#ifdef CONFIG_HAVE_UNWIND_USER_SFRAME
+		struct maple_tree sframe_mt;
+#endif
 	} __randomize_layout;
 
 	/*
diff --git a/include/linux/sframe.h b/include/linux/sframe.h
new file mode 100644
index 000000000000..d167e01817c4
--- /dev/null
+++ b/include/linux/sframe.h
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_SFRAME_H
+#define _LINUX_SFRAME_H
+
+#include <linux/mm_types.h>
+
+struct unwind_user_frame;
+
+#ifdef CONFIG_HAVE_UNWIND_USER_SFRAME
+
+#define INIT_MM_SFRAME .sframe_mt = MTREE_INIT(sframe_mt, 0),
+
+extern void sframe_free_mm(struct mm_struct *mm);
+
+/* text_start, text_end, file_name are optional */
+extern int sframe_add_section(unsigned long sframe_addr, unsigned long text_start,
+			      unsigned long text_end);
+
+extern int sframe_remove_section(unsigned long sframe_addr);
+extern int sframe_find(unsigned long ip, struct unwind_user_frame *frame);
+
+static inline bool current_has_sframe(void)
+{
+	struct mm_struct *mm = current->mm;
+
+	return mm && !mtree_empty(&mm->sframe_mt);
+}
+
+#else /* !CONFIG_HAVE_UNWIND_USER_SFRAME */
+
+static inline void sframe_free_mm(struct mm_struct *mm) {}
+
+static inline int sframe_add_section(unsigned long sframe_addr, unsigned long text_start, unsigned long text_end) { return -EINVAL; }
+static inline int sframe_remove_section(unsigned long sframe_addr) { return -EINVAL; }
+static inline int sframe_find(unsigned long ip, struct unwind_user_frame *frame) { return -EINVAL; }
+
+static inline bool current_has_sframe(void) { return false; }
+
+#endif /* CONFIG_HAVE_UNWIND_USER_SFRAME */
+
+#endif /* _LINUX_SFRAME_H */
diff --git a/include/linux/unwind_user.h b/include/linux/unwind_user.h
index 9d28db06f33f..cde0fde4923e 100644
--- a/include/linux/unwind_user.h
+++ b/include/linux/unwind_user.h
@@ -5,7 +5,9 @@
 #include <linux/types.h>
 
 enum unwind_user_type {
+	UNWIND_USER_TYPE_NONE,
 	UNWIND_USER_TYPE_FP,
+	UNWIND_USER_TYPE_SFRAME,
 };
 
 struct unwind_stacktrace {
diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
index b9935988da5c..4dc3f0ca5af5 100644
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@ -39,6 +39,7 @@ typedef __s64	Elf64_Sxword;
 #define PT_GNU_STACK	(PT_LOOS + 0x474e551)
 #define PT_GNU_RELRO	(PT_LOOS + 0x474e552)
 #define PT_GNU_PROPERTY	(PT_LOOS + 0x474e553)
+#define PT_GNU_SFRAME	(PT_LOOS + 0x474e554)
 
 
 /* ARM MTE memory tag segment type */
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 35791791a879..69511077c910 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -328,4 +328,7 @@ struct prctl_mm_map {
 # define PR_PPC_DEXCR_CTRL_CLEAR_ONEXEC	0x10 /* Clear the aspect on exec */
 # define PR_PPC_DEXCR_CTRL_MASK		0x1f
 
+#define PR_ADD_SFRAME			74
+#define PR_REMOVE_SFRAME		75
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index c056ea95fe8c..60f14fbab956 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -105,6 +105,7 @@
 #include <linux/rseq.h>
 #include <uapi/linux/pidfd.h>
 #include <linux/pidfs.h>
+#include <linux/sframe.h>
 
 #include <asm/pgalloc.h>
 #include <linux/uaccess.h>
@@ -924,6 +925,7 @@ void __mmdrop(struct mm_struct *mm)
 	mm_pasid_drop(mm);
 	mm_destroy_cid(mm);
 	percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS);
+	sframe_free_mm(mm);
 
 	free_mm(mm);
 }
@@ -1251,6 +1253,13 @@ static void mm_init_uprobes_state(struct mm_struct *mm)
 #endif
 }
 
+static void mm_init_sframe(struct mm_struct *mm)
+{
+#ifdef CONFIG_HAVE_UNWIND_USER_SFRAME
+	mt_init(&mm->sframe_mt);
+#endif
+}
+
 static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	struct user_namespace *user_ns)
 {
@@ -1282,6 +1291,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	mm->pmd_huge_pte = NULL;
 #endif
 	mm_init_uprobes_state(mm);
+	mm_init_sframe(mm);
 	hugetlb_count_init(mm);
 
 	if (current->mm) {
diff --git a/kernel/sys.c b/kernel/sys.c
index 4da31f28fda8..7d05a67727db 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -64,6 +64,7 @@
 #include <linux/rcupdate.h>
 #include <linux/uidgid.h>
 #include <linux/cred.h>
+#include <linux/sframe.h>
 
 #include <linux/nospec.h>
 
@@ -2784,6 +2785,16 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_RISCV_SET_ICACHE_FLUSH_CTX:
 		error = RISCV_SET_ICACHE_FLUSH_CTX(arg2, arg3);
 		break;
+	case PR_ADD_SFRAME:
+		if (arg5)
+			return -EINVAL;
+		error = sframe_add_section(arg2, arg3, arg4);
+		break;
+	case PR_REMOVE_SFRAME:
+		if (arg3 || arg4 || arg5)
+			return -EINVAL;
+		error = sframe_remove_section(arg2);
+		break;
 	default:
 		error = -EINVAL;
 		break;
diff --git a/kernel/unwind/Makefile b/kernel/unwind/Makefile
index 349ce3677526..f70380d7a6a6 100644
--- a/kernel/unwind/Makefile
+++ b/kernel/unwind/Makefile
@@ -1 +1,2 @@
- obj-$(CONFIG_UNWIND_USER) += user.o
+ obj-$(CONFIG_UNWIND_USER)		+= user.o
+ obj-$(CONFIG_HAVE_UNWIND_USER_SFRAME)	+= sframe.o
diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
new file mode 100644
index 000000000000..933e47696e29
--- /dev/null
+++ b/kernel/unwind/sframe.c
@@ -0,0 +1,380 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define pr_fmt(fmt)	"sframe: " fmt
+
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/srcu.h>
+#include <linux/uaccess.h>
+#include <linux/mm.h>
+#include <linux/sframe.h>
+#include <linux/unwind_user.h>
+
+#include "sframe.h"
+
+#define SFRAME_FILENAME_LEN 32
+
+struct sframe_section {
+	struct rcu_head rcu;
+
+	unsigned long sframe_addr;
+	unsigned long text_addr;
+
+	unsigned long fdes_addr;
+	unsigned long fres_addr;
+	unsigned int  fdes_nr;
+	signed char   ra_off;
+	signed char   fp_off;
+};
+
+DEFINE_STATIC_SRCU(sframe_srcu);
+
+#define __SFRAME_GET_USER(out, user_ptr, type)				\
+({									\
+	type __tmp;							\
+	if (get_user(__tmp, (type __user *)user_ptr))			\
+		return -EFAULT;						\
+	user_ptr += sizeof(__tmp);					\
+	out = __tmp;							\
+})
+
+#define SFRAME_GET_USER(out, user_ptr, size)				\
+({									\
+	switch (size) {							\
+	case 1:								\
+		__SFRAME_GET_USER(out, user_ptr, u8);			\
+		break;							\
+	case 2:								\
+		__SFRAME_GET_USER(out, user_ptr, u16);			\
+		break;							\
+	case 4:								\
+		__SFRAME_GET_USER(out, user_ptr, u32);			\
+		break;							\
+	default:							\
+		return -EINVAL;						\
+	}								\
+})
+
+static unsigned char fre_type_to_size(unsigned char fre_type)
+{
+	if (fre_type > 2)
+		return 0;
+	return 1 << fre_type;
+}
+
+static unsigned char offset_size_enum_to_size(unsigned char off_size)
+{
+	if (off_size > 2)
+		return 0;
+	return 1 << off_size;
+}
+
+static int find_fde(struct sframe_section *sec, unsigned long ip,
+		    struct sframe_fde *fde)
+{
+	struct sframe_fde __user *first, *last, *found = NULL;
+	u32 ip_off, func_off_low = 0, func_off_high = -1;
+
+	ip_off = ip - sec->sframe_addr;
+
+	first = (void __user *)sec->fdes_addr;
+	last = first + sec->fdes_nr;
+	while (first <= last) {
+		struct sframe_fde __user *mid;
+		u32 func_off;
+
+		mid = first + ((last - first) / 2);
+
+		if (get_user(func_off, (s32 __user *)mid))
+			return -EFAULT;
+
+		if (ip_off >= func_off) {
+			/* validate sort order */
+			if (func_off < func_off_low)
+				return -EINVAL;
+
+			func_off_low = func_off;
+
+			found = mid;
+			first = mid + 1;
+		} else {
+			/* validate sort order */
+			if (func_off > func_off_high)
+				return -EINVAL;
+
+			func_off_high = func_off;
+
+			last = mid - 1;
+		}
+	}
+
+	if (!found)
+		return -EINVAL;
+
+	if (copy_from_user(fde, found, sizeof(*fde)))
+		return -EFAULT;
+
+	/* check for gaps */
+	if (ip_off < fde->start_addr || ip_off >= fde->start_addr + fde->size)
+		return -EINVAL;
+
+	return 0;
+}
+
+static int find_fre(struct sframe_section *sec, struct sframe_fde *fde,
+		    unsigned long ip, struct unwind_user_frame *frame)
+{
+	unsigned char fde_type = SFRAME_FUNC_FDE_TYPE(fde->info);
+	unsigned char fre_type = SFRAME_FUNC_FRE_TYPE(fde->info);
+	unsigned char offset_count, offset_size;
+	s32 cfa_off, ra_off, fp_off, ip_off;
+	void __user *f, *last_f = NULL;
+	unsigned char addr_size;
+	u32 last_fre_ip_off = 0;
+	u8 fre_info = 0;
+	int i;
+
+	addr_size = fre_type_to_size(fre_type);
+	if (!addr_size)
+		return -EINVAL;
+
+	ip_off = ip - (sec->sframe_addr + fde->start_addr);
+
+	f = (void __user *)sec->fres_addr + fde->fres_off;
+
+	for (i = 0; i < fde->fres_num; i++) {
+		u32 fre_ip_off;
+
+		SFRAME_GET_USER(fre_ip_off, f, addr_size);
+
+		if (fre_ip_off < last_fre_ip_off)
+			return -EINVAL;
+
+		last_fre_ip_off = fre_ip_off;
+
+		if (fde_type == SFRAME_FDE_TYPE_PCINC) {
+			if (ip_off < fre_ip_off)
+				break;
+		} else {
+			/* SFRAME_FDE_TYPE_PCMASK */
+			if (ip_off % fde->rep_size < fre_ip_off)
+				break;
+		}
+
+		SFRAME_GET_USER(fre_info, f, 1);
+
+		offset_count = SFRAME_FRE_OFFSET_COUNT(fre_info);
+		offset_size  = offset_size_enum_to_size(SFRAME_FRE_OFFSET_SIZE(fre_info));
+
+		if (!offset_count || !offset_size)
+			return -EINVAL;
+
+		last_f = f;
+		f += offset_count * offset_size;
+	}
+
+	if (!last_f)
+		return -EINVAL;
+
+	f = last_f;
+
+	SFRAME_GET_USER(cfa_off, f, offset_size);
+	offset_count--;
+
+	ra_off = sec->ra_off;
+	if (!ra_off) {
+		if (!offset_count--)
+			return -EINVAL;
+
+		SFRAME_GET_USER(ra_off, f, offset_size);
+	}
+
+	fp_off = sec->fp_off;
+	if (!fp_off && offset_count) {
+		offset_count--;
+		SFRAME_GET_USER(fp_off, f, offset_size);
+	}
+
+	if (offset_count)
+		return -EINVAL;
+
+	frame->cfa_off = cfa_off;
+	frame->ra_off = ra_off;
+	frame->fp_off = fp_off;
+	frame->use_fp = SFRAME_FRE_CFA_BASE_REG_ID(fre_info) == SFRAME_BASE_REG_FP;
+
+	return 0;
+}
+
+int sframe_find(unsigned long ip, struct unwind_user_frame *frame)
+{
+	struct mm_struct *mm = current->mm;
+	struct sframe_section *sec;
+	struct sframe_fde fde;
+	int ret = -EINVAL;
+
+	if (!mm)
+		return -EINVAL;
+
+	guard(srcu)(&sframe_srcu);
+
+	sec = mtree_load(&mm->sframe_mt, ip);
+	if (!sec)
+		return ret;
+
+	ret = find_fde(sec, ip, &fde);
+	if (ret)
+		return ret;
+
+	ret = find_fre(sec, &fde, ip, frame);
+	if (ret)
+		return ret;
+
+	return 0;
+}
+
+static int __sframe_add_section(unsigned long sframe_addr,
+				unsigned long text_start,
+				unsigned long text_end)
+{
+	struct maple_tree *sframe_mt = &current->mm->sframe_mt;
+	struct sframe_section *sec;
+	struct sframe_header shdr;
+	unsigned long header_end;
+	int ret;
+
+	if (copy_from_user(&shdr, (void __user *)sframe_addr, sizeof(shdr)))
+		return -EFAULT;
+
+	if (shdr.preamble.magic != SFRAME_MAGIC ||
+	    shdr.preamble.version != SFRAME_VERSION_2 ||
+	    !(shdr.preamble.flags & SFRAME_F_FDE_SORTED) ||
+	    shdr.auxhdr_len || !shdr.num_fdes || !shdr.num_fres ||
+	    shdr.fdes_off > shdr.fres_off) {
+		return -EINVAL;
+	}
+
+	sec = kmalloc(sizeof(*sec), GFP_KERNEL);
+	if (!sec)
+		return -ENOMEM;
+
+	header_end = sframe_addr + SFRAME_HDR_SIZE(shdr);
+
+	sec->sframe_addr	= sframe_addr;
+	sec->text_addr		= text_start;
+	sec->fdes_addr		= header_end + shdr.fdes_off;
+	sec->fres_addr		= header_end + shdr.fres_off;
+	sec->fdes_nr		= shdr.num_fdes;
+	sec->ra_off		= shdr.cfa_fixed_ra_offset;
+	sec->fp_off		= shdr.cfa_fixed_fp_offset;
+
+	ret = mtree_insert_range(sframe_mt, text_start, text_end, sec, GFP_KERNEL);
+	if (ret) {
+		kfree(sec);
+		return ret;
+	}
+
+	return 0;
+}
+
+int sframe_add_section(unsigned long sframe_addr, unsigned long text_start,
+		       unsigned long text_end)
+{
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *sframe_vma;
+
+	mmap_read_lock(mm);
+
+	sframe_vma = vma_lookup(mm, sframe_addr);
+	if (!sframe_vma)
+		goto err_unlock;
+
+	if (text_start && text_end) {
+		struct vm_area_struct *text_vma;
+
+		text_vma = vma_lookup(mm, text_start);
+		if (!(text_vma->vm_flags & VM_EXEC))
+			goto err_unlock;
+
+		if (PAGE_ALIGN(text_end) != text_vma->vm_end)
+			goto err_unlock;
+	} else {
+		struct vm_area_struct *vma, *text_vma = NULL;
+		VMA_ITERATOR(vmi, mm, 0);
+
+		for_each_vma(vmi, vma) {
+			if (vma->vm_file != sframe_vma->vm_file ||
+			    !(vma->vm_flags & VM_EXEC))
+				continue;
+
+			if (text_vma) {
+				pr_warn_once("%s[%d]: multiple EXEC segments unsupported\n",
+					     current->comm, current->pid);
+				goto err_unlock;
+			}
+
+			text_vma = vma;
+		}
+
+		if (!text_vma)
+			goto err_unlock;
+
+		text_start = text_vma->vm_start;
+		text_end   = text_vma->vm_end;
+	}
+
+	mmap_read_unlock(mm);
+
+	return __sframe_add_section(sframe_addr, text_start, text_end);
+
+err_unlock:
+	mmap_read_unlock(mm);
+	return -EINVAL;
+}
+
+static void sframe_free_srcu(struct rcu_head *rcu)
+{
+	struct sframe_section *sec = container_of(rcu, struct sframe_section, rcu);
+
+	kfree(sec);
+}
+
+static int __sframe_remove_section(struct mm_struct *mm,
+				   struct sframe_section *sec)
+{
+	sec = mtree_erase(&mm->sframe_mt, sec->text_addr);
+	if (!sec)
+		return -EINVAL;
+
+	call_srcu(&sframe_srcu, &sec->rcu, sframe_free_srcu);
+
+	return 0;
+}
+
+int sframe_remove_section(unsigned long sframe_addr)
+{
+	struct mm_struct *mm = current->mm;
+	struct sframe_section *sec;
+	unsigned long index = 0;
+
+	mt_for_each(&mm->sframe_mt, sec, index, ULONG_MAX) {
+		if (sec->sframe_addr == sframe_addr)
+			return __sframe_remove_section(mm, sec);
+	}
+
+	return -EINVAL;
+}
+
+void sframe_free_mm(struct mm_struct *mm)
+{
+	struct sframe_section *sec;
+	unsigned long index = 0;
+
+	if (!mm)
+		return;
+
+	mt_for_each(&mm->sframe_mt, sec, index, ULONG_MAX)
+		kfree(sec);
+
+	mtree_destroy(&mm->sframe_mt);
+}
diff --git a/kernel/unwind/sframe.h b/kernel/unwind/sframe.h
new file mode 100644
index 000000000000..11b9be7ad82e
--- /dev/null
+++ b/kernel/unwind/sframe.h
@@ -0,0 +1,215 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2023, Oracle and/or its affiliates.
+ *
+ * This file contains definitions for the SFrame stack tracing format, which is
+ * documented at https://sourceware.org/binutils/docs
+ */
+#ifndef _SFRAME_H
+#define _SFRAME_H
+
+#include <linux/types.h>
+
+#define SFRAME_VERSION_1	1
+#define SFRAME_VERSION_2	2
+#define SFRAME_MAGIC		0xdee2
+
+/* Function Descriptor Entries are sorted on PC. */
+#define SFRAME_F_FDE_SORTED	0x1
+/* Frame-pointer based stack tracing. Defined, but not set. */
+#define SFRAME_F_FRAME_POINTER	0x2
+
+#define SFRAME_CFA_FIXED_FP_INVALID 0
+#define SFRAME_CFA_FIXED_RA_INVALID 0
+
+/* Supported ABIs/Arch. */
+#define SFRAME_ABI_AARCH64_ENDIAN_BIG	    1 /* AARCH64 big endian. */
+#define SFRAME_ABI_AARCH64_ENDIAN_LITTLE    2 /* AARCH64 little endian. */
+#define SFRAME_ABI_AMD64_ENDIAN_LITTLE	    3 /* AMD64 little endian. */
+
+/* SFrame FRE types. */
+#define SFRAME_FRE_TYPE_ADDR1	0
+#define SFRAME_FRE_TYPE_ADDR2	1
+#define SFRAME_FRE_TYPE_ADDR4	2
+
+/*
+ * SFrame Function Descriptor Entry types.
+ *
+ * The SFrame format has two possible representations for functions. The
+ * choice of which type to use is made according to the instruction patterns
+ * in the relevant program stub.
+ */
+
+/* Unwinders perform a (PC >= FRE_START_ADDR) to look up a matching FRE. */
+#define SFRAME_FDE_TYPE_PCINC	0
+/*
+ * Unwinders perform a (PC & FRE_START_ADDR_AS_MASK >= FRE_START_ADDR_AS_MASK)
+ * to look up a matching FRE. Typical usecases are pltN entries, trampolines
+ * etc.
+ */
+#define SFRAME_FDE_TYPE_PCMASK	1
+
+/**
+ * struct sframe_preamble - SFrame Preamble.
+ * @magic: Magic number (SFRAME_MAGIC).
+ * @version: Format version number (SFRAME_VERSION).
+ * @flags: Various flags.
+ */
+struct sframe_preamble {
+	u16 magic;
+	u8  version;
+	u8  flags;
+} __packed;
+
+/**
+ * struct sframe_header - SFrame Header.
+ * @preamble: SFrame preamble.
+ * @abi_arch: Identify the arch (including endianness) and ABI.
+ * @cfa_fixed_fp_offset: Offset for the Frame Pointer (FP) from CFA may be
+ *	  fixed for some ABIs ((e.g, in AMD64 when -fno-omit-frame-pointer is
+ *	  used). When fixed, this field specifies the fixed stack frame offset
+ *	  and the individual FREs do not need to track it. When not fixed, it
+ *	  is set to SFRAME_CFA_FIXED_FP_INVALID, and the individual FREs may
+ *	  provide the applicable stack frame offset, if any.
+ * @cfa_fixed_ra_offset: Offset for the Return Address from CFA is fixed for
+ *	  some ABIs. When fixed, this field specifies the fixed stack frame
+ *	  offset and the individual FREs do not need to track it. When not
+ *	  fixed, it is set to SFRAME_CFA_FIXED_FP_INVALID.
+ * @auxhdr_len: Number of bytes making up the auxiliary header, if any.
+ *	  Some ABI/arch, in the future, may use this space for extending the
+ *	  information in SFrame header. Auxiliary header is contained in bytes
+ *	  sequentially following the sframe_header.
+ * @num_fdes: Number of SFrame FDEs in this SFrame section.
+ * @num_fres: Number of SFrame Frame Row Entries.
+ * @fre_len:  Number of bytes in the SFrame Frame Row Entry section.
+ * @fdes_off: Offset of SFrame Function Descriptor Entry section.
+ * @fres_off: Offset of SFrame Frame Row Entry section.
+ */
+struct sframe_header {
+	struct sframe_preamble preamble;
+	u8  abi_arch;
+	s8  cfa_fixed_fp_offset;
+	s8  cfa_fixed_ra_offset;
+	u8  auxhdr_len;
+	u32 num_fdes;
+	u32 num_fres;
+	u32 fre_len;
+	u32 fdes_off;
+	u32 fres_off;
+} __packed;
+
+#define SFRAME_HDR_SIZE(sframe_hdr)	\
+	((sizeof(struct sframe_header) + (sframe_hdr).auxhdr_len))
+
+/* Two possible keys for executable (instruction) pointers signing. */
+#define SFRAME_AARCH64_PAUTH_KEY_A    0 /* Key A. */
+#define SFRAME_AARCH64_PAUTH_KEY_B    1 /* Key B. */
+
+/**
+ * struct sframe_fde - SFrame Function Descriptor Entry.
+ * @start_addr: Function start address. Encoded as a signed offset,
+ *	  relative to the current FDE.
+ * @size: Size of the function in bytes.
+ * @fres_off: Offset of the first SFrame Frame Row Entry of the function,
+ *	  relative to the beginning of the SFrame Frame Row Entry sub-section.
+ * @fres_num: Number of frame row entries for the function.
+ * @info: Additional information for deciphering the stack trace
+ *	  information for the function. Contains information about SFrame FRE
+ *	  type, SFrame FDE type, PAC authorization A/B key, etc.
+ * @rep_size: Block size for SFRAME_FDE_TYPE_PCMASK
+ * @padding: Unused
+ */
+struct sframe_fde {
+	s32 start_addr;
+	u32 size;
+	u32 fres_off;
+	u32 fres_num;
+	u8  info;
+	u8  rep_size;
+	u16 padding;
+} __packed;
+
+/*
+ * 'func_info' in SFrame FDE contains additional information for deciphering
+ * the stack trace information for the function. In V1, the information is
+ * organized as follows:
+ *   - 4-bits: Identify the FRE type used for the function.
+ *   - 1-bit: Identify the FDE type of the function - mask or inc.
+ *   - 1-bit: PAC authorization A/B key (aarch64).
+ *   - 2-bits: Unused.
+ * ---------------------------------------------------------------------
+ * |  Unused  |  PAC auth A/B key (aarch64) |  FDE type |   FRE type   |
+ * |          |        Unused (amd64)       |           |              |
+ * ---------------------------------------------------------------------
+ * 8          6                             5           4              0
+ */
+
+/* Note: Set PAC auth key to SFRAME_AARCH64_PAUTH_KEY_A by default.  */
+#define SFRAME_FUNC_INFO(fde_type, fre_enc_type) \
+	(((SFRAME_AARCH64_PAUTH_KEY_A & 0x1) << 5) | \
+	 (((fde_type) & 0x1) << 4) | ((fre_enc_type) & 0xf))
+
+#define SFRAME_FUNC_FRE_TYPE(data)	  ((data) & 0xf)
+#define SFRAME_FUNC_FDE_TYPE(data)	  (((data) >> 4) & 0x1)
+#define SFRAME_FUNC_PAUTH_KEY(data)	  (((data) >> 5) & 0x1)
+
+/*
+ * Size of stack frame offsets in an SFrame Frame Row Entry. A single
+ * SFrame FRE has all offsets of the same size. Offset size may vary
+ * across frame row entries.
+ */
+#define SFRAME_FRE_OFFSET_1B	  0
+#define SFRAME_FRE_OFFSET_2B	  1
+#define SFRAME_FRE_OFFSET_4B	  2
+
+/* An SFrame Frame Row Entry can be SP or FP based.  */
+#define SFRAME_BASE_REG_FP	0
+#define SFRAME_BASE_REG_SP	1
+
+/*
+ * The index at which a specific offset is presented in the variable length
+ * bytes of an FRE.
+ */
+#define SFRAME_FRE_CFA_OFFSET_IDX   0
+/*
+ * The RA stack offset, if present, will always be at index 1 in the variable
+ * length bytes of the FRE.
+ */
+#define SFRAME_FRE_RA_OFFSET_IDX    1
+/*
+ * The FP stack offset may appear at offset 1 or 2, depending on the ABI as RA
+ * may or may not be tracked.
+ */
+#define SFRAME_FRE_FP_OFFSET_IDX    2
+
+/*
+ * 'fre_info' in SFrame FRE contains information about:
+ *   - 1 bit: base reg for CFA
+ *   - 4 bits: Number of offsets (N). A value of up to 3 is allowed to track
+ *   all three of CFA, FP and RA (fixed implicit order).
+ *   - 2 bits: information about size of the offsets (S) in bytes.
+ *     Valid values are SFRAME_FRE_OFFSET_1B, SFRAME_FRE_OFFSET_2B,
+ *     SFRAME_FRE_OFFSET_4B
+ *   - 1 bit: Mangled RA state bit (aarch64 only).
+ * ---------------------------------------------------------------
+ * | Mangled-RA (aarch64) |  Size of   |   Number of  | base_reg |
+ * |  Unused (amd64)      |  offsets   |    offsets   |          |
+ * ---------------------------------------------------------------
+ * 8                      7             5             1          0
+ */
+
+/* Note: Set mangled_ra_p to zero by default. */
+#define SFRAME_FRE_INFO(base_reg_id, offset_num, offset_size) \
+	(((0 & 0x1) << 7) | (((offset_size) & 0x3) << 5) | \
+	 (((offset_num) & 0xf) << 1) | ((base_reg_id) & 0x1))
+
+/* Set the mangled_ra_p bit as indicated. */
+#define SFRAME_FRE_INFO_UPDATE_MANGLED_RA_P(mangled_ra_p, fre_info) \
+	((((mangled_ra_p) & 0x1) << 7) | ((fre_info) & 0x7f))
+
+#define SFRAME_FRE_CFA_BASE_REG_ID(data)	  ((data) & 0x1)
+#define SFRAME_FRE_OFFSET_COUNT(data)		  (((data) >> 1) & 0xf)
+#define SFRAME_FRE_OFFSET_SIZE(data)		  (((data) >> 5) & 0x3)
+#define SFRAME_FRE_MANGLED_RA_P(data)		  (((data) >> 7) & 0x1)
+
+#endif /* _SFRAME_H */
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
index 54b989810a0e..8e47c80e3e54 100644
--- a/kernel/unwind/user.c
+++ b/kernel/unwind/user.c
@@ -8,12 +8,17 @@
 #include <linux/sched.h>
 #include <linux/sched/task_stack.h>
 #include <linux/unwind_user.h>
+#include <linux/sframe.h>
 #include <linux/uaccess.h>
-#include <asm/unwind_user.h>
 
+#ifdef CONFIG_HAVE_UNWIND_USER_FP
+#include <asm/unwind_user.h>
 static struct unwind_user_frame fp_frame = {
 	ARCH_INIT_USER_FP_FRAME
 };
+#else
+static struct unwind_user_frame fp_frame;
+#endif
 
 int unwind_user_next(struct unwind_user_state *state)
 {
@@ -30,6 +35,16 @@ int unwind_user_next(struct unwind_user_state *state)
 	case UNWIND_USER_TYPE_FP:
 		frame = &fp_frame;
 		break;
+	case UNWIND_USER_TYPE_SFRAME:
+		if (sframe_find(state->ip, frame)) {
+			if (!IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP))
+				goto the_end;
+
+			frame = &fp_frame;
+		}
+		break;
+	case UNWIND_USER_TYPE_NONE:
+		goto the_end;
 	default:
 		BUG();
 	}
@@ -68,7 +83,12 @@ int unwind_user_start(struct unwind_user_state *state)
 		return -EINVAL;
 	}
 
-	state->type = UNWIND_USER_TYPE_FP;
+	if (current_has_sframe())
+		state->type = UNWIND_USER_TYPE_SFRAME;
+	else if (IS_ENABLED(CONFIG_UNWIND_USER_FP))
+		state->type = UNWIND_USER_TYPE_FP;
+	else
+		state->type = UNWIND_USER_TYPE_NONE;
 
 	state->sp = user_stack_pointer(regs);
 	state->ip = instruction_pointer(regs);
diff --git a/mm/init-mm.c b/mm/init-mm.c
index 24c809379274..8eb1b122b7bf 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -11,12 +11,17 @@
 #include <linux/atomic.h>
 #include <linux/user_namespace.h>
 #include <linux/iommu.h>
+#include <linux/sframe.h>
 #include <asm/mmu.h>
 
 #ifndef INIT_MM_CONTEXT
 #define INIT_MM_CONTEXT(name)
 #endif
 
+#ifndef INIT_MM_SFRAME
+#define INIT_MM_SFRAME
+#endif
+
 const struct vm_operations_struct vma_dummy_vm_ops;
 
 /*
@@ -45,6 +50,7 @@ struct mm_struct init_mm = {
 	.user_ns	= &init_user_ns,
 	.cpu_bitmap	= CPU_BITS_NONE,
 	INIT_MM_CONTEXT(init_mm)
+	INIT_MM_SFRAME
 };
 
 void setup_initial_init_mm(void *start_code, void *end_code,
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 10/19] unwind/x86: Enable CONFIG_HAVE_UNWIND_USER_SFRAME
  2024-10-28 21:47 [PATCH v3 00/19] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (8 preceding siblings ...)
  2024-10-28 21:47 ` [PATCH v3 09/19] unwind: Introduce sframe user space unwinding Josh Poimboeuf
@ 2024-10-28 21:47 ` Josh Poimboeuf
  2024-10-28 21:47   ` Josh Poimboeuf
  2024-10-29 13:14   ` Peter Zijlstra
  2024-10-28 21:47 ` [PATCH v3 11/19] unwind: Add deferred user space unwinding API Josh Poimboeuf
                   ` (12 subsequent siblings)
  22 siblings, 2 replies; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-28 21:47 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

Binutils 2.41 supports generating sframe v2 for x86_64.  It works well
in testing so enable it.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/x86/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f91098d6f535..3e6f4c80c5b5 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -290,6 +290,7 @@ config X86
 	select HAVE_UACCESS_VALIDATION		if HAVE_OBJTOOL
 	select HAVE_UNSTABLE_SCHED_CLOCK
 	select HAVE_UNWIND_USER_FP		if X86_64
+	select HAVE_UNWIND_USER_SFRAME		if X86_64
 	select HAVE_USER_RETURN_NOTIFIER
 	select HAVE_GENERIC_VDSO
 	select VDSO_GETRANDOM			if X86_64
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 11/19] unwind: Add deferred user space unwinding API
  2024-10-28 21:47 [PATCH v3 00/19] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (9 preceding siblings ...)
  2024-10-28 21:47 ` [PATCH v3 10/19] unwind/x86: Enable CONFIG_HAVE_UNWIND_USER_SFRAME Josh Poimboeuf
@ 2024-10-28 21:47 ` Josh Poimboeuf
  2024-10-28 21:47   ` Josh Poimboeuf
                     ` (4 more replies)
  2024-10-28 21:47 ` [PATCH v3 12/19] perf: Remove get_perf_callchain() 'init_nr' argument Josh Poimboeuf
                   ` (11 subsequent siblings)
  22 siblings, 5 replies; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-28 21:47 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

Add unwind_user_deferred() which allows callers to schedule task work to
unwind the user space stack before returning to user space.  This solves
several problems for its callers:

  - Ensure the unwind happens in task context even if the caller may
    running in interrupt context.

  - Only do the unwind once, even if called multiple times either by the
    same caller or multiple callers.

  - Create a "context context" cookie which allows trace post-processing
    to correlate kernel unwinds/traces with the user unwind.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 include/linux/entry-common.h |   3 +
 include/linux/sched.h        |   5 +
 include/linux/unwind_user.h  |  56 ++++++++++
 kernel/fork.c                |   4 +
 kernel/unwind/user.c         | 199 +++++++++++++++++++++++++++++++++++
 5 files changed, 267 insertions(+)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 1e50cdb83ae5..efbe8f964f31 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -12,6 +12,7 @@
 #include <linux/resume_user_mode.h>
 #include <linux/tick.h>
 #include <linux/kmsan.h>
+#include <linux/unwind_user.h>
 
 #include <asm/entry-common.h>
 
@@ -111,6 +112,8 @@ static __always_inline void enter_from_user_mode(struct pt_regs *regs)
 	CT_WARN_ON(__ct_state() != CT_STATE_USER);
 	user_exit_irqoff();
 
+	unwind_enter_from_user_mode();
+
 	instrumentation_begin();
 	kmsan_unpoison_entry_regs(regs);
 	trace_hardirqs_off_finish();
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5007a8e2d640..31b6f1d763ef 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -47,6 +47,7 @@
 #include <linux/livepatch_sched.h>
 #include <linux/uidgid_types.h>
 #include <asm/kmap_size.h>
+#include <linux/unwind_user.h>
 
 /* task_struct member predeclarations (sorted alphabetically): */
 struct audit_context;
@@ -1592,6 +1593,10 @@ struct task_struct {
 	struct user_event_mm		*user_event_mm;
 #endif
 
+#ifdef CONFIG_UNWIND_USER
+	struct unwind_task_info		unwind_task_info;
+#endif
+
 	/*
 	 * New fields for task_struct should be added above here, so that
 	 * they are included in the randomized portion of task_struct.
diff --git a/include/linux/unwind_user.h b/include/linux/unwind_user.h
index cde0fde4923e..98e236c843b1 100644
--- a/include/linux/unwind_user.h
+++ b/include/linux/unwind_user.h
@@ -3,6 +3,9 @@
 #define _LINUX_UNWIND_USER_H
 
 #include <linux/types.h>
+#include <linux/percpu-defs.h>
+
+#define UNWIND_MAX_CALLBACKS 4
 
 enum unwind_user_type {
 	UNWIND_USER_TYPE_NONE,
@@ -30,6 +33,26 @@ struct unwind_user_state {
 	bool done;
 };
 
+struct unwind_task_info {
+	u64			ctx_cookie;
+	u32			pending_callbacks;
+	u64			last_cookies[UNWIND_MAX_CALLBACKS];
+	void			*privs[UNWIND_MAX_CALLBACKS];
+	unsigned long		*entries;
+	struct callback_head	work;
+};
+
+typedef void (*unwind_callback_t)(struct unwind_stacktrace *trace,
+				  u64 ctx_cookie, void *data);
+
+struct unwind_callback {
+	unwind_callback_t		func;
+	int				idx;
+};
+
+
+#ifdef CONFIG_UNWIND_USER
+
 /* Synchronous interfaces: */
 
 int unwind_user_start(struct unwind_user_state *state);
@@ -40,4 +63,37 @@ int unwind_user(struct unwind_stacktrace *trace, unsigned int max_entries);
 #define for_each_user_frame(state) \
 	for (unwind_user_start((state)); !(state)->done; unwind_user_next((state)))
 
+
+/* Asynchronous interfaces: */
+
+void unwind_task_init(struct task_struct *task);
+void unwind_task_free(struct task_struct *task);
+
+int unwind_user_register(struct unwind_callback *callback, unwind_callback_t func);
+int unwind_user_unregister(struct unwind_callback *callback);
+
+int unwind_user_deferred(struct unwind_callback *callback, u64 *ctx_cookie, void *data);
+
+DECLARE_PER_CPU(u64, unwind_ctx_ctr);
+
+static __always_inline void unwind_enter_from_user_mode(void)
+{
+	__this_cpu_inc(unwind_ctx_ctr);
+}
+
+
+#else /* !CONFIG_UNWIND_USER */
+
+static inline void unwind_task_init(struct task_struct *task) {}
+static inline void unwind_task_free(struct task_struct *task) {}
+
+static inline int unwind_user_register(struct unwind_callback *callback, unwind_callback_t func) { return -ENOSYS; }
+static inline int unwind_user_unregister(struct unwind_callback *callback) { return -ENOSYS; }
+
+static inline int unwind_user_deferred(struct unwind_callback *callback, u64 *ctx_cookie, void *data) { return -ENOSYS; }
+
+static inline void unwind_enter_from_user_mode(void) {}
+
+#endif /* !CONFIG_UNWIND_USER */
+
 #endif /* _LINUX_UNWIND_USER_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 60f14fbab956..d7580067853d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -105,6 +105,7 @@
 #include <linux/rseq.h>
 #include <uapi/linux/pidfd.h>
 #include <linux/pidfs.h>
+#include <linux/unwind_user.h>
 #include <linux/sframe.h>
 
 #include <asm/pgalloc.h>
@@ -972,6 +973,7 @@ void __put_task_struct(struct task_struct *tsk)
 	WARN_ON(refcount_read(&tsk->usage));
 	WARN_ON(tsk == current);
 
+	unwind_task_free(tsk);
 	sched_ext_free(tsk);
 	io_uring_free(tsk);
 	cgroup_free(tsk);
@@ -2348,6 +2350,8 @@ __latent_entropy struct task_struct *copy_process(
 	p->bpf_ctx = NULL;
 #endif
 
+	unwind_task_init(p);
+
 	/* Perform scheduler related setup. Assign this task to a CPU. */
 	retval = sched_fork(clone_flags, p);
 	if (retval)
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
index 8e47c80e3e54..ed7759c56551 100644
--- a/kernel/unwind/user.c
+++ b/kernel/unwind/user.c
@@ -10,6 +10,11 @@
 #include <linux/unwind_user.h>
 #include <linux/sframe.h>
 #include <linux/uaccess.h>
+#include <linux/slab.h>
+#include <linux/task_work.h>
+#include <linux/mm.h>
+
+#define UNWIND_MAX_ENTRIES 512
 
 #ifdef CONFIG_HAVE_UNWIND_USER_FP
 #include <asm/unwind_user.h>
@@ -20,6 +25,12 @@ static struct unwind_user_frame fp_frame = {
 static struct unwind_user_frame fp_frame;
 #endif
 
+static struct unwind_callback *callbacks[UNWIND_MAX_CALLBACKS];
+static DECLARE_RWSEM(callbacks_rwsem);
+
+/* Counter for entries from user space */
+DEFINE_PER_CPU(u64, unwind_ctx_ctr);
+
 int unwind_user_next(struct unwind_user_state *state)
 {
 	struct unwind_user_frame _frame;
@@ -117,3 +128,191 @@ int unwind_user(struct unwind_stacktrace *trace, unsigned int max_entries)
 
 	return 0;
 }
+
+/*
+ * The "context cookie" is a unique identifier which allows post-processing to
+ * correlate kernel trace(s) with user unwinds.  It has the CPU id the highest
+ * 16 bits and a per-CPU entry counter in the lower 48 bits.
+ */
+static u64 ctx_to_cookie(u64 cpu, u64 ctx)
+{
+	BUILD_BUG_ON(NR_CPUS > 65535);
+	return (ctx & ((1UL << 48) - 1)) | cpu;
+}
+
+/*
+ * Schedule a user space unwind to be done in task work before exiting the
+ * kernel.
+ *
+ * The @callback must have previously been registered with
+ * unwind_user_register().
+ *
+ * The @cookie output is a unique identifer which will also be passed to the
+ * callback function.  It can be used to stitch kernel and user traces together
+ * in post-processing.
+ *
+ * If there are multiple calls to this function for a given @callback, the
+ * cookie will usually be the same and the callback will only be called once.
+ *
+ * The only exception is when the task has migrated to another CPU, *and* this
+ * is called while the task work is running (or has already run).  Then a new
+ * cookie will be generated and the callback will be called again for the new
+ * cookie.
+ */
+int unwind_user_deferred(struct unwind_callback *callback, u64 *ctx_cookie, void *data)
+{
+	struct unwind_task_info *info = &current->unwind_task_info;
+	u64 cookie = info->ctx_cookie;
+	int idx = callback->idx;
+
+	if (WARN_ON_ONCE(in_nmi()))
+		return -EINVAL;
+
+	if (WARN_ON_ONCE(!callback->func || idx < 0))
+		return -EINVAL;
+
+	if (!current->mm)
+		return -EINVAL;
+
+	guard(irqsave)();
+
+	if (cookie && (info->pending_callbacks & (1 << idx)))
+		goto done;
+
+	/*
+	 * If this is the first call from *any* caller since the most recent
+	 * entry from user space, initialize the task context cookie and
+	 * schedule the task work.
+	 */
+	if (!cookie) {
+		u64 ctx_ctr = __this_cpu_read(unwind_ctx_ctr);
+		u64 cpu = raw_smp_processor_id();
+
+		cookie = ctx_to_cookie(cpu, ctx_ctr);
+
+		/*
+		 * If called after task work has sent an unwind to the callback
+		 * function but before the exit to user space, skip it as the
+		 * previous call to the callback function should suffice.
+		 *
+		 * The only exception is if this task has migrated to another
+		 * CPU since the first call to unwind_user_deferred().  The
+		 * per-CPU context counter will have changed which will result
+		 * in a new cookie and another unwind (see comment above
+		 * function).
+		 */
+		if (cookie == info->last_cookies[idx])
+			goto done;
+
+		info->ctx_cookie = cookie;
+		task_work_add(current, &info->work, TWA_RESUME);
+	}
+
+	info->pending_callbacks |= (1 << idx);
+	info->privs[idx] = data;
+	info->last_cookies[idx] = cookie;
+
+done:
+	if (ctx_cookie)
+		*ctx_cookie = cookie;
+	return 0;
+}
+
+static void unwind_user_task_work(struct callback_head *head)
+{
+	struct unwind_task_info *info = container_of(head, struct unwind_task_info, work);
+	struct task_struct *task = container_of(info, struct task_struct, unwind_task_info);
+	void *privs[UNWIND_MAX_CALLBACKS];
+	struct unwind_stacktrace trace;
+	unsigned long pending;
+	u64 cookie = 0;
+	int i;
+
+	BUILD_BUG_ON(UNWIND_MAX_CALLBACKS > 32);
+
+	if (WARN_ON_ONCE(task != current))
+		return;
+
+	if (WARN_ON_ONCE(!info->ctx_cookie || !info->pending_callbacks))
+		return;
+
+	scoped_guard(irqsave) {
+		pending = info->pending_callbacks;
+		cookie = info->ctx_cookie;
+
+		info->pending_callbacks = 0;
+		info->ctx_cookie = 0;
+		memcpy(privs, info->privs, sizeof(void *) * UNWIND_MAX_CALLBACKS);
+	}
+
+	if (!info->entries) {
+		info->entries = kmalloc(UNWIND_MAX_ENTRIES * sizeof(long),
+					GFP_KERNEL);
+		if (!info->entries)
+			return;
+	}
+
+	trace.entries = info->entries;
+	trace.nr = 0;
+	unwind_user(&trace, UNWIND_MAX_ENTRIES);
+
+	guard(rwsem_read)(&callbacks_rwsem);
+
+	for_each_set_bit(i, &pending, UNWIND_MAX_CALLBACKS) {
+		if (callbacks[i])
+			callbacks[i]->func(&trace, cookie, privs[i]);
+	}
+}
+
+int unwind_user_register(struct unwind_callback *callback, unwind_callback_t func)
+{
+	scoped_guard(rwsem_write, &callbacks_rwsem) {
+		for (int i = 0; i < UNWIND_MAX_CALLBACKS; i++) {
+			if (!callbacks[i]) {
+				callback->func = func;
+				callback->idx = i;
+				callbacks[i] = callback;
+				return 0;
+			}
+		}
+	}
+
+	callback->func = NULL;
+	callback->idx = -1;
+	return -ENOSPC;
+}
+
+int unwind_user_unregister(struct unwind_callback *callback)
+{
+	if (callback->idx < 0)
+		return -EINVAL;
+
+	scoped_guard(rwsem_write, &callbacks_rwsem)
+		callbacks[callback->idx] = NULL;
+
+	callback->func = NULL;
+	callback->idx = -1;
+
+	return 0;
+}
+
+void unwind_task_init(struct task_struct *task)
+{
+	struct unwind_task_info *info = &task->unwind_task_info;
+
+	info->entries		= NULL;
+	info->pending_callbacks	= 0;
+	info->ctx_cookie	= 0;
+
+	memset(info->last_cookies, 0, sizeof(u64) * UNWIND_MAX_CALLBACKS);
+	memset(info->privs,	   0, sizeof(u64) * UNWIND_MAX_CALLBACKS);
+
+	init_task_work(&info->work, unwind_user_task_work);
+}
+
+void unwind_task_free(struct task_struct *task)
+{
+	struct unwind_task_info *info = &task->unwind_task_info;
+
+	kfree(info->entries);
+}
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 12/19] perf: Remove get_perf_callchain() 'init_nr' argument
  2024-10-28 21:47 [PATCH v3 00/19] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (10 preceding siblings ...)
  2024-10-28 21:47 ` [PATCH v3 11/19] unwind: Add deferred user space unwinding API Josh Poimboeuf
@ 2024-10-28 21:47 ` Josh Poimboeuf
  2024-10-28 21:47   ` Josh Poimboeuf
  2024-10-28 21:47 ` [PATCH v3 13/19] perf: Remove get_perf_callchain() 'crosstask' argument Josh Poimboeuf
                   ` (10 subsequent siblings)
  22 siblings, 1 reply; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-28 21:47 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Namhyung Kim

The 'init_nr' argument has double duty: it's used to initialize both the
number of contexts and the number of stack entries.  That's confusing
and the callers always pass zero anyway.  Hard code the zero.

Acked-by: Namhyung Kim <Namhyung@kernel.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 include/linux/perf_event.h |  2 +-
 kernel/bpf/stackmap.c      |  4 ++--
 kernel/events/callchain.c  | 12 ++++++------
 kernel/events/core.c       |  2 +-
 4 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index fb908843f209..1e956ff9acd3 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1568,7 +1568,7 @@ DECLARE_PER_CPU(struct perf_callchain_entry, perf_callchain_entry);
 extern void perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
 extern void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
 extern struct perf_callchain_entry *
-get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user,
+get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 		   u32 max_stack, bool crosstask, bool add_mark);
 extern int get_callchain_buffers(int max_stack);
 extern void put_callchain_buffers(void);
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 3615c06b7dfa..ec3a57a5fba1 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -314,7 +314,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
 	if (max_depth > sysctl_perf_event_max_stack)
 		max_depth = sysctl_perf_event_max_stack;
 
-	trace = get_perf_callchain(regs, 0, kernel, user, max_depth,
+	trace = get_perf_callchain(regs, kernel, user, max_depth,
 				   false, false);
 
 	if (unlikely(!trace))
@@ -451,7 +451,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
 	else if (kernel && task)
 		trace = get_callchain_entry_for_task(task, max_depth);
 	else
-		trace = get_perf_callchain(regs, 0, kernel, user, max_depth,
+		trace = get_perf_callchain(regs, kernel, user, max_depth,
 					   crosstask, false);
 
 	if (unlikely(!trace) || trace->nr < skip) {
diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 8a47e52a454f..83834203e144 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -216,7 +216,7 @@ static void fixup_uretprobe_trampoline_entries(struct perf_callchain_entry *entr
 }
 
 struct perf_callchain_entry *
-get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user,
+get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 		   u32 max_stack, bool crosstask, bool add_mark)
 {
 	struct perf_callchain_entry *entry;
@@ -227,11 +227,11 @@ get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user,
 	if (!entry)
 		return NULL;
 
-	ctx.entry     = entry;
-	ctx.max_stack = max_stack;
-	ctx.nr	      = entry->nr = init_nr;
-	ctx.contexts       = 0;
-	ctx.contexts_maxed = false;
+	ctx.entry		= entry;
+	ctx.max_stack		= max_stack;
+	ctx.nr			= entry->nr = 0;
+	ctx.contexts		= 0;
+	ctx.contexts_maxed	= false;
 
 	if (kernel && !user_mode(regs)) {
 		if (add_mark)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index df27d08a7232..1654d6e7c148 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7800,7 +7800,7 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
 	if (!kernel && !user)
 		return &__empty_callchain;
 
-	callchain = get_perf_callchain(regs, 0, kernel, user,
+	callchain = get_perf_callchain(regs, kernel, user,
 				       max_stack, crosstask, true);
 	return callchain ?: &__empty_callchain;
 }
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 13/19] perf: Remove get_perf_callchain() 'crosstask' argument
  2024-10-28 21:47 [PATCH v3 00/19] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (11 preceding siblings ...)
  2024-10-28 21:47 ` [PATCH v3 12/19] perf: Remove get_perf_callchain() 'init_nr' argument Josh Poimboeuf
@ 2024-10-28 21:47 ` Josh Poimboeuf
  2024-10-28 21:48   ` Josh Poimboeuf
  2024-10-28 21:47 ` [PATCH v3 14/19] perf: Simplify get_perf_callchain() user logic Josh Poimboeuf
                   ` (9 subsequent siblings)
  22 siblings, 1 reply; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-28 21:47 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

get_perf_callchain() doesn't support cross-task unwinding, so it doesn't
make much sense to have 'crosstask' as an argument.

Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 include/linux/perf_event.h |  2 +-
 kernel/bpf/stackmap.c      | 12 ++++--------
 kernel/events/callchain.c  |  6 +-----
 kernel/events/core.c       |  9 +++++----
 4 files changed, 11 insertions(+), 18 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 1e956ff9acd3..788f6971d32d 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1569,7 +1569,7 @@ extern void perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct p
 extern void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
 extern struct perf_callchain_entry *
 get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
-		   u32 max_stack, bool crosstask, bool add_mark);
+		   u32 max_stack, bool add_mark);
 extern int get_callchain_buffers(int max_stack);
 extern void put_callchain_buffers(void);
 extern struct perf_callchain_entry *get_callchain_entry(int *rctx);
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index ec3a57a5fba1..ee9701337912 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -314,8 +314,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
 	if (max_depth > sysctl_perf_event_max_stack)
 		max_depth = sysctl_perf_event_max_stack;
 
-	trace = get_perf_callchain(regs, kernel, user, max_depth,
-				   false, false);
+	trace = get_perf_callchain(regs, kernel, user, max_depth, false);
 
 	if (unlikely(!trace))
 		/* couldn't fetch the stack trace */
@@ -430,10 +429,8 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
 	if (task && user && !user_mode(regs))
 		goto err_fault;
 
-	/* get_perf_callchain does not support crosstask user stack walking
-	 * but returns an empty stack instead of NULL.
-	 */
-	if (crosstask && user) {
+	/* get_perf_callchain() does not support crosstask stack walking */
+	if (crosstask) {
 		err = -EOPNOTSUPP;
 		goto clear;
 	}
@@ -451,8 +448,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
 	else if (kernel && task)
 		trace = get_callchain_entry_for_task(task, max_depth);
 	else
-		trace = get_perf_callchain(regs, kernel, user, max_depth,
-					   crosstask, false);
+		trace = get_perf_callchain(regs, kernel, user, max_depth,false);
 
 	if (unlikely(!trace) || trace->nr < skip) {
 		if (may_fault)
diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 83834203e144..655fb25a725b 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -217,7 +217,7 @@ static void fixup_uretprobe_trampoline_entries(struct perf_callchain_entry *entr
 
 struct perf_callchain_entry *
 get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
-		   u32 max_stack, bool crosstask, bool add_mark)
+		   u32 max_stack, bool add_mark)
 {
 	struct perf_callchain_entry *entry;
 	struct perf_callchain_entry_ctx ctx;
@@ -248,9 +248,6 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 		}
 
 		if (regs) {
-			if (crosstask)
-				goto exit_put;
-
 			if (add_mark)
 				perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
 
@@ -260,7 +257,6 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 		}
 	}
 
-exit_put:
 	put_callchain_entry(rctx);
 
 	return entry;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 1654d6e7c148..ebf143aa427b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7792,16 +7792,17 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
 {
 	bool kernel = !event->attr.exclude_callchain_kernel;
 	bool user   = !event->attr.exclude_callchain_user;
-	/* Disallow cross-task user callchains. */
-	bool crosstask = event->ctx->task && event->ctx->task != current;
 	const u32 max_stack = event->attr.sample_max_stack;
 	struct perf_callchain_entry *callchain;
 
 	if (!kernel && !user)
 		return &__empty_callchain;
 
-	callchain = get_perf_callchain(regs, kernel, user,
-				       max_stack, crosstask, true);
+	/* Disallow cross-task callchains. */
+	if (event->ctx->task && event->ctx->task != current)
+		return &__empty_callchain;
+
+	callchain = get_perf_callchain(regs, kernel, user, max_stack, true);
 	return callchain ?: &__empty_callchain;
 }
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 14/19] perf: Simplify get_perf_callchain() user logic
  2024-10-28 21:47 [PATCH v3 00/19] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (12 preceding siblings ...)
  2024-10-28 21:47 ` [PATCH v3 13/19] perf: Remove get_perf_callchain() 'crosstask' argument Josh Poimboeuf
@ 2024-10-28 21:47 ` Josh Poimboeuf
  2024-10-28 21:48   ` Josh Poimboeuf
  2024-10-28 21:47 ` [PATCH v3 15/19] perf: Add deferred user callchains Josh Poimboeuf
                   ` (8 subsequent siblings)
  22 siblings, 1 reply; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-28 21:47 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

Simplify the get_perf_callchain() user logic a bit.  task_pt_regs()
should never be NULL.

Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 kernel/events/callchain.c | 20 +++++++++-----------
 1 file changed, 9 insertions(+), 11 deletions(-)

diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 655fb25a725b..2278402b7ac9 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -241,22 +241,20 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 
 	if (user) {
 		if (!user_mode(regs)) {
-			if  (current->mm)
-				regs = task_pt_regs(current);
-			else
-				regs = NULL;
+			if (!current->mm)
+				goto exit_put;
+			regs = task_pt_regs(current);
 		}
 
-		if (regs) {
-			if (add_mark)
-				perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
+		if (add_mark)
+			perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
 
-			start_entry_idx = entry->nr;
-			perf_callchain_user(&ctx, regs);
-			fixup_uretprobe_trampoline_entries(entry, start_entry_idx);
-		}
+		start_entry_idx = entry->nr;
+		perf_callchain_user(&ctx, regs);
+		fixup_uretprobe_trampoline_entries(entry, start_entry_idx);
 	}
 
+exit_put:
 	put_callchain_entry(rctx);
 
 	return entry;
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 15/19] perf: Add deferred user callchains
  2024-10-28 21:47 [PATCH v3 00/19] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (13 preceding siblings ...)
  2024-10-28 21:47 ` [PATCH v3 14/19] perf: Simplify get_perf_callchain() user logic Josh Poimboeuf
@ 2024-10-28 21:47 ` Josh Poimboeuf
  2024-10-28 21:48   ` Josh Poimboeuf
                     ` (2 more replies)
  2024-10-28 21:47 ` [PATCH v3 16/19] perf tools: Minimal CALLCHAIN_DEFERRED support Josh Poimboeuf
                   ` (7 subsequent siblings)
  22 siblings, 3 replies; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-28 21:47 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

Instead of attempting to unwind user space from the NMI handler, defer
it to run in task context by sending a self-IPI and then scheduling the
unwind to run in the IRQ's exit task work before returning to user space.

This allows the user stack page to be paged in if needed, avoids
duplicate unwinds for kernel-bound workloads, and prepares for SFrame
unwinding (so .sframe sections can be paged in on demand).

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/Kconfig                          |  3 ++
 include/linux/perf_event.h            | 10 ++++-
 include/uapi/linux/perf_event.h       | 22 +++++++++-
 kernel/bpf/stackmap.c                 |  6 +--
 kernel/events/callchain.c             | 11 ++++-
 kernel/events/core.c                  | 63 ++++++++++++++++++++++++++-
 tools/include/uapi/linux/perf_event.h | 22 +++++++++-
 7 files changed, 129 insertions(+), 8 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index e769c39dd221..33449485eafd 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -446,6 +446,9 @@ config HAVE_UNWIND_USER_SFRAME
 	bool
 	select UNWIND_USER
 
+config HAVE_PERF_CALLCHAIN_DEFERRED
+	bool
+
 config HAVE_PERF_REGS
 	bool
 	help
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 788f6971d32d..2193b3d16820 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -808,9 +808,11 @@ struct perf_event {
 	unsigned long			pending_addr;	/* SIGTRAP */
 	struct irq_work			pending_irq;
 	struct irq_work			pending_disable_irq;
+	struct irq_work			pending_unwind_irq;
 	struct callback_head		pending_task;
 	unsigned int			pending_work;
 	struct rcuwait			pending_work_wait;
+	unsigned int			pending_unwind;
 
 	atomic_t			event_limit;
 
@@ -1569,12 +1571,18 @@ extern void perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct p
 extern void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
 extern struct perf_callchain_entry *
 get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
-		   u32 max_stack, bool add_mark);
+		   u32 max_stack, bool add_mark, bool defer_user);
 extern int get_callchain_buffers(int max_stack);
 extern void put_callchain_buffers(void);
 extern struct perf_callchain_entry *get_callchain_entry(int *rctx);
 extern void put_callchain_entry(int rctx);
 
+#ifdef CONFIG_HAVE_PERF_CALLCHAIN_DEFERRED
+extern void perf_callchain_user_deferred(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
+#else
+static inline void perf_callchain_user_deferred(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs) {}
+#endif
+
 extern int sysctl_perf_event_max_stack;
 extern int sysctl_perf_event_max_contexts_per_stack;
 
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 4842c36fdf80..6d0524b7d082 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -460,7 +460,8 @@ struct perf_event_attr {
 				inherit_thread :  1, /* children only inherit if cloned with CLONE_THREAD */
 				remove_on_exec :  1, /* event is removed from task on exec */
 				sigtrap        :  1, /* send synchronous SIGTRAP on event */
-				__reserved_1   : 26;
+				defer_callchain:  1, /* generate PERF_RECORD_CALLCHAIN_DEFERRED records */
+				__reserved_1   : 25;
 
 	union {
 		__u32		wakeup_events;	  /* wakeup every n events */
@@ -1217,6 +1218,24 @@ enum perf_event_type {
 	 */
 	PERF_RECORD_AUX_OUTPUT_HW_ID		= 21,
 
+	/*
+	 * This user callchain capture was deferred until shortly before
+	 * returning to user space.  Previous samples would have kernel
+	 * callchains only and they need to be stitched with this to make full
+	 * callchains.
+	 *
+	 * TODO: do PERF_SAMPLE_{REGS,STACK}_USER also need deferral?
+	 *
+	 * struct {
+	 *	struct perf_event_header	header;
+	 *	u64				ctx_cookie;
+	 *	u64				nr;
+	 *	u64				ips[nr];
+	 *	struct sample_id		sample_id;
+	 * };
+	 */
+	PERF_RECORD_CALLCHAIN_DEFERRED		= 22,
+
 	PERF_RECORD_MAX,			/* non-ABI */
 };
 
@@ -1247,6 +1266,7 @@ enum perf_callchain_context {
 	PERF_CONTEXT_HV			= (__u64)-32,
 	PERF_CONTEXT_KERNEL		= (__u64)-128,
 	PERF_CONTEXT_USER		= (__u64)-512,
+	PERF_CONTEXT_USER_DEFERRED	= (__u64)-640,
 
 	PERF_CONTEXT_GUEST		= (__u64)-2048,
 	PERF_CONTEXT_GUEST_KERNEL	= (__u64)-2176,
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index ee9701337912..f073ebaf9c30 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -314,8 +314,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
 	if (max_depth > sysctl_perf_event_max_stack)
 		max_depth = sysctl_perf_event_max_stack;
 
-	trace = get_perf_callchain(regs, kernel, user, max_depth, false);
-
+	trace = get_perf_callchain(regs, kernel, user, max_depth, false, false);
 	if (unlikely(!trace))
 		/* couldn't fetch the stack trace */
 		return -EFAULT;
@@ -448,7 +447,8 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
 	else if (kernel && task)
 		trace = get_callchain_entry_for_task(task, max_depth);
 	else
-		trace = get_perf_callchain(regs, kernel, user, max_depth,false);
+		trace = get_perf_callchain(regs, kernel, user, max_depth,
+					   false, false);
 
 	if (unlikely(!trace) || trace->nr < skip) {
 		if (may_fault)
diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 2278402b7ac9..eeb15ba0137f 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -217,7 +217,7 @@ static void fixup_uretprobe_trampoline_entries(struct perf_callchain_entry *entr
 
 struct perf_callchain_entry *
 get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
-		   u32 max_stack, bool add_mark)
+		   u32 max_stack, bool add_mark, bool defer_user)
 {
 	struct perf_callchain_entry *entry;
 	struct perf_callchain_entry_ctx ctx;
@@ -246,6 +246,15 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 			regs = task_pt_regs(current);
 		}
 
+		if (defer_user) {
+			/*
+			 * Foretell the coming of PERF_RECORD_CALLCHAIN_DEFERRED
+			 * which can be stitched to this one.
+			 */
+			perf_callchain_store_context(&ctx, PERF_CONTEXT_USER_DEFERRED);
+			goto exit_put;
+		}
+
 		if (add_mark)
 			perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index ebf143aa427b..bf97b2fa8a9c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -55,11 +55,14 @@
 #include <linux/pgtable.h>
 #include <linux/buildid.h>
 #include <linux/task_work.h>
+#include <linux/unwind_user.h>
 
 #include "internal.h"
 
 #include <asm/irq_regs.h>
 
+static struct unwind_callback perf_unwind_callback_cb;
+
 typedef int (*remote_function_f)(void *);
 
 struct remote_function_call {
@@ -6955,6 +6958,53 @@ static void perf_pending_irq(struct irq_work *entry)
 		perf_swevent_put_recursion_context(rctx);
 }
 
+static void perf_pending_unwind_irq(struct irq_work *entry)
+{
+	struct perf_event *event = container_of(entry, struct perf_event, pending_unwind_irq);
+
+	if (event->pending_unwind) {
+		unwind_user_deferred(&perf_unwind_callback_cb, NULL, event);
+		event->pending_unwind = 0;
+	}
+}
+
+struct perf_callchain_deferred_event {
+	struct perf_event_header	header;
+	u64				ctx_cookie;
+	u64				nr;
+	u64				ips[];
+};
+
+static void perf_event_callchain_deferred(struct unwind_stacktrace *trace,
+					  u64 ctx_cookie, void *_data)
+{
+	struct perf_callchain_deferred_event deferred_event;
+	u64 callchain_context = PERF_CONTEXT_USER;
+	struct perf_output_handle handle;
+	struct perf_event *event = _data;
+	struct perf_sample_data data;
+	u64 nr = trace->nr + 1 /* callchain_context */;
+
+	deferred_event.header.type = PERF_RECORD_CALLCHAIN_DEFERRED;
+	deferred_event.header.misc = PERF_RECORD_MISC_USER;
+	deferred_event.header.size = sizeof(deferred_event) + (nr * sizeof(u64));
+
+	deferred_event.ctx_cookie = ctx_cookie;
+	deferred_event.nr = nr;
+
+	perf_event_header__init_id(&deferred_event.header, &data, event);
+
+	if (perf_output_begin(&handle, &data, event, deferred_event.header.size))
+		return;
+
+	perf_output_put(&handle, deferred_event);
+	perf_output_put(&handle, callchain_context);
+	perf_output_copy(&handle, trace->entries, trace->nr * sizeof(u64));
+	perf_event__output_id_sample(event, &handle, &data);
+
+	perf_output_end(&handle);
+}
+
 static void perf_pending_task(struct callback_head *head)
 {
 	struct perf_event *event = container_of(head, struct perf_event, pending_task);
@@ -7794,6 +7844,8 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
 	bool user   = !event->attr.exclude_callchain_user;
 	const u32 max_stack = event->attr.sample_max_stack;
 	struct perf_callchain_entry *callchain;
+	bool defer_user = IS_ENABLED(CONFIG_UNWIND_USER) &&
+			  event->attr.defer_callchain;
 
 	if (!kernel && !user)
 		return &__empty_callchain;
@@ -7802,7 +7854,14 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
 	if (event->ctx->task && event->ctx->task != current)
 		return &__empty_callchain;
 
-	callchain = get_perf_callchain(regs, kernel, user, max_stack, true);
+	callchain = get_perf_callchain(regs, kernel, user, max_stack, true,
+				       defer_user);
+
+	if (user && defer_user && !event->pending_unwind) {
+		event->pending_unwind = 1;
+		irq_work_queue(&event->pending_unwind_irq);
+	}
+
 	return callchain ?: &__empty_callchain;
 }
 
@@ -12171,6 +12230,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 
 	init_waitqueue_head(&event->waitq);
 	init_irq_work(&event->pending_irq, perf_pending_irq);
+	event->pending_unwind_irq = IRQ_WORK_INIT_HARD(perf_pending_unwind_irq);
 	event->pending_disable_irq = IRQ_WORK_INIT_HARD(perf_pending_disable);
 	init_task_work(&event->pending_task, perf_pending_task);
 	rcuwait_init(&event->pending_work_wait);
@@ -14093,6 +14153,7 @@ void __init perf_event_init(void)
 	perf_tp_register();
 	perf_event_init_cpu(smp_processor_id());
 	register_reboot_notifier(&perf_reboot_notifier);
+	unwind_user_register(&perf_unwind_callback_cb, perf_event_callchain_deferred);
 
 	ret = init_hw_breakpoint();
 	WARN(ret, "hw_breakpoint initialization failed with: %d", ret);
diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
index 4842c36fdf80..6d0524b7d082 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -460,7 +460,8 @@ struct perf_event_attr {
 				inherit_thread :  1, /* children only inherit if cloned with CLONE_THREAD */
 				remove_on_exec :  1, /* event is removed from task on exec */
 				sigtrap        :  1, /* send synchronous SIGTRAP on event */
-				__reserved_1   : 26;
+				defer_callchain:  1, /* generate PERF_RECORD_CALLCHAIN_DEFERRED records */
+				__reserved_1   : 25;
 
 	union {
 		__u32		wakeup_events;	  /* wakeup every n events */
@@ -1217,6 +1218,24 @@ enum perf_event_type {
 	 */
 	PERF_RECORD_AUX_OUTPUT_HW_ID		= 21,
 
+	/*
+	 * This user callchain capture was deferred until shortly before
+	 * returning to user space.  Previous samples would have kernel
+	 * callchains only and they need to be stitched with this to make full
+	 * callchains.
+	 *
+	 * TODO: do PERF_SAMPLE_{REGS,STACK}_USER also need deferral?
+	 *
+	 * struct {
+	 *	struct perf_event_header	header;
+	 *	u64				ctx_cookie;
+	 *	u64				nr;
+	 *	u64				ips[nr];
+	 *	struct sample_id		sample_id;
+	 * };
+	 */
+	PERF_RECORD_CALLCHAIN_DEFERRED		= 22,
+
 	PERF_RECORD_MAX,			/* non-ABI */
 };
 
@@ -1247,6 +1266,7 @@ enum perf_callchain_context {
 	PERF_CONTEXT_HV			= (__u64)-32,
 	PERF_CONTEXT_KERNEL		= (__u64)-128,
 	PERF_CONTEXT_USER		= (__u64)-512,
+	PERF_CONTEXT_USER_DEFERRED	= (__u64)-640,
 
 	PERF_CONTEXT_GUEST		= (__u64)-2048,
 	PERF_CONTEXT_GUEST_KERNEL	= (__u64)-2176,
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 16/19] perf tools: Minimal CALLCHAIN_DEFERRED support
  2024-10-28 21:47 [PATCH v3 00/19] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (14 preceding siblings ...)
  2024-10-28 21:47 ` [PATCH v3 15/19] perf: Add deferred user callchains Josh Poimboeuf
@ 2024-10-28 21:47 ` Josh Poimboeuf
  2024-10-28 21:48   ` Josh Poimboeuf
  2024-10-28 21:47 ` [PATCH v3 17/19] perf record: Enable defer_callchain for user callchains Josh Poimboeuf
                   ` (6 subsequent siblings)
  22 siblings, 1 reply; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-28 21:47 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

From: Namhyung Kim <namhyung@kernel.org>

Add a new event type for deferred callchains and a new callback for the
struct perf_tool.  For now it doesn't actually handle the deferred
callchains but it just marks the sample if it has the PERF_CONTEXT_
USER_DEFFERED in the callchain array.

At least, perf report can dump the raw data with this change.  Actually
this requires the next commit to enable attr.defer_callchain, but if you
already have a data file, it'll show the following result.

  $ perf report -D
  ...
  0x5fe0@perf.data [0x40]: event: 22
  .
  . ... raw event: size 64 bytes
  .  0000:  16 00 00 00 02 00 40 00 02 00 00 00 00 00 00 00  ......@.........
  .  0010:  00 fe ff ff ff ff ff ff 4b d3 3f 25 45 7f 00 00  ........K.?%E...
  .  0020:  21 03 00 00 21 03 00 00 43 02 12 ab 05 00 00 00  !...!...C.......
  .  0030:  00 00 00 00 00 00 00 00 09 00 00 00 00 00 00 00  ................

  0 24344920643 0x5fe0 [0x40]: PERF_RECORD_CALLCHAIN_DEFERRED(IP, 0x2): 801/801: 0
  ... FP chain: nr:2
  .....  0: fffffffffffffe00
  .....  1: 00007f45253fd34b
  : unhandled!

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 tools/lib/perf/include/perf/event.h       |  7 +++++++
 tools/perf/util/event.c                   |  1 +
 tools/perf/util/evsel.c                   | 15 +++++++++++++++
 tools/perf/util/machine.c                 |  1 +
 tools/perf/util/perf_event_attr_fprintf.c |  1 +
 tools/perf/util/sample.h                  |  3 ++-
 tools/perf/util/session.c                 | 17 +++++++++++++++++
 tools/perf/util/tool.c                    |  1 +
 tools/perf/util/tool.h                    |  3 ++-
 9 files changed, 47 insertions(+), 2 deletions(-)

diff --git a/tools/lib/perf/include/perf/event.h b/tools/lib/perf/include/perf/event.h
index 37bb7771d914..f643a6a2b9fc 100644
--- a/tools/lib/perf/include/perf/event.h
+++ b/tools/lib/perf/include/perf/event.h
@@ -151,6 +151,12 @@ struct perf_record_switch {
 	__u32			 next_prev_tid;
 };
 
+struct perf_record_callchain_deferred {
+	struct perf_event_header header;
+	__u64			 nr;
+	__u64			 ips[];
+};
+
 struct perf_record_header_attr {
 	struct perf_event_header header;
 	struct perf_event_attr	 attr;
@@ -494,6 +500,7 @@ union perf_event {
 	struct perf_record_read			read;
 	struct perf_record_throttle		throttle;
 	struct perf_record_sample		sample;
+	struct perf_record_callchain_deferred	callchain_deferred;
 	struct perf_record_bpf_event		bpf;
 	struct perf_record_ksymbol		ksymbol;
 	struct perf_record_text_poke_event	text_poke;
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index aac96d5d1917..8cdec373db44 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -58,6 +58,7 @@ static const char *perf_event__names[] = {
 	[PERF_RECORD_CGROUP]			= "CGROUP",
 	[PERF_RECORD_TEXT_POKE]			= "TEXT_POKE",
 	[PERF_RECORD_AUX_OUTPUT_HW_ID]		= "AUX_OUTPUT_HW_ID",
+	[PERF_RECORD_CALLCHAIN_DEFERRED]	= "CALLCHAIN_DEFERRED",
 	[PERF_RECORD_HEADER_ATTR]		= "ATTR",
 	[PERF_RECORD_HEADER_EVENT_TYPE]		= "EVENT_TYPE",
 	[PERF_RECORD_HEADER_TRACING_DATA]	= "TRACING_DATA",
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index dbf9c8cee3c5..701092d6b1b6 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -2676,6 +2676,18 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
 	data->data_src = PERF_MEM_DATA_SRC_NONE;
 	data->vcpu = -1;
 
+	if (event->header.type == PERF_RECORD_CALLCHAIN_DEFERRED) {
+		const u64 max_callchain_nr = UINT64_MAX / sizeof(u64);
+
+		data->callchain = (struct ip_callchain *)&event->callchain_deferred.nr;
+		if (data->callchain->nr > max_callchain_nr)
+			return -EFAULT;
+
+		if (evsel->core.attr.sample_id_all)
+			perf_evsel__parse_id_sample(evsel, event, data);
+		return 0;
+	}
+
 	if (event->header.type != PERF_RECORD_SAMPLE) {
 		if (!evsel->core.attr.sample_id_all)
 			return 0;
@@ -2806,6 +2818,9 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
 		if (data->callchain->nr > max_callchain_nr)
 			return -EFAULT;
 		sz = data->callchain->nr * sizeof(u64);
+		if (evsel->core.attr.defer_callchain && data->callchain->nr >= 1 &&
+		    data->callchain->ips[data->callchain->nr - 1] == PERF_CONTEXT_USER_DEFERRED)
+			data->deferred_callchain = true;
 		OVERFLOW_CHECK(array, sz, max_size);
 		array = (void *)array + sz;
 	}
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index fad227b625d1..f367577c91ff 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -2085,6 +2085,7 @@ static int add_callchain_ip(struct thread *thread,
 				*cpumode = PERF_RECORD_MISC_KERNEL;
 				break;
 			case PERF_CONTEXT_USER:
+			case PERF_CONTEXT_USER_DEFERRED:
 				*cpumode = PERF_RECORD_MISC_USER;
 				break;
 			default:
diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
index 59fbbba79697..113845b35110 100644
--- a/tools/perf/util/perf_event_attr_fprintf.c
+++ b/tools/perf/util/perf_event_attr_fprintf.c
@@ -321,6 +321,7 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
 	PRINT_ATTRf(inherit_thread, p_unsigned);
 	PRINT_ATTRf(remove_on_exec, p_unsigned);
 	PRINT_ATTRf(sigtrap, p_unsigned);
+	PRINT_ATTRf(defer_callchain, p_unsigned);
 
 	PRINT_ATTRn("{ wakeup_events, wakeup_watermark }", wakeup_events, p_unsigned, false);
 	PRINT_ATTRf(bp_type, p_unsigned);
diff --git a/tools/perf/util/sample.h b/tools/perf/util/sample.h
index 70b2c3135555..010659dc80f8 100644
--- a/tools/perf/util/sample.h
+++ b/tools/perf/util/sample.h
@@ -108,7 +108,8 @@ struct perf_sample {
 		u16 p_stage_cyc;
 		u16 retire_lat;
 	};
-	bool no_hw_idx;		/* No hw_idx collected in branch_stack */
+	bool no_hw_idx;			/* No hw_idx collected in branch_stack */
+	bool deferred_callchain;	/* Has deferred user callchains */
 	char insn[MAX_INSN];
 	void *raw_data;
 	struct ip_callchain *callchain;
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index dbaf07bf6c5f..1248a0317a2f 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -714,6 +714,7 @@ static perf_event__swap_op perf_event__swap_ops[] = {
 	[PERF_RECORD_CGROUP]		  = perf_event__cgroup_swap,
 	[PERF_RECORD_TEXT_POKE]		  = perf_event__text_poke_swap,
 	[PERF_RECORD_AUX_OUTPUT_HW_ID]	  = perf_event__all64_swap,
+	[PERF_RECORD_CALLCHAIN_DEFERRED]  = perf_event__all64_swap,
 	[PERF_RECORD_HEADER_ATTR]	  = perf_event__hdr_attr_swap,
 	[PERF_RECORD_HEADER_EVENT_TYPE]	  = perf_event__event_type_swap,
 	[PERF_RECORD_HEADER_TRACING_DATA] = perf_event__tracing_data_swap,
@@ -1107,6 +1108,19 @@ static void dump_sample(struct evsel *evsel, union perf_event *event,
 		sample_read__printf(sample, evsel->core.attr.read_format);
 }
 
+static void dump_deferred_callchain(struct evsel *evsel, union perf_event *event,
+				    struct perf_sample *sample)
+{
+	if (!dump_trace)
+		return;
+
+	printf("(IP, 0x%x): %d/%d: %#" PRIx64 "\n",
+	       event->header.misc, sample->pid, sample->tid, sample->ip);
+
+	if (evsel__has_callchain(evsel))
+		callchain__printf(evsel, sample);
+}
+
 static void dump_read(struct evsel *evsel, union perf_event *event)
 {
 	struct perf_record_read *read_event = &event->read;
@@ -1327,6 +1341,9 @@ static int machines__deliver_event(struct machines *machines,
 		return tool->text_poke(tool, event, sample, machine);
 	case PERF_RECORD_AUX_OUTPUT_HW_ID:
 		return tool->aux_output_hw_id(tool, event, sample, machine);
+	case PERF_RECORD_CALLCHAIN_DEFERRED:
+		dump_deferred_callchain(evsel, event, sample);
+		return tool->callchain_deferred(tool, event, sample, evsel, machine);
 	default:
 		++evlist->stats.nr_unknown_events;
 		return -1;
diff --git a/tools/perf/util/tool.c b/tools/perf/util/tool.c
index 3b7f390f26eb..e78f16de912e 100644
--- a/tools/perf/util/tool.c
+++ b/tools/perf/util/tool.c
@@ -259,6 +259,7 @@ void perf_tool__init(struct perf_tool *tool, bool ordered_events)
 	tool->read = process_event_sample_stub;
 	tool->throttle = process_event_stub;
 	tool->unthrottle = process_event_stub;
+	tool->callchain_deferred = process_event_sample_stub;
 	tool->attr = process_event_synth_attr_stub;
 	tool->event_update = process_event_synth_event_update_stub;
 	tool->tracing_data = process_event_synth_tracing_data_stub;
diff --git a/tools/perf/util/tool.h b/tools/perf/util/tool.h
index db1c7642b0d1..9987bbde6d5e 100644
--- a/tools/perf/util/tool.h
+++ b/tools/perf/util/tool.h
@@ -42,7 +42,8 @@ enum show_feature_header {
 
 struct perf_tool {
 	event_sample	sample,
-			read;
+			read,
+			callchain_deferred;
 	event_op	mmap,
 			mmap2,
 			comm,
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 17/19] perf record: Enable defer_callchain for user callchains
  2024-10-28 21:47 [PATCH v3 00/19] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (15 preceding siblings ...)
  2024-10-28 21:47 ` [PATCH v3 16/19] perf tools: Minimal CALLCHAIN_DEFERRED support Josh Poimboeuf
@ 2024-10-28 21:47 ` Josh Poimboeuf
  2024-10-28 21:48   ` Josh Poimboeuf
  2024-10-28 21:47 ` [PATCH v3 18/19] perf script: Display PERF_RECORD_CALLCHAIN_DEFERRED Josh Poimboeuf
                   ` (5 subsequent siblings)
  22 siblings, 1 reply; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-28 21:47 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

From: Namhyung Kim <namhyung@kernel.org>

And add the missing feature detection logic to clear the flag on old
kernels.

  $ perf record -g -vv true
  ...
  ------------------------------------------------------------
  perf_event_attr:
    type                             0 (PERF_TYPE_HARDWARE)
    size                             136
    config                           0 (PERF_COUNT_HW_CPU_CYCLES)
    { sample_period, sample_freq }   4000
    sample_type                      IP|TID|TIME|CALLCHAIN|PERIOD
    read_format                      ID|LOST
    disabled                         1
    inherit                          1
    mmap                             1
    comm                             1
    freq                             1
    enable_on_exec                   1
    task                             1
    sample_id_all                    1
    mmap2                            1
    comm_exec                        1
    ksymbol                          1
    bpf_event                        1
    defer_callchain                  1
  ------------------------------------------------------------
  sys_perf_event_open: pid 162755  cpu 0  group_fd -1  flags 0x8
  sys_perf_event_open failed, error -22
  switching off deferred callchain support

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 tools/perf/util/evsel.c | 17 ++++++++++++++++-
 tools/perf/util/evsel.h |  1 +
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 701092d6b1b6..ad89644b32f2 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -912,6 +912,14 @@ static void __evsel__config_callchain(struct evsel *evsel, struct record_opts *o
 		}
 	}
 
+	if (param->record_mode == CALLCHAIN_FP && !attr->exclude_callchain_user) {
+		/*
+		 * Enable deferred callchains optimistically.  It'll be switched
+		 * off later if the kernel doesn't support it.
+		 */
+		attr->defer_callchain = 1;
+	}
+
 	if (function) {
 		pr_info("Disabling user space callchains for function trace event.\n");
 		attr->exclude_callchain_user = 1;
@@ -2089,6 +2097,8 @@ static int __evsel__prepare_open(struct evsel *evsel, struct perf_cpu_map *cpus,
 
 static void evsel__disable_missing_features(struct evsel *evsel)
 {
+	if (perf_missing_features.defer_callchain)
+		evsel->core.attr.defer_callchain = 0;
 	if (perf_missing_features.branch_counters)
 		evsel->core.attr.branch_sample_type &= ~PERF_SAMPLE_BRANCH_COUNTERS;
 	if (perf_missing_features.read_lost)
@@ -2144,7 +2154,12 @@ bool evsel__detect_missing_features(struct evsel *evsel)
 	 * Must probe features in the order they were added to the
 	 * perf_event_attr interface.
 	 */
-	if (!perf_missing_features.branch_counters &&
+	if (!perf_missing_features.defer_callchain &&
+	    evsel->core.attr.defer_callchain) {
+		perf_missing_features.defer_callchain = true;
+		pr_debug2("switching off deferred callchain support\n");
+		return true;
+	} else if (!perf_missing_features.branch_counters &&
 	    (evsel->core.attr.branch_sample_type & PERF_SAMPLE_BRANCH_COUNTERS)) {
 		perf_missing_features.branch_counters = true;
 		pr_debug2("switching off branch counters support\n");
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index 15e745a9a798..f0a1e1d78942 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -221,6 +221,7 @@ struct perf_missing_features {
 	bool weight_struct;
 	bool read_lost;
 	bool branch_counters;
+	bool defer_callchain;
 };
 
 extern struct perf_missing_features perf_missing_features;
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 18/19] perf script: Display PERF_RECORD_CALLCHAIN_DEFERRED
  2024-10-28 21:47 [PATCH v3 00/19] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (16 preceding siblings ...)
  2024-10-28 21:47 ` [PATCH v3 17/19] perf record: Enable defer_callchain for user callchains Josh Poimboeuf
@ 2024-10-28 21:47 ` Josh Poimboeuf
  2024-10-28 21:48   ` Josh Poimboeuf
  2024-10-28 21:47 ` [PATCH v3 19/19] perf tools: Merge deferred user callchains Josh Poimboeuf
                   ` (4 subsequent siblings)
  22 siblings, 1 reply; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-28 21:47 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

From: Namhyung Kim <namhyung@kernel.org>

Handle the deferred callchains in the script output.

  $ perf script
  perf     801 [000]    18.031793:          1 cycles:P:
          ffffffff91a14c36 __intel_pmu_enable_all.isra.0+0x56 ([kernel.kallsyms])
          ffffffff91d373e9 perf_ctx_enable+0x39 ([kernel.kallsyms])
          ffffffff91d36af7 event_function+0xd7 ([kernel.kallsyms])
          ffffffff91d34222 remote_function+0x42 ([kernel.kallsyms])
          ffffffff91c1ebe1 generic_exec_single+0x61 ([kernel.kallsyms])
          ffffffff91c1edac smp_call_function_single+0xec ([kernel.kallsyms])
          ffffffff91d37a9d event_function_call+0x10d ([kernel.kallsyms])
          ffffffff91d33557 perf_event_for_each_child+0x37 ([kernel.kallsyms])
          ffffffff91d47324 _perf_ioctl+0x204 ([kernel.kallsyms])
          ffffffff91d47c43 perf_ioctl+0x33 ([kernel.kallsyms])
          ffffffff91e2f216 __x64_sys_ioctl+0x96 ([kernel.kallsyms])
          ffffffff9265f1ae do_syscall_64+0x9e ([kernel.kallsyms])
          ffffffff92800130 entry_SYSCALL_64+0xb0 ([kernel.kallsyms])

  perf     801 [000]    18.031814: DEFERRED CALLCHAIN
              7fb5fc22034b __GI___ioctl+0x3b (/usr/lib/x86_64-linux-gnu/libc.so.6)

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 tools/perf/builtin-script.c | 89 +++++++++++++++++++++++++++++++++++++
 1 file changed, 89 insertions(+)

diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index a644787fa9e1..311580e25f5b 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -2540,6 +2540,93 @@ static int process_sample_event(const struct perf_tool *tool,
 	return ret;
 }
 
+static int process_deferred_sample_event(const struct perf_tool *tool,
+					 union perf_event *event,
+					 struct perf_sample *sample,
+					 struct evsel *evsel,
+					 struct machine *machine)
+{
+	struct perf_script *scr = container_of(tool, struct perf_script, tool);
+	struct perf_event_attr *attr = &evsel->core.attr;
+	struct evsel_script *es = evsel->priv;
+	unsigned int type = output_type(attr->type);
+	struct addr_location al;
+	FILE *fp = es->fp;
+	int ret = 0;
+
+	if (output[type].fields == 0)
+		return 0;
+
+	/* Set thread to NULL to indicate addr_al and al are not initialized */
+	addr_location__init(&al);
+
+	if (perf_time__ranges_skip_sample(scr->ptime_range, scr->range_num,
+					  sample->time)) {
+		goto out_put;
+	}
+
+	if (debug_mode) {
+		if (sample->time < last_timestamp) {
+			pr_err("Samples misordered, previous: %" PRIu64
+				" this: %" PRIu64 "\n", last_timestamp,
+				sample->time);
+			nr_unordered++;
+		}
+		last_timestamp = sample->time;
+		goto out_put;
+	}
+
+	if (filter_cpu(sample))
+		goto out_put;
+
+	if (machine__resolve(machine, &al, sample) < 0) {
+		pr_err("problem processing %d event, skipping it.\n",
+		       event->header.type);
+		ret = -1;
+		goto out_put;
+	}
+
+	if (al.filtered)
+		goto out_put;
+
+	if (!show_event(sample, evsel, al.thread, &al, NULL))
+		goto out_put;
+
+	if (evswitch__discard(&scr->evswitch, evsel))
+		goto out_put;
+
+	perf_sample__fprintf_start(scr, sample, al.thread, evsel,
+				   PERF_RECORD_CALLCHAIN_DEFERRED, fp);
+	fprintf(fp, "DEFERRED CALLCHAIN");
+
+	if (PRINT_FIELD(IP)) {
+		struct callchain_cursor *cursor = NULL;
+
+		if (symbol_conf.use_callchain && sample->callchain) {
+			cursor = get_tls_callchain_cursor();
+			if (thread__resolve_callchain(al.thread, cursor, evsel,
+						      sample, NULL, NULL,
+						      scripting_max_stack)) {
+				pr_info("cannot resolve deferred callchains\n");
+				cursor = NULL;
+			}
+		}
+
+		fputc(cursor ? '\n' : ' ', fp);
+		sample__fprintf_sym(sample, &al, 0, output[type].print_ip_opts,
+				    cursor, symbol_conf.bt_stop_list, fp);
+	}
+
+	fprintf(fp, "\n");
+
+	if (verbose > 0)
+		fflush(fp);
+
+out_put:
+	addr_location__exit(&al);
+	return ret;
+}
+
 // Used when scr->per_event_dump is not set
 static struct evsel_script es_stdout;
 
@@ -4325,6 +4412,7 @@ int cmd_script(int argc, const char **argv)
 
 	perf_tool__init(&script.tool, !unsorted_dump);
 	script.tool.sample		 = process_sample_event;
+	script.tool.callchain_deferred	 = process_deferred_sample_event;
 	script.tool.mmap		 = perf_event__process_mmap;
 	script.tool.mmap2		 = perf_event__process_mmap2;
 	script.tool.comm		 = perf_event__process_comm;
@@ -4351,6 +4439,7 @@ int cmd_script(int argc, const char **argv)
 	script.tool.throttle		 = process_throttle_event;
 	script.tool.unthrottle		 = process_throttle_event;
 	script.tool.ordering_requires_timestamps = true;
+	script.tool.merge_deferred_callchains = false;
 	session = perf_session__new(&data, &script.tool);
 	if (IS_ERR(session))
 		return PTR_ERR(session);
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 19/19] perf tools: Merge deferred user callchains
  2024-10-28 21:47 [PATCH v3 00/19] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (17 preceding siblings ...)
  2024-10-28 21:47 ` [PATCH v3 18/19] perf script: Display PERF_RECORD_CALLCHAIN_DEFERRED Josh Poimboeuf
@ 2024-10-28 21:47 ` Josh Poimboeuf
  2024-10-28 21:48   ` Josh Poimboeuf
  2024-10-28 21:47 ` [PATCH v3 00/19] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (3 subsequent siblings)
  22 siblings, 1 reply; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-28 21:47 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

From: Namhyung Kim <namhyung@kernel.org>

Save samples with deferred callchains in a separate list and deliver
them after merging the user callchains.  If users don't want to merge
they can set tool->merge_deferred_callchains to false to prevent the
behavior.

With previous result, now perf script will show the merged callchains.

  $ perf script
  perf     801 [000]    18.031793:          1 cycles:P:
          ffffffff91a14c36 __intel_pmu_enable_all.isra.0+0x56 ([kernel.kallsyms])
          ffffffff91d373e9 perf_ctx_enable+0x39 ([kernel.kallsyms])
          ffffffff91d36af7 event_function+0xd7 ([kernel.kallsyms])
          ffffffff91d34222 remote_function+0x42 ([kernel.kallsyms])
          ffffffff91c1ebe1 generic_exec_single+0x61 ([kernel.kallsyms])
          ffffffff91c1edac smp_call_function_single+0xec ([kernel.kallsyms])
          ffffffff91d37a9d event_function_call+0x10d ([kernel.kallsyms])
          ffffffff91d33557 perf_event_for_each_child+0x37 ([kernel.kallsyms])
          ffffffff91d47324 _perf_ioctl+0x204 ([kernel.kallsyms])
          ffffffff91d47c43 perf_ioctl+0x33 ([kernel.kallsyms])
          ffffffff91e2f216 __x64_sys_ioctl+0x96 ([kernel.kallsyms])
          ffffffff9265f1ae do_syscall_64+0x9e ([kernel.kallsyms])
          ffffffff92800130 entry_SYSCALL_64+0xb0 ([kernel.kallsyms])
              7fb5fc22034b __GI___ioctl+0x3b (/usr/lib/x86_64-linux-gnu/libc.so.6)
  ...

The old output can be get using --no-merge-callchain option.
Also perf report can get the user callchain entry at the end.

  $ perf report --no-children --percent-limit=0 --stdio -q -S __intel_pmu_enable_all.isra.0
  # symbol: __intel_pmu_enable_all.isra.0
       0.00%  perf     [kernel.kallsyms]
              |
              ---__intel_pmu_enable_all.isra.0
                 perf_ctx_enable
                 event_function
                 remote_function
                 generic_exec_single
                 smp_call_function_single
                 event_function_call
                 perf_event_for_each_child
                 _perf_ioctl
                 perf_ioctl
                 __x64_sys_ioctl
                 do_syscall_64
                 entry_SYSCALL_64
                 __GI___ioctl

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 tools/perf/Documentation/perf-script.txt |  5 ++
 tools/perf/builtin-script.c              |  5 +-
 tools/perf/util/callchain.c              | 24 +++++++++
 tools/perf/util/callchain.h              |  3 ++
 tools/perf/util/evlist.c                 |  1 +
 tools/perf/util/evlist.h                 |  1 +
 tools/perf/util/session.c                | 63 +++++++++++++++++++++++-
 tools/perf/util/tool.c                   |  1 +
 tools/perf/util/tool.h                   |  1 +
 9 files changed, 102 insertions(+), 2 deletions(-)

diff --git a/tools/perf/Documentation/perf-script.txt b/tools/perf/Documentation/perf-script.txt
index b72866ef270b..69f018b3d199 100644
--- a/tools/perf/Documentation/perf-script.txt
+++ b/tools/perf/Documentation/perf-script.txt
@@ -518,6 +518,11 @@ include::itrace.txt[]
 	The known limitations include exception handing such as
 	setjmp/longjmp will have calls/returns not match.
 
+--merge-callchains::
+	Enable merging deferred user callchains if available.  This is the
+	default behavior.  If you want to see separate CALLCHAIN_DEFERRED
+	records for some reason, use --no-merge-callchains explicitly.
+
 :GMEXAMPLECMD: script
 :GMEXAMPLESUBCMD:
 include::guest-files.txt[]
diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index 311580e25f5b..e3acf4979c36 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -4031,6 +4031,7 @@ int cmd_script(int argc, const char **argv)
 	bool header_only = false;
 	bool script_started = false;
 	bool unsorted_dump = false;
+	bool merge_deferred_callchains = true;
 	char *rec_script_path = NULL;
 	char *rep_script_path = NULL;
 	struct perf_session *session;
@@ -4184,6 +4185,8 @@ int cmd_script(int argc, const char **argv)
 		    "Guest code can be found in hypervisor process"),
 	OPT_BOOLEAN('\0', "stitch-lbr", &script.stitch_lbr,
 		    "Enable LBR callgraph stitching approach"),
+	OPT_BOOLEAN('\0', "merge-callchains", &merge_deferred_callchains,
+		    "Enable merge deferred user callchains"),
 	OPTS_EVSWITCH(&script.evswitch),
 	OPT_END()
 	};
@@ -4439,7 +4442,7 @@ int cmd_script(int argc, const char **argv)
 	script.tool.throttle		 = process_throttle_event;
 	script.tool.unthrottle		 = process_throttle_event;
 	script.tool.ordering_requires_timestamps = true;
-	script.tool.merge_deferred_callchains = false;
+	script.tool.merge_deferred_callchains = merge_deferred_callchains;
 	session = perf_session__new(&data, &script.tool);
 	if (IS_ERR(session))
 		return PTR_ERR(session);
diff --git a/tools/perf/util/callchain.c b/tools/perf/util/callchain.c
index 0c7564747a14..d1114491c3da 100644
--- a/tools/perf/util/callchain.c
+++ b/tools/perf/util/callchain.c
@@ -1832,3 +1832,27 @@ int sample__for_each_callchain_node(struct thread *thread, struct evsel *evsel,
 	}
 	return 0;
 }
+
+int sample__merge_deferred_callchain(struct perf_sample *sample_orig,
+				     struct perf_sample *sample_callchain)
+{
+	u64 nr_orig = sample_orig->callchain->nr - 1;
+	u64 nr_deferred = sample_callchain->callchain->nr;
+	struct ip_callchain *callchain;
+
+	callchain = calloc(1 + nr_orig + nr_deferred, sizeof(u64));
+	if (callchain == NULL) {
+		sample_orig->deferred_callchain = false;
+		return -ENOMEM;
+	}
+
+	callchain->nr = nr_orig + nr_deferred;
+	/* copy except for the last PERF_CONTEXT_USER_DEFERRED */
+	memcpy(callchain->ips, sample_orig->callchain->ips, nr_orig * sizeof(u64));
+	/* copy deferred use callchains */
+	memcpy(&callchain->ips[nr_orig], sample_callchain->callchain->ips,
+	       nr_deferred * sizeof(u64));
+
+	sample_orig->callchain = callchain;
+	return 0;
+}
diff --git a/tools/perf/util/callchain.h b/tools/perf/util/callchain.h
index 86ed9e4d04f9..89785125ed25 100644
--- a/tools/perf/util/callchain.h
+++ b/tools/perf/util/callchain.h
@@ -317,4 +317,7 @@ int sample__for_each_callchain_node(struct thread *thread, struct evsel *evsel,
 				    struct perf_sample *sample, int max_stack,
 				    bool symbols, callchain_iter_fn cb, void *data);
 
+int sample__merge_deferred_callchain(struct perf_sample *sample_orig,
+				     struct perf_sample *sample_callchain);
+
 #endif	/* __PERF_CALLCHAIN_H */
diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index f14b7e6ff1dc..f27d8c4a22aa 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -81,6 +81,7 @@ void evlist__init(struct evlist *evlist, struct perf_cpu_map *cpus,
 	evlist->ctl_fd.ack = -1;
 	evlist->ctl_fd.pos = -1;
 	evlist->nr_br_cntr = -1;
+	INIT_LIST_HEAD(&evlist->deferred_samples);
 }
 
 struct evlist *evlist__new(void)
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index bcc1c6984bb5..c26379366554 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -84,6 +84,7 @@ struct evlist {
 		int	pos;	/* index at evlist core object to check signals */
 	} ctl_fd;
 	struct event_enable_timer *eet;
+	struct list_head deferred_samples;
 };
 
 struct evsel_str_handler {
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 1248a0317a2f..e0a21b896b57 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -1256,6 +1256,56 @@ static int evlist__deliver_sample(struct evlist *evlist, const struct perf_tool
 					    &sample->read.one, machine);
 }
 
+struct deferred_event {
+	struct list_head list;
+	union perf_event *event;
+};
+
+static int evlist__deliver_deferred_samples(struct evlist *evlist,
+					    const struct perf_tool *tool,
+					    union  perf_event *event,
+					    struct perf_sample *sample,
+					    struct machine *machine)
+{
+	struct deferred_event *de, *tmp;
+	struct evsel *evsel;
+	int ret = 0;
+
+	if (!tool->merge_deferred_callchains) {
+		evsel = evlist__id2evsel(evlist, sample->id);
+		return tool->callchain_deferred(tool, event, sample,
+						evsel, machine);
+	}
+
+	list_for_each_entry_safe(de, tmp, &evlist->deferred_samples, list) {
+		struct perf_sample orig_sample;
+
+		ret = evlist__parse_sample(evlist, de->event, &orig_sample);
+		if (ret < 0) {
+			pr_err("failed to parse original sample\n");
+			break;
+		}
+
+		if (sample->tid != orig_sample.tid)
+			continue;
+
+		evsel = evlist__id2evsel(evlist, orig_sample.id);
+		sample__merge_deferred_callchain(&orig_sample, sample);
+		ret = evlist__deliver_sample(evlist, tool, de->event,
+					     &orig_sample, evsel, machine);
+
+		if (orig_sample.deferred_callchain)
+			free(orig_sample.callchain);
+
+		list_del(&de->list);
+		free(de);
+
+		if (ret)
+			break;
+	}
+	return ret;
+}
+
 static int machines__deliver_event(struct machines *machines,
 				   struct evlist *evlist,
 				   union perf_event *event,
@@ -1284,6 +1334,16 @@ static int machines__deliver_event(struct machines *machines,
 			return 0;
 		}
 		dump_sample(evsel, event, sample, perf_env__arch(machine->env));
+		if (sample->deferred_callchain && tool->merge_deferred_callchains) {
+			struct deferred_event *de = malloc(sizeof(*de));
+
+			if (de == NULL)
+				return -ENOMEM;
+
+			de->event = event;
+			list_add_tail(&de->list, &evlist->deferred_samples);
+			return 0;
+		}
 		return evlist__deliver_sample(evlist, tool, event, sample, evsel, machine);
 	case PERF_RECORD_MMAP:
 		return tool->mmap(tool, event, sample, machine);
@@ -1343,7 +1403,8 @@ static int machines__deliver_event(struct machines *machines,
 		return tool->aux_output_hw_id(tool, event, sample, machine);
 	case PERF_RECORD_CALLCHAIN_DEFERRED:
 		dump_deferred_callchain(evsel, event, sample);
-		return tool->callchain_deferred(tool, event, sample, evsel, machine);
+		return evlist__deliver_deferred_samples(evlist, tool, event,
+							sample, machine);
 	default:
 		++evlist->stats.nr_unknown_events;
 		return -1;
diff --git a/tools/perf/util/tool.c b/tools/perf/util/tool.c
index e78f16de912e..385043e06627 100644
--- a/tools/perf/util/tool.c
+++ b/tools/perf/util/tool.c
@@ -238,6 +238,7 @@ void perf_tool__init(struct perf_tool *tool, bool ordered_events)
 	tool->cgroup_events = false;
 	tool->no_warn = false;
 	tool->show_feat_hdr = SHOW_FEAT_NO_HEADER;
+	tool->merge_deferred_callchains = true;
 
 	tool->sample = process_event_sample_stub;
 	tool->mmap = process_event_stub;
diff --git a/tools/perf/util/tool.h b/tools/perf/util/tool.h
index 9987bbde6d5e..d06580478ab1 100644
--- a/tools/perf/util/tool.h
+++ b/tools/perf/util/tool.h
@@ -87,6 +87,7 @@ struct perf_tool {
 	bool		cgroup_events;
 	bool		no_warn;
 	bool		dont_split_sample_group;
+	bool		merge_deferred_callchains;
 	enum show_feature_header show_feat_hdr;
 };
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 00/19] unwind, perf: sframe user space unwinding
  2024-10-28 21:47 [PATCH v3 00/19] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (18 preceding siblings ...)
  2024-10-28 21:47 ` [PATCH v3 19/19] perf tools: Merge deferred user callchains Josh Poimboeuf
@ 2024-10-28 21:47 ` Josh Poimboeuf
  2024-10-28 21:54 ` Josh Poimboeuf
                   ` (2 subsequent siblings)
  22 siblings, 0 replies; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-28 21:47 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

This has all the changes discussed in v2, plus VDSO sframe support and
Namhyung's perf tool patches (see detailed changelog below).

I did quite a bit of testing, it seems to work well.  It still needs
some binutils and glibc patches which I'll send in a reply.

Questions for perf experts:

  - Is the perf_event lifetime managed correctly or do we need to do
    something to ensure it exists in unwind_user_task_work()?

    Or alternatively is the original perf_event even needed in
    unwind_user_task_work() or can a new one be created on demand?

  - Is --call-graph=sframe needed for consistency?

  - Should perf use the context cookie?  Note that because the callback
    is usually only called once for multiple NMIs in the same entry
    context, it's possible for the PERF_RECORD_CALLCHAIN_DEFERRED event
    to arrive *before* some of the corresponding kernel events.  The
    context cookie disambiguates the corner cases.

Based on tip/master.

Also at:

  git://git.kernel.org/pub/scm/linux/kernel/git/jpoimboe/linux.git sframe-v3


v3:
- move the "deferred" logic out of perf and into unwind_user with new
  unwind_user_deferred() interface [Steven, Mathieu]
- add more sframe sanity checks [Steven]
- make frame pointers optional depending on arch [Jens]
- fix perf event output [Namhyung]
- include Namhyung's perf tool patches
- enable sframe generation in VDSO
- fix build errors [robot]

v2: https://lore.kernel.org/cover.1726268190.git.jpoimboe@kernel.org
- rebase on v6.11-rc7
- reorganize the patches to add sframe first
- change to sframe v2
- add new perf event type: PERF_RECORD_CALLCHAIN_DEFERRED
- add new perf attribute: defer_callchain

v1: https://lore.kernel.org/cover.1699487758.git.jpoimboe@kernel.org

Some distros have started compiling frame pointers into all their
packages to enable the kernel to do system-wide profiling of user space.
Unfortunately that creates a runtime performance penalty across the
entire system.  Using DWARF (or .eh_frame) instead isn't feasible
because of complexity and slowness.

For in-kernel unwinding we solved this problem with the creation of the
ORC unwinder for x86_64.  Similarly, for user space the GNU assembler
has created the SFrame ("Simple Frame") v2 format starting with binutils
2.41.

These patches add support for unwinding user space from the kernel using
SFrame with perf.  It should be easy to add user unwinding support for
other components like ftrace.

There were two main challenges:

1) Finding .sframe sections in shared/dlopened libraries

   The kernel has no visibility to the contents of shared libraries.
   This was solved by adding a PR_ADD_SFRAME option to prctl() which
   allows the runtime linker to manually provide the in-memory address
   of an .sframe section to the kernel.

2) Dealing with page faults

   Keeping all binaries' sframe data pinned would likely waste a lot of
   memory.  Instead, read it from user space on demand.  That can't be
   done from perf NMI context due to page faults, so defer the unwind to
   the next user exit.  Since the NMI handler doesn't do exit work,
   self-IPI and then schedule task work to be run on exit from the IPI.

Special thanks to Indu for the original concept, and to Steven and Peter
for helping a lot with the design.  And to Steven for letting me do it ;-)



Josh Poimboeuf (15):
  x86/vdso: Fix DWARF generation for getrandom()
  x86/asm: Avoid emitting DWARF CFI for non-VDSO
  x86/asm: Fix VDSO DWARF generation with kernel IBT enabled
  x86/vdso: Use SYM_FUNC_{START,END} in __kernel_vsyscall()
  x86/vdso: Use CFI macros in __vdso_sgx_enter_enclave()
  x86/vdso: Enable sframe generation in VDSO
  unwind: Add user space unwinding API
  unwind/x86: Enable CONFIG_HAVE_UNWIND_USER_FP
  unwind: Introduce sframe user space unwinding
  unwind/x86: Enable CONFIG_HAVE_UNWIND_USER_SFRAME
  unwind: Add deferred user space unwinding API
  perf: Remove get_perf_callchain() 'init_nr' argument
  perf: Remove get_perf_callchain() 'crosstask' argument
  perf: Simplify get_perf_callchain() user logic
  perf: Add deferred user callchains

Namhyung Kim (4):
  perf tools: Minimal CALLCHAIN_DEFERRED support
  perf record: Enable defer_callchain for user callchains
  perf script: Display PERF_RECORD_CALLCHAIN_DEFERRED
  perf tools: Merge deferred user callchains

 arch/Kconfig                              |  14 +
 arch/x86/Kconfig                          |   2 +
 arch/x86/entry/vdso/Makefile              |   6 +-
 arch/x86/entry/vdso/vdso-layout.lds.S     |   5 +-
 arch/x86/entry/vdso/vdso32/system_call.S  |  10 +-
 arch/x86/entry/vdso/vgetrandom-chacha.S   |   3 +-
 arch/x86/entry/vdso/vsgx.S                |  19 +-
 arch/x86/include/asm/dwarf2.h             |  40 ++-
 arch/x86/include/asm/linkage.h            |  29 +-
 arch/x86/include/asm/mmu.h                |   2 +-
 arch/x86/include/asm/unwind_user.h        |  11 +
 arch/x86/include/asm/vdso.h               |   1 -
 fs/binfmt_elf.c                           |  35 +-
 include/linux/entry-common.h              |   3 +
 include/linux/mm_types.h                  |   3 +
 include/linux/perf_event.h                |  12 +-
 include/linux/sched.h                     |   5 +
 include/linux/sframe.h                    |  41 +++
 include/linux/unwind_user.h               |  99 ++++++
 include/uapi/linux/elf.h                  |   1 +
 include/uapi/linux/perf_event.h           |  22 +-
 include/uapi/linux/prctl.h                |   3 +
 kernel/Makefile                           |   1 +
 kernel/bpf/stackmap.c                     |  14 +-
 kernel/events/callchain.c                 |  47 +--
 kernel/events/core.c                      |  70 +++-
 kernel/fork.c                             |  14 +
 kernel/sys.c                              |  11 +
 kernel/unwind/Makefile                    |   2 +
 kernel/unwind/sframe.c                    | 380 ++++++++++++++++++++++
 kernel/unwind/sframe.h                    | 215 ++++++++++++
 kernel/unwind/user.c                      | 318 ++++++++++++++++++
 mm/init-mm.c                              |   6 +
 tools/include/uapi/linux/perf_event.h     |  22 +-
 tools/lib/perf/include/perf/event.h       |   7 +
 tools/perf/Documentation/perf-script.txt  |   5 +
 tools/perf/builtin-script.c               |  92 ++++++
 tools/perf/util/callchain.c               |  24 ++
 tools/perf/util/callchain.h               |   3 +
 tools/perf/util/event.c                   |   1 +
 tools/perf/util/evlist.c                  |   1 +
 tools/perf/util/evlist.h                  |   1 +
 tools/perf/util/evsel.c                   |  32 +-
 tools/perf/util/evsel.h                   |   1 +
 tools/perf/util/machine.c                 |   1 +
 tools/perf/util/perf_event_attr_fprintf.c |   1 +
 tools/perf/util/sample.h                  |   3 +-
 tools/perf/util/session.c                 |  78 +++++
 tools/perf/util/tool.c                    |   2 +
 tools/perf/util/tool.h                    |   4 +-
 50 files changed, 1634 insertions(+), 88 deletions(-)
 create mode 100644 arch/x86/include/asm/unwind_user.h
 create mode 100644 include/linux/sframe.h
 create mode 100644 include/linux/unwind_user.h
 create mode 100644 kernel/unwind/Makefile
 create mode 100644 kernel/unwind/sframe.c
 create mode 100644 kernel/unwind/sframe.h
 create mode 100644 kernel/unwind/user.c

-- 
2.47.0


^ permalink raw reply	[flat|nested] 119+ messages in thread

* [PATCH v3 01/19] x86/vdso: Fix DWARF generation for getrandom()
  2024-10-28 21:47 ` [PATCH v3 01/19] x86/vdso: Fix DWARF generation for getrandom() Josh Poimboeuf
@ 2024-10-28 21:47   ` Josh Poimboeuf
  0 siblings, 0 replies; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-28 21:47 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

Enable DWARF generation for the VDSO implementation of getrandom().

Fixes: 33385150ac45 ("x86: vdso: Wire up getrandom() vDSO implementation")
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/x86/entry/vdso/vgetrandom-chacha.S | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/vdso/vgetrandom-chacha.S b/arch/x86/entry/vdso/vgetrandom-chacha.S
index bcba5639b8ee..cc82da9216fb 100644
--- a/arch/x86/entry/vdso/vgetrandom-chacha.S
+++ b/arch/x86/entry/vdso/vgetrandom-chacha.S
@@ -4,7 +4,7 @@
  */
 
 #include <linux/linkage.h>
-#include <asm/frame.h>
+#include <asm/dwarf2.h>
 
 .section	.rodata, "a"
 .align 16
@@ -22,7 +22,7 @@ CONSTANTS:	.octa 0x6b20657479622d323320646e61707865
  *	rcx: number of 64-byte blocks to write to output
  */
 SYM_FUNC_START(__arch_chacha20_blocks_nostack)
-
+	CFI_STARTPROC
 .set	output,		%rdi
 .set	key,		%rsi
 .set	counter,	%rdx
@@ -175,4 +175,5 @@ SYM_FUNC_START(__arch_chacha20_blocks_nostack)
 	pxor		temp,temp
 
 	ret
+	CFI_ENDPROC
 SYM_FUNC_END(__arch_chacha20_blocks_nostack)
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 02/19] x86/asm: Avoid emitting DWARF CFI for non-VDSO
  2024-10-28 21:47 ` [PATCH v3 02/19] x86/asm: Avoid emitting DWARF CFI for non-VDSO Josh Poimboeuf
@ 2024-10-28 21:47   ` Josh Poimboeuf
  2024-10-30 17:19   ` Jens Remus
  1 sibling, 0 replies; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-28 21:47 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

VDSO is the only part of the "kernel" using DWARF CFI directives.  For
the kernel proper, ensure the CFI_* macros don't do anything.

These macros aren't yet being used outside of VDSO, so there's no
functional change.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/x86/include/asm/dwarf2.h | 37 +++++++++++++++++++++++++++--------
 1 file changed, 29 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/dwarf2.h b/arch/x86/include/asm/dwarf2.h
index 430fca13bb56..b1aa3fcd5bca 100644
--- a/arch/x86/include/asm/dwarf2.h
+++ b/arch/x86/include/asm/dwarf2.h
@@ -6,6 +6,15 @@
 #warning "asm/dwarf2.h should be only included in pure assembly files"
 #endif
 
+#ifdef BUILD_VDSO
+
+	/*
+	 * For the vDSO, emit both runtime unwind information and debug
+	 * symbols for the .dbg file.
+	 */
+
+	.cfi_sections .eh_frame, .debug_frame
+
 #define CFI_STARTPROC		.cfi_startproc
 #define CFI_ENDPROC		.cfi_endproc
 #define CFI_DEF_CFA		.cfi_def_cfa
@@ -21,7 +30,8 @@
 #define CFI_UNDEFINED		.cfi_undefined
 #define CFI_ESCAPE		.cfi_escape
 
-#ifndef BUILD_VDSO
+#else /* !BUILD_VDSO */
+
 	/*
 	 * Emit CFI data in .debug_frame sections, not .eh_frame sections.
 	 * The latter we currently just discard since we don't do DWARF
@@ -29,13 +39,24 @@
 	 * useful to anyone.  Note we should not use this directive if we
 	 * ever decide to enable DWARF unwinding at runtime.
 	 */
+
 	.cfi_sections .debug_frame
-#else
-	 /*
-	  * For the vDSO, emit both runtime unwind information and debug
-	  * symbols for the .dbg file.
-	  */
-	.cfi_sections .eh_frame, .debug_frame
-#endif
+
+#define CFI_STARTPROC
+#define CFI_ENDPROC
+#define CFI_DEF_CFA
+#define CFI_DEF_CFA_REGISTER
+#define CFI_DEF_CFA_OFFSET
+#define CFI_ADJUST_CFA_OFFSET
+#define CFI_OFFSET
+#define CFI_REL_OFFSET
+#define CFI_REGISTER
+#define CFI_RESTORE
+#define CFI_REMEMBER_STATE
+#define CFI_RESTORE_STATE
+#define CFI_UNDEFINED
+#define CFI_ESCAPE
+
+#endif /* !BUILD_VDSO */
 
 #endif /* _ASM_X86_DWARF2_H */
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 03/19] x86/asm: Fix VDSO DWARF generation with kernel IBT enabled
  2024-10-28 21:47 ` [PATCH v3 03/19] x86/asm: Fix VDSO DWARF generation with kernel IBT enabled Josh Poimboeuf
@ 2024-10-28 21:47   ` Josh Poimboeuf
  0 siblings, 0 replies; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-28 21:47 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

The DWARF .cfi_startproc annotation needs to be at the very beginning of
a function.  But with kernel IBT that doesn't happen as ENDBR is
sneakily embedded in SYM_FUNC_START.  So the DWARF unwinding info is
wrong at the beginning of the VDSO functions.

Fix it by adding CFI_STARTPROC and CFI_ENDPROC to SYM_FUNC_START_* and
SYM_FUNC_END respectively.  Note this only affects VDSO, as the CFI_*
macros are empty for the kernel proper.

Fixes: c4691712b546 ("x86/linkage: Add ENDBR to SYM_FUNC_START*()")
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/x86/entry/vdso/vdso-layout.lds.S   |  2 +-
 arch/x86/entry/vdso/vgetrandom-chacha.S |  2 --
 arch/x86/entry/vdso/vsgx.S              |  4 ----
 arch/x86/include/asm/linkage.h          | 29 +++++++++++++++++++------
 arch/x86/include/asm/vdso.h             |  1 -
 5 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/arch/x86/entry/vdso/vdso-layout.lds.S b/arch/x86/entry/vdso/vdso-layout.lds.S
index bafa73f09e92..a42c7d4a33da 100644
--- a/arch/x86/entry/vdso/vdso-layout.lds.S
+++ b/arch/x86/entry/vdso/vdso-layout.lds.S
@@ -1,5 +1,5 @@
 /* SPDX-License-Identifier: GPL-2.0 */
-#include <asm/vdso.h>
+#include <asm/page_types.h>
 
 /*
  * Linker script for vDSO.  This is an ELF shared object prelinked to
diff --git a/arch/x86/entry/vdso/vgetrandom-chacha.S b/arch/x86/entry/vdso/vgetrandom-chacha.S
index cc82da9216fb..a33212594731 100644
--- a/arch/x86/entry/vdso/vgetrandom-chacha.S
+++ b/arch/x86/entry/vdso/vgetrandom-chacha.S
@@ -22,7 +22,6 @@ CONSTANTS:	.octa 0x6b20657479622d323320646e61707865
  *	rcx: number of 64-byte blocks to write to output
  */
 SYM_FUNC_START(__arch_chacha20_blocks_nostack)
-	CFI_STARTPROC
 .set	output,		%rdi
 .set	key,		%rsi
 .set	counter,	%rdx
@@ -175,5 +174,4 @@ SYM_FUNC_START(__arch_chacha20_blocks_nostack)
 	pxor		temp,temp
 
 	ret
-	CFI_ENDPROC
 SYM_FUNC_END(__arch_chacha20_blocks_nostack)
diff --git a/arch/x86/entry/vdso/vsgx.S b/arch/x86/entry/vdso/vsgx.S
index 37a3d4c02366..c0342238c976 100644
--- a/arch/x86/entry/vdso/vsgx.S
+++ b/arch/x86/entry/vdso/vsgx.S
@@ -24,8 +24,6 @@
 .section .text, "ax"
 
 SYM_FUNC_START(__vdso_sgx_enter_enclave)
-	/* Prolog */
-	.cfi_startproc
 	push	%rbp
 	.cfi_adjust_cfa_offset	8
 	.cfi_rel_offset		%rbp, 0
@@ -143,8 +141,6 @@ SYM_FUNC_START(__vdso_sgx_enter_enclave)
 	jle	.Lout
 	jmp	.Lenter_enclave
 
-	.cfi_endproc
-
 _ASM_VDSO_EXTABLE_HANDLE(.Lenclu_eenter_eresume, .Lhandle_exception)
 
 SYM_FUNC_END(__vdso_sgx_enter_enclave)
diff --git a/arch/x86/include/asm/linkage.h b/arch/x86/include/asm/linkage.h
index dc31b13b87a0..2866d57ef907 100644
--- a/arch/x86/include/asm/linkage.h
+++ b/arch/x86/include/asm/linkage.h
@@ -40,6 +40,10 @@
 
 #ifdef __ASSEMBLY__
 
+#ifndef LINKER_SCRIPT
+#include <asm/dwarf2.h>
+#endif
+
 #if defined(CONFIG_MITIGATION_RETHUNK) && !defined(__DISABLE_EXPORTS) && !defined(BUILD_VDSO)
 #define RET	jmp __x86_return_thunk
 #else /* CONFIG_MITIGATION_RETPOLINE */
@@ -112,40 +116,51 @@
 # define SYM_FUNC_ALIAS_MEMFUNC	SYM_FUNC_ALIAS
 #endif
 
+#define __SYM_FUNC_START				\
+	CFI_STARTPROC ASM_NL				\
+	ENDBR
+
+#define __SYM_FUNC_END					\
+	CFI_ENDPROC ASM_NL
+
 /* SYM_TYPED_FUNC_START -- use for indirectly called globals, w/ CFI type */
 #define SYM_TYPED_FUNC_START(name)				\
 	SYM_TYPED_START(name, SYM_L_GLOBAL, SYM_F_ALIGN)	\
-	ENDBR
+	__SYM_FUNC_START
 
 /* SYM_FUNC_START -- use for global functions */
 #define SYM_FUNC_START(name)				\
 	SYM_START(name, SYM_L_GLOBAL, SYM_F_ALIGN)	\
-	ENDBR
+	__SYM_FUNC_START
 
 /* SYM_FUNC_START_NOALIGN -- use for global functions, w/o alignment */
 #define SYM_FUNC_START_NOALIGN(name)			\
 	SYM_START(name, SYM_L_GLOBAL, SYM_A_NONE)	\
-	ENDBR
+	__SYM_FUNC_START
 
 /* SYM_FUNC_START_LOCAL -- use for local functions */
 #define SYM_FUNC_START_LOCAL(name)			\
 	SYM_START(name, SYM_L_LOCAL, SYM_F_ALIGN)	\
-	ENDBR
+	__SYM_FUNC_START
 
 /* SYM_FUNC_START_LOCAL_NOALIGN -- use for local functions, w/o alignment */
 #define SYM_FUNC_START_LOCAL_NOALIGN(name)		\
 	SYM_START(name, SYM_L_LOCAL, SYM_A_NONE)	\
-	ENDBR
+	__SYM_FUNC_START
 
 /* SYM_FUNC_START_WEAK -- use for weak functions */
 #define SYM_FUNC_START_WEAK(name)			\
 	SYM_START(name, SYM_L_WEAK, SYM_F_ALIGN)	\
-	ENDBR
+	__SYM_FUNC_START
 
 /* SYM_FUNC_START_WEAK_NOALIGN -- use for weak functions, w/o alignment */
 #define SYM_FUNC_START_WEAK_NOALIGN(name)		\
 	SYM_START(name, SYM_L_WEAK, SYM_A_NONE)		\
-	ENDBR
+	__SYM_FUNC_START
+
+#define SYM_FUNC_END(name)				\
+	__SYM_FUNC_END					\
+	SYM_END(name, SYM_T_FUNC)
 
 #endif /* _ASM_X86_LINKAGE_H */
 
diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
index d7f6592b74a9..0111c349bbc5 100644
--- a/arch/x86/include/asm/vdso.h
+++ b/arch/x86/include/asm/vdso.h
@@ -2,7 +2,6 @@
 #ifndef _ASM_X86_VDSO_H
 #define _ASM_X86_VDSO_H
 
-#include <asm/page_types.h>
 #include <linux/linkage.h>
 #include <linux/init.h>
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 04/19] x86/vdso: Use SYM_FUNC_{START,END} in __kernel_vsyscall()
  2024-10-28 21:47 ` [PATCH v3 04/19] x86/vdso: Use SYM_FUNC_{START,END} in __kernel_vsyscall() Josh Poimboeuf
@ 2024-10-28 21:47   ` Josh Poimboeuf
  0 siblings, 0 replies; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-28 21:47 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

Use SYM_FUNC_{START,END} instead of all the boilerplate.  No functional
change.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/x86/entry/vdso/vdso32/system_call.S | 10 ++--------
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/arch/x86/entry/vdso/vdso32/system_call.S b/arch/x86/entry/vdso/vdso32/system_call.S
index d33c6513fd2c..bdc576548240 100644
--- a/arch/x86/entry/vdso/vdso32/system_call.S
+++ b/arch/x86/entry/vdso/vdso32/system_call.S
@@ -9,11 +9,7 @@
 #include <asm/alternative.h>
 
 	.text
-	.globl __kernel_vsyscall
-	.type __kernel_vsyscall,@function
-	ALIGN
-__kernel_vsyscall:
-	CFI_STARTPROC
+SYM_FUNC_START(__kernel_vsyscall)
 	/*
 	 * Reshuffle regs so that all of any of the entry instructions
 	 * will preserve enough state.
@@ -79,7 +75,5 @@ SYM_INNER_LABEL(int80_landing_pad, SYM_L_GLOBAL)
 	CFI_RESTORE		ecx
 	CFI_ADJUST_CFA_OFFSET	-4
 	RET
-	CFI_ENDPROC
-
-	.size __kernel_vsyscall,.-__kernel_vsyscall
+SYM_FUNC_END(__kernel_vsyscall)
 	.previous
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 05/19] x86/vdso: Use CFI macros in __vdso_sgx_enter_enclave()
  2024-10-28 21:47 ` [PATCH v3 05/19] x86/vdso: Use CFI macros in __vdso_sgx_enter_enclave() Josh Poimboeuf
@ 2024-10-28 21:47   ` Josh Poimboeuf
  0 siblings, 0 replies; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-28 21:47 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

Use the CFI macros instead of the raw .cfi_* directives to be consistent
with the rest of the VDSO asm.  It's also easier on the eyes.

No functional changes.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/x86/entry/vdso/vsgx.S | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/arch/x86/entry/vdso/vsgx.S b/arch/x86/entry/vdso/vsgx.S
index c0342238c976..8d7b8eb45c50 100644
--- a/arch/x86/entry/vdso/vsgx.S
+++ b/arch/x86/entry/vdso/vsgx.S
@@ -24,13 +24,14 @@
 .section .text, "ax"
 
 SYM_FUNC_START(__vdso_sgx_enter_enclave)
+	SYM_F_ALIGN
 	push	%rbp
-	.cfi_adjust_cfa_offset	8
-	.cfi_rel_offset		%rbp, 0
+	CFI_ADJUST_CFA_OFFSET	8
+	CFI_REL_OFFSET		%rbp, 0
 	mov	%rsp, %rbp
-	.cfi_def_cfa_register	%rbp
+	CFI_DEF_CFA_REGISTER	%rbp
 	push	%rbx
-	.cfi_rel_offset		%rbx, -8
+	CFI_REL_OFFSET		%rbx, -8
 
 	mov	%ecx, %eax
 .Lenter_enclave:
@@ -77,13 +78,11 @@ SYM_FUNC_START(__vdso_sgx_enter_enclave)
 .Lout:
 	pop	%rbx
 	leave
-	.cfi_def_cfa		%rsp, 8
+	CFI_DEF_CFA		%rsp, 8
 	RET
 
-	/* The out-of-line code runs with the pre-leave stack frame. */
-	.cfi_def_cfa		%rbp, 16
-
 .Linvalid_input:
+	CFI_DEF_CFA		%rbp, 16
 	mov	$(-EINVAL), %eax
 	jmp	.Lout
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 06/19] x86/vdso: Enable sframe generation in VDSO
  2024-10-28 21:47 ` [PATCH v3 06/19] x86/vdso: Enable sframe generation in VDSO Josh Poimboeuf
@ 2024-10-28 21:47   ` Josh Poimboeuf
  2024-10-30 18:20   ` Jens Remus
  1 sibling, 0 replies; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-28 21:47 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

Enable sframe generation in the VDSO library so kernel and user space
can unwind through it.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/x86/entry/vdso/Makefile          | 6 ++++--
 arch/x86/entry/vdso/vdso-layout.lds.S | 3 +++
 arch/x86/include/asm/dwarf2.h         | 5 ++++-
 3 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile
index c9216ac4fb1e..75ae9e093a2d 100644
--- a/arch/x86/entry/vdso/Makefile
+++ b/arch/x86/entry/vdso/Makefile
@@ -47,13 +47,15 @@ quiet_cmd_vdso2c = VDSO2C  $@
 $(obj)/vdso-image-%.c: $(obj)/vdso%.so.dbg $(obj)/vdso%.so $(obj)/vdso2c FORCE
 	$(call if_changed,vdso2c)
 
+SFRAME_CFLAGS := $(call as-option,-Wa$(comma)-gsframe,)
+
 #
 # Don't omit frame pointers for ease of userspace debugging, but do
 # optimize sibling calls.
 #
 CFL := $(PROFILING) -mcmodel=small -fPIC -O2 -fasynchronous-unwind-tables -m64 \
        $(filter -g%,$(KBUILD_CFLAGS)) -fno-stack-protector \
-       -fno-omit-frame-pointer -foptimize-sibling-calls \
+       -fno-omit-frame-pointer $(SFRAME_CFLAGS) -foptimize-sibling-calls \
        -DDISABLE_BRANCH_PROFILING -DBUILD_VDSO
 
 ifdef CONFIG_MITIGATION_RETPOLINE
@@ -63,7 +65,7 @@ endif
 endif
 
 $(vobjs): KBUILD_CFLAGS := $(filter-out $(PADDING_CFLAGS) $(CC_FLAGS_LTO) $(CC_FLAGS_CFI) $(RANDSTRUCT_CFLAGS) $(GCC_PLUGINS_CFLAGS) $(RETPOLINE_CFLAGS),$(KBUILD_CFLAGS)) $(CFL)
-$(vobjs): KBUILD_AFLAGS += -DBUILD_VDSO
+$(vobjs): KBUILD_AFLAGS += -DBUILD_VDSO $(SFRAME_CFLAGS)
 
 #
 # vDSO code runs in userspace and -pg doesn't help with profiling anyway.
diff --git a/arch/x86/entry/vdso/vdso-layout.lds.S b/arch/x86/entry/vdso/vdso-layout.lds.S
index a42c7d4a33da..f6cd6654bb9e 100644
--- a/arch/x86/entry/vdso/vdso-layout.lds.S
+++ b/arch/x86/entry/vdso/vdso-layout.lds.S
@@ -69,6 +69,7 @@ SECTIONS
 	.eh_frame_hdr	: { *(.eh_frame_hdr) }		:text	:eh_frame_hdr
 	.eh_frame	: { KEEP (*(.eh_frame)) }	:text
 
+	.sframe		: { *(.sframe) }		:text	:sframe
 
 	/*
 	 * Text is well-separated from actual data: there's plenty of
@@ -97,6 +98,7 @@ SECTIONS
  * Very old versions of ld do not recognize this name token; use the constant.
  */
 #define PT_GNU_EH_FRAME	0x6474e550
+#define PT_GNU_SFRAME	0x6474e554
 
 /*
  * We must supply the ELF program headers explicitly to get just one
@@ -108,4 +110,5 @@ PHDRS
 	dynamic		PT_DYNAMIC	FLAGS(4);		/* PF_R */
 	note		PT_NOTE		FLAGS(4);		/* PF_R */
 	eh_frame_hdr	PT_GNU_EH_FRAME;
+	sframe		PT_GNU_SFRAME;
 }
diff --git a/arch/x86/include/asm/dwarf2.h b/arch/x86/include/asm/dwarf2.h
index b1aa3fcd5bca..1a49492817a1 100644
--- a/arch/x86/include/asm/dwarf2.h
+++ b/arch/x86/include/asm/dwarf2.h
@@ -12,8 +12,11 @@
 	 * For the vDSO, emit both runtime unwind information and debug
 	 * symbols for the .dbg file.
 	 */
-
+#ifdef __x86_64__
+	.cfi_sections .eh_frame, .debug_frame, .sframe
+#else
 	.cfi_sections .eh_frame, .debug_frame
+#endif
 
 #define CFI_STARTPROC		.cfi_startproc
 #define CFI_ENDPROC		.cfi_endproc
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 07/19] unwind: Add user space unwinding API
  2024-10-28 21:47 ` [PATCH v3 07/19] unwind: Add user space unwinding API Josh Poimboeuf
@ 2024-10-28 21:47   ` Josh Poimboeuf
  2024-12-06 10:29   ` Jens Remus
  1 sibling, 0 replies; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-28 21:47 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

Introduce a user space unwinder API which provides a generic way to
unwind user stacks.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/Kconfig                |  7 +++
 include/linux/unwind_user.h | 41 +++++++++++++++
 kernel/Makefile             |  1 +
 kernel/unwind/Makefile      |  1 +
 kernel/unwind/user.c        | 99 +++++++++++++++++++++++++++++++++++++
 5 files changed, 149 insertions(+)
 create mode 100644 include/linux/unwind_user.h
 create mode 100644 kernel/unwind/Makefile
 create mode 100644 kernel/unwind/user.c

diff --git a/arch/Kconfig b/arch/Kconfig
index 7a95c1052cd5..ee8ec97ea0ef 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -435,6 +435,13 @@ config HAVE_HARDLOCKUP_DETECTOR_ARCH
 	  It uses the same command line parameters, and sysctl interface,
 	  as the generic hardlockup detectors.
 
+config UNWIND_USER
+	bool
+
+config HAVE_UNWIND_USER_FP
+	bool
+	select UNWIND_USER
+
 config HAVE_PERF_REGS
 	bool
 	help
diff --git a/include/linux/unwind_user.h b/include/linux/unwind_user.h
new file mode 100644
index 000000000000..9d28db06f33f
--- /dev/null
+++ b/include/linux/unwind_user.h
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_UNWIND_USER_H
+#define _LINUX_UNWIND_USER_H
+
+#include <linux/types.h>
+
+enum unwind_user_type {
+	UNWIND_USER_TYPE_FP,
+};
+
+struct unwind_stacktrace {
+	unsigned int	nr;
+	unsigned long	*entries;
+};
+
+struct unwind_user_frame {
+	s32 cfa_off;
+	s32 ra_off;
+	s32 fp_off;
+	bool use_fp;
+};
+
+struct unwind_user_state {
+	unsigned long ip;
+	unsigned long sp;
+	unsigned long fp;
+	enum unwind_user_type type;
+	bool done;
+};
+
+/* Synchronous interfaces: */
+
+int unwind_user_start(struct unwind_user_state *state);
+int unwind_user_next(struct unwind_user_state *state);
+
+int unwind_user(struct unwind_stacktrace *trace, unsigned int max_entries);
+
+#define for_each_user_frame(state) \
+	for (unwind_user_start((state)); !(state)->done; unwind_user_next((state)))
+
+#endif /* _LINUX_UNWIND_USER_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index 87866b037fbe..6cb4b0e02a34 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -50,6 +50,7 @@ obj-y += rcu/
 obj-y += livepatch/
 obj-y += dma/
 obj-y += entry/
+obj-y += unwind/
 obj-$(CONFIG_MODULES) += module/
 
 obj-$(CONFIG_KCMP) += kcmp.o
diff --git a/kernel/unwind/Makefile b/kernel/unwind/Makefile
new file mode 100644
index 000000000000..349ce3677526
--- /dev/null
+++ b/kernel/unwind/Makefile
@@ -0,0 +1 @@
+ obj-$(CONFIG_UNWIND_USER) += user.o
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
new file mode 100644
index 000000000000..54b989810a0e
--- /dev/null
+++ b/kernel/unwind/user.c
@@ -0,0 +1,99 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+* Generic interfaces for unwinding user space
+*
+* Copyright (C) 2024 Josh Poimboeuf <jpoimboe@kernel.org>
+*/
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/sched/task_stack.h>
+#include <linux/unwind_user.h>
+#include <linux/uaccess.h>
+#include <asm/unwind_user.h>
+
+static struct unwind_user_frame fp_frame = {
+	ARCH_INIT_USER_FP_FRAME
+};
+
+int unwind_user_next(struct unwind_user_state *state)
+{
+	struct unwind_user_frame _frame;
+	struct unwind_user_frame *frame = &_frame;
+	unsigned long prev_ip, cfa, fp, ra = 0;
+
+	if (state->done)
+		return -EINVAL;
+
+	prev_ip = state->ip;
+
+	switch (state->type) {
+	case UNWIND_USER_TYPE_FP:
+		frame = &fp_frame;
+		break;
+	default:
+		BUG();
+	}
+
+	cfa = (frame->use_fp ? state->fp : state->sp) + frame->cfa_off;
+
+	if (frame->ra_off && get_user(ra, (unsigned long __user *)(cfa + frame->ra_off)))
+		goto the_end;
+
+	if (ra == prev_ip)
+		goto the_end;
+
+	if (frame->fp_off && get_user(fp, (unsigned long __user *)(cfa + frame->fp_off)))
+		goto the_end;
+
+	state->sp = cfa;
+	state->ip = ra;
+	if (frame->fp_off)
+		state->fp = fp;
+
+	return 0;
+
+the_end:
+	state->done = true;
+	return -EINVAL;
+}
+
+int unwind_user_start(struct unwind_user_state *state)
+{
+	struct pt_regs *regs = task_pt_regs(current);
+
+	memset(state, 0, sizeof(*state));
+
+	if (!current->mm) {
+		state->done = true;
+		return -EINVAL;
+	}
+
+	state->type = UNWIND_USER_TYPE_FP;
+
+	state->sp = user_stack_pointer(regs);
+	state->ip = instruction_pointer(regs);
+	state->fp = frame_pointer(regs);
+
+	return 0;
+}
+
+int unwind_user(struct unwind_stacktrace *trace, unsigned int max_entries)
+{
+	struct unwind_user_state state;
+
+	trace->nr = 0;
+
+	if (!max_entries)
+		return -EINVAL;
+
+	if (!current->mm)
+		return 0;
+
+	for_each_user_frame(&state) {
+		trace->entries[trace->nr++] = state.ip;
+		if (trace->nr >= max_entries)
+			break;
+	}
+
+	return 0;
+}
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 08/19] unwind/x86: Enable CONFIG_HAVE_UNWIND_USER_FP
  2024-10-28 21:47 ` [PATCH v3 08/19] unwind/x86: Enable CONFIG_HAVE_UNWIND_USER_FP Josh Poimboeuf
@ 2024-10-28 21:47   ` Josh Poimboeuf
  2024-10-29 13:13   ` Peter Zijlstra
  1 sibling, 0 replies; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-28 21:47 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

Use ARCH_INIT_USER_FP_FRAME to describe how frame pointers are unwound
on x86, and enable CONFIG_HAVE_UNWIND_USER_FP accordingly so the
unwind_user interfaces can be used.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/x86/Kconfig                   |  1 +
 arch/x86/include/asm/unwind_user.h | 11 +++++++++++
 2 files changed, 12 insertions(+)
 create mode 100644 arch/x86/include/asm/unwind_user.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0bdb7a394f59..f91098d6f535 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -289,6 +289,7 @@ config X86
 	select HAVE_SYSCALL_TRACEPOINTS
 	select HAVE_UACCESS_VALIDATION		if HAVE_OBJTOOL
 	select HAVE_UNSTABLE_SCHED_CLOCK
+	select HAVE_UNWIND_USER_FP		if X86_64
 	select HAVE_USER_RETURN_NOTIFIER
 	select HAVE_GENERIC_VDSO
 	select VDSO_GETRANDOM			if X86_64
diff --git a/arch/x86/include/asm/unwind_user.h b/arch/x86/include/asm/unwind_user.h
new file mode 100644
index 000000000000..19df26a65132
--- /dev/null
+++ b/arch/x86/include/asm/unwind_user.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_UNWIND_USER_H
+#define _ASM_X86_UNWIND_USER_H
+
+#define ARCH_INIT_USER_FP_FRAME							\
+	.ra_off		= (s32)sizeof(long) * -1,				\
+	.cfa_off	= (s32)sizeof(long) * 2,				\
+	.fp_off		= (s32)sizeof(long) * -2,				\
+	.use_fp		= true,
+
+#endif /* _ASM_X86_UNWIND_USER_H */
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 09/19] unwind: Introduce sframe user space unwinding
  2024-10-28 21:47 ` [PATCH v3 09/19] unwind: Introduce sframe user space unwinding Josh Poimboeuf
@ 2024-10-28 21:47   ` Josh Poimboeuf
  2024-10-29 13:27   ` Peter Zijlstra
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-28 21:47 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

Some distros have started compiling frame pointers into all their
packages to enable the kernel to do system-wide profiling of user space.
Unfortunately that creates a runtime performance penalty across the
entire system.  Using DWARF instead isn't feasible due to the complexity
it would add to the kernel.

For in-kernel unwinding we solved this problem with the creation of the
ORC unwinder for x86_64.  Similarly, for user space the GNU assembler
has created the sframe format starting with binutils 2.41 for sframe v2.
Sframe is a simpler version of .eh_frame.  It gets placed in the .sframe
section.

Add support for unwinding user space using sframe.

More information about sframe can be found here:

  - https://lwn.net/Articles/932209/
  - https://lwn.net/Articles/940686/
  - https://sourceware.org/binutils/docs/sframe-spec.html

Glibc support is needed to implement the prctl() calls to tell the
kernel where the .sframe segments are.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/Kconfig                |   4 +
 arch/x86/include/asm/mmu.h  |   2 +-
 fs/binfmt_elf.c             |  35 +++-
 include/linux/mm_types.h    |   3 +
 include/linux/sframe.h      |  41 ++++
 include/linux/unwind_user.h |   2 +
 include/uapi/linux/elf.h    |   1 +
 include/uapi/linux/prctl.h  |   3 +
 kernel/fork.c               |  10 +
 kernel/sys.c                |  11 ++
 kernel/unwind/Makefile      |   3 +-
 kernel/unwind/sframe.c      | 380 ++++++++++++++++++++++++++++++++++++
 kernel/unwind/sframe.h      | 215 ++++++++++++++++++++
 kernel/unwind/user.c        |  24 ++-
 mm/init-mm.c                |   6 +
 15 files changed, 732 insertions(+), 8 deletions(-)
 create mode 100644 include/linux/sframe.h
 create mode 100644 kernel/unwind/sframe.c
 create mode 100644 kernel/unwind/sframe.h

diff --git a/arch/Kconfig b/arch/Kconfig
index ee8ec97ea0ef..e769c39dd221 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -442,6 +442,10 @@ config HAVE_UNWIND_USER_FP
 	bool
 	select UNWIND_USER
 
+config HAVE_UNWIND_USER_SFRAME
+	bool
+	select UNWIND_USER
+
 config HAVE_PERF_REGS
 	bool
 	help
diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index ce4677b8b735..12ea831978cc 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -73,7 +73,7 @@ typedef struct {
 	.context = {							\
 		.ctx_id = 1,						\
 		.lock = __MUTEX_INITIALIZER(mm.context.lock),		\
-	}
+	},
 
 void leave_mm(void);
 #define leave_mm leave_mm
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 06dc4a57ba78..434c548f0837 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -47,6 +47,7 @@
 #include <linux/dax.h>
 #include <linux/uaccess.h>
 #include <linux/rseq.h>
+#include <linux/sframe.h>
 #include <asm/param.h>
 #include <asm/page.h>
 
@@ -633,11 +634,13 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
 		unsigned long no_base, struct elf_phdr *interp_elf_phdata,
 		struct arch_elf_state *arch_state)
 {
-	struct elf_phdr *eppnt;
+	struct elf_phdr *eppnt, *sframe_phdr = NULL;
 	unsigned long load_addr = 0;
 	int load_addr_set = 0;
 	unsigned long error = ~0UL;
 	unsigned long total_size;
+	unsigned long start_code = ~0UL;
+	unsigned long end_code = 0;
 	int i;
 
 	/* First of all, some simple consistency checks */
@@ -659,7 +662,8 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
 
 	eppnt = interp_elf_phdata;
 	for (i = 0; i < interp_elf_ex->e_phnum; i++, eppnt++) {
-		if (eppnt->p_type == PT_LOAD) {
+		switch (eppnt->p_type) {
+		case PT_LOAD: {
 			int elf_type = MAP_PRIVATE;
 			int elf_prot = make_prot(eppnt->p_flags, arch_state,
 						 true, true);
@@ -688,7 +692,7 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
 			/*
 			 * Check to see if the section's size will overflow the
 			 * allowed task size. Note that p_filesz must always be
-			 * <= p_memsize so it's only necessary to check p_memsz.
+			 * <= p_memsz so it's only necessary to check p_memsz.
 			 */
 			k = load_addr + eppnt->p_vaddr;
 			if (BAD_ADDR(k) ||
@@ -698,9 +702,24 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
 				error = -ENOMEM;
 				goto out;
 			}
+
+			if ((eppnt->p_flags & PF_X) && k < start_code)
+				start_code = k;
+
+			if ((eppnt->p_flags & PF_X) && k + eppnt->p_filesz > end_code)
+				end_code = k + eppnt->p_filesz;
+			break;
+		}
+		case PT_GNU_SFRAME:
+			sframe_phdr = eppnt;
+			break;
 		}
 	}
 
+	if (sframe_phdr)
+		sframe_add_section(load_addr + sframe_phdr->p_vaddr,
+				   start_code, end_code);
+
 	error = load_addr;
 out:
 	return error;
@@ -823,7 +842,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
 	int first_pt_load = 1;
 	unsigned long error;
 	struct elf_phdr *elf_ppnt, *elf_phdata, *interp_elf_phdata = NULL;
-	struct elf_phdr *elf_property_phdata = NULL;
+	struct elf_phdr *elf_property_phdata = NULL, *sframe_phdr = NULL;
 	unsigned long elf_brk;
 	int retval, i;
 	unsigned long elf_entry;
@@ -931,6 +950,10 @@ static int load_elf_binary(struct linux_binprm *bprm)
 				executable_stack = EXSTACK_DISABLE_X;
 			break;
 
+		case PT_GNU_SFRAME:
+			sframe_phdr = elf_ppnt;
+			break;
+
 		case PT_LOPROC ... PT_HIPROC:
 			retval = arch_elf_pt_proc(elf_ex, elf_ppnt,
 						  bprm->file, false,
@@ -1321,6 +1344,10 @@ static int load_elf_binary(struct linux_binprm *bprm)
 					    task_pid_nr(current), retval);
 	}
 
+	if (sframe_phdr)
+		sframe_add_section(load_bias + sframe_phdr->p_vaddr,
+				   start_code, end_code);
+
 	regs = current_pt_regs();
 #ifdef ELF_PLAT_INIT
 	/*
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 381d22eba088..6e7561c1a5fc 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1052,6 +1052,9 @@ struct mm_struct {
 #endif
 		} lru_gen;
 #endif /* CONFIG_LRU_GEN_WALKS_MMU */
+#ifdef CONFIG_HAVE_UNWIND_USER_SFRAME
+		struct maple_tree sframe_mt;
+#endif
 	} __randomize_layout;
 
 	/*
diff --git a/include/linux/sframe.h b/include/linux/sframe.h
new file mode 100644
index 000000000000..d167e01817c4
--- /dev/null
+++ b/include/linux/sframe.h
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_SFRAME_H
+#define _LINUX_SFRAME_H
+
+#include <linux/mm_types.h>
+
+struct unwind_user_frame;
+
+#ifdef CONFIG_HAVE_UNWIND_USER_SFRAME
+
+#define INIT_MM_SFRAME .sframe_mt = MTREE_INIT(sframe_mt, 0),
+
+extern void sframe_free_mm(struct mm_struct *mm);
+
+/* text_start, text_end, file_name are optional */
+extern int sframe_add_section(unsigned long sframe_addr, unsigned long text_start,
+			      unsigned long text_end);
+
+extern int sframe_remove_section(unsigned long sframe_addr);
+extern int sframe_find(unsigned long ip, struct unwind_user_frame *frame);
+
+static inline bool current_has_sframe(void)
+{
+	struct mm_struct *mm = current->mm;
+
+	return mm && !mtree_empty(&mm->sframe_mt);
+}
+
+#else /* !CONFIG_HAVE_UNWIND_USER_SFRAME */
+
+static inline void sframe_free_mm(struct mm_struct *mm) {}
+
+static inline int sframe_add_section(unsigned long sframe_addr, unsigned long text_start, unsigned long text_end) { return -EINVAL; }
+static inline int sframe_remove_section(unsigned long sframe_addr) { return -EINVAL; }
+static inline int sframe_find(unsigned long ip, struct unwind_user_frame *frame) { return -EINVAL; }
+
+static inline bool current_has_sframe(void) { return false; }
+
+#endif /* CONFIG_HAVE_UNWIND_USER_SFRAME */
+
+#endif /* _LINUX_SFRAME_H */
diff --git a/include/linux/unwind_user.h b/include/linux/unwind_user.h
index 9d28db06f33f..cde0fde4923e 100644
--- a/include/linux/unwind_user.h
+++ b/include/linux/unwind_user.h
@@ -5,7 +5,9 @@
 #include <linux/types.h>
 
 enum unwind_user_type {
+	UNWIND_USER_TYPE_NONE,
 	UNWIND_USER_TYPE_FP,
+	UNWIND_USER_TYPE_SFRAME,
 };
 
 struct unwind_stacktrace {
diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
index b9935988da5c..4dc3f0ca5af5 100644
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@ -39,6 +39,7 @@ typedef __s64	Elf64_Sxword;
 #define PT_GNU_STACK	(PT_LOOS + 0x474e551)
 #define PT_GNU_RELRO	(PT_LOOS + 0x474e552)
 #define PT_GNU_PROPERTY	(PT_LOOS + 0x474e553)
+#define PT_GNU_SFRAME	(PT_LOOS + 0x474e554)
 
 
 /* ARM MTE memory tag segment type */
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 35791791a879..69511077c910 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -328,4 +328,7 @@ struct prctl_mm_map {
 # define PR_PPC_DEXCR_CTRL_CLEAR_ONEXEC	0x10 /* Clear the aspect on exec */
 # define PR_PPC_DEXCR_CTRL_MASK		0x1f
 
+#define PR_ADD_SFRAME			74
+#define PR_REMOVE_SFRAME		75
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index c056ea95fe8c..60f14fbab956 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -105,6 +105,7 @@
 #include <linux/rseq.h>
 #include <uapi/linux/pidfd.h>
 #include <linux/pidfs.h>
+#include <linux/sframe.h>
 
 #include <asm/pgalloc.h>
 #include <linux/uaccess.h>
@@ -924,6 +925,7 @@ void __mmdrop(struct mm_struct *mm)
 	mm_pasid_drop(mm);
 	mm_destroy_cid(mm);
 	percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS);
+	sframe_free_mm(mm);
 
 	free_mm(mm);
 }
@@ -1251,6 +1253,13 @@ static void mm_init_uprobes_state(struct mm_struct *mm)
 #endif
 }
 
+static void mm_init_sframe(struct mm_struct *mm)
+{
+#ifdef CONFIG_HAVE_UNWIND_USER_SFRAME
+	mt_init(&mm->sframe_mt);
+#endif
+}
+
 static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	struct user_namespace *user_ns)
 {
@@ -1282,6 +1291,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	mm->pmd_huge_pte = NULL;
 #endif
 	mm_init_uprobes_state(mm);
+	mm_init_sframe(mm);
 	hugetlb_count_init(mm);
 
 	if (current->mm) {
diff --git a/kernel/sys.c b/kernel/sys.c
index 4da31f28fda8..7d05a67727db 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -64,6 +64,7 @@
 #include <linux/rcupdate.h>
 #include <linux/uidgid.h>
 #include <linux/cred.h>
+#include <linux/sframe.h>
 
 #include <linux/nospec.h>
 
@@ -2784,6 +2785,16 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_RISCV_SET_ICACHE_FLUSH_CTX:
 		error = RISCV_SET_ICACHE_FLUSH_CTX(arg2, arg3);
 		break;
+	case PR_ADD_SFRAME:
+		if (arg5)
+			return -EINVAL;
+		error = sframe_add_section(arg2, arg3, arg4);
+		break;
+	case PR_REMOVE_SFRAME:
+		if (arg3 || arg4 || arg5)
+			return -EINVAL;
+		error = sframe_remove_section(arg2);
+		break;
 	default:
 		error = -EINVAL;
 		break;
diff --git a/kernel/unwind/Makefile b/kernel/unwind/Makefile
index 349ce3677526..f70380d7a6a6 100644
--- a/kernel/unwind/Makefile
+++ b/kernel/unwind/Makefile
@@ -1 +1,2 @@
- obj-$(CONFIG_UNWIND_USER) += user.o
+ obj-$(CONFIG_UNWIND_USER)		+= user.o
+ obj-$(CONFIG_HAVE_UNWIND_USER_SFRAME)	+= sframe.o
diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
new file mode 100644
index 000000000000..933e47696e29
--- /dev/null
+++ b/kernel/unwind/sframe.c
@@ -0,0 +1,380 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define pr_fmt(fmt)	"sframe: " fmt
+
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/srcu.h>
+#include <linux/uaccess.h>
+#include <linux/mm.h>
+#include <linux/sframe.h>
+#include <linux/unwind_user.h>
+
+#include "sframe.h"
+
+#define SFRAME_FILENAME_LEN 32
+
+struct sframe_section {
+	struct rcu_head rcu;
+
+	unsigned long sframe_addr;
+	unsigned long text_addr;
+
+	unsigned long fdes_addr;
+	unsigned long fres_addr;
+	unsigned int  fdes_nr;
+	signed char   ra_off;
+	signed char   fp_off;
+};
+
+DEFINE_STATIC_SRCU(sframe_srcu);
+
+#define __SFRAME_GET_USER(out, user_ptr, type)				\
+({									\
+	type __tmp;							\
+	if (get_user(__tmp, (type __user *)user_ptr))			\
+		return -EFAULT;						\
+	user_ptr += sizeof(__tmp);					\
+	out = __tmp;							\
+})
+
+#define SFRAME_GET_USER(out, user_ptr, size)				\
+({									\
+	switch (size) {							\
+	case 1:								\
+		__SFRAME_GET_USER(out, user_ptr, u8);			\
+		break;							\
+	case 2:								\
+		__SFRAME_GET_USER(out, user_ptr, u16);			\
+		break;							\
+	case 4:								\
+		__SFRAME_GET_USER(out, user_ptr, u32);			\
+		break;							\
+	default:							\
+		return -EINVAL;						\
+	}								\
+})
+
+static unsigned char fre_type_to_size(unsigned char fre_type)
+{
+	if (fre_type > 2)
+		return 0;
+	return 1 << fre_type;
+}
+
+static unsigned char offset_size_enum_to_size(unsigned char off_size)
+{
+	if (off_size > 2)
+		return 0;
+	return 1 << off_size;
+}
+
+static int find_fde(struct sframe_section *sec, unsigned long ip,
+		    struct sframe_fde *fde)
+{
+	struct sframe_fde __user *first, *last, *found = NULL;
+	u32 ip_off, func_off_low = 0, func_off_high = -1;
+
+	ip_off = ip - sec->sframe_addr;
+
+	first = (void __user *)sec->fdes_addr;
+	last = first + sec->fdes_nr;
+	while (first <= last) {
+		struct sframe_fde __user *mid;
+		u32 func_off;
+
+		mid = first + ((last - first) / 2);
+
+		if (get_user(func_off, (s32 __user *)mid))
+			return -EFAULT;
+
+		if (ip_off >= func_off) {
+			/* validate sort order */
+			if (func_off < func_off_low)
+				return -EINVAL;
+
+			func_off_low = func_off;
+
+			found = mid;
+			first = mid + 1;
+		} else {
+			/* validate sort order */
+			if (func_off > func_off_high)
+				return -EINVAL;
+
+			func_off_high = func_off;
+
+			last = mid - 1;
+		}
+	}
+
+	if (!found)
+		return -EINVAL;
+
+	if (copy_from_user(fde, found, sizeof(*fde)))
+		return -EFAULT;
+
+	/* check for gaps */
+	if (ip_off < fde->start_addr || ip_off >= fde->start_addr + fde->size)
+		return -EINVAL;
+
+	return 0;
+}
+
+static int find_fre(struct sframe_section *sec, struct sframe_fde *fde,
+		    unsigned long ip, struct unwind_user_frame *frame)
+{
+	unsigned char fde_type = SFRAME_FUNC_FDE_TYPE(fde->info);
+	unsigned char fre_type = SFRAME_FUNC_FRE_TYPE(fde->info);
+	unsigned char offset_count, offset_size;
+	s32 cfa_off, ra_off, fp_off, ip_off;
+	void __user *f, *last_f = NULL;
+	unsigned char addr_size;
+	u32 last_fre_ip_off = 0;
+	u8 fre_info = 0;
+	int i;
+
+	addr_size = fre_type_to_size(fre_type);
+	if (!addr_size)
+		return -EINVAL;
+
+	ip_off = ip - (sec->sframe_addr + fde->start_addr);
+
+	f = (void __user *)sec->fres_addr + fde->fres_off;
+
+	for (i = 0; i < fde->fres_num; i++) {
+		u32 fre_ip_off;
+
+		SFRAME_GET_USER(fre_ip_off, f, addr_size);
+
+		if (fre_ip_off < last_fre_ip_off)
+			return -EINVAL;
+
+		last_fre_ip_off = fre_ip_off;
+
+		if (fde_type == SFRAME_FDE_TYPE_PCINC) {
+			if (ip_off < fre_ip_off)
+				break;
+		} else {
+			/* SFRAME_FDE_TYPE_PCMASK */
+			if (ip_off % fde->rep_size < fre_ip_off)
+				break;
+		}
+
+		SFRAME_GET_USER(fre_info, f, 1);
+
+		offset_count = SFRAME_FRE_OFFSET_COUNT(fre_info);
+		offset_size  = offset_size_enum_to_size(SFRAME_FRE_OFFSET_SIZE(fre_info));
+
+		if (!offset_count || !offset_size)
+			return -EINVAL;
+
+		last_f = f;
+		f += offset_count * offset_size;
+	}
+
+	if (!last_f)
+		return -EINVAL;
+
+	f = last_f;
+
+	SFRAME_GET_USER(cfa_off, f, offset_size);
+	offset_count--;
+
+	ra_off = sec->ra_off;
+	if (!ra_off) {
+		if (!offset_count--)
+			return -EINVAL;
+
+		SFRAME_GET_USER(ra_off, f, offset_size);
+	}
+
+	fp_off = sec->fp_off;
+	if (!fp_off && offset_count) {
+		offset_count--;
+		SFRAME_GET_USER(fp_off, f, offset_size);
+	}
+
+	if (offset_count)
+		return -EINVAL;
+
+	frame->cfa_off = cfa_off;
+	frame->ra_off = ra_off;
+	frame->fp_off = fp_off;
+	frame->use_fp = SFRAME_FRE_CFA_BASE_REG_ID(fre_info) == SFRAME_BASE_REG_FP;
+
+	return 0;
+}
+
+int sframe_find(unsigned long ip, struct unwind_user_frame *frame)
+{
+	struct mm_struct *mm = current->mm;
+	struct sframe_section *sec;
+	struct sframe_fde fde;
+	int ret = -EINVAL;
+
+	if (!mm)
+		return -EINVAL;
+
+	guard(srcu)(&sframe_srcu);
+
+	sec = mtree_load(&mm->sframe_mt, ip);
+	if (!sec)
+		return ret;
+
+	ret = find_fde(sec, ip, &fde);
+	if (ret)
+		return ret;
+
+	ret = find_fre(sec, &fde, ip, frame);
+	if (ret)
+		return ret;
+
+	return 0;
+}
+
+static int __sframe_add_section(unsigned long sframe_addr,
+				unsigned long text_start,
+				unsigned long text_end)
+{
+	struct maple_tree *sframe_mt = &current->mm->sframe_mt;
+	struct sframe_section *sec;
+	struct sframe_header shdr;
+	unsigned long header_end;
+	int ret;
+
+	if (copy_from_user(&shdr, (void __user *)sframe_addr, sizeof(shdr)))
+		return -EFAULT;
+
+	if (shdr.preamble.magic != SFRAME_MAGIC ||
+	    shdr.preamble.version != SFRAME_VERSION_2 ||
+	    !(shdr.preamble.flags & SFRAME_F_FDE_SORTED) ||
+	    shdr.auxhdr_len || !shdr.num_fdes || !shdr.num_fres ||
+	    shdr.fdes_off > shdr.fres_off) {
+		return -EINVAL;
+	}
+
+	sec = kmalloc(sizeof(*sec), GFP_KERNEL);
+	if (!sec)
+		return -ENOMEM;
+
+	header_end = sframe_addr + SFRAME_HDR_SIZE(shdr);
+
+	sec->sframe_addr	= sframe_addr;
+	sec->text_addr		= text_start;
+	sec->fdes_addr		= header_end + shdr.fdes_off;
+	sec->fres_addr		= header_end + shdr.fres_off;
+	sec->fdes_nr		= shdr.num_fdes;
+	sec->ra_off		= shdr.cfa_fixed_ra_offset;
+	sec->fp_off		= shdr.cfa_fixed_fp_offset;
+
+	ret = mtree_insert_range(sframe_mt, text_start, text_end, sec, GFP_KERNEL);
+	if (ret) {
+		kfree(sec);
+		return ret;
+	}
+
+	return 0;
+}
+
+int sframe_add_section(unsigned long sframe_addr, unsigned long text_start,
+		       unsigned long text_end)
+{
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *sframe_vma;
+
+	mmap_read_lock(mm);
+
+	sframe_vma = vma_lookup(mm, sframe_addr);
+	if (!sframe_vma)
+		goto err_unlock;
+
+	if (text_start && text_end) {
+		struct vm_area_struct *text_vma;
+
+		text_vma = vma_lookup(mm, text_start);
+		if (!(text_vma->vm_flags & VM_EXEC))
+			goto err_unlock;
+
+		if (PAGE_ALIGN(text_end) != text_vma->vm_end)
+			goto err_unlock;
+	} else {
+		struct vm_area_struct *vma, *text_vma = NULL;
+		VMA_ITERATOR(vmi, mm, 0);
+
+		for_each_vma(vmi, vma) {
+			if (vma->vm_file != sframe_vma->vm_file ||
+			    !(vma->vm_flags & VM_EXEC))
+				continue;
+
+			if (text_vma) {
+				pr_warn_once("%s[%d]: multiple EXEC segments unsupported\n",
+					     current->comm, current->pid);
+				goto err_unlock;
+			}
+
+			text_vma = vma;
+		}
+
+		if (!text_vma)
+			goto err_unlock;
+
+		text_start = text_vma->vm_start;
+		text_end   = text_vma->vm_end;
+	}
+
+	mmap_read_unlock(mm);
+
+	return __sframe_add_section(sframe_addr, text_start, text_end);
+
+err_unlock:
+	mmap_read_unlock(mm);
+	return -EINVAL;
+}
+
+static void sframe_free_srcu(struct rcu_head *rcu)
+{
+	struct sframe_section *sec = container_of(rcu, struct sframe_section, rcu);
+
+	kfree(sec);
+}
+
+static int __sframe_remove_section(struct mm_struct *mm,
+				   struct sframe_section *sec)
+{
+	sec = mtree_erase(&mm->sframe_mt, sec->text_addr);
+	if (!sec)
+		return -EINVAL;
+
+	call_srcu(&sframe_srcu, &sec->rcu, sframe_free_srcu);
+
+	return 0;
+}
+
+int sframe_remove_section(unsigned long sframe_addr)
+{
+	struct mm_struct *mm = current->mm;
+	struct sframe_section *sec;
+	unsigned long index = 0;
+
+	mt_for_each(&mm->sframe_mt, sec, index, ULONG_MAX) {
+		if (sec->sframe_addr == sframe_addr)
+			return __sframe_remove_section(mm, sec);
+	}
+
+	return -EINVAL;
+}
+
+void sframe_free_mm(struct mm_struct *mm)
+{
+	struct sframe_section *sec;
+	unsigned long index = 0;
+
+	if (!mm)
+		return;
+
+	mt_for_each(&mm->sframe_mt, sec, index, ULONG_MAX)
+		kfree(sec);
+
+	mtree_destroy(&mm->sframe_mt);
+}
diff --git a/kernel/unwind/sframe.h b/kernel/unwind/sframe.h
new file mode 100644
index 000000000000..11b9be7ad82e
--- /dev/null
+++ b/kernel/unwind/sframe.h
@@ -0,0 +1,215 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2023, Oracle and/or its affiliates.
+ *
+ * This file contains definitions for the SFrame stack tracing format, which is
+ * documented at https://sourceware.org/binutils/docs
+ */
+#ifndef _SFRAME_H
+#define _SFRAME_H
+
+#include <linux/types.h>
+
+#define SFRAME_VERSION_1	1
+#define SFRAME_VERSION_2	2
+#define SFRAME_MAGIC		0xdee2
+
+/* Function Descriptor Entries are sorted on PC. */
+#define SFRAME_F_FDE_SORTED	0x1
+/* Frame-pointer based stack tracing. Defined, but not set. */
+#define SFRAME_F_FRAME_POINTER	0x2
+
+#define SFRAME_CFA_FIXED_FP_INVALID 0
+#define SFRAME_CFA_FIXED_RA_INVALID 0
+
+/* Supported ABIs/Arch. */
+#define SFRAME_ABI_AARCH64_ENDIAN_BIG	    1 /* AARCH64 big endian. */
+#define SFRAME_ABI_AARCH64_ENDIAN_LITTLE    2 /* AARCH64 little endian. */
+#define SFRAME_ABI_AMD64_ENDIAN_LITTLE	    3 /* AMD64 little endian. */
+
+/* SFrame FRE types. */
+#define SFRAME_FRE_TYPE_ADDR1	0
+#define SFRAME_FRE_TYPE_ADDR2	1
+#define SFRAME_FRE_TYPE_ADDR4	2
+
+/*
+ * SFrame Function Descriptor Entry types.
+ *
+ * The SFrame format has two possible representations for functions. The
+ * choice of which type to use is made according to the instruction patterns
+ * in the relevant program stub.
+ */
+
+/* Unwinders perform a (PC >= FRE_START_ADDR) to look up a matching FRE. */
+#define SFRAME_FDE_TYPE_PCINC	0
+/*
+ * Unwinders perform a (PC & FRE_START_ADDR_AS_MASK >= FRE_START_ADDR_AS_MASK)
+ * to look up a matching FRE. Typical usecases are pltN entries, trampolines
+ * etc.
+ */
+#define SFRAME_FDE_TYPE_PCMASK	1
+
+/**
+ * struct sframe_preamble - SFrame Preamble.
+ * @magic: Magic number (SFRAME_MAGIC).
+ * @version: Format version number (SFRAME_VERSION).
+ * @flags: Various flags.
+ */
+struct sframe_preamble {
+	u16 magic;
+	u8  version;
+	u8  flags;
+} __packed;
+
+/**
+ * struct sframe_header - SFrame Header.
+ * @preamble: SFrame preamble.
+ * @abi_arch: Identify the arch (including endianness) and ABI.
+ * @cfa_fixed_fp_offset: Offset for the Frame Pointer (FP) from CFA may be
+ *	  fixed for some ABIs ((e.g, in AMD64 when -fno-omit-frame-pointer is
+ *	  used). When fixed, this field specifies the fixed stack frame offset
+ *	  and the individual FREs do not need to track it. When not fixed, it
+ *	  is set to SFRAME_CFA_FIXED_FP_INVALID, and the individual FREs may
+ *	  provide the applicable stack frame offset, if any.
+ * @cfa_fixed_ra_offset: Offset for the Return Address from CFA is fixed for
+ *	  some ABIs. When fixed, this field specifies the fixed stack frame
+ *	  offset and the individual FREs do not need to track it. When not
+ *	  fixed, it is set to SFRAME_CFA_FIXED_FP_INVALID.
+ * @auxhdr_len: Number of bytes making up the auxiliary header, if any.
+ *	  Some ABI/arch, in the future, may use this space for extending the
+ *	  information in SFrame header. Auxiliary header is contained in bytes
+ *	  sequentially following the sframe_header.
+ * @num_fdes: Number of SFrame FDEs in this SFrame section.
+ * @num_fres: Number of SFrame Frame Row Entries.
+ * @fre_len:  Number of bytes in the SFrame Frame Row Entry section.
+ * @fdes_off: Offset of SFrame Function Descriptor Entry section.
+ * @fres_off: Offset of SFrame Frame Row Entry section.
+ */
+struct sframe_header {
+	struct sframe_preamble preamble;
+	u8  abi_arch;
+	s8  cfa_fixed_fp_offset;
+	s8  cfa_fixed_ra_offset;
+	u8  auxhdr_len;
+	u32 num_fdes;
+	u32 num_fres;
+	u32 fre_len;
+	u32 fdes_off;
+	u32 fres_off;
+} __packed;
+
+#define SFRAME_HDR_SIZE(sframe_hdr)	\
+	((sizeof(struct sframe_header) + (sframe_hdr).auxhdr_len))
+
+/* Two possible keys for executable (instruction) pointers signing. */
+#define SFRAME_AARCH64_PAUTH_KEY_A    0 /* Key A. */
+#define SFRAME_AARCH64_PAUTH_KEY_B    1 /* Key B. */
+
+/**
+ * struct sframe_fde - SFrame Function Descriptor Entry.
+ * @start_addr: Function start address. Encoded as a signed offset,
+ *	  relative to the current FDE.
+ * @size: Size of the function in bytes.
+ * @fres_off: Offset of the first SFrame Frame Row Entry of the function,
+ *	  relative to the beginning of the SFrame Frame Row Entry sub-section.
+ * @fres_num: Number of frame row entries for the function.
+ * @info: Additional information for deciphering the stack trace
+ *	  information for the function. Contains information about SFrame FRE
+ *	  type, SFrame FDE type, PAC authorization A/B key, etc.
+ * @rep_size: Block size for SFRAME_FDE_TYPE_PCMASK
+ * @padding: Unused
+ */
+struct sframe_fde {
+	s32 start_addr;
+	u32 size;
+	u32 fres_off;
+	u32 fres_num;
+	u8  info;
+	u8  rep_size;
+	u16 padding;
+} __packed;
+
+/*
+ * 'func_info' in SFrame FDE contains additional information for deciphering
+ * the stack trace information for the function. In V1, the information is
+ * organized as follows:
+ *   - 4-bits: Identify the FRE type used for the function.
+ *   - 1-bit: Identify the FDE type of the function - mask or inc.
+ *   - 1-bit: PAC authorization A/B key (aarch64).
+ *   - 2-bits: Unused.
+ * ---------------------------------------------------------------------
+ * |  Unused  |  PAC auth A/B key (aarch64) |  FDE type |   FRE type   |
+ * |          |        Unused (amd64)       |           |              |
+ * ---------------------------------------------------------------------
+ * 8          6                             5           4              0
+ */
+
+/* Note: Set PAC auth key to SFRAME_AARCH64_PAUTH_KEY_A by default.  */
+#define SFRAME_FUNC_INFO(fde_type, fre_enc_type) \
+	(((SFRAME_AARCH64_PAUTH_KEY_A & 0x1) << 5) | \
+	 (((fde_type) & 0x1) << 4) | ((fre_enc_type) & 0xf))
+
+#define SFRAME_FUNC_FRE_TYPE(data)	  ((data) & 0xf)
+#define SFRAME_FUNC_FDE_TYPE(data)	  (((data) >> 4) & 0x1)
+#define SFRAME_FUNC_PAUTH_KEY(data)	  (((data) >> 5) & 0x1)
+
+/*
+ * Size of stack frame offsets in an SFrame Frame Row Entry. A single
+ * SFrame FRE has all offsets of the same size. Offset size may vary
+ * across frame row entries.
+ */
+#define SFRAME_FRE_OFFSET_1B	  0
+#define SFRAME_FRE_OFFSET_2B	  1
+#define SFRAME_FRE_OFFSET_4B	  2
+
+/* An SFrame Frame Row Entry can be SP or FP based.  */
+#define SFRAME_BASE_REG_FP	0
+#define SFRAME_BASE_REG_SP	1
+
+/*
+ * The index at which a specific offset is presented in the variable length
+ * bytes of an FRE.
+ */
+#define SFRAME_FRE_CFA_OFFSET_IDX   0
+/*
+ * The RA stack offset, if present, will always be at index 1 in the variable
+ * length bytes of the FRE.
+ */
+#define SFRAME_FRE_RA_OFFSET_IDX    1
+/*
+ * The FP stack offset may appear at offset 1 or 2, depending on the ABI as RA
+ * may or may not be tracked.
+ */
+#define SFRAME_FRE_FP_OFFSET_IDX    2
+
+/*
+ * 'fre_info' in SFrame FRE contains information about:
+ *   - 1 bit: base reg for CFA
+ *   - 4 bits: Number of offsets (N). A value of up to 3 is allowed to track
+ *   all three of CFA, FP and RA (fixed implicit order).
+ *   - 2 bits: information about size of the offsets (S) in bytes.
+ *     Valid values are SFRAME_FRE_OFFSET_1B, SFRAME_FRE_OFFSET_2B,
+ *     SFRAME_FRE_OFFSET_4B
+ *   - 1 bit: Mangled RA state bit (aarch64 only).
+ * ---------------------------------------------------------------
+ * | Mangled-RA (aarch64) |  Size of   |   Number of  | base_reg |
+ * |  Unused (amd64)      |  offsets   |    offsets   |          |
+ * ---------------------------------------------------------------
+ * 8                      7             5             1          0
+ */
+
+/* Note: Set mangled_ra_p to zero by default. */
+#define SFRAME_FRE_INFO(base_reg_id, offset_num, offset_size) \
+	(((0 & 0x1) << 7) | (((offset_size) & 0x3) << 5) | \
+	 (((offset_num) & 0xf) << 1) | ((base_reg_id) & 0x1))
+
+/* Set the mangled_ra_p bit as indicated. */
+#define SFRAME_FRE_INFO_UPDATE_MANGLED_RA_P(mangled_ra_p, fre_info) \
+	((((mangled_ra_p) & 0x1) << 7) | ((fre_info) & 0x7f))
+
+#define SFRAME_FRE_CFA_BASE_REG_ID(data)	  ((data) & 0x1)
+#define SFRAME_FRE_OFFSET_COUNT(data)		  (((data) >> 1) & 0xf)
+#define SFRAME_FRE_OFFSET_SIZE(data)		  (((data) >> 5) & 0x3)
+#define SFRAME_FRE_MANGLED_RA_P(data)		  (((data) >> 7) & 0x1)
+
+#endif /* _SFRAME_H */
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
index 54b989810a0e..8e47c80e3e54 100644
--- a/kernel/unwind/user.c
+++ b/kernel/unwind/user.c
@@ -8,12 +8,17 @@
 #include <linux/sched.h>
 #include <linux/sched/task_stack.h>
 #include <linux/unwind_user.h>
+#include <linux/sframe.h>
 #include <linux/uaccess.h>
-#include <asm/unwind_user.h>
 
+#ifdef CONFIG_HAVE_UNWIND_USER_FP
+#include <asm/unwind_user.h>
 static struct unwind_user_frame fp_frame = {
 	ARCH_INIT_USER_FP_FRAME
 };
+#else
+static struct unwind_user_frame fp_frame;
+#endif
 
 int unwind_user_next(struct unwind_user_state *state)
 {
@@ -30,6 +35,16 @@ int unwind_user_next(struct unwind_user_state *state)
 	case UNWIND_USER_TYPE_FP:
 		frame = &fp_frame;
 		break;
+	case UNWIND_USER_TYPE_SFRAME:
+		if (sframe_find(state->ip, frame)) {
+			if (!IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP))
+				goto the_end;
+
+			frame = &fp_frame;
+		}
+		break;
+	case UNWIND_USER_TYPE_NONE:
+		goto the_end;
 	default:
 		BUG();
 	}
@@ -68,7 +83,12 @@ int unwind_user_start(struct unwind_user_state *state)
 		return -EINVAL;
 	}
 
-	state->type = UNWIND_USER_TYPE_FP;
+	if (current_has_sframe())
+		state->type = UNWIND_USER_TYPE_SFRAME;
+	else if (IS_ENABLED(CONFIG_UNWIND_USER_FP))
+		state->type = UNWIND_USER_TYPE_FP;
+	else
+		state->type = UNWIND_USER_TYPE_NONE;
 
 	state->sp = user_stack_pointer(regs);
 	state->ip = instruction_pointer(regs);
diff --git a/mm/init-mm.c b/mm/init-mm.c
index 24c809379274..8eb1b122b7bf 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -11,12 +11,17 @@
 #include <linux/atomic.h>
 #include <linux/user_namespace.h>
 #include <linux/iommu.h>
+#include <linux/sframe.h>
 #include <asm/mmu.h>
 
 #ifndef INIT_MM_CONTEXT
 #define INIT_MM_CONTEXT(name)
 #endif
 
+#ifndef INIT_MM_SFRAME
+#define INIT_MM_SFRAME
+#endif
+
 const struct vm_operations_struct vma_dummy_vm_ops;
 
 /*
@@ -45,6 +50,7 @@ struct mm_struct init_mm = {
 	.user_ns	= &init_user_ns,
 	.cpu_bitmap	= CPU_BITS_NONE,
 	INIT_MM_CONTEXT(init_mm)
+	INIT_MM_SFRAME
 };
 
 void setup_initial_init_mm(void *start_code, void *end_code,
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 10/19] unwind/x86: Enable CONFIG_HAVE_UNWIND_USER_SFRAME
  2024-10-28 21:47 ` [PATCH v3 10/19] unwind/x86: Enable CONFIG_HAVE_UNWIND_USER_SFRAME Josh Poimboeuf
@ 2024-10-28 21:47   ` Josh Poimboeuf
  2024-10-29 13:14   ` Peter Zijlstra
  1 sibling, 0 replies; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-28 21:47 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

Binutils 2.41 supports generating sframe v2 for x86_64.  It works well
in testing so enable it.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/x86/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f91098d6f535..3e6f4c80c5b5 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -290,6 +290,7 @@ config X86
 	select HAVE_UACCESS_VALIDATION		if HAVE_OBJTOOL
 	select HAVE_UNSTABLE_SCHED_CLOCK
 	select HAVE_UNWIND_USER_FP		if X86_64
+	select HAVE_UNWIND_USER_SFRAME		if X86_64
 	select HAVE_USER_RETURN_NOTIFIER
 	select HAVE_GENERIC_VDSO
 	select VDSO_GETRANDOM			if X86_64
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 11/19] unwind: Add deferred user space unwinding API
  2024-10-28 21:47 ` [PATCH v3 11/19] unwind: Add deferred user space unwinding API Josh Poimboeuf
@ 2024-10-28 21:47   ` Josh Poimboeuf
  2024-10-29 13:48   ` Peter Zijlstra
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-28 21:47 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

Add unwind_user_deferred() which allows callers to schedule task work to
unwind the user space stack before returning to user space.  This solves
several problems for its callers:

  - Ensure the unwind happens in task context even if the caller may
    running in interrupt context.

  - Only do the unwind once, even if called multiple times either by the
    same caller or multiple callers.

  - Create a "context context" cookie which allows trace post-processing
    to correlate kernel unwinds/traces with the user unwind.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 include/linux/entry-common.h |   3 +
 include/linux/sched.h        |   5 +
 include/linux/unwind_user.h  |  56 ++++++++++
 kernel/fork.c                |   4 +
 kernel/unwind/user.c         | 199 +++++++++++++++++++++++++++++++++++
 5 files changed, 267 insertions(+)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 1e50cdb83ae5..efbe8f964f31 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -12,6 +12,7 @@
 #include <linux/resume_user_mode.h>
 #include <linux/tick.h>
 #include <linux/kmsan.h>
+#include <linux/unwind_user.h>
 
 #include <asm/entry-common.h>
 
@@ -111,6 +112,8 @@ static __always_inline void enter_from_user_mode(struct pt_regs *regs)
 	CT_WARN_ON(__ct_state() != CT_STATE_USER);
 	user_exit_irqoff();
 
+	unwind_enter_from_user_mode();
+
 	instrumentation_begin();
 	kmsan_unpoison_entry_regs(regs);
 	trace_hardirqs_off_finish();
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5007a8e2d640..31b6f1d763ef 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -47,6 +47,7 @@
 #include <linux/livepatch_sched.h>
 #include <linux/uidgid_types.h>
 #include <asm/kmap_size.h>
+#include <linux/unwind_user.h>
 
 /* task_struct member predeclarations (sorted alphabetically): */
 struct audit_context;
@@ -1592,6 +1593,10 @@ struct task_struct {
 	struct user_event_mm		*user_event_mm;
 #endif
 
+#ifdef CONFIG_UNWIND_USER
+	struct unwind_task_info		unwind_task_info;
+#endif
+
 	/*
 	 * New fields for task_struct should be added above here, so that
 	 * they are included in the randomized portion of task_struct.
diff --git a/include/linux/unwind_user.h b/include/linux/unwind_user.h
index cde0fde4923e..98e236c843b1 100644
--- a/include/linux/unwind_user.h
+++ b/include/linux/unwind_user.h
@@ -3,6 +3,9 @@
 #define _LINUX_UNWIND_USER_H
 
 #include <linux/types.h>
+#include <linux/percpu-defs.h>
+
+#define UNWIND_MAX_CALLBACKS 4
 
 enum unwind_user_type {
 	UNWIND_USER_TYPE_NONE,
@@ -30,6 +33,26 @@ struct unwind_user_state {
 	bool done;
 };
 
+struct unwind_task_info {
+	u64			ctx_cookie;
+	u32			pending_callbacks;
+	u64			last_cookies[UNWIND_MAX_CALLBACKS];
+	void			*privs[UNWIND_MAX_CALLBACKS];
+	unsigned long		*entries;
+	struct callback_head	work;
+};
+
+typedef void (*unwind_callback_t)(struct unwind_stacktrace *trace,
+				  u64 ctx_cookie, void *data);
+
+struct unwind_callback {
+	unwind_callback_t		func;
+	int				idx;
+};
+
+
+#ifdef CONFIG_UNWIND_USER
+
 /* Synchronous interfaces: */
 
 int unwind_user_start(struct unwind_user_state *state);
@@ -40,4 +63,37 @@ int unwind_user(struct unwind_stacktrace *trace, unsigned int max_entries);
 #define for_each_user_frame(state) \
 	for (unwind_user_start((state)); !(state)->done; unwind_user_next((state)))
 
+
+/* Asynchronous interfaces: */
+
+void unwind_task_init(struct task_struct *task);
+void unwind_task_free(struct task_struct *task);
+
+int unwind_user_register(struct unwind_callback *callback, unwind_callback_t func);
+int unwind_user_unregister(struct unwind_callback *callback);
+
+int unwind_user_deferred(struct unwind_callback *callback, u64 *ctx_cookie, void *data);
+
+DECLARE_PER_CPU(u64, unwind_ctx_ctr);
+
+static __always_inline void unwind_enter_from_user_mode(void)
+{
+	__this_cpu_inc(unwind_ctx_ctr);
+}
+
+
+#else /* !CONFIG_UNWIND_USER */
+
+static inline void unwind_task_init(struct task_struct *task) {}
+static inline void unwind_task_free(struct task_struct *task) {}
+
+static inline int unwind_user_register(struct unwind_callback *callback, unwind_callback_t func) { return -ENOSYS; }
+static inline int unwind_user_unregister(struct unwind_callback *callback) { return -ENOSYS; }
+
+static inline int unwind_user_deferred(struct unwind_callback *callback, u64 *ctx_cookie, void *data) { return -ENOSYS; }
+
+static inline void unwind_enter_from_user_mode(void) {}
+
+#endif /* !CONFIG_UNWIND_USER */
+
 #endif /* _LINUX_UNWIND_USER_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 60f14fbab956..d7580067853d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -105,6 +105,7 @@
 #include <linux/rseq.h>
 #include <uapi/linux/pidfd.h>
 #include <linux/pidfs.h>
+#include <linux/unwind_user.h>
 #include <linux/sframe.h>
 
 #include <asm/pgalloc.h>
@@ -972,6 +973,7 @@ void __put_task_struct(struct task_struct *tsk)
 	WARN_ON(refcount_read(&tsk->usage));
 	WARN_ON(tsk == current);
 
+	unwind_task_free(tsk);
 	sched_ext_free(tsk);
 	io_uring_free(tsk);
 	cgroup_free(tsk);
@@ -2348,6 +2350,8 @@ __latent_entropy struct task_struct *copy_process(
 	p->bpf_ctx = NULL;
 #endif
 
+	unwind_task_init(p);
+
 	/* Perform scheduler related setup. Assign this task to a CPU. */
 	retval = sched_fork(clone_flags, p);
 	if (retval)
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
index 8e47c80e3e54..ed7759c56551 100644
--- a/kernel/unwind/user.c
+++ b/kernel/unwind/user.c
@@ -10,6 +10,11 @@
 #include <linux/unwind_user.h>
 #include <linux/sframe.h>
 #include <linux/uaccess.h>
+#include <linux/slab.h>
+#include <linux/task_work.h>
+#include <linux/mm.h>
+
+#define UNWIND_MAX_ENTRIES 512
 
 #ifdef CONFIG_HAVE_UNWIND_USER_FP
 #include <asm/unwind_user.h>
@@ -20,6 +25,12 @@ static struct unwind_user_frame fp_frame = {
 static struct unwind_user_frame fp_frame;
 #endif
 
+static struct unwind_callback *callbacks[UNWIND_MAX_CALLBACKS];
+static DECLARE_RWSEM(callbacks_rwsem);
+
+/* Counter for entries from user space */
+DEFINE_PER_CPU(u64, unwind_ctx_ctr);
+
 int unwind_user_next(struct unwind_user_state *state)
 {
 	struct unwind_user_frame _frame;
@@ -117,3 +128,191 @@ int unwind_user(struct unwind_stacktrace *trace, unsigned int max_entries)
 
 	return 0;
 }
+
+/*
+ * The "context cookie" is a unique identifier which allows post-processing to
+ * correlate kernel trace(s) with user unwinds.  It has the CPU id the highest
+ * 16 bits and a per-CPU entry counter in the lower 48 bits.
+ */
+static u64 ctx_to_cookie(u64 cpu, u64 ctx)
+{
+	BUILD_BUG_ON(NR_CPUS > 65535);
+	return (ctx & ((1UL << 48) - 1)) | cpu;
+}
+
+/*
+ * Schedule a user space unwind to be done in task work before exiting the
+ * kernel.
+ *
+ * The @callback must have previously been registered with
+ * unwind_user_register().
+ *
+ * The @cookie output is a unique identifer which will also be passed to the
+ * callback function.  It can be used to stitch kernel and user traces together
+ * in post-processing.
+ *
+ * If there are multiple calls to this function for a given @callback, the
+ * cookie will usually be the same and the callback will only be called once.
+ *
+ * The only exception is when the task has migrated to another CPU, *and* this
+ * is called while the task work is running (or has already run).  Then a new
+ * cookie will be generated and the callback will be called again for the new
+ * cookie.
+ */
+int unwind_user_deferred(struct unwind_callback *callback, u64 *ctx_cookie, void *data)
+{
+	struct unwind_task_info *info = &current->unwind_task_info;
+	u64 cookie = info->ctx_cookie;
+	int idx = callback->idx;
+
+	if (WARN_ON_ONCE(in_nmi()))
+		return -EINVAL;
+
+	if (WARN_ON_ONCE(!callback->func || idx < 0))
+		return -EINVAL;
+
+	if (!current->mm)
+		return -EINVAL;
+
+	guard(irqsave)();
+
+	if (cookie && (info->pending_callbacks & (1 << idx)))
+		goto done;
+
+	/*
+	 * If this is the first call from *any* caller since the most recent
+	 * entry from user space, initialize the task context cookie and
+	 * schedule the task work.
+	 */
+	if (!cookie) {
+		u64 ctx_ctr = __this_cpu_read(unwind_ctx_ctr);
+		u64 cpu = raw_smp_processor_id();
+
+		cookie = ctx_to_cookie(cpu, ctx_ctr);
+
+		/*
+		 * If called after task work has sent an unwind to the callback
+		 * function but before the exit to user space, skip it as the
+		 * previous call to the callback function should suffice.
+		 *
+		 * The only exception is if this task has migrated to another
+		 * CPU since the first call to unwind_user_deferred().  The
+		 * per-CPU context counter will have changed which will result
+		 * in a new cookie and another unwind (see comment above
+		 * function).
+		 */
+		if (cookie == info->last_cookies[idx])
+			goto done;
+
+		info->ctx_cookie = cookie;
+		task_work_add(current, &info->work, TWA_RESUME);
+	}
+
+	info->pending_callbacks |= (1 << idx);
+	info->privs[idx] = data;
+	info->last_cookies[idx] = cookie;
+
+done:
+	if (ctx_cookie)
+		*ctx_cookie = cookie;
+	return 0;
+}
+
+static void unwind_user_task_work(struct callback_head *head)
+{
+	struct unwind_task_info *info = container_of(head, struct unwind_task_info, work);
+	struct task_struct *task = container_of(info, struct task_struct, unwind_task_info);
+	void *privs[UNWIND_MAX_CALLBACKS];
+	struct unwind_stacktrace trace;
+	unsigned long pending;
+	u64 cookie = 0;
+	int i;
+
+	BUILD_BUG_ON(UNWIND_MAX_CALLBACKS > 32);
+
+	if (WARN_ON_ONCE(task != current))
+		return;
+
+	if (WARN_ON_ONCE(!info->ctx_cookie || !info->pending_callbacks))
+		return;
+
+	scoped_guard(irqsave) {
+		pending = info->pending_callbacks;
+		cookie = info->ctx_cookie;
+
+		info->pending_callbacks = 0;
+		info->ctx_cookie = 0;
+		memcpy(privs, info->privs, sizeof(void *) * UNWIND_MAX_CALLBACKS);
+	}
+
+	if (!info->entries) {
+		info->entries = kmalloc(UNWIND_MAX_ENTRIES * sizeof(long),
+					GFP_KERNEL);
+		if (!info->entries)
+			return;
+	}
+
+	trace.entries = info->entries;
+	trace.nr = 0;
+	unwind_user(&trace, UNWIND_MAX_ENTRIES);
+
+	guard(rwsem_read)(&callbacks_rwsem);
+
+	for_each_set_bit(i, &pending, UNWIND_MAX_CALLBACKS) {
+		if (callbacks[i])
+			callbacks[i]->func(&trace, cookie, privs[i]);
+	}
+}
+
+int unwind_user_register(struct unwind_callback *callback, unwind_callback_t func)
+{
+	scoped_guard(rwsem_write, &callbacks_rwsem) {
+		for (int i = 0; i < UNWIND_MAX_CALLBACKS; i++) {
+			if (!callbacks[i]) {
+				callback->func = func;
+				callback->idx = i;
+				callbacks[i] = callback;
+				return 0;
+			}
+		}
+	}
+
+	callback->func = NULL;
+	callback->idx = -1;
+	return -ENOSPC;
+}
+
+int unwind_user_unregister(struct unwind_callback *callback)
+{
+	if (callback->idx < 0)
+		return -EINVAL;
+
+	scoped_guard(rwsem_write, &callbacks_rwsem)
+		callbacks[callback->idx] = NULL;
+
+	callback->func = NULL;
+	callback->idx = -1;
+
+	return 0;
+}
+
+void unwind_task_init(struct task_struct *task)
+{
+	struct unwind_task_info *info = &task->unwind_task_info;
+
+	info->entries		= NULL;
+	info->pending_callbacks	= 0;
+	info->ctx_cookie	= 0;
+
+	memset(info->last_cookies, 0, sizeof(u64) * UNWIND_MAX_CALLBACKS);
+	memset(info->privs,	   0, sizeof(u64) * UNWIND_MAX_CALLBACKS);
+
+	init_task_work(&info->work, unwind_user_task_work);
+}
+
+void unwind_task_free(struct task_struct *task)
+{
+	struct unwind_task_info *info = &task->unwind_task_info;
+
+	kfree(info->entries);
+}
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 12/19] perf: Remove get_perf_callchain() 'init_nr' argument
  2024-10-28 21:47 ` [PATCH v3 12/19] perf: Remove get_perf_callchain() 'init_nr' argument Josh Poimboeuf
@ 2024-10-28 21:47   ` Josh Poimboeuf
  0 siblings, 0 replies; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-28 21:47 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Namhyung Kim

The 'init_nr' argument has double duty: it's used to initialize both the
number of contexts and the number of stack entries.  That's confusing
and the callers always pass zero anyway.  Hard code the zero.

Acked-by: Namhyung Kim <Namhyung@kernel.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 include/linux/perf_event.h |  2 +-
 kernel/bpf/stackmap.c      |  4 ++--
 kernel/events/callchain.c  | 12 ++++++------
 kernel/events/core.c       |  2 +-
 4 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index fb908843f209..1e956ff9acd3 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1568,7 +1568,7 @@ DECLARE_PER_CPU(struct perf_callchain_entry, perf_callchain_entry);
 extern void perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
 extern void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
 extern struct perf_callchain_entry *
-get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user,
+get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 		   u32 max_stack, bool crosstask, bool add_mark);
 extern int get_callchain_buffers(int max_stack);
 extern void put_callchain_buffers(void);
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 3615c06b7dfa..ec3a57a5fba1 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -314,7 +314,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
 	if (max_depth > sysctl_perf_event_max_stack)
 		max_depth = sysctl_perf_event_max_stack;
 
-	trace = get_perf_callchain(regs, 0, kernel, user, max_depth,
+	trace = get_perf_callchain(regs, kernel, user, max_depth,
 				   false, false);
 
 	if (unlikely(!trace))
@@ -451,7 +451,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
 	else if (kernel && task)
 		trace = get_callchain_entry_for_task(task, max_depth);
 	else
-		trace = get_perf_callchain(regs, 0, kernel, user, max_depth,
+		trace = get_perf_callchain(regs, kernel, user, max_depth,
 					   crosstask, false);
 
 	if (unlikely(!trace) || trace->nr < skip) {
diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 8a47e52a454f..83834203e144 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -216,7 +216,7 @@ static void fixup_uretprobe_trampoline_entries(struct perf_callchain_entry *entr
 }
 
 struct perf_callchain_entry *
-get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user,
+get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 		   u32 max_stack, bool crosstask, bool add_mark)
 {
 	struct perf_callchain_entry *entry;
@@ -227,11 +227,11 @@ get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user,
 	if (!entry)
 		return NULL;
 
-	ctx.entry     = entry;
-	ctx.max_stack = max_stack;
-	ctx.nr	      = entry->nr = init_nr;
-	ctx.contexts       = 0;
-	ctx.contexts_maxed = false;
+	ctx.entry		= entry;
+	ctx.max_stack		= max_stack;
+	ctx.nr			= entry->nr = 0;
+	ctx.contexts		= 0;
+	ctx.contexts_maxed	= false;
 
 	if (kernel && !user_mode(regs)) {
 		if (add_mark)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index df27d08a7232..1654d6e7c148 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7800,7 +7800,7 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
 	if (!kernel && !user)
 		return &__empty_callchain;
 
-	callchain = get_perf_callchain(regs, 0, kernel, user,
+	callchain = get_perf_callchain(regs, kernel, user,
 				       max_stack, crosstask, true);
 	return callchain ?: &__empty_callchain;
 }
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 13/19] perf: Remove get_perf_callchain() 'crosstask' argument
  2024-10-28 21:47 ` [PATCH v3 13/19] perf: Remove get_perf_callchain() 'crosstask' argument Josh Poimboeuf
@ 2024-10-28 21:48   ` Josh Poimboeuf
  0 siblings, 0 replies; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-28 21:48 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

get_perf_callchain() doesn't support cross-task unwinding, so it doesn't
make much sense to have 'crosstask' as an argument.

Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 include/linux/perf_event.h |  2 +-
 kernel/bpf/stackmap.c      | 12 ++++--------
 kernel/events/callchain.c  |  6 +-----
 kernel/events/core.c       |  9 +++++----
 4 files changed, 11 insertions(+), 18 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 1e956ff9acd3..788f6971d32d 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1569,7 +1569,7 @@ extern void perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct p
 extern void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
 extern struct perf_callchain_entry *
 get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
-		   u32 max_stack, bool crosstask, bool add_mark);
+		   u32 max_stack, bool add_mark);
 extern int get_callchain_buffers(int max_stack);
 extern void put_callchain_buffers(void);
 extern struct perf_callchain_entry *get_callchain_entry(int *rctx);
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index ec3a57a5fba1..ee9701337912 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -314,8 +314,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
 	if (max_depth > sysctl_perf_event_max_stack)
 		max_depth = sysctl_perf_event_max_stack;
 
-	trace = get_perf_callchain(regs, kernel, user, max_depth,
-				   false, false);
+	trace = get_perf_callchain(regs, kernel, user, max_depth, false);
 
 	if (unlikely(!trace))
 		/* couldn't fetch the stack trace */
@@ -430,10 +429,8 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
 	if (task && user && !user_mode(regs))
 		goto err_fault;
 
-	/* get_perf_callchain does not support crosstask user stack walking
-	 * but returns an empty stack instead of NULL.
-	 */
-	if (crosstask && user) {
+	/* get_perf_callchain() does not support crosstask stack walking */
+	if (crosstask) {
 		err = -EOPNOTSUPP;
 		goto clear;
 	}
@@ -451,8 +448,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
 	else if (kernel && task)
 		trace = get_callchain_entry_for_task(task, max_depth);
 	else
-		trace = get_perf_callchain(regs, kernel, user, max_depth,
-					   crosstask, false);
+		trace = get_perf_callchain(regs, kernel, user, max_depth,false);
 
 	if (unlikely(!trace) || trace->nr < skip) {
 		if (may_fault)
diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 83834203e144..655fb25a725b 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -217,7 +217,7 @@ static void fixup_uretprobe_trampoline_entries(struct perf_callchain_entry *entr
 
 struct perf_callchain_entry *
 get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
-		   u32 max_stack, bool crosstask, bool add_mark)
+		   u32 max_stack, bool add_mark)
 {
 	struct perf_callchain_entry *entry;
 	struct perf_callchain_entry_ctx ctx;
@@ -248,9 +248,6 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 		}
 
 		if (regs) {
-			if (crosstask)
-				goto exit_put;
-
 			if (add_mark)
 				perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
 
@@ -260,7 +257,6 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 		}
 	}
 
-exit_put:
 	put_callchain_entry(rctx);
 
 	return entry;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 1654d6e7c148..ebf143aa427b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7792,16 +7792,17 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
 {
 	bool kernel = !event->attr.exclude_callchain_kernel;
 	bool user   = !event->attr.exclude_callchain_user;
-	/* Disallow cross-task user callchains. */
-	bool crosstask = event->ctx->task && event->ctx->task != current;
 	const u32 max_stack = event->attr.sample_max_stack;
 	struct perf_callchain_entry *callchain;
 
 	if (!kernel && !user)
 		return &__empty_callchain;
 
-	callchain = get_perf_callchain(regs, kernel, user,
-				       max_stack, crosstask, true);
+	/* Disallow cross-task callchains. */
+	if (event->ctx->task && event->ctx->task != current)
+		return &__empty_callchain;
+
+	callchain = get_perf_callchain(regs, kernel, user, max_stack, true);
 	return callchain ?: &__empty_callchain;
 }
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 14/19] perf: Simplify get_perf_callchain() user logic
  2024-10-28 21:47 ` [PATCH v3 14/19] perf: Simplify get_perf_callchain() user logic Josh Poimboeuf
@ 2024-10-28 21:48   ` Josh Poimboeuf
  0 siblings, 0 replies; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-28 21:48 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

Simplify the get_perf_callchain() user logic a bit.  task_pt_regs()
should never be NULL.

Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 kernel/events/callchain.c | 20 +++++++++-----------
 1 file changed, 9 insertions(+), 11 deletions(-)

diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 655fb25a725b..2278402b7ac9 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -241,22 +241,20 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 
 	if (user) {
 		if (!user_mode(regs)) {
-			if  (current->mm)
-				regs = task_pt_regs(current);
-			else
-				regs = NULL;
+			if (!current->mm)
+				goto exit_put;
+			regs = task_pt_regs(current);
 		}
 
-		if (regs) {
-			if (add_mark)
-				perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
+		if (add_mark)
+			perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
 
-			start_entry_idx = entry->nr;
-			perf_callchain_user(&ctx, regs);
-			fixup_uretprobe_trampoline_entries(entry, start_entry_idx);
-		}
+		start_entry_idx = entry->nr;
+		perf_callchain_user(&ctx, regs);
+		fixup_uretprobe_trampoline_entries(entry, start_entry_idx);
 	}
 
+exit_put:
 	put_callchain_entry(rctx);
 
 	return entry;
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 15/19] perf: Add deferred user callchains
  2024-10-28 21:47 ` [PATCH v3 15/19] perf: Add deferred user callchains Josh Poimboeuf
@ 2024-10-28 21:48   ` Josh Poimboeuf
  2024-10-29 14:06   ` Peter Zijlstra
  2024-11-06  9:45   ` Jens Remus
  2 siblings, 0 replies; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-28 21:48 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

Instead of attempting to unwind user space from the NMI handler, defer
it to run in task context by sending a self-IPI and then scheduling the
unwind to run in the IRQ's exit task work before returning to user space.

This allows the user stack page to be paged in if needed, avoids
duplicate unwinds for kernel-bound workloads, and prepares for SFrame
unwinding (so .sframe sections can be paged in on demand).

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/Kconfig                          |  3 ++
 include/linux/perf_event.h            | 10 ++++-
 include/uapi/linux/perf_event.h       | 22 +++++++++-
 kernel/bpf/stackmap.c                 |  6 +--
 kernel/events/callchain.c             | 11 ++++-
 kernel/events/core.c                  | 63 ++++++++++++++++++++++++++-
 tools/include/uapi/linux/perf_event.h | 22 +++++++++-
 7 files changed, 129 insertions(+), 8 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index e769c39dd221..33449485eafd 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -446,6 +446,9 @@ config HAVE_UNWIND_USER_SFRAME
 	bool
 	select UNWIND_USER
 
+config HAVE_PERF_CALLCHAIN_DEFERRED
+	bool
+
 config HAVE_PERF_REGS
 	bool
 	help
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 788f6971d32d..2193b3d16820 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -808,9 +808,11 @@ struct perf_event {
 	unsigned long			pending_addr;	/* SIGTRAP */
 	struct irq_work			pending_irq;
 	struct irq_work			pending_disable_irq;
+	struct irq_work			pending_unwind_irq;
 	struct callback_head		pending_task;
 	unsigned int			pending_work;
 	struct rcuwait			pending_work_wait;
+	unsigned int			pending_unwind;
 
 	atomic_t			event_limit;
 
@@ -1569,12 +1571,18 @@ extern void perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct p
 extern void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
 extern struct perf_callchain_entry *
 get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
-		   u32 max_stack, bool add_mark);
+		   u32 max_stack, bool add_mark, bool defer_user);
 extern int get_callchain_buffers(int max_stack);
 extern void put_callchain_buffers(void);
 extern struct perf_callchain_entry *get_callchain_entry(int *rctx);
 extern void put_callchain_entry(int rctx);
 
+#ifdef CONFIG_HAVE_PERF_CALLCHAIN_DEFERRED
+extern void perf_callchain_user_deferred(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
+#else
+static inline void perf_callchain_user_deferred(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs) {}
+#endif
+
 extern int sysctl_perf_event_max_stack;
 extern int sysctl_perf_event_max_contexts_per_stack;
 
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 4842c36fdf80..6d0524b7d082 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -460,7 +460,8 @@ struct perf_event_attr {
 				inherit_thread :  1, /* children only inherit if cloned with CLONE_THREAD */
 				remove_on_exec :  1, /* event is removed from task on exec */
 				sigtrap        :  1, /* send synchronous SIGTRAP on event */
-				__reserved_1   : 26;
+				defer_callchain:  1, /* generate PERF_RECORD_CALLCHAIN_DEFERRED records */
+				__reserved_1   : 25;
 
 	union {
 		__u32		wakeup_events;	  /* wakeup every n events */
@@ -1217,6 +1218,24 @@ enum perf_event_type {
 	 */
 	PERF_RECORD_AUX_OUTPUT_HW_ID		= 21,
 
+	/*
+	 * This user callchain capture was deferred until shortly before
+	 * returning to user space.  Previous samples would have kernel
+	 * callchains only and they need to be stitched with this to make full
+	 * callchains.
+	 *
+	 * TODO: do PERF_SAMPLE_{REGS,STACK}_USER also need deferral?
+	 *
+	 * struct {
+	 *	struct perf_event_header	header;
+	 *	u64				ctx_cookie;
+	 *	u64				nr;
+	 *	u64				ips[nr];
+	 *	struct sample_id		sample_id;
+	 * };
+	 */
+	PERF_RECORD_CALLCHAIN_DEFERRED		= 22,
+
 	PERF_RECORD_MAX,			/* non-ABI */
 };
 
@@ -1247,6 +1266,7 @@ enum perf_callchain_context {
 	PERF_CONTEXT_HV			= (__u64)-32,
 	PERF_CONTEXT_KERNEL		= (__u64)-128,
 	PERF_CONTEXT_USER		= (__u64)-512,
+	PERF_CONTEXT_USER_DEFERRED	= (__u64)-640,
 
 	PERF_CONTEXT_GUEST		= (__u64)-2048,
 	PERF_CONTEXT_GUEST_KERNEL	= (__u64)-2176,
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index ee9701337912..f073ebaf9c30 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -314,8 +314,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
 	if (max_depth > sysctl_perf_event_max_stack)
 		max_depth = sysctl_perf_event_max_stack;
 
-	trace = get_perf_callchain(regs, kernel, user, max_depth, false);
-
+	trace = get_perf_callchain(regs, kernel, user, max_depth, false, false);
 	if (unlikely(!trace))
 		/* couldn't fetch the stack trace */
 		return -EFAULT;
@@ -448,7 +447,8 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
 	else if (kernel && task)
 		trace = get_callchain_entry_for_task(task, max_depth);
 	else
-		trace = get_perf_callchain(regs, kernel, user, max_depth,false);
+		trace = get_perf_callchain(regs, kernel, user, max_depth,
+					   false, false);
 
 	if (unlikely(!trace) || trace->nr < skip) {
 		if (may_fault)
diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 2278402b7ac9..eeb15ba0137f 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -217,7 +217,7 @@ static void fixup_uretprobe_trampoline_entries(struct perf_callchain_entry *entr
 
 struct perf_callchain_entry *
 get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
-		   u32 max_stack, bool add_mark)
+		   u32 max_stack, bool add_mark, bool defer_user)
 {
 	struct perf_callchain_entry *entry;
 	struct perf_callchain_entry_ctx ctx;
@@ -246,6 +246,15 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 			regs = task_pt_regs(current);
 		}
 
+		if (defer_user) {
+			/*
+			 * Foretell the coming of PERF_RECORD_CALLCHAIN_DEFERRED
+			 * which can be stitched to this one.
+			 */
+			perf_callchain_store_context(&ctx, PERF_CONTEXT_USER_DEFERRED);
+			goto exit_put;
+		}
+
 		if (add_mark)
 			perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index ebf143aa427b..bf97b2fa8a9c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -55,11 +55,14 @@
 #include <linux/pgtable.h>
 #include <linux/buildid.h>
 #include <linux/task_work.h>
+#include <linux/unwind_user.h>
 
 #include "internal.h"
 
 #include <asm/irq_regs.h>
 
+static struct unwind_callback perf_unwind_callback_cb;
+
 typedef int (*remote_function_f)(void *);
 
 struct remote_function_call {
@@ -6955,6 +6958,53 @@ static void perf_pending_irq(struct irq_work *entry)
 		perf_swevent_put_recursion_context(rctx);
 }
 
+static void perf_pending_unwind_irq(struct irq_work *entry)
+{
+	struct perf_event *event = container_of(entry, struct perf_event, pending_unwind_irq);
+
+	if (event->pending_unwind) {
+		unwind_user_deferred(&perf_unwind_callback_cb, NULL, event);
+		event->pending_unwind = 0;
+	}
+}
+
+struct perf_callchain_deferred_event {
+	struct perf_event_header	header;
+	u64				ctx_cookie;
+	u64				nr;
+	u64				ips[];
+};
+
+static void perf_event_callchain_deferred(struct unwind_stacktrace *trace,
+					  u64 ctx_cookie, void *_data)
+{
+	struct perf_callchain_deferred_event deferred_event;
+	u64 callchain_context = PERF_CONTEXT_USER;
+	struct perf_output_handle handle;
+	struct perf_event *event = _data;
+	struct perf_sample_data data;
+	u64 nr = trace->nr + 1 /* callchain_context */;
+
+	deferred_event.header.type = PERF_RECORD_CALLCHAIN_DEFERRED;
+	deferred_event.header.misc = PERF_RECORD_MISC_USER;
+	deferred_event.header.size = sizeof(deferred_event) + (nr * sizeof(u64));
+
+	deferred_event.ctx_cookie = ctx_cookie;
+	deferred_event.nr = nr;
+
+	perf_event_header__init_id(&deferred_event.header, &data, event);
+
+	if (perf_output_begin(&handle, &data, event, deferred_event.header.size))
+		return;
+
+	perf_output_put(&handle, deferred_event);
+	perf_output_put(&handle, callchain_context);
+	perf_output_copy(&handle, trace->entries, trace->nr * sizeof(u64));
+	perf_event__output_id_sample(event, &handle, &data);
+
+	perf_output_end(&handle);
+}
+
 static void perf_pending_task(struct callback_head *head)
 {
 	struct perf_event *event = container_of(head, struct perf_event, pending_task);
@@ -7794,6 +7844,8 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
 	bool user   = !event->attr.exclude_callchain_user;
 	const u32 max_stack = event->attr.sample_max_stack;
 	struct perf_callchain_entry *callchain;
+	bool defer_user = IS_ENABLED(CONFIG_UNWIND_USER) &&
+			  event->attr.defer_callchain;
 
 	if (!kernel && !user)
 		return &__empty_callchain;
@@ -7802,7 +7854,14 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
 	if (event->ctx->task && event->ctx->task != current)
 		return &__empty_callchain;
 
-	callchain = get_perf_callchain(regs, kernel, user, max_stack, true);
+	callchain = get_perf_callchain(regs, kernel, user, max_stack, true,
+				       defer_user);
+
+	if (user && defer_user && !event->pending_unwind) {
+		event->pending_unwind = 1;
+		irq_work_queue(&event->pending_unwind_irq);
+	}
+
 	return callchain ?: &__empty_callchain;
 }
 
@@ -12171,6 +12230,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 
 	init_waitqueue_head(&event->waitq);
 	init_irq_work(&event->pending_irq, perf_pending_irq);
+	event->pending_unwind_irq = IRQ_WORK_INIT_HARD(perf_pending_unwind_irq);
 	event->pending_disable_irq = IRQ_WORK_INIT_HARD(perf_pending_disable);
 	init_task_work(&event->pending_task, perf_pending_task);
 	rcuwait_init(&event->pending_work_wait);
@@ -14093,6 +14153,7 @@ void __init perf_event_init(void)
 	perf_tp_register();
 	perf_event_init_cpu(smp_processor_id());
 	register_reboot_notifier(&perf_reboot_notifier);
+	unwind_user_register(&perf_unwind_callback_cb, perf_event_callchain_deferred);
 
 	ret = init_hw_breakpoint();
 	WARN(ret, "hw_breakpoint initialization failed with: %d", ret);
diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
index 4842c36fdf80..6d0524b7d082 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -460,7 +460,8 @@ struct perf_event_attr {
 				inherit_thread :  1, /* children only inherit if cloned with CLONE_THREAD */
 				remove_on_exec :  1, /* event is removed from task on exec */
 				sigtrap        :  1, /* send synchronous SIGTRAP on event */
-				__reserved_1   : 26;
+				defer_callchain:  1, /* generate PERF_RECORD_CALLCHAIN_DEFERRED records */
+				__reserved_1   : 25;
 
 	union {
 		__u32		wakeup_events;	  /* wakeup every n events */
@@ -1217,6 +1218,24 @@ enum perf_event_type {
 	 */
 	PERF_RECORD_AUX_OUTPUT_HW_ID		= 21,
 
+	/*
+	 * This user callchain capture was deferred until shortly before
+	 * returning to user space.  Previous samples would have kernel
+	 * callchains only and they need to be stitched with this to make full
+	 * callchains.
+	 *
+	 * TODO: do PERF_SAMPLE_{REGS,STACK}_USER also need deferral?
+	 *
+	 * struct {
+	 *	struct perf_event_header	header;
+	 *	u64				ctx_cookie;
+	 *	u64				nr;
+	 *	u64				ips[nr];
+	 *	struct sample_id		sample_id;
+	 * };
+	 */
+	PERF_RECORD_CALLCHAIN_DEFERRED		= 22,
+
 	PERF_RECORD_MAX,			/* non-ABI */
 };
 
@@ -1247,6 +1266,7 @@ enum perf_callchain_context {
 	PERF_CONTEXT_HV			= (__u64)-32,
 	PERF_CONTEXT_KERNEL		= (__u64)-128,
 	PERF_CONTEXT_USER		= (__u64)-512,
+	PERF_CONTEXT_USER_DEFERRED	= (__u64)-640,
 
 	PERF_CONTEXT_GUEST		= (__u64)-2048,
 	PERF_CONTEXT_GUEST_KERNEL	= (__u64)-2176,
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 16/19] perf tools: Minimal CALLCHAIN_DEFERRED support
  2024-10-28 21:47 ` [PATCH v3 16/19] perf tools: Minimal CALLCHAIN_DEFERRED support Josh Poimboeuf
@ 2024-10-28 21:48   ` Josh Poimboeuf
  0 siblings, 0 replies; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-28 21:48 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

From: Namhyung Kim <namhyung@kernel.org>

Add a new event type for deferred callchains and a new callback for the
struct perf_tool.  For now it doesn't actually handle the deferred
callchains but it just marks the sample if it has the PERF_CONTEXT_
USER_DEFFERED in the callchain array.

At least, perf report can dump the raw data with this change.  Actually
this requires the next commit to enable attr.defer_callchain, but if you
already have a data file, it'll show the following result.

  $ perf report -D
  ...
  0x5fe0@perf.data [0x40]: event: 22
  .
  . ... raw event: size 64 bytes
  .  0000:  16 00 00 00 02 00 40 00 02 00 00 00 00 00 00 00  ......@.........
  .  0010:  00 fe ff ff ff ff ff ff 4b d3 3f 25 45 7f 00 00  ........K.?%E...
  .  0020:  21 03 00 00 21 03 00 00 43 02 12 ab 05 00 00 00  !...!...C.......
  .  0030:  00 00 00 00 00 00 00 00 09 00 00 00 00 00 00 00  ................

  0 24344920643 0x5fe0 [0x40]: PERF_RECORD_CALLCHAIN_DEFERRED(IP, 0x2): 801/801: 0
  ... FP chain: nr:2
  .....  0: fffffffffffffe00
  .....  1: 00007f45253fd34b
  : unhandled!

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 tools/lib/perf/include/perf/event.h       |  7 +++++++
 tools/perf/util/event.c                   |  1 +
 tools/perf/util/evsel.c                   | 15 +++++++++++++++
 tools/perf/util/machine.c                 |  1 +
 tools/perf/util/perf_event_attr_fprintf.c |  1 +
 tools/perf/util/sample.h                  |  3 ++-
 tools/perf/util/session.c                 | 17 +++++++++++++++++
 tools/perf/util/tool.c                    |  1 +
 tools/perf/util/tool.h                    |  3 ++-
 9 files changed, 47 insertions(+), 2 deletions(-)

diff --git a/tools/lib/perf/include/perf/event.h b/tools/lib/perf/include/perf/event.h
index 37bb7771d914..f643a6a2b9fc 100644
--- a/tools/lib/perf/include/perf/event.h
+++ b/tools/lib/perf/include/perf/event.h
@@ -151,6 +151,12 @@ struct perf_record_switch {
 	__u32			 next_prev_tid;
 };
 
+struct perf_record_callchain_deferred {
+	struct perf_event_header header;
+	__u64			 nr;
+	__u64			 ips[];
+};
+
 struct perf_record_header_attr {
 	struct perf_event_header header;
 	struct perf_event_attr	 attr;
@@ -494,6 +500,7 @@ union perf_event {
 	struct perf_record_read			read;
 	struct perf_record_throttle		throttle;
 	struct perf_record_sample		sample;
+	struct perf_record_callchain_deferred	callchain_deferred;
 	struct perf_record_bpf_event		bpf;
 	struct perf_record_ksymbol		ksymbol;
 	struct perf_record_text_poke_event	text_poke;
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index aac96d5d1917..8cdec373db44 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -58,6 +58,7 @@ static const char *perf_event__names[] = {
 	[PERF_RECORD_CGROUP]			= "CGROUP",
 	[PERF_RECORD_TEXT_POKE]			= "TEXT_POKE",
 	[PERF_RECORD_AUX_OUTPUT_HW_ID]		= "AUX_OUTPUT_HW_ID",
+	[PERF_RECORD_CALLCHAIN_DEFERRED]	= "CALLCHAIN_DEFERRED",
 	[PERF_RECORD_HEADER_ATTR]		= "ATTR",
 	[PERF_RECORD_HEADER_EVENT_TYPE]		= "EVENT_TYPE",
 	[PERF_RECORD_HEADER_TRACING_DATA]	= "TRACING_DATA",
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index dbf9c8cee3c5..701092d6b1b6 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -2676,6 +2676,18 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
 	data->data_src = PERF_MEM_DATA_SRC_NONE;
 	data->vcpu = -1;
 
+	if (event->header.type == PERF_RECORD_CALLCHAIN_DEFERRED) {
+		const u64 max_callchain_nr = UINT64_MAX / sizeof(u64);
+
+		data->callchain = (struct ip_callchain *)&event->callchain_deferred.nr;
+		if (data->callchain->nr > max_callchain_nr)
+			return -EFAULT;
+
+		if (evsel->core.attr.sample_id_all)
+			perf_evsel__parse_id_sample(evsel, event, data);
+		return 0;
+	}
+
 	if (event->header.type != PERF_RECORD_SAMPLE) {
 		if (!evsel->core.attr.sample_id_all)
 			return 0;
@@ -2806,6 +2818,9 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
 		if (data->callchain->nr > max_callchain_nr)
 			return -EFAULT;
 		sz = data->callchain->nr * sizeof(u64);
+		if (evsel->core.attr.defer_callchain && data->callchain->nr >= 1 &&
+		    data->callchain->ips[data->callchain->nr - 1] == PERF_CONTEXT_USER_DEFERRED)
+			data->deferred_callchain = true;
 		OVERFLOW_CHECK(array, sz, max_size);
 		array = (void *)array + sz;
 	}
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index fad227b625d1..f367577c91ff 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -2085,6 +2085,7 @@ static int add_callchain_ip(struct thread *thread,
 				*cpumode = PERF_RECORD_MISC_KERNEL;
 				break;
 			case PERF_CONTEXT_USER:
+			case PERF_CONTEXT_USER_DEFERRED:
 				*cpumode = PERF_RECORD_MISC_USER;
 				break;
 			default:
diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
index 59fbbba79697..113845b35110 100644
--- a/tools/perf/util/perf_event_attr_fprintf.c
+++ b/tools/perf/util/perf_event_attr_fprintf.c
@@ -321,6 +321,7 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
 	PRINT_ATTRf(inherit_thread, p_unsigned);
 	PRINT_ATTRf(remove_on_exec, p_unsigned);
 	PRINT_ATTRf(sigtrap, p_unsigned);
+	PRINT_ATTRf(defer_callchain, p_unsigned);
 
 	PRINT_ATTRn("{ wakeup_events, wakeup_watermark }", wakeup_events, p_unsigned, false);
 	PRINT_ATTRf(bp_type, p_unsigned);
diff --git a/tools/perf/util/sample.h b/tools/perf/util/sample.h
index 70b2c3135555..010659dc80f8 100644
--- a/tools/perf/util/sample.h
+++ b/tools/perf/util/sample.h
@@ -108,7 +108,8 @@ struct perf_sample {
 		u16 p_stage_cyc;
 		u16 retire_lat;
 	};
-	bool no_hw_idx;		/* No hw_idx collected in branch_stack */
+	bool no_hw_idx;			/* No hw_idx collected in branch_stack */
+	bool deferred_callchain;	/* Has deferred user callchains */
 	char insn[MAX_INSN];
 	void *raw_data;
 	struct ip_callchain *callchain;
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index dbaf07bf6c5f..1248a0317a2f 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -714,6 +714,7 @@ static perf_event__swap_op perf_event__swap_ops[] = {
 	[PERF_RECORD_CGROUP]		  = perf_event__cgroup_swap,
 	[PERF_RECORD_TEXT_POKE]		  = perf_event__text_poke_swap,
 	[PERF_RECORD_AUX_OUTPUT_HW_ID]	  = perf_event__all64_swap,
+	[PERF_RECORD_CALLCHAIN_DEFERRED]  = perf_event__all64_swap,
 	[PERF_RECORD_HEADER_ATTR]	  = perf_event__hdr_attr_swap,
 	[PERF_RECORD_HEADER_EVENT_TYPE]	  = perf_event__event_type_swap,
 	[PERF_RECORD_HEADER_TRACING_DATA] = perf_event__tracing_data_swap,
@@ -1107,6 +1108,19 @@ static void dump_sample(struct evsel *evsel, union perf_event *event,
 		sample_read__printf(sample, evsel->core.attr.read_format);
 }
 
+static void dump_deferred_callchain(struct evsel *evsel, union perf_event *event,
+				    struct perf_sample *sample)
+{
+	if (!dump_trace)
+		return;
+
+	printf("(IP, 0x%x): %d/%d: %#" PRIx64 "\n",
+	       event->header.misc, sample->pid, sample->tid, sample->ip);
+
+	if (evsel__has_callchain(evsel))
+		callchain__printf(evsel, sample);
+}
+
 static void dump_read(struct evsel *evsel, union perf_event *event)
 {
 	struct perf_record_read *read_event = &event->read;
@@ -1327,6 +1341,9 @@ static int machines__deliver_event(struct machines *machines,
 		return tool->text_poke(tool, event, sample, machine);
 	case PERF_RECORD_AUX_OUTPUT_HW_ID:
 		return tool->aux_output_hw_id(tool, event, sample, machine);
+	case PERF_RECORD_CALLCHAIN_DEFERRED:
+		dump_deferred_callchain(evsel, event, sample);
+		return tool->callchain_deferred(tool, event, sample, evsel, machine);
 	default:
 		++evlist->stats.nr_unknown_events;
 		return -1;
diff --git a/tools/perf/util/tool.c b/tools/perf/util/tool.c
index 3b7f390f26eb..e78f16de912e 100644
--- a/tools/perf/util/tool.c
+++ b/tools/perf/util/tool.c
@@ -259,6 +259,7 @@ void perf_tool__init(struct perf_tool *tool, bool ordered_events)
 	tool->read = process_event_sample_stub;
 	tool->throttle = process_event_stub;
 	tool->unthrottle = process_event_stub;
+	tool->callchain_deferred = process_event_sample_stub;
 	tool->attr = process_event_synth_attr_stub;
 	tool->event_update = process_event_synth_event_update_stub;
 	tool->tracing_data = process_event_synth_tracing_data_stub;
diff --git a/tools/perf/util/tool.h b/tools/perf/util/tool.h
index db1c7642b0d1..9987bbde6d5e 100644
--- a/tools/perf/util/tool.h
+++ b/tools/perf/util/tool.h
@@ -42,7 +42,8 @@ enum show_feature_header {
 
 struct perf_tool {
 	event_sample	sample,
-			read;
+			read,
+			callchain_deferred;
 	event_op	mmap,
 			mmap2,
 			comm,
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 17/19] perf record: Enable defer_callchain for user callchains
  2024-10-28 21:47 ` [PATCH v3 17/19] perf record: Enable defer_callchain for user callchains Josh Poimboeuf
@ 2024-10-28 21:48   ` Josh Poimboeuf
  0 siblings, 0 replies; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-28 21:48 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

From: Namhyung Kim <namhyung@kernel.org>

And add the missing feature detection logic to clear the flag on old
kernels.

  $ perf record -g -vv true
  ...
  ------------------------------------------------------------
  perf_event_attr:
    type                             0 (PERF_TYPE_HARDWARE)
    size                             136
    config                           0 (PERF_COUNT_HW_CPU_CYCLES)
    { sample_period, sample_freq }   4000
    sample_type                      IP|TID|TIME|CALLCHAIN|PERIOD
    read_format                      ID|LOST
    disabled                         1
    inherit                          1
    mmap                             1
    comm                             1
    freq                             1
    enable_on_exec                   1
    task                             1
    sample_id_all                    1
    mmap2                            1
    comm_exec                        1
    ksymbol                          1
    bpf_event                        1
    defer_callchain                  1
  ------------------------------------------------------------
  sys_perf_event_open: pid 162755  cpu 0  group_fd -1  flags 0x8
  sys_perf_event_open failed, error -22
  switching off deferred callchain support

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 tools/perf/util/evsel.c | 17 ++++++++++++++++-
 tools/perf/util/evsel.h |  1 +
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 701092d6b1b6..ad89644b32f2 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -912,6 +912,14 @@ static void __evsel__config_callchain(struct evsel *evsel, struct record_opts *o
 		}
 	}
 
+	if (param->record_mode == CALLCHAIN_FP && !attr->exclude_callchain_user) {
+		/*
+		 * Enable deferred callchains optimistically.  It'll be switched
+		 * off later if the kernel doesn't support it.
+		 */
+		attr->defer_callchain = 1;
+	}
+
 	if (function) {
 		pr_info("Disabling user space callchains for function trace event.\n");
 		attr->exclude_callchain_user = 1;
@@ -2089,6 +2097,8 @@ static int __evsel__prepare_open(struct evsel *evsel, struct perf_cpu_map *cpus,
 
 static void evsel__disable_missing_features(struct evsel *evsel)
 {
+	if (perf_missing_features.defer_callchain)
+		evsel->core.attr.defer_callchain = 0;
 	if (perf_missing_features.branch_counters)
 		evsel->core.attr.branch_sample_type &= ~PERF_SAMPLE_BRANCH_COUNTERS;
 	if (perf_missing_features.read_lost)
@@ -2144,7 +2154,12 @@ bool evsel__detect_missing_features(struct evsel *evsel)
 	 * Must probe features in the order they were added to the
 	 * perf_event_attr interface.
 	 */
-	if (!perf_missing_features.branch_counters &&
+	if (!perf_missing_features.defer_callchain &&
+	    evsel->core.attr.defer_callchain) {
+		perf_missing_features.defer_callchain = true;
+		pr_debug2("switching off deferred callchain support\n");
+		return true;
+	} else if (!perf_missing_features.branch_counters &&
 	    (evsel->core.attr.branch_sample_type & PERF_SAMPLE_BRANCH_COUNTERS)) {
 		perf_missing_features.branch_counters = true;
 		pr_debug2("switching off branch counters support\n");
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index 15e745a9a798..f0a1e1d78942 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -221,6 +221,7 @@ struct perf_missing_features {
 	bool weight_struct;
 	bool read_lost;
 	bool branch_counters;
+	bool defer_callchain;
 };
 
 extern struct perf_missing_features perf_missing_features;
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 18/19] perf script: Display PERF_RECORD_CALLCHAIN_DEFERRED
  2024-10-28 21:47 ` [PATCH v3 18/19] perf script: Display PERF_RECORD_CALLCHAIN_DEFERRED Josh Poimboeuf
@ 2024-10-28 21:48   ` Josh Poimboeuf
  0 siblings, 0 replies; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-28 21:48 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

From: Namhyung Kim <namhyung@kernel.org>

Handle the deferred callchains in the script output.

  $ perf script
  perf     801 [000]    18.031793:          1 cycles:P:
          ffffffff91a14c36 __intel_pmu_enable_all.isra.0+0x56 ([kernel.kallsyms])
          ffffffff91d373e9 perf_ctx_enable+0x39 ([kernel.kallsyms])
          ffffffff91d36af7 event_function+0xd7 ([kernel.kallsyms])
          ffffffff91d34222 remote_function+0x42 ([kernel.kallsyms])
          ffffffff91c1ebe1 generic_exec_single+0x61 ([kernel.kallsyms])
          ffffffff91c1edac smp_call_function_single+0xec ([kernel.kallsyms])
          ffffffff91d37a9d event_function_call+0x10d ([kernel.kallsyms])
          ffffffff91d33557 perf_event_for_each_child+0x37 ([kernel.kallsyms])
          ffffffff91d47324 _perf_ioctl+0x204 ([kernel.kallsyms])
          ffffffff91d47c43 perf_ioctl+0x33 ([kernel.kallsyms])
          ffffffff91e2f216 __x64_sys_ioctl+0x96 ([kernel.kallsyms])
          ffffffff9265f1ae do_syscall_64+0x9e ([kernel.kallsyms])
          ffffffff92800130 entry_SYSCALL_64+0xb0 ([kernel.kallsyms])

  perf     801 [000]    18.031814: DEFERRED CALLCHAIN
              7fb5fc22034b __GI___ioctl+0x3b (/usr/lib/x86_64-linux-gnu/libc.so.6)

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 tools/perf/builtin-script.c | 89 +++++++++++++++++++++++++++++++++++++
 1 file changed, 89 insertions(+)

diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index a644787fa9e1..311580e25f5b 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -2540,6 +2540,93 @@ static int process_sample_event(const struct perf_tool *tool,
 	return ret;
 }
 
+static int process_deferred_sample_event(const struct perf_tool *tool,
+					 union perf_event *event,
+					 struct perf_sample *sample,
+					 struct evsel *evsel,
+					 struct machine *machine)
+{
+	struct perf_script *scr = container_of(tool, struct perf_script, tool);
+	struct perf_event_attr *attr = &evsel->core.attr;
+	struct evsel_script *es = evsel->priv;
+	unsigned int type = output_type(attr->type);
+	struct addr_location al;
+	FILE *fp = es->fp;
+	int ret = 0;
+
+	if (output[type].fields == 0)
+		return 0;
+
+	/* Set thread to NULL to indicate addr_al and al are not initialized */
+	addr_location__init(&al);
+
+	if (perf_time__ranges_skip_sample(scr->ptime_range, scr->range_num,
+					  sample->time)) {
+		goto out_put;
+	}
+
+	if (debug_mode) {
+		if (sample->time < last_timestamp) {
+			pr_err("Samples misordered, previous: %" PRIu64
+				" this: %" PRIu64 "\n", last_timestamp,
+				sample->time);
+			nr_unordered++;
+		}
+		last_timestamp = sample->time;
+		goto out_put;
+	}
+
+	if (filter_cpu(sample))
+		goto out_put;
+
+	if (machine__resolve(machine, &al, sample) < 0) {
+		pr_err("problem processing %d event, skipping it.\n",
+		       event->header.type);
+		ret = -1;
+		goto out_put;
+	}
+
+	if (al.filtered)
+		goto out_put;
+
+	if (!show_event(sample, evsel, al.thread, &al, NULL))
+		goto out_put;
+
+	if (evswitch__discard(&scr->evswitch, evsel))
+		goto out_put;
+
+	perf_sample__fprintf_start(scr, sample, al.thread, evsel,
+				   PERF_RECORD_CALLCHAIN_DEFERRED, fp);
+	fprintf(fp, "DEFERRED CALLCHAIN");
+
+	if (PRINT_FIELD(IP)) {
+		struct callchain_cursor *cursor = NULL;
+
+		if (symbol_conf.use_callchain && sample->callchain) {
+			cursor = get_tls_callchain_cursor();
+			if (thread__resolve_callchain(al.thread, cursor, evsel,
+						      sample, NULL, NULL,
+						      scripting_max_stack)) {
+				pr_info("cannot resolve deferred callchains\n");
+				cursor = NULL;
+			}
+		}
+
+		fputc(cursor ? '\n' : ' ', fp);
+		sample__fprintf_sym(sample, &al, 0, output[type].print_ip_opts,
+				    cursor, symbol_conf.bt_stop_list, fp);
+	}
+
+	fprintf(fp, "\n");
+
+	if (verbose > 0)
+		fflush(fp);
+
+out_put:
+	addr_location__exit(&al);
+	return ret;
+}
+
 // Used when scr->per_event_dump is not set
 static struct evsel_script es_stdout;
 
@@ -4325,6 +4412,7 @@ int cmd_script(int argc, const char **argv)
 
 	perf_tool__init(&script.tool, !unsorted_dump);
 	script.tool.sample		 = process_sample_event;
+	script.tool.callchain_deferred	 = process_deferred_sample_event;
 	script.tool.mmap		 = perf_event__process_mmap;
 	script.tool.mmap2		 = perf_event__process_mmap2;
 	script.tool.comm		 = perf_event__process_comm;
@@ -4351,6 +4439,7 @@ int cmd_script(int argc, const char **argv)
 	script.tool.throttle		 = process_throttle_event;
 	script.tool.unthrottle		 = process_throttle_event;
 	script.tool.ordering_requires_timestamps = true;
+	script.tool.merge_deferred_callchains = false;
 	session = perf_session__new(&data, &script.tool);
 	if (IS_ERR(session))
 		return PTR_ERR(session);
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* [PATCH v3 19/19] perf tools: Merge deferred user callchains
  2024-10-28 21:47 ` [PATCH v3 19/19] perf tools: Merge deferred user callchains Josh Poimboeuf
@ 2024-10-28 21:48   ` Josh Poimboeuf
  0 siblings, 0 replies; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-28 21:48 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

From: Namhyung Kim <namhyung@kernel.org>

Save samples with deferred callchains in a separate list and deliver
them after merging the user callchains.  If users don't want to merge
they can set tool->merge_deferred_callchains to false to prevent the
behavior.

With previous result, now perf script will show the merged callchains.

  $ perf script
  perf     801 [000]    18.031793:          1 cycles:P:
          ffffffff91a14c36 __intel_pmu_enable_all.isra.0+0x56 ([kernel.kallsyms])
          ffffffff91d373e9 perf_ctx_enable+0x39 ([kernel.kallsyms])
          ffffffff91d36af7 event_function+0xd7 ([kernel.kallsyms])
          ffffffff91d34222 remote_function+0x42 ([kernel.kallsyms])
          ffffffff91c1ebe1 generic_exec_single+0x61 ([kernel.kallsyms])
          ffffffff91c1edac smp_call_function_single+0xec ([kernel.kallsyms])
          ffffffff91d37a9d event_function_call+0x10d ([kernel.kallsyms])
          ffffffff91d33557 perf_event_for_each_child+0x37 ([kernel.kallsyms])
          ffffffff91d47324 _perf_ioctl+0x204 ([kernel.kallsyms])
          ffffffff91d47c43 perf_ioctl+0x33 ([kernel.kallsyms])
          ffffffff91e2f216 __x64_sys_ioctl+0x96 ([kernel.kallsyms])
          ffffffff9265f1ae do_syscall_64+0x9e ([kernel.kallsyms])
          ffffffff92800130 entry_SYSCALL_64+0xb0 ([kernel.kallsyms])
              7fb5fc22034b __GI___ioctl+0x3b (/usr/lib/x86_64-linux-gnu/libc.so.6)
  ...

The old output can be get using --no-merge-callchain option.
Also perf report can get the user callchain entry at the end.

  $ perf report --no-children --percent-limit=0 --stdio -q -S __intel_pmu_enable_all.isra.0
  # symbol: __intel_pmu_enable_all.isra.0
       0.00%  perf     [kernel.kallsyms]
              |
              ---__intel_pmu_enable_all.isra.0
                 perf_ctx_enable
                 event_function
                 remote_function
                 generic_exec_single
                 smp_call_function_single
                 event_function_call
                 perf_event_for_each_child
                 _perf_ioctl
                 perf_ioctl
                 __x64_sys_ioctl
                 do_syscall_64
                 entry_SYSCALL_64
                 __GI___ioctl

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 tools/perf/Documentation/perf-script.txt |  5 ++
 tools/perf/builtin-script.c              |  5 +-
 tools/perf/util/callchain.c              | 24 +++++++++
 tools/perf/util/callchain.h              |  3 ++
 tools/perf/util/evlist.c                 |  1 +
 tools/perf/util/evlist.h                 |  1 +
 tools/perf/util/session.c                | 63 +++++++++++++++++++++++-
 tools/perf/util/tool.c                   |  1 +
 tools/perf/util/tool.h                   |  1 +
 9 files changed, 102 insertions(+), 2 deletions(-)

diff --git a/tools/perf/Documentation/perf-script.txt b/tools/perf/Documentation/perf-script.txt
index b72866ef270b..69f018b3d199 100644
--- a/tools/perf/Documentation/perf-script.txt
+++ b/tools/perf/Documentation/perf-script.txt
@@ -518,6 +518,11 @@ include::itrace.txt[]
 	The known limitations include exception handing such as
 	setjmp/longjmp will have calls/returns not match.
 
+--merge-callchains::
+	Enable merging deferred user callchains if available.  This is the
+	default behavior.  If you want to see separate CALLCHAIN_DEFERRED
+	records for some reason, use --no-merge-callchains explicitly.
+
 :GMEXAMPLECMD: script
 :GMEXAMPLESUBCMD:
 include::guest-files.txt[]
diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index 311580e25f5b..e3acf4979c36 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -4031,6 +4031,7 @@ int cmd_script(int argc, const char **argv)
 	bool header_only = false;
 	bool script_started = false;
 	bool unsorted_dump = false;
+	bool merge_deferred_callchains = true;
 	char *rec_script_path = NULL;
 	char *rep_script_path = NULL;
 	struct perf_session *session;
@@ -4184,6 +4185,8 @@ int cmd_script(int argc, const char **argv)
 		    "Guest code can be found in hypervisor process"),
 	OPT_BOOLEAN('\0', "stitch-lbr", &script.stitch_lbr,
 		    "Enable LBR callgraph stitching approach"),
+	OPT_BOOLEAN('\0', "merge-callchains", &merge_deferred_callchains,
+		    "Enable merge deferred user callchains"),
 	OPTS_EVSWITCH(&script.evswitch),
 	OPT_END()
 	};
@@ -4439,7 +4442,7 @@ int cmd_script(int argc, const char **argv)
 	script.tool.throttle		 = process_throttle_event;
 	script.tool.unthrottle		 = process_throttle_event;
 	script.tool.ordering_requires_timestamps = true;
-	script.tool.merge_deferred_callchains = false;
+	script.tool.merge_deferred_callchains = merge_deferred_callchains;
 	session = perf_session__new(&data, &script.tool);
 	if (IS_ERR(session))
 		return PTR_ERR(session);
diff --git a/tools/perf/util/callchain.c b/tools/perf/util/callchain.c
index 0c7564747a14..d1114491c3da 100644
--- a/tools/perf/util/callchain.c
+++ b/tools/perf/util/callchain.c
@@ -1832,3 +1832,27 @@ int sample__for_each_callchain_node(struct thread *thread, struct evsel *evsel,
 	}
 	return 0;
 }
+
+int sample__merge_deferred_callchain(struct perf_sample *sample_orig,
+				     struct perf_sample *sample_callchain)
+{
+	u64 nr_orig = sample_orig->callchain->nr - 1;
+	u64 nr_deferred = sample_callchain->callchain->nr;
+	struct ip_callchain *callchain;
+
+	callchain = calloc(1 + nr_orig + nr_deferred, sizeof(u64));
+	if (callchain == NULL) {
+		sample_orig->deferred_callchain = false;
+		return -ENOMEM;
+	}
+
+	callchain->nr = nr_orig + nr_deferred;
+	/* copy except for the last PERF_CONTEXT_USER_DEFERRED */
+	memcpy(callchain->ips, sample_orig->callchain->ips, nr_orig * sizeof(u64));
+	/* copy deferred use callchains */
+	memcpy(&callchain->ips[nr_orig], sample_callchain->callchain->ips,
+	       nr_deferred * sizeof(u64));
+
+	sample_orig->callchain = callchain;
+	return 0;
+}
diff --git a/tools/perf/util/callchain.h b/tools/perf/util/callchain.h
index 86ed9e4d04f9..89785125ed25 100644
--- a/tools/perf/util/callchain.h
+++ b/tools/perf/util/callchain.h
@@ -317,4 +317,7 @@ int sample__for_each_callchain_node(struct thread *thread, struct evsel *evsel,
 				    struct perf_sample *sample, int max_stack,
 				    bool symbols, callchain_iter_fn cb, void *data);
 
+int sample__merge_deferred_callchain(struct perf_sample *sample_orig,
+				     struct perf_sample *sample_callchain);
+
 #endif	/* __PERF_CALLCHAIN_H */
diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index f14b7e6ff1dc..f27d8c4a22aa 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -81,6 +81,7 @@ void evlist__init(struct evlist *evlist, struct perf_cpu_map *cpus,
 	evlist->ctl_fd.ack = -1;
 	evlist->ctl_fd.pos = -1;
 	evlist->nr_br_cntr = -1;
+	INIT_LIST_HEAD(&evlist->deferred_samples);
 }
 
 struct evlist *evlist__new(void)
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index bcc1c6984bb5..c26379366554 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -84,6 +84,7 @@ struct evlist {
 		int	pos;	/* index at evlist core object to check signals */
 	} ctl_fd;
 	struct event_enable_timer *eet;
+	struct list_head deferred_samples;
 };
 
 struct evsel_str_handler {
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 1248a0317a2f..e0a21b896b57 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -1256,6 +1256,56 @@ static int evlist__deliver_sample(struct evlist *evlist, const struct perf_tool
 					    &sample->read.one, machine);
 }
 
+struct deferred_event {
+	struct list_head list;
+	union perf_event *event;
+};
+
+static int evlist__deliver_deferred_samples(struct evlist *evlist,
+					    const struct perf_tool *tool,
+					    union  perf_event *event,
+					    struct perf_sample *sample,
+					    struct machine *machine)
+{
+	struct deferred_event *de, *tmp;
+	struct evsel *evsel;
+	int ret = 0;
+
+	if (!tool->merge_deferred_callchains) {
+		evsel = evlist__id2evsel(evlist, sample->id);
+		return tool->callchain_deferred(tool, event, sample,
+						evsel, machine);
+	}
+
+	list_for_each_entry_safe(de, tmp, &evlist->deferred_samples, list) {
+		struct perf_sample orig_sample;
+
+		ret = evlist__parse_sample(evlist, de->event, &orig_sample);
+		if (ret < 0) {
+			pr_err("failed to parse original sample\n");
+			break;
+		}
+
+		if (sample->tid != orig_sample.tid)
+			continue;
+
+		evsel = evlist__id2evsel(evlist, orig_sample.id);
+		sample__merge_deferred_callchain(&orig_sample, sample);
+		ret = evlist__deliver_sample(evlist, tool, de->event,
+					     &orig_sample, evsel, machine);
+
+		if (orig_sample.deferred_callchain)
+			free(orig_sample.callchain);
+
+		list_del(&de->list);
+		free(de);
+
+		if (ret)
+			break;
+	}
+	return ret;
+}
+
 static int machines__deliver_event(struct machines *machines,
 				   struct evlist *evlist,
 				   union perf_event *event,
@@ -1284,6 +1334,16 @@ static int machines__deliver_event(struct machines *machines,
 			return 0;
 		}
 		dump_sample(evsel, event, sample, perf_env__arch(machine->env));
+		if (sample->deferred_callchain && tool->merge_deferred_callchains) {
+			struct deferred_event *de = malloc(sizeof(*de));
+
+			if (de == NULL)
+				return -ENOMEM;
+
+			de->event = event;
+			list_add_tail(&de->list, &evlist->deferred_samples);
+			return 0;
+		}
 		return evlist__deliver_sample(evlist, tool, event, sample, evsel, machine);
 	case PERF_RECORD_MMAP:
 		return tool->mmap(tool, event, sample, machine);
@@ -1343,7 +1403,8 @@ static int machines__deliver_event(struct machines *machines,
 		return tool->aux_output_hw_id(tool, event, sample, machine);
 	case PERF_RECORD_CALLCHAIN_DEFERRED:
 		dump_deferred_callchain(evsel, event, sample);
-		return tool->callchain_deferred(tool, event, sample, evsel, machine);
+		return evlist__deliver_deferred_samples(evlist, tool, event,
+							sample, machine);
 	default:
 		++evlist->stats.nr_unknown_events;
 		return -1;
diff --git a/tools/perf/util/tool.c b/tools/perf/util/tool.c
index e78f16de912e..385043e06627 100644
--- a/tools/perf/util/tool.c
+++ b/tools/perf/util/tool.c
@@ -238,6 +238,7 @@ void perf_tool__init(struct perf_tool *tool, bool ordered_events)
 	tool->cgroup_events = false;
 	tool->no_warn = false;
 	tool->show_feat_hdr = SHOW_FEAT_NO_HEADER;
+	tool->merge_deferred_callchains = true;
 
 	tool->sample = process_event_sample_stub;
 	tool->mmap = process_event_stub;
diff --git a/tools/perf/util/tool.h b/tools/perf/util/tool.h
index 9987bbde6d5e..d06580478ab1 100644
--- a/tools/perf/util/tool.h
+++ b/tools/perf/util/tool.h
@@ -87,6 +87,7 @@ struct perf_tool {
 	bool		cgroup_events;
 	bool		no_warn;
 	bool		dont_split_sample_group;
+	bool		merge_deferred_callchains;
 	enum show_feature_header show_feat_hdr;
 };
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 00/19] unwind, perf: sframe user space unwinding
  2024-10-28 21:47 [PATCH v3 00/19] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (19 preceding siblings ...)
  2024-10-28 21:47 ` [PATCH v3 00/19] unwind, perf: sframe user space unwinding Josh Poimboeuf
@ 2024-10-28 21:54 ` Josh Poimboeuf
  2024-10-28 23:55 ` Josh Poimboeuf
  2024-10-29 14:08 ` Peter Zijlstra
  22 siblings, 0 replies; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-28 21:54 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

[-- Attachment #1: Type: text/plain, Size: 157 bytes --]

On Mon, Oct 28, 2024 at 02:47:27PM -0700, Josh Poimboeuf wrote:
> It still needs some binutils and glibc patches

binutils+glibc patches attached.

-- 
Josh

[-- Attachment #2: 0001-binutils.patch --]
[-- Type: text/plain, Size: 1708 bytes --]

From: Indu Bhagat <indu.bhagat@oracle.com>
Date: Fri, 25 Oct 2024 14:33:19 -0700
To: jpoimboe@kernel.org
Subject: [PATCH 1/3] ld: fix PR/32297
X-Mailer: git-send-email 2.43.0
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Content-Type: text/plain; charset=utf-8

From: Josh Poimboeuf <jpoimboe@kernel.org>

When _creating_ SFrame information for the linker created .plt.sec, the
code correctly checks for presence of .plt.sec.  When _writing_ the
SFrame section for the corresponding .plt.sec, however, the conditionals
were wrongly checking for splt.  This was causing an assertion at link
time.

This issue has been known to affect glibc build with SFrame enabled.

No testcase is added just yet.  A later commit ensures correct SFrame
stack trace information is created for .plt.got. A test case (where only
.plt and .plt.got are created) is added then.

PR/32297 sframe: bfd assertion with empty main on IBT enabled system

ChangeLog:
	PR/32297
	* bfd/elfxx-x86.c (_bfd_x86_elf_late_size_sections): Check for
	  plt_second member not for splt.
---
 bfd/elfxx-x86.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/bfd/elfxx-x86.c b/bfd/elfxx-x86.c
index dd951b91f50..0d83ce57d4f 100644
--- a/bfd/elfxx-x86.c
+++ b/bfd/elfxx-x86.c
@@ -2680,8 +2680,8 @@ _bfd_x86_elf_late_size_sections (bfd *output_bfd,
 	_bfd_x86_elf_write_sframe_plt (output_bfd, info, SFRAME_PLT);
 
       if (htab->plt_second_sframe != NULL
-	  && htab->elf.splt != NULL
-	  && htab->elf.splt->size != 0
+	  && htab->plt_second != NULL
+	  && htab->plt_second->size != 0
 	  && htab->plt_second_sframe->contents == NULL)
 	_bfd_x86_elf_write_sframe_plt (output_bfd, info, SFRAME_PLT_SEC);
     }
-- 
2.43.0


[-- Attachment #3: 0002-binutils.patch --]
[-- Type: text/plain, Size: 9166 bytes --]

From: Indu Bhagat <indu.bhagat@oracle.com>
Date: Fri, 25 Oct 2024 14:33:20 -0700
To: jpoimboe@kernel.org
Subject: [PATCH 2/3] ld: fix wrong SFrame info for lazy IBT PLT
X-Mailer: git-send-email 2.43.0
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Content-Type: text/plain; charset=utf-8

From: Josh Poimboeuf <jpoimboe@kernel.org>

Fix PR/32296 sframe: wrong SFrame info for pltN and .plt.sec for -z ibtplt

The x86 psABI defines a 2-PLT scheme for IBT which uses .plt and
.plt.sec entries.  It was observed that SFrame information for .plt.sec
section was incorrect.  The erroneous assumption was that SFrame stack
trace information for .plt.sec with lazy binding is the same as SFrame
stack trace information for .plt with lazy binding.  This is corrected
now by initializing a new SFrame PLT helper object
elf_x86_64_sframe_ibt_plt for lazy PLT with IBT.

Add a testcase where linking with -z ibtplt generates .plt.sec entries and
ensure correct SFrame information for it.

ChangeLog:
	PR/32296
	* bfd/elf64-x86-64.c (elf_x86_64_sframe_ibt_pltn_fre2): New
	definition elf_x86_64_sframe_ibt_plt.  Use it in
	elf_x86_64_sframe_plt.
	(elf_x86_64_link_setup_gnu_properties): Lazy IBT PLT entries are
	different from lazy PLT.
        * bfd/elfxx-x86.c (_bfd_x86_elf_create_sframe_plt): Adjust for
	SFrame for IBT PLT.
        * ld/testsuite/ld-x86-64/x86-64.exp: Add new test.
        * ld/testsuite/ld-x86-64/sframe-ibt-plt-1.d: New test.
---
 bfd/elf64-x86-64.c                        | 34 +++++++++++++++++++----
 bfd/elfxx-x86.c                           | 29 ++++++++++---------
 ld/testsuite/ld-x86-64/sframe-ibt-plt-1.d | 33 ++++++++++++++++++++++
 ld/testsuite/ld-x86-64/x86-64.exp         |  1 +
 4 files changed, 79 insertions(+), 18 deletions(-)
 create mode 100644 ld/testsuite/ld-x86-64/sframe-ibt-plt-1.d

diff --git a/bfd/elf64-x86-64.c b/bfd/elf64-x86-64.c
index 4330bbd1648..abbebbe92b4 100644
--- a/bfd/elf64-x86-64.c
+++ b/bfd/elf64-x86-64.c
@@ -906,6 +906,14 @@ static const sframe_frame_row_entry elf_x86_64_sframe_pltn_fre2 =
   SFRAME_V1_FRE_INFO (SFRAME_BASE_REG_SP, 1, SFRAME_FRE_OFFSET_1B) /* FRE info.  */
 };
 
+/* .sframe FRE covering the .plt section entry for IBT.  */
+static const sframe_frame_row_entry elf_x86_64_sframe_ibt_pltn_fre2 =
+{
+  9, /* SFrame FRE start address.  */
+  {16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, /* 12 bytes.  */
+  SFRAME_V1_FRE_INFO (SFRAME_BASE_REG_SP, 1, SFRAME_FRE_OFFSET_1B) /* FRE info.  */
+};
+
 /* .sframe FRE covering the second .plt section entry.  */
 static const sframe_frame_row_entry elf_x86_64_sframe_sec_pltn_fre1 =
 {
@@ -930,7 +938,7 @@ static const struct elf_x86_sframe_plt elf_x86_64_sframe_non_lazy_plt =
   { &elf_x86_64_sframe_null_fre }
 };
 
-/* SFrame helper object for lazy PLT.  Also used for IBT enabled PLT.  */
+/* SFrame helper object for lazy PLT. */
 static const struct elf_x86_sframe_plt elf_x86_64_sframe_plt =
 {
   LAZY_PLT_ENTRY_SIZE,
@@ -942,9 +950,25 @@ static const struct elf_x86_sframe_plt elf_x86_64_sframe_plt =
   /* Array of SFrame FREs for plt.  */
   { &elf_x86_64_sframe_pltn_fre1, &elf_x86_64_sframe_pltn_fre2 },
   NON_LAZY_PLT_ENTRY_SIZE,
-  1, /* Number of FREs for PLTn for second PLT.  */
-  /* FREs for second plt (stack trace info for .plt.got is
-     identical).  Used when IBT or non-lazy PLT is in effect.  */
+  1, /* Number of FREs for second PLT.  */
+  /* Array of SFrame FREs for second PLT.  */
+  { &elf_x86_64_sframe_sec_pltn_fre1 }
+};
+
+/* SFrame helper object for lazy PLT with IBT. */
+static const struct elf_x86_sframe_plt elf_x86_64_sframe_ibt_plt =
+{
+  LAZY_PLT_ENTRY_SIZE,
+  2, /* Number of FREs for PLT0.  */
+  /* Array of SFrame FREs for plt0.  */
+  { &elf_x86_64_sframe_plt0_fre1, &elf_x86_64_sframe_plt0_fre2 },
+  LAZY_PLT_ENTRY_SIZE,
+  2, /* Number of FREs for PLTn.  */
+  /* Array of SFrame FREs for plt.  */
+  { &elf_x86_64_sframe_pltn_fre1, &elf_x86_64_sframe_ibt_pltn_fre2 },
+  LAZY_PLT_ENTRY_SIZE,
+  1, /* Number of FREs for second PLT.  */
+  /* Array of SFrame FREs for second plt.  */
   { &elf_x86_64_sframe_sec_pltn_fre1 }
 };
 
@@ -5678,7 +5702,7 @@ elf_x86_64_link_setup_gnu_properties (struct bfd_link_info *info)
     {
       init_table.sframe_lazy_plt = &elf_x86_64_sframe_plt;
       init_table.sframe_non_lazy_plt = &elf_x86_64_sframe_non_lazy_plt;
-      init_table.sframe_lazy_ibt_plt = &elf_x86_64_sframe_plt;
+      init_table.sframe_lazy_ibt_plt = &elf_x86_64_sframe_ibt_plt;
       init_table.sframe_non_lazy_ibt_plt = &elf_x86_64_sframe_non_lazy_plt;
     }
   else
diff --git a/bfd/elfxx-x86.c b/bfd/elfxx-x86.c
index 0d83ce57d4f..cb90f688b6f 100644
--- a/bfd/elfxx-x86.c
+++ b/bfd/elfxx-x86.c
@@ -1831,7 +1831,6 @@ _bfd_x86_elf_create_sframe_plt (bfd *output_bfd,
   struct elf_x86_link_hash_table *htab;
   const struct elf_backend_data *bed;
 
-  bool plt0_generated_p;
   unsigned int plt0_entry_size;
   unsigned char func_info;
   uint32_t fre_type;
@@ -1845,14 +1844,11 @@ _bfd_x86_elf_create_sframe_plt (bfd *output_bfd,
   unsigned plt_entry_size = 0;
   unsigned int num_pltn_fres = 0;
   unsigned int num_pltn_entries = 0;
+  const sframe_frame_row_entry * const *pltn_fres;
 
   bed = get_elf_backend_data (output_bfd);
   htab = elf_x86_hash_table (info, bed->target_id);
   /* Whether SFrame stack trace info for plt0 is to be generated.  */
-  plt0_generated_p = htab->plt.has_plt0;
-  plt0_entry_size
-    = (plt0_generated_p) ? htab->sframe_plt->plt0_entry_size : 0;
-
   switch (plt_sec_type)
     {
     case SFRAME_PLT:
@@ -1860,7 +1856,10 @@ _bfd_x86_elf_create_sframe_plt (bfd *output_bfd,
 	  ectx = &htab->plt_cfe_ctx;
 	  dpltsec = htab->elf.splt;
 
-	  plt_entry_size = htab->plt.plt_entry_size;
+	  plt0_entry_size
+	    = htab->plt.has_plt0 ? htab->sframe_plt->plt0_entry_size : 0;
+	  plt_entry_size = htab->sframe_plt->pltn_entry_size;
+	  pltn_fres = htab->sframe_plt->pltn_fres;
 	  num_pltn_fres = htab->sframe_plt->pltn_num_fres;
 	  num_pltn_entries
 	    = (dpltsec->size - plt0_entry_size) / plt_entry_size;
@@ -1870,12 +1869,15 @@ _bfd_x86_elf_create_sframe_plt (bfd *output_bfd,
     case SFRAME_PLT_SEC:
 	{
 	  ectx = &htab->plt_second_cfe_ctx;
-	  /* FIXME - this or htab->plt_second_sframe ?  */
-	  dpltsec = htab->plt_second_eh_frame;
+	  dpltsec = htab->plt_second;
+
+	  plt0_entry_size = 0;
 
 	  plt_entry_size = htab->sframe_plt->sec_pltn_entry_size;
+	  pltn_fres = htab->sframe_plt->sec_pltn_fres;
 	  num_pltn_fres = htab->sframe_plt->sec_pltn_num_fres;
 	  num_pltn_entries = dpltsec->size / plt_entry_size;
+
 	  break;
 	}
     default:
@@ -1897,7 +1899,7 @@ _bfd_x86_elf_create_sframe_plt (bfd *output_bfd,
 
   /* Add SFrame FDE and the associated FREs for plt0 if plt0 has been
      generated.  */
-  if (plt0_generated_p)
+  if (plt0_entry_size)
     {
       /* Add SFrame FDE for plt0, the function start address is updated later
 	 at _bfd_elf_merge_section_sframe time.  */
@@ -1934,16 +1936,17 @@ _bfd_x86_elf_create_sframe_plt (bfd *output_bfd,
 				      plt0_entry_size, /* func start addr.  */
 				      dpltsec->size - plt0_entry_size,
 				      func_info,
-				      16,
+				      plt_entry_size,
 				      0 /* Num FREs.  */);
 
       sframe_frame_row_entry pltn_fre;
-      /* Now add the FREs for pltn.  Simply adding the two FREs suffices due
+      /* Now add the FREs for pltn.  Simply adding the FREs suffices due
 	 to the usage of SFRAME_FDE_TYPE_PCMASK above.  */
       for (unsigned int j = 0; j < num_pltn_fres; j++)
 	{
-	  pltn_fre = *(htab->sframe_plt->pltn_fres[j]);
-	  sframe_encoder_add_fre (*ectx, 1, &pltn_fre);
+	  unsigned int func_idx = plt0_entry_size ? 1 : 0;
+	  pltn_fre = *(pltn_fres[j]);
+	  sframe_encoder_add_fre (*ectx, func_idx, &pltn_fre);
 	}
     }
 
diff --git a/ld/testsuite/ld-x86-64/sframe-ibt-plt-1.d b/ld/testsuite/ld-x86-64/sframe-ibt-plt-1.d
new file mode 100644
index 00000000000..26be4dfc6a0
--- /dev/null
+++ b/ld/testsuite/ld-x86-64/sframe-ibt-plt-1.d
@@ -0,0 +1,33 @@
+#as: --gsframe
+#source: ibt-plt-3.s
+#objdump: --sframe=.sframe
+#ld: -shared -z ibtplt --no-rosegment
+#name: SFrame for IBT PLT .plt.sec
+
+.*: +file format .*
+
+Contents of the SFrame section .sframe:
+  Header :
+
+    Version: SFRAME_VERSION_2
+    Flags: SFRAME_F_FDE_SORTED
+    CFA fixed RA offset: \-8
+#...
+
+  Function Index :
+
+    func idx \[0\]: pc = 0x1000, size = 16 bytes
+    STARTPC +CFA +FP +RA +
+    0+1000 +sp\+16 +u +f +
+    0+1006 +sp\+24 +u +f +
+
+    func idx \[1\]: pc = 0x1010, size = 32 bytes
+    STARTPC\[m\] +CFA +FP +RA +
+    0+0000 +sp\+8 +u +f +
+    0+0009 +sp\+16 +u +f +
+
+    func idx \[2\]: pc = 0x1030, size = 32 bytes
+    STARTPC\[m\] +CFA +FP +RA +
+    0+0000 +sp\+8 +u +f +
+
+#...
diff --git a/ld/testsuite/ld-x86-64/x86-64.exp b/ld/testsuite/ld-x86-64/x86-64.exp
index bd7574d6965..0632397bccc 100644
--- a/ld/testsuite/ld-x86-64/x86-64.exp
+++ b/ld/testsuite/ld-x86-64/x86-64.exp
@@ -547,6 +547,7 @@ run_dump_test "pr32191-x32"
 if { ![skip_sframe_tests] } {
     run_dump_test "sframe-simple-1"
     run_dump_test "sframe-plt-1"
+    run_dump_test "sframe-ibt-plt-1"
 }
 
 if ![istarget "x86_64-*-linux*"] {
-- 
2.43.0


[-- Attachment #4: 0003-binutils.patch --]
[-- Type: text/plain, Size: 14386 bytes --]

From: Indu Bhagat <indu.bhagat@oracle.com>
Date: Fri, 25 Oct 2024 14:33:21 -0700
To: jpoimboe@kernel.org
Cc: Indu Bhagat <indu.bhagat@oracle.com>
Subject: [PATCH 3/3] ld: generate SFrame stack trace info for .plt.got
X-Mailer: git-send-email 2.43.0
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Content-Type: text/plain; charset=utf-8

PR/32298 sframe: no SFrame stack trace info generated for .plt.got

Add support to generate SFrame stack trace info for .plt.got section.
Enhance the current definition of struct elf_x86_sframe_plt to include
initialized SFrame FDE/FREs applicable for .plt.got section.  There are
two variants of .plt.got entries: 16 byte and 8 byte.

8 byte:
    ff 25 00 00 00 00     jmpq  *name@GOTPCREL(%rip)
    66 90                 xchg  %ax,%ax

16 byte:
    f3 0f 1e fa           endbr64
    ff 25 66 2f 00 00     jmpq  *name@GOTPCREL(%rip)
    66 0f 1f 44 00 00     nopw   0x0(%rax,%rax,1)

For the testcase, define some application symbols such that their PLT
entry is placed in .plt.got and ensure SFrame information is generated
with and without -z ibtplt.

ChangeLog:
	PR/32298
	* bfd/elf64-x86-64.c (elf_x86_64_link_setup_gnu_properties):
	PLT GOT entry size is different for IBT vs non IBT PLTs.
	* bfd/elfxx-x86.c (enum dynobj_sframe_plt_type): New enum for
	SFRAME_PLT_GOT.
	(_bfd_x86_elf_create_sframe_plt): Handle SFRAME_PLT_GOT.
	(_bfd_x86_elf_write_sframe_plt): Likewise.
	(_bfd_x86_elf_late_size_sections): Likewise.
	(_bfd_x86_elf_finish_dynamic_sections): Likewise.
	* bfd/elfxx-x86.h (struct elf_x86_sframe_plt): Add new members
	to keep information about PLT GOT entries.
	(struct elf_x86_link_hash_table): Add support for creating
	SFrame section for .plt.got.
	* ld/testsuite/ld-x86-64/x86-64.exp: Add new tests.
	* ld/testsuite/ld-x86-64/sframe-pltgot-1.d: New test.
	* ld/testsuite/ld-x86-64/sframe-pltgot-1.s: New test.
	* ld/testsuite/ld-x86-64/sframe-pltgot-2.d: New test.
---
 bfd/elf64-x86-64.c                       | 42 ++++++++++--
 bfd/elfxx-x86.c                          | 86 ++++++++++++++++++++++--
 bfd/elfxx-x86.h                          |  6 ++
 ld/testsuite/ld-x86-64/sframe-pltgot-1.d | 28 ++++++++
 ld/testsuite/ld-x86-64/sframe-pltgot-1.s | 15 +++++
 ld/testsuite/ld-x86-64/sframe-pltgot-2.d | 28 ++++++++
 ld/testsuite/ld-x86-64/x86-64.exp        |  2 +
 7 files changed, 198 insertions(+), 9 deletions(-)
 create mode 100644 ld/testsuite/ld-x86-64/sframe-pltgot-1.d
 create mode 100644 ld/testsuite/ld-x86-64/sframe-pltgot-1.s
 create mode 100644 ld/testsuite/ld-x86-64/sframe-pltgot-2.d

diff --git a/bfd/elf64-x86-64.c b/bfd/elf64-x86-64.c
index abbebbe92b4..a62fa621f99 100644
--- a/bfd/elf64-x86-64.c
+++ b/bfd/elf64-x86-64.c
@@ -922,7 +922,7 @@ static const sframe_frame_row_entry elf_x86_64_sframe_sec_pltn_fre1 =
   SFRAME_V1_FRE_INFO (SFRAME_BASE_REG_SP, 1, SFRAME_FRE_OFFSET_1B) /* FRE info.  */
 };
 
-/* SFrame helper object for non-lazy PLT.  Also used for IBT enabled PLT.  */
+/* SFrame helper object for non-lazy PLT.  */
 static const struct elf_x86_sframe_plt elf_x86_64_sframe_non_lazy_plt =
 {
   LAZY_PLT_ENTRY_SIZE,
@@ -935,7 +935,31 @@ static const struct elf_x86_sframe_plt elf_x86_64_sframe_non_lazy_plt =
   { &elf_x86_64_sframe_sec_pltn_fre1, &elf_x86_64_sframe_null_fre },
   0,
   0, /* There is no second PLT necessary.  */
-  { &elf_x86_64_sframe_null_fre }
+  { &elf_x86_64_sframe_null_fre },
+  NON_LAZY_PLT_ENTRY_SIZE,
+  1, /* Number of FREs for PLT GOT.  */
+  /* Array of SFrame FREs for PLT GOT.  */
+  { &elf_x86_64_sframe_null_fre },
+};
+
+/* SFrame helper object for non-lazy IBT enabled PLT.  */
+static const struct elf_x86_sframe_plt elf_x86_64_sframe_non_lazy_ibt_plt =
+{
+  LAZY_PLT_ENTRY_SIZE,
+  2, /* Number of FREs for PLT0.  */
+  /* Array of SFrame FREs for plt0.  */
+  { &elf_x86_64_sframe_plt0_fre1, &elf_x86_64_sframe_plt0_fre2 },
+  LAZY_PLT_ENTRY_SIZE,
+  1, /* Number of FREs for PLTn.  */
+  /* Array of SFrame FREs for plt.  */
+  { &elf_x86_64_sframe_sec_pltn_fre1, &elf_x86_64_sframe_null_fre },
+  0,
+  0, /* There is no second PLT necessary.  */
+  { &elf_x86_64_sframe_null_fre },
+  LAZY_PLT_ENTRY_SIZE,
+  1, /* Number of FREs for PLT GOT.  */
+  /* Array of SFrame FREs for PLT GOT.  */
+  { &elf_x86_64_sframe_null_fre },
 };
 
 /* SFrame helper object for lazy PLT. */
@@ -952,7 +976,11 @@ static const struct elf_x86_sframe_plt elf_x86_64_sframe_plt =
   NON_LAZY_PLT_ENTRY_SIZE,
   1, /* Number of FREs for second PLT.  */
   /* Array of SFrame FREs for second PLT.  */
-  { &elf_x86_64_sframe_sec_pltn_fre1 }
+  { &elf_x86_64_sframe_sec_pltn_fre1 },
+  NON_LAZY_PLT_ENTRY_SIZE,
+  1, /* Number of FREs for PLT GOT.  */
+  /* Array of SFrame FREs for PLT GOT.  */
+  { &elf_x86_64_sframe_null_fre },
 };
 
 /* SFrame helper object for lazy PLT with IBT. */
@@ -969,7 +997,11 @@ static const struct elf_x86_sframe_plt elf_x86_64_sframe_ibt_plt =
   LAZY_PLT_ENTRY_SIZE,
   1, /* Number of FREs for second PLT.  */
   /* Array of SFrame FREs for second plt.  */
-  { &elf_x86_64_sframe_sec_pltn_fre1 }
+  { &elf_x86_64_sframe_sec_pltn_fre1 },
+  LAZY_PLT_ENTRY_SIZE,
+  1, /* Number of FREs for PLT GOT.  */
+  /* Array of SFrame FREs for PLT GOT.  */
+  { &elf_x86_64_sframe_null_fre },
 };
 
 /* These are the standard parameters.  */
@@ -5703,7 +5735,7 @@ elf_x86_64_link_setup_gnu_properties (struct bfd_link_info *info)
       init_table.sframe_lazy_plt = &elf_x86_64_sframe_plt;
       init_table.sframe_non_lazy_plt = &elf_x86_64_sframe_non_lazy_plt;
       init_table.sframe_lazy_ibt_plt = &elf_x86_64_sframe_ibt_plt;
-      init_table.sframe_non_lazy_ibt_plt = &elf_x86_64_sframe_non_lazy_plt;
+      init_table.sframe_non_lazy_ibt_plt = &elf_x86_64_sframe_non_lazy_ibt_plt;
     }
   else
     {
diff --git a/bfd/elfxx-x86.c b/bfd/elfxx-x86.c
index cb90f688b6f..0843803171b 100644
--- a/bfd/elfxx-x86.c
+++ b/bfd/elfxx-x86.c
@@ -1817,7 +1817,8 @@ elf_x86_relative_reloc_compare (const void *pa, const void *pb)
 enum dynobj_sframe_plt_type
 {
   SFRAME_PLT = 1,
-  SFRAME_PLT_SEC = 2
+  SFRAME_PLT_SEC = 2,
+  SFRAME_PLT_GOT = 3,
 };
 
 /* Create SFrame stack trace info for the plt entries in the .plt section
@@ -1880,6 +1881,21 @@ _bfd_x86_elf_create_sframe_plt (bfd *output_bfd,
 
 	  break;
 	}
+      case SFRAME_PLT_GOT:
+	{
+	  ectx = &htab->plt_got_cfe_ctx;
+	  dpltsec = htab->plt_got;
+
+	  plt0_entry_size = 0;
+
+	  plt_entry_size = htab->sframe_plt->plt_got_entry_size;
+	  pltn_fres = htab->sframe_plt->plt_got_fres;
+	  num_pltn_fres = htab->sframe_plt->plt_got_num_fres;
+	  num_pltn_entries = dpltsec->size / plt_entry_size;
+
+	  break;
+	}
+
     default:
       /* No other value is possible.  */
       return false;
@@ -1984,6 +2000,10 @@ _bfd_x86_elf_write_sframe_plt (bfd *output_bfd,
       ectx = htab->plt_second_cfe_ctx;
       sec = htab->plt_second_sframe;
       break;
+    case SFRAME_PLT_GOT:
+      ectx = htab->plt_got_cfe_ctx;
+      sec = htab->plt_got_sframe;
+      break;
     default:
       /* No other value is possible.  */
       return false;
@@ -2511,7 +2531,18 @@ _bfd_x86_elf_late_size_sections (bfd *output_bfd,
 	  htab->plt_sframe->size = sizeof (sframe_header) + 1;
 	}
 
-      /* FIXME - generate for .plt.got ?  */
+      if (htab->plt_got_sframe != NULL
+	  && htab->plt_got != NULL
+	  && htab->plt_got->size != 0
+	  && !bfd_is_abs_section (htab->plt_got->output_section))
+	{
+	  _bfd_x86_elf_create_sframe_plt (output_bfd, info, SFRAME_PLT_GOT);
+	  /* FIXME - Dirty Hack.  Set the size to something non-zero for now,
+	     so that the section does not get stripped out below.  The precise
+	     size of this section is known only when the contents are
+	     serialized in _bfd_x86_elf_write_sframe_plt.  */
+	  htab->plt_got_sframe->size = sizeof (sframe_header) + 1;
+	}
 
       if (htab->plt_second_sframe != NULL
 	  && htab->plt_second != NULL
@@ -2578,6 +2609,7 @@ _bfd_x86_elf_late_size_sections (bfd *output_bfd,
 	       || s == htab->plt_second_eh_frame
 	       || s == htab->plt_sframe
 	       || s == htab->plt_second_sframe
+	       || s == htab->plt_got_sframe
 	       || s == htab->elf.sdynbss
 	       || s == htab->elf.sdynrelro)
 	{
@@ -2622,7 +2654,8 @@ _bfd_x86_elf_late_size_sections (bfd *output_bfd,
 
       /* Skip allocating contents for .sframe section as it is written
 	 out differently.  See below.  */
-      if ((s == htab->plt_sframe) || (s == htab->plt_second_sframe))
+      if ((s == htab->plt_sframe) || (s == htab->plt_second_sframe)
+	  || (s == htab->plt_got_sframe))
 	continue;
 
       /* NB: Initially, the iplt section has minimal alignment to
@@ -2687,6 +2720,12 @@ _bfd_x86_elf_late_size_sections (bfd *output_bfd,
 	  && htab->plt_second->size != 0
 	  && htab->plt_second_sframe->contents == NULL)
 	_bfd_x86_elf_write_sframe_plt (output_bfd, info, SFRAME_PLT_SEC);
+
+      if (htab->plt_got_sframe != NULL
+	  && htab->plt_got != NULL
+	  && htab->plt_got->size != 0
+	  && htab->plt_got_sframe->contents == NULL)
+	_bfd_x86_elf_write_sframe_plt (output_bfd, info, SFRAME_PLT_GOT);
     }
 
   if (resolved_plt != NULL
@@ -2997,6 +3036,34 @@ _bfd_x86_elf_finish_dynamic_sections (bfd *output_bfd,
 	    return NULL;
 	}
     }
+
+  if (htab->plt_got_sframe != NULL
+      && htab->plt_got_sframe->contents != NULL)
+    {
+      if (htab->plt_got != NULL
+	  && htab->plt_got->size != 0
+	  && (htab->plt_got->flags & SEC_EXCLUDE) == 0
+	  && htab->plt_got->output_section != NULL
+	  && htab->plt_got_sframe->output_section != NULL)
+	{
+	  bfd_vma plt_start = htab->plt_got->output_section->vma;
+	  bfd_vma sframe_start
+	    = (htab->plt_got_sframe->output_section->vma
+	       + htab->plt_got_sframe->output_offset
+	       + PLT_SFRAME_FDE_START_OFFSET);
+	  bfd_put_signed_32 (dynobj, plt_start - sframe_start,
+			     htab->plt_got_sframe->contents
+			     + PLT_SFRAME_FDE_START_OFFSET);
+	}
+      if (htab->plt_got_sframe->sec_info_type == SEC_INFO_TYPE_SFRAME)
+	{
+	  if (! _bfd_elf_merge_section_sframe (output_bfd, info,
+					       htab->plt_got_sframe,
+					       htab->plt_got_sframe->contents))
+	    return NULL;
+	}
+    }
+
   if (htab->elf.sgot && htab->elf.sgot->size > 0)
     elf_section_data (htab->elf.sgot->output_section)->this_hdr.sh_entsize
       = htab->got_entry_size;
@@ -4764,7 +4831,18 @@ _bfd_x86_elf_link_setup_gnu_properties
 
 	      htab->plt_second_sframe = sec;
 	    }
-	  /* FIXME - add later for plt_got. */
+
+	  /* .plt.got.  */
+	  if (htab->plt_got != NULL)
+	    {
+	      sec = bfd_make_section_anyway_with_flags (dynobj,
+							".sframe",
+							flags);
+	      if (sec == NULL)
+		info->callbacks->einfo (_("%F%P: failed to create PLT GOT .sframe section\n"));
+
+	      htab->plt_got_sframe = sec;
+	    }
 	}
     }
 
diff --git a/bfd/elfxx-x86.h b/bfd/elfxx-x86.h
index b042d45c282..cd26e8fc6f2 100644
--- a/bfd/elfxx-x86.h
+++ b/bfd/elfxx-x86.h
@@ -401,6 +401,10 @@ struct elf_x86_sframe_plt
   unsigned int sec_pltn_entry_size;
   unsigned int sec_pltn_num_fres;
   const sframe_frame_row_entry *sec_pltn_fres[SFRAME_PLTN_MAX_NUM_FRES];
+
+  unsigned int plt_got_entry_size;
+  unsigned int plt_got_num_fres;
+  const sframe_frame_row_entry *plt_got_fres[SFRAME_PLTN_MAX_NUM_FRES];
 };
 
 struct elf_x86_lazy_plt_layout
@@ -606,6 +610,8 @@ struct elf_x86_link_hash_table
   asection *plt_sframe;
   sframe_encoder_ctx *plt_second_cfe_ctx;
   asection *plt_second_sframe;
+  sframe_encoder_ctx *plt_got_cfe_ctx;
+  asection *plt_got_sframe;
 
   /* Parameters describing PLT generation, lazy or non-lazy.  */
   struct elf_x86_plt_layout plt;
diff --git a/ld/testsuite/ld-x86-64/sframe-pltgot-1.d b/ld/testsuite/ld-x86-64/sframe-pltgot-1.d
new file mode 100644
index 00000000000..e26a741b635
--- /dev/null
+++ b/ld/testsuite/ld-x86-64/sframe-pltgot-1.d
@@ -0,0 +1,28 @@
+#as: --gsframe
+#source: sframe-pltgot-1.s
+#objdump: --sframe=.sframe
+#ld: -shared -z ibtplt --no-rosegment
+#name: SFrame for .plt.got
+
+.*: +file format .*
+
+Contents of the SFrame section .sframe:
+  Header :
+
+    Version: SFRAME_VERSION_2
+    Flags: SFRAME_F_FDE_SORTED
+    CFA fixed RA offset: \-8
+#...
+
+  Function Index :
+
+    func idx \[0\]: pc = 0x1000, size = 16 bytes
+    STARTPC +CFA +FP +RA +
+    0+1000 +sp\+16 +u +f +
+    0+1006 +sp\+24 +u +f +
+
+    func idx \[1\]: pc = 0x1010, size = 64 bytes
+    STARTPC\[m\] +CFA +FP +RA +
+    0+0000 +sp\+16 +u +f +
+
+#...
diff --git a/ld/testsuite/ld-x86-64/sframe-pltgot-1.s b/ld/testsuite/ld-x86-64/sframe-pltgot-1.s
new file mode 100644
index 00000000000..e596e846439
--- /dev/null
+++ b/ld/testsuite/ld-x86-64/sframe-pltgot-1.s
@@ -0,0 +1,15 @@
+        .text
+        .globl foo
+        .type foo, @function
+foo:
+        .cfi_startproc
+        call    func1@plt
+        movq    func1@GOTPCREL(%rip), %rax
+        call    func2@plt
+        movq    func2@GOTPCREL(%rip), %rax
+        call    func3@plt
+        movq    func3@GOTPCREL(%rip), %rax
+        call    func4@plt
+        movq    func4@GOTPCREL(%rip), %rax
+        .cfi_endproc
+	.section	.note.GNU-stack,"",@progbits
diff --git a/ld/testsuite/ld-x86-64/sframe-pltgot-2.d b/ld/testsuite/ld-x86-64/sframe-pltgot-2.d
new file mode 100644
index 00000000000..7a99d12faa8
--- /dev/null
+++ b/ld/testsuite/ld-x86-64/sframe-pltgot-2.d
@@ -0,0 +1,28 @@
+#as: --gsframe
+#source: sframe-pltgot-1.s
+#objdump: --sframe=.sframe
+#ld: -shared --no-rosegment
+#name: SFrame for .plt.got
+
+.*: +file format .*
+
+Contents of the SFrame section .sframe:
+  Header :
+
+    Version: SFRAME_VERSION_2
+    Flags: SFRAME_F_FDE_SORTED
+    CFA fixed RA offset: \-8
+#...
+
+  Function Index :
+
+    func idx \[0\]: pc = 0x1000, size = 16 bytes
+    STARTPC +CFA +FP +RA +
+    0+1000 +sp\+16 +u +f +
+    0+1006 +sp\+24 +u +f +
+
+    func idx \[1\]: pc = 0x1010, size = 32 bytes
+    STARTPC\[m\] +CFA +FP +RA +
+    0+0000 +sp\+16 +u +f +
+
+#...
diff --git a/ld/testsuite/ld-x86-64/x86-64.exp b/ld/testsuite/ld-x86-64/x86-64.exp
index 0632397bccc..a66d28eb0d1 100644
--- a/ld/testsuite/ld-x86-64/x86-64.exp
+++ b/ld/testsuite/ld-x86-64/x86-64.exp
@@ -548,6 +548,8 @@ if { ![skip_sframe_tests] } {
     run_dump_test "sframe-simple-1"
     run_dump_test "sframe-plt-1"
     run_dump_test "sframe-ibt-plt-1"
+    run_dump_test "sframe-pltgot-1"
+    run_dump_test "sframe-pltgot-2"
 }
 
 if ![istarget "x86_64-*-linux*"] {
-- 
2.43.0


[-- Attachment #5: glibc.patch --]
[-- Type: text/plain, Size: 2267 bytes --]

diff --git a/elf/dl-load.c b/elf/dl-load.c
index 2923b1141d..ffaf722f5f 100644
--- a/elf/dl-load.c
+++ b/elf/dl-load.c
@@ -29,6 +29,7 @@
 #include <bits/wordsize.h>
 #include <sys/mman.h>
 #include <sys/param.h>
+#include <sys/prctl.h>
 #include <sys/stat.h>
 #include <sys/types.h>
 #include <gnu/lib-names.h>
@@ -88,6 +89,10 @@ struct filebuf
 
 #define STRING(x) __STRING (x)
 
+#ifndef PT_GNU_SFRAME
+#define PT_GNU_SFRAME 0x6474e554
+#endif
+
 
 int __stack_prot attribute_hidden attribute_relro
 #if _STACK_GROWS_DOWN && defined PROT_GROWSDOWN
@@ -1213,6 +1218,10 @@ _dl_map_object_from_fd (const char *name, const char *origname, int fd,
 	  l->l_relro_addr = ph->p_vaddr;
 	  l->l_relro_size = ph->p_memsz;
 	  break;
+
+	case PT_GNU_SFRAME:
+	  l->l_sframe_addr = ph->p_vaddr;
+	  break;
 	}
 
     if (__glibc_unlikely (nloadcmds == 0))
@@ -1376,6 +1385,13 @@ cannot enable executable stack as shared object requires");
 	break;
       }
 
+#define PR_ADD_SFRAME 74
+  if (l->l_sframe_addr != 0)
+  {
+    l->l_sframe_addr += l->l_addr;
+    __prctl(PR_ADD_SFRAME, l->l_sframe_addr, NULL, NULL, NULL);
+  }
+
   /* We are done mapping in the file.  We no longer need the descriptor.  */
   if (__glibc_unlikely (__close_nocancel (fd) != 0))
     {
diff --git a/elf/dl-unmap-segments.h b/elf/dl-unmap-segments.h
index 1ec507e887..c4a7e9e0ed 100644
--- a/elf/dl-unmap-segments.h
+++ b/elf/dl-unmap-segments.h
@@ -21,14 +21,20 @@
 
 #include <link.h>
 #include <sys/mman.h>
+#include <sys/prctl.h>
 
 /* _dl_map_segments ensures that any whole pages in gaps between segments
    are filled in with PROT_NONE mappings.  So we can just unmap the whole
    range in one fell swoop.  */
 
+#define PR_REMOVE_SFRAME 75
+
 static __always_inline void
 _dl_unmap_segments (struct link_map *l)
 {
+  if (l->l_sframe_addr != 0)
+    __prctl(PR_REMOVE_SFRAME, l->l_sframe_addr, NULL, NULL, NULL);
+
   __munmap ((void *) l->l_map_start, l->l_map_end - l->l_map_start);
 }
 
diff --git a/include/link.h b/include/link.h
index c6af095d87..36ac75680f 100644
--- a/include/link.h
+++ b/include/link.h
@@ -348,6 +348,8 @@ struct link_map
     ElfW(Addr) l_relro_addr;
     size_t l_relro_size;
 
+    ElfW(Addr) l_sframe_addr;
+
     unsigned long long int l_serial;
   };
 

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 00/19] unwind, perf: sframe user space unwinding
  2024-10-28 21:47 [PATCH v3 00/19] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (20 preceding siblings ...)
  2024-10-28 21:54 ` Josh Poimboeuf
@ 2024-10-28 23:55 ` Josh Poimboeuf
  2024-10-29 14:08 ` Peter Zijlstra
  22 siblings, 0 replies; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-28 23:55 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

On Mon, Oct 28, 2024 at 02:47:27PM -0700, Josh Poimboeuf wrote:
> This has all the changes discussed in v2, plus VDSO sframe support and
> Namhyung's perf tool patches (see detailed changelog below).
> 
> I did quite a bit of testing, it seems to work well.  It still needs
> some binutils and glibc patches which I'll send in a reply.
> 
> Questions for perf experts:
> 
>   - Is the perf_event lifetime managed correctly or do we need to do
>     something to ensure it exists in unwind_user_task_work()?
> 
>     Or alternatively is the original perf_event even needed in
>     unwind_user_task_work() or can a new one be created on demand?
> 
>   - Is --call-graph=sframe needed for consistency?
> 
>   - Should perf use the context cookie?  Note that because the callback
>     is usually only called once for multiple NMIs in the same entry
>     context, it's possible for the PERF_RECORD_CALLCHAIN_DEFERRED event
>     to arrive *before* some of the corresponding kernel events.  The
>     context cookie disambiguates the corner cases.
> 
> Based on tip/master.
> 
> Also at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/jpoimboe/linux.git sframe-v3

Argh, apparently it's a bad idea to pass "*.patch" twice on the
git-send-email cmdline ;-)  Sorry for sending it twice!

-- 
Josh

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 08/19] unwind/x86: Enable CONFIG_HAVE_UNWIND_USER_FP
  2024-10-28 21:47 ` [PATCH v3 08/19] unwind/x86: Enable CONFIG_HAVE_UNWIND_USER_FP Josh Poimboeuf
  2024-10-28 21:47   ` Josh Poimboeuf
@ 2024-10-29 13:13   ` Peter Zijlstra
  2024-10-29 16:31     ` Josh Poimboeuf
  1 sibling, 1 reply; 119+ messages in thread
From: Peter Zijlstra @ 2024-10-29 13:13 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

On Mon, Oct 28, 2024 at 02:47:35PM -0700, Josh Poimboeuf wrote:
> Use ARCH_INIT_USER_FP_FRAME to describe how frame pointers are unwound
> on x86, and enable CONFIG_HAVE_UNWIND_USER_FP accordingly so the
> unwind_user interfaces can be used.
> 
> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
> ---
>  arch/x86/Kconfig                   |  1 +
>  arch/x86/include/asm/unwind_user.h | 11 +++++++++++
>  2 files changed, 12 insertions(+)
>  create mode 100644 arch/x86/include/asm/unwind_user.h
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 0bdb7a394f59..f91098d6f535 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -289,6 +289,7 @@ config X86
>  	select HAVE_SYSCALL_TRACEPOINTS
>  	select HAVE_UACCESS_VALIDATION		if HAVE_OBJTOOL
>  	select HAVE_UNSTABLE_SCHED_CLOCK
> +	select HAVE_UNWIND_USER_FP		if X86_64
>  	select HAVE_USER_RETURN_NOTIFIER
>  	select HAVE_GENERIC_VDSO
>  	select VDSO_GETRANDOM			if X86_64
> diff --git a/arch/x86/include/asm/unwind_user.h b/arch/x86/include/asm/unwind_user.h
> new file mode 100644
> index 000000000000..19df26a65132
> --- /dev/null
> +++ b/arch/x86/include/asm/unwind_user.h
> @@ -0,0 +1,11 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_X86_UNWIND_USER_H
> +#define _ASM_X86_UNWIND_USER_H
> +
> +#define ARCH_INIT_USER_FP_FRAME							\
> +	.ra_off		= (s32)sizeof(long) * -1,				\
> +	.cfa_off	= (s32)sizeof(long) * 2,				\
> +	.fp_off		= (s32)sizeof(long) * -2,				\
> +	.use_fp		= true,
> +
> +#endif /* _ASM_X86_UNWIND_USER_H */

What about compat / 32bit userspace?

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 10/19] unwind/x86: Enable CONFIG_HAVE_UNWIND_USER_SFRAME
  2024-10-28 21:47 ` [PATCH v3 10/19] unwind/x86: Enable CONFIG_HAVE_UNWIND_USER_SFRAME Josh Poimboeuf
  2024-10-28 21:47   ` Josh Poimboeuf
@ 2024-10-29 13:14   ` Peter Zijlstra
  1 sibling, 0 replies; 119+ messages in thread
From: Peter Zijlstra @ 2024-10-29 13:14 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

On Mon, Oct 28, 2024 at 02:47:37PM -0700, Josh Poimboeuf wrote:
> Binutils 2.41 supports generating sframe v2 for x86_64.  It works well
> in testing so enable it.

There is no sframe for 32bit?

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 09/19] unwind: Introduce sframe user space unwinding
  2024-10-28 21:47 ` [PATCH v3 09/19] unwind: Introduce sframe user space unwinding Josh Poimboeuf
  2024-10-28 21:47   ` Josh Poimboeuf
@ 2024-10-29 13:27   ` Peter Zijlstra
  2024-10-29 16:50     ` Josh Poimboeuf
  2024-10-29 23:32   ` Andrii Nakryiko
                     ` (5 subsequent siblings)
  7 siblings, 1 reply; 119+ messages in thread
From: Peter Zijlstra @ 2024-10-29 13:27 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

On Mon, Oct 28, 2024 at 02:47:36PM -0700, Josh Poimboeuf wrote:

> +static int __sframe_add_section(unsigned long sframe_addr,
> +				unsigned long text_start,
> +				unsigned long text_end)
> +{
> +	struct maple_tree *sframe_mt = &current->mm->sframe_mt;
> +	struct sframe_section *sec;
> +	struct sframe_header shdr;
> +	unsigned long header_end;
> +	int ret;
> +
> +	if (copy_from_user(&shdr, (void __user *)sframe_addr, sizeof(shdr)))
> +		return -EFAULT;
> +
> +	if (shdr.preamble.magic != SFRAME_MAGIC ||
> +	    shdr.preamble.version != SFRAME_VERSION_2 ||
> +	    !(shdr.preamble.flags & SFRAME_F_FDE_SORTED) ||
> +	    shdr.auxhdr_len || !shdr.num_fdes || !shdr.num_fres ||
> +	    shdr.fdes_off > shdr.fres_off) {
> +		return -EINVAL;
> +	}
> +
> +	sec = kmalloc(sizeof(*sec), GFP_KERNEL);
> +	if (!sec)
> +		return -ENOMEM;
> +
> +	header_end = sframe_addr + SFRAME_HDR_SIZE(shdr);
> +
> +	sec->sframe_addr	= sframe_addr;
> +	sec->text_addr		= text_start;
> +	sec->fdes_addr		= header_end + shdr.fdes_off;
> +	sec->fres_addr		= header_end + shdr.fres_off;
> +	sec->fdes_nr		= shdr.num_fdes;
> +	sec->ra_off		= shdr.cfa_fixed_ra_offset;
> +	sec->fp_off		= shdr.cfa_fixed_fp_offset;
> +
> +	ret = mtree_insert_range(sframe_mt, text_start, text_end, sec, GFP_KERNEL);
> +	if (ret) {
> +		kfree(sec);
> +		return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +int sframe_add_section(unsigned long sframe_addr, unsigned long text_start,
> +		       unsigned long text_end)
> +{
> +	struct mm_struct *mm = current->mm;
> +	struct vm_area_struct *sframe_vma;
> +
> +	mmap_read_lock(mm);

DEFINE_GUARD(mmap_read_lock, struct mm_struct *,
	     mmap_read_lock(_T), mmap_read_unlock(_T))

in include/linux/mmap_lock.h ?

> +
> +	sframe_vma = vma_lookup(mm, sframe_addr);
> +	if (!sframe_vma)
> +		goto err_unlock;
> +
> +	if (text_start && text_end) {
> +		struct vm_area_struct *text_vma;
> +
> +		text_vma = vma_lookup(mm, text_start);
> +		if (!(text_vma->vm_flags & VM_EXEC))
> +			goto err_unlock;
> +
> +		if (PAGE_ALIGN(text_end) != text_vma->vm_end)
> +			goto err_unlock;
> +	} else {
> +		struct vm_area_struct *vma, *text_vma = NULL;
> +		VMA_ITERATOR(vmi, mm, 0);
> +
> +		for_each_vma(vmi, vma) {
> +			if (vma->vm_file != sframe_vma->vm_file ||
> +			    !(vma->vm_flags & VM_EXEC))
> +				continue;
> +
> +			if (text_vma) {
> +				pr_warn_once("%s[%d]: multiple EXEC segments unsupported\n",
> +					     current->comm, current->pid);
> +				goto err_unlock;
> +			}
> +
> +			text_vma = vma;
> +		}
> +
> +		if (!text_vma)
> +			goto err_unlock;
> +
> +		text_start = text_vma->vm_start;
> +		text_end   = text_vma->vm_end;
> +	}
> +
> +	mmap_read_unlock(mm);
> +
> +	return __sframe_add_section(sframe_addr, text_start, text_end);
> +
> +err_unlock:
> +	mmap_read_unlock(mm);
> +	return -EINVAL;
> +}

> +int sframe_remove_section(unsigned long sframe_addr)
> +{
> +	struct mm_struct *mm = current->mm;
> +	struct sframe_section *sec;
> +	unsigned long index = 0;
> +
> +	mt_for_each(&mm->sframe_mt, sec, index, ULONG_MAX) {
> +		if (sec->sframe_addr == sframe_addr)
> +			return __sframe_remove_section(mm, sec);
> +	}
> +
> +	return -EINVAL;
> +}

> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -64,6 +64,7 @@
>  #include <linux/rcupdate.h>
>  #include <linux/uidgid.h>
>  #include <linux/cred.h>
> +#include <linux/sframe.h>
>  
>  #include <linux/nospec.h>
>  
> @@ -2784,6 +2785,16 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>  	case PR_RISCV_SET_ICACHE_FLUSH_CTX:
>  		error = RISCV_SET_ICACHE_FLUSH_CTX(arg2, arg3);
>  		break;
> +	case PR_ADD_SFRAME:
> +		if (arg5)
> +			return -EINVAL;
> +		error = sframe_add_section(arg2, arg3, arg4);
> +		break;
> +	case PR_REMOVE_SFRAME:
> +		if (arg3 || arg4 || arg5)
> +			return -EINVAL;
> +		error = sframe_remove_section(arg2);
> +		break;
>  	default:
>  		error = -EINVAL;
>  		break;

So I realize that mtree has an internal lock, but are we sure we don't
want a lock around those prctl()s?



^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 11/19] unwind: Add deferred user space unwinding API
  2024-10-28 21:47 ` [PATCH v3 11/19] unwind: Add deferred user space unwinding API Josh Poimboeuf
  2024-10-28 21:47   ` Josh Poimboeuf
@ 2024-10-29 13:48   ` Peter Zijlstra
  2024-10-29 16:51     ` Josh Poimboeuf
  2024-10-29 13:49   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 119+ messages in thread
From: Peter Zijlstra @ 2024-10-29 13:48 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

On Mon, Oct 28, 2024 at 02:47:38PM -0700, Josh Poimboeuf wrote:

> +/*
> + * The "context cookie" is a unique identifier which allows post-processing to
> + * correlate kernel trace(s) with user unwinds.  It has the CPU id the highest
> + * 16 bits and a per-CPU entry counter in the lower 48 bits.
> + */
> +static u64 ctx_to_cookie(u64 cpu, u64 ctx)
> +{
> +	BUILD_BUG_ON(NR_CPUS > 65535);
> +	return (ctx & ((1UL << 48) - 1)) | cpu;
> +}

Did you mean to: (cpu << 48) ?

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 11/19] unwind: Add deferred user space unwinding API
  2024-10-28 21:47 ` [PATCH v3 11/19] unwind: Add deferred user space unwinding API Josh Poimboeuf
  2024-10-28 21:47   ` Josh Poimboeuf
  2024-10-29 13:48   ` Peter Zijlstra
@ 2024-10-29 13:49   ` Peter Zijlstra
  2024-10-29 17:05     ` Josh Poimboeuf
  2024-10-29 13:56   ` Peter Zijlstra
  2024-10-29 23:32   ` Andrii Nakryiko
  4 siblings, 1 reply; 119+ messages in thread
From: Peter Zijlstra @ 2024-10-29 13:49 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

On Mon, Oct 28, 2024 at 02:47:38PM -0700, Josh Poimboeuf wrote:

> +static void unwind_user_task_work(struct callback_head *head)
> +{
...
> +	guard(rwsem_read)(&callbacks_rwsem);
> +
> +	for_each_set_bit(i, &pending, UNWIND_MAX_CALLBACKS) {
> +		if (callbacks[i])
> +			callbacks[i]->func(&trace, cookie, privs[i]);
> +	}

I'm fairly sure people will come with pitchforks for that read-lock
there. They scale like shit on big systems. Please use SRCU or somesuch.

> +}
> +
> +int unwind_user_register(struct unwind_callback *callback, unwind_callback_t func)
> +{
> +	scoped_guard(rwsem_write, &callbacks_rwsem) {
> +		for (int i = 0; i < UNWIND_MAX_CALLBACKS; i++) {
> +			if (!callbacks[i]) {
> +				callback->func = func;
> +				callback->idx = i;
> +				callbacks[i] = callback;
> +				return 0;
> +			}
> +		}
> +	}
> +
> +	callback->func = NULL;
> +	callback->idx = -1;
> +	return -ENOSPC;
> +}
> +
> +int unwind_user_unregister(struct unwind_callback *callback)
> +{
> +	if (callback->idx < 0)
> +		return -EINVAL;
> +
> +	scoped_guard(rwsem_write, &callbacks_rwsem)
> +		callbacks[callback->idx] = NULL;
> +
> +	callback->func = NULL;
> +	callback->idx = -1;
> +
> +	return 0;
> +}



^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 11/19] unwind: Add deferred user space unwinding API
  2024-10-28 21:47 ` [PATCH v3 11/19] unwind: Add deferred user space unwinding API Josh Poimboeuf
                     ` (2 preceding siblings ...)
  2024-10-29 13:49   ` Peter Zijlstra
@ 2024-10-29 13:56   ` Peter Zijlstra
  2024-10-29 17:17     ` Josh Poimboeuf
  2024-10-29 23:32   ` Andrii Nakryiko
  4 siblings, 1 reply; 119+ messages in thread
From: Peter Zijlstra @ 2024-10-29 13:56 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

On Mon, Oct 28, 2024 at 02:47:38PM -0700, Josh Poimboeuf wrote:

> + * The only exception is when the task has migrated to another CPU, *and* this
> + * is called while the task work is running (or has already run).  Then a new
> + * cookie will be generated and the callback will be called again for the new
> + * cookie.

So that's a bit crap. The user stack won't change for having been
migrated.

So perf can readily use the full u64 cookie value as a sequence number,
since the whole perf record will already have the TID of the task in.
Mixing in this CPU number for no good reason and causing trouble like
this just doesn't make sense to me.

If ftrace needs brain damage like this, can't we push this to the user?

That is, do away with the per-cpu sequence crap, and add a per-task
counter that is incremented for every return-to-userspace.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 15/19] perf: Add deferred user callchains
  2024-10-28 21:47 ` [PATCH v3 15/19] perf: Add deferred user callchains Josh Poimboeuf
  2024-10-28 21:48   ` Josh Poimboeuf
@ 2024-10-29 14:06   ` Peter Zijlstra
  2024-11-06  9:45   ` Jens Remus
  2 siblings, 0 replies; 119+ messages in thread
From: Peter Zijlstra @ 2024-10-29 14:06 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

On Mon, Oct 28, 2024 at 02:47:42PM -0700, Josh Poimboeuf wrote:
> diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
> index 2278402b7ac9..eeb15ba0137f 100644
> --- a/kernel/events/callchain.c
> +++ b/kernel/events/callchain.c
> @@ -217,7 +217,7 @@ static void fixup_uretprobe_trampoline_entries(struct perf_callchain_entry *entr
>  
>  struct perf_callchain_entry *
>  get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
> -		   u32 max_stack, bool add_mark)
> +		   u32 max_stack, bool add_mark, bool defer_user)
>  {
>  	struct perf_callchain_entry *entry;
>  	struct perf_callchain_entry_ctx ctx;
> @@ -246,6 +246,15 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
>  			regs = task_pt_regs(current);
>  		}
>  
> +		if (defer_user) {
> +			/*
> +			 * Foretell the coming of PERF_RECORD_CALLCHAIN_DEFERRED
> +			 * which can be stitched to this one.
> +			 */
> +			perf_callchain_store_context(&ctx, PERF_CONTEXT_USER_DEFERRED);
> +			goto exit_put;
> +		}
> +
>  		if (add_mark)
>  			perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);

Given that the whole deferred thing doesn't handle COMPAT, this will
break thing, no?

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 00/19] unwind, perf: sframe user space unwinding
  2024-10-28 21:47 [PATCH v3 00/19] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (21 preceding siblings ...)
  2024-10-28 23:55 ` Josh Poimboeuf
@ 2024-10-29 14:08 ` Peter Zijlstra
  22 siblings, 0 replies; 119+ messages in thread
From: Peter Zijlstra @ 2024-10-29 14:08 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

On Mon, Oct 28, 2024 at 02:47:27PM -0700, Josh Poimboeuf wrote:

>   - Is the perf_event lifetime managed correctly or do we need to do
>     something to ensure it exists in unwind_user_task_work()?

You probably want a sync/cancel somewhere near the start of
_free_event().

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 08/19] unwind/x86: Enable CONFIG_HAVE_UNWIND_USER_FP
  2024-10-29 13:13   ` Peter Zijlstra
@ 2024-10-29 16:31     ` Josh Poimboeuf
  2024-10-29 18:08       ` Peter Zijlstra
  0 siblings, 1 reply; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-29 16:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

On Tue, Oct 29, 2024 at 02:13:03PM +0100, Peter Zijlstra wrote:
> On Mon, Oct 28, 2024 at 02:47:35PM -0700, Josh Poimboeuf wrote:
> > Use ARCH_INIT_USER_FP_FRAME to describe how frame pointers are unwound
> > on x86, and enable CONFIG_HAVE_UNWIND_USER_FP accordingly so the
> > unwind_user interfaces can be used.
> > 
> > Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
> > ---
> >  arch/x86/Kconfig                   |  1 +
> >  arch/x86/include/asm/unwind_user.h | 11 +++++++++++
> >  2 files changed, 12 insertions(+)
> >  create mode 100644 arch/x86/include/asm/unwind_user.h
> > 
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 0bdb7a394f59..f91098d6f535 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -289,6 +289,7 @@ config X86
> >  	select HAVE_SYSCALL_TRACEPOINTS
> >  	select HAVE_UACCESS_VALIDATION		if HAVE_OBJTOOL
> >  	select HAVE_UNSTABLE_SCHED_CLOCK
> > +	select HAVE_UNWIND_USER_FP		if X86_64
> >  	select HAVE_USER_RETURN_NOTIFIER
> >  	select HAVE_GENERIC_VDSO
> >  	select VDSO_GETRANDOM			if X86_64
> > diff --git a/arch/x86/include/asm/unwind_user.h b/arch/x86/include/asm/unwind_user.h
> > new file mode 100644
> > index 000000000000..19df26a65132
> > --- /dev/null
> > +++ b/arch/x86/include/asm/unwind_user.h
> > @@ -0,0 +1,11 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _ASM_X86_UNWIND_USER_H
> > +#define _ASM_X86_UNWIND_USER_H
> > +
> > +#define ARCH_INIT_USER_FP_FRAME							\
> > +	.ra_off		= (s32)sizeof(long) * -1,				\
> > +	.cfa_off	= (s32)sizeof(long) * 2,				\
> > +	.fp_off		= (s32)sizeof(long) * -2,				\
> > +	.use_fp		= true,
> > +
> > +#endif /* _ASM_X86_UNWIND_USER_H */
> 
> What about compat / 32bit userspace?

Sframe doesn't support 32-bit binaries at the moment.  Does anybody
actually care?

You're right this regresses existing perf behavior for the frame pointer
case.  I'll try to fix that.

-- 
Josh

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 09/19] unwind: Introduce sframe user space unwinding
  2024-10-29 13:27   ` Peter Zijlstra
@ 2024-10-29 16:50     ` Josh Poimboeuf
  2024-10-29 18:10       ` Peter Zijlstra
  0 siblings, 1 reply; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-29 16:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

On Tue, Oct 29, 2024 at 02:27:09PM +0100, Peter Zijlstra wrote:
> > +int sframe_add_section(unsigned long sframe_addr, unsigned long text_start,
> > +		       unsigned long text_end)
> > +{
> > +	struct mm_struct *mm = current->mm;
> > +	struct vm_area_struct *sframe_vma;
> > +
> > +	mmap_read_lock(mm);
> 
> DEFINE_GUARD(mmap_read_lock, struct mm_struct *,
> 	     mmap_read_lock(_T), mmap_read_unlock(_T))
> 
> in include/linux/mmap_lock.h ?

Will do.

> > @@ -2784,6 +2785,16 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> >  	case PR_RISCV_SET_ICACHE_FLUSH_CTX:
> >  		error = RISCV_SET_ICACHE_FLUSH_CTX(arg2, arg3);
> >  		break;
> > +	case PR_ADD_SFRAME:
> > +		if (arg5)
> > +			return -EINVAL;
> > +		error = sframe_add_section(arg2, arg3, arg4);
> > +		break;
> > +	case PR_REMOVE_SFRAME:
> > +		if (arg3 || arg4 || arg5)
> > +			return -EINVAL;
> > +		error = sframe_remove_section(arg2);
> > +		break;
> >  	default:
> >  		error = -EINVAL;
> >  		break;
> 
> So I realize that mtree has an internal lock, but are we sure we don't
> want a lock around those prctl()s?

Not that I can tell?  It relies on the mtree internal locking for
atomicity.

For sframe_remove_section() it uses srcu to delay the freeing until all
sframe_find()'s have completed.

-- 
Josh

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 11/19] unwind: Add deferred user space unwinding API
  2024-10-29 13:48   ` Peter Zijlstra
@ 2024-10-29 16:51     ` Josh Poimboeuf
  0 siblings, 0 replies; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-29 16:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

On Tue, Oct 29, 2024 at 02:48:07PM +0100, Peter Zijlstra wrote:
> On Mon, Oct 28, 2024 at 02:47:38PM -0700, Josh Poimboeuf wrote:
> 
> > +/*
> > + * The "context cookie" is a unique identifier which allows post-processing to
> > + * correlate kernel trace(s) with user unwinds.  It has the CPU id the highest
> > + * 16 bits and a per-CPU entry counter in the lower 48 bits.
> > + */
> > +static u64 ctx_to_cookie(u64 cpu, u64 ctx)
> > +{
> > +	BUILD_BUG_ON(NR_CPUS > 65535);
> > +	return (ctx & ((1UL << 48) - 1)) | cpu;
> > +}
> 
> Did you mean to: (cpu << 48) ?

Indeed... that was the victim of a late refactoring.

-- 
Josh

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 11/19] unwind: Add deferred user space unwinding API
  2024-10-29 13:49   ` Peter Zijlstra
@ 2024-10-29 17:05     ` Josh Poimboeuf
  2024-10-29 18:11       ` Peter Zijlstra
  0 siblings, 1 reply; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-29 17:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

On Tue, Oct 29, 2024 at 02:49:18PM +0100, Peter Zijlstra wrote:
> On Mon, Oct 28, 2024 at 02:47:38PM -0700, Josh Poimboeuf wrote:
> 
> > +static void unwind_user_task_work(struct callback_head *head)
> > +{
> ...
> > +	guard(rwsem_read)(&callbacks_rwsem);
> > +
> > +	for_each_set_bit(i, &pending, UNWIND_MAX_CALLBACKS) {
> > +		if (callbacks[i])
> > +			callbacks[i]->func(&trace, cookie, privs[i]);
> > +	}
> 
> I'm fairly sure people will come with pitchforks for that read-lock
> there. They scale like shit on big systems. Please use SRCU or somesuch.

I'd expect that unwind_user_{register,unregister}() would only be called
once per tracing component during boot so there wouldn't be any
contention.

But I think I can make SRCU work here.

-- 
Josh

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 11/19] unwind: Add deferred user space unwinding API
  2024-10-29 13:56   ` Peter Zijlstra
@ 2024-10-29 17:17     ` Josh Poimboeuf
  2024-10-29 17:47       ` Mathieu Desnoyers
  0 siblings, 1 reply; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-29 17:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

On Tue, Oct 29, 2024 at 02:56:17PM +0100, Peter Zijlstra wrote:
> On Mon, Oct 28, 2024 at 02:47:38PM -0700, Josh Poimboeuf wrote:
> 
> > + * The only exception is when the task has migrated to another CPU, *and* this
> > + * is called while the task work is running (or has already run).  Then a new
> > + * cookie will be generated and the callback will be called again for the new
> > + * cookie.
> 
> So that's a bit crap. The user stack won't change for having been
> migrated.
> 
> So perf can readily use the full u64 cookie value as a sequence number,
> since the whole perf record will already have the TID of the task in.
> Mixing in this CPU number for no good reason and causing trouble like
> this just doesn't make sense to me.
> 
> If ftrace needs brain damage like this, can't we push this to the user?
> 
> That is, do away with the per-cpu sequence crap, and add a per-task
> counter that is incremented for every return-to-userspace.

That would definitely make things easier for me, though IIRC Steven and
Mathieu had some concerns about TID wrapping over days/months/years.

With that mindset I suppose the per-CPU counter could also wrap, though
that could be mitigated by making the cookie a struct with more bits.

-- 
Josh

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 11/19] unwind: Add deferred user space unwinding API
  2024-10-29 17:17     ` Josh Poimboeuf
@ 2024-10-29 17:47       ` Mathieu Desnoyers
  2024-10-29 18:20         ` Peter Zijlstra
  2024-10-29 18:34         ` Josh Poimboeuf
  0 siblings, 2 replies; 119+ messages in thread
From: Mathieu Desnoyers @ 2024-10-29 17:47 UTC (permalink / raw)
  To: Josh Poimboeuf, Peter Zijlstra
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Florian Weimer, Andy Lutomirski

On 2024-10-29 13:17, Josh Poimboeuf wrote:
> On Tue, Oct 29, 2024 at 02:56:17PM +0100, Peter Zijlstra wrote:
>> On Mon, Oct 28, 2024 at 02:47:38PM -0700, Josh Poimboeuf wrote:
>>
>>> + * The only exception is when the task has migrated to another CPU, *and* this
>>> + * is called while the task work is running (or has already run).  Then a new
>>> + * cookie will be generated and the callback will be called again for the new
>>> + * cookie.
>>
>> So that's a bit crap. The user stack won't change for having been
>> migrated.
>>
>> So perf can readily use the full u64 cookie value as a sequence number,
>> since the whole perf record will already have the TID of the task in.
>> Mixing in this CPU number for no good reason and causing trouble like
>> this just doesn't make sense to me.
>>
>> If ftrace needs brain damage like this, can't we push this to the user?
>>
>> That is, do away with the per-cpu sequence crap, and add a per-task
>> counter that is incremented for every return-to-userspace.
> 
> That would definitely make things easier for me, though IIRC Steven and
> Mathieu had some concerns about TID wrapping over days/months/years.
> 
> With that mindset I suppose the per-CPU counter could also wrap, though
> that could be mitigated by making the cookie a struct with more bits.
> 

AFAIR, the scheme we discussed in Prague was different than the
implementation here.

We discussed having a free-running counter per-cpu, and combining it
with the cpu number as top (or low) bits, to effectively make a 64-bit
value that is unique across the entire system, but without requiring a
global counter with its associated cache line bouncing.

Here is part where the implementation here differs from our discussion:
I recall we discussed keeping a snapshot of the counter value within
the task struct of the thread. So we only snapshot the per-cpu value
on first use after entering the kernel, and after that we use the same
per-cpu value snapshot (from task struct) up until return to userspace.
We clear the task struct cookie snapshot on return to userspace.

This way, even if the thread is migrated during the system call, the
cookie value does not change: it simply depends on the point where it
was first snapshotted (either before or after migration). From that
point until return to userspace, we just use the per-thread snapshot
value.

This should allow us to keep a global cookie semantic (no need to
tie this to tracer-specific knowledge about current TID), without the
migration corner cases discussed in the comment above.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 08/19] unwind/x86: Enable CONFIG_HAVE_UNWIND_USER_FP
  2024-10-29 16:31     ` Josh Poimboeuf
@ 2024-10-29 18:08       ` Peter Zijlstra
  0 siblings, 0 replies; 119+ messages in thread
From: Peter Zijlstra @ 2024-10-29 18:08 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

On Tue, Oct 29, 2024 at 09:31:10AM -0700, Josh Poimboeuf wrote:
> On Tue, Oct 29, 2024 at 02:13:03PM +0100, Peter Zijlstra wrote:

> > > +#define ARCH_INIT_USER_FP_FRAME							\
> > > +	.ra_off		= (s32)sizeof(long) * -1,				\
> > > +	.cfa_off	= (s32)sizeof(long) * 2,				\
> > > +	.fp_off		= (s32)sizeof(long) * -2,				\
> > > +	.use_fp		= true,
> > > +
> > > +#endif /* _ASM_X86_UNWIND_USER_H */
> > 
> > What about compat / 32bit userspace?
> 
> Sframe doesn't support 32-bit binaries at the moment.  Does anybody
> actually care?

Dunno, I think the only 32bit code I've recently ran is in selftests and
wine32 -- because Monkey Island rules :-)

> You're right this regresses existing perf behavior for the frame pointer
> case.  I'll try to fix that.

Thanks, this patch is the frame pointer unwinder, that really should do
32bit.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 09/19] unwind: Introduce sframe user space unwinding
  2024-10-29 16:50     ` Josh Poimboeuf
@ 2024-10-29 18:10       ` Peter Zijlstra
  0 siblings, 0 replies; 119+ messages in thread
From: Peter Zijlstra @ 2024-10-29 18:10 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

On Tue, Oct 29, 2024 at 09:50:18AM -0700, Josh Poimboeuf wrote:
> On Tue, Oct 29, 2024 at 02:27:09PM +0100, Peter Zijlstra wrote:
> > > +int sframe_add_section(unsigned long sframe_addr, unsigned long text_start,
> > > +		       unsigned long text_end)
> > > +{
> > > +	struct mm_struct *mm = current->mm;
> > > +	struct vm_area_struct *sframe_vma;
> > > +
> > > +	mmap_read_lock(mm);
> > 
> > DEFINE_GUARD(mmap_read_lock, struct mm_struct *,
> > 	     mmap_read_lock(_T), mmap_read_unlock(_T))
> > 
> > in include/linux/mmap_lock.h ?
> 
> Will do.
> 
> > > @@ -2784,6 +2785,16 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> > >  	case PR_RISCV_SET_ICACHE_FLUSH_CTX:
> > >  		error = RISCV_SET_ICACHE_FLUSH_CTX(arg2, arg3);
> > >  		break;
> > > +	case PR_ADD_SFRAME:
> > > +		if (arg5)
> > > +			return -EINVAL;
> > > +		error = sframe_add_section(arg2, arg3, arg4);
> > > +		break;
> > > +	case PR_REMOVE_SFRAME:
> > > +		if (arg3 || arg4 || arg5)
> > > +			return -EINVAL;
> > > +		error = sframe_remove_section(arg2);
> > > +		break;
> > >  	default:
> > >  		error = -EINVAL;
> > >  		break;
> > 
> > So I realize that mtree has an internal lock, but are we sure we don't
> > want a lock around those prctl()s?
> 
> Not that I can tell?  It relies on the mtree internal locking for
> atomicity.
> 
> For sframe_remove_section() it uses srcu to delay the freeing until all
> sframe_find()'s have completed.

Yeah it does all that. But I was sorta looking at all that kmalloc and
copy from user stuff, but I suppose you can race that without problem.

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 11/19] unwind: Add deferred user space unwinding API
  2024-10-29 17:05     ` Josh Poimboeuf
@ 2024-10-29 18:11       ` Peter Zijlstra
  0 siblings, 0 replies; 119+ messages in thread
From: Peter Zijlstra @ 2024-10-29 18:11 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

On Tue, Oct 29, 2024 at 10:05:26AM -0700, Josh Poimboeuf wrote:
> On Tue, Oct 29, 2024 at 02:49:18PM +0100, Peter Zijlstra wrote:
> > On Mon, Oct 28, 2024 at 02:47:38PM -0700, Josh Poimboeuf wrote:
> > 
> > > +static void unwind_user_task_work(struct callback_head *head)
> > > +{
> > ...
> > > +	guard(rwsem_read)(&callbacks_rwsem);
> > > +
> > > +	for_each_set_bit(i, &pending, UNWIND_MAX_CALLBACKS) {
> > > +		if (callbacks[i])
> > > +			callbacks[i]->func(&trace, cookie, privs[i]);
> > > +	}
> > 
> > I'm fairly sure people will come with pitchforks for that read-lock
> > there. They scale like shit on big systems. Please use SRCU or somesuch.
> 
> I'd expect that unwind_user_{register,unregister}() would only be called
> once per tracing component during boot so there wouldn't be any
> contention.

The read-lock does an atomic op on the lock word, try and do that with
200+ CPUs and things get really sad.

> But I think I can make SRCU work here.

Thanks!

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 11/19] unwind: Add deferred user space unwinding API
  2024-10-29 17:47       ` Mathieu Desnoyers
@ 2024-10-29 18:20         ` Peter Zijlstra
  2024-10-30  6:17           ` Steven Rostedt
  2024-10-29 18:34         ` Josh Poimboeuf
  1 sibling, 1 reply; 119+ messages in thread
From: Peter Zijlstra @ 2024-10-29 18:20 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Josh Poimboeuf, x86, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Florian Weimer, Andy Lutomirski

On Tue, Oct 29, 2024 at 01:47:59PM -0400, Mathieu Desnoyers wrote:
> On 2024-10-29 13:17, Josh Poimboeuf wrote:
> > On Tue, Oct 29, 2024 at 02:56:17PM +0100, Peter Zijlstra wrote:
> > > On Mon, Oct 28, 2024 at 02:47:38PM -0700, Josh Poimboeuf wrote:
> > > 
> > > > + * The only exception is when the task has migrated to another CPU, *and* this
> > > > + * is called while the task work is running (or has already run).  Then a new
> > > > + * cookie will be generated and the callback will be called again for the new
> > > > + * cookie.
> > > 
> > > So that's a bit crap. The user stack won't change for having been
> > > migrated.
> > > 
> > > So perf can readily use the full u64 cookie value as a sequence number,
> > > since the whole perf record will already have the TID of the task in.
> > > Mixing in this CPU number for no good reason and causing trouble like
> > > this just doesn't make sense to me.
> > > 
> > > If ftrace needs brain damage like this, can't we push this to the user?
> > > 
> > > That is, do away with the per-cpu sequence crap, and add a per-task
> > > counter that is incremented for every return-to-userspace.
> > 
> > That would definitely make things easier for me, though IIRC Steven and
> > Mathieu had some concerns about TID wrapping over days/months/years.
> > 
> > With that mindset I suppose the per-CPU counter could also wrap, though
> > that could be mitigated by making the cookie a struct with more bits.
> > 
> 
> AFAIR, the scheme we discussed in Prague was different than the
> implementation here.
> 
> We discussed having a free-running counter per-cpu, and combining it
> with the cpu number as top (or low) bits, to effectively make a 64-bit
> value that is unique across the entire system, but without requiring a
> global counter with its associated cache line bouncing.
> 
> Here is part where the implementation here differs from our discussion:
> I recall we discussed keeping a snapshot of the counter value within
> the task struct of the thread. So we only snapshot the per-cpu value
> on first use after entering the kernel, and after that we use the same
> per-cpu value snapshot (from task struct) up until return to userspace.
> We clear the task struct cookie snapshot on return to userspace.
> 
> This way, even if the thread is migrated during the system call, the
> cookie value does not change: it simply depends on the point where it
> was first snapshotted (either before or after migration). From that
> point until return to userspace, we just use the per-thread snapshot
> value.
> 
> This should allow us to keep a global cookie semantic (no need to
> tie this to tracer-specific knowledge about current TID), without the
> migration corner cases discussed in the comment above.

The 48:16 bit split gives you uniqueness for around 78 hours at 1GHz.

But seriously, perf doesn't need this. It really only needs a sequence
number if you care to stitch over a LOST packet (and I can't say I care
about that case much) -- and doing that right doesn't really take much
at all.


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 11/19] unwind: Add deferred user space unwinding API
  2024-10-29 17:47       ` Mathieu Desnoyers
  2024-10-29 18:20         ` Peter Zijlstra
@ 2024-10-29 18:34         ` Josh Poimboeuf
  2024-10-30 13:44           ` Mathieu Desnoyers
  1 sibling, 1 reply; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-29 18:34 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, x86, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Florian Weimer, Andy Lutomirski

On Tue, Oct 29, 2024 at 01:47:59PM -0400, Mathieu Desnoyers wrote:
> > > If ftrace needs brain damage like this, can't we push this to the user?
> > > 
> > > That is, do away with the per-cpu sequence crap, and add a per-task
> > > counter that is incremented for every return-to-userspace.
> > 
> > That would definitely make things easier for me, though IIRC Steven and
> > Mathieu had some concerns about TID wrapping over days/months/years.
> > 
> > With that mindset I suppose the per-CPU counter could also wrap, though
> > that could be mitigated by making the cookie a struct with more bits.
> > 
> 
> AFAIR, the scheme we discussed in Prague was different than the
> implementation here.

It does differ a bit.  I'll explain why below.

> We discussed having a free-running counter per-cpu, and combining it
> with the cpu number as top (or low) bits, to effectively make a 64-bit
> value that is unique across the entire system, but without requiring a
> global counter with its associated cache line bouncing.
> 
> Here is part where the implementation here differs from our discussion:
> I recall we discussed keeping a snapshot of the counter value within
> the task struct of the thread. So we only snapshot the per-cpu value
> on first use after entering the kernel, and after that we use the same
> per-cpu value snapshot (from task struct) up until return to userspace.
> We clear the task struct cookie snapshot on return to userspace.

Right, so adding some details to this, what I remember specifically
deciding on:

 - In unwind_user_deferred(), if task cookie is 0, it snapshots the
   per-cpu counter, stores the old value in the task cookie, and
   increments the new value (with CPU # in top 48 bits).

 - Future calls to unwind_user_deferred() see the task cookie is set and
   reuse the same cookie.

 - Then the task work (which runs right before exiting the kernel)
   clears the task cookie (with interrupts disabled), does the unwind,
   and calls the callbacks.  It clears the task cookie so that any
   future calls to unwind_user_deferred() (either before exiting the
   kernel or after next entry) are guaranteed to get an unwind.

That's what I had implemented originally.  It had a major performance
issue, particularly for long stacks (bash sometimes had 300+ stack
frames in my testing).

The task work runs with interrupts enabled.  So if another PMU interrupt
and call to unwind_user_deferred() happens after the task work clears
the task cookie but before kernel exit, a new cookie is generated and an
additional task work is scheduled.  For long stacks, performance gets
really bad, dominated by all the extra unnecessary unwinding.

So I changed the design a bit:

  - Increment a per-cpu counter at kernel entry before interrupts are
    enabled.

  - In unwind_user_deferred(), if task cookie is 0, it sets the task
    cookie based on the per-cpu counter, like before.  However if this
    cookie was the last one used by this callback+task, it skips the
    callback altogether.

So it eliminates duplicate unwinds except for the CPU migration case.

If I change the entry code to increment a per-task counter instead of a
per-cpu counter then this problem goes away.  I was just concerned about
the performance impacts of doing that on every entry.

-- 
Josh

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 09/19] unwind: Introduce sframe user space unwinding
  2024-10-28 21:47 ` [PATCH v3 09/19] unwind: Introduce sframe user space unwinding Josh Poimboeuf
  2024-10-28 21:47   ` Josh Poimboeuf
  2024-10-29 13:27   ` Peter Zijlstra
@ 2024-10-29 23:32   ` Andrii Nakryiko
  2024-10-30  5:53     ` Josh Poimboeuf
  2024-11-05 17:40   ` Steven Rostedt
                     ` (4 subsequent siblings)
  7 siblings, 1 reply; 119+ messages in thread
From: Andrii Nakryiko @ 2024-10-29 23:32 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

On Mon, Oct 28, 2024 at 2:48 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>
> Some distros have started compiling frame pointers into all their
> packages to enable the kernel to do system-wide profiling of user space.
> Unfortunately that creates a runtime performance penalty across the
> entire system.  Using DWARF instead isn't feasible due to the complexity
> it would add to the kernel.
>
> For in-kernel unwinding we solved this problem with the creation of the
> ORC unwinder for x86_64.  Similarly, for user space the GNU assembler
> has created the sframe format starting with binutils 2.41 for sframe v2.
> Sframe is a simpler version of .eh_frame.  It gets placed in the .sframe
> section.
>
> Add support for unwinding user space using sframe.
>
> More information about sframe can be found here:
>
>   - https://lwn.net/Articles/932209/
>   - https://lwn.net/Articles/940686/
>   - https://sourceware.org/binutils/docs/sframe-spec.html
>
> Glibc support is needed to implement the prctl() calls to tell the
> kernel where the .sframe segments are.
>
> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
> ---
>  arch/Kconfig                |   4 +
>  arch/x86/include/asm/mmu.h  |   2 +-
>  fs/binfmt_elf.c             |  35 +++-
>  include/linux/mm_types.h    |   3 +
>  include/linux/sframe.h      |  41 ++++
>  include/linux/unwind_user.h |   2 +
>  include/uapi/linux/elf.h    |   1 +
>  include/uapi/linux/prctl.h  |   3 +
>  kernel/fork.c               |  10 +
>  kernel/sys.c                |  11 ++
>  kernel/unwind/Makefile      |   3 +-
>  kernel/unwind/sframe.c      | 380 ++++++++++++++++++++++++++++++++++++
>  kernel/unwind/sframe.h      | 215 ++++++++++++++++++++
>  kernel/unwind/user.c        |  24 ++-
>  mm/init-mm.c                |   6 +
>  15 files changed, 732 insertions(+), 8 deletions(-)
>  create mode 100644 include/linux/sframe.h
>  create mode 100644 kernel/unwind/sframe.c
>  create mode 100644 kernel/unwind/sframe.h
>

It feels like this patch is trying to do too much. There is both new
UAPI introduction, and SFrame format definition, and unwinder
integration, etc, etc. Do you think it can be split further into more
focused smaller patches?


> @@ -688,7 +692,7 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
>                         /*
>                          * Check to see if the section's size will overflow the
>                          * allowed task size. Note that p_filesz must always be
> -                        * <= p_memsize so it's only necessary to check p_memsz.
> +                        * <= p_memsz so it's only necessary to check p_memsz.
>                          */
>                         k = load_addr + eppnt->p_vaddr;
>                         if (BAD_ADDR(k) ||
> @@ -698,9 +702,24 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
>                                 error = -ENOMEM;
>                                 goto out;
>                         }
> +
> +                       if ((eppnt->p_flags & PF_X) && k < start_code)
> +                               start_code = k;
> +
> +                       if ((eppnt->p_flags & PF_X) && k + eppnt->p_filesz > end_code)
> +                               end_code = k + eppnt->p_filesz;
> +                       break;
> +               }
> +               case PT_GNU_SFRAME:
> +                       sframe_phdr = eppnt;

if I understand correctly, there has to be only one sframe, is that
right? Should we validate that?

> +                       break;
>                 }
>         }
>
> +       if (sframe_phdr)
> +               sframe_add_section(load_addr + sframe_phdr->p_vaddr,
> +                                  start_code, end_code);
> +

no error checking?

>         error = load_addr;
>  out:
>         return error;
> @@ -823,7 +842,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
>         int first_pt_load = 1;
>         unsigned long error;
>         struct elf_phdr *elf_ppnt, *elf_phdata, *interp_elf_phdata = NULL;
> -       struct elf_phdr *elf_property_phdata = NULL;
> +       struct elf_phdr *elf_property_phdata = NULL, *sframe_phdr = NULL;
>         unsigned long elf_brk;
>         int retval, i;
>         unsigned long elf_entry;
> @@ -931,6 +950,10 @@ static int load_elf_binary(struct linux_binprm *bprm)
>                                 executable_stack = EXSTACK_DISABLE_X;
>                         break;
>
> +               case PT_GNU_SFRAME:
> +                       sframe_phdr = elf_ppnt;
> +                       break;
> +
>                 case PT_LOPROC ... PT_HIPROC:
>                         retval = arch_elf_pt_proc(elf_ex, elf_ppnt,
>                                                   bprm->file, false,
> @@ -1321,6 +1344,10 @@ static int load_elf_binary(struct linux_binprm *bprm)
>                                             task_pid_nr(current), retval);
>         }
>
> +       if (sframe_phdr)
> +               sframe_add_section(load_bias + sframe_phdr->p_vaddr,
> +                                  start_code, end_code);
> +

error checking missing?

>         regs = current_pt_regs();
>  #ifdef ELF_PLAT_INIT
>         /*
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 381d22eba088..6e7561c1a5fc 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1052,6 +1052,9 @@ struct mm_struct {
>  #endif
>                 } lru_gen;
>  #endif /* CONFIG_LRU_GEN_WALKS_MMU */
> +#ifdef CONFIG_HAVE_UNWIND_USER_SFRAME
> +               struct maple_tree sframe_mt;
> +#endif
>         } __randomize_layout;
>
>         /*
> diff --git a/include/linux/sframe.h b/include/linux/sframe.h
> new file mode 100644
> index 000000000000..d167e01817c4
> --- /dev/null
> +++ b/include/linux/sframe.h
> @@ -0,0 +1,41 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_SFRAME_H
> +#define _LINUX_SFRAME_H
> +
> +#include <linux/mm_types.h>
> +
> +struct unwind_user_frame;
> +
> +#ifdef CONFIG_HAVE_UNWIND_USER_SFRAME
> +
> +#define INIT_MM_SFRAME .sframe_mt = MTREE_INIT(sframe_mt, 0),
> +
> +extern void sframe_free_mm(struct mm_struct *mm);
> +
> +/* text_start, text_end, file_name are optional */

what file_name? was that an extra argument that got removed?

> +extern int sframe_add_section(unsigned long sframe_addr, unsigned long text_start,
> +                             unsigned long text_end);
> +
> +extern int sframe_remove_section(unsigned long sframe_addr);
> +extern int sframe_find(unsigned long ip, struct unwind_user_frame *frame);
> +
> +static inline bool current_has_sframe(void)
> +{
> +       struct mm_struct *mm = current->mm;
> +
> +       return mm && !mtree_empty(&mm->sframe_mt);
> +}
> +

[...]

> diff --git a/kernel/sys.c b/kernel/sys.c
> index 4da31f28fda8..7d05a67727db 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -64,6 +64,7 @@
>  #include <linux/rcupdate.h>
>  #include <linux/uidgid.h>
>  #include <linux/cred.h>
> +#include <linux/sframe.h>
>
>  #include <linux/nospec.h>
>
> @@ -2784,6 +2785,16 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>         case PR_RISCV_SET_ICACHE_FLUSH_CTX:
>                 error = RISCV_SET_ICACHE_FLUSH_CTX(arg2, arg3);
>                 break;
> +       case PR_ADD_SFRAME:
> +               if (arg5)
> +                       return -EINVAL;
> +               error = sframe_add_section(arg2, arg3, arg4);

wouldn't it be better to make this interface extendable from the get
go? Instead of passing 3 arguments with fixed meaning, why not pass a
pointer to an extendable binary struct like seems to be the trend
nowadays with nicely extensible APIs. See [0] for one such example
(specifically, struct procmap_query). Seems more prudent, as we'll
most probably will be adding flags, options, extra information, etc)

  [0] https://lore.kernel.org/linux-mm/20240627170900.1672542-3-andrii@kernel.org/

> +               break;
> +       case PR_REMOVE_SFRAME:
> +               if (arg3 || arg4 || arg5)
> +                       return -EINVAL;
> +               error = sframe_remove_section(arg2);
> +               break;
>         default:
>                 error = -EINVAL;
>                 break;

[...]

> +static unsigned char fre_type_to_size(unsigned char fre_type)
> +{
> +       if (fre_type > 2)
> +               return 0;
> +       return 1 << fre_type;
> +}
> +
> +static unsigned char offset_size_enum_to_size(unsigned char off_size)
> +{
> +       if (off_size > 2)
> +               return 0;
> +       return 1 << off_size;
> +}
> +
> +static int find_fde(struct sframe_section *sec, unsigned long ip,
> +                   struct sframe_fde *fde)
> +{
> +       struct sframe_fde __user *first, *last, *found = NULL;
> +       u32 ip_off, func_off_low = 0, func_off_high = -1;
> +
> +       ip_off = ip - sec->sframe_addr;

what if ip_off is larger than 4GB? ELF section can be bigger than 4GB, right?

and also, does it mean that SFrame doesn't support executables with
text bigger than 4GB?

> +
> +       first = (void __user *)sec->fdes_addr;
> +       last = first + sec->fdes_nr;
> +       while (first <= last) {
> +               struct sframe_fde __user *mid;
> +               u32 func_off;
> +
> +               mid = first + ((last - first) / 2);
> +
> +               if (get_user(func_off, (s32 __user *)mid))
> +                       return -EFAULT;
> +
> +               if (ip_off >= func_off) {
> +                       /* validate sort order */
> +                       if (func_off < func_off_low)
> +                               return -EINVAL;
> +
> +                       func_off_low = func_off;
> +
> +                       found = mid;
> +                       first = mid + 1;
> +               } else {
> +                       /* validate sort order */
> +                       if (func_off > func_off_high)
> +                               return -EINVAL;
> +
> +                       func_off_high = func_off;
> +
> +                       last = mid - 1;
> +               }
> +       }
> +
> +       if (!found)
> +               return -EINVAL;
> +
> +       if (copy_from_user(fde, found, sizeof(*fde)))
> +               return -EFAULT;
> +
> +       /* check for gaps */
> +       if (ip_off < fde->start_addr || ip_off >= fde->start_addr + fde->size)
> +               return -EINVAL;
> +
> +       return 0;
> +}
> +

[...]

> +int sframe_add_section(unsigned long sframe_addr, unsigned long text_start,
> +                      unsigned long text_end)
> +{
> +       struct mm_struct *mm = current->mm;
> +       struct vm_area_struct *sframe_vma;
> +
> +       mmap_read_lock(mm);
> +
> +       sframe_vma = vma_lookup(mm, sframe_addr);
> +       if (!sframe_vma)
> +               goto err_unlock;
> +
> +       if (text_start && text_end) {
> +               struct vm_area_struct *text_vma;
> +
> +               text_vma = vma_lookup(mm, text_start);
> +               if (!(text_vma->vm_flags & VM_EXEC))
> +                       goto err_unlock;
> +
> +               if (PAGE_ALIGN(text_end) != text_vma->vm_end)
> +                       goto err_unlock;
> +       } else {
> +               struct vm_area_struct *vma, *text_vma = NULL;
> +               VMA_ITERATOR(vmi, mm, 0);
> +
> +               for_each_vma(vmi, vma) {
> +                       if (vma->vm_file != sframe_vma->vm_file ||
> +                           !(vma->vm_flags & VM_EXEC))
> +                               continue;
> +
> +                       if (text_vma) {
> +                               pr_warn_once("%s[%d]: multiple EXEC segments unsupported\n",
> +                                            current->comm, current->pid);

is this just something that fundamentally can't be supported by SFrame
format? Or just an implementation simplification?

It's not illegal to have an executable with multiple VM_EXEC segments,
no? Should this be a pr_warn_once() then?

> +                               goto err_unlock;
> +                       }
> +
> +                       text_vma = vma;
> +               }
> +
> +               if (!text_vma)
> +                       goto err_unlock;
> +
> +               text_start = text_vma->vm_start;
> +               text_end   = text_vma->vm_end;
> +       }
> +
> +       mmap_read_unlock(mm);
> +
> +       return __sframe_add_section(sframe_addr, text_start, text_end);
> +

[...]

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 11/19] unwind: Add deferred user space unwinding API
  2024-10-28 21:47 ` [PATCH v3 11/19] unwind: Add deferred user space unwinding API Josh Poimboeuf
                     ` (3 preceding siblings ...)
  2024-10-29 13:56   ` Peter Zijlstra
@ 2024-10-29 23:32   ` Andrii Nakryiko
  2024-10-30  6:10     ` Josh Poimboeuf
  4 siblings, 1 reply; 119+ messages in thread
From: Andrii Nakryiko @ 2024-10-29 23:32 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

On Mon, Oct 28, 2024 at 2:48 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>
> Add unwind_user_deferred() which allows callers to schedule task work to
> unwind the user space stack before returning to user space.  This solves
> several problems for its callers:
>
>   - Ensure the unwind happens in task context even if the caller may
>     running in interrupt context.
>
>   - Only do the unwind once, even if called multiple times either by the
>     same caller or multiple callers.
>
>   - Create a "context context" cookie which allows trace post-processing
>     to correlate kernel unwinds/traces with the user unwind.
>
> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
> ---
>  include/linux/entry-common.h |   3 +
>  include/linux/sched.h        |   5 +
>  include/linux/unwind_user.h  |  56 ++++++++++
>  kernel/fork.c                |   4 +
>  kernel/unwind/user.c         | 199 +++++++++++++++++++++++++++++++++++
>  5 files changed, 267 insertions(+)
>
> diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
> index 1e50cdb83ae5..efbe8f964f31 100644
> --- a/include/linux/entry-common.h
> +++ b/include/linux/entry-common.h
> @@ -12,6 +12,7 @@
>  #include <linux/resume_user_mode.h>
>  #include <linux/tick.h>
>  #include <linux/kmsan.h>
> +#include <linux/unwind_user.h>
>
>  #include <asm/entry-common.h>
>
> @@ -111,6 +112,8 @@ static __always_inline void enter_from_user_mode(struct pt_regs *regs)
>         CT_WARN_ON(__ct_state() != CT_STATE_USER);
>         user_exit_irqoff();
>
> +       unwind_enter_from_user_mode();
> +
>         instrumentation_begin();
>         kmsan_unpoison_entry_regs(regs);
>         trace_hardirqs_off_finish();
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 5007a8e2d640..31b6f1d763ef 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -47,6 +47,7 @@
>  #include <linux/livepatch_sched.h>
>  #include <linux/uidgid_types.h>
>  #include <asm/kmap_size.h>
> +#include <linux/unwind_user.h>
>
>  /* task_struct member predeclarations (sorted alphabetically): */
>  struct audit_context;
> @@ -1592,6 +1593,10 @@ struct task_struct {
>         struct user_event_mm            *user_event_mm;
>  #endif
>
> +#ifdef CONFIG_UNWIND_USER
> +       struct unwind_task_info         unwind_task_info;

this is quite a lot of memory to pay on each task, a lot of which a)
might not have sframe and b) might not need stack unwinding during
their lifetime.

It can be a pointer and allocated in copy_process(), no? Though
ideally this should be initialized lazily, if possible.

> +#endif
> +
>         /*
>          * New fields for task_struct should be added above here, so that
>          * they are included in the randomized portion of task_struct.

[...]

> +static void unwind_user_task_work(struct callback_head *head)
> +{
> +       struct unwind_task_info *info = container_of(head, struct unwind_task_info, work);
> +       struct task_struct *task = container_of(info, struct task_struct, unwind_task_info);
> +       void *privs[UNWIND_MAX_CALLBACKS];
> +       struct unwind_stacktrace trace;
> +       unsigned long pending;
> +       u64 cookie = 0;
> +       int i;
> +
> +       BUILD_BUG_ON(UNWIND_MAX_CALLBACKS > 32);
> +
> +       if (WARN_ON_ONCE(task != current))
> +               return;
> +
> +       if (WARN_ON_ONCE(!info->ctx_cookie || !info->pending_callbacks))
> +               return;
> +
> +       scoped_guard(irqsave) {
> +               pending = info->pending_callbacks;
> +               cookie = info->ctx_cookie;
> +
> +               info->pending_callbacks = 0;
> +               info->ctx_cookie = 0;
> +               memcpy(privs, info->privs, sizeof(void *) * UNWIND_MAX_CALLBACKS);
> +       }
> +
> +       if (!info->entries) {
> +               info->entries = kmalloc(UNWIND_MAX_ENTRIES * sizeof(long),
> +                                       GFP_KERNEL);
> +               if (!info->entries)
> +                       return;

uhm... can we notify callbacks that stack capture failed? otherwise
we'd need some extra timeouts and other complications if we are
*waiting* for this callback to be called

> +       }
> +
> +       trace.entries = info->entries;
> +       trace.nr = 0;
> +       unwind_user(&trace, UNWIND_MAX_ENTRIES);
> +

[...]

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 09/19] unwind: Introduce sframe user space unwinding
  2024-10-29 23:32   ` Andrii Nakryiko
@ 2024-10-30  5:53     ` Josh Poimboeuf
  2024-10-31 20:57       ` Andrii Nakryiko
  0 siblings, 1 reply; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-30  5:53 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

On Tue, Oct 29, 2024 at 04:32:40PM -0700, Andrii Nakryiko wrote:
> It feels like this patch is trying to do too much. There is both new
> UAPI introduction, and SFrame format definition, and unwinder
> integration, etc, etc. Do you think it can be split further into more
> focused smaller patches?

True, let me see if I can split it up.

> > +
> > +                       if ((eppnt->p_flags & PF_X) && k < start_code)
> > +                               start_code = k;
> > +
> > +                       if ((eppnt->p_flags & PF_X) && k + eppnt->p_filesz > end_code)
> > +                               end_code = k + eppnt->p_filesz;
> > +                       break;
> > +               }
> > +               case PT_GNU_SFRAME:
> > +                       sframe_phdr = eppnt;
> 
> if I understand correctly, there has to be only one sframe, is that
> right? Should we validate that?

Yes, there shouldn't be more than one PT_GNU_SFRAME for the executable
itself.  I can validate that.

> > +                       break;
> >                 }
> >         }
> >
> > +       if (sframe_phdr)
> > +               sframe_add_section(load_addr + sframe_phdr->p_vaddr,
> > +                                  start_code, end_code);
> > +
> 
> no error checking?

Good point.  I remember discussing this with some people at Cauldon/LPC,
I just forgot to do it!

Right now it does all the validation at unwind, which could really slow
things down unnecessarily if the sframe isn't valid.

> > +#ifdef CONFIG_HAVE_UNWIND_USER_SFRAME
> > +
> > +#define INIT_MM_SFRAME .sframe_mt = MTREE_INIT(sframe_mt, 0),
> > +
> > +extern void sframe_free_mm(struct mm_struct *mm);
> > +
> > +/* text_start, text_end, file_name are optional */
> 
> what file_name? was that an extra argument that got removed?

Indeed, that was for some old code.

> >         case PR_RISCV_SET_ICACHE_FLUSH_CTX:
> >                 error = RISCV_SET_ICACHE_FLUSH_CTX(arg2, arg3);
> >                 break;
> > +       case PR_ADD_SFRAME:
> > +               if (arg5)
> > +                       return -EINVAL;
> > +               error = sframe_add_section(arg2, arg3, arg4);
> 
> wouldn't it be better to make this interface extendable from the get
> go? Instead of passing 3 arguments with fixed meaning, why not pass a
> pointer to an extendable binary struct like seems to be the trend
> nowadays with nicely extensible APIs. See [0] for one such example
> (specifically, struct procmap_query). Seems more prudent, as we'll
> most probably will be adding flags, options, extra information, etc)
> 
>   [0] https://lore.kernel.org/linux-mm/20240627170900.1672542-3-andrii@kernel.org/

This ioctl interface was admittedly hacked together.  I was hoping
somebody would suggest something better :-)  I'll take a look.

> > +static int find_fde(struct sframe_section *sec, unsigned long ip,
> > +                   struct sframe_fde *fde)
> > +{
> > +       struct sframe_fde __user *first, *last, *found = NULL;
> > +       u32 ip_off, func_off_low = 0, func_off_high = -1;
> > +
> > +       ip_off = ip - sec->sframe_addr;
> 
> what if ip_off is larger than 4GB? ELF section can be bigger than 4GB, right?

That's baked into sframe v2.

> and also, does it mean that SFrame doesn't support executables with
> text bigger than 4GB?

Yes, but is that a realistic concern?

> > +       } else {
> > +               struct vm_area_struct *vma, *text_vma = NULL;
> > +               VMA_ITERATOR(vmi, mm, 0);
> > +
> > +               for_each_vma(vmi, vma) {
> > +                       if (vma->vm_file != sframe_vma->vm_file ||
> > +                           !(vma->vm_flags & VM_EXEC))
> > +                               continue;
> > +
> > +                       if (text_vma) {
> > +                               pr_warn_once("%s[%d]: multiple EXEC segments unsupported\n",
> > +                                            current->comm, current->pid);
> 
> is this just something that fundamentally can't be supported by SFrame
> format? Or just an implementation simplification?

It's a simplification I suppose.

> It's not illegal to have an executable with multiple VM_EXEC segments,
> no? Should this be a pr_warn_once() then?

I don't know, is it allowed?  I've never seen it in practice.  The
pr_warn_once() is not reporting that it's illegal but rather that this
corner case actually exists and maybe needs to be looked at.

-- 
Josh

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 11/19] unwind: Add deferred user space unwinding API
  2024-10-29 23:32   ` Andrii Nakryiko
@ 2024-10-30  6:10     ` Josh Poimboeuf
  2024-10-31 21:22       ` Andrii Nakryiko
  0 siblings, 1 reply; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-30  6:10 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

On Tue, Oct 29, 2024 at 04:32:46PM -0700, Andrii Nakryiko wrote:
> >  struct audit_context;
> > @@ -1592,6 +1593,10 @@ struct task_struct {
> >         struct user_event_mm            *user_event_mm;
> >  #endif
> >
> > +#ifdef CONFIG_UNWIND_USER
> > +       struct unwind_task_info         unwind_task_info;
> 
> this is quite a lot of memory to pay on each task, a lot of which a)
> might not have sframe

Frame pointers are also supported.

> and b) might not need stack unwinding during their lifetime.

Right, I'm not super happy about that.

> It can be a pointer and allocated in copy_process(), no?
> Though ideally this should be initialized lazily, if possible.

Problem is, the unwinder doesn't know in advance which tasks will be
unwound.

Its first clue is unwind_user_register(), would it make sense for the
caller to clarify whether all tasks need to be unwound or only a
specific subset?

Its second clue is unwind_user_deferred(), which is called for the task
itself.  But by then it's too late because it needs to access the
per-task data from (potentially) irq context so it can't do a lazy
allocation.

I'm definitely open to ideas...

> > +       if (!info->entries) {
> > +               info->entries = kmalloc(UNWIND_MAX_ENTRIES * sizeof(long),
> > +                                       GFP_KERNEL);
> > +               if (!info->entries)
> > +                       return;
> 
> uhm... can we notify callbacks that stack capture failed? otherwise
> we'd need some extra timeouts and other complications if we are
> *waiting* for this callback to be called

Hm, do you actually plan to wait for the callback?

I assume this is BPF, can you give some high-level detail about how it
will use these interfaces?

-- 
Josh

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 11/19] unwind: Add deferred user space unwinding API
  2024-10-29 18:20         ` Peter Zijlstra
@ 2024-10-30  6:17           ` Steven Rostedt
  2024-10-30 14:03             ` Peter Zijlstra
  0 siblings, 1 reply; 119+ messages in thread
From: Steven Rostedt @ 2024-10-30  6:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, Josh Poimboeuf, x86, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Florian Weimer, Andy Lutomirski

On Tue, 29 Oct 2024 19:20:32 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> The 48:16 bit split gives you uniqueness for around 78 hours at 1GHz.

Are you saying that there will be one system call per nanosecond? This
number is incremented only when a task enters the kernel from user
spaces *and* requests a stack trace. If that happens 1000 times a
second, that would still be around 9000 years.

> 
> But seriously, perf doesn't need this. It really only needs a sequence
> number if you care to stitch over a LOST packet (and I can't say I care
> about that case much) -- and doing that right doesn't really take much
> at all.

Perf may not care because it has a unique descriptor per task, right?
Where it can already know what events are associated to a task. But
that's just a unique characteristic of perf. The unwinder should give a
identifier for every user space stack trace that it will produce and
pass that back to the tracer when it requests a stack trace but it
cannot yet be performed. This identifier is what we are calling a
context cookie. Then when it wants the stack trace, the unwinder will
give the tracer the stack trace along with the identifier
(context-cookie) that this stack trace was for in the past.

It definitely belongs with the undwinder logic.

-- Steve

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 11/19] unwind: Add deferred user space unwinding API
  2024-10-29 18:34         ` Josh Poimboeuf
@ 2024-10-30 13:44           ` Mathieu Desnoyers
  2024-10-30 17:47             ` Josh Poimboeuf
  0 siblings, 1 reply; 119+ messages in thread
From: Mathieu Desnoyers @ 2024-10-30 13:44 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Peter Zijlstra, x86, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Florian Weimer, Andy Lutomirski

On 2024-10-29 14:34, Josh Poimboeuf wrote:
> On Tue, Oct 29, 2024 at 01:47:59PM -0400, Mathieu Desnoyers wrote:
>>>> If ftrace needs brain damage like this, can't we push this to the user?
>>>>
>>>> That is, do away with the per-cpu sequence crap, and add a per-task
>>>> counter that is incremented for every return-to-userspace.
>>>
>>> That would definitely make things easier for me, though IIRC Steven and
>>> Mathieu had some concerns about TID wrapping over days/months/years.
>>>
>>> With that mindset I suppose the per-CPU counter could also wrap, though
>>> that could be mitigated by making the cookie a struct with more bits.
>>>
>>
>> AFAIR, the scheme we discussed in Prague was different than the
>> implementation here.
> 
> It does differ a bit.  I'll explain why below.
> 
>> We discussed having a free-running counter per-cpu, and combining it
>> with the cpu number as top (or low) bits, to effectively make a 64-bit
>> value that is unique across the entire system, but without requiring a
>> global counter with its associated cache line bouncing.
>>
>> Here is part where the implementation here differs from our discussion:
>> I recall we discussed keeping a snapshot of the counter value within
>> the task struct of the thread. So we only snapshot the per-cpu value
>> on first use after entering the kernel, and after that we use the same
>> per-cpu value snapshot (from task struct) up until return to userspace.
>> We clear the task struct cookie snapshot on return to userspace.
> 
> Right, so adding some details to this, what I remember specifically
> deciding on:
> 
>   - In unwind_user_deferred(), if task cookie is 0, it snapshots the
>     per-cpu counter, stores the old value in the task cookie, and
>     increments the new value (with CPU # in top 48 bits).
> 
>   - Future calls to unwind_user_deferred() see the task cookie is set and
>     reuse the same cookie.
> 
>   - Then the task work (which runs right before exiting the kernel)
>     clears the task cookie (with interrupts disabled), does the unwind,
>     and calls the callbacks.  It clears the task cookie so that any
>     future calls to unwind_user_deferred() (either before exiting the
>     kernel or after next entry) are guaranteed to get an unwind.

This is where I think we should improve the logic.

If an unwind_user_deferred() is called right over/after the unwind,
I don't think we want to issue another stack walk: it would be redundant
with the one already in progress or completed.

What we'd want is to make sure that the cookie returned to that
unwind_user_deferred() is the same as the cookie associated with the
in-progress or completed stack unwind. This way, trace analysis can
look at the surrounding events (before and after) the unwind request
and find the associated call stack.

> 
> That's what I had implemented originally.  It had a major performance
> issue, particularly for long stacks (bash sometimes had 300+ stack
> frames in my testing).

That's probably because long stacks require a lot of work (user pages
accesses) to be done when stack walking, which increases the likeliness
of re-issuing unwind_user_deferred() while the stack walk is being done.

That's basically a lack-of-progress issue: a sufficiently long stack
walk with a sufficiently frequent unwind_user_deferred() trigger could
make the system stall forever trying to service stalk walks again
and again. That's bad.

> 
> The task work runs with interrupts enabled.  So if another PMU interrupt
> and call to unwind_user_deferred() happens after the task work clears
> the task cookie but before kernel exit, a new cookie is generated and an
> additional task work is scheduled.  For long stacks, performance gets
> really bad, dominated by all the extra unnecessary unwinding.

What you want here is to move the point where you clear the task
cookie to _after_ completion of stack unwind. There are a few ways
this can be done:

A) Clear the task cookie in the task_work() right after the
    unwind_user_deferred() is completed. Downside: if some long task work
    happen to be done after the stack walk, a new unwind_user_deferred()
    could be issued again and we may end up looping forever taking stack
    unwind and never actually making forward progress.

B) Clear the task cookie after the exit_to_user_mode_loop is done,
    before returning to user-space, while interrupts are disabled.

C) Clear the task cookie when entering kernel space again.

I think (B) and (C) could be interesting approaches, perhaps with
(B) slightly more interesting because it cleans up after itself
before returning to userspace. Also (B) allows us to return to a
purely lazy scheme where there is nothing to do when entering the
kernel. This is therefore more efficient in terms of cache locality,
because we can expect our per-task state to be cache hot when
returning to userspace, but not necessarily when entering into
the kernel.

> 
> So I changed the design a bit:
> 
>    - Increment a per-cpu counter at kernel entry before interrupts are
>      enabled.
> 
>    - In unwind_user_deferred(), if task cookie is 0, it sets the task
>      cookie based on the per-cpu counter, like before.  However if this
>      cookie was the last one used by this callback+task, it skips the
>      callback altogether.
> 
> So it eliminates duplicate unwinds except for the CPU migration case.

This sounds complicated and fragile. And why would we care about
duplicated unwinds on cpu migration ?

What prevents us from moving the per-task cookie clearing to after
exit_to_user_mode_loop instead ? Then there is no need to do per-cpu
counter increment on every kernel entry and we can go back to a lazy
scheme.

> 
> If I change the entry code to increment a per-task counter instead of a
> per-cpu counter then this problem goes away.  I was just concerned about
> the performance impacts of doing that on every entry.

Moving from per-cpu to per-task makes this cookie task-specific and not
global anymore, I don't think we want this for a stack walking
infrastructure meant to be used by various tracers. Also a global cookie
is more robust and does not depend on guaranteeing that all the
trace data is present to guarantee current thread ID accuracy and
thus that cookies match between deferred unwind request and their
fulfillment.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 11/19] unwind: Add deferred user space unwinding API
  2024-10-30  6:17           ` Steven Rostedt
@ 2024-10-30 14:03             ` Peter Zijlstra
  2024-10-30 19:58               ` Steven Rostedt
  0 siblings, 1 reply; 119+ messages in thread
From: Peter Zijlstra @ 2024-10-30 14:03 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, Josh Poimboeuf, x86, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Florian Weimer, Andy Lutomirski

On Wed, Oct 30, 2024 at 02:17:22AM -0400, Steven Rostedt wrote:
> On Tue, 29 Oct 2024 19:20:32 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > The 48:16 bit split gives you uniqueness for around 78 hours at 1GHz.
> 
> Are you saying that there will be one system call per nanosecond? This
> number is incremented only when a task enters the kernel from user
> spaces *and* requests a stack trace. If that happens 1000 times a
> second, that would still be around 9000 years.

We used to be able to do well over a million syscalls a second. I'm not
exactly sure where we are now, the whole speculation shit-show hurt
things quite badly.

> > 
> > But seriously, perf doesn't need this. It really only needs a sequence
> > number if you care to stitch over a LOST packet (and I can't say I care
> > about that case much) -- and doing that right doesn't really take much
> > at all.
> 
> Perf may not care because it has a unique descriptor per task, right?
> Where it can already know what events are associated to a task. But
> that's just a unique characteristic of perf. The unwinder should give a
> identifier for every user space stack trace that it will produce and
> pass that back to the tracer when it requests a stack trace but it
> cannot yet be performed. This identifier is what we are calling a
> context cookie. Then when it wants the stack trace, the unwinder will
> give the tracer the stack trace along with the identifier
> (context-cookie) that this stack trace was for in the past.

You're designing things inside out again. You should add functionality
by adding layers.

Pass a void * into the 'request-unwind' and have the 'do-unwind'
callback get that same pointer. Then anybody that needs identifiers to
figure out where things came from can stuff something in there.


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 02/19] x86/asm: Avoid emitting DWARF CFI for non-VDSO
  2024-10-28 21:47 ` [PATCH v3 02/19] x86/asm: Avoid emitting DWARF CFI for non-VDSO Josh Poimboeuf
  2024-10-28 21:47   ` Josh Poimboeuf
@ 2024-10-30 17:19   ` Jens Remus
  2024-10-30 17:51     ` Josh Poimboeuf
  1 sibling, 1 reply; 119+ messages in thread
From: Jens Remus @ 2024-10-30 17:19 UTC (permalink / raw)
  To: Josh Poimboeuf, x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

On 28.10.2024 22:47, Josh Poimboeuf wrote:
> VDSO is the only part of the "kernel" using DWARF CFI directives.  For
> the kernel proper, ensure the CFI_* macros don't do anything.
> 
> These macros aren't yet being used outside of VDSO, so there's no
> functional change.

...

> diff --git a/arch/x86/include/asm/dwarf2.h b/arch/x86/include/asm/dwarf2.h
> index 430fca13bb56..b1aa3fcd5bca 100644
> --- a/arch/x86/include/asm/dwarf2.h
> +++ b/arch/x86/include/asm/dwarf2.h
...
> @@ -29,13 +39,24 @@
>   	 * useful to anyone.  Note we should not use this directive if we
>   	 * ever decide to enable DWARF unwinding at runtime.
>   	 */
> +
>   	.cfi_sections .debug_frame
> -#else
> -	 /*
> -	  * For the vDSO, emit both runtime unwind information and debug
> -	  * symbols for the .dbg file.
> -	  */
> -	.cfi_sections .eh_frame, .debug_frame
> -#endif
> +
> +#define CFI_STARTPROC
> +#define CFI_ENDPROC
> +#define CFI_DEF_CFA
> +#define CFI_DEF_CFA_REGISTER
> +#define CFI_DEF_CFA_OFFSET
> +#define CFI_ADJUST_CFA_OFFSET
> +#define CFI_OFFSET
> +#define CFI_REL_OFFSET
> +#define CFI_REGISTER
> +#define CFI_RESTORE
> +#define CFI_REMEMBER_STATE
> +#define CFI_RESTORE_STATE
> +#define CFI_UNDEFINED
> +#define CFI_ESCAPE

Don't all of the above need to be defined to "#", so that they change 
any parameters into an assembler comment? For example:

#define CFI_STARTPROC		#

> +
> +#endif /* !BUILD_VDSO */
>   
>   #endif /* _ASM_X86_DWARF2_H */

Thanks and regards,
Jens
-- 
Jens Remus
Linux on Z Development (D3303) and z/VSE Support
+49-7031-16-1128 Office
jremus@de.ibm.com

IBM

IBM Deutschland Research & Development GmbH; Vorsitzender des 
Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der 
Gesellschaft: Böblingen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 11/19] unwind: Add deferred user space unwinding API
  2024-10-30 13:44           ` Mathieu Desnoyers
@ 2024-10-30 17:47             ` Josh Poimboeuf
  2024-10-30 17:55               ` Josh Poimboeuf
  2024-10-30 18:25               ` Josh Poimboeuf
  0 siblings, 2 replies; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-30 17:47 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, x86, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Florian Weimer, Andy Lutomirski

On Wed, Oct 30, 2024 at 09:44:14AM -0400, Mathieu Desnoyers wrote:
> What you want here is to move the point where you clear the task
> cookie to _after_ completion of stack unwind. There are a few ways
> this can be done:
> 
> A) Clear the task cookie in the task_work() right after the
>    unwind_user_deferred() is completed. Downside: if some long task work
>    happen to be done after the stack walk, a new unwind_user_deferred()
>    could be issued again and we may end up looping forever taking stack
>    unwind and never actually making forward progress.
> 
> B) Clear the task cookie after the exit_to_user_mode_loop is done,
>    before returning to user-space, while interrupts are disabled.

Problem is, if another tracer calls unwind_user_deferred() for the first
time, after the task work but before the task cookie gets cleared, it
will see the cookie is non-zero and will fail to schedule another task
work.  So its callback never gets called.

> > If I change the entry code to increment a per-task counter instead of a
> > per-cpu counter then this problem goes away.  I was just concerned about
> > the performance impacts of doing that on every entry.
> 
> Moving from per-cpu to per-task makes this cookie task-specific and not
> global anymore, I don't think we want this for a stack walking
> infrastructure meant to be used by various tracers. Also a global cookie
> is more robust and does not depend on guaranteeing that all the
> trace data is present to guarantee current thread ID accuracy and
> thus that cookies match between deferred unwind request and their
> fulfillment.

I don't disagree.  What I meant was, on entry (or exit), increment the
task cookie *with* the CPU bits included.

Or as you suggested previously, the "cookie" just be a struct with two
fields: CPU # and per-task entry counter.

-- 
Josh

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 02/19] x86/asm: Avoid emitting DWARF CFI for non-VDSO
  2024-10-30 17:19   ` Jens Remus
@ 2024-10-30 17:51     ` Josh Poimboeuf
  0 siblings, 0 replies; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-30 17:51 UTC (permalink / raw)
  To: Jens Remus
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

On Wed, Oct 30, 2024 at 06:19:43PM +0100, Jens Remus wrote:
> > +#define CFI_STARTPROC
> > +#define CFI_ENDPROC
> > +#define CFI_DEF_CFA
> > +#define CFI_DEF_CFA_REGISTER
> > +#define CFI_DEF_CFA_OFFSET
> > +#define CFI_ADJUST_CFA_OFFSET
> > +#define CFI_OFFSET
> > +#define CFI_REL_OFFSET
> > +#define CFI_REGISTER
> > +#define CFI_RESTORE
> > +#define CFI_REMEMBER_STATE
> > +#define CFI_RESTORE_STATE
> > +#define CFI_UNDEFINED
> > +#define CFI_ESCAPE
> 
> Don't all of the above need to be defined to "#", so that they change any
> parameters into an assembler comment? For example:
> 
> #define CFI_STARTPROC		#

Indeed.  Or maybe they could be .macros which ignore the arguments.

-- 
Josh

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 11/19] unwind: Add deferred user space unwinding API
  2024-10-30 17:47             ` Josh Poimboeuf
@ 2024-10-30 17:55               ` Josh Poimboeuf
  2024-10-30 18:25               ` Josh Poimboeuf
  1 sibling, 0 replies; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-30 17:55 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, x86, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Florian Weimer, Andy Lutomirski

On Wed, Oct 30, 2024 at 10:47:55AM -0700, Josh Poimboeuf wrote:
> > Moving from per-cpu to per-task makes this cookie task-specific and not
> > global anymore, I don't think we want this for a stack walking
> > infrastructure meant to be used by various tracers. Also a global cookie
> > is more robust and does not depend on guaranteeing that all the
> > trace data is present to guarantee current thread ID accuracy and
> > thus that cookies match between deferred unwind request and their
> > fulfillment.
> 
> I don't disagree.  What I meant was, on entry (or exit), increment the
> task cookie *with* the CPU bits included.
> 
> Or as you suggested previously, the "cookie" just be a struct with two
> fields: CPU # and per-task entry counter.

Never mind, that wouldn't work...  Two tasks could have identical
per-task counters while being scheduled on the same CPU.

-- 
Josh

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 06/19] x86/vdso: Enable sframe generation in VDSO
  2024-10-28 21:47 ` [PATCH v3 06/19] x86/vdso: Enable sframe generation in VDSO Josh Poimboeuf
  2024-10-28 21:47   ` Josh Poimboeuf
@ 2024-10-30 18:20   ` Jens Remus
  2024-10-30 19:17     ` Josh Poimboeuf
  1 sibling, 1 reply; 119+ messages in thread
From: Jens Remus @ 2024-10-30 18:20 UTC (permalink / raw)
  To: Josh Poimboeuf, x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Heiko Carstens, Vasily Gorbik

On 28.10.2024 22:47, Josh Poimboeuf wrote:
> Enable sframe generation in the VDSO library so kernel and user space
> can unwind through it.

...

Applying similar changes to s390 and using the current binutils release without SFrame support on s390 results in build errors.

AFAIK the kernel has a minimum binutils requirement of 2.25 [1] and the assembler option "--gsframe was introduced with 2.40.

> diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile
> index c9216ac4fb1e..75ae9e093a2d 100644
> --- a/arch/x86/entry/vdso/Makefile
> +++ b/arch/x86/entry/vdso/Makefile
> @@ -47,13 +47,15 @@ quiet_cmd_vdso2c = VDSO2C  $@
>   $(obj)/vdso-image-%.c: $(obj)/vdso%.so.dbg $(obj)/vdso%.so $(obj)/vdso2c FORCE
>   	$(call if_changed,vdso2c)
>   
> +SFRAME_CFLAGS := $(call as-option,-Wa$(comma)-gsframe,)

I have the impression this test might not work as expected. On s390 the assembler accepts option --gsframe and only generates an error when the assembler code actually contains any CFI directives:

$ cat test-nocfi.a
         la      %r1,42
$ as --gsframe test-nocfi.a
$ echo $?
0

$ cat test-cfi.a
         .cfi_startproc
         .cfi_endproc
$ as --gsframe test-cfi.a
test-cfi.a: Assembler messages:
test-cfi.a: Error: .sframe not supported for target
$ echo $?
1

But the following assembler code triggers an error:

$ cat test-sframe.a
         .cfi_sections .sframe
         .cfi_startproc
         .cfi_endproc
$ as test-sframe.a
test-sframe.a: Assembler messages:
test-sframe.a: Error: .sframe not supported for target
$ echo $?
1

Maybe the following would be an alternative test in the Makefile?

SFRAME_CFLAGS := $(call as-instr,.cfi_sections .sframe\n.cfi_startproc\n.cfi_endproc,-DCONFIG_AS_SFRAME=1)
ifneq ($(SFRAME_CFLAGS),)
        SFRAME_CFLAGS += -Wa,--gsframe
endif

> diff --git a/arch/x86/include/asm/dwarf2.h b/arch/x86/include/asm/dwarf2.h
> index b1aa3fcd5bca..1a49492817a1 100644
> --- a/arch/x86/include/asm/dwarf2.h
> +++ b/arch/x86/include/asm/dwarf2.h
> @@ -12,8 +12,11 @@
>   	 * For the vDSO, emit both runtime unwind information and debug
>   	 * symbols for the .dbg file.
>   	 */
> -
> +#ifdef __x86_64__

#if defined(__x86_64__) && defined(CONFIG_AS_SFRAME)

> +	.cfi_sections .eh_frame, .debug_frame, .sframe
> +#else
>   	.cfi_sections .eh_frame, .debug_frame
> +#endif
>   
>   #define CFI_STARTPROC		.cfi_startproc
>   #define CFI_ENDPROC		.cfi_endproc

[1]: https://docs.kernel.org/process/changes.html

Regards,
Jens
-- 
Jens Remus
Linux on Z Development (D3303) and z/VSE Support
+49-7031-16-1128 Office
jremus@de.ibm.com

IBM

IBM Deutschland Research & Development GmbH; Vorsitzender des Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der Gesellschaft: Böblingen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 11/19] unwind: Add deferred user space unwinding API
  2024-10-30 17:47             ` Josh Poimboeuf
  2024-10-30 17:55               ` Josh Poimboeuf
@ 2024-10-30 18:25               ` Josh Poimboeuf
  1 sibling, 0 replies; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-30 18:25 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, x86, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Florian Weimer, Andy Lutomirski

On Wed, Oct 30, 2024 at 10:47:55AM -0700, Josh Poimboeuf wrote:
> On Wed, Oct 30, 2024 at 09:44:14AM -0400, Mathieu Desnoyers wrote:
> > What you want here is to move the point where you clear the task
> > cookie to _after_ completion of stack unwind. There are a few ways
> > this can be done:
> > 
> > A) Clear the task cookie in the task_work() right after the
> >    unwind_user_deferred() is completed. Downside: if some long task work
> >    happen to be done after the stack walk, a new unwind_user_deferred()
> >    could be issued again and we may end up looping forever taking stack
> >    unwind and never actually making forward progress.
> > 
> > B) Clear the task cookie after the exit_to_user_mode_loop is done,
> >    before returning to user-space, while interrupts are disabled.
> 
> Problem is, if another tracer calls unwind_user_deferred() for the first
> time, after the task work but before the task cookie gets cleared, it
> will see the cookie is non-zero and will fail to schedule another task
> work.  So its callback never gets called.

How about we:

  - clear task cookie on entry or exit from user space

  - use a different variable (rather than clearing of task cookie) to
    communicate whether task work is pending

  - keep track of which callbacks have been called for this cookie

something like so?

int unwind_user_deferred(struct unwind_callback *callback, u64 *ctx_cookie, void *data)
{
	struct unwind_task_info *info = &current->unwind_task_info;
	unsigned int work_pending = work_pending;
	u64 cookie = info->ctx_cookie;
	int idx = callback->idx;

	if (WARN_ON_ONCE(in_nmi()))
		return -EINVAL;

	if (!current->mm)
		return -EINVAL;

	guard(irqsave)();

	if (cookie && (info->pending_callbacks & (1 << idx))) {
		/* callback already scheduled */
		goto done;
	}

	/*
	 * If this is the first call from any caller since the most recent
	 * entry from user space, initialize the task context cookie.
	 */
	if (!cookie) {
		u64 cpu = raw_smp_processor_id();
		u64 ctx_ctr;

		__this_cpu_inc(unwind_ctx_ctr);
		ctx_ctr = __this_cpu_read(unwind_ctx_ctr);

		cookie = ctx_to_cookie(cpu, ctx_ctr);
		info->ctx_cookie = cookie;

	} else {
		if (cookie == info->last_cookies[idx]) {
			/*
			 * The callback has already been called with this unwind,
			 * nothing to do.
			 */
			goto done;
		}

		if (info->pending_callbacks & (1 << idx)) {
			/* callback already scheduled */
			goto done;
		}
	}

	info->pending_callbacks |= (1 << idx);
	info->privs[idx] = data;
	info->last_cookies[idx] = cookie;

	if (!info->task_pending) {
		info->task_pending = 1; /* cleared by unwind_user_task_work() */
		task_work_add(current, &info->work, TWA_RESUME);
	}

done:
	if (ctx_cookie)
		*ctx_cookie = cookie;
	return 0;
}

-- 
Josh

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 06/19] x86/vdso: Enable sframe generation in VDSO
  2024-10-30 18:20   ` Jens Remus
@ 2024-10-30 19:17     ` Josh Poimboeuf
  0 siblings, 0 replies; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-30 19:17 UTC (permalink / raw)
  To: Jens Remus
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Heiko Carstens, Vasily Gorbik

On Wed, Oct 30, 2024 at 07:20:08PM +0100, Jens Remus wrote:
> > diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile
> > index c9216ac4fb1e..75ae9e093a2d 100644
> > --- a/arch/x86/entry/vdso/Makefile
> > +++ b/arch/x86/entry/vdso/Makefile
> > @@ -47,13 +47,15 @@ quiet_cmd_vdso2c = VDSO2C  $@
> >   $(obj)/vdso-image-%.c: $(obj)/vdso%.so.dbg $(obj)/vdso%.so $(obj)/vdso2c FORCE
> >   	$(call if_changed,vdso2c)
> > +SFRAME_CFLAGS := $(call as-option,-Wa$(comma)-gsframe,)
> 
> I have the impression this test might not work as expected.

I suspect it works fine on x86-64, since that was the first arch
supported, so --gsframe being supported also means x86-64 is supported.

But yeah, other arches (and x86-32) are a different story.

> Maybe the following would be an alternative test in the Makefile?
> 
> SFRAME_CFLAGS := $(call as-instr,.cfi_sections .sframe\n.cfi_startproc\n.cfi_endproc,-DCONFIG_AS_SFRAME=1)
> ifneq ($(SFRAME_CFLAGS),)
>        SFRAME_CFLAGS += -Wa,--gsframe
> endif

Looks good, though the ifneq isn't needed:

SFRAME_CFLAGS := $(call as-instr, \
		   .cfi_sections .sframe\n.cfi_startproc\n.cfi_endproc, \
		   Wa$(comma)--gsframe -DCONFIG_AS_SFRAME=1)


Though, if multiple arches are going to be using that, maybe it should
just be a config option:


diff --git a/arch/Kconfig b/arch/Kconfig
index 33449485eafd..676dd45a7255 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -446,6 +446,9 @@ config HAVE_UNWIND_USER_SFRAME
 	bool
 	select UNWIND_USER
 
+config AS_SFRAME
+	def_bool $(as-instr,.cfi_sections .sframe\n.cfi_startproc\n.cfi_endproc)
+
 config HAVE_PERF_CALLCHAIN_DEFERRED
 	bool
 

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 11/19] unwind: Add deferred user space unwinding API
  2024-10-30 14:03             ` Peter Zijlstra
@ 2024-10-30 19:58               ` Steven Rostedt
  2024-10-30 20:48                 ` Josh Poimboeuf
  0 siblings, 1 reply; 119+ messages in thread
From: Steven Rostedt @ 2024-10-30 19:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, Josh Poimboeuf, x86, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Florian Weimer, Andy Lutomirski

On Wed, 30 Oct 2024 15:03:24 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> You're designing things inside out again. You should add functionality
> by adding layers.
> 
> Pass a void * into the 'request-unwind' and have the 'do-unwind'
> callback get that same pointer. Then anybody that needs identifiers to
> figure out where things came from can stuff something in there.

What the hell should I do with a void pointer? I want a stack trace, I
ask for one, it gives me an identifier for it, then at the end when it
does my callback it gives me the stack trace I asked for with the
identifier that it belongs to.

The identifier should be part of the unwinder code as it has to do with
the stack trace and has nothing to do with tracing.

-- Steve

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 11/19] unwind: Add deferred user space unwinding API
  2024-10-30 19:58               ` Steven Rostedt
@ 2024-10-30 20:48                 ` Josh Poimboeuf
  0 siblings, 0 replies; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-30 20:48 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Mathieu Desnoyers, x86, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Florian Weimer, Andy Lutomirski

On Wed, Oct 30, 2024 at 03:58:16PM -0400, Steven Rostedt wrote:
> On Wed, 30 Oct 2024 15:03:24 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > You're designing things inside out again. You should add functionality
> > by adding layers.
> > 
> > Pass a void * into the 'request-unwind' and have the 'do-unwind'
> > callback get that same pointer. Then anybody that needs identifiers to
> > figure out where things came from can stuff something in there.
> 
> What the hell should I do with a void pointer? I want a stack trace, I
> ask for one, it gives me an identifier for it, then at the end when it
> does my callback it gives me the stack trace I asked for with the
> identifier that it belongs to.
>
> The identifier should be part of the unwinder code as it has to do
> with the stack trace and has nothing to do with tracing.

If at least most of the users would find the cookie useful -- which it
sounds like they would, maybe even including perf -- then it makes sense
to integrate that into the unwinder.  It also helps the unwinder
disambiguate which callbacks have been called and whether the stack has
already been unwound.

That said, I'm also implementing the void pointer thing, as AFAICT perf
needs to pass the perf_event pointer to the callback.  There's no reason
we can't have both.

-- 
Josh

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 09/19] unwind: Introduce sframe user space unwinding
  2024-10-30  5:53     ` Josh Poimboeuf
@ 2024-10-31 20:57       ` Andrii Nakryiko
  2024-10-31 21:00         ` Nick Desaulniers
                           ` (2 more replies)
  0 siblings, 3 replies; 119+ messages in thread
From: Andrii Nakryiko @ 2024-10-31 20:57 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

On Tue, Oct 29, 2024 at 10:53 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>
> On Tue, Oct 29, 2024 at 04:32:40PM -0700, Andrii Nakryiko wrote:
> > It feels like this patch is trying to do too much. There is both new
> > UAPI introduction, and SFrame format definition, and unwinder
> > integration, etc, etc. Do you think it can be split further into more
> > focused smaller patches?
>
> True, let me see if I can split it up.
>
> > > +
> > > +                       if ((eppnt->p_flags & PF_X) && k < start_code)
> > > +                               start_code = k;
> > > +
> > > +                       if ((eppnt->p_flags & PF_X) && k + eppnt->p_filesz > end_code)
> > > +                               end_code = k + eppnt->p_filesz;
> > > +                       break;
> > > +               }
> > > +               case PT_GNU_SFRAME:
> > > +                       sframe_phdr = eppnt;
> >
> > if I understand correctly, there has to be only one sframe, is that
> > right? Should we validate that?
>
> Yes, there shouldn't be more than one PT_GNU_SFRAME for the executable
> itself.  I can validate that.
>
> > > +                       break;
> > >                 }
> > >         }
> > >
> > > +       if (sframe_phdr)
> > > +               sframe_add_section(load_addr + sframe_phdr->p_vaddr,
> > > +                                  start_code, end_code);
> > > +
> >
> > no error checking?
>
> Good point.  I remember discussing this with some people at Cauldon/LPC,
> I just forgot to do it!
>
> Right now it does all the validation at unwind, which could really slow
> things down unnecessarily if the sframe isn't valid.
>
> > > +#ifdef CONFIG_HAVE_UNWIND_USER_SFRAME
> > > +
> > > +#define INIT_MM_SFRAME .sframe_mt = MTREE_INIT(sframe_mt, 0),
> > > +
> > > +extern void sframe_free_mm(struct mm_struct *mm);
> > > +
> > > +/* text_start, text_end, file_name are optional */
> >
> > what file_name? was that an extra argument that got removed?
>
> Indeed, that was for some old code.
>
> > >         case PR_RISCV_SET_ICACHE_FLUSH_CTX:
> > >                 error = RISCV_SET_ICACHE_FLUSH_CTX(arg2, arg3);
> > >                 break;
> > > +       case PR_ADD_SFRAME:
> > > +               if (arg5)
> > > +                       return -EINVAL;
> > > +               error = sframe_add_section(arg2, arg3, arg4);
> >
> > wouldn't it be better to make this interface extendable from the get
> > go? Instead of passing 3 arguments with fixed meaning, why not pass a
> > pointer to an extendable binary struct like seems to be the trend
> > nowadays with nicely extensible APIs. See [0] for one such example
> > (specifically, struct procmap_query). Seems more prudent, as we'll
> > most probably will be adding flags, options, extra information, etc)
> >
> >   [0] https://lore.kernel.org/linux-mm/20240627170900.1672542-3-andrii@kernel.org/
>
> This ioctl interface was admittedly hacked together.  I was hoping
> somebody would suggest something better :-)  I'll take a look.
>
> > > +static int find_fde(struct sframe_section *sec, unsigned long ip,
> > > +                   struct sframe_fde *fde)
> > > +{
> > > +       struct sframe_fde __user *first, *last, *found = NULL;
> > > +       u32 ip_off, func_off_low = 0, func_off_high = -1;
> > > +
> > > +       ip_off = ip - sec->sframe_addr;
> >
> > what if ip_off is larger than 4GB? ELF section can be bigger than 4GB, right?
>
> That's baked into sframe v2.

I believe we do have large production binaries with more than 4GB of
text, what are we going to do about them? It would be interesting to
hear sframe people's opinion. Adding such a far-reaching new format in
2024 with these limitations is kind of sad. At the very least maybe we
should allow some form of chaining sframe definitions to cover more
than 4GB segments? Please CC relevant folks, I'm wondering what
they're thinking about this.

>
> > and also, does it mean that SFrame doesn't support executables with
> > text bigger than 4GB?
>
> Yes, but is that a realistic concern?

See above, yes. You'd be surprised. As somewhat corroborating
evidence, there were tons of problems and churn (within at least Meta)
with DWARF not supporting more than 2GB sizes, so yes, this is not an
abstract problem for sure. Modern production applications can be
ridiculously big.

>
> > > +       } else {
> > > +               struct vm_area_struct *vma, *text_vma = NULL;
> > > +               VMA_ITERATOR(vmi, mm, 0);
> > > +
> > > +               for_each_vma(vmi, vma) {
> > > +                       if (vma->vm_file != sframe_vma->vm_file ||
> > > +                           !(vma->vm_flags & VM_EXEC))
> > > +                               continue;
> > > +
> > > +                       if (text_vma) {
> > > +                               pr_warn_once("%s[%d]: multiple EXEC segments unsupported\n",
> > > +                                            current->comm, current->pid);
> >
> > is this just something that fundamentally can't be supported by SFrame
> > format? Or just an implementation simplification?
>
> It's a simplification I suppose.

That's a rather random limitation, IMO... How hard would it be to not
make that assumption?

>
> > It's not illegal to have an executable with multiple VM_EXEC segments,
> > no? Should this be a pr_warn_once() then?
>
> I don't know, is it allowed?  I've never seen it in practice.  The

I'm pretty sure you can do that with a custom linker script, at the
very least. Normally this probably won't happen, but I don't think
Linux dictates how many executable VMAs an application can have. And
it probably just naturally happens for JIT-ted applications (Java, Go,
etc).

Linux kernel itself has two executable segments, for instance (though
kernel is special, of course, but still).

> pr_warn_once() is not reporting that it's illegal but rather that this
> corner case actually exists and maybe needs to be looked at.

This warn() will be logged across millions of machines in the fleet,
triggering alarms, people looking at this, making custom internal
patches to disable the known-to-happen warn. Why do we need all this?
This is an issue that is trivial to trigger by user process that's not
doing anything illegal. Why?

>
> --
> Josh

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 09/19] unwind: Introduce sframe user space unwinding
  2024-10-31 20:57       ` Andrii Nakryiko
@ 2024-10-31 21:00         ` Nick Desaulniers
  2024-10-31 21:38         ` Indu Bhagat
  2024-10-31 23:03         ` Josh Poimboeuf
  2 siblings, 0 replies; 119+ messages in thread
From: Nick Desaulniers @ 2024-10-31 21:00 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Josh Poimboeuf, x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Arthur Eubanks

On Thu, Oct 31, 2024 at 1:57 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Tue, Oct 29, 2024 at 10:53 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
> >
> > On Tue, Oct 29, 2024 at 04:32:40PM -0700, Andrii Nakryiko wrote:
> > > > +static int find_fde(struct sframe_section *sec, unsigned long ip,
> > > > +                   struct sframe_fde *fde)
> > > > +{
> > > > +       struct sframe_fde __user *first, *last, *found = NULL;
> > > > +       u32 ip_off, func_off_low = 0, func_off_high = -1;
> > > > +
> > > > +       ip_off = ip - sec->sframe_addr;
> > >
> > > what if ip_off is larger than 4GB? ELF section can be bigger than 4GB, right?
> >
> > That's baked into sframe v2.
>
> I believe we do have large production binaries with more than 4GB of
> text, what are we going to do about them? It would be interesting to
> hear sframe people's opinion. Adding such a far-reaching new format in
> 2024 with these limitations is kind of sad. At the very least maybe we
> should allow some form of chaining sframe definitions to cover more
> than 4GB segments? Please CC relevant folks, I'm wondering what
> they're thinking about this.

FWIW, Google has such large binaries, too.
https://llvm.org/devmtg/2023-10/slides/quicktalks/Eubanks-CompromisesWithLargeX86-64Binaries.pdf
-- 
Thanks,
~Nick Desaulniers

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 11/19] unwind: Add deferred user space unwinding API
  2024-10-30  6:10     ` Josh Poimboeuf
@ 2024-10-31 21:22       ` Andrii Nakryiko
  2024-10-31 23:13         ` Josh Poimboeuf
  0 siblings, 1 reply; 119+ messages in thread
From: Andrii Nakryiko @ 2024-10-31 21:22 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

On Tue, Oct 29, 2024 at 11:10 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>
> On Tue, Oct 29, 2024 at 04:32:46PM -0700, Andrii Nakryiko wrote:
> > >  struct audit_context;
> > > @@ -1592,6 +1593,10 @@ struct task_struct {
> > >         struct user_event_mm            *user_event_mm;
> > >  #endif
> > >
> > > +#ifdef CONFIG_UNWIND_USER
> > > +       struct unwind_task_info         unwind_task_info;
> >
> > this is quite a lot of memory to pay on each task, a lot of which a)
> > might not have sframe
>
> Frame pointers are also supported.
>
> > and b) might not need stack unwinding during their lifetime.
>
> Right, I'm not super happy about that.
>
> > It can be a pointer and allocated in copy_process(), no?
> > Though ideally this should be initialized lazily, if possible.
>
> Problem is, the unwinder doesn't know in advance which tasks will be
> unwound.
>
> Its first clue is unwind_user_register(), would it make sense for the
> caller to clarify whether all tasks need to be unwound or only a
> specific subset?
>
> Its second clue is unwind_user_deferred(), which is called for the task
> itself.  But by then it's too late because it needs to access the
> per-task data from (potentially) irq context so it can't do a lazy
> allocation.
>
> I'm definitely open to ideas...

The laziest thing would be to perform GFP_ATOMIC allocation, and if
that fails, oops, too bad, no stack trace for you (but, generally
speaking, no big deal). Advantages are clear, though, right? Single
pointer in task_struct, which most of the time will be NULL, so no
unnecessary overheads.

>
> > > +       if (!info->entries) {
> > > +               info->entries = kmalloc(UNWIND_MAX_ENTRIES * sizeof(long),
> > > +                                       GFP_KERNEL);
> > > +               if (!info->entries)
> > > +                       return;
> >
> > uhm... can we notify callbacks that stack capture failed? otherwise
> > we'd need some extra timeouts and other complications if we are
> > *waiting* for this callback to be called
>
> Hm, do you actually plan to wait for the callback?
>
> I assume this is BPF, can you give some high-level detail about how it
> will use these interfaces?

So I have a presentation about integrating async stack trace capture
into BPF at last LSF/MM/BPF, LWN.net has an article ([0]), which you
might want to skim through, not sure if that will be helpful.

But very briefly, the way I see it, we'll have a workflow similar to
the following:

1. BPF program will call some new API to request stack trace capture:
bpf_get_stack_async(...)
  - it will pass a reference to BPF ringbuf into which captured stack
trace should go
  - this API will return unique ID (similar to your cookie, but I'll
need a unique cookie for each call to bpf_get_stack_async() call, even
if all of them capture the same stack trace; so what Mathie is
proposing with coalescing all requests and triggering one callback
isn't very convenient in this case, but we can probably work around
that in BPF-specific way)
2. bpf_get_stack_async() calls unwind_user_deferred() and it expects
one of two outcomes:
  - either successful capture, at which point we'll submit data into
BPF ringbuf with that unique ID for user to correlate whatever data
was captured synchronously and stack trace that arrived later
  - OR we failed to get a stack trace, and we'll still put a small
piece of information for user to know that stack trace is never
coming.

It's the last point that's important to make usability so much
simpler, avoiding unnecessary custom timeouts and stuff like that.
Regardless whether stack trace capture is success or not, user is
guaranteed to get a "notification" about the outcome.

Hope this helps.

But basically, if I I called unwind_user_deferred(), I expect to get
some callback, guaranteed, with the result or failure. The only thing
that's not guaranteed (and which makes timeouts bad) is *when* this
will happen. Because stack trace capture can be arbitrarily delayed
and stuff. That's fine, but that also shows why timeout is tricky and
necessarily fragile.


  [0] https://lwn.net/Articles/978736/

>
> --
> Josh

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 09/19] unwind: Introduce sframe user space unwinding
  2024-10-31 20:57       ` Andrii Nakryiko
  2024-10-31 21:00         ` Nick Desaulniers
@ 2024-10-31 21:38         ` Indu Bhagat
  2024-11-01 18:38           ` Andrii Nakryiko
  2024-10-31 23:03         ` Josh Poimboeuf
  2 siblings, 1 reply; 119+ messages in thread
From: Indu Bhagat @ 2024-10-31 21:38 UTC (permalink / raw)
  To: Andrii Nakryiko, Josh Poimboeuf
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

On 10/31/24 1:57 PM, Andrii Nakryiko wrote:
> On Tue, Oct 29, 2024 at 10:53 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>>
>> On Tue, Oct 29, 2024 at 04:32:40PM -0700, Andrii Nakryiko wrote:
>>> It feels like this patch is trying to do too much. There is both new
>>> UAPI introduction, and SFrame format definition, and unwinder
>>> integration, etc, etc. Do you think it can be split further into more
>>> focused smaller patches?
>>
>> True, let me see if I can split it up.
>>
>>>> +
>>>> +                       if ((eppnt->p_flags & PF_X) && k < start_code)
>>>> +                               start_code = k;
>>>> +
>>>> +                       if ((eppnt->p_flags & PF_X) && k + eppnt->p_filesz > end_code)
>>>> +                               end_code = k + eppnt->p_filesz;
>>>> +                       break;
>>>> +               }
>>>> +               case PT_GNU_SFRAME:
>>>> +                       sframe_phdr = eppnt;
>>>
>>> if I understand correctly, there has to be only one sframe, is that
>>> right? Should we validate that?
>>
>> Yes, there shouldn't be more than one PT_GNU_SFRAME for the executable
>> itself.  I can validate that.
>>
>>>> +                       break;
>>>>                  }
>>>>          }
>>>>
>>>> +       if (sframe_phdr)
>>>> +               sframe_add_section(load_addr + sframe_phdr->p_vaddr,
>>>> +                                  start_code, end_code);
>>>> +
>>>
>>> no error checking?
>>
>> Good point.  I remember discussing this with some people at Cauldon/LPC,
>> I just forgot to do it!
>>
>> Right now it does all the validation at unwind, which could really slow
>> things down unnecessarily if the sframe isn't valid.
>>
>>>> +#ifdef CONFIG_HAVE_UNWIND_USER_SFRAME
>>>> +
>>>> +#define INIT_MM_SFRAME .sframe_mt = MTREE_INIT(sframe_mt, 0),
>>>> +
>>>> +extern void sframe_free_mm(struct mm_struct *mm);
>>>> +
>>>> +/* text_start, text_end, file_name are optional */
>>>
>>> what file_name? was that an extra argument that got removed?
>>
>> Indeed, that was for some old code.
>>
>>>>          case PR_RISCV_SET_ICACHE_FLUSH_CTX:
>>>>                  error = RISCV_SET_ICACHE_FLUSH_CTX(arg2, arg3);
>>>>                  break;
>>>> +       case PR_ADD_SFRAME:
>>>> +               if (arg5)
>>>> +                       return -EINVAL;
>>>> +               error = sframe_add_section(arg2, arg3, arg4);
>>>
>>> wouldn't it be better to make this interface extendable from the get
>>> go? Instead of passing 3 arguments with fixed meaning, why not pass a
>>> pointer to an extendable binary struct like seems to be the trend
>>> nowadays with nicely extensible APIs. See [0] for one such example
>>> (specifically, struct procmap_query). Seems more prudent, as we'll
>>> most probably will be adding flags, options, extra information, etc)
>>>
>>>    [0] https://lore.kernel.org/linux-mm/20240627170900.1672542-3-andrii@kernel.org/
>>
>> This ioctl interface was admittedly hacked together.  I was hoping
>> somebody would suggest something better :-)  I'll take a look.
>>
>>>> +static int find_fde(struct sframe_section *sec, unsigned long ip,
>>>> +                   struct sframe_fde *fde)
>>>> +{
>>>> +       struct sframe_fde __user *first, *last, *found = NULL;
>>>> +       u32 ip_off, func_off_low = 0, func_off_high = -1;
>>>> +
>>>> +       ip_off = ip - sec->sframe_addr;
>>>
>>> what if ip_off is larger than 4GB? ELF section can be bigger than 4GB, right?
>>
>> That's baked into sframe v2.
> 
> I believe we do have large production binaries with more than 4GB of
> text, what are we going to do about them? It would be interesting to
> hear sframe people's opinion. Adding such a far-reaching new format in
> 2024 with these limitations is kind of sad. At the very least maybe we
> should allow some form of chaining sframe definitions to cover more
> than 4GB segments? Please CC relevant folks, I'm wondering what
> they're thinking about this.
> 

SFrame V2 does have that limitation. We can try to have 64-bit 
representation for the 'ip' in the SFrame FDE and conditionalize it 
somehow (say, with a flag in the header) so as to not bloat the majority 
of applications.

>>
>>> and also, does it mean that SFrame doesn't support executables with
>>> text bigger than 4GB?
>>
>> Yes, but is that a realistic concern?
> 
> See above, yes. You'd be surprised. As somewhat corroborating
> evidence, there were tons of problems and churn (within at least Meta)
> with DWARF not supporting more than 2GB sizes, so yes, this is not an
> abstract problem for sure. Modern production applications can be
> ridiculously big.
> 
>>
>>>> +       } else {
>>>> +               struct vm_area_struct *vma, *text_vma = NULL;
>>>> +               VMA_ITERATOR(vmi, mm, 0);
>>>> +
>>>> +               for_each_vma(vmi, vma) {
>>>> +                       if (vma->vm_file != sframe_vma->vm_file ||
>>>> +                           !(vma->vm_flags & VM_EXEC))
>>>> +                               continue;
>>>> +
>>>> +                       if (text_vma) {
>>>> +                               pr_warn_once("%s[%d]: multiple EXEC segments unsupported\n",
>>>> +                                            current->comm, current->pid);
>>>
>>> is this just something that fundamentally can't be supported by SFrame
>>> format? Or just an implementation simplification?
>>
>> It's a simplification I suppose.
> 
> That's a rather random limitation, IMO... How hard would it be to not
> make that assumption?
> 
>>
>>> It's not illegal to have an executable with multiple VM_EXEC segments,
>>> no? Should this be a pr_warn_once() then?
>>
>> I don't know, is it allowed?  I've never seen it in practice.  The
> 
> I'm pretty sure you can do that with a custom linker script, at the
> very least. Normally this probably won't happen, but I don't think
> Linux dictates how many executable VMAs an application can have. And
> it probably just naturally happens for JIT-ted applications (Java, Go,
> etc).
> 
> Linux kernel itself has two executable segments, for instance (though
> kernel is special, of course, but still).
> 
>> pr_warn_once() is not reporting that it's illegal but rather that this
>> corner case actually exists and maybe needs to be looked at.
> 
> This warn() will be logged across millions of machines in the fleet,
> triggering alarms, people looking at this, making custom internal
> patches to disable the known-to-happen warn. Why do we need all this?
> This is an issue that is trivial to trigger by user process that's not
> doing anything illegal. Why?
> 
>>
>> --
>> Josh


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 09/19] unwind: Introduce sframe user space unwinding
  2024-10-31 20:57       ` Andrii Nakryiko
  2024-10-31 21:00         ` Nick Desaulniers
  2024-10-31 21:38         ` Indu Bhagat
@ 2024-10-31 23:03         ` Josh Poimboeuf
  2024-11-01 18:34           ` Andrii Nakryiko
  2024-11-01 19:09           ` Segher Boessenkool
  2 siblings, 2 replies; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-31 23:03 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

On Thu, Oct 31, 2024 at 01:57:10PM -0700, Andrii Nakryiko wrote:
> > > what if ip_off is larger than 4GB? ELF section can be bigger than 4GB, right?
> >
> > That's baked into sframe v2.
> 
> I believe we do have large production binaries with more than 4GB of
> text, what are we going to do about them? It would be interesting to
> hear sframe people's opinion. Adding such a far-reaching new format in
> 2024 with these limitations is kind of sad. At the very least maybe we
> should allow some form of chaining sframe definitions to cover more
> than 4GB segments? Please CC relevant folks, I'm wondering what
> they're thinking about this.

Personally I find the idea of a single 4GB+ text segment pretty
surprising as I've never seen anything even close to that.

Anyway it's iterative development and not everybody's requirements are
clear from day 1.  Which is why we're discussing it now.  I think there
are already plans to do an sframe v3.

> > > > +                       if (text_vma) {
> > > > +                               pr_warn_once("%s[%d]: multiple EXEC segments unsupported\n",
> > > > +                                            current->comm, current->pid);
> > >
> > > is this just something that fundamentally can't be supported by SFrame
> > > format? Or just an implementation simplification?
> >
> > It's a simplification I suppose.
> 
> That's a rather random limitation, IMO... How hard would it be to not
> make that assumption?

It's definitely not random, there's no need to complicate the code if
this condition doesn't exist.

> > > It's not illegal to have an executable with multiple VM_EXEC segments,
> > > no? Should this be a pr_warn_once() then?
> >
> > I don't know, is it allowed?  I've never seen it in practice.  The
> 
> I'm pretty sure you can do that with a custom linker script, at the
> very least. Normally this probably won't happen, but I don't think
> Linux dictates how many executable VMAs an application can have.
> And it probably just naturally happens for JIT-ted applications (Java,
> Go, etc).

Actually I just double checked and even the kernel's ELF loader assumes
that each executable has only a single text start+end address pair.

> > pr_warn_once() is not reporting that it's illegal but rather that this
> > corner case actually exists and maybe needs to be looked at.
> 
> This warn() will be logged across millions of machines in the fleet,
> triggering alarms, people looking at this, making custom internal
> patches to disable the known-to-happen warn. Why do we need all this?
> This is an issue that is trivial to trigger by user process that's not
> doing anything illegal. Why?

There's no point in adding complexity to support some hypothetical.  I
can remove the printk though.

-- 
Josh

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 11/19] unwind: Add deferred user space unwinding API
  2024-10-31 21:22       ` Andrii Nakryiko
@ 2024-10-31 23:13         ` Josh Poimboeuf
  2024-10-31 23:28           ` Andrii Nakryiko
  0 siblings, 1 reply; 119+ messages in thread
From: Josh Poimboeuf @ 2024-10-31 23:13 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

On Thu, Oct 31, 2024 at 02:22:48PM -0700, Andrii Nakryiko wrote:
> > Problem is, the unwinder doesn't know in advance which tasks will be
> > unwound.
> >
> > Its first clue is unwind_user_register(), would it make sense for the
> > caller to clarify whether all tasks need to be unwound or only a
> > specific subset?
> >
> > Its second clue is unwind_user_deferred(), which is called for the task
> > itself.  But by then it's too late because it needs to access the
> > per-task data from (potentially) irq context so it can't do a lazy
> > allocation.
> >
> > I'm definitely open to ideas...
> 
> The laziest thing would be to perform GFP_ATOMIC allocation, and if
> that fails, oops, too bad, no stack trace for you (but, generally
> speaking, no big deal). Advantages are clear, though, right? Single
> pointer in task_struct, which most of the time will be NULL, so no
> unnecessary overheads.

GFP_ATOMIC is limited, I don't think we want the unwinder to trigger
OOM.

> It's the last point that's important to make usability so much
> simpler, avoiding unnecessary custom timeouts and stuff like that.
> Regardless whether stack trace capture is success or not, user is
> guaranteed to get a "notification" about the outcome.
> 
> Hope this helps.
> 
> But basically, if I I called unwind_user_deferred(), I expect to get
> some callback, guaranteed, with the result or failure. The only thing
> that's not guaranteed (and which makes timeouts bad) is *when* this
> will happen. Because stack trace capture can be arbitrarily delayed
> and stuff. That's fine, but that also shows why timeout is tricky and
> necessarily fragile.

That sounds reasonable.  In the OOM error case I can just pass a small
(stack allocated) one-entry trace with only regs->ip.

-- 
Josh

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 11/19] unwind: Add deferred user space unwinding API
  2024-10-31 23:13         ` Josh Poimboeuf
@ 2024-10-31 23:28           ` Andrii Nakryiko
  2024-11-01 17:41             ` Josh Poimboeuf
  0 siblings, 1 reply; 119+ messages in thread
From: Andrii Nakryiko @ 2024-10-31 23:28 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

On Thu, Oct 31, 2024 at 4:13 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>
> On Thu, Oct 31, 2024 at 02:22:48PM -0700, Andrii Nakryiko wrote:
> > > Problem is, the unwinder doesn't know in advance which tasks will be
> > > unwound.
> > >
> > > Its first clue is unwind_user_register(), would it make sense for the
> > > caller to clarify whether all tasks need to be unwound or only a
> > > specific subset?
> > >
> > > Its second clue is unwind_user_deferred(), which is called for the task
> > > itself.  But by then it's too late because it needs to access the
> > > per-task data from (potentially) irq context so it can't do a lazy
> > > allocation.
> > >
> > > I'm definitely open to ideas...
> >
> > The laziest thing would be to perform GFP_ATOMIC allocation, and if
> > that fails, oops, too bad, no stack trace for you (but, generally
> > speaking, no big deal). Advantages are clear, though, right? Single
> > pointer in task_struct, which most of the time will be NULL, so no
> > unnecessary overheads.
>
> GFP_ATOMIC is limited, I don't think we want the unwinder to trigger
> OOM.
>

So all task_structs on the system using 104 bytes more, *permanently*
and *unconditionally*, is not a concern, but lazy GFP_ATOMIC
allocation when you actually need it is?

> > It's the last point that's important to make usability so much
> > simpler, avoiding unnecessary custom timeouts and stuff like that.
> > Regardless whether stack trace capture is success or not, user is
> > guaranteed to get a "notification" about the outcome.
> >
> > Hope this helps.
> >
> > But basically, if I I called unwind_user_deferred(), I expect to get
> > some callback, guaranteed, with the result or failure. The only thing
> > that's not guaranteed (and which makes timeouts bad) is *when* this
> > will happen. Because stack trace capture can be arbitrarily delayed
> > and stuff. That's fine, but that also shows why timeout is tricky and
> > necessarily fragile.
>
> That sounds reasonable.  In the OOM error case I can just pass a small
> (stack allocated) one-entry trace with only regs->ip.
>

SGTM

> --
> Josh
>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 11/19] unwind: Add deferred user space unwinding API
  2024-10-31 23:28           ` Andrii Nakryiko
@ 2024-11-01 17:41             ` Josh Poimboeuf
  2024-11-01 18:05               ` Andrii Nakryiko
  0 siblings, 1 reply; 119+ messages in thread
From: Josh Poimboeuf @ 2024-11-01 17:41 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

On Thu, Oct 31, 2024 at 04:28:08PM -0700, Andrii Nakryiko wrote:
> So all task_structs on the system using 104 bytes more, *permanently*

Either way it's permanent, we don't know when to free it until the task
struct is freed...

> and *unconditionally*, is not a concern

Of course it's a concern, that's why we're looking for something
better...

> but lazy GFP_ATOMIC allocation when you actually need it is?

We don't want to dip into the GFP_ATOMIC emergency reserves, those are
kept for more important things.

Actually, I think I can just use GFP_NOWAIT here.

-- 
Josh

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 11/19] unwind: Add deferred user space unwinding API
  2024-11-01 17:41             ` Josh Poimboeuf
@ 2024-11-01 18:05               ` Andrii Nakryiko
  0 siblings, 0 replies; 119+ messages in thread
From: Andrii Nakryiko @ 2024-11-01 18:05 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

On Fri, Nov 1, 2024 at 10:41 AM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>
> On Thu, Oct 31, 2024 at 04:28:08PM -0700, Andrii Nakryiko wrote:
> > So all task_structs on the system using 104 bytes more, *permanently*
>
> Either way it's permanent, we don't know when to free it until the task
> struct is freed...
>

I'm not sure if we are arguing for the sake of arguing at this point
:) Yes, for *those tasks* for which we at least once requested stack
trace, that memory will stay, sure. But there are normally tons of
threads that are almost completely idle and/or use so little CPU, that
they won't ever be caught in the profiler, so their stack trace will
never be requested.

Sure, you can come up with a use case where you'll just go over each
task and ask for stack trace for each of them, but that's not a common
case.

So, sorry, but no, I don't agree that these are equivalent things.
Lazy memory allocation is a must, IMO.

> > and *unconditionally*, is not a concern
>
> Of course it's a concern, that's why we're looking for something
> better...
>
> > but lazy GFP_ATOMIC allocation when you actually need it is?
>
> We don't want to dip into the GFP_ATOMIC emergency reserves, those are
> kept for more important things.
>
> Actually, I think I can just use GFP_NOWAIT here.

Whatever semantics works for being called from NMI (even if it can fail).

>
> --
> Josh

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 09/19] unwind: Introduce sframe user space unwinding
  2024-10-31 23:03         ` Josh Poimboeuf
@ 2024-11-01 18:34           ` Andrii Nakryiko
  2024-11-01 19:29             ` Josh Poimboeuf
  2024-11-01 19:09           ` Segher Boessenkool
  1 sibling, 1 reply; 119+ messages in thread
From: Andrii Nakryiko @ 2024-11-01 18:34 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski, bpf

+cc bpf@ where people care about stack traces and profiling as well
(please cc bpf@vger.kernel.org for next revisions as well, I'm sure a
bunch of folks would appreciate it and have something useful to say)

On Thu, Oct 31, 2024 at 4:03 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>
> On Thu, Oct 31, 2024 at 01:57:10PM -0700, Andrii Nakryiko wrote:
> > > > what if ip_off is larger than 4GB? ELF section can be bigger than 4GB, right?
> > >
> > > That's baked into sframe v2.
> >
> > I believe we do have large production binaries with more than 4GB of
> > text, what are we going to do about them? It would be interesting to
> > hear sframe people's opinion. Adding such a far-reaching new format in
> > 2024 with these limitations is kind of sad. At the very least maybe we
> > should allow some form of chaining sframe definitions to cover more
> > than 4GB segments? Please CC relevant folks, I'm wondering what
> > they're thinking about this.
>
> Personally I find the idea of a single 4GB+ text segment pretty
> surprising as I've never seen anything even close to that.

I grabbed one of Meta production servers running one of the big-ish (I
don't know if it's even the largest, most probably not) service. Here
are first few sections from /proc/pid/maps belonging to the main
binary:

00200000-170ad000 r--p 00000000 07:01 5
172ac000-498e7000 r-xp 16eac000 07:01 5
49ae7000-49b8b000 r--p 494e7000 07:01 5
49d8b000-4a228000 rw-p 4958b000 07:01 5
4a228000-4c677000 rw-p 00000000 00:00 0
4c800000-4ca00000 r-xp 49c00000 07:01 5
4ca00000-4f600000 r-xp 49e00000 07:01 5
4f600000-5b270000 r-xp 4ca00000 07:01 5

Few observations:

1) There are 4 executable segments in just the first 8 entries.
2) Their total size is already approaching 1.5GB:

>>> ((0x170ad000 - 0x200000) + (0x5b270000 - 0x4f600000) + (0x498e7000 - 0x172ac000)) / 1024 / 1024
1361.34375

I don't know about you, but from my experience things like code size
tend to just grow over time, rarely it shrinks (and even that usually
requires tremendous and focused efforts).

>
> Anyway it's iterative development and not everybody's requirements are
> clear from day 1.  Which is why we're discussing it now.  I think there
> are already plans to do an sframe v3.

Of course, which is why I'm providing this feedback. But it would be
nice to avoid having to support a zoo of versions if we already know
there are practical limitations that we are not that far from hitting.

>
> > > > > +                       if (text_vma) {
> > > > > +                               pr_warn_once("%s[%d]: multiple EXEC segments unsupported\n",
> > > > > +                                            current->comm, current->pid);
> > > >
> > > > is this just something that fundamentally can't be supported by SFrame
> > > > format? Or just an implementation simplification?
> > >
> > > It's a simplification I suppose.
> >
> > That's a rather random limitation, IMO... How hard would it be to not
> > make that assumption?
>
> It's definitely not random, there's no need to complicate the code if
> this condition doesn't exist.

Sorry, I'm probably dense and missing something. But from the example
process above, isn't this check violated already? Or it's two
different things? Not sure, honestly.

>
> > > > It's not illegal to have an executable with multiple VM_EXEC segments,
> > > > no? Should this be a pr_warn_once() then?
> > >
> > > I don't know, is it allowed?  I've never seen it in practice.  The
> >
> > I'm pretty sure you can do that with a custom linker script, at the
> > very least. Normally this probably won't happen, but I don't think
> > Linux dictates how many executable VMAs an application can have.
> > And it probably just naturally happens for JIT-ted applications (Java,
> > Go, etc).
>
> Actually I just double checked and even the kernel's ELF loader assumes
> that each executable has only a single text start+end address pair.

See above, very confused by such assumptions, but I'm hoping we are
talking about two different things here.

>
> > > pr_warn_once() is not reporting that it's illegal but rather that this
> > > corner case actually exists and maybe needs to be looked at.
> >
> > This warn() will be logged across millions of machines in the fleet,
> > triggering alarms, people looking at this, making custom internal
> > patches to disable the known-to-happen warn. Why do we need all this?
> > This is an issue that is trivial to trigger by user process that's not
> > doing anything illegal. Why?
>
> There's no point in adding complexity to support some hypothetical.  I
> can remove the printk though.

We are talking about fundamental things like format for supporting
frame pointer-less stack trace capture. It will take years to adopt
SFrame everywhere, so I think it's prudent to think a bit ahead beyond
just saying "no real application should need more than 4GB text", IMO.

>
> --
> Josh

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 09/19] unwind: Introduce sframe user space unwinding
  2024-10-31 21:38         ` Indu Bhagat
@ 2024-11-01 18:38           ` Andrii Nakryiko
  2024-11-01 18:47             ` Steven Rostedt
  2024-11-03  0:07             ` Indu Bhagat
  0 siblings, 2 replies; 119+ messages in thread
From: Andrii Nakryiko @ 2024-11-01 18:38 UTC (permalink / raw)
  To: Indu Bhagat
  Cc: Josh Poimboeuf, x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski, bpf

On Thu, Oct 31, 2024 at 2:38 PM Indu Bhagat <indu.bhagat@oracle.com> wrote:
>
> On 10/31/24 1:57 PM, Andrii Nakryiko wrote:
> > On Tue, Oct 29, 2024 at 10:53 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
> >>
> >> On Tue, Oct 29, 2024 at 04:32:40PM -0700, Andrii Nakryiko wrote:
> >>> It feels like this patch is trying to do too much. There is both new
> >>> UAPI introduction, and SFrame format definition, and unwinder
> >>> integration, etc, etc. Do you think it can be split further into more
> >>> focused smaller patches?
> >>
> >> True, let me see if I can split it up.
> >>
> >>>> +
> >>>> +                       if ((eppnt->p_flags & PF_X) && k < start_code)
> >>>> +                               start_code = k;
> >>>> +
> >>>> +                       if ((eppnt->p_flags & PF_X) && k + eppnt->p_filesz > end_code)
> >>>> +                               end_code = k + eppnt->p_filesz;
> >>>> +                       break;
> >>>> +               }
> >>>> +               case PT_GNU_SFRAME:
> >>>> +                       sframe_phdr = eppnt;
> >>>
> >>> if I understand correctly, there has to be only one sframe, is that
> >>> right? Should we validate that?
> >>
> >> Yes, there shouldn't be more than one PT_GNU_SFRAME for the executable
> >> itself.  I can validate that.
> >>
> >>>> +                       break;
> >>>>                  }
> >>>>          }
> >>>>
> >>>> +       if (sframe_phdr)
> >>>> +               sframe_add_section(load_addr + sframe_phdr->p_vaddr,
> >>>> +                                  start_code, end_code);
> >>>> +
> >>>
> >>> no error checking?
> >>
> >> Good point.  I remember discussing this with some people at Cauldon/LPC,
> >> I just forgot to do it!
> >>
> >> Right now it does all the validation at unwind, which could really slow
> >> things down unnecessarily if the sframe isn't valid.
> >>
> >>>> +#ifdef CONFIG_HAVE_UNWIND_USER_SFRAME
> >>>> +
> >>>> +#define INIT_MM_SFRAME .sframe_mt = MTREE_INIT(sframe_mt, 0),
> >>>> +
> >>>> +extern void sframe_free_mm(struct mm_struct *mm);
> >>>> +
> >>>> +/* text_start, text_end, file_name are optional */
> >>>
> >>> what file_name? was that an extra argument that got removed?
> >>
> >> Indeed, that was for some old code.
> >>
> >>>>          case PR_RISCV_SET_ICACHE_FLUSH_CTX:
> >>>>                  error = RISCV_SET_ICACHE_FLUSH_CTX(arg2, arg3);
> >>>>                  break;
> >>>> +       case PR_ADD_SFRAME:
> >>>> +               if (arg5)
> >>>> +                       return -EINVAL;
> >>>> +               error = sframe_add_section(arg2, arg3, arg4);
> >>>
> >>> wouldn't it be better to make this interface extendable from the get
> >>> go? Instead of passing 3 arguments with fixed meaning, why not pass a
> >>> pointer to an extendable binary struct like seems to be the trend
> >>> nowadays with nicely extensible APIs. See [0] for one such example
> >>> (specifically, struct procmap_query). Seems more prudent, as we'll
> >>> most probably will be adding flags, options, extra information, etc)
> >>>
> >>>    [0] https://lore.kernel.org/linux-mm/20240627170900.1672542-3-andrii@kernel.org/
> >>
> >> This ioctl interface was admittedly hacked together.  I was hoping
> >> somebody would suggest something better :-)  I'll take a look.
> >>
> >>>> +static int find_fde(struct sframe_section *sec, unsigned long ip,
> >>>> +                   struct sframe_fde *fde)
> >>>> +{
> >>>> +       struct sframe_fde __user *first, *last, *found = NULL;
> >>>> +       u32 ip_off, func_off_low = 0, func_off_high = -1;
> >>>> +
> >>>> +       ip_off = ip - sec->sframe_addr;
> >>>
> >>> what if ip_off is larger than 4GB? ELF section can be bigger than 4GB, right?
> >>
> >> That's baked into sframe v2.
> >
> > I believe we do have large production binaries with more than 4GB of
> > text, what are we going to do about them? It would be interesting to
> > hear sframe people's opinion. Adding such a far-reaching new format in
> > 2024 with these limitations is kind of sad. At the very least maybe we
> > should allow some form of chaining sframe definitions to cover more
> > than 4GB segments? Please CC relevant folks, I'm wondering what
> > they're thinking about this.
> >
>
> SFrame V2 does have that limitation. We can try to have 64-bit
> representation for the 'ip' in the SFrame FDE and conditionalize it
> somehow (say, with a flag in the header) so as to not bloat the majority
> of applications.

Hi Indu,

I think that's prudent if we believe that SFrame is the solution here.
See my reply to Josh. Real-world already approach 4GB limits, and
things are not going to shrink in the years to come. So yeah, probably
we need some adjustments to the format to at least allow 64-bit
offsets (though trying to stick to 32-bit as much as possible, of
course, if they work).

I'm not really familiar with the nuances of the format just yet, so
can't really provide anything more useful at this point. What would be
the sort of gold reference for Sframe format to familiarize myself
thoroughly?

BTW, I wanted to ask. Are there any plans to add SFrame support to
Clang as well? It feels like without that there is no future for
SFrame as a general-purpose solution for stack traces.

>
> >>
> >>> and also, does it mean that SFrame doesn't support executables with
> >>> text bigger than 4GB?
> >>
> >> Yes, but is that a realistic concern?
> >
> > See above, yes. You'd be surprised. As somewhat corroborating
> > evidence, there were tons of problems and churn (within at least Meta)
> > with DWARF not supporting more than 2GB sizes, so yes, this is not an
> > abstract problem for sure. Modern production applications can be
> > ridiculously big.
> >
> >>
> >>>> +       } else {
> >>>> +               struct vm_area_struct *vma, *text_vma = NULL;
> >>>> +               VMA_ITERATOR(vmi, mm, 0);
> >>>> +
> >>>> +               for_each_vma(vmi, vma) {
> >>>> +                       if (vma->vm_file != sframe_vma->vm_file ||
> >>>> +                           !(vma->vm_flags & VM_EXEC))
> >>>> +                               continue;
> >>>> +
> >>>> +                       if (text_vma) {
> >>>> +                               pr_warn_once("%s[%d]: multiple EXEC segments unsupported\n",
> >>>> +                                            current->comm, current->pid);
> >>>
> >>> is this just something that fundamentally can't be supported by SFrame
> >>> format? Or just an implementation simplification?
> >>
> >> It's a simplification I suppose.
> >
> > That's a rather random limitation, IMO... How hard would it be to not
> > make that assumption?
> >
> >>
> >>> It's not illegal to have an executable with multiple VM_EXEC segments,
> >>> no? Should this be a pr_warn_once() then?
> >>
> >> I don't know, is it allowed?  I've never seen it in practice.  The
> >
> > I'm pretty sure you can do that with a custom linker script, at the
> > very least. Normally this probably won't happen, but I don't think
> > Linux dictates how many executable VMAs an application can have. And
> > it probably just naturally happens for JIT-ted applications (Java, Go,
> > etc).
> >
> > Linux kernel itself has two executable segments, for instance (though
> > kernel is special, of course, but still).
> >
> >> pr_warn_once() is not reporting that it's illegal but rather that this
> >> corner case actually exists and maybe needs to be looked at.
> >
> > This warn() will be logged across millions of machines in the fleet,
> > triggering alarms, people looking at this, making custom internal
> > patches to disable the known-to-happen warn. Why do we need all this?
> > This is an issue that is trivial to trigger by user process that's not
> > doing anything illegal. Why?
> >
> >>
> >> --
> >> Josh
>

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 09/19] unwind: Introduce sframe user space unwinding
  2024-11-01 18:38           ` Andrii Nakryiko
@ 2024-11-01 18:47             ` Steven Rostedt
  2024-11-01 18:54               ` Andrii Nakryiko
  2024-11-03  0:07             ` Indu Bhagat
  1 sibling, 1 reply; 119+ messages in thread
From: Steven Rostedt @ 2024-11-01 18:47 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Indu Bhagat, Josh Poimboeuf, x86, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski, bpf

On Fri, 1 Nov 2024 11:38:47 -0700
Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:

> BTW, I wanted to ask. Are there any plans to add SFrame support to
> Clang as well? It feels like without that there is no future for
> SFrame as a general-purpose solution for stack traces.

We want to use SFrames inside Google, and having Clang support it is a
requirement for that. I'm working on getting people to support it in Clang.

-- Steve

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 09/19] unwind: Introduce sframe user space unwinding
  2024-11-01 18:47             ` Steven Rostedt
@ 2024-11-01 18:54               ` Andrii Nakryiko
  0 siblings, 0 replies; 119+ messages in thread
From: Andrii Nakryiko @ 2024-11-01 18:54 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Indu Bhagat, Josh Poimboeuf, x86, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski, bpf

On Fri, Nov 1, 2024 at 11:46 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Fri, 1 Nov 2024 11:38:47 -0700
> Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> > BTW, I wanted to ask. Are there any plans to add SFrame support to
> > Clang as well? It feels like without that there is no future for
> > SFrame as a general-purpose solution for stack traces.
>
> We want to use SFrames inside Google, and having Clang support it is a
> requirement for that. I'm working on getting people to support it in Clang.
>

Nice, good to hear!

> -- Steve

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 09/19] unwind: Introduce sframe user space unwinding
  2024-10-31 23:03         ` Josh Poimboeuf
  2024-11-01 18:34           ` Andrii Nakryiko
@ 2024-11-01 19:09           ` Segher Boessenkool
  2024-11-01 19:33             ` Josh Poimboeuf
  1 sibling, 1 reply; 119+ messages in thread
From: Segher Boessenkool @ 2024-11-01 19:09 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andrii Nakryiko, x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

On Thu, Oct 31, 2024 at 04:03:13PM -0700, Josh Poimboeuf wrote:
> Personally I find the idea of a single 4GB+ text segment pretty
> surprising as I've never seen anything even close to that.

It is pretty common I'm afraid.

> Actually I just double checked and even the kernel's ELF loader assumes
> that each executable has only a single text start+end address pair.

Huh?  What makes you think that?  There can be many executable PT_LOAD
segments in each and every binary.


Segher

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 09/19] unwind: Introduce sframe user space unwinding
  2024-11-01 18:34           ` Andrii Nakryiko
@ 2024-11-01 19:29             ` Josh Poimboeuf
  2024-11-01 19:44               ` Andrii Nakryiko
  0 siblings, 1 reply; 119+ messages in thread
From: Josh Poimboeuf @ 2024-11-01 19:29 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski, bpf

On Fri, Nov 01, 2024 at 11:34:48AM -0700, Andrii Nakryiko wrote:
> 00200000-170ad000 r--p 00000000 07:01 5
> 172ac000-498e7000 r-xp 16eac000 07:01 5
> 49ae7000-49b8b000 r--p 494e7000 07:01 5
> 49d8b000-4a228000 rw-p 4958b000 07:01 5
> 4a228000-4c677000 rw-p 00000000 00:00 0
> 4c800000-4ca00000 r-xp 49c00000 07:01 5
> 4ca00000-4f600000 r-xp 49e00000 07:01 5
> 4f600000-5b270000 r-xp 4ca00000 07:01 5
>
> Sorry, I'm probably dense and missing something. But from the example
> process above, isn't this check violated already? Or it's two
> different things? Not sure, honestly.

It's hard to tell exactly what's going on, did you strip the file names?

The sframe limitation is per file, not per address space.  I assume
these are one file:

> 172ac000-498e7000 r-xp 16eac000 07:01 5

and these are another:

> 4c800000-4ca00000 r-xp 49c00000 07:01 5
> 4ca00000-4f600000 r-xp 49e00000 07:01 5
> 4f600000-5b270000 r-xp 4ca00000 07:01 5

Multiple mappings for a single file is fine, as long as they're
contiguous.

> > Actually I just double checked and even the kernel's ELF loader assumes
> > that each executable has only a single text start+end address pair.
> 
> See above, very confused by such assumptions, but I'm hoping we are
> talking about two different things here.

The "contiguous text" thing seems enforced by the kernel for
executables.  However it doesn't manage shared libraries, those are
mapped by the loader, e.g. /lib64/ld-linux-x86-64.so.2.

At a quick glance I can't tell if /lib64/ld-linux-x86-64.so.2 enforces
that.

> > There's no point in adding complexity to support some hypothetical.  I
> > can remove the printk though.
> 
> We are talking about fundamental things like format for supporting
> frame pointer-less stack trace capture. It will take years to adopt
> SFrame everywhere, so I think it's prudent to think a bit ahead beyond
> just saying "no real application should need more than 4GB text", IMO.

I don't think anybody is saying that...

-- 
Josh

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 09/19] unwind: Introduce sframe user space unwinding
  2024-11-01 19:09           ` Segher Boessenkool
@ 2024-11-01 19:33             ` Josh Poimboeuf
  2024-11-01 19:35               ` Josh Poimboeuf
  0 siblings, 1 reply; 119+ messages in thread
From: Josh Poimboeuf @ 2024-11-01 19:33 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Andrii Nakryiko, x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

On Fri, Nov 01, 2024 at 02:09:08PM -0500, Segher Boessenkool wrote:
> On Thu, Oct 31, 2024 at 04:03:13PM -0700, Josh Poimboeuf wrote:
> > Actually I just double checked and even the kernel's ELF loader assumes
> > that each executable has only a single text start+end address pair.
> 
> Huh?  What makes you think that?  There can be many executable PT_LOAD
> segments in each and every binary.

Right, but for executables (not shared libraries) the kernel seems to
assume they're contiguous?  See the 'start_code' and 'end_code'
variables in load_elf_binary() load_elf_interp().

-- 
Josh

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 09/19] unwind: Introduce sframe user space unwinding
  2024-11-01 19:33             ` Josh Poimboeuf
@ 2024-11-01 19:35               ` Josh Poimboeuf
  2024-11-01 19:48                 ` Josh Poimboeuf
  0 siblings, 1 reply; 119+ messages in thread
From: Josh Poimboeuf @ 2024-11-01 19:35 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Andrii Nakryiko, x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

On Fri, Nov 01, 2024 at 12:33:07PM -0700, Josh Poimboeuf wrote:
> On Fri, Nov 01, 2024 at 02:09:08PM -0500, Segher Boessenkool wrote:
> > On Thu, Oct 31, 2024 at 04:03:13PM -0700, Josh Poimboeuf wrote:
> > > Actually I just double checked and even the kernel's ELF loader assumes
> > > that each executable has only a single text start+end address pair.
> > 
> > Huh?  What makes you think that?  There can be many executable PT_LOAD
> > segments in each and every binary.
> 
> Right, but for executables (not shared libraries) the kernel seems to
> assume they're contiguous?  See the 'start_code' and 'end_code'
> variables in load_elf_binary() load_elf_interp().

Typo, see load_elf_binary (not load_elf_interp).

-- 
Josh

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 09/19] unwind: Introduce sframe user space unwinding
  2024-11-01 19:29             ` Josh Poimboeuf
@ 2024-11-01 19:44               ` Andrii Nakryiko
  2024-11-01 19:46                 ` Andrii Nakryiko
  0 siblings, 1 reply; 119+ messages in thread
From: Andrii Nakryiko @ 2024-11-01 19:44 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski, bpf

On Fri, Nov 1, 2024 at 12:29 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>
> On Fri, Nov 01, 2024 at 11:34:48AM -0700, Andrii Nakryiko wrote:
> > 00200000-170ad000 r--p 00000000 07:01 5
> > 172ac000-498e7000 r-xp 16eac000 07:01 5
> > 49ae7000-49b8b000 r--p 494e7000 07:01 5
> > 49d8b000-4a228000 rw-p 4958b000 07:01 5
> > 4a228000-4c677000 rw-p 00000000 00:00 0
> > 4c800000-4ca00000 r-xp 49c00000 07:01 5
> > 4ca00000-4f600000 r-xp 49e00000 07:01 5
> > 4f600000-5b270000 r-xp 4ca00000 07:01 5
> >
> > Sorry, I'm probably dense and missing something. But from the example
> > process above, isn't this check violated already? Or it's two
> > different things? Not sure, honestly.
>
> It's hard to tell exactly what's going on, did you strip the file names?

Yes, I did, of course. But as I said, they all belong to the same main
binary of the process.

>
> The sframe limitation is per file, not per address space.  I assume
> these are one file:
>
> > 172ac000-498e7000 r-xp 16eac000 07:01 5
>
> and these are another:
>
> > 4c800000-4ca00000 r-xp 49c00000 07:01 5
> > 4ca00000-4f600000 r-xp 49e00000 07:01 5
> > 4f600000-5b270000 r-xp 4ca00000 07:01 5
>
> Multiple mappings for a single file is fine, as long as they're
> contiguous.

No all of what I posted above belongs to the same file (except
"4a228000-4c677000 rw-p 00000000 00:00 0" which doesn't have
associated file, but I suspect it originally was part of this file, we
do some tricks with re-mmap()'ing stuff due to huge pages usage).

>
> > > Actually I just double checked and even the kernel's ELF loader assumes
> > > that each executable has only a single text start+end address pair.
> >
> > See above, very confused by such assumptions, but I'm hoping we are
> > talking about two different things here.
>
> The "contiguous text" thing seems enforced by the kernel for
> executables.  However it doesn't manage shared libraries, those are
> mapped by the loader, e.g. /lib64/ld-linux-x86-64.so.2.
>
> At a quick glance I can't tell if /lib64/ld-linux-x86-64.so.2 enforces
> that.
>
> > > There's no point in adding complexity to support some hypothetical.  I
> > > can remove the printk though.
> >
> > We are talking about fundamental things like format for supporting
> > frame pointer-less stack trace capture. It will take years to adopt
> > SFrame everywhere, so I think it's prudent to think a bit ahead beyond
> > just saying "no real application should need more than 4GB text", IMO.
>
> I don't think anybody is saying that...
>
> --
> Josh

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 09/19] unwind: Introduce sframe user space unwinding
  2024-11-01 19:44               ` Andrii Nakryiko
@ 2024-11-01 19:46                 ` Andrii Nakryiko
  2024-11-01 19:51                   ` Josh Poimboeuf
  0 siblings, 1 reply; 119+ messages in thread
From: Andrii Nakryiko @ 2024-11-01 19:46 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski, bpf

On Fri, Nov 1, 2024 at 12:44 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Fri, Nov 1, 2024 at 12:29 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
> >
> > On Fri, Nov 01, 2024 at 11:34:48AM -0700, Andrii Nakryiko wrote:
> > > 00200000-170ad000 r--p 00000000 07:01 5
> > > 172ac000-498e7000 r-xp 16eac000 07:01 5
> > > 49ae7000-49b8b000 r--p 494e7000 07:01 5
> > > 49d8b000-4a228000 rw-p 4958b000 07:01 5
> > > 4a228000-4c677000 rw-p 00000000 00:00 0
> > > 4c800000-4ca00000 r-xp 49c00000 07:01 5
> > > 4ca00000-4f600000 r-xp 49e00000 07:01 5
> > > 4f600000-5b270000 r-xp 4ca00000 07:01 5
> > >

I should have maybe posted this in this form:

00200000-170ad000 r--p 00000000 07:01 5  /packages/obfuscated_file
172ac000-498e7000 r-xp 16eac000 07:01 5  /packages/obfuscated_file
49ae7000-49b8b000 r--p 494e7000 07:01 5  /packages/obfuscated_file
49d8b000-4a228000 rw-p 4958b000 07:01 5  /packages/obfuscated_file
4a228000-4c677000 rw-p 00000000 00:00 0
4c800000-4ca00000 r-xp 49c00000 07:01 5  /packages/obfuscated_file
4ca00000-4f600000 r-xp 49e00000 07:01 5  /packages/obfuscated_file
4f600000-5b270000 r-xp 4ca00000 07:01 5  /packages/obfuscated_file

Those paths are pointing to the same binary.


> > > Sorry, I'm probably dense and missing something. But from the example
> > > process above, isn't this check violated already? Or it's two
> > > different things? Not sure, honestly.
> >
> > It's hard to tell exactly what's going on, did you strip the file names?
>
> Yes, I did, of course. But as I said, they all belong to the same main
> binary of the process.
>
> >
> > The sframe limitation is per file, not per address space.  I assume
> > these are one file:
> >
> > > 172ac000-498e7000 r-xp 16eac000 07:01 5
> >
> > and these are another:
> >
> > > 4c800000-4ca00000 r-xp 49c00000 07:01 5
> > > 4ca00000-4f600000 r-xp 49e00000 07:01 5
> > > 4f600000-5b270000 r-xp 4ca00000 07:01 5
> >
> > Multiple mappings for a single file is fine, as long as they're
> > contiguous.
>
> No all of what I posted above belongs to the same file (except
> "4a228000-4c677000 rw-p 00000000 00:00 0" which doesn't have
> associated file, but I suspect it originally was part of this file, we
> do some tricks with re-mmap()'ing stuff due to huge pages usage).
>
> >
> > > > Actually I just double checked and even the kernel's ELF loader assumes
> > > > that each executable has only a single text start+end address pair.
> > >
> > > See above, very confused by such assumptions, but I'm hoping we are
> > > talking about two different things here.
> >
> > The "contiguous text" thing seems enforced by the kernel for
> > executables.  However it doesn't manage shared libraries, those are
> > mapped by the loader, e.g. /lib64/ld-linux-x86-64.so.2.
> >
> > At a quick glance I can't tell if /lib64/ld-linux-x86-64.so.2 enforces
> > that.
> >
> > > > There's no point in adding complexity to support some hypothetical.  I
> > > > can remove the printk though.
> > >
> > > We are talking about fundamental things like format for supporting
> > > frame pointer-less stack trace capture. It will take years to adopt
> > > SFrame everywhere, so I think it's prudent to think a bit ahead beyond
> > > just saying "no real application should need more than 4GB text", IMO.
> >
> > I don't think anybody is saying that...
> >
> > --
> > Josh

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 09/19] unwind: Introduce sframe user space unwinding
  2024-11-01 19:35               ` Josh Poimboeuf
@ 2024-11-01 19:48                 ` Josh Poimboeuf
  2024-11-01 21:35                   ` Segher Boessenkool
  0 siblings, 1 reply; 119+ messages in thread
From: Josh Poimboeuf @ 2024-11-01 19:48 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Andrii Nakryiko, x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

On Fri, Nov 01, 2024 at 12:35:13PM -0700, Josh Poimboeuf wrote:
> On Fri, Nov 01, 2024 at 12:33:07PM -0700, Josh Poimboeuf wrote:
> > On Fri, Nov 01, 2024 at 02:09:08PM -0500, Segher Boessenkool wrote:
> > > On Thu, Oct 31, 2024 at 04:03:13PM -0700, Josh Poimboeuf wrote:
> > > > Actually I just double checked and even the kernel's ELF loader assumes
> > > > that each executable has only a single text start+end address pair.
> > > 
> > > Huh?  What makes you think that?  There can be many executable PT_LOAD
> > > segments in each and every binary.
> > 
> > Right, but for executables (not shared libraries) the kernel seems to
> > assume they're contiguous?  See the 'start_code' and 'end_code'
> > variables in load_elf_binary() load_elf_interp().
> 
> Typo, see load_elf_binary (not load_elf_interp).

Hm, actually AFAICT that's only for reporting things in sysfs/proc.  So
maybe it's assumed but not really "enforced".

-- 
Josh

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 09/19] unwind: Introduce sframe user space unwinding
  2024-11-01 19:46                 ` Andrii Nakryiko
@ 2024-11-01 19:51                   ` Josh Poimboeuf
  0 siblings, 0 replies; 119+ messages in thread
From: Josh Poimboeuf @ 2024-11-01 19:51 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski, bpf

On Fri, Nov 01, 2024 at 12:46:09PM -0700, Andrii Nakryiko wrote:
> On Fri, Nov 1, 2024 at 12:44 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Fri, Nov 1, 2024 at 12:29 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
> > >
> > > On Fri, Nov 01, 2024 at 11:34:48AM -0700, Andrii Nakryiko wrote:
> > > > 00200000-170ad000 r--p 00000000 07:01 5
> > > > 172ac000-498e7000 r-xp 16eac000 07:01 5
> > > > 49ae7000-49b8b000 r--p 494e7000 07:01 5
> > > > 49d8b000-4a228000 rw-p 4958b000 07:01 5
> > > > 4a228000-4c677000 rw-p 00000000 00:00 0
> > > > 4c800000-4ca00000 r-xp 49c00000 07:01 5
> > > > 4ca00000-4f600000 r-xp 49e00000 07:01 5
> > > > 4f600000-5b270000 r-xp 4ca00000 07:01 5
> > > >
> 
> I should have maybe posted this in this form:
> 
> 00200000-170ad000 r--p 00000000 07:01 5  /packages/obfuscated_file
> 172ac000-498e7000 r-xp 16eac000 07:01 5  /packages/obfuscated_file
> 49ae7000-49b8b000 r--p 494e7000 07:01 5  /packages/obfuscated_file
> 49d8b000-4a228000 rw-p 4958b000 07:01 5  /packages/obfuscated_file
> 4a228000-4c677000 rw-p 00000000 00:00 0
> 4c800000-4ca00000 r-xp 49c00000 07:01 5  /packages/obfuscated_file
> 4ca00000-4f600000 r-xp 49e00000 07:01 5  /packages/obfuscated_file
> 4f600000-5b270000 r-xp 4ca00000 07:01 5  /packages/obfuscated_file
> 
> Those paths are pointing to the same binary.

Ok, thanks for sharing that.  I'll add in support for noncontiguous
text.

-- 
Josh

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 09/19] unwind: Introduce sframe user space unwinding
  2024-11-01 19:48                 ` Josh Poimboeuf
@ 2024-11-01 21:35                   ` Segher Boessenkool
  0 siblings, 0 replies; 119+ messages in thread
From: Segher Boessenkool @ 2024-11-01 21:35 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andrii Nakryiko, x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

Hi!

On Fri, Nov 01, 2024 at 12:48:00PM -0700, Josh Poimboeuf wrote:
> On Fri, Nov 01, 2024 at 12:35:13PM -0700, Josh Poimboeuf wrote:
> > On Fri, Nov 01, 2024 at 12:33:07PM -0700, Josh Poimboeuf wrote:
> > > On Fri, Nov 01, 2024 at 02:09:08PM -0500, Segher Boessenkool wrote:
> > > > On Thu, Oct 31, 2024 at 04:03:13PM -0700, Josh Poimboeuf wrote:
> > > > > Actually I just double checked and even the kernel's ELF loader assumes
> > > > > that each executable has only a single text start+end address pair.
> > > > 
> > > > Huh?  What makes you think that?  There can be many executable PT_LOAD
> > > > segments in each and every binary.
> > > 
> > > Right, but for executables (not shared libraries) the kernel seems to
> > > assume they're contiguous?  See the 'start_code' and 'end_code'
> > > variables in load_elf_binary() load_elf_interp().
> > 
> > Typo, see load_elf_binary (not load_elf_interp).
> 
> Hm, actually AFAICT that's only for reporting things in sysfs/proc.  So
> maybe it's assumed but not really "enforced".

Yes, this is copied to mm->start_code (etc.)  This isn't used for
anything very important it seems, it seems to be a leftover from when we
only had really simple binfmts?  For a.out it did actually make sense,
for example :-)


Segher

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 09/19] unwind: Introduce sframe user space unwinding
  2024-11-01 18:38           ` Andrii Nakryiko
  2024-11-01 18:47             ` Steven Rostedt
@ 2024-11-03  0:07             ` Indu Bhagat
  1 sibling, 0 replies; 119+ messages in thread
From: Indu Bhagat @ 2024-11-03  0:07 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Josh Poimboeuf, x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski, bpf

On 11/1/24 11:38 AM, Andrii Nakryiko wrote:
> On Thu, Oct 31, 2024 at 2:38 PM Indu Bhagat <indu.bhagat@oracle.com> wrote:
>>
>> On 10/31/24 1:57 PM, Andrii Nakryiko wrote:
>>> On Tue, Oct 29, 2024 at 10:53 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>>>>
>>>> On Tue, Oct 29, 2024 at 04:32:40PM -0700, Andrii Nakryiko wrote:
>>>>> It feels like this patch is trying to do too much. There is both new
>>>>> UAPI introduction, and SFrame format definition, and unwinder
>>>>> integration, etc, etc. Do you think it can be split further into more
>>>>> focused smaller patches?
>>>>
>>>> True, let me see if I can split it up.
>>>>
>>>>>> +
>>>>>> +                       if ((eppnt->p_flags & PF_X) && k < start_code)
>>>>>> +                               start_code = k;
>>>>>> +
>>>>>> +                       if ((eppnt->p_flags & PF_X) && k + eppnt->p_filesz > end_code)
>>>>>> +                               end_code = k + eppnt->p_filesz;
>>>>>> +                       break;
>>>>>> +               }
>>>>>> +               case PT_GNU_SFRAME:
>>>>>> +                       sframe_phdr = eppnt;
>>>>>
>>>>> if I understand correctly, there has to be only one sframe, is that
>>>>> right? Should we validate that?
>>>>
>>>> Yes, there shouldn't be more than one PT_GNU_SFRAME for the executable
>>>> itself.  I can validate that.
>>>>
>>>>>> +                       break;
>>>>>>                   }
>>>>>>           }
>>>>>>
>>>>>> +       if (sframe_phdr)
>>>>>> +               sframe_add_section(load_addr + sframe_phdr->p_vaddr,
>>>>>> +                                  start_code, end_code);
>>>>>> +
>>>>>
>>>>> no error checking?
>>>>
>>>> Good point.  I remember discussing this with some people at Cauldon/LPC,
>>>> I just forgot to do it!
>>>>
>>>> Right now it does all the validation at unwind, which could really slow
>>>> things down unnecessarily if the sframe isn't valid.
>>>>
>>>>>> +#ifdef CONFIG_HAVE_UNWIND_USER_SFRAME
>>>>>> +
>>>>>> +#define INIT_MM_SFRAME .sframe_mt = MTREE_INIT(sframe_mt, 0),
>>>>>> +
>>>>>> +extern void sframe_free_mm(struct mm_struct *mm);
>>>>>> +
>>>>>> +/* text_start, text_end, file_name are optional */
>>>>>
>>>>> what file_name? was that an extra argument that got removed?
>>>>
>>>> Indeed, that was for some old code.
>>>>
>>>>>>           case PR_RISCV_SET_ICACHE_FLUSH_CTX:
>>>>>>                   error = RISCV_SET_ICACHE_FLUSH_CTX(arg2, arg3);
>>>>>>                   break;
>>>>>> +       case PR_ADD_SFRAME:
>>>>>> +               if (arg5)
>>>>>> +                       return -EINVAL;
>>>>>> +               error = sframe_add_section(arg2, arg3, arg4);
>>>>>
>>>>> wouldn't it be better to make this interface extendable from the get
>>>>> go? Instead of passing 3 arguments with fixed meaning, why not pass a
>>>>> pointer to an extendable binary struct like seems to be the trend
>>>>> nowadays with nicely extensible APIs. See [0] for one such example
>>>>> (specifically, struct procmap_query). Seems more prudent, as we'll
>>>>> most probably will be adding flags, options, extra information, etc)
>>>>>
>>>>>     [0] https://lore.kernel.org/linux-mm/20240627170900.1672542-3-andrii@kernel.org/
>>>>
>>>> This ioctl interface was admittedly hacked together.  I was hoping
>>>> somebody would suggest something better :-)  I'll take a look.
>>>>
>>>>>> +static int find_fde(struct sframe_section *sec, unsigned long ip,
>>>>>> +                   struct sframe_fde *fde)
>>>>>> +{
>>>>>> +       struct sframe_fde __user *first, *last, *found = NULL;
>>>>>> +       u32 ip_off, func_off_low = 0, func_off_high = -1;
>>>>>> +
>>>>>> +       ip_off = ip - sec->sframe_addr;
>>>>>
>>>>> what if ip_off is larger than 4GB? ELF section can be bigger than 4GB, right?
>>>>
>>>> That's baked into sframe v2.
>>>
>>> I believe we do have large production binaries with more than 4GB of
>>> text, what are we going to do about them? It would be interesting to
>>> hear sframe people's opinion. Adding such a far-reaching new format in
>>> 2024 with these limitations is kind of sad. At the very least maybe we
>>> should allow some form of chaining sframe definitions to cover more
>>> than 4GB segments? Please CC relevant folks, I'm wondering what
>>> they're thinking about this.
>>>
>>
>> SFrame V2 does have that limitation. We can try to have 64-bit
>> representation for the 'ip' in the SFrame FDE and conditionalize it
>> somehow (say, with a flag in the header) so as to not bloat the majority
>> of applications.
> 
> Hi Indu,
> 
> I think that's prudent if we believe that SFrame is the solution here.
> See my reply to Josh. Real-world already approach 4GB limits, and
> things are not going to shrink in the years to come. So yeah, probably
> we need some adjustments to the format to at least allow 64-bit
> offsets (though trying to stick to 32-bit as much as possible, of
> course, if they work).
> 
> I'm not really familiar with the nuances of the format just yet, so
> can't really provide anything more useful at this point. What would be
> the sort of gold reference for Sframe format to familiarize myself
> thoroughly?
> 

There are some links on the SFrame wiki that can be helpful
https://sourceware.org/binutils/wiki/sframe

> BTW, I wanted to ask. Are there any plans to add SFrame support to
> Clang as well? It feels like without that there is no future for
> SFrame as a general-purpose solution for stack traces.
> 
>>
>>>>
>>>>> and also, does it mean that SFrame doesn't support executables with
>>>>> text bigger than 4GB?
>>>>
>>>> Yes, but is that a realistic concern?
>>>
>>> See above, yes. You'd be surprised. As somewhat corroborating
>>> evidence, there were tons of problems and churn (within at least Meta)
>>> with DWARF not supporting more than 2GB sizes, so yes, this is not an
>>> abstract problem for sure. Modern production applications can be
>>> ridiculously big.
>>>
>>>>
>>>>>> +       } else {
>>>>>> +               struct vm_area_struct *vma, *text_vma = NULL;
>>>>>> +               VMA_ITERATOR(vmi, mm, 0);
>>>>>> +
>>>>>> +               for_each_vma(vmi, vma) {
>>>>>> +                       if (vma->vm_file != sframe_vma->vm_file ||
>>>>>> +                           !(vma->vm_flags & VM_EXEC))
>>>>>> +                               continue;
>>>>>> +
>>>>>> +                       if (text_vma) {
>>>>>> +                               pr_warn_once("%s[%d]: multiple EXEC segments unsupported\n",
>>>>>> +                                            current->comm, current->pid);
>>>>>
>>>>> is this just something that fundamentally can't be supported by SFrame
>>>>> format? Or just an implementation simplification?
>>>>
>>>> It's a simplification I suppose.
>>>
>>> That's a rather random limitation, IMO... How hard would it be to not
>>> make that assumption?
>>>
>>>>
>>>>> It's not illegal to have an executable with multiple VM_EXEC segments,
>>>>> no? Should this be a pr_warn_once() then?
>>>>
>>>> I don't know, is it allowed?  I've never seen it in practice.  The
>>>
>>> I'm pretty sure you can do that with a custom linker script, at the
>>> very least. Normally this probably won't happen, but I don't think
>>> Linux dictates how many executable VMAs an application can have. And
>>> it probably just naturally happens for JIT-ted applications (Java, Go,
>>> etc).
>>>
>>> Linux kernel itself has two executable segments, for instance (though
>>> kernel is special, of course, but still).
>>>
>>>> pr_warn_once() is not reporting that it's illegal but rather that this
>>>> corner case actually exists and maybe needs to be looked at.
>>>
>>> This warn() will be logged across millions of machines in the fleet,
>>> triggering alarms, people looking at this, making custom internal
>>> patches to disable the known-to-happen warn. Why do we need all this?
>>> This is an issue that is trivial to trigger by user process that's not
>>> doing anything illegal. Why?
>>>
>>>>
>>>> --
>>>> Josh
>>


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 09/19] unwind: Introduce sframe user space unwinding
  2024-10-28 21:47 ` [PATCH v3 09/19] unwind: Introduce sframe user space unwinding Josh Poimboeuf
                     ` (2 preceding siblings ...)
  2024-10-29 23:32   ` Andrii Nakryiko
@ 2024-11-05 17:40   ` Steven Rostedt
  2024-11-05 17:45     ` Steven Rostedt
  2024-11-06 17:04   ` Jens Remus
                     ` (3 subsequent siblings)
  7 siblings, 1 reply; 119+ messages in thread
From: Steven Rostedt @ 2024-11-05 17:40 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

On Mon, 28 Oct 2024 14:47:56 -0700
Josh Poimboeuf <jpoimboe@kernel.org> wrote:

> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> index 06dc4a57ba78..434c548f0837 100644
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -47,6 +47,7 @@
>  #include <linux/dax.h>
>  #include <linux/uaccess.h>
>  #include <linux/rseq.h>
> +#include <linux/sframe.h>
>  #include <asm/param.h>
>  #include <asm/page.h>
>  
> @@ -633,11 +634,13 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
>  		unsigned long no_base, struct elf_phdr *interp_elf_phdata,
>  		struct arch_elf_state *arch_state)
>  {
> -	struct elf_phdr *eppnt;
> +	struct elf_phdr *eppnt, *sframe_phdr = NULL;
>  	unsigned long load_addr = 0;
>  	int load_addr_set = 0;
>  	unsigned long error = ~0UL;
>  	unsigned long total_size;
> +	unsigned long start_code = ~0UL;
> +	unsigned long end_code = 0;
>  	int i;
>  
>  	/* First of all, some simple consistency checks */
> @@ -659,7 +662,8 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
>  
>  	eppnt = interp_elf_phdata;
>  	for (i = 0; i < interp_elf_ex->e_phnum; i++, eppnt++) {
> -		if (eppnt->p_type == PT_LOAD) {
> +		switch (eppnt->p_type) {
> +		case PT_LOAD: {
>  			int elf_type = MAP_PRIVATE;
>  			int elf_prot = make_prot(eppnt->p_flags, arch_state,
>  						 true, true);
> @@ -688,7 +692,7 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
>  			/*
>  			 * Check to see if the section's size will overflow the
>  			 * allowed task size. Note that p_filesz must always be
> -			 * <= p_memsize so it's only necessary to check p_memsz.
> +			 * <= p_memsz so it's only necessary to check p_memsz.
>  			 */
>  			k = load_addr + eppnt->p_vaddr;
>  			if (BAD_ADDR(k) ||
> @@ -698,9 +702,24 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
>  				error = -ENOMEM;
>  				goto out;
>  			}
> +
> +			if ((eppnt->p_flags & PF_X) && k < start_code)
> +				start_code = k;
> +
> +			if ((eppnt->p_flags & PF_X) && k + eppnt->p_filesz > end_code)
> +				end_code = k + eppnt->p_filesz;
> +			break;
> +		}
> +		case PT_GNU_SFRAME:
> +			sframe_phdr = eppnt;
> +			break;
>  		}
>  	}
>  
> +	if (sframe_phdr)
> +		sframe_add_section(load_addr + sframe_phdr->p_vaddr,
> +				   start_code, end_code);
> +
>  	error = load_addr;
>  out:
>  	return error;
> @@ -823,7 +842,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
>  	int first_pt_load = 1;
>  	unsigned long error;
>  	struct elf_phdr *elf_ppnt, *elf_phdata, *interp_elf_phdata = NULL;
> -	struct elf_phdr *elf_property_phdata = NULL;
> +	struct elf_phdr *elf_property_phdata = NULL, *sframe_phdr = NULL;
>  	unsigned long elf_brk;
>  	int retval, i;
>  	unsigned long elf_entry;
> @@ -931,6 +950,10 @@ static int load_elf_binary(struct linux_binprm *bprm)
>  				executable_stack = EXSTACK_DISABLE_X;
>  			break;
>  
> +		case PT_GNU_SFRAME:
> +			sframe_phdr = elf_ppnt;

You need to save the p_vaddr here and not the pointer.

> +			break;
> +
>  		case PT_LOPROC ... PT_HIPROC:
>  			retval = arch_elf_pt_proc(elf_ex, elf_ppnt,
>  						  bprm->file, false,
> @@ -1321,6 +1344,10 @@ static int load_elf_binary(struct linux_binprm *bprm)
>  					    task_pid_nr(current), retval);
>  	}

Before this code we have:

	kfree(elf_phdata);

And I added:

	if (sframe_phdr)
		trace_printk("after sframe vaddr=%x\n", sframe_phdr->p_vaddr);
	kfree(elf_phdata);
	if (sframe_phdr)
		trace_printk("after sframe vaddr=%x\n", sframe_phdr->p_vaddr);

Which produced:

         scan-fs-940   [007] .....    16.091081: bprint:               load_elf_binary: after sframe vaddr=2298
         scan-fs-940   [007] .....    16.091083: bprint:               load_elf_binary: after sframe vaddr=0

I was wondering why it wasn't working.

-- Steve

>  
> +	if (sframe_phdr)
> +		sframe_add_section(load_bias + sframe_phdr->p_vaddr,
> +				   start_code, end_code);
> +
>  	regs = current_pt_regs();
>  #ifdef ELF_PLAT_INIT
>  	/*
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 381d22eba088..6e7561c1a5fc 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1052,6 +1052,9 @@ struct mm_struct {
>  #endif
>  		} lru_gen;
>  #endif /* CONFIG_LRU_GEN_WALKS_MMU */
> +#ifdef CONFIG_HAVE_UNWIND_USER_SFRAME
> +		struct maple_tree sframe_mt;
> +#endif
>  	} __randomize_layout;
>  
>  	/*

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 09/19] unwind: Introduce sframe user space unwinding
  2024-11-05 17:40   ` Steven Rostedt
@ 2024-11-05 17:45     ` Steven Rostedt
  0 siblings, 0 replies; 119+ messages in thread
From: Steven Rostedt @ 2024-11-05 17:45 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

On Tue, 5 Nov 2024 12:40:53 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Mon, 28 Oct 2024 14:47:56 -0700
> Josh Poimboeuf <jpoimboe@kernel.org> wrote:

<linux-trace-kernel@vger.kerne.org>: Host or domain name not found. Name
    service error for name=vger.kerne.org type=AAAA: Host not found

Hmm, no wonder this didn't show up in patchwork :-/

Please fix on your next version.

Thanks,

-- Steve

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 15/19] perf: Add deferred user callchains
  2024-10-28 21:47 ` [PATCH v3 15/19] perf: Add deferred user callchains Josh Poimboeuf
  2024-10-28 21:48   ` Josh Poimboeuf
  2024-10-29 14:06   ` Peter Zijlstra
@ 2024-11-06  9:45   ` Jens Remus
  2 siblings, 0 replies; 119+ messages in thread
From: Jens Remus @ 2024-11-06  9:45 UTC (permalink / raw)
  To: Josh Poimboeuf, x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Heiko Carstens, Vasily Gorbik

On 28.10.2024 22:48, Josh Poimboeuf wrote:
...
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index ebf143aa427b..bf97b2fa8a9c 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
...
> @@ -6955,6 +6958,53 @@ static void perf_pending_irq(struct irq_work *entry)
>   		perf_swevent_put_recursion_context(rctx);
>   }
>   
> +static void perf_pending_unwind_irq(struct irq_work *entry)
> +{
> +	struct perf_event *event = container_of(entry, struct perf_event, pending_unwind_irq);
> +
> +	if (event->pending_unwind) {
> +		unwind_user_deferred(&perf_unwind_callback_cb, NULL, event);
> +		event->pending_unwind = 0;
> +	}
> +}
> +
> +struct perf_callchain_deferred_event {
> +	struct perf_event_header	header;
> +	u64				ctx_cookie;

This introduces ctx_cookie in the struct used to produce deferred events but misses to do so in the struct used to consume them. This causes the ctx_cookie value to erroneously get interpreted as nr (number of IPs) in perf:

Core was generated by `perf record -F 99 --call-graph fp /opt/binutils-sframe2/bin/objdump --sframe /opt/binutils-sframe2/bin/objdump'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  __GI_memcpy () at ../sysdeps/s390/memcpy-z900.S:209
209             mvc     0(256,%r1),0(%r3)
[Current thread is 1 (Thread 0x3ff8bb5fe80 (LWP 16554))]
(gdb) bt
#0  __GI_memcpy () at ../sysdeps/s390/memcpy-z900.S:209
#1  0x00000000012ad0ca in sample__merge_deferred_callchain (sample_orig=0x3ffd2ff53c8, sample_callchain=0x3ffd2ff5b18) at util/callchain.c:1853
...
(gdb) p/x sample_callchain->callchain->nr
$2 = 0x489cb

With debug output from perf_event_callchain_deferred() (see below):

DEBUG: perf_event_callchain_deferred: ctx_cookie=0x00000000000489cb, nr=2

> +	u64				nr;
> +	u64				ips[];
> +};
> +
> +static void perf_event_callchain_deferred(struct unwind_stacktrace *trace,
> +					  u64 ctx_cookie, void *_data)
> +{
> +	struct perf_callchain_deferred_event deferred_event;
> +	u64 callchain_context = PERF_CONTEXT_USER;
> +	struct perf_output_handle handle;
> +	struct perf_event *event = _data;
> +	struct perf_sample_data data;
> +	u64 nr = trace->nr + 1 /* callchain_context */;
> +
> +	deferred_event.header.type = PERF_RECORD_CALLCHAIN_DEFERRED;
> +	deferred_event.header.misc = PERF_RECORD_MISC_USER;
> +	deferred_event.header.size = sizeof(deferred_event) + (nr * sizeof(u64));
> +
> +	deferred_event.ctx_cookie = ctx_cookie;
> +	deferred_event.nr = nr;
> +
> +	perf_event_header__init_id(&deferred_event.header, &data, event);

	pr_info_ratelimited("DEBUG: perf_event_callchain_deferred: ctx_cookie=0x%016llx, nr=%llu\n",
		deferred_event.ctx_cookie, deferred_event.nr);

> +
> +	if (perf_output_begin(&handle, &data, event, deferred_event.header.size))
> +		return;
> +
> +	perf_output_put(&handle, deferred_event);
> +	perf_output_put(&handle, callchain_context);
> +	perf_output_copy(&handle, trace->entries, trace->nr * sizeof(u64));
> +	perf_event__output_id_sample(event, &handle, &data);
> +
> +	perf_output_end(&handle);
> +}
> +
>   static void perf_pending_task(struct callback_head *head)
>   {
>   	struct perf_event *event = container_of(head, struct perf_event, pending_task);
...

Regards,
Jens
-- 
Jens Remus
Linux on Z Development (D3303) and z/VSE Support
+49-7031-16-1128 Office
jremus@de.ibm.com

IBM

IBM Deutschland Research & Development GmbH; Vorsitzender des Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der Gesellschaft: Böblingen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 09/19] unwind: Introduce sframe user space unwinding
  2024-10-28 21:47 ` [PATCH v3 09/19] unwind: Introduce sframe user space unwinding Josh Poimboeuf
                     ` (3 preceding siblings ...)
  2024-11-05 17:40   ` Steven Rostedt
@ 2024-11-06 17:04   ` Jens Remus
  2024-11-07  8:25   ` Weinan Liu
                     ` (2 subsequent siblings)
  7 siblings, 0 replies; 119+ messages in thread
From: Jens Remus @ 2024-11-06 17:04 UTC (permalink / raw)
  To: Josh Poimboeuf, x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Heiko Carstens, Vasily Gorbik

On 28.10.2024 22:47, Josh Poimboeuf wrote:
...

> diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
...
> +#define __SFRAME_GET_USER(out, user_ptr, type)				\
> +({									\
> +	type __tmp;							\
> +	if (get_user(__tmp, (type __user *)user_ptr))			\
> +		return -EFAULT;						\
> +	user_ptr += sizeof(__tmp);					\
> +	out = __tmp;							\
> +})
> +
> +#define SFRAME_GET_USER(out, user_ptr, size)				\
> +({									\
> +	switch (size) {							\
> +	case 1:								\
> +		__SFRAME_GET_USER(out, user_ptr, u8);			\
> +		break;							\
> +	case 2:								\
> +		__SFRAME_GET_USER(out, user_ptr, u16);			\
> +		break;							\
> +	case 4:								\
> +		__SFRAME_GET_USER(out, user_ptr, u32);			\
> +		break;							\
> +	default:							\
> +		return -EINVAL;						\
> +	}								\
> +})
> +
> +static unsigned char fre_type_to_size(unsigned char fre_type)
> +{
> +	if (fre_type > 2)
> +		return 0;
> +	return 1 << fre_type;
> +}
> +
> +static unsigned char offset_size_enum_to_size(unsigned char off_size)
> +{
> +	if (off_size > 2)
> +		return 0;
> +	return 1 << off_size;
> +}
...
> +static int find_fre(struct sframe_section *sec, struct sframe_fde *fde,
> +		    unsigned long ip, struct unwind_user_frame *frame)
> +{
> +	unsigned char fde_type = SFRAME_FUNC_FDE_TYPE(fde->info);
> +	unsigned char fre_type = SFRAME_FUNC_FRE_TYPE(fde->info);
> +	unsigned char offset_count, offset_size;
> +	s32 cfa_off, ra_off, fp_off, ip_off;
> +	void __user *f, *last_f = NULL;
> +	unsigned char addr_size;
> +	u32 last_fre_ip_off = 0;
> +	u8 fre_info = 0;
> +	int i;
> +
> +	addr_size = fre_type_to_size(fre_type);
> +	if (!addr_size)
> +		return -EINVAL;
> +
> +	ip_off = ip - (sec->sframe_addr + fde->start_addr);
> +
> +	f = (void __user *)sec->fres_addr + fde->fres_off;
> +
> +	for (i = 0; i < fde->fres_num; i++) {
> +		u32 fre_ip_off;
> +
> +		SFRAME_GET_USER(fre_ip_off, f, addr_size);
> +
> +		if (fre_ip_off < last_fre_ip_off)
> +			return -EINVAL;
> +
> +		last_fre_ip_off = fre_ip_off;
> +
> +		if (fde_type == SFRAME_FDE_TYPE_PCINC) {
> +			if (ip_off < fre_ip_off)
> +				break;
> +		} else {
> +			/* SFRAME_FDE_TYPE_PCMASK */
> +			if (ip_off % fde->rep_size < fre_ip_off)
> +				break;
> +		}
> +
> +		SFRAME_GET_USER(fre_info, f, 1);
> +
> +		offset_count = SFRAME_FRE_OFFSET_COUNT(fre_info);
> +		offset_size  = offset_size_enum_to_size(SFRAME_FRE_OFFSET_SIZE(fre_info));
> +
> +		if (!offset_count || !offset_size)
> +			return -EINVAL;
> +
> +		last_f = f;
> +		f += offset_count * offset_size;
> +	}
> +
> +	if (!last_f)
> +		return -EINVAL;
> +
> +	f = last_f;
> +
> +	SFRAME_GET_USER(cfa_off, f, offset_size);

SFRAME_GET_USER() does not work for the signed SFrame CFA offset.

> +	offset_count--;
> +
> +	ra_off = sec->ra_off;
> +	if (!ra_off) {
> +		if (!offset_count--)
> +			return -EINVAL;
> +
> +		SFRAME_GET_USER(ra_off, f, offset_size);

Likewise for the signed SFrame RA offset.

Excerpt from my added trace. Note the RA_off=65488 (unsigned) = 0xFFD0 = 
-48 (signed):

unwind_user_next: WARNING: RA could not be obtained from user space 
(IP=0x000003ffbb5f4218, CFA=0x000003ffc22f8f10, RA_off=65488)

Excerpt from perf script:

3ffbb5f4218 internal_fnwmatch+0x558 (/usr/lib64/libc.so.6)

Excerpts from objdump -wt --sframe:

00000000000f3cc0 l     F .text  000000000000195c 
internal_fnwmatch

func idx [1715]: pc = 0xf3cc0, size = 6492 bytes
STARTPC         CFA       FP        RA           INFO
00000000000f3cc0  sp+160    u         u            (1*1B)
00000000000f3cc6  sp+160    c-72      c-48         (3*1B)
00000000000f3cd0  sp+4256   c-72      c-48         (3*2B)
00000000000f3cdc  sp+8352   c-72      c-48         (3*2B)
00000000000f3ce8  sp+10792  c-72      c-48         (3*2B)
00000000000f3f7e  sp+160    u         u            (1*1B)
00000000000f3f80  sp+10792  c-72      c-48         (3*2B)

> +	}
> +
> +	fp_off = sec->fp_off;
> +	if (!fp_off && offset_count) {
> +		offset_count--;
> +		SFRAME_GET_USER(fp_off, f, offset_size);

Likewise for the signed SFrame FP offset.

> +	}
> +
> +	if (offset_count)
> +		return -EINVAL;
> +
> +	frame->cfa_off = cfa_off;
> +	frame->ra_off = ra_off;
> +	frame->fp_off = fp_off;
> +	frame->use_fp = SFRAME_FRE_CFA_BASE_REG_ID(fre_info) == SFRAME_BASE_REG_FP;
> +
> +	return 0;
> +}
...

I have verified that reintroducing and using SFRAME_GET_USER_SIGNED() 
works correctly.

Regards,
Jens
-- 
Jens Remus
Linux on Z Development (D3303) and z/VSE Support
+49-7031-16-1128 Office
jremus@de.ibm.com

IBM

IBM Deutschland Research & Development GmbH; Vorsitzender des 
Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der 
Gesellschaft: Böblingen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 09/19] unwind: Introduce sframe user space unwinding
  2024-10-28 21:47 ` [PATCH v3 09/19] unwind: Introduce sframe user space unwinding Josh Poimboeuf
                     ` (4 preceding siblings ...)
  2024-11-06 17:04   ` Jens Remus
@ 2024-11-07  8:25   ` Weinan Liu
  2024-11-07 16:59   ` Jens Remus
  2024-11-13 15:56   ` Jens Remus
  7 siblings, 0 replies; 119+ messages in thread
From: Weinan Liu @ 2024-11-07  8:25 UTC (permalink / raw)
  To: jpoimboe
  Cc: acme, adrian.hunter, alexander.shishkin, andrii.nakryiko, broonie,
	fweimer, indu.bhagat, irogers, jolsa, jordalgo, jremus,
	linux-kernel, linux-perf-users, linux-toolchains,
	linux-trace-kernel, luto, mark.rutland, mathieu.desnoyers, mingo,
	namhyung, peterz, rostedt, sam, x86

> diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
...
> +static int find_fre(struct sframe_section *sec, struct sframe_fde *fde,
> +		    unsigned long ip, struct unwind_user_frame *frame)
> +{
> +	unsigned char fde_type = SFRAME_FUNC_FDE_TYPE(fde->info);
> +	unsigned char fre_type = SFRAME_FUNC_FRE_TYPE(fde->info);
> +	unsigned char offset_count, offset_size;
> +	s32 cfa_off, ra_off, fp_off, ip_off;
> +	void __user *f, *last_f = NULL;
> +	unsigned char addr_size;
> +	u32 last_fre_ip_off = 0;
> +	u8 fre_info = 0;
> +	int i;
> +
> +	addr_size = fre_type_to_size(fre_type);
> +	if (!addr_size)
> +		return -EINVAL;
> +
> +	ip_off = ip - (sec->sframe_addr + fde->start_addr);

nit: Since we already know wether the ip_off should mask or not, I think we don't have to check fde_type and mask the ip_off everytime.
	ip_off = (fde_type == SFRAME_FDE_TYPE_PCINC) ? ip_off : ip_off % fde->rep_size;

> +
> +	f = (void __user *)sec->fres_addr + fde->fres_off;
> +
> +	for (i = 0; i < fde->fres_num; i++) {
> +		u32 fre_ip_off;
> +
> +		SFRAME_GET_USER(fre_ip_off, f, addr_size);
> +
> +		if (fre_ip_off < last_fre_ip_off)
> +			return -EINVAL;
> +
> +		last_fre_ip_off = fre_ip_off;
> +
> +		if (fde_type == SFRAME_FDE_TYPE_PCINC) {
> +			if (ip_off < fre_ip_off)
> +				break;
> +		} else {
> +			/* SFRAME_FDE_TYPE_PCMASK */
> +			if (ip_off % fde->rep_size < fre_ip_off)
> +				break;
> +		}
> +
> +		SFRAME_GET_USER(fre_info, f, 1);
> +
> +		offset_count = SFRAME_FRE_OFFSET_COUNT(fre_info);
> +		offset_size  = offset_size_enum_to_size(SFRAME_FRE_OFFSET_SIZE(fre_info));
> +
> +		if (!offset_count || !offset_size)
> +			return -EINVAL;
> +
> +		last_f = f;
> +		f += offset_count * offset_size;
> +	}
> +
> +	if (!last_f)
> +		return -EINVAL;
> +
> +	f = last_f;
> +
> +	SFRAME_GET_USER(cfa_off, f, offset_size);
> +	offset_count--;
> +
> +	ra_off = sec->ra_off;
> +	if (!ra_off) {
> +		if (!offset_count--)
> +			return -EINVAL;
> +
> +		SFRAME_GET_USER(ra_off, f, offset_size);
> +	}
> +
> +	fp_off = sec->fp_off;
> +	if (!fp_off && offset_count) {
> +		offset_count--;
> +		SFRAME_GET_USER(fp_off, f, offset_size);
> +	}
> +
> +	if (offset_count)
> +		return -EINVAL;
> +
> +	frame->cfa_off = cfa_off;
> +	frame->ra_off = ra_off;
> +	frame->fp_off = fp_off;
> +	frame->use_fp = SFRAME_FRE_CFA_BASE_REG_ID(fre_info) == SFRAME_BASE_REG_FP;
> +
> +	return 0;
> +}

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 09/19] unwind: Introduce sframe user space unwinding
  2024-10-28 21:47 ` [PATCH v3 09/19] unwind: Introduce sframe user space unwinding Josh Poimboeuf
                     ` (5 preceding siblings ...)
  2024-11-07  8:25   ` Weinan Liu
@ 2024-11-07 16:59   ` Jens Remus
  2024-11-13 20:50     ` Steven Rostedt
  2024-11-13 15:56   ` Jens Remus
  7 siblings, 1 reply; 119+ messages in thread
From: Jens Remus @ 2024-11-07 16:59 UTC (permalink / raw)
  To: Josh Poimboeuf, x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Heiko Carstens, Vasily Gorbik

On 28.10.2024 22:47, Josh Poimboeuf wrote:
...
> diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
...
> +static int find_fde(struct sframe_section *sec, unsigned long ip,
> +		    struct sframe_fde *fde)
> +{
> +	struct sframe_fde __user *first, *last, *found = NULL;
> +	u32 ip_off, func_off_low = 0, func_off_high = -1;
> +
> +	ip_off = ip - sec->sframe_addr;
> +
> +	first = (void __user *)sec->fdes_addr;
> +	last = first + sec->fdes_nr;

Could it be that this needs to be:

	last = first + sec->fdes_nr - 1;

> +	while (first <= last) {
> +		struct sframe_fde __user *mid;
> +		u32 func_off;
> +
> +		mid = first + ((last - first) / 2);
> +
> +		if (get_user(func_off, (s32 __user *)mid))
> +			return -EFAULT;
> +
> +		if (ip_off >= func_off) {
> +			/* validate sort order */
> +			if (func_off < func_off_low)
> +				return -EINVAL;

Otherwise I run into this when the IP is within the function whose FDE 
is the last one in the .sframe section:

find_fde: IP=0x000000000110fbcc: ERROR: func_off < func_off_low 
(func_off=196608, func_off_low=4294224904)

110fbcc dump_sframe+0x2ec (/opt/binutils-sframe2/bin/objdump)

func idx [2275]: pc = 0x110f8e0, size = 3310 bytes <dump_sframe>
STARTPC         CFA       FP        RA           INFO
000000000110f8e0  sp+160    u         u            (1*1B)
000000000110f8e6  sp+160    c-72      c-48         (3*1B)
000000000110f8f6  sp+632    c-72      c-48         (3*1B)
000000000110fa82  sp+160    u         u            (1*1B)
000000000110fa88  sp+632    c-72      c-48         (3*1B)
0000000001110486  sp+160    u         u            (1*1B)
000000000111048c  sp+632    c-72      c-48         (3*1B)
0000000001110574  sp+160    u         u            (1*1B)
000000000111057a  sp+632    c-72      c-48         (3*1B)

> +
> +			func_off_low = func_off;
> +
> +			found = mid;
> +			first = mid + 1;
> +		} else {
> +			/* validate sort order */
> +			if (func_off > func_off_high)
> +				return -EINVAL;
> +
> +			func_off_high = func_off;
> +
> +			last = mid - 1;
> +		}
> +	}
> +
> +	if (!found)
> +		return -EINVAL;
> +
> +	if (copy_from_user(fde, found, sizeof(*fde)))
> +		return -EFAULT;
> +
> +	/* check for gaps */
> +	if (ip_off < fde->start_addr || ip_off >= fde->start_addr + fde->size)
> +		return -EINVAL;
> +
> +	return 0;
> +}

Thanks and regards,
Jens
-- 
Jens Remus
Linux on Z Development (D3303) and z/VSE Support
+49-7031-16-1128 Office
jremus@de.ibm.com

IBM

IBM Deutschland Research & Development GmbH; Vorsitzender des 
Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der 
Gesellschaft: Böblingen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 09/19] unwind: Introduce sframe user space unwinding
  2024-10-28 21:47 ` [PATCH v3 09/19] unwind: Introduce sframe user space unwinding Josh Poimboeuf
                     ` (6 preceding siblings ...)
  2024-11-07 16:59   ` Jens Remus
@ 2024-11-13 15:56   ` Jens Remus
  2024-11-13 20:50     ` Steven Rostedt
  7 siblings, 1 reply; 119+ messages in thread
From: Jens Remus @ 2024-11-13 15:56 UTC (permalink / raw)
  To: Josh Poimboeuf, x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

On 28.10.2024 22:47, Josh Poimboeuf wrote:

> diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c

> @@ -68,7 +83,12 @@ int unwind_user_start(struct unwind_user_state *state)
>   		return -EINVAL;
>   	}
>   
> -	state->type = UNWIND_USER_TYPE_FP;
> +	if (current_has_sframe())
> +		state->type = UNWIND_USER_TYPE_SFRAME;
> +	else if (IS_ENABLED(CONFIG_UNWIND_USER_FP))

The test must be for CONFIG_HAVE_UNWIND_USER_FP. :-)

> +		state->type = UNWIND_USER_TYPE_FP;
> +	else
> +		state->type = UNWIND_USER_TYPE_NONE;
>   
>   	state->sp = user_stack_pointer(regs);
>   	state->ip = instruction_pointer(regs);

Regards,
Jens
-- 
Jens Remus
Linux on Z Development (D3303) and z/VSE Support
+49-7031-16-1128 Office
jremus@de.ibm.com

IBM

IBM Deutschland Research & Development GmbH; Vorsitzender des 
Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der 
Gesellschaft: Böblingen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 09/19] unwind: Introduce sframe user space unwinding
  2024-11-07 16:59   ` Jens Remus
@ 2024-11-13 20:50     ` Steven Rostedt
  2024-11-13 21:15       ` Josh Poimboeuf
  0 siblings, 1 reply; 119+ messages in thread
From: Steven Rostedt @ 2024-11-13 20:50 UTC (permalink / raw)
  To: Jens Remus
  Cc: Josh Poimboeuf, x86, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Heiko Carstens, Vasily Gorbik

On Thu, 7 Nov 2024 17:59:08 +0100
Jens Remus <jremus@linux.ibm.com> wrote:

> On 28.10.2024 22:47, Josh Poimboeuf wrote:
> ...
> > diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c  
> ...
> > +static int find_fde(struct sframe_section *sec, unsigned long ip,
> > +		    struct sframe_fde *fde)
> > +{
> > +	struct sframe_fde __user *first, *last, *found = NULL;
> > +	u32 ip_off, func_off_low = 0, func_off_high = -1;
> > +
> > +	ip_off = ip - sec->sframe_addr;
> > +
> > +	first = (void __user *)sec->fdes_addr;
> > +	last = first + sec->fdes_nr;  
> 
> Could it be that this needs to be:
> 
> 	last = first + sec->fdes_nr - 1;

Yep, I discovered the same issue.

-- Steve

> 
> > +	while (first <= last) {
> > +		struct sframe_fde __user *mid;
> > +		u32 func_off;
> > +

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 09/19] unwind: Introduce sframe user space unwinding
  2024-11-13 15:56   ` Jens Remus
@ 2024-11-13 20:50     ` Steven Rostedt
  2024-11-13 21:13       ` Josh Poimboeuf
  0 siblings, 1 reply; 119+ messages in thread
From: Steven Rostedt @ 2024-11-13 20:50 UTC (permalink / raw)
  To: Jens Remus
  Cc: Josh Poimboeuf, x86, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

On Wed, 13 Nov 2024 16:56:25 +0100
Jens Remus <jremus@linux.ibm.com> wrote:

> On 28.10.2024 22:47, Josh Poimboeuf wrote:
> 
> > diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c  
> 
> > @@ -68,7 +83,12 @@ int unwind_user_start(struct unwind_user_state *state)
> >   		return -EINVAL;
> >   	}
> >   
> > -	state->type = UNWIND_USER_TYPE_FP;
> > +	if (current_has_sframe())
> > +		state->type = UNWIND_USER_TYPE_SFRAME;
> > +	else if (IS_ENABLED(CONFIG_UNWIND_USER_FP))  
> 
> The test must be for CONFIG_HAVE_UNWIND_USER_FP. :-)

Yep, that too.

Thanks,

-- Steve

> 
> > +		state->type = UNWIND_USER_TYPE_FP;
> > +	else
> > +		state->type = UNWIND_USER_TYPE_NONE;
> >   
> >   	state->sp = user_stack_pointer(regs);
> >   	state->ip = instruction_pointer(regs);  
> 
> Regards,
> Jens


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 09/19] unwind: Introduce sframe user space unwinding
  2024-11-13 20:50     ` Steven Rostedt
@ 2024-11-13 21:13       ` Josh Poimboeuf
  0 siblings, 0 replies; 119+ messages in thread
From: Josh Poimboeuf @ 2024-11-13 21:13 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Jens Remus, x86, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski

On Wed, Nov 13, 2024 at 03:50:58PM -0500, Steven Rostedt wrote:
> On Wed, 13 Nov 2024 16:56:25 +0100
> Jens Remus <jremus@linux.ibm.com> wrote:
> 
> > On 28.10.2024 22:47, Josh Poimboeuf wrote:
> > 
> > > diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c  
> > 
> > > @@ -68,7 +83,12 @@ int unwind_user_start(struct unwind_user_state *state)
> > >   		return -EINVAL;
> > >   	}
> > >   
> > > -	state->type = UNWIND_USER_TYPE_FP;
> > > +	if (current_has_sframe())
> > > +		state->type = UNWIND_USER_TYPE_SFRAME;
> > > +	else if (IS_ENABLED(CONFIG_UNWIND_USER_FP))  
> > 
> > The test must be for CONFIG_HAVE_UNWIND_USER_FP. :-)
> 
> Yep, that too.

I also found this one, so that makes three of us!

It's too bad IS_ENABLED() doesn't catch typos.

-- 
Josh

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 09/19] unwind: Introduce sframe user space unwinding
  2024-11-13 20:50     ` Steven Rostedt
@ 2024-11-13 21:15       ` Josh Poimboeuf
  2024-11-13 22:13         ` Steven Rostedt
  0 siblings, 1 reply; 119+ messages in thread
From: Josh Poimboeuf @ 2024-11-13 21:15 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Jens Remus, x86, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Heiko Carstens, Vasily Gorbik

On Wed, Nov 13, 2024 at 03:50:40PM -0500, Steven Rostedt wrote:
> On Thu, 7 Nov 2024 17:59:08 +0100
> Jens Remus <jremus@linux.ibm.com> wrote:
> 
> > On 28.10.2024 22:47, Josh Poimboeuf wrote:
> > ...
> > > diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c  
> > ...
> > > +static int find_fde(struct sframe_section *sec, unsigned long ip,
> > > +		    struct sframe_fde *fde)
> > > +{
> > > +	struct sframe_fde __user *first, *last, *found = NULL;
> > > +	u32 ip_off, func_off_low = 0, func_off_high = -1;
> > > +
> > > +	ip_off = ip - sec->sframe_addr;
> > > +
> > > +	first = (void __user *)sec->fdes_addr;
> > > +	last = first + sec->fdes_nr;  
> > 
> > Could it be that this needs to be:
> > 
> > 	last = first + sec->fdes_nr - 1;
> 
> Yep, I discovered the same issue.

Indeed, thanks.

-- 
Josh

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 09/19] unwind: Introduce sframe user space unwinding
  2024-11-13 21:15       ` Josh Poimboeuf
@ 2024-11-13 22:13         ` Steven Rostedt
  2024-11-13 22:21           ` Steven Rostedt
  2024-11-14  9:57           ` Jens Remus
  0 siblings, 2 replies; 119+ messages in thread
From: Steven Rostedt @ 2024-11-13 22:13 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Jens Remus, x86, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Heiko Carstens, Vasily Gorbik

On Wed, 13 Nov 2024 13:15:35 -0800
Josh Poimboeuf <jpoimboe@kernel.org> wrote:

> On Wed, Nov 13, 2024 at 03:50:40PM -0500, Steven Rostedt wrote:
> > On Thu, 7 Nov 2024 17:59:08 +0100
> > Jens Remus <jremus@linux.ibm.com> wrote:
> >   
> > > On 28.10.2024 22:47, Josh Poimboeuf wrote:
> > > ...  
> > > > diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c    
> > > ...  
> > > > +static int find_fde(struct sframe_section *sec, unsigned long ip,
> > > > +		    struct sframe_fde *fde)
> > > > +{
> > > > +	struct sframe_fde __user *first, *last, *found = NULL;
> > > > +	u32 ip_off, func_off_low = 0, func_off_high = -1;
> > > > +
> > > > +	ip_off = ip - sec->sframe_addr;
> > > > +
> > > > +	first = (void __user *)sec->fdes_addr;
> > > > +	last = first + sec->fdes_nr;    
> > > 
> > > Could it be that this needs to be:
> > > 
> > > 	last = first + sec->fdes_nr - 1;  
> > 
> > Yep, I discovered the same issue.  
> 
> Indeed, thanks.
> 

BTW, the following changes were needed to make it work for me:

-- Steve

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 434c548f0837..64cc3c1188ca 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -842,7 +842,8 @@ static int load_elf_binary(struct linux_binprm *bprm)
 	int first_pt_load = 1;
 	unsigned long error;
 	struct elf_phdr *elf_ppnt, *elf_phdata, *interp_elf_phdata = NULL;
-	struct elf_phdr *elf_property_phdata = NULL, *sframe_phdr = NULL;
+	struct elf_phdr *elf_property_phdata = NULL;
+	unsigned long sframe_vaddr = 0;
 	unsigned long elf_brk;
 	int retval, i;
 	unsigned long elf_entry;
@@ -951,7 +952,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
 			break;
 
 		case PT_GNU_SFRAME:
-			sframe_phdr = elf_ppnt;
+			sframe_vaddr = elf_ppnt->p_vaddr;
 			break;
 
 		case PT_LOPROC ... PT_HIPROC:
@@ -1344,8 +1345,8 @@ static int load_elf_binary(struct linux_binprm *bprm)
 					    task_pid_nr(current), retval);
 	}
 
-	if (sframe_phdr)
-		sframe_add_section(load_bias + sframe_phdr->p_vaddr,
+	if (sframe_vaddr)
+		sframe_add_section(load_bias + sframe_vaddr,
 				   start_code, end_code);
 
 	regs = current_pt_regs();
diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
index 933e47696e29..ca4ef0b72772 100644
--- a/kernel/unwind/sframe.c
+++ b/kernel/unwind/sframe.c
@@ -73,15 +73,15 @@ static int find_fde(struct sframe_section *sec, unsigned long ip,
 		    struct sframe_fde *fde)
 {
 	struct sframe_fde __user *first, *last, *found = NULL;
-	u32 ip_off, func_off_low = 0, func_off_high = -1;
+	s32 ip_off, func_off_low = INT_MIN, func_off_high = INT_MAX;
 
 	ip_off = ip - sec->sframe_addr;
 
 	first = (void __user *)sec->fdes_addr;
-	last = first + sec->fdes_nr;
+	last = first + sec->fdes_nr - 1;
 	while (first <= last) {
 		struct sframe_fde __user *mid;
-		u32 func_off;
+		s32 func_off;
 
 		mid = first + ((last - first) / 2);
 
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
index 11aadfade005..d9cd820150c5 100644
--- a/kernel/unwind/user.c
+++ b/kernel/unwind/user.c
@@ -97,7 +97,7 @@ int unwind_user_start(struct unwind_user_state *state)
 
 	if (current_has_sframe())
 		state->type = UNWIND_USER_TYPE_SFRAME;
-	else if (IS_ENABLED(CONFIG_UNWIND_USER_FP))
+	else if (IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP))
 		state->type = UNWIND_USER_TYPE_FP;
 	else
 		state->type = UNWIND_USER_TYPE_NONE;
@@ -138,7 +138,7 @@ int unwind_user(struct unwind_stacktrace *trace, unsigned int max_entries)
 static u64 ctx_to_cookie(u64 cpu, u64 ctx)
 {
 	BUILD_BUG_ON(NR_CPUS > 65535);
-	return (ctx & ((1UL << 48) - 1)) | cpu;
+	return (ctx & ((1UL << 48) - 1)) | (cpu << 48);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 09/19] unwind: Introduce sframe user space unwinding
  2024-11-13 22:13         ` Steven Rostedt
@ 2024-11-13 22:21           ` Steven Rostedt
  2024-11-13 22:25             ` Steven Rostedt
  2024-11-14  9:57           ` Jens Remus
  1 sibling, 1 reply; 119+ messages in thread
From: Steven Rostedt @ 2024-11-13 22:21 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Jens Remus, x86, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Heiko Carstens, Vasily Gorbik


[ This reply fixes the linux-trace-kernel email :-p ]

On Wed, 13 Nov 2024 17:13:26 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> BTW, the following changes were needed to make it work for me:
> 
> -- Steve
> 
> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> index 434c548f0837..64cc3c1188ca 100644
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -842,7 +842,8 @@ static int load_elf_binary(struct linux_binprm *bprm)
>  	int first_pt_load = 1;
>  	unsigned long error;
>  	struct elf_phdr *elf_ppnt, *elf_phdata, *interp_elf_phdata = NULL;
> -	struct elf_phdr *elf_property_phdata = NULL, *sframe_phdr = NULL;
> +	struct elf_phdr *elf_property_phdata = NULL;
> +	unsigned long sframe_vaddr = 0;

Could not just save the pointer to the sframe phd, as it gets freed before we need it.

>  	unsigned long elf_brk;
>  	int retval, i;
>  	unsigned long elf_entry;
> @@ -951,7 +952,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
>  			break;
>  
>  		case PT_GNU_SFRAME:
> -			sframe_phdr = elf_ppnt;
> +			sframe_vaddr = elf_ppnt->p_vaddr;
>  			break;
>  
>  		case PT_LOPROC ... PT_HIPROC:
> @@ -1344,8 +1345,8 @@ static int load_elf_binary(struct linux_binprm *bprm)
>  					    task_pid_nr(current), retval);
>  	}
>  
> -	if (sframe_phdr)
> -		sframe_add_section(load_bias + sframe_phdr->p_vaddr,
> +	if (sframe_vaddr)
> +		sframe_add_section(load_bias + sframe_vaddr,
>  				   start_code, end_code);
>  
>  	regs = current_pt_regs();
> diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
> index 933e47696e29..ca4ef0b72772 100644
> --- a/kernel/unwind/sframe.c
> +++ b/kernel/unwind/sframe.c
> @@ -73,15 +73,15 @@ static int find_fde(struct sframe_section *sec, unsigned long ip,
>  		    struct sframe_fde *fde)
>  {
>  	struct sframe_fde __user *first, *last, *found = NULL;
> -	u32 ip_off, func_off_low = 0, func_off_high = -1;
> +	s32 ip_off, func_off_low = INT_MIN, func_off_high = INT_MAX;

The ip_off is a signed it. I wrote a program to dump out the sframe section
of files, and I had:

	ffffed88: (1020) size:      16 off:     146 num:       2 info: 1 rep:16
	ffffed98: (1030) size:     336 off:     154 num:       2 info:17 rep:16
	ffffefe1: (1279) size:     113 off:       0 num:       4 info: 0 rep: 0
	fffff052: (12ea) size:      54 off:      15 num:       3 info: 0 rep: 0
	fffff088: (1320) size:     167 off:      26 num:       3 info: 0 rep: 0
	fffff12f: (13c7) size:     167 off:      37 num:       4 info: 0 rep: 0
	fffff1d6: (146e) size:     167 off:      52 num:       4 info: 0 rep: 0
	fffff27d: (1515) size:      22 off:      67 num:       4 info: 0 rep: 0
	fffff293: (152b) size:     141 off:      82 num:       4 info: 0 rep: 0
	fffff320: (15b8) size:      81 off:      97 num:       4 info: 0 rep: 0
	fffff371: (1609) size:     671 off:     112 num:       4 info: 1 rep: 0
	fffff610: (18a8) size:     171 off:     131 num:       4 info: 0 rep: 0

The above turns was created by a loop of:

	fde = (void *)sframes + sizeof(*sframes) + sframes->sfh_fdeoff;
	for (s = 0; s < sframes->sfh_num_fdes; s++, fde++) {
		printf("\t%x: (%lx) size:%8u off:%8u num:%8u info:%2u rep:%2u\n",
		       fde->sfde_func_start_address,
		       fde->sfde_func_start_address + shdr->sh_offset,
		       fde->sfde_func_size,
		       fde->sfde_func_start_fre_off,
		       fde->sfde_func_num_fres,
		       fde->sfde_func_info,
		       fde->sfde_func_rep_size);
	}

As you can see, all the ip_off are negative.

>  
>  	ip_off = ip - sec->sframe_addr;
>  
>  	first = (void __user *)sec->fdes_addr;
> -	last = first + sec->fdes_nr;
> +	last = first + sec->fdes_nr - 1;

The above was mentioned before.

>  	while (first <= last) {
>  		struct sframe_fde __user *mid;
> -		u32 func_off;
> +		s32 func_off;
>  
>  		mid = first + ((last - first) / 2);
>  
> diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
> index 11aadfade005..d9cd820150c5 100644
> --- a/kernel/unwind/user.c
> +++ b/kernel/unwind/user.c
> @@ -97,7 +97,7 @@ int unwind_user_start(struct unwind_user_state *state)
>  
>  	if (current_has_sframe())
>  		state->type = UNWIND_USER_TYPE_SFRAME;
> -	else if (IS_ENABLED(CONFIG_UNWIND_USER_FP))
> +	else if (IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP))

This was mentioned too.

>  		state->type = UNWIND_USER_TYPE_FP;
>  	else
>  		state->type = UNWIND_USER_TYPE_NONE;
> @@ -138,7 +138,7 @@ int unwind_user(struct unwind_stacktrace *trace, unsigned int max_entries)
>  static u64 ctx_to_cookie(u64 cpu, u64 ctx)
>  {
>  	BUILD_BUG_ON(NR_CPUS > 65535);
> -	return (ctx & ((1UL << 48) - 1)) | cpu;
> +	return (ctx & ((1UL << 48) - 1)) | (cpu << 48);

And so was this.

-- Steve

>  }
>  
>  /*


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 09/19] unwind: Introduce sframe user space unwinding
  2024-11-13 22:21           ` Steven Rostedt
@ 2024-11-13 22:25             ` Steven Rostedt
  0 siblings, 0 replies; 119+ messages in thread
From: Steven Rostedt @ 2024-11-13 22:25 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Jens Remus, x86, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Heiko Carstens, Vasily Gorbik

On Wed, 13 Nov 2024 17:21:18 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> > diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> > index 434c548f0837..64cc3c1188ca 100644
> > --- a/fs/binfmt_elf.c
> > +++ b/fs/binfmt_elf.c
> > @@ -842,7 +842,8 @@ static int load_elf_binary(struct linux_binprm *bprm)
> >  	int first_pt_load = 1;
> >  	unsigned long error;
> >  	struct elf_phdr *elf_ppnt, *elf_phdata, *interp_elf_phdata = NULL;
> > -	struct elf_phdr *elf_property_phdata = NULL, *sframe_phdr = NULL;
> > +	struct elf_phdr *elf_property_phdata = NULL;
> > +	unsigned long sframe_vaddr = 0;  
> 
> Could not just save the pointer to the sframe phd, as it gets freed before we need it.
                                                ^^^
                                                phdr


> > diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
> > index 933e47696e29..ca4ef0b72772 100644
> > --- a/kernel/unwind/sframe.c
> > +++ b/kernel/unwind/sframe.c
> > @@ -73,15 +73,15 @@ static int find_fde(struct sframe_section *sec, unsigned long ip,
> >  		    struct sframe_fde *fde)
> >  {
> >  	struct sframe_fde __user *first, *last, *found = NULL;
> > -	u32 ip_off, func_off_low = 0, func_off_high = -1;
> > +	s32 ip_off, func_off_low = INT_MIN, func_off_high = INT_MAX;  
> 
> The ip_off is a signed it. I wrote a program to dump out the sframe section
                         ^^
                         int


> of files, and I had:
> 
> 	ffffed88: (1020) size:      16 off:     146 num:       2 info: 1 rep:16
> 	ffffed98: (1030) size:     336 off:     154 num:       2 info:17 rep:16
> 	ffffefe1: (1279) size:     113 off:       0 num:       4 info: 0 rep: 0
> 	fffff052: (12ea) size:      54 off:      15 num:       3 info: 0 rep: 0
> 	fffff088: (1320) size:     167 off:      26 num:       3 info: 0 rep: 0
> 	fffff12f: (13c7) size:     167 off:      37 num:       4 info: 0 rep: 0
> 	fffff1d6: (146e) size:     167 off:      52 num:       4 info: 0 rep: 0
> 	fffff27d: (1515) size:      22 off:      67 num:       4 info: 0 rep: 0
> 	fffff293: (152b) size:     141 off:      82 num:       4 info: 0 rep: 0
> 	fffff320: (15b8) size:      81 off:      97 num:       4 info: 0 rep: 0
> 	fffff371: (1609) size:     671 off:     112 num:       4 info: 1 rep: 0
> 	fffff610: (18a8) size:     171 off:     131 num:       4 info: 0 rep: 0
> 
> The above turns was created by a loop of:
            ^^^^^^^^^
            items were

No idea why I typed that :-p


I can't blame jetlag anymore.

-- Steve

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 09/19] unwind: Introduce sframe user space unwinding
  2024-11-13 22:13         ` Steven Rostedt
  2024-11-13 22:21           ` Steven Rostedt
@ 2024-11-14  9:57           ` Jens Remus
  1 sibling, 0 replies; 119+ messages in thread
From: Jens Remus @ 2024-11-14  9:57 UTC (permalink / raw)
  To: Steven Rostedt, Josh Poimboeuf
  Cc: x86, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Mathieu Desnoyers,
	Florian Weimer, Andy Lutomirski, Heiko Carstens, Vasily Gorbik

On 13.11.2024 23:13, Steven Rostedt wrote:
> On Wed, 13 Nov 2024 13:15:35 -0800

> BTW, the following changes were needed to make it work for me:

> diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
> index 933e47696e29..ca4ef0b72772 100644
> --- a/kernel/unwind/sframe.c
> +++ b/kernel/unwind/sframe.c
> @@ -73,15 +73,15 @@ static int find_fde(struct sframe_section *sec, unsigned long ip,
>   		    struct sframe_fde *fde)
>   {
>   	struct sframe_fde __user *first, *last, *found = NULL;
> -	u32 ip_off, func_off_low = 0, func_off_high = -1;
> +	s32 ip_off, func_off_low = INT_MIN, func_off_high = INT_MAX;

Coincidentally I was experimenting with exactly the same changes, except 
that I used S32_MIN and S32_MAX.  Any preference?

>   
>   	ip_off = ip - sec->sframe_addr;
>   
>   	first = (void __user *)sec->fdes_addr;
> -	last = first + sec->fdes_nr;
> +	last = first + sec->fdes_nr - 1;
>   	while (first <= last) {
>   		struct sframe_fde __user *mid;
> -		u32 func_off;
> +		s32 func_off;
>   
>   		mid = first + ((last - first) / 2);
>   

Thanks and regards,
Jens
-- 
Jens Remus
Linux on Z Development (D3303) and z/VSE Support
+49-7031-16-1128 Office
jremus@de.ibm.com

IBM

IBM Deutschland Research & Development GmbH; Vorsitzender des 
Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der 
Gesellschaft: Böblingen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 07/19] unwind: Add user space unwinding API
  2024-10-28 21:47 ` [PATCH v3 07/19] unwind: Add user space unwinding API Josh Poimboeuf
  2024-10-28 21:47   ` Josh Poimboeuf
@ 2024-12-06 10:29   ` Jens Remus
  2024-12-09 20:54     ` Josh Poimboeuf
  1 sibling, 1 reply; 119+ messages in thread
From: Jens Remus @ 2024-12-06 10:29 UTC (permalink / raw)
  To: Josh Poimboeuf, x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Heiko Carstens, Vasily Gorbik

On 28.10.2024 22:47, Josh Poimboeuf wrote:
> Introduce a user space unwinder API which provides a generic way to
> unwind user stacks.

...

> diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c

...

> +int unwind_user_next(struct unwind_user_state *state)
> +{
> +	struct unwind_user_frame _frame;
> +	struct unwind_user_frame *frame = &_frame;
> +	unsigned long prev_ip, cfa, fp, ra = 0;
> +
> +	if (state->done)
> +		return -EINVAL;
> +
> +	prev_ip = state->ip;
> +
> +	switch (state->type) {
> +	case UNWIND_USER_TYPE_FP:
> +		frame = &fp_frame;
> +		break;
> +	default:
> +		BUG();
> +	}
> +
> +	cfa = (frame->use_fp ? state->fp : state->sp) + frame->cfa_off;
> +
> +	if (frame->ra_off && get_user(ra, (unsigned long __user *)(cfa + frame->ra_off)))
> +		goto the_end;
> +
> +	if (ra == prev_ip)
> +		goto the_end;

This seems too restrictive to me, as it effectively prevents
unwinding from recursive functions, e.g. Glibc internal merge sort
msort_with_tmp():

$ perf record -F 9999 --call-graph fp /usr/bin/objdump -wdWF /usr/bin/objdump
$ perf script
...
objdump    8314 236064.515562:     100010 task-clock:ppp:
                  100630a compare_symbols+0x2a (/usr/bin/objdump)
              3ffb9e58e7c msort_with_tmp.part.0+0x15c (/usr/lib64/libc.so.6)
              3ffb9e58d76 msort_with_tmp.part.0+0x56 (/usr/lib64/libc.so.6)
[unwinding unexpectedly stops]

Would it be an option to only stop unwinding if both the IP and SP do
not change?

if (sp == prev_sp && ra == prev_ra)
	gote the_end;

> +
> +	if (frame->fp_off && get_user(fp, (unsigned long __user *)(cfa + frame->fp_off)))
> +		goto the_end;
> +
> +	state->sp = cfa;
> +	state->ip = ra;
> +	if (frame->fp_off)
> +		state->fp = fp;
> +
> +	return 0;
> +
> +the_end:
> +	state->done = true;
> +	return -EINVAL;
> +}

...

Thanks and regards,
Jens
-- 
Jens Remus
Linux on Z Development (D3303) and z/VSE Support
+49-7031-16-1128 Office
jremus@de.ibm.com

IBM

IBM Deutschland Research & Development GmbH; Vorsitzender des Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der Gesellschaft: Böblingen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 07/19] unwind: Add user space unwinding API
  2024-12-06 10:29   ` Jens Remus
@ 2024-12-09 20:54     ` Josh Poimboeuf
  2024-12-11 14:53       ` Jens Remus
  0 siblings, 1 reply; 119+ messages in thread
From: Josh Poimboeuf @ 2024-12-09 20:54 UTC (permalink / raw)
  To: Jens Remus
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Heiko Carstens, Vasily Gorbik

On Fri, Dec 06, 2024 at 11:29:21AM +0100, Jens Remus wrote:
> On 28.10.2024 22:47, Josh Poimboeuf wrote:
> > +	if (ra == prev_ip)
> > +		goto the_end;
> 
> This seems too restrictive to me, as it effectively prevents
> unwinding from recursive functions, e.g. Glibc internal merge sort
> msort_with_tmp():
> 
> $ perf record -F 9999 --call-graph fp /usr/bin/objdump -wdWF /usr/bin/objdump
> $ perf script
> ...
> objdump    8314 236064.515562:     100010 task-clock:ppp:
>                  100630a compare_symbols+0x2a (/usr/bin/objdump)
>              3ffb9e58e7c msort_with_tmp.part.0+0x15c (/usr/lib64/libc.so.6)
>              3ffb9e58d76 msort_with_tmp.part.0+0x56 (/usr/lib64/libc.so.6)
> [unwinding unexpectedly stops]
> 
> Would it be an option to only stop unwinding if both the IP and SP do
> not change?
> 
> if (sp == prev_sp && ra == prev_ra)
> 	gote the_end;

Good point, I've already fixed that for the next version (not yet
posted).  I believe the only thing we really need to check here is that
the unwind is heading in the right direction:

if (cfa <= state->sp)
	goto the_end;

-- 
Josh

^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 07/19] unwind: Add user space unwinding API
  2024-12-09 20:54     ` Josh Poimboeuf
@ 2024-12-11 14:53       ` Jens Remus
  2024-12-11 17:48         ` Josh Poimboeuf
  0 siblings, 1 reply; 119+ messages in thread
From: Jens Remus @ 2024-12-11 14:53 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Heiko Carstens, Vasily Gorbik

On 09.12.2024 21:54, Josh Poimboeuf wrote:
> On Fri, Dec 06, 2024 at 11:29:21AM +0100, Jens Remus wrote:
>> On 28.10.2024 22:47, Josh Poimboeuf wrote:
>>> +	if (ra == prev_ip)
>>> +		goto the_end;
>>
>> This seems too restrictive to me, as it effectively prevents
>> unwinding from recursive functions, e.g. Glibc internal merge sort
>> msort_with_tmp():
>>
>> $ perf record -F 9999 --call-graph fp /usr/bin/objdump -wdWF /usr/bin/objdump
>> $ perf script
>> ...
>> objdump    8314 236064.515562:     100010 task-clock:ppp:
>>                   100630a compare_symbols+0x2a (/usr/bin/objdump)
>>               3ffb9e58e7c msort_with_tmp.part.0+0x15c (/usr/lib64/libc.so.6)
>>               3ffb9e58d76 msort_with_tmp.part.0+0x56 (/usr/lib64/libc.so.6)
>> [unwinding unexpectedly stops]
>>
>> Would it be an option to only stop unwinding if both the IP and SP do
>> not change?
>>
>> if (sp == prev_sp && ra == prev_ra)
>> 	gote the_end;
> 
> Good point, I've already fixed that for the next version (not yet
> posted).  I believe the only thing we really need to check here is that
> the unwind is heading in the right direction:
> 
> if (cfa <= state->sp)
> 	goto the_end;

Assuming the x86 definition of the CFA (CFA == SP at call site) this
translates into:

if (sp <= state->sp)
	goto the_end;

That won't work for architectures that pass the return address in a
register instead of on the stack, such as s390. At least in the
topmost frame the unwound SP may be unchanged. For instance when in
the function prologue or when in a leaf function.

One of my patches for s390 support introduces a state->first flag,
indicating whether it is the topmost user space frame. Using that
your check could be extended to:

if ((state->first && sp < state->sp) || (!state->first && sp <= state->sp))
	goto the_end;

Which could be simplified to:

if (sp <= state->sp - state->first)
	goto the_end;

Btw. neither would work for architectures with an upwards-growing
stack, such as hppa. Not sure if that needs to be considered.

Regards,
Jens
-- 
Jens Remus
Linux on Z Development (D3303) and z/VSE Support
+49-7031-16-1128 Office
jremus@de.ibm.com

IBM

IBM Deutschland Research & Development GmbH; Vorsitzender des Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der Gesellschaft: Böblingen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/


^ permalink raw reply	[flat|nested] 119+ messages in thread

* Re: [PATCH v3 07/19] unwind: Add user space unwinding API
  2024-12-11 14:53       ` Jens Remus
@ 2024-12-11 17:48         ` Josh Poimboeuf
  0 siblings, 0 replies; 119+ messages in thread
From: Josh Poimboeuf @ 2024-12-11 17:48 UTC (permalink / raw)
  To: Jens Remus
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Heiko Carstens, Vasily Gorbik

On Wed, Dec 11, 2024 at 03:53:26PM +0100, Jens Remus wrote:
> On 09.12.2024 21:54, Josh Poimboeuf wrote:
> > if (cfa <= state->sp)
> > 	goto the_end;
> 
> Assuming the x86 definition of the CFA (CFA == SP at call site) this
> translates into:
> 
> if (sp <= state->sp)
> 	goto the_end;
> 
> That won't work for architectures that pass the return address in a
> register instead of on the stack, such as s390. At least in the
> topmost frame the unwound SP may be unchanged. For instance when in
> the function prologue or when in a leaf function.
> 
> One of my patches for s390 support introduces a state->first flag,
> indicating whether it is the topmost user space frame. Using that
> your check could be extended to:
> 
> if ((state->first && sp < state->sp) || (!state->first && sp <= state->sp))
> 	goto the_end;
> 
> Which could be simplified to:
> 
> if (sp <= state->sp - state->first)
> 	goto the_end;

Since my patches are x86-only, how about I leave the "sp <= state->sp"
check and then you add something like that in your patches on top?

> Btw. neither would work for architectures with an upwards-growing
> stack, such as hppa. Not sure if that needs to be considered.

I don't think that's needed until if/when sframe becomes supported for
such an arch.

-- 
Josh

^ permalink raw reply	[flat|nested] 119+ messages in thread

end of thread, other threads:[~2024-12-11 17:48 UTC | newest]

Thread overview: 119+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-28 21:47 [PATCH v3 00/19] unwind, perf: sframe user space unwinding Josh Poimboeuf
2024-10-28 21:47 ` [PATCH v3 01/19] x86/vdso: Fix DWARF generation for getrandom() Josh Poimboeuf
2024-10-28 21:47   ` Josh Poimboeuf
2024-10-28 21:47 ` [PATCH v3 02/19] x86/asm: Avoid emitting DWARF CFI for non-VDSO Josh Poimboeuf
2024-10-28 21:47   ` Josh Poimboeuf
2024-10-30 17:19   ` Jens Remus
2024-10-30 17:51     ` Josh Poimboeuf
2024-10-28 21:47 ` [PATCH v3 03/19] x86/asm: Fix VDSO DWARF generation with kernel IBT enabled Josh Poimboeuf
2024-10-28 21:47   ` Josh Poimboeuf
2024-10-28 21:47 ` [PATCH v3 04/19] x86/vdso: Use SYM_FUNC_{START,END} in __kernel_vsyscall() Josh Poimboeuf
2024-10-28 21:47   ` Josh Poimboeuf
2024-10-28 21:47 ` [PATCH v3 05/19] x86/vdso: Use CFI macros in __vdso_sgx_enter_enclave() Josh Poimboeuf
2024-10-28 21:47   ` Josh Poimboeuf
2024-10-28 21:47 ` [PATCH v3 06/19] x86/vdso: Enable sframe generation in VDSO Josh Poimboeuf
2024-10-28 21:47   ` Josh Poimboeuf
2024-10-30 18:20   ` Jens Remus
2024-10-30 19:17     ` Josh Poimboeuf
2024-10-28 21:47 ` [PATCH v3 07/19] unwind: Add user space unwinding API Josh Poimboeuf
2024-10-28 21:47   ` Josh Poimboeuf
2024-12-06 10:29   ` Jens Remus
2024-12-09 20:54     ` Josh Poimboeuf
2024-12-11 14:53       ` Jens Remus
2024-12-11 17:48         ` Josh Poimboeuf
2024-10-28 21:47 ` [PATCH v3 08/19] unwind/x86: Enable CONFIG_HAVE_UNWIND_USER_FP Josh Poimboeuf
2024-10-28 21:47   ` Josh Poimboeuf
2024-10-29 13:13   ` Peter Zijlstra
2024-10-29 16:31     ` Josh Poimboeuf
2024-10-29 18:08       ` Peter Zijlstra
2024-10-28 21:47 ` [PATCH v3 09/19] unwind: Introduce sframe user space unwinding Josh Poimboeuf
2024-10-28 21:47   ` Josh Poimboeuf
2024-10-29 13:27   ` Peter Zijlstra
2024-10-29 16:50     ` Josh Poimboeuf
2024-10-29 18:10       ` Peter Zijlstra
2024-10-29 23:32   ` Andrii Nakryiko
2024-10-30  5:53     ` Josh Poimboeuf
2024-10-31 20:57       ` Andrii Nakryiko
2024-10-31 21:00         ` Nick Desaulniers
2024-10-31 21:38         ` Indu Bhagat
2024-11-01 18:38           ` Andrii Nakryiko
2024-11-01 18:47             ` Steven Rostedt
2024-11-01 18:54               ` Andrii Nakryiko
2024-11-03  0:07             ` Indu Bhagat
2024-10-31 23:03         ` Josh Poimboeuf
2024-11-01 18:34           ` Andrii Nakryiko
2024-11-01 19:29             ` Josh Poimboeuf
2024-11-01 19:44               ` Andrii Nakryiko
2024-11-01 19:46                 ` Andrii Nakryiko
2024-11-01 19:51                   ` Josh Poimboeuf
2024-11-01 19:09           ` Segher Boessenkool
2024-11-01 19:33             ` Josh Poimboeuf
2024-11-01 19:35               ` Josh Poimboeuf
2024-11-01 19:48                 ` Josh Poimboeuf
2024-11-01 21:35                   ` Segher Boessenkool
2024-11-05 17:40   ` Steven Rostedt
2024-11-05 17:45     ` Steven Rostedt
2024-11-06 17:04   ` Jens Remus
2024-11-07  8:25   ` Weinan Liu
2024-11-07 16:59   ` Jens Remus
2024-11-13 20:50     ` Steven Rostedt
2024-11-13 21:15       ` Josh Poimboeuf
2024-11-13 22:13         ` Steven Rostedt
2024-11-13 22:21           ` Steven Rostedt
2024-11-13 22:25             ` Steven Rostedt
2024-11-14  9:57           ` Jens Remus
2024-11-13 15:56   ` Jens Remus
2024-11-13 20:50     ` Steven Rostedt
2024-11-13 21:13       ` Josh Poimboeuf
2024-10-28 21:47 ` [PATCH v3 10/19] unwind/x86: Enable CONFIG_HAVE_UNWIND_USER_SFRAME Josh Poimboeuf
2024-10-28 21:47   ` Josh Poimboeuf
2024-10-29 13:14   ` Peter Zijlstra
2024-10-28 21:47 ` [PATCH v3 11/19] unwind: Add deferred user space unwinding API Josh Poimboeuf
2024-10-28 21:47   ` Josh Poimboeuf
2024-10-29 13:48   ` Peter Zijlstra
2024-10-29 16:51     ` Josh Poimboeuf
2024-10-29 13:49   ` Peter Zijlstra
2024-10-29 17:05     ` Josh Poimboeuf
2024-10-29 18:11       ` Peter Zijlstra
2024-10-29 13:56   ` Peter Zijlstra
2024-10-29 17:17     ` Josh Poimboeuf
2024-10-29 17:47       ` Mathieu Desnoyers
2024-10-29 18:20         ` Peter Zijlstra
2024-10-30  6:17           ` Steven Rostedt
2024-10-30 14:03             ` Peter Zijlstra
2024-10-30 19:58               ` Steven Rostedt
2024-10-30 20:48                 ` Josh Poimboeuf
2024-10-29 18:34         ` Josh Poimboeuf
2024-10-30 13:44           ` Mathieu Desnoyers
2024-10-30 17:47             ` Josh Poimboeuf
2024-10-30 17:55               ` Josh Poimboeuf
2024-10-30 18:25               ` Josh Poimboeuf
2024-10-29 23:32   ` Andrii Nakryiko
2024-10-30  6:10     ` Josh Poimboeuf
2024-10-31 21:22       ` Andrii Nakryiko
2024-10-31 23:13         ` Josh Poimboeuf
2024-10-31 23:28           ` Andrii Nakryiko
2024-11-01 17:41             ` Josh Poimboeuf
2024-11-01 18:05               ` Andrii Nakryiko
2024-10-28 21:47 ` [PATCH v3 12/19] perf: Remove get_perf_callchain() 'init_nr' argument Josh Poimboeuf
2024-10-28 21:47   ` Josh Poimboeuf
2024-10-28 21:47 ` [PATCH v3 13/19] perf: Remove get_perf_callchain() 'crosstask' argument Josh Poimboeuf
2024-10-28 21:48   ` Josh Poimboeuf
2024-10-28 21:47 ` [PATCH v3 14/19] perf: Simplify get_perf_callchain() user logic Josh Poimboeuf
2024-10-28 21:48   ` Josh Poimboeuf
2024-10-28 21:47 ` [PATCH v3 15/19] perf: Add deferred user callchains Josh Poimboeuf
2024-10-28 21:48   ` Josh Poimboeuf
2024-10-29 14:06   ` Peter Zijlstra
2024-11-06  9:45   ` Jens Remus
2024-10-28 21:47 ` [PATCH v3 16/19] perf tools: Minimal CALLCHAIN_DEFERRED support Josh Poimboeuf
2024-10-28 21:48   ` Josh Poimboeuf
2024-10-28 21:47 ` [PATCH v3 17/19] perf record: Enable defer_callchain for user callchains Josh Poimboeuf
2024-10-28 21:48   ` Josh Poimboeuf
2024-10-28 21:47 ` [PATCH v3 18/19] perf script: Display PERF_RECORD_CALLCHAIN_DEFERRED Josh Poimboeuf
2024-10-28 21:48   ` Josh Poimboeuf
2024-10-28 21:47 ` [PATCH v3 19/19] perf tools: Merge deferred user callchains Josh Poimboeuf
2024-10-28 21:48   ` Josh Poimboeuf
2024-10-28 21:47 ` [PATCH v3 00/19] unwind, perf: sframe user space unwinding Josh Poimboeuf
2024-10-28 21:54 ` Josh Poimboeuf
2024-10-28 23:55 ` Josh Poimboeuf
2024-10-29 14:08 ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).