[PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains

linux-toolchains.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains
@ 2024-09-13 23:02 Josh Poimboeuf
  2024-09-13 23:02 ` [PATCH v2 01/11] unwind: Introduce generic user space unwinding interface Josh Poimboeuf
                   ` (13 more replies)
  0 siblings, 14 replies; 44+ messages in thread
From: Josh Poimboeuf @ 2024-09-13 23:02 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James

This is a new version of sframe user space unwinding + perf deferred
callchains.  Better late than never.

Unfortunately I didn't get a chance to do any testing with this one yet
as I'm rushing it out the door before GNU Cauldron starts.
Conference-driven development ;-)

Plus I don't have the perf tool piece working yet so I don't have a way
of doing end-to-end testing at the moment anyway.

In other words, please don't merge this yet.

Namhyung, if you're still available to write a perf tool patch which
integrates with this, that would be great.  Otherwise I could give it a
try.

Steven, let me know if this would interface ok with your anticipated
tracing usage.

v2:
- rebase on v6.11-rc7
- reorganize the patches to add sframe first
- change to sframe v2
- add new perf event type: PERF_RECORD_CALLCHAIN_DEFERRED
- add new perf attribute: defer_callchain

v1: https://lore.kernel.org/cover.1699487758.git.jpoimboe@kernel.org

Some distros have started compiling frame pointers into all their
packages to enable the kernel to do system-wide profiling of user space.
Unfortunately that creates a runtime performance penalty across the
entire system.  Using DWARF (or .eh_frame) instead isn't feasible
because of complexity and slowness.

For in-kernel unwinding we solved this problem with the creation of the
ORC unwinder for x86_64.  Similarly, for user space the GNU assembler
has created the SFrame ("Simple Frame") v2 format starting with binutils
2.41.

These patches add support for unwinding user space from the kernel using
SFrame with perf.  It should be easy to add user unwinding support for
other components like ftrace.

There were two main challenges:

1) Finding .sframe sections in shared/dlopened libraries

   The kernel has no visibility to the contents of shared libraries.
   This was solved by adding a PR_ADD_SFRAME option to prctl() which
   allows the runtime linker to manually provide the in-memory address
   of an .sframe section to the kernel.

2) Dealing with page faults

   Keeping all binaries' sframe data pinned would likely waste a lot of
   memory.  Instead, read it from user space on demand.  That can't be
   done from perf NMI context due to page faults, so defer the unwind to
   the next user exit.  Since the NMI handler doesn't do exit work,
   self-IPI and then schedule task work to be run on exit from the IPI.

Special thanks to Indu for the original concept, and to Steven and Peter
for helping a lot with the design.  And to Steven for letting me do it ;-)

Josh Poimboeuf (11):
  unwind: Introduce generic user space unwinding interface
  unwind/x86: Add HAVE_USER_UNWIND
  unwind: Introduce SFrame user space unwinding
  unwind/x86/64: Add HAVE_USER_UNWIND_SFRAME
  perf/x86: Use user_unwind interface
  perf: Remove get_perf_callchain() 'init_nr' argument
  perf: Remove get_perf_callchain() 'crosstask' argument
  perf: Simplify get_perf_callchain() user logic
  perf: Introduce deferred user callchains
  perf/x86: Add HAVE_PERF_CALLCHAIN_DEFERRED
  perf/x86: Enable SFrame unwinding for deferred user callchains

 arch/Kconfig                       |   9 +
 arch/x86/Kconfig                   |   3 +
 arch/x86/events/core.c             |  76 +++---
 arch/x86/include/asm/user_unwind.h |  11 +
 fs/binfmt_elf.c                    |  47 +++-
 include/linux/mm_types.h           |   3 +
 include/linux/perf_event.h         |  11 +-
 include/linux/sframe.h             |  46 ++++
 include/linux/user_unwind.h        |  32 +++
 include/uapi/linux/elf.h           |   1 +
 include/uapi/linux/perf_event.h    |  21 +-
 include/uapi/linux/prctl.h         |   3 +
 kernel/Makefile                    |   1 +
 kernel/bpf/stackmap.c              |   8 +-
 kernel/events/callchain.c          |  48 ++--
 kernel/events/core.c               |  82 +++++-
 kernel/fork.c                      |  10 +
 kernel/sys.c                       |  11 +
 kernel/unwind/Makefile             |   2 +
 kernel/unwind/sframe.c             | 420 +++++++++++++++++++++++++++++
 kernel/unwind/sframe.h             | 215 +++++++++++++++
 kernel/unwind/user.c               |  95 +++++++
 mm/init-mm.c                       |   4 +-
 23 files changed, 1086 insertions(+), 73 deletions(-)
 create mode 100644 arch/x86/include/asm/user_unwind.h
 create mode 100644 include/linux/sframe.h
 create mode 100644 include/linux/user_unwind.h
 create mode 100644 kernel/unwind/Makefile
 create mode 100644 kernel/unwind/sframe.c
 create mode 100644 kernel/unwind/sframe.h
 create mode 100644 kernel/unwind/user.c

-- 
2.46.0

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v2 01/11] unwind: Introduce generic user space unwinding interface
  2024-09-13 23:02 [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains Josh Poimboeuf
@ 2024-09-13 23:02 ` Josh Poimboeuf
  2024-09-13 23:02 ` [PATCH v2 02/11] unwind/x86: Add HAVE_USER_UNWIND Josh Poimboeuf
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 44+ messages in thread
From: Josh Poimboeuf @ 2024-09-13 23:02 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James

Introduce a user space unwinder interface which will provide a generic
way for architectures to unwind different user space stack frame types.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/Kconfig                |  3 ++
 include/linux/user_unwind.h | 31 ++++++++++++++
 kernel/Makefile             |  1 +
 kernel/unwind/Makefile      |  1 +
 kernel/unwind/user.c        | 81 +++++++++++++++++++++++++++++++++++++
 5 files changed, 117 insertions(+)
 create mode 100644 include/linux/user_unwind.h
 create mode 100644 kernel/unwind/Makefile
 create mode 100644 kernel/unwind/user.c

diff --git a/arch/Kconfig b/arch/Kconfig
index 975dd22a2dbd..b1002b2da331 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -425,6 +425,9 @@ config HAVE_HARDLOCKUP_DETECTOR_ARCH
 	  It uses the same command line parameters, and sysctl interface,
 	  as the generic hardlockup detectors.
 
+config HAVE_USER_UNWIND
+	bool
+
 config HAVE_PERF_REGS
 	bool
 	help
diff --git a/include/linux/user_unwind.h b/include/linux/user_unwind.h
new file mode 100644
index 000000000000..0a19ac6c92b2
--- /dev/null
+++ b/include/linux/user_unwind.h
@@ -0,0 +1,31 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_USER_UNWIND_H
+#define _LINUX_USER_UNWIND_H
+
+#include <linux/types.h>
+
+enum user_unwind_type {
+	USER_UNWIND_TYPE_AUTO,
+	USER_UNWIND_TYPE_FP,
+};
+
+struct user_unwind_frame {
+	s32 cfa_off;
+	s32 ra_off;
+	s32 fp_off;
+	bool use_fp;
+};
+
+struct user_unwind_state {
+	unsigned long ip, sp, fp;
+	enum user_unwind_type type;
+	bool done;
+};
+
+extern int user_unwind_start(struct user_unwind_state *state, enum user_unwind_type);
+extern int user_unwind_next(struct user_unwind_state *state);
+
+#define for_each_user_frame(state, type) \
+	for (user_unwind_start(state, type); !state.done; user_unwind_next(state))
+
+#endif /* _LINUX_USER_UNWIND_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index 3c13240dfc9f..259581df82dd 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -50,6 +50,7 @@ obj-y += rcu/
 obj-y += livepatch/
 obj-y += dma/
 obj-y += entry/
+obj-y += unwind/
 obj-$(CONFIG_MODULES) += module/
 
 obj-$(CONFIG_KCMP) += kcmp.o
diff --git a/kernel/unwind/Makefile b/kernel/unwind/Makefile
new file mode 100644
index 000000000000..eb466d6a3295
--- /dev/null
+++ b/kernel/unwind/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_HAVE_USER_UNWIND) += user.o
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
new file mode 100644
index 000000000000..5d16f9604a61
--- /dev/null
+++ b/kernel/unwind/user.c
@@ -0,0 +1,81 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+* Generic interface for unwinding user space
+*
+* Copyright (C) 2024 Josh Poimboeuf <jpoimboe@kernel.org>
+*/
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/sched/task_stack.h>
+#include <linux/user_unwind.h>
+#include <linux/uaccess.h>
+#include <asm/user_unwind.h>
+
+static struct user_unwind_frame fp_frame = {
+	ARCH_INIT_USER_FP_FRAME
+};
+
+int user_unwind_next(struct user_unwind_state *state)
+{
+	struct user_unwind_frame _frame;
+	struct user_unwind_frame *frame = &_frame;
+	unsigned long cfa, fp, ra;
+	int ret = -EINVAL;
+
+	if (state->done)
+		return -EINVAL;
+
+	switch (state->type) {
+	case USER_UNWIND_TYPE_FP:
+		frame = &fp_frame;
+		break;
+	default:
+		BUG();
+	}
+
+	cfa = (frame->use_fp ? state->fp : state->sp) + frame->cfa_off;
+
+	if (frame->ra_off && get_user(ra, (unsigned long *)(cfa + frame->ra_off)))
+		goto the_end;
+
+	if (frame->fp_off && get_user(fp, (unsigned long *)(cfa + frame->fp_off)))
+		goto the_end;
+
+	state->sp = cfa;
+	state->ip = ra;
+	if (frame->fp_off)
+		state->fp = fp;
+
+	return 0;
+
+the_end:
+	state->done = true;
+	return ret;
+}
+
+int user_unwind_start(struct user_unwind_state *state,
+		      enum user_unwind_type type)
+{
+	struct pt_regs *regs = task_pt_regs(current);
+
+	memset(state, 0, sizeof(*state));
+
+	if (!current->mm) {
+		state->done = true;
+		return -EINVAL;
+	}
+
+	switch (type) {
+	case USER_UNWIND_TYPE_AUTO:
+	case USER_UNWIND_TYPE_FP:
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	state->sp = user_stack_pointer(regs);
+	state->ip = instruction_pointer(regs);
+	state->fp = frame_pointer(regs);
+
+	return user_unwind_next(state);
+}
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 02/11] unwind/x86: Add HAVE_USER_UNWIND
  2024-09-13 23:02 [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains Josh Poimboeuf
  2024-09-13 23:02 ` [PATCH v2 01/11] unwind: Introduce generic user space unwinding interface Josh Poimboeuf
@ 2024-09-13 23:02 ` Josh Poimboeuf
  2024-09-16 11:46   ` Peter Zijlstra
  2024-09-13 23:02 ` [PATCH v2 03/11] unwind: Introduce SFrame user space unwinding Josh Poimboeuf
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 44+ messages in thread
From: Josh Poimboeuf @ 2024-09-13 23:02 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James

Use ARCH_INIT_USER_FP_FRAME to describe how frame pointers are unwound
on x86, and enable HAVE_USER_UNWIND accordinlgy so the user unwind
interfaces can be used.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/x86/Kconfig                   |  1 +
 arch/x86/include/asm/user_unwind.h | 11 +++++++++++
 2 files changed, 12 insertions(+)
 create mode 100644 arch/x86/include/asm/user_unwind.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 007bab9f2a0e..266edff59058 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -286,6 +286,7 @@ config X86
 	select HAVE_UACCESS_VALIDATION		if HAVE_OBJTOOL
 	select HAVE_UNSTABLE_SCHED_CLOCK
 	select HAVE_USER_RETURN_NOTIFIER
+	select HAVE_USER_UNWIND
 	select HAVE_GENERIC_VDSO
 	select VDSO_GETRANDOM			if X86_64
 	select HOTPLUG_PARALLEL			if SMP && X86_64
diff --git a/arch/x86/include/asm/user_unwind.h b/arch/x86/include/asm/user_unwind.h
new file mode 100644
index 000000000000..8c509c65cfb5
--- /dev/null
+++ b/arch/x86/include/asm/user_unwind.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_USER_UNWIND_H
+#define _ASM_X86_USER_UNWIND_H
+
+#define ARCH_INIT_USER_FP_FRAME							\
+	.ra_off		= (s32)sizeof(long) * -1,				\
+	.cfa_off	= (s32)sizeof(long) * 2,				\
+	.fp_off		= (s32)sizeof(long) * -2,				\
+	.use_fp		= true,
+
+#endif /* _ASM_X86_USER_UNWIND_H */
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 03/11] unwind: Introduce SFrame user space unwinding
  2024-09-13 23:02 [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains Josh Poimboeuf
  2024-09-13 23:02 ` [PATCH v2 01/11] unwind: Introduce generic user space unwinding interface Josh Poimboeuf
  2024-09-13 23:02 ` [PATCH v2 02/11] unwind/x86: Add HAVE_USER_UNWIND Josh Poimboeuf
@ 2024-09-13 23:02 ` Josh Poimboeuf
  2024-09-14 11:23   ` Steven Rostedt
  2024-10-23 13:59   ` Jens Remus
  2024-09-13 23:02 ` [PATCH v2 04/11] unwind/x86/64: Add HAVE_USER_UNWIND_SFRAME Josh Poimboeuf
                   ` (10 subsequent siblings)
  13 siblings, 2 replies; 44+ messages in thread
From: Josh Poimboeuf @ 2024-09-13 23:02 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James

Some distros have started compiling frame pointers into all their
packages to enable the kernel to do system-wide profiling of user space.
Unfortunately that creates a runtime performance penalty across the
entire system.  Using DWARF instead isn't feasible due to the complexity
it would add to the kernel.

For in-kernel unwinding we solved this problem with the creation of the
ORC unwinder for x86_64.  Similarly, for user space the GNU assembler
has created the SFrame format starting with binutils 2.41 for SFrame v2.
SFrame is a simpler version of .eh_frame which gets placed in the
.sframe section.

Add support for unwinding user space using SFrame.

More information about SFrame can be found here:

  - https://lwn.net/Articles/932209/
  - https://lwn.net/Articles/940686/
  - https://sourceware.org/binutils/docs/sframe-spec.html

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/Kconfig                |   3 +
 fs/binfmt_elf.c             |  47 +++-
 include/linux/mm_types.h    |   3 +
 include/linux/sframe.h      |  46 ++++
 include/linux/user_unwind.h |   1 +
 include/uapi/linux/elf.h    |   1 +
 include/uapi/linux/prctl.h  |   3 +
 kernel/fork.c               |  10 +
 kernel/sys.c                |  11 +
 kernel/unwind/Makefile      |   1 +
 kernel/unwind/sframe.c      | 420 ++++++++++++++++++++++++++++++++++++
 kernel/unwind/sframe.h      | 215 ++++++++++++++++++
 kernel/unwind/user.c        |  14 ++
 mm/init-mm.c                |   4 +-
 14 files changed, 774 insertions(+), 5 deletions(-)
 create mode 100644 include/linux/sframe.h
 create mode 100644 kernel/unwind/sframe.c
 create mode 100644 kernel/unwind/sframe.h

diff --git a/arch/Kconfig b/arch/Kconfig
index b1002b2da331..ff5d5bc5f947 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -428,6 +428,9 @@ config HAVE_HARDLOCKUP_DETECTOR_ARCH
 config HAVE_USER_UNWIND
 	bool
 
+config HAVE_USER_UNWIND_SFRAME
+	bool
+
 config HAVE_PERF_REGS
 	bool
 	help
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 19fa49cd9907..923aed390f2e 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -47,6 +47,7 @@
 #include <linux/dax.h>
 #include <linux/uaccess.h>
 #include <linux/rseq.h>
+#include <linux/sframe.h>
 #include <asm/param.h>
 #include <asm/page.h>
 
@@ -633,11 +634,13 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
 		unsigned long no_base, struct elf_phdr *interp_elf_phdata,
 		struct arch_elf_state *arch_state)
 {
-	struct elf_phdr *eppnt;
+	struct elf_phdr *eppnt, *sframe_phdr = NULL;
 	unsigned long load_addr = 0;
 	int load_addr_set = 0;
 	unsigned long error = ~0UL;
 	unsigned long total_size;
+	unsigned long start_code = ~0UL;
+	unsigned long end_code = 0;
 	int i;
 
 	/* First of all, some simple consistency checks */
@@ -659,7 +662,8 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
 
 	eppnt = interp_elf_phdata;
 	for (i = 0; i < interp_elf_ex->e_phnum; i++, eppnt++) {
-		if (eppnt->p_type == PT_LOAD) {
+		switch (eppnt->p_type) {
+		case PT_LOAD: {
 			int elf_type = MAP_PRIVATE;
 			int elf_prot = make_prot(eppnt->p_flags, arch_state,
 						 true, true);
@@ -688,7 +692,7 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
 			/*
 			 * Check to see if the section's size will overflow the
 			 * allowed task size. Note that p_filesz must always be
-			 * <= p_memsize so it's only necessary to check p_memsz.
+			 * <= p_memsz so it's only necessary to check p_memsz.
 			 */
 			k = load_addr + eppnt->p_vaddr;
 			if (BAD_ADDR(k) ||
@@ -698,7 +702,28 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
 				error = -ENOMEM;
 				goto out;
 			}
+
+			if ((eppnt->p_flags & PF_X) && k < start_code)
+				start_code = k;
+
+			if ((eppnt->p_flags & PF_X) && k + eppnt->p_filesz > end_code)
+				end_code = k + eppnt->p_filesz;
+			break;
 		}
+		case PT_GNU_SFRAME:
+			sframe_phdr = eppnt;
+			break;
+		}
+	}
+
+	if (sframe_phdr) {
+		struct sframe_file sfile = {
+			.sframe_addr	= load_addr + sframe_phdr->p_vaddr,
+			.text_start	= start_code,
+			.text_end	= end_code,
+		};
+
+		__sframe_add_section(&sfile);
 	}
 
 	error = load_addr;
@@ -823,7 +848,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
 	int first_pt_load = 1;
 	unsigned long error;
 	struct elf_phdr *elf_ppnt, *elf_phdata, *interp_elf_phdata = NULL;
-	struct elf_phdr *elf_property_phdata = NULL;
+	struct elf_phdr *elf_property_phdata = NULL, *sframe_phdr = NULL;
 	unsigned long elf_brk;
 	int retval, i;
 	unsigned long elf_entry;
@@ -931,6 +956,10 @@ static int load_elf_binary(struct linux_binprm *bprm)
 				executable_stack = EXSTACK_DISABLE_X;
 			break;
 
+		case PT_GNU_SFRAME:
+			sframe_phdr = elf_ppnt;
+			break;
+
 		case PT_LOPROC ... PT_HIPROC:
 			retval = arch_elf_pt_proc(elf_ex, elf_ppnt,
 						  bprm->file, false,
@@ -1316,6 +1345,16 @@ static int load_elf_binary(struct linux_binprm *bprm)
 				MAP_FIXED | MAP_PRIVATE, 0);
 	}
 
+	if (sframe_phdr) {
+		struct sframe_file sfile = {
+			.sframe_addr	= load_bias + sframe_phdr->p_vaddr,
+			.text_start	= start_code,
+			.text_end	= end_code,
+		};
+
+		__sframe_add_section(&sfile);
+	}
+
 	regs = current_pt_regs();
 #ifdef ELF_PLAT_INIT
 	/*
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 485424979254..1aee78cbea33 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1019,6 +1019,9 @@ struct mm_struct {
 #endif
 		} lru_gen;
 #endif /* CONFIG_LRU_GEN_WALKS_MMU */
+#ifdef CONFIG_HAVE_USER_UNWIND_SFRAME
+		struct maple_tree sframe_mt;
+#endif
 	} __randomize_layout;
 
 	/*
diff --git a/include/linux/sframe.h b/include/linux/sframe.h
new file mode 100644
index 000000000000..3a44f76929e2
--- /dev/null
+++ b/include/linux/sframe.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_SFRAME_H
+#define _LINUX_SFRAME_H
+
+#include <linux/mm_types.h>
+
+struct sframe_file {
+	unsigned long sframe_addr, text_start, text_end;
+};
+
+struct user_unwind_frame;
+
+#ifdef CONFIG_HAVE_USER_UNWIND_SFRAME
+
+#define INIT_MM_SFRAME .sframe_mt = MTREE_INIT(sframe_mt, 0)
+
+extern void sframe_free_mm(struct mm_struct *mm);
+
+extern int __sframe_add_section(struct sframe_file *file);
+extern int sframe_add_section(unsigned long sframe_addr, unsigned long text_start, unsigned long text_end);
+extern int sframe_remove_section(unsigned long sframe_addr);
+extern int sframe_find(unsigned long ip, struct user_unwind_frame *frame);
+
+static inline bool current_has_sframe(void)
+{
+	struct mm_struct *mm = current->mm;
+
+	return mm && !mtree_empty(&mm->sframe_mt);
+}
+
+#else /* !CONFIG_HAVE_USER_UNWIND_SFRAME */
+
+#define INIT_MM_SFRAME
+
+static inline void sframe_free_mm(struct mm_struct *mm) {}
+
+static inline int __sframe_add_section(struct sframe_file *file) { return -EINVAL; }
+static inline int sframe_add_section(unsigned long sframe_addr, unsigned long text_start, unsigned long text_end) { return -EINVAL; }
+static inline int sframe_remove_section(unsigned long sframe_addr) { return -EINVAL; }
+static inline int sframe_find(unsigned long ip, struct user_unwind_frame *frame) { return -EINVAL; }
+
+static inline bool current_has_sframe(void) { return false; }
+
+#endif /* CONFIG_HAVE_USER_UNWIND_SFRAME */
+
+#endif /* _LINUX_SFRAME_H */
diff --git a/include/linux/user_unwind.h b/include/linux/user_unwind.h
index 0a19ac6c92b2..8003f9d35405 100644
--- a/include/linux/user_unwind.h
+++ b/include/linux/user_unwind.h
@@ -7,6 +7,7 @@
 enum user_unwind_type {
 	USER_UNWIND_TYPE_AUTO,
 	USER_UNWIND_TYPE_FP,
+	USER_UNWIND_TYPE_SFRAME,
 };
 
 struct user_unwind_frame {
diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
index b54b313bcf07..b2aca31e1a49 100644
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@ -39,6 +39,7 @@ typedef __s64	Elf64_Sxword;
 #define PT_GNU_STACK	(PT_LOOS + 0x474e551)
 #define PT_GNU_RELRO	(PT_LOOS + 0x474e552)
 #define PT_GNU_PROPERTY	(PT_LOOS + 0x474e553)
+#define PT_GNU_SFRAME	(PT_LOOS + 0x474e554)
 
 
 /* ARM MTE memory tag segment type */
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 35791791a879..69511077c910 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -328,4 +328,7 @@ struct prctl_mm_map {
 # define PR_PPC_DEXCR_CTRL_CLEAR_ONEXEC	0x10 /* Clear the aspect on exec */
 # define PR_PPC_DEXCR_CTRL_MASK		0x1f
 
+#define PR_ADD_SFRAME			74
+#define PR_REMOVE_SFRAME		75
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index cc760491f201..a216f091edfb 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -104,6 +104,7 @@
 #include <linux/rseq.h>
 #include <uapi/linux/pidfd.h>
 #include <linux/pidfs.h>
+#include <linux/sframe.h>
 
 #include <asm/pgalloc.h>
 #include <linux/uaccess.h>
@@ -923,6 +924,7 @@ void __mmdrop(struct mm_struct *mm)
 	mm_pasid_drop(mm);
 	mm_destroy_cid(mm);
 	percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS);
+	sframe_free_mm(mm);
 
 	free_mm(mm);
 }
@@ -1249,6 +1251,13 @@ static void mm_init_uprobes_state(struct mm_struct *mm)
 #endif
 }
 
+static void mm_init_sframe(struct mm_struct *mm)
+{
+#ifdef CONFIG_HAVE_USER_UNWIND_SFRAME
+	mt_init(&mm->sframe_mt);
+#endif
+}
+
 static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	struct user_namespace *user_ns)
 {
@@ -1280,6 +1289,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	mm->pmd_huge_pte = NULL;
 #endif
 	mm_init_uprobes_state(mm);
+	mm_init_sframe(mm);
 	hugetlb_count_init(mm);
 
 	if (current->mm) {
diff --git a/kernel/sys.c b/kernel/sys.c
index 3a2df1bd9f64..e4d2b64f4ae4 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -64,6 +64,7 @@
 #include <linux/rcupdate.h>
 #include <linux/uidgid.h>
 #include <linux/cred.h>
+#include <linux/sframe.h>
 
 #include <linux/nospec.h>
 
@@ -2782,6 +2783,16 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_RISCV_SET_ICACHE_FLUSH_CTX:
 		error = RISCV_SET_ICACHE_FLUSH_CTX(arg2, arg3);
 		break;
+	case PR_ADD_SFRAME:
+		if (arg5)
+			return -EINVAL;
+		error = sframe_add_section(arg2, arg3, arg4);
+		break;
+	case PR_REMOVE_SFRAME:
+		if (arg3 || arg4 || arg5)
+			return -EINVAL;
+		error = sframe_remove_section(arg2);
+		break;
 	default:
 		error = -EINVAL;
 		break;
diff --git a/kernel/unwind/Makefile b/kernel/unwind/Makefile
index eb466d6a3295..6f202c5840cf 100644
--- a/kernel/unwind/Makefile
+++ b/kernel/unwind/Makefile
@@ -1 +1,2 @@
 obj-$(CONFIG_HAVE_USER_UNWIND) += user.o
+obj-$(CONFIG_HAVE_USER_UNWIND_SFRAME) += sframe.o
diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
new file mode 100644
index 000000000000..3e4d29e737a1
--- /dev/null
+++ b/kernel/unwind/sframe.c
@@ -0,0 +1,420 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/srcu.h>
+#include <linux/uaccess.h>
+#include <linux/mm.h>
+#include <linux/sframe.h>
+#include <linux/user_unwind.h>
+
+#include "sframe.h"
+
+#define SFRAME_FILENAME_LEN 32
+
+struct sframe_section {
+	struct rcu_head rcu;
+
+	unsigned long sframe_addr;
+	unsigned long text_addr;
+
+	unsigned long fdes_addr;
+	unsigned long fres_addr;
+	unsigned int  fdes_nr;
+	signed char ra_off, fp_off;
+};
+
+DEFINE_STATIC_SRCU(sframe_srcu);
+
+#define __SFRAME_GET_USER(out, user_ptr, type)				\
+({									\
+	type __tmp;							\
+	if (get_user(__tmp, (type *)user_ptr))				\
+		return -EFAULT;						\
+	user_ptr += sizeof(__tmp);					\
+	out = __tmp;							\
+})
+
+#define SFRAME_GET_USER_SIGNED(out, user_ptr, size)			\
+({									\
+	switch (size) {							\
+	case 1:								\
+		__SFRAME_GET_USER(out, user_ptr, s8);			\
+		break;							\
+	case 2:								\
+		__SFRAME_GET_USER(out, user_ptr, s16);			\
+		break;							\
+	case 4:								\
+		__SFRAME_GET_USER(out, user_ptr, s32);			\
+		break;							\
+	default:							\
+		return -EINVAL;						\
+	}								\
+})
+
+#define SFRAME_GET_USER_UNSIGNED(out, user_ptr, size)			\
+({									\
+	switch (size) {							\
+	case 1:								\
+		__SFRAME_GET_USER(out, user_ptr, u8);			\
+		break;							\
+	case 2:								\
+		__SFRAME_GET_USER(out, user_ptr, u16);			\
+		break;							\
+	case 4:								\
+		__SFRAME_GET_USER(out, user_ptr, u32);			\
+		break;							\
+	default:							\
+		return -EINVAL;						\
+	}								\
+})
+
+static unsigned char fre_type_to_size(unsigned char fre_type)
+{
+	if (fre_type > 2)
+		return 0;
+	return 1 << fre_type;
+}
+
+static unsigned char offset_size_enum_to_size(unsigned char off_size)
+{
+	if (off_size > 2)
+		return 0;
+	return 1 << off_size;
+}
+
+static int find_fde(struct sframe_section *sec, unsigned long ip,
+		    struct sframe_fde *fde)
+{
+	s32 func_off, ip_off;
+	struct sframe_fde __user *first, *last, *mid, *found;
+
+	ip_off = ip - sec->sframe_addr;
+
+	first = (void *)sec->fdes_addr;
+	last = first + sec->fdes_nr;
+	while (first <= last) {
+		mid = first + ((last - first) / 2);
+		if (get_user(func_off, (s32 *)mid))
+			return -EFAULT;
+		if (ip_off >= func_off) {
+			found = mid;
+			first = mid + 1;
+		} else
+			last = mid - 1;
+	}
+
+	if (!found)
+		return -EINVAL;
+
+	if (copy_from_user(fde, found, sizeof(*fde)))
+		return -EFAULT;
+
+	return 0;
+}
+
+static int find_fre(struct sframe_section *sec, struct sframe_fde *fde,
+		    unsigned long ip, struct user_unwind_frame *frame)
+{
+	unsigned char fde_type = SFRAME_FUNC_FDE_TYPE(fde->info);
+	unsigned char fre_type = SFRAME_FUNC_FRE_TYPE(fde->info);
+	s32 fre_ip_off, cfa_off, ra_off, fp_off, ip_off;
+	unsigned char offset_count, offset_size;
+	unsigned char addr_size;
+	void __user *f, *last_f;
+	u8 fre_info;
+	int i;
+
+	addr_size = fre_type_to_size(fre_type);
+	if (!addr_size)
+		return -EINVAL;
+
+	ip_off = ip - sec->sframe_addr - fde->start_addr;
+
+	f = (void *)sec->fres_addr + fde->fres_off;
+
+	for (i = 0; i < fde->fres_num; i++) {
+
+		SFRAME_GET_USER_UNSIGNED(fre_ip_off, f, addr_size);
+
+		if (fde_type == SFRAME_FDE_TYPE_PCINC) {
+			if (fre_ip_off > ip_off)
+				break;
+		} else {
+			/* SFRAME_FDE_TYPE_PCMASK */
+			if (ip_off % fde->rep_size < fre_ip_off)
+				break;
+		}
+
+		SFRAME_GET_USER_UNSIGNED(fre_info, f, 1);
+
+		offset_count = SFRAME_FRE_OFFSET_COUNT(fre_info);
+		offset_size  = offset_size_enum_to_size(SFRAME_FRE_OFFSET_SIZE(fre_info));
+
+		if (!offset_count || !offset_size)
+			return -EINVAL;
+
+		last_f = f;
+		f += offset_count * offset_size;
+	}
+
+	if (!last_f)
+		return -EINVAL;
+
+	f = last_f;
+
+	SFRAME_GET_USER_UNSIGNED(cfa_off, f, offset_size);
+	offset_count--;
+
+	ra_off = sec->ra_off;
+	if (!ra_off) {
+		if (!offset_count--)
+			return -EINVAL;
+		SFRAME_GET_USER_SIGNED(ra_off, f, offset_size);
+	}
+
+	fp_off = sec->fp_off;
+	if (!fp_off && offset_count) {
+		offset_count--;
+		SFRAME_GET_USER_SIGNED(fp_off, f, offset_size);
+	}
+
+	if (offset_count)
+		return -EINVAL;
+
+	frame->cfa_off = cfa_off;
+	frame->ra_off = ra_off;
+	frame->fp_off = fp_off;
+	frame->use_fp = SFRAME_FRE_CFA_BASE_REG_ID(fre_info) == SFRAME_BASE_REG_FP;
+
+	return 0;
+}
+
+int sframe_find(unsigned long ip, struct user_unwind_frame *frame)
+{
+	struct mm_struct *mm = current->mm;
+	struct sframe_section *sec;
+	struct sframe_fde fde;
+	int srcu_idx;
+	int ret = -EINVAL;
+
+	srcu_idx = srcu_read_lock(&sframe_srcu);
+
+	sec = mtree_load(&mm->sframe_mt, ip);
+	if (!sec) {
+		srcu_read_unlock(&sframe_srcu, srcu_idx);
+		return -EINVAL;
+	}
+
+
+	ret = find_fde(sec, ip, &fde);
+	if (ret)
+		goto err_unlock;
+
+	ret = find_fre(sec, &fde, ip, frame);
+	if (ret)
+		goto err_unlock;
+
+	srcu_read_unlock(&sframe_srcu, srcu_idx);
+	return 0;
+
+err_unlock:
+	srcu_read_unlock(&sframe_srcu, srcu_idx);
+	return ret;
+}
+
+static int get_sframe_file(unsigned long sframe_addr, struct sframe_file *file)
+{
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *sframe_vma, *text_vma, *vma;
+	VMA_ITERATOR(vmi, mm, 0);
+
+	mmap_read_lock(mm);
+
+	sframe_vma = vma_lookup(mm, sframe_addr);
+	if (!sframe_vma || !sframe_vma->vm_file)
+		goto err_unlock;
+
+	text_vma = NULL;
+
+	for_each_vma(vmi, vma) {
+		if (vma->vm_file != sframe_vma->vm_file)
+			continue;
+		if (vma->vm_flags & VM_EXEC) {
+			if (text_vma) {
+				/*
+				 * Multiple EXEC segments in a single file
+				 * aren't currently supported, is that a thing?
+				 */
+				mmap_read_unlock(mm);
+				pr_warn_once("unsupported multiple EXEC segments in task %s[%d]\n",
+					     current->comm, current->pid);
+				return -EINVAL;
+			}
+			text_vma = vma;
+		}
+	}
+
+	file->sframe_addr	= sframe_addr;
+	file->text_start	= text_vma->vm_start;
+	file->text_end		= text_vma->vm_end;
+
+	mmap_read_unlock(mm);
+	return 0;
+
+err_unlock:
+	mmap_read_unlock(mm);
+	return -EINVAL;
+}
+
+static int validate_sframe_addrs(struct sframe_file *file)
+{
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *text_vma;
+
+	mmap_read_lock(mm);
+
+	if (!vma_lookup(mm, file->sframe_addr))
+		goto err_unlock;
+
+	text_vma = vma_lookup(mm, file->text_start);
+	if (!(text_vma->vm_flags & VM_EXEC))
+		goto err_unlock;
+
+	if (vma_lookup(mm, file->text_end-1) != text_vma)
+		goto err_unlock;
+
+	mmap_read_unlock(mm);
+	return 0;
+
+err_unlock:
+	mmap_read_unlock(mm);
+	return -EINVAL;
+}
+
+int __sframe_add_section(struct sframe_file *file)
+{
+	struct maple_tree *sframe_mt = &current->mm->sframe_mt;
+	struct sframe_section *sec;
+	struct sframe_header shdr;
+	unsigned long header_end;
+	int ret;
+
+	if (copy_from_user(&shdr, (void *)file->sframe_addr, sizeof(shdr)))
+		return -EFAULT;
+
+	if (shdr.preamble.magic != SFRAME_MAGIC ||
+	    shdr.preamble.version != SFRAME_VERSION_2 ||
+	    !(shdr.preamble.flags & SFRAME_F_FDE_SORTED) ||
+	    shdr.auxhdr_len || !shdr.num_fdes || !shdr.num_fres ||
+	    shdr.fdes_off > shdr.fres_off) {
+		/*
+		 * Either binutils < 2.41, corrupt sframe header, or
+		 * unsupported feature.
+		 * */
+		pr_warn_once("bad sframe header in task %s[%d]\n",
+			     current->comm, current->pid);
+		return -EINVAL;
+	}
+
+	header_end = file->sframe_addr + SFRAME_HDR_SIZE(shdr);
+
+	sec = kmalloc(sizeof(*sec), GFP_KERNEL);
+	if (!sec)
+		return -ENOMEM;
+
+	sec->sframe_addr	= file->sframe_addr;
+	sec->text_addr		= file->text_start;
+	sec->fdes_addr		= header_end + shdr.fdes_off;
+	sec->fres_addr		= header_end + shdr.fres_off;
+	sec->fdes_nr		= shdr.num_fdes;
+	sec->ra_off		= shdr.cfa_fixed_ra_offset;
+	sec->fp_off		= shdr.cfa_fixed_fp_offset;
+
+	ret = mtree_insert_range(sframe_mt, file->text_start, file->text_end,
+				 sec, GFP_KERNEL);
+	if (ret) {
+		kfree(sec);
+		return ret;
+	}
+
+	return 0;
+}
+
+int sframe_add_section(unsigned long sframe_addr, unsigned long text_start, unsigned long text_end)
+{
+	struct sframe_file file;
+	int ret;
+
+	if (!text_start || !text_end) {
+		ret = get_sframe_file(sframe_addr, &file);
+		if (ret)
+			return ret;
+	} else {
+		/*
+		 * This is mainly for generated code, for which the text isn't
+		 * file-backed so the user has to give the text bounds.
+		 */
+		file.sframe_addr	= sframe_addr;
+		file.text_start		= text_start;
+		file.text_end		= text_end;
+		ret = validate_sframe_addrs(&file);
+		if (ret)
+			return ret;
+	}
+
+	return __sframe_add_section(&file);
+}
+
+static void sframe_free_rcu(struct rcu_head *rcu)
+{
+	struct sframe_section *sec = container_of(rcu, struct sframe_section, rcu);
+
+	kfree(sec);
+}
+
+static int __sframe_remove_section(struct mm_struct *mm,
+				   struct sframe_section *sec)
+{
+	struct sframe_section *s;
+
+	s = mtree_erase(&mm->sframe_mt, sec->text_addr);
+	if (!s || WARN_ON_ONCE(s != sec))
+		return -EINVAL;
+
+	call_srcu(&sframe_srcu, &sec->rcu, sframe_free_rcu);
+
+	return 0;
+}
+
+int sframe_remove_section(unsigned long sframe_addr)
+{
+	struct mm_struct *mm = current->mm;
+	struct sframe_section *sec;
+	unsigned long index = 0;
+
+	sec = mtree_load(&mm->sframe_mt, sframe_addr);
+	if (!sec)
+		return -EINVAL;
+
+	mt_for_each(&mm->sframe_mt, sec, index, ULONG_MAX) {
+		if (sec->sframe_addr == sframe_addr)
+			return __sframe_remove_section(mm, sec);
+	}
+
+	return -EINVAL;
+}
+
+void sframe_free_mm(struct mm_struct *mm)
+{
+	struct sframe_section *sec;
+	unsigned long index = 0;
+
+	if (!mm)
+		return;
+
+	mt_for_each(&mm->sframe_mt, sec, index, ULONG_MAX)
+		kfree(sec);
+
+	mtree_destroy(&mm->sframe_mt);
+}
diff --git a/kernel/unwind/sframe.h b/kernel/unwind/sframe.h
new file mode 100644
index 000000000000..aa468d6f1f4a
--- /dev/null
+++ b/kernel/unwind/sframe.h
@@ -0,0 +1,215 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2023, Oracle and/or its affiliates.
+ *
+ * This file contains definitions for the SFrame stack tracing format, which is
+ * documented at https://sourceware.org/binutils/docs
+ */
+#ifndef _SFRAME_H
+#define _SFRAME_H
+
+#include <linux/types.h>
+
+#define SFRAME_VERSION_1	1
+#define SFRAME_VERSION_2	2
+#define SFRAME_MAGIC		0xdee2
+
+/* Function Descriptor Entries are sorted on PC. */
+#define SFRAME_F_FDE_SORTED	0x1
+/* Frame-pointer based stack tracing. Defined, but not set. */
+#define SFRAME_F_FRAME_POINTER	0x2
+
+#define SFRAME_CFA_FIXED_FP_INVALID 0
+#define SFRAME_CFA_FIXED_RA_INVALID 0
+
+/* Supported ABIs/Arch. */
+#define SFRAME_ABI_AARCH64_ENDIAN_BIG	    1 /* AARCH64 big endian. */
+#define SFRAME_ABI_AARCH64_ENDIAN_LITTLE    2 /* AARCH64 little endian. */
+#define SFRAME_ABI_AMD64_ENDIAN_LITTLE	    3 /* AMD64 little endian. */
+
+/* SFrame FRE types. */
+#define SFRAME_FRE_TYPE_ADDR1	0
+#define SFRAME_FRE_TYPE_ADDR2	1
+#define SFRAME_FRE_TYPE_ADDR4	2
+
+/*
+ * SFrame Function Descriptor Entry types.
+ *
+ * The SFrame format has two possible representations for functions. The
+ * choice of which type to use is made according to the instruction patterns
+ * in the relevant program stub.
+ */
+
+/* Unwinders perform a (PC >= FRE_START_ADDR) to look up a matching FRE. */
+#define SFRAME_FDE_TYPE_PCINC	0
+/*
+ * Unwinders perform a (PC & FRE_START_ADDR_AS_MASK >= FRE_START_ADDR_AS_MASK)
+ * to look up a matching FRE. Typical usecases are pltN entries, trampolines
+ * etc.
+ */
+#define SFRAME_FDE_TYPE_PCMASK	1
+
+/**
+ * struct sframe_preamble - SFrame Preamble.
+ * @magic: Magic number (SFRAME_MAGIC).
+ * @version: Format version number (SFRAME_VERSION).
+ * @flags: Various flags.
+ */
+struct sframe_preamble {
+	u16 magic;
+	u8  version;
+	u8  flags;
+} __packed;
+
+/**
+ * struct sframe_header - SFrame Header.
+ * @preamble: SFrame preamble.
+ * @abi_arch: Identify the arch (including endianness) and ABI.
+ * @cfa_fixed_fp_offset: Offset for the Frame Pointer (FP) from CFA may be
+ *	  fixed  for some ABIs ((e.g, in AMD64 when -fno-omit-frame-pointer is
+ *	  used). When fixed, this field specifies the fixed stack frame offset
+ *	  and the individual FREs do not need to track it. When not fixed, it
+ *	  is set to SFRAME_CFA_FIXED_FP_INVALID, and the individual FREs may
+ *	  provide the applicable stack frame offset, if any.
+ * @cfa_fixed_ra_offset: Offset for the Return Address from CFA is fixed for
+ *	  some ABIs. When fixed, this field specifies the fixed stack frame
+ *	  offset and the individual FREs do not need to track it. When not
+ *	  fixed, it is set to SFRAME_CFA_FIXED_FP_INVALID.
+ * @auxhdr_len: Number of bytes making up the auxiliary header, if any.
+ *	  Some ABI/arch, in the future, may use this space for extending the
+ *	  information in SFrame header. Auxiliary header is contained in bytes
+ *	  sequentially following the sframe_header.
+ * @num_fdes: Number of SFrame FDEs in this SFrame section.
+ * @num_fres: Number of SFrame Frame Row Entries.
+ * @fre_len:  Number of bytes in the SFrame Frame Row Entry section.
+ * @fdes_off: Offset of SFrame Function Descriptor Entry section.
+ * @fres_off: Offset of SFrame Frame Row Entry section.
+ */
+struct sframe_header {
+	struct sframe_preamble preamble;
+	u8  abi_arch;
+	s8  cfa_fixed_fp_offset;
+	s8  cfa_fixed_ra_offset;
+	u8  auxhdr_len;
+	u32 num_fdes;
+	u32 num_fres;
+	u32 fre_len;
+	u32 fdes_off;
+	u32 fres_off;
+} __packed;
+
+#define SFRAME_HDR_SIZE(sframe_hdr)	\
+	((sizeof(struct sframe_header) + (sframe_hdr).auxhdr_len))
+
+/* Two possible keys for executable (instruction) pointers signing. */
+#define SFRAME_AARCH64_PAUTH_KEY_A    0 /* Key A. */
+#define SFRAME_AARCH64_PAUTH_KEY_B    1 /* Key B. */
+
+/**
+ * struct sframe_fde - SFrame Function Descriptor Entry.
+ * @start_addr: Function start address. Encoded as a signed offset,
+ *	  relative to the current FDE.
+ * @size: Size of the function in bytes.
+ * @fres_off: Offset of the first SFrame Frame Row Entry of the function,
+ *	  relative to the beginning of the SFrame Frame Row Entry sub-section.
+ * @fres_num: Number of frame row entries for the function.
+ * @info: Additional information for deciphering the stack trace
+ *	  information for the function. Contains information about SFrame FRE
+ *	  type, SFrame FDE type, PAC authorization A/B key, etc.
+ * @rep_size: Block size for SFRAME_FDE_TYPE_PCMASK
+ * @padding: Unused
+ */
+struct sframe_fde {
+	s32 start_addr;
+	u32 size;
+	u32 fres_off;
+	u32 fres_num;
+	u8  info;
+	u8  rep_size;
+	u16 padding;
+} __packed;
+
+/*
+ * 'func_info' in SFrame FDE contains additional information for deciphering
+ * the stack trace information for the function. In V1, the information is
+ * organized as follows:
+ *   - 4-bits: Identify the FRE type used for the function.
+ *   - 1-bit: Identify the FDE type of the function - mask or inc.
+ *   - 1-bit: PAC authorization A/B key (aarch64).
+ *   - 2-bits: Unused.
+ * ---------------------------------------------------------------------
+ * |  Unused  |  PAC auth A/B key (aarch64) |  FDE type |   FRE type   |
+ * |          |        Unused (amd64)       |           |              |
+ * ---------------------------------------------------------------------
+ * 8          6                             5           4              0
+ */
+
+/* Note: Set PAC auth key to SFRAME_AARCH64_PAUTH_KEY_A by default.  */
+#define SFRAME_FUNC_INFO(fde_type, fre_enc_type) \
+	(((SFRAME_AARCH64_PAUTH_KEY_A & 0x1) << 5) | \
+	 (((fde_type) & 0x1) << 4) | ((fre_enc_type) & 0xf))
+
+#define SFRAME_FUNC_FRE_TYPE(data)	  ((data) & 0xf)
+#define SFRAME_FUNC_FDE_TYPE(data)	  (((data) >> 4) & 0x1)
+#define SFRAME_FUNC_PAUTH_KEY(data)	  (((data) >> 5) & 0x1)
+
+/*
+ * Size of stack frame offsets in an SFrame Frame Row Entry. A single
+ * SFrame FRE has all offsets of the same size. Offset size may vary
+ * across frame row entries.
+ */
+#define SFRAME_FRE_OFFSET_1B	  0
+#define SFRAME_FRE_OFFSET_2B	  1
+#define SFRAME_FRE_OFFSET_4B	  2
+
+/* An SFrame Frame Row Entry can be SP or FP based.  */
+#define SFRAME_BASE_REG_FP	0
+#define SFRAME_BASE_REG_SP	1
+
+/*
+ * The index at which a specific offset is presented in the variable length
+ * bytes of an FRE.
+ */
+#define SFRAME_FRE_CFA_OFFSET_IDX   0
+/*
+ * The RA stack offset, if present, will always be at index 1 in the variable
+ * length bytes of the FRE.
+ */
+#define SFRAME_FRE_RA_OFFSET_IDX    1
+/*
+ * The FP stack offset may appear at offset 1 or 2, depending on the ABI as RA
+ * may or may not be tracked.
+ */
+#define SFRAME_FRE_FP_OFFSET_IDX    2
+
+/*
+ * 'fre_info' in SFrame FRE contains information about:
+ *   - 1 bit: base reg for CFA
+ *   - 4 bits: Number of offsets (N). A value of up to 3 is allowed to track
+ *   all three of CFA, FP and RA (fixed implicit order).
+ *   - 2 bits: information about size of the offsets (S) in bytes.
+ *     Valid values are SFRAME_FRE_OFFSET_1B, SFRAME_FRE_OFFSET_2B,
+ *     SFRAME_FRE_OFFSET_4B
+ *   - 1 bit: Mangled RA state bit (aarch64 only).
+ * ---------------------------------------------------------------
+ * | Mangled-RA (aarch64) |  Size of   |   Number of  | base_reg |
+ * |  Unused (amd64)      |  offsets   |    offsets   |          |
+ * ---------------------------------------------------------------
+ * 8                      7             5             1          0
+ */
+
+/* Note: Set mangled_ra_p to zero by default. */
+#define SFRAME_FRE_INFO(base_reg_id, offset_num, offset_size) \
+	(((0 & 0x1) << 7) | (((offset_size) & 0x3) << 5) | \
+	 (((offset_num) & 0xf) << 1) | ((base_reg_id) & 0x1))
+
+/* Set the mangled_ra_p bit as indicated. */
+#define SFRAME_FRE_INFO_UPDATE_MANGLED_RA_P(mangled_ra_p, fre_info) \
+	((((mangled_ra_p) & 0x1) << 7) | ((fre_info) & 0x7f))
+
+#define SFRAME_FRE_CFA_BASE_REG_ID(data)	  ((data) & 0x1)
+#define SFRAME_FRE_OFFSET_COUNT(data)		  (((data) >> 1) & 0xf)
+#define SFRAME_FRE_OFFSET_SIZE(data)		  (((data) >> 5) & 0x3)
+#define SFRAME_FRE_MANGLED_RA_P(data)		  (((data) >> 7) & 0x1)
+
+#endif /* _SFRAME_H */
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
index 5d16f9604a61..3a7b14cf522b 100644
--- a/kernel/unwind/user.c
+++ b/kernel/unwind/user.c
@@ -8,6 +8,7 @@
 #include <linux/sched.h>
 #include <linux/sched/task_stack.h>
 #include <linux/user_unwind.h>
+#include <linux/sframe.h>
 #include <linux/uaccess.h>
 #include <asm/user_unwind.h>
 
@@ -29,6 +30,11 @@ int user_unwind_next(struct user_unwind_state *state)
 	case USER_UNWIND_TYPE_FP:
 		frame = &fp_frame;
 		break;
+	case USER_UNWIND_TYPE_SFRAME:
+		ret = sframe_find(state->ip, frame);
+		if (ret)
+			goto the_end;
+		break;
 	default:
 		BUG();
 	}
@@ -57,6 +63,7 @@ int user_unwind_start(struct user_unwind_state *state,
 		      enum user_unwind_type type)
 {
 	struct pt_regs *regs = task_pt_regs(current);
+	bool sframe_possible = current_has_sframe();
 
 	memset(state, 0, sizeof(*state));
 
@@ -67,6 +74,13 @@ int user_unwind_start(struct user_unwind_state *state,
 
 	switch (type) {
 	case USER_UNWIND_TYPE_AUTO:
+		state->type = sframe_possible ? USER_UNWIND_TYPE_SFRAME :
+						USER_UNWIND_TYPE_FP;
+		break;
+	case USER_UNWIND_TYPE_SFRAME:
+		if (!sframe_possible)
+			return -EINVAL;
+		break;
 	case USER_UNWIND_TYPE_FP:
 		break;
 	default:
diff --git a/mm/init-mm.c b/mm/init-mm.c
index 24c809379274..c4c6af046778 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -11,6 +11,7 @@
 #include <linux/atomic.h>
 #include <linux/user_namespace.h>
 #include <linux/iommu.h>
+#include <linux/sframe.h>
 #include <asm/mmu.h>
 
 #ifndef INIT_MM_CONTEXT
@@ -44,7 +45,8 @@ struct mm_struct init_mm = {
 #endif
 	.user_ns	= &init_user_ns,
 	.cpu_bitmap	= CPU_BITS_NONE,
-	INIT_MM_CONTEXT(init_mm)
+	INIT_MM_CONTEXT(init_mm),
+	INIT_MM_SFRAME,
 };
 
 void setup_initial_init_mm(void *start_code, void *end_code,
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 04/11] unwind/x86/64: Add HAVE_USER_UNWIND_SFRAME
  2024-09-13 23:02 [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains Josh Poimboeuf
                   ` (2 preceding siblings ...)
  2024-09-13 23:02 ` [PATCH v2 03/11] unwind: Introduce SFrame user space unwinding Josh Poimboeuf
@ 2024-09-13 23:02 ` Josh Poimboeuf
  2024-09-13 23:02 ` [PATCH v2 05/11] perf/x86: Use user_unwind interface Josh Poimboeuf
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 44+ messages in thread
From: Josh Poimboeuf @ 2024-09-13 23:02 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James

Binutils 2.41 supports generating sframe v2 for x86_64.  It works well
in testing so enable it.

NOTE: An out-of-tree glibc patch is still needed to enable setting
PR_ADD_SFRAME for shared libraries and dlopens.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/x86/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 266edff59058..0b3d7c72b65b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -287,6 +287,7 @@ config X86
 	select HAVE_UNSTABLE_SCHED_CLOCK
 	select HAVE_USER_RETURN_NOTIFIER
 	select HAVE_USER_UNWIND
+	select HAVE_USER_UNWIND_SFRAME		if X86_64
 	select HAVE_GENERIC_VDSO
 	select VDSO_GETRANDOM			if X86_64
 	select HOTPLUG_PARALLEL			if SMP && X86_64
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 05/11] perf/x86: Use user_unwind interface
  2024-09-13 23:02 [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains Josh Poimboeuf
                   ` (3 preceding siblings ...)
  2024-09-13 23:02 ` [PATCH v2 04/11] unwind/x86/64: Add HAVE_USER_UNWIND_SFRAME Josh Poimboeuf
@ 2024-09-13 23:02 ` Josh Poimboeuf
  2024-09-16  6:48   ` kernel test robot
  2024-09-13 23:02 ` [PATCH v2 06/11] perf: Remove get_perf_callchain() 'init_nr' argument Josh Poimboeuf
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 44+ messages in thread
From: Josh Poimboeuf @ 2024-09-13 23:02 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James

Simplify __perf_callchain_user() and prepare to enable deferred sframe
unwinding by switching to the generic user unwind interface.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/x86/events/core.c | 18 +++++-------------
 1 file changed, 5 insertions(+), 13 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index be01823b1bb4..e82aadf99d9b 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -29,6 +29,7 @@
 #include <linux/device.h>
 #include <linux/nospec.h>
 #include <linux/static_call.h>
+#include <linux/user_unwind.h>
 
 #include <asm/apic.h>
 #include <asm/stacktrace.h>
@@ -2862,8 +2863,7 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry_ctx *ent
 void
 perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs)
 {
-	struct stack_frame frame;
-	const struct stack_frame __user *fp;
+	struct user_unwind_state state;
 
 	if (perf_guest_state()) {
 		/* TODO: We don't support guest os callchain now */
@@ -2876,8 +2876,6 @@ perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs
 	if (regs->flags & (X86_VM_MASK | PERF_EFLAGS_VM))
 		return;
 
-	fp = (void __user *)regs->bp;
-
 	perf_callchain_store(entry, regs->ip);
 
 	if (!nmi_uaccess_okay())
@@ -2887,18 +2885,12 @@ perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs
 		return;
 
 	pagefault_disable();
-	while (entry->nr < entry->max_stack) {
-		if (!valid_user_frame(fp, sizeof(frame)))
-			break;
 
-		if (__get_user(frame.next_frame, &fp->next_frame))
+	for_each_user_frame(&state, USER_UNWIND_TYPE_FP) {
+		if (perf_callchain_store(entry, state.ip))
 			break;
-		if (__get_user(frame.return_address, &fp->return_address))
-			break;
-
-		perf_callchain_store(entry, frame.return_address);
-		fp = (void __user *)frame.next_frame;
 	}
+
 	pagefault_enable();
 }
 
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 06/11] perf: Remove get_perf_callchain() 'init_nr' argument
  2024-09-13 23:02 [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains Josh Poimboeuf
                   ` (4 preceding siblings ...)
  2024-09-13 23:02 ` [PATCH v2 05/11] perf/x86: Use user_unwind interface Josh Poimboeuf
@ 2024-09-13 23:02 ` Josh Poimboeuf
  2024-09-13 23:02 ` [PATCH v2 07/11] perf: Remove get_perf_callchain() 'crosstask' argument Josh Poimboeuf
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 44+ messages in thread
From: Josh Poimboeuf @ 2024-09-13 23:02 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, Namhyung Kim

The 'init_nr' argument has double duty: it's used to initialize both the
number of contexts and the number of stack entries.  That's confusing
and the callers always pass zero anyway.  Hard code the zero.

Acked-by: Namhyung Kim <Namhyung@kernel.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 include/linux/perf_event.h |  2 +-
 kernel/bpf/stackmap.c      |  4 ++--
 kernel/events/callchain.c  | 12 ++++++------
 kernel/events/core.c       |  2 +-
 4 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 1a8942277dda..4365bb0684a5 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1540,7 +1540,7 @@ DECLARE_PER_CPU(struct perf_callchain_entry, perf_callchain_entry);
 extern void perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
 extern void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
 extern struct perf_callchain_entry *
-get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user,
+get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 		   u32 max_stack, bool crosstask, bool add_mark);
 extern int get_callchain_buffers(int max_stack);
 extern void put_callchain_buffers(void);
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index c99f8e5234ac..0f922e43b524 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -297,7 +297,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
 	if (max_depth > sysctl_perf_event_max_stack)
 		max_depth = sysctl_perf_event_max_stack;
 
-	trace = get_perf_callchain(regs, 0, kernel, user, max_depth,
+	trace = get_perf_callchain(regs, kernel, user, max_depth,
 				   false, false);
 
 	if (unlikely(!trace))
@@ -432,7 +432,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
 	else if (kernel && task)
 		trace = get_callchain_entry_for_task(task, max_depth);
 	else
-		trace = get_perf_callchain(regs, 0, kernel, user, max_depth,
+		trace = get_perf_callchain(regs, kernel, user, max_depth,
 					   crosstask, false);
 	if (unlikely(!trace))
 		goto err_fault;
diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 8a47e52a454f..83834203e144 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -216,7 +216,7 @@ static void fixup_uretprobe_trampoline_entries(struct perf_callchain_entry *entr
 }
 
 struct perf_callchain_entry *
-get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user,
+get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 		   u32 max_stack, bool crosstask, bool add_mark)
 {
 	struct perf_callchain_entry *entry;
@@ -227,11 +227,11 @@ get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user,
 	if (!entry)
 		return NULL;
 
-	ctx.entry     = entry;
-	ctx.max_stack = max_stack;
-	ctx.nr	      = entry->nr = init_nr;
-	ctx.contexts       = 0;
-	ctx.contexts_maxed = false;
+	ctx.entry		= entry;
+	ctx.max_stack		= max_stack;
+	ctx.nr			= entry->nr = 0;
+	ctx.contexts		= 0;
+	ctx.contexts_maxed	= false;
 
 	if (kernel && !user_mode(regs)) {
 		if (add_mark)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 8a6c6bbcd658..8038cd4d981b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7694,7 +7694,7 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
 	if (!kernel && !user)
 		return &__empty_callchain;
 
-	callchain = get_perf_callchain(regs, 0, kernel, user,
+	callchain = get_perf_callchain(regs, kernel, user,
 				       max_stack, crosstask, true);
 	return callchain ?: &__empty_callchain;
 }
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 07/11] perf: Remove get_perf_callchain() 'crosstask' argument
  2024-09-13 23:02 [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains Josh Poimboeuf
                   ` (5 preceding siblings ...)
  2024-09-13 23:02 ` [PATCH v2 06/11] perf: Remove get_perf_callchain() 'init_nr' argument Josh Poimboeuf
@ 2024-09-13 23:02 ` Josh Poimboeuf
  2024-09-13 23:02 ` [PATCH v2 08/11] perf: Simplify get_perf_callchain() user logic Josh Poimboeuf
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 44+ messages in thread
From: Josh Poimboeuf @ 2024-09-13 23:02 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James

get_perf_callchain() doesn't support cross-task unwinding, so it doesn't
make much sense to have 'crosstask' as an argument.  Instead, have
perf_callchain() adjust 'user' accordingly.

Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 include/linux/perf_event.h | 2 +-
 kernel/bpf/stackmap.c      | 5 ++---
 kernel/events/callchain.c  | 6 +-----
 kernel/events/core.c       | 8 ++++----
 4 files changed, 8 insertions(+), 13 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 4365bb0684a5..64f9efe19553 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1541,7 +1541,7 @@ extern void perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct p
 extern void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
 extern struct perf_callchain_entry *
 get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
-		   u32 max_stack, bool crosstask, bool add_mark);
+		   u32 max_stack, bool add_mark);
 extern int get_callchain_buffers(int max_stack);
 extern void put_callchain_buffers(void);
 extern struct perf_callchain_entry *get_callchain_entry(int *rctx);
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 0f922e43b524..ff6f0ef7ba5d 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -297,8 +297,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
 	if (max_depth > sysctl_perf_event_max_stack)
 		max_depth = sysctl_perf_event_max_stack;
 
-	trace = get_perf_callchain(regs, kernel, user, max_depth,
-				   false, false);
+	trace = get_perf_callchain(regs, kernel, user, max_depth, false);
 
 	if (unlikely(!trace))
 		/* couldn't fetch the stack trace */
@@ -433,7 +432,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
 		trace = get_callchain_entry_for_task(task, max_depth);
 	else
 		trace = get_perf_callchain(regs, kernel, user, max_depth,
-					   crosstask, false);
+					   false);
 	if (unlikely(!trace))
 		goto err_fault;
 
diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 83834203e144..655fb25a725b 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -217,7 +217,7 @@ static void fixup_uretprobe_trampoline_entries(struct perf_callchain_entry *entr
 
 struct perf_callchain_entry *
 get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
-		   u32 max_stack, bool crosstask, bool add_mark)
+		   u32 max_stack, bool add_mark)
 {
 	struct perf_callchain_entry *entry;
 	struct perf_callchain_entry_ctx ctx;
@@ -248,9 +248,6 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 		}
 
 		if (regs) {
-			if (crosstask)
-				goto exit_put;
-
 			if (add_mark)
 				perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
 
@@ -260,7 +257,6 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 		}
 	}
 
-exit_put:
 	put_callchain_entry(rctx);
 
 	return entry;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 8038cd4d981b..19fd7bd38ecf 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7686,16 +7686,16 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
 {
 	bool kernel = !event->attr.exclude_callchain_kernel;
 	bool user   = !event->attr.exclude_callchain_user;
-	/* Disallow cross-task user callchains. */
-	bool crosstask = event->ctx->task && event->ctx->task != current;
 	const u32 max_stack = event->attr.sample_max_stack;
 	struct perf_callchain_entry *callchain;
 
+	/* Disallow cross-task user callchains. */
+	user &= !event->ctx->task || event->ctx->task == current;
+
 	if (!kernel && !user)
 		return &__empty_callchain;
 
-	callchain = get_perf_callchain(regs, kernel, user,
-				       max_stack, crosstask, true);
+	callchain = get_perf_callchain(regs, kernel, user, max_stack, true);
 	return callchain ?: &__empty_callchain;
 }
 
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 08/11] perf: Simplify get_perf_callchain() user logic
  2024-09-13 23:02 [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains Josh Poimboeuf
                   ` (6 preceding siblings ...)
  2024-09-13 23:02 ` [PATCH v2 07/11] perf: Remove get_perf_callchain() 'crosstask' argument Josh Poimboeuf
@ 2024-09-13 23:02 ` Josh Poimboeuf
  2024-09-13 23:02 ` [PATCH v2 09/11] perf: Introduce deferred user callchains Josh Poimboeuf
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 44+ messages in thread
From: Josh Poimboeuf @ 2024-09-13 23:02 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James

Simplify the get_perf_callchain() user logic a bit.  task_pt_regs()
should never be NULL.

Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 kernel/events/callchain.c | 20 +++++++++-----------
 1 file changed, 9 insertions(+), 11 deletions(-)

diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 655fb25a725b..2278402b7ac9 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -241,22 +241,20 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 
 	if (user) {
 		if (!user_mode(regs)) {
-			if  (current->mm)
-				regs = task_pt_regs(current);
-			else
-				regs = NULL;
+			if (!current->mm)
+				goto exit_put;
+			regs = task_pt_regs(current);
 		}
 
-		if (regs) {
-			if (add_mark)
-				perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
+		if (add_mark)
+			perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
 
-			start_entry_idx = entry->nr;
-			perf_callchain_user(&ctx, regs);
-			fixup_uretprobe_trampoline_entries(entry, start_entry_idx);
-		}
+		start_entry_idx = entry->nr;
+		perf_callchain_user(&ctx, regs);
+		fixup_uretprobe_trampoline_entries(entry, start_entry_idx);
 	}
 
+exit_put:
 	put_callchain_entry(rctx);
 
 	return entry;
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 09/11] perf: Introduce deferred user callchains
  2024-09-13 23:02 [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains Josh Poimboeuf
                   ` (7 preceding siblings ...)
  2024-09-13 23:02 ` [PATCH v2 08/11] perf: Simplify get_perf_callchain() user logic Josh Poimboeuf
@ 2024-09-13 23:02 ` Josh Poimboeuf
  2024-09-17 22:07   ` Namhyung Kim
  2024-09-13 23:02 ` [PATCH v2 10/11] perf/x86: Add HAVE_PERF_CALLCHAIN_DEFERRED Josh Poimboeuf
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 44+ messages in thread
From: Josh Poimboeuf @ 2024-09-13 23:02 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James

Instead of attempting to unwind user space from the NMI handler, defer
it to run in task context by sending a self-IPI and then scheduling the
unwind to run in the IRQ's exit task work before returning to user space.

This allows the user stack page to be paged in if needed, avoids
duplicate unwinds for kernel-bound workloads, and prepares for SFrame
unwinding (so .sframe sections can be paged in on demand).

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/Kconfig                    |  3 ++
 include/linux/perf_event.h      |  9 +++-
 include/uapi/linux/perf_event.h | 21 ++++++++-
 kernel/bpf/stackmap.c           |  5 +--
 kernel/events/callchain.c       | 12 +++++-
 kernel/events/core.c            | 76 ++++++++++++++++++++++++++++++++-
 6 files changed, 119 insertions(+), 7 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index ff5d5bc5f947..0629c1aa2a5c 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -431,6 +431,9 @@ config HAVE_USER_UNWIND
 config HAVE_USER_UNWIND_SFRAME
 	bool
 
+config HAVE_PERF_CALLCHAIN_DEFERRED
+	bool
+
 config HAVE_PERF_REGS
 	bool
 	help
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 64f9efe19553..a617aad2851c 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -787,6 +787,7 @@ struct perf_event {
 	struct callback_head		pending_task;
 	unsigned int			pending_work;
 	struct rcuwait			pending_work_wait;
+	unsigned int			pending_callchain;
 
 	atomic_t			event_limit;
 
@@ -1541,12 +1542,18 @@ extern void perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct p
 extern void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
 extern struct perf_callchain_entry *
 get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
-		   u32 max_stack, bool add_mark);
+		   u32 max_stack, bool add_mark, bool defer_user);
 extern int get_callchain_buffers(int max_stack);
 extern void put_callchain_buffers(void);
 extern struct perf_callchain_entry *get_callchain_entry(int *rctx);
 extern void put_callchain_entry(int rctx);
 
+#ifdef CONFIG_HAVE_PERF_CALLCHAIN_DEFERRED
+extern void perf_callchain_user_deferred(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
+#else
+static inline void perf_callchain_user_deferred(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs) {}
+#endif
+
 extern int sysctl_perf_event_max_stack;
 extern int sysctl_perf_event_max_contexts_per_stack;
 
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 4842c36fdf80..a7f875eb29dd 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -460,7 +460,8 @@ struct perf_event_attr {
 				inherit_thread :  1, /* children only inherit if cloned with CLONE_THREAD */
 				remove_on_exec :  1, /* event is removed from task on exec */
 				sigtrap        :  1, /* send synchronous SIGTRAP on event */
-				__reserved_1   : 26;
+				defer_callchain:  1, /* generate PERF_RECORD_CALLCHAIN_DEFERRED records */
+				__reserved_1   : 25;
 
 	union {
 		__u32		wakeup_events;	  /* wakeup every n events */
@@ -1217,6 +1218,23 @@ enum perf_event_type {
 	 */
 	PERF_RECORD_AUX_OUTPUT_HW_ID		= 21,
 
+	/*
+	 * This user callchain capture was deferred until shortly before
+	 * returning to user space.  Previous samples would have kernel
+	 * callchains only and they need to be stitched with this to make full
+	 * callchains.
+	 *
+	 * TODO: do PERF_SAMPLE_{REGS,STACK}_USER also need deferral?
+	 *
+	 * struct {
+	 *	struct perf_event_header	header;
+	 *	u64				nr;
+	 *	u64				ips[nr];
+	 *	struct sample_id		sample_id;
+	 * };
+	 */
+	PERF_RECORD_CALLCHAIN_DEFERRED		= 22,
+
 	PERF_RECORD_MAX,			/* non-ABI */
 };
 
@@ -1247,6 +1265,7 @@ enum perf_callchain_context {
 	PERF_CONTEXT_HV			= (__u64)-32,
 	PERF_CONTEXT_KERNEL		= (__u64)-128,
 	PERF_CONTEXT_USER		= (__u64)-512,
+	PERF_CONTEXT_USER_DEFERRED	= (__u64)-640,
 
 	PERF_CONTEXT_GUEST		= (__u64)-2048,
 	PERF_CONTEXT_GUEST_KERNEL	= (__u64)-2176,
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index ff6f0ef7ba5d..9d58a84e553a 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -297,8 +297,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
 	if (max_depth > sysctl_perf_event_max_stack)
 		max_depth = sysctl_perf_event_max_stack;
 
-	trace = get_perf_callchain(regs, kernel, user, max_depth, false);
-
+	trace = get_perf_callchain(regs, kernel, user, max_depth, false, false);
 	if (unlikely(!trace))
 		/* couldn't fetch the stack trace */
 		return -EFAULT;
@@ -432,7 +431,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
 		trace = get_callchain_entry_for_task(task, max_depth);
 	else
 		trace = get_perf_callchain(regs, kernel, user, max_depth,
-					   false);
+					   false, false);
 	if (unlikely(!trace))
 		goto err_fault;
 
diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 2278402b7ac9..883f0ac9ef3a 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -217,7 +217,7 @@ static void fixup_uretprobe_trampoline_entries(struct perf_callchain_entry *entr
 
 struct perf_callchain_entry *
 get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
-		   u32 max_stack, bool add_mark)
+		   u32 max_stack, bool add_mark, bool defer_user)
 {
 	struct perf_callchain_entry *entry;
 	struct perf_callchain_entry_ctx ctx;
@@ -246,6 +246,16 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 			regs = task_pt_regs(current);
 		}
 
+		if (defer_user) {
+			/*
+			 * Foretell the coming of a
+			 * PERF_RECORD_CALLCHAIN_DEFERRED sample which can be
+			 * stitched to this one.
+			 */
+			perf_callchain_store_context(&ctx, PERF_CONTEXT_USER_DEFERRED);
+			goto exit_put;
+		}
+
 		if (add_mark)
 			perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 19fd7bd38ecf..5fc7c5156287 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6835,6 +6835,12 @@ static void perf_pending_irq(struct irq_work *entry)
 	struct perf_event *event = container_of(entry, struct perf_event, pending_irq);
 	int rctx;
 
+	if (!is_software_event(event)) {
+		if (event->pending_callchain)
+			task_work_add(current, &event->pending_task, TWA_RESUME);
+		return;
+	}
+
 	/*
 	 * If we 'fail' here, that's OK, it means recursion is already disabled
 	 * and we won't recurse 'further'.
@@ -6854,11 +6860,70 @@ static void perf_pending_irq(struct irq_work *entry)
 		perf_swevent_put_recursion_context(rctx);
 }
 
+struct perf_callchain_deferred_event {
+	struct perf_event_header	header;
+	struct perf_callchain_entry	callchain;
+};
+
+#define PERF_CALLCHAIN_DEFERRED_EVENT_SIZE				\
+	sizeof(struct perf_callchain_deferred_event) +			\
+	(sizeof(__u64) * 1) + /* PERF_CONTEXT_USER */			\
+	(sizeof(__u64) * PERF_MAX_STACK_DEPTH)
+
+static void perf_event_callchain_deferred(struct perf_event *event)
+{
+	struct pt_regs *regs = task_pt_regs(current);
+	struct perf_callchain_entry *callchain;
+	struct perf_output_handle handle;
+	struct perf_sample_data data;
+	unsigned char buf[PERF_CALLCHAIN_DEFERRED_EVENT_SIZE];
+	struct perf_callchain_entry_ctx ctx;
+	struct perf_callchain_deferred_event *deferred_event;
+
+	deferred_event = (void *)&buf;
+
+	callchain = &deferred_event->callchain;
+	callchain->nr = 0;
+
+	ctx.entry		= callchain;
+	ctx.max_stack		= MIN(event->attr.sample_max_stack,
+				      PERF_MAX_STACK_DEPTH);
+	ctx.nr			= 0;
+	ctx.contexts		= 0;
+	ctx.contexts_maxed	= false;
+
+	perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
+	perf_callchain_user_deferred(&ctx, regs);
+
+	deferred_event->header.type = PERF_RECORD_CALLCHAIN_DEFERRED;
+	deferred_event->header.misc = 0;
+	deferred_event->header.size = sizeof(*deferred_event) +
+				      (callchain->nr * sizeof(u64));
+
+	perf_event_header__init_id(&deferred_event->header, &data, event);
+
+	if (perf_output_begin(&handle, &data, event,
+			      deferred_event->header.size))
+		return;
+
+	perf_output_copy(&handle, deferred_event, deferred_event->header.size);
+	perf_event__output_id_sample(event, &handle, &data);
+	perf_output_end(&handle);
+}
+
 static void perf_pending_task(struct callback_head *head)
 {
 	struct perf_event *event = container_of(head, struct perf_event, pending_task);
 	int rctx;
 
+	if (!is_software_event(event)) {
+		if (event->pending_callchain) {
+			perf_event_callchain_deferred(event);
+			event->pending_callchain = 0;
+		}
+		return;
+	}
+
 	/*
 	 * All accesses to the event must belong to the same implicit RCU read-side
 	 * critical section as the ->pending_work reset. See comment in
@@ -7688,6 +7753,8 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
 	bool user   = !event->attr.exclude_callchain_user;
 	const u32 max_stack = event->attr.sample_max_stack;
 	struct perf_callchain_entry *callchain;
+	bool defer_user = IS_ENABLED(CONFIG_HAVE_PERF_CALLCHAIN_DEFERRED) &&
+			  event->attr.defer_callchain;
 
 	/* Disallow cross-task user callchains. */
 	user &= !event->ctx->task || event->ctx->task == current;
@@ -7695,7 +7762,14 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
 	if (!kernel && !user)
 		return &__empty_callchain;
 
-	callchain = get_perf_callchain(regs, kernel, user, max_stack, true);
+	callchain = get_perf_callchain(regs, kernel, user, max_stack, true,
+				       defer_user);
+
+	if (user && defer_user && !event->pending_callchain) {
+		event->pending_callchain = 1;
+		irq_work_queue(&event->pending_irq);
+	}
+
 	return callchain ?: &__empty_callchain;
 }
 
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 10/11] perf/x86: Add HAVE_PERF_CALLCHAIN_DEFERRED
  2024-09-13 23:02 [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains Josh Poimboeuf
                   ` (8 preceding siblings ...)
  2024-09-13 23:02 ` [PATCH v2 09/11] perf: Introduce deferred user callchains Josh Poimboeuf
@ 2024-09-13 23:02 ` Josh Poimboeuf
  2024-09-13 23:02 ` [PATCH v2 11/11] perf/x86: Enable SFrame unwinding for deferred user callchains Josh Poimboeuf
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 44+ messages in thread
From: Josh Poimboeuf @ 2024-09-13 23:02 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James

Enable deferred user space unwinding on x86.  Frame pointers are still
the default for now.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/x86/Kconfig       |  1 +
 arch/x86/events/core.c | 52 +++++++++++++++++++++++++++---------------
 2 files changed, 34 insertions(+), 19 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0b3d7c72b65b..24d9373cc5e6 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -265,6 +265,7 @@ config X86
 	select HAVE_PERF_EVENTS_NMI
 	select HAVE_HARDLOCKUP_DETECTOR_PERF	if PERF_EVENTS && HAVE_PERF_EVENTS_NMI
 	select HAVE_PCI
+	select HAVE_PERF_CALLCHAIN_DEFERRED
 	select HAVE_PERF_REGS
 	select HAVE_PERF_USER_STACK_DUMP
 	select MMU_GATHER_RCU_TABLE_FREE	if PARAVIRT
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index e82aadf99d9b..d6ea265d9aa8 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2821,8 +2821,8 @@ static unsigned long get_segment_base(unsigned int segment)
 
 #include <linux/compat.h>
 
-static inline int
-perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry_ctx *entry)
+static int __perf_callchain_user32(struct pt_regs *regs,
+				   struct perf_callchain_entry_ctx *entry)
 {
 	/* 32-bit process in 64-bit kernel. */
 	unsigned long ss_base, cs_base;
@@ -2836,7 +2836,6 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry_ctx *ent
 	ss_base = get_segment_base(regs->ss);
 
 	fp = compat_ptr(ss_base + regs->bp);
-	pagefault_disable();
 	while (entry->nr < entry->max_stack) {
 		if (!valid_user_frame(fp, sizeof(frame)))
 			break;
@@ -2849,19 +2848,18 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry_ctx *ent
 		perf_callchain_store(entry, cs_base + frame.return_address);
 		fp = compat_ptr(ss_base + frame.next_frame);
 	}
-	pagefault_enable();
 	return 1;
 }
-#else
-static inline int
-perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry_ctx *entry)
+#else /* !CONFIG_IA32_EMULATION */
+static int __perf_callchain_user32(struct pt_regs *regs,
+				   struct perf_callchain_entry_ctx *entry)
 {
-    return 0;
+	return 0;
 }
-#endif
+#endif /* CONFIG_IA32_EMULATION */
 
-void
-perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs)
+static void __perf_callchain_user(struct perf_callchain_entry_ctx *entry,
+				  struct pt_regs *regs, bool atomic)
 {
 	struct user_unwind_state state;
 
@@ -2878,20 +2876,36 @@ perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs
 
 	perf_callchain_store(entry, regs->ip);
 
-	if (!nmi_uaccess_okay())
-		return;
+	if (atomic) {
+		if (!nmi_uaccess_okay())
+			return;
+		pagefault_disable();
+	}
 
-	if (perf_callchain_user32(regs, entry))
-		return;
-
-	pagefault_disable();
+	if (__perf_callchain_user32(regs, entry))
+		goto done;
 
 	for_each_user_frame(&state, USER_UNWIND_TYPE_FP) {
 		if (perf_callchain_store(entry, state.ip))
-			break;
+			goto done;
 	}
 
-	pagefault_enable();
+done:
+	if (atomic)
+		pagefault_enable();
+}
+
+
+void perf_callchain_user(struct perf_callchain_entry_ctx *entry,
+			 struct pt_regs *regs)
+{
+	return __perf_callchain_user(entry, regs, true);
+}
+
+void perf_callchain_user_deferred(struct perf_callchain_entry_ctx *entry,
+				  struct pt_regs *regs)
+{
+	return __perf_callchain_user(entry, regs, false);
 }
 
 /*
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 11/11] perf/x86: Enable SFrame unwinding for deferred user callchains
  2024-09-13 23:02 [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains Josh Poimboeuf
                   ` (9 preceding siblings ...)
  2024-09-13 23:02 ` [PATCH v2 10/11] perf/x86: Add HAVE_PERF_CALLCHAIN_DEFERRED Josh Poimboeuf
@ 2024-09-13 23:02 ` Josh Poimboeuf
  2024-09-14 12:12 ` [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains Steven Rostedt
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 44+ messages in thread
From: Josh Poimboeuf @ 2024-09-13 23:02 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James

Use SFrame for deferred user callchains, if available.

Non-deferred user callchains still need to use frame pointers, as SFrame
is likely to fault when it pages in the .sframe section.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/x86/events/core.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index d6ea265d9aa8..d618c50865d3 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2861,6 +2861,7 @@ static int __perf_callchain_user32(struct pt_regs *regs,
 static void __perf_callchain_user(struct perf_callchain_entry_ctx *entry,
 				  struct pt_regs *regs, bool atomic)
 {
+	bool unwind_type = USER_UNWIND_TYPE_AUTO;
 	struct user_unwind_state state;
 
 	if (perf_guest_state()) {
@@ -2879,13 +2880,14 @@ static void __perf_callchain_user(struct perf_callchain_entry_ctx *entry,
 	if (atomic) {
 		if (!nmi_uaccess_okay())
 			return;
+		unwind_type = USER_UNWIND_TYPE_FP;
 		pagefault_disable();
 	}
 
 	if (__perf_callchain_user32(regs, entry))
 		goto done;
 
-	for_each_user_frame(&state, USER_UNWIND_TYPE_FP) {
+	for_each_user_frame(&state, unwind_type) {
 		if (perf_callchain_store(entry, state.ip))
 			goto done;
 	}
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 03/11] unwind: Introduce SFrame user space unwinding
  2024-09-13 23:02 ` [PATCH v2 03/11] unwind: Introduce SFrame user space unwinding Josh Poimboeuf
@ 2024-09-14 11:23   ` Steven Rostedt
  2024-10-01 18:20     ` Indu Bhagat
  2024-10-23 13:59   ` Jens Remus
  1 sibling, 1 reply; 44+ messages in thread
From: Steven Rostedt @ 2024-09-14 11:23 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James

On Sat, 14 Sep 2024 01:02:05 +0200
Josh Poimboeuf <jpoimboe@kernel.org> wrote:

> diff --git a/include/linux/sframe.h b/include/linux/sframe.h
> new file mode 100644
> index 000000000000..3a44f76929e2
> --- /dev/null
> +++ b/include/linux/sframe.h
> @@ -0,0 +1,46 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_SFRAME_H
> +#define _LINUX_SFRAME_H
> +
> +#include <linux/mm_types.h>
> +
> +struct sframe_file {
> +	unsigned long sframe_addr, text_start, text_end;

Please make each entry a separate line:

	unsigned long		sframe_addr;
	unsigned long		text_start;
	unsigned long		text_end;

It may be fine for declaring variables like this in a function, but it
should not be done in a structure.

> +};
> +
> +struct user_unwind_frame;
> +
> +#ifdef CONFIG_HAVE_USER_UNWIND_SFRAME
> +
> +#define INIT_MM_SFRAME .sframe_mt = MTREE_INIT(sframe_mt, 0)
> +
> +extern void sframe_free_mm(struct mm_struct *mm);
> +
> +extern int __sframe_add_section(struct sframe_file *file);
> +extern int sframe_add_section(unsigned long sframe_addr, unsigned long text_start, unsigned long text_end);
> +extern int sframe_remove_section(unsigned long sframe_addr);
> +extern int sframe_find(unsigned long ip, struct user_unwind_frame *frame);
> +
> +static inline bool current_has_sframe(void)
> +{
> +	struct mm_struct *mm = current->mm;
> +
> +	return mm && !mtree_empty(&mm->sframe_mt);
> +}
> +
> +#else /* !CONFIG_HAVE_USER_UNWIND_SFRAME */
> +
> +#define INIT_MM_SFRAME
> +
> +static inline void sframe_free_mm(struct mm_struct *mm) {}
> +
> +static inline int __sframe_add_section(struct sframe_file *file) { return -EINVAL; }
> +static inline int sframe_add_section(unsigned long sframe_addr, unsigned long text_start, unsigned long text_end) { return -EINVAL; }
> +static inline int sframe_remove_section(unsigned long sframe_addr) { return -EINVAL; }
> +static inline int sframe_find(unsigned long ip, struct user_unwind_frame *frame) { return -EINVAL; }
> +
> +static inline bool current_has_sframe(void) { return false; }
> +
> +#endif /* CONFIG_HAVE_USER_UNWIND_SFRAME */
> +
> +#endif /* _LINUX_SFRAME_H */
> diff --git a/include/linux/user_unwind.h b/include/linux/user_unwind.h
> index 0a19ac6c92b2..8003f9d35405 100644
> --- a/include/linux/user_unwind.h
> +++ b/include/linux/user_unwind.h
> @@ -7,6 +7,7 @@
>  enum user_unwind_type {
>  	USER_UNWIND_TYPE_AUTO,
>  	USER_UNWIND_TYPE_FP,
> +	USER_UNWIND_TYPE_SFRAME,
>  };
>  
>  struct user_unwind_frame {
> diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
> index b54b313bcf07..b2aca31e1a49 100644
> --- a/include/uapi/linux/elf.h
> +++ b/include/uapi/linux/elf.h
> @@ -39,6 +39,7 @@ typedef __s64	Elf64_Sxword;
>  #define PT_GNU_STACK	(PT_LOOS + 0x474e551)
>  #define PT_GNU_RELRO	(PT_LOOS + 0x474e552)
>  #define PT_GNU_PROPERTY	(PT_LOOS + 0x474e553)
> +#define PT_GNU_SFRAME	(PT_LOOS + 0x474e554)
>  
>  
>  /* ARM MTE memory tag segment type */
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 35791791a879..69511077c910 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -328,4 +328,7 @@ struct prctl_mm_map {
>  # define PR_PPC_DEXCR_CTRL_CLEAR_ONEXEC	0x10 /* Clear the aspect on exec */
>  # define PR_PPC_DEXCR_CTRL_MASK		0x1f
>  
> +#define PR_ADD_SFRAME			74
> +#define PR_REMOVE_SFRAME		75
> +
>  #endif /* _LINUX_PRCTL_H */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index cc760491f201..a216f091edfb 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -104,6 +104,7 @@
>  #include <linux/rseq.h>
>  #include <uapi/linux/pidfd.h>
>  #include <linux/pidfs.h>
> +#include <linux/sframe.h>
>  
>  #include <asm/pgalloc.h>
>  #include <linux/uaccess.h>
> @@ -923,6 +924,7 @@ void __mmdrop(struct mm_struct *mm)
>  	mm_pasid_drop(mm);
>  	mm_destroy_cid(mm);
>  	percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS);
> +	sframe_free_mm(mm);
>  
>  	free_mm(mm);
>  }
> @@ -1249,6 +1251,13 @@ static void mm_init_uprobes_state(struct mm_struct *mm)
>  #endif
>  }
>  
> +static void mm_init_sframe(struct mm_struct *mm)
> +{
> +#ifdef CONFIG_HAVE_USER_UNWIND_SFRAME
> +	mt_init(&mm->sframe_mt);
> +#endif
> +}
> +
>  static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
>  	struct user_namespace *user_ns)
>  {
> @@ -1280,6 +1289,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
>  	mm->pmd_huge_pte = NULL;
>  #endif
>  	mm_init_uprobes_state(mm);
> +	mm_init_sframe(mm);
>  	hugetlb_count_init(mm);
>  
>  	if (current->mm) {
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 3a2df1bd9f64..e4d2b64f4ae4 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -64,6 +64,7 @@
>  #include <linux/rcupdate.h>
>  #include <linux/uidgid.h>
>  #include <linux/cred.h>
> +#include <linux/sframe.h>
>  
>  #include <linux/nospec.h>
>  
> @@ -2782,6 +2783,16 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>  	case PR_RISCV_SET_ICACHE_FLUSH_CTX:
>  		error = RISCV_SET_ICACHE_FLUSH_CTX(arg2, arg3);
>  		break;
> +	case PR_ADD_SFRAME:
> +		if (arg5)
> +			return -EINVAL;
> +		error = sframe_add_section(arg2, arg3, arg4);
> +		break;
> +	case PR_REMOVE_SFRAME:
> +		if (arg3 || arg4 || arg5)
> +			return -EINVAL;
> +		error = sframe_remove_section(arg2);
> +		break;
>  	default:
>  		error = -EINVAL;
>  		break;
> diff --git a/kernel/unwind/Makefile b/kernel/unwind/Makefile
> index eb466d6a3295..6f202c5840cf 100644
> --- a/kernel/unwind/Makefile
> +++ b/kernel/unwind/Makefile
> @@ -1 +1,2 @@
>  obj-$(CONFIG_HAVE_USER_UNWIND) += user.o
> +obj-$(CONFIG_HAVE_USER_UNWIND_SFRAME) += sframe.o
> diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
> new file mode 100644
> index 000000000000..3e4d29e737a1
> --- /dev/null
> +++ b/kernel/unwind/sframe.c
> @@ -0,0 +1,420 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/srcu.h>
> +#include <linux/uaccess.h>
> +#include <linux/mm.h>
> +#include <linux/sframe.h>
> +#include <linux/user_unwind.h>
> +
> +#include "sframe.h"
> +
> +#define SFRAME_FILENAME_LEN 32
> +
> +struct sframe_section {
> +	struct rcu_head rcu;
> +
> +	unsigned long sframe_addr;
> +	unsigned long text_addr;
> +
> +	unsigned long fdes_addr;
> +	unsigned long fres_addr;
> +	unsigned int  fdes_nr;
> +	signed char ra_off, fp_off;
> +};
> +
> +DEFINE_STATIC_SRCU(sframe_srcu);
> +
> +#define __SFRAME_GET_USER(out, user_ptr, type)				\
> +({									\
> +	type __tmp;							\
> +	if (get_user(__tmp, (type *)user_ptr))				\
> +		return -EFAULT;						\
> +	user_ptr += sizeof(__tmp);					\
> +	out = __tmp;							\
> +})
> +
> +#define SFRAME_GET_USER_SIGNED(out, user_ptr, size)			\
> +({									\
> +	switch (size) {							\
> +	case 1:								\
> +		__SFRAME_GET_USER(out, user_ptr, s8);			\
> +		break;							\
> +	case 2:								\
> +		__SFRAME_GET_USER(out, user_ptr, s16);			\
> +		break;							\
> +	case 4:								\
> +		__SFRAME_GET_USER(out, user_ptr, s32);			\
> +		break;							\
> +	default:							\
> +		return -EINVAL;						\
> +	}								\
> +})
> +
> +#define SFRAME_GET_USER_UNSIGNED(out, user_ptr, size)			\
> +({									\
> +	switch (size) {							\
> +	case 1:								\
> +		__SFRAME_GET_USER(out, user_ptr, u8);			\
> +		break;							\
> +	case 2:								\
> +		__SFRAME_GET_USER(out, user_ptr, u16);			\
> +		break;							\
> +	case 4:								\
> +		__SFRAME_GET_USER(out, user_ptr, u32);			\
> +		break;							\
> +	default:							\
> +		return -EINVAL;						\
> +	}								\
> +})
> +
> +static unsigned char fre_type_to_size(unsigned char fre_type)
> +{
> +	if (fre_type > 2)
> +		return 0;
> +	return 1 << fre_type;
> +}
> +
> +static unsigned char offset_size_enum_to_size(unsigned char off_size)
> +{
> +	if (off_size > 2)
> +		return 0;
> +	return 1 << off_size;
> +}
> +
> +static int find_fde(struct sframe_section *sec, unsigned long ip,
> +		    struct sframe_fde *fde)
> +{
> +	s32 func_off, ip_off;
> +	struct sframe_fde __user *first, *last, *mid, *found;

Need to initialize found = NULL;

> +
> +	ip_off = ip - sec->sframe_addr;
> +
> +	first = (void *)sec->fdes_addr;
> +	last = first + sec->fdes_nr;
> +	while (first <= last) {

So we trust user space to have this table sorted?

> +		mid = first + ((last - first) / 2);
> +		if (get_user(func_off, (s32 *)mid))
> +			return -EFAULT;
> +		if (ip_off >= func_off) {
> +			found = mid;

If it's not sorted, this can return the wrong value.

We should have some check that has something like:

	s32 low_func_off = 0, high_func_off = 0;

			if (low_func_off && low_func_off > func_off)
				return -EINVAL;	
			low_func_off = func_off;

> +			first = mid + 1;
> +		} else
			{ /* Note, this needs a bracket anyway, because
			     rules are, if one if block has a bracket,
			     the other needs one too */

			if (high_func_off && high_func_off < func_off)
				return -EINVAL;

			high_func_off = func_off;

> +			last = mid - 1;
		}

> +	}
> +
> +	if (!found)
> +		return -EINVAL;
> +
> +	if (copy_from_user(fde, found, sizeof(*fde)))
> +		return -EFAULT;
> +
> +	return 0;
> +}

-- Steve

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains
  2024-09-13 23:02 [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains Josh Poimboeuf
                   ` (10 preceding siblings ...)
  2024-09-13 23:02 ` [PATCH v2 11/11] perf/x86: Enable SFrame unwinding for deferred user callchains Josh Poimboeuf
@ 2024-09-14 12:12 ` Steven Rostedt
  2024-09-15 11:11   ` Josh Poimboeuf
  2024-09-14 19:37 ` Namhyung Kim
  2024-10-23 13:22 ` Jens Remus
  13 siblings, 1 reply; 44+ messages in thread
From: Steven Rostedt @ 2024-09-14 12:12 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, Mathieu Desnoyers

On Sat, 14 Sep 2024 01:02:02 +0200
Josh Poimboeuf <jpoimboe@kernel.org> wrote:

> This is a new version of sframe user space unwinding + perf deferred
> callchains.  Better late than never.
> 
> Unfortunately I didn't get a chance to do any testing with this one yet
> as I'm rushing it out the door before GNU Cauldron starts.
> Conference-driven development ;-)

Thanks for posting this!

BTW, next time, can you also Cc linux-trace-kernel@vger.kerne.org.

> 
> Plus I don't have the perf tool piece working yet so I don't have a way
> of doing end-to-end testing at the moment anyway.
> 
> In other words, please don't merge this yet.
> 
> Namhyung, if you're still available to write a perf tool patch which
> integrates with this, that would be great.  Otherwise I could give it a
> try.
> 
> Steven, let me know if this would interface ok with your anticipated
> tracing usage.

This is a very good starting point for me. A couple of notes.

I think the unwinder should have an interface itself that provides the
deferred unwinding, instead of having all the tracers to implement
their own.

The user space unwinder (regardless of sframe or not) should provide a
way to say "I want a user space stack trace for here, but don't do it
yet. Just give me a cookie and then call my callback function with the
stack and cookie before going back to user space".

That is, we should have an interface like:

typedef void (unwinder_callback_t)(struct user_space_stack *, u64 cookie);

struct unwinder_client {
	struct list_head	list;
	unwinder_callback_t	callback;
};

struct unwinder_client my_unwinder = {
	.callback = mycallback;
};

	register_unwinder(&my_unwinder);

Then a tracer could do this in NMI context:

	/* Get the current cookie of the current task */
	cookie = unwinder_get_cookie();

	/* Tell the unwinder to call the callbacks */
	unwinder_trigger();

	/* then the tracer can save off that cookie, do a kernel stack trace
	 * and save the cookie with that stack trace.
	 */

On return to user space, it will go the ptrace slow path and call all
registered callbacks with the cookie, regardless if it asked for it or
not.

	/* Save the stack somewhere, could even allocate memory to do so */
	list_for_each_entry(callback, &unwinder_callback, list) {
		callback->callback(userstack, cookie);
	}
	update_next_cookie();

The cookie is a unique number for every entry into the kernel by all
tasks. Then it is up to the tracer's callback to know if it should save
the trace or ignore it. That is, if perf and ftrace are expecting
callbacks, but only perf triggered the callback, it should record the
cookie, and save the stack if it matches a cookie it wants. ftrace's
callback would ignore the trace, because it did not ask for a callback.
That is, the tracer will need to save off the cookie to know if its
callback needs to do something or not.

The cookie can be generated by this:

static u64 next_cookie;

u64 unwinder_get_cookie(void)
{
	u64 next;

	if (current->unwinder_cookie)
		return current->unwinder_cookie;

	next = next_cookie;
	while (!try_cmpxchg64(&next_cookie, &next, next + 1))
		;

	next++;

	/* Check for races with NMI */
	if (!cmpxchg(&current->unwinder_cookie, 0, next))
		next = current->unwinder_cookie;

	return next;
}

When the task goes back to user space and does all the callbacks, it
would reset the task cookie.

	current->unwinder_cookie = 0;

Now the loop of callbacks should also catch if unwinder_trigger() was
called again (NMI), and may need to recall all the back traces again
with the new cookie.

That will need some thought about handling that race.

This is just an idea to start discussion.

-- Steve

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains
  2024-09-13 23:02 [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains Josh Poimboeuf
                   ` (11 preceding siblings ...)
  2024-09-14 12:12 ` [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains Steven Rostedt
@ 2024-09-14 19:37 ` Namhyung Kim
  2024-10-23 13:22 ` Jens Remus
  13 siblings, 0 replies; 44+ messages in thread
From: Namhyung Kim @ 2024-09-14 19:37 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James

Hello Josh,

On Sat, Sep 14, 2024 at 01:02:02AM +0200, Josh Poimboeuf wrote:
> This is a new version of sframe user space unwinding + perf deferred
> callchains.  Better late than never.
> 
> Unfortunately I didn't get a chance to do any testing with this one yet
> as I'm rushing it out the door before GNU Cauldron starts.
> Conference-driven development ;-)

lol

> 
> Plus I don't have the perf tool piece working yet so I don't have a way
> of doing end-to-end testing at the moment anyway.
> 
> In other words, please don't merge this yet.
> 
> Namhyung, if you're still available to write a perf tool patch which
> integrates with this, that would be great.  Otherwise I could give it a
> try.

Sure, I'll take a look and post the perf tools patches.

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains
  2024-09-14 12:12 ` [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains Steven Rostedt
@ 2024-09-15 11:11   ` Josh Poimboeuf
  2024-09-15 11:38     ` Steven Rostedt
  2024-09-16 14:08     ` Peter Zijlstra
  0 siblings, 2 replies; 44+ messages in thread
From: Josh Poimboeuf @ 2024-09-15 11:11 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: x86, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, Mathieu Desnoyers

On Sat, Sep 14, 2024 at 08:12:46AM -0400, Steven Rostedt wrote:
> I think the unwinder should have an interface itself that provides the
> deferred unwinding, instead of having all the tracers to implement
> their own.
> 
> The user space unwinder (regardless of sframe or not) should provide a
> way to say "I want a user space stack trace for here, but don't do it
> yet. Just give me a cookie and then call my callback function with the
> stack and cookie before going back to user space".

We (Steven, Mathieu and I) have been discussing this at GNU Cauldron and
I think we're in basic agreement on this.

I think the biggest tweak we decided on is that the context id (aka
"cookie") would be percpu.  Its initial value is (cpuid << 48).  It gets
incremented for every entry from user space.

> That is, we should have an interface like:
> 
> typedef void (unwinder_callback_t)(struct user_space_stack *, u64 cookie);

> struct unwinder_client {
> 	struct list_head	list;
> 	unwinder_callback_t	callback;

I assume we want to allow tracers to pick sframes or FP (or auto).  If
so we would need to add a user_unwind_type enum to this struct.

Then the question is, what to do if tracer A wants sframe and tracer B
wants FP?  I'm thinking it's fine to allow that.  I assume the "multiple
tracers unwinding user space" case isn't realistic so any extra overhead
from these cases is the user's fault?

The unwinder would need two sets of callbacks and task work functions:
one for sframe, one for FP.

The tracer would need to pass its &my_unwinder struct to
unwinder_trigger() so the unwinder knows which task work function to
activate.

I thinking we also need a 'max_entries' field.  No need to keep
unwinding if we've already reached the tracer's max.  If there are
multiple callbacks then we'd have to get the max of the maxes, but
that's easy enough.

Mathieu had requested passing an opaque void * from the NMI to the
callback, but I don't think that's possible with this scheme since the
unwinder wouldn't know which callback to give it to.  Instead the tracer
would have to keep track of its own data associated with the given
cookie.

-- 
Josh

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains
  2024-09-15 11:11   ` Josh Poimboeuf
@ 2024-09-15 11:38     ` Steven Rostedt
  2024-09-16 14:08     ` Peter Zijlstra
  1 sibling, 0 replies; 44+ messages in thread
From: Steven Rostedt @ 2024-09-15 11:38 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, Mathieu Desnoyers

On Sun, 15 Sep 2024 13:11:11 +0200
Josh Poimboeuf <jpoimboe@redhat.com> wrote:

> On Sat, Sep 14, 2024 at 08:12:46AM -0400, Steven Rostedt wrote:
> > I think the unwinder should have an interface itself that provides the
> > deferred unwinding, instead of having all the tracers to implement
> > their own.
> > 
> > The user space unwinder (regardless of sframe or not) should provide a
> > way to say "I want a user space stack trace for here, but don't do it
> > yet. Just give me a cookie and then call my callback function with the
> > stack and cookie before going back to user space".  
> 
> We (Steven, Mathieu and I) have been discussing this at GNU Cauldron and
> I think we're in basic agreement on this.
> 
> I think the biggest tweak we decided on is that the context id (aka
> "cookie") would be percpu.  Its initial value is (cpuid << 48).  It gets
> incremented for every entry from user space.

Well, in concept it is incremented for every entry from user space, but
in reality it should only be incremented the first time it is
referenced when coming into user space. That is, if there's nothing
asking for a stack trace, then nothing should be incremented.

> 
> > That is, we should have an interface like:
> > 
> > typedef void (unwinder_callback_t)(struct user_space_stack *, u64 cookie);  
> 
> > struct unwinder_client {
> > 	struct list_head	list;
> > 	unwinder_callback_t	callback;  
> 
> I assume we want to allow tracers to pick sframes or FP (or auto).  If
> so we would need to add a user_unwind_type enum to this struct.

Actually, I don't think they should care. A stack trace should just be
passed to the callback. How that stack trace is created, is up to the
unwinder. sframe should be used if it's avaliable, otherwise it will
use the frame pointer. If neither is available, the callback could be
called with NULL for the stack trace and then it can figure out what to
do via pt_regs.

> 
> Then the question is, what to do if tracer A wants sframe and tracer B
> wants FP?  I'm thinking it's fine to allow that.  I assume the "multiple
> tracers unwinding user space" case isn't realistic so any extra overhead
> from these cases is the user's fault?
> 
> The unwinder would need two sets of callbacks and task work functions:
> one for sframe, one for FP.
> 
> The tracer would need to pass its &my_unwinder struct to
> unwinder_trigger() so the unwinder knows which task work function to
> activate.
> 
> 
> I thinking we also need a 'max_entries' field.  No need to keep
> unwinding if we've already reached the tracer's max.  If there are
> multiple callbacks then we'd have to get the max of the maxes, but
> that's easy enough.

We could set a max when the tracer registers a callback. It can be -1
to mean "no limit".

I was talking with Mathieu at lunch, and we talked about having the
stack information pointer be part of the task_struct.

That is, the first time the unwinder is triggered and its doing the
unwind (either sfarme or fp), it will allocate memory to store the
stack trace information. This information is what is sent to the
tracer's callbacks. Again, the tracers do not care how the stack frame
was created. This memory is then saved in the task struct (freed when
the task exits), and reused for future stack traces.

If a stack trace could not be created, the callback just gets NULL and
that will tell the tracer that nothing is available for the user space
stack trace, and it can try to do something else. But that's not some
the unwinder should care about.

> 
> 
> Mathieu had requested passing an opaque void * from the NMI to the
> callback, but I don't think that's possible with this scheme since the
> unwinder wouldn't know which callback to give it to.  Instead the tracer
> would have to keep track of its own data associated with the given
> cookie.
> 

Not sure what that means, but from NMI, I would see the tracer doing:

	u64 cookie;

	cookie = undwinder_trigger();

And then the tracer is responsible to do something with that cookie.
This would also trigger the "task_work()"  where the ptrace path is
entered when going back to user space, the unwinder will try to create a
stack trace (from sframe or fp), saving it in this allocated memory
that is then stored via a pointer on the task_struct so it only needs
to be allocated once (or if a stack trace is bigger than what was
allocated).

Then all the tracer callbacks will be called with the stack trace and
the cookie. The tracer will decided if it should use the stack trace
from the cookie, or just ignore it. But the tracer shouldn't care about
sframes or frame pointers. That's an implementation detail of the
unwinder.

-- Steve

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains
  2024-09-16 18:15         ` Peter Zijlstra
@ 2024-09-16  0:15           ` Mathieu Desnoyers
  2024-09-16  0:33             ` Mathieu Desnoyers
  2024-09-16 22:46           ` Steven Rostedt
  1 sibling, 1 reply; 44+ messages in thread
From: Mathieu Desnoyers @ 2024-09-16  0:15 UTC (permalink / raw)
  To: Peter Zijlstra, Josh Poimboeuf
  Cc: Steven Rostedt, x86, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James

On 2024-09-16 20:15, Peter Zijlstra wrote:
[...]
> The point being that it is possible to wrap one CPU into the id space of
> another CPU. It is not trivial, but someone who wants to can make it
> happen.

I agree that the overflow of the free-running counter bleeding into the
CPU numbers is something we want to prevent. We don't care if this
counter overflows after day/months/years for sake of correlation
within a system call, but we do care about the fact that this
free-running counter could be made to overflow after a very long
time while the system runs, and then we reach a state where the
CPU numbers are mixed up, which leads to short-term correlation
issues.

I would recommend this layout for this 64-bit value instead:

low-bits: cpu number
high-bits: free-running counter

This way, we eliminate any carry from overflow into the cpu number bits.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains
  2024-09-16  0:15           ` Mathieu Desnoyers
@ 2024-09-16  0:33             ` Mathieu Desnoyers
  2024-09-17  0:37               ` Mathieu Desnoyers
  0 siblings, 1 reply; 44+ messages in thread
From: Mathieu Desnoyers @ 2024-09-16  0:33 UTC (permalink / raw)
  To: Peter Zijlstra, Josh Poimboeuf
  Cc: Steven Rostedt, x86, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James

On 2024-09-16 02:15, Mathieu Desnoyers wrote:
> On 2024-09-16 20:15, Peter Zijlstra wrote:
> [...]
>> The point being that it is possible to wrap one CPU into the id space of
>> another CPU. It is not trivial, but someone who wants to can make it
>> happen.
> 
> I agree that the overflow of the free-running counter bleeding into the
> CPU numbers is something we want to prevent. We don't care if this
> counter overflows after day/months/years for sake of correlation
> within a system call, but we do care about the fact that this
> free-running counter could be made to overflow after a very long
> time while the system runs, and then we reach a state where the
> CPU numbers are mixed up, which leads to short-term correlation
> issues.
> 
> I would recommend this layout for this 64-bit value instead:
> 
> low-bits: cpu number
> high-bits: free-running counter
> 
> This way, we eliminate any carry from overflow into the cpu number bits.

Even better: AFAIR from the discussion I had with Steven and Josh, we intend
to have the cookie stored to/read from the task struct with interrupts off,
we can simply do:

struct stackwalk_cookie {
	uint64_t counter;	/* free running per-cpu counter value */
	int cpu;		/* cpu on which the counter was snapshot. */
};

So we don't have to deal with any overflow trickiness, there is no need for
bit-shifting, and we get a full 64-bit free-running counter independently of
the number of CPUs.

Thanks,

Mathieu



> 
> Thanks,
> 
> Mathieu
> 
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 05/11] perf/x86: Use user_unwind interface
  2024-09-13 23:02 ` [PATCH v2 05/11] perf/x86: Use user_unwind interface Josh Poimboeuf
@ 2024-09-16  6:48   ` kernel test robot
  2024-09-17 22:01     ` Namhyung Kim
  0 siblings, 1 reply; 44+ messages in thread
From: kernel test robot @ 2024-09-16  6:48 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: oe-lkp, lkp, linux-perf-users, linux-kernel, x86, Peter Zijlstra,
	Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	Indu Bhagat, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim, Ian Rogers, Adrian Hunter, Mark Brown,
	linux-toolchains, Jordan Rome, Sam James, oliver.sang



Hello,

kernel test robot noticed "kernel_BUG_at_kernel/unwind/user.c" on:

commit: 164c5ae4072303c9eb4e263115a5e70d5a3cc052 ("[PATCH v2 05/11] perf/x86: Use user_unwind interface")
url: https://github.com/intel-lab-lkp/linux/commits/Josh-Poimboeuf/unwind-Introduce-generic-user-space-unwinding-interface/20240914-070619
base: https://git.kernel.org/cgit/linux/kernel/git/perf/perf-tools-next.git perf-tools-next
patch link: https://lore.kernel.org/all/daf3f59e0d14ee11b45ad6735b8a211a3c7534dc.1726268190.git.jpoimboe@kernel.org/
patch subject: [PATCH v2 05/11] perf/x86: Use user_unwind interface

in testcase: fsmark
version: fsmark-x86_64-2628be5-1_20240224
with following parameters:

	iterations: 1x
	nr_threads: 64t
	disk: 1BRD_48G
	fs: xfs
	filesize: 4M
	test_size: 24G
	sync_method: NoSync
	cpufreq_governor: performance



compiler: gcc-12
test machine: 96 threads 2 sockets Intel(R) Xeon(R) Platinum 8260L CPU @ 2.40GHz (Cascade Lake) with 128G memory

(please refer to attached dmesg/kmsg for entire log/backtrace)



If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202409161428.3cc6c1e1-oliver.sang@intel.com


[   53.036370][    C3] ------------[ cut here ]------------
[   53.036374][    C3] kernel BUG at kernel/unwind/user.c:39!
[   53.036381][    C3] Oops: invalid opcode: 0000 [#1] SMP NOPTI
[   53.036385][    C3] CPU: 3 UID: 0 PID: 1317 Comm: sed Tainted: G S                 6.11.0-rc6-00501-g164c5ae40723 #1
[   53.036388][    C3] Tainted: [S]=CPU_OUT_OF_SPEC
[   53.036389][    C3] Hardware name: Intel Corporation S2600WFD/S2600WFD, BIOS SE5C620.86B.0D.01.0286.011120190816 01/11/2019
[ 53.036390][ C3] RIP: 0010:user_unwind_next (kernel/unwind/user.c:39) 
[ 53.036400][ C3] Code: 01 eb c1 48 01 f0 e8 a0 15 ea 00 85 c0 75 ee 48 89 73 08 4c 89 03 8b 41 08 85 c0 74 a5 48 89 53 10 eb 9f bf ea ff ff ff eb 9a <0f> 0b e8 ba b9 ea 00 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 2e 0f
All code
========
   0:	01 eb                	add    %ebp,%ebx
   2:	c1 48 01 f0          	rorl   $0xf0,0x1(%rax)
   6:	e8 a0 15 ea 00       	callq  0xea15ab
   b:	85 c0                	test   %eax,%eax
   d:	75 ee                	jne    0xfffffffffffffffd
   f:	48 89 73 08          	mov    %rsi,0x8(%rbx)
  13:	4c 89 03             	mov    %r8,(%rbx)
  16:	8b 41 08             	mov    0x8(%rcx),%eax
  19:	85 c0                	test   %eax,%eax
  1b:	74 a5                	je     0xffffffffffffffc2
  1d:	48 89 53 10          	mov    %rdx,0x10(%rbx)
  21:	eb 9f                	jmp    0xffffffffffffffc2
  23:	bf ea ff ff ff       	mov    $0xffffffea,%edi
  28:	eb 9a                	jmp    0xffffffffffffffc4
  2a:*	0f 0b                	ud2    		<-- trapping instruction
  2c:	e8 ba b9 ea 00       	callq  0xeab9eb
  31:	66 66 2e 0f 1f 84 00 	data16 nopw %cs:0x0(%rax,%rax,1)
  38:	00 00 00 00 
  3c:	66                   	data16
  3d:	66                   	data16
  3e:	2e                   	cs
  3f:	0f                   	.byte 0xf

Code starting with the faulting instruction
===========================================
   0:	0f 0b                	ud2    
   2:	e8 ba b9 ea 00       	callq  0xeab9c1
   7:	66 66 2e 0f 1f 84 00 	data16 nopw %cs:0x0(%rax,%rax,1)
   e:	00 00 00 00 
  12:	66                   	data16
  13:	66                   	data16
  14:	2e                   	cs
  15:	0f                   	.byte 0xf
[   53.036402][    C3] RSP: 0000:ffffc9000b38b938 EFLAGS: 00010093
[   53.036404][    C3] RAX: 0000000000000000 RBX: ffffc9000b38b960 RCX: ffff888230156880
[   53.036405][    C3] RDX: 000055b666c6ef3c RSI: 0000000000000001 RDI: ffffc9000b38b960
[   53.036407][    C3] RBP: ffffc9000b38bf58 R08: 0000000000000000 R09: 0000000000000000
[   53.036408][    C3] R10: ffffc9000b38bf58 R11: 0000000000000000 R12: ffff8881fcc88000
[   53.036409][    C3] R13: 000000000000007f R14: ffffc9000b38bf58 R15: 0000000000000000
[   53.036411][    C3] FS:  00007f9da3761800(0000) GS:ffff88903eb80000(0000) knlGS:0000000000000000
[   53.036412][    C3] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   53.036413][    C3] CR2: 00007f4a5dbb53d8 CR3: 0000000216b96005 CR4: 00000000007706f0
[   53.036414][    C3] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   53.036415][    C3] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   53.036416][    C3] PKRU: 55555554
[   53.036417][    C3] Call Trace:
[   53.036419][    C3]  <TASK>
[ 53.036422][ C3] ? die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434 arch/x86/kernel/dumpstack.c:447) 
[ 53.036429][ C3] ? do_trap (arch/x86/kernel/traps.c:114 arch/x86/kernel/traps.c:155) 
[ 53.036433][ C3] ? user_unwind_next (kernel/unwind/user.c:39) 
[ 53.036435][ C3] ? do_error_trap (arch/x86/include/asm/traps.h:58 arch/x86/kernel/traps.c:176) 
[ 53.036437][ C3] ? user_unwind_next (kernel/unwind/user.c:39) 
[ 53.036439][ C3] ? exc_invalid_op (arch/x86/kernel/traps.c:267) 
[ 53.036446][ C3] ? user_unwind_next (kernel/unwind/user.c:39) 
[ 53.036448][ C3] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:621) 
[ 53.036455][ C3] ? user_unwind_next (kernel/unwind/user.c:39) 
[ 53.036457][ C3] perf_callchain_user (include/linux/uaccess.h:233 include/linux/uaccess.h:260 arch/x86/events/core.c:2894) 
[ 53.036460][ C3] get_perf_callchain (kernel/events/callchain.c:184 kernel/events/callchain.c:259) 
[ 53.036465][ C3] perf_callchain (kernel/events/core.c:7693) 
[ 53.036469][ C3] setup_pebs_fixed_sample_data (include/linux/perf_event.h:1237 arch/x86/events/intel/ds.c:1772) 
[ 53.036473][ C3] intel_pmu_drain_pebs_nhm (arch/x86/events/intel/ds.c:2201 arch/x86/events/intel/ds.c:2378) 
[ 53.036477][ C3] handle_pmi_common (arch/x86/events/intel/core.c:3066) 
[ 53.036480][ C3] ? __intel_pmu_enable_all+0x28/0xf0 
[ 53.036484][ C3] ? perf_rotate_context (kernel/events/core.c:1154 kernel/events/core.c:1150 kernel/events/core.c:4322) 
[ 53.036486][ C3] ? __pfx_perf_mux_hrtimer_handler (kernel/events/core.c:1082) 
[ 53.036488][ C3] ? ktime_get (kernel/time/timekeeping.c:195 (discriminator 4) kernel/time/timekeeping.c:395 (discriminator 4) kernel/time/timekeeping.c:403 (discriminator 4) kernel/time/timekeeping.c:850 (discriminator 4)) 
[ 53.036492][ C3] intel_pmu_handle_irq (arch/x86/include/asm/msr.h:86 arch/x86/include/asm/msr.h:133 arch/x86/events/intel/core.c:2488 arch/x86/events/intel/core.c:3186) 
[ 53.036493][ C3] perf_event_nmi_handler (arch/x86/events/core.c:1748 arch/x86/events/core.c:1734) 
[ 53.036499][ C3] nmi_handle (arch/x86/kernel/nmi.c:151 (discriminator 7)) 
[ 53.036500][ C3] default_do_nmi (arch/x86/kernel/nmi.c:352) 
[ 53.036503][ C3] exc_nmi (arch/x86/kernel/nmi.c:546) 
[ 53.036504][ C3] asm_exc_nmi (arch/x86/entry/entry_64.S:1198) 
[   53.036507][    C3] RIP: 0033:0x55b666c6ef3c
[ 53.036509][ C3] Code: 75 10 48 c7 c1 ff ff ff ff 41 b8 10 00 00 00 ba 01 00 00 00 e8 75 91 00 00 49 8b 4d 08 48 89 c7 eb 89 0f 1f 40 00 44 09 71 08 <48> 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f c3 4c 3b 62 10 7d 08 4c
All code
========
   0:	75 10                	jne    0x12
   2:	48 c7 c1 ff ff ff ff 	mov    $0xffffffffffffffff,%rcx
   9:	41 b8 10 00 00 00    	mov    $0x10,%r8d
   f:	ba 01 00 00 00       	mov    $0x1,%edx
  14:	e8 75 91 00 00       	callq  0x918e
  19:	49 8b 4d 08          	mov    0x8(%r13),%rcx
  1d:	48 89 c7             	mov    %rax,%rdi
  20:	eb 89                	jmp    0xffffffffffffffab
  22:	0f 1f 40 00          	nopl   0x0(%rax)
  26:	44 09 71 08          	or     %r14d,0x8(%rcx)
  2a:*	48 83 c4 18          	add    $0x18,%rsp		<-- trapping instruction
  2e:	5b                   	pop    %rbx
  2f:	5d                   	pop    %rbp
  30:	41 5c                	pop    %r12
  32:	41 5d                	pop    %r13
  34:	41 5e                	pop    %r14
  36:	41 5f                	pop    %r15
  38:	c3                   	retq   
  39:	4c 3b 62 10          	cmp    0x10(%rdx),%r12
  3d:	7d 08                	jge    0x47
  3f:	4c                   	rex.WR

Code starting with the faulting instruction
===========================================
   0:	48 83 c4 18          	add    $0x18,%rsp
   4:	5b                   	pop    %rbx
   5:	5d                   	pop    %rbp
   6:	41 5c                	pop    %r12
   8:	41 5d                	pop    %r13
   a:	41 5e                	pop    %r14
   c:	41 5f                	pop    %r15
   e:	c3                   	retq   
   f:	4c 3b 62 10          	cmp    0x10(%rdx),%r12
  13:	7d 08                	jge    0x1d
  15:	4c                   	rex.WR


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240916/202409161428.3cc6c1e1-oliver.sang@intel.com



-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 02/11] unwind/x86: Add HAVE_USER_UNWIND
  2024-09-13 23:02 ` [PATCH v2 02/11] unwind/x86: Add HAVE_USER_UNWIND Josh Poimboeuf
@ 2024-09-16 11:46   ` Peter Zijlstra
  2024-10-20  8:09     ` Josh Poimboeuf
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2024-09-16 11:46 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James

On Sat, Sep 14, 2024 at 01:02:04AM +0200, Josh Poimboeuf wrote:
> Use ARCH_INIT_USER_FP_FRAME to describe how frame pointers are unwound
> on x86, and enable HAVE_USER_UNWIND accordinlgy so the user unwind
> interfaces can be used.
> 
> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
> ---
>  arch/x86/Kconfig                   |  1 +
>  arch/x86/include/asm/user_unwind.h | 11 +++++++++++
>  2 files changed, 12 insertions(+)
>  create mode 100644 arch/x86/include/asm/user_unwind.h
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 007bab9f2a0e..266edff59058 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -286,6 +286,7 @@ config X86
>  	select HAVE_UACCESS_VALIDATION		if HAVE_OBJTOOL
>  	select HAVE_UNSTABLE_SCHED_CLOCK
>  	select HAVE_USER_RETURN_NOTIFIER
> +	select HAVE_USER_UNWIND
>  	select HAVE_GENERIC_VDSO
>  	select VDSO_GETRANDOM			if X86_64
>  	select HOTPLUG_PARALLEL			if SMP && X86_64
> diff --git a/arch/x86/include/asm/user_unwind.h b/arch/x86/include/asm/user_unwind.h
> new file mode 100644
> index 000000000000..8c509c65cfb5
> --- /dev/null
> +++ b/arch/x86/include/asm/user_unwind.h
> @@ -0,0 +1,11 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_X86_USER_UNWIND_H
> +#define _ASM_X86_USER_UNWIND_H
> +
> +#define ARCH_INIT_USER_FP_FRAME							\
> +	.ra_off		= (s32)sizeof(long) * -1,				\
> +	.cfa_off	= (s32)sizeof(long) * 2,				\
> +	.fp_off		= (s32)sizeof(long) * -2,				\
> +	.use_fp		= true,

What about compat crap?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains
  2024-09-15 11:11   ` Josh Poimboeuf
  2024-09-15 11:38     ` Steven Rostedt
@ 2024-09-16 14:08     ` Peter Zijlstra
  2024-09-16 15:39       ` Josh Poimboeuf
  2024-09-16 16:03       ` Steven Rostedt
  1 sibling, 2 replies; 44+ messages in thread
From: Peter Zijlstra @ 2024-09-16 14:08 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Steven Rostedt, x86, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, Mathieu Desnoyers

On Sun, Sep 15, 2024 at 01:11:11PM +0200, Josh Poimboeuf wrote:
> On Sat, Sep 14, 2024 at 08:12:46AM -0400, Steven Rostedt wrote:
> > I think the unwinder should have an interface itself that provides the
> > deferred unwinding, instead of having all the tracers to implement
> > their own.
> > 
> > The user space unwinder (regardless of sframe or not) should provide a
> > way to say "I want a user space stack trace for here, but don't do it
> > yet. Just give me a cookie and then call my callback function with the
> > stack and cookie before going back to user space".
> 
> We (Steven, Mathieu and I) have been discussing this at GNU Cauldron and
> I think we're in basic agreement on this.
> 
> I think the biggest tweak we decided on is that the context id (aka
> "cookie") would be percpu.  Its initial value is (cpuid << 48).  It gets
> incremented for every entry from user space.

Why? What's the purpose of the cookie? This scheme seems unsound, pin
yourself on CPU0 and trigger 1<<48 unwinds while keeping CPU1 idle.

> > That is, we should have an interface like:
> > 
> > typedef void (unwinder_callback_t)(struct user_space_stack *, u64 cookie);

Just make it a void* and let the consumer figure it out.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains
  2024-09-16 14:08     ` Peter Zijlstra
@ 2024-09-16 15:39       ` Josh Poimboeuf
  2024-09-16 18:15         ` Peter Zijlstra
  2024-10-03  2:31         ` Steven Rostedt
  2024-09-16 16:03       ` Steven Rostedt
  1 sibling, 2 replies; 44+ messages in thread
From: Josh Poimboeuf @ 2024-09-16 15:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, x86, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, Mathieu Desnoyers

On Mon, Sep 16, 2024 at 04:08:56PM +0200, Peter Zijlstra wrote:
> On Sun, Sep 15, 2024 at 01:11:11PM +0200, Josh Poimboeuf wrote:
> > On Sat, Sep 14, 2024 at 08:12:46AM -0400, Steven Rostedt wrote:
> > > I think the unwinder should have an interface itself that provides the
> > > deferred unwinding, instead of having all the tracers to implement
> > > their own.
> > > 
> > > The user space unwinder (regardless of sframe or not) should provide a
> > > way to say "I want a user space stack trace for here, but don't do it
> > > yet. Just give me a cookie and then call my callback function with the
> > > stack and cookie before going back to user space".
> > 
> > We (Steven, Mathieu and I) have been discussing this at GNU Cauldron and
> > I think we're in basic agreement on this.
> > 
> > I think the biggest tweak we decided on is that the context id (aka
> > "cookie") would be percpu.  Its initial value is (cpuid << 48).  It gets
> > incremented for every entry from user space.
> 
> Why? What's the purpose of the cookie?

The cookie is incremented once per entry from userspace, when needed.

It's a unique id used by the tracers which is emitted with both kernel
and user stacktrace samples so the userspace tool can stitch them back
together.  Even if you have multiple tracers triggering at the same time
they can all share a single user trace.

Completely untested patch here that I hacked up today:

 git://git.kernel.org/pub/scm/linux/kernel/git/jpoimboe/linux.git sframe-2.1

To avoid races, unwind_user_deferred() can't be called from NMI.  The
tracers will need to trigger irq_work to call it.

> This scheme seems unsound

It may not be communicated well in this thread.  We had much discussion
at Cauldron the last few days.  The patch may be the best way to
communicate it at this point, though I'm sure I missed some details.

> pin yourself on CPU0 and trigger 1<<48 unwinds while keeping CPU1 idle.

Hm???

> > > That is, we should have an interface like:
> > > 
> > > typedef void (unwinder_callback_t)(struct user_space_stack *, u64 cookie);
> 
> Just make it a void* and let the consumer figure it out.

I think it makes sense to manage the cookie in the deferred unwinder
since the need to correlate the stack traces is an inherent part of the
deferral.

-- 
Josh

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains
  2024-09-16 14:08     ` Peter Zijlstra
  2024-09-16 15:39       ` Josh Poimboeuf
@ 2024-09-16 16:03       ` Steven Rostedt
  1 sibling, 0 replies; 44+ messages in thread
From: Steven Rostedt @ 2024-09-16 16:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Josh Poimboeuf, x86, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, Mathieu Desnoyers

On Mon, 16 Sep 2024 16:08:56 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> > 
> > I think the biggest tweak we decided on is that the context id (aka
> > "cookie") would be percpu.  Its initial value is (cpuid << 48).  It gets
> > incremented for every entry from user space.  
> 
> Why? What's the purpose of the cookie? This scheme seems unsound, pin
> yourself on CPU0 and trigger 1<<48 unwinds while keeping CPU1 idle.
> 
> > > That is, we should have an interface like:
> > > 
> > > typedef void (unwinder_callback_t)(struct user_space_stack *, u64 cookie);  
> 
> Just make it a void* and let the consumer figure it out.

Let me try to explain the rationale for the "cookie". It's actually a "context_cookie".
The cookie is unique for every time user space enters the kernel.

Basically it's a "tag" (we could call it that too, but "cookie" seemed
more appropriate as it follows what web browsers do). The idea is this:

 * You're in interrupt context and want a user space back trace, and
   request to have a user stack trace when the task goes back to user
   space. Now the unwinder will return the current context cookie" and
   will use that same cookie when it creates the user stack on exit
   back to user space.

* If this happens in a long system call that does many stack traces
  (records the kernel stack and also wants the user stack), it does not
  record the user trace at the time it is requested. But we need a way
  to tag this event with the user stack trace that will happen when the
  user stack is recorded when going back to user space. We can't pin
  the task to the CPU when we want to profile it. The cookie/tag needs
  to be the same for every request that happens with a single entry
  into the kernel. It must be different for another entry into the
  kernel as the user stack will be different then.

  Basically think that each cookie represents a single user stack.

* When going back to user space, the ptrace path is taken and the user
  stack trace is recorded. It will then call the tracers that requested
  it with the stack trace and the cookie that represents it.

Here's an example:

system_call() {

	<interrupt> trigger user stack trace: assign cookie 123
		tracer records kernel stack trace and adds cookie 123.

	<interrupt> trigger user stack trace: assign cookie 123
		tracer records kernel stack trace and adds cookie 123.

	<interrupt> trigger user stack trace: assign cookie 123
		tracer records kernel stack trace and adds cookie 123.

	<return to user space, take ptrace path>

	<record user stack trace>
	call all the tracers with user stack trace with cookie 123.
}

system_call() {

	<interrupt> trigger user stack trace: assign cookie 124
		tracer records kernel stack trace and adds cookie 124.

	<return to user space, take ptrace path>

	<record user stack trace>
	call all the tracers with user stack trace with cookie 124.
}

Then the tracers can post process the events and append the user stack
trace with cookie 123 on top of the kernel stack events that had cookie
123, and append the user stack trace with cookie 124 on top of the
kernel stack events that had cookie 124.

-- Steve

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains
  2024-09-16 15:39       ` Josh Poimboeuf
@ 2024-09-16 18:15         ` Peter Zijlstra
  2024-09-16  0:15           ` Mathieu Desnoyers
  2024-09-16 22:46           ` Steven Rostedt
  2024-10-03  2:31         ` Steven Rostedt
  1 sibling, 2 replies; 44+ messages in thread
From: Peter Zijlstra @ 2024-09-16 18:15 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Steven Rostedt, x86, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, Mathieu Desnoyers

On Mon, Sep 16, 2024 at 05:39:53PM +0200, Josh Poimboeuf wrote:

> The cookie is incremented once per entry from userspace, when needed.
> 
> It's a unique id used by the tracers which is emitted with both kernel
> and user stacktrace samples so the userspace tool can stitch them back
> together.  Even if you have multiple tracers triggering at the same time
> they can all share a single user trace.

But perf don't need this at all, right? It knows that the next deferred
trace for that tid will be the one.

> > This scheme seems unsound

> > pin yourself on CPU0 and trigger 1<<48 unwinds while keeping CPU1 idle.
> 
> Hm???

The point being that it is possible to wrap one CPU into the id space of
another CPU. It is not trivial, but someone who wants to can make it
happen.

Combined I don't see the need to force this into this scheme, carry an
opaque (void*) pointer and let the user do whatever it wants, perf for
instance can pass NULL and not do anything like this.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains
  2024-09-16 18:15         ` Peter Zijlstra
  2024-09-16  0:15           ` Mathieu Desnoyers
@ 2024-09-16 22:46           ` Steven Rostedt
  2024-09-17 21:58             ` Namhyung Kim
  1 sibling, 1 reply; 44+ messages in thread
From: Steven Rostedt @ 2024-09-16 22:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Josh Poimboeuf, x86, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, Mathieu Desnoyers

On Mon, 16 Sep 2024 20:15:45 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Sep 16, 2024 at 05:39:53PM +0200, Josh Poimboeuf wrote:
> 
> > The cookie is incremented once per entry from userspace, when needed.
> > 
> > It's a unique id used by the tracers which is emitted with both kernel
> > and user stacktrace samples so the userspace tool can stitch them back
> > together.  Even if you have multiple tracers triggering at the same time
> > they can all share a single user trace.  
> 
> But perf don't need this at all, right? It knows that the next deferred
> trace for that tid will be the one.

Is that because perf uses per task buffers? Will perf know this if it
uses a global buffer? What does perf do with "-a"?

> 
> > > This scheme seems unsound  
> 
> > > pin yourself on CPU0 and trigger 1<<48 unwinds while keeping CPU1 idle.  
> > 
> > Hm???  
> 
> The point being that it is possible to wrap one CPU into the id space of
> another CPU. It is not trivial, but someone who wants to can make it
> happen.
> 
> Combined I don't see the need to force this into this scheme, carry an
> opaque (void*) pointer and let the user do whatever it wants, perf for
> instance can pass NULL and not do anything like this.

That would put a huge burden on those tracers. Specifically ftrace.
Just because perf doesn't need it, it doesn't mean it can't be part of
the infrastructure. The cookie isn't even per-cpu. The counter is just
generated per cpu to avoid races. The cookie will have the CPU as part
of the bits, plus the per-cpu counter. This will give us a unique value
across all CPUs without having to do any atomic operations. It's just
that the cookie must be unique for every time a task enters the kernel,
and stay the same while it is in the kernel.

So if the task enters the kernel on CPU 1 and a trace is requested, the
tracer still needs to save that event when the request occurred (for
example, the kernel stack trace). It needs a unique identifier that
will be attached to the user stack trace when it is made, so that post
processing can tell where to append that stack.

Then the task can migrate to several other CPUs, but it will still have
the same cookie as when it received it on the first CPU. The cookie
only gets updated on the task when it goes back to user space.

ftrace allows for several instances that could be filtering on the same
or different tasks. One instance could filter 3 tasks, another 2 tasks,
where one of those tasks is in both instance. The user stack
infrastructure should give a way to easily map the request to the stack
via some kind of identifier.

What would perf do on dropped events? Say a system call happened, perf
requested a user stack trace, then the events were dropped, and when it
started again, it was in a system call, how would perf know it is in
the same system call? I guess the callback when the dropped user stack
trace occurred would tell perf to reset. Still it adds a pretty big
burden for the tracers to handle these corner cases whereas a simple
identifier is much more simple and robust.

-- Steve

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains
  2024-09-16  0:33             ` Mathieu Desnoyers
@ 2024-09-17  0:37               ` Mathieu Desnoyers
  0 siblings, 0 replies; 44+ messages in thread
From: Mathieu Desnoyers @ 2024-09-17  0:37 UTC (permalink / raw)
  To: Peter Zijlstra, Josh Poimboeuf
  Cc: Steven Rostedt, x86, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James

On 2024-09-16 02:33, Mathieu Desnoyers wrote:
> On 2024-09-16 02:15, Mathieu Desnoyers wrote:
>> On 2024-09-16 20:15, Peter Zijlstra wrote:
>> [...]
>>> The point being that it is possible to wrap one CPU into the id space of
>>> another CPU. It is not trivial, but someone who wants to can make it
>>> happen.
>>
>> I agree that the overflow of the free-running counter bleeding into the
>> CPU numbers is something we want to prevent. We don't care if this
>> counter overflows after day/months/years for sake of correlation
>> within a system call, but we do care about the fact that this
>> free-running counter could be made to overflow after a very long
>> time while the system runs, and then we reach a state where the
>> CPU numbers are mixed up, which leads to short-term correlation
>> issues.
>>
>> I would recommend this layout for this 64-bit value instead:
>>
>> low-bits: cpu number
>> high-bits: free-running counter
>>
>> This way, we eliminate any carry from overflow into the cpu number bits.
> 
> Even better: AFAIR from the discussion I had with Steven and Josh, we 
> intend
> to have the cookie stored to/read from the task struct with interrupts off,
> we can simply do:
> 
> struct stackwalk_cookie {
>      uint64_t counter;    /* free running per-cpu counter value */
>      int cpu;        /* cpu on which the counter was snapshot. */
> };
> 
> So we don't have to deal with any overflow trickiness, there is no need for
> bit-shifting, and we get a full 64-bit free-running counter 
> independently of
> the number of CPUs.

Sorry my laptop had the wrong date when I sent this email. Re-sending to
make sure it gets seen in the correct order time-wise.

Thanks,

Mathieu

> 
> Thanks,
> 
> Mathieu
> 
> 
> 
>>
>> Thanks,
>>
>> Mathieu
>>
>>
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains
  2024-09-16 22:46           ` Steven Rostedt
@ 2024-09-17 21:58             ` Namhyung Kim
  2024-09-18  5:14               ` Mathieu Desnoyers
  0 siblings, 1 reply; 44+ messages in thread
From: Namhyung Kim @ 2024-09-17 21:58 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Josh Poimboeuf, x86, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, Mathieu Desnoyers

Hello,

On Mon, Sep 16, 2024 at 06:46:45PM -0400, Steven Rostedt wrote:
> On Mon, 16 Sep 2024 20:15:45 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Mon, Sep 16, 2024 at 05:39:53PM +0200, Josh Poimboeuf wrote:
> > 
> > > The cookie is incremented once per entry from userspace, when needed.
> > > 
> > > It's a unique id used by the tracers which is emitted with both kernel
> > > and user stacktrace samples so the userspace tool can stitch them back
> > > together.  Even if you have multiple tracers triggering at the same time
> > > they can all share a single user trace.  
> > 
> > But perf don't need this at all, right? It knows that the next deferred
> > trace for that tid will be the one.

Well.. technically you can sample without tid.  But I'm not sure how
much it'd be useful if you collect callchains without tid.

> 
> Is that because perf uses per task buffers? Will perf know this if it
> uses a global buffer? What does perf do with "-a"?

Then it'd use per-cpu ring buffers.  But each sample would have pid/tid
pair and time so perf tools can match it with a deferred callchian.

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 05/11] perf/x86: Use user_unwind interface
  2024-09-16  6:48   ` kernel test robot
@ 2024-09-17 22:01     ` Namhyung Kim
  0 siblings, 0 replies; 44+ messages in thread
From: Namhyung Kim @ 2024-09-17 22:01 UTC (permalink / raw)
  To: kernel test robot
  Cc: Josh Poimboeuf, oe-lkp, lkp, linux-perf-users, linux-kernel, x86,
	Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Mark Brown, linux-toolchains, Jordan Rome, Sam James

Hello,

On Mon, Sep 16, 2024 at 02:48:35PM +0800, kernel test robot wrote:
> 
> 
> Hello,
> 
> kernel test robot noticed "kernel_BUG_at_kernel/unwind/user.c" on:
> 
> commit: 164c5ae4072303c9eb4e263115a5e70d5a3cc052 ("[PATCH v2 05/11] perf/x86: Use user_unwind interface")
> url: https://github.com/intel-lab-lkp/linux/commits/Josh-Poimboeuf/unwind-Introduce-generic-user-space-unwinding-interface/20240914-070619
> base: https://git.kernel.org/cgit/linux/kernel/git/perf/perf-tools-next.git perf-tools-next
> patch link: https://lore.kernel.org/all/daf3f59e0d14ee11b45ad6735b8a211a3c7534dc.1726268190.git.jpoimboe@kernel.org/
> patch subject: [PATCH v2 05/11] perf/x86: Use user_unwind interface
> 
> in testcase: fsmark
> version: fsmark-x86_64-2628be5-1_20240224
> with following parameters:
> 
> 	iterations: 1x
> 	nr_threads: 64t
> 	disk: 1BRD_48G
> 	fs: xfs
> 	filesize: 4M
> 	test_size: 24G
> 	sync_method: NoSync
> 	cpufreq_governor: performance
> 
> 
> 
> compiler: gcc-12
> test machine: 96 threads 2 sockets Intel(R) Xeon(R) Platinum 8260L CPU @ 2.40GHz (Cascade Lake) with 128G memory
> 
> (please refer to attached dmesg/kmsg for entire log/backtrace)
> 
> 
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <oliver.sang@intel.com>
> | Closes: https://lore.kernel.org/oe-lkp/202409161428.3cc6c1e1-oliver.sang@intel.com
> 
> 
> [   53.036370][    C3] ------------[ cut here ]------------
> [   53.036374][    C3] kernel BUG at kernel/unwind/user.c:39!

I also noticed this and the following patch would fix it.

Thanks,
Namhyung


diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
index 3a7b14cf522b4139..3c8f1deb6d34ec37 100644
--- a/kernel/unwind/user.c
+++ b/kernel/unwind/user.c
@@ -80,8 +80,10 @@ int user_unwind_start(struct user_unwind_state *state,
 	case USER_UNWIND_TYPE_SFRAME:
 		if (!sframe_possible)
 			return -EINVAL;
+		state->type = type;
 		break;
 	case USER_UNWIND_TYPE_FP:
+		state->type = type;
 		break;
 	default:
 		return -EINVAL;

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 09/11] perf: Introduce deferred user callchains
  2024-09-13 23:02 ` [PATCH v2 09/11] perf: Introduce deferred user callchains Josh Poimboeuf
@ 2024-09-17 22:07   ` Namhyung Kim
  0 siblings, 0 replies; 44+ messages in thread
From: Namhyung Kim @ 2024-09-17 22:07 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James

On Sat, Sep 14, 2024 at 01:02:11AM +0200, Josh Poimboeuf wrote:
> Instead of attempting to unwind user space from the NMI handler, defer
> it to run in task context by sending a self-IPI and then scheduling the
> unwind to run in the IRQ's exit task work before returning to user space.
> 
> This allows the user stack page to be paged in if needed, avoids
> duplicate unwinds for kernel-bound workloads, and prepares for SFrame
> unwinding (so .sframe sections can be paged in on demand).
> 
> Suggested-by: Steven Rostedt <rostedt@goodmis.org>
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
> ---
[SNIP]  
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 19fd7bd38ecf..5fc7c5156287 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -6854,11 +6860,70 @@ static void perf_pending_irq(struct irq_work *entry)
>  		perf_swevent_put_recursion_context(rctx);
>  }
>  
> +struct perf_callchain_deferred_event {
> +	struct perf_event_header	header;
> +	struct perf_callchain_entry	callchain;
> +};
> +
> +#define PERF_CALLCHAIN_DEFERRED_EVENT_SIZE				\
> +	sizeof(struct perf_callchain_deferred_event) +			\
> +	(sizeof(__u64) * 1) + /* PERF_CONTEXT_USER */			\
> +	(sizeof(__u64) * PERF_MAX_STACK_DEPTH)
> +
> +static void perf_event_callchain_deferred(struct perf_event *event)
> +{
> +	struct pt_regs *regs = task_pt_regs(current);
> +	struct perf_callchain_entry *callchain;
> +	struct perf_output_handle handle;
> +	struct perf_sample_data data;
> +	unsigned char buf[PERF_CALLCHAIN_DEFERRED_EVENT_SIZE];
> +	struct perf_callchain_entry_ctx ctx;
> +	struct perf_callchain_deferred_event *deferred_event;
> +
> +	deferred_event = (void *)&buf;
> +
> +	callchain = &deferred_event->callchain;
> +	callchain->nr = 0;
> +
> +	ctx.entry		= callchain;
> +	ctx.max_stack		= MIN(event->attr.sample_max_stack,
> +				      PERF_MAX_STACK_DEPTH);
> +	ctx.nr			= 0;
> +	ctx.contexts		= 0;
> +	ctx.contexts_maxed	= false;
> +
> +	perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
> +	perf_callchain_user_deferred(&ctx, regs);
> +
> +	deferred_event->header.type = PERF_RECORD_CALLCHAIN_DEFERRED;
> +	deferred_event->header.misc = 0;

I think we can use PERF_RECORD_MISC_USER here as it's about user
callchains.

> +	deferred_event->header.size = sizeof(*deferred_event) +
> +				      (callchain->nr * sizeof(u64));
> +
> +	perf_event_header__init_id(&deferred_event->header, &data, event);
> +
> +	if (perf_output_begin(&handle, &data, event,
> +			      deferred_event->header.size))
> +		return;
> +
> +	perf_output_copy(&handle, deferred_event, deferred_event->header.size);

You should not copy the whole event size because it also contains the
id_sample parts in the below.  Maybe something like this instead?

	perf_output_put(&handle, *deferred_event);
	__output_copy(&handle, callchain->ip, callchain->nr * sizeof(u64));

Thanks,
Namhyung


> +	perf_event__output_id_sample(event, &handle, &data);
> +	perf_output_end(&handle);
> +}
> +
>  static void perf_pending_task(struct callback_head *head)
>  {
>  	struct perf_event *event = container_of(head, struct perf_event, pending_task);
>  	int rctx;
>  
> +	if (!is_software_event(event)) {
> +		if (event->pending_callchain) {
> +			perf_event_callchain_deferred(event);
> +			event->pending_callchain = 0;
> +		}
> +		return;
> +	}
> +
>  	/*
>  	 * All accesses to the event must belong to the same implicit RCU read-side
>  	 * critical section as the ->pending_work reset. See comment in
> @@ -7688,6 +7753,8 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
>  	bool user   = !event->attr.exclude_callchain_user;
>  	const u32 max_stack = event->attr.sample_max_stack;
>  	struct perf_callchain_entry *callchain;
> +	bool defer_user = IS_ENABLED(CONFIG_HAVE_PERF_CALLCHAIN_DEFERRED) &&
> +			  event->attr.defer_callchain;
>  
>  	/* Disallow cross-task user callchains. */
>  	user &= !event->ctx->task || event->ctx->task == current;
> @@ -7695,7 +7762,14 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
>  	if (!kernel && !user)
>  		return &__empty_callchain;
>  
> -	callchain = get_perf_callchain(regs, kernel, user, max_stack, true);
> +	callchain = get_perf_callchain(regs, kernel, user, max_stack, true,
> +				       defer_user);
> +
> +	if (user && defer_user && !event->pending_callchain) {
> +		event->pending_callchain = 1;
> +		irq_work_queue(&event->pending_irq);
> +	}
> +
>  	return callchain ?: &__empty_callchain;
>  }
>  
> -- 
> 2.46.0
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains
  2024-09-17 21:58             ` Namhyung Kim
@ 2024-09-18  5:14               ` Mathieu Desnoyers
  0 siblings, 0 replies; 44+ messages in thread
From: Mathieu Desnoyers @ 2024-09-18  5:14 UTC (permalink / raw)
  To: Namhyung Kim, Steven Rostedt
  Cc: Peter Zijlstra, Josh Poimboeuf, x86, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James

On 2024-09-17 23:58, Namhyung Kim wrote:
> Hello,
> 
> On Mon, Sep 16, 2024 at 06:46:45PM -0400, Steven Rostedt wrote:
>> On Mon, 16 Sep 2024 20:15:45 +0200
>> Peter Zijlstra <peterz@infradead.org> wrote:
>>
>>> On Mon, Sep 16, 2024 at 05:39:53PM +0200, Josh Poimboeuf wrote:
>>>
>>>> The cookie is incremented once per entry from userspace, when needed.
>>>>
>>>> It's a unique id used by the tracers which is emitted with both kernel
>>>> and user stacktrace samples so the userspace tool can stitch them back
>>>> together.  Even if you have multiple tracers triggering at the same time
>>>> they can all share a single user trace.
>>>
>>> But perf don't need this at all, right? It knows that the next deferred
>>> trace for that tid will be the one.
> 
> Well.. technically you can sample without tid.  But I'm not sure how
> much it'd be useful if you collect callchains without tid.
> 
>>
>> Is that because perf uses per task buffers? Will perf know this if it
>> uses a global buffer? What does perf do with "-a"?
> 
> Then it'd use per-cpu ring buffers.  But each sample would have pid/tid
> pair and time so perf tools can match it with a deferred callchian.

That semantic correlation based on trace information should work fine if
you do not miss important events in the trace. What the unique id cookie
provides is robustness against confusion that can arise when events are
discarded.

Discarded events happen when the event throughput is too high for the
ring buffer to handle. The following type of confusion can then arise:

- If you miss a stack trace event and then a stack-sample-request
event, then the post-processing tools can incorrectly infer causality
between a prior stack-sample-request and a following stack trace for
the same tid.

- If you miss even more information about the end/beginning of lifetime
of two threads, post-processing can be confused and associate a prior
stack-sample-request from tid=N with a later stack trace for tid=N where
N was re-used due to an exit/clone sequence.

Saving the unique id cookie along with the stack-sample-request and with
the stack trace allows more robust causality relationship between the
two. Showing reliable causality information may not be super important
for profiling use-cases, but it is important for event tracers.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 03/11] unwind: Introduce SFrame user space unwinding
  2024-09-14 11:23   ` Steven Rostedt
@ 2024-10-01 18:20     ` Indu Bhagat
  2024-10-01 18:36       ` Steven Rostedt
  0 siblings, 1 reply; 44+ messages in thread
From: Indu Bhagat @ 2024-10-01 18:20 UTC (permalink / raw)
  To: Steven Rostedt, Josh Poimboeuf
  Cc: x86, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim, Ian Rogers, Adrian Hunter, linux-perf-users,
	Mark Brown, linux-toolchains, Jordan Rome, Sam James

On 9/14/24 4:23 AM, Steven Rostedt wrote:
> On Sat, 14 Sep 2024 01:02:05 +0200
> Josh Poimboeuf <jpoimboe@kernel.org> wrote:
> 
>> diff --git a/include/linux/sframe.h b/include/linux/sframe.h
>> new file mode 100644
>> index 000000000000..3a44f76929e2
>> --- /dev/null
>> +++ b/include/linux/sframe.h
>> @@ -0,0 +1,46 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef _LINUX_SFRAME_H
>> +#define _LINUX_SFRAME_H
>> +
>> +#include <linux/mm_types.h>
>> +
>> +struct sframe_file {
>> +	unsigned long sframe_addr, text_start, text_end;
> 
> Please make each entry a separate line:
> 
> 	unsigned long		sframe_addr;
> 	unsigned long		text_start;
> 	unsigned long		text_end;
> 
> It may be fine for declaring variables like this in a function, but it
> should not be done in a structure.
> 
>> +};
>> +
>> +struct user_unwind_frame;
>> +
>> +#ifdef CONFIG_HAVE_USER_UNWIND_SFRAME
>> +
>> +#define INIT_MM_SFRAME .sframe_mt = MTREE_INIT(sframe_mt, 0)
>> +
>> +extern void sframe_free_mm(struct mm_struct *mm);
>> +
>> +extern int __sframe_add_section(struct sframe_file *file);
>> +extern int sframe_add_section(unsigned long sframe_addr, unsigned long text_start, unsigned long text_end);
>> +extern int sframe_remove_section(unsigned long sframe_addr);
>> +extern int sframe_find(unsigned long ip, struct user_unwind_frame *frame);
>> +
>> +static inline bool current_has_sframe(void)
>> +{
>> +	struct mm_struct *mm = current->mm;
>> +
>> +	return mm && !mtree_empty(&mm->sframe_mt);
>> +}
>> +
>> +#else /* !CONFIG_HAVE_USER_UNWIND_SFRAME */
>> +
>> +#define INIT_MM_SFRAME
>> +
>> +static inline void sframe_free_mm(struct mm_struct *mm) {}
>> +
>> +static inline int __sframe_add_section(struct sframe_file *file) { return -EINVAL; }
>> +static inline int sframe_add_section(unsigned long sframe_addr, unsigned long text_start, unsigned long text_end) { return -EINVAL; }
>> +static inline int sframe_remove_section(unsigned long sframe_addr) { return -EINVAL; }
>> +static inline int sframe_find(unsigned long ip, struct user_unwind_frame *frame) { return -EINVAL; }
>> +
>> +static inline bool current_has_sframe(void) { return false; }
>> +
>> +#endif /* CONFIG_HAVE_USER_UNWIND_SFRAME */
>> +
>> +#endif /* _LINUX_SFRAME_H */
>> diff --git a/include/linux/user_unwind.h b/include/linux/user_unwind.h
>> index 0a19ac6c92b2..8003f9d35405 100644
>> --- a/include/linux/user_unwind.h
>> +++ b/include/linux/user_unwind.h
>> @@ -7,6 +7,7 @@
>>   enum user_unwind_type {
>>   	USER_UNWIND_TYPE_AUTO,
>>   	USER_UNWIND_TYPE_FP,
>> +	USER_UNWIND_TYPE_SFRAME,
>>   };
>>   
>>   struct user_unwind_frame {
>> diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
>> index b54b313bcf07..b2aca31e1a49 100644
>> --- a/include/uapi/linux/elf.h
>> +++ b/include/uapi/linux/elf.h
>> @@ -39,6 +39,7 @@ typedef __s64	Elf64_Sxword;
>>   #define PT_GNU_STACK	(PT_LOOS + 0x474e551)
>>   #define PT_GNU_RELRO	(PT_LOOS + 0x474e552)
>>   #define PT_GNU_PROPERTY	(PT_LOOS + 0x474e553)
>> +#define PT_GNU_SFRAME	(PT_LOOS + 0x474e554)
>>   
>>   
>>   /* ARM MTE memory tag segment type */
>> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
>> index 35791791a879..69511077c910 100644
>> --- a/include/uapi/linux/prctl.h
>> +++ b/include/uapi/linux/prctl.h
>> @@ -328,4 +328,7 @@ struct prctl_mm_map {
>>   # define PR_PPC_DEXCR_CTRL_CLEAR_ONEXEC	0x10 /* Clear the aspect on exec */
>>   # define PR_PPC_DEXCR_CTRL_MASK		0x1f
>>   
>> +#define PR_ADD_SFRAME			74
>> +#define PR_REMOVE_SFRAME		75
>> +
>>   #endif /* _LINUX_PRCTL_H */
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index cc760491f201..a216f091edfb 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -104,6 +104,7 @@
>>   #include <linux/rseq.h>
>>   #include <uapi/linux/pidfd.h>
>>   #include <linux/pidfs.h>
>> +#include <linux/sframe.h>
>>   
>>   #include <asm/pgalloc.h>
>>   #include <linux/uaccess.h>
>> @@ -923,6 +924,7 @@ void __mmdrop(struct mm_struct *mm)
>>   	mm_pasid_drop(mm);
>>   	mm_destroy_cid(mm);
>>   	percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS);
>> +	sframe_free_mm(mm);
>>   
>>   	free_mm(mm);
>>   }
>> @@ -1249,6 +1251,13 @@ static void mm_init_uprobes_state(struct mm_struct *mm)
>>   #endif
>>   }
>>   
>> +static void mm_init_sframe(struct mm_struct *mm)
>> +{
>> +#ifdef CONFIG_HAVE_USER_UNWIND_SFRAME
>> +	mt_init(&mm->sframe_mt);
>> +#endif
>> +}
>> +
>>   static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
>>   	struct user_namespace *user_ns)
>>   {
>> @@ -1280,6 +1289,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
>>   	mm->pmd_huge_pte = NULL;
>>   #endif
>>   	mm_init_uprobes_state(mm);
>> +	mm_init_sframe(mm);
>>   	hugetlb_count_init(mm);
>>   
>>   	if (current->mm) {
>> diff --git a/kernel/sys.c b/kernel/sys.c
>> index 3a2df1bd9f64..e4d2b64f4ae4 100644
>> --- a/kernel/sys.c
>> +++ b/kernel/sys.c
>> @@ -64,6 +64,7 @@
>>   #include <linux/rcupdate.h>
>>   #include <linux/uidgid.h>
>>   #include <linux/cred.h>
>> +#include <linux/sframe.h>
>>   
>>   #include <linux/nospec.h>
>>   
>> @@ -2782,6 +2783,16 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>>   	case PR_RISCV_SET_ICACHE_FLUSH_CTX:
>>   		error = RISCV_SET_ICACHE_FLUSH_CTX(arg2, arg3);
>>   		break;
>> +	case PR_ADD_SFRAME:
>> +		if (arg5)
>> +			return -EINVAL;
>> +		error = sframe_add_section(arg2, arg3, arg4);
>> +		break;
>> +	case PR_REMOVE_SFRAME:
>> +		if (arg3 || arg4 || arg5)
>> +			return -EINVAL;
>> +		error = sframe_remove_section(arg2);
>> +		break;
>>   	default:
>>   		error = -EINVAL;
>>   		break;
>> diff --git a/kernel/unwind/Makefile b/kernel/unwind/Makefile
>> index eb466d6a3295..6f202c5840cf 100644
>> --- a/kernel/unwind/Makefile
>> +++ b/kernel/unwind/Makefile
>> @@ -1 +1,2 @@
>>   obj-$(CONFIG_HAVE_USER_UNWIND) += user.o
>> +obj-$(CONFIG_HAVE_USER_UNWIND_SFRAME) += sframe.o
>> diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
>> new file mode 100644
>> index 000000000000..3e4d29e737a1
>> --- /dev/null
>> +++ b/kernel/unwind/sframe.c
>> @@ -0,0 +1,420 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +
>> +#include <linux/sched.h>
>> +#include <linux/slab.h>
>> +#include <linux/srcu.h>
>> +#include <linux/uaccess.h>
>> +#include <linux/mm.h>
>> +#include <linux/sframe.h>
>> +#include <linux/user_unwind.h>
>> +
>> +#include "sframe.h"
>> +
>> +#define SFRAME_FILENAME_LEN 32
>> +
>> +struct sframe_section {
>> +	struct rcu_head rcu;
>> +
>> +	unsigned long sframe_addr;
>> +	unsigned long text_addr;
>> +
>> +	unsigned long fdes_addr;
>> +	unsigned long fres_addr;
>> +	unsigned int  fdes_nr;
>> +	signed char ra_off, fp_off;
>> +};
>> +
>> +DEFINE_STATIC_SRCU(sframe_srcu);
>> +
>> +#define __SFRAME_GET_USER(out, user_ptr, type)				\
>> +({									\
>> +	type __tmp;							\
>> +	if (get_user(__tmp, (type *)user_ptr))				\
>> +		return -EFAULT;						\
>> +	user_ptr += sizeof(__tmp);					\
>> +	out = __tmp;							\
>> +})
>> +
>> +#define SFRAME_GET_USER_SIGNED(out, user_ptr, size)			\
>> +({									\
>> +	switch (size) {							\
>> +	case 1:								\
>> +		__SFRAME_GET_USER(out, user_ptr, s8);			\
>> +		break;							\
>> +	case 2:								\
>> +		__SFRAME_GET_USER(out, user_ptr, s16);			\
>> +		break;							\
>> +	case 4:								\
>> +		__SFRAME_GET_USER(out, user_ptr, s32);			\
>> +		break;							\
>> +	default:							\
>> +		return -EINVAL;						\
>> +	}								\
>> +})
>> +
>> +#define SFRAME_GET_USER_UNSIGNED(out, user_ptr, size)			\
>> +({									\
>> +	switch (size) {							\
>> +	case 1:								\
>> +		__SFRAME_GET_USER(out, user_ptr, u8);			\
>> +		break;							\
>> +	case 2:								\
>> +		__SFRAME_GET_USER(out, user_ptr, u16);			\
>> +		break;							\
>> +	case 4:								\
>> +		__SFRAME_GET_USER(out, user_ptr, u32);			\
>> +		break;							\
>> +	default:							\
>> +		return -EINVAL;						\
>> +	}								\
>> +})
>> +
>> +static unsigned char fre_type_to_size(unsigned char fre_type)
>> +{
>> +	if (fre_type > 2)
>> +		return 0;
>> +	return 1 << fre_type;
>> +}
>> +
>> +static unsigned char offset_size_enum_to_size(unsigned char off_size)
>> +{
>> +	if (off_size > 2)
>> +		return 0;
>> +	return 1 << off_size;
>> +}
>> +
>> +static int find_fde(struct sframe_section *sec, unsigned long ip,
>> +		    struct sframe_fde *fde)
>> +{
>> +	s32 func_off, ip_off;
>> +	struct sframe_fde __user *first, *last, *mid, *found;
> 
> Need to initialize found = NULL;
> 
>> +
>> +	ip_off = ip - sec->sframe_addr;
>> +
>> +	first = (void *)sec->fdes_addr;
>> +	last = first + sec->fdes_nr;
>> +	while (first <= last) {
> 
> So we trust user space to have this table sorted?
> 

GNU ld will create this table sorted when linking .sframe sections and 
will set SFRAME_F_FDE_SORTED in flags in the output .sframe section.  In 
the current patch, I see the __sframe_add_section () includes a check 
for SFRAME_F_FDE_SORTED for admitting SFrame sections.

So proceeding here with the assumption that the SFrame FDE list is 
sorted should work fine.

>> +		mid = first + ((last - first) / 2);
>> +		if (get_user(func_off, (s32 *)mid))
>> +			return -EFAULT;
>> +		if (ip_off >= func_off) {
>> +			found = mid;
> 
> If it's not sorted, this can return the wrong value.
> 
> We should have some check that has something like:
> 
> 	s32 low_func_off = 0, high_func_off = 0;
> 
> 			if (low_func_off && low_func_off > func_off)
> 				return -EINVAL;	
> 			low_func_off = func_off;
> 
>> +			first = mid + 1;
>> +		} else
> 			{ /* Note, this needs a bracket anyway, because
> 			     rules are, if one if block has a bracket,
> 			     the other needs one too */
> 
> 			if (high_func_off && high_func_off < func_off)
> 				return -EINVAL;
> 
> 			high_func_off = func_off;
> 
>> +			last = mid - 1;
> 		}
> 
>> +	}
>> +
>> +	if (!found)
>> +		return -EINVAL;
>> +
>> +	if (copy_from_user(fde, found, sizeof(*fde)))
>> +		return -EFAULT;
>> +
>> +	return 0;
>> +}
> 
> -- Steve


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 03/11] unwind: Introduce SFrame user space unwinding
  2024-10-01 18:20     ` Indu Bhagat
@ 2024-10-01 18:36       ` Steven Rostedt
  2024-10-02  8:18         ` Florian Weimer
  0 siblings, 1 reply; 44+ messages in thread
From: Steven Rostedt @ 2024-10-01 18:36 UTC (permalink / raw)
  To: Indu Bhagat
  Cc: Josh Poimboeuf, x86, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James

On Tue, 1 Oct 2024 11:20:35 -0700
Indu Bhagat <indu.bhagat@oracle.com> wrote:

> > So we trust user space to have this table sorted?
> >   
> 
> GNU ld will create this table sorted when linking .sframe sections and 
> will set SFRAME_F_FDE_SORTED in flags in the output .sframe section.  In 
> the current patch, I see the __sframe_add_section () includes a check 
> for SFRAME_F_FDE_SORTED for admitting SFrame sections.
> 
> So proceeding here with the assumption that the SFrame FDE list is 
> sorted should work fine.

No not at all! We *cannot trust* user space. This could lead to a security
hole if we assume it's sorted. The kernel must not trust anything it
receives from user space. Because an attacker will be looking for ways to
confuse the kernel to exploit it.

When I look at code that reads user space, I do not look at it as if it
were made by the compiler. I look at it as if it were made by someone
that's trying to find ways to crack the system. Every read from user space
*must* be validated *every* time it's read. It can not even validate it
once and then think its immutable (unless the kernel actually made it
immutable).

-- Steve

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 03/11] unwind: Introduce SFrame user space unwinding
  2024-10-01 18:36       ` Steven Rostedt
@ 2024-10-02  8:18         ` Florian Weimer
  2024-10-02 14:05           ` Steven Rostedt
  0 siblings, 1 reply; 44+ messages in thread
From: Florian Weimer @ 2024-10-02  8:18 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Indu Bhagat, Josh Poimboeuf, x86, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James

* Steven Rostedt:

> On Tue, 1 Oct 2024 11:20:35 -0700
> Indu Bhagat <indu.bhagat@oracle.com> wrote:
>
>> > So we trust user space to have this table sorted?
>> >   
>> 
>> GNU ld will create this table sorted when linking .sframe sections and 
>> will set SFRAME_F_FDE_SORTED in flags in the output .sframe section.  In 
>> the current patch, I see the __sframe_add_section () includes a check 
>> for SFRAME_F_FDE_SORTED for admitting SFrame sections.
>> 
>> So proceeding here with the assumption that the SFrame FDE list is 
>> sorted should work fine.
>
> No not at all! We *cannot trust* user space. This could lead to a security
> hole if we assume it's sorted. The kernel must not trust anything it
> receives from user space. Because an attacker will be looking for ways to
> confuse the kernel to exploit it.

I don't quite understand, sorry.

Doing a binary search on an unordered table fails to find some entries
that could be discovered by a linear scan.  But an attacker could just
as well use an incomplete table from the start.  So assuming an ordered
table seems rather unlikely to introduce additional problems.  (Given
the lack of a formal threat model, it's impossible to make more precise
claims in either direction.)

Thanks,
Florian


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 03/11] unwind: Introduce SFrame user space unwinding
  2024-10-02  8:18         ` Florian Weimer
@ 2024-10-02 14:05           ` Steven Rostedt
  0 siblings, 0 replies; 44+ messages in thread
From: Steven Rostedt @ 2024-10-02 14:05 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Indu Bhagat, Josh Poimboeuf, x86, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James

On Wed, 02 Oct 2024 10:18:21 +0200
Florian Weimer <fweimer@redhat.com> wrote:
> 
> I don't quite understand, sorry.
> 
> Doing a binary search on an unordered table fails to find some entries
> that could be discovered by a linear scan.  But an attacker could just
> as well use an incomplete table from the start.  So assuming an ordered
> table seems rather unlikely to introduce additional problems.  (Given
> the lack of a formal threat model, it's impossible to make more precise
> claims in either direction.)

Basically, the idea is if anything is out of place, scrap the entire
process. An unordered table can give unpredictable results, that could be
used latter as a gadget. If the kernel expects a sorted table and it ends
up not being sorted, it should automatically flag it as corrupt and stop
all processing.

The kernel doesn't need to scan the entire table each time to see if it is
sorted, that would kill the point of it being sorted in the first place.
But it can check that the values merge towards a correct answer. All it
needs to do is keep track of the current highest and lowest values, and if
it finds something outside that range, it should abort immediately.

The effort needed to validate is very low, so it should be done.

-- Steve

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains
  2024-09-16 15:39       ` Josh Poimboeuf
  2024-09-16 18:15         ` Peter Zijlstra
@ 2024-10-03  2:31         ` Steven Rostedt
  2024-10-03  2:37           ` Josh Poimboeuf
  1 sibling, 1 reply; 44+ messages in thread
From: Steven Rostedt @ 2024-10-03  2:31 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Peter Zijlstra, x86, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, Mathieu Desnoyers

On Mon, 16 Sep 2024 17:39:53 +0200
Josh Poimboeuf <jpoimboe@kernel.org> wrote:

> Completely untested patch here that I hacked up today:

I can tell it wasn't tested ;-)

> 
>  git://git.kernel.org/pub/scm/linux/kernel/git/jpoimboe/linux.git sframe-2.1
> 
> To avoid races, unwind_user_deferred() can't be called from NMI.  The
> tracers will need to trigger irq_work to call it.

diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
index 7edb0833fe46..c56cf5d564df 100644
--- a/kernel/unwind/user.c
+++ b/kernel/unwind/user.c
@@ -177,7 +181,8 @@ int unwind_user_deferred(struct unwind_callback *callback, u64 *ctx_cookie)
 
 		cookie = __this_cpu_read(ctx_ctr);
 		cookie &= ((1UL << 48) - 1);
-		cookie |= ((cpu << 48) + 1);
+		cookie |= cpu << 48;
+		cookie++;
 		__this_cpu_write(ctx_ctr, cookie);
 
 		current->unwind_ctx_cookie = cookie;

As the cookie never got incremented.

That was just one issue. Things are still not working but I'll debug the
rest later.

-- Steve

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains
  2024-10-03  2:31         ` Steven Rostedt
@ 2024-10-03  2:37           ` Josh Poimboeuf
  2024-10-03 14:56             ` Steven Rostedt
  0 siblings, 1 reply; 44+ messages in thread
From: Josh Poimboeuf @ 2024-10-03  2:37 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, x86, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, Mathieu Desnoyers

On Wed, Oct 02, 2024 at 10:31:25PM -0400, Steven Rostedt wrote:
> +++ b/kernel/unwind/user.c
> @@ -177,7 +181,8 @@ int unwind_user_deferred(struct unwind_callback *callback, u64 *ctx_cookie)
>  
>  		cookie = __this_cpu_read(ctx_ctr);
>  		cookie &= ((1UL << 48) - 1);
> -		cookie |= ((cpu << 48) + 1);
> +		cookie |= cpu << 48;
> +		cookie++;
>  		__this_cpu_write(ctx_ctr, cookie);
>  
>  		current->unwind_ctx_cookie = cookie;
> 
> As the cookie never got incremented.

Ha, that might help ;-)

> That was just one issue. Things are still not working but I'll debug the
> rest later.

If you don't want to run on the bleeding edge, I'll be testing it soon
myself once I get the coding done for my v3.

-- 
Josh

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains
  2024-10-03  2:37           ` Josh Poimboeuf
@ 2024-10-03 14:56             ` Steven Rostedt
  0 siblings, 0 replies; 44+ messages in thread
From: Steven Rostedt @ 2024-10-03 14:56 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Peter Zijlstra, x86, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, Mathieu Desnoyers

On Wed, 2 Oct 2024 19:37:02 -0700
Josh Poimboeuf <jpoimboe@kernel.org> wrote:

> > That was just one issue. Things are still not working but I'll debug the
> > rest later.  
> 
> If you don't want to run on the bleeding edge, I'll be testing it soon
> myself once I get the coding done for my v3.

Yeah, I'll wait for that. I have some code working for ftrace on top of
this, but I also promised Mathieu I would start looking at his faultable
trace points, so I should do that before working more on sframe.

-- Steve

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 02/11] unwind/x86: Add HAVE_USER_UNWIND
  2024-09-16 11:46   ` Peter Zijlstra
@ 2024-10-20  8:09     ` Josh Poimboeuf
  0 siblings, 0 replies; 44+ messages in thread
From: Josh Poimboeuf @ 2024-10-20  8:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James

On Mon, Sep 16, 2024 at 01:46:05PM +0200, Peter Zijlstra wrote:
> > +#define ARCH_INIT_USER_FP_FRAME							\
> > +	.ra_off		= (s32)sizeof(long) * -1,				\
> > +	.cfa_off	= (s32)sizeof(long) * 2,				\
> > +	.fp_off		= (s32)sizeof(long) * -2,				\
> > +	.use_fp		= true,
> 
> What about compat crap?

Sframe doesn't support x86 compat so I guess we don't need to worry
about that at least for now.

-- 
Josh

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains
  2024-09-13 23:02 [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains Josh Poimboeuf
                   ` (12 preceding siblings ...)
  2024-09-14 19:37 ` Namhyung Kim
@ 2024-10-23 13:22 ` Jens Remus
  2024-10-24  2:22   ` Steven Rostedt
  2024-10-27 17:24   ` Josh Poimboeuf
  13 siblings, 2 replies; 44+ messages in thread
From: Jens Remus @ 2024-10-23 13:22 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, x86, Heiko Carstens, Ilya Leoshkevich,
	Vasily Gorbik

Hello Josh!

On 14.09.2024 01:02, Josh Poimboeuf wrote:
> This is a new version of sframe user space unwinding + perf deferred
> callchains.  Better late than never.

...

> These patches add support for unwinding user space from the kernel using
> SFrame with perf.  It should be easy to add user unwinding support for
> other components like ftrace.

...

We are looking forward to implement support for unwinding of user space
using SFrame in kernel/perf on s390. One major concern is that your x86
implementation currently relies on a fallback to unwinding using frame
pointer. On s390 unwinding using frame pointer is unsupported, because
of lack of proper s390x ABI [1] specification and compiler support. In
theory there would be a s390-specific alternative of unwinding using
backchain (compiler option -mbackchain), but this has limitations and
there is currently no distribution where user space is built with
backchain.

How much of an issue would it be if s390 could not provide an unwinding
fallback implementation? Do you see the possibility to get away without?

For s390 support of unwinding using SFrame we would need to make a few
changes to your generic unwinding framework in the kernel:

- Support for architectures that do not define CFA == SP at callsite:
   On s390 the CFA is defined as SP at callsite +160. The stack pointer
   (SP) therefore unwinds as SP = CFA - 160. For that we would introduce
   e.g. a sp_val_off field (SP value offset from CFA) in struct
   user_unwind_frame that would default to 0 on all architectures except
   s390.

- Support for architectures where RA is not necessarily saved on stack:
   On s390 the return address (RA) is not saved (on stack) at function
   entry. In leaf functions it is not necessarily saved at all.

- Support for architectures were RA/FP are saved in registers in leaf
   functions:
   On s390 the frame pointer (FP) and return address (RA) registers can
   be saved in other registers when in leaf functions. In the SFrame
   format we would encode the DWARF register number as offset using the
   least-significant bit as indication: offset = (regnum << 1) | 1.
   Therefore we would need to extend your generic unwinding framework to
   support FP and RA to be restored from registers.

[1]: s390x ELF ABI supplement,
      https://github.com/IBM/s390x-abi/releases

Thanks and regards,
Jens
-- 
Jens Remus
Linux on Z Development (D3303) and z/VSE Support
+49-7031-16-1128 Office
jremus@de.ibm.com

IBM

IBM Deutschland Research & Development GmbH; Vorsitzender des 
Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der 
Gesellschaft: Böblingen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 03/11] unwind: Introduce SFrame user space unwinding
  2024-09-13 23:02 ` [PATCH v2 03/11] unwind: Introduce SFrame user space unwinding Josh Poimboeuf
  2024-09-14 11:23   ` Steven Rostedt
@ 2024-10-23 13:59   ` Jens Remus
  2024-10-27 17:49     ` Josh Poimboeuf
  1 sibling, 1 reply; 44+ messages in thread
From: Jens Remus @ 2024-10-23 13:59 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, x86, Heiko Carstens, Ilya Leoshkevich,
	Vasily Gorbik

Hello Josh!

On 14.09.2024 01:02, Josh Poimboeuf wrote:
> Some distros have started compiling frame pointers into all their
> packages to enable the kernel to do system-wide profiling of user space.
> Unfortunately that creates a runtime performance penalty across the
> entire system.  Using DWARF instead isn't feasible due to the complexity
> it would add to the kernel.
> 
> For in-kernel unwinding we solved this problem with the creation of the
> ORC unwinder for x86_64.  Similarly, for user space the GNU assembler
> has created the SFrame format starting with binutils 2.41 for SFrame v2.
> SFrame is a simpler version of .eh_frame which gets placed in the
> .sframe section.
> 
> Add support for unwinding user space using SFrame.
> 
> More information about SFrame can be found here:
> 
>    - https://lwn.net/Articles/932209/
>    - https://lwn.net/Articles/940686/
>    - https://sourceware.org/binutils/docs/sframe-spec.html
> 
> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
> ---
>   arch/Kconfig                |   3 +
>   fs/binfmt_elf.c             |  47 +++-
>   include/linux/mm_types.h    |   3 +
>   include/linux/sframe.h      |  46 ++++
>   include/linux/user_unwind.h |   1 +
>   include/uapi/linux/elf.h    |   1 +
>   include/uapi/linux/prctl.h  |   3 +
>   kernel/fork.c               |  10 +
>   kernel/sys.c                |  11 +
>   kernel/unwind/Makefile      |   1 +
>   kernel/unwind/sframe.c      | 420 ++++++++++++++++++++++++++++++++++++
>   kernel/unwind/sframe.h      | 215 ++++++++++++++++++
>   kernel/unwind/user.c        |  14 ++
>   mm/init-mm.c                |   4 +-
>   14 files changed, 774 insertions(+), 5 deletions(-)
>   create mode 100644 include/linux/sframe.h
>   create mode 100644 kernel/unwind/sframe.c
>   create mode 100644 kernel/unwind/sframe.h
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index b1002b2da331..ff5d5bc5f947 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -428,6 +428,9 @@ config HAVE_HARDLOCKUP_DETECTOR_ARCH
>   config HAVE_USER_UNWIND
>   	bool
>   
> +config HAVE_USER_UNWIND_SFRAME
> +	bool

I found this unwinder of userspace using SFrame to depend on your 
generic unwinder of userspace framework. Does this warrant to add:

	depends on HAVE_USER_UNWIND

> +
>   config HAVE_PERF_REGS
>   	bool
>   	help
> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> index 19fa49cd9907..923aed390f2e 100644
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -47,6 +47,7 @@
>   #include <linux/dax.h>
>   #include <linux/uaccess.h>
>   #include <linux/rseq.h>
> +#include <linux/sframe.h>
>   #include <asm/param.h>
>   #include <asm/page.h>
>   
> @@ -633,11 +634,13 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
>   		unsigned long no_base, struct elf_phdr *interp_elf_phdata,
>   		struct arch_elf_state *arch_state)
>   {
> -	struct elf_phdr *eppnt;
> +	struct elf_phdr *eppnt, *sframe_phdr = NULL;
>   	unsigned long load_addr = 0;
>   	int load_addr_set = 0;
>   	unsigned long error = ~0UL;
>   	unsigned long total_size;
> +	unsigned long start_code = ~0UL;
> +	unsigned long end_code = 0;
>   	int i;
>   
>   	/* First of all, some simple consistency checks */
> @@ -659,7 +662,8 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
>   
>   	eppnt = interp_elf_phdata;
>   	for (i = 0; i < interp_elf_ex->e_phnum; i++, eppnt++) {
> -		if (eppnt->p_type == PT_LOAD) {
> +		switch (eppnt->p_type) {
> +		case PT_LOAD: {
>   			int elf_type = MAP_PRIVATE;
>   			int elf_prot = make_prot(eppnt->p_flags, arch_state,
>   						 true, true);
> @@ -688,7 +692,7 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
>   			/*
>   			 * Check to see if the section's size will overflow the
>   			 * allowed task size. Note that p_filesz must always be
> -			 * <= p_memsize so it's only necessary to check p_memsz.
> +			 * <= p_memsz so it's only necessary to check p_memsz.
>   			 */
>   			k = load_addr + eppnt->p_vaddr;
>   			if (BAD_ADDR(k) ||
> @@ -698,7 +702,28 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
>   				error = -ENOMEM;
>   				goto out;
>   			}
> +
> +			if ((eppnt->p_flags & PF_X) && k < start_code)
> +				start_code = k;
> +
> +			if ((eppnt->p_flags & PF_X) && k + eppnt->p_filesz > end_code)
> +				end_code = k + eppnt->p_filesz;
> +			break;
>   		}
> +		case PT_GNU_SFRAME:
> +			sframe_phdr = eppnt;
> +			break;
> +		}
> +	}
> +
> +	if (sframe_phdr) {
> +		struct sframe_file sfile = {
> +			.sframe_addr	= load_addr + sframe_phdr->p_vaddr,
> +			.text_start	= start_code,
> +			.text_end	= end_code,
> +		};
> +
> +		__sframe_add_section(&sfile);
>   	}
>   
>   	error = load_addr;
> @@ -823,7 +848,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
>   	int first_pt_load = 1;
>   	unsigned long error;
>   	struct elf_phdr *elf_ppnt, *elf_phdata, *interp_elf_phdata = NULL;
> -	struct elf_phdr *elf_property_phdata = NULL;
> +	struct elf_phdr *elf_property_phdata = NULL, *sframe_phdr = NULL;
>   	unsigned long elf_brk;
>   	int retval, i;
>   	unsigned long elf_entry;
> @@ -931,6 +956,10 @@ static int load_elf_binary(struct linux_binprm *bprm)
>   				executable_stack = EXSTACK_DISABLE_X;
>   			break;
>   
> +		case PT_GNU_SFRAME:
> +			sframe_phdr = elf_ppnt;
> +			break;
> +
>   		case PT_LOPROC ... PT_HIPROC:
>   			retval = arch_elf_pt_proc(elf_ex, elf_ppnt,
>   						  bprm->file, false,
> @@ -1316,6 +1345,16 @@ static int load_elf_binary(struct linux_binprm *bprm)
>   				MAP_FIXED | MAP_PRIVATE, 0);
>   	}
>   
> +	if (sframe_phdr) {
> +		struct sframe_file sfile = {
> +			.sframe_addr	= load_bias + sframe_phdr->p_vaddr,
> +			.text_start	= start_code,
> +			.text_end	= end_code,
> +		};
> +
> +		__sframe_add_section(&sfile);
> +	}
> +
>   	regs = current_pt_regs();
>   #ifdef ELF_PLAT_INIT
>   	/*
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 485424979254..1aee78cbea33 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1019,6 +1019,9 @@ struct mm_struct {
>   #endif
>   		} lru_gen;
>   #endif /* CONFIG_LRU_GEN_WALKS_MMU */
> +#ifdef CONFIG_HAVE_USER_UNWIND_SFRAME
> +		struct maple_tree sframe_mt;
> +#endif
>   	} __randomize_layout;
>   
>   	/*
> diff --git a/include/linux/sframe.h b/include/linux/sframe.h
> new file mode 100644
> index 000000000000..3a44f76929e2
> --- /dev/null
> +++ b/include/linux/sframe.h
> @@ -0,0 +1,46 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_SFRAME_H
> +#define _LINUX_SFRAME_H
> +
> +#include <linux/mm_types.h>
> +
> +struct sframe_file {
> +	unsigned long sframe_addr, text_start, text_end;
> +};
> +
> +struct user_unwind_frame;
> +
> +#ifdef CONFIG_HAVE_USER_UNWIND_SFRAME
> +
> +#define INIT_MM_SFRAME .sframe_mt = MTREE_INIT(sframe_mt, 0)
> +
> +extern void sframe_free_mm(struct mm_struct *mm);
> +
> +extern int __sframe_add_section(struct sframe_file *file);
> +extern int sframe_add_section(unsigned long sframe_addr, unsigned long text_start, unsigned long text_end);
> +extern int sframe_remove_section(unsigned long sframe_addr);
> +extern int sframe_find(unsigned long ip, struct user_unwind_frame *frame);
> +
> +static inline bool current_has_sframe(void)
> +{
> +	struct mm_struct *mm = current->mm;
> +
> +	return mm && !mtree_empty(&mm->sframe_mt);
> +}
> +
> +#else /* !CONFIG_HAVE_USER_UNWIND_SFRAME */
> +
> +#define INIT_MM_SFRAME
> +
> +static inline void sframe_free_mm(struct mm_struct *mm) {}
> +
> +static inline int __sframe_add_section(struct sframe_file *file) { return -EINVAL; }
> +static inline int sframe_add_section(unsigned long sframe_addr, unsigned long text_start, unsigned long text_end) { return -EINVAL; }
> +static inline int sframe_remove_section(unsigned long sframe_addr) { return -EINVAL; }
> +static inline int sframe_find(unsigned long ip, struct user_unwind_frame *frame) { return -EINVAL; }
> +
> +static inline bool current_has_sframe(void) { return false; }
> +
> +#endif /* CONFIG_HAVE_USER_UNWIND_SFRAME */
> +
> +#endif /* _LINUX_SFRAME_H */
> diff --git a/include/linux/user_unwind.h b/include/linux/user_unwind.h
> index 0a19ac6c92b2..8003f9d35405 100644
> --- a/include/linux/user_unwind.h
> +++ b/include/linux/user_unwind.h
> @@ -7,6 +7,7 @@
>   enum user_unwind_type {
>   	USER_UNWIND_TYPE_AUTO,
>   	USER_UNWIND_TYPE_FP,
> +	USER_UNWIND_TYPE_SFRAME,
>   };
>   
>   struct user_unwind_frame {
> diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
> index b54b313bcf07..b2aca31e1a49 100644
> --- a/include/uapi/linux/elf.h
> +++ b/include/uapi/linux/elf.h
> @@ -39,6 +39,7 @@ typedef __s64	Elf64_Sxword;
>   #define PT_GNU_STACK	(PT_LOOS + 0x474e551)
>   #define PT_GNU_RELRO	(PT_LOOS + 0x474e552)
>   #define PT_GNU_PROPERTY	(PT_LOOS + 0x474e553)
> +#define PT_GNU_SFRAME	(PT_LOOS + 0x474e554)
>   
>   
>   /* ARM MTE memory tag segment type */
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 35791791a879..69511077c910 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -328,4 +328,7 @@ struct prctl_mm_map {
>   # define PR_PPC_DEXCR_CTRL_CLEAR_ONEXEC	0x10 /* Clear the aspect on exec */
>   # define PR_PPC_DEXCR_CTRL_MASK		0x1f
>   
> +#define PR_ADD_SFRAME			74
> +#define PR_REMOVE_SFRAME		75
> +
>   #endif /* _LINUX_PRCTL_H */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index cc760491f201..a216f091edfb 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -104,6 +104,7 @@
>   #include <linux/rseq.h>
>   #include <uapi/linux/pidfd.h>
>   #include <linux/pidfs.h>
> +#include <linux/sframe.h>
>   
>   #include <asm/pgalloc.h>
>   #include <linux/uaccess.h>
> @@ -923,6 +924,7 @@ void __mmdrop(struct mm_struct *mm)
>   	mm_pasid_drop(mm);
>   	mm_destroy_cid(mm);
>   	percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS);
> +	sframe_free_mm(mm);
>   
>   	free_mm(mm);
>   }
> @@ -1249,6 +1251,13 @@ static void mm_init_uprobes_state(struct mm_struct *mm)
>   #endif
>   }
>   
> +static void mm_init_sframe(struct mm_struct *mm)
> +{
> +#ifdef CONFIG_HAVE_USER_UNWIND_SFRAME
> +	mt_init(&mm->sframe_mt);
> +#endif
> +}
> +
>   static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
>   	struct user_namespace *user_ns)
>   {
> @@ -1280,6 +1289,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
>   	mm->pmd_huge_pte = NULL;
>   #endif
>   	mm_init_uprobes_state(mm);
> +	mm_init_sframe(mm);
>   	hugetlb_count_init(mm);
>   
>   	if (current->mm) {
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 3a2df1bd9f64..e4d2b64f4ae4 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -64,6 +64,7 @@
>   #include <linux/rcupdate.h>
>   #include <linux/uidgid.h>
>   #include <linux/cred.h>
> +#include <linux/sframe.h>
>   
>   #include <linux/nospec.h>
>   
> @@ -2782,6 +2783,16 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>   	case PR_RISCV_SET_ICACHE_FLUSH_CTX:
>   		error = RISCV_SET_ICACHE_FLUSH_CTX(arg2, arg3);
>   		break;
> +	case PR_ADD_SFRAME:
> +		if (arg5)
> +			return -EINVAL;
> +		error = sframe_add_section(arg2, arg3, arg4);
> +		break;
> +	case PR_REMOVE_SFRAME:
> +		if (arg3 || arg4 || arg5)
> +			return -EINVAL;
> +		error = sframe_remove_section(arg2);
> +		break;
>   	default:
>   		error = -EINVAL;
>   		break;
> diff --git a/kernel/unwind/Makefile b/kernel/unwind/Makefile
> index eb466d6a3295..6f202c5840cf 100644
> --- a/kernel/unwind/Makefile
> +++ b/kernel/unwind/Makefile
> @@ -1 +1,2 @@
>   obj-$(CONFIG_HAVE_USER_UNWIND) += user.o
> +obj-$(CONFIG_HAVE_USER_UNWIND_SFRAME) += sframe.o
> diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
> new file mode 100644
> index 000000000000..3e4d29e737a1
> --- /dev/null
> +++ b/kernel/unwind/sframe.c
> @@ -0,0 +1,420 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/srcu.h>
> +#include <linux/uaccess.h>
> +#include <linux/mm.h>
> +#include <linux/sframe.h>
> +#include <linux/user_unwind.h>
> +
> +#include "sframe.h"
> +
> +#define SFRAME_FILENAME_LEN 32
> +
> +struct sframe_section {
> +	struct rcu_head rcu;
> +
> +	unsigned long sframe_addr;
> +	unsigned long text_addr;
> +
> +	unsigned long fdes_addr;
> +	unsigned long fres_addr;
> +	unsigned int  fdes_nr;
> +	signed char ra_off, fp_off;
> +};
> +
> +DEFINE_STATIC_SRCU(sframe_srcu);
> +
> +#define __SFRAME_GET_USER(out, user_ptr, type)				\
> +({									\
> +	type __tmp;							\
> +	if (get_user(__tmp, (type *)user_ptr))				\
> +		return -EFAULT;						\
> +	user_ptr += sizeof(__tmp);					\
> +	out = __tmp;							\
> +})
> +
> +#define SFRAME_GET_USER_SIGNED(out, user_ptr, size)			\
> +({									\
> +	switch (size) {							\
> +	case 1:								\
> +		__SFRAME_GET_USER(out, user_ptr, s8);			\
> +		break;							\
> +	case 2:								\
> +		__SFRAME_GET_USER(out, user_ptr, s16);			\
> +		break;							\
> +	case 4:								\
> +		__SFRAME_GET_USER(out, user_ptr, s32);			\
> +		break;							\
> +	default:							\
> +		return -EINVAL;						\
> +	}								\
> +})
> +
> +#define SFRAME_GET_USER_UNSIGNED(out, user_ptr, size)			\
> +({									\
> +	switch (size) {							\
> +	case 1:								\
> +		__SFRAME_GET_USER(out, user_ptr, u8);			\
> +		break;							\
> +	case 2:								\
> +		__SFRAME_GET_USER(out, user_ptr, u16);			\
> +		break;							\
> +	case 4:								\
> +		__SFRAME_GET_USER(out, user_ptr, u32);			\
> +		break;							\
> +	default:							\
> +		return -EINVAL;						\
> +	}								\
> +})
> +
> +static unsigned char fre_type_to_size(unsigned char fre_type)
> +{
> +	if (fre_type > 2)
> +		return 0;
> +	return 1 << fre_type;
> +}
> +
> +static unsigned char offset_size_enum_to_size(unsigned char off_size)
> +{
> +	if (off_size > 2)
> +		return 0;
> +	return 1 << off_size;
> +}
> +
> +static int find_fde(struct sframe_section *sec, unsigned long ip,
> +		    struct sframe_fde *fde)
> +{
> +	s32 func_off, ip_off;
> +	struct sframe_fde __user *first, *last, *mid, *found;
> +
> +	ip_off = ip - sec->sframe_addr;
> +
> +	first = (void *)sec->fdes_addr;
> +	last = first + sec->fdes_nr;
> +	while (first <= last) {
> +		mid = first + ((last - first) / 2);
> +		if (get_user(func_off, (s32 *)mid))
> +			return -EFAULT;
> +		if (ip_off >= func_off) {
> +			found = mid;
> +			first = mid + 1;
> +		} else
> +			last = mid - 1;
> +	}
> +
> +	if (!found)
> +		return -EINVAL;
> +
> +	if (copy_from_user(fde, found, sizeof(*fde)))
> +		return -EFAULT;
> +
> +	return 0;
> +}
> +
> +static int find_fre(struct sframe_section *sec, struct sframe_fde *fde,
> +		    unsigned long ip, struct user_unwind_frame *frame)
> +{
> +	unsigned char fde_type = SFRAME_FUNC_FDE_TYPE(fde->info);
> +	unsigned char fre_type = SFRAME_FUNC_FRE_TYPE(fde->info);
> +	s32 fre_ip_off, cfa_off, ra_off, fp_off, ip_off;

Doesn't fre_ip_off need to be u32 (see also below)? The SFrame format 
specification states the FRE sfre_start_address is either u8, u16, or u32:
https://sourceware.org/binutils/docs/sframe-spec.html#The-SFrame-FRE-Types

> +	unsigned char offset_count, offset_size;
> +	unsigned char addr_size;
> +	void __user *f, *last_f;
> +	u8 fre_info;
> +	int i;
> +
> +	addr_size = fre_type_to_size(fre_type);
> +	if (!addr_size)
> +		return -EINVAL;
> +
> +	ip_off = ip - sec->sframe_addr - fde->start_addr;
> +
> +	f = (void *)sec->fres_addr + fde->fres_off;
> +
> +	for (i = 0; i < fde->fres_num; i++) {
> +
> +		SFRAME_GET_USER_UNSIGNED(fre_ip_off, f, addr_size);

You already use SFRAME_GET_USER_UNSIGNED() to read it.

> +
> +		if (fde_type == SFRAME_FDE_TYPE_PCINC) {
> +			if (fre_ip_off > ip_off)
> +				break;
> +		} else {
> +			/* SFRAME_FDE_TYPE_PCMASK */
> +			if (ip_off % fde->rep_size < fre_ip_off)
> +				break;
> +		}
> +
> +		SFRAME_GET_USER_UNSIGNED(fre_info, f, 1);
> +
> +		offset_count = SFRAME_FRE_OFFSET_COUNT(fre_info);
> +		offset_size  = offset_size_enum_to_size(SFRAME_FRE_OFFSET_SIZE(fre_info));
> +
> +		if (!offset_count || !offset_size)
> +			return -EINVAL;
> +
> +		last_f = f;
> +		f += offset_count * offset_size;
> +	}
> +
> +	if (!last_f)
> +		return -EINVAL;
> +
> +	f = last_f;
> +
> +	SFRAME_GET_USER_UNSIGNED(cfa_off, f, offset_size);

As far as I know the CFA offset from CFA base register is signed in the 
SFrame file format. See Binutils include/sframe-api.h, 
sframe_fre_get_cfa_offset(). Therefore use SFRAME_GET_USER_SIGNED(). 
Both cfa_off and struct user_unwind_frame cfa_off are already defined as 
s32.

> +	offset_count--;
> +
> +	ra_off = sec->ra_off;
> +	if (!ra_off) {

On s390 there is no fixed RA offset from CFA ...

> +		if (!offset_count--)
> +			return -EINVAL;

... and the RA must not neccessarily be saved (e.g. at function entry). 
But we can address this when sending patches for s390 support.

> +		SFRAME_GET_USER_SIGNED(ra_off, f, offset_size);
> +	}
> +
> +	fp_off = sec->fp_off;
> +	if (!fp_off && offset_count) {
> +		offset_count--;
> +		SFRAME_GET_USER_SIGNED(fp_off, f, offset_size);
> +	}
> +
> +	if (offset_count)
> +		return -EINVAL;
> +
> +	frame->cfa_off = cfa_off;
> +	frame->ra_off = ra_off;
> +	frame->fp_off = fp_off;
> +	frame->use_fp = SFRAME_FRE_CFA_BASE_REG_ID(fre_info) == SFRAME_BASE_REG_FP;
> +
> +	return 0;
> +}
> +
> +int sframe_find(unsigned long ip, struct user_unwind_frame *frame)
> +{
> +	struct mm_struct *mm = current->mm;
> +	struct sframe_section *sec;
> +	struct sframe_fde fde;
> +	int srcu_idx;
> +	int ret = -EINVAL;
> +
> +	srcu_idx = srcu_read_lock(&sframe_srcu);
> +
> +	sec = mtree_load(&mm->sframe_mt, ip);
> +	if (!sec) {
> +		srcu_read_unlock(&sframe_srcu, srcu_idx);
> +		return -EINVAL;
> +	}
> +
> +
> +	ret = find_fde(sec, ip, &fde);
> +	if (ret)
> +		goto err_unlock;
> +
> +	ret = find_fre(sec, &fde, ip, frame);
> +	if (ret)
> +		goto err_unlock;
> +
> +	srcu_read_unlock(&sframe_srcu, srcu_idx);
> +	return 0;
> +
> +err_unlock:
> +	srcu_read_unlock(&sframe_srcu, srcu_idx);
> +	return ret;
> +}
> +
> +static int get_sframe_file(unsigned long sframe_addr, struct sframe_file *file)
> +{
> +	struct mm_struct *mm = current->mm;
> +	struct vm_area_struct *sframe_vma, *text_vma, *vma;
> +	VMA_ITERATOR(vmi, mm, 0);
> +
> +	mmap_read_lock(mm);
> +
> +	sframe_vma = vma_lookup(mm, sframe_addr);
> +	if (!sframe_vma || !sframe_vma->vm_file)
> +		goto err_unlock;
> +
> +	text_vma = NULL;
> +
> +	for_each_vma(vmi, vma) {
> +		if (vma->vm_file != sframe_vma->vm_file)
> +			continue;
> +		if (vma->vm_flags & VM_EXEC) {
> +			if (text_vma) {
> +				/*
> +				 * Multiple EXEC segments in a single file
> +				 * aren't currently supported, is that a thing?
> +				 */
> +				mmap_read_unlock(mm);
> +				pr_warn_once("unsupported multiple EXEC segments in task %s[%d]\n",
> +					     current->comm, current->pid);
> +				return -EINVAL;
> +			}
> +			text_vma = vma;
> +		}
> +	}
> +
> +	file->sframe_addr	= sframe_addr;
> +	file->text_start	= text_vma->vm_start;
> +	file->text_end		= text_vma->vm_end;
> +
> +	mmap_read_unlock(mm);
> +	return 0;
> +
> +err_unlock:
> +	mmap_read_unlock(mm);
> +	return -EINVAL;
> +}
> +
> +static int validate_sframe_addrs(struct sframe_file *file)
> +{
> +	struct mm_struct *mm = current->mm;
> +	struct vm_area_struct *text_vma;
> +
> +	mmap_read_lock(mm);
> +
> +	if (!vma_lookup(mm, file->sframe_addr))
> +		goto err_unlock;
> +
> +	text_vma = vma_lookup(mm, file->text_start);
> +	if (!(text_vma->vm_flags & VM_EXEC))
> +		goto err_unlock;
> +
> +	if (vma_lookup(mm, file->text_end-1) != text_vma)
> +		goto err_unlock;
> +
> +	mmap_read_unlock(mm);
> +	return 0;
> +
> +err_unlock:
> +	mmap_read_unlock(mm);
> +	return -EINVAL;
> +}
> +
> +int __sframe_add_section(struct sframe_file *file)
> +{
> +	struct maple_tree *sframe_mt = &current->mm->sframe_mt;
> +	struct sframe_section *sec;
> +	struct sframe_header shdr;
> +	unsigned long header_end;
> +	int ret;
> +
> +	if (copy_from_user(&shdr, (void *)file->sframe_addr, sizeof(shdr)))
> +		return -EFAULT;
> +
> +	if (shdr.preamble.magic != SFRAME_MAGIC ||
> +	    shdr.preamble.version != SFRAME_VERSION_2 ||
> +	    !(shdr.preamble.flags & SFRAME_F_FDE_SORTED) ||
> +	    shdr.auxhdr_len || !shdr.num_fdes || !shdr.num_fres ||
> +	    shdr.fdes_off > shdr.fres_off) {
> +		/*
> +		 * Either binutils < 2.41, corrupt sframe header, or
> +		 * unsupported feature.
> +		 * */
> +		pr_warn_once("bad sframe header in task %s[%d]\n",
> +			     current->comm, current->pid);
> +		return -EINVAL;
> +	}
> +
> +	header_end = file->sframe_addr + SFRAME_HDR_SIZE(shdr);
> +
> +	sec = kmalloc(sizeof(*sec), GFP_KERNEL);
> +	if (!sec)
> +		return -ENOMEM;
> +
> +	sec->sframe_addr	= file->sframe_addr;
> +	sec->text_addr		= file->text_start;
> +	sec->fdes_addr		= header_end + shdr.fdes_off;
> +	sec->fres_addr		= header_end + shdr.fres_off;
> +	sec->fdes_nr		= shdr.num_fdes;
> +	sec->ra_off		= shdr.cfa_fixed_ra_offset;
> +	sec->fp_off		= shdr.cfa_fixed_fp_offset;
> +
> +	ret = mtree_insert_range(sframe_mt, file->text_start, file->text_end,
> +				 sec, GFP_KERNEL);
> +	if (ret) {
> +		kfree(sec);
> +		return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +int sframe_add_section(unsigned long sframe_addr, unsigned long text_start, unsigned long text_end)
> +{
> +	struct sframe_file file;
> +	int ret;
> +
> +	if (!text_start || !text_end) {
> +		ret = get_sframe_file(sframe_addr, &file);
> +		if (ret)
> +			return ret;
> +	} else {
> +		/*
> +		 * This is mainly for generated code, for which the text isn't
> +		 * file-backed so the user has to give the text bounds.
> +		 */
> +		file.sframe_addr	= sframe_addr;
> +		file.text_start		= text_start;
> +		file.text_end		= text_end;
> +		ret = validate_sframe_addrs(&file);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	return __sframe_add_section(&file);
> +}
> +
> +static void sframe_free_rcu(struct rcu_head *rcu)
> +{
> +	struct sframe_section *sec = container_of(rcu, struct sframe_section, rcu);
> +
> +	kfree(sec);
> +}
> +
> +static int __sframe_remove_section(struct mm_struct *mm,
> +				   struct sframe_section *sec)
> +{
> +	struct sframe_section *s;
> +
> +	s = mtree_erase(&mm->sframe_mt, sec->text_addr);
> +	if (!s || WARN_ON_ONCE(s != sec))
> +		return -EINVAL;
> +
> +	call_srcu(&sframe_srcu, &sec->rcu, sframe_free_rcu);
> +
> +	return 0;
> +}
> +
> +int sframe_remove_section(unsigned long sframe_addr)
> +{
> +	struct mm_struct *mm = current->mm;
> +	struct sframe_section *sec;
> +	unsigned long index = 0;
> +
> +	sec = mtree_load(&mm->sframe_mt, sframe_addr);
> +	if (!sec)
> +		return -EINVAL;
> +
> +	mt_for_each(&mm->sframe_mt, sec, index, ULONG_MAX) {
> +		if (sec->sframe_addr == sframe_addr)
> +			return __sframe_remove_section(mm, sec);
> +	}
> +
> +	return -EINVAL;
> +}
> +
> +void sframe_free_mm(struct mm_struct *mm)
> +{
> +	struct sframe_section *sec;
> +	unsigned long index = 0;
> +
> +	if (!mm)
> +		return;
> +
> +	mt_for_each(&mm->sframe_mt, sec, index, ULONG_MAX)
> +		kfree(sec);
> +
> +	mtree_destroy(&mm->sframe_mt);
> +}
> diff --git a/kernel/unwind/sframe.h b/kernel/unwind/sframe.h
> new file mode 100644
> index 000000000000..aa468d6f1f4a
> --- /dev/null
> +++ b/kernel/unwind/sframe.h
> @@ -0,0 +1,215 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +/*
> + * Copyright (C) 2023, Oracle and/or its affiliates.
> + *
> + * This file contains definitions for the SFrame stack tracing format, which is
> + * documented at https://sourceware.org/binutils/docs
> + */
> +#ifndef _SFRAME_H
> +#define _SFRAME_H
> +
> +#include <linux/types.h>
> +
> +#define SFRAME_VERSION_1	1
> +#define SFRAME_VERSION_2	2
> +#define SFRAME_MAGIC		0xdee2
> +
> +/* Function Descriptor Entries are sorted on PC. */
> +#define SFRAME_F_FDE_SORTED	0x1
> +/* Frame-pointer based stack tracing. Defined, but not set. */
> +#define SFRAME_F_FRAME_POINTER	0x2
> +
> +#define SFRAME_CFA_FIXED_FP_INVALID 0
> +#define SFRAME_CFA_FIXED_RA_INVALID 0
> +
> +/* Supported ABIs/Arch. */
> +#define SFRAME_ABI_AARCH64_ENDIAN_BIG	    1 /* AARCH64 big endian. */
> +#define SFRAME_ABI_AARCH64_ENDIAN_LITTLE    2 /* AARCH64 little endian. */
> +#define SFRAME_ABI_AMD64_ENDIAN_LITTLE	    3 /* AMD64 little endian. */
> +
> +/* SFrame FRE types. */
> +#define SFRAME_FRE_TYPE_ADDR1	0
> +#define SFRAME_FRE_TYPE_ADDR2	1
> +#define SFRAME_FRE_TYPE_ADDR4	2
> +
> +/*
> + * SFrame Function Descriptor Entry types.
> + *
> + * The SFrame format has two possible representations for functions. The
> + * choice of which type to use is made according to the instruction patterns
> + * in the relevant program stub.
> + */
> +
> +/* Unwinders perform a (PC >= FRE_START_ADDR) to look up a matching FRE. */
> +#define SFRAME_FDE_TYPE_PCINC	0
> +/*
> + * Unwinders perform a (PC & FRE_START_ADDR_AS_MASK >= FRE_START_ADDR_AS_MASK)
> + * to look up a matching FRE. Typical usecases are pltN entries, trampolines
> + * etc.
> + */
> +#define SFRAME_FDE_TYPE_PCMASK	1
> +
> +/**
> + * struct sframe_preamble - SFrame Preamble.
> + * @magic: Magic number (SFRAME_MAGIC).
> + * @version: Format version number (SFRAME_VERSION).
> + * @flags: Various flags.
> + */
> +struct sframe_preamble {
> +	u16 magic;
> +	u8  version;
> +	u8  flags;
> +} __packed;
> +
> +/**
> + * struct sframe_header - SFrame Header.
> + * @preamble: SFrame preamble.
> + * @abi_arch: Identify the arch (including endianness) and ABI.
> + * @cfa_fixed_fp_offset: Offset for the Frame Pointer (FP) from CFA may be
> + *	  fixed  for some ABIs ((e.g, in AMD64 when -fno-omit-frame-pointer is

Nit: Two consecutive spaces in "fixed  for".

> + *	  used). When fixed, this field specifies the fixed stack frame offset
> + *	  and the individual FREs do not need to track it. When not fixed, it
> + *	  is set to SFRAME_CFA_FIXED_FP_INVALID, and the individual FREs may
> + *	  provide the applicable stack frame offset, if any.
> + * @cfa_fixed_ra_offset: Offset for the Return Address from CFA is fixed for
> + *	  some ABIs. When fixed, this field specifies the fixed stack frame
> + *	  offset and the individual FREs do not need to track it. When not
> + *	  fixed, it is set to SFRAME_CFA_FIXED_FP_INVALID.
> + * @auxhdr_len: Number of bytes making up the auxiliary header, if any.
> + *	  Some ABI/arch, in the future, may use this space for extending the
> + *	  information in SFrame header. Auxiliary header is contained in bytes
> + *	  sequentially following the sframe_header.
> + * @num_fdes: Number of SFrame FDEs in this SFrame section.
> + * @num_fres: Number of SFrame Frame Row Entries.
> + * @fre_len:  Number of bytes in the SFrame Frame Row Entry section.
> + * @fdes_off: Offset of SFrame Function Descriptor Entry section.
> + * @fres_off: Offset of SFrame Frame Row Entry section.
> + */
> +struct sframe_header {
> +	struct sframe_preamble preamble;
> +	u8  abi_arch;
> +	s8  cfa_fixed_fp_offset;
> +	s8  cfa_fixed_ra_offset;
> +	u8  auxhdr_len;
> +	u32 num_fdes;
> +	u32 num_fres;
> +	u32 fre_len;
> +	u32 fdes_off;
> +	u32 fres_off;
> +} __packed;
> +
> +#define SFRAME_HDR_SIZE(sframe_hdr)	\
> +	((sizeof(struct sframe_header) + (sframe_hdr).auxhdr_len))
> +
> +/* Two possible keys for executable (instruction) pointers signing. */
> +#define SFRAME_AARCH64_PAUTH_KEY_A    0 /* Key A. */
> +#define SFRAME_AARCH64_PAUTH_KEY_B    1 /* Key B. */
> +
> +/**
> + * struct sframe_fde - SFrame Function Descriptor Entry.
> + * @start_addr: Function start address. Encoded as a signed offset,
> + *	  relative to the current FDE.
> + * @size: Size of the function in bytes.
> + * @fres_off: Offset of the first SFrame Frame Row Entry of the function,
> + *	  relative to the beginning of the SFrame Frame Row Entry sub-section.
> + * @fres_num: Number of frame row entries for the function.
> + * @info: Additional information for deciphering the stack trace
> + *	  information for the function. Contains information about SFrame FRE
> + *	  type, SFrame FDE type, PAC authorization A/B key, etc.
> + * @rep_size: Block size for SFRAME_FDE_TYPE_PCMASK
> + * @padding: Unused
> + */
> +struct sframe_fde {
> +	s32 start_addr;
> +	u32 size;
> +	u32 fres_off;
> +	u32 fres_num;
> +	u8  info;
> +	u8  rep_size;
> +	u16 padding;
> +} __packed;
> +
> +/*
> + * 'func_info' in SFrame FDE contains additional information for deciphering
> + * the stack trace information for the function. In V1, the information is
> + * organized as follows:
> + *   - 4-bits: Identify the FRE type used for the function.
> + *   - 1-bit: Identify the FDE type of the function - mask or inc.
> + *   - 1-bit: PAC authorization A/B key (aarch64).
> + *   - 2-bits: Unused.
> + * ---------------------------------------------------------------------
> + * |  Unused  |  PAC auth A/B key (aarch64) |  FDE type |   FRE type   |
> + * |          |        Unused (amd64)       |           |              |
> + * ---------------------------------------------------------------------
> + * 8          6                             5           4              0
> + */
> +
> +/* Note: Set PAC auth key to SFRAME_AARCH64_PAUTH_KEY_A by default.  */
> +#define SFRAME_FUNC_INFO(fde_type, fre_enc_type) \
> +	(((SFRAME_AARCH64_PAUTH_KEY_A & 0x1) << 5) | \
> +	 (((fde_type) & 0x1) << 4) | ((fre_enc_type) & 0xf))
> +
> +#define SFRAME_FUNC_FRE_TYPE(data)	  ((data) & 0xf)
> +#define SFRAME_FUNC_FDE_TYPE(data)	  (((data) >> 4) & 0x1)
> +#define SFRAME_FUNC_PAUTH_KEY(data)	  (((data) >> 5) & 0x1)
> +
> +/*
> + * Size of stack frame offsets in an SFrame Frame Row Entry. A single
> + * SFrame FRE has all offsets of the same size. Offset size may vary
> + * across frame row entries.
> + */
> +#define SFRAME_FRE_OFFSET_1B	  0
> +#define SFRAME_FRE_OFFSET_2B	  1
> +#define SFRAME_FRE_OFFSET_4B	  2
> +
> +/* An SFrame Frame Row Entry can be SP or FP based.  */
> +#define SFRAME_BASE_REG_FP	0
> +#define SFRAME_BASE_REG_SP	1
> +
> +/*
> + * The index at which a specific offset is presented in the variable length
> + * bytes of an FRE.
> + */
> +#define SFRAME_FRE_CFA_OFFSET_IDX   0
> +/*
> + * The RA stack offset, if present, will always be at index 1 in the variable
> + * length bytes of the FRE.
> + */
> +#define SFRAME_FRE_RA_OFFSET_IDX    1
> +/*
> + * The FP stack offset may appear at offset 1 or 2, depending on the ABI as RA
> + * may or may not be tracked.
> + */
> +#define SFRAME_FRE_FP_OFFSET_IDX    2
> +
> +/*
> + * 'fre_info' in SFrame FRE contains information about:
> + *   - 1 bit: base reg for CFA
> + *   - 4 bits: Number of offsets (N). A value of up to 3 is allowed to track
> + *   all three of CFA, FP and RA (fixed implicit order).
> + *   - 2 bits: information about size of the offsets (S) in bytes.
> + *     Valid values are SFRAME_FRE_OFFSET_1B, SFRAME_FRE_OFFSET_2B,
> + *     SFRAME_FRE_OFFSET_4B
> + *   - 1 bit: Mangled RA state bit (aarch64 only).
> + * ---------------------------------------------------------------
> + * | Mangled-RA (aarch64) |  Size of   |   Number of  | base_reg |
> + * |  Unused (amd64)      |  offsets   |    offsets   |          |
> + * ---------------------------------------------------------------
> + * 8                      7             5             1          0
> + */
> +
> +/* Note: Set mangled_ra_p to zero by default. */
> +#define SFRAME_FRE_INFO(base_reg_id, offset_num, offset_size) \
> +	(((0 & 0x1) << 7) | (((offset_size) & 0x3) << 5) | \
> +	 (((offset_num) & 0xf) << 1) | ((base_reg_id) & 0x1))
> +
> +/* Set the mangled_ra_p bit as indicated. */
> +#define SFRAME_FRE_INFO_UPDATE_MANGLED_RA_P(mangled_ra_p, fre_info) \
> +	((((mangled_ra_p) & 0x1) << 7) | ((fre_info) & 0x7f))
> +
> +#define SFRAME_FRE_CFA_BASE_REG_ID(data)	  ((data) & 0x1)
> +#define SFRAME_FRE_OFFSET_COUNT(data)		  (((data) >> 1) & 0xf)
> +#define SFRAME_FRE_OFFSET_SIZE(data)		  (((data) >> 5) & 0x3)
> +#define SFRAME_FRE_MANGLED_RA_P(data)		  (((data) >> 7) & 0x1)
> +
> +#endif /* _SFRAME_H */
> diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
> index 5d16f9604a61..3a7b14cf522b 100644
> --- a/kernel/unwind/user.c
> +++ b/kernel/unwind/user.c
> @@ -8,6 +8,7 @@
>   #include <linux/sched.h>
>   #include <linux/sched/task_stack.h>
>   #include <linux/user_unwind.h>
> +#include <linux/sframe.h>
>   #include <linux/uaccess.h>
>   #include <asm/user_unwind.h>
>   
> @@ -29,6 +30,11 @@ int user_unwind_next(struct user_unwind_state *state)
>   	case USER_UNWIND_TYPE_FP:
>   		frame = &fp_frame;
>   		break;
> +	case USER_UNWIND_TYPE_SFRAME:
> +		ret = sframe_find(state->ip, frame);
> +		if (ret)
> +			goto the_end;
> +		break;
>   	default:
>   		BUG();
>   	}
> @@ -57,6 +63,7 @@ int user_unwind_start(struct user_unwind_state *state,
>   		      enum user_unwind_type type)
>   {
>   	struct pt_regs *regs = task_pt_regs(current);
> +	bool sframe_possible = current_has_sframe();
>   
>   	memset(state, 0, sizeof(*state));
>   
> @@ -67,6 +74,13 @@ int user_unwind_start(struct user_unwind_state *state,
>   
>   	switch (type) {
>   	case USER_UNWIND_TYPE_AUTO:
> +		state->type = sframe_possible ? USER_UNWIND_TYPE_SFRAME :
> +						USER_UNWIND_TYPE_FP;
> +		break;
> +	case USER_UNWIND_TYPE_SFRAME:
> +		if (!sframe_possible)
> +			return -EINVAL;
> +		break;
>   	case USER_UNWIND_TYPE_FP:
>   		break;
>   	default:
> diff --git a/mm/init-mm.c b/mm/init-mm.c
> index 24c809379274..c4c6af046778 100644
> --- a/mm/init-mm.c
> +++ b/mm/init-mm.c
> @@ -11,6 +11,7 @@
>   #include <linux/atomic.h>
>   #include <linux/user_namespace.h>
>   #include <linux/iommu.h>
> +#include <linux/sframe.h>
>   #include <asm/mmu.h>
>   
>   #ifndef INIT_MM_CONTEXT
> @@ -44,7 +45,8 @@ struct mm_struct init_mm = {
>   #endif
>   	.user_ns	= &init_user_ns,
>   	.cpu_bitmap	= CPU_BITS_NONE,
> -	INIT_MM_CONTEXT(init_mm)
> +	INIT_MM_CONTEXT(init_mm),
> +	INIT_MM_SFRAME,

This does not compile on s390, as INIT_MM_CONTEXT() is defined with a 
trailing comma, leading to two consecutive commas. Already reported by 
the kernel test robot for other architectures.
Same if INIT_MM_SFRAME expands into nothing there would be two 
consecutive commas.

>   };
>   
>   void setup_initial_init_mm(void *start_code, void *end_code,

Regards,
Jens
-- 
Jens Remus
Linux on Z Development (D3303) and z/VSE Support
+49-7031-16-1128 Office
jremus@de.ibm.com

IBM

IBM Deutschland Research & Development GmbH; Vorsitzender des 
Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der 
Gesellschaft: Böblingen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains
  2024-10-23 13:22 ` Jens Remus
@ 2024-10-24  2:22   ` Steven Rostedt
  2024-10-27 17:24   ` Josh Poimboeuf
  1 sibling, 0 replies; 44+ messages in thread
From: Steven Rostedt @ 2024-10-24  2:22 UTC (permalink / raw)
  To: Jens Remus
  Cc: Josh Poimboeuf, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, x86, Heiko Carstens, Ilya Leoshkevich,
	Vasily Gorbik

On Wed, 23 Oct 2024 15:22:35 +0200
Jens Remus <jremus@linux.ibm.com> wrote:

> We are looking forward to implement support for unwinding of user space
> using SFrame in kernel/perf on s390. One major concern is that your x86
> implementation currently relies on a fallback to unwinding using frame
> pointer. On s390 unwinding using frame pointer is unsupported, because
> of lack of proper s390x ABI [1] specification and compiler support. In
> theory there would be a s390-specific alternative of unwinding using
> backchain (compiler option -mbackchain), but this has limitations and
> there is currently no distribution where user space is built with
> backchain.
> 
> How much of an issue would it be if s390 could not provide an unwinding
> fallback implementation? Do you see the possibility to get away without?

Yes. Even with x86, there's no guarantee that the applications will
have frame pointers available. Basically it just returns a stack frame
of one (the IP of where user space entered the kernel).

> 
> For s390 support of unwinding using SFrame we would need to make a few
> changes to your generic unwinding framework in the kernel:
> 
> - Support for architectures that do not define CFA == SP at callsite:
>    On s390 the CFA is defined as SP at callsite +160. The stack pointer
>    (SP) therefore unwinds as SP = CFA - 160. For that we would introduce
>    e.g. a sp_val_off field (SP value offset from CFA) in struct
>    user_unwind_frame that would default to 0 on all architectures except
>    s390.
> 
> - Support for architectures where RA is not necessarily saved on stack:
>    On s390 the return address (RA) is not saved (on stack) at function
>    entry. In leaf functions it is not necessarily saved at all.
> 
> - Support for architectures were RA/FP are saved in registers in leaf
>    functions:
>    On s390 the frame pointer (FP) and return address (RA) registers can
>    be saved in other registers when in leaf functions. In the SFrame
>    format we would encode the DWARF register number as offset using the
>    least-significant bit as indication: offset = (regnum << 1) | 1.
>    Therefore we would need to extend your generic unwinding framework to
>    support FP and RA to be restored from registers.
> 
> [1]: s390x ELF ABI supplement,
>       https://github.com/IBM/s390x-abi/releases

Note that Indu (who's on the Cc and is also the author of sframes) gave
a talk at GNU Cauldron about s390 support. I'm assuming that her new
sframe specification will cover all of this. Then we will have to
implement it.

-- Steve


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains
  2024-10-23 13:22 ` Jens Remus
  2024-10-24  2:22   ` Steven Rostedt
@ 2024-10-27 17:24   ` Josh Poimboeuf
  1 sibling, 0 replies; 44+ messages in thread
From: Josh Poimboeuf @ 2024-10-27 17:24 UTC (permalink / raw)
  To: Jens Remus
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, x86, Heiko Carstens, Ilya Leoshkevich,
	Vasily Gorbik

On Wed, Oct 23, 2024 at 03:22:35PM +0200, Jens Remus wrote:
> We are looking forward to implement support for unwinding of user space
> using SFrame in kernel/perf on s390. One major concern is that your x86
> implementation currently relies on a fallback to unwinding using frame
> pointer. On s390 unwinding using frame pointer is unsupported, because
> of lack of proper s390x ABI [1] specification and compiler support. In
> theory there would be a s390-specific alternative of unwinding using
> backchain (compiler option -mbackchain), but this has limitations and
> there is currently no distribution where user space is built with
> backchain.
> 
> How much of an issue would it be if s390 could not provide an unwinding
> fallback implementation? Do you see the possibility to get away without?

No problem, I'll make the fallback dependent on CONFIG_HAVE_UNWIND_USER_FP.

> For s390 support of unwinding using SFrame we would need to make a few
> changes to your generic unwinding framework in the kernel:

Also no problem, I'm sure there will need to be tweaks going forward.

Thanks for looking at it!  v3 will be posted soon.

-- 
Josh


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 03/11] unwind: Introduce SFrame user space unwinding
  2024-10-23 13:59   ` Jens Remus
@ 2024-10-27 17:49     ` Josh Poimboeuf
  0 siblings, 0 replies; 44+ messages in thread
From: Josh Poimboeuf @ 2024-10-27 17:49 UTC (permalink / raw)
  To: Jens Remus
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, x86, Heiko Carstens, Ilya Leoshkevich,
	Vasily Gorbik

On Wed, Oct 23, 2024 at 03:59:53PM +0200, Jens Remus wrote:
> > +++ b/arch/Kconfig
> > @@ -428,6 +428,9 @@ config HAVE_HARDLOCKUP_DETECTOR_ARCH
> >   config HAVE_USER_UNWIND
> >   	bool
> > +config HAVE_USER_UNWIND_SFRAME
> > +	bool
> 
> I found this unwinder of userspace using SFrame to depend on your generic
> unwinder of userspace framework. Does this warrant to add:

Based on your other suggestion I now have:

config UNWIND_USER
	bool

config HAVE_UNWIND_USER_FP
	bool
	select UNWIND_USER

config HAVE_UNWIND_USER_SFRAME
	bool
	select UNWIND_USER

The arches set HAVE_*, which in turn sets UNWIND_USER.

> > +static int find_fre(struct sframe_section *sec, struct sframe_fde *fde,
> > +		    unsigned long ip, struct user_unwind_frame *frame)
> > +{
> > +	unsigned char fde_type = SFRAME_FUNC_FDE_TYPE(fde->info);
> > +	unsigned char fre_type = SFRAME_FUNC_FRE_TYPE(fde->info);
> > +	s32 fre_ip_off, cfa_off, ra_off, fp_off, ip_off;
> 
> Doesn't fre_ip_off need to be u32 (see also below)? The SFrame format
> specification states the FRE sfre_start_address is either u8, u16, or u32:
> https://sourceware.org/binutils/docs/sframe-spec.html#The-SFrame-FRE-Types

Indeed, I also noticed that.

> > +	SFRAME_GET_USER_UNSIGNED(cfa_off, f, offset_size);
> 
> As far as I know the CFA offset from CFA base register is signed in the
> SFrame file format. See Binutils include/sframe-api.h,
> sframe_fre_get_cfa_offset(). Therefore use SFRAME_GET_USER_SIGNED(). Both
> cfa_off and struct user_unwind_frame cfa_off are already defined as s32.

Good catch.  There's no need to have separate SIGNED/UNSIGNED variants
anyway, I'll simplify that into SFRAME_GET_USER().

> > +++ b/mm/init-mm.c
> > @@ -11,6 +11,7 @@
> >   #include <linux/atomic.h>
> >   #include <linux/user_namespace.h>
> >   #include <linux/iommu.h>
> > +#include <linux/sframe.h>
> >   #include <asm/mmu.h>
> >   #ifndef INIT_MM_CONTEXT
> > @@ -44,7 +45,8 @@ struct mm_struct init_mm = {
> >   #endif
> >   	.user_ns	= &init_user_ns,
> >   	.cpu_bitmap	= CPU_BITS_NONE,
> > -	INIT_MM_CONTEXT(init_mm)
> > +	INIT_MM_CONTEXT(init_mm),
> > +	INIT_MM_SFRAME,
> 
> This does not compile on s390, as INIT_MM_CONTEXT() is defined with a
> trailing comma, leading to two consecutive commas. Already reported by the
> kernel test robot for other architectures.
> Same if INIT_MM_SFRAME expands into nothing there would be two consecutive
> commas.

Yeah, I saw those build errors and I've got that fixed now.

Thanks for the review!

-- 
Josh


^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2024-10-27 17:49 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-13 23:02 [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains Josh Poimboeuf
2024-09-13 23:02 ` [PATCH v2 01/11] unwind: Introduce generic user space unwinding interface Josh Poimboeuf
2024-09-13 23:02 ` [PATCH v2 02/11] unwind/x86: Add HAVE_USER_UNWIND Josh Poimboeuf
2024-09-16 11:46   ` Peter Zijlstra
2024-10-20  8:09     ` Josh Poimboeuf
2024-09-13 23:02 ` [PATCH v2 03/11] unwind: Introduce SFrame user space unwinding Josh Poimboeuf
2024-09-14 11:23   ` Steven Rostedt
2024-10-01 18:20     ` Indu Bhagat
2024-10-01 18:36       ` Steven Rostedt
2024-10-02  8:18         ` Florian Weimer
2024-10-02 14:05           ` Steven Rostedt
2024-10-23 13:59   ` Jens Remus
2024-10-27 17:49     ` Josh Poimboeuf
2024-09-13 23:02 ` [PATCH v2 04/11] unwind/x86/64: Add HAVE_USER_UNWIND_SFRAME Josh Poimboeuf
2024-09-13 23:02 ` [PATCH v2 05/11] perf/x86: Use user_unwind interface Josh Poimboeuf
2024-09-16  6:48   ` kernel test robot
2024-09-17 22:01     ` Namhyung Kim
2024-09-13 23:02 ` [PATCH v2 06/11] perf: Remove get_perf_callchain() 'init_nr' argument Josh Poimboeuf
2024-09-13 23:02 ` [PATCH v2 07/11] perf: Remove get_perf_callchain() 'crosstask' argument Josh Poimboeuf
2024-09-13 23:02 ` [PATCH v2 08/11] perf: Simplify get_perf_callchain() user logic Josh Poimboeuf
2024-09-13 23:02 ` [PATCH v2 09/11] perf: Introduce deferred user callchains Josh Poimboeuf
2024-09-17 22:07   ` Namhyung Kim
2024-09-13 23:02 ` [PATCH v2 10/11] perf/x86: Add HAVE_PERF_CALLCHAIN_DEFERRED Josh Poimboeuf
2024-09-13 23:02 ` [PATCH v2 11/11] perf/x86: Enable SFrame unwinding for deferred user callchains Josh Poimboeuf
2024-09-14 12:12 ` [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains Steven Rostedt
2024-09-15 11:11   ` Josh Poimboeuf
2024-09-15 11:38     ` Steven Rostedt
2024-09-16 14:08     ` Peter Zijlstra
2024-09-16 15:39       ` Josh Poimboeuf
2024-09-16 18:15         ` Peter Zijlstra
2024-09-16  0:15           ` Mathieu Desnoyers
2024-09-16  0:33             ` Mathieu Desnoyers
2024-09-17  0:37               ` Mathieu Desnoyers
2024-09-16 22:46           ` Steven Rostedt
2024-09-17 21:58             ` Namhyung Kim
2024-09-18  5:14               ` Mathieu Desnoyers
2024-10-03  2:31         ` Steven Rostedt
2024-10-03  2:37           ` Josh Poimboeuf
2024-10-03 14:56             ` Steven Rostedt
2024-09-16 16:03       ` Steven Rostedt
2024-09-14 19:37 ` Namhyung Kim
2024-10-23 13:22 ` Jens Remus
2024-10-24  2:22   ` Steven Rostedt
2024-10-27 17:24   ` Josh Poimboeuf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).