[PATCH v4 00/39] unwind, perf: sframe user space unwinding

linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 00/39] unwind, perf: sframe user space unwinding
@ 2025-01-22  2:30 Josh Poimboeuf
  2025-01-22  2:30 ` [PATCH v4 01/39] task_work: Fix TWA_NMI_CURRENT error handling Josh Poimboeuf
                   ` (40 more replies)
  0 siblings, 41 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22  2:30 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

This took a bit longer than expected.  I fell into some rabbit holes
chasing a number of subtle bugs.  I ended up rewriting the deferral code
several times.  But I think the end result is much better.

The deferral request has a new interface, which helps make the
implementation MUCH simpler and less fragile.  As a bonus it's now
possible for the request implementation to be NMI-safe.

The interface is similar to {task,irq}_work.  The caller owns an
unwind_work struct:

  struct unwind_work {
	struct callback_head		work;
	unwind_callback_t		func;
	int				pending;
  };

For perf, struct unwind_work is embedded in struct perf_event.  For
ftrace maybe it would live in task_struct?

The unwind_work can be passed to the following functions:

  void unwind_deferred_init(struct unwind_work *work, unwind_callback_t func);
  int unwind_deferred_request(struct unwind_work *work, u64 *cookie);
  bool unwind_deferred_cancel(struct task_struct *task, struct unwind_work *work);

If unwind_deferred_request() returns success, the callback is
guaranteed.  If the callback is already pending, it returns an error,
but the returned *cookie is still valid if it's nonzero.

Questions:

  - Peter, I'm not sure how well this works with Intel PEBS?  This just
    uses the original task regs, is that a problem?

  - Namhyung, I rebased your perf tool patches on the new missing
    feature validation code, do the patches still look sane?

For testing with user space, here are the latest binutils fixes:

  1785837a2570 ("ld: fix PR/32297")
  938fb512184d ("ld: fix wrong SFrame info for lazy IBT PLT")
  47c88752f9ad ("ld: generate SFrame stack trace info for .plt.got")

An out-of-tree glibc patch is also needed -- will attach in a reply.

Code also available at 

  git://git.kernel.org/pub/scm/linux/kernel/git/jpoimboe/linux.git sframe-v4


v4:
- split up patches better [Andrii]
- add callback guarantee [Andrii]
- support multiple non-contiguous elf text segments [Andrii]
- sframe section validation [Andrii]
- x86 compat mode support [Peter]
- implement guard(mmap_read_lock) [Peter]
- synchronize callback with perf event lifetime [Peter]
- detect toolchain sframe support with CONFIG_SFRAME_AS [Jens]
- get vdso working (with updated glibc patches) [Jens]
- rebase perf tool on new missing feature validation code
- brand new deferred interface and implementation
- make unwind_deferred_request() NMI-safe
- sframe debugging infrastructure
- fix some task_work bugs
- enclose multiple user copies in single STAC/CLAC pair for performance
- much banging head on wall, refactoring, simplification
- fix a lot of bugs


Previous revisions
------------------

v3:
https://lore.kernel.org/cover.1730150953.git.jpoimboe@kernel.org
- move the "deferred" logic out of perf and into unwind_user with new
  unwind_user_deferred() interface [Steven, Mathieu]
- add more sframe sanity checks [Steven]
- make frame pointers optional depending on arch [Jens]
- fix perf event output [Namhyung]
- include Namhyung's perf tool patches
- enable sframe generation in VDSO
- fix build errors [robot]

v2:
https://lore.kernel.org/cover.1726268190.git.jpoimboe@kernel.org
- rebase on v6.11-rc7
- reorganize the patches to add sframe first
- change to sframe v2
- add new perf event type: PERF_RECORD_CALLCHAIN_DEFERRED
- add new perf attribute: defer_callchain

v1:
https://lore.kernel.org/cover.1699487758.git.jpoimboe@kernel.org


Original description
--------------------

Some distros have started compiling frame pointers into all their
packages to enable the kernel to do system-wide profiling of user space.
Unfortunately that creates a runtime performance penalty across the
entire system.  Using DWARF (or .eh_frame) instead isn't feasible
because of complexity and slowness.

For in-kernel unwinding we solved this problem with the creation of the
ORC unwinder for x86_64.  Similarly, for user space the GNU assembler
has created the SFrame ("Simple Frame") v2 format starting with binutils
2.41.

These patches add support for unwinding user space from the kernel using
SFrame with perf.  It should be easy to add user unwinding support for
other components like ftrace.

There were two main challenges:

1) Finding .sframe sections in shared/dlopened libraries

   The kernel has no visibility to the contents of shared libraries.
   This was solved by adding a PR_ADD_SFRAME option to prctl() which
   allows the runtime linker to manually provide the in-memory address
   of an .sframe section to the kernel.

2) Dealing with page faults

   Keeping all binaries' sframe data pinned would likely waste a lot of
   memory.  Instead, read it from user space on demand.  That can't be
   done from perf NMI context due to page faults, so defer the unwind to
   the next user exit.  Since the NMI handler doesn't do exit work,
   self-IPI and then schedule task work to be run on exit from the IPI.

Special thanks to Indu for the original concept, and to Steven and Peter
for helping a lot with the design.  And to Steven for letting me do it ;-)

Josh Poimboeuf (35):
  task_work: Fix TWA_NMI_CURRENT error handling
  task_work: Fix TWA_NMI_CURRENT race with __schedule()
  mm: Add guard for mmap_read_lock
  x86/vdso: Fix DWARF generation for getrandom()
  x86/asm: Avoid emitting DWARF CFI for non-VDSO
  x86/asm: Fix VDSO DWARF generation with kernel IBT enabled
  x86/vdso: Use SYM_FUNC_{START,END} in __kernel_vsyscall()
  x86/vdso: Use CFI macros in __vdso_sgx_enter_enclave()
  x86/vdso: Enable sframe generation in VDSO
  x86/uaccess: Add unsafe_copy_from_user() implementation
  unwind_user: Add user space unwinding API
  unwind_user: Add frame pointer support
  unwind_user/x86: Enable frame pointer unwinding on x86
  perf/x86: Rename get_segment_base() and make it global
  unwind_user: Add compat mode frame pointer support
  unwind_user/x86: Enable compat mode frame pointer unwinding on x86
  unwind_user/sframe: Add support for reading .sframe headers
  unwind_user/sframe: Store sframe section data in per-mm maple tree
  unwind_user/sframe: Add support for reading .sframe contents
  unwind_user/sframe: Detect .sframe sections in executables
  unwind_user/sframe: Add prctl() interface for registering .sframe
    sections
  unwind_user/sframe: Wire up unwind_user to sframe
  unwind_user/sframe/x86: Enable sframe unwinding on x86
  unwind_user/sframe: Remove .sframe section on detected corruption
  unwind_user/sframe: Show file name in debug output
  unwind_user/sframe: Enable debugging in uaccess regions
  unwind_user/sframe: Add .sframe validation option
  unwind_user/deferred: Add deferred unwinding interface
  unwind_user/deferred: Add unwind cache
  unwind_user/deferred: Make unwind deferral requests NMI-safe
  perf: Remove get_perf_callchain() 'init_nr' argument
  perf: Remove get_perf_callchain() 'crosstask' argument
  perf: Simplify get_perf_callchain() user logic
  perf: Skip user unwind if !current->mm
  perf: Support deferred user callchains

Namhyung Kim (4):
  perf tools: Minimal CALLCHAIN_DEFERRED support
  perf record: Enable defer_callchain for user callchains
  perf script: Display PERF_RECORD_CALLCHAIN_DEFERRED
  perf tools: Merge deferred user callchains

 arch/Kconfig                              |  40 ++
 arch/x86/Kconfig                          |   3 +
 arch/x86/entry/vdso/Makefile              |  10 +-
 arch/x86/entry/vdso/vdso-layout.lds.S     |   5 +-
 arch/x86/entry/vdso/vdso32/system_call.S  |  10 +-
 arch/x86/entry/vdso/vgetrandom-chacha.S   |   3 +-
 arch/x86/entry/vdso/vsgx.S                |  19 +-
 arch/x86/events/core.c                    |  10 +-
 arch/x86/include/asm/dwarf2.h             |  54 +-
 arch/x86/include/asm/linkage.h            |  29 +-
 arch/x86/include/asm/mmu.h                |   2 +-
 arch/x86/include/asm/perf_event.h         |   2 +
 arch/x86/include/asm/uaccess.h            |  39 +-
 arch/x86/include/asm/unwind_user.h        |  61 +++
 arch/x86/include/asm/unwind_user_types.h  |  17 +
 arch/x86/include/asm/vdso.h               |   1 -
 fs/binfmt_elf.c                           |  49 +-
 include/asm-generic/Kbuild                |   2 +
 include/asm-generic/unwind_user.h         |  24 +
 include/asm-generic/unwind_user_types.h   |   9 +
 include/linux/entry-common.h              |   3 +
 include/linux/mm_types.h                  |   3 +
 include/linux/mmap_lock.h                 |   2 +
 include/linux/perf_event.h                |  15 +-
 include/linux/sched.h                     |   5 +
 include/linux/sframe.h                    |  56 ++
 include/linux/unwind_deferred.h           |  52 ++
 include/linux/unwind_deferred_types.h     |  17 +
 include/linux/unwind_user.h               |  15 +
 include/linux/unwind_user_types.h         |  36 ++
 include/uapi/linux/elf.h                  |   1 +
 include/uapi/linux/perf_event.h           |  19 +-
 include/uapi/linux/prctl.h                |   5 +-
 kernel/Makefile                           |   1 +
 kernel/bpf/stackmap.c                     |  14 +-
 kernel/events/callchain.c                 |  47 +-
 kernel/events/core.c                      | 112 +++-
 kernel/fork.c                             |  14 +
 kernel/sys.c                              |   9 +
 kernel/task_work.c                        |  67 ++-
 kernel/unwind/Makefile                    |   2 +
 kernel/unwind/deferred.c                  | 266 ++++++++++
 kernel/unwind/sframe.c                    | 595 ++++++++++++++++++++++
 kernel/unwind/sframe.h                    |  71 +++
 kernel/unwind/sframe_debug.h              |  95 ++++
 kernel/unwind/user.c                      | 146 ++++++
 mm/init-mm.c                              |   2 +
 tools/include/uapi/linux/perf_event.h     |  19 +-
 tools/lib/perf/include/perf/event.h       |   7 +
 tools/perf/Documentation/perf-script.txt  |   5 +
 tools/perf/builtin-script.c               |  92 ++++
 tools/perf/util/callchain.c               |  24 +
 tools/perf/util/callchain.h               |   3 +
 tools/perf/util/event.c                   |   1 +
 tools/perf/util/evlist.c                  |   1 +
 tools/perf/util/evlist.h                  |   1 +
 tools/perf/util/evsel.c                   |  39 ++
 tools/perf/util/evsel.h                   |   1 +
 tools/perf/util/machine.c                 |   1 +
 tools/perf/util/perf_event_attr_fprintf.c |   1 +
 tools/perf/util/sample.h                  |   3 +-
 tools/perf/util/session.c                 |  78 +++
 tools/perf/util/tool.c                    |   2 +
 tools/perf/util/tool.h                    |   4 +-
 64 files changed, 2208 insertions(+), 133 deletions(-)
 create mode 100644 arch/x86/include/asm/unwind_user.h
 create mode 100644 arch/x86/include/asm/unwind_user_types.h
 create mode 100644 include/asm-generic/unwind_user.h
 create mode 100644 include/asm-generic/unwind_user_types.h
 create mode 100644 include/linux/sframe.h
 create mode 100644 include/linux/unwind_deferred.h
 create mode 100644 include/linux/unwind_deferred_types.h
 create mode 100644 include/linux/unwind_user.h
 create mode 100644 include/linux/unwind_user_types.h
 create mode 100644 kernel/unwind/Makefile
 create mode 100644 kernel/unwind/deferred.c
 create mode 100644 kernel/unwind/sframe.c
 create mode 100644 kernel/unwind/sframe.h
 create mode 100644 kernel/unwind/sframe_debug.h
 create mode 100644 kernel/unwind/user.c

-- 
2.48.1


^ permalink raw reply	[flat|nested] 161+ messages in thread

* [PATCH v4 01/39] task_work: Fix TWA_NMI_CURRENT error handling
  2025-01-22  2:30 [PATCH v4 00/39] unwind, perf: sframe user space unwinding Josh Poimboeuf
@ 2025-01-22  2:30 ` Josh Poimboeuf
  2025-01-22 12:28   ` Peter Zijlstra
  2025-01-22  2:30 ` [PATCH v4 02/39] task_work: Fix TWA_NMI_CURRENT race with __schedule() Josh Poimboeuf
                   ` (39 subsequent siblings)
  40 siblings, 1 reply; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22  2:30 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

It's possible for irq_work_queue() to fail if the work has already been
claimed.  That can happen if a TWA_NMI_CURRENT task work is requested
before a previous TWA_NMI_CURRENT IRQ work on the same CPU has gotten a
chance to run.

The error has to be checked before the write to task->task_works.  Also
the try_cmpxchg() loop isn't needed in NMI context.  The TWA_NMI_CURRENT
case really is special, keep things simple by keeping its code all
together in one place.

Fixes: 466e4d801cd4 ("task_work: Add TWA_NMI_CURRENT as an additional notify mode.")
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 kernel/task_work.c | 39 ++++++++++++++++++++++++++-------------
 1 file changed, 26 insertions(+), 13 deletions(-)

diff --git a/kernel/task_work.c b/kernel/task_work.c
index c969f1f26be5..92024a8bfe12 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -58,25 +58,38 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
 	int flags = notify & TWA_FLAGS;
 
 	notify &= ~TWA_FLAGS;
+
 	if (notify == TWA_NMI_CURRENT) {
-		if (WARN_ON_ONCE(task != current))
+		if (WARN_ON_ONCE(!in_nmi() || task != current))
 			return -EINVAL;
 		if (!IS_ENABLED(CONFIG_IRQ_WORK))
 			return -EINVAL;
-	} else {
-		/*
-		 * Record the work call stack in order to print it in KASAN
-		 * reports.
-		 *
-		 * Note that stack allocation can fail if TWAF_NO_ALLOC flag
-		 * is set and new page is needed to expand the stack buffer.
-		 */
-		if (flags & TWAF_NO_ALLOC)
-			kasan_record_aux_stack_noalloc(work);
-		else
-			kasan_record_aux_stack(work);
+#ifdef CONFIG_IRQ_WORK
+		head = task->task_works;
+		if (unlikely(head == &work_exited))
+			return -ESRCH;
+
+		if (!irq_work_queue(this_cpu_ptr(&irq_work_NMI_resume)))
+			return -EBUSY;
+
+		work->next = head;
+		task->task_works = work;
+#endif
+		return 0;
 	}
 
+	/*
+	 * Record the work call stack in order to print it in KASAN
+	 * reports.
+	 *
+	 * Note that stack allocation can fail if TWAF_NO_ALLOC flag
+	 * is set and new page is needed to expand the stack buffer.
+	 */
+	if (flags & TWAF_NO_ALLOC)
+		kasan_record_aux_stack_noalloc(work);
+	else
+		kasan_record_aux_stack(work);
+
 	head = READ_ONCE(task->task_works);
 	do {
 		if (unlikely(head == &work_exited))
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v4 02/39] task_work: Fix TWA_NMI_CURRENT race with __schedule()
  2025-01-22  2:30 [PATCH v4 00/39] unwind, perf: sframe user space unwinding Josh Poimboeuf
  2025-01-22  2:30 ` [PATCH v4 01/39] task_work: Fix TWA_NMI_CURRENT error handling Josh Poimboeuf
@ 2025-01-22  2:30 ` Josh Poimboeuf
  2025-01-22 12:23   ` Peter Zijlstra
  2025-01-22 12:42   ` Peter Zijlstra
  2025-01-22  2:30 ` [PATCH v4 03/39] mm: Add guard for mmap_read_lock Josh Poimboeuf
                   ` (38 subsequent siblings)
  40 siblings, 2 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22  2:30 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

If TWA_NMI_CURRENT task work is queued from an NMI triggered while
running in __schedule() with IRQs disabled, task_work_set_notify_irq()
ends up inadvertently running on the next scheduled task.  So the
original task doesn't get its TIF_NOTIFY_RESUME flag set and the task
work may get delayed indefinitely, or may not get to run at all.

    __schedule()
	// disable irqs
	    <NMI>
		task_work_add(current, work, TWA_NMI_CURRENT);
	    </NMI>
	// current = next;
	// enable irqs
	    <IRQ>
		task_work_set_notify_irq()
		test_and_set_tsk_thread_flag(current, TIF_NOTIFY_RESUME); // wrong task!
	    </IRQ>
	// original task skips task work on its next return to user (or exit!)

Fix it by storing the task pointer along with the irq_work struct and
passing that task to set_notify_resume().

Fixes: 466e4d801cd4 ("task_work: Add TWA_NMI_CURRENT as an additional notify mode.")
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 kernel/task_work.c | 30 +++++++++++++++++++++---------
 1 file changed, 21 insertions(+), 9 deletions(-)

diff --git a/kernel/task_work.c b/kernel/task_work.c
index 92024a8bfe12..f17447f69843 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -7,12 +7,23 @@
 static struct callback_head work_exited; /* all we need is ->next == NULL */
 
 #ifdef CONFIG_IRQ_WORK
+
+struct nmi_irq_work {
+	struct irq_work work;
+	struct task_struct *task;
+};
+
 static void task_work_set_notify_irq(struct irq_work *entry)
 {
-	test_and_set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
+	struct nmi_irq_work *work = container_of(entry, struct nmi_irq_work, work);
+
+	set_notify_resume(work->task);
 }
-static DEFINE_PER_CPU(struct irq_work, irq_work_NMI_resume) =
-	IRQ_WORK_INIT_HARD(task_work_set_notify_irq);
+
+static DEFINE_PER_CPU(struct nmi_irq_work, nmi_irq_work) = {
+	.work = IRQ_WORK_INIT_HARD(task_work_set_notify_irq),
+};
+
 #endif
 
 /**
@@ -65,15 +76,21 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
 		if (!IS_ENABLED(CONFIG_IRQ_WORK))
 			return -EINVAL;
 #ifdef CONFIG_IRQ_WORK
+{
+		struct nmi_irq_work *irq_work = this_cpu_ptr(&nmi_irq_work);
+
 		head = task->task_works;
 		if (unlikely(head == &work_exited))
 			return -ESRCH;
 
-		if (!irq_work_queue(this_cpu_ptr(&irq_work_NMI_resume)))
+		if (!irq_work_queue(&irq_work->work))
 			return -EBUSY;
 
+		irq_work->task = current;
+
 		work->next = head;
 		task->task_works = work;
+}
 #endif
 		return 0;
 	}
@@ -109,11 +126,6 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
 	case TWA_SIGNAL_NO_IPI:
 		__set_notify_signal(task);
 		break;
-#ifdef CONFIG_IRQ_WORK
-	case TWA_NMI_CURRENT:
-		irq_work_queue(this_cpu_ptr(&irq_work_NMI_resume));
-		break;
-#endif
 	default:
 		WARN_ON_ONCE(1);
 		break;
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v4 03/39] mm: Add guard for mmap_read_lock
  2025-01-22  2:30 [PATCH v4 00/39] unwind, perf: sframe user space unwinding Josh Poimboeuf
  2025-01-22  2:30 ` [PATCH v4 01/39] task_work: Fix TWA_NMI_CURRENT error handling Josh Poimboeuf
  2025-01-22  2:30 ` [PATCH v4 02/39] task_work: Fix TWA_NMI_CURRENT race with __schedule() Josh Poimboeuf
@ 2025-01-22  2:30 ` Josh Poimboeuf
  2025-01-22  2:30 ` [PATCH v4 04/39] x86/vdso: Fix DWARF generation for getrandom() Josh Poimboeuf
                   ` (37 subsequent siblings)
  40 siblings, 0 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22  2:30 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

This is the new way of doing things.  Converting all existing
mmap_read_lock users is an exercise left for the reader ;-)

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 include/linux/mmap_lock.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index de9dc20b01ba..c971c4617060 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -182,4 +182,6 @@ static inline int mmap_lock_is_contended(struct mm_struct *mm)
 	return rwsem_is_contended(&mm->mmap_lock);
 }
 
+DEFINE_GUARD(mmap_read_lock, struct mm_struct *, mmap_read_lock(_T), mmap_read_unlock(_T))
+
 #endif /* _LINUX_MMAP_LOCK_H */
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v4 04/39] x86/vdso: Fix DWARF generation for getrandom()
  2025-01-22  2:30 [PATCH v4 00/39] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (2 preceding siblings ...)
  2025-01-22  2:30 ` [PATCH v4 03/39] mm: Add guard for mmap_read_lock Josh Poimboeuf
@ 2025-01-22  2:30 ` Josh Poimboeuf
  2025-01-22  2:30 ` [PATCH v4 05/39] x86/asm: Avoid emitting DWARF CFI for non-VDSO Josh Poimboeuf
                   ` (36 subsequent siblings)
  40 siblings, 0 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22  2:30 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

Add CFI annotations to the VDSO implementation of getrandom() so it will
have valid DWARF unwinding metadata.

Fixes: 33385150ac45 ("x86: vdso: Wire up getrandom() vDSO implementation")
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/x86/entry/vdso/vgetrandom-chacha.S | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/vdso/vgetrandom-chacha.S b/arch/x86/entry/vdso/vgetrandom-chacha.S
index bcba5639b8ee..cc82da9216fb 100644
--- a/arch/x86/entry/vdso/vgetrandom-chacha.S
+++ b/arch/x86/entry/vdso/vgetrandom-chacha.S
@@ -4,7 +4,7 @@
  */
 
 #include <linux/linkage.h>
-#include <asm/frame.h>
+#include <asm/dwarf2.h>
 
 .section	.rodata, "a"
 .align 16
@@ -22,7 +22,7 @@ CONSTANTS:	.octa 0x6b20657479622d323320646e61707865
  *	rcx: number of 64-byte blocks to write to output
  */
 SYM_FUNC_START(__arch_chacha20_blocks_nostack)
-
+	CFI_STARTPROC
 .set	output,		%rdi
 .set	key,		%rsi
 .set	counter,	%rdx
@@ -175,4 +175,5 @@ SYM_FUNC_START(__arch_chacha20_blocks_nostack)
 	pxor		temp,temp
 
 	ret
+	CFI_ENDPROC
 SYM_FUNC_END(__arch_chacha20_blocks_nostack)
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v4 05/39] x86/asm: Avoid emitting DWARF CFI for non-VDSO
  2025-01-22  2:30 [PATCH v4 00/39] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (3 preceding siblings ...)
  2025-01-22  2:30 ` [PATCH v4 04/39] x86/vdso: Fix DWARF generation for getrandom() Josh Poimboeuf
@ 2025-01-22  2:30 ` Josh Poimboeuf
  2025-01-24 16:08   ` Jens Remus
  2025-01-22  2:30 ` [PATCH v4 06/39] x86/asm: Fix VDSO DWARF generation with kernel IBT enabled Josh Poimboeuf
                   ` (35 subsequent siblings)
  40 siblings, 1 reply; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22  2:30 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

It was decided years ago that .cfi_* annotations aren't maintainable in
the kernel.  They were replaced by objtool unwind hints.  For the kernel
proper, ensure the CFI_* macros don't do anything.

On the other hand the VDSO library *does* use them, so user space can
unwind through it.

Make sure these macros only work for VDSO.  They aren't actually being
used outside of VDSO anyway, so there's no functional change.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/x86/include/asm/dwarf2.h | 51 ++++++++++++++++++++++++-----------
 1 file changed, 35 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/dwarf2.h b/arch/x86/include/asm/dwarf2.h
index 430fca13bb56..b195b3c8677e 100644
--- a/arch/x86/include/asm/dwarf2.h
+++ b/arch/x86/include/asm/dwarf2.h
@@ -6,6 +6,15 @@
 #warning "asm/dwarf2.h should be only included in pure assembly files"
 #endif
 
+#ifdef BUILD_VDSO
+
+	/*
+	 * For the vDSO, emit both runtime unwind information and debug
+	 * symbols for the .dbg file.
+	 */
+
+	.cfi_sections .eh_frame, .debug_frame
+
 #define CFI_STARTPROC		.cfi_startproc
 #define CFI_ENDPROC		.cfi_endproc
 #define CFI_DEF_CFA		.cfi_def_cfa
@@ -21,21 +30,31 @@
 #define CFI_UNDEFINED		.cfi_undefined
 #define CFI_ESCAPE		.cfi_escape
 
-#ifndef BUILD_VDSO
-	/*
-	 * Emit CFI data in .debug_frame sections, not .eh_frame sections.
-	 * The latter we currently just discard since we don't do DWARF
-	 * unwinding at runtime.  So only the offline DWARF information is
-	 * useful to anyone.  Note we should not use this directive if we
-	 * ever decide to enable DWARF unwinding at runtime.
-	 */
-	.cfi_sections .debug_frame
-#else
-	 /*
-	  * For the vDSO, emit both runtime unwind information and debug
-	  * symbols for the .dbg file.
-	  */
-	.cfi_sections .eh_frame, .debug_frame
-#endif
+#else /* !BUILD_VDSO */
+
+/*
+ * On x86, these macros aren't used outside VDSO.  As well they shouldn't be:
+ * they're fragile and very difficult to maintain.
+ */
+
+.macro nocfi args:vararg
+.endm
+
+#define CFI_STARTPROC		nocfi
+#define CFI_ENDPROC		nocfi
+#define CFI_DEF_CFA		nocfi
+#define CFI_DEF_CFA_REGISTER	nocfi
+#define CFI_DEF_CFA_OFFSET	nocfi
+#define CFI_ADJUST_CFA_OFFSET	nocfi
+#define CFI_OFFSET		nocfi
+#define CFI_REL_OFFSET		nocfi
+#define CFI_REGISTER		nocfi
+#define CFI_RESTORE		nocfi
+#define CFI_REMEMBER_STATE	nocfi
+#define CFI_RESTORE_STATE	nocfi
+#define CFI_UNDEFINED		nocfi
+#define CFI_ESCAPE		nocfi
+
+#endif /* !BUILD_VDSO */
 
 #endif /* _ASM_X86_DWARF2_H */
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v4 06/39] x86/asm: Fix VDSO DWARF generation with kernel IBT enabled
  2025-01-22  2:30 [PATCH v4 00/39] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (4 preceding siblings ...)
  2025-01-22  2:30 ` [PATCH v4 05/39] x86/asm: Avoid emitting DWARF CFI for non-VDSO Josh Poimboeuf
@ 2025-01-22  2:30 ` Josh Poimboeuf
  2025-01-22  2:30 ` [PATCH v4 07/39] x86/vdso: Use SYM_FUNC_{START,END} in __kernel_vsyscall() Josh Poimboeuf
                   ` (34 subsequent siblings)
  40 siblings, 0 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22  2:30 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

The DWARF .cfi_startproc annotation needs to be at the very beginning of
a function.  But with kernel IBT that doesn't happen as ENDBR is
sneakily embedded in SYM_FUNC_START.  As a result the DWARF unwinding
info is wrong at the beginning of all the VDSO functions.

Fix it by adding CFI_STARTPROC and CFI_ENDPROC to SYM_FUNC_START_* and
SYM_FUNC_END respectively.  Note this only affects VDSO, as the CFI_*
macros are empty for the kernel proper.

Fixes: c4691712b546 ("x86/linkage: Add ENDBR to SYM_FUNC_START*()")
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/x86/entry/vdso/vdso-layout.lds.S   |  2 +-
 arch/x86/entry/vdso/vgetrandom-chacha.S |  2 --
 arch/x86/entry/vdso/vsgx.S              |  4 ----
 arch/x86/include/asm/linkage.h          | 29 +++++++++++++++++++------
 arch/x86/include/asm/vdso.h             |  1 -
 5 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/arch/x86/entry/vdso/vdso-layout.lds.S b/arch/x86/entry/vdso/vdso-layout.lds.S
index 872947c1004c..506c9800a5aa 100644
--- a/arch/x86/entry/vdso/vdso-layout.lds.S
+++ b/arch/x86/entry/vdso/vdso-layout.lds.S
@@ -1,5 +1,5 @@
 /* SPDX-License-Identifier: GPL-2.0 */
-#include <asm/vdso.h>
+#include <asm/page_types.h>
 #include <asm/vdso/vsyscall.h>
 
 /*
diff --git a/arch/x86/entry/vdso/vgetrandom-chacha.S b/arch/x86/entry/vdso/vgetrandom-chacha.S
index cc82da9216fb..a33212594731 100644
--- a/arch/x86/entry/vdso/vgetrandom-chacha.S
+++ b/arch/x86/entry/vdso/vgetrandom-chacha.S
@@ -22,7 +22,6 @@ CONSTANTS:	.octa 0x6b20657479622d323320646e61707865
  *	rcx: number of 64-byte blocks to write to output
  */
 SYM_FUNC_START(__arch_chacha20_blocks_nostack)
-	CFI_STARTPROC
 .set	output,		%rdi
 .set	key,		%rsi
 .set	counter,	%rdx
@@ -175,5 +174,4 @@ SYM_FUNC_START(__arch_chacha20_blocks_nostack)
 	pxor		temp,temp
 
 	ret
-	CFI_ENDPROC
 SYM_FUNC_END(__arch_chacha20_blocks_nostack)
diff --git a/arch/x86/entry/vdso/vsgx.S b/arch/x86/entry/vdso/vsgx.S
index 37a3d4c02366..c0342238c976 100644
--- a/arch/x86/entry/vdso/vsgx.S
+++ b/arch/x86/entry/vdso/vsgx.S
@@ -24,8 +24,6 @@
 .section .text, "ax"
 
 SYM_FUNC_START(__vdso_sgx_enter_enclave)
-	/* Prolog */
-	.cfi_startproc
 	push	%rbp
 	.cfi_adjust_cfa_offset	8
 	.cfi_rel_offset		%rbp, 0
@@ -143,8 +141,6 @@ SYM_FUNC_START(__vdso_sgx_enter_enclave)
 	jle	.Lout
 	jmp	.Lenter_enclave
 
-	.cfi_endproc
-
 _ASM_VDSO_EXTABLE_HANDLE(.Lenclu_eenter_eresume, .Lhandle_exception)
 
 SYM_FUNC_END(__vdso_sgx_enter_enclave)
diff --git a/arch/x86/include/asm/linkage.h b/arch/x86/include/asm/linkage.h
index dc31b13b87a0..2866d57ef907 100644
--- a/arch/x86/include/asm/linkage.h
+++ b/arch/x86/include/asm/linkage.h
@@ -40,6 +40,10 @@
 
 #ifdef __ASSEMBLY__
 
+#ifndef LINKER_SCRIPT
+#include <asm/dwarf2.h>
+#endif
+
 #if defined(CONFIG_MITIGATION_RETHUNK) && !defined(__DISABLE_EXPORTS) && !defined(BUILD_VDSO)
 #define RET	jmp __x86_return_thunk
 #else /* CONFIG_MITIGATION_RETPOLINE */
@@ -112,40 +116,51 @@
 # define SYM_FUNC_ALIAS_MEMFUNC	SYM_FUNC_ALIAS
 #endif
 
+#define __SYM_FUNC_START				\
+	CFI_STARTPROC ASM_NL				\
+	ENDBR
+
+#define __SYM_FUNC_END					\
+	CFI_ENDPROC ASM_NL
+
 /* SYM_TYPED_FUNC_START -- use for indirectly called globals, w/ CFI type */
 #define SYM_TYPED_FUNC_START(name)				\
 	SYM_TYPED_START(name, SYM_L_GLOBAL, SYM_F_ALIGN)	\
-	ENDBR
+	__SYM_FUNC_START
 
 /* SYM_FUNC_START -- use for global functions */
 #define SYM_FUNC_START(name)				\
 	SYM_START(name, SYM_L_GLOBAL, SYM_F_ALIGN)	\
-	ENDBR
+	__SYM_FUNC_START
 
 /* SYM_FUNC_START_NOALIGN -- use for global functions, w/o alignment */
 #define SYM_FUNC_START_NOALIGN(name)			\
 	SYM_START(name, SYM_L_GLOBAL, SYM_A_NONE)	\
-	ENDBR
+	__SYM_FUNC_START
 
 /* SYM_FUNC_START_LOCAL -- use for local functions */
 #define SYM_FUNC_START_LOCAL(name)			\
 	SYM_START(name, SYM_L_LOCAL, SYM_F_ALIGN)	\
-	ENDBR
+	__SYM_FUNC_START
 
 /* SYM_FUNC_START_LOCAL_NOALIGN -- use for local functions, w/o alignment */
 #define SYM_FUNC_START_LOCAL_NOALIGN(name)		\
 	SYM_START(name, SYM_L_LOCAL, SYM_A_NONE)	\
-	ENDBR
+	__SYM_FUNC_START
 
 /* SYM_FUNC_START_WEAK -- use for weak functions */
 #define SYM_FUNC_START_WEAK(name)			\
 	SYM_START(name, SYM_L_WEAK, SYM_F_ALIGN)	\
-	ENDBR
+	__SYM_FUNC_START
 
 /* SYM_FUNC_START_WEAK_NOALIGN -- use for weak functions, w/o alignment */
 #define SYM_FUNC_START_WEAK_NOALIGN(name)		\
 	SYM_START(name, SYM_L_WEAK, SYM_A_NONE)		\
-	ENDBR
+	__SYM_FUNC_START
+
+#define SYM_FUNC_END(name)				\
+	__SYM_FUNC_END					\
+	SYM_END(name, SYM_T_FUNC)
 
 #endif /* _ASM_X86_LINKAGE_H */
 
diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
index d7f6592b74a9..0111c349bbc5 100644
--- a/arch/x86/include/asm/vdso.h
+++ b/arch/x86/include/asm/vdso.h
@@ -2,7 +2,6 @@
 #ifndef _ASM_X86_VDSO_H
 #define _ASM_X86_VDSO_H
 
-#include <asm/page_types.h>
 #include <linux/linkage.h>
 #include <linux/init.h>
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v4 07/39] x86/vdso: Use SYM_FUNC_{START,END} in __kernel_vsyscall()
  2025-01-22  2:30 [PATCH v4 00/39] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (5 preceding siblings ...)
  2025-01-22  2:30 ` [PATCH v4 06/39] x86/asm: Fix VDSO DWARF generation with kernel IBT enabled Josh Poimboeuf
@ 2025-01-22  2:30 ` Josh Poimboeuf
  2025-01-22  2:31 ` [PATCH v4 08/39] x86/vdso: Use CFI macros in __vdso_sgx_enter_enclave() Josh Poimboeuf
                   ` (33 subsequent siblings)
  40 siblings, 0 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22  2:30 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

Use SYM_FUNC_{START,END} instead of all the boilerplate.  No functional
change.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/x86/entry/vdso/vdso32/system_call.S | 10 ++--------
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/arch/x86/entry/vdso/vdso32/system_call.S b/arch/x86/entry/vdso/vdso32/system_call.S
index d33c6513fd2c..bdc576548240 100644
--- a/arch/x86/entry/vdso/vdso32/system_call.S
+++ b/arch/x86/entry/vdso/vdso32/system_call.S
@@ -9,11 +9,7 @@
 #include <asm/alternative.h>
 
 	.text
-	.globl __kernel_vsyscall
-	.type __kernel_vsyscall,@function
-	ALIGN
-__kernel_vsyscall:
-	CFI_STARTPROC
+SYM_FUNC_START(__kernel_vsyscall)
 	/*
 	 * Reshuffle regs so that all of any of the entry instructions
 	 * will preserve enough state.
@@ -79,7 +75,5 @@ SYM_INNER_LABEL(int80_landing_pad, SYM_L_GLOBAL)
 	CFI_RESTORE		ecx
 	CFI_ADJUST_CFA_OFFSET	-4
 	RET
-	CFI_ENDPROC
-
-	.size __kernel_vsyscall,.-__kernel_vsyscall
+SYM_FUNC_END(__kernel_vsyscall)
 	.previous
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v4 08/39] x86/vdso: Use CFI macros in __vdso_sgx_enter_enclave()
  2025-01-22  2:30 [PATCH v4 00/39] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (6 preceding siblings ...)
  2025-01-22  2:30 ` [PATCH v4 07/39] x86/vdso: Use SYM_FUNC_{START,END} in __kernel_vsyscall() Josh Poimboeuf
@ 2025-01-22  2:31 ` Josh Poimboeuf
  2025-01-22  2:31 ` [PATCH v4 09/39] x86/vdso: Enable sframe generation in VDSO Josh Poimboeuf
                   ` (32 subsequent siblings)
  40 siblings, 0 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22  2:31 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

Use the CFI macros instead of the raw .cfi_* directives to be consistent
with the rest of the VDSO asm.  It's also easier on the eyes.

No functional changes.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/x86/entry/vdso/vsgx.S | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/arch/x86/entry/vdso/vsgx.S b/arch/x86/entry/vdso/vsgx.S
index c0342238c976..8d7b8eb45c50 100644
--- a/arch/x86/entry/vdso/vsgx.S
+++ b/arch/x86/entry/vdso/vsgx.S
@@ -24,13 +24,14 @@
 .section .text, "ax"
 
 SYM_FUNC_START(__vdso_sgx_enter_enclave)
+	SYM_F_ALIGN
 	push	%rbp
-	.cfi_adjust_cfa_offset	8
-	.cfi_rel_offset		%rbp, 0
+	CFI_ADJUST_CFA_OFFSET	8
+	CFI_REL_OFFSET		%rbp, 0
 	mov	%rsp, %rbp
-	.cfi_def_cfa_register	%rbp
+	CFI_DEF_CFA_REGISTER	%rbp
 	push	%rbx
-	.cfi_rel_offset		%rbx, -8
+	CFI_REL_OFFSET		%rbx, -8
 
 	mov	%ecx, %eax
 .Lenter_enclave:
@@ -77,13 +78,11 @@ SYM_FUNC_START(__vdso_sgx_enter_enclave)
 .Lout:
 	pop	%rbx
 	leave
-	.cfi_def_cfa		%rsp, 8
+	CFI_DEF_CFA		%rsp, 8
 	RET
 
-	/* The out-of-line code runs with the pre-leave stack frame. */
-	.cfi_def_cfa		%rbp, 16
-
 .Linvalid_input:
+	CFI_DEF_CFA		%rbp, 16
 	mov	$(-EINVAL), %eax
 	jmp	.Lout
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v4 09/39] x86/vdso: Enable sframe generation in VDSO
  2025-01-22  2:30 [PATCH v4 00/39] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (7 preceding siblings ...)
  2025-01-22  2:31 ` [PATCH v4 08/39] x86/vdso: Use CFI macros in __vdso_sgx_enter_enclave() Josh Poimboeuf
@ 2025-01-22  2:31 ` Josh Poimboeuf
  2025-01-24 16:00   ` Jens Remus
  2025-01-24 16:30   ` Jens Remus
  2025-01-22  2:31 ` [PATCH v4 10/39] x86/uaccess: Add unsafe_copy_from_user() implementation Josh Poimboeuf
                   ` (31 subsequent siblings)
  40 siblings, 2 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22  2:31 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

Enable sframe generation in the VDSO library so kernel and user space
can unwind through it.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/Kconfig                          |  3 +++
 arch/x86/entry/vdso/Makefile          | 10 +++++++---
 arch/x86/entry/vdso/vdso-layout.lds.S |  3 +++
 arch/x86/include/asm/dwarf2.h         |  5 ++++-
 4 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 6682b2a53e34..65228c78fef0 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -435,6 +435,9 @@ config HAVE_HARDLOCKUP_DETECTOR_ARCH
 	  It uses the same command line parameters, and sysctl interface,
 	  as the generic hardlockup detectors.
 
+config AS_SFRAME
+	def_bool $(as-instr,.cfi_sections .sframe\n.cfi_startproc\n.cfi_endproc)
+
 config HAVE_PERF_REGS
 	bool
 	help
diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile
index c9216ac4fb1e..478de89029d1 100644
--- a/arch/x86/entry/vdso/Makefile
+++ b/arch/x86/entry/vdso/Makefile
@@ -47,13 +47,17 @@ quiet_cmd_vdso2c = VDSO2C  $@
 $(obj)/vdso-image-%.c: $(obj)/vdso%.so.dbg $(obj)/vdso%.so $(obj)/vdso2c FORCE
 	$(call if_changed,vdso2c)
 
+#ifdef CONFIG_AS_SFRAME
+SFRAME_CFLAGS := -Wa$(comma)-gsframe
+#endif
+
 #
 # Don't omit frame pointers for ease of userspace debugging, but do
 # optimize sibling calls.
 #
 CFL := $(PROFILING) -mcmodel=small -fPIC -O2 -fasynchronous-unwind-tables -m64 \
        $(filter -g%,$(KBUILD_CFLAGS)) -fno-stack-protector \
-       -fno-omit-frame-pointer -foptimize-sibling-calls \
+       -fno-omit-frame-pointer $(SFRAME_CFLAGS) -foptimize-sibling-calls \
        -DDISABLE_BRANCH_PROFILING -DBUILD_VDSO
 
 ifdef CONFIG_MITIGATION_RETPOLINE
@@ -63,7 +67,7 @@ endif
 endif
 
 $(vobjs): KBUILD_CFLAGS := $(filter-out $(PADDING_CFLAGS) $(CC_FLAGS_LTO) $(CC_FLAGS_CFI) $(RANDSTRUCT_CFLAGS) $(GCC_PLUGINS_CFLAGS) $(RETPOLINE_CFLAGS),$(KBUILD_CFLAGS)) $(CFL)
-$(vobjs): KBUILD_AFLAGS += -DBUILD_VDSO
+$(vobjs): KBUILD_AFLAGS += -DBUILD_VDSO $(SFRAME_CFLAGS)
 
 #
 # vDSO code runs in userspace and -pg doesn't help with profiling anyway.
@@ -104,7 +108,7 @@ $(obj)/%-x32.o: $(obj)/%.o FORCE
 
 targets += vdsox32.lds $(vobjx32s-y)
 
-$(obj)/%.so: OBJCOPYFLAGS := -S --remove-section __ex_table
+$(obj)/%.so: OBJCOPYFLAGS := -g --remove-section __ex_table
 $(obj)/%.so: $(obj)/%.so.dbg FORCE
 	$(call if_changed,objcopy)
 
diff --git a/arch/x86/entry/vdso/vdso-layout.lds.S b/arch/x86/entry/vdso/vdso-layout.lds.S
index 506c9800a5aa..4dcde4747b07 100644
--- a/arch/x86/entry/vdso/vdso-layout.lds.S
+++ b/arch/x86/entry/vdso/vdso-layout.lds.S
@@ -63,6 +63,7 @@ SECTIONS
 	.eh_frame_hdr	: { *(.eh_frame_hdr) }		:text	:eh_frame_hdr
 	.eh_frame	: { KEEP (*(.eh_frame)) }	:text
 
+	.sframe		: { *(.sframe) }		:text	:sframe
 
 	/*
 	 * Text is well-separated from actual data: there's plenty of
@@ -91,6 +92,7 @@ SECTIONS
  * Very old versions of ld do not recognize this name token; use the constant.
  */
 #define PT_GNU_EH_FRAME	0x6474e550
+#define PT_GNU_SFRAME	0x6474e554
 
 /*
  * We must supply the ELF program headers explicitly to get just one
@@ -102,4 +104,5 @@ PHDRS
 	dynamic		PT_DYNAMIC	FLAGS(4);		/* PF_R */
 	note		PT_NOTE		FLAGS(4);		/* PF_R */
 	eh_frame_hdr	PT_GNU_EH_FRAME;
+	sframe		PT_GNU_SFRAME;
 }
diff --git a/arch/x86/include/asm/dwarf2.h b/arch/x86/include/asm/dwarf2.h
index b195b3c8677e..1c354f648505 100644
--- a/arch/x86/include/asm/dwarf2.h
+++ b/arch/x86/include/asm/dwarf2.h
@@ -12,8 +12,11 @@
 	 * For the vDSO, emit both runtime unwind information and debug
 	 * symbols for the .dbg file.
 	 */
-
+#ifdef __x86_64__
+	.cfi_sections .eh_frame, .debug_frame, .sframe
+#else
 	.cfi_sections .eh_frame, .debug_frame
+#endif
 
 #define CFI_STARTPROC		.cfi_startproc
 #define CFI_ENDPROC		.cfi_endproc
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v4 10/39] x86/uaccess: Add unsafe_copy_from_user() implementation
  2025-01-22  2:30 [PATCH v4 00/39] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (8 preceding siblings ...)
  2025-01-22  2:31 ` [PATCH v4 09/39] x86/vdso: Enable sframe generation in VDSO Josh Poimboeuf
@ 2025-01-22  2:31 ` Josh Poimboeuf
  2025-01-22  2:31 ` [PATCH v4 11/39] unwind_user: Add user space unwinding API Josh Poimboeuf
                   ` (30 subsequent siblings)
  40 siblings, 0 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22  2:31 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

Add an x86 implementation of unsafe_copy_from_user() similar to the
existing unsafe_copy_to_user().

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/x86/include/asm/uaccess.h | 39 +++++++++++++++++++++++++---------
 1 file changed, 29 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index 3a7755c1a441..a3148865bc57 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -599,23 +599,42 @@ _label:									\
  * We want the unsafe accessors to always be inlined and use
  * the error labels - thus the macro games.
  */
-#define unsafe_copy_loop(dst, src, len, type, label)				\
+#define unsafe_copy_to_user_loop(dst, src, len, type, label)			\
 	while (len >= sizeof(type)) {						\
-		unsafe_put_user(*(type *)(src),(type __user *)(dst),label);	\
+		unsafe_put_user(*(type *)(src), (type __user *)(dst), label);	\
 		dst += sizeof(type);						\
 		src += sizeof(type);						\
 		len -= sizeof(type);						\
 	}
 
-#define unsafe_copy_to_user(_dst,_src,_len,label)			\
+#define unsafe_copy_to_user(_dst, _src, _len, label)			\
 do {									\
-	char __user *__ucu_dst = (_dst);				\
-	const char *__ucu_src = (_src);					\
-	size_t __ucu_len = (_len);					\
-	unsafe_copy_loop(__ucu_dst, __ucu_src, __ucu_len, u64, label);	\
-	unsafe_copy_loop(__ucu_dst, __ucu_src, __ucu_len, u32, label);	\
-	unsafe_copy_loop(__ucu_dst, __ucu_src, __ucu_len, u16, label);	\
-	unsafe_copy_loop(__ucu_dst, __ucu_src, __ucu_len, u8, label);	\
+	void __user *__dst = (_dst);					\
+	const void *__src = (_src);					\
+	size_t __len = (_len);						\
+	unsafe_copy_to_user_loop(__dst, __src, __len, u64, label);	\
+	unsafe_copy_to_user_loop(__dst, __src, __len, u32, label);	\
+	unsafe_copy_to_user_loop(__dst, __src, __len, u16, label);	\
+	unsafe_copy_to_user_loop(__dst, __src, __len, u8,  label);	\
+} while (0)
+
+#define unsafe_copy_from_user_loop(dst, src, len, type, label)			\
+	while (len >= sizeof(type)) {						\
+		unsafe_get_user(*(type *)(dst),(type __user *)(src),label);	\
+		dst += sizeof(type);						\
+		src += sizeof(type);						\
+		len -= sizeof(type);						\
+	}
+
+#define unsafe_copy_from_user(_dst, _src, _len, label)			\
+do {									\
+	void *__dst = (_dst);						\
+	void __user *__src = (_src);					\
+	size_t __len = (_len);						\
+	unsafe_copy_from_user_loop(__dst, __src, __len, u64, label);	\
+	unsafe_copy_from_user_loop(__dst, __src, __len, u32, label);	\
+	unsafe_copy_from_user_loop(__dst, __src, __len, u16, label);	\
+	unsafe_copy_from_user_loop(__dst, __src, __len, u8,  label);	\
 } while (0)
 
 #ifdef CONFIG_CC_HAS_ASM_GOTO_OUTPUT
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v4 11/39] unwind_user: Add user space unwinding API
  2025-01-22  2:30 [PATCH v4 00/39] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (9 preceding siblings ...)
  2025-01-22  2:31 ` [PATCH v4 10/39] x86/uaccess: Add unsafe_copy_from_user() implementation Josh Poimboeuf
@ 2025-01-22  2:31 ` Josh Poimboeuf
  2025-01-24 16:41   ` Jens Remus
                     ` (2 more replies)
  2025-01-22  2:31 ` [PATCH v4 12/39] unwind_user: Add frame pointer support Josh Poimboeuf
                   ` (29 subsequent siblings)
  40 siblings, 3 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22  2:31 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

Introduce a generic API for unwinding user stacks.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/Kconfig                      |  3 ++
 include/linux/unwind_user.h       | 15 ++++++++
 include/linux/unwind_user_types.h | 31 ++++++++++++++++
 kernel/Makefile                   |  1 +
 kernel/unwind/Makefile            |  1 +
 kernel/unwind/user.c              | 59 +++++++++++++++++++++++++++++++
 6 files changed, 110 insertions(+)
 create mode 100644 include/linux/unwind_user.h
 create mode 100644 include/linux/unwind_user_types.h
 create mode 100644 kernel/unwind/Makefile
 create mode 100644 kernel/unwind/user.c

diff --git a/arch/Kconfig b/arch/Kconfig
index 65228c78fef0..c6fa2b3ecbc6 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -435,6 +435,9 @@ config HAVE_HARDLOCKUP_DETECTOR_ARCH
 	  It uses the same command line parameters, and sysctl interface,
 	  as the generic hardlockup detectors.
 
+config UNWIND_USER
+	bool
+
 config AS_SFRAME
 	def_bool $(as-instr,.cfi_sections .sframe\n.cfi_startproc\n.cfi_endproc)
 
diff --git a/include/linux/unwind_user.h b/include/linux/unwind_user.h
new file mode 100644
index 000000000000..aa7923c1384f
--- /dev/null
+++ b/include/linux/unwind_user.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_UNWIND_USER_H
+#define _LINUX_UNWIND_USER_H
+
+#include <linux/unwind_user_types.h>
+
+int unwind_user_start(struct unwind_user_state *state);
+int unwind_user_next(struct unwind_user_state *state);
+
+int unwind_user(struct unwind_stacktrace *trace, unsigned int max_entries);
+
+#define for_each_user_frame(state) \
+	for (unwind_user_start((state)); !(state)->done; unwind_user_next((state)))
+
+#endif /* _LINUX_UNWIND_USER_H */
diff --git a/include/linux/unwind_user_types.h b/include/linux/unwind_user_types.h
new file mode 100644
index 000000000000..6ed1b4ae74e1
--- /dev/null
+++ b/include/linux/unwind_user_types.h
@@ -0,0 +1,31 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_UNWIND_USER_TYPES_H
+#define _LINUX_UNWIND_USER_TYPES_H
+
+#include <linux/types.h>
+
+enum unwind_user_type {
+	UNWIND_USER_TYPE_NONE,
+};
+
+struct unwind_stacktrace {
+	unsigned int	nr;
+	unsigned long	*entries;
+};
+
+struct unwind_user_frame {
+	s32 cfa_off;
+	s32 ra_off;
+	s32 fp_off;
+	bool use_fp;
+};
+
+struct unwind_user_state {
+	unsigned long ip;
+	unsigned long sp;
+	unsigned long fp;
+	enum unwind_user_type type;
+	bool done;
+};
+
+#endif /* _LINUX_UNWIND_USER_TYPES_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index 87866b037fbe..6cb4b0e02a34 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -50,6 +50,7 @@ obj-y += rcu/
 obj-y += livepatch/
 obj-y += dma/
 obj-y += entry/
+obj-y += unwind/
 obj-$(CONFIG_MODULES) += module/
 
 obj-$(CONFIG_KCMP) += kcmp.o
diff --git a/kernel/unwind/Makefile b/kernel/unwind/Makefile
new file mode 100644
index 000000000000..349ce3677526
--- /dev/null
+++ b/kernel/unwind/Makefile
@@ -0,0 +1 @@
+ obj-$(CONFIG_UNWIND_USER) += user.o
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
new file mode 100644
index 000000000000..456539635e49
--- /dev/null
+++ b/kernel/unwind/user.c
@@ -0,0 +1,59 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+* Generic interfaces for unwinding user space
+*/
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/sched/task_stack.h>
+#include <linux/unwind_user.h>
+
+int unwind_user_next(struct unwind_user_state *state)
+{
+	struct unwind_user_frame _frame;
+	struct unwind_user_frame *frame = &_frame;
+	unsigned long cfa = 0, fp, ra = 0;
+
+	/* no implementation yet */
+	-EINVAL;
+}
+
+int unwind_user_start(struct unwind_user_state *state)
+{
+	struct pt_regs *regs = task_pt_regs(current);
+
+	memset(state, 0, sizeof(*state));
+
+	if (!current->mm || !user_mode(regs)) {
+		state->done = true;
+		return -EINVAL;
+	}
+
+	state->type = UNWIND_USER_TYPE_NONE;
+
+	state->ip = instruction_pointer(regs);
+	state->sp = user_stack_pointer(regs);
+	state->fp = frame_pointer(regs);
+
+	return 0;
+}
+
+int unwind_user(struct unwind_stacktrace *trace, unsigned int max_entries)
+{
+	struct unwind_user_state state;
+
+	trace->nr = 0;
+
+	if (!max_entries)
+		return -EINVAL;
+
+	if (!current->mm)
+		return 0;
+
+	for_each_user_frame(&state) {
+		trace->entries[trace->nr++] = state.ip;
+		if (trace->nr >= max_entries)
+			break;
+	}
+
+	return 0;
+}
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v4 12/39] unwind_user: Add frame pointer support
  2025-01-22  2:30 [PATCH v4 00/39] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (10 preceding siblings ...)
  2025-01-22  2:31 ` [PATCH v4 11/39] unwind_user: Add user space unwinding API Josh Poimboeuf
@ 2025-01-22  2:31 ` Josh Poimboeuf
  2025-01-24 17:59   ` Andrii Nakryiko
  2025-01-22  2:31 ` [PATCH v4 13/39] unwind_user/x86: Enable frame pointer unwinding on x86 Josh Poimboeuf
                   ` (28 subsequent siblings)
  40 siblings, 1 reply; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22  2:31 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

Add optional support for user space frame pointer unwinding.  If
supported, the arch needs to enable CONFIG_HAVE_UNWIND_USER_FP and
define ARCH_INIT_USER_FP_FRAME.

By encoding the frame offsets in struct unwind_user_frame, much of this
code can also be reused for future unwinder implementations like sframe.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/Kconfig                      |  4 +++
 include/asm-generic/unwind_user.h |  9 ++++++
 include/linux/unwind_user_types.h |  1 +
 kernel/unwind/user.c              | 49 +++++++++++++++++++++++++++++--
 4 files changed, 60 insertions(+), 3 deletions(-)
 create mode 100644 include/asm-generic/unwind_user.h

diff --git a/arch/Kconfig b/arch/Kconfig
index c6fa2b3ecbc6..cf996cbb8142 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -438,6 +438,10 @@ config HAVE_HARDLOCKUP_DETECTOR_ARCH
 config UNWIND_USER
 	bool
 
+config HAVE_UNWIND_USER_FP
+	bool
+	select UNWIND_USER
+
 config AS_SFRAME
 	def_bool $(as-instr,.cfi_sections .sframe\n.cfi_startproc\n.cfi_endproc)
 
diff --git a/include/asm-generic/unwind_user.h b/include/asm-generic/unwind_user.h
new file mode 100644
index 000000000000..832425502fb3
--- /dev/null
+++ b/include/asm-generic/unwind_user.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_GENERIC_UNWIND_USER_H
+#define _ASM_GENERIC_UNWIND_USER_H
+
+#ifndef ARCH_INIT_USER_FP_FRAME
+ #define ARCH_INIT_USER_FP_FRAME
+#endif
+
+#endif /* _ASM_GENERIC_UNWIND_USER_H */
diff --git a/include/linux/unwind_user_types.h b/include/linux/unwind_user_types.h
index 6ed1b4ae74e1..65bd070eb6b0 100644
--- a/include/linux/unwind_user_types.h
+++ b/include/linux/unwind_user_types.h
@@ -6,6 +6,7 @@
 
 enum unwind_user_type {
 	UNWIND_USER_TYPE_NONE,
+	UNWIND_USER_TYPE_FP,
 };
 
 struct unwind_stacktrace {
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
index 456539635e49..73fd4e150dfd 100644
--- a/kernel/unwind/user.c
+++ b/kernel/unwind/user.c
@@ -6,6 +6,18 @@
 #include <linux/sched.h>
 #include <linux/sched/task_stack.h>
 #include <linux/unwind_user.h>
+#include <linux/uaccess.h>
+#include <asm/unwind_user.h>
+
+static struct unwind_user_frame fp_frame = {
+	ARCH_INIT_USER_FP_FRAME
+};
+
+static inline bool fp_state(struct unwind_user_state *state)
+{
+	return IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP) &&
+	       state->type == UNWIND_USER_TYPE_FP;
+}
 
 int unwind_user_next(struct unwind_user_state *state)
 {
@@ -13,8 +25,36 @@ int unwind_user_next(struct unwind_user_state *state)
 	struct unwind_user_frame *frame = &_frame;
 	unsigned long cfa = 0, fp, ra = 0;
 
-	/* no implementation yet */
-	-EINVAL;
+	if (state->done)
+		return -EINVAL;
+
+	if (fp_state(state))
+		frame = &fp_frame;
+	else
+		goto the_end;
+
+	cfa = (frame->use_fp ? state->fp : state->sp) + frame->cfa_off;
+
+	/* stack going in wrong direction? */
+	if (cfa <= state->sp)
+		goto the_end;
+
+	if (get_user(ra, (unsigned long *)(cfa + frame->ra_off)))
+		goto the_end;
+
+	if (frame->fp_off && get_user(fp, (unsigned long __user *)(cfa + frame->fp_off)))
+		goto the_end;
+
+	state->ip = ra;
+	state->sp = cfa;
+	if (frame->fp_off)
+		state->fp = fp;
+
+	return 0;
+
+the_end:
+	state->done = true;
+	return -EINVAL;
 }
 
 int unwind_user_start(struct unwind_user_state *state)
@@ -28,7 +68,10 @@ int unwind_user_start(struct unwind_user_state *state)
 		return -EINVAL;
 	}
 
-	state->type = UNWIND_USER_TYPE_NONE;
+	if (IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP))
+		state->type = UNWIND_USER_TYPE_FP;
+	else
+		state->type = UNWIND_USER_TYPE_NONE;
 
 	state->ip = instruction_pointer(regs);
 	state->sp = user_stack_pointer(regs);
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v4 13/39] unwind_user/x86: Enable frame pointer unwinding on x86
  2025-01-22  2:30 [PATCH v4 00/39] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (11 preceding siblings ...)
  2025-01-22  2:31 ` [PATCH v4 12/39] unwind_user: Add frame pointer support Josh Poimboeuf
@ 2025-01-22  2:31 ` Josh Poimboeuf
  2025-01-22  2:31 ` [PATCH v4 14/39] perf/x86: Rename get_segment_base() and make it global Josh Poimboeuf
                   ` (27 subsequent siblings)
  40 siblings, 0 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22  2:31 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

Use ARCH_INIT_USER_FP_FRAME to describe how frame pointers are unwound
on x86, and enable CONFIG_HAVE_UNWIND_USER_FP accordingly so the
unwind_user interfaces can be used.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/x86/Kconfig                   |  1 +
 arch/x86/include/asm/unwind_user.h | 11 +++++++++++
 2 files changed, 12 insertions(+)
 create mode 100644 arch/x86/include/asm/unwind_user.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index ef6cfea9df73..f938b957a927 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -291,6 +291,7 @@ config X86
 	select HAVE_SYSCALL_TRACEPOINTS
 	select HAVE_UACCESS_VALIDATION		if HAVE_OBJTOOL
 	select HAVE_UNSTABLE_SCHED_CLOCK
+	select HAVE_UNWIND_USER_FP		if X86_64
 	select HAVE_USER_RETURN_NOTIFIER
 	select HAVE_GENERIC_VDSO
 	select VDSO_GETRANDOM			if X86_64
diff --git a/arch/x86/include/asm/unwind_user.h b/arch/x86/include/asm/unwind_user.h
new file mode 100644
index 000000000000..8597857bf896
--- /dev/null
+++ b/arch/x86/include/asm/unwind_user.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_UNWIND_USER_H
+#define _ASM_X86_UNWIND_USER_H
+
+#define ARCH_INIT_USER_FP_FRAME							\
+	.cfa_off	= (s32)sizeof(long) *  2,				\
+	.ra_off		= (s32)sizeof(long) * -1,				\
+	.fp_off		= (s32)sizeof(long) * -2,				\
+	.use_fp		= true,
+
+#endif /* _ASM_X86_UNWIND_USER_H */
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v4 14/39] perf/x86: Rename get_segment_base() and make it global
  2025-01-22  2:30 [PATCH v4 00/39] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (12 preceding siblings ...)
  2025-01-22  2:31 ` [PATCH v4 13/39] unwind_user/x86: Enable frame pointer unwinding on x86 Josh Poimboeuf
@ 2025-01-22  2:31 ` Josh Poimboeuf
  2025-01-22 12:51   ` Peter Zijlstra
  2025-01-24 20:09   ` Steven Rostedt
  2025-01-22  2:31 ` [PATCH v4 15/39] unwind_user: Add compat mode frame pointer support Josh Poimboeuf
                   ` (26 subsequent siblings)
  40 siblings, 2 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22  2:31 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

get_segment_base() will be used by the unwind_user code, so make it
global and rename it so it doesn't conflict with a KVM function of the
same name.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/x86/events/core.c            | 10 +++++-----
 arch/x86/include/asm/perf_event.h |  2 ++
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index c75c482d4c52..23ac6343cf86 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2790,7 +2790,7 @@ valid_user_frame(const void __user *fp, unsigned long size)
 	return __access_ok(fp, size);
 }
 
-static unsigned long get_segment_base(unsigned int segment)
+unsigned long segment_base_address(unsigned int segment)
 {
 	struct desc_struct *desc;
 	unsigned int idx = segment >> 3;
@@ -2874,8 +2874,8 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry_ctx *ent
 	if (user_64bit_mode(regs))
 		return 0;
 
-	cs_base = get_segment_base(regs->cs);
-	ss_base = get_segment_base(regs->ss);
+	cs_base = segment_base_address(regs->cs);
+	ss_base = segment_base_address(regs->ss);
 
 	fp = compat_ptr(ss_base + regs->bp);
 	pagefault_disable();
@@ -2994,11 +2994,11 @@ static unsigned long code_segment_base(struct pt_regs *regs)
 		return 0x10 * regs->cs;
 
 	if (user_mode(regs) && regs->cs != __USER_CS)
-		return get_segment_base(regs->cs);
+		return segment_base_address(regs->cs);
 #else
 	if (user_mode(regs) && !user_64bit_mode(regs) &&
 	    regs->cs != __USER32_CS)
-		return get_segment_base(regs->cs);
+		return segment_base_address(regs->cs);
 #endif
 	return 0;
 }
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index d95f902acc52..75956c68356f 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -639,4 +639,6 @@ static __always_inline void perf_lopwr_cb(bool lopwr_in)
 
 #define arch_perf_out_copy_user copy_from_user_nmi
 
+unsigned long segment_base_address(unsigned int segment);
+
 #endif /* _ASM_X86_PERF_EVENT_H */
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v4 15/39] unwind_user: Add compat mode frame pointer support
  2025-01-22  2:30 [PATCH v4 00/39] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (13 preceding siblings ...)
  2025-01-22  2:31 ` [PATCH v4 14/39] perf/x86: Rename get_segment_base() and make it global Josh Poimboeuf
@ 2025-01-22  2:31 ` Josh Poimboeuf
  2025-01-22  2:31 ` [PATCH v4 16/39] unwind_user/x86: Enable compat mode frame pointer unwinding on x86 Josh Poimboeuf
                   ` (25 subsequent siblings)
  40 siblings, 0 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22  2:31 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

Add optional support for user space compat mode frame pointer unwinding.
If supported, the arch needs to enable CONFIG_HAVE_UNWIND_USER_COMPAT_FP
and define ARCH_INIT_USER_COMPAT_FP_FRAME.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/Kconfig                            |  4 +++
 include/asm-generic/Kbuild              |  2 ++
 include/asm-generic/unwind_user.h       | 15 +++++++++++
 include/asm-generic/unwind_user_types.h |  9 +++++++
 include/linux/unwind_user_types.h       |  3 +++
 kernel/unwind/user.c                    | 36 ++++++++++++++++++++++---
 6 files changed, 65 insertions(+), 4 deletions(-)
 create mode 100644 include/asm-generic/unwind_user_types.h

diff --git a/arch/Kconfig b/arch/Kconfig
index cf996cbb8142..f1f7a3857c97 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -442,6 +442,10 @@ config HAVE_UNWIND_USER_FP
 	bool
 	select UNWIND_USER
 
+config HAVE_UNWIND_USER_COMPAT_FP
+	bool
+	depends on HAVE_UNWIND_USER_FP
+
 config AS_SFRAME
 	def_bool $(as-instr,.cfi_sections .sframe\n.cfi_startproc\n.cfi_endproc)
 
diff --git a/include/asm-generic/Kbuild b/include/asm-generic/Kbuild
index 1b43c3a77012..2f3e4e2d8610 100644
--- a/include/asm-generic/Kbuild
+++ b/include/asm-generic/Kbuild
@@ -58,6 +58,8 @@ mandatory-y += tlbflush.h
 mandatory-y += topology.h
 mandatory-y += trace_clock.h
 mandatory-y += uaccess.h
+mandatory-y += unwind_user.h
+mandatory-y += unwind_user_types.h
 mandatory-y += vermagic.h
 mandatory-y += vga.h
 mandatory-y += video.h
diff --git a/include/asm-generic/unwind_user.h b/include/asm-generic/unwind_user.h
index 832425502fb3..385638ce4aec 100644
--- a/include/asm-generic/unwind_user.h
+++ b/include/asm-generic/unwind_user.h
@@ -2,8 +2,23 @@
 #ifndef _ASM_GENERIC_UNWIND_USER_H
 #define _ASM_GENERIC_UNWIND_USER_H
 
+#include <asm/unwind_user_types.h>
+
 #ifndef ARCH_INIT_USER_FP_FRAME
  #define ARCH_INIT_USER_FP_FRAME
 #endif
 
+#ifndef ARCH_INIT_USER_COMPAT_FP_FRAME
+ #define ARCH_INIT_USER_COMPAT_FP_FRAME
+ #define in_compat_mode(regs) false
+#endif
+
+#ifndef arch_unwind_user_init
+static inline void arch_unwind_user_init(struct unwind_user_state *state, struct pt_regs *reg) {}
+#endif
+
+#ifndef arch_unwind_user_next
+static inline void arch_unwind_user_next(struct unwind_user_state *state) {}
+#endif
+
 #endif /* _ASM_GENERIC_UNWIND_USER_H */
diff --git a/include/asm-generic/unwind_user_types.h b/include/asm-generic/unwind_user_types.h
new file mode 100644
index 000000000000..ee803de7c998
--- /dev/null
+++ b/include/asm-generic/unwind_user_types.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_GENERIC_UNWIND_USER_TYPES_H
+#define _ASM_GENERIC_UNWIND_USER_TYPES_H
+
+#ifndef arch_unwind_user_state
+struct arch_unwind_user_state {};
+#endif
+
+#endif /* _ASM_GENERIC_UNWIND_USER_TYPES_H */
diff --git a/include/linux/unwind_user_types.h b/include/linux/unwind_user_types.h
index 65bd070eb6b0..3ec4a097a3dd 100644
--- a/include/linux/unwind_user_types.h
+++ b/include/linux/unwind_user_types.h
@@ -3,10 +3,12 @@
 #define _LINUX_UNWIND_USER_TYPES_H
 
 #include <linux/types.h>
+#include <asm/unwind_user_types.h>
 
 enum unwind_user_type {
 	UNWIND_USER_TYPE_NONE,
 	UNWIND_USER_TYPE_FP,
+	UNWIND_USER_TYPE_COMPAT_FP,
 };
 
 struct unwind_stacktrace {
@@ -25,6 +27,7 @@ struct unwind_user_state {
 	unsigned long ip;
 	unsigned long sp;
 	unsigned long fp;
+	struct arch_unwind_user_state arch;
 	enum unwind_user_type type;
 	bool done;
 };
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
index 73fd4e150dfd..92963f129c6a 100644
--- a/kernel/unwind/user.c
+++ b/kernel/unwind/user.c
@@ -13,12 +13,32 @@ static struct unwind_user_frame fp_frame = {
 	ARCH_INIT_USER_FP_FRAME
 };
 
+static struct unwind_user_frame compat_fp_frame = {
+	ARCH_INIT_USER_COMPAT_FP_FRAME
+};
+
 static inline bool fp_state(struct unwind_user_state *state)
 {
 	return IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP) &&
 	       state->type == UNWIND_USER_TYPE_FP;
 }
 
+static inline bool compat_state(struct unwind_user_state *state)
+{
+	return IS_ENABLED(CONFIG_HAVE_UNWIND_USER_COMPAT_FP) &&
+	       state->type == UNWIND_USER_TYPE_COMPAT_FP;
+}
+
+#define UNWIND_GET_USER_LONG(to, from, state)				\
+({									\
+	int __ret;							\
+	if (compat_state(state))					\
+		__ret = get_user(to, (u32 __user *)(from));		\
+	else								\
+		__ret = get_user(to, (u64 __user *)(from));		\
+	__ret;								\
+})
+
 int unwind_user_next(struct unwind_user_state *state)
 {
 	struct unwind_user_frame _frame;
@@ -28,7 +48,9 @@ int unwind_user_next(struct unwind_user_state *state)
 	if (state->done)
 		return -EINVAL;
 
-	if (fp_state(state))
+	if (compat_state(state))
+		frame = &compat_fp_frame;
+	else if (fp_state(state))
 		frame = &fp_frame;
 	else
 		goto the_end;
@@ -39,10 +61,10 @@ int unwind_user_next(struct unwind_user_state *state)
 	if (cfa <= state->sp)
 		goto the_end;
 
-	if (get_user(ra, (unsigned long *)(cfa + frame->ra_off)))
+	if (UNWIND_GET_USER_LONG(ra, cfa + frame->ra_off, state))
 		goto the_end;
 
-	if (frame->fp_off && get_user(fp, (unsigned long __user *)(cfa + frame->fp_off)))
+	if (frame->fp_off && UNWIND_GET_USER_LONG(fp, cfa + frame->fp_off, state))
 		goto the_end;
 
 	state->ip = ra;
@@ -50,6 +72,8 @@ int unwind_user_next(struct unwind_user_state *state)
 	if (frame->fp_off)
 		state->fp = fp;
 
+	arch_unwind_user_next(state);
+
 	return 0;
 
 the_end:
@@ -68,7 +92,9 @@ int unwind_user_start(struct unwind_user_state *state)
 		return -EINVAL;
 	}
 
-	if (IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP))
+	if (IS_ENABLED(CONFIG_HAVE_UNWIND_USER_COMPAT_FP) && in_compat_mode(regs))
+		state->type = UNWIND_USER_TYPE_COMPAT_FP;
+	else if (IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP))
 		state->type = UNWIND_USER_TYPE_FP;
 	else
 		state->type = UNWIND_USER_TYPE_NONE;
@@ -77,6 +103,8 @@ int unwind_user_start(struct unwind_user_state *state)
 	state->sp = user_stack_pointer(regs);
 	state->fp = frame_pointer(regs);
 
+	arch_unwind_user_init(state, regs);
+
 	return 0;
 }
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v4 16/39] unwind_user/x86: Enable compat mode frame pointer unwinding on x86
  2025-01-22  2:30 [PATCH v4 00/39] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (14 preceding siblings ...)
  2025-01-22  2:31 ` [PATCH v4 15/39] unwind_user: Add compat mode frame pointer support Josh Poimboeuf
@ 2025-01-22  2:31 ` Josh Poimboeuf
  2025-01-22  2:31 ` [PATCH v4 17/39] unwind_user/sframe: Add support for reading .sframe headers Josh Poimboeuf
                   ` (24 subsequent siblings)
  40 siblings, 0 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22  2:31 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

Use ARCH_INIT_USER_COMPAT_FP_FRAME to describe how frame pointers are
unwound on x86, and implement the hooks needed to add the segment base
addresses.  Enable HAVE_UNWIND_USER_COMPAT_FP if the system has compat
mode compiled in.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/x86/Kconfig                         |  1 +
 arch/x86/include/asm/unwind_user.h       | 50 ++++++++++++++++++++++++
 arch/x86/include/asm/unwind_user_types.h | 17 ++++++++
 3 files changed, 68 insertions(+)
 create mode 100644 arch/x86/include/asm/unwind_user_types.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f938b957a927..08c44db0fefb 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -291,6 +291,7 @@ config X86
 	select HAVE_SYSCALL_TRACEPOINTS
 	select HAVE_UACCESS_VALIDATION		if HAVE_OBJTOOL
 	select HAVE_UNSTABLE_SCHED_CLOCK
+	select HAVE_UNWIND_USER_COMPAT_FP	if IA32_EMULATION
 	select HAVE_UNWIND_USER_FP		if X86_64
 	select HAVE_USER_RETURN_NOTIFIER
 	select HAVE_GENERIC_VDSO
diff --git a/arch/x86/include/asm/unwind_user.h b/arch/x86/include/asm/unwind_user.h
index 8597857bf896..bb1148111259 100644
--- a/arch/x86/include/asm/unwind_user.h
+++ b/arch/x86/include/asm/unwind_user.h
@@ -2,10 +2,60 @@
 #ifndef _ASM_X86_UNWIND_USER_H
 #define _ASM_X86_UNWIND_USER_H
 
+#include <linux/unwind_user_types.h>
+#include <asm/ptrace.h>
+#include <asm/perf_event.h>
+
 #define ARCH_INIT_USER_FP_FRAME							\
 	.cfa_off	= (s32)sizeof(long) *  2,				\
 	.ra_off		= (s32)sizeof(long) * -1,				\
 	.fp_off		= (s32)sizeof(long) * -2,				\
 	.use_fp		= true,
 
+#ifdef CONFIG_IA32_EMULATION
+
+#define ARCH_INIT_USER_COMPAT_FP_FRAME						\
+	.cfa_off	= (s32)sizeof(u32)  *  2,				\
+	.ra_off		= (s32)sizeof(u32)  * -1,				\
+	.fp_off		= (s32)sizeof(u32)  * -2,				\
+	.use_fp		= true,
+
+#define in_compat_mode(regs) !user_64bit_mode(regs)
+
+static inline void arch_unwind_user_init(struct unwind_user_state *state,
+					 struct pt_regs *regs)
+{
+	unsigned long cs_base, ss_base;
+
+	if (state->type != UNWIND_USER_TYPE_COMPAT_FP)
+		return;
+
+	scoped_guard(irqsave) {
+		cs_base = segment_base_address(regs->cs);
+		ss_base = segment_base_address(regs->ss);
+	}
+
+	state->arch.cs_base = cs_base;
+	state->arch.ss_base = ss_base;
+
+	state->ip += cs_base;
+	state->sp += ss_base;
+	state->fp += ss_base;
+}
+#define arch_unwind_user_init arch_unwind_user_init
+
+static inline void arch_unwind_user_next(struct unwind_user_state *state)
+{
+	if (state->type != UNWIND_USER_TYPE_COMPAT_FP)
+		return;
+
+	state->ip += state->arch.cs_base;
+	state->fp += state->arch.ss_base;
+}
+#define arch_unwind_user_next arch_unwind_user_next
+
+#endif /* CONFIG_IA32_EMULATION */
+
+#include <asm-generic/unwind_user.h>
+
 #endif /* _ASM_X86_UNWIND_USER_H */
diff --git a/arch/x86/include/asm/unwind_user_types.h b/arch/x86/include/asm/unwind_user_types.h
new file mode 100644
index 000000000000..d7074dc5f0ce
--- /dev/null
+++ b/arch/x86/include/asm/unwind_user_types.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_UNWIND_USER_TYPES_H
+#define _ASM_UNWIND_USER_TYPES_H
+
+#ifdef CONFIG_IA32_EMULATION
+
+struct arch_unwind_user_state {
+	unsigned long ss_base;
+	unsigned long cs_base;
+};
+#define arch_unwind_user_state arch_unwind_user_state
+
+#endif /* CONFIG_IA32_EMULATION */
+
+#include <asm-generic/unwind_user_types.h>
+
+#endif /* _ASM_UNWIND_USER_TYPES_H */
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v4 17/39] unwind_user/sframe: Add support for reading .sframe headers
  2025-01-22  2:30 [PATCH v4 00/39] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (15 preceding siblings ...)
  2025-01-22  2:31 ` [PATCH v4 16/39] unwind_user/x86: Enable compat mode frame pointer unwinding on x86 Josh Poimboeuf
@ 2025-01-22  2:31 ` Josh Poimboeuf
  2025-01-24 18:00   ` Andrii Nakryiko
  2025-01-22  2:31 ` [PATCH v4 18/39] unwind_user/sframe: Store sframe section data in per-mm maple tree Josh Poimboeuf
                   ` (23 subsequent siblings)
  40 siblings, 1 reply; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22  2:31 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

In preparation for unwinding user space stacks with sframe, add basic
sframe compile infrastructure and support for reading the .sframe
section header.

sframe_add_section() reads the header and unconditionally returns an
error, so it's not very useful yet.  A subsequent patch will improve
that.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/Kconfig           |   3 +
 include/linux/sframe.h |  36 +++++++++++
 kernel/unwind/Makefile |   3 +-
 kernel/unwind/sframe.c | 136 +++++++++++++++++++++++++++++++++++++++++
 kernel/unwind/sframe.h |  71 +++++++++++++++++++++
 5 files changed, 248 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/sframe.h
 create mode 100644 kernel/unwind/sframe.c
 create mode 100644 kernel/unwind/sframe.h

diff --git a/arch/Kconfig b/arch/Kconfig
index f1f7a3857c97..23edd0e4e16a 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -446,6 +446,9 @@ config HAVE_UNWIND_USER_COMPAT_FP
 	bool
 	depends on HAVE_UNWIND_USER_FP
 
+config HAVE_UNWIND_USER_SFRAME
+	bool
+
 config AS_SFRAME
 	def_bool $(as-instr,.cfi_sections .sframe\n.cfi_startproc\n.cfi_endproc)
 
diff --git a/include/linux/sframe.h b/include/linux/sframe.h
new file mode 100644
index 000000000000..3bfaf21869c2
--- /dev/null
+++ b/include/linux/sframe.h
@@ -0,0 +1,36 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_SFRAME_H
+#define _LINUX_SFRAME_H
+
+#include <linux/mm_types.h>
+#include <linux/unwind_user_types.h>
+
+#ifdef CONFIG_HAVE_UNWIND_USER_SFRAME
+
+struct sframe_section {
+	unsigned long	sframe_start;
+	unsigned long	sframe_end;
+	unsigned long	text_start;
+	unsigned long	text_end;
+
+	unsigned long	fdes_start;
+	unsigned long	fres_start;
+	unsigned long	fres_end;
+	unsigned int	num_fdes;
+
+	signed char	ra_off;
+	signed char	fp_off;
+};
+
+extern int sframe_add_section(unsigned long sframe_start, unsigned long sframe_end,
+			      unsigned long text_start, unsigned long text_end);
+extern int sframe_remove_section(unsigned long sframe_addr);
+
+#else /* !CONFIG_HAVE_UNWIND_USER_SFRAME */
+
+static inline int sframe_add_section(unsigned long sframe_start, unsigned long sframe_end, unsigned long text_start, unsigned long text_end) { return -ENOSYS; }
+static inline int sframe_remove_section(unsigned long sframe_addr) { return -ENOSYS; }
+
+#endif /* CONFIG_HAVE_UNWIND_USER_SFRAME */
+
+#endif /* _LINUX_SFRAME_H */
diff --git a/kernel/unwind/Makefile b/kernel/unwind/Makefile
index 349ce3677526..f70380d7a6a6 100644
--- a/kernel/unwind/Makefile
+++ b/kernel/unwind/Makefile
@@ -1 +1,2 @@
- obj-$(CONFIG_UNWIND_USER) += user.o
+ obj-$(CONFIG_UNWIND_USER)		+= user.o
+ obj-$(CONFIG_HAVE_UNWIND_USER_SFRAME)	+= sframe.o
diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
new file mode 100644
index 000000000000..20287f795b36
--- /dev/null
+++ b/kernel/unwind/sframe.c
@@ -0,0 +1,136 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Userspace sframe access functions
+ */
+
+#define pr_fmt(fmt)	"sframe: " fmt
+
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/srcu.h>
+#include <linux/uaccess.h>
+#include <linux/mm.h>
+#include <linux/string_helpers.h>
+#include <linux/sframe.h>
+#include <linux/unwind_user_types.h>
+
+#include "sframe.h"
+
+#define dbg(fmt, ...)							\
+	pr_debug("%s (%d): " fmt, current->comm, current->pid, ##__VA_ARGS__)
+
+static void free_section(struct sframe_section *sec)
+{
+	kfree(sec);
+}
+
+static int sframe_read_header(struct sframe_section *sec)
+{
+	unsigned long header_end, fdes_start, fdes_end, fres_start, fres_end;
+	struct sframe_header shdr;
+	unsigned int num_fdes;
+
+	if (copy_from_user(&shdr, (void __user *)sec->sframe_start, sizeof(shdr))) {
+		dbg("header usercopy failed\n");
+		return -EFAULT;
+	}
+
+	if (shdr.preamble.magic != SFRAME_MAGIC ||
+	    shdr.preamble.version != SFRAME_VERSION_2 ||
+	    !(shdr.preamble.flags & SFRAME_F_FDE_SORTED) ||
+	    shdr.auxhdr_len) {
+		dbg("bad/unsupported sframe header\n");
+		return -EINVAL;
+	}
+
+	if (!shdr.num_fdes || !shdr.num_fres) {
+		dbg("no fde/fre entries\n");
+		return -EINVAL;
+	}
+
+	header_end = sec->sframe_start + SFRAME_HEADER_SIZE(shdr);
+	if (header_end >= sec->sframe_end) {
+		dbg("header doesn't fit in section\n");
+		return -EINVAL;
+	}
+
+	num_fdes   = shdr.num_fdes;
+	fdes_start = header_end + shdr.fdes_off;
+	fdes_end   = fdes_start + (num_fdes * sizeof(struct sframe_fde));
+
+	fres_start = header_end + shdr.fres_off;
+	fres_end   = fres_start + shdr.fre_len;
+
+	if (fres_start < fdes_end || fres_end > sec->sframe_end) {
+		dbg("inconsistent fde/fre offsets\n");
+		return -EINVAL;
+	}
+
+	sec->num_fdes		= num_fdes;
+	sec->fdes_start		= fdes_start;
+	sec->fres_start		= fres_start;
+	sec->fres_end		= fres_end;
+
+	sec->ra_off		= shdr.cfa_fixed_ra_offset;
+	sec->fp_off		= shdr.cfa_fixed_fp_offset;
+
+	return 0;
+}
+
+int sframe_add_section(unsigned long sframe_start, unsigned long sframe_end,
+		       unsigned long text_start, unsigned long text_end)
+{
+	struct maple_tree *sframe_mt = &current->mm->sframe_mt;
+	struct vm_area_struct *sframe_vma, *text_vma;
+	struct mm_struct *mm = current->mm;
+	struct sframe_section *sec;
+	int ret;
+
+	if (!sframe_start || !sframe_end || !text_start || !text_end) {
+		dbg("zero-length sframe/text address\n");
+		return -EINVAL;
+	}
+
+	scoped_guard(mmap_read_lock, mm) {
+		sframe_vma = vma_lookup(mm, sframe_start);
+		if (!sframe_vma || sframe_end > sframe_vma->vm_end) {
+			dbg("bad sframe address (0x%lx - 0x%lx)\n",
+			    sframe_start, sframe_end);
+			return -EINVAL;
+		}
+
+		text_vma = vma_lookup(mm, text_start);
+		if (!text_vma ||
+		    !(text_vma->vm_flags & VM_EXEC) ||
+		    text_end > text_vma->vm_end) {
+			dbg("bad text address (0x%lx - 0x%lx)\n",
+			    text_start, text_end);
+			return -EINVAL;
+		}
+	}
+
+	sec = kzalloc(sizeof(*sec), GFP_KERNEL);
+	if (!sec)
+		return -ENOMEM;
+
+	sec->sframe_start	= sframe_start;
+	sec->sframe_end		= sframe_end;
+	sec->text_start		= text_start;
+	sec->text_end		= text_end;
+
+	ret = sframe_read_header(sec);
+	if (ret)
+		goto err_free;
+
+	/* TODO nowhere to store it yet - just free it and return an error */
+	ret = -ENOSYS;
+
+err_free:
+	free_section(sec);
+	return ret;
+}
+
+int sframe_remove_section(unsigned long sframe_start)
+{
+	return -ENOSYS;
+}
diff --git a/kernel/unwind/sframe.h b/kernel/unwind/sframe.h
new file mode 100644
index 000000000000..e9bfccfaf5b4
--- /dev/null
+++ b/kernel/unwind/sframe.h
@@ -0,0 +1,71 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * From https://www.sourceware.org/binutils/docs/sframe-spec.html
+ */
+#ifndef _SFRAME_H
+#define _SFRAME_H
+
+#include <linux/types.h>
+
+#define SFRAME_VERSION_1			1
+#define SFRAME_VERSION_2			2
+#define SFRAME_MAGIC				0xdee2
+
+#define SFRAME_F_FDE_SORTED			0x1
+#define SFRAME_F_FRAME_POINTER			0x2
+
+#define SFRAME_ABI_AARCH64_ENDIAN_BIG		1
+#define SFRAME_ABI_AARCH64_ENDIAN_LITTLE	2
+#define SFRAME_ABI_AMD64_ENDIAN_LITTLE		3
+
+#define SFRAME_FDE_TYPE_PCINC			0
+#define SFRAME_FDE_TYPE_PCMASK			1
+
+struct sframe_preamble {
+	u16	magic;
+	u8	version;
+	u8	flags;
+} __packed;
+
+struct sframe_header {
+	struct sframe_preamble preamble;
+	u8	abi_arch;
+	s8	cfa_fixed_fp_offset;
+	s8	cfa_fixed_ra_offset;
+	u8	auxhdr_len;
+	u32	num_fdes;
+	u32	num_fres;
+	u32	fre_len;
+	u32	fdes_off;
+	u32	fres_off;
+} __packed;
+
+#define SFRAME_HEADER_SIZE(header) \
+	((sizeof(struct sframe_header) + header.auxhdr_len))
+
+#define SFRAME_AARCH64_PAUTH_KEY_A		0
+#define SFRAME_AARCH64_PAUTH_KEY_B		1
+
+struct sframe_fde {
+	s32	start_addr;
+	u32	func_size;
+	u32	fres_off;
+	u32	fres_num;
+	u8	info;
+	u8	rep_size;
+	u16 padding;
+} __packed;
+
+#define SFRAME_FUNC_FRE_TYPE(data)		(data & 0xf)
+#define SFRAME_FUNC_FDE_TYPE(data)		((data >> 4) & 0x1)
+#define SFRAME_FUNC_PAUTH_KEY(data)		((data >> 5) & 0x1)
+
+#define SFRAME_BASE_REG_FP			0
+#define SFRAME_BASE_REG_SP			1
+
+#define SFRAME_FRE_CFA_BASE_REG_ID(data)	(data & 0x1)
+#define SFRAME_FRE_OFFSET_COUNT(data)		((data >> 1) & 0xf)
+#define SFRAME_FRE_OFFSET_SIZE(data)		((data >> 5) & 0x3)
+#define SFRAME_FRE_MANGLED_RA_P(data)		((data >> 7) & 0x1)
+
+#endif /* _SFRAME_H */
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v4 18/39] unwind_user/sframe: Store sframe section data in per-mm maple tree
  2025-01-22  2:30 [PATCH v4 00/39] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (16 preceding siblings ...)
  2025-01-22  2:31 ` [PATCH v4 17/39] unwind_user/sframe: Add support for reading .sframe headers Josh Poimboeuf
@ 2025-01-22  2:31 ` Josh Poimboeuf
  2025-01-22  2:31 ` [PATCH v4 19/39] unwind_user/sframe: Add support for reading .sframe contents Josh Poimboeuf
                   ` (22 subsequent siblings)
  40 siblings, 0 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22  2:31 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

Associate an sframe section with its mm by adding it to a per-mm maple
tree which is indexed by the corresponding text address range.  A single
sframe section can be associated with multiple text ranges.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/x86/include/asm/mmu.h |  2 +-
 include/linux/mm_types.h   |  3 +++
 include/linux/sframe.h     | 13 +++++++++
 kernel/fork.c              | 10 +++++++
 kernel/unwind/sframe.c     | 55 +++++++++++++++++++++++++++++++++++---
 mm/init-mm.c               |  2 ++
 6 files changed, 81 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index ce4677b8b735..12ea831978cc 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -73,7 +73,7 @@ typedef struct {
 	.context = {							\
 		.ctx_id = 1,						\
 		.lock = __MUTEX_INITIALIZER(mm.context.lock),		\
-	}
+	},
 
 void leave_mm(void);
 #define leave_mm leave_mm
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 332cee285662..aba0d67fda9a 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1064,6 +1064,9 @@ struct mm_struct {
 #endif
 		} lru_gen;
 #endif /* CONFIG_LRU_GEN_WALKS_MMU */
+#ifdef CONFIG_HAVE_UNWIND_USER_SFRAME
+		struct maple_tree sframe_mt;
+#endif
 	} __randomize_layout;
 
 	/*
diff --git a/include/linux/sframe.h b/include/linux/sframe.h
index 3bfaf21869c2..ff4b9d1dbd00 100644
--- a/include/linux/sframe.h
+++ b/include/linux/sframe.h
@@ -22,14 +22,27 @@ struct sframe_section {
 	signed char	fp_off;
 };
 
+#define INIT_MM_SFRAME .sframe_mt = MTREE_INIT(sframe_mt, 0),
+extern void sframe_free_mm(struct mm_struct *mm);
+
 extern int sframe_add_section(unsigned long sframe_start, unsigned long sframe_end,
 			      unsigned long text_start, unsigned long text_end);
 extern int sframe_remove_section(unsigned long sframe_addr);
 
+static inline bool current_has_sframe(void)
+{
+	struct mm_struct *mm = current->mm;
+
+	return mm && !mtree_empty(&mm->sframe_mt);
+}
+
 #else /* !CONFIG_HAVE_UNWIND_USER_SFRAME */
 
+#define INIT_MM_SFRAME
+static inline void sframe_free_mm(struct mm_struct *mm) {}
 static inline int sframe_add_section(unsigned long sframe_start, unsigned long sframe_end, unsigned long text_start, unsigned long text_end) { return -ENOSYS; }
 static inline int sframe_remove_section(unsigned long sframe_addr) { return -ENOSYS; }
+static inline bool current_has_sframe(void) { return false; }
 
 #endif /* CONFIG_HAVE_UNWIND_USER_SFRAME */
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 9b301180fd41..88753f8bbdd3 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -105,6 +105,7 @@
 #include <uapi/linux/pidfd.h>
 #include <linux/pidfs.h>
 #include <linux/tick.h>
+#include <linux/sframe.h>
 
 #include <asm/pgalloc.h>
 #include <linux/uaccess.h>
@@ -925,6 +926,7 @@ void __mmdrop(struct mm_struct *mm)
 	mm_pasid_drop(mm);
 	mm_destroy_cid(mm);
 	percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS);
+	sframe_free_mm(mm);
 
 	free_mm(mm);
 }
@@ -1252,6 +1254,13 @@ static void mm_init_uprobes_state(struct mm_struct *mm)
 #endif
 }
 
+static void mm_init_sframe(struct mm_struct *mm)
+{
+#ifdef CONFIG_HAVE_UNWIND_USER_SFRAME
+	mt_init(&mm->sframe_mt);
+#endif
+}
+
 static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	struct user_namespace *user_ns)
 {
@@ -1283,6 +1292,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	mm->pmd_huge_pte = NULL;
 #endif
 	mm_init_uprobes_state(mm);
+	mm_init_sframe(mm);
 	hugetlb_count_init(mm);
 
 	if (current->mm) {
diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
index 20287f795b36..fa7d87ffd00a 100644
--- a/kernel/unwind/sframe.c
+++ b/kernel/unwind/sframe.c
@@ -122,15 +122,64 @@ int sframe_add_section(unsigned long sframe_start, unsigned long sframe_end,
 	if (ret)
 		goto err_free;
 
-	/* TODO nowhere to store it yet - just free it and return an error */
-	ret = -ENOSYS;
+	ret = mtree_insert_range(sframe_mt, sec->text_start, sec->text_end, sec, GFP_KERNEL);
+	if (ret) {
+		dbg("mtree_insert_range failed: text=%lx-%lx\n",
+		    sec->text_start, sec->text_end);
+		goto err_free;
+	}
+
+	return 0;
 
 err_free:
 	free_section(sec);
 	return ret;
 }
 
+static int __sframe_remove_section(struct mm_struct *mm,
+				   struct sframe_section *sec)
+{
+	if (!mtree_erase(&mm->sframe_mt, sec->text_start)) {
+		dbg("mtree_erase failed: text=%lx\n", sec->text_start);
+		return -EINVAL;
+	}
+
+	free_section(sec);
+
+	return 0;
+}
+
 int sframe_remove_section(unsigned long sframe_start)
 {
-	return -ENOSYS;
+	struct mm_struct *mm = current->mm;
+	struct sframe_section *sec;
+	unsigned long index = 0;
+	bool found = false;
+	int ret = 0;
+
+	mt_for_each(&mm->sframe_mt, sec, index, ULONG_MAX) {
+		if (sec->sframe_start == sframe_start) {
+			found = true;
+			ret |= __sframe_remove_section(mm, sec);
+		}
+	}
+
+	if (!found || ret)
+		return -EINVAL;
+
+	return 0;
+}
+
+void sframe_free_mm(struct mm_struct *mm)
+{
+	struct sframe_section *sec;
+	unsigned long index = 0;
+
+	if (!mm)
+		return;
+
+	mt_for_each(&mm->sframe_mt, sec, index, ULONG_MAX)
+		free_section(sec);
+
+	mtree_destroy(&mm->sframe_mt);
 }
diff --git a/mm/init-mm.c b/mm/init-mm.c
index 24c809379274..feb01fcd32f0 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -11,6 +11,7 @@
 #include <linux/atomic.h>
 #include <linux/user_namespace.h>
 #include <linux/iommu.h>
+#include <linux/sframe.h>
 #include <asm/mmu.h>
 
 #ifndef INIT_MM_CONTEXT
@@ -45,6 +46,7 @@ struct mm_struct init_mm = {
 	.user_ns	= &init_user_ns,
 	.cpu_bitmap	= CPU_BITS_NONE,
 	INIT_MM_CONTEXT(init_mm)
+	INIT_MM_SFRAME
 };
 
 void setup_initial_init_mm(void *start_code, void *end_code,
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v4 19/39] unwind_user/sframe: Add support for reading .sframe contents
  2025-01-22  2:30 [PATCH v4 00/39] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (17 preceding siblings ...)
  2025-01-22  2:31 ` [PATCH v4 18/39] unwind_user/sframe: Store sframe section data in per-mm maple tree Josh Poimboeuf
@ 2025-01-22  2:31 ` Josh Poimboeuf
  2025-01-24 16:36   ` Jens Remus
                     ` (3 more replies)
  2025-01-22  2:31 ` [PATCH v4 20/39] unwind_user/sframe: Detect .sframe sections in executables Josh Poimboeuf
                   ` (21 subsequent siblings)
  40 siblings, 4 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22  2:31 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

In preparation for using sframe to unwind user space stacks, add an
sframe_find() interface for finding the sframe information associated
with a given text address.

For performance, use user_read_access_begin() and the corresponding
unsafe_*() accessors.  Note that use of pr_debug() in uaccess-enabled
regions would break noinstr validation, so there aren't any debug
messages yet.  That will be added in a subsequent commit.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 include/linux/sframe.h       |   5 +
 kernel/unwind/sframe.c       | 295 ++++++++++++++++++++++++++++++++++-
 kernel/unwind/sframe_debug.h |  35 +++++
 3 files changed, 331 insertions(+), 4 deletions(-)
 create mode 100644 kernel/unwind/sframe_debug.h

diff --git a/include/linux/sframe.h b/include/linux/sframe.h
index ff4b9d1dbd00..2e70085a1e89 100644
--- a/include/linux/sframe.h
+++ b/include/linux/sframe.h
@@ -3,11 +3,14 @@
 #define _LINUX_SFRAME_H
 
 #include <linux/mm_types.h>
+#include <linux/srcu.h>
 #include <linux/unwind_user_types.h>
 
 #ifdef CONFIG_HAVE_UNWIND_USER_SFRAME
 
 struct sframe_section {
+	struct rcu_head	rcu;
+
 	unsigned long	sframe_start;
 	unsigned long	sframe_end;
 	unsigned long	text_start;
@@ -28,6 +31,7 @@ extern void sframe_free_mm(struct mm_struct *mm);
 extern int sframe_add_section(unsigned long sframe_start, unsigned long sframe_end,
 			      unsigned long text_start, unsigned long text_end);
 extern int sframe_remove_section(unsigned long sframe_addr);
+extern int sframe_find(unsigned long ip, struct unwind_user_frame *frame);
 
 static inline bool current_has_sframe(void)
 {
@@ -42,6 +46,7 @@ static inline bool current_has_sframe(void)
 static inline void sframe_free_mm(struct mm_struct *mm) {}
 static inline int sframe_add_section(unsigned long sframe_start, unsigned long sframe_end, unsigned long text_start, unsigned long text_end) { return -ENOSYS; }
 static inline int sframe_remove_section(unsigned long sframe_addr) { return -ENOSYS; }
+static inline int sframe_find(unsigned long ip, struct unwind_user_frame *frame) { return -ENOSYS; }
 static inline bool current_has_sframe(void) { return false; }
 
 #endif /* CONFIG_HAVE_UNWIND_USER_SFRAME */
diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
index fa7d87ffd00a..1a35615a361e 100644
--- a/kernel/unwind/sframe.c
+++ b/kernel/unwind/sframe.c
@@ -15,9 +15,287 @@
 #include <linux/unwind_user_types.h>
 
 #include "sframe.h"
+#include "sframe_debug.h"
 
-#define dbg(fmt, ...)							\
-	pr_debug("%s (%d): " fmt, current->comm, current->pid, ##__VA_ARGS__)
+struct sframe_fre {
+	unsigned int	size;
+	s32		ip_off;
+	s32		cfa_off;
+	s32		ra_off;
+	s32		fp_off;
+	u8		info;
+};
+
+DEFINE_STATIC_SRCU(sframe_srcu);
+
+static __always_inline unsigned char fre_type_to_size(unsigned char fre_type)
+{
+	if (fre_type > 2)
+		return 0;
+	return 1 << fre_type;
+}
+
+static __always_inline unsigned char offset_size_enum_to_size(unsigned char off_size)
+{
+	if (off_size > 2)
+		return 0;
+	return 1 << off_size;
+}
+
+static __always_inline int __read_fde(struct sframe_section *sec,
+				      unsigned int fde_num,
+				      struct sframe_fde *fde)
+{
+	unsigned long fde_addr, ip;
+
+	fde_addr = sec->fdes_start + (fde_num * sizeof(struct sframe_fde));
+	unsafe_copy_from_user(fde, (void __user *)fde_addr,
+			      sizeof(struct sframe_fde), Efault);
+
+	ip = sec->sframe_start + fde->start_addr;
+	if (ip < sec->text_start || ip > sec->text_end)
+		return -EINVAL;
+
+	return 0;
+
+Efault:
+	return -EFAULT;
+}
+
+static __always_inline int __find_fde(struct sframe_section *sec,
+				      unsigned long ip,
+				      struct sframe_fde *fde)
+{
+	s32 ip_off, func_off_low = S32_MIN, func_off_high = S32_MAX;
+	struct sframe_fde __user *first, *low, *high, *found = NULL;
+	int ret;
+
+	ip_off = ip - sec->sframe_start;
+
+	first = (void __user *)sec->fdes_start;
+	low = first;
+	high = first + sec->num_fdes - 1;
+
+	while (low <= high) {
+		struct sframe_fde __user *mid;
+		s32 func_off;
+
+		mid = low + ((high - low) / 2);
+
+		unsafe_get_user(func_off, (s32 __user *)mid, Efault);
+
+		if (ip_off >= func_off) {
+			if (func_off < func_off_low)
+				return -EFAULT;
+
+			func_off_low = func_off;
+
+			found = mid;
+			low = mid + 1;
+		} else {
+			if (func_off > func_off_high)
+				return -EFAULT;
+
+			func_off_high = func_off;
+
+			high = mid - 1;
+		}
+	}
+
+	if (!found)
+		return -EINVAL;
+
+	ret = __read_fde(sec, found - first, fde);
+	if (ret)
+		return ret;
+
+	/* make sure it's not in a gap */
+	if (ip_off < fde->start_addr || ip_off >= fde->start_addr + fde->func_size)
+		return -EINVAL;
+
+	return 0;
+
+Efault:
+	return -EFAULT;
+}
+
+#define __UNSAFE_GET_USER_INC(to, from, type, label)			\
+({									\
+	type __to;							\
+	unsafe_get_user(__to, (type __user *)from, label);		\
+	from += sizeof(__to);						\
+	to = (typeof(to))__to;							\
+})
+
+#define UNSAFE_GET_USER_INC(to, from, size, label)			\
+({									\
+	switch (size) {							\
+	case 1:								\
+		__UNSAFE_GET_USER_INC(to, from, u8, label);		\
+		break;							\
+	case 2:								\
+		__UNSAFE_GET_USER_INC(to, from, u16, label);		\
+		break;							\
+	case 4:								\
+		__UNSAFE_GET_USER_INC(to, from, u32, label);		\
+		break;							\
+	default:							\
+		return -EFAULT;						\
+	}								\
+})
+
+static __always_inline int __read_fre(struct sframe_section *sec,
+				      struct sframe_fde *fde,
+				      unsigned long fre_addr,
+				      struct sframe_fre *fre)
+{
+	unsigned char fde_type = SFRAME_FUNC_FDE_TYPE(fde->info);
+	unsigned char fre_type = SFRAME_FUNC_FRE_TYPE(fde->info);
+	unsigned char offset_count, offset_size;
+	s32 ip_off, cfa_off, ra_off, fp_off;
+	unsigned long cur = fre_addr;
+	unsigned char addr_size;
+	u8 info;
+
+	addr_size = fre_type_to_size(fre_type);
+	if (!addr_size)
+		return -EFAULT;
+
+	if (fre_addr + addr_size + 1 > sec->fres_end)
+		return -EFAULT;
+
+	UNSAFE_GET_USER_INC(ip_off, cur, addr_size, Efault);
+	if (fde_type == SFRAME_FDE_TYPE_PCINC && ip_off > fde->func_size)
+		return -EFAULT;
+
+	UNSAFE_GET_USER_INC(info, cur, 1, Efault);
+	offset_count = SFRAME_FRE_OFFSET_COUNT(info);
+	offset_size  = offset_size_enum_to_size(SFRAME_FRE_OFFSET_SIZE(info));
+	if (!offset_count || !offset_size)
+		return -EFAULT;
+
+	if (cur + (offset_count * offset_size) > sec->fres_end)
+		return -EFAULT;
+
+	fre->size = addr_size + 1 + (offset_count * offset_size);
+
+	UNSAFE_GET_USER_INC(cfa_off, cur, offset_size, Efault);
+	offset_count--;
+
+	ra_off = sec->ra_off;
+	if (!ra_off) {
+		if (!offset_count--)
+			return -EFAULT;
+
+		UNSAFE_GET_USER_INC(ra_off, cur, offset_size, Efault);
+	}
+
+	fp_off = sec->fp_off;
+	if (!fp_off && offset_count) {
+		offset_count--;
+		UNSAFE_GET_USER_INC(fp_off, cur, offset_size, Efault);
+	}
+
+	if (offset_count)
+		return -EFAULT;
+
+	fre->ip_off		= ip_off;
+	fre->cfa_off		= cfa_off;
+	fre->ra_off		= ra_off;
+	fre->fp_off		= fp_off;
+	fre->info		= info;
+
+	return 0;
+
+Efault:
+	return -EFAULT;
+}
+
+static __always_inline int __find_fre(struct sframe_section *sec,
+				      struct sframe_fde *fde, unsigned long ip,
+				      struct unwind_user_frame *frame)
+{
+	unsigned char fde_type = SFRAME_FUNC_FDE_TYPE(fde->info);
+	struct sframe_fre *fre, *prev_fre = NULL;
+	struct sframe_fre fres[2];
+	unsigned long fre_addr;
+	bool which = false;
+	unsigned int i;
+	s32 ip_off;
+
+	ip_off = (s32)(ip - sec->sframe_start) - fde->start_addr;
+
+	if (fde_type == SFRAME_FDE_TYPE_PCMASK)
+		ip_off %= fde->rep_size;
+
+	fre_addr = sec->fres_start + fde->fres_off;
+
+	for (i = 0; i < fde->fres_num; i++) {
+		int ret;
+
+		/*
+		 * Alternate between the two fre_addr[] entries for 'fre' and
+		 * 'prev_fre'.
+		 */
+		fre = which ? fres : fres + 1;
+		which = !which;
+
+		ret = __read_fre(sec, fde, fre_addr, fre);
+		if (ret)
+			return ret;
+
+		fre_addr += fre->size;
+
+		if (prev_fre && fre->ip_off <= prev_fre->ip_off)
+			return -EFAULT;
+
+		if (fre->ip_off > ip_off)
+			break;
+
+		prev_fre = fre;
+	}
+
+	if (!prev_fre)
+		return -EINVAL;
+	fre = prev_fre;
+
+	frame->cfa_off = fre->cfa_off;
+	frame->ra_off  = fre->ra_off;
+	frame->fp_off  = fre->fp_off;
+	frame->use_fp  = SFRAME_FRE_CFA_BASE_REG_ID(fre->info) == SFRAME_BASE_REG_FP;
+
+	return 0;
+}
+
+int sframe_find(unsigned long ip, struct unwind_user_frame *frame)
+{
+	struct mm_struct *mm = current->mm;
+	struct sframe_section *sec;
+	struct sframe_fde fde;
+	int ret;
+
+	if (!mm)
+		return -EINVAL;
+
+	guard(srcu)(&sframe_srcu);
+
+	sec = mtree_load(&mm->sframe_mt, ip);
+	if (!sec)
+		return -EINVAL;
+
+	if (!user_read_access_begin((void __user *)sec->sframe_start,
+				    sec->sframe_end - sec->sframe_start))
+		return -EFAULT;
+
+	ret = __find_fde(sec, ip, &fde);
+	if (ret)
+		goto end;
+
+	ret = __find_fre(sec, &fde, ip, frame);
+end:
+	user_read_access_end();
+	return ret;
+}
 
 static void free_section(struct sframe_section *sec)
 {
@@ -119,8 +397,10 @@ int sframe_add_section(unsigned long sframe_start, unsigned long sframe_end,
 	sec->text_end		= text_end;
 
 	ret = sframe_read_header(sec);
-	if (ret)
+	if (ret) {
+		dbg_print_header(sec);
 		goto err_free;
+	}
 
 	ret = mtree_insert_range(sframe_mt, sec->text_start, sec->text_end, sec, GFP_KERNEL);
 	if (ret) {
@@ -136,6 +416,13 @@ int sframe_add_section(unsigned long sframe_start, unsigned long sframe_end,
 	return ret;
 }
 
+static void sframe_free_srcu(struct rcu_head *rcu)
+{
+	struct sframe_section *sec = container_of(rcu, struct sframe_section, rcu);
+
+	free_section(sec);
+}
+
 static int __sframe_remove_section(struct mm_struct *mm,
 				   struct sframe_section *sec)
 {
@@ -144,7 +431,7 @@ static int __sframe_remove_section(struct mm_struct *mm,
 		return -EINVAL;
 	}
 
-	free_section(sec);
+	call_srcu(&sframe_srcu, &sec->rcu, sframe_free_srcu);
 
 	return 0;
 }
diff --git a/kernel/unwind/sframe_debug.h b/kernel/unwind/sframe_debug.h
new file mode 100644
index 000000000000..055c8c8fae24
--- /dev/null
+++ b/kernel/unwind/sframe_debug.h
@@ -0,0 +1,35 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _SFRAME_DEBUG_H
+#define _SFRAME_DEBUG_H
+
+#include <linux/sframe.h>
+#include "sframe.h"
+
+#ifdef CONFIG_DYNAMIC_DEBUG
+
+#define dbg(fmt, ...)							\
+	pr_debug("%s (%d): " fmt, current->comm, current->pid, ##__VA_ARGS__)
+
+static __always_inline void dbg_print_header(struct sframe_section *sec)
+{
+	unsigned long fdes_end;
+
+	fdes_end = sec->fdes_start + (sec->num_fdes * sizeof(struct sframe_fde));
+
+	dbg("SEC: sframe:0x%lx-0x%lx text:0x%lx-0x%lx "
+	    "fdes:0x%lx-0x%lx fres:0x%lx-0x%lx "
+	    "ra_off:%d fp_off:%d\n",
+	    sec->sframe_start, sec->sframe_end, sec->text_start, sec->text_end,
+	    sec->fdes_start, fdes_end, sec->fres_start, sec->fres_end,
+	    sec->ra_off, sec->fp_off);
+}
+
+#else /* !CONFIG_DYNAMIC_DEBUG */
+
+#define dbg(args...)			no_printk(args)
+
+static inline void dbg_print_header(struct sframe_section *sec) {}
+
+#endif /* !CONFIG_DYNAMIC_DEBUG */
+
+#endif /* _SFRAME_DEBUG_H */
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v4 20/39] unwind_user/sframe: Detect .sframe sections in executables
  2025-01-22  2:30 [PATCH v4 00/39] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (18 preceding siblings ...)
  2025-01-22  2:31 ` [PATCH v4 19/39] unwind_user/sframe: Add support for reading .sframe contents Josh Poimboeuf
@ 2025-01-22  2:31 ` Josh Poimboeuf
  2025-01-22  2:31 ` [PATCH v4 21/39] unwind_user/sframe: Add prctl() interface for registering .sframe sections Josh Poimboeuf
                   ` (20 subsequent siblings)
  40 siblings, 0 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22  2:31 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

When loading an ELF executable, automatically detect an .sframe section
and associate it with the mm_struct.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 fs/binfmt_elf.c          | 49 +++++++++++++++++++++++++++++++++++++---
 include/uapi/linux/elf.h |  1 +
 2 files changed, 47 insertions(+), 3 deletions(-)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 106f0e8af177..90cd745e5bd6 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -47,6 +47,7 @@
 #include <linux/dax.h>
 #include <linux/uaccess.h>
 #include <linux/rseq.h>
+#include <linux/sframe.h>
 #include <asm/param.h>
 #include <asm/page.h>
 
@@ -629,6 +630,21 @@ static inline int make_prot(u32 p_flags, struct arch_elf_state *arch_state,
 	return arch_elf_adjust_prot(prot, arch_state, has_interp, is_interp);
 }
 
+static void elf_add_sframe(struct elf_phdr *text, struct elf_phdr *sframe,
+			   unsigned long base_addr)
+{
+	unsigned long sframe_start, sframe_end, text_start, text_end;
+
+	sframe_start = base_addr + sframe->p_vaddr;
+	sframe_end   = sframe_start + sframe->p_memsz;
+
+	text_start   = base_addr + text->p_vaddr;
+	text_end     = text_start + text->p_memsz;
+
+	/* Ignore return value, sframe section isn't critical */
+	sframe_add_section(sframe_start, sframe_end, text_start, text_end);
+}
+
 /* This is much more generalized than the library routine read function,
    so we keep this separate.  Technically the library read function
    is only provided so that we can read a.out libraries that have
@@ -639,7 +655,7 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
 		unsigned long no_base, struct elf_phdr *interp_elf_phdata,
 		struct arch_elf_state *arch_state)
 {
-	struct elf_phdr *eppnt;
+	struct elf_phdr *eppnt, *sframe_phdr = NULL;
 	unsigned long load_addr = 0;
 	int load_addr_set = 0;
 	unsigned long error = ~0UL;
@@ -665,7 +681,8 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
 
 	eppnt = interp_elf_phdata;
 	for (i = 0; i < interp_elf_ex->e_phnum; i++, eppnt++) {
-		if (eppnt->p_type == PT_LOAD) {
+		switch (eppnt->p_type) {
+		case PT_LOAD: {
 			int elf_type = MAP_PRIVATE;
 			int elf_prot = make_prot(eppnt->p_flags, arch_state,
 						 true, true);
@@ -704,6 +721,20 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
 				error = -ENOMEM;
 				goto out;
 			}
+			break;
+		}
+		case PT_GNU_SFRAME:
+			sframe_phdr = eppnt;
+			break;
+		}
+	}
+
+	if (sframe_phdr) {
+		eppnt = interp_elf_phdata;
+		for (i = 0; i < interp_elf_ex->e_phnum; i++, eppnt++) {
+			if (eppnt->p_flags & PF_X) {
+				elf_add_sframe(eppnt, sframe_phdr, load_addr);
+			}
 		}
 	}
 
@@ -829,7 +860,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
 	int first_pt_load = 1;
 	unsigned long error;
 	struct elf_phdr *elf_ppnt, *elf_phdata, *interp_elf_phdata = NULL;
-	struct elf_phdr *elf_property_phdata = NULL;
+	struct elf_phdr *elf_property_phdata = NULL, *sframe_phdr = NULL;
 	unsigned long elf_brk;
 	int retval, i;
 	unsigned long elf_entry;
@@ -937,6 +968,10 @@ static int load_elf_binary(struct linux_binprm *bprm)
 				executable_stack = EXSTACK_DISABLE_X;
 			break;
 
+		case PT_GNU_SFRAME:
+			sframe_phdr = elf_ppnt;
+			break;
+
 		case PT_LOPROC ... PT_HIPROC:
 			retval = arch_elf_pt_proc(elf_ex, elf_ppnt,
 						  bprm->file, false,
@@ -1227,6 +1262,14 @@ static int load_elf_binary(struct linux_binprm *bprm)
 			elf_brk = k;
 	}
 
+	if (sframe_phdr) {
+		for (i = 0, elf_ppnt = elf_phdata;
+		     i < elf_ex->e_phnum; i++, elf_ppnt++) {
+			if ((elf_ppnt->p_flags & PF_X))
+				elf_add_sframe(elf_ppnt, sframe_phdr, load_bias);
+		}
+	}
+
 	e_entry = elf_ex->e_entry + load_bias;
 	phdr_addr += load_bias;
 	elf_brk += load_bias;
diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
index b44069d29cec..026978cddc2e 100644
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@ -39,6 +39,7 @@ typedef __s64	Elf64_Sxword;
 #define PT_GNU_STACK	(PT_LOOS + 0x474e551)
 #define PT_GNU_RELRO	(PT_LOOS + 0x474e552)
 #define PT_GNU_PROPERTY	(PT_LOOS + 0x474e553)
+#define PT_GNU_SFRAME	(PT_LOOS + 0x474e554)
 
 
 /* ARM MTE memory tag segment type */
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v4 21/39] unwind_user/sframe: Add prctl() interface for registering .sframe sections
  2025-01-22  2:30 [PATCH v4 00/39] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (19 preceding siblings ...)
  2025-01-22  2:31 ` [PATCH v4 20/39] unwind_user/sframe: Detect .sframe sections in executables Josh Poimboeuf
@ 2025-01-22  2:31 ` Josh Poimboeuf
  2025-01-22  2:31 ` [PATCH v4 22/39] unwind_user/sframe: Wire up unwind_user to sframe Josh Poimboeuf
                   ` (19 subsequent siblings)
  40 siblings, 0 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22  2:31 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

The kernel doesn't have direct visibility to the ELF contents of shared
libraries.  Add some prctl() interfaces which allow glibc to tell the
kernel where to find .sframe sections.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 include/uapi/linux/prctl.h | 5 ++++-
 kernel/sys.c               | 9 +++++++++
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 5c6080680cb2..4a52e3f9ccc9 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -351,6 +351,9 @@ struct prctl_mm_map {
  * configuration.  All bits may be locked via this call, including
  * undefined bits.
  */
-#define PR_LOCK_SHADOW_STACK_STATUS      76
+#define PR_LOCK_SHADOW_STACK_STATUS	76
+
+#define PR_ADD_SFRAME			77
+#define PR_REMOVE_SFRAME		78
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index c4c701c6f0b4..414dfd6ee9fa 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -64,6 +64,7 @@
 #include <linux/rcupdate.h>
 #include <linux/uidgid.h>
 #include <linux/cred.h>
+#include <linux/sframe.h>
 
 #include <linux/nospec.h>
 
@@ -2809,6 +2810,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 			return -EINVAL;
 		error = arch_lock_shadow_stack_status(me, arg2);
 		break;
+	case PR_ADD_SFRAME:
+		error = sframe_add_section(arg2, arg3, arg4, arg5);
+		break;
+	case PR_REMOVE_SFRAME:
+		if (arg3 || arg4 || arg5)
+			return -EINVAL;
+		error = sframe_remove_section(arg2);
+		break;
 	default:
 		error = -EINVAL;
 		break;
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v4 22/39] unwind_user/sframe: Wire up unwind_user to sframe
  2025-01-22  2:30 [PATCH v4 00/39] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (20 preceding siblings ...)
  2025-01-22  2:31 ` [PATCH v4 21/39] unwind_user/sframe: Add prctl() interface for registering .sframe sections Josh Poimboeuf
@ 2025-01-22  2:31 ` Josh Poimboeuf
  2025-01-22  2:31 ` [PATCH v4 23/39] unwind_user/sframe/x86: Enable sframe unwinding on x86 Josh Poimboeuf
                   ` (18 subsequent siblings)
  40 siblings, 0 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22  2:31 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

Now that the sframe infrastructure is fully in place, make it work by
hooking it up to the unwind_user interface.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/Kconfig                      |  1 +
 include/linux/unwind_user_types.h |  1 +
 kernel/unwind/user.c              | 22 +++++++++++++++++++---
 3 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 23edd0e4e16a..12a3b73cbe66 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -448,6 +448,7 @@ config HAVE_UNWIND_USER_COMPAT_FP
 
 config HAVE_UNWIND_USER_SFRAME
 	bool
+	select UNWIND_USER
 
 config AS_SFRAME
 	def_bool $(as-instr,.cfi_sections .sframe\n.cfi_startproc\n.cfi_endproc)
diff --git a/include/linux/unwind_user_types.h b/include/linux/unwind_user_types.h
index 3ec4a097a3dd..5558558948b7 100644
--- a/include/linux/unwind_user_types.h
+++ b/include/linux/unwind_user_types.h
@@ -9,6 +9,7 @@ enum unwind_user_type {
 	UNWIND_USER_TYPE_NONE,
 	UNWIND_USER_TYPE_FP,
 	UNWIND_USER_TYPE_COMPAT_FP,
+	UNWIND_USER_TYPE_SFRAME,
 };
 
 struct unwind_stacktrace {
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
index 92963f129c6a..fc0c75da81f6 100644
--- a/kernel/unwind/user.c
+++ b/kernel/unwind/user.c
@@ -6,6 +6,7 @@
 #include <linux/sched.h>
 #include <linux/sched/task_stack.h>
 #include <linux/unwind_user.h>
+#include <linux/sframe.h>
 #include <linux/uaccess.h>
 #include <asm/unwind_user.h>
 
@@ -29,6 +30,12 @@ static inline bool compat_state(struct unwind_user_state *state)
 	       state->type == UNWIND_USER_TYPE_COMPAT_FP;
 }
 
+static inline bool sframe_state(struct unwind_user_state *state)
+{
+	return IS_ENABLED(CONFIG_HAVE_UNWIND_USER_SFRAME) &&
+	       state->type == UNWIND_USER_TYPE_SFRAME;
+}
+
 #define UNWIND_GET_USER_LONG(to, from, state)				\
 ({									\
 	int __ret;							\
@@ -48,12 +55,19 @@ int unwind_user_next(struct unwind_user_state *state)
 	if (state->done)
 		return -EINVAL;
 
-	if (compat_state(state))
+	if (compat_state(state)) {
 		frame = &compat_fp_frame;
-	else if (fp_state(state))
+	} else if (sframe_state(state)) {
+		if (sframe_find(state->ip, frame)) {
+			if (!IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP))
+				goto the_end;
+			frame = &fp_frame;
+		}
+	} else if (fp_state(state)) {
 		frame = &fp_frame;
-	else
+	} else {
 		goto the_end;
+	}
 
 	cfa = (frame->use_fp ? state->fp : state->sp) + frame->cfa_off;
 
@@ -94,6 +108,8 @@ int unwind_user_start(struct unwind_user_state *state)
 
 	if (IS_ENABLED(CONFIG_HAVE_UNWIND_USER_COMPAT_FP) && in_compat_mode(regs))
 		state->type = UNWIND_USER_TYPE_COMPAT_FP;
+	else if (current_has_sframe())
+		state->type = UNWIND_USER_TYPE_SFRAME;
 	else if (IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP))
 		state->type = UNWIND_USER_TYPE_FP;
 	else
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v4 23/39] unwind_user/sframe/x86: Enable sframe unwinding on x86
  2025-01-22  2:30 [PATCH v4 00/39] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (21 preceding siblings ...)
  2025-01-22  2:31 ` [PATCH v4 22/39] unwind_user/sframe: Wire up unwind_user to sframe Josh Poimboeuf
@ 2025-01-22  2:31 ` Josh Poimboeuf
  2025-01-22  2:31 ` [PATCH v4 24/39] unwind_user/sframe: Remove .sframe section on detected corruption Josh Poimboeuf
                   ` (17 subsequent siblings)
  40 siblings, 0 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22  2:31 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

The x86 sframe 2.0 implementation works fairly well, starting with
binutils 2.41 (though some bugs are getting fixed in later versions).
Enable it.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/x86/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 08c44db0fefb..1016f8f80447 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -293,6 +293,7 @@ config X86
 	select HAVE_UNSTABLE_SCHED_CLOCK
 	select HAVE_UNWIND_USER_COMPAT_FP	if IA32_EMULATION
 	select HAVE_UNWIND_USER_FP		if X86_64
+	select HAVE_UNWIND_USER_SFRAME		if X86_64
 	select HAVE_USER_RETURN_NOTIFIER
 	select HAVE_GENERIC_VDSO
 	select VDSO_GETRANDOM			if X86_64
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v4 24/39] unwind_user/sframe: Remove .sframe section on detected corruption
  2025-01-22  2:30 [PATCH v4 00/39] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (22 preceding siblings ...)
  2025-01-22  2:31 ` [PATCH v4 23/39] unwind_user/sframe/x86: Enable sframe unwinding on x86 Josh Poimboeuf
@ 2025-01-22  2:31 ` Josh Poimboeuf
  2025-01-22  2:31 ` [PATCH v4 25/39] unwind_user/sframe: Show file name in debug output Josh Poimboeuf
                   ` (16 subsequent siblings)
  40 siblings, 0 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22  2:31 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

To avoid continued attempted use of a bad .sframe section, remove it
on demand when the first sign of corruption is detected.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 kernel/unwind/sframe.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
index 1a35615a361e..66b920441692 100644
--- a/kernel/unwind/sframe.c
+++ b/kernel/unwind/sframe.c
@@ -294,6 +294,10 @@ int sframe_find(unsigned long ip, struct unwind_user_frame *frame)
 	ret = __find_fre(sec, &fde, ip, frame);
 end:
 	user_read_access_end();
+
+	if (ret == -EFAULT)
+		WARN_ON_ONCE(sframe_remove_section(sec->sframe_start));
+
 	return ret;
 }
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v4 25/39] unwind_user/sframe: Show file name in debug output
  2025-01-22  2:30 [PATCH v4 00/39] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (23 preceding siblings ...)
  2025-01-22  2:31 ` [PATCH v4 24/39] unwind_user/sframe: Remove .sframe section on detected corruption Josh Poimboeuf
@ 2025-01-22  2:31 ` Josh Poimboeuf
  2025-01-30 16:17   ` Jens Remus
  2025-01-22  2:31 ` [PATCH v4 26/39] unwind_user/sframe: Enable debugging in uaccess regions Josh Poimboeuf
                   ` (15 subsequent siblings)
  40 siblings, 1 reply; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22  2:31 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

When debugging sframe issues, the error messages aren't all that helpful
without knowing what file a corresponding .sframe section belongs to.
Prefix debug output strings with the file name.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 include/linux/sframe.h       |  4 +++-
 kernel/unwind/sframe.c       | 23 ++++++++++++--------
 kernel/unwind/sframe_debug.h | 41 ++++++++++++++++++++++++++++++------
 3 files changed, 52 insertions(+), 16 deletions(-)

diff --git a/include/linux/sframe.h b/include/linux/sframe.h
index 2e70085a1e89..18a04d574090 100644
--- a/include/linux/sframe.h
+++ b/include/linux/sframe.h
@@ -10,7 +10,9 @@
 
 struct sframe_section {
 	struct rcu_head	rcu;
-
+#ifdef CONFIG_DYNAMIC_DEBUG
+	const char	*filename;
+#endif
 	unsigned long	sframe_start;
 	unsigned long	sframe_end;
 	unsigned long	text_start;
diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
index 66b920441692..f463123f9afe 100644
--- a/kernel/unwind/sframe.c
+++ b/kernel/unwind/sframe.c
@@ -295,14 +295,17 @@ int sframe_find(unsigned long ip, struct unwind_user_frame *frame)
 end:
 	user_read_access_end();
 
-	if (ret == -EFAULT)
+	if (ret == -EFAULT) {
+		dbg_sec("removing bad .sframe section\n");
 		WARN_ON_ONCE(sframe_remove_section(sec->sframe_start));
+	}
 
 	return ret;
 }
 
 static void free_section(struct sframe_section *sec)
 {
+	dbg_free(sec);
 	kfree(sec);
 }
 
@@ -313,7 +316,7 @@ static int sframe_read_header(struct sframe_section *sec)
 	unsigned int num_fdes;
 
 	if (copy_from_user(&shdr, (void __user *)sec->sframe_start, sizeof(shdr))) {
-		dbg("header usercopy failed\n");
+		dbg_sec("header usercopy failed\n");
 		return -EFAULT;
 	}
 
@@ -321,18 +324,18 @@ static int sframe_read_header(struct sframe_section *sec)
 	    shdr.preamble.version != SFRAME_VERSION_2 ||
 	    !(shdr.preamble.flags & SFRAME_F_FDE_SORTED) ||
 	    shdr.auxhdr_len) {
-		dbg("bad/unsupported sframe header\n");
+		dbg_sec("bad/unsupported sframe header\n");
 		return -EINVAL;
 	}
 
 	if (!shdr.num_fdes || !shdr.num_fres) {
-		dbg("no fde/fre entries\n");
+		dbg_sec("no fde/fre entries\n");
 		return -EINVAL;
 	}
 
 	header_end = sec->sframe_start + SFRAME_HEADER_SIZE(shdr);
 	if (header_end >= sec->sframe_end) {
-		dbg("header doesn't fit in section\n");
+		dbg_sec("header doesn't fit in section\n");
 		return -EINVAL;
 	}
 
@@ -344,7 +347,7 @@ static int sframe_read_header(struct sframe_section *sec)
 	fres_end   = fres_start + shdr.fre_len;
 
 	if (fres_start < fdes_end || fres_end > sec->sframe_end) {
-		dbg("inconsistent fde/fre offsets\n");
+		dbg_sec("inconsistent fde/fre offsets\n");
 		return -EINVAL;
 	}
 
@@ -400,6 +403,8 @@ int sframe_add_section(unsigned long sframe_start, unsigned long sframe_end,
 	sec->text_start		= text_start;
 	sec->text_end		= text_end;
 
+	dbg_init(sec);
+
 	ret = sframe_read_header(sec);
 	if (ret) {
 		dbg_print_header(sec);
@@ -408,8 +413,8 @@ int sframe_add_section(unsigned long sframe_start, unsigned long sframe_end,
 
 	ret = mtree_insert_range(sframe_mt, sec->text_start, sec->text_end, sec, GFP_KERNEL);
 	if (ret) {
-		dbg("mtree_insert_range failed: text=%lx-%lx\n",
-		    sec->text_start, sec->text_end);
+		dbg_sec("mtree_insert_range failed: text=%lx-%lx\n",
+			sec->text_start, sec->text_end);
 		goto err_free;
 	}
 
@@ -431,7 +436,7 @@ static int __sframe_remove_section(struct mm_struct *mm,
 				   struct sframe_section *sec)
 {
 	if (!mtree_erase(&mm->sframe_mt, sec->text_start)) {
-		dbg("mtree_erase failed: text=%lx\n", sec->text_start);
+		dbg_sec("mtree_erase failed: text=%lx\n", sec->text_start);
 		return -EINVAL;
 	}
 
diff --git a/kernel/unwind/sframe_debug.h b/kernel/unwind/sframe_debug.h
index 055c8c8fae24..4d121cdbb760 100644
--- a/kernel/unwind/sframe_debug.h
+++ b/kernel/unwind/sframe_debug.h
@@ -10,26 +10,55 @@
 #define dbg(fmt, ...)							\
 	pr_debug("%s (%d): " fmt, current->comm, current->pid, ##__VA_ARGS__)
 
+#define dbg_sec(fmt, ...)						\
+	dbg("%s: " fmt, sec->filename, ##__VA_ARGS__)
+
 static __always_inline void dbg_print_header(struct sframe_section *sec)
 {
 	unsigned long fdes_end;
 
 	fdes_end = sec->fdes_start + (sec->num_fdes * sizeof(struct sframe_fde));
 
-	dbg("SEC: sframe:0x%lx-0x%lx text:0x%lx-0x%lx "
-	    "fdes:0x%lx-0x%lx fres:0x%lx-0x%lx "
-	    "ra_off:%d fp_off:%d\n",
-	    sec->sframe_start, sec->sframe_end, sec->text_start, sec->text_end,
-	    sec->fdes_start, fdes_end, sec->fres_start, sec->fres_end,
-	    sec->ra_off, sec->fp_off);
+	dbg_sec("SEC: sframe:0x%lx-0x%lx text:0x%lx-0x%lx "
+		"fdes:0x%lx-0x%lx fres:0x%lx-0x%lx "
+		"ra_off:%d fp_off:%d\n",
+		sec->sframe_start, sec->sframe_end, sec->text_start, sec->text_end,
+		sec->fdes_start, fdes_end, sec->fres_start, sec->fres_end,
+		sec->ra_off, sec->fp_off);
+}
+
+static inline void dbg_init(struct sframe_section *sec)
+{
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *vma;
+
+	guard(mmap_read_lock)(mm);
+	vma = vma_lookup(mm, sec->sframe_start);
+	if (!vma)
+		sec->filename = kstrdup("(vma gone???)", GFP_KERNEL);
+	else if (vma->vm_file)
+		sec->filename = kstrdup_quotable_file(vma->vm_file, GFP_KERNEL);
+	else if (!vma->vm_mm)
+		sec->filename = kstrdup("(vdso)", GFP_KERNEL);
+	else
+		sec->filename = kstrdup("(anonymous)", GFP_KERNEL);
+}
+
+static inline void dbg_free(struct sframe_section *sec)
+{
+	kfree(sec->filename);
 }
 
 #else /* !CONFIG_DYNAMIC_DEBUG */
 
 #define dbg(args...)			no_printk(args)
+#define dbg_sec(args...	)		no_printk(args)
 
 static inline void dbg_print_header(struct sframe_section *sec) {}
 
+static inline void dbg_init(struct sframe_section *sec) {}
+static inline void dbg_free(struct sframe_section *sec) {}
+
 #endif /* !CONFIG_DYNAMIC_DEBUG */
 
 #endif /* _SFRAME_DEBUG_H */
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v4 26/39] unwind_user/sframe: Enable debugging in uaccess regions
  2025-01-22  2:30 [PATCH v4 00/39] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (24 preceding siblings ...)
  2025-01-22  2:31 ` [PATCH v4 25/39] unwind_user/sframe: Show file name in debug output Josh Poimboeuf
@ 2025-01-22  2:31 ` Josh Poimboeuf
  2025-01-30 16:38   ` Jens Remus
  2025-01-22  2:31 ` [PATCH v4 27/39] unwind_user/sframe: Add .sframe validation option Josh Poimboeuf
                   ` (14 subsequent siblings)
  40 siblings, 1 reply; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22  2:31 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

Objtool warns about calling pr_debug() from uaccess-enabled regions, and
rightfully so.  Add a dbg_sec_uaccess() macro which temporarily disables
uaccess before doing the dynamic printk, and use that to add debug
messages throughout the uaccess-enabled regions.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 kernel/unwind/sframe.c       | 59 ++++++++++++++++++++++++++++--------
 kernel/unwind/sframe_debug.h | 31 +++++++++++++++++++
 2 files changed, 77 insertions(+), 13 deletions(-)

diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
index f463123f9afe..a2ca26b952d3 100644
--- a/kernel/unwind/sframe.c
+++ b/kernel/unwind/sframe.c
@@ -53,12 +53,15 @@ static __always_inline int __read_fde(struct sframe_section *sec,
 			      sizeof(struct sframe_fde), Efault);
 
 	ip = sec->sframe_start + fde->start_addr;
-	if (ip < sec->text_start || ip > sec->text_end)
+	if (ip < sec->text_start || ip > sec->text_end) {
+		dbg_sec_uaccess("bad fde num %d\n", fde_num);
 		return -EINVAL;
+	}
 
 	return 0;
 
 Efault:
+	dbg_sec_uaccess("fde %d usercopy failed\n", fde_num);
 	return -EFAULT;
 }
 
@@ -85,16 +88,22 @@ static __always_inline int __find_fde(struct sframe_section *sec,
 		unsafe_get_user(func_off, (s32 __user *)mid, Efault);
 
 		if (ip_off >= func_off) {
-			if (func_off < func_off_low)
+			if (func_off < func_off_low) {
+				dbg_sec_uaccess("fde %u not sorted\n",
+						(unsigned int)(mid - first));
 				return -EFAULT;
+			}
 
 			func_off_low = func_off;
 
 			found = mid;
 			low = mid + 1;
 		} else {
-			if (func_off > func_off_high)
+			if (func_off > func_off_high) {
+				dbg_sec_uaccess("fde %u not sorted\n",
+						(unsigned int)(mid - first));
 				return -EFAULT;
+			}
 
 			func_off_high = func_off;
 
@@ -140,6 +149,8 @@ static __always_inline int __find_fde(struct sframe_section *sec,
 		__UNSAFE_GET_USER_INC(to, from, u32, label);		\
 		break;							\
 	default:							\
+		dbg_sec_uaccess("%d: bad UNSAFE_GET_USER_INC size %u\n",\
+				__LINE__, size);			\
 		return -EFAULT;						\
 	}								\
 })
@@ -158,24 +169,34 @@ static __always_inline int __read_fre(struct sframe_section *sec,
 	u8 info;
 
 	addr_size = fre_type_to_size(fre_type);
-	if (!addr_size)
+	if (!addr_size) {
+		dbg_sec_uaccess("bad addr_size in fde info %u\n", fde->info);
 		return -EFAULT;
+	}
 
-	if (fre_addr + addr_size + 1 > sec->fres_end)
+	if (fre_addr + addr_size + 1 > sec->fres_end) {
+		dbg_sec_uaccess("fre addr+info goes past end of subsection\n");
 		return -EFAULT;
+	}
 
 	UNSAFE_GET_USER_INC(ip_off, cur, addr_size, Efault);
-	if (fde_type == SFRAME_FDE_TYPE_PCINC && ip_off > fde->func_size)
+	if (fde_type == SFRAME_FDE_TYPE_PCINC && ip_off > fde->func_size) {
+		dbg_sec_uaccess("fre starts past end of function: ip_off=0x%x, func_size=0x%x\n",
+				ip_off, fde->func_size);
 		return -EFAULT;
+	}
 
 	UNSAFE_GET_USER_INC(info, cur, 1, Efault);
 	offset_count = SFRAME_FRE_OFFSET_COUNT(info);
 	offset_size  = offset_size_enum_to_size(SFRAME_FRE_OFFSET_SIZE(info));
-	if (!offset_count || !offset_size)
+	if (!offset_count || !offset_size) {
+		dbg_sec_uaccess("zero offset_count or size in fre info %u\n",info);
 		return -EFAULT;
-
-	if (cur + (offset_count * offset_size) > sec->fres_end)
+	}
+	if (cur + (offset_count * offset_size) > sec->fres_end) {
+		dbg_sec_uaccess("fre goes past end of subsection\n");
 		return -EFAULT;
+	}
 
 	fre->size = addr_size + 1 + (offset_count * offset_size);
 
@@ -184,8 +205,10 @@ static __always_inline int __read_fre(struct sframe_section *sec,
 
 	ra_off = sec->ra_off;
 	if (!ra_off) {
-		if (!offset_count--)
+		if (!offset_count--) {
+			dbg_sec_uaccess("zero offset_count, can't find ra_off\n");
 			return -EFAULT;
+		}
 
 		UNSAFE_GET_USER_INC(ra_off, cur, offset_size, Efault);
 	}
@@ -196,8 +219,10 @@ static __always_inline int __read_fre(struct sframe_section *sec,
 		UNSAFE_GET_USER_INC(fp_off, cur, offset_size, Efault);
 	}
 
-	if (offset_count)
+	if (offset_count) {
+		dbg_sec_uaccess("non-zero offset_count after reading fre\n");
 		return -EFAULT;
+	}
 
 	fre->ip_off		= ip_off;
 	fre->cfa_off		= cfa_off;
@@ -208,6 +233,7 @@ static __always_inline int __read_fre(struct sframe_section *sec,
 	return 0;
 
 Efault:
+	dbg_sec_uaccess("fre usercopy failed\n");
 	return -EFAULT;
 }
 
@@ -241,13 +267,20 @@ static __always_inline int __find_fre(struct sframe_section *sec,
 		which = !which;
 
 		ret = __read_fre(sec, fde, fre_addr, fre);
-		if (ret)
+		if (ret) {
+			dbg_sec_uaccess("fde addr 0x%x: __read_fre(%u) failed\n",
+					fde->start_addr, i);
+			dbg_print_fde_uaccess(sec, fde);
 			return ret;
+		}
 
 		fre_addr += fre->size;
 
-		if (prev_fre && fre->ip_off <= prev_fre->ip_off)
+		if (prev_fre && fre->ip_off <= prev_fre->ip_off) {
+			dbg_sec_uaccess("fde addr 0x%x: fre %u not sorted\n",
+					fde->start_addr, i);
 			return -EFAULT;
+		}
 
 		if (fre->ip_off > ip_off)
 			break;
diff --git a/kernel/unwind/sframe_debug.h b/kernel/unwind/sframe_debug.h
index 4d121cdbb760..3bb3c5574aee 100644
--- a/kernel/unwind/sframe_debug.h
+++ b/kernel/unwind/sframe_debug.h
@@ -13,6 +13,26 @@
 #define dbg_sec(fmt, ...)						\
 	dbg("%s: " fmt, sec->filename, ##__VA_ARGS__)
 
+#define __dbg_sec_descriptor(fmt, ...)					\
+	__dynamic_pr_debug(&descriptor, "sframe: %s: " fmt,		\
+			   sec->filename, ##__VA_ARGS__)
+
+/*
+ * To avoid breaking uaccess rules, temporarily disable uaccess
+ * before calling printk.
+ */
+#define dbg_sec_uaccess(fmt, ...)					\
+({									\
+	DEFINE_DYNAMIC_DEBUG_METADATA(descriptor, fmt);			\
+	if (DYNAMIC_DEBUG_BRANCH(descriptor)) {				\
+		user_read_access_end();					\
+		__dbg_sec_descriptor(fmt, ##__VA_ARGS__);		\
+		BUG_ON(!user_read_access_begin(				\
+				(void __user *)sec->sframe_start,	\
+				sec->sframe_end - sec->sframe_start));	\
+	}								\
+})
+
 static __always_inline void dbg_print_header(struct sframe_section *sec)
 {
 	unsigned long fdes_end;
@@ -27,6 +47,15 @@ static __always_inline void dbg_print_header(struct sframe_section *sec)
 		sec->ra_off, sec->fp_off);
 }
 
+static __always_inline void dbg_print_fde_uaccess(struct sframe_section *sec,
+						  struct sframe_fde *fde)
+{
+	dbg_sec_uaccess("FDE: start_addr:0x%x func_size:0x%x "
+			"fres_off:0x%x fres_num:%d info:%u rep_size:%u\n",
+			fde->start_addr, fde->func_size,
+			fde->fres_off, fde->fres_num, fde->info, fde->rep_size);
+}
+
 static inline void dbg_init(struct sframe_section *sec)
 {
 	struct mm_struct *mm = current->mm;
@@ -53,8 +82,10 @@ static inline void dbg_free(struct sframe_section *sec)
 
 #define dbg(args...)			no_printk(args)
 #define dbg_sec(args...	)		no_printk(args)
+#define dbg_sec_uaccess(args...)	no_printk(args)
 
 static inline void dbg_print_header(struct sframe_section *sec) {}
+static inline void dbg_print_fde_uaccess(struct sframe_section *sec, struct sframe_fde *fde) {}
 
 static inline void dbg_init(struct sframe_section *sec) {}
 static inline void dbg_free(struct sframe_section *sec) {}
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v4 27/39] unwind_user/sframe: Add .sframe validation option
  2025-01-22  2:30 [PATCH v4 00/39] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (25 preceding siblings ...)
  2025-01-22  2:31 ` [PATCH v4 26/39] unwind_user/sframe: Enable debugging in uaccess regions Josh Poimboeuf
@ 2025-01-22  2:31 ` Josh Poimboeuf
  2025-01-22  2:31 ` [PATCH v4 28/39] unwind_user/deferred: Add deferred unwinding interface Josh Poimboeuf
                   ` (13 subsequent siblings)
  40 siblings, 0 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22  2:31 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

Add a debug feature to validate all .sframe sections when first loading
the file rather than on demand.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/Kconfig           | 19 ++++++++++
 kernel/unwind/sframe.c | 81 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 100 insertions(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index 12a3b73cbe66..b3676605bab6 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -453,6 +453,25 @@ config HAVE_UNWIND_USER_SFRAME
 config AS_SFRAME
 	def_bool $(as-instr,.cfi_sections .sframe\n.cfi_startproc\n.cfi_endproc)
 
+config SFRAME_VALIDATION
+	bool "Enable .sframe section debugging"
+	depends on HAVE_UNWIND_USER_SFRAME
+	depends on DYNAMIC_DEBUG
+	help
+	  When adding an .sframe section for a task, validate the entire
+	  section immediately rather than on demand.
+
+	  This is a debug feature which is helpful for rooting out .sframe
+	  section issues.  If the .sframe section is corrupt, it will fail to
+	  load immediately, with more information provided in dynamic printks.
+
+	  This has a significant page cache footprint due to its reading of the
+	  entire .sframe section for every loaded executable and shared
+	  library.  Also, it's done for all processes, even those which don't
+	  get stack traced by the kernel.  Not recommended for general use.
+
+	  If unsure, say N.
+
 config HAVE_PERF_REGS
 	bool
 	help
diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
index a2ca26b952d3..bba14c5fe0f5 100644
--- a/kernel/unwind/sframe.c
+++ b/kernel/unwind/sframe.c
@@ -336,6 +336,83 @@ int sframe_find(unsigned long ip, struct unwind_user_frame *frame)
 	return ret;
 }
 
+#ifdef CONFIG_SFRAME_VALIDATION
+
+static __always_inline int __sframe_validate_section(struct sframe_section *sec)
+{
+	unsigned long prev_ip = 0;
+	unsigned int i;
+
+	for (i = 0; i < sec->num_fdes; i++) {
+		struct sframe_fre *fre, *prev_fre = NULL;
+		unsigned long ip, fre_addr;
+		struct sframe_fde fde;
+		struct sframe_fre fres[2];
+		bool which = false;
+		unsigned int j;
+		int ret;
+
+		ret = __read_fde(sec, i, &fde);
+		if (ret)
+			return ret;
+
+		ip = sec->sframe_start + fde.start_addr;
+		if (ip <= prev_ip) {
+			dbg_sec_uaccess("fde %u not sorted\n", i);
+			return -EFAULT;
+		}
+		prev_ip = ip;
+
+		fre_addr = sec->fres_start + fde.fres_off;
+		for (j = 0; j < fde.fres_num; j++) {
+			int ret;
+
+			fre = which ? fres : fres + 1;
+			which = !which;
+
+			ret = __read_fre(sec, &fde, fre_addr, fre);
+			if (ret) {
+				dbg_sec_uaccess("fde %u: __read_fre(%u) failed\n", i, j);
+				dbg_print_fde_uaccess(sec, &fde);
+				return ret;
+			}
+
+			fre_addr += fre->size;
+
+			if (prev_fre && fre->ip_off <= prev_fre->ip_off) {
+				dbg_sec_uaccess("fde %u: fre %u not sorted\n", i, j);
+				return -EFAULT;
+			}
+
+			prev_fre = fre;
+		}
+	}
+
+	return 0;
+}
+
+static int sframe_validate_section(struct sframe_section *sec)
+{
+	int ret;
+
+	if (!user_read_access_begin((void __user *)sec->sframe_start,
+				    sec->sframe_end - sec->sframe_start)) {
+		dbg_sec("section usercopy failed\n");
+		return -EFAULT;
+	}
+
+	ret = __sframe_validate_section(sec);
+	user_read_access_end();
+	return ret;
+}
+
+#else /*  !CONFIG_SFRAME_VALIDATION */
+
+static int sframe_validate_section(struct sframe_section *sec) { return 0; }
+
+#endif /* !CONFIG_SFRAME_VALIDATION */
+
+
 static void free_section(struct sframe_section *sec)
 {
 	dbg_free(sec);
@@ -444,6 +521,10 @@ int sframe_add_section(unsigned long sframe_start, unsigned long sframe_end,
 		goto err_free;
 	}
 
+	ret = sframe_validate_section(sec);
+	if (ret)
+		goto err_free;
+
 	ret = mtree_insert_range(sframe_mt, sec->text_start, sec->text_end, sec, GFP_KERNEL);
 	if (ret) {
 		dbg_sec("mtree_insert_range failed: text=%lx-%lx\n",
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v4 28/39] unwind_user/deferred: Add deferred unwinding interface
  2025-01-22  2:30 [PATCH v4 00/39] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (26 preceding siblings ...)
  2025-01-22  2:31 ` [PATCH v4 27/39] unwind_user/sframe: Add .sframe validation option Josh Poimboeuf
@ 2025-01-22  2:31 ` Josh Poimboeuf
  2025-01-22 13:37   ` Peter Zijlstra
                     ` (3 more replies)
  2025-01-22  2:31 ` [PATCH v4 29/39] unwind_user/deferred: Add unwind cache Josh Poimboeuf
                   ` (12 subsequent siblings)
  40 siblings, 4 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22  2:31 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

Add an interface for scheduling task work to unwind the user space stack
before returning to user space.  This solves several problems for its
callers:

  - Ensure the unwind happens in task context even if the caller may be
    running in NMI or interrupt context.

  - Avoid duplicate unwinds, whether called multiple times by the same
    caller or by different callers.

  - Create a "context cookie" which allows trace post-processing to
    correlate kernel unwinds/traces with the user unwind.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 include/linux/entry-common.h          |   2 +
 include/linux/sched.h                 |   5 +
 include/linux/unwind_deferred.h       |  46 +++++++
 include/linux/unwind_deferred_types.h |  10 ++
 kernel/fork.c                         |   4 +
 kernel/unwind/Makefile                |   2 +-
 kernel/unwind/deferred.c              | 178 ++++++++++++++++++++++++++
 7 files changed, 246 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/unwind_deferred.h
 create mode 100644 include/linux/unwind_deferred_types.h
 create mode 100644 kernel/unwind/deferred.c

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index fc61d0205c97..fb2b27154fee 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -12,6 +12,7 @@
 #include <linux/resume_user_mode.h>
 #include <linux/tick.h>
 #include <linux/kmsan.h>
+#include <linux/unwind_deferred.h>
 
 #include <asm/entry-common.h>
 
@@ -111,6 +112,7 @@ static __always_inline void enter_from_user_mode(struct pt_regs *regs)
 
 	CT_WARN_ON(__ct_state() != CT_STATE_USER);
 	user_exit_irqoff();
+	unwind_enter_from_user_mode();
 
 	instrumentation_begin();
 	kmsan_unpoison_entry_regs(regs);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 64934e0830af..042a95f4f6e6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -46,6 +46,7 @@
 #include <linux/rv.h>
 #include <linux/livepatch_sched.h>
 #include <linux/uidgid_types.h>
+#include <linux/unwind_deferred_types.h>
 #include <asm/kmap_size.h>
 
 /* task_struct member predeclarations (sorted alphabetically): */
@@ -1603,6 +1604,10 @@ struct task_struct {
 	struct user_event_mm		*user_event_mm;
 #endif
 
+#ifdef CONFIG_UNWIND_USER
+	struct unwind_task_info		unwind_info;
+#endif
+
 	/*
 	 * New fields for task_struct should be added above here, so that
 	 * they are included in the randomized portion of task_struct.
diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
new file mode 100644
index 000000000000..741f409f0d1f
--- /dev/null
+++ b/include/linux/unwind_deferred.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_UNWIND_USER_DEFERRED_H
+#define _LINUX_UNWIND_USER_DEFERRED_H
+
+#include <linux/task_work.h>
+#include <linux/unwind_user.h>
+#include <linux/unwind_deferred_types.h>
+
+struct unwind_work;
+
+typedef void (*unwind_callback_t)(struct unwind_work *work, struct unwind_stacktrace *trace, u64 cookie);
+
+struct unwind_work {
+	struct callback_head		work;
+	unwind_callback_t		func;
+	int				pending;
+};
+
+#ifdef CONFIG_UNWIND_USER
+
+void unwind_task_init(struct task_struct *task);
+void unwind_task_free(struct task_struct *task);
+
+void unwind_deferred_init(struct unwind_work *work, unwind_callback_t func);
+int unwind_deferred_request(struct unwind_work *work, u64 *cookie);
+bool unwind_deferred_cancel(struct task_struct *task, struct unwind_work *work);
+
+static __always_inline void unwind_enter_from_user_mode(void)
+{
+	current->unwind_info.cookie = 0;
+}
+
+#else /* !CONFIG_UNWIND_USER */
+
+static inline void unwind_task_init(struct task_struct *task) {}
+static inline void unwind_task_free(struct task_struct *task) {}
+
+static inline void unwind_deferred_init(struct unwind_work *work, unwind_callback_t func) {}
+static inline int unwind_deferred_request(struct task_struct *task, struct unwind_work *work, u64 *cookie) { return -ENOSYS; }
+static inline bool unwind_deferred_cancel(struct task_struct *task, struct unwind_work *work) { return false; }
+
+static inline void unwind_enter_from_user_mode(void) {}
+
+#endif /* !CONFIG_UNWIND_USER */
+
+#endif /* _LINUX_UNWIND_USER_DEFERRED_H */
diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
new file mode 100644
index 000000000000..9749824aea09
--- /dev/null
+++ b/include/linux/unwind_deferred_types.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_UNWIND_USER_DEFERRED_TYPES_H
+#define _LINUX_UNWIND_USER_DEFERRED_TYPES_H
+
+struct unwind_task_info {
+	unsigned long		*entries;
+	u64			cookie;
+};
+
+#endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 88753f8bbdd3..c9a954af72a1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -106,6 +106,7 @@
 #include <linux/pidfs.h>
 #include <linux/tick.h>
 #include <linux/sframe.h>
+#include <linux/unwind_deferred.h>
 
 #include <asm/pgalloc.h>
 #include <linux/uaccess.h>
@@ -973,6 +974,7 @@ void __put_task_struct(struct task_struct *tsk)
 	WARN_ON(refcount_read(&tsk->usage));
 	WARN_ON(tsk == current);
 
+	unwind_task_free(tsk);
 	sched_ext_free(tsk);
 	io_uring_free(tsk);
 	cgroup_free(tsk);
@@ -2370,6 +2372,8 @@ __latent_entropy struct task_struct *copy_process(
 	p->bpf_ctx = NULL;
 #endif
 
+	unwind_task_init(p);
+
 	/* Perform scheduler related setup. Assign this task to a CPU. */
 	retval = sched_fork(clone_flags, p);
 	if (retval)
diff --git a/kernel/unwind/Makefile b/kernel/unwind/Makefile
index f70380d7a6a6..146038165865 100644
--- a/kernel/unwind/Makefile
+++ b/kernel/unwind/Makefile
@@ -1,2 +1,2 @@
- obj-$(CONFIG_UNWIND_USER)		+= user.o
+ obj-$(CONFIG_UNWIND_USER)		+= user.o deferred.o
  obj-$(CONFIG_HAVE_UNWIND_USER_SFRAME)	+= sframe.o
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
new file mode 100644
index 000000000000..f0dbe4069247
--- /dev/null
+++ b/kernel/unwind/deferred.c
@@ -0,0 +1,178 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+* Deferred user space unwinding
+*/
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/sched/task_stack.h>
+#include <linux/sframe.h>
+#include <linux/slab.h>
+#include <linux/task_work.h>
+#include <linux/mm.h>
+#include <linux/unwind_deferred.h>
+
+#define UNWIND_MAX_ENTRIES 512
+
+/* entry-from-user counter */
+static DEFINE_PER_CPU(u64, unwind_ctx_ctr);
+
+/*
+ * The context cookie is a unique identifier which allows post-processing to
+ * correlate kernel trace(s) with user unwinds.  The high 12 bits are the CPU
+ * id; the lower 48 bits are a per-CPU entry counter.
+ */
+static u64 ctx_to_cookie(u64 cpu, u64 ctx)
+{
+	BUILD_BUG_ON(NR_CPUS > 65535);
+	return (ctx & ((1UL << 48) - 1)) | (cpu << 48);
+}
+
+/*
+ * Read the task context cookie, first initializing it if this is the first
+ * call to get_cookie() since the most recent entry from user.
+ */
+static u64 get_cookie(struct unwind_task_info *info)
+{
+	u64 ctx_ctr;
+	u64 cookie;
+	u64 cpu;
+
+	guard(irqsave)();
+
+	cookie = info->cookie;
+	if (cookie)
+		return cookie;
+
+
+	cpu = raw_smp_processor_id();
+	ctx_ctr = __this_cpu_inc_return(unwind_ctx_ctr);
+	info->cookie = ctx_to_cookie(cpu, ctx_ctr);
+
+	return cookie;
+
+}
+
+static void unwind_deferred_task_work(struct callback_head *head)
+{
+	struct unwind_work *work = container_of(head, struct unwind_work, work);
+	struct unwind_task_info *info = &current->unwind_info;
+	struct unwind_stacktrace trace;
+	u64 cookie;
+
+	if (WARN_ON_ONCE(!work->pending))
+		return;
+
+	/*
+	 * From here on out, the callback must always be called, even if it's
+	 * just an empty trace.
+	 */
+
+	cookie = get_cookie(info);
+
+	/* Check for task exit path. */
+	if (!current->mm)
+		goto do_callback;
+
+	if (!info->entries) {
+		info->entries = kmalloc(UNWIND_MAX_ENTRIES * sizeof(long),
+					GFP_KERNEL);
+		if (!info->entries)
+			goto do_callback;
+	}
+
+	trace.entries = info->entries;
+	trace.nr = 0;
+	unwind_user(&trace, UNWIND_MAX_ENTRIES);
+
+do_callback:
+	work->func(work, &trace, cookie);
+	work->pending = 0;
+}
+
+/*
+ * Schedule a user space unwind to be done in task work before exiting the
+ * kernel.
+ *
+ * The returned cookie output is a unique identifer for the current task entry
+ * context.  Its value will also be passed to the callback function.  It can be
+ * used to stitch kernel and user stack traces together in post-processing.
+ *
+ * It's valid to call this function multiple times for the same @work within
+ * the same task entry context.  Each call will return the same cookie.  If the
+ * callback is already pending, an error will be returned along with the
+ * cookie.  If the callback is not pending because it has already been
+ * previously called for the same entry context, it will be called again with
+ * the same stack trace and cookie.
+ *
+ * Thus are three possible return scenarios:
+ *
+ *   * return != 0, *cookie == 0: the operation failed, no pending callback.
+ *
+ *   * return != 0, *cookie != 0: the callback is already pending. The cookie
+ *     can still be used to correlate with the pending callback.
+ *
+ *   * return == 0, *cookie != 0: the callback queued successfully.  The
+ *     callback is guaranteed to be called with the given cookie.
+ */
+int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
+{
+	struct unwind_task_info *info = &current->unwind_info;
+	int ret;
+
+	*cookie = 0;
+
+	if (WARN_ON_ONCE(in_nmi()))
+		return -EINVAL;
+
+	if (!current->mm || !user_mode(task_pt_regs(current)))
+		return -EINVAL;
+
+	guard(irqsave)();
+
+	*cookie = get_cookie(info);
+
+	/* callback already pending? */
+	if (work->pending)
+		return -EEXIST;
+
+	ret = task_work_add(current, &work->work, TWA_RESUME);
+	if (WARN_ON_ONCE(ret))
+		return ret;
+
+	work->pending = 1;
+
+	return 0;
+}
+
+bool unwind_deferred_cancel(struct task_struct *task, struct unwind_work *work)
+{
+	bool ret;
+
+	ret = task_work_cancel(task, &work->work);
+	if (ret)
+		work->pending = 0;
+
+	return ret;
+}
+
+void unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
+{
+	memset(work, 0, sizeof(*work));
+
+	init_task_work(&work->work, unwind_deferred_task_work);
+	work->func = func;
+}
+
+void unwind_task_init(struct task_struct *task)
+{
+	struct unwind_task_info *info = &task->unwind_info;
+
+	memset(info, 0, sizeof(*info));
+}
+
+void unwind_task_free(struct task_struct *task)
+{
+	struct unwind_task_info *info = &task->unwind_info;
+
+	kfree(info->entries);
+}
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v4 29/39] unwind_user/deferred: Add unwind cache
  2025-01-22  2:30 [PATCH v4 00/39] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (27 preceding siblings ...)
  2025-01-22  2:31 ` [PATCH v4 28/39] unwind_user/deferred: Add deferred unwinding interface Josh Poimboeuf
@ 2025-01-22  2:31 ` Josh Poimboeuf
  2025-01-22 13:57   ` Peter Zijlstra
  2025-01-22  2:31 ` [PATCH v4 30/39] unwind_user/deferred: Make unwind deferral requests NMI-safe Josh Poimboeuf
                   ` (11 subsequent siblings)
  40 siblings, 1 reply; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22  2:31 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

Cache the results of the unwind to ensure the unwind is only performed
once, even when called by multiple tracers.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 include/linux/unwind_deferred_types.h |  8 +++++++-
 kernel/unwind/deferred.c              | 26 ++++++++++++++++++++------
 2 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
index 9749824aea09..6f71a06329fb 100644
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -2,8 +2,14 @@
 #ifndef _LINUX_UNWIND_USER_DEFERRED_TYPES_H
 #define _LINUX_UNWIND_USER_DEFERRED_TYPES_H
 
-struct unwind_task_info {
+struct unwind_cache {
 	unsigned long		*entries;
+	unsigned int		nr_entries;
+	u64			cookie;
+};
+
+struct unwind_task_info {
+	struct unwind_cache	cache;
 	u64			cookie;
 };
 
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index f0dbe4069247..2f38055cce48 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -56,6 +56,7 @@ static void unwind_deferred_task_work(struct callback_head *head)
 {
 	struct unwind_work *work = container_of(head, struct unwind_work, work);
 	struct unwind_task_info *info = &current->unwind_info;
+	struct unwind_cache *cache = &info->cache;
 	struct unwind_stacktrace trace;
 	u64 cookie;
 
@@ -73,17 +74,30 @@ static void unwind_deferred_task_work(struct callback_head *head)
 	if (!current->mm)
 		goto do_callback;
 
-	if (!info->entries) {
-		info->entries = kmalloc(UNWIND_MAX_ENTRIES * sizeof(long),
-					GFP_KERNEL);
-		if (!info->entries)
+	if (!cache->entries) {
+		cache->entries = kmalloc(UNWIND_MAX_ENTRIES * sizeof(long),
+					 GFP_KERNEL);
+		if (!cache->entries)
 			goto do_callback;
 	}
 
-	trace.entries = info->entries;
+	trace.entries = cache->entries;
+
+	if (cookie == cache->cookie) {
+		/*
+		 * The user stack has already been previously unwound in this
+		 * entry context.  Skip the unwind and use the cache.
+		 */
+		trace.nr = cache->nr_entries;
+		goto do_callback;
+	}
+
 	trace.nr = 0;
 	unwind_user(&trace, UNWIND_MAX_ENTRIES);
 
+	cache->cookie = cookie;
+	cache->nr_entries = trace.nr;
+
 do_callback:
 	work->func(work, &trace, cookie);
 	work->pending = 0;
@@ -174,5 +188,5 @@ void unwind_task_free(struct task_struct *task)
 {
 	struct unwind_task_info *info = &task->unwind_info;
 
-	kfree(info->entries);
+	kfree(info->cache.entries);
 }
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v4 30/39] unwind_user/deferred: Make unwind deferral requests NMI-safe
  2025-01-22  2:30 [PATCH v4 00/39] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (28 preceding siblings ...)
  2025-01-22  2:31 ` [PATCH v4 29/39] unwind_user/deferred: Add unwind cache Josh Poimboeuf
@ 2025-01-22  2:31 ` Josh Poimboeuf
  2025-01-22 14:15   ` Peter Zijlstra
  2025-01-22 14:24   ` Peter Zijlstra
  2025-01-22  2:31 ` [PATCH v4 31/39] perf: Remove get_perf_callchain() 'init_nr' argument Josh Poimboeuf
                   ` (10 subsequent siblings)
  40 siblings, 2 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22  2:31 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

Make unwind_deferred_request() NMI-safe so tracers in NMI context can
call it to get the cookie immediately rather than have to do the fragile
"schedule irq work and then call unwind_deferred_request()" dance.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 include/linux/entry-common.h          |   1 +
 include/linux/unwind_deferred.h       |   6 ++
 include/linux/unwind_deferred_types.h |   1 +
 kernel/unwind/deferred.c              | 106 ++++++++++++++++++++++----
 4 files changed, 98 insertions(+), 16 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index fb2b27154fee..e9b8c145f480 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -363,6 +363,7 @@ static __always_inline void exit_to_user_mode(void)
 	lockdep_hardirqs_on_prepare();
 	instrumentation_end();
 
+	unwind_exit_to_user_mode();
 	user_enter_irqoff();
 	arch_exit_to_user_mode();
 	lockdep_hardirqs_on(CALLER_ADDR0);
diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index 741f409f0d1f..22269f4d2392 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -30,6 +30,11 @@ static __always_inline void unwind_enter_from_user_mode(void)
 	current->unwind_info.cookie = 0;
 }
 
+static __always_inline void unwind_exit_to_user_mode(void)
+{
+	current->unwind_info.cookie = 0;
+}
+
 #else /* !CONFIG_UNWIND_USER */
 
 static inline void unwind_task_init(struct task_struct *task) {}
@@ -40,6 +45,7 @@ static inline int unwind_deferred_request(struct task_struct *task, struct unwin
 static inline bool unwind_deferred_cancel(struct task_struct *task, struct unwind_work *work) { return false; }
 
 static inline void unwind_enter_from_user_mode(void) {}
+static inline void unwind_exit_to_user_mode(void) {}
 
 #endif /* !CONFIG_UNWIND_USER */
 
diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
index 6f71a06329fb..c535cca6534b 100644
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -11,6 +11,7 @@ struct unwind_cache {
 struct unwind_task_info {
 	struct unwind_cache	cache;
 	u64			cookie;
+	u64			nmi_cookie;
 };
 
 #endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index 2f38055cce48..939c94abaa50 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -29,27 +29,49 @@ static u64 ctx_to_cookie(u64 cpu, u64 ctx)
 
 /*
  * Read the task context cookie, first initializing it if this is the first
- * call to get_cookie() since the most recent entry from user.
+ * call to get_cookie() since the most recent entry from user.  This has to be
+ * done carefully to coordinate with unwind_deferred_request_nmi().
  */
 static u64 get_cookie(struct unwind_task_info *info)
 {
 	u64 ctx_ctr;
 	u64 cookie;
-	u64 cpu;
 
 	guard(irqsave)();
 
-	cookie = info->cookie;
+	cookie = READ_ONCE(info->cookie);
 	if (cookie)
 		return cookie;
 
+	ctx_ctr = __this_cpu_read(unwind_ctx_ctr);
 
-	cpu = raw_smp_processor_id();
-	ctx_ctr = __this_cpu_inc_return(unwind_ctx_ctr);
-	info->cookie = ctx_to_cookie(cpu, ctx_ctr);
+	/* Read ctx_ctr before info->nmi_cookie */
+	barrier();
+
+	cookie = READ_ONCE(info->nmi_cookie);
+	if (cookie) {
+		/*
+		 * This is the first call to get_cookie() since an NMI handler
+		 * first wrote it to info->nmi_cookie.  Sync it.
+		 */
+		WRITE_ONCE(info->cookie, cookie);
+		WRITE_ONCE(info->nmi_cookie, 0);
+		return cookie;
+	}
+
+	/*
+	 * Write info->cookie.  It's ok to race with an NMI here.  The value of
+	 * the cookie is based on ctx_ctr from before the NMI could have
+	 * incremented it.  The result will be the same even if cookie or
+	 * ctx_ctr end up getting written twice.
+	 */
+	cookie = ctx_to_cookie(raw_smp_processor_id(), ctx_ctr + 1);
+	WRITE_ONCE(info->cookie, cookie);
+	WRITE_ONCE(info->nmi_cookie, 0);
+	barrier();
+	__this_cpu_write(unwind_ctx_ctr, ctx_ctr + 1);
 
 	return cookie;
-
 }
 
 static void unwind_deferred_task_work(struct callback_head *head)
@@ -100,7 +122,52 @@ static void unwind_deferred_task_work(struct callback_head *head)
 
 do_callback:
 	work->func(work, &trace, cookie);
-	work->pending = 0;
+	WRITE_ONCE(work->pending, 0);
+}
+
+static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *cookie)
+{
+	struct unwind_task_info *info = &current->unwind_info;
+	bool inited_cookie = false;
+	int ret;
+
+	*cookie = info->cookie;
+	if (!*cookie) {
+		/*
+		 * This is the first unwind request since the most recent entry
+		 * from user.  Initialize the task cookie.
+		 *
+		 * Don't write to info->cookie directly, otherwise it may get
+		 * cleared if the NMI occurred in the kernel during early entry
+		 * or late exit before the task work gets to run.  Instead, use
+		 * info->nmi_cookie which gets synced later by get_cookie().
+		 */
+		if (!info->nmi_cookie) {
+			u64 cpu = raw_smp_processor_id();
+			u64 ctx_ctr;
+
+			ctx_ctr = __this_cpu_inc_return(unwind_ctx_ctr);
+			info->nmi_cookie = ctx_to_cookie(cpu, ctx_ctr);
+
+			inited_cookie = true;
+		}
+
+		*cookie = info->nmi_cookie;
+
+	} else if (work->pending) {
+		return -EEXIST;
+	}
+
+	ret = task_work_add(current, &work->work, TWA_NMI_CURRENT);
+	if (ret) {
+		if (inited_cookie)
+			info->nmi_cookie = 0;
+		return ret;
+	}
+
+	work->pending = 1;
+
+	return 0;
 }
 
 /*
@@ -131,29 +198,36 @@ static void unwind_deferred_task_work(struct callback_head *head)
 int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
 {
 	struct unwind_task_info *info = &current->unwind_info;
+	int pending;
 	int ret;
 
 	*cookie = 0;
 
-	if (WARN_ON_ONCE(in_nmi()))
-		return -EINVAL;
-
 	if (!current->mm || !user_mode(task_pt_regs(current)))
 		return -EINVAL;
 
+	if (in_nmi())
+		return unwind_deferred_request_nmi(work, cookie);
+
 	guard(irqsave)();
 
 	*cookie = get_cookie(info);
 
 	/* callback already pending? */
-	if (work->pending)
+	pending = READ_ONCE(work->pending);
+	if (pending)
 		return -EEXIST;
 
-	ret = task_work_add(current, &work->work, TWA_RESUME);
-	if (WARN_ON_ONCE(ret))
-		return ret;
+	/* Claim the work unless an NMI just now swooped in to do so. */
+	if (!try_cmpxchg(&work->pending, &pending, 1))
+		return -EEXIST;
 
-	work->pending = 1;
+	/* The work has been claimed, now schedule it. */
+	ret = task_work_add(current, &work->work, TWA_RESUME);
+	if (WARN_ON_ONCE(ret)) {
+		WRITE_ONCE(work->pending, 0);
+		return ret;
+	}
 
 	return 0;
 }
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v4 31/39] perf: Remove get_perf_callchain() 'init_nr' argument
  2025-01-22  2:30 [PATCH v4 00/39] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (29 preceding siblings ...)
  2025-01-22  2:31 ` [PATCH v4 30/39] unwind_user/deferred: Make unwind deferral requests NMI-safe Josh Poimboeuf
@ 2025-01-22  2:31 ` Josh Poimboeuf
  2025-01-22  2:31 ` [PATCH v4 32/39] perf: Remove get_perf_callchain() 'crosstask' argument Josh Poimboeuf
                   ` (9 subsequent siblings)
  40 siblings, 0 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22  2:31 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu, Namhyung Kim

The 'init_nr' argument has double duty: it's used to initialize both the
number of contexts and the number of stack entries.  That's confusing
and the callers always pass zero anyway.  Hard code the zero.

Acked-by: Namhyung Kim <Namhyung@kernel.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 include/linux/perf_event.h |  2 +-
 kernel/bpf/stackmap.c      |  4 ++--
 kernel/events/callchain.c  | 12 ++++++------
 kernel/events/core.c       |  2 +-
 4 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index cb99ec8c9e96..4c8ff7258c6a 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1589,7 +1589,7 @@ DECLARE_PER_CPU(struct perf_callchain_entry, perf_callchain_entry);
 extern void perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
 extern void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
 extern struct perf_callchain_entry *
-get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user,
+get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 		   u32 max_stack, bool crosstask, bool add_mark);
 extern int get_callchain_buffers(int max_stack);
 extern void put_callchain_buffers(void);
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 3615c06b7dfa..ec3a57a5fba1 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -314,7 +314,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
 	if (max_depth > sysctl_perf_event_max_stack)
 		max_depth = sysctl_perf_event_max_stack;
 
-	trace = get_perf_callchain(regs, 0, kernel, user, max_depth,
+	trace = get_perf_callchain(regs, kernel, user, max_depth,
 				   false, false);
 
 	if (unlikely(!trace))
@@ -451,7 +451,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
 	else if (kernel && task)
 		trace = get_callchain_entry_for_task(task, max_depth);
 	else
-		trace = get_perf_callchain(regs, 0, kernel, user, max_depth,
+		trace = get_perf_callchain(regs, kernel, user, max_depth,
 					   crosstask, false);
 
 	if (unlikely(!trace) || trace->nr < skip) {
diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 8a47e52a454f..83834203e144 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -216,7 +216,7 @@ static void fixup_uretprobe_trampoline_entries(struct perf_callchain_entry *entr
 }
 
 struct perf_callchain_entry *
-get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user,
+get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 		   u32 max_stack, bool crosstask, bool add_mark)
 {
 	struct perf_callchain_entry *entry;
@@ -227,11 +227,11 @@ get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user,
 	if (!entry)
 		return NULL;
 
-	ctx.entry     = entry;
-	ctx.max_stack = max_stack;
-	ctx.nr	      = entry->nr = init_nr;
-	ctx.contexts       = 0;
-	ctx.contexts_maxed = false;
+	ctx.entry		= entry;
+	ctx.max_stack		= max_stack;
+	ctx.nr			= entry->nr = 0;
+	ctx.contexts		= 0;
+	ctx.contexts_maxed	= false;
 
 	if (kernel && !user_mode(regs)) {
 		if (add_mark)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 065f9188b44a..ebe457bacf96 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7801,7 +7801,7 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
 	if (!kernel && !user)
 		return &__empty_callchain;
 
-	callchain = get_perf_callchain(regs, 0, kernel, user,
+	callchain = get_perf_callchain(regs, kernel, user,
 				       max_stack, crosstask, true);
 	return callchain ?: &__empty_callchain;
 }
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v4 32/39] perf: Remove get_perf_callchain() 'crosstask' argument
  2025-01-22  2:30 [PATCH v4 00/39] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (30 preceding siblings ...)
  2025-01-22  2:31 ` [PATCH v4 31/39] perf: Remove get_perf_callchain() 'init_nr' argument Josh Poimboeuf
@ 2025-01-22  2:31 ` Josh Poimboeuf
  2025-01-24 18:13   ` Andrii Nakryiko
  2025-01-22  2:31 ` [PATCH v4 33/39] perf: Simplify get_perf_callchain() user logic Josh Poimboeuf
                   ` (8 subsequent siblings)
  40 siblings, 1 reply; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22  2:31 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

get_perf_callchain() doesn't support cross-task unwinding, so it doesn't
make much sense to have 'crosstask' as an argument.

Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 include/linux/perf_event.h |  2 +-
 kernel/bpf/stackmap.c      | 12 ++++--------
 kernel/events/callchain.c  |  6 +-----
 kernel/events/core.c       |  9 +++++----
 4 files changed, 11 insertions(+), 18 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 4c8ff7258c6a..1563dc2cd979 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1590,7 +1590,7 @@ extern void perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct p
 extern void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
 extern struct perf_callchain_entry *
 get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
-		   u32 max_stack, bool crosstask, bool add_mark);
+		   u32 max_stack, bool add_mark);
 extern int get_callchain_buffers(int max_stack);
 extern void put_callchain_buffers(void);
 extern struct perf_callchain_entry *get_callchain_entry(int *rctx);
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index ec3a57a5fba1..ee9701337912 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -314,8 +314,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
 	if (max_depth > sysctl_perf_event_max_stack)
 		max_depth = sysctl_perf_event_max_stack;
 
-	trace = get_perf_callchain(regs, kernel, user, max_depth,
-				   false, false);
+	trace = get_perf_callchain(regs, kernel, user, max_depth, false);
 
 	if (unlikely(!trace))
 		/* couldn't fetch the stack trace */
@@ -430,10 +429,8 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
 	if (task && user && !user_mode(regs))
 		goto err_fault;
 
-	/* get_perf_callchain does not support crosstask user stack walking
-	 * but returns an empty stack instead of NULL.
-	 */
-	if (crosstask && user) {
+	/* get_perf_callchain() does not support crosstask stack walking */
+	if (crosstask) {
 		err = -EOPNOTSUPP;
 		goto clear;
 	}
@@ -451,8 +448,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
 	else if (kernel && task)
 		trace = get_callchain_entry_for_task(task, max_depth);
 	else
-		trace = get_perf_callchain(regs, kernel, user, max_depth,
-					   crosstask, false);
+		trace = get_perf_callchain(regs, kernel, user, max_depth,false);
 
 	if (unlikely(!trace) || trace->nr < skip) {
 		if (may_fault)
diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 83834203e144..655fb25a725b 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -217,7 +217,7 @@ static void fixup_uretprobe_trampoline_entries(struct perf_callchain_entry *entr
 
 struct perf_callchain_entry *
 get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
-		   u32 max_stack, bool crosstask, bool add_mark)
+		   u32 max_stack, bool add_mark)
 {
 	struct perf_callchain_entry *entry;
 	struct perf_callchain_entry_ctx ctx;
@@ -248,9 +248,6 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 		}
 
 		if (regs) {
-			if (crosstask)
-				goto exit_put;
-
 			if (add_mark)
 				perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
 
@@ -260,7 +257,6 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 		}
 	}
 
-exit_put:
 	put_callchain_entry(rctx);
 
 	return entry;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index ebe457bacf96..99f0f28feeb5 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7793,16 +7793,17 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
 {
 	bool kernel = !event->attr.exclude_callchain_kernel;
 	bool user   = !event->attr.exclude_callchain_user;
-	/* Disallow cross-task user callchains. */
-	bool crosstask = event->ctx->task && event->ctx->task != current;
 	const u32 max_stack = event->attr.sample_max_stack;
 	struct perf_callchain_entry *callchain;
 
 	if (!kernel && !user)
 		return &__empty_callchain;
 
-	callchain = get_perf_callchain(regs, kernel, user,
-				       max_stack, crosstask, true);
+	/* Disallow cross-task callchains. */
+	if (event->ctx->task && event->ctx->task != current)
+		return &__empty_callchain;
+
+	callchain = get_perf_callchain(regs, kernel, user, max_stack, true);
 	return callchain ?: &__empty_callchain;
 }
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v4 33/39] perf: Simplify get_perf_callchain() user logic
  2025-01-22  2:30 [PATCH v4 00/39] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (31 preceding siblings ...)
  2025-01-22  2:31 ` [PATCH v4 32/39] perf: Remove get_perf_callchain() 'crosstask' argument Josh Poimboeuf
@ 2025-01-22  2:31 ` Josh Poimboeuf
  2025-01-22  2:31 ` [PATCH v4 34/39] perf: Skip user unwind if !current->mm Josh Poimboeuf
                   ` (7 subsequent siblings)
  40 siblings, 0 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22  2:31 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

Simplify the get_perf_callchain() user logic a bit.  task_pt_regs()
should never be NULL.

Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 kernel/events/callchain.c | 20 +++++++++-----------
 1 file changed, 9 insertions(+), 11 deletions(-)

diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 655fb25a725b..2278402b7ac9 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -241,22 +241,20 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 
 	if (user) {
 		if (!user_mode(regs)) {
-			if  (current->mm)
-				regs = task_pt_regs(current);
-			else
-				regs = NULL;
+			if (!current->mm)
+				goto exit_put;
+			regs = task_pt_regs(current);
 		}
 
-		if (regs) {
-			if (add_mark)
-				perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
+		if (add_mark)
+			perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
 
-			start_entry_idx = entry->nr;
-			perf_callchain_user(&ctx, regs);
-			fixup_uretprobe_trampoline_entries(entry, start_entry_idx);
-		}
+		start_entry_idx = entry->nr;
+		perf_callchain_user(&ctx, regs);
+		fixup_uretprobe_trampoline_entries(entry, start_entry_idx);
 	}
 
+exit_put:
 	put_callchain_entry(rctx);
 
 	return entry;
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v4 34/39] perf: Skip user unwind if !current->mm
  2025-01-22  2:30 [PATCH v4 00/39] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (32 preceding siblings ...)
  2025-01-22  2:31 ` [PATCH v4 33/39] perf: Simplify get_perf_callchain() user logic Josh Poimboeuf
@ 2025-01-22  2:31 ` Josh Poimboeuf
  2025-01-22 14:29   ` Peter Zijlstra
  2025-01-22  2:31 ` [PATCH v4 35/39] perf: Support deferred user callchains Josh Poimboeuf
                   ` (6 subsequent siblings)
  40 siblings, 1 reply; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22  2:31 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

If the task doesn't have any memory, there's no stack to unwind.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 kernel/events/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 99f0f28feeb5..a886bb83f4d0 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7792,7 +7792,7 @@ struct perf_callchain_entry *
 perf_callchain(struct perf_event *event, struct pt_regs *regs)
 {
 	bool kernel = !event->attr.exclude_callchain_kernel;
-	bool user   = !event->attr.exclude_callchain_user;
+	bool user   = !event->attr.exclude_callchain_user && current->mm;
 	const u32 max_stack = event->attr.sample_max_stack;
 	struct perf_callchain_entry *callchain;
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v4 35/39] perf: Support deferred user callchains
  2025-01-22  2:30 [PATCH v4 00/39] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (33 preceding siblings ...)
  2025-01-22  2:31 ` [PATCH v4 34/39] perf: Skip user unwind if !current->mm Josh Poimboeuf
@ 2025-01-22  2:31 ` Josh Poimboeuf
  2025-01-22  2:31 ` [PATCH v4 36/39] perf tools: Minimal CALLCHAIN_DEFERRED support Josh Poimboeuf
                   ` (5 subsequent siblings)
  40 siblings, 0 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22  2:31 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

Use the new unwind_deferred_request() interface (if available) to defer
unwinds to task context.  This allows the use of .sframe (if available)
and also prevents duplicate userspace unwinds.

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/Kconfig                          |   3 +
 include/linux/perf_event.h            |  13 +++-
 include/uapi/linux/perf_event.h       |  19 ++++-
 kernel/bpf/stackmap.c                 |   6 +-
 kernel/events/callchain.c             |  11 ++-
 kernel/events/core.c                  | 103 +++++++++++++++++++++++++-
 tools/include/uapi/linux/perf_event.h |  19 ++++-
 7 files changed, 166 insertions(+), 8 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index b3676605bab6..83ab94af46ca 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -472,6 +472,9 @@ config SFRAME_VALIDATION
 
 	  If unsure, say N.
 
+config HAVE_PERF_CALLCHAIN_DEFERRED
+	bool
+
 config HAVE_PERF_REGS
 	bool
 	help
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 1563dc2cd979..7fd54e4d2084 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -62,6 +62,7 @@ struct perf_guest_info_callbacks {
 #include <linux/security.h>
 #include <linux/static_call.h>
 #include <linux/lockdep.h>
+#include <linux/unwind_deferred.h>
 #include <asm/local.h>
 
 struct perf_callchain_entry {
@@ -833,6 +834,10 @@ struct perf_event {
 	unsigned int			pending_work;
 	struct rcuwait			pending_work_wait;
 
+	struct unwind_work		pending_unwind_work;
+	struct rcuwait			pending_unwind_wait;
+	unsigned int			pending_unwind_callback;
+
 	atomic_t			event_limit;
 
 	/* address range filters */
@@ -1590,12 +1595,18 @@ extern void perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct p
 extern void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
 extern struct perf_callchain_entry *
 get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
-		   u32 max_stack, bool add_mark);
+		   u32 max_stack, bool add_mark, bool defer_user);
 extern int get_callchain_buffers(int max_stack);
 extern void put_callchain_buffers(void);
 extern struct perf_callchain_entry *get_callchain_entry(int *rctx);
 extern void put_callchain_entry(int rctx);
 
+#ifdef CONFIG_HAVE_PERF_CALLCHAIN_DEFERRED
+extern void perf_callchain_user_deferred(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
+#else
+static inline void perf_callchain_user_deferred(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs) {}
+#endif
+
 extern int sysctl_perf_event_max_stack;
 extern int sysctl_perf_event_max_contexts_per_stack;
 
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 0524d541d4e3..16307be57de9 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -460,7 +460,8 @@ struct perf_event_attr {
 				inherit_thread :  1, /* children only inherit if cloned with CLONE_THREAD */
 				remove_on_exec :  1, /* event is removed from task on exec */
 				sigtrap        :  1, /* send synchronous SIGTRAP on event */
-				__reserved_1   : 26;
+				defer_callchain:  1, /* generate PERF_RECORD_CALLCHAIN_DEFERRED records */
+				__reserved_1   : 25;
 
 	union {
 		__u32		wakeup_events;	  /* wakeup every n events */
@@ -1226,6 +1227,21 @@ enum perf_event_type {
 	 */
 	PERF_RECORD_AUX_OUTPUT_HW_ID		= 21,
 
+	/*
+	 * This user callchain capture was deferred until shortly before
+	 * returning to user space.  Previous samples would have kernel
+	 * callchains only and they need to be stitched with this to make full
+	 * callchains.
+	 *
+	 * struct {
+	 *	struct perf_event_header	header;
+	 *	u64				nr;
+	 *	u64				ips[nr];
+	 *	struct sample_id		sample_id;
+	 * };
+	 */
+	PERF_RECORD_CALLCHAIN_DEFERRED		= 22,
+
 	PERF_RECORD_MAX,			/* non-ABI */
 };
 
@@ -1256,6 +1272,7 @@ enum perf_callchain_context {
 	PERF_CONTEXT_HV			= (__u64)-32,
 	PERF_CONTEXT_KERNEL		= (__u64)-128,
 	PERF_CONTEXT_USER		= (__u64)-512,
+	PERF_CONTEXT_USER_DEFERRED	= (__u64)-640,
 
 	PERF_CONTEXT_GUEST		= (__u64)-2048,
 	PERF_CONTEXT_GUEST_KERNEL	= (__u64)-2176,
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index ee9701337912..f073ebaf9c30 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -314,8 +314,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
 	if (max_depth > sysctl_perf_event_max_stack)
 		max_depth = sysctl_perf_event_max_stack;
 
-	trace = get_perf_callchain(regs, kernel, user, max_depth, false);
-
+	trace = get_perf_callchain(regs, kernel, user, max_depth, false, false);
 	if (unlikely(!trace))
 		/* couldn't fetch the stack trace */
 		return -EFAULT;
@@ -448,7 +447,8 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
 	else if (kernel && task)
 		trace = get_callchain_entry_for_task(task, max_depth);
 	else
-		trace = get_perf_callchain(regs, kernel, user, max_depth,false);
+		trace = get_perf_callchain(regs, kernel, user, max_depth,
+					   false, false);
 
 	if (unlikely(!trace) || trace->nr < skip) {
 		if (may_fault)
diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 2278402b7ac9..eeb15ba0137f 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -217,7 +217,7 @@ static void fixup_uretprobe_trampoline_entries(struct perf_callchain_entry *entr
 
 struct perf_callchain_entry *
 get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
-		   u32 max_stack, bool add_mark)
+		   u32 max_stack, bool add_mark, bool defer_user)
 {
 	struct perf_callchain_entry *entry;
 	struct perf_callchain_entry_ctx ctx;
@@ -246,6 +246,15 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 			regs = task_pt_regs(current);
 		}
 
+		if (defer_user) {
+			/*
+			 * Foretell the coming of PERF_RECORD_CALLCHAIN_DEFERRED
+			 * which can be stitched to this one.
+			 */
+			perf_callchain_store_context(&ctx, PERF_CONTEXT_USER_DEFERRED);
+			goto exit_put;
+		}
+
 		if (add_mark)
 			perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index a886bb83f4d0..32603bbd797d 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -55,6 +55,7 @@
 #include <linux/pgtable.h>
 #include <linux/buildid.h>
 #include <linux/task_work.h>
+#include <linux/unwind_deferred.h>
 
 #include "internal.h"
 
@@ -5312,11 +5313,37 @@ static void perf_pending_task_sync(struct perf_event *event)
 	rcuwait_wait_event(&event->pending_work_wait, !event->pending_work, TASK_UNINTERRUPTIBLE);
 }
 
+static void perf_pending_unwind_sync(struct perf_event *event)
+{
+	might_sleep();
+
+	if (!event->pending_unwind_callback)
+		return;
+
+	/*
+	 * If the task is queued to the current task's queue, we
+	 * obviously can't wait for it to complete. Simply cancel it.
+	 */
+	if (unwind_deferred_cancel(current, &event->pending_unwind_work)) {
+		event->pending_unwind_callback = 0;
+		local_dec(&event->ctx->nr_no_switch_fast);
+		return;
+	}
+
+	/*
+	 * All accesses related to the event are within the same RCU section in
+	 * perf_event_callchain_deferred(). The RCU grace period before the
+	 * event is freed will make sure all those accesses are complete by then.
+	 */
+	rcuwait_wait_event(&event->pending_unwind_wait, !event->pending_unwind_callback, TASK_UNINTERRUPTIBLE);
+}
+
 static void _free_event(struct perf_event *event)
 {
 	irq_work_sync(&event->pending_irq);
 	irq_work_sync(&event->pending_disable_irq);
 	perf_pending_task_sync(event);
+	perf_pending_unwind_sync(event);
 
 	unaccount_event(event);
 
@@ -6933,6 +6960,61 @@ static void perf_pending_irq(struct irq_work *entry)
 		perf_swevent_put_recursion_context(rctx);
 }
 
+
+struct perf_callchain_deferred_event {
+	struct perf_event_header	header;
+	u64				nr;
+	u64				ips[];
+};
+
+static void perf_event_callchain_deferred(struct unwind_work *work, struct unwind_stacktrace *trace, u64 cookie)
+{
+	struct perf_event *event = container_of(work, struct perf_event, pending_unwind_work);
+	struct perf_callchain_deferred_event deferred_event;
+	u64 callchain_context = PERF_CONTEXT_USER;
+	struct perf_output_handle handle;
+	struct perf_sample_data data;
+	u64 nr = trace->nr +1 ; /* +1 == callchain_context */
+
+	if (WARN_ON_ONCE(!event->pending_unwind_callback))
+		return;
+
+	/*
+	 * All accesses to the event must belong to the same implicit RCU
+	 * read-side critical section as the ->pending_unwind_callback reset.
+	 * See comment in perf_pending_unwind_sync().
+	 */
+	rcu_read_lock();
+
+	if (!current->mm)
+		goto out;
+
+	deferred_event.header.type = PERF_RECORD_CALLCHAIN_DEFERRED;
+	deferred_event.header.misc = PERF_RECORD_MISC_USER;
+	deferred_event.header.size = sizeof(deferred_event) + (nr * sizeof(u64));
+
+	deferred_event.nr = nr;
+
+	perf_event_header__init_id(&deferred_event.header, &data, event);
+
+	if (perf_output_begin(&handle, &data, event, deferred_event.header.size))
+		goto out;
+
+	perf_output_put(&handle, deferred_event);
+	perf_output_put(&handle, callchain_context);
+	perf_output_copy(&handle, trace->entries, trace->nr * sizeof(u64));
+	perf_event__output_id_sample(event, &handle, &data);
+
+	perf_output_end(&handle);
+
+out:
+	event->pending_unwind_callback = 0;
+	local_dec(&event->ctx->nr_no_switch_fast);
+	rcuwait_wake_up(&event->pending_unwind_wait);
+
+	rcu_read_unlock();
+}
+
 static void perf_pending_task(struct callback_head *head)
 {
 	struct perf_event *event = container_of(head, struct perf_event, pending_task);
@@ -7795,6 +7877,8 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
 	bool user   = !event->attr.exclude_callchain_user && current->mm;
 	const u32 max_stack = event->attr.sample_max_stack;
 	struct perf_callchain_entry *callchain;
+	bool defer_user = IS_ENABLED(CONFIG_UNWIND_USER) && user &&
+			  event->attr.defer_callchain;
 
 	if (!kernel && !user)
 		return &__empty_callchain;
@@ -7803,7 +7887,21 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
 	if (event->ctx->task && event->ctx->task != current)
 		return &__empty_callchain;
 
-	callchain = get_perf_callchain(regs, kernel, user, max_stack, true);
+	if (defer_user && !event->pending_unwind_callback) {
+		u64 cookie;
+
+		if (!unwind_deferred_request(&event->pending_unwind_work, &cookie)) {
+			event->pending_unwind_callback = 1;
+			local_inc(&event->ctx->nr_no_switch_fast);
+		}
+
+		if (!cookie)
+			defer_user = false;
+	}
+
+	callchain = get_perf_callchain(regs, kernel, user, max_stack, true,
+				       defer_user);
+
 	return callchain ?: &__empty_callchain;
 }
 
@@ -12225,6 +12323,9 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	init_task_work(&event->pending_task, perf_pending_task);
 	rcuwait_init(&event->pending_work_wait);
 
+	unwind_deferred_init(&event->pending_unwind_work, perf_event_callchain_deferred);
+	rcuwait_init(&event->pending_unwind_wait);
+
 	mutex_init(&event->mmap_mutex);
 	raw_spin_lock_init(&event->addr_filters.lock);
 
diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
index 0524d541d4e3..16307be57de9 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -460,7 +460,8 @@ struct perf_event_attr {
 				inherit_thread :  1, /* children only inherit if cloned with CLONE_THREAD */
 				remove_on_exec :  1, /* event is removed from task on exec */
 				sigtrap        :  1, /* send synchronous SIGTRAP on event */
-				__reserved_1   : 26;
+				defer_callchain:  1, /* generate PERF_RECORD_CALLCHAIN_DEFERRED records */
+				__reserved_1   : 25;
 
 	union {
 		__u32		wakeup_events;	  /* wakeup every n events */
@@ -1226,6 +1227,21 @@ enum perf_event_type {
 	 */
 	PERF_RECORD_AUX_OUTPUT_HW_ID		= 21,
 
+	/*
+	 * This user callchain capture was deferred until shortly before
+	 * returning to user space.  Previous samples would have kernel
+	 * callchains only and they need to be stitched with this to make full
+	 * callchains.
+	 *
+	 * struct {
+	 *	struct perf_event_header	header;
+	 *	u64				nr;
+	 *	u64				ips[nr];
+	 *	struct sample_id		sample_id;
+	 * };
+	 */
+	PERF_RECORD_CALLCHAIN_DEFERRED		= 22,
+
 	PERF_RECORD_MAX,			/* non-ABI */
 };
 
@@ -1256,6 +1272,7 @@ enum perf_callchain_context {
 	PERF_CONTEXT_HV			= (__u64)-32,
 	PERF_CONTEXT_KERNEL		= (__u64)-128,
 	PERF_CONTEXT_USER		= (__u64)-512,
+	PERF_CONTEXT_USER_DEFERRED	= (__u64)-640,
 
 	PERF_CONTEXT_GUEST		= (__u64)-2048,
 	PERF_CONTEXT_GUEST_KERNEL	= (__u64)-2176,
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v4 36/39] perf tools: Minimal CALLCHAIN_DEFERRED support
  2025-01-22  2:30 [PATCH v4 00/39] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (34 preceding siblings ...)
  2025-01-22  2:31 ` [PATCH v4 35/39] perf: Support deferred user callchains Josh Poimboeuf
@ 2025-01-22  2:31 ` Josh Poimboeuf
  2025-01-22  2:31 ` [PATCH v4 37/39] perf record: Enable defer_callchain for user callchains Josh Poimboeuf
                   ` (4 subsequent siblings)
  40 siblings, 0 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22  2:31 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

From: Namhyung Kim <namhyung@kernel.org>

Add a new event type for deferred callchains and a new callback for the
struct perf_tool.  For now it doesn't actually handle the deferred
callchains but it just marks the sample if it has the PERF_CONTEXT_
USER_DEFFERED in the callchain array.

At least, perf report can dump the raw data with this change.  Actually
this requires the next commit to enable attr.defer_callchain, but if you
already have a data file, it'll show the following result.

  $ perf report -D
  ...
  0x5fe0@perf.data [0x40]: event: 22
  .
  . ... raw event: size 64 bytes
  .  0000:  16 00 00 00 02 00 40 00 02 00 00 00 00 00 00 00  ......@.........
  .  0010:  00 fe ff ff ff ff ff ff 4b d3 3f 25 45 7f 00 00  ........K.?%E...
  .  0020:  21 03 00 00 21 03 00 00 43 02 12 ab 05 00 00 00  !...!...C.......
  .  0030:  00 00 00 00 00 00 00 00 09 00 00 00 00 00 00 00  ................

  0 24344920643 0x5fe0 [0x40]: PERF_RECORD_CALLCHAIN_DEFERRED(IP, 0x2): 801/801: 0
  ... FP chain: nr:2
  .....  0: fffffffffffffe00
  .....  1: 00007f45253fd34b
  : unhandled!

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 tools/lib/perf/include/perf/event.h       |  7 +++++++
 tools/perf/util/event.c                   |  1 +
 tools/perf/util/evsel.c                   | 15 +++++++++++++++
 tools/perf/util/machine.c                 |  1 +
 tools/perf/util/perf_event_attr_fprintf.c |  1 +
 tools/perf/util/sample.h                  |  3 ++-
 tools/perf/util/session.c                 | 17 +++++++++++++++++
 tools/perf/util/tool.c                    |  1 +
 tools/perf/util/tool.h                    |  3 ++-
 9 files changed, 47 insertions(+), 2 deletions(-)

diff --git a/tools/lib/perf/include/perf/event.h b/tools/lib/perf/include/perf/event.h
index 37bb7771d914..f643a6a2b9fc 100644
--- a/tools/lib/perf/include/perf/event.h
+++ b/tools/lib/perf/include/perf/event.h
@@ -151,6 +151,12 @@ struct perf_record_switch {
 	__u32			 next_prev_tid;
 };
 
+struct perf_record_callchain_deferred {
+	struct perf_event_header header;
+	__u64			 nr;
+	__u64			 ips[];
+};
+
 struct perf_record_header_attr {
 	struct perf_event_header header;
 	struct perf_event_attr	 attr;
@@ -494,6 +500,7 @@ union perf_event {
 	struct perf_record_read			read;
 	struct perf_record_throttle		throttle;
 	struct perf_record_sample		sample;
+	struct perf_record_callchain_deferred	callchain_deferred;
 	struct perf_record_bpf_event		bpf;
 	struct perf_record_ksymbol		ksymbol;
 	struct perf_record_text_poke_event	text_poke;
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index aac96d5d1917..8cdec373db44 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -58,6 +58,7 @@ static const char *perf_event__names[] = {
 	[PERF_RECORD_CGROUP]			= "CGROUP",
 	[PERF_RECORD_TEXT_POKE]			= "TEXT_POKE",
 	[PERF_RECORD_AUX_OUTPUT_HW_ID]		= "AUX_OUTPUT_HW_ID",
+	[PERF_RECORD_CALLCHAIN_DEFERRED]	= "CALLCHAIN_DEFERRED",
 	[PERF_RECORD_HEADER_ATTR]		= "ATTR",
 	[PERF_RECORD_HEADER_EVENT_TYPE]		= "EVENT_TYPE",
 	[PERF_RECORD_HEADER_TRACING_DATA]	= "TRACING_DATA",
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index d22c5df1701e..09b9735f2fb1 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -2817,6 +2817,18 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
 	data->data_src = PERF_MEM_DATA_SRC_NONE;
 	data->vcpu = -1;
 
+	if (event->header.type == PERF_RECORD_CALLCHAIN_DEFERRED) {
+		const u64 max_callchain_nr = UINT64_MAX / sizeof(u64);
+
+		data->callchain = (struct ip_callchain *)&event->callchain_deferred.nr;
+		if (data->callchain->nr > max_callchain_nr)
+			return -EFAULT;
+
+		if (evsel->core.attr.sample_id_all)
+			perf_evsel__parse_id_sample(evsel, event, data);
+		return 0;
+	}
+
 	if (event->header.type != PERF_RECORD_SAMPLE) {
 		if (!evsel->core.attr.sample_id_all)
 			return 0;
@@ -2947,6 +2959,9 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
 		if (data->callchain->nr > max_callchain_nr)
 			return -EFAULT;
 		sz = data->callchain->nr * sizeof(u64);
+		if (evsel->core.attr.defer_callchain && data->callchain->nr >= 1 &&
+		    data->callchain->ips[data->callchain->nr - 1] == PERF_CONTEXT_USER_DEFERRED)
+			data->deferred_callchain = true;
 		OVERFLOW_CHECK(array, sz, max_size);
 		array = (void *)array + sz;
 	}
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 27d5345d2b30..9da467886bc6 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -2087,6 +2087,7 @@ static int add_callchain_ip(struct thread *thread,
 				*cpumode = PERF_RECORD_MISC_KERNEL;
 				break;
 			case PERF_CONTEXT_USER:
+			case PERF_CONTEXT_USER_DEFERRED:
 				*cpumode = PERF_RECORD_MISC_USER;
 				break;
 			default:
diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
index 59fbbba79697..113845b35110 100644
--- a/tools/perf/util/perf_event_attr_fprintf.c
+++ b/tools/perf/util/perf_event_attr_fprintf.c
@@ -321,6 +321,7 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
 	PRINT_ATTRf(inherit_thread, p_unsigned);
 	PRINT_ATTRf(remove_on_exec, p_unsigned);
 	PRINT_ATTRf(sigtrap, p_unsigned);
+	PRINT_ATTRf(defer_callchain, p_unsigned);
 
 	PRINT_ATTRn("{ wakeup_events, wakeup_watermark }", wakeup_events, p_unsigned, false);
 	PRINT_ATTRf(bp_type, p_unsigned);
diff --git a/tools/perf/util/sample.h b/tools/perf/util/sample.h
index 70b2c3135555..010659dc80f8 100644
--- a/tools/perf/util/sample.h
+++ b/tools/perf/util/sample.h
@@ -108,7 +108,8 @@ struct perf_sample {
 		u16 p_stage_cyc;
 		u16 retire_lat;
 	};
-	bool no_hw_idx;		/* No hw_idx collected in branch_stack */
+	bool no_hw_idx;			/* No hw_idx collected in branch_stack */
+	bool deferred_callchain;	/* Has deferred user callchains */
 	char insn[MAX_INSN];
 	void *raw_data;
 	struct ip_callchain *callchain;
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 507e6cba9545..493070180279 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -714,6 +714,7 @@ static perf_event__swap_op perf_event__swap_ops[] = {
 	[PERF_RECORD_CGROUP]		  = perf_event__cgroup_swap,
 	[PERF_RECORD_TEXT_POKE]		  = perf_event__text_poke_swap,
 	[PERF_RECORD_AUX_OUTPUT_HW_ID]	  = perf_event__all64_swap,
+	[PERF_RECORD_CALLCHAIN_DEFERRED]  = perf_event__all64_swap,
 	[PERF_RECORD_HEADER_ATTR]	  = perf_event__hdr_attr_swap,
 	[PERF_RECORD_HEADER_EVENT_TYPE]	  = perf_event__event_type_swap,
 	[PERF_RECORD_HEADER_TRACING_DATA] = perf_event__tracing_data_swap,
@@ -1107,6 +1108,19 @@ static void dump_sample(struct evsel *evsel, union perf_event *event,
 		sample_read__printf(sample, evsel->core.attr.read_format);
 }
 
+static void dump_deferred_callchain(struct evsel *evsel, union perf_event *event,
+				    struct perf_sample *sample)
+{
+	if (!dump_trace)
+		return;
+
+	printf("(IP, 0x%x): %d/%d: %#" PRIx64 "\n",
+	       event->header.misc, sample->pid, sample->tid, sample->ip);
+
+	if (evsel__has_callchain(evsel))
+		callchain__printf(evsel, sample);
+}
+
 static void dump_read(struct evsel *evsel, union perf_event *event)
 {
 	struct perf_record_read *read_event = &event->read;
@@ -1337,6 +1351,9 @@ static int machines__deliver_event(struct machines *machines,
 		return tool->text_poke(tool, event, sample, machine);
 	case PERF_RECORD_AUX_OUTPUT_HW_ID:
 		return tool->aux_output_hw_id(tool, event, sample, machine);
+	case PERF_RECORD_CALLCHAIN_DEFERRED:
+		dump_deferred_callchain(evsel, event, sample);
+		return tool->callchain_deferred(tool, event, sample, evsel, machine);
 	default:
 		++evlist->stats.nr_unknown_events;
 		return -1;
diff --git a/tools/perf/util/tool.c b/tools/perf/util/tool.c
index 3b7f390f26eb..e78f16de912e 100644
--- a/tools/perf/util/tool.c
+++ b/tools/perf/util/tool.c
@@ -259,6 +259,7 @@ void perf_tool__init(struct perf_tool *tool, bool ordered_events)
 	tool->read = process_event_sample_stub;
 	tool->throttle = process_event_stub;
 	tool->unthrottle = process_event_stub;
+	tool->callchain_deferred = process_event_sample_stub;
 	tool->attr = process_event_synth_attr_stub;
 	tool->event_update = process_event_synth_event_update_stub;
 	tool->tracing_data = process_event_synth_tracing_data_stub;
diff --git a/tools/perf/util/tool.h b/tools/perf/util/tool.h
index db1c7642b0d1..9987bbde6d5e 100644
--- a/tools/perf/util/tool.h
+++ b/tools/perf/util/tool.h
@@ -42,7 +42,8 @@ enum show_feature_header {
 
 struct perf_tool {
 	event_sample	sample,
-			read;
+			read,
+			callchain_deferred;
 	event_op	mmap,
 			mmap2,
 			comm,
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v4 37/39] perf record: Enable defer_callchain for user callchains
  2025-01-22  2:30 [PATCH v4 00/39] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (35 preceding siblings ...)
  2025-01-22  2:31 ` [PATCH v4 36/39] perf tools: Minimal CALLCHAIN_DEFERRED support Josh Poimboeuf
@ 2025-01-22  2:31 ` Josh Poimboeuf
  2025-01-22  2:31 ` [PATCH v4 38/39] perf script: Display PERF_RECORD_CALLCHAIN_DEFERRED Josh Poimboeuf
                   ` (3 subsequent siblings)
  40 siblings, 0 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22  2:31 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

From: Namhyung Kim <namhyung@kernel.org>

And add the missing feature detection logic to clear the flag on old
kernels.

  $ perf record -g -vv true
  ...
  ------------------------------------------------------------
  perf_event_attr:
    type                             0 (PERF_TYPE_HARDWARE)
    size                             136
    config                           0 (PERF_COUNT_HW_CPU_CYCLES)
    { sample_period, sample_freq }   4000
    sample_type                      IP|TID|TIME|CALLCHAIN|PERIOD
    read_format                      ID|LOST
    disabled                         1
    inherit                          1
    mmap                             1
    comm                             1
    freq                             1
    enable_on_exec                   1
    task                             1
    sample_id_all                    1
    mmap2                            1
    comm_exec                        1
    ksymbol                          1
    bpf_event                        1
    defer_callchain                  1
  ------------------------------------------------------------
  sys_perf_event_open: pid 162755  cpu 0  group_fd -1  flags 0x8
  sys_perf_event_open failed, error -22
  switching off deferred callchain support

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 tools/perf/util/evsel.c | 24 ++++++++++++++++++++++++
 tools/perf/util/evsel.h |  1 +
 2 files changed, 25 insertions(+)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 09b9735f2fb1..e3be3cc7632d 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -998,6 +998,14 @@ static void __evsel__config_callchain(struct evsel *evsel, struct record_opts *o
 		}
 	}
 
+	if (param->record_mode == CALLCHAIN_FP && !attr->exclude_callchain_user) {
+		/*
+		 * Enable deferred callchains optimistically.  It'll be switched
+		 * off later if the kernel doesn't support it.
+		 */
+		attr->defer_callchain = 1;
+	}
+
 	if (function) {
 		pr_info("Disabling user space callchains for function trace event.\n");
 		attr->exclude_callchain_user = 1;
@@ -2038,6 +2046,8 @@ static int __evsel__prepare_open(struct evsel *evsel, struct perf_cpu_map *cpus,
 
 static void evsel__disable_missing_features(struct evsel *evsel)
 {
+	if (perf_missing_features.defer_callchain)
+		evsel->core.attr.defer_callchain = 0;
 	if (perf_missing_features.inherit_sample_read && evsel->core.attr.inherit &&
 	    (evsel->core.attr.sample_type & PERF_SAMPLE_READ))
 		evsel->core.attr.inherit = 0;
@@ -2244,6 +2254,15 @@ static bool evsel__detect_missing_features(struct evsel *evsel)
 
 	/* Please add new feature detection here. */
 
+	attr.defer_callchain = true;
+	attr.sample_type = PERF_SAMPLE_CALLCHAIN;
+	if (has_attr_feature(&attr, /*flags=*/0))
+		goto found;
+	perf_missing_features.defer_callchain = true;
+	pr_debug2("switching off deferred callchain support\n");
+	attr.defer_callchain = false;
+	attr.sample_type = 0;
+
 	attr.inherit = true;
 	attr.sample_type = PERF_SAMPLE_READ;
 	if (has_attr_feature(&attr, /*flags=*/0))
@@ -2355,6 +2374,11 @@ static bool evsel__detect_missing_features(struct evsel *evsel)
 	errno = old_errno;
 
 check:
+	if (evsel->core.attr.defer_callchain &&
+	    evsel->core.attr.sample_type & PERF_SAMPLE_CALLCHAIN &&
+	    perf_missing_features.defer_callchain)
+		return true;
+
 	if (evsel->core.attr.inherit &&
 	    (evsel->core.attr.sample_type & PERF_SAMPLE_READ) &&
 	    perf_missing_features.inherit_sample_read)
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index 04934a7af174..b90a970f9a30 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -206,6 +206,7 @@ struct perf_missing_features {
 	bool read_lost;
 	bool branch_counters;
 	bool inherit_sample_read;
+	bool defer_callchain;
 };
 
 extern struct perf_missing_features perf_missing_features;
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v4 38/39] perf script: Display PERF_RECORD_CALLCHAIN_DEFERRED
  2025-01-22  2:30 [PATCH v4 00/39] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (36 preceding siblings ...)
  2025-01-22  2:31 ` [PATCH v4 37/39] perf record: Enable defer_callchain for user callchains Josh Poimboeuf
@ 2025-01-22  2:31 ` Josh Poimboeuf
  2025-01-22  2:31 ` [PATCH v4 39/39] perf tools: Merge deferred user callchains Josh Poimboeuf
                   ` (2 subsequent siblings)
  40 siblings, 0 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22  2:31 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

From: Namhyung Kim <namhyung@kernel.org>

Handle the deferred callchains in the script output.

  $ perf script
  perf     801 [000]    18.031793:          1 cycles:P:
          ffffffff91a14c36 __intel_pmu_enable_all.isra.0+0x56 ([kernel.kallsyms])
          ffffffff91d373e9 perf_ctx_enable+0x39 ([kernel.kallsyms])
          ffffffff91d36af7 event_function+0xd7 ([kernel.kallsyms])
          ffffffff91d34222 remote_function+0x42 ([kernel.kallsyms])
          ffffffff91c1ebe1 generic_exec_single+0x61 ([kernel.kallsyms])
          ffffffff91c1edac smp_call_function_single+0xec ([kernel.kallsyms])
          ffffffff91d37a9d event_function_call+0x10d ([kernel.kallsyms])
          ffffffff91d33557 perf_event_for_each_child+0x37 ([kernel.kallsyms])
          ffffffff91d47324 _perf_ioctl+0x204 ([kernel.kallsyms])
          ffffffff91d47c43 perf_ioctl+0x33 ([kernel.kallsyms])
          ffffffff91e2f216 __x64_sys_ioctl+0x96 ([kernel.kallsyms])
          ffffffff9265f1ae do_syscall_64+0x9e ([kernel.kallsyms])
          ffffffff92800130 entry_SYSCALL_64+0xb0 ([kernel.kallsyms])

  perf     801 [000]    18.031814: DEFERRED CALLCHAIN
              7fb5fc22034b __GI___ioctl+0x3b (/usr/lib/x86_64-linux-gnu/libc.so.6)

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 tools/perf/builtin-script.c | 89 +++++++++++++++++++++++++++++++++++++
 1 file changed, 89 insertions(+)

diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index 9e47905f75a6..2b9085fa18bd 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -2541,6 +2541,93 @@ static int process_sample_event(const struct perf_tool *tool,
 	return ret;
 }
 
+static int process_deferred_sample_event(const struct perf_tool *tool,
+					 union perf_event *event,
+					 struct perf_sample *sample,
+					 struct evsel *evsel,
+					 struct machine *machine)
+{
+	struct perf_script *scr = container_of(tool, struct perf_script, tool);
+	struct perf_event_attr *attr = &evsel->core.attr;
+	struct evsel_script *es = evsel->priv;
+	unsigned int type = output_type(attr->type);
+	struct addr_location al;
+	FILE *fp = es->fp;
+	int ret = 0;
+
+	if (output[type].fields == 0)
+		return 0;
+
+	/* Set thread to NULL to indicate addr_al and al are not initialized */
+	addr_location__init(&al);
+
+	if (perf_time__ranges_skip_sample(scr->ptime_range, scr->range_num,
+					  sample->time)) {
+		goto out_put;
+	}
+
+	if (debug_mode) {
+		if (sample->time < last_timestamp) {
+			pr_err("Samples misordered, previous: %" PRIu64
+				" this: %" PRIu64 "\n", last_timestamp,
+				sample->time);
+			nr_unordered++;
+		}
+		last_timestamp = sample->time;
+		goto out_put;
+	}
+
+	if (filter_cpu(sample))
+		goto out_put;
+
+	if (machine__resolve(machine, &al, sample) < 0) {
+		pr_err("problem processing %d event, skipping it.\n",
+		       event->header.type);
+		ret = -1;
+		goto out_put;
+	}
+
+	if (al.filtered)
+		goto out_put;
+
+	if (!show_event(sample, evsel, al.thread, &al, NULL))
+		goto out_put;
+
+	if (evswitch__discard(&scr->evswitch, evsel))
+		goto out_put;
+
+	perf_sample__fprintf_start(scr, sample, al.thread, evsel,
+				   PERF_RECORD_CALLCHAIN_DEFERRED, fp);
+	fprintf(fp, "DEFERRED CALLCHAIN");
+
+	if (PRINT_FIELD(IP)) {
+		struct callchain_cursor *cursor = NULL;
+
+		if (symbol_conf.use_callchain && sample->callchain) {
+			cursor = get_tls_callchain_cursor();
+			if (thread__resolve_callchain(al.thread, cursor, evsel,
+						      sample, NULL, NULL,
+						      scripting_max_stack)) {
+				pr_info("cannot resolve deferred callchains\n");
+				cursor = NULL;
+			}
+		}
+
+		fputc(cursor ? '\n' : ' ', fp);
+		sample__fprintf_sym(sample, &al, 0, output[type].print_ip_opts,
+				    cursor, symbol_conf.bt_stop_list, fp);
+	}
+
+	fprintf(fp, "\n");
+
+	if (verbose > 0)
+		fflush(fp);
+
+out_put:
+	addr_location__exit(&al);
+	return ret;
+}
+
 // Used when scr->per_event_dump is not set
 static struct evsel_script es_stdout;
 
@@ -4326,6 +4413,7 @@ int cmd_script(int argc, const char **argv)
 
 	perf_tool__init(&script.tool, !unsorted_dump);
 	script.tool.sample		 = process_sample_event;
+	script.tool.callchain_deferred	 = process_deferred_sample_event;
 	script.tool.mmap		 = perf_event__process_mmap;
 	script.tool.mmap2		 = perf_event__process_mmap2;
 	script.tool.comm		 = perf_event__process_comm;
@@ -4352,6 +4440,7 @@ int cmd_script(int argc, const char **argv)
 	script.tool.throttle		 = process_throttle_event;
 	script.tool.unthrottle		 = process_throttle_event;
 	script.tool.ordering_requires_timestamps = true;
+	script.tool.merge_deferred_callchains = false;
 	session = perf_session__new(&data, &script.tool);
 	if (IS_ERR(session))
 		return PTR_ERR(session);
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v4 39/39] perf tools: Merge deferred user callchains
  2025-01-22  2:30 [PATCH v4 00/39] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (37 preceding siblings ...)
  2025-01-22  2:31 ` [PATCH v4 38/39] perf script: Display PERF_RECORD_CALLCHAIN_DEFERRED Josh Poimboeuf
@ 2025-01-22  2:31 ` Josh Poimboeuf
  2025-01-22  2:35 ` [PATCH v4 00/39] unwind, perf: sframe user space unwinding Josh Poimboeuf
  2025-01-22 16:13 ` Steven Rostedt
  40 siblings, 0 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22  2:31 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

From: Namhyung Kim <namhyung@kernel.org>

Save samples with deferred callchains in a separate list and deliver
them after merging the user callchains.  If users don't want to merge
they can set tool->merge_deferred_callchains to false to prevent the
behavior.

With previous result, now perf script will show the merged callchains.

  $ perf script
  perf     801 [000]    18.031793:          1 cycles:P:
          ffffffff91a14c36 __intel_pmu_enable_all.isra.0+0x56 ([kernel.kallsyms])
          ffffffff91d373e9 perf_ctx_enable+0x39 ([kernel.kallsyms])
          ffffffff91d36af7 event_function+0xd7 ([kernel.kallsyms])
          ffffffff91d34222 remote_function+0x42 ([kernel.kallsyms])
          ffffffff91c1ebe1 generic_exec_single+0x61 ([kernel.kallsyms])
          ffffffff91c1edac smp_call_function_single+0xec ([kernel.kallsyms])
          ffffffff91d37a9d event_function_call+0x10d ([kernel.kallsyms])
          ffffffff91d33557 perf_event_for_each_child+0x37 ([kernel.kallsyms])
          ffffffff91d47324 _perf_ioctl+0x204 ([kernel.kallsyms])
          ffffffff91d47c43 perf_ioctl+0x33 ([kernel.kallsyms])
          ffffffff91e2f216 __x64_sys_ioctl+0x96 ([kernel.kallsyms])
          ffffffff9265f1ae do_syscall_64+0x9e ([kernel.kallsyms])
          ffffffff92800130 entry_SYSCALL_64+0xb0 ([kernel.kallsyms])
              7fb5fc22034b __GI___ioctl+0x3b (/usr/lib/x86_64-linux-gnu/libc.so.6)
  ...

The old output can be get using --no-merge-callchain option.
Also perf report can get the user callchain entry at the end.

  $ perf report --no-children --percent-limit=0 --stdio -q -S __intel_pmu_enable_all.isra.0
  # symbol: __intel_pmu_enable_all.isra.0
       0.00%  perf     [kernel.kallsyms]
              |
              ---__intel_pmu_enable_all.isra.0
                 perf_ctx_enable
                 event_function
                 remote_function
                 generic_exec_single
                 smp_call_function_single
                 event_function_call
                 perf_event_for_each_child
                 _perf_ioctl
                 perf_ioctl
                 __x64_sys_ioctl
                 do_syscall_64
                 entry_SYSCALL_64
                 __GI___ioctl

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 tools/perf/Documentation/perf-script.txt |  5 ++
 tools/perf/builtin-script.c              |  5 +-
 tools/perf/util/callchain.c              | 24 +++++++++
 tools/perf/util/callchain.h              |  3 ++
 tools/perf/util/evlist.c                 |  1 +
 tools/perf/util/evlist.h                 |  1 +
 tools/perf/util/session.c                | 63 +++++++++++++++++++++++-
 tools/perf/util/tool.c                   |  1 +
 tools/perf/util/tool.h                   |  1 +
 9 files changed, 102 insertions(+), 2 deletions(-)

diff --git a/tools/perf/Documentation/perf-script.txt b/tools/perf/Documentation/perf-script.txt
index b72866ef270b..69f018b3d199 100644
--- a/tools/perf/Documentation/perf-script.txt
+++ b/tools/perf/Documentation/perf-script.txt
@@ -518,6 +518,11 @@ include::itrace.txt[]
 	The known limitations include exception handing such as
 	setjmp/longjmp will have calls/returns not match.
 
+--merge-callchains::
+	Enable merging deferred user callchains if available.  This is the
+	default behavior.  If you want to see separate CALLCHAIN_DEFERRED
+	records for some reason, use --no-merge-callchains explicitly.
+
 :GMEXAMPLECMD: script
 :GMEXAMPLESUBCMD:
 include::guest-files.txt[]
diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index 2b9085fa18bd..d18ada14a83a 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -4032,6 +4032,7 @@ int cmd_script(int argc, const char **argv)
 	bool header_only = false;
 	bool script_started = false;
 	bool unsorted_dump = false;
+	bool merge_deferred_callchains = true;
 	char *rec_script_path = NULL;
 	char *rep_script_path = NULL;
 	struct perf_session *session;
@@ -4185,6 +4186,8 @@ int cmd_script(int argc, const char **argv)
 		    "Guest code can be found in hypervisor process"),
 	OPT_BOOLEAN('\0', "stitch-lbr", &script.stitch_lbr,
 		    "Enable LBR callgraph stitching approach"),
+	OPT_BOOLEAN('\0', "merge-callchains", &merge_deferred_callchains,
+		    "Enable merge deferred user callchains"),
 	OPTS_EVSWITCH(&script.evswitch),
 	OPT_END()
 	};
@@ -4440,7 +4443,7 @@ int cmd_script(int argc, const char **argv)
 	script.tool.throttle		 = process_throttle_event;
 	script.tool.unthrottle		 = process_throttle_event;
 	script.tool.ordering_requires_timestamps = true;
-	script.tool.merge_deferred_callchains = false;
+	script.tool.merge_deferred_callchains = merge_deferred_callchains;
 	session = perf_session__new(&data, &script.tool);
 	if (IS_ERR(session))
 		return PTR_ERR(session);
diff --git a/tools/perf/util/callchain.c b/tools/perf/util/callchain.c
index 0c7564747a14..d1114491c3da 100644
--- a/tools/perf/util/callchain.c
+++ b/tools/perf/util/callchain.c
@@ -1832,3 +1832,27 @@ int sample__for_each_callchain_node(struct thread *thread, struct evsel *evsel,
 	}
 	return 0;
 }
+
+int sample__merge_deferred_callchain(struct perf_sample *sample_orig,
+				     struct perf_sample *sample_callchain)
+{
+	u64 nr_orig = sample_orig->callchain->nr - 1;
+	u64 nr_deferred = sample_callchain->callchain->nr;
+	struct ip_callchain *callchain;
+
+	callchain = calloc(1 + nr_orig + nr_deferred, sizeof(u64));
+	if (callchain == NULL) {
+		sample_orig->deferred_callchain = false;
+		return -ENOMEM;
+	}
+
+	callchain->nr = nr_orig + nr_deferred;
+	/* copy except for the last PERF_CONTEXT_USER_DEFERRED */
+	memcpy(callchain->ips, sample_orig->callchain->ips, nr_orig * sizeof(u64));
+	/* copy deferred use callchains */
+	memcpy(&callchain->ips[nr_orig], sample_callchain->callchain->ips,
+	       nr_deferred * sizeof(u64));
+
+	sample_orig->callchain = callchain;
+	return 0;
+}
diff --git a/tools/perf/util/callchain.h b/tools/perf/util/callchain.h
index 86ed9e4d04f9..89785125ed25 100644
--- a/tools/perf/util/callchain.h
+++ b/tools/perf/util/callchain.h
@@ -317,4 +317,7 @@ int sample__for_each_callchain_node(struct thread *thread, struct evsel *evsel,
 				    struct perf_sample *sample, int max_stack,
 				    bool symbols, callchain_iter_fn cb, void *data);
 
+int sample__merge_deferred_callchain(struct perf_sample *sample_orig,
+				     struct perf_sample *sample_callchain);
+
 #endif	/* __PERF_CALLCHAIN_H */
diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index f0dd174e2deb..39a43980f6aa 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -82,6 +82,7 @@ void evlist__init(struct evlist *evlist, struct perf_cpu_map *cpus,
 	evlist->ctl_fd.ack = -1;
 	evlist->ctl_fd.pos = -1;
 	evlist->nr_br_cntr = -1;
+	INIT_LIST_HEAD(&evlist->deferred_samples);
 }
 
 struct evlist *evlist__new(void)
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index adddb1db1ad2..f78275af1553 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -84,6 +84,7 @@ struct evlist {
 		int	pos;	/* index at evlist core object to check signals */
 	} ctl_fd;
 	struct event_enable_timer *eet;
+	struct list_head deferred_samples;
 };
 
 struct evsel_str_handler {
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 493070180279..e02e69ce2f77 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -1266,6 +1266,56 @@ static int evlist__deliver_sample(struct evlist *evlist, const struct perf_tool
 					    per_thread);
 }
 
+struct deferred_event {
+	struct list_head list;
+	union perf_event *event;
+};
+
+static int evlist__deliver_deferred_samples(struct evlist *evlist,
+					    const struct perf_tool *tool,
+					    union  perf_event *event,
+					    struct perf_sample *sample,
+					    struct machine *machine)
+{
+	struct deferred_event *de, *tmp;
+	struct evsel *evsel;
+	int ret = 0;
+
+	if (!tool->merge_deferred_callchains) {
+		evsel = evlist__id2evsel(evlist, sample->id);
+		return tool->callchain_deferred(tool, event, sample,
+						evsel, machine);
+	}
+
+	list_for_each_entry_safe(de, tmp, &evlist->deferred_samples, list) {
+		struct perf_sample orig_sample;
+
+		ret = evlist__parse_sample(evlist, de->event, &orig_sample);
+		if (ret < 0) {
+			pr_err("failed to parse original sample\n");
+			break;
+		}
+
+		if (sample->tid != orig_sample.tid)
+			continue;
+
+		evsel = evlist__id2evsel(evlist, orig_sample.id);
+		sample__merge_deferred_callchain(&orig_sample, sample);
+		ret = evlist__deliver_sample(evlist, tool, de->event,
+					     &orig_sample, evsel, machine);
+
+		if (orig_sample.deferred_callchain)
+			free(orig_sample.callchain);
+
+		list_del(&de->list);
+		free(de);
+
+		if (ret)
+			break;
+	}
+	return ret;
+}
+
 static int machines__deliver_event(struct machines *machines,
 				   struct evlist *evlist,
 				   union perf_event *event,
@@ -1294,6 +1344,16 @@ static int machines__deliver_event(struct machines *machines,
 			return 0;
 		}
 		dump_sample(evsel, event, sample, perf_env__arch(machine->env));
+		if (sample->deferred_callchain && tool->merge_deferred_callchains) {
+			struct deferred_event *de = malloc(sizeof(*de));
+
+			if (de == NULL)
+				return -ENOMEM;
+
+			de->event = event;
+			list_add_tail(&de->list, &evlist->deferred_samples);
+			return 0;
+		}
 		return evlist__deliver_sample(evlist, tool, event, sample, evsel, machine);
 	case PERF_RECORD_MMAP:
 		return tool->mmap(tool, event, sample, machine);
@@ -1353,7 +1413,8 @@ static int machines__deliver_event(struct machines *machines,
 		return tool->aux_output_hw_id(tool, event, sample, machine);
 	case PERF_RECORD_CALLCHAIN_DEFERRED:
 		dump_deferred_callchain(evsel, event, sample);
-		return tool->callchain_deferred(tool, event, sample, evsel, machine);
+		return evlist__deliver_deferred_samples(evlist, tool, event,
+							sample, machine);
 	default:
 		++evlist->stats.nr_unknown_events;
 		return -1;
diff --git a/tools/perf/util/tool.c b/tools/perf/util/tool.c
index e78f16de912e..385043e06627 100644
--- a/tools/perf/util/tool.c
+++ b/tools/perf/util/tool.c
@@ -238,6 +238,7 @@ void perf_tool__init(struct perf_tool *tool, bool ordered_events)
 	tool->cgroup_events = false;
 	tool->no_warn = false;
 	tool->show_feat_hdr = SHOW_FEAT_NO_HEADER;
+	tool->merge_deferred_callchains = true;
 
 	tool->sample = process_event_sample_stub;
 	tool->mmap = process_event_stub;
diff --git a/tools/perf/util/tool.h b/tools/perf/util/tool.h
index 9987bbde6d5e..d06580478ab1 100644
--- a/tools/perf/util/tool.h
+++ b/tools/perf/util/tool.h
@@ -87,6 +87,7 @@ struct perf_tool {
 	bool		cgroup_events;
 	bool		no_warn;
 	bool		dont_split_sample_group;
+	bool		merge_deferred_callchains;
 	enum show_feature_header show_feat_hdr;
 };
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 00/39] unwind, perf: sframe user space unwinding
  2025-01-22  2:30 [PATCH v4 00/39] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (38 preceding siblings ...)
  2025-01-22  2:31 ` [PATCH v4 39/39] perf tools: Merge deferred user callchains Josh Poimboeuf
@ 2025-01-22  2:35 ` Josh Poimboeuf
  2025-01-22 16:13 ` Steven Rostedt
  40 siblings, 0 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22  2:35 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Tue, Jan 21, 2025 at 06:30:52PM -0800, Josh Poimboeuf wrote:
> For testing with user space, here are the latest binutils fixes:
> 
>   1785837a2570 ("ld: fix PR/32297")
>   938fb512184d ("ld: fix wrong SFrame info for lazy IBT PLT")
>   47c88752f9ad ("ld: generate SFrame stack trace info for .plt.got")
> 
> An out-of-tree glibc patch is also needed -- will attach in a reply.

Latest out-of-tree glibc patch below:

diff --git a/elf/dl-load.c b/elf/dl-load.c
index e986d7faab..5a593c2126 100644
--- a/elf/dl-load.c
+++ b/elf/dl-load.c
@@ -29,6 +29,7 @@
 #include <bits/wordsize.h>
 #include <sys/mman.h>
 #include <sys/param.h>
+#include <sys/prctl.h>
 #include <sys/stat.h>
 #include <sys/types.h>
 #include <gnu/lib-names.h>
@@ -87,6 +88,9 @@ struct filebuf
 
 #define STRING(x) __STRING (x)
 
+#ifndef PT_GNU_SFRAME
+#define PT_GNU_SFRAME 0x6474e554
+#endif
 
 /* This is the decomposed LD_LIBRARY_PATH search path.  */
 struct r_search_path_struct __rtld_env_path_list attribute_relro;
@@ -1186,6 +1190,11 @@ _dl_map_object_from_fd (const char *name, const char *origname, int fd,
 	  l->l_relro_addr = ph->p_vaddr;
 	  l->l_relro_size = ph->p_memsz;
 	  break;
+
+	case PT_GNU_SFRAME:
+	  l->l_sframe_start = ph->p_vaddr;
+	  l->l_sframe_end   = ph->p_vaddr + ph->p_memsz;
+	  break;
 	}
 
     if (__glibc_unlikely (nloadcmds == 0))
@@ -1236,6 +1245,26 @@ _dl_map_object_from_fd (const char *name, const char *origname, int fd,
 	l->l_map_start = l->l_map_end = 0;
 	goto lose;
       }
+
+#define PR_ADD_SFRAME 77
+    if (l->l_sframe_start != 0)
+    {
+      l->l_sframe_start += l->l_addr;
+      l->l_sframe_end   += l->l_addr;
+
+      for (size_t i = 0; i < nloadcmds; i++)
+      {
+	struct loadcmd *c = &loadcmds[i];
+
+	if (c->prot & PROT_EXEC)
+	{
+	  ElfW(Addr) text_start = l->l_addr + c->mapstart;
+	  ElfW(Addr) text_end   = l->l_addr + c->mapend;
+
+	  __prctl(PR_ADD_SFRAME, l->l_sframe_start, l->l_sframe_end, text_start, text_end);
+	}
+      }
+    }
   }
 
   if (l->l_ld != NULL)
diff --git a/elf/dl-unmap-segments.h b/elf/dl-unmap-segments.h
index f16f4d7ded..dd14162e00 100644
--- a/elf/dl-unmap-segments.h
+++ b/elf/dl-unmap-segments.h
@@ -21,14 +21,20 @@
 
 #include <link.h>
 #include <sys/mman.h>
+#include <sys/prctl.h>
 
 /* _dl_map_segments ensures that any whole pages in gaps between segments
    are filled in with PROT_NONE mappings.  So we can just unmap the whole
    range in one fell swoop.  */
 
+#define PR_REMOVE_SFRAME 78
+
 static __always_inline void
 _dl_unmap_segments (struct link_map *l)
 {
+  if (l->l_sframe_start != 0)
+    __prctl(PR_REMOVE_SFRAME, l->l_sframe_start, NULL, NULL, NULL);
+
   __munmap ((void *) l->l_map_start, l->l_map_end - l->l_map_start);
 }
 
diff --git a/elf/setup-vdso.h b/elf/setup-vdso.h
index 888e1e4897..2a6bb9b944 100644
--- a/elf/setup-vdso.h
+++ b/elf/setup-vdso.h
@@ -16,6 +16,11 @@
    License along with the GNU C Library; if not, see
    <https://www.gnu.org/licenses/>.  */
 
+#include <sys/prctl.h>
+#ifndef PT_GNU_SFRAME
+#define PT_GNU_SFRAME 0x6474e554
+#endif
+
 static inline void __attribute__ ((always_inline))
 setup_vdso (struct link_map *main_map __attribute__ ((unused)),
 	    struct link_map ***first_preload __attribute__ ((unused)))
@@ -52,6 +57,14 @@ setup_vdso (struct link_map *main_map __attribute__ ((unused)),
 	      if (ph->p_vaddr + ph->p_memsz >= l->l_map_end)
 		l->l_map_end = ph->p_vaddr + ph->p_memsz;
 	    }
+	  else if (ph->p_type == PT_GNU_SFRAME)
+	    {
+	      if (! l->l_sframe_start)
+		{
+		  l->l_sframe_start = ph->p_vaddr;
+		  l->l_sframe_end   = ph->p_vaddr + ph->p_memsz;
+		}
+	    }
 	  else
 	    /* There must be no TLS segment.  */
 	    assert (ph->p_type != PT_TLS);
@@ -74,6 +87,15 @@ setup_vdso (struct link_map *main_map __attribute__ ((unused)),
       l->l_local_scope[0]->r_nlist = 1;
       l->l_local_scope[0]->r_list = &l->l_real;
 
+#define PR_ADD_SFRAME 77
+      if (l->l_sframe_start != 0)
+	{
+	  l->l_sframe_start += l->l_addr;
+	  l->l_sframe_end   += l->l_addr;
+
+	  __prctl(PR_ADD_SFRAME, l->l_sframe_start, l->l_sframe_end, l->l_addr, l->l_map_end);
+	}
+
       /* Now that we have the info handy, use the DSO image's soname
 	 so this object can be looked up by name.  */
       if (l->l_info[DT_SONAME] != NULL)
diff --git a/include/link.h b/include/link.h
index 5ed445d5a6..e94390b29e 100644
--- a/include/link.h
+++ b/include/link.h
@@ -345,6 +345,9 @@ struct link_map
     ElfW(Addr) l_relro_addr;
     size_t l_relro_size;
 
+    ElfW(Addr) l_sframe_start;
+    ElfW(Addr) l_sframe_end;
+
     unsigned long long int l_serial;
   };
 

^ permalink raw reply related	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 02/39] task_work: Fix TWA_NMI_CURRENT race with __schedule()
  2025-01-22  2:30 ` [PATCH v4 02/39] task_work: Fix TWA_NMI_CURRENT race with __schedule() Josh Poimboeuf
@ 2025-01-22 12:23   ` Peter Zijlstra
  2025-01-22 12:42   ` Peter Zijlstra
  1 sibling, 0 replies; 161+ messages in thread
From: Peter Zijlstra @ 2025-01-22 12:23 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Tue, Jan 21, 2025 at 06:30:54PM -0800, Josh Poimboeuf wrote:

> @@ -109,11 +126,6 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
>  	case TWA_SIGNAL_NO_IPI:
>  		__set_notify_signal(task);
>  		break;
> -#ifdef CONFIG_IRQ_WORK
> -	case TWA_NMI_CURRENT:
> -		irq_work_queue(this_cpu_ptr(&irq_work_NMI_resume));
> -		break;
> -#endif
>  	default:
>  		WARN_ON_ONCE(1);
>  		break;

Shouldn't this go to in the previous patch?

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 01/39] task_work: Fix TWA_NMI_CURRENT error handling
  2025-01-22  2:30 ` [PATCH v4 01/39] task_work: Fix TWA_NMI_CURRENT error handling Josh Poimboeuf
@ 2025-01-22 12:28   ` Peter Zijlstra
  2025-01-22 20:47     ` Josh Poimboeuf
  2025-04-22 16:14     ` Steven Rostedt
  0 siblings, 2 replies; 161+ messages in thread
From: Peter Zijlstra @ 2025-01-22 12:28 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Tue, Jan 21, 2025 at 06:30:53PM -0800, Josh Poimboeuf wrote:
> It's possible for irq_work_queue() to fail if the work has already been
> claimed.  That can happen if a TWA_NMI_CURRENT task work is requested
> before a previous TWA_NMI_CURRENT IRQ work on the same CPU has gotten a
> chance to run.

I'm confused, if it fails then it's already pending, and we'll get the
notification already. You can still add the work.

> The error has to be checked before the write to task->task_works.  Also
> the try_cmpxchg() loop isn't needed in NMI context.  The TWA_NMI_CURRENT
> case really is special, keep things simple by keeping its code all
> together in one place.

NMIs can nest, consider #DB (which is NMI like) doing task_work_add()
and getting interrupted with NMI doing the same.


Might all this be fallout from trying to fix that schedule() bug from
the next patch, because as is, I don't see it.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 02/39] task_work: Fix TWA_NMI_CURRENT race with __schedule()
  2025-01-22  2:30 ` [PATCH v4 02/39] task_work: Fix TWA_NMI_CURRENT race with __schedule() Josh Poimboeuf
  2025-01-22 12:23   ` Peter Zijlstra
@ 2025-01-22 12:42   ` Peter Zijlstra
  2025-01-22 21:03     ` Josh Poimboeuf
  2025-04-22 16:15     ` Steven Rostedt
  1 sibling, 2 replies; 161+ messages in thread
From: Peter Zijlstra @ 2025-01-22 12:42 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Tue, Jan 21, 2025 at 06:30:54PM -0800, Josh Poimboeuf wrote:
> If TWA_NMI_CURRENT task work is queued from an NMI triggered while
> running in __schedule() with IRQs disabled, task_work_set_notify_irq()
> ends up inadvertently running on the next scheduled task.  So the
> original task doesn't get its TIF_NOTIFY_RESUME flag set and the task
> work may get delayed indefinitely, or may not get to run at all.
> 
>     __schedule()
> 	// disable irqs
> 	    <NMI>
> 		task_work_add(current, work, TWA_NMI_CURRENT);
> 	    </NMI>
> 	// current = next;
> 	// enable irqs
> 	    <IRQ>
> 		task_work_set_notify_irq()
> 		test_and_set_tsk_thread_flag(current, TIF_NOTIFY_RESUME); // wrong task!
> 	    </IRQ>
> 	// original task skips task work on its next return to user (or exit!)
> 
> Fix it by storing the task pointer along with the irq_work struct and
> passing that task to set_notify_resume().

So I'm a little confused, isn't something like this sufficient?

If we hit before schedule(), all just works as expected, if we hit after
schedule(), the task will already have the TIF flag set, and we'll hit
the return to user path once it gets scheduled again.

---
diff --git a/kernel/task_work.c b/kernel/task_work.c
index c969f1f26be5..155549c017b2 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -9,7 +9,12 @@ static struct callback_head work_exited; /* all we need is ->next == NULL */
 #ifdef CONFIG_IRQ_WORK
 static void task_work_set_notify_irq(struct irq_work *entry)
 {
-	test_and_set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
+	/*
+	 * no-op IPI
+	 *
+	 * TWA_NMI_CURRENT will already have set the TIF flag, all
+	 * this interrupt does it tickle the return-to-user path.
+	 */
 }
 static DEFINE_PER_CPU(struct irq_work, irq_work_NMI_resume) =
 	IRQ_WORK_INIT_HARD(task_work_set_notify_irq);
@@ -98,6 +103,7 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
 		break;
 #ifdef CONFIG_IRQ_WORK
 	case TWA_NMI_CURRENT:
+		set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
 		irq_work_queue(this_cpu_ptr(&irq_work_NMI_resume));
 		break;
 #endif

^ permalink raw reply related	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 14/39] perf/x86: Rename get_segment_base() and make it global
  2025-01-22  2:31 ` [PATCH v4 14/39] perf/x86: Rename get_segment_base() and make it global Josh Poimboeuf
@ 2025-01-22 12:51   ` Peter Zijlstra
  2025-01-22 21:37     ` Josh Poimboeuf
  2025-01-24 20:09   ` Steven Rostedt
  1 sibling, 1 reply; 161+ messages in thread
From: Peter Zijlstra @ 2025-01-22 12:51 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Tue, Jan 21, 2025 at 06:31:06PM -0800, Josh Poimboeuf wrote:
> get_segment_base() will be used by the unwind_user code, so make it
> global and rename it so it doesn't conflict with a KVM function of the
> same name.

Should it not also get moved out of the perf code in this case?

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 28/39] unwind_user/deferred: Add deferred unwinding interface
  2025-01-22  2:31 ` [PATCH v4 28/39] unwind_user/deferred: Add deferred unwinding interface Josh Poimboeuf
@ 2025-01-22 13:37   ` Peter Zijlstra
  2025-01-22 14:16     ` Peter Zijlstra
  2025-01-22 21:38     ` Josh Poimboeuf
  2025-01-22 13:44   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 161+ messages in thread
From: Peter Zijlstra @ 2025-01-22 13:37 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Tue, Jan 21, 2025 at 06:31:20PM -0800, Josh Poimboeuf wrote:
> +/*
> + * The context cookie is a unique identifier which allows post-processing to
> + * correlate kernel trace(s) with user unwinds.  The high 12 bits are the CPU

s/12/16/ ?

> + * id; the lower 48 bits are a per-CPU entry counter.
> + */
> +static u64 ctx_to_cookie(u64 cpu, u64 ctx)
> +{
> +	BUILD_BUG_ON(NR_CPUS > 65535);
> +	return (ctx & ((1UL << 48) - 1)) | (cpu << 48);
> +}

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 28/39] unwind_user/deferred: Add deferred unwinding interface
  2025-01-22  2:31 ` [PATCH v4 28/39] unwind_user/deferred: Add deferred unwinding interface Josh Poimboeuf
  2025-01-22 13:37   ` Peter Zijlstra
@ 2025-01-22 13:44   ` Peter Zijlstra
  2025-01-22 21:52     ` Josh Poimboeuf
  2025-01-22 20:13   ` Mathieu Desnoyers
  2025-01-24 16:35   ` Jens Remus
  3 siblings, 1 reply; 161+ messages in thread
From: Peter Zijlstra @ 2025-01-22 13:44 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Tue, Jan 21, 2025 at 06:31:20PM -0800, Josh Poimboeuf wrote:

> +/* entry-from-user counter */
> +static DEFINE_PER_CPU(u64, unwind_ctx_ctr);

AFAICT from the below, this thing does *not* count entry-from-user. It
might count a subset, but I need to stare longer.

> +/*
> + * Read the task context cookie, first initializing it if this is the first
> + * call to get_cookie() since the most recent entry from user.
> + */
> +static u64 get_cookie(struct unwind_task_info *info)
> +{
> +	u64 ctx_ctr;
> +	u64 cookie;
> +	u64 cpu;
> +
> +	guard(irqsave)();
> +
> +	cookie = info->cookie;
> +	if (cookie)
> +		return cookie;
> +
> +
> +	cpu = raw_smp_processor_id();
> +	ctx_ctr = __this_cpu_inc_return(unwind_ctx_ctr);
> +	info->cookie = ctx_to_cookie(cpu, ctx_ctr);
> +
> +	return cookie;
> +
> +}
> +
> +static void unwind_deferred_task_work(struct callback_head *head)
> +{

> +	cookie = get_cookie(info);

> +}
> +

> +int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
> +{

> +	*cookie = get_cookie(info);

> +}



^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 29/39] unwind_user/deferred: Add unwind cache
  2025-01-22  2:31 ` [PATCH v4 29/39] unwind_user/deferred: Add unwind cache Josh Poimboeuf
@ 2025-01-22 13:57   ` Peter Zijlstra
  2025-01-22 22:36     ` Josh Poimboeuf
  0 siblings, 1 reply; 161+ messages in thread
From: Peter Zijlstra @ 2025-01-22 13:57 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Tue, Jan 21, 2025 at 06:31:21PM -0800, Josh Poimboeuf wrote:
> Cache the results of the unwind to ensure the unwind is only performed
> once, even when called by multiple tracers.
> 
> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
> ---
>  include/linux/unwind_deferred_types.h |  8 +++++++-
>  kernel/unwind/deferred.c              | 26 ++++++++++++++++++++------
>  2 files changed, 27 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
> index 9749824aea09..6f71a06329fb 100644
> --- a/include/linux/unwind_deferred_types.h
> +++ b/include/linux/unwind_deferred_types.h
> @@ -2,8 +2,14 @@
>  #ifndef _LINUX_UNWIND_USER_DEFERRED_TYPES_H
>  #define _LINUX_UNWIND_USER_DEFERRED_TYPES_H
>  
> -struct unwind_task_info {
> +struct unwind_cache {
>  	unsigned long		*entries;
> +	unsigned int		nr_entries;
> +	u64			cookie;
> +};

If you make the return to user path clear nr_entries you don't need a
second cookie field I think.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 30/39] unwind_user/deferred: Make unwind deferral requests NMI-safe
  2025-01-22  2:31 ` [PATCH v4 30/39] unwind_user/deferred: Make unwind deferral requests NMI-safe Josh Poimboeuf
@ 2025-01-22 14:15   ` Peter Zijlstra
  2025-01-22 22:49     ` Josh Poimboeuf
  2025-01-22 14:24   ` Peter Zijlstra
  1 sibling, 1 reply; 161+ messages in thread
From: Peter Zijlstra @ 2025-01-22 14:15 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Tue, Jan 21, 2025 at 06:31:22PM -0800, Josh Poimboeuf wrote:

> diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
> index 2f38055cce48..939c94abaa50 100644
> --- a/kernel/unwind/deferred.c
> +++ b/kernel/unwind/deferred.c
> @@ -29,27 +29,49 @@ static u64 ctx_to_cookie(u64 cpu, u64 ctx)
>  
>  /*
>   * Read the task context cookie, first initializing it if this is the first
> - * call to get_cookie() since the most recent entry from user.
> + * call to get_cookie() since the most recent entry from user.  This has to be
> + * done carefully to coordinate with unwind_deferred_request_nmi().
>   */
>  static u64 get_cookie(struct unwind_task_info *info)
>  {
>  	u64 ctx_ctr;
>  	u64 cookie;
> -	u64 cpu;
>  
>  	guard(irqsave)();
>  
> -	cookie = info->cookie;
> +	cookie = READ_ONCE(info->cookie);
>  	if (cookie)
>  		return cookie;
>  
> +	ctx_ctr = __this_cpu_read(unwind_ctx_ctr);
>  
> -	cpu = raw_smp_processor_id();
> -	ctx_ctr = __this_cpu_inc_return(unwind_ctx_ctr);
> -	info->cookie = ctx_to_cookie(cpu, ctx_ctr);
> +	/* Read ctx_ctr before info->nmi_cookie */
> +	barrier();
> +
> +	cookie = READ_ONCE(info->nmi_cookie);
> +	if (cookie) {
> +		/*
> +		 * This is the first call to get_cookie() since an NMI handler
> +		 * first wrote it to info->nmi_cookie.  Sync it.
> +		 */
> +		WRITE_ONCE(info->cookie, cookie);
> +		WRITE_ONCE(info->nmi_cookie, 0);
> +		return cookie;
> +	}
> +
> +	/*
> +	 * Write info->cookie.  It's ok to race with an NMI here.  The value of
> +	 * the cookie is based on ctx_ctr from before the NMI could have
> +	 * incremented it.  The result will be the same even if cookie or
> +	 * ctx_ctr end up getting written twice.
> +	 */
> +	cookie = ctx_to_cookie(raw_smp_processor_id(), ctx_ctr + 1);
> +	WRITE_ONCE(info->cookie, cookie);
> +	WRITE_ONCE(info->nmi_cookie, 0);
> +	barrier();
> +	__this_cpu_write(unwind_ctx_ctr, ctx_ctr + 1);
>  
>  	return cookie;
> -
>  }

Oh gawd. Can we please do something simple like:

	guard(irqsave)();
	cpu = raw_smp_processor_id();
	ctr = __this_cpu_read(unwind_ctx_cnt);
	cookie = READ_ONCE(current->unwind_info.cookie);
	do {
		if (cookie)
			return cookie;
		cookie = ctx_to_cookie(cpu, ctr+1);
	} while (!try_cmpxchg64(&current->unwind_info.cookie, &cookie, cookie));
	__this_cpu_write(unwind_ctx_ctr, ctr+1);
	return cookie;


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 28/39] unwind_user/deferred: Add deferred unwinding interface
  2025-01-22 13:37   ` Peter Zijlstra
@ 2025-01-22 14:16     ` Peter Zijlstra
  2025-01-22 22:51       ` Josh Poimboeuf
  2025-01-22 21:38     ` Josh Poimboeuf
  1 sibling, 1 reply; 161+ messages in thread
From: Peter Zijlstra @ 2025-01-22 14:16 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Wed, Jan 22, 2025 at 02:37:30PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 21, 2025 at 06:31:20PM -0800, Josh Poimboeuf wrote:
> > +/*
> > + * The context cookie is a unique identifier which allows post-processing to
> > + * correlate kernel trace(s) with user unwinds.  The high 12 bits are the CPU
> 
> s/12/16/ ?
> 
> > + * id; the lower 48 bits are a per-CPU entry counter.
> > + */
> > +static u64 ctx_to_cookie(u64 cpu, u64 ctx)
> > +{
> > +	BUILD_BUG_ON(NR_CPUS > 65535);
> > +	return (ctx & ((1UL << 48) - 1)) | (cpu << 48);
> > +}

Also, I have to note that 0 is a valid return value here, which will
give a ton of fun.


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 30/39] unwind_user/deferred: Make unwind deferral requests NMI-safe
  2025-01-22  2:31 ` [PATCH v4 30/39] unwind_user/deferred: Make unwind deferral requests NMI-safe Josh Poimboeuf
  2025-01-22 14:15   ` Peter Zijlstra
@ 2025-01-22 14:24   ` Peter Zijlstra
  2025-01-22 22:52     ` Josh Poimboeuf
  1 sibling, 1 reply; 161+ messages in thread
From: Peter Zijlstra @ 2025-01-22 14:24 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Tue, Jan 21, 2025 at 06:31:22PM -0800, Josh Poimboeuf wrote:
> +static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *cookie)
> +{
> +	struct unwind_task_info *info = &current->unwind_info;
> +	bool inited_cookie = false;
> +	int ret;
> +
> +	*cookie = info->cookie;
> +	if (!*cookie) {
> +		/*
> +		 * This is the first unwind request since the most recent entry
> +		 * from user.  Initialize the task cookie.
> +		 *
> +		 * Don't write to info->cookie directly, otherwise it may get
> +		 * cleared if the NMI occurred in the kernel during early entry
> +		 * or late exit before the task work gets to run.  Instead, use
> +		 * info->nmi_cookie which gets synced later by get_cookie().
> +		 */
> +		if (!info->nmi_cookie) {
> +			u64 cpu = raw_smp_processor_id();
> +			u64 ctx_ctr;
> +
> +			ctx_ctr = __this_cpu_inc_return(unwind_ctx_ctr);

__this_cpu_inc_return() is *NOT* NMI safe IIRC.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 34/39] perf: Skip user unwind if !current->mm
  2025-01-22  2:31 ` [PATCH v4 34/39] perf: Skip user unwind if !current->mm Josh Poimboeuf
@ 2025-01-22 14:29   ` Peter Zijlstra
  2025-01-22 23:08     ` Josh Poimboeuf
  0 siblings, 1 reply; 161+ messages in thread
From: Peter Zijlstra @ 2025-01-22 14:29 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Tue, Jan 21, 2025 at 06:31:26PM -0800, Josh Poimboeuf wrote:
> If the task doesn't have any memory, there's no stack to unwind.
> 
> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
> ---
>  kernel/events/core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 99f0f28feeb5..a886bb83f4d0 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -7792,7 +7792,7 @@ struct perf_callchain_entry *
>  perf_callchain(struct perf_event *event, struct pt_regs *regs)
>  {
>  	bool kernel = !event->attr.exclude_callchain_kernel;
> -	bool user   = !event->attr.exclude_callchain_user;
> +	bool user   = !event->attr.exclude_callchain_user && current->mm;

What about things like io_uring helpers, don't they keep ->mm but are
never in userspace?

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 00/39] unwind, perf: sframe user space unwinding
  2025-01-22  2:30 [PATCH v4 00/39] unwind, perf: sframe user space unwinding Josh Poimboeuf
                   ` (39 preceding siblings ...)
  2025-01-22  2:35 ` [PATCH v4 00/39] unwind, perf: sframe user space unwinding Josh Poimboeuf
@ 2025-01-22 16:13 ` Steven Rostedt
  40 siblings, 0 replies; 161+ messages in thread
From: Steven Rostedt @ 2025-01-22 16:13 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Tue, 21 Jan 2025 18:30:52 -0800
Josh Poimboeuf <jpoimboe@kernel.org> wrote:

> The interface is similar to {task,irq}_work.  The caller owns an
> unwind_work struct:
> 
>   struct unwind_work {
> 	struct callback_head		work;
> 	unwind_callback_t		func;
> 	int				pending;
>   };
> 
> For perf, struct unwind_work is embedded in struct perf_event.  For
> ftrace maybe it would live in task_struct?

Hmm, this is going to be difficult, as I don't want to add more to a task
struct as it's already too bloated as is. I'll have to think about this a bit.

-- Steve

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 28/39] unwind_user/deferred: Add deferred unwinding interface
  2025-01-22  2:31 ` [PATCH v4 28/39] unwind_user/deferred: Add deferred unwinding interface Josh Poimboeuf
  2025-01-22 13:37   ` Peter Zijlstra
  2025-01-22 13:44   ` Peter Zijlstra
@ 2025-01-22 20:13   ` Mathieu Desnoyers
  2025-01-23  4:05     ` Josh Poimboeuf
  2025-01-24 16:35   ` Jens Remus
  3 siblings, 1 reply; 161+ messages in thread
From: Mathieu Desnoyers @ 2025-01-22 20:13 UTC (permalink / raw)
  To: Josh Poimboeuf, x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Florian Weimer, Andy Lutomirski, Masami Hiramatsu,
	Weinan Liu

On 2025-01-21 21:31, Josh Poimboeuf wrote:
> Add an interface for scheduling task work to unwind the user space stack
> before returning to user space.  This solves several problems for its
> callers:
> 
>    - Ensure the unwind happens in task context even if the caller may be
>      running in NMI or interrupt context.
> 
>    - Avoid duplicate unwinds, whether called multiple times by the same
>      caller or by different callers.
> 
>    - Create a "context cookie" which allows trace post-processing to
>      correlate kernel unwinds/traces with the user unwind.
> 
> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
> ---
>   include/linux/entry-common.h          |   2 +
>   include/linux/sched.h                 |   5 +
>   include/linux/unwind_deferred.h       |  46 +++++++
>   include/linux/unwind_deferred_types.h |  10 ++
>   kernel/fork.c                         |   4 +
>   kernel/unwind/Makefile                |   2 +-
>   kernel/unwind/deferred.c              | 178 ++++++++++++++++++++++++++
>   7 files changed, 246 insertions(+), 1 deletion(-)
>   create mode 100644 include/linux/unwind_deferred.h
>   create mode 100644 include/linux/unwind_deferred_types.h
>   create mode 100644 kernel/unwind/deferred.c
> 
> diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
> index fc61d0205c97..fb2b27154fee 100644
> --- a/include/linux/entry-common.h
> +++ b/include/linux/entry-common.h
> @@ -12,6 +12,7 @@
>   #include <linux/resume_user_mode.h>
>   #include <linux/tick.h>
>   #include <linux/kmsan.h>
> +#include <linux/unwind_deferred.h>
>   
>   #include <asm/entry-common.h>
>   
> @@ -111,6 +112,7 @@ static __always_inline void enter_from_user_mode(struct pt_regs *regs)
>   
>   	CT_WARN_ON(__ct_state() != CT_STATE_USER);
>   	user_exit_irqoff();
> +	unwind_enter_from_user_mode();
>   
>   	instrumentation_begin();
>   	kmsan_unpoison_entry_regs(regs);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 64934e0830af..042a95f4f6e6 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -46,6 +46,7 @@
>   #include <linux/rv.h>
>   #include <linux/livepatch_sched.h>
>   #include <linux/uidgid_types.h>
> +#include <linux/unwind_deferred_types.h>
>   #include <asm/kmap_size.h>
>   
>   /* task_struct member predeclarations (sorted alphabetically): */
> @@ -1603,6 +1604,10 @@ struct task_struct {
>   	struct user_event_mm		*user_event_mm;
>   #endif
>   
> +#ifdef CONFIG_UNWIND_USER
> +	struct unwind_task_info		unwind_info;
> +#endif
> +
>   	/*
>   	 * New fields for task_struct should be added above here, so that
>   	 * they are included in the randomized portion of task_struct.
> diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
> new file mode 100644
> index 000000000000..741f409f0d1f
> --- /dev/null
> +++ b/include/linux/unwind_deferred.h
> @@ -0,0 +1,46 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_UNWIND_USER_DEFERRED_H
> +#define _LINUX_UNWIND_USER_DEFERRED_H
> +
> +#include <linux/task_work.h>
> +#include <linux/unwind_user.h>
> +#include <linux/unwind_deferred_types.h>
> +
> +struct unwind_work;
> +
> +typedef void (*unwind_callback_t)(struct unwind_work *work, struct unwind_stacktrace *trace, u64 cookie);
> +
> +struct unwind_work {
> +	struct callback_head		work;
> +	unwind_callback_t		func;
> +	int				pending;
> +};

This is a lot of information to keep around per instance.

I'm not sure it would be OK to have a single unwind_work per perf-event
for perf. I suspect it may need to be per perf-event X per-task if a
perf-event can be associated to more than a single task (not sure ?).

For LTTng, we'd have to consider something similar because of multi-session
support. Either we'd have one unwind_work per-session X per-task, or we'd
need to multiplex this internally within LTTng-modules. None of this is
ideal in terms of memory footprint.

We should look at what part of this information can be made static/global
and what part is task-local, so we minimize the amount of redundant data
per-task (memory footprint).

AFAIU, most of that unwind_work information is global:

   - work,
   - func,

And could be registered dynamically by the tracer when it enables
tracing with an interest on stack walking.

At registration, we can allocate a descriptor ID (with a limited bounded
max number, configurable). This would associate a work+func to a given
ID, and keep track of this in a global table (indexed by ID).

I suspect that the only thing we really want to keep track of per-task
is the pending bit, and what is the ID of the unwind_work associated.
This could be kept, per-task, in either:

- a bitmap of pending bits, indexed by ID, or
- an array of pending IDs.

Unregistration of unwind_work could iterate on all tasks and clear the
pending bit or ID associated with the unregistered work, to make sure
we don't trigger unrelated work after a re-use.


> +
> +#ifdef CONFIG_UNWIND_USER
> +
> +void unwind_task_init(struct task_struct *task);
> +void unwind_task_free(struct task_struct *task);
> +
> +void unwind_deferred_init(struct unwind_work *work, unwind_callback_t func);
> +int unwind_deferred_request(struct unwind_work *work, u64 *cookie);
> +bool unwind_deferred_cancel(struct task_struct *task, struct unwind_work *work);
> +
> +static __always_inline void unwind_enter_from_user_mode(void)
> +{
> +	current->unwind_info.cookie = 0;
> +}
> +
> +#else /* !CONFIG_UNWIND_USER */
> +
> +static inline void unwind_task_init(struct task_struct *task) {}
> +static inline void unwind_task_free(struct task_struct *task) {}
> +
> +static inline void unwind_deferred_init(struct unwind_work *work, unwind_callback_t func) {}
> +static inline int unwind_deferred_request(struct task_struct *task, struct unwind_work *work, u64 *cookie) { return -ENOSYS; }
> +static inline bool unwind_deferred_cancel(struct task_struct *task, struct unwind_work *work) { return false; }
> +
> +static inline void unwind_enter_from_user_mode(void) {}
> +
> +#endif /* !CONFIG_UNWIND_USER */
> +
> +#endif /* _LINUX_UNWIND_USER_DEFERRED_H */
> diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
> new file mode 100644
> index 000000000000..9749824aea09
> --- /dev/null
> +++ b/include/linux/unwind_deferred_types.h
> @@ -0,0 +1,10 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_UNWIND_USER_DEFERRED_TYPES_H
> +#define _LINUX_UNWIND_USER_DEFERRED_TYPES_H
> +
> +struct unwind_task_info {
> +	unsigned long		*entries;
> +	u64			cookie;
> +};
> +
> +#endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 88753f8bbdd3..c9a954af72a1 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -106,6 +106,7 @@
>   #include <linux/pidfs.h>
>   #include <linux/tick.h>
>   #include <linux/sframe.h>
> +#include <linux/unwind_deferred.h>
>   
>   #include <asm/pgalloc.h>
>   #include <linux/uaccess.h>
> @@ -973,6 +974,7 @@ void __put_task_struct(struct task_struct *tsk)
>   	WARN_ON(refcount_read(&tsk->usage));
>   	WARN_ON(tsk == current);
>   
> +	unwind_task_free(tsk);
>   	sched_ext_free(tsk);
>   	io_uring_free(tsk);
>   	cgroup_free(tsk);
> @@ -2370,6 +2372,8 @@ __latent_entropy struct task_struct *copy_process(
>   	p->bpf_ctx = NULL;
>   #endif
>   
> +	unwind_task_init(p);
> +
>   	/* Perform scheduler related setup. Assign this task to a CPU. */
>   	retval = sched_fork(clone_flags, p);
>   	if (retval)
> diff --git a/kernel/unwind/Makefile b/kernel/unwind/Makefile
> index f70380d7a6a6..146038165865 100644
> --- a/kernel/unwind/Makefile
> +++ b/kernel/unwind/Makefile
> @@ -1,2 +1,2 @@
> - obj-$(CONFIG_UNWIND_USER)		+= user.o
> + obj-$(CONFIG_UNWIND_USER)		+= user.o deferred.o
>    obj-$(CONFIG_HAVE_UNWIND_USER_SFRAME)	+= sframe.o
> diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
> new file mode 100644
> index 000000000000..f0dbe4069247
> --- /dev/null
> +++ b/kernel/unwind/deferred.c
> @@ -0,0 +1,178 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> +* Deferred user space unwinding
> +*/
> +#include <linux/kernel.h>
> +#include <linux/sched.h>
> +#include <linux/sched/task_stack.h>
> +#include <linux/sframe.h>
> +#include <linux/slab.h>
> +#include <linux/task_work.h>
> +#include <linux/mm.h>
> +#include <linux/unwind_deferred.h>
> +
> +#define UNWIND_MAX_ENTRIES 512
> +
> +/* entry-from-user counter */
> +static DEFINE_PER_CPU(u64, unwind_ctx_ctr);
> +
> +/*
> + * The context cookie is a unique identifier which allows post-processing to
> + * correlate kernel trace(s) with user unwinds.  The high 12 bits are the CPU
> + * id; the lower 48 bits are a per-CPU entry counter.
> + */
> +static u64 ctx_to_cookie(u64 cpu, u64 ctx)
> +{
> +	BUILD_BUG_ON(NR_CPUS > 65535);

2^12 = 4k, not 64k. Perhaps you mean to reserve 16 bits
for cpu numbers ?

> +	return (ctx & ((1UL << 48) - 1)) | (cpu << 48);

Perhaps use ilog2(NR_CPUS) instead for the number of bits to use
rather than hard code 12 ?


> +}
> +
> +/*
> + * Read the task context cookie, first initializing it if this is the first
> + * call to get_cookie() since the most recent entry from user.
> + */
> +static u64 get_cookie(struct unwind_task_info *info)
> +{
> +	u64 ctx_ctr;
> +	u64 cookie;
> +	u64 cpu;
> +
> +	guard(irqsave)();
> +
> +	cookie = info->cookie;
> +	if (cookie)
> +		return cookie;
> +
> +
> +	cpu = raw_smp_processor_id();
> +	ctx_ctr = __this_cpu_inc_return(unwind_ctx_ctr);
> +	info->cookie = ctx_to_cookie(cpu, ctx_ctr);
> +
> +	return cookie;
> +
> +}
> +
> +static void unwind_deferred_task_work(struct callback_head *head)
> +{
> +	struct unwind_work *work = container_of(head, struct unwind_work, work);
> +	struct unwind_task_info *info = &current->unwind_info;
> +	struct unwind_stacktrace trace;
> +	u64 cookie;
> +
> +	if (WARN_ON_ONCE(!work->pending))
> +		return;
> +
> +	/*
> +	 * From here on out, the callback must always be called, even if it's
> +	 * just an empty trace.
> +	 */
> +
> +	cookie = get_cookie(info);
> +
> +	/* Check for task exit path. */
> +	if (!current->mm)
> +		goto do_callback;
> +
> +	if (!info->entries) {
> +		info->entries = kmalloc(UNWIND_MAX_ENTRIES * sizeof(long),
> +					GFP_KERNEL);
> +		if (!info->entries)
> +			goto do_callback;
> +	}
> +
> +	trace.entries = info->entries;
> +	trace.nr = 0;
> +	unwind_user(&trace, UNWIND_MAX_ENTRIES);
> +
> +do_callback:
> +	work->func(work, &trace, cookie);
> +	work->pending = 0;
> +}
> +
> +/*
> + * Schedule a user space unwind to be done in task work before exiting the
> + * kernel.
> + *
> + * The returned cookie output is a unique identifer for the current task entry

identifier

Thanks,

Mathieu

> + * context.  Its value will also be passed to the callback function.  It can be
> + * used to stitch kernel and user stack traces together in post-processing.
> + *
> + * It's valid to call this function multiple times for the same @work within
> + * the same task entry context.  Each call will return the same cookie.  If the
> + * callback is already pending, an error will be returned along with the
> + * cookie.  If the callback is not pending because it has already been
> + * previously called for the same entry context, it will be called again with
> + * the same stack trace and cookie.
> + *
> + * Thus are three possible return scenarios:
> + *
> + *   * return != 0, *cookie == 0: the operation failed, no pending callback.
> + *
> + *   * return != 0, *cookie != 0: the callback is already pending. The cookie
> + *     can still be used to correlate with the pending callback.
> + *
> + *   * return == 0, *cookie != 0: the callback queued successfully.  The
> + *     callback is guaranteed to be called with the given cookie.
> + */
> +int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
> +{
> +	struct unwind_task_info *info = &current->unwind_info;
> +	int ret;
> +
> +	*cookie = 0;
> +
> +	if (WARN_ON_ONCE(in_nmi()))
> +		return -EINVAL;
> +
> +	if (!current->mm || !user_mode(task_pt_regs(current)))
> +		return -EINVAL;
> +
> +	guard(irqsave)();
> +
> +	*cookie = get_cookie(info);
> +
> +	/* callback already pending? */
> +	if (work->pending)
> +		return -EEXIST;
> +
> +	ret = task_work_add(current, &work->work, TWA_RESUME);
> +	if (WARN_ON_ONCE(ret))
> +		return ret;
> +
> +	work->pending = 1;
> +
> +	return 0;
> +}
> +
> +bool unwind_deferred_cancel(struct task_struct *task, struct unwind_work *work)
> +{
> +	bool ret;
> +
> +	ret = task_work_cancel(task, &work->work);
> +	if (ret)
> +		work->pending = 0;
> +
> +	return ret;
> +}
> +
> +void unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
> +{
> +	memset(work, 0, sizeof(*work));
> +
> +	init_task_work(&work->work, unwind_deferred_task_work);
> +	work->func = func;
> +}
> +
> +void unwind_task_init(struct task_struct *task)
> +{
> +	struct unwind_task_info *info = &task->unwind_info;
> +
> +	memset(info, 0, sizeof(*info));
> +}
> +
> +void unwind_task_free(struct task_struct *task)
> +{
> +	struct unwind_task_info *info = &task->unwind_info;
> +
> +	kfree(info->entries);
> +}

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 01/39] task_work: Fix TWA_NMI_CURRENT error handling
  2025-01-22 12:28   ` Peter Zijlstra
@ 2025-01-22 20:47     ` Josh Poimboeuf
  2025-01-23  8:14       ` Peter Zijlstra
  2025-04-22 16:14     ` Steven Rostedt
  1 sibling, 1 reply; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22 20:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Wed, Jan 22, 2025 at 01:28:21PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 21, 2025 at 06:30:53PM -0800, Josh Poimboeuf wrote:
> > It's possible for irq_work_queue() to fail if the work has already been
> > claimed.  That can happen if a TWA_NMI_CURRENT task work is requested
> > before a previous TWA_NMI_CURRENT IRQ work on the same CPU has gotten a
> > chance to run.
> 
> I'm confused, if it fails then it's already pending, and we'll get the
> notification already. You can still add the work.

Yeah, I suppose that makes sense.  If the pending irq_work is already
going to set TIF_NOTIFY_RESUME anyway, there's no need to do that again.

> > The error has to be checked before the write to task->task_works.  Also
> > the try_cmpxchg() loop isn't needed in NMI context.  The TWA_NMI_CURRENT
> > case really is special, keep things simple by keeping its code all
> > together in one place.
> 
> NMIs can nest,

Just for my understanding: for nested NMIs, the entry code basically
queues up the next NMI, so the C handler (exc_nmi) can't nest.  Right?

> consider #DB (which is NMI like)

What exactly do you mean by "NMI like"?  Is it because a #DB might be
basically running in NMI context, if the NMI hit a breakpoint?

> doing task_work_add() and getting interrupted with NMI doing the same.

How exactly would that work?  At least with my patch the #DB wouldn't be
able to use TWA_NMI_CURRENT unless in_nmi() were true due to NMI hitting
a breakpoint.  In which case a nested NMI wouldn't actually nest, it
would get "queued" by the entry code.

But yeah, I do see how the reverse can be true: somebody sets a
breakpoint in task_work, right where it's fiddling with the list head.
NMI calls task_work_add(TWA_NMI_CURRENT), triggering the #DB, which also
calls task_work_add().

-- 
Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 02/39] task_work: Fix TWA_NMI_CURRENT race with __schedule()
  2025-01-22 12:42   ` Peter Zijlstra
@ 2025-01-22 21:03     ` Josh Poimboeuf
  2025-01-22 22:14       ` Josh Poimboeuf
  2025-04-22 16:15     ` Steven Rostedt
  1 sibling, 1 reply; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22 21:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Wed, Jan 22, 2025 at 01:42:28PM +0100, Peter Zijlstra wrote:
> So I'm a little confused, isn't something like this sufficient?
> 
> If we hit before schedule(), all just works as expected, if we hit after
> schedule(), the task will already have the TIF flag set, and we'll hit
> the return to user path once it gets scheduled again.
> 
> ---
> diff --git a/kernel/task_work.c b/kernel/task_work.c
> index c969f1f26be5..155549c017b2 100644
> --- a/kernel/task_work.c
> +++ b/kernel/task_work.c
> @@ -9,7 +9,12 @@ static struct callback_head work_exited; /* all we need is ->next == NULL */
>  #ifdef CONFIG_IRQ_WORK
>  static void task_work_set_notify_irq(struct irq_work *entry)
>  {
> -	test_and_set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
> +	/*
> +	 * no-op IPI
> +	 *
> +	 * TWA_NMI_CURRENT will already have set the TIF flag, all
> +	 * this interrupt does it tickle the return-to-user path.
> +	 */
>  }
>  static DEFINE_PER_CPU(struct irq_work, irq_work_NMI_resume) =
>  	IRQ_WORK_INIT_HARD(task_work_set_notify_irq);
> @@ -98,6 +103,7 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
>  		break;
>  #ifdef CONFIG_IRQ_WORK
>  	case TWA_NMI_CURRENT:
> +		set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
>  		irq_work_queue(this_cpu_ptr(&irq_work_NMI_resume));
>  		break;
>  #endif

Yeah, that looks so much better...

The self-IPI is only needed when the NMI happened in user space, right?
Would it make sense to have an optimized version of that?

-- 
Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 14/39] perf/x86: Rename get_segment_base() and make it global
  2025-01-22 12:51   ` Peter Zijlstra
@ 2025-01-22 21:37     ` Josh Poimboeuf
  0 siblings, 0 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22 21:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Wed, Jan 22, 2025 at 01:51:54PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 21, 2025 at 06:31:06PM -0800, Josh Poimboeuf wrote:
> > get_segment_base() will be used by the unwind_user code, so make it
> > global and rename it so it doesn't conflict with a KVM function of the
> > same name.
> 
> Should it not also get moved out of the perf code in this case?

True, I'll try to find it a better home.

-- 
Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 28/39] unwind_user/deferred: Add deferred unwinding interface
  2025-01-22 13:37   ` Peter Zijlstra
  2025-01-22 14:16     ` Peter Zijlstra
@ 2025-01-22 21:38     ` Josh Poimboeuf
  1 sibling, 0 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22 21:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Wed, Jan 22, 2025 at 02:37:30PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 21, 2025 at 06:31:20PM -0800, Josh Poimboeuf wrote:
> > +/*
> > + * The context cookie is a unique identifier which allows post-processing to
> > + * correlate kernel trace(s) with user unwinds.  The high 12 bits are the CPU
> 
> s/12/16/ ?

Oops.  Code is right, comment is wrong.

> 
> > + * id; the lower 48 bits are a per-CPU entry counter.
> > + */
> > +static u64 ctx_to_cookie(u64 cpu, u64 ctx)
> > +{
> > +	BUILD_BUG_ON(NR_CPUS > 65535);
> > +	return (ctx & ((1UL << 48) - 1)) | (cpu << 48);
> > +}

-- 
Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 28/39] unwind_user/deferred: Add deferred unwinding interface
  2025-01-22 13:44   ` Peter Zijlstra
@ 2025-01-22 21:52     ` Josh Poimboeuf
  0 siblings, 0 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22 21:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Wed, Jan 22, 2025 at 02:44:20PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 21, 2025 at 06:31:20PM -0800, Josh Poimboeuf wrote:
> 
> > +/* entry-from-user counter */
> > +static DEFINE_PER_CPU(u64, unwind_ctx_ctr);
> 
> AFAICT from the below, this thing does *not* count entry-from-user. It
> might count a subset, but I need to stare longer.

Right, it's a subset.  Something like so:

/*
 * This is a unique percpu identifier for a given task entry context.
 * Conceptually, it's incremented every time the CPU enters the kernel from
 * user space, so that each "entry context" on the CPU gets a unique ID.  In
 * reality, as an optimization, it's only incremented on demand for the first
 * deferred unwind request after a given entry-from-user.
 *
 * It's combined with the CPU id to make a systemwide-unique "context cookie".
 */
static DEFINE_PER_CPU(u64, unwind_ctx_ctr);

-- 
Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 02/39] task_work: Fix TWA_NMI_CURRENT race with __schedule()
  2025-01-22 21:03     ` Josh Poimboeuf
@ 2025-01-22 22:14       ` Josh Poimboeuf
  2025-01-23  8:15         ` Peter Zijlstra
  0 siblings, 1 reply; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22 22:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Wed, Jan 22, 2025 at 01:03:42PM -0800, Josh Poimboeuf wrote:
> The self-IPI is only needed when the NMI happened in user space, right?
> Would it make sense to have an optimized version of that?

Actually, maybe not, that could be tricky if the NMI hits in the kernel
after task work runs.

-- 
Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 29/39] unwind_user/deferred: Add unwind cache
  2025-01-22 13:57   ` Peter Zijlstra
@ 2025-01-22 22:36     ` Josh Poimboeuf
  2025-01-23  8:31       ` Peter Zijlstra
  0 siblings, 1 reply; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22 22:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Wed, Jan 22, 2025 at 02:57:00PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 21, 2025 at 06:31:21PM -0800, Josh Poimboeuf wrote:
> > Cache the results of the unwind to ensure the unwind is only performed
> > once, even when called by multiple tracers.
> > 
> > Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
> > ---
> >  include/linux/unwind_deferred_types.h |  8 +++++++-
> >  kernel/unwind/deferred.c              | 26 ++++++++++++++++++++------
> >  2 files changed, 27 insertions(+), 7 deletions(-)
> > 
> > diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
> > index 9749824aea09..6f71a06329fb 100644
> > --- a/include/linux/unwind_deferred_types.h
> > +++ b/include/linux/unwind_deferred_types.h
> > @@ -2,8 +2,14 @@
> >  #ifndef _LINUX_UNWIND_USER_DEFERRED_TYPES_H
> >  #define _LINUX_UNWIND_USER_DEFERRED_TYPES_H
> >  
> > -struct unwind_task_info {
> > +struct unwind_cache {
> >  	unsigned long		*entries;
> > +	unsigned int		nr_entries;
> > +	u64			cookie;
> > +};
> 
> If you make the return to user path clear nr_entries you don't need a
> second cookie field I think.

But if the NMI happens late in the exit-to-user path, with IRQs
disabled, right before nr_entries gets cleared, the cache won't get
used in the task work.

However I think we can clear it on entry-from-user.

-- 
Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 30/39] unwind_user/deferred: Make unwind deferral requests NMI-safe
  2025-01-22 14:15   ` Peter Zijlstra
@ 2025-01-22 22:49     ` Josh Poimboeuf
  2025-01-23  8:40       ` Peter Zijlstra
  0 siblings, 1 reply; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22 22:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Wed, Jan 22, 2025 at 03:15:05PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 21, 2025 at 06:31:22PM -0800, Josh Poimboeuf wrote:
> Oh gawd. Can we please do something simple like:
> 
> 	guard(irqsave)();
> 	cpu = raw_smp_processor_id();
> 	ctr = __this_cpu_read(unwind_ctx_cnt);

Don't you need a compiler barrier here?  __this_cpu_read() doesn't have
one.

> 	cookie = READ_ONCE(current->unwind_info.cookie);
> 	do {
> 		if (cookie)
> 			return cookie;
> 		cookie = ctx_to_cookie(cpu, ctr+1);
> 	} while (!try_cmpxchg64(&current->unwind_info.cookie, &cookie, cookie));
> 	__this_cpu_write(unwind_ctx_ctr, ctr+1);
> 	return cookie;

I was trying to avoid the overhead of the cmpxchg.

But also, the nmi_cookie is still needed for the case where the NMI
arrives before info->cookie gets cleared by early entry-from-user.

-- 
Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 28/39] unwind_user/deferred: Add deferred unwinding interface
  2025-01-22 14:16     ` Peter Zijlstra
@ 2025-01-22 22:51       ` Josh Poimboeuf
  2025-01-23  8:17         ` Peter Zijlstra
  0 siblings, 1 reply; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22 22:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Wed, Jan 22, 2025 at 03:16:16PM +0100, Peter Zijlstra wrote:
> On Wed, Jan 22, 2025 at 02:37:30PM +0100, Peter Zijlstra wrote:
> > On Tue, Jan 21, 2025 at 06:31:20PM -0800, Josh Poimboeuf wrote:
> > > +/*
> > > + * The context cookie is a unique identifier which allows post-processing to
> > > + * correlate kernel trace(s) with user unwinds.  The high 12 bits are the CPU
> > 
> > s/12/16/ ?
> > 
> > > + * id; the lower 48 bits are a per-CPU entry counter.
> > > + */
> > > +static u64 ctx_to_cookie(u64 cpu, u64 ctx)
> > > +{
> > > +	BUILD_BUG_ON(NR_CPUS > 65535);
> > > +	return (ctx & ((1UL << 48) - 1)) | (cpu << 48);
> > > +}
> 
> Also, I have to note that 0 is a valid return value here, which will
> give a ton of fun.

The ctx_ctr is always incremented before calling this, so 0 isn't a
valid cookie.

-- 
Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 30/39] unwind_user/deferred: Make unwind deferral requests NMI-safe
  2025-01-22 14:24   ` Peter Zijlstra
@ 2025-01-22 22:52     ` Josh Poimboeuf
  2025-01-23  8:42       ` Peter Zijlstra
  0 siblings, 1 reply; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22 22:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Wed, Jan 22, 2025 at 03:24:18PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 21, 2025 at 06:31:22PM -0800, Josh Poimboeuf wrote:
> > +static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *cookie)
> > +{
> > +	struct unwind_task_info *info = &current->unwind_info;
> > +	bool inited_cookie = false;
> > +	int ret;
> > +
> > +	*cookie = info->cookie;
> > +	if (!*cookie) {
> > +		/*
> > +		 * This is the first unwind request since the most recent entry
> > +		 * from user.  Initialize the task cookie.
> > +		 *
> > +		 * Don't write to info->cookie directly, otherwise it may get
> > +		 * cleared if the NMI occurred in the kernel during early entry
> > +		 * or late exit before the task work gets to run.  Instead, use
> > +		 * info->nmi_cookie which gets synced later by get_cookie().
> > +		 */
> > +		if (!info->nmi_cookie) {
> > +			u64 cpu = raw_smp_processor_id();
> > +			u64 ctx_ctr;
> > +
> > +			ctx_ctr = __this_cpu_inc_return(unwind_ctx_ctr);
> 
> __this_cpu_inc_return() is *NOT* NMI safe IIRC.

Hm, I guess I was only looking at x86.

-- 
Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 34/39] perf: Skip user unwind if !current->mm
  2025-01-22 14:29   ` Peter Zijlstra
@ 2025-01-22 23:08     ` Josh Poimboeuf
  2025-01-23  8:44       ` Peter Zijlstra
  0 siblings, 1 reply; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-22 23:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Wed, Jan 22, 2025 at 03:29:10PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 21, 2025 at 06:31:26PM -0800, Josh Poimboeuf wrote:
> > If the task doesn't have any memory, there's no stack to unwind.
> > 
> > Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
> > ---
> >  kernel/events/core.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index 99f0f28feeb5..a886bb83f4d0 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -7792,7 +7792,7 @@ struct perf_callchain_entry *
> >  perf_callchain(struct perf_event *event, struct pt_regs *regs)
> >  {
> >  	bool kernel = !event->attr.exclude_callchain_kernel;
> > -	bool user   = !event->attr.exclude_callchain_user;
> > +	bool user   = !event->attr.exclude_callchain_user && current->mm;
> 
> What about things like io_uring helpers, don't they keep ->mm but are
> never in userspace?

Hm, that's news to me.  At least this patch doesn't make things worse in
that regard.

IIRC, user_regs(task_pt_regs(current)) returns false for kthreads
because the regs area is cleared.  We could add in a user_regs() check?

-- 
Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 28/39] unwind_user/deferred: Add deferred unwinding interface
  2025-01-22 20:13   ` Mathieu Desnoyers
@ 2025-01-23  4:05     ` Josh Poimboeuf
  2025-01-23  8:25       ` Peter Zijlstra
  0 siblings, 1 reply; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-23  4:05 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Florian Weimer, Andy Lutomirski, Masami Hiramatsu,
	Weinan Liu

On Wed, Jan 22, 2025 at 03:13:10PM -0500, Mathieu Desnoyers wrote:
> > +struct unwind_work {
> > +	struct callback_head		work;
> > +	unwind_callback_t		func;
> > +	int				pending;
> > +};
> 
> This is a lot of information to keep around per instance.
> 
> I'm not sure it would be OK to have a single unwind_work per perf-event
> for perf. I suspect it may need to be per perf-event X per-task if a
> perf-event can be associated to more than a single task (not sure ?).

For "perf record -g <command>", it seems to be one event per task.
Incidentally this is the mode where I did my perf testing :-/

But looking at it now, a global "perf record -g" appears to use one
event per CPU.  So if a task requests an unwind and then schedules out
before returning to user space, any subsequent tasks trying to unwind on
that CPU would be blocked until the original task returned to user.  So
yeah, that's definitely a problem.

Actually a per-CPU unwind_work descriptor could conceivably work if we
able to unwind at schedule() time.

But Steve pointed out that wouldn't work so well if the task isn't in
RUNNING state.

However... would it be a horrible idea for 'next' to unwind 'prev' after
the context switch???

> For LTTng, we'd have to consider something similar because of multi-session
> support. Either we'd have one unwind_work per-session X per-task, or we'd
> need to multiplex this internally within LTTng-modules. None of this is
> ideal in terms of memory footprint.
> 
> We should look at what part of this information can be made static/global
> and what part is task-local, so we minimize the amount of redundant data
> per-task (memory footprint).
> 
> AFAIU, most of that unwind_work information is global:
> 
>   - work,
>   - func,
> 
> And could be registered dynamically by the tracer when it enables
> tracing with an interest on stack walking.
> 
> At registration, we can allocate a descriptor ID (with a limited bounded
> max number, configurable). This would associate a work+func to a given
> ID, and keep track of this in a global table (indexed by ID).
> 
> I suspect that the only thing we really want to keep track of per-task
> is the pending bit, and what is the ID of the unwind_work associated.
> This could be kept, per-task, in either:
> 
> - a bitmap of pending bits, indexed by ID, or
> - an array of pending IDs.

That's basically what I was doing before.  The per-task state also had:

  - 'struct callback_head work' for doing the task work.  A single work
    function was used to multiplex the callbacks, as opposed to the
    current patches where each descriptor gets its own separate
    task_work.

  - 'void *privs[UNWIND_MAX_CALLBACKS]' opaque data pointers.  Maybe
    some callbacks don't need that, but perf needed it for the 'event'
    pointer.  For 32 max callbacks that's 256 bytes per task.

  - 'u64 last_cookies[UNWIND_MAX_CALLBACKS]' to prevent a callback from
    getting called twice.  But actually that may have been overkill, it
    should be fine to call the callback again with the cached stack
    trace.  The tracer could instead have its own policy for how to
    handle dupes.

  - 'unsigned int work_pending' to designate whether the task_work is
    pending.  Also probably not necessary, the pending bits could serve
    the same purpose.

So it had more concurrency to deal with, to handle the extra per-task
state.

It also had a global array of callbacks, which used a mutex and SRCU to
coordinate between the register/unregister and the task work.

Another major issue was that it wasn't NMI-safe due to all the shared
state.  So a tracer in NMI would have to schedule an IRQ to call
unwind_deferred_request().  Not only is that a pain for the tracers,
it's problematic in other ways:

  - If the NMI occurred in schedule() with IRQs disabled, the IRQ would
    actually interrupt the 'next' task.  So the caller would have to
    stash a 'task' pointer for the IRQ handler to read and pass to
    unwind_deferred_request().  (similar to the task_work bug I found)

  - Thus the deferred unwind interface would need to handle requests
    from non-current, introducing a new set of concurrency issues.

  - Also, while a tracer in NMI can unwind the kernel stack and send
    that to a ring buffer immediately, it can't store the cookie along
    with it, so there lie more tracer headaches.

Once I changed the interface to get rid of the global nastiness, all
those problems went away.

Of course that now introduces the new problem that each tracer (or
tracing event) needs some kind of per-task state.  But otherwise this
new interface really simplifies things a *lot*.

Anyway, I don't have a good answer at the moment.  Will marinate on it.

Maybe we could do something like allocate the unwind_work (or some
equivalent) on demand at the time of unwind request using GFP_NOWAIT or
GFP_ATOMIC or some such, then free it during the task work?

> Unregistration of unwind_work could iterate on all tasks and clear the
> pending bit or ID associated with the unregistered work, to make sure
> we don't trigger unrelated work after a re-use.

What the old unregister code did was to remove it from the global
callbacks array (with the careful use of mutex+SRCU to coordinate with
the task work).  Then synchronize_srcu() before returning.

> > +/*
> > + * The context cookie is a unique identifier which allows post-processing to
> > + * correlate kernel trace(s) with user unwinds.  The high 12 bits are the CPU
> > + * id; the lower 48 bits are a per-CPU entry counter.
> > + */
> > +static u64 ctx_to_cookie(u64 cpu, u64 ctx)
> > +{
> > +	BUILD_BUG_ON(NR_CPUS > 65535);
> 
> 2^12 = 4k, not 64k. Perhaps you mean to reserve 16 bits
> for cpu numbers ?

Yeah, here the code is right but the comment is wrong.  It actually does
use 16 bits.

> > +	return (ctx & ((1UL << 48) - 1)) | (cpu << 48);
> 
> Perhaps use ilog2(NR_CPUS) instead for the number of bits to use
> rather than hard code 12 ?

I'm thinking I'd rather keep it simple by hard-coding the # of bits, so
as to avoid any surprises caused by edge cases.

-- 
Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 01/39] task_work: Fix TWA_NMI_CURRENT error handling
  2025-01-22 20:47     ` Josh Poimboeuf
@ 2025-01-23  8:14       ` Peter Zijlstra
  2025-01-23 17:15         ` Josh Poimboeuf
  0 siblings, 1 reply; 161+ messages in thread
From: Peter Zijlstra @ 2025-01-23  8:14 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Wed, Jan 22, 2025 at 12:47:20PM -0800, Josh Poimboeuf wrote:
> On Wed, Jan 22, 2025 at 01:28:21PM +0100, Peter Zijlstra wrote:
> > On Tue, Jan 21, 2025 at 06:30:53PM -0800, Josh Poimboeuf wrote:
> > > It's possible for irq_work_queue() to fail if the work has already been
> > > claimed.  That can happen if a TWA_NMI_CURRENT task work is requested
> > > before a previous TWA_NMI_CURRENT IRQ work on the same CPU has gotten a
> > > chance to run.
> > 
> > I'm confused, if it fails then it's already pending, and we'll get the
> > notification already. You can still add the work.
> 
> Yeah, I suppose that makes sense.  If the pending irq_work is already
> going to set TIF_NOTIFY_RESUME anyway, there's no need to do that again.
> 
> > > The error has to be checked before the write to task->task_works.  Also
> > > the try_cmpxchg() loop isn't needed in NMI context.  The TWA_NMI_CURRENT
> > > case really is special, keep things simple by keeping its code all
> > > together in one place.
> > 
> > NMIs can nest,
> 
> Just for my understanding: for nested NMIs, the entry code basically
> queues up the next NMI, so the C handler (exc_nmi) can't nest.  Right?
> 
> > consider #DB (which is NMI like)
> 
> What exactly do you mean by "NMI like"?  Is it because a #DB might be
> basically running in NMI context, if the NMI hit a breakpoint?

No, #DB, #BP and such are considered NMI (and will have in_nmi() true)
because they can trigger anywhere, including sections where IRQs are
disabled.

> > doing task_work_add() and getting interrupted with NMI doing the same.
> 
> How exactly would that work?  At least with my patch the #DB wouldn't be
> able to use TWA_NMI_CURRENT unless in_nmi() were true

It is, see exc_debug_kernel() doing irqentry_nmi_enter().


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 02/39] task_work: Fix TWA_NMI_CURRENT race with __schedule()
  2025-01-22 22:14       ` Josh Poimboeuf
@ 2025-01-23  8:15         ` Peter Zijlstra
  0 siblings, 0 replies; 161+ messages in thread
From: Peter Zijlstra @ 2025-01-23  8:15 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Wed, Jan 22, 2025 at 02:14:30PM -0800, Josh Poimboeuf wrote:
> On Wed, Jan 22, 2025 at 01:03:42PM -0800, Josh Poimboeuf wrote:
> > The self-IPI is only needed when the NMI happened in user space, right?
> > Would it make sense to have an optimized version of that?
> 
> Actually, maybe not, that could be tricky if the NMI hits in the kernel
> after task work runs.

Right, I was going to say, lets keep this as simple as possible :-)

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 28/39] unwind_user/deferred: Add deferred unwinding interface
  2025-01-22 22:51       ` Josh Poimboeuf
@ 2025-01-23  8:17         ` Peter Zijlstra
  2025-01-23 18:30           ` Josh Poimboeuf
  0 siblings, 1 reply; 161+ messages in thread
From: Peter Zijlstra @ 2025-01-23  8:17 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Wed, Jan 22, 2025 at 02:51:27PM -0800, Josh Poimboeuf wrote:
> On Wed, Jan 22, 2025 at 03:16:16PM +0100, Peter Zijlstra wrote:
> > On Wed, Jan 22, 2025 at 02:37:30PM +0100, Peter Zijlstra wrote:
> > > On Tue, Jan 21, 2025 at 06:31:20PM -0800, Josh Poimboeuf wrote:
> > > > +/*
> > > > + * The context cookie is a unique identifier which allows post-processing to
> > > > + * correlate kernel trace(s) with user unwinds.  The high 12 bits are the CPU
> > > 
> > > s/12/16/ ?
> > > 
> > > > + * id; the lower 48 bits are a per-CPU entry counter.
> > > > + */
> > > > +static u64 ctx_to_cookie(u64 cpu, u64 ctx)
> > > > +{
> > > > +	BUILD_BUG_ON(NR_CPUS > 65535);
> > > > +	return (ctx & ((1UL << 48) - 1)) | (cpu << 48);
> > > > +}
> > 
> > Also, I have to note that 0 is a valid return value here, which will
> > give a ton of fun.
> 
> The ctx_ctr is always incremented before calling this, so 0 isn't a
> valid cookie.

Right, so that's the problem. You're considering 0 an invalid cookie,
but ctx_to_cookie(0, 1<<48) will be a 0 cookie.

That thing *will* wrap.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 28/39] unwind_user/deferred: Add deferred unwinding interface
  2025-01-23  4:05     ` Josh Poimboeuf
@ 2025-01-23  8:25       ` Peter Zijlstra
  2025-01-23 18:43         ` Josh Poimboeuf
  0 siblings, 1 reply; 161+ messages in thread
From: Peter Zijlstra @ 2025-01-23  8:25 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Mathieu Desnoyers, x86, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Florian Weimer, Andy Lutomirski, Masami Hiramatsu,
	Weinan Liu

On Wed, Jan 22, 2025 at 08:05:33PM -0800, Josh Poimboeuf wrote:

> However... would it be a horrible idea for 'next' to unwind 'prev' after
> the context switch???

The idea isn't terrible, but it will be all sorta of tricky.

The big immediate problem is that the CPU doing the context switch
looses control over prev at:

  __schedule()
    context_switch()
      finish_task_switch()
        finish_task()
	  smp_store_release(&prev->on_cpu, 0);

And this is before we drop rq->lock.

The instruction after that store another CPU is free to claim the task
and run with it. Notably, another CPU might already be spin waiting on
that state, trying to wake the task back up.

By the time we get to a schedulable context, @prev is completely out of
bounds.



^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 29/39] unwind_user/deferred: Add unwind cache
  2025-01-22 22:36     ` Josh Poimboeuf
@ 2025-01-23  8:31       ` Peter Zijlstra
  2025-01-23 18:45         ` Josh Poimboeuf
  0 siblings, 1 reply; 161+ messages in thread
From: Peter Zijlstra @ 2025-01-23  8:31 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Wed, Jan 22, 2025 at 02:36:25PM -0800, Josh Poimboeuf wrote:
> On Wed, Jan 22, 2025 at 02:57:00PM +0100, Peter Zijlstra wrote:
> > On Tue, Jan 21, 2025 at 06:31:21PM -0800, Josh Poimboeuf wrote:
> > > Cache the results of the unwind to ensure the unwind is only performed
> > > once, even when called by multiple tracers.
> > > 
> > > Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
> > > ---
> > >  include/linux/unwind_deferred_types.h |  8 +++++++-
> > >  kernel/unwind/deferred.c              | 26 ++++++++++++++++++++------
> > >  2 files changed, 27 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
> > > index 9749824aea09..6f71a06329fb 100644
> > > --- a/include/linux/unwind_deferred_types.h
> > > +++ b/include/linux/unwind_deferred_types.h
> > > @@ -2,8 +2,14 @@
> > >  #ifndef _LINUX_UNWIND_USER_DEFERRED_TYPES_H
> > >  #define _LINUX_UNWIND_USER_DEFERRED_TYPES_H
> > >  
> > > -struct unwind_task_info {
> > > +struct unwind_cache {
> > >  	unsigned long		*entries;
> > > +	unsigned int		nr_entries;
> > > +	u64			cookie;
> > > +};
> > 
> > If you make the return to user path clear nr_entries you don't need a
> > second cookie field I think.
> 
> But if the NMI happens late in the exit-to-user path, with IRQs
> disabled, right before nr_entries gets cleared, the cache won't get
> used in the task work.
> 
> However I think we can clear it on entry-from-user.

Return to user runs with interrupts disabled, if an NMI hits that, it
will have to set TIF_NOTIFY_RESUME again and queue the IRQ work thing.
That self-IPI will hit the moment we do IRET (which is what re-enables
interrupts) and we're going back into the kernel.

Anyway, I suppose that is a long way of saying that you should be able
to do this on return to user.

But yes, enter-from-user should work too.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 30/39] unwind_user/deferred: Make unwind deferral requests NMI-safe
  2025-01-22 22:49     ` Josh Poimboeuf
@ 2025-01-23  8:40       ` Peter Zijlstra
  2025-01-23 19:48         ` Josh Poimboeuf
  0 siblings, 1 reply; 161+ messages in thread
From: Peter Zijlstra @ 2025-01-23  8:40 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Wed, Jan 22, 2025 at 02:49:02PM -0800, Josh Poimboeuf wrote:
> On Wed, Jan 22, 2025 at 03:15:05PM +0100, Peter Zijlstra wrote:
> > On Tue, Jan 21, 2025 at 06:31:22PM -0800, Josh Poimboeuf wrote:
> > Oh gawd. Can we please do something simple like:
> > 
> > 	guard(irqsave)();
> > 	cpu = raw_smp_processor_id();
> > 	ctr = __this_cpu_read(unwind_ctx_cnt);
> 
> Don't you need a compiler barrier here?  __this_cpu_read() doesn't have
> one.

What for?

> > 	cookie = READ_ONCE(current->unwind_info.cookie);
> > 	do {
> > 		if (cookie)
> > 			return cookie;
> > 		cookie = ctx_to_cookie(cpu, ctr+1);
> > 	} while (!try_cmpxchg64(&current->unwind_info.cookie, &cookie, cookie));
> > 	__this_cpu_write(unwind_ctx_ctr, ctr+1);
> > 	return cookie;
> 
> I was trying to avoid the overhead of the cmpxchg.

We're going to be doing userspace stack unwinding, I don't think
overhead is a real concern.

> But also, the nmi_cookie is still needed for the case where the NMI
> arrives before info->cookie gets cleared by early entry-from-user.

So how about we clear cookie (and set nr_entries to -1) at
return-to-user, after we've done the work loop and have interrupts
disabled until we hit userspace.

Any NMI that hits there will have to cause another entry anyway.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 30/39] unwind_user/deferred: Make unwind deferral requests NMI-safe
  2025-01-22 22:52     ` Josh Poimboeuf
@ 2025-01-23  8:42       ` Peter Zijlstra
  0 siblings, 0 replies; 161+ messages in thread
From: Peter Zijlstra @ 2025-01-23  8:42 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Wed, Jan 22, 2025 at 02:52:57PM -0800, Josh Poimboeuf wrote:
> On Wed, Jan 22, 2025 at 03:24:18PM +0100, Peter Zijlstra wrote:
> > On Tue, Jan 21, 2025 at 06:31:22PM -0800, Josh Poimboeuf wrote:
> > > +static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *cookie)
> > > +{
> > > +	struct unwind_task_info *info = &current->unwind_info;
> > > +	bool inited_cookie = false;
> > > +	int ret;
> > > +
> > > +	*cookie = info->cookie;
> > > +	if (!*cookie) {
> > > +		/*
> > > +		 * This is the first unwind request since the most recent entry
> > > +		 * from user.  Initialize the task cookie.
> > > +		 *
> > > +		 * Don't write to info->cookie directly, otherwise it may get
> > > +		 * cleared if the NMI occurred in the kernel during early entry
> > > +		 * or late exit before the task work gets to run.  Instead, use
> > > +		 * info->nmi_cookie which gets synced later by get_cookie().
> > > +		 */
> > > +		if (!info->nmi_cookie) {
> > > +			u64 cpu = raw_smp_processor_id();
> > > +			u64 ctx_ctr;
> > > +
> > > +			ctx_ctr = __this_cpu_inc_return(unwind_ctx_ctr);
> > 
> > __this_cpu_inc_return() is *NOT* NMI safe IIRC.
> 
> Hm, I guess I was only looking at x86.

:-), right so x86 is rather special here, the various RISC platforms are
what you should be looking at, eg. arm64.


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 34/39] perf: Skip user unwind if !current->mm
  2025-01-22 23:08     ` Josh Poimboeuf
@ 2025-01-23  8:44       ` Peter Zijlstra
  0 siblings, 0 replies; 161+ messages in thread
From: Peter Zijlstra @ 2025-01-23  8:44 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Wed, Jan 22, 2025 at 03:08:38PM -0800, Josh Poimboeuf wrote:
> On Wed, Jan 22, 2025 at 03:29:10PM +0100, Peter Zijlstra wrote:
> > On Tue, Jan 21, 2025 at 06:31:26PM -0800, Josh Poimboeuf wrote:
> > > If the task doesn't have any memory, there's no stack to unwind.
> > > 
> > > Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
> > > ---
> > >  kernel/events/core.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > > index 99f0f28feeb5..a886bb83f4d0 100644
> > > --- a/kernel/events/core.c
> > > +++ b/kernel/events/core.c
> > > @@ -7792,7 +7792,7 @@ struct perf_callchain_entry *
> > >  perf_callchain(struct perf_event *event, struct pt_regs *regs)
> > >  {
> > >  	bool kernel = !event->attr.exclude_callchain_kernel;
> > > -	bool user   = !event->attr.exclude_callchain_user;
> > > +	bool user   = !event->attr.exclude_callchain_user && current->mm;
> > 
> > What about things like io_uring helpers, don't they keep ->mm but are
> > never in userspace?
> 
> Hm, that's news to me.  At least this patch doesn't make things worse in
> that regard.
> 
> IIRC, user_regs(task_pt_regs(current)) returns false for kthreads
> because the regs area is cleared.  We could add in a user_regs() check?

Possibly, I've not yet dug into the io_worker thing in much detail yet.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 01/39] task_work: Fix TWA_NMI_CURRENT error handling
  2025-01-23  8:14       ` Peter Zijlstra
@ 2025-01-23 17:15         ` Josh Poimboeuf
  2025-01-23 22:19           ` Peter Zijlstra
  0 siblings, 1 reply; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-23 17:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Thu, Jan 23, 2025 at 09:14:03AM +0100, Peter Zijlstra wrote:
> On Wed, Jan 22, 2025 at 12:47:20PM -0800, Josh Poimboeuf wrote:
> > What exactly do you mean by "NMI like"?  Is it because a #DB might be
> > basically running in NMI context, if the NMI hit a breakpoint?
> 
> No, #DB, #BP and such are considered NMI (and will have in_nmi() true)
> because they can trigger anywhere, including sections where IRQs are
> disabled.

So:

  - while exceptions are technically not NMI, they're "NMI" because they
    can occur in NMI or IRQ-disabled regions

  - such "NMI" exceptions can be preempted by NMIs and "NMIs"

  - NMIs can be preempted by "NMIs" but not NMIs (except in entry code!)

... did I get all that right?  Not subtle at all!

I feel like in_nmi() needs a comment explaining all that nonobviousness.

-- 
Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 28/39] unwind_user/deferred: Add deferred unwinding interface
  2025-01-23  8:17         ` Peter Zijlstra
@ 2025-01-23 18:30           ` Josh Poimboeuf
  2025-01-23 21:58             ` Peter Zijlstra
  0 siblings, 1 reply; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-23 18:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Thu, Jan 23, 2025 at 09:17:18AM +0100, Peter Zijlstra wrote:
> On Wed, Jan 22, 2025 at 02:51:27PM -0800, Josh Poimboeuf wrote:
> > On Wed, Jan 22, 2025 at 03:16:16PM +0100, Peter Zijlstra wrote:
> > The ctx_ctr is always incremented before calling this, so 0 isn't a
> > valid cookie.
> 
> Right, so that's the problem. You're considering 0 an invalid cookie,
> but ctx_to_cookie(0, 1<<48) will be a 0 cookie.
> 
> That thing *will* wrap.

Well, yes, after N years of sustained very high syscall activity on CPU
0, with stack tracing enabled, in which multiple tracer unwind requests
happen to occur in the same entry context where ctx_ctr wrapped, one of
the tracers might get an invalid cookie.

I can double-increment the counter when it's (1UL << 48) - 1).  Or use
some other bit for "cookie valid".

-- 
Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 28/39] unwind_user/deferred: Add deferred unwinding interface
  2025-01-23  8:25       ` Peter Zijlstra
@ 2025-01-23 18:43         ` Josh Poimboeuf
  2025-01-23 22:13           ` Peter Zijlstra
  0 siblings, 1 reply; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-23 18:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, x86, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Florian Weimer, Andy Lutomirski, Masami Hiramatsu,
	Weinan Liu

On Thu, Jan 23, 2025 at 09:25:34AM +0100, Peter Zijlstra wrote:
> On Wed, Jan 22, 2025 at 08:05:33PM -0800, Josh Poimboeuf wrote:
> 
> > However... would it be a horrible idea for 'next' to unwind 'prev' after
> > the context switch???
> 
> The idea isn't terrible, but it will be all sorta of tricky.
> 
> The big immediate problem is that the CPU doing the context switch
> looses control over prev at:
> 
>   __schedule()
>     context_switch()
>       finish_task_switch()
>         finish_task()
> 	  smp_store_release(&prev->on_cpu, 0);
> 
> And this is before we drop rq->lock.
> 
> The instruction after that store another CPU is free to claim the task
> and run with it. Notably, another CPU might already be spin waiting on
> that state, trying to wake the task back up.
> 
> By the time we get to a schedulable context, @prev is completely out of
> bounds.

Could unwind_deferred_request() call migrate_disable() or so?

How bad would it be to set some bit in @prev to prevent it from getting
rescheduled until the unwind from @next has been done?  Unfortunately
two tasks would be blocked on the unwind instead of one.

BTW, this might be useful for another reason.  In Steve's sframe meeting
yesterday there was some talk of BPF needing to unwind from
sched-switch, without having to wait indefinitely for @prev to get
rescheduled and return to user.

-- 
Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 29/39] unwind_user/deferred: Add unwind cache
  2025-01-23  8:31       ` Peter Zijlstra
@ 2025-01-23 18:45         ` Josh Poimboeuf
  0 siblings, 0 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-23 18:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Thu, Jan 23, 2025 at 09:31:31AM +0100, Peter Zijlstra wrote:
> On Wed, Jan 22, 2025 at 02:36:25PM -0800, Josh Poimboeuf wrote:
> > On Wed, Jan 22, 2025 at 02:57:00PM +0100, Peter Zijlstra wrote:
> > > On Tue, Jan 21, 2025 at 06:31:21PM -0800, Josh Poimboeuf wrote:
> > But if the NMI happens late in the exit-to-user path, with IRQs
> > disabled, right before nr_entries gets cleared, the cache won't get
> > used in the task work.
> > 
> > However I think we can clear it on entry-from-user.
> 
> Return to user runs with interrupts disabled, if an NMI hits that, it
> will have to set TIF_NOTIFY_RESUME again and queue the IRQ work thing.
> That self-IPI will hit the moment we do IRET (which is what re-enables
> interrupts) and we're going back into the kernel.
> 
> Anyway, I suppose that is a long way of saying that you should be able
> to do this on return to user.

Indeed, I knew that but somehow overlooked the fact that the IRQ would
clear the cookie so the cache wouldn't be usable anyway.

-- 
Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 30/39] unwind_user/deferred: Make unwind deferral requests NMI-safe
  2025-01-23  8:40       ` Peter Zijlstra
@ 2025-01-23 19:48         ` Josh Poimboeuf
  2025-01-23 19:54           ` Josh Poimboeuf
  2025-01-23 22:17           ` Peter Zijlstra
  0 siblings, 2 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-23 19:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Thu, Jan 23, 2025 at 09:40:26AM +0100, Peter Zijlstra wrote:
> On Wed, Jan 22, 2025 at 02:49:02PM -0800, Josh Poimboeuf wrote:
> > On Wed, Jan 22, 2025 at 03:15:05PM +0100, Peter Zijlstra wrote:
> > > On Tue, Jan 21, 2025 at 06:31:22PM -0800, Josh Poimboeuf wrote:
> > > Oh gawd. Can we please do something simple like:
> > > 
> > > 	guard(irqsave)();
> > > 	cpu = raw_smp_processor_id();
> > > 	ctr = __this_cpu_read(unwind_ctx_cnt);
> > 
> > Don't you need a compiler barrier here?  __this_cpu_read() doesn't have
> > one.
> 
> What for?

Hm, I guess it's not needed for this one.

> > > 	cookie = READ_ONCE(current->unwind_info.cookie);
> > > 	do {
> > > 		if (cookie)
> > > 			return cookie;
> > > 		cookie = ctx_to_cookie(cpu, ctr+1);
> > > 	} while (!try_cmpxchg64(&current->unwind_info.cookie, &cookie, cookie));

Should not the 2nd argument be &zero?

> > > 	__this_cpu_write(unwind_ctx_ctr, ctr+1);
> > > 	return cookie;
> > But also, the nmi_cookie is still needed for the case where the NMI
> > arrives before info->cookie gets cleared by early entry-from-user.
> 
> So how about we clear cookie (and set nr_entries to -1) at

I think we could set nr_entries to 0 instead of -1?

> return-to-user, after we've done the work loop and have interrupts
> disabled until we hit userspace.
>
> Any NMI that hits there will have to cause another entry anyway.

But there's a cookie mismatch:

    // return-to-user: IRQs disabled
    <NMI>
	current->unwind_info.cookie = 0x1234
    </NMI>
    unwind_exit_to_user_mode()
	current->unwind_info.cookie = 0
    IRET
<IRQ>
    task_work()
        callback(@cookie=WRONG)

-- 
Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 30/39] unwind_user/deferred: Make unwind deferral requests NMI-safe
  2025-01-23 19:48         ` Josh Poimboeuf
@ 2025-01-23 19:54           ` Josh Poimboeuf
  2025-01-23 22:17           ` Peter Zijlstra
  1 sibling, 0 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-23 19:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Thu, Jan 23, 2025 at 11:48:10AM -0800, Josh Poimboeuf wrote:
> On Thu, Jan 23, 2025 at 09:40:26AM +0100, Peter Zijlstra wrote:
> > On Wed, Jan 22, 2025 at 02:49:02PM -0800, Josh Poimboeuf wrote:
> > > But also, the nmi_cookie is still needed for the case where the NMI
> > > arrives before info->cookie gets cleared by early entry-from-user.
> > 
> > So how about we clear cookie (and set nr_entries to -1) at
> 
> I think we could set nr_entries to 0 instead of -1?
> 
> > return-to-user, after we've done the work loop and have interrupts
> > disabled until we hit userspace.
> >
> > Any NMI that hits there will have to cause another entry anyway.
> 
> But there's a cookie mismatch:
> 
>     // return-to-user: IRQs disabled
>     <NMI>
> 	current->unwind_info.cookie = 0x1234
>     </NMI>
>     unwind_exit_to_user_mode()
> 	current->unwind_info.cookie = 0
>     IRET
> <IRQ>
>     task_work()
>         callback(@cookie=WRONG)

Though, assuming we're keeping the unwind_work struct, there's a simpler
alternative to nmi_cookie: store the cookie in the unwind_work.  Then
the task work can just use that instead of current->unwind_info.cookie.

-- 
Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 28/39] unwind_user/deferred: Add deferred unwinding interface
  2025-01-23 18:30           ` Josh Poimboeuf
@ 2025-01-23 21:58             ` Peter Zijlstra
  0 siblings, 0 replies; 161+ messages in thread
From: Peter Zijlstra @ 2025-01-23 21:58 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Thu, Jan 23, 2025 at 10:30:56AM -0800, Josh Poimboeuf wrote:
> On Thu, Jan 23, 2025 at 09:17:18AM +0100, Peter Zijlstra wrote:
> > On Wed, Jan 22, 2025 at 02:51:27PM -0800, Josh Poimboeuf wrote:
> > > On Wed, Jan 22, 2025 at 03:16:16PM +0100, Peter Zijlstra wrote:
> > > The ctx_ctr is always incremented before calling this, so 0 isn't a
> > > valid cookie.
> > 
> > Right, so that's the problem. You're considering 0 an invalid cookie,
> > but ctx_to_cookie(0, 1<<48) will be a 0 cookie.
> > 
> > That thing *will* wrap.
> 
> Well, yes, after N years of sustained very high syscall activity on CPU
> 0, with stack tracing enabled, in which multiple tracer unwind requests
> happen to occur in the same entry context where ctx_ctr wrapped, one of
> the tracers might get an invalid cookie.
> 
> I can double-increment the counter when it's (1UL << 48) - 1).  Or use
> some other bit for "cookie valid".

Right, steal one bit from counter and make it always 1. 47 bit wrap
around should be fine.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 28/39] unwind_user/deferred: Add deferred unwinding interface
  2025-01-23 18:43         ` Josh Poimboeuf
@ 2025-01-23 22:13           ` Peter Zijlstra
  2025-01-24 21:58             ` Steven Rostedt
  0 siblings, 1 reply; 161+ messages in thread
From: Peter Zijlstra @ 2025-01-23 22:13 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Mathieu Desnoyers, x86, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Florian Weimer, Andy Lutomirski, Masami Hiramatsu,
	Weinan Liu

On Thu, Jan 23, 2025 at 10:43:05AM -0800, Josh Poimboeuf wrote:
> On Thu, Jan 23, 2025 at 09:25:34AM +0100, Peter Zijlstra wrote:
> > On Wed, Jan 22, 2025 at 08:05:33PM -0800, Josh Poimboeuf wrote:
> > 
> > > However... would it be a horrible idea for 'next' to unwind 'prev' after
> > > the context switch???
> > 
> > The idea isn't terrible, but it will be all sorta of tricky.
> > 
> > The big immediate problem is that the CPU doing the context switch
> > looses control over prev at:
> > 
> >   __schedule()
> >     context_switch()
> >       finish_task_switch()
> >         finish_task()
> > 	  smp_store_release(&prev->on_cpu, 0);
> > 
> > And this is before we drop rq->lock.
> > 
> > The instruction after that store another CPU is free to claim the task
> > and run with it. Notably, another CPU might already be spin waiting on
> > that state, trying to wake the task back up.
> > 
> > By the time we get to a schedulable context, @prev is completely out of
> > bounds.
> 
> Could unwind_deferred_request() call migrate_disable() or so?

That's pretty vile... and might cause performance issues. You realy
don't want things to magically start behaving differently just because
you're tracing.

> How bad would it be to set some bit in @prev to prevent it from getting
> rescheduled until the unwind from @next has been done?  Unfortunately
> two tasks would be blocked on the unwind instead of one.

Yeah, not going to happen. Those paths are complicated enough as is.

> BTW, this might be useful for another reason.  In Steve's sframe meeting
> yesterday there was some talk of BPF needing to unwind from
> sched-switch, without having to wait indefinitely for @prev to get
> rescheduled and return to user.

-EPONIES, you cannot take faults from the middle of schedule(). They can
always use the best effort FP unwind we have today.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 30/39] unwind_user/deferred: Make unwind deferral requests NMI-safe
  2025-01-23 19:48         ` Josh Poimboeuf
  2025-01-23 19:54           ` Josh Poimboeuf
@ 2025-01-23 22:17           ` Peter Zijlstra
  2025-01-23 23:34             ` Josh Poimboeuf
  1 sibling, 1 reply; 161+ messages in thread
From: Peter Zijlstra @ 2025-01-23 22:17 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Thu, Jan 23, 2025 at 11:48:07AM -0800, Josh Poimboeuf wrote:
> > > > 	cookie = READ_ONCE(current->unwind_info.cookie);
> > > > 	do {
> > > > 		if (cookie)
> > > > 			return cookie;
> > > > 		cookie = ctx_to_cookie(cpu, ctr+1);
> > > > 	} while (!try_cmpxchg64(&current->unwind_info.cookie, &cookie, cookie));
> 
> Should not the 2nd argument be &zero?

This I suppose

	cookie = READ_ONCE(current->unwind_info.cookie);
	do {
		if (cookie)
			return cookie;
	} while (!try_cmpxchg64(&current->unwind_info.cookie, &cookie,
				ctx_to_cookie(cpu, ctr+1)));

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 01/39] task_work: Fix TWA_NMI_CURRENT error handling
  2025-01-23 17:15         ` Josh Poimboeuf
@ 2025-01-23 22:19           ` Peter Zijlstra
  0 siblings, 0 replies; 161+ messages in thread
From: Peter Zijlstra @ 2025-01-23 22:19 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Thu, Jan 23, 2025 at 09:15:09AM -0800, Josh Poimboeuf wrote:
> On Thu, Jan 23, 2025 at 09:14:03AM +0100, Peter Zijlstra wrote:
> > On Wed, Jan 22, 2025 at 12:47:20PM -0800, Josh Poimboeuf wrote:
> > > What exactly do you mean by "NMI like"?  Is it because a #DB might be
> > > basically running in NMI context, if the NMI hit a breakpoint?
> > 
> > No, #DB, #BP and such are considered NMI (and will have in_nmi() true)
> > because they can trigger anywhere, including sections where IRQs are
> > disabled.
> 
> So:
> 
>   - while exceptions are technically not NMI, they're "NMI" because they
>     can occur in NMI or IRQ-disabled regions
> 
>   - such "NMI" exceptions can be preempted by NMIs and "NMIs"
> 
>   - NMIs can be preempted by "NMIs" but not NMIs (except in entry code!)
> 
> ... did I get all that right?  Not subtle at all!

Yeah, sounds about right :-)

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 30/39] unwind_user/deferred: Make unwind deferral requests NMI-safe
  2025-01-23 22:17           ` Peter Zijlstra
@ 2025-01-23 23:34             ` Josh Poimboeuf
  2025-01-24 11:58               ` Peter Zijlstra
  0 siblings, 1 reply; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-23 23:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Thu, Jan 23, 2025 at 11:17:34PM +0100, Peter Zijlstra wrote:
> On Thu, Jan 23, 2025 at 11:48:07AM -0800, Josh Poimboeuf wrote:
> > > > > 	cookie = READ_ONCE(current->unwind_info.cookie);
> > > > > 	do {
> > > > > 		if (cookie)
> > > > > 			return cookie;
> > > > > 		cookie = ctx_to_cookie(cpu, ctr+1);
> > > > > 	} while (!try_cmpxchg64(&current->unwind_info.cookie, &cookie, cookie));
> > 
> > Should not the 2nd argument be &zero?
> 
> This I suppose
> 
> 	cookie = READ_ONCE(current->unwind_info.cookie);
> 	do {
> 		if (cookie)
> 			return cookie;
> 	} while (!try_cmpxchg64(&current->unwind_info.cookie, &cookie,
> 				ctx_to_cookie(cpu, ctr+1)));

Here's what I have at the moment.  unwind_deferred_request() is vastly
simplified thanks to your suggestions, no more NMI-specific muck.
work->pending is replaced with work->cookie so the task work can use it
in case it crossed the cookie-clearing boundary (return-to-user with
IRQs disabled).

static inline bool nmi_supported(void)
{
	return IS_ENABLED(CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG);
}

/*
 * Read the task context cookie, initializing it if this is the first request
 * since the most recent entry from user.
 */
static u64 get_cookie(void)
{
	struct unwind_task_info *info = &current->unwind_info;
	u64 ctr, cookie, old_cookie = 0;
	unsigned int cpu;

	cookie = READ_ONCE(info->cookie);
	if (cookie)
		return cookie;

	guard(irqsave)();

	cpu = raw_smp_processor_id();
	ctr = __this_cpu_read(unwind_ctx_ctr);

	cookie = ctx_to_cookie(cpu, ++ctr);

	/* Make sure the cookie is non-zero. */
	if (!cookie)
		cookie = ctx_to_cookie(cpu, ++ctr);

	/* Write the task cookie unless an NMI just now swooped in to do so. */
	if (!try_cmpxchg64(&info->cookie, &old_cookie, cookie))
		return old_cookie;

	/* Success, update the context counter. */
	__this_cpu_write(unwind_ctx_ctr, ctr);

	return cookie;
}

/*
 * Schedule a user space unwind to be done in task work before exiting the
 * kernel.
 *
 * The returned cookie output is a unique identifer for the current task entry
 * context.  Its value will also be passed to the callback function.  It can be
 * used to stitch kernel and user stack traces together in post-processing.
 *
 * It's valid to call this function multiple times for the same @work within
 * the same task entry context.  Each call will return the same cookie.  If the
 * callback is already pending, an error will be returned along with the
 * cookie.  If the callback is not pending because it has already been
 * previously called for the same entry context, it will be called again with
 * the same stack trace and cookie.
 *
 * Thus are three possible return scenarios:
 *
 *   * return != 0, *cookie == 0: the operation failed, no pending callback.
 *
 *   * return != 0, *cookie != 0: the callback is already pending. The cookie
 *     can still be used to correlate with the pending callback.
 *
 *   * return == 0, *cookie != 0: the callback queued successfully.  The
 *     callback is guaranteed to be called with the given cookie.
 */
int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
{
	enum task_work_notify_mode mode = in_nmi() ? TWA_NMI_CURRENT : TWA_RESUME;
	u64 zero = 0;
	int ret;

	*cookie = 0;

	if (!current->mm || !user_mode(task_pt_regs(current)))
		return -EINVAL;

	if (!nmi_supported() && in_nmi())
		return -EINVAL;

	/* Return a valid @cookie even if the callback is already pending. */
	*cookie = READ_ONCE(work->cookie);
	if (*cookie)
		return -EEXIST;

	*cookie = get_cookie();

	/* Claim the work unless an NMI just now swooped in to do so. */
	if (!try_cmpxchg64(&work->cookie, &zero, *cookie))
		return -EEXIST;

	/* The work has been claimed, now schedule it. */
	ret = task_work_add(current, &work->work, mode);
	if (WARN_ON_ONCE(ret)) {
		WRITE_ONCE(work->cookie, 0);
		return ret;
	}

	return 0;
}

-- 
Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 30/39] unwind_user/deferred: Make unwind deferral requests NMI-safe
  2025-01-23 23:34             ` Josh Poimboeuf
@ 2025-01-24 11:58               ` Peter Zijlstra
  0 siblings, 0 replies; 161+ messages in thread
From: Peter Zijlstra @ 2025-01-24 11:58 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Thu, Jan 23, 2025 at 03:34:45PM -0800, Josh Poimboeuf wrote:

> Here's what I have at the moment.  unwind_deferred_request() is vastly
> simplified thanks to your suggestions, no more NMI-specific muck.
> work->pending is replaced with work->cookie so the task work can use it
> in case it crossed the cookie-clearing boundary (return-to-user with
> IRQs disabled).

That looks a lot saner, thanks! I'll be waiting for the next cycle, not
sure I can track all the changes in me head.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 09/39] x86/vdso: Enable sframe generation in VDSO
  2025-01-22  2:31 ` [PATCH v4 09/39] x86/vdso: Enable sframe generation in VDSO Josh Poimboeuf
@ 2025-01-24 16:00   ` Jens Remus
  2025-01-24 16:43     ` Josh Poimboeuf
  2025-01-24 16:30   ` Jens Remus
  1 sibling, 1 reply; 161+ messages in thread
From: Jens Remus @ 2025-01-24 16:00 UTC (permalink / raw)
  To: Josh Poimboeuf, x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu, Heiko Carstens, Vasily Gorbik

On 22.01.2025 03:31, Josh Poimboeuf wrote:
> Enable sframe generation in the VDSO library so kernel and user space
> can unwind through it.
> 
> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>

...

> diff --git a/arch/x86/include/asm/dwarf2.h b/arch/x86/include/asm/dwarf2.h
> index b195b3c8677e..1c354f648505 100644
> --- a/arch/x86/include/asm/dwarf2.h
> +++ b/arch/x86/include/asm/dwarf2.h
> @@ -12,8 +12,11 @@
>   	 * For the vDSO, emit both runtime unwind information and debug
>   	 * symbols for the .dbg file.
>   	 */
> -

Nit: Deleted blank line you introduced in "[PATCH v4 05/39] x86/asm:
Avoid emitting DWARF CFI for non-VDSO".

> +#ifdef __x86_64__

#if defined(__x86_64__) && defined(CONFIG_AS_SFRAME)

AFAIK the kernel has a minimum binutils requirement of 2.25 [1]
and assembler option "--gsframe" as well as directive
".cfi_sections .sframe" were introduced with 2.40.

> +	.cfi_sections .eh_frame, .debug_frame, .sframe
> +#else
>   	.cfi_sections .eh_frame, .debug_frame
> +#endif
>   
>   #define CFI_STARTPROC		.cfi_startproc
>   #define CFI_ENDPROC		.cfi_endproc

[1]: https://docs.kernel.org/process/changes.html

Regards,
Jens
-- 
Jens Remus
Linux on Z Development (D3303)
+49-7031-16-1128 Office
jremus@de.ibm.com

IBM

IBM Deutschland Research & Development GmbH; Vorsitzender des Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der Gesellschaft: Böblingen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 05/39] x86/asm: Avoid emitting DWARF CFI for non-VDSO
  2025-01-22  2:30 ` [PATCH v4 05/39] x86/asm: Avoid emitting DWARF CFI for non-VDSO Josh Poimboeuf
@ 2025-01-24 16:08   ` Jens Remus
  2025-01-24 16:47     ` Josh Poimboeuf
  0 siblings, 1 reply; 161+ messages in thread
From: Jens Remus @ 2025-01-24 16:08 UTC (permalink / raw)
  To: Josh Poimboeuf, x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu, Heiko Carstens, Vasily Gorbik

On 22.01.2025 03:30, Josh Poimboeuf wrote:
> It was decided years ago that .cfi_* annotations aren't maintainable in
> the kernel.  They were replaced by objtool unwind hints.  For the kernel
> proper, ensure the CFI_* macros don't do anything.
> 
> On the other hand the VDSO library *does* use them, so user space can
> unwind through it.
> 
> Make sure these macros only work for VDSO.  They aren't actually being
> used outside of VDSO anyway, so there's no functional change.
> 
> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>

> diff --git a/arch/x86/include/asm/dwarf2.h b/arch/x86/include/asm/dwarf2.h

> -#ifndef BUILD_VDSO
> -	/*
> -	 * Emit CFI data in .debug_frame sections, not .eh_frame sections.
> -	 * The latter we currently just discard since we don't do DWARF
> -	 * unwinding at runtime.  So only the offline DWARF information is
> -	 * useful to anyone.  Note we should not use this directive if we
> -	 * ever decide to enable DWARF unwinding at runtime.
> -	 */
> -	.cfi_sections .debug_frame
> -#else
> -	 /*
> -	  * For the vDSO, emit both runtime unwind information and debug
> -	  * symbols for the .dbg file.
> -	  */
> -	.cfi_sections .eh_frame, .debug_frame
> -#endif
> +#else /* !BUILD_VDSO */
> +

Did you remove ".cfi_sections .debug_frame" on purpose from the
!BUILD_VDSO path compared to V3? Presumably to not only not emit
DWARF CFI from assembler, but any source?

> +/*
> + * On x86, these macros aren't used outside VDSO.  As well they shouldn't be:
> + * they're fragile and very difficult to maintain.
> + */

Thanks and regards,
Jens
-- 
Jens Remus
Linux on Z Development (D3303)
+49-7031-16-1128 Office
jremus@de.ibm.com

IBM

IBM Deutschland Research & Development GmbH; Vorsitzender des Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der Gesellschaft: Böblingen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 09/39] x86/vdso: Enable sframe generation in VDSO
  2025-01-22  2:31 ` [PATCH v4 09/39] x86/vdso: Enable sframe generation in VDSO Josh Poimboeuf
  2025-01-24 16:00   ` Jens Remus
@ 2025-01-24 16:30   ` Jens Remus
  2025-01-24 16:56     ` Josh Poimboeuf
  1 sibling, 1 reply; 161+ messages in thread
From: Jens Remus @ 2025-01-24 16:30 UTC (permalink / raw)
  To: Josh Poimboeuf, x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu, Heiko Carstens, Vasily Gorbik

On 22.01.2025 03:31, Josh Poimboeuf wrote:
> Enable sframe generation in the VDSO library so kernel and user space
> can unwind through it.
> 
> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>

> diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile

> @@ -47,13 +47,17 @@ quiet_cmd_vdso2c = VDSO2C  $@
>   $(obj)/vdso-image-%.c: $(obj)/vdso%.so.dbg $(obj)/vdso%.so $(obj)/vdso2c FORCE
>   	$(call if_changed,vdso2c)
>   
> +#ifdef CONFIG_AS_SFRAME
> +SFRAME_CFLAGS := -Wa$(comma)-gsframe
> +#endif
> +

You probably erroneously mixed up C preprocessor and Makefile syntax? :-)

ifeq ($(CONFIG_AS_SFRAME),y)
        SFRAME_CFLAGS := -Wa,--gsframe
endif

$(comma) does not appear to be required in this context.

Regards,
Jens
-- 
Jens Remus
Linux on Z Development (D3303)
+49-7031-16-1128 Office
jremus@de.ibm.com

IBM

IBM Deutschland Research & Development GmbH; Vorsitzender des Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der Gesellschaft: Böblingen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 28/39] unwind_user/deferred: Add deferred unwinding interface
  2025-01-22  2:31 ` [PATCH v4 28/39] unwind_user/deferred: Add deferred unwinding interface Josh Poimboeuf
                     ` (2 preceding siblings ...)
  2025-01-22 20:13   ` Mathieu Desnoyers
@ 2025-01-24 16:35   ` Jens Remus
  2025-01-24 16:57     ` Josh Poimboeuf
  3 siblings, 1 reply; 161+ messages in thread
From: Jens Remus @ 2025-01-24 16:35 UTC (permalink / raw)
  To: Josh Poimboeuf, x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu, Heiko Carstens, Vasily Gorbik

On 22.01.2025 03:31, Josh Poimboeuf wrote:

> diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h

> +#ifdef CONFIG_UNWIND_USER
> +
> +void unwind_task_init(struct task_struct *task);
> +void unwind_task_free(struct task_struct *task);
> +
> +void unwind_deferred_init(struct unwind_work *work, unwind_callback_t func);
> +int unwind_deferred_request(struct unwind_work *work, u64 *cookie);
> +bool unwind_deferred_cancel(struct task_struct *task, struct unwind_work *work);
> +
> +static __always_inline void unwind_enter_from_user_mode(void)
> +{
> +	current->unwind_info.cookie = 0;
> +}
> +
> +#else /* !CONFIG_UNWIND_USER */
> +
> +static inline void unwind_task_init(struct task_struct *task) {}
> +static inline void unwind_task_free(struct task_struct *task) {}
> +
> +static inline void unwind_deferred_init(struct unwind_work *work, unwind_callback_t func) {}
> +static inline int unwind_deferred_request(struct task_struct *task, struct unwind_work *work, u64 *cookie) { return -ENOSYS; }

static inline int unwind_deferred_request(struct unwind_work *work, u64 *cookie) { return -ENOSYS; }

Otherwise this does not compile on architectures that do not enable
UNWIND_USER.

> +static inline bool unwind_deferred_cancel(struct task_struct *task, struct unwind_work *work) { return false; }
> +
> +static inline void unwind_enter_from_user_mode(void) {}
> +
> +#endif /* !CONFIG_UNWIND_USER */

Regards,
Jens
-- 
Jens Remus
Linux on Z Development (D3303)
+49-7031-16-1128 Office
jremus@de.ibm.com

IBM

IBM Deutschland Research & Development GmbH; Vorsitzender des Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der Gesellschaft: Böblingen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 19/39] unwind_user/sframe: Add support for reading .sframe contents
  2025-01-22  2:31 ` [PATCH v4 19/39] unwind_user/sframe: Add support for reading .sframe contents Josh Poimboeuf
@ 2025-01-24 16:36   ` Jens Remus
  2025-01-24 17:07     ` Josh Poimboeuf
  2025-01-24 18:02   ` Andrii Nakryiko
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 161+ messages in thread
From: Jens Remus @ 2025-01-24 16:36 UTC (permalink / raw)
  To: Josh Poimboeuf, x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On 22.01.2025 03:31, Josh Poimboeuf wrote:

> diff --git a/include/linux/sframe.h b/include/linux/sframe.h

> @@ -3,11 +3,14 @@
>   #define _LINUX_SFRAME_H
>   
>   #include <linux/mm_types.h>
> +#include <linux/srcu.h>
>   #include <linux/unwind_user_types.h>
>   
>   #ifdef CONFIG_HAVE_UNWIND_USER_SFRAME
>   
>   struct sframe_section {
> +	struct rcu_head	rcu;
> +

Nit: You are adding a blank line, that you later remove with
"[PATCH v4 25/39] unwind_user/sframe: Show file name in debug output".

>   	unsigned long	sframe_start;
>   	unsigned long	sframe_end;
>   	unsigned long	text_start;

Regards,
Jens
-- 
Jens Remus
Linux on Z Development (D3303)
+49-7031-16-1128 Office
jremus@de.ibm.com

IBM

IBM Deutschland Research & Development GmbH; Vorsitzender des Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der Gesellschaft: Böblingen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 11/39] unwind_user: Add user space unwinding API
  2025-01-22  2:31 ` [PATCH v4 11/39] unwind_user: Add user space unwinding API Josh Poimboeuf
@ 2025-01-24 16:41   ` Jens Remus
  2025-01-24 17:09     ` Josh Poimboeuf
  2025-01-24 17:59   ` Andrii Nakryiko
  2025-01-24 20:02   ` Steven Rostedt
  2 siblings, 1 reply; 161+ messages in thread
From: Jens Remus @ 2025-01-24 16:41 UTC (permalink / raw)
  To: Josh Poimboeuf, x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu, Heiko Carstens, Vasily Gorbik

On 22.01.2025 03:31, Josh Poimboeuf wrote:

> diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c

> @@ -0,0 +1,59 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> +* Generic interfaces for unwinding user space
> +*/
> +#include <linux/kernel.h>
> +#include <linux/sched.h>
> +#include <linux/sched/task_stack.h>
> +#include <linux/unwind_user.h>
> +
> +int unwind_user_next(struct unwind_user_state *state)
> +{
> +	struct unwind_user_frame _frame;
> +	struct unwind_user_frame *frame = &_frame;
> +	unsigned long cfa = 0, fp, ra = 0;

Why are cfa and ra initialized to zero? Where is that important in
subsequent patches?

"[PATCH v4 12/39] unwind_user: Add frame pointer support" does either
unconditionally set both cfa and ra or bail out.

> +
> +	/* no implementation yet */
> +	-EINVAL;
> +}

Thanks and regards,
Jens
-- 
Jens Remus
Linux on Z Development (D3303)
+49-7031-16-1128 Office
jremus@de.ibm.com

IBM

IBM Deutschland Research & Development GmbH; Vorsitzender des Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der Gesellschaft: Böblingen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 09/39] x86/vdso: Enable sframe generation in VDSO
  2025-01-24 16:00   ` Jens Remus
@ 2025-01-24 16:43     ` Josh Poimboeuf
  2025-01-24 16:53       ` Josh Poimboeuf
  2025-04-22 17:44       ` Steven Rostedt
  0 siblings, 2 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-24 16:43 UTC (permalink / raw)
  To: Jens Remus
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu, Heiko Carstens, Vasily Gorbik

On Fri, Jan 24, 2025 at 05:00:27PM +0100, Jens Remus wrote:
> On 22.01.2025 03:31, Josh Poimboeuf wrote:
> > diff --git a/arch/x86/include/asm/dwarf2.h b/arch/x86/include/asm/dwarf2.h
> > index b195b3c8677e..1c354f648505 100644
> > --- a/arch/x86/include/asm/dwarf2.h
> > +++ b/arch/x86/include/asm/dwarf2.h
> > @@ -12,8 +12,11 @@
> >   	 * For the vDSO, emit both runtime unwind information and debug
> >   	 * symbols for the .dbg file.
> >   	 */
> > -
> 
> Nit: Deleted blank line you introduced in "[PATCH v4 05/39] x86/asm:
> Avoid emitting DWARF CFI for non-VDSO".

Indeed.

> > +#ifdef __x86_64__
> 
> #if defined(__x86_64__) && defined(CONFIG_AS_SFRAME)
> 
> AFAIK the kernel has a minimum binutils requirement of 2.25 [1]
> and assembler option "--gsframe" as well as directive
> ".cfi_sections .sframe" were introduced with 2.40.

True, I'll change it to just '#ifdef CONFIG_AS_SFRAME' since that's what
really matters (and 32-bit doesn't support it anyway).

> > +	.cfi_sections .eh_frame, .debug_frame, .sframe
> > +#else
> >   	.cfi_sections .eh_frame, .debug_frame
> > +#endif
> >   #define CFI_STARTPROC		.cfi_startproc
> >   #define CFI_ENDPROC		.cfi_endproc

-- 
Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 05/39] x86/asm: Avoid emitting DWARF CFI for non-VDSO
  2025-01-24 16:08   ` Jens Remus
@ 2025-01-24 16:47     ` Josh Poimboeuf
  0 siblings, 0 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-24 16:47 UTC (permalink / raw)
  To: Jens Remus
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu, Heiko Carstens, Vasily Gorbik

On Fri, Jan 24, 2025 at 05:08:57PM +0100, Jens Remus wrote:
> On 22.01.2025 03:30, Josh Poimboeuf wrote:
> > -#ifndef BUILD_VDSO
> > -	/*
> > -	 * Emit CFI data in .debug_frame sections, not .eh_frame sections.
> > -	 * The latter we currently just discard since we don't do DWARF
> > -	 * unwinding at runtime.  So only the offline DWARF information is
> > -	 * useful to anyone.  Note we should not use this directive if we
> > -	 * ever decide to enable DWARF unwinding at runtime.
> > -	 */
> > -	.cfi_sections .debug_frame
> > -#else
> > -	 /*
> > -	  * For the vDSO, emit both runtime unwind information and debug
> > -	  * symbols for the .dbg file.
> > -	  */
> > -	.cfi_sections .eh_frame, .debug_frame
> > -#endif
> > +#else /* !BUILD_VDSO */
> > +
> 
> Did you remove ".cfi_sections .debug_frame" on purpose from the
> !BUILD_VDSO path compared to V3?

Yes, since non-VDSO assembly files won't be emitting any .cfi, there's
no .debug_frame to output anyway.

> Presumably to not only not emit DWARF CFI from assembler, but any
> source?

This only impacts assembly files, notice the __ASSEMBLY__ check at the
top of the file.

-- 
Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 09/39] x86/vdso: Enable sframe generation in VDSO
  2025-01-24 16:43     ` Josh Poimboeuf
@ 2025-01-24 16:53       ` Josh Poimboeuf
  2025-04-22 17:44       ` Steven Rostedt
  1 sibling, 0 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-24 16:53 UTC (permalink / raw)
  To: Jens Remus
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu, Heiko Carstens, Vasily Gorbik

On Fri, Jan 24, 2025 at 08:43:35AM -0800, Josh Poimboeuf wrote:
> On Fri, Jan 24, 2025 at 05:00:27PM +0100, Jens Remus wrote:
> > On 22.01.2025 03:31, Josh Poimboeuf wrote:
> > > +#ifdef __x86_64__
> > 
> > #if defined(__x86_64__) && defined(CONFIG_AS_SFRAME)
> > 
> > AFAIK the kernel has a minimum binutils requirement of 2.25 [1]
> > and assembler option "--gsframe" as well as directive
> > ".cfi_sections .sframe" were introduced with 2.40.
> 
> True, I'll change it to just '#ifdef CONFIG_AS_SFRAME' since that's what
> really matters (and 32-bit doesn't support it anyway).

Actually, just CONFIG_AS_CFAME didn't work as 64-bit still compiles some
32-bit asm files.  I'll go with your suggestion.

-- 
Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 09/39] x86/vdso: Enable sframe generation in VDSO
  2025-01-24 16:30   ` Jens Remus
@ 2025-01-24 16:56     ` Josh Poimboeuf
  0 siblings, 0 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-24 16:56 UTC (permalink / raw)
  To: Jens Remus
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu, Heiko Carstens, Vasily Gorbik

On Fri, Jan 24, 2025 at 05:30:36PM +0100, Jens Remus wrote:
> On 22.01.2025 03:31, Josh Poimboeuf wrote:
> > Enable sframe generation in the VDSO library so kernel and user space
> > can unwind through it.
> > 
> > Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
> 
> > diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile
> 
> > @@ -47,13 +47,17 @@ quiet_cmd_vdso2c = VDSO2C  $@
> >   $(obj)/vdso-image-%.c: $(obj)/vdso%.so.dbg $(obj)/vdso%.so $(obj)/vdso2c FORCE
> >   	$(call if_changed,vdso2c)
> > +#ifdef CONFIG_AS_SFRAME
> > +SFRAME_CFLAGS := -Wa$(comma)-gsframe
> > +#endif
> > +
> 
> You probably erroneously mixed up C preprocessor and Makefile syntax? :-)
> 
> ifeq ($(CONFIG_AS_SFRAME),y)
>        SFRAME_CFLAGS := -Wa,--gsframe
> endif
> 
> $(comma) does not appear to be required in this context.

Yeah, before it was in a Makefile macro which needed the $(comma) to
distinguish it from the macro arguments.

-- 
Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 28/39] unwind_user/deferred: Add deferred unwinding interface
  2025-01-24 16:35   ` Jens Remus
@ 2025-01-24 16:57     ` Josh Poimboeuf
  0 siblings, 0 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-24 16:57 UTC (permalink / raw)
  To: Jens Remus
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu, Heiko Carstens, Vasily Gorbik

On Fri, Jan 24, 2025 at 05:35:37PM +0100, Jens Remus wrote:
> On 22.01.2025 03:31, Josh Poimboeuf wrote:
> > +static inline void unwind_deferred_init(struct unwind_work *work, unwind_callback_t func) {}
> > +static inline int unwind_deferred_request(struct task_struct *task, struct unwind_work *work, u64 *cookie) { return -ENOSYS; }
> 
> static inline int unwind_deferred_request(struct unwind_work *work, u64 *cookie) { return -ENOSYS; }
> 
> Otherwise this does not compile on architectures that do not enable
> UNWIND_USER.

Yeah, bots have reported that also, thanks.

-- 
Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 19/39] unwind_user/sframe: Add support for reading .sframe contents
  2025-01-24 16:36   ` Jens Remus
@ 2025-01-24 17:07     ` Josh Poimboeuf
  0 siblings, 0 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-24 17:07 UTC (permalink / raw)
  To: Jens Remus
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Fri, Jan 24, 2025 at 05:36:38PM +0100, Jens Remus wrote:
> On 22.01.2025 03:31, Josh Poimboeuf wrote:
> 
> > diff --git a/include/linux/sframe.h b/include/linux/sframe.h
> 
> > @@ -3,11 +3,14 @@
> >   #define _LINUX_SFRAME_H
> >   #include <linux/mm_types.h>
> > +#include <linux/srcu.h>
> >   #include <linux/unwind_user_types.h>
> >   #ifdef CONFIG_HAVE_UNWIND_USER_SFRAME
> >   struct sframe_section {
> > +	struct rcu_head	rcu;
> > +
> 
> Nit: You are adding a blank line, that you later remove with
> "[PATCH v4 25/39] unwind_user/sframe: Show file name in debug output".

I suppose that was intentional.  The original blank line created visual
separation between the rcu head and the sframe values.  The later patch
instead sort of uses the ifdef to keep some separation?  But yeah, I'll
keep the blank lines for consistency.  <shrug>

struct sframe_section {
	struct rcu_head	rcu;

#ifdef CONFIG_DYNAMIC_DEBUG
	const char	*filename;
#endif

	unsigned long	sframe_start;
	unsigned long	sframe_start;
	unsigned long	sframe_end;
	unsigned long	text_start;
	unsigned long	text_end;

	unsigned long	fdes_start;
	unsigned long	fres_start;
	unsigned long	fres_end;
	unsigned int	num_fdes;

	signed char	ra_off;
	signed char	fp_off;
};

-- 
Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 11/39] unwind_user: Add user space unwinding API
  2025-01-24 16:41   ` Jens Remus
@ 2025-01-24 17:09     ` Josh Poimboeuf
  0 siblings, 0 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-24 17:09 UTC (permalink / raw)
  To: Jens Remus
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu, Heiko Carstens, Vasily Gorbik

On Fri, Jan 24, 2025 at 05:41:29PM +0100, Jens Remus wrote:
> On 22.01.2025 03:31, Josh Poimboeuf wrote:
> > +int unwind_user_next(struct unwind_user_state *state)
> > +{
> > +	struct unwind_user_frame _frame;
> > +	struct unwind_user_frame *frame = &_frame;
> > +	unsigned long cfa = 0, fp, ra = 0;
> 
> Why are cfa and ra initialized to zero? Where is that important in
> subsequent patches?
> 
> "[PATCH v4 12/39] unwind_user: Add frame pointer support" does either
> unconditionally set both cfa and ra or bail out.

Right, probably leftovers from some previous iteration.  I drop those.

-- 
Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 11/39] unwind_user: Add user space unwinding API
  2025-01-22  2:31 ` [PATCH v4 11/39] unwind_user: Add user space unwinding API Josh Poimboeuf
  2025-01-24 16:41   ` Jens Remus
@ 2025-01-24 17:59   ` Andrii Nakryiko
  2025-01-24 18:08     ` Josh Poimboeuf
  2025-01-24 20:02   ` Steven Rostedt
  2 siblings, 1 reply; 161+ messages in thread
From: Andrii Nakryiko @ 2025-01-24 17:59 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Tue, Jan 21, 2025 at 6:32 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>
> Introduce a generic API for unwinding user stacks.
>
> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
> ---
>  arch/Kconfig                      |  3 ++
>  include/linux/unwind_user.h       | 15 ++++++++
>  include/linux/unwind_user_types.h | 31 ++++++++++++++++
>  kernel/Makefile                   |  1 +
>  kernel/unwind/Makefile            |  1 +
>  kernel/unwind/user.c              | 59 +++++++++++++++++++++++++++++++
>  6 files changed, 110 insertions(+)
>  create mode 100644 include/linux/unwind_user.h
>  create mode 100644 include/linux/unwind_user_types.h
>  create mode 100644 kernel/unwind/Makefile
>  create mode 100644 kernel/unwind/user.c
>

[...]

> --- /dev/null
> +++ b/kernel/unwind/user.c
> @@ -0,0 +1,59 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> +* Generic interfaces for unwinding user space
> +*/
> +#include <linux/kernel.h>
> +#include <linux/sched.h>
> +#include <linux/sched/task_stack.h>
> +#include <linux/unwind_user.h>
> +
> +int unwind_user_next(struct unwind_user_state *state)
> +{
> +       struct unwind_user_frame _frame;
> +       struct unwind_user_frame *frame = &_frame;
> +       unsigned long cfa = 0, fp, ra = 0;

wouldn't all the above generate compilation warnings about unused
variables, potentially breaking bisection?

> +
> +       /* no implementation yet */
> +       -EINVAL;

return missing?

> +}

[...]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 12/39] unwind_user: Add frame pointer support
  2025-01-22  2:31 ` [PATCH v4 12/39] unwind_user: Add frame pointer support Josh Poimboeuf
@ 2025-01-24 17:59   ` Andrii Nakryiko
  2025-01-24 18:16     ` Josh Poimboeuf
  0 siblings, 1 reply; 161+ messages in thread
From: Andrii Nakryiko @ 2025-01-24 17:59 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Tue, Jan 21, 2025 at 6:32 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>
> Add optional support for user space frame pointer unwinding.  If
> supported, the arch needs to enable CONFIG_HAVE_UNWIND_USER_FP and
> define ARCH_INIT_USER_FP_FRAME.
>
> By encoding the frame offsets in struct unwind_user_frame, much of this
> code can also be reused for future unwinder implementations like sframe.
>
> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
> ---
>  arch/Kconfig                      |  4 +++
>  include/asm-generic/unwind_user.h |  9 ++++++
>  include/linux/unwind_user_types.h |  1 +
>  kernel/unwind/user.c              | 49 +++++++++++++++++++++++++++++--
>  4 files changed, 60 insertions(+), 3 deletions(-)
>  create mode 100644 include/asm-generic/unwind_user.h
>

Do you plan to reuse this logic for stack unwinding done by perf
subsystem in perf_callchain_user()? See is_uprobe_at_func_entry()
parts and also fixup_uretprobe_trampoline_entries() for some of the
quirks that have to be taken into account when doing frame
pointer-based unwinding. It would be great not to lose those in this
new reimplementation.

Not sure what's the best way to avoid duplicating the logic, but I
thought I'd bring that up.

> diff --git a/arch/Kconfig b/arch/Kconfig
> index c6fa2b3ecbc6..cf996cbb8142 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -438,6 +438,10 @@ config HAVE_HARDLOCKUP_DETECTOR_ARCH
>  config UNWIND_USER
>         bool
>
> +config HAVE_UNWIND_USER_FP
> +       bool
> +       select UNWIND_USER
> +
>  config AS_SFRAME
>         def_bool $(as-instr,.cfi_sections .sframe\n.cfi_startproc\n.cfi_endproc)
>

[...]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 17/39] unwind_user/sframe: Add support for reading .sframe headers
  2025-01-22  2:31 ` [PATCH v4 17/39] unwind_user/sframe: Add support for reading .sframe headers Josh Poimboeuf
@ 2025-01-24 18:00   ` Andrii Nakryiko
  2025-01-24 19:21     ` Josh Poimboeuf
  2025-01-24 20:31     ` Indu Bhagat
  0 siblings, 2 replies; 161+ messages in thread
From: Andrii Nakryiko @ 2025-01-24 18:00 UTC (permalink / raw)
  To: Josh Poimboeuf, Indu Bhagat
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Tue, Jan 21, 2025 at 6:32 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>
> In preparation for unwinding user space stacks with sframe, add basic
> sframe compile infrastructure and support for reading the .sframe
> section header.
>
> sframe_add_section() reads the header and unconditionally returns an
> error, so it's not very useful yet.  A subsequent patch will improve
> that.
>
> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
> ---
>  arch/Kconfig           |   3 +
>  include/linux/sframe.h |  36 +++++++++++
>  kernel/unwind/Makefile |   3 +-
>  kernel/unwind/sframe.c | 136 +++++++++++++++++++++++++++++++++++++++++
>  kernel/unwind/sframe.h |  71 +++++++++++++++++++++
>  5 files changed, 248 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/sframe.h
>  create mode 100644 kernel/unwind/sframe.c
>  create mode 100644 kernel/unwind/sframe.h
>

[...]

> +
> +extern int sframe_add_section(unsigned long sframe_start, unsigned long sframe_end,
> +                             unsigned long text_start, unsigned long text_end);
> +extern int sframe_remove_section(unsigned long sframe_addr);
> +
> +#else /* !CONFIG_HAVE_UNWIND_USER_SFRAME */
> +
> +static inline int sframe_add_section(unsigned long sframe_start, unsigned long sframe_end, unsigned long text_start, unsigned long text_end) { return -ENOSYS; }

nit: very-very long, wrap it?

> +static inline int sframe_remove_section(unsigned long sframe_addr) { return -ENOSYS; }
> +
> +#endif /* CONFIG_HAVE_UNWIND_USER_SFRAME */
> +
> +#endif /* _LINUX_SFRAME_H */

[...]

> +static int sframe_read_header(struct sframe_section *sec)
> +{
> +       unsigned long header_end, fdes_start, fdes_end, fres_start, fres_end;
> +       struct sframe_header shdr;
> +       unsigned int num_fdes;
> +
> +       if (copy_from_user(&shdr, (void __user *)sec->sframe_start, sizeof(shdr))) {
> +               dbg("header usercopy failed\n");
> +               return -EFAULT;
> +       }
> +
> +       if (shdr.preamble.magic != SFRAME_MAGIC ||
> +           shdr.preamble.version != SFRAME_VERSION_2 ||
> +           !(shdr.preamble.flags & SFRAME_F_FDE_SORTED) ||

probably more a question to Indu, but why is this sorting not
mandatory and part of SFrame "standard"? How realistically non-sorted
FDEs would work in practice? Ain't nobody got time to sort them just
to unwind the stack...

> +           shdr.auxhdr_len) {
> +               dbg("bad/unsupported sframe header\n");
> +               return -EINVAL;
> +       }
> +
> +       if (!shdr.num_fdes || !shdr.num_fres) {

given SFRAME_F_FRAME_POINTER in the header, is it really that
nonsensical and illegal to have zero FDEs/FREs? Maybe we should allow
that?

> +               dbg("no fde/fre entries\n");
> +               return -EINVAL;
> +       }
> +
> +       header_end = sec->sframe_start + SFRAME_HEADER_SIZE(shdr);
> +       if (header_end >= sec->sframe_end) {

if we allow zero FDEs/FREs, header_end == sec->sframe_end is legal, right?

> +               dbg("header doesn't fit in section\n");
> +               return -EINVAL;
> +       }
> +
> +       num_fdes   = shdr.num_fdes;
> +       fdes_start = header_end + shdr.fdes_off;
> +       fdes_end   = fdes_start + (num_fdes * sizeof(struct sframe_fde));
> +
> +       fres_start = header_end + shdr.fres_off;
> +       fres_end   = fres_start + shdr.fre_len;
> +

maybe use check_add_overflow() in all the above calculation, at least
on 32-bit arches this all can overflow and it's not clear if below
sanity check detects all possible overflows

> +       if (fres_start < fdes_end || fres_end > sec->sframe_end) {
> +               dbg("inconsistent fde/fre offsets\n");
> +               return -EINVAL;
> +       }
> +
> +       sec->num_fdes           = num_fdes;
> +       sec->fdes_start         = fdes_start;
> +       sec->fres_start         = fres_start;
> +       sec->fres_end           = fres_end;
> +
> +       sec->ra_off             = shdr.cfa_fixed_ra_offset;
> +       sec->fp_off             = shdr.cfa_fixed_fp_offset;
> +
> +       return 0;
> +}
> +

[...]

> diff --git a/kernel/unwind/sframe.h b/kernel/unwind/sframe.h
> new file mode 100644
> index 000000000000..e9bfccfaf5b4
> --- /dev/null
> +++ b/kernel/unwind/sframe.h
> @@ -0,0 +1,71 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +/*
> + * From https://www.sourceware.org/binutils/docs/sframe-spec.html
> + */
> +#ifndef _SFRAME_H
> +#define _SFRAME_H
> +
> +#include <linux/types.h>
> +
> +#define SFRAME_VERSION_1                       1
> +#define SFRAME_VERSION_2                       2
> +#define SFRAME_MAGIC                           0xdee2
> +
> +#define SFRAME_F_FDE_SORTED                    0x1
> +#define SFRAME_F_FRAME_POINTER                 0x2
> +
> +#define SFRAME_ABI_AARCH64_ENDIAN_BIG          1
> +#define SFRAME_ABI_AARCH64_ENDIAN_LITTLE       2
> +#define SFRAME_ABI_AMD64_ENDIAN_LITTLE         3
> +
> +#define SFRAME_FDE_TYPE_PCINC                  0
> +#define SFRAME_FDE_TYPE_PCMASK                 1
> +
> +struct sframe_preamble {
> +       u16     magic;
> +       u8      version;
> +       u8      flags;
> +} __packed;
> +
> +struct sframe_header {
> +       struct sframe_preamble preamble;
> +       u8      abi_arch;
> +       s8      cfa_fixed_fp_offset;
> +       s8      cfa_fixed_ra_offset;
> +       u8      auxhdr_len;
> +       u32     num_fdes;
> +       u32     num_fres;
> +       u32     fre_len;
> +       u32     fdes_off;
> +       u32     fres_off;
> +} __packed;
> +
> +#define SFRAME_HEADER_SIZE(header) \
> +       ((sizeof(struct sframe_header) + header.auxhdr_len))
> +
> +#define SFRAME_AARCH64_PAUTH_KEY_A             0
> +#define SFRAME_AARCH64_PAUTH_KEY_B             1
> +
> +struct sframe_fde {
> +       s32     start_addr;
> +       u32     func_size;
> +       u32     fres_off;
> +       u32     fres_num;
> +       u8      info;
> +       u8      rep_size;
> +       u16 padding;
> +} __packed;

I couldn't understand from SFrame itself, but why do sframe_header,
sframe_preamble, and sframe_fde have to be marked __packed, if it's
all naturally aligned (intentionally and by design)?..


> +
> +#define SFRAME_FUNC_FRE_TYPE(data)             (data & 0xf)
> +#define SFRAME_FUNC_FDE_TYPE(data)             ((data >> 4) & 0x1)
> +#define SFRAME_FUNC_PAUTH_KEY(data)            ((data >> 5) & 0x1)
> +
> +#define SFRAME_BASE_REG_FP                     0
> +#define SFRAME_BASE_REG_SP                     1
> +
> +#define SFRAME_FRE_CFA_BASE_REG_ID(data)       (data & 0x1)
> +#define SFRAME_FRE_OFFSET_COUNT(data)          ((data >> 1) & 0xf)
> +#define SFRAME_FRE_OFFSET_SIZE(data)           ((data >> 5) & 0x3)
> +#define SFRAME_FRE_MANGLED_RA_P(data)          ((data >> 7) & 0x1)
> +
> +#endif /* _SFRAME_H */
> --
> 2.48.1
>

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 19/39] unwind_user/sframe: Add support for reading .sframe contents
  2025-01-22  2:31 ` [PATCH v4 19/39] unwind_user/sframe: Add support for reading .sframe contents Josh Poimboeuf
  2025-01-24 16:36   ` Jens Remus
@ 2025-01-24 18:02   ` Andrii Nakryiko
  2025-01-24 21:41     ` Josh Poimboeuf
  2025-01-30 15:07   ` Indu Bhagat
  2025-01-30 15:47   ` Jens Remus
  3 siblings, 1 reply; 161+ messages in thread
From: Andrii Nakryiko @ 2025-01-24 18:02 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Tue, Jan 21, 2025 at 6:32 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>
> In preparation for using sframe to unwind user space stacks, add an
> sframe_find() interface for finding the sframe information associated
> with a given text address.
>
> For performance, use user_read_access_begin() and the corresponding
> unsafe_*() accessors.  Note that use of pr_debug() in uaccess-enabled
> regions would break noinstr validation, so there aren't any debug
> messages yet.  That will be added in a subsequent commit.
>
> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
> ---
>  include/linux/sframe.h       |   5 +
>  kernel/unwind/sframe.c       | 295 ++++++++++++++++++++++++++++++++++-
>  kernel/unwind/sframe_debug.h |  35 +++++
>  3 files changed, 331 insertions(+), 4 deletions(-)
>  create mode 100644 kernel/unwind/sframe_debug.h
>

[...]

> +
> +static __always_inline int __read_fde(struct sframe_section *sec,
> +                                     unsigned int fde_num,
> +                                     struct sframe_fde *fde)
> +{
> +       unsigned long fde_addr, ip;
> +
> +       fde_addr = sec->fdes_start + (fde_num * sizeof(struct sframe_fde));
> +       unsafe_copy_from_user(fde, (void __user *)fde_addr,
> +                             sizeof(struct sframe_fde), Efault);
> +
> +       ip = sec->sframe_start + fde->start_addr;
> +       if (ip < sec->text_start || ip > sec->text_end)

ip >= sec->test_end ? ip == sec->test_end doesn't make sense, no?

> +               return -EINVAL;
> +
> +       return 0;
> +
> +Efault:
> +       return -EFAULT;
> +}
> +

[...]

> +static __always_inline int __read_fre(struct sframe_section *sec,
> +                                     struct sframe_fde *fde,
> +                                     unsigned long fre_addr,
> +                                     struct sframe_fre *fre)
> +{
> +       unsigned char fde_type = SFRAME_FUNC_FDE_TYPE(fde->info);
> +       unsigned char fre_type = SFRAME_FUNC_FRE_TYPE(fde->info);
> +       unsigned char offset_count, offset_size;
> +       s32 ip_off, cfa_off, ra_off, fp_off;
> +       unsigned long cur = fre_addr;
> +       unsigned char addr_size;
> +       u8 info;
> +
> +       addr_size = fre_type_to_size(fre_type);
> +       if (!addr_size)
> +               return -EFAULT;
> +
> +       if (fre_addr + addr_size + 1 > sec->fres_end)

nit: isn't this the same as `fre_addr + addr_size >= sec->fres_end` ?

> +               return -EFAULT;
> +
> +       UNSAFE_GET_USER_INC(ip_off, cur, addr_size, Efault);
> +       if (fde_type == SFRAME_FDE_TYPE_PCINC && ip_off > fde->func_size)

is ip_off == fde->func_size allowable?

> +               return -EFAULT;
> +
> +       UNSAFE_GET_USER_INC(info, cur, 1, Efault);
> +       offset_count = SFRAME_FRE_OFFSET_COUNT(info);
> +       offset_size  = offset_size_enum_to_size(SFRAME_FRE_OFFSET_SIZE(info));
> +       if (!offset_count || !offset_size)
> +               return -EFAULT;
> +
> +       if (cur + (offset_count * offset_size) > sec->fres_end)

offset_count * offset_size done in u8 can overflow, no? maybe upcast
to unsigned long or use check_add_overflow?

> +               return -EFAULT;
> +
> +       fre->size = addr_size + 1 + (offset_count * offset_size);
> +
> +       UNSAFE_GET_USER_INC(cfa_off, cur, offset_size, Efault);
> +       offset_count--;
> +
> +       ra_off = sec->ra_off;
> +       if (!ra_off) {
> +               if (!offset_count--)
> +                       return -EFAULT;
> +
> +               UNSAFE_GET_USER_INC(ra_off, cur, offset_size, Efault);
> +       }
> +
> +       fp_off = sec->fp_off;
> +       if (!fp_off && offset_count) {
> +               offset_count--;
> +               UNSAFE_GET_USER_INC(fp_off, cur, offset_size, Efault);
> +       }
> +
> +       if (offset_count)
> +               return -EFAULT;
> +
> +       fre->ip_off             = ip_off;
> +       fre->cfa_off            = cfa_off;
> +       fre->ra_off             = ra_off;
> +       fre->fp_off             = fp_off;
> +       fre->info               = info;
> +
> +       return 0;
> +
> +Efault:
> +       return -EFAULT;
> +}
> +
> +static __always_inline int __find_fre(struct sframe_section *sec,
> +                                     struct sframe_fde *fde, unsigned long ip,
> +                                     struct unwind_user_frame *frame)
> +{
> +       unsigned char fde_type = SFRAME_FUNC_FDE_TYPE(fde->info);
> +       struct sframe_fre *fre, *prev_fre = NULL;
> +       struct sframe_fre fres[2];

you only need prev_fre->ip_off, so why all this `which` and `fres[2]`
business if all you need is prev_fre_off and a bool whether you have
prev_fre at all?

> +       unsigned long fre_addr;
> +       bool which = false;
> +       unsigned int i;
> +       s32 ip_off;
> +
> +       ip_off = (s32)(ip - sec->sframe_start) - fde->start_addr;
> +
> +       if (fde_type == SFRAME_FDE_TYPE_PCMASK)
> +               ip_off %= fde->rep_size;

did you check that fde->rep_size is not zero?

> +
> +       fre_addr = sec->fres_start + fde->fres_off;
> +
> +       for (i = 0; i < fde->fres_num; i++) {

why not binary search? seem more logical to guard against cases with
lots of FREs and be pretty fast in common case anyways.


> +               int ret;
> +
> +               /*
> +                * Alternate between the two fre_addr[] entries for 'fre' and
> +                * 'prev_fre'.
> +                */
> +               fre = which ? fres : fres + 1;
> +               which = !which;
> +
> +               ret = __read_fre(sec, fde, fre_addr, fre);
> +               if (ret)
> +                       return ret;
> +
> +               fre_addr += fre->size;
> +
> +               if (prev_fre && fre->ip_off <= prev_fre->ip_off)
> +                       return -EFAULT;
> +
> +               if (fre->ip_off > ip_off)
> +                       break;
> +
> +               prev_fre = fre;
> +       }
> +
> +       if (!prev_fre)
> +               return -EINVAL;
> +       fre = prev_fre;
> +
> +       frame->cfa_off = fre->cfa_off;
> +       frame->ra_off  = fre->ra_off;
> +       frame->fp_off  = fre->fp_off;
> +       frame->use_fp  = SFRAME_FRE_CFA_BASE_REG_ID(fre->info) == SFRAME_BASE_REG_FP;
> +
> +       return 0;
> +}
> +

[...]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 11/39] unwind_user: Add user space unwinding API
  2025-01-24 17:59   ` Andrii Nakryiko
@ 2025-01-24 18:08     ` Josh Poimboeuf
  0 siblings, 0 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-24 18:08 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Fri, Jan 24, 2025 at 09:59:28AM -0800, Andrii Nakryiko wrote:
> On Tue, Jan 21, 2025 at 6:32 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
> > +int unwind_user_next(struct unwind_user_state *state)
> > +{
> > +       struct unwind_user_frame _frame;
> > +       struct unwind_user_frame *frame = &_frame;
> > +       unsigned long cfa = 0, fp, ra = 0;
> 
> wouldn't all the above generate compilation warnings about unused
> variables, potentially breaking bisection?

It doesn't break bisection because no arches support UNWIND_USER yet so
this can't yet be compiled.

But yeah, it's shoddy and I fixed it after Jens' comment about the
unnecessary zero-initialized variables.

> 
> > +
> > +       /* no implementation yet */
> > +       -EINVAL;
> 
> return missing?

heh :-)

-- 
Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 32/39] perf: Remove get_perf_callchain() 'crosstask' argument
  2025-01-22  2:31 ` [PATCH v4 32/39] perf: Remove get_perf_callchain() 'crosstask' argument Josh Poimboeuf
@ 2025-01-24 18:13   ` Andrii Nakryiko
  2025-01-24 22:00     ` Josh Poimboeuf
  0 siblings, 1 reply; 161+ messages in thread
From: Andrii Nakryiko @ 2025-01-24 18:13 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Tue, Jan 21, 2025 at 6:32 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>
> get_perf_callchain() doesn't support cross-task unwinding, so it doesn't
> make much sense to have 'crosstask' as an argument.
>
> Acked-by: Namhyung Kim <namhyung@kernel.org>
> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
> ---
>  include/linux/perf_event.h |  2 +-
>  kernel/bpf/stackmap.c      | 12 ++++--------
>  kernel/events/callchain.c  |  6 +-----
>  kernel/events/core.c       |  9 +++++----
>  4 files changed, 11 insertions(+), 18 deletions(-)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 4c8ff7258c6a..1563dc2cd979 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -1590,7 +1590,7 @@ extern void perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct p
>  extern void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
>  extern struct perf_callchain_entry *
>  get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
> -                  u32 max_stack, bool crosstask, bool add_mark);
> +                  u32 max_stack, bool add_mark);
>  extern int get_callchain_buffers(int max_stack);
>  extern void put_callchain_buffers(void);
>  extern struct perf_callchain_entry *get_callchain_entry(int *rctx);
> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
> index ec3a57a5fba1..ee9701337912 100644
> --- a/kernel/bpf/stackmap.c
> +++ b/kernel/bpf/stackmap.c
> @@ -314,8 +314,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
>         if (max_depth > sysctl_perf_event_max_stack)
>                 max_depth = sysctl_perf_event_max_stack;
>
> -       trace = get_perf_callchain(regs, kernel, user, max_depth,
> -                                  false, false);
> +       trace = get_perf_callchain(regs, kernel, user, max_depth, false);
>
>         if (unlikely(!trace))
>                 /* couldn't fetch the stack trace */
> @@ -430,10 +429,8 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
>         if (task && user && !user_mode(regs))
>                 goto err_fault;
>
> -       /* get_perf_callchain does not support crosstask user stack walking
> -        * but returns an empty stack instead of NULL.
> -        */
> -       if (crosstask && user) {
> +       /* get_perf_callchain() does not support crosstask stack walking */
> +       if (crosstask) {

crosstask stack trace is supported for kernel stack traces (see
get_callchain_entry_for_task() call), so this is breaking that case

>                 err = -EOPNOTSUPP;
>                 goto clear;
>         }
> @@ -451,8 +448,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
>         else if (kernel && task)
>                 trace = get_callchain_entry_for_task(task, max_depth);
>         else
> -               trace = get_perf_callchain(regs, kernel, user, max_depth,
> -                                          crosstask, false);
> +               trace = get_perf_callchain(regs, kernel, user, max_depth,false);

nit: missing space

>
>         if (unlikely(!trace) || trace->nr < skip) {
>                 if (may_fault)
> diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
> index 83834203e144..655fb25a725b 100644
> --- a/kernel/events/callchain.c
> +++ b/kernel/events/callchain.c

[...]

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 12/39] unwind_user: Add frame pointer support
  2025-01-24 17:59   ` Andrii Nakryiko
@ 2025-01-24 18:16     ` Josh Poimboeuf
  2025-04-24 13:41       ` Steven Rostedt
  0 siblings, 1 reply; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-24 18:16 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Fri, Jan 24, 2025 at 09:59:37AM -0800, Andrii Nakryiko wrote:
> On Tue, Jan 21, 2025 at 6:32 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
> >
> > Add optional support for user space frame pointer unwinding.  If
> > supported, the arch needs to enable CONFIG_HAVE_UNWIND_USER_FP and
> > define ARCH_INIT_USER_FP_FRAME.
> >
> > By encoding the frame offsets in struct unwind_user_frame, much of this
> > code can also be reused for future unwinder implementations like sframe.
> >
> > Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
> > ---
> >  arch/Kconfig                      |  4 +++
> >  include/asm-generic/unwind_user.h |  9 ++++++
> >  include/linux/unwind_user_types.h |  1 +
> >  kernel/unwind/user.c              | 49 +++++++++++++++++++++++++++++--
> >  4 files changed, 60 insertions(+), 3 deletions(-)
> >  create mode 100644 include/asm-generic/unwind_user.h
> >
> 
> Do you plan to reuse this logic for stack unwinding done by perf
> subsystem in perf_callchain_user()? See is_uprobe_at_func_entry()
> parts and also fixup_uretprobe_trampoline_entries() for some of the
> quirks that have to be taken into account when doing frame
> pointer-based unwinding. It would be great not to lose those in this
> new reimplementation.
> 
> Not sure what's the best way to avoid duplicating the logic, but I
> thought I'd bring that up.

Indeed!  That was on the todo list and somehow evaporated.

-- 
Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 17/39] unwind_user/sframe: Add support for reading .sframe headers
  2025-01-24 18:00   ` Andrii Nakryiko
@ 2025-01-24 19:21     ` Josh Poimboeuf
  2025-01-24 20:13       ` Steven Rostedt
  2025-01-24 22:13       ` Indu Bhagat
  2025-01-24 20:31     ` Indu Bhagat
  1 sibling, 2 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-24 19:21 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Indu Bhagat, x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Fri, Jan 24, 2025 at 10:00:52AM -0800, Andrii Nakryiko wrote:
> On Tue, Jan 21, 2025 at 6:32 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
> > +static inline int sframe_add_section(unsigned long sframe_start, unsigned long sframe_end, unsigned long text_start, unsigned long text_end) { return -ENOSYS; }
> 
> nit: very-very long, wrap it?

That was intentional as it's just an empty stub, but yeah, maybe 160
chars is a bit much.

> > +       if (shdr.preamble.magic != SFRAME_MAGIC ||
> > +           shdr.preamble.version != SFRAME_VERSION_2 ||
> > +           !(shdr.preamble.flags & SFRAME_F_FDE_SORTED) ||
> 
> probably more a question to Indu, but why is this sorting not
> mandatory and part of SFrame "standard"? How realistically non-sorted
> FDEs would work in practice? Ain't nobody got time to sort them just
> to unwind the stack...

No idea...

> > +       if (!shdr.num_fdes || !shdr.num_fres) {
> 
> given SFRAME_F_FRAME_POINTER in the header, is it really that
> nonsensical and illegal to have zero FDEs/FREs? Maybe we should allow
> that?

It would seem a bit silly to create an empty .sframe section just to set
that SFRAME_F_FRAME_POINTER bit.  Regardless, there's nothing the kernel
can do with that.

> > +               dbg("no fde/fre entries\n");
> > +               return -EINVAL;
> > +       }
> > +
> > +       header_end = sec->sframe_start + SFRAME_HEADER_SIZE(shdr);
> > +       if (header_end >= sec->sframe_end) {
> 
> if we allow zero FDEs/FREs, header_end == sec->sframe_end is legal, right?

I suppose so, but again I'm not seeing any reason to support that.

> > +               dbg("header doesn't fit in section\n");
> > +               return -EINVAL;
> > +       }
> > +
> > +       num_fdes   = shdr.num_fdes;
> > +       fdes_start = header_end + shdr.fdes_off;
> > +       fdes_end   = fdes_start + (num_fdes * sizeof(struct sframe_fde));
> > +
> > +       fres_start = header_end + shdr.fres_off;
> > +       fres_end   = fres_start + shdr.fre_len;
> > +
> 
> maybe use check_add_overflow() in all the above calculation, at least
> on 32-bit arches this all can overflow and it's not clear if below
> sanity check detects all possible overflows

Ok, I'll look into it.

> > +struct sframe_preamble {
> > +       u16     magic;
> > +       u8      version;
> > +       u8      flags;
> > +} __packed;
> > +
> > +struct sframe_header {
> > +       struct sframe_preamble preamble;
> > +       u8      abi_arch;
> > +       s8      cfa_fixed_fp_offset;
> > +       s8      cfa_fixed_ra_offset;
> > +       u8      auxhdr_len;
> > +       u32     num_fdes;
> > +       u32     num_fres;
> > +       u32     fre_len;
> > +       u32     fdes_off;
> > +       u32     fres_off;
> > +} __packed;
> > +
> > +struct sframe_fde {
> > +       s32     start_addr;
> > +       u32     func_size;
> > +       u32     fres_off;
> > +       u32     fres_num;
> > +       u8      info;
> > +       u8      rep_size;
> > +       u16 padding;
> > +} __packed;
> 
> I couldn't understand from SFrame itself, but why do sframe_header,
> sframe_preamble, and sframe_fde have to be marked __packed, if it's
> all naturally aligned (intentionally and by design)?..

Right, but the spec says they're all packed.  Maybe the point is that
some future sframe version is free to introduce unaligned fields.

-- 
Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 11/39] unwind_user: Add user space unwinding API
  2025-01-22  2:31 ` [PATCH v4 11/39] unwind_user: Add user space unwinding API Josh Poimboeuf
  2025-01-24 16:41   ` Jens Remus
  2025-01-24 17:59   ` Andrii Nakryiko
@ 2025-01-24 20:02   ` Steven Rostedt
  2025-01-24 22:05     ` Josh Poimboeuf
  2 siblings, 1 reply; 161+ messages in thread
From: Steven Rostedt @ 2025-01-24 20:02 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Tue, 21 Jan 2025 18:31:03 -0800
Josh Poimboeuf <jpoimboe@kernel.org> wrote:

> +int unwind_user_next(struct unwind_user_state *state)
> +{
> +	struct unwind_user_frame _frame;
> +	struct unwind_user_frame *frame = &_frame;
> +	unsigned long cfa = 0, fp, ra = 0;
> +
> +	/* no implementation yet */
> +	-EINVAL;
> +}
> +
> +int unwind_user_start(struct unwind_user_state *state)
> +{
> +	struct pt_regs *regs = task_pt_regs(current);
> +
> +	memset(state, 0, sizeof(*state));
> +
> +	if (!current->mm || !user_mode(regs)) {
> +		state->done = true;
> +		return -EINVAL;
> +	}
> +
> +	state->type = UNWIND_USER_TYPE_NONE;
> +
> +	state->ip = instruction_pointer(regs);
> +	state->sp = user_stack_pointer(regs);
> +	state->fp = frame_pointer(regs);
> +
> +	return 0;
> +}
> +

I know this is just an introductory of the interface, but this should
really have kerneldoc attached to it, as I have no idea what these are
supposed to be doing. This patch is meaningless without it. The change log
is useless too.

-- Steve


> +int unwind_user(struct unwind_stacktrace *trace, unsigned int max_entries)
> +{
> +	struct unwind_user_state state;
> +
> +	trace->nr = 0;
> +
> +	if (!max_entries)
> +		return -EINVAL;
> +
> +	if (!current->mm)
> +		return 0;
> +
> +	for_each_user_frame(&state) {
> +		trace->entries[trace->nr++] = state.ip;
> +		if (trace->nr >= max_entries)
> +			break;
> +	}
> +
> +	return 0;
> +}

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 14/39] perf/x86: Rename get_segment_base() and make it global
  2025-01-22  2:31 ` [PATCH v4 14/39] perf/x86: Rename get_segment_base() and make it global Josh Poimboeuf
  2025-01-22 12:51   ` Peter Zijlstra
@ 2025-01-24 20:09   ` Steven Rostedt
  2025-01-24 22:06     ` Josh Poimboeuf
  1 sibling, 1 reply; 161+ messages in thread
From: Steven Rostedt @ 2025-01-24 20:09 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Tue, 21 Jan 2025 18:31:06 -0800
Josh Poimboeuf <jpoimboe@kernel.org> wrote:

> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index c75c482d4c52..23ac6343cf86 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -2790,7 +2790,7 @@ valid_user_frame(const void __user *fp, unsigned long size)
>  	return __access_ok(fp, size);
>  }
>  
> -static unsigned long get_segment_base(unsigned int segment)
> +unsigned long segment_base_address(unsigned int segment)
>  {
>  	struct desc_struct *desc;
>  	unsigned int idx = segment >> 3;

As this requires interrupts disabled, and if you do move this out of this
file, you probably should add:

	lockdep_assert_irqs_disabled();

-- Steve

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 17/39] unwind_user/sframe: Add support for reading .sframe headers
  2025-01-24 19:21     ` Josh Poimboeuf
@ 2025-01-24 20:13       ` Steven Rostedt
  2025-01-24 22:39         ` Josh Poimboeuf
  2025-01-24 22:13       ` Indu Bhagat
  1 sibling, 1 reply; 161+ messages in thread
From: Steven Rostedt @ 2025-01-24 20:13 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andrii Nakryiko, Indu Bhagat, x86, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Fri, 24 Jan 2025 11:21:59 -0800
Josh Poimboeuf <jpoimboe@kernel.org> wrote:

> > given SFRAME_F_FRAME_POINTER in the header, is it really that
> > nonsensical and illegal to have zero FDEs/FREs? Maybe we should allow
> > that?  
> 
> It would seem a bit silly to create an empty .sframe section just to set
> that SFRAME_F_FRAME_POINTER bit.  Regardless, there's nothing the kernel
> can do with that.
> 
> > > +               dbg("no fde/fre entries\n");
> > > +               return -EINVAL;
> > > +       }
> > > +
> > > +       header_end = sec->sframe_start + SFRAME_HEADER_SIZE(shdr);
> > > +       if (header_end >= sec->sframe_end) {  
> > 
> > if we allow zero FDEs/FREs, header_end == sec->sframe_end is legal, right?  
> 
> I suppose so, but again I'm not seeing any reason to support that.

Hmm, could that be useful for implementing a way to dynamically grow or
shrink an sframe because of jits? I'm just thinking about placeholders or
something.

-- Steve

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 17/39] unwind_user/sframe: Add support for reading .sframe headers
  2025-01-24 18:00   ` Andrii Nakryiko
  2025-01-24 19:21     ` Josh Poimboeuf
@ 2025-01-24 20:31     ` Indu Bhagat
  1 sibling, 0 replies; 161+ messages in thread
From: Indu Bhagat @ 2025-01-24 20:31 UTC (permalink / raw)
  To: Andrii Nakryiko, Josh Poimboeuf
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On 1/24/25 10:00 AM, Andrii Nakryiko wrote:
> On Tue, Jan 21, 2025 at 6:32 PM Josh Poimboeuf<jpoimboe@kernel.org> wrote:
>> In preparation for unwinding user space stacks with sframe, add basic
>> sframe compile infrastructure and support for reading the .sframe
>> section header.
>>
>> sframe_add_section() reads the header and unconditionally returns an
>> error, so it's not very useful yet.  A subsequent patch will improve
>> that.
>>
>> Signed-off-by: Josh Poimboeuf<jpoimboe@kernel.org>
>> ---
>>   arch/Kconfig           |   3 +
>>   include/linux/sframe.h |  36 +++++++++++
>>   kernel/unwind/Makefile |   3 +-
>>   kernel/unwind/sframe.c | 136 +++++++++++++++++++++++++++++++++++++++++
>>   kernel/unwind/sframe.h |  71 +++++++++++++++++++++
>>   5 files changed, 248 insertions(+), 1 deletion(-)
>>   create mode 100644 include/linux/sframe.h
>>   create mode 100644 kernel/unwind/sframe.c
>>   create mode 100644 kernel/unwind/sframe.h
>>
> [...]
> 
>> +
>> +extern int sframe_add_section(unsigned long sframe_start, unsigned long sframe_end,
>> +                             unsigned long text_start, unsigned long text_end);
>> +extern int sframe_remove_section(unsigned long sframe_addr);
>> +
>> +#else /* !CONFIG_HAVE_UNWIND_USER_SFRAME */
>> +
>> +static inline int sframe_add_section(unsigned long sframe_start, unsigned long sframe_end, unsigned long text_start, unsigned long text_end) { return -ENOSYS; }
> nit: very-very long, wrap it?
> 
>> +static inline int sframe_remove_section(unsigned long sframe_addr) { return -ENOSYS; }
>> +
>> +#endif /* CONFIG_HAVE_UNWIND_USER_SFRAME */
>> +
>> +#endif /* _LINUX_SFRAME_H */
> [...]
> 
>> +static int sframe_read_header(struct sframe_section *sec)
>> +{
>> +       unsigned long header_end, fdes_start, fdes_end, fres_start, fres_end;
>> +       struct sframe_header shdr;
>> +       unsigned int num_fdes;
>> +
>> +       if (copy_from_user(&shdr, (void __user *)sec->sframe_start, sizeof(shdr))) {
>> +               dbg("header usercopy failed\n");
>> +               return -EFAULT;
>> +       }
>> +
>> +       if (shdr.preamble.magic != SFRAME_MAGIC ||
>> +           shdr.preamble.version != SFRAME_VERSION_2 ||
>> +           !(shdr.preamble.flags & SFRAME_F_FDE_SORTED) ||
> probably more a question to Indu, but why is this sorting not
> mandatory and part of SFrame "standard"? How realistically non-sorted
> FDEs would work in practice? Ain't nobody got time to sort them just
> to unwind the stack...

It is not worthwhile for the assembler (even wasteful as it adds to 
build time for nothing) to generate an .sframe section with FDEs in 
sorted order of start PC of function.  This is because the final order 
is decided by the linker as it merges all input sections.

Thats one reason why it is already necessary that the specification 
allows SFRAME_F_FDE_SORTED not set in the section. I can also see how 
not making the sorting mandatory may also be necessary for JIT use-case..

FWIW, for non-JIT environments, non-sorted FDEs are not expected in 
linked binaries; such a thing does not seem to be useful in practice.

Hope that helps
Indu




^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 19/39] unwind_user/sframe: Add support for reading .sframe contents
  2025-01-24 18:02   ` Andrii Nakryiko
@ 2025-01-24 21:41     ` Josh Poimboeuf
  2025-01-28  0:39       ` Andrii Nakryiko
  2025-01-30 19:51       ` Weinan Liu
  0 siblings, 2 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-24 21:41 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Fri, Jan 24, 2025 at 10:02:46AM -0800, Andrii Nakryiko wrote:
> On Tue, Jan 21, 2025 at 6:32 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
> > +static __always_inline int __read_fde(struct sframe_section *sec,
> > +                                     unsigned int fde_num,
> > +                                     struct sframe_fde *fde)
> > +{
> > +       unsigned long fde_addr, ip;
> > +
> > +       fde_addr = sec->fdes_start + (fde_num * sizeof(struct sframe_fde));
> > +       unsafe_copy_from_user(fde, (void __user *)fde_addr,
> > +                             sizeof(struct sframe_fde), Efault);
> > +
> > +       ip = sec->sframe_start + fde->start_addr;
> > +       if (ip < sec->text_start || ip > sec->text_end)
> 
> ip >= sec->test_end ? ip == sec->test_end doesn't make sense, no?

Believe it or not, this may have been intentional: 'text_end' can
actually be a valid stack return address if the last instruction of the
section is a call to a noreturn function.  The unwinder still needs to
know the state of the stack at that point.

That said, there would be no reason to put an FDE at the end.  So yeah,
it should be '>='.

> > +static __always_inline int __read_fre(struct sframe_section *sec,
> > +                                     struct sframe_fde *fde,
> > +                                     unsigned long fre_addr,
> > +                                     struct sframe_fre *fre)
> > +{
> > +       unsigned char fde_type = SFRAME_FUNC_FDE_TYPE(fde->info);
> > +       unsigned char fre_type = SFRAME_FUNC_FRE_TYPE(fde->info);
> > +       unsigned char offset_count, offset_size;
> > +       s32 ip_off, cfa_off, ra_off, fp_off;
> > +       unsigned long cur = fre_addr;
> > +       unsigned char addr_size;
> > +       u8 info;
> > +
> > +       addr_size = fre_type_to_size(fre_type);
> > +       if (!addr_size)
> > +               return -EFAULT;
> > +
> > +       if (fre_addr + addr_size + 1 > sec->fres_end)
> 
> nit: isn't this the same as `fre_addr + addr_size >= sec->fres_end` ?

Yes, but that would further obfuscate the fact that this is validating
that the next two reads don't go past sec->fres_end:

	UNSAFE_GET_USER_INC(ip_off, cur, addr_size, Efault);
	UNSAFE_GET_USER_INC(info, cur, 1, Efault);

The explicit "1" is for that second read.

I'll do something like SFRAME_FRE_INFO_SIZE instead of 1 to make that
clearer.

> > +               return -EFAULT;
> > +
> > +       UNSAFE_GET_USER_INC(ip_off, cur, addr_size, Efault);
> > +       if (fde_type == SFRAME_FDE_TYPE_PCINC && ip_off > fde->func_size)
> 
> is ip_off == fde->func_size allowable?

This was another one where I was probably thinking about the whole "last
insn could be a call to noreturn" scenario.

But even in that case, the frame would be unchanged after the call, so
there shouldn't need to be an FRE there.

In fact, for the unwinder to deal with that scenario, for a call frame
(as opposed to an interrupted one), it should always subtract one from
the return address before calling into sframe so that it points to the
calling instruction.  </me makes note to go do that>

So yeah, this looks like another off-by-one, caused by overthinking yet
not thinking enough.

> > +               return -EFAULT;
> > +
> > +       UNSAFE_GET_USER_INC(info, cur, 1, Efault);
> > +       offset_count = SFRAME_FRE_OFFSET_COUNT(info);
> > +       offset_size  = offset_size_enum_to_size(SFRAME_FRE_OFFSET_SIZE(info));
> > +       if (!offset_count || !offset_size)
> > +               return -EFAULT;
> > +
> > +       if (cur + (offset_count * offset_size) > sec->fres_end)
> 
> offset_count * offset_size done in u8 can overflow, no? maybe upcast
> to unsigned long or use check_add_overflow?

offset_size is <= 2 as returned by offset_size_enum_to_size().

offset_count is expected to be <= 3, enforced by the !offset_count check
at the bottom.

An overflow here would be harmless as it would be caught by the
!offset_count anyway.  Though I also notice offset_count isn't big
enough to hold the 2-byte SFRAME_FRE_OFFSET_COUNT() value.  Which is
harmless for the same reason, but yeah I'll make offset_count an
unsigned int.

> > +static __always_inline int __find_fre(struct sframe_section *sec,
> > +                                     struct sframe_fde *fde, unsigned long ip,
> > +                                     struct unwind_user_frame *frame)
> > +{
> > +       unsigned char fde_type = SFRAME_FUNC_FDE_TYPE(fde->info);
> > +       struct sframe_fre *fre, *prev_fre = NULL;
> > +       struct sframe_fre fres[2];
> 
> you only need prev_fre->ip_off, so why all this `which` and `fres[2]`
> business if all you need is prev_fre_off and a bool whether you have
> prev_fre at all?

So this cleverness probably needs a comment.  prev_fre is actually
needed after the loop:

> > +       if (!prev_fre)
> > +               return -EINVAL;
> > +       fre = prev_fre;

In the body of the loop, prev_fre is a tentative match, unless the next
fre also matches, in which case that one becomes the new tentative
match.

I'll add a comment.  Also it'll probably be less confusing if I rename
"prev_fre" to "fre", and "fre" to "next_fre".

> > +       unsigned long fre_addr;
> > +       bool which = false;
> > +       unsigned int i;
> > +       s32 ip_off;
> > +
> > +       ip_off = (s32)(ip - sec->sframe_start) - fde->start_addr;
> > +
> > +       if (fde_type == SFRAME_FDE_TYPE_PCMASK)
> > +               ip_off %= fde->rep_size;
> 
> did you check that fde->rep_size is not zero?

I did not, good catch!

> > +       fre_addr = sec->fres_start + fde->fres_off;
> > +
> > +       for (i = 0; i < fde->fres_num; i++) {
> 
> why not binary search? seem more logical to guard against cases with
> lots of FREs and be pretty fast in common case anyways.

That would be nice, but the FREs are variable-sized and you don't know
how big one is until you start reading it.

-- 
Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 28/39] unwind_user/deferred: Add deferred unwinding interface
  2025-01-23 22:13           ` Peter Zijlstra
@ 2025-01-24 21:58             ` Steven Rostedt
  2025-01-24 22:46               ` Josh Poimboeuf
  0 siblings, 1 reply; 161+ messages in thread
From: Steven Rostedt @ 2025-01-24 21:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Josh Poimboeuf, Mathieu Desnoyers, x86, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Florian Weimer, Andy Lutomirski, Masami Hiramatsu,
	Weinan Liu

On Thu, 23 Jan 2025 23:13:26 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> -EPONIES, you cannot take faults from the middle of schedule(). They can
> always use the best effort FP unwind we have today.

Agreed.

Now the only thing I could think of is a flag gets set where the task comes
out of the scheduler and then does the stack trace. It doesn't need to do
the stack trace before it schedules. As it did just schedule, where ever it
scheduled must have been in a schedulable context.

That is, kind of like the task_work flag for entering user space and
exiting the kernel, could we have a sched_work flag to run after after being
scheduled back (exiting schedule()). Since the task has been picked to run,
it will not cause latency for other tasks. The work will be done in its
context. This is no different to the tasks accounting than if it does this
going back to user space. Heck, it would only need to do this once if it
didn't go back to user space, as the user space stack would be the same.
That is, if it gets scheduled multiple times, this would only happen on the
first instance until it leaves the kernel.

	[ trigger stack trace - set sched_work ]

	schedule() {
		context_switch() -> CPU runs some other task
				 <- gets scheduled back onto the CPU
		[..]
		/* preemption enabled ... */
		if (sched_work) {
			do stack trace() // can schedule here but
					 // calls a schedule function that does not
					 // do sched_work to prevent recursion
		}
	}

Could something like this work?

-- Steve

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 32/39] perf: Remove get_perf_callchain() 'crosstask' argument
  2025-01-24 18:13   ` Andrii Nakryiko
@ 2025-01-24 22:00     ` Josh Poimboeuf
  2025-01-28  0:39       ` Andrii Nakryiko
  0 siblings, 1 reply; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-24 22:00 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Fri, Jan 24, 2025 at 10:13:23AM -0800, Andrii Nakryiko wrote:
> On Tue, Jan 21, 2025 at 6:32 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
> > @@ -430,10 +429,8 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
> >         if (task && user && !user_mode(regs))
> >                 goto err_fault;
> >
> > -       /* get_perf_callchain does not support crosstask user stack walking
> > -        * but returns an empty stack instead of NULL.
> > -        */
> > -       if (crosstask && user) {
> > +       /* get_perf_callchain() does not support crosstask stack walking */
> > +       if (crosstask) {
> 
> crosstask stack trace is supported for kernel stack traces (see
> get_callchain_entry_for_task() call), so this is breaking that case

Oh I see, thanks.

BTW, that seems dubious, does it do anything to ensure the task isn't
running?   Otherwise the unwind is going to be a wild ride.

-- 
Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 11/39] unwind_user: Add user space unwinding API
  2025-01-24 20:02   ` Steven Rostedt
@ 2025-01-24 22:05     ` Josh Poimboeuf
  0 siblings, 0 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-24 22:05 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: x86, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Fri, Jan 24, 2025 at 03:02:11PM -0500, Steven Rostedt wrote:
> On Tue, 21 Jan 2025 18:31:03 -0800
> Josh Poimboeuf <jpoimboe@kernel.org> wrote:
> > +int unwind_user_start(struct unwind_user_state *state)
> > +{
> > +	struct pt_regs *regs = task_pt_regs(current);
> > +
> > +	memset(state, 0, sizeof(*state));
> > +
> > +	if (!current->mm || !user_mode(regs)) {
> > +		state->done = true;
> > +		return -EINVAL;
> > +	}
> > +
> > +	state->type = UNWIND_USER_TYPE_NONE;
> > +
> > +	state->ip = instruction_pointer(regs);
> > +	state->sp = user_stack_pointer(regs);
> > +	state->fp = frame_pointer(regs);
> > +
> > +	return 0;
> > +}
> > +
> 
> I know this is just an introductory of the interface, but this should
> really have kerneldoc attached to it, as I have no idea what these are
> supposed to be doing. This patch is meaningless without it. The change log
> is useless too.

Yeah, sure.

-- 
Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 14/39] perf/x86: Rename get_segment_base() and make it global
  2025-01-24 20:09   ` Steven Rostedt
@ 2025-01-24 22:06     ` Josh Poimboeuf
  0 siblings, 0 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-24 22:06 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: x86, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Fri, Jan 24, 2025 at 03:09:27PM -0500, Steven Rostedt wrote:
> On Tue, 21 Jan 2025 18:31:06 -0800
> Josh Poimboeuf <jpoimboe@kernel.org> wrote:
> 
> > diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> > index c75c482d4c52..23ac6343cf86 100644
> > --- a/arch/x86/events/core.c
> > +++ b/arch/x86/events/core.c
> > @@ -2790,7 +2790,7 @@ valid_user_frame(const void __user *fp, unsigned long size)
> >  	return __access_ok(fp, size);
> >  }
> >  
> > -static unsigned long get_segment_base(unsigned int segment)
> > +unsigned long segment_base_address(unsigned int segment)
> >  {
> >  	struct desc_struct *desc;
> >  	unsigned int idx = segment >> 3;
> 
> As this requires interrupts disabled, and if you do move this out of this
> file, you probably should add:
> 
> 	lockdep_assert_irqs_disabled();

Indeed...

-- 
Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 17/39] unwind_user/sframe: Add support for reading .sframe headers
  2025-01-24 19:21     ` Josh Poimboeuf
  2025-01-24 20:13       ` Steven Rostedt
@ 2025-01-24 22:13       ` Indu Bhagat
  2025-01-28  1:10         ` Andrii Nakryiko
  1 sibling, 1 reply; 161+ messages in thread
From: Indu Bhagat @ 2025-01-24 22:13 UTC (permalink / raw)
  To: Josh Poimboeuf, Andrii Nakryiko
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On 1/24/25 11:21 AM, Josh Poimboeuf wrote:
> On Fri, Jan 24, 2025 at 10:00:52AM -0800, Andrii Nakryiko wrote:
>> On Tue, Jan 21, 2025 at 6:32 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>>> +static inline int sframe_add_section(unsigned long sframe_start, unsigned long sframe_end, unsigned long text_start, unsigned long text_end) { return -ENOSYS; }
>>
>> nit: very-very long, wrap it?
> 
> That was intentional as it's just an empty stub, but yeah, maybe 160
> chars is a bit much.
> 
>>> +       if (shdr.preamble.magic != SFRAME_MAGIC ||
>>> +           shdr.preamble.version != SFRAME_VERSION_2 ||
>>> +           !(shdr.preamble.flags & SFRAME_F_FDE_SORTED) ||
>>
>> probably more a question to Indu, but why is this sorting not
>> mandatory and part of SFrame "standard"? How realistically non-sorted
>> FDEs would work in practice? Ain't nobody got time to sort them just
>> to unwind the stack...
> 
> No idea...
> 
>>> +       if (!shdr.num_fdes || !shdr.num_fres) {
>>
>> given SFRAME_F_FRAME_POINTER in the header, is it really that
>> nonsensical and illegal to have zero FDEs/FREs? Maybe we should allow
>> that?
> 
> It would seem a bit silly to create an empty .sframe section just to set
> that SFRAME_F_FRAME_POINTER bit.  Regardless, there's nothing the kernel
> can do with that.
> 

Yes, in theory, it is allowed (as per the specification) to have an 
SFrame section with zero number of FDEs/FREs.  But since such a section 
will not be useful, I share the opinion that it makes sense to disallow 
it in the current unwinding contexts, for now (JIT usecase may change 
things later).

SFRAME_F_FRAME_POINTER flag is not being set currently by GAS/GNU ld at all.

>>> +               dbg("no fde/fre entries\n");
>>> +               return -EINVAL;
>>> +       }
>>> +
>>> +       header_end = sec->sframe_start + SFRAME_HEADER_SIZE(shdr);
>>> +       if (header_end >= sec->sframe_end) {
>>
>> if we allow zero FDEs/FREs, header_end == sec->sframe_end is legal, right?
> 
> I suppose so, but again I'm not seeing any reason to support that.
> 
>>> +               dbg("header doesn't fit in section\n");
>>> +               return -EINVAL;
>>> +       }
>>> +
>>> +       num_fdes   = shdr.num_fdes;
>>> +       fdes_start = header_end + shdr.fdes_off;
>>> +       fdes_end   = fdes_start + (num_fdes * sizeof(struct sframe_fde));
>>> +
>>> +       fres_start = header_end + shdr.fres_off;
>>> +       fres_end   = fres_start + shdr.fre_len;
>>> +
>>
>> maybe use check_add_overflow() in all the above calculation, at least
>> on 32-bit arches this all can overflow and it's not clear if below
>> sanity check detects all possible overflows
> 
> Ok, I'll look into it.
> 
>>> +struct sframe_preamble {
>>> +       u16     magic;
>>> +       u8      version;
>>> +       u8      flags;
>>> +} __packed;
>>> +
>>> +struct sframe_header {
>>> +       struct sframe_preamble preamble;
>>> +       u8      abi_arch;
>>> +       s8      cfa_fixed_fp_offset;
>>> +       s8      cfa_fixed_ra_offset;
>>> +       u8      auxhdr_len;
>>> +       u32     num_fdes;
>>> +       u32     num_fres;
>>> +       u32     fre_len;
>>> +       u32     fdes_off;
>>> +       u32     fres_off;
>>> +} __packed;
>>> +
>>> +struct sframe_fde {
>>> +       s32     start_addr;
>>> +       u32     func_size;
>>> +       u32     fres_off;
>>> +       u32     fres_num;
>>> +       u8      info;
>>> +       u8      rep_size;
>>> +       u16 padding;
>>> +} __packed;
>>
>> I couldn't understand from SFrame itself, but why do sframe_header,
>> sframe_preamble, and sframe_fde have to be marked __packed, if it's
>> all naturally aligned (intentionally and by design)?..
> 
> Right, but the spec says they're all packed.  Maybe the point is that
> some future sframe version is free to introduce unaligned fields.
> 

SFrame specification aims to keep SFrame header and SFrame FDE members 
at aligned boundaries in future versions.

Only SFrame FRE related accesses may have unaligned accesses.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 17/39] unwind_user/sframe: Add support for reading .sframe headers
  2025-01-24 20:13       ` Steven Rostedt
@ 2025-01-24 22:39         ` Josh Poimboeuf
  0 siblings, 0 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-24 22:39 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andrii Nakryiko, Indu Bhagat, x86, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Fri, Jan 24, 2025 at 03:13:40PM -0500, Steven Rostedt wrote:
> On Fri, 24 Jan 2025 11:21:59 -0800
> Josh Poimboeuf <jpoimboe@kernel.org> wrote:
> 
> > > given SFRAME_F_FRAME_POINTER in the header, is it really that
> > > nonsensical and illegal to have zero FDEs/FREs? Maybe we should allow
> > > that?  
> > 
> > It would seem a bit silly to create an empty .sframe section just to set
> > that SFRAME_F_FRAME_POINTER bit.  Regardless, there's nothing the kernel
> > can do with that.
> > 
> > > > +               dbg("no fde/fre entries\n");
> > > > +               return -EINVAL;
> > > > +       }
> > > > +
> > > > +       header_end = sec->sframe_start + SFRAME_HEADER_SIZE(shdr);
> > > > +       if (header_end >= sec->sframe_end) {  
> > > 
> > > if we allow zero FDEs/FREs, header_end == sec->sframe_end is legal, right?  
> > 
> > I suppose so, but again I'm not seeing any reason to support that.
> 
> Hmm, could that be useful for implementing a way to dynamically grow or
> shrink an sframe because of jits? I'm just thinking about placeholders or
> sohething.

Maybe?

I was thinking the kernel would have sframe_section placeholders for JIT
code sections, so when sframe_find() retrieves the struct for a given
IP, it sees the JIT flag is set along with a pointer to the in-memory
shared "sframe section", then goes to read that to get the corresponding
sframe entry (insert erratic hand waving).

It's still early days but it's quite possible the in-memory "sframe
section" formats might end up looking pretty different from the .sframe
file section spec.

-- 
Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 28/39] unwind_user/deferred: Add deferred unwinding interface
  2025-01-24 21:58             ` Steven Rostedt
@ 2025-01-24 22:46               ` Josh Poimboeuf
  2025-01-24 22:50                 ` Josh Poimboeuf
  0 siblings, 1 reply; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-24 22:46 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Mathieu Desnoyers, x86, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Florian Weimer, Andy Lutomirski, Masami Hiramatsu,
	Weinan Liu

On Fri, Jan 24, 2025 at 04:58:03PM -0500, Steven Rostedt wrote:
> On Thu, 23 Jan 2025 23:13:26 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > -EPONIES, you cannot take faults from the middle of schedule(). They can
> > always use the best effort FP unwind we have today.
> 
> Agreed.
> 
> Now the only thing I could think of is a flag gets set where the task comes
> out of the scheduler and then does the stack trace. It doesn't need to do
> the stack trace before it schedules. As it did just schedule, where ever it
> scheduled must have been in a schedulable context.
> 
> That is, kind of like the task_work flag for entering user space and
> exiting the kernel, could we have a sched_work flag to run after after being
> scheduled back (exiting schedule()). Since the task has been picked to run,
> it will not cause latency for other tasks. The work will be done in its
> context. This is no different to the tasks accounting than if it does this
> going back to user space. Heck, it would only need to do this once if it
> didn't go back to user space, as the user space stack would be the same.
> That is, if it gets scheduled multiple times, this would only happen on the
> first instance until it leaves the kernel.
> 
> 
> 	[ trigger stack trace - set sched_work ]
> 
> 	schedule() {
> 		context_switch() -> CPU runs some other task
> 				 <- gets scheduled back onto the CPU
> 		[..]
> 		/* preemption enabled ... */
> 		if (sched_work) {
> 			do stack trace() // can schedule here but
> 					 // calls a schedule function that does not
> 					 // do sched_work to prevent recursion
> 		}
> 	}
> 
> Could something like this work?

Yeah, this is basically a more fleshed out version of what I was trying
to propose.

One additional wrinkle is that if @prev wakes up on another CPU while
@next is unwinding it, the unwind goes haywire.  So that would maybe
need to be prevented.

-- 
Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 28/39] unwind_user/deferred: Add deferred unwinding interface
  2025-01-24 22:46               ` Josh Poimboeuf
@ 2025-01-24 22:50                 ` Josh Poimboeuf
  2025-01-24 23:57                   ` Steven Rostedt
  0 siblings, 1 reply; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-24 22:50 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Mathieu Desnoyers, x86, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Florian Weimer, Andy Lutomirski, Masami Hiramatsu,
	Weinan Liu

On Fri, Jan 24, 2025 at 02:46:48PM -0800, Josh Poimboeuf wrote:
> On Fri, Jan 24, 2025 at 04:58:03PM -0500, Steven Rostedt wrote:
> > Now the only thing I could think of is a flag gets set where the task comes
> > out of the scheduler and then does the stack trace. It doesn't need to do
> > the stack trace before it schedules. As it did just schedule, where ever it
> > scheduled must have been in a schedulable context.
> > 
> > That is, kind of like the task_work flag for entering user space and
> > exiting the kernel, could we have a sched_work flag to run after after being
> > scheduled back (exiting schedule()). Since the task has been picked to run,
> > it will not cause latency for other tasks. The work will be done in its
> > context. This is no different to the tasks accounting than if it does this
> > going back to user space. Heck, it would only need to do this once if it
> > didn't go back to user space, as the user space stack would be the same.
> > That is, if it gets scheduled multiple times, this would only happen on the
> > first instance until it leaves the kernel.
> > 
> > 
> > 	[ trigger stack trace - set sched_work ]
> > 
> > 	schedule() {
> > 		context_switch() -> CPU runs some other task
> > 				 <- gets scheduled back onto the CPU
> > 		[..]
> > 		/* preemption enabled ... */
> > 		if (sched_work) {
> > 			do stack trace() // can schedule here but
> > 					 // calls a schedule function that does not
> > 					 // do sched_work to prevent recursion
> > 		}
> > 	}
> > 
> > Could something like this work?
> 
> Yeah, this is basically a more fleshed out version of what I was trying
> to propose.
> 
> One additional wrinkle is that if @prev wakes up on another CPU while
> @next is unwinding it, the unwind goes haywire.  So that would maybe
> need to be prevented.

Hm, reading this again I'm wondering if you're actually proposing that
the unwind happens on @prev after it gets rescheduled sometime in the
future?  Does that actually solve the issue?  What if doesn't get
rescheduled within a reasonable amount of time?

-- 
Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 28/39] unwind_user/deferred: Add deferred unwinding interface
  2025-01-24 22:50                 ` Josh Poimboeuf
@ 2025-01-24 23:57                   ` Steven Rostedt
  2025-01-30 20:21                     ` Steven Rostedt
  0 siblings, 1 reply; 161+ messages in thread
From: Steven Rostedt @ 2025-01-24 23:57 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Peter Zijlstra, Mathieu Desnoyers, x86, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Florian Weimer, Andy Lutomirski, Masami Hiramatsu,
	Weinan Liu

On Fri, 24 Jan 2025 14:50:11 -0800
Josh Poimboeuf <jpoimboe@kernel.org> wrote:

> Hm, reading this again I'm wondering if you're actually proposing that
> the unwind happens on @prev after it gets rescheduled sometime in the
> future?  Does that actually solve the issue?  What if doesn't get
> rescheduled within a reasonable amount of time?

Correct, it would be prev that would be doing the unwinding and not next.
But when prev is scheduled back onto the CPU. That way it's only blocking
itself.

The use case that people were doing this with was measuring the time a task
is off the CPU. It can't get that time until the task schedules back
anyway. What the complaint was about was that it could be a very long
system call, with lots of sleeps and they couldn't do the processing.

I can go back and ask, but I'm pretty sure doing the unwind when a task
comes back to the CPU would be sufficient.

-- Steve

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 19/39] unwind_user/sframe: Add support for reading .sframe contents
  2025-01-24 21:41     ` Josh Poimboeuf
@ 2025-01-28  0:39       ` Andrii Nakryiko
  2025-01-28 10:50         ` Jens Remus
  2025-01-28 10:54         ` Jens Remus
  2025-01-30 19:51       ` Weinan Liu
  1 sibling, 2 replies; 161+ messages in thread
From: Andrii Nakryiko @ 2025-01-28  0:39 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Fri, Jan 24, 2025 at 1:41 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>
> On Fri, Jan 24, 2025 at 10:02:46AM -0800, Andrii Nakryiko wrote:
> > On Tue, Jan 21, 2025 at 6:32 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
> > > +static __always_inline int __read_fde(struct sframe_section *sec,
> > > +                                     unsigned int fde_num,
> > > +                                     struct sframe_fde *fde)
> > > +{
> > > +       unsigned long fde_addr, ip;
> > > +
> > > +       fde_addr = sec->fdes_start + (fde_num * sizeof(struct sframe_fde));
> > > +       unsafe_copy_from_user(fde, (void __user *)fde_addr,
> > > +                             sizeof(struct sframe_fde), Efault);
> > > +
> > > +       ip = sec->sframe_start + fde->start_addr;
> > > +       if (ip < sec->text_start || ip > sec->text_end)
> >
> > ip >= sec->test_end ? ip == sec->test_end doesn't make sense, no?
>
> Believe it or not, this may have been intentional: 'text_end' can
> actually be a valid stack return address if the last instruction of the
> section is a call to a noreturn function.  The unwinder still needs to
> know the state of the stack at that point.
>

TIL...

> That said, there would be no reason to put an FDE at the end.  So yeah,
> it should be '>='.
>
> > > +static __always_inline int __read_fre(struct sframe_section *sec,
> > > +                                     struct sframe_fde *fde,
> > > +                                     unsigned long fre_addr,
> > > +                                     struct sframe_fre *fre)
> > > +{
> > > +       unsigned char fde_type = SFRAME_FUNC_FDE_TYPE(fde->info);
> > > +       unsigned char fre_type = SFRAME_FUNC_FRE_TYPE(fde->info);
> > > +       unsigned char offset_count, offset_size;
> > > +       s32 ip_off, cfa_off, ra_off, fp_off;
> > > +       unsigned long cur = fre_addr;
> > > +       unsigned char addr_size;
> > > +       u8 info;
> > > +
> > > +       addr_size = fre_type_to_size(fre_type);
> > > +       if (!addr_size)
> > > +               return -EFAULT;
> > > +
> > > +       if (fre_addr + addr_size + 1 > sec->fres_end)
> >
> > nit: isn't this the same as `fre_addr + addr_size >= sec->fres_end` ?
>
> Yes, but that would further obfuscate the fact that this is validating
> that the next two reads don't go past sec->fres_end:
>
>         UNSAFE_GET_USER_INC(ip_off, cur, addr_size, Efault);
>         UNSAFE_GET_USER_INC(info, cur, 1, Efault);
>
> The explicit "1" is for that second read.
>
> I'll do something like SFRAME_FRE_INFO_SIZE instead of 1 to make that
> clearer.

yeah, I definitely didn't connect 1 with the size of fre info byte,
named #define makes sense, thanks!

>
> > > +               return -EFAULT;
> > > +
> > > +       UNSAFE_GET_USER_INC(ip_off, cur, addr_size, Efault);
> > > +       if (fde_type == SFRAME_FDE_TYPE_PCINC && ip_off > fde->func_size)
> >
> > is ip_off == fde->func_size allowable?
>
> This was another one where I was probably thinking about the whole "last
> insn could be a call to noreturn" scenario.
>
> But even in that case, the frame would be unchanged after the call, so
> there shouldn't need to be an FRE there.
>
> In fact, for the unwinder to deal with that scenario, for a call frame
> (as opposed to an interrupted one), it should always subtract one from
> the return address before calling into sframe so that it points to the
> calling instruction.  </me makes note to go do that>
>
> So yeah, this looks like another off-by-one, caused by overthinking yet
> not thinking enough.
>
> > > +               return -EFAULT;
> > > +
> > > +       UNSAFE_GET_USER_INC(info, cur, 1, Efault);
> > > +       offset_count = SFRAME_FRE_OFFSET_COUNT(info);
> > > +       offset_size  = offset_size_enum_to_size(SFRAME_FRE_OFFSET_SIZE(info));
> > > +       if (!offset_count || !offset_size)
> > > +               return -EFAULT;
> > > +
> > > +       if (cur + (offset_count * offset_size) > sec->fres_end)
> >
> > offset_count * offset_size done in u8 can overflow, no? maybe upcast
> > to unsigned long or use check_add_overflow?
>
> offset_size is <= 2 as returned by offset_size_enum_to_size().
>
> offset_count is expected to be <= 3, enforced by the !offset_count check
> at the bottom.
>
> An overflow here would be harmless as it would be caught by the
> !offset_count anyway.  Though I also notice offset_count isn't big
> enough to hold the 2-byte SFRAME_FRE_OFFSET_COUNT() value.  Which is
> harmless for the same reason, but yeah I'll make offset_count an
> unsigned int.

yeah, the less whoever is reading has to worry about overflows, the
better, thanks!

>
> > > +static __always_inline int __find_fre(struct sframe_section *sec,
> > > +                                     struct sframe_fde *fde, unsigned long ip,
> > > +                                     struct unwind_user_frame *frame)
> > > +{
> > > +       unsigned char fde_type = SFRAME_FUNC_FDE_TYPE(fde->info);
> > > +       struct sframe_fre *fre, *prev_fre = NULL;
> > > +       struct sframe_fre fres[2];
> >
> > you only need prev_fre->ip_off, so why all this `which` and `fres[2]`
> > business if all you need is prev_fre_off and a bool whether you have
> > prev_fre at all?
>
> So this cleverness probably needs a comment.  prev_fre is actually
> needed after the loop:
>
> > > +       if (!prev_fre)
> > > +               return -EINVAL;
> > > +       fre = prev_fre;
>
> In the body of the loop, prev_fre is a tentative match, unless the next
> fre also matches, in which case that one becomes the new tentative
> match.
>
> I'll add a comment.  Also it'll probably be less confusing if I rename
> "prev_fre" to "fre", and "fre" to "next_fre".

ah, yeah, I missed that, you are right

>
> > > +       unsigned long fre_addr;
> > > +       bool which = false;
> > > +       unsigned int i;
> > > +       s32 ip_off;
> > > +
> > > +       ip_off = (s32)(ip - sec->sframe_start) - fde->start_addr;
> > > +
> > > +       if (fde_type == SFRAME_FDE_TYPE_PCMASK)
> > > +               ip_off %= fde->rep_size;
> >
> > did you check that fde->rep_size is not zero?
>
> I did not, good catch!
>
> > > +       fre_addr = sec->fres_start + fde->fres_off;
> > > +
> > > +       for (i = 0; i < fde->fres_num; i++) {
> >
> > why not binary search? seem more logical to guard against cases with
> > lots of FREs and be pretty fast in common case anyways.
>
> That would be nice, but the FREs are variable-sized and you don't know
> how big one is until you start reading it.

ah, another non-obvious thing, yeah... do you think it's worth fixing
this and making FREs binary searchable in v3?

Indu, do you have some stats on distribution of FRE count per FDE in practice?

Tbh, FRE format is bothering me quite a lot... but let's discuss that
in another thread with you and Indu

>
> --
> Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 32/39] perf: Remove get_perf_callchain() 'crosstask' argument
  2025-01-24 22:00     ` Josh Poimboeuf
@ 2025-01-28  0:39       ` Andrii Nakryiko
  0 siblings, 0 replies; 161+ messages in thread
From: Andrii Nakryiko @ 2025-01-28  0:39 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Fri, Jan 24, 2025 at 2:00 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>
> On Fri, Jan 24, 2025 at 10:13:23AM -0800, Andrii Nakryiko wrote:
> > On Tue, Jan 21, 2025 at 6:32 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
> > > @@ -430,10 +429,8 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
> > >         if (task && user && !user_mode(regs))
> > >                 goto err_fault;
> > >
> > > -       /* get_perf_callchain does not support crosstask user stack walking
> > > -        * but returns an empty stack instead of NULL.
> > > -        */
> > > -       if (crosstask && user) {
> > > +       /* get_perf_callchain() does not support crosstask stack walking */
> > > +       if (crosstask) {
> >
> > crosstask stack trace is supported for kernel stack traces (see
> > get_callchain_entry_for_task() call), so this is breaking that case
>
> Oh I see, thanks.
>
> BTW, that seems dubious, does it do anything to ensure the task isn't
> running?   Otherwise the unwind is going to be a wild ride.

Yeah, I think it's very speculative and doesn't pause the task in any
way (just makes sure it doesn't go away). We just rely on
stack_trace_save_tsk() -> arch_stack_walk(), which just optimistically
tries to unwind, it seems.

It's still useful and if the user is prepared to handle a potentially
garbage stack trace, why not. People do similar thing for user space
stack trace (with custom BPF code), and it's very useful (even if not
"reliable" by any means).


>
> --
> Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 17/39] unwind_user/sframe: Add support for reading .sframe headers
  2025-01-24 22:13       ` Indu Bhagat
@ 2025-01-28  1:10         ` Andrii Nakryiko
  2025-01-29  2:02           ` Josh Poimboeuf
                             ` (2 more replies)
  0 siblings, 3 replies; 161+ messages in thread
From: Andrii Nakryiko @ 2025-01-28  1:10 UTC (permalink / raw)
  To: Indu Bhagat, Josh Poimboeuf
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Fri, Jan 24, 2025 at 2:14 PM Indu Bhagat <indu.bhagat@oracle.com> wrote:
>
> On 1/24/25 11:21 AM, Josh Poimboeuf wrote:
> > On Fri, Jan 24, 2025 at 10:00:52AM -0800, Andrii Nakryiko wrote:
> >> On Tue, Jan 21, 2025 at 6:32 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
> >>> +static inline int sframe_add_section(unsigned long sframe_start, unsigned long sframe_end, unsigned long text_start, unsigned long text_end) { return -ENOSYS; }
> >>
> >> nit: very-very long, wrap it?
> >
> > That was intentional as it's just an empty stub, but yeah, maybe 160
> > chars is a bit much.
> >
> >>> +       if (shdr.preamble.magic != SFRAME_MAGIC ||
> >>> +           shdr.preamble.version != SFRAME_VERSION_2 ||
> >>> +           !(shdr.preamble.flags & SFRAME_F_FDE_SORTED) ||
> >>
> >> probably more a question to Indu, but why is this sorting not
> >> mandatory and part of SFrame "standard"? How realistically non-sorted
> >> FDEs would work in practice? Ain't nobody got time to sort them just
> >> to unwind the stack...
> >
> > No idea...
> >
> >>> +       if (!shdr.num_fdes || !shdr.num_fres) {
> >>
> >> given SFRAME_F_FRAME_POINTER in the header, is it really that
> >> nonsensical and illegal to have zero FDEs/FREs? Maybe we should allow
> >> that?
> >
> > It would seem a bit silly to create an empty .sframe section just to set
> > that SFRAME_F_FRAME_POINTER bit.  Regardless, there's nothing the kernel
> > can do with that.
> >
>
> Yes, in theory, it is allowed (as per the specification) to have an
> SFrame section with zero number of FDEs/FREs.  But since such a section
> will not be useful, I share the opinion that it makes sense to disallow
> it in the current unwinding contexts, for now (JIT usecase may change
> things later).
>

I disagree, actually. If it's a legal thing, it shouldn't be randomly
rejected. If we later make use of that, we'd have to worry not to
accidentally cause problems on older kernels that arbitrarily rejected
empty FDE just because it didn't make sense at some point (without
causing any issues).

> SFRAME_F_FRAME_POINTER flag is not being set currently by GAS/GNU ld at all.
>
> >>> +               dbg("no fde/fre entries\n");
> >>> +               return -EINVAL;
> >>> +       }
> >>> +
> >>> +       header_end = sec->sframe_start + SFRAME_HEADER_SIZE(shdr);
> >>> +       if (header_end >= sec->sframe_end) {
> >>
> >> if we allow zero FDEs/FREs, header_end == sec->sframe_end is legal, right?
> >
> > I suppose so, but again I'm not seeing any reason to support that.

Let's invert this. Is there any reason why it shouldn't be supported? ;)

> >
> >>> +               dbg("header doesn't fit in section\n");
> >>> +               return -EINVAL;
> >>> +       }
> >>> +
> >>> +       num_fdes   = shdr.num_fdes;
> >>> +       fdes_start = header_end + shdr.fdes_off;
> >>> +       fdes_end   = fdes_start + (num_fdes * sizeof(struct sframe_fde));
> >>> +
> >>> +       fres_start = header_end + shdr.fres_off;
> >>> +       fres_end   = fres_start + shdr.fre_len;
> >>> +
> >>
> >> maybe use check_add_overflow() in all the above calculation, at least
> >> on 32-bit arches this all can overflow and it's not clear if below
> >> sanity check detects all possible overflows
> >
> > Ok, I'll look into it.
> >
> >>> +struct sframe_preamble {
> >>> +       u16     magic;
> >>> +       u8      version;
> >>> +       u8      flags;
> >>> +} __packed;
> >>> +
> >>> +struct sframe_header {
> >>> +       struct sframe_preamble preamble;
> >>> +       u8      abi_arch;
> >>> +       s8      cfa_fixed_fp_offset;
> >>> +       s8      cfa_fixed_ra_offset;
> >>> +       u8      auxhdr_len;
> >>> +       u32     num_fdes;
> >>> +       u32     num_fres;
> >>> +       u32     fre_len;
> >>> +       u32     fdes_off;
> >>> +       u32     fres_off;
> >>> +} __packed;
> >>> +
> >>> +struct sframe_fde {
> >>> +       s32     start_addr;
> >>> +       u32     func_size;
> >>> +       u32     fres_off;
> >>> +       u32     fres_num;
> >>> +       u8      info;
> >>> +       u8      rep_size;
> >>> +       u16 padding;
> >>> +} __packed;
> >>
> >> I couldn't understand from SFrame itself, but why do sframe_header,
> >> sframe_preamble, and sframe_fde have to be marked __packed, if it's
> >> all naturally aligned (intentionally and by design)?..
> >
> > Right, but the spec says they're all packed.  Maybe the point is that
> > some future sframe version is free to introduce unaligned fields.
> >
>
> SFrame specification aims to keep SFrame header and SFrame FDE members
> at aligned boundaries in future versions.
>
> Only SFrame FRE related accesses may have unaligned accesses.

Yeah, and it's actually bothering me quite a lot :) I have a tentative
proposal, maybe we can discuss this for SFrame v3? Let me briefly
outline the idea.

So, currently in v2, FREs within FDEs use an array-of-structs layout.
If we use preudo-C type definitions, it would be something like this
for FDE + its FREs:

struct FDE_and_FREs {
    struct sframe_func_desc_entry fde_metadata;

    union FRE {
        struct FRE8 {
            u8 sfre_start_address;
            u8 sfre_info;
            u8|u16|u32 offsets[M];
        }
        struct FRE16 {
            u16 sfre_start_address;
            u16 sfre_info;
            u8|u16|u32 offsets[M];
        }
        struct FRE32 {
            u32 sfre_start_address;
            u32 sfre_info;
            u8|u16|u32 offsets[M];
        }
    } fres[N] __packed;
};

where all fres[i]s are one of those FRE8/FRE16/FRE32, so start
addresses have the same size, but each FRE has potentially different
offsets sizing, so there is no common alignment, and so everything has
to be packed and unaligned.

But what if we take a struct-of-arrays approach and represent it more like:

struct FDE_and_FREs {
    struct sframe_func_desc_entry fde_metadata;
    u8|u16|u32 start_addrs[N]; /* can extend to u64 as well */
    u8 sfre_infos[N];
    u8 offsets8[M8];
    u16 offsets16[M16] __aligned(2);
    u32 offsets32[M32] __aligned(4);
    /* we can naturally extend to support also u64 offsets */
};

i.e., we split all FRE records into their three constituents: start
addresses, info bytes, and then each FRE can fall into either 8-, 16-,
or 32-bit offsets "bucket". We collect all the offsets, depending on
their size, into these aligned offsets{8,16,32} arrays (with natural
extension to 64 bits, if necessary), with at most wasting 1-3 bytes to
ensure proper alignment everywhere.

Note, at this point we need to decide if we want to make FREs binary
searchable or not.

If not, we don't really need anything extra. As we process each
start_addrs[i] and sfre_infos[i] to find matching FRE, we keep track
of how many 8-, 16-, and 32-bit offsets already processed FREs
consumed, and when we find the right one, we know exactly the starting
index within offset{8,16,32}. Done.

But if we were to make FREs binary searchable, we need to basically
have an index of offset pointers to quickly find offsetsX[j] position
corresponding to FRE #i. For that, we can have an extra array right
next to start_addrs, "semantically parallel" to it:

u8|u16|u32 start_addrs[N];
u8|u16|u32 offset_idxs[N];

where start_addrs[i] corresponds to offset_idxs[i], and offset_idxs[i]
points to the first offset corresponding to FRE #i in offsetX[] array
(depending on FRE's "bitness"). This is a bit more storage for this
offset index, but for FDEs with lots of FREs this might be a
worthwhile tradeoff.

Few points:
  a) we can decide this "binary searchability" per-FDE, and for FDEs
with 1-2-3 FREs not bother, while those with more FREs would be
searchable ones with index. So we can combine both fast lookups,
natural alignment of on-disk format, and compactness. The presence of
index is just another bit in FDE metadata.
  b) bitness of offset_idxs[] can be coupled with bitness of
start_addrs (for simplicity), or could be completely independent and
identified by FDE's metadata (2 more bits to define this just like
start_addr bitness is defined). Independent probably would be my
preference, with linker (or whoever will be producing .sframe data)
can pick the smallest bitness that is sufficient to represent
everything.

Yes, it's a bit more complicated to draw and explain, but everything
will be nicely aligned, extensible to 64 bits, and (optionally at
least) binary searchable. Implementation-wise on the kernel side it
shouldn't be significantly more involved. Maybe the compiler would
need to be a bit smarter when producing FDE data, but it's no rocket
science.

Thoughts?

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 19/39] unwind_user/sframe: Add support for reading .sframe contents
  2025-01-28  0:39       ` Andrii Nakryiko
@ 2025-01-28 10:50         ` Jens Remus
  2025-01-29  2:04           ` Josh Poimboeuf
  2025-01-28 10:54         ` Jens Remus
  1 sibling, 1 reply; 161+ messages in thread
From: Jens Remus @ 2025-01-28 10:50 UTC (permalink / raw)
  To: Andrii Nakryiko, Josh Poimboeuf
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Mathieu Desnoyers,
	Florian Weimer, Andy Lutomirski, Masami Hiramatsu, Weinan Liu,
	heiko Carstens, Vasily Gorbik

On 28.01.2025 01:39, Andrii Nakryiko wrote:
> On Fri, Jan 24, 2025 at 1:41 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>> On Fri, Jan 24, 2025 at 10:02:46AM -0800, Andrii Nakryiko wrote:
>>> On Tue, Jan 21, 2025 at 6:32 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:

>>>> +       UNSAFE_GET_USER_INC(info, cur, 1, Efault);
>>>> +       offset_count = SFRAME_FRE_OFFSET_COUNT(info);
>>>> +       offset_size  = offset_size_enum_to_size(SFRAME_FRE_OFFSET_SIZE(info));
>>>> +       if (!offset_count || !offset_size)
>>>> +               return -EFAULT;
>>>> +
>>>> +       if (cur + (offset_count * offset_size) > sec->fres_end)
>>>
>>> offset_count * offset_size done in u8 can overflow, no? maybe upcast
>>> to unsigned long or use check_add_overflow?

The maximum offset_count * offset_size is 15 * 4 = 60 if I am not wrong:

>> offset_size is <= 2 as returned by offset_size_enum_to_size().

SFrame V2 FRE offset sizes are either 1, 2, or 4 bytes.  This is also
reflected in offset_size_enum_to_size().

>> offset_count is expected to be <= 3, enforced by the !offset_count check
>> at the bottom.

SFrame V2 FRE offset count is 4 bits unsigned, so 0 <= offset_count <= 15.
  
>> An overflow here would be harmless as it would be caught by the
>> !offset_count anyway.  Though I also notice offset_count isn't big
>> enough to hold the 2-byte SFRAME_FRE_OFFSET_COUNT() value.  Which is
>> harmless for the same reason, but yeah I'll make offset_count an
>> unsigned int.

As mentioned above the FRE offset count is 4 bits, not 2 bytes.  This is
also reflected in SFRAME_FRE_OFFSET_COUNT().

Regards,
Jens
-- 
Jens Remus
Linux on Z Development (D3303)
+49-7031-16-1128 Office
jremus@de.ibm.com

IBM

IBM Deutschland Research & Development GmbH; Vorsitzender des Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der Gesellschaft: Böblingen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 19/39] unwind_user/sframe: Add support for reading .sframe contents
  2025-01-28  0:39       ` Andrii Nakryiko
  2025-01-28 10:50         ` Jens Remus
@ 2025-01-28 10:54         ` Jens Remus
  1 sibling, 0 replies; 161+ messages in thread
From: Jens Remus @ 2025-01-28 10:54 UTC (permalink / raw)
  To: Andrii Nakryiko, Josh Poimboeuf
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Mathieu Desnoyers,
	Florian Weimer, Andy Lutomirski, Masami Hiramatsu, Weinan Liu

On 28.01.2025 01:39, Andrii Nakryiko wrote:
> On Fri, Jan 24, 2025 at 1:41 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>> On Fri, Jan 24, 2025 at 10:02:46AM -0800, Andrii Nakryiko wrote:
>>> On Tue, Jan 21, 2025 at 6:32 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:

>>>> +       fre_addr = sec->fres_start + fde->fres_off;
>>>> +
>>>> +       for (i = 0; i < fde->fres_num; i++) {
>>>
>>> why not binary search? seem more logical to guard against cases with
>>> lots of FREs and be pretty fast in common case anyways.
>>
>> That would be nice, but the FREs are variable-sized and you don't know
>> how big one is until you start reading it.
> 
> ah, another non-obvious thing, yeah... do you think it's worth fixing
> this and making FREs binary searchable in v3?
> 
> Indu, do you have some stats on distribution of FRE count per FDE in practice?
> 
> Tbh, FRE format is bothering me quite a lot... but let's discuss that
> in another thread with you and Indu

I would be interested to be part of that discussion.  I could share some
preliminary stats from my s390 work.

Thanks and regards,
Jens
-- 
Jens Remus
Linux on Z Development (D3303)
+49-7031-16-1128 Office
jremus@de.ibm.com

IBM

IBM Deutschland Research & Development GmbH; Vorsitzender des Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der Gesellschaft: Böblingen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 17/39] unwind_user/sframe: Add support for reading .sframe headers
  2025-01-28  1:10         ` Andrii Nakryiko
@ 2025-01-29  2:02           ` Josh Poimboeuf
  2025-01-30  0:02             ` Andrii Nakryiko
                               ` (2 more replies)
  2025-01-30 21:21           ` Indu Bhagat
  2025-02-05 11:01           ` Jens Remus
  2 siblings, 3 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-29  2:02 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Indu Bhagat, x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Mon, Jan 27, 2025 at 05:10:27PM -0800, Andrii Nakryiko wrote:
> > Yes, in theory, it is allowed (as per the specification) to have an
> > SFrame section with zero number of FDEs/FREs.  But since such a section
> > will not be useful, I share the opinion that it makes sense to disallow
> > it in the current unwinding contexts, for now (JIT usecase may change
> > things later).
> >
> 
> I disagree, actually. If it's a legal thing, it shouldn't be randomly
> rejected. If we later make use of that, we'd have to worry not to
> accidentally cause problems on older kernels that arbitrarily rejected
> empty FDE just because it didn't make sense at some point (without
> causing any issues).

If such older kernels don't do anything with the section anyway, what's
the point of pretending they do?

Returning an error would actually make more sense as it communicates
that the kernel doesn't support whatever hypothetical thing you're
trying to do with 0 FDEs.

> > SFRAME_F_FRAME_POINTER flag is not being set currently by GAS/GNU ld at all.
> >
> > >>> +               dbg("no fde/fre entries\n");
> > >>> +               return -EINVAL;
> > >>> +       }
> > >>> +
> > >>> +       header_end = sec->sframe_start + SFRAME_HEADER_SIZE(shdr);
> > >>> +       if (header_end >= sec->sframe_end) {
> > >>
> > >> if we allow zero FDEs/FREs, header_end == sec->sframe_end is legal, right?
> > >
> > > I suppose so, but again I'm not seeing any reason to support that.
> 
> Let's invert this. Is there any reason why it shouldn't be supported? ;)

It's simple, we don't add code to "support" some vague hypothetical.

For whatever definition of "support", since there's literally nothing
the kernel can do with that.

> But what if we take a struct-of-arrays approach and represent it more like:
> 
> struct FDE_and_FREs {
>     struct sframe_func_desc_entry fde_metadata;
>     u8|u16|u32 start_addrs[N]; /* can extend to u64 as well */
>     u8 sfre_infos[N];
>     u8 offsets8[M8];
>     u16 offsets16[M16] __aligned(2);
>     u32 offsets32[M32] __aligned(4);
>     /* we can naturally extend to support also u64 offsets */
> };
> 
> i.e., we split all FRE records into their three constituents: start
> addresses, info bytes, and then each FRE can fall into either 8-, 16-,
> or 32-bit offsets "bucket". We collect all the offsets, depending on
> their size, into these aligned offsets{8,16,32} arrays (with natural
> extension to 64 bits, if necessary), with at most wasting 1-3 bytes to
> ensure proper alignment everywhere.

Makes sense.  Though I also have another idea below.

> Note, at this point we need to decide if we want to make FREs binary
> searchable or not.
> 
> If not, we don't really need anything extra. As we process each
> start_addrs[i] and sfre_infos[i] to find matching FRE, we keep track
> of how many 8-, 16-, and 32-bit offsets already processed FREs
> consumed, and when we find the right one, we know exactly the starting
> index within offset{8,16,32}. Done.
> 
> But if we were to make FREs binary searchable, we need to basically
> have an index of offset pointers to quickly find offsetsX[j] position
> corresponding to FRE #i. For that, we can have an extra array right
> next to start_addrs, "semantically parallel" to it:
> 
> u8|u16|u32 start_addrs[N];
> u8|u16|u32 offset_idxs[N];

Binary search would definitely help.  I did a crude histogram for "FREs
per FDE" for a few binaries on my test system:

gdb (the biggest binary on my test system):

10th Percentile: 1
20th Percentile: 1
30th Percentile: 1
40th Percentile: 3
50th Percentile: 5
60th Percentile: 8
70th Percentile: 11
80th Percentile: 14
90th Percentile: 16
100th Percentile: 472

bash:

10th Percentile: 1
20th Percentile: 1
30th Percentile: 3
40th Percentile: 5
50th Percentile: 7
60th Percentile: 9
70th Percentile: 12
80th Percentile: 16
90th Percentile: 17
100th Percentile: 46

glibc:

10th Percentile: 1
20th Percentile: 1
30th Percentile: 1
40th Percentile: 1
50th Percentile: 4
60th Percentile: 6
70th Percentile: 9
80th Percentile: 14
90th Percentile: 16
100th Percentile: 44

libpython:

10th Percentile: 1
20th Percentile: 3
30th Percentile: 4
40th Percentile: 6
50th Percentile: 8
60th Percentile: 11
70th Percentile: 12
80th Percentile: 16
90th Percentile: 20
100th Percentile: 112

So binary search would help in a lot of cases.

However, if we're going that route, we might want to even consider a
completely revamped data layout.  For example:

One insight is that the vast majority of (cfa, fp, ra) tuples aren't
unique.  They could be deduped by storing the unique tuples in a
standalone 'fre_data' array which is referenced by another
address-specific array.

  struct fre_data {
	s8|s16|s32 cfa, fp, ra;
	u8 info;
  };
  struct fre_data fre_data[num_fre_data];

The storage sizes of cfa/fp/ra can be a constant specified in the global
sframe header.  It's fine all being the same size as it looks like this
array wouldn't typically be more than a few hundred entries anyway.

Then there would be an array of sorted FRE entries which reference the
fre_data[] entries:

  struct fre {
  	s32|s64 start_address;
	u8|u16 fre_data_idx;
	
  } __packed;
  struct fre fres[num_fres];

(For alignment reasons that should probably be two separate arrays, even
though not ideal for cache locality)

Here again the field storage sizes would be specified in the global
sframe header.

Note FDEs aren't even needed here as the unwinder doesn't need to know
when a function begins/ends.  The only info needed by the unwinder is
just the fre_data struct.  So a simple binary search of fres[] is all
that's really needed.

But wait, there's more...

The binary search could be made significantly faster using a small fast
lookup array which is indexed evenly across the text address offset
space, similar to what ORC does:

  u32 fre_chunks[num_chunks];

The text address space (starting at offset 0) can be split into
'num_chunks' chunks of size 'chunk_size'.  The value of
fre_chunks[offset/chunk_size] is an index into the fres[] array.

Taking my gdb binary as an example:

.text is 6417058 bytes, with 146997 total sframe FREs.  Assuming a chunk
size of 1024, fre_chunks[] needs 6417058/1024 = 6267 entries.

For unwinding at text offset 0x400000, the index into fre_chunks[] would
be 0x400000/1024 = 4096.  If fre_chunks[4096] == 96074 and
fre_chunks[4096+1] == 96098, you need only do a binary search of the 24
entries between fres[96074] && fres[96098] rather than searching the
entire 146997 byte array.

.sframe size calculation:

    374 unique fre_data entries (out of 146997 total FREs!)
    = 374 * (2 * 3) = 2244 bytes

    146997 fre entries
    = 146997 * (4 + 2) = 881982 bytes

    .text size 6417058 (chunk_size = 1024, num_chunks=6267)
    = 6267 * 8 = 43869 bytes

Approximate total .sframe size would be 2244 + 881982 + 8192 = 906k,
plus negligible header size.  Which is smaller than the v2 .sframe on my
gdb binary (985k).

With the chunked lookup table, the avg lookup is:

  log2(146997/6267) = ~4.5 iterations

whereas a full binary search would be:

  log2(146997) = 17 iterations

So assuming I got all that math right, it's over 3x faster and the
binary is smaller (or at least should be roughly comparable).

Of course the downside is it's an all new format.  Presumably the linker
would need to do more work than it's currently doing, e.g., find all the
duplicates and set up the data accordingly.

-- 
Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 19/39] unwind_user/sframe: Add support for reading .sframe contents
  2025-01-28 10:50         ` Jens Remus
@ 2025-01-29  2:04           ` Josh Poimboeuf
  0 siblings, 0 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-01-29  2:04 UTC (permalink / raw)
  To: Jens Remus
  Cc: Andrii Nakryiko, x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Mathieu Desnoyers,
	Florian Weimer, Andy Lutomirski, Masami Hiramatsu, Weinan Liu,
	heiko Carstens, Vasily Gorbik

On Tue, Jan 28, 2025 at 11:50:25AM +0100, Jens Remus wrote:
> On 28.01.2025 01:39, Andrii Nakryiko wrote:
> > On Fri, Jan 24, 2025 at 1:41 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
> > > On Fri, Jan 24, 2025 at 10:02:46AM -0800, Andrii Nakryiko wrote:
> > > > On Tue, Jan 21, 2025 at 6:32 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
> 
> > > > > +       UNSAFE_GET_USER_INC(info, cur, 1, Efault);
> > > > > +       offset_count = SFRAME_FRE_OFFSET_COUNT(info);
> > > > > +       offset_size  = offset_size_enum_to_size(SFRAME_FRE_OFFSET_SIZE(info));
> > > > > +       if (!offset_count || !offset_size)
> > > > > +               return -EFAULT;
> > > > > +
> > > > > +       if (cur + (offset_count * offset_size) > sec->fres_end)
> > > > 
> > > > offset_count * offset_size done in u8 can overflow, no? maybe upcast
> > > > to unsigned long or use check_add_overflow?
> 
> The maximum offset_count * offset_size is 15 * 4 = 60 if I am not wrong:
> 
> > > offset_size is <= 2 as returned by offset_size_enum_to_size().
> 
> SFrame V2 FRE offset sizes are either 1, 2, or 4 bytes.  This is also
> reflected in offset_size_enum_to_size().
> 
> > > offset_count is expected to be <= 3, enforced by the !offset_count check
> > > at the bottom.
> 
> SFrame V2 FRE offset count is 4 bits unsigned, so 0 <= offset_count <= 15.
> > > An overflow here would be harmless as it would be caught by the
> > > !offset_count anyway.  Though I also notice offset_count isn't big
> > > enough to hold the 2-byte SFRAME_FRE_OFFSET_COUNT() value.  Which is
> > > harmless for the same reason, but yeah I'll make offset_count an
> > > unsigned int.
> 
> As mentioned above the FRE offset count is 4 bits, not 2 bytes.  This is
> also reflected in SFRAME_FRE_OFFSET_COUNT().

You are right on both counts, not sure what I was smoking that day.

-- 
Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 17/39] unwind_user/sframe: Add support for reading .sframe headers
  2025-01-29  2:02           ` Josh Poimboeuf
@ 2025-01-30  0:02             ` Andrii Nakryiko
  2025-02-04 18:26               ` Josh Poimboeuf
  2025-01-30 21:39             ` Indu Bhagat
  2025-02-05 13:56             ` Jens Remus
  2 siblings, 1 reply; 161+ messages in thread
From: Andrii Nakryiko @ 2025-01-30  0:02 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Indu Bhagat, x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Tue, Jan 28, 2025 at 6:02 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>
> On Mon, Jan 27, 2025 at 05:10:27PM -0800, Andrii Nakryiko wrote:
> > > Yes, in theory, it is allowed (as per the specification) to have an
> > > SFrame section with zero number of FDEs/FREs.  But since such a section
> > > will not be useful, I share the opinion that it makes sense to disallow
> > > it in the current unwinding contexts, for now (JIT usecase may change
> > > things later).
> > >
> >
> > I disagree, actually. If it's a legal thing, it shouldn't be randomly
> > rejected. If we later make use of that, we'd have to worry not to
> > accidentally cause problems on older kernels that arbitrarily rejected
> > empty FDE just because it didn't make sense at some point (without
> > causing any issues).
>
> If such older kernels don't do anything with the section anyway, what's
> the point of pretending they do?
>
> Returning an error would actually make more sense as it communicates
> that the kernel doesn't support whatever hypothetical thing you're
> trying to do with 0 FDEs.
>
> > > SFRAME_F_FRAME_POINTER flag is not being set currently by GAS/GNU ld at all.
> > >
> > > >>> +               dbg("no fde/fre entries\n");
> > > >>> +               return -EINVAL;
> > > >>> +       }
> > > >>> +
> > > >>> +       header_end = sec->sframe_start + SFRAME_HEADER_SIZE(shdr);
> > > >>> +       if (header_end >= sec->sframe_end) {
> > > >>
> > > >> if we allow zero FDEs/FREs, header_end == sec->sframe_end is legal, right?
> > > >
> > > > I suppose so, but again I'm not seeing any reason to support that.
> >
> > Let's invert this. Is there any reason why it shouldn't be supported? ;)
>
> It's simple, we don't add code to "support" some vague hypothetical.
>
> For whatever definition of "support", since there's literally nothing
> the kernel can do with that.

If under the format specification it's legal, there is no reason for
the kernel to proactively reject it, IMO. But ok, whatever.

>
> > But what if we take a struct-of-arrays approach and represent it more like:
> >
> > struct FDE_and_FREs {
> >     struct sframe_func_desc_entry fde_metadata;
> >     u8|u16|u32 start_addrs[N]; /* can extend to u64 as well */
> >     u8 sfre_infos[N];
> >     u8 offsets8[M8];
> >     u16 offsets16[M16] __aligned(2);
> >     u32 offsets32[M32] __aligned(4);
> >     /* we can naturally extend to support also u64 offsets */
> > };
> >
> > i.e., we split all FRE records into their three constituents: start
> > addresses, info bytes, and then each FRE can fall into either 8-, 16-,
> > or 32-bit offsets "bucket". We collect all the offsets, depending on
> > their size, into these aligned offsets{8,16,32} arrays (with natural
> > extension to 64 bits, if necessary), with at most wasting 1-3 bytes to
> > ensure proper alignment everywhere.
>
> Makes sense.  Though I also have another idea below.
>
> > Note, at this point we need to decide if we want to make FREs binary
> > searchable or not.
> >
> > If not, we don't really need anything extra. As we process each
> > start_addrs[i] and sfre_infos[i] to find matching FRE, we keep track
> > of how many 8-, 16-, and 32-bit offsets already processed FREs
> > consumed, and when we find the right one, we know exactly the starting
> > index within offset{8,16,32}. Done.
> >
> > But if we were to make FREs binary searchable, we need to basically
> > have an index of offset pointers to quickly find offsetsX[j] position
> > corresponding to FRE #i. For that, we can have an extra array right
> > next to start_addrs, "semantically parallel" to it:
> >
> > u8|u16|u32 start_addrs[N];
> > u8|u16|u32 offset_idxs[N];
>
> Binary search would definitely help.  I did a crude histogram for "FREs
> per FDE" for a few binaries on my test system:
>
> gdb (the biggest binary on my test system):
>
> 10th Percentile: 1
> 20th Percentile: 1
> 30th Percentile: 1
> 40th Percentile: 3
> 50th Percentile: 5
> 60th Percentile: 8
> 70th Percentile: 11
> 80th Percentile: 14
> 90th Percentile: 16
> 100th Percentile: 472
>
> bash:
>
> 10th Percentile: 1
> 20th Percentile: 1
> 30th Percentile: 3
> 40th Percentile: 5
> 50th Percentile: 7
> 60th Percentile: 9
> 70th Percentile: 12
> 80th Percentile: 16
> 90th Percentile: 17
> 100th Percentile: 46
>
> glibc:
>
> 10th Percentile: 1
> 20th Percentile: 1
> 30th Percentile: 1
> 40th Percentile: 1
> 50th Percentile: 4
> 60th Percentile: 6
> 70th Percentile: 9
> 80th Percentile: 14
> 90th Percentile: 16
> 100th Percentile: 44
>
> libpython:
>
> 10th Percentile: 1
> 20th Percentile: 3
> 30th Percentile: 4
> 40th Percentile: 6
> 50th Percentile: 8
> 60th Percentile: 11
> 70th Percentile: 12
> 80th Percentile: 16
> 90th Percentile: 20
> 100th Percentile: 112
>
> So binary search would help in a lot of cases.

yep, agreed, seems like a worthwhile thing to support, given the stats
(I suspect big production C++ applications might be even worse in this
regard)

>
> However, if we're going that route, we might want to even consider a
> completely revamped data layout.  For example:
>
> One insight is that the vast majority of (cfa, fp, ra) tuples aren't
> unique.  They could be deduped by storing the unique tuples in a
> standalone 'fre_data' array which is referenced by another
> address-specific array.
>
>   struct fre_data {
>         s8|s16|s32 cfa, fp, ra;
>         u8 info;
>   };
>   struct fre_data fre_data[num_fre_data];
>
> The storage sizes of cfa/fp/ra can be a constant specified in the global
> sframe header.  It's fine all being the same size as it looks like this
> array wouldn't typically be more than a few hundred entries anyway.
>
> Then there would be an array of sorted FRE entries which reference the
> fre_data[] entries:
>
>   struct fre {
>         s32|s64 start_address;
>         u8|u16 fre_data_idx;

even u16 seems dangerous, I'd use u32, not sure it's worth limiting
the format just to 64K unique combinations

>
>   } __packed;
>   struct fre fres[num_fres];
>
> (For alignment reasons that should probably be two separate arrays, even
> though not ideal for cache locality)
>
> Here again the field storage sizes would be specified in the global
> sframe header.
>
> Note FDEs aren't even needed here as the unwinder doesn't need to know
> when a function begins/ends.  The only info needed by the unwinder is
> just the fre_data struct.  So a simple binary search of fres[] is all
> that's really needed.
>
> But wait, there's more...
>
> The binary search could be made significantly faster using a small fast
> lookup array which is indexed evenly across the text address offset
> space, similar to what ORC does:
>
>   u32 fre_chunks[num_chunks];
>
> The text address space (starting at offset 0) can be split into
> 'num_chunks' chunks of size 'chunk_size'.  The value of
> fre_chunks[offset/chunk_size] is an index into the fres[] array.
>
> Taking my gdb binary as an example:
>
> .text is 6417058 bytes, with 146997 total sframe FREs.  Assuming a chunk
> size of 1024, fre_chunks[] needs 6417058/1024 = 6267 entries.
>
> For unwinding at text offset 0x400000, the index into fre_chunks[] would
> be 0x400000/1024 = 4096.  If fre_chunks[4096] == 96074 and
> fre_chunks[4096+1] == 96098, you need only do a binary search of the 24
> entries between fres[96074] && fres[96098] rather than searching the
> entire 146997 byte array.
>
> .sframe size calculation:
>
>     374 unique fre_data entries (out of 146997 total FREs!)
>     = 374 * (2 * 3) = 2244 bytes
>
>     146997 fre entries
>     = 146997 * (4 + 2) = 881982 bytes
>
>     .text size 6417058 (chunk_size = 1024, num_chunks=6267)
>     = 6267 * 8 = 43869 bytes
>
> Approximate total .sframe size would be 2244 + 881982 + 8192 = 906k,
> plus negligible header size.  Which is smaller than the v2 .sframe on my
> gdb binary (985k).
>
> With the chunked lookup table, the avg lookup is:
>
>   log2(146997/6267) = ~4.5 iterations
>
> whereas a full binary search would be:
>
>   log2(146997) = 17 iterations
>
> So assuming I got all that math right, it's over 3x faster and the
> binary is smaller (or at least should be roughly comparable).

I'm not sure about this chunked lookup approach for arbitrary user
space applications. Those executable sections can be a) big and b)
discontiguous. E.g., one of the production binaries I looked at. Here
are its three main executable sections:

...
  [17] .bolt.org.text    PROGBITS         000000000b00e640  0ae0d640
       0000000011ad621c  0000000000000000  AX       0     0     64
...
  [48] .text             PROGBITS         000000001e600000  1ce00000
       0000000000775dd8  0000000000000000  AX       0     0     2097152
  [49] .text.cold        PROGBITS         000000001ed75e00  1d575e00
       00000000007d3271  0000000000000000  AX       0     0     64
...

Total text size is about 300MB:
>>> 0x0000000000775dd8 + 0x00000000007d3271 + 0x0000000011ad621c
312603237

Section #17 ends at:

>>> hex(0x0000000011ad621c + 0x000000000b00e640)
'0x1cae485c'

While .text starts at 000000001e600000, so we have a gap of ~28MB:

>>> 0x000000001e600000 - 0x1cae485c
28424100

So unless we do something more clever to support multiple
discontiguous chunks, this seems like a bad fit for user space.

I think having all this just binary searchable is already a big win
anyways and should be plenty fast, no?

>
> Of course the downside is it's an all new format.  Presumably the linker
> would need to do more work than it's currently doing, e.g., find all the
> duplicates and set up the data accordingly.

I do like the idea of deduplicating those records and just indexing
them, as in practice this should probably be much more compact. So it
might be worth looking into this some more. I'll defer to Indu.

>
> --
> Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 19/39] unwind_user/sframe: Add support for reading .sframe contents
  2025-01-22  2:31 ` [PATCH v4 19/39] unwind_user/sframe: Add support for reading .sframe contents Josh Poimboeuf
  2025-01-24 16:36   ` Jens Remus
  2025-01-24 18:02   ` Andrii Nakryiko
@ 2025-01-30 15:07   ` Indu Bhagat
  2025-02-04 18:38     ` Josh Poimboeuf
  2025-01-30 15:47   ` Jens Remus
  3 siblings, 1 reply; 161+ messages in thread
From: Indu Bhagat @ 2025-01-30 15:07 UTC (permalink / raw)
  To: Josh Poimboeuf, x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On 1/21/25 6:31 PM, Josh Poimboeuf wrote:
> In preparation for using sframe to unwind user space stacks, add an
> sframe_find() interface for finding the sframe information associated
> with a given text address.
> 
> For performance, use user_read_access_begin() and the corresponding
> unsafe_*() accessors.  Note that use of pr_debug() in uaccess-enabled
> regions would break noinstr validation, so there aren't any debug
> messages yet.  That will be added in a subsequent commit.
> 
> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
> ---
>   include/linux/sframe.h       |   5 +
>   kernel/unwind/sframe.c       | 295 ++++++++++++++++++++++++++++++++++-
>   kernel/unwind/sframe_debug.h |  35 +++++
>   3 files changed, 331 insertions(+), 4 deletions(-)
>   create mode 100644 kernel/unwind/sframe_debug.h
> 
> diff --git a/include/linux/sframe.h b/include/linux/sframe.h
> index ff4b9d1dbd00..2e70085a1e89 100644
> --- a/include/linux/sframe.h
> +++ b/include/linux/sframe.h
> @@ -3,11 +3,14 @@
>   #define _LINUX_SFRAME_H
>   
>   #include <linux/mm_types.h>
> +#include <linux/srcu.h>
>   #include <linux/unwind_user_types.h>
>   
>   #ifdef CONFIG_HAVE_UNWIND_USER_SFRAME
>   
>   struct sframe_section {
> +	struct rcu_head	rcu;
> +
>   	unsigned long	sframe_start;
>   	unsigned long	sframe_end;
>   	unsigned long	text_start;
> @@ -28,6 +31,7 @@ extern void sframe_free_mm(struct mm_struct *mm);
>   extern int sframe_add_section(unsigned long sframe_start, unsigned long sframe_end,
>   			      unsigned long text_start, unsigned long text_end);
>   extern int sframe_remove_section(unsigned long sframe_addr);
> +extern int sframe_find(unsigned long ip, struct unwind_user_frame *frame);
>   
>   static inline bool current_has_sframe(void)
>   {
> @@ -42,6 +46,7 @@ static inline bool current_has_sframe(void)
>   static inline void sframe_free_mm(struct mm_struct *mm) {}
>   static inline int sframe_add_section(unsigned long sframe_start, unsigned long sframe_end, unsigned long text_start, unsigned long text_end) { return -ENOSYS; }
>   static inline int sframe_remove_section(unsigned long sframe_addr) { return -ENOSYS; }
> +static inline int sframe_find(unsigned long ip, struct unwind_user_frame *frame) { return -ENOSYS; }
>   static inline bool current_has_sframe(void) { return false; }
>   
>   #endif /* CONFIG_HAVE_UNWIND_USER_SFRAME */
> diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
> index fa7d87ffd00a..1a35615a361e 100644
> --- a/kernel/unwind/sframe.c
> +++ b/kernel/unwind/sframe.c
> @@ -15,9 +15,287 @@
>   #include <linux/unwind_user_types.h>
>   
>   #include "sframe.h"
> +#include "sframe_debug.h"
>   
> -#define dbg(fmt, ...)							\
> -	pr_debug("%s (%d): " fmt, current->comm, current->pid, ##__VA_ARGS__)
> +struct sframe_fre {
> +	unsigned int	size;
> +	s32		ip_off;
> +	s32		cfa_off;
> +	s32		ra_off;
> +	s32		fp_off;
> +	u8		info;
> +};
> +
> +DEFINE_STATIC_SRCU(sframe_srcu);
> +
> +static __always_inline unsigned char fre_type_to_size(unsigned char fre_type)
> +{
> +	if (fre_type > 2)
> +		return 0;
> +	return 1 << fre_type;
> +}
> +
> +static __always_inline unsigned char offset_size_enum_to_size(unsigned char off_size)
> +{
> +	if (off_size > 2)
> +		return 0;
> +	return 1 << off_size;
> +}
> +
> +static __always_inline int __read_fde(struct sframe_section *sec,
> +				      unsigned int fde_num,
> +				      struct sframe_fde *fde)
> +{
> +	unsigned long fde_addr, ip;
> +
> +	fde_addr = sec->fdes_start + (fde_num * sizeof(struct sframe_fde));
> +	unsafe_copy_from_user(fde, (void __user *)fde_addr,
> +			      sizeof(struct sframe_fde), Efault);
> +
> +	ip = sec->sframe_start + fde->start_addr;
> +	if (ip < sec->text_start || ip > sec->text_end)
> +		return -EINVAL;
> +
> +	return 0;
> +
> +Efault:
> +	return -EFAULT;
> +}
> +
> +static __always_inline int __find_fde(struct sframe_section *sec,
> +				      unsigned long ip,
> +				      struct sframe_fde *fde)
> +{
> +	s32 ip_off, func_off_low = S32_MIN, func_off_high = S32_MAX;
> +	struct sframe_fde __user *first, *low, *high, *found = NULL;
> +	int ret;
> +
> +	ip_off = ip - sec->sframe_start;
> +
> +	first = (void __user *)sec->fdes_start;
> +	low = first;
> +	high = first + sec->num_fdes - 1;
> +
> +	while (low <= high) {
> +		struct sframe_fde __user *mid;
> +		s32 func_off;
> +
> +		mid = low + ((high - low) / 2);
> +
> +		unsafe_get_user(func_off, (s32 __user *)mid, Efault);
> +
> +		if (ip_off >= func_off) {
> +			if (func_off < func_off_low)
> +				return -EFAULT;
> +
> +			func_off_low = func_off;
> +
> +			found = mid;
> +			low = mid + 1;
> +		} else {
> +			if (func_off > func_off_high)
> +				return -EFAULT;
> +
> +			func_off_high = func_off;
> +
> +			high = mid - 1;
> +		}
> +	}
> +
> +	if (!found)
> +		return -EINVAL;
> +
> +	ret = __read_fde(sec, found - first, fde);
> +	if (ret)
> +		return ret;
> +
> +	/* make sure it's not in a gap */
> +	if (ip_off < fde->start_addr || ip_off >= fde->start_addr + fde->func_size)
> +		return -EINVAL;
> +
> +	return 0;
> +
> +Efault:
> +	return -EFAULT;
> +}
> +
> +#define __UNSAFE_GET_USER_INC(to, from, type, label)			\
> +({									\
> +	type __to;							\
> +	unsafe_get_user(__to, (type __user *)from, label);		\
> +	from += sizeof(__to);						\
> +	to = (typeof(to))__to;							\
> +})
> +
> +#define UNSAFE_GET_USER_INC(to, from, size, label)			\
> +({									\
> +	switch (size) {							\
> +	case 1:								\
> +		__UNSAFE_GET_USER_INC(to, from, u8, label);		\
> +		break;							\
> +	case 2:								\
> +		__UNSAFE_GET_USER_INC(to, from, u16, label);		\
> +		break;							\
> +	case 4:								\
> +		__UNSAFE_GET_USER_INC(to, from, u32, label);		\
> +		break;							\
> +	default:							\
> +		return -EFAULT;						\
> +	}								\
> +})
> +
> +static __always_inline int __read_fre(struct sframe_section *sec,
> +				      struct sframe_fde *fde,
> +				      unsigned long fre_addr,
> +				      struct sframe_fre *fre)
> +{
> +	unsigned char fde_type = SFRAME_FUNC_FDE_TYPE(fde->info);
> +	unsigned char fre_type = SFRAME_FUNC_FRE_TYPE(fde->info);
> +	unsigned char offset_count, offset_size;
> +	s32 ip_off, cfa_off, ra_off, fp_off;
> +	unsigned long cur = fre_addr;
> +	unsigned char addr_size;
> +	u8 info;
> +
> +	addr_size = fre_type_to_size(fre_type);
> +	if (!addr_size)
> +		return -EFAULT;
> +
> +	if (fre_addr + addr_size + 1 > sec->fres_end)
> +		return -EFAULT;
> +
> +	UNSAFE_GET_USER_INC(ip_off, cur, addr_size, Efault);
> +	if (fde_type == SFRAME_FDE_TYPE_PCINC && ip_off > fde->func_size)
> +		return -EFAULT;
> +
> +	UNSAFE_GET_USER_INC(info, cur, 1, Efault);
> +	offset_count = SFRAME_FRE_OFFSET_COUNT(info);
> +	offset_size  = offset_size_enum_to_size(SFRAME_FRE_OFFSET_SIZE(info));
> +	if (!offset_count || !offset_size)
> +		return -EFAULT;
> +
> +	if (cur + (offset_count * offset_size) > sec->fres_end)
> +		return -EFAULT;
> +
> +	fre->size = addr_size + 1 + (offset_count * offset_size);
> +
> +	UNSAFE_GET_USER_INC(cfa_off, cur, offset_size, Efault);
> +	offset_count--;
> +
> +	ra_off = sec->ra_off;
> +	if (!ra_off) {
> +		if (!offset_count--)
> +			return -EFAULT;
> +
> +		UNSAFE_GET_USER_INC(ra_off, cur, offset_size, Efault);
> +	}
> +
> +	fp_off = sec->fp_off;
> +	if (!fp_off && offset_count) {
> +		offset_count--;
> +		UNSAFE_GET_USER_INC(fp_off, cur, offset_size, Efault);
> +	}
> +
> +	if (offset_count)
> +		return -EFAULT;
> +
> +	fre->ip_off		= ip_off;
> +	fre->cfa_off		= cfa_off;
> +	fre->ra_off		= ra_off;
> +	fre->fp_off		= fp_off;
> +	fre->info		= info;
> +
> +	return 0;
> +
> +Efault:
> +	return -EFAULT;
> +}
> +
> +static __always_inline int __find_fre(struct sframe_section *sec,
> +				      struct sframe_fde *fde, unsigned long ip,
> +				      struct unwind_user_frame *frame)
> +{
> +	unsigned char fde_type = SFRAME_FUNC_FDE_TYPE(fde->info);
> +	struct sframe_fre *fre, *prev_fre = NULL;
> +	struct sframe_fre fres[2];
> +	unsigned long fre_addr;
> +	bool which = false;
> +	unsigned int i;
> +	s32 ip_off;
> +
> +	ip_off = (s32)(ip - sec->sframe_start) - fde->start_addr;
> +
> +	if (fde_type == SFRAME_FDE_TYPE_PCMASK)
> +		ip_off %= fde->rep_size;
> +
> +	fre_addr = sec->fres_start + fde->fres_off;
> +
> +	for (i = 0; i < fde->fres_num; i++) {
> +		int ret;
> +
> +		/*
> +		 * Alternate between the two fre_addr[] entries for 'fre' and
> +		 * 'prev_fre'.
> +		 */
> +		fre = which ? fres : fres + 1;
> +		which = !which;
> +
> +		ret = __read_fre(sec, fde, fre_addr, fre);
> +		if (ret)
> +			return ret;
> +

It should be possible to only read the ip_off and info from FRE and 
defer the reading of offsets (as done in __read_fre) until later when 
you do need the offsets. See below.

We can find the relevant FRE with the following pieces of information:
   - ip_off
   - fre_size (this will mean we need to read the uin8_t info in the FRE)

> +		fre_addr += fre->size;
> +
> +		if (prev_fre && fre->ip_off <= prev_fre->ip_off)
> +			return -EFAULT;
> +
> +		if (fre->ip_off > ip_off)
> +			break;
> +
> +		prev_fre = fre;
> +	}
> +
> +	if (!prev_fre)
> +		return -EINVAL;
> +	fre = prev_fre;
> +

(Invoke the __read_fre here as we know now that this FRE is what we are 
looking for.)

> +	frame->cfa_off = fre->cfa_off;
> +	frame->ra_off  = fre->ra_off;
> +	frame->fp_off  = fre->fp_off;
> +	frame->use_fp  = SFRAME_FRE_CFA_BASE_REG_ID(fre->info) == SFRAME_BASE_REG_FP;
> +
> +	return 0;
> +}

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 19/39] unwind_user/sframe: Add support for reading .sframe contents
  2025-01-22  2:31 ` [PATCH v4 19/39] unwind_user/sframe: Add support for reading .sframe contents Josh Poimboeuf
                     ` (2 preceding siblings ...)
  2025-01-30 15:07   ` Indu Bhagat
@ 2025-01-30 15:47   ` Jens Remus
  2025-02-04 18:51     ` Josh Poimboeuf
  3 siblings, 1 reply; 161+ messages in thread
From: Jens Remus @ 2025-01-30 15:47 UTC (permalink / raw)
  To: Josh Poimboeuf, x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu, Heiko Carstens, Vasily Gorbik

On 22.01.2025 03:31, Josh Poimboeuf wrote:
> In preparation for using sframe to unwind user space stacks, add an
> sframe_find() interface for finding the sframe information associated
> with a given text address.
> 
> For performance, use user_read_access_begin() and the corresponding
> unsafe_*() accessors.  Note that use of pr_debug() in uaccess-enabled
> regions would break noinstr validation, so there aren't any debug
> messages yet.  That will be added in a subsequent commit.
> 
> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>

> diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c

> +struct sframe_fre {
> +	unsigned int	size;
> +	s32		ip_off;

The IP offset (from function start) in the SFrame V2 FDE is unsigned:

u32 ip_off;

> +	s32		cfa_off;
> +	s32		ra_off;
> +	s32		fp_off;
> +	u8		info;
> +};

...

> +#define __UNSAFE_GET_USER_INC(to, from, type, label)			\
> +({									\
> +	type __to;							\
> +	unsafe_get_user(__to, (type __user *)from, label);		\
> +	from += sizeof(__to);						\
> +	to = (typeof(to))__to;							\
> +})
> +
> +#define UNSAFE_GET_USER_INC(to, from, size, label)			\
> +({									\
> +	switch (size) {							\
> +	case 1:								\
> +		__UNSAFE_GET_USER_INC(to, from, u8, label);		\
> +		break;							\
> +	case 2:								\
> +		__UNSAFE_GET_USER_INC(to, from, u16, label);		\
> +		break;							\
> +	case 4:								\
> +		__UNSAFE_GET_USER_INC(to, from, u32, label);		\
> +		break;							\
> +	default:							\
> +		return -EFAULT;						\
> +	}								\
> +})

This does not work for the signed SFrame fields, such as the FRE CFA,
RA, and FP offsets, as it does not perform the required sign extension.
One option would be to rename to UNSAFE_GET_USER_UNSIGNED_INC() and
re-introduce UNSAFE_GET_USER_SIGNED_INC() using s8, s16, and s32.

> +static __always_inline int __read_fre(struct sframe_section *sec,
> +				      struct sframe_fde *fde,
> +				      unsigned long fre_addr,
> +				      struct sframe_fre *fre)
> +{
> +	unsigned char fde_type = SFRAME_FUNC_FDE_TYPE(fde->info);
> +	unsigned char fre_type = SFRAME_FUNC_FRE_TYPE(fde->info);
> +	unsigned char offset_count, offset_size;
> +	s32 ip_off, cfa_off, ra_off, fp_off;

The IP offset (from function start) in the SFrame V2 FRE is unsigned:

u32 ip_off;
s32 cfa_off, ra_off, fp_off;

> +	unsigned long cur = fre_addr;
> +	unsigned char addr_size;
> +	u8 info;
> +
> +	addr_size = fre_type_to_size(fre_type);
> +	if (!addr_size)
> +		return -EFAULT;
> +
> +	if (fre_addr + addr_size + 1 > sec->fres_end)
> +		return -EFAULT;
> +
> +	UNSAFE_GET_USER_INC(ip_off, cur, addr_size, Efault);

Ok: The SFrame V2 FRE IP offset is unsigned u8, u16, or u32.

> +	if (fde_type == SFRAME_FDE_TYPE_PCINC && ip_off > fde->func_size)
> +		return -EFAULT;
> +
> +	UNSAFE_GET_USER_INC(info, cur, 1, Efault);

Ok: The SFrame V2 FRE info word is one byte of data.

> +	offset_count = SFRAME_FRE_OFFSET_COUNT(info);
> +	offset_size  = offset_size_enum_to_size(SFRAME_FRE_OFFSET_SIZE(info));
> +	if (!offset_count || !offset_size)
> +		return -EFAULT;
> +
> +	if (cur + (offset_count * offset_size) > sec->fres_end)
> +		return -EFAULT;
> +
> +	fre->size = addr_size + 1 + (offset_count * offset_size);
> +
> +	UNSAFE_GET_USER_INC(cfa_off, cur, offset_size, Efault);

Issue: The SFrame V2 FRE CFA offset is signed s8, s16, or s32.  Sign
extension required when storing in s32.

> +	offset_count--;
> +
> +	ra_off = sec->ra_off;
> +	if (!ra_off) {
> +		if (!offset_count--)
> +			return -EFAULT;
> +
> +		UNSAFE_GET_USER_INC(ra_off, cur, offset_size, Efault);

Issue: The SFrame V2 FRE RA offset is signed s8, s16, or s32.  Sign
extension required when storing in s32.

> +	}
> +
> +	fp_off = sec->fp_off;
> +	if (!fp_off && offset_count) {
> +		offset_count--;
> +		UNSAFE_GET_USER_INC(fp_off, cur, offset_size, Efault);

Issue: The SFrame V2 FRE FP offset is signed s8, s16, or s32.  Sign
extension required when storing in s32.

> +	}
> +
> +	if (offset_count)
> +		return -EFAULT;
> +
> +	fre->ip_off		= ip_off;
> +	fre->cfa_off		= cfa_off;
> +	fre->ra_off		= ra_off;
> +	fre->fp_off		= fp_off;
> +	fre->info		= info;
> +
> +	return 0;
> +
> +Efault:
> +	return -EFAULT;
> +}
> +
> +static __always_inline int __find_fre(struct sframe_section *sec,
> +				      struct sframe_fde *fde, unsigned long ip,
> +				      struct unwind_user_frame *frame)
> +{
> +	unsigned char fde_type = SFRAME_FUNC_FDE_TYPE(fde->info);
> +	struct sframe_fre *fre, *prev_fre = NULL;
> +	struct sframe_fre fres[2];
> +	unsigned long fre_addr;
> +	bool which = false;
> +	unsigned int i;
> +	s32 ip_off;
> +
> +	ip_off = (s32)(ip - sec->sframe_start) - fde->start_addr;

The IP offset (from function start) in the SFrame V2 FRE is unsigned.
The function start address offset (from .sframe section begin) is signed.
Therefore:

u32 ip_off;

ip_off = ip - (sec->sframe_start + fde->start_addr);

> +
> +	if (fde_type == SFRAME_FDE_TYPE_PCMASK)
> +		ip_off %= fde->rep_size;

Following is a patch with the suggested changes, that were required to
make unwinding of user space using SFrame work on s390:

diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
index bba14c5fe0f5..ea2d491ea68f 100644
--- a/kernel/unwind/sframe.c
+++ b/kernel/unwind/sframe.c
@@ -19,7 +19,7 @@
  
  struct sframe_fre {
  	unsigned int	size;
-	s32		ip_off;
+	u32		ip_off;
  	s32		cfa_off;
  	s32		ra_off;
  	s32		fp_off;
@@ -136,7 +136,26 @@ static __always_inline int __find_fde(struct sframe_section *sec,
  	to = (typeof(to))__to;							\
  })
  
-#define UNSAFE_GET_USER_INC(to, from, size, label)			\
+#define UNSAFE_GET_USER_SIGNED_INC(to, from, size, label)		\
+({									\
+	switch (size) {							\
+	case 1:								\
+		__UNSAFE_GET_USER_INC(to, from, s8, label);		\
+		break;							\
+	case 2:								\
+		__UNSAFE_GET_USER_INC(to, from, s16, label);		\
+		break;							\
+	case 4:								\
+		__UNSAFE_GET_USER_INC(to, from, s32, label);		\
+		break;							\
+	default:							\
+		dbg_sec_uaccess("%d: bad UNSAFE_GET_USER_SIGNED__INC size %u\n",\
+				__LINE__, size);			\
+		return -EFAULT;						\
+	}								\
+})
+
+#define UNSAFE_GET_USER_UNSIGNED_INC(to, from, size, label)		\
  ({									\
  	switch (size) {							\
  	case 1:								\
@@ -149,7 +168,7 @@ static __always_inline int __find_fde(struct sframe_section *sec,
  		__UNSAFE_GET_USER_INC(to, from, u32, label);		\
  		break;							\
  	default:							\
-		dbg_sec_uaccess("%d: bad UNSAFE_GET_USER_INC size %u\n",\
+		dbg_sec_uaccess("%d: bad UNSAFE_GET_USER_UNSIGNED_INC size %u\n",\
  				__LINE__, size);			\
  		return -EFAULT;						\
  	}								\
@@ -163,7 +182,8 @@ static __always_inline int __read_fre(struct sframe_section *sec,
  	unsigned char fde_type = SFRAME_FUNC_FDE_TYPE(fde->info);
  	unsigned char fre_type = SFRAME_FUNC_FRE_TYPE(fde->info);
  	unsigned char offset_count, offset_size;
-	s32 ip_off, cfa_off, ra_off, fp_off;
+	u32 ip_off;
+	s32 cfa_off, ra_off, fp_off;
  	unsigned long cur = fre_addr;
  	unsigned char addr_size;
  	u8 info;
@@ -179,14 +199,14 @@ static __always_inline int __read_fre(struct sframe_section *sec,
  		return -EFAULT;
  	}
  
-	UNSAFE_GET_USER_INC(ip_off, cur, addr_size, Efault);
+	UNSAFE_GET_USER_UNSIGNED_INC(ip_off, cur, addr_size, Efault);
  	if (fde_type == SFRAME_FDE_TYPE_PCINC && ip_off > fde->func_size) {
  		dbg_sec_uaccess("fre starts past end of function: ip_off=0x%x, func_size=0x%x\n",
  				ip_off, fde->func_size);
  		return -EFAULT;
  	}
  
-	UNSAFE_GET_USER_INC(info, cur, 1, Efault);
+	UNSAFE_GET_USER_UNSIGNED_INC(info, cur, 1, Efault);
  	offset_count = SFRAME_FRE_OFFSET_COUNT(info);
  	offset_size  = offset_size_enum_to_size(SFRAME_FRE_OFFSET_SIZE(info));
  	if (!offset_count || !offset_size) {
@@ -200,7 +220,7 @@ static __always_inline int __read_fre(struct sframe_section *sec,
  
  	fre->size = addr_size + 1 + (offset_count * offset_size);
  
-	UNSAFE_GET_USER_INC(cfa_off, cur, offset_size, Efault);
+	UNSAFE_GET_USER_SIGNED_INC(cfa_off, cur, offset_size, Efault);
  	offset_count--;
  
  	ra_off = sec->ra_off;
@@ -210,13 +230,13 @@ static __always_inline int __read_fre(struct sframe_section *sec,
  			return -EFAULT;
  		}
  
-		UNSAFE_GET_USER_INC(ra_off, cur, offset_size, Efault);
+		UNSAFE_GET_USER_SIGNED_INC(ra_off, cur, offset_size, Efault);
  	}
  
  	fp_off = sec->fp_off;
  	if (!fp_off && offset_count) {
  		offset_count--;
-		UNSAFE_GET_USER_INC(fp_off, cur, offset_size, Efault);
+		UNSAFE_GET_USER_SIGNED_INC(fp_off, cur, offset_size, Efault);
  	}
  
  	if (offset_count) {
@@ -247,9 +267,9 @@ static __always_inline int __find_fre(struct sframe_section *sec,
  	unsigned long fre_addr;
  	bool which = false;
  	unsigned int i;
-	s32 ip_off;
+	u32 ip_off;
  
-	ip_off = (s32)(ip - sec->sframe_start) - fde->start_addr;
+	ip_off = ip - (sec->sframe_start + fde->start_addr);
  
  	if (fde_type == SFRAME_FDE_TYPE_PCMASK)
  		ip_off %= fde->rep_size;

Regards,
Jens
-- 
Jens Remus
Linux on Z Development (D3303)
+49-7031-16-1128 Office
jremus@de.ibm.com

IBM

IBM Deutschland Research & Development GmbH; Vorsitzender des Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der Gesellschaft: Böblingen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 25/39] unwind_user/sframe: Show file name in debug output
  2025-01-22  2:31 ` [PATCH v4 25/39] unwind_user/sframe: Show file name in debug output Josh Poimboeuf
@ 2025-01-30 16:17   ` Jens Remus
  2025-02-04 19:10     ` Josh Poimboeuf
  0 siblings, 1 reply; 161+ messages in thread
From: Jens Remus @ 2025-01-30 16:17 UTC (permalink / raw)
  To: Josh Poimboeuf, x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu, Heiko Carstens, Vasily Gorbik

On 22.01.2025 03:31, Josh Poimboeuf wrote:
> When debugging sframe issues, the error messages aren't all that helpful
> without knowing what file a corresponding .sframe section belongs to.
> Prefix debug output strings with the file name.
> 
> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>

> diff --git a/kernel/unwind/sframe_debug.h b/kernel/unwind/sframe_debug.h

> +static inline void dbg_init(struct sframe_section *sec)
> +{
> +	struct mm_struct *mm = current->mm;
> +	struct vm_area_struct *vma;
> +
> +	guard(mmap_read_lock)(mm);
> +	vma = vma_lookup(mm, sec->sframe_start);
> +	if (!vma)
> +		sec->filename = kstrdup("(vma gone???)", GFP_KERNEL);
> +	else if (vma->vm_file)
> +		sec->filename = kstrdup_quotable_file(vma->vm_file, GFP_KERNEL);
> +	else if (!vma->vm_mm)

This condition does not appear to work for vdso on s390.  The following
does:

	else if (in_range(sec->sframe_start, current->mm->context.vdso_base, vdso_text_size()))

> +		sec->filename = kstrdup("(vdso)", GFP_KERNEL);
> +	else
> +		sec->filename = kstrdup("(anonymous)", GFP_KERNEL);
> +}

Regards,
Jens
-- 
Jens Remus
Linux on Z Development (D3303)
+49-7031-16-1128 Office
jremus@de.ibm.com

IBM

IBM Deutschland Research & Development GmbH; Vorsitzender des Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der Gesellschaft: Böblingen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 26/39] unwind_user/sframe: Enable debugging in uaccess regions
  2025-01-22  2:31 ` [PATCH v4 26/39] unwind_user/sframe: Enable debugging in uaccess regions Josh Poimboeuf
@ 2025-01-30 16:38   ` Jens Remus
  2025-02-04 19:33     ` Josh Poimboeuf
  0 siblings, 1 reply; 161+ messages in thread
From: Jens Remus @ 2025-01-30 16:38 UTC (permalink / raw)
  To: Josh Poimboeuf, x86
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu, Heiko Carstens, Alexander Gordeev

On 22.01.2025 03:31, Josh Poimboeuf wrote:
> Objtool warns about calling pr_debug() from uaccess-enabled regions, and
> rightfully so.  Add a dbg_sec_uaccess() macro which temporarily disables
> uaccess before doing the dynamic printk, and use that to add debug
> messages throughout the uaccess-enabled regions.
> 
> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>

> diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c

> @@ -53,12 +53,15 @@ static __always_inline int __read_fde(struct sframe_section *sec,
>   			      sizeof(struct sframe_fde), Efault);
>   
>   	ip = sec->sframe_start + fde->start_addr;
> -	if (ip < sec->text_start || ip > sec->text_end)
> +	if (ip < sec->text_start || ip > sec->text_end) {
> +		dbg_sec_uaccess("bad fde num %d\n", fde_num);
>   		return -EINVAL;
> +	}
>   
>   	return 0;
>   
>   Efault:
> +	dbg_sec_uaccess("fde %d usercopy failed\n", fde_num);
>   	return -EFAULT;
>   }

Add a similar debug message for SFRame FDE user copy failures?

diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c

@@ -125,6 +125,7 @@ static __always_inline int __find_fde(struct sframe_section *sec,
         return 0;

  Efault:
+       dbg_sec_uaccess("fde usercopy failed\n");
         return -EFAULT;
  }


Printing the IP is probably not an option due to security concerns?
Printing the the CFA, FP, and RA offsets is too much traffic?  To debug
issues on s390 I had to add tons of additional debug messages to make
sense of what was actually going on.

Regards,
Jens
-- 
Jens Remus
Linux on Z Development (D3303)
+49-7031-16-1128 Office
jremus@de.ibm.com

IBM

IBM Deutschland Research & Development GmbH; Vorsitzender des Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der Gesellschaft: Böblingen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 19/39] unwind_user/sframe: Add support for reading .sframe contents
  2025-01-24 21:41     ` Josh Poimboeuf
  2025-01-28  0:39       ` Andrii Nakryiko
@ 2025-01-30 19:51       ` Weinan Liu
  2025-02-04 19:42         ` Josh Poimboeuf
  1 sibling, 1 reply; 161+ messages in thread
From: Weinan Liu @ 2025-01-30 19:51 UTC (permalink / raw)
  To: jpoimboe
  Cc: acme, adrian.hunter, alexander.shishkin, andrii.nakryiko, broonie,
	fweimer, indu.bhagat, irogers, jolsa, jordalgo, jremus,
	linux-kernel, linux-perf-users, linux-toolchains,
	linux-trace-kernel, luto, mark.rutland, mathieu.desnoyers,
	mhiramat, mingo, namhyung, peterz, rostedt, sam, wnliu, x86

On Fri, Jan 24, 2025 at 1:41 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
> On Fri, Jan 24, 2025 at 10:02:46AM -0800, Andrii Nakryiko wrote:
> > On Tue, Jan 21, 2025 at 6:32 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
> > > +static __always_inline int __find_fre(struct sframe_section *sec,
> > > +                                     struct sframe_fde *fde, unsigned long ip,
> > > +                                     struct unwind_user_frame *frame)
> > > +{
> > > +       unsigned char fde_type = SFRAME_FUNC_FDE_TYPE(fde->info);
> > > +       struct sframe_fre *fre, *prev_fre = NULL;
> > > +       struct sframe_fre fres[2];
> > 
> > you only need prev_fre->ip_off, so why all this `which` and `fres[2]`
> > business if all you need is prev_fre_off and a bool whether you have
> > prev_fre at all?
> 
> So this cleverness probably needs a comment.  prev_fre is actually
> needed after the loop:
> 
> > > +       if (!prev_fre)
> > > +               return -EINVAL;
> > > +       fre = prev_fre;
> 
> In the body of the loop, prev_fre is a tentative match, unless the next
> fre also matches, in which case that one becomes the new tentative
> match.
> 
> I'll add a comment.  Also it'll probably be less confusing if I rename
> "prev_fre" to "fre", and "fre" to "next_fre".
> 

Nit: swap() might be a simplify way to alternate pointers between two
fre_addr[] entries.

For example,

static __always_inline int __find_fre(struct sframe_section *sec,
				      struct sframe_fde *fde, unsigned long ip,
				      struct unwind_user_frame *frame)
{
	/* intialize fres[] with invalid value */
	struct sframe_fre fres[2] = {0};
	struct sframe_fre *fre = &fres[1], *prev_fre = fres;

	for (i = 0; i < fde->fres_num; i++) {
		swap(fre, next_fre);
		ret = __read_fre(sec, fde, fre_addr, fre);
...

		if (fre->ip_off > ip_off)
			break;
	}

	if (fre->size == 0)
		return -EINVAL;
...

}


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 28/39] unwind_user/deferred: Add deferred unwinding interface
  2025-01-24 23:57                   ` Steven Rostedt
@ 2025-01-30 20:21                     ` Steven Rostedt
  2025-02-05  2:25                       ` Josh Poimboeuf
  0 siblings, 1 reply; 161+ messages in thread
From: Steven Rostedt @ 2025-01-30 20:21 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Peter Zijlstra, Mathieu Desnoyers, x86, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Florian Weimer, Andy Lutomirski, Masami Hiramatsu,
	Weinan Liu

On Fri, 24 Jan 2025 18:57:44 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Fri, 24 Jan 2025 14:50:11 -0800
> Josh Poimboeuf <jpoimboe@kernel.org> wrote:
> 
> > Hm, reading this again I'm wondering if you're actually proposing that
> > the unwind happens on @prev after it gets rescheduled sometime in the
> > future?  Does that actually solve the issue?  What if doesn't get
> > rescheduled within a reasonable amount of time?  
> 
> Correct, it would be prev that would be doing the unwinding and not next.
> But when prev is scheduled back onto the CPU. That way it's only blocking
> itself.
> 
> The use case that people were doing this with was measuring the time a task
> is off the CPU. It can't get that time until the task schedules back
> anyway. What the complaint was about was that it could be a very long
> system call, with lots of sleeps and they couldn't do the processing.
> 
> I can go back and ask, but I'm pretty sure doing the unwind when a task
> comes back to the CPU would be sufficient.
> 

Coming back from this. It would be fine if we could do the back trace when
we come back from the scheduler, so it should not be an issue if the task
even has to schedule again to fault in the sframe information.

I was also wondering if the unwinder doesn't keep track of who requested
the back trace, just that someone did. Then it would just take a single
flag in the task struct to do the back trace. Return the "cookie" to the
tracer that requested the back trace, and when you do the back trace, just
call all callbacks with that cookie. Let the tracer decided if it wants to
record the back trace or ignore it based on the cookie.

That is, the tracers would need to keep track of the cookies that it cares
about, as if there's other tracers asking for stack traces on tasks that
this tracer doesn't care about it needs to handle being called when it
doesn't care about the stack trace. That said, if you want to trace all
tasks, you can just ignore the cookies and record the traces.

-- Steve

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 17/39] unwind_user/sframe: Add support for reading .sframe headers
  2025-01-28  1:10         ` Andrii Nakryiko
  2025-01-29  2:02           ` Josh Poimboeuf
@ 2025-01-30 21:21           ` Indu Bhagat
  2025-02-04 19:59             ` Josh Poimboeuf
  2025-02-05 23:16             ` Andrii Nakryiko
  2025-02-05 11:01           ` Jens Remus
  2 siblings, 2 replies; 161+ messages in thread
From: Indu Bhagat @ 2025-01-30 21:21 UTC (permalink / raw)
  To: Andrii Nakryiko, Josh Poimboeuf
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On 1/27/25 5:10 PM, Andrii Nakryiko wrote:
>>>>> +struct sframe_preamble {
>>>>> +       u16     magic;
>>>>> +       u8      version;
>>>>> +       u8      flags;
>>>>> +} __packed;
>>>>> +
>>>>> +struct sframe_header {
>>>>> +       struct sframe_preamble preamble;
>>>>> +       u8      abi_arch;
>>>>> +       s8      cfa_fixed_fp_offset;
>>>>> +       s8      cfa_fixed_ra_offset;
>>>>> +       u8      auxhdr_len;
>>>>> +       u32     num_fdes;
>>>>> +       u32     num_fres;
>>>>> +       u32     fre_len;
>>>>> +       u32     fdes_off;
>>>>> +       u32     fres_off;
>>>>> +} __packed;
>>>>> +
>>>>> +struct sframe_fde {
>>>>> +       s32     start_addr;
>>>>> +       u32     func_size;
>>>>> +       u32     fres_off;
>>>>> +       u32     fres_num;
>>>>> +       u8      info;
>>>>> +       u8      rep_size;
>>>>> +       u16 padding;
>>>>> +} __packed;
>>>> I couldn't understand from SFrame itself, but why do sframe_header,
>>>> sframe_preamble, and sframe_fde have to be marked __packed, if it's
>>>> all naturally aligned (intentionally and by design)?..
>>> Right, but the spec says they're all packed.  Maybe the point is that
>>> some future sframe version is free to introduce unaligned fields.
>>>
>> SFrame specification aims to keep SFrame header and SFrame FDE members
>> at aligned boundaries in future versions.
>>
>> Only SFrame FRE related accesses may have unaligned accesses.
> Yeah, and it's actually bothering me quite a lot 🙂 I have a tentative
> proposal, maybe we can discuss this for SFrame v3? Let me briefly
> outline the idea.
> 

I looked at the idea below.  It could work wrt unaligned accesses.

Speaking of unaligned accesses, I will ask away: Is the reason to avoid 
unaligned accesses performance hit or are there other practical reasons 
to it ?

> So, currently in v2, FREs within FDEs use an array-of-structs layout.
> If we use preudo-C type definitions, it would be something like this
> for FDE + its FREs:
> 
> struct FDE_and_FREs {
>      struct sframe_func_desc_entry fde_metadata;
> 
>      union FRE {
>          struct FRE8 {
>              u8 sfre_start_address;
>              u8 sfre_info;
>              u8|u16|u32 offsets[M];
>          }
>          struct FRE16 {
>              u16 sfre_start_address;
>              u16 sfre_info;
>              u8|u16|u32 offsets[M];
>          }
>          struct FRE32 {
>              u32 sfre_start_address;
>              u32 sfre_info;
>              u8|u16|u32 offsets[M];
>          }
>      } fres[N] __packed;
> };
> 
> where all fres[i]s are one of those FRE8/FRE16/FRE32, so start
> addresses have the same size, but each FRE has potentially different
> offsets sizing, so there is no common alignment, and so everything has
> to be packed and unaligned.
> 
> But what if we take a struct-of-arrays approach and represent it more like:
> 
> struct FDE_and_FREs {
>      struct sframe_func_desc_entry fde_metadata;
>      u8|u16|u32 start_addrs[N]; /* can extend to u64 as well */
>      u8 sfre_infos[N];
>      u8 offsets8[M8];
>      u16 offsets16[M16] __aligned(2);
>      u32 offsets32[M32] __aligned(4);
>      /* we can naturally extend to support also u64 offsets */
> };
> 
> i.e., we split all FRE records into their three constituents: start
> addresses, info bytes, and then each FRE can fall into either 8-, 16-,
> or 32-bit offsets "bucket". We collect all the offsets, depending on
> their size, into these aligned offsets{8,16,32} arrays (with natural
> extension to 64 bits, if necessary), with at most wasting 1-3 bytes to
> ensure proper alignment everywhere.
> 
> Note, at this point we need to decide if we want to make FREs binary
> searchable or not.
> 
> If not, we don't really need anything extra. As we process each
> start_addrs[i] and sfre_infos[i] to find matching FRE, we keep track
> of how many 8-, 16-, and 32-bit offsets already processed FREs
> consumed, and when we find the right one, we know exactly the starting
> index within offset{8,16,32}. Done.
> 
> But if we were to make FREs binary searchable, we need to basically
> have an index of offset pointers to quickly find offsetsX[j] position
> corresponding to FRE #i. For that, we can have an extra array right
> next to start_addrs, "semantically parallel" to it:
> 
> u8|u16|u32 start_addrs[N];
> u8|u16|u32 offset_idxs[N];
> 
> where start_addrs[i] corresponds to offset_idxs[i], and offset_idxs[i]
> points to the first offset corresponding to FRE #i in offsetX[] array
> (depending on FRE's "bitness"). This is a bit more storage for this
> offset index, but for FDEs with lots of FREs this might be a
> worthwhile tradeoff.
> 
> Few points:
>    a) we can decide this "binary searchability" per-FDE, and for FDEs
> with 1-2-3 FREs not bother, while those with more FREs would be
> searchable ones with index. So we can combine both fast lookups,
> natural alignment of on-disk format, and compactness. The presence of
> index is just another bit in FDE metadata.

I have been going back and forth on this one. So there seem to be the 
following options here:
   #1. Make "binary searchability" a per-FDE decision.
   #2. Make "binary searchability" a per-section decision (I expect 
aarch64 to have very low number of FREs per FDE).
   #3. Bake "binary searchability" into the SFrame FRE specification. 
So its always ON for all FDEs.  The advantage is that it makes stack 
tracers simpler to implement with less code.

I do think #2, #3 appear simpler in concept.

>    b) bitness of offset_idxs[] can be coupled with bitness of
> start_addrs (for simplicity), or could be completely independent and
> identified by FDE's metadata (2 more bits to define this just like
> start_addr bitness is defined). Independent probably would be my
> preference, with linker (or whoever will be producing .sframe data)
> can pick the smallest bitness that is sufficient to represent
> everything.
> 

ATM, GAS does apply special logic to decide the bitness of start_addrs 
per function, and ld just uses that info.  Coupling the bitness of 
offset_idx with bitness of start_addrs will be easy (or _easier_ I 
think), but for now, I leave it as "should be doable" :)

> Yes, it's a bit more complicated to draw and explain, but everything
> will be nicely aligned, extensible to 64 bits, and (optionally at
> least) binary searchable. Implementation-wise on the kernel side it
> shouldn't be significantly more involved. Maybe the compiler would
> need to be a bit smarter when producing FDE data, but it's no rocket
> science.
> 
> Thoughts?

Combining the requirements from your email and Josh's follow up:
   - No unaligned accesses
   - Sorted FREs

I would put compaction as a "good to have" requirement.  It appears to 
me that any compaction will mean a sort of post-processing which will 
interfere with JIT usecase.


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 17/39] unwind_user/sframe: Add support for reading .sframe headers
  2025-01-29  2:02           ` Josh Poimboeuf
  2025-01-30  0:02             ` Andrii Nakryiko
@ 2025-01-30 21:39             ` Indu Bhagat
  2025-02-05  0:57               ` Josh Poimboeuf
  2025-02-05 13:56             ` Jens Remus
  2 siblings, 1 reply; 161+ messages in thread
From: Indu Bhagat @ 2025-01-30 21:39 UTC (permalink / raw)
  To: Josh Poimboeuf, Andrii Nakryiko
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On 1/28/25 6:02 PM, Josh Poimboeuf wrote:
> On Mon, Jan 27, 2025 at 05:10:27PM -0800, Andrii Nakryiko wrote:
>>> Yes, in theory, it is allowed (as per the specification) to have an
>>> SFrame section with zero number of FDEs/FREs.  But since such a section
>>> will not be useful, I share the opinion that it makes sense to disallow
>>> it in the current unwinding contexts, for now (JIT usecase may change
>>> things later).
>>>
>>
>> I disagree, actually. If it's a legal thing, it shouldn't be randomly
>> rejected. If we later make use of that, we'd have to worry not to
>> accidentally cause problems on older kernels that arbitrarily rejected
>> empty FDE just because it didn't make sense at some point (without
>> causing any issues).
> 
> If such older kernels don't do anything with the section anyway, what's
> the point of pretending they do?
> 
> Returning an error would actually make more sense as it communicates
> that the kernel doesn't support whatever hypothetical thing you're
> trying to do with 0 FDEs.
> 
>>> SFRAME_F_FRAME_POINTER flag is not being set currently by GAS/GNU ld at all.
>>>
>>>>>> +               dbg("no fde/fre entries\n");
>>>>>> +               return -EINVAL;
>>>>>> +       }
>>>>>> +
>>>>>> +       header_end = sec->sframe_start + SFRAME_HEADER_SIZE(shdr);
>>>>>> +       if (header_end >= sec->sframe_end) {
>>>>>
>>>>> if we allow zero FDEs/FREs, header_end == sec->sframe_end is legal, right?
>>>>
>>>> I suppose so, but again I'm not seeing any reason to support that.
>>
>> Let's invert this. Is there any reason why it shouldn't be supported? ;)
> 
> It's simple, we don't add code to "support" some vague hypothetical.
> 
> For whatever definition of "support", since there's literally nothing
> the kernel can do with that.
> 
>> But what if we take a struct-of-arrays approach and represent it more like:
>>
>> struct FDE_and_FREs {
>>      struct sframe_func_desc_entry fde_metadata;
>>      u8|u16|u32 start_addrs[N]; /* can extend to u64 as well */
>>      u8 sfre_infos[N];
>>      u8 offsets8[M8];
>>      u16 offsets16[M16] __aligned(2);
>>      u32 offsets32[M32] __aligned(4);
>>      /* we can naturally extend to support also u64 offsets */
>> };
>>
>> i.e., we split all FRE records into their three constituents: start
>> addresses, info bytes, and then each FRE can fall into either 8-, 16-,
>> or 32-bit offsets "bucket". We collect all the offsets, depending on
>> their size, into these aligned offsets{8,16,32} arrays (with natural
>> extension to 64 bits, if necessary), with at most wasting 1-3 bytes to
>> ensure proper alignment everywhere.
> 
> Makes sense.  Though I also have another idea below.
> 
>> Note, at this point we need to decide if we want to make FREs binary
>> searchable or not.
>>
>> If not, we don't really need anything extra. As we process each
>> start_addrs[i] and sfre_infos[i] to find matching FRE, we keep track
>> of how many 8-, 16-, and 32-bit offsets already processed FREs
>> consumed, and when we find the right one, we know exactly the starting
>> index within offset{8,16,32}. Done.
>>
>> But if we were to make FREs binary searchable, we need to basically
>> have an index of offset pointers to quickly find offsetsX[j] position
>> corresponding to FRE #i. For that, we can have an extra array right
>> next to start_addrs, "semantically parallel" to it:
>>
>> u8|u16|u32 start_addrs[N];
>> u8|u16|u32 offset_idxs[N];
> 
> Binary search would definitely help.  I did a crude histogram for "FREs
> per FDE" for a few binaries on my test system:
> 
> gdb (the biggest binary on my test system):
> 
> 10th Percentile: 1
> 20th Percentile: 1
> 30th Percentile: 1
> 40th Percentile: 3
> 50th Percentile: 5
> 60th Percentile: 8
> 70th Percentile: 11
> 80th Percentile: 14
> 90th Percentile: 16
> 100th Percentile: 472
> 
> bash:
> 
> 10th Percentile: 1
> 20th Percentile: 1
> 30th Percentile: 3
> 40th Percentile: 5
> 50th Percentile: 7
> 60th Percentile: 9
> 70th Percentile: 12
> 80th Percentile: 16
> 90th Percentile: 17
> 100th Percentile: 46
> 
> glibc:
> 
> 10th Percentile: 1
> 20th Percentile: 1
> 30th Percentile: 1
> 40th Percentile: 1
> 50th Percentile: 4
> 60th Percentile: 6
> 70th Percentile: 9
> 80th Percentile: 14
> 90th Percentile: 16
> 100th Percentile: 44
> 
> libpython:
> 
> 10th Percentile: 1
> 20th Percentile: 3
> 30th Percentile: 4
> 40th Percentile: 6
> 50th Percentile: 8
> 60th Percentile: 11
> 70th Percentile: 12
> 80th Percentile: 16
> 90th Percentile: 20
> 100th Percentile: 112
> 
> So binary search would help in a lot of cases.
> 

Thanks for gathering this.

I suspect on aarch64, the numbers will be very different (leaning 
towards very low number of FREs per FDE).

Making SFrame FREs amenable to binary search can be targeted.  Both your 
and Andrii's proposal do address that...

> However, if we're going that route, we might want to even consider a
> completely revamped data layout.  For example:
> 
> One insight is that the vast majority of (cfa, fp, ra) tuples aren't
> unique.  They could be deduped by storing the unique tuples in a
> standalone 'fre_data' array which is referenced by another
> address-specific array.
> 
>    struct fre_data {
> 	s8|s16|s32 cfa, fp, ra;
> 	u8 info;
>    };
>    struct fre_data fre_data[num_fre_data];
> 

We had the same observation at the time of SFrame V1.  And this method 
of compaction (deduped tuples) was brain-stormed a bit.  Back then, the 
costs were thought to be:
   - more work at build time.
   - an additional data access once the FRE is found (as there is 
indirection).

So it was really compaction at the costs above.  We did steer towards 
simplicity and the SFrame FRE is what it stands today.

The difference in the pros and cons now from then:
   - pros: helps mitigate unaligned accesses
   - cons: interferes slightly with the design goal of efficient 
addition and removal of stack trace information per function for JIT. 
Think "removal" as the set of actions necessary for addressing 
fragmentation in SFrame section data in JIT usecase.

> The storage sizes of cfa/fp/ra can be a constant specified in the global
> sframe header.  It's fine all being the same size as it looks like this
> array wouldn't typically be more than a few hundred entries anyway.
> 
> Then there would be an array of sorted FRE entries which reference the
> fre_data[] entries:
> 
>    struct fre {
>    	s32|s64 start_address;
> 	u8|u16 fre_data_idx;
> 	
>    } __packed;
>    struct fre fres[num_fres];
> 
> (For alignment reasons that should probably be two separate arrays, even
> though not ideal for cache locality)
> 
> Here again the field storage sizes would be specified in the global
> sframe header.
> 
> Note FDEs aren't even needed here as the unwinder doesn't need to know
> when a function begins/ends.  The only info needed by the unwinder is
> just the fre_data struct.  So a simple binary search of fres[] is all
> that's really needed.
> 

Splitting out information (start_address) to an FDE (as done in V1/V2) 
has the benefit that a job like relocating information is proportional 
to O(NumFunctions).

In the case above, IIUC, where the proposal puts start_address in the 
FRE, these costs will be (much) higher.

In addition, not being able to identify stack trace information per 
function will affect the JIT usecase.  We need to able to mark stack 
trace information stale for functions in JIT environment.

I think the first conceptual landing point in the information layout 
should be a per-function entry.

> But wait, there's more...
> 
> The binary search could be made significantly faster using a small fast
> lookup array which is indexed evenly across the text address offset
> space, similar to what ORC does:
> 
>    u32 fre_chunks[num_chunks];
> 
> The text address space (starting at offset 0) can be split into
> 'num_chunks' chunks of size 'chunk_size'.  The value of
> fre_chunks[offset/chunk_size] is an index into the fres[] array.
> 
> Taking my gdb binary as an example:
> 
> .text is 6417058 bytes, with 146997 total sframe FREs.  Assuming a chunk
> size of 1024, fre_chunks[] needs 6417058/1024 = 6267 entries.
> 
> For unwinding at text offset 0x400000, the index into fre_chunks[] would
> be 0x400000/1024 = 4096.  If fre_chunks[4096] == 96074 and
> fre_chunks[4096+1] == 96098, you need only do a binary search of the 24
> entries between fres[96074] && fres[96098] rather than searching the
> entire 146997 byte array.
> 
> .sframe size calculation:
> 
>      374 unique fre_data entries (out of 146997 total FREs!)
>      = 374 * (2 * 3) = 2244 bytes
> 
>      146997 fre entries
>      = 146997 * (4 + 2) = 881982 bytes
> 
>      .text size 6417058 (chunk_size = 1024, num_chunks=6267)
>      = 6267 * 8 = 43869 bytes
> 
> Approximate total .sframe size would be 2244 + 881982 + 8192 = 906k,
> plus negligible header size.  Which is smaller than the v2 .sframe on my
> gdb binary (985k).
> 
> With the chunked lookup table, the avg lookup is:
> 
>    log2(146997/6267) = ~4.5 iterations
> 
> whereas a full binary search would be:
> 
>    log2(146997) = 17 iterations
> 
> So assuming I got all that math right, it's over 3x faster and the
> binary is smaller (or at least should be roughly comparable).
> 
> Of course the downside is it's an all new format.  Presumably the linker
> would need to do more work than it's currently doing, e.g., find all the
> duplicates and set up the data accordingly.
> 


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 17/39] unwind_user/sframe: Add support for reading .sframe headers
  2025-01-30  0:02             ` Andrii Nakryiko
@ 2025-02-04 18:26               ` Josh Poimboeuf
  0 siblings, 0 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-02-04 18:26 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Indu Bhagat, x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Wed, Jan 29, 2025 at 04:02:34PM -0800, Andrii Nakryiko wrote:
> On Tue, Jan 28, 2025 at 6:02 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
> I'm not sure about this chunked lookup approach for arbitrary user
> space applications. Those executable sections can be a) big and b)
> discontiguous. E.g., one of the production binaries I looked at. Here
> are its three main executable sections:
> 
> ...
>   [17] .bolt.org.text    PROGBITS         000000000b00e640  0ae0d640
>        0000000011ad621c  0000000000000000  AX       0     0     64
> ...
>   [48] .text             PROGBITS         000000001e600000  1ce00000
>        0000000000775dd8  0000000000000000  AX       0     0     2097152
>   [49] .text.cold        PROGBITS         000000001ed75e00  1d575e00
>        00000000007d3271  0000000000000000  AX       0     0     64
> ...
> 
> Total text size is about 300MB:
> >>> 0x0000000000775dd8 + 0x00000000007d3271 + 0x0000000011ad621c
> 312603237
> 
> Section #17 ends at:
> 
> >>> hex(0x0000000011ad621c + 0x000000000b00e640)
> '0x1cae485c'
> 
> While .text starts at 000000001e600000, so we have a gap of ~28MB:
> 
> >>> 0x000000001e600000 - 0x1cae485c
> 28424100
> 
> So unless we do something more clever to support multiple
> discontiguous chunks, this seems like a bad fit for user space.

Nothing clever needed, we could just have multiple sframe sections, each
one with a pointer to its text segment.  That would also have the
benefit of allowing the sframe data to be much more compact for the
noncontiguous cases.

> I think having all this just binary searchable is already a big win
> anyways and should be plenty fast, no?

Sframe is trying to compete with frame pointers which are MUCH faster.
3-4x faster in my testing, not including the page faults (which tend to
only affect performance in the very beginning).

-- 
Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 19/39] unwind_user/sframe: Add support for reading .sframe contents
  2025-01-30 15:07   ` Indu Bhagat
@ 2025-02-04 18:38     ` Josh Poimboeuf
  0 siblings, 0 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-02-04 18:38 UTC (permalink / raw)
  To: Indu Bhagat
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Thu, Jan 30, 2025 at 07:07:32AM -0800, Indu Bhagat wrote:
> On 1/21/25 6:31 PM, Josh Poimboeuf wrote:
> > +	for (i = 0; i < fde->fres_num; i++) {
> > +		int ret;
> > +
> > +		/*
> > +		 * Alternate between the two fre_addr[] entries for 'fre' and
> > +		 * 'prev_fre'.
> > +		 */
> > +		fre = which ? fres : fres + 1;
> > +		which = !which;
> > +
> > +		ret = __read_fre(sec, fde, fre_addr, fre);
> > +		if (ret)
> > +			return ret;
> > +
> 
> It should be possible to only read the ip_off and info from FRE and defer
> the reading of offsets (as done in __read_fre) until later when you do need
> the offsets. See below.
> 
> We can find the relevant FRE with the following pieces of information:
>   - ip_off
>   - fre_size (this will mean we need to read the uin8_t info in the FRE)

Indeed, I'll skip reading the offsets until after the loop.

-- 
Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 19/39] unwind_user/sframe: Add support for reading .sframe contents
  2025-01-30 15:47   ` Jens Remus
@ 2025-02-04 18:51     ` Josh Poimboeuf
  2025-02-05  9:47       ` Jens Remus
  0 siblings, 1 reply; 161+ messages in thread
From: Josh Poimboeuf @ 2025-02-04 18:51 UTC (permalink / raw)
  To: Jens Remus
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu, Heiko Carstens, Vasily Gorbik

On Thu, Jan 30, 2025 at 04:47:00PM +0100, Jens Remus wrote:
> On 22.01.2025 03:31, Josh Poimboeuf wrote:
> > +struct sframe_fre {
> > +	unsigned int	size;
> > +	s32		ip_off;
> 
> The IP offset (from function start) in the SFrame V2 FDE is unsigned:
> 
> u32 ip_off;

Indeed.

> > +#define __UNSAFE_GET_USER_INC(to, from, type, label)			\
> > +({									\
> > +	type __to;							\
> > +	unsafe_get_user(__to, (type __user *)from, label);		\
> > +	from += sizeof(__to);						\
> > +	to = (typeof(to))__to;							\
> > +})
> > +
> > +#define UNSAFE_GET_USER_INC(to, from, size, label)			\
> > +({									\
> > +	switch (size) {							\
> > +	case 1:								\
> > +		__UNSAFE_GET_USER_INC(to, from, u8, label);		\
> > +		break;							\
> > +	case 2:								\
> > +		__UNSAFE_GET_USER_INC(to, from, u16, label);		\
> > +		break;							\
> > +	case 4:								\
> > +		__UNSAFE_GET_USER_INC(to, from, u32, label);		\
> > +		break;							\
> > +	default:							\
> > +		return -EFAULT;						\
> > +	}								\
> > +})
> 
> This does not work for the signed SFrame fields, such as the FRE CFA,
> RA, and FP offsets, as it does not perform the required sign extension.
> One option would be to rename to UNSAFE_GET_USER_UNSIGNED_INC() and
> re-introduce UNSAFE_GET_USER_SIGNED_INC() using s8, s16, and s32.

See the following line in __UNSAFE_GET_USER_INC():

	to = (typeof(to))__to;

Does that not do the sign extension?

-- 
Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 25/39] unwind_user/sframe: Show file name in debug output
  2025-01-30 16:17   ` Jens Remus
@ 2025-02-04 19:10     ` Josh Poimboeuf
  2025-02-05 10:04       ` Jens Remus
  0 siblings, 1 reply; 161+ messages in thread
From: Josh Poimboeuf @ 2025-02-04 19:10 UTC (permalink / raw)
  To: Jens Remus
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu, Heiko Carstens, Vasily Gorbik

On Thu, Jan 30, 2025 at 05:17:33PM +0100, Jens Remus wrote:
> On 22.01.2025 03:31, Josh Poimboeuf wrote:
> > When debugging sframe issues, the error messages aren't all that helpful
> > without knowing what file a corresponding .sframe section belongs to.
> > Prefix debug output strings with the file name.
> > 
> > Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
> 
> > diff --git a/kernel/unwind/sframe_debug.h b/kernel/unwind/sframe_debug.h
> 
> > +static inline void dbg_init(struct sframe_section *sec)
> > +{
> > +	struct mm_struct *mm = current->mm;
> > +	struct vm_area_struct *vma;
> > +
> > +	guard(mmap_read_lock)(mm);
> > +	vma = vma_lookup(mm, sec->sframe_start);
> > +	if (!vma)
> > +		sec->filename = kstrdup("(vma gone???)", GFP_KERNEL);
> > +	else if (vma->vm_file)
> > +		sec->filename = kstrdup_quotable_file(vma->vm_file, GFP_KERNEL);
> > +	else if (!vma->vm_mm)
> 
> This condition does not appear to work for vdso on s390.

I had a fix for this which somehow got dropped:

diff --git a/kernel/unwind/sframe_debug.h b/kernel/unwind/sframe_debug.h
index 3bb3c5574aee..045e9c0b16c9 100644
--- a/kernel/unwind/sframe_debug.h
+++ b/kernel/unwind/sframe_debug.h
@@ -67,6 +67,10 @@ static inline void dbg_init(struct sframe_section *sec)
 		sec->filename = kstrdup("(vma gone???)", GFP_KERNEL);
 	else if (vma->vm_file)
 		sec->filename = kstrdup_quotable_file(vma->vm_file, GFP_KERNEL);
+	else if (vma->vm_ops && vma->vm_ops->name)
+		sec->filename = kstrdup(vma->vm_ops->name(vma), GFP_KERNEL);
+	else if (arch_vma_name(vma))
+		sec->filename = kstrdup(arch_vma_name(vma), GFP_KERNEL);
 	else if (!vma->vm_mm)
 		sec->filename = kstrdup("(vdso)", GFP_KERNEL);
 	else

^ permalink raw reply related	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 26/39] unwind_user/sframe: Enable debugging in uaccess regions
  2025-01-30 16:38   ` Jens Remus
@ 2025-02-04 19:33     ` Josh Poimboeuf
  0 siblings, 0 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-02-04 19:33 UTC (permalink / raw)
  To: Jens Remus
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu, Heiko Carstens, Alexander Gordeev

On Thu, Jan 30, 2025 at 05:38:24PM +0100, Jens Remus wrote:
> Add a similar debug message for SFRame FDE user copy failures?
> 
> diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
> 
> @@ -125,6 +125,7 @@ static __always_inline int __find_fde(struct sframe_section *sec,
>         return 0;
> 
>  Efault:
> +       dbg_sec_uaccess("fde usercopy failed\n");
>         return -EFAULT;
>  }

Indeed.

> Printing the IP is probably not an option due to security concerns?
> Printing the the CFA, FP, and RA offsets is too much traffic?  To debug
> issues on s390 I had to add tons of additional debug messages to make
> sense of what was actually going on.

I guess it depends on what you're trying to debug.  These messages are
intended to help diagnose problems with the section format.  It gets
unloaded if any of these errors are detected, so it helps to try to
communicate why that happened.

So yeah, they're intended to be very low traffic, only used for errors
where an .sframe section is getting unloaded (i.e., all the -EFAULTs).

Corrupt CFA/FP/RA offset values are harder to detect, that could also be
related to some other issue like stack corruption.

-- 
Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 19/39] unwind_user/sframe: Add support for reading .sframe contents
  2025-01-30 19:51       ` Weinan Liu
@ 2025-02-04 19:42         ` Josh Poimboeuf
  0 siblings, 0 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-02-04 19:42 UTC (permalink / raw)
  To: Weinan Liu
  Cc: acme, adrian.hunter, alexander.shishkin, andrii.nakryiko, broonie,
	fweimer, indu.bhagat, irogers, jolsa, jordalgo, jremus,
	linux-kernel, linux-perf-users, linux-toolchains,
	linux-trace-kernel, luto, mark.rutland, mathieu.desnoyers,
	mhiramat, mingo, namhyung, peterz, rostedt, sam, x86

On Thu, Jan 30, 2025 at 07:51:15PM +0000, Weinan Liu wrote:
> Nit: swap() might be a simplify way to alternate pointers between two
> fre_addr[] entries.
> 
> For example,
> 
> static __always_inline int __find_fre(struct sframe_section *sec,
> 				      struct sframe_fde *fde, unsigned long ip,
> 				      struct unwind_user_frame *frame)
> {
> 	/* intialize fres[] with invalid value */
> 	struct sframe_fre fres[2] = {0};
> 	struct sframe_fre *fre = &fres[1], *prev_fre = fres;
> 
> 	for (i = 0; i < fde->fres_num; i++) {
> 		swap(fre, next_fre);
> 		ret = __read_fre(sec, fde, fre_addr, fre);

Problem is, if it breaks out early here on the first iteration:

> 		if (fre->ip_off > ip_off)
> 			break;
> 	}
> 
> 	if (fre->size == 0)
> 		return -EINVAL;

Then fre isn't valid even though it has a nonzero size.

-- 
Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 17/39] unwind_user/sframe: Add support for reading .sframe headers
  2025-01-30 21:21           ` Indu Bhagat
@ 2025-02-04 19:59             ` Josh Poimboeuf
  2025-02-05 23:16             ` Andrii Nakryiko
  1 sibling, 0 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-02-04 19:59 UTC (permalink / raw)
  To: Indu Bhagat
  Cc: Andrii Nakryiko, x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Thu, Jan 30, 2025 at 01:21:21PM -0800, Indu Bhagat wrote:
> > Yeah, and it's actually bothering me quite a lot 🙂 I have a tentative
> > proposal, maybe we can discuss this for SFrame v3? Let me briefly
> > outline the idea.
> > 
> 
> I looked at the idea below.  It could work wrt unaligned accesses.
> 
> Speaking of unaligned accesses, I will ask away: Is the reason to avoid
> unaligned accesses performance hit or are there other practical reasons to
> it ?

I think performance is the main concern, though there are still some CPU
arches out there which don't support unaligned accesses.

> Combining the requirements from your email and Josh's follow up:
>   - No unaligned accesses
>   - Sorted FREs
> 
> I would put compaction as a "good to have" requirement.  It appears to me
> that any compaction will mean a sort of post-processing which will interfere
> with JIT usecase.

I think we should still consider the fast lookup table.  We might want
to prototype something just to see what the speedup looks like.  Similar
to compaction it could just be an optional feature implemented by the
linker which JIT doesn't have to use.

-- 
Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 17/39] unwind_user/sframe: Add support for reading .sframe headers
  2025-01-30 21:39             ` Indu Bhagat
@ 2025-02-05  0:57               ` Josh Poimboeuf
  2025-02-06  1:10                 ` Indu Bhagat
  0 siblings, 1 reply; 161+ messages in thread
From: Josh Poimboeuf @ 2025-02-05  0:57 UTC (permalink / raw)
  To: Indu Bhagat
  Cc: Andrii Nakryiko, x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Thu, Jan 30, 2025 at 01:39:52PM -0800, Indu Bhagat wrote:
> On 1/28/25 6:02 PM, Josh Poimboeuf wrote:
> > However, if we're going that route, we might want to even consider a
> > completely revamped data layout.  For example:
> > 
> > One insight is that the vast majority of (cfa, fp, ra) tuples aren't
> > unique.  They could be deduped by storing the unique tuples in a
> > standalone 'fre_data' array which is referenced by another
> > address-specific array.
> > 
> >    struct fre_data {
> > 	s8|s16|s32 cfa, fp, ra;
> > 	u8 info;
> >    };
> >    struct fre_data fre_data[num_fre_data];
> > 
> 
> We had the same observation at the time of SFrame V1.  And this method of
> compaction (deduped tuples) was brain-stormed a bit.  Back then, the costs
> were thought to be:
>   - more work at build time.
>   - an additional data access once the FRE is found (as there is
> indirection).
> 
> So it was really compaction at the costs above.  We did steer towards
> simplicity and the SFrame FRE is what it stands today.
> 
> The difference in the pros and cons now from then:
>   - pros: helps mitigate unaligned accesses
>   - cons: interferes slightly with the design goal of efficient addition and
> removal of stack trace information per function for JIT. Think "removal" as
> the set of actions necessary for addressing fragmentation in SFrame section
> data in JIT usecase.

If fre_data[] is allowed to have duplicates then the deduping could be
optional.

> > Note FDEs aren't even needed here as the unwinder doesn't need to know
> > when a function begins/ends.  The only info needed by the unwinder is
> > just the fre_data struct.  So a simple binary search of fres[] is all
> > that's really needed.
> 
> Splitting out information (start_address) to an FDE (as done in V1/V2) has
> the benefit that a job like relocating information is proportional to
> O(NumFunctions).
> 
> In the case above, IIUC, where the proposal puts start_address in the FRE,
> these costs will be (much) higher.

I'm not sure I follow, is this referring to the link-time work of
sorting things?

> In addition, not being able to identify stack trace information per function
> will affect the JIT usecase.  We need to able to mark stack trace
> information stale for functions in JIT environment.

Maybe, though it's hard to really say how any of these changes would
affect JIT without knowing what those interfaces are going to look like.

-- 
Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 28/39] unwind_user/deferred: Add deferred unwinding interface
  2025-01-30 20:21                     ` Steven Rostedt
@ 2025-02-05  2:25                       ` Josh Poimboeuf
  0 siblings, 0 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-02-05  2:25 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Mathieu Desnoyers, x86, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Jens Remus, Florian Weimer, Andy Lutomirski, Masami Hiramatsu,
	Weinan Liu

On Thu, Jan 30, 2025 at 03:21:36PM -0500, Steven Rostedt wrote:
> Coming back from this. It would be fine if we could do the back trace when
> we come back from the scheduler, so it should not be an issue if the task
> even has to schedule again to fault in the sframe information.

So there would be two callback hook points:

  - schedule() after enabling preemption

  - task work

and first one wins?

> I was also wondering if the unwinder doesn't keep track of who requested
> the back trace, just that someone did. Then it would just take a single
> flag in the task struct to do the back trace. Return the "cookie" to the
> tracer that requested the back trace, and when you do the back trace, just
> call all callbacks with that cookie. Let the tracer decided if it wants to
> record the back trace or ignore it based on the cookie.
> 
> That is, the tracers would need to keep track of the cookies that it cares
> about, as if there's other tracers asking for stack traces on tasks that
> this tracer doesn't care about it needs to handle being called when it
> doesn't care about the stack trace. That said, if you want to trace all
> tasks, you can just ignore the cookies and record the traces.

Easy enough for the unwinder, but IIUC each tracer would have to
maintain a global list of pending cookies (and corresponding ptrs to
perf_event, trace_array, etc)?  Would that not create a lot of
contention?

Seems like there really needs to be some kind of per-task or per-request
state.

-- 
Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 19/39] unwind_user/sframe: Add support for reading .sframe contents
  2025-02-04 18:51     ` Josh Poimboeuf
@ 2025-02-05  9:47       ` Jens Remus
  2025-02-07 21:06         ` Josh Poimboeuf
  0 siblings, 1 reply; 161+ messages in thread
From: Jens Remus @ 2025-02-05  9:47 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu, Heiko Carstens, Vasily Gorbik

On 04.02.2025 19:51, Josh Poimboeuf wrote:
> On Thu, Jan 30, 2025 at 04:47:00PM +0100, Jens Remus wrote:
>> On 22.01.2025 03:31, Josh Poimboeuf wrote:

>>> +#define __UNSAFE_GET_USER_INC(to, from, type, label)			\
>>> +({									\
>>> +	type __to;							\
>>> +	unsafe_get_user(__to, (type __user *)from, label);		\
>>> +	from += sizeof(__to);						\
>>> +	to = (typeof(to))__to;							\
>>> +})
>>> +
>>> +#define UNSAFE_GET_USER_INC(to, from, size, label)			\
>>> +({									\
>>> +	switch (size) {							\
>>> +	case 1:								\
>>> +		__UNSAFE_GET_USER_INC(to, from, u8, label);		\
>>> +		break;							\
>>> +	case 2:								\
>>> +		__UNSAFE_GET_USER_INC(to, from, u16, label);		\
>>> +		break;							\
>>> +	case 4:								\
>>> +		__UNSAFE_GET_USER_INC(to, from, u32, label);		\
>>> +		break;							\
>>> +	default:							\
>>> +		return -EFAULT;						\
>>> +	}								\
>>> +})
>>
>> This does not work for the signed SFrame fields, such as the FRE CFA,
>> RA, and FP offsets, as it does not perform the required sign extension.
>> One option would be to rename to UNSAFE_GET_USER_UNSIGNED_INC() and
>> re-introduce UNSAFE_GET_USER_SIGNED_INC() using s8, s16, and s32.
> 
> See the following line in __UNSAFE_GET_USER_INC():
> 
> 	to = (typeof(to))__to;
> 
> Does that not do the sign extension?

No. In practice with my proposed changes reverted and the following
debugging code added:

@@ -293,6 +293,10 @@ static __always_inline int __find_fre(struct sframe_section *sec,
                 return -EINVAL;
         fre = prev_fre;

+       dbg_sec_uaccess("fre: ip_off=%u, cfa_off=%d, ra_off=%d, fp_off=%d, use_fp=%s, sp_val_off=%d\n",
+                       fre->ip_off, fre->cfa_off, fre->ra_off, fre->fp_off,
+                       SFRAME_FRE_CFA_BASE_REG_ID(fre->info) == SFRAME_BASE_REG_FP ? "y" : "n",
+                       sframe_sp_val_off());

Excerpt from dmesg:

sframe: /usr/lib/ld64.so.1: fre: ip_off=16, cfa_off=440, ra_off=208, fp_off=184, use_fp=n, sp_val_off=-160
sframe: /usr/lib/ld64.so.1: fre: ip_off=2600, cfa_off=672, ra_off=208, fp_off=184, use_fp=y, sp_val_off=-160
sframe: /usr/lib/ld64.so.1: fre: ip_off=10, cfa_off=368, ra_off=0, fp_off=0, use_fp=n, sp_val_off=-160
sframe: /usr/lib/ld64.so.1: fre: ip_off=722, cfa_off=672, ra_off=208, fp_off=184, use_fp=y, sp_val_off=-160

On s390 the register save slots have negative offsets from CFA (due to
the CFA to be defined as SP at call site + 160).  The RA, if saved,
would be saved at CFA-48 on the stack.  I.e. ra_off=-48 instead of
ra_off=208 would have been correct.

208 = 0xd0 (unsigned) = -48 (signed)


Looking at the code:

UNSAFE_GET_USER_INC(ra_off, cur, offset_size, Efault);

With offset_size=1 expands into:

__UNSAFE_GET_USER_INC(/*to=*/ra_off, /*from=*cur, /*type=*/u8, /*label=*/Efault);

Expands into:

{
	u8 __to;
	unsafe_get_user(__to, (u8 __user *)cur, Efault);
	cur += sizeof(__to);
	ra_off = (typeof(ra_off))__to;
}

The issue is that on the last line __to is u8 instead of s8 and thus
u8 gets casted to s32, which is performed without sign extension.  __to
would need to be s8 or get casted to s8 for sign extension to take
place.

Regards,
Jens
-- 
Jens Remus
Linux on Z Development (D3303)
+49-7031-16-1128 Office
jremus@de.ibm.com

IBM

IBM Deutschland Research & Development GmbH; Vorsitzender des Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der Gesellschaft: Böblingen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 25/39] unwind_user/sframe: Show file name in debug output
  2025-02-04 19:10     ` Josh Poimboeuf
@ 2025-02-05 10:04       ` Jens Remus
  0 siblings, 0 replies; 161+ messages in thread
From: Jens Remus @ 2025-02-05 10:04 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu, Heiko Carstens, Vasily Gorbik

On 04.02.2025 20:10, Josh Poimboeuf wrote:
> On Thu, Jan 30, 2025 at 05:17:33PM +0100, Jens Remus wrote:
>> On 22.01.2025 03:31, Josh Poimboeuf wrote:

>>> diff --git a/kernel/unwind/sframe_debug.h b/kernel/unwind/sframe_debug.h
>>
>>> +static inline void dbg_init(struct sframe_section *sec)
>>> +{
>>> +	struct mm_struct *mm = current->mm;
>>> +	struct vm_area_struct *vma;
>>> +
>>> +	guard(mmap_read_lock)(mm);
>>> +	vma = vma_lookup(mm, sec->sframe_start);
>>> +	if (!vma)
>>> +		sec->filename = kstrdup("(vma gone???)", GFP_KERNEL);
>>> +	else if (vma->vm_file)
>>> +		sec->filename = kstrdup_quotable_file(vma->vm_file, GFP_KERNEL);
>>> +	else if (!vma->vm_mm)
>>
>> This condition does not appear to work for vdso on s390.
> 
> I had a fix for this which somehow got dropped:
> 
> diff --git a/kernel/unwind/sframe_debug.h b/kernel/unwind/sframe_debug.h
> index 3bb3c5574aee..045e9c0b16c9 100644
> --- a/kernel/unwind/sframe_debug.h
> +++ b/kernel/unwind/sframe_debug.h
> @@ -67,6 +67,10 @@ static inline void dbg_init(struct sframe_section *sec)
>   		sec->filename = kstrdup("(vma gone???)", GFP_KERNEL);
>   	else if (vma->vm_file)
>   		sec->filename = kstrdup_quotable_file(vma->vm_file, GFP_KERNEL);
> +	else if (vma->vm_ops && vma->vm_ops->name)
> +		sec->filename = kstrdup(vma->vm_ops->name(vma), GFP_KERNEL);
> +	else if (arch_vma_name(vma))
> +		sec->filename = kstrdup(arch_vma_name(vma), GFP_KERNEL);
>   	else if (!vma->vm_mm)
>   		sec->filename = kstrdup("(vdso)", GFP_KERNEL);
>   	else

Thanks! Your patch works fine on s390:

test_vdso    1087   104.721004:      10000 task-clock:ppp:
              3ffd11f1702 __cvdso_clock_gettime_data.constprop.0+0x52 ([vdso])
              3ffd11f14c6 __kernel_clock_gettime+0x16 ([vdso])
              3ff8f6f113e clock_gettime@@GLIBC_2.17+0x1e (/usr/lib64/libc.so.6)
              3ff8f6e20e8 __gettimeofday+0x38 (/usr/lib64/libc.so.6)
                  100066e main+0x1e (/root/test/vdso/test_vdso)
              3ff8f63459c __libc_start_call_main+0x8c (/usr/lib64/libc.so.6)
              3ff8f63469e __libc_start_main@@GLIBC_2.34+0xae (/usr/lib64/libc.so.6)
                  100073a _start+0x3a (/root/test/vdso/test_vdso)

Regards,
Jens
-- 
Jens Remus
Linux on Z Development (D3303)
+49-7031-16-1128 Office
jremus@de.ibm.com

IBM

IBM Deutschland Research & Development GmbH; Vorsitzender des Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der Gesellschaft: Böblingen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 17/39] unwind_user/sframe: Add support for reading .sframe headers
  2025-01-28  1:10         ` Andrii Nakryiko
  2025-01-29  2:02           ` Josh Poimboeuf
  2025-01-30 21:21           ` Indu Bhagat
@ 2025-02-05 11:01           ` Jens Remus
  2025-02-05 23:05             ` Andrii Nakryiko
  2 siblings, 1 reply; 161+ messages in thread
From: Jens Remus @ 2025-02-05 11:01 UTC (permalink / raw)
  To: Andrii Nakryiko, Indu Bhagat, Josh Poimboeuf
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Mathieu Desnoyers,
	Florian Weimer, Andy Lutomirski, Masami Hiramatsu, Weinan Liu

On 28.01.2025 02:10, Andrii Nakryiko wrote:

> So, currently in v2, FREs within FDEs use an array-of-structs layout.
> If we use preudo-C type definitions, it would be something like this
> for FDE + its FREs:
> 
> struct FDE_and_FREs {
>      struct sframe_func_desc_entry fde_metadata;
> 
>      union FRE {
>          struct FRE8 {
>              u8 sfre_start_address;
>              u8 sfre_info;
>              u8|u16|u32 offsets[M];
>          }
>          struct FRE16 {
>              u16 sfre_start_address;
>              u16 sfre_info;
>              u8|u16|u32 offsets[M];
>          }
>          struct FRE32 {
>              u32 sfre_start_address;
>              u32 sfre_info;
>              u8|u16|u32 offsets[M];
>          }
>      } fres[N] __packed;
> };
> 
> where all fres[i]s are one of those FRE8/FRE16/FRE32, so start
> addresses have the same size, but each FRE has potentially different
> offsets sizing, so there is no common alignment, and so everything has
> to be packed and unaligned.

Just for clarification of the SFrame V2 format, as there may be some
misconception. Using pseudo-C type definition:

struct sframe_fre8 {
	u8 sfre_start_address;
	u8 sfre_info;
	s8|s16|s32 offsets[M];
};
struct sframe_fre16 {
	u16 sfre_start_address;
	u8 sfre_info;
	s8|s16|s32 offsets[M];
};
struct sframe_fre32 {
	u32 sfre_start_address;
	u8 sfre_info;
	s8|s16|s32 offsets[M];
};

struct sframe_section {
	/* Headers. */
	struct sframe_preamble preamble;
	struct sframe_header header;
	struct sframe_auxiliary_header auxhdr;

	/* FDEs. */
	struct sframe_fde fdes[N_FDE];

	/* FRE8s / FRE16s / FRE32s per FDE. */
	struct sframe_fre{8|16|32} fres_fde1[N_FRE_FDE1] __packed;
	...
	struct sframe_fre{8|16|32} fres_fdeN[N_FRE_FDEN] __packed;
};

Where:
- fdes[] can be binary searched: All fdes[i] are of equal size and
   sorted on start address.
- Each fdes[i] points at its fres_fdei[].
- fres_fdei[] cannot be binary searched: For each fdes[i] they are
   one of those FRE8/FRE16/FRE32, so start addresses have the same
   size, but each FRE has potentially different offsets sizing, so
   there is no common alignment, and so everything has to be packed
   and unaligned.

Regards,
Jens
-- 
Jens Remus
Linux on Z Development (D3303)
+49-7031-16-1128 Office
jremus@de.ibm.com

IBM

IBM Deutschland Research & Development GmbH; Vorsitzender des Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der Gesellschaft: Böblingen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 17/39] unwind_user/sframe: Add support for reading .sframe headers
  2025-01-29  2:02           ` Josh Poimboeuf
  2025-01-30  0:02             ` Andrii Nakryiko
  2025-01-30 21:39             ` Indu Bhagat
@ 2025-02-05 13:56             ` Jens Remus
  2025-02-07 21:13               ` Josh Poimboeuf
  2 siblings, 1 reply; 161+ messages in thread
From: Jens Remus @ 2025-02-05 13:56 UTC (permalink / raw)
  To: Josh Poimboeuf, Andrii Nakryiko
  Cc: Indu Bhagat, x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Mathieu Desnoyers,
	Florian Weimer, Andy Lutomirski, Masami Hiramatsu, Weinan Liu

On 29.01.2025 03:02, Josh Poimboeuf wrote:

> Note FDEs aren't even needed here as the unwinder doesn't need to know
> when a function begins/ends.  The only info needed by the unwinder is
> just the fre_data struct.  So a simple binary search of fres[] is all
> that's really needed.

In SFrame V2 FDEs specify ranges bound by function start address and
length.  FREs in contrast specify open ranges bounded by start address.
Their effect ends either with the next FRE becoming into effect or when
their FDE range ends.
This concept enables holes in the .text section which do not have any
valid FDE/FRE information associated.

Your proposal lacks some sort of mechanism to replicate those holes.
It could be FDEs with a flag (or no offsets?) that specifies their
range has no valid information.

Regards,
Jens
-- 
Jens Remus
Linux on Z Development (D3303)
+49-7031-16-1128 Office
jremus@de.ibm.com

IBM

IBM Deutschland Research & Development GmbH; Vorsitzender des Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der Gesellschaft: Böblingen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 17/39] unwind_user/sframe: Add support for reading .sframe headers
  2025-02-05 11:01           ` Jens Remus
@ 2025-02-05 23:05             ` Andrii Nakryiko
  0 siblings, 0 replies; 161+ messages in thread
From: Andrii Nakryiko @ 2025-02-05 23:05 UTC (permalink / raw)
  To: Jens Remus
  Cc: Indu Bhagat, Josh Poimboeuf, x86, Peter Zijlstra, Steven Rostedt,
	Ingo Molnar, Arnaldo Carvalho de Melo, linux-kernel, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Mathieu Desnoyers,
	Florian Weimer, Andy Lutomirski, Masami Hiramatsu, Weinan Liu

On Wed, Feb 5, 2025 at 3:02 AM Jens Remus <jremus@linux.ibm.com> wrote:
>
> On 28.01.2025 02:10, Andrii Nakryiko wrote:
>
> > So, currently in v2, FREs within FDEs use an array-of-structs layout.
> > If we use preudo-C type definitions, it would be something like this
> > for FDE + its FREs:
> >
> > struct FDE_and_FREs {
> >      struct sframe_func_desc_entry fde_metadata;
> >
> >      union FRE {
> >          struct FRE8 {
> >              u8 sfre_start_address;
> >              u8 sfre_info;
> >              u8|u16|u32 offsets[M];
> >          }
> >          struct FRE16 {
> >              u16 sfre_start_address;
> >              u16 sfre_info;
> >              u8|u16|u32 offsets[M];
> >          }
> >          struct FRE32 {
> >              u32 sfre_start_address;
> >              u32 sfre_info;
> >              u8|u16|u32 offsets[M];
> >          }
> >      } fres[N] __packed;
> > };
> >
> > where all fres[i]s are one of those FRE8/FRE16/FRE32, so start
> > addresses have the same size, but each FRE has potentially different
> > offsets sizing, so there is no common alignment, and so everything has
> > to be packed and unaligned.
>
> Just for clarification of the SFrame V2 format, as there may be some
> misconception. Using pseudo-C type definition:
>
> struct sframe_fre8 {
>         u8 sfre_start_address;
>         u8 sfre_info;
>         s8|s16|s32 offsets[M];
> };
> struct sframe_fre16 {
>         u16 sfre_start_address;
>         u8 sfre_info;
>         s8|s16|s32 offsets[M];
> };
> struct sframe_fre32 {
>         u32 sfre_start_address;
>         u8 sfre_info;
>         s8|s16|s32 offsets[M];
> };
>

yeah, sorry, copy/paste error for those sfre_info, it's always u8

> struct sframe_section {
>         /* Headers. */
>         struct sframe_preamble preamble;
>         struct sframe_header header;
>         struct sframe_auxiliary_header auxhdr;
>
>         /* FDEs. */
>         struct sframe_fde fdes[N_FDE];
>
>         /* FRE8s / FRE16s / FRE32s per FDE. */
>         struct sframe_fre{8|16|32} fres_fde1[N_FRE_FDE1] __packed;
>         ...
>         struct sframe_fre{8|16|32} fres_fdeN[N_FRE_FDEN] __packed;
> };
>
> Where:
> - fdes[] can be binary searched: All fdes[i] are of equal size and
>    sorted on start address.
> - Each fdes[i] points at its fres_fdei[].
> - fres_fdei[] cannot be binary searched: For each fdes[i] they are
>    one of those FRE8/FRE16/FRE32, so start addresses have the same
>    size, but each FRE has potentially different offsets sizing, so
>    there is no common alignment, and so everything has to be packed
>    and unaligned.
>

yep

> Regards,
> Jens
> --
> Jens Remus
> Linux on Z Development (D3303)
> +49-7031-16-1128 Office
> jremus@de.ibm.com
>
> IBM
>
> IBM Deutschland Research & Development GmbH; Vorsitzender des Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der Gesellschaft: Böblingen; Registergericht: Amtsgericht Stuttgart, HRB 243294
> IBM Data Privacy Statement: https://www.ibm.com/privacy/
>

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 17/39] unwind_user/sframe: Add support for reading .sframe headers
  2025-01-30 21:21           ` Indu Bhagat
  2025-02-04 19:59             ` Josh Poimboeuf
@ 2025-02-05 23:16             ` Andrii Nakryiko
  1 sibling, 0 replies; 161+ messages in thread
From: Andrii Nakryiko @ 2025-02-05 23:16 UTC (permalink / raw)
  To: Indu Bhagat
  Cc: Josh Poimboeuf, x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Thu, Jan 30, 2025 at 1:21 PM Indu Bhagat <indu.bhagat@oracle.com> wrote:
>
> On 1/27/25 5:10 PM, Andrii Nakryiko wrote:
> >>>>> +struct sframe_preamble {
> >>>>> +       u16     magic;
> >>>>> +       u8      version;
> >>>>> +       u8      flags;
> >>>>> +} __packed;
> >>>>> +
> >>>>> +struct sframe_header {
> >>>>> +       struct sframe_preamble preamble;
> >>>>> +       u8      abi_arch;
> >>>>> +       s8      cfa_fixed_fp_offset;
> >>>>> +       s8      cfa_fixed_ra_offset;
> >>>>> +       u8      auxhdr_len;
> >>>>> +       u32     num_fdes;
> >>>>> +       u32     num_fres;
> >>>>> +       u32     fre_len;
> >>>>> +       u32     fdes_off;
> >>>>> +       u32     fres_off;
> >>>>> +} __packed;
> >>>>> +
> >>>>> +struct sframe_fde {
> >>>>> +       s32     start_addr;
> >>>>> +       u32     func_size;
> >>>>> +       u32     fres_off;
> >>>>> +       u32     fres_num;
> >>>>> +       u8      info;
> >>>>> +       u8      rep_size;
> >>>>> +       u16 padding;
> >>>>> +} __packed;
> >>>> I couldn't understand from SFrame itself, but why do sframe_header,
> >>>> sframe_preamble, and sframe_fde have to be marked __packed, if it's
> >>>> all naturally aligned (intentionally and by design)?..
> >>> Right, but the spec says they're all packed.  Maybe the point is that
> >>> some future sframe version is free to introduce unaligned fields.
> >>>
> >> SFrame specification aims to keep SFrame header and SFrame FDE members
> >> at aligned boundaries in future versions.
> >>
> >> Only SFrame FRE related accesses may have unaligned accesses.
> > Yeah, and it's actually bothering me quite a lot 🙂 I have a tentative
> > proposal, maybe we can discuss this for SFrame v3? Let me briefly
> > outline the idea.
> >
>
> I looked at the idea below.  It could work wrt unaligned accesses.
>
> Speaking of unaligned accesses, I will ask away: Is the reason to avoid
> unaligned accesses performance hit or are there other practical reasons
> to it ?

Performance hit on architectures like x86-64 that do support
unaligned, but it's actually a CPU error for some other architectures,
so you'd need to code with that in mind, making local aligned copies,
etc. In general, I'd say it's a bit of a red flag that a format that
is meant to be memory-mapped (effectively) and used without
pre-processing requires dealing with unaligned accesses. So if we can
fix that, that would be a win.

>
> > So, currently in v2, FREs within FDEs use an array-of-structs layout.
> > If we use preudo-C type definitions, it would be something like this
> > for FDE + its FREs:
> >
> > struct FDE_and_FREs {
> >      struct sframe_func_desc_entry fde_metadata;
> >
> >      union FRE {
> >          struct FRE8 {
> >              u8 sfre_start_address;
> >              u8 sfre_info;
> >              u8|u16|u32 offsets[M];
> >          }
> >          struct FRE16 {
> >              u16 sfre_start_address;
> >              u16 sfre_info;
> >              u8|u16|u32 offsets[M];
> >          }
> >          struct FRE32 {
> >              u32 sfre_start_address;
> >              u32 sfre_info;
> >              u8|u16|u32 offsets[M];
> >          }
> >      } fres[N] __packed;
> > };
> >
> > where all fres[i]s are one of those FRE8/FRE16/FRE32, so start
> > addresses have the same size, but each FRE has potentially different
> > offsets sizing, so there is no common alignment, and so everything has
> > to be packed and unaligned.
> >
> > But what if we take a struct-of-arrays approach and represent it more like:
> >
> > struct FDE_and_FREs {
> >      struct sframe_func_desc_entry fde_metadata;
> >      u8|u16|u32 start_addrs[N]; /* can extend to u64 as well */
> >      u8 sfre_infos[N];
> >      u8 offsets8[M8];
> >      u16 offsets16[M16] __aligned(2);
> >      u32 offsets32[M32] __aligned(4);
> >      /* we can naturally extend to support also u64 offsets */
> > };
> >
> > i.e., we split all FRE records into their three constituents: start
> > addresses, info bytes, and then each FRE can fall into either 8-, 16-,
> > or 32-bit offsets "bucket". We collect all the offsets, depending on
> > their size, into these aligned offsets{8,16,32} arrays (with natural
> > extension to 64 bits, if necessary), with at most wasting 1-3 bytes to
> > ensure proper alignment everywhere.
> >
> > Note, at this point we need to decide if we want to make FREs binary
> > searchable or not.
> >
> > If not, we don't really need anything extra. As we process each
> > start_addrs[i] and sfre_infos[i] to find matching FRE, we keep track
> > of how many 8-, 16-, and 32-bit offsets already processed FREs
> > consumed, and when we find the right one, we know exactly the starting
> > index within offset{8,16,32}. Done.
> >
> > But if we were to make FREs binary searchable, we need to basically
> > have an index of offset pointers to quickly find offsetsX[j] position
> > corresponding to FRE #i. For that, we can have an extra array right
> > next to start_addrs, "semantically parallel" to it:
> >
> > u8|u16|u32 start_addrs[N];
> > u8|u16|u32 offset_idxs[N];
> >
> > where start_addrs[i] corresponds to offset_idxs[i], and offset_idxs[i]
> > points to the first offset corresponding to FRE #i in offsetX[] array
> > (depending on FRE's "bitness"). This is a bit more storage for this
> > offset index, but for FDEs with lots of FREs this might be a
> > worthwhile tradeoff.
> >
> > Few points:
> >    a) we can decide this "binary searchability" per-FDE, and for FDEs
> > with 1-2-3 FREs not bother, while those with more FREs would be
> > searchable ones with index. So we can combine both fast lookups,
> > natural alignment of on-disk format, and compactness. The presence of
> > index is just another bit in FDE metadata.
>
> I have been going back and forth on this one. So there seem to be the
> following options here:
>    #1. Make "binary searchability" a per-FDE decision.
>    #2. Make "binary searchability" a per-section decision (I expect
> aarch64 to have very low number of FREs per FDE).
>    #3. Bake "binary searchability" into the SFrame FRE specification.
> So its always ON for all FDEs.  The advantage is that it makes stack
> tracers simpler to implement with less code.
>
> I do think #2, #3 appear simpler in concept.

Whichever makes it easier across the entire stack (compiler, linker,
kernel/unwinder). As long as binary searchability is possible,
especially for FDEs with lots of FREs. Making it per-FDE just allows
to pick most compact (but still with good performance) representation.

>
> >    b) bitness of offset_idxs[] can be coupled with bitness of
> > start_addrs (for simplicity), or could be completely independent and
> > identified by FDE's metadata (2 more bits to define this just like
> > start_addr bitness is defined). Independent probably would be my
> > preference, with linker (or whoever will be producing .sframe data)
> > can pick the smallest bitness that is sufficient to represent
> > everything.
> >
>
> ATM, GAS does apply special logic to decide the bitness of start_addrs
> per function, and ld just uses that info.  Coupling the bitness of
> offset_idx with bitness of start_addrs will be easy (or _easier_ I
> think), but for now, I leave it as "should be doable" :)

Those offsets are relative to fde's start_addr, right? So, generally
speaking, should be usually small? I my understanding is correct, then
yeah, coupling is probably ok.

>
> > Yes, it's a bit more complicated to draw and explain, but everything
> > will be nicely aligned, extensible to 64 bits, and (optionally at
> > least) binary searchable. Implementation-wise on the kernel side it
> > shouldn't be significantly more involved. Maybe the compiler would
> > need to be a bit smarter when producing FDE data, but it's no rocket
> > science.
> >
> > Thoughts?
>
> Combining the requirements from your email and Josh's follow up:
>    - No unaligned accesses
>    - Sorted FREs
>
> I would put compaction as a "good to have" requirement.  It appears to
> me that any compaction will mean a sort of post-processing which will
> interfere with JIT usecase.
>

sgtm

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 17/39] unwind_user/sframe: Add support for reading .sframe headers
  2025-02-05  0:57               ` Josh Poimboeuf
@ 2025-02-06  1:10                 ` Indu Bhagat
  0 siblings, 0 replies; 161+ messages in thread
From: Indu Bhagat @ 2025-02-06  1:10 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andrii Nakryiko, x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On 2/4/25 4:57 PM, Josh Poimboeuf wrote:
> On Thu, Jan 30, 2025 at 01:39:52PM -0800, Indu Bhagat wrote:
>> On 1/28/25 6:02 PM, Josh Poimboeuf wrote:
>>> However, if we're going that route, we might want to even consider a
>>> completely revamped data layout.  For example:
>>>
>>> One insight is that the vast majority of (cfa, fp, ra) tuples aren't
>>> unique.  They could be deduped by storing the unique tuples in a
>>> standalone 'fre_data' array which is referenced by another
>>> address-specific array.
>>>
>>>     struct fre_data {
>>> 	s8|s16|s32 cfa, fp, ra;
>>> 	u8 info;
>>>     };
>>>     struct fre_data fre_data[num_fre_data];
>>>
>>
>> We had the same observation at the time of SFrame V1.  And this method of
>> compaction (deduped tuples) was brain-stormed a bit.  Back then, the costs
>> were thought to be:
>>    - more work at build time.
>>    - an additional data access once the FRE is found (as there is
>> indirection).
>>
>> So it was really compaction at the costs above.  We did steer towards
>> simplicity and the SFrame FRE is what it stands today.
>>
>> The difference in the pros and cons now from then:
>>    - pros: helps mitigate unaligned accesses
>>    - cons: interferes slightly with the design goal of efficient addition and
>> removal of stack trace information per function for JIT. Think "removal" as
>> the set of actions necessary for addressing fragmentation in SFrame section
>> data in JIT usecase.
> 
> If fre_data[] is allowed to have duplicates then the deduping could be
> optional.
> 
>>> Note FDEs aren't even needed here as the unwinder doesn't need to know
>>> when a function begins/ends.  The only info needed by the unwinder is
>>> just the fre_data struct.  So a simple binary search of fres[] is all
>>> that's really needed.
>>
>> Splitting out information (start_address) to an FDE (as done in V1/V2) has
>> the benefit that a job like relocating information is proportional to
>> O(NumFunctions).
>>
>> In the case above, IIUC, where the proposal puts start_address in the FRE,
>> these costs will be (much) higher.
> 
> I'm not sure I follow, is this referring to the link-time work of
> sorting things?
> 

I meant the work of tracking the start address of each function.  This 
could be done at link-time as is done in most cases.

But also depending on the case : e.g., kernel module loader will need to 
apply these relocations in the .rela.sframe section...

If the granularity is finer than a function, more number of relocations 
will need to be applied.

>> In addition, not being able to identify stack trace information per function
>> will affect the JIT usecase.  We need to able to mark stack trace
>> information stale for functions in JIT environment.
> 
> Maybe, though it's hard to really say how any of these changes would
> affect JIT without knowing what those interfaces are going to look like.
> 


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 19/39] unwind_user/sframe: Add support for reading .sframe contents
  2025-02-05  9:47       ` Jens Remus
@ 2025-02-07 21:06         ` Josh Poimboeuf
  2025-02-10 15:56           ` Jens Remus
  0 siblings, 1 reply; 161+ messages in thread
From: Josh Poimboeuf @ 2025-02-07 21:06 UTC (permalink / raw)
  To: Jens Remus
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu, Heiko Carstens, Vasily Gorbik

On Wed, Feb 05, 2025 at 10:47:58AM +0100, Jens Remus wrote:
> UNSAFE_GET_USER_INC(ra_off, cur, offset_size, Efault);
> 
> With offset_size=1 expands into:
> 
> __UNSAFE_GET_USER_INC(/*to=*/ra_off, /*from=*cur, /*type=*/u8, /*label=*/Efault);
> 
> Expands into:
> 
> {
> 	u8 __to;
> 	unsafe_get_user(__to, (u8 __user *)cur, Efault);
> 	cur += sizeof(__to);
> 	ra_off = (typeof(ra_off))__to;
> }
> 
> The issue is that on the last line __to is u8 instead of s8 and thus
> u8 gets casted to s32, which is performed without sign extension.  __to
> would need to be s8 or get casted to s8 for sign extension to take
> place.

Ah, I get it now, thanks for spelling that out for me.

Here's what I have at the moment:

#define __UNSAFE_GET_USER_INC(to, from, type, label)			\
({									\
	type __to;							\
	unsafe_get_user(__to, (type __user *)from, label);		\
	from += sizeof(__to);						\
	to = __to;							\
})

#define __UNSAFE_GET_USER_INC(to, from, size, label, u_or_s)		\
({									\
	switch (size) {							\
	case 1:								\
		__UNSAFE_GET_USER_INC(to, from, u_or_s##8, label);	\
		break;							\
	case 2:								\
		__UNSAFE_GET_USER_INC(to, from, u_or_s##16, label);	\
		break;							\
	case 4:								\
		__UNSAFE_GET_USER_INC(to, from, u_or_s##32, label);	\
		break;							\
	default:							\
		dbg_sec_uaccess("%d: bad unsafe_get_user() size %u\n",	\
				__LINE__, size);			\
		return -EFAULT;						\
	}								\
})

#define UNSAFE_GET_USER_UNSIGNED_INC(to, from, size, label)		\
	__UNSAFE_GET_USER_INC(to, from, size, label, u)

#define UNSAFE_GET_USER_SIGNED_INC(to, from, size, label)		\
	__UNSAFE_GET_USER_INC(to, from, size, label, s)

#define UNSAFE_GET_USER_INC(to, from, size, label)				\
	_Generic(to,								\
		 u8:	UNSAFE_GET_USER_UNSIGNED_INC(to, from, size, label),	\
		 u16:	UNSAFE_GET_USER_UNSIGNED_INC(to, from, size, label),	\
		 u32:	UNSAFE_GET_USER_UNSIGNED_INC(to, from, size, label),	\
		 s8:	UNSAFE_GET_USER_SIGNED_INC(to, from, size, label),	\
		 s16:	UNSAFE_GET_USER_SIGNED_INC(to, from, size, label),	\
		 s32:	UNSAFE_GET_USER_SIGNED_INC(to, from, size, label))

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 17/39] unwind_user/sframe: Add support for reading .sframe headers
  2025-02-05 13:56             ` Jens Remus
@ 2025-02-07 21:13               ` Josh Poimboeuf
  0 siblings, 0 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-02-07 21:13 UTC (permalink / raw)
  To: Jens Remus
  Cc: Andrii Nakryiko, Indu Bhagat, x86, Peter Zijlstra, Steven Rostedt,
	Ingo Molnar, Arnaldo Carvalho de Melo, linux-kernel, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Mathieu Desnoyers,
	Florian Weimer, Andy Lutomirski, Masami Hiramatsu, Weinan Liu

On Wed, Feb 05, 2025 at 02:56:36PM +0100, Jens Remus wrote:
> On 29.01.2025 03:02, Josh Poimboeuf wrote:
> 
> > Note FDEs aren't even needed here as the unwinder doesn't need to know
> > when a function begins/ends.  The only info needed by the unwinder is
> > just the fre_data struct.  So a simple binary search of fres[] is all
> > that's really needed.
> 
> In SFrame V2 FDEs specify ranges bound by function start address and
> length.  FREs in contrast specify open ranges bounded by start address.
> Their effect ends either with the next FRE becoming into effect or when
> their FDE range ends.
> This concept enables holes in the .text section which do not have any
> valid FDE/FRE information associated.
> 
> Your proposal lacks some sort of mechanism to replicate those holes.
> It could be FDEs with a flag (or no offsets?) that specifies their
> range has no valid information.

In ORC, a hole is simply specified by an ORC entry (aka "FRE") of type
UNDEFINED.

That could be done here as well: the linker would replace a gap with an
"undefined" FRE which could be identified either by setting a bit in the
FRE header, or by using a special fre_data[] entry.  For example,
fre_data index 0 could just be a placeholder for the undefined FRE type.

-- 
Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 19/39] unwind_user/sframe: Add support for reading .sframe contents
  2025-02-07 21:06         ` Josh Poimboeuf
@ 2025-02-10 15:56           ` Jens Remus
  0 siblings, 0 replies; 161+ messages in thread
From: Jens Remus @ 2025-02-10 15:56 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: x86, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu, Heiko Carstens, Vasily Gorbik

On 07.02.2025 22:06, Josh Poimboeuf wrote:
> On Wed, Feb 05, 2025 at 10:47:58AM +0100, Jens Remus wrote:
>> UNSAFE_GET_USER_INC(ra_off, cur, offset_size, Efault);
>>
>> With offset_size=1 expands into:
>>
>> __UNSAFE_GET_USER_INC(/*to=*/ra_off, /*from=*cur, /*type=*/u8, /*label=*/Efault);
>>
>> Expands into:
>>
>> {
>> 	u8 __to;
>> 	unsafe_get_user(__to, (u8 __user *)cur, Efault);
>> 	cur += sizeof(__to);
>> 	ra_off = (typeof(ra_off))__to;
>> }
>>
>> The issue is that on the last line __to is u8 instead of s8 and thus
>> u8 gets casted to s32, which is performed without sign extension.  __to
>> would need to be s8 or get casted to s8 for sign extension to take
>> place.
> 
> Ah, I get it now, thanks for spelling that out for me.
> 
> Here's what I have at the moment:

Thanks!  Using your new UNSAFE_GET_USER_INC() in all places works great
on s390 when resolving the duplicate macro names (see below).

> 
> #define __UNSAFE_GET_USER_INC(to, from, type, label)			\
> ({									\
> 	type __to;							\
> 	unsafe_get_user(__to, (type __user *)from, label);		\
> 	from += sizeof(__to);						\
> 	to = __to;							\
> })
> 
> #define __UNSAFE_GET_USER_INC(to, from, size, label, u_or_s)		\

That does not compile.  One of the macros needs to be renamed.

   CC      kernel/unwind/sframe.o
kernel/unwind/sframe.c:141:9: warning: "__UNSAFE_GET_USER_INC" redefined
   141 | #define __UNSAFE_GET_USER_INC(to, from, size, label, u_or_s)            \
       |         ^~~~~~~~~~~~~~~~~~~~~
kernel/unwind/sframe.c:133:9: note: this is the location of the previous definition
   133 | #define __UNSAFE_GET_USER_INC(to, from, type, label)                    \
       |         ^~~~~~~~~~~~~~~~~~~~~

> ({									\
> 	switch (size) {							\
> 	case 1:								\
> 		__UNSAFE_GET_USER_INC(to, from, u_or_s##8, label);	\
> 		break;							\
> 	case 2:								\
> 		__UNSAFE_GET_USER_INC(to, from, u_or_s##16, label);	\
> 		break;							\
> 	case 4:								\
> 		__UNSAFE_GET_USER_INC(to, from, u_or_s##32, label);	\
> 		break;							\
> 	default:							\
> 		dbg_sec_uaccess("%d: bad unsafe_get_user() size %u\n",	\
> 				__LINE__, size);			\
> 		return -EFAULT;						\
> 	}								\
> })
> 
> #define UNSAFE_GET_USER_UNSIGNED_INC(to, from, size, label)		\
> 	__UNSAFE_GET_USER_INC(to, from, size, label, u)
> 
> #define UNSAFE_GET_USER_SIGNED_INC(to, from, size, label)		\
> 	__UNSAFE_GET_USER_INC(to, from, size, label, s)
> 
> #define UNSAFE_GET_USER_INC(to, from, size, label)				\
> 	_Generic(to,								\
> 		 u8:	UNSAFE_GET_USER_UNSIGNED_INC(to, from, size, label),	\
> 		 u16:	UNSAFE_GET_USER_UNSIGNED_INC(to, from, size, label),	\
> 		 u32:	UNSAFE_GET_USER_UNSIGNED_INC(to, from, size, label),	\
> 		 s8:	UNSAFE_GET_USER_SIGNED_INC(to, from, size, label),	\
> 		 s16:	UNSAFE_GET_USER_SIGNED_INC(to, from, size, label),	\
> 		 s32:	UNSAFE_GET_USER_SIGNED_INC(to, from, size, label))

Regards,
Jens
-- 
Jens Remus
Linux on Z Development (D3303)
+49-7031-16-1128 Office
jremus@de.ibm.com

IBM

IBM Deutschland Research & Development GmbH; Vorsitzender des Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der Gesellschaft: Böblingen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 01/39] task_work: Fix TWA_NMI_CURRENT error handling
  2025-01-22 12:28   ` Peter Zijlstra
  2025-01-22 20:47     ` Josh Poimboeuf
@ 2025-04-22 16:14     ` Steven Rostedt
  1 sibling, 0 replies; 161+ messages in thread
From: Steven Rostedt @ 2025-04-22 16:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Josh Poimboeuf, x86, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu


/me is resurrecting the sframe work

On Wed, 22 Jan 2025 13:28:21 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> > The error has to be checked before the write to task->task_works.  Also
> > the try_cmpxchg() loop isn't needed in NMI context.  The TWA_NMI_CURRENT
> > case really is special, keep things simple by keeping its code all
> > together in one place.  
> 
> NMIs can nest, consider #DB (which is NMI like) doing task_work_add()
> and getting interrupted with NMI doing the same.
> 

I was looking at this patch and was thinking that the trycmpxchg() is still
for the race against other CPUs.

Another CPU can add a task_work for this task, right?

If so, then even though the NMI adding the task work can't be interrupted
(considering there's no #DB and such), it still can clobber an update done
for this tasks task_work from another CPU.

Is this patch still even needed?

-- Steve

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 02/39] task_work: Fix TWA_NMI_CURRENT race with __schedule()
  2025-01-22 12:42   ` Peter Zijlstra
  2025-01-22 21:03     ` Josh Poimboeuf
@ 2025-04-22 16:15     ` Steven Rostedt
  2025-04-22 17:20       ` Josh Poimboeuf
  1 sibling, 1 reply; 161+ messages in thread
From: Steven Rostedt @ 2025-04-22 16:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Josh Poimboeuf, x86, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Wed, 22 Jan 2025 13:42:28 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> If we hit before schedule(), all just works as expected, if we hit after
> schedule(), the task will already have the TIF flag set, and we'll hit
> the return to user path once it gets scheduled again.
> 
> ---
> diff --git a/kernel/task_work.c b/kernel/task_work.c
> index c969f1f26be5..155549c017b2 100644
> --- a/kernel/task_work.c
> +++ b/kernel/task_work.c
> @@ -9,7 +9,12 @@ static struct callback_head work_exited; /* all we need is ->next == NULL */
>  #ifdef CONFIG_IRQ_WORK
>  static void task_work_set_notify_irq(struct irq_work *entry)
>  {
> -	test_and_set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
> +	/*
> +	 * no-op IPI
> +	 *
> +	 * TWA_NMI_CURRENT will already have set the TIF flag, all
> +	 * this interrupt does it tickle the return-to-user path.
> +	 */
>  }
>  static DEFINE_PER_CPU(struct irq_work, irq_work_NMI_resume) =
>  	IRQ_WORK_INIT_HARD(task_work_set_notify_irq);
> @@ -98,6 +103,7 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
>  		break;
>  #ifdef CONFIG_IRQ_WORK
>  	case TWA_NMI_CURRENT:
> +		set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
>  		irq_work_queue(this_cpu_ptr(&irq_work_NMI_resume));
>  		break;
>  #endif

Does this patch replace patches 1 and 2?

If so, Peter, can you give me a Signed-off-by?

-- Steve

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 02/39] task_work: Fix TWA_NMI_CURRENT race with __schedule()
  2025-04-22 16:15     ` Steven Rostedt
@ 2025-04-22 17:20       ` Josh Poimboeuf
  0 siblings, 0 replies; 161+ messages in thread
From: Josh Poimboeuf @ 2025-04-22 17:20 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, x86, Ingo Molnar, Arnaldo Carvalho de Melo,
	linux-kernel, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, linux-trace-kernel, Andrii Nakryiko, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Tue, Apr 22, 2025 at 12:15:41PM -0400, Steven Rostedt wrote:
> On Wed, 22 Jan 2025 13:42:28 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > If we hit before schedule(), all just works as expected, if we hit after
> > schedule(), the task will already have the TIF flag set, and we'll hit
> > the return to user path once it gets scheduled again.
> > 
> > ---
> > diff --git a/kernel/task_work.c b/kernel/task_work.c
> > index c969f1f26be5..155549c017b2 100644
> > --- a/kernel/task_work.c
> > +++ b/kernel/task_work.c
> > @@ -9,7 +9,12 @@ static struct callback_head work_exited; /* all we need is ->next == NULL */
> >  #ifdef CONFIG_IRQ_WORK
> >  static void task_work_set_notify_irq(struct irq_work *entry)
> >  {
> > -	test_and_set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
> > +	/*
> > +	 * no-op IPI
> > +	 *
> > +	 * TWA_NMI_CURRENT will already have set the TIF flag, all
> > +	 * this interrupt does it tickle the return-to-user path.
> > +	 */
> >  }
> >  static DEFINE_PER_CPU(struct irq_work, irq_work_NMI_resume) =
> >  	IRQ_WORK_INIT_HARD(task_work_set_notify_irq);
> > @@ -98,6 +103,7 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
> >  		break;
> >  #ifdef CONFIG_IRQ_WORK
> >  	case TWA_NMI_CURRENT:
> > +		set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
> >  		irq_work_queue(this_cpu_ptr(&irq_work_NMI_resume));
> >  		break;
> >  #endif
> 
> Does this patch replace patches 1 and 2?

Indeed it does.

> If so, Peter, can you give me a Signed-off-by?
> 
> -- Steve

-- 
Josh

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 09/39] x86/vdso: Enable sframe generation in VDSO
  2025-01-24 16:43     ` Josh Poimboeuf
  2025-01-24 16:53       ` Josh Poimboeuf
@ 2025-04-22 17:44       ` Steven Rostedt
  1 sibling, 0 replies; 161+ messages in thread
From: Steven Rostedt @ 2025-04-22 17:44 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Jens Remus, x86, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Andrii Nakryiko,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu, Heiko Carstens, Vasily Gorbik

On Fri, 24 Jan 2025 08:43:32 -0800
Josh Poimboeuf <jpoimboe@kernel.org> wrote:

> On Fri, Jan 24, 2025 at 05:00:27PM +0100, Jens Remus wrote:
> > On 22.01.2025 03:31, Josh Poimboeuf wrote:  
> > > diff --git a/arch/x86/include/asm/dwarf2.h b/arch/x86/include/asm/dwarf2.h
> > > index b195b3c8677e..1c354f648505 100644
> > > --- a/arch/x86/include/asm/dwarf2.h
> > > +++ b/arch/x86/include/asm/dwarf2.h
> > > @@ -12,8 +12,11 @@
> > >   	 * For the vDSO, emit both runtime unwind information and debug
> > >   	 * symbols for the .dbg file.
> > >   	 */
> > > -  
> > 
> > Nit: Deleted blank line you introduced in "[PATCH v4 05/39] x86/asm:
> > Avoid emitting DWARF CFI for non-VDSO".  
> 
> Indeed.

Note, I'm getting ready to send out an update of Josh's patches.

This particular nit isn't really an issue, as #ifdef's do sometimes replace
blank lines.

Just mentioning this so you don't think I'm ignoring it, when I post patches.

-- Steve

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v4 12/39] unwind_user: Add frame pointer support
  2025-01-24 18:16     ` Josh Poimboeuf
@ 2025-04-24 13:41       ` Steven Rostedt
  0 siblings, 0 replies; 161+ messages in thread
From: Steven Rostedt @ 2025-04-24 13:41 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Andrii Nakryiko, x86, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, Indu Bhagat, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, linux-trace-kernel, Jens Remus,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski,
	Masami Hiramatsu, Weinan Liu

On Fri, 24 Jan 2025 10:16:10 -0800
Josh Poimboeuf <jpoimboe@kernel.org> wrote:

> > Do you plan to reuse this logic for stack unwinding done by perf
> > subsystem in perf_callchain_user()? See is_uprobe_at_func_entry()
> > parts and also fixup_uretprobe_trampoline_entries() for some of the
> > quirks that have to be taken into account when doing frame
> > pointer-based unwinding. It would be great not to lose those in this
> > new reimplementation.
> > 
> > Not sure what's the best way to avoid duplicating the logic, but I
> > thought I'd bring that up.  
> 
> Indeed!  That was on the todo list and somehow evaporated.

I'm getting ready to post an update of these patches, but I want to mention
that this has not been addressed, and I'm replying here to make sure it
stays on the radar.

-- Steve

^ permalink raw reply	[flat|nested] 161+ messages in thread

end of thread, other threads:[~2025-04-24 13:39 UTC | newest]

Thread overview: 161+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-22  2:30 [PATCH v4 00/39] unwind, perf: sframe user space unwinding Josh Poimboeuf
2025-01-22  2:30 ` [PATCH v4 01/39] task_work: Fix TWA_NMI_CURRENT error handling Josh Poimboeuf
2025-01-22 12:28   ` Peter Zijlstra
2025-01-22 20:47     ` Josh Poimboeuf
2025-01-23  8:14       ` Peter Zijlstra
2025-01-23 17:15         ` Josh Poimboeuf
2025-01-23 22:19           ` Peter Zijlstra
2025-04-22 16:14     ` Steven Rostedt
2025-01-22  2:30 ` [PATCH v4 02/39] task_work: Fix TWA_NMI_CURRENT race with __schedule() Josh Poimboeuf
2025-01-22 12:23   ` Peter Zijlstra
2025-01-22 12:42   ` Peter Zijlstra
2025-01-22 21:03     ` Josh Poimboeuf
2025-01-22 22:14       ` Josh Poimboeuf
2025-01-23  8:15         ` Peter Zijlstra
2025-04-22 16:15     ` Steven Rostedt
2025-04-22 17:20       ` Josh Poimboeuf
2025-01-22  2:30 ` [PATCH v4 03/39] mm: Add guard for mmap_read_lock Josh Poimboeuf
2025-01-22  2:30 ` [PATCH v4 04/39] x86/vdso: Fix DWARF generation for getrandom() Josh Poimboeuf
2025-01-22  2:30 ` [PATCH v4 05/39] x86/asm: Avoid emitting DWARF CFI for non-VDSO Josh Poimboeuf
2025-01-24 16:08   ` Jens Remus
2025-01-24 16:47     ` Josh Poimboeuf
2025-01-22  2:30 ` [PATCH v4 06/39] x86/asm: Fix VDSO DWARF generation with kernel IBT enabled Josh Poimboeuf
2025-01-22  2:30 ` [PATCH v4 07/39] x86/vdso: Use SYM_FUNC_{START,END} in __kernel_vsyscall() Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 08/39] x86/vdso: Use CFI macros in __vdso_sgx_enter_enclave() Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 09/39] x86/vdso: Enable sframe generation in VDSO Josh Poimboeuf
2025-01-24 16:00   ` Jens Remus
2025-01-24 16:43     ` Josh Poimboeuf
2025-01-24 16:53       ` Josh Poimboeuf
2025-04-22 17:44       ` Steven Rostedt
2025-01-24 16:30   ` Jens Remus
2025-01-24 16:56     ` Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 10/39] x86/uaccess: Add unsafe_copy_from_user() implementation Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 11/39] unwind_user: Add user space unwinding API Josh Poimboeuf
2025-01-24 16:41   ` Jens Remus
2025-01-24 17:09     ` Josh Poimboeuf
2025-01-24 17:59   ` Andrii Nakryiko
2025-01-24 18:08     ` Josh Poimboeuf
2025-01-24 20:02   ` Steven Rostedt
2025-01-24 22:05     ` Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 12/39] unwind_user: Add frame pointer support Josh Poimboeuf
2025-01-24 17:59   ` Andrii Nakryiko
2025-01-24 18:16     ` Josh Poimboeuf
2025-04-24 13:41       ` Steven Rostedt
2025-01-22  2:31 ` [PATCH v4 13/39] unwind_user/x86: Enable frame pointer unwinding on x86 Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 14/39] perf/x86: Rename get_segment_base() and make it global Josh Poimboeuf
2025-01-22 12:51   ` Peter Zijlstra
2025-01-22 21:37     ` Josh Poimboeuf
2025-01-24 20:09   ` Steven Rostedt
2025-01-24 22:06     ` Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 15/39] unwind_user: Add compat mode frame pointer support Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 16/39] unwind_user/x86: Enable compat mode frame pointer unwinding on x86 Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 17/39] unwind_user/sframe: Add support for reading .sframe headers Josh Poimboeuf
2025-01-24 18:00   ` Andrii Nakryiko
2025-01-24 19:21     ` Josh Poimboeuf
2025-01-24 20:13       ` Steven Rostedt
2025-01-24 22:39         ` Josh Poimboeuf
2025-01-24 22:13       ` Indu Bhagat
2025-01-28  1:10         ` Andrii Nakryiko
2025-01-29  2:02           ` Josh Poimboeuf
2025-01-30  0:02             ` Andrii Nakryiko
2025-02-04 18:26               ` Josh Poimboeuf
2025-01-30 21:39             ` Indu Bhagat
2025-02-05  0:57               ` Josh Poimboeuf
2025-02-06  1:10                 ` Indu Bhagat
2025-02-05 13:56             ` Jens Remus
2025-02-07 21:13               ` Josh Poimboeuf
2025-01-30 21:21           ` Indu Bhagat
2025-02-04 19:59             ` Josh Poimboeuf
2025-02-05 23:16             ` Andrii Nakryiko
2025-02-05 11:01           ` Jens Remus
2025-02-05 23:05             ` Andrii Nakryiko
2025-01-24 20:31     ` Indu Bhagat
2025-01-22  2:31 ` [PATCH v4 18/39] unwind_user/sframe: Store sframe section data in per-mm maple tree Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 19/39] unwind_user/sframe: Add support for reading .sframe contents Josh Poimboeuf
2025-01-24 16:36   ` Jens Remus
2025-01-24 17:07     ` Josh Poimboeuf
2025-01-24 18:02   ` Andrii Nakryiko
2025-01-24 21:41     ` Josh Poimboeuf
2025-01-28  0:39       ` Andrii Nakryiko
2025-01-28 10:50         ` Jens Remus
2025-01-29  2:04           ` Josh Poimboeuf
2025-01-28 10:54         ` Jens Remus
2025-01-30 19:51       ` Weinan Liu
2025-02-04 19:42         ` Josh Poimboeuf
2025-01-30 15:07   ` Indu Bhagat
2025-02-04 18:38     ` Josh Poimboeuf
2025-01-30 15:47   ` Jens Remus
2025-02-04 18:51     ` Josh Poimboeuf
2025-02-05  9:47       ` Jens Remus
2025-02-07 21:06         ` Josh Poimboeuf
2025-02-10 15:56           ` Jens Remus
2025-01-22  2:31 ` [PATCH v4 20/39] unwind_user/sframe: Detect .sframe sections in executables Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 21/39] unwind_user/sframe: Add prctl() interface for registering .sframe sections Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 22/39] unwind_user/sframe: Wire up unwind_user to sframe Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 23/39] unwind_user/sframe/x86: Enable sframe unwinding on x86 Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 24/39] unwind_user/sframe: Remove .sframe section on detected corruption Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 25/39] unwind_user/sframe: Show file name in debug output Josh Poimboeuf
2025-01-30 16:17   ` Jens Remus
2025-02-04 19:10     ` Josh Poimboeuf
2025-02-05 10:04       ` Jens Remus
2025-01-22  2:31 ` [PATCH v4 26/39] unwind_user/sframe: Enable debugging in uaccess regions Josh Poimboeuf
2025-01-30 16:38   ` Jens Remus
2025-02-04 19:33     ` Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 27/39] unwind_user/sframe: Add .sframe validation option Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 28/39] unwind_user/deferred: Add deferred unwinding interface Josh Poimboeuf
2025-01-22 13:37   ` Peter Zijlstra
2025-01-22 14:16     ` Peter Zijlstra
2025-01-22 22:51       ` Josh Poimboeuf
2025-01-23  8:17         ` Peter Zijlstra
2025-01-23 18:30           ` Josh Poimboeuf
2025-01-23 21:58             ` Peter Zijlstra
2025-01-22 21:38     ` Josh Poimboeuf
2025-01-22 13:44   ` Peter Zijlstra
2025-01-22 21:52     ` Josh Poimboeuf
2025-01-22 20:13   ` Mathieu Desnoyers
2025-01-23  4:05     ` Josh Poimboeuf
2025-01-23  8:25       ` Peter Zijlstra
2025-01-23 18:43         ` Josh Poimboeuf
2025-01-23 22:13           ` Peter Zijlstra
2025-01-24 21:58             ` Steven Rostedt
2025-01-24 22:46               ` Josh Poimboeuf
2025-01-24 22:50                 ` Josh Poimboeuf
2025-01-24 23:57                   ` Steven Rostedt
2025-01-30 20:21                     ` Steven Rostedt
2025-02-05  2:25                       ` Josh Poimboeuf
2025-01-24 16:35   ` Jens Remus
2025-01-24 16:57     ` Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 29/39] unwind_user/deferred: Add unwind cache Josh Poimboeuf
2025-01-22 13:57   ` Peter Zijlstra
2025-01-22 22:36     ` Josh Poimboeuf
2025-01-23  8:31       ` Peter Zijlstra
2025-01-23 18:45         ` Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 30/39] unwind_user/deferred: Make unwind deferral requests NMI-safe Josh Poimboeuf
2025-01-22 14:15   ` Peter Zijlstra
2025-01-22 22:49     ` Josh Poimboeuf
2025-01-23  8:40       ` Peter Zijlstra
2025-01-23 19:48         ` Josh Poimboeuf
2025-01-23 19:54           ` Josh Poimboeuf
2025-01-23 22:17           ` Peter Zijlstra
2025-01-23 23:34             ` Josh Poimboeuf
2025-01-24 11:58               ` Peter Zijlstra
2025-01-22 14:24   ` Peter Zijlstra
2025-01-22 22:52     ` Josh Poimboeuf
2025-01-23  8:42       ` Peter Zijlstra
2025-01-22  2:31 ` [PATCH v4 31/39] perf: Remove get_perf_callchain() 'init_nr' argument Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 32/39] perf: Remove get_perf_callchain() 'crosstask' argument Josh Poimboeuf
2025-01-24 18:13   ` Andrii Nakryiko
2025-01-24 22:00     ` Josh Poimboeuf
2025-01-28  0:39       ` Andrii Nakryiko
2025-01-22  2:31 ` [PATCH v4 33/39] perf: Simplify get_perf_callchain() user logic Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 34/39] perf: Skip user unwind if !current->mm Josh Poimboeuf
2025-01-22 14:29   ` Peter Zijlstra
2025-01-22 23:08     ` Josh Poimboeuf
2025-01-23  8:44       ` Peter Zijlstra
2025-01-22  2:31 ` [PATCH v4 35/39] perf: Support deferred user callchains Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 36/39] perf tools: Minimal CALLCHAIN_DEFERRED support Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 37/39] perf record: Enable defer_callchain for user callchains Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 38/39] perf script: Display PERF_RECORD_CALLCHAIN_DEFERRED Josh Poimboeuf
2025-01-22  2:31 ` [PATCH v4 39/39] perf tools: Merge deferred user callchains Josh Poimboeuf
2025-01-22  2:35 ` [PATCH v4 00/39] unwind, perf: sframe user space unwinding Josh Poimboeuf
2025-01-22 16:13 ` Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).