linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v9 00/13] unwind_user: x86: Deferred unwinding infrastructure
@ 2025-05-13 22:34 Steven Rostedt
  2025-05-13 22:34 ` [PATCH v9 01/13] unwind_user: Add user space unwinding API Steven Rostedt
                   ` (13 more replies)
  0 siblings, 14 replies; 23+ messages in thread
From: Steven Rostedt @ 2025-05-13 22:34 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Andrii Nakryiko


This series does not make any user space visible changes.
It only adds the necessary infrastructure of the deferred unwinder.

 Based off of tip/master: d60119b82d8871c66563f8657bb3e550a80234de

This has modifications in x86 and I would like it to go through the x86
tree. Preferably it can go into this merge window so we can focus on getting
perf and ftrace to work on top of this.

Perf exposes a lot of the interface to user space as the perf tool needs
to handle the merging of the stacks, I figured it would be better to just
get the kernel side mostly done and then work out the kinks of the code
between user and kernel.

As there is no exposure to user space at this time, if something is found
wrong with it it can be fixed without worrying about breaking API.

I ran all these patches through my tests and they passed
(had to add the fix for the CONFIG_MODULE in tip/master:
 https://lore.kernel.org/all/20250513025839.495755-1-ebiggers@kernel.org/)

Changes since v8: https://lore.kernel.org/all/20250509164524.448387100@goodmis.org/

- The patches posted here have not been updated since v8.

- Removed updates to perf proper

  This is only to create the unwind infrastructure so that perf and
  ftrace can be worked on simultaneously without dependencies for
  each.

- Discussion about using guard for SRCU

   Andrii Nakryiko brought up using guard for SRCU around the
   list iteration, but it was decided that just using the normal
   methods were fine for this use case.


Josh Poimboeuf (9):
      unwind_user: Add user space unwinding API
      unwind_user: Add frame pointer support
      unwind_user/x86: Enable frame pointer unwinding on x86
      perf/x86: Rename and move get_segment_base() and make it global
      unwind_user: Add compat mode frame pointer support
      unwind_user/x86: Enable compat mode frame pointer unwinding on x86
      unwind_user/deferred: Add unwind cache
      unwind_user/deferred: Add deferred unwinding interface
      unwind_user/deferred: Make unwind deferral requests NMI-safe

Steven Rostedt (4):
      unwind_user/deferred: Add unwind_deferred_trace()
      unwind deferred: Use bitmask to determine which callbacks to call
      unwind deferred: Use SRCU unwind_deferred_task_work()
      unwind: Clear unwind_mask on exit back to user space

----
 MAINTAINERS                              |   8 +
 arch/Kconfig                             |  11 +
 arch/x86/Kconfig                         |   2 +
 arch/x86/events/core.c                   |  44 +---
 arch/x86/include/asm/ptrace.h            |   2 +
 arch/x86/include/asm/unwind_user.h       |  61 +++++
 arch/x86/include/asm/unwind_user_types.h |  17 ++
 arch/x86/kernel/ptrace.c                 |  38 ++++
 include/asm-generic/Kbuild               |   2 +
 include/asm-generic/unwind_user.h        |  24 ++
 include/asm-generic/unwind_user_types.h  |   9 +
 include/linux/entry-common.h             |   2 +
 include/linux/sched.h                    |   6 +
 include/linux/unwind_deferred.h          |  72 ++++++
 include/linux/unwind_deferred_types.h    |  17 ++
 include/linux/unwind_user.h              |  15 ++
 include/linux/unwind_user_types.h        |  35 +++
 kernel/Makefile                          |   1 +
 kernel/fork.c                            |   4 +
 kernel/unwind/Makefile                   |   1 +
 kernel/unwind/deferred.c                 | 367 +++++++++++++++++++++++++++++++
 kernel/unwind/user.c                     | 130 +++++++++++
 22 files changed, 829 insertions(+), 39 deletions(-)
 create mode 100644 arch/x86/include/asm/unwind_user.h
 create mode 100644 arch/x86/include/asm/unwind_user_types.h
 create mode 100644 include/asm-generic/unwind_user.h
 create mode 100644 include/asm-generic/unwind_user_types.h
 create mode 100644 include/linux/unwind_deferred.h
 create mode 100644 include/linux/unwind_deferred_types.h
 create mode 100644 include/linux/unwind_user.h
 create mode 100644 include/linux/unwind_user_types.h
 create mode 100644 kernel/unwind/Makefile
 create mode 100644 kernel/unwind/deferred.c
 create mode 100644 kernel/unwind/user.c

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v9 01/13] unwind_user: Add user space unwinding API
  2025-05-13 22:34 [PATCH v9 00/13] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
@ 2025-05-13 22:34 ` Steven Rostedt
  2025-05-13 22:34 ` [PATCH v9 02/13] unwind_user: Add frame pointer support Steven Rostedt
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Steven Rostedt @ 2025-05-13 22:34 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Andrii Nakryiko

From: Josh Poimboeuf <jpoimboe@kernel.org>

Introduce a generic API for unwinding user stacks.

In order to expand user space unwinding to be able to handle more complex
scenarios, such as deferred unwinding and reading user space information,
create a generic interface that all architectures can use that support the
various unwinding methods.

This is an alternative method for handling user space stack traces from
the simple stack_trace_save_user() API. This does not replace that
interface, but this interface will be used to expand the functionality of
user space stack walking.

None of the structures introduced will be exposed to user space tooling.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 MAINTAINERS                       |  8 +++++
 arch/Kconfig                      |  3 ++
 include/linux/unwind_user.h       | 15 +++++++++
 include/linux/unwind_user_types.h | 31 +++++++++++++++++
 kernel/Makefile                   |  1 +
 kernel/unwind/Makefile            |  1 +
 kernel/unwind/user.c              | 55 +++++++++++++++++++++++++++++++
 7 files changed, 114 insertions(+)
 create mode 100644 include/linux/unwind_user.h
 create mode 100644 include/linux/unwind_user_types.h
 create mode 100644 kernel/unwind/Makefile
 create mode 100644 kernel/unwind/user.c

diff --git a/MAINTAINERS b/MAINTAINERS
index d978af532430..57e86405a8af 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -25373,6 +25373,14 @@ F:	Documentation/driver-api/uio-howto.rst
 F:	drivers/uio/
 F:	include/linux/uio_driver.h
 
+USERSPACE STACK UNWINDING
+M:	Josh Poimboeuf <jpoimboe@kernel.org>
+M:	Steven Rostedt <rostedt@goodmis.org>
+S:	Maintained
+F:	include/linux/unwind*.h
+F:	kernel/unwind/
+
+
 UTIL-LINUX PACKAGE
 M:	Karel Zak <kzak@redhat.com>
 L:	util-linux@vger.kernel.org
diff --git a/arch/Kconfig b/arch/Kconfig
index b0adb665041f..ccbcead9fac0 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -435,6 +435,9 @@ config HAVE_HARDLOCKUP_DETECTOR_ARCH
 	  It uses the same command line parameters, and sysctl interface,
 	  as the generic hardlockup detectors.
 
+config UNWIND_USER
+	bool
+
 config HAVE_PERF_REGS
 	bool
 	help
diff --git a/include/linux/unwind_user.h b/include/linux/unwind_user.h
new file mode 100644
index 000000000000..aa7923c1384f
--- /dev/null
+++ b/include/linux/unwind_user.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_UNWIND_USER_H
+#define _LINUX_UNWIND_USER_H
+
+#include <linux/unwind_user_types.h>
+
+int unwind_user_start(struct unwind_user_state *state);
+int unwind_user_next(struct unwind_user_state *state);
+
+int unwind_user(struct unwind_stacktrace *trace, unsigned int max_entries);
+
+#define for_each_user_frame(state) \
+	for (unwind_user_start((state)); !(state)->done; unwind_user_next((state)))
+
+#endif /* _LINUX_UNWIND_USER_H */
diff --git a/include/linux/unwind_user_types.h b/include/linux/unwind_user_types.h
new file mode 100644
index 000000000000..6ed1b4ae74e1
--- /dev/null
+++ b/include/linux/unwind_user_types.h
@@ -0,0 +1,31 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_UNWIND_USER_TYPES_H
+#define _LINUX_UNWIND_USER_TYPES_H
+
+#include <linux/types.h>
+
+enum unwind_user_type {
+	UNWIND_USER_TYPE_NONE,
+};
+
+struct unwind_stacktrace {
+	unsigned int	nr;
+	unsigned long	*entries;
+};
+
+struct unwind_user_frame {
+	s32 cfa_off;
+	s32 ra_off;
+	s32 fp_off;
+	bool use_fp;
+};
+
+struct unwind_user_state {
+	unsigned long ip;
+	unsigned long sp;
+	unsigned long fp;
+	enum unwind_user_type type;
+	bool done;
+};
+
+#endif /* _LINUX_UNWIND_USER_TYPES_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index 434929de17ef..5a2b2be2a32d 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -55,6 +55,7 @@ obj-y += rcu/
 obj-y += livepatch/
 obj-y += dma/
 obj-y += entry/
+obj-y += unwind/
 obj-$(CONFIG_MODULES) += module/
 
 obj-$(CONFIG_KCMP) += kcmp.o
diff --git a/kernel/unwind/Makefile b/kernel/unwind/Makefile
new file mode 100644
index 000000000000..349ce3677526
--- /dev/null
+++ b/kernel/unwind/Makefile
@@ -0,0 +1 @@
+ obj-$(CONFIG_UNWIND_USER) += user.o
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
new file mode 100644
index 000000000000..d30449328981
--- /dev/null
+++ b/kernel/unwind/user.c
@@ -0,0 +1,55 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+* Generic interfaces for unwinding user space
+*/
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/sched/task_stack.h>
+#include <linux/unwind_user.h>
+
+int unwind_user_next(struct unwind_user_state *state)
+{
+	/* no implementation yet */
+	return -EINVAL;
+}
+
+int unwind_user_start(struct unwind_user_state *state)
+{
+	struct pt_regs *regs = task_pt_regs(current);
+
+	memset(state, 0, sizeof(*state));
+
+	if ((current->flags & PF_KTHREAD) || !user_mode(regs)) {
+		state->done = true;
+		return -EINVAL;
+	}
+
+	state->type = UNWIND_USER_TYPE_NONE;
+
+	state->ip = instruction_pointer(regs);
+	state->sp = user_stack_pointer(regs);
+	state->fp = frame_pointer(regs);
+
+	return 0;
+}
+
+int unwind_user(struct unwind_stacktrace *trace, unsigned int max_entries)
+{
+	struct unwind_user_state state;
+
+	trace->nr = 0;
+
+	if (!max_entries)
+		return -EINVAL;
+
+	if (current->flags & PF_KTHREAD)
+		return 0;
+
+	for_each_user_frame(&state) {
+		trace->entries[trace->nr++] = state.ip;
+		if (trace->nr >= max_entries)
+			break;
+	}
+
+	return 0;
+}
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v9 02/13] unwind_user: Add frame pointer support
  2025-05-13 22:34 [PATCH v9 00/13] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
  2025-05-13 22:34 ` [PATCH v9 01/13] unwind_user: Add user space unwinding API Steven Rostedt
@ 2025-05-13 22:34 ` Steven Rostedt
  2025-05-13 22:34 ` [PATCH v9 03/13] unwind_user/x86: Enable frame pointer unwinding on x86 Steven Rostedt
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Steven Rostedt @ 2025-05-13 22:34 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Andrii Nakryiko

From: Josh Poimboeuf <jpoimboe@kernel.org>

Add optional support for user space frame pointer unwinding.  If
supported, the arch needs to enable CONFIG_HAVE_UNWIND_USER_FP and
define ARCH_INIT_USER_FP_FRAME.

By encoding the frame offsets in struct unwind_user_frame, much of this
code can also be reused for future unwinder implementations like sframe.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 arch/Kconfig                      |  4 +++
 include/asm-generic/unwind_user.h |  9 ++++++
 include/linux/unwind_user_types.h |  1 +
 kernel/unwind/user.c              | 51 +++++++++++++++++++++++++++++--
 4 files changed, 63 insertions(+), 2 deletions(-)
 create mode 100644 include/asm-generic/unwind_user.h

diff --git a/arch/Kconfig b/arch/Kconfig
index ccbcead9fac0..0e3844c0e200 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -438,6 +438,10 @@ config HAVE_HARDLOCKUP_DETECTOR_ARCH
 config UNWIND_USER
 	bool
 
+config HAVE_UNWIND_USER_FP
+	bool
+	select UNWIND_USER
+
 config HAVE_PERF_REGS
 	bool
 	help
diff --git a/include/asm-generic/unwind_user.h b/include/asm-generic/unwind_user.h
new file mode 100644
index 000000000000..832425502fb3
--- /dev/null
+++ b/include/asm-generic/unwind_user.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_GENERIC_UNWIND_USER_H
+#define _ASM_GENERIC_UNWIND_USER_H
+
+#ifndef ARCH_INIT_USER_FP_FRAME
+ #define ARCH_INIT_USER_FP_FRAME
+#endif
+
+#endif /* _ASM_GENERIC_UNWIND_USER_H */
diff --git a/include/linux/unwind_user_types.h b/include/linux/unwind_user_types.h
index 6ed1b4ae74e1..65bd070eb6b0 100644
--- a/include/linux/unwind_user_types.h
+++ b/include/linux/unwind_user_types.h
@@ -6,6 +6,7 @@
 
 enum unwind_user_type {
 	UNWIND_USER_TYPE_NONE,
+	UNWIND_USER_TYPE_FP,
 };
 
 struct unwind_stacktrace {
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
index d30449328981..0671a81494d3 100644
--- a/kernel/unwind/user.c
+++ b/kernel/unwind/user.c
@@ -6,10 +6,54 @@
 #include <linux/sched.h>
 #include <linux/sched/task_stack.h>
 #include <linux/unwind_user.h>
+#include <linux/uaccess.h>
+#include <asm/unwind_user.h>
+
+static struct unwind_user_frame fp_frame = {
+	ARCH_INIT_USER_FP_FRAME
+};
+
+static inline bool fp_state(struct unwind_user_state *state)
+{
+	return IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP) &&
+	       state->type == UNWIND_USER_TYPE_FP;
+}
 
 int unwind_user_next(struct unwind_user_state *state)
 {
-	/* no implementation yet */
+	struct unwind_user_frame _frame;
+	struct unwind_user_frame *frame = &_frame;
+	unsigned long cfa = 0, fp, ra = 0;
+
+	if (state->done)
+		return -EINVAL;
+
+	if (fp_state(state))
+		frame = &fp_frame;
+	else
+		goto the_end;
+
+	cfa = (frame->use_fp ? state->fp : state->sp) + frame->cfa_off;
+
+	/* stack going in wrong direction? */
+	if (cfa <= state->sp)
+		goto the_end;
+
+	if (get_user(ra, (unsigned long *)(cfa + frame->ra_off)))
+		goto the_end;
+
+	if (frame->fp_off && get_user(fp, (unsigned long __user *)(cfa + frame->fp_off)))
+		goto the_end;
+
+	state->ip = ra;
+	state->sp = cfa;
+	if (frame->fp_off)
+		state->fp = fp;
+
+	return 0;
+
+the_end:
+	state->done = true;
 	return -EINVAL;
 }
 
@@ -24,7 +68,10 @@ int unwind_user_start(struct unwind_user_state *state)
 		return -EINVAL;
 	}
 
-	state->type = UNWIND_USER_TYPE_NONE;
+	if (IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP))
+		state->type = UNWIND_USER_TYPE_FP;
+	else
+		state->type = UNWIND_USER_TYPE_NONE;
 
 	state->ip = instruction_pointer(regs);
 	state->sp = user_stack_pointer(regs);
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v9 03/13] unwind_user/x86: Enable frame pointer unwinding on x86
  2025-05-13 22:34 [PATCH v9 00/13] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
  2025-05-13 22:34 ` [PATCH v9 01/13] unwind_user: Add user space unwinding API Steven Rostedt
  2025-05-13 22:34 ` [PATCH v9 02/13] unwind_user: Add frame pointer support Steven Rostedt
@ 2025-05-13 22:34 ` Steven Rostedt
  2025-05-13 22:34 ` [PATCH v9 04/13] perf/x86: Rename and move get_segment_base() and make it global Steven Rostedt
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Steven Rostedt @ 2025-05-13 22:34 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Andrii Nakryiko

From: Josh Poimboeuf <jpoimboe@kernel.org>

Use ARCH_INIT_USER_FP_FRAME to describe how frame pointers are unwound
on x86, and enable CONFIG_HAVE_UNWIND_USER_FP accordingly so the
unwind_user interfaces can be used.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 arch/x86/Kconfig                   |  1 +
 arch/x86/include/asm/unwind_user.h | 11 +++++++++++
 2 files changed, 12 insertions(+)
 create mode 100644 arch/x86/include/asm/unwind_user.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 891a69b308cb..456e085263e1 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -301,6 +301,7 @@ config X86
 	select HAVE_SYSCALL_TRACEPOINTS
 	select HAVE_UACCESS_VALIDATION		if HAVE_OBJTOOL
 	select HAVE_UNSTABLE_SCHED_CLOCK
+	select HAVE_UNWIND_USER_FP		if X86_64
 	select HAVE_USER_RETURN_NOTIFIER
 	select HAVE_GENERIC_VDSO
 	select VDSO_GETRANDOM			if X86_64
diff --git a/arch/x86/include/asm/unwind_user.h b/arch/x86/include/asm/unwind_user.h
new file mode 100644
index 000000000000..8597857bf896
--- /dev/null
+++ b/arch/x86/include/asm/unwind_user.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_UNWIND_USER_H
+#define _ASM_X86_UNWIND_USER_H
+
+#define ARCH_INIT_USER_FP_FRAME							\
+	.cfa_off	= (s32)sizeof(long) *  2,				\
+	.ra_off		= (s32)sizeof(long) * -1,				\
+	.fp_off		= (s32)sizeof(long) * -2,				\
+	.use_fp		= true,
+
+#endif /* _ASM_X86_UNWIND_USER_H */
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v9 04/13] perf/x86: Rename and move get_segment_base() and make it global
  2025-05-13 22:34 [PATCH v9 00/13] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (2 preceding siblings ...)
  2025-05-13 22:34 ` [PATCH v9 03/13] unwind_user/x86: Enable frame pointer unwinding on x86 Steven Rostedt
@ 2025-05-13 22:34 ` Steven Rostedt
  2025-05-13 22:34 ` [PATCH v9 05/13] unwind_user: Add compat mode frame pointer support Steven Rostedt
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Steven Rostedt @ 2025-05-13 22:34 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Andrii Nakryiko

From: Josh Poimboeuf <jpoimboe@kernel.org>

get_segment_base() will be used by the unwind_user code, so make it
global and rename it so it doesn't conflict with a KVM function of the
same name.

As the function is no longer specific to perf, move it to ptrace.c as that
seems to be a better location for a generic function like this.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 arch/x86/events/core.c        | 44 ++++-------------------------------
 arch/x86/include/asm/ptrace.h |  2 ++
 arch/x86/kernel/ptrace.c      | 38 ++++++++++++++++++++++++++++++
 3 files changed, 45 insertions(+), 39 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 2e10dcf897c5..cc6329235b68 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -43,6 +43,7 @@
 #include <asm/ldt.h>
 #include <asm/unwind.h>
 #include <asm/uprobes.h>
+#include <asm/ptrace.h>
 #include <asm/ibt.h>
 
 #include "perf_event.h"
@@ -2809,41 +2810,6 @@ valid_user_frame(const void __user *fp, unsigned long size)
 	return __access_ok(fp, size);
 }
 
-static unsigned long get_segment_base(unsigned int segment)
-{
-	struct desc_struct *desc;
-	unsigned int idx = segment >> 3;
-
-	if ((segment & SEGMENT_TI_MASK) == SEGMENT_LDT) {
-#ifdef CONFIG_MODIFY_LDT_SYSCALL
-		struct ldt_struct *ldt;
-
-		/*
-		 * If we're not in a valid context with a real (not just lazy)
-		 * user mm, then don't even try.
-		 */
-		if (!nmi_uaccess_okay())
-			return 0;
-
-		/* IRQs are off, so this synchronizes with smp_store_release */
-		ldt = smp_load_acquire(&current->mm->context.ldt);
-		if (!ldt || idx >= ldt->nr_entries)
-			return 0;
-
-		desc = &ldt->entries[idx];
-#else
-		return 0;
-#endif
-	} else {
-		if (idx >= GDT_ENTRIES)
-			return 0;
-
-		desc = raw_cpu_ptr(gdt_page.gdt) + idx;
-	}
-
-	return get_desc_base(desc);
-}
-
 #ifdef CONFIG_UPROBES
 /*
  * Heuristic-based check if uprobe is installed at the function entry.
@@ -2900,8 +2866,8 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry_ctx *ent
 	if (user_64bit_mode(regs))
 		return 0;
 
-	cs_base = get_segment_base(regs->cs);
-	ss_base = get_segment_base(regs->ss);
+	cs_base = segment_base_address(regs->cs);
+	ss_base = segment_base_address(regs->ss);
 
 	fp = compat_ptr(ss_base + regs->bp);
 	pagefault_disable();
@@ -3020,11 +2986,11 @@ static unsigned long code_segment_base(struct pt_regs *regs)
 		return 0x10 * regs->cs;
 
 	if (user_mode(regs) && regs->cs != __USER_CS)
-		return get_segment_base(regs->cs);
+		return segment_base_address(regs->cs);
 #else
 	if (user_mode(regs) && !user_64bit_mode(regs) &&
 	    regs->cs != __USER32_CS)
-		return get_segment_base(regs->cs);
+		return segment_base_address(regs->cs);
 #endif
 	return 0;
 }
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 50f75467f73d..59357ec98e52 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -314,6 +314,8 @@ static __always_inline bool regs_irqs_disabled(struct pt_regs *regs)
 	return !(regs->flags & X86_EFLAGS_IF);
 }
 
+unsigned long segment_base_address(unsigned int segment);
+
 /* Query offset/name of register from its name/offset */
 extern int regs_query_register_offset(const char *name);
 extern const char *regs_query_register_name(unsigned int offset);
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 095f04bdabdc..81353a09701b 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -41,6 +41,7 @@
 #include <asm/syscall.h>
 #include <asm/fsgsbase.h>
 #include <asm/io_bitmap.h>
+#include <asm/mmu_context.h>
 
 #include "tls.h"
 
@@ -339,6 +340,43 @@ static int set_segment_reg(struct task_struct *task,
 
 #endif	/* CONFIG_X86_32 */
 
+unsigned long segment_base_address(unsigned int segment)
+{
+	struct desc_struct *desc;
+	unsigned int idx = segment >> 3;
+
+	lockdep_assert_irqs_disabled();
+
+	if ((segment & SEGMENT_TI_MASK) == SEGMENT_LDT) {
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
+		struct ldt_struct *ldt;
+
+		/*
+		 * If we're not in a valid context with a real (not just lazy)
+		 * user mm, then don't even try.
+		 */
+		if (!nmi_uaccess_okay())
+			return 0;
+
+		/* IRQs are off, so this synchronizes with smp_store_release */
+		ldt = smp_load_acquire(&current->mm->context.ldt);
+		if (!ldt || idx >= ldt->nr_entries)
+			return 0;
+
+		desc = &ldt->entries[idx];
+#else
+		return 0;
+#endif
+	} else {
+		if (idx >= GDT_ENTRIES)
+			return 0;
+
+		desc = raw_cpu_ptr(gdt_page.gdt) + idx;
+	}
+
+	return get_desc_base(desc);
+}
+
 static unsigned long get_flags(struct task_struct *task)
 {
 	unsigned long retval = task_pt_regs(task)->flags;
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v9 05/13] unwind_user: Add compat mode frame pointer support
  2025-05-13 22:34 [PATCH v9 00/13] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (3 preceding siblings ...)
  2025-05-13 22:34 ` [PATCH v9 04/13] perf/x86: Rename and move get_segment_base() and make it global Steven Rostedt
@ 2025-05-13 22:34 ` Steven Rostedt
  2025-05-13 22:34 ` [PATCH v9 06/13] unwind_user/x86: Enable compat mode frame pointer unwinding on x86 Steven Rostedt
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Steven Rostedt @ 2025-05-13 22:34 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Andrii Nakryiko

From: Josh Poimboeuf <jpoimboe@kernel.org>

Add optional support for user space compat mode frame pointer unwinding.
If supported, the arch needs to enable CONFIG_HAVE_UNWIND_USER_COMPAT_FP
and define ARCH_INIT_USER_COMPAT_FP_FRAME.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 arch/Kconfig                            |  4 +++
 include/asm-generic/Kbuild              |  2 ++
 include/asm-generic/unwind_user.h       | 15 +++++++++++
 include/asm-generic/unwind_user_types.h |  9 +++++++
 include/linux/unwind_user_types.h       |  3 +++
 kernel/unwind/user.c                    | 36 ++++++++++++++++++++++---
 6 files changed, 65 insertions(+), 4 deletions(-)
 create mode 100644 include/asm-generic/unwind_user_types.h

diff --git a/arch/Kconfig b/arch/Kconfig
index 0e3844c0e200..dbb1cc89e040 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -442,6 +442,10 @@ config HAVE_UNWIND_USER_FP
 	bool
 	select UNWIND_USER
 
+config HAVE_UNWIND_USER_COMPAT_FP
+	bool
+	depends on HAVE_UNWIND_USER_FP
+
 config HAVE_PERF_REGS
 	bool
 	help
diff --git a/include/asm-generic/Kbuild b/include/asm-generic/Kbuild
index 8675b7b4ad23..b797a2434396 100644
--- a/include/asm-generic/Kbuild
+++ b/include/asm-generic/Kbuild
@@ -59,6 +59,8 @@ mandatory-y += tlbflush.h
 mandatory-y += topology.h
 mandatory-y += trace_clock.h
 mandatory-y += uaccess.h
+mandatory-y += unwind_user.h
+mandatory-y += unwind_user_types.h
 mandatory-y += vermagic.h
 mandatory-y += vga.h
 mandatory-y += video.h
diff --git a/include/asm-generic/unwind_user.h b/include/asm-generic/unwind_user.h
index 832425502fb3..385638ce4aec 100644
--- a/include/asm-generic/unwind_user.h
+++ b/include/asm-generic/unwind_user.h
@@ -2,8 +2,23 @@
 #ifndef _ASM_GENERIC_UNWIND_USER_H
 #define _ASM_GENERIC_UNWIND_USER_H
 
+#include <asm/unwind_user_types.h>
+
 #ifndef ARCH_INIT_USER_FP_FRAME
  #define ARCH_INIT_USER_FP_FRAME
 #endif
 
+#ifndef ARCH_INIT_USER_COMPAT_FP_FRAME
+ #define ARCH_INIT_USER_COMPAT_FP_FRAME
+ #define in_compat_mode(regs) false
+#endif
+
+#ifndef arch_unwind_user_init
+static inline void arch_unwind_user_init(struct unwind_user_state *state, struct pt_regs *reg) {}
+#endif
+
+#ifndef arch_unwind_user_next
+static inline void arch_unwind_user_next(struct unwind_user_state *state) {}
+#endif
+
 #endif /* _ASM_GENERIC_UNWIND_USER_H */
diff --git a/include/asm-generic/unwind_user_types.h b/include/asm-generic/unwind_user_types.h
new file mode 100644
index 000000000000..ee803de7c998
--- /dev/null
+++ b/include/asm-generic/unwind_user_types.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_GENERIC_UNWIND_USER_TYPES_H
+#define _ASM_GENERIC_UNWIND_USER_TYPES_H
+
+#ifndef arch_unwind_user_state
+struct arch_unwind_user_state {};
+#endif
+
+#endif /* _ASM_GENERIC_UNWIND_USER_TYPES_H */
diff --git a/include/linux/unwind_user_types.h b/include/linux/unwind_user_types.h
index 65bd070eb6b0..3ec4a097a3dd 100644
--- a/include/linux/unwind_user_types.h
+++ b/include/linux/unwind_user_types.h
@@ -3,10 +3,12 @@
 #define _LINUX_UNWIND_USER_TYPES_H
 
 #include <linux/types.h>
+#include <asm/unwind_user_types.h>
 
 enum unwind_user_type {
 	UNWIND_USER_TYPE_NONE,
 	UNWIND_USER_TYPE_FP,
+	UNWIND_USER_TYPE_COMPAT_FP,
 };
 
 struct unwind_stacktrace {
@@ -25,6 +27,7 @@ struct unwind_user_state {
 	unsigned long ip;
 	unsigned long sp;
 	unsigned long fp;
+	struct arch_unwind_user_state arch;
 	enum unwind_user_type type;
 	bool done;
 };
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
index 0671a81494d3..635cc04bb299 100644
--- a/kernel/unwind/user.c
+++ b/kernel/unwind/user.c
@@ -13,12 +13,32 @@ static struct unwind_user_frame fp_frame = {
 	ARCH_INIT_USER_FP_FRAME
 };
 
+static struct unwind_user_frame compat_fp_frame = {
+	ARCH_INIT_USER_COMPAT_FP_FRAME
+};
+
 static inline bool fp_state(struct unwind_user_state *state)
 {
 	return IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP) &&
 	       state->type == UNWIND_USER_TYPE_FP;
 }
 
+static inline bool compat_state(struct unwind_user_state *state)
+{
+	return IS_ENABLED(CONFIG_HAVE_UNWIND_USER_COMPAT_FP) &&
+	       state->type == UNWIND_USER_TYPE_COMPAT_FP;
+}
+
+#define UNWIND_GET_USER_LONG(to, from, state)				\
+({									\
+	int __ret;							\
+	if (compat_state(state))					\
+		__ret = get_user(to, (u32 __user *)(from));		\
+	else								\
+		__ret = get_user(to, (u64 __user *)(from));		\
+	__ret;								\
+})
+
 int unwind_user_next(struct unwind_user_state *state)
 {
 	struct unwind_user_frame _frame;
@@ -28,7 +48,9 @@ int unwind_user_next(struct unwind_user_state *state)
 	if (state->done)
 		return -EINVAL;
 
-	if (fp_state(state))
+	if (compat_state(state))
+		frame = &compat_fp_frame;
+	else if (fp_state(state))
 		frame = &fp_frame;
 	else
 		goto the_end;
@@ -39,10 +61,10 @@ int unwind_user_next(struct unwind_user_state *state)
 	if (cfa <= state->sp)
 		goto the_end;
 
-	if (get_user(ra, (unsigned long *)(cfa + frame->ra_off)))
+	if (UNWIND_GET_USER_LONG(ra, cfa + frame->ra_off, state))
 		goto the_end;
 
-	if (frame->fp_off && get_user(fp, (unsigned long __user *)(cfa + frame->fp_off)))
+	if (frame->fp_off && UNWIND_GET_USER_LONG(fp, cfa + frame->fp_off, state))
 		goto the_end;
 
 	state->ip = ra;
@@ -50,6 +72,8 @@ int unwind_user_next(struct unwind_user_state *state)
 	if (frame->fp_off)
 		state->fp = fp;
 
+	arch_unwind_user_next(state);
+
 	return 0;
 
 the_end:
@@ -68,7 +92,9 @@ int unwind_user_start(struct unwind_user_state *state)
 		return -EINVAL;
 	}
 
-	if (IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP))
+	if (IS_ENABLED(CONFIG_HAVE_UNWIND_USER_COMPAT_FP) && in_compat_mode(regs))
+		state->type = UNWIND_USER_TYPE_COMPAT_FP;
+	else if (IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP))
 		state->type = UNWIND_USER_TYPE_FP;
 	else
 		state->type = UNWIND_USER_TYPE_NONE;
@@ -77,6 +103,8 @@ int unwind_user_start(struct unwind_user_state *state)
 	state->sp = user_stack_pointer(regs);
 	state->fp = frame_pointer(regs);
 
+	arch_unwind_user_init(state, regs);
+
 	return 0;
 }
 
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v9 06/13] unwind_user/x86: Enable compat mode frame pointer unwinding on x86
  2025-05-13 22:34 [PATCH v9 00/13] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (4 preceding siblings ...)
  2025-05-13 22:34 ` [PATCH v9 05/13] unwind_user: Add compat mode frame pointer support Steven Rostedt
@ 2025-05-13 22:34 ` Steven Rostedt
  2025-05-13 22:34 ` [PATCH v9 07/13] unwind_user/deferred: Add unwind_deferred_trace() Steven Rostedt
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Steven Rostedt @ 2025-05-13 22:34 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Andrii Nakryiko

From: Josh Poimboeuf <jpoimboe@kernel.org>

Use ARCH_INIT_USER_COMPAT_FP_FRAME to describe how frame pointers are
unwound on x86, and implement the hooks needed to add the segment base
addresses.  Enable HAVE_UNWIND_USER_COMPAT_FP if the system has compat
mode compiled in.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 arch/x86/Kconfig                         |  1 +
 arch/x86/include/asm/unwind_user.h       | 50 ++++++++++++++++++++++++
 arch/x86/include/asm/unwind_user_types.h | 17 ++++++++
 3 files changed, 68 insertions(+)
 create mode 100644 arch/x86/include/asm/unwind_user_types.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 456e085263e1..90f055956fce 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -301,6 +301,7 @@ config X86
 	select HAVE_SYSCALL_TRACEPOINTS
 	select HAVE_UACCESS_VALIDATION		if HAVE_OBJTOOL
 	select HAVE_UNSTABLE_SCHED_CLOCK
+	select HAVE_UNWIND_USER_COMPAT_FP	if IA32_EMULATION
 	select HAVE_UNWIND_USER_FP		if X86_64
 	select HAVE_USER_RETURN_NOTIFIER
 	select HAVE_GENERIC_VDSO
diff --git a/arch/x86/include/asm/unwind_user.h b/arch/x86/include/asm/unwind_user.h
index 8597857bf896..bb1148111259 100644
--- a/arch/x86/include/asm/unwind_user.h
+++ b/arch/x86/include/asm/unwind_user.h
@@ -2,10 +2,60 @@
 #ifndef _ASM_X86_UNWIND_USER_H
 #define _ASM_X86_UNWIND_USER_H
 
+#include <linux/unwind_user_types.h>
+#include <asm/ptrace.h>
+#include <asm/perf_event.h>
+
 #define ARCH_INIT_USER_FP_FRAME							\
 	.cfa_off	= (s32)sizeof(long) *  2,				\
 	.ra_off		= (s32)sizeof(long) * -1,				\
 	.fp_off		= (s32)sizeof(long) * -2,				\
 	.use_fp		= true,
 
+#ifdef CONFIG_IA32_EMULATION
+
+#define ARCH_INIT_USER_COMPAT_FP_FRAME						\
+	.cfa_off	= (s32)sizeof(u32)  *  2,				\
+	.ra_off		= (s32)sizeof(u32)  * -1,				\
+	.fp_off		= (s32)sizeof(u32)  * -2,				\
+	.use_fp		= true,
+
+#define in_compat_mode(regs) !user_64bit_mode(regs)
+
+static inline void arch_unwind_user_init(struct unwind_user_state *state,
+					 struct pt_regs *regs)
+{
+	unsigned long cs_base, ss_base;
+
+	if (state->type != UNWIND_USER_TYPE_COMPAT_FP)
+		return;
+
+	scoped_guard(irqsave) {
+		cs_base = segment_base_address(regs->cs);
+		ss_base = segment_base_address(regs->ss);
+	}
+
+	state->arch.cs_base = cs_base;
+	state->arch.ss_base = ss_base;
+
+	state->ip += cs_base;
+	state->sp += ss_base;
+	state->fp += ss_base;
+}
+#define arch_unwind_user_init arch_unwind_user_init
+
+static inline void arch_unwind_user_next(struct unwind_user_state *state)
+{
+	if (state->type != UNWIND_USER_TYPE_COMPAT_FP)
+		return;
+
+	state->ip += state->arch.cs_base;
+	state->fp += state->arch.ss_base;
+}
+#define arch_unwind_user_next arch_unwind_user_next
+
+#endif /* CONFIG_IA32_EMULATION */
+
+#include <asm-generic/unwind_user.h>
+
 #endif /* _ASM_X86_UNWIND_USER_H */
diff --git a/arch/x86/include/asm/unwind_user_types.h b/arch/x86/include/asm/unwind_user_types.h
new file mode 100644
index 000000000000..d7074dc5f0ce
--- /dev/null
+++ b/arch/x86/include/asm/unwind_user_types.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_UNWIND_USER_TYPES_H
+#define _ASM_UNWIND_USER_TYPES_H
+
+#ifdef CONFIG_IA32_EMULATION
+
+struct arch_unwind_user_state {
+	unsigned long ss_base;
+	unsigned long cs_base;
+};
+#define arch_unwind_user_state arch_unwind_user_state
+
+#endif /* CONFIG_IA32_EMULATION */
+
+#include <asm-generic/unwind_user_types.h>
+
+#endif /* _ASM_UNWIND_USER_TYPES_H */
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v9 07/13] unwind_user/deferred: Add unwind_deferred_trace()
  2025-05-13 22:34 [PATCH v9 00/13] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (5 preceding siblings ...)
  2025-05-13 22:34 ` [PATCH v9 06/13] unwind_user/x86: Enable compat mode frame pointer unwinding on x86 Steven Rostedt
@ 2025-05-13 22:34 ` Steven Rostedt
  2025-05-13 22:34 ` [PATCH v9 08/13] unwind_user/deferred: Add unwind cache Steven Rostedt
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Steven Rostedt @ 2025-05-13 22:34 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Andrii Nakryiko

From: Steven Rostedt <rostedt@goodmis.org>

Add a function that must be called inside a faultable context that will
retrieve a user space stack trace. The function unwind_deferred_trace()
can be called by a tracer when a task is about to enter user space, or has
just come back from user space and has interrupts enabled.

This code is based on work by Josh Poimboeuf's deferred unwinding code:

Link: https://lore.kernel.org/all/6052e8487746603bdb29b65f4033e739092d9925.1737511963.git.jpoimboe@kernel.org/

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 include/linux/sched.h                 |  5 +++
 include/linux/unwind_deferred.h       | 24 +++++++++++
 include/linux/unwind_deferred_types.h |  9 ++++
 kernel/fork.c                         |  4 ++
 kernel/unwind/Makefile                |  2 +-
 kernel/unwind/deferred.c              | 60 +++++++++++++++++++++++++++
 6 files changed, 103 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/unwind_deferred.h
 create mode 100644 include/linux/unwind_deferred_types.h
 create mode 100644 kernel/unwind/deferred.c

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4ecc0c6b1cb0..a1e1c07cadfb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -47,6 +47,7 @@
 #include <linux/livepatch_sched.h>
 #include <linux/uidgid_types.h>
 #include <linux/tracepoint-defs.h>
+#include <linux/unwind_deferred_types.h>
 #include <asm/kmap_size.h>
 
 /* task_struct member predeclarations (sorted alphabetically): */
@@ -1646,6 +1647,10 @@ struct task_struct {
 	struct user_event_mm		*user_event_mm;
 #endif
 
+#ifdef CONFIG_UNWIND_USER
+	struct unwind_task_info		unwind_info;
+#endif
+
 	/* CPU-specific state of this task: */
 	struct thread_struct		thread;
 
diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
new file mode 100644
index 000000000000..5064ebe38c4f
--- /dev/null
+++ b/include/linux/unwind_deferred.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_UNWIND_USER_DEFERRED_H
+#define _LINUX_UNWIND_USER_DEFERRED_H
+
+#include <linux/unwind_user.h>
+#include <linux/unwind_deferred_types.h>
+
+#ifdef CONFIG_UNWIND_USER
+
+void unwind_task_init(struct task_struct *task);
+void unwind_task_free(struct task_struct *task);
+
+int unwind_deferred_trace(struct unwind_stacktrace *trace);
+
+#else /* !CONFIG_UNWIND_USER */
+
+static inline void unwind_task_init(struct task_struct *task) {}
+static inline void unwind_task_free(struct task_struct *task) {}
+
+static inline int unwind_deferred_trace(struct unwind_stacktrace *trace) { return -ENOSYS; }
+
+#endif /* !CONFIG_UNWIND_USER */
+
+#endif /* _LINUX_UNWIND_USER_DEFERRED_H */
diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
new file mode 100644
index 000000000000..aa32db574e43
--- /dev/null
+++ b/include/linux/unwind_deferred_types.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_UNWIND_USER_DEFERRED_TYPES_H
+#define _LINUX_UNWIND_USER_DEFERRED_TYPES_H
+
+struct unwind_task_info {
+	unsigned long		*entries;
+};
+
+#endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 1f5d8083eeb2..b767727c6c06 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -105,6 +105,7 @@
 #include <uapi/linux/pidfd.h>
 #include <linux/pidfs.h>
 #include <linux/tick.h>
+#include <linux/unwind_deferred.h>
 
 #include <asm/pgalloc.h>
 #include <linux/uaccess.h>
@@ -991,6 +992,7 @@ void __put_task_struct(struct task_struct *tsk)
 	WARN_ON(refcount_read(&tsk->usage));
 	WARN_ON(tsk == current);
 
+	unwind_task_free(tsk);
 	sched_ext_free(tsk);
 	io_uring_free(tsk);
 	cgroup_free(tsk);
@@ -2404,6 +2406,8 @@ __latent_entropy struct task_struct *copy_process(
 	p->bpf_ctx = NULL;
 #endif
 
+	unwind_task_init(p);
+
 	/* Perform scheduler related setup. Assign this task to a CPU. */
 	retval = sched_fork(clone_flags, p);
 	if (retval)
diff --git a/kernel/unwind/Makefile b/kernel/unwind/Makefile
index 349ce3677526..6752ac96d7e2 100644
--- a/kernel/unwind/Makefile
+++ b/kernel/unwind/Makefile
@@ -1 +1 @@
- obj-$(CONFIG_UNWIND_USER) += user.o
+ obj-$(CONFIG_UNWIND_USER)		+= user.o deferred.o
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
new file mode 100644
index 000000000000..0bafb95e6336
--- /dev/null
+++ b/kernel/unwind/deferred.c
@@ -0,0 +1,60 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Deferred user space unwinding
+ */
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/unwind_deferred.h>
+
+#define UNWIND_MAX_ENTRIES 512
+
+/**
+ * unwind_deferred_trace - Produce a user stacktrace in faultable context
+ * @trace: The descriptor that will store the user stacktrace
+ *
+ * This must be called in a known faultable context (usually when entering
+ * or exiting user space). Depending on the available implementations
+ * the @trace will be loaded with the addresses of the user space stacktrace
+ * if it can be found.
+ *
+ * Return: 0 on success and negative on error
+ *         On success @trace will contain the user space stacktrace
+ */
+int unwind_deferred_trace(struct unwind_stacktrace *trace)
+{
+	struct unwind_task_info *info = &current->unwind_info;
+
+	/* Should always be called from faultable context */
+	might_fault();
+
+	if (current->flags & PF_EXITING)
+		return -EINVAL;
+
+	if (!info->entries) {
+		info->entries = kmalloc_array(UNWIND_MAX_ENTRIES, sizeof(long),
+					      GFP_KERNEL);
+		if (!info->entries)
+			return -ENOMEM;
+	}
+
+	trace->nr = 0;
+	trace->entries = info->entries;
+	unwind_user(trace, UNWIND_MAX_ENTRIES);
+
+	return 0;
+}
+
+void unwind_task_init(struct task_struct *task)
+{
+	struct unwind_task_info *info = &task->unwind_info;
+
+	memset(info, 0, sizeof(*info));
+}
+
+void unwind_task_free(struct task_struct *task)
+{
+	struct unwind_task_info *info = &task->unwind_info;
+
+	kfree(info->entries);
+}
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v9 08/13] unwind_user/deferred: Add unwind cache
  2025-05-13 22:34 [PATCH v9 00/13] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (6 preceding siblings ...)
  2025-05-13 22:34 ` [PATCH v9 07/13] unwind_user/deferred: Add unwind_deferred_trace() Steven Rostedt
@ 2025-05-13 22:34 ` Steven Rostedt
  2025-05-13 22:34 ` [PATCH v9 09/13] unwind_user/deferred: Add deferred unwinding interface Steven Rostedt
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Steven Rostedt @ 2025-05-13 22:34 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Andrii Nakryiko

From: Josh Poimboeuf <jpoimboe@kernel.org>

Cache the results of the unwind to ensure the unwind is only performed
once, even when called by multiple tracers.

The cache nr_entries gets cleared every time the task exits the kernel.
When a stacktrace is requested, nr_entries gets set to the number of
entries in the stacktrace. If another stacktrace is requested, if
nr_entries is not zero, then it contains the same stacktrace that would be
retrieved so it is not processed again and the entries is given to the
caller.

Co-developed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 include/linux/entry-common.h          |  2 ++
 include/linux/unwind_deferred.h       |  8 ++++++++
 include/linux/unwind_deferred_types.h |  7 ++++++-
 kernel/unwind/deferred.c              | 26 ++++++++++++++++++++------
 4 files changed, 36 insertions(+), 7 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index f94f3fdf15fc..6e850c9d3f0c 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -12,6 +12,7 @@
 #include <linux/resume_user_mode.h>
 #include <linux/tick.h>
 #include <linux/kmsan.h>
+#include <linux/unwind_deferred.h>
 
 #include <asm/entry-common.h>
 #include <asm/syscall.h>
@@ -362,6 +363,7 @@ static __always_inline void exit_to_user_mode(void)
 	lockdep_hardirqs_on_prepare();
 	instrumentation_end();
 
+	unwind_exit_to_user_mode();
 	user_enter_irqoff();
 	arch_exit_to_user_mode();
 	lockdep_hardirqs_on(CALLER_ADDR0);
diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index 5064ebe38c4f..7d6cb2ffd084 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -12,6 +12,12 @@ void unwind_task_free(struct task_struct *task);
 
 int unwind_deferred_trace(struct unwind_stacktrace *trace);
 
+static __always_inline void unwind_exit_to_user_mode(void)
+{
+	if (unlikely(current->unwind_info.cache))
+		current->unwind_info.cache->nr_entries = 0;
+}
+
 #else /* !CONFIG_UNWIND_USER */
 
 static inline void unwind_task_init(struct task_struct *task) {}
@@ -19,6 +25,8 @@ static inline void unwind_task_free(struct task_struct *task) {}
 
 static inline int unwind_deferred_trace(struct unwind_stacktrace *trace) { return -ENOSYS; }
 
+static inline void unwind_exit_to_user_mode(void) {}
+
 #endif /* !CONFIG_UNWIND_USER */
 
 #endif /* _LINUX_UNWIND_USER_DEFERRED_H */
diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
index aa32db574e43..db5b54b18828 100644
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -2,8 +2,13 @@
 #ifndef _LINUX_UNWIND_USER_DEFERRED_TYPES_H
 #define _LINUX_UNWIND_USER_DEFERRED_TYPES_H
 
+struct unwind_cache {
+	unsigned int		nr_entries;
+	unsigned long		entries[];
+};
+
 struct unwind_task_info {
-	unsigned long		*entries;
+	struct unwind_cache	*cache;
 };
 
 #endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index 0bafb95e6336..e3913781c8c6 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -24,6 +24,7 @@
 int unwind_deferred_trace(struct unwind_stacktrace *trace)
 {
 	struct unwind_task_info *info = &current->unwind_info;
+	struct unwind_cache *cache;
 
 	/* Should always be called from faultable context */
 	might_fault();
@@ -31,17 +32,30 @@ int unwind_deferred_trace(struct unwind_stacktrace *trace)
 	if (current->flags & PF_EXITING)
 		return -EINVAL;
 
-	if (!info->entries) {
-		info->entries = kmalloc_array(UNWIND_MAX_ENTRIES, sizeof(long),
-					      GFP_KERNEL);
-		if (!info->entries)
+	if (!info->cache) {
+		info->cache = kzalloc(struct_size(cache, entries, UNWIND_MAX_ENTRIES),
+				      GFP_KERNEL);
+		if (!info->cache)
 			return -ENOMEM;
 	}
 
+	cache = info->cache;
+	trace->entries = cache->entries;
+
+	if (cache->nr_entries) {
+		/*
+		 * The user stack has already been previously unwound in this
+		 * entry context.  Skip the unwind and use the cache.
+		 */
+		trace->nr = cache->nr_entries;
+		return 0;
+	}
+
 	trace->nr = 0;
-	trace->entries = info->entries;
 	unwind_user(trace, UNWIND_MAX_ENTRIES);
 
+	cache->nr_entries = trace->nr;
+
 	return 0;
 }
 
@@ -56,5 +70,5 @@ void unwind_task_free(struct task_struct *task)
 {
 	struct unwind_task_info *info = &task->unwind_info;
 
-	kfree(info->entries);
+	kfree(info->cache);
 }
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v9 09/13] unwind_user/deferred: Add deferred unwinding interface
  2025-05-13 22:34 [PATCH v9 00/13] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (7 preceding siblings ...)
  2025-05-13 22:34 ` [PATCH v9 08/13] unwind_user/deferred: Add unwind cache Steven Rostedt
@ 2025-05-13 22:34 ` Steven Rostedt
  2025-05-13 22:34 ` [PATCH v9 10/13] unwind_user/deferred: Make unwind deferral requests NMI-safe Steven Rostedt
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Steven Rostedt @ 2025-05-13 22:34 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Andrii Nakryiko

From: Josh Poimboeuf <jpoimboe@kernel.org>

Add an interface for scheduling task work to unwind the user space stack
before returning to user space. This solves several problems for its
callers:

  - Ensure the unwind happens in task context even if the caller may be
    running in NMI or interrupt context.

  - Avoid duplicate unwinds, whether called multiple times by the same
    caller or by different callers.

  - Take a timestamp when the first request comes in since the task
    entered the kernel. This will be returned to the calling function
    along with the stack trace when the task leaves the kernel. This
    timestamp can be used to correlate kernel unwinds/traces with the user
    unwind.

The timestamp is created to detect when the stacktrace is the same. It is
generated the first time a user space stacktrace is requested after the
task enters the kernel.

The timestamp is passed to the caller on request, and when the stacktrace is
generated upon returning to user space, it call the requester's callback
with the timestamp as well as the stacktrace.

Co-developed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 include/linux/unwind_deferred.h       |  18 ++++
 include/linux/unwind_deferred_types.h |   3 +
 kernel/unwind/deferred.c              | 131 +++++++++++++++++++++++++-
 3 files changed, 151 insertions(+), 1 deletion(-)

diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index 7d6cb2ffd084..a384eef719a3 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -2,9 +2,19 @@
 #ifndef _LINUX_UNWIND_USER_DEFERRED_H
 #define _LINUX_UNWIND_USER_DEFERRED_H
 
+#include <linux/task_work.h>
 #include <linux/unwind_user.h>
 #include <linux/unwind_deferred_types.h>
 
+struct unwind_work;
+
+typedef void (*unwind_callback_t)(struct unwind_work *work, struct unwind_stacktrace *trace, u64 timestamp);
+
+struct unwind_work {
+	struct list_head		list;
+	unwind_callback_t		func;
+};
+
 #ifdef CONFIG_UNWIND_USER
 
 void unwind_task_init(struct task_struct *task);
@@ -12,10 +22,15 @@ void unwind_task_free(struct task_struct *task);
 
 int unwind_deferred_trace(struct unwind_stacktrace *trace);
 
+int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func);
+int unwind_deferred_request(struct unwind_work *work, u64 *timestamp);
+void unwind_deferred_cancel(struct unwind_work *work);
+
 static __always_inline void unwind_exit_to_user_mode(void)
 {
 	if (unlikely(current->unwind_info.cache))
 		current->unwind_info.cache->nr_entries = 0;
+	current->unwind_info.timestamp = 0;
 }
 
 #else /* !CONFIG_UNWIND_USER */
@@ -24,6 +39,9 @@ static inline void unwind_task_init(struct task_struct *task) {}
 static inline void unwind_task_free(struct task_struct *task) {}
 
 static inline int unwind_deferred_trace(struct unwind_stacktrace *trace) { return -ENOSYS; }
+static inline int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func) { return -ENOSYS; }
+static inline int unwind_deferred_request(struct unwind_work *work, u64 *timestamp) { return -ENOSYS; }
+static inline void unwind_deferred_cancel(struct unwind_work *work) {}
 
 static inline void unwind_exit_to_user_mode(void) {}
 
diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
index db5b54b18828..5df264cf81ad 100644
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -9,6 +9,9 @@ struct unwind_cache {
 
 struct unwind_task_info {
 	struct unwind_cache	*cache;
+	struct callback_head	work;
+	u64			timestamp;
+	int			pending;
 };
 
 #endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index e3913781c8c6..b76c704ddc6d 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -2,13 +2,35 @@
 /*
  * Deferred user space unwinding
  */
+#include <linux/sched/task_stack.h>
+#include <linux/unwind_deferred.h>
+#include <linux/sched/clock.h>
+#include <linux/task_work.h>
 #include <linux/kernel.h>
 #include <linux/sched.h>
 #include <linux/slab.h>
-#include <linux/unwind_deferred.h>
+#include <linux/mm.h>
 
 #define UNWIND_MAX_ENTRIES 512
 
+/* Guards adding to and reading the list of callbacks */
+static DEFINE_MUTEX(callback_mutex);
+static LIST_HEAD(callbacks);
+
+/*
+ * Read the task context timestamp, if this is the first caller then
+ * it will set the timestamp.
+ */
+static u64 get_timestamp(struct unwind_task_info *info)
+{
+	lockdep_assert_irqs_disabled();
+
+	if (!info->timestamp)
+		info->timestamp = local_clock();
+
+	return info->timestamp;
+}
+
 /**
  * unwind_deferred_trace - Produce a user stacktrace in faultable context
  * @trace: The descriptor that will store the user stacktrace
@@ -59,11 +81,117 @@ int unwind_deferred_trace(struct unwind_stacktrace *trace)
 	return 0;
 }
 
+static void unwind_deferred_task_work(struct callback_head *head)
+{
+	struct unwind_task_info *info = container_of(head, struct unwind_task_info, work);
+	struct unwind_stacktrace trace;
+	struct unwind_work *work;
+	u64 timestamp;
+
+	if (WARN_ON_ONCE(!info->pending))
+		return;
+
+	/* Allow work to come in again */
+	WRITE_ONCE(info->pending, 0);
+
+	/*
+	 * From here on out, the callback must always be called, even if it's
+	 * just an empty trace.
+	 */
+	trace.nr = 0;
+	trace.entries = NULL;
+
+	unwind_deferred_trace(&trace);
+
+	timestamp = info->timestamp;
+
+	guard(mutex)(&callback_mutex);
+	list_for_each_entry(work, &callbacks, list) {
+		work->func(work, &trace, timestamp);
+	}
+}
+
+/**
+ * unwind_deferred_request - Request a user stacktrace on task exit
+ * @work: Unwind descriptor requesting the trace
+ * @timestamp: The time stamp of the first request made for this task
+ *
+ * Schedule a user space unwind to be done in task work before exiting the
+ * kernel.
+ *
+ * The returned @timestamp output is the timestamp of the very first request
+ * for a user space stacktrace for this task since it entered the kernel.
+ * It can be from a request by any caller of this infrastructure.
+ * Its value will also be passed to the callback function.  It can be
+ * used to stitch kernel and user stack traces together in post-processing.
+ *
+ * It's valid to call this function multiple times for the same @work within
+ * the same task entry context.  Each call will return the same timestamp
+ * while the task hasn't left the kernel. If the callback is not pending because
+ * it has already been previously called for the same entry context, it will be
+ * called again with the same stack trace and timestamp.
+ *
+ * Return: 1 if the the callback was already queued.
+ *         0 if the callback successfully was queued.
+ *         Negative if there's an error.
+ *         @timestamp holds the timestamp of the first request by any user
+ */
+int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
+{
+	struct unwind_task_info *info = &current->unwind_info;
+	int ret;
+
+	*timestamp = 0;
+
+	if (WARN_ON_ONCE(in_nmi()))
+		return -EINVAL;
+
+	if ((current->flags & (PF_KTHREAD | PF_EXITING)) ||
+	    !user_mode(task_pt_regs(current)))
+		return -EINVAL;
+
+	guard(irqsave)();
+
+	*timestamp = get_timestamp(info);
+
+	/* callback already pending? */
+	if (info->pending)
+		return 1;
+
+	/* The work has been claimed, now schedule it. */
+	ret = task_work_add(current, &info->work, TWA_RESUME);
+	if (WARN_ON_ONCE(ret))
+		return ret;
+
+	info->pending = 1;
+	return 0;
+}
+
+void unwind_deferred_cancel(struct unwind_work *work)
+{
+	if (!work)
+		return;
+
+	guard(mutex)(&callback_mutex);
+	list_del(&work->list);
+}
+
+int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
+{
+	memset(work, 0, sizeof(*work));
+
+	guard(mutex)(&callback_mutex);
+	list_add(&work->list, &callbacks);
+	work->func = func;
+	return 0;
+}
+
 void unwind_task_init(struct task_struct *task)
 {
 	struct unwind_task_info *info = &task->unwind_info;
 
 	memset(info, 0, sizeof(*info));
+	init_task_work(&info->work, unwind_deferred_task_work);
 }
 
 void unwind_task_free(struct task_struct *task)
@@ -71,4 +199,5 @@ void unwind_task_free(struct task_struct *task)
 	struct unwind_task_info *info = &task->unwind_info;
 
 	kfree(info->cache);
+	task_work_cancel(task, &info->work);
 }
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v9 10/13] unwind_user/deferred: Make unwind deferral requests NMI-safe
  2025-05-13 22:34 [PATCH v9 00/13] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (8 preceding siblings ...)
  2025-05-13 22:34 ` [PATCH v9 09/13] unwind_user/deferred: Add deferred unwinding interface Steven Rostedt
@ 2025-05-13 22:34 ` Steven Rostedt
  2025-05-13 22:34 ` [PATCH v9 11/13] unwind deferred: Use bitmask to determine which callbacks to call Steven Rostedt
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Steven Rostedt @ 2025-05-13 22:34 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Andrii Nakryiko

From: Josh Poimboeuf <jpoimboe@kernel.org>

Make unwind_deferred_request() NMI-safe so tracers in NMI context can
call it and safely request a user space stacktrace when the task exits.

A "nmi_timestamp" is added to the unwind_task_info that gets updated by
NMIs to not race with setting the info->timestamp.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 include/linux/unwind_deferred_types.h |  1 +
 kernel/unwind/deferred.c              | 91 ++++++++++++++++++++++++---
 2 files changed, 84 insertions(+), 8 deletions(-)

diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
index 5df264cf81ad..ae27a02234b8 100644
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -11,6 +11,7 @@ struct unwind_task_info {
 	struct unwind_cache	*cache;
 	struct callback_head	work;
 	u64			timestamp;
+	u64			nmi_timestamp;
 	int			pending;
 };
 
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index b76c704ddc6d..238cd97079ec 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -25,8 +25,27 @@ static u64 get_timestamp(struct unwind_task_info *info)
 {
 	lockdep_assert_irqs_disabled();
 
-	if (!info->timestamp)
-		info->timestamp = local_clock();
+	/*
+	 * Note, the timestamp is generated on the first request.
+	 * If it exists here, then the timestamp is earlier than
+	 * this request and it means that this request will be
+	 * valid for the stracktrace.
+	 */
+	if (!info->timestamp) {
+		WRITE_ONCE(info->timestamp, local_clock());
+		barrier();
+		/*
+		 * If an NMI came in and set a timestamp, it means that
+		 * it happened before this timestamp was set (otherwise
+		 * the NMI would have used this one). Use the NMI timestamp
+		 * instead.
+		 */
+		if (unlikely(info->nmi_timestamp)) {
+			WRITE_ONCE(info->timestamp, info->nmi_timestamp);
+			barrier();
+			WRITE_ONCE(info->nmi_timestamp, 0);
+		}
+	}
 
 	return info->timestamp;
 }
@@ -103,6 +122,13 @@ static void unwind_deferred_task_work(struct callback_head *head)
 
 	unwind_deferred_trace(&trace);
 
+	/* Check if the timestamp was only set by NMI */
+	if (info->nmi_timestamp) {
+		WRITE_ONCE(info->timestamp, info->nmi_timestamp);
+		barrier();
+		WRITE_ONCE(info->nmi_timestamp, 0);
+	}
+
 	timestamp = info->timestamp;
 
 	guard(mutex)(&callback_mutex);
@@ -111,6 +137,48 @@ static void unwind_deferred_task_work(struct callback_head *head)
 	}
 }
 
+static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *timestamp)
+{
+	struct unwind_task_info *info = &current->unwind_info;
+	bool inited_timestamp = false;
+	int ret;
+
+	/* Always use the nmi_timestamp first */
+	*timestamp = info->nmi_timestamp ? : info->timestamp;
+
+	if (!*timestamp) {
+		/*
+		 * This is the first unwind request since the most recent entry
+		 * from user space. Initialize the task timestamp.
+		 *
+		 * Don't write to info->timestamp directly, otherwise it may race
+		 * with an interruption of get_timestamp().
+		 */
+		info->nmi_timestamp = local_clock();
+		*timestamp = info->nmi_timestamp;
+		inited_timestamp = true;
+	}
+
+	if (info->pending)
+		return 1;
+
+	ret = task_work_add(current, &info->work, TWA_NMI_CURRENT);
+	if (ret) {
+		/*
+		 * If this set nmi_timestamp and is not using it,
+		 * there's no guarantee that it will be used.
+		 * Set it back to zero.
+		 */
+		if (inited_timestamp)
+			info->nmi_timestamp = 0;
+		return ret;
+	}
+
+	info->pending = 1;
+
+	return 0;
+}
+
 /**
  * unwind_deferred_request - Request a user stacktrace on task exit
  * @work: Unwind descriptor requesting the trace
@@ -139,31 +207,38 @@ static void unwind_deferred_task_work(struct callback_head *head)
 int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
 {
 	struct unwind_task_info *info = &current->unwind_info;
+	int pending;
 	int ret;
 
 	*timestamp = 0;
 
-	if (WARN_ON_ONCE(in_nmi()))
-		return -EINVAL;
-
 	if ((current->flags & (PF_KTHREAD | PF_EXITING)) ||
 	    !user_mode(task_pt_regs(current)))
 		return -EINVAL;
 
+	if (in_nmi())
+		return unwind_deferred_request_nmi(work, timestamp);
+
 	guard(irqsave)();
 
 	*timestamp = get_timestamp(info);
 
 	/* callback already pending? */
-	if (info->pending)
+	pending = READ_ONCE(info->pending);
+	if (pending)
+		return 1;
+
+	/* Claim the work unless an NMI just now swooped in to do so. */
+	if (!try_cmpxchg(&info->pending, &pending, 1))
 		return 1;
 
 	/* The work has been claimed, now schedule it. */
 	ret = task_work_add(current, &info->work, TWA_RESUME);
-	if (WARN_ON_ONCE(ret))
+	if (WARN_ON_ONCE(ret)) {
+		WRITE_ONCE(info->pending, 0);
 		return ret;
+	}
 
-	info->pending = 1;
 	return 0;
 }
 
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v9 11/13] unwind deferred: Use bitmask to determine which callbacks to call
  2025-05-13 22:34 [PATCH v9 00/13] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (9 preceding siblings ...)
  2025-05-13 22:34 ` [PATCH v9 10/13] unwind_user/deferred: Make unwind deferral requests NMI-safe Steven Rostedt
@ 2025-05-13 22:34 ` Steven Rostedt
  2025-05-13 22:34 ` [PATCH v9 12/13] unwind deferred: Use SRCU unwind_deferred_task_work() Steven Rostedt
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Steven Rostedt @ 2025-05-13 22:34 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Andrii Nakryiko

From: Steven Rostedt <rostedt@goodmis.org>

In order to know which registered callback requested a stacktrace for when
the task goes back to user space, add a bitmask for all registered
tracers. The bitmask is the size of log, which means that on a 32 bit
machine, it can have at most 32 registered tracers, and on 64 bit, it can
have at most 64 registered tracers. This should not be an issue as there
should not be more than 10 (unless BPF can abuse this?).

When a tracer registers with unwind_deferred_init() it will get a bit
number assigned to it. When a tracer requests a stacktrace, it will have
its bit set within the task_struct. When the task returns back to user
space, it will call the callbacks for all the registered tracers where
their bits are set in the task's mask.

When a tracer is removed by the unwind_deferred_cancel() all current tasks
will clear the associated bit, just in case another tracer gets registered
immediately afterward and then gets their callback called unexpectedly.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 include/linux/sched.h           |  1 +
 include/linux/unwind_deferred.h |  1 +
 kernel/unwind/deferred.c        | 46 ++++++++++++++++++++++++++++-----
 3 files changed, 41 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a1e1c07cadfb..d3ee0c5405d6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1649,6 +1649,7 @@ struct task_struct {
 
 #ifdef CONFIG_UNWIND_USER
 	struct unwind_task_info		unwind_info;
+	unsigned long			unwind_mask;
 #endif
 
 	/* CPU-specific state of this task: */
diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index a384eef719a3..1789c3624723 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -13,6 +13,7 @@ typedef void (*unwind_callback_t)(struct unwind_work *work, struct unwind_stackt
 struct unwind_work {
 	struct list_head		list;
 	unwind_callback_t		func;
+	int				bit;
 };
 
 #ifdef CONFIG_UNWIND_USER
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index 238cd97079ec..7ae0bec5b36a 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -16,6 +16,7 @@
 /* Guards adding to and reading the list of callbacks */
 static DEFINE_MUTEX(callback_mutex);
 static LIST_HEAD(callbacks);
+static unsigned long unwind_mask;
 
 /*
  * Read the task context timestamp, if this is the first caller then
@@ -106,6 +107,7 @@ static void unwind_deferred_task_work(struct callback_head *head)
 	struct unwind_stacktrace trace;
 	struct unwind_work *work;
 	u64 timestamp;
+	struct task_struct *task = current;
 
 	if (WARN_ON_ONCE(!info->pending))
 		return;
@@ -133,7 +135,10 @@ static void unwind_deferred_task_work(struct callback_head *head)
 
 	guard(mutex)(&callback_mutex);
 	list_for_each_entry(work, &callbacks, list) {
-		work->func(work, &trace, timestamp);
+		if (task->unwind_mask & (1UL << work->bit)) {
+			work->func(work, &trace, timestamp);
+			clear_bit(work->bit, &current->unwind_mask);
+		}
 	}
 }
 
@@ -159,9 +164,12 @@ static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *timestamp)
 		inited_timestamp = true;
 	}
 
-	if (info->pending)
+	if (current->unwind_mask & (1UL << work->bit))
 		return 1;
 
+	if (info->pending)
+		goto out;
+
 	ret = task_work_add(current, &info->work, TWA_NMI_CURRENT);
 	if (ret) {
 		/*
@@ -175,8 +183,8 @@ static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *timestamp)
 	}
 
 	info->pending = 1;
-
-	return 0;
+out:
+	return test_and_set_bit(work->bit, &current->unwind_mask);
 }
 
 /**
@@ -223,14 +231,18 @@ int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
 
 	*timestamp = get_timestamp(info);
 
+	/* This is already queued */
+	if (current->unwind_mask & (1UL << work->bit))
+		return 1;
+
 	/* callback already pending? */
 	pending = READ_ONCE(info->pending);
 	if (pending)
-		return 1;
+		goto out;
 
 	/* Claim the work unless an NMI just now swooped in to do so. */
 	if (!try_cmpxchg(&info->pending, &pending, 1))
-		return 1;
+		goto out;
 
 	/* The work has been claimed, now schedule it. */
 	ret = task_work_add(current, &info->work, TWA_RESUME);
@@ -239,16 +251,27 @@ int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
 		return ret;
 	}
 
-	return 0;
+ out:
+	return test_and_set_bit(work->bit, &current->unwind_mask);
 }
 
 void unwind_deferred_cancel(struct unwind_work *work)
 {
+	struct task_struct *g, *t;
+
 	if (!work)
 		return;
 
 	guard(mutex)(&callback_mutex);
 	list_del(&work->list);
+
+	clear_bit(work->bit, &unwind_mask);
+
+	guard(rcu)();
+	/* Clear this bit from all threads */
+	for_each_process_thread(g, t) {
+		clear_bit(work->bit, &t->unwind_mask);
+	}
 }
 
 int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
@@ -256,6 +279,14 @@ int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
 	memset(work, 0, sizeof(*work));
 
 	guard(mutex)(&callback_mutex);
+
+	/* See if there's a bit in the mask available */
+	if (unwind_mask == ~0UL)
+		return -EBUSY;
+
+	work->bit = ffz(unwind_mask);
+	unwind_mask |= 1UL << work->bit;
+
 	list_add(&work->list, &callbacks);
 	work->func = func;
 	return 0;
@@ -267,6 +298,7 @@ void unwind_task_init(struct task_struct *task)
 
 	memset(info, 0, sizeof(*info));
 	init_task_work(&info->work, unwind_deferred_task_work);
+	task->unwind_mask = 0;
 }
 
 void unwind_task_free(struct task_struct *task)
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v9 12/13] unwind deferred: Use SRCU unwind_deferred_task_work()
  2025-05-13 22:34 [PATCH v9 00/13] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (10 preceding siblings ...)
  2025-05-13 22:34 ` [PATCH v9 11/13] unwind deferred: Use bitmask to determine which callbacks to call Steven Rostedt
@ 2025-05-13 22:34 ` Steven Rostedt
  2025-05-13 22:34 ` [PATCH v9 13/13] unwind: Clear unwind_mask on exit back to user space Steven Rostedt
  2025-05-14 17:27 ` [PATCH v9 00/13] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
  13 siblings, 0 replies; 23+ messages in thread
From: Steven Rostedt @ 2025-05-13 22:34 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Andrii Nakryiko

From: Steven Rostedt <rostedt@goodmis.org>

Instead of using the callback_mutex to protect the link list of callbacks
in unwind_deferred_task_work(), use SRCU instead. This gets called every
time a task exits that has to record a stack trace that was requested.
This can happen for many tasks on several CPUs at the same time. A mutex
is a bottleneck and can cause a bit of contention and slow down performance.

As the callbacks themselves are allowed to sleep, regular RCU can not be
used to protect the list. Instead use SRCU, as that still allows the
callbacks to sleep and the list can be read without needing to hold the
callback_mutex.

Link: https://lore.kernel.org/all/ca9bd83a-6c80-4ee0-a83c-224b9d60b755@efficios.com/

Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Discussion about using guard for SRCU:
   https://lore.kernel.org/20250509165155.628873521@goodmis.org

   Andrii Nakryiko brought up using guard for SRCU around the
   list iteration, but it was decided that just using the normal
   methods were fine for this use case.

 kernel/unwind/deferred.c | 33 +++++++++++++++++++++++++--------
 1 file changed, 25 insertions(+), 8 deletions(-)

diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index 7ae0bec5b36a..5d6976ee648f 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -13,10 +13,11 @@
 
 #define UNWIND_MAX_ENTRIES 512
 
-/* Guards adding to and reading the list of callbacks */
+/* Guards adding to or removing from the list of callbacks */
 static DEFINE_MUTEX(callback_mutex);
 static LIST_HEAD(callbacks);
 static unsigned long unwind_mask;
+DEFINE_STATIC_SRCU(unwind_srcu);
 
 /*
  * Read the task context timestamp, if this is the first caller then
@@ -108,6 +109,7 @@ static void unwind_deferred_task_work(struct callback_head *head)
 	struct unwind_work *work;
 	u64 timestamp;
 	struct task_struct *task = current;
+	int idx;
 
 	if (WARN_ON_ONCE(!info->pending))
 		return;
@@ -133,13 +135,15 @@ static void unwind_deferred_task_work(struct callback_head *head)
 
 	timestamp = info->timestamp;
 
-	guard(mutex)(&callback_mutex);
-	list_for_each_entry(work, &callbacks, list) {
+	idx = srcu_read_lock(&unwind_srcu);
+	list_for_each_entry_srcu(work, &callbacks, list,
+				 srcu_read_lock_held(&unwind_srcu)) {
 		if (task->unwind_mask & (1UL << work->bit)) {
 			work->func(work, &trace, timestamp);
 			clear_bit(work->bit, &current->unwind_mask);
 		}
 	}
+	srcu_read_unlock(&unwind_srcu, idx);
 }
 
 static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *timestamp)
@@ -216,6 +220,7 @@ int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
 {
 	struct unwind_task_info *info = &current->unwind_info;
 	int pending;
+	int bit;
 	int ret;
 
 	*timestamp = 0;
@@ -227,12 +232,17 @@ int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
 	if (in_nmi())
 		return unwind_deferred_request_nmi(work, timestamp);
 
+	/* Do not allow cancelled works to request again */
+	bit = READ_ONCE(work->bit);
+	if (WARN_ON_ONCE(bit < 0))
+		return -EINVAL;
+
 	guard(irqsave)();
 
 	*timestamp = get_timestamp(info);
 
 	/* This is already queued */
-	if (current->unwind_mask & (1UL << work->bit))
+	if (current->unwind_mask & (1UL << bit))
 		return 1;
 
 	/* callback already pending? */
@@ -258,19 +268,26 @@ int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
 void unwind_deferred_cancel(struct unwind_work *work)
 {
 	struct task_struct *g, *t;
+	int bit;
 
 	if (!work)
 		return;
 
 	guard(mutex)(&callback_mutex);
-	list_del(&work->list);
+	list_del_rcu(&work->list);
+	bit = work->bit;
+
+	/* Do not allow any more requests and prevent callbacks */
+	work->bit = -1;
+
+	clear_bit(bit, &unwind_mask);
 
-	clear_bit(work->bit, &unwind_mask);
+	synchronize_srcu(&unwind_srcu);
 
 	guard(rcu)();
 	/* Clear this bit from all threads */
 	for_each_process_thread(g, t) {
-		clear_bit(work->bit, &t->unwind_mask);
+		clear_bit(bit, &t->unwind_mask);
 	}
 }
 
@@ -287,7 +304,7 @@ int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
 	work->bit = ffz(unwind_mask);
 	unwind_mask |= 1UL << work->bit;
 
-	list_add(&work->list, &callbacks);
+	list_add_rcu(&work->list, &callbacks);
 	work->func = func;
 	return 0;
 }
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v9 13/13] unwind: Clear unwind_mask on exit back to user space
  2025-05-13 22:34 [PATCH v9 00/13] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (11 preceding siblings ...)
  2025-05-13 22:34 ` [PATCH v9 12/13] unwind deferred: Use SRCU unwind_deferred_task_work() Steven Rostedt
@ 2025-05-13 22:34 ` Steven Rostedt
  2025-05-14 17:27 ` [PATCH v9 00/13] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
  13 siblings, 0 replies; 23+ messages in thread
From: Steven Rostedt @ 2025-05-13 22:34 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Andrii Nakryiko

From: Steven Rostedt <rostedt@goodmis.org>

When testing the deferred unwinder by attaching deferred user space
stacktraces to events, a live lock happened. This was when the deferred
unwinding was added to the irqs_disabled event, which happens after the
task_work callbacks are called and before the task goes back to user
space.

The event callback would be registered when irqs were disabled, the
task_work would trigger, call the callback for this work and clear the
work's bit. Then before getting back to user space, irqs would be disabled
again, the event triggered again, and a new task_work registered. This
caused an infinite loop and the system hung.

To prevent this, clear the bits at the very last moment before going back
to user space and when instrumentation is disabled. That is in
unwind_exit_to_user_mode().

Move the pending bit from a value on the task_struct to the most
significant bit of the unwind_mask (saves space on the task_struct). This
will allow modifying the pending bit along with the work bits atomically.

Instead of clearing a work's bit after its callback is called, it is
delayed until exit. If the work is requested again, the task_work is not
queued again and the work will be notified that the task has already been
called (via UNWIND_ALREADY_EXECUTED return value).

The pending bit is cleared before calling the callback functions but the
current work bits remain. If one of the called works registers again, it
will not trigger a task_work if its bit is still present in the task's
unwind_mask.

If a new work registers, then it will set both the pending bit and its own
bit but clear the other work bits so that their callbacks do not get
called again.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 include/linux/unwind_deferred.h       | 23 ++++++-
 include/linux/unwind_deferred_types.h |  1 -
 kernel/unwind/deferred.c              | 96 +++++++++++++++++++--------
 3 files changed, 90 insertions(+), 30 deletions(-)

diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index 1789c3624723..b3c8703fcc22 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -18,6 +18,14 @@ struct unwind_work {
 
 #ifdef CONFIG_UNWIND_USER
 
+#define UNWIND_PENDING_BIT	(BITS_PER_LONG - 1)
+#define UNWIND_PENDING		(1UL << UNWIND_PENDING_BIT)
+
+enum {
+	UNWIND_ALREADY_PENDING	= 1,
+	UNWIND_ALREADY_EXECUTED	= 2,
+};
+
 void unwind_task_init(struct task_struct *task);
 void unwind_task_free(struct task_struct *task);
 
@@ -29,7 +37,20 @@ void unwind_deferred_cancel(struct unwind_work *work);
 
 static __always_inline void unwind_exit_to_user_mode(void)
 {
-	if (unlikely(current->unwind_info.cache))
+	unsigned long bits;
+
+	/* Was there any unwinding? */
+	if (likely(!current->unwind_mask))
+		return;
+
+	bits = current->unwind_mask;
+	do {
+		/* Is a task_work going to run again before going back */
+		if (bits & UNWIND_PENDING)
+			return;
+	} while (!try_cmpxchg(&current->unwind_mask, &bits, 0UL));
+
+	if (likely(current->unwind_info.cache))
 		current->unwind_info.cache->nr_entries = 0;
 	current->unwind_info.timestamp = 0;
 }
diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
index ae27a02234b8..28811a9d4262 100644
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -12,7 +12,6 @@ struct unwind_task_info {
 	struct callback_head	work;
 	u64			timestamp;
 	u64			nmi_timestamp;
-	int			pending;
 };
 
 #endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index 5d6976ee648f..b0289a695b92 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -19,6 +19,11 @@ static LIST_HEAD(callbacks);
 static unsigned long unwind_mask;
 DEFINE_STATIC_SRCU(unwind_srcu);
 
+static inline bool unwind_pending(struct task_struct *task)
+{
+	return test_bit(UNWIND_PENDING_BIT, &task->unwind_mask);
+}
+
 /*
  * Read the task context timestamp, if this is the first caller then
  * it will set the timestamp.
@@ -107,15 +112,18 @@ static void unwind_deferred_task_work(struct callback_head *head)
 	struct unwind_task_info *info = container_of(head, struct unwind_task_info, work);
 	struct unwind_stacktrace trace;
 	struct unwind_work *work;
+	unsigned long bits;
 	u64 timestamp;
 	struct task_struct *task = current;
 	int idx;
 
-	if (WARN_ON_ONCE(!info->pending))
+	if (WARN_ON_ONCE(!unwind_pending(task)))
 		return;
 
-	/* Allow work to come in again */
-	WRITE_ONCE(info->pending, 0);
+	/* Clear pending bit but make sure to have the current bits */
+	bits = READ_ONCE(task->unwind_mask);
+	while (!try_cmpxchg(&task->unwind_mask, &bits, bits & ~UNWIND_PENDING_BIT))
+		;
 
 	/*
 	 * From here on out, the callback must always be called, even if it's
@@ -138,10 +146,8 @@ static void unwind_deferred_task_work(struct callback_head *head)
 	idx = srcu_read_lock(&unwind_srcu);
 	list_for_each_entry_srcu(work, &callbacks, list,
 				 srcu_read_lock_held(&unwind_srcu)) {
-		if (task->unwind_mask & (1UL << work->bit)) {
+		if (bits & (1UL << work->bit))
 			work->func(work, &trace, timestamp);
-			clear_bit(work->bit, &current->unwind_mask);
-		}
 	}
 	srcu_read_unlock(&unwind_srcu, idx);
 }
@@ -168,10 +174,13 @@ static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *timestamp)
 		inited_timestamp = true;
 	}
 
-	if (current->unwind_mask & (1UL << work->bit))
-		return 1;
+	/* Is this already queued */
+	if (current->unwind_mask & (1UL << work->bit)) {
+		return unwind_pending(current) ? UNWIND_ALREADY_PENDING :
+			UNWIND_ALREADY_EXECUTED;
+	}
 
-	if (info->pending)
+	if (unwind_pending(current))
 		goto out;
 
 	ret = task_work_add(current, &info->work, TWA_NMI_CURRENT);
@@ -186,9 +195,17 @@ static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *timestamp)
 		return ret;
 	}
 
-	info->pending = 1;
+	/*
+	 * This is the first to set the PENDING_BIT, clear all others
+	 * as any other bit has already had their callback called, and
+	 * those callbacks should not be called again because of this
+	 * new callback. If they request another callback, then they
+	 * will get a new one.
+	 */
+	current->unwind_mask = UNWIND_PENDING;
 out:
-	return test_and_set_bit(work->bit, &current->unwind_mask);
+	return test_and_set_bit(work->bit, &current->unwind_mask) ?
+		UNWIND_ALREADY_PENDING : 0;
 }
 
 /**
@@ -211,15 +228,17 @@ static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *timestamp)
  * it has already been previously called for the same entry context, it will be
  * called again with the same stack trace and timestamp.
  *
- * Return: 1 if the the callback was already queued.
- *         0 if the callback successfully was queued.
+ * Return: 0 if the callback successfully was queued.
+ *         UNWIND_ALREADY_PENDING if the the callback was already queued.
+ *         UNWIND_ALREADY_EXECUTED if the callback was already called
+ *                (and will not be called again)
  *         Negative if there's an error.
  *         @timestamp holds the timestamp of the first request by any user
  */
 int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
 {
 	struct unwind_task_info *info = &current->unwind_info;
-	int pending;
+	unsigned long old, bits;
 	int bit;
 	int ret;
 
@@ -241,28 +260,49 @@ int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
 
 	*timestamp = get_timestamp(info);
 
-	/* This is already queued */
-	if (current->unwind_mask & (1UL << bit))
-		return 1;
+	old = READ_ONCE(current->unwind_mask);
+
+	/* Is this already queued */
+	if (old & (1UL << bit)) {
+		/*
+		 * If pending is not set, it means this work's callback
+		 * was already called.
+		 */
+		return old & UNWIND_PENDING ? UNWIND_ALREADY_PENDING :
+			UNWIND_ALREADY_EXECUTED;
+	}
 
-	/* callback already pending? */
-	pending = READ_ONCE(info->pending);
-	if (pending)
+	if (unwind_pending(current))
 		goto out;
 
-	/* Claim the work unless an NMI just now swooped in to do so. */
-	if (!try_cmpxchg(&info->pending, &pending, 1))
+	/*
+	 * This is the first to enable another task_work for this task since
+	 * the task entered the kernel, or had already called the callbacks.
+	 * Set only the bit for this work and clear all others as they have
+	 * already had their callbacks called, and do not need to call them
+	 * again because of this work.
+	 */
+	bits = UNWIND_PENDING | (1 << bit);
+
+	/*
+	 * If the cmpxchg() fails, it means that an NMI came in and set
+	 * the pending bit as well as cleared the other bits. Just
+	 * jump to setting the bit for this work.
+	 */
+	if (!try_cmpxchg(&current->unwind_mask, &old, bits))
 		goto out;
 
 	/* The work has been claimed, now schedule it. */
 	ret = task_work_add(current, &info->work, TWA_RESUME);
-	if (WARN_ON_ONCE(ret)) {
-		WRITE_ONCE(info->pending, 0);
-		return ret;
-	}
+
+	if (WARN_ON_ONCE(ret))
+		WRITE_ONCE(current->unwind_mask, 0);
+
+	return ret;
 
  out:
-	return test_and_set_bit(work->bit, &current->unwind_mask);
+	return test_and_set_bit(work->bit, &current->unwind_mask) ?
+		UNWIND_ALREADY_PENDING : 0;
 }
 
 void unwind_deferred_cancel(struct unwind_work *work)
@@ -298,7 +338,7 @@ int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
 	guard(mutex)(&callback_mutex);
 
 	/* See if there's a bit in the mask available */
-	if (unwind_mask == ~0UL)
+	if (unwind_mask == ~(UNWIND_PENDING))
 		return -EBUSY;
 
 	work->bit = ffz(unwind_mask);
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v9 00/13] unwind_user: x86: Deferred unwinding infrastructure
  2025-05-13 22:34 [PATCH v9 00/13] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (12 preceding siblings ...)
  2025-05-13 22:34 ` [PATCH v9 13/13] unwind: Clear unwind_mask on exit back to user space Steven Rostedt
@ 2025-05-14 17:27 ` Steven Rostedt
  2025-05-16 23:39   ` Namhyung Kim
  13 siblings, 1 reply; 23+ messages in thread
From: Steven Rostedt @ 2025-05-14 17:27 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Andrii Nakryiko

On Tue, 13 May 2025 18:34:35 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> This has modifications in x86 and I would like it to go through the x86
> tree. Preferably it can go into this merge window so we can focus on getting
> perf and ftrace to work on top of this.

I think it may be best for me to remove the two x86 specific patches, and
rebuild the ftrace work on top of it. For testing, I'll just keep those two
patches in my tree locally, but then I can get this moving for this merge
window.

Next merge window, we can spend more time on getting the perf API working
properly.

-- Steve

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v9 00/13] unwind_user: x86: Deferred unwinding infrastructure
  2025-05-14 17:27 ` [PATCH v9 00/13] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
@ 2025-05-16 23:39   ` Namhyung Kim
  2025-05-19 15:33     ` Steven Rostedt
  2025-05-20 23:26     ` Masami Hiramatsu
  0 siblings, 2 replies; 23+ messages in thread
From: Namhyung Kim @ 2025-05-16 23:39 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
	Mathieu Desnoyers, Josh Poimboeuf, Peter Zijlstra, Ingo Molnar,
	Jiri Olsa, Thomas Gleixner, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andrii Nakryiko

Hi Steve,

On Wed, May 14, 2025 at 01:27:20PM -0400, Steven Rostedt wrote:
> On Tue, 13 May 2025 18:34:35 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > This has modifications in x86 and I would like it to go through the x86
> > tree. Preferably it can go into this merge window so we can focus on getting
> > perf and ftrace to work on top of this.
> 
> I think it may be best for me to remove the two x86 specific patches, and
> rebuild the ftrace work on top of it. For testing, I'll just keep those two
> patches in my tree locally, but then I can get this moving for this merge
> window.

Maybe I asked this before but I don't remember if I got the answer. :)
How does it handle task exits as it won't go to userspace?  I guess it'll
lose user callstacks for exit syscalls and other termination paths.

Similarly, it will miss user callstacks in the samples at the end of
profiling if the target tasks remain in the kernel (or they sleep).
It looks like a fundamental limitation of the deferred callchains.

Thanks,
Namhyung

> 
> Next merge window, we can spend more time on getting the perf API working
> properly.
> 
> -- Steve

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v9 00/13] unwind_user: x86: Deferred unwinding infrastructure
  2025-05-16 23:39   ` Namhyung Kim
@ 2025-05-19 15:33     ` Steven Rostedt
  2025-05-20  9:35       ` Ingo Molnar
  2025-05-20 23:26     ` Masami Hiramatsu
  1 sibling, 1 reply; 23+ messages in thread
From: Steven Rostedt @ 2025-05-19 15:33 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
	Mathieu Desnoyers, Josh Poimboeuf, Peter Zijlstra, Ingo Molnar,
	Jiri Olsa, Thomas Gleixner, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andrii Nakryiko

On Fri, 16 May 2025 16:39:56 -0700
Namhyung Kim <namhyung@kernel.org> wrote:

> Hi Steve,
> 
> On Wed, May 14, 2025 at 01:27:20PM -0400, Steven Rostedt wrote:
> > On Tue, 13 May 2025 18:34:35 -0400
> > Steven Rostedt <rostedt@goodmis.org> wrote:
> >   
> > > This has modifications in x86 and I would like it to go through the x86
> > > tree. Preferably it can go into this merge window so we can focus on getting
> > > perf and ftrace to work on top of this.  
> > 
> > I think it may be best for me to remove the two x86 specific patches, and
> > rebuild the ftrace work on top of it. For testing, I'll just keep those two
> > patches in my tree locally, but then I can get this moving for this merge
> > window.  
> 
> Maybe I asked this before but I don't remember if I got the answer. :)
> How does it handle task exits as it won't go to userspace?  I guess it'll
> lose user callstacks for exit syscalls and other termination paths.
> 
> Similarly, it will miss user callstacks in the samples at the end of
> profiling if the target tasks remain in the kernel (or they sleep).
> It looks like a fundamental limitation of the deferred callchains.
> 

Ah, I think I forgot about that. I believe the exit path can also be a
faultable path. All it needs is a hook to do the exit. Is there any
"task work" clean up on exit? I need to take a look.

-- Steve

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v9 00/13] unwind_user: x86: Deferred unwinding infrastructure
  2025-05-19 15:33     ` Steven Rostedt
@ 2025-05-20  9:35       ` Ingo Molnar
  2025-05-20 15:57         ` Steven Rostedt
  0 siblings, 1 reply; 23+ messages in thread
From: Ingo Molnar @ 2025-05-20  9:35 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Namhyung Kim, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Jiri Olsa, Thomas Gleixner, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Andrii Nakryiko


* Steven Rostedt <rostedt@goodmis.org> wrote:

> On Fri, 16 May 2025 16:39:56 -0700
> Namhyung Kim <namhyung@kernel.org> wrote:
> 
> > Hi Steve,
> > 
> > On Wed, May 14, 2025 at 01:27:20PM -0400, Steven Rostedt wrote:
> > > On Tue, 13 May 2025 18:34:35 -0400
> > > Steven Rostedt <rostedt@goodmis.org> wrote:
> > >   
> > > > This has modifications in x86 and I would like it to go through the x86
> > > > tree. Preferably it can go into this merge window so we can focus on getting
> > > > perf and ftrace to work on top of this.  
> > > 
> > > I think it may be best for me to remove the two x86 specific patches, and
> > > rebuild the ftrace work on top of it. For testing, I'll just keep those two
> > > patches in my tree locally, but then I can get this moving for this merge
> > > window.  
> > 
> > Maybe I asked this before but I don't remember if I got the answer. :)
> > How does it handle task exits as it won't go to userspace?  I guess it'll
> > lose user callstacks for exit syscalls and other termination paths.
> > 
> > Similarly, it will miss user callstacks in the samples at the end of
> > profiling if the target tasks remain in the kernel (or they sleep).
> > It looks like a fundamental limitation of the deferred callchains.
> > 
> 
> Ah, I think I forgot about that. I believe the exit path can also be a
> faultable path. All it needs is a hook to do the exit. Is there any
> "task work" clean up on exit? I need to take a look.

Could you please not rush this facility into v6.16? It barely had any 
design review so far, and I'm still not entirely sure about the 
approach.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v9 00/13] unwind_user: x86: Deferred unwinding infrastructure
  2025-05-20  9:35       ` Ingo Molnar
@ 2025-05-20 15:57         ` Steven Rostedt
  2025-05-20 16:29           ` Steven Rostedt
  0 siblings, 1 reply; 23+ messages in thread
From: Steven Rostedt @ 2025-05-20 15:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Namhyung Kim, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Jiri Olsa, Thomas Gleixner, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Andrii Nakryiko, Jose E. Marchesi,
	Indu Bhagat

On Tue, 20 May 2025 11:35:56 +0200
Ingo Molnar <mingo@kernel.org> wrote:

> > Ah, I think I forgot about that. I believe the exit path can also be a
> > faultable path. All it needs is a hook to do the exit. Is there any
> > "task work" clean up on exit? I need to take a look.  
> 
> Could you please not rush this facility into v6.16? It barely had any 
> design review so far, and I'm still not entirely sure about the 
> approach.

Hi Ingo,

Note, there has been a lot of discussion on this approach, although it's
been mostly at conferences and in meetings. At GNU Cauldron in September
2024 (before Plumbers) Josh, Mathieu and myself discussed it in quite
detail. I've then been hosting a monthly meeting with engineers from
Google, Red Hat, EffiOS, Oracle, Microsoft and others (I can invite you to
it if you would like). There's actually two meetings (one that is in a Asia
friendly timezone and another in a European friendly timezone). The first
patches from this went out in October, 2024:

  https://lore.kernel.org/all/cover.1730150953.git.jpoimboe@kernel.org/

There wasn't much discussion on it, although Peter did reply, and I believe
that we did address all of his concerns.

Josh then changed the approach from what we originally discussed, which was
to just have each tracer attach a task_work to the task it wants a trace
thinking that would be good enough, but Mathieu and I found that it doesn't
work because even a perf event can not handle this because it would need to
keep track of several tasks that may migrate.

Let me state the goal of this work, as it started from a session at the
2022 Tracing Summit in London. We needed a way to get reliable user space
stack traces without using frame pointers. We discussed various methods,
including using eh_frame but they all had issues. We concluded that it
would be nice to have an ORC unwinder (technically it's called a stack
walker), for user space.

Then in Nov 2022, I read about "sframes" which was exactly what we were
looking for:

   https://www.phoronix.com/news/GNU-Binutils-SFrame

At FOSDEM 2023, I asked Jose Marchesi (a GCC maintainer) about sframes, and
it just happened to be one of his employees that created it (Indu Bhagat).
At Kernel Recipes 2023, Brendan Gregg, during his talk, asked if it would
be great to have user space stack walking without needing to run everything
with frame pointers. I raised my hand and asked if he had heard about
"sframes", which he did not. I explained what it was and he was very
interested. After that talk, Josh Poimboeuf came up to me and asked if he
could implement this, which I agreed.

One thing that is needed for sframes is that it has to be done in a
faultable context. The sframe sections are like ORC, where it has lookup
tables, but they live in the user space address, and because they can be
large, they can't be locked into memory. This brings up the deferred
unwinding aspect.

At the LSFMMBPF 2023 conference, Indu and myself ran a session on sframes
to get a better idea on how to implement this. The perf user space stack
tracing was mentioned, where if frame pointers are not there, perf may copy
thousands of bytes of the user space stack into the perf buffer. If there's
a long system call, it may do this several times, and because the user
space stack does not change while the task is in the kernel, this is
thousands of bytes of duplicate data. I asked Jiri Olsa "why not just defer
the stack trace and only make a single entry", and I believe he replied "we
are thinking about it", but nothing further came about it.

Now, there's a serious push to get sframes upstream, and that will take a
few steps. These are:

1) Create a new user unwind stack call that is expected to be called in
   faultable context. If a tracer (perf, ftrace, BPF or whatever) knows
   it's in a faultable context and wants a trace, it can simply ask for it.

   It was also asked to have a user space system call that can get this
   trace so that a function in a user space applications can see what is
   calling it.

2) Create a deferred stack unwinding infrastructure that can be used by
   many clients (perf, ftrace, BPF or whatever) and called in any context
   (interrupt or NMI).

   As the unwind stack call needs to be in a faultable context, and it is
   very common to want both a kernel stack trace along with a user space
   stack trace and this can happen in an NMI, allow the kernel stack trace
   to be executed and delay the user stack trace.

   The accounting for this is not easy, as it has a many to many
   relationship. You could have perf, ftrace and BPF all asking for a
   delayed stack trace and they all need to be called. But each could be
   wanting a stack trace from a different set of tasks. Keeping track of
   which tracer gets a callback for which task is where this patch set
   comes in. Or at least just the infrastructure part.

3) Add sframes.

   The final part is to get this working with sframes. Where a distro
   builds all its applications with sframes enabled and then perf, ftrace,
   BPF or whatever gets access to it through this interface.

There's quite a momentum of work happening today that is being built
expecting us to get to step 3. There's no use to adding sframes to
applications if the kernel can't read them. The only way to read them is to
have this deferred infrastructure.

We've discussed the current design quiet a bit, but until there's actual
users starting to build on top of it, all the corner cases may not come
out. That's why I'm suggesting if we can just get the basic infrastructure
in this merge window, where it's not fully enabled (there are no users of
it), we can then have several users build on top of it in the next merge
window to see if it finds anything that breaks.

As perf has the biggest user space ABI, where the delayed stack trace may
be in a different event buffer than where the kernel stack trace occurred
(due to migration), it's the one I'm most concern about getting this right.
As once it is exposed to user space, it can never change. That's the one I
do want to focus on the most. But it shouldn't delay getting the non user
space visible aspect of the kernel side moving forward. If we find an
issue, it can always be changed because it doesn't affect user space.

I've been testing this on the ftrace side and so far, everything works
fine. But the ftrace side has the deferred trace always in the same
instance buffer as where it was triggered, as it doesn't depend on user
space API for that.

I've also been running this on perf and it too is working well. But I don't
know the perf interface well enough to make sure there isn't other corner
cases that I may be missing.

For 6.16, I would just like to get the common infrastructure in the kernel
so there's no dependency on different tracers. This patch set adds the
changes but does not enable it yet. This way, perf, ftrace and BPF can all
work to build on top of the changes without needing to share code.

There's only two patches that touch x86 here. I have another patch series
that removes them and implements this for ftrace, but because no
architecture has it enabled, it's just dead code. But it would also allow
to build on top of the infrastructure where any architecture could enable
it. Kind of like what PREEMPT_RT did.

-- Steve

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v9 00/13] unwind_user: x86: Deferred unwinding infrastructure
  2025-05-20 15:57         ` Steven Rostedt
@ 2025-05-20 16:29           ` Steven Rostedt
  0 siblings, 0 replies; 23+ messages in thread
From: Steven Rostedt @ 2025-05-20 16:29 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Namhyung Kim, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Jiri Olsa, Thomas Gleixner, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Andrii Nakryiko, Jose E. Marchesi,
	Indu Bhagat

On Tue, 20 May 2025 11:57:21 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> For 6.16, I would just like to get the common infrastructure in the kernel
> so there's no dependency on different tracers. This patch set adds the
> changes but does not enable it yet. This way, perf, ftrace and BPF can all
> work to build on top of the changes without needing to share code.

Another thing that would work for me is not to push it this merge window,
but if you could make a separate branch based on top of v6.15 when it is
released, that has this code, where other subsystems could work on top of
that, would be great!

-- Steve

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v9 00/13] unwind_user: x86: Deferred unwinding infrastructure
  2025-05-16 23:39   ` Namhyung Kim
  2025-05-19 15:33     ` Steven Rostedt
@ 2025-05-20 23:26     ` Masami Hiramatsu
  2025-05-20 23:55       ` Steven Rostedt
  1 sibling, 1 reply; 23+ messages in thread
From: Masami Hiramatsu @ 2025-05-20 23:26 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Thomas Gleixner,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Andrii Nakryiko

On Fri, 16 May 2025 16:39:56 -0700
Namhyung Kim <namhyung@kernel.org> wrote:

> Hi Steve,
> 
> On Wed, May 14, 2025 at 01:27:20PM -0400, Steven Rostedt wrote:
> > On Tue, 13 May 2025 18:34:35 -0400
> > Steven Rostedt <rostedt@goodmis.org> wrote:
> > 
> > > This has modifications in x86 and I would like it to go through the x86
> > > tree. Preferably it can go into this merge window so we can focus on getting
> > > perf and ftrace to work on top of this.
> > 
> > I think it may be best for me to remove the two x86 specific patches, and
> > rebuild the ftrace work on top of it. For testing, I'll just keep those two
> > patches in my tree locally, but then I can get this moving for this merge
> > window.
> 
> Maybe I asked this before but I don't remember if I got the answer. :)
> How does it handle task exits as it won't go to userspace?  I guess it'll
> lose user callstacks for exit syscalls and other termination paths.
> 
> Similarly, it will miss user callstacks in the samples at the end of
> profiling if the target tasks remain in the kernel (or they sleep).
> It looks like a fundamental limitation of the deferred callchains.

Can we use a hybrid approach for this case?
It might be more balanced (from the performance point of view) to save
the full stack in a classic way only in this case, rather than faulting
on process exit or doing file access just to load the sframe.

Thanks,

> 
> Thanks,
> Namhyung
> 
> > 
> > Next merge window, we can spend more time on getting the perf API working
> > properly.
> > 
> > -- Steve


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v9 00/13] unwind_user: x86: Deferred unwinding infrastructure
  2025-05-20 23:26     ` Masami Hiramatsu
@ 2025-05-20 23:55       ` Steven Rostedt
  2025-05-21 16:50         ` Steven Rostedt
  0 siblings, 1 reply; 23+ messages in thread
From: Steven Rostedt @ 2025-05-20 23:55 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Namhyung Kim, linux-kernel, linux-trace-kernel, bpf, x86,
	Mathieu Desnoyers, Josh Poimboeuf, Peter Zijlstra, Ingo Molnar,
	Jiri Olsa, Thomas Gleixner, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andrii Nakryiko

On Wed, 21 May 2025 08:26:05 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> > Maybe I asked this before but I don't remember if I got the answer. :)
> > How does it handle task exits as it won't go to userspace?  I guess it'll
> > lose user callstacks for exit syscalls and other termination paths.

I just checked, and the good news is that task_work does indeed get called
when a task exits. The bad news is that it happens after do_exit() cleans
up the task's "mm" structure via exit_mm(). Which means that current->mm is
NULL :-p

There's a proposal to move trace_sched_process_exit() to before exit_mm().
If that happens, we could make that tracepoint a "faultable" tracepoint and
then the unwind infrastructure could attach to it and do the unwinding from
that tracepoint.

> > 
> > Similarly, it will miss user callstacks in the samples at the end of
> > profiling if the target tasks remain in the kernel (or they sleep).
> > It looks like a fundamental limitation of the deferred callchains.  

Yes that is a limitation.

> 
> Can we use a hybrid approach for this case?
> It might be more balanced (from the performance point of view) to save
> the full stack in a classic way only in this case, rather than faulting
> on process exit or doing file access just to load the sframe.

Another approach is that the tool (like perf) could request to take the
user space stack trace every time a task enters the kernel via a system
call.

-- Steve

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v9 00/13] unwind_user: x86: Deferred unwinding infrastructure
  2025-05-20 23:55       ` Steven Rostedt
@ 2025-05-21 16:50         ` Steven Rostedt
  0 siblings, 0 replies; 23+ messages in thread
From: Steven Rostedt @ 2025-05-21 16:50 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Namhyung Kim, linux-kernel, linux-trace-kernel, bpf, x86,
	Mathieu Desnoyers, Josh Poimboeuf, Peter Zijlstra, Ingo Molnar,
	Jiri Olsa, Thomas Gleixner, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andrii Nakryiko

On Tue, 20 May 2025 19:55:49 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> There's a proposal to move trace_sched_process_exit() to before exit_mm().
> If that happens, we could make that tracepoint a "faultable" tracepoint and
> then the unwind infrastructure could attach to it and do the unwinding from
> that tracepoint.

The below patch does work. It's just a PoC and would need to be broken up
and also cleaned up.

I created a TRACE_EVENT_FAULTABLE() that is basically just a
TRACE_EVENT_SYSCALL(), and used that for the sched_process_exit tracepoint.

I then had the unwinder attach to that tracepoint when the first unwind
callback is registered.

I had to change the check in the trace from testing PF_EXITING to just
current->mm is NULL.

But this does work for the exiting of a task:

-- Steve

diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index a351763e6965..eb98bb61126e 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -617,6 +617,8 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 #define TRACE_EVENT_SYSCALL(name, proto, args, struct, assign,	\
 			    print, reg, unreg)			\
 	DECLARE_TRACE_SYSCALL(name, PARAMS(proto), PARAMS(args))
+#define TRACE_EVENT_FAULTABLE(name, proto, args, struct, assign, print)	\
+	DECLARE_TRACE_SYSCALL(name, PARAMS(proto), PARAMS(args))
 
 #define TRACE_EVENT_FLAGS(event, flag)
 
diff --git a/include/trace/define_trace.h b/include/trace/define_trace.h
index ed52d0506c69..b228424744fd 100644
--- a/include/trace/define_trace.h
+++ b/include/trace/define_trace.h
@@ -50,6 +50,10 @@
 #define TRACE_EVENT_SYSCALL(name, proto, args, struct, assign, print, reg, unreg) \
 	DEFINE_TRACE_SYSCALL(name, reg, unreg, PARAMS(proto), PARAMS(args))
 
+#undef TRACE_EVENT_FAULTABLE
+#define TRACE_EVENT_FAULTABLE(name, proto, args, struct, assign, print) \
+	DEFINE_TRACE_SYSCALL(name, NULL, NULL, PARAMS(proto), PARAMS(args))
+
 #undef TRACE_EVENT_NOP
 #define TRACE_EVENT_NOP(name, proto, args, struct, assign, print)
 
@@ -125,6 +129,7 @@
 #undef TRACE_EVENT_FN
 #undef TRACE_EVENT_FN_COND
 #undef TRACE_EVENT_SYSCALL
+#undef TRACE_EVENT_FAULTABLE
 #undef TRACE_EVENT_CONDITION
 #undef TRACE_EVENT_NOP
 #undef DEFINE_EVENT_NOP
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 3bec9fb73a36..c6d7894970e3 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -326,13 +326,13 @@ DEFINE_EVENT(sched_process_template, sched_process_free,
 	     TP_ARGS(p));
 
 /*
- * Tracepoint for a task exiting.
+ * Tracepoint for a task exiting (allows faulting)
  * Note, it's a superset of sched_process_template and should be kept
  * compatible as much as possible. sched_process_exits has an extra
  * `group_dead` argument, so sched_process_template can't be used,
  * unfortunately, just like sched_migrate_task above.
  */
-TRACE_EVENT(sched_process_exit,
+TRACE_EVENT_FAULTABLE(sched_process_exit,
 
 	TP_PROTO(struct task_struct *p, bool group_dead),
 
diff --git a/include/trace/trace_events.h b/include/trace/trace_events.h
index 4f22136fd465..0ed57e7906d1 100644
--- a/include/trace/trace_events.h
+++ b/include/trace/trace_events.h
@@ -55,6 +55,16 @@
 			     PARAMS(print));		       \
 	DEFINE_EVENT(name, name, PARAMS(proto), PARAMS(args));
 
+#undef TRACE_EVENT_FAULTABLE
+#define TRACE_EVENT_FAULTABLE(name, proto, args, tstruct, assign, print) \
+	DECLARE_EVENT_SYSCALL_CLASS(name,		       \
+			     PARAMS(proto),		       \
+			     PARAMS(args),		       \
+			     PARAMS(tstruct),		       \
+			     PARAMS(assign),		       \
+			     PARAMS(print));		       \
+	DEFINE_EVENT(name, name, PARAMS(proto), PARAMS(args));
+
 #include "stages/stage1_struct_define.h"
 
 #undef DECLARE_EVENT_CLASS
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index 63d0237bad3e..7aad471f2887 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -11,6 +11,8 @@
 #include <linux/slab.h>
 #include <linux/mm.h>
 
+#include <trace/events/sched.h>
+
 #define UNWIND_MAX_ENTRIES 512
 
 /* Guards adding to or removing from the list of callbacks */
@@ -77,7 +79,7 @@ int unwind_deferred_trace(struct unwind_stacktrace *trace)
 	/* Should always be called from faultable context */
 	might_fault();
 
-	if (current->flags & PF_EXITING)
+	if (!current->mm)
 		return -EINVAL;
 
 	if (!info->cache) {
@@ -107,14 +109,14 @@ int unwind_deferred_trace(struct unwind_stacktrace *trace)
 	return 0;
 }
 
-static void unwind_deferred_task_work(struct callback_head *head)
+static void process_unwind_deferred(void)
 {
-	struct unwind_task_info *info = container_of(head, struct unwind_task_info, work);
+	struct task_struct *task = current;
+	struct unwind_task_info *info = &task->unwind_info;
 	struct unwind_stacktrace trace;
 	struct unwind_work *work;
 	unsigned long bits;
 	u64 timestamp;
-	struct task_struct *task = current;
 	int idx;
 
 	if (WARN_ON_ONCE(!unwind_pending(task)))
@@ -152,6 +155,21 @@ static void unwind_deferred_task_work(struct callback_head *head)
 	srcu_read_unlock(&unwind_srcu, idx);
 }
 
+static void unwind_deferred_task_work(struct callback_head *head)
+{
+	process_unwind_deferred();
+}
+
+static void unwind_deferred_callback(void *data, struct task_struct *p, bool group_dead)
+{
+	if (!unwind_pending(p))
+		return;
+
+	process_unwind_deferred();
+
+	task_work_cancel(p, &p->unwind_info.work);
+}
+
 static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *timestamp)
 {
 	struct unwind_task_info *info = &current->unwind_info;
@@ -329,6 +347,10 @@ void unwind_deferred_cancel(struct unwind_work *work)
 	for_each_process_thread(g, t) {
 		clear_bit(bit, &t->unwind_mask);
 	}
+
+	/* Is this the last registered unwinding? */
+	if (!unwind_mask)
+		unregister_trace_sched_process_exit(unwind_deferred_callback, NULL);
 }
 
 int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
@@ -341,6 +363,15 @@ int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
 	if (unwind_mask == ~(UNWIND_PENDING))
 		return -EBUSY;
 
+	/* Is this the first registered unwinding? */
+	if (!unwind_mask) {
+		int ret;
+
+		ret = register_trace_sched_process_exit(unwind_deferred_callback, NULL);
+		if (ret < 0)
+			return ret;
+	}
+
 	work->bit = ffz(unwind_mask);
 	unwind_mask |= 1UL << work->bit;
 


^ permalink raw reply related	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2025-05-21 16:50 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-13 22:34 [PATCH v9 00/13] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
2025-05-13 22:34 ` [PATCH v9 01/13] unwind_user: Add user space unwinding API Steven Rostedt
2025-05-13 22:34 ` [PATCH v9 02/13] unwind_user: Add frame pointer support Steven Rostedt
2025-05-13 22:34 ` [PATCH v9 03/13] unwind_user/x86: Enable frame pointer unwinding on x86 Steven Rostedt
2025-05-13 22:34 ` [PATCH v9 04/13] perf/x86: Rename and move get_segment_base() and make it global Steven Rostedt
2025-05-13 22:34 ` [PATCH v9 05/13] unwind_user: Add compat mode frame pointer support Steven Rostedt
2025-05-13 22:34 ` [PATCH v9 06/13] unwind_user/x86: Enable compat mode frame pointer unwinding on x86 Steven Rostedt
2025-05-13 22:34 ` [PATCH v9 07/13] unwind_user/deferred: Add unwind_deferred_trace() Steven Rostedt
2025-05-13 22:34 ` [PATCH v9 08/13] unwind_user/deferred: Add unwind cache Steven Rostedt
2025-05-13 22:34 ` [PATCH v9 09/13] unwind_user/deferred: Add deferred unwinding interface Steven Rostedt
2025-05-13 22:34 ` [PATCH v9 10/13] unwind_user/deferred: Make unwind deferral requests NMI-safe Steven Rostedt
2025-05-13 22:34 ` [PATCH v9 11/13] unwind deferred: Use bitmask to determine which callbacks to call Steven Rostedt
2025-05-13 22:34 ` [PATCH v9 12/13] unwind deferred: Use SRCU unwind_deferred_task_work() Steven Rostedt
2025-05-13 22:34 ` [PATCH v9 13/13] unwind: Clear unwind_mask on exit back to user space Steven Rostedt
2025-05-14 17:27 ` [PATCH v9 00/13] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
2025-05-16 23:39   ` Namhyung Kim
2025-05-19 15:33     ` Steven Rostedt
2025-05-20  9:35       ` Ingo Molnar
2025-05-20 15:57         ` Steven Rostedt
2025-05-20 16:29           ` Steven Rostedt
2025-05-20 23:26     ` Masami Hiramatsu
2025-05-20 23:55       ` Steven Rostedt
2025-05-21 16:50         ` Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).