linux-trace-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 00/17] unwind_user: perf: x86: Deferred unwinding infrastructure
@ 2025-05-02 16:47 Steven Rostedt
  2025-05-02 16:47 ` [PATCH v7 01/17] unwind_user: Add user space unwinding API Steven Rostedt
                   ` (17 more replies)
  0 siblings, 18 replies; 26+ messages in thread
From: Steven Rostedt @ 2025-05-02 16:47 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, x86, Jiri Olsa, Namhyung Kim


[ Shorten the Cc list to just those that maintain this ]

This series does not make any user space visible changes.
It only adds the necessary infrastructure of the deferred unwinder
and makes a few helpful cleanups to perf.

 Based off of tip/master: 252d33c92dbc23bcc1e662a889787c09a02eeccc

Peter,

Would you be willing to take this series? I'd like to get this part
in the kernel in the next merge window and then we can focus on getting perf
and ftrace to use it in the next merge window.

Perf exposes a lot of the interface to user space as the perf tool needs
to handle the merging of the stacks, I figured it would be better to just
get the kernel side mostly done and then work out the kinks of the code
between user and kernel.

Are you OK with this?

This series combines the non user interface of:

 [v7] perf: Deferred unwinding of user space stack traces
 https://lore.kernel.org/linux-trace-kernel/20250430195746.827125963@goodmis.org/

which had no changes, with:

 [v6] perf: Deferred unwinding of user space stack traces for per CPU events
 https://lore.kernel.org/linux-trace-kernel/20250501013202.997535180@goodmis.org/

With the following changes:

- Have unwind_deferred_request() return positive if already queued

- Check (current->flags & PF_KTHREAD | PF_EXITING) in
  unwind_deferred_request(), as the task_work will fail to be added in the
  exit code.

Hence, this is called v7.

Josh Poimboeuf (13):
      unwind_user: Add user space unwinding API
      unwind_user: Add frame pointer support
      unwind_user/x86: Enable frame pointer unwinding on x86
      perf/x86: Rename and move get_segment_base() and make it global
      unwind_user: Add compat mode frame pointer support
      unwind_user/x86: Enable compat mode frame pointer unwinding on x86
      unwind_user/deferred: Add unwind cache
      unwind_user/deferred: Add deferred unwinding interface
      unwind_user/deferred: Make unwind deferral requests NMI-safe
      perf: Remove get_perf_callchain() init_nr argument
      perf: Have get_perf_callchain() return NULL if crosstask and user are set
      perf: Simplify get_perf_callchain() user logic
      perf: Skip user unwind if the task is a kernel thread.

Steven Rostedt (4):
      unwind_user/deferred: Add unwind_deferred_trace()
      unwind deferred: Use bitmask to determine which callbacks to call
      unwind deferred: Use SRCU unwind_deferred_task_work()
      perf: Use current->flags & PF_KTHREAD instead of current->mm == NULL

----
 MAINTAINERS                              |   8 +
 arch/Kconfig                             |  11 +
 arch/x86/Kconfig                         |   2 +
 arch/x86/events/core.c                   |  44 +---
 arch/x86/include/asm/ptrace.h            |   2 +
 arch/x86/include/asm/unwind_user.h       |  61 ++++++
 arch/x86/include/asm/unwind_user_types.h |  17 ++
 arch/x86/kernel/ptrace.c                 |  38 ++++
 include/asm-generic/Kbuild               |   2 +
 include/asm-generic/unwind_user.h        |  24 +++
 include/asm-generic/unwind_user_types.h  |   9 +
 include/linux/entry-common.h             |   2 +
 include/linux/perf_event.h               |   2 +-
 include/linux/sched.h                    |   6 +
 include/linux/unwind_deferred.h          |  50 +++++
 include/linux/unwind_deferred_types.h    |  18 ++
 include/linux/unwind_user.h              |  15 ++
 include/linux/unwind_user_types.h        |  35 ++++
 kernel/Makefile                          |   1 +
 kernel/bpf/stackmap.c                    |   4 +-
 kernel/events/callchain.c                |  38 ++--
 kernel/events/core.c                     |   7 +-
 kernel/fork.c                            |   4 +
 kernel/unwind/Makefile                   |   1 +
 kernel/unwind/deferred.c                 | 349 +++++++++++++++++++++++++++++++
 kernel/unwind/user.c                     | 130 ++++++++++++
 26 files changed, 815 insertions(+), 65 deletions(-)
 create mode 100644 arch/x86/include/asm/unwind_user.h
 create mode 100644 arch/x86/include/asm/unwind_user_types.h
 create mode 100644 include/asm-generic/unwind_user.h
 create mode 100644 include/asm-generic/unwind_user_types.h
 create mode 100644 include/linux/unwind_deferred.h
 create mode 100644 include/linux/unwind_deferred_types.h
 create mode 100644 include/linux/unwind_user.h
 create mode 100644 include/linux/unwind_user_types.h
 create mode 100644 kernel/unwind/Makefile
 create mode 100644 kernel/unwind/deferred.c
 create mode 100644 kernel/unwind/user.c

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v7 01/17] unwind_user: Add user space unwinding API
  2025-05-02 16:47 [PATCH v7 00/17] unwind_user: perf: x86: Deferred unwinding infrastructure Steven Rostedt
@ 2025-05-02 16:47 ` Steven Rostedt
  2025-05-04  9:30   ` Ingo Molnar
  2025-05-02 16:47 ` [PATCH v7 02/17] unwind_user: Add frame pointer support Steven Rostedt
                   ` (16 subsequent siblings)
  17 siblings, 1 reply; 26+ messages in thread
From: Steven Rostedt @ 2025-05-02 16:47 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, x86, Jiri Olsa, Namhyung Kim

From: Josh Poimboeuf <jpoimboe@kernel.org>

Introduce a generic API for unwinding user stacks.

In order to expand user space unwinding to be able to handle more complex
scenarios, such as deferred unwinding and reading user space information,
create a generic interface that all architectures can use that support the
various unwinding methods.

This is an alternative method for handling user space stack traces from
the simple stack_trace_save_user() API. This does not replace that
interface, but this interface will be used to expand the functionality of
user space stack walking.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 MAINTAINERS                       |  8 +++++
 arch/Kconfig                      |  3 ++
 include/linux/unwind_user.h       | 15 +++++++++
 include/linux/unwind_user_types.h | 31 +++++++++++++++++
 kernel/Makefile                   |  1 +
 kernel/unwind/Makefile            |  1 +
 kernel/unwind/user.c              | 55 +++++++++++++++++++++++++++++++
 7 files changed, 114 insertions(+)
 create mode 100644 include/linux/unwind_user.h
 create mode 100644 include/linux/unwind_user_types.h
 create mode 100644 kernel/unwind/Makefile
 create mode 100644 kernel/unwind/user.c

diff --git a/MAINTAINERS b/MAINTAINERS
index fedcbcba8397..f94b8d05543d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -25308,6 +25308,14 @@ F:	Documentation/driver-api/uio-howto.rst
 F:	drivers/uio/
 F:	include/linux/uio_driver.h
 
+USERSPACE STACK UNWINDING
+M:	Josh Poimboeuf <jpoimboe@kernel.org>
+M:	Steven Rostedt <rostedt@goodmis.org>
+S:	Maintained
+F:	include/linux/unwind*.h
+F:	kernel/unwind/
+
+
 UTIL-LINUX PACKAGE
 M:	Karel Zak <kzak@redhat.com>
 L:	util-linux@vger.kernel.org
diff --git a/arch/Kconfig b/arch/Kconfig
index b0adb665041f..ccbcead9fac0 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -435,6 +435,9 @@ config HAVE_HARDLOCKUP_DETECTOR_ARCH
 	  It uses the same command line parameters, and sysctl interface,
 	  as the generic hardlockup detectors.
 
+config UNWIND_USER
+	bool
+
 config HAVE_PERF_REGS
 	bool
 	help
diff --git a/include/linux/unwind_user.h b/include/linux/unwind_user.h
new file mode 100644
index 000000000000..aa7923c1384f
--- /dev/null
+++ b/include/linux/unwind_user.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_UNWIND_USER_H
+#define _LINUX_UNWIND_USER_H
+
+#include <linux/unwind_user_types.h>
+
+int unwind_user_start(struct unwind_user_state *state);
+int unwind_user_next(struct unwind_user_state *state);
+
+int unwind_user(struct unwind_stacktrace *trace, unsigned int max_entries);
+
+#define for_each_user_frame(state) \
+	for (unwind_user_start((state)); !(state)->done; unwind_user_next((state)))
+
+#endif /* _LINUX_UNWIND_USER_H */
diff --git a/include/linux/unwind_user_types.h b/include/linux/unwind_user_types.h
new file mode 100644
index 000000000000..6ed1b4ae74e1
--- /dev/null
+++ b/include/linux/unwind_user_types.h
@@ -0,0 +1,31 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_UNWIND_USER_TYPES_H
+#define _LINUX_UNWIND_USER_TYPES_H
+
+#include <linux/types.h>
+
+enum unwind_user_type {
+	UNWIND_USER_TYPE_NONE,
+};
+
+struct unwind_stacktrace {
+	unsigned int	nr;
+	unsigned long	*entries;
+};
+
+struct unwind_user_frame {
+	s32 cfa_off;
+	s32 ra_off;
+	s32 fp_off;
+	bool use_fp;
+};
+
+struct unwind_user_state {
+	unsigned long ip;
+	unsigned long sp;
+	unsigned long fp;
+	enum unwind_user_type type;
+	bool done;
+};
+
+#endif /* _LINUX_UNWIND_USER_TYPES_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index 434929de17ef..5a2b2be2a32d 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -55,6 +55,7 @@ obj-y += rcu/
 obj-y += livepatch/
 obj-y += dma/
 obj-y += entry/
+obj-y += unwind/
 obj-$(CONFIG_MODULES) += module/
 
 obj-$(CONFIG_KCMP) += kcmp.o
diff --git a/kernel/unwind/Makefile b/kernel/unwind/Makefile
new file mode 100644
index 000000000000..349ce3677526
--- /dev/null
+++ b/kernel/unwind/Makefile
@@ -0,0 +1 @@
+ obj-$(CONFIG_UNWIND_USER) += user.o
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
new file mode 100644
index 000000000000..d30449328981
--- /dev/null
+++ b/kernel/unwind/user.c
@@ -0,0 +1,55 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+* Generic interfaces for unwinding user space
+*/
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/sched/task_stack.h>
+#include <linux/unwind_user.h>
+
+int unwind_user_next(struct unwind_user_state *state)
+{
+	/* no implementation yet */
+	return -EINVAL;
+}
+
+int unwind_user_start(struct unwind_user_state *state)
+{
+	struct pt_regs *regs = task_pt_regs(current);
+
+	memset(state, 0, sizeof(*state));
+
+	if ((current->flags & PF_KTHREAD) || !user_mode(regs)) {
+		state->done = true;
+		return -EINVAL;
+	}
+
+	state->type = UNWIND_USER_TYPE_NONE;
+
+	state->ip = instruction_pointer(regs);
+	state->sp = user_stack_pointer(regs);
+	state->fp = frame_pointer(regs);
+
+	return 0;
+}
+
+int unwind_user(struct unwind_stacktrace *trace, unsigned int max_entries)
+{
+	struct unwind_user_state state;
+
+	trace->nr = 0;
+
+	if (!max_entries)
+		return -EINVAL;
+
+	if (current->flags & PF_KTHREAD)
+		return 0;
+
+	for_each_user_frame(&state) {
+		trace->entries[trace->nr++] = state.ip;
+		if (trace->nr >= max_entries)
+			break;
+	}
+
+	return 0;
+}
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v7 02/17] unwind_user: Add frame pointer support
  2025-05-02 16:47 [PATCH v7 00/17] unwind_user: perf: x86: Deferred unwinding infrastructure Steven Rostedt
  2025-05-02 16:47 ` [PATCH v7 01/17] unwind_user: Add user space unwinding API Steven Rostedt
@ 2025-05-02 16:47 ` Steven Rostedt
  2025-05-02 16:47 ` [PATCH v7 03/17] unwind_user/x86: Enable frame pointer unwinding on x86 Steven Rostedt
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 26+ messages in thread
From: Steven Rostedt @ 2025-05-02 16:47 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, x86, Jiri Olsa, Namhyung Kim

From: Josh Poimboeuf <jpoimboe@kernel.org>

Add optional support for user space frame pointer unwinding.  If
supported, the arch needs to enable CONFIG_HAVE_UNWIND_USER_FP and
define ARCH_INIT_USER_FP_FRAME.

By encoding the frame offsets in struct unwind_user_frame, much of this
code can also be reused for future unwinder implementations like sframe.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 arch/Kconfig                      |  4 +++
 include/asm-generic/unwind_user.h |  9 ++++++
 include/linux/unwind_user_types.h |  1 +
 kernel/unwind/user.c              | 51 +++++++++++++++++++++++++++++--
 4 files changed, 63 insertions(+), 2 deletions(-)
 create mode 100644 include/asm-generic/unwind_user.h

diff --git a/arch/Kconfig b/arch/Kconfig
index ccbcead9fac0..0e3844c0e200 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -438,6 +438,10 @@ config HAVE_HARDLOCKUP_DETECTOR_ARCH
 config UNWIND_USER
 	bool
 
+config HAVE_UNWIND_USER_FP
+	bool
+	select UNWIND_USER
+
 config HAVE_PERF_REGS
 	bool
 	help
diff --git a/include/asm-generic/unwind_user.h b/include/asm-generic/unwind_user.h
new file mode 100644
index 000000000000..832425502fb3
--- /dev/null
+++ b/include/asm-generic/unwind_user.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_GENERIC_UNWIND_USER_H
+#define _ASM_GENERIC_UNWIND_USER_H
+
+#ifndef ARCH_INIT_USER_FP_FRAME
+ #define ARCH_INIT_USER_FP_FRAME
+#endif
+
+#endif /* _ASM_GENERIC_UNWIND_USER_H */
diff --git a/include/linux/unwind_user_types.h b/include/linux/unwind_user_types.h
index 6ed1b4ae74e1..65bd070eb6b0 100644
--- a/include/linux/unwind_user_types.h
+++ b/include/linux/unwind_user_types.h
@@ -6,6 +6,7 @@
 
 enum unwind_user_type {
 	UNWIND_USER_TYPE_NONE,
+	UNWIND_USER_TYPE_FP,
 };
 
 struct unwind_stacktrace {
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
index d30449328981..0671a81494d3 100644
--- a/kernel/unwind/user.c
+++ b/kernel/unwind/user.c
@@ -6,10 +6,54 @@
 #include <linux/sched.h>
 #include <linux/sched/task_stack.h>
 #include <linux/unwind_user.h>
+#include <linux/uaccess.h>
+#include <asm/unwind_user.h>
+
+static struct unwind_user_frame fp_frame = {
+	ARCH_INIT_USER_FP_FRAME
+};
+
+static inline bool fp_state(struct unwind_user_state *state)
+{
+	return IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP) &&
+	       state->type == UNWIND_USER_TYPE_FP;
+}
 
 int unwind_user_next(struct unwind_user_state *state)
 {
-	/* no implementation yet */
+	struct unwind_user_frame _frame;
+	struct unwind_user_frame *frame = &_frame;
+	unsigned long cfa = 0, fp, ra = 0;
+
+	if (state->done)
+		return -EINVAL;
+
+	if (fp_state(state))
+		frame = &fp_frame;
+	else
+		goto the_end;
+
+	cfa = (frame->use_fp ? state->fp : state->sp) + frame->cfa_off;
+
+	/* stack going in wrong direction? */
+	if (cfa <= state->sp)
+		goto the_end;
+
+	if (get_user(ra, (unsigned long *)(cfa + frame->ra_off)))
+		goto the_end;
+
+	if (frame->fp_off && get_user(fp, (unsigned long __user *)(cfa + frame->fp_off)))
+		goto the_end;
+
+	state->ip = ra;
+	state->sp = cfa;
+	if (frame->fp_off)
+		state->fp = fp;
+
+	return 0;
+
+the_end:
+	state->done = true;
 	return -EINVAL;
 }
 
@@ -24,7 +68,10 @@ int unwind_user_start(struct unwind_user_state *state)
 		return -EINVAL;
 	}
 
-	state->type = UNWIND_USER_TYPE_NONE;
+	if (IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP))
+		state->type = UNWIND_USER_TYPE_FP;
+	else
+		state->type = UNWIND_USER_TYPE_NONE;
 
 	state->ip = instruction_pointer(regs);
 	state->sp = user_stack_pointer(regs);
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v7 03/17] unwind_user/x86: Enable frame pointer unwinding on x86
  2025-05-02 16:47 [PATCH v7 00/17] unwind_user: perf: x86: Deferred unwinding infrastructure Steven Rostedt
  2025-05-02 16:47 ` [PATCH v7 01/17] unwind_user: Add user space unwinding API Steven Rostedt
  2025-05-02 16:47 ` [PATCH v7 02/17] unwind_user: Add frame pointer support Steven Rostedt
@ 2025-05-02 16:47 ` Steven Rostedt
  2025-05-02 16:47 ` [PATCH v7 04/17] perf/x86: Rename and move get_segment_base() and make it global Steven Rostedt
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 26+ messages in thread
From: Steven Rostedt @ 2025-05-02 16:47 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, x86, Jiri Olsa, Namhyung Kim

From: Josh Poimboeuf <jpoimboe@kernel.org>

Use ARCH_INIT_USER_FP_FRAME to describe how frame pointers are unwound
on x86, and enable CONFIG_HAVE_UNWIND_USER_FP accordingly so the
unwind_user interfaces can be used.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 arch/x86/Kconfig                   |  1 +
 arch/x86/include/asm/unwind_user.h | 11 +++++++++++
 2 files changed, 12 insertions(+)
 create mode 100644 arch/x86/include/asm/unwind_user.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 4c33c644b92d..a6e529dc4550 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -301,6 +301,7 @@ config X86
 	select HAVE_SYSCALL_TRACEPOINTS
 	select HAVE_UACCESS_VALIDATION		if HAVE_OBJTOOL
 	select HAVE_UNSTABLE_SCHED_CLOCK
+	select HAVE_UNWIND_USER_FP		if X86_64
 	select HAVE_USER_RETURN_NOTIFIER
 	select HAVE_GENERIC_VDSO
 	select VDSO_GETRANDOM			if X86_64
diff --git a/arch/x86/include/asm/unwind_user.h b/arch/x86/include/asm/unwind_user.h
new file mode 100644
index 000000000000..8597857bf896
--- /dev/null
+++ b/arch/x86/include/asm/unwind_user.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_UNWIND_USER_H
+#define _ASM_X86_UNWIND_USER_H
+
+#define ARCH_INIT_USER_FP_FRAME							\
+	.cfa_off	= (s32)sizeof(long) *  2,				\
+	.ra_off		= (s32)sizeof(long) * -1,				\
+	.fp_off		= (s32)sizeof(long) * -2,				\
+	.use_fp		= true,
+
+#endif /* _ASM_X86_UNWIND_USER_H */
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v7 04/17] perf/x86: Rename and move get_segment_base() and make it global
  2025-05-02 16:47 [PATCH v7 00/17] unwind_user: perf: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (2 preceding siblings ...)
  2025-05-02 16:47 ` [PATCH v7 03/17] unwind_user/x86: Enable frame pointer unwinding on x86 Steven Rostedt
@ 2025-05-02 16:47 ` Steven Rostedt
  2025-05-02 16:47 ` [PATCH v7 05/17] unwind_user: Add compat mode frame pointer support Steven Rostedt
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 26+ messages in thread
From: Steven Rostedt @ 2025-05-02 16:47 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, x86, Jiri Olsa, Namhyung Kim

From: Josh Poimboeuf <jpoimboe@kernel.org>

get_segment_base() will be used by the unwind_user code, so make it
global and rename it so it doesn't conflict with a KVM function of the
same name.

As the function is no longer specific to perf, move it to ptrace.c as that
seems to be a better location for a generic function like this.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 arch/x86/events/core.c        | 44 ++++-------------------------------
 arch/x86/include/asm/ptrace.h |  2 ++
 arch/x86/kernel/ptrace.c      | 38 ++++++++++++++++++++++++++++++
 3 files changed, 45 insertions(+), 39 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 2e10dcf897c5..cc6329235b68 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -43,6 +43,7 @@
 #include <asm/ldt.h>
 #include <asm/unwind.h>
 #include <asm/uprobes.h>
+#include <asm/ptrace.h>
 #include <asm/ibt.h>
 
 #include "perf_event.h"
@@ -2809,41 +2810,6 @@ valid_user_frame(const void __user *fp, unsigned long size)
 	return __access_ok(fp, size);
 }
 
-static unsigned long get_segment_base(unsigned int segment)
-{
-	struct desc_struct *desc;
-	unsigned int idx = segment >> 3;
-
-	if ((segment & SEGMENT_TI_MASK) == SEGMENT_LDT) {
-#ifdef CONFIG_MODIFY_LDT_SYSCALL
-		struct ldt_struct *ldt;
-
-		/*
-		 * If we're not in a valid context with a real (not just lazy)
-		 * user mm, then don't even try.
-		 */
-		if (!nmi_uaccess_okay())
-			return 0;
-
-		/* IRQs are off, so this synchronizes with smp_store_release */
-		ldt = smp_load_acquire(&current->mm->context.ldt);
-		if (!ldt || idx >= ldt->nr_entries)
-			return 0;
-
-		desc = &ldt->entries[idx];
-#else
-		return 0;
-#endif
-	} else {
-		if (idx >= GDT_ENTRIES)
-			return 0;
-
-		desc = raw_cpu_ptr(gdt_page.gdt) + idx;
-	}
-
-	return get_desc_base(desc);
-}
-
 #ifdef CONFIG_UPROBES
 /*
  * Heuristic-based check if uprobe is installed at the function entry.
@@ -2900,8 +2866,8 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry_ctx *ent
 	if (user_64bit_mode(regs))
 		return 0;
 
-	cs_base = get_segment_base(regs->cs);
-	ss_base = get_segment_base(regs->ss);
+	cs_base = segment_base_address(regs->cs);
+	ss_base = segment_base_address(regs->ss);
 
 	fp = compat_ptr(ss_base + regs->bp);
 	pagefault_disable();
@@ -3020,11 +2986,11 @@ static unsigned long code_segment_base(struct pt_regs *regs)
 		return 0x10 * regs->cs;
 
 	if (user_mode(regs) && regs->cs != __USER_CS)
-		return get_segment_base(regs->cs);
+		return segment_base_address(regs->cs);
 #else
 	if (user_mode(regs) && !user_64bit_mode(regs) &&
 	    regs->cs != __USER32_CS)
-		return get_segment_base(regs->cs);
+		return segment_base_address(regs->cs);
 #endif
 	return 0;
 }
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 50f75467f73d..59357ec98e52 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -314,6 +314,8 @@ static __always_inline bool regs_irqs_disabled(struct pt_regs *regs)
 	return !(regs->flags & X86_EFLAGS_IF);
 }
 
+unsigned long segment_base_address(unsigned int segment);
+
 /* Query offset/name of register from its name/offset */
 extern int regs_query_register_offset(const char *name);
 extern const char *regs_query_register_name(unsigned int offset);
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 095f04bdabdc..81353a09701b 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -41,6 +41,7 @@
 #include <asm/syscall.h>
 #include <asm/fsgsbase.h>
 #include <asm/io_bitmap.h>
+#include <asm/mmu_context.h>
 
 #include "tls.h"
 
@@ -339,6 +340,43 @@ static int set_segment_reg(struct task_struct *task,
 
 #endif	/* CONFIG_X86_32 */
 
+unsigned long segment_base_address(unsigned int segment)
+{
+	struct desc_struct *desc;
+	unsigned int idx = segment >> 3;
+
+	lockdep_assert_irqs_disabled();
+
+	if ((segment & SEGMENT_TI_MASK) == SEGMENT_LDT) {
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
+		struct ldt_struct *ldt;
+
+		/*
+		 * If we're not in a valid context with a real (not just lazy)
+		 * user mm, then don't even try.
+		 */
+		if (!nmi_uaccess_okay())
+			return 0;
+
+		/* IRQs are off, so this synchronizes with smp_store_release */
+		ldt = smp_load_acquire(&current->mm->context.ldt);
+		if (!ldt || idx >= ldt->nr_entries)
+			return 0;
+
+		desc = &ldt->entries[idx];
+#else
+		return 0;
+#endif
+	} else {
+		if (idx >= GDT_ENTRIES)
+			return 0;
+
+		desc = raw_cpu_ptr(gdt_page.gdt) + idx;
+	}
+
+	return get_desc_base(desc);
+}
+
 static unsigned long get_flags(struct task_struct *task)
 {
 	unsigned long retval = task_pt_regs(task)->flags;
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v7 05/17] unwind_user: Add compat mode frame pointer support
  2025-05-02 16:47 [PATCH v7 00/17] unwind_user: perf: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (3 preceding siblings ...)
  2025-05-02 16:47 ` [PATCH v7 04/17] perf/x86: Rename and move get_segment_base() and make it global Steven Rostedt
@ 2025-05-02 16:47 ` Steven Rostedt
  2025-05-02 16:47 ` [PATCH v7 06/17] unwind_user/x86: Enable compat mode frame pointer unwinding on x86 Steven Rostedt
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 26+ messages in thread
From: Steven Rostedt @ 2025-05-02 16:47 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, x86, Jiri Olsa, Namhyung Kim

From: Josh Poimboeuf <jpoimboe@kernel.org>

Add optional support for user space compat mode frame pointer unwinding.
If supported, the arch needs to enable CONFIG_HAVE_UNWIND_USER_COMPAT_FP
and define ARCH_INIT_USER_COMPAT_FP_FRAME.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 arch/Kconfig                            |  4 +++
 include/asm-generic/Kbuild              |  2 ++
 include/asm-generic/unwind_user.h       | 15 +++++++++++
 include/asm-generic/unwind_user_types.h |  9 +++++++
 include/linux/unwind_user_types.h       |  3 +++
 kernel/unwind/user.c                    | 36 ++++++++++++++++++++++---
 6 files changed, 65 insertions(+), 4 deletions(-)
 create mode 100644 include/asm-generic/unwind_user_types.h

diff --git a/arch/Kconfig b/arch/Kconfig
index 0e3844c0e200..dbb1cc89e040 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -442,6 +442,10 @@ config HAVE_UNWIND_USER_FP
 	bool
 	select UNWIND_USER
 
+config HAVE_UNWIND_USER_COMPAT_FP
+	bool
+	depends on HAVE_UNWIND_USER_FP
+
 config HAVE_PERF_REGS
 	bool
 	help
diff --git a/include/asm-generic/Kbuild b/include/asm-generic/Kbuild
index 8675b7b4ad23..b797a2434396 100644
--- a/include/asm-generic/Kbuild
+++ b/include/asm-generic/Kbuild
@@ -59,6 +59,8 @@ mandatory-y += tlbflush.h
 mandatory-y += topology.h
 mandatory-y += trace_clock.h
 mandatory-y += uaccess.h
+mandatory-y += unwind_user.h
+mandatory-y += unwind_user_types.h
 mandatory-y += vermagic.h
 mandatory-y += vga.h
 mandatory-y += video.h
diff --git a/include/asm-generic/unwind_user.h b/include/asm-generic/unwind_user.h
index 832425502fb3..385638ce4aec 100644
--- a/include/asm-generic/unwind_user.h
+++ b/include/asm-generic/unwind_user.h
@@ -2,8 +2,23 @@
 #ifndef _ASM_GENERIC_UNWIND_USER_H
 #define _ASM_GENERIC_UNWIND_USER_H
 
+#include <asm/unwind_user_types.h>
+
 #ifndef ARCH_INIT_USER_FP_FRAME
  #define ARCH_INIT_USER_FP_FRAME
 #endif
 
+#ifndef ARCH_INIT_USER_COMPAT_FP_FRAME
+ #define ARCH_INIT_USER_COMPAT_FP_FRAME
+ #define in_compat_mode(regs) false
+#endif
+
+#ifndef arch_unwind_user_init
+static inline void arch_unwind_user_init(struct unwind_user_state *state, struct pt_regs *reg) {}
+#endif
+
+#ifndef arch_unwind_user_next
+static inline void arch_unwind_user_next(struct unwind_user_state *state) {}
+#endif
+
 #endif /* _ASM_GENERIC_UNWIND_USER_H */
diff --git a/include/asm-generic/unwind_user_types.h b/include/asm-generic/unwind_user_types.h
new file mode 100644
index 000000000000..ee803de7c998
--- /dev/null
+++ b/include/asm-generic/unwind_user_types.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_GENERIC_UNWIND_USER_TYPES_H
+#define _ASM_GENERIC_UNWIND_USER_TYPES_H
+
+#ifndef arch_unwind_user_state
+struct arch_unwind_user_state {};
+#endif
+
+#endif /* _ASM_GENERIC_UNWIND_USER_TYPES_H */
diff --git a/include/linux/unwind_user_types.h b/include/linux/unwind_user_types.h
index 65bd070eb6b0..3ec4a097a3dd 100644
--- a/include/linux/unwind_user_types.h
+++ b/include/linux/unwind_user_types.h
@@ -3,10 +3,12 @@
 #define _LINUX_UNWIND_USER_TYPES_H
 
 #include <linux/types.h>
+#include <asm/unwind_user_types.h>
 
 enum unwind_user_type {
 	UNWIND_USER_TYPE_NONE,
 	UNWIND_USER_TYPE_FP,
+	UNWIND_USER_TYPE_COMPAT_FP,
 };
 
 struct unwind_stacktrace {
@@ -25,6 +27,7 @@ struct unwind_user_state {
 	unsigned long ip;
 	unsigned long sp;
 	unsigned long fp;
+	struct arch_unwind_user_state arch;
 	enum unwind_user_type type;
 	bool done;
 };
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
index 0671a81494d3..635cc04bb299 100644
--- a/kernel/unwind/user.c
+++ b/kernel/unwind/user.c
@@ -13,12 +13,32 @@ static struct unwind_user_frame fp_frame = {
 	ARCH_INIT_USER_FP_FRAME
 };
 
+static struct unwind_user_frame compat_fp_frame = {
+	ARCH_INIT_USER_COMPAT_FP_FRAME
+};
+
 static inline bool fp_state(struct unwind_user_state *state)
 {
 	return IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP) &&
 	       state->type == UNWIND_USER_TYPE_FP;
 }
 
+static inline bool compat_state(struct unwind_user_state *state)
+{
+	return IS_ENABLED(CONFIG_HAVE_UNWIND_USER_COMPAT_FP) &&
+	       state->type == UNWIND_USER_TYPE_COMPAT_FP;
+}
+
+#define UNWIND_GET_USER_LONG(to, from, state)				\
+({									\
+	int __ret;							\
+	if (compat_state(state))					\
+		__ret = get_user(to, (u32 __user *)(from));		\
+	else								\
+		__ret = get_user(to, (u64 __user *)(from));		\
+	__ret;								\
+})
+
 int unwind_user_next(struct unwind_user_state *state)
 {
 	struct unwind_user_frame _frame;
@@ -28,7 +48,9 @@ int unwind_user_next(struct unwind_user_state *state)
 	if (state->done)
 		return -EINVAL;
 
-	if (fp_state(state))
+	if (compat_state(state))
+		frame = &compat_fp_frame;
+	else if (fp_state(state))
 		frame = &fp_frame;
 	else
 		goto the_end;
@@ -39,10 +61,10 @@ int unwind_user_next(struct unwind_user_state *state)
 	if (cfa <= state->sp)
 		goto the_end;
 
-	if (get_user(ra, (unsigned long *)(cfa + frame->ra_off)))
+	if (UNWIND_GET_USER_LONG(ra, cfa + frame->ra_off, state))
 		goto the_end;
 
-	if (frame->fp_off && get_user(fp, (unsigned long __user *)(cfa + frame->fp_off)))
+	if (frame->fp_off && UNWIND_GET_USER_LONG(fp, cfa + frame->fp_off, state))
 		goto the_end;
 
 	state->ip = ra;
@@ -50,6 +72,8 @@ int unwind_user_next(struct unwind_user_state *state)
 	if (frame->fp_off)
 		state->fp = fp;
 
+	arch_unwind_user_next(state);
+
 	return 0;
 
 the_end:
@@ -68,7 +92,9 @@ int unwind_user_start(struct unwind_user_state *state)
 		return -EINVAL;
 	}
 
-	if (IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP))
+	if (IS_ENABLED(CONFIG_HAVE_UNWIND_USER_COMPAT_FP) && in_compat_mode(regs))
+		state->type = UNWIND_USER_TYPE_COMPAT_FP;
+	else if (IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP))
 		state->type = UNWIND_USER_TYPE_FP;
 	else
 		state->type = UNWIND_USER_TYPE_NONE;
@@ -77,6 +103,8 @@ int unwind_user_start(struct unwind_user_state *state)
 	state->sp = user_stack_pointer(regs);
 	state->fp = frame_pointer(regs);
 
+	arch_unwind_user_init(state, regs);
+
 	return 0;
 }
 
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v7 06/17] unwind_user/x86: Enable compat mode frame pointer unwinding on x86
  2025-05-02 16:47 [PATCH v7 00/17] unwind_user: perf: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (4 preceding siblings ...)
  2025-05-02 16:47 ` [PATCH v7 05/17] unwind_user: Add compat mode frame pointer support Steven Rostedt
@ 2025-05-02 16:47 ` Steven Rostedt
  2025-05-02 16:47 ` [PATCH v7 07/17] unwind_user/deferred: Add unwind_deferred_trace() Steven Rostedt
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 26+ messages in thread
From: Steven Rostedt @ 2025-05-02 16:47 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, x86, Jiri Olsa, Namhyung Kim

From: Josh Poimboeuf <jpoimboe@kernel.org>

Use ARCH_INIT_USER_COMPAT_FP_FRAME to describe how frame pointers are
unwound on x86, and implement the hooks needed to add the segment base
addresses.  Enable HAVE_UNWIND_USER_COMPAT_FP if the system has compat
mode compiled in.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 arch/x86/Kconfig                         |  1 +
 arch/x86/include/asm/unwind_user.h       | 50 ++++++++++++++++++++++++
 arch/x86/include/asm/unwind_user_types.h | 17 ++++++++
 3 files changed, 68 insertions(+)
 create mode 100644 arch/x86/include/asm/unwind_user_types.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index a6e529dc4550..ee81e06cabca 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -301,6 +301,7 @@ config X86
 	select HAVE_SYSCALL_TRACEPOINTS
 	select HAVE_UACCESS_VALIDATION		if HAVE_OBJTOOL
 	select HAVE_UNSTABLE_SCHED_CLOCK
+	select HAVE_UNWIND_USER_COMPAT_FP	if IA32_EMULATION
 	select HAVE_UNWIND_USER_FP		if X86_64
 	select HAVE_USER_RETURN_NOTIFIER
 	select HAVE_GENERIC_VDSO
diff --git a/arch/x86/include/asm/unwind_user.h b/arch/x86/include/asm/unwind_user.h
index 8597857bf896..bb1148111259 100644
--- a/arch/x86/include/asm/unwind_user.h
+++ b/arch/x86/include/asm/unwind_user.h
@@ -2,10 +2,60 @@
 #ifndef _ASM_X86_UNWIND_USER_H
 #define _ASM_X86_UNWIND_USER_H
 
+#include <linux/unwind_user_types.h>
+#include <asm/ptrace.h>
+#include <asm/perf_event.h>
+
 #define ARCH_INIT_USER_FP_FRAME							\
 	.cfa_off	= (s32)sizeof(long) *  2,				\
 	.ra_off		= (s32)sizeof(long) * -1,				\
 	.fp_off		= (s32)sizeof(long) * -2,				\
 	.use_fp		= true,
 
+#ifdef CONFIG_IA32_EMULATION
+
+#define ARCH_INIT_USER_COMPAT_FP_FRAME						\
+	.cfa_off	= (s32)sizeof(u32)  *  2,				\
+	.ra_off		= (s32)sizeof(u32)  * -1,				\
+	.fp_off		= (s32)sizeof(u32)  * -2,				\
+	.use_fp		= true,
+
+#define in_compat_mode(regs) !user_64bit_mode(regs)
+
+static inline void arch_unwind_user_init(struct unwind_user_state *state,
+					 struct pt_regs *regs)
+{
+	unsigned long cs_base, ss_base;
+
+	if (state->type != UNWIND_USER_TYPE_COMPAT_FP)
+		return;
+
+	scoped_guard(irqsave) {
+		cs_base = segment_base_address(regs->cs);
+		ss_base = segment_base_address(regs->ss);
+	}
+
+	state->arch.cs_base = cs_base;
+	state->arch.ss_base = ss_base;
+
+	state->ip += cs_base;
+	state->sp += ss_base;
+	state->fp += ss_base;
+}
+#define arch_unwind_user_init arch_unwind_user_init
+
+static inline void arch_unwind_user_next(struct unwind_user_state *state)
+{
+	if (state->type != UNWIND_USER_TYPE_COMPAT_FP)
+		return;
+
+	state->ip += state->arch.cs_base;
+	state->fp += state->arch.ss_base;
+}
+#define arch_unwind_user_next arch_unwind_user_next
+
+#endif /* CONFIG_IA32_EMULATION */
+
+#include <asm-generic/unwind_user.h>
+
 #endif /* _ASM_X86_UNWIND_USER_H */
diff --git a/arch/x86/include/asm/unwind_user_types.h b/arch/x86/include/asm/unwind_user_types.h
new file mode 100644
index 000000000000..d7074dc5f0ce
--- /dev/null
+++ b/arch/x86/include/asm/unwind_user_types.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_UNWIND_USER_TYPES_H
+#define _ASM_UNWIND_USER_TYPES_H
+
+#ifdef CONFIG_IA32_EMULATION
+
+struct arch_unwind_user_state {
+	unsigned long ss_base;
+	unsigned long cs_base;
+};
+#define arch_unwind_user_state arch_unwind_user_state
+
+#endif /* CONFIG_IA32_EMULATION */
+
+#include <asm-generic/unwind_user_types.h>
+
+#endif /* _ASM_UNWIND_USER_TYPES_H */
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v7 07/17] unwind_user/deferred: Add unwind_deferred_trace()
  2025-05-02 16:47 [PATCH v7 00/17] unwind_user: perf: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (5 preceding siblings ...)
  2025-05-02 16:47 ` [PATCH v7 06/17] unwind_user/x86: Enable compat mode frame pointer unwinding on x86 Steven Rostedt
@ 2025-05-02 16:47 ` Steven Rostedt
  2025-05-02 16:47 ` [PATCH v7 08/17] unwind_user/deferred: Add unwind cache Steven Rostedt
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 26+ messages in thread
From: Steven Rostedt @ 2025-05-02 16:47 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, x86, Jiri Olsa, Namhyung Kim

From: Steven Rostedt <rostedt@goodmis.org>

Add a function that must be called inside a faultable context that will
retrieve a user space stack trace. The function unwind_deferred_trace()
can be called by a tracer when a task is about to enter user space, or has
just come back from user space and has interrupts enabled.

This code is based on work by Josh Poimboeuf's deferred unwinding code:

Link: https://lore.kernel.org/all/6052e8487746603bdb29b65f4033e739092d9925.1737511963.git.jpoimboe@kernel.org/

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 include/linux/sched.h                 |  5 +++
 include/linux/unwind_deferred.h       | 24 ++++++++++++++
 include/linux/unwind_deferred_types.h |  9 +++++
 kernel/fork.c                         |  4 +++
 kernel/unwind/Makefile                |  2 +-
 kernel/unwind/deferred.c              | 48 +++++++++++++++++++++++++++
 6 files changed, 91 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/unwind_deferred.h
 create mode 100644 include/linux/unwind_deferred_types.h
 create mode 100644 kernel/unwind/deferred.c

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4ecc0c6b1cb0..a1e1c07cadfb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -47,6 +47,7 @@
 #include <linux/livepatch_sched.h>
 #include <linux/uidgid_types.h>
 #include <linux/tracepoint-defs.h>
+#include <linux/unwind_deferred_types.h>
 #include <asm/kmap_size.h>
 
 /* task_struct member predeclarations (sorted alphabetically): */
@@ -1646,6 +1647,10 @@ struct task_struct {
 	struct user_event_mm		*user_event_mm;
 #endif
 
+#ifdef CONFIG_UNWIND_USER
+	struct unwind_task_info		unwind_info;
+#endif
+
 	/* CPU-specific state of this task: */
 	struct thread_struct		thread;
 
diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
new file mode 100644
index 000000000000..5064ebe38c4f
--- /dev/null
+++ b/include/linux/unwind_deferred.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_UNWIND_USER_DEFERRED_H
+#define _LINUX_UNWIND_USER_DEFERRED_H
+
+#include <linux/unwind_user.h>
+#include <linux/unwind_deferred_types.h>
+
+#ifdef CONFIG_UNWIND_USER
+
+void unwind_task_init(struct task_struct *task);
+void unwind_task_free(struct task_struct *task);
+
+int unwind_deferred_trace(struct unwind_stacktrace *trace);
+
+#else /* !CONFIG_UNWIND_USER */
+
+static inline void unwind_task_init(struct task_struct *task) {}
+static inline void unwind_task_free(struct task_struct *task) {}
+
+static inline int unwind_deferred_trace(struct unwind_stacktrace *trace) { return -ENOSYS; }
+
+#endif /* !CONFIG_UNWIND_USER */
+
+#endif /* _LINUX_UNWIND_USER_DEFERRED_H */
diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
new file mode 100644
index 000000000000..aa32db574e43
--- /dev/null
+++ b/include/linux/unwind_deferred_types.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_UNWIND_USER_DEFERRED_TYPES_H
+#define _LINUX_UNWIND_USER_DEFERRED_TYPES_H
+
+struct unwind_task_info {
+	unsigned long		*entries;
+};
+
+#endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index c4b26cd8998b..8c79c7c2c553 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -105,6 +105,7 @@
 #include <uapi/linux/pidfd.h>
 #include <linux/pidfs.h>
 #include <linux/tick.h>
+#include <linux/unwind_deferred.h>
 
 #include <asm/pgalloc.h>
 #include <linux/uaccess.h>
@@ -991,6 +992,7 @@ void __put_task_struct(struct task_struct *tsk)
 	WARN_ON(refcount_read(&tsk->usage));
 	WARN_ON(tsk == current);
 
+	unwind_task_free(tsk);
 	sched_ext_free(tsk);
 	io_uring_free(tsk);
 	cgroup_free(tsk);
@@ -2395,6 +2397,8 @@ __latent_entropy struct task_struct *copy_process(
 	p->bpf_ctx = NULL;
 #endif
 
+	unwind_task_init(p);
+
 	/* Perform scheduler related setup. Assign this task to a CPU. */
 	retval = sched_fork(clone_flags, p);
 	if (retval)
diff --git a/kernel/unwind/Makefile b/kernel/unwind/Makefile
index 349ce3677526..6752ac96d7e2 100644
--- a/kernel/unwind/Makefile
+++ b/kernel/unwind/Makefile
@@ -1 +1 @@
- obj-$(CONFIG_UNWIND_USER) += user.o
+ obj-$(CONFIG_UNWIND_USER)		+= user.o deferred.o
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
new file mode 100644
index 000000000000..5a3789e38c00
--- /dev/null
+++ b/kernel/unwind/deferred.c
@@ -0,0 +1,48 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Deferred user space unwinding
+ */
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/unwind_deferred.h>
+
+#define UNWIND_MAX_ENTRIES 512
+
+int unwind_deferred_trace(struct unwind_stacktrace *trace)
+{
+	struct unwind_task_info *info = &current->unwind_info;
+
+	/* Should always be called from faultable context */
+	might_fault();
+
+	if (current->flags & PF_EXITING)
+		return -EINVAL;
+
+       if (!info->entries) {
+               info->entries = kmalloc_array(UNWIND_MAX_ENTRIES, sizeof(long),
+                                             GFP_KERNEL);
+               if (!info->entries)
+		       return -ENOMEM;
+       }
+
+	trace->nr = 0;
+	trace->entries = info->entries;
+	unwind_user(trace, UNWIND_MAX_ENTRIES);
+
+	return 0;
+}
+
+void unwind_task_init(struct task_struct *task)
+{
+	struct unwind_task_info *info = &task->unwind_info;
+
+	memset(info, 0, sizeof(*info));
+}
+
+void unwind_task_free(struct task_struct *task)
+{
+	struct unwind_task_info *info = &task->unwind_info;
+
+	kfree(info->entries);
+}
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v7 08/17] unwind_user/deferred: Add unwind cache
  2025-05-02 16:47 [PATCH v7 00/17] unwind_user: perf: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (6 preceding siblings ...)
  2025-05-02 16:47 ` [PATCH v7 07/17] unwind_user/deferred: Add unwind_deferred_trace() Steven Rostedt
@ 2025-05-02 16:47 ` Steven Rostedt
  2025-05-04  9:37   ` Ingo Molnar
  2025-05-02 16:47 ` [PATCH v7 09/17] unwind_user/deferred: Add deferred unwinding interface Steven Rostedt
                   ` (9 subsequent siblings)
  17 siblings, 1 reply; 26+ messages in thread
From: Steven Rostedt @ 2025-05-02 16:47 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, x86, Jiri Olsa, Namhyung Kim

From: Josh Poimboeuf <jpoimboe@kernel.org>

Cache the results of the unwind to ensure the unwind is only performed
once, even when called by multiple tracers.

The cache nr_entries gets cleared every time the task exits the kernel.
When a stacktrace is requested, nr_entries gets set to the number of
entries in the stacktrace. If another stacktrace is requested, if
nr_entries is not zero, then it contains the same stacktrace that would be
retrieved so it is not processed again and the entries is given to the
caller.

Co-developed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 include/linux/entry-common.h          |  2 ++
 include/linux/unwind_deferred.h       |  7 +++++++
 include/linux/unwind_deferred_types.h |  7 ++++++-
 kernel/unwind/deferred.c              | 27 ++++++++++++++++++++-------
 4 files changed, 35 insertions(+), 8 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index f94f3fdf15fc..6e850c9d3f0c 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -12,6 +12,7 @@
 #include <linux/resume_user_mode.h>
 #include <linux/tick.h>
 #include <linux/kmsan.h>
+#include <linux/unwind_deferred.h>
 
 #include <asm/entry-common.h>
 #include <asm/syscall.h>
@@ -362,6 +363,7 @@ static __always_inline void exit_to_user_mode(void)
 	lockdep_hardirqs_on_prepare();
 	instrumentation_end();
 
+	unwind_exit_to_user_mode();
 	user_enter_irqoff();
 	arch_exit_to_user_mode();
 	lockdep_hardirqs_on(CALLER_ADDR0);
diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index 5064ebe38c4f..c2d760e5e257 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -12,6 +12,11 @@ void unwind_task_free(struct task_struct *task);
 
 int unwind_deferred_trace(struct unwind_stacktrace *trace);
 
+static __always_inline void unwind_exit_to_user_mode(void)
+{
+	current->unwind_info.cache.nr_entries = 0;
+}
+
 #else /* !CONFIG_UNWIND_USER */
 
 static inline void unwind_task_init(struct task_struct *task) {}
@@ -19,6 +24,8 @@ static inline void unwind_task_free(struct task_struct *task) {}
 
 static inline int unwind_deferred_trace(struct unwind_stacktrace *trace) { return -ENOSYS; }
 
+static inline void unwind_exit_to_user_mode(void) {}
+
 #endif /* !CONFIG_UNWIND_USER */
 
 #endif /* _LINUX_UNWIND_USER_DEFERRED_H */
diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
index aa32db574e43..b3b7389ee6eb 100644
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -2,8 +2,13 @@
 #ifndef _LINUX_UNWIND_USER_DEFERRED_TYPES_H
 #define _LINUX_UNWIND_USER_DEFERRED_TYPES_H
 
-struct unwind_task_info {
+struct unwind_cache {
 	unsigned long		*entries;
+	unsigned int		nr_entries;
+};
+
+struct unwind_task_info {
+	struct unwind_cache	cache;
 };
 
 #endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index 5a3789e38c00..89ed04b1c527 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -12,6 +12,7 @@
 int unwind_deferred_trace(struct unwind_stacktrace *trace)
 {
 	struct unwind_task_info *info = &current->unwind_info;
+	struct unwind_cache *cache = &info->cache;
 
 	/* Should always be called from faultable context */
 	might_fault();
@@ -19,17 +20,29 @@ int unwind_deferred_trace(struct unwind_stacktrace *trace)
 	if (current->flags & PF_EXITING)
 		return -EINVAL;
 
-       if (!info->entries) {
-               info->entries = kmalloc_array(UNWIND_MAX_ENTRIES, sizeof(long),
-                                             GFP_KERNEL);
-               if (!info->entries)
-		       return -ENOMEM;
+	if (!cache->entries) {
+		cache->entries = kmalloc_array(UNWIND_MAX_ENTRIES, sizeof(long),
+					       GFP_KERNEL);
+		if (!cache->entries)
+			return -ENOMEM;
+        }
+
+	trace->entries = cache->entries;
+
+	if (cache->nr_entries) {
+               /*
+                * The user stack has already been previously unwound in this
+                * entry context.  Skip the unwind and use the cache.
+                */
+               trace->nr = cache->nr_entries;
+               return 0;
        }
 
 	trace->nr = 0;
-	trace->entries = info->entries;
 	unwind_user(trace, UNWIND_MAX_ENTRIES);
 
+	cache->nr_entries = trace->nr;
+
 	return 0;
 }
 
@@ -44,5 +57,5 @@ void unwind_task_free(struct task_struct *task)
 {
 	struct unwind_task_info *info = &task->unwind_info;
 
-	kfree(info->entries);
+	kfree(info->cache.entries);
 }
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v7 09/17] unwind_user/deferred: Add deferred unwinding interface
  2025-05-02 16:47 [PATCH v7 00/17] unwind_user: perf: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (7 preceding siblings ...)
  2025-05-02 16:47 ` [PATCH v7 08/17] unwind_user/deferred: Add unwind cache Steven Rostedt
@ 2025-05-02 16:47 ` Steven Rostedt
  2025-05-02 16:47 ` [PATCH v7 10/17] unwind_user/deferred: Make unwind deferral requests NMI-safe Steven Rostedt
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 26+ messages in thread
From: Steven Rostedt @ 2025-05-02 16:47 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, x86, Jiri Olsa, Namhyung Kim

From: Josh Poimboeuf <jpoimboe@kernel.org>

Add an interface for scheduling task work to unwind the user space stack
before returning to user space. This solves several problems for its
callers:

  - Ensure the unwind happens in task context even if the caller may be
    running in NMI or interrupt context.

  - Avoid duplicate unwinds, whether called multiple times by the same
    caller or by different callers.

  - Create a "context cookie" which allows trace post-processing to
    correlate kernel unwinds/traces with the user unwind.

A concept of a "cookie" is created to detect when the stacktrace is the
same. A cookie is generated the first time a user space stacktrace is
requested after the task enters the kernel. As the stacktrace is saved on
the task_struct while the task is in the kernel, if another request comes
in, if the cookie is still the same, it will use the saved stacktrace,
and not have to regenerate one.

The cookie is passed to the caller on request, and when the stacktrace is
generated upon returning to user space, it call the requester's callback
with the cookie as well as the stacktrace.

Co-developed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v6: https://lore.kernel.org/20250424192612.505622711@goodmis.org

- Have unwind_deferred_request() return positive if already queued

- Check (current->flags & PF_KTHREAD | PF_EXITING) in
  unwind_deferred_request(), as the task_work will fail to be added in the
  exit code.

 include/linux/unwind_deferred.h       |  18 +++
 include/linux/unwind_deferred_types.h |   3 +
 kernel/unwind/deferred.c              | 165 +++++++++++++++++++++++++-
 3 files changed, 185 insertions(+), 1 deletion(-)

diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index c2d760e5e257..d36784cae658 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -2,9 +2,19 @@
 #ifndef _LINUX_UNWIND_USER_DEFERRED_H
 #define _LINUX_UNWIND_USER_DEFERRED_H
 
+#include <linux/task_work.h>
 #include <linux/unwind_user.h>
 #include <linux/unwind_deferred_types.h>
 
+struct unwind_work;
+
+typedef void (*unwind_callback_t)(struct unwind_work *work, struct unwind_stacktrace *trace, u64 cookie);
+
+struct unwind_work {
+	struct list_head		list;
+	unwind_callback_t		func;
+};
+
 #ifdef CONFIG_UNWIND_USER
 
 void unwind_task_init(struct task_struct *task);
@@ -12,9 +22,14 @@ void unwind_task_free(struct task_struct *task);
 
 int unwind_deferred_trace(struct unwind_stacktrace *trace);
 
+int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func);
+int unwind_deferred_request(struct unwind_work *work, u64 *cookie);
+void unwind_deferred_cancel(struct unwind_work *work);
+
 static __always_inline void unwind_exit_to_user_mode(void)
 {
 	current->unwind_info.cache.nr_entries = 0;
+	current->unwind_info.cookie = 0;
 }
 
 #else /* !CONFIG_UNWIND_USER */
@@ -23,6 +38,9 @@ static inline void unwind_task_init(struct task_struct *task) {}
 static inline void unwind_task_free(struct task_struct *task) {}
 
 static inline int unwind_deferred_trace(struct unwind_stacktrace *trace) { return -ENOSYS; }
+static inline int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func) { return -ENOSYS; }
+static inline int unwind_deferred_request(struct unwind_work *work, u64 *cookie) { return -ENOSYS; }
+static inline void unwind_deferred_cancel(struct unwind_work *work) {}
 
 static inline void unwind_exit_to_user_mode(void) {}
 
diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
index b3b7389ee6eb..33373c32c221 100644
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -9,6 +9,9 @@ struct unwind_cache {
 
 struct unwind_task_info {
 	struct unwind_cache	cache;
+	u64			cookie;
+	struct callback_head	work;
+	int			pending;
 };
 
 #endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index 89ed04b1c527..b93ad97daf94 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -2,13 +2,72 @@
 /*
  * Deferred user space unwinding
  */
+#include <linux/sched/task_stack.h>
+#include <linux/unwind_deferred.h>
+#include <linux/task_work.h>
 #include <linux/kernel.h>
 #include <linux/sched.h>
 #include <linux/slab.h>
-#include <linux/unwind_deferred.h>
+#include <linux/mm.h>
 
 #define UNWIND_MAX_ENTRIES 512
 
+/*
+ * This is a unique percpu identifier for a given task entry context.
+ * Conceptually, it's incremented every time the CPU enters the kernel from
+ * user space, so that each "entry context" on the CPU gets a unique ID.  In
+ * reality, as an optimization, it's only incremented on demand for the first
+ * deferred unwind request after a given entry-from-user.
+ *
+ * It's combined with the CPU id to make a systemwide-unique "context cookie".
+ */
+static DEFINE_PER_CPU(u64, unwind_ctx_ctr);
+
+/* Guards adding to and reading the list of callbacks */
+static DEFINE_MUTEX(callback_mutex);
+static LIST_HEAD(callbacks);
+
+/*
+ * The context cookie is a unique identifier that is assigned to a user
+ * space stacktrace. As the user space stacktrace remains the same while
+ * the task is in the kernel, the cookie is an identifier for the stacktrace.
+ * Although it is possible for the stacktrace to get another cookie if another
+ * request is made after the cookie was cleared and before reentering user
+ * space.
+ *
+ * The high 16 bits are the CPU id; the lower 48 bits are a per-CPU entry
+ * counter shifted left by one and or'd with 1 (to prevent it from ever being
+ * zero).
+ */
+static u64 ctx_to_cookie(u64 cpu, u64 ctx)
+{
+	BUILD_BUG_ON(NR_CPUS > 65535);
+	return ((ctx << 1) & ((1UL << 48) - 1)) | (cpu << 48) | 1;
+}
+
+/*
+ * Read the task context cookie, first initializing it if this is the first
+ * call to get_cookie() since the most recent entry from user.
+ */
+static u64 get_cookie(struct unwind_task_info *info)
+{
+	u64 ctx_ctr;
+	u64 cookie;
+	u64 cpu;
+
+	guard(irqsave)();
+
+	cookie = info->cookie;
+	if (cookie)
+		return cookie;
+
+	cpu = raw_smp_processor_id();
+	ctx_ctr = __this_cpu_inc_return(unwind_ctx_ctr);
+	info->cookie = ctx_to_cookie(cpu, ctx_ctr);
+
+	return info->cookie;
+}
+
 int unwind_deferred_trace(struct unwind_stacktrace *trace)
 {
 	struct unwind_task_info *info = &current->unwind_info;
@@ -46,11 +105,114 @@ int unwind_deferred_trace(struct unwind_stacktrace *trace)
 	return 0;
 }
 
+static void unwind_deferred_task_work(struct callback_head *head)
+{
+	struct unwind_task_info *info = container_of(head, struct unwind_task_info, work);
+	struct unwind_stacktrace trace;
+	struct unwind_work *work;
+	u64 cookie;
+
+	if (WARN_ON_ONCE(!info->pending))
+		return;
+
+	/* Allow work to come in again */
+	WRITE_ONCE(info->pending, 0);
+
+	/*
+	 * From here on out, the callback must always be called, even if it's
+	 * just an empty trace.
+	 */
+	trace.nr = 0;
+	trace.entries = NULL;
+
+	unwind_deferred_trace(&trace);
+
+	cookie = get_cookie(info);
+
+	guard(mutex)(&callback_mutex);
+	list_for_each_entry(work, &callbacks, list) {
+		work->func(work, &trace, cookie);
+	}
+	barrier();
+	/* If another task work is pending, reuse the cookie and stack trace */
+	if (!READ_ONCE(info->pending))
+		WRITE_ONCE(info->cookie, 0);
+}
+
+/*
+ * Schedule a user space unwind to be done in task work before exiting the
+ * kernel.
+ *
+ * The returned cookie output is a unique identifer for the current task entry
+ * context.  Its value will also be passed to the callback function.  It can be
+ * used to stitch kernel and user stack traces together in post-processing.
+ *
+ * It's valid to call this function multiple times for the same @work within
+ * the same task entry context.  Each call will return the same cookie.
+ * If the callback is not pending because it has already been previously called
+ * for the same entry context, it will be called again with the same stack trace
+ * and cookie.
+ *
+ * Returns 1 if the the callback was already queued.
+ *         0 if the callback will be called on task to user space
+ *         Negative if there's an error.
+ */
+int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
+{
+	struct unwind_task_info *info = &current->unwind_info;
+	int ret;
+
+	*cookie = 0;
+
+	if (WARN_ON_ONCE(in_nmi()))
+		return -EINVAL;
+
+	if ((current->flags & (PF_KTHREAD | PF_EXITING)) ||
+	    !user_mode(task_pt_regs(current)))
+		return -EINVAL;
+
+	guard(irqsave)();
+
+	*cookie = get_cookie(info);
+
+	/* callback already pending? */
+	if (info->pending)
+		return 1;
+
+	/* The work has been claimed, now schedule it. */
+	ret = task_work_add(current, &info->work, TWA_RESUME);
+	if (WARN_ON_ONCE(ret))
+		return ret;
+
+	info->pending = 1;
+	return 0;
+}
+
+void unwind_deferred_cancel(struct unwind_work *work)
+{
+	if (!work)
+		return;
+
+	guard(mutex)(&callback_mutex);
+	list_del(&work->list);
+}
+
+int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
+{
+	memset(work, 0, sizeof(*work));
+
+	guard(mutex)(&callback_mutex);
+	list_add(&work->list, &callbacks);
+	work->func = func;
+	return 0;
+}
+
 void unwind_task_init(struct task_struct *task)
 {
 	struct unwind_task_info *info = &task->unwind_info;
 
 	memset(info, 0, sizeof(*info));
+	init_task_work(&info->work, unwind_deferred_task_work);
 }
 
 void unwind_task_free(struct task_struct *task)
@@ -58,4 +220,5 @@ void unwind_task_free(struct task_struct *task)
 	struct unwind_task_info *info = &task->unwind_info;
 
 	kfree(info->cache.entries);
+	task_work_cancel(task, &info->work);
 }
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v7 10/17] unwind_user/deferred: Make unwind deferral requests NMI-safe
  2025-05-02 16:47 [PATCH v7 00/17] unwind_user: perf: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (8 preceding siblings ...)
  2025-05-02 16:47 ` [PATCH v7 09/17] unwind_user/deferred: Add deferred unwinding interface Steven Rostedt
@ 2025-05-02 16:47 ` Steven Rostedt
  2025-05-02 16:47 ` [PATCH v7 11/17] unwind deferred: Use bitmask to determine which callbacks to call Steven Rostedt
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 26+ messages in thread
From: Steven Rostedt @ 2025-05-02 16:47 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, x86, Jiri Olsa, Namhyung Kim

From: Josh Poimboeuf <jpoimboe@kernel.org>

Make unwind_deferred_request() NMI-safe so tracers in NMI context can
call it to get the cookie immediately rather than have to do the fragile
"schedule irq work and then call unwind_deferred_request()" dance.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v6: https://lore.kernel.org/20250424192612.669992559@goodmis.org

- Have unwind_deferred_request() return positive if already queued.

 include/linux/unwind_deferred_types.h |   1 +
 kernel/unwind/deferred.c              | 100 ++++++++++++++++++++++----
 2 files changed, 89 insertions(+), 12 deletions(-)

diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
index 33373c32c221..8f47d77ddda0 100644
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -10,6 +10,7 @@ struct unwind_cache {
 struct unwind_task_info {
 	struct unwind_cache	cache;
 	u64			cookie;
+	u64			nmi_cookie;
 	struct callback_head	work;
 	int			pending;
 };
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index b93ad97daf94..d86ea82a8915 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -47,23 +47,47 @@ static u64 ctx_to_cookie(u64 cpu, u64 ctx)
 
 /*
  * Read the task context cookie, first initializing it if this is the first
- * call to get_cookie() since the most recent entry from user.
+ * call to get_cookie() since the most recent entry from user.  This has to be
+ * done carefully to coordinate with unwind_deferred_request_nmi().
  */
 static u64 get_cookie(struct unwind_task_info *info)
 {
 	u64 ctx_ctr;
 	u64 cookie;
-	u64 cpu;
 
 	guard(irqsave)();
 
-	cookie = info->cookie;
+	cookie = READ_ONCE(info->cookie);
 	if (cookie)
 		return cookie;
 
-	cpu = raw_smp_processor_id();
-	ctx_ctr = __this_cpu_inc_return(unwind_ctx_ctr);
-	info->cookie = ctx_to_cookie(cpu, ctx_ctr);
+	ctx_ctr = __this_cpu_read(unwind_ctx_ctr);
+
+	/* Read ctx_ctr before info->nmi_cookie */
+	barrier();
+
+	cookie = READ_ONCE(info->nmi_cookie);
+	if (cookie) {
+		/*
+		 * This is the first call to get_cookie() since an NMI handler
+		 * first wrote it to info->nmi_cookie.  Sync it.
+		 */
+		WRITE_ONCE(info->cookie, cookie);
+		WRITE_ONCE(info->nmi_cookie, 0);
+		return cookie;
+	}
+
+	/*
+	 * Write info->cookie.  It's ok to race with an NMI here.  The value of
+	 * the cookie is based on ctx_ctr from before the NMI could have
+	 * incremented it.  The result will be the same even if cookie or
+	 * ctx_ctr end up getting written twice.
+	 */
+	cookie = ctx_to_cookie(raw_smp_processor_id(), ctx_ctr + 1);
+	WRITE_ONCE(info->cookie, cookie);
+	WRITE_ONCE(info->nmi_cookie, 0);
+	barrier();
+	__this_cpu_write(unwind_ctx_ctr, ctx_ctr + 1);
 
 	return info->cookie;
 }
@@ -139,6 +163,51 @@ static void unwind_deferred_task_work(struct callback_head *head)
 		WRITE_ONCE(info->cookie, 0);
 }
 
+static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *cookie)
+{
+	struct unwind_task_info *info = &current->unwind_info;
+	bool inited_cookie = false;
+	int ret;
+
+	*cookie = info->cookie;
+	if (!*cookie) {
+		/*
+		 * This is the first unwind request since the most recent entry
+		 * from user.  Initialize the task cookie.
+		 *
+		 * Don't write to info->cookie directly, otherwise it may get
+		 * cleared if the NMI occurred in the kernel during early entry
+		 * or late exit before the task work gets to run.  Instead, use
+		 * info->nmi_cookie which gets synced later by get_cookie().
+		 */
+		if (!info->nmi_cookie) {
+			u64 cpu = raw_smp_processor_id();
+			u64 ctx_ctr;
+
+			ctx_ctr = __this_cpu_inc_return(unwind_ctx_ctr);
+			info->nmi_cookie = ctx_to_cookie(cpu, ctx_ctr);
+
+			inited_cookie = true;
+		}
+
+		*cookie = info->nmi_cookie;
+	}
+
+	if (info->pending)
+		return 1;
+
+	ret = task_work_add(current, &info->work, TWA_NMI_CURRENT);
+	if (ret) {
+		if (inited_cookie)
+			info->nmi_cookie = 0;
+		return ret;
+	}
+
+	info->pending = 1;
+
+	return 0;
+}
+
 /*
  * Schedule a user space unwind to be done in task work before exiting the
  * kernel.
@@ -160,31 +229,38 @@ static void unwind_deferred_task_work(struct callback_head *head)
 int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
 {
 	struct unwind_task_info *info = &current->unwind_info;
+	int pending;
 	int ret;
 
 	*cookie = 0;
 
-	if (WARN_ON_ONCE(in_nmi()))
-		return -EINVAL;
-
 	if ((current->flags & (PF_KTHREAD | PF_EXITING)) ||
 	    !user_mode(task_pt_regs(current)))
 		return -EINVAL;
 
+	if (in_nmi())
+		return unwind_deferred_request_nmi(work, cookie);
+
 	guard(irqsave)();
 
 	*cookie = get_cookie(info);
 
 	/* callback already pending? */
-	if (info->pending)
+	pending = READ_ONCE(info->pending);
+	if (pending)
+		return 1;
+
+	/* Claim the work unless an NMI just now swooped in to do so. */
+	if (!try_cmpxchg(&info->pending, &pending, 1))
 		return 1;
 
 	/* The work has been claimed, now schedule it. */
 	ret = task_work_add(current, &info->work, TWA_RESUME);
-	if (WARN_ON_ONCE(ret))
+	if (WARN_ON_ONCE(ret)) {
+		WRITE_ONCE(info->pending, 0);
 		return ret;
+	}
 
-	info->pending = 1;
 	return 0;
 }
 
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v7 11/17] unwind deferred: Use bitmask to determine which callbacks to call
  2025-05-02 16:47 [PATCH v7 00/17] unwind_user: perf: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (9 preceding siblings ...)
  2025-05-02 16:47 ` [PATCH v7 10/17] unwind_user/deferred: Make unwind deferral requests NMI-safe Steven Rostedt
@ 2025-05-02 16:47 ` Steven Rostedt
  2025-05-09  1:36   ` Steven Rostedt
  2025-05-02 16:47 ` [PATCH v7 12/17] unwind deferred: Use SRCU unwind_deferred_task_work() Steven Rostedt
                   ` (6 subsequent siblings)
  17 siblings, 1 reply; 26+ messages in thread
From: Steven Rostedt @ 2025-05-02 16:47 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, x86, Jiri Olsa, Namhyung Kim

From: Steven Rostedt <rostedt@goodmis.org>

In order to know which registered callback requested a stacktrace for when
the task goes back to user space, add a bitmask for all registered
tracers. The bitmask is the size of log, which means that on a 32 bit
machine, it can have at most 32 registered tracers, and on 64 bit, it can
have at most 64 registered tracers. This should not be an issue as there
should not be more than 10 (unless BPF can abuse this?).

When a tracer registers with unwind_deferred_init() it will get a bit
number assigned to it. When a tracer requests a stacktrace, it will have
its bit set within the task_struct. When the task returns back to user
space, it will call the callbacks for all the registered tracers where
their bits are set in the task's mask.

When a tracer is removed by the unwind_deferred_cancel() all current tasks
will clear the associated bit, just in case another tracer gets registered
immediately afterward and then gets their callback called unexpectedly.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v6: https://lore.kernel.org/20250424192612.844558089@goodmis.org

- Have unwind_deferred_request() return positive if already queued.

 include/linux/sched.h           |  1 +
 include/linux/unwind_deferred.h |  1 +
 kernel/unwind/deferred.c        | 46 ++++++++++++++++++++++++++++-----
 3 files changed, 41 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a1e1c07cadfb..d3ee0c5405d6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1649,6 +1649,7 @@ struct task_struct {
 
 #ifdef CONFIG_UNWIND_USER
 	struct unwind_task_info		unwind_info;
+	unsigned long			unwind_mask;
 #endif
 
 	/* CPU-specific state of this task: */
diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index d36784cae658..719a7cfb3164 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -13,6 +13,7 @@ typedef void (*unwind_callback_t)(struct unwind_work *work, struct unwind_stackt
 struct unwind_work {
 	struct list_head		list;
 	unwind_callback_t		func;
+	int				bit;
 };
 
 #ifdef CONFIG_UNWIND_USER
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index d86ea82a8915..716393dff810 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -26,6 +26,7 @@ static DEFINE_PER_CPU(u64, unwind_ctx_ctr);
 /* Guards adding to and reading the list of callbacks */
 static DEFINE_MUTEX(callback_mutex);
 static LIST_HEAD(callbacks);
+static unsigned long unwind_mask;
 
 /*
  * The context cookie is a unique identifier that is assigned to a user
@@ -134,6 +135,7 @@ static void unwind_deferred_task_work(struct callback_head *head)
 	struct unwind_task_info *info = container_of(head, struct unwind_task_info, work);
 	struct unwind_stacktrace trace;
 	struct unwind_work *work;
+	struct task_struct *task = current;
 	u64 cookie;
 
 	if (WARN_ON_ONCE(!info->pending))
@@ -155,7 +157,10 @@ static void unwind_deferred_task_work(struct callback_head *head)
 
 	guard(mutex)(&callback_mutex);
 	list_for_each_entry(work, &callbacks, list) {
-		work->func(work, &trace, cookie);
+		if (task->unwind_mask & (1UL << work->bit)) {
+			work->func(work, &trace, cookie);
+			clear_bit(work->bit, &current->unwind_mask);
+		}
 	}
 	barrier();
 	/* If another task work is pending, reuse the cookie and stack trace */
@@ -193,9 +198,12 @@ static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *cookie)
 		*cookie = info->nmi_cookie;
 	}
 
-	if (info->pending)
+	if (current->unwind_mask & (1UL << work->bit))
 		return 1;
 
+	if (info->pending)
+		goto out;
+
 	ret = task_work_add(current, &info->work, TWA_NMI_CURRENT);
 	if (ret) {
 		if (inited_cookie)
@@ -204,8 +212,8 @@ static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *cookie)
 	}
 
 	info->pending = 1;
-
-	return 0;
+out:
+	return test_and_set_bit(work->bit, &current->unwind_mask);
 }
 
 /*
@@ -245,14 +253,18 @@ int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
 
 	*cookie = get_cookie(info);
 
+	/* This is already queued */
+	if (current->unwind_mask & (1UL << work->bit))
+		return 1;
+
 	/* callback already pending? */
 	pending = READ_ONCE(info->pending);
 	if (pending)
-		return 1;
+		goto out;
 
 	/* Claim the work unless an NMI just now swooped in to do so. */
 	if (!try_cmpxchg(&info->pending, &pending, 1))
-		return 1;
+		goto out;
 
 	/* The work has been claimed, now schedule it. */
 	ret = task_work_add(current, &info->work, TWA_RESUME);
@@ -261,16 +273,27 @@ int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
 		return ret;
 	}
 
-	return 0;
+ out:
+	return test_and_set_bit(work->bit, &current->unwind_mask);
 }
 
 void unwind_deferred_cancel(struct unwind_work *work)
 {
+	struct task_struct *g, *t;
+
 	if (!work)
 		return;
 
 	guard(mutex)(&callback_mutex);
 	list_del(&work->list);
+
+	clear_bit(work->bit, &unwind_mask);
+
+	guard(rcu)();
+	/* Clear this bit from all threads */
+	for_each_process_thread(g, t) {
+		clear_bit(work->bit, &t->unwind_mask);
+	}
 }
 
 int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
@@ -278,6 +301,14 @@ int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
 	memset(work, 0, sizeof(*work));
 
 	guard(mutex)(&callback_mutex);
+
+	/* See if there's a bit in the mask available */
+	if (unwind_mask == ~0UL)
+		return -EBUSY;
+
+	work->bit = ffz(unwind_mask);
+	unwind_mask |= 1UL << work->bit;
+
 	list_add(&work->list, &callbacks);
 	work->func = func;
 	return 0;
@@ -289,6 +320,7 @@ void unwind_task_init(struct task_struct *task)
 
 	memset(info, 0, sizeof(*info));
 	init_task_work(&info->work, unwind_deferred_task_work);
+	task->unwind_mask = 0;
 }
 
 void unwind_task_free(struct task_struct *task)
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v7 12/17] unwind deferred: Use SRCU unwind_deferred_task_work()
  2025-05-02 16:47 [PATCH v7 00/17] unwind_user: perf: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (10 preceding siblings ...)
  2025-05-02 16:47 ` [PATCH v7 11/17] unwind deferred: Use bitmask to determine which callbacks to call Steven Rostedt
@ 2025-05-02 16:47 ` Steven Rostedt
  2025-05-02 16:47 ` [PATCH v7 13/17] perf: Remove get_perf_callchain() init_nr argument Steven Rostedt
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 26+ messages in thread
From: Steven Rostedt @ 2025-05-02 16:47 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, x86, Jiri Olsa, Namhyung Kim

From: Steven Rostedt <rostedt@goodmis.org>

Instead of using the callback_mutex to protect the link list of callbacks
in unwind_deferred_task_work(), use SRCU instead. This gets called every
time a task exits that has to record a stack trace that was requested.
This can happen for many tasks on several CPUs at the same time. A mutex
is a bottleneck and can cause a bit of contention and slow down performance.

As the callbacks themselves are allowed to sleep, regular RCU can not be
used to protect the list. Instead use SRCU, as that still allows the
callbacks to sleep and the list can be read without needing to hold the
callback_mutex.

Link: https://lore.kernel.org/all/ca9bd83a-6c80-4ee0-a83c-224b9d60b755@efficios.com/

Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/unwind/deferred.c | 35 ++++++++++++++++++++++++++---------
 1 file changed, 26 insertions(+), 9 deletions(-)

diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index 716393dff810..5f98ac5e3a1b 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -23,10 +23,11 @@
  */
 static DEFINE_PER_CPU(u64, unwind_ctx_ctr);
 
-/* Guards adding to and reading the list of callbacks */
+/* Guards adding to or removing from the list of callbacks */
 static DEFINE_MUTEX(callback_mutex);
 static LIST_HEAD(callbacks);
 static unsigned long unwind_mask;
+DEFINE_STATIC_SRCU(unwind_srcu);
 
 /*
  * The context cookie is a unique identifier that is assigned to a user
@@ -137,6 +138,7 @@ static void unwind_deferred_task_work(struct callback_head *head)
 	struct unwind_work *work;
 	struct task_struct *task = current;
 	u64 cookie;
+	int idx;
 
 	if (WARN_ON_ONCE(!info->pending))
 		return;
@@ -155,14 +157,16 @@ static void unwind_deferred_task_work(struct callback_head *head)
 
 	cookie = get_cookie(info);
 
-	guard(mutex)(&callback_mutex);
-	list_for_each_entry(work, &callbacks, list) {
+	idx = srcu_read_lock(&unwind_srcu);
+	list_for_each_entry_srcu(work, &callbacks, list,
+				 srcu_read_lock_held(&unwind_srcu)) {
 		if (task->unwind_mask & (1UL << work->bit)) {
 			work->func(work, &trace, cookie);
 			clear_bit(work->bit, &current->unwind_mask);
 		}
 	}
-	barrier();
+	srcu_read_unlock(&unwind_srcu, idx);
+
 	/* If another task work is pending, reuse the cookie and stack trace */
 	if (!READ_ONCE(info->pending))
 		WRITE_ONCE(info->cookie, 0);
@@ -238,6 +242,7 @@ int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
 {
 	struct unwind_task_info *info = &current->unwind_info;
 	int pending;
+	int bit;
 	int ret;
 
 	*cookie = 0;
@@ -249,12 +254,17 @@ int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
 	if (in_nmi())
 		return unwind_deferred_request_nmi(work, cookie);
 
+	/* Do not allow cancelled works to request again */
+	bit = READ_ONCE(work->bit);
+	if (WARN_ON_ONCE(bit < 0))
+		return -EINVAL;
+
 	guard(irqsave)();
 
 	*cookie = get_cookie(info);
 
 	/* This is already queued */
-	if (current->unwind_mask & (1UL << work->bit))
+	if (current->unwind_mask & (1UL << bit))
 		return 1;
 
 	/* callback already pending? */
@@ -280,19 +290,26 @@ int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
 void unwind_deferred_cancel(struct unwind_work *work)
 {
 	struct task_struct *g, *t;
+	int bit;
 
 	if (!work)
 		return;
 
 	guard(mutex)(&callback_mutex);
-	list_del(&work->list);
+	list_del_rcu(&work->list);
+	bit = work->bit;
+
+	/* Do not allow any more requests and prevent callbacks */
+	work->bit = -1;
+
+	clear_bit(bit, &unwind_mask);
 
-	clear_bit(work->bit, &unwind_mask);
+	synchronize_srcu(&unwind_srcu);
 
 	guard(rcu)();
 	/* Clear this bit from all threads */
 	for_each_process_thread(g, t) {
-		clear_bit(work->bit, &t->unwind_mask);
+		clear_bit(bit, &t->unwind_mask);
 	}
 }
 
@@ -309,7 +326,7 @@ int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
 	work->bit = ffz(unwind_mask);
 	unwind_mask |= 1UL << work->bit;
 
-	list_add(&work->list, &callbacks);
+	list_add_rcu(&work->list, &callbacks);
 	work->func = func;
 	return 0;
 }
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v7 13/17] perf: Remove get_perf_callchain() init_nr argument
  2025-05-02 16:47 [PATCH v7 00/17] unwind_user: perf: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (11 preceding siblings ...)
  2025-05-02 16:47 ` [PATCH v7 12/17] unwind deferred: Use SRCU unwind_deferred_task_work() Steven Rostedt
@ 2025-05-02 16:47 ` Steven Rostedt
  2025-05-02 16:48 ` [PATCH v7 14/17] perf: Have get_perf_callchain() return NULL if crosstask and user are set Steven Rostedt
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 26+ messages in thread
From: Steven Rostedt @ 2025-05-02 16:47 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, x86, Jiri Olsa, Namhyung Kim, Namhyung Kim

From: Josh Poimboeuf <jpoimboe@kernel.org>

The 'init_nr' argument has double duty: it's used to initialize both the
number of contexts and the number of stack entries.  That's confusing
and the callers always pass zero anyway.  Hard code the zero.

Acked-by: Namhyung Kim <Namhyung@kernel.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 include/linux/perf_event.h |  2 +-
 kernel/bpf/stackmap.c      |  4 ++--
 kernel/events/callchain.c  | 12 ++++++------
 kernel/events/core.c       |  2 +-
 4 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 947ad12dfdbe..3cc0b0ea0afa 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1651,7 +1651,7 @@ DECLARE_PER_CPU(struct perf_callchain_entry, perf_callchain_entry);
 extern void perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
 extern void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
 extern struct perf_callchain_entry *
-get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user,
+get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 		   u32 max_stack, bool crosstask, bool add_mark);
 extern int get_callchain_buffers(int max_stack);
 extern void put_callchain_buffers(void);
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 3615c06b7dfa..ec3a57a5fba1 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -314,7 +314,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
 	if (max_depth > sysctl_perf_event_max_stack)
 		max_depth = sysctl_perf_event_max_stack;
 
-	trace = get_perf_callchain(regs, 0, kernel, user, max_depth,
+	trace = get_perf_callchain(regs, kernel, user, max_depth,
 				   false, false);
 
 	if (unlikely(!trace))
@@ -451,7 +451,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
 	else if (kernel && task)
 		trace = get_callchain_entry_for_task(task, max_depth);
 	else
-		trace = get_perf_callchain(regs, 0, kernel, user, max_depth,
+		trace = get_perf_callchain(regs, kernel, user, max_depth,
 					   crosstask, false);
 
 	if (unlikely(!trace) || trace->nr < skip) {
diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 6c83ad674d01..b0f5bd228cd8 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -217,7 +217,7 @@ static void fixup_uretprobe_trampoline_entries(struct perf_callchain_entry *entr
 }
 
 struct perf_callchain_entry *
-get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user,
+get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 		   u32 max_stack, bool crosstask, bool add_mark)
 {
 	struct perf_callchain_entry *entry;
@@ -228,11 +228,11 @@ get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user,
 	if (!entry)
 		return NULL;
 
-	ctx.entry     = entry;
-	ctx.max_stack = max_stack;
-	ctx.nr	      = entry->nr = init_nr;
-	ctx.contexts       = 0;
-	ctx.contexts_maxed = false;
+	ctx.entry		= entry;
+	ctx.max_stack		= max_stack;
+	ctx.nr			= entry->nr = 0;
+	ctx.contexts		= 0;
+	ctx.contexts_maxed	= false;
 
 	if (kernel && !user_mode(regs)) {
 		if (add_mark)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 05136e835042..0bac8cae08a8 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8110,7 +8110,7 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
 	if (!kernel && !user)
 		return &__empty_callchain;
 
-	callchain = get_perf_callchain(regs, 0, kernel, user,
+	callchain = get_perf_callchain(regs, kernel, user,
 				       max_stack, crosstask, true);
 	return callchain ?: &__empty_callchain;
 }
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v7 14/17] perf: Have get_perf_callchain() return NULL if crosstask and user are set
  2025-05-02 16:47 [PATCH v7 00/17] unwind_user: perf: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (12 preceding siblings ...)
  2025-05-02 16:47 ` [PATCH v7 13/17] perf: Remove get_perf_callchain() init_nr argument Steven Rostedt
@ 2025-05-02 16:48 ` Steven Rostedt
  2025-05-02 16:48 ` [PATCH v7 15/17] perf: Use current->flags & PF_KTHREAD instead of current->mm == NULL Steven Rostedt
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 26+ messages in thread
From: Steven Rostedt @ 2025-05-02 16:48 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, x86, Jiri Olsa, Namhyung Kim

From: Josh Poimboeuf <jpoimboe@kernel.org>

get_perf_callchain() doesn't support cross-task unwinding for user space
stacks, have it return NULL if both the crosstask and user arguments are
set.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/events/callchain.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index b0f5bd228cd8..abf258913ab6 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -224,6 +224,10 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 	struct perf_callchain_entry_ctx ctx;
 	int rctx, start_entry_idx;
 
+	/* crosstask is not supported for user stacks */
+	if (crosstask && user)
+		return NULL;
+
 	entry = get_callchain_entry(&rctx);
 	if (!entry)
 		return NULL;
@@ -249,9 +253,6 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 		}
 
 		if (regs) {
-			if (crosstask)
-				goto exit_put;
-
 			if (add_mark)
 				perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
 
@@ -261,7 +262,6 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 		}
 	}
 
-exit_put:
 	put_callchain_entry(rctx);
 
 	return entry;
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v7 15/17] perf: Use current->flags & PF_KTHREAD instead of current->mm == NULL
  2025-05-02 16:47 [PATCH v7 00/17] unwind_user: perf: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (13 preceding siblings ...)
  2025-05-02 16:48 ` [PATCH v7 14/17] perf: Have get_perf_callchain() return NULL if crosstask and user are set Steven Rostedt
@ 2025-05-02 16:48 ` Steven Rostedt
  2025-05-02 16:48 ` [PATCH v7 16/17] perf: Simplify get_perf_callchain() user logic Steven Rostedt
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 26+ messages in thread
From: Steven Rostedt @ 2025-05-02 16:48 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, x86, Jiri Olsa, Namhyung Kim

From: Steven Rostedt <rostedt@goodmis.org>

To determine if a task is a kernel thread or not, it is more reliable to
use (current->flags & PF_KTHREAD) than to rely on current->mm being NULL.
That is because some kernel tasks (io_uring helpers) may have a mm field.

Link: https://lore.kernel.org/linux-trace-kernel/20250424163607.GE18306@noisy.programming.kicks-ass.net/

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/events/callchain.c | 6 +++---
 kernel/events/core.c      | 2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index abf258913ab6..cda145dc11bd 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -246,10 +246,10 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 
 	if (user) {
 		if (!user_mode(regs)) {
-			if  (current->mm)
-				regs = task_pt_regs(current);
-			else
+			if (current->flags & PF_KTHREAD)
 				regs = NULL;
+			else
+				regs = task_pt_regs(current);
 		}
 
 		if (regs) {
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 0bac8cae08a8..7c7d5a27c568 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7989,7 +7989,7 @@ static u64 perf_virt_to_phys(u64 virt)
 		 * Try IRQ-safe get_user_page_fast_only first.
 		 * If failed, leave phys_addr as 0.
 		 */
-		if (current->mm != NULL) {
+		if (!(current->flags & PF_KTHREAD)) {
 			struct page *p;
 
 			pagefault_disable();
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v7 16/17] perf: Simplify get_perf_callchain() user logic
  2025-05-02 16:47 [PATCH v7 00/17] unwind_user: perf: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (14 preceding siblings ...)
  2025-05-02 16:48 ` [PATCH v7 15/17] perf: Use current->flags & PF_KTHREAD instead of current->mm == NULL Steven Rostedt
@ 2025-05-02 16:48 ` Steven Rostedt
  2025-05-02 16:48 ` [PATCH v7 17/17] perf: Skip user unwind if the task is a kernel thread Steven Rostedt
  2025-05-04  9:41 ` [PATCH v7 00/17] unwind_user: perf: x86: Deferred unwinding infrastructure Ingo Molnar
  17 siblings, 0 replies; 26+ messages in thread
From: Steven Rostedt @ 2025-05-02 16:48 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, x86, Jiri Olsa, Namhyung Kim

From: Josh Poimboeuf <jpoimboe@kernel.org>

Simplify the get_perf_callchain() user logic a bit.  task_pt_regs()
should never be NULL.

Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/events/callchain.c | 18 ++++++++----------
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index cda145dc11bd..2798c0c9f782 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -247,21 +247,19 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 	if (user) {
 		if (!user_mode(regs)) {
 			if (current->flags & PF_KTHREAD)
-				regs = NULL;
-			else
-				regs = task_pt_regs(current);
+				goto exit_put;
+			regs = task_pt_regs(current);
 		}
 
-		if (regs) {
-			if (add_mark)
-				perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
+		if (add_mark)
+			perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
 
-			start_entry_idx = entry->nr;
-			perf_callchain_user(&ctx, regs);
-			fixup_uretprobe_trampoline_entries(entry, start_entry_idx);
-		}
+		start_entry_idx = entry->nr;
+		perf_callchain_user(&ctx, regs);
+		fixup_uretprobe_trampoline_entries(entry, start_entry_idx);
 	}
 
+exit_put:
 	put_callchain_entry(rctx);
 
 	return entry;
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v7 17/17] perf: Skip user unwind if the task is a kernel thread.
  2025-05-02 16:47 [PATCH v7 00/17] unwind_user: perf: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (15 preceding siblings ...)
  2025-05-02 16:48 ` [PATCH v7 16/17] perf: Simplify get_perf_callchain() user logic Steven Rostedt
@ 2025-05-02 16:48 ` Steven Rostedt
  2025-05-04  9:41 ` [PATCH v7 00/17] unwind_user: perf: x86: Deferred unwinding infrastructure Ingo Molnar
  17 siblings, 0 replies; 26+ messages in thread
From: Steven Rostedt @ 2025-05-02 16:48 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, x86, Jiri Olsa, Namhyung Kim

From: Josh Poimboeuf <jpoimboe@kernel.org>

If the task is not a user thread, there's no user stack to unwind.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/events/core.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 7c7d5a27c568..02e52df7a02e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8101,7 +8101,8 @@ struct perf_callchain_entry *
 perf_callchain(struct perf_event *event, struct pt_regs *regs)
 {
 	bool kernel = !event->attr.exclude_callchain_kernel;
-	bool user   = !event->attr.exclude_callchain_user;
+	bool user   = !event->attr.exclude_callchain_user &&
+		!(current->flags & PF_KTHREAD);
 	/* Disallow cross-task user callchains. */
 	bool crosstask = event->ctx->task && event->ctx->task != current;
 	const u32 max_stack = event->attr.sample_max_stack;
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 01/17] unwind_user: Add user space unwinding API
  2025-05-02 16:47 ` [PATCH v7 01/17] unwind_user: Add user space unwinding API Steven Rostedt
@ 2025-05-04  9:30   ` Ingo Molnar
  2025-05-04 16:43     ` Steven Rostedt
  0 siblings, 1 reply; 26+ messages in thread
From: Ingo Molnar @ 2025-05-04  9:30 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu,
	Mathieu Desnoyers, Josh Poimboeuf, Peter Zijlstra, x86, Jiri Olsa,
	Namhyung Kim


* Steven Rostedt <rostedt@goodmis.org> wrote:

> +++ b/include/linux/unwind_user_types.h
> @@ -0,0 +1,31 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_UNWIND_USER_TYPES_H
> +#define _LINUX_UNWIND_USER_TYPES_H
> +
> +#include <linux/types.h>
> +
> +enum unwind_user_type {
> +	UNWIND_USER_TYPE_NONE,
> +};
> +
> +struct unwind_stacktrace {
> +	unsigned int	nr;
> +	unsigned long	*entries;
> +};
> +
> +struct unwind_user_frame {
> +	s32 cfa_off;
> +	s32 ra_off;
> +	s32 fp_off;
> +	bool use_fp;
> +};
> +
> +struct unwind_user_state {
> +	unsigned long ip;
> +	unsigned long sp;
> +	unsigned long fp;
> +	enum unwind_user_type type;
> +	bool done;
> +};

Will any of these types be visible to tooling via ABIs?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 08/17] unwind_user/deferred: Add unwind cache
  2025-05-02 16:47 ` [PATCH v7 08/17] unwind_user/deferred: Add unwind cache Steven Rostedt
@ 2025-05-04  9:37   ` Ingo Molnar
  2025-05-04 16:21     ` Steven Rostedt
  0 siblings, 1 reply; 26+ messages in thread
From: Ingo Molnar @ 2025-05-04  9:37 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu,
	Mathieu Desnoyers, Josh Poimboeuf, Peter Zijlstra, x86, Jiri Olsa,
	Namhyung Kim


* Steven Rostedt <rostedt@goodmis.org> wrote:

> -struct unwind_task_info {
> +struct unwind_cache {
>  	unsigned long		*entries;
> +	unsigned int		nr_entries;
> +};
> +
> +struct unwind_task_info {
> +	struct unwind_cache	cache;
>  };

> @@ -19,17 +20,29 @@ int unwind_deferred_trace(struct unwind_stacktrace *trace)
>  	if (current->flags & PF_EXITING)
>  		return -EINVAL;
>  
> -       if (!info->entries) {
> -               info->entries = kmalloc_array(UNWIND_MAX_ENTRIES, sizeof(long),
> -                                             GFP_KERNEL);
> -               if (!info->entries)
> -		       return -ENOMEM;
> +	if (!cache->entries) {
> +		cache->entries = kmalloc_array(UNWIND_MAX_ENTRIES, sizeof(long),
> +					       GFP_KERNEL);
> +		if (!cache->entries)
> +			return -ENOMEM;
> +        }

That's just sloppy: why isn't the unwind_cache a pointer to begin with, 
with the dynamically allocated object containing ::nr_entries?

Also, the code has whitespace damage.

> +
> +	trace->entries = cache->entries;
> +
> +	if (cache->nr_entries) {
> +               /*
> +                * The user stack has already been previously unwound in this
> +                * entry context.  Skip the unwind and use the cache.
> +                */
> +               trace->nr = cache->nr_entries;
> +               return 0;

Whitespace damage here too. Maybe in other patches as well.

Please don't rush this series without first reviewing it carefully ...

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 00/17] unwind_user: perf: x86: Deferred unwinding infrastructure
  2025-05-02 16:47 [PATCH v7 00/17] unwind_user: perf: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (16 preceding siblings ...)
  2025-05-02 16:48 ` [PATCH v7 17/17] perf: Skip user unwind if the task is a kernel thread Steven Rostedt
@ 2025-05-04  9:41 ` Ingo Molnar
  2025-05-04 16:32   ` Steven Rostedt
  17 siblings, 1 reply; 26+ messages in thread
From: Ingo Molnar @ 2025-05-04  9:41 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu,
	Mathieu Desnoyers, Josh Poimboeuf, Peter Zijlstra, x86, Jiri Olsa,
	Namhyung Kim


* Steven Rostedt <rostedt@goodmis.org> wrote:

>       unwind_user: Add user space unwinding API
>       unwind_user: Add frame pointer support
>       unwind_user/x86: Enable frame pointer unwinding on x86
>       perf/x86: Rename and move get_segment_base() and make it global
>       unwind_user: Add compat mode frame pointer support
>       unwind_user/x86: Enable compat mode frame pointer unwinding on x86
>       unwind_user/deferred: Add unwind cache

What is the cost of 'caching' here? Will we double-buffer the tracing 
data before it reaches its single primary tooling user, with no use of 
any actual 'caching', which will be scenario in like 99.9% of the 
everyday usecases when this facility is used?

>       unwind_user/deferred: Add deferred unwinding interface
>       unwind_user/deferred: Make unwind deferral requests NMI-safe
>       perf: Remove get_perf_callchain() init_nr argument
>       perf: Have get_perf_callchain() return NULL if crosstask and user are set
>       perf: Simplify get_perf_callchain() user logic
>       perf: Skip user unwind if the task is a kernel thread.

Please don't leave periods in titles.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 08/17] unwind_user/deferred: Add unwind cache
  2025-05-04  9:37   ` Ingo Molnar
@ 2025-05-04 16:21     ` Steven Rostedt
  0 siblings, 0 replies; 26+ messages in thread
From: Steven Rostedt @ 2025-05-04 16:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu,
	Mathieu Desnoyers, Josh Poimboeuf, Peter Zijlstra, x86, Jiri Olsa,
	Namhyung Kim

On Sun, 4 May 2025 11:37:54 +0200
Ingo Molnar <mingo@kernel.org> wrote:

Hi Ingo,

> * Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > -struct unwind_task_info {
> > +struct unwind_cache {
> >  	unsigned long		*entries;
> > +	unsigned int		nr_entries;
> > +};
> > +
> > +struct unwind_task_info {
> > +	struct unwind_cache	cache;
> >  };  
> 
> > @@ -19,17 +20,29 @@ int unwind_deferred_trace(struct unwind_stacktrace *trace)
> >  	if (current->flags & PF_EXITING)
> >  		return -EINVAL;
> >  
> > -       if (!info->entries) {
> > -               info->entries = kmalloc_array(UNWIND_MAX_ENTRIES, sizeof(long),
> > -                                             GFP_KERNEL);
> > -               if (!info->entries)
> > -		       return -ENOMEM;
> > +	if (!cache->entries) {
> > +		cache->entries = kmalloc_array(UNWIND_MAX_ENTRIES, sizeof(long),
> > +					       GFP_KERNEL);
> > +		if (!cache->entries)
> > +			return -ENOMEM;
> > +        }  
> 
> That's just sloppy: why isn't the unwind_cache a pointer to begin with, 
> with the dynamically allocated object containing ::nr_entries?

Basically you want?

struct unwind_task_info {
	struct unwind_cache	*cache;
};

Then have:

struct unwind_cache {
	int			nr_entries;
	unsigned long		entries[];
};

And allocate the unwind_cache to include the size (using the dynamic
structure macro helpers)?

That makes sense to me.

> 
> Also, the code has whitespace damage.
> 
> > +
> > +	trace->entries = cache->entries;
> > +
> > +	if (cache->nr_entries) {
> > +               /*
> > +                * The user stack has already been previously
> > unwound in this
> > +                * entry context.  Skip the unwind and use the
> > cache.
> > +                */
> > +               trace->nr = cache->nr_entries;
> > +               return 0;  
> 
> Whitespace damage here too. Maybe in other patches as well.
> 
> Please don't rush this series without first reviewing it carefully ...

Hmm, For whitespace damage, I usually rely on git show to show me where
whitespace damage is. But if there's no tabs, then it doesn't show. The
whitespace damage came from when I pulled in Josh's work and rebased it
on the latest kernel. There were conflicts which was solved by having
to do some cut and paste from .rej files, and somehow it added spaces
instead of tabs.

Peter caught this is the beginning, and I thought I got all the
locations that had that white space issue. This patch hasn't been
touched much during the various versions.

I'll do a search for spaces to see if there's more in any of the other
patches.

Thanks!

-- Steve


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 00/17] unwind_user: perf: x86: Deferred unwinding infrastructure
  2025-05-04  9:41 ` [PATCH v7 00/17] unwind_user: perf: x86: Deferred unwinding infrastructure Ingo Molnar
@ 2025-05-04 16:32   ` Steven Rostedt
  0 siblings, 0 replies; 26+ messages in thread
From: Steven Rostedt @ 2025-05-04 16:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu,
	Mathieu Desnoyers, Josh Poimboeuf, Peter Zijlstra, x86, Jiri Olsa,
	Namhyung Kim

On Sun, 4 May 2025 11:41:56 +0200
Ingo Molnar <mingo@kernel.org> wrote:

> * Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> >       unwind_user: Add user space unwinding API
> >       unwind_user: Add frame pointer support
> >       unwind_user/x86: Enable frame pointer unwinding on x86
> >       perf/x86: Rename and move get_segment_base() and make it global
> >       unwind_user: Add compat mode frame pointer support
> >       unwind_user/x86: Enable compat mode frame pointer unwinding on x86
> >       unwind_user/deferred: Add unwind cache  
> 
> What is the cost of 'caching' here? Will we double-buffer the tracing 
> data before it reaches its single primary tooling user, with no use of 
> any actual 'caching', which will be scenario in like 99.9% of the 
> everyday usecases when this facility is used?

I'm sorry, I may not understand the question here.

The cache doesn't add any extra buffer. The previous patch (Add unwind
deferred trace) allocates "entries" the first time a trace is done to
save the user stacktrace into the buffer. It will not free the entries
(until exit of the task) to save from having to allocate the entries
again.

If for some reason an interrupt happens while it is recording the trace
and the interrupt requests another trace, without the cache, it will do
the work of walking the user stack trace again.

The "cache" code, simply keeps information around to know that the
current trace is still valid, and that it doesn't need to do the work
of walking the user stack to produce the stack again.

> 
> >       unwind_user/deferred: Add deferred unwinding interface
> >       unwind_user/deferred: Make unwind deferral requests NMI-safe
> >       perf: Remove get_perf_callchain() init_nr argument
> >       perf: Have get_perf_callchain() return NULL if crosstask and user are set
> >       perf: Simplify get_perf_callchain() user logic
> >       perf: Skip user unwind if the task is a kernel thread.  
> 
> Please don't leave periods in titles.

OK, will fix.

Thanks for looking at this Ingo!

-- Steve

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 01/17] unwind_user: Add user space unwinding API
  2025-05-04  9:30   ` Ingo Molnar
@ 2025-05-04 16:43     ` Steven Rostedt
  2025-05-04 17:53       ` Josh Poimboeuf
  0 siblings, 1 reply; 26+ messages in thread
From: Steven Rostedt @ 2025-05-04 16:43 UTC (permalink / raw)
  To: Ingo Molnar, Josh Poimboeuf
  Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu,
	Mathieu Desnoyers, Peter Zijlstra, x86, Jiri Olsa, Namhyung Kim

On Sun, 4 May 2025 11:30:32 +0200
Ingo Molnar <mingo@kernel.org> wrote:

> > +struct unwind_user_state {
> > +	unsigned long ip;
> > +	unsigned long sp;
> > +	unsigned long fp;
> > +	enum unwind_user_type type;
> > +	bool done;
> > +};  
> 
> Will any of these types be visible to tooling via ABIs?

Not that I'm aware of. Josh can confirm. Josh?

With the exception of BTF (which exposes pretty much all structures), I
believe this is completely internal for the generic code of the unwinder.

-- Steve

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 01/17] unwind_user: Add user space unwinding API
  2025-05-04 16:43     ` Steven Rostedt
@ 2025-05-04 17:53       ` Josh Poimboeuf
  0 siblings, 0 replies; 26+ messages in thread
From: Josh Poimboeuf @ 2025-05-04 17:53 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, linux-kernel, linux-trace-kernel, Masami Hiramatsu,
	Mathieu Desnoyers, Peter Zijlstra, x86, Jiri Olsa, Namhyung Kim

On Sun, May 04, 2025 at 12:43:30PM -0400, Steven Rostedt wrote:
> On Sun, 4 May 2025 11:30:32 +0200
> Ingo Molnar <mingo@kernel.org> wrote:
> 
> > > +struct unwind_user_state {
> > > +	unsigned long ip;
> > > +	unsigned long sp;
> > > +	unsigned long fp;
> > > +	enum unwind_user_type type;
> > > +	bool done;
> > > +};  
> > 
> > Will any of these types be visible to tooling via ABIs?
> 
> Not that I'm aware of. Josh can confirm. Josh?
> 
> With the exception of BTF (which exposes pretty much all structures), I
> believe this is completely internal for the generic code of the unwinder.

Correct, these types are strictly internal to the kernel.

-- 
Josh

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 11/17] unwind deferred: Use bitmask to determine which callbacks to call
  2025-05-02 16:47 ` [PATCH v7 11/17] unwind deferred: Use bitmask to determine which callbacks to call Steven Rostedt
@ 2025-05-09  1:36   ` Steven Rostedt
  0 siblings, 0 replies; 26+ messages in thread
From: Steven Rostedt @ 2025-05-09  1:36 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, x86, Jiri Olsa, Namhyung Kim

On Fri, 02 May 2025 12:47:57 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> @@ -134,6 +135,7 @@ static void unwind_deferred_task_work(struct callback_head *head)
>  	struct unwind_task_info *info = container_of(head, struct unwind_task_info, work);
>  	struct unwind_stacktrace trace;
>  	struct unwind_work *work;
> +	struct task_struct *task = current;
>  	u64 cookie;
>  
>  	if (WARN_ON_ONCE(!info->pending))
> @@ -155,7 +157,10 @@ static void unwind_deferred_task_work(struct callback_head *head)
>  
>  	guard(mutex)(&callback_mutex);
>  	list_for_each_entry(work, &callbacks, list) {
> -		work->func(work, &trace, cookie);
> +		if (task->unwind_mask & (1UL << work->bit)) {
> +			work->func(work, &trace, cookie);
> +			clear_bit(work->bit, &current->unwind_mask);
> +		}
>  	}
>  	barrier();
>  	/* If another task work is pending, reuse the cookie and stack trace */

So testing this code I hit a live lock. What happened was I enabled the
flag that asks for a user space callback after every event. But when I
enabled this and also enabled all events, I did this on a kernel that had
irq_disable as an event, and then the system hung.

What happened was that after the trace was recorded, on the way back to
user space, interrupts were disabled again, and the request for a callback
was done again. This enabled another task_work to be triggered, and it
would do the callback again, and then back on the way back to user space,
it disabled interrupts, another request for a user space stack trace was
done, the task_work was triggered again, and ... wash, rinse, repeat!

To fix this, I decided to move the "pending" bit into the task->unwind_mask
(the most significant bit). I also moved the clearing of the work bits into
unwind_exit_to_user_mode():

static __always_inline void unwind_exit_to_user_mode(void)
{
	unsigned long bits;

	/* Was there any unwinding? */
	if (likely(!current->unwind_mask))
		return;

	bits = current->unwind_mask;
	do {
		/* Is a task_work going to run again before going back */
		if (bits & UNWIND_PENDING)
			return;
	} while (!try_cmpxchg(&current->unwind_mask, &bits, 0UL));

	if (likely(current->unwind_info.cache))
		current->unwind_info.cache->nr_entries = 0;
	current->unwind_info.timestamp = 0;
}

The idea is, current->unwind_mask would only be set if an unwind was
requested and given. If it's not set, no unwind was done and there's
nothing left to do, so just exit the routine.

Then if unwind_mask has PENDING set, it means that a task work is going to
be executed again before going back to user space, so exit, otherwise, try
to clear all the bits to zero (using try_cmpxchg() in case an NMI comes in).

Finally clear all the data normally.

Now if a tracer requests a callback, it will get only one callback, if it
requests another one on he way out it will be told that it has already
requested one (and it was already delivered). Now, if a tracer really wants
to request another one, and it knows it will not cause an infinite recursion
by doing so, then I could add an API that lets the unwinder do that.
Possibly adding a "force" argument to the request function.

Otherwise, the bit for the tracer stays set until the task goes back to
user space or, if another tracer requests its first stacktrace, then it
will set the pending bit and clear all other bits that were set previously.

Having the pending bit be part of the unwind_mask allows for updating the
pending bit atomically with the other bits.

Here's the patch that implements this:

diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index 1789c3624723..db7a8d5d6040 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -18,6 +18,9 @@ struct unwind_work {
 
 #ifdef CONFIG_UNWIND_USER
 
+#define UNWIND_PENDING_BIT	(BITS_PER_LONG - 1)
+#define UNWIND_PENDING		(1UL << UNWIND_PENDING_BIT)
+
 void unwind_task_init(struct task_struct *task);
 void unwind_task_free(struct task_struct *task);
 
@@ -29,7 +32,20 @@ void unwind_deferred_cancel(struct unwind_work *work);
 
 static __always_inline void unwind_exit_to_user_mode(void)
 {
-	if (unlikely(current->unwind_info.cache))
+	unsigned long bits;
+
+	/* Was there any unwinding? */
+	if (likely(!current->unwind_mask))
+		return;
+
+	bits = current->unwind_mask;
+	do {
+		/* Is a task_work going to run again before going back */
+		if (bits & UNWIND_PENDING)
+			return;
+	} while (!try_cmpxchg(&current->unwind_mask, &bits, 0UL));
+
+	if (likely(current->unwind_info.cache))
 		current->unwind_info.cache->nr_entries = 0;
 	current->unwind_info.timestamp = 0;
 }
diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
index ae27a02234b8..28811a9d4262 100644
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -12,7 +12,6 @@ struct unwind_task_info {
 	struct callback_head	work;
 	u64			timestamp;
 	u64			nmi_timestamp;
-	int			pending;
 };
 
 #endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index 6ffed486bd7b..637ab6491bc5 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -19,6 +19,11 @@ static LIST_HEAD(callbacks);
 static unsigned long unwind_mask;
 DEFINE_STATIC_SRCU(unwind_srcu);
 
+static inline bool unwind_pending(struct task_struct *task)
+{
+	return test_bit(UNWIND_PENDING_BIT, &task->unwind_mask);
+}
+
 /*
  * Read the task context timestamp, if this is the first caller then
  * it will set the timestamp.
@@ -99,11 +104,11 @@ static void unwind_deferred_task_work(struct callback_head *head)
 	struct task_struct *task = current;
 	int idx;
 
-	if (WARN_ON_ONCE(!info->pending))
+	if (WARN_ON_ONCE(!unwind_pending(task)))
 		return;
 
 	/* Allow work to come in again */
-	WRITE_ONCE(info->pending, 0);
+	clear_bit(UNWIND_PENDING_BIT, &task->unwind_mask);
 
 	/*
 	 * From here on out, the callback must always be called, even if it's
@@ -126,10 +131,8 @@ static void unwind_deferred_task_work(struct callback_head *head)
 	idx = srcu_read_lock(&unwind_srcu);
 	list_for_each_entry_srcu(work, &callbacks, list,
 				 srcu_read_lock_held(&unwind_srcu)) {
-		if (task->unwind_mask & (1UL << work->bit)) {
+		if (task->unwind_mask & (1UL << work->bit))
 			work->func(work, &trace, timestamp);
-			clear_bit(work->bit, &current->unwind_mask);
-		}
 	}
 	srcu_read_unlock(&unwind_srcu, idx);
 }
@@ -156,10 +159,11 @@ static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *timestamp)
 		inited_timestamp = true;
 	}
 
+	/* Is this already queued */
 	if (current->unwind_mask & (1UL << work->bit))
 		return 1;
 
-	if (info->pending)
+	if (unwind_pending(current))
 		goto out;
 
 	ret = task_work_add(current, &info->work, TWA_NMI_CURRENT);
@@ -174,7 +178,13 @@ static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *timestamp)
 		return ret;
 	}
 
-	info->pending = 1;
+	/*
+	 * This is the first to set the PENDING_BIT, clear all others
+	 * as any other bit has already had their callback called, and
+	 * those callbacks should not be called again because of this
+	 * new callback.
+	 */
+	current->unwind_mask = UNWIND_PENDING;
 out:
 	return test_and_set_bit(work->bit, &current->unwind_mask);
 }
@@ -202,7 +212,7 @@ static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *timestamp)
 int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
 {
 	struct unwind_task_info *info = &current->unwind_info;
-	int pending;
+	unsigned long old, bits;
 	int bit;
 	int ret;
 
@@ -224,25 +234,30 @@ int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
 
 	*timestamp = get_timestamp(info);
 
-	/* This is already queued */
+	/* Is this already queued */
 	if (current->unwind_mask & (1UL << bit))
 		return 1;
 
-	/* callback already pending? */
-	pending = READ_ONCE(info->pending);
-	if (pending)
+	old = current->unwind_mask;
+	barrier();
+
+	if (unwind_pending(current))
 		goto out;
 
-	/* Claim the work unless an NMI just now swooped in to do so. */
-	if (!try_cmpxchg(&info->pending, &pending, 1))
+	/* This is the first to set the pending bit since the task enter kernel */
+	bits = UNWIND_PENDING | (1 << bit);
+
+	/* callback already pending? */
+	if (!try_cmpxchg(&current->unwind_mask, &old, bits))
 		goto out;
 
 	/* The work has been claimed, now schedule it. */
 	ret = task_work_add(current, &info->work, TWA_RESUME);
-	if (WARN_ON_ONCE(ret)) {
-		WRITE_ONCE(info->pending, 0);
-		return ret;
-	}
+
+	if (WARN_ON_ONCE(ret))
+		WRITE_ONCE(current->unwind_mask, 0);
+
+	return ret;
 
  out:
 	return test_and_set_bit(work->bit, &current->unwind_mask);
@@ -281,7 +296,7 @@ int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
 	guard(mutex)(&callback_mutex);
 
 	/* See if there's a bit in the mask available */
-	if (unwind_mask == ~0UL)
+	if (unwind_mask == ~(UNWIND_PENDING))
 		return -EBUSY;
 
 	work->bit = ffz(unwind_mask);

-- Steve

^ permalink raw reply related	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2025-05-09  1:36 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-02 16:47 [PATCH v7 00/17] unwind_user: perf: x86: Deferred unwinding infrastructure Steven Rostedt
2025-05-02 16:47 ` [PATCH v7 01/17] unwind_user: Add user space unwinding API Steven Rostedt
2025-05-04  9:30   ` Ingo Molnar
2025-05-04 16:43     ` Steven Rostedt
2025-05-04 17:53       ` Josh Poimboeuf
2025-05-02 16:47 ` [PATCH v7 02/17] unwind_user: Add frame pointer support Steven Rostedt
2025-05-02 16:47 ` [PATCH v7 03/17] unwind_user/x86: Enable frame pointer unwinding on x86 Steven Rostedt
2025-05-02 16:47 ` [PATCH v7 04/17] perf/x86: Rename and move get_segment_base() and make it global Steven Rostedt
2025-05-02 16:47 ` [PATCH v7 05/17] unwind_user: Add compat mode frame pointer support Steven Rostedt
2025-05-02 16:47 ` [PATCH v7 06/17] unwind_user/x86: Enable compat mode frame pointer unwinding on x86 Steven Rostedt
2025-05-02 16:47 ` [PATCH v7 07/17] unwind_user/deferred: Add unwind_deferred_trace() Steven Rostedt
2025-05-02 16:47 ` [PATCH v7 08/17] unwind_user/deferred: Add unwind cache Steven Rostedt
2025-05-04  9:37   ` Ingo Molnar
2025-05-04 16:21     ` Steven Rostedt
2025-05-02 16:47 ` [PATCH v7 09/17] unwind_user/deferred: Add deferred unwinding interface Steven Rostedt
2025-05-02 16:47 ` [PATCH v7 10/17] unwind_user/deferred: Make unwind deferral requests NMI-safe Steven Rostedt
2025-05-02 16:47 ` [PATCH v7 11/17] unwind deferred: Use bitmask to determine which callbacks to call Steven Rostedt
2025-05-09  1:36   ` Steven Rostedt
2025-05-02 16:47 ` [PATCH v7 12/17] unwind deferred: Use SRCU unwind_deferred_task_work() Steven Rostedt
2025-05-02 16:47 ` [PATCH v7 13/17] perf: Remove get_perf_callchain() init_nr argument Steven Rostedt
2025-05-02 16:48 ` [PATCH v7 14/17] perf: Have get_perf_callchain() return NULL if crosstask and user are set Steven Rostedt
2025-05-02 16:48 ` [PATCH v7 15/17] perf: Use current->flags & PF_KTHREAD instead of current->mm == NULL Steven Rostedt
2025-05-02 16:48 ` [PATCH v7 16/17] perf: Simplify get_perf_callchain() user logic Steven Rostedt
2025-05-02 16:48 ` [PATCH v7 17/17] perf: Skip user unwind if the task is a kernel thread Steven Rostedt
2025-05-04  9:41 ` [PATCH v7 00/17] unwind_user: perf: x86: Deferred unwinding infrastructure Ingo Molnar
2025-05-04 16:32   ` Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).