[PATCH v13 00/14] unwind_user: x86: Deferred unwinding infrastructure

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v13 00/14] unwind_user: x86: Deferred unwinding infrastructure
@ 2025-07-08  1:22 Steven Rostedt
  2025-07-08  1:22 ` [PATCH v13 01/14] unwind_user: Add user space unwinding API Steven Rostedt
                   ` (13 more replies)
  0 siblings, 14 replies; 45+ messages in thread
From: Steven Rostedt @ 2025-07-08  1:22 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
	Jens Axboe, Florian Weimer, Sam James

This is the first patch series of a set that will make it possible to be able
to use SFrames[1] in the Linux kernel. A quick recap of the motivation for
doing this.

Currently the only way to get a user space stack trace from a stack
walk (and not just copying large amount of user stack into the kernel
ring buffer) is to use frame pointers. This has a few issues. The biggest
one is that compiling frame pointers into every application and library
has been shown to cause performance overhead.

Another issue is that the format of the frames may not always be consistent
between different compilers and some architectures (s390) has no defined
format to do a reliable stack walk. The only way to perform user space
profiling on these architectures is to copy the user stack into the kernel
buffer.

SFrames is now supported in gcc binutils and soon will also be supported
by LLVM. SFrames acts more like ORC, and lives in the ELF executable
file as its own section. Like ORC it has two tables where the first table
is sorted by instruction pointers (IP) and using the current IP and finding
it's entry in the first table, it will take you to the second table which
will tell you where the return address of the current function is located
and then you can use that address to look it up in the first table to find
the return address of that function, and so on. This performs a user
space stack walk.

Now because the SFrame section lives in the ELF file it needs to be faulted
into memory when it is used. This means that walking the user space stack
requires being in a faultable context. As profilers like perf request a stack
trace in interrupt or NMI context, it cannot do the walking when it is
requested. Instead it must be deferred until it is safe to fault in user
space. One place this is known to be safe is when the task is about to return
back to user space.

Josh originally wrote the PoC of this code and his last version he posted
was back in January:

   https://lore.kernel.org/all/cover.1737511963.git.jpoimboe@kernel.org/

That series contained everything from adding a new faultable user space
stack walking code, deferring the stack walk, implementing sframes,
fixing up x86 (VDSO), and even added both the kernel and user space side
of perf to make it work. But Josh also ran out of time to work on it and
I picked it up. As there's several parts to this series, I also broke
it out. Especially since there's parts of his series that do not depend
on each other.

This series contains only the core infrastructure that all the rest needs.
Of the 14 patches, only 2 are x86 specific. The rest is simply the unwinding
code that s390 can build against. I moved the 2 x86 specific to the end
of the series too.

Since multiple tracers (like perf, ftrace, bpf, etc) can attach to the
deferred unwinder and each of these tracers can attach to some or all
of the tasks to trace, there is a many to many relationship. This relationship
needs to be made in interrupt or NMI context so it can not rely on any
allocation. To handle this, a bitmask is used. There's a global bitmask of
size long which will allocate a single bit when a tracer registers for
deferred stack traces. The task struct will also have a bitmask where a
request comes in from one of the tracers to have a deferred stack trace, it
will set the corresponding bit for that tracer it its mask. As one of the bits
represents that a request has been made, this means at most 31 on 32 bit
systems or 63 on 64 bit systems of tracers may be registered at a given time.
This should not be an issue as only one perf application, or ftrace instance
should request a bit. BPF should also use only one bit and handle any
multiplexing for its users.

When the first request is made for a deferred stack trace from a task, it will
generate a unique cookie. This cookie will be used as the identifier for the
user space stack trace. As the user space stack trace does not change while the
task is in the kernel, requests that come in after the first request and before
the task goes back to user space will get the same cookie.  If there's dropped
events, and the events dropped miss a task entering user space and coming back
to the kernel, the new stack trace taken when it goes back to user space should
not be used with the events before the drop happened.

When a tracer makes a request, it gets this cookie, and the tasks bitmask
sets the bit for the requesting tracer. A task work is used to have the task
do the callbacks before it goes back to user space. When it does, it will scan
its bitmask and call all the callbacks for the tracers that have their
representing bit set. The callback will receive the user space stack trace as
well as the cookie that was used. It's up to the tracer to use the cookie
or not to map the user space stack trace taken back to the events where
it was requested.

That's the basic idea. Obviously there's more to it than the above
explanation, but each patch explains what it is doing, and it is broken up
step by step.

I run two SFrame meetings once a month (one in Asia friendly timezone and
the other in Europe friendly). We have developers from Google, Oracle, Red Hat,
IBM, EfficiOS, Meta, Microsoft, and more that attend. (If anyone is interested
in attending let me know). I have been running this since December of 2024.
Last year in GNU Cauldron, a few of us got together to discuss the design
and such. We are pretty confident that the current design is sound. We have
working code on top of this and have been testing it.

Since the s390 folks want to start working on this (they have patches to
sframes already from working on the prototypes), I would like this series
to be a separate branch based on top of v6.16-rc2. Then all the subsystems
that want to work on top of this can as there's no real dependency between
them.

I have more patches on top of this series that add perf support, ftrace
support, sframe support and the x86 fix ups (for VDSO). But each of those
patch series can be worked on independently, but they all depend on this
series (although the x86 specific patches at the end isn't necessarily
needed, at least for other architectures).

Please review, and if you are happy with them, lets get them in a branch
that we all can use. I'm happy to take it in my tree if I can get acks on the
x86 code. Or it can be in the tip tree as a separate branch on top of 6.16-rc4
and I'll just base my work on top of that. Doesn't matter either way.

[1] https://sourceware.org/binutils/wiki/sframe

The code for this series is located here:

  git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git
unwind/core

Changes since v12: https://lore.kernel.org/linux-trace-kernel/20250701005321.942306427@goodmis.org/

- Make unwind_user_start() and unwind_user_next() static. There's no
  reason that they need to be used by other files.

- Move for_each_user_frame() macro into user.c

- Remove extra parenthesis around start in for_each_user_frame() macro
  (Mathieu Desnoyers)

- Added test when use_fp is true to make sure fp < sp (Jens Remus)

- Make sure the address read is word aligned (Linus Torvalds)

- With new alignment check, updated to handle compat mode.

- Replaced the timestamp with the generated cookie logic again. Instead of
  using a 64 bit word where the CPU part of the cookie is just 12 bits,
  make it two 32 bit words, where the CPU that the cookie is generated on
  is one word and the second word is just a per cpu counter. This allows
  for just using a 32 bit cmpxchg which works on all archs that have safe
  NMI cmpxchg.

- Now that the timestamp has been replaced by a cookie that uses only a 32
  bit cmpxchg(), this code just checks if the architecture has a safe
  cmpxchg that can be used in NMI and doesn't do the 64 bit check.
  Only the pending value is converted to local_t.

- Removed no longer used local.h headers from unwind_deferred_types.h

Josh Poimboeuf (7):
      unwind_user: Add user space unwinding API
      unwind_user: Add frame pointer support
      unwind_user: Add compat mode frame pointer support
      unwind_user/deferred: Add unwind cache
      unwind_user/deferred: Add deferred unwinding interface
      unwind_user/x86: Enable frame pointer unwinding on x86
      unwind_user/x86: Enable compat mode frame pointer unwinding on x86

Steven Rostedt (7):
      unwind_user/deferred: Add unwind_user_faultable()
      unwind_user/deferred: Make unwind deferral requests NMI-safe
      unwind deferred: Use bitmask to determine which callbacks to call
      unwind deferred: Use SRCU unwind_deferred_task_work()
      unwind: Clear unwind_mask on exit back to user space
      unwind: Add USED bit to only have one conditional on way back to user space
      unwind: Finish up unwind when a task exits

----
 MAINTAINERS                              |   8 +
 arch/Kconfig                             |  11 +
 arch/x86/Kconfig                         |   2 +
 arch/x86/include/asm/unwind_user.h       |  42 ++++
 arch/x86/include/asm/unwind_user_types.h |  17 ++
 arch/x86/kernel/stacktrace.c             |  28 +++
 include/asm-generic/Kbuild               |   2 +
 include/asm-generic/unwind_user.h        |   5 +
 include/asm-generic/unwind_user_types.h  |   5 +
 include/linux/entry-common.h             |   2 +
 include/linux/sched.h                    |   5 +
 include/linux/unwind_deferred.h          |  79 +++++++
 include/linux/unwind_deferred_types.h    |  26 +++
 include/linux/unwind_user.h              |  39 ++++
 include/linux/unwind_user_types.h        |  39 ++++
 kernel/Makefile                          |   1 +
 kernel/exit.c                            |   2 +
 kernel/fork.c                            |   4 +
 kernel/unwind/Makefile                   |   1 +
 kernel/unwind/deferred.c                 | 363 +++++++++++++++++++++++++++++++
 kernel/unwind/user.c                     | 147 +++++++++++++
 21 files changed, 828 insertions(+)
 create mode 100644 arch/x86/include/asm/unwind_user.h
 create mode 100644 arch/x86/include/asm/unwind_user_types.h
 create mode 100644 include/asm-generic/unwind_user.h
 create mode 100644 include/asm-generic/unwind_user_types.h
 create mode 100644 include/linux/unwind_deferred.h
 create mode 100644 include/linux/unwind_deferred_types.h
 create mode 100644 include/linux/unwind_user.h
 create mode 100644 include/linux/unwind_user_types.h
 create mode 100644 kernel/unwind/Makefile
 create mode 100644 kernel/unwind/deferred.c
 create mode 100644 kernel/unwind/user.c

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v13 01/14] unwind_user: Add user space unwinding API
  2025-07-08  1:22 [PATCH v13 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
@ 2025-07-08  1:22 ` Steven Rostedt
  2025-07-08  1:22 ` [PATCH v13 02/14] unwind_user: Add frame pointer support Steven Rostedt
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 45+ messages in thread
From: Steven Rostedt @ 2025-07-08  1:22 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
	Jens Axboe, Florian Weimer, Sam James

From: Josh Poimboeuf <jpoimboe@kernel.org>

Introduce a generic API for unwinding user stacks.

In order to expand user space unwinding to be able to handle more complex
scenarios, such as deferred unwinding and reading user space information,
create a generic interface that all architectures can use that support the
various unwinding methods.

This is an alternative method for handling user space stack traces from
the simple stack_trace_save_user() API. This does not replace that
interface, but this interface will be used to expand the functionality of
user space stack walking.

None of the structures introduced will be exposed to user space tooling.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v12: https://lore.kernel.org/20250701005450.721228270@goodmis.org

- Make unwind_user_start() and unwind_user_next() static. There's no
  reason that they need to be used by other files.

- Move for_each_user_frame() macro into user.c

- Remove extra parenthesis around start in for_each_user_frame() macro
  (Mathieu Desnoyers)

 MAINTAINERS                       |  8 +++++
 arch/Kconfig                      |  3 ++
 include/linux/unwind_user.h       |  9 +++++
 include/linux/unwind_user_types.h | 31 +++++++++++++++++
 kernel/Makefile                   |  1 +
 kernel/unwind/Makefile            |  1 +
 kernel/unwind/user.c              | 58 +++++++++++++++++++++++++++++++
 7 files changed, 111 insertions(+)
 create mode 100644 include/linux/unwind_user.h
 create mode 100644 include/linux/unwind_user_types.h
 create mode 100644 kernel/unwind/Makefile
 create mode 100644 kernel/unwind/user.c

diff --git a/MAINTAINERS b/MAINTAINERS
index fad6cb025a19..370d780fd5f8 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -25928,6 +25928,14 @@ F:	Documentation/driver-api/uio-howto.rst
 F:	drivers/uio/
 F:	include/linux/uio_driver.h
 
+USERSPACE STACK UNWINDING
+M:	Josh Poimboeuf <jpoimboe@kernel.org>
+M:	Steven Rostedt <rostedt@goodmis.org>
+S:	Maintained
+F:	include/linux/unwind*.h
+F:	kernel/unwind/
+
+
 UTIL-LINUX PACKAGE
 M:	Karel Zak <kzak@redhat.com>
 L:	util-linux@vger.kernel.org
diff --git a/arch/Kconfig b/arch/Kconfig
index a3308a220f86..ea59e5d7cc69 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -435,6 +435,9 @@ config HAVE_HARDLOCKUP_DETECTOR_ARCH
 	  It uses the same command line parameters, and sysctl interface,
 	  as the generic hardlockup detectors.
 
+config UNWIND_USER
+	bool
+
 config HAVE_PERF_REGS
 	bool
 	help
diff --git a/include/linux/unwind_user.h b/include/linux/unwind_user.h
new file mode 100644
index 000000000000..1d77bed8de2c
--- /dev/null
+++ b/include/linux/unwind_user.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_UNWIND_USER_H
+#define _LINUX_UNWIND_USER_H
+
+#include <linux/unwind_user_types.h>
+
+int unwind_user(struct unwind_stacktrace *trace, unsigned int max_entries);
+
+#endif /* _LINUX_UNWIND_USER_H */
diff --git a/include/linux/unwind_user_types.h b/include/linux/unwind_user_types.h
new file mode 100644
index 000000000000..6ed1b4ae74e1
--- /dev/null
+++ b/include/linux/unwind_user_types.h
@@ -0,0 +1,31 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_UNWIND_USER_TYPES_H
+#define _LINUX_UNWIND_USER_TYPES_H
+
+#include <linux/types.h>
+
+enum unwind_user_type {
+	UNWIND_USER_TYPE_NONE,
+};
+
+struct unwind_stacktrace {
+	unsigned int	nr;
+	unsigned long	*entries;
+};
+
+struct unwind_user_frame {
+	s32 cfa_off;
+	s32 ra_off;
+	s32 fp_off;
+	bool use_fp;
+};
+
+struct unwind_user_state {
+	unsigned long ip;
+	unsigned long sp;
+	unsigned long fp;
+	enum unwind_user_type type;
+	bool done;
+};
+
+#endif /* _LINUX_UNWIND_USER_TYPES_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index 32e80dd626af..541186050251 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -55,6 +55,7 @@ obj-y += rcu/
 obj-y += livepatch/
 obj-y += dma/
 obj-y += entry/
+obj-y += unwind/
 obj-$(CONFIG_MODULES) += module/
 
 obj-$(CONFIG_KCMP) += kcmp.o
diff --git a/kernel/unwind/Makefile b/kernel/unwind/Makefile
new file mode 100644
index 000000000000..349ce3677526
--- /dev/null
+++ b/kernel/unwind/Makefile
@@ -0,0 +1 @@
+ obj-$(CONFIG_UNWIND_USER) += user.o
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
new file mode 100644
index 000000000000..d0cf3ee2706d
--- /dev/null
+++ b/kernel/unwind/user.c
@@ -0,0 +1,58 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+* Generic interfaces for unwinding user space
+*/
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/sched/task_stack.h>
+#include <linux/unwind_user.h>
+
+#define for_each_user_frame(state) \
+	for (unwind_user_start(state); !(state)->done; unwind_user_next(state))
+
+static int unwind_user_next(struct unwind_user_state *state)
+{
+	/* no implementation yet */
+	return -EINVAL;
+}
+
+static int unwind_user_start(struct unwind_user_state *state)
+{
+	struct pt_regs *regs = task_pt_regs(current);
+
+	memset(state, 0, sizeof(*state));
+
+	if ((current->flags & PF_KTHREAD) || !user_mode(regs)) {
+		state->done = true;
+		return -EINVAL;
+	}
+
+	state->type = UNWIND_USER_TYPE_NONE;
+
+	state->ip = instruction_pointer(regs);
+	state->sp = user_stack_pointer(regs);
+	state->fp = frame_pointer(regs);
+
+	return 0;
+}
+
+int unwind_user(struct unwind_stacktrace *trace, unsigned int max_entries)
+{
+	struct unwind_user_state state;
+
+	trace->nr = 0;
+
+	if (!max_entries)
+		return -EINVAL;
+
+	if (current->flags & PF_KTHREAD)
+		return 0;
+
+	for_each_user_frame(&state) {
+		trace->entries[trace->nr++] = state.ip;
+		if (trace->nr >= max_entries)
+			break;
+	}
+
+	return 0;
+}
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v13 02/14] unwind_user: Add frame pointer support
  2025-07-08  1:22 [PATCH v13 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
  2025-07-08  1:22 ` [PATCH v13 01/14] unwind_user: Add user space unwinding API Steven Rostedt
@ 2025-07-08  1:22 ` Steven Rostedt
  2025-07-09 10:01   ` Jens Remus
  2025-07-08  1:22 ` [PATCH v13 03/14] unwind_user: Add compat mode " Steven Rostedt
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 45+ messages in thread
From: Steven Rostedt @ 2025-07-08  1:22 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
	Jens Axboe, Florian Weimer, Sam James

From: Josh Poimboeuf <jpoimboe@kernel.org>

Add optional support for user space frame pointer unwinding.  If
supported, the arch needs to enable CONFIG_HAVE_UNWIND_USER_FP and
define ARCH_INIT_USER_FP_FRAME.

By encoding the frame offsets in struct unwind_user_frame, much of this
code can also be reused for future unwinder implementations like sframe.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Co-developed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v12: https://lore.kernel.org/20250701005450.888492528@goodmis.org

- Added test when use_fp is true to make sure fp < sp (Jens Remus)

- Make sure the address read is word aligned (Linus Torvalds)

 arch/Kconfig                      |  4 ++
 include/asm-generic/Kbuild        |  1 +
 include/asm-generic/unwind_user.h |  5 +++
 include/linux/unwind_user.h       |  5 +++
 include/linux/unwind_user_types.h |  1 +
 kernel/unwind/user.c              | 65 ++++++++++++++++++++++++++++++-
 6 files changed, 79 insertions(+), 2 deletions(-)
 create mode 100644 include/asm-generic/unwind_user.h

diff --git a/arch/Kconfig b/arch/Kconfig
index ea59e5d7cc69..8e3fd723bd74 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -438,6 +438,10 @@ config HAVE_HARDLOCKUP_DETECTOR_ARCH
 config UNWIND_USER
 	bool
 
+config HAVE_UNWIND_USER_FP
+	bool
+	select UNWIND_USER
+
 config HAVE_PERF_REGS
 	bool
 	help
diff --git a/include/asm-generic/Kbuild b/include/asm-generic/Kbuild
index 8675b7b4ad23..295c94a3ccc1 100644
--- a/include/asm-generic/Kbuild
+++ b/include/asm-generic/Kbuild
@@ -59,6 +59,7 @@ mandatory-y += tlbflush.h
 mandatory-y += topology.h
 mandatory-y += trace_clock.h
 mandatory-y += uaccess.h
+mandatory-y += unwind_user.h
 mandatory-y += vermagic.h
 mandatory-y += vga.h
 mandatory-y += video.h
diff --git a/include/asm-generic/unwind_user.h b/include/asm-generic/unwind_user.h
new file mode 100644
index 000000000000..b8882b909944
--- /dev/null
+++ b/include/asm-generic/unwind_user.h
@@ -0,0 +1,5 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_GENERIC_UNWIND_USER_H
+#define _ASM_GENERIC_UNWIND_USER_H
+
+#endif /* _ASM_GENERIC_UNWIND_USER_H */
diff --git a/include/linux/unwind_user.h b/include/linux/unwind_user.h
index 1d77bed8de2c..7f7282516bf5 100644
--- a/include/linux/unwind_user.h
+++ b/include/linux/unwind_user.h
@@ -3,6 +3,11 @@
 #define _LINUX_UNWIND_USER_H
 
 #include <linux/unwind_user_types.h>
+#include <asm/unwind_user.h>
+
+#ifndef ARCH_INIT_USER_FP_FRAME
+ #define ARCH_INIT_USER_FP_FRAME
+#endif
 
 int unwind_user(struct unwind_stacktrace *trace, unsigned int max_entries);
 
diff --git a/include/linux/unwind_user_types.h b/include/linux/unwind_user_types.h
index 6ed1b4ae74e1..65bd070eb6b0 100644
--- a/include/linux/unwind_user_types.h
+++ b/include/linux/unwind_user_types.h
@@ -6,6 +6,7 @@
 
 enum unwind_user_type {
 	UNWIND_USER_TYPE_NONE,
+	UNWIND_USER_TYPE_FP,
 };
 
 struct unwind_stacktrace {
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
index d0cf3ee2706d..62b3ef37d71b 100644
--- a/kernel/unwind/user.c
+++ b/kernel/unwind/user.c
@@ -6,13 +6,71 @@
 #include <linux/sched.h>
 #include <linux/sched/task_stack.h>
 #include <linux/unwind_user.h>
+#include <linux/uaccess.h>
+
+static struct unwind_user_frame fp_frame = {
+	ARCH_INIT_USER_FP_FRAME
+};
+
+static inline bool fp_state(struct unwind_user_state *state)
+{
+	return IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP) &&
+	       state->type == UNWIND_USER_TYPE_FP;
+}
 
 #define for_each_user_frame(state) \
 	for (unwind_user_start(state); !(state)->done; unwind_user_next(state))
 
 static int unwind_user_next(struct unwind_user_state *state)
 {
-	/* no implementation yet */
+	struct unwind_user_frame *frame;
+	unsigned long cfa = 0, fp, ra = 0;
+	unsigned int shift;
+
+	if (state->done)
+		return -EINVAL;
+
+	if (fp_state(state))
+		frame = &fp_frame;
+	else
+		goto done;
+
+	if (frame->use_fp) {
+		if (state->fp < state->sp)
+			goto done;
+		cfa = state->fp;
+	} else {
+		cfa = state->sp;
+	}
+
+	/* Get the Canonical Frame Address (CFA) */
+	cfa += frame->cfa_off;
+
+	/* stack going in wrong direction? */
+	if (cfa <= state->sp)
+		goto done;
+
+	/* Make sure that the address is word aligned */
+	shift = sizeof(long) == 4 ? 2 : 3;
+	if ((cfa + frame->ra_off) & ((1 << shift) - 1))
+		goto done;
+
+	/* Find the Return Address (RA) */
+	if (get_user(ra, (unsigned long *)(cfa + frame->ra_off)))
+		goto done;
+
+	if (frame->fp_off && get_user(fp, (unsigned long __user *)(cfa + frame->fp_off)))
+		goto done;
+
+	state->ip = ra;
+	state->sp = cfa;
+	if (frame->fp_off)
+		state->fp = fp;
+
+	return 0;
+
+done:
+	state->done = true;
 	return -EINVAL;
 }
 
@@ -27,7 +85,10 @@ static int unwind_user_start(struct unwind_user_state *state)
 		return -EINVAL;
 	}
 
-	state->type = UNWIND_USER_TYPE_NONE;
+	if (IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP))
+		state->type = UNWIND_USER_TYPE_FP;
+	else
+		state->type = UNWIND_USER_TYPE_NONE;
 
 	state->ip = instruction_pointer(regs);
 	state->sp = user_stack_pointer(regs);
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v13 03/14] unwind_user: Add compat mode frame pointer support
  2025-07-08  1:22 [PATCH v13 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
  2025-07-08  1:22 ` [PATCH v13 01/14] unwind_user: Add user space unwinding API Steven Rostedt
  2025-07-08  1:22 ` [PATCH v13 02/14] unwind_user: Add frame pointer support Steven Rostedt
@ 2025-07-08  1:22 ` Steven Rostedt
  2025-07-08  1:22 ` [PATCH v13 04/14] unwind_user/deferred: Add unwind_user_faultable() Steven Rostedt
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 45+ messages in thread
From: Steven Rostedt @ 2025-07-08  1:22 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
	Jens Axboe, Florian Weimer, Sam James

From: Josh Poimboeuf <jpoimboe@kernel.org>

Add optional support for user space compat mode frame pointer unwinding.
If supported, the arch needs to enable CONFIG_HAVE_UNWIND_USER_COMPAT_FP
and define ARCH_INIT_USER_COMPAT_FP_FRAME.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Co-developed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v12: https://lore.kernel.org/20250701005451.055982038@goodmis.org

- With new alignment check, updated to handle compat mode.

 arch/Kconfig                            |  4 +++
 include/asm-generic/Kbuild              |  1 +
 include/asm-generic/unwind_user_types.h |  5 ++++
 include/linux/unwind_user.h             |  5 ++++
 include/linux/unwind_user_types.h       |  7 +++++
 kernel/unwind/user.c                    | 34 +++++++++++++++++++++----
 6 files changed, 51 insertions(+), 5 deletions(-)
 create mode 100644 include/asm-generic/unwind_user_types.h

diff --git a/arch/Kconfig b/arch/Kconfig
index 8e3fd723bd74..2c41d3072910 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -442,6 +442,10 @@ config HAVE_UNWIND_USER_FP
 	bool
 	select UNWIND_USER
 
+config HAVE_UNWIND_USER_COMPAT_FP
+	bool
+	depends on HAVE_UNWIND_USER_FP
+
 config HAVE_PERF_REGS
 	bool
 	help
diff --git a/include/asm-generic/Kbuild b/include/asm-generic/Kbuild
index 295c94a3ccc1..b797a2434396 100644
--- a/include/asm-generic/Kbuild
+++ b/include/asm-generic/Kbuild
@@ -60,6 +60,7 @@ mandatory-y += topology.h
 mandatory-y += trace_clock.h
 mandatory-y += uaccess.h
 mandatory-y += unwind_user.h
+mandatory-y += unwind_user_types.h
 mandatory-y += vermagic.h
 mandatory-y += vga.h
 mandatory-y += video.h
diff --git a/include/asm-generic/unwind_user_types.h b/include/asm-generic/unwind_user_types.h
new file mode 100644
index 000000000000..f568b82e52cd
--- /dev/null
+++ b/include/asm-generic/unwind_user_types.h
@@ -0,0 +1,5 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_GENERIC_UNWIND_USER_TYPES_H
+#define _ASM_GENERIC_UNWIND_USER_TYPES_H
+
+#endif /* _ASM_GENERIC_UNWIND_USER_TYPES_H */
diff --git a/include/linux/unwind_user.h b/include/linux/unwind_user.h
index 7f7282516bf5..834b643afd3a 100644
--- a/include/linux/unwind_user.h
+++ b/include/linux/unwind_user.h
@@ -9,6 +9,11 @@
  #define ARCH_INIT_USER_FP_FRAME
 #endif
 
+#ifndef ARCH_INIT_USER_COMPAT_FP_FRAME
+ #define ARCH_INIT_USER_COMPAT_FP_FRAME
+ #define in_compat_mode(regs) false
+#endif
+
 int unwind_user(struct unwind_stacktrace *trace, unsigned int max_entries);
 
 #endif /* _LINUX_UNWIND_USER_H */
diff --git a/include/linux/unwind_user_types.h b/include/linux/unwind_user_types.h
index 65bd070eb6b0..0b6563951ca4 100644
--- a/include/linux/unwind_user_types.h
+++ b/include/linux/unwind_user_types.h
@@ -3,10 +3,16 @@
 #define _LINUX_UNWIND_USER_TYPES_H
 
 #include <linux/types.h>
+#include <asm/unwind_user_types.h>
+
+#ifndef arch_unwind_user_state
+struct arch_unwind_user_state {};
+#endif
 
 enum unwind_user_type {
 	UNWIND_USER_TYPE_NONE,
 	UNWIND_USER_TYPE_FP,
+	UNWIND_USER_TYPE_COMPAT_FP,
 };
 
 struct unwind_stacktrace {
@@ -25,6 +31,7 @@ struct unwind_user_state {
 	unsigned long ip;
 	unsigned long sp;
 	unsigned long fp;
+	struct arch_unwind_user_state arch;
 	enum unwind_user_type type;
 	bool done;
 };
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
index 62b3ef37d71b..03775191447c 100644
--- a/kernel/unwind/user.c
+++ b/kernel/unwind/user.c
@@ -12,6 +12,10 @@ static struct unwind_user_frame fp_frame = {
 	ARCH_INIT_USER_FP_FRAME
 };
 
+static struct unwind_user_frame compat_fp_frame = {
+	ARCH_INIT_USER_COMPAT_FP_FRAME
+};
+
 static inline bool fp_state(struct unwind_user_state *state)
 {
 	return IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP) &&
@@ -21,6 +25,22 @@ static inline bool fp_state(struct unwind_user_state *state)
 #define for_each_user_frame(state) \
 	for (unwind_user_start(state); !(state)->done; unwind_user_next(state))
 
+static inline bool compat_fp_state(struct unwind_user_state *state)
+{
+	return IS_ENABLED(CONFIG_HAVE_UNWIND_USER_COMPAT_FP) &&
+	       state->type == UNWIND_USER_TYPE_COMPAT_FP;
+}
+
+#define unwind_get_user_long(to, from, state)				\
+({									\
+	int __ret;							\
+	if (compat_fp_state(state))					\
+		__ret = get_user(to, (u32 __user *)(from));		\
+	else								\
+		__ret = get_user(to, (unsigned long __user *)(from));	\
+	__ret;								\
+})
+
 static int unwind_user_next(struct unwind_user_state *state)
 {
 	struct unwind_user_frame *frame;
@@ -30,7 +50,9 @@ static int unwind_user_next(struct unwind_user_state *state)
 	if (state->done)
 		return -EINVAL;
 
-	if (fp_state(state))
+	if (compat_fp_state(state))
+		frame = &compat_fp_frame;
+	else if (fp_state(state))
 		frame = &fp_frame;
 	else
 		goto done;
@@ -51,15 +73,15 @@ static int unwind_user_next(struct unwind_user_state *state)
 		goto done;
 
 	/* Make sure that the address is word aligned */
-	shift = sizeof(long) == 4 ? 2 : 3;
+	shift = sizeof(long) == 4 || compat_fp_state(state) ? 2 : 3;
 	if ((cfa + frame->ra_off) & ((1 << shift) - 1))
 		goto done;
 
 	/* Find the Return Address (RA) */
-	if (get_user(ra, (unsigned long *)(cfa + frame->ra_off)))
+	if (unwind_get_user_long(ra, cfa + frame->ra_off, state))
 		goto done;
 
-	if (frame->fp_off && get_user(fp, (unsigned long __user *)(cfa + frame->fp_off)))
+	if (frame->fp_off && unwind_get_user_long(fp, cfa + frame->fp_off, state))
 		goto done;
 
 	state->ip = ra;
@@ -85,7 +107,9 @@ static int unwind_user_start(struct unwind_user_state *state)
 		return -EINVAL;
 	}
 
-	if (IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP))
+	if (IS_ENABLED(CONFIG_HAVE_UNWIND_USER_COMPAT_FP) && in_compat_mode(regs))
+		state->type = UNWIND_USER_TYPE_COMPAT_FP;
+	else if (IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP))
 		state->type = UNWIND_USER_TYPE_FP;
 	else
 		state->type = UNWIND_USER_TYPE_NONE;
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v13 04/14] unwind_user/deferred: Add unwind_user_faultable()
  2025-07-08  1:22 [PATCH v13 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (2 preceding siblings ...)
  2025-07-08  1:22 ` [PATCH v13 03/14] unwind_user: Add compat mode " Steven Rostedt
@ 2025-07-08  1:22 ` Steven Rostedt
  2025-07-08  1:22 ` [PATCH v13 05/14] unwind_user/deferred: Add unwind cache Steven Rostedt
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 45+ messages in thread
From: Steven Rostedt @ 2025-07-08  1:22 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
	Jens Axboe, Florian Weimer, Sam James

From: Steven Rostedt <rostedt@goodmis.org>

Add a new API to retrieve a user space callstack called
unwind_user_faultable(). The difference between this user space stack
tracer from the current user space stack tracer is that this must be
called from faultable context as it may use routines to access user space
data that needs to be faulted in.

It can be safely called from entering or exiting a system call as the code
can still be faulted in there.

This code is based on work by Josh Poimboeuf's deferred unwinding code:

Link: https://lore.kernel.org/all/6052e8487746603bdb29b65f4033e739092d9925.1737511963.git.jpoimboe@kernel.org/

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 include/linux/sched.h                 |  5 +++
 include/linux/unwind_deferred.h       | 24 +++++++++++
 include/linux/unwind_deferred_types.h |  9 ++++
 kernel/fork.c                         |  4 ++
 kernel/unwind/Makefile                |  2 +-
 kernel/unwind/deferred.c              | 60 +++++++++++++++++++++++++++
 6 files changed, 103 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/unwind_deferred.h
 create mode 100644 include/linux/unwind_deferred_types.h
 create mode 100644 kernel/unwind/deferred.c

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4f78a64beb52..59fdf7d9bb1e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -46,6 +46,7 @@
 #include <linux/rv.h>
 #include <linux/uidgid_types.h>
 #include <linux/tracepoint-defs.h>
+#include <linux/unwind_deferred_types.h>
 #include <asm/kmap_size.h>
 
 /* task_struct member predeclarations (sorted alphabetically): */
@@ -1654,6 +1655,10 @@ struct task_struct {
 	struct user_event_mm		*user_event_mm;
 #endif
 
+#ifdef CONFIG_UNWIND_USER
+	struct unwind_task_info		unwind_info;
+#endif
+
 	/* CPU-specific state of this task: */
 	struct thread_struct		thread;
 
diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
new file mode 100644
index 000000000000..a5f6e8f8a1a2
--- /dev/null
+++ b/include/linux/unwind_deferred.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_UNWIND_USER_DEFERRED_H
+#define _LINUX_UNWIND_USER_DEFERRED_H
+
+#include <linux/unwind_user.h>
+#include <linux/unwind_deferred_types.h>
+
+#ifdef CONFIG_UNWIND_USER
+
+void unwind_task_init(struct task_struct *task);
+void unwind_task_free(struct task_struct *task);
+
+int unwind_user_faultable(struct unwind_stacktrace *trace);
+
+#else /* !CONFIG_UNWIND_USER */
+
+static inline void unwind_task_init(struct task_struct *task) {}
+static inline void unwind_task_free(struct task_struct *task) {}
+
+static inline int unwind_user_faultable(struct unwind_stacktrace *trace) { return -ENOSYS; }
+
+#endif /* !CONFIG_UNWIND_USER */
+
+#endif /* _LINUX_UNWIND_USER_DEFERRED_H */
diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
new file mode 100644
index 000000000000..aa32db574e43
--- /dev/null
+++ b/include/linux/unwind_deferred_types.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_UNWIND_USER_DEFERRED_TYPES_H
+#define _LINUX_UNWIND_USER_DEFERRED_TYPES_H
+
+struct unwind_task_info {
+	unsigned long		*entries;
+};
+
+#endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 1ee8eb11f38b..3341d50c61f2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -105,6 +105,7 @@
 #include <uapi/linux/pidfd.h>
 #include <linux/pidfs.h>
 #include <linux/tick.h>
+#include <linux/unwind_deferred.h>
 
 #include <asm/pgalloc.h>
 #include <linux/uaccess.h>
@@ -732,6 +733,7 @@ void __put_task_struct(struct task_struct *tsk)
 	WARN_ON(refcount_read(&tsk->usage));
 	WARN_ON(tsk == current);
 
+	unwind_task_free(tsk);
 	sched_ext_free(tsk);
 	io_uring_free(tsk);
 	cgroup_free(tsk);
@@ -2135,6 +2137,8 @@ __latent_entropy struct task_struct *copy_process(
 	p->bpf_ctx = NULL;
 #endif
 
+	unwind_task_init(p);
+
 	/* Perform scheduler related setup. Assign this task to a CPU. */
 	retval = sched_fork(clone_flags, p);
 	if (retval)
diff --git a/kernel/unwind/Makefile b/kernel/unwind/Makefile
index 349ce3677526..eae37bea54fd 100644
--- a/kernel/unwind/Makefile
+++ b/kernel/unwind/Makefile
@@ -1 +1 @@
- obj-$(CONFIG_UNWIND_USER) += user.o
+ obj-$(CONFIG_UNWIND_USER)	+= user.o deferred.o
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
new file mode 100644
index 000000000000..a0badbeb3cc1
--- /dev/null
+++ b/kernel/unwind/deferred.c
@@ -0,0 +1,60 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Deferred user space unwinding
+ */
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/unwind_deferred.h>
+
+#define UNWIND_MAX_ENTRIES 512
+
+/**
+ * unwind_user_faultable - Produce a user stacktrace in faultable context
+ * @trace: The descriptor that will store the user stacktrace
+ *
+ * This must be called in a known faultable context (usually when entering
+ * or exiting user space). Depending on the available implementations
+ * the @trace will be loaded with the addresses of the user space stacktrace
+ * if it can be found.
+ *
+ * Return: 0 on success and negative on error
+ *         On success @trace will contain the user space stacktrace
+ */
+int unwind_user_faultable(struct unwind_stacktrace *trace)
+{
+	struct unwind_task_info *info = &current->unwind_info;
+
+	/* Should always be called from faultable context */
+	might_fault();
+
+	if (current->flags & PF_EXITING)
+		return -EINVAL;
+
+	if (!info->entries) {
+		info->entries = kmalloc_array(UNWIND_MAX_ENTRIES, sizeof(long),
+					      GFP_KERNEL);
+		if (!info->entries)
+			return -ENOMEM;
+	}
+
+	trace->nr = 0;
+	trace->entries = info->entries;
+	unwind_user(trace, UNWIND_MAX_ENTRIES);
+
+	return 0;
+}
+
+void unwind_task_init(struct task_struct *task)
+{
+	struct unwind_task_info *info = &task->unwind_info;
+
+	memset(info, 0, sizeof(*info));
+}
+
+void unwind_task_free(struct task_struct *task)
+{
+	struct unwind_task_info *info = &task->unwind_info;
+
+	kfree(info->entries);
+}
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v13 05/14] unwind_user/deferred: Add unwind cache
  2025-07-08  1:22 [PATCH v13 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (3 preceding siblings ...)
  2025-07-08  1:22 ` [PATCH v13 04/14] unwind_user/deferred: Add unwind_user_faultable() Steven Rostedt
@ 2025-07-08  1:22 ` Steven Rostedt
  2025-07-08  1:22 ` [PATCH v13 06/14] unwind_user/deferred: Add deferred unwinding interface Steven Rostedt
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 45+ messages in thread
From: Steven Rostedt @ 2025-07-08  1:22 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
	Jens Axboe, Florian Weimer, Sam James

From: Josh Poimboeuf <jpoimboe@kernel.org>

Cache the results of the unwind to ensure the unwind is only performed
once, even when called by multiple tracers.

The cache nr_entries gets cleared every time the task exits the kernel.
When a stacktrace is requested, nr_entries gets set to the number of
entries in the stacktrace. If another stacktrace is requested, if
nr_entries is not zero, then it contains the same stacktrace that would be
retrieved so it is not processed again and the entries is given to the
caller.

Co-developed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 include/linux/entry-common.h          |  2 ++
 include/linux/unwind_deferred.h       |  8 +++++++
 include/linux/unwind_deferred_types.h |  7 +++++-
 kernel/unwind/deferred.c              | 31 +++++++++++++++++++++------
 4 files changed, 40 insertions(+), 8 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index f94f3fdf15fc..8908b8eeb99b 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -12,6 +12,7 @@
 #include <linux/resume_user_mode.h>
 #include <linux/tick.h>
 #include <linux/kmsan.h>
+#include <linux/unwind_deferred.h>
 
 #include <asm/entry-common.h>
 #include <asm/syscall.h>
@@ -362,6 +363,7 @@ static __always_inline void exit_to_user_mode(void)
 	lockdep_hardirqs_on_prepare();
 	instrumentation_end();
 
+	unwind_reset_info();
 	user_enter_irqoff();
 	arch_exit_to_user_mode();
 	lockdep_hardirqs_on(CALLER_ADDR0);
diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index a5f6e8f8a1a2..baacf4a1eb4c 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -12,6 +12,12 @@ void unwind_task_free(struct task_struct *task);
 
 int unwind_user_faultable(struct unwind_stacktrace *trace);
 
+static __always_inline void unwind_reset_info(void)
+{
+	if (unlikely(current->unwind_info.cache))
+		current->unwind_info.cache->nr_entries = 0;
+}
+
 #else /* !CONFIG_UNWIND_USER */
 
 static inline void unwind_task_init(struct task_struct *task) {}
@@ -19,6 +25,8 @@ static inline void unwind_task_free(struct task_struct *task) {}
 
 static inline int unwind_user_faultable(struct unwind_stacktrace *trace) { return -ENOSYS; }
 
+static inline void unwind_reset_info(void) {}
+
 #endif /* !CONFIG_UNWIND_USER */
 
 #endif /* _LINUX_UNWIND_USER_DEFERRED_H */
diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
index aa32db574e43..db5b54b18828 100644
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -2,8 +2,13 @@
 #ifndef _LINUX_UNWIND_USER_DEFERRED_TYPES_H
 #define _LINUX_UNWIND_USER_DEFERRED_TYPES_H
 
+struct unwind_cache {
+	unsigned int		nr_entries;
+	unsigned long		entries[];
+};
+
 struct unwind_task_info {
-	unsigned long		*entries;
+	struct unwind_cache	*cache;
 };
 
 #endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index a0badbeb3cc1..96368a5aa522 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -4,10 +4,13 @@
  */
 #include <linux/kernel.h>
 #include <linux/sched.h>
+#include <linux/sizes.h>
 #include <linux/slab.h>
 #include <linux/unwind_deferred.h>
 
-#define UNWIND_MAX_ENTRIES 512
+/* Make the cache fit in a 4K page */
+#define UNWIND_MAX_ENTRIES					\
+	((SZ_4K - sizeof(struct unwind_cache)) / sizeof(long))
 
 /**
  * unwind_user_faultable - Produce a user stacktrace in faultable context
@@ -24,6 +27,7 @@
 int unwind_user_faultable(struct unwind_stacktrace *trace)
 {
 	struct unwind_task_info *info = &current->unwind_info;
+	struct unwind_cache *cache;
 
 	/* Should always be called from faultable context */
 	might_fault();
@@ -31,17 +35,30 @@ int unwind_user_faultable(struct unwind_stacktrace *trace)
 	if (current->flags & PF_EXITING)
 		return -EINVAL;
 
-	if (!info->entries) {
-		info->entries = kmalloc_array(UNWIND_MAX_ENTRIES, sizeof(long),
-					      GFP_KERNEL);
-		if (!info->entries)
+	if (!info->cache) {
+		info->cache = kzalloc(struct_size(cache, entries, UNWIND_MAX_ENTRIES),
+				      GFP_KERNEL);
+		if (!info->cache)
 			return -ENOMEM;
 	}
 
+	cache = info->cache;
+	trace->entries = cache->entries;
+
+	if (cache->nr_entries) {
+		/*
+		 * The user stack has already been previously unwound in this
+		 * entry context.  Skip the unwind and use the cache.
+		 */
+		trace->nr = cache->nr_entries;
+		return 0;
+	}
+
 	trace->nr = 0;
-	trace->entries = info->entries;
 	unwind_user(trace, UNWIND_MAX_ENTRIES);
 
+	cache->nr_entries = trace->nr;
+
 	return 0;
 }
 
@@ -56,5 +73,5 @@ void unwind_task_free(struct task_struct *task)
 {
 	struct unwind_task_info *info = &task->unwind_info;
 
-	kfree(info->entries);
+	kfree(info->cache);
 }
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v13 06/14] unwind_user/deferred: Add deferred unwinding interface
  2025-07-08  1:22 [PATCH v13 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (4 preceding siblings ...)
  2025-07-08  1:22 ` [PATCH v13 05/14] unwind_user/deferred: Add unwind cache Steven Rostedt
@ 2025-07-08  1:22 ` Steven Rostedt
  2025-07-08  1:22 ` [PATCH v13 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe Steven Rostedt
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 45+ messages in thread
From: Steven Rostedt @ 2025-07-08  1:22 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
	Jens Axboe, Florian Weimer, Sam James

From: Josh Poimboeuf <jpoimboe@kernel.org>

Add an interface for scheduling task work to unwind the user space stack
before returning to user space. This solves several problems for its
callers:

  - Ensure the unwind happens in task context even if the caller may be
    running in interrupt context.

  - Avoid duplicate unwinds, whether called multiple times by the same
    caller or by different callers.

  - Create a "context cookie" which allows trace post-processing to
    correlate kernel unwinds/traces with the user unwind.

A concept of a "cookie" is created to detect when the stacktrace is the
same. A cookie is generated the first time a user space stacktrace is
requested after the task enters the kernel. As the stacktrace is saved on
the task_struct while the task is in the kernel, if another request comes
in, if the cookie is still the same, it will use the saved stacktrace,
and not have to regenerate one.

The cookie is passed to the caller on request, and when the stacktrace is
generated upon returning to user space, it call the requester's callback
with the cookie as well as the stacktrace. The cookie is cleared
when it goes back to user space. Note, this currently adds another
conditional to the unwind_reset_info() path that is always called
returning to user space, but future changes will put this back to a single
conditional.

A global list is created and protected by a global mutex that holds
tracers that register with the unwind infrastructure. The number of
registered tracers will be limited in future changes. Each perf program or
ftrace instance will register its own descriptor to use for deferred
unwind stack traces.

Note, in the function unwind_deferred_task_work() that gets called when
returning to user space, it uses a global mutex for synchronization which
will cause a big bottleneck. This will be replaced by SRCU, but that
change adds some complex synchronization that deservers its own commit.

Co-developed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v12: https://lore.kernel.org/20250701005451.571473750@goodmis.org

- Replaced the timestamp with the generated cookie logic again. Instead of
  using a 64 bit word where the CPU part of the cookie is just 12 bits,
  make it two 32 bit words, where the CPU that the cookie is generated on
  is one word and the second word is just a per cpu counter. This allows
  for just using a 32 bit cmpxchg which works on all archs that have safe
  NMI cmpxchg.

 include/linux/unwind_deferred.h       |  24 ++++
 include/linux/unwind_deferred_types.h |  12 ++
 kernel/unwind/deferred.c              | 159 +++++++++++++++++++++++++-
 3 files changed, 194 insertions(+), 1 deletion(-)

diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index baacf4a1eb4c..14efd8c027aa 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -2,9 +2,19 @@
 #ifndef _LINUX_UNWIND_USER_DEFERRED_H
 #define _LINUX_UNWIND_USER_DEFERRED_H
 
+#include <linux/task_work.h>
 #include <linux/unwind_user.h>
 #include <linux/unwind_deferred_types.h>
 
+struct unwind_work;
+
+typedef void (*unwind_callback_t)(struct unwind_work *work, struct unwind_stacktrace *trace, u64 cookie);
+
+struct unwind_work {
+	struct list_head		list;
+	unwind_callback_t		func;
+};
+
 #ifdef CONFIG_UNWIND_USER
 
 void unwind_task_init(struct task_struct *task);
@@ -12,8 +22,19 @@ void unwind_task_free(struct task_struct *task);
 
 int unwind_user_faultable(struct unwind_stacktrace *trace);
 
+int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func);
+int unwind_deferred_request(struct unwind_work *work, u64 *cookie);
+void unwind_deferred_cancel(struct unwind_work *work);
+
 static __always_inline void unwind_reset_info(void)
 {
+	if (unlikely(current->unwind_info.id.id))
+		current->unwind_info.id.id = 0;
+	/*
+	 * As unwind_user_faultable() can be called directly and
+	 * depends on nr_entries being cleared on exit to user,
+	 * this needs to be a separate conditional.
+	 */
 	if (unlikely(current->unwind_info.cache))
 		current->unwind_info.cache->nr_entries = 0;
 }
@@ -24,6 +45,9 @@ static inline void unwind_task_init(struct task_struct *task) {}
 static inline void unwind_task_free(struct task_struct *task) {}
 
 static inline int unwind_user_faultable(struct unwind_stacktrace *trace) { return -ENOSYS; }
+static inline int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func) { return -ENOSYS; }
+static inline int unwind_deferred_request(struct unwind_work *work, u64 *timestamp) { return -ENOSYS; }
+static inline void unwind_deferred_cancel(struct unwind_work *work) {}
 
 static inline void unwind_reset_info(void) {}
 
diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
index db5b54b18828..79b4f8cece53 100644
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -7,8 +7,20 @@ struct unwind_cache {
 	unsigned long		entries[];
 };
 
+
+union unwind_task_id {
+	struct {
+		u32		cpu;
+		u32		cnt;
+	};
+	u64			id;
+};
+
 struct unwind_task_info {
 	struct unwind_cache	*cache;
+	struct callback_head	work;
+	union unwind_task_id	id;
+	int			pending;
 };
 
 #endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index 96368a5aa522..b1faaa55e5d5 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -2,16 +2,66 @@
 /*
  * Deferred user space unwinding
  */
+#include <linux/sched/task_stack.h>
+#include <linux/unwind_deferred.h>
+#include <linux/sched/clock.h>
+#include <linux/task_work.h>
 #include <linux/kernel.h>
 #include <linux/sched.h>
 #include <linux/sizes.h>
 #include <linux/slab.h>
-#include <linux/unwind_deferred.h>
+#include <linux/mm.h>
 
 /* Make the cache fit in a 4K page */
 #define UNWIND_MAX_ENTRIES					\
 	((SZ_4K - sizeof(struct unwind_cache)) / sizeof(long))
 
+/* Guards adding to and reading the list of callbacks */
+static DEFINE_MUTEX(callback_mutex);
+static LIST_HEAD(callbacks);
+
+/*
+ * This is a unique percpu identifier for a given task entry context.
+ * Conceptually, it's incremented every time the CPU enters the kernel from
+ * user space, so that each "entry context" on the CPU gets a unique ID.  In
+ * reality, as an optimization, it's only incremented on demand for the first
+ * deferred unwind request after a given entry-from-user.
+ *
+ * It's combined with the CPU id to make a systemwide-unique "context cookie".
+ */
+static DEFINE_PER_CPU(u32, unwind_ctx_ctr);
+
+/*
+ * The context cookie is a unique identifier that is assigned to a user
+ * space stacktrace. As the user space stacktrace remains the same while
+ * the task is in the kernel, the cookie is an identifier for the stacktrace.
+ * Although it is possible for the stacktrace to get another cookie if another
+ * request is made after the cookie was cleared and before reentering user
+ * space.
+ */
+static u64 get_cookie(struct unwind_task_info *info)
+{
+	u32 cpu_cnt;
+	u32 cnt;
+	u32 old = 0;
+
+	if (info->id.cpu)
+		return info->id.id;
+
+	cpu_cnt = __this_cpu_read(unwind_ctx_ctr);
+	cpu_cnt += 2;
+	cnt = cpu_cnt | 1; /* Always make non zero */
+
+	if (try_cmpxchg(&info->id.cnt, &old, cnt)) {
+		/* Update the per cpu counter */
+		__this_cpu_write(unwind_ctx_ctr, cpu_cnt);
+	}
+	/* Interrupts are disabled, the CPU will always be same */
+	info->id.cpu = smp_processor_id() + 1; /* Must be non zero */
+
+	return info->id.id;
+}
+
 /**
  * unwind_user_faultable - Produce a user stacktrace in faultable context
  * @trace: The descriptor that will store the user stacktrace
@@ -62,11 +112,117 @@ int unwind_user_faultable(struct unwind_stacktrace *trace)
 	return 0;
 }
 
+static void unwind_deferred_task_work(struct callback_head *head)
+{
+	struct unwind_task_info *info = container_of(head, struct unwind_task_info, work);
+	struct unwind_stacktrace trace;
+	struct unwind_work *work;
+	u64 cookie;
+
+	if (WARN_ON_ONCE(!info->pending))
+		return;
+
+	/* Allow work to come in again */
+	WRITE_ONCE(info->pending, 0);
+
+	/*
+	 * From here on out, the callback must always be called, even if it's
+	 * just an empty trace.
+	 */
+	trace.nr = 0;
+	trace.entries = NULL;
+
+	unwind_user_faultable(&trace);
+
+	cookie = info->id.id;
+
+	guard(mutex)(&callback_mutex);
+	list_for_each_entry(work, &callbacks, list) {
+		work->func(work, &trace, cookie);
+	}
+}
+
+/**
+ * unwind_deferred_request - Request a user stacktrace on task exit
+ * @work: Unwind descriptor requesting the trace
+ * @cookie: The cookie of the first request made for this task
+ *
+ * Schedule a user space unwind to be done in task work before exiting the
+ * kernel.
+ *
+ * The returned @cookie output is the generated cookie of the very first
+ * request for a user space stacktrace for this task since it entered the
+ * kernel. It can be from a request by any caller of this infrastructure.
+ * Its value will also be passed to the callback function.  It can be
+ * used to stitch kernel and user stack traces together in post-processing.
+ *
+ * It's valid to call this function multiple times for the same @work within
+ * the same task entry context.  Each call will return the same cookie
+ * while the task hasn't left the kernel. If the callback is not pending
+ * because it has already been previously called for the same entry context,
+ * it will be called again with the same stack trace and cookie.
+ *
+ * Return: 1 if the the callback was already queued.
+ *         0 if the callback successfully was queued.
+ *         Negative if there's an error.
+ *         @cookie holds the cookie of the first request by any user
+ */
+int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
+{
+	struct unwind_task_info *info = &current->unwind_info;
+	int ret;
+
+	*cookie = 0;
+
+	if (WARN_ON_ONCE(in_nmi()))
+		return -EINVAL;
+
+	if ((current->flags & (PF_KTHREAD | PF_EXITING)) ||
+	    !user_mode(task_pt_regs(current)))
+		return -EINVAL;
+
+	guard(irqsave)();
+
+	*cookie = get_cookie(info);
+
+	/* callback already pending? */
+	if (info->pending)
+		return 1;
+
+	/* The work has been claimed, now schedule it. */
+	ret = task_work_add(current, &info->work, TWA_RESUME);
+	if (WARN_ON_ONCE(ret))
+		return ret;
+
+	info->pending = 1;
+	return 0;
+}
+
+void unwind_deferred_cancel(struct unwind_work *work)
+{
+	if (!work)
+		return;
+
+	guard(mutex)(&callback_mutex);
+	list_del(&work->list);
+}
+
+int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
+{
+	memset(work, 0, sizeof(*work));
+
+	guard(mutex)(&callback_mutex);
+	list_add(&work->list, &callbacks);
+	work->func = func;
+	return 0;
+}
+
 void unwind_task_init(struct task_struct *task)
 {
 	struct unwind_task_info *info = &task->unwind_info;
 
 	memset(info, 0, sizeof(*info));
+	init_task_work(&info->work, unwind_deferred_task_work);
 }
 
 void unwind_task_free(struct task_struct *task)
@@ -74,4 +230,5 @@ void unwind_task_free(struct task_struct *task)
 	struct unwind_task_info *info = &task->unwind_info;
 
 	kfree(info->cache);
+	task_work_cancel(task, &info->work);
 }
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v13 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe
  2025-07-08  1:22 [PATCH v13 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (5 preceding siblings ...)
  2025-07-08  1:22 ` [PATCH v13 06/14] unwind_user/deferred: Add deferred unwinding interface Steven Rostedt
@ 2025-07-08  1:22 ` Steven Rostedt
  2025-07-14 13:29   ` Peter Zijlstra
  2025-07-08  1:22 ` [PATCH v13 08/14] unwind deferred: Use bitmask to determine which callbacks to call Steven Rostedt
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 45+ messages in thread
From: Steven Rostedt @ 2025-07-08  1:22 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
	Jens Axboe, Florian Weimer, Sam James

From: Steven Rostedt <rostedt@goodmis.org>

Make unwind_deferred_request() NMI-safe so tracers in NMI context can
call it and safely request a user space stacktrace when the task exits.

Note, this is only allowed for architectures that implement a safe
cmpxchg. If an architecture requests a deferred stack trace from NMI
context that does not support a safe NMI cmpxchg, it will get an -EINVAL.
For those architectures, they would need another method (perhaps an
irqwork), to request a deferred user space stack trace. That can be dealt
with later if one of theses architectures require this feature.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v12: https://lore.kernel.org/20250701005451.737614486@goodmis.org

- Now that the timestamp has been replaced by a cookie that uses only a 32
  bit cmpxchg(), this code just checks if the architecture has a safe
  cmpxchg that can be used in NMI and doesn't do the 64 bit check.
  Only the pending value is converted to local_t.

 include/linux/unwind_deferred_types.h |  4 +-
 kernel/unwind/deferred.c              | 56 ++++++++++++++++++++++-----
 2 files changed, 49 insertions(+), 11 deletions(-)

diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
index 79b4f8cece53..cd95ed1c8610 100644
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -2,6 +2,8 @@
 #ifndef _LINUX_UNWIND_USER_DEFERRED_TYPES_H
 #define _LINUX_UNWIND_USER_DEFERRED_TYPES_H
 
+#include <asm/local.h>
+
 struct unwind_cache {
 	unsigned int		nr_entries;
 	unsigned long		entries[];
@@ -20,7 +22,7 @@ struct unwind_task_info {
 	struct unwind_cache	*cache;
 	struct callback_head	work;
 	union unwind_task_id	id;
-	int			pending;
+	local_t			pending;
 };
 
 #endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index b1faaa55e5d5..2417e4ebbc82 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -12,6 +12,31 @@
 #include <linux/slab.h>
 #include <linux/mm.h>
 
+/*
+ * For requesting a deferred user space stack trace from NMI context
+ * the architecture must support a safe cmpxchg in NMI context.
+ * For those architectures that do not have that, then it cannot ask
+ * for a deferred user space stack trace from an NMI context. If it
+ * does, then it will get -EINVAL.
+ */
+#if defined(CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG)
+# define CAN_USE_IN_NMI		1
+static inline bool try_assign_cnt(struct unwind_task_info *info, u32 cnt)
+{
+	u32 old = 0;
+
+	return try_cmpxchg(&info->id.cnt, &old, cnt);
+}
+#else
+# define CAN_USE_IN_NMI		0
+/* When NMIs are not allowed, this always succeeds */
+static inline bool try_assign_cnt(struct unwind_task_info *info, u32 cnt)
+{
+	info->id.cnt = cnt;
+	return true;
+}
+#endif
+
 /* Make the cache fit in a 4K page */
 #define UNWIND_MAX_ENTRIES					\
 	((SZ_4K - sizeof(struct unwind_cache)) / sizeof(long))
@@ -43,7 +68,6 @@ static u64 get_cookie(struct unwind_task_info *info)
 {
 	u32 cpu_cnt;
 	u32 cnt;
-	u32 old = 0;
 
 	if (info->id.cpu)
 		return info->id.id;
@@ -52,7 +76,7 @@ static u64 get_cookie(struct unwind_task_info *info)
 	cpu_cnt += 2;
 	cnt = cpu_cnt | 1; /* Always make non zero */
 
-	if (try_cmpxchg(&info->id.cnt, &old, cnt)) {
+	if (try_assign_cnt(info, cnt)) {
 		/* Update the per cpu counter */
 		__this_cpu_write(unwind_ctx_ctr, cpu_cnt);
 	}
@@ -119,11 +143,11 @@ static void unwind_deferred_task_work(struct callback_head *head)
 	struct unwind_work *work;
 	u64 cookie;
 
-	if (WARN_ON_ONCE(!info->pending))
+	if (WARN_ON_ONCE(!local_read(&info->pending)))
 		return;
 
 	/* Allow work to come in again */
-	WRITE_ONCE(info->pending, 0);
+	local_set(&info->pending, 0);
 
 	/*
 	 * From here on out, the callback must always be called, even if it's
@@ -170,31 +194,43 @@ static void unwind_deferred_task_work(struct callback_head *head)
 int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
 {
 	struct unwind_task_info *info = &current->unwind_info;
+	long pending;
 	int ret;
 
 	*cookie = 0;
 
-	if (WARN_ON_ONCE(in_nmi()))
-		return -EINVAL;
-
 	if ((current->flags & (PF_KTHREAD | PF_EXITING)) ||
 	    !user_mode(task_pt_regs(current)))
 		return -EINVAL;
 
+	/* NMI requires having safe cmpxchg operations */
+	if (!CAN_USE_IN_NMI && in_nmi())
+		return -EINVAL;
+
 	guard(irqsave)();
 
 	*cookie = get_cookie(info);
 
 	/* callback already pending? */
-	if (info->pending)
+	pending = local_read(&info->pending);
+	if (pending)
 		return 1;
 
+	if (CAN_USE_IN_NMI) {
+		/* Claim the work unless an NMI just now swooped in to do so. */
+		if (!local_try_cmpxchg(&info->pending, &pending, 1))
+			return 1;
+	} else {
+		local_set(&info->pending, 1);
+	}
+
 	/* The work has been claimed, now schedule it. */
 	ret = task_work_add(current, &info->work, TWA_RESUME);
-	if (WARN_ON_ONCE(ret))
+	if (WARN_ON_ONCE(ret)) {
+		local_set(&info->pending, 0);
 		return ret;
+	}
 
-	info->pending = 1;
 	return 0;
 }
 
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v13 08/14] unwind deferred: Use bitmask to determine which callbacks to call
  2025-07-08  1:22 [PATCH v13 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (6 preceding siblings ...)
  2025-07-08  1:22 ` [PATCH v13 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe Steven Rostedt
@ 2025-07-08  1:22 ` Steven Rostedt
  2025-07-08  1:22 ` [PATCH v13 09/14] unwind deferred: Use SRCU unwind_deferred_task_work() Steven Rostedt
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 45+ messages in thread
From: Steven Rostedt @ 2025-07-08  1:22 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
	Jens Axboe, Florian Weimer, Sam James

From: Steven Rostedt <rostedt@goodmis.org>

In order to know which registered callback requested a stacktrace for when
the task goes back to user space, add a bitmask to keep track of all
registered tracers. The bitmask is the size of long, which means that on a
32 bit machine, it can have at most 32 registered tracers, and on 64 bit,
it can have at most 64 registered tracers. This should not be an issue as
there should not be more than 10 (unless BPF can abuse this?).

When a tracer registers with unwind_deferred_init() it will get a bit
number assigned to it. When a tracer requests a stacktrace, it will have
its bit set within the task_struct. When the task returns back to user
space, it will call the callbacks for all the registered tracers where
their bits are set in the task's mask.

When a tracer is removed by the unwind_deferred_cancel() all current tasks
will clear the associated bit, just in case another tracer gets registered
immediately afterward and then gets their callback called unexpectedly.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 include/linux/unwind_deferred.h       |  1 +
 include/linux/unwind_deferred_types.h |  1 +
 kernel/unwind/deferred.c              | 36 ++++++++++++++++++++++++---
 3 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index 14efd8c027aa..12bffdb0648e 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -13,6 +13,7 @@ typedef void (*unwind_callback_t)(struct unwind_work *work, struct unwind_stackt
 struct unwind_work {
 	struct list_head		list;
 	unwind_callback_t		func;
+	int				bit;
 };
 
 #ifdef CONFIG_UNWIND_USER
diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
index cd95ed1c8610..7a03a8672b0d 100644
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -21,6 +21,7 @@ union unwind_task_id {
 struct unwind_task_info {
 	struct unwind_cache	*cache;
 	struct callback_head	work;
+	unsigned long		unwind_mask;
 	union unwind_task_id	id;
 	local_t			pending;
 };
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index 2417e4ebbc82..5edb648b7de7 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -44,6 +44,7 @@ static inline bool try_assign_cnt(struct unwind_task_info *info, u32 cnt)
 /* Guards adding to and reading the list of callbacks */
 static DEFINE_MUTEX(callback_mutex);
 static LIST_HEAD(callbacks);
+static unsigned long unwind_mask;
 
 /*
  * This is a unique percpu identifier for a given task entry context.
@@ -162,7 +163,10 @@ static void unwind_deferred_task_work(struct callback_head *head)
 
 	guard(mutex)(&callback_mutex);
 	list_for_each_entry(work, &callbacks, list) {
-		work->func(work, &trace, cookie);
+		if (test_bit(work->bit, &info->unwind_mask)) {
+			work->func(work, &trace, cookie);
+			clear_bit(work->bit, &info->unwind_mask);
+		}
 	}
 }
 
@@ -211,15 +215,19 @@ int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
 
 	*cookie = get_cookie(info);
 
+	/* This is already queued */
+	if (test_bit(work->bit, &info->unwind_mask))
+		return 1;
+
 	/* callback already pending? */
 	pending = local_read(&info->pending);
 	if (pending)
-		return 1;
+		goto out;
 
 	if (CAN_USE_IN_NMI) {
 		/* Claim the work unless an NMI just now swooped in to do so. */
 		if (!local_try_cmpxchg(&info->pending, &pending, 1))
-			return 1;
+			goto out;
 	} else {
 		local_set(&info->pending, 1);
 	}
@@ -231,16 +239,27 @@ int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
 		return ret;
 	}
 
-	return 0;
+ out:
+	return test_and_set_bit(work->bit, &info->unwind_mask);
 }
 
 void unwind_deferred_cancel(struct unwind_work *work)
 {
+	struct task_struct *g, *t;
+
 	if (!work)
 		return;
 
 	guard(mutex)(&callback_mutex);
 	list_del(&work->list);
+
+	__clear_bit(work->bit, &unwind_mask);
+
+	guard(rcu)();
+	/* Clear this bit from all threads */
+	for_each_process_thread(g, t) {
+		clear_bit(work->bit, &t->unwind_info.unwind_mask);
+	}
 }
 
 int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
@@ -248,6 +267,14 @@ int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
 	memset(work, 0, sizeof(*work));
 
 	guard(mutex)(&callback_mutex);
+
+	/* See if there's a bit in the mask available */
+	if (unwind_mask == ~0UL)
+		return -EBUSY;
+
+	work->bit = ffz(unwind_mask);
+	__set_bit(work->bit, &unwind_mask);
+
 	list_add(&work->list, &callbacks);
 	work->func = func;
 	return 0;
@@ -259,6 +286,7 @@ void unwind_task_init(struct task_struct *task)
 
 	memset(info, 0, sizeof(*info));
 	init_task_work(&info->work, unwind_deferred_task_work);
+	info->unwind_mask = 0;
 }
 
 void unwind_task_free(struct task_struct *task)
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v13 09/14] unwind deferred: Use SRCU unwind_deferred_task_work()
  2025-07-08  1:22 [PATCH v13 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (7 preceding siblings ...)
  2025-07-08  1:22 ` [PATCH v13 08/14] unwind deferred: Use bitmask to determine which callbacks to call Steven Rostedt
@ 2025-07-08  1:22 ` Steven Rostedt
  2025-07-14 13:56   ` Peter Zijlstra
  2025-07-08  1:22 ` [PATCH v13 10/14] unwind: Clear unwind_mask on exit back to user space Steven Rostedt
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 45+ messages in thread
From: Steven Rostedt @ 2025-07-08  1:22 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
	Jens Axboe, Florian Weimer, Sam James

From: Steven Rostedt <rostedt@goodmis.org>

Instead of using the callback_mutex to protect the link list of callbacks
in unwind_deferred_task_work(), use SRCU instead. This gets called every
time a task exits that has to record a stack trace that was requested.
This can happen for many tasks on several CPUs at the same time. A mutex
is a bottleneck and can cause a bit of contention and slow down performance.

As the callbacks themselves are allowed to sleep, regular RCU cannot be
used to protect the list. Instead use SRCU, as that still allows the
callbacks to sleep and the list can be read without needing to hold the
callback_mutex.

Link: https://lore.kernel.org/all/ca9bd83a-6c80-4ee0-a83c-224b9d60b755@efficios.com/

Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/unwind/deferred.c | 35 ++++++++++++++++++++++++++---------
 1 file changed, 26 insertions(+), 9 deletions(-)

diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index 5edb648b7de7..9aed9866f460 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -41,10 +41,11 @@ static inline bool try_assign_cnt(struct unwind_task_info *info, u32 cnt)
 #define UNWIND_MAX_ENTRIES					\
 	((SZ_4K - sizeof(struct unwind_cache)) / sizeof(long))
 
-/* Guards adding to and reading the list of callbacks */
+/* Guards adding to or removing from the list of callbacks */
 static DEFINE_MUTEX(callback_mutex);
 static LIST_HEAD(callbacks);
 static unsigned long unwind_mask;
+DEFINE_STATIC_SRCU(unwind_srcu);
 
 /*
  * This is a unique percpu identifier for a given task entry context.
@@ -143,6 +144,7 @@ static void unwind_deferred_task_work(struct callback_head *head)
 	struct unwind_stacktrace trace;
 	struct unwind_work *work;
 	u64 cookie;
+	int idx;
 
 	if (WARN_ON_ONCE(!local_read(&info->pending)))
 		return;
@@ -161,13 +163,15 @@ static void unwind_deferred_task_work(struct callback_head *head)
 
 	cookie = info->id.id;
 
-	guard(mutex)(&callback_mutex);
-	list_for_each_entry(work, &callbacks, list) {
+	idx = srcu_read_lock(&unwind_srcu);
+	list_for_each_entry_srcu(work, &callbacks, list,
+				 srcu_read_lock_held(&unwind_srcu)) {
 		if (test_bit(work->bit, &info->unwind_mask)) {
 			work->func(work, &trace, cookie);
 			clear_bit(work->bit, &info->unwind_mask);
 		}
 	}
+	srcu_read_unlock(&unwind_srcu, idx);
 }
 
 /**
@@ -199,6 +203,7 @@ int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
 {
 	struct unwind_task_info *info = &current->unwind_info;
 	long pending;
+	int bit;
 	int ret;
 
 	*cookie = 0;
@@ -211,12 +216,17 @@ int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
 	if (!CAN_USE_IN_NMI && in_nmi())
 		return -EINVAL;
 
+	/* Do not allow cancelled works to request again */
+	bit = READ_ONCE(work->bit);
+	if (WARN_ON_ONCE(bit < 0))
+		return -EINVAL;
+
 	guard(irqsave)();
 
 	*cookie = get_cookie(info);
 
 	/* This is already queued */
-	if (test_bit(work->bit, &info->unwind_mask))
+	if (test_bit(bit, &info->unwind_mask))
 		return 1;
 
 	/* callback already pending? */
@@ -240,25 +250,32 @@ int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
 	}
 
  out:
-	return test_and_set_bit(work->bit, &info->unwind_mask);
+	return test_and_set_bit(bit, &info->unwind_mask);
 }
 
 void unwind_deferred_cancel(struct unwind_work *work)
 {
 	struct task_struct *g, *t;
+	int bit;
 
 	if (!work)
 		return;
 
 	guard(mutex)(&callback_mutex);
-	list_del(&work->list);
+	list_del_rcu(&work->list);
+	bit = work->bit;
+
+	/* Do not allow any more requests and prevent callbacks */
+	work->bit = -1;
+
+	__clear_bit(bit, &unwind_mask);
 
-	__clear_bit(work->bit, &unwind_mask);
+	synchronize_srcu(&unwind_srcu);
 
 	guard(rcu)();
 	/* Clear this bit from all threads */
 	for_each_process_thread(g, t) {
-		clear_bit(work->bit, &t->unwind_info.unwind_mask);
+		clear_bit(bit, &t->unwind_info.unwind_mask);
 	}
 }
 
@@ -275,7 +292,7 @@ int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
 	work->bit = ffz(unwind_mask);
 	__set_bit(work->bit, &unwind_mask);
 
-	list_add(&work->list, &callbacks);
+	list_add_rcu(&work->list, &callbacks);
 	work->func = func;
 	return 0;
 }
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v13 10/14] unwind: Clear unwind_mask on exit back to user space
  2025-07-08  1:22 [PATCH v13 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (8 preceding siblings ...)
  2025-07-08  1:22 ` [PATCH v13 09/14] unwind deferred: Use SRCU unwind_deferred_task_work() Steven Rostedt
@ 2025-07-08  1:22 ` Steven Rostedt
  2025-07-15 10:29   ` Peter Zijlstra
  2025-07-08  1:22 ` [PATCH v13 11/14] unwind: Add USED bit to only have one conditional on way " Steven Rostedt
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 45+ messages in thread
From: Steven Rostedt @ 2025-07-08  1:22 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
	Jens Axboe, Florian Weimer, Sam James

From: Steven Rostedt <rostedt@goodmis.org>

When testing the deferred unwinder by attaching deferred user space
stacktraces to events, a live lock happened. This was when the deferred
unwinding was added to the irqs_disabled event, which happens after the
task_work callbacks are called and before the task goes back to user
space.

The event callback would be registered when irqs were disabled, the
task_work would trigger, call the callback for this work and clear the
work's bit. Then before getting back to user space, irqs would be disabled
again, the event triggered again, and a new task_work registered. This
caused an infinite loop and the system hung.

To prevent this, clear the bits at the very last moment before going back
to user space and when instrumentation is disabled. That is in
unwind_exit_to_user_mode().

Move the pending bit from a value on the task_struct to the most
significant bit of the unwind_mask (saves space on the task_struct). This
will allow modifying the pending bit along with the work bits atomically.

Instead of clearing a work's bit after its callback is called, it is
delayed until exit. If the work is requested again, the task_work is not
queued again and the work will be notified that the task has already been
called (via UNWIND_ALREADY_EXECUTED return value).

The pending bit is cleared before calling the callback functions but the
current work bits remain. If one of the called works registers again, it
will not trigger a task_work if its bit is still present in the task's
unwind_mask.

If a new work registers, then it will set both the pending bit and its own
bit but clear the other work bits so that their callbacks do not get
called again.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v12: https://lore.kernel.org/20250701005452.242933931@goodmis.org

- Removed no longer used local.h headers from unwind_deferred_types.h

 include/linux/unwind_deferred.h       | 25 +++++++--
 include/linux/unwind_deferred_types.h |  3 --
 kernel/unwind/deferred.c              | 76 ++++++++++++++++++---------
 3 files changed, 74 insertions(+), 30 deletions(-)

diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index 12bffdb0648e..587e120c0fd6 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -18,6 +18,14 @@ struct unwind_work {
 
 #ifdef CONFIG_UNWIND_USER
 
+#define UNWIND_PENDING_BIT	(BITS_PER_LONG - 1)
+#define UNWIND_PENDING		BIT(UNWIND_PENDING_BIT)
+
+enum {
+	UNWIND_ALREADY_PENDING	= 1,
+	UNWIND_ALREADY_EXECUTED	= 2,
+};
+
 void unwind_task_init(struct task_struct *task);
 void unwind_task_free(struct task_struct *task);
 
@@ -29,15 +37,26 @@ void unwind_deferred_cancel(struct unwind_work *work);
 
 static __always_inline void unwind_reset_info(void)
 {
-	if (unlikely(current->unwind_info.id.id))
+	struct unwind_task_info *info = &current->unwind_info;
+	unsigned long bits;
+
+	/* Was there any unwinding? */
+	if (unlikely(info->unwind_mask)) {
+		bits = info->unwind_mask;
+		do {
+			/* Is a task_work going to run again before going back */
+			if (bits & UNWIND_PENDING)
+				return;
+		} while (!try_cmpxchg(&info->unwind_mask, &bits, 0UL));
 		current->unwind_info.id.id = 0;
+	}
 	/*
 	 * As unwind_user_faultable() can be called directly and
 	 * depends on nr_entries being cleared on exit to user,
 	 * this needs to be a separate conditional.
 	 */
-	if (unlikely(current->unwind_info.cache))
-		current->unwind_info.cache->nr_entries = 0;
+	if (unlikely(info->cache))
+		info->cache->nr_entries = 0;
 }
 
 #else /* !CONFIG_UNWIND_USER */
diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
index 7a03a8672b0d..db6c65daf185 100644
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -2,8 +2,6 @@
 #ifndef _LINUX_UNWIND_USER_DEFERRED_TYPES_H
 #define _LINUX_UNWIND_USER_DEFERRED_TYPES_H
 
-#include <asm/local.h>
-
 struct unwind_cache {
 	unsigned int		nr_entries;
 	unsigned long		entries[];
@@ -23,7 +21,6 @@ struct unwind_task_info {
 	struct callback_head	work;
 	unsigned long		unwind_mask;
 	union unwind_task_id	id;
-	local_t			pending;
 };
 
 #endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index 9aed9866f460..256458f3eafe 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -47,6 +47,11 @@ static LIST_HEAD(callbacks);
 static unsigned long unwind_mask;
 DEFINE_STATIC_SRCU(unwind_srcu);
 
+static inline bool unwind_pending(struct unwind_task_info *info)
+{
+	return test_bit(UNWIND_PENDING_BIT, &info->unwind_mask);
+}
+
 /*
  * This is a unique percpu identifier for a given task entry context.
  * Conceptually, it's incremented every time the CPU enters the kernel from
@@ -143,14 +148,17 @@ static void unwind_deferred_task_work(struct callback_head *head)
 	struct unwind_task_info *info = container_of(head, struct unwind_task_info, work);
 	struct unwind_stacktrace trace;
 	struct unwind_work *work;
+	unsigned long bits;
 	u64 cookie;
 	int idx;
 
-	if (WARN_ON_ONCE(!local_read(&info->pending)))
+	if (WARN_ON_ONCE(!unwind_pending(info)))
 		return;
 
-	/* Allow work to come in again */
-	local_set(&info->pending, 0);
+	/* Clear pending bit but make sure to have the current bits */
+	bits = READ_ONCE(info->unwind_mask);
+	while (!try_cmpxchg(&info->unwind_mask, &bits, bits & ~UNWIND_PENDING))
+		;
 
 	/*
 	 * From here on out, the callback must always be called, even if it's
@@ -166,10 +174,8 @@ static void unwind_deferred_task_work(struct callback_head *head)
 	idx = srcu_read_lock(&unwind_srcu);
 	list_for_each_entry_srcu(work, &callbacks, list,
 				 srcu_read_lock_held(&unwind_srcu)) {
-		if (test_bit(work->bit, &info->unwind_mask)) {
+		if (test_bit(work->bit, &bits))
 			work->func(work, &trace, cookie);
-			clear_bit(work->bit, &info->unwind_mask);
-		}
 	}
 	srcu_read_unlock(&unwind_srcu, idx);
 }
@@ -194,15 +200,17 @@ static void unwind_deferred_task_work(struct callback_head *head)
  * because it has already been previously called for the same entry context,
  * it will be called again with the same stack trace and cookie.
  *
- * Return: 1 if the the callback was already queued.
- *         0 if the callback successfully was queued.
+ * Return: 0 if the callback successfully was queued.
+ *         UNWIND_ALREADY_PENDING if the the callback was already queued.
+ *         UNWIND_ALREADY_EXECUTED if the callback was already called
+ *                (and will not be called again)
  *         Negative if there's an error.
  *         @cookie holds the cookie of the first request by any user
  */
 int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
 {
 	struct unwind_task_info *info = &current->unwind_info;
-	long pending;
+	unsigned long old, bits;
 	int bit;
 	int ret;
 
@@ -225,32 +233,52 @@ int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
 
 	*cookie = get_cookie(info);
 
-	/* This is already queued */
-	if (test_bit(bit, &info->unwind_mask))
-		return 1;
+	old = READ_ONCE(info->unwind_mask);
+
+	/* Is this already queued */
+	if (test_bit(bit, &old)) {
+		/*
+		 * If pending is not set, it means this work's callback
+		 * was already called.
+		 */
+		return old & UNWIND_PENDING ? UNWIND_ALREADY_PENDING :
+			UNWIND_ALREADY_EXECUTED;
+	}
 
-	/* callback already pending? */
-	pending = local_read(&info->pending);
-	if (pending)
+	if (unwind_pending(info))
 		goto out;
 
+	/*
+	 * This is the first to enable another task_work for this task since
+	 * the task entered the kernel, or had already called the callbacks.
+	 * Set only the bit for this work and clear all others as they have
+	 * already had their callbacks called, and do not need to call them
+	 * again because of this work.
+	 */
+	bits = UNWIND_PENDING | BIT(bit);
+
+	/*
+	 * If the cmpxchg() fails, it means that an NMI came in and set
+	 * the pending bit as well as cleared the other bits. Just
+	 * jump to setting the bit for this work.
+	 */
 	if (CAN_USE_IN_NMI) {
-		/* Claim the work unless an NMI just now swooped in to do so. */
-		if (!local_try_cmpxchg(&info->pending, &pending, 1))
+		if (!try_cmpxchg(&info->unwind_mask, &old, bits))
 			goto out;
 	} else {
-		local_set(&info->pending, 1);
+		info->unwind_mask = bits;
 	}
 
 	/* The work has been claimed, now schedule it. */
 	ret = task_work_add(current, &info->work, TWA_RESUME);
-	if (WARN_ON_ONCE(ret)) {
-		local_set(&info->pending, 0);
-		return ret;
-	}
 
+	if (WARN_ON_ONCE(ret))
+		WRITE_ONCE(info->unwind_mask, 0);
+
+	return ret;
  out:
-	return test_and_set_bit(bit, &info->unwind_mask);
+	return test_and_set_bit(bit, &info->unwind_mask) ?
+		UNWIND_ALREADY_PENDING : 0;
 }
 
 void unwind_deferred_cancel(struct unwind_work *work)
@@ -286,7 +314,7 @@ int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
 	guard(mutex)(&callback_mutex);
 
 	/* See if there's a bit in the mask available */
-	if (unwind_mask == ~0UL)
+	if (unwind_mask == ~(UNWIND_PENDING))
 		return -EBUSY;
 
 	work->bit = ffz(unwind_mask);
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v13 11/14] unwind: Add USED bit to only have one conditional on way back to user space
  2025-07-08  1:22 [PATCH v13 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (9 preceding siblings ...)
  2025-07-08  1:22 ` [PATCH v13 10/14] unwind: Clear unwind_mask on exit back to user space Steven Rostedt
@ 2025-07-08  1:22 ` Steven Rostedt
  2025-07-08  1:22 ` [PATCH v13 12/14] unwind: Finish up unwind when a task exits Steven Rostedt
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 45+ messages in thread
From: Steven Rostedt @ 2025-07-08  1:22 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
	Jens Axboe, Florian Weimer, Sam James

From: Steven Rostedt <rostedt@goodmis.org>

On the way back to user space, the function unwind_reset_info() is called
unconditionally (but always inlined). It currently has two conditionals.
One that checks the unwind_mask which is set whenever a deferred trace is
called and is used to know that the mask needs to be cleared. The other
checks if the cache has been allocated, and if so, it resets the
nr_entries so that the unwinder knows it needs to do the work to get a new
user space stack trace again (it only does it once per entering the
kernel).

Use one of the bits in the unwind mask as a "USED" bit that gets set
whenever a trace is created. This will make it possible to only check the
unwind_mask in the unwind_reset_info() to know if it needs to do work or
not and eliminates a conditional that happens every time the task goes
back to user space.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 include/linux/unwind_deferred.h | 14 +++++++-------
 kernel/unwind/deferred.c        |  5 ++++-
 2 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index 587e120c0fd6..376bfd50ff75 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -21,6 +21,10 @@ struct unwind_work {
 #define UNWIND_PENDING_BIT	(BITS_PER_LONG - 1)
 #define UNWIND_PENDING		BIT(UNWIND_PENDING_BIT)
 
+/* Set if the unwinding was used (directly or deferred) */
+#define UNWIND_USED_BIT		(UNWIND_PENDING_BIT - 1)
+#define UNWIND_USED		BIT(UNWIND_USED_BIT)
+
 enum {
 	UNWIND_ALREADY_PENDING	= 1,
 	UNWIND_ALREADY_EXECUTED	= 2,
@@ -49,14 +53,10 @@ static __always_inline void unwind_reset_info(void)
 				return;
 		} while (!try_cmpxchg(&info->unwind_mask, &bits, 0UL));
 		current->unwind_info.id.id = 0;
+
+		if (unlikely(info->cache))
+			info->cache->nr_entries = 0;
 	}
-	/*
-	 * As unwind_user_faultable() can be called directly and
-	 * depends on nr_entries being cleared on exit to user,
-	 * this needs to be a separate conditional.
-	 */
-	if (unlikely(info->cache))
-		info->cache->nr_entries = 0;
 }
 
 #else /* !CONFIG_UNWIND_USER */
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index 256458f3eafe..9299974b6562 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -140,6 +140,9 @@ int unwind_user_faultable(struct unwind_stacktrace *trace)
 
 	cache->nr_entries = trace->nr;
 
+	/* Clear nr_entries on way back to user space */
+	set_bit(UNWIND_USED_BIT, &info->unwind_mask);
+
 	return 0;
 }
 
@@ -314,7 +317,7 @@ int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
 	guard(mutex)(&callback_mutex);
 
 	/* See if there's a bit in the mask available */
-	if (unwind_mask == ~(UNWIND_PENDING))
+	if (unwind_mask == ~(UNWIND_PENDING|UNWIND_USED))
 		return -EBUSY;
 
 	work->bit = ffz(unwind_mask);
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v13 12/14] unwind: Finish up unwind when a task exits
  2025-07-08  1:22 [PATCH v13 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (10 preceding siblings ...)
  2025-07-08  1:22 ` [PATCH v13 11/14] unwind: Add USED bit to only have one conditional on way " Steven Rostedt
@ 2025-07-08  1:22 ` Steven Rostedt
  2025-07-08  1:22 ` [PATCH v13 13/14] unwind_user/x86: Enable frame pointer unwinding on x86 Steven Rostedt
  2025-07-08  1:22 ` [PATCH v13 14/14] unwind_user/x86: Enable compat mode " Steven Rostedt
  13 siblings, 0 replies; 45+ messages in thread
From: Steven Rostedt @ 2025-07-08  1:22 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
	Jens Axboe, Florian Weimer, Sam James

From: Steven Rostedt <rostedt@goodmis.org>

On do_exit() when a task is exiting, if a unwind is requested and the
deferred user stacktrace is deferred via the task_work, the task_work
callback is called after exit_mm() is called in do_exit(). This means that
the user stack trace will not be retrieved and an empty stack is created.

Instead, add a function unwind_deferred_task_exit() and call it just
before exit_mm() so that the unwinder can call the requested callbacks
with the user space stack.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 include/linux/unwind_deferred.h |  3 +++
 kernel/exit.c                   |  2 ++
 kernel/unwind/deferred.c        | 23 ++++++++++++++++++++---
 3 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index 376bfd50ff75..a9d5b100d6b2 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -39,6 +39,8 @@ int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func);
 int unwind_deferred_request(struct unwind_work *work, u64 *cookie);
 void unwind_deferred_cancel(struct unwind_work *work);
 
+void unwind_deferred_task_exit(struct task_struct *task);
+
 static __always_inline void unwind_reset_info(void)
 {
 	struct unwind_task_info *info = &current->unwind_info;
@@ -69,6 +71,7 @@ static inline int unwind_deferred_init(struct unwind_work *work, unwind_callback
 static inline int unwind_deferred_request(struct unwind_work *work, u64 *timestamp) { return -ENOSYS; }
 static inline void unwind_deferred_cancel(struct unwind_work *work) {}
 
+static inline void unwind_deferred_task_exit(struct task_struct *task) {}
 static inline void unwind_reset_info(void) {}
 
 #endif /* !CONFIG_UNWIND_USER */
diff --git a/kernel/exit.c b/kernel/exit.c
index bb184a67ac73..1d8c8ac33c4f 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -68,6 +68,7 @@
 #include <linux/rethook.h>
 #include <linux/sysfs.h>
 #include <linux/user_events.h>
+#include <linux/unwind_deferred.h>
 #include <linux/uaccess.h>
 #include <linux/pidfs.h>
 
@@ -938,6 +939,7 @@ void __noreturn do_exit(long code)
 
 	tsk->exit_code = code;
 	taskstats_exit(tsk, group_dead);
+	unwind_deferred_task_exit(tsk);
 	trace_sched_process_exit(tsk, group_dead);
 
 	/*
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index 9299974b6562..039e12700d49 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -113,7 +113,7 @@ int unwind_user_faultable(struct unwind_stacktrace *trace)
 	/* Should always be called from faultable context */
 	might_fault();
 
-	if (current->flags & PF_EXITING)
+	if (!current->mm)
 		return -EINVAL;
 
 	if (!info->cache) {
@@ -146,9 +146,9 @@ int unwind_user_faultable(struct unwind_stacktrace *trace)
 	return 0;
 }
 
-static void unwind_deferred_task_work(struct callback_head *head)
+static void process_unwind_deferred(struct task_struct *task)
 {
-	struct unwind_task_info *info = container_of(head, struct unwind_task_info, work);
+	struct unwind_task_info *info = &task->unwind_info;
 	struct unwind_stacktrace trace;
 	struct unwind_work *work;
 	unsigned long bits;
@@ -183,6 +183,23 @@ static void unwind_deferred_task_work(struct callback_head *head)
 	srcu_read_unlock(&unwind_srcu, idx);
 }
 
+static void unwind_deferred_task_work(struct callback_head *head)
+{
+	process_unwind_deferred(current);
+}
+
+void unwind_deferred_task_exit(struct task_struct *task)
+{
+	struct unwind_task_info *info = &current->unwind_info;
+
+	if (!unwind_pending(info))
+		return;
+
+	process_unwind_deferred(task);
+
+	task_work_cancel(task, &info->work);
+}
+
 /**
  * unwind_deferred_request - Request a user stacktrace on task exit
  * @work: Unwind descriptor requesting the trace
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v13 13/14] unwind_user/x86: Enable frame pointer unwinding on x86
  2025-07-08  1:22 [PATCH v13 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (11 preceding siblings ...)
  2025-07-08  1:22 ` [PATCH v13 12/14] unwind: Finish up unwind when a task exits Steven Rostedt
@ 2025-07-08  1:22 ` Steven Rostedt
  2025-07-11  8:43   ` David Laight
  2025-07-08  1:22 ` [PATCH v13 14/14] unwind_user/x86: Enable compat mode " Steven Rostedt
  13 siblings, 1 reply; 45+ messages in thread
From: Steven Rostedt @ 2025-07-08  1:22 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
	Jens Axboe, Florian Weimer, Sam James

From: Josh Poimboeuf <jpoimboe@kernel.org>

Use ARCH_INIT_USER_FP_FRAME to describe how frame pointers are unwound
on x86, and enable CONFIG_HAVE_UNWIND_USER_FP accordingly so the
unwind_user interfaces can be used.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 arch/x86/Kconfig                   |  1 +
 arch/x86/include/asm/unwind_user.h | 11 +++++++++++
 2 files changed, 12 insertions(+)
 create mode 100644 arch/x86/include/asm/unwind_user.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 71019b3b54ea..5862433c81e1 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -302,6 +302,7 @@ config X86
 	select HAVE_SYSCALL_TRACEPOINTS
 	select HAVE_UACCESS_VALIDATION		if HAVE_OBJTOOL
 	select HAVE_UNSTABLE_SCHED_CLOCK
+	select HAVE_UNWIND_USER_FP		if X86_64
 	select HAVE_USER_RETURN_NOTIFIER
 	select HAVE_GENERIC_VDSO
 	select VDSO_GETRANDOM			if X86_64
diff --git a/arch/x86/include/asm/unwind_user.h b/arch/x86/include/asm/unwind_user.h
new file mode 100644
index 000000000000..8597857bf896
--- /dev/null
+++ b/arch/x86/include/asm/unwind_user.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_UNWIND_USER_H
+#define _ASM_X86_UNWIND_USER_H
+
+#define ARCH_INIT_USER_FP_FRAME							\
+	.cfa_off	= (s32)sizeof(long) *  2,				\
+	.ra_off		= (s32)sizeof(long) * -1,				\
+	.fp_off		= (s32)sizeof(long) * -2,				\
+	.use_fp		= true,
+
+#endif /* _ASM_X86_UNWIND_USER_H */
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v13 14/14] unwind_user/x86: Enable compat mode frame pointer unwinding on x86
  2025-07-08  1:22 [PATCH v13 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (12 preceding siblings ...)
  2025-07-08  1:22 ` [PATCH v13 13/14] unwind_user/x86: Enable frame pointer unwinding on x86 Steven Rostedt
@ 2025-07-08  1:22 ` Steven Rostedt
  13 siblings, 0 replies; 45+ messages in thread
From: Steven Rostedt @ 2025-07-08  1:22 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
	Jens Axboe, Florian Weimer, Sam James

From: Josh Poimboeuf <jpoimboe@kernel.org>

Use ARCH_INIT_USER_COMPAT_FP_FRAME to describe how frame pointers are
unwound on x86, and implement the hooks needed to add the segment base
addresses.  Enable HAVE_UNWIND_USER_COMPAT_FP if the system has compat
mode compiled in.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 arch/x86/Kconfig                         |  1 +
 arch/x86/include/asm/unwind_user.h       | 31 ++++++++++++++++++++++++
 arch/x86/include/asm/unwind_user_types.h | 17 +++++++++++++
 arch/x86/kernel/stacktrace.c             | 28 +++++++++++++++++++++
 include/linux/unwind_user.h              | 20 +++++++++++++++
 kernel/unwind/user.c                     |  4 +++
 6 files changed, 101 insertions(+)
 create mode 100644 arch/x86/include/asm/unwind_user_types.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5862433c81e1..17d4094c821b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -302,6 +302,7 @@ config X86
 	select HAVE_SYSCALL_TRACEPOINTS
 	select HAVE_UACCESS_VALIDATION		if HAVE_OBJTOOL
 	select HAVE_UNSTABLE_SCHED_CLOCK
+	select HAVE_UNWIND_USER_COMPAT_FP	if IA32_EMULATION
 	select HAVE_UNWIND_USER_FP		if X86_64
 	select HAVE_USER_RETURN_NOTIFIER
 	select HAVE_GENERIC_VDSO
diff --git a/arch/x86/include/asm/unwind_user.h b/arch/x86/include/asm/unwind_user.h
index 8597857bf896..19634a73612d 100644
--- a/arch/x86/include/asm/unwind_user.h
+++ b/arch/x86/include/asm/unwind_user.h
@@ -2,10 +2,41 @@
 #ifndef _ASM_X86_UNWIND_USER_H
 #define _ASM_X86_UNWIND_USER_H
 
+#include <linux/unwind_user_types.h>
+
 #define ARCH_INIT_USER_FP_FRAME							\
 	.cfa_off	= (s32)sizeof(long) *  2,				\
 	.ra_off		= (s32)sizeof(long) * -1,				\
 	.fp_off		= (s32)sizeof(long) * -2,				\
 	.use_fp		= true,
 
+#ifdef CONFIG_IA32_EMULATION
+
+#define ARCH_INIT_USER_COMPAT_FP_FRAME						\
+	.cfa_off	= (s32)sizeof(u32)  *  2,				\
+	.ra_off		= (s32)sizeof(u32)  * -1,				\
+	.fp_off		= (s32)sizeof(u32)  * -2,				\
+	.use_fp		= true,
+
+#define in_compat_mode(regs) !user_64bit_mode(regs)
+
+void arch_unwind_user_init(struct unwind_user_state *state,
+			   struct pt_regs *regs);
+
+static inline void arch_unwind_user_next(struct unwind_user_state *state)
+{
+	if (state->type != UNWIND_USER_TYPE_COMPAT_FP)
+		return;
+
+	state->ip += state->arch.cs_base;
+	state->fp += state->arch.ss_base;
+}
+
+#define arch_unwind_user_init arch_unwind_user_init
+#define arch_unwind_user_next arch_unwind_user_next
+
+#endif /* CONFIG_IA32_EMULATION */
+
+#include <asm-generic/unwind_user.h>
+
 #endif /* _ASM_X86_UNWIND_USER_H */
diff --git a/arch/x86/include/asm/unwind_user_types.h b/arch/x86/include/asm/unwind_user_types.h
new file mode 100644
index 000000000000..f93d535f900e
--- /dev/null
+++ b/arch/x86/include/asm/unwind_user_types.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_UNWIND_USER_TYPES_H
+#define _ASM_X86_UNWIND_USER_TYPES_H
+
+#ifdef CONFIG_IA32_EMULATION
+
+struct arch_unwind_user_state {
+	unsigned long ss_base;
+	unsigned long cs_base;
+};
+#define arch_unwind_user_state arch_unwind_user_state
+
+#endif /* CONFIG_IA32_EMULATION */
+
+#include <asm-generic/unwind_user_types.h>
+
+#endif /* _ASM_UNWIND_USER_TYPES_H */
diff --git a/arch/x86/kernel/stacktrace.c b/arch/x86/kernel/stacktrace.c
index ee117fcf46ed..8ef9d8c71df9 100644
--- a/arch/x86/kernel/stacktrace.c
+++ b/arch/x86/kernel/stacktrace.c
@@ -9,7 +9,10 @@
 #include <linux/stacktrace.h>
 #include <linux/export.h>
 #include <linux/uaccess.h>
+#include <asm/unwind_user.h>
 #include <asm/stacktrace.h>
+#include <asm/insn.h>
+#include <asm/insn-eval.h>
 #include <asm/unwind.h>
 
 void arch_stack_walk(stack_trace_consume_fn consume_entry, void *cookie,
@@ -128,3 +131,28 @@ void arch_stack_walk_user(stack_trace_consume_fn consume_entry, void *cookie,
 	}
 }
 
+#ifdef CONFIG_IA32_EMULATION
+void arch_unwind_user_init(struct unwind_user_state *state,
+			   struct pt_regs *regs)
+{
+	unsigned long cs_base, ss_base;
+
+	if (state->type != UNWIND_USER_TYPE_COMPAT_FP)
+		return;
+
+	cs_base = insn_get_seg_base(regs, INAT_SEG_REG_CS);
+	ss_base = insn_get_seg_base(regs, INAT_SEG_REG_SS);
+
+	if (cs_base == -1)
+		cs_base = 0;
+	if (ss_base == -1)
+		ss_base = 0;
+
+	state->arch.cs_base = cs_base;
+	state->arch.ss_base = ss_base;
+
+	state->ip += cs_base;
+	state->sp += ss_base;
+	state->fp += ss_base;
+}
+#endif /* CONFIG_IA32_EMULATION */
diff --git a/include/linux/unwind_user.h b/include/linux/unwind_user.h
index 834b643afd3a..8a4af0214ecb 100644
--- a/include/linux/unwind_user.h
+++ b/include/linux/unwind_user.h
@@ -14,6 +14,26 @@
  #define in_compat_mode(regs) false
 #endif
 
+/*
+ * If an architecture needs to initialize the state for a specific
+ * reason, for example, it may need to do something different
+ * in compat mode, it can define a macro named arch_unwind_user_init
+ * with the name of the function that will perform this initialization.
+ */
+#ifndef arch_unwind_user_init
+static inline void arch_unwind_user_init(struct unwind_user_state *state, struct pt_regs *reg) {}
+#endif
+
+/*
+ * If an architecture requires some more updates to the state between
+ * stack frames, it can define a macro named arch_unwind_user_next
+ * with the name of the function that will update the state between
+ * reading stack frames during the user space stack walk.
+ */
+#ifndef arch_unwind_user_next
+static inline void arch_unwind_user_next(struct unwind_user_state *state) {}
+#endif
+
 int unwind_user(struct unwind_stacktrace *trace, unsigned int max_entries);
 
 #endif /* _LINUX_UNWIND_USER_H */
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
index 03775191447c..249d9e32fad7 100644
--- a/kernel/unwind/user.c
+++ b/kernel/unwind/user.c
@@ -89,6 +89,8 @@ static int unwind_user_next(struct unwind_user_state *state)
 	if (frame->fp_off)
 		state->fp = fp;
 
+	arch_unwind_user_next(state);
+
 	return 0;
 
 done:
@@ -118,6 +120,8 @@ static int unwind_user_start(struct unwind_user_state *state)
 	state->sp = user_stack_pointer(regs);
 	state->fp = frame_pointer(regs);
 
+	arch_unwind_user_init(state, regs);
+
 	return 0;
 }
 
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 02/14] unwind_user: Add frame pointer support
  2025-07-08  1:22 ` [PATCH v13 02/14] unwind_user: Add frame pointer support Steven Rostedt
@ 2025-07-09 10:01   ` Jens Remus
  2025-07-10 12:28     ` Jens Remus
  2025-07-10 15:21     ` Steven Rostedt
  0 siblings, 2 replies; 45+ messages in thread
From: Jens Remus @ 2025-07-09 10:01 UTC (permalink / raw)
  To: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Linus Torvalds, Andrew Morton, Jens Axboe,
	Florian Weimer, Sam James, Heiko Carstens, Vasily Gorbik

On 08.07.2025 03:22, Steven Rostedt wrote:
> From: Josh Poimboeuf <jpoimboe@kernel.org>
> 
> Add optional support for user space frame pointer unwinding.  If
> supported, the arch needs to enable CONFIG_HAVE_UNWIND_USER_FP and
> define ARCH_INIT_USER_FP_FRAME.
> 
> By encoding the frame offsets in struct unwind_user_frame, much of this
> code can also be reused for future unwinder implementations like sframe.
> 
> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
> Co-developed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>

> diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c

> @@ -6,13 +6,71 @@
>  #include <linux/sched.h>
>  #include <linux/sched/task_stack.h>
>  #include <linux/unwind_user.h>
> +#include <linux/uaccess.h>
> +
> +static struct unwind_user_frame fp_frame = {
> +	ARCH_INIT_USER_FP_FRAME
> +};
> +
> +static inline bool fp_state(struct unwind_user_state *state)
> +{
> +	return IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP) &&
> +	       state->type == UNWIND_USER_TYPE_FP;
> +}
>  
>  #define for_each_user_frame(state) \
>  	for (unwind_user_start(state); !(state)->done; unwind_user_next(state))
>  
>  static int unwind_user_next(struct unwind_user_state *state)
>  {
> -	/* no implementation yet */
> +	struct unwind_user_frame *frame;
> +	unsigned long cfa = 0, fp, ra = 0;
> +	unsigned int shift;
> +
> +	if (state->done)
> +		return -EINVAL;
> +
> +	if (fp_state(state))
> +		frame = &fp_frame;
> +	else
> +		goto done;
> +
> +	if (frame->use_fp) {
> +		if (state->fp < state->sp)

		if (state->fp <= state->sp)

I meanwhile came to the conclusion that for architectures, such as s390,
where SP at function entry == SP at call site, the FP may be equal to
the SP.  At least for the brief period where the FP has been setup and
stack allocation did not yet take place.  For most architectures this
can probably only occur in the topmost frame.  For s390 the FP is setup
after static stack allocation, so --fno-omit-frame-pointer would enforce
FP==SP in any frame that does not perform dynamic stack allocation.

> +			goto done;
> +		cfa = state->fp;
> +	} else {
> +		cfa = state->sp;
> +	}
> +
> +	/* Get the Canonical Frame Address (CFA) */
> +	cfa += frame->cfa_off;
> +
> +	/* stack going in wrong direction? */
> +	if (cfa <= state->sp)
> +		goto done;
> +
> +	/* Make sure that the address is word aligned */
> +	shift = sizeof(long) == 4 ? 2 : 3;
> +	if ((cfa + frame->ra_off) & ((1 << shift) - 1))
> +		goto done;

Do all architectures/ABI mandate register stack save slots to be aligned?
s390 does.

> +
> +	/* Find the Return Address (RA) */
> +	if (get_user(ra, (unsigned long *)(cfa + frame->ra_off)))
> +		goto done;
> +

Why not validate the FP stack save slot address as well?

> +	if (frame->fp_off && get_user(fp, (unsigned long __user *)(cfa + frame->fp_off)))
> +		goto done;
> +
> +	state->ip = ra;
> +	state->sp = cfa;
> +	if (frame->fp_off)
> +		state->fp = fp;
> +
> +	return 0;
> +
> +done:
> +	state->done = true;
>  	return -EINVAL;
>  }

Thanks and regards,
Jens
-- 
Jens Remus
Linux on Z Development (D3303)
+49-7031-16-1128 Office
jremus@de.ibm.com

IBM

IBM Deutschland Research & Development GmbH; Vorsitzender des Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der Gesellschaft: Böblingen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 02/14] unwind_user: Add frame pointer support
  2025-07-09 10:01   ` Jens Remus
@ 2025-07-10 12:28     ` Jens Remus
  2025-07-10 15:21     ` Steven Rostedt
  1 sibling, 0 replies; 45+ messages in thread
From: Jens Remus @ 2025-07-10 12:28 UTC (permalink / raw)
  To: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Linus Torvalds, Andrew Morton, Jens Axboe,
	Florian Weimer, Sam James, Heiko Carstens, Vasily Gorbik

On 09.07.2025 12:01, Jens Remus wrote:
> On 08.07.2025 03:22, Steven Rostedt wrote:
>> From: Josh Poimboeuf <jpoimboe@kernel.org>

>> diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c

>>  static int unwind_user_next(struct unwind_user_state *state)
>>  {
>> -	/* no implementation yet */
>> +	struct unwind_user_frame *frame;
>> +	unsigned long cfa = 0, fp, ra = 0;
>> +	unsigned int shift;
>> +
>> +	if (state->done)
>> +		return -EINVAL;
>> +
>> +	if (fp_state(state))
>> +		frame = &fp_frame;
>> +	else
>> +		goto done;
>> +
>> +	if (frame->use_fp) {
>> +		if (state->fp < state->sp)

The initial check above is correct.  I got the logic wrong.  Sorry for
the fuss!  Do not change the check to what I came up with yesterday:

> 		if (state->fp <= state->sp)
> 

Below s390 particularity, that FP may be equal to FP in any frame,
is only allowed with the initial check.

> I meanwhile came to the conclusion that for architectures, such as s390,
> where SP at function entry == SP at call site, the FP may be equal to
> the SP.  At least for the brief period where the FP has been setup and
> stack allocation did not yet take place.  For most architectures this
> can probably only occur in the topmost frame.  For s390 the FP is setup
> after static stack allocation, so --fno-omit-frame-pointer would enforce
> FP==SP in any frame that does not perform dynamic stack allocation.
> 
>> +			goto done;
>> +		cfa = state->fp;
>> +	} else {
>> +		cfa = state->sp;
>> +	}

Regards,
Jens
-- 
Jens Remus
Linux on Z Development (D3303)
+49-7031-16-1128 Office
jremus@de.ibm.com

IBM

IBM Deutschland Research & Development GmbH; Vorsitzender des Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der Gesellschaft: Böblingen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 02/14] unwind_user: Add frame pointer support
  2025-07-09 10:01   ` Jens Remus
  2025-07-10 12:28     ` Jens Remus
@ 2025-07-10 15:21     ` Steven Rostedt
  2025-07-10 15:41       ` Jens Remus
  1 sibling, 1 reply; 45+ messages in thread
From: Steven Rostedt @ 2025-07-10 15:21 UTC (permalink / raw)
  To: Jens Remus
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Linus Torvalds, Andrew Morton, Jens Axboe,
	Florian Weimer, Sam James, Heiko Carstens, Vasily Gorbik

On Wed, 9 Jul 2025 12:01:14 +0200
Jens Remus <jremus@linux.ibm.com> wrote:
> >  static int unwind_user_next(struct unwind_user_state *state)
> >  {
> > -	/* no implementation yet */
> > +	struct unwind_user_frame *frame;
> > +	unsigned long cfa = 0, fp, ra = 0;
> > +	unsigned int shift;
> > +
> > +	if (state->done)
> > +		return -EINVAL;
> > +
> > +	if (fp_state(state))
> > +		frame = &fp_frame;
> > +	else
> > +		goto done;
> > +
> > +	if (frame->use_fp) {
> > +		if (state->fp < state->sp)  
> 
> 		if (state->fp <= state->sp)
> 
> I meanwhile came to the conclusion that for architectures, such as s390,
> where SP at function entry == SP at call site, the FP may be equal to
> the SP.  At least for the brief period where the FP has been setup and
> stack allocation did not yet take place.  For most architectures this
> can probably only occur in the topmost frame.  For s390 the FP is setup
> after static stack allocation, so --fno-omit-frame-pointer would enforce
> FP==SP in any frame that does not perform dynamic stack allocation.

From your latest email, I take it I can ignore the above?

> 
> > +			goto done;
> > +		cfa = state->fp;
> > +	} else {
> > +		cfa = state->sp;
> > +	}
> > +
> > +	/* Get the Canonical Frame Address (CFA) */
> > +	cfa += frame->cfa_off;
> > +
> > +	/* stack going in wrong direction? */
> > +	if (cfa <= state->sp)
> > +		goto done;
> > +
> > +	/* Make sure that the address is word aligned */
> > +	shift = sizeof(long) == 4 ? 2 : 3;
> > +	if ((cfa + frame->ra_off) & ((1 << shift) - 1))
> > +		goto done;  
> 
> Do all architectures/ABI mandate register stack save slots to be aligned?
> s390 does.

I believe so.

> 
> > +
> > +	/* Find the Return Address (RA) */
> > +	if (get_user(ra, (unsigned long *)(cfa + frame->ra_off)))
> > +		goto done;
> > +  
> 
> Why not validate the FP stack save slot address as well?

You mean to validate cfa + frame->fp_off?

Isn't cfa the only real variable here? That is, if cfa + frame->ra_off
works, wouldn't the same go for frame->fp_off, as both frame->ra_off
and frame->fp_off are constants set by the architecture, and should be
word aligned.

-- Steve

> 
> > +	if (frame->fp_off && get_user(fp, (unsigned long __user *)(cfa + frame->fp_off)))
> > +		goto done;
> > +


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 02/14] unwind_user: Add frame pointer support
  2025-07-10 15:21     ` Steven Rostedt
@ 2025-07-10 15:41       ` Jens Remus
  2025-07-10 17:08         ` Steven Rostedt
  0 siblings, 1 reply; 45+ messages in thread
From: Jens Remus @ 2025-07-10 15:41 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Linus Torvalds, Andrew Morton, Jens Axboe,
	Florian Weimer, Sam James, Heiko Carstens, Vasily Gorbik

On 10.07.2025 17:21, Steven Rostedt wrote:
> On Wed, 9 Jul 2025 12:01:14 +0200
> Jens Remus <jremus@linux.ibm.com> wrote:

>>> +	if (frame->use_fp) {
>>> +		if (state->fp < state->sp)  
>>
>> 		if (state->fp <= state->sp)
>>
>> I meanwhile came to the conclusion that for architectures, such as s390,
>> where SP at function entry == SP at call site, the FP may be equal to
>> the SP.  At least for the brief period where the FP has been setup and
>> stack allocation did not yet take place.  For most architectures this
>> can probably only occur in the topmost frame.  For s390 the FP is setup
>> after static stack allocation, so --fno-omit-frame-pointer would enforce
>> FP==SP in any frame that does not perform dynamic stack allocation.
> 
> From your latest email, I take it I can ignore the above?

Correct.

>>> +	/* Make sure that the address is word aligned */
>>> +	shift = sizeof(long) == 4 ? 2 : 3;
>>> +	if ((cfa + frame->ra_off) & ((1 << shift) - 1))
>>> +		goto done;  
>>
>> Do all architectures/ABI mandate register stack save slots to be aligned?
>> s390 does.
> 
> I believe so.
> 
>>
>>> +
>>> +	/* Find the Return Address (RA) */
>>> +	if (get_user(ra, (unsigned long *)(cfa + frame->ra_off)))
>>> +		goto done;
>>> +  
>>
>> Why not validate the FP stack save slot address as well?
> 
> You mean to validate cfa + frame->fp_off?

Yes.

> Isn't cfa the only real variable here? That is, if cfa + frame->ra_off
> works, wouldn't the same go for frame->fp_off, as both frame->ra_off
> and frame->fp_off are constants set by the architecture, and should be
> word aligned.

cfa + frame->ra_off could be aligned by chance.  So could
cfa + frame->fp_off be as well of course.

On s390 the CFA must be aligned (as the SP must be aligned) and the
FP and RA offsets from CFA must be aligned, as pointer / 64-bit integers
(such as 64-bit register values) must be aligned as well.

So the CFA (and/or offset), FP offset, and RA offset could be validated
individually.  Not sure if that would be over engineering though.

>>> +	if (frame->fp_off && get_user(fp, (unsigned long __user *)(cfa + frame->fp_off)))
>>> +		goto done;

Regards,
Jens
-- 
Jens Remus
Linux on Z Development (D3303)
+49-7031-16-1128 Office
jremus@de.ibm.com

IBM

IBM Deutschland Research & Development GmbH; Vorsitzender des Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der Gesellschaft: Böblingen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 02/14] unwind_user: Add frame pointer support
  2025-07-10 15:41       ` Jens Remus
@ 2025-07-10 17:08         ` Steven Rostedt
  2025-07-14 12:52           ` Jens Remus
  0 siblings, 1 reply; 45+ messages in thread
From: Steven Rostedt @ 2025-07-10 17:08 UTC (permalink / raw)
  To: Jens Remus
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Linus Torvalds, Andrew Morton, Jens Axboe,
	Florian Weimer, Sam James, Heiko Carstens, Vasily Gorbik

On Thu, 10 Jul 2025 17:41:36 +0200
Jens Remus <jremus@linux.ibm.com> wrote:

> cfa + frame->ra_off could be aligned by chance.  So could
> cfa + frame->fp_off be as well of course.
> 
> On s390 the CFA must be aligned (as the SP must be aligned) and the
> FP and RA offsets from CFA must be aligned, as pointer / 64-bit integers
> (such as 64-bit register values) must be aligned as well.
> 
> So the CFA (and/or offset), FP offset, and RA offset could be validated
> individually.  Not sure if that would be over engineering though.

I wonder if we should just validate that cfa is aligned? Would that work?

I would think that ra_off and fp_off should be aligned as well and if
cfa is aligned then it would still be aligned when adding those offsets.

-- Steve

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 13/14] unwind_user/x86: Enable frame pointer unwinding on x86
  2025-07-08  1:22 ` [PATCH v13 13/14] unwind_user/x86: Enable frame pointer unwinding on x86 Steven Rostedt
@ 2025-07-11  8:43   ` David Laight
  2025-07-11 16:11     ` Steven Rostedt
  0 siblings, 1 reply; 45+ messages in thread
From: David Laight @ 2025-07-11  8:43 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
	Mathieu Desnoyers, Josh Poimboeuf, Peter Zijlstra, Ingo Molnar,
	Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
	Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
	Linus Torvalds, Andrew Morton, Jens Axboe, Florian Weimer,
	Sam James

On Mon, 07 Jul 2025 21:22:52 -0400
Steven Rostedt <rostedt@kernel.org> wrote:

> From: Josh Poimboeuf <jpoimboe@kernel.org>
> 
> Use ARCH_INIT_USER_FP_FRAME to describe how frame pointers are unwound
> on x86, and enable CONFIG_HAVE_UNWIND_USER_FP accordingly so the
> unwind_user interfaces can be used.

How is that going to work?
Pretty much all x86 userspace is compiled with bp as a general
purpose register not a frame pointer.

	David

> 
> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> ---
>  arch/x86/Kconfig                   |  1 +
>  arch/x86/include/asm/unwind_user.h | 11 +++++++++++
>  2 files changed, 12 insertions(+)
>  create mode 100644 arch/x86/include/asm/unwind_user.h
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 71019b3b54ea..5862433c81e1 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -302,6 +302,7 @@ config X86
>  	select HAVE_SYSCALL_TRACEPOINTS
>  	select HAVE_UACCESS_VALIDATION		if HAVE_OBJTOOL
>  	select HAVE_UNSTABLE_SCHED_CLOCK
> +	select HAVE_UNWIND_USER_FP		if X86_64
>  	select HAVE_USER_RETURN_NOTIFIER
>  	select HAVE_GENERIC_VDSO
>  	select VDSO_GETRANDOM			if X86_64
> diff --git a/arch/x86/include/asm/unwind_user.h b/arch/x86/include/asm/unwind_user.h
> new file mode 100644
> index 000000000000..8597857bf896
> --- /dev/null
> +++ b/arch/x86/include/asm/unwind_user.h
> @@ -0,0 +1,11 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_X86_UNWIND_USER_H
> +#define _ASM_X86_UNWIND_USER_H
> +
> +#define ARCH_INIT_USER_FP_FRAME							\
> +	.cfa_off	= (s32)sizeof(long) *  2,				\
> +	.ra_off		= (s32)sizeof(long) * -1,				\
> +	.fp_off		= (s32)sizeof(long) * -2,				\
> +	.use_fp		= true,
> +
> +#endif /* _ASM_X86_UNWIND_USER_H */


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 13/14] unwind_user/x86: Enable frame pointer unwinding on x86
  2025-07-11  8:43   ` David Laight
@ 2025-07-11 16:11     ` Steven Rostedt
  0 siblings, 0 replies; 45+ messages in thread
From: Steven Rostedt @ 2025-07-11 16:11 UTC (permalink / raw)
  To: David Laight
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
	Jens Axboe, Florian Weimer, Sam James

On Fri, 11 Jul 2025 09:43:21 +0100
David Laight <david.laight.linux@gmail.com> wrote:

> On Mon, 07 Jul 2025 21:22:52 -0400
> Steven Rostedt <rostedt@kernel.org> wrote:
> 
> > From: Josh Poimboeuf <jpoimboe@kernel.org>
> > 
> > Use ARCH_INIT_USER_FP_FRAME to describe how frame pointers are unwound
> > on x86, and enable CONFIG_HAVE_UNWIND_USER_FP accordingly so the
> > unwind_user interfaces can be used.  
> 
> How is that going to work?
> Pretty much all x86 userspace is compiled with bp as a general
> purpose register not a frame pointer.

Which is where this patch set comes in...

  https://lore.kernel.org/linux-trace-kernel/20250701184939.026626626@goodmis.org

-- Steve

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 02/14] unwind_user: Add frame pointer support
  2025-07-10 17:08         ` Steven Rostedt
@ 2025-07-14 12:52           ` Jens Remus
  0 siblings, 0 replies; 45+ messages in thread
From: Jens Remus @ 2025-07-14 12:52 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Linus Torvalds, Andrew Morton, Jens Axboe,
	Florian Weimer, Sam James, Heiko Carstens, Vasily Gorbik

On 10.07.2025 19:08, Steven Rostedt wrote:
> On Thu, 10 Jul 2025 17:41:36 +0200
> Jens Remus <jremus@linux.ibm.com> wrote:
> 
>> cfa + frame->ra_off could be aligned by chance.  So could
>> cfa + frame->fp_off be as well of course.
>>
>> On s390 the CFA must be aligned (as the SP must be aligned) and the
>> FP and RA offsets from CFA must be aligned, as pointer / 64-bit integers
>> (such as 64-bit register values) must be aligned as well.
>>
>> So the CFA (and/or offset), FP offset, and RA offset could be validated
>> individually.  Not sure if that would be over engineering though.
> 
> I wonder if we should just validate that cfa is aligned? Would that work?
> 
> I would think that ra_off and fp_off should be aligned as well and if
> cfa is aligned then it would still be aligned when adding those offsets.

Makes sense, if the base assumption is that the SFrame information is
valid and the primary intend is to check that the used CFA base register
(i.e. SP or FP) was aligned.

Regards,
Jens
-- 
Jens Remus
Linux on Z Development (D3303)
+49-7031-16-1128 Office
jremus@de.ibm.com

IBM

IBM Deutschland Research & Development GmbH; Vorsitzender des Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der Gesellschaft: Böblingen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe
  2025-07-08  1:22 ` [PATCH v13 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe Steven Rostedt
@ 2025-07-14 13:29   ` Peter Zijlstra
  2025-07-14 14:19     ` Steven Rostedt
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2025-07-14 13:29 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
	Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
	Andrew Morton, Jens Axboe, Florian Weimer, Sam James

On Mon, Jul 07, 2025 at 09:22:46PM -0400, Steven Rostedt wrote:
> From: Steven Rostedt <rostedt@goodmis.org>
> 
> Make unwind_deferred_request() NMI-safe so tracers in NMI context can
> call it and safely request a user space stacktrace when the task exits.
> 
> Note, this is only allowed for architectures that implement a safe
> cmpxchg. If an architecture requests a deferred stack trace from NMI
> context that does not support a safe NMI cmpxchg, it will get an -EINVAL.
> For those architectures, they would need another method (perhaps an
> irqwork), to request a deferred user space stack trace. That can be dealt
> with later if one of theses architectures require this feature.
> 
> Suggested-by: Peter Zijlstra <peterz@infradead.org>

How's this instead?

---
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -12,6 +12,40 @@
 #include <linux/slab.h>
 #include <linux/mm.h>
 
+/*
+ * For requesting a deferred user space stack trace from NMI context
+ * the architecture must support a safe cmpxchg in NMI context.
+ * For those architectures that do not have that, then it cannot ask
+ * for a deferred user space stack trace from an NMI context. If it
+ * does, then it will get -EINVAL.
+ */
+#ifdef CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG
+#define UNWIND_NMI_SAFE 1
+static inline bool try_assign_cnt(struct unwind_task_info *info, u32 cnt)
+{
+	u32 zero = 0;
+	return try_cmpxchg(&info->id.cnt, &zero, cnt);
+}
+static inline bool test_and_set_pending(struct unwind_task_info *info)
+{
+	return info->pending || cmpxchg_local(&info->pending, 0, 1);
+}
+#else
+#define UNWIND_NMI_SAFE 0
+/* When NMIs are not allowed, this always succeeds */
+static inline bool try_assign_cnt(struct unwind_task_info *info, u32 cnt)
+{
+	info->id.cnt = cnt;
+	return true;
+}
+static inline bool test_and_set_pending(struct unwind_task_info *info)
+{
+	int pending = info->pending;
+	info->pending = 1;
+	return pending;
+}
+#endif /* CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG */
+
 /* Make the cache fit in a 4K page */
 #define UNWIND_MAX_ENTRIES					\
 	((SZ_4K - sizeof(struct unwind_cache)) / sizeof(long))
@@ -41,21 +75,16 @@ static DEFINE_PER_CPU(u32, unwind_ctx_ct
  */
 static u64 get_cookie(struct unwind_task_info *info)
 {
-	u32 cpu_cnt;
-	u32 cnt;
-	u32 old = 0;
+	u32 cnt = 1;
 
 	if (info->id.cpu)
 		return info->id.id;
 
-	cpu_cnt = __this_cpu_read(unwind_ctx_ctr);
-	cpu_cnt += 2;
-	cnt = cpu_cnt | 1; /* Always make non zero */
-
-	if (try_cmpxchg(&info->id.cnt, &old, cnt)) {
-		/* Update the per cpu counter */
-		__this_cpu_write(unwind_ctx_ctr, cpu_cnt);
-	}
+	/* LSB it always set to ensure 0 is an invalid value. */
+	cnt |= __this_cpu_read(unwind_ctx_ctr) + 2;
+	if (try_assign_cnt(info, cnt))
+		__this_cpu_write(unwind_ctx_ctr, cnt);
+
 	/* Interrupts are disabled, the CPU will always be same */
 	info->id.cpu = smp_processor_id() + 1; /* Must be non zero */
 
@@ -174,27 +203,29 @@ int unwind_deferred_request(struct unwin
 
 	*cookie = 0;
 
-	if (WARN_ON_ONCE(in_nmi()))
-		return -EINVAL;
-
 	if ((current->flags & (PF_KTHREAD | PF_EXITING)) ||
 	    !user_mode(task_pt_regs(current)))
 		return -EINVAL;
 
+	/* NMI requires having safe cmpxchg operations */
+	if (WARN_ON_ONCE(!UNWIND_NMI_SAFE && in_nmi()))
+		return -EINVAL;
+
 	guard(irqsave)();
 
 	*cookie = get_cookie(info);
 
 	/* callback already pending? */
-	if (info->pending)
+	if (test_and_set_pending(info))
 		return 1;
 
 	/* The work has been claimed, now schedule it. */
 	ret = task_work_add(current, &info->work, TWA_RESUME);
-	if (WARN_ON_ONCE(ret))
+	if (WARN_ON_ONCE(ret)) {
+		WRITE_ONCE(info->pending, 0);
 		return ret;
+	}
 
-	info->pending = 1;
 	return 0;
 }
 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 09/14] unwind deferred: Use SRCU unwind_deferred_task_work()
  2025-07-08  1:22 ` [PATCH v13 09/14] unwind deferred: Use SRCU unwind_deferred_task_work() Steven Rostedt
@ 2025-07-14 13:56   ` Peter Zijlstra
  2025-07-14 14:21     ` Steven Rostedt
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2025-07-14 13:56 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
	Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
	Andrew Morton, Jens Axboe, Florian Weimer, Sam James

On Mon, Jul 07, 2025 at 09:22:48PM -0400, Steven Rostedt wrote:

> @@ -143,6 +144,7 @@ static void unwind_deferred_task_work(struct callback_head *head)
>  	struct unwind_stacktrace trace;
>  	struct unwind_work *work;
>  	u64 cookie;
> +	int idx;
>  
>  	if (WARN_ON_ONCE(!local_read(&info->pending)))
>  		return;
> @@ -161,13 +163,15 @@ static void unwind_deferred_task_work(struct callback_head *head)
>  
>  	cookie = info->id.id;
>  
> -	guard(mutex)(&callback_mutex);
> -	list_for_each_entry(work, &callbacks, list) {
> +	idx = srcu_read_lock(&unwind_srcu);
> +	list_for_each_entry_srcu(work, &callbacks, list,
> +				 srcu_read_lock_held(&unwind_srcu)) {
>  		if (test_bit(work->bit, &info->unwind_mask)) {
>  			work->func(work, &trace, cookie);
>  			clear_bit(work->bit, &info->unwind_mask);
>  		}
>  	}
> +	srcu_read_unlock(&unwind_srcu, idx);
>  }

Please; something like so:

--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -524,4 +524,9 @@ DEFINE_LOCK_GUARD_1(srcu, struct srcu_st
 		    srcu_read_unlock(_T->lock, _T->idx),
 		    int idx)
 
+DEFINE_LOCK_GUARD_1(srcu_lite, struct srcu_struct,
+		    _T->idx = srcu_read_lock_lite(_T->lock),
+		    srcu_read_unlock_lite(_T->lock, _T->idx),
+		    int idx)
+
 #endif
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -165,7 +165,7 @@ static void unwind_deferred_task_work(st
 
 	cookie = info->id.id;
 
-	guard(mutex)(&callback_mutex);
+	guard(srcu_lite)(&unwind_srcu);
 	list_for_each_entry(work, &callbacks, list) {
 		work->func(work, &trace, cookie);
 	}

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe
  2025-07-14 13:29   ` Peter Zijlstra
@ 2025-07-14 14:19     ` Steven Rostedt
  2025-07-14 15:05       ` Peter Zijlstra
  0 siblings, 1 reply; 45+ messages in thread
From: Steven Rostedt @ 2025-07-14 14:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar,
	Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
	Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
	Linus Torvalds, Andrew Morton, Jens Axboe, Florian Weimer,
	Sam James

On Mon, 14 Jul 2025 15:29:36 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> +#ifdef CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG
> +#define UNWIND_NMI_SAFE 1
> +static inline bool try_assign_cnt(struct unwind_task_info *info, u32 cnt)
> +{
> +	u32 zero = 0;
> +	return try_cmpxchg(&info->id.cnt, &zero, cnt);
> +}
> +static inline bool test_and_set_pending(struct unwind_task_info *info)
> +{
> +	return info->pending || cmpxchg_local(&info->pending, 0, 1);
> +}
> +#el

Patch 10 moves the pending bit into the unwind_mask as it needs to be
in sync with the different tracer requests. I'm not sure how this
change will interact with that.

-- Steve

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 09/14] unwind deferred: Use SRCU unwind_deferred_task_work()
  2025-07-14 13:56   ` Peter Zijlstra
@ 2025-07-14 14:21     ` Steven Rostedt
  2025-07-14 15:03       ` Peter Zijlstra
  0 siblings, 1 reply; 45+ messages in thread
From: Steven Rostedt @ 2025-07-14 14:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar,
	Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
	Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
	Linus Torvalds, Andrew Morton, Jens Axboe, Florian Weimer,
	Sam James

On Mon, 14 Jul 2025 15:56:38 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> Please; something like so:
> 
> --- a/include/linux/srcu.h
> +++ b/include/linux/srcu.h
> @@ -524,4 +524,9 @@ DEFINE_LOCK_GUARD_1(srcu, struct srcu_st
>  		    srcu_read_unlock(_T->lock, _T->idx),
>  		    int idx)
>  
> +DEFINE_LOCK_GUARD_1(srcu_lite, struct srcu_struct,
> +		    _T->idx = srcu_read_lock_lite(_T->lock),
> +		    srcu_read_unlock_lite(_T->lock, _T->idx),
> +		    int idx)
> +
>  #endif
> --- a/kernel/unwind/deferred.c
> +++ b/kernel/unwind/deferred.c
> @@ -165,7 +165,7 @@ static void unwind_deferred_task_work(st
>  
>  	cookie = info->id.id;
>  
> -	guard(mutex)(&callback_mutex);
> +	guard(srcu_lite)(&unwind_srcu);
>  	list_for_each_entry(work, &callbacks, list) {
>  		work->func(work, &trace, cookie);
>  	}

I think I rather have a scoped_guard() here. One thing that bothers me
about the guard() logic is that it could easily start to "leak"
protection. That is, the unwind_srcu is only needed for walking the
list. The reason I chose to open code the protection, is because I
wanted to distinctly denote where the end of the protection was.

-- Steve

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 09/14] unwind deferred: Use SRCU unwind_deferred_task_work()
  2025-07-14 14:21     ` Steven Rostedt
@ 2025-07-14 15:03       ` Peter Zijlstra
  0 siblings, 0 replies; 45+ messages in thread
From: Peter Zijlstra @ 2025-07-14 15:03 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar,
	Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
	Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
	Linus Torvalds, Andrew Morton, Jens Axboe, Florian Weimer,
	Sam James

On Mon, Jul 14, 2025 at 10:21:40AM -0400, Steven Rostedt wrote:
> On Mon, 14 Jul 2025 15:56:38 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > Please; something like so:
> > 
> > --- a/include/linux/srcu.h
> > +++ b/include/linux/srcu.h
> > @@ -524,4 +524,9 @@ DEFINE_LOCK_GUARD_1(srcu, struct srcu_st
> >  		    srcu_read_unlock(_T->lock, _T->idx),
> >  		    int idx)
> >  
> > +DEFINE_LOCK_GUARD_1(srcu_lite, struct srcu_struct,
> > +		    _T->idx = srcu_read_lock_lite(_T->lock),
> > +		    srcu_read_unlock_lite(_T->lock, _T->idx),
> > +		    int idx)
> > +
> >  #endif
> > --- a/kernel/unwind/deferred.c
> > +++ b/kernel/unwind/deferred.c
> > @@ -165,7 +165,7 @@ static void unwind_deferred_task_work(st
> >  
> >  	cookie = info->id.id;
> >  
> > -	guard(mutex)(&callback_mutex);
> > +	guard(srcu_lite)(&unwind_srcu);
> >  	list_for_each_entry(work, &callbacks, list) {
> >  		work->func(work, &trace, cookie);
> >  	}
> 
> I think I rather have a scoped_guard() here. One thing that bothers me
> about the guard() logic is that it could easily start to "leak"
> protection. That is, the unwind_srcu is only needed for walking the
> list. The reason I chose to open code the protection, is because I
> wanted to distinctly denote where the end of the protection was.

Sure. But the point was more to:
 - use scru_lite; and,
 - use guards

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe
  2025-07-14 14:19     ` Steven Rostedt
@ 2025-07-14 15:05       ` Peter Zijlstra
  2025-07-14 15:11         ` Steven Rostedt
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2025-07-14 15:05 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar,
	Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
	Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
	Linus Torvalds, Andrew Morton, Jens Axboe, Florian Weimer,
	Sam James

On Mon, Jul 14, 2025 at 10:19:19AM -0400, Steven Rostedt wrote:
> On Mon, 14 Jul 2025 15:29:36 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > +#ifdef CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG
> > +#define UNWIND_NMI_SAFE 1
> > +static inline bool try_assign_cnt(struct unwind_task_info *info, u32 cnt)
> > +{
> > +	u32 zero = 0;
> > +	return try_cmpxchg(&info->id.cnt, &zero, cnt);
> > +}
> > +static inline bool test_and_set_pending(struct unwind_task_info *info)
> > +{
> > +	return info->pending || cmpxchg_local(&info->pending, 0, 1);
> > +}
> > +#el
> 
> Patch 10 moves the pending bit into the unwind_mask as it needs to be
> in sync with the different tracer requests. I'm not sure how this
> change will interact with that.

Urgh; so I hate reviewing code you're ripping out in the next patch :-(

Let me go stare at that.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe
  2025-07-14 15:05       ` Peter Zijlstra
@ 2025-07-14 15:11         ` Steven Rostedt
  2025-07-15  9:09           ` Peter Zijlstra
  0 siblings, 1 reply; 45+ messages in thread
From: Steven Rostedt @ 2025-07-14 15:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar,
	Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
	Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
	Linus Torvalds, Andrew Morton, Jens Axboe, Florian Weimer,
	Sam James

On Mon, 14 Jul 2025 17:05:16 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> Urgh; so I hate reviewing code you're ripping out in the next patch :-(

Sorry. It just happened to be developed that way. Patch 10 came about
to fix a bug that was triggered with the current method.

> 
> Let me go stare at that.

Thanks!

-- Steve

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe
  2025-07-14 15:11         ` Steven Rostedt
@ 2025-07-15  9:09           ` Peter Zijlstra
  2025-07-15 12:35             ` Steven Rostedt
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2025-07-15  9:09 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar,
	Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
	Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
	Linus Torvalds, Andrew Morton, Jens Axboe, Florian Weimer,
	Sam James

On Mon, Jul 14, 2025 at 11:11:58AM -0400, Steven Rostedt wrote:
> On Mon, 14 Jul 2025 17:05:16 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > Urgh; so I hate reviewing code you're ripping out in the next patch :-(
> 
> Sorry. It just happened to be developed that way. Patch 10 came about
> to fix a bug that was triggered with the current method.

Sure; but then you rework the series such that the bug never happened
and reviewers don't go insane from the back and forth and possibly
stumbling over the same bug you then fix later.

You should know this.

I'm going to not stare at email for some 3 weeks soon; I strongly
suggest you take this time to fix up this series to not suffer nonsense
like this.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 10/14] unwind: Clear unwind_mask on exit back to user space
  2025-07-08  1:22 ` [PATCH v13 10/14] unwind: Clear unwind_mask on exit back to user space Steven Rostedt
@ 2025-07-15 10:29   ` Peter Zijlstra
  2025-07-15 12:49     ` Steven Rostedt
                       ` (3 more replies)
  0 siblings, 4 replies; 45+ messages in thread
From: Peter Zijlstra @ 2025-07-15 10:29 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
	Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
	Andrew Morton, Jens Axboe, Florian Weimer, Sam James

On Mon, Jul 07, 2025 at 09:22:49PM -0400, Steven Rostedt wrote:
> diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
> index 12bffdb0648e..587e120c0fd6 100644
> --- a/include/linux/unwind_deferred.h
> +++ b/include/linux/unwind_deferred.h
> @@ -18,6 +18,14 @@ struct unwind_work {
>  
>  #ifdef CONFIG_UNWIND_USER
>  
> +#define UNWIND_PENDING_BIT	(BITS_PER_LONG - 1)
> +#define UNWIND_PENDING		BIT(UNWIND_PENDING_BIT)

Since it really didn't matter what bit you took, why not take bit 0?

> +
> +enum {
> +	UNWIND_ALREADY_PENDING	= 1,
> +	UNWIND_ALREADY_EXECUTED	= 2,
> +};
> +
>  void unwind_task_init(struct task_struct *task);
>  void unwind_task_free(struct task_struct *task);
>  
> @@ -29,15 +37,26 @@ void unwind_deferred_cancel(struct unwind_work *work);
>  
>  static __always_inline void unwind_reset_info(void)
>  {
> -	if (unlikely(current->unwind_info.id.id))
> +	struct unwind_task_info *info = &current->unwind_info;
> +	unsigned long bits;
> +
> +	/* Was there any unwinding? */
> +	if (unlikely(info->unwind_mask)) {
> +		bits = info->unwind_mask;
> +		do {
> +			/* Is a task_work going to run again before going back */
> +			if (bits & UNWIND_PENDING)
> +				return;
> +		} while (!try_cmpxchg(&info->unwind_mask, &bits, 0UL));
>  		current->unwind_info.id.id = 0;
> +	}
>  	/*
>  	 * As unwind_user_faultable() can be called directly and
>  	 * depends on nr_entries being cleared on exit to user,
>  	 * this needs to be a separate conditional.
>  	 */
> -	if (unlikely(current->unwind_info.cache))
> -		current->unwind_info.cache->nr_entries = 0;
> +	if (unlikely(info->cache))
> +		info->cache->nr_entries = 0;
>  }
>  
>  #else /* !CONFIG_UNWIND_USER */

> diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
> index 9aed9866f460..256458f3eafe 100644
> --- a/kernel/unwind/deferred.c
> +++ b/kernel/unwind/deferred.c
> @@ -47,6 +47,11 @@ static LIST_HEAD(callbacks);
>  static unsigned long unwind_mask;
>  DEFINE_STATIC_SRCU(unwind_srcu);
>  
> +static inline bool unwind_pending(struct unwind_task_info *info)
> +{
> +	return test_bit(UNWIND_PENDING_BIT, &info->unwind_mask);
> +}
> +
>  /*
>   * This is a unique percpu identifier for a given task entry context.
>   * Conceptually, it's incremented every time the CPU enters the kernel from
> @@ -143,14 +148,17 @@ static void unwind_deferred_task_work(struct callback_head *head)
>  	struct unwind_task_info *info = container_of(head, struct unwind_task_info, work);
>  	struct unwind_stacktrace trace;
>  	struct unwind_work *work;
> +	unsigned long bits;
>  	u64 cookie;
>  	int idx;
>  
> -	if (WARN_ON_ONCE(!local_read(&info->pending)))
> +	if (WARN_ON_ONCE(!unwind_pending(info)))
>  		return;
>  
> -	/* Allow work to come in again */
> -	local_set(&info->pending, 0);
> +	/* Clear pending bit but make sure to have the current bits */
> +	bits = READ_ONCE(info->unwind_mask);
> +	while (!try_cmpxchg(&info->unwind_mask, &bits, bits & ~UNWIND_PENDING))
> +		;

We have:

	bits = atomic_long_fetch_andnot(UNWIND_PENDING, &info->unwind_mask);

for that. A fair number of architecture can actually do this better than
a cmpxchg loop.

>  
>  	/*
>  	 * From here on out, the callback must always be called, even if it's
> @@ -166,10 +174,8 @@ static void unwind_deferred_task_work(struct callback_head *head)
>  	idx = srcu_read_lock(&unwind_srcu);
>  	list_for_each_entry_srcu(work, &callbacks, list,
>  				 srcu_read_lock_held(&unwind_srcu)) {
> -		if (test_bit(work->bit, &info->unwind_mask)) {
> +		if (test_bit(work->bit, &bits))
>  			work->func(work, &trace, cookie);
> -			clear_bit(work->bit, &info->unwind_mask);
> -		}
>  	}
>  	srcu_read_unlock(&unwind_srcu, idx);
>  }
> @@ -194,15 +200,17 @@ static void unwind_deferred_task_work(struct callback_head *head)
>   * because it has already been previously called for the same entry context,
>   * it will be called again with the same stack trace and cookie.
>   *
> - * Return: 1 if the the callback was already queued.
> - *         0 if the callback successfully was queued.
> + * Return: 0 if the callback successfully was queued.
> + *         UNWIND_ALREADY_PENDING if the the callback was already queued.
> + *         UNWIND_ALREADY_EXECUTED if the callback was already called
> + *                (and will not be called again)
>   *         Negative if there's an error.
>   *         @cookie holds the cookie of the first request by any user
>   */

Lots of babbling in the Changelog, but no real elucidation as to why you
need this second return value.

AFAICT it serves no real purpose; the users of this function should not
care. The only difference is that the unwind reference (your cookie)
becomes a backward reference instead of a forward reference. But why
would anybody care?

Whatever tool is ultimately in charge of gluing humpty^Wstacktraces back
together again should have no problem with this.

>  int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
>  {
>  	struct unwind_task_info *info = &current->unwind_info;
> -	long pending;
> +	unsigned long old, bits;
>  	int bit;
>  	int ret;
>  
> @@ -225,32 +233,52 @@ int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
>  
>  	*cookie = get_cookie(info);
>  
> -	/* This is already queued */
> -	if (test_bit(bit, &info->unwind_mask))
> -		return 1;
> +	old = READ_ONCE(info->unwind_mask);
> +
> +	/* Is this already queued */
> +	if (test_bit(bit, &old)) {
> +		/*
> +		 * If pending is not set, it means this work's callback
> +		 * was already called.
> +		 */
> +		return old & UNWIND_PENDING ? UNWIND_ALREADY_PENDING :
> +			UNWIND_ALREADY_EXECUTED;
> +	}
>  
> -	/* callback already pending? */
> -	pending = local_read(&info->pending);
> -	if (pending)
> +	if (unwind_pending(info))
>  		goto out;
>  
> +	/*
> +	 * This is the first to enable another task_work for this task since
> +	 * the task entered the kernel, or had already called the callbacks.
> +	 * Set only the bit for this work and clear all others as they have
> +	 * already had their callbacks called, and do not need to call them
> +	 * again because of this work.
> +	 */
> +	bits = UNWIND_PENDING | BIT(bit);
> +
> +	/*
> +	 * If the cmpxchg() fails, it means that an NMI came in and set
> +	 * the pending bit as well as cleared the other bits. Just
> +	 * jump to setting the bit for this work.
> +	 */
>  	if (CAN_USE_IN_NMI) {
> -		/* Claim the work unless an NMI just now swooped in to do so. */
> -		if (!local_try_cmpxchg(&info->pending, &pending, 1))
> +		if (!try_cmpxchg(&info->unwind_mask, &old, bits))
>  			goto out;
>  	} else {
> -		local_set(&info->pending, 1);
> +		info->unwind_mask = bits;
>  	}
>  
>  	/* The work has been claimed, now schedule it. */
>  	ret = task_work_add(current, &info->work, TWA_RESUME);
> -	if (WARN_ON_ONCE(ret)) {
> -		local_set(&info->pending, 0);
> -		return ret;
> -	}
>  
> +	if (WARN_ON_ONCE(ret))
> +		WRITE_ONCE(info->unwind_mask, 0);
> +
> +	return ret;
>   out:
> -	return test_and_set_bit(bit, &info->unwind_mask);
> +	return test_and_set_bit(bit, &info->unwind_mask) ?
> +		UNWIND_ALREADY_PENDING : 0;
>  }

This is some of the most horrifyingly confused code I've seen in a
while.

Please just slow down and think for a minute.

The below is the last four patches rolled into one. Not been near a
compiler.

---

--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -524,4 +524,8 @@ DEFINE_LOCK_GUARD_1(srcu, struct srcu_st
 		    srcu_read_unlock(_T->lock, _T->idx),
 		    int idx)
 
+DEFINE_LOCK_GUARD_1(srcu_lite, struct srcu_struct,
+		    _T->idx = srcu_read_lock_lite(_T->lock),
+		    srcu_read_unlock_lite(_T->lock, _T->idx),
+		    int idx)
 #endif
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -13,10 +13,14 @@ typedef void (*unwind_callback_t)(struct
 struct unwind_work {
 	struct list_head		list;
 	unwind_callback_t		func;
+	int				bit;
 };
 
 #ifdef CONFIG_UNWIND_USER
 
+#define UNWIND_PENDING_BIT	0
+#define UNWIND_PENDING		BIT(UNWIND_PENDING_BIT)
+
 void unwind_task_init(struct task_struct *task);
 void unwind_task_free(struct task_struct *task);
 
@@ -28,15 +32,26 @@ void unwind_deferred_cancel(struct unwin
 
 static __always_inline void unwind_reset_info(void)
 {
-	if (unlikely(current->unwind_info.id.id))
+	struct unwind_task_info *info = &current->unwind_info;
+	unsigned long bits;
+
+	/* Was there any unwinding? */
+	if (unlikely(info->unwind_mask)) {
+		bits = raw_atomic_long_read(&info->unwind_mask);
+		do {
+			/* Is a task_work going to run again before going back */
+			if (bits & UNWIND_PENDING)
+				return;
+		} while (!raw_atomic_long_try_cmpxchg(&info->unwind_mask, &bits, 0UL));
 		current->unwind_info.id.id = 0;
+	}
 	/*
 	 * As unwind_user_faultable() can be called directly and
 	 * depends on nr_entries being cleared on exit to user,
 	 * this needs to be a separate conditional.
 	 */
-	if (unlikely(current->unwind_info.cache))
-		current->unwind_info.cache->nr_entries = 0;
+	if (unlikely(info->cache))
+		info->cache->nr_entries = 0;
 }
 
 #else /* !CONFIG_UNWIND_USER */
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -2,6 +2,8 @@
 #ifndef _LINUX_UNWIND_USER_DEFERRED_TYPES_H
 #define _LINUX_UNWIND_USER_DEFERRED_TYPES_H
 
+#include <linux/atomic.h>
+
 struct unwind_cache {
 	unsigned int		nr_entries;
 	unsigned long		entries[];
@@ -19,8 +21,8 @@ union unwind_task_id {
 struct unwind_task_info {
 	struct unwind_cache	*cache;
 	struct callback_head	work;
+	atomic_long_t		unwind_mask;
 	union unwind_task_id	id;
-	int			pending;
 };
 
 #endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -12,13 +12,39 @@
 #include <linux/slab.h>
 #include <linux/mm.h>
 
+/*
+ * For requesting a deferred user space stack trace from NMI context
+ * the architecture must support a safe cmpxchg in NMI context.
+ * For those architectures that do not have that, then it cannot ask
+ * for a deferred user space stack trace from an NMI context. If it
+ * does, then it will get -EINVAL.
+ */
+#ifdef CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG
+#define UNWIND_NMI_SAFE 1
+static inline bool try_assign_cnt(struct unwind_task_info *info, u32 cnt)
+{
+	u32 zero = 0;
+	return try_cmpxchg(&info->id.cnt, &zero, cnt);
+}
+#else
+#define UNWIND_NMI_SAFE 0
+/* When NMIs are not allowed, this always succeeds */
+static inline bool try_assign_cnt(struct unwind_task_info *info, u32 cnt)
+{
+	info->id.cnt = cnt;
+	return true;
+}
+#endif /* CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG */
+
 /* Make the cache fit in a 4K page */
 #define UNWIND_MAX_ENTRIES					\
 	((SZ_4K - sizeof(struct unwind_cache)) / sizeof(long))
 
-/* Guards adding to and reading the list of callbacks */
+/* Guards adding to or removing from the list of callbacks */
 static DEFINE_MUTEX(callback_mutex);
 static LIST_HEAD(callbacks);
+static unsigned long unwind_mask;
+DEFINE_STATIC_SRCU(unwind_srcu);
 
 /*
  * This is a unique percpu identifier for a given task entry context.
@@ -41,21 +67,16 @@ static DEFINE_PER_CPU(u32, unwind_ctx_ct
  */
 static u64 get_cookie(struct unwind_task_info *info)
 {
-	u32 cpu_cnt;
-	u32 cnt;
-	u32 old = 0;
+	u32 cnt = 1;
 
 	if (info->id.cpu)
 		return info->id.id;
 
-	cpu_cnt = __this_cpu_read(unwind_ctx_ctr);
-	cpu_cnt += 2;
-	cnt = cpu_cnt | 1; /* Always make non zero */
-
-	if (try_cmpxchg(&info->id.cnt, &old, cnt)) {
-		/* Update the per cpu counter */
-		__this_cpu_write(unwind_ctx_ctr, cpu_cnt);
-	}
+	/* LSB it always set to ensure 0 is an invalid value. */
+	cnt |= __this_cpu_read(unwind_ctx_ctr) + 2;
+	if (try_assign_cnt(info, cnt))
+		__this_cpu_write(unwind_ctx_ctr, cnt);
+
 	/* Interrupts are disabled, the CPU will always be same */
 	info->id.cpu = smp_processor_id() + 1; /* Must be non zero */
 
@@ -117,13 +138,13 @@ static void unwind_deferred_task_work(st
 	struct unwind_task_info *info = container_of(head, struct unwind_task_info, work);
 	struct unwind_stacktrace trace;
 	struct unwind_work *work;
+	unsigned long bits;
 	u64 cookie;
 
-	if (WARN_ON_ONCE(!info->pending))
+	if (WARN_ON_ONCE(!unwind_pending(info)))
 		return;
 
-	/* Allow work to come in again */
-	WRITE_ONCE(info->pending, 0);
+	bits = atomic_long_fetch_andnot(UNWIND_PENDING, &info->unwind_mask);
 
 	/*
 	 * From here on out, the callback must always be called, even if it's
@@ -136,9 +157,11 @@ static void unwind_deferred_task_work(st
 
 	cookie = info->id.id;
 
-	guard(mutex)(&callback_mutex);
-	list_for_each_entry(work, &callbacks, list) {
-		work->func(work, &trace, cookie);
+	guard(srcu_lite)(&unwind_srcu);
+	list_for_each_entry_srcu(work, &callbacks, list,
+				 srcu_read_lock_held(&unwind_srcu)) {
+		if (test_bit(work->bit, &bits))
+			work->func(work, &trace, cookie);
 	}
 }
 
@@ -162,7 +185,7 @@ static void unwind_deferred_task_work(st
  * because it has already been previously called for the same entry context,
  * it will be called again with the same stack trace and cookie.
  *
- * Return: 1 if the the callback was already queued.
+ * Return: 1 if the callback was already queued.
  *         0 if the callback successfully was queued.
  *         Negative if there's an error.
  *         @cookie holds the cookie of the first request by any user
@@ -170,41 +193,62 @@ static void unwind_deferred_task_work(st
 int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
 {
 	struct unwind_task_info *info = &current->unwind_info;
-	int ret;
+	unsigned long bits, mask;
+	int bit, ret;
 
 	*cookie = 0;
 
-	if (WARN_ON_ONCE(in_nmi()))
-		return -EINVAL;
-
 	if ((current->flags & (PF_KTHREAD | PF_EXITING)) ||
 	    !user_mode(task_pt_regs(current)))
 		return -EINVAL;
 
+	/* NMI requires having safe cmpxchg operations */
+	if (WARN_ON_ONCE(!UNWIND_NMI_SAFE && in_nmi()))
+		return -EINVAL;
+
+	/* Do not allow cancelled works to request again */
+	bit = READ_ONCE(work->bit);
+	if (WARN_ON_ONCE(bit < 0))
+		return -EINVAL;
+
 	guard(irqsave)();
 
 	*cookie = get_cookie(info);
 
-	/* callback already pending? */
-	if (info->pending)
+	bits = UNWIND_PENDING | BIT(bit);
+	mask = atomic_long_fetch_or(bits, &info->unwind_mask);
+	if (mask & bits)
 		return 1;
 
 	/* The work has been claimed, now schedule it. */
 	ret = task_work_add(current, &info->work, TWA_RESUME);
 	if (WARN_ON_ONCE(ret))
-		return ret;
-
-	info->pending = 1;
-	return 0;
+		atomic_long_set(0, &info->unwind_mask);
 }
 
 void unwind_deferred_cancel(struct unwind_work *work)
 {
+	struct task_struct *g, *t;
+	int bit;
+
 	if (!work)
 		return;
 
 	guard(mutex)(&callback_mutex);
-	list_del(&work->list);
+	list_del_rcu(&work->list);
+	bit = work->bit;
+
+	/* Do not allow any more requests and prevent callbacks */
+	work->bit = -1;
+
+	__clear_bit(bit, &unwind_mask);
+
+	synchronize_srcu(&unwind_srcu);
+
+	guard(rcu)();
+	/* Clear this bit from all threads */
+	for_each_process_thread(g, t)
+		atomic_long_andnot(BIT(bit), &t->unwind_info.unwind_mask);
 }
 
 int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
@@ -212,7 +256,15 @@ int unwind_deferred_init(struct unwind_w
 	memset(work, 0, sizeof(*work));
 
 	guard(mutex)(&callback_mutex);
-	list_add(&work->list, &callbacks);
+
+	/* See if there's a bit in the mask available */
+	if (unwind_mask == ~UNWIND_PENDING)
+		return -EBUSY;
+
+	work->bit = ffz(unwind_mask);
+	__set_bit(work->bit, &unwind_mask);
+
+	list_add_rcu(&work->list, &callbacks);
 	work->func = func;
 	return 0;
 }


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe
  2025-07-15  9:09           ` Peter Zijlstra
@ 2025-07-15 12:35             ` Steven Rostedt
  0 siblings, 0 replies; 45+ messages in thread
From: Steven Rostedt @ 2025-07-15 12:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar,
	Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
	Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
	Linus Torvalds, Andrew Morton, Jens Axboe, Florian Weimer,
	Sam James

On Tue, 15 Jul 2025 11:09:55 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Jul 14, 2025 at 11:11:58AM -0400, Steven Rostedt wrote:
> > On Mon, 14 Jul 2025 17:05:16 +0200
> > Peter Zijlstra <peterz@infradead.org> wrote:
> >   
> > > Urgh; so I hate reviewing code you're ripping out in the next patch :-(  
> > 
> > Sorry. It just happened to be developed that way. Patch 10 came about
> > to fix a bug that was triggered with the current method.  
> 
> Sure; but then you rework the series such that the bug never happened
> and reviewers don't go insane from the back and forth and possibly
> stumbling over the same bug you then fix later.
> 
> You should know this.

The bug was with actually with the next patch (#8) that uses the bitmask to
know which tracer requested a callback. Patch 8 cleared the bit after the
callbacks were called. The bug that was triggered was when the tracer set
an event to do a user space stack trace on an event that is called between
the task_work and going back to user space. It triggered an infinite loop
because the bit would get set again and trigger another task_work!

I can merge patch 8 and 10, but it still would not have affected this
patch, and would have likely led to the same confusion.

> 
> I'm going to not stare at email for some 3 weeks soon; I strongly
> suggest you take this time to fix up this series to not suffer nonsense
> like this.
> 

Sure, I'll take a deep look at your review and work on the next series to
hopefully address each of your concerns.

Thanks!

-- Steve

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 10/14] unwind: Clear unwind_mask on exit back to user space
  2025-07-15 10:29   ` Peter Zijlstra
@ 2025-07-15 12:49     ` Steven Rostedt
  2025-07-15 18:06       ` Steven Rostedt
  2025-07-15 19:01       ` Peter Zijlstra
  2025-07-15 17:20     ` Steven Rostedt
                       ` (2 subsequent siblings)
  3 siblings, 2 replies; 45+ messages in thread
From: Steven Rostedt @ 2025-07-15 12:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar,
	Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
	Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
	Linus Torvalds, Andrew Morton, Jens Axboe, Florian Weimer,
	Sam James

On Tue, 15 Jul 2025 12:29:12 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Jul 07, 2025 at 09:22:49PM -0400, Steven Rostedt wrote:
> > diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
> > index 12bffdb0648e..587e120c0fd6 100644
> > --- a/include/linux/unwind_deferred.h
> > +++ b/include/linux/unwind_deferred.h
> > @@ -18,6 +18,14 @@ struct unwind_work {
> >  
> >  #ifdef CONFIG_UNWIND_USER
> >  
> > +#define UNWIND_PENDING_BIT	(BITS_PER_LONG - 1)
> > +#define UNWIND_PENDING		BIT(UNWIND_PENDING_BIT)  
> 
> Since it really didn't matter what bit you took, why not take bit 0?

I was worried about it affecting the global unwind_mask test, but I guess
bit zero works too. In fact, I think I could just set the PENDING and USED
(see next patch) bits in the global unwind mask as being already "used" and
then change:

	/* See if there's a bit in the mask available */
	if (unwind_mask == ~(UNWIND_PENDING|UNWIND_USED))
		return -EBUSY;

Back to:

	/* See if there's a bit in the mask available */
	if (unwind_mask == ~0UL)
		return -EBUSY;

> 


> >  /*
> >   * This is a unique percpu identifier for a given task entry context.
> >   * Conceptually, it's incremented every time the CPU enters the kernel from
> > @@ -143,14 +148,17 @@ static void unwind_deferred_task_work(struct callback_head *head)
> >  	struct unwind_task_info *info = container_of(head, struct unwind_task_info, work);
> >  	struct unwind_stacktrace trace;
> >  	struct unwind_work *work;
> > +	unsigned long bits;
> >  	u64 cookie;
> >  	int idx;
> >  
> > -	if (WARN_ON_ONCE(!local_read(&info->pending)))
> > +	if (WARN_ON_ONCE(!unwind_pending(info)))
> >  		return;
> >  
> > -	/* Allow work to come in again */
> > -	local_set(&info->pending, 0);
> > +	/* Clear pending bit but make sure to have the current bits */
> > +	bits = READ_ONCE(info->unwind_mask);
> > +	while (!try_cmpxchg(&info->unwind_mask, &bits, bits & ~UNWIND_PENDING))
> > +		;  
> 
> We have:
> 
> 	bits = atomic_long_fetch_andnot(UNWIND_PENDING, &info->unwind_mask);
> 
> for that. A fair number of architecture can actually do this better than
> a cmpxchg loop.

Thanks, I didn't know about that one.

> 
> >  
> >  	/*
> >  	 * From here on out, the callback must always be called, even if it's
> > @@ -166,10 +174,8 @@ static void unwind_deferred_task_work(struct callback_head *head)
> >  	idx = srcu_read_lock(&unwind_srcu);
> >  	list_for_each_entry_srcu(work, &callbacks, list,
> >  				 srcu_read_lock_held(&unwind_srcu)) {
> > -		if (test_bit(work->bit, &info->unwind_mask)) {
> > +		if (test_bit(work->bit, &bits))
> >  			work->func(work, &trace, cookie);
> > -			clear_bit(work->bit, &info->unwind_mask);
> > -		}
> >  	}
> >  	srcu_read_unlock(&unwind_srcu, idx);
> >  }
> > @@ -194,15 +200,17 @@ static void unwind_deferred_task_work(struct callback_head *head)
> >   * because it has already been previously called for the same entry context,
> >   * it will be called again with the same stack trace and cookie.
> >   *
> > - * Return: 1 if the the callback was already queued.
> > - *         0 if the callback successfully was queued.
> > + * Return: 0 if the callback successfully was queued.
> > + *         UNWIND_ALREADY_PENDING if the the callback was already queued.
> > + *         UNWIND_ALREADY_EXECUTED if the callback was already called
> > + *                (and will not be called again)
> >   *         Negative if there's an error.
> >   *         @cookie holds the cookie of the first request by any user
> >   */  
> 
> Lots of babbling in the Changelog, but no real elucidation as to why you
> need this second return value.
> 
> AFAICT it serves no real purpose; the users of this function should not
> care. The only difference is that the unwind reference (your cookie)
> becomes a backward reference instead of a forward reference. But why
> would anybody care?

Older versions of the code required it. I think I can remove it now.

> 
> Whatever tool is ultimately in charge of gluing humpty^Wstacktraces back
> together again should have no problem with this.
> 
> >  int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
> >  {
> >  	struct unwind_task_info *info = &current->unwind_info;
> > -	long pending;
> > +	unsigned long old, bits;
> >  	int bit;
> >  	int ret;
> >  
> > @@ -225,32 +233,52 @@ int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
> >  
> >  	*cookie = get_cookie(info);
> >  
> > -	/* This is already queued */
> > -	if (test_bit(bit, &info->unwind_mask))
> > -		return 1;
> > +	old = READ_ONCE(info->unwind_mask);
> > +
> > +	/* Is this already queued */
> > +	if (test_bit(bit, &old)) {
> > +		/*
> > +		 * If pending is not set, it means this work's callback
> > +		 * was already called.
> > +		 */
> > +		return old & UNWIND_PENDING ? UNWIND_ALREADY_PENDING :
> > +			UNWIND_ALREADY_EXECUTED;
> > +	}
> >  
> > -	/* callback already pending? */
> > -	pending = local_read(&info->pending);
> > -	if (pending)
> > +	if (unwind_pending(info))
> >  		goto out;
> >  
> > +	/*
> > +	 * This is the first to enable another task_work for this task since
> > +	 * the task entered the kernel, or had already called the callbacks.
> > +	 * Set only the bit for this work and clear all others as they have
> > +	 * already had their callbacks called, and do not need to call them
> > +	 * again because of this work.
> > +	 */
> > +	bits = UNWIND_PENDING | BIT(bit);
> > +
> > +	/*
> > +	 * If the cmpxchg() fails, it means that an NMI came in and set
> > +	 * the pending bit as well as cleared the other bits. Just
> > +	 * jump to setting the bit for this work.
> > +	 */
> >  	if (CAN_USE_IN_NMI) {
> > -		/* Claim the work unless an NMI just now swooped in to do so. */
> > -		if (!local_try_cmpxchg(&info->pending, &pending, 1))
> > +		if (!try_cmpxchg(&info->unwind_mask, &old, bits))
> >  			goto out;
> >  	} else {
> > -		local_set(&info->pending, 1);
> > +		info->unwind_mask = bits;
> >  	}
> >  
> >  	/* The work has been claimed, now schedule it. */
> >  	ret = task_work_add(current, &info->work, TWA_RESUME);
> > -	if (WARN_ON_ONCE(ret)) {
> > -		local_set(&info->pending, 0);
> > -		return ret;
> > -	}
> >  
> > +	if (WARN_ON_ONCE(ret))
> > +		WRITE_ONCE(info->unwind_mask, 0);
> > +
> > +	return ret;
> >   out:
> > -	return test_and_set_bit(bit, &info->unwind_mask);
> > +	return test_and_set_bit(bit, &info->unwind_mask) ?
> > +		UNWIND_ALREADY_PENDING : 0;
> >  }  
> 
> This is some of the most horrifyingly confused code I've seen in a
> while.
> 
> Please just slow down and think for a minute.
> 
> The below is the last four patches rolled into one. Not been near a
> compiler.

Are you recommending that I fold those patches into one?

I'm fine with that. Note, part of the way things are broken up is because I
took Josh's code and built on top of his work. I tend to try not to modify
someone else's code when doing that and make building blocks of each stage.

Also, it follows the way I tend to review code. Which is to take the entire
patch set, apply it, then look at each patch compared to the final result.

That probably explains why my patch series is confusing for you, as it was
written more for the way I review. Sorry about that.

-- Steve

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 10/14] unwind: Clear unwind_mask on exit back to user space
  2025-07-15 10:29   ` Peter Zijlstra
  2025-07-15 12:49     ` Steven Rostedt
@ 2025-07-15 17:20     ` Steven Rostedt
  2025-07-15 19:07       ` Peter Zijlstra
  2025-07-15 22:01     ` Steven Rostedt
  2025-07-16 18:26     ` Steven Rostedt
  3 siblings, 1 reply; 45+ messages in thread
From: Steven Rostedt @ 2025-07-15 17:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar,
	Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
	Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
	Linus Torvalds, Andrew Morton, Jens Axboe, Florian Weimer,
	Sam James

On Tue, 15 Jul 2025 12:29:12 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> @@ -170,41 +193,62 @@ static void unwind_deferred_task_work(st
>  int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
>  {
>  	struct unwind_task_info *info = &current->unwind_info;
> -	int ret;
> +	unsigned long bits, mask;
> +	int bit, ret;
>  
>  	*cookie = 0;
>  
> -	if (WARN_ON_ONCE(in_nmi()))
> -		return -EINVAL;
> -
>  	if ((current->flags & (PF_KTHREAD | PF_EXITING)) ||
>  	    !user_mode(task_pt_regs(current)))
>  		return -EINVAL;
>  
> +	/* NMI requires having safe cmpxchg operations */
> +	if (WARN_ON_ONCE(!UNWIND_NMI_SAFE && in_nmi()))
> +		return -EINVAL;

I don't think we want to have a WARN_ON() here as the perf series tries
to first do the deferred unwinding and if that fails, it will go back
to it's old method.

By having a WARN_ON(), we need to make perf aware of this limitation
too. Do we want to do that?

-- Steve

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 10/14] unwind: Clear unwind_mask on exit back to user space
  2025-07-15 12:49     ` Steven Rostedt
@ 2025-07-15 18:06       ` Steven Rostedt
  2025-07-15 18:10         ` Steven Rostedt
                           ` (2 more replies)
  2025-07-15 19:01       ` Peter Zijlstra
  1 sibling, 3 replies; 45+ messages in thread
From: Steven Rostedt @ 2025-07-15 18:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar,
	Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
	Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
	Linus Torvalds, Andrew Morton, Jens Axboe, Florian Weimer,
	Sam James

On Tue, 15 Jul 2025 08:49:32 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> > >   *
> > > - * Return: 1 if the the callback was already queued.
> > > - *         0 if the callback successfully was queued.
> > > + * Return: 0 if the callback successfully was queued.
> > > + *         UNWIND_ALREADY_PENDING if the the callback was already queued.
> > > + *         UNWIND_ALREADY_EXECUTED if the callback was already called
> > > + *                (and will not be called again)
> > >   *         Negative if there's an error.
> > >   *         @cookie holds the cookie of the first request by any user
> > >   */    
> > 
> > Lots of babbling in the Changelog, but no real elucidation as to why you
> > need this second return value.
> > 
> > AFAICT it serves no real purpose; the users of this function should not
> > care. The only difference is that the unwind reference (your cookie)
> > becomes a backward reference instead of a forward reference. But why
> > would anybody care?  
> 
> Older versions of the code required it. I think I can remove it now.

Ah it is still used in the perf code:

perf_callchain() has:

        if (defer_user) {
                int ret = deferred_request(event);
                if (!ret)
                        local_inc(&event->ctx->nr_no_switch_fast);
                else if (ret < 0)
                        defer_user = false;
        }

Where deferred_requests() is as static function that returns the result
of the unwind request. If it is zero, it means the callback will be
called, if it is greater than zero it means it has already been called,
and negative is an error (and use the old method).

It looks like when the callback is called it expects nr_no_switch_fast
to be incremented and it will decrement it. This is directly from
Josh's patch and I don't know perf well enough to know if that update
to nr_no_switch_fast is needed.

If it's not needed, we can just return 0 on success and negative on
failure. What do you think?

Here's the original patch:

  https://lore.kernel.org/all/20250708020050.928524258@kernel.org/

-- Steve

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 10/14] unwind: Clear unwind_mask on exit back to user space
  2025-07-15 18:06       ` Steven Rostedt
@ 2025-07-15 18:10         ` Steven Rostedt
  2025-07-15 18:26         ` Steven Rostedt
  2025-07-15 19:04         ` Peter Zijlstra
  2 siblings, 0 replies; 45+ messages in thread
From: Steven Rostedt @ 2025-07-15 18:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar,
	Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
	Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
	Linus Torvalds, Andrew Morton, Jens Axboe, Florian Weimer,
	Sam James

On Tue, 15 Jul 2025 14:06:50 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> Ah it is still used in the perf code:

Either way, what I'll do is to remove this special return value for this
series, and add it back in the perf series if needed.

This is one of the problems that arises when you take a series with
lots of changes and try to break it apart. You will always miss
something that isn't needed in one where a change was made for another
series.

Working on each one to make sure this all works, it starts to blend together :-p

-- Steve

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 10/14] unwind: Clear unwind_mask on exit back to user space
  2025-07-15 18:06       ` Steven Rostedt
  2025-07-15 18:10         ` Steven Rostedt
@ 2025-07-15 18:26         ` Steven Rostedt
  2025-07-15 19:04         ` Peter Zijlstra
  2 siblings, 0 replies; 45+ messages in thread
From: Steven Rostedt @ 2025-07-15 18:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar,
	Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
	Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
	Linus Torvalds, Andrew Morton, Jens Axboe, Florian Weimer,
	Sam James

On Tue, 15 Jul 2025 14:06:50 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> > > > + * Return: 0 if the callback successfully was queued.
> > > > + *         UNWIND_ALREADY_PENDING if the the callback was already queued.
> > > > + *         UNWIND_ALREADY_EXECUTED if the callback was already called
> > > > + *                (and will not be called again)
> > > >   *         Negative if there's an error.
> > > >   *         @cookie holds the cookie of the first request by any user
> > > >   */      
> > > 
> > > Lots of babbling in the Changelog, but no real elucidation as to why you
> > > need this second return value.
> > > 
> > > AFAICT it serves no real purpose; the users of this function should not
> > > care. The only difference is that the unwind reference (your cookie)
> > > becomes a backward reference instead of a forward reference. But why
> > > would anybody care?    
> > 
> > Older versions of the code required it. I think I can remove it now.  
> 
> Ah it is still used in the perf code:
> 
> perf_callchain() has:
> 
>         if (defer_user) {
>                 int ret = deferred_request(event);
>                 if (!ret)
>                         local_inc(&event->ctx->nr_no_switch_fast);

Hmm, I guess this could work if it returned non zero for both already
queued and already executed. So it doesn't need to be two different
values.

-- Steve


>                 else if (ret < 0)
>                         defer_user = false;
>         }

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 10/14] unwind: Clear unwind_mask on exit back to user space
  2025-07-15 12:49     ` Steven Rostedt
  2025-07-15 18:06       ` Steven Rostedt
@ 2025-07-15 19:01       ` Peter Zijlstra
  1 sibling, 0 replies; 45+ messages in thread
From: Peter Zijlstra @ 2025-07-15 19:01 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar,
	Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
	Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
	Linus Torvalds, Andrew Morton, Jens Axboe, Florian Weimer,
	Sam James

On Tue, Jul 15, 2025 at 08:49:32AM -0400, Steven Rostedt wrote:
> > The below is the last four patches rolled into one. Not been near a
> > compiler.
> 
> Are you recommending that I fold those patches into one?

Not particularly; but given the terrible back and forth, the only sane
way to 'show' my changes it from patch 6 onwards folded. If I were to
diff against patch 9 it'd be a shitshow.

At the very least the SRCU thing ought to be broken back out. Not sure
how many pieces it can reasonably be broken into, see what works.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 10/14] unwind: Clear unwind_mask on exit back to user space
  2025-07-15 18:06       ` Steven Rostedt
  2025-07-15 18:10         ` Steven Rostedt
  2025-07-15 18:26         ` Steven Rostedt
@ 2025-07-15 19:04         ` Peter Zijlstra
  2 siblings, 0 replies; 45+ messages in thread
From: Peter Zijlstra @ 2025-07-15 19:04 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar,
	Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
	Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
	Linus Torvalds, Andrew Morton, Jens Axboe, Florian Weimer,
	Sam James

On Tue, Jul 15, 2025 at 02:06:50PM -0400, Steven Rostedt wrote:
> On Tue, 15 Jul 2025 08:49:32 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > > >   *
> > > > - * Return: 1 if the the callback was already queued.
> > > > - *         0 if the callback successfully was queued.
> > > > + * Return: 0 if the callback successfully was queued.
> > > > + *         UNWIND_ALREADY_PENDING if the the callback was already queued.
> > > > + *         UNWIND_ALREADY_EXECUTED if the callback was already called
> > > > + *                (and will not be called again)
> > > >   *         Negative if there's an error.
> > > >   *         @cookie holds the cookie of the first request by any user
> > > >   */    
> > > 
> > > Lots of babbling in the Changelog, but no real elucidation as to why you
> > > need this second return value.
> > > 
> > > AFAICT it serves no real purpose; the users of this function should not
> > > care. The only difference is that the unwind reference (your cookie)
> > > becomes a backward reference instead of a forward reference. But why
> > > would anybody care?  
> > 
> > Older versions of the code required it. I think I can remove it now.
> 
> Ah it is still used in the perf code:
> 
> perf_callchain() has:
> 
>         if (defer_user) {
>                 int ret = deferred_request(event);
>                 if (!ret)
>                         local_inc(&event->ctx->nr_no_switch_fast);
>                 else if (ret < 0)
>                         defer_user = false;
>         }
> 
> Where deferred_requests() is as static function that returns the result
> of the unwind request. If it is zero, it means the callback will be
> called, if it is greater than zero it means it has already been called,
> and negative is an error (and use the old method).
> 
> It looks like when the callback is called it expects nr_no_switch_fast
> to be incremented and it will decrement it. This is directly from
> Josh's patch and I don't know perf well enough to know if that update
> to nr_no_switch_fast is needed.
> 
> If it's not needed, we can just return 0 on success and negative on
> failure. What do you think?

I'm yet again confused. I don't see this code differentiate between 1
and 2 return values (those PENDING and EXECUTED).

Anyway, fundamentally I don't think there is a problem with backward
references as opposed to the normal forward references.

So leave it out for now.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 10/14] unwind: Clear unwind_mask on exit back to user space
  2025-07-15 17:20     ` Steven Rostedt
@ 2025-07-15 19:07       ` Peter Zijlstra
  0 siblings, 0 replies; 45+ messages in thread
From: Peter Zijlstra @ 2025-07-15 19:07 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar,
	Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
	Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
	Linus Torvalds, Andrew Morton, Jens Axboe, Florian Weimer,
	Sam James

On Tue, Jul 15, 2025 at 01:20:16PM -0400, Steven Rostedt wrote:
> On Tue, 15 Jul 2025 12:29:12 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > @@ -170,41 +193,62 @@ static void unwind_deferred_task_work(st
> >  int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
> >  {
> >  	struct unwind_task_info *info = &current->unwind_info;
> > -	int ret;
> > +	unsigned long bits, mask;
> > +	int bit, ret;
> >  
> >  	*cookie = 0;
> >  
> > -	if (WARN_ON_ONCE(in_nmi()))
> > -		return -EINVAL;
> > -
> >  	if ((current->flags & (PF_KTHREAD | PF_EXITING)) ||
> >  	    !user_mode(task_pt_regs(current)))
> >  		return -EINVAL;
> >  
> > +	/* NMI requires having safe cmpxchg operations */
> > +	if (WARN_ON_ONCE(!UNWIND_NMI_SAFE && in_nmi()))
> > +		return -EINVAL;
> 
> I don't think we want to have a WARN_ON() here as the perf series tries
> to first do the deferred unwinding and if that fails, it will go back
> to it's old method.

The thing is, I don't think we have an architecture that supports NMIs
and does not have NMI safe cmpxchg. And if we do have one such -- I
don't think it has perf; perf very much assumes cmpxchg is NMI safe.

Calling this from NMI context and not having an NMI safe cmpxchg is very
much a dodgy use case. Please leave the WARN, if it ever triggers, we'll
look at who manages and deal with it then.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 10/14] unwind: Clear unwind_mask on exit back to user space
  2025-07-15 10:29   ` Peter Zijlstra
  2025-07-15 12:49     ` Steven Rostedt
  2025-07-15 17:20     ` Steven Rostedt
@ 2025-07-15 22:01     ` Steven Rostedt
  2025-07-16 18:26     ` Steven Rostedt
  3 siblings, 0 replies; 45+ messages in thread
From: Steven Rostedt @ 2025-07-15 22:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar,
	Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
	Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
	Linus Torvalds, Andrew Morton, Jens Axboe, Florian Weimer,
	Sam James

On Tue, 15 Jul 2025 12:29:12 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> The below is the last four patches rolled into one. Not been near a
> compiler.

And it shows ;-)

> @@ -117,13 +138,13 @@ static void unwind_deferred_task_work(st
>  	struct unwind_task_info *info = container_of(head, struct unwind_task_info, work);
>  	struct unwind_stacktrace trace;
>  	struct unwind_work *work;
> +	unsigned long bits;
>  	u64 cookie;
>  
> -	if (WARN_ON_ONCE(!info->pending))
> +	if (WARN_ON_ONCE(!unwind_pending(info)))
>  		return;
>  
> -	/* Allow work to come in again */
> -	WRITE_ONCE(info->pending, 0);
> +	bits = atomic_long_fetch_andnot(UNWIND_PENDING, &info->unwind_mask);

I may need to do what other parts of the kernel has done and turn the
above into:

	bits = atomic_long_fetch_andnot(UNWIND_PENDING, (atomic_long_t *)&info->unwind_mask);

As there's other bit manipulations that atomic_long does not take care
of and it's making the code more confusing. When I looked to see how
other users of atomic_long_andnot() did things, most just typecasted
the value to use that function :-/

-- Steve


>  
>  	/*
>  	 * From here on out, the callback must always be called, even if it's
> @@ -136,9 +157,11 @@ static void unwind_deferred_task_work(st
>  
>  	cookie = info->id.id;
>  
> -	guard(mutex)(&callback_mutex);
> -	list_for_each_entry(work, &callbacks, list) {
> -		work->func(work, &trace, cookie);
> +	guard(srcu_lite)(&unwind_srcu);
> +	list_for_each_entry_srcu(work, &callbacks, list,
> +				 srcu_read_lock_held(&unwind_srcu)) {
> +		if (test_bit(work->bit, &bits))
> +			work->func(work, &trace, cookie);
>  	}
>  }
>  
> @@ -162,7 +185,7 @@ static void unwind_deferred_task_work(st
>   * because it has already been previously called for the same entry context,
>   * it will be called again with the same stack trace and cookie.
>   *
> - * Return: 1 if the the callback was already queued.
> + * Return: 1 if the callback was already queued.
>   *         0 if the callback successfully was queued.
>   *         Negative if there's an error.
>   *         @cookie holds the cookie of the first request by any user

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 10/14] unwind: Clear unwind_mask on exit back to user space
  2025-07-15 10:29   ` Peter Zijlstra
                       ` (2 preceding siblings ...)
  2025-07-15 22:01     ` Steven Rostedt
@ 2025-07-16 18:26     ` Steven Rostedt
  2025-07-16 18:33       ` Steven Rostedt
  3 siblings, 1 reply; 45+ messages in thread
From: Steven Rostedt @ 2025-07-16 18:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar,
	Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
	Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
	Linus Torvalds, Andrew Morton, Jens Axboe, Florian Weimer,
	Sam James

On Tue, 15 Jul 2025 12:29:12 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Jul 07, 2025 at 09:22:49PM -0400, Steven Rostedt wrote:
> >  
> > +	/*
> > +	 * This is the first to enable another task_work for this task since
> > +	 * the task entered the kernel, or had already called the callbacks.
> > +	 * Set only the bit for this work and clear all others as they have
> > +	 * already had their callbacks called, and do not need to call them
> > +	 * again because of this work.
> > +	 */
> > +	bits = UNWIND_PENDING | BIT(bit);
> > +
> > +	/*
> > +	 * If the cmpxchg() fails, it means that an NMI came in and set
> > +	 * the pending bit as well as cleared the other bits. Just
> > +	 * jump to setting the bit for this work.
> > +	 */
> >  	if (CAN_USE_IN_NMI) {
> > -		/* Claim the work unless an NMI just now swooped in to do so. */
> > -		if (!local_try_cmpxchg(&info->pending, &pending, 1))
> > +		if (!try_cmpxchg(&info->unwind_mask, &old, bits))
> >  			goto out;
> >  	} else {
> > -		local_set(&info->pending, 1);
> > +		info->unwind_mask = bits;
> >  	}
> >  
> >  	/* The work has been claimed, now schedule it. */
> >  	ret = task_work_add(current, &info->work, TWA_RESUME);
> > -	if (WARN_ON_ONCE(ret)) {
> > -		local_set(&info->pending, 0);
> > -		return ret;
> > -	}
> >  
> > +	if (WARN_ON_ONCE(ret))
> > +		WRITE_ONCE(info->unwind_mask, 0);
> > +
> > +	return ret;
> >   out:
> > -	return test_and_set_bit(bit, &info->unwind_mask);
> > +	return test_and_set_bit(bit, &info->unwind_mask) ?
> > +		UNWIND_ALREADY_PENDING : 0;
> >  }  
> 
> This is some of the most horrifyingly confused code I've seen in a
> while.
> 
> Please just slow down and think for a minute.
> 
> The below is the last four patches rolled into one. Not been near a
> compiler.

The above is still needed as is (explained below).


> @@ -170,41 +193,62 @@ static void unwind_deferred_task_work(st
>  int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
>  {
>  	struct unwind_task_info *info = &current->unwind_info;
> -	int ret;
> +	unsigned long bits, mask;
> +	int bit, ret;
>  
>  	*cookie = 0;
>  
> -	if (WARN_ON_ONCE(in_nmi()))
> -		return -EINVAL;
> -
>  	if ((current->flags & (PF_KTHREAD | PF_EXITING)) ||
>  	    !user_mode(task_pt_regs(current)))
>  		return -EINVAL;
>  
> +	/* NMI requires having safe cmpxchg operations */
> +	if (WARN_ON_ONCE(!UNWIND_NMI_SAFE && in_nmi()))
> +		return -EINVAL;
> +
> +	/* Do not allow cancelled works to request again */
> +	bit = READ_ONCE(work->bit);
> +	if (WARN_ON_ONCE(bit < 0))
> +		return -EINVAL;
> +
>  	guard(irqsave)();
>  
>  	*cookie = get_cookie(info);
>  
> -	/* callback already pending? */
> -	if (info->pending)
> +	bits = UNWIND_PENDING | BIT(bit);
> +	mask = atomic_long_fetch_or(bits, &info->unwind_mask);
> +	if (mask & bits)
>  		return 1;


So the fetch_or() isn't good enough for what needs to be done, and why
the code above is the way it is.

We have this scenario:


  perf and ftrace are both tracing the same task. perf with bit 1 and
  ftrace with bit 2. Let's say there's even another perf program
  running and registered bit 3.


  perf requests a deferred callback, and info->unwind_mask gets bit 1
  and the pending bit set.

  The task is exiting to user space and calls perf's callback and
  clears the pending bit but keeps perf's bit set as it was already
  called, and doesn't need to be called again even if perf requests a
  new stacktrace before the task gets back to user space.

  Now before the task gets back to user space, ftrace requests the
  deferred trace. To do so, it must set the pending bit and its bit,
  but it must also clear the perf bit as it should not call perf's
  callback again.

The atomic_long_fetch_or() above will set ftrace's bit but not clear
perf's bits and the perf callback will get called a second time even
though perf never requested another callback.

This is why the code at the top has:

	bits = UNWIND_PENDING | BIT(bit);

	/*
	 * If the cmpxchg() fails, it means that an NMI came in and set
	 * the pending bit as well as cleared the other bits. Just
	 * jump to setting the bit for this work.
	 */
	if (CAN_USE_IN_NMI) {
		/* Claim the work unless an NMI just now swooped in to do so. */
		if (!local_try_cmpxchg(&info->pending, &pending, 1))
		if (!try_cmpxchg(&info->unwind_mask, &old, bits))
			goto out;

That cmpxchg() clears out any of the old bits if pending isn't set. Now
if an NMI came in and the other perf process requested a callback, it
would set its own bit plus the pending bit and then ftrace only needs
to jump to the end and do the test_and_set on its bit.

-- Steve


>  
>  	/* The work has been claimed, now schedule it. */
>  	ret = task_work_add(current, &info->work, TWA_RESUME);
>  	if (WARN_ON_ONCE(ret))
> -		return ret;
> -
> -	info->pending = 1;
> -	return 0;
> +		atomic_long_set(0, &info->unwind_mask);
>  }
>  

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 10/14] unwind: Clear unwind_mask on exit back to user space
  2025-07-16 18:26     ` Steven Rostedt
@ 2025-07-16 18:33       ` Steven Rostedt
  2025-07-16 19:25         ` Steven Rostedt
  0 siblings, 1 reply; 45+ messages in thread
From: Steven Rostedt @ 2025-07-16 18:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar,
	Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
	Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
	Linus Torvalds, Andrew Morton, Jens Axboe, Florian Weimer,
	Sam James

On Wed, 16 Jul 2025 14:26:09 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

>   Now before the task gets back to user space, ftrace requests the
>   deferred trace. To do so, it must set the pending bit and its bit,
>   but it must also clear the perf bit as it should not call perf's
>   callback again.

After ftrace clears the bits, it is possible that the first perf
program will request again and this time it will get another callback
with the same cookie. But at least it has a request between the last
callback and the next one.

That is, it would have:

[Task enters kernel]
  request -> add cookie
  request -> add cookie
  [..]
  callback -> add trace + cookie
 [ftrace clears bits]
  request -> add cookie
  callback -> add trace + cookie
[Task exits back to user space]

Which shouldn't be too confusing. But if we just do the fetch_or and it
didn't request a new trace, it would have:

[Task enters kernel]
  request -> add cookie
  request -> add cookie
  [..]
  callback -> add trace + cookie
 [ftrace clears bits]
  callback -> add trace + cookie
[Task exits back to user space]

Where there's two callbacks written to the perf buffer for the same
request.

Maybe this isn't a problem, but I was trying to avoid adding multiple
requests due to other tracers affecting the state.

-- Steve

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v13 10/14] unwind: Clear unwind_mask on exit back to user space
  2025-07-16 18:33       ` Steven Rostedt
@ 2025-07-16 19:25         ` Steven Rostedt
  0 siblings, 0 replies; 45+ messages in thread
From: Steven Rostedt @ 2025-07-16 19:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar,
	Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
	Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
	Linus Torvalds, Andrew Morton, Jens Axboe, Florian Weimer,
	Sam James

On Wed, 16 Jul 2025 14:33:52 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> [Task enters kernel]
>   request -> add cookie
>   request -> add cookie
>   [..]
>   callback -> add trace + cookie
>  [ftrace clears bits]
>   callback -> add trace + cookie
> [Task exits back to user space]

Another solution could be to add another unwind mask of completed
callbacks. Then we could use the fetch_or(). I guess that could look
like this:

diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index 15045999c5e2..0124865aaab4 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -61,8 +61,10 @@ static __always_inline void unwind_reset_info(void)
 		} while (!try_cmpxchg(&info->unwind_mask, &bits, 0UL));
 		current->unwind_info.id.id = 0;
 
-		if (unlikely(info->cache))
+		if (unlikely(info->cache)) {
 			info->cache->nr_entries = 0;
+			info->cache->unwind_completed = 0;
+		}
 	}
 }
 
diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
index 5dc9cda141ff..33b62ac25c86 100644
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -3,6 +3,7 @@
 #define _LINUX_UNWIND_USER_DEFERRED_TYPES_H
 
 struct unwind_cache {
+	unsigned long		unwind_completed;
 	unsigned int		nr_entries;
 	unsigned long		entries[];
 };
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index 60cc71062f86..9a3e06ee9d63 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -170,13 +170,19 @@ static void process_unwind_deferred(struct task_struct *task)
 
 	unwind_user_faultable(&trace);
 
+	if (info->cache)
+		bits &= ~(info->cache->unwind_completed);
+
 	cookie = info->id.id;
 
 	guard(srcu_lite)(&unwind_srcu);
 	list_for_each_entry_srcu(work, &callbacks, list,
 				 srcu_read_lock_held(&unwind_srcu)) {
-		if (test_bit(work->bit, &bits))
+		if (test_bit(work->bit, &bits)) {
 			work->func(work, &trace, cookie);
+			if (info->cache)
+				info->cache->unwind_completed |= BIT(work->bit);
+		}
 	}
 }
 
Instead of adding another long word in the tasks struct, I just use the
unwind_cache that gets allocated on the first use.

I think this can work. I'll switch it over to this and then I can use
the fetch_or() and there should be no extra callbacks, even if an
already called callback is requested again after another callback was
requested which would trigger another task work.

I'll update this patch (and fold it into the bitmask patch) with the
fetch_or() and create this patch as a separate patch that just gets rid
of spurious callbacks.

-- Steve

^ permalink raw reply related	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2025-07-16 19:25 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-08  1:22 [PATCH v13 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
2025-07-08  1:22 ` [PATCH v13 01/14] unwind_user: Add user space unwinding API Steven Rostedt
2025-07-08  1:22 ` [PATCH v13 02/14] unwind_user: Add frame pointer support Steven Rostedt
2025-07-09 10:01   ` Jens Remus
2025-07-10 12:28     ` Jens Remus
2025-07-10 15:21     ` Steven Rostedt
2025-07-10 15:41       ` Jens Remus
2025-07-10 17:08         ` Steven Rostedt
2025-07-14 12:52           ` Jens Remus
2025-07-08  1:22 ` [PATCH v13 03/14] unwind_user: Add compat mode " Steven Rostedt
2025-07-08  1:22 ` [PATCH v13 04/14] unwind_user/deferred: Add unwind_user_faultable() Steven Rostedt
2025-07-08  1:22 ` [PATCH v13 05/14] unwind_user/deferred: Add unwind cache Steven Rostedt
2025-07-08  1:22 ` [PATCH v13 06/14] unwind_user/deferred: Add deferred unwinding interface Steven Rostedt
2025-07-08  1:22 ` [PATCH v13 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe Steven Rostedt
2025-07-14 13:29   ` Peter Zijlstra
2025-07-14 14:19     ` Steven Rostedt
2025-07-14 15:05       ` Peter Zijlstra
2025-07-14 15:11         ` Steven Rostedt
2025-07-15  9:09           ` Peter Zijlstra
2025-07-15 12:35             ` Steven Rostedt
2025-07-08  1:22 ` [PATCH v13 08/14] unwind deferred: Use bitmask to determine which callbacks to call Steven Rostedt
2025-07-08  1:22 ` [PATCH v13 09/14] unwind deferred: Use SRCU unwind_deferred_task_work() Steven Rostedt
2025-07-14 13:56   ` Peter Zijlstra
2025-07-14 14:21     ` Steven Rostedt
2025-07-14 15:03       ` Peter Zijlstra
2025-07-08  1:22 ` [PATCH v13 10/14] unwind: Clear unwind_mask on exit back to user space Steven Rostedt
2025-07-15 10:29   ` Peter Zijlstra
2025-07-15 12:49     ` Steven Rostedt
2025-07-15 18:06       ` Steven Rostedt
2025-07-15 18:10         ` Steven Rostedt
2025-07-15 18:26         ` Steven Rostedt
2025-07-15 19:04         ` Peter Zijlstra
2025-07-15 19:01       ` Peter Zijlstra
2025-07-15 17:20     ` Steven Rostedt
2025-07-15 19:07       ` Peter Zijlstra
2025-07-15 22:01     ` Steven Rostedt
2025-07-16 18:26     ` Steven Rostedt
2025-07-16 18:33       ` Steven Rostedt
2025-07-16 19:25         ` Steven Rostedt
2025-07-08  1:22 ` [PATCH v13 11/14] unwind: Add USED bit to only have one conditional on way " Steven Rostedt
2025-07-08  1:22 ` [PATCH v13 12/14] unwind: Finish up unwind when a task exits Steven Rostedt
2025-07-08  1:22 ` [PATCH v13 13/14] unwind_user/x86: Enable frame pointer unwinding on x86 Steven Rostedt
2025-07-11  8:43   ` David Laight
2025-07-11 16:11     ` Steven Rostedt
2025-07-08  1:22 ` [PATCH v13 14/14] unwind_user/x86: Enable compat mode " Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).