* [PATCH v10 00/14] unwind_user: x86: Deferred unwinding infrastructure
@ 2025-06-11 0:54 Steven Rostedt
2025-06-11 0:54 ` [PATCH v10 01/14] unwind_user: Add user space unwinding API Steven Rostedt
` (14 more replies)
0 siblings, 15 replies; 79+ messages in thread
From: Steven Rostedt @ 2025-06-11 0:54 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton
Hi Peter and Ingo,
This is the first patch series of a set that will make it possible to be able
to use SFrames[1] in the Linux kernel. A quick recap of the motivation for
doing this.
Currently the only way to get a user space stack trace from a stack
walk (and not just copying large amount of user stack into the kernel
ring buffer) is to use frame pointers. This has a few issues. The biggest
one is that compiling frame pointers into every application and library
has been shown to cause performance overhead.
Another issue is that the format of the frames may not always be consistent
between different compilers and some architectures (s390) has no defined
format to do a reliable stack walk. The only way to perform user space
profiling on these architectures is to copy the user stack into the kernel
buffer.
SFrames is now supported in gcc binutils and soon will also be supported
by LLVM. SFrames acts more like ORC, and lives in the ELF executable
file as its own section. Like ORC it has two tables where the first table
is sorted by instruction pointers (IP) and using the current IP and finding
it's entry in the first table, it will take you to the second table which
will tell you where the return address of the current function is located
and then you can use that address to look it up in the first table to find
the return address of that function, and so on. This performs a user
space stack walk.
Now because the SFrame section lives in the ELF file it needs to be faulted
into memory when it is used. This means that walking the user space stack
requires being in a faultable context. As profilers like perf request a stack
trace in interrupt or NMI context, it cannot do the walking when it is
requested. Instead it must be deferred until it is safe to fault in user
space. One place this is known to be safe is when the task is about to return
back to user space.
Josh originally wrote the PoC of this code and his last version he posted
was back in January:
https://lore.kernel.org/all/cover.1737511963.git.jpoimboe@kernel.org/
That series contained everything from adding a new faultable user space
stack walking code, deferring the stack walk, implementing sframes,
fixing up x86 (VDSO), and even added both the kernel and user space side
of perf to make it work. But Josh also ran out of time to work on it and
I picked it up. As there's several parts to this series, I also broke
it out. Especially since there's parts of his series that do not depend
on each other.
This series contains only the core infrastructure that all the rest needs.
Of the 14 patches, only 3 are x86 specific. The rest is simply the unwinding
code that s390 can build against. I moved the 3 x86 specific to the end
of the series too.
Since multiple tracers (like perf, ftrace, bpf, etc) can attach to the
deferred unwinder and each of these tracers can attach to some or all
of the tasks to trace, there is a many to many relationship. This relationship
needs to be made in interrupt or NMI context so it can not rely on any
allocation. To handle this, a bitmask is used. There's a global bitmask of
size long which will allocate a single bit when a tracer registers for
deferred stack traces. The task struct will also have a bitmask where a
request comes in from one of the tracers to have a deferred stack trace, it
will set the corresponding bit for that tracer it its mask. As one of the bits
represents that a request has been made, this means at most 31 on 32 bit
systems or 63 on 64 bit systems of tracers may be registered at a given time.
This should not be an issue as only one perf application, or ftrace instance
should request a bit. BPF should also use only one bit and handle any
multiplexing for its users.
When the first request is made for a deferred stack trace from a task, it will
take a timestamp. This timestamp will be used as the identifier for the user
space stack trace. As the user space stack trace does not change while the
task is in the kernel, requests that come in after the first request and
before the task goes back to user space will get the same timestamp. This
timestamp also serves the purpose of knowing how far back a given user space
stack trace goes. If there's dropped events, and the events dropped miss a
task entering user space and coming back to the kernel, the new stack trace
taken when it goes back to user space should not be used with the events
before the drop happened.
When a tracer makes a request, it gets this timestamp, and the tasks bitmask
sets the bit for the requesting tracer. A task work is used to have the task
do the callbacks before it goes back to user space. When it does, it will scan
its bitmask and call all the callbacks for the tracers that have their
representing bit set. The callback will receive the user space stack trace as
well as the timestamp that was used.
That's the basic idea. Obviously there's more to it than the above
explanation, but each patch explains what it is doing, and it is broken up
step by step.
I fully tested these patches in my own test suite, which it did find minor
bugs which I fixed.
I run two SFrame meetings once a month (one in Asia friendly timezone and
the other in Europe friendly). We have developers from Google, Oracle, Red Hat,
IBM, EfficiOS, Meta, Microsoft, and more that attend. (If anyone is interested
in attending let me know). I have been running this since December of 2024.
Last year in GNU Cauldron, a few of us got together to discuss the design
and such. We are pretty confident that the current design is sound. We have
working code on top of this and have been testing it.
Since the s390 folks want to start working on this (they have patches to
sframes already from working on the prototypes), I would like this series
to be a separate branch based on top of v6.16-rc2. Then all the subsystems
that want to work on top of this can as there's no real dependency between
them.
I have more patches on top of this series that add perf support, ftrace
support, sframe support and the x86 fix ups (for VDSO). But each of those
patch series can be worked on independently, but they all depend on this
series (although the x86 specific patches at the end isn't necessarily
needed, at least for other architectures).
Please review, and if you are happy with them, lets get them in a branch
that we all can use. I'm happy to take it in my tree if I can get acks on the
x86 code. Or it can be in the tip tree as a separate branch on top of 6.16-rc2
(or rc3 if someone finds some issues), and I'll just base my work on top of
that. Doesn't matter either way.
Thanks!
-- Steve
[1] https://sourceware.org/binutils/wiki/sframe
Changes since v9: https://lore.kernel.org/linux-trace-kernel/20250513223435.636200356@goodmis.org/
- As asm-generic headers are not included when an architecture defines the
header, having more than one #ifndef and setting variables does not work
with those checks in the asm-generic header and the architecture header
does not define all the values.
o Move ARCH_INIT_USER_FP_FRAME check to linux/user_unwind.h
o Have linux/user_unwind.h include asm/user_unwind.h and not have C files
have to call the asm header directly
o Remove unnecessary frame initialization
o Added unwind_user.h to asm-generic/Kbuild
o Move #indef arch_unwind_user_state to linux/user_unwind_types.h
o Move the following to linux/unwind_user.h:
#ifndef ARCH_INIT_USER_COMPAT_FP_FRAME
#ifndef arch_unwind_user_init
#ifndef arch_unwind_user_next
- Changed UNWIND_GET_USER_LONG() to use "unsigned long" instead of u64 as
this can be called on 32 bit architectures and just because
"compat_state()" returns false doesn't mean that the value is 64 bit.
- Check for ret < 0 instead of just ret != 0 from return code of
task_work_add(). Don't want to just assume it's less than zero as it
needs to return a negative on error.
- Use BIT() macro for bit setting and testing.
- Moved the "unwind_mask" from the task_struct into the task->unwind_info
structure.
- Fix compare with ~UNWIND_PENDING_BIT to be ~UNWIND_PENDING
- Remove unneeded include of perf_event.h
This series can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git
unwind/core
Josh Poimboeuf (9):
unwind_user: Add user space unwinding API
unwind_user: Add frame pointer support
unwind_user: Add compat mode frame pointer support
unwind_user/deferred: Add unwind cache
unwind_user/deferred: Add deferred unwinding interface
unwind_user/deferred: Make unwind deferral requests NMI-safe
unwind_user/x86: Enable frame pointer unwinding on x86
perf/x86: Rename and move get_segment_base() and make it global
unwind_user/x86: Enable compat mode frame pointer unwinding on x86
Steven Rostedt (5):
unwind_user/deferred: Add unwind_deferred_trace()
unwind deferred: Use bitmask to determine which callbacks to call
unwind deferred: Use SRCU unwind_deferred_task_work()
unwind: Clear unwind_mask on exit back to user space
unwind: Finish up unwind when a task exits
----
MAINTAINERS | 8 +
arch/Kconfig | 11 +
arch/x86/Kconfig | 2 +
arch/x86/events/core.c | 44 +---
arch/x86/include/asm/ptrace.h | 2 +
arch/x86/include/asm/unwind_user.h | 60 +++++
arch/x86/include/asm/unwind_user_types.h | 17 ++
arch/x86/kernel/ptrace.c | 38 +++
include/asm-generic/Kbuild | 2 +
include/asm-generic/unwind_user.h | 5 +
include/asm-generic/unwind_user_types.h | 5 +
include/linux/entry-common.h | 2 +
include/linux/sched.h | 5 +
include/linux/unwind_deferred.h | 75 ++++++
include/linux/unwind_deferred_types.h | 18 ++
include/linux/unwind_user.h | 33 +++
include/linux/unwind_user_types.h | 39 ++++
kernel/Makefile | 1 +
kernel/exit.c | 2 +
kernel/fork.c | 4 +
kernel/unwind/Makefile | 1 +
kernel/unwind/deferred.c | 383 +++++++++++++++++++++++++++++++
kernel/unwind/user.c | 128 +++++++++++
23 files changed, 846 insertions(+), 39 deletions(-)
create mode 100644 arch/x86/include/asm/unwind_user.h
create mode 100644 arch/x86/include/asm/unwind_user_types.h
create mode 100644 include/asm-generic/unwind_user.h
create mode 100644 include/asm-generic/unwind_user_types.h
create mode 100644 include/linux/unwind_deferred.h
create mode 100644 include/linux/unwind_deferred_types.h
create mode 100644 include/linux/unwind_user.h
create mode 100644 include/linux/unwind_user_types.h
create mode 100644 kernel/unwind/Makefile
create mode 100644 kernel/unwind/deferred.c
create mode 100644 kernel/unwind/user.c
^ permalink raw reply [flat|nested] 79+ messages in thread
* [PATCH v10 01/14] unwind_user: Add user space unwinding API
2025-06-11 0:54 [PATCH v10 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
@ 2025-06-11 0:54 ` Steven Rostedt
2025-06-18 13:49 ` Peter Zijlstra
2025-06-11 0:54 ` [PATCH v10 02/14] unwind_user: Add frame pointer support Steven Rostedt
` (13 subsequent siblings)
14 siblings, 1 reply; 79+ messages in thread
From: Steven Rostedt @ 2025-06-11 0:54 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton
From: Josh Poimboeuf <jpoimboe@kernel.org>
Introduce a generic API for unwinding user stacks.
In order to expand user space unwinding to be able to handle more complex
scenarios, such as deferred unwinding and reading user space information,
create a generic interface that all architectures can use that support the
various unwinding methods.
This is an alternative method for handling user space stack traces from
the simple stack_trace_save_user() API. This does not replace that
interface, but this interface will be used to expand the functionality of
user space stack walking.
None of the structures introduced will be exposed to user space tooling.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
MAINTAINERS | 8 +++++
arch/Kconfig | 3 ++
include/linux/unwind_user.h | 15 +++++++++
include/linux/unwind_user_types.h | 31 +++++++++++++++++
kernel/Makefile | 1 +
kernel/unwind/Makefile | 1 +
kernel/unwind/user.c | 55 +++++++++++++++++++++++++++++++
7 files changed, 114 insertions(+)
create mode 100644 include/linux/unwind_user.h
create mode 100644 include/linux/unwind_user_types.h
create mode 100644 kernel/unwind/Makefile
create mode 100644 kernel/unwind/user.c
diff --git a/MAINTAINERS b/MAINTAINERS
index a92290fffa16..8617f87bceed 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -25885,6 +25885,14 @@ F: Documentation/driver-api/uio-howto.rst
F: drivers/uio/
F: include/linux/uio_driver.h
+USERSPACE STACK UNWINDING
+M: Josh Poimboeuf <jpoimboe@kernel.org>
+M: Steven Rostedt <rostedt@goodmis.org>
+S: Maintained
+F: include/linux/unwind*.h
+F: kernel/unwind/
+
+
UTIL-LINUX PACKAGE
M: Karel Zak <kzak@redhat.com>
L: util-linux@vger.kernel.org
diff --git a/arch/Kconfig b/arch/Kconfig
index a3308a220f86..ea59e5d7cc69 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -435,6 +435,9 @@ config HAVE_HARDLOCKUP_DETECTOR_ARCH
It uses the same command line parameters, and sysctl interface,
as the generic hardlockup detectors.
+config UNWIND_USER
+ bool
+
config HAVE_PERF_REGS
bool
help
diff --git a/include/linux/unwind_user.h b/include/linux/unwind_user.h
new file mode 100644
index 000000000000..aa7923c1384f
--- /dev/null
+++ b/include/linux/unwind_user.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_UNWIND_USER_H
+#define _LINUX_UNWIND_USER_H
+
+#include <linux/unwind_user_types.h>
+
+int unwind_user_start(struct unwind_user_state *state);
+int unwind_user_next(struct unwind_user_state *state);
+
+int unwind_user(struct unwind_stacktrace *trace, unsigned int max_entries);
+
+#define for_each_user_frame(state) \
+ for (unwind_user_start((state)); !(state)->done; unwind_user_next((state)))
+
+#endif /* _LINUX_UNWIND_USER_H */
diff --git a/include/linux/unwind_user_types.h b/include/linux/unwind_user_types.h
new file mode 100644
index 000000000000..6ed1b4ae74e1
--- /dev/null
+++ b/include/linux/unwind_user_types.h
@@ -0,0 +1,31 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_UNWIND_USER_TYPES_H
+#define _LINUX_UNWIND_USER_TYPES_H
+
+#include <linux/types.h>
+
+enum unwind_user_type {
+ UNWIND_USER_TYPE_NONE,
+};
+
+struct unwind_stacktrace {
+ unsigned int nr;
+ unsigned long *entries;
+};
+
+struct unwind_user_frame {
+ s32 cfa_off;
+ s32 ra_off;
+ s32 fp_off;
+ bool use_fp;
+};
+
+struct unwind_user_state {
+ unsigned long ip;
+ unsigned long sp;
+ unsigned long fp;
+ enum unwind_user_type type;
+ bool done;
+};
+
+#endif /* _LINUX_UNWIND_USER_TYPES_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index 32e80dd626af..541186050251 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -55,6 +55,7 @@ obj-y += rcu/
obj-y += livepatch/
obj-y += dma/
obj-y += entry/
+obj-y += unwind/
obj-$(CONFIG_MODULES) += module/
obj-$(CONFIG_KCMP) += kcmp.o
diff --git a/kernel/unwind/Makefile b/kernel/unwind/Makefile
new file mode 100644
index 000000000000..349ce3677526
--- /dev/null
+++ b/kernel/unwind/Makefile
@@ -0,0 +1 @@
+ obj-$(CONFIG_UNWIND_USER) += user.o
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
new file mode 100644
index 000000000000..d30449328981
--- /dev/null
+++ b/kernel/unwind/user.c
@@ -0,0 +1,55 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+* Generic interfaces for unwinding user space
+*/
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/sched/task_stack.h>
+#include <linux/unwind_user.h>
+
+int unwind_user_next(struct unwind_user_state *state)
+{
+ /* no implementation yet */
+ return -EINVAL;
+}
+
+int unwind_user_start(struct unwind_user_state *state)
+{
+ struct pt_regs *regs = task_pt_regs(current);
+
+ memset(state, 0, sizeof(*state));
+
+ if ((current->flags & PF_KTHREAD) || !user_mode(regs)) {
+ state->done = true;
+ return -EINVAL;
+ }
+
+ state->type = UNWIND_USER_TYPE_NONE;
+
+ state->ip = instruction_pointer(regs);
+ state->sp = user_stack_pointer(regs);
+ state->fp = frame_pointer(regs);
+
+ return 0;
+}
+
+int unwind_user(struct unwind_stacktrace *trace, unsigned int max_entries)
+{
+ struct unwind_user_state state;
+
+ trace->nr = 0;
+
+ if (!max_entries)
+ return -EINVAL;
+
+ if (current->flags & PF_KTHREAD)
+ return 0;
+
+ for_each_user_frame(&state) {
+ trace->entries[trace->nr++] = state.ip;
+ if (trace->nr >= max_entries)
+ break;
+ }
+
+ return 0;
+}
--
2.47.2
^ permalink raw reply related [flat|nested] 79+ messages in thread
* [PATCH v10 02/14] unwind_user: Add frame pointer support
2025-06-11 0:54 [PATCH v10 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
2025-06-11 0:54 ` [PATCH v10 01/14] unwind_user: Add user space unwinding API Steven Rostedt
@ 2025-06-11 0:54 ` Steven Rostedt
2025-06-18 13:52 ` Peter Zijlstra
2025-06-11 0:54 ` [PATCH v10 03/14] unwind_user: Add compat mode " Steven Rostedt
` (12 subsequent siblings)
14 siblings, 1 reply; 79+ messages in thread
From: Steven Rostedt @ 2025-06-11 0:54 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton
From: Josh Poimboeuf <jpoimboe@kernel.org>
Add optional support for user space frame pointer unwinding. If
supported, the arch needs to enable CONFIG_HAVE_UNWIND_USER_FP and
define ARCH_INIT_USER_FP_FRAME.
By encoding the frame offsets in struct unwind_user_frame, much of this
code can also be reused for future unwinder implementations like sframe.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Co-developed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v9: https://lore.kernel.org/linux-trace-kernel/20250513223551.290698040@goodmis.org/
As asm-generic headers are not included when an architecture defines the
header, having more than one #ifndef and setting variables does not work
with those checks in the asm-generic header and the architecture header
does not define all the values.
- Move ARCH_INIT_USER_FP_FRAME check to linux/user_unwind.h
- Have linux/user_unwind.h include asm/user_unwind.h and not have C files
have to call the asm header directly
- Remove unnecessary frame initialization
- Added unwind_user.h to asm-generic/Kbuild
arch/Kconfig | 4 +++
include/asm-generic/Kbuild | 1 +
include/asm-generic/unwind_user.h | 5 ++++
include/linux/unwind_user.h | 5 ++++
include/linux/unwind_user_types.h | 1 +
kernel/unwind/user.c | 49 +++++++++++++++++++++++++++++--
6 files changed, 63 insertions(+), 2 deletions(-)
create mode 100644 include/asm-generic/unwind_user.h
diff --git a/arch/Kconfig b/arch/Kconfig
index ea59e5d7cc69..8e3fd723bd74 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -438,6 +438,10 @@ config HAVE_HARDLOCKUP_DETECTOR_ARCH
config UNWIND_USER
bool
+config HAVE_UNWIND_USER_FP
+ bool
+ select UNWIND_USER
+
config HAVE_PERF_REGS
bool
help
diff --git a/include/asm-generic/Kbuild b/include/asm-generic/Kbuild
index 8675b7b4ad23..295c94a3ccc1 100644
--- a/include/asm-generic/Kbuild
+++ b/include/asm-generic/Kbuild
@@ -59,6 +59,7 @@ mandatory-y += tlbflush.h
mandatory-y += topology.h
mandatory-y += trace_clock.h
mandatory-y += uaccess.h
+mandatory-y += unwind_user.h
mandatory-y += vermagic.h
mandatory-y += vga.h
mandatory-y += video.h
diff --git a/include/asm-generic/unwind_user.h b/include/asm-generic/unwind_user.h
new file mode 100644
index 000000000000..b8882b909944
--- /dev/null
+++ b/include/asm-generic/unwind_user.h
@@ -0,0 +1,5 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_GENERIC_UNWIND_USER_H
+#define _ASM_GENERIC_UNWIND_USER_H
+
+#endif /* _ASM_GENERIC_UNWIND_USER_H */
diff --git a/include/linux/unwind_user.h b/include/linux/unwind_user.h
index aa7923c1384f..a405111c41b0 100644
--- a/include/linux/unwind_user.h
+++ b/include/linux/unwind_user.h
@@ -3,6 +3,11 @@
#define _LINUX_UNWIND_USER_H
#include <linux/unwind_user_types.h>
+#include <asm/unwind_user.h>
+
+#ifndef ARCH_INIT_USER_FP_FRAME
+ #define ARCH_INIT_USER_FP_FRAME
+#endif
int unwind_user_start(struct unwind_user_state *state);
int unwind_user_next(struct unwind_user_state *state);
diff --git a/include/linux/unwind_user_types.h b/include/linux/unwind_user_types.h
index 6ed1b4ae74e1..65bd070eb6b0 100644
--- a/include/linux/unwind_user_types.h
+++ b/include/linux/unwind_user_types.h
@@ -6,6 +6,7 @@
enum unwind_user_type {
UNWIND_USER_TYPE_NONE,
+ UNWIND_USER_TYPE_FP,
};
struct unwind_stacktrace {
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
index d30449328981..4fc550356b33 100644
--- a/kernel/unwind/user.c
+++ b/kernel/unwind/user.c
@@ -6,10 +6,52 @@
#include <linux/sched.h>
#include <linux/sched/task_stack.h>
#include <linux/unwind_user.h>
+#include <linux/uaccess.h>
+
+static struct unwind_user_frame fp_frame = {
+ ARCH_INIT_USER_FP_FRAME
+};
+
+static inline bool fp_state(struct unwind_user_state *state)
+{
+ return IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP) &&
+ state->type == UNWIND_USER_TYPE_FP;
+}
int unwind_user_next(struct unwind_user_state *state)
{
- /* no implementation yet */
+ struct unwind_user_frame *frame;
+ unsigned long cfa = 0, fp, ra = 0;
+
+ if (state->done)
+ return -EINVAL;
+
+ if (fp_state(state))
+ frame = &fp_frame;
+ else
+ goto the_end;
+
+ cfa = (frame->use_fp ? state->fp : state->sp) + frame->cfa_off;
+
+ /* stack going in wrong direction? */
+ if (cfa <= state->sp)
+ goto the_end;
+
+ if (get_user(ra, (unsigned long *)(cfa + frame->ra_off)))
+ goto the_end;
+
+ if (frame->fp_off && get_user(fp, (unsigned long __user *)(cfa + frame->fp_off)))
+ goto the_end;
+
+ state->ip = ra;
+ state->sp = cfa;
+ if (frame->fp_off)
+ state->fp = fp;
+
+ return 0;
+
+the_end:
+ state->done = true;
return -EINVAL;
}
@@ -24,7 +66,10 @@ int unwind_user_start(struct unwind_user_state *state)
return -EINVAL;
}
- state->type = UNWIND_USER_TYPE_NONE;
+ if (IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP))
+ state->type = UNWIND_USER_TYPE_FP;
+ else
+ state->type = UNWIND_USER_TYPE_NONE;
state->ip = instruction_pointer(regs);
state->sp = user_stack_pointer(regs);
--
2.47.2
^ permalink raw reply related [flat|nested] 79+ messages in thread
* [PATCH v10 03/14] unwind_user: Add compat mode frame pointer support
2025-06-11 0:54 [PATCH v10 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
2025-06-11 0:54 ` [PATCH v10 01/14] unwind_user: Add user space unwinding API Steven Rostedt
2025-06-11 0:54 ` [PATCH v10 02/14] unwind_user: Add frame pointer support Steven Rostedt
@ 2025-06-11 0:54 ` Steven Rostedt
2025-06-18 13:46 ` Peter Zijlstra
2025-06-18 13:47 ` Peter Zijlstra
2025-06-11 0:54 ` [PATCH v10 04/14] unwind_user/deferred: Add unwind_deferred_trace() Steven Rostedt
` (11 subsequent siblings)
14 siblings, 2 replies; 79+ messages in thread
From: Steven Rostedt @ 2025-06-11 0:54 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton
From: Josh Poimboeuf <jpoimboe@kernel.org>
Add optional support for user space compat mode frame pointer unwinding.
If supported, the arch needs to enable CONFIG_HAVE_UNWIND_USER_COMPAT_FP
and define ARCH_INIT_USER_COMPAT_FP_FRAME.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Co-developed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v9: https://lore.kernel.org/linux-trace-kernel/20250513223551.459986355@goodmis.org/
As asm-generic headers are not included when an architecture defines the
header, having more than one #ifndef and setting variables does not work
with those checks in the asm-generic header and the architecture header
does not define all the values.
- Move #indef arch_unwind_user_state to linux/user_unwind_types.h
- Move the following to linux/unwind_user.h:
#ifndef ARCH_INIT_USER_COMPAT_FP_FRAME
#ifndef arch_unwind_user_init
#ifndef arch_unwind_user_next
- Changed UNWIND_GET_USER_LONG() to use "unsigned long" instead of u64 as
this can be called on 32 bit architectures and just because
"compat_state()" returns false doesn't mean that the value is 64 bit.
arch/Kconfig | 4 +++
include/asm-generic/Kbuild | 1 +
include/asm-generic/unwind_user_types.h | 5 ++++
include/linux/unwind_user.h | 13 +++++++++
include/linux/unwind_user_types.h | 7 +++++
kernel/unwind/user.c | 36 ++++++++++++++++++++++---
6 files changed, 62 insertions(+), 4 deletions(-)
create mode 100644 include/asm-generic/unwind_user_types.h
diff --git a/arch/Kconfig b/arch/Kconfig
index 8e3fd723bd74..2c41d3072910 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -442,6 +442,10 @@ config HAVE_UNWIND_USER_FP
bool
select UNWIND_USER
+config HAVE_UNWIND_USER_COMPAT_FP
+ bool
+ depends on HAVE_UNWIND_USER_FP
+
config HAVE_PERF_REGS
bool
help
diff --git a/include/asm-generic/Kbuild b/include/asm-generic/Kbuild
index 295c94a3ccc1..b797a2434396 100644
--- a/include/asm-generic/Kbuild
+++ b/include/asm-generic/Kbuild
@@ -60,6 +60,7 @@ mandatory-y += topology.h
mandatory-y += trace_clock.h
mandatory-y += uaccess.h
mandatory-y += unwind_user.h
+mandatory-y += unwind_user_types.h
mandatory-y += vermagic.h
mandatory-y += vga.h
mandatory-y += video.h
diff --git a/include/asm-generic/unwind_user_types.h b/include/asm-generic/unwind_user_types.h
new file mode 100644
index 000000000000..f568b82e52cd
--- /dev/null
+++ b/include/asm-generic/unwind_user_types.h
@@ -0,0 +1,5 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_GENERIC_UNWIND_USER_TYPES_H
+#define _ASM_GENERIC_UNWIND_USER_TYPES_H
+
+#endif /* _ASM_GENERIC_UNWIND_USER_TYPES_H */
diff --git a/include/linux/unwind_user.h b/include/linux/unwind_user.h
index a405111c41b0..c70da8f7e54c 100644
--- a/include/linux/unwind_user.h
+++ b/include/linux/unwind_user.h
@@ -9,6 +9,19 @@
#define ARCH_INIT_USER_FP_FRAME
#endif
+#ifndef ARCH_INIT_USER_COMPAT_FP_FRAME
+ #define ARCH_INIT_USER_COMPAT_FP_FRAME
+ #define in_compat_mode(regs) false
+#endif
+
+#ifndef arch_unwind_user_init
+static inline void arch_unwind_user_init(struct unwind_user_state *state, struct pt_regs *reg) {}
+#endif
+
+#ifndef arch_unwind_user_next
+static inline void arch_unwind_user_next(struct unwind_user_state *state) {}
+#endif
+
int unwind_user_start(struct unwind_user_state *state);
int unwind_user_next(struct unwind_user_state *state);
diff --git a/include/linux/unwind_user_types.h b/include/linux/unwind_user_types.h
index 65bd070eb6b0..0b6563951ca4 100644
--- a/include/linux/unwind_user_types.h
+++ b/include/linux/unwind_user_types.h
@@ -3,10 +3,16 @@
#define _LINUX_UNWIND_USER_TYPES_H
#include <linux/types.h>
+#include <asm/unwind_user_types.h>
+
+#ifndef arch_unwind_user_state
+struct arch_unwind_user_state {};
+#endif
enum unwind_user_type {
UNWIND_USER_TYPE_NONE,
UNWIND_USER_TYPE_FP,
+ UNWIND_USER_TYPE_COMPAT_FP,
};
struct unwind_stacktrace {
@@ -25,6 +31,7 @@ struct unwind_user_state {
unsigned long ip;
unsigned long sp;
unsigned long fp;
+ struct arch_unwind_user_state arch;
enum unwind_user_type type;
bool done;
};
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
index 4fc550356b33..29e1f497a26e 100644
--- a/kernel/unwind/user.c
+++ b/kernel/unwind/user.c
@@ -12,12 +12,32 @@ static struct unwind_user_frame fp_frame = {
ARCH_INIT_USER_FP_FRAME
};
+static struct unwind_user_frame compat_fp_frame = {
+ ARCH_INIT_USER_COMPAT_FP_FRAME
+};
+
static inline bool fp_state(struct unwind_user_state *state)
{
return IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP) &&
state->type == UNWIND_USER_TYPE_FP;
}
+static inline bool compat_state(struct unwind_user_state *state)
+{
+ return IS_ENABLED(CONFIG_HAVE_UNWIND_USER_COMPAT_FP) &&
+ state->type == UNWIND_USER_TYPE_COMPAT_FP;
+}
+
+#define UNWIND_GET_USER_LONG(to, from, state) \
+({ \
+ int __ret; \
+ if (compat_state(state)) \
+ __ret = get_user(to, (u32 __user *)(from)); \
+ else \
+ __ret = get_user(to, (unsigned long __user *)(from)); \
+ __ret; \
+})
+
int unwind_user_next(struct unwind_user_state *state)
{
struct unwind_user_frame *frame;
@@ -26,7 +46,9 @@ int unwind_user_next(struct unwind_user_state *state)
if (state->done)
return -EINVAL;
- if (fp_state(state))
+ if (compat_state(state))
+ frame = &compat_fp_frame;
+ else if (fp_state(state))
frame = &fp_frame;
else
goto the_end;
@@ -37,10 +59,10 @@ int unwind_user_next(struct unwind_user_state *state)
if (cfa <= state->sp)
goto the_end;
- if (get_user(ra, (unsigned long *)(cfa + frame->ra_off)))
+ if (UNWIND_GET_USER_LONG(ra, cfa + frame->ra_off, state))
goto the_end;
- if (frame->fp_off && get_user(fp, (unsigned long __user *)(cfa + frame->fp_off)))
+ if (frame->fp_off && UNWIND_GET_USER_LONG(fp, cfa + frame->fp_off, state))
goto the_end;
state->ip = ra;
@@ -48,6 +70,8 @@ int unwind_user_next(struct unwind_user_state *state)
if (frame->fp_off)
state->fp = fp;
+ arch_unwind_user_next(state);
+
return 0;
the_end:
@@ -66,7 +90,9 @@ int unwind_user_start(struct unwind_user_state *state)
return -EINVAL;
}
- if (IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP))
+ if (IS_ENABLED(CONFIG_HAVE_UNWIND_USER_COMPAT_FP) && in_compat_mode(regs))
+ state->type = UNWIND_USER_TYPE_COMPAT_FP;
+ else if (IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP))
state->type = UNWIND_USER_TYPE_FP;
else
state->type = UNWIND_USER_TYPE_NONE;
@@ -75,6 +101,8 @@ int unwind_user_start(struct unwind_user_state *state)
state->sp = user_stack_pointer(regs);
state->fp = frame_pointer(regs);
+ arch_unwind_user_init(state, regs);
+
return 0;
}
--
2.47.2
^ permalink raw reply related [flat|nested] 79+ messages in thread
* [PATCH v10 04/14] unwind_user/deferred: Add unwind_deferred_trace()
2025-06-11 0:54 [PATCH v10 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
` (2 preceding siblings ...)
2025-06-11 0:54 ` [PATCH v10 03/14] unwind_user: Add compat mode " Steven Rostedt
@ 2025-06-11 0:54 ` Steven Rostedt
2025-06-18 13:54 ` Peter Zijlstra
` (3 more replies)
2025-06-11 0:54 ` [PATCH v10 05/14] unwind_user/deferred: Add unwind cache Steven Rostedt
` (10 subsequent siblings)
14 siblings, 4 replies; 79+ messages in thread
From: Steven Rostedt @ 2025-06-11 0:54 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton
From: Steven Rostedt <rostedt@goodmis.org>
Add a function that must be called inside a faultable context that will
retrieve a user space stack trace. The function unwind_deferred_trace()
can be called by a tracer when a task is about to enter user space, or has
just come back from user space and has interrupts enabled.
This code is based on work by Josh Poimboeuf's deferred unwinding code:
Link: https://lore.kernel.org/all/6052e8487746603bdb29b65f4033e739092d9925.1737511963.git.jpoimboe@kernel.org/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
include/linux/sched.h | 5 +++
include/linux/unwind_deferred.h | 24 +++++++++++
include/linux/unwind_deferred_types.h | 9 ++++
kernel/fork.c | 4 ++
kernel/unwind/Makefile | 2 +-
kernel/unwind/deferred.c | 60 +++++++++++++++++++++++++++
6 files changed, 103 insertions(+), 1 deletion(-)
create mode 100644 include/linux/unwind_deferred.h
create mode 100644 include/linux/unwind_deferred_types.h
create mode 100644 kernel/unwind/deferred.c
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4f78a64beb52..59fdf7d9bb1e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -46,6 +46,7 @@
#include <linux/rv.h>
#include <linux/uidgid_types.h>
#include <linux/tracepoint-defs.h>
+#include <linux/unwind_deferred_types.h>
#include <asm/kmap_size.h>
/* task_struct member predeclarations (sorted alphabetically): */
@@ -1654,6 +1655,10 @@ struct task_struct {
struct user_event_mm *user_event_mm;
#endif
+#ifdef CONFIG_UNWIND_USER
+ struct unwind_task_info unwind_info;
+#endif
+
/* CPU-specific state of this task: */
struct thread_struct thread;
diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
new file mode 100644
index 000000000000..5064ebe38c4f
--- /dev/null
+++ b/include/linux/unwind_deferred.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_UNWIND_USER_DEFERRED_H
+#define _LINUX_UNWIND_USER_DEFERRED_H
+
+#include <linux/unwind_user.h>
+#include <linux/unwind_deferred_types.h>
+
+#ifdef CONFIG_UNWIND_USER
+
+void unwind_task_init(struct task_struct *task);
+void unwind_task_free(struct task_struct *task);
+
+int unwind_deferred_trace(struct unwind_stacktrace *trace);
+
+#else /* !CONFIG_UNWIND_USER */
+
+static inline void unwind_task_init(struct task_struct *task) {}
+static inline void unwind_task_free(struct task_struct *task) {}
+
+static inline int unwind_deferred_trace(struct unwind_stacktrace *trace) { return -ENOSYS; }
+
+#endif /* !CONFIG_UNWIND_USER */
+
+#endif /* _LINUX_UNWIND_USER_DEFERRED_H */
diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
new file mode 100644
index 000000000000..aa32db574e43
--- /dev/null
+++ b/include/linux/unwind_deferred_types.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_UNWIND_USER_DEFERRED_TYPES_H
+#define _LINUX_UNWIND_USER_DEFERRED_TYPES_H
+
+struct unwind_task_info {
+ unsigned long *entries;
+};
+
+#endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 1ee8eb11f38b..3341d50c61f2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -105,6 +105,7 @@
#include <uapi/linux/pidfd.h>
#include <linux/pidfs.h>
#include <linux/tick.h>
+#include <linux/unwind_deferred.h>
#include <asm/pgalloc.h>
#include <linux/uaccess.h>
@@ -732,6 +733,7 @@ void __put_task_struct(struct task_struct *tsk)
WARN_ON(refcount_read(&tsk->usage));
WARN_ON(tsk == current);
+ unwind_task_free(tsk);
sched_ext_free(tsk);
io_uring_free(tsk);
cgroup_free(tsk);
@@ -2135,6 +2137,8 @@ __latent_entropy struct task_struct *copy_process(
p->bpf_ctx = NULL;
#endif
+ unwind_task_init(p);
+
/* Perform scheduler related setup. Assign this task to a CPU. */
retval = sched_fork(clone_flags, p);
if (retval)
diff --git a/kernel/unwind/Makefile b/kernel/unwind/Makefile
index 349ce3677526..6752ac96d7e2 100644
--- a/kernel/unwind/Makefile
+++ b/kernel/unwind/Makefile
@@ -1 +1 @@
- obj-$(CONFIG_UNWIND_USER) += user.o
+ obj-$(CONFIG_UNWIND_USER) += user.o deferred.o
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
new file mode 100644
index 000000000000..0bafb95e6336
--- /dev/null
+++ b/kernel/unwind/deferred.c
@@ -0,0 +1,60 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Deferred user space unwinding
+ */
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/unwind_deferred.h>
+
+#define UNWIND_MAX_ENTRIES 512
+
+/**
+ * unwind_deferred_trace - Produce a user stacktrace in faultable context
+ * @trace: The descriptor that will store the user stacktrace
+ *
+ * This must be called in a known faultable context (usually when entering
+ * or exiting user space). Depending on the available implementations
+ * the @trace will be loaded with the addresses of the user space stacktrace
+ * if it can be found.
+ *
+ * Return: 0 on success and negative on error
+ * On success @trace will contain the user space stacktrace
+ */
+int unwind_deferred_trace(struct unwind_stacktrace *trace)
+{
+ struct unwind_task_info *info = ¤t->unwind_info;
+
+ /* Should always be called from faultable context */
+ might_fault();
+
+ if (current->flags & PF_EXITING)
+ return -EINVAL;
+
+ if (!info->entries) {
+ info->entries = kmalloc_array(UNWIND_MAX_ENTRIES, sizeof(long),
+ GFP_KERNEL);
+ if (!info->entries)
+ return -ENOMEM;
+ }
+
+ trace->nr = 0;
+ trace->entries = info->entries;
+ unwind_user(trace, UNWIND_MAX_ENTRIES);
+
+ return 0;
+}
+
+void unwind_task_init(struct task_struct *task)
+{
+ struct unwind_task_info *info = &task->unwind_info;
+
+ memset(info, 0, sizeof(*info));
+}
+
+void unwind_task_free(struct task_struct *task)
+{
+ struct unwind_task_info *info = &task->unwind_info;
+
+ kfree(info->entries);
+}
--
2.47.2
^ permalink raw reply related [flat|nested] 79+ messages in thread
* [PATCH v10 05/14] unwind_user/deferred: Add unwind cache
2025-06-11 0:54 [PATCH v10 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
` (3 preceding siblings ...)
2025-06-11 0:54 ` [PATCH v10 04/14] unwind_user/deferred: Add unwind_deferred_trace() Steven Rostedt
@ 2025-06-11 0:54 ` Steven Rostedt
2025-06-18 14:13 ` Peter Zijlstra
2025-06-11 0:54 ` [PATCH v10 06/14] unwind_user/deferred: Add deferred unwinding interface Steven Rostedt
` (9 subsequent siblings)
14 siblings, 1 reply; 79+ messages in thread
From: Steven Rostedt @ 2025-06-11 0:54 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton
From: Josh Poimboeuf <jpoimboe@kernel.org>
Cache the results of the unwind to ensure the unwind is only performed
once, even when called by multiple tracers.
The cache nr_entries gets cleared every time the task exits the kernel.
When a stacktrace is requested, nr_entries gets set to the number of
entries in the stacktrace. If another stacktrace is requested, if
nr_entries is not zero, then it contains the same stacktrace that would be
retrieved so it is not processed again and the entries is given to the
caller.
Co-developed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
include/linux/entry-common.h | 2 ++
include/linux/unwind_deferred.h | 8 ++++++++
include/linux/unwind_deferred_types.h | 7 ++++++-
kernel/unwind/deferred.c | 26 ++++++++++++++++++++------
4 files changed, 36 insertions(+), 7 deletions(-)
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index f94f3fdf15fc..6e850c9d3f0c 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -12,6 +12,7 @@
#include <linux/resume_user_mode.h>
#include <linux/tick.h>
#include <linux/kmsan.h>
+#include <linux/unwind_deferred.h>
#include <asm/entry-common.h>
#include <asm/syscall.h>
@@ -362,6 +363,7 @@ static __always_inline void exit_to_user_mode(void)
lockdep_hardirqs_on_prepare();
instrumentation_end();
+ unwind_exit_to_user_mode();
user_enter_irqoff();
arch_exit_to_user_mode();
lockdep_hardirqs_on(CALLER_ADDR0);
diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index 5064ebe38c4f..7d6cb2ffd084 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -12,6 +12,12 @@ void unwind_task_free(struct task_struct *task);
int unwind_deferred_trace(struct unwind_stacktrace *trace);
+static __always_inline void unwind_exit_to_user_mode(void)
+{
+ if (unlikely(current->unwind_info.cache))
+ current->unwind_info.cache->nr_entries = 0;
+}
+
#else /* !CONFIG_UNWIND_USER */
static inline void unwind_task_init(struct task_struct *task) {}
@@ -19,6 +25,8 @@ static inline void unwind_task_free(struct task_struct *task) {}
static inline int unwind_deferred_trace(struct unwind_stacktrace *trace) { return -ENOSYS; }
+static inline void unwind_exit_to_user_mode(void) {}
+
#endif /* !CONFIG_UNWIND_USER */
#endif /* _LINUX_UNWIND_USER_DEFERRED_H */
diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
index aa32db574e43..db5b54b18828 100644
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -2,8 +2,13 @@
#ifndef _LINUX_UNWIND_USER_DEFERRED_TYPES_H
#define _LINUX_UNWIND_USER_DEFERRED_TYPES_H
+struct unwind_cache {
+ unsigned int nr_entries;
+ unsigned long entries[];
+};
+
struct unwind_task_info {
- unsigned long *entries;
+ struct unwind_cache *cache;
};
#endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index 0bafb95e6336..e3913781c8c6 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -24,6 +24,7 @@
int unwind_deferred_trace(struct unwind_stacktrace *trace)
{
struct unwind_task_info *info = ¤t->unwind_info;
+ struct unwind_cache *cache;
/* Should always be called from faultable context */
might_fault();
@@ -31,17 +32,30 @@ int unwind_deferred_trace(struct unwind_stacktrace *trace)
if (current->flags & PF_EXITING)
return -EINVAL;
- if (!info->entries) {
- info->entries = kmalloc_array(UNWIND_MAX_ENTRIES, sizeof(long),
- GFP_KERNEL);
- if (!info->entries)
+ if (!info->cache) {
+ info->cache = kzalloc(struct_size(cache, entries, UNWIND_MAX_ENTRIES),
+ GFP_KERNEL);
+ if (!info->cache)
return -ENOMEM;
}
+ cache = info->cache;
+ trace->entries = cache->entries;
+
+ if (cache->nr_entries) {
+ /*
+ * The user stack has already been previously unwound in this
+ * entry context. Skip the unwind and use the cache.
+ */
+ trace->nr = cache->nr_entries;
+ return 0;
+ }
+
trace->nr = 0;
- trace->entries = info->entries;
unwind_user(trace, UNWIND_MAX_ENTRIES);
+ cache->nr_entries = trace->nr;
+
return 0;
}
@@ -56,5 +70,5 @@ void unwind_task_free(struct task_struct *task)
{
struct unwind_task_info *info = &task->unwind_info;
- kfree(info->entries);
+ kfree(info->cache);
}
--
2.47.2
^ permalink raw reply related [flat|nested] 79+ messages in thread
* [PATCH v10 06/14] unwind_user/deferred: Add deferred unwinding interface
2025-06-11 0:54 [PATCH v10 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
` (4 preceding siblings ...)
2025-06-11 0:54 ` [PATCH v10 05/14] unwind_user/deferred: Add unwind cache Steven Rostedt
@ 2025-06-11 0:54 ` Steven Rostedt
2025-06-18 14:20 ` Peter Zijlstra
2025-06-18 18:46 ` Peter Zijlstra
2025-06-11 0:54 ` [PATCH v10 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe Steven Rostedt
` (8 subsequent siblings)
14 siblings, 2 replies; 79+ messages in thread
From: Steven Rostedt @ 2025-06-11 0:54 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton
From: Josh Poimboeuf <jpoimboe@kernel.org>
Add an interface for scheduling task work to unwind the user space stack
before returning to user space. This solves several problems for its
callers:
- Ensure the unwind happens in task context even if the caller may be
running in interrupt context.
- Avoid duplicate unwinds, whether called multiple times by the same
caller or by different callers.
- Take a timestamp when the first request comes in since the task
entered the kernel. This will be returned to the calling function
along with the stack trace when the task leaves the kernel. This
timestamp can be used to correlate kernel unwinds/traces with the user
unwind.
The timestamp is created to detect when the stacktrace is the same. It is
generated the first time a user space stacktrace is requested after the
task enters the kernel.
The timestamp is passed to the caller on request, and when the stacktrace is
generated upon returning to user space, it call the requester's callback
with the timestamp as well as the stacktrace.
Co-developed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
include/linux/unwind_deferred.h | 18 ++++
include/linux/unwind_deferred_types.h | 3 +
kernel/unwind/deferred.c | 131 +++++++++++++++++++++++++-
3 files changed, 151 insertions(+), 1 deletion(-)
diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index 7d6cb2ffd084..a384eef719a3 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -2,9 +2,19 @@
#ifndef _LINUX_UNWIND_USER_DEFERRED_H
#define _LINUX_UNWIND_USER_DEFERRED_H
+#include <linux/task_work.h>
#include <linux/unwind_user.h>
#include <linux/unwind_deferred_types.h>
+struct unwind_work;
+
+typedef void (*unwind_callback_t)(struct unwind_work *work, struct unwind_stacktrace *trace, u64 timestamp);
+
+struct unwind_work {
+ struct list_head list;
+ unwind_callback_t func;
+};
+
#ifdef CONFIG_UNWIND_USER
void unwind_task_init(struct task_struct *task);
@@ -12,10 +22,15 @@ void unwind_task_free(struct task_struct *task);
int unwind_deferred_trace(struct unwind_stacktrace *trace);
+int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func);
+int unwind_deferred_request(struct unwind_work *work, u64 *timestamp);
+void unwind_deferred_cancel(struct unwind_work *work);
+
static __always_inline void unwind_exit_to_user_mode(void)
{
if (unlikely(current->unwind_info.cache))
current->unwind_info.cache->nr_entries = 0;
+ current->unwind_info.timestamp = 0;
}
#else /* !CONFIG_UNWIND_USER */
@@ -24,6 +39,9 @@ static inline void unwind_task_init(struct task_struct *task) {}
static inline void unwind_task_free(struct task_struct *task) {}
static inline int unwind_deferred_trace(struct unwind_stacktrace *trace) { return -ENOSYS; }
+static inline int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func) { return -ENOSYS; }
+static inline int unwind_deferred_request(struct unwind_work *work, u64 *timestamp) { return -ENOSYS; }
+static inline void unwind_deferred_cancel(struct unwind_work *work) {}
static inline void unwind_exit_to_user_mode(void) {}
diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
index db5b54b18828..5df264cf81ad 100644
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -9,6 +9,9 @@ struct unwind_cache {
struct unwind_task_info {
struct unwind_cache *cache;
+ struct callback_head work;
+ u64 timestamp;
+ int pending;
};
#endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index e3913781c8c6..b76c704ddc6d 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -2,13 +2,35 @@
/*
* Deferred user space unwinding
*/
+#include <linux/sched/task_stack.h>
+#include <linux/unwind_deferred.h>
+#include <linux/sched/clock.h>
+#include <linux/task_work.h>
#include <linux/kernel.h>
#include <linux/sched.h>
#include <linux/slab.h>
-#include <linux/unwind_deferred.h>
+#include <linux/mm.h>
#define UNWIND_MAX_ENTRIES 512
+/* Guards adding to and reading the list of callbacks */
+static DEFINE_MUTEX(callback_mutex);
+static LIST_HEAD(callbacks);
+
+/*
+ * Read the task context timestamp, if this is the first caller then
+ * it will set the timestamp.
+ */
+static u64 get_timestamp(struct unwind_task_info *info)
+{
+ lockdep_assert_irqs_disabled();
+
+ if (!info->timestamp)
+ info->timestamp = local_clock();
+
+ return info->timestamp;
+}
+
/**
* unwind_deferred_trace - Produce a user stacktrace in faultable context
* @trace: The descriptor that will store the user stacktrace
@@ -59,11 +81,117 @@ int unwind_deferred_trace(struct unwind_stacktrace *trace)
return 0;
}
+static void unwind_deferred_task_work(struct callback_head *head)
+{
+ struct unwind_task_info *info = container_of(head, struct unwind_task_info, work);
+ struct unwind_stacktrace trace;
+ struct unwind_work *work;
+ u64 timestamp;
+
+ if (WARN_ON_ONCE(!info->pending))
+ return;
+
+ /* Allow work to come in again */
+ WRITE_ONCE(info->pending, 0);
+
+ /*
+ * From here on out, the callback must always be called, even if it's
+ * just an empty trace.
+ */
+ trace.nr = 0;
+ trace.entries = NULL;
+
+ unwind_deferred_trace(&trace);
+
+ timestamp = info->timestamp;
+
+ guard(mutex)(&callback_mutex);
+ list_for_each_entry(work, &callbacks, list) {
+ work->func(work, &trace, timestamp);
+ }
+}
+
+/**
+ * unwind_deferred_request - Request a user stacktrace on task exit
+ * @work: Unwind descriptor requesting the trace
+ * @timestamp: The time stamp of the first request made for this task
+ *
+ * Schedule a user space unwind to be done in task work before exiting the
+ * kernel.
+ *
+ * The returned @timestamp output is the timestamp of the very first request
+ * for a user space stacktrace for this task since it entered the kernel.
+ * It can be from a request by any caller of this infrastructure.
+ * Its value will also be passed to the callback function. It can be
+ * used to stitch kernel and user stack traces together in post-processing.
+ *
+ * It's valid to call this function multiple times for the same @work within
+ * the same task entry context. Each call will return the same timestamp
+ * while the task hasn't left the kernel. If the callback is not pending because
+ * it has already been previously called for the same entry context, it will be
+ * called again with the same stack trace and timestamp.
+ *
+ * Return: 1 if the the callback was already queued.
+ * 0 if the callback successfully was queued.
+ * Negative if there's an error.
+ * @timestamp holds the timestamp of the first request by any user
+ */
+int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
+{
+ struct unwind_task_info *info = ¤t->unwind_info;
+ int ret;
+
+ *timestamp = 0;
+
+ if (WARN_ON_ONCE(in_nmi()))
+ return -EINVAL;
+
+ if ((current->flags & (PF_KTHREAD | PF_EXITING)) ||
+ !user_mode(task_pt_regs(current)))
+ return -EINVAL;
+
+ guard(irqsave)();
+
+ *timestamp = get_timestamp(info);
+
+ /* callback already pending? */
+ if (info->pending)
+ return 1;
+
+ /* The work has been claimed, now schedule it. */
+ ret = task_work_add(current, &info->work, TWA_RESUME);
+ if (WARN_ON_ONCE(ret))
+ return ret;
+
+ info->pending = 1;
+ return 0;
+}
+
+void unwind_deferred_cancel(struct unwind_work *work)
+{
+ if (!work)
+ return;
+
+ guard(mutex)(&callback_mutex);
+ list_del(&work->list);
+}
+
+int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
+{
+ memset(work, 0, sizeof(*work));
+
+ guard(mutex)(&callback_mutex);
+ list_add(&work->list, &callbacks);
+ work->func = func;
+ return 0;
+}
+
void unwind_task_init(struct task_struct *task)
{
struct unwind_task_info *info = &task->unwind_info;
memset(info, 0, sizeof(*info));
+ init_task_work(&info->work, unwind_deferred_task_work);
}
void unwind_task_free(struct task_struct *task)
@@ -71,4 +199,5 @@ void unwind_task_free(struct task_struct *task)
struct unwind_task_info *info = &task->unwind_info;
kfree(info->cache);
+ task_work_cancel(task, &info->work);
}
--
2.47.2
^ permalink raw reply related [flat|nested] 79+ messages in thread
* [PATCH v10 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe
2025-06-11 0:54 [PATCH v10 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
` (5 preceding siblings ...)
2025-06-11 0:54 ` [PATCH v10 06/14] unwind_user/deferred: Add deferred unwinding interface Steven Rostedt
@ 2025-06-11 0:54 ` Steven Rostedt
2025-06-19 8:34 ` Peter Zijlstra
2025-06-19 8:57 ` Peter Zijlstra
2025-06-11 0:54 ` [PATCH v10 08/14] unwind deferred: Use bitmask to determine which callbacks to call Steven Rostedt
` (7 subsequent siblings)
14 siblings, 2 replies; 79+ messages in thread
From: Steven Rostedt @ 2025-06-11 0:54 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton
From: Josh Poimboeuf <jpoimboe@kernel.org>
Make unwind_deferred_request() NMI-safe so tracers in NMI context can
call it and safely request a user space stacktrace when the task exits.
A "nmi_timestamp" is added to the unwind_task_info that gets updated by
NMIs to not race with setting the info->timestamp.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v9: https://lore.kernel.org/linux-trace-kernel/20250513223552.636076711@goodmis.org/
- Check for ret < 0 instead of just ret != 0 from return code of
task_work_add(). Don't want to just assume it's less than zero as it
needs to return a negative on error.
include/linux/unwind_deferred_types.h | 1 +
kernel/unwind/deferred.c | 91 ++++++++++++++++++++++++---
2 files changed, 84 insertions(+), 8 deletions(-)
diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
index 5df264cf81ad..ae27a02234b8 100644
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -11,6 +11,7 @@ struct unwind_task_info {
struct unwind_cache *cache;
struct callback_head work;
u64 timestamp;
+ u64 nmi_timestamp;
int pending;
};
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index b76c704ddc6d..88c867c32c01 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -25,8 +25,27 @@ static u64 get_timestamp(struct unwind_task_info *info)
{
lockdep_assert_irqs_disabled();
- if (!info->timestamp)
- info->timestamp = local_clock();
+ /*
+ * Note, the timestamp is generated on the first request.
+ * If it exists here, then the timestamp is earlier than
+ * this request and it means that this request will be
+ * valid for the stracktrace.
+ */
+ if (!info->timestamp) {
+ WRITE_ONCE(info->timestamp, local_clock());
+ barrier();
+ /*
+ * If an NMI came in and set a timestamp, it means that
+ * it happened before this timestamp was set (otherwise
+ * the NMI would have used this one). Use the NMI timestamp
+ * instead.
+ */
+ if (unlikely(info->nmi_timestamp)) {
+ WRITE_ONCE(info->timestamp, info->nmi_timestamp);
+ barrier();
+ WRITE_ONCE(info->nmi_timestamp, 0);
+ }
+ }
return info->timestamp;
}
@@ -103,6 +122,13 @@ static void unwind_deferred_task_work(struct callback_head *head)
unwind_deferred_trace(&trace);
+ /* Check if the timestamp was only set by NMI */
+ if (info->nmi_timestamp) {
+ WRITE_ONCE(info->timestamp, info->nmi_timestamp);
+ barrier();
+ WRITE_ONCE(info->nmi_timestamp, 0);
+ }
+
timestamp = info->timestamp;
guard(mutex)(&callback_mutex);
@@ -111,6 +137,48 @@ static void unwind_deferred_task_work(struct callback_head *head)
}
}
+static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *timestamp)
+{
+ struct unwind_task_info *info = ¤t->unwind_info;
+ bool inited_timestamp = false;
+ int ret;
+
+ /* Always use the nmi_timestamp first */
+ *timestamp = info->nmi_timestamp ? : info->timestamp;
+
+ if (!*timestamp) {
+ /*
+ * This is the first unwind request since the most recent entry
+ * from user space. Initialize the task timestamp.
+ *
+ * Don't write to info->timestamp directly, otherwise it may race
+ * with an interruption of get_timestamp().
+ */
+ info->nmi_timestamp = local_clock();
+ *timestamp = info->nmi_timestamp;
+ inited_timestamp = true;
+ }
+
+ if (info->pending)
+ return 1;
+
+ ret = task_work_add(current, &info->work, TWA_NMI_CURRENT);
+ if (ret < 0) {
+ /*
+ * If this set nmi_timestamp and is not using it,
+ * there's no guarantee that it will be used.
+ * Set it back to zero.
+ */
+ if (inited_timestamp)
+ info->nmi_timestamp = 0;
+ return ret;
+ }
+
+ info->pending = 1;
+
+ return 0;
+}
+
/**
* unwind_deferred_request - Request a user stacktrace on task exit
* @work: Unwind descriptor requesting the trace
@@ -139,31 +207,38 @@ static void unwind_deferred_task_work(struct callback_head *head)
int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
{
struct unwind_task_info *info = ¤t->unwind_info;
+ int pending;
int ret;
*timestamp = 0;
- if (WARN_ON_ONCE(in_nmi()))
- return -EINVAL;
-
if ((current->flags & (PF_KTHREAD | PF_EXITING)) ||
!user_mode(task_pt_regs(current)))
return -EINVAL;
+ if (in_nmi())
+ return unwind_deferred_request_nmi(work, timestamp);
+
guard(irqsave)();
*timestamp = get_timestamp(info);
/* callback already pending? */
- if (info->pending)
+ pending = READ_ONCE(info->pending);
+ if (pending)
+ return 1;
+
+ /* Claim the work unless an NMI just now swooped in to do so. */
+ if (!try_cmpxchg(&info->pending, &pending, 1))
return 1;
/* The work has been claimed, now schedule it. */
ret = task_work_add(current, &info->work, TWA_RESUME);
- if (WARN_ON_ONCE(ret))
+ if (WARN_ON_ONCE(ret)) {
+ WRITE_ONCE(info->pending, 0);
return ret;
+ }
- info->pending = 1;
return 0;
}
--
2.47.2
^ permalink raw reply related [flat|nested] 79+ messages in thread
* [PATCH v10 08/14] unwind deferred: Use bitmask to determine which callbacks to call
2025-06-11 0:54 [PATCH v10 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
` (6 preceding siblings ...)
2025-06-11 0:54 ` [PATCH v10 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe Steven Rostedt
@ 2025-06-11 0:54 ` Steven Rostedt
2025-06-20 8:15 ` Peter Zijlstra
2025-06-11 0:54 ` [PATCH v10 09/14] unwind deferred: Use SRCU unwind_deferred_task_work() Steven Rostedt
` (6 subsequent siblings)
14 siblings, 1 reply; 79+ messages in thread
From: Steven Rostedt @ 2025-06-11 0:54 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton
From: Steven Rostedt <rostedt@goodmis.org>
In order to know which registered callback requested a stacktrace for when
the task goes back to user space, add a bitmask to keep track of all
registered tracers. The bitmask is the size of long, which means that on a
32 bit machine, it can have at most 32 registered tracers, and on 64 bit,
it can have at most 64 registered tracers. This should not be an issue as
there should not be more than 10 (unless BPF can abuse this?).
When a tracer registers with unwind_deferred_init() it will get a bit
number assigned to it. When a tracer requests a stacktrace, it will have
its bit set within the task_struct. When the task returns back to user
space, it will call the callbacks for all the registered tracers where
their bits are set in the task's mask.
When a tracer is removed by the unwind_deferred_cancel() all current tasks
will clear the associated bit, just in case another tracer gets registered
immediately afterward and then gets their callback called unexpectedly.
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v9: https://lore.kernel.org/linux-trace-kernel/20250513223552.804390728@goodmis.org/
- Use BIT() macro for bit setting and testing.
- Moved the "unwind_mask" from the task_struct into the task->unwind_info
structure.
include/linux/unwind_deferred.h | 1 +
include/linux/unwind_deferred_types.h | 1 +
kernel/unwind/deferred.c | 45 ++++++++++++++++++++++-----
3 files changed, 40 insertions(+), 7 deletions(-)
diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index a384eef719a3..1789c3624723 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -13,6 +13,7 @@ typedef void (*unwind_callback_t)(struct unwind_work *work, struct unwind_stackt
struct unwind_work {
struct list_head list;
unwind_callback_t func;
+ int bit;
};
#ifdef CONFIG_UNWIND_USER
diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
index ae27a02234b8..780b00c07208 100644
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -10,6 +10,7 @@ struct unwind_cache {
struct unwind_task_info {
struct unwind_cache *cache;
struct callback_head work;
+ unsigned long unwind_mask;
u64 timestamp;
u64 nmi_timestamp;
int pending;
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index 88c867c32c01..268afae31ba4 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -16,6 +16,7 @@
/* Guards adding to and reading the list of callbacks */
static DEFINE_MUTEX(callback_mutex);
static LIST_HEAD(callbacks);
+static unsigned long unwind_mask;
/*
* Read the task context timestamp, if this is the first caller then
@@ -133,7 +134,10 @@ static void unwind_deferred_task_work(struct callback_head *head)
guard(mutex)(&callback_mutex);
list_for_each_entry(work, &callbacks, list) {
- work->func(work, &trace, timestamp);
+ if (info->unwind_mask & BIT(work->bit)) {
+ work->func(work, &trace, timestamp);
+ clear_bit(work->bit, &info->unwind_mask);
+ }
}
}
@@ -159,9 +163,12 @@ static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *timestamp)
inited_timestamp = true;
}
- if (info->pending)
+ if (info->unwind_mask & BIT(work->bit))
return 1;
+ if (info->pending)
+ goto out;
+
ret = task_work_add(current, &info->work, TWA_NMI_CURRENT);
if (ret < 0) {
/*
@@ -175,8 +182,8 @@ static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *timestamp)
}
info->pending = 1;
-
- return 0;
+out:
+ return test_and_set_bit(work->bit, &info->unwind_mask);
}
/**
@@ -223,14 +230,18 @@ int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
*timestamp = get_timestamp(info);
+ /* This is already queued */
+ if (info->unwind_mask & BIT(work->bit))
+ return 1;
+
/* callback already pending? */
pending = READ_ONCE(info->pending);
if (pending)
- return 1;
+ goto out;
/* Claim the work unless an NMI just now swooped in to do so. */
if (!try_cmpxchg(&info->pending, &pending, 1))
- return 1;
+ goto out;
/* The work has been claimed, now schedule it. */
ret = task_work_add(current, &info->work, TWA_RESUME);
@@ -239,16 +250,27 @@ int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
return ret;
}
- return 0;
+ out:
+ return test_and_set_bit(work->bit, &info->unwind_mask);
}
void unwind_deferred_cancel(struct unwind_work *work)
{
+ struct task_struct *g, *t;
+
if (!work)
return;
guard(mutex)(&callback_mutex);
list_del(&work->list);
+
+ clear_bit(work->bit, &unwind_mask);
+
+ guard(rcu)();
+ /* Clear this bit from all threads */
+ for_each_process_thread(g, t) {
+ clear_bit(work->bit, &t->unwind_info.unwind_mask);
+ }
}
int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
@@ -256,6 +278,14 @@ int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
memset(work, 0, sizeof(*work));
guard(mutex)(&callback_mutex);
+
+ /* See if there's a bit in the mask available */
+ if (unwind_mask == ~0UL)
+ return -EBUSY;
+
+ work->bit = ffz(unwind_mask);
+ unwind_mask |= BIT(work->bit);
+
list_add(&work->list, &callbacks);
work->func = func;
return 0;
@@ -267,6 +297,7 @@ void unwind_task_init(struct task_struct *task)
memset(info, 0, sizeof(*info));
init_task_work(&info->work, unwind_deferred_task_work);
+ info->unwind_mask = 0;
}
void unwind_task_free(struct task_struct *task)
--
2.47.2
^ permalink raw reply related [flat|nested] 79+ messages in thread
* [PATCH v10 09/14] unwind deferred: Use SRCU unwind_deferred_task_work()
2025-06-11 0:54 [PATCH v10 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
` (7 preceding siblings ...)
2025-06-11 0:54 ` [PATCH v10 08/14] unwind deferred: Use bitmask to determine which callbacks to call Steven Rostedt
@ 2025-06-11 0:54 ` Steven Rostedt
2025-06-11 0:54 ` [PATCH v10 10/14] unwind: Clear unwind_mask on exit back to user space Steven Rostedt
` (5 subsequent siblings)
14 siblings, 0 replies; 79+ messages in thread
From: Steven Rostedt @ 2025-06-11 0:54 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton
From: Steven Rostedt <rostedt@goodmis.org>
Instead of using the callback_mutex to protect the link list of callbacks
in unwind_deferred_task_work(), use SRCU instead. This gets called every
time a task exits that has to record a stack trace that was requested.
This can happen for many tasks on several CPUs at the same time. A mutex
is a bottleneck and can cause a bit of contention and slow down performance.
As the callbacks themselves are allowed to sleep, regular RCU cannot be
used to protect the list. Instead use SRCU, as that still allows the
callbacks to sleep and the list can be read without needing to hold the
callback_mutex.
Link: https://lore.kernel.org/all/ca9bd83a-6c80-4ee0-a83c-224b9d60b755@efficios.com/
Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
kernel/unwind/deferred.c | 33 +++++++++++++++++++++++++--------
1 file changed, 25 insertions(+), 8 deletions(-)
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index 268afae31ba4..e44538a2be6c 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -13,10 +13,11 @@
#define UNWIND_MAX_ENTRIES 512
-/* Guards adding to and reading the list of callbacks */
+/* Guards adding to or removing from the list of callbacks */
static DEFINE_MUTEX(callback_mutex);
static LIST_HEAD(callbacks);
static unsigned long unwind_mask;
+DEFINE_STATIC_SRCU(unwind_srcu);
/*
* Read the task context timestamp, if this is the first caller then
@@ -107,6 +108,7 @@ static void unwind_deferred_task_work(struct callback_head *head)
struct unwind_stacktrace trace;
struct unwind_work *work;
u64 timestamp;
+ int idx;
if (WARN_ON_ONCE(!info->pending))
return;
@@ -132,13 +134,15 @@ static void unwind_deferred_task_work(struct callback_head *head)
timestamp = info->timestamp;
- guard(mutex)(&callback_mutex);
- list_for_each_entry(work, &callbacks, list) {
+ idx = srcu_read_lock(&unwind_srcu);
+ list_for_each_entry_srcu(work, &callbacks, list,
+ srcu_read_lock_held(&unwind_srcu)) {
if (info->unwind_mask & BIT(work->bit)) {
work->func(work, &trace, timestamp);
clear_bit(work->bit, &info->unwind_mask);
}
}
+ srcu_read_unlock(&unwind_srcu, idx);
}
static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *timestamp)
@@ -215,6 +219,7 @@ int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
{
struct unwind_task_info *info = ¤t->unwind_info;
int pending;
+ int bit;
int ret;
*timestamp = 0;
@@ -226,12 +231,17 @@ int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
if (in_nmi())
return unwind_deferred_request_nmi(work, timestamp);
+ /* Do not allow cancelled works to request again */
+ bit = READ_ONCE(work->bit);
+ if (WARN_ON_ONCE(bit < 0))
+ return -EINVAL;
+
guard(irqsave)();
*timestamp = get_timestamp(info);
/* This is already queued */
- if (info->unwind_mask & BIT(work->bit))
+ if (info->unwind_mask & BIT(bit))
return 1;
/* callback already pending? */
@@ -257,19 +267,26 @@ int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
void unwind_deferred_cancel(struct unwind_work *work)
{
struct task_struct *g, *t;
+ int bit;
if (!work)
return;
guard(mutex)(&callback_mutex);
- list_del(&work->list);
+ list_del_rcu(&work->list);
+ bit = work->bit;
+
+ /* Do not allow any more requests and prevent callbacks */
+ work->bit = -1;
+
+ clear_bit(bit, &unwind_mask);
- clear_bit(work->bit, &unwind_mask);
+ synchronize_srcu(&unwind_srcu);
guard(rcu)();
/* Clear this bit from all threads */
for_each_process_thread(g, t) {
- clear_bit(work->bit, &t->unwind_info.unwind_mask);
+ clear_bit(bit, &t->unwind_info.unwind_mask);
}
}
@@ -286,7 +303,7 @@ int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
work->bit = ffz(unwind_mask);
unwind_mask |= BIT(work->bit);
- list_add(&work->list, &callbacks);
+ list_add_rcu(&work->list, &callbacks);
work->func = func;
return 0;
}
--
2.47.2
^ permalink raw reply related [flat|nested] 79+ messages in thread
* [PATCH v10 10/14] unwind: Clear unwind_mask on exit back to user space
2025-06-11 0:54 [PATCH v10 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
` (8 preceding siblings ...)
2025-06-11 0:54 ` [PATCH v10 09/14] unwind deferred: Use SRCU unwind_deferred_task_work() Steven Rostedt
@ 2025-06-11 0:54 ` Steven Rostedt
2025-06-11 0:54 ` [PATCH v10 11/14] unwind: Finish up unwind when a task exits Steven Rostedt
` (4 subsequent siblings)
14 siblings, 0 replies; 79+ messages in thread
From: Steven Rostedt @ 2025-06-11 0:54 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton
From: Steven Rostedt <rostedt@goodmis.org>
When testing the deferred unwinder by attaching deferred user space
stacktraces to events, a live lock happened. This was when the deferred
unwinding was added to the irqs_disabled event, which happens after the
task_work callbacks are called and before the task goes back to user
space.
The event callback would be registered when irqs were disabled, the
task_work would trigger, call the callback for this work and clear the
work's bit. Then before getting back to user space, irqs would be disabled
again, the event triggered again, and a new task_work registered. This
caused an infinite loop and the system hung.
To prevent this, clear the bits at the very last moment before going back
to user space and when instrumentation is disabled. That is in
unwind_exit_to_user_mode().
Move the pending bit from a value on the task_struct to the most
significant bit of the unwind_mask (saves space on the task_struct). This
will allow modifying the pending bit along with the work bits atomically.
Instead of clearing a work's bit after its callback is called, it is
delayed until exit. If the work is requested again, the task_work is not
queued again and the work will be notified that the task has already been
called (via UNWIND_ALREADY_EXECUTED return value).
The pending bit is cleared before calling the callback functions but the
current work bits remain. If one of the called works registers again, it
will not trigger a task_work if its bit is still present in the task's
unwind_mask.
If a new work registers, then it will set both the pending bit and its own
bit but clear the other work bits so that their callbacks do not get
called again.
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v9: https://lore.kernel.org/all/20250513223553.143567998@goodmis.org/
- Fix compare with ~UNWIND_PENDING_BIT to be ~UNWIND_PENDING
- Use BIT() macro for bit setting and testing
include/linux/unwind_deferred.h | 28 +++++++-
include/linux/unwind_deferred_types.h | 1 -
kernel/unwind/deferred.c | 96 +++++++++++++++++++--------
3 files changed, 93 insertions(+), 32 deletions(-)
diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index 1789c3624723..426e21457606 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -18,6 +18,14 @@ struct unwind_work {
#ifdef CONFIG_UNWIND_USER
+#define UNWIND_PENDING_BIT (BITS_PER_LONG - 1)
+#define UNWIND_PENDING BIT(UNWIND_PENDING_BIT)
+
+enum {
+ UNWIND_ALREADY_PENDING = 1,
+ UNWIND_ALREADY_EXECUTED = 2,
+};
+
void unwind_task_init(struct task_struct *task);
void unwind_task_free(struct task_struct *task);
@@ -29,9 +37,23 @@ void unwind_deferred_cancel(struct unwind_work *work);
static __always_inline void unwind_exit_to_user_mode(void)
{
- if (unlikely(current->unwind_info.cache))
- current->unwind_info.cache->nr_entries = 0;
- current->unwind_info.timestamp = 0;
+ struct unwind_task_info *info = ¤t->unwind_info;
+ unsigned long bits;
+
+ /* Was there any unwinding? */
+ if (likely(!info->unwind_mask))
+ return;
+
+ bits = info->unwind_mask;
+ do {
+ /* Is a task_work going to run again before going back */
+ if (bits & UNWIND_PENDING)
+ return;
+ } while (!try_cmpxchg(&info->unwind_mask, &bits, 0UL));
+
+ if (likely(info->cache))
+ info->cache->nr_entries = 0;
+ info->timestamp = 0;
}
#else /* !CONFIG_UNWIND_USER */
diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
index 780b00c07208..f384e7f45783 100644
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -13,7 +13,6 @@ struct unwind_task_info {
unsigned long unwind_mask;
u64 timestamp;
u64 nmi_timestamp;
- int pending;
};
#endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index e44538a2be6c..8a6caaae04d3 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -19,6 +19,11 @@ static LIST_HEAD(callbacks);
static unsigned long unwind_mask;
DEFINE_STATIC_SRCU(unwind_srcu);
+static inline bool unwind_pending(struct unwind_task_info *info)
+{
+ return test_bit(UNWIND_PENDING_BIT, &info->unwind_mask);
+}
+
/*
* Read the task context timestamp, if this is the first caller then
* it will set the timestamp.
@@ -107,14 +112,17 @@ static void unwind_deferred_task_work(struct callback_head *head)
struct unwind_task_info *info = container_of(head, struct unwind_task_info, work);
struct unwind_stacktrace trace;
struct unwind_work *work;
+ unsigned long bits;
u64 timestamp;
int idx;
- if (WARN_ON_ONCE(!info->pending))
+ if (WARN_ON_ONCE(!unwind_pending(info)))
return;
- /* Allow work to come in again */
- WRITE_ONCE(info->pending, 0);
+ /* Clear pending bit but make sure to have the current bits */
+ bits = READ_ONCE(info->unwind_mask);
+ while (!try_cmpxchg(&info->unwind_mask, &bits, bits & ~UNWIND_PENDING))
+ ;
/*
* From here on out, the callback must always be called, even if it's
@@ -137,10 +145,8 @@ static void unwind_deferred_task_work(struct callback_head *head)
idx = srcu_read_lock(&unwind_srcu);
list_for_each_entry_srcu(work, &callbacks, list,
srcu_read_lock_held(&unwind_srcu)) {
- if (info->unwind_mask & BIT(work->bit)) {
+ if (bits & BIT(work->bit))
work->func(work, &trace, timestamp);
- clear_bit(work->bit, &info->unwind_mask);
- }
}
srcu_read_unlock(&unwind_srcu, idx);
}
@@ -167,10 +173,13 @@ static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *timestamp)
inited_timestamp = true;
}
- if (info->unwind_mask & BIT(work->bit))
- return 1;
+ /* Is this already queued */
+ if (info->unwind_mask & BIT(work->bit)) {
+ return unwind_pending(info) ? UNWIND_ALREADY_PENDING :
+ UNWIND_ALREADY_EXECUTED;
+ }
- if (info->pending)
+ if (unwind_pending(info))
goto out;
ret = task_work_add(current, &info->work, TWA_NMI_CURRENT);
@@ -185,9 +194,17 @@ static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *timestamp)
return ret;
}
- info->pending = 1;
+ /*
+ * This is the first to set the PENDING_BIT, clear all others
+ * as any other bit has already had their callback called, and
+ * those callbacks should not be called again because of this
+ * new callback. If they request another callback, then they
+ * will get a new one.
+ */
+ info->unwind_mask = UNWIND_PENDING;
out:
- return test_and_set_bit(work->bit, &info->unwind_mask);
+ return test_and_set_bit(work->bit, &info->unwind_mask) ?
+ UNWIND_ALREADY_PENDING : 0;
}
/**
@@ -210,15 +227,17 @@ static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *timestamp)
* it has already been previously called for the same entry context, it will be
* called again with the same stack trace and timestamp.
*
- * Return: 1 if the the callback was already queued.
- * 0 if the callback successfully was queued.
+ * Return: 0 if the callback successfully was queued.
+ * UNWIND_ALREADY_PENDING if the the callback was already queued.
+ * UNWIND_ALREADY_EXECUTED if the callback was already called
+ * (and will not be called again)
* Negative if there's an error.
* @timestamp holds the timestamp of the first request by any user
*/
int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
{
struct unwind_task_info *info = ¤t->unwind_info;
- int pending;
+ unsigned long old, bits;
int bit;
int ret;
@@ -240,28 +259,49 @@ int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
*timestamp = get_timestamp(info);
- /* This is already queued */
- if (info->unwind_mask & BIT(bit))
- return 1;
+ old = READ_ONCE(info->unwind_mask);
+
+ /* Is this already queued */
+ if (old & BIT(bit)) {
+ /*
+ * If pending is not set, it means this work's callback
+ * was already called.
+ */
+ return old & UNWIND_PENDING ? UNWIND_ALREADY_PENDING :
+ UNWIND_ALREADY_EXECUTED;
+ }
- /* callback already pending? */
- pending = READ_ONCE(info->pending);
- if (pending)
+ if (unwind_pending(info))
goto out;
- /* Claim the work unless an NMI just now swooped in to do so. */
- if (!try_cmpxchg(&info->pending, &pending, 1))
+ /*
+ * This is the first to enable another task_work for this task since
+ * the task entered the kernel, or had already called the callbacks.
+ * Set only the bit for this work and clear all others as they have
+ * already had their callbacks called, and do not need to call them
+ * again because of this work.
+ */
+ bits = UNWIND_PENDING | BIT(bit);
+
+ /*
+ * If the cmpxchg() fails, it means that an NMI came in and set
+ * the pending bit as well as cleared the other bits. Just
+ * jump to setting the bit for this work.
+ */
+ if (!try_cmpxchg(&info->unwind_mask, &old, bits))
goto out;
/* The work has been claimed, now schedule it. */
ret = task_work_add(current, &info->work, TWA_RESUME);
- if (WARN_ON_ONCE(ret)) {
- WRITE_ONCE(info->pending, 0);
- return ret;
- }
+
+ if (WARN_ON_ONCE(ret))
+ WRITE_ONCE(info->unwind_mask, 0);
+
+ return ret;
out:
- return test_and_set_bit(work->bit, &info->unwind_mask);
+ return test_and_set_bit(work->bit, &info->unwind_mask) ?
+ UNWIND_ALREADY_PENDING : 0;
}
void unwind_deferred_cancel(struct unwind_work *work)
@@ -297,7 +337,7 @@ int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
guard(mutex)(&callback_mutex);
/* See if there's a bit in the mask available */
- if (unwind_mask == ~0UL)
+ if (unwind_mask == ~(UNWIND_PENDING))
return -EBUSY;
work->bit = ffz(unwind_mask);
--
2.47.2
^ permalink raw reply related [flat|nested] 79+ messages in thread
* [PATCH v10 11/14] unwind: Finish up unwind when a task exits
2025-06-11 0:54 [PATCH v10 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
` (9 preceding siblings ...)
2025-06-11 0:54 ` [PATCH v10 10/14] unwind: Clear unwind_mask on exit back to user space Steven Rostedt
@ 2025-06-11 0:54 ` Steven Rostedt
2025-06-11 0:54 ` [PATCH v10 12/14] unwind_user/x86: Enable frame pointer unwinding on x86 Steven Rostedt
` (3 subsequent siblings)
14 siblings, 0 replies; 79+ messages in thread
From: Steven Rostedt @ 2025-06-11 0:54 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton
From: Steven Rostedt <rostedt@goodmis.org>
On do_exit() when a task is exiting, if a unwind is requested and the
deferred user stacktrace is deferred via the task_work, the task_work
callback is called after exit_mm() is called in do_exit(). This means that
the user stack trace will not be retrieved and an empty stack is created.
Instead, add a function unwind_deferred_task_exit() and call it just
before exit_mm() so that the unwinder can call the requested callbacks
with the user space stack.
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
include/linux/unwind_deferred.h | 4 +++-
kernel/exit.c | 2 ++
kernel/unwind/deferred.c | 23 ++++++++++++++++++++---
3 files changed, 25 insertions(+), 4 deletions(-)
diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index 426e21457606..bf0cc0477b2e 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -35,6 +35,8 @@ int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func);
int unwind_deferred_request(struct unwind_work *work, u64 *timestamp);
void unwind_deferred_cancel(struct unwind_work *work);
+void unwind_deferred_task_exit(struct task_struct *task);
+
static __always_inline void unwind_exit_to_user_mode(void)
{
struct unwind_task_info *info = ¤t->unwind_info;
@@ -65,7 +67,7 @@ static inline int unwind_deferred_trace(struct unwind_stacktrace *trace) { retur
static inline int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func) { return -ENOSYS; }
static inline int unwind_deferred_request(struct unwind_work *work, u64 *timestamp) { return -ENOSYS; }
static inline void unwind_deferred_cancel(struct unwind_work *work) {}
-
+static inline void unwind_deferred_task_exit(struct task_struct *task) {}
static inline void unwind_exit_to_user_mode(void) {}
#endif /* !CONFIG_UNWIND_USER */
diff --git a/kernel/exit.c b/kernel/exit.c
index bd743900354c..6599f9518436 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -68,6 +68,7 @@
#include <linux/rethook.h>
#include <linux/sysfs.h>
#include <linux/user_events.h>
+#include <linux/unwind_deferred.h>
#include <linux/uaccess.h>
#include <linux/pidfs.h>
@@ -938,6 +939,7 @@ void __noreturn do_exit(long code)
tsk->exit_code = code;
taskstats_exit(tsk, group_dead);
+ unwind_deferred_task_exit(tsk);
trace_sched_process_exit(tsk, group_dead);
exit_mm();
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index 8a6caaae04d3..6c95f484568e 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -77,7 +77,7 @@ int unwind_deferred_trace(struct unwind_stacktrace *trace)
/* Should always be called from faultable context */
might_fault();
- if (current->flags & PF_EXITING)
+ if (!current->mm)
return -EINVAL;
if (!info->cache) {
@@ -107,9 +107,9 @@ int unwind_deferred_trace(struct unwind_stacktrace *trace)
return 0;
}
-static void unwind_deferred_task_work(struct callback_head *head)
+static void process_unwind_deferred(struct task_struct *task)
{
- struct unwind_task_info *info = container_of(head, struct unwind_task_info, work);
+ struct unwind_task_info *info = &task->unwind_info;
struct unwind_stacktrace trace;
struct unwind_work *work;
unsigned long bits;
@@ -151,6 +151,23 @@ static void unwind_deferred_task_work(struct callback_head *head)
srcu_read_unlock(&unwind_srcu, idx);
}
+static void unwind_deferred_task_work(struct callback_head *head)
+{
+ process_unwind_deferred(current);
+}
+
+void unwind_deferred_task_exit(struct task_struct *task)
+{
+ struct unwind_task_info *info = ¤t->unwind_info;
+
+ if (!unwind_pending(info))
+ return;
+
+ process_unwind_deferred(task);
+
+ task_work_cancel(task, &info->work);
+}
+
static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *timestamp)
{
struct unwind_task_info *info = ¤t->unwind_info;
--
2.47.2
^ permalink raw reply related [flat|nested] 79+ messages in thread
* [PATCH v10 12/14] unwind_user/x86: Enable frame pointer unwinding on x86
2025-06-11 0:54 [PATCH v10 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
` (10 preceding siblings ...)
2025-06-11 0:54 ` [PATCH v10 11/14] unwind: Finish up unwind when a task exits Steven Rostedt
@ 2025-06-11 0:54 ` Steven Rostedt
2025-06-11 0:54 ` [PATCH v10 13/14] perf/x86: Rename and move get_segment_base() and make it global Steven Rostedt
` (2 subsequent siblings)
14 siblings, 0 replies; 79+ messages in thread
From: Steven Rostedt @ 2025-06-11 0:54 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton
From: Josh Poimboeuf <jpoimboe@kernel.org>
Use ARCH_INIT_USER_FP_FRAME to describe how frame pointers are unwound
on x86, and enable CONFIG_HAVE_UNWIND_USER_FP accordingly so the
unwind_user interfaces can be used.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
arch/x86/Kconfig | 1 +
arch/x86/include/asm/unwind_user.h | 11 +++++++++++
2 files changed, 12 insertions(+)
create mode 100644 arch/x86/include/asm/unwind_user.h
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 340e5468980e..2cdb5cf91541 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -302,6 +302,7 @@ config X86
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_UACCESS_VALIDATION if HAVE_OBJTOOL
select HAVE_UNSTABLE_SCHED_CLOCK
+ select HAVE_UNWIND_USER_FP if X86_64
select HAVE_USER_RETURN_NOTIFIER
select HAVE_GENERIC_VDSO
select VDSO_GETRANDOM if X86_64
diff --git a/arch/x86/include/asm/unwind_user.h b/arch/x86/include/asm/unwind_user.h
new file mode 100644
index 000000000000..8597857bf896
--- /dev/null
+++ b/arch/x86/include/asm/unwind_user.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_UNWIND_USER_H
+#define _ASM_X86_UNWIND_USER_H
+
+#define ARCH_INIT_USER_FP_FRAME \
+ .cfa_off = (s32)sizeof(long) * 2, \
+ .ra_off = (s32)sizeof(long) * -1, \
+ .fp_off = (s32)sizeof(long) * -2, \
+ .use_fp = true,
+
+#endif /* _ASM_X86_UNWIND_USER_H */
--
2.47.2
^ permalink raw reply related [flat|nested] 79+ messages in thread
* [PATCH v10 13/14] perf/x86: Rename and move get_segment_base() and make it global
2025-06-11 0:54 [PATCH v10 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
` (11 preceding siblings ...)
2025-06-11 0:54 ` [PATCH v10 12/14] unwind_user/x86: Enable frame pointer unwinding on x86 Steven Rostedt
@ 2025-06-11 0:54 ` Steven Rostedt
2025-06-11 0:54 ` [PATCH v10 14/14] unwind_user/x86: Enable compat mode frame pointer unwinding on x86 Steven Rostedt
2025-06-12 21:44 ` [PATCH v10 00/14] unwind_user: x86: Deferred unwinding infrastructure Andrii Nakryiko
14 siblings, 0 replies; 79+ messages in thread
From: Steven Rostedt @ 2025-06-11 0:54 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton
From: Josh Poimboeuf <jpoimboe@kernel.org>
get_segment_base() will be used by the unwind_user code, so make it
global and rename it to segment_base_address() so it doesn't conflict with
a KVM function of the same name.
As the function is no longer specific to perf, move it to ptrace.c as that
seems to be a better location for a generic function like this.
Also add a lockdep_assert_irqs_disabled() to make sure it's always called
with interrupts disabled.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
arch/x86/events/core.c | 44 ++++-------------------------------
arch/x86/include/asm/ptrace.h | 2 ++
arch/x86/kernel/ptrace.c | 38 ++++++++++++++++++++++++++++++
3 files changed, 45 insertions(+), 39 deletions(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 7610f26dfbd9..2f2ec84f2a14 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -43,6 +43,7 @@
#include <asm/ldt.h>
#include <asm/unwind.h>
#include <asm/uprobes.h>
+#include <asm/ptrace.h>
#include <asm/ibt.h>
#include "perf_event.h"
@@ -2808,41 +2809,6 @@ valid_user_frame(const void __user *fp, unsigned long size)
return __access_ok(fp, size);
}
-static unsigned long get_segment_base(unsigned int segment)
-{
- struct desc_struct *desc;
- unsigned int idx = segment >> 3;
-
- if ((segment & SEGMENT_TI_MASK) == SEGMENT_LDT) {
-#ifdef CONFIG_MODIFY_LDT_SYSCALL
- struct ldt_struct *ldt;
-
- /*
- * If we're not in a valid context with a real (not just lazy)
- * user mm, then don't even try.
- */
- if (!nmi_uaccess_okay())
- return 0;
-
- /* IRQs are off, so this synchronizes with smp_store_release */
- ldt = smp_load_acquire(¤t->mm->context.ldt);
- if (!ldt || idx >= ldt->nr_entries)
- return 0;
-
- desc = &ldt->entries[idx];
-#else
- return 0;
-#endif
- } else {
- if (idx >= GDT_ENTRIES)
- return 0;
-
- desc = raw_cpu_ptr(gdt_page.gdt) + idx;
- }
-
- return get_desc_base(desc);
-}
-
#ifdef CONFIG_UPROBES
/*
* Heuristic-based check if uprobe is installed at the function entry.
@@ -2899,8 +2865,8 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry_ctx *ent
if (user_64bit_mode(regs))
return 0;
- cs_base = get_segment_base(regs->cs);
- ss_base = get_segment_base(regs->ss);
+ cs_base = segment_base_address(regs->cs);
+ ss_base = segment_base_address(regs->ss);
fp = compat_ptr(ss_base + regs->bp);
pagefault_disable();
@@ -3019,11 +2985,11 @@ static unsigned long code_segment_base(struct pt_regs *regs)
return 0x10 * regs->cs;
if (user_mode(regs) && regs->cs != __USER_CS)
- return get_segment_base(regs->cs);
+ return segment_base_address(regs->cs);
#else
if (user_mode(regs) && !user_64bit_mode(regs) &&
regs->cs != __USER32_CS)
- return get_segment_base(regs->cs);
+ return segment_base_address(regs->cs);
#endif
return 0;
}
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 50f75467f73d..59357ec98e52 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -314,6 +314,8 @@ static __always_inline bool regs_irqs_disabled(struct pt_regs *regs)
return !(regs->flags & X86_EFLAGS_IF);
}
+unsigned long segment_base_address(unsigned int segment);
+
/* Query offset/name of register from its name/offset */
extern int regs_query_register_offset(const char *name);
extern const char *regs_query_register_name(unsigned int offset);
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 095f04bdabdc..81353a09701b 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -41,6 +41,7 @@
#include <asm/syscall.h>
#include <asm/fsgsbase.h>
#include <asm/io_bitmap.h>
+#include <asm/mmu_context.h>
#include "tls.h"
@@ -339,6 +340,43 @@ static int set_segment_reg(struct task_struct *task,
#endif /* CONFIG_X86_32 */
+unsigned long segment_base_address(unsigned int segment)
+{
+ struct desc_struct *desc;
+ unsigned int idx = segment >> 3;
+
+ lockdep_assert_irqs_disabled();
+
+ if ((segment & SEGMENT_TI_MASK) == SEGMENT_LDT) {
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
+ struct ldt_struct *ldt;
+
+ /*
+ * If we're not in a valid context with a real (not just lazy)
+ * user mm, then don't even try.
+ */
+ if (!nmi_uaccess_okay())
+ return 0;
+
+ /* IRQs are off, so this synchronizes with smp_store_release */
+ ldt = smp_load_acquire(¤t->mm->context.ldt);
+ if (!ldt || idx >= ldt->nr_entries)
+ return 0;
+
+ desc = &ldt->entries[idx];
+#else
+ return 0;
+#endif
+ } else {
+ if (idx >= GDT_ENTRIES)
+ return 0;
+
+ desc = raw_cpu_ptr(gdt_page.gdt) + idx;
+ }
+
+ return get_desc_base(desc);
+}
+
static unsigned long get_flags(struct task_struct *task)
{
unsigned long retval = task_pt_regs(task)->flags;
--
2.47.2
^ permalink raw reply related [flat|nested] 79+ messages in thread
* [PATCH v10 14/14] unwind_user/x86: Enable compat mode frame pointer unwinding on x86
2025-06-11 0:54 [PATCH v10 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
` (12 preceding siblings ...)
2025-06-11 0:54 ` [PATCH v10 13/14] perf/x86: Rename and move get_segment_base() and make it global Steven Rostedt
@ 2025-06-11 0:54 ` Steven Rostedt
2025-06-12 21:44 ` [PATCH v10 00/14] unwind_user: x86: Deferred unwinding infrastructure Andrii Nakryiko
14 siblings, 0 replies; 79+ messages in thread
From: Steven Rostedt @ 2025-06-11 0:54 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton
From: Josh Poimboeuf <jpoimboe@kernel.org>
Use ARCH_INIT_USER_COMPAT_FP_FRAME to describe how frame pointers are
unwound on x86, and implement the hooks needed to add the segment base
addresses. Enable HAVE_UNWIND_USER_COMPAT_FP if the system has compat
mode compiled in.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v9: https://lore.kernel.org/linux-trace-kernel/20250513223551.966925463@goodmis.org/
- Remove unneeded include of perf_event.h
arch/x86/Kconfig | 1 +
arch/x86/include/asm/unwind_user.h | 49 ++++++++++++++++++++++++
arch/x86/include/asm/unwind_user_types.h | 17 ++++++++
3 files changed, 67 insertions(+)
create mode 100644 arch/x86/include/asm/unwind_user_types.h
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2cdb5cf91541..3f7bdc9e3cec 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -302,6 +302,7 @@ config X86
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_UACCESS_VALIDATION if HAVE_OBJTOOL
select HAVE_UNSTABLE_SCHED_CLOCK
+ select HAVE_UNWIND_USER_COMPAT_FP if IA32_EMULATION
select HAVE_UNWIND_USER_FP if X86_64
select HAVE_USER_RETURN_NOTIFIER
select HAVE_GENERIC_VDSO
diff --git a/arch/x86/include/asm/unwind_user.h b/arch/x86/include/asm/unwind_user.h
index 8597857bf896..43f8554c1d70 100644
--- a/arch/x86/include/asm/unwind_user.h
+++ b/arch/x86/include/asm/unwind_user.h
@@ -2,10 +2,59 @@
#ifndef _ASM_X86_UNWIND_USER_H
#define _ASM_X86_UNWIND_USER_H
+#include <linux/unwind_user_types.h>
+#include <asm/ptrace.h>
+
#define ARCH_INIT_USER_FP_FRAME \
.cfa_off = (s32)sizeof(long) * 2, \
.ra_off = (s32)sizeof(long) * -1, \
.fp_off = (s32)sizeof(long) * -2, \
.use_fp = true,
+#ifdef CONFIG_IA32_EMULATION
+
+#define ARCH_INIT_USER_COMPAT_FP_FRAME \
+ .cfa_off = (s32)sizeof(u32) * 2, \
+ .ra_off = (s32)sizeof(u32) * -1, \
+ .fp_off = (s32)sizeof(u32) * -2, \
+ .use_fp = true,
+
+#define in_compat_mode(regs) !user_64bit_mode(regs)
+
+static inline void arch_unwind_user_init(struct unwind_user_state *state,
+ struct pt_regs *regs)
+{
+ unsigned long cs_base, ss_base;
+
+ if (state->type != UNWIND_USER_TYPE_COMPAT_FP)
+ return;
+
+ scoped_guard(irqsave) {
+ cs_base = segment_base_address(regs->cs);
+ ss_base = segment_base_address(regs->ss);
+ }
+
+ state->arch.cs_base = cs_base;
+ state->arch.ss_base = ss_base;
+
+ state->ip += cs_base;
+ state->sp += ss_base;
+ state->fp += ss_base;
+}
+#define arch_unwind_user_init arch_unwind_user_init
+
+static inline void arch_unwind_user_next(struct unwind_user_state *state)
+{
+ if (state->type != UNWIND_USER_TYPE_COMPAT_FP)
+ return;
+
+ state->ip += state->arch.cs_base;
+ state->fp += state->arch.ss_base;
+}
+#define arch_unwind_user_next arch_unwind_user_next
+
+#endif /* CONFIG_IA32_EMULATION */
+
+#include <asm-generic/unwind_user.h>
+
#endif /* _ASM_X86_UNWIND_USER_H */
diff --git a/arch/x86/include/asm/unwind_user_types.h b/arch/x86/include/asm/unwind_user_types.h
new file mode 100644
index 000000000000..d7074dc5f0ce
--- /dev/null
+++ b/arch/x86/include/asm/unwind_user_types.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_UNWIND_USER_TYPES_H
+#define _ASM_UNWIND_USER_TYPES_H
+
+#ifdef CONFIG_IA32_EMULATION
+
+struct arch_unwind_user_state {
+ unsigned long ss_base;
+ unsigned long cs_base;
+};
+#define arch_unwind_user_state arch_unwind_user_state
+
+#endif /* CONFIG_IA32_EMULATION */
+
+#include <asm-generic/unwind_user_types.h>
+
+#endif /* _ASM_UNWIND_USER_TYPES_H */
--
2.47.2
^ permalink raw reply related [flat|nested] 79+ messages in thread
* Re: [PATCH v10 00/14] unwind_user: x86: Deferred unwinding infrastructure
2025-06-11 0:54 [PATCH v10 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
` (13 preceding siblings ...)
2025-06-11 0:54 ` [PATCH v10 14/14] unwind_user/x86: Enable compat mode frame pointer unwinding on x86 Steven Rostedt
@ 2025-06-12 21:44 ` Andrii Nakryiko
2025-06-12 22:02 ` Josh Poimboeuf
14 siblings, 1 reply; 79+ messages in thread
From: Andrii Nakryiko @ 2025-06-12 21:44 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Peter Zijlstra, Ingo Molnar,
Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
Linus Torvalds, Andrew Morton
On Tue, Jun 10, 2025 at 6:03 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
>
> Hi Peter and Ingo,
>
> This is the first patch series of a set that will make it possible to be able
> to use SFrames[1] in the Linux kernel. A quick recap of the motivation for
> doing this.
>
> Currently the only way to get a user space stack trace from a stack
> walk (and not just copying large amount of user stack into the kernel
> ring buffer) is to use frame pointers. This has a few issues. The biggest
> one is that compiling frame pointers into every application and library
> has been shown to cause performance overhead.
>
> Another issue is that the format of the frames may not always be consistent
> between different compilers and some architectures (s390) has no defined
> format to do a reliable stack walk. The only way to perform user space
> profiling on these architectures is to copy the user stack into the kernel
> buffer.
>
> SFrames is now supported in gcc binutils and soon will also be supported
> by LLVM. SFrames acts more like ORC, and lives in the ELF executable
Is there any upstream PR or discussion for SFrames support in LLVM to
keep track of?
> file as its own section. Like ORC it has two tables where the first table
> is sorted by instruction pointers (IP) and using the current IP and finding
> it's entry in the first table, it will take you to the second table which
> will tell you where the return address of the current function is located
> and then you can use that address to look it up in the first table to find
> the return address of that function, and so on. This performs a user
> space stack walk.
>
[...]
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 00/14] unwind_user: x86: Deferred unwinding infrastructure
2025-06-12 21:44 ` [PATCH v10 00/14] unwind_user: x86: Deferred unwinding infrastructure Andrii Nakryiko
@ 2025-06-12 22:02 ` Josh Poimboeuf
2025-06-12 23:30 ` Andrii Nakryiko
0 siblings, 1 reply; 79+ messages in thread
From: Josh Poimboeuf @ 2025-06-12 22:02 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
Masami Hiramatsu, Mathieu Desnoyers, Peter Zijlstra, Ingo Molnar,
Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
Linus Torvalds, Andrew Morton
On Thu, Jun 12, 2025 at 02:44:18PM -0700, Andrii Nakryiko wrote:
> On Tue, Jun 10, 2025 at 6:03 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> >
> > Hi Peter and Ingo,
> >
> > This is the first patch series of a set that will make it possible to be able
> > to use SFrames[1] in the Linux kernel. A quick recap of the motivation for
> > doing this.
> >
> > Currently the only way to get a user space stack trace from a stack
> > walk (and not just copying large amount of user stack into the kernel
> > ring buffer) is to use frame pointers. This has a few issues. The biggest
> > one is that compiling frame pointers into every application and library
> > has been shown to cause performance overhead.
> >
> > Another issue is that the format of the frames may not always be consistent
> > between different compilers and some architectures (s390) has no defined
> > format to do a reliable stack walk. The only way to perform user space
> > profiling on these architectures is to copy the user stack into the kernel
> > buffer.
> >
> > SFrames is now supported in gcc binutils and soon will also be supported
> > by LLVM. SFrames acts more like ORC, and lives in the ELF executable
>
> Is there any upstream PR or discussion for SFrames support in LLVM to
> keep track of?
https://github.com/llvm/llvm-project/issues/64449
--
Josh
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 00/14] unwind_user: x86: Deferred unwinding infrastructure
2025-06-12 22:02 ` Josh Poimboeuf
@ 2025-06-12 23:30 ` Andrii Nakryiko
0 siblings, 0 replies; 79+ messages in thread
From: Andrii Nakryiko @ 2025-06-12 23:30 UTC (permalink / raw)
To: Josh Poimboeuf
Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
Masami Hiramatsu, Mathieu Desnoyers, Peter Zijlstra, Ingo Molnar,
Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
Linus Torvalds, Andrew Morton
On Thu, Jun 12, 2025 at 3:02 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>
> On Thu, Jun 12, 2025 at 02:44:18PM -0700, Andrii Nakryiko wrote:
> > On Tue, Jun 10, 2025 at 6:03 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> > >
> > >
> > > Hi Peter and Ingo,
> > >
> > > This is the first patch series of a set that will make it possible to be able
> > > to use SFrames[1] in the Linux kernel. A quick recap of the motivation for
> > > doing this.
> > >
> > > Currently the only way to get a user space stack trace from a stack
> > > walk (and not just copying large amount of user stack into the kernel
> > > ring buffer) is to use frame pointers. This has a few issues. The biggest
> > > one is that compiling frame pointers into every application and library
> > > has been shown to cause performance overhead.
> > >
> > > Another issue is that the format of the frames may not always be consistent
> > > between different compilers and some architectures (s390) has no defined
> > > format to do a reliable stack walk. The only way to perform user space
> > > profiling on these architectures is to copy the user stack into the kernel
> > > buffer.
> > >
> > > SFrames is now supported in gcc binutils and soon will also be supported
> > > by LLVM. SFrames acts more like ORC, and lives in the ELF executable
> >
> > Is there any upstream PR or discussion for SFrames support in LLVM to
> > keep track of?
>
> https://github.com/llvm/llvm-project/issues/64449
Great, thank you!
>
> --
> Josh
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 03/14] unwind_user: Add compat mode frame pointer support
2025-06-11 0:54 ` [PATCH v10 03/14] unwind_user: Add compat mode " Steven Rostedt
@ 2025-06-18 13:46 ` Peter Zijlstra
2025-06-18 15:10 ` Steven Rostedt
2025-06-18 13:47 ` Peter Zijlstra
1 sibling, 1 reply; 79+ messages in thread
From: Peter Zijlstra @ 2025-06-18 13:46 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Tue, Jun 10, 2025 at 08:54:24PM -0400, Steven Rostedt wrote:
> diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
> index 4fc550356b33..29e1f497a26e 100644
> --- a/kernel/unwind/user.c
> +++ b/kernel/unwind/user.c
> @@ -12,12 +12,32 @@ static struct unwind_user_frame fp_frame = {
> ARCH_INIT_USER_FP_FRAME
> };
>
> +static struct unwind_user_frame compat_fp_frame = {
> + ARCH_INIT_USER_COMPAT_FP_FRAME
> +};
> +
> static inline bool fp_state(struct unwind_user_state *state)
> {
> return IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP) &&
> state->type == UNWIND_USER_TYPE_FP;
> }
>
> +static inline bool compat_state(struct unwind_user_state *state)
Consistency would mandate this thing be called: compat_fp_state().
> +{
> + return IS_ENABLED(CONFIG_HAVE_UNWIND_USER_COMPAT_FP) &&
> + state->type == UNWIND_USER_TYPE_COMPAT_FP;
> +}
> +
> +#define UNWIND_GET_USER_LONG(to, from, state) \
Do we have to shout this?
> +({ \
> + int __ret; \
> + if (compat_state(state)) \
> + __ret = get_user(to, (u32 __user *)(from)); \
> + else \
> + __ret = get_user(to, (unsigned long __user *)(from)); \
> + __ret; \
> +})
> +
> int unwind_user_next(struct unwind_user_state *state)
> {
> struct unwind_user_frame *frame;
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 03/14] unwind_user: Add compat mode frame pointer support
2025-06-11 0:54 ` [PATCH v10 03/14] unwind_user: Add compat mode " Steven Rostedt
2025-06-18 13:46 ` Peter Zijlstra
@ 2025-06-18 13:47 ` Peter Zijlstra
2025-06-18 15:18 ` Steven Rostedt
1 sibling, 1 reply; 79+ messages in thread
From: Peter Zijlstra @ 2025-06-18 13:47 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Tue, Jun 10, 2025 at 08:54:24PM -0400, Steven Rostedt wrote:
> +#ifndef arch_unwind_user_init
> +static inline void arch_unwind_user_init(struct unwind_user_state *state, struct pt_regs *reg) {}
> +#endif
> +
> +#ifndef arch_unwind_user_next
> +static inline void arch_unwind_user_next(struct unwind_user_state *state) {}
> +#endif
The purpose of these arch hooks is so far mysterious. No comments, no
changelog, no nothing.
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 01/14] unwind_user: Add user space unwinding API
2025-06-11 0:54 ` [PATCH v10 01/14] unwind_user: Add user space unwinding API Steven Rostedt
@ 2025-06-18 13:49 ` Peter Zijlstra
0 siblings, 0 replies; 79+ messages in thread
From: Peter Zijlstra @ 2025-06-18 13:49 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Tue, Jun 10, 2025 at 08:54:22PM -0400, Steven Rostedt wrote:
> + state->ip = instruction_pointer(regs);
> + state->sp = user_stack_pointer(regs);
> + state->fp = frame_pointer(regs);
That user_ thing sticks out like a sore thumb. I realize this is
pre-existing naming, but urgh..
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 02/14] unwind_user: Add frame pointer support
2025-06-11 0:54 ` [PATCH v10 02/14] unwind_user: Add frame pointer support Steven Rostedt
@ 2025-06-18 13:52 ` Peter Zijlstra
2025-06-18 15:09 ` Steven Rostedt
0 siblings, 1 reply; 79+ messages in thread
From: Peter Zijlstra @ 2025-06-18 13:52 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Tue, Jun 10, 2025 at 08:54:23PM -0400, Steven Rostedt wrote:
> int unwind_user_next(struct unwind_user_state *state)
> {
> + struct unwind_user_frame *frame;
> + unsigned long cfa = 0, fp, ra = 0;
> +
> + if (state->done)
> + return -EINVAL;
> +
> + if (fp_state(state))
> + frame = &fp_frame;
> + else
> + goto the_end;
> +
> + cfa = (frame->use_fp ? state->fp : state->sp) + frame->cfa_off;
> +
> + /* stack going in wrong direction? */
> + if (cfa <= state->sp)
> + goto the_end;
> +
> + if (get_user(ra, (unsigned long *)(cfa + frame->ra_off)))
> + goto the_end;
> +
> + if (frame->fp_off && get_user(fp, (unsigned long __user *)(cfa + frame->fp_off)))
> + goto the_end;
> +
> + state->ip = ra;
> + state->sp = cfa;
> + if (frame->fp_off)
> + state->fp = fp;
> +
> + return 0;
> +
> +the_end:
> + state->done = true;
> return -EINVAL;
> }
I'm thinking 'the_end' might be better named 'done' ?
Also, CFA here is Call-Frame-Address and RA Return-Address ?
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 04/14] unwind_user/deferred: Add unwind_deferred_trace()
2025-06-11 0:54 ` [PATCH v10 04/14] unwind_user/deferred: Add unwind_deferred_trace() Steven Rostedt
@ 2025-06-18 13:54 ` Peter Zijlstra
2025-06-18 13:59 ` Peter Zijlstra
` (2 subsequent siblings)
3 siblings, 0 replies; 79+ messages in thread
From: Peter Zijlstra @ 2025-06-18 13:54 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Tue, Jun 10, 2025 at 08:54:25PM -0400, Steven Rostedt wrote:
> From: Steven Rostedt <rostedt@goodmis.org>
>
> Add a function that must be called inside a faultable context that will
> retrieve a user space stack trace. The function unwind_deferred_trace()
> can be called by a tracer when a task is about to enter user space, or has
> just come back from user space and has interrupts enabled.
This is word salad; I really can't make much of it.
Let me go read the patch to see if that makes more sense.
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 04/14] unwind_user/deferred: Add unwind_deferred_trace()
2025-06-11 0:54 ` [PATCH v10 04/14] unwind_user/deferred: Add unwind_deferred_trace() Steven Rostedt
2025-06-18 13:54 ` Peter Zijlstra
@ 2025-06-18 13:59 ` Peter Zijlstra
2025-06-18 15:20 ` Steven Rostedt
2025-06-18 14:01 ` Peter Zijlstra
2025-06-18 14:02 ` Peter Zijlstra
3 siblings, 1 reply; 79+ messages in thread
From: Peter Zijlstra @ 2025-06-18 13:59 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Tue, Jun 10, 2025 at 08:54:25PM -0400, Steven Rostedt wrote:
> diff --git a/kernel/unwind/Makefile b/kernel/unwind/Makefile
> index 349ce3677526..6752ac96d7e2 100644
> --- a/kernel/unwind/Makefile
> +++ b/kernel/unwind/Makefile
> @@ -1 +1 @@
> - obj-$(CONFIG_UNWIND_USER) += user.o
> + obj-$(CONFIG_UNWIND_USER) += user.o deferred.o
We really needed that extra whitespace? :-)
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 04/14] unwind_user/deferred: Add unwind_deferred_trace()
2025-06-11 0:54 ` [PATCH v10 04/14] unwind_user/deferred: Add unwind_deferred_trace() Steven Rostedt
2025-06-18 13:54 ` Peter Zijlstra
2025-06-18 13:59 ` Peter Zijlstra
@ 2025-06-18 14:01 ` Peter Zijlstra
2025-06-18 15:23 ` Steven Rostedt
2025-06-18 14:02 ` Peter Zijlstra
3 siblings, 1 reply; 79+ messages in thread
From: Peter Zijlstra @ 2025-06-18 14:01 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Tue, Jun 10, 2025 at 08:54:25PM -0400, Steven Rostedt wrote:
> +#define UNWIND_MAX_ENTRIES 512
The reason this is 512 is so that you end up with a whole page below?
> +int unwind_deferred_trace(struct unwind_stacktrace *trace)
> +{
> + struct unwind_task_info *info = ¤t->unwind_info;
> +
> + /* Should always be called from faultable context */
> + might_fault();
> +
> + if (current->flags & PF_EXITING)
> + return -EINVAL;
> +
> + if (!info->entries) {
> + info->entries = kmalloc_array(UNWIND_MAX_ENTRIES, sizeof(long),
> + GFP_KERNEL);
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 04/14] unwind_user/deferred: Add unwind_deferred_trace()
2025-06-11 0:54 ` [PATCH v10 04/14] unwind_user/deferred: Add unwind_deferred_trace() Steven Rostedt
` (2 preceding siblings ...)
2025-06-18 14:01 ` Peter Zijlstra
@ 2025-06-18 14:02 ` Peter Zijlstra
2025-06-18 15:29 ` Steven Rostedt
3 siblings, 1 reply; 79+ messages in thread
From: Peter Zijlstra @ 2025-06-18 14:02 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Tue, Jun 10, 2025 at 08:54:25PM -0400, Steven Rostedt wrote:
> +/**
> + * unwind_deferred_trace - Produce a user stacktrace in faultable context
> + * @trace: The descriptor that will store the user stacktrace
> + *
> + * This must be called in a known faultable context (usually when entering
> + * or exiting user space). Depending on the available implementations
> + * the @trace will be loaded with the addresses of the user space stacktrace
> + * if it can be found.
I am confused -- why would we ever want to call this on exiting
user-space, or rather kernel entry?
I thought the whole point was to request a user trace while in-kernel,
and defer that to return-to-user.
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 05/14] unwind_user/deferred: Add unwind cache
2025-06-11 0:54 ` [PATCH v10 05/14] unwind_user/deferred: Add unwind cache Steven Rostedt
@ 2025-06-18 14:13 ` Peter Zijlstra
2025-06-18 15:33 ` Steven Rostedt
0 siblings, 1 reply; 79+ messages in thread
From: Peter Zijlstra @ 2025-06-18 14:13 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
> diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
> index f94f3fdf15fc..6e850c9d3f0c 100644
> --- a/include/linux/entry-common.h
> +++ b/include/linux/entry-common.h
> @@ -12,6 +12,7 @@
> #include <linux/resume_user_mode.h>
> #include <linux/tick.h>
> #include <linux/kmsan.h>
> +#include <linux/unwind_deferred.h>
>
> #include <asm/entry-common.h>
> #include <asm/syscall.h>
> @@ -362,6 +363,7 @@ static __always_inline void exit_to_user_mode(void)
> lockdep_hardirqs_on_prepare();
> instrumentation_end();
>
> + unwind_exit_to_user_mode();
So I was expecting this to do the actual unwind, and was about to go
yell this is the wrong place for that.
But this is not that. Perhaps find a better name like:
unwind_clear_cache() or so?
> user_enter_irqoff();
> arch_exit_to_user_mode();
> lockdep_hardirqs_on(CALLER_ADDR0);
> diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
> index aa32db574e43..db5b54b18828 100644
> --- a/include/linux/unwind_deferred_types.h
> +++ b/include/linux/unwind_deferred_types.h
> @@ -2,8 +2,13 @@
> #ifndef _LINUX_UNWIND_USER_DEFERRED_TYPES_H
> #define _LINUX_UNWIND_USER_DEFERRED_TYPES_H
>
> +struct unwind_cache {
> + unsigned int nr_entries;
> + unsigned long entries[];
> +};
> +
> struct unwind_task_info {
> - unsigned long *entries;
> + struct unwind_cache *cache;
> };
>
> #endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
> diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
> index 0bafb95e6336..e3913781c8c6 100644
> --- a/kernel/unwind/deferred.c
> +++ b/kernel/unwind/deferred.c
> @@ -24,6 +24,7 @@
> int unwind_deferred_trace(struct unwind_stacktrace *trace)
> {
> struct unwind_task_info *info = ¤t->unwind_info;
> + struct unwind_cache *cache;
>
> /* Should always be called from faultable context */
> might_fault();
> @@ -31,17 +32,30 @@ int unwind_deferred_trace(struct unwind_stacktrace *trace)
> if (current->flags & PF_EXITING)
> return -EINVAL;
>
> - if (!info->entries) {
> - info->entries = kmalloc_array(UNWIND_MAX_ENTRIES, sizeof(long),
> - GFP_KERNEL);
> - if (!info->entries)
> + if (!info->cache) {
> + info->cache = kzalloc(struct_size(cache, entries, UNWIND_MAX_ENTRIES),
> + GFP_KERNEL);
And now you're one 'long' larger than a page. Surely that's a crap size
for an allocator?
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 06/14] unwind_user/deferred: Add deferred unwinding interface
2025-06-11 0:54 ` [PATCH v10 06/14] unwind_user/deferred: Add deferred unwinding interface Steven Rostedt
@ 2025-06-18 14:20 ` Peter Zijlstra
2025-06-18 15:37 ` Steven Rostedt
2025-06-18 18:46 ` Peter Zijlstra
1 sibling, 1 reply; 79+ messages in thread
From: Peter Zijlstra @ 2025-06-18 14:20 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Tue, Jun 10, 2025 at 08:54:27PM -0400, Steven Rostedt wrote:
> From: Josh Poimboeuf <jpoimboe@kernel.org>
>
> Add an interface for scheduling task work to unwind the user space stack
> before returning to user space. This solves several problems for its
> callers:
>
> - Ensure the unwind happens in task context even if the caller may be
> running in interrupt context.
>
> - Avoid duplicate unwinds, whether called multiple times by the same
> caller or by different callers.
>
> - Take a timestamp when the first request comes in since the task
> entered the kernel. This will be returned to the calling function
> along with the stack trace when the task leaves the kernel. This
> timestamp can be used to correlate kernel unwinds/traces with the user
> unwind.
>
> The timestamp is created to detect when the stacktrace is the same. It is
> generated the first time a user space stacktrace is requested after the
> task enters the kernel.
>
> The timestamp is passed to the caller on request, and when the stacktrace is
> generated upon returning to user space, it call the requester's callback
> with the timestamp as well as the stacktrace.
This whole story hinges on there being a high resolution time-stamp
available... Good thing we killed x86 !TSC support when we did. You sure
there's no other architectures you're interested in that lack a high res
time source?
What about two CPUs managing to request an unwind at exactly the same
time?
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 02/14] unwind_user: Add frame pointer support
2025-06-18 13:52 ` Peter Zijlstra
@ 2025-06-18 15:09 ` Steven Rostedt
2025-06-23 16:31 ` Indu Bhagat
0 siblings, 1 reply; 79+ messages in thread
From: Steven Rostedt @ 2025-06-18 15:09 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Wed, 18 Jun 2025 15:52:01 +0200
Peter Zijlstra <peterz@infradead.org> wrote:
> > +the_end:
> > + state->done = true;
> > return -EINVAL;
> > }
>
> I'm thinking 'the_end' might be better named 'done' ?
I thought it was cute ;-) (BTW, I didn't name it).
But sure, I can update it to be something more common.
>
> Also, CFA here is Call-Frame-Address and RA Return-Address ?
I believe so. Do you want me to add a comment?
-- Steve
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 03/14] unwind_user: Add compat mode frame pointer support
2025-06-18 13:46 ` Peter Zijlstra
@ 2025-06-18 15:10 ` Steven Rostedt
2025-06-18 17:52 ` Linus Torvalds
0 siblings, 1 reply; 79+ messages in thread
From: Steven Rostedt @ 2025-06-18 15:10 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Wed, 18 Jun 2025 15:46:41 +0200
Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Jun 10, 2025 at 08:54:24PM -0400, Steven Rostedt wrote:
> > diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
> > index 4fc550356b33..29e1f497a26e 100644
> > --- a/kernel/unwind/user.c
> > +++ b/kernel/unwind/user.c
> > @@ -12,12 +12,32 @@ static struct unwind_user_frame fp_frame = {
> > ARCH_INIT_USER_FP_FRAME
> > };
> >
> > +static struct unwind_user_frame compat_fp_frame = {
> > + ARCH_INIT_USER_COMPAT_FP_FRAME
> > +};
> > +
> > static inline bool fp_state(struct unwind_user_state *state)
> > {
> > return IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP) &&
> > state->type == UNWIND_USER_TYPE_FP;
> > }
> >
> > +static inline bool compat_state(struct unwind_user_state *state)
>
> Consistency would mandate this thing be called: compat_fp_state().
Sure. Will update it.
>
> > +{
> > + return IS_ENABLED(CONFIG_HAVE_UNWIND_USER_COMPAT_FP) &&
> > + state->type == UNWIND_USER_TYPE_COMPAT_FP;
> > +}
> > +
> > +#define UNWIND_GET_USER_LONG(to, from, state) \
>
> Do we have to shout this?
Don't we usually shout macros?
-- Steve
>
> > +({ \
> > + int __ret; \
> > + if (compat_state(state)) \
> > + __ret = get_user(to, (u32 __user *)(from)); \
> > + else \
> > + __ret = get_user(to, (unsigned long __user *)(from)); \
> > + __ret; \
> > +})
> > +
> > int unwind_user_next(struct unwind_user_state *state)
> > {
> > struct unwind_user_frame *frame;
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 03/14] unwind_user: Add compat mode frame pointer support
2025-06-18 13:47 ` Peter Zijlstra
@ 2025-06-18 15:18 ` Steven Rostedt
2025-06-19 7:51 ` Peter Zijlstra
0 siblings, 1 reply; 79+ messages in thread
From: Steven Rostedt @ 2025-06-18 15:18 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Wed, 18 Jun 2025 15:47:58 +0200
Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Jun 10, 2025 at 08:54:24PM -0400, Steven Rostedt wrote:
>
> > +#ifndef arch_unwind_user_init
> > +static inline void arch_unwind_user_init(struct unwind_user_state *state, struct pt_regs *reg) {}
> > +#endif
> > +
> > +#ifndef arch_unwind_user_next
> > +static inline void arch_unwind_user_next(struct unwind_user_state *state) {}
> > +#endif
>
> The purpose of these arch hooks is so far mysterious. No comments, no
> changelog, no nothing.
I'll add comments.
It's used later in the x86 compat code to allow the architecture to do any
special initialization or to handling moving to the next frame.
From patch 14:
+#define in_compat_mode(regs) !user_64bit_mode(regs)
+
+static inline void arch_unwind_user_init(struct unwind_user_state *state,
+ struct pt_regs *regs)
+{
+ unsigned long cs_base, ss_base;
+
+ if (state->type != UNWIND_USER_TYPE_COMPAT_FP)
+ return;
+
+ scoped_guard(irqsave) {
+ cs_base = segment_base_address(regs->cs);
+ ss_base = segment_base_address(regs->ss);
+ }
+
+ state->arch.cs_base = cs_base;
+ state->arch.ss_base = ss_base;
+
+ state->ip += cs_base;
+ state->sp += ss_base;
+ state->fp += ss_base;
+}
+#define arch_unwind_user_init arch_unwind_user_init
+
+static inline void arch_unwind_user_next(struct unwind_user_state *state)
+{
+ if (state->type != UNWIND_USER_TYPE_COMPAT_FP)
+ return;
+
+ state->ip += state->arch.cs_base;
+ state->fp += state->arch.ss_base;
+}
+#define arch_unwind_user_next arch_unwind_user_next
-- Steve
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 04/14] unwind_user/deferred: Add unwind_deferred_trace()
2025-06-18 13:59 ` Peter Zijlstra
@ 2025-06-18 15:20 ` Steven Rostedt
2025-06-18 16:26 ` Steven Rostedt
0 siblings, 1 reply; 79+ messages in thread
From: Steven Rostedt @ 2025-06-18 15:20 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Wed, 18 Jun 2025 15:59:07 +0200
Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Jun 10, 2025 at 08:54:25PM -0400, Steven Rostedt wrote:
> > diff --git a/kernel/unwind/Makefile b/kernel/unwind/Makefile
> > index 349ce3677526..6752ac96d7e2 100644
> > --- a/kernel/unwind/Makefile
> > +++ b/kernel/unwind/Makefile
> > @@ -1 +1 @@
> > - obj-$(CONFIG_UNWIND_USER) += user.o
> > + obj-$(CONFIG_UNWIND_USER) += user.o deferred.o
>
> We really needed that extra whitespace? :-)
Oops, I have no idea how that happened. Especially since emacs doesn't do
tabs well.
-- Steve
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 04/14] unwind_user/deferred: Add unwind_deferred_trace()
2025-06-18 14:01 ` Peter Zijlstra
@ 2025-06-18 15:23 ` Steven Rostedt
0 siblings, 0 replies; 79+ messages in thread
From: Steven Rostedt @ 2025-06-18 15:23 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Wed, 18 Jun 2025 16:01:11 +0200
Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Jun 10, 2025 at 08:54:25PM -0400, Steven Rostedt wrote:
>
> > +#define UNWIND_MAX_ENTRIES 512
>
> The reason this is 512 is so that you end up with a whole page below?
Possibly. We could probably even make that configurable. Perhaps just use
sysctl_perf_event_max_contexts_per_stack ?
Josh, any comments about why you picked this number?
-- Steve
>
> > +int unwind_deferred_trace(struct unwind_stacktrace *trace)
> > +{
> > + struct unwind_task_info *info = ¤t->unwind_info;
> > +
> > + /* Should always be called from faultable context */
> > + might_fault();
> > +
> > + if (current->flags & PF_EXITING)
> > + return -EINVAL;
> > +
> > + if (!info->entries) {
> > + info->entries = kmalloc_array(UNWIND_MAX_ENTRIES, sizeof(long),
> > + GFP_KERNEL);
>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 04/14] unwind_user/deferred: Add unwind_deferred_trace()
2025-06-18 14:02 ` Peter Zijlstra
@ 2025-06-18 15:29 ` Steven Rostedt
2025-06-19 7:54 ` Peter Zijlstra
0 siblings, 1 reply; 79+ messages in thread
From: Steven Rostedt @ 2025-06-18 15:29 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Wed, 18 Jun 2025 16:02:47 +0200
Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Jun 10, 2025 at 08:54:25PM -0400, Steven Rostedt wrote:
> > +/**
> > + * unwind_deferred_trace - Produce a user stacktrace in faultable context
> > + * @trace: The descriptor that will store the user stacktrace
> > + *
> > + * This must be called in a known faultable context (usually when entering
> > + * or exiting user space). Depending on the available implementations
> > + * the @trace will be loaded with the addresses of the user space stacktrace
> > + * if it can be found.
>
> I am confused -- why would we ever want to call this on exiting
> user-space, or rather kernel entry?
>
> I thought the whole point was to request a user trace while in-kernel,
> and defer that to return-to-user.
This code was broken out of the unwind deferred trace to be more stand
alone. Actually, it should be renamed to unwind_faultable_trace() or
something to denote that it must be called from a faultable context.
When Josh made perf use the task_work directly, it used this function to do
the trace as it handled the deferring.
Note, a request from the gcc folks is to add a system call that gives the
user space application a backtrace from its current location. This can be
handy for debugging as it would be similar to how we use dump_stack().
This function would be used for that.
-- Steve
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 05/14] unwind_user/deferred: Add unwind cache
2025-06-18 14:13 ` Peter Zijlstra
@ 2025-06-18 15:33 ` Steven Rostedt
2025-06-19 7:56 ` Peter Zijlstra
0 siblings, 1 reply; 79+ messages in thread
From: Steven Rostedt @ 2025-06-18 15:33 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Wed, 18 Jun 2025 16:13:45 +0200
Peter Zijlstra <peterz@infradead.org> wrote:
> > diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
> > index f94f3fdf15fc..6e850c9d3f0c 100644
> > --- a/include/linux/entry-common.h
> > +++ b/include/linux/entry-common.h
> > @@ -12,6 +12,7 @@
> > #include <linux/resume_user_mode.h>
> > #include <linux/tick.h>
> > #include <linux/kmsan.h>
> > +#include <linux/unwind_deferred.h>
> >
> > #include <asm/entry-common.h>
> > #include <asm/syscall.h>
> > @@ -362,6 +363,7 @@ static __always_inline void exit_to_user_mode(void)
> > lockdep_hardirqs_on_prepare();
> > instrumentation_end();
> >
> > + unwind_exit_to_user_mode();
>
> So I was expecting this to do the actual unwind, and was about to go
> yell this is the wrong place for that.
>
> But this is not that. Perhaps find a better name like:
> unwind_clear_cache() or so?
Sure.
How about unwind_reset_info()?
As it's not going to just clear the cache but also reset the trace info
(like the timestamp and such).
>
> > user_enter_irqoff();
> > arch_exit_to_user_mode();
> > lockdep_hardirqs_on(CALLER_ADDR0);
>
>
> > diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
> > index aa32db574e43..db5b54b18828 100644
> > --- a/include/linux/unwind_deferred_types.h
> > +++ b/include/linux/unwind_deferred_types.h
> > @@ -2,8 +2,13 @@
> > #ifndef _LINUX_UNWIND_USER_DEFERRED_TYPES_H
> > #define _LINUX_UNWIND_USER_DEFERRED_TYPES_H
> >
> > +struct unwind_cache {
> > + unsigned int nr_entries;
> > + unsigned long entries[];
> > +};
> > +
> > struct unwind_task_info {
> > - unsigned long *entries;
> > + struct unwind_cache *cache;
> > };
> >
> > #endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
> > diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
> > index 0bafb95e6336..e3913781c8c6 100644
> > --- a/kernel/unwind/deferred.c
> > +++ b/kernel/unwind/deferred.c
> > @@ -24,6 +24,7 @@
> > int unwind_deferred_trace(struct unwind_stacktrace *trace)
> > {
> > struct unwind_task_info *info = ¤t->unwind_info;
> > + struct unwind_cache *cache;
> >
> > /* Should always be called from faultable context */
> > might_fault();
> > @@ -31,17 +32,30 @@ int unwind_deferred_trace(struct unwind_stacktrace *trace)
> > if (current->flags & PF_EXITING)
> > return -EINVAL;
> >
> > - if (!info->entries) {
> > - info->entries = kmalloc_array(UNWIND_MAX_ENTRIES, sizeof(long),
> > - GFP_KERNEL);
> > - if (!info->entries)
> > + if (!info->cache) {
> > + info->cache = kzalloc(struct_size(cache, entries, UNWIND_MAX_ENTRIES),
> > + GFP_KERNEL);
>
> And now you're one 'long' larger than a page. Surely that's a crap size
> for an allocator?
Bah, Ingo suggested to put the counter in the allocation and I didn't think
about the size going over the page. Good catch!
Since it can make one per task, it may be good to make this into a
kmemcache.
-- Steve
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 06/14] unwind_user/deferred: Add deferred unwinding interface
2025-06-18 14:20 ` Peter Zijlstra
@ 2025-06-18 15:37 ` Steven Rostedt
2025-06-18 15:38 ` Steven Rostedt
2025-06-19 8:01 ` Peter Zijlstra
0 siblings, 2 replies; 79+ messages in thread
From: Steven Rostedt @ 2025-06-18 15:37 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Wed, 18 Jun 2025 16:20:00 +0200
Peter Zijlstra <peterz@infradead.org> wrote:
> > The timestamp is passed to the caller on request, and when the stacktrace is
> > generated upon returning to user space, it call the requester's callback
> > with the timestamp as well as the stacktrace.
>
> This whole story hinges on there being a high resolution time-stamp
> available... Good thing we killed x86 !TSC support when we did. You sure
> there's no other architectures you're interested in that lack a high res
> time source?
>
> What about two CPUs managing to request an unwind at exactly the same
> time?
It's mapped to a task. As long as each timestamp is unique for a task it
should be fine. As the trace can record the current->pid along with the
timestamp to map to the unique user space stack trace.
As for resolution, as long as there can't be two system calls back to back
within the same time stamp. Otherwise, yeah, we have an issue.
-- Steve
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 06/14] unwind_user/deferred: Add deferred unwinding interface
2025-06-18 15:37 ` Steven Rostedt
@ 2025-06-18 15:38 ` Steven Rostedt
2025-06-19 8:01 ` Peter Zijlstra
1 sibling, 0 replies; 79+ messages in thread
From: Steven Rostedt @ 2025-06-18 15:38 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Wed, 18 Jun 2025 11:37:06 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:
> > What about two CPUs managing to request an unwind at exactly the same
> > time?
>
> It's mapped to a task. As long as each timestamp is unique for a task it
> should be fine. As the trace can record the current->pid along with the
> timestamp to map to the unique user space stack trace.
>
> As for resolution, as long as there can't be two system calls back to back
> within the same time stamp. Otherwise, yeah, we have an issue.
I'll add a comment that states this as a constraint.
-- Steve
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 04/14] unwind_user/deferred: Add unwind_deferred_trace()
2025-06-18 15:20 ` Steven Rostedt
@ 2025-06-18 16:26 ` Steven Rostedt
0 siblings, 0 replies; 79+ messages in thread
From: Steven Rostedt @ 2025-06-18 16:26 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Wed, 18 Jun 2025 11:20:15 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:
> > > +++ b/kernel/unwind/Makefile
> > > @@ -1 +1 @@
> > > - obj-$(CONFIG_UNWIND_USER) += user.o
> > > + obj-$(CONFIG_UNWIND_USER) += user.o deferred.o
> >
> > We really needed that extra whitespace? :-)
>
> Oops, I have no idea how that happened. Especially since emacs doesn't do
> tabs well.
I will replace the two tabs with a single tab which will still add space,
as it replaces a single space with a tab. But tabs appear to be more
consistent with other Makefiles.
-- Steve
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 03/14] unwind_user: Add compat mode frame pointer support
2025-06-18 15:10 ` Steven Rostedt
@ 2025-06-18 17:52 ` Linus Torvalds
2025-06-18 18:37 ` Steven Rostedt
0 siblings, 1 reply; 79+ messages in thread
From: Linus Torvalds @ 2025-06-18 17:52 UTC (permalink / raw)
To: Steven Rostedt
Cc: Peter Zijlstra, linux-kernel, linux-trace-kernel, bpf, x86,
Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar,
Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
Andrew Morton
On Wed, 18 Jun 2025 at 08:10, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> Don't we usually shout macros?
Not really. It's more about "we shout about macros when they _behave_
like macros".
So typically we do all upper-case for things that are constants
(whether enums or macros, actually) or when they don't act like normal
functions.
But when it's not particularly important that it's a macro, and it
could have been a function (but maybe it was just simpler to use a
macro for whatever reason), we typically don't use all upper-case.
In this case, I have to agree with PeterZ that this just looks odd:
- if (get_user(ra, (unsigned long *)(cfa + frame->ra_off)))
+ if (UNWIND_GET_USER_LONG(ra, cfa + frame->ra_off, state))
why is UNWIND_GET_USER_LONG() so loud when it just replaces "get_user()"?
Note that "get_user()" itself is a macro, and is lower-case, even
though you couldn't actually do it as a function (because it changes
its first argument in place).
Linus
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 03/14] unwind_user: Add compat mode frame pointer support
2025-06-18 17:52 ` Linus Torvalds
@ 2025-06-18 18:37 ` Steven Rostedt
0 siblings, 0 replies; 79+ messages in thread
From: Steven Rostedt @ 2025-06-18 18:37 UTC (permalink / raw)
To: Linus Torvalds
Cc: Peter Zijlstra, linux-kernel, linux-trace-kernel, bpf, x86,
Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar,
Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
Andrew Morton
On Wed, 18 Jun 2025 10:52:22 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> In this case, I have to agree with PeterZ that this just looks odd:
>
> - if (get_user(ra, (unsigned long *)(cfa + frame->ra_off)))
> + if (UNWIND_GET_USER_LONG(ra, cfa + frame->ra_off, state))
>
> why is UNWIND_GET_USER_LONG() so loud when it just replaces "get_user()"?
>
> Note that "get_user()" itself is a macro, and is lower-case, even
> though you couldn't actually do it as a function (because it changes
> its first argument in place).
OK, I'll make it lower case.
Thanks!
-- Steve
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 06/14] unwind_user/deferred: Add deferred unwinding interface
2025-06-11 0:54 ` [PATCH v10 06/14] unwind_user/deferred: Add deferred unwinding interface Steven Rostedt
2025-06-18 14:20 ` Peter Zijlstra
@ 2025-06-18 18:46 ` Peter Zijlstra
2025-06-18 19:09 ` Steven Rostedt
2025-06-24 22:36 ` Steven Rostedt
1 sibling, 2 replies; 79+ messages in thread
From: Peter Zijlstra @ 2025-06-18 18:46 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
> +struct unwind_work;
> +
> +typedef void (*unwind_callback_t)(struct unwind_work *work, struct unwind_stacktrace *trace, u64 timestamp);
> +
> +struct unwind_work {
> + struct list_head list;
Does this really need to be a list? Single linked list like
callback_head not good enough?
> + unwind_callback_t func;
> +};
> +
> #ifdef CONFIG_UNWIND_USER
>
> void unwind_task_init(struct task_struct *task);
> @@ -12,10 +22,15 @@ void unwind_task_free(struct task_struct *task);
>
> int unwind_deferred_trace(struct unwind_stacktrace *trace);
>
> +int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func);
> +int unwind_deferred_request(struct unwind_work *work, u64 *timestamp);
> +void unwind_deferred_cancel(struct unwind_work *work);
> +
> static __always_inline void unwind_exit_to_user_mode(void)
> {
> if (unlikely(current->unwind_info.cache))
> current->unwind_info.cache->nr_entries = 0;
> + current->unwind_info.timestamp = 0;
Surely clearing that timestamp is only relevant when there is a cache
around? Better to not add this unconditional write to the exit path.
> }
>
> #else /* !CONFIG_UNWIND_USER */
> @@ -24,6 +39,9 @@ static inline void unwind_task_init(struct task_struct *task) {}
> static inline void unwind_task_free(struct task_struct *task) {}
>
> static inline int unwind_deferred_trace(struct unwind_stacktrace *trace) { return -ENOSYS; }
> +static inline int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func) { return -ENOSYS; }
> +static inline int unwind_deferred_request(struct unwind_work *work, u64 *timestamp) { return -ENOSYS; }
> +static inline void unwind_deferred_cancel(struct unwind_work *work) {}
>
> static inline void unwind_exit_to_user_mode(void) {}
>
> diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
> index db5b54b18828..5df264cf81ad 100644
> --- a/include/linux/unwind_deferred_types.h
> +++ b/include/linux/unwind_deferred_types.h
> @@ -9,6 +9,9 @@ struct unwind_cache {
>
> struct unwind_task_info {
> struct unwind_cache *cache;
> + struct callback_head work;
> + u64 timestamp;
> + int pending;
> };
>
> #endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
> diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
> index e3913781c8c6..b76c704ddc6d 100644
> --- a/kernel/unwind/deferred.c
> +++ b/kernel/unwind/deferred.c
> @@ -2,13 +2,35 @@
> /*
> * Deferred user space unwinding
> */
> +#include <linux/sched/task_stack.h>
> +#include <linux/unwind_deferred.h>
> +#include <linux/sched/clock.h>
> +#include <linux/task_work.h>
> #include <linux/kernel.h>
> #include <linux/sched.h>
> #include <linux/slab.h>
> -#include <linux/unwind_deferred.h>
> +#include <linux/mm.h>
>
> #define UNWIND_MAX_ENTRIES 512
>
> +/* Guards adding to and reading the list of callbacks */
> +static DEFINE_MUTEX(callback_mutex);
> +static LIST_HEAD(callbacks);
Global state.. smells like failure.
> +/*
> + * Read the task context timestamp, if this is the first caller then
> + * it will set the timestamp.
> + */
> +static u64 get_timestamp(struct unwind_task_info *info)
> +{
> + lockdep_assert_irqs_disabled();
> +
> + if (!info->timestamp)
> + info->timestamp = local_clock();
> +
> + return info->timestamp;
> +}
> +
> /**
> * unwind_deferred_trace - Produce a user stacktrace in faultable context
> * @trace: The descriptor that will store the user stacktrace
> @@ -59,11 +81,117 @@ int unwind_deferred_trace(struct unwind_stacktrace *trace)
> return 0;
> }
>
> +static void unwind_deferred_task_work(struct callback_head *head)
> +{
> + struct unwind_task_info *info = container_of(head, struct unwind_task_info, work);
> + struct unwind_stacktrace trace;
> + struct unwind_work *work;
> + u64 timestamp;
> +
> + if (WARN_ON_ONCE(!info->pending))
> + return;
> +
> + /* Allow work to come in again */
> + WRITE_ONCE(info->pending, 0);
> +
> + /*
> + * From here on out, the callback must always be called, even if it's
> + * just an empty trace.
> + */
> + trace.nr = 0;
> + trace.entries = NULL;
> +
> + unwind_deferred_trace(&trace);
> +
> + timestamp = info->timestamp;
> +
> + guard(mutex)(&callback_mutex);
> + list_for_each_entry(work, &callbacks, list) {
> + work->func(work, &trace, timestamp);
> + }
So now you're globally serializing all return-to-user instances. How is
that not a problem?
> +}
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 06/14] unwind_user/deferred: Add deferred unwinding interface
2025-06-18 18:46 ` Peter Zijlstra
@ 2025-06-18 19:09 ` Steven Rostedt
2025-06-18 19:36 ` Steven Rostedt
2025-06-19 7:50 ` Peter Zijlstra
2025-06-24 22:36 ` Steven Rostedt
1 sibling, 2 replies; 79+ messages in thread
From: Steven Rostedt @ 2025-06-18 19:09 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Wed, 18 Jun 2025 20:46:20 +0200
Peter Zijlstra <peterz@infradead.org> wrote:
> > +struct unwind_work;
> > +
> > +typedef void (*unwind_callback_t)(struct unwind_work *work, struct unwind_stacktrace *trace, u64 timestamp);
> > +
> > +struct unwind_work {
> > + struct list_head list;
>
> Does this really need to be a list? Single linked list like
> callback_head not good enough?
Doesn't a list head make it easier to remove without having to iterate the
list?
>
> > + unwind_callback_t func;
> > +};
> > +
> > #ifdef CONFIG_UNWIND_USER
> >
> > void unwind_task_init(struct task_struct *task);
> > @@ -12,10 +22,15 @@ void unwind_task_free(struct task_struct *task);
> >
> > int unwind_deferred_trace(struct unwind_stacktrace *trace);
> >
> > +int unwind_deferred_init(struct unwind_work *work, unwind_callback_t
> > func); +int unwind_deferred_request(struct unwind_work *work, u64
> > *timestamp); +void unwind_deferred_cancel(struct unwind_work *work);
> > +
> > static __always_inline void unwind_exit_to_user_mode(void)
> > {
> > if (unlikely(current->unwind_info.cache))
> > current->unwind_info.cache->nr_entries = 0;
> > + current->unwind_info.timestamp = 0;
>
> Surely clearing that timestamp is only relevant when there is a cache
> around? Better to not add this unconditional write to the exit path.
That's actually not quite true. If the allocation fails, we still want to
clear the timestamp. But later patches add more data to check and it does
exit out if there's been no requests:
{
struct unwind_task_info *info = ¤t->unwind_info;
unsigned long bits;
/* Was there any unwinding? */
if (likely(!info->unwind_mask))
return;
bits = info->unwind_mask;
do {
/* Is a task_work going to run again before going back */
if (bits & UNWIND_PENDING)
return;
} while (!try_cmpxchg(&info->unwind_mask, &bits, 0UL));
if (likely(info->cache))
info->cache->nr_entries = 0;
info->timestamp = 0;
}
But for better reviewing, I could add a comment in this patch that states
that this will eventually exit out early when it does more work.
>
> > }
> >
> > #else /* !CONFIG_UNWIND_USER */
> > @@ -24,6 +39,9 @@ static inline void unwind_task_init(struct
> > task_struct *task) {} static inline void unwind_task_free(struct
> > task_struct *task) {}
> > static inline int unwind_deferred_trace(struct unwind_stacktrace
> > *trace) { return -ENOSYS; } +static inline int
> > unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
> > { return -ENOSYS; } +static inline int unwind_deferred_request(struct
> > unwind_work *work, u64 *timestamp) { return -ENOSYS; } +static inline
> > void unwind_deferred_cancel(struct unwind_work *work) {} static inline
> > void unwind_exit_to_user_mode(void) {}
> > diff --git a/include/linux/unwind_deferred_types.h
> > b/include/linux/unwind_deferred_types.h index
> > db5b54b18828..5df264cf81ad 100644 ---
> > a/include/linux/unwind_deferred_types.h +++
> > b/include/linux/unwind_deferred_types.h @@ -9,6 +9,9 @@ struct
> > unwind_cache {
> > struct unwind_task_info {
> > struct unwind_cache *cache;
> > + struct callback_head work;
> > + u64 timestamp;
> > + int pending;
> > };
> >
> > #endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
> > diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
> > index e3913781c8c6..b76c704ddc6d 100644
> > --- a/kernel/unwind/deferred.c
> > +++ b/kernel/unwind/deferred.c
> > @@ -2,13 +2,35 @@
> > /*
> > * Deferred user space unwinding
> > */
> > +#include <linux/sched/task_stack.h>
> > +#include <linux/unwind_deferred.h>
> > +#include <linux/sched/clock.h>
> > +#include <linux/task_work.h>
> > #include <linux/kernel.h>
> > #include <linux/sched.h>
> > #include <linux/slab.h>
> > -#include <linux/unwind_deferred.h>
> > +#include <linux/mm.h>
> >
> > #define UNWIND_MAX_ENTRIES 512
> >
> > +/* Guards adding to and reading the list of callbacks */
> > +static DEFINE_MUTEX(callback_mutex);
> > +static LIST_HEAD(callbacks);
>
> Global state.. smells like failure.
Yes, the unwind infrastructure is global, as it is the way tasks know what
tracer's callbacks to call.
>
> > +/*
> > + * Read the task context timestamp, if this is the first caller then
> > + * it will set the timestamp.
> > + */
> > +static u64 get_timestamp(struct unwind_task_info *info)
> > +{
> > + lockdep_assert_irqs_disabled();
> > +
> > + if (!info->timestamp)
> > + info->timestamp = local_clock();
> > +
> > + return info->timestamp;
> > +}
> > +
> > /**
> > * unwind_deferred_trace - Produce a user stacktrace in faultable
> > context
> > * @trace: The descriptor that will store the user stacktrace
> > @@ -59,11 +81,117 @@ int unwind_deferred_trace(struct unwind_stacktrace
> > *trace) return 0;
> > }
> >
> > +static void unwind_deferred_task_work(struct callback_head *head)
> > +{
> > + struct unwind_task_info *info = container_of(head, struct
> > unwind_task_info, work);
> > + struct unwind_stacktrace trace;
> > + struct unwind_work *work;
> > + u64 timestamp;
> > +
> > + if (WARN_ON_ONCE(!info->pending))
> > + return;
> > +
> > + /* Allow work to come in again */
> > + WRITE_ONCE(info->pending, 0);
> > +
> > + /*
> > + * From here on out, the callback must always be called, even
> > if it's
> > + * just an empty trace.
> > + */
> > + trace.nr = 0;
> > + trace.entries = NULL;
> > +
> > + unwind_deferred_trace(&trace);
> > +
> > + timestamp = info->timestamp;
> > +
> > + guard(mutex)(&callback_mutex);
> > + list_for_each_entry(work, &callbacks, list) {
> > + work->func(work, &trace, timestamp);
> > + }
>
> So now you're globally serializing all return-to-user instances. How is
> that not a problem?
It was the original way we did things. The next patch changes this to SRCU.
But it requires a bit more care. For breaking up the series, I preferred
not to add that logic and make it a separate patch.
For better reviewing, I'll add a comment here that says:
/* TODO switch this global lock to SRCU */
-- Steve
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 06/14] unwind_user/deferred: Add deferred unwinding interface
2025-06-18 19:09 ` Steven Rostedt
@ 2025-06-18 19:36 ` Steven Rostedt
2025-06-19 7:50 ` Peter Zijlstra
1 sibling, 0 replies; 79+ messages in thread
From: Steven Rostedt @ 2025-06-18 19:36 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Wed, 18 Jun 2025 15:09:15 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:
> > So now you're globally serializing all return-to-user instances. How is
> > that not a problem?
>
> It was the original way we did things. The next patch changes this to SRCU.
> But it requires a bit more care. For breaking up the series, I preferred
> not to add that logic and make it a separate patch.
It's not the next patch but patch 9 (three away).
-- Steve
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 06/14] unwind_user/deferred: Add deferred unwinding interface
2025-06-18 19:09 ` Steven Rostedt
2025-06-18 19:36 ` Steven Rostedt
@ 2025-06-19 7:50 ` Peter Zijlstra
2025-06-19 8:56 ` Steven Rostedt
1 sibling, 1 reply; 79+ messages in thread
From: Peter Zijlstra @ 2025-06-19 7:50 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Wed, Jun 18, 2025 at 03:09:15PM -0400, Steven Rostedt wrote:
> On Wed, 18 Jun 2025 20:46:20 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
>
> > > +struct unwind_work;
> > > +
> > > +typedef void (*unwind_callback_t)(struct unwind_work *work, struct unwind_stacktrace *trace, u64 timestamp);
> > > +
> > > +struct unwind_work {
> > > + struct list_head list;
> >
> > Does this really need to be a list? Single linked list like
> > callback_head not good enough?
>
> Doesn't a list head make it easier to remove without having to iterate the
> list?
Yeah, but why would you ever want to remove it? You asked for an unwind,
you get an unwind, no?
> > > static __always_inline void unwind_exit_to_user_mode(void)
> > > {
> > > if (unlikely(current->unwind_info.cache))
> > > current->unwind_info.cache->nr_entries = 0;
> > > + current->unwind_info.timestamp = 0;
> >
> > Surely clearing that timestamp is only relevant when there is a cache
> > around? Better to not add this unconditional write to the exit path.
>
> That's actually not quite true. If the allocation fails, we still want to
> clear the timestamp. But later patches add more data to check and it does
> exit out if there's been no requests:
Well, you could put in an error value on alloc fail or somesuch. Then
its non-zero.
> But for better reviewing, I could add a comment in this patch that states
> that this will eventually exit out early when it does more work.
You're making this really hard to review, why not do it right from the
get-go?
> > > +/* Guards adding to and reading the list of callbacks */
> > > +static DEFINE_MUTEX(callback_mutex);
> > > +static LIST_HEAD(callbacks);
> >
> > Global state.. smells like failure.
>
> Yes, the unwind infrastructure is global, as it is the way tasks know what
> tracer's callbacks to call.
Well, that's apparently how you've set it up. I don't immediately see
this has to be like this.
And there's no comments no nothing.
I don't see why you can't have something like:
struct unwind_work {
struct callback_head task_work;
void *data;
void (*func)(struct unwind_work *work, void *data);
};
void unwind_task_work_func(struct callback_head *task_work)
{
struct unwind_work *uw = container_of(task_work, struct unwind_work, task_work);
// do actual unwind
uw->func(uw, uw->data);
}
or something along those lines. No global state involved.
> > > + guard(mutex)(&callback_mutex);
> > > + list_for_each_entry(work, &callbacks, list) {
> > > + work->func(work, &trace, timestamp);
> > > + }
> >
> > So now you're globally serializing all return-to-user instances. How is
> > that not a problem?
>
> It was the original way we did things. The next patch changes this to SRCU.
> But it requires a bit more care. For breaking up the series, I preferred
> not to add that logic and make it a separate patch.
>
> For better reviewing, I'll add a comment here that says:
>
> /* TODO switch this global lock to SRCU */
Oh ffs :-(
So splitting up patches is for ease of review, but now you're making
splits that make review harder, how does that make sense?
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 03/14] unwind_user: Add compat mode frame pointer support
2025-06-18 15:18 ` Steven Rostedt
@ 2025-06-19 7:51 ` Peter Zijlstra
2025-06-19 8:39 ` Steven Rostedt
0 siblings, 1 reply; 79+ messages in thread
From: Peter Zijlstra @ 2025-06-19 7:51 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Wed, Jun 18, 2025 at 11:18:40AM -0400, Steven Rostedt wrote:
> On Wed, 18 Jun 2025 15:47:58 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
>
> > On Tue, Jun 10, 2025 at 08:54:24PM -0400, Steven Rostedt wrote:
> >
> > > +#ifndef arch_unwind_user_init
> > > +static inline void arch_unwind_user_init(struct unwind_user_state *state, struct pt_regs *reg) {}
> > > +#endif
> > > +
> > > +#ifndef arch_unwind_user_next
> > > +static inline void arch_unwind_user_next(struct unwind_user_state *state) {}
> > > +#endif
> >
> > The purpose of these arch hooks is so far mysterious. No comments, no
> > changelog, no nothing.
>
> I'll add comments.
How about you introduce the hooks when they're actually needed instead?
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 04/14] unwind_user/deferred: Add unwind_deferred_trace()
2025-06-18 15:29 ` Steven Rostedt
@ 2025-06-19 7:54 ` Peter Zijlstra
2025-06-19 8:44 ` Steven Rostedt
0 siblings, 1 reply; 79+ messages in thread
From: Peter Zijlstra @ 2025-06-19 7:54 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Wed, Jun 18, 2025 at 11:29:39AM -0400, Steven Rostedt wrote:
> Note, a request from the gcc folks is to add a system call that gives the
> user space application a backtrace from its current location. This can be
> handy for debugging as it would be similar to how we use dump_stack().
That makes very little sense to me; apps can typically unwind themselves
just fine, no? In fact, they can use DWARFs and all that.
Also, how about we don't make thing complicated and not confuse comments
with things like this? Focus on the deferred stuff (that's what these
patches are about) -- and then return-to-user is the one and only place
that makes sense.
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 05/14] unwind_user/deferred: Add unwind cache
2025-06-18 15:33 ` Steven Rostedt
@ 2025-06-19 7:56 ` Peter Zijlstra
2025-06-19 8:47 ` Steven Rostedt
0 siblings, 1 reply; 79+ messages in thread
From: Peter Zijlstra @ 2025-06-19 7:56 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Wed, Jun 18, 2025 at 11:33:59AM -0400, Steven Rostedt wrote:
> On Wed, 18 Jun 2025 16:13:45 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> > > + info->cache = kzalloc(struct_size(cache, entries, UNWIND_MAX_ENTRIES),
> > > + GFP_KERNEL);
> >
> > And now you're one 'long' larger than a page. Surely that's a crap size
> > for an allocator?
>
> Bah, Ingo suggested to put the counter in the allocation and I didn't think
> about the size going over the page. Good catch!
>
> Since it can make one per task, it may be good to make this into a
> kmemcache.
Well, the trivial solution is to make it 511 and call it a day. Don't
make things complicated if you don't have to.
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 06/14] unwind_user/deferred: Add deferred unwinding interface
2025-06-18 15:37 ` Steven Rostedt
2025-06-18 15:38 ` Steven Rostedt
@ 2025-06-19 8:01 ` Peter Zijlstra
2025-06-19 8:49 ` Steven Rostedt
1 sibling, 1 reply; 79+ messages in thread
From: Peter Zijlstra @ 2025-06-19 8:01 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Wed, Jun 18, 2025 at 11:37:06AM -0400, Steven Rostedt wrote:
> On Wed, 18 Jun 2025 16:20:00 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
>
> > > The timestamp is passed to the caller on request, and when the stacktrace is
> > > generated upon returning to user space, it call the requester's callback
> > > with the timestamp as well as the stacktrace.
> >
> > This whole story hinges on there being a high resolution time-stamp
> > available... Good thing we killed x86 !TSC support when we did. You sure
> > there's no other architectures you're interested in that lack a high res
> > time source?
> >
> > What about two CPUs managing to request an unwind at exactly the same
> > time?
>
> It's mapped to a task. As long as each timestamp is unique for a task it
> should be fine. As the trace can record the current->pid along with the
> timestamp to map to the unique user space stack trace.
>
> As for resolution, as long as there can't be two system calls back to back
> within the same time stamp. Otherwise, yeah, we have an issue.
Well, up until very recent, jiffies was the fallback clock on x86. You
can get PID reuse in a jiffy if you push things hard.
Most archs seem to have some higher res clock these days, but I would
not be surprised if among the museum pieces we still support there's
some crap lying in wait.
If you want to rely on consecutive system calls never seeing the same
timestamp, let alone PID reuse in the same timestamp -- for some generic
infrastructure -- you need to go audit all the arch code.
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe
2025-06-11 0:54 ` [PATCH v10 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe Steven Rostedt
@ 2025-06-19 8:34 ` Peter Zijlstra
2025-06-19 8:37 ` Steven Rostedt
2025-06-19 8:57 ` Peter Zijlstra
1 sibling, 1 reply; 79+ messages in thread
From: Peter Zijlstra @ 2025-06-19 8:34 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Tue, Jun 10, 2025 at 08:54:28PM -0400, Steven Rostedt wrote:
> From: Josh Poimboeuf <jpoimboe@kernel.org>
>
> Make unwind_deferred_request() NMI-safe so tracers in NMI context can
> call it and safely request a user space stacktrace when the task exits.
>
> A "nmi_timestamp" is added to the unwind_task_info that gets updated by
> NMIs to not race with setting the info->timestamp.
I feel this is missing something... or I am.
> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> ---
> Changes since v9: https://lore.kernel.org/linux-trace-kernel/20250513223552.636076711@goodmis.org/
>
> - Check for ret < 0 instead of just ret != 0 from return code of
> task_work_add(). Don't want to just assume it's less than zero as it
> needs to return a negative on error.
>
> include/linux/unwind_deferred_types.h | 1 +
> kernel/unwind/deferred.c | 91 ++++++++++++++++++++++++---
> 2 files changed, 84 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
> index 5df264cf81ad..ae27a02234b8 100644
> --- a/include/linux/unwind_deferred_types.h
> +++ b/include/linux/unwind_deferred_types.h
> @@ -11,6 +11,7 @@ struct unwind_task_info {
> struct unwind_cache *cache;
> struct callback_head work;
> u64 timestamp;
> + u64 nmi_timestamp;
> int pending;
> };
>
> diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
> index b76c704ddc6d..88c867c32c01 100644
> --- a/kernel/unwind/deferred.c
> +++ b/kernel/unwind/deferred.c
> @@ -25,8 +25,27 @@ static u64 get_timestamp(struct unwind_task_info *info)
> {
> lockdep_assert_irqs_disabled();
>
> - if (!info->timestamp)
> - info->timestamp = local_clock();
> + /*
> + * Note, the timestamp is generated on the first request.
> + * If it exists here, then the timestamp is earlier than
> + * this request and it means that this request will be
> + * valid for the stracktrace.
> + */
> + if (!info->timestamp) {
> + WRITE_ONCE(info->timestamp, local_clock());
> + barrier();
> + /*
> + * If an NMI came in and set a timestamp, it means that
> + * it happened before this timestamp was set (otherwise
> + * the NMI would have used this one). Use the NMI timestamp
> + * instead.
> + */
> + if (unlikely(info->nmi_timestamp)) {
> + WRITE_ONCE(info->timestamp, info->nmi_timestamp);
> + barrier();
> + WRITE_ONCE(info->nmi_timestamp, 0);
> + }
> + }
>
> return info->timestamp;
> }
> @@ -103,6 +122,13 @@ static void unwind_deferred_task_work(struct callback_head *head)
>
> unwind_deferred_trace(&trace);
>
> + /* Check if the timestamp was only set by NMI */
> + if (info->nmi_timestamp) {
> + WRITE_ONCE(info->timestamp, info->nmi_timestamp);
> + barrier();
> + WRITE_ONCE(info->nmi_timestamp, 0);
> + }
> +
> timestamp = info->timestamp;
>
> guard(mutex)(&callback_mutex);
> @@ -111,6 +137,48 @@ static void unwind_deferred_task_work(struct callback_head *head)
> }
> }
>
> +static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *timestamp)
> +{
> + struct unwind_task_info *info = ¤t->unwind_info;
> + bool inited_timestamp = false;
> + int ret;
> +
> + /* Always use the nmi_timestamp first */
> + *timestamp = info->nmi_timestamp ? : info->timestamp;
> +
> + if (!*timestamp) {
> + /*
> + * This is the first unwind request since the most recent entry
> + * from user space. Initialize the task timestamp.
> + *
> + * Don't write to info->timestamp directly, otherwise it may race
> + * with an interruption of get_timestamp().
> + */
> + info->nmi_timestamp = local_clock();
> + *timestamp = info->nmi_timestamp;
> + inited_timestamp = true;
> + }
> +
> + if (info->pending)
> + return 1;
> +
> + ret = task_work_add(current, &info->work, TWA_NMI_CURRENT);
> + if (ret < 0) {
> + /*
> + * If this set nmi_timestamp and is not using it,
> + * there's no guarantee that it will be used.
> + * Set it back to zero.
> + */
> + if (inited_timestamp)
> + info->nmi_timestamp = 0;
> + return ret;
> + }
> +
> + info->pending = 1;
> +
> + return 0;
> +}
So what's the actual problem here, something like this:
if (!info->timestamp)
<NMI>
if (!info->timestamp)
info->timestamp = local_clock(); /* Ta */
</NMI>
info->timestamp = local_clock(); /* Tb */
And now info has Tb which is after Ta, which was recorded for the NMI
request?
Why can't we cmpxchg_local() the thing and avoid this horrible stuff?
static u64 get_timestamp(struct unwind_task_info *info)
{
u64 new, old = info->timestamp;
if (old)
return old;
new = local_clock();
old = cmpxchg_local(&info->timestamp, old, new);
if (old)
return old;
return new;
}
Seems simple enough; what's wrong with it?
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe
2025-06-19 8:34 ` Peter Zijlstra
@ 2025-06-19 8:37 ` Steven Rostedt
2025-06-19 8:44 ` Peter Zijlstra
0 siblings, 1 reply; 79+ messages in thread
From: Steven Rostedt @ 2025-06-19 8:37 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Thu, 19 Jun 2025 10:34:15 +0200
Peter Zijlstra <peterz@infradead.org> wrote:
> Why can't we cmpxchg_local() the thing and avoid this horrible stuff?
>
> static u64 get_timestamp(struct unwind_task_info *info)
> {
> u64 new, old = info->timestamp;
>
> if (old)
> return old;
>
> new = local_clock();
> old = cmpxchg_local(&info->timestamp, old, new);
> if (old)
> return old;
> return new;
> }
>
> Seems simple enough; what's wrong with it?
It's a 64 bit number where most 32 bit architectures don't have any
decent cmpxchg on 64 bit values. That's given me hell in the ring
buffer code :-p
-- Steve
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 03/14] unwind_user: Add compat mode frame pointer support
2025-06-19 7:51 ` Peter Zijlstra
@ 2025-06-19 8:39 ` Steven Rostedt
0 siblings, 0 replies; 79+ messages in thread
From: Steven Rostedt @ 2025-06-19 8:39 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Thu, 19 Jun 2025 09:51:03 +0200
Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > The purpose of these arch hooks is so far mysterious. No comments, no
> > > changelog, no nothing.
> >
> > I'll add comments.
>
> How about you introduce the hooks when they're actually needed instead?
OK. Then I'll move the hooks to the later patch.
-- Steve
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe
2025-06-19 8:37 ` Steven Rostedt
@ 2025-06-19 8:44 ` Peter Zijlstra
2025-06-19 8:48 ` Peter Zijlstra
0 siblings, 1 reply; 79+ messages in thread
From: Peter Zijlstra @ 2025-06-19 8:44 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Thu, Jun 19, 2025 at 04:37:33AM -0400, Steven Rostedt wrote:
> On Thu, 19 Jun 2025 10:34:15 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
>
> > Why can't we cmpxchg_local() the thing and avoid this horrible stuff?
> >
> > static u64 get_timestamp(struct unwind_task_info *info)
> > {
> > u64 new, old = info->timestamp;
> >
> > if (old)
> > return old;
> >
> > new = local_clock();
> > old = cmpxchg_local(&info->timestamp, old, new);
> > if (old)
> > return old;
> > return new;
> > }
> >
> > Seems simple enough; what's wrong with it?
>
> It's a 64 bit number where most 32 bit architectures don't have any
> decent cmpxchg on 64 bit values. That's given me hell in the ring
> buffer code :-p
Do we really have to support 32bit?
But IIRC a previous version of all this had a syscall counter. If you
make this a per task syscall counter, unsigned long is plenty.
I suppose that was dropped because adding that counter increment to all
syscalls blows. But if you really want to support 32bit, that might be a
fallback.
Luckily, x86 dropped support for !CMPXCHG8B right along with !TSC. So on
x86 we good with timestamps, even on 32bit.
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 04/14] unwind_user/deferred: Add unwind_deferred_trace()
2025-06-19 7:54 ` Peter Zijlstra
@ 2025-06-19 8:44 ` Steven Rostedt
0 siblings, 0 replies; 79+ messages in thread
From: Steven Rostedt @ 2025-06-19 8:44 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Thu, 19 Jun 2025 09:54:17 +0200
Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, Jun 18, 2025 at 11:29:39AM -0400, Steven Rostedt wrote:
>
> > Note, a request from the gcc folks is to add a system call that gives the
> > user space application a backtrace from its current location. This can be
> > handy for debugging as it would be similar to how we use dump_stack().
>
> That makes very little sense to me; apps can typically unwind themselves
> just fine, no? In fact, they can use DWARFs and all that.
Not really. It can in gdb, but doing it from a running app means that
the app needs a full parser, and access to the elf file it's running.
>
> Also, how about we don't make thing complicated and not confuse comments
> with things like this? Focus on the deferred stuff (that's what these
> patches are about) -- and then return-to-user is the one and only place
> that makes sense.
The change log had:
Add a function that must be called inside a faultable context that will
retrieve a user space stack trace. The function unwind_deferred_trace()
can be called by a tracer when a task is about to enter user space, or has
just come back from user space and has interrupts enabled.
It doesn't mention the backtrace thing. It only makes a statement that
it needs to be done in a faultable context. I renamed the function to:
unwind_user_faultable()
And updated the change log to:
unwind_user/deferred: Add unwind_user_faultable()
Add a new API to retrieve a user space callstack called
unwind_user_faultable(). The difference between this user space stack
tracer from the current user space stack tracer is that this must be
called from faultable context as it may use routines to access user space
data that needs to be faulted in.
It can be safely called from entering or exiting a system call as the code
The explanation is that it must be called in faultable context. It
doesn't add any more policy that that (like it having to be deferred).
-- Steve
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 05/14] unwind_user/deferred: Add unwind cache
2025-06-19 7:56 ` Peter Zijlstra
@ 2025-06-19 8:47 ` Steven Rostedt
2025-06-19 9:04 ` Peter Zijlstra
0 siblings, 1 reply; 79+ messages in thread
From: Steven Rostedt @ 2025-06-19 8:47 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Thu, 19 Jun 2025 09:56:11 +0200
Peter Zijlstra <peterz@infradead.org> wrote:
> Well, the trivial solution is to make it 511 and call it a day. Don't
> make things complicated if you don't have to.
I don't know if this is more complicated, but it should make it fit
nicely in a page:
/* Make the cache fit in a page */
#define UNWIND_MAX_ENTRIES \
((PAGE_SIZE - sizeof(struct unwind_cache)) / sizeof(long))
-- Steve
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe
2025-06-19 8:44 ` Peter Zijlstra
@ 2025-06-19 8:48 ` Peter Zijlstra
2025-06-19 9:10 ` Steven Rostedt
0 siblings, 1 reply; 79+ messages in thread
From: Peter Zijlstra @ 2025-06-19 8:48 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Thu, Jun 19, 2025 at 10:44:27AM +0200, Peter Zijlstra wrote:
> Luckily, x86 dropped support for !CMPXCHG8B right along with !TSC. So on
> x86 we good with timestamps, even on 32bit.
Well, not entirely true, local_clock() is not guaranteed monotonic. So
you might be in for quite a bit of hurt if you rely on that.
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 06/14] unwind_user/deferred: Add deferred unwinding interface
2025-06-19 8:01 ` Peter Zijlstra
@ 2025-06-19 8:49 ` Steven Rostedt
0 siblings, 0 replies; 79+ messages in thread
From: Steven Rostedt @ 2025-06-19 8:49 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Thu, 19 Jun 2025 10:01:08 +0200
Peter Zijlstra <peterz@infradead.org> wrote:
> If you want to rely on consecutive system calls never seeing the same
> timestamp, let alone PID reuse in the same timestamp -- for some generic
> infrastructure -- you need to go audit all the arch code.
I can drop the timestamp and go back to the original "cookie"
generation that guaranteed unique ids for something like 2^48 system
calls per task.
I thought the timestamps just made it easier. But having to audit every
arch, may make that not the case.
-- Steve
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 06/14] unwind_user/deferred: Add deferred unwinding interface
2025-06-19 7:50 ` Peter Zijlstra
@ 2025-06-19 8:56 ` Steven Rostedt
2025-06-19 9:11 ` Peter Zijlstra
0 siblings, 1 reply; 79+ messages in thread
From: Steven Rostedt @ 2025-06-19 8:56 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Thu, 19 Jun 2025 09:50:08 +0200
Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, Jun 18, 2025 at 03:09:15PM -0400, Steven Rostedt wrote:
> > On Wed, 18 Jun 2025 20:46:20 +0200
> > Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > > > +struct unwind_work;
> > > > +
> > > > +typedef void (*unwind_callback_t)(struct unwind_work *work, struct unwind_stacktrace *trace, u64 timestamp);
> > > > +
> > > > +struct unwind_work {
> > > > + struct list_head list;
> > >
> > > Does this really need to be a list? Single linked list like
> > > callback_head not good enough?
> >
> > Doesn't a list head make it easier to remove without having to iterate the
> > list?
>
> Yeah, but why would you ever want to remove it? You asked for an unwind,
> you get an unwind, no?
No, it's not unique per tracing infrastructure, but tracing instance.
That is, per perf program, or per tracing instance. It needs to be
removed.
>
> > > > static __always_inline void unwind_exit_to_user_mode(void)
> > > > {
> > > > if (unlikely(current->unwind_info.cache))
> > > > current->unwind_info.cache->nr_entries = 0;
> > > > + current->unwind_info.timestamp = 0;
> > >
> > > Surely clearing that timestamp is only relevant when there is a cache
> > > around? Better to not add this unconditional write to the exit path.
> >
> > That's actually not quite true. If the allocation fails, we still want to
> > clear the timestamp. But later patches add more data to check and it does
> > exit out if there's been no requests:
>
> Well, you could put in an error value on alloc fail or somesuch. Then
> its non-zero.
OK.
>
> > But for better reviewing, I could add a comment in this patch that states
> > that this will eventually exit out early when it does more work.
>
> You're making this really hard to review, why not do it right from the
> get-go?
Because the value that is to be checked isn't here yet.
>
> > > > +/* Guards adding to and reading the list of callbacks */
> > > > +static DEFINE_MUTEX(callback_mutex);
> > > > +static LIST_HEAD(callbacks);
> > >
> > > Global state.. smells like failure.
> >
> > Yes, the unwind infrastructure is global, as it is the way tasks know what
> > tracer's callbacks to call.
>
> Well, that's apparently how you've set it up. I don't immediately see
> this has to be like this.
>
> And there's no comments no nothing.
>
> I don't see why you can't have something like:
>
> struct unwind_work {
> struct callback_head task_work;
> void *data;
> void (*func)(struct unwind_work *work, void *data);
> };
>
> void unwind_task_work_func(struct callback_head *task_work)
> {
> struct unwind_work *uw = container_of(task_work, struct unwind_work, task_work);
>
> // do actual unwind
>
> uw->func(uw, uw->data);
> }
>
> or something along those lines. No global state involved.
We have a many to many relationship here where a task_work doesn't work.
That is, you can have a tracer that expects callbacks from several
tasks at the same time, as well as some of those tasks expect to send a
callback to different tracers.
Later patches add a bitmask to every task that gets set to know which
trace to use.
Since the number of tracers that can be called back is fixed to the
number of bits in long (for the bitmask), I can get rid of the link
list and make it into an array. That would make this easier.
>
>
> > > > + guard(mutex)(&callback_mutex);
> > > > + list_for_each_entry(work, &callbacks, list) {
> > > > + work->func(work, &trace, timestamp);
> > > > + }
> > >
> > > So now you're globally serializing all return-to-user instances. How is
> > > that not a problem?
> >
> > It was the original way we did things. The next patch changes this to SRCU.
> > But it requires a bit more care. For breaking up the series, I preferred
> > not to add that logic and make it a separate patch.
> >
> > For better reviewing, I'll add a comment here that says:
> >
> > /* TODO switch this global lock to SRCU */
>
> Oh ffs :-(
>
> So splitting up patches is for ease of review, but now you're making
> splits that make review harder, how does that make sense?
Actually, a comment isn't the right place, I should have mentioned this
in the change log.
-- Steve
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe
2025-06-11 0:54 ` [PATCH v10 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe Steven Rostedt
2025-06-19 8:34 ` Peter Zijlstra
@ 2025-06-19 8:57 ` Peter Zijlstra
2025-06-19 9:07 ` Steven Rostedt
1 sibling, 1 reply; 79+ messages in thread
From: Peter Zijlstra @ 2025-06-19 8:57 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Tue, Jun 10, 2025 at 08:54:28PM -0400, Steven Rostedt wrote:
> +static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *timestamp)
> +{
> + struct unwind_task_info *info = ¤t->unwind_info;
> + bool inited_timestamp = false;
> + int ret;
> +
> + /* Always use the nmi_timestamp first */
> + *timestamp = info->nmi_timestamp ? : info->timestamp;
> +
> + if (!*timestamp) {
> + /*
> + * This is the first unwind request since the most recent entry
> + * from user space. Initialize the task timestamp.
> + *
> + * Don't write to info->timestamp directly, otherwise it may race
> + * with an interruption of get_timestamp().
> + */
> + info->nmi_timestamp = local_clock();
> + *timestamp = info->nmi_timestamp;
> + inited_timestamp = true;
> + }
> +
> + if (info->pending)
> + return 1;
> +
> + ret = task_work_add(current, &info->work, TWA_NMI_CURRENT);
> + if (ret < 0) {
> + /*
> + * If this set nmi_timestamp and is not using it,
> + * there's no guarantee that it will be used.
> + * Set it back to zero.
> + */
> + if (inited_timestamp)
> + info->nmi_timestamp = 0;
> + return ret;
> + }
> +
> + info->pending = 1;
> +
> + return 0;
> +}
> +
> /**
> * unwind_deferred_request - Request a user stacktrace on task exit
> * @work: Unwind descriptor requesting the trace
> @@ -139,31 +207,38 @@ static void unwind_deferred_task_work(struct callback_head *head)
> int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
> {
> struct unwind_task_info *info = ¤t->unwind_info;
> + int pending;
> int ret;
>
> *timestamp = 0;
>
> if ((current->flags & (PF_KTHREAD | PF_EXITING)) ||
> !user_mode(task_pt_regs(current)))
> return -EINVAL;
>
> + if (in_nmi())
> + return unwind_deferred_request_nmi(work, timestamp);
So nested NMI is a thing -- AFAICT this is broken in the face of nested
NMI.
Specifically, we mark all exceptions that can happen with IRQs disabled
as NMI like (so that they don't go about taking locks etc.).
So imagine you're in #DB, you're asking for an unwind, you do the above
dance and get hit with NMI.
Then you get the NMI setting nmi_timestamp, and #DB overwriting it with
a later value, and you're back up the creek without no paddles.
Mix that with local_clock() that is only monotonic on a single CPU. And
you ask for an unwind on CPU0, get migrated to CPU1 which for the
argument will be behind, and see a timestamp 'far' in the future.
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 05/14] unwind_user/deferred: Add unwind cache
2025-06-19 8:47 ` Steven Rostedt
@ 2025-06-19 9:04 ` Peter Zijlstra
2025-06-19 9:12 ` Steven Rostedt
0 siblings, 1 reply; 79+ messages in thread
From: Peter Zijlstra @ 2025-06-19 9:04 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Thu, Jun 19, 2025 at 04:47:14AM -0400, Steven Rostedt wrote:
> On Thu, 19 Jun 2025 09:56:11 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
>
> > Well, the trivial solution is to make it 511 and call it a day. Don't
> > make things complicated if you don't have to.
>
> I don't know if this is more complicated, but it should make it fit
> nicely in a page:
>
> /* Make the cache fit in a page */
> #define UNWIND_MAX_ENTRIES \
> ((PAGE_SIZE - sizeof(struct unwind_cache)) / sizeof(long))
Right, that's the fancy way of spelling 511 :-) Except on 32bit, where
it now spells 1023 instead.
Did you want that bitness difference?
Also, you ready for some *reaaally* big numbers on Power/ARM with 64K
pages? :-)
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe
2025-06-19 8:57 ` Peter Zijlstra
@ 2025-06-19 9:07 ` Steven Rostedt
2025-06-19 9:32 ` Peter Zijlstra
0 siblings, 1 reply; 79+ messages in thread
From: Steven Rostedt @ 2025-06-19 9:07 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On June 19, 2025 4:57:17 AM EDT, Peter Zijlstra <peterz@infradead.org> wrote:
>On Tue, Jun 10, 2025 at 08:54:28PM -0400, Steven Rostedt wrote:
>
>>
>> + info->nmi_timestamp = local_clock();
>> + *timestamp = info->nmi_timestamp;
>> + inited_timestamp = true;
>> + }
>> +
>> + if (info->pending)
>> + return 1;
>> +
>> + ret = task_work_add(current, &info->work, TWA_NMI_CURRENT);
>> + if (ret < 0) {
>> + /*
>> + * If this set nmi_timestamp and is not using it,
>> + * there's no guarantee that it will be used.
>> + * Set it back to zero.
>> + */
>> + if (inited_timestamp)
>> + info->nmi_timestamp = 0;
>> + return ret;
>> + }
>> +
>> + info->pending = 1;
>> +
>> + return 0;
>> +}
>> +
>> /**
>> * unwind_deferred_request - Request a user stacktrace on task exit
>> * @work: Unwind descriptor requesting the trace
>> @@ -139,31 +207,38 @@ static void unwind_deferred_task_work(struct callback_head *head)
>> int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
>> {
>> struct unwind_task_info *info = ¤t->unwind_info;
>> + int pending;
>> int ret;
>>
>> *timestamp = 0;
>>
>> if ((current->flags & (PF_KTHREAD | PF_EXITING)) ||
>> !user_mode(task_pt_regs(current)))
>> return -EINVAL;
>>
>> + if (in_nmi())
>> + return unwind_deferred_request_nmi(work, timestamp);
>
>So nested NMI is a thing -- AFAICT this is broken in the face of nested
>NMI.
>
>Specifically, we mark all exceptions that can happen with IRQs disabled
>as NMI like (so that they don't go about taking locks etc.).
>
>So imagine you're in #DB, you're asking for an unwind, you do the above
>dance and get hit with NMI.
Does #DB make in_nmi() true? If that's the case then we do need to handle that.
-- Steve
>
>Then you get the NMI setting nmi_timestamp, and #DB overwriting it with
>a later value, and you're back up the creek without no paddles.
>
>
>Mix that with local_clock() that is only monotonic on a single CPU. And
>you ask for an unwind on CPU0, get migrated to CPU1 which for the
>argument will be behind, and see a timestamp 'far' in the future.
>
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe
2025-06-19 8:48 ` Peter Zijlstra
@ 2025-06-19 9:10 ` Steven Rostedt
2025-06-19 9:24 ` Peter Zijlstra
0 siblings, 1 reply; 79+ messages in thread
From: Steven Rostedt @ 2025-06-19 9:10 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On June 19, 2025 4:48:13 AM EDT, Peter Zijlstra <peterz@infradead.org> wrote:
>On Thu, Jun 19, 2025 at 10:44:27AM +0200, Peter Zijlstra wrote:
>
>> Luckily, x86 dropped support for !CMPXCHG8B right along with !TSC. So on
>> x86 we good with timestamps, even on 32bit.
>
>Well, not entirely true, local_clock() is not guaranteed monotonic. So
>you might be in for quite a bit of hurt if you rely on that.
>
As long as it is monotonic per task. If it is not, then pretty much all tracers that use it are broken.
-- Steve
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 06/14] unwind_user/deferred: Add deferred unwinding interface
2025-06-19 8:56 ` Steven Rostedt
@ 2025-06-19 9:11 ` Peter Zijlstra
2025-06-24 14:03 ` Steven Rostedt
0 siblings, 1 reply; 79+ messages in thread
From: Peter Zijlstra @ 2025-06-19 9:11 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Thu, Jun 19, 2025 at 04:56:59AM -0400, Steven Rostedt wrote:
> We have a many to many relationship here where a task_work doesn't work.
>
> That is, you can have a tracer that expects callbacks from several
> tasks at the same time, as well as some of those tasks expect to send a
> callback to different tracers.
>
> Later patches add a bitmask to every task that gets set to know which
> trace to use.
>
> Since the number of tracers that can be called back is fixed to the
> number of bits in long (for the bitmask), I can get rid of the link
> list and make it into an array. That would make this easier.
So something sketching this design decision might be useful. Perhaps a
comment in the file itself?
I feel much of this complication stems from the fact you're wanting to
make this perhaps too generic.
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 05/14] unwind_user/deferred: Add unwind cache
2025-06-19 9:04 ` Peter Zijlstra
@ 2025-06-19 9:12 ` Steven Rostedt
0 siblings, 0 replies; 79+ messages in thread
From: Steven Rostedt @ 2025-06-19 9:12 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On June 19, 2025 5:04:36 AM EDT, Peter Zijlstra <peterz@infradead.org> wrote:
>On Thu, Jun 19, 2025 at 04:47:14AM -0400, Steven Rostedt wrote:
>> On Thu, 19 Jun 2025 09:56:11 +0200
>> Peter Zijlstra <peterz@infradead.org> wrote:
>>
>> > Well, the trivial solution is to make it 511 and call it a day. Don't
>> > make things complicated if you don't have to.
>>
>> I don't know if this is more complicated, but it should make it fit
>> nicely in a page:
>>
>> /* Make the cache fit in a page */
>> #define UNWIND_MAX_ENTRIES \
>> ((PAGE_SIZE - sizeof(struct unwind_cache)) / sizeof(long))
>
>Right, that's the fancy way of spelling 511 :-) Except on 32bit, where
>it now spells 1023 instead.
>
>Did you want that bitness difference?
>
>Also, you ready for some *reaaally* big numbers on Power/ARM with 64K
>pages? :-)
Bah, yeah, I need to use a size and not just PAGE_SIZE.
Thanks,
-- Steve
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe
2025-06-19 9:10 ` Steven Rostedt
@ 2025-06-19 9:24 ` Peter Zijlstra
0 siblings, 0 replies; 79+ messages in thread
From: Peter Zijlstra @ 2025-06-19 9:24 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Thu, Jun 19, 2025 at 05:10:20AM -0400, Steven Rostedt wrote:
>
>
> On June 19, 2025 4:48:13 AM EDT, Peter Zijlstra <peterz@infradead.org> wrote:
> >On Thu, Jun 19, 2025 at 10:44:27AM +0200, Peter Zijlstra wrote:
> >
> >> Luckily, x86 dropped support for !CMPXCHG8B right along with !TSC. So on
> >> x86 we good with timestamps, even on 32bit.
> >
> >Well, not entirely true, local_clock() is not guaranteed monotonic. So
> >you might be in for quite a bit of hurt if you rely on that.
> >
>
> As long as it is monotonic per task. If it is not, then pretty much all tracers that use it are broken.
It is monotonic per CPU. It says so in the comment.
The inter-CPU drift is bounded to a tick or something.
The trade-off is that it can be the same value for the majority if the
that tick.
The way that thing is set up, is that we use GTOD (HPET if your TSC is
buggered) snapshots at ticks, set up a window to the next tick, and fill
out with TSC deltas and a (local) monotonicity filter.
So if TSC is really wild, it can hit that window boundary real quick,
get stuck there until the next tick.
Some of the early had TSC affected by DVFS, so you change CPU speed, TSC
speed changes along with it. We sorta try and compensate for that.
Anyway, welcome to the wonderful world of trying to tell time on x86 :-(
Today, most x86_64 chips made in the last few years will have relatively
sane TSC, but still we get the rare random case it gets marked unstable
(they're becoming few and far between though).
x86 is one of the worst architectures in this regard -- but IIRC there
were a few others out there. Also, virt, lets not talk about virt.
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe
2025-06-19 9:07 ` Steven Rostedt
@ 2025-06-19 9:32 ` Peter Zijlstra
2025-06-19 9:34 ` Peter Zijlstra
2025-06-19 9:42 ` Steven Rostedt
0 siblings, 2 replies; 79+ messages in thread
From: Peter Zijlstra @ 2025-06-19 9:32 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Thu, Jun 19, 2025 at 05:07:10AM -0400, Steven Rostedt wrote:
> Does #DB make in_nmi() true? If that's the case then we do need to handle that.
Yes: #DF, #MC, #BP (int3), #DB and NMI all have in_nmi() true.
Ignoring #DF because that's mostly game over, you can get them all
nested for up to 4 (you're well aware of the normal NMI recursion
crap).
Then there is the SEV #VC stuff, which is also NMI like. So if you're a
CoCo-nut, you can perhaps get it up to 5.
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe
2025-06-19 9:32 ` Peter Zijlstra
@ 2025-06-19 9:34 ` Peter Zijlstra
2025-06-19 9:42 ` Steven Rostedt
1 sibling, 0 replies; 79+ messages in thread
From: Peter Zijlstra @ 2025-06-19 9:34 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Thu, Jun 19, 2025 at 11:32:26AM +0200, Peter Zijlstra wrote:
> On Thu, Jun 19, 2025 at 05:07:10AM -0400, Steven Rostedt wrote:
>
> > Does #DB make in_nmi() true? If that's the case then we do need to handle that.
>
> Yes: #DF, #MC, #BP (int3), #DB and NMI all have in_nmi() true.
Note: these are all the from-kernel parts of those exceptions. The
from-user side is significantly different.
> Ignoring #DF because that's mostly game over, you can get them all
> nested for up to 4 (you're well aware of the normal NMI recursion
> crap).
>
> Then there is the SEV #VC stuff, which is also NMI like. So if you're a
> CoCo-nut, you can perhaps get it up to 5.
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe
2025-06-19 9:32 ` Peter Zijlstra
2025-06-19 9:34 ` Peter Zijlstra
@ 2025-06-19 9:42 ` Steven Rostedt
2025-06-19 9:45 ` Peter Zijlstra
1 sibling, 1 reply; 79+ messages in thread
From: Steven Rostedt @ 2025-06-19 9:42 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On June 19, 2025 5:32:26 AM EDT, Peter Zijlstra <peterz@infradead.org> wrote:
>On Thu, Jun 19, 2025 at 05:07:10AM -0400, Steven Rostedt wrote:
>
>> Does #DB make in_nmi() true? If that's the case then we do need to handle that.
>
>Yes: #DF, #MC, #BP (int3), #DB and NMI all have in_nmi() true.
>
>Ignoring #DF because that's mostly game over, you can get them all
>nested for up to 4 (you're well aware of the normal NMI recursion
>crap).
We probably can implement this with stacked counters.
>Then there is the SEV #VC stuff, which is also NMI like. So if you're a
>CoCo-nut, you can perhaps get it up to 5.
The rest of the tracing infrastructure goes 4 deep and hasn't had issues, so 4 is probably sufficient.
-- Steve
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe
2025-06-19 9:42 ` Steven Rostedt
@ 2025-06-19 9:45 ` Peter Zijlstra
2025-06-19 10:19 ` Steven Rostedt
0 siblings, 1 reply; 79+ messages in thread
From: Peter Zijlstra @ 2025-06-19 9:45 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Thu, Jun 19, 2025 at 05:42:31AM -0400, Steven Rostedt wrote:
>
>
> On June 19, 2025 5:32:26 AM EDT, Peter Zijlstra <peterz@infradead.org> wrote:
> >On Thu, Jun 19, 2025 at 05:07:10AM -0400, Steven Rostedt wrote:
> >
> >> Does #DB make in_nmi() true? If that's the case then we do need to handle that.
> >
> >Yes: #DF, #MC, #BP (int3), #DB and NMI all have in_nmi() true.
> >
> >Ignoring #DF because that's mostly game over, you can get them all
> >nested for up to 4 (you're well aware of the normal NMI recursion
> >crap).
>
> We probably can implement this with stacked counters.
I would seriously consider dropping support for anything that can't do
cmpxchg at the width you need.
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe
2025-06-19 9:45 ` Peter Zijlstra
@ 2025-06-19 10:19 ` Steven Rostedt
2025-06-19 10:39 ` Peter Zijlstra
0 siblings, 1 reply; 79+ messages in thread
From: Steven Rostedt @ 2025-06-19 10:19 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On June 19, 2025 5:45:05 AM EDT, Peter Zijlstra <peterz@infradead.org> wrote:
>On Thu, Jun 19, 2025 at 05:42:31AM -0400, Steven Rostedt wrote:
>>
>>
>> On June 19, 2025 5:32:26 AM EDT, Peter Zijlstra <peterz@infradead.org> wrote:
>> >On Thu, Jun 19, 2025 at 05:07:10AM -0400, Steven Rostedt wrote:
>> >
>> >> Does #DB make in_nmi() true? If that's the case then we do need to handle that.
>> >
>> >Yes: #DF, #MC, #BP (int3), #DB and NMI all have in_nmi() true.
>> >
>> >Ignoring #DF because that's mostly game over, you can get them all
>> >nested for up to 4 (you're well aware of the normal NMI recursion
>> >crap).
>>
>> We probably can implement this with stacked counters.
>
>I would seriously consider dropping support for anything that can't do
>cmpxchg at the width you need.
That may be something we can do as it's a new feature and unlike the ftrace ring buffer, it won't be a regression not to support them.
We currently care about x86-64, arm64, ppc 64 and s390. I'm assuming they all have a proper 64 bit cmpxchg.
-- Steve
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe
2025-06-19 10:19 ` Steven Rostedt
@ 2025-06-19 10:39 ` Peter Zijlstra
2025-06-19 13:04 ` Steven Rostedt
0 siblings, 1 reply; 79+ messages in thread
From: Peter Zijlstra @ 2025-06-19 10:39 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Thu, Jun 19, 2025 at 06:19:28AM -0400, Steven Rostedt wrote:
> We currently care about x86-64, arm64, ppc 64 and s390. I'm assuming
> they all have a proper 64 bit cmpxchg.
They do. The only 64bit architecture that does not is HPPA IIRC.
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe
2025-06-19 10:39 ` Peter Zijlstra
@ 2025-06-19 13:04 ` Steven Rostedt
0 siblings, 0 replies; 79+ messages in thread
From: Steven Rostedt @ 2025-06-19 13:04 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Thu, 19 Jun 2025 12:39:51 +0200
Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Jun 19, 2025 at 06:19:28AM -0400, Steven Rostedt wrote:
>
> > We currently care about x86-64, arm64, ppc 64 and s390. I'm assuming
> > they all have a proper 64 bit cmpxchg.
>
> They do. The only 64bit architecture that does not is HPPA IIRC.
It appears that I refactored the ring buffer code to get rid of all 64
bit cmpxchg(), but I do have this in the code:
if ((!IS_ENABLED(CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG) ||
IS_ENABLED(CONFIG_GENERIC_ATOMIC64)) &&
(unlikely(in_nmi()))) {
return NULL;
}
We could do something similar, in the function that asks for a deferred
stack trace:
/* NMI requires having safe 64 bit cmpxchg operations */
if ((!IS_ENABLED(CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG) || !IS_ENABLED(CONFIG_64BIT)) && in_nmi())
return -EINVAL;
As the only thing not supported is requesting a deferred stack trace
from NMI context when 64 bit cmpxchg is not available. No reason to not
support the rest of the functionality.
I'll have to wrap the cmpxchg too to not be performed and just do the
update, as for these archs, NMI is not an issue and interrupts will be
disabled, so no cmpxchg is needed.
-- Steve
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 08/14] unwind deferred: Use bitmask to determine which callbacks to call
2025-06-11 0:54 ` [PATCH v10 08/14] unwind deferred: Use bitmask to determine which callbacks to call Steven Rostedt
@ 2025-06-20 8:15 ` Peter Zijlstra
2025-06-24 14:55 ` Steven Rostedt
0 siblings, 1 reply; 79+ messages in thread
From: Peter Zijlstra @ 2025-06-20 8:15 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Tue, Jun 10, 2025 at 08:54:29PM -0400, Steven Rostedt wrote:
> void unwind_deferred_cancel(struct unwind_work *work)
> {
> + struct task_struct *g, *t;
> +
> if (!work)
> return;
>
> guard(mutex)(&callback_mutex);
> list_del(&work->list);
> +
> + clear_bit(work->bit, &unwind_mask);
atomic bitop
> +
> + guard(rcu)();
> + /* Clear this bit from all threads */
> + for_each_process_thread(g, t) {
> + clear_bit(work->bit, &t->unwind_info.unwind_mask);
> + }
> }
>
> int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
> @@ -256,6 +278,14 @@ int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
> memset(work, 0, sizeof(*work));
>
> guard(mutex)(&callback_mutex);
> +
> + /* See if there's a bit in the mask available */
> + if (unwind_mask == ~0UL)
> + return -EBUSY;
> +
> + work->bit = ffz(unwind_mask);
> + unwind_mask |= BIT(work->bit);
regular or
> +
> list_add(&work->list, &callbacks);
> work->func = func;
> return 0;
> @@ -267,6 +297,7 @@ void unwind_task_init(struct task_struct *task)
>
> memset(info, 0, sizeof(*info));
> init_task_work(&info->work, unwind_deferred_task_work);
> + info->unwind_mask = 0;
> }
Which is somewhat inconsistent;
__clear_bit()/__set_bit()
or:
unwind_mask &= ~BIT() / unwind_mask |= BIT()
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 02/14] unwind_user: Add frame pointer support
2025-06-18 15:09 ` Steven Rostedt
@ 2025-06-23 16:31 ` Indu Bhagat
2025-06-24 20:30 ` Steven Rostedt
0 siblings, 1 reply; 79+ messages in thread
From: Indu Bhagat @ 2025-06-23 16:31 UTC (permalink / raw)
To: Steven Rostedt, Peter Zijlstra
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton
On 6/18/25 8:09 AM, Steven Rostedt wrote:
> On Wed, 18 Jun 2025 15:52:01 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
>
>>> +the_end:
>>> + state->done = true;
>>> return -EINVAL;
>>> }
>>
>> I'm thinking 'the_end' might be better named 'done' ?
>
> I thought it was cute ;-) (BTW, I didn't name it).
>
> But sure, I can update it to be something more common.
>
>>
>> Also, CFA here is Call-Frame-Address and RA Return-Address ?
>
> I believe so. Do you want me to add a comment?
>
If a comment is added, Canonical Frame Address will be more appropriate.
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 06/14] unwind_user/deferred: Add deferred unwinding interface
2025-06-19 9:11 ` Peter Zijlstra
@ 2025-06-24 14:03 ` Steven Rostedt
0 siblings, 0 replies; 79+ messages in thread
From: Steven Rostedt @ 2025-06-24 14:03 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Thu, 19 Jun 2025 11:11:21 +0200
Peter Zijlstra <peterz@infradead.org> wrote:
> I feel much of this complication stems from the fact you're wanting to
> make this perhaps too generic.
I want to make it work for perf, ftrace, LTTng and BPF, where each has
their own requirements, which tends to force making it generic. :-/
-- Steve
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 08/14] unwind deferred: Use bitmask to determine which callbacks to call
2025-06-20 8:15 ` Peter Zijlstra
@ 2025-06-24 14:55 ` Steven Rostedt
2025-06-24 15:00 ` Peter Zijlstra
0 siblings, 1 reply; 79+ messages in thread
From: Steven Rostedt @ 2025-06-24 14:55 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Fri, 20 Jun 2025 10:15:42 +0200
Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Jun 10, 2025 at 08:54:29PM -0400, Steven Rostedt wrote:
>
>
> > void unwind_deferred_cancel(struct unwind_work *work)
> > {
> > + struct task_struct *g, *t;
> > +
> > if (!work)
> > return;
> >
> > guard(mutex)(&callback_mutex);
> > list_del(&work->list);
> > +
> > + clear_bit(work->bit, &unwind_mask);
>
> atomic bitop
Yeah, it just seemed cleaner than: unwind_mask &= ~(work->bit);
It's not needed as the update of unwind_mask is done within the
callback_mutex.
>
> > +
> > + guard(rcu)();
> > + /* Clear this bit from all threads */
> > + for_each_process_thread(g, t) {
> > + clear_bit(work->bit, &t->unwind_info.unwind_mask);
> > + }
> > }
> >
> > int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
> > @@ -256,6 +278,14 @@ int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
> > memset(work, 0, sizeof(*work));
> >
> > guard(mutex)(&callback_mutex);
> > +
> > + /* See if there's a bit in the mask available */
> > + if (unwind_mask == ~0UL)
> > + return -EBUSY;
> > +
> > + work->bit = ffz(unwind_mask);
> > + unwind_mask |= BIT(work->bit);
>
> regular or
>
> > +
> > list_add(&work->list, &callbacks);
> > work->func = func;
> > return 0;
> > @@ -267,6 +297,7 @@ void unwind_task_init(struct task_struct *task)
> >
> > memset(info, 0, sizeof(*info));
> > init_task_work(&info->work, unwind_deferred_task_work);
> > + info->unwind_mask = 0;
> > }
>
> Which is somewhat inconsistent;
>
> __clear_bit()/__set_bit()
Hmm, are the above non-atomic?
>
> or:
>
> unwind_mask &= ~BIT() / unwind_mask |= BIT()
although, because the update is always guarded, this may be the better
approach, as it shows there's no atomic needed.
-- Steve
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 08/14] unwind deferred: Use bitmask to determine which callbacks to call
2025-06-24 14:55 ` Steven Rostedt
@ 2025-06-24 15:00 ` Peter Zijlstra
2025-06-24 16:36 ` Steven Rostedt
0 siblings, 1 reply; 79+ messages in thread
From: Peter Zijlstra @ 2025-06-24 15:00 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Tue, Jun 24, 2025 at 10:55:38AM -0400, Steven Rostedt wrote:
> > Which is somewhat inconsistent;
> >
> > __clear_bit()/__set_bit()
>
> Hmm, are the above non-atomic?
Yes, ctags or any other code browser of you choice should get you to
their definition, which has a comment explaining the non-atomicy of
them.
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 08/14] unwind deferred: Use bitmask to determine which callbacks to call
2025-06-24 15:00 ` Peter Zijlstra
@ 2025-06-24 16:36 ` Steven Rostedt
0 siblings, 0 replies; 79+ messages in thread
From: Steven Rostedt @ 2025-06-24 16:36 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Tue, 24 Jun 2025 17:00:21 +0200
Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Jun 24, 2025 at 10:55:38AM -0400, Steven Rostedt wrote:
>
> > > Which is somewhat inconsistent;
> > >
> > > __clear_bit()/__set_bit()
> >
> > Hmm, are the above non-atomic?
>
> Yes, ctags or any other code browser of you choice should get you to
> their definition, which has a comment explaining the non-atomicy of
> them.
Bah, I did do a TAGS function (emacs) to find them, but totally missed
the comment above. I just saw the macro magic of them, but totally
missed the comment above them saying:
/*
* The following macros are non-atomic versions of their non-underscored
* counterparts.
*/
:-p
-- Steve
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 02/14] unwind_user: Add frame pointer support
2025-06-23 16:31 ` Indu Bhagat
@ 2025-06-24 20:30 ` Steven Rostedt
0 siblings, 0 replies; 79+ messages in thread
From: Steven Rostedt @ 2025-06-24 20:30 UTC (permalink / raw)
To: Indu Bhagat
Cc: Peter Zijlstra, linux-kernel, linux-trace-kernel, bpf, x86,
Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar,
Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Mon, 23 Jun 2025 09:31:17 -0700
Indu Bhagat <indu.bhagat@oracle.com> wrote:
> >> Also, CFA here is Call-Frame-Address and RA Return-Address ?
> >
> > I believe so. Do you want me to add a comment?
> >
>
> If a comment is added, Canonical Frame Address will be more appropriate.
Thanks Indu,
Updated.
-- Steve
^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v10 06/14] unwind_user/deferred: Add deferred unwinding interface
2025-06-18 18:46 ` Peter Zijlstra
2025-06-18 19:09 ` Steven Rostedt
@ 2025-06-24 22:36 ` Steven Rostedt
1 sibling, 0 replies; 79+ messages in thread
From: Steven Rostedt @ 2025-06-24 22:36 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton
On Wed, 18 Jun 2025 20:46:20 +0200
Peter Zijlstra <peterz@infradead.org> wrote:
> > static __always_inline void unwind_exit_to_user_mode(void)
> > {
> > if (unlikely(current->unwind_info.cache))
> > current->unwind_info.cache->nr_entries = 0;
> > + current->unwind_info.timestamp = 0;
>
> Surely clearing that timestamp is only relevant when there is a cache
> around? Better to not add this unconditional write to the exit path.
>
> > }
Note, the timestamp could be there if the cache failed to allocate, and
we would still want to clear the timestamp. I did turn this into:
static __always_inline void unwind_reset_info(void)
{
/* Exit out early if this was never used */
if (likely(!current->unwind_info.timestamp))
return;
if (current->unwind_info.cache)
current->unwind_info.cache->nr_entries = 0;
current->unwind_info.timestamp = 0;
}
For this patch, but later patches will give us the bitmask to check.
-- Steve
^ permalink raw reply [flat|nested] 79+ messages in thread
end of thread, other threads:[~2025-06-24 22:36 UTC | newest]
Thread overview: 79+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-11 0:54 [PATCH v10 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
2025-06-11 0:54 ` [PATCH v10 01/14] unwind_user: Add user space unwinding API Steven Rostedt
2025-06-18 13:49 ` Peter Zijlstra
2025-06-11 0:54 ` [PATCH v10 02/14] unwind_user: Add frame pointer support Steven Rostedt
2025-06-18 13:52 ` Peter Zijlstra
2025-06-18 15:09 ` Steven Rostedt
2025-06-23 16:31 ` Indu Bhagat
2025-06-24 20:30 ` Steven Rostedt
2025-06-11 0:54 ` [PATCH v10 03/14] unwind_user: Add compat mode " Steven Rostedt
2025-06-18 13:46 ` Peter Zijlstra
2025-06-18 15:10 ` Steven Rostedt
2025-06-18 17:52 ` Linus Torvalds
2025-06-18 18:37 ` Steven Rostedt
2025-06-18 13:47 ` Peter Zijlstra
2025-06-18 15:18 ` Steven Rostedt
2025-06-19 7:51 ` Peter Zijlstra
2025-06-19 8:39 ` Steven Rostedt
2025-06-11 0:54 ` [PATCH v10 04/14] unwind_user/deferred: Add unwind_deferred_trace() Steven Rostedt
2025-06-18 13:54 ` Peter Zijlstra
2025-06-18 13:59 ` Peter Zijlstra
2025-06-18 15:20 ` Steven Rostedt
2025-06-18 16:26 ` Steven Rostedt
2025-06-18 14:01 ` Peter Zijlstra
2025-06-18 15:23 ` Steven Rostedt
2025-06-18 14:02 ` Peter Zijlstra
2025-06-18 15:29 ` Steven Rostedt
2025-06-19 7:54 ` Peter Zijlstra
2025-06-19 8:44 ` Steven Rostedt
2025-06-11 0:54 ` [PATCH v10 05/14] unwind_user/deferred: Add unwind cache Steven Rostedt
2025-06-18 14:13 ` Peter Zijlstra
2025-06-18 15:33 ` Steven Rostedt
2025-06-19 7:56 ` Peter Zijlstra
2025-06-19 8:47 ` Steven Rostedt
2025-06-19 9:04 ` Peter Zijlstra
2025-06-19 9:12 ` Steven Rostedt
2025-06-11 0:54 ` [PATCH v10 06/14] unwind_user/deferred: Add deferred unwinding interface Steven Rostedt
2025-06-18 14:20 ` Peter Zijlstra
2025-06-18 15:37 ` Steven Rostedt
2025-06-18 15:38 ` Steven Rostedt
2025-06-19 8:01 ` Peter Zijlstra
2025-06-19 8:49 ` Steven Rostedt
2025-06-18 18:46 ` Peter Zijlstra
2025-06-18 19:09 ` Steven Rostedt
2025-06-18 19:36 ` Steven Rostedt
2025-06-19 7:50 ` Peter Zijlstra
2025-06-19 8:56 ` Steven Rostedt
2025-06-19 9:11 ` Peter Zijlstra
2025-06-24 14:03 ` Steven Rostedt
2025-06-24 22:36 ` Steven Rostedt
2025-06-11 0:54 ` [PATCH v10 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe Steven Rostedt
2025-06-19 8:34 ` Peter Zijlstra
2025-06-19 8:37 ` Steven Rostedt
2025-06-19 8:44 ` Peter Zijlstra
2025-06-19 8:48 ` Peter Zijlstra
2025-06-19 9:10 ` Steven Rostedt
2025-06-19 9:24 ` Peter Zijlstra
2025-06-19 8:57 ` Peter Zijlstra
2025-06-19 9:07 ` Steven Rostedt
2025-06-19 9:32 ` Peter Zijlstra
2025-06-19 9:34 ` Peter Zijlstra
2025-06-19 9:42 ` Steven Rostedt
2025-06-19 9:45 ` Peter Zijlstra
2025-06-19 10:19 ` Steven Rostedt
2025-06-19 10:39 ` Peter Zijlstra
2025-06-19 13:04 ` Steven Rostedt
2025-06-11 0:54 ` [PATCH v10 08/14] unwind deferred: Use bitmask to determine which callbacks to call Steven Rostedt
2025-06-20 8:15 ` Peter Zijlstra
2025-06-24 14:55 ` Steven Rostedt
2025-06-24 15:00 ` Peter Zijlstra
2025-06-24 16:36 ` Steven Rostedt
2025-06-11 0:54 ` [PATCH v10 09/14] unwind deferred: Use SRCU unwind_deferred_task_work() Steven Rostedt
2025-06-11 0:54 ` [PATCH v10 10/14] unwind: Clear unwind_mask on exit back to user space Steven Rostedt
2025-06-11 0:54 ` [PATCH v10 11/14] unwind: Finish up unwind when a task exits Steven Rostedt
2025-06-11 0:54 ` [PATCH v10 12/14] unwind_user/x86: Enable frame pointer unwinding on x86 Steven Rostedt
2025-06-11 0:54 ` [PATCH v10 13/14] perf/x86: Rename and move get_segment_base() and make it global Steven Rostedt
2025-06-11 0:54 ` [PATCH v10 14/14] unwind_user/x86: Enable compat mode frame pointer unwinding on x86 Steven Rostedt
2025-06-12 21:44 ` [PATCH v10 00/14] unwind_user: x86: Deferred unwinding infrastructure Andrii Nakryiko
2025-06-12 22:02 ` Josh Poimboeuf
2025-06-12 23:30 ` Andrii Nakryiko
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).