* [PATCH v12 00/14] unwind_user: x86: Deferred unwinding infrastructure
@ 2025-07-01 0:53 Steven Rostedt
2025-07-01 0:53 ` [PATCH v12 01/14] unwind_user: Add user space unwinding API Steven Rostedt
` (14 more replies)
0 siblings, 15 replies; 53+ messages in thread
From: Steven Rostedt @ 2025-07-01 0:53 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
Jens Axboe, Florian Weimer
[
UPDATE: Florian Weimer is looking to having Fedora built with SFrames
so that once this becomes available in the kernel, it will
also be usable in Fedora.
]
This is the first patch series of a set that will make it possible to be able
to use SFrames[1] in the Linux kernel. A quick recap of the motivation for
doing this.
Currently the only way to get a user space stack trace from a stack
walk (and not just copying large amount of user stack into the kernel
ring buffer) is to use frame pointers. This has a few issues. The biggest
one is that compiling frame pointers into every application and library
has been shown to cause performance overhead.
Another issue is that the format of the frames may not always be consistent
between different compilers and some architectures (s390) has no defined
format to do a reliable stack walk. The only way to perform user space
profiling on these architectures is to copy the user stack into the kernel
buffer.
SFrames is now supported in gcc binutils and soon will also be supported
by LLVM. SFrames acts more like ORC, and lives in the ELF executable
file as its own section. Like ORC it has two tables where the first table
is sorted by instruction pointers (IP) and using the current IP and finding
it's entry in the first table, it will take you to the second table which
will tell you where the return address of the current function is located
and then you can use that address to look it up in the first table to find
the return address of that function, and so on. This performs a user
space stack walk.
Now because the SFrame section lives in the ELF file it needs to be faulted
into memory when it is used. This means that walking the user space stack
requires being in a faultable context. As profilers like perf request a stack
trace in interrupt or NMI context, it cannot do the walking when it is
requested. Instead it must be deferred until it is safe to fault in user
space. One place this is known to be safe is when the task is about to return
back to user space.
Josh originally wrote the PoC of this code and his last version he posted
was back in January:
https://lore.kernel.org/all/cover.1737511963.git.jpoimboe@kernel.org/
That series contained everything from adding a new faultable user space
stack walking code, deferring the stack walk, implementing sframes,
fixing up x86 (VDSO), and even added both the kernel and user space side
of perf to make it work. But Josh also ran out of time to work on it and
I picked it up. As there's several parts to this series, I also broke
it out. Especially since there's parts of his series that do not depend
on each other.
This series contains only the core infrastructure that all the rest needs.
Of the 14 patches, only 3 are x86 specific. The rest is simply the unwinding
code that s390 can build against. I moved the 3 x86 specific to the end
of the series too.
Since multiple tracers (like perf, ftrace, bpf, etc) can attach to the
deferred unwinder and each of these tracers can attach to some or all
of the tasks to trace, there is a many to many relationship. This relationship
needs to be made in interrupt or NMI context so it can not rely on any
allocation. To handle this, a bitmask is used. There's a global bitmask of
size long which will allocate a single bit when a tracer registers for
deferred stack traces. The task struct will also have a bitmask where a
request comes in from one of the tracers to have a deferred stack trace, it
will set the corresponding bit for that tracer it its mask. As one of the bits
represents that a request has been made, this means at most 31 on 32 bit
systems or 63 on 64 bit systems of tracers may be registered at a given time.
This should not be an issue as only one perf application, or ftrace instance
should request a bit. BPF should also use only one bit and handle any
multiplexing for its users.
When the first request is made for a deferred stack trace from a task, it will
take a timestamp. This timestamp will be used as the identifier for the user
space stack trace. As the user space stack trace does not change while the
task is in the kernel, requests that come in after the first request and
before the task goes back to user space will get the same timestamp. This
timestamp also serves the purpose of knowing how far back a given user space
stack trace goes. If there's dropped events, and the events dropped miss a
task entering user space and coming back to the kernel, the new stack trace
taken when it goes back to user space should not be used with the events
before the drop happened.
When a tracer makes a request, it gets this timestamp, and the tasks bitmask
sets the bit for the requesting tracer. A task work is used to have the task
do the callbacks before it goes back to user space. When it does, it will scan
its bitmask and call all the callbacks for the tracers that have their
representing bit set. The callback will receive the user space stack trace as
well as the timestamp that was used.
That's the basic idea. Obviously there's more to it than the above
explanation, but each patch explains what it is doing, and it is broken up
step by step.
I run two SFrame meetings once a month (one in Asia friendly timezone and
the other in Europe friendly). We have developers from Google, Oracle, Red Hat,
IBM, EfficiOS, Meta, Microsoft, and more that attend. (If anyone is interested
in attending let me know). I have been running this since December of 2024.
Last year in GNU Cauldron, a few of us got together to discuss the design
and such. We are pretty confident that the current design is sound. We have
working code on top of this and have been testing it.
Since the s390 folks want to start working on this (they have patches to
sframes already from working on the prototypes), I would like this series
to be a separate branch based on top of v6.16-rc2. Then all the subsystems
that want to work on top of this can as there's no real dependency between
them.
I have more patches on top of this series that add perf support, ftrace
support, sframe support and the x86 fix ups (for VDSO). But each of those
patch series can be worked on independently, but they all depend on this
series (although the x86 specific patches at the end isn't necessarily
needed, at least for other architectures).
Please review, and if you are happy with them, lets get them in a branch
that we all can use. I'm happy to take it in my tree if I can get acks on the
x86 code. Or it can be in the tip tree as a separate branch on top of 6.16-rc4
and I'll just base my work on top of that. Doesn't matter either way.
This is based on top of v6.16-rc4 and the code is here:
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git unwind/core
Head SHA1: 649fe8a37fbb8bc7eb1d420630523feb4f44d1d7
Changes since v11: https://lore.kernel.org/linux-trace-kernel/20250625225600.555017347@goodmis.org/
- Add USED bit to the task's unwind_mask to know if the faultable user stack
function was used or not. This allows for only having to check one value on
the way back to user space to know if it has to do more work or not.
- Fix header macro protection name to include X86 (Ingo Molnar)
- Use insn_get_seg_base() to get segment registers instead of using the
function perf uses and making it global. Also as that function doesn't
look to have a requirement to disable interrupts, the scoped_guard(irqsave)
is removed.
- Check return code of insn_get_seg_base() for the unlikely event that it
returns invalid (-1).
- Moved arch_unwind_user_init() into stacktrace.c as to use
insn_get_seg_base(), it must include insn-eval.h that defines
pt_regs_offset(), but that is also used in the perf generic code as an
array and if it is included in the header file, it causes a build
conflict.
- Update the comments that explain arch_unwind_user_init/next that a macro
needs to be defined with those names if they are going to be used.
Josh Poimboeuf (7):
unwind_user: Add user space unwinding API
unwind_user: Add frame pointer support
unwind_user: Add compat mode frame pointer support
unwind_user/deferred: Add unwind cache
unwind_user/deferred: Add deferred unwinding interface
unwind_user/x86: Enable frame pointer unwinding on x86
unwind_user/x86: Enable compat mode frame pointer unwinding on x86
Steven Rostedt (7):
unwind_user/deferred: Add unwind_user_faultable()
unwind_user/deferred: Make unwind deferral requests NMI-safe
unwind deferred: Use bitmask to determine which callbacks to call
unwind deferred: Use SRCU unwind_deferred_task_work()
unwind: Clear unwind_mask on exit back to user space
unwind: Add USED bit to only have one conditional on way back to user space
unwind: Finish up unwind when a task exits
----
MAINTAINERS | 8 +
arch/Kconfig | 11 +
arch/x86/Kconfig | 2 +
arch/x86/include/asm/unwind_user.h | 42 ++++
arch/x86/include/asm/unwind_user_types.h | 17 ++
arch/x86/kernel/stacktrace.c | 28 +++
include/asm-generic/Kbuild | 2 +
include/asm-generic/unwind_user.h | 5 +
include/asm-generic/unwind_user_types.h | 5 +
include/linux/entry-common.h | 2 +
include/linux/sched.h | 5 +
include/linux/unwind_deferred.h | 79 +++++++
include/linux/unwind_deferred_types.h | 20 ++
include/linux/unwind_user.h | 45 ++++
include/linux/unwind_user_types.h | 39 ++++
kernel/Makefile | 1 +
kernel/exit.c | 2 +
kernel/fork.c | 4 +
kernel/unwind/Makefile | 1 +
kernel/unwind/deferred.c | 357 +++++++++++++++++++++++++++++++
kernel/unwind/user.c | 130 +++++++++++
21 files changed, 805 insertions(+)
create mode 100644 arch/x86/include/asm/unwind_user.h
create mode 100644 arch/x86/include/asm/unwind_user_types.h
create mode 100644 include/asm-generic/unwind_user.h
create mode 100644 include/asm-generic/unwind_user_types.h
create mode 100644 include/linux/unwind_deferred.h
create mode 100644 include/linux/unwind_deferred_types.h
create mode 100644 include/linux/unwind_user.h
create mode 100644 include/linux/unwind_user_types.h
create mode 100644 kernel/unwind/Makefile
create mode 100644 kernel/unwind/deferred.c
create mode 100644 kernel/unwind/user.c
^ permalink raw reply [flat|nested] 53+ messages in thread
* [PATCH v12 01/14] unwind_user: Add user space unwinding API
2025-07-01 0:53 [PATCH v12 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
@ 2025-07-01 0:53 ` Steven Rostedt
2025-07-04 17:58 ` Mathieu Desnoyers
2025-07-04 18:20 ` Mathieu Desnoyers
2025-07-01 0:53 ` [PATCH v12 02/14] unwind_user: Add frame pointer support Steven Rostedt
` (13 subsequent siblings)
14 siblings, 2 replies; 53+ messages in thread
From: Steven Rostedt @ 2025-07-01 0:53 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
Jens Axboe, Florian Weimer
From: Josh Poimboeuf <jpoimboe@kernel.org>
Introduce a generic API for unwinding user stacks.
In order to expand user space unwinding to be able to handle more complex
scenarios, such as deferred unwinding and reading user space information,
create a generic interface that all architectures can use that support the
various unwinding methods.
This is an alternative method for handling user space stack traces from
the simple stack_trace_save_user() API. This does not replace that
interface, but this interface will be used to expand the functionality of
user space stack walking.
None of the structures introduced will be exposed to user space tooling.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
MAINTAINERS | 8 +++++
arch/Kconfig | 3 ++
include/linux/unwind_user.h | 15 +++++++++
include/linux/unwind_user_types.h | 31 +++++++++++++++++
kernel/Makefile | 1 +
kernel/unwind/Makefile | 1 +
kernel/unwind/user.c | 55 +++++++++++++++++++++++++++++++
7 files changed, 114 insertions(+)
create mode 100644 include/linux/unwind_user.h
create mode 100644 include/linux/unwind_user_types.h
create mode 100644 kernel/unwind/Makefile
create mode 100644 kernel/unwind/user.c
diff --git a/MAINTAINERS b/MAINTAINERS
index 4bac4ea21b64..ed5705c4f7d9 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -25924,6 +25924,14 @@ F: Documentation/driver-api/uio-howto.rst
F: drivers/uio/
F: include/linux/uio_driver.h
+USERSPACE STACK UNWINDING
+M: Josh Poimboeuf <jpoimboe@kernel.org>
+M: Steven Rostedt <rostedt@goodmis.org>
+S: Maintained
+F: include/linux/unwind*.h
+F: kernel/unwind/
+
+
UTIL-LINUX PACKAGE
M: Karel Zak <kzak@redhat.com>
L: util-linux@vger.kernel.org
diff --git a/arch/Kconfig b/arch/Kconfig
index a3308a220f86..ea59e5d7cc69 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -435,6 +435,9 @@ config HAVE_HARDLOCKUP_DETECTOR_ARCH
It uses the same command line parameters, and sysctl interface,
as the generic hardlockup detectors.
+config UNWIND_USER
+ bool
+
config HAVE_PERF_REGS
bool
help
diff --git a/include/linux/unwind_user.h b/include/linux/unwind_user.h
new file mode 100644
index 000000000000..aa7923c1384f
--- /dev/null
+++ b/include/linux/unwind_user.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_UNWIND_USER_H
+#define _LINUX_UNWIND_USER_H
+
+#include <linux/unwind_user_types.h>
+
+int unwind_user_start(struct unwind_user_state *state);
+int unwind_user_next(struct unwind_user_state *state);
+
+int unwind_user(struct unwind_stacktrace *trace, unsigned int max_entries);
+
+#define for_each_user_frame(state) \
+ for (unwind_user_start((state)); !(state)->done; unwind_user_next((state)))
+
+#endif /* _LINUX_UNWIND_USER_H */
diff --git a/include/linux/unwind_user_types.h b/include/linux/unwind_user_types.h
new file mode 100644
index 000000000000..6ed1b4ae74e1
--- /dev/null
+++ b/include/linux/unwind_user_types.h
@@ -0,0 +1,31 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_UNWIND_USER_TYPES_H
+#define _LINUX_UNWIND_USER_TYPES_H
+
+#include <linux/types.h>
+
+enum unwind_user_type {
+ UNWIND_USER_TYPE_NONE,
+};
+
+struct unwind_stacktrace {
+ unsigned int nr;
+ unsigned long *entries;
+};
+
+struct unwind_user_frame {
+ s32 cfa_off;
+ s32 ra_off;
+ s32 fp_off;
+ bool use_fp;
+};
+
+struct unwind_user_state {
+ unsigned long ip;
+ unsigned long sp;
+ unsigned long fp;
+ enum unwind_user_type type;
+ bool done;
+};
+
+#endif /* _LINUX_UNWIND_USER_TYPES_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index 32e80dd626af..541186050251 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -55,6 +55,7 @@ obj-y += rcu/
obj-y += livepatch/
obj-y += dma/
obj-y += entry/
+obj-y += unwind/
obj-$(CONFIG_MODULES) += module/
obj-$(CONFIG_KCMP) += kcmp.o
diff --git a/kernel/unwind/Makefile b/kernel/unwind/Makefile
new file mode 100644
index 000000000000..349ce3677526
--- /dev/null
+++ b/kernel/unwind/Makefile
@@ -0,0 +1 @@
+ obj-$(CONFIG_UNWIND_USER) += user.o
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
new file mode 100644
index 000000000000..d30449328981
--- /dev/null
+++ b/kernel/unwind/user.c
@@ -0,0 +1,55 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+* Generic interfaces for unwinding user space
+*/
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/sched/task_stack.h>
+#include <linux/unwind_user.h>
+
+int unwind_user_next(struct unwind_user_state *state)
+{
+ /* no implementation yet */
+ return -EINVAL;
+}
+
+int unwind_user_start(struct unwind_user_state *state)
+{
+ struct pt_regs *regs = task_pt_regs(current);
+
+ memset(state, 0, sizeof(*state));
+
+ if ((current->flags & PF_KTHREAD) || !user_mode(regs)) {
+ state->done = true;
+ return -EINVAL;
+ }
+
+ state->type = UNWIND_USER_TYPE_NONE;
+
+ state->ip = instruction_pointer(regs);
+ state->sp = user_stack_pointer(regs);
+ state->fp = frame_pointer(regs);
+
+ return 0;
+}
+
+int unwind_user(struct unwind_stacktrace *trace, unsigned int max_entries)
+{
+ struct unwind_user_state state;
+
+ trace->nr = 0;
+
+ if (!max_entries)
+ return -EINVAL;
+
+ if (current->flags & PF_KTHREAD)
+ return 0;
+
+ for_each_user_frame(&state) {
+ trace->entries[trace->nr++] = state.ip;
+ if (trace->nr >= max_entries)
+ break;
+ }
+
+ return 0;
+}
--
2.47.2
^ permalink raw reply related [flat|nested] 53+ messages in thread
* [PATCH v12 02/14] unwind_user: Add frame pointer support
2025-07-01 0:53 [PATCH v12 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
2025-07-01 0:53 ` [PATCH v12 01/14] unwind_user: Add user space unwinding API Steven Rostedt
@ 2025-07-01 0:53 ` Steven Rostedt
2025-07-01 2:10 ` Linus Torvalds
2025-07-01 0:53 ` [PATCH v12 03/14] unwind_user: Add compat mode " Steven Rostedt
` (12 subsequent siblings)
14 siblings, 1 reply; 53+ messages in thread
From: Steven Rostedt @ 2025-07-01 0:53 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
Jens Axboe, Florian Weimer
From: Josh Poimboeuf <jpoimboe@kernel.org>
Add optional support for user space frame pointer unwinding. If
supported, the arch needs to enable CONFIG_HAVE_UNWIND_USER_FP and
define ARCH_INIT_USER_FP_FRAME.
By encoding the frame offsets in struct unwind_user_frame, much of this
code can also be reused for future unwinder implementations like sframe.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Co-developed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
arch/Kconfig | 4 +++
include/asm-generic/Kbuild | 1 +
include/asm-generic/unwind_user.h | 5 +++
include/linux/unwind_user.h | 5 +++
include/linux/unwind_user_types.h | 1 +
kernel/unwind/user.c | 51 +++++++++++++++++++++++++++++--
6 files changed, 65 insertions(+), 2 deletions(-)
create mode 100644 include/asm-generic/unwind_user.h
diff --git a/arch/Kconfig b/arch/Kconfig
index ea59e5d7cc69..8e3fd723bd74 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -438,6 +438,10 @@ config HAVE_HARDLOCKUP_DETECTOR_ARCH
config UNWIND_USER
bool
+config HAVE_UNWIND_USER_FP
+ bool
+ select UNWIND_USER
+
config HAVE_PERF_REGS
bool
help
diff --git a/include/asm-generic/Kbuild b/include/asm-generic/Kbuild
index 8675b7b4ad23..295c94a3ccc1 100644
--- a/include/asm-generic/Kbuild
+++ b/include/asm-generic/Kbuild
@@ -59,6 +59,7 @@ mandatory-y += tlbflush.h
mandatory-y += topology.h
mandatory-y += trace_clock.h
mandatory-y += uaccess.h
+mandatory-y += unwind_user.h
mandatory-y += vermagic.h
mandatory-y += vga.h
mandatory-y += video.h
diff --git a/include/asm-generic/unwind_user.h b/include/asm-generic/unwind_user.h
new file mode 100644
index 000000000000..b8882b909944
--- /dev/null
+++ b/include/asm-generic/unwind_user.h
@@ -0,0 +1,5 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_GENERIC_UNWIND_USER_H
+#define _ASM_GENERIC_UNWIND_USER_H
+
+#endif /* _ASM_GENERIC_UNWIND_USER_H */
diff --git a/include/linux/unwind_user.h b/include/linux/unwind_user.h
index aa7923c1384f..a405111c41b0 100644
--- a/include/linux/unwind_user.h
+++ b/include/linux/unwind_user.h
@@ -3,6 +3,11 @@
#define _LINUX_UNWIND_USER_H
#include <linux/unwind_user_types.h>
+#include <asm/unwind_user.h>
+
+#ifndef ARCH_INIT_USER_FP_FRAME
+ #define ARCH_INIT_USER_FP_FRAME
+#endif
int unwind_user_start(struct unwind_user_state *state);
int unwind_user_next(struct unwind_user_state *state);
diff --git a/include/linux/unwind_user_types.h b/include/linux/unwind_user_types.h
index 6ed1b4ae74e1..65bd070eb6b0 100644
--- a/include/linux/unwind_user_types.h
+++ b/include/linux/unwind_user_types.h
@@ -6,6 +6,7 @@
enum unwind_user_type {
UNWIND_USER_TYPE_NONE,
+ UNWIND_USER_TYPE_FP,
};
struct unwind_stacktrace {
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
index d30449328981..1201d655654a 100644
--- a/kernel/unwind/user.c
+++ b/kernel/unwind/user.c
@@ -6,10 +6,54 @@
#include <linux/sched.h>
#include <linux/sched/task_stack.h>
#include <linux/unwind_user.h>
+#include <linux/uaccess.h>
+
+static struct unwind_user_frame fp_frame = {
+ ARCH_INIT_USER_FP_FRAME
+};
+
+static inline bool fp_state(struct unwind_user_state *state)
+{
+ return IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP) &&
+ state->type == UNWIND_USER_TYPE_FP;
+}
int unwind_user_next(struct unwind_user_state *state)
{
- /* no implementation yet */
+ struct unwind_user_frame *frame;
+ unsigned long cfa = 0, fp, ra = 0;
+
+ if (state->done)
+ return -EINVAL;
+
+ if (fp_state(state))
+ frame = &fp_frame;
+ else
+ goto done;
+
+ /* Get the Canonical Frame Address (CFA) */
+ cfa = (frame->use_fp ? state->fp : state->sp) + frame->cfa_off;
+
+ /* stack going in wrong direction? */
+ if (cfa <= state->sp)
+ goto done;
+
+ /* Find the Return Address (RA) */
+ if (get_user(ra, (unsigned long *)(cfa + frame->ra_off)))
+ goto done;
+
+ if (frame->fp_off && get_user(fp, (unsigned long __user *)(cfa + frame->fp_off)))
+ goto done;
+
+ state->ip = ra;
+ state->sp = cfa;
+ if (frame->fp_off)
+ state->fp = fp;
+
+ return 0;
+
+done:
+ state->done = true;
return -EINVAL;
}
@@ -24,7 +68,10 @@ int unwind_user_start(struct unwind_user_state *state)
return -EINVAL;
}
- state->type = UNWIND_USER_TYPE_NONE;
+ if (IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP))
+ state->type = UNWIND_USER_TYPE_FP;
+ else
+ state->type = UNWIND_USER_TYPE_NONE;
state->ip = instruction_pointer(regs);
state->sp = user_stack_pointer(regs);
--
2.47.2
^ permalink raw reply related [flat|nested] 53+ messages in thread
* [PATCH v12 03/14] unwind_user: Add compat mode frame pointer support
2025-07-01 0:53 [PATCH v12 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
2025-07-01 0:53 ` [PATCH v12 01/14] unwind_user: Add user space unwinding API Steven Rostedt
2025-07-01 0:53 ` [PATCH v12 02/14] unwind_user: Add frame pointer support Steven Rostedt
@ 2025-07-01 0:53 ` Steven Rostedt
2025-07-01 0:53 ` [PATCH v12 04/14] unwind_user/deferred: Add unwind_user_faultable() Steven Rostedt
` (11 subsequent siblings)
14 siblings, 0 replies; 53+ messages in thread
From: Steven Rostedt @ 2025-07-01 0:53 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
Jens Axboe, Florian Weimer
From: Josh Poimboeuf <jpoimboe@kernel.org>
Add optional support for user space compat mode frame pointer unwinding.
If supported, the arch needs to enable CONFIG_HAVE_UNWIND_USER_COMPAT_FP
and define ARCH_INIT_USER_COMPAT_FP_FRAME.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Co-developed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
arch/Kconfig | 4 ++++
include/asm-generic/Kbuild | 1 +
include/asm-generic/unwind_user_types.h | 5 ++++
include/linux/unwind_user.h | 5 ++++
include/linux/unwind_user_types.h | 7 ++++++
kernel/unwind/user.c | 32 +++++++++++++++++++++----
6 files changed, 50 insertions(+), 4 deletions(-)
create mode 100644 include/asm-generic/unwind_user_types.h
diff --git a/arch/Kconfig b/arch/Kconfig
index 8e3fd723bd74..2c41d3072910 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -442,6 +442,10 @@ config HAVE_UNWIND_USER_FP
bool
select UNWIND_USER
+config HAVE_UNWIND_USER_COMPAT_FP
+ bool
+ depends on HAVE_UNWIND_USER_FP
+
config HAVE_PERF_REGS
bool
help
diff --git a/include/asm-generic/Kbuild b/include/asm-generic/Kbuild
index 295c94a3ccc1..b797a2434396 100644
--- a/include/asm-generic/Kbuild
+++ b/include/asm-generic/Kbuild
@@ -60,6 +60,7 @@ mandatory-y += topology.h
mandatory-y += trace_clock.h
mandatory-y += uaccess.h
mandatory-y += unwind_user.h
+mandatory-y += unwind_user_types.h
mandatory-y += vermagic.h
mandatory-y += vga.h
mandatory-y += video.h
diff --git a/include/asm-generic/unwind_user_types.h b/include/asm-generic/unwind_user_types.h
new file mode 100644
index 000000000000..f568b82e52cd
--- /dev/null
+++ b/include/asm-generic/unwind_user_types.h
@@ -0,0 +1,5 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_GENERIC_UNWIND_USER_TYPES_H
+#define _ASM_GENERIC_UNWIND_USER_TYPES_H
+
+#endif /* _ASM_GENERIC_UNWIND_USER_TYPES_H */
diff --git a/include/linux/unwind_user.h b/include/linux/unwind_user.h
index a405111c41b0..ac007363820a 100644
--- a/include/linux/unwind_user.h
+++ b/include/linux/unwind_user.h
@@ -9,6 +9,11 @@
#define ARCH_INIT_USER_FP_FRAME
#endif
+#ifndef ARCH_INIT_USER_COMPAT_FP_FRAME
+ #define ARCH_INIT_USER_COMPAT_FP_FRAME
+ #define in_compat_mode(regs) false
+#endif
+
int unwind_user_start(struct unwind_user_state *state);
int unwind_user_next(struct unwind_user_state *state);
diff --git a/include/linux/unwind_user_types.h b/include/linux/unwind_user_types.h
index 65bd070eb6b0..0b6563951ca4 100644
--- a/include/linux/unwind_user_types.h
+++ b/include/linux/unwind_user_types.h
@@ -3,10 +3,16 @@
#define _LINUX_UNWIND_USER_TYPES_H
#include <linux/types.h>
+#include <asm/unwind_user_types.h>
+
+#ifndef arch_unwind_user_state
+struct arch_unwind_user_state {};
+#endif
enum unwind_user_type {
UNWIND_USER_TYPE_NONE,
UNWIND_USER_TYPE_FP,
+ UNWIND_USER_TYPE_COMPAT_FP,
};
struct unwind_stacktrace {
@@ -25,6 +31,7 @@ struct unwind_user_state {
unsigned long ip;
unsigned long sp;
unsigned long fp;
+ struct arch_unwind_user_state arch;
enum unwind_user_type type;
bool done;
};
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
index 1201d655654a..3a0ac4346f5b 100644
--- a/kernel/unwind/user.c
+++ b/kernel/unwind/user.c
@@ -12,12 +12,32 @@ static struct unwind_user_frame fp_frame = {
ARCH_INIT_USER_FP_FRAME
};
+static struct unwind_user_frame compat_fp_frame = {
+ ARCH_INIT_USER_COMPAT_FP_FRAME
+};
+
static inline bool fp_state(struct unwind_user_state *state)
{
return IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP) &&
state->type == UNWIND_USER_TYPE_FP;
}
+static inline bool compat_fp_state(struct unwind_user_state *state)
+{
+ return IS_ENABLED(CONFIG_HAVE_UNWIND_USER_COMPAT_FP) &&
+ state->type == UNWIND_USER_TYPE_COMPAT_FP;
+}
+
+#define unwind_get_user_long(to, from, state) \
+({ \
+ int __ret; \
+ if (compat_fp_state(state)) \
+ __ret = get_user(to, (u32 __user *)(from)); \
+ else \
+ __ret = get_user(to, (unsigned long __user *)(from)); \
+ __ret; \
+})
+
int unwind_user_next(struct unwind_user_state *state)
{
struct unwind_user_frame *frame;
@@ -26,7 +46,9 @@ int unwind_user_next(struct unwind_user_state *state)
if (state->done)
return -EINVAL;
- if (fp_state(state))
+ if (compat_fp_state(state))
+ frame = &compat_fp_frame;
+ else if (fp_state(state))
frame = &fp_frame;
else
goto done;
@@ -39,10 +61,10 @@ int unwind_user_next(struct unwind_user_state *state)
goto done;
/* Find the Return Address (RA) */
- if (get_user(ra, (unsigned long *)(cfa + frame->ra_off)))
+ if (unwind_get_user_long(ra, cfa + frame->ra_off, state))
goto done;
- if (frame->fp_off && get_user(fp, (unsigned long __user *)(cfa + frame->fp_off)))
+ if (frame->fp_off && unwind_get_user_long(fp, cfa + frame->fp_off, state))
goto done;
state->ip = ra;
@@ -68,7 +90,9 @@ int unwind_user_start(struct unwind_user_state *state)
return -EINVAL;
}
- if (IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP))
+ if (IS_ENABLED(CONFIG_HAVE_UNWIND_USER_COMPAT_FP) && in_compat_mode(regs))
+ state->type = UNWIND_USER_TYPE_COMPAT_FP;
+ else if (IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP))
state->type = UNWIND_USER_TYPE_FP;
else
state->type = UNWIND_USER_TYPE_NONE;
--
2.47.2
^ permalink raw reply related [flat|nested] 53+ messages in thread
* [PATCH v12 04/14] unwind_user/deferred: Add unwind_user_faultable()
2025-07-01 0:53 [PATCH v12 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
` (2 preceding siblings ...)
2025-07-01 0:53 ` [PATCH v12 03/14] unwind_user: Add compat mode " Steven Rostedt
@ 2025-07-01 0:53 ` Steven Rostedt
2025-07-01 0:53 ` [PATCH v12 05/14] unwind_user/deferred: Add unwind cache Steven Rostedt
` (10 subsequent siblings)
14 siblings, 0 replies; 53+ messages in thread
From: Steven Rostedt @ 2025-07-01 0:53 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
Jens Axboe, Florian Weimer
From: Steven Rostedt <rostedt@goodmis.org>
Add a new API to retrieve a user space callstack called
unwind_user_faultable(). The difference between this user space stack
tracer from the current user space stack tracer is that this must be
called from faultable context as it may use routines to access user space
data that needs to be faulted in.
It can be safely called from entering or exiting a system call as the code
can still be faulted in there.
This code is based on work by Josh Poimboeuf's deferred unwinding code:
Link: https://lore.kernel.org/all/6052e8487746603bdb29b65f4033e739092d9925.1737511963.git.jpoimboe@kernel.org/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
include/linux/sched.h | 5 +++
include/linux/unwind_deferred.h | 24 +++++++++++
include/linux/unwind_deferred_types.h | 9 ++++
kernel/fork.c | 4 ++
kernel/unwind/Makefile | 2 +-
kernel/unwind/deferred.c | 60 +++++++++++++++++++++++++++
6 files changed, 103 insertions(+), 1 deletion(-)
create mode 100644 include/linux/unwind_deferred.h
create mode 100644 include/linux/unwind_deferred_types.h
create mode 100644 kernel/unwind/deferred.c
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4f78a64beb52..59fdf7d9bb1e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -46,6 +46,7 @@
#include <linux/rv.h>
#include <linux/uidgid_types.h>
#include <linux/tracepoint-defs.h>
+#include <linux/unwind_deferred_types.h>
#include <asm/kmap_size.h>
/* task_struct member predeclarations (sorted alphabetically): */
@@ -1654,6 +1655,10 @@ struct task_struct {
struct user_event_mm *user_event_mm;
#endif
+#ifdef CONFIG_UNWIND_USER
+ struct unwind_task_info unwind_info;
+#endif
+
/* CPU-specific state of this task: */
struct thread_struct thread;
diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
new file mode 100644
index 000000000000..a5f6e8f8a1a2
--- /dev/null
+++ b/include/linux/unwind_deferred.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_UNWIND_USER_DEFERRED_H
+#define _LINUX_UNWIND_USER_DEFERRED_H
+
+#include <linux/unwind_user.h>
+#include <linux/unwind_deferred_types.h>
+
+#ifdef CONFIG_UNWIND_USER
+
+void unwind_task_init(struct task_struct *task);
+void unwind_task_free(struct task_struct *task);
+
+int unwind_user_faultable(struct unwind_stacktrace *trace);
+
+#else /* !CONFIG_UNWIND_USER */
+
+static inline void unwind_task_init(struct task_struct *task) {}
+static inline void unwind_task_free(struct task_struct *task) {}
+
+static inline int unwind_user_faultable(struct unwind_stacktrace *trace) { return -ENOSYS; }
+
+#endif /* !CONFIG_UNWIND_USER */
+
+#endif /* _LINUX_UNWIND_USER_DEFERRED_H */
diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
new file mode 100644
index 000000000000..aa32db574e43
--- /dev/null
+++ b/include/linux/unwind_deferred_types.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_UNWIND_USER_DEFERRED_TYPES_H
+#define _LINUX_UNWIND_USER_DEFERRED_TYPES_H
+
+struct unwind_task_info {
+ unsigned long *entries;
+};
+
+#endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 1ee8eb11f38b..3341d50c61f2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -105,6 +105,7 @@
#include <uapi/linux/pidfd.h>
#include <linux/pidfs.h>
#include <linux/tick.h>
+#include <linux/unwind_deferred.h>
#include <asm/pgalloc.h>
#include <linux/uaccess.h>
@@ -732,6 +733,7 @@ void __put_task_struct(struct task_struct *tsk)
WARN_ON(refcount_read(&tsk->usage));
WARN_ON(tsk == current);
+ unwind_task_free(tsk);
sched_ext_free(tsk);
io_uring_free(tsk);
cgroup_free(tsk);
@@ -2135,6 +2137,8 @@ __latent_entropy struct task_struct *copy_process(
p->bpf_ctx = NULL;
#endif
+ unwind_task_init(p);
+
/* Perform scheduler related setup. Assign this task to a CPU. */
retval = sched_fork(clone_flags, p);
if (retval)
diff --git a/kernel/unwind/Makefile b/kernel/unwind/Makefile
index 349ce3677526..eae37bea54fd 100644
--- a/kernel/unwind/Makefile
+++ b/kernel/unwind/Makefile
@@ -1 +1 @@
- obj-$(CONFIG_UNWIND_USER) += user.o
+ obj-$(CONFIG_UNWIND_USER) += user.o deferred.o
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
new file mode 100644
index 000000000000..a0badbeb3cc1
--- /dev/null
+++ b/kernel/unwind/deferred.c
@@ -0,0 +1,60 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Deferred user space unwinding
+ */
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/unwind_deferred.h>
+
+#define UNWIND_MAX_ENTRIES 512
+
+/**
+ * unwind_user_faultable - Produce a user stacktrace in faultable context
+ * @trace: The descriptor that will store the user stacktrace
+ *
+ * This must be called in a known faultable context (usually when entering
+ * or exiting user space). Depending on the available implementations
+ * the @trace will be loaded with the addresses of the user space stacktrace
+ * if it can be found.
+ *
+ * Return: 0 on success and negative on error
+ * On success @trace will contain the user space stacktrace
+ */
+int unwind_user_faultable(struct unwind_stacktrace *trace)
+{
+ struct unwind_task_info *info = ¤t->unwind_info;
+
+ /* Should always be called from faultable context */
+ might_fault();
+
+ if (current->flags & PF_EXITING)
+ return -EINVAL;
+
+ if (!info->entries) {
+ info->entries = kmalloc_array(UNWIND_MAX_ENTRIES, sizeof(long),
+ GFP_KERNEL);
+ if (!info->entries)
+ return -ENOMEM;
+ }
+
+ trace->nr = 0;
+ trace->entries = info->entries;
+ unwind_user(trace, UNWIND_MAX_ENTRIES);
+
+ return 0;
+}
+
+void unwind_task_init(struct task_struct *task)
+{
+ struct unwind_task_info *info = &task->unwind_info;
+
+ memset(info, 0, sizeof(*info));
+}
+
+void unwind_task_free(struct task_struct *task)
+{
+ struct unwind_task_info *info = &task->unwind_info;
+
+ kfree(info->entries);
+}
--
2.47.2
^ permalink raw reply related [flat|nested] 53+ messages in thread
* [PATCH v12 05/14] unwind_user/deferred: Add unwind cache
2025-07-01 0:53 [PATCH v12 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
` (3 preceding siblings ...)
2025-07-01 0:53 ` [PATCH v12 04/14] unwind_user/deferred: Add unwind_user_faultable() Steven Rostedt
@ 2025-07-01 0:53 ` Steven Rostedt
2025-07-01 0:53 ` [PATCH v12 06/14] unwind_user/deferred: Add deferred unwinding interface Steven Rostedt
` (9 subsequent siblings)
14 siblings, 0 replies; 53+ messages in thread
From: Steven Rostedt @ 2025-07-01 0:53 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
Jens Axboe, Florian Weimer
From: Josh Poimboeuf <jpoimboe@kernel.org>
Cache the results of the unwind to ensure the unwind is only performed
once, even when called by multiple tracers.
The cache nr_entries gets cleared every time the task exits the kernel.
When a stacktrace is requested, nr_entries gets set to the number of
entries in the stacktrace. If another stacktrace is requested, if
nr_entries is not zero, then it contains the same stacktrace that would be
retrieved so it is not processed again and the entries is given to the
caller.
Co-developed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
include/linux/entry-common.h | 2 ++
include/linux/unwind_deferred.h | 8 +++++++
include/linux/unwind_deferred_types.h | 7 +++++-
kernel/unwind/deferred.c | 31 +++++++++++++++++++++------
4 files changed, 40 insertions(+), 8 deletions(-)
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index f94f3fdf15fc..8908b8eeb99b 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -12,6 +12,7 @@
#include <linux/resume_user_mode.h>
#include <linux/tick.h>
#include <linux/kmsan.h>
+#include <linux/unwind_deferred.h>
#include <asm/entry-common.h>
#include <asm/syscall.h>
@@ -362,6 +363,7 @@ static __always_inline void exit_to_user_mode(void)
lockdep_hardirqs_on_prepare();
instrumentation_end();
+ unwind_reset_info();
user_enter_irqoff();
arch_exit_to_user_mode();
lockdep_hardirqs_on(CALLER_ADDR0);
diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index a5f6e8f8a1a2..baacf4a1eb4c 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -12,6 +12,12 @@ void unwind_task_free(struct task_struct *task);
int unwind_user_faultable(struct unwind_stacktrace *trace);
+static __always_inline void unwind_reset_info(void)
+{
+ if (unlikely(current->unwind_info.cache))
+ current->unwind_info.cache->nr_entries = 0;
+}
+
#else /* !CONFIG_UNWIND_USER */
static inline void unwind_task_init(struct task_struct *task) {}
@@ -19,6 +25,8 @@ static inline void unwind_task_free(struct task_struct *task) {}
static inline int unwind_user_faultable(struct unwind_stacktrace *trace) { return -ENOSYS; }
+static inline void unwind_reset_info(void) {}
+
#endif /* !CONFIG_UNWIND_USER */
#endif /* _LINUX_UNWIND_USER_DEFERRED_H */
diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
index aa32db574e43..db5b54b18828 100644
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -2,8 +2,13 @@
#ifndef _LINUX_UNWIND_USER_DEFERRED_TYPES_H
#define _LINUX_UNWIND_USER_DEFERRED_TYPES_H
+struct unwind_cache {
+ unsigned int nr_entries;
+ unsigned long entries[];
+};
+
struct unwind_task_info {
- unsigned long *entries;
+ struct unwind_cache *cache;
};
#endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index a0badbeb3cc1..96368a5aa522 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -4,10 +4,13 @@
*/
#include <linux/kernel.h>
#include <linux/sched.h>
+#include <linux/sizes.h>
#include <linux/slab.h>
#include <linux/unwind_deferred.h>
-#define UNWIND_MAX_ENTRIES 512
+/* Make the cache fit in a 4K page */
+#define UNWIND_MAX_ENTRIES \
+ ((SZ_4K - sizeof(struct unwind_cache)) / sizeof(long))
/**
* unwind_user_faultable - Produce a user stacktrace in faultable context
@@ -24,6 +27,7 @@
int unwind_user_faultable(struct unwind_stacktrace *trace)
{
struct unwind_task_info *info = ¤t->unwind_info;
+ struct unwind_cache *cache;
/* Should always be called from faultable context */
might_fault();
@@ -31,17 +35,30 @@ int unwind_user_faultable(struct unwind_stacktrace *trace)
if (current->flags & PF_EXITING)
return -EINVAL;
- if (!info->entries) {
- info->entries = kmalloc_array(UNWIND_MAX_ENTRIES, sizeof(long),
- GFP_KERNEL);
- if (!info->entries)
+ if (!info->cache) {
+ info->cache = kzalloc(struct_size(cache, entries, UNWIND_MAX_ENTRIES),
+ GFP_KERNEL);
+ if (!info->cache)
return -ENOMEM;
}
+ cache = info->cache;
+ trace->entries = cache->entries;
+
+ if (cache->nr_entries) {
+ /*
+ * The user stack has already been previously unwound in this
+ * entry context. Skip the unwind and use the cache.
+ */
+ trace->nr = cache->nr_entries;
+ return 0;
+ }
+
trace->nr = 0;
- trace->entries = info->entries;
unwind_user(trace, UNWIND_MAX_ENTRIES);
+ cache->nr_entries = trace->nr;
+
return 0;
}
@@ -56,5 +73,5 @@ void unwind_task_free(struct task_struct *task)
{
struct unwind_task_info *info = &task->unwind_info;
- kfree(info->entries);
+ kfree(info->cache);
}
--
2.47.2
^ permalink raw reply related [flat|nested] 53+ messages in thread
* [PATCH v12 06/14] unwind_user/deferred: Add deferred unwinding interface
2025-07-01 0:53 [PATCH v12 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
` (4 preceding siblings ...)
2025-07-01 0:53 ` [PATCH v12 05/14] unwind_user/deferred: Add unwind cache Steven Rostedt
@ 2025-07-01 0:53 ` Steven Rostedt
2025-07-02 16:36 ` Peter Zijlstra
2025-07-01 0:53 ` [PATCH v12 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe Steven Rostedt
` (8 subsequent siblings)
14 siblings, 1 reply; 53+ messages in thread
From: Steven Rostedt @ 2025-07-01 0:53 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
Jens Axboe, Florian Weimer
From: Josh Poimboeuf <jpoimboe@kernel.org>
Add an interface for scheduling task work to unwind the user space stack
before returning to user space. This solves several problems for its
callers:
- Ensure the unwind happens in task context even if the caller may be
running in interrupt context.
- Avoid duplicate unwinds, whether called multiple times by the same
caller or by different callers.
- Take a timestamp when the first request comes in since the task
entered the kernel. This will be returned to the calling function
along with the stack trace when the task leaves the kernel. This
timestamp can be used to correlate kernel unwinds/traces with the user
unwind. For this to work properly, the architecture must have a
local_clock() resolution that guarantees a different timestamp per
a task systemcall.
The timestamp is created to detect when the stacktrace is the same. It is
generated the first time a user space stacktrace is requested after the
task enters the kernel.
The timestamp is passed to the caller on request, and when the stacktrace is
generated upon returning to user space, it will call the requester's callback
with the timestamp as well as the stacktrace. The timestamp is cleared
when it goes back to user space. Note, this currently adds another
conditional to the unwind_reset_info() path that is always called
returning to user space, but future changes will put this back to a single
conditional.
A global list is created and protected by a global mutex that holds
tracers that register with the unwind infrastructure. The number of
registered tracers will be limited in future changes. Each perf program or
ftrace instance will register its own descriptor to use for deferred
unwind stack traces.
Note, in the function unwind_deferred_task_work() that gets called when
returning to user space, it uses a global mutex for synchronization which
will cause a big bottleneck. This will be replaced by SRCU, but that
change adds some complex synchronization that deservers its own commit.
Co-developed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v11: https://lore.kernel.org/20250625225715.825831885@goodmis.org
- Still need to clear cache->nr_entries in unwind_reset_info even if
timestamp is zero. This is because unwind_user_faultable() can be called
directly and it requires nr_entries to be zeroed but it does not touch
the timestamp.
include/linux/unwind_deferred.h | 24 +++++
include/linux/unwind_deferred_types.h | 3 +
kernel/unwind/deferred.c | 139 +++++++++++++++++++++++++-
3 files changed, 165 insertions(+), 1 deletion(-)
diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index baacf4a1eb4c..c6548e8d64d1 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -2,9 +2,19 @@
#ifndef _LINUX_UNWIND_USER_DEFERRED_H
#define _LINUX_UNWIND_USER_DEFERRED_H
+#include <linux/task_work.h>
#include <linux/unwind_user.h>
#include <linux/unwind_deferred_types.h>
+struct unwind_work;
+
+typedef void (*unwind_callback_t)(struct unwind_work *work, struct unwind_stacktrace *trace, u64 timestamp);
+
+struct unwind_work {
+ struct list_head list;
+ unwind_callback_t func;
+};
+
#ifdef CONFIG_UNWIND_USER
void unwind_task_init(struct task_struct *task);
@@ -12,8 +22,19 @@ void unwind_task_free(struct task_struct *task);
int unwind_user_faultable(struct unwind_stacktrace *trace);
+int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func);
+int unwind_deferred_request(struct unwind_work *work, u64 *timestamp);
+void unwind_deferred_cancel(struct unwind_work *work);
+
static __always_inline void unwind_reset_info(void)
{
+ if (unlikely(current->unwind_info.timestamp))
+ current->unwind_info.timestamp = 0;
+ /*
+ * As unwind_user_faultable() can be called directly and
+ * depends on nr_entries being cleared on exit to user,
+ * this needs to be a separate conditional.
+ */
if (unlikely(current->unwind_info.cache))
current->unwind_info.cache->nr_entries = 0;
}
@@ -24,6 +45,9 @@ static inline void unwind_task_init(struct task_struct *task) {}
static inline void unwind_task_free(struct task_struct *task) {}
static inline int unwind_user_faultable(struct unwind_stacktrace *trace) { return -ENOSYS; }
+static inline int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func) { return -ENOSYS; }
+static inline int unwind_deferred_request(struct unwind_work *work, u64 *timestamp) { return -ENOSYS; }
+static inline void unwind_deferred_cancel(struct unwind_work *work) {}
static inline void unwind_reset_info(void) {}
diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
index db5b54b18828..5df264cf81ad 100644
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -9,6 +9,9 @@ struct unwind_cache {
struct unwind_task_info {
struct unwind_cache *cache;
+ struct callback_head work;
+ u64 timestamp;
+ int pending;
};
#endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index 96368a5aa522..d5f2c004a5b0 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -2,16 +2,43 @@
/*
* Deferred user space unwinding
*/
+#include <linux/sched/task_stack.h>
+#include <linux/unwind_deferred.h>
+#include <linux/sched/clock.h>
+#include <linux/task_work.h>
#include <linux/kernel.h>
#include <linux/sched.h>
#include <linux/sizes.h>
#include <linux/slab.h>
-#include <linux/unwind_deferred.h>
+#include <linux/mm.h>
/* Make the cache fit in a 4K page */
#define UNWIND_MAX_ENTRIES \
((SZ_4K - sizeof(struct unwind_cache)) / sizeof(long))
+/* Guards adding to and reading the list of callbacks */
+static DEFINE_MUTEX(callback_mutex);
+static LIST_HEAD(callbacks);
+
+/*
+ * Read the task context timestamp, if this is the first caller then
+ * it will set the timestamp.
+ *
+ * For this to work properly, the timestamp (local_clock()) must
+ * have a resolution that will guarantee a different timestamp
+ * everytime a task makes a system call. That is, two short
+ * system calls back to back must have a different timestamp.
+ */
+static u64 get_timestamp(struct unwind_task_info *info)
+{
+ lockdep_assert_irqs_disabled();
+
+ if (!info->timestamp)
+ info->timestamp = local_clock();
+
+ return info->timestamp;
+}
+
/**
* unwind_user_faultable - Produce a user stacktrace in faultable context
* @trace: The descriptor that will store the user stacktrace
@@ -62,11 +89,120 @@ int unwind_user_faultable(struct unwind_stacktrace *trace)
return 0;
}
+static void unwind_deferred_task_work(struct callback_head *head)
+{
+ struct unwind_task_info *info = container_of(head, struct unwind_task_info, work);
+ struct unwind_stacktrace trace;
+ struct unwind_work *work;
+ u64 timestamp;
+
+ if (WARN_ON_ONCE(!info->pending))
+ return;
+
+ /* Allow work to come in again */
+ WRITE_ONCE(info->pending, 0);
+
+ /*
+ * From here on out, the callback must always be called, even if it's
+ * just an empty trace.
+ */
+ trace.nr = 0;
+ trace.entries = NULL;
+
+ unwind_user_faultable(&trace);
+
+ timestamp = info->timestamp;
+
+ guard(mutex)(&callback_mutex);
+ list_for_each_entry(work, &callbacks, list) {
+ work->func(work, &trace, timestamp);
+ }
+}
+
+/**
+ * unwind_deferred_request - Request a user stacktrace on task exit
+ * @work: Unwind descriptor requesting the trace
+ * @timestamp: The time stamp of the first request made for this task
+ *
+ * Schedule a user space unwind to be done in task work before exiting the
+ * kernel.
+ *
+ * The returned @timestamp output is the timestamp of the very first request
+ * for a user space stacktrace for this task since it entered the kernel.
+ * It can be from a request by any caller of this infrastructure.
+ * Its value will also be passed to the callback function. It can be
+ * used to stitch kernel and user stack traces together in post-processing.
+ *
+ * Note, the architecture must have a local_clock() implementation that
+ * guarantees a different timestamp per task systemcall.
+ *
+ * It's valid to call this function multiple times for the same @work within
+ * the same task entry context. Each call will return the same timestamp
+ * while the task hasn't left the kernel. If the callback is not pending because
+ * it has already been previously called for the same entry context, it will be
+ * called again with the same stack trace and timestamp.
+ *
+ * Return: 1 if the the callback was already queued.
+ * 0 if the callback successfully was queued.
+ * Negative if there's an error.
+ * @timestamp holds the timestamp of the first request by any user
+ */
+int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
+{
+ struct unwind_task_info *info = ¤t->unwind_info;
+ int ret;
+
+ *timestamp = 0;
+
+ if (WARN_ON_ONCE(in_nmi()))
+ return -EINVAL;
+
+ if ((current->flags & (PF_KTHREAD | PF_EXITING)) ||
+ !user_mode(task_pt_regs(current)))
+ return -EINVAL;
+
+ guard(irqsave)();
+
+ *timestamp = get_timestamp(info);
+
+ /* callback already pending? */
+ if (info->pending)
+ return 1;
+
+ /* The work has been claimed, now schedule it. */
+ ret = task_work_add(current, &info->work, TWA_RESUME);
+ if (WARN_ON_ONCE(ret))
+ return ret;
+
+ info->pending = 1;
+ return 0;
+}
+
+void unwind_deferred_cancel(struct unwind_work *work)
+{
+ if (!work)
+ return;
+
+ guard(mutex)(&callback_mutex);
+ list_del(&work->list);
+}
+
+int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
+{
+ memset(work, 0, sizeof(*work));
+
+ guard(mutex)(&callback_mutex);
+ list_add(&work->list, &callbacks);
+ work->func = func;
+ return 0;
+}
+
void unwind_task_init(struct task_struct *task)
{
struct unwind_task_info *info = &task->unwind_info;
memset(info, 0, sizeof(*info));
+ init_task_work(&info->work, unwind_deferred_task_work);
}
void unwind_task_free(struct task_struct *task)
@@ -74,4 +210,5 @@ void unwind_task_free(struct task_struct *task)
struct unwind_task_info *info = &task->unwind_info;
kfree(info->cache);
+ task_work_cancel(task, &info->work);
}
--
2.47.2
^ permalink raw reply related [flat|nested] 53+ messages in thread
* [PATCH v12 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe
2025-07-01 0:53 [PATCH v12 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
` (5 preceding siblings ...)
2025-07-01 0:53 ` [PATCH v12 06/14] unwind_user/deferred: Add deferred unwinding interface Steven Rostedt
@ 2025-07-01 0:53 ` Steven Rostedt
2025-07-02 15:53 ` Jens Remus
2025-07-01 0:53 ` [PATCH v12 08/14] unwind deferred: Use bitmask to determine which callbacks to call Steven Rostedt
` (7 subsequent siblings)
14 siblings, 1 reply; 53+ messages in thread
From: Steven Rostedt @ 2025-07-01 0:53 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
Jens Axboe, Florian Weimer
From: Steven Rostedt <rostedt@goodmis.org>
Make unwind_deferred_request() NMI-safe so tracers in NMI context can
call it and safely request a user space stacktrace when the task exits.
Note, this is only allowed for architectures that implement a safe 64 bit
cmpxchg. Which rules out some 32bit architectures and even some 64 bit
ones. If an architecture requests a deferred stack trace from NMI context
that does not support a safe NMI 64 bit cmpxchg, it will get an -EINVAL.
For those architectures, they would need another method (perhaps an
irqwork), to request a deferred user space stack trace. That can be dealt
with later if one of theses architectures require this feature.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
include/linux/unwind_deferred.h | 4 +-
include/linux/unwind_deferred_types.h | 7 ++-
kernel/unwind/deferred.c | 74 ++++++++++++++++++++++-----
3 files changed, 69 insertions(+), 16 deletions(-)
diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index c6548e8d64d1..73f6cac53530 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -28,8 +28,8 @@ void unwind_deferred_cancel(struct unwind_work *work);
static __always_inline void unwind_reset_info(void)
{
- if (unlikely(current->unwind_info.timestamp))
- current->unwind_info.timestamp = 0;
+ if (unlikely(local64_read(¤t->unwind_info.timestamp)))
+ local64_set(¤t->unwind_info.timestamp, 0);
/*
* As unwind_user_faultable() can be called directly and
* depends on nr_entries being cleared on exit to user,
diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
index 5df264cf81ad..0d722e877473 100644
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -2,6 +2,9 @@
#ifndef _LINUX_UNWIND_USER_DEFERRED_TYPES_H
#define _LINUX_UNWIND_USER_DEFERRED_TYPES_H
+#include <asm/local64.h>
+#include <asm/local.h>
+
struct unwind_cache {
unsigned int nr_entries;
unsigned long entries[];
@@ -10,8 +13,8 @@ struct unwind_cache {
struct unwind_task_info {
struct unwind_cache *cache;
struct callback_head work;
- u64 timestamp;
- int pending;
+ local64_t timestamp;
+ local_t pending;
};
#endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index d5f2c004a5b0..dd36e58c8cad 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -12,6 +12,35 @@
#include <linux/slab.h>
#include <linux/mm.h>
+/*
+ * For requesting a deferred user space stack trace from NMI context
+ * the architecture must support a 64bit safe cmpxchg in NMI context.
+ * For those architectures that do not have that, then it cannot ask
+ * for a deferred user space stack trace from an NMI context. If it
+ * does, then it will get -EINVAL.
+ */
+#if defined(CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG) && \
+ !defined(CONFIG_GENERIC_ATOMIC64)
+# define CAN_USE_IN_NMI 1
+static inline u64 assign_timestamp(struct unwind_task_info *info,
+ u64 timestamp)
+{
+ u64 old = 0;
+ if (!local64_try_cmpxchg(&info->timestamp, &old, timestamp))
+ timestamp = old;
+ return timestamp;
+}
+#else
+# define CAN_USE_IN_NMI 0
+static inline u64 assign_timestamp(struct unwind_task_info *info,
+ u64 timestamp)
+{
+ /* For archs that do not allow NMI here */
+ local64_set(&info->timestamp, timestamp);
+ return timestamp;
+}
+#endif
+
/* Make the cache fit in a 4K page */
#define UNWIND_MAX_ENTRIES \
((SZ_4K - sizeof(struct unwind_cache)) / sizeof(long))
@@ -31,12 +60,21 @@ static LIST_HEAD(callbacks);
*/
static u64 get_timestamp(struct unwind_task_info *info)
{
+ u64 timestamp;
+
lockdep_assert_irqs_disabled();
- if (!info->timestamp)
- info->timestamp = local_clock();
+ /*
+ * Note, the timestamp is generated on the first request.
+ * If it exists here, then the timestamp is earlier than
+ * this request and it means that this request will be
+ * valid for the stracktrace.
+ */
+ timestamp = local64_read(&info->timestamp);
+ if (timestamp)
+ return timestamp;
- return info->timestamp;
+ return assign_timestamp(info, local_clock());
}
/**
@@ -96,11 +134,11 @@ static void unwind_deferred_task_work(struct callback_head *head)
struct unwind_work *work;
u64 timestamp;
- if (WARN_ON_ONCE(!info->pending))
+ if (WARN_ON_ONCE(!local_read(&info->pending)))
return;
/* Allow work to come in again */
- WRITE_ONCE(info->pending, 0);
+ local_set(&info->pending, 0);
/*
* From here on out, the callback must always be called, even if it's
@@ -111,7 +149,7 @@ static void unwind_deferred_task_work(struct callback_head *head)
unwind_user_faultable(&trace);
- timestamp = info->timestamp;
+ timestamp = local64_read(&info->timestamp);
guard(mutex)(&callback_mutex);
list_for_each_entry(work, &callbacks, list) {
@@ -150,31 +188,43 @@ static void unwind_deferred_task_work(struct callback_head *head)
int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
{
struct unwind_task_info *info = ¤t->unwind_info;
+ long pending;
int ret;
*timestamp = 0;
- if (WARN_ON_ONCE(in_nmi()))
- return -EINVAL;
-
if ((current->flags & (PF_KTHREAD | PF_EXITING)) ||
!user_mode(task_pt_regs(current)))
return -EINVAL;
+ /* NMI requires having safe 64 bit cmpxchg operations */
+ if (!CAN_USE_IN_NMI && in_nmi())
+ return -EINVAL;
+
guard(irqsave)();
*timestamp = get_timestamp(info);
/* callback already pending? */
- if (info->pending)
+ pending = local_read(&info->pending);
+ if (pending)
return 1;
+ if (CAN_USE_IN_NMI) {
+ /* Claim the work unless an NMI just now swooped in to do so. */
+ if (!local_try_cmpxchg(&info->pending, &pending, 1))
+ return 1;
+ } else {
+ local_set(&info->pending, 1);
+ }
+
/* The work has been claimed, now schedule it. */
ret = task_work_add(current, &info->work, TWA_RESUME);
- if (WARN_ON_ONCE(ret))
+ if (WARN_ON_ONCE(ret)) {
+ local_set(&info->pending, 0);
return ret;
+ }
- info->pending = 1;
return 0;
}
--
2.47.2
^ permalink raw reply related [flat|nested] 53+ messages in thread
* [PATCH v12 08/14] unwind deferred: Use bitmask to determine which callbacks to call
2025-07-01 0:53 [PATCH v12 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
` (6 preceding siblings ...)
2025-07-01 0:53 ` [PATCH v12 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe Steven Rostedt
@ 2025-07-01 0:53 ` Steven Rostedt
2025-07-01 0:53 ` [PATCH v12 09/14] unwind deferred: Use SRCU unwind_deferred_task_work() Steven Rostedt
` (6 subsequent siblings)
14 siblings, 0 replies; 53+ messages in thread
From: Steven Rostedt @ 2025-07-01 0:53 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
Jens Axboe, Florian Weimer
From: Steven Rostedt <rostedt@goodmis.org>
In order to know which registered callback requested a stacktrace for when
the task goes back to user space, add a bitmask to keep track of all
registered tracers. The bitmask is the size of long, which means that on a
32 bit machine, it can have at most 32 registered tracers, and on 64 bit,
it can have at most 64 registered tracers. This should not be an issue as
there should not be more than 10 (unless BPF can abuse this?).
When a tracer registers with unwind_deferred_init() it will get a bit
number assigned to it. When a tracer requests a stacktrace, it will have
its bit set within the task_struct. When the task returns back to user
space, it will call the callbacks for all the registered tracers where
their bits are set in the task's mask.
When a tracer is removed by the unwind_deferred_cancel() all current tasks
will clear the associated bit, just in case another tracer gets registered
immediately afterward and then gets their callback called unexpectedly.
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
include/linux/unwind_deferred.h | 1 +
include/linux/unwind_deferred_types.h | 1 +
kernel/unwind/deferred.c | 36 ++++++++++++++++++++++++---
3 files changed, 34 insertions(+), 4 deletions(-)
diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index 73f6cac53530..538b4b7968dc 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -13,6 +13,7 @@ typedef void (*unwind_callback_t)(struct unwind_work *work, struct unwind_stackt
struct unwind_work {
struct list_head list;
unwind_callback_t func;
+ int bit;
};
#ifdef CONFIG_UNWIND_USER
diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
index 0d722e877473..5863bf4eb436 100644
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -13,6 +13,7 @@ struct unwind_cache {
struct unwind_task_info {
struct unwind_cache *cache;
struct callback_head work;
+ unsigned long unwind_mask;
local64_t timestamp;
local_t pending;
};
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index dd36e58c8cad..6c558d00ff41 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -48,6 +48,7 @@ static inline u64 assign_timestamp(struct unwind_task_info *info,
/* Guards adding to and reading the list of callbacks */
static DEFINE_MUTEX(callback_mutex);
static LIST_HEAD(callbacks);
+static unsigned long unwind_mask;
/*
* Read the task context timestamp, if this is the first caller then
@@ -153,7 +154,10 @@ static void unwind_deferred_task_work(struct callback_head *head)
guard(mutex)(&callback_mutex);
list_for_each_entry(work, &callbacks, list) {
- work->func(work, &trace, timestamp);
+ if (test_bit(work->bit, &info->unwind_mask)) {
+ work->func(work, &trace, timestamp);
+ clear_bit(work->bit, &info->unwind_mask);
+ }
}
}
@@ -205,15 +209,19 @@ int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
*timestamp = get_timestamp(info);
+ /* This is already queued */
+ if (test_bit(work->bit, &info->unwind_mask))
+ return 1;
+
/* callback already pending? */
pending = local_read(&info->pending);
if (pending)
- return 1;
+ goto out;
if (CAN_USE_IN_NMI) {
/* Claim the work unless an NMI just now swooped in to do so. */
if (!local_try_cmpxchg(&info->pending, &pending, 1))
- return 1;
+ goto out;
} else {
local_set(&info->pending, 1);
}
@@ -225,16 +233,27 @@ int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
return ret;
}
- return 0;
+ out:
+ return test_and_set_bit(work->bit, &info->unwind_mask);
}
void unwind_deferred_cancel(struct unwind_work *work)
{
+ struct task_struct *g, *t;
+
if (!work)
return;
guard(mutex)(&callback_mutex);
list_del(&work->list);
+
+ __clear_bit(work->bit, &unwind_mask);
+
+ guard(rcu)();
+ /* Clear this bit from all threads */
+ for_each_process_thread(g, t) {
+ clear_bit(work->bit, &t->unwind_info.unwind_mask);
+ }
}
int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
@@ -242,6 +261,14 @@ int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
memset(work, 0, sizeof(*work));
guard(mutex)(&callback_mutex);
+
+ /* See if there's a bit in the mask available */
+ if (unwind_mask == ~0UL)
+ return -EBUSY;
+
+ work->bit = ffz(unwind_mask);
+ __set_bit(work->bit, &unwind_mask);
+
list_add(&work->list, &callbacks);
work->func = func;
return 0;
@@ -253,6 +280,7 @@ void unwind_task_init(struct task_struct *task)
memset(info, 0, sizeof(*info));
init_task_work(&info->work, unwind_deferred_task_work);
+ info->unwind_mask = 0;
}
void unwind_task_free(struct task_struct *task)
--
2.47.2
^ permalink raw reply related [flat|nested] 53+ messages in thread
* [PATCH v12 09/14] unwind deferred: Use SRCU unwind_deferred_task_work()
2025-07-01 0:53 [PATCH v12 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
` (7 preceding siblings ...)
2025-07-01 0:53 ` [PATCH v12 08/14] unwind deferred: Use bitmask to determine which callbacks to call Steven Rostedt
@ 2025-07-01 0:53 ` Steven Rostedt
2025-07-01 0:53 ` [PATCH v12 10/14] unwind: Clear unwind_mask on exit back to user space Steven Rostedt
` (5 subsequent siblings)
14 siblings, 0 replies; 53+ messages in thread
From: Steven Rostedt @ 2025-07-01 0:53 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
Jens Axboe, Florian Weimer
From: Steven Rostedt <rostedt@goodmis.org>
Instead of using the callback_mutex to protect the link list of callbacks
in unwind_deferred_task_work(), use SRCU instead. This gets called every
time a task exits that has to record a stack trace that was requested.
This can happen for many tasks on several CPUs at the same time. A mutex
is a bottleneck and can cause a bit of contention and slow down performance.
As the callbacks themselves are allowed to sleep, regular RCU cannot be
used to protect the list. Instead use SRCU, as that still allows the
callbacks to sleep and the list can be read without needing to hold the
callback_mutex.
Link: https://lore.kernel.org/all/ca9bd83a-6c80-4ee0-a83c-224b9d60b755@efficios.com/
Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
kernel/unwind/deferred.c | 35 ++++++++++++++++++++++++++---------
1 file changed, 26 insertions(+), 9 deletions(-)
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index 6c558d00ff41..7309c9e0e57a 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -45,10 +45,11 @@ static inline u64 assign_timestamp(struct unwind_task_info *info,
#define UNWIND_MAX_ENTRIES \
((SZ_4K - sizeof(struct unwind_cache)) / sizeof(long))
-/* Guards adding to and reading the list of callbacks */
+/* Guards adding to or removing from the list of callbacks */
static DEFINE_MUTEX(callback_mutex);
static LIST_HEAD(callbacks);
static unsigned long unwind_mask;
+DEFINE_STATIC_SRCU(unwind_srcu);
/*
* Read the task context timestamp, if this is the first caller then
@@ -134,6 +135,7 @@ static void unwind_deferred_task_work(struct callback_head *head)
struct unwind_stacktrace trace;
struct unwind_work *work;
u64 timestamp;
+ int idx;
if (WARN_ON_ONCE(!local_read(&info->pending)))
return;
@@ -152,13 +154,15 @@ static void unwind_deferred_task_work(struct callback_head *head)
timestamp = local64_read(&info->timestamp);
- guard(mutex)(&callback_mutex);
- list_for_each_entry(work, &callbacks, list) {
+ idx = srcu_read_lock(&unwind_srcu);
+ list_for_each_entry_srcu(work, &callbacks, list,
+ srcu_read_lock_held(&unwind_srcu)) {
if (test_bit(work->bit, &info->unwind_mask)) {
work->func(work, &trace, timestamp);
clear_bit(work->bit, &info->unwind_mask);
}
}
+ srcu_read_unlock(&unwind_srcu, idx);
}
/**
@@ -193,6 +197,7 @@ int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
{
struct unwind_task_info *info = ¤t->unwind_info;
long pending;
+ int bit;
int ret;
*timestamp = 0;
@@ -205,12 +210,17 @@ int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
if (!CAN_USE_IN_NMI && in_nmi())
return -EINVAL;
+ /* Do not allow cancelled works to request again */
+ bit = READ_ONCE(work->bit);
+ if (WARN_ON_ONCE(bit < 0))
+ return -EINVAL;
+
guard(irqsave)();
*timestamp = get_timestamp(info);
/* This is already queued */
- if (test_bit(work->bit, &info->unwind_mask))
+ if (test_bit(bit, &info->unwind_mask))
return 1;
/* callback already pending? */
@@ -234,25 +244,32 @@ int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
}
out:
- return test_and_set_bit(work->bit, &info->unwind_mask);
+ return test_and_set_bit(bit, &info->unwind_mask);
}
void unwind_deferred_cancel(struct unwind_work *work)
{
struct task_struct *g, *t;
+ int bit;
if (!work)
return;
guard(mutex)(&callback_mutex);
- list_del(&work->list);
+ list_del_rcu(&work->list);
+ bit = work->bit;
+
+ /* Do not allow any more requests and prevent callbacks */
+ work->bit = -1;
+
+ __clear_bit(bit, &unwind_mask);
- __clear_bit(work->bit, &unwind_mask);
+ synchronize_srcu(&unwind_srcu);
guard(rcu)();
/* Clear this bit from all threads */
for_each_process_thread(g, t) {
- clear_bit(work->bit, &t->unwind_info.unwind_mask);
+ clear_bit(bit, &t->unwind_info.unwind_mask);
}
}
@@ -269,7 +286,7 @@ int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
work->bit = ffz(unwind_mask);
__set_bit(work->bit, &unwind_mask);
- list_add(&work->list, &callbacks);
+ list_add_rcu(&work->list, &callbacks);
work->func = func;
return 0;
}
--
2.47.2
^ permalink raw reply related [flat|nested] 53+ messages in thread
* [PATCH v12 10/14] unwind: Clear unwind_mask on exit back to user space
2025-07-01 0:53 [PATCH v12 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
` (8 preceding siblings ...)
2025-07-01 0:53 ` [PATCH v12 09/14] unwind deferred: Use SRCU unwind_deferred_task_work() Steven Rostedt
@ 2025-07-01 0:53 ` Steven Rostedt
2025-07-01 0:53 ` [PATCH v12 11/14] unwind: Add USED bit to only have one conditional on way " Steven Rostedt
` (4 subsequent siblings)
14 siblings, 0 replies; 53+ messages in thread
From: Steven Rostedt @ 2025-07-01 0:53 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
Jens Axboe, Florian Weimer
From: Steven Rostedt <rostedt@goodmis.org>
When testing the deferred unwinder by attaching deferred user space
stacktraces to events, a live lock happened. This was when the deferred
unwinding was added to the irqs_disabled event, which happens after the
task_work callbacks are called and before the task goes back to user
space.
The event callback would be registered when irqs were disabled, the
task_work would trigger, call the callback for this work and clear the
work's bit. Then before getting back to user space, irqs would be disabled
again, the event triggered again, and a new task_work registered. This
caused an infinite loop and the system hung.
To prevent this, clear the bits at the very last moment before going back
to user space and when instrumentation is disabled. That is in
unwind_exit_to_user_mode().
Move the pending bit from a value on the task_struct to the most
significant bit of the unwind_mask (saves space on the task_struct). This
will allow modifying the pending bit along with the work bits atomically.
Instead of clearing a work's bit after its callback is called, it is
delayed until exit. If the work is requested again, the task_work is not
queued again and the work will be notified that the task has already been
called (via UNWIND_ALREADY_EXECUTED return value).
The pending bit is cleared before calling the callback functions but the
current work bits remain. If one of the called works registers again, it
will not trigger a task_work if its bit is still present in the task's
unwind_mask.
If a new work registers, then it will set both the pending bit and its own
bit but clear the other work bits so that their callbacks do not get
called again.
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v11: https://lore.kernel.org/20250625225716.505389511@goodmis.org
- Still clear info->cache->nr_entries in unwind_reset_info()
If unwind_user_faultable() is called directly (like perf will do)
it still needs to clear the nr_entries as calling the function
directly does not set the bit.
Later code will change this so it has only one conditional to check
when not enabled.
include/linux/unwind_deferred.h | 25 +++++++--
include/linux/unwind_deferred_types.h | 1 -
kernel/unwind/deferred.c | 76 ++++++++++++++++++---------
3 files changed, 74 insertions(+), 28 deletions(-)
diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index 538b4b7968dc..d25a72fb21ef 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -18,6 +18,14 @@ struct unwind_work {
#ifdef CONFIG_UNWIND_USER
+#define UNWIND_PENDING_BIT (BITS_PER_LONG - 1)
+#define UNWIND_PENDING BIT(UNWIND_PENDING_BIT)
+
+enum {
+ UNWIND_ALREADY_PENDING = 1,
+ UNWIND_ALREADY_EXECUTED = 2,
+};
+
void unwind_task_init(struct task_struct *task);
void unwind_task_free(struct task_struct *task);
@@ -29,15 +37,26 @@ void unwind_deferred_cancel(struct unwind_work *work);
static __always_inline void unwind_reset_info(void)
{
- if (unlikely(local64_read(¤t->unwind_info.timestamp)))
+ struct unwind_task_info *info = ¤t->unwind_info;
+ unsigned long bits;
+
+ /* Was there any unwinding? */
+ if (unlikely(info->unwind_mask)) {
+ bits = info->unwind_mask;
+ do {
+ /* Is a task_work going to run again before going back */
+ if (bits & UNWIND_PENDING)
+ return;
+ } while (!try_cmpxchg(&info->unwind_mask, &bits, 0UL));
local64_set(¤t->unwind_info.timestamp, 0);
+ }
/*
* As unwind_user_faultable() can be called directly and
* depends on nr_entries being cleared on exit to user,
* this needs to be a separate conditional.
*/
- if (unlikely(current->unwind_info.cache))
- current->unwind_info.cache->nr_entries = 0;
+ if (unlikely(info->cache))
+ info->cache->nr_entries = 0;
}
#else /* !CONFIG_UNWIND_USER */
diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
index 5863bf4eb436..4308367f1887 100644
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -15,7 +15,6 @@ struct unwind_task_info {
struct callback_head work;
unsigned long unwind_mask;
local64_t timestamp;
- local_t pending;
};
#endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index 7309c9e0e57a..e7e4442926d3 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -51,6 +51,11 @@ static LIST_HEAD(callbacks);
static unsigned long unwind_mask;
DEFINE_STATIC_SRCU(unwind_srcu);
+static inline bool unwind_pending(struct unwind_task_info *info)
+{
+ return test_bit(UNWIND_PENDING_BIT, &info->unwind_mask);
+}
+
/*
* Read the task context timestamp, if this is the first caller then
* it will set the timestamp.
@@ -134,14 +139,17 @@ static void unwind_deferred_task_work(struct callback_head *head)
struct unwind_task_info *info = container_of(head, struct unwind_task_info, work);
struct unwind_stacktrace trace;
struct unwind_work *work;
+ unsigned long bits;
u64 timestamp;
int idx;
- if (WARN_ON_ONCE(!local_read(&info->pending)))
+ if (WARN_ON_ONCE(!unwind_pending(info)))
return;
- /* Allow work to come in again */
- local_set(&info->pending, 0);
+ /* Clear pending bit but make sure to have the current bits */
+ bits = READ_ONCE(info->unwind_mask);
+ while (!try_cmpxchg(&info->unwind_mask, &bits, bits & ~UNWIND_PENDING))
+ ;
/*
* From here on out, the callback must always be called, even if it's
@@ -157,10 +165,8 @@ static void unwind_deferred_task_work(struct callback_head *head)
idx = srcu_read_lock(&unwind_srcu);
list_for_each_entry_srcu(work, &callbacks, list,
srcu_read_lock_held(&unwind_srcu)) {
- if (test_bit(work->bit, &info->unwind_mask)) {
+ if (test_bit(work->bit, &bits))
work->func(work, &trace, timestamp);
- clear_bit(work->bit, &info->unwind_mask);
- }
}
srcu_read_unlock(&unwind_srcu, idx);
}
@@ -188,15 +194,17 @@ static void unwind_deferred_task_work(struct callback_head *head)
* it has already been previously called for the same entry context, it will be
* called again with the same stack trace and timestamp.
*
- * Return: 1 if the the callback was already queued.
- * 0 if the callback successfully was queued.
+ * Return: 0 if the callback successfully was queued.
+ * UNWIND_ALREADY_PENDING if the the callback was already queued.
+ * UNWIND_ALREADY_EXECUTED if the callback was already called
+ * (and will not be called again)
* Negative if there's an error.
* @timestamp holds the timestamp of the first request by any user
*/
int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
{
struct unwind_task_info *info = ¤t->unwind_info;
- long pending;
+ unsigned long old, bits;
int bit;
int ret;
@@ -219,32 +227,52 @@ int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
*timestamp = get_timestamp(info);
- /* This is already queued */
- if (test_bit(bit, &info->unwind_mask))
- return 1;
+ old = READ_ONCE(info->unwind_mask);
+
+ /* Is this already queued */
+ if (test_bit(bit, &old)) {
+ /*
+ * If pending is not set, it means this work's callback
+ * was already called.
+ */
+ return old & UNWIND_PENDING ? UNWIND_ALREADY_PENDING :
+ UNWIND_ALREADY_EXECUTED;
+ }
- /* callback already pending? */
- pending = local_read(&info->pending);
- if (pending)
+ if (unwind_pending(info))
goto out;
+ /*
+ * This is the first to enable another task_work for this task since
+ * the task entered the kernel, or had already called the callbacks.
+ * Set only the bit for this work and clear all others as they have
+ * already had their callbacks called, and do not need to call them
+ * again because of this work.
+ */
+ bits = UNWIND_PENDING | BIT(bit);
+
+ /*
+ * If the cmpxchg() fails, it means that an NMI came in and set
+ * the pending bit as well as cleared the other bits. Just
+ * jump to setting the bit for this work.
+ */
if (CAN_USE_IN_NMI) {
- /* Claim the work unless an NMI just now swooped in to do so. */
- if (!local_try_cmpxchg(&info->pending, &pending, 1))
+ if (!try_cmpxchg(&info->unwind_mask, &old, bits))
goto out;
} else {
- local_set(&info->pending, 1);
+ info->unwind_mask = bits;
}
/* The work has been claimed, now schedule it. */
ret = task_work_add(current, &info->work, TWA_RESUME);
- if (WARN_ON_ONCE(ret)) {
- local_set(&info->pending, 0);
- return ret;
- }
+ if (WARN_ON_ONCE(ret))
+ WRITE_ONCE(info->unwind_mask, 0);
+
+ return ret;
out:
- return test_and_set_bit(bit, &info->unwind_mask);
+ return test_and_set_bit(bit, &info->unwind_mask) ?
+ UNWIND_ALREADY_PENDING : 0;
}
void unwind_deferred_cancel(struct unwind_work *work)
@@ -280,7 +308,7 @@ int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
guard(mutex)(&callback_mutex);
/* See if there's a bit in the mask available */
- if (unwind_mask == ~0UL)
+ if (unwind_mask == ~(UNWIND_PENDING))
return -EBUSY;
work->bit = ffz(unwind_mask);
--
2.47.2
^ permalink raw reply related [flat|nested] 53+ messages in thread
* [PATCH v12 11/14] unwind: Add USED bit to only have one conditional on way back to user space
2025-07-01 0:53 [PATCH v12 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
` (9 preceding siblings ...)
2025-07-01 0:53 ` [PATCH v12 10/14] unwind: Clear unwind_mask on exit back to user space Steven Rostedt
@ 2025-07-01 0:53 ` Steven Rostedt
2025-07-01 0:53 ` [PATCH v12 12/14] unwind: Finish up unwind when a task exits Steven Rostedt
` (3 subsequent siblings)
14 siblings, 0 replies; 53+ messages in thread
From: Steven Rostedt @ 2025-07-01 0:53 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
Jens Axboe, Florian Weimer
From: Steven Rostedt <rostedt@goodmis.org>
On the way back to user space, the function unwind_reset_info() is called
unconditionally (but always inlined). It currently has two conditionals.
One that checks the unwind_mask which is set whenever a deferred trace is
called and is used to know that the mask needs to be cleared. The other
checks if the cache has been allocated, and if so, it resets the
nr_entries so that the unwinder knows it needs to do the work to get a new
user space stack trace again (it only does it once per entering the
kernel).
Use one of the bits in the unwind mask as a "USED" bit that gets set
whenever a trace is created. This will make it possible to only check the
unwind_mask in the unwind_reset_info() to know if it needs to do work or
not and eliminates a conditional that happens every time the task goes
back to user space.
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
include/linux/unwind_deferred.h | 14 +++++++-------
kernel/unwind/deferred.c | 5 ++++-
2 files changed, 11 insertions(+), 8 deletions(-)
diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index d25a72fb21ef..a1c62097f142 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -21,6 +21,10 @@ struct unwind_work {
#define UNWIND_PENDING_BIT (BITS_PER_LONG - 1)
#define UNWIND_PENDING BIT(UNWIND_PENDING_BIT)
+/* Set if the unwinding was used (directly or deferred) */
+#define UNWIND_USED_BIT (UNWIND_PENDING_BIT - 1)
+#define UNWIND_USED BIT(UNWIND_USED_BIT)
+
enum {
UNWIND_ALREADY_PENDING = 1,
UNWIND_ALREADY_EXECUTED = 2,
@@ -49,14 +53,10 @@ static __always_inline void unwind_reset_info(void)
return;
} while (!try_cmpxchg(&info->unwind_mask, &bits, 0UL));
local64_set(¤t->unwind_info.timestamp, 0);
+
+ if (unlikely(info->cache))
+ info->cache->nr_entries = 0;
}
- /*
- * As unwind_user_faultable() can be called directly and
- * depends on nr_entries being cleared on exit to user,
- * this needs to be a separate conditional.
- */
- if (unlikely(info->cache))
- info->cache->nr_entries = 0;
}
#else /* !CONFIG_UNWIND_USER */
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index e7e4442926d3..5ab9b9045ae5 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -131,6 +131,9 @@ int unwind_user_faultable(struct unwind_stacktrace *trace)
cache->nr_entries = trace->nr;
+ /* Clear nr_entries on way back to user space */
+ set_bit(UNWIND_USED_BIT, &info->unwind_mask);
+
return 0;
}
@@ -308,7 +311,7 @@ int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
guard(mutex)(&callback_mutex);
/* See if there's a bit in the mask available */
- if (unwind_mask == ~(UNWIND_PENDING))
+ if (unwind_mask == ~(UNWIND_PENDING|UNWIND_USED))
return -EBUSY;
work->bit = ffz(unwind_mask);
--
2.47.2
^ permalink raw reply related [flat|nested] 53+ messages in thread
* [PATCH v12 12/14] unwind: Finish up unwind when a task exits
2025-07-01 0:53 [PATCH v12 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
` (10 preceding siblings ...)
2025-07-01 0:53 ` [PATCH v12 11/14] unwind: Add USED bit to only have one conditional on way " Steven Rostedt
@ 2025-07-01 0:53 ` Steven Rostedt
2025-07-01 0:53 ` [PATCH v12 13/14] unwind_user/x86: Enable frame pointer unwinding on x86 Steven Rostedt
` (2 subsequent siblings)
14 siblings, 0 replies; 53+ messages in thread
From: Steven Rostedt @ 2025-07-01 0:53 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
Jens Axboe, Florian Weimer
From: Steven Rostedt <rostedt@goodmis.org>
On do_exit() when a task is exiting, if a unwind is requested and the
deferred user stacktrace is deferred via the task_work, the task_work
callback is called after exit_mm() is called in do_exit(). This means that
the user stack trace will not be retrieved and an empty stack is created.
Instead, add a function unwind_deferred_task_exit() and call it just
before exit_mm() so that the unwinder can call the requested callbacks
with the user space stack.
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
include/linux/unwind_deferred.h | 3 +++
kernel/exit.c | 2 ++
kernel/unwind/deferred.c | 23 ++++++++++++++++++++---
3 files changed, 25 insertions(+), 3 deletions(-)
diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index a1c62097f142..44654e6149ec 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -39,6 +39,8 @@ int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func);
int unwind_deferred_request(struct unwind_work *work, u64 *timestamp);
void unwind_deferred_cancel(struct unwind_work *work);
+void unwind_deferred_task_exit(struct task_struct *task);
+
static __always_inline void unwind_reset_info(void)
{
struct unwind_task_info *info = ¤t->unwind_info;
@@ -69,6 +71,7 @@ static inline int unwind_deferred_init(struct unwind_work *work, unwind_callback
static inline int unwind_deferred_request(struct unwind_work *work, u64 *timestamp) { return -ENOSYS; }
static inline void unwind_deferred_cancel(struct unwind_work *work) {}
+static inline void unwind_deferred_task_exit(struct task_struct *task) {}
static inline void unwind_reset_info(void) {}
#endif /* !CONFIG_UNWIND_USER */
diff --git a/kernel/exit.c b/kernel/exit.c
index bb184a67ac73..1d8c8ac33c4f 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -68,6 +68,7 @@
#include <linux/rethook.h>
#include <linux/sysfs.h>
#include <linux/user_events.h>
+#include <linux/unwind_deferred.h>
#include <linux/uaccess.h>
#include <linux/pidfs.h>
@@ -938,6 +939,7 @@ void __noreturn do_exit(long code)
tsk->exit_code = code;
taskstats_exit(tsk, group_dead);
+ unwind_deferred_task_exit(tsk);
trace_sched_process_exit(tsk, group_dead);
/*
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index 5ab9b9045ae5..9ec1e74c6469 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -104,7 +104,7 @@ int unwind_user_faultable(struct unwind_stacktrace *trace)
/* Should always be called from faultable context */
might_fault();
- if (current->flags & PF_EXITING)
+ if (!current->mm)
return -EINVAL;
if (!info->cache) {
@@ -137,9 +137,9 @@ int unwind_user_faultable(struct unwind_stacktrace *trace)
return 0;
}
-static void unwind_deferred_task_work(struct callback_head *head)
+static void process_unwind_deferred(struct task_struct *task)
{
- struct unwind_task_info *info = container_of(head, struct unwind_task_info, work);
+ struct unwind_task_info *info = &task->unwind_info;
struct unwind_stacktrace trace;
struct unwind_work *work;
unsigned long bits;
@@ -174,6 +174,23 @@ static void unwind_deferred_task_work(struct callback_head *head)
srcu_read_unlock(&unwind_srcu, idx);
}
+static void unwind_deferred_task_work(struct callback_head *head)
+{
+ process_unwind_deferred(current);
+}
+
+void unwind_deferred_task_exit(struct task_struct *task)
+{
+ struct unwind_task_info *info = ¤t->unwind_info;
+
+ if (!unwind_pending(info))
+ return;
+
+ process_unwind_deferred(task);
+
+ task_work_cancel(task, &info->work);
+}
+
/**
* unwind_deferred_request - Request a user stacktrace on task exit
* @work: Unwind descriptor requesting the trace
--
2.47.2
^ permalink raw reply related [flat|nested] 53+ messages in thread
* [PATCH v12 13/14] unwind_user/x86: Enable frame pointer unwinding on x86
2025-07-01 0:53 [PATCH v12 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
` (11 preceding siblings ...)
2025-07-01 0:53 ` [PATCH v12 12/14] unwind: Finish up unwind when a task exits Steven Rostedt
@ 2025-07-01 0:53 ` Steven Rostedt
2025-07-01 0:53 ` [PATCH v12 14/14] unwind_user/x86: Enable compat mode " Steven Rostedt
2025-07-01 2:06 ` [PATCH v12 00/14] unwind_user: x86: Deferred unwinding infrastructure Linus Torvalds
14 siblings, 0 replies; 53+ messages in thread
From: Steven Rostedt @ 2025-07-01 0:53 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
Jens Axboe, Florian Weimer
From: Josh Poimboeuf <jpoimboe@kernel.org>
Use ARCH_INIT_USER_FP_FRAME to describe how frame pointers are unwound
on x86, and enable CONFIG_HAVE_UNWIND_USER_FP accordingly so the
unwind_user interfaces can be used.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
arch/x86/Kconfig | 1 +
arch/x86/include/asm/unwind_user.h | 11 +++++++++++
2 files changed, 12 insertions(+)
create mode 100644 arch/x86/include/asm/unwind_user.h
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 71019b3b54ea..5862433c81e1 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -302,6 +302,7 @@ config X86
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_UACCESS_VALIDATION if HAVE_OBJTOOL
select HAVE_UNSTABLE_SCHED_CLOCK
+ select HAVE_UNWIND_USER_FP if X86_64
select HAVE_USER_RETURN_NOTIFIER
select HAVE_GENERIC_VDSO
select VDSO_GETRANDOM if X86_64
diff --git a/arch/x86/include/asm/unwind_user.h b/arch/x86/include/asm/unwind_user.h
new file mode 100644
index 000000000000..8597857bf896
--- /dev/null
+++ b/arch/x86/include/asm/unwind_user.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_UNWIND_USER_H
+#define _ASM_X86_UNWIND_USER_H
+
+#define ARCH_INIT_USER_FP_FRAME \
+ .cfa_off = (s32)sizeof(long) * 2, \
+ .ra_off = (s32)sizeof(long) * -1, \
+ .fp_off = (s32)sizeof(long) * -2, \
+ .use_fp = true,
+
+#endif /* _ASM_X86_UNWIND_USER_H */
--
2.47.2
^ permalink raw reply related [flat|nested] 53+ messages in thread
* [PATCH v12 14/14] unwind_user/x86: Enable compat mode frame pointer unwinding on x86
2025-07-01 0:53 [PATCH v12 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
` (12 preceding siblings ...)
2025-07-01 0:53 ` [PATCH v12 13/14] unwind_user/x86: Enable frame pointer unwinding on x86 Steven Rostedt
@ 2025-07-01 0:53 ` Steven Rostedt
2025-07-01 2:06 ` [PATCH v12 00/14] unwind_user: x86: Deferred unwinding infrastructure Linus Torvalds
14 siblings, 0 replies; 53+ messages in thread
From: Steven Rostedt @ 2025-07-01 0:53 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
Jens Axboe, Florian Weimer
From: Josh Poimboeuf <jpoimboe@kernel.org>
Use ARCH_INIT_USER_COMPAT_FP_FRAME to describe how frame pointers are
unwound on x86, and implement the hooks needed to add the segment base
addresses. Enable HAVE_UNWIND_USER_COMPAT_FP if the system has compat
mode compiled in.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v11: https://lore.kernel.org/20250625225717.187191105@goodmis.org
- Fix header macro protection name to include X86 (Ingo Molnar)
- Use insn_get_seg_base() to get segment registers instead of using the
function perf uses and making it global. Also as that function doesn't
look to have a requirement to disable interrupts, the scoped_guard(irqsave)
is removed.
- Check return code of insn_get_seg_base() for the unlikely event that it
returns invalid (-1).
- Moved arch_unwind_user_init() into stacktrace.c as to use
insn_get_seg_base(), it must include insn-eval.h that defines
pt_regs_offset(), but that is also used in the perf generic code as an
array and if it is included in the header file, it causes a build
conflict.
- Update the comments that explain arch_unwind_user_init/next that a macro
needs to be defined with those names if they are going to be used.
arch/x86/Kconfig | 1 +
arch/x86/include/asm/unwind_user.h | 31 ++++++++++++++++++++++++
arch/x86/include/asm/unwind_user_types.h | 17 +++++++++++++
arch/x86/kernel/stacktrace.c | 28 +++++++++++++++++++++
include/linux/unwind_user.h | 20 +++++++++++++++
kernel/unwind/user.c | 4 +++
6 files changed, 101 insertions(+)
create mode 100644 arch/x86/include/asm/unwind_user_types.h
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5862433c81e1..17d4094c821b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -302,6 +302,7 @@ config X86
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_UACCESS_VALIDATION if HAVE_OBJTOOL
select HAVE_UNSTABLE_SCHED_CLOCK
+ select HAVE_UNWIND_USER_COMPAT_FP if IA32_EMULATION
select HAVE_UNWIND_USER_FP if X86_64
select HAVE_USER_RETURN_NOTIFIER
select HAVE_GENERIC_VDSO
diff --git a/arch/x86/include/asm/unwind_user.h b/arch/x86/include/asm/unwind_user.h
index 8597857bf896..19634a73612d 100644
--- a/arch/x86/include/asm/unwind_user.h
+++ b/arch/x86/include/asm/unwind_user.h
@@ -2,10 +2,41 @@
#ifndef _ASM_X86_UNWIND_USER_H
#define _ASM_X86_UNWIND_USER_H
+#include <linux/unwind_user_types.h>
+
#define ARCH_INIT_USER_FP_FRAME \
.cfa_off = (s32)sizeof(long) * 2, \
.ra_off = (s32)sizeof(long) * -1, \
.fp_off = (s32)sizeof(long) * -2, \
.use_fp = true,
+#ifdef CONFIG_IA32_EMULATION
+
+#define ARCH_INIT_USER_COMPAT_FP_FRAME \
+ .cfa_off = (s32)sizeof(u32) * 2, \
+ .ra_off = (s32)sizeof(u32) * -1, \
+ .fp_off = (s32)sizeof(u32) * -2, \
+ .use_fp = true,
+
+#define in_compat_mode(regs) !user_64bit_mode(regs)
+
+void arch_unwind_user_init(struct unwind_user_state *state,
+ struct pt_regs *regs);
+
+static inline void arch_unwind_user_next(struct unwind_user_state *state)
+{
+ if (state->type != UNWIND_USER_TYPE_COMPAT_FP)
+ return;
+
+ state->ip += state->arch.cs_base;
+ state->fp += state->arch.ss_base;
+}
+
+#define arch_unwind_user_init arch_unwind_user_init
+#define arch_unwind_user_next arch_unwind_user_next
+
+#endif /* CONFIG_IA32_EMULATION */
+
+#include <asm-generic/unwind_user.h>
+
#endif /* _ASM_X86_UNWIND_USER_H */
diff --git a/arch/x86/include/asm/unwind_user_types.h b/arch/x86/include/asm/unwind_user_types.h
new file mode 100644
index 000000000000..f93d535f900e
--- /dev/null
+++ b/arch/x86/include/asm/unwind_user_types.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_UNWIND_USER_TYPES_H
+#define _ASM_X86_UNWIND_USER_TYPES_H
+
+#ifdef CONFIG_IA32_EMULATION
+
+struct arch_unwind_user_state {
+ unsigned long ss_base;
+ unsigned long cs_base;
+};
+#define arch_unwind_user_state arch_unwind_user_state
+
+#endif /* CONFIG_IA32_EMULATION */
+
+#include <asm-generic/unwind_user_types.h>
+
+#endif /* _ASM_UNWIND_USER_TYPES_H */
diff --git a/arch/x86/kernel/stacktrace.c b/arch/x86/kernel/stacktrace.c
index ee117fcf46ed..8ef9d8c71df9 100644
--- a/arch/x86/kernel/stacktrace.c
+++ b/arch/x86/kernel/stacktrace.c
@@ -9,7 +9,10 @@
#include <linux/stacktrace.h>
#include <linux/export.h>
#include <linux/uaccess.h>
+#include <asm/unwind_user.h>
#include <asm/stacktrace.h>
+#include <asm/insn.h>
+#include <asm/insn-eval.h>
#include <asm/unwind.h>
void arch_stack_walk(stack_trace_consume_fn consume_entry, void *cookie,
@@ -128,3 +131,28 @@ void arch_stack_walk_user(stack_trace_consume_fn consume_entry, void *cookie,
}
}
+#ifdef CONFIG_IA32_EMULATION
+void arch_unwind_user_init(struct unwind_user_state *state,
+ struct pt_regs *regs)
+{
+ unsigned long cs_base, ss_base;
+
+ if (state->type != UNWIND_USER_TYPE_COMPAT_FP)
+ return;
+
+ cs_base = insn_get_seg_base(regs, INAT_SEG_REG_CS);
+ ss_base = insn_get_seg_base(regs, INAT_SEG_REG_SS);
+
+ if (cs_base == -1)
+ cs_base = 0;
+ if (ss_base == -1)
+ ss_base = 0;
+
+ state->arch.cs_base = cs_base;
+ state->arch.ss_base = ss_base;
+
+ state->ip += cs_base;
+ state->sp += ss_base;
+ state->fp += ss_base;
+}
+#endif /* CONFIG_IA32_EMULATION */
diff --git a/include/linux/unwind_user.h b/include/linux/unwind_user.h
index ac007363820a..b57b68215c6f 100644
--- a/include/linux/unwind_user.h
+++ b/include/linux/unwind_user.h
@@ -14,6 +14,26 @@
#define in_compat_mode(regs) false
#endif
+/*
+ * If an architecture needs to initialize the state for a specific
+ * reason, for example, it may need to do something different
+ * in compat mode, it can define a macro named arch_unwind_user_init
+ * with the name of the function that will perform this initialization.
+ */
+#ifndef arch_unwind_user_init
+static inline void arch_unwind_user_init(struct unwind_user_state *state, struct pt_regs *reg) {}
+#endif
+
+/*
+ * If an architecture requires some more updates to the state between
+ * stack frames, it can define a macro named arch_unwind_user_next
+ * with the name of the function that will update the state between
+ * reading stack frames during the user space stack walk.
+ */
+#ifndef arch_unwind_user_next
+static inline void arch_unwind_user_next(struct unwind_user_state *state) {}
+#endif
+
int unwind_user_start(struct unwind_user_state *state);
int unwind_user_next(struct unwind_user_state *state);
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
index 3a0ac4346f5b..2bb7995c3f23 100644
--- a/kernel/unwind/user.c
+++ b/kernel/unwind/user.c
@@ -72,6 +72,8 @@ int unwind_user_next(struct unwind_user_state *state)
if (frame->fp_off)
state->fp = fp;
+ arch_unwind_user_next(state);
+
return 0;
done:
@@ -101,6 +103,8 @@ int unwind_user_start(struct unwind_user_state *state)
state->sp = user_stack_pointer(regs);
state->fp = frame_pointer(regs);
+ arch_unwind_user_init(state, regs);
+
return 0;
}
--
2.47.2
^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: [PATCH v12 00/14] unwind_user: x86: Deferred unwinding infrastructure
2025-07-01 0:53 [PATCH v12 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
` (13 preceding siblings ...)
2025-07-01 0:53 ` [PATCH v12 14/14] unwind_user/x86: Enable compat mode " Steven Rostedt
@ 2025-07-01 2:06 ` Linus Torvalds
2025-07-01 2:45 ` Steven Rostedt
14 siblings, 1 reply; 53+ messages in thread
From: Linus Torvalds @ 2025-07-01 2:06 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Peter Zijlstra, Ingo Molnar,
Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
Andrew Morton, Jens Axboe, Florian Weimer
On Mon, 30 Jun 2025 at 17:54, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> This is the first patch series of a set that will make it possible to be able
> to use SFrames[1] in the Linux kernel. A quick recap of the motivation for
> doing this.
You have a '[1]' to indicate there's a link to what SFrames are.
But no such link actually exists in this email. Hmm?
Linus
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v12 02/14] unwind_user: Add frame pointer support
2025-07-01 0:53 ` [PATCH v12 02/14] unwind_user: Add frame pointer support Steven Rostedt
@ 2025-07-01 2:10 ` Linus Torvalds
2025-07-01 2:56 ` Steven Rostedt
2025-07-01 4:46 ` Florian Weimer
0 siblings, 2 replies; 53+ messages in thread
From: Linus Torvalds @ 2025-07-01 2:10 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Peter Zijlstra, Ingo Molnar,
Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
Andrew Morton, Jens Axboe, Florian Weimer
On Mon, 30 Jun 2025 at 17:54, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> + /* stack going in wrong direction? */
> + if (cfa <= state->sp)
> + goto done;
I suspect this should do a lot more testing.
> + /* Find the Return Address (RA) */
> + if (get_user(ra, (unsigned long *)(cfa + frame->ra_off)))
> + goto done;
> +
> + if (frame->fp_off && get_user(fp, (unsigned long __user *)(cfa + frame->fp_off)))
> + goto done;
.. and this should check the frame for validity too. At a minimum it
should be properly aligned, but things like "it had better be above
the current frame" to avoid having some loop would seem to be a good
idea.
Maybe even check that it's the same vma?
Linus
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v12 00/14] unwind_user: x86: Deferred unwinding infrastructure
2025-07-01 2:06 ` [PATCH v12 00/14] unwind_user: x86: Deferred unwinding infrastructure Linus Torvalds
@ 2025-07-01 2:45 ` Steven Rostedt
2025-07-01 22:49 ` Kees Cook
0 siblings, 1 reply; 53+ messages in thread
From: Steven Rostedt @ 2025-07-01 2:45 UTC (permalink / raw)
To: Linus Torvalds
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Peter Zijlstra, Ingo Molnar,
Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
Andrew Morton, Jens Axboe, Florian Weimer
On Mon, 30 Jun 2025 19:06:12 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Mon, 30 Jun 2025 at 17:54, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > This is the first patch series of a set that will make it possible to be able
> > to use SFrames[1] in the Linux kernel. A quick recap of the motivation for
> > doing this.
>
> You have a '[1]' to indicate there's a link to what SFrames are.
>
> But no such link actually exists in this email. Hmm?
>
Oops. I cut and pasted from v11:
https://lore.kernel.org/linux-trace-kernel/20250625225600.555017347@goodmis.org/
But ended the cut too early. I stopped at my sig:
---
Thanks!
-- Steve
[1] https://sourceware.org/binutils/wiki/sframe
-- Steve
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v12 02/14] unwind_user: Add frame pointer support
2025-07-01 2:10 ` Linus Torvalds
@ 2025-07-01 2:56 ` Steven Rostedt
2025-07-01 3:05 ` Steven Rostedt
2025-07-01 15:36 ` Jens Remus
2025-07-01 4:46 ` Florian Weimer
1 sibling, 2 replies; 53+ messages in thread
From: Steven Rostedt @ 2025-07-01 2:56 UTC (permalink / raw)
To: Linus Torvalds
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Peter Zijlstra, Ingo Molnar,
Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
Andrew Morton, Jens Axboe, Florian Weimer
On Mon, 30 Jun 2025 19:10:09 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Mon, 30 Jun 2025 at 17:54, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > + /* stack going in wrong direction? */
> > + if (cfa <= state->sp)
> > + goto done;
>
> I suspect this should do a lot more testing.
Sure.
>
> > + /* Find the Return Address (RA) */
> > + if (get_user(ra, (unsigned long *)(cfa + frame->ra_off)))
> > + goto done;
> > +
> > + if (frame->fp_off && get_user(fp, (unsigned long __user *)(cfa + frame->fp_off)))
> > + goto done;
>
> .. and this should check the frame for validity too. At a minimum it
> should be properly aligned, but things like "it had better be above
> the current frame" to avoid having some loop would seem to be a good
> idea.
Makes sense.
>
> Maybe even check that it's the same vma?
Hmm, I call on to Jens Remus and ask if s390 can do anything whacky here?
Where something that isn't allowed on other architectures may be allowed
there? I know s390 has some strange type of stack usage.
-- Steve
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v12 02/14] unwind_user: Add frame pointer support
2025-07-01 2:56 ` Steven Rostedt
@ 2025-07-01 3:05 ` Steven Rostedt
2025-07-01 15:36 ` Jens Remus
1 sibling, 0 replies; 53+ messages in thread
From: Steven Rostedt @ 2025-07-01 3:05 UTC (permalink / raw)
To: Linus Torvalds
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Peter Zijlstra, Ingo Molnar,
Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
Andrew Morton, Jens Axboe, Florian Weimer, Kees Cook
On Mon, 30 Jun 2025 22:56:03 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:
> On Mon, 30 Jun 2025 19:10:09 -0700
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> > On Mon, 30 Jun 2025 at 17:54, Steven Rostedt <rostedt@goodmis.org> wrote:
> > >
> > > + /* stack going in wrong direction? */
> > > + if (cfa <= state->sp)
> > > + goto done;
> >
> > I suspect this should do a lot more testing.
>
> Sure.
Adding Kees too.
Kees,
I'd like to get some security eyes on this code to take a look at it. As it
is making decisions on input from user space, I'd like to have more
security folks looking at this to make sure that some malicious task can't
set up its stack in such a way that it can exploit something here.
The parsing of the sframe code (latest version net yet posted) will need a
similar audit.
Thanks,
-- Steve
>
> >
> > > + /* Find the Return Address (RA) */
> > > + if (get_user(ra, (unsigned long *)(cfa + frame->ra_off)))
> > > + goto done;
> > > +
> > > + if (frame->fp_off && get_user(fp, (unsigned long __user *)(cfa + frame->fp_off)))
> > > + goto done;
> >
> > .. and this should check the frame for validity too. At a minimum it
> > should be properly aligned, but things like "it had better be above
> > the current frame" to avoid having some loop would seem to be a good
> > idea.
>
> Makes sense.
>
> >
> > Maybe even check that it's the same vma?
>
> Hmm, I call on to Jens Remus and ask if s390 can do anything whacky here?
> Where something that isn't allowed on other architectures may be allowed
> there? I know s390 has some strange type of stack usage.
>
> -- Steve
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v12 02/14] unwind_user: Add frame pointer support
2025-07-01 2:10 ` Linus Torvalds
2025-07-01 2:56 ` Steven Rostedt
@ 2025-07-01 4:46 ` Florian Weimer
2025-07-01 12:14 ` Steven Rostedt
1 sibling, 1 reply; 53+ messages in thread
From: Florian Weimer @ 2025-07-01 4:46 UTC (permalink / raw)
To: Linus Torvalds
Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Andrew Morton, Jens Axboe
* Linus Torvalds:
> On Mon, 30 Jun 2025 at 17:54, Steven Rostedt <rostedt@goodmis.org> wrote:
>>
>> + /* stack going in wrong direction? */
>> + if (cfa <= state->sp)
>> + goto done;
>
> I suspect this should do a lot more testing.
>
>> + /* Find the Return Address (RA) */
>> + if (get_user(ra, (unsigned long *)(cfa + frame->ra_off)))
>> + goto done;
>> +
>> + if (frame->fp_off && get_user(fp, (unsigned long __user *)(cfa + frame->fp_off)))
>> + goto done;
>
> .. and this should check the frame for validity too. At a minimum it
> should be properly aligned, but things like "it had better be above
> the current frame" to avoid having some loop would seem to be a good
> idea.
I don't think SFrame as-is requires stacks to be contiguous. Maybe
there could be a per-frame flag that indicates whether a stack switch is
expected?
Thanks,
Florian
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v12 02/14] unwind_user: Add frame pointer support
2025-07-01 4:46 ` Florian Weimer
@ 2025-07-01 12:14 ` Steven Rostedt
0 siblings, 0 replies; 53+ messages in thread
From: Steven Rostedt @ 2025-07-01 12:14 UTC (permalink / raw)
To: Florian Weimer
Cc: Linus Torvalds, linux-kernel, linux-trace-kernel, bpf, x86,
Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Andrew Morton, Jens Axboe
On Tue, 01 Jul 2025 06:46:14 +0200
Florian Weimer <fweimer@redhat.com> wrote:
> * Linus Torvalds:
>
> > On Mon, 30 Jun 2025 at 17:54, Steven Rostedt <rostedt@goodmis.org> wrote:
> >>
> >> + /* stack going in wrong direction? */
> >> + if (cfa <= state->sp)
> >> + goto done;
> >
> > I suspect this should do a lot more testing.
> >
> >> + /* Find the Return Address (RA) */
> >> + if (get_user(ra, (unsigned long *)(cfa + frame->ra_off)))
> >> + goto done;
> >> +
> >> + if (frame->fp_off && get_user(fp, (unsigned long __user *)(cfa + frame->fp_off)))
> >> + goto done;
> >
> > .. and this should check the frame for validity too. At a minimum it
> > should be properly aligned, but things like "it had better be above
> > the current frame" to avoid having some loop would seem to be a good
> > idea.
>
> I don't think SFrame as-is requires stacks to be contiguous. Maybe
> there could be a per-frame flag that indicates whether a stack switch is
> expected?
Looking at the current code of perf, it appears to only check that the
address is valid to read from user space. Perhaps that's the only check
needed here too?
Now this loop will not go into an infinite loop as the code has:
for_each_user_frame(&state) {
trace->entries[trace->nr++] = state.ip;
if (trace->nr >= max_entries)
break;
}
Where
#define for_each_user_frame(state) \
for (unwind_user_start((state)); !(state)->done; unwind_user_next((state)))
It will stop at "max_entries" even if the user space tries to make it go
forever. max_entries is either 511 (on 64 bit) or 1023 (on 32 bit), as it
is defined as:
/* Make the cache fit in a 4K page */
#define UNWIND_MAX_ENTRIES \
((SZ_4K - sizeof(struct unwind_cache)) / sizeof(long))
Now, perhaps we need to verify that the cfa is indeed canonical, but what
other test do we have perform?
In the future, this code will also be handling JIT and possibly interpreted
code. How much do we really want to limit the stack walking due to checks?
-- Steve
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v12 02/14] unwind_user: Add frame pointer support
2025-07-01 2:56 ` Steven Rostedt
2025-07-01 3:05 ` Steven Rostedt
@ 2025-07-01 15:36 ` Jens Remus
2025-07-02 23:50 ` Steven Rostedt
1 sibling, 1 reply; 53+ messages in thread
From: Jens Remus @ 2025-07-01 15:36 UTC (permalink / raw)
To: Steven Rostedt
Cc: Linus Torvalds, linux-kernel, linux-trace-kernel, bpf, x86,
Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Andrew Morton, Jens Axboe, Florian Weimer
On 01.07.2025 04:56, Steven Rostedt wrote:
> On Mon, 30 Jun 2025 19:10:09 -0700
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
>> On Mon, 30 Jun 2025 at 17:54, Steven Rostedt <rostedt@goodmis.org> wrote:
>>>
>>> + /* stack going in wrong direction? */
>>> + if (cfa <= state->sp)
>>> + goto done;
>>
>> I suspect this should do a lot more testing.
>
> Sure.
The above assumes:
curr_frame_sp = state->sp
prev_frame_sp = cfa // for arches that define CFA as SP at call site
On s390 the prev_frame_sp may be equal to curr_frame_sp for the topmost
frame, as long as the topmost function did not allocate any stack. For
instance when early in the prologue or when in a leaf function that does
not require any stack space. My s390 sframe support patches would
therefore currently change above check to:
/* stack going in wrong direction? */
if (sp <= state->sp - topmost)
goto done;
I assume that should be the case for all architectures whose function
call instruction does not modify the SP, unlike x86-64's CALL does.
>>> + if (frame->fp_off && get_user(fp, (unsigned long __user *)(cfa + frame->fp_off)))
>>> + goto done;
>>
>> .. and this should check the frame for validity too. At a minimum it
>> should be properly aligned, but things like "it had better be above
>> the current frame" to avoid having some loop would seem to be a good
>> idea.
>
> Makes sense.
On s390 the FP (register) value does not necessarily need to be above
the SP value, as the s390x ELF ABI does only designate a "preferred" FP
register, so that the FP register may be used for other purposes, when a
FP is not required in a function.
So the output FP value cannot be checked on s390. But the input FP
value could probably be checked as follows before computing the CFA:
if (frame->use_fp && state->fp < state->sp)
goto done;
/* Get the Canonical Frame Address (CFA) */
cfa = (frame->use_fp ? state->fp : state->sp) + frame->cfa_off;
>> Maybe even check that it's the same vma?
>
> Hmm, I call on to Jens Remus and ask if s390 can do anything whacky here?
> Where something that isn't allowed on other architectures may be allowed
> there? I know s390 has some strange type of stack usage.
On s390, if a FP is required for dynamic stack allocation, it is only
initialized as late as possible, that is usually after static stack
allocation. Therefore SP == FP is rather seldom the case.
Regards,
Jens
--
Jens Remus
Linux on Z Development (D3303)
+49-7031-16-1128 Office
jremus@de.ibm.com
IBM
IBM Deutschland Research & Development GmbH; Vorsitzender des Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der Gesellschaft: Böblingen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v12 00/14] unwind_user: x86: Deferred unwinding infrastructure
2025-07-01 2:45 ` Steven Rostedt
@ 2025-07-01 22:49 ` Kees Cook
2025-07-01 23:26 ` Steven Rostedt
0 siblings, 1 reply; 53+ messages in thread
From: Kees Cook @ 2025-07-01 22:49 UTC (permalink / raw)
To: Steven Rostedt
Cc: Linus Torvalds, linux-kernel, linux-trace-kernel, bpf, x86,
Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Andrew Morton, Jens Axboe,
Florian Weimer
On Mon, Jun 30, 2025 at 10:45:39PM -0400, Steven Rostedt wrote:
> On Mon, 30 Jun 2025 19:06:12 -0700
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> > On Mon, 30 Jun 2025 at 17:54, Steven Rostedt <rostedt@goodmis.org> wrote:
> > >
> > > This is the first patch series of a set that will make it possible to be able
> > > to use SFrames[1] in the Linux kernel. A quick recap of the motivation for
> > > doing this.
> >
> > You have a '[1]' to indicate there's a link to what SFrames are.
> [...]
> [1] https://sourceware.org/binutils/wiki/sframe
Okay, I've read the cover letter and this wiki page, but I am dense: why
does the _kernel_ want to do this? Shouldn't it only be userspace that
cares about userspace unwinding? I don't use perf, ftrace, and ebpf
enough to make this obvious to me, I guess. ;)
--
Kees Cook
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v12 00/14] unwind_user: x86: Deferred unwinding infrastructure
2025-07-01 22:49 ` Kees Cook
@ 2025-07-01 23:26 ` Steven Rostedt
2025-07-02 14:56 ` Kees Cook
0 siblings, 1 reply; 53+ messages in thread
From: Steven Rostedt @ 2025-07-01 23:26 UTC (permalink / raw)
To: Kees Cook
Cc: Linus Torvalds, linux-kernel, linux-trace-kernel, bpf, x86,
Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Andrew Morton, Jens Axboe,
Florian Weimer
On Tue, 1 Jul 2025 15:49:23 -0700
Kees Cook <kees@kernel.org> wrote:
> On Mon, Jun 30, 2025 at 10:45:39PM -0400, Steven Rostedt wrote:
> > On Mon, 30 Jun 2025 19:06:12 -0700
> > Linus Torvalds <torvalds@linux-foundation.org> wrote:
> >
> > > On Mon, 30 Jun 2025 at 17:54, Steven Rostedt <rostedt@goodmis.org> wrote:
> > > >
> > > > This is the first patch series of a set that will make it possible to be able
> > > > to use SFrames[1] in the Linux kernel. A quick recap of the motivation for
> > > > doing this.
> > >
> > > You have a '[1]' to indicate there's a link to what SFrames are.
> > [...]
> > [1] https://sourceware.org/binutils/wiki/sframe
>
> Okay, I've read the cover letter and this wiki page, but I am dense: why
> does the _kernel_ want to do this? Shouldn't it only be userspace that
> cares about userspace unwinding? I don't use perf, ftrace, and ebpf
> enough to make this obvious to me, I guess. ;)
>
It's how perf does profiling. It needs to walk the user space stack to see
what functions are being called. Ftrace can do the same thing, but is not
as used because it doesn't have the tooling (yet) to figure out what the
user space addresses mean (but I'm working on fixing that).
And BPF has commands that it can do, but I don't know BPF enough to comment.
The big user is perf with profiling. It currently uses frame pointers, but
because of the way frame pointers are set up, it misses a lot of the leaf
functions when the interrupt triggers (which sframes does not have that
problem). Also, if frame pointers is not configured, perf may just copy
thousands of bytes of the user space stack into the kernel ring buffer and
then parse it later (this isn't used often due to the overhead).
Then there's s390 that doesn't have frame pointers and only has the copy of
thousands of bytes to do any meaningful user space profiling.
Note, this has been a long standing issue where in 2022, we had a BOF on
this, looking for something like ORC in user space as it would solve lots
of our issues. Then December of that same year, we heard about SFrames.
At Kernel Recipes in 2023, Brendan Gregg during his talk was saying that
there needs to be a better way to do profiling of user space from the
kernel without frame pointers. I mentioned SFrames and he was quite excited
to hear about it. That's also when Josh, who was in the attendance, asked
if he could do the implementation of it in the kernel!
Anyway, yeah, it's something that has a ton of interest, as it's the way
for tools like perf to give nice graphs of where user space bottlenecks
exist.
-- Steve
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v12 00/14] unwind_user: x86: Deferred unwinding infrastructure
2025-07-01 23:26 ` Steven Rostedt
@ 2025-07-02 14:56 ` Kees Cook
2025-07-02 16:20 ` Mathieu Desnoyers
0 siblings, 1 reply; 53+ messages in thread
From: Kees Cook @ 2025-07-02 14:56 UTC (permalink / raw)
To: Steven Rostedt
Cc: Linus Torvalds, linux-kernel, linux-trace-kernel, bpf, x86,
Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Andrew Morton, Jens Axboe,
Florian Weimer
On Tue, Jul 01, 2025 at 07:26:19PM -0400, Steven Rostedt wrote:
> On Tue, 1 Jul 2025 15:49:23 -0700 Kees Cook <kees@kernel.org> wrote:
> > Okay, I've read the cover letter and this wiki page, but I am dense: why
> > does the _kernel_ want to do this? Shouldn't it only be userspace that
> > cares about userspace unwinding? I don't use perf, ftrace, and ebpf
> > enough to make this obvious to me, I guess. ;)
> [...]
> Anyway, yeah, it's something that has a ton of interest, as it's the way
> for tools like perf to give nice graphs of where user space bottlenecks
> exist.
Right! Yeah, I know it's very wanted -- I wasn't saying "don't this in
the kernel", but quite literally, "*I* am missing something about why
this is so important." :) And thank you, now I see: the sampling-based
profiling of userspace must happen via the kernel.
--
Kees Cook
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v12 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe
2025-07-01 0:53 ` [PATCH v12 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe Steven Rostedt
@ 2025-07-02 15:53 ` Jens Remus
2025-07-02 19:11 ` Steven Rostedt
0 siblings, 1 reply; 53+ messages in thread
From: Jens Remus @ 2025-07-02 15:53 UTC (permalink / raw)
To: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Linus Torvalds, Andrew Morton, Jens Axboe,
Florian Weimer, Heiko Carstens, Vasily Gorbik
Hello Steve!
On 01.07.2025 02:53, Steven Rostedt wrote:
> Make unwind_deferred_request() NMI-safe so tracers in NMI context can
> call it and safely request a user space stacktrace when the task exits.
> diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
> @@ -2,6 +2,9 @@
> #ifndef _LINUX_UNWIND_USER_DEFERRED_TYPES_H
> #define _LINUX_UNWIND_USER_DEFERRED_TYPES_H
>
> +#include <asm/local64.h>
> +#include <asm/local.h>
This creates the following circular dependency, that breaks the build on
s390 as follows, whenever local64.h is included first, so that local64_t
is not yet defined when unwind_deferred_types.h gets included down the
line:
Circular dependency:
include/linux/unwind_deferred_types.h
arch/<arch>/include/generated/asm/local64.h
include/asm-generic/local64.h
include/linux/percpu.h
include/linux/sched.h
include/linux/unwind_deferred_types.h
Compile error on s390:
CC net/ipv4/proc.o
In file included from ./include/linux/sched.h:49,
from ./include/linux/percpu.h:12,
from ./include/asm-generic/local64.h:5,
from ./arch/s390/include/generated/asm/local64.h:1,
from ./include/linux/u64_stats_sync.h:71,
from ./include/net/snmp.h:47,
from ./include/net/netns/mib.h:5,
from ./include/net/net_namespace.h:17,
from net/ipv4/proc.c:31:
./include/linux/unwind_deferred_types.h:17:9: error: unknown type name ‘local64_t’; did you mean ‘local_t’?
17 | local64_t timestamp;
| ^~~~~~~~~
| local_t
Reason it fails on s390:
net/ipv4/proc.c
+- include/net/net_namespace.h
+- include/net/netns/mib.h
+- include/net/snmp.h
+- include/linux/u64_stats_sync.h
+- arch/s390/include/generated/asm/local64.h <-- ISSUE: local64.h gets included before unwind_deferred_types.h
+- include/asm-generic/local64.h
+- include/linux/percpu.h
+- include/linux/sched.h
+- include/linux/unwind_deferred_types.h <-- ERROR: local64_t not yet defined
Reason it does not fail on x86:
net/ipv4/proc.c
+- include/net/net_namespace.h
+- include/linux/workqueue.h
| +- include/linux/timer.h
| +- include/linux/ktime.h
| +- include/linux/jiffies.h
| +- include/linux/time.h
| +- include/linux/time32.h
| +- include/linux/timex.h
| +- arch/x86/include/asm/timex.h
| +- arch/x86/include/asm/tsc.h
| +- arch/x86/include/asm/msr.h
| +- include/linux/percpu.h
| +- include/linux/sched.h
| +- include/linux/unwind_deferred_types.h <-- OK: unwind_deferred_types.h gets included before local64.h
| +- arch/x86/include/generated/asm/local64.h
| +- include/asm-generic/local64.h
| [ +- include/linux/percpu.h ]
+- include/net/netns/mib.h
+- include/net/snmp.h
+- include/linux/u64_stats_sync.h
+- arch/x86/include/generated/asm/local64.h <-- OK: local64.h comes after unwind_deferred_types.h
[ +- include/asm-generic/local64.h ]
[ +- include/linux/percpu.h ]
[ +- include/linux/sched.h ]
[ +- include/linux/unwind_deferred_types.h ]
My colleague Heiko Carstens (on Cc) suggested the following potential fix
that breaks the circular dependency, provided local[64].h does not ever
need percpu.h. At least I verified that defconfig x86 and s390 builds
still work.
diff --git a/include/asm-generic/local.h b/include/asm-generic/local.h
index 7f97018df66f..3e7ce6a9e18e 100644
--- a/include/asm-generic/local.h
+++ b/include/asm-generic/local.h
@@ -2,7 +2,6 @@
#ifndef _ASM_GENERIC_LOCAL_H
#define _ASM_GENERIC_LOCAL_H
-#include <linux/percpu.h>
#include <linux/atomic.h>
#include <asm/types.h>
diff --git a/include/asm-generic/local64.h b/include/asm-generic/local64.h
index 14963a7a6253..1f9af89916cb 100644
--- a/include/asm-generic/local64.h
+++ b/include/asm-generic/local64.h
@@ -2,7 +2,6 @@
#ifndef _ASM_GENERIC_LOCAL64_H
#define _ASM_GENERIC_LOCAL64_H
-#include <linux/percpu.h>
#include <asm/types.h>
> +
> struct unwind_cache {
> unsigned int nr_entries;
> unsigned long entries[];
> @@ -10,8 +13,8 @@ struct unwind_cache {
> struct unwind_task_info {
> struct unwind_cache *cache;
> struct callback_head work;
> - u64 timestamp;
> - int pending;
> + local64_t timestamp;
> + local_t pending;
> };
>
> #endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
Regards,
Jens
--
Jens Remus
Linux on Z Development (D3303)
+49-7031-16-1128 Office
jremus@de.ibm.com
IBM
IBM Deutschland Research & Development GmbH; Vorsitzender des Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der Gesellschaft: Böblingen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/
^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: [PATCH v12 00/14] unwind_user: x86: Deferred unwinding infrastructure
2025-07-02 14:56 ` Kees Cook
@ 2025-07-02 16:20 ` Mathieu Desnoyers
0 siblings, 0 replies; 53+ messages in thread
From: Mathieu Desnoyers @ 2025-07-02 16:20 UTC (permalink / raw)
To: Kees Cook, Steven Rostedt
Cc: Linus Torvalds, linux-kernel, linux-trace-kernel, bpf, x86,
Masami Hiramatsu, Josh Poimboeuf, Peter Zijlstra, Ingo Molnar,
Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
Andrew Morton, Jens Axboe, Florian Weimer
On 2025-07-02 10:56, Kees Cook wrote:
> On Tue, Jul 01, 2025 at 07:26:19PM -0400, Steven Rostedt wrote:
>> On Tue, 1 Jul 2025 15:49:23 -0700 Kees Cook <kees@kernel.org> wrote:
>>> Okay, I've read the cover letter and this wiki page, but I am dense: why
>>> does the _kernel_ want to do this? Shouldn't it only be userspace that
>>> cares about userspace unwinding? I don't use perf, ftrace, and ebpf
>>> enough to make this obvious to me, I guess. ;)
>> [...]
>> Anyway, yeah, it's something that has a ton of interest, as it's the way
>> for tools like perf to give nice graphs of where user space bottlenecks
>> exist.
>
> Right! Yeah, I know it's very wanted -- I wasn't saying "don't this in
> the kernel", but quite literally, "*I* am missing something about why
> this is so important." :) And thank you, now I see: the sampling-based
> profiling of userspace must happen via the kernel.
>
I should add that once we have this in place for perf sampling,
it enables the following additional use-cases:
- Sample stack traces from specific tracepoints, e.g. system call
entry. This allows associating the kernel tracepoints with their
userspace call chain (causality) without requiring tracing of
userspace.
- Implement a system call that allows a userspace thread to use the
kernel stack sampling facilities on itself rather than reimplement
the stack walk and sframe registry handling + decoding on the
userspace side.
For this last point, it's only relevant because the infrastructure
will already be in place for stack sampling from the kernel. So we'd
eliminate duplication by allowing its use from userspace as well.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v12 06/14] unwind_user/deferred: Add deferred unwinding interface
2025-07-01 0:53 ` [PATCH v12 06/14] unwind_user/deferred: Add deferred unwinding interface Steven Rostedt
@ 2025-07-02 16:36 ` Peter Zijlstra
2025-07-02 16:42 ` Steven Rostedt
0 siblings, 1 reply; 53+ messages in thread
From: Peter Zijlstra @ 2025-07-02 16:36 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton, Jens Axboe, Florian Weimer
On Mon, Jun 30, 2025 at 08:53:27PM -0400, Steven Rostedt wrote:
> +/*
> + * Read the task context timestamp, if this is the first caller then
> + * it will set the timestamp.
> + *
> + * For this to work properly, the timestamp (local_clock()) must
> + * have a resolution that will guarantee a different timestamp
> + * everytime a task makes a system call. That is, two short
> + * system calls back to back must have a different timestamp.
> + */
> +static u64 get_timestamp(struct unwind_task_info *info)
> +{
> + lockdep_assert_irqs_disabled();
> +
> + if (!info->timestamp)
> + info->timestamp = local_clock();
> +
> + return info->timestamp;
> +}
I'm very hesitant about this. Modern hardware can do this, but older
hardware (think Intel Core and AMD Bulldozer etc hardware) might
struggle with this. They don't have stable TSC and as such will use the
magic in kernel/sched/clock.c; which can get stuck on a window edge for
a little bit and re-use timestamps.
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v12 06/14] unwind_user/deferred: Add deferred unwinding interface
2025-07-02 16:36 ` Peter Zijlstra
@ 2025-07-02 16:42 ` Steven Rostedt
2025-07-02 16:56 ` Linus Torvalds
0 siblings, 1 reply; 53+ messages in thread
From: Steven Rostedt @ 2025-07-02 16:42 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton, Jens Axboe, Florian Weimer
On Wed, 2 Jul 2025 18:36:09 +0200
Peter Zijlstra <peterz@infradead.org> wrote:
> > +static u64 get_timestamp(struct unwind_task_info *info)
> > +{
> > + lockdep_assert_irqs_disabled();
> > +
> > + if (!info->timestamp)
> > + info->timestamp = local_clock();
> > +
> > + return info->timestamp;
> > +}
>
> I'm very hesitant about this. Modern hardware can do this, but older
> hardware (think Intel Core and AMD Bulldozer etc hardware) might
> struggle with this. They don't have stable TSC and as such will use
> the magic in kernel/sched/clock.c; which can get stuck on a window
> edge for a little bit and re-use timestamps.
Well, the idea of using timestamps came from Microsoft as that's what
they do for a similar feature. It just seemed like an easier approach.
But I could definitely go back to the "cookie" idea that is just a per
cpu counter (with the CPU number as part of the cookie).
As the timestamp is likely not going to be as useful as it is with
Microsoft as there's no guarantee that the timestamp counter used is
the same as the timestamp used by the tracer asking for this, the cookie
approach may indeed be better.
-- Steve
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v12 06/14] unwind_user/deferred: Add deferred unwinding interface
2025-07-02 16:42 ` Steven Rostedt
@ 2025-07-02 16:56 ` Linus Torvalds
2025-07-02 17:26 ` Steven Rostedt
0 siblings, 1 reply; 53+ messages in thread
From: Linus Torvalds @ 2025-07-02 16:56 UTC (permalink / raw)
To: Steven Rostedt
Cc: Peter Zijlstra, linux-kernel, linux-trace-kernel, bpf, x86,
Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar,
Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
Andrew Morton, Jens Axboe, Florian Weimer
On Wed, 2 Jul 2025 at 09:42, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> As the timestamp is likely not going to be as useful as it is with
> Microsoft as there's no guarantee that the timestamp counter used is
> the same as the timestamp used by the tracer asking for this, the cookie
> approach may indeed be better.
I think having just a percpu counter is probably the safest thing to
do, since the main reason just seems to be "correlate with the user
event". Using some kind of "real time" for correlation purposes seems
like a bad idea from any portability standpoint, considering just how
many broken timers we've seen across pretty much every architecture
out there.
Also, does it actually have to be entirely unique? IOW, a 32-bit
counter (or even less) might be sufficient if there's some guarantee
that processing happens before the counter wraps around? Again - for
correlation purposes, just *how* many outstanding events can you have
that aren't ordered by other things too?
I'm sure people want to also get some kind of rough time idea, but
don't most perf events have them simply because people want time
information for _informatioal_ reasons, rather than to correlate two
events?
Linus
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v12 06/14] unwind_user/deferred: Add deferred unwinding interface
2025-07-02 16:56 ` Linus Torvalds
@ 2025-07-02 17:26 ` Steven Rostedt
2025-07-02 17:48 ` Steven Rostedt
0 siblings, 1 reply; 53+ messages in thread
From: Steven Rostedt @ 2025-07-02 17:26 UTC (permalink / raw)
To: Linus Torvalds
Cc: Peter Zijlstra, linux-kernel, linux-trace-kernel, bpf, x86,
Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar,
Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
Andrew Morton, Jens Axboe, Florian Weimer
On Wed, 2 Jul 2025 09:56:39 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> Also, does it actually have to be entirely unique? IOW, a 32-bit
> counter (or even less) might be sufficient if there's some guarantee
> that processing happens before the counter wraps around? Again - for
> correlation purposes, just *how* many outstanding events can you have
> that aren't ordered by other things too?
>
> I'm sure people want to also get some kind of rough time idea, but
> don't most perf events have them simply because people want time
> information for _informatioal_ reasons, rather than to correlate two
> events?
And it only needs to be unique per thread per system call. The real
reason for this identifier is for lost events. As I explained in the
perf patchset, the issues is this:
In case of dropped events, we could have the case of:
system_call() {
<nmi> {
take kernel stack trace
ask for deferred trace.
[EVENTS START DROPPING HERE]
}
Call deferred callback to record trace [ BUT IS DROPPED ]
}
system_call() {
<nmi> {
take kernel stack trace
ask for deferred trace [ STILL DROPPING ]
}
[ READER CATCHES UP AND STARTS READING EVENTS AGAIN]
Call deferred callback to record trace
}
The user space tool will see that kernel stack traces of the first
system call, then it will see events dropped, and then it will see the
deferred user space stack trace of the second call.
The identifier is only there for uniqueness for that one thread to let
the tracer know if the deferred trace can be tied to events before it
lost them.
We figured a single 32 bit counter would be good enough when we first
discussed this idea, but we wanted per cpu counters to not have cache
contention every time a CPU wanted to increment the counter. But each
CPU would need an identifier so that a task migrating will not get the
same identifier for a different system call just because it migrated.
We used 16 bits for the CPU counter thinking that 32K of CPUs would
last some time in the future. We then chose to use a 64 bit number to
allow us to have 48 bits left for uniqueness which is plenty.
If we use 32 bits, that would give us 32K of unique systemcalls, and it
does seem possible that on a busy system, a tracer could lose 32K of
system calls before it gets going again. But we could still use it
anyway as the likelihood of losing exactly 32K of system calls and
starting tracing back up again will probably never happen. And if it
does, the worse thing that it will do is have the tracer mistake which
user space stack trace goes to which event. If your are tracing that
many events, this will likely be in the noise.
So I'm fine with making this a 32 bit counter using 16 bits for the CPU
and 16 bits for per thread uniqueness.
-- Steve
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v12 06/14] unwind_user/deferred: Add deferred unwinding interface
2025-07-02 17:26 ` Steven Rostedt
@ 2025-07-02 17:48 ` Steven Rostedt
2025-07-02 18:21 ` Linus Torvalds
0 siblings, 1 reply; 53+ messages in thread
From: Steven Rostedt @ 2025-07-02 17:48 UTC (permalink / raw)
To: Linus Torvalds
Cc: Peter Zijlstra, linux-kernel, linux-trace-kernel, bpf, x86,
Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar,
Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
Andrew Morton, Jens Axboe, Florian Weimer
On Wed, 2 Jul 2025 13:26:05 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:
> So I'm fine with making this a 32 bit counter using 16 bits for the CPU
> and 16 bits for per thread uniqueness.
To still be able to use a 32 bit cmpxchg (for racing with an NMI), we
could break this number up into two 32 bit words. One with the CPU that
it was created on, and one with the per_cpu counter:
union unwind_task_id {
struct {
u32 cpu;
u32 cnt;
}
u64 id;
};
static DEFINE_PER_CPU(u32, unwind_ctx_ctr);
static u64 get_cookie(struct unwind_task_info *info)
{
u32 cpu_cnt;
u32 cnt;
u32 old = 0;
if (info->id.cpu)
return info->id.id;
cpu_cnt = __this_cpu_read(unwind_ctx_ctr);
cpu_cnt += 2;
cnt = cpu_cnt | 1; /* Always make non zero */
if (try_cmpxchg(&info->id.cnt, &old, cnt)) {
/* Update the per cpu counter */
__this_cpu_write(unwind_ctx_ctr, cpu_cnt);
}
/* Interrupts are disabled, the CPU will always be same */
info->id.cpu = smp_processor_id() + 1; /* Must be non zero */
return info->id.id;
}
When leaving the kernel it does:
info->id.id = 0;
-- Steve
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v12 06/14] unwind_user/deferred: Add deferred unwinding interface
2025-07-02 17:48 ` Steven Rostedt
@ 2025-07-02 18:21 ` Linus Torvalds
2025-07-02 18:47 ` Mathieu Desnoyers
2025-07-02 18:57 ` Steven Rostedt
0 siblings, 2 replies; 53+ messages in thread
From: Linus Torvalds @ 2025-07-02 18:21 UTC (permalink / raw)
To: Steven Rostedt
Cc: Peter Zijlstra, linux-kernel, linux-trace-kernel, bpf, x86,
Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar,
Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
Andrew Morton, Jens Axboe, Florian Weimer
On Wed, 2 Jul 2025 at 10:49, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> To still be able to use a 32 bit cmpxchg (for racing with an NMI), we
> could break this number up into two 32 bit words. One with the CPU that
> it was created on, and one with the per_cpu counter:
Do you actually even need a cpu number at all?
If this is per-thread, maybe just a per-thread counter would be good?
And you already *have* that per-thread thing, in
'current->unwind_info'.
And the work is using task_work, so the worker callback is also per-thread.
Also, is racing with NMI even a thing for the sequence number? I would
expect that the only thing that NMI would race with is the 'pending'
field, not the sequence number.
IOW, I think the logic could be
- check 'pending' non-atomically, just because it's cheap
- do a try_cmpxchg() on pending to actually deal with nmi races
Actually, there are no SMP issues, just instruction atomicity - so a
'local_try_cmpxchg() would be sufficient, but that's a 'long' not a
'u32' ;^(
- now you are exclusive for that thread, you're done, no more need
for any atomic counter or percpu things
And then the next step is to just say "pending is the low bit of the
id word" and having a separate 31-bit counter that gets incremented by
"get_cookie()".
So then you end up with something like
// New name for 'get_timestamp()'
get_current_cookie() { return current->unwind_info.cookie; }
// New name for 'get_cookie()':
// 31-bit counter by just leaving bit #0 alone
get_new_cookie() { current->unwind_info.cookie += 2; }
and then unwind_deferred_request() would do something like
unwind_deferred_request()
{
int old, new;
if (current->unwind_info.id)
return 1;
guard(irqsave)();
// For NMI, if we race with 'get_new_cookie()'
// we don't care if we get the old or the new one
old = 0; new = get_current_cookie() | 1;
if (!try_cmpxchg(¤t->unwind_info.id, &old, new))
return 1;
.. now schedule the thing with that cookie set.
Hmm?
But I didn't actually look at the users, so maybe I'm missing some
reason why you want to have a separate per-cpu value.
Or maybe I missed something else entirely, and the above is complete
garbage and the ramblings of a insane mind.
It happens.
Off to get more coffee.
Linus
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v12 06/14] unwind_user/deferred: Add deferred unwinding interface
2025-07-02 18:21 ` Linus Torvalds
@ 2025-07-02 18:47 ` Mathieu Desnoyers
2025-07-02 19:05 ` Steven Rostedt
2025-07-02 18:57 ` Steven Rostedt
1 sibling, 1 reply; 53+ messages in thread
From: Mathieu Desnoyers @ 2025-07-02 18:47 UTC (permalink / raw)
To: Linus Torvalds, Steven Rostedt
Cc: Peter Zijlstra, linux-kernel, linux-trace-kernel, bpf, x86,
Masami Hiramatsu, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Andrew Morton,
Jens Axboe, Florian Weimer
On 2025-07-02 14:21, Linus Torvalds wrote:
> On Wed, 2 Jul 2025 at 10:49, Steven Rostedt <rostedt@goodmis.org> wrote:
>>
>> To still be able to use a 32 bit cmpxchg (for racing with an NMI), we
>> could break this number up into two 32 bit words. One with the CPU that
>> it was created on, and one with the per_cpu counter:
>
> Do you actually even need a cpu number at all?
>
> If this is per-thread, maybe just a per-thread counter would be good?
> And you already *have* that per-thread thing, in
> 'current->unwind_info'.
AFAIR, one of the goals here is to save the cookie into the trace
to allow trace post-processing to link the event triggering the
unwinding with the deferred unwinding data.
In order to make the trace analysis results reliable, we'd like
to avoid the following causes of uncertainty, which would
mistakenly cause the post-processing analysis to associate
a stack trace with the wrong event:
- Thread ID re-use (exit + clone/fork),
- Thread migration,
- Events discarded (e.g. buffer full) causing missing
thread lifetime events or missing unwind-related events.
Unless I'm missing something, the per-thread counter would have
issues with thread ID re-use during the trace lifetime.
One possibility to solve this would be to introduce a thread
identifier (e.g. 64-bit thread ID value) which is unique
across the entire kernel lifetime. This approach would actually
be useful for other use-cases as well, but a 64-bit ID is not
as compact as the CPU number, so it is somewhat wasteful in
terms of trace bandwidth.
Hence the alternative we came up with, which is to combine the
CPU number and a per-CPU counter to have a cheap way to keep
track of a globally unique counter using per-CPU partitioning.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v12 06/14] unwind_user/deferred: Add deferred unwinding interface
2025-07-02 18:21 ` Linus Torvalds
2025-07-02 18:47 ` Mathieu Desnoyers
@ 2025-07-02 18:57 ` Steven Rostedt
1 sibling, 0 replies; 53+ messages in thread
From: Steven Rostedt @ 2025-07-02 18:57 UTC (permalink / raw)
To: Linus Torvalds
Cc: Peter Zijlstra, linux-kernel, linux-trace-kernel, bpf, x86,
Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar,
Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
Andrew Morton, Jens Axboe, Florian Weimer
On Wed, 2 Jul 2025 11:21:34 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Wed, 2 Jul 2025 at 10:49, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > To still be able to use a 32 bit cmpxchg (for racing with an NMI), we
> > could break this number up into two 32 bit words. One with the CPU that
> > it was created on, and one with the per_cpu counter:
>
> Do you actually even need a cpu number at all?
>
> If this is per-thread, maybe just a per-thread counter would be good?
> And you already *have* that per-thread thing, in
> 'current->unwind_info'.
I just hate to increase the task struct even more. I'm trying to keep
only the data I can't put elsewhere in the task struct.
>
> And the work is using task_work, so the worker callback is also per-thread.
>
> Also, is racing with NMI even a thing for the sequence number? I would
> expect that the only thing that NMI would race with is the 'pending'
> field, not the sequence number.
The sequence number is updated on the first time it's called after a
task enters the kernel. Then it should return the same number every
time after that. That's because this unique number is the identifier
for the user space stack trace which doesn't change while the task is
in the kernel, hence the id getting updated every time the task enters
the kernel and not after that.
>
> IOW, I think the logic could be
>
> - check 'pending' non-atomically, just because it's cheap
Note, later patches move the "pending" bit into a mask that is used to
figure out what callbacks to call.
>
> - do a try_cmpxchg() on pending to actually deal with nmi races
>
> Actually, there are no SMP issues, just instruction atomicity - so a
> 'local_try_cmpxchg() would be sufficient, but that's a 'long' not a
> 'u32' ;^(
Yeah, later patches do try to use more local_try_cmpxchg() at different
parts. Even for the timestamp.
>
> - now you are exclusive for that thread, you're done, no more need
> for any atomic counter or percpu things
The trick and race with NMIs is, this needs to return that cookie for
both callers, where the cookie is the same number.
>
> And then the next step is to just say "pending is the low bit of the
> id word" and having a separate 31-bit counter that gets incremented by
> "get_cookie()".
>
> So then you end up with something like
>
> // New name for 'get_timestamp()'
> get_current_cookie() { return current->unwind_info.cookie; }
> // New name for 'get_cookie()':
> // 31-bit counter by just leaving bit #0 alone
> get_new_cookie() { current->unwind_info.cookie += 2; }
>
> and then unwind_deferred_request() would do something like
>
> unwind_deferred_request()
> {
> int old, new;
>
> if (current->unwind_info.id)
> return 1;
>
> guard(irqsave)();
> // For NMI, if we race with 'get_new_cookie()'
> // we don't care if we get the old or the new one
> old = 0; new = get_current_cookie() | 1;
> if (!try_cmpxchg(¤t->unwind_info.id, &old, new))
> return 1;
> .. now schedule the thing with that cookie set.
>
> Hmm?
>
> But I didn't actually look at the users, so maybe I'm missing some
> reason why you want to have a separate per-cpu value.
Now we could just have the counter be the 32 bit cookie (old timestamp)
in the task struct, and have bit 0 be if it is valued or not.
static u32 get_cookie(struct unwind_task_info *info)
{
u32 cnt = READ_ONCE(info->id);
u32 new_cnt;
if (cnt & 1)
return cnt;
new_cnt = cnt + 3;
if (try_cmpxchg(&info->id, &cnt, new_cnt))
return new_cnt;
return cnt; // return info->id; would work too
}
Then when going back to user space:
info->id &= ~1;
Now the counter is stored in the task struct and no other info is needed.
>
> Or maybe I missed something else entirely, and the above is complete
> garbage and the ramblings of a insane mind.
>
> It happens.
>
> Off to get more coffee.
Enjoy, I just grabbed my last cup of the day.
-- Steve
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v12 06/14] unwind_user/deferred: Add deferred unwinding interface
2025-07-02 18:47 ` Mathieu Desnoyers
@ 2025-07-02 19:05 ` Steven Rostedt
2025-07-02 19:12 ` Mathieu Desnoyers
0 siblings, 1 reply; 53+ messages in thread
From: Steven Rostedt @ 2025-07-02 19:05 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Linus Torvalds, Peter Zijlstra, linux-kernel, linux-trace-kernel,
bpf, x86, Masami Hiramatsu, Josh Poimboeuf, Ingo Molnar,
Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
Andrew Morton, Jens Axboe, Florian Weimer
On Wed, 2 Jul 2025 14:47:10 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>
> AFAIR, one of the goals here is to save the cookie into the trace
> to allow trace post-processing to link the event triggering the
> unwinding with the deferred unwinding data.
>
> In order to make the trace analysis results reliable, we'd like
> to avoid the following causes of uncertainty, which would
> mistakenly cause the post-processing analysis to associate
> a stack trace with the wrong event:
>
> - Thread ID re-use (exit + clone/fork),
> - Thread migration,
> - Events discarded (e.g. buffer full) causing missing
> thread lifetime events or missing unwind-related events.
>
> Unless I'm missing something, the per-thread counter would have
> issues with thread ID re-use during the trace lifetime.
But you are missing one more thing that the trace can use, and that's
the time sequence. As soon as the same thread has a new id you can
assume all the older user space traces are not applicable for any new
events for that thread, or any other thread with the same thread ID.
Thus the only issue that can truly be a problem is if you have missed
events where thread id wraps around. I guess that could be possible if
a long running task finally exits and it's thread id is reused
immediately. Is that a common occurrence?
-- Steve.
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v12 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe
2025-07-02 15:53 ` Jens Remus
@ 2025-07-02 19:11 ` Steven Rostedt
0 siblings, 0 replies; 53+ messages in thread
From: Steven Rostedt @ 2025-07-02 19:11 UTC (permalink / raw)
To: Jens Remus
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Peter Zijlstra, Ingo Molnar,
Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Linus Torvalds,
Andrew Morton, Jens Axboe, Florian Weimer, Heiko Carstens,
Vasily Gorbik
On Wed, 2 Jul 2025 17:53:05 +0200
Jens Remus <jremus@linux.ibm.com> wrote:
> > @@ -2,6 +2,9 @@
> > #ifndef _LINUX_UNWIND_USER_DEFERRED_TYPES_H
> > #define _LINUX_UNWIND_USER_DEFERRED_TYPES_H
> >
> > +#include <asm/local64.h>
> > +#include <asm/local.h>
>
> This creates the following circular dependency, that breaks the build on
> s390 as follows, whenever local64.h is included first, so that local64_t
> is not yet defined when unwind_deferred_types.h gets included down the
> line:
As per the discussion on patch 6, this may not be an issue in the next
version. I'm looking to get rid of 64 bit cmpxchg.
-- Steve
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v12 06/14] unwind_user/deferred: Add deferred unwinding interface
2025-07-02 19:05 ` Steven Rostedt
@ 2025-07-02 19:12 ` Mathieu Desnoyers
2025-07-02 19:21 ` Steven Rostedt
0 siblings, 1 reply; 53+ messages in thread
From: Mathieu Desnoyers @ 2025-07-02 19:12 UTC (permalink / raw)
To: Steven Rostedt
Cc: Linus Torvalds, Peter Zijlstra, linux-kernel, linux-trace-kernel,
bpf, x86, Masami Hiramatsu, Josh Poimboeuf, Ingo Molnar,
Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
Andrew Morton, Jens Axboe, Florian Weimer
On 2025-07-02 15:05, Steven Rostedt wrote:
> On Wed, 2 Jul 2025 14:47:10 -0400
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>>
>> AFAIR, one of the goals here is to save the cookie into the trace
>> to allow trace post-processing to link the event triggering the
>> unwinding with the deferred unwinding data.
>>
>> In order to make the trace analysis results reliable, we'd like
>> to avoid the following causes of uncertainty, which would
>> mistakenly cause the post-processing analysis to associate
>> a stack trace with the wrong event:
>>
>> - Thread ID re-use (exit + clone/fork),
>> - Thread migration,
>> - Events discarded (e.g. buffer full) causing missing
>> thread lifetime events or missing unwind-related events.
>>
>> Unless I'm missing something, the per-thread counter would have
>> issues with thread ID re-use during the trace lifetime.
>
> But you are missing one more thing that the trace can use, and that's
> the time sequence. As soon as the same thread has a new id you can
> assume all the older user space traces are not applicable for any new
> events for that thread, or any other thread with the same thread ID.
In order for the scheme you describe to work, you need:
- instrumentation of task lifetime (exit/fork+clone),
- be sure that the events related to that instrumentation were not
dropped.
I'm not sure about ftrace, but in LTTng enabling instrumentation of
task lifetime is entirely up to the user.
And even if it's enabled, events can be discarded (e.g. buffer full).
>
> Thus the only issue that can truly be a problem is if you have missed
> events where thread id wraps around. I guess that could be possible if
> a long running task finally exits and it's thread id is reused
> immediately. Is that a common occurrence?
You just need a combination of thread ID re-use and either no
instrumentation of task lifetime or events discarded to trigger this.
Even if it's not so frequent, at large scale and in production, I
suspect that this will happen quite often.
You don't even need the task IDs to be re-used very quickly for this to
be an issue.
Thanks,
Mathieu
>
> -- Steve.
>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v12 06/14] unwind_user/deferred: Add deferred unwinding interface
2025-07-02 19:12 ` Mathieu Desnoyers
@ 2025-07-02 19:21 ` Steven Rostedt
2025-07-02 19:36 ` Steven Rostedt
2025-07-02 19:43 ` Mathieu Desnoyers
0 siblings, 2 replies; 53+ messages in thread
From: Steven Rostedt @ 2025-07-02 19:21 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Linus Torvalds, Peter Zijlstra, linux-kernel, linux-trace-kernel,
bpf, x86, Masami Hiramatsu, Josh Poimboeuf, Ingo Molnar,
Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
Andrew Morton, Jens Axboe, Florian Weimer
On Wed, 2 Jul 2025 15:12:45 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> > But you are missing one more thing that the trace can use, and that's
> > the time sequence. As soon as the same thread has a new id you can
> > assume all the older user space traces are not applicable for any new
> > events for that thread, or any other thread with the same thread ID.
>
> In order for the scheme you describe to work, you need:
>
> - instrumentation of task lifetime (exit/fork+clone),
> - be sure that the events related to that instrumentation were not
> dropped.
>
> I'm not sure about ftrace, but in LTTng enabling instrumentation of
> task lifetime is entirely up to the user.
Has nothing to do with task lifetime. If you see a deferred request
with id of 1 from task 8888, and then later you see either a deferred
request or a stack trace with an id other than 1 for task 8888, you can
then say all events before now are no longer eligible for new deferred
stack traces.
>
> And even if it's enabled, events can be discarded (e.g. buffer full).
The only case is if you see a deferred request with id 1 for task 8888,
then you start dropping all events and that task 8888 exits and a new
one appears with task id 8888 where it too has a deferred request with
id 1 then you start picking up events again and see a deferred stack
trace for the new task 8888 where it's id is 1, you lose.
But other than that exact scenario, it should not get confused.
>
> >
> > Thus the only issue that can truly be a problem is if you have missed
> > events where thread id wraps around. I guess that could be possible if
> > a long running task finally exits and it's thread id is reused
> > immediately. Is that a common occurrence?
>
> You just need a combination of thread ID re-use and either no
> instrumentation of task lifetime or events discarded to trigger this.
Again, it's seeing a new request with another id for the same task, you
don't need to worry about it. You don't even need to look at fork and
exit events.
> Even if it's not so frequent, at large scale and in production, I
> suspect that this will happen quite often.
Really? As I explained above?
-- Steve
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v12 06/14] unwind_user/deferred: Add deferred unwinding interface
2025-07-02 19:21 ` Steven Rostedt
@ 2025-07-02 19:36 ` Steven Rostedt
2025-07-02 19:40 ` Steven Rostedt
2025-07-02 19:43 ` Mathieu Desnoyers
1 sibling, 1 reply; 53+ messages in thread
From: Steven Rostedt @ 2025-07-02 19:36 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Linus Torvalds, Peter Zijlstra, linux-kernel, linux-trace-kernel,
bpf, x86, Masami Hiramatsu, Josh Poimboeuf, Ingo Molnar,
Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
Andrew Morton, Jens Axboe, Florian Weimer
On Wed, 2 Jul 2025 15:21:11 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:
> The only case is if you see a deferred request with id 1 for task 8888,
> then you start dropping all events and that task 8888 exits and a new
> one appears with task id 8888 where it too has a deferred request with
> id 1 then you start picking up events again and see a deferred stack
> trace for the new task 8888 where it's id is 1, you lose.
And if we want to fix that, we could make the cookie 64 bit again, and
set the timestamp on the first time it is used for the trace.
union unwind_task_id {
struct {
u32 task_id;
u32 cnt;
}
u64 id;
};
static u64 get_cookie(struct unwind_task_info *info)
{
u32 cnt = READ_ONCE(info->id.cnt);
u32 new_cnt;
if (cnt & 1)
return info->id;
if (unlikely(!info->id.task_id)) {
u32 task_id = local_clock();
cnt = 0;
if (try_cmpxchg(&info->id.task_id, &cnt, task_id))
task_id = cnt;
}
new_cnt = cnt + 3;
if (try_cmpxchg(&info->id, &cnt, new_cnt))
new_cnt = cnt; // try_cmpxchg() expects something
return info->id;
}
So now each task will have its own id and even if we have a task wrap
around, the cookie will never be the same, as fork sets the info->id to
zero.
Yes, the local_clock() can wrap around, but now making all those the
same to cause an issue is extremely unlikely, and still, if it happens,
the worse thing that it causes is that the user space stack trace will
be associated to the wrong events.
-- Steve
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v12 06/14] unwind_user/deferred: Add deferred unwinding interface
2025-07-02 19:36 ` Steven Rostedt
@ 2025-07-02 19:40 ` Steven Rostedt
2025-07-02 19:48 ` Mathieu Desnoyers
0 siblings, 1 reply; 53+ messages in thread
From: Steven Rostedt @ 2025-07-02 19:40 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Linus Torvalds, Peter Zijlstra, linux-kernel, linux-trace-kernel,
bpf, x86, Masami Hiramatsu, Josh Poimboeuf, Ingo Molnar,
Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
Andrew Morton, Jens Axboe, Florian Weimer
On Wed, 2 Jul 2025 15:36:00 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:
> union unwind_task_id {
> struct {
> u32 task_id;
> u32 cnt;
> }
> u64 id;
> };
>
> static u64 get_cookie(struct unwind_task_info *info)
> {
> u32 cnt = READ_ONCE(info->id.cnt);
> u32 new_cnt;
>
> if (cnt & 1)
> return info->id;
>
> if (unlikely(!info->id.task_id)) {
> u32 task_id = local_clock();
>
> cnt = 0;
> if (try_cmpxchg(&info->id.task_id, &cnt, task_id))
> task_id = cnt;
> }
>
> new_cnt = cnt + 3;
> if (try_cmpxchg(&info->id, &cnt, new_cnt))
> new_cnt = cnt; // try_cmpxchg() expects something
>
> return info->id;
> }
Honestly I think this is way overkill. What I would do, is to have the
cookie saved in the event be 64 bit, but we can start with the
simple 32 bit solution keeping the top 32 bits zeros. If this does
indeed become an issue in the future, we could fix it with a 64 bit
number. By making sure all the exposed "cookies" are 64 bit, it should
not break anything. The cookie is just supposed to be a random unique
number that associates a request with its deferred user space stack
trace.
With any exposed cookies to user space being 64 bits, this should not
be an issue to address in the future.
-- Steve
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v12 06/14] unwind_user/deferred: Add deferred unwinding interface
2025-07-02 19:21 ` Steven Rostedt
2025-07-02 19:36 ` Steven Rostedt
@ 2025-07-02 19:43 ` Mathieu Desnoyers
2025-07-02 19:51 ` Steven Rostedt
1 sibling, 1 reply; 53+ messages in thread
From: Mathieu Desnoyers @ 2025-07-02 19:43 UTC (permalink / raw)
To: Steven Rostedt
Cc: Linus Torvalds, Peter Zijlstra, linux-kernel, linux-trace-kernel,
bpf, x86, Masami Hiramatsu, Josh Poimboeuf, Ingo Molnar,
Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
Andrew Morton, Jens Axboe, Florian Weimer
On 2025-07-02 15:21, Steven Rostedt wrote:
> On Wed, 2 Jul 2025 15:12:45 -0400
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>
>>> But you are missing one more thing that the trace can use, and that's
>>> the time sequence. As soon as the same thread has a new id you can
>>> assume all the older user space traces are not applicable for any new
>>> events for that thread, or any other thread with the same thread ID.
>>
>> In order for the scheme you describe to work, you need:
>>
>> - instrumentation of task lifetime (exit/fork+clone),
>> - be sure that the events related to that instrumentation were not
>> dropped.
>>
>> I'm not sure about ftrace, but in LTTng enabling instrumentation of
>> task lifetime is entirely up to the user.
>
> Has nothing to do with task lifetime. If you see a deferred request
> with id of 1 from task 8888, and then later you see either a deferred
> request or a stack trace with an id other than 1 for task 8888, you can
> then say all events before now are no longer eligible for new deferred
> stack traces.
>
>>
>> And even if it's enabled, events can be discarded (e.g. buffer full).
>
> The only case is if you see a deferred request with id 1 for task 8888,
> then you start dropping all events and that task 8888 exits and a new
> one appears with task id 8888 where it too has a deferred request with
> id 1 then you start picking up events again and see a deferred stack
> trace for the new task 8888 where it's id is 1, you lose.
>
> But other than that exact scenario, it should not get confused.
Correct.
>
>>
>>>
>>> Thus the only issue that can truly be a problem is if you have missed
>>> events where thread id wraps around. I guess that could be possible if
>>> a long running task finally exits and it's thread id is reused
>>> immediately. Is that a common occurrence?
>>
>> You just need a combination of thread ID re-use and either no
>> instrumentation of task lifetime or events discarded to trigger this.
>
> Again, it's seeing a new request with another id for the same task, you
> don't need to worry about it. You don't even need to look at fork and
> exit events.
The reason why instrumentation of exit/{fork,clone} is useful is to
know when a thread ID is re-used.
>
>> Even if it's not so frequent, at large scale and in production, I
>> suspect that this will happen quite often.
>
> Really? As I explained above?
Note that all newly forked threads will likely start counting near 0.
So chances are that for short-lived threads most of the counter values
will be in a low range.
So all you need is thread ID re-use for two threads which happen to use
the deferred cookies within low-value ranges to hit this.
From my perspective, making trace analysis results reliable is the most
basic guarantee tooling should provide in order to make it trusted by
users. So I am tempted to err towards robustness rather than take
shortcuts because "it does not happen often".
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v12 06/14] unwind_user/deferred: Add deferred unwinding interface
2025-07-02 19:40 ` Steven Rostedt
@ 2025-07-02 19:48 ` Mathieu Desnoyers
2025-07-02 20:10 ` Steven Rostedt
0 siblings, 1 reply; 53+ messages in thread
From: Mathieu Desnoyers @ 2025-07-02 19:48 UTC (permalink / raw)
To: Steven Rostedt
Cc: Linus Torvalds, Peter Zijlstra, linux-kernel, linux-trace-kernel,
bpf, x86, Masami Hiramatsu, Josh Poimboeuf, Ingo Molnar,
Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
Andrew Morton, Jens Axboe, Florian Weimer
On 2025-07-02 15:40, Steven Rostedt wrote:
> On Wed, 2 Jul 2025 15:36:00 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
>
>> union unwind_task_id {
>> struct {
>> u32 task_id;
>> u32 cnt;
>> }
>> u64 id;
>> };
>>
>> static u64 get_cookie(struct unwind_task_info *info)
>> {
>> u32 cnt = READ_ONCE(info->id.cnt);
>> u32 new_cnt;
>>
>> if (cnt & 1)
>> return info->id;
>>
>> if (unlikely(!info->id.task_id)) {
>> u32 task_id = local_clock();
>>
>> cnt = 0;
>> if (try_cmpxchg(&info->id.task_id, &cnt, task_id))
>> task_id = cnt;
>> }
>>
>> new_cnt = cnt + 3;
>> if (try_cmpxchg(&info->id, &cnt, new_cnt))
>> new_cnt = cnt; // try_cmpxchg() expects something
>>
>> return info->id;
>> }
>
> Honestly I think this is way overkill. What I would do, is to have the
> cookie saved in the event be 64 bit, but we can start with the
> simple 32 bit solution keeping the top 32 bits zeros. If this does
> indeed become an issue in the future, we could fix it with a 64 bit
> number. By making sure all the exposed "cookies" are 64 bit, it should
> not break anything. The cookie is just supposed to be a random unique
> number that associates a request with its deferred user space stack
> trace.
>
> With any exposed cookies to user space being 64 bits, this should not
> be an issue to address in the future.
FWIW, I liked your idea of making the cookie 64-bit with:
- 32-bit cpu number,
- 32-bit per-CPU free running counter.
This is simple, works on 32-bit archs, and it does not overflow as often
as time LSB because it counts execution contexts.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v12 06/14] unwind_user/deferred: Add deferred unwinding interface
2025-07-02 19:43 ` Mathieu Desnoyers
@ 2025-07-02 19:51 ` Steven Rostedt
0 siblings, 0 replies; 53+ messages in thread
From: Steven Rostedt @ 2025-07-02 19:51 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Linus Torvalds, Peter Zijlstra, linux-kernel, linux-trace-kernel,
bpf, x86, Masami Hiramatsu, Josh Poimboeuf, Ingo Molnar,
Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
Andrew Morton, Jens Axboe, Florian Weimer
On Wed, 2 Jul 2025 15:43:08 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> From my perspective, making trace analysis results reliable is the most
> basic guarantee tooling should provide in order to make it trusted by
> users. So I am tempted to err towards robustness rather than take
> shortcuts because "it does not happen often".
Another solution which I'm thinking of doing for perf is to simply
state: a deferred stack trace does not go to any event before events
were dropped.
That is, even if the stack trace is associated, if events are dropped
before getting out of the kernel, just say, "Sorry, the events before
the dropped events lost its user stack trace".
That may better, as having no data is better than incorrect data.
-- Steve
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v12 06/14] unwind_user/deferred: Add deferred unwinding interface
2025-07-02 19:48 ` Mathieu Desnoyers
@ 2025-07-02 20:10 ` Steven Rostedt
0 siblings, 0 replies; 53+ messages in thread
From: Steven Rostedt @ 2025-07-02 20:10 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Linus Torvalds, Peter Zijlstra, linux-kernel, linux-trace-kernel,
bpf, x86, Masami Hiramatsu, Josh Poimboeuf, Ingo Molnar,
Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
Andrew Morton, Jens Axboe, Florian Weimer
On Wed, 2 Jul 2025 15:48:16 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> FWIW, I liked your idea of making the cookie 64-bit with:
>
> - 32-bit cpu number,
> - 32-bit per-CPU free running counter.
>
> This is simple, works on 32-bit archs, and it does not overflow as often
> as time LSB because it counts execution contexts.
I have no problem with implementing that. But Linus had his concerns.
If he's OK with that version, then I'll go with that.
-- Steve
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v12 02/14] unwind_user: Add frame pointer support
2025-07-01 15:36 ` Jens Remus
@ 2025-07-02 23:50 ` Steven Rostedt
2025-07-03 16:21 ` Jens Remus
0 siblings, 1 reply; 53+ messages in thread
From: Steven Rostedt @ 2025-07-02 23:50 UTC (permalink / raw)
To: Jens Remus
Cc: Linus Torvalds, linux-kernel, linux-trace-kernel, bpf, x86,
Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Andrew Morton, Jens Axboe, Florian Weimer
On Tue, 1 Jul 2025 17:36:55 +0200
Jens Remus <jremus@linux.ibm.com> wrote:
> On s390 the prev_frame_sp may be equal to curr_frame_sp for the topmost
> frame, as long as the topmost function did not allocate any stack. For
> instance when early in the prologue or when in a leaf function that does
> not require any stack space. My s390 sframe support patches would
> therefore currently change above check to:
>
> /* stack going in wrong direction? */
> if (sp <= state->sp - topmost)
> goto done;
How do you calculate "topmost" then?
Is it another field you add to "state"?
-- Steve
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v12 02/14] unwind_user: Add frame pointer support
2025-07-02 23:50 ` Steven Rostedt
@ 2025-07-03 16:21 ` Jens Remus
2025-07-07 21:28 ` Steven Rostedt
0 siblings, 1 reply; 53+ messages in thread
From: Jens Remus @ 2025-07-03 16:21 UTC (permalink / raw)
To: Steven Rostedt
Cc: Linus Torvalds, linux-kernel, linux-trace-kernel, bpf, x86,
Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Andrew Morton, Jens Axboe, Florian Weimer
On 03.07.2025 01:50, Steven Rostedt wrote:
> On Tue, 1 Jul 2025 17:36:55 +0200
> Jens Remus <jremus@linux.ibm.com> wrote:
>
>> On s390 the prev_frame_sp may be equal to curr_frame_sp for the topmost
>> frame, as long as the topmost function did not allocate any stack. For
>> instance when early in the prologue or when in a leaf function that does
>> not require any stack space. My s390 sframe support patches would
>> therefore currently change above check to:
>>
>> /* stack going in wrong direction? */
>> if (sp <= state->sp - topmost)
>> goto done;
>
> How do you calculate "topmost" then?
>
> Is it another field you add to "state"?
Correct. It is a boolean set to true in unwind_user_start() and set to
false in unwind_user_next() when updating the state.
I assume most architectures need above change, as their SP at function
entry should be equal to the SP at call site (unlike x86-64 due to CALL).
s390 also needs this information to allow restoring of FP/RA saved in
other registers (instead of on the stack) only for the topmost frame.
For any other frame arbitrary register contents would not be available,
as user unwind only unwinds SP, FP, and RA.
I would post my s390 sframe support patches as RFC once you have
provided a merged sframe branch as discussed in:
https://lore.kernel.org/all/20250702124737.565934b5@batman.local.home/
Regards,
Jens
--
Jens Remus
Linux on Z Development (D3303)
+49-7031-16-1128 Office
jremus@de.ibm.com
IBM
IBM Deutschland Research & Development GmbH; Vorsitzender des Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der Gesellschaft: Böblingen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v12 01/14] unwind_user: Add user space unwinding API
2025-07-01 0:53 ` [PATCH v12 01/14] unwind_user: Add user space unwinding API Steven Rostedt
@ 2025-07-04 17:58 ` Mathieu Desnoyers
2025-07-04 18:20 ` Mathieu Desnoyers
1 sibling, 0 replies; 53+ messages in thread
From: Mathieu Desnoyers @ 2025-07-04 17:58 UTC (permalink / raw)
To: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Josh Poimboeuf, Peter Zijlstra, Ingo Molnar,
Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
Linus Torvalds, Andrew Morton, Jens Axboe, Florian Weimer
On 2025-06-30 20:53, Steven Rostedt wrote:
[...]
> diff --git a/include/linux/unwind_user.h b/include/linux/unwind_user.h
> new file mode 100644
> index 000000000000..aa7923c1384f
> --- /dev/null
> +++ b/include/linux/unwind_user.h
[...]
> +
> +int unwind_user(struct unwind_stacktrace *trace, unsigned int max_entries);
> +
> +#define for_each_user_frame(state) \
> + for (unwind_user_start((state)); !(state)->done; unwind_user_next((state)))
You can remove the () around "state" when it's already surrounded by parentheses:
+#define for_each_user_frame(state) \
+ for (unwind_user_start(state); !(state)->done; unwind_user_next(state))
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v12 01/14] unwind_user: Add user space unwinding API
2025-07-01 0:53 ` [PATCH v12 01/14] unwind_user: Add user space unwinding API Steven Rostedt
2025-07-04 17:58 ` Mathieu Desnoyers
@ 2025-07-04 18:20 ` Mathieu Desnoyers
2025-07-07 19:42 ` Steven Rostedt
1 sibling, 1 reply; 53+ messages in thread
From: Mathieu Desnoyers @ 2025-07-04 18:20 UTC (permalink / raw)
To: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Josh Poimboeuf, Peter Zijlstra, Ingo Molnar,
Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
Linus Torvalds, Andrew Morton, Jens Axboe, Florian Weimer
On 2025-06-30 20:53, Steven Rostedt wrote:
> From: Josh Poimboeuf <jpoimboe@kernel.org>
>
> Introduce a generic API for unwinding user stacks.
>
> In order to expand user space unwinding to be able to handle more complex
> scenarios, such as deferred unwinding and reading user space information,
> create a generic interface that all architectures can use that support the
> various unwinding methods.
>
> This is an alternative method for handling user space stack traces from
> the simple stack_trace_save_user() API. This does not replace that
> interface, but this interface will be used to expand the functionality of
> user space stack walking.
>
> None of the structures introduced will be exposed to user space tooling.
Would it be possible to make those unwind APIs EXPORT_SYMBOL_GPL
so they are available for GPL kernel modules ?
Thanks,
Mathieu
>
> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> ---
> MAINTAINERS | 8 +++++
> arch/Kconfig | 3 ++
> include/linux/unwind_user.h | 15 +++++++++
> include/linux/unwind_user_types.h | 31 +++++++++++++++++
> kernel/Makefile | 1 +
> kernel/unwind/Makefile | 1 +
> kernel/unwind/user.c | 55 +++++++++++++++++++++++++++++++
> 7 files changed, 114 insertions(+)
> create mode 100644 include/linux/unwind_user.h
> create mode 100644 include/linux/unwind_user_types.h
> create mode 100644 kernel/unwind/Makefile
> create mode 100644 kernel/unwind/user.c
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 4bac4ea21b64..ed5705c4f7d9 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -25924,6 +25924,14 @@ F: Documentation/driver-api/uio-howto.rst
> F: drivers/uio/
> F: include/linux/uio_driver.h
>
> +USERSPACE STACK UNWINDING
> +M: Josh Poimboeuf <jpoimboe@kernel.org>
> +M: Steven Rostedt <rostedt@goodmis.org>
> +S: Maintained
> +F: include/linux/unwind*.h
> +F: kernel/unwind/
> +
> +
> UTIL-LINUX PACKAGE
> M: Karel Zak <kzak@redhat.com>
> L: util-linux@vger.kernel.org
> diff --git a/arch/Kconfig b/arch/Kconfig
> index a3308a220f86..ea59e5d7cc69 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -435,6 +435,9 @@ config HAVE_HARDLOCKUP_DETECTOR_ARCH
> It uses the same command line parameters, and sysctl interface,
> as the generic hardlockup detectors.
>
> +config UNWIND_USER
> + bool
> +
> config HAVE_PERF_REGS
> bool
> help
> diff --git a/include/linux/unwind_user.h b/include/linux/unwind_user.h
> new file mode 100644
> index 000000000000..aa7923c1384f
> --- /dev/null
> +++ b/include/linux/unwind_user.h
> @@ -0,0 +1,15 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_UNWIND_USER_H
> +#define _LINUX_UNWIND_USER_H
> +
> +#include <linux/unwind_user_types.h>
> +
> +int unwind_user_start(struct unwind_user_state *state);
> +int unwind_user_next(struct unwind_user_state *state);
> +
> +int unwind_user(struct unwind_stacktrace *trace, unsigned int max_entries);
> +
> +#define for_each_user_frame(state) \
> + for (unwind_user_start((state)); !(state)->done; unwind_user_next((state)))
> +
> +#endif /* _LINUX_UNWIND_USER_H */
[...]
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v12 01/14] unwind_user: Add user space unwinding API
2025-07-04 18:20 ` Mathieu Desnoyers
@ 2025-07-07 19:42 ` Steven Rostedt
2025-07-07 21:01 ` Steven Rostedt
0 siblings, 1 reply; 53+ messages in thread
From: Steven Rostedt @ 2025-07-07 19:42 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton, Jens Axboe, Florian Weimer
On Fri, 4 Jul 2025 14:20:54 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> > None of the structures introduced will be exposed to user space tooling.
>
> Would it be possible to make those unwind APIs EXPORT_SYMBOL_GPL
> so they are available for GPL kernel modules ?
I'm OK with that, but others tend to complain about EXPORT_SYMBOL_GPL
for functions not used by modules in the kernel. But I personally feel
that LTTng should get an exception for that rule ;-)
-- Steve
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v12 01/14] unwind_user: Add user space unwinding API
2025-07-07 19:42 ` Steven Rostedt
@ 2025-07-07 21:01 ` Steven Rostedt
0 siblings, 0 replies; 53+ messages in thread
From: Steven Rostedt @ 2025-07-07 21:01 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Josh Poimboeuf, Peter Zijlstra, Ingo Molnar, Jiri Olsa,
Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
Andrew Morton, Jens Axboe, Florian Weimer
On Mon, 7 Jul 2025 15:42:45 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:
> On Fri, 4 Jul 2025 14:20:54 -0400
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>
> > > None of the structures introduced will be exposed to user space tooling.
> >
> > Would it be possible to make those unwind APIs EXPORT_SYMBOL_GPL
> > so they are available for GPL kernel modules ?
>
> I'm OK with that, but others tend to complain about EXPORT_SYMBOL_GPL
> for functions not used by modules in the kernel. But I personally feel
> that LTTng should get an exception for that rule ;-)
I just noticed that this was to patch 1. The functions here probably
shouldn't be exported as they are more internal to the infrastructure.
In fact, I think I'll move that macro into the user.c code. I don't
think it should be used outside that function. And the
unwind_user_start/next() could also be static functions.
-- Steve
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH v12 02/14] unwind_user: Add frame pointer support
2025-07-03 16:21 ` Jens Remus
@ 2025-07-07 21:28 ` Steven Rostedt
0 siblings, 0 replies; 53+ messages in thread
From: Steven Rostedt @ 2025-07-07 21:28 UTC (permalink / raw)
To: Jens Remus
Cc: Linus Torvalds, linux-kernel, linux-trace-kernel, bpf, x86,
Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Andrew Morton, Jens Axboe, Florian Weimer
On Thu, 3 Jul 2025 18:21:10 +0200
Jens Remus <jremus@linux.ibm.com> wrote:
> >> /* stack going in wrong direction? */
> >> if (sp <= state->sp - topmost)
> >> goto done;
> >
> > How do you calculate "topmost" then?
> >
> > Is it another field you add to "state"?
>
> Correct. It is a boolean set to true in unwind_user_start() and set to
> false in unwind_user_next() when updating the state.
So it's subtracting 1 or zero? So that the topmost can be equal. Well,
that would need a bit of commenting.
>
> I assume most architectures need above change, as their SP at function
> entry should be equal to the SP at call site (unlike x86-64 due to CALL).
>
> s390 also needs this information to allow restoring of FP/RA saved in
> other registers (instead of on the stack) only for the topmost frame.
> For any other frame arbitrary register contents would not be available,
> as user unwind only unwinds SP, FP, and RA.
>
> I would post my s390 sframe support patches as RFC once you have
> provided a merged sframe branch as discussed in:
> https://lore.kernel.org/all/20250702124737.565934b5@batman.local.home/
I did have a merge branch on my repo. But I was hoping to see your code
so that I can add this to this patch before having to post again. But
now I'm posting without this change, as I don't want to screw it up. I
think I know what it it looks like, but it would be better to see what
you did to make sure what I envision is correct.
Oh well. I'll post v13 without it.
-- Steve
^ permalink raw reply [flat|nested] 53+ messages in thread
end of thread, other threads:[~2025-07-07 21:28 UTC | newest]
Thread overview: 53+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-01 0:53 [PATCH v12 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
2025-07-01 0:53 ` [PATCH v12 01/14] unwind_user: Add user space unwinding API Steven Rostedt
2025-07-04 17:58 ` Mathieu Desnoyers
2025-07-04 18:20 ` Mathieu Desnoyers
2025-07-07 19:42 ` Steven Rostedt
2025-07-07 21:01 ` Steven Rostedt
2025-07-01 0:53 ` [PATCH v12 02/14] unwind_user: Add frame pointer support Steven Rostedt
2025-07-01 2:10 ` Linus Torvalds
2025-07-01 2:56 ` Steven Rostedt
2025-07-01 3:05 ` Steven Rostedt
2025-07-01 15:36 ` Jens Remus
2025-07-02 23:50 ` Steven Rostedt
2025-07-03 16:21 ` Jens Remus
2025-07-07 21:28 ` Steven Rostedt
2025-07-01 4:46 ` Florian Weimer
2025-07-01 12:14 ` Steven Rostedt
2025-07-01 0:53 ` [PATCH v12 03/14] unwind_user: Add compat mode " Steven Rostedt
2025-07-01 0:53 ` [PATCH v12 04/14] unwind_user/deferred: Add unwind_user_faultable() Steven Rostedt
2025-07-01 0:53 ` [PATCH v12 05/14] unwind_user/deferred: Add unwind cache Steven Rostedt
2025-07-01 0:53 ` [PATCH v12 06/14] unwind_user/deferred: Add deferred unwinding interface Steven Rostedt
2025-07-02 16:36 ` Peter Zijlstra
2025-07-02 16:42 ` Steven Rostedt
2025-07-02 16:56 ` Linus Torvalds
2025-07-02 17:26 ` Steven Rostedt
2025-07-02 17:48 ` Steven Rostedt
2025-07-02 18:21 ` Linus Torvalds
2025-07-02 18:47 ` Mathieu Desnoyers
2025-07-02 19:05 ` Steven Rostedt
2025-07-02 19:12 ` Mathieu Desnoyers
2025-07-02 19:21 ` Steven Rostedt
2025-07-02 19:36 ` Steven Rostedt
2025-07-02 19:40 ` Steven Rostedt
2025-07-02 19:48 ` Mathieu Desnoyers
2025-07-02 20:10 ` Steven Rostedt
2025-07-02 19:43 ` Mathieu Desnoyers
2025-07-02 19:51 ` Steven Rostedt
2025-07-02 18:57 ` Steven Rostedt
2025-07-01 0:53 ` [PATCH v12 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe Steven Rostedt
2025-07-02 15:53 ` Jens Remus
2025-07-02 19:11 ` Steven Rostedt
2025-07-01 0:53 ` [PATCH v12 08/14] unwind deferred: Use bitmask to determine which callbacks to call Steven Rostedt
2025-07-01 0:53 ` [PATCH v12 09/14] unwind deferred: Use SRCU unwind_deferred_task_work() Steven Rostedt
2025-07-01 0:53 ` [PATCH v12 10/14] unwind: Clear unwind_mask on exit back to user space Steven Rostedt
2025-07-01 0:53 ` [PATCH v12 11/14] unwind: Add USED bit to only have one conditional on way " Steven Rostedt
2025-07-01 0:53 ` [PATCH v12 12/14] unwind: Finish up unwind when a task exits Steven Rostedt
2025-07-01 0:53 ` [PATCH v12 13/14] unwind_user/x86: Enable frame pointer unwinding on x86 Steven Rostedt
2025-07-01 0:53 ` [PATCH v12 14/14] unwind_user/x86: Enable compat mode " Steven Rostedt
2025-07-01 2:06 ` [PATCH v12 00/14] unwind_user: x86: Deferred unwinding infrastructure Linus Torvalds
2025-07-01 2:45 ` Steven Rostedt
2025-07-01 22:49 ` Kees Cook
2025-07-01 23:26 ` Steven Rostedt
2025-07-02 14:56 ` Kees Cook
2025-07-02 16:20 ` Mathieu Desnoyers
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).