* [PATCH v8 00/18] unwind_user: perf: x86: Deferred unwinding infrastructure
@ 2025-05-09 16:45 Steven Rostedt
2025-05-09 16:45 ` [PATCH v8 01/18] unwind_user: Add user space unwinding API Steven Rostedt
` (17 more replies)
0 siblings, 18 replies; 27+ messages in thread
From: Steven Rostedt @ 2025-05-09 16:45 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim
This series does not make any user space visible changes.
It only adds the necessary infrastructure of the deferred unwinder
and makes a few helpful cleanups to perf.
Based off of tip/master: 864fb128fda9392e40bbb2055460cc648d26c51e
Peter and Ingo,
Would you be willing to take this series? I'd like to get this part
in the kernel in the next merge window and then we can focus on getting perf
and ftrace to use it in the next merge window.
Perf exposes a lot of the interface to user space as the perf tool needs
to handle the merging of the stacks, I figured it would be better to just
get the kernel side mostly done and then work out the kinks of the code
between user and kernel.
Are you OK with this?
I'd like to get this in the kernel for this merge window so that
Perf, ftrace, BPF and even LTTng can build on top of it simultaneously in
the next merge window. If something is found wrong with it, then it can
still be updated as no user space API has been exposed yet.
Changes since v7: https://lore.kernel.org/linux-trace-kernel/20250502164746.178864972@goodmis.org/
- Allocate unwind_cache as a structure and not just its entries
(Ingo Molnar)
- Fixed white space issues
(Ingo Molnar)
- Removed period from one of the commit subjects
(Ingo Molnar)
- Use a timestamp instead of a "cookie"
Instead of using a "cookie" the local_clock() is used on the first
request, and that is used as the cookie was before. The caller
can use that as a cookie like the previous patches for the task
or record it in the user stacktrace event and if the requested
event uses the local_clock timestamp, the timestamp can be used to
know if the stacktrace is valid or not due to dropped events.
If one call stack is dropped and the task goes back to user space
and then back to the kernel and performs another call stack the
tooling needs to know if that is not valid for the previous request.
- Added kerneldoc to unwind_deferred_trace()
- Updated comments to kerneldoc for unwind_deferred_request()
- Clear the unwind_mask on exit.
If a tracer requests a deferred stacktrace after it has originally
received one for the current stacktrace it will not trigger another
stacktrace.
The unwind request now returns:
0 - successfully queued the callback
UNWIND_ALREADY_PENDING - queued by someone else
UNWIND_ALREADY_EXECUTED - not queued but was called previously
Negative - An error occurred
If perf or another tracer knows that triggering another stacktrace
will not cause a infinite loop, then the code can be modified
to allow a tracer to request another trace even if it has already
received the current stacktrace.
Josh Poimboeuf (13):
unwind_user: Add user space unwinding API
unwind_user: Add frame pointer support
unwind_user/x86: Enable frame pointer unwinding on x86
perf/x86: Rename and move get_segment_base() and make it global
unwind_user: Add compat mode frame pointer support
unwind_user/x86: Enable compat mode frame pointer unwinding on x86
unwind_user/deferred: Add unwind cache
unwind_user/deferred: Add deferred unwinding interface
unwind_user/deferred: Make unwind deferral requests NMI-safe
perf: Remove get_perf_callchain() init_nr argument
perf: Have get_perf_callchain() return NULL if crosstask and user are set
perf: Simplify get_perf_callchain() user logic
perf: Skip user unwind if the task is a kernel thread
Steven Rostedt (5):
unwind_user/deferred: Add unwind_deferred_trace()
unwind deferred: Use bitmask to determine which callbacks to call
unwind deferred: Use SRCU unwind_deferred_task_work()
unwind: Clear unwind_mask on exit back to user space
perf: Use current->flags & PF_KTHREAD instead of current->mm == NULL
----
MAINTAINERS | 8 +
arch/Kconfig | 11 +
arch/x86/Kconfig | 2 +
arch/x86/events/core.c | 44 +---
arch/x86/include/asm/ptrace.h | 2 +
arch/x86/include/asm/unwind_user.h | 61 +++++
arch/x86/include/asm/unwind_user_types.h | 17 ++
arch/x86/kernel/ptrace.c | 38 ++++
include/asm-generic/Kbuild | 2 +
include/asm-generic/unwind_user.h | 24 ++
include/asm-generic/unwind_user_types.h | 9 +
include/linux/entry-common.h | 2 +
include/linux/perf_event.h | 2 +-
include/linux/sched.h | 6 +
include/linux/unwind_deferred.h | 72 ++++++
include/linux/unwind_deferred_types.h | 17 ++
include/linux/unwind_user.h | 15 ++
include/linux/unwind_user_types.h | 35 +++
kernel/Makefile | 1 +
kernel/bpf/stackmap.c | 4 +-
kernel/events/callchain.c | 38 ++--
kernel/events/core.c | 7 +-
kernel/fork.c | 4 +
kernel/unwind/Makefile | 1 +
kernel/unwind/deferred.c | 367 +++++++++++++++++++++++++++++++
kernel/unwind/user.c | 130 +++++++++++
26 files changed, 854 insertions(+), 65 deletions(-)
create mode 100644 arch/x86/include/asm/unwind_user.h
create mode 100644 arch/x86/include/asm/unwind_user_types.h
create mode 100644 include/asm-generic/unwind_user.h
create mode 100644 include/asm-generic/unwind_user_types.h
create mode 100644 include/linux/unwind_deferred.h
create mode 100644 include/linux/unwind_deferred_types.h
create mode 100644 include/linux/unwind_user.h
create mode 100644 include/linux/unwind_user_types.h
create mode 100644 kernel/unwind/Makefile
create mode 100644 kernel/unwind/deferred.c
create mode 100644 kernel/unwind/user.c
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH v8 01/18] unwind_user: Add user space unwinding API
2025-05-09 16:45 [PATCH v8 00/18] unwind_user: perf: x86: Deferred unwinding infrastructure Steven Rostedt
@ 2025-05-09 16:45 ` Steven Rostedt
2025-05-09 16:45 ` [PATCH v8 02/18] unwind_user: Add frame pointer support Steven Rostedt
` (16 subsequent siblings)
17 siblings, 0 replies; 27+ messages in thread
From: Steven Rostedt @ 2025-05-09 16:45 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim
From: Josh Poimboeuf <jpoimboe@kernel.org>
Introduce a generic API for unwinding user stacks.
In order to expand user space unwinding to be able to handle more complex
scenarios, such as deferred unwinding and reading user space information,
create a generic interface that all architectures can use that support the
various unwinding methods.
This is an alternative method for handling user space stack traces from
the simple stack_trace_save_user() API. This does not replace that
interface, but this interface will be used to expand the functionality of
user space stack walking.
None of the structures introduced will be exposed to user space tooling.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
MAINTAINERS | 8 +++++
arch/Kconfig | 3 ++
include/linux/unwind_user.h | 15 +++++++++
include/linux/unwind_user_types.h | 31 +++++++++++++++++
kernel/Makefile | 1 +
kernel/unwind/Makefile | 1 +
kernel/unwind/user.c | 55 +++++++++++++++++++++++++++++++
7 files changed, 114 insertions(+)
create mode 100644 include/linux/unwind_user.h
create mode 100644 include/linux/unwind_user_types.h
create mode 100644 kernel/unwind/Makefile
create mode 100644 kernel/unwind/user.c
diff --git a/MAINTAINERS b/MAINTAINERS
index 800680e11431..ff1af7d36e77 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -25316,6 +25316,14 @@ F: Documentation/driver-api/uio-howto.rst
F: drivers/uio/
F: include/linux/uio_driver.h
+USERSPACE STACK UNWINDING
+M: Josh Poimboeuf <jpoimboe@kernel.org>
+M: Steven Rostedt <rostedt@goodmis.org>
+S: Maintained
+F: include/linux/unwind*.h
+F: kernel/unwind/
+
+
UTIL-LINUX PACKAGE
M: Karel Zak <kzak@redhat.com>
L: util-linux@vger.kernel.org
diff --git a/arch/Kconfig b/arch/Kconfig
index b0adb665041f..ccbcead9fac0 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -435,6 +435,9 @@ config HAVE_HARDLOCKUP_DETECTOR_ARCH
It uses the same command line parameters, and sysctl interface,
as the generic hardlockup detectors.
+config UNWIND_USER
+ bool
+
config HAVE_PERF_REGS
bool
help
diff --git a/include/linux/unwind_user.h b/include/linux/unwind_user.h
new file mode 100644
index 000000000000..aa7923c1384f
--- /dev/null
+++ b/include/linux/unwind_user.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_UNWIND_USER_H
+#define _LINUX_UNWIND_USER_H
+
+#include <linux/unwind_user_types.h>
+
+int unwind_user_start(struct unwind_user_state *state);
+int unwind_user_next(struct unwind_user_state *state);
+
+int unwind_user(struct unwind_stacktrace *trace, unsigned int max_entries);
+
+#define for_each_user_frame(state) \
+ for (unwind_user_start((state)); !(state)->done; unwind_user_next((state)))
+
+#endif /* _LINUX_UNWIND_USER_H */
diff --git a/include/linux/unwind_user_types.h b/include/linux/unwind_user_types.h
new file mode 100644
index 000000000000..6ed1b4ae74e1
--- /dev/null
+++ b/include/linux/unwind_user_types.h
@@ -0,0 +1,31 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_UNWIND_USER_TYPES_H
+#define _LINUX_UNWIND_USER_TYPES_H
+
+#include <linux/types.h>
+
+enum unwind_user_type {
+ UNWIND_USER_TYPE_NONE,
+};
+
+struct unwind_stacktrace {
+ unsigned int nr;
+ unsigned long *entries;
+};
+
+struct unwind_user_frame {
+ s32 cfa_off;
+ s32 ra_off;
+ s32 fp_off;
+ bool use_fp;
+};
+
+struct unwind_user_state {
+ unsigned long ip;
+ unsigned long sp;
+ unsigned long fp;
+ enum unwind_user_type type;
+ bool done;
+};
+
+#endif /* _LINUX_UNWIND_USER_TYPES_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index 434929de17ef..5a2b2be2a32d 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -55,6 +55,7 @@ obj-y += rcu/
obj-y += livepatch/
obj-y += dma/
obj-y += entry/
+obj-y += unwind/
obj-$(CONFIG_MODULES) += module/
obj-$(CONFIG_KCMP) += kcmp.o
diff --git a/kernel/unwind/Makefile b/kernel/unwind/Makefile
new file mode 100644
index 000000000000..349ce3677526
--- /dev/null
+++ b/kernel/unwind/Makefile
@@ -0,0 +1 @@
+ obj-$(CONFIG_UNWIND_USER) += user.o
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
new file mode 100644
index 000000000000..d30449328981
--- /dev/null
+++ b/kernel/unwind/user.c
@@ -0,0 +1,55 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+* Generic interfaces for unwinding user space
+*/
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/sched/task_stack.h>
+#include <linux/unwind_user.h>
+
+int unwind_user_next(struct unwind_user_state *state)
+{
+ /* no implementation yet */
+ return -EINVAL;
+}
+
+int unwind_user_start(struct unwind_user_state *state)
+{
+ struct pt_regs *regs = task_pt_regs(current);
+
+ memset(state, 0, sizeof(*state));
+
+ if ((current->flags & PF_KTHREAD) || !user_mode(regs)) {
+ state->done = true;
+ return -EINVAL;
+ }
+
+ state->type = UNWIND_USER_TYPE_NONE;
+
+ state->ip = instruction_pointer(regs);
+ state->sp = user_stack_pointer(regs);
+ state->fp = frame_pointer(regs);
+
+ return 0;
+}
+
+int unwind_user(struct unwind_stacktrace *trace, unsigned int max_entries)
+{
+ struct unwind_user_state state;
+
+ trace->nr = 0;
+
+ if (!max_entries)
+ return -EINVAL;
+
+ if (current->flags & PF_KTHREAD)
+ return 0;
+
+ for_each_user_frame(&state) {
+ trace->entries[trace->nr++] = state.ip;
+ if (trace->nr >= max_entries)
+ break;
+ }
+
+ return 0;
+}
--
2.47.2
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v8 02/18] unwind_user: Add frame pointer support
2025-05-09 16:45 [PATCH v8 00/18] unwind_user: perf: x86: Deferred unwinding infrastructure Steven Rostedt
2025-05-09 16:45 ` [PATCH v8 01/18] unwind_user: Add user space unwinding API Steven Rostedt
@ 2025-05-09 16:45 ` Steven Rostedt
2025-05-09 16:45 ` [PATCH v8 03/18] unwind_user/x86: Enable frame pointer unwinding on x86 Steven Rostedt
` (15 subsequent siblings)
17 siblings, 0 replies; 27+ messages in thread
From: Steven Rostedt @ 2025-05-09 16:45 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim
From: Josh Poimboeuf <jpoimboe@kernel.org>
Add optional support for user space frame pointer unwinding. If
supported, the arch needs to enable CONFIG_HAVE_UNWIND_USER_FP and
define ARCH_INIT_USER_FP_FRAME.
By encoding the frame offsets in struct unwind_user_frame, much of this
code can also be reused for future unwinder implementations like sframe.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
arch/Kconfig | 4 +++
include/asm-generic/unwind_user.h | 9 ++++++
include/linux/unwind_user_types.h | 1 +
kernel/unwind/user.c | 51 +++++++++++++++++++++++++++++--
4 files changed, 63 insertions(+), 2 deletions(-)
create mode 100644 include/asm-generic/unwind_user.h
diff --git a/arch/Kconfig b/arch/Kconfig
index ccbcead9fac0..0e3844c0e200 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -438,6 +438,10 @@ config HAVE_HARDLOCKUP_DETECTOR_ARCH
config UNWIND_USER
bool
+config HAVE_UNWIND_USER_FP
+ bool
+ select UNWIND_USER
+
config HAVE_PERF_REGS
bool
help
diff --git a/include/asm-generic/unwind_user.h b/include/asm-generic/unwind_user.h
new file mode 100644
index 000000000000..832425502fb3
--- /dev/null
+++ b/include/asm-generic/unwind_user.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_GENERIC_UNWIND_USER_H
+#define _ASM_GENERIC_UNWIND_USER_H
+
+#ifndef ARCH_INIT_USER_FP_FRAME
+ #define ARCH_INIT_USER_FP_FRAME
+#endif
+
+#endif /* _ASM_GENERIC_UNWIND_USER_H */
diff --git a/include/linux/unwind_user_types.h b/include/linux/unwind_user_types.h
index 6ed1b4ae74e1..65bd070eb6b0 100644
--- a/include/linux/unwind_user_types.h
+++ b/include/linux/unwind_user_types.h
@@ -6,6 +6,7 @@
enum unwind_user_type {
UNWIND_USER_TYPE_NONE,
+ UNWIND_USER_TYPE_FP,
};
struct unwind_stacktrace {
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
index d30449328981..0671a81494d3 100644
--- a/kernel/unwind/user.c
+++ b/kernel/unwind/user.c
@@ -6,10 +6,54 @@
#include <linux/sched.h>
#include <linux/sched/task_stack.h>
#include <linux/unwind_user.h>
+#include <linux/uaccess.h>
+#include <asm/unwind_user.h>
+
+static struct unwind_user_frame fp_frame = {
+ ARCH_INIT_USER_FP_FRAME
+};
+
+static inline bool fp_state(struct unwind_user_state *state)
+{
+ return IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP) &&
+ state->type == UNWIND_USER_TYPE_FP;
+}
int unwind_user_next(struct unwind_user_state *state)
{
- /* no implementation yet */
+ struct unwind_user_frame _frame;
+ struct unwind_user_frame *frame = &_frame;
+ unsigned long cfa = 0, fp, ra = 0;
+
+ if (state->done)
+ return -EINVAL;
+
+ if (fp_state(state))
+ frame = &fp_frame;
+ else
+ goto the_end;
+
+ cfa = (frame->use_fp ? state->fp : state->sp) + frame->cfa_off;
+
+ /* stack going in wrong direction? */
+ if (cfa <= state->sp)
+ goto the_end;
+
+ if (get_user(ra, (unsigned long *)(cfa + frame->ra_off)))
+ goto the_end;
+
+ if (frame->fp_off && get_user(fp, (unsigned long __user *)(cfa + frame->fp_off)))
+ goto the_end;
+
+ state->ip = ra;
+ state->sp = cfa;
+ if (frame->fp_off)
+ state->fp = fp;
+
+ return 0;
+
+the_end:
+ state->done = true;
return -EINVAL;
}
@@ -24,7 +68,10 @@ int unwind_user_start(struct unwind_user_state *state)
return -EINVAL;
}
- state->type = UNWIND_USER_TYPE_NONE;
+ if (IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP))
+ state->type = UNWIND_USER_TYPE_FP;
+ else
+ state->type = UNWIND_USER_TYPE_NONE;
state->ip = instruction_pointer(regs);
state->sp = user_stack_pointer(regs);
--
2.47.2
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v8 03/18] unwind_user/x86: Enable frame pointer unwinding on x86
2025-05-09 16:45 [PATCH v8 00/18] unwind_user: perf: x86: Deferred unwinding infrastructure Steven Rostedt
2025-05-09 16:45 ` [PATCH v8 01/18] unwind_user: Add user space unwinding API Steven Rostedt
2025-05-09 16:45 ` [PATCH v8 02/18] unwind_user: Add frame pointer support Steven Rostedt
@ 2025-05-09 16:45 ` Steven Rostedt
2025-05-09 16:45 ` [PATCH v8 04/18] perf/x86: Rename and move get_segment_base() and make it global Steven Rostedt
` (14 subsequent siblings)
17 siblings, 0 replies; 27+ messages in thread
From: Steven Rostedt @ 2025-05-09 16:45 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim
From: Josh Poimboeuf <jpoimboe@kernel.org>
Use ARCH_INIT_USER_FP_FRAME to describe how frame pointers are unwound
on x86, and enable CONFIG_HAVE_UNWIND_USER_FP accordingly so the
unwind_user interfaces can be used.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
arch/x86/Kconfig | 1 +
arch/x86/include/asm/unwind_user.h | 11 +++++++++++
2 files changed, 12 insertions(+)
create mode 100644 arch/x86/include/asm/unwind_user.h
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 4c33c644b92d..a6e529dc4550 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -301,6 +301,7 @@ config X86
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_UACCESS_VALIDATION if HAVE_OBJTOOL
select HAVE_UNSTABLE_SCHED_CLOCK
+ select HAVE_UNWIND_USER_FP if X86_64
select HAVE_USER_RETURN_NOTIFIER
select HAVE_GENERIC_VDSO
select VDSO_GETRANDOM if X86_64
diff --git a/arch/x86/include/asm/unwind_user.h b/arch/x86/include/asm/unwind_user.h
new file mode 100644
index 000000000000..8597857bf896
--- /dev/null
+++ b/arch/x86/include/asm/unwind_user.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_UNWIND_USER_H
+#define _ASM_X86_UNWIND_USER_H
+
+#define ARCH_INIT_USER_FP_FRAME \
+ .cfa_off = (s32)sizeof(long) * 2, \
+ .ra_off = (s32)sizeof(long) * -1, \
+ .fp_off = (s32)sizeof(long) * -2, \
+ .use_fp = true,
+
+#endif /* _ASM_X86_UNWIND_USER_H */
--
2.47.2
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v8 04/18] perf/x86: Rename and move get_segment_base() and make it global
2025-05-09 16:45 [PATCH v8 00/18] unwind_user: perf: x86: Deferred unwinding infrastructure Steven Rostedt
` (2 preceding siblings ...)
2025-05-09 16:45 ` [PATCH v8 03/18] unwind_user/x86: Enable frame pointer unwinding on x86 Steven Rostedt
@ 2025-05-09 16:45 ` Steven Rostedt
2025-05-09 16:45 ` [PATCH v8 05/18] unwind_user: Add compat mode frame pointer support Steven Rostedt
` (13 subsequent siblings)
17 siblings, 0 replies; 27+ messages in thread
From: Steven Rostedt @ 2025-05-09 16:45 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim
From: Josh Poimboeuf <jpoimboe@kernel.org>
get_segment_base() will be used by the unwind_user code, so make it
global and rename it so it doesn't conflict with a KVM function of the
same name.
As the function is no longer specific to perf, move it to ptrace.c as that
seems to be a better location for a generic function like this.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
arch/x86/events/core.c | 44 ++++-------------------------------
arch/x86/include/asm/ptrace.h | 2 ++
arch/x86/kernel/ptrace.c | 38 ++++++++++++++++++++++++++++++
3 files changed, 45 insertions(+), 39 deletions(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 2e10dcf897c5..cc6329235b68 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -43,6 +43,7 @@
#include <asm/ldt.h>
#include <asm/unwind.h>
#include <asm/uprobes.h>
+#include <asm/ptrace.h>
#include <asm/ibt.h>
#include "perf_event.h"
@@ -2809,41 +2810,6 @@ valid_user_frame(const void __user *fp, unsigned long size)
return __access_ok(fp, size);
}
-static unsigned long get_segment_base(unsigned int segment)
-{
- struct desc_struct *desc;
- unsigned int idx = segment >> 3;
-
- if ((segment & SEGMENT_TI_MASK) == SEGMENT_LDT) {
-#ifdef CONFIG_MODIFY_LDT_SYSCALL
- struct ldt_struct *ldt;
-
- /*
- * If we're not in a valid context with a real (not just lazy)
- * user mm, then don't even try.
- */
- if (!nmi_uaccess_okay())
- return 0;
-
- /* IRQs are off, so this synchronizes with smp_store_release */
- ldt = smp_load_acquire(¤t->mm->context.ldt);
- if (!ldt || idx >= ldt->nr_entries)
- return 0;
-
- desc = &ldt->entries[idx];
-#else
- return 0;
-#endif
- } else {
- if (idx >= GDT_ENTRIES)
- return 0;
-
- desc = raw_cpu_ptr(gdt_page.gdt) + idx;
- }
-
- return get_desc_base(desc);
-}
-
#ifdef CONFIG_UPROBES
/*
* Heuristic-based check if uprobe is installed at the function entry.
@@ -2900,8 +2866,8 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry_ctx *ent
if (user_64bit_mode(regs))
return 0;
- cs_base = get_segment_base(regs->cs);
- ss_base = get_segment_base(regs->ss);
+ cs_base = segment_base_address(regs->cs);
+ ss_base = segment_base_address(regs->ss);
fp = compat_ptr(ss_base + regs->bp);
pagefault_disable();
@@ -3020,11 +2986,11 @@ static unsigned long code_segment_base(struct pt_regs *regs)
return 0x10 * regs->cs;
if (user_mode(regs) && regs->cs != __USER_CS)
- return get_segment_base(regs->cs);
+ return segment_base_address(regs->cs);
#else
if (user_mode(regs) && !user_64bit_mode(regs) &&
regs->cs != __USER32_CS)
- return get_segment_base(regs->cs);
+ return segment_base_address(regs->cs);
#endif
return 0;
}
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 50f75467f73d..59357ec98e52 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -314,6 +314,8 @@ static __always_inline bool regs_irqs_disabled(struct pt_regs *regs)
return !(regs->flags & X86_EFLAGS_IF);
}
+unsigned long segment_base_address(unsigned int segment);
+
/* Query offset/name of register from its name/offset */
extern int regs_query_register_offset(const char *name);
extern const char *regs_query_register_name(unsigned int offset);
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 095f04bdabdc..81353a09701b 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -41,6 +41,7 @@
#include <asm/syscall.h>
#include <asm/fsgsbase.h>
#include <asm/io_bitmap.h>
+#include <asm/mmu_context.h>
#include "tls.h"
@@ -339,6 +340,43 @@ static int set_segment_reg(struct task_struct *task,
#endif /* CONFIG_X86_32 */
+unsigned long segment_base_address(unsigned int segment)
+{
+ struct desc_struct *desc;
+ unsigned int idx = segment >> 3;
+
+ lockdep_assert_irqs_disabled();
+
+ if ((segment & SEGMENT_TI_MASK) == SEGMENT_LDT) {
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
+ struct ldt_struct *ldt;
+
+ /*
+ * If we're not in a valid context with a real (not just lazy)
+ * user mm, then don't even try.
+ */
+ if (!nmi_uaccess_okay())
+ return 0;
+
+ /* IRQs are off, so this synchronizes with smp_store_release */
+ ldt = smp_load_acquire(¤t->mm->context.ldt);
+ if (!ldt || idx >= ldt->nr_entries)
+ return 0;
+
+ desc = &ldt->entries[idx];
+#else
+ return 0;
+#endif
+ } else {
+ if (idx >= GDT_ENTRIES)
+ return 0;
+
+ desc = raw_cpu_ptr(gdt_page.gdt) + idx;
+ }
+
+ return get_desc_base(desc);
+}
+
static unsigned long get_flags(struct task_struct *task)
{
unsigned long retval = task_pt_regs(task)->flags;
--
2.47.2
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v8 05/18] unwind_user: Add compat mode frame pointer support
2025-05-09 16:45 [PATCH v8 00/18] unwind_user: perf: x86: Deferred unwinding infrastructure Steven Rostedt
` (3 preceding siblings ...)
2025-05-09 16:45 ` [PATCH v8 04/18] perf/x86: Rename and move get_segment_base() and make it global Steven Rostedt
@ 2025-05-09 16:45 ` Steven Rostedt
2025-05-09 16:45 ` [PATCH v8 06/18] unwind_user/x86: Enable compat mode frame pointer unwinding on x86 Steven Rostedt
` (12 subsequent siblings)
17 siblings, 0 replies; 27+ messages in thread
From: Steven Rostedt @ 2025-05-09 16:45 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim
From: Josh Poimboeuf <jpoimboe@kernel.org>
Add optional support for user space compat mode frame pointer unwinding.
If supported, the arch needs to enable CONFIG_HAVE_UNWIND_USER_COMPAT_FP
and define ARCH_INIT_USER_COMPAT_FP_FRAME.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
arch/Kconfig | 4 +++
include/asm-generic/Kbuild | 2 ++
include/asm-generic/unwind_user.h | 15 +++++++++++
include/asm-generic/unwind_user_types.h | 9 +++++++
include/linux/unwind_user_types.h | 3 +++
kernel/unwind/user.c | 36 ++++++++++++++++++++++---
6 files changed, 65 insertions(+), 4 deletions(-)
create mode 100644 include/asm-generic/unwind_user_types.h
diff --git a/arch/Kconfig b/arch/Kconfig
index 0e3844c0e200..dbb1cc89e040 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -442,6 +442,10 @@ config HAVE_UNWIND_USER_FP
bool
select UNWIND_USER
+config HAVE_UNWIND_USER_COMPAT_FP
+ bool
+ depends on HAVE_UNWIND_USER_FP
+
config HAVE_PERF_REGS
bool
help
diff --git a/include/asm-generic/Kbuild b/include/asm-generic/Kbuild
index 8675b7b4ad23..b797a2434396 100644
--- a/include/asm-generic/Kbuild
+++ b/include/asm-generic/Kbuild
@@ -59,6 +59,8 @@ mandatory-y += tlbflush.h
mandatory-y += topology.h
mandatory-y += trace_clock.h
mandatory-y += uaccess.h
+mandatory-y += unwind_user.h
+mandatory-y += unwind_user_types.h
mandatory-y += vermagic.h
mandatory-y += vga.h
mandatory-y += video.h
diff --git a/include/asm-generic/unwind_user.h b/include/asm-generic/unwind_user.h
index 832425502fb3..385638ce4aec 100644
--- a/include/asm-generic/unwind_user.h
+++ b/include/asm-generic/unwind_user.h
@@ -2,8 +2,23 @@
#ifndef _ASM_GENERIC_UNWIND_USER_H
#define _ASM_GENERIC_UNWIND_USER_H
+#include <asm/unwind_user_types.h>
+
#ifndef ARCH_INIT_USER_FP_FRAME
#define ARCH_INIT_USER_FP_FRAME
#endif
+#ifndef ARCH_INIT_USER_COMPAT_FP_FRAME
+ #define ARCH_INIT_USER_COMPAT_FP_FRAME
+ #define in_compat_mode(regs) false
+#endif
+
+#ifndef arch_unwind_user_init
+static inline void arch_unwind_user_init(struct unwind_user_state *state, struct pt_regs *reg) {}
+#endif
+
+#ifndef arch_unwind_user_next
+static inline void arch_unwind_user_next(struct unwind_user_state *state) {}
+#endif
+
#endif /* _ASM_GENERIC_UNWIND_USER_H */
diff --git a/include/asm-generic/unwind_user_types.h b/include/asm-generic/unwind_user_types.h
new file mode 100644
index 000000000000..ee803de7c998
--- /dev/null
+++ b/include/asm-generic/unwind_user_types.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_GENERIC_UNWIND_USER_TYPES_H
+#define _ASM_GENERIC_UNWIND_USER_TYPES_H
+
+#ifndef arch_unwind_user_state
+struct arch_unwind_user_state {};
+#endif
+
+#endif /* _ASM_GENERIC_UNWIND_USER_TYPES_H */
diff --git a/include/linux/unwind_user_types.h b/include/linux/unwind_user_types.h
index 65bd070eb6b0..3ec4a097a3dd 100644
--- a/include/linux/unwind_user_types.h
+++ b/include/linux/unwind_user_types.h
@@ -3,10 +3,12 @@
#define _LINUX_UNWIND_USER_TYPES_H
#include <linux/types.h>
+#include <asm/unwind_user_types.h>
enum unwind_user_type {
UNWIND_USER_TYPE_NONE,
UNWIND_USER_TYPE_FP,
+ UNWIND_USER_TYPE_COMPAT_FP,
};
struct unwind_stacktrace {
@@ -25,6 +27,7 @@ struct unwind_user_state {
unsigned long ip;
unsigned long sp;
unsigned long fp;
+ struct arch_unwind_user_state arch;
enum unwind_user_type type;
bool done;
};
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
index 0671a81494d3..635cc04bb299 100644
--- a/kernel/unwind/user.c
+++ b/kernel/unwind/user.c
@@ -13,12 +13,32 @@ static struct unwind_user_frame fp_frame = {
ARCH_INIT_USER_FP_FRAME
};
+static struct unwind_user_frame compat_fp_frame = {
+ ARCH_INIT_USER_COMPAT_FP_FRAME
+};
+
static inline bool fp_state(struct unwind_user_state *state)
{
return IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP) &&
state->type == UNWIND_USER_TYPE_FP;
}
+static inline bool compat_state(struct unwind_user_state *state)
+{
+ return IS_ENABLED(CONFIG_HAVE_UNWIND_USER_COMPAT_FP) &&
+ state->type == UNWIND_USER_TYPE_COMPAT_FP;
+}
+
+#define UNWIND_GET_USER_LONG(to, from, state) \
+({ \
+ int __ret; \
+ if (compat_state(state)) \
+ __ret = get_user(to, (u32 __user *)(from)); \
+ else \
+ __ret = get_user(to, (u64 __user *)(from)); \
+ __ret; \
+})
+
int unwind_user_next(struct unwind_user_state *state)
{
struct unwind_user_frame _frame;
@@ -28,7 +48,9 @@ int unwind_user_next(struct unwind_user_state *state)
if (state->done)
return -EINVAL;
- if (fp_state(state))
+ if (compat_state(state))
+ frame = &compat_fp_frame;
+ else if (fp_state(state))
frame = &fp_frame;
else
goto the_end;
@@ -39,10 +61,10 @@ int unwind_user_next(struct unwind_user_state *state)
if (cfa <= state->sp)
goto the_end;
- if (get_user(ra, (unsigned long *)(cfa + frame->ra_off)))
+ if (UNWIND_GET_USER_LONG(ra, cfa + frame->ra_off, state))
goto the_end;
- if (frame->fp_off && get_user(fp, (unsigned long __user *)(cfa + frame->fp_off)))
+ if (frame->fp_off && UNWIND_GET_USER_LONG(fp, cfa + frame->fp_off, state))
goto the_end;
state->ip = ra;
@@ -50,6 +72,8 @@ int unwind_user_next(struct unwind_user_state *state)
if (frame->fp_off)
state->fp = fp;
+ arch_unwind_user_next(state);
+
return 0;
the_end:
@@ -68,7 +92,9 @@ int unwind_user_start(struct unwind_user_state *state)
return -EINVAL;
}
- if (IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP))
+ if (IS_ENABLED(CONFIG_HAVE_UNWIND_USER_COMPAT_FP) && in_compat_mode(regs))
+ state->type = UNWIND_USER_TYPE_COMPAT_FP;
+ else if (IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP))
state->type = UNWIND_USER_TYPE_FP;
else
state->type = UNWIND_USER_TYPE_NONE;
@@ -77,6 +103,8 @@ int unwind_user_start(struct unwind_user_state *state)
state->sp = user_stack_pointer(regs);
state->fp = frame_pointer(regs);
+ arch_unwind_user_init(state, regs);
+
return 0;
}
--
2.47.2
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v8 06/18] unwind_user/x86: Enable compat mode frame pointer unwinding on x86
2025-05-09 16:45 [PATCH v8 00/18] unwind_user: perf: x86: Deferred unwinding infrastructure Steven Rostedt
` (4 preceding siblings ...)
2025-05-09 16:45 ` [PATCH v8 05/18] unwind_user: Add compat mode frame pointer support Steven Rostedt
@ 2025-05-09 16:45 ` Steven Rostedt
2025-05-09 16:45 ` [PATCH v8 07/18] unwind_user/deferred: Add unwind_deferred_trace() Steven Rostedt
` (11 subsequent siblings)
17 siblings, 0 replies; 27+ messages in thread
From: Steven Rostedt @ 2025-05-09 16:45 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim
From: Josh Poimboeuf <jpoimboe@kernel.org>
Use ARCH_INIT_USER_COMPAT_FP_FRAME to describe how frame pointers are
unwound on x86, and implement the hooks needed to add the segment base
addresses. Enable HAVE_UNWIND_USER_COMPAT_FP if the system has compat
mode compiled in.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
arch/x86/Kconfig | 1 +
arch/x86/include/asm/unwind_user.h | 50 ++++++++++++++++++++++++
arch/x86/include/asm/unwind_user_types.h | 17 ++++++++
3 files changed, 68 insertions(+)
create mode 100644 arch/x86/include/asm/unwind_user_types.h
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index a6e529dc4550..ee81e06cabca 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -301,6 +301,7 @@ config X86
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_UACCESS_VALIDATION if HAVE_OBJTOOL
select HAVE_UNSTABLE_SCHED_CLOCK
+ select HAVE_UNWIND_USER_COMPAT_FP if IA32_EMULATION
select HAVE_UNWIND_USER_FP if X86_64
select HAVE_USER_RETURN_NOTIFIER
select HAVE_GENERIC_VDSO
diff --git a/arch/x86/include/asm/unwind_user.h b/arch/x86/include/asm/unwind_user.h
index 8597857bf896..bb1148111259 100644
--- a/arch/x86/include/asm/unwind_user.h
+++ b/arch/x86/include/asm/unwind_user.h
@@ -2,10 +2,60 @@
#ifndef _ASM_X86_UNWIND_USER_H
#define _ASM_X86_UNWIND_USER_H
+#include <linux/unwind_user_types.h>
+#include <asm/ptrace.h>
+#include <asm/perf_event.h>
+
#define ARCH_INIT_USER_FP_FRAME \
.cfa_off = (s32)sizeof(long) * 2, \
.ra_off = (s32)sizeof(long) * -1, \
.fp_off = (s32)sizeof(long) * -2, \
.use_fp = true,
+#ifdef CONFIG_IA32_EMULATION
+
+#define ARCH_INIT_USER_COMPAT_FP_FRAME \
+ .cfa_off = (s32)sizeof(u32) * 2, \
+ .ra_off = (s32)sizeof(u32) * -1, \
+ .fp_off = (s32)sizeof(u32) * -2, \
+ .use_fp = true,
+
+#define in_compat_mode(regs) !user_64bit_mode(regs)
+
+static inline void arch_unwind_user_init(struct unwind_user_state *state,
+ struct pt_regs *regs)
+{
+ unsigned long cs_base, ss_base;
+
+ if (state->type != UNWIND_USER_TYPE_COMPAT_FP)
+ return;
+
+ scoped_guard(irqsave) {
+ cs_base = segment_base_address(regs->cs);
+ ss_base = segment_base_address(regs->ss);
+ }
+
+ state->arch.cs_base = cs_base;
+ state->arch.ss_base = ss_base;
+
+ state->ip += cs_base;
+ state->sp += ss_base;
+ state->fp += ss_base;
+}
+#define arch_unwind_user_init arch_unwind_user_init
+
+static inline void arch_unwind_user_next(struct unwind_user_state *state)
+{
+ if (state->type != UNWIND_USER_TYPE_COMPAT_FP)
+ return;
+
+ state->ip += state->arch.cs_base;
+ state->fp += state->arch.ss_base;
+}
+#define arch_unwind_user_next arch_unwind_user_next
+
+#endif /* CONFIG_IA32_EMULATION */
+
+#include <asm-generic/unwind_user.h>
+
#endif /* _ASM_X86_UNWIND_USER_H */
diff --git a/arch/x86/include/asm/unwind_user_types.h b/arch/x86/include/asm/unwind_user_types.h
new file mode 100644
index 000000000000..d7074dc5f0ce
--- /dev/null
+++ b/arch/x86/include/asm/unwind_user_types.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_UNWIND_USER_TYPES_H
+#define _ASM_UNWIND_USER_TYPES_H
+
+#ifdef CONFIG_IA32_EMULATION
+
+struct arch_unwind_user_state {
+ unsigned long ss_base;
+ unsigned long cs_base;
+};
+#define arch_unwind_user_state arch_unwind_user_state
+
+#endif /* CONFIG_IA32_EMULATION */
+
+#include <asm-generic/unwind_user_types.h>
+
+#endif /* _ASM_UNWIND_USER_TYPES_H */
--
2.47.2
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v8 07/18] unwind_user/deferred: Add unwind_deferred_trace()
2025-05-09 16:45 [PATCH v8 00/18] unwind_user: perf: x86: Deferred unwinding infrastructure Steven Rostedt
` (5 preceding siblings ...)
2025-05-09 16:45 ` [PATCH v8 06/18] unwind_user/x86: Enable compat mode frame pointer unwinding on x86 Steven Rostedt
@ 2025-05-09 16:45 ` Steven Rostedt
2025-05-09 16:45 ` [PATCH v8 08/18] unwind_user/deferred: Add unwind cache Steven Rostedt
` (10 subsequent siblings)
17 siblings, 0 replies; 27+ messages in thread
From: Steven Rostedt @ 2025-05-09 16:45 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim
From: Steven Rostedt <rostedt@goodmis.org>
Add a function that must be called inside a faultable context that will
retrieve a user space stack trace. The function unwind_deferred_trace()
can be called by a tracer when a task is about to enter user space, or has
just come back from user space and has interrupts enabled.
This code is based on work by Josh Poimboeuf's deferred unwinding code:
Link: https://lore.kernel.org/all/6052e8487746603bdb29b65f4033e739092d9925.1737511963.git.jpoimboe@kernel.org/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v7: https://lore.kernel.org/20250502165008.564172398@goodmis.org
- Fixed whitespace issues
- Added kerneldoc to unwind_deferred_trace()
include/linux/sched.h | 5 +++
include/linux/unwind_deferred.h | 24 +++++++++++
include/linux/unwind_deferred_types.h | 9 ++++
kernel/fork.c | 4 ++
kernel/unwind/Makefile | 2 +-
kernel/unwind/deferred.c | 60 +++++++++++++++++++++++++++
6 files changed, 103 insertions(+), 1 deletion(-)
create mode 100644 include/linux/unwind_deferred.h
create mode 100644 include/linux/unwind_deferred_types.h
create mode 100644 kernel/unwind/deferred.c
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4ecc0c6b1cb0..a1e1c07cadfb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -47,6 +47,7 @@
#include <linux/livepatch_sched.h>
#include <linux/uidgid_types.h>
#include <linux/tracepoint-defs.h>
+#include <linux/unwind_deferred_types.h>
#include <asm/kmap_size.h>
/* task_struct member predeclarations (sorted alphabetically): */
@@ -1646,6 +1647,10 @@ struct task_struct {
struct user_event_mm *user_event_mm;
#endif
+#ifdef CONFIG_UNWIND_USER
+ struct unwind_task_info unwind_info;
+#endif
+
/* CPU-specific state of this task: */
struct thread_struct thread;
diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
new file mode 100644
index 000000000000..5064ebe38c4f
--- /dev/null
+++ b/include/linux/unwind_deferred.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_UNWIND_USER_DEFERRED_H
+#define _LINUX_UNWIND_USER_DEFERRED_H
+
+#include <linux/unwind_user.h>
+#include <linux/unwind_deferred_types.h>
+
+#ifdef CONFIG_UNWIND_USER
+
+void unwind_task_init(struct task_struct *task);
+void unwind_task_free(struct task_struct *task);
+
+int unwind_deferred_trace(struct unwind_stacktrace *trace);
+
+#else /* !CONFIG_UNWIND_USER */
+
+static inline void unwind_task_init(struct task_struct *task) {}
+static inline void unwind_task_free(struct task_struct *task) {}
+
+static inline int unwind_deferred_trace(struct unwind_stacktrace *trace) { return -ENOSYS; }
+
+#endif /* !CONFIG_UNWIND_USER */
+
+#endif /* _LINUX_UNWIND_USER_DEFERRED_H */
diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
new file mode 100644
index 000000000000..aa32db574e43
--- /dev/null
+++ b/include/linux/unwind_deferred_types.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_UNWIND_USER_DEFERRED_TYPES_H
+#define _LINUX_UNWIND_USER_DEFERRED_TYPES_H
+
+struct unwind_task_info {
+ unsigned long *entries;
+};
+
+#endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index c4b26cd8998b..8c79c7c2c553 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -105,6 +105,7 @@
#include <uapi/linux/pidfd.h>
#include <linux/pidfs.h>
#include <linux/tick.h>
+#include <linux/unwind_deferred.h>
#include <asm/pgalloc.h>
#include <linux/uaccess.h>
@@ -991,6 +992,7 @@ void __put_task_struct(struct task_struct *tsk)
WARN_ON(refcount_read(&tsk->usage));
WARN_ON(tsk == current);
+ unwind_task_free(tsk);
sched_ext_free(tsk);
io_uring_free(tsk);
cgroup_free(tsk);
@@ -2395,6 +2397,8 @@ __latent_entropy struct task_struct *copy_process(
p->bpf_ctx = NULL;
#endif
+ unwind_task_init(p);
+
/* Perform scheduler related setup. Assign this task to a CPU. */
retval = sched_fork(clone_flags, p);
if (retval)
diff --git a/kernel/unwind/Makefile b/kernel/unwind/Makefile
index 349ce3677526..6752ac96d7e2 100644
--- a/kernel/unwind/Makefile
+++ b/kernel/unwind/Makefile
@@ -1 +1 @@
- obj-$(CONFIG_UNWIND_USER) += user.o
+ obj-$(CONFIG_UNWIND_USER) += user.o deferred.o
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
new file mode 100644
index 000000000000..0bafb95e6336
--- /dev/null
+++ b/kernel/unwind/deferred.c
@@ -0,0 +1,60 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Deferred user space unwinding
+ */
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/unwind_deferred.h>
+
+#define UNWIND_MAX_ENTRIES 512
+
+/**
+ * unwind_deferred_trace - Produce a user stacktrace in faultable context
+ * @trace: The descriptor that will store the user stacktrace
+ *
+ * This must be called in a known faultable context (usually when entering
+ * or exiting user space). Depending on the available implementations
+ * the @trace will be loaded with the addresses of the user space stacktrace
+ * if it can be found.
+ *
+ * Return: 0 on success and negative on error
+ * On success @trace will contain the user space stacktrace
+ */
+int unwind_deferred_trace(struct unwind_stacktrace *trace)
+{
+ struct unwind_task_info *info = ¤t->unwind_info;
+
+ /* Should always be called from faultable context */
+ might_fault();
+
+ if (current->flags & PF_EXITING)
+ return -EINVAL;
+
+ if (!info->entries) {
+ info->entries = kmalloc_array(UNWIND_MAX_ENTRIES, sizeof(long),
+ GFP_KERNEL);
+ if (!info->entries)
+ return -ENOMEM;
+ }
+
+ trace->nr = 0;
+ trace->entries = info->entries;
+ unwind_user(trace, UNWIND_MAX_ENTRIES);
+
+ return 0;
+}
+
+void unwind_task_init(struct task_struct *task)
+{
+ struct unwind_task_info *info = &task->unwind_info;
+
+ memset(info, 0, sizeof(*info));
+}
+
+void unwind_task_free(struct task_struct *task)
+{
+ struct unwind_task_info *info = &task->unwind_info;
+
+ kfree(info->entries);
+}
--
2.47.2
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v8 08/18] unwind_user/deferred: Add unwind cache
2025-05-09 16:45 [PATCH v8 00/18] unwind_user: perf: x86: Deferred unwinding infrastructure Steven Rostedt
` (6 preceding siblings ...)
2025-05-09 16:45 ` [PATCH v8 07/18] unwind_user/deferred: Add unwind_deferred_trace() Steven Rostedt
@ 2025-05-09 16:45 ` Steven Rostedt
2025-05-09 16:45 ` [PATCH v8 09/18] unwind_user/deferred: Add deferred unwinding interface Steven Rostedt
` (9 subsequent siblings)
17 siblings, 0 replies; 27+ messages in thread
From: Steven Rostedt @ 2025-05-09 16:45 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim
From: Josh Poimboeuf <jpoimboe@kernel.org>
Cache the results of the unwind to ensure the unwind is only performed
once, even when called by multiple tracers.
The cache nr_entries gets cleared every time the task exits the kernel.
When a stacktrace is requested, nr_entries gets set to the number of
entries in the stacktrace. If another stacktrace is requested, if
nr_entries is not zero, then it contains the same stacktrace that would be
retrieved so it is not processed again and the entries is given to the
caller.
Co-developed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v7: https://lore.kernel.org/20250502165008.734340489@goodmis.org
- Allocate unwind_cache as a structure and not just its entries
(Ingo Molnar)
- Fixed white space issues
(Ingo Molnar)
include/linux/entry-common.h | 2 ++
include/linux/unwind_deferred.h | 8 ++++++++
include/linux/unwind_deferred_types.h | 7 ++++++-
kernel/unwind/deferred.c | 26 ++++++++++++++++++++------
4 files changed, 36 insertions(+), 7 deletions(-)
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index f94f3fdf15fc..6e850c9d3f0c 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -12,6 +12,7 @@
#include <linux/resume_user_mode.h>
#include <linux/tick.h>
#include <linux/kmsan.h>
+#include <linux/unwind_deferred.h>
#include <asm/entry-common.h>
#include <asm/syscall.h>
@@ -362,6 +363,7 @@ static __always_inline void exit_to_user_mode(void)
lockdep_hardirqs_on_prepare();
instrumentation_end();
+ unwind_exit_to_user_mode();
user_enter_irqoff();
arch_exit_to_user_mode();
lockdep_hardirqs_on(CALLER_ADDR0);
diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index 5064ebe38c4f..7d6cb2ffd084 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -12,6 +12,12 @@ void unwind_task_free(struct task_struct *task);
int unwind_deferred_trace(struct unwind_stacktrace *trace);
+static __always_inline void unwind_exit_to_user_mode(void)
+{
+ if (unlikely(current->unwind_info.cache))
+ current->unwind_info.cache->nr_entries = 0;
+}
+
#else /* !CONFIG_UNWIND_USER */
static inline void unwind_task_init(struct task_struct *task) {}
@@ -19,6 +25,8 @@ static inline void unwind_task_free(struct task_struct *task) {}
static inline int unwind_deferred_trace(struct unwind_stacktrace *trace) { return -ENOSYS; }
+static inline void unwind_exit_to_user_mode(void) {}
+
#endif /* !CONFIG_UNWIND_USER */
#endif /* _LINUX_UNWIND_USER_DEFERRED_H */
diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
index aa32db574e43..db5b54b18828 100644
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -2,8 +2,13 @@
#ifndef _LINUX_UNWIND_USER_DEFERRED_TYPES_H
#define _LINUX_UNWIND_USER_DEFERRED_TYPES_H
+struct unwind_cache {
+ unsigned int nr_entries;
+ unsigned long entries[];
+};
+
struct unwind_task_info {
- unsigned long *entries;
+ struct unwind_cache *cache;
};
#endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index 0bafb95e6336..e3913781c8c6 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -24,6 +24,7 @@
int unwind_deferred_trace(struct unwind_stacktrace *trace)
{
struct unwind_task_info *info = ¤t->unwind_info;
+ struct unwind_cache *cache;
/* Should always be called from faultable context */
might_fault();
@@ -31,17 +32,30 @@ int unwind_deferred_trace(struct unwind_stacktrace *trace)
if (current->flags & PF_EXITING)
return -EINVAL;
- if (!info->entries) {
- info->entries = kmalloc_array(UNWIND_MAX_ENTRIES, sizeof(long),
- GFP_KERNEL);
- if (!info->entries)
+ if (!info->cache) {
+ info->cache = kzalloc(struct_size(cache, entries, UNWIND_MAX_ENTRIES),
+ GFP_KERNEL);
+ if (!info->cache)
return -ENOMEM;
}
+ cache = info->cache;
+ trace->entries = cache->entries;
+
+ if (cache->nr_entries) {
+ /*
+ * The user stack has already been previously unwound in this
+ * entry context. Skip the unwind and use the cache.
+ */
+ trace->nr = cache->nr_entries;
+ return 0;
+ }
+
trace->nr = 0;
- trace->entries = info->entries;
unwind_user(trace, UNWIND_MAX_ENTRIES);
+ cache->nr_entries = trace->nr;
+
return 0;
}
@@ -56,5 +70,5 @@ void unwind_task_free(struct task_struct *task)
{
struct unwind_task_info *info = &task->unwind_info;
- kfree(info->entries);
+ kfree(info->cache);
}
--
2.47.2
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v8 09/18] unwind_user/deferred: Add deferred unwinding interface
2025-05-09 16:45 [PATCH v8 00/18] unwind_user: perf: x86: Deferred unwinding infrastructure Steven Rostedt
` (7 preceding siblings ...)
2025-05-09 16:45 ` [PATCH v8 08/18] unwind_user/deferred: Add unwind cache Steven Rostedt
@ 2025-05-09 16:45 ` Steven Rostedt
2025-05-09 16:45 ` [PATCH v8 10/18] unwind_user/deferred: Make unwind deferral requests NMI-safe Steven Rostedt
` (8 subsequent siblings)
17 siblings, 0 replies; 27+ messages in thread
From: Steven Rostedt @ 2025-05-09 16:45 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim
From: Josh Poimboeuf <jpoimboe@kernel.org>
Add an interface for scheduling task work to unwind the user space stack
before returning to user space. This solves several problems for its
callers:
- Ensure the unwind happens in task context even if the caller may be
running in NMI or interrupt context.
- Avoid duplicate unwinds, whether called multiple times by the same
caller or by different callers.
- Take a timestamp when the first request comes in since the task
entered the kernel. This will be returned to the calling function
along with the stack trace when the task leaves the kernel. This
timestamp can be used to correlate kernel unwinds/traces with the user
unwind.
The timestamp is created to detect when the stacktrace is the same. It is
generated the first time a user space stacktrace is requested after the
task enters the kernel.
The timestamp is passed to the caller on request, and when the stacktrace is
generated upon returning to user space, it call the requester's callback
with the timestamp as well as the stacktrace.
Co-developed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v7: https://lore.kernel.org/20250502165008.904786447@goodmis.org
- Use a timestamp instead of a "cookie"
- Updated comments to kerneldoc for unwind_deferred_request()
include/linux/unwind_deferred.h | 18 ++++
include/linux/unwind_deferred_types.h | 3 +
kernel/unwind/deferred.c | 131 +++++++++++++++++++++++++-
3 files changed, 151 insertions(+), 1 deletion(-)
diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index 7d6cb2ffd084..a384eef719a3 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -2,9 +2,19 @@
#ifndef _LINUX_UNWIND_USER_DEFERRED_H
#define _LINUX_UNWIND_USER_DEFERRED_H
+#include <linux/task_work.h>
#include <linux/unwind_user.h>
#include <linux/unwind_deferred_types.h>
+struct unwind_work;
+
+typedef void (*unwind_callback_t)(struct unwind_work *work, struct unwind_stacktrace *trace, u64 timestamp);
+
+struct unwind_work {
+ struct list_head list;
+ unwind_callback_t func;
+};
+
#ifdef CONFIG_UNWIND_USER
void unwind_task_init(struct task_struct *task);
@@ -12,10 +22,15 @@ void unwind_task_free(struct task_struct *task);
int unwind_deferred_trace(struct unwind_stacktrace *trace);
+int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func);
+int unwind_deferred_request(struct unwind_work *work, u64 *timestamp);
+void unwind_deferred_cancel(struct unwind_work *work);
+
static __always_inline void unwind_exit_to_user_mode(void)
{
if (unlikely(current->unwind_info.cache))
current->unwind_info.cache->nr_entries = 0;
+ current->unwind_info.timestamp = 0;
}
#else /* !CONFIG_UNWIND_USER */
@@ -24,6 +39,9 @@ static inline void unwind_task_init(struct task_struct *task) {}
static inline void unwind_task_free(struct task_struct *task) {}
static inline int unwind_deferred_trace(struct unwind_stacktrace *trace) { return -ENOSYS; }
+static inline int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func) { return -ENOSYS; }
+static inline int unwind_deferred_request(struct unwind_work *work, u64 *timestamp) { return -ENOSYS; }
+static inline void unwind_deferred_cancel(struct unwind_work *work) {}
static inline void unwind_exit_to_user_mode(void) {}
diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
index db5b54b18828..5df264cf81ad 100644
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -9,6 +9,9 @@ struct unwind_cache {
struct unwind_task_info {
struct unwind_cache *cache;
+ struct callback_head work;
+ u64 timestamp;
+ int pending;
};
#endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index e3913781c8c6..b76c704ddc6d 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -2,13 +2,35 @@
/*
* Deferred user space unwinding
*/
+#include <linux/sched/task_stack.h>
+#include <linux/unwind_deferred.h>
+#include <linux/sched/clock.h>
+#include <linux/task_work.h>
#include <linux/kernel.h>
#include <linux/sched.h>
#include <linux/slab.h>
-#include <linux/unwind_deferred.h>
+#include <linux/mm.h>
#define UNWIND_MAX_ENTRIES 512
+/* Guards adding to and reading the list of callbacks */
+static DEFINE_MUTEX(callback_mutex);
+static LIST_HEAD(callbacks);
+
+/*
+ * Read the task context timestamp, if this is the first caller then
+ * it will set the timestamp.
+ */
+static u64 get_timestamp(struct unwind_task_info *info)
+{
+ lockdep_assert_irqs_disabled();
+
+ if (!info->timestamp)
+ info->timestamp = local_clock();
+
+ return info->timestamp;
+}
+
/**
* unwind_deferred_trace - Produce a user stacktrace in faultable context
* @trace: The descriptor that will store the user stacktrace
@@ -59,11 +81,117 @@ int unwind_deferred_trace(struct unwind_stacktrace *trace)
return 0;
}
+static void unwind_deferred_task_work(struct callback_head *head)
+{
+ struct unwind_task_info *info = container_of(head, struct unwind_task_info, work);
+ struct unwind_stacktrace trace;
+ struct unwind_work *work;
+ u64 timestamp;
+
+ if (WARN_ON_ONCE(!info->pending))
+ return;
+
+ /* Allow work to come in again */
+ WRITE_ONCE(info->pending, 0);
+
+ /*
+ * From here on out, the callback must always be called, even if it's
+ * just an empty trace.
+ */
+ trace.nr = 0;
+ trace.entries = NULL;
+
+ unwind_deferred_trace(&trace);
+
+ timestamp = info->timestamp;
+
+ guard(mutex)(&callback_mutex);
+ list_for_each_entry(work, &callbacks, list) {
+ work->func(work, &trace, timestamp);
+ }
+}
+
+/**
+ * unwind_deferred_request - Request a user stacktrace on task exit
+ * @work: Unwind descriptor requesting the trace
+ * @timestamp: The time stamp of the first request made for this task
+ *
+ * Schedule a user space unwind to be done in task work before exiting the
+ * kernel.
+ *
+ * The returned @timestamp output is the timestamp of the very first request
+ * for a user space stacktrace for this task since it entered the kernel.
+ * It can be from a request by any caller of this infrastructure.
+ * Its value will also be passed to the callback function. It can be
+ * used to stitch kernel and user stack traces together in post-processing.
+ *
+ * It's valid to call this function multiple times for the same @work within
+ * the same task entry context. Each call will return the same timestamp
+ * while the task hasn't left the kernel. If the callback is not pending because
+ * it has already been previously called for the same entry context, it will be
+ * called again with the same stack trace and timestamp.
+ *
+ * Return: 1 if the the callback was already queued.
+ * 0 if the callback successfully was queued.
+ * Negative if there's an error.
+ * @timestamp holds the timestamp of the first request by any user
+ */
+int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
+{
+ struct unwind_task_info *info = ¤t->unwind_info;
+ int ret;
+
+ *timestamp = 0;
+
+ if (WARN_ON_ONCE(in_nmi()))
+ return -EINVAL;
+
+ if ((current->flags & (PF_KTHREAD | PF_EXITING)) ||
+ !user_mode(task_pt_regs(current)))
+ return -EINVAL;
+
+ guard(irqsave)();
+
+ *timestamp = get_timestamp(info);
+
+ /* callback already pending? */
+ if (info->pending)
+ return 1;
+
+ /* The work has been claimed, now schedule it. */
+ ret = task_work_add(current, &info->work, TWA_RESUME);
+ if (WARN_ON_ONCE(ret))
+ return ret;
+
+ info->pending = 1;
+ return 0;
+}
+
+void unwind_deferred_cancel(struct unwind_work *work)
+{
+ if (!work)
+ return;
+
+ guard(mutex)(&callback_mutex);
+ list_del(&work->list);
+}
+
+int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
+{
+ memset(work, 0, sizeof(*work));
+
+ guard(mutex)(&callback_mutex);
+ list_add(&work->list, &callbacks);
+ work->func = func;
+ return 0;
+}
+
void unwind_task_init(struct task_struct *task)
{
struct unwind_task_info *info = &task->unwind_info;
memset(info, 0, sizeof(*info));
+ init_task_work(&info->work, unwind_deferred_task_work);
}
void unwind_task_free(struct task_struct *task)
@@ -71,4 +199,5 @@ void unwind_task_free(struct task_struct *task)
struct unwind_task_info *info = &task->unwind_info;
kfree(info->cache);
+ task_work_cancel(task, &info->work);
}
--
2.47.2
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v8 10/18] unwind_user/deferred: Make unwind deferral requests NMI-safe
2025-05-09 16:45 [PATCH v8 00/18] unwind_user: perf: x86: Deferred unwinding infrastructure Steven Rostedt
` (8 preceding siblings ...)
2025-05-09 16:45 ` [PATCH v8 09/18] unwind_user/deferred: Add deferred unwinding interface Steven Rostedt
@ 2025-05-09 16:45 ` Steven Rostedt
2025-05-09 16:45 ` [PATCH v8 11/18] unwind deferred: Use bitmask to determine which callbacks to call Steven Rostedt
` (7 subsequent siblings)
17 siblings, 0 replies; 27+ messages in thread
From: Steven Rostedt @ 2025-05-09 16:45 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim
From: Josh Poimboeuf <jpoimboe@kernel.org>
Make unwind_deferred_request() NMI-safe so tracers in NMI context can
call it and safely request a user space stacktrace when the task exits.
A "nmi_timestamp" is added to the unwind_task_info that gets updated by
NMIs to not race with setting the info->timestamp.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v7: https://lore.kernel.org/20250502165009.069806229@goodmis.org
- Updated to use timestamp instead of cookie
include/linux/unwind_deferred_types.h | 1 +
kernel/unwind/deferred.c | 91 ++++++++++++++++++++++++---
2 files changed, 84 insertions(+), 8 deletions(-)
diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
index 5df264cf81ad..ae27a02234b8 100644
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -11,6 +11,7 @@ struct unwind_task_info {
struct unwind_cache *cache;
struct callback_head work;
u64 timestamp;
+ u64 nmi_timestamp;
int pending;
};
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index b76c704ddc6d..238cd97079ec 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -25,8 +25,27 @@ static u64 get_timestamp(struct unwind_task_info *info)
{
lockdep_assert_irqs_disabled();
- if (!info->timestamp)
- info->timestamp = local_clock();
+ /*
+ * Note, the timestamp is generated on the first request.
+ * If it exists here, then the timestamp is earlier than
+ * this request and it means that this request will be
+ * valid for the stracktrace.
+ */
+ if (!info->timestamp) {
+ WRITE_ONCE(info->timestamp, local_clock());
+ barrier();
+ /*
+ * If an NMI came in and set a timestamp, it means that
+ * it happened before this timestamp was set (otherwise
+ * the NMI would have used this one). Use the NMI timestamp
+ * instead.
+ */
+ if (unlikely(info->nmi_timestamp)) {
+ WRITE_ONCE(info->timestamp, info->nmi_timestamp);
+ barrier();
+ WRITE_ONCE(info->nmi_timestamp, 0);
+ }
+ }
return info->timestamp;
}
@@ -103,6 +122,13 @@ static void unwind_deferred_task_work(struct callback_head *head)
unwind_deferred_trace(&trace);
+ /* Check if the timestamp was only set by NMI */
+ if (info->nmi_timestamp) {
+ WRITE_ONCE(info->timestamp, info->nmi_timestamp);
+ barrier();
+ WRITE_ONCE(info->nmi_timestamp, 0);
+ }
+
timestamp = info->timestamp;
guard(mutex)(&callback_mutex);
@@ -111,6 +137,48 @@ static void unwind_deferred_task_work(struct callback_head *head)
}
}
+static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *timestamp)
+{
+ struct unwind_task_info *info = ¤t->unwind_info;
+ bool inited_timestamp = false;
+ int ret;
+
+ /* Always use the nmi_timestamp first */
+ *timestamp = info->nmi_timestamp ? : info->timestamp;
+
+ if (!*timestamp) {
+ /*
+ * This is the first unwind request since the most recent entry
+ * from user space. Initialize the task timestamp.
+ *
+ * Don't write to info->timestamp directly, otherwise it may race
+ * with an interruption of get_timestamp().
+ */
+ info->nmi_timestamp = local_clock();
+ *timestamp = info->nmi_timestamp;
+ inited_timestamp = true;
+ }
+
+ if (info->pending)
+ return 1;
+
+ ret = task_work_add(current, &info->work, TWA_NMI_CURRENT);
+ if (ret) {
+ /*
+ * If this set nmi_timestamp and is not using it,
+ * there's no guarantee that it will be used.
+ * Set it back to zero.
+ */
+ if (inited_timestamp)
+ info->nmi_timestamp = 0;
+ return ret;
+ }
+
+ info->pending = 1;
+
+ return 0;
+}
+
/**
* unwind_deferred_request - Request a user stacktrace on task exit
* @work: Unwind descriptor requesting the trace
@@ -139,31 +207,38 @@ static void unwind_deferred_task_work(struct callback_head *head)
int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
{
struct unwind_task_info *info = ¤t->unwind_info;
+ int pending;
int ret;
*timestamp = 0;
- if (WARN_ON_ONCE(in_nmi()))
- return -EINVAL;
-
if ((current->flags & (PF_KTHREAD | PF_EXITING)) ||
!user_mode(task_pt_regs(current)))
return -EINVAL;
+ if (in_nmi())
+ return unwind_deferred_request_nmi(work, timestamp);
+
guard(irqsave)();
*timestamp = get_timestamp(info);
/* callback already pending? */
- if (info->pending)
+ pending = READ_ONCE(info->pending);
+ if (pending)
+ return 1;
+
+ /* Claim the work unless an NMI just now swooped in to do so. */
+ if (!try_cmpxchg(&info->pending, &pending, 1))
return 1;
/* The work has been claimed, now schedule it. */
ret = task_work_add(current, &info->work, TWA_RESUME);
- if (WARN_ON_ONCE(ret))
+ if (WARN_ON_ONCE(ret)) {
+ WRITE_ONCE(info->pending, 0);
return ret;
+ }
- info->pending = 1;
return 0;
}
--
2.47.2
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v8 11/18] unwind deferred: Use bitmask to determine which callbacks to call
2025-05-09 16:45 [PATCH v8 00/18] unwind_user: perf: x86: Deferred unwinding infrastructure Steven Rostedt
` (9 preceding siblings ...)
2025-05-09 16:45 ` [PATCH v8 10/18] unwind_user/deferred: Make unwind deferral requests NMI-safe Steven Rostedt
@ 2025-05-09 16:45 ` Steven Rostedt
2025-05-09 16:45 ` [PATCH v8 12/18] unwind deferred: Use SRCU unwind_deferred_task_work() Steven Rostedt
` (6 subsequent siblings)
17 siblings, 0 replies; 27+ messages in thread
From: Steven Rostedt @ 2025-05-09 16:45 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim
From: Steven Rostedt <rostedt@goodmis.org>
In order to know which registered callback requested a stacktrace for when
the task goes back to user space, add a bitmask for all registered
tracers. The bitmask is the size of log, which means that on a 32 bit
machine, it can have at most 32 registered tracers, and on 64 bit, it can
have at most 64 registered tracers. This should not be an issue as there
should not be more than 10 (unless BPF can abuse this?).
When a tracer registers with unwind_deferred_init() it will get a bit
number assigned to it. When a tracer requests a stacktrace, it will have
its bit set within the task_struct. When the task returns back to user
space, it will call the callbacks for all the registered tracers where
their bits are set in the task's mask.
When a tracer is removed by the unwind_deferred_cancel() all current tasks
will clear the associated bit, just in case another tracer gets registered
immediately afterward and then gets their callback called unexpectedly.
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
include/linux/sched.h | 1 +
include/linux/unwind_deferred.h | 1 +
kernel/unwind/deferred.c | 46 ++++++++++++++++++++++++++++-----
3 files changed, 41 insertions(+), 7 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a1e1c07cadfb..d3ee0c5405d6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1649,6 +1649,7 @@ struct task_struct {
#ifdef CONFIG_UNWIND_USER
struct unwind_task_info unwind_info;
+ unsigned long unwind_mask;
#endif
/* CPU-specific state of this task: */
diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index a384eef719a3..1789c3624723 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -13,6 +13,7 @@ typedef void (*unwind_callback_t)(struct unwind_work *work, struct unwind_stackt
struct unwind_work {
struct list_head list;
unwind_callback_t func;
+ int bit;
};
#ifdef CONFIG_UNWIND_USER
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index 238cd97079ec..7ae0bec5b36a 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -16,6 +16,7 @@
/* Guards adding to and reading the list of callbacks */
static DEFINE_MUTEX(callback_mutex);
static LIST_HEAD(callbacks);
+static unsigned long unwind_mask;
/*
* Read the task context timestamp, if this is the first caller then
@@ -106,6 +107,7 @@ static void unwind_deferred_task_work(struct callback_head *head)
struct unwind_stacktrace trace;
struct unwind_work *work;
u64 timestamp;
+ struct task_struct *task = current;
if (WARN_ON_ONCE(!info->pending))
return;
@@ -133,7 +135,10 @@ static void unwind_deferred_task_work(struct callback_head *head)
guard(mutex)(&callback_mutex);
list_for_each_entry(work, &callbacks, list) {
- work->func(work, &trace, timestamp);
+ if (task->unwind_mask & (1UL << work->bit)) {
+ work->func(work, &trace, timestamp);
+ clear_bit(work->bit, ¤t->unwind_mask);
+ }
}
}
@@ -159,9 +164,12 @@ static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *timestamp)
inited_timestamp = true;
}
- if (info->pending)
+ if (current->unwind_mask & (1UL << work->bit))
return 1;
+ if (info->pending)
+ goto out;
+
ret = task_work_add(current, &info->work, TWA_NMI_CURRENT);
if (ret) {
/*
@@ -175,8 +183,8 @@ static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *timestamp)
}
info->pending = 1;
-
- return 0;
+out:
+ return test_and_set_bit(work->bit, ¤t->unwind_mask);
}
/**
@@ -223,14 +231,18 @@ int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
*timestamp = get_timestamp(info);
+ /* This is already queued */
+ if (current->unwind_mask & (1UL << work->bit))
+ return 1;
+
/* callback already pending? */
pending = READ_ONCE(info->pending);
if (pending)
- return 1;
+ goto out;
/* Claim the work unless an NMI just now swooped in to do so. */
if (!try_cmpxchg(&info->pending, &pending, 1))
- return 1;
+ goto out;
/* The work has been claimed, now schedule it. */
ret = task_work_add(current, &info->work, TWA_RESUME);
@@ -239,16 +251,27 @@ int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
return ret;
}
- return 0;
+ out:
+ return test_and_set_bit(work->bit, ¤t->unwind_mask);
}
void unwind_deferred_cancel(struct unwind_work *work)
{
+ struct task_struct *g, *t;
+
if (!work)
return;
guard(mutex)(&callback_mutex);
list_del(&work->list);
+
+ clear_bit(work->bit, &unwind_mask);
+
+ guard(rcu)();
+ /* Clear this bit from all threads */
+ for_each_process_thread(g, t) {
+ clear_bit(work->bit, &t->unwind_mask);
+ }
}
int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
@@ -256,6 +279,14 @@ int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
memset(work, 0, sizeof(*work));
guard(mutex)(&callback_mutex);
+
+ /* See if there's a bit in the mask available */
+ if (unwind_mask == ~0UL)
+ return -EBUSY;
+
+ work->bit = ffz(unwind_mask);
+ unwind_mask |= 1UL << work->bit;
+
list_add(&work->list, &callbacks);
work->func = func;
return 0;
@@ -267,6 +298,7 @@ void unwind_task_init(struct task_struct *task)
memset(info, 0, sizeof(*info));
init_task_work(&info->work, unwind_deferred_task_work);
+ task->unwind_mask = 0;
}
void unwind_task_free(struct task_struct *task)
--
2.47.2
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v8 12/18] unwind deferred: Use SRCU unwind_deferred_task_work()
2025-05-09 16:45 [PATCH v8 00/18] unwind_user: perf: x86: Deferred unwinding infrastructure Steven Rostedt
` (10 preceding siblings ...)
2025-05-09 16:45 ` [PATCH v8 11/18] unwind deferred: Use bitmask to determine which callbacks to call Steven Rostedt
@ 2025-05-09 16:45 ` Steven Rostedt
2025-05-09 21:49 ` Andrii Nakryiko
2025-05-09 16:45 ` [PATCH v8 13/18] unwind: Clear unwind_mask on exit back to user space Steven Rostedt
` (5 subsequent siblings)
17 siblings, 1 reply; 27+ messages in thread
From: Steven Rostedt @ 2025-05-09 16:45 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim
From: Steven Rostedt <rostedt@goodmis.org>
Instead of using the callback_mutex to protect the link list of callbacks
in unwind_deferred_task_work(), use SRCU instead. This gets called every
time a task exits that has to record a stack trace that was requested.
This can happen for many tasks on several CPUs at the same time. A mutex
is a bottleneck and can cause a bit of contention and slow down performance.
As the callbacks themselves are allowed to sleep, regular RCU can not be
used to protect the list. Instead use SRCU, as that still allows the
callbacks to sleep and the list can be read without needing to hold the
callback_mutex.
Link: https://lore.kernel.org/all/ca9bd83a-6c80-4ee0-a83c-224b9d60b755@efficios.com/
Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
kernel/unwind/deferred.c | 33 +++++++++++++++++++++++++--------
1 file changed, 25 insertions(+), 8 deletions(-)
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index 7ae0bec5b36a..5d6976ee648f 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -13,10 +13,11 @@
#define UNWIND_MAX_ENTRIES 512
-/* Guards adding to and reading the list of callbacks */
+/* Guards adding to or removing from the list of callbacks */
static DEFINE_MUTEX(callback_mutex);
static LIST_HEAD(callbacks);
static unsigned long unwind_mask;
+DEFINE_STATIC_SRCU(unwind_srcu);
/*
* Read the task context timestamp, if this is the first caller then
@@ -108,6 +109,7 @@ static void unwind_deferred_task_work(struct callback_head *head)
struct unwind_work *work;
u64 timestamp;
struct task_struct *task = current;
+ int idx;
if (WARN_ON_ONCE(!info->pending))
return;
@@ -133,13 +135,15 @@ static void unwind_deferred_task_work(struct callback_head *head)
timestamp = info->timestamp;
- guard(mutex)(&callback_mutex);
- list_for_each_entry(work, &callbacks, list) {
+ idx = srcu_read_lock(&unwind_srcu);
+ list_for_each_entry_srcu(work, &callbacks, list,
+ srcu_read_lock_held(&unwind_srcu)) {
if (task->unwind_mask & (1UL << work->bit)) {
work->func(work, &trace, timestamp);
clear_bit(work->bit, ¤t->unwind_mask);
}
}
+ srcu_read_unlock(&unwind_srcu, idx);
}
static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *timestamp)
@@ -216,6 +220,7 @@ int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
{
struct unwind_task_info *info = ¤t->unwind_info;
int pending;
+ int bit;
int ret;
*timestamp = 0;
@@ -227,12 +232,17 @@ int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
if (in_nmi())
return unwind_deferred_request_nmi(work, timestamp);
+ /* Do not allow cancelled works to request again */
+ bit = READ_ONCE(work->bit);
+ if (WARN_ON_ONCE(bit < 0))
+ return -EINVAL;
+
guard(irqsave)();
*timestamp = get_timestamp(info);
/* This is already queued */
- if (current->unwind_mask & (1UL << work->bit))
+ if (current->unwind_mask & (1UL << bit))
return 1;
/* callback already pending? */
@@ -258,19 +268,26 @@ int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
void unwind_deferred_cancel(struct unwind_work *work)
{
struct task_struct *g, *t;
+ int bit;
if (!work)
return;
guard(mutex)(&callback_mutex);
- list_del(&work->list);
+ list_del_rcu(&work->list);
+ bit = work->bit;
+
+ /* Do not allow any more requests and prevent callbacks */
+ work->bit = -1;
+
+ clear_bit(bit, &unwind_mask);
- clear_bit(work->bit, &unwind_mask);
+ synchronize_srcu(&unwind_srcu);
guard(rcu)();
/* Clear this bit from all threads */
for_each_process_thread(g, t) {
- clear_bit(work->bit, &t->unwind_mask);
+ clear_bit(bit, &t->unwind_mask);
}
}
@@ -287,7 +304,7 @@ int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
work->bit = ffz(unwind_mask);
unwind_mask |= 1UL << work->bit;
- list_add(&work->list, &callbacks);
+ list_add_rcu(&work->list, &callbacks);
work->func = func;
return 0;
}
--
2.47.2
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v8 13/18] unwind: Clear unwind_mask on exit back to user space
2025-05-09 16:45 [PATCH v8 00/18] unwind_user: perf: x86: Deferred unwinding infrastructure Steven Rostedt
` (11 preceding siblings ...)
2025-05-09 16:45 ` [PATCH v8 12/18] unwind deferred: Use SRCU unwind_deferred_task_work() Steven Rostedt
@ 2025-05-09 16:45 ` Steven Rostedt
2025-05-09 16:45 ` [PATCH v8 14/18] perf: Remove get_perf_callchain() init_nr argument Steven Rostedt
` (4 subsequent siblings)
17 siblings, 0 replies; 27+ messages in thread
From: Steven Rostedt @ 2025-05-09 16:45 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim
From: Steven Rostedt <rostedt@goodmis.org>
When testing the deferred unwinder by attaching deferred user space
stacktraces to events, a live lock happened. This was when the deferred
unwinding was added to the irqs_disabled event, which happens after the
task_work callbacks are called and before the task goes back to user
space.
The event callback would be registered when irqs were disabled, the
task_work would trigger, call the callback for this work and clear the
work's bit. Then before getting back to user space, irqs would be disabled
again, the event triggered again, and a new task_work registered. This
caused an infinite loop and the system hung.
To prevent this, clear the bits at the very last moment before going back
to user space and when instrumentation is disabled. That is in
unwind_exit_to_user_mode().
Move the pending bit from a value on the task_struct to the most
significant bit of the unwind_mask (saves space on the task_struct). This
will allow modifying the pending bit along with the work bits atomically.
Instead of clearing a work's bit after its callback is called, it is
delayed until exit. If the work is requested again, the task_work is not
queued again and the work will be notified that the task has already been
called (via UNWIND_ALREADY_EXECUTED return value).
The pending bit is cleared before calling the callback functions but the
current work bits remain. If one of the called works registers again, it
will not trigger a task_work if its bit is still present in the task's
unwind_mask.
If a new work registers, then it will set both the pending bit and its own
bit but clear the other work bits so that their callbacks do not get
called again.
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
include/linux/unwind_deferred.h | 23 ++++++-
include/linux/unwind_deferred_types.h | 1 -
kernel/unwind/deferred.c | 96 +++++++++++++++++++--------
3 files changed, 90 insertions(+), 30 deletions(-)
diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index 1789c3624723..b3c8703fcc22 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -18,6 +18,14 @@ struct unwind_work {
#ifdef CONFIG_UNWIND_USER
+#define UNWIND_PENDING_BIT (BITS_PER_LONG - 1)
+#define UNWIND_PENDING (1UL << UNWIND_PENDING_BIT)
+
+enum {
+ UNWIND_ALREADY_PENDING = 1,
+ UNWIND_ALREADY_EXECUTED = 2,
+};
+
void unwind_task_init(struct task_struct *task);
void unwind_task_free(struct task_struct *task);
@@ -29,7 +37,20 @@ void unwind_deferred_cancel(struct unwind_work *work);
static __always_inline void unwind_exit_to_user_mode(void)
{
- if (unlikely(current->unwind_info.cache))
+ unsigned long bits;
+
+ /* Was there any unwinding? */
+ if (likely(!current->unwind_mask))
+ return;
+
+ bits = current->unwind_mask;
+ do {
+ /* Is a task_work going to run again before going back */
+ if (bits & UNWIND_PENDING)
+ return;
+ } while (!try_cmpxchg(¤t->unwind_mask, &bits, 0UL));
+
+ if (likely(current->unwind_info.cache))
current->unwind_info.cache->nr_entries = 0;
current->unwind_info.timestamp = 0;
}
diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
index ae27a02234b8..28811a9d4262 100644
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -12,7 +12,6 @@ struct unwind_task_info {
struct callback_head work;
u64 timestamp;
u64 nmi_timestamp;
- int pending;
};
#endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index 5d6976ee648f..b0289a695b92 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -19,6 +19,11 @@ static LIST_HEAD(callbacks);
static unsigned long unwind_mask;
DEFINE_STATIC_SRCU(unwind_srcu);
+static inline bool unwind_pending(struct task_struct *task)
+{
+ return test_bit(UNWIND_PENDING_BIT, &task->unwind_mask);
+}
+
/*
* Read the task context timestamp, if this is the first caller then
* it will set the timestamp.
@@ -107,15 +112,18 @@ static void unwind_deferred_task_work(struct callback_head *head)
struct unwind_task_info *info = container_of(head, struct unwind_task_info, work);
struct unwind_stacktrace trace;
struct unwind_work *work;
+ unsigned long bits;
u64 timestamp;
struct task_struct *task = current;
int idx;
- if (WARN_ON_ONCE(!info->pending))
+ if (WARN_ON_ONCE(!unwind_pending(task)))
return;
- /* Allow work to come in again */
- WRITE_ONCE(info->pending, 0);
+ /* Clear pending bit but make sure to have the current bits */
+ bits = READ_ONCE(task->unwind_mask);
+ while (!try_cmpxchg(&task->unwind_mask, &bits, bits & ~UNWIND_PENDING_BIT))
+ ;
/*
* From here on out, the callback must always be called, even if it's
@@ -138,10 +146,8 @@ static void unwind_deferred_task_work(struct callback_head *head)
idx = srcu_read_lock(&unwind_srcu);
list_for_each_entry_srcu(work, &callbacks, list,
srcu_read_lock_held(&unwind_srcu)) {
- if (task->unwind_mask & (1UL << work->bit)) {
+ if (bits & (1UL << work->bit))
work->func(work, &trace, timestamp);
- clear_bit(work->bit, ¤t->unwind_mask);
- }
}
srcu_read_unlock(&unwind_srcu, idx);
}
@@ -168,10 +174,13 @@ static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *timestamp)
inited_timestamp = true;
}
- if (current->unwind_mask & (1UL << work->bit))
- return 1;
+ /* Is this already queued */
+ if (current->unwind_mask & (1UL << work->bit)) {
+ return unwind_pending(current) ? UNWIND_ALREADY_PENDING :
+ UNWIND_ALREADY_EXECUTED;
+ }
- if (info->pending)
+ if (unwind_pending(current))
goto out;
ret = task_work_add(current, &info->work, TWA_NMI_CURRENT);
@@ -186,9 +195,17 @@ static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *timestamp)
return ret;
}
- info->pending = 1;
+ /*
+ * This is the first to set the PENDING_BIT, clear all others
+ * as any other bit has already had their callback called, and
+ * those callbacks should not be called again because of this
+ * new callback. If they request another callback, then they
+ * will get a new one.
+ */
+ current->unwind_mask = UNWIND_PENDING;
out:
- return test_and_set_bit(work->bit, ¤t->unwind_mask);
+ return test_and_set_bit(work->bit, ¤t->unwind_mask) ?
+ UNWIND_ALREADY_PENDING : 0;
}
/**
@@ -211,15 +228,17 @@ static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *timestamp)
* it has already been previously called for the same entry context, it will be
* called again with the same stack trace and timestamp.
*
- * Return: 1 if the the callback was already queued.
- * 0 if the callback successfully was queued.
+ * Return: 0 if the callback successfully was queued.
+ * UNWIND_ALREADY_PENDING if the the callback was already queued.
+ * UNWIND_ALREADY_EXECUTED if the callback was already called
+ * (and will not be called again)
* Negative if there's an error.
* @timestamp holds the timestamp of the first request by any user
*/
int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
{
struct unwind_task_info *info = ¤t->unwind_info;
- int pending;
+ unsigned long old, bits;
int bit;
int ret;
@@ -241,28 +260,49 @@ int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
*timestamp = get_timestamp(info);
- /* This is already queued */
- if (current->unwind_mask & (1UL << bit))
- return 1;
+ old = READ_ONCE(current->unwind_mask);
+
+ /* Is this already queued */
+ if (old & (1UL << bit)) {
+ /*
+ * If pending is not set, it means this work's callback
+ * was already called.
+ */
+ return old & UNWIND_PENDING ? UNWIND_ALREADY_PENDING :
+ UNWIND_ALREADY_EXECUTED;
+ }
- /* callback already pending? */
- pending = READ_ONCE(info->pending);
- if (pending)
+ if (unwind_pending(current))
goto out;
- /* Claim the work unless an NMI just now swooped in to do so. */
- if (!try_cmpxchg(&info->pending, &pending, 1))
+ /*
+ * This is the first to enable another task_work for this task since
+ * the task entered the kernel, or had already called the callbacks.
+ * Set only the bit for this work and clear all others as they have
+ * already had their callbacks called, and do not need to call them
+ * again because of this work.
+ */
+ bits = UNWIND_PENDING | (1 << bit);
+
+ /*
+ * If the cmpxchg() fails, it means that an NMI came in and set
+ * the pending bit as well as cleared the other bits. Just
+ * jump to setting the bit for this work.
+ */
+ if (!try_cmpxchg(¤t->unwind_mask, &old, bits))
goto out;
/* The work has been claimed, now schedule it. */
ret = task_work_add(current, &info->work, TWA_RESUME);
- if (WARN_ON_ONCE(ret)) {
- WRITE_ONCE(info->pending, 0);
- return ret;
- }
+
+ if (WARN_ON_ONCE(ret))
+ WRITE_ONCE(current->unwind_mask, 0);
+
+ return ret;
out:
- return test_and_set_bit(work->bit, ¤t->unwind_mask);
+ return test_and_set_bit(work->bit, ¤t->unwind_mask) ?
+ UNWIND_ALREADY_PENDING : 0;
}
void unwind_deferred_cancel(struct unwind_work *work)
@@ -298,7 +338,7 @@ int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
guard(mutex)(&callback_mutex);
/* See if there's a bit in the mask available */
- if (unwind_mask == ~0UL)
+ if (unwind_mask == ~(UNWIND_PENDING))
return -EBUSY;
work->bit = ffz(unwind_mask);
--
2.47.2
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v8 14/18] perf: Remove get_perf_callchain() init_nr argument
2025-05-09 16:45 [PATCH v8 00/18] unwind_user: perf: x86: Deferred unwinding infrastructure Steven Rostedt
` (12 preceding siblings ...)
2025-05-09 16:45 ` [PATCH v8 13/18] unwind: Clear unwind_mask on exit back to user space Steven Rostedt
@ 2025-05-09 16:45 ` Steven Rostedt
2025-05-09 17:11 ` Alexei Starovoitov
2025-05-09 16:45 ` [PATCH v8 15/18] perf: Have get_perf_callchain() return NULL if crosstask and user are set Steven Rostedt
` (3 subsequent siblings)
17 siblings, 1 reply; 27+ messages in thread
From: Steven Rostedt @ 2025-05-09 16:45 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Namhyung Kim
From: Josh Poimboeuf <jpoimboe@kernel.org>
The 'init_nr' argument has double duty: it's used to initialize both the
number of contexts and the number of stack entries. That's confusing
and the callers always pass zero anyway. Hard code the zero.
Acked-by: Namhyung Kim <Namhyung@kernel.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
include/linux/perf_event.h | 2 +-
kernel/bpf/stackmap.c | 4 ++--
kernel/events/callchain.c | 12 ++++++------
kernel/events/core.c | 2 +-
4 files changed, 10 insertions(+), 10 deletions(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 947ad12dfdbe..3cc0b0ea0afa 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1651,7 +1651,7 @@ DECLARE_PER_CPU(struct perf_callchain_entry, perf_callchain_entry);
extern void perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
extern void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
extern struct perf_callchain_entry *
-get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user,
+get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
u32 max_stack, bool crosstask, bool add_mark);
extern int get_callchain_buffers(int max_stack);
extern void put_callchain_buffers(void);
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 3615c06b7dfa..ec3a57a5fba1 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -314,7 +314,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
if (max_depth > sysctl_perf_event_max_stack)
max_depth = sysctl_perf_event_max_stack;
- trace = get_perf_callchain(regs, 0, kernel, user, max_depth,
+ trace = get_perf_callchain(regs, kernel, user, max_depth,
false, false);
if (unlikely(!trace))
@@ -451,7 +451,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
else if (kernel && task)
trace = get_callchain_entry_for_task(task, max_depth);
else
- trace = get_perf_callchain(regs, 0, kernel, user, max_depth,
+ trace = get_perf_callchain(regs, kernel, user, max_depth,
crosstask, false);
if (unlikely(!trace) || trace->nr < skip) {
diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 6c83ad674d01..b0f5bd228cd8 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -217,7 +217,7 @@ static void fixup_uretprobe_trampoline_entries(struct perf_callchain_entry *entr
}
struct perf_callchain_entry *
-get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user,
+get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
u32 max_stack, bool crosstask, bool add_mark)
{
struct perf_callchain_entry *entry;
@@ -228,11 +228,11 @@ get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user,
if (!entry)
return NULL;
- ctx.entry = entry;
- ctx.max_stack = max_stack;
- ctx.nr = entry->nr = init_nr;
- ctx.contexts = 0;
- ctx.contexts_maxed = false;
+ ctx.entry = entry;
+ ctx.max_stack = max_stack;
+ ctx.nr = entry->nr = 0;
+ ctx.contexts = 0;
+ ctx.contexts_maxed = false;
if (kernel && !user_mode(regs)) {
if (add_mark)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 05136e835042..0bac8cae08a8 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8110,7 +8110,7 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
if (!kernel && !user)
return &__empty_callchain;
- callchain = get_perf_callchain(regs, 0, kernel, user,
+ callchain = get_perf_callchain(regs, kernel, user,
max_stack, crosstask, true);
return callchain ?: &__empty_callchain;
}
--
2.47.2
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v8 15/18] perf: Have get_perf_callchain() return NULL if crosstask and user are set
2025-05-09 16:45 [PATCH v8 00/18] unwind_user: perf: x86: Deferred unwinding infrastructure Steven Rostedt
` (13 preceding siblings ...)
2025-05-09 16:45 ` [PATCH v8 14/18] perf: Remove get_perf_callchain() init_nr argument Steven Rostedt
@ 2025-05-09 16:45 ` Steven Rostedt
2025-05-09 21:53 ` Andrii Nakryiko
2025-05-09 16:45 ` [PATCH v8 16/18] perf: Use current->flags & PF_KTHREAD instead of current->mm == NULL Steven Rostedt
` (2 subsequent siblings)
17 siblings, 1 reply; 27+ messages in thread
From: Steven Rostedt @ 2025-05-09 16:45 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim
From: Josh Poimboeuf <jpoimboe@kernel.org>
get_perf_callchain() doesn't support cross-task unwinding for user space
stacks, have it return NULL if both the crosstask and user arguments are
set.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
kernel/events/callchain.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index b0f5bd228cd8..abf258913ab6 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -224,6 +224,10 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
struct perf_callchain_entry_ctx ctx;
int rctx, start_entry_idx;
+ /* crosstask is not supported for user stacks */
+ if (crosstask && user)
+ return NULL;
+
entry = get_callchain_entry(&rctx);
if (!entry)
return NULL;
@@ -249,9 +253,6 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
}
if (regs) {
- if (crosstask)
- goto exit_put;
-
if (add_mark)
perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
@@ -261,7 +262,6 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
}
}
-exit_put:
put_callchain_entry(rctx);
return entry;
--
2.47.2
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v8 16/18] perf: Use current->flags & PF_KTHREAD instead of current->mm == NULL
2025-05-09 16:45 [PATCH v8 00/18] unwind_user: perf: x86: Deferred unwinding infrastructure Steven Rostedt
` (14 preceding siblings ...)
2025-05-09 16:45 ` [PATCH v8 15/18] perf: Have get_perf_callchain() return NULL if crosstask and user are set Steven Rostedt
@ 2025-05-09 16:45 ` Steven Rostedt
2025-05-09 16:45 ` [PATCH v8 17/18] perf: Simplify get_perf_callchain() user logic Steven Rostedt
2025-05-09 16:45 ` [PATCH v8 18/18] perf: Skip user unwind if the task is a kernel thread Steven Rostedt
17 siblings, 0 replies; 27+ messages in thread
From: Steven Rostedt @ 2025-05-09 16:45 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim
From: Steven Rostedt <rostedt@goodmis.org>
To determine if a task is a kernel thread or not, it is more reliable to
use (current->flags & PF_KTHREAD) than to rely on current->mm being NULL.
That is because some kernel tasks (io_uring helpers) may have a mm field.
Link: https://lore.kernel.org/linux-trace-kernel/20250424163607.GE18306@noisy.programming.kicks-ass.net/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
kernel/events/callchain.c | 6 +++---
kernel/events/core.c | 2 +-
2 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index abf258913ab6..cda145dc11bd 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -246,10 +246,10 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
if (user) {
if (!user_mode(regs)) {
- if (current->mm)
- regs = task_pt_regs(current);
- else
+ if (current->flags & PF_KTHREAD)
regs = NULL;
+ else
+ regs = task_pt_regs(current);
}
if (regs) {
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 0bac8cae08a8..7c7d5a27c568 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7989,7 +7989,7 @@ static u64 perf_virt_to_phys(u64 virt)
* Try IRQ-safe get_user_page_fast_only first.
* If failed, leave phys_addr as 0.
*/
- if (current->mm != NULL) {
+ if (!(current->flags & PF_KTHREAD)) {
struct page *p;
pagefault_disable();
--
2.47.2
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v8 17/18] perf: Simplify get_perf_callchain() user logic
2025-05-09 16:45 [PATCH v8 00/18] unwind_user: perf: x86: Deferred unwinding infrastructure Steven Rostedt
` (15 preceding siblings ...)
2025-05-09 16:45 ` [PATCH v8 16/18] perf: Use current->flags & PF_KTHREAD instead of current->mm == NULL Steven Rostedt
@ 2025-05-09 16:45 ` Steven Rostedt
2025-05-09 16:45 ` [PATCH v8 18/18] perf: Skip user unwind if the task is a kernel thread Steven Rostedt
17 siblings, 0 replies; 27+ messages in thread
From: Steven Rostedt @ 2025-05-09 16:45 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim
From: Josh Poimboeuf <jpoimboe@kernel.org>
Simplify the get_perf_callchain() user logic a bit. task_pt_regs()
should never be NULL.
Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
kernel/events/callchain.c | 18 ++++++++----------
1 file changed, 8 insertions(+), 10 deletions(-)
diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index cda145dc11bd..2798c0c9f782 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -247,21 +247,19 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
if (user) {
if (!user_mode(regs)) {
if (current->flags & PF_KTHREAD)
- regs = NULL;
- else
- regs = task_pt_regs(current);
+ goto exit_put;
+ regs = task_pt_regs(current);
}
- if (regs) {
- if (add_mark)
- perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
+ if (add_mark)
+ perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
- start_entry_idx = entry->nr;
- perf_callchain_user(&ctx, regs);
- fixup_uretprobe_trampoline_entries(entry, start_entry_idx);
- }
+ start_entry_idx = entry->nr;
+ perf_callchain_user(&ctx, regs);
+ fixup_uretprobe_trampoline_entries(entry, start_entry_idx);
}
+exit_put:
put_callchain_entry(rctx);
return entry;
--
2.47.2
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v8 18/18] perf: Skip user unwind if the task is a kernel thread
2025-05-09 16:45 [PATCH v8 00/18] unwind_user: perf: x86: Deferred unwinding infrastructure Steven Rostedt
` (16 preceding siblings ...)
2025-05-09 16:45 ` [PATCH v8 17/18] perf: Simplify get_perf_callchain() user logic Steven Rostedt
@ 2025-05-09 16:45 ` Steven Rostedt
17 siblings, 0 replies; 27+ messages in thread
From: Steven Rostedt @ 2025-05-09 16:45 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim
From: Josh Poimboeuf <jpoimboe@kernel.org>
If the task is not a user thread, there's no user stack to unwind.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
kernel/events/core.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 7c7d5a27c568..02e52df7a02e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8101,7 +8101,8 @@ struct perf_callchain_entry *
perf_callchain(struct perf_event *event, struct pt_regs *regs)
{
bool kernel = !event->attr.exclude_callchain_kernel;
- bool user = !event->attr.exclude_callchain_user;
+ bool user = !event->attr.exclude_callchain_user &&
+ !(current->flags & PF_KTHREAD);
/* Disallow cross-task user callchains. */
bool crosstask = event->ctx->task && event->ctx->task != current;
const u32 max_stack = event->attr.sample_max_stack;
--
2.47.2
^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH v8 14/18] perf: Remove get_perf_callchain() init_nr argument
2025-05-09 16:45 ` [PATCH v8 14/18] perf: Remove get_perf_callchain() init_nr argument Steven Rostedt
@ 2025-05-09 17:11 ` Alexei Starovoitov
0 siblings, 0 replies; 27+ messages in thread
From: Alexei Starovoitov @ 2025-05-09 17:11 UTC (permalink / raw)
To: Steven Rostedt
Cc: LKML, linux-trace-kernel, bpf, X86 ML, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Peter Zijlstra, Ingo Molnar,
Jiri Olsa, Namhyung Kim
On Fri, May 9, 2025 at 10:07 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> From: Josh Poimboeuf <jpoimboe@kernel.org>
>
> The 'init_nr' argument has double duty: it's used to initialize both the
> number of contexts and the number of stack entries. That's confusing
> and the callers always pass zero anyway. Hard code the zero.
>
> Acked-by: Namhyung Kim <Namhyung@kernel.org>
> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> ---
> include/linux/perf_event.h | 2 +-
> kernel/bpf/stackmap.c | 4 ++--
Acked-by: Alexei Starovoitov <ast@kernel.org>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v8 12/18] unwind deferred: Use SRCU unwind_deferred_task_work()
2025-05-09 16:45 ` [PATCH v8 12/18] unwind deferred: Use SRCU unwind_deferred_task_work() Steven Rostedt
@ 2025-05-09 21:49 ` Andrii Nakryiko
2025-05-10 13:41 ` Steven Rostedt
0 siblings, 1 reply; 27+ messages in thread
From: Andrii Nakryiko @ 2025-05-09 21:49 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Peter Zijlstra, Ingo Molnar,
Jiri Olsa, Namhyung Kim
On Fri, May 9, 2025 at 9:54 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> From: Steven Rostedt <rostedt@goodmis.org>
>
> Instead of using the callback_mutex to protect the link list of callbacks
> in unwind_deferred_task_work(), use SRCU instead. This gets called every
> time a task exits that has to record a stack trace that was requested.
> This can happen for many tasks on several CPUs at the same time. A mutex
> is a bottleneck and can cause a bit of contention and slow down performance.
>
> As the callbacks themselves are allowed to sleep, regular RCU can not be
> used to protect the list. Instead use SRCU, as that still allows the
> callbacks to sleep and the list can be read without needing to hold the
> callback_mutex.
>
> Link: https://lore.kernel.org/all/ca9bd83a-6c80-4ee0-a83c-224b9d60b755@efficios.com/
>
> Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> ---
> kernel/unwind/deferred.c | 33 +++++++++++++++++++++++++--------
> 1 file changed, 25 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
> index 7ae0bec5b36a..5d6976ee648f 100644
> --- a/kernel/unwind/deferred.c
> +++ b/kernel/unwind/deferred.c
> @@ -13,10 +13,11 @@
>
> #define UNWIND_MAX_ENTRIES 512
>
> -/* Guards adding to and reading the list of callbacks */
> +/* Guards adding to or removing from the list of callbacks */
> static DEFINE_MUTEX(callback_mutex);
> static LIST_HEAD(callbacks);
> static unsigned long unwind_mask;
> +DEFINE_STATIC_SRCU(unwind_srcu);
>
> /*
> * Read the task context timestamp, if this is the first caller then
> @@ -108,6 +109,7 @@ static void unwind_deferred_task_work(struct callback_head *head)
> struct unwind_work *work;
> u64 timestamp;
> struct task_struct *task = current;
> + int idx;
>
> if (WARN_ON_ONCE(!info->pending))
> return;
> @@ -133,13 +135,15 @@ static void unwind_deferred_task_work(struct callback_head *head)
>
> timestamp = info->timestamp;
>
> - guard(mutex)(&callback_mutex);
> - list_for_each_entry(work, &callbacks, list) {
> + idx = srcu_read_lock(&unwind_srcu);
nit: you could have used guard(srcu)(&unwind_srcu) ?
> + list_for_each_entry_srcu(work, &callbacks, list,
> + srcu_read_lock_held(&unwind_srcu)) {
> if (task->unwind_mask & (1UL << work->bit)) {
> work->func(work, &trace, timestamp);
> clear_bit(work->bit, ¤t->unwind_mask);
> }
> }
> + srcu_read_unlock(&unwind_srcu, idx);
> }
>
> static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *timestamp)
[...]
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v8 15/18] perf: Have get_perf_callchain() return NULL if crosstask and user are set
2025-05-09 16:45 ` [PATCH v8 15/18] perf: Have get_perf_callchain() return NULL if crosstask and user are set Steven Rostedt
@ 2025-05-09 21:53 ` Andrii Nakryiko
2025-05-10 13:46 ` Steven Rostedt
2025-05-10 17:59 ` Josh Poimboeuf
0 siblings, 2 replies; 27+ messages in thread
From: Andrii Nakryiko @ 2025-05-09 21:53 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Peter Zijlstra, Ingo Molnar,
Jiri Olsa, Namhyung Kim
On Fri, May 9, 2025 at 9:52 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> From: Josh Poimboeuf <jpoimboe@kernel.org>
>
> get_perf_callchain() doesn't support cross-task unwinding for user space
> stacks, have it return NULL if both the crosstask and user arguments are
> set.
>
> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> ---
> kernel/events/callchain.c | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
> index b0f5bd228cd8..abf258913ab6 100644
> --- a/kernel/events/callchain.c
> +++ b/kernel/events/callchain.c
> @@ -224,6 +224,10 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
> struct perf_callchain_entry_ctx ctx;
> int rctx, start_entry_idx;
>
> + /* crosstask is not supported for user stacks */
> + if (crosstask && user)
> + return NULL;
I think get_perf_callchain() supports requesting both user and kernel
stack traces, and if it's crosstask, you can still get kernel (but not
user) stack, if I'm reading the code correctly.
So by just returning NULL early you will change this behavior, no?
> +
> entry = get_callchain_entry(&rctx);
> if (!entry)
> return NULL;
> @@ -249,9 +253,6 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
> }
>
> if (regs) {
> - if (crosstask)
> - goto exit_put;
> -
> if (add_mark)
> perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
>
> @@ -261,7 +262,6 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
> }
> }
>
> -exit_put:
> put_callchain_entry(rctx);
>
> return entry;
> --
> 2.47.2
>
>
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v8 12/18] unwind deferred: Use SRCU unwind_deferred_task_work()
2025-05-09 21:49 ` Andrii Nakryiko
@ 2025-05-10 13:41 ` Steven Rostedt
2025-05-12 16:17 ` Andrii Nakryiko
0 siblings, 1 reply; 27+ messages in thread
From: Steven Rostedt @ 2025-05-10 13:41 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Peter Zijlstra, Ingo Molnar,
Jiri Olsa, Namhyung Kim
On Fri, 9 May 2025 14:49:37 -0700
Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> > @@ -133,13 +135,15 @@ static void unwind_deferred_task_work(struct callback_head *head)
> >
> > timestamp = info->timestamp;
> >
> > - guard(mutex)(&callback_mutex);
> > - list_for_each_entry(work, &callbacks, list) {
> > + idx = srcu_read_lock(&unwind_srcu);
>
> nit: you could have used guard(srcu)(&unwind_srcu) ?
Then it would be a scope_guard() as it is only needed for the list. I
prefer using guard() when it is most of the function that is being
protected. Here it's just the list and nothing else.
One issue I have with guard() is that it tends to "leak". That is, if you
use it to protect only one thing and then add more after what you are
protecting, then the guard ends up protecting more than it needs to.
If anything, I would make the loop into its own function with the guard()
then it is more obvious. But for now, I think it's fine as is, unless
others prefer the switch.
-- Steve
>
> > + list_for_each_entry_srcu(work, &callbacks, list,
> > + srcu_read_lock_held(&unwind_srcu)) {
> > if (task->unwind_mask & (1UL << work->bit)) {
> > work->func(work, &trace, timestamp);
> > clear_bit(work->bit, ¤t->unwind_mask);
> > }
> > }
> > + srcu_read_unlock(&unwind_srcu, idx);
> > }
> >
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v8 15/18] perf: Have get_perf_callchain() return NULL if crosstask and user are set
2025-05-09 21:53 ` Andrii Nakryiko
@ 2025-05-10 13:46 ` Steven Rostedt
2025-05-10 17:59 ` Josh Poimboeuf
1 sibling, 0 replies; 27+ messages in thread
From: Steven Rostedt @ 2025-05-10 13:46 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Peter Zijlstra, Ingo Molnar,
Jiri Olsa, Namhyung Kim
On Fri, 9 May 2025 14:53:38 -0700
Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> > @@ -224,6 +224,10 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
> > struct perf_callchain_entry_ctx ctx;
> > int rctx, start_entry_idx;
> >
> > + /* crosstask is not supported for user stacks */
> > + if (crosstask && user)
> > + return NULL;
>
> I think get_perf_callchain() supports requesting both user and kernel
> stack traces, and if it's crosstask, you can still get kernel (but not
> user) stack, if I'm reading the code correctly.
>
> So by just returning NULL early you will change this behavior, no?
Basically you are saying that one could ask for a kernel/user stack trace
with crosstask enabled and still just get the kernel trace?
If this is the case, then I think it may be best to remove patches 15-18
from this series and work on them in the "perf specific" series, as this
doesn't have anything to do with the unwind infrastructure itself.
Actually, patch 14 doesn't either, so I may move that one too (and keep the
acks to it).
Thanks,
-- Steve
>
> > +
> > entry = get_callchain_entry(&rctx);
> > if (!entry)
> > return NULL;
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v8 15/18] perf: Have get_perf_callchain() return NULL if crosstask and user are set
2025-05-09 21:53 ` Andrii Nakryiko
2025-05-10 13:46 ` Steven Rostedt
@ 2025-05-10 17:59 ` Josh Poimboeuf
2025-05-12 22:27 ` Andrii Nakryiko
1 sibling, 1 reply; 27+ messages in thread
From: Josh Poimboeuf @ 2025-05-10 17:59 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
Masami Hiramatsu, Mathieu Desnoyers, Peter Zijlstra, Ingo Molnar,
Jiri Olsa, Namhyung Kim
On Fri, May 09, 2025 at 02:53:38PM -0700, Andrii Nakryiko wrote:
> On Fri, May 9, 2025 at 9:52 AM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > From: Josh Poimboeuf <jpoimboe@kernel.org>
> >
> > get_perf_callchain() doesn't support cross-task unwinding for user space
> > stacks, have it return NULL if both the crosstask and user arguments are
> > set.
> >
> > Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
> > Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> > ---
> > kernel/events/callchain.c | 8 ++++----
> > 1 file changed, 4 insertions(+), 4 deletions(-)
> >
> > diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
> > index b0f5bd228cd8..abf258913ab6 100644
> > --- a/kernel/events/callchain.c
> > +++ b/kernel/events/callchain.c
> > @@ -224,6 +224,10 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
> > struct perf_callchain_entry_ctx ctx;
> > int rctx, start_entry_idx;
> >
> > + /* crosstask is not supported for user stacks */
> > + if (crosstask && user)
> > + return NULL;
>
> I think get_perf_callchain() supports requesting both user and kernel
> stack traces, and if it's crosstask, you can still get kernel (but not
> user) stack, if I'm reading the code correctly.
>
> So by just returning NULL early you will change this behavior, no?
Yeah, that does seem like a bug.
Though crosstask in general is dubious, even for kernel stacks.
If the task is running while you're unwinding it, hilarity ensues.
There are guardrails in place, so it should be safe, it may just produce
nonsense. But maybe the callers don't need perfection.
But also, it would seem to be a bad idea to allow one task to spy on
what another task's kernelspace is doing. Does unpriv BPF allow that?
Though it seems even 'cat /proc/self/stack' is a privileged operation
these days, does unpriv BPF allow that as well?
--
Josh
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v8 12/18] unwind deferred: Use SRCU unwind_deferred_task_work()
2025-05-10 13:41 ` Steven Rostedt
@ 2025-05-12 16:17 ` Andrii Nakryiko
0 siblings, 0 replies; 27+ messages in thread
From: Andrii Nakryiko @ 2025-05-12 16:17 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Peter Zijlstra, Ingo Molnar,
Jiri Olsa, Namhyung Kim
On Sat, May 10, 2025 at 6:41 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Fri, 9 May 2025 14:49:37 -0700
> Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> > > @@ -133,13 +135,15 @@ static void unwind_deferred_task_work(struct callback_head *head)
> > >
> > > timestamp = info->timestamp;
> > >
> > > - guard(mutex)(&callback_mutex);
> > > - list_for_each_entry(work, &callbacks, list) {
> > > + idx = srcu_read_lock(&unwind_srcu);
> >
> > nit: you could have used guard(srcu)(&unwind_srcu) ?
>
> Then it would be a scope_guard() as it is only needed for the list. I
> prefer using guard() when it is most of the function that is being
> protected. Here it's just the list and nothing else.
>
> One issue I have with guard() is that it tends to "leak". That is, if you
> use it to protect only one thing and then add more after what you are
> protecting, then the guard ends up protecting more than it needs to.
>
Yep, makes sense. I just noticed the use of guard() for mutex before,
so assumes the same could be done for SRCU, but what you are saying
makes sense, no problem.
> If anything, I would make the loop into its own function with the guard()
> then it is more obvious. But for now, I think it's fine as is, unless
> others prefer the switch.
>
> -- Steve
>
> >
> > > + list_for_each_entry_srcu(work, &callbacks, list,
> > > + srcu_read_lock_held(&unwind_srcu)) {
> > > if (task->unwind_mask & (1UL << work->bit)) {
> > > work->func(work, &trace, timestamp);
> > > clear_bit(work->bit, ¤t->unwind_mask);
> > > }
> > > }
> > > + srcu_read_unlock(&unwind_srcu, idx);
> > > }
> > >
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v8 15/18] perf: Have get_perf_callchain() return NULL if crosstask and user are set
2025-05-10 17:59 ` Josh Poimboeuf
@ 2025-05-12 22:27 ` Andrii Nakryiko
0 siblings, 0 replies; 27+ messages in thread
From: Andrii Nakryiko @ 2025-05-12 22:27 UTC (permalink / raw)
To: Josh Poimboeuf
Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
Masami Hiramatsu, Mathieu Desnoyers, Peter Zijlstra, Ingo Molnar,
Jiri Olsa, Namhyung Kim
On Sat, May 10, 2025 at 10:59 AM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>
> On Fri, May 09, 2025 at 02:53:38PM -0700, Andrii Nakryiko wrote:
> > On Fri, May 9, 2025 at 9:52 AM Steven Rostedt <rostedt@goodmis.org> wrote:
> > >
> > > From: Josh Poimboeuf <jpoimboe@kernel.org>
> > >
> > > get_perf_callchain() doesn't support cross-task unwinding for user space
> > > stacks, have it return NULL if both the crosstask and user arguments are
> > > set.
> > >
> > > Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
> > > Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> > > ---
> > > kernel/events/callchain.c | 8 ++++----
> > > 1 file changed, 4 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
> > > index b0f5bd228cd8..abf258913ab6 100644
> > > --- a/kernel/events/callchain.c
> > > +++ b/kernel/events/callchain.c
> > > @@ -224,6 +224,10 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
> > > struct perf_callchain_entry_ctx ctx;
> > > int rctx, start_entry_idx;
> > >
> > > + /* crosstask is not supported for user stacks */
> > > + if (crosstask && user)
> > > + return NULL;
> >
> > I think get_perf_callchain() supports requesting both user and kernel
> > stack traces, and if it's crosstask, you can still get kernel (but not
> > user) stack, if I'm reading the code correctly.
> >
> > So by just returning NULL early you will change this behavior, no?
>
> Yeah, that does seem like a bug.
>
> Though crosstask in general is dubious, even for kernel stacks.
>
> If the task is running while you're unwinding it, hilarity ensues.
> There are guardrails in place, so it should be safe, it may just produce
> nonsense. But maybe the callers don't need perfection.
>
> But also, it would seem to be a bad idea to allow one task to spy on
> what another task's kernelspace is doing. Does unpriv BPF allow that?
No, you need CAP_PERFMON to be able to capture stack traces, so we are
fine from that POV.
But also note that BPF itself doesn't allow requesting both user and
kernel stack traces in one go. So the issue I pointed out is not
really related to how BPF is using get_perf_callchain().
>
> Though it seems even 'cat /proc/self/stack' is a privileged operation
> these days, does unpriv BPF allow that as well?
>
> --
> Josh
^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2025-05-12 22:27 UTC | newest]
Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-09 16:45 [PATCH v8 00/18] unwind_user: perf: x86: Deferred unwinding infrastructure Steven Rostedt
2025-05-09 16:45 ` [PATCH v8 01/18] unwind_user: Add user space unwinding API Steven Rostedt
2025-05-09 16:45 ` [PATCH v8 02/18] unwind_user: Add frame pointer support Steven Rostedt
2025-05-09 16:45 ` [PATCH v8 03/18] unwind_user/x86: Enable frame pointer unwinding on x86 Steven Rostedt
2025-05-09 16:45 ` [PATCH v8 04/18] perf/x86: Rename and move get_segment_base() and make it global Steven Rostedt
2025-05-09 16:45 ` [PATCH v8 05/18] unwind_user: Add compat mode frame pointer support Steven Rostedt
2025-05-09 16:45 ` [PATCH v8 06/18] unwind_user/x86: Enable compat mode frame pointer unwinding on x86 Steven Rostedt
2025-05-09 16:45 ` [PATCH v8 07/18] unwind_user/deferred: Add unwind_deferred_trace() Steven Rostedt
2025-05-09 16:45 ` [PATCH v8 08/18] unwind_user/deferred: Add unwind cache Steven Rostedt
2025-05-09 16:45 ` [PATCH v8 09/18] unwind_user/deferred: Add deferred unwinding interface Steven Rostedt
2025-05-09 16:45 ` [PATCH v8 10/18] unwind_user/deferred: Make unwind deferral requests NMI-safe Steven Rostedt
2025-05-09 16:45 ` [PATCH v8 11/18] unwind deferred: Use bitmask to determine which callbacks to call Steven Rostedt
2025-05-09 16:45 ` [PATCH v8 12/18] unwind deferred: Use SRCU unwind_deferred_task_work() Steven Rostedt
2025-05-09 21:49 ` Andrii Nakryiko
2025-05-10 13:41 ` Steven Rostedt
2025-05-12 16:17 ` Andrii Nakryiko
2025-05-09 16:45 ` [PATCH v8 13/18] unwind: Clear unwind_mask on exit back to user space Steven Rostedt
2025-05-09 16:45 ` [PATCH v8 14/18] perf: Remove get_perf_callchain() init_nr argument Steven Rostedt
2025-05-09 17:11 ` Alexei Starovoitov
2025-05-09 16:45 ` [PATCH v8 15/18] perf: Have get_perf_callchain() return NULL if crosstask and user are set Steven Rostedt
2025-05-09 21:53 ` Andrii Nakryiko
2025-05-10 13:46 ` Steven Rostedt
2025-05-10 17:59 ` Josh Poimboeuf
2025-05-12 22:27 ` Andrii Nakryiko
2025-05-09 16:45 ` [PATCH v8 16/18] perf: Use current->flags & PF_KTHREAD instead of current->mm == NULL Steven Rostedt
2025-05-09 16:45 ` [PATCH v8 17/18] perf: Simplify get_perf_callchain() user logic Steven Rostedt
2025-05-09 16:45 ` [PATCH v8 18/18] perf: Skip user unwind if the task is a kernel thread Steven Rostedt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).