[RESEND][PATCH v15 0/4] perf: Support the deferred unwinding infrastructure

linux-trace-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RESEND][PATCH v15 0/4] perf: Support the deferred unwinding infrastructure
@ 2025-09-08 17:14 Steven Rostedt
  2025-09-08 17:14 ` [RESEND][PATCH v15 1/4] unwind deferred: Add unwind_user_get_cookie() API Steven Rostedt
                   ` (5 more replies)
  0 siblings, 6 replies; 25+ messages in thread
From: Steven Rostedt @ 2025-09-08 17:14 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
	Andrew Morton, Florian Weimer, Sam James, Kees Cook,
	Carlos O'Donell

[
  This is simply a resend of version 15 of this patch series
  but with only the kernel changes. I'm separating out the user space
  changes to their own series.
  The original v15 is here:
    https://lore.kernel.org/linux-trace-kernel/20250825180638.877627656@kernel.org/
]

This patch set is based off of perf/core of the tip tree:
  git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git

To run this series, you can checkout this repo that has this series as well as the above:

  git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git  unwind/perf-test

This series implements the perf interface to use deferred user space stack
tracing.

Patch 1 adds a new API interface to the user unwinder logic to allow perf to
get the current context cookie for it's task event tracing. Perf's task event
tracing maps a single task per perf event buffer and it follows the task
around, so it only needs to implement its own task_work to do the deferred
stack trace. Because it can still suffer not knowing which user stack trace
belongs to which kernel stack due to dropped events, having the cookie to
create a unique identifier for each user space stack trace to know which
kernel stack to append it to is useful.

Patch 2 adds the per task deferred stack traces to perf. It adds a new event
type called PERF_RECORD_CALLCHAIN_DEFERRED that is recorded when a task is
about to go back to user space and happens in a location that pages may be
faulted in. It also adds a new callchain context called PERF_CONTEXT_USER_DEFERRED
that is used as a place holder in a kernel callchain to append the deferred
user space stack trace to.

Patch 3 adds the user stack trace context cookie in the kernel callchain right
after the PERF_CONTEXT_USER_DEFERRED context so that the user space side can
map the request to the deferred user space stack trace.

Patch 4 adds support for the per CPU perf events that will allow the kernel to
associate each of the per CPU perf event buffers to a single application. This
is needed so that when a request for a deferred stack trace happens on a task
that then migrates to another CPU, it will know which CPU buffer to use to
record the stack trace on. It is possible to have more than one perf user tool
running and a request made by one perf tool should have the deferred trace go
to the same perf tool's perf CPU event buffer. A global list of all the
descriptors representing each perf tool that is using deferred stack tracing
is created to manage this.

Josh Poimboeuf (1):
      perf: Support deferred user callchains

Steven Rostedt (3):
      unwind deferred: Add unwind_user_get_cookie() API
      perf: Have the deferred request record the user context cookie
      perf: Support deferred user callchains for per CPU events

----
 include/linux/perf_event.h            |  11 +-
 include/linux/unwind_deferred.h       |   5 +
 include/uapi/linux/perf_event.h       |  25 +-
 kernel/bpf/stackmap.c                 |   4 +-
 kernel/events/callchain.c             |  14 +-
 kernel/events/core.c                  | 421 +++++++++++++++++++++++++++++++++-
 kernel/unwind/deferred.c              |  21 ++
 tools/include/uapi/linux/perf_event.h |  25 +-
 8 files changed, 518 insertions(+), 8 deletions(-)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [RESEND][PATCH v15 1/4] unwind deferred: Add unwind_user_get_cookie() API
  2025-09-08 17:14 [RESEND][PATCH v15 0/4] perf: Support the deferred unwinding infrastructure Steven Rostedt
@ 2025-09-08 17:14 ` Steven Rostedt
  2025-09-08 17:14 ` [RESEND][PATCH v15 2/4] perf: Support deferred user callchains Steven Rostedt
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 25+ messages in thread
From: Steven Rostedt @ 2025-09-08 17:14 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
	Andrew Morton, Florian Weimer, Sam James, Kees Cook,
	Carlos O'Donell

From: Steven Rostedt <rostedt@goodmis.org>

Add the function unwind_user_get_cookie() API that allows a subsystem to
retrieve the current context cookie. This can be used by perf to attach a
cookie to its task deferred unwinding code that doesn't use the deferred
unwind logic.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 include/linux/unwind_deferred.h |  5 +++++
 kernel/unwind/deferred.c        | 21 +++++++++++++++++++++
 2 files changed, 26 insertions(+)

diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index 26122d00708a..ce507495972c 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -41,6 +41,8 @@ void unwind_deferred_cancel(struct unwind_work *work);
 
 void unwind_deferred_task_exit(struct task_struct *task);
 
+u64 unwind_user_get_cookie(void);
+
 static __always_inline void unwind_reset_info(void)
 {
 	struct unwind_task_info *info = &current->unwind_info;
@@ -76,6 +78,9 @@ static inline void unwind_deferred_cancel(struct unwind_work *work) {}
 static inline void unwind_deferred_task_exit(struct task_struct *task) {}
 static inline void unwind_reset_info(void) {}
 
+/* Must be non-zero */
+static inline u64 unwind_user_get_cookie(void) { return (u64)-1; }
+
 #endif /* !CONFIG_UNWIND_USER */
 
 #endif /* _LINUX_UNWIND_USER_DEFERRED_H */
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index dc6040aae3ee..90f90e30000a 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -94,6 +94,27 @@ static u64 get_cookie(struct unwind_task_info *info)
 	return info->id.id;
 }
 
+/**
+ * unwind_user_get_cookie - Get the current user context cookie
+ *
+ * This is used to get a unique context cookie for the current task.
+ * Every time a task enters the kernel it has a new context. If
+ * a subsystem needs to have a unique identifier for that context for
+ * the current task, it can call this function to retrieve a unique
+ * cookie for that task context.
+ *
+ * Returns: A unque identifier for the current task user context.
+ */
+u64 unwind_user_get_cookie(void)
+{
+	struct unwind_task_info *info = &current->unwind_info;
+
+	guard(irqsave)();
+	/* Make sure to clear the info->id.id when exiting the kernel */
+	set_bit(UNWIND_USED_BIT, &info->unwind_mask);
+	return get_cookie(info);
+}
+
 /**
  * unwind_user_faultable - Produce a user stacktrace in faultable context
  * @trace: The descriptor that will store the user stacktrace
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RESEND][PATCH v15 2/4] perf: Support deferred user callchains
  2025-09-08 17:14 [RESEND][PATCH v15 0/4] perf: Support the deferred unwinding infrastructure Steven Rostedt
  2025-09-08 17:14 ` [RESEND][PATCH v15 1/4] unwind deferred: Add unwind_user_get_cookie() API Steven Rostedt
@ 2025-09-08 17:14 ` Steven Rostedt
  2025-09-23  9:19   ` Peter Zijlstra
                     ` (2 more replies)
  2025-09-08 17:14 ` [RESEND][PATCH v15 3/4] perf: Have the deferred request record the user context cookie Steven Rostedt
                   ` (3 subsequent siblings)
  5 siblings, 3 replies; 25+ messages in thread
From: Steven Rostedt @ 2025-09-08 17:14 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
	Andrew Morton, Florian Weimer, Sam James, Kees Cook,
	Carlos O'Donell

From: Josh Poimboeuf <jpoimboe@kernel.org>

If the user fault unwind is available (the one that will be used for
sframes), have perf be able to utilize it. Currently all user stack
traces are done at the request site. This mostly happens in interrupt or
NMI context where user space is only accessible if it is currently present
in memory. It is possible that the user stack was swapped out and is not
present, but mostly the use of sframes will require faulting in user pages
which will not be possible from interrupt context. Instead, add a frame
work that will delay the reading of the user space stack until the task
goes back to user space where faulting in pages is possible. This is also
advantageous as the user space stack doesn't change while in the kernel,
and this will also remove duplicate entries of user space stacks for a
long running system call being profiled.

A new perf context is created called PERF_CONTEXT_USER_DEFERRED. It is
added to the kernel callchain, usually when an interrupt or NMI is
triggered (but can be added to any callchain). When a deferred unwind is
required, a new task_work is triggered (pending_unwind_work) on the task.
The callchain that is done immediately for the kernel is appended with the
PERF_CONTEXT_USER_DEFERRED.

When the task exits to user space and the task_work handler is triggered,
it will execute the user stack unwinding and record the user stack trace.
This user stack trace will go into a new perf type called
PERF_RECORD_CALLCHAIN_DEFERRED.  The perf user space will need to attach
this stack trace to each of the previous kernel callchains for that task
with the PERF_CONTEXT_USER_DEFERRED context in them.

As the struct unwind_stacktrace has its entries as "unsigned long", and it
is used to copy directly into struct perf_callchain_entry which its "ip"
field is defined as u64, currently only deferred callchains are allowed
for 64bit architectures. This could change in the future if there is a
demand for it for 32 bit architectures.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Co-developed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 include/linux/perf_event.h            |   7 +-
 include/uapi/linux/perf_event.h       |  20 +++-
 kernel/bpf/stackmap.c                 |   4 +-
 kernel/events/callchain.c             |  11 +-
 kernel/events/core.c                  | 156 +++++++++++++++++++++++++-
 tools/include/uapi/linux/perf_event.h |  20 +++-
 6 files changed, 210 insertions(+), 8 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index fd1d91017b99..1527afa952f7 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -53,6 +53,7 @@
 #include <linux/security.h>
 #include <linux/static_call.h>
 #include <linux/lockdep.h>
+#include <linux/unwind_deferred.h>
 
 #include <asm/local.h>
 
@@ -880,6 +881,10 @@ struct perf_event {
 	struct callback_head		pending_task;
 	unsigned int			pending_work;
 
+	unsigned int			pending_unwind_callback;
+	struct callback_head		pending_unwind_work;
+	struct rcuwait			pending_unwind_wait;
+
 	atomic_t			event_limit;
 
 	/* address range filters */
@@ -1720,7 +1725,7 @@ extern void perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct p
 extern void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
 extern struct perf_callchain_entry *
 get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
-		   u32 max_stack, bool crosstask, bool add_mark);
+		   u32 max_stack, bool crosstask, bool add_mark, bool defer_user);
 extern int get_callchain_buffers(int max_stack);
 extern void put_callchain_buffers(void);
 extern struct perf_callchain_entry *get_callchain_entry(int *rctx);
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 78a362b80027..20b8f890113b 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -463,7 +463,8 @@ struct perf_event_attr {
 				inherit_thread :  1, /* children only inherit if cloned with CLONE_THREAD */
 				remove_on_exec :  1, /* event is removed from task on exec */
 				sigtrap        :  1, /* send synchronous SIGTRAP on event */
-				__reserved_1   : 26;
+				defer_callchain:  1, /* generate PERF_RECORD_CALLCHAIN_DEFERRED records */
+				__reserved_1   : 25;
 
 	union {
 		__u32		wakeup_events;	  /* wake up every n events */
@@ -1239,6 +1240,22 @@ enum perf_event_type {
 	 */
 	PERF_RECORD_AUX_OUTPUT_HW_ID		= 21,
 
+	/*
+	 * This user callchain capture was deferred until shortly before
+	 * returning to user space.  Previous samples would have kernel
+	 * callchains only and they need to be stitched with this to make full
+	 * callchains.
+	 *
+	 * struct {
+	 *	struct perf_event_header	header;
+	 *	u64				cookie;
+	 *	u64				nr;
+	 *	u64				ips[nr];
+	 *	struct sample_id		sample_id;
+	 * };
+	 */
+	PERF_RECORD_CALLCHAIN_DEFERRED		= 22,
+
 	PERF_RECORD_MAX,			/* non-ABI */
 };
 
@@ -1269,6 +1286,7 @@ enum perf_callchain_context {
 	PERF_CONTEXT_HV				= (__u64)-32,
 	PERF_CONTEXT_KERNEL			= (__u64)-128,
 	PERF_CONTEXT_USER			= (__u64)-512,
+	PERF_CONTEXT_USER_DEFERRED		= (__u64)-640,
 
 	PERF_CONTEXT_GUEST			= (__u64)-2048,
 	PERF_CONTEXT_GUEST_KERNEL		= (__u64)-2176,
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index ec3a57a5fba1..339f7cbbcf36 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -315,7 +315,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
 		max_depth = sysctl_perf_event_max_stack;
 
 	trace = get_perf_callchain(regs, kernel, user, max_depth,
-				   false, false);
+				   false, false, false);
 
 	if (unlikely(!trace))
 		/* couldn't fetch the stack trace */
@@ -452,7 +452,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
 		trace = get_callchain_entry_for_task(task, max_depth);
 	else
 		trace = get_perf_callchain(regs, kernel, user, max_depth,
-					   crosstask, false);
+					   crosstask, false, false);
 
 	if (unlikely(!trace) || trace->nr < skip) {
 		if (may_fault)
diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 808c0d7a31fa..d0e0da66a164 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -218,7 +218,7 @@ static void fixup_uretprobe_trampoline_entries(struct perf_callchain_entry *entr
 
 struct perf_callchain_entry *
 get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
-		   u32 max_stack, bool crosstask, bool add_mark)
+		   u32 max_stack, bool crosstask, bool add_mark, bool defer_user)
 {
 	struct perf_callchain_entry *entry;
 	struct perf_callchain_entry_ctx ctx;
@@ -251,6 +251,15 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 			regs = task_pt_regs(current);
 		}
 
+		if (defer_user) {
+			/*
+			 * Foretell the coming of PERF_RECORD_CALLCHAIN_DEFERRED
+			 * which can be stitched to this one.
+			 */
+			perf_callchain_store_context(&ctx, PERF_CONTEXT_USER_DEFERRED);
+			goto exit_put;
+		}
+
 		if (add_mark)
 			perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 28de3baff792..37e684edbc8a 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5582,6 +5582,95 @@ static bool exclusive_event_installable(struct perf_event *event,
 	return true;
 }
 
+static void perf_pending_unwind_sync(struct perf_event *event)
+{
+	might_sleep();
+
+	if (!event->pending_unwind_callback)
+		return;
+
+	/*
+	 * If the task is queued to the current task's queue, we
+	 * obviously can't wait for it to complete. Simply cancel it.
+	 */
+	if (task_work_cancel(current, &event->pending_unwind_work)) {
+		event->pending_unwind_callback = 0;
+		local_dec(&event->ctx->nr_no_switch_fast);
+		return;
+	}
+
+	/*
+	 * All accesses related to the event are within the same RCU section in
+	 * perf_event_callchain_deferred(). The RCU grace period before the
+	 * event is freed will make sure all those accesses are complete by then.
+	 */
+	rcuwait_wait_event(&event->pending_unwind_wait, !event->pending_unwind_callback, TASK_UNINTERRUPTIBLE);
+}
+
+struct perf_callchain_deferred_event {
+	struct perf_event_header	header;
+	u64				cookie;
+	u64				nr;
+	u64				ips[];
+};
+
+static void perf_event_callchain_deferred(struct callback_head *work)
+{
+	struct perf_event *event = container_of(work, struct perf_event, pending_unwind_work);
+	struct perf_callchain_deferred_event deferred_event;
+	u64 callchain_context = PERF_CONTEXT_USER;
+	struct unwind_stacktrace trace;
+	struct perf_output_handle handle;
+	struct perf_sample_data data;
+	u64 nr;
+
+	if (!event->pending_unwind_callback)
+		return;
+
+	if (unwind_user_faultable(&trace) < 0)
+		goto out;
+
+	/*
+	 * All accesses to the event must belong to the same implicit RCU
+	 * read-side critical section as the ->pending_unwind_callback reset.
+	 * See comment in perf_pending_unwind_sync().
+	 */
+	guard(rcu)();
+
+	if (current->flags & (PF_KTHREAD | PF_USER_WORKER))
+		goto out;
+
+	nr = trace.nr + 1 ; /* '+1' == callchain_context */
+
+	deferred_event.header.type = PERF_RECORD_CALLCHAIN_DEFERRED;
+	deferred_event.header.misc = PERF_RECORD_MISC_USER;
+	deferred_event.header.size = sizeof(deferred_event) + (nr * sizeof(u64));
+
+	deferred_event.nr = nr;
+	deferred_event.cookie = unwind_user_get_cookie();
+
+	perf_event_header__init_id(&deferred_event.header, &data, event);
+
+	if (perf_output_begin(&handle, &data, event, deferred_event.header.size))
+		goto out;
+
+	perf_output_put(&handle, deferred_event);
+	perf_output_put(&handle, callchain_context);
+	/* trace.entries[] are not guaranteed to be 64bit */
+	for (int i = 0; i < trace.nr; i++) {
+		u64 entry = trace.entries[i];
+		perf_output_put(&handle, entry);
+	}
+	perf_event__output_id_sample(event, &handle, &data);
+
+	perf_output_end(&handle);
+
+out:
+	event->pending_unwind_callback = 0;
+	local_dec(&event->ctx->nr_no_switch_fast);
+	rcuwait_wake_up(&event->pending_unwind_wait);
+}
+
 static void perf_free_addr_filters(struct perf_event *event);
 
 /* vs perf_event_alloc() error */
@@ -5649,6 +5738,7 @@ static void _free_event(struct perf_event *event)
 {
 	irq_work_sync(&event->pending_irq);
 	irq_work_sync(&event->pending_disable_irq);
+	perf_pending_unwind_sync(event);
 
 	unaccount_event(event);
 
@@ -8194,6 +8284,46 @@ static u64 perf_get_page_size(unsigned long addr)
 
 static struct perf_callchain_entry __empty_callchain = { .nr = 0, };
 
+/*
+ * Returns:
+*     > 0 : if already queued.
+ *      0 : if it performed the queuing
+ *    < 0 : if it did not get queued.
+ */
+static int deferred_request(struct perf_event *event)
+{
+	struct callback_head *work = &event->pending_unwind_work;
+	int pending;
+	int ret;
+
+	/* Only defer for task events */
+	if (!event->ctx->task)
+		return -EINVAL;
+
+	if ((current->flags & (PF_KTHREAD | PF_USER_WORKER)) ||
+	    !user_mode(task_pt_regs(current)))
+		return -EINVAL;
+
+	guard(irqsave)();
+
+	/* callback already pending? */
+	pending = READ_ONCE(event->pending_unwind_callback);
+	if (pending)
+		return 1;
+
+	/* Claim the work unless an NMI just now swooped in to do so. */
+	if (!try_cmpxchg(&event->pending_unwind_callback, &pending, 1))
+		return 1;
+
+	/* The work has been claimed, now schedule it. */
+	ret = task_work_add(current, work, TWA_RESUME);
+	if (WARN_ON_ONCE(ret)) {
+		WRITE_ONCE(event->pending_unwind_callback, 0);
+		return ret;
+	}
+	return 0;
+}
+
 struct perf_callchain_entry *
 perf_callchain(struct perf_event *event, struct pt_regs *regs)
 {
@@ -8204,6 +8334,9 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
 	bool crosstask = event->ctx->task && event->ctx->task != current;
 	const u32 max_stack = event->attr.sample_max_stack;
 	struct perf_callchain_entry *callchain;
+	/* perf currently only supports deferred in 64bit */
+	bool defer_user = IS_ENABLED(CONFIG_UNWIND_USER) && user &&
+			  event->attr.defer_callchain;
 
 	if (!current->mm)
 		user = false;
@@ -8211,8 +8344,21 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
 	if (!kernel && !user)
 		return &__empty_callchain;
 
-	callchain = get_perf_callchain(regs, kernel, user,
-				       max_stack, crosstask, true);
+	/* Disallow cross-task callchains. */
+	if (event->ctx->task && event->ctx->task != current)
+		return &__empty_callchain;
+
+	if (defer_user) {
+		int ret = deferred_request(event);
+		if (!ret)
+			local_inc(&event->ctx->nr_no_switch_fast);
+		else if (ret < 0)
+			defer_user = false;
+	}
+
+	callchain = get_perf_callchain(regs, kernel, user, max_stack,
+				       crosstask, true, defer_user);
+
 	return callchain ?: &__empty_callchain;
 }
 
@@ -12882,6 +13028,8 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	event->pending_disable_irq = IRQ_WORK_INIT_HARD(perf_pending_disable);
 	init_task_work(&event->pending_task, perf_pending_task);
 
+	rcuwait_init(&event->pending_unwind_wait);
+
 	mutex_init(&event->mmap_mutex);
 	raw_spin_lock_init(&event->addr_filters.lock);
 
@@ -13050,6 +13198,10 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	if (err)
 		return ERR_PTR(err);
 
+	if (event->attr.defer_callchain)
+		init_task_work(&event->pending_unwind_work,
+			       perf_event_callchain_deferred);
+
 	/* symmetric to unaccount_event() in _free_event() */
 	account_event(event);
 
diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
index 78a362b80027..20b8f890113b 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -463,7 +463,8 @@ struct perf_event_attr {
 				inherit_thread :  1, /* children only inherit if cloned with CLONE_THREAD */
 				remove_on_exec :  1, /* event is removed from task on exec */
 				sigtrap        :  1, /* send synchronous SIGTRAP on event */
-				__reserved_1   : 26;
+				defer_callchain:  1, /* generate PERF_RECORD_CALLCHAIN_DEFERRED records */
+				__reserved_1   : 25;
 
 	union {
 		__u32		wakeup_events;	  /* wake up every n events */
@@ -1239,6 +1240,22 @@ enum perf_event_type {
 	 */
 	PERF_RECORD_AUX_OUTPUT_HW_ID		= 21,
 
+	/*
+	 * This user callchain capture was deferred until shortly before
+	 * returning to user space.  Previous samples would have kernel
+	 * callchains only and they need to be stitched with this to make full
+	 * callchains.
+	 *
+	 * struct {
+	 *	struct perf_event_header	header;
+	 *	u64				cookie;
+	 *	u64				nr;
+	 *	u64				ips[nr];
+	 *	struct sample_id		sample_id;
+	 * };
+	 */
+	PERF_RECORD_CALLCHAIN_DEFERRED		= 22,
+
 	PERF_RECORD_MAX,			/* non-ABI */
 };
 
@@ -1269,6 +1286,7 @@ enum perf_callchain_context {
 	PERF_CONTEXT_HV				= (__u64)-32,
 	PERF_CONTEXT_KERNEL			= (__u64)-128,
 	PERF_CONTEXT_USER			= (__u64)-512,
+	PERF_CONTEXT_USER_DEFERRED		= (__u64)-640,
 
 	PERF_CONTEXT_GUEST			= (__u64)-2048,
 	PERF_CONTEXT_GUEST_KERNEL		= (__u64)-2176,
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [RESEND][PATCH v15 2/4] perf: Support deferred user callchains
  2025-09-08 17:14 ` [RESEND][PATCH v15 2/4] perf: Support deferred user callchains Steven Rostedt
@ 2025-09-23  9:19   ` Peter Zijlstra
  2025-09-23  9:35     ` Steven Rostedt
  2025-09-23 10:01   ` [RESEND][PATCH v15 2/4] perf: Support deferred user callchains Peter Zijlstra
  2025-09-23 10:32   ` Peter Zijlstra
  2 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2025-09-23  9:19 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
	Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Jens Remus, Linus Torvalds, Andrew Morton, Florian Weimer,
	Sam James, Kees Cook, Carlos O'Donell

On Mon, Sep 08, 2025 at 01:14:14PM -0400, Steven Rostedt wrote:

> +static void perf_event_callchain_deferred(struct callback_head *work)
> +{
> +	struct perf_event *event = container_of(work, struct perf_event, pending_unwind_work);
> +	struct perf_callchain_deferred_event deferred_event;
> +	u64 callchain_context = PERF_CONTEXT_USER;
> +	struct unwind_stacktrace trace;
> +	struct perf_output_handle handle;
> +	struct perf_sample_data data;
> +	u64 nr;
> +
> +	if (!event->pending_unwind_callback)
> +		return;
> +
> +	if (unwind_user_faultable(&trace) < 0)
> +		goto out;

This is broken. Because:

> +
> +	/*
> +	 * All accesses to the event must belong to the same implicit RCU
> +	 * read-side critical section as the ->pending_unwind_callback reset.
> +	 * See comment in perf_pending_unwind_sync().
> +	 */
> +	guard(rcu)();

Here you start a guard, that lasts until close of function..

> +
> +	if (current->flags & (PF_KTHREAD | PF_USER_WORKER))
> +		goto out;
> +
> +	nr = trace.nr + 1 ; /* '+1' == callchain_context */
> +
> +	deferred_event.header.type = PERF_RECORD_CALLCHAIN_DEFERRED;
> +	deferred_event.header.misc = PERF_RECORD_MISC_USER;
> +	deferred_event.header.size = sizeof(deferred_event) + (nr * sizeof(u64));
> +
> +	deferred_event.nr = nr;
> +	deferred_event.cookie = unwind_user_get_cookie();
> +
> +	perf_event_header__init_id(&deferred_event.header, &data, event);
> +
> +	if (perf_output_begin(&handle, &data, event, deferred_event.header.size))
> +		goto out;
> +
> +	perf_output_put(&handle, deferred_event);
> +	perf_output_put(&handle, callchain_context);
> +	/* trace.entries[] are not guaranteed to be 64bit */
> +	for (int i = 0; i < trace.nr; i++) {
> +		u64 entry = trace.entries[i];
> +		perf_output_put(&handle, entry);
> +	}
> +	perf_event__output_id_sample(event, &handle, &data);
> +
> +	perf_output_end(&handle);
> +
> +out:

Which very much includes here, so your goto jumps into a scope, which is
not permitted.

GCC can fail to warn on this, but clang will consistently fail to
compile this. Surely the robot would've told you by now -- even if
you're not using clang yourself.

> +	event->pending_unwind_callback = 0;
> +	local_dec(&event->ctx->nr_no_switch_fast);
> +	rcuwait_wake_up(&event->pending_unwind_wait);
> +}

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RESEND][PATCH v15 2/4] perf: Support deferred user callchains
  2025-09-23  9:19   ` Peter Zijlstra
@ 2025-09-23  9:35     ` Steven Rostedt
  2025-09-23  9:38       ` Peter Zijlstra
  0 siblings, 1 reply; 25+ messages in thread
From: Steven Rostedt @ 2025-09-23  9:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
	Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Jens Remus, Linus Torvalds, Andrew Morton, Florian Weimer,
	Sam James, Kees Cook, Carlos O'Donell

On Tue, 23 Sep 2025 11:19:35 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Sep 08, 2025 at 01:14:14PM -0400, Steven Rostedt wrote:
> 
> > +static void perf_event_callchain_deferred(struct callback_head *work)
> > +{
> > +	struct perf_event *event = container_of(work, struct perf_event, pending_unwind_work);
> > +	struct perf_callchain_deferred_event deferred_event;
> > +	u64 callchain_context = PERF_CONTEXT_USER;
> > +	struct unwind_stacktrace trace;
> > +	struct perf_output_handle handle;
> > +	struct perf_sample_data data;
> > +	u64 nr;
> > +
> > +	if (!event->pending_unwind_callback)
> > +		return;
> > +
> > +	if (unwind_user_faultable(&trace) < 0)
> > +		goto out;  
> 
> This is broken. Because:
> 
> > +
> > +	/*
> > +	 * All accesses to the event must belong to the same implicit RCU
> > +	 * read-side critical section as the ->pending_unwind_callback reset.
> > +	 * See comment in perf_pending_unwind_sync().
> > +	 */
> > +	guard(rcu)();  
> 
> Here you start a guard, that lasts until close of function..
> 
> > +
> > +	if (current->flags & (PF_KTHREAD | PF_USER_WORKER))
> > +		goto out;
> > +
> > +	nr = trace.nr + 1 ; /* '+1' == callchain_context */
> > +
> > +	deferred_event.header.type = PERF_RECORD_CALLCHAIN_DEFERRED;
> > +	deferred_event.header.misc = PERF_RECORD_MISC_USER;
> > +	deferred_event.header.size = sizeof(deferred_event) + (nr * sizeof(u64));
> > +
> > +	deferred_event.nr = nr;
> > +	deferred_event.cookie = unwind_user_get_cookie();
> > +
> > +	perf_event_header__init_id(&deferred_event.header, &data, event);
> > +
> > +	if (perf_output_begin(&handle, &data, event, deferred_event.header.size))
> > +		goto out;
> > +
> > +	perf_output_put(&handle, deferred_event);
> > +	perf_output_put(&handle, callchain_context);
> > +	/* trace.entries[] are not guaranteed to be 64bit */
> > +	for (int i = 0; i < trace.nr; i++) {
> > +		u64 entry = trace.entries[i];
> > +		perf_output_put(&handle, entry);
> > +	}
> > +	perf_event__output_id_sample(event, &handle, &data);
> > +
> > +	perf_output_end(&handle);
> > +
> > +out:  
> 
> Which very much includes here, so your goto jumps into a scope, which is
> not permitted.

Nice catch.

> 
> GCC can fail to warn on this, but clang will consistently fail to
> compile this. Surely the robot would've told you by now -- even if
> you're not using clang yourself.

Unfortunately it hasn't :-(

I need to start building with clang more often.

I even pushed this to a git tree. Not sure why it didn't get flagged.

-- Steve


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RESEND][PATCH v15 2/4] perf: Support deferred user callchains
  2025-09-23  9:35     ` Steven Rostedt
@ 2025-09-23  9:38       ` Peter Zijlstra
  2025-09-23 10:28         ` [RESEND][PATCH v15 2/4] perf: Support deferred user callchainshttps://lore.kernel.org/linux-trace-kernel/20250827193644.527334838@kernel.org/ Steven Rostedt
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2025-09-23  9:38 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
	Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Jens Remus, Linus Torvalds, Andrew Morton, Florian Weimer,
	Sam James, Kees Cook, Carlos O'Donell

On Tue, Sep 23, 2025 at 05:35:15AM -0400, Steven Rostedt wrote:

> I even pushed this to a git tree. Not sure why it didn't get flagged.

I've been looking at this... how do I enable CONFIG_UWIND_USER ?

I suspect the problem is that its impossible to actually compile/use
this code.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RESEND][PATCH v15 2/4] perf: Support deferred user callchainshttps://lore.kernel.org/linux-trace-kernel/20250827193644.527334838@kernel.org/
  2025-09-23  9:38       ` Peter Zijlstra
@ 2025-09-23 10:28         ` Steven Rostedt
  2025-09-23 10:35           ` Peter Zijlstra
  0 siblings, 1 reply; 25+ messages in thread
From: Steven Rostedt @ 2025-09-23 10:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
	Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Jens Remus, Linus Torvalds, Andrew Morton, Florian Weimer,
	Sam James, Kees Cook, Carlos O'Donell

On Tue, 23 Sep 2025 11:38:21 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, Sep 23, 2025 at 05:35:15AM -0400, Steven Rostedt wrote:
> 
> > I even pushed this to a git tree. Not sure why it didn't get flagged.  
> 
> I've been looking at this... how do I enable CONFIG_UWIND_USER ?

Hmm, maybe that's why it wasn't flagged.

> 
> I suspect the problem is that its impossible to actually compile/use
> this code.

It needs an arch to enable it. Here's the x86 patches that I was hoping
would get in too:

  https://lore.kernel.org/linux-trace-kernel/20250827193644.527334838@kernel.org/

Hmm, but I had a branch that applied all the necessary patches :-/

-- Steve

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RESEND][PATCH v15 2/4] perf: Support deferred user callchainshttps://lore.kernel.org/linux-trace-kernel/20250827193644.527334838@kernel.org/
  2025-09-23 10:28         ` [RESEND][PATCH v15 2/4] perf: Support deferred user callchainshttps://lore.kernel.org/linux-trace-kernel/20250827193644.527334838@kernel.org/ Steven Rostedt
@ 2025-09-23 10:35           ` Peter Zijlstra
  0 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2025-09-23 10:35 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
	Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Jens Remus, Linus Torvalds, Andrew Morton, Florian Weimer,
	Sam James, Kees Cook, Carlos O'Donell

On Tue, Sep 23, 2025 at 06:28:48AM -0400, Steven Rostedt wrote:
> On Tue, 23 Sep 2025 11:38:21 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Tue, Sep 23, 2025 at 05:35:15AM -0400, Steven Rostedt wrote:
> > 
> > > I even pushed this to a git tree. Not sure why it didn't get flagged.  
> > 
> > I've been looking at this... how do I enable CONFIG_UWIND_USER ?
> 
> Hmm, maybe that's why it wasn't flagged.
> 
> > 
> > I suspect the problem is that its impossible to actually compile/use
> > this code.
> 
> It needs an arch to enable it. Here's the x86 patches that I was hoping
> would get in too:
> 
>   https://lore.kernel.org/linux-trace-kernel/20250827193644.527334838@kernel.org/

Hmm, let me go find that.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RESEND][PATCH v15 2/4] perf: Support deferred user callchains
  2025-09-08 17:14 ` [RESEND][PATCH v15 2/4] perf: Support deferred user callchains Steven Rostedt
  2025-09-23  9:19   ` Peter Zijlstra
@ 2025-09-23 10:01   ` Peter Zijlstra
  2025-09-23 10:32   ` Peter Zijlstra
  2 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2025-09-23 10:01 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
	Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Jens Remus, Linus Torvalds, Andrew Morton, Florian Weimer,
	Sam James, Kees Cook, Carlos O'Donell

On Mon, Sep 08, 2025 at 01:14:14PM -0400, Steven Rostedt wrote:
> +	/*
> +	 * All accesses related to the event are within the same RCU section in
> +	 * perf_event_callchain_deferred(). The RCU grace period before the
> +	 * event is freed will make sure all those accesses are complete by then.
> +	 */
> +	rcuwait_wait_event(&event->pending_unwind_wait, !event->pending_unwind_callback, TASK_UNINTERRUPTIBLE);

You need a narrower terminal, this is again excessive. I mostly code
with my screen split in 4 columns (3 if I can't find my glasses), and
that gets me around 90 character columns.

> +	if (event->attr.defer_callchain)
> +		init_task_work(&event->pending_unwind_work,
> +			       perf_event_callchain_deferred);

And let me hand you a bucket of {}, I've heard they're getting expensive
over in the US due to this tariff nonsense :-)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RESEND][PATCH v15 2/4] perf: Support deferred user callchains
  2025-09-08 17:14 ` [RESEND][PATCH v15 2/4] perf: Support deferred user callchains Steven Rostedt
  2025-09-23  9:19   ` Peter Zijlstra
  2025-09-23 10:01   ` [RESEND][PATCH v15 2/4] perf: Support deferred user callchains Peter Zijlstra
@ 2025-09-23 10:32   ` Peter Zijlstra
  2025-09-23 12:36     ` Steven Rostedt
  2025-10-03 19:56     ` Steven Rostedt
  2 siblings, 2 replies; 25+ messages in thread
From: Peter Zijlstra @ 2025-09-23 10:32 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
	Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Jens Remus, Linus Torvalds, Andrew Morton, Florian Weimer,
	Sam James, Kees Cook, Carlos O'Donell

On Mon, Sep 08, 2025 at 01:14:14PM -0400, Steven Rostedt wrote:
> +struct perf_callchain_deferred_event {
> +	struct perf_event_header	header;
> +	u64				cookie;
> +	u64				nr;
> +	u64				ips[];
> +};
> +
> +static void perf_event_callchain_deferred(struct callback_head *work)
> +{
> +	struct perf_event *event = container_of(work, struct perf_event, pending_unwind_work);
> +	struct perf_callchain_deferred_event deferred_event;
> +	u64 callchain_context = PERF_CONTEXT_USER;
> +	struct unwind_stacktrace trace;
> +	struct perf_output_handle handle;
> +	struct perf_sample_data data;
> +	u64 nr;
> +
> +	if (!event->pending_unwind_callback)
> +		return;
> +
> +	if (unwind_user_faultable(&trace) < 0)
> +		goto out;
> +
> +	/*
> +	 * All accesses to the event must belong to the same implicit RCU
> +	 * read-side critical section as the ->pending_unwind_callback reset.
> +	 * See comment in perf_pending_unwind_sync().
> +	 */
> +	guard(rcu)();
> +
> +	if (current->flags & (PF_KTHREAD | PF_USER_WORKER))
> +		goto out;
> +
> +	nr = trace.nr + 1 ; /* '+1' == callchain_context */
> +
> +	deferred_event.header.type = PERF_RECORD_CALLCHAIN_DEFERRED;
> +	deferred_event.header.misc = PERF_RECORD_MISC_USER;
> +	deferred_event.header.size = sizeof(deferred_event) + (nr * sizeof(u64));
> +
> +	deferred_event.nr = nr;
> +	deferred_event.cookie = unwind_user_get_cookie();
> +
> +	perf_event_header__init_id(&deferred_event.header, &data, event);
> +
> +	if (perf_output_begin(&handle, &data, event, deferred_event.header.size))
> +		goto out;
> +
> +	perf_output_put(&handle, deferred_event);
> +	perf_output_put(&handle, callchain_context);
> +	/* trace.entries[] are not guaranteed to be 64bit */
> +	for (int i = 0; i < trace.nr; i++) {
> +		u64 entry = trace.entries[i];
> +		perf_output_put(&handle, entry);
> +	}
> +	perf_event__output_id_sample(event, &handle, &data);
> +
> +	perf_output_end(&handle);
> +
> +out:
> +	event->pending_unwind_callback = 0;
> +	local_dec(&event->ctx->nr_no_switch_fast);
> +	rcuwait_wake_up(&event->pending_unwind_wait);
> +}
> +

> +/*
> + * Returns:
> +*     > 0 : if already queued.
> + *      0 : if it performed the queuing
> + *    < 0 : if it did not get queued.
> + */
> +static int deferred_request(struct perf_event *event)
> +{
> +	struct callback_head *work = &event->pending_unwind_work;
> +	int pending;
> +	int ret;
> +
> +	/* Only defer for task events */
> +	if (!event->ctx->task)
> +		return -EINVAL;
> +
> +	if ((current->flags & (PF_KTHREAD | PF_USER_WORKER)) ||
> +	    !user_mode(task_pt_regs(current)))
> +		return -EINVAL;
> +
> +	guard(irqsave)();
> +
> +	/* callback already pending? */
> +	pending = READ_ONCE(event->pending_unwind_callback);
> +	if (pending)
> +		return 1;
> +
> +	/* Claim the work unless an NMI just now swooped in to do so. */
> +	if (!try_cmpxchg(&event->pending_unwind_callback, &pending, 1))
> +		return 1;
> +
> +	/* The work has been claimed, now schedule it. */
> +	ret = task_work_add(current, work, TWA_RESUME);
> +	if (WARN_ON_ONCE(ret)) {
> +		WRITE_ONCE(event->pending_unwind_callback, 0);
> +		return ret;
> +	}
> +	return 0;
> +}

So the thing that stands out is that you're not actually using the
unwind infrastructure you've previously created. Things like: struct
unwind_work, unwind_deferred_{init,request,cancel}() all go unused, and
instead you seem to have build a parallel set, with similar bugs to the
ones I just had to fix in the unwind_deferred things :/

I'm also not much of a fan of nr_no_switch_fast, and the fact that this
patch is limited to per-task events, and you're then adding another 300+
lines of code to support per-cpu events later on.

Fundamentally we only have one stack-trace per task at any one point. We
can have many events per task and many more per-cpu. Let us stick a
struct unwind_work in task_struct and have the perf callback function
use perf_iterate_sb() to find all events that want delivery or so (or we
can add another per perf_event_context list for this purpose).

But duplicating all this seems 'unfortunate'.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RESEND][PATCH v15 2/4] perf: Support deferred user callchains
  2025-09-23 10:32   ` Peter Zijlstra
@ 2025-09-23 12:36     ` Steven Rostedt
  2025-10-03 19:56     ` Steven Rostedt
  1 sibling, 0 replies; 25+ messages in thread
From: Steven Rostedt @ 2025-09-23 12:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
	Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Jens Remus, Linus Torvalds, Andrew Morton, Florian Weimer,
	Sam James, Kees Cook, Carlos O'Donell

On Tue, 23 Sep 2025 12:32:13 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> So the thing that stands out is that you're not actually using the
> unwind infrastructure you've previously created. Things like: struct
> unwind_work, unwind_deferred_{init,request,cancel}() all go unused, and
> instead you seem to have build a parallel set, with similar bugs to the
> ones I just had to fix in the unwind_deferred things :/
> 
> I'm also not much of a fan of nr_no_switch_fast, and the fact that this
> patch is limited to per-task events, and you're then adding another 300+
> lines of code to support per-cpu events later on.
> 
> Fundamentally we only have one stack-trace per task at any one point. We
> can have many events per task and many more per-cpu. Let us stick a
> struct unwind_work in task_struct and have the perf callback function
> use perf_iterate_sb() to find all events that want delivery or so (or we
> can add another per perf_event_context list for this purpose).
> 
> But duplicating all this seems 'unfortunate'.

We could remove this and have perf only use the CPU version. That may
be better in the long run anyway, as it gets rid of the duplication. In
fact that was the original plan we had, but since Josh wrote this patch
thinking it was all that perf needed (which ended not being the case),
I still kept it in. But I believe this will work just the same as the
CPU tracing which uses all the other infrastructure.

-- Steve


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RESEND][PATCH v15 2/4] perf: Support deferred user callchains
  2025-09-23 10:32   ` Peter Zijlstra
  2025-09-23 12:36     ` Steven Rostedt
@ 2025-10-03 19:56     ` Steven Rostedt
  1 sibling, 0 replies; 25+ messages in thread
From: Steven Rostedt @ 2025-10-03 19:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar,
	Jiri Olsa, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
	Florian Weimer, Sam James, Kees Cook, Carlos O'Donell

On Tue, 23 Sep 2025 12:32:13 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> I'm also not much of a fan of nr_no_switch_fast, and the fact that this
> patch is limited to per-task events, and you're then adding another 300+
> lines of code to support per-cpu events later on.

BTW, I'm not exactly sure what the purpose of the "nr_no_switch_fast" is
for. Josh had it in his patches and I kept it.

I'm almost done with my next version that moved a lot of the "follow task"
work into the deferred unwind infrastructure, which drastically simplified
this patch.

But I still have this "nr_no_switch_fast" increment when a request is
successfully made and decremented when the stacktrace is executed. In the
task switch perf code there's:

			/* PMIs are disabled; ctx->nr_no_switch_fast is stable. */
			if (local_read(&ctx->nr_no_switch_fast) ||
			    local_read(&next_ctx->nr_no_switch_fast)) {
				/*
				 * Must not swap out ctx when there's pending
				 * events that rely on the ctx->task relation.
				 *
				 * Likewise, when a context contains inherit +
				 * SAMPLE_READ events they should be switched
				 * out using the slow path so that they are
				 * treated as if they were distinct contexts.
				 */
				raw_spin_unlock(&next_ctx->lock);
				rcu_read_unlock();
				goto inside_switch;
			}

Is this mostly to do with PMU counters? Here there is a relation to the
task and the event, but that's just that the task is going to have a
deferred stack trace.

Can I safely drop this counter?

-- Steve

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [RESEND][PATCH v15 3/4] perf: Have the deferred request record the user context cookie
  2025-09-08 17:14 [RESEND][PATCH v15 0/4] perf: Support the deferred unwinding infrastructure Steven Rostedt
  2025-09-08 17:14 ` [RESEND][PATCH v15 1/4] unwind deferred: Add unwind_user_get_cookie() API Steven Rostedt
  2025-09-08 17:14 ` [RESEND][PATCH v15 2/4] perf: Support deferred user callchains Steven Rostedt
@ 2025-09-08 17:14 ` Steven Rostedt
  2025-09-08 17:14 ` [RESEND][PATCH v15 4/4] perf: Support deferred user callchains for per CPU events Steven Rostedt
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 25+ messages in thread
From: Steven Rostedt @ 2025-09-08 17:14 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
	Andrew Morton, Florian Weimer, Sam James, Kees Cook,
	Carlos O'Donell

From: Steven Rostedt <rostedt@goodmis.org>

When a request to have a deferred unwind is made, have the cookie
associated to the user context recorded in the event that represents that
request. It is added after the PERF_CONTEXT_USER_DEFERRED in the
callchain. That perf context is a marker of where to add the associated
user space stack trace in the callchain. Adding the cookie after that
marker will not affect the appending of the callchain as it will be
overwritten by the user space stack in the perf tool.

The cookie will be used to match the cookie that is saved when the
deferred callchain is recorded. The perf tool will be able to use the
cooking saved at the request to know if the callchain that was recorded
when the task goes back to user space is for that event. If there were
dropped events after the request was made where it dropped the calltrace
that happened when the task went back to user space and then came back
into the kernel and a new request was dropped, but then the record started
again and it recorded a new callchain going back to user space, this
callchain would not be for the initial request. The cookie matching will
prevent this scenario from happening.

The cookie prevents:

  record kernel stack trace with PERF_CONTEXT_USER_DEFERRED

  [ dropped events starts here ]

  record user stack trace - DROPPED

  [enters user space ]
  [exits user space back to the kernel ]

  record kernel stack trace with PERF_CONTEXT_USER_DEFERRED - DROPPED!

  [ events stop being dropped here ]

  record user stack trace

Without a differentiating "cookie" identifier, the user space tool will
incorrectly attach the last recorded user stack trace to the first kernel
stack trace with the PERF_CONTEXT_USER_DEFERRED, as using the TID is not
enough to identify this situation.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 include/linux/perf_event.h            |  2 +-
 include/uapi/linux/perf_event.h       |  5 +++++
 kernel/bpf/stackmap.c                 |  4 ++--
 kernel/events/callchain.c             |  9 ++++++---
 kernel/events/core.c                  | 11 +++++++----
 tools/include/uapi/linux/perf_event.h |  5 +++++
 6 files changed, 26 insertions(+), 10 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 1527afa952f7..c8eefbc9ce51 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1725,7 +1725,7 @@ extern void perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct p
 extern void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
 extern struct perf_callchain_entry *
 get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
-		   u32 max_stack, bool crosstask, bool add_mark, bool defer_user);
+		   u32 max_stack, bool crosstask, bool add_mark, u64 defer_cookie);
 extern int get_callchain_buffers(int max_stack);
 extern void put_callchain_buffers(void);
 extern struct perf_callchain_entry *get_callchain_entry(int *rctx);
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 20b8f890113b..79232e85a8fc 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -1282,6 +1282,11 @@ enum perf_bpf_event_type {
 #define PERF_MAX_STACK_DEPTH			127
 #define PERF_MAX_CONTEXTS_PER_STACK		  8
 
+/*
+ * The PERF_CONTEXT_USER_DEFERRED has two items (context and cookie)
+ */
+#define PERF_DEFERRED_ITEMS			2
+
 enum perf_callchain_context {
 	PERF_CONTEXT_HV				= (__u64)-32,
 	PERF_CONTEXT_KERNEL			= (__u64)-128,
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 339f7cbbcf36..ef6021111fe3 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -315,7 +315,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
 		max_depth = sysctl_perf_event_max_stack;
 
 	trace = get_perf_callchain(regs, kernel, user, max_depth,
-				   false, false, false);
+				   false, false, 0);
 
 	if (unlikely(!trace))
 		/* couldn't fetch the stack trace */
@@ -452,7 +452,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
 		trace = get_callchain_entry_for_task(task, max_depth);
 	else
 		trace = get_perf_callchain(regs, kernel, user, max_depth,
-					   crosstask, false, false);
+					   crosstask, false, 0);
 
 	if (unlikely(!trace) || trace->nr < skip) {
 		if (may_fault)
diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index d0e0da66a164..b9c7e00725d6 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -218,7 +218,7 @@ static void fixup_uretprobe_trampoline_entries(struct perf_callchain_entry *entr
 
 struct perf_callchain_entry *
 get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
-		   u32 max_stack, bool crosstask, bool add_mark, bool defer_user)
+		   u32 max_stack, bool crosstask, bool add_mark, u64 defer_cookie)
 {
 	struct perf_callchain_entry *entry;
 	struct perf_callchain_entry_ctx ctx;
@@ -251,12 +251,15 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 			regs = task_pt_regs(current);
 		}
 
-		if (defer_user) {
+		if (defer_cookie) {
 			/*
 			 * Foretell the coming of PERF_RECORD_CALLCHAIN_DEFERRED
-			 * which can be stitched to this one.
+			 * which can be stitched to this one, and add
+			 * the cookie after it (it will be cut off when the
+			 * user stack is copied to the callchain).
 			 */
 			perf_callchain_store_context(&ctx, PERF_CONTEXT_USER_DEFERRED);
+			perf_callchain_store_context(&ctx, defer_cookie);
 			goto exit_put;
 		}
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 37e684edbc8a..db4ca7e4afb1 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8290,7 +8290,7 @@ static struct perf_callchain_entry __empty_callchain = { .nr = 0, };
  *      0 : if it performed the queuing
  *    < 0 : if it did not get queued.
  */
-static int deferred_request(struct perf_event *event)
+static int deferred_request(struct perf_event *event, u64 *defer_cookie)
 {
 	struct callback_head *work = &event->pending_unwind_work;
 	int pending;
@@ -8306,6 +8306,8 @@ static int deferred_request(struct perf_event *event)
 
 	guard(irqsave)();
 
+	*defer_cookie = unwind_user_get_cookie();
+
 	/* callback already pending? */
 	pending = READ_ONCE(event->pending_unwind_callback);
 	if (pending)
@@ -8334,6 +8336,7 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
 	bool crosstask = event->ctx->task && event->ctx->task != current;
 	const u32 max_stack = event->attr.sample_max_stack;
 	struct perf_callchain_entry *callchain;
+	u64 defer_cookie = 0;
 	/* perf currently only supports deferred in 64bit */
 	bool defer_user = IS_ENABLED(CONFIG_UNWIND_USER) && user &&
 			  event->attr.defer_callchain;
@@ -8349,15 +8352,15 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
 		return &__empty_callchain;
 
 	if (defer_user) {
-		int ret = deferred_request(event);
+		int ret = deferred_request(event, &defer_cookie);
 		if (!ret)
 			local_inc(&event->ctx->nr_no_switch_fast);
 		else if (ret < 0)
-			defer_user = false;
+			defer_cookie = 0;
 	}
 
 	callchain = get_perf_callchain(regs, kernel, user, max_stack,
-				       crosstask, true, defer_user);
+				       crosstask, true, defer_cookie);
 
 	return callchain ?: &__empty_callchain;
 }
diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
index 20b8f890113b..79232e85a8fc 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -1282,6 +1282,11 @@ enum perf_bpf_event_type {
 #define PERF_MAX_STACK_DEPTH			127
 #define PERF_MAX_CONTEXTS_PER_STACK		  8
 
+/*
+ * The PERF_CONTEXT_USER_DEFERRED has two items (context and cookie)
+ */
+#define PERF_DEFERRED_ITEMS			2
+
 enum perf_callchain_context {
 	PERF_CONTEXT_HV				= (__u64)-32,
 	PERF_CONTEXT_KERNEL			= (__u64)-128,
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RESEND][PATCH v15 4/4] perf: Support deferred user callchains for per CPU events
  2025-09-08 17:14 [RESEND][PATCH v15 0/4] perf: Support the deferred unwinding infrastructure Steven Rostedt
                   ` (2 preceding siblings ...)
  2025-09-08 17:14 ` [RESEND][PATCH v15 3/4] perf: Have the deferred request record the user context cookie Steven Rostedt
@ 2025-09-08 17:14 ` Steven Rostedt
  2025-09-08 17:21 ` [RESEND][PATCH v15 0/4] perf: Support the deferred unwinding infrastructure Steven Rostedt
  2025-09-18 11:46 ` Peter Zijlstra
  5 siblings, 0 replies; 25+ messages in thread
From: Steven Rostedt @ 2025-09-08 17:14 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
	Andrew Morton, Florian Weimer, Sam James, Kees Cook,
	Carlos O'Donell

From: Steven Rostedt <rostedt@goodmis.org>

The deferred unwinder works fine for task events (events that trace only a
specific task), as it can use a task_work from an interrupt or NMI and
when the task goes back to user space it will call the event's callback to
do the deferred unwinding.

But for per CPU events things are not so simple. When a per CPU event
wants a deferred unwinding to occur, it cannot simply use a task_work as
there's a many to many relationship. If the task migrates and another task
is scheduled in where the per CPU event wants a deferred unwinding to
occur on that task as well, and the task that migrated to another CPU has
that CPU's event want to unwind it too, each CPU may need unwinding from
more than one task, and each task may have requests from many CPUs.

The main issue is that from the kernel point of view, there's currently
nothing that associates a per CPU event for one CPU to the per CPU events
that cover the other CPUs for a given process. To the kernel, they are all
just individual events buffers. This is problematic if a delayed request
is made on one CPU and the task migrates to another CPU where the delayed
user stack trace will be performed. The kernel needs to know which CPU
buffer to add it to that belongs to the same process that initiated the
deferred request.

To solve this, when a per CPU event is created that has defer_callchain
attribute set, it will do a lookup from a global list
(unwind_deferred_list), for a perf_unwind_deferred descriptor that has the
id that matches the PID of the current task's group_leader. (The process
ID for all the threads of a process)

If it is not found, then it will create one and add it to the global list.
This descriptor contains an array of all possible CPUs, where each element
is a perf_unwind_cpu descriptor.

The perf_unwind_cpu descriptor has a list of all the per CPU events that
is tracing the matching CPU that corresponds to its index in the array,
where the events belong to a task that has the same group_leader.
It also has a processing bit and rcuwait to handle removal.

For each occupied perf_unwind_cpu descriptor in the array, the
perf_deferred_unwind descriptor increments its nr_cpu_events. When a
perf_unwind_cpu descriptor is empty, the nr_cpu_events is decremented.
This is used to know when to free the perf_deferred_unwind descriptor, as
when it becomes empty, it is no longer referenced.

Finally, the perf_deferred_unwind descriptor has an id that holds the PID
of the group_leader for the tasks that the events were created by.

When a second (or more) per CPU event is created where the
perf_deferred_unwind descriptor already exists, it just adds itself to
the perf_unwind_cpu array of that descriptor. Updating the necessary
counter. This is used to map different per CPU events to each other based
on their group leader PID.

Each of these perf_deferred_unwind descriptors have a unwind_work that
registers with the deferred unwind infrastructure via
unwind_deferred_init(), where it also registers a callback to
perf_event_deferred_cpu().

Now when a per CPU event requests a deferred unwinding, it calls
unwind_deferred_request() with the associated perf_deferred_unwind
descriptor. It is expected that the program that uses this has events on
all CPUs, as the deferred trace may not be called on the CPU event that
requested it. That is, the task may migrate and its user stack trace will
be recorded on the CPU event of the CPU that it exits back to user space
on.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 include/linux/perf_event.h |   4 +
 kernel/events/core.c       | 320 +++++++++++++++++++++++++++++++++----
 2 files changed, 295 insertions(+), 29 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index c8eefbc9ce51..0edc7ad4c914 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -733,6 +733,7 @@ struct swevent_hlist {
 struct bpf_prog;
 struct perf_cgroup;
 struct perf_buffer;
+struct perf_unwind_deferred;
 
 struct pmu_event_list {
 	raw_spinlock_t			lock;
@@ -885,6 +886,9 @@ struct perf_event {
 	struct callback_head		pending_unwind_work;
 	struct rcuwait			pending_unwind_wait;
 
+	struct perf_unwind_deferred	*unwind_deferred;
+	struct list_head		unwind_list;
+
 	atomic_t			event_limit;
 
 	/* address range filters */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index db4ca7e4afb1..303ab50eca8b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5582,10 +5582,193 @@ static bool exclusive_event_installable(struct perf_event *event,
 	return true;
 }
 
+/* Holds a list of per CPU events that registered for deferred unwinding */
+struct perf_unwind_cpu {
+	struct list_head	list;
+	struct rcuwait		pending_unwind_wait;
+	int			processing;
+};
+
+struct perf_unwind_deferred {
+	struct list_head		list;
+	struct unwind_work		unwind_work;
+	struct perf_unwind_cpu __rcu	*cpu_events;
+	struct rcu_head			rcu_head;
+	int				nr_cpu_events;
+	int				id;
+};
+
+static DEFINE_MUTEX(unwind_deferred_mutex);
+static LIST_HEAD(unwind_deferred_list);
+
+static void perf_event_deferred_cpu(struct unwind_work *work,
+				    struct unwind_stacktrace *trace, u64 cookie);
+
+/*
+ * Add a per CPU event.
+ *
+ * The deferred callstack can happen on a different CPU than what was
+ * requested. If one CPU event requests a deferred callstack, but the
+ * tasks migrates, it will execute on a different CPU and save the
+ * stack trace to that CPU event.
+ *
+ * In order to map all the CPU events with the same application,
+ * use the current->gorup_leader->pid as the identifier of what
+ * events share the same program.
+ *
+ * A perf_unwind_deferred descriptor is created for each unique
+ * group_leader pid, and all the events that have the same group_leader
+ * pid will be linked to the same deferred descriptor.
+ *
+ * If there's no descriptor that matches the current group_leader pid,
+ * one will be created.
+ */
+static int perf_add_unwind_deferred(struct perf_event *event)
+{
+	struct perf_unwind_deferred *defer;
+	struct perf_unwind_cpu *cpu_events;
+	int id = current->group_leader->pid;
+	bool found = false;
+	int ret = 0;
+
+	if (event->cpu < 0)
+		return -EINVAL;
+
+	guard(mutex)(&unwind_deferred_mutex);
+
+	list_for_each_entry(defer, &unwind_deferred_list, list) {
+		if (defer->id == id) {
+			found = true;
+			break;
+		}
+	}
+
+	if (!found) {
+		defer = kzalloc(sizeof(*defer), GFP_KERNEL);
+		if (!defer)
+			return -ENOMEM;
+		list_add(&defer->list, &unwind_deferred_list);
+		defer->id = id;
+	}
+
+	/*
+	 * The deferred desciptor has an array for every CPU.
+	 * Each entry in this array is a link list of all the CPU
+	 * events for the corresponding CPU. This is a quick way to
+	 * find the associated event for a given CPU in
+	 * perf_event_deferred_cpu().
+	 */
+	if (!defer->nr_cpu_events) {
+		cpu_events = kcalloc(num_possible_cpus(),
+				     sizeof(*cpu_events),
+				     GFP_KERNEL);
+		if (!cpu_events) {
+			ret = -ENOMEM;
+			goto free;
+		}
+		for (int cpu = 0; cpu < num_possible_cpus(); cpu++) {
+			rcuwait_init(&cpu_events[cpu].pending_unwind_wait);
+			INIT_LIST_HEAD(&cpu_events[cpu].list);
+		}
+
+		rcu_assign_pointer(defer->cpu_events, cpu_events);
+
+		ret = unwind_deferred_init(&defer->unwind_work,
+					   perf_event_deferred_cpu);
+		if (ret)
+			goto free;
+	}
+	cpu_events = rcu_dereference_protected(defer->cpu_events,
+				lockdep_is_held(&unwind_deferred_mutex));
+
+	/*
+	 * The defer->nr_cpu_events is the count of the number
+	 * of non-empty lists in the cpu_events array. If the list
+	 * being added to is already occupied, the nr_cpu_events does
+	 * not need to get incremented.
+	 */
+	if (list_empty(&cpu_events[event->cpu].list))
+		defer->nr_cpu_events++;
+	list_add_tail_rcu(&event->unwind_list, &cpu_events[event->cpu].list);
+
+	event->unwind_deferred = defer;
+	return 0;
+free:
+	/* Nothing to do if there was already an existing event attached */
+	if (found)
+		return ret;
+
+	list_del(&defer->list);
+	kfree(cpu_events);
+	kfree(defer);
+	return ret;
+}
+
+static void free_unwind_deferred_rcu(struct rcu_head *head)
+{
+	struct perf_unwind_cpu *cpu_events;
+	struct perf_unwind_deferred *defer =
+		container_of(head, struct perf_unwind_deferred, rcu_head);
+
+	WARN_ON_ONCE(defer->nr_cpu_events);
+	/*
+	 * This is called by call_rcu() and there are no more
+	 * references to cpu_events.
+	 */
+	cpu_events = rcu_dereference_protected(defer->cpu_events, true);
+	kfree(cpu_events);
+	kfree(defer);
+}
+
+static void perf_remove_unwind_deferred(struct perf_event *event)
+{
+	struct perf_unwind_deferred *defer = event->unwind_deferred;
+	struct perf_unwind_cpu *cpu_events, *cpu_unwind;
+
+	if (!defer)
+		return;
+
+	guard(mutex)(&unwind_deferred_mutex);
+	list_del_rcu(&event->unwind_list);
+
+	cpu_events = rcu_dereference_protected(defer->cpu_events,
+				lockdep_is_held(&unwind_deferred_mutex));
+	cpu_unwind = &cpu_events[event->cpu];
+
+	if (list_empty(&cpu_unwind->list)) {
+		defer->nr_cpu_events--;
+		if (!defer->nr_cpu_events)
+			unwind_deferred_cancel(&defer->unwind_work);
+	}
+
+	event->unwind_deferred = NULL;
+
+	/*
+	 * Make sure perf_event_deferred_cpu() is done with this event.
+	 * That function will set cpu_unwind->processing and then
+	 * call smp_mb() before iterating the list of its events.
+	 * If the event's unwind_deferred is NULL, it will be skipped.
+	 * The smp_mb() in that function matches the mb() in
+	 * rcuwait_wait_event().
+	 */
+	rcuwait_wait_event(&cpu_unwind->pending_unwind_wait,
+				   !cpu_unwind->processing, TASK_UNINTERRUPTIBLE);
+
+	/* Is this still being used by other per CPU events? */
+	if (defer->nr_cpu_events)
+		return;
+
+	list_del(&defer->list);
+	/* The defer->cpu_events is protected by RCU */
+	call_rcu(&defer->rcu_head, free_unwind_deferred_rcu);
+}
+
 static void perf_pending_unwind_sync(struct perf_event *event)
 {
 	might_sleep();
 
+	perf_remove_unwind_deferred(event);
+
 	if (!event->pending_unwind_callback)
 		return;
 
@@ -5614,63 +5797,119 @@ struct perf_callchain_deferred_event {
 	u64				ips[];
 };
 
-static void perf_event_callchain_deferred(struct callback_head *work)
+static void perf_event_callchain_deferred(struct perf_event *event,
+					  struct unwind_stacktrace *trace,
+					  u64 cookie)
 {
-	struct perf_event *event = container_of(work, struct perf_event, pending_unwind_work);
 	struct perf_callchain_deferred_event deferred_event;
 	u64 callchain_context = PERF_CONTEXT_USER;
-	struct unwind_stacktrace trace;
 	struct perf_output_handle handle;
 	struct perf_sample_data data;
 	u64 nr;
 
-	if (!event->pending_unwind_callback)
-		return;
-
-	if (unwind_user_faultable(&trace) < 0)
-		goto out;
-
-	/*
-	 * All accesses to the event must belong to the same implicit RCU
-	 * read-side critical section as the ->pending_unwind_callback reset.
-	 * See comment in perf_pending_unwind_sync().
-	 */
-	guard(rcu)();
-
 	if (current->flags & (PF_KTHREAD | PF_USER_WORKER))
-		goto out;
+		return;
 
-	nr = trace.nr + 1 ; /* '+1' == callchain_context */
+	nr = trace->nr + 1 ; /* '+1' == callchain_context */
 
 	deferred_event.header.type = PERF_RECORD_CALLCHAIN_DEFERRED;
 	deferred_event.header.misc = PERF_RECORD_MISC_USER;
 	deferred_event.header.size = sizeof(deferred_event) + (nr * sizeof(u64));
 
 	deferred_event.nr = nr;
-	deferred_event.cookie = unwind_user_get_cookie();
+	deferred_event.cookie = cookie;
 
 	perf_event_header__init_id(&deferred_event.header, &data, event);
 
 	if (perf_output_begin(&handle, &data, event, deferred_event.header.size))
-		goto out;
+		return;
 
 	perf_output_put(&handle, deferred_event);
 	perf_output_put(&handle, callchain_context);
-	/* trace.entries[] are not guaranteed to be 64bit */
-	for (int i = 0; i < trace.nr; i++) {
-		u64 entry = trace.entries[i];
+	/* trace->entries[] are not guaranteed to be 64bit */
+	for (int i = 0; i < trace->nr; i++) {
+		u64 entry = trace->entries[i];
 		perf_output_put(&handle, entry);
 	}
 	perf_event__output_id_sample(event, &handle, &data);
 
 	perf_output_end(&handle);
+}
+
+/* Deferred unwinding callback for task specific events */
+static void perf_event_deferred_task(struct callback_head *work)
+{
+	struct perf_event *event = container_of(work, struct perf_event, pending_unwind_work);
+	struct unwind_stacktrace trace;
+
+	if (!event->pending_unwind_callback)
+		return;
+
+	if (unwind_user_faultable(&trace) >= 0) {
+		u64 cookie = unwind_user_get_cookie();
+
+		/*
+		 * All accesses to the event must belong to the same implicit RCU
+		 * read-side critical section as the ->pending_unwind_callback reset.
+		 * See comment in perf_pending_unwind_sync().
+		 */
+		guard(rcu)();
+		perf_event_callchain_deferred(event, &trace, cookie);
+	}
 
-out:
 	event->pending_unwind_callback = 0;
 	local_dec(&event->ctx->nr_no_switch_fast);
 	rcuwait_wake_up(&event->pending_unwind_wait);
 }
 
+/*
+ * Deferred unwinding callback for per CPU events.
+ * Note, the request for the deferred unwinding may have happened
+ * on a different CPU.
+ */
+static void perf_event_deferred_cpu(struct unwind_work *work,
+				    struct unwind_stacktrace *trace, u64 cookie)
+{
+	struct perf_unwind_deferred *defer =
+		container_of(work, struct perf_unwind_deferred, unwind_work);
+	struct perf_unwind_cpu *cpu_events, *cpu_unwind;
+	struct perf_event *event;
+	int cpu;
+
+	guard(rcu)();
+	guard(preempt)();
+
+	cpu = smp_processor_id();
+	cpu_events = rcu_dereference(defer->cpu_events);
+	cpu_unwind = &cpu_events[cpu];
+
+	WRITE_ONCE(cpu_unwind->processing, 1);
+	/*
+	 * Make sure the above is seen before the event->unwind_deferred
+	 * is checked. This matches the mb() in rcuwait_rcu_wait_event() in
+	 * perf_remove_unwind_deferred().
+	 */
+	smp_mb();
+
+	list_for_each_entry_rcu(event, &cpu_unwind->list, unwind_list) {
+		/* If unwind_deferred is NULL the event is going away */
+		if (unlikely(!event->unwind_deferred))
+			continue;
+		perf_event_callchain_deferred(event, trace, cookie);
+		/* Only the first CPU event gets the trace */
+		break;
+	}
+
+	/*
+	 * The perf_event_callchain_deferred() must finish before setting
+	 * cpu_unwind->processing to zero. This is also to synchronize
+	 * with the rcuwait in perf_remove_unwind_deferred().
+	 */
+	smp_mb();
+	WRITE_ONCE(cpu_unwind->processing, 0);
+	rcuwait_wake_up(&cpu_unwind->pending_unwind_wait);
+}
+
 static void perf_free_addr_filters(struct perf_event *event);
 
 /* vs perf_event_alloc() error */
@@ -8284,6 +8523,17 @@ static u64 perf_get_page_size(unsigned long addr)
 
 static struct perf_callchain_entry __empty_callchain = { .nr = 0, };
 
+
+static int deferred_unwind_request(struct perf_unwind_deferred *defer,
+				   u64 *defer_cookie)
+{
+	/*
+	 * Returns 0 for queued, 1 for already queued or executed,
+	 * and negative on error.
+	 */
+	return unwind_deferred_request(&defer->unwind_work, defer_cookie);
+}
+
 /*
  * Returns:
 *     > 0 : if already queued.
@@ -8293,17 +8543,22 @@ static struct perf_callchain_entry __empty_callchain = { .nr = 0, };
 static int deferred_request(struct perf_event *event, u64 *defer_cookie)
 {
 	struct callback_head *work = &event->pending_unwind_work;
+	struct perf_unwind_deferred *defer;
 	int pending;
 	int ret;
 
-	/* Only defer for task events */
-	if (!event->ctx->task)
-		return -EINVAL;
-
 	if ((current->flags & (PF_KTHREAD | PF_USER_WORKER)) ||
 	    !user_mode(task_pt_regs(current)))
 		return -EINVAL;
 
+	defer = READ_ONCE(event->unwind_deferred);
+	if (defer)
+		return deferred_unwind_request(defer, defer_cookie);
+
+	/* Per CPU events should have had unwind_deferred set! */
+	if (WARN_ON_ONCE(!event->ctx->task))
+		return -EINVAL;
+
 	guard(irqsave)();
 
 	*defer_cookie = unwind_user_get_cookie();
@@ -13197,13 +13452,20 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 		}
 	}
 
+	/* Setup unwind deferring for per CPU events */
+	if (event->attr.defer_callchain && !task) {
+		err = perf_add_unwind_deferred(event);
+		if (err)
+			return ERR_PTR(err);
+	}
+
 	err = security_perf_event_alloc(event);
 	if (err)
 		return ERR_PTR(err);
 
 	if (event->attr.defer_callchain)
 		init_task_work(&event->pending_unwind_work,
-			       perf_event_callchain_deferred);
+			       perf_event_deferred_task);
 
 	/* symmetric to unaccount_event() in _free_event() */
 	account_event(event);
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [RESEND][PATCH v15 0/4] perf: Support the deferred unwinding infrastructure
  2025-09-08 17:14 [RESEND][PATCH v15 0/4] perf: Support the deferred unwinding infrastructure Steven Rostedt
                   ` (3 preceding siblings ...)
  2025-09-08 17:14 ` [RESEND][PATCH v15 4/4] perf: Support deferred user callchains for per CPU events Steven Rostedt
@ 2025-09-08 17:21 ` Steven Rostedt
  2025-09-16 14:41   ` Steven Rostedt
  2025-09-18 11:46 ` Peter Zijlstra
  5 siblings, 1 reply; 25+ messages in thread
From: Steven Rostedt @ 2025-09-08 17:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
	Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Jens Remus, Linus Torvalds, Andrew Morton, Florian Weimer,
	Sam James, Kees Cook, Carlos O'Donell


Peter, can you take a look at these patches please. I believe you're the
only one that really maintains this code today.

-- Steve


On Mon, 08 Sep 2025 13:14:12 -0400
Steven Rostedt <rostedt@kernel.org> wrote:

> [
>   This is simply a resend of version 15 of this patch series
>   but with only the kernel changes. I'm separating out the user space
>   changes to their own series.
>   The original v15 is here:
>     https://lore.kernel.org/linux-trace-kernel/20250825180638.877627656@kernel.org/
> ]
> 
> This patch set is based off of perf/core of the tip tree:
>   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
> 
> To run this series, you can checkout this repo that has this series as well as the above:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git  unwind/perf-test
> 
> This series implements the perf interface to use deferred user space stack
> tracing.
> 
> Patch 1 adds a new API interface to the user unwinder logic to allow perf to
> get the current context cookie for it's task event tracing. Perf's task event
> tracing maps a single task per perf event buffer and it follows the task
> around, so it only needs to implement its own task_work to do the deferred
> stack trace. Because it can still suffer not knowing which user stack trace
> belongs to which kernel stack due to dropped events, having the cookie to
> create a unique identifier for each user space stack trace to know which
> kernel stack to append it to is useful.
> 
> Patch 2 adds the per task deferred stack traces to perf. It adds a new event
> type called PERF_RECORD_CALLCHAIN_DEFERRED that is recorded when a task is
> about to go back to user space and happens in a location that pages may be
> faulted in. It also adds a new callchain context called PERF_CONTEXT_USER_DEFERRED
> that is used as a place holder in a kernel callchain to append the deferred
> user space stack trace to.
> 
> Patch 3 adds the user stack trace context cookie in the kernel callchain right
> after the PERF_CONTEXT_USER_DEFERRED context so that the user space side can
> map the request to the deferred user space stack trace.
> 
> Patch 4 adds support for the per CPU perf events that will allow the kernel to
> associate each of the per CPU perf event buffers to a single application. This
> is needed so that when a request for a deferred stack trace happens on a task
> that then migrates to another CPU, it will know which CPU buffer to use to
> record the stack trace on. It is possible to have more than one perf user tool
> running and a request made by one perf tool should have the deferred trace go
> to the same perf tool's perf CPU event buffer. A global list of all the
> descriptors representing each perf tool that is using deferred stack tracing
> is created to manage this.
> 
> 
> Josh Poimboeuf (1):
>       perf: Support deferred user callchains
> 
> Steven Rostedt (3):
>       unwind deferred: Add unwind_user_get_cookie() API
>       perf: Have the deferred request record the user context cookie
>       perf: Support deferred user callchains for per CPU events
> 
> ----
>  include/linux/perf_event.h            |  11 +-
>  include/linux/unwind_deferred.h       |   5 +
>  include/uapi/linux/perf_event.h       |  25 +-
>  kernel/bpf/stackmap.c                 |   4 +-
>  kernel/events/callchain.c             |  14 +-
>  kernel/events/core.c                  | 421 +++++++++++++++++++++++++++++++++-
>  kernel/unwind/deferred.c              |  21 ++
>  tools/include/uapi/linux/perf_event.h |  25 +-
>  8 files changed, 518 insertions(+), 8 deletions(-)


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RESEND][PATCH v15 0/4] perf: Support the deferred unwinding infrastructure
  2025-09-08 17:21 ` [RESEND][PATCH v15 0/4] perf: Support the deferred unwinding infrastructure Steven Rostedt
@ 2025-09-16 14:41   ` Steven Rostedt
  0 siblings, 0 replies; 25+ messages in thread
From: Steven Rostedt @ 2025-09-16 14:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
	Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Jens Remus, Linus Torvalds, Andrew Morton, Florian Weimer,
	Sam James, Kees Cook, Carlos O'Donell

Peter,

It's been over 3 weeks since the original has been sent. And last week I
broke it up to only hold the kernel changes. Can you please take a look at
it?

I have updated the user space side with Namhyung Kim's updates:

   https://lore.kernel.org/all/20250908175319.841517121@kernel.org/

Also, the two patches to enable deferred unwinding in x86 has been ignored
for almost three weeks as well:

  https://lore.kernel.org/linux-trace-kernel/20250827193644.527334838@kernel.org/

-- Steve


On Mon, 8 Sep 2025 13:21:06 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> Peter, can you take a look at these patches please. I believe you're the
> only one that really maintains this code today.
> 
> -- Steve
> 
> 
> On Mon, 08 Sep 2025 13:14:12 -0400
> Steven Rostedt <rostedt@kernel.org> wrote:
> 
> > [
> >   This is simply a resend of version 15 of this patch series
> >   but with only the kernel changes. I'm separating out the user space
> >   changes to their own series.
> >   The original v15 is here:
> >     https://lore.kernel.org/linux-trace-kernel/20250825180638.877627656@kernel.org/
> > ]
> > 
> > This patch set is based off of perf/core of the tip tree:
> >   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
> > 
> > To run this series, you can checkout this repo that has this series as well as the above:
> > 
> >   git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git  unwind/perf-test
> > 
> > This series implements the perf interface to use deferred user space stack
> > tracing.
> > 
> > Patch 1 adds a new API interface to the user unwinder logic to allow perf to
> > get the current context cookie for it's task event tracing. Perf's task event
> > tracing maps a single task per perf event buffer and it follows the task
> > around, so it only needs to implement its own task_work to do the deferred
> > stack trace. Because it can still suffer not knowing which user stack trace
> > belongs to which kernel stack due to dropped events, having the cookie to
> > create a unique identifier for each user space stack trace to know which
> > kernel stack to append it to is useful.
> > 
> > Patch 2 adds the per task deferred stack traces to perf. It adds a new event
> > type called PERF_RECORD_CALLCHAIN_DEFERRED that is recorded when a task is
> > about to go back to user space and happens in a location that pages may be
> > faulted in. It also adds a new callchain context called PERF_CONTEXT_USER_DEFERRED
> > that is used as a place holder in a kernel callchain to append the deferred
> > user space stack trace to.
> > 
> > Patch 3 adds the user stack trace context cookie in the kernel callchain right
> > after the PERF_CONTEXT_USER_DEFERRED context so that the user space side can
> > map the request to the deferred user space stack trace.
> > 
> > Patch 4 adds support for the per CPU perf events that will allow the kernel to
> > associate each of the per CPU perf event buffers to a single application. This
> > is needed so that when a request for a deferred stack trace happens on a task
> > that then migrates to another CPU, it will know which CPU buffer to use to
> > record the stack trace on. It is possible to have more than one perf user tool
> > running and a request made by one perf tool should have the deferred trace go
> > to the same perf tool's perf CPU event buffer. A global list of all the
> > descriptors representing each perf tool that is using deferred stack tracing
> > is created to manage this.
> > 
> > 
> > Josh Poimboeuf (1):
> >       perf: Support deferred user callchains
> > 
> > Steven Rostedt (3):
> >       unwind deferred: Add unwind_user_get_cookie() API
> >       perf: Have the deferred request record the user context cookie
> >       perf: Support deferred user callchains for per CPU events
> > 
> > ----
> >  include/linux/perf_event.h            |  11 +-
> >  include/linux/unwind_deferred.h       |   5 +
> >  include/uapi/linux/perf_event.h       |  25 +-
> >  kernel/bpf/stackmap.c                 |   4 +-
> >  kernel/events/callchain.c             |  14 +-
> >  kernel/events/core.c                  | 421 +++++++++++++++++++++++++++++++++-
> >  kernel/unwind/deferred.c              |  21 ++
> >  tools/include/uapi/linux/perf_event.h |  25 +-
> >  8 files changed, 518 insertions(+), 8 deletions(-)  
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RESEND][PATCH v15 0/4] perf: Support the deferred unwinding infrastructure
  2025-09-08 17:14 [RESEND][PATCH v15 0/4] perf: Support the deferred unwinding infrastructure Steven Rostedt
                   ` (4 preceding siblings ...)
  2025-09-08 17:21 ` [RESEND][PATCH v15 0/4] perf: Support the deferred unwinding infrastructure Steven Rostedt
@ 2025-09-18 11:46 ` Peter Zijlstra
  2025-09-18 15:18   ` Steven Rostedt
  5 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2025-09-18 11:46 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
	Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Jens Remus, Linus Torvalds, Andrew Morton, Florian Weimer,
	Sam James, Kees Cook, Carlos O'Donell



So I started looking at this, but given I never seen the deferred unwind
bits that got merged I have to look at that first.

Headers want something like so.. Let me read the rest.

---
 include/linux/unwind_deferred.h       | 38 +++++++++++++++++++----------------
 include/linux/unwind_deferred_types.h |  2 ++
 2 files changed, 23 insertions(+), 17 deletions(-)

diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index 26122d00708a..5d51a3f2f8ec 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -8,7 +8,8 @@
 
 struct unwind_work;
 
-typedef void (*unwind_callback_t)(struct unwind_work *work, struct unwind_stacktrace *trace, u64 cookie);
+typedef void (*unwind_callback_t)(struct unwind_work *work,
+				  struct unwind_stacktrace *trace, u64 cookie);
 
 struct unwind_work {
 	struct list_head		list;
@@ -44,22 +45,22 @@ void unwind_deferred_task_exit(struct task_struct *task);
 static __always_inline void unwind_reset_info(void)
 {
 	struct unwind_task_info *info = &current->unwind_info;
-	unsigned long bits;
+	unsigned long bits = info->unwind_mask;
 
 	/* Was there any unwinding? */
-	if (unlikely(info->unwind_mask)) {
-		bits = info->unwind_mask;
-		do {
-			/* Is a task_work going to run again before going back */
-			if (bits & UNWIND_PENDING)
-				return;
-		} while (!try_cmpxchg(&info->unwind_mask, &bits, 0UL));
-		current->unwind_info.id.id = 0;
+	if (likely(!bits))
+		return;
 
-		if (unlikely(info->cache)) {
-			info->cache->nr_entries = 0;
-			info->cache->unwind_completed = 0;
-		}
+	do {
+		/* Is a task_work going to run again before going back */
+		if (bits & UNWIND_PENDING)
+			return;
+	} while (!try_cmpxchg(&info->unwind_mask, &bits, 0UL));
+	current->unwind_info.id.id = 0;
+
+	if (unlikely(info->cache)) {
+		info->cache->nr_entries = 0;
+		info->cache->unwind_completed = 0;
 	}
 }
 
@@ -68,9 +69,12 @@ static __always_inline void unwind_reset_info(void)
 static inline void unwind_task_init(struct task_struct *task) {}
 static inline void unwind_task_free(struct task_struct *task) {}
 
-static inline int unwind_user_faultable(struct unwind_stacktrace *trace) { return -ENOSYS; }
-static inline int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func) { return -ENOSYS; }
-static inline int unwind_deferred_request(struct unwind_work *work, u64 *timestamp) { return -ENOSYS; }
+static inline int unwind_user_faultable(struct unwind_stacktrace *trace)
+{ return -ENOSYS; }
+static inline int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
+{ return -ENOSYS; }
+static inline int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
+{ return -ENOSYS; }
 static inline void unwind_deferred_cancel(struct unwind_work *work) {}
 
 static inline void unwind_deferred_task_exit(struct task_struct *task) {}
diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
index 33b62ac25c86..29452ff49859 100644
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -2,6 +2,8 @@
 #ifndef _LINUX_UNWIND_USER_DEFERRED_TYPES_H
 #define _LINUX_UNWIND_USER_DEFERRED_TYPES_H
 
+#include <linux/types.h>
+
 struct unwind_cache {
 	unsigned long		unwind_completed;
 	unsigned int		nr_entries;

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [RESEND][PATCH v15 0/4] perf: Support the deferred unwinding infrastructure
  2025-09-18 11:46 ` Peter Zijlstra
@ 2025-09-18 15:18   ` Steven Rostedt
  2025-09-18 17:24     ` Peter Zijlstra
  0 siblings, 1 reply; 25+ messages in thread
From: Steven Rostedt @ 2025-09-18 15:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar,
	Jiri Olsa, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
	Florian Weimer, Sam James, Kees Cook, Carlos O'Donell

On Thu, 18 Sep 2025 13:46:10 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> So I started looking at this, but given I never seen the deferred unwind
> bits that got merged I have to look at that first.
> 
> Headers want something like so.. Let me read the rest.
> 
> ---
>  include/linux/unwind_deferred.h       | 38 +++++++++++++++++++----------------
>  include/linux/unwind_deferred_types.h |  2 ++
>  2 files changed, 23 insertions(+), 17 deletions(-)

Would you like to send a formal patch with this? I'd actually break it into
two patches. One to clean up the long lines, and the other to change the
logic.

-- Steve


> 
> diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
> index 26122d00708a..5d51a3f2f8ec 100644
> --- a/include/linux/unwind_deferred.h
> +++ b/include/linux/unwind_deferred.h
> @@ -8,7 +8,8 @@
>  
>  struct unwind_work;
>  
> -typedef void (*unwind_callback_t)(struct unwind_work *work, struct unwind_stacktrace *trace, u64 cookie);
> +typedef void (*unwind_callback_t)(struct unwind_work *work,
> +				  struct unwind_stacktrace *trace, u64 cookie);
>  
>  struct unwind_work {
>  	struct list_head		list;
> @@ -44,22 +45,22 @@ void unwind_deferred_task_exit(struct task_struct *task);
>  static __always_inline void unwind_reset_info(void)
>  {
>  	struct unwind_task_info *info = &current->unwind_info;
> -	unsigned long bits;
> +	unsigned long bits = info->unwind_mask;
>  
>  	/* Was there any unwinding? */
> -	if (unlikely(info->unwind_mask)) {
> -		bits = info->unwind_mask;
> -		do {
> -			/* Is a task_work going to run again before going back */
> -			if (bits & UNWIND_PENDING)
> -				return;
> -		} while (!try_cmpxchg(&info->unwind_mask, &bits, 0UL));
> -		current->unwind_info.id.id = 0;
> +	if (likely(!bits))
> +		return;
>  
> -		if (unlikely(info->cache)) {
> -			info->cache->nr_entries = 0;
> -			info->cache->unwind_completed = 0;
> -		}
> +	do {
> +		/* Is a task_work going to run again before going back */
> +		if (bits & UNWIND_PENDING)
> +			return;
> +	} while (!try_cmpxchg(&info->unwind_mask, &bits, 0UL));
> +	current->unwind_info.id.id = 0;
> +
> +	if (unlikely(info->cache)) {
> +		info->cache->nr_entries = 0;
> +		info->cache->unwind_completed = 0;
>  	}
>  }
>  
> @@ -68,9 +69,12 @@ static __always_inline void unwind_reset_info(void)
>  static inline void unwind_task_init(struct task_struct *task) {}
>  static inline void unwind_task_free(struct task_struct *task) {}
>  
> -static inline int unwind_user_faultable(struct unwind_stacktrace *trace) { return -ENOSYS; }
> -static inline int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func) { return -ENOSYS; }
> -static inline int unwind_deferred_request(struct unwind_work *work, u64 *timestamp) { return -ENOSYS; }
> +static inline int unwind_user_faultable(struct unwind_stacktrace *trace)
> +{ return -ENOSYS; }
> +static inline int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
> +{ return -ENOSYS; }
> +static inline int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
> +{ return -ENOSYS; }
>  static inline void unwind_deferred_cancel(struct unwind_work *work) {}
>  
>  static inline void unwind_deferred_task_exit(struct task_struct *task) {}
> diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
> index 33b62ac25c86..29452ff49859 100644
> --- a/include/linux/unwind_deferred_types.h
> +++ b/include/linux/unwind_deferred_types.h
> @@ -2,6 +2,8 @@
>  #ifndef _LINUX_UNWIND_USER_DEFERRED_TYPES_H
>  #define _LINUX_UNWIND_USER_DEFERRED_TYPES_H
>  
> +#include <linux/types.h>
> +
>  struct unwind_cache {
>  	unsigned long		unwind_completed;
>  	unsigned int		nr_entries;


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RESEND][PATCH v15 0/4] perf: Support the deferred unwinding infrastructure
  2025-09-18 15:18   ` Steven Rostedt
@ 2025-09-18 17:24     ` Peter Zijlstra
  2025-09-18 17:32       ` Peter Zijlstra
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2025-09-18 17:24 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar,
	Jiri Olsa, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
	Florian Weimer, Sam James, Kees Cook, Carlos O'Donell

On Thu, Sep 18, 2025 at 11:18:53AM -0400, Steven Rostedt wrote:
> On Thu, 18 Sep 2025 13:46:10 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > So I started looking at this, but given I never seen the deferred unwind
> > bits that got merged I have to look at that first.
> > 
> > Headers want something like so.. Let me read the rest.
> > 
> > ---
> >  include/linux/unwind_deferred.h       | 38 +++++++++++++++++++----------------
> >  include/linux/unwind_deferred_types.h |  2 ++
> >  2 files changed, 23 insertions(+), 17 deletions(-)
> 
> Would you like to send a formal patch with this? I'd actually break it into
> two patches. One to clean up the long lines, and the other to change the
> logic.

Sure, I'll collect the lot while I go through it and whip something up
when I'm done. For now, I'll just shoot a few questions your way.


So we have:

do_syscall_64()
  ... do stuff ...
  syscall_exit_to_user_mode(regs)
    syscall_exit_to_user_mode_work(regs)
      syscall_exit_work()
      exit_to_user_mode_prepare()
        exit_to_user_mode_loop()
	  retume_user_mode_work()
	    task_work_run()
    exit_to_user_mode()
      unwind_reset_info();
      user_enter_irqoff();
      arch_exit_to_user_mode();
      lockdep_hardirqs_on();
  SYSRET/IRET


and

DEFINE_IDTENTRY*()
  irqentry_enter();
  ... stuff ...
  irqentry_exit()
    irqentry_exit_to_user_mode()
      exit_to_user_mode_prepare()
        exit_to_user_mode_loop();
	  retume_user_mode_work()
	    task_work_run()
      exit_to_user_mode()
        unwind_reset_info();
	...
  IRET

Now, task_work_run() is in the exit_to_user_mode_loop() which is notably
*before* exit_to_user_mode() which does the unwind_reset_info().

What happens if we get an NMI requesting an unwind after
unwind_reset_info() while still very much being in the kernel on the way
out?


What is the purpose of unwind_deferred_task_exit()? This is called from
do_exit(), only slightly before it does exit_task_work(), which runs all
pending task_work. Is there something that justifies the manual run and
cancel instead of just leaving it sit in task_work an having it run
naturally? If so, that most certainly deserves a comment.


A similar question for unwind_task_free(), where exactly is it relevant?
Where does it acquire a task_work that is not otherwise already ran on
exit?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RESEND][PATCH v15 0/4] perf: Support the deferred unwinding infrastructure
  2025-09-18 17:24     ` Peter Zijlstra
@ 2025-09-18 17:32       ` Peter Zijlstra
  2025-09-18 19:10         ` Steven Rostedt
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2025-09-18 17:32 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar,
	Jiri Olsa, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
	Florian Weimer, Sam James, Kees Cook, Carlos O'Donell

On Thu, Sep 18, 2025 at 07:24:14PM +0200, Peter Zijlstra wrote:

> So we have:
> 
> do_syscall_64()
>   ... do stuff ...
>   syscall_exit_to_user_mode(regs)
>     syscall_exit_to_user_mode_work(regs)
>       syscall_exit_work()
>       exit_to_user_mode_prepare()
>         exit_to_user_mode_loop()
> 	  retume_user_mode_work()
> 	    task_work_run()
>     exit_to_user_mode()
>       unwind_reset_info();
>       user_enter_irqoff();
>       arch_exit_to_user_mode();
>       lockdep_hardirqs_on();
>   SYSRET/IRET
> 
> 
> and
> 
> DEFINE_IDTENTRY*()
>   irqentry_enter();
>   ... stuff ...
>   irqentry_exit()
>     irqentry_exit_to_user_mode()
>       exit_to_user_mode_prepare()
>         exit_to_user_mode_loop();
> 	  retume_user_mode_work()
> 	    task_work_run()
>       exit_to_user_mode()
>         unwind_reset_info();
> 	...
>   IRET
> 
> Now, task_work_run() is in the exit_to_user_mode_loop() which is notably
> *before* exit_to_user_mode() which does the unwind_reset_info().
> 
> What happens if we get an NMI requesting an unwind after
> unwind_reset_info() while still very much being in the kernel on the way
> out?

AFAICT it will try and do a task_work_add(TWA_RESUME) from NMI context,
and this will fail horribly.

If you do something like:

	twa_mode = in_nmi() ? TWA_NMI_CURRENT : TWA_RESUME;
	task_work_add(foo, twa_mode);

it might actually work.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RESEND][PATCH v15 0/4] perf: Support the deferred unwinding infrastructure
  2025-09-18 17:32       ` Peter Zijlstra
@ 2025-09-18 19:10         ` Steven Rostedt
  2025-09-19 23:34           ` Josh Poimboeuf
  0 siblings, 1 reply; 25+ messages in thread
From: Steven Rostedt @ 2025-09-18 19:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar,
	Jiri Olsa, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
	Florian Weimer, Sam James, Kees Cook, Carlos O'Donell

On Thu, 18 Sep 2025 19:32:20 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> > Now, task_work_run() is in the exit_to_user_mode_loop() which is notably
> > *before* exit_to_user_mode() which does the unwind_reset_info().
> > 
> > What happens if we get an NMI requesting an unwind after
> > unwind_reset_info() while still very much being in the kernel on the way
> > out?  
> 
> AFAICT it will try and do a task_work_add(TWA_RESUME) from NMI context,
> and this will fail horribly.
> 
> If you do something like:
> 
> 	twa_mode = in_nmi() ? TWA_NMI_CURRENT : TWA_RESUME;
> 	task_work_add(foo, twa_mode);
> 
> it might actually work.

Ah, the comment for TWA_RESUME didn't express this restriction.

That does look like that would work as the way I expected task_work to
handle this case.

-- Steve

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RESEND][PATCH v15 0/4] perf: Support the deferred unwinding infrastructure
  2025-09-18 19:10         ` Steven Rostedt
@ 2025-09-19 23:34           ` Josh Poimboeuf
  2025-09-21 23:33             ` Steven Rostedt
  2025-09-22  7:23             ` Peter Zijlstra
  0 siblings, 2 replies; 25+ messages in thread
From: Josh Poimboeuf @ 2025-09-19 23:34 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Steven Rostedt, linux-kernel, linux-trace-kernel,
	bpf, x86, Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar,
	Jiri Olsa, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
	Florian Weimer, Sam James, Kees Cook, Carlos O'Donell

On Thu, Sep 18, 2025 at 03:10:18PM -0400, Steven Rostedt wrote:
> On Thu, 18 Sep 2025 19:32:20 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > > Now, task_work_run() is in the exit_to_user_mode_loop() which is notably
> > > *before* exit_to_user_mode() which does the unwind_reset_info().
> > > 
> > > What happens if we get an NMI requesting an unwind after
> > > unwind_reset_info() while still very much being in the kernel on the way
> > > out?  
> > 
> > AFAICT it will try and do a task_work_add(TWA_RESUME) from NMI context,
> > and this will fail horribly.
> > 
> > If you do something like:
> > 
> > 	twa_mode = in_nmi() ? TWA_NMI_CURRENT : TWA_RESUME;
> > 	task_work_add(foo, twa_mode);
> > 
> > it might actually work.
> 
> Ah, the comment for TWA_RESUME didn't express this restriction.
> 
> That does look like that would work as the way I expected task_work to
> handle this case.

BTW, I remember Peter had a fix for TWA_NMI_CURRENT, I guess it got lost
in the shuffle or did something else happen in the meantime?

  https://lore.kernel.org/20250122124228.GO7145@noisy.programming.kicks-ass.net

-- 
Josh

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RESEND][PATCH v15 0/4] perf: Support the deferred unwinding infrastructure
  2025-09-19 23:34           ` Josh Poimboeuf
@ 2025-09-21 23:33             ` Steven Rostedt
  2025-09-22  7:23             ` Peter Zijlstra
  1 sibling, 0 replies; 25+ messages in thread
From: Steven Rostedt @ 2025-09-21 23:33 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Peter Zijlstra, Steven Rostedt, linux-kernel, linux-trace-kernel,
	bpf, x86, Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar,
	Jiri Olsa, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
	Florian Weimer, Sam James, Kees Cook, Carlos O'Donell

On Fri, 19 Sep 2025 16:34:02 -0700
Josh Poimboeuf <jpoimboe@kernel.org> wrote:

> > That does look like that would work as the way I expected task_work to
> > handle this case.  
> 
> BTW, I remember Peter had a fix for TWA_NMI_CURRENT, I guess it got lost
> in the shuffle or did something else happen in the meantime?
> 
>   https://lore.kernel.org/20250122124228.GO7145@noisy.programming.kicks-ass.net

Yeah, it did get lost in the shuffle. I took the code from your git
tree, but missed the comments that were made to the patches you sent to
the mailing list. :-p

Thanks for pointing this out!

-- Steve

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RESEND][PATCH v15 0/4] perf: Support the deferred unwinding infrastructure
  2025-09-19 23:34           ` Josh Poimboeuf
  2025-09-21 23:33             ` Steven Rostedt
@ 2025-09-22  7:23             ` Peter Zijlstra
  2025-09-22 14:17               ` Peter Zijlstra
  1 sibling, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2025-09-22  7:23 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Steven Rostedt, Steven Rostedt, linux-kernel, linux-trace-kernel,
	bpf, x86, Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar,
	Jiri Olsa, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
	Florian Weimer, Sam James, Kees Cook, Carlos O'Donell

On Fri, Sep 19, 2025 at 04:34:02PM -0700, Josh Poimboeuf wrote:
> On Thu, Sep 18, 2025 at 03:10:18PM -0400, Steven Rostedt wrote:
> > On Thu, 18 Sep 2025 19:32:20 +0200
> > Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > > Now, task_work_run() is in the exit_to_user_mode_loop() which is notably
> > > > *before* exit_to_user_mode() which does the unwind_reset_info().
> > > > 
> > > > What happens if we get an NMI requesting an unwind after
> > > > unwind_reset_info() while still very much being in the kernel on the way
> > > > out?  
> > > 
> > > AFAICT it will try and do a task_work_add(TWA_RESUME) from NMI context,
> > > and this will fail horribly.
> > > 
> > > If you do something like:
> > > 
> > > 	twa_mode = in_nmi() ? TWA_NMI_CURRENT : TWA_RESUME;
> > > 	task_work_add(foo, twa_mode);
> > > 
> > > it might actually work.
> > 
> > Ah, the comment for TWA_RESUME didn't express this restriction.
> > 
> > That does look like that would work as the way I expected task_work to
> > handle this case.
> 
> BTW, I remember Peter had a fix for TWA_NMI_CURRENT, I guess it got lost
> in the shuffle or did something else happen in the meantime?
> 
>   https://lore.kernel.org/20250122124228.GO7145@noisy.programming.kicks-ass.net

Oh, yeah, I had completely forgotten about all that :-)

I'll go stick it in the pile. Thanks!

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RESEND][PATCH v15 0/4] perf: Support the deferred unwinding infrastructure
  2025-09-22  7:23             ` Peter Zijlstra
@ 2025-09-22 14:17               ` Peter Zijlstra
  0 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2025-09-22 14:17 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Steven Rostedt, Steven Rostedt, linux-kernel, linux-trace-kernel,
	bpf, x86, Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar,
	Jiri Olsa, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
	Florian Weimer, Sam James, Kees Cook, Carlos O'Donell

On Mon, Sep 22, 2025 at 09:23:07AM +0200, Peter Zijlstra wrote:

> I'll go stick it in the pile. Thanks!

  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git unwind/cleanup

I'll let the robot have a chew and post tomorrow or so.

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2025-10-03 19:55 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-08 17:14 [RESEND][PATCH v15 0/4] perf: Support the deferred unwinding infrastructure Steven Rostedt
2025-09-08 17:14 ` [RESEND][PATCH v15 1/4] unwind deferred: Add unwind_user_get_cookie() API Steven Rostedt
2025-09-08 17:14 ` [RESEND][PATCH v15 2/4] perf: Support deferred user callchains Steven Rostedt
2025-09-23  9:19   ` Peter Zijlstra
2025-09-23  9:35     ` Steven Rostedt
2025-09-23  9:38       ` Peter Zijlstra
2025-09-23 10:28         ` [RESEND][PATCH v15 2/4] perf: Support deferred user callchainshttps://lore.kernel.org/linux-trace-kernel/20250827193644.527334838@kernel.org/ Steven Rostedt
2025-09-23 10:35           ` Peter Zijlstra
2025-09-23 10:01   ` [RESEND][PATCH v15 2/4] perf: Support deferred user callchains Peter Zijlstra
2025-09-23 10:32   ` Peter Zijlstra
2025-09-23 12:36     ` Steven Rostedt
2025-10-03 19:56     ` Steven Rostedt
2025-09-08 17:14 ` [RESEND][PATCH v15 3/4] perf: Have the deferred request record the user context cookie Steven Rostedt
2025-09-08 17:14 ` [RESEND][PATCH v15 4/4] perf: Support deferred user callchains for per CPU events Steven Rostedt
2025-09-08 17:21 ` [RESEND][PATCH v15 0/4] perf: Support the deferred unwinding infrastructure Steven Rostedt
2025-09-16 14:41   ` Steven Rostedt
2025-09-18 11:46 ` Peter Zijlstra
2025-09-18 15:18   ` Steven Rostedt
2025-09-18 17:24     ` Peter Zijlstra
2025-09-18 17:32       ` Peter Zijlstra
2025-09-18 19:10         ` Steven Rostedt
2025-09-19 23:34           ` Josh Poimboeuf
2025-09-21 23:33             ` Steven Rostedt
2025-09-22  7:23             ` Peter Zijlstra
2025-09-22 14:17               ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).