[PATCH v16 0/4] perf: Support the deferred unwinding infrastructure

linux-trace-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v16 0/4] perf: Support the deferred unwinding infrastructure
@ 2025-10-07 21:40 Steven Rostedt
  2025-10-07 21:40 ` [PATCH v16 1/4] unwind: Add interface to allow tracing a single task Steven Rostedt
                   ` (4 more replies)
  0 siblings, 5 replies; 22+ messages in thread
From: Steven Rostedt @ 2025-10-07 21:40 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
	Andrew Morton, Florian Weimer, Sam James, Kees Cook,
	Carlos O'Donell

This is based on top of tip/perf/core commit: 6d48436560e91be85

Then I added the patches from Peter Zijlstra:

    https://lore.kernel.org/all/20250924075948.579302904@infradead.org/

This series implements the perf interface to use deferred user space stack
tracing.

The patches for the user space side should still work with this series:

  https://lore.kernel.org/linux-trace-kernel/20250908175319.841517121@kernel.org

Patch 1 updates the deferred unwinding infrastructure. It adds a new
function called: unwind_deferred_task_init(). This is used when a tracer
(perf) only needs to follow a single task. The descriptor returned can
be used the same way as the descriptor returned by unwind_deferred_init(),
but the tracer must only use it on one task at a time.

Patch 2 adds the per task deferred stack traces to perf. It adds a new event
type called PERF_RECORD_CALLCHAIN_DEFERRED that is recorded when a task is
about to go back to user space and happens in a location that pages may be
faulted in. It also adds a new callchain context called
PERF_CONTEXT_USER_DEFERRED that is used as a place holder in a kernel
callchain to append the deferred user space stack trace to.

Patch 3 adds the user stack trace context cookie in the kernel callchain right
after the PERF_CONTEXT_USER_DEFERRED context so that the user space side can
map the request to the deferred user space stack trace.

Patch 4 adds support for the per CPU perf events that will allow the kernel to
associate each of the per CPU perf event buffers to a single application. This
is needed so that when a request for a deferred stack trace happens on a task
that then migrates to another CPU, it will know which CPU buffer to use to
record the stack trace on. It is possible to have more than one perf user tool
running and a request made by one perf tool should have the deferred trace go
to the same perf tool's perf CPU event buffer. A global list of all the
descriptors representing each perf tool that is using deferred stack tracing
is created to manage this.

Changes since v15: https://lore.kernel.org/linux-trace-kernel/20250825180638.877627656@kernel.org/

- The main update was that I moved the code to do single task deferred
  stack tracing into the unwind code. That allowed to reuse the code
  for tracing all tasks, and simplified the perf code in doing so.

  The first patch updates the unwind deferred code to have this
  infrastructure. It only added a new function:
    unwind_deferred_task_init()
  This is the same as unwind_deferred_init() but it is used when the
  tracer will only trace a single task. The descriptor returned will
  have its own task_work callback it will use and it allows for any
  number of callers, not a limited set like the "all task" deferred
  unwinding has.

- The new code also removed the need to expose the generation of the
  cookie.

Josh Poimboeuf (1):
      perf: Support deferred user callchains

Steven Rostedt (3):
      unwind: Add interface to allow tracing a single task
      perf: Have the deferred request record the user context cookie
      perf: Support deferred user callchains for per CPU events

----
 include/linux/perf_event.h            |   9 +-
 include/linux/unwind_deferred.h       |  15 ++
 include/uapi/linux/perf_event.h       |  25 ++-
 kernel/bpf/stackmap.c                 |   4 +-
 kernel/events/callchain.c             |  14 +-
 kernel/events/core.c                  | 362 +++++++++++++++++++++++++++++++++-
 kernel/unwind/deferred.c              | 283 ++++++++++++++++++++++----
 tools/include/uapi/linux/perf_event.h |  25 ++-
 8 files changed, 686 insertions(+), 51 deletions(-)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v16 1/4] unwind: Add interface to allow tracing a single task
  2025-10-07 21:40 [PATCH v16 0/4] perf: Support the deferred unwinding infrastructure Steven Rostedt
@ 2025-10-07 21:40 ` Steven Rostedt
  2025-10-07 21:40 ` [PATCH v16 2/4] perf: Support deferred user callchains Steven Rostedt
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 22+ messages in thread
From: Steven Rostedt @ 2025-10-07 21:40 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
	Andrew Morton, Florian Weimer, Sam James, Kees Cook,
	Carlos O'Donell

From: Steven Rostedt <rostedt@goodmis.org>

If a tracer (namely perf) is only tracing a single task, it doesn't need
the full functionality of the deferred stacktrace infrastructure. That
infrastructure has a limited number of users as it needs to handle
multiple tracers that can trace multiple tasks at the same time, creating
a multi to multi relationship.

But for a tracer that is tracing a single task, that creates a single to
multi relationship (a tracer tracing a single task and a task that can
have several tracers tracing it). This allows for data to be allocated at
the time of initialization to let the tracer use its own task_work data
structures to attach to the task.

Add a new interface called unwind_deferred_task_init() that works similar
to the unwind_deferred_init(), but this interface is used when the tracer
will only ever trace a single task at the same time.

The unwind_work descriptor that is initialized during this init function
now has a struct callback_head field that is used to attach itself to a
task_work. The work->bit for this task is set to the UNWIND_PENDING_BIT
(but defined as UNWIND_TASK) to differentiate it from unwind_works that
are tracing any task, as their work->bit will be one of the allocated
bits in the unwind_mask.

The rest of the calls are the same. That is, the unwind_deferred_request()
and unwind_deferred_cancel() are called on this, and these functions will
know that this unwind_work descriptor traces a single task. Even the
callback function works the same.

If a tracer were to try to use this unwind_work on multiple tasks at the
same time, it will simply fail to attach to the second task if one is
already pending a deferred stacktrace, and a WARN_ON is produced.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 include/linux/unwind_deferred.h |  15 ++
 kernel/unwind/deferred.c        | 283 +++++++++++++++++++++++++++-----
 2 files changed, 255 insertions(+), 43 deletions(-)

diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index f4743c8cff4c..6f0f04ba538d 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -2,6 +2,7 @@
 #ifndef _LINUX_UNWIND_USER_DEFERRED_H
 #define _LINUX_UNWIND_USER_DEFERRED_H
 
+#include <linux/rcuwait.h>
 #include <linux/task_work.h>
 #include <linux/unwind_user.h>
 #include <linux/unwind_deferred_types.h>
@@ -15,6 +16,9 @@ typedef void (*unwind_callback_t)(struct unwind_work *work,
 struct unwind_work {
 	struct list_head		list;
 	unwind_callback_t		func;
+	struct callback_head 		work;
+	struct task_struct		*task;
+	struct rcuwait			wait;
 	int				bit;
 };
 
@@ -32,11 +36,22 @@ enum {
 	UNWIND_USED		= BIT(UNWIND_USED_BIT)
 };
 
+/*
+ * UNWIND_PENDING is set in the task's info->unwind_mask when
+ * a deferred unwind is requested on that task. If the unwind
+ * descriptor is used only to trace a specific task, it's bit
+ * is the UNWIND_PENDING_BIT. This gets set as the work->bit
+ * and is to distinguish unwind_work descriptors that trace
+ * a single task from those that trace all tasks.
+ */
+#define UNWIND_TASK	UNWIND_PENDING_BIT
+
 void unwind_task_init(struct task_struct *task);
 void unwind_task_free(struct task_struct *task);
 
 int unwind_user_faultable(struct unwind_stacktrace *trace);
 
+int unwind_deferred_task_init(struct unwind_work *work, unwind_callback_t func);
 int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func);
 int unwind_deferred_request(struct unwind_work *work, u64 *cookie);
 void unwind_deferred_cancel(struct unwind_work *work);
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index ceeeff562302..f34b60713a4b 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -44,6 +44,7 @@ static inline bool try_assign_cnt(struct unwind_task_info *info, u32 cnt)
 /* Guards adding to or removing from the list of callbacks */
 static DEFINE_MUTEX(callback_mutex);
 static LIST_HEAD(callbacks);
+static LIST_HEAD(task_callbacks);
 
 #define RESERVED_BITS	(UNWIND_PENDING | UNWIND_USED)
 
@@ -155,12 +156,18 @@ static void process_unwind_deferred(struct task_struct *task)
 	unsigned long bits;
 	u64 cookie;
 
-	if (WARN_ON_ONCE(!unwind_pending(info)))
-		return;
-
 	/* Clear pending bit but make sure to have the current bits */
 	bits = atomic_long_fetch_andnot(UNWIND_PENDING,
 					&info->unwind_mask);
+
+	/* Remove the callbacks that were already completed */
+	if (info->cache)
+		bits &= ~(info->cache->unwind_completed);
+
+	/* If all callbacks have already been done, there's nothing to do */
+	if (!bits)
+		return;
+
 	/*
 	 * From here on out, the callback must always be called, even if it's
 	 * just an empty trace.
@@ -170,9 +177,6 @@ static void process_unwind_deferred(struct task_struct *task)
 
 	unwind_user_faultable(&trace);
 
-	if (info->cache)
-		bits &= ~(info->cache->unwind_completed);
-
 	cookie = info->id.id;
 
 	guard(srcu)(&unwind_srcu);
@@ -186,11 +190,95 @@ static void process_unwind_deferred(struct task_struct *task)
 	}
 }
 
-static void unwind_deferred_task_work(struct callback_head *head)
+/* Callback for an unwind work that traces all tasks */
+static void unwind_deferred_work(struct callback_head *head)
 {
 	process_unwind_deferred(current);
 }
 
+/* Get the trace for an unwind work that traces a single task */
+static void get_deferred_task_stacktrace(struct task_struct *task,
+					 struct unwind_stacktrace *trace,
+					 u64 *cookie, bool clear_pending)
+{
+	struct unwind_task_info *info = &task->unwind_info;
+
+	if (clear_pending)
+		atomic_long_andnot(UNWIND_PENDING, &info->unwind_mask);
+
+	trace->nr = 0;
+	trace->entries = NULL;
+
+	unwind_user_faultable(trace);
+
+	*cookie = info->id.id;
+}
+
+/* Callback for an unwind work that only traces this task */
+static void unwind_deferred_task_work(struct callback_head *head)
+{
+	struct unwind_work *work = container_of(head, struct unwind_work, work);
+	struct unwind_task_info *info = &current->unwind_info;
+	struct unwind_stacktrace trace;
+	u64 cookie;
+
+	guard(srcu)(&unwind_srcu);
+
+	/* Always clear the pending bit when this is called */
+	atomic_long_andnot(UNWIND_PENDING, &info->unwind_mask);
+
+	/* Is this work being canceled? */
+	if (unlikely(work->bit < 0))
+		work->task = NULL;
+
+	if (!work->task)
+		goto out;
+
+	/*
+	 * From here on out, the callback must always be called, even if it's
+	 * just an empty trace.
+	 */
+	get_deferred_task_stacktrace(current, &trace, &cookie, false);
+	work->func(work, &trace, cookie);
+	work->task = NULL;
+out:
+	/* Synchronize with cancel_unwind_task() */
+	rcuwait_wake_up(&work->wait);
+}
+
+/* Flush any pending work for an exiting task */
+static void process_unwind_tasks(struct task_struct *task)
+{
+	struct unwind_stacktrace trace;
+	struct unwind_work *work;
+	u64 cookie = 0;
+
+	guard(srcu)(&unwind_srcu);
+
+	/* The task is exiting, flush any pending per task unwind works */
+	list_for_each_entry_srcu(work, &task_callbacks, list,
+				 srcu_read_lock_held(&unwind_srcu)) {
+		if (work->task != task)
+			continue;
+
+		/* There may be waiters in cancel_unwind_task() */
+		if (work->bit < 0)
+			goto wakeup;
+
+		task_work_cancel(task, &work->work);
+
+		/* Only need to get the trace once */
+		if (!cookie)
+			get_deferred_task_stacktrace(task, &trace,
+						     &cookie, true);
+		work->func(work, &trace, cookie);
+wakeup:
+		work->task = NULL;
+		/* Synchronize with cancel_unwind_task() */
+		rcuwait_wake_up(&work->wait);
+	}
+}
+
 void unwind_deferred_task_exit(struct task_struct *task)
 {
 	struct unwind_task_info *info = &current->unwind_info;
@@ -199,10 +287,80 @@ void unwind_deferred_task_exit(struct task_struct *task)
 		return;
 
 	process_unwind_deferred(task);
+	process_unwind_tasks(task);
 
 	task_work_cancel(task, &info->work);
 }
 
+static int queue_unwind_task(struct unwind_work *work, int twa_mode,
+			     struct unwind_task_info *info)
+{
+	struct task_struct *task = READ_ONCE(work->task);
+	int ret;
+
+	if (task) {
+		/* Did the tracer break its contract? */
+		WARN_ON_ONCE(task != current);
+		return 1;
+	}
+
+	if (!try_cmpxchg(&work->task, &task, current))
+		return 1;
+
+	/* The work has been claimed, now schedule it. */
+	ret = task_work_add(current, &work->work, twa_mode);
+
+	if (WARN_ON_ONCE(ret))
+		work->task = NULL;
+	else
+		atomic_long_or(UNWIND_PENDING, &info->unwind_mask);
+
+	return ret;
+}
+
+static int queue_unwind_work(struct unwind_work *work, int twa_mode,
+			     struct unwind_task_info *info)
+{
+	unsigned long bit = BIT(work->bit);
+	unsigned long old, bits;
+	int ret;
+
+	/* Check if the unwind_work only traces this task */
+	if (work->bit == UNWIND_TASK)
+		return queue_unwind_task(work, twa_mode, info);
+
+	old = atomic_long_read(&info->unwind_mask);
+
+	/* Is this already queued or executed */
+	if (old & bit)
+		return 1;
+
+	/*
+	 * This work's bit hasn't been set yet. Now set it with the PENDING
+	 * bit and fetch the current value of unwind_mask. If ether the
+	 * work's bit or PENDING was already set, then this is already queued
+	 * to have a callback.
+	 */
+	bits = UNWIND_PENDING | bit;
+	old = atomic_long_fetch_or(bits, &info->unwind_mask);
+	if (old & bits) {
+		/*
+		 * If the work's bit was set, whatever set it had better
+		 * have also set pending and queued a callback.
+		 */
+		WARN_ON_ONCE(!(old & UNWIND_PENDING));
+		return old & bit;
+	}
+
+	/* The work has been claimed, now schedule it. */
+	ret = task_work_add(current, &info->work, twa_mode);
+
+	if (WARN_ON_ONCE(ret))
+		atomic_long_set(&info->unwind_mask, 0);
+
+	return ret;
+}
+
 /**
  * unwind_deferred_request - Request a user stacktrace on task kernel exit
  * @work: Unwind descriptor requesting the trace
@@ -232,9 +390,6 @@ int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
 {
 	struct unwind_task_info *info = &current->unwind_info;
 	int twa_mode = TWA_RESUME;
-	unsigned long old, bits;
-	unsigned long bit;
-	int ret;
 
 	*cookie = 0;
 
@@ -254,47 +409,45 @@ int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
 	}
 
 	/* Do not allow cancelled works to request again */
-	bit = READ_ONCE(work->bit);
-	if (WARN_ON_ONCE(bit < 0))
+	if (WARN_ON_ONCE(READ_ONCE(work->bit) < 0))
 		return -EINVAL;
 
-	/* Only need the mask now */
-	bit = BIT(bit);
-
 	guard(irqsave)();
 
 	*cookie = get_cookie(info);
 
-	old = atomic_long_read(&info->unwind_mask);
+	return queue_unwind_work(work, twa_mode, info);
+}
 
-	/* Is this already queued or executed */
-	if (old & bit)
-		return 1;
+static void cancel_unwind_task(struct unwind_work *work)
+{
+	struct task_struct *task;
 
-	/*
-	 * This work's bit hasn't been set yet. Now set it with the PENDING
-	 * bit and fetch the current value of unwind_mask. If ether the
-	 * work's bit or PENDING was already set, then this is already queued
-	 * to have a callback.
-	 */
-	bits = UNWIND_PENDING | bit;
-	old = atomic_long_fetch_or(bits, &info->unwind_mask);
-	if (old & bits) {
+	task = READ_ONCE(work->task);
+
+	if (!task || !task_work_cancel(task, &work->work)) {
 		/*
-		 * If the work's bit was set, whatever set it had better
-		 * have also set pending and queued a callback.
+		 * If the task_work_cancel() fails to cancel it could mean that
+		 * the task_work is just about to execute. This needs to wait
+		 * until the work->func() is finished before returning.
+		 * This is required because the SRCU section may not have been
+		 * entered yet, and the synchronize_srcu() will not wait for it.
 		 */
-		WARN_ON_ONCE(!(old & UNWIND_PENDING));
-		return old & bit;
+		if (task) {
+			rcuwait_wait_event(&work->wait, work->task == NULL,
+				   TASK_UNINTERRUPTIBLE);
+		}
 	}
 
-	/* The work has been claimed, now schedule it. */
-	ret = task_work_add(current, &info->work, twa_mode);
-
-	if (WARN_ON_ONCE(ret))
-		atomic_long_set(&info->unwind_mask, 0);
+	/*
+	 * Needed to protect loop in process_unwind_tasks().
+	 * This also guarantees that unwind_deferred_task_work() is
+	 * completely done and the work structure is no longer referenced.
+	 */
+	synchronize_srcu(&unwind_srcu);
 
-	return ret;
+	/* Still set task to NULL if task_work_cancel() succeeded */
+	work->task = NULL;
 }
 
 void unwind_deferred_cancel(struct unwind_work *work)
@@ -307,16 +460,24 @@ void unwind_deferred_cancel(struct unwind_work *work)
 
 	bit = work->bit;
 
-	/* No work should be using a reserved bit */
-	if (WARN_ON_ONCE(BIT(bit) & RESERVED_BITS))
+	/* Was it initialized ? */
+	if (!bit)
 		return;
 
-	guard(mutex)(&callback_mutex);
-	list_del_rcu(&work->list);
+	scoped_guard(mutex, &callback_mutex) {
+		list_del_rcu(&work->list);
+	}
 
 	/* Do not allow any more requests and prevent callbacks */
 	work->bit = -1;
 
+	if (bit == UNWIND_TASK)
+		return cancel_unwind_task(work);
+
+	/* No work should be using a reserved bit */
+	if (WARN_ON_ONCE(BIT(bit) & RESERVED_BITS))
+		return;
+
 	__clear_bit(bit, &unwind_mask);
 
 	synchronize_srcu(&unwind_srcu);
@@ -330,6 +491,17 @@ void unwind_deferred_cancel(struct unwind_work *work)
 	}
 }
 
+/**
+ * unwind_deferred_init - Init unwind_work that traces any task
+ * @work: The unwind_work descriptor to initialize
+ * @func: The callback function that will have the stacktrace
+ *
+ * Initialize a work that can trace any task. There's only a limited
+ * number of these that can be allocated.
+ *
+ * Returns 0 on success or -EBUSY if the limit of these unwind_works have
+ *   been exceeded.
+ */
 int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
 {
 	memset(work, 0, sizeof(*work));
@@ -348,12 +520,37 @@ int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
 	return 0;
 }
 
+/**
+ * unwind_deferred_task_init - Init unwind_work that traces a single task
+ * @work: The unwind_work descriptor to initialize
+ * @func: The callback function that will have the stacktrace
+ *
+ * Initialize a work that will always trace only a single task. It is
+ * up to the caller to make sure that the unwind_deferred_requeust()
+ * will always be called on the same task for the @work descriptor.
+ *
+ * Note, unlike unwind_deferred_init() there is no limit of these works
+ * that can be initialized and used.
+ */
+int unwind_deferred_task_init(struct unwind_work *work, unwind_callback_t func)
+{
+	memset(work, 0, sizeof(*work));
+	work->bit = UNWIND_TASK;
+	init_task_work(&work->work, unwind_deferred_task_work);
+	work->func = func;
+	rcuwait_init(&work->wait);
+
+	guard(mutex)(&callback_mutex);
+	list_add_rcu(&work->list, &task_callbacks);
+	return 0;
+}
+
 void unwind_task_init(struct task_struct *task)
 {
 	struct unwind_task_info *info = &task->unwind_info;
 
 	memset(info, 0, sizeof(*info));
-	init_task_work(&info->work, unwind_deferred_task_work);
+	init_task_work(&info->work, unwind_deferred_work);
 	atomic_long_set(&info->unwind_mask, 0);
 }
 
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v16 2/4] perf: Support deferred user callchains
  2025-10-07 21:40 [PATCH v16 0/4] perf: Support the deferred unwinding infrastructure Steven Rostedt
  2025-10-07 21:40 ` [PATCH v16 1/4] unwind: Add interface to allow tracing a single task Steven Rostedt
@ 2025-10-07 21:40 ` Steven Rostedt
  2025-10-07 21:40 ` [PATCH v16 3/4] perf: Have the deferred request record the user context cookie Steven Rostedt
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 22+ messages in thread
From: Steven Rostedt @ 2025-10-07 21:40 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
	Andrew Morton, Florian Weimer, Sam James, Kees Cook,
	Carlos O'Donell

From: Josh Poimboeuf <jpoimboe@kernel.org>

If the user fault unwind is available (the one that will be used for
sframes), have perf be able to utilize it. Currently all user stack
traces are done at the request site. This mostly happens in interrupt or
NMI context where user space is only accessible if it is currently present
in memory. It is possible that the user stack was swapped out and is not
present, but mostly the use of sframes will require faulting in user pages
which will not be possible from interrupt context. Instead, add a frame
work that will delay the reading of the user space stack until the task
goes back to user space where faulting in pages is possible. This is also
advantageous as the user space stack doesn't change while in the kernel,
and this will also remove duplicate entries of user space stacks for a
long running system call being profiled.

A new perf context is created called PERF_CONTEXT_USER_DEFERRED. It is
added to the kernel callchain, usually when an interrupt or NMI is
triggered (but can be added to any callchain). When a deferred unwind is
required, it uses the new deferred unwind infrastructure.

When tracing a single task and a user stack trace is required, perf will
call unwind_deferred_request(). This will trigger a task_work that on task
kernel space exit will call the perf function perf_event_deferred_task()
with the user stacktrace and a cookie (an identifier for that stack
trace).

This user stack trace will go into a new perf type called
PERF_RECORD_CALLCHAIN_DEFERRED.  The perf user space will need to attach
this stack trace to each of the previous kernel callchains for that task
with the PERF_CONTEXT_USER_DEFERRED context in them.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Co-developed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v15: https://lore.kernel.org/20250825180801.727927527@kernel.org

- Peter Zijlstra pointed out that the code mostly duplicated the code of
  the unwind infrastructure, and had the same bugs as it had.
  The unwind infrastructure was updated to allow a tracer to use it for a
  single task. The perf code now uses that which greatly simplified
  this version over the previous one.

 include/linux/perf_event.h            |   5 +-
 include/uapi/linux/perf_event.h       |  20 ++++-
 kernel/bpf/stackmap.c                 |   4 +-
 kernel/events/callchain.c             |  11 ++-
 kernel/events/core.c                  | 110 +++++++++++++++++++++++++-
 tools/include/uapi/linux/perf_event.h |  20 ++++-
 6 files changed, 162 insertions(+), 8 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index fd1d91017b99..152e3dacff98 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -53,6 +53,7 @@
 #include <linux/security.h>
 #include <linux/static_call.h>
 #include <linux/lockdep.h>
+#include <linux/unwind_deferred.h>
 
 #include <asm/local.h>
 
@@ -880,6 +881,8 @@ struct perf_event {
 	struct callback_head		pending_task;
 	unsigned int			pending_work;
 
+	struct unwind_work		unwind_work;
+
 	atomic_t			event_limit;
 
 	/* address range filters */
@@ -1720,7 +1723,7 @@ extern void perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct p
 extern void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
 extern struct perf_callchain_entry *
 get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
-		   u32 max_stack, bool crosstask, bool add_mark);
+		   u32 max_stack, bool crosstask, bool add_mark, bool defer_user);
 extern int get_callchain_buffers(int max_stack);
 extern void put_callchain_buffers(void);
 extern struct perf_callchain_entry *get_callchain_entry(int *rctx);
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 78a362b80027..20b8f890113b 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -463,7 +463,8 @@ struct perf_event_attr {
 				inherit_thread :  1, /* children only inherit if cloned with CLONE_THREAD */
 				remove_on_exec :  1, /* event is removed from task on exec */
 				sigtrap        :  1, /* send synchronous SIGTRAP on event */
-				__reserved_1   : 26;
+				defer_callchain:  1, /* generate PERF_RECORD_CALLCHAIN_DEFERRED records */
+				__reserved_1   : 25;
 
 	union {
 		__u32		wakeup_events;	  /* wake up every n events */
@@ -1239,6 +1240,22 @@ enum perf_event_type {
 	 */
 	PERF_RECORD_AUX_OUTPUT_HW_ID		= 21,
 
+	/*
+	 * This user callchain capture was deferred until shortly before
+	 * returning to user space.  Previous samples would have kernel
+	 * callchains only and they need to be stitched with this to make full
+	 * callchains.
+	 *
+	 * struct {
+	 *	struct perf_event_header	header;
+	 *	u64				cookie;
+	 *	u64				nr;
+	 *	u64				ips[nr];
+	 *	struct sample_id		sample_id;
+	 * };
+	 */
+	PERF_RECORD_CALLCHAIN_DEFERRED		= 22,
+
 	PERF_RECORD_MAX,			/* non-ABI */
 };
 
@@ -1269,6 +1286,7 @@ enum perf_callchain_context {
 	PERF_CONTEXT_HV				= (__u64)-32,
 	PERF_CONTEXT_KERNEL			= (__u64)-128,
 	PERF_CONTEXT_USER			= (__u64)-512,
+	PERF_CONTEXT_USER_DEFERRED		= (__u64)-640,
 
 	PERF_CONTEXT_GUEST			= (__u64)-2048,
 	PERF_CONTEXT_GUEST_KERNEL		= (__u64)-2176,
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index ec3a57a5fba1..339f7cbbcf36 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -315,7 +315,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
 		max_depth = sysctl_perf_event_max_stack;
 
 	trace = get_perf_callchain(regs, kernel, user, max_depth,
-				   false, false);
+				   false, false, false);
 
 	if (unlikely(!trace))
 		/* couldn't fetch the stack trace */
@@ -452,7 +452,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
 		trace = get_callchain_entry_for_task(task, max_depth);
 	else
 		trace = get_perf_callchain(regs, kernel, user, max_depth,
-					   crosstask, false);
+					   crosstask, false, false);
 
 	if (unlikely(!trace) || trace->nr < skip) {
 		if (may_fault)
diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 808c0d7a31fa..d0e0da66a164 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -218,7 +218,7 @@ static void fixup_uretprobe_trampoline_entries(struct perf_callchain_entry *entr
 
 struct perf_callchain_entry *
 get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
-		   u32 max_stack, bool crosstask, bool add_mark)
+		   u32 max_stack, bool crosstask, bool add_mark, bool defer_user)
 {
 	struct perf_callchain_entry *entry;
 	struct perf_callchain_entry_ctx ctx;
@@ -251,6 +251,15 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 			regs = task_pt_regs(current);
 		}
 
+		if (defer_user) {
+			/*
+			 * Foretell the coming of PERF_RECORD_CALLCHAIN_DEFERRED
+			 * which can be stitched to this one.
+			 */
+			perf_callchain_store_context(&ctx, PERF_CONTEXT_USER_DEFERRED);
+			goto exit_put;
+		}
+
 		if (add_mark)
 			perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 28de3baff792..be94b437e7e0 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5582,6 +5582,67 @@ static bool exclusive_event_installable(struct perf_event *event,
 	return true;
 }
 
+static void perf_pending_unwind_sync(struct perf_event *event)
+{
+	struct unwind_work *work = &event->unwind_work;
+
+	unwind_deferred_cancel(work);
+}
+
+struct perf_callchain_deferred_event {
+	struct perf_event_header	header;
+	u64				cookie;
+	u64				nr;
+	u64				ips[];
+};
+
+static void perf_event_callchain_deferred(struct perf_event *event,
+					  struct unwind_stacktrace *trace,
+					  u64 cookie)
+{
+	struct perf_callchain_deferred_event deferred_event;
+	u64 callchain_context = PERF_CONTEXT_USER;
+	struct perf_output_handle handle;
+	struct perf_sample_data data;
+	u64 nr;
+
+	nr = trace->nr + 1 ; /* '+1' == callchain_context */
+
+	deferred_event.header.type = PERF_RECORD_CALLCHAIN_DEFERRED;
+	deferred_event.header.misc = PERF_RECORD_MISC_USER;
+	deferred_event.header.size = sizeof(deferred_event) + (nr * sizeof(u64));
+
+	deferred_event.nr = nr;
+	deferred_event.cookie = cookie;
+
+	perf_event_header__init_id(&deferred_event.header, &data, event);
+
+	if (perf_output_begin(&handle, &data, event, deferred_event.header.size))
+		return;
+
+	perf_output_put(&handle, deferred_event);
+	perf_output_put(&handle, callchain_context);
+	/* trace->entries[] are not guaranteed to be 64bit */
+	for (int i = 0; i < trace->nr; i++) {
+		u64 entry = trace->entries[i];
+		perf_output_put(&handle, entry);
+	}
+	perf_event__output_id_sample(event, &handle, &data);
+
+	perf_output_end(&handle);
+}
+
+/* Deferred unwinding callback for task specific events */
+static void perf_event_deferred_task(struct unwind_work *work,
+				     struct unwind_stacktrace *trace, u64 cookie)
+{
+	struct perf_event *event = container_of(work, struct perf_event, unwind_work);
+
+	perf_event_callchain_deferred(event, trace, cookie);
+
+	local_dec(&event->ctx->nr_no_switch_fast);
+}
+
 static void perf_free_addr_filters(struct perf_event *event);
 
 /* vs perf_event_alloc() error */
@@ -5649,6 +5710,7 @@ static void _free_event(struct perf_event *event)
 {
 	irq_work_sync(&event->pending_irq);
 	irq_work_sync(&event->pending_disable_irq);
+	perf_pending_unwind_sync(event);
 
 	unaccount_event(event);
 
@@ -8194,6 +8256,28 @@ static u64 perf_get_page_size(unsigned long addr)
 
 static struct perf_callchain_entry __empty_callchain = { .nr = 0, };
 
+/*
+ * Returns:
+*     > 0 : if already queued.
+ *      0 : if it performed the queuing
+ *    < 0 : if it did not get queued.
+ */
+static int deferred_request(struct perf_event *event)
+{
+	struct unwind_work *work = &event->unwind_work;
+	u64 cookie;
+
+	/* Only defer for task events */
+	if (!event->ctx->task)
+		return -EINVAL;
+
+	if ((current->flags & (PF_KTHREAD | PF_USER_WORKER)) ||
+	    !user_mode(task_pt_regs(current)))
+		return -EINVAL;
+
+	return unwind_deferred_request(work, &cookie);
+}
+
 struct perf_callchain_entry *
 perf_callchain(struct perf_event *event, struct pt_regs *regs)
 {
@@ -8204,6 +8288,8 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
 	bool crosstask = event->ctx->task && event->ctx->task != current;
 	const u32 max_stack = event->attr.sample_max_stack;
 	struct perf_callchain_entry *callchain;
+	bool defer_user = IS_ENABLED(CONFIG_UNWIND_USER) && user &&
+			  event->attr.defer_callchain;
 
 	if (!current->mm)
 		user = false;
@@ -8211,8 +8297,21 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
 	if (!kernel && !user)
 		return &__empty_callchain;
 
-	callchain = get_perf_callchain(regs, kernel, user,
-				       max_stack, crosstask, true);
+	/* Disallow cross-task callchains. */
+	if (event->ctx->task && event->ctx->task != current)
+		return &__empty_callchain;
+
+	if (defer_user) {
+		int ret = deferred_request(event);
+		if (!ret)
+			local_inc(&event->ctx->nr_no_switch_fast);
+		else if (ret < 0)
+			defer_user = false;
+	}
+
+	callchain = get_perf_callchain(regs, kernel, user, max_stack,
+				       crosstask, true, defer_user);
+
 	return callchain ?: &__empty_callchain;
 }
 
@@ -13046,6 +13145,13 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 		}
 	}
 
+	if (event->attr.defer_callchain) {
+		if (task) {
+			err = unwind_deferred_task_init(&event->unwind_work,
+							perf_event_deferred_task);
+		}
+	}
+
 	err = security_perf_event_alloc(event);
 	if (err)
 		return ERR_PTR(err);
diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
index 78a362b80027..20b8f890113b 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -463,7 +463,8 @@ struct perf_event_attr {
 				inherit_thread :  1, /* children only inherit if cloned with CLONE_THREAD */
 				remove_on_exec :  1, /* event is removed from task on exec */
 				sigtrap        :  1, /* send synchronous SIGTRAP on event */
-				__reserved_1   : 26;
+				defer_callchain:  1, /* generate PERF_RECORD_CALLCHAIN_DEFERRED records */
+				__reserved_1   : 25;
 
 	union {
 		__u32		wakeup_events;	  /* wake up every n events */
@@ -1239,6 +1240,22 @@ enum perf_event_type {
 	 */
 	PERF_RECORD_AUX_OUTPUT_HW_ID		= 21,
 
+	/*
+	 * This user callchain capture was deferred until shortly before
+	 * returning to user space.  Previous samples would have kernel
+	 * callchains only and they need to be stitched with this to make full
+	 * callchains.
+	 *
+	 * struct {
+	 *	struct perf_event_header	header;
+	 *	u64				cookie;
+	 *	u64				nr;
+	 *	u64				ips[nr];
+	 *	struct sample_id		sample_id;
+	 * };
+	 */
+	PERF_RECORD_CALLCHAIN_DEFERRED		= 22,
+
 	PERF_RECORD_MAX,			/* non-ABI */
 };
 
@@ -1269,6 +1286,7 @@ enum perf_callchain_context {
 	PERF_CONTEXT_HV				= (__u64)-32,
 	PERF_CONTEXT_KERNEL			= (__u64)-128,
 	PERF_CONTEXT_USER			= (__u64)-512,
+	PERF_CONTEXT_USER_DEFERRED		= (__u64)-640,
 
 	PERF_CONTEXT_GUEST			= (__u64)-2048,
 	PERF_CONTEXT_GUEST_KERNEL		= (__u64)-2176,
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v16 3/4] perf: Have the deferred request record the user context cookie
  2025-10-07 21:40 [PATCH v16 0/4] perf: Support the deferred unwinding infrastructure Steven Rostedt
  2025-10-07 21:40 ` [PATCH v16 1/4] unwind: Add interface to allow tracing a single task Steven Rostedt
  2025-10-07 21:40 ` [PATCH v16 2/4] perf: Support deferred user callchains Steven Rostedt
@ 2025-10-07 21:40 ` Steven Rostedt
  2025-10-07 21:40 ` [PATCH v16 4/4] perf: Support deferred user callchains for per CPU events Steven Rostedt
  2025-10-23 15:00 ` [PATCH v16 0/4] perf: Support the deferred unwinding infrastructure Peter Zijlstra
  4 siblings, 0 replies; 22+ messages in thread
From: Steven Rostedt @ 2025-10-07 21:40 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
	Andrew Morton, Florian Weimer, Sam James, Kees Cook,
	Carlos O'Donell

From: Steven Rostedt <rostedt@goodmis.org>

When a request to have a deferred unwind is made, have the cookie
associated to the user context recorded in the event that represents that
request. It is added after the PERF_CONTEXT_USER_DEFERRED in the
callchain. That perf context is a marker of where to add the associated
user space stack trace in the callchain. Adding the cookie after that
marker will not affect the appending of the callchain as it will be
overwritten by the user space stack in the perf tool.

The cookie will be used to match the cookie that is saved when the
deferred callchain is recorded. The perf tool will be able to use the
cooking saved at the request to know if the callchain that was recorded
when the task goes back to user space is for that event. If there were
dropped events after the request was made where it dropped the calltrace
that happened when the task went back to user space and then came back
into the kernel and a new request was dropped, but then the record started
again and it recorded a new callchain going back to user space, this
callchain would not be for the initial request. The cookie matching will
prevent this scenario from happening.

The cookie prevents:

  record kernel stack trace with PERF_CONTEXT_USER_DEFERRED

  [ dropped events starts here ]

  record user stack trace - DROPPED

  [enters user space ]
  [exits user space back to the kernel ]

  record kernel stack trace with PERF_CONTEXT_USER_DEFERRED - DROPPED!

  [ events stop being dropped here ]

  record user stack trace

Without a differentiating "cookie" identifier, the user space tool will
incorrectly attach the last recorded user stack trace to the first kernel
stack trace with the PERF_CONTEXT_USER_DEFERRED, as using the TID is not
enough to identify this situation.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 include/linux/perf_event.h            |  2 +-
 include/uapi/linux/perf_event.h       |  5 +++++
 kernel/bpf/stackmap.c                 |  4 ++--
 kernel/events/callchain.c             |  9 ++++++---
 kernel/events/core.c                  | 12 ++++++------
 tools/include/uapi/linux/perf_event.h |  5 +++++
 6 files changed, 25 insertions(+), 12 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 152e3dacff98..a0f95f751b44 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1723,7 +1723,7 @@ extern void perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct p
 extern void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
 extern struct perf_callchain_entry *
 get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
-		   u32 max_stack, bool crosstask, bool add_mark, bool defer_user);
+		   u32 max_stack, bool crosstask, bool add_mark, u64 defer_cookie);
 extern int get_callchain_buffers(int max_stack);
 extern void put_callchain_buffers(void);
 extern struct perf_callchain_entry *get_callchain_entry(int *rctx);
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 20b8f890113b..79232e85a8fc 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -1282,6 +1282,11 @@ enum perf_bpf_event_type {
 #define PERF_MAX_STACK_DEPTH			127
 #define PERF_MAX_CONTEXTS_PER_STACK		  8
 
+/*
+ * The PERF_CONTEXT_USER_DEFERRED has two items (context and cookie)
+ */
+#define PERF_DEFERRED_ITEMS			2
+
 enum perf_callchain_context {
 	PERF_CONTEXT_HV				= (__u64)-32,
 	PERF_CONTEXT_KERNEL			= (__u64)-128,
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 339f7cbbcf36..ef6021111fe3 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -315,7 +315,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
 		max_depth = sysctl_perf_event_max_stack;
 
 	trace = get_perf_callchain(regs, kernel, user, max_depth,
-				   false, false, false);
+				   false, false, 0);
 
 	if (unlikely(!trace))
 		/* couldn't fetch the stack trace */
@@ -452,7 +452,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
 		trace = get_callchain_entry_for_task(task, max_depth);
 	else
 		trace = get_perf_callchain(regs, kernel, user, max_depth,
-					   crosstask, false, false);
+					   crosstask, false, 0);
 
 	if (unlikely(!trace) || trace->nr < skip) {
 		if (may_fault)
diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index d0e0da66a164..b9c7e00725d6 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -218,7 +218,7 @@ static void fixup_uretprobe_trampoline_entries(struct perf_callchain_entry *entr
 
 struct perf_callchain_entry *
 get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
-		   u32 max_stack, bool crosstask, bool add_mark, bool defer_user)
+		   u32 max_stack, bool crosstask, bool add_mark, u64 defer_cookie)
 {
 	struct perf_callchain_entry *entry;
 	struct perf_callchain_entry_ctx ctx;
@@ -251,12 +251,15 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 			regs = task_pt_regs(current);
 		}
 
-		if (defer_user) {
+		if (defer_cookie) {
 			/*
 			 * Foretell the coming of PERF_RECORD_CALLCHAIN_DEFERRED
-			 * which can be stitched to this one.
+			 * which can be stitched to this one, and add
+			 * the cookie after it (it will be cut off when the
+			 * user stack is copied to the callchain).
 			 */
 			perf_callchain_store_context(&ctx, PERF_CONTEXT_USER_DEFERRED);
+			perf_callchain_store_context(&ctx, defer_cookie);
 			goto exit_put;
 		}
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index be94b437e7e0..f003a1f9497a 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8262,10 +8262,9 @@ static struct perf_callchain_entry __empty_callchain = { .nr = 0, };
  *      0 : if it performed the queuing
  *    < 0 : if it did not get queued.
  */
-static int deferred_request(struct perf_event *event)
+static int deferred_request(struct perf_event *event, u64 *defer_cookie)
 {
 	struct unwind_work *work = &event->unwind_work;
-	u64 cookie;
 
 	/* Only defer for task events */
 	if (!event->ctx->task)
@@ -8275,7 +8274,7 @@ static int deferred_request(struct perf_event *event)
 	    !user_mode(task_pt_regs(current)))
 		return -EINVAL;
 
-	return unwind_deferred_request(work, &cookie);
+	return unwind_deferred_request(work, defer_cookie);
 }
 
 struct perf_callchain_entry *
@@ -8288,6 +8287,7 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
 	bool crosstask = event->ctx->task && event->ctx->task != current;
 	const u32 max_stack = event->attr.sample_max_stack;
 	struct perf_callchain_entry *callchain;
+	u64 defer_cookie = 0;
 	bool defer_user = IS_ENABLED(CONFIG_UNWIND_USER) && user &&
 			  event->attr.defer_callchain;
 
@@ -8302,15 +8302,15 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
 		return &__empty_callchain;
 
 	if (defer_user) {
-		int ret = deferred_request(event);
+		int ret = deferred_request(event, &defer_cookie);
 		if (!ret)
 			local_inc(&event->ctx->nr_no_switch_fast);
 		else if (ret < 0)
-			defer_user = false;
+			defer_cookie = 0;
 	}
 
 	callchain = get_perf_callchain(regs, kernel, user, max_stack,
-				       crosstask, true, defer_user);
+				       crosstask, true, defer_cookie);
 
 	return callchain ?: &__empty_callchain;
 }
diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
index 20b8f890113b..79232e85a8fc 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -1282,6 +1282,11 @@ enum perf_bpf_event_type {
 #define PERF_MAX_STACK_DEPTH			127
 #define PERF_MAX_CONTEXTS_PER_STACK		  8
 
+/*
+ * The PERF_CONTEXT_USER_DEFERRED has two items (context and cookie)
+ */
+#define PERF_DEFERRED_ITEMS			2
+
 enum perf_callchain_context {
 	PERF_CONTEXT_HV				= (__u64)-32,
 	PERF_CONTEXT_KERNEL			= (__u64)-128,
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v16 4/4] perf: Support deferred user callchains for per CPU events
  2025-10-07 21:40 [PATCH v16 0/4] perf: Support the deferred unwinding infrastructure Steven Rostedt
                   ` (2 preceding siblings ...)
  2025-10-07 21:40 ` [PATCH v16 3/4] perf: Have the deferred request record the user context cookie Steven Rostedt
@ 2025-10-07 21:40 ` Steven Rostedt
  2025-10-23 15:00 ` [PATCH v16 0/4] perf: Support the deferred unwinding infrastructure Peter Zijlstra
  4 siblings, 0 replies; 22+ messages in thread
From: Steven Rostedt @ 2025-10-07 21:40 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
	Andrew Morton, Florian Weimer, Sam James, Kees Cook,
	Carlos O'Donell

From: Steven Rostedt <rostedt@goodmis.org>

The deferred unwinder works fine for task events (events that trace only a
specific task), as it can use a task_work from an interrupt or NMI and
when the task goes back to user space it will call the event's callback to
do the deferred unwinding.

But for per CPU events things are not so simple. When a per CPU event
wants a deferred unwinding to occur, it cannot simply use a task_work as
there's a many to many relationship. If the task migrates and another task
is scheduled in where the per CPU event wants a deferred unwinding to
occur on that task as well, and the task that migrated to another CPU has
that CPU's event want to unwind it too, each CPU may need unwinding from
more than one task, and each task may have requests from many CPUs.

The main issue is that from the kernel point of view, there's currently
nothing that associates a per CPU event for one CPU to the per CPU events
that cover the other CPUs for a given process. To the kernel, they are all
just individual event buffers. This is problematic if a delayed request
is made on one CPU and the task migrates to another CPU where the delayed
user stack trace will be performed. The kernel needs to know which CPU
buffer to add it to that belongs to the same process that initiated the
deferred request.

To solve this, when a per CPU event is created that has defer_callchain
attribute set, it will do a lookup from a global list
(unwind_deferred_list), for a perf_unwind_deferred descriptor that has the
id that matches the PID of the current task's group_leader. (The process
ID for all the threads of a process)

If it is not found, then it will create one and add it to the global list.
This descriptor contains an array of all possible CPUs, where each element
is a perf_unwind_cpu descriptor.

The perf_unwind_cpu descriptor has a list of all the per CPU events that
is tracing the matching CPU that corresponds to its index in the array,
where the events belong to a task that has the same group_leader.
It also has a processing bit and rcuwait to handle removal.

For each occupied perf_unwind_cpu descriptor in the array, the
perf_deferred_unwind descriptor increments its nr_cpu_events. When a
perf_unwind_cpu descriptor is empty, the nr_cpu_events is decremented.
This is used to know when to free the perf_deferred_unwind descriptor, as
when it becomes empty, it is no longer referenced.

Finally, the perf_deferred_unwind descriptor has an id that holds the PID
of the group_leader for the tasks that the events were created by.

When a second (or more) per CPU event is created where the
perf_deferred_unwind descriptor already exists, it just adds itself to
the perf_unwind_cpu array of that descriptor. Updating the necessary
counter. This is used to map different per CPU events to each other based
on their group leader PID.

Each of these perf_deferred_unwind descriptors have a unwind_work that
registers with the deferred unwind infrastructure via
unwind_deferred_init(), where it also registers a callback to
perf_event_deferred_cpu().

Now when a per CPU event requests a deferred unwinding, it calls
unwind_deferred_request() with the associated perf_deferred_unwind
descriptor. It is expected that the program that uses this has events on
all CPUs, as the deferred trace may not be called on the CPU event that
requested it. That is, the task may migrate and its user stack trace will
be recorded on the CPU event of the CPU that it exits back to user space
on.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 include/linux/perf_event.h |   4 +
 kernel/events/core.c       | 260 ++++++++++++++++++++++++++++++++++++-
 2 files changed, 260 insertions(+), 4 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index a0f95f751b44..b6b7cd6d67a5 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -733,6 +733,7 @@ struct swevent_hlist {
 struct bpf_prog;
 struct perf_cgroup;
 struct perf_buffer;
+struct perf_unwind_deferred;
 
 struct pmu_event_list {
 	raw_spinlock_t			lock;
@@ -883,6 +884,9 @@ struct perf_event {
 
 	struct unwind_work		unwind_work;
 
+	struct perf_unwind_deferred	*unwind_deferred;
+	struct list_head		unwind_list;
+
 	atomic_t			event_limit;
 
 	/* address range filters */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f003a1f9497a..f3e48cc4b32d 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5582,10 +5582,192 @@ static bool exclusive_event_installable(struct perf_event *event,
 	return true;
 }
 
+/* Holds a list of per CPU events that registered for deferred unwinding */
+struct perf_unwind_cpu {
+	struct list_head	list;
+	struct rcuwait		pending_unwind_wait;
+	int			processing;
+};
+
+struct perf_unwind_deferred {
+	struct list_head		list;
+	struct unwind_work		unwind_work;
+	struct perf_unwind_cpu __rcu	*cpu_events;
+	struct rcu_head			rcu_head;
+	int				nr_cpu_events;
+	int				id;
+};
+
+static DEFINE_MUTEX(unwind_deferred_mutex);
+static LIST_HEAD(unwind_deferred_list);
+
+static void perf_event_deferred_cpu(struct unwind_work *work,
+				    struct unwind_stacktrace *trace, u64 cookie);
+
+/*
+ * Add a per CPU event.
+ *
+ * The deferred callstack can happen on a different CPU than what was
+ * requested. If one CPU event requests a deferred callstack, but the
+ * tasks migrates, it will execute on a different CPU and save the
+ * stack trace to that CPU event.
+ *
+ * In order to map all the CPU events with the same application,
+ * use the current->group_leader->pid as the identifier of what
+ * events share the same program.
+ *
+ * A perf_unwind_deferred descriptor is created for each unique
+ * group_leader pid, and all the events that have the same group_leader
+ * pid will be linked to the same deferred descriptor.
+ *
+ * If there's no descriptor that matches the current group_leader pid,
+ * one will be created.
+ */
+static int perf_add_unwind_deferred(struct perf_event *event)
+{
+	struct perf_unwind_deferred *defer;
+	struct perf_unwind_cpu *cpu_events;
+	int id = current->group_leader->pid;
+	bool found = false;
+	int ret = 0;
+
+	if (event->cpu < 0)
+		return -EINVAL;
+
+	guard(mutex)(&unwind_deferred_mutex);
+
+	list_for_each_entry(defer, &unwind_deferred_list, list) {
+		if (defer->id == id) {
+			found = true;
+			break;
+		}
+	}
+
+	if (!found) {
+		defer = kzalloc(sizeof(*defer), GFP_KERNEL);
+		if (!defer)
+			return -ENOMEM;
+		list_add(&defer->list, &unwind_deferred_list);
+		defer->id = id;
+	}
+
+	/*
+	 * The deferred desciptor has an array for every CPU.
+	 * Each entry in this array is a link list of all the CPU
+	 * events for the corresponding CPU. This is a quick way to
+	 * find the associated event for a given CPU in
+	 * perf_event_deferred_cpu().
+	 */
+	if (!defer->nr_cpu_events) {
+		cpu_events = kcalloc(num_possible_cpus(),
+				     sizeof(*cpu_events),
+				     GFP_KERNEL);
+		if (!cpu_events) {
+			ret = -ENOMEM;
+			goto free;
+		}
+		for (int cpu = 0; cpu < num_possible_cpus(); cpu++) {
+			rcuwait_init(&cpu_events[cpu].pending_unwind_wait);
+			INIT_LIST_HEAD(&cpu_events[cpu].list);
+		}
+
+		rcu_assign_pointer(defer->cpu_events, cpu_events);
+
+		ret = unwind_deferred_init(&defer->unwind_work,
+					   perf_event_deferred_cpu);
+		if (ret)
+			goto free;
+	}
+	cpu_events = rcu_dereference_protected(defer->cpu_events,
+				lockdep_is_held(&unwind_deferred_mutex));
+
+	/*
+	 * The defer->nr_cpu_events is the count of the number
+	 * of non-empty lists in the cpu_events array. If the list
+	 * being added to is already occupied, the nr_cpu_events does
+	 * not need to get incremented.
+	 */
+	if (list_empty(&cpu_events[event->cpu].list))
+		defer->nr_cpu_events++;
+	list_add_tail_rcu(&event->unwind_list, &cpu_events[event->cpu].list);
+
+	event->unwind_deferred = defer;
+	return 0;
+free:
+	/* Nothing to do if there was already an existing event attached */
+	if (found)
+		return ret;
+
+	list_del(&defer->list);
+	kfree(cpu_events);
+	kfree(defer);
+	return ret;
+}
+
+static void free_unwind_deferred_rcu(struct rcu_head *head)
+{
+	struct perf_unwind_cpu *cpu_events;
+	struct perf_unwind_deferred *defer =
+		container_of(head, struct perf_unwind_deferred, rcu_head);
+
+	WARN_ON_ONCE(defer->nr_cpu_events);
+	/*
+	 * This is called by call_rcu() and there are no more
+	 * references to cpu_events.
+	 */
+	cpu_events = rcu_dereference_protected(defer->cpu_events, true);
+	kfree(cpu_events);
+	kfree(defer);
+}
+
+static void perf_remove_unwind_deferred(struct perf_event *event)
+{
+	struct perf_unwind_deferred *defer = event->unwind_deferred;
+	struct perf_unwind_cpu *cpu_events, *cpu_unwind;
+
+	if (!defer)
+		return;
+
+	guard(mutex)(&unwind_deferred_mutex);
+	list_del_rcu(&event->unwind_list);
+
+	cpu_events = rcu_dereference_protected(defer->cpu_events,
+				lockdep_is_held(&unwind_deferred_mutex));
+	cpu_unwind = &cpu_events[event->cpu];
+
+	if (list_empty(&cpu_unwind->list)) {
+		defer->nr_cpu_events--;
+		if (!defer->nr_cpu_events)
+			unwind_deferred_cancel(&defer->unwind_work);
+	}
+
+	event->unwind_deferred = NULL;
+
+	/*
+	 * Make sure perf_event_deferred_cpu() is done with this event.
+	 * That function will set cpu_unwind->processing and then
+	 * call smp_mb() before iterating the list of its events.
+	 * If the event's unwind_deferred is NULL, it will be skipped.
+	 * The smp_mb() in that function matches the mb() in
+	 * rcuwait_wait_event().
+	 */
+	rcuwait_wait_event(&cpu_unwind->pending_unwind_wait,
+				   !cpu_unwind->processing, TASK_UNINTERRUPTIBLE);
+
+	/* Is this still being used by other per CPU events? */
+	if (defer->nr_cpu_events)
+		return;
+
+	list_del(&defer->list);
+	/* The defer->cpu_events is protected by RCU */
+	call_rcu(&defer->rcu_head, free_unwind_deferred_rcu);
+}
+
 static void perf_pending_unwind_sync(struct perf_event *event)
 {
 	struct unwind_work *work = &event->unwind_work;
 
+	perf_remove_unwind_deferred(event);
 	unwind_deferred_cancel(work);
 }
 
@@ -5643,6 +5825,54 @@ static void perf_event_deferred_task(struct unwind_work *work,
 	local_dec(&event->ctx->nr_no_switch_fast);
 }
 
+/*
+ * Deferred unwinding callback for per CPU events.
+ * Note, the request for the deferred unwinding may have happened
+ * on a different CPU.
+ */
+static void perf_event_deferred_cpu(struct unwind_work *work,
+				    struct unwind_stacktrace *trace, u64 cookie)
+{
+	struct perf_unwind_deferred *defer =
+		container_of(work, struct perf_unwind_deferred, unwind_work);
+	struct perf_unwind_cpu *cpu_events, *cpu_unwind;
+	struct perf_event *event;
+	int cpu;
+
+	guard(rcu)();
+	guard(preempt)();
+
+	cpu = smp_processor_id();
+	cpu_events = rcu_dereference(defer->cpu_events);
+	cpu_unwind = &cpu_events[cpu];
+
+	WRITE_ONCE(cpu_unwind->processing, 1);
+	/*
+	 * Make sure the above is seen before the event->unwind_deferred
+	 * is checked. This matches the mb() in rcuwait_rcu_wait_event() in
+	 * perf_remove_unwind_deferred().
+	 */
+	smp_mb();
+
+	list_for_each_entry_rcu(event, &cpu_unwind->list, unwind_list) {
+		/* If unwind_deferred is NULL the event is going away */
+		if (unlikely(!event->unwind_deferred))
+			continue;
+		perf_event_callchain_deferred(event, trace, cookie);
+		/* Only the first CPU event gets the trace */
+		break;
+	}
+
+	/*
+	 * The perf_event_callchain_deferred() must finish before setting
+	 * cpu_unwind->processing to zero. This is also to synchronize
+	 * with the rcuwait in perf_remove_unwind_deferred().
+	 */
+	smp_mb();
+	WRITE_ONCE(cpu_unwind->processing, 0);
+	rcuwait_wake_up(&cpu_unwind->pending_unwind_wait);
+}
+
 static void perf_free_addr_filters(struct perf_event *event);
 
 /* vs perf_event_alloc() error */
@@ -8256,6 +8486,22 @@ static u64 perf_get_page_size(unsigned long addr)
 
 static struct perf_callchain_entry __empty_callchain = { .nr = 0, };
 
+
+static int deferred_unwind_request(struct perf_unwind_deferred *defer,
+				   u64 *defer_cookie)
+{
+	int ret;
+
+	ret = unwind_deferred_request(&defer->unwind_work, defer_cookie);
+
+	/*
+	 * Return 1 on success or negative on error.
+	 * Do not return zero as not to increment nr_no_switch_fast.
+	 * That's not needed here.
+	 */
+	return ret < 0 ? ret : 1;
+}
+
 /*
  * Returns:
 *     > 0 : if already queued.
@@ -8265,15 +8511,16 @@ static struct perf_callchain_entry __empty_callchain = { .nr = 0, };
 static int deferred_request(struct perf_event *event, u64 *defer_cookie)
 {
 	struct unwind_work *work = &event->unwind_work;
-
-	/* Only defer for task events */
-	if (!event->ctx->task)
-		return -EINVAL;
+	struct perf_unwind_deferred *defer;
 
 	if ((current->flags & (PF_KTHREAD | PF_USER_WORKER)) ||
 	    !user_mode(task_pt_regs(current)))
 		return -EINVAL;
 
+	defer = READ_ONCE(event->unwind_deferred);
+	if (defer)
+		return deferred_unwind_request(defer, defer_cookie);
+
 	return unwind_deferred_request(work, defer_cookie);
 }
 
@@ -13149,6 +13396,11 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 		if (task) {
 			err = unwind_deferred_task_init(&event->unwind_work,
 							perf_event_deferred_task);
+		} else {
+			/* Setup unwind deferring for per CPU events */
+			err = perf_add_unwind_deferred(event);
+			if (err)
+				return ERR_PTR(err);
 		}
 	}
 
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v16 0/4] perf: Support the deferred unwinding infrastructure
  2025-10-07 21:40 [PATCH v16 0/4] perf: Support the deferred unwinding infrastructure Steven Rostedt
                   ` (3 preceding siblings ...)
  2025-10-07 21:40 ` [PATCH v16 4/4] perf: Support deferred user callchains for per CPU events Steven Rostedt
@ 2025-10-23 15:00 ` Peter Zijlstra
  2025-10-23 16:40   ` Steven Rostedt
  2025-10-24  9:29   ` Peter Zijlstra
  4 siblings, 2 replies; 22+ messages in thread
From: Peter Zijlstra @ 2025-10-23 15:00 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
	Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Jens Remus, Linus Torvalds, Andrew Morton, Florian Weimer,
	Sam James, Kees Cook, Carlos O'Donell

On Tue, Oct 07, 2025 at 05:40:08PM -0400, Steven Rostedt wrote:

>  include/linux/perf_event.h            |   9 +-
>  include/linux/unwind_deferred.h       |  15 ++
>  include/uapi/linux/perf_event.h       |  25 ++-
>  kernel/bpf/stackmap.c                 |   4 +-
>  kernel/events/callchain.c             |  14 +-
>  kernel/events/core.c                  | 362 +++++++++++++++++++++++++++++++++-
>  kernel/unwind/deferred.c              | 283 ++++++++++++++++++++++----
>  tools/include/uapi/linux/perf_event.h |  25 ++-
>  8 files changed, 686 insertions(+), 51 deletions(-)

After staring at this some, I mostly threw it all out and wrote the
below.

I also have some hackery on the userspace patches to go along with this,
and it all sits in my unwind/cleanup branch.

Trouble is, pretty much every unwind is 510 entries long -- this cannot
be right. I'm sure there's a silly mistake in unwind/user.c but I'm too
tired to find it just now. I'll try again tomorrow.
  
---
 include/linux/perf_event.h            |    2 
 include/linux/unwind_deferred.h       |   12 -----
 include/linux/unwind_deferred_types.h |   13 +++++
 include/uapi/linux/perf_event.h       |   21 ++++++++-
 kernel/bpf/stackmap.c                 |    4 -
 kernel/events/callchain.c             |   14 +++++-
 kernel/events/core.c                  |   79 +++++++++++++++++++++++++++++++++-
 tools/include/uapi/linux/perf_event.h |   21 ++++++++-
 8 files changed, 146 insertions(+), 20 deletions(-)

--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1720,7 +1720,7 @@ extern void perf_callchain_user(struct p
 extern void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
 extern struct perf_callchain_entry *
 get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
-		   u32 max_stack, bool crosstask, bool add_mark);
+		   u32 max_stack, bool crosstask, bool add_mark, u64 defer_cookie);
 extern int get_callchain_buffers(int max_stack);
 extern void put_callchain_buffers(void);
 extern struct perf_callchain_entry *get_callchain_entry(int *rctx);
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -6,18 +6,6 @@
 #include <linux/unwind_user.h>
 #include <linux/unwind_deferred_types.h>
 
-struct unwind_work;
-
-typedef void (*unwind_callback_t)(struct unwind_work *work,
-				  struct unwind_stacktrace *trace,
-				  u64 cookie);
-
-struct unwind_work {
-	struct list_head		list;
-	unwind_callback_t		func;
-	int				bit;
-};
-
 #ifdef CONFIG_UNWIND_USER
 
 enum {
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -39,4 +39,17 @@ struct unwind_task_info {
 	union unwind_task_id	id;
 };
 
+struct unwind_work;
+struct unwind_stacktrace;
+
+typedef void (*unwind_callback_t)(struct unwind_work *work,
+				  struct unwind_stacktrace *trace,
+				  u64 cookie);
+
+struct unwind_work {
+	struct list_head		list;
+	unwind_callback_t		func;
+	int				bit;
+};
+
 #endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -463,7 +463,9 @@ struct perf_event_attr {
 				inherit_thread :  1, /* children only inherit if cloned with CLONE_THREAD */
 				remove_on_exec :  1, /* event is removed from task on exec */
 				sigtrap        :  1, /* send synchronous SIGTRAP on event */
-				__reserved_1   : 26;
+				defer_callchain:  1, /* request PERF_RECORD_CALLCHAIN_DEFERRED records */
+				defer_output   :  1, /* output PERF_RECORD_CALLCHAIN_DEFERRED records */
+				__reserved_1   : 24;
 
 	union {
 		__u32		wakeup_events;	  /* wake up every n events */
@@ -1239,6 +1241,22 @@ enum perf_event_type {
 	 */
 	PERF_RECORD_AUX_OUTPUT_HW_ID		= 21,
 
+	/*
+	 * This user callchain capture was deferred until shortly before
+	 * returning to user space.  Previous samples would have kernel
+	 * callchains only and they need to be stitched with this to make full
+	 * callchains.
+	 *
+	 * struct {
+	 *	struct perf_event_header	header;
+	 *	u64				cookie;
+	 *	u64				nr;
+	 *	u64				ips[nr];
+	 *	struct sample_id		sample_id;
+	 * };
+	 */
+	PERF_RECORD_CALLCHAIN_DEFERRED		= 22,
+
 	PERF_RECORD_MAX,			/* non-ABI */
 };
 
@@ -1269,6 +1287,7 @@ enum perf_callchain_context {
 	PERF_CONTEXT_HV				= (__u64)-32,
 	PERF_CONTEXT_KERNEL			= (__u64)-128,
 	PERF_CONTEXT_USER			= (__u64)-512,
+	PERF_CONTEXT_USER_DEFERRED		= (__u64)-640,
 
 	PERF_CONTEXT_GUEST			= (__u64)-2048,
 	PERF_CONTEXT_GUEST_KERNEL		= (__u64)-2176,
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -315,7 +315,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_re
 		max_depth = sysctl_perf_event_max_stack;
 
 	trace = get_perf_callchain(regs, kernel, user, max_depth,
-				   false, false);
+				   false, false, 0);
 
 	if (unlikely(!trace))
 		/* couldn't fetch the stack trace */
@@ -452,7 +452,7 @@ static long __bpf_get_stack(struct pt_re
 		trace = get_callchain_entry_for_task(task, max_depth);
 	else
 		trace = get_perf_callchain(regs, kernel, user, max_depth,
-					   crosstask, false);
+					   crosstask, false, 0);
 
 	if (unlikely(!trace) || trace->nr < skip) {
 		if (may_fault)
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -218,7 +218,7 @@ static void fixup_uretprobe_trampoline_e
 
 struct perf_callchain_entry *
 get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
-		   u32 max_stack, bool crosstask, bool add_mark)
+		   u32 max_stack, bool crosstask, bool add_mark, u64 defer_cookie)
 {
 	struct perf_callchain_entry *entry;
 	struct perf_callchain_entry_ctx ctx;
@@ -251,6 +251,18 @@ get_perf_callchain(struct pt_regs *regs,
 			regs = task_pt_regs(current);
 		}
 
+		if (defer_cookie) {
+			/*
+			 * Foretell the coming of PERF_RECORD_CALLCHAIN_DEFERRED
+			 * which can be stitched to this one, and add
+			 * the cookie after it (it will be cut off when the
+			 * user stack is copied to the callchain).
+			 */
+			perf_callchain_store_context(&ctx, PERF_CONTEXT_USER_DEFERRED);
+			perf_callchain_store_context(&ctx, defer_cookie);
+			goto exit_put;
+		}
+
 		if (add_mark)
 			perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
 
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -56,6 +56,7 @@
 #include <linux/buildid.h>
 #include <linux/task_work.h>
 #include <linux/percpu-rwsem.h>
+#include <linux/unwind_deferred.h>
 
 #include "internal.h"
 
@@ -8200,6 +8201,8 @@ static u64 perf_get_page_size(unsigned l
 
 static struct perf_callchain_entry __empty_callchain = { .nr = 0, };
 
+static struct unwind_work perf_unwind_work;
+
 struct perf_callchain_entry *
 perf_callchain(struct perf_event *event, struct pt_regs *regs)
 {
@@ -8208,8 +8211,11 @@ perf_callchain(struct perf_event *event,
 		!(current->flags & (PF_KTHREAD | PF_USER_WORKER));
 	/* Disallow cross-task user callchains. */
 	bool crosstask = event->ctx->task && event->ctx->task != current;
+	bool defer_user = IS_ENABLED(CONFIG_UNWIND_USER) && user &&
+			  event->attr.defer_callchain;
 	const u32 max_stack = event->attr.sample_max_stack;
 	struct perf_callchain_entry *callchain;
+	u64 defer_cookie;
 
 	if (!current->mm)
 		user = false;
@@ -8217,8 +8223,13 @@ perf_callchain(struct perf_event *event,
 	if (!kernel && !user)
 		return &__empty_callchain;
 
-	callchain = get_perf_callchain(regs, kernel, user,
-				       max_stack, crosstask, true);
+	if (!(user && defer_user && !crosstask &&
+	      unwind_deferred_request(&perf_unwind_work, &defer_cookie) >= 0))
+		defer_cookie = 0;
+
+	callchain = get_perf_callchain(regs, kernel, user, max_stack,
+				       crosstask, true, defer_cookie);
+
 	return callchain ?: &__empty_callchain;
 }
 
@@ -10003,6 +10014,67 @@ void perf_event_bpf_event(struct bpf_pro
 	perf_iterate_sb(perf_event_bpf_output, &bpf_event, NULL);
 }
 
+struct perf_callchain_deferred_event {
+	struct unwind_stacktrace *trace;
+	struct {
+		struct perf_event_header	header;
+		u64				cookie;
+		u64				nr;
+		u64				ips[];
+	} event;
+};
+
+static void perf_callchain_deferred_output(struct perf_event *event, void *data)
+{
+	struct perf_callchain_deferred_event *deferred_event = data;
+	struct perf_output_handle handle;
+	struct perf_sample_data sample;
+	int ret, size = deferred_event->event.header.size;
+
+	if (!event->attr.defer_output)
+		return;
+
+	/* XXX do we really need sample_id_all for this ??? */
+	perf_event_header__init_id(&deferred_event->event.header, &sample, event);
+
+	ret = perf_output_begin(&handle, &sample, event,
+				deferred_event->event.header.size);
+	if (ret)
+		goto out;
+
+	perf_output_put(&handle, deferred_event->event);
+	for (int i = 0; i < deferred_event->trace->nr; i++) {
+		u64 entry = deferred_event->trace->entries[i];
+		perf_output_put(&handle, entry);
+	}
+	perf_event__output_id_sample(event, &handle, &sample);
+
+	perf_output_end(&handle);
+out:
+	deferred_event->event.header.size = size;
+}
+
+/* Deferred unwinding callback for task specific events */
+static void perf_unwind_deferred_callback(struct unwind_work *work,
+					 struct unwind_stacktrace *trace, u64 cookie)
+{
+	struct perf_callchain_deferred_event deferred_event = {
+		.trace = trace,
+		.event = {
+			.header = {
+				.type = PERF_RECORD_CALLCHAIN_DEFERRED,
+				.misc = PERF_RECORD_MISC_USER,
+				.size = sizeof(deferred_event.event) +
+					(trace->nr * sizeof(u64)),
+			},
+			.cookie = cookie,
+			.nr = trace->nr,
+		},
+	};
+
+	perf_iterate_sb(perf_callchain_deferred_output, &deferred_event, NULL);
+}
+
 struct perf_text_poke_event {
 	const void		*old_bytes;
 	const void		*new_bytes;
@@ -14799,6 +14871,9 @@ void __init perf_event_init(void)
 
 	idr_init(&pmu_idr);
 
+	unwind_deferred_init(&perf_unwind_work,
+			     perf_unwind_deferred_callback);
+
 	perf_event_init_all_cpus();
 	init_srcu_struct(&pmus_srcu);
 	perf_pmu_register(&perf_swevent, "software", PERF_TYPE_SOFTWARE);
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -463,7 +463,9 @@ struct perf_event_attr {
 				inherit_thread :  1, /* children only inherit if cloned with CLONE_THREAD */
 				remove_on_exec :  1, /* event is removed from task on exec */
 				sigtrap        :  1, /* send synchronous SIGTRAP on event */
-				__reserved_1   : 26;
+				defer_callchain:  1, /* request PERF_RECORD_CALLCHAIN_DEFERRED records */
+				defer_output   :  1, /* output PERF_RECORD_CALLCHAIN_DEFERRED records */
+				__reserved_1   : 24;
 
 	union {
 		__u32		wakeup_events;	  /* wake up every n events */
@@ -1239,6 +1241,22 @@ enum perf_event_type {
 	 */
 	PERF_RECORD_AUX_OUTPUT_HW_ID		= 21,
 
+	/*
+	 * This user callchain capture was deferred until shortly before
+	 * returning to user space.  Previous samples would have kernel
+	 * callchains only and they need to be stitched with this to make full
+	 * callchains.
+	 *
+	 * struct {
+	 *	struct perf_event_header	header;
+	 *	u64				cookie;
+	 *	u64				nr;
+	 *	u64				ips[nr];
+	 *	struct sample_id		sample_id;
+	 * };
+	 */
+	PERF_RECORD_CALLCHAIN_DEFERRED		= 22,
+
 	PERF_RECORD_MAX,			/* non-ABI */
 };
 
@@ -1269,6 +1287,7 @@ enum perf_callchain_context {
 	PERF_CONTEXT_HV				= (__u64)-32,
 	PERF_CONTEXT_KERNEL			= (__u64)-128,
 	PERF_CONTEXT_USER			= (__u64)-512,
+	PERF_CONTEXT_USER_DEFERRED		= (__u64)-640,
 
 	PERF_CONTEXT_GUEST			= (__u64)-2048,
 	PERF_CONTEXT_GUEST_KERNEL		= (__u64)-2176,

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v16 0/4] perf: Support the deferred unwinding infrastructure
  2025-10-23 15:00 ` [PATCH v16 0/4] perf: Support the deferred unwinding infrastructure Peter Zijlstra
@ 2025-10-23 16:40   ` Steven Rostedt
  2025-10-24  8:26     ` Peter Zijlstra
  2025-10-24  9:29   ` Peter Zijlstra
  1 sibling, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2025-10-23 16:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar,
	Jiri Olsa, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
	Florian Weimer, Sam James, Kees Cook, Carlos O'Donell

On Thu, 23 Oct 2025 17:00:02 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> +/* Deferred unwinding callback for task specific events */
> +static void perf_unwind_deferred_callback(struct unwind_work *work,
> +					 struct unwind_stacktrace *trace, u64 cookie)
> +{
> +	struct perf_callchain_deferred_event deferred_event = {
> +		.trace = trace,
> +		.event = {
> +			.header = {
> +				.type = PERF_RECORD_CALLCHAIN_DEFERRED,
> +				.misc = PERF_RECORD_MISC_USER,
> +				.size = sizeof(deferred_event.event) +
> +					(trace->nr * sizeof(u64)),
> +			},
> +			.cookie = cookie,
> +			.nr = trace->nr,
> +		},
> +	};
> +
> +	perf_iterate_sb(perf_callchain_deferred_output, &deferred_event, NULL);
> +}
> +

So "perf_iterate_sb()" was the key point I was missing. I'm guessing it's
basically a demultiplexer that distributes events to all the requestors?

If I had know this, I would have done it completely different.

-- Steve

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v16 0/4] perf: Support the deferred unwinding infrastructure
  2025-10-23 16:40   ` Steven Rostedt
@ 2025-10-24  8:26     ` Peter Zijlstra
  2025-10-24 12:58       ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2025-10-24  8:26 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar,
	Jiri Olsa, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
	Florian Weimer, Sam James, Kees Cook, Carlos O'Donell

On Thu, Oct 23, 2025 at 12:40:57PM -0400, Steven Rostedt wrote:
> On Thu, 23 Oct 2025 17:00:02 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > +/* Deferred unwinding callback for task specific events */
> > +static void perf_unwind_deferred_callback(struct unwind_work *work,
> > +					 struct unwind_stacktrace *trace, u64 cookie)
> > +{
> > +	struct perf_callchain_deferred_event deferred_event = {
> > +		.trace = trace,
> > +		.event = {
> > +			.header = {
> > +				.type = PERF_RECORD_CALLCHAIN_DEFERRED,
> > +				.misc = PERF_RECORD_MISC_USER,
> > +				.size = sizeof(deferred_event.event) +
> > +					(trace->nr * sizeof(u64)),
> > +			},
> > +			.cookie = cookie,
> > +			.nr = trace->nr,
> > +		},
> > +	};
> > +
> > +	perf_iterate_sb(perf_callchain_deferred_output, &deferred_event, NULL);
> > +}
> > +
> 
> So "perf_iterate_sb()" was the key point I was missing. I'm guessing it's
> basically a demultiplexer that distributes events to all the requestors?

A superset. Basically every event in the relevant context that 'wants'
it.

It is what we use for all traditional side-band events (hence the _sb
naming) like mmap, task creation/exit, etc.

I was under the impression the perf tool would create one software dummy
event to listen specifically for these events per buffer, but alas, when
I looked at the tool this does not appear to be the case.

As a result it is possible to receive these events multiple times. And
since that is a problem that needs to be solved anyway, I didn't think
it 'relevant' in this case.

> If I had know this, I would have done it completely different.

I did do mention it here:

  https://lkml.kernel.org/r/20250923103213.GD3419281@noisy.programming.kicks-ass.net

Anyway, no worries. Onwards to figuring out WTF the unwinder doesn't
seem to terminate properly.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v16 0/4] perf: Support the deferred unwinding infrastructure
  2025-10-23 15:00 ` [PATCH v16 0/4] perf: Support the deferred unwinding infrastructure Peter Zijlstra
  2025-10-23 16:40   ` Steven Rostedt
@ 2025-10-24  9:29   ` Peter Zijlstra
  2025-10-24 10:41     ` Peter Zijlstra
  1 sibling, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2025-10-24  9:29 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
	Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Jens Remus, Linus Torvalds, Andrew Morton, Florian Weimer,
	Sam James, Kees Cook, Carlos O'Donell

On Thu, Oct 23, 2025 at 05:00:02PM +0200, Peter Zijlstra wrote:

> Trouble is, pretty much every unwind is 510 entries long -- this cannot
> be right. I'm sure there's a silly mistake in unwind/user.c but I'm too
> tired to find it just now. I'll try again tomorrow.

PEBKAC

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v16 0/4] perf: Support the deferred unwinding infrastructure
  2025-10-24  9:29   ` Peter Zijlstra
@ 2025-10-24 10:41     ` Peter Zijlstra
  2025-10-24 13:58       ` Jens Remus
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2025-10-24 10:41 UTC (permalink / raw)
  To: Steven Rostedt, jremus
  Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
	Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Jens Remus, Linus Torvalds, Andrew Morton, Florian Weimer,
	Sam James, Kees Cook, Carlos O'Donell

On Fri, Oct 24, 2025 at 11:29:26AM +0200, Peter Zijlstra wrote:
> On Thu, Oct 23, 2025 at 05:00:02PM +0200, Peter Zijlstra wrote:
> 
> > Trouble is, pretty much every unwind is 510 entries long -- this cannot
> > be right. I'm sure there's a silly mistake in unwind/user.c but I'm too
> > tired to find it just now. I'll try again tomorrow.
> 
> PEBKAC

Anyway, while staring at this, I noted that the perf userspace unwind
code has a few bits that are missing from the new shiny thing.

How about something like so? This add an optional arch specific unwinder
at the very highest priority (bit 0) and uses that to do a few extra
bits before disabling itself and falling back to whatever lower prio
unwinder to do the actual unwinding.

---
 arch/x86/events/core.c             |   40 ---------------------------
 arch/x86/include/asm/unwind_user.h |    4 ++
 arch/x86/include/asm/uprobes.h     |    9 ++++++
 arch/x86/kernel/unwind_user.c      |   53 +++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/uprobes.c          |   32 ++++++++++++++++++++++
 include/linux/unwind_user_types.h  |    5 ++-
 kernel/unwind/user.c               |    7 ++++
 7 files changed, 109 insertions(+), 41 deletions(-)

--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2845,46 +2845,6 @@ static unsigned long get_segment_base(un
 	return get_desc_base(desc);
 }
 
-#ifdef CONFIG_UPROBES
-/*
- * Heuristic-based check if uprobe is installed at the function entry.
- *
- * Under assumption of user code being compiled with frame pointers,
- * `push %rbp/%ebp` is a good indicator that we indeed are.
- *
- * Similarly, `endbr64` (assuming 64-bit mode) is also a common pattern.
- * If we get this wrong, captured stack trace might have one extra bogus
- * entry, but the rest of stack trace will still be meaningful.
- */
-static bool is_uprobe_at_func_entry(struct pt_regs *regs)
-{
-	struct arch_uprobe *auprobe;
-
-	if (!current->utask)
-		return false;
-
-	auprobe = current->utask->auprobe;
-	if (!auprobe)
-		return false;
-
-	/* push %rbp/%ebp */
-	if (auprobe->insn[0] == 0x55)
-		return true;
-
-	/* endbr64 (64-bit only) */
-	if (user_64bit_mode(regs) && is_endbr((u32 *)auprobe->insn))
-		return true;
-
-	return false;
-}
-
-#else
-static bool is_uprobe_at_func_entry(struct pt_regs *regs)
-{
-	return false;
-}
-#endif /* CONFIG_UPROBES */
-
 #ifdef CONFIG_IA32_EMULATION
 
 #include <linux/compat.h>
--- a/arch/x86/include/asm/unwind_user.h
+++ b/arch/x86/include/asm/unwind_user.h
@@ -8,4 +8,8 @@
 	.fp_off		= -2*(ws),			\
 	.use_fp		= true,
 
+#define HAVE_UNWIND_USER_ARCH 1
+
+extern int unwind_user_next_arch(struct unwind_user_state *state);
+
 #endif /* _ASM_X86_UNWIND_USER_H */
--- a/arch/x86/include/asm/uprobes.h
+++ b/arch/x86/include/asm/uprobes.h
@@ -62,4 +62,13 @@ struct arch_uprobe_task {
 	unsigned int			saved_tf;
 };
 
+#ifdef CONFIG_UPROBES
+extern bool is_uprobe_at_func_entry(struct pt_regs *regs);
+#else
+static bool is_uprobe_at_func_entry(struct pt_regs *regs)
+{
+	return false;
+}
+#endif /* CONFIG_UPROBES */
+
 #endif	/* _ASM_UPROBES_H */
--- /dev/null
+++ b/arch/x86/kernel/unwind_user.c
@@ -0,0 +1,53 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include <linux/unwind_user.h>
+#include <linux/uprobes.h>
+#include <linux/uaccess.h>
+#include <linux/sched/task_stack.h>
+#include <asm/processor.h>
+#include <asm/tlbflush.h>
+
+int unwind_user_next_arch(struct unwind_user_state *state)
+{
+	struct pt_regs *regs = task_pt_regs(current);
+
+	/* only once, on the first iteration */
+	state->available_types &= ~UNWIND_USER_TYPE_ARCH;
+
+	/* We don't know how to unwind VM86 stacks. */
+	if (regs->flags & X86_VM_MASK) {
+		state->done = true;
+		return 0;
+	}
+
+	/*
+	 * If we are called from uprobe handler, and we are indeed at the very
+	 * entry to user function (which is normally a `push %rbp` instruction,
+	 * under assumption of application being compiled with frame pointers),
+	 * we should read return address from *regs->sp before proceeding
+	 * to follow frame pointers, otherwise we'll skip immediate caller
+	 * as %rbp is not yet setup.
+	 */
+	if (!is_uprobe_at_func_entry(regs))
+		return -EINVAL;
+
+#ifdef CONFIG_COMPAT
+	if (state->ws == sizeof(int)) {
+		unsigned int retaddr;
+		int ret = get_user(retaddr, (unsigned int __user *)regs->sp);
+		if (ret)
+			return ret;
+
+		state->ip = retaddr;
+		return 0;
+	}
+#endif
+	unsigned long retaddr;
+	int ret = get_user(retaddr, (unsigned long __user *)regs->sp);
+	if (ret)
+		return ret;
+
+	state->ip = retaddr;
+	return 0;
+}
+
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -1791,3 +1791,35 @@ bool arch_uretprobe_is_alive(struct retu
 	else
 		return regs->sp <= ret->stack;
 }
+
+/*
+ * Heuristic-based check if uprobe is installed at the function entry.
+ *
+ * Under assumption of user code being compiled with frame pointers,
+ * `push %rbp/%ebp` is a good indicator that we indeed are.
+ *
+ * Similarly, `endbr64` (assuming 64-bit mode) is also a common pattern.
+ * If we get this wrong, captured stack trace might have one extra bogus
+ * entry, but the rest of stack trace will still be meaningful.
+ */
+bool is_uprobe_at_func_entry(struct pt_regs *regs)
+{
+	struct arch_uprobe *auprobe;
+
+	if (!current->utask)
+		return false;
+
+	auprobe = current->utask->auprobe;
+	if (!auprobe)
+		return false;
+
+	/* push %rbp/%ebp */
+	if (auprobe->insn[0] == 0x55)
+		return true;
+
+	/* endbr64 (64-bit only) */
+	if (user_64bit_mode(regs) && is_endbr((u32 *)auprobe->insn))
+		return true;
+
+	return false;
+}
--- a/include/linux/unwind_user_types.h
+++ b/include/linux/unwind_user_types.h
@@ -3,13 +3,15 @@
 #define _LINUX_UNWIND_USER_TYPES_H
 
 #include <linux/types.h>
+#include <linux/bits.h>
 
 /*
  * Unwind types, listed in priority order: lower numbers are attempted first if
  * available.
  */
 enum unwind_user_type_bits {
-	UNWIND_USER_TYPE_FP_BIT =		0,
+	UNWIND_USER_TYPE_ARCH_BIT = 0,
+	UNWIND_USER_TYPE_FP_BIT,
 
 	NR_UNWIND_USER_TYPE_BITS,
 };
@@ -17,6 +19,7 @@ enum unwind_user_type_bits {
 enum unwind_user_type {
 	/* Type "none" for the start of stack walk iteration. */
 	UNWIND_USER_TYPE_NONE =			0,
+	UNWIND_USER_TYPE_ARCH =			BIT(UNWIND_USER_TYPE_ARCH_BIT),
 	UNWIND_USER_TYPE_FP =			BIT(UNWIND_USER_TYPE_FP_BIT),
 };
 
--- a/kernel/unwind/user.c
+++ b/kernel/unwind/user.c
@@ -79,6 +79,10 @@ static int unwind_user_next(struct unwin
 
 		state->current_type = type;
 		switch (type) {
+		case UNWIND_USER_TYPE_ARCH:
+			if (!unwind_user_next_arch(state))
+				return 0;
+			continue;
 		case UNWIND_USER_TYPE_FP:
 			if (!unwind_user_next_fp(state))
 				return 0;
@@ -107,6 +111,9 @@ static int unwind_user_start(struct unwi
 		return -EINVAL;
 	}
 
+	if (HAVE_UNWIND_USER_ARCH)
+		state->available_types |= UNWIND_USER_TYPE_ARCH;
+
 	if (IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP))
 		state->available_types |= UNWIND_USER_TYPE_FP;
 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v16 0/4] perf: Support the deferred unwinding infrastructure
  2025-10-24  8:26     ` Peter Zijlstra
@ 2025-10-24 12:58       ` Peter Zijlstra
  2025-10-26  0:58         ` Namhyung Kim
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2025-10-24 12:58 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar,
	Jiri Olsa, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
	Florian Weimer, Sam James, Kees Cook, Carlos O'Donell


Arnaldo, Namhyung,

On Fri, Oct 24, 2025 at 10:26:56AM +0200, Peter Zijlstra wrote:

> > So "perf_iterate_sb()" was the key point I was missing. I'm guessing it's
> > basically a demultiplexer that distributes events to all the requestors?
> 
> A superset. Basically every event in the relevant context that 'wants'
> it.
> 
> It is what we use for all traditional side-band events (hence the _sb
> naming) like mmap, task creation/exit, etc.
> 
> I was under the impression the perf tool would create one software dummy
> event to listen specifically for these events per buffer, but alas, when
> I looked at the tool this does not appear to be the case.
> 
> As a result it is possible to receive these events multiple times. And
> since that is a problem that needs to be solved anyway, I didn't think
> it 'relevant' in this case.

When I use:

  perf record -ag -e cycles -e instructions

I get:

# event : name = cycles, , id = { }, type = 0 (PERF_TYPE_HARDWARE), size = 136, config = 0 (PERF_COUNT_HW_CPU_CYCLES), { sample_period, sample_freq } = 2000, sample_type = IP|TID|TIME|CALLCHAIN|CPU|PERIOD|IDENTIFIER, read_format = ID|LOST, disabled = 1, freq = 1, sample_id_all = 1, defer_callchain = 1
# event : name = instructions, , id = { }, type = 0 (PERF_TYPE_HARDWARE), size = 136, config = 0x1 (PERF_COUNT_HW_INSTRUCTIONS), { sample_period, sample_freq } = 2000, sample_type = IP|TID|TIME|CALLCHAIN|CPU|PERIOD|IDENTIFIER, read_format = ID|LOST, disabled = 1, freq = 1, sample_id_all = 1, defer_callchain = 1
# event : name = dummy:u, , id = { }, type = 1 (PERF_TYPE_SOFTWARE), size = 136, config = 0x9 (PERF_COUNT_SW_DUMMY), { sample_period, sample_freq } = 1, sample_type = IP|TID|TIME|CPU|IDENTIFIER, read_format = ID|LOST, exclude_kernel = 1, exclude_hv = 1, mmap = 1, comm = 1, task = 1, sample_id_all = 1, exclude_guest = 1, mmap2 = 1, comm_exec = 1, ksymbol = 1, bpf_event = 1, build_id = 1, defer_output = 1

And we have this dummy event I spoke of above; and it has defer_output
set, none of the others do. This is what I expected.

*However*, when I use:

  perf record -g -e cycles -e instruction

I get:

# event : name = cycles, , id = { }, type = 0 (PERF_TYPE_HARDWARE), size = 136, config = 0 (PERF_COUNT_HW_CPU_CYCLES), { sample_period, sample_freq } = 2000, sample_type = IP|TID|TIME|CALLCHAIN|ID|PERIOD, read_format = ID|LOST, disabled = 1, inherit = 1, mmap = 1, comm = 1, freq = 1, enable_on_exec = 1, task = 1, sample_id_all = 1, mmap2 = 1, comm_exec = 1, ksymbol = 1, bpf_event = 1, build_id = 1, defer_callchain = 1, defer_output = 1
# event : name = instructions, , id = { }, type = 0 (PERF_TYPE_HARDWARE), size = 136, config = 0x1 (PERF_COUNT_HW_INSTRUCTIONS), { sample_period, sample_freq } = 2000, sample_type = IP|TID|TIME|CALLCHAIN|ID|PERIOD, read_format = ID|LOST, disabled = 1, inherit = 1, freq = 1, enable_on_exec = 1, sample_id_all = 1, defer_callchain = 1

Which doesn't have a dummy event. Notably the first real event has
defer_output set (and all the other sideband stuff like mmap, comm,
etc.).

Is there a reason the !cpu mode doesn't have the dummy event? Anyway, it
should all work, just unexpected inconsistency that confused me. 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v16 0/4] perf: Support the deferred unwinding infrastructure
  2025-10-24 10:41     ` Peter Zijlstra
@ 2025-10-24 13:58       ` Jens Remus
  2025-10-24 14:08         ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Remus @ 2025-10-24 13:58 UTC (permalink / raw)
  To: Peter Zijlstra, Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
	Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Linus Torvalds, Andrew Morton, Florian Weimer, Sam James,
	Kees Cook, Carlos O'Donell, Heiko Carstens, Vasily Gorbik

Hello Peter!

On 10/24/2025 12:41 PM, Peter Zijlstra wrote:
> On Fri, Oct 24, 2025 at 11:29:26AM +0200, Peter Zijlstra wrote:
>> On Thu, Oct 23, 2025 at 05:00:02PM +0200, Peter Zijlstra wrote:
>>
>>> Trouble is, pretty much every unwind is 510 entries long -- this cannot
>>> be right. I'm sure there's a silly mistake in unwind/user.c but I'm too
>>> tired to find it just now. I'll try again tomorrow.
>>
>> PEBKAC
> 
> Anyway, while staring at this, I noted that the perf userspace unwind
> code has a few bits that are missing from the new shiny thing.
> 
> How about something like so? This add an optional arch specific unwinder
> at the very highest priority (bit 0) and uses that to do a few extra
> bits before disabling itself and falling back to whatever lower prio
> unwinder to do the actual unwinding.

unwind user sframe does not need any of this special handling, because
it knows for each IP whether the SP or FP is the CFA base register
and whether the FP and RA have been saved.

Isn't this actually specific to unwind user fp?  If the IP is at
function entry, then the FP has not been setup yet.  I think unwind user
fp could handle this using an arch specific is_uprobe_at_func_entry() to
determine whether to use a new frame_fp_entry instead of frame_fp.  For
x86 the following frame_fp_entry should work, if I am not wrong:

#define ARCH_INIT_USER_FP_ENTRY_FRAME(ws)	\
	.cfa_off	=  1*(ws),		\
	.ra_off		= -1*(ws),		\
	.fp_off		= 0,			\
	.use_fp		= false,

Following roughly outlines the required changes:

diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c

-static int unwind_user_next_fp(struct unwind_user_state *state)
+static int unwind_user_next_common(struct unwind_user_state *state,
+                                  const struct unwind_user_frame *frame,
+                                  struct pt_regs *regs)

@@ -71,6 +83,7 @@ static int unwind_user_next_common(struct unwind_user_state *state,
        state->sp = sp;
        if (frame->fp_off)
                state->fp = fp;
+       state->topmost = false;
        return 0;
 }
@@ -154,6 +167,7 @@ static int unwind_user_start(struct unwind_user_state *state)
        state->sp = user_stack_pointer(regs);
        state->fp = frame_pointer(regs);
        state->ws = compat_user_mode(regs) ? sizeof(int) : sizeof(long);
+       state->topmost = true;

        return 0;
 }

static int unwind_user_next_fp(struct unwind_user_state *state)
{
	const struct unwind_user_frame fp_frame = {
		ARCH_INIT_USER_FP_FRAME(state->ws)
	};
	const struct unwind_user_frame fp_entry_frame = {
		ARCH_INIT_USER_FP_ENTRY_FRAME(state->ws)
	};
	struct pt_regs *regs = task_pt_regs(current);

	if (state->topmost && is_uprobe_at_func_entry(regs))
		return unwind_user_next_common(state, &fp_entry_frame, regs);
	else
		return unwind_user_next_common(state, &fp_frame, regs);
}

diff --git a/include/linux/unwind_user_types.h b/include/linux/unwind_user_types.h
@@ -43,6 +43,7 @@ struct unwind_user_state {
        unsigned int                            ws;
        enum unwind_user_type                   current_type;
        unsigned int                            available_types;
+       bool                                    topmost;
        bool                                    done;
 };

What do you think?

> +++ b/arch/x86/kernel/unwind_user.c
> @@ -0,0 +1,53 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#include <linux/unwind_user.h>
> +#include <linux/uprobes.h>
> +#include <linux/uaccess.h>
> +#include <linux/sched/task_stack.h>
> +#include <asm/processor.h>
> +#include <asm/tlbflush.h>
> +
> +int unwind_user_next_arch(struct unwind_user_state *state)
> +{
> +	struct pt_regs *regs = task_pt_regs(current);
> +
> +	/* only once, on the first iteration */
> +	state->available_types &= ~UNWIND_USER_TYPE_ARCH;
> +
> +	/* We don't know how to unwind VM86 stacks. */
> +	if (regs->flags & X86_VM_MASK) {
> +		state->done = true;
> +		return 0;
> +	}
> +
> +	/*
> +	 * If we are called from uprobe handler, and we are indeed at the very
> +	 * entry to user function (which is normally a `push %rbp` instruction,
> +	 * under assumption of application being compiled with frame pointers),
> +	 * we should read return address from *regs->sp before proceeding
> +	 * to follow frame pointers, otherwise we'll skip immediate caller
> +	 * as %rbp is not yet setup.
> +	 */
> +	if (!is_uprobe_at_func_entry(regs))
> +		return -EINVAL;
> +
> +#ifdef CONFIG_COMPAT
> +	if (state->ws == sizeof(int)) {
> +		unsigned int retaddr;
> +		int ret = get_user(retaddr, (unsigned int __user *)regs->sp);
> +		if (ret)
> +			return ret;
> +
> +		state->ip = retaddr;
> +		return 0;
> +	}
> +#endif
> +	unsigned long retaddr;
> +	int ret = get_user(retaddr, (unsigned long __user *)regs->sp);
> +	if (ret)
> +		return ret;
> +
> +	state->ip = retaddr;
> +	return 0;
> +}

Above would then not be needed, as the unwind_user_next_fp() logic would
do the right thing.

> +++ b/arch/x86/kernel/uprobes.c
> @@ -1791,3 +1791,35 @@ bool arch_uretprobe_is_alive(struct retu
>  	else
>  		return regs->sp <= ret->stack;
>  }
> +
> +/*
> + * Heuristic-based check if uprobe is installed at the function entry.
> + *
> + * Under assumption of user code being compiled with frame pointers,
> + * `push %rbp/%ebp` is a good indicator that we indeed are.
> + *
> + * Similarly, `endbr64` (assuming 64-bit mode) is also a common pattern.
> + * If we get this wrong, captured stack trace might have one extra bogus
> + * entry, but the rest of stack trace will still be meaningful.
> + */
> +bool is_uprobe_at_func_entry(struct pt_regs *regs)
> +{
> +	struct arch_uprobe *auprobe;
> +
> +	if (!current->utask)
> +		return false;
> +
> +	auprobe = current->utask->auprobe;
> +	if (!auprobe)
> +		return false;
> +
> +	/* push %rbp/%ebp */
> +	if (auprobe->insn[0] == 0x55)
> +		return true;
> +
> +	/* endbr64 (64-bit only) */
> +	if (user_64bit_mode(regs) && is_endbr((u32 *)auprobe->insn))
> +		return true;
> +
> +	return false;
> +}
Regards,
Jens
-- 
Jens Remus
Linux on Z Development (D3303)
+49-7031-16-1128 Office
jremus@de.ibm.com

IBM

IBM Deutschland Research & Development GmbH; Vorsitzender des Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der Gesellschaft: Böblingen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v16 0/4] perf: Support the deferred unwinding infrastructure
  2025-10-24 13:58       ` Jens Remus
@ 2025-10-24 14:08         ` Peter Zijlstra
  2025-10-24 14:51           ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2025-10-24 14:08 UTC (permalink / raw)
  To: Jens Remus
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar,
	Jiri Olsa, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Linus Torvalds, Andrew Morton, Florian Weimer,
	Sam James, Kees Cook, Carlos O'Donell, Heiko Carstens,
	Vasily Gorbik

On Fri, Oct 24, 2025 at 03:58:20PM +0200, Jens Remus wrote:
> Hello Peter!
> 
> On 10/24/2025 12:41 PM, Peter Zijlstra wrote:
> > On Fri, Oct 24, 2025 at 11:29:26AM +0200, Peter Zijlstra wrote:
> >> On Thu, Oct 23, 2025 at 05:00:02PM +0200, Peter Zijlstra wrote:
> >>
> >>> Trouble is, pretty much every unwind is 510 entries long -- this cannot
> >>> be right. I'm sure there's a silly mistake in unwind/user.c but I'm too
> >>> tired to find it just now. I'll try again tomorrow.
> >>
> >> PEBKAC
> > 
> > Anyway, while staring at this, I noted that the perf userspace unwind
> > code has a few bits that are missing from the new shiny thing.
> > 
> > How about something like so? This add an optional arch specific unwinder
> > at the very highest priority (bit 0) and uses that to do a few extra
> > bits before disabling itself and falling back to whatever lower prio
> > unwinder to do the actual unwinding.
> 
> unwind user sframe does not need any of this special handling, because
> it knows for each IP whether the SP or FP is the CFA base register
> and whether the FP and RA have been saved.

It still can't unwind VM86 stacks. But yes, it should do lots better
with that start of function hack.

> Isn't this actually specific to unwind user fp?  If the IP is at
> function entry, then the FP has not been setup yet.  I think unwind user
> fp could handle this using an arch specific is_uprobe_at_func_entry() to
> determine whether to use a new frame_fp_entry instead of frame_fp.  For
> x86 the following frame_fp_entry should work, if I am not wrong:
> 
> #define ARCH_INIT_USER_FP_ENTRY_FRAME(ws)	\
> 	.cfa_off	=  1*(ws),		\
> 	.ra_off		= -1*(ws),		\
> 	.fp_off		= 0,			\
> 	.use_fp		= false,
> 
> Following roughly outlines the required changes:
> 
> diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
> 
> -static int unwind_user_next_fp(struct unwind_user_state *state)
> +static int unwind_user_next_common(struct unwind_user_state *state,
> +                                  const struct unwind_user_frame *frame,
> +                                  struct pt_regs *regs)
> 
> @@ -71,6 +83,7 @@ static int unwind_user_next_common(struct unwind_user_state *state,
>         state->sp = sp;
>         if (frame->fp_off)
>                 state->fp = fp;
> +       state->topmost = false;
>         return 0;
>  }
> @@ -154,6 +167,7 @@ static int unwind_user_start(struct unwind_user_state *state)
>         state->sp = user_stack_pointer(regs);
>         state->fp = frame_pointer(regs);
>         state->ws = compat_user_mode(regs) ? sizeof(int) : sizeof(long);
> +       state->topmost = true;
> 
>         return 0;
>  }
> 
> static int unwind_user_next_fp(struct unwind_user_state *state)
> {
> 	const struct unwind_user_frame fp_frame = {
> 		ARCH_INIT_USER_FP_FRAME(state->ws)
> 	};
> 	const struct unwind_user_frame fp_entry_frame = {
> 		ARCH_INIT_USER_FP_ENTRY_FRAME(state->ws)
> 	};
> 	struct pt_regs *regs = task_pt_regs(current);
> 
> 	if (state->topmost && is_uprobe_at_func_entry(regs))
> 		return unwind_user_next_common(state, &fp_entry_frame, regs);
> 	else
> 		return unwind_user_next_common(state, &fp_frame, regs);
> }
> 
> diff --git a/include/linux/unwind_user_types.h b/include/linux/unwind_user_types.h
> @@ -43,6 +43,7 @@ struct unwind_user_state {
>         unsigned int                            ws;
>         enum unwind_user_type                   current_type;
>         unsigned int                            available_types;
> +       bool                                    topmost;
>         bool                                    done;
>  };
> 
> What do you think?

Yeah, I suppose that should work. Let me rework things accordingly.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v16 0/4] perf: Support the deferred unwinding infrastructure
  2025-10-24 14:08         ` Peter Zijlstra
@ 2025-10-24 14:51           ` Peter Zijlstra
  2025-10-24 14:54             ` Peter Zijlstra
                               ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Peter Zijlstra @ 2025-10-24 14:51 UTC (permalink / raw)
  To: Jens Remus
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar,
	Jiri Olsa, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Linus Torvalds, Andrew Morton, Florian Weimer,
	Sam James, Kees Cook, Carlos O'Donell, Heiko Carstens,
	Vasily Gorbik

On Fri, Oct 24, 2025 at 04:08:15PM +0200, Peter Zijlstra wrote:

> Yeah, I suppose that should work. Let me rework things accordingly.

---
Subject: unwind_user/x86: Teach FP unwind about start of function
From: Peter Zijlstra <peterz@infradead.org>
Date: Fri Oct 24 12:31:10 CEST 2025

When userspace is interrupted at the start of a function, before we
get a chance to complete the frame, unwind will miss one caller.

X86 has a uprobe specific fixup for this, add bits to the generic
unwinder to support this.

Suggested-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/events/core.c             |   40 -------------------------------------
 arch/x86/include/asm/unwind_user.h |   12 +++++++++++
 arch/x86/include/asm/uprobes.h     |    9 ++++++++
 arch/x86/kernel/uprobes.c          |   32 +++++++++++++++++++++++++++++
 include/linux/unwind_user_types.h  |    1 
 kernel/unwind/user.c               |   35 ++++++++++++++++++++++++--------
 6 files changed, 80 insertions(+), 49 deletions(-)

--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2845,46 +2845,6 @@ static unsigned long get_segment_base(un
 	return get_desc_base(desc);
 }
 
-#ifdef CONFIG_UPROBES
-/*
- * Heuristic-based check if uprobe is installed at the function entry.
- *
- * Under assumption of user code being compiled with frame pointers,
- * `push %rbp/%ebp` is a good indicator that we indeed are.
- *
- * Similarly, `endbr64` (assuming 64-bit mode) is also a common pattern.
- * If we get this wrong, captured stack trace might have one extra bogus
- * entry, but the rest of stack trace will still be meaningful.
- */
-static bool is_uprobe_at_func_entry(struct pt_regs *regs)
-{
-	struct arch_uprobe *auprobe;
-
-	if (!current->utask)
-		return false;
-
-	auprobe = current->utask->auprobe;
-	if (!auprobe)
-		return false;
-
-	/* push %rbp/%ebp */
-	if (auprobe->insn[0] == 0x55)
-		return true;
-
-	/* endbr64 (64-bit only) */
-	if (user_64bit_mode(regs) && is_endbr((u32 *)auprobe->insn))
-		return true;
-
-	return false;
-}
-
-#else
-static bool is_uprobe_at_func_entry(struct pt_regs *regs)
-{
-	return false;
-}
-#endif /* CONFIG_UPROBES */
-
 #ifdef CONFIG_IA32_EMULATION
 
 #include <linux/compat.h>
--- a/arch/x86/include/asm/unwind_user.h
+++ b/arch/x86/include/asm/unwind_user.h
@@ -3,6 +3,7 @@
 #define _ASM_X86_UNWIND_USER_H
 
 #include <asm/ptrace.h>
+#include <asm/uprobes.h>
 
 #define ARCH_INIT_USER_FP_FRAME(ws)			\
 	.cfa_off	=  2*(ws),			\
@@ -10,6 +11,12 @@
 	.fp_off		= -2*(ws),			\
 	.use_fp		= true,
 
+#define ARCH_INIT_USER_FP_ENTRY_FRAME(ws)		\
+	.cfa_off	=  1*(ws),			\
+	.ra_off		= -1*(ws),			\
+	.fp_off		= 0,				\
+	.use_fp		= false,
+
 static inline int unwind_user_word_size(struct pt_regs *regs)
 {
 	/* We can't unwind VM86 stacks */
@@ -22,4 +29,9 @@ static inline int unwind_user_word_size(
 	return sizeof(long);
 }
 
+static inline bool unwind_user_at_function_start(struct pt_regs *regs)
+{
+	return is_uprobe_at_func_entry(regs);
+}
+
 #endif /* _ASM_X86_UNWIND_USER_H */
--- a/arch/x86/include/asm/uprobes.h
+++ b/arch/x86/include/asm/uprobes.h
@@ -62,4 +62,13 @@ struct arch_uprobe_task {
 	unsigned int			saved_tf;
 };
 
+#ifdef CONFIG_UPROBES
+extern bool is_uprobe_at_func_entry(struct pt_regs *regs);
+#else
+static bool is_uprobe_at_func_entry(struct pt_regs *regs)
+{
+	return false;
+}
+#endif /* CONFIG_UPROBES */
+
 #endif	/* _ASM_UPROBES_H */
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -1791,3 +1791,35 @@ bool arch_uretprobe_is_alive(struct retu
 	else
 		return regs->sp <= ret->stack;
 }
+
+/*
+ * Heuristic-based check if uprobe is installed at the function entry.
+ *
+ * Under assumption of user code being compiled with frame pointers,
+ * `push %rbp/%ebp` is a good indicator that we indeed are.
+ *
+ * Similarly, `endbr64` (assuming 64-bit mode) is also a common pattern.
+ * If we get this wrong, captured stack trace might have one extra bogus
+ * entry, but the rest of stack trace will still be meaningful.
+ */
+bool is_uprobe_at_func_entry(struct pt_regs *regs)
+{
+	struct arch_uprobe *auprobe;
+
+	if (!current->utask)
+		return false;
+
+	auprobe = current->utask->auprobe;
+	if (!auprobe)
+		return false;
+
+	/* push %rbp/%ebp */
+	if (auprobe->insn[0] == 0x55)
+		return true;
+
+	/* endbr64 (64-bit only) */
+	if (user_64bit_mode(regs) && is_endbr((u32 *)auprobe->insn))
+		return true;
+
+	return false;
+}
--- a/include/linux/unwind_user_types.h
+++ b/include/linux/unwind_user_types.h
@@ -39,6 +39,7 @@ struct unwind_user_state {
 	unsigned int				ws;
 	enum unwind_user_type			current_type;
 	unsigned int				available_types;
+	bool					topmost;
 	bool					done;
 };
 
--- a/kernel/unwind/user.c
+++ b/kernel/unwind/user.c
@@ -26,14 +26,12 @@ get_user_word(unsigned long *word, unsig
 	return get_user(*word, addr);
 }
 
-static int unwind_user_next_fp(struct unwind_user_state *state)
+static int unwind_user_next_common(struct unwind_user_state *state,
+				   const struct unwind_user_frame *frame)
 {
-	const struct unwind_user_frame frame = {
-		ARCH_INIT_USER_FP_FRAME(state->ws)
-	};
 	unsigned long cfa, fp, ra;
 
-	if (frame.use_fp) {
+	if (frame->use_fp) {
 		if (state->fp < state->sp)
 			return -EINVAL;
 		cfa = state->fp;
@@ -42,7 +40,7 @@ static int unwind_user_next_fp(struct un
 	}
 
 	/* Get the Canonical Frame Address (CFA) */
-	cfa += frame.cfa_off;
+	cfa += frame->cfa_off;
 
 	/* stack going in wrong direction? */
 	if (cfa <= state->sp)
@@ -53,19 +51,37 @@ static int unwind_user_next_fp(struct un
 		return -EINVAL;
 
 	/* Find the Return Address (RA) */
-	if (get_user_word(&ra, cfa, frame.ra_off, state->ws))
+	if (get_user_word(&ra, cfa, frame->ra_off, state->ws))
 		return -EINVAL;
 
-	if (frame.fp_off && get_user_word(&fp, cfa, frame.fp_off, state->ws))
+	if (frame->fp_off && get_user_word(&fp, cfa, frame->fp_off, state->ws))
 		return -EINVAL;
 
 	state->ip = ra;
 	state->sp = cfa;
-	if (frame.fp_off)
+	if (frame->fp_off)
 		state->fp = fp;
+	state->topmost = false;
 	return 0;
 }
 
+static int unwind_user_next_fp(struct unwind_user_state *state)
+{
+	struct pt_regs *regs = task_pt_regs(current);
+
+	const struct unwind_user_frame fp_frame = {
+		ARCH_INIT_USER_FP_FRAME(state->ws)
+	};
+	const struct unwind_user_frame fp_entry_frame = {
+		ARCH_INIT_USER_FP_ENTRY_FRAME(state->ws)
+	};
+
+	if (state->topmost && unwind_user_at_function_start(regs))
+		return unwind_user_next_common(state, &fp_entry_frame);
+
+	return unwind_user_next_common(state, &fp_frame);
+}
+
 static int unwind_user_next(struct unwind_user_state *state)
 {
 	unsigned long iter_mask = state->available_types;
@@ -118,6 +134,7 @@ static int unwind_user_start(struct unwi
 		state->done = true;
 		return -EINVAL;
 	}
+	state->topmost = true;
 
 	return 0;
 }

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v16 0/4] perf: Support the deferred unwinding infrastructure
  2025-10-24 14:51           ` Peter Zijlstra
@ 2025-10-24 14:54             ` Peter Zijlstra
  2025-10-24 14:57               ` Peter Zijlstra
  2025-10-24 15:09             ` Jens Remus
  2025-11-04 11:22             ` Florian Weimer
  2 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2025-10-24 14:54 UTC (permalink / raw)
  To: Jens Remus
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar,
	Jiri Olsa, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Linus Torvalds, Andrew Morton, Florian Weimer,
	Sam James, Kees Cook, Carlos O'Donell, Heiko Carstens,
	Vasily Gorbik

On Fri, Oct 24, 2025 at 04:51:56PM +0200, Peter Zijlstra wrote:

> --- a/arch/x86/include/asm/unwind_user.h
> +++ b/arch/x86/include/asm/unwind_user.h
> @@ -3,6 +3,7 @@
>  #define _ASM_X86_UNWIND_USER_H
>  
>  #include <asm/ptrace.h>
> +#include <asm/uprobes.h>
>  
>  #define ARCH_INIT_USER_FP_FRAME(ws)			\
>  	.cfa_off	=  2*(ws),			\
> @@ -10,6 +11,12 @@
>  	.fp_off		= -2*(ws),			\
>  	.use_fp		= true,
>  
> +#define ARCH_INIT_USER_FP_ENTRY_FRAME(ws)		\
> +	.cfa_off	=  1*(ws),			\
> +	.ra_off		= -1*(ws),			\
> +	.fp_off		= 0,				\
> +	.use_fp		= false,
> +
>  static inline int unwind_user_word_size(struct pt_regs *regs)
>  {
>  	/* We can't unwind VM86 stacks */
> @@ -22,4 +29,9 @@ static inline int unwind_user_word_size(
>  	return sizeof(long);
>  }
>  
> +static inline bool unwind_user_at_function_start(struct pt_regs *regs)
> +{
> +	return is_uprobe_at_func_entry(regs);
> +}
> +
>  #endif /* _ASM_X86_UNWIND_USER_H */

> --- a/include/linux/unwind_user_types.h
> +++ b/include/linux/unwind_user_types.h
> @@ -39,6 +39,7 @@ struct unwind_user_state {
>  	unsigned int				ws;
>  	enum unwind_user_type			current_type;
>  	unsigned int				available_types;
> +	bool					topmost;
>  	bool					done;
>  };
>  
> --- a/kernel/unwind/user.c
> +++ b/kernel/unwind/user.c

>  
> +static int unwind_user_next_fp(struct unwind_user_state *state)
> +{
> +	struct pt_regs *regs = task_pt_regs(current);
> +
> +	const struct unwind_user_frame fp_frame = {
> +		ARCH_INIT_USER_FP_FRAME(state->ws)
> +	};
> +	const struct unwind_user_frame fp_entry_frame = {
> +		ARCH_INIT_USER_FP_ENTRY_FRAME(state->ws)
> +	};
> +
> +	if (state->topmost && unwind_user_at_function_start(regs))
> +		return unwind_user_next_common(state, &fp_entry_frame);
> +
> +	return unwind_user_next_common(state, &fp_frame);
> +}
> +
>  static int unwind_user_next(struct unwind_user_state *state)
>  {
>  	unsigned long iter_mask = state->available_types;
> @@ -118,6 +134,7 @@ static int unwind_user_start(struct unwi
>  		state->done = true;
>  		return -EINVAL;
>  	}
> +	state->topmost = true;
>  
>  	return 0;
>  }

And right before sending this; I realized we could do the
unwind_user_at_function_start() in unwind_user_start() and set something
like state->entry = true instead of topmost.

That saves having to do task_pt_regs() in unwind_user_next_fp().

Does that make sense?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v16 0/4] perf: Support the deferred unwinding infrastructure
  2025-10-24 14:54             ` Peter Zijlstra
@ 2025-10-24 14:57               ` Peter Zijlstra
  2025-10-24 15:10                 ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2025-10-24 14:57 UTC (permalink / raw)
  To: Jens Remus
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar,
	Jiri Olsa, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Linus Torvalds, Andrew Morton, Florian Weimer,
	Sam James, Kees Cook, Carlos O'Donell, Heiko Carstens,
	Vasily Gorbik

On Fri, Oct 24, 2025 at 04:54:02PM +0200, Peter Zijlstra wrote:
> On Fri, Oct 24, 2025 at 04:51:56PM +0200, Peter Zijlstra wrote:
> 
> > --- a/arch/x86/include/asm/unwind_user.h
> > +++ b/arch/x86/include/asm/unwind_user.h
> > @@ -3,6 +3,7 @@
> >  #define _ASM_X86_UNWIND_USER_H
> >  
> >  #include <asm/ptrace.h>
> > +#include <asm/uprobes.h>
> >  
> >  #define ARCH_INIT_USER_FP_FRAME(ws)			\
> >  	.cfa_off	=  2*(ws),			\
> > @@ -10,6 +11,12 @@
> >  	.fp_off		= -2*(ws),			\
> >  	.use_fp		= true,
> >  
> > +#define ARCH_INIT_USER_FP_ENTRY_FRAME(ws)		\
> > +	.cfa_off	=  1*(ws),			\
> > +	.ra_off		= -1*(ws),			\
> > +	.fp_off		= 0,				\
> > +	.use_fp		= false,
> > +
> >  static inline int unwind_user_word_size(struct pt_regs *regs)
> >  {
> >  	/* We can't unwind VM86 stacks */
> > @@ -22,4 +29,9 @@ static inline int unwind_user_word_size(
> >  	return sizeof(long);
> >  }
> >  
> > +static inline bool unwind_user_at_function_start(struct pt_regs *regs)
> > +{
> > +	return is_uprobe_at_func_entry(regs);
> > +}
> > +
> >  #endif /* _ASM_X86_UNWIND_USER_H */
> 
> > --- a/include/linux/unwind_user_types.h
> > +++ b/include/linux/unwind_user_types.h
> > @@ -39,6 +39,7 @@ struct unwind_user_state {
> >  	unsigned int				ws;
> >  	enum unwind_user_type			current_type;
> >  	unsigned int				available_types;
> > +	bool					topmost;
> >  	bool					done;
> >  };
> >  
> > --- a/kernel/unwind/user.c
> > +++ b/kernel/unwind/user.c
> 
> >  
> > +static int unwind_user_next_fp(struct unwind_user_state *state)
> > +{
> > +	struct pt_regs *regs = task_pt_regs(current);
> > +
> > +	const struct unwind_user_frame fp_frame = {
> > +		ARCH_INIT_USER_FP_FRAME(state->ws)
> > +	};
> > +	const struct unwind_user_frame fp_entry_frame = {
> > +		ARCH_INIT_USER_FP_ENTRY_FRAME(state->ws)
> > +	};
> > +
> > +	if (state->topmost && unwind_user_at_function_start(regs))
> > +		return unwind_user_next_common(state, &fp_entry_frame);
> > +
> > +	return unwind_user_next_common(state, &fp_frame);
> > +}
> > +
> >  static int unwind_user_next(struct unwind_user_state *state)
> >  {
> >  	unsigned long iter_mask = state->available_types;
> > @@ -118,6 +134,7 @@ static int unwind_user_start(struct unwi
> >  		state->done = true;
> >  		return -EINVAL;
> >  	}
> > +	state->topmost = true;
> >  
> >  	return 0;
> >  }
> 
> And right before sending this; I realized we could do the
> unwind_user_at_function_start() in unwind_user_start() and set something
> like state->entry = true instead of topmost.
> 
> That saves having to do task_pt_regs() in unwind_user_next_fp().
> 
> Does that make sense?

Urgh, that makes us call that weird hack for sframe too, which isn't
needed. Oh well, ignore this.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v16 0/4] perf: Support the deferred unwinding infrastructure
  2025-10-24 14:51           ` Peter Zijlstra
  2025-10-24 14:54             ` Peter Zijlstra
@ 2025-10-24 15:09             ` Jens Remus
  2025-10-24 15:11               ` Peter Zijlstra
  2025-11-04 11:22             ` Florian Weimer
  2 siblings, 1 reply; 22+ messages in thread
From: Jens Remus @ 2025-10-24 15:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar,
	Jiri Olsa, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Linus Torvalds, Andrew Morton, Florian Weimer,
	Sam James, Kees Cook, Carlos O'Donell, Heiko Carstens,
	Vasily Gorbik

Hello Peter,

very nice!

On 10/24/2025 4:51 PM, Peter Zijlstra wrote:

> Subject: unwind_user/x86: Teach FP unwind about start of function
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Fri Oct 24 12:31:10 CEST 2025
> 
> When userspace is interrupted at the start of a function, before we
> get a chance to complete the frame, unwind will miss one caller.
> 
> X86 has a uprobe specific fixup for this, add bits to the generic
> unwinder to support this.
> 
> Suggested-by: Jens Remus <jremus@linux.ibm.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

> +++ b/kernel/unwind/user.c

> +static int unwind_user_next_fp(struct unwind_user_state *state)
> +{
> +	struct pt_regs *regs = task_pt_regs(current);
> +
> +	const struct unwind_user_frame fp_frame = {
> +		ARCH_INIT_USER_FP_FRAME(state->ws)
> +	};
> +	const struct unwind_user_frame fp_entry_frame = {
> +		ARCH_INIT_USER_FP_ENTRY_FRAME(state->ws)
> +	};
> +
> +	if (state->topmost && unwind_user_at_function_start(regs))
> +		return unwind_user_next_common(state, &fp_entry_frame);

IIUC this will cause kernel/unwind/user.c to fail compile on
architectures that will support HAVE_UNWIND_USER_SFRAME but not
HAVE_UNWIND_USER_FP (such as s390), and thus do not need to implement
unwind_user_at_function_start().

Either s390 would need to supply a dummy unwind_user_at_function_start()
or the unwind user sframe series needs to address this and supply
a dummy one if FP is not enabled, so that the code compiles with only
SFRAME enabled.

What do you think?

> +
> +	return unwind_user_next_common(state, &fp_frame);
> +}
Thanks and regards,
Jens
-- 
Jens Remus
Linux on Z Development (D3303)
+49-7031-16-1128 Office
jremus@de.ibm.com

IBM

IBM Deutschland Research & Development GmbH; Vorsitzender des Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der Gesellschaft: Böblingen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v16 0/4] perf: Support the deferred unwinding infrastructure
  2025-10-24 14:57               ` Peter Zijlstra
@ 2025-10-24 15:10                 ` Peter Zijlstra
  0 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2025-10-24 15:10 UTC (permalink / raw)
  To: Jens Remus
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar,
	Jiri Olsa, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Linus Torvalds, Andrew Morton, Florian Weimer,
	Sam James, Kees Cook, Carlos O'Donell, Heiko Carstens,
	Vasily Gorbik

On Fri, Oct 24, 2025 at 04:57:35PM +0200, Peter Zijlstra wrote:

> Urgh, that makes us call that weird hack for sframe too, which isn't
> needed. Oh well, ignore this.

I've decided to stop tinkering for today and pushed out the lot into:

  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git perf/core

It seems to build and work on the one test build I did, so fingers
crossed.

If there is anything you want changed, please holler, I'll not push to
tip until at least Monday anyway.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v16 0/4] perf: Support the deferred unwinding infrastructure
  2025-10-24 15:09             ` Jens Remus
@ 2025-10-24 15:11               ` Peter Zijlstra
  0 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2025-10-24 15:11 UTC (permalink / raw)
  To: Jens Remus
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar,
	Jiri Olsa, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Linus Torvalds, Andrew Morton, Florian Weimer,
	Sam James, Kees Cook, Carlos O'Donell, Heiko Carstens,
	Vasily Gorbik

On Fri, Oct 24, 2025 at 05:09:02PM +0200, Jens Remus wrote:
> Hello Peter,
> 
> very nice!
> 
> On 10/24/2025 4:51 PM, Peter Zijlstra wrote:
> 
> > Subject: unwind_user/x86: Teach FP unwind about start of function
> > From: Peter Zijlstra <peterz@infradead.org>
> > Date: Fri Oct 24 12:31:10 CEST 2025
> > 
> > When userspace is interrupted at the start of a function, before we
> > get a chance to complete the frame, unwind will miss one caller.
> > 
> > X86 has a uprobe specific fixup for this, add bits to the generic
> > unwinder to support this.
> > 
> > Suggested-by: Jens Remus <jremus@linux.ibm.com>
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> 
> > +++ b/kernel/unwind/user.c
> 
> > +static int unwind_user_next_fp(struct unwind_user_state *state)
> > +{
> > +	struct pt_regs *regs = task_pt_regs(current);
> > +
> > +	const struct unwind_user_frame fp_frame = {
> > +		ARCH_INIT_USER_FP_FRAME(state->ws)
> > +	};
> > +	const struct unwind_user_frame fp_entry_frame = {
> > +		ARCH_INIT_USER_FP_ENTRY_FRAME(state->ws)
> > +	};
> > +
> > +	if (state->topmost && unwind_user_at_function_start(regs))
> > +		return unwind_user_next_common(state, &fp_entry_frame);
> 
> IIUC this will cause kernel/unwind/user.c to fail compile on
> architectures that will support HAVE_UNWIND_USER_SFRAME but not
> HAVE_UNWIND_USER_FP (such as s390), and thus do not need to implement
> unwind_user_at_function_start().
> 
> Either s390 would need to supply a dummy unwind_user_at_function_start()
> or the unwind user sframe series needs to address this and supply
> a dummy one if FP is not enabled, so that the code compiles with only
> SFRAME enabled.
> 
> What do you think?

I'll make it conditional on HAVE_UNWIND_USER_FP -- but tomorrow or so.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v16 0/4] perf: Support the deferred unwinding infrastructure
  2025-10-24 12:58       ` Peter Zijlstra
@ 2025-10-26  0:58         ` Namhyung Kim
  0 siblings, 0 replies; 22+ messages in thread
From: Namhyung Kim @ 2025-10-26  0:58 UTC (permalink / raw)
  To: Peter Zijlstra, Adrian Hunter
  Cc: Steven Rostedt, Steven Rostedt, linux-kernel, linux-trace-kernel,
	bpf, x86, Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Jens Remus, Linus Torvalds, Andrew Morton, Florian Weimer,
	Sam James, Kees Cook, Carlos O'Donell

Hi Peter,

On Fri, Oct 24, 2025 at 02:58:41PM +0200, Peter Zijlstra wrote:
> 
> Arnaldo, Namhyung,
> 
> On Fri, Oct 24, 2025 at 10:26:56AM +0200, Peter Zijlstra wrote:
> 
> > > So "perf_iterate_sb()" was the key point I was missing. I'm guessing it's
> > > basically a demultiplexer that distributes events to all the requestors?
> > 
> > A superset. Basically every event in the relevant context that 'wants'
> > it.
> > 
> > It is what we use for all traditional side-band events (hence the _sb
> > naming) like mmap, task creation/exit, etc.
> > 
> > I was under the impression the perf tool would create one software dummy
> > event to listen specifically for these events per buffer, but alas, when
> > I looked at the tool this does not appear to be the case.
> > 
> > As a result it is possible to receive these events multiple times. And
> > since that is a problem that needs to be solved anyway, I didn't think
> > it 'relevant' in this case.
> 
> When I use:
> 
>   perf record -ag -e cycles -e instructions
> 
> I get:
> 
> # event : name = cycles, , id = { }, type = 0 (PERF_TYPE_HARDWARE), size = 136, config = 0 (PERF_COUNT_HW_CPU_CYCLES), { sample_period, sample_freq } = 2000, sample_type = IP|TID|TIME|CALLCHAIN|CPU|PERIOD|IDENTIFIER, read_format = ID|LOST, disabled = 1, freq = 1, sample_id_all = 1, defer_callchain = 1
> # event : name = instructions, , id = { }, type = 0 (PERF_TYPE_HARDWARE), size = 136, config = 0x1 (PERF_COUNT_HW_INSTRUCTIONS), { sample_period, sample_freq } = 2000, sample_type = IP|TID|TIME|CALLCHAIN|CPU|PERIOD|IDENTIFIER, read_format = ID|LOST, disabled = 1, freq = 1, sample_id_all = 1, defer_callchain = 1
> # event : name = dummy:u, , id = { }, type = 1 (PERF_TYPE_SOFTWARE), size = 136, config = 0x9 (PERF_COUNT_SW_DUMMY), { sample_period, sample_freq } = 1, sample_type = IP|TID|TIME|CPU|IDENTIFIER, read_format = ID|LOST, exclude_kernel = 1, exclude_hv = 1, mmap = 1, comm = 1, task = 1, sample_id_all = 1, exclude_guest = 1, mmap2 = 1, comm_exec = 1, ksymbol = 1, bpf_event = 1, build_id = 1, defer_output = 1
> 
> And we have this dummy event I spoke of above; and it has defer_output
> set, none of the others do. This is what I expected.
> 
> *However*, when I use:
> 
>   perf record -g -e cycles -e instruction
> 
> I get:
> 
> # event : name = cycles, , id = { }, type = 0 (PERF_TYPE_HARDWARE), size = 136, config = 0 (PERF_COUNT_HW_CPU_CYCLES), { sample_period, sample_freq } = 2000, sample_type = IP|TID|TIME|CALLCHAIN|ID|PERIOD, read_format = ID|LOST, disabled = 1, inherit = 1, mmap = 1, comm = 1, freq = 1, enable_on_exec = 1, task = 1, sample_id_all = 1, mmap2 = 1, comm_exec = 1, ksymbol = 1, bpf_event = 1, build_id = 1, defer_callchain = 1, defer_output = 1
> # event : name = instructions, , id = { }, type = 0 (PERF_TYPE_HARDWARE), size = 136, config = 0x1 (PERF_COUNT_HW_INSTRUCTIONS), { sample_period, sample_freq } = 2000, sample_type = IP|TID|TIME|CALLCHAIN|ID|PERIOD, read_format = ID|LOST, disabled = 1, inherit = 1, freq = 1, enable_on_exec = 1, sample_id_all = 1, defer_callchain = 1
> 
> Which doesn't have a dummy event. Notably the first real event has
> defer_output set (and all the other sideband stuff like mmap, comm,
> etc.).
> 
> Is there a reason the !cpu mode doesn't have the dummy event? Anyway, it
> should all work, just unexpected inconsistency that confused me. 

Right, I don't remember why.  I think there's no reason doing it for
system wide mode only.

Adrian, do you have any idea?  I have a vague memory that you worked on
this in the past.

Thanks,
Namhyung


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v16 0/4] perf: Support the deferred unwinding infrastructure
  2025-10-24 14:51           ` Peter Zijlstra
  2025-10-24 14:54             ` Peter Zijlstra
  2025-10-24 15:09             ` Jens Remus
@ 2025-11-04 11:22             ` Florian Weimer
  2025-11-05 13:08               ` Peter Zijlstra
  2 siblings, 1 reply; 22+ messages in thread
From: Florian Weimer @ 2025-11-04 11:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jens Remus, Steven Rostedt, linux-kernel, linux-trace-kernel, bpf,
	x86, Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Linus Torvalds, Andrew Morton, Sam James,
	Kees Cook, Carlos O'Donell, Heiko Carstens, Vasily Gorbik

* Peter Zijlstra:

> +/*
> + * Heuristic-based check if uprobe is installed at the function entry.
> + *
> + * Under assumption of user code being compiled with frame pointers,
> + * `push %rbp/%ebp` is a good indicator that we indeed are.
> + *
> + * Similarly, `endbr64` (assuming 64-bit mode) is also a common pattern.
> + * If we get this wrong, captured stack trace might have one extra bogus
> + * entry, but the rest of stack trace will still be meaningful.
> + */
> +bool is_uprobe_at_func_entry(struct pt_regs *regs)

Is this specifically for uprobes?  Wouldn't it make sense to tell the
kernel when the uprobe is installed whether the frame pointer has been
set up at this point?  Userspace can typically figure this out easily
enough (it's not much more difficult to find the address of the
function).

Thanks,
Florian


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v16 0/4] perf: Support the deferred unwinding infrastructure
  2025-11-04 11:22             ` Florian Weimer
@ 2025-11-05 13:08               ` Peter Zijlstra
  0 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2025-11-05 13:08 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Jens Remus, Steven Rostedt, linux-kernel, linux-trace-kernel, bpf,
	x86, Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Linus Torvalds, Andrew Morton, Sam James,
	Kees Cook, Carlos O'Donell, Heiko Carstens, Vasily Gorbik

On Tue, Nov 04, 2025 at 12:22:01PM +0100, Florian Weimer wrote:
> * Peter Zijlstra:
> 
> > +/*
> > + * Heuristic-based check if uprobe is installed at the function entry.
> > + *
> > + * Under assumption of user code being compiled with frame pointers,
> > + * `push %rbp/%ebp` is a good indicator that we indeed are.
> > + *
> > + * Similarly, `endbr64` (assuming 64-bit mode) is also a common pattern.
> > + * If we get this wrong, captured stack trace might have one extra bogus
> > + * entry, but the rest of stack trace will still be meaningful.
> > + */
> > +bool is_uprobe_at_func_entry(struct pt_regs *regs)
> 
> Is this specifically for uprobes?  Wouldn't it make sense to tell the
> kernel when the uprobe is installed whether the frame pointer has been
> set up at this point?  Userspace can typically figure this out easily
> enough (it's not much more difficult to find the address of the
> function).

Yeah, I suppose so. Not sure the actual user interface for this allows
for that. Someone would have to dig into that a bit.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2025-11-05 13:08 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-07 21:40 [PATCH v16 0/4] perf: Support the deferred unwinding infrastructure Steven Rostedt
2025-10-07 21:40 ` [PATCH v16 1/4] unwind: Add interface to allow tracing a single task Steven Rostedt
2025-10-07 21:40 ` [PATCH v16 2/4] perf: Support deferred user callchains Steven Rostedt
2025-10-07 21:40 ` [PATCH v16 3/4] perf: Have the deferred request record the user context cookie Steven Rostedt
2025-10-07 21:40 ` [PATCH v16 4/4] perf: Support deferred user callchains for per CPU events Steven Rostedt
2025-10-23 15:00 ` [PATCH v16 0/4] perf: Support the deferred unwinding infrastructure Peter Zijlstra
2025-10-23 16:40   ` Steven Rostedt
2025-10-24  8:26     ` Peter Zijlstra
2025-10-24 12:58       ` Peter Zijlstra
2025-10-26  0:58         ` Namhyung Kim
2025-10-24  9:29   ` Peter Zijlstra
2025-10-24 10:41     ` Peter Zijlstra
2025-10-24 13:58       ` Jens Remus
2025-10-24 14:08         ` Peter Zijlstra
2025-10-24 14:51           ` Peter Zijlstra
2025-10-24 14:54             ` Peter Zijlstra
2025-10-24 14:57               ` Peter Zijlstra
2025-10-24 15:10                 ` Peter Zijlstra
2025-10-24 15:09             ` Jens Remus
2025-10-24 15:11               ` Peter Zijlstra
2025-11-04 11:22             ` Florian Weimer
2025-11-05 13:08               ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).