[PATCH v6 0/5] perf: Deferred unwinding of user space stack traces for per CPU events

linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v6 0/5] perf: Deferred unwinding of user space stack traces for per CPU events
@ 2025-05-01  1:32 Steven Rostedt
  2025-05-01  1:32 ` [PATCH v6 1/5] unwind_user/deferred: Add deferred unwinding interface Steven Rostedt
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: Steven Rostedt @ 2025-05-01  1:32 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Josh Poimboeuf, x86, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Indu Bhagat, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, Andrii Nakryiko, Jens Remus, Florian Weimer,
	Andy Lutomirski, Weinan Liu, Blake Jones, Beau Belgrave,
	Jose E. Marchesi, Alexander Aring

This is v6 of:

  https://lore.kernel.org/linux-trace-kernel/20250424192456.851953422@goodmis.org/

But this only adds the unwind deferred interface and not the ftrace code
so that perf can use it.

This series is based on top of:

 https://lore.kernel.org/linux-trace-kernel/20250430195746.827125963@goodmis.org/

The above patch series adds deferred unwinding for task events, but not
for per CPU events. This is because the event is only tracing a single task
and can use a task_work to trigger its own callbacks. But per CPU events do
not have that luxury. A single per CPU event can request a deferred user
space stacktrace for several tasks before receiving any of the deferred
stacktraces.

To solve this, per CPU events will use the extended interface of the
deferred unwinder that ftrace will use. This includes the new API of:

  unwind_deferred_init()
  unwind_deferred_request()
  unwind_deferred_cancel()

What perf now does is:

When a new per CPU event is created, it searches a global list of
descriptors that map to the group_leader of tasks that create these events.
The PID of group_leader of current is used to find this descriptor.
If one is found, the event is simply added to it. If one is not found, then
it will create the descriptor.

This descriptor has an array the size of possible CPUs and holds per CPU
descriptors. Each of these CPU descriptors has a linked list that would
holds the CPU events that were created and want deferred unwinding.

The group_leader descriptor has a unwind_work descriptor that it registers
with the unwind deferred infrastructure with unwind_deferred_init().
Each event within this descriptor has a pointer to this descriptor.
When a request is made from interrupt context to have a deferred unwind
happen, it calls unwind_deferred_request() passing it the group_leader
descriptor.

When the task returns back to user space, it will call the callback
associated with the group_leader descriptor, and that callback will pass the
user space stacktrace to the event attached to the CPU from its CPU array.

When these events are freed, they are removed from this descriptor, and
when the last event is removed, the descriptor is freed.

I've tested this, and this appears to work fine. All the associated events
that the perf tool creates are associated via this descriptor. At least it
doesn't overflow the max number of unwind works that can be attached to the
unwind deferred infrastructure.

This is based on v5 of the unwind code mentioned above. Changes since
then include:

- Have unwind_deferred_request() return positive if already queued

- Check (current->flags & PF_KTHREAD | PF_EXITING) in
  unwind_deferred_request(), as the task_work will fail to be added in the
  exit code.

- Have unwind_deferred_request() return positive if already queued.

- Use SRCU to protect the list of callbacks when a task returns instead of
  using a global mutex. (Mathieu Desnoyers)

- Does not include ftrace update

- Includes perf per CPU events using this infrastructure

Josh Poimboeuf (2):
      unwind_user/deferred: Add deferred unwinding interface
      unwind_user/deferred: Make unwind deferral requests NMI-safe

Steven Rostedt (3):
      unwind deferred: Use bitmask to determine which callbacks to call
      unwind deferred: Use SRCU unwind_deferred_task_work()
      perf: Support deferred user callchains for per CPU events

----
 include/linux/perf_event.h            |   5 +
 include/linux/sched.h                 |   1 +
 include/linux/unwind_deferred.h       |  19 +++
 include/linux/unwind_deferred_types.h |   4 +
 kernel/events/core.c                  | 226 +++++++++++++++++++++++---
 kernel/unwind/deferred.c              | 290 +++++++++++++++++++++++++++++++++-
 6 files changed, 519 insertions(+), 26 deletions(-)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v6 1/5] unwind_user/deferred: Add deferred unwinding interface
  2025-05-01  1:32 [PATCH v6 0/5] perf: Deferred unwinding of user space stack traces for per CPU events Steven Rostedt
@ 2025-05-01  1:32 ` Steven Rostedt
  2025-05-01  1:32 ` [PATCH v6 2/5] unwind_user/deferred: Make unwind deferral requests NMI-safe Steven Rostedt
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: Steven Rostedt @ 2025-05-01  1:32 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Josh Poimboeuf, x86, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Indu Bhagat, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, Andrii Nakryiko, Jens Remus, Florian Weimer,
	Andy Lutomirski, Weinan Liu, Blake Jones, Beau Belgrave,
	Jose E. Marchesi, Alexander Aring

From: Josh Poimboeuf <jpoimboe@kernel.org>

Add an interface for scheduling task work to unwind the user space stack
before returning to user space. This solves several problems for its
callers:

  - Ensure the unwind happens in task context even if the caller may be
    running in NMI or interrupt context.

  - Avoid duplicate unwinds, whether called multiple times by the same
    caller or by different callers.

  - Create a "context cookie" which allows trace post-processing to
    correlate kernel unwinds/traces with the user unwind.

A concept of a "cookie" is created to detect when the stacktrace is the
same. A cookie is generated the first time a user space stacktrace is
requested after the task enters the kernel. As the stacktrace is saved on
the task_struct while the task is in the kernel, if another request comes
in, if the cookie is still the same, it will use the saved stacktrace,
and not have to regenerate one.

The cookie is passed to the caller on request, and when the stacktrace is
generated upon returning to user space, it call the requester's callback
with the cookie as well as the stacktrace.

Co-developed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v5: https://lore.kernel.org/20250424192612.505622711@goodmis.org

- Have unwind_deferred_request() return positive if already queued

- Check (current->flags & PF_KTHREAD | PF_EXITING) in
  unwind_deferred_request(), as the task_work will fail to be added in the
  exit code.

 include/linux/unwind_deferred.h       |  18 +++
 include/linux/unwind_deferred_types.h |   3 +
 kernel/unwind/deferred.c              | 165 +++++++++++++++++++++++++-
 3 files changed, 185 insertions(+), 1 deletion(-)

diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index c2d760e5e257..d36784cae658 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -2,9 +2,19 @@
 #ifndef _LINUX_UNWIND_USER_DEFERRED_H
 #define _LINUX_UNWIND_USER_DEFERRED_H
 
+#include <linux/task_work.h>
 #include <linux/unwind_user.h>
 #include <linux/unwind_deferred_types.h>
 
+struct unwind_work;
+
+typedef void (*unwind_callback_t)(struct unwind_work *work, struct unwind_stacktrace *trace, u64 cookie);
+
+struct unwind_work {
+	struct list_head		list;
+	unwind_callback_t		func;
+};
+
 #ifdef CONFIG_UNWIND_USER
 
 void unwind_task_init(struct task_struct *task);
@@ -12,9 +22,14 @@ void unwind_task_free(struct task_struct *task);
 
 int unwind_deferred_trace(struct unwind_stacktrace *trace);
 
+int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func);
+int unwind_deferred_request(struct unwind_work *work, u64 *cookie);
+void unwind_deferred_cancel(struct unwind_work *work);
+
 static __always_inline void unwind_exit_to_user_mode(void)
 {
 	current->unwind_info.cache.nr_entries = 0;
+	current->unwind_info.cookie = 0;
 }
 
 #else /* !CONFIG_UNWIND_USER */
@@ -23,6 +38,9 @@ static inline void unwind_task_init(struct task_struct *task) {}
 static inline void unwind_task_free(struct task_struct *task) {}
 
 static inline int unwind_deferred_trace(struct unwind_stacktrace *trace) { return -ENOSYS; }
+static inline int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func) { return -ENOSYS; }
+static inline int unwind_deferred_request(struct unwind_work *work, u64 *cookie) { return -ENOSYS; }
+static inline void unwind_deferred_cancel(struct unwind_work *work) {}
 
 static inline void unwind_exit_to_user_mode(void) {}
 
diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
index b3b7389ee6eb..33373c32c221 100644
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -9,6 +9,9 @@ struct unwind_cache {
 
 struct unwind_task_info {
 	struct unwind_cache	cache;
+	u64			cookie;
+	struct callback_head	work;
+	int			pending;
 };
 
 #endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index 89ed04b1c527..b93ad97daf94 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -2,13 +2,72 @@
 /*
  * Deferred user space unwinding
  */
+#include <linux/sched/task_stack.h>
+#include <linux/unwind_deferred.h>
+#include <linux/task_work.h>
 #include <linux/kernel.h>
 #include <linux/sched.h>
 #include <linux/slab.h>
-#include <linux/unwind_deferred.h>
+#include <linux/mm.h>
 
 #define UNWIND_MAX_ENTRIES 512
 
+/*
+ * This is a unique percpu identifier for a given task entry context.
+ * Conceptually, it's incremented every time the CPU enters the kernel from
+ * user space, so that each "entry context" on the CPU gets a unique ID.  In
+ * reality, as an optimization, it's only incremented on demand for the first
+ * deferred unwind request after a given entry-from-user.
+ *
+ * It's combined with the CPU id to make a systemwide-unique "context cookie".
+ */
+static DEFINE_PER_CPU(u64, unwind_ctx_ctr);
+
+/* Guards adding to and reading the list of callbacks */
+static DEFINE_MUTEX(callback_mutex);
+static LIST_HEAD(callbacks);
+
+/*
+ * The context cookie is a unique identifier that is assigned to a user
+ * space stacktrace. As the user space stacktrace remains the same while
+ * the task is in the kernel, the cookie is an identifier for the stacktrace.
+ * Although it is possible for the stacktrace to get another cookie if another
+ * request is made after the cookie was cleared and before reentering user
+ * space.
+ *
+ * The high 16 bits are the CPU id; the lower 48 bits are a per-CPU entry
+ * counter shifted left by one and or'd with 1 (to prevent it from ever being
+ * zero).
+ */
+static u64 ctx_to_cookie(u64 cpu, u64 ctx)
+{
+	BUILD_BUG_ON(NR_CPUS > 65535);
+	return ((ctx << 1) & ((1UL << 48) - 1)) | (cpu << 48) | 1;
+}
+
+/*
+ * Read the task context cookie, first initializing it if this is the first
+ * call to get_cookie() since the most recent entry from user.
+ */
+static u64 get_cookie(struct unwind_task_info *info)
+{
+	u64 ctx_ctr;
+	u64 cookie;
+	u64 cpu;
+
+	guard(irqsave)();
+
+	cookie = info->cookie;
+	if (cookie)
+		return cookie;
+
+	cpu = raw_smp_processor_id();
+	ctx_ctr = __this_cpu_inc_return(unwind_ctx_ctr);
+	info->cookie = ctx_to_cookie(cpu, ctx_ctr);
+
+	return info->cookie;
+}
+
 int unwind_deferred_trace(struct unwind_stacktrace *trace)
 {
 	struct unwind_task_info *info = &current->unwind_info;
@@ -46,11 +105,114 @@ int unwind_deferred_trace(struct unwind_stacktrace *trace)
 	return 0;
 }
 
+static void unwind_deferred_task_work(struct callback_head *head)
+{
+	struct unwind_task_info *info = container_of(head, struct unwind_task_info, work);
+	struct unwind_stacktrace trace;
+	struct unwind_work *work;
+	u64 cookie;
+
+	if (WARN_ON_ONCE(!info->pending))
+		return;
+
+	/* Allow work to come in again */
+	WRITE_ONCE(info->pending, 0);
+
+	/*
+	 * From here on out, the callback must always be called, even if it's
+	 * just an empty trace.
+	 */
+	trace.nr = 0;
+	trace.entries = NULL;
+
+	unwind_deferred_trace(&trace);
+
+	cookie = get_cookie(info);
+
+	guard(mutex)(&callback_mutex);
+	list_for_each_entry(work, &callbacks, list) {
+		work->func(work, &trace, cookie);
+	}
+	barrier();
+	/* If another task work is pending, reuse the cookie and stack trace */
+	if (!READ_ONCE(info->pending))
+		WRITE_ONCE(info->cookie, 0);
+}
+
+/*
+ * Schedule a user space unwind to be done in task work before exiting the
+ * kernel.
+ *
+ * The returned cookie output is a unique identifer for the current task entry
+ * context.  Its value will also be passed to the callback function.  It can be
+ * used to stitch kernel and user stack traces together in post-processing.
+ *
+ * It's valid to call this function multiple times for the same @work within
+ * the same task entry context.  Each call will return the same cookie.
+ * If the callback is not pending because it has already been previously called
+ * for the same entry context, it will be called again with the same stack trace
+ * and cookie.
+ *
+ * Returns 1 if the the callback was already queued.
+ *         0 if the callback will be called on task to user space
+ *         Negative if there's an error.
+ */
+int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
+{
+	struct unwind_task_info *info = &current->unwind_info;
+	int ret;
+
+	*cookie = 0;
+
+	if (WARN_ON_ONCE(in_nmi()))
+		return -EINVAL;
+
+	if ((current->flags & (PF_KTHREAD | PF_EXITING)) ||
+	    !user_mode(task_pt_regs(current)))
+		return -EINVAL;
+
+	guard(irqsave)();
+
+	*cookie = get_cookie(info);
+
+	/* callback already pending? */
+	if (info->pending)
+		return 1;
+
+	/* The work has been claimed, now schedule it. */
+	ret = task_work_add(current, &info->work, TWA_RESUME);
+	if (WARN_ON_ONCE(ret))
+		return ret;
+
+	info->pending = 1;
+	return 0;
+}
+
+void unwind_deferred_cancel(struct unwind_work *work)
+{
+	if (!work)
+		return;
+
+	guard(mutex)(&callback_mutex);
+	list_del(&work->list);
+}
+
+int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
+{
+	memset(work, 0, sizeof(*work));
+
+	guard(mutex)(&callback_mutex);
+	list_add(&work->list, &callbacks);
+	work->func = func;
+	return 0;
+}
+
 void unwind_task_init(struct task_struct *task)
 {
 	struct unwind_task_info *info = &task->unwind_info;
 
 	memset(info, 0, sizeof(*info));
+	init_task_work(&info->work, unwind_deferred_task_work);
 }
 
 void unwind_task_free(struct task_struct *task)
@@ -58,4 +220,5 @@ void unwind_task_free(struct task_struct *task)
 	struct unwind_task_info *info = &task->unwind_info;
 
 	kfree(info->cache.entries);
+	task_work_cancel(task, &info->work);
 }
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v6 2/5] unwind_user/deferred: Make unwind deferral requests NMI-safe
  2025-05-01  1:32 [PATCH v6 0/5] perf: Deferred unwinding of user space stack traces for per CPU events Steven Rostedt
  2025-05-01  1:32 ` [PATCH v6 1/5] unwind_user/deferred: Add deferred unwinding interface Steven Rostedt
@ 2025-05-01  1:32 ` Steven Rostedt
  2025-05-01  1:32 ` [PATCH v6 3/5] unwind deferred: Use bitmask to determine which callbacks to call Steven Rostedt
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: Steven Rostedt @ 2025-05-01  1:32 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Josh Poimboeuf, x86, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Indu Bhagat, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, Andrii Nakryiko, Jens Remus, Florian Weimer,
	Andy Lutomirski, Weinan Liu, Blake Jones, Beau Belgrave,
	Jose E. Marchesi, Alexander Aring

From: Josh Poimboeuf <jpoimboe@kernel.org>

Make unwind_deferred_request() NMI-safe so tracers in NMI context can
call it to get the cookie immediately rather than have to do the fragile
"schedule irq work and then call unwind_deferred_request()" dance.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v5: https://lore.kernel.org/20250424192612.669992559@goodmis.org

- Have unwind_deferred_request() return positive if already queued.

 include/linux/unwind_deferred_types.h |   1 +
 kernel/unwind/deferred.c              | 100 ++++++++++++++++++++++----
 2 files changed, 89 insertions(+), 12 deletions(-)

diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
index 33373c32c221..8f47d77ddda0 100644
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -10,6 +10,7 @@ struct unwind_cache {
 struct unwind_task_info {
 	struct unwind_cache	cache;
 	u64			cookie;
+	u64			nmi_cookie;
 	struct callback_head	work;
 	int			pending;
 };
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index b93ad97daf94..d86ea82a8915 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -47,23 +47,47 @@ static u64 ctx_to_cookie(u64 cpu, u64 ctx)
 
 /*
  * Read the task context cookie, first initializing it if this is the first
- * call to get_cookie() since the most recent entry from user.
+ * call to get_cookie() since the most recent entry from user.  This has to be
+ * done carefully to coordinate with unwind_deferred_request_nmi().
  */
 static u64 get_cookie(struct unwind_task_info *info)
 {
 	u64 ctx_ctr;
 	u64 cookie;
-	u64 cpu;
 
 	guard(irqsave)();
 
-	cookie = info->cookie;
+	cookie = READ_ONCE(info->cookie);
 	if (cookie)
 		return cookie;
 
-	cpu = raw_smp_processor_id();
-	ctx_ctr = __this_cpu_inc_return(unwind_ctx_ctr);
-	info->cookie = ctx_to_cookie(cpu, ctx_ctr);
+	ctx_ctr = __this_cpu_read(unwind_ctx_ctr);
+
+	/* Read ctx_ctr before info->nmi_cookie */
+	barrier();
+
+	cookie = READ_ONCE(info->nmi_cookie);
+	if (cookie) {
+		/*
+		 * This is the first call to get_cookie() since an NMI handler
+		 * first wrote it to info->nmi_cookie.  Sync it.
+		 */
+		WRITE_ONCE(info->cookie, cookie);
+		WRITE_ONCE(info->nmi_cookie, 0);
+		return cookie;
+	}
+
+	/*
+	 * Write info->cookie.  It's ok to race with an NMI here.  The value of
+	 * the cookie is based on ctx_ctr from before the NMI could have
+	 * incremented it.  The result will be the same even if cookie or
+	 * ctx_ctr end up getting written twice.
+	 */
+	cookie = ctx_to_cookie(raw_smp_processor_id(), ctx_ctr + 1);
+	WRITE_ONCE(info->cookie, cookie);
+	WRITE_ONCE(info->nmi_cookie, 0);
+	barrier();
+	__this_cpu_write(unwind_ctx_ctr, ctx_ctr + 1);
 
 	return info->cookie;
 }
@@ -139,6 +163,51 @@ static void unwind_deferred_task_work(struct callback_head *head)
 		WRITE_ONCE(info->cookie, 0);
 }
 
+static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *cookie)
+{
+	struct unwind_task_info *info = &current->unwind_info;
+	bool inited_cookie = false;
+	int ret;
+
+	*cookie = info->cookie;
+	if (!*cookie) {
+		/*
+		 * This is the first unwind request since the most recent entry
+		 * from user.  Initialize the task cookie.
+		 *
+		 * Don't write to info->cookie directly, otherwise it may get
+		 * cleared if the NMI occurred in the kernel during early entry
+		 * or late exit before the task work gets to run.  Instead, use
+		 * info->nmi_cookie which gets synced later by get_cookie().
+		 */
+		if (!info->nmi_cookie) {
+			u64 cpu = raw_smp_processor_id();
+			u64 ctx_ctr;
+
+			ctx_ctr = __this_cpu_inc_return(unwind_ctx_ctr);
+			info->nmi_cookie = ctx_to_cookie(cpu, ctx_ctr);
+
+			inited_cookie = true;
+		}
+
+		*cookie = info->nmi_cookie;
+	}
+
+	if (info->pending)
+		return 1;
+
+	ret = task_work_add(current, &info->work, TWA_NMI_CURRENT);
+	if (ret) {
+		if (inited_cookie)
+			info->nmi_cookie = 0;
+		return ret;
+	}
+
+	info->pending = 1;
+
+	return 0;
+}
+
 /*
  * Schedule a user space unwind to be done in task work before exiting the
  * kernel.
@@ -160,31 +229,38 @@ static void unwind_deferred_task_work(struct callback_head *head)
 int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
 {
 	struct unwind_task_info *info = &current->unwind_info;
+	int pending;
 	int ret;
 
 	*cookie = 0;
 
-	if (WARN_ON_ONCE(in_nmi()))
-		return -EINVAL;
-
 	if ((current->flags & (PF_KTHREAD | PF_EXITING)) ||
 	    !user_mode(task_pt_regs(current)))
 		return -EINVAL;
 
+	if (in_nmi())
+		return unwind_deferred_request_nmi(work, cookie);
+
 	guard(irqsave)();
 
 	*cookie = get_cookie(info);
 
 	/* callback already pending? */
-	if (info->pending)
+	pending = READ_ONCE(info->pending);
+	if (pending)
+		return 1;
+
+	/* Claim the work unless an NMI just now swooped in to do so. */
+	if (!try_cmpxchg(&info->pending, &pending, 1))
 		return 1;
 
 	/* The work has been claimed, now schedule it. */
 	ret = task_work_add(current, &info->work, TWA_RESUME);
-	if (WARN_ON_ONCE(ret))
+	if (WARN_ON_ONCE(ret)) {
+		WRITE_ONCE(info->pending, 0);
 		return ret;
+	}
 
-	info->pending = 1;
 	return 0;
 }
 
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v6 3/5] unwind deferred: Use bitmask to determine which callbacks to call
  2025-05-01  1:32 [PATCH v6 0/5] perf: Deferred unwinding of user space stack traces for per CPU events Steven Rostedt
  2025-05-01  1:32 ` [PATCH v6 1/5] unwind_user/deferred: Add deferred unwinding interface Steven Rostedt
  2025-05-01  1:32 ` [PATCH v6 2/5] unwind_user/deferred: Make unwind deferral requests NMI-safe Steven Rostedt
@ 2025-05-01  1:32 ` Steven Rostedt
  2025-05-01  1:32 ` [PATCH v6 4/5] unwind deferred: Use SRCU unwind_deferred_task_work() Steven Rostedt
  2025-05-01  1:32 ` [PATCH v6 5/5] perf: Support deferred user callchains for per CPU events Steven Rostedt
  4 siblings, 0 replies; 10+ messages in thread
From: Steven Rostedt @ 2025-05-01  1:32 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Josh Poimboeuf, x86, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Indu Bhagat, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, Andrii Nakryiko, Jens Remus, Florian Weimer,
	Andy Lutomirski, Weinan Liu, Blake Jones, Beau Belgrave,
	Jose E. Marchesi, Alexander Aring

From: Steven Rostedt <rostedt@goodmis.org>

In order to know which registered callback requested a stacktrace for when
the task goes back to user space, add a bitmask for all registered
tracers. The bitmask is the size of log, which means that on a 32 bit
machine, it can have at most 32 registered tracers, and on 64 bit, it can
have at most 64 registered tracers. This should not be an issue as there
should not be more than 10 (unless BPF can abuse this?).

When a tracer registers with unwind_deferred_init() it will get a bit
number assigned to it. When a tracer requests a stacktrace, it will have
its bit set within the task_struct. When the task returns back to user
space, it will call the callbacks for all the registered tracers where
their bits are set in the task's mask.

When a tracer is removed by the unwind_deferred_cancel() all current tasks
will clear the associated bit, just in case another tracer gets registered
immediately afterward and then gets their callback called unexpectedly.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v5: https://lore.kernel.org/20250424192612.844558089@goodmis.org

- Have unwind_deferred_request() return positive if already queued.

 include/linux/sched.h           |  1 +
 include/linux/unwind_deferred.h |  1 +
 kernel/unwind/deferred.c        | 46 ++++++++++++++++++++++++++++-----
 3 files changed, 41 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a1e1c07cadfb..d3ee0c5405d6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1649,6 +1649,7 @@ struct task_struct {
 
 #ifdef CONFIG_UNWIND_USER
 	struct unwind_task_info		unwind_info;
+	unsigned long			unwind_mask;
 #endif
 
 	/* CPU-specific state of this task: */
diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index d36784cae658..719a7cfb3164 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -13,6 +13,7 @@ typedef void (*unwind_callback_t)(struct unwind_work *work, struct unwind_stackt
 struct unwind_work {
 	struct list_head		list;
 	unwind_callback_t		func;
+	int				bit;
 };
 
 #ifdef CONFIG_UNWIND_USER
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index d86ea82a8915..716393dff810 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -26,6 +26,7 @@ static DEFINE_PER_CPU(u64, unwind_ctx_ctr);
 /* Guards adding to and reading the list of callbacks */
 static DEFINE_MUTEX(callback_mutex);
 static LIST_HEAD(callbacks);
+static unsigned long unwind_mask;
 
 /*
  * The context cookie is a unique identifier that is assigned to a user
@@ -134,6 +135,7 @@ static void unwind_deferred_task_work(struct callback_head *head)
 	struct unwind_task_info *info = container_of(head, struct unwind_task_info, work);
 	struct unwind_stacktrace trace;
 	struct unwind_work *work;
+	struct task_struct *task = current;
 	u64 cookie;
 
 	if (WARN_ON_ONCE(!info->pending))
@@ -155,7 +157,10 @@ static void unwind_deferred_task_work(struct callback_head *head)
 
 	guard(mutex)(&callback_mutex);
 	list_for_each_entry(work, &callbacks, list) {
-		work->func(work, &trace, cookie);
+		if (task->unwind_mask & (1UL << work->bit)) {
+			work->func(work, &trace, cookie);
+			clear_bit(work->bit, &current->unwind_mask);
+		}
 	}
 	barrier();
 	/* If another task work is pending, reuse the cookie and stack trace */
@@ -193,9 +198,12 @@ static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *cookie)
 		*cookie = info->nmi_cookie;
 	}
 
-	if (info->pending)
+	if (current->unwind_mask & (1UL << work->bit))
 		return 1;
 
+	if (info->pending)
+		goto out;
+
 	ret = task_work_add(current, &info->work, TWA_NMI_CURRENT);
 	if (ret) {
 		if (inited_cookie)
@@ -204,8 +212,8 @@ static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *cookie)
 	}
 
 	info->pending = 1;
-
-	return 0;
+out:
+	return test_and_set_bit(work->bit, &current->unwind_mask);
 }
 
 /*
@@ -245,14 +253,18 @@ int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
 
 	*cookie = get_cookie(info);
 
+	/* This is already queued */
+	if (current->unwind_mask & (1UL << work->bit))
+		return 1;
+
 	/* callback already pending? */
 	pending = READ_ONCE(info->pending);
 	if (pending)
-		return 1;
+		goto out;
 
 	/* Claim the work unless an NMI just now swooped in to do so. */
 	if (!try_cmpxchg(&info->pending, &pending, 1))
-		return 1;
+		goto out;
 
 	/* The work has been claimed, now schedule it. */
 	ret = task_work_add(current, &info->work, TWA_RESUME);
@@ -261,16 +273,27 @@ int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
 		return ret;
 	}
 
-	return 0;
+ out:
+	return test_and_set_bit(work->bit, &current->unwind_mask);
 }
 
 void unwind_deferred_cancel(struct unwind_work *work)
 {
+	struct task_struct *g, *t;
+
 	if (!work)
 		return;
 
 	guard(mutex)(&callback_mutex);
 	list_del(&work->list);
+
+	clear_bit(work->bit, &unwind_mask);
+
+	guard(rcu)();
+	/* Clear this bit from all threads */
+	for_each_process_thread(g, t) {
+		clear_bit(work->bit, &t->unwind_mask);
+	}
 }
 
 int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
@@ -278,6 +301,14 @@ int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
 	memset(work, 0, sizeof(*work));
 
 	guard(mutex)(&callback_mutex);
+
+	/* See if there's a bit in the mask available */
+	if (unwind_mask == ~0UL)
+		return -EBUSY;
+
+	work->bit = ffz(unwind_mask);
+	unwind_mask |= 1UL << work->bit;
+
 	list_add(&work->list, &callbacks);
 	work->func = func;
 	return 0;
@@ -289,6 +320,7 @@ void unwind_task_init(struct task_struct *task)
 
 	memset(info, 0, sizeof(*info));
 	init_task_work(&info->work, unwind_deferred_task_work);
+	task->unwind_mask = 0;
 }
 
 void unwind_task_free(struct task_struct *task)
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v6 4/5] unwind deferred: Use SRCU unwind_deferred_task_work()
  2025-05-01  1:32 [PATCH v6 0/5] perf: Deferred unwinding of user space stack traces for per CPU events Steven Rostedt
                   ` (2 preceding siblings ...)
  2025-05-01  1:32 ` [PATCH v6 3/5] unwind deferred: Use bitmask to determine which callbacks to call Steven Rostedt
@ 2025-05-01  1:32 ` Steven Rostedt
  2025-05-01  1:32 ` [PATCH v6 5/5] perf: Support deferred user callchains for per CPU events Steven Rostedt
  4 siblings, 0 replies; 10+ messages in thread
From: Steven Rostedt @ 2025-05-01  1:32 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Josh Poimboeuf, x86, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Indu Bhagat, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, Andrii Nakryiko, Jens Remus, Florian Weimer,
	Andy Lutomirski, Weinan Liu, Blake Jones, Beau Belgrave,
	Jose E. Marchesi, Alexander Aring

From: Steven Rostedt <rostedt@goodmis.org>

Instead of using the callback_mutex to protect the link list of callbacks
in unwind_deferred_task_work(), use SRCU instead. This gets called every
time a task exits that has to record a stack trace that was requested.
This can happen for many tasks on several CPUs at the same time. A mutex
is a bottleneck and can cause a bit of contention and slow down performance.

As the callbacks themselves are allowed to sleep, regular RCU can not be
used to protect the list. Instead use SRCU, as that still allows the
callbacks to sleep and the list can be read without needing to hold the
callback_mutex.

Link: https://lore.kernel.org/all/ca9bd83a-6c80-4ee0-a83c-224b9d60b755@efficios.com/

Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/unwind/deferred.c | 35 ++++++++++++++++++++++++++---------
 1 file changed, 26 insertions(+), 9 deletions(-)

diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index 716393dff810..5f98ac5e3a1b 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -23,10 +23,11 @@
  */
 static DEFINE_PER_CPU(u64, unwind_ctx_ctr);
 
-/* Guards adding to and reading the list of callbacks */
+/* Guards adding to or removing from the list of callbacks */
 static DEFINE_MUTEX(callback_mutex);
 static LIST_HEAD(callbacks);
 static unsigned long unwind_mask;
+DEFINE_STATIC_SRCU(unwind_srcu);
 
 /*
  * The context cookie is a unique identifier that is assigned to a user
@@ -137,6 +138,7 @@ static void unwind_deferred_task_work(struct callback_head *head)
 	struct unwind_work *work;
 	struct task_struct *task = current;
 	u64 cookie;
+	int idx;
 
 	if (WARN_ON_ONCE(!info->pending))
 		return;
@@ -155,14 +157,16 @@ static void unwind_deferred_task_work(struct callback_head *head)
 
 	cookie = get_cookie(info);
 
-	guard(mutex)(&callback_mutex);
-	list_for_each_entry(work, &callbacks, list) {
+	idx = srcu_read_lock(&unwind_srcu);
+	list_for_each_entry_srcu(work, &callbacks, list,
+				 srcu_read_lock_held(&unwind_srcu)) {
 		if (task->unwind_mask & (1UL << work->bit)) {
 			work->func(work, &trace, cookie);
 			clear_bit(work->bit, &current->unwind_mask);
 		}
 	}
-	barrier();
+	srcu_read_unlock(&unwind_srcu, idx);
+
 	/* If another task work is pending, reuse the cookie and stack trace */
 	if (!READ_ONCE(info->pending))
 		WRITE_ONCE(info->cookie, 0);
@@ -238,6 +242,7 @@ int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
 {
 	struct unwind_task_info *info = &current->unwind_info;
 	int pending;
+	int bit;
 	int ret;
 
 	*cookie = 0;
@@ -249,12 +254,17 @@ int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
 	if (in_nmi())
 		return unwind_deferred_request_nmi(work, cookie);
 
+	/* Do not allow cancelled works to request again */
+	bit = READ_ONCE(work->bit);
+	if (WARN_ON_ONCE(bit < 0))
+		return -EINVAL;
+
 	guard(irqsave)();
 
 	*cookie = get_cookie(info);
 
 	/* This is already queued */
-	if (current->unwind_mask & (1UL << work->bit))
+	if (current->unwind_mask & (1UL << bit))
 		return 1;
 
 	/* callback already pending? */
@@ -280,19 +290,26 @@ int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
 void unwind_deferred_cancel(struct unwind_work *work)
 {
 	struct task_struct *g, *t;
+	int bit;
 
 	if (!work)
 		return;
 
 	guard(mutex)(&callback_mutex);
-	list_del(&work->list);
+	list_del_rcu(&work->list);
+	bit = work->bit;
+
+	/* Do not allow any more requests and prevent callbacks */
+	work->bit = -1;
+
+	clear_bit(bit, &unwind_mask);
 
-	clear_bit(work->bit, &unwind_mask);
+	synchronize_srcu(&unwind_srcu);
 
 	guard(rcu)();
 	/* Clear this bit from all threads */
 	for_each_process_thread(g, t) {
-		clear_bit(work->bit, &t->unwind_mask);
+		clear_bit(bit, &t->unwind_mask);
 	}
 }
 
@@ -309,7 +326,7 @@ int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
 	work->bit = ffz(unwind_mask);
 	unwind_mask |= 1UL << work->bit;
 
-	list_add(&work->list, &callbacks);
+	list_add_rcu(&work->list, &callbacks);
 	work->func = func;
 	return 0;
 }
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v6 5/5] perf: Support deferred user callchains for per CPU events
  2025-05-01  1:32 [PATCH v6 0/5] perf: Deferred unwinding of user space stack traces for per CPU events Steven Rostedt
                   ` (3 preceding siblings ...)
  2025-05-01  1:32 ` [PATCH v6 4/5] unwind deferred: Use SRCU unwind_deferred_task_work() Steven Rostedt
@ 2025-05-01  1:32 ` Steven Rostedt
  2025-05-01 20:14   ` Namhyung Kim
  2025-05-01 20:20   ` Namhyung Kim
  4 siblings, 2 replies; 10+ messages in thread
From: Steven Rostedt @ 2025-05-01  1:32 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Josh Poimboeuf, x86, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Indu Bhagat, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, Andrii Nakryiko, Jens Remus, Florian Weimer,
	Andy Lutomirski, Weinan Liu, Blake Jones, Beau Belgrave,
	Jose E. Marchesi, Alexander Aring

From: Steven Rostedt <rostedt@goodmis.org>

The deferred unwinder works fine for task events (events that trace only a
specific task), as it can use a task_work from an interrupt or NMI and
when the task goes back to user space it will call the event's callback to
do the deferred unwinding.

But for per CPU events things are not so simple. When a per CPU event
wants a deferred unwinding to occur, it can not simply use a task_work as
there's a many to many relationship. If the task migrates and another task
is scheduled in where the per CPU event wants a deferred unwinding to
occur on that task as well, and the task that migrated to another CPU has
that CPU's event want to unwind it too, each CPU may need unwinding from
more than one task, and each task may has requests from many CPUs.

To solve this, when a per CPU event is created that has defer_callchain
attribute set, it will do a lookup from a global list
(unwind_deferred_list), for a perf_unwind_deferred descriptor that has the
id that matches the PID of the current task's group_leader.

If it is not found, then it will create one and add it to the global list.
This descriptor contains an array of all possible CPUs, where each element
is a perf_unwind_cpu descriptor.

The perf_unwind_cpu descriptor has a list of all the per CPU events that
is tracing the matching CPU that corresponds to its index in the array,
where the events belong to a task that has the same group_leader.
It also has a processing bit and rcuwait to handle removal.

For each occupied perf_unwind_cpu descriptor in the array, the
perf_deferred_unwind descriptor increments its nr_cpu_events. When a
perf_unwind_cpu descriptor is empty, the nr_cpu_events is decremented.
This is used to know when to free the perf_deferred_unwind descriptor, as
when it become empty, it is no longer referenced.

Finally, the perf_deferred_unwind descriptor has an id that holds the PID
of the group_leader for the tasks that the events were created by.

When a second (or more) per CPU event is created where the
perf_deferred_unwind descriptor is already created, it just adds itself to
the perf_unwind_cpu array of that descriptor. Updating the necessary
counter.

Each of these perf_deferred_unwind descriptors have a unwind_work that
registers with the deferred unwind infrastructure via
unwind_deferred_init(), where it also registers a callback to
perf_event_deferred_cpu().

Now when a per CPU event requests a deferred unwinding, it calls
unwind_deferred_request() with the associated perf_deferred_unwind
descriptor. It is expected that the program that uses this has events on
all CPUs, as the deferred trace may not be called on the CPU event that
requested it. That is, the task may migrate and its user stack trace will
be recorded on the CPU event of the CPU that it exits back to user space
on.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 include/linux/perf_event.h |   5 +
 kernel/events/core.c       | 226 +++++++++++++++++++++++++++++++++----
 2 files changed, 206 insertions(+), 25 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 10603a8344d3..c12b4894c4e1 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -683,6 +683,7 @@ struct swevent_hlist {
 struct bpf_prog;
 struct perf_cgroup;
 struct perf_buffer;
+struct perf_unwind_deferred;
 
 struct pmu_event_list {
 	raw_spinlock_t		lock;
@@ -835,6 +836,9 @@ struct perf_event {
 	struct callback_head		pending_unwind_work;
 	struct rcuwait			pending_unwind_wait;
 
+	struct perf_unwind_deferred	*unwind_deferred;
+	struct list_head		unwind_list;
+
 	atomic_t			event_limit;
 
 	/* address range filters */
@@ -875,6 +879,7 @@ struct perf_event {
 #ifdef CONFIG_SECURITY
 	void *security;
 #endif
+
 	struct list_head		sb_list;
 	struct list_head		pmu_list;
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index a5d9c6220589..f0c3b8878276 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5537,10 +5537,128 @@ static bool exclusive_event_installable(struct perf_event *event,
 	return true;
 }
 
+/* Holds a list of per CPU events that registered for deferred unwinding */
+struct perf_unwind_cpu {
+	struct list_head	list;
+	struct rcuwait		pending_unwind_wait;
+	int			processing;
+};
+
+struct perf_unwind_deferred {
+	struct list_head	list;
+	struct unwind_work	unwind_work;
+	struct perf_unwind_cpu	*cpu_events;
+	int			nr_cpu_events;
+	int			id;
+};
+
+static DEFINE_MUTEX(unwind_deferred_mutex);
+static LIST_HEAD(unwind_deferred_list);
+
+static void perf_event_deferred_cpu(struct unwind_work *work,
+				    struct unwind_stacktrace *trace, u64 cookie);
+
+static int perf_add_unwind_deferred(struct perf_event *event)
+{
+	struct perf_unwind_deferred *defer;
+	int id = current->group_leader->pid;
+	bool found = false;
+	int ret = 0;
+
+	if (event->cpu < 0)
+		return -EINVAL;
+
+	guard(mutex)(&unwind_deferred_mutex);
+
+	list_for_each_entry(defer, &unwind_deferred_list, list) {
+		if (defer->id == id) {
+			found = true;
+			break;
+		}
+	}
+
+	if (!found) {
+		defer = kzalloc(sizeof(*defer), GFP_KERNEL);
+		if (!defer)
+			return -ENOMEM;
+		list_add(&defer->list, &unwind_deferred_list);
+		defer->id = id;
+	}
+
+	if (!defer->nr_cpu_events) {
+		defer->cpu_events = kcalloc(num_possible_cpus(),
+					    sizeof(*defer->cpu_events),
+					    GFP_KERNEL);
+		if (!defer->cpu_events) {
+			ret = -ENOMEM;
+			goto free;
+		}
+		for (int cpu = 0; cpu < num_possible_cpus(); cpu++) {
+			rcuwait_init(&defer->cpu_events[cpu].pending_unwind_wait);
+			INIT_LIST_HEAD(&defer->cpu_events[cpu].list);
+		}
+
+		ret = unwind_deferred_init(&defer->unwind_work,
+					   perf_event_deferred_cpu);
+		if (ret)
+			goto free;
+	}
+
+	if (list_empty(&defer->cpu_events[event->cpu].list))
+		defer->nr_cpu_events++;
+	list_add_tail_rcu(&event->unwind_list, &defer->cpu_events[event->cpu].list);
+
+	event->unwind_deferred = defer;
+	return 0;
+free:
+	if (found)
+		return ret;
+
+	list_del(&defer->list);
+	kfree(defer->cpu_events);
+	kfree(defer);
+	return ret;
+}
+
+static void perf_remove_unwind_deferred(struct perf_event *event)
+{
+	struct perf_unwind_deferred *defer = event->unwind_deferred;
+	struct perf_unwind_cpu *cpu_unwind;
+
+	if (!defer)
+		return;
+
+	guard(mutex)(&unwind_deferred_mutex);
+	list_del_rcu(&event->unwind_list);
+
+	cpu_unwind = &defer->cpu_events[event->cpu];
+
+	if (list_empty(&cpu_unwind->list)) {
+		defer->nr_cpu_events--;
+		if (!defer->nr_cpu_events)
+			unwind_deferred_cancel(&defer->unwind_work);
+	}
+	/* Make sure perf_event_deferred_cpu() is done with this event */
+	rcuwait_wait_event(&cpu_unwind->pending_unwind_wait,
+				   !cpu_unwind->processing, TASK_UNINTERRUPTIBLE);
+
+	event->unwind_deferred = NULL;
+
+	/* Is this still being used by other per CPU events? */
+	if (defer->nr_cpu_events)
+		return;
+
+	list_del(&defer->list);
+	kfree(defer->cpu_events);
+	kfree(defer);
+}
+
 static void perf_pending_unwind_sync(struct perf_event *event)
 {
 	might_sleep();
 
+	perf_remove_unwind_deferred(event);
+
 	if (!event->pending_unwind_callback)
 		return;
 
@@ -5568,33 +5686,19 @@ struct perf_callchain_deferred_event {
 	u64				ips[];
 };
 
-static void perf_event_callchain_deferred(struct callback_head *work)
+static void perf_event_callchain_deferred(struct perf_event *event,
+					  struct unwind_stacktrace *trace)
 {
-	struct perf_event *event = container_of(work, struct perf_event, pending_unwind_work);
 	struct perf_callchain_deferred_event deferred_event;
 	u64 callchain_context = PERF_CONTEXT_USER;
-	struct unwind_stacktrace trace;
 	struct perf_output_handle handle;
 	struct perf_sample_data data;
 	u64 nr;
 
-	if (!event->pending_unwind_callback)
-		return;
-
-	if (unwind_deferred_trace(&trace) < 0)
-		goto out;
-
-	/*
-	 * All accesses to the event must belong to the same implicit RCU
-	 * read-side critical section as the ->pending_unwind_callback reset.
-	 * See comment in perf_pending_unwind_sync().
-	 */
-	guard(rcu)();
-
 	if (current->flags & PF_KTHREAD)
-		goto out;
+		return;
 
-	nr = trace.nr + 1 ; /* '+1' == callchain_context */
+	nr = trace->nr + 1 ; /* '+1' == callchain_context */
 
 	deferred_event.header.type = PERF_RECORD_CALLCHAIN_DEFERRED;
 	deferred_event.header.misc = PERF_RECORD_MISC_USER;
@@ -5605,21 +5709,74 @@ static void perf_event_callchain_deferred(struct callback_head *work)
 	perf_event_header__init_id(&deferred_event.header, &data, event);
 
 	if (perf_output_begin(&handle, &data, event, deferred_event.header.size))
-		goto out;
+		return;
 
 	perf_output_put(&handle, deferred_event);
 	perf_output_put(&handle, callchain_context);
-	perf_output_copy(&handle, trace.entries, trace.nr * sizeof(u64));
+	perf_output_copy(&handle, trace->entries, trace->nr * sizeof(u64));
 	perf_event__output_id_sample(event, &handle, &data);
 
 	perf_output_end(&handle);
+}
+
+/* Deferred unwinding callback for task specific events */
+static void perf_event_deferred_task(struct callback_head *work)
+{
+	struct perf_event *event = container_of(work, struct perf_event, pending_unwind_work);
+	struct unwind_stacktrace trace;
+
+	if (!event->pending_unwind_callback)
+		return;
+
+	if (unwind_deferred_trace(&trace) >= 0) {
+
+		/*
+		 * All accesses to the event must belong to the same implicit RCU
+		 * read-side critical section as the ->pending_unwind_callback reset.
+		 * See comment in perf_pending_unwind_sync().
+		 */
+		guard(rcu)();
+		perf_event_callchain_deferred(event, &trace);
+	}
 
-out:
 	event->pending_unwind_callback = 0;
 	local_dec(&event->ctx->nr_no_switch_fast);
 	rcuwait_wake_up(&event->pending_unwind_wait);
 }
 
+/* Deferred unwinding callback for per CPU events */
+static void perf_event_deferred_cpu(struct unwind_work *work,
+				    struct unwind_stacktrace *trace, u64 cookie)
+{
+	struct perf_unwind_deferred *defer =
+		container_of(work, struct perf_unwind_deferred, unwind_work);
+	struct perf_unwind_cpu *cpu_unwind;
+	struct perf_event *event;
+	int cpu;
+
+	guard(rcu)();
+	guard(preempt)();
+
+	cpu = smp_processor_id();
+	cpu_unwind = &defer->cpu_events[cpu];
+
+	WRITE_ONCE(cpu_unwind->processing, 1);
+	/*
+	 * Make sure the above is seen for the rcuwait in
+	 * perf_remove_unwind_deferred() before iterating the loop.
+	 */
+	smp_mb();
+
+	list_for_each_entry_rcu(event, &cpu_unwind->list, unwind_list) {
+		perf_event_callchain_deferred(event, trace);
+		/* Only the first CPU event gets the trace */
+		break;
+	}
+
+	WRITE_ONCE(cpu_unwind->processing, 0);
+	rcuwait_wake_up(&cpu_unwind->pending_unwind_wait);
+}
+
 static void perf_free_addr_filters(struct perf_event *event);
 
 /* vs perf_event_alloc() error */
@@ -8198,6 +8355,15 @@ static int deferred_request_nmi(struct perf_event *event)
 	return 0;
 }
 
+static int deferred_unwind_request(struct perf_unwind_deferred *defer)
+{
+	u64 cookie;
+	int ret;
+
+	ret = unwind_deferred_request(&defer->unwind_work, &cookie);
+	return ret < 0 ? ret : 0;
+}
+
 /*
  * Returns:
 *     > 0 : if already queued.
@@ -8210,11 +8376,14 @@ static int deferred_request(struct perf_event *event)
 	int pending;
 	int ret;
 
-	/* Only defer for task events */
-	if (!event->ctx->task)
+	if ((current->flags & PF_KTHREAD) || !user_mode(task_pt_regs(current)))
 		return -EINVAL;
 
-	if ((current->flags & PF_KTHREAD) || !user_mode(task_pt_regs(current)))
+	if (event->unwind_deferred)
+		return deferred_unwind_request(event->unwind_deferred);
+
+	/* Per CPU events should have had unwind_deferred set! */
+	if (WARN_ON_ONCE(!event->ctx->task))
 		return -EINVAL;
 
 	if (in_nmi())
@@ -13100,13 +13269,20 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 		}
 	}
 
+	/* Setup unwind deferring for per CPU events */
+	if (event->attr.defer_callchain && !task) {
+		err = perf_add_unwind_deferred(event);
+		if (err)
+			return ERR_PTR(err);
+	}
+
 	err = security_perf_event_alloc(event);
 	if (err)
 		return ERR_PTR(err);
 
 	if (event->attr.defer_callchain)
 		init_task_work(&event->pending_unwind_work,
-			       perf_event_callchain_deferred);
+			       perf_event_deferred_task);
 
 	/* symmetric to unaccount_event() in _free_event() */
 	account_event(event);
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v6 5/5] perf: Support deferred user callchains for per CPU events
  2025-05-01  1:32 ` [PATCH v6 5/5] perf: Support deferred user callchains for per CPU events Steven Rostedt
@ 2025-05-01 20:14   ` Namhyung Kim
  2025-05-01 20:57     ` Steven Rostedt
  2025-05-01 20:20   ` Namhyung Kim
  1 sibling, 1 reply; 10+ messages in thread
From: Namhyung Kim @ 2025-05-01 20:14 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
	Mathieu Desnoyers, Andrew Morton, Josh Poimboeuf, x86,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Indu Bhagat, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, Andrii Nakryiko, Jens Remus,
	Florian Weimer, Andy Lutomirski, Weinan Liu, Blake Jones,
	Beau Belgrave, Jose E. Marchesi, Alexander Aring

Hi Steve,

On Wed, Apr 30, 2025 at 09:32:07PM -0400, Steven Rostedt wrote:
> From: Steven Rostedt <rostedt@goodmis.org>
> 
> The deferred unwinder works fine for task events (events that trace only a
> specific task), as it can use a task_work from an interrupt or NMI and
> when the task goes back to user space it will call the event's callback to
> do the deferred unwinding.
> 
> But for per CPU events things are not so simple. When a per CPU event
> wants a deferred unwinding to occur, it can not simply use a task_work as
> there's a many to many relationship. If the task migrates and another task
> is scheduled in where the per CPU event wants a deferred unwinding to
> occur on that task as well, and the task that migrated to another CPU has
> that CPU's event want to unwind it too, each CPU may need unwinding from
> more than one task, and each task may has requests from many CPUs.
> 
> To solve this, when a per CPU event is created that has defer_callchain
> attribute set, it will do a lookup from a global list
> (unwind_deferred_list), for a perf_unwind_deferred descriptor that has the
> id that matches the PID of the current task's group_leader.

Nice, it'd work well with the perf tools at least.

> 
> If it is not found, then it will create one and add it to the global list.
> This descriptor contains an array of all possible CPUs, where each element
> is a perf_unwind_cpu descriptor.
> 
> The perf_unwind_cpu descriptor has a list of all the per CPU events that
> is tracing the matching CPU that corresponds to its index in the array,
> where the events belong to a task that has the same group_leader.
> It also has a processing bit and rcuwait to handle removal.
> 
> For each occupied perf_unwind_cpu descriptor in the array, the
> perf_deferred_unwind descriptor increments its nr_cpu_events. When a
> perf_unwind_cpu descriptor is empty, the nr_cpu_events is decremented.
> This is used to know when to free the perf_deferred_unwind descriptor, as
> when it become empty, it is no longer referenced.
> 
> Finally, the perf_deferred_unwind descriptor has an id that holds the PID
> of the group_leader for the tasks that the events were created by.
> 
> When a second (or more) per CPU event is created where the
> perf_deferred_unwind descriptor is already created, it just adds itself to
> the perf_unwind_cpu array of that descriptor. Updating the necessary
> counter.
> 
> Each of these perf_deferred_unwind descriptors have a unwind_work that
> registers with the deferred unwind infrastructure via
> unwind_deferred_init(), where it also registers a callback to
> perf_event_deferred_cpu().
> 
> Now when a per CPU event requests a deferred unwinding, it calls
> unwind_deferred_request() with the associated perf_deferred_unwind
> descriptor. It is expected that the program that uses this has events on
> all CPUs, as the deferred trace may not be called on the CPU event that
> requested it. That is, the task may migrate and its user stack trace will
> be recorded on the CPU event of the CPU that it exits back to user space
> on.
> 
> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> ---
>  include/linux/perf_event.h |   5 +
>  kernel/events/core.c       | 226 +++++++++++++++++++++++++++++++++----
>  2 files changed, 206 insertions(+), 25 deletions(-)
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 10603a8344d3..c12b4894c4e1 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -683,6 +683,7 @@ struct swevent_hlist {
>  struct bpf_prog;
>  struct perf_cgroup;
>  struct perf_buffer;
> +struct perf_unwind_deferred;
>  
>  struct pmu_event_list {
>  	raw_spinlock_t		lock;
> @@ -835,6 +836,9 @@ struct perf_event {
>  	struct callback_head		pending_unwind_work;
>  	struct rcuwait			pending_unwind_wait;
>  
> +	struct perf_unwind_deferred	*unwind_deferred;
> +	struct list_head		unwind_list;
> +
>  	atomic_t			event_limit;
>  
>  	/* address range filters */
> @@ -875,6 +879,7 @@ struct perf_event {
>  #ifdef CONFIG_SECURITY
>  	void *security;
>  #endif
> +
>  	struct list_head		sb_list;
>  	struct list_head		pmu_list;
>  
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index a5d9c6220589..f0c3b8878276 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -5537,10 +5537,128 @@ static bool exclusive_event_installable(struct perf_event *event,
>  	return true;
>  }
>  
> +/* Holds a list of per CPU events that registered for deferred unwinding */
> +struct perf_unwind_cpu {
> +	struct list_head	list;
> +	struct rcuwait		pending_unwind_wait;
> +	int			processing;
> +};
> +
> +struct perf_unwind_deferred {
> +	struct list_head	list;
> +	struct unwind_work	unwind_work;
> +	struct perf_unwind_cpu	*cpu_events;
> +	int			nr_cpu_events;
> +	int			id;
> +};
> +
> +static DEFINE_MUTEX(unwind_deferred_mutex);
> +static LIST_HEAD(unwind_deferred_list);
> +
> +static void perf_event_deferred_cpu(struct unwind_work *work,
> +				    struct unwind_stacktrace *trace, u64 cookie);
> +
> +static int perf_add_unwind_deferred(struct perf_event *event)
> +{
> +	struct perf_unwind_deferred *defer;
> +	int id = current->group_leader->pid;
> +	bool found = false;
> +	int ret = 0;
> +
> +	if (event->cpu < 0)
> +		return -EINVAL;
> +
> +	guard(mutex)(&unwind_deferred_mutex);
> +
> +	list_for_each_entry(defer, &unwind_deferred_list, list) {
> +		if (defer->id == id) {
> +			found = true;
> +			break;
> +		}
> +	}
> +
> +	if (!found) {
> +		defer = kzalloc(sizeof(*defer), GFP_KERNEL);
> +		if (!defer)
> +			return -ENOMEM;
> +		list_add(&defer->list, &unwind_deferred_list);
> +		defer->id = id;
> +	}
> +
> +	if (!defer->nr_cpu_events) {
> +		defer->cpu_events = kcalloc(num_possible_cpus(),
> +					    sizeof(*defer->cpu_events),
> +					    GFP_KERNEL);
> +		if (!defer->cpu_events) {
> +			ret = -ENOMEM;
> +			goto free;
> +		}
> +		for (int cpu = 0; cpu < num_possible_cpus(); cpu++) {
> +			rcuwait_init(&defer->cpu_events[cpu].pending_unwind_wait);
> +			INIT_LIST_HEAD(&defer->cpu_events[cpu].list);
> +		}
> +
> +		ret = unwind_deferred_init(&defer->unwind_work,
> +					   perf_event_deferred_cpu);
> +		if (ret)
> +			goto free;
> +	}
> +
> +	if (list_empty(&defer->cpu_events[event->cpu].list))
> +		defer->nr_cpu_events++;
> +	list_add_tail_rcu(&event->unwind_list, &defer->cpu_events[event->cpu].list);
> +
> +	event->unwind_deferred = defer;
> +	return 0;
> +free:
> +	if (found)
> +		return ret;
> +
> +	list_del(&defer->list);
> +	kfree(defer->cpu_events);
> +	kfree(defer);
> +	return ret;
> +}
> +
> +static void perf_remove_unwind_deferred(struct perf_event *event)
> +{
> +	struct perf_unwind_deferred *defer = event->unwind_deferred;
> +	struct perf_unwind_cpu *cpu_unwind;
> +
> +	if (!defer)
> +		return;
> +
> +	guard(mutex)(&unwind_deferred_mutex);
> +	list_del_rcu(&event->unwind_list);
> +
> +	cpu_unwind = &defer->cpu_events[event->cpu];
> +
> +	if (list_empty(&cpu_unwind->list)) {
> +		defer->nr_cpu_events--;
> +		if (!defer->nr_cpu_events)
> +			unwind_deferred_cancel(&defer->unwind_work);
> +	}
> +	/* Make sure perf_event_deferred_cpu() is done with this event */
> +	rcuwait_wait_event(&cpu_unwind->pending_unwind_wait,
> +				   !cpu_unwind->processing, TASK_UNINTERRUPTIBLE);
> +
> +	event->unwind_deferred = NULL;
> +
> +	/* Is this still being used by other per CPU events? */
> +	if (defer->nr_cpu_events)
> +		return;
> +
> +	list_del(&defer->list);
> +	kfree(defer->cpu_events);
> +	kfree(defer);
> +}
> +
>  static void perf_pending_unwind_sync(struct perf_event *event)
>  {
>  	might_sleep();
>  
> +	perf_remove_unwind_deferred(event);
> +
>  	if (!event->pending_unwind_callback)
>  		return;
>  
> @@ -5568,33 +5686,19 @@ struct perf_callchain_deferred_event {
>  	u64				ips[];
>  };
>  
> -static void perf_event_callchain_deferred(struct callback_head *work)
> +static void perf_event_callchain_deferred(struct perf_event *event,
> +					  struct unwind_stacktrace *trace)
>  {
> -	struct perf_event *event = container_of(work, struct perf_event, pending_unwind_work);
>  	struct perf_callchain_deferred_event deferred_event;
>  	u64 callchain_context = PERF_CONTEXT_USER;
> -	struct unwind_stacktrace trace;
>  	struct perf_output_handle handle;
>  	struct perf_sample_data data;
>  	u64 nr;
>  
> -	if (!event->pending_unwind_callback)
> -		return;
> -
> -	if (unwind_deferred_trace(&trace) < 0)
> -		goto out;
> -
> -	/*
> -	 * All accesses to the event must belong to the same implicit RCU
> -	 * read-side critical section as the ->pending_unwind_callback reset.
> -	 * See comment in perf_pending_unwind_sync().
> -	 */
> -	guard(rcu)();
> -
>  	if (current->flags & PF_KTHREAD)
> -		goto out;
> +		return;
>  
> -	nr = trace.nr + 1 ; /* '+1' == callchain_context */
> +	nr = trace->nr + 1 ; /* '+1' == callchain_context */
>  
>  	deferred_event.header.type = PERF_RECORD_CALLCHAIN_DEFERRED;
>  	deferred_event.header.misc = PERF_RECORD_MISC_USER;
> @@ -5605,21 +5709,74 @@ static void perf_event_callchain_deferred(struct callback_head *work)
>  	perf_event_header__init_id(&deferred_event.header, &data, event);
>  
>  	if (perf_output_begin(&handle, &data, event, deferred_event.header.size))
> -		goto out;
> +		return;
>  
>  	perf_output_put(&handle, deferred_event);
>  	perf_output_put(&handle, callchain_context);
> -	perf_output_copy(&handle, trace.entries, trace.nr * sizeof(u64));
> +	perf_output_copy(&handle, trace->entries, trace->nr * sizeof(u64));
>  	perf_event__output_id_sample(event, &handle, &data);
>  
>  	perf_output_end(&handle);
> +}
> +
> +/* Deferred unwinding callback for task specific events */
> +static void perf_event_deferred_task(struct callback_head *work)
> +{
> +	struct perf_event *event = container_of(work, struct perf_event, pending_unwind_work);
> +	struct unwind_stacktrace trace;
> +
> +	if (!event->pending_unwind_callback)
> +		return;
> +
> +	if (unwind_deferred_trace(&trace) >= 0) {
> +
> +		/*
> +		 * All accesses to the event must belong to the same implicit RCU
> +		 * read-side critical section as the ->pending_unwind_callback reset.
> +		 * See comment in perf_pending_unwind_sync().
> +		 */
> +		guard(rcu)();
> +		perf_event_callchain_deferred(event, &trace);
> +	}
>  
> -out:
>  	event->pending_unwind_callback = 0;
>  	local_dec(&event->ctx->nr_no_switch_fast);
>  	rcuwait_wake_up(&event->pending_unwind_wait);
>  }
>  
> +/* Deferred unwinding callback for per CPU events */
> +static void perf_event_deferred_cpu(struct unwind_work *work,
> +				    struct unwind_stacktrace *trace, u64 cookie)
> +{
> +	struct perf_unwind_deferred *defer =
> +		container_of(work, struct perf_unwind_deferred, unwind_work);
> +	struct perf_unwind_cpu *cpu_unwind;
> +	struct perf_event *event;
> +	int cpu;
> +
> +	guard(rcu)();
> +	guard(preempt)();
> +
> +	cpu = smp_processor_id();
> +	cpu_unwind = &defer->cpu_events[cpu];
> +
> +	WRITE_ONCE(cpu_unwind->processing, 1);
> +	/*
> +	 * Make sure the above is seen for the rcuwait in
> +	 * perf_remove_unwind_deferred() before iterating the loop.
> +	 */
> +	smp_mb();
> +
> +	list_for_each_entry_rcu(event, &cpu_unwind->list, unwind_list) {
> +		perf_event_callchain_deferred(event, trace);
> +		/* Only the first CPU event gets the trace */
> +		break;

I guess this is to emit a callchain record when more than one events
requested the deferred callchains for the same task like:

  $ perf record -a -e cycles,instructions

right?


> +	}
> +
> +	WRITE_ONCE(cpu_unwind->processing, 0);
> +	rcuwait_wake_up(&cpu_unwind->pending_unwind_wait);
> +}
> +
>  static void perf_free_addr_filters(struct perf_event *event);
>  
>  /* vs perf_event_alloc() error */
> @@ -8198,6 +8355,15 @@ static int deferred_request_nmi(struct perf_event *event)
>  	return 0;
>  }
>  
> +static int deferred_unwind_request(struct perf_unwind_deferred *defer)
> +{
> +	u64 cookie;
> +	int ret;
> +
> +	ret = unwind_deferred_request(&defer->unwind_work, &cookie);
> +	return ret < 0 ? ret : 0;
> +}
> +
>  /*
>   * Returns:
>  *     > 0 : if already queued.
> @@ -8210,11 +8376,14 @@ static int deferred_request(struct perf_event *event)
>  	int pending;
>  	int ret;
>  
> -	/* Only defer for task events */
> -	if (!event->ctx->task)
> +	if ((current->flags & PF_KTHREAD) || !user_mode(task_pt_regs(current)))
>  		return -EINVAL;
>  
> -	if ((current->flags & PF_KTHREAD) || !user_mode(task_pt_regs(current)))
> +	if (event->unwind_deferred)
> +		return deferred_unwind_request(event->unwind_deferred);
> +
> +	/* Per CPU events should have had unwind_deferred set! */
> +	if (WARN_ON_ONCE(!event->ctx->task))
>  		return -EINVAL;
>  
>  	if (in_nmi())
> @@ -13100,13 +13269,20 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
>  		}
>  	}
>  
> +	/* Setup unwind deferring for per CPU events */
> +	if (event->attr.defer_callchain && !task) {

As I said it should handle per-task and per-CPU events.  How about this?

	if (event->attr.defer_callchain) {
		if (event->cpu >= 0) {
			err = perf_add_unwind_deferred(event);
			if (err)
				return ERR_PTR(err);
		} else {
			init_task_work(&event->pending_unwind_work,
					perf_event_callchain_deferred,
					perf_event_deferred_task);
		}
	}

> +		err = perf_add_unwind_deferred(event);
> +		if (err)
> +			return ERR_PTR(err);
> +	}
> +
>  	err = security_perf_event_alloc(event);
>  	if (err)
>  		return ERR_PTR(err);
>  
>  	if (event->attr.defer_callchain)
>  		init_task_work(&event->pending_unwind_work,
> -			       perf_event_callchain_deferred);
> +			       perf_event_deferred_task);

And you can remove here.

Thanks,
Namhyung

>  
>  	/* symmetric to unaccount_event() in _free_event() */
>  	account_event(event);
> -- 
> 2.47.2
> 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v6 5/5] perf: Support deferred user callchains for per CPU events
  2025-05-01  1:32 ` [PATCH v6 5/5] perf: Support deferred user callchains for per CPU events Steven Rostedt
  2025-05-01 20:14   ` Namhyung Kim
@ 2025-05-01 20:20   ` Namhyung Kim
  1 sibling, 0 replies; 10+ messages in thread
From: Namhyung Kim @ 2025-05-01 20:20 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
	Mathieu Desnoyers, Andrew Morton, Josh Poimboeuf, x86,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Indu Bhagat, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, Andrii Nakryiko, Jens Remus,
	Florian Weimer, Andy Lutomirski, Weinan Liu, Blake Jones,
	Beau Belgrave, Jose E. Marchesi, Alexander Aring

On Wed, Apr 30, 2025 at 09:32:07PM -0400, Steven Rostedt wrote:
> @@ -13100,13 +13269,20 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
>  		}
>  	}
>  
> +	/* Setup unwind deferring for per CPU events */
> +	if (event->attr.defer_callchain && !task) {
> +		err = perf_add_unwind_deferred(event);
> +		if (err)
> +			return ERR_PTR(err);
> +	}
> +
>  	err = security_perf_event_alloc(event);
>  	if (err)
>  		return ERR_PTR(err);

It seems perf_remove_unwind_deferred() is not called from
__free_event() - with double underscores.

Thanks,
Namhyung


>  
>  	if (event->attr.defer_callchain)
>  		init_task_work(&event->pending_unwind_work,
> -			       perf_event_callchain_deferred);
> +			       perf_event_deferred_task);
>  
>  	/* symmetric to unaccount_event() in _free_event() */
>  	account_event(event);
> -- 
> 2.47.2
> 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v6 5/5] perf: Support deferred user callchains for per CPU events
  2025-05-01 20:14   ` Namhyung Kim
@ 2025-05-01 20:57     ` Steven Rostedt
  2025-05-02 20:16       ` Namhyung Kim
  0 siblings, 1 reply; 10+ messages in thread
From: Steven Rostedt @ 2025-05-01 20:57 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
	Mathieu Desnoyers, Andrew Morton, Josh Poimboeuf, x86,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Indu Bhagat, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, Andrii Nakryiko, Jens Remus,
	Florian Weimer, Andy Lutomirski, Weinan Liu, Blake Jones,
	Beau Belgrave, Jose E. Marchesi, Alexander Aring

On Thu, 1 May 2025 13:14:11 -0700
Namhyung Kim <namhyung@kernel.org> wrote:

> Hi Steve,
> 
> On Wed, Apr 30, 2025 at 09:32:07PM -0400, Steven Rostedt wrote:

> > To solve this, when a per CPU event is created that has defer_callchain
> > attribute set, it will do a lookup from a global list
> > (unwind_deferred_list), for a perf_unwind_deferred descriptor that has the
> > id that matches the PID of the current task's group_leader.  
> 
> Nice, it'd work well with the perf tools at least.

Cool!



> > +static void perf_event_deferred_cpu(struct unwind_work *work,
> > +				    struct unwind_stacktrace *trace, u64 cookie)
> > +{
> > +	struct perf_unwind_deferred *defer =
> > +		container_of(work, struct perf_unwind_deferred, unwind_work);
> > +	struct perf_unwind_cpu *cpu_unwind;
> > +	struct perf_event *event;
> > +	int cpu;
> > +
> > +	guard(rcu)();
> > +	guard(preempt)();
> > +
> > +	cpu = smp_processor_id();
> > +	cpu_unwind = &defer->cpu_events[cpu];
> > +
> > +	WRITE_ONCE(cpu_unwind->processing, 1);
> > +	/*
> > +	 * Make sure the above is seen for the rcuwait in
> > +	 * perf_remove_unwind_deferred() before iterating the loop.
> > +	 */
> > +	smp_mb();
> > +
> > +	list_for_each_entry_rcu(event, &cpu_unwind->list, unwind_list) {
> > +		perf_event_callchain_deferred(event, trace);
> > +		/* Only the first CPU event gets the trace */
> > +		break;  
> 
> I guess this is to emit a callchain record when more than one events
> requested the deferred callchains for the same task like:
> 
>   $ perf record -a -e cycles,instructions
> 
> right?

Yeah. If perf assigns more than one per CPU event, we only need one of
those events to record the deferred trace, not both of them.

But I keep a link list so that if the program closes the first one and
keeps the second active, this will still work, as the first one would be
removed from the list, and the second one would pick up the tracing after
that.

> 
> 
> > +	}
> > +
> > +	WRITE_ONCE(cpu_unwind->processing, 0);
> > +	rcuwait_wake_up(&cpu_unwind->pending_unwind_wait);
> > +}
> > +
> >  static void perf_free_addr_filters(struct perf_event *event);
> >  
> >  /* vs perf_event_alloc() error */
> > @@ -8198,6 +8355,15 @@ static int deferred_request_nmi(struct perf_event *event)
> >  	return 0;
> >  }
> >  
> > +static int deferred_unwind_request(struct perf_unwind_deferred *defer)
> > +{
> > +	u64 cookie;
> > +	int ret;
> > +
> > +	ret = unwind_deferred_request(&defer->unwind_work, &cookie);
> > +	return ret < 0 ? ret : 0;
> > +}
> > +
> >  /*
> >   * Returns:
> >  *     > 0 : if already queued.
> > @@ -8210,11 +8376,14 @@ static int deferred_request(struct perf_event *event)
> >  	int pending;
> >  	int ret;
> >  
> > -	/* Only defer for task events */
> > -	if (!event->ctx->task)
> > +	if ((current->flags & PF_KTHREAD) || !user_mode(task_pt_regs(current)))
> >  		return -EINVAL;
> >  
> > -	if ((current->flags & PF_KTHREAD) || !user_mode(task_pt_regs(current)))
> > +	if (event->unwind_deferred)
> > +		return deferred_unwind_request(event->unwind_deferred);
> > +
> > +	/* Per CPU events should have had unwind_deferred set! */
> > +	if (WARN_ON_ONCE(!event->ctx->task))
> >  		return -EINVAL;
> >  
> >  	if (in_nmi())
> > @@ -13100,13 +13269,20 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
> >  		}
> >  	}
> >  
> > +	/* Setup unwind deferring for per CPU events */
> > +	if (event->attr.defer_callchain && !task) {  
> 
> As I said it should handle per-task and per-CPU events.  How about this?

Hmm, I just added some printk()s in this code, and it seems that perf
record always did per CPU.

But if an event is per CPU and per task, will it still only trace that
task? It will never trace another task right?

Because the way this is currently implemented is that the event that
requested the callback is the one that records it, even if it runs on
another CPU:

In defer_request_nmi():

	struct callback_head *work = &event->pending_unwind_work;
	int ret;

	if (event->pending_unwind_callback)
		return 1;

	ret = task_work_add(current, work, TWA_NMI_CURRENT);
	if (ret)
		return ret;

	event->pending_unwind_callback = 1;

The task_work_add() adds the work from the event's pending_unwind_work.

Now the callback will be:

static void perf_event_deferred_task(struct callback_head *work)
{
	struct perf_event *event = container_of(work, struct perf_event, pending_unwind_work);

// the above is the event that requested this. This may run on another CPU.

	struct unwind_stacktrace trace;

	if (!event->pending_unwind_callback)
		return;

	if (unwind_deferred_trace(&trace) >= 0) {

		/*
		 * All accesses to the event must belong to the same implicit RCU
		 * read-side critical section as the ->pending_unwind_callback reset.
		 * See comment in perf_pending_unwind_sync().
		 */
		guard(rcu)();
		perf_event_callchain_deferred(event, &trace);

// The above records the stack trace to that event.
// Again, this may happen on another CPU.

	}

	event->pending_unwind_callback = 0;
	local_dec(&event->ctx->nr_no_switch_fast);
	rcuwait_wake_up(&event->pending_unwind_wait);
}

Is the recording to an event from one CPU to another CPU an issue, if that
event also is only tracing a task?

> 
> 	if (event->attr.defer_callchain) {
> 		if (event->cpu >= 0) {
> 			err = perf_add_unwind_deferred(event);
> 			if (err)
> 				return ERR_PTR(err);
> 		} else {
> 			init_task_work(&event->pending_unwind_work,
> 					perf_event_callchain_deferred,
> 					perf_event_deferred_task);
> 		}
> 	}
> 
> > +		err = perf_add_unwind_deferred(event);
> > +		if (err)
> > +			return ERR_PTR(err);
> > +	}
> > +
> >  	err = security_perf_event_alloc(event);
> >  	if (err)
> >  		return ERR_PTR(err);
> >  
> >  	if (event->attr.defer_callchain)
> >  		init_task_work(&event->pending_unwind_work,
> > -			       perf_event_callchain_deferred);
> > +			       perf_event_deferred_task);  
> 
> And you can remove here.

There's nothing wrong with always initializing it. It will just never be
called.

What situation do we have where cpu is negative? What's the perf command?
Is there one?

> 
> Thanks,
> Namhyung
> 

Thanks for the review.

-- Steve

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v6 5/5] perf: Support deferred user callchains for per CPU events
  2025-05-01 20:57     ` Steven Rostedt
@ 2025-05-02 20:16       ` Namhyung Kim
  0 siblings, 0 replies; 10+ messages in thread
From: Namhyung Kim @ 2025-05-02 20:16 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
	Mathieu Desnoyers, Andrew Morton, Josh Poimboeuf, x86,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Indu Bhagat, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains,
	Jordan Rome, Sam James, Andrii Nakryiko, Jens Remus,
	Florian Weimer, Andy Lutomirski, Weinan Liu, Blake Jones,
	Beau Belgrave, Jose E. Marchesi, Alexander Aring

On Thu, May 01, 2025 at 04:57:30PM -0400, Steven Rostedt wrote:
> On Thu, 1 May 2025 13:14:11 -0700
> Namhyung Kim <namhyung@kernel.org> wrote:
> 
> > Hi Steve,
> > 
> > On Wed, Apr 30, 2025 at 09:32:07PM -0400, Steven Rostedt wrote:
> 
> > > To solve this, when a per CPU event is created that has defer_callchain
> > > attribute set, it will do a lookup from a global list
> > > (unwind_deferred_list), for a perf_unwind_deferred descriptor that has the
> > > id that matches the PID of the current task's group_leader.  
> > 
> > Nice, it'd work well with the perf tools at least.
> 
> Cool!
> 
> 
> 
> > > +static void perf_event_deferred_cpu(struct unwind_work *work,
> > > +				    struct unwind_stacktrace *trace, u64 cookie)
> > > +{
> > > +	struct perf_unwind_deferred *defer =
> > > +		container_of(work, struct perf_unwind_deferred, unwind_work);
> > > +	struct perf_unwind_cpu *cpu_unwind;
> > > +	struct perf_event *event;
> > > +	int cpu;
> > > +
> > > +	guard(rcu)();
> > > +	guard(preempt)();
> > > +
> > > +	cpu = smp_processor_id();
> > > +	cpu_unwind = &defer->cpu_events[cpu];
> > > +
> > > +	WRITE_ONCE(cpu_unwind->processing, 1);
> > > +	/*
> > > +	 * Make sure the above is seen for the rcuwait in
> > > +	 * perf_remove_unwind_deferred() before iterating the loop.
> > > +	 */
> > > +	smp_mb();
> > > +
> > > +	list_for_each_entry_rcu(event, &cpu_unwind->list, unwind_list) {
> > > +		perf_event_callchain_deferred(event, trace);
> > > +		/* Only the first CPU event gets the trace */
> > > +		break;  
> > 
> > I guess this is to emit a callchain record when more than one events
> > requested the deferred callchains for the same task like:
> > 
> >   $ perf record -a -e cycles,instructions
> > 
> > right?
> 
> Yeah. If perf assigns more than one per CPU event, we only need one of
> those events to record the deferred trace, not both of them.
> 
> But I keep a link list so that if the program closes the first one and
> keeps the second active, this will still work, as the first one would be
> removed from the list, and the second one would pick up the tracing after
> that.

Makes sense.

> 
> > 
> > 
> > > +	}
> > > +
> > > +	WRITE_ONCE(cpu_unwind->processing, 0);
> > > +	rcuwait_wake_up(&cpu_unwind->pending_unwind_wait);
> > > +}
> > > +
> > >  static void perf_free_addr_filters(struct perf_event *event);
> > >  
> > >  /* vs perf_event_alloc() error */
> > > @@ -8198,6 +8355,15 @@ static int deferred_request_nmi(struct perf_event *event)
> > >  	return 0;
> > >  }
> > >  
> > > +static int deferred_unwind_request(struct perf_unwind_deferred *defer)
> > > +{
> > > +	u64 cookie;
> > > +	int ret;
> > > +
> > > +	ret = unwind_deferred_request(&defer->unwind_work, &cookie);
> > > +	return ret < 0 ? ret : 0;
> > > +}
> > > +
> > >  /*
> > >   * Returns:
> > >  *     > 0 : if already queued.
> > > @@ -8210,11 +8376,14 @@ static int deferred_request(struct perf_event *event)
> > >  	int pending;
> > >  	int ret;
> > >  
> > > -	/* Only defer for task events */
> > > -	if (!event->ctx->task)
> > > +	if ((current->flags & PF_KTHREAD) || !user_mode(task_pt_regs(current)))
> > >  		return -EINVAL;
> > >  
> > > -	if ((current->flags & PF_KTHREAD) || !user_mode(task_pt_regs(current)))
> > > +	if (event->unwind_deferred)
> > > +		return deferred_unwind_request(event->unwind_deferred);
> > > +
> > > +	/* Per CPU events should have had unwind_deferred set! */
> > > +	if (WARN_ON_ONCE(!event->ctx->task))
> > >  		return -EINVAL;
> > >  
> > >  	if (in_nmi())
> > > @@ -13100,13 +13269,20 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
> > >  		}
> > >  	}
> > >  
> > > +	/* Setup unwind deferring for per CPU events */
> > > +	if (event->attr.defer_callchain && !task) {  
> > 
> > As I said it should handle per-task and per-CPU events.  How about this?
> 
> Hmm, I just added some printk()s in this code, and it seems that perf
> record always did per CPU.

Right, that's the default behavior.

> 
> But if an event is per CPU and per task, will it still only trace that
> task? It will never trace another task right?

Yes, the event can be inherited to a child but then child will create a
new event so each task will have its own events.

> 
> Because the way this is currently implemented is that the event that
> requested the callback is the one that records it, even if it runs on
> another CPU:
> 
> In defer_request_nmi():
> 
> 	struct callback_head *work = &event->pending_unwind_work;
> 	int ret;
> 
> 	if (event->pending_unwind_callback)
> 		return 1;
> 
> 	ret = task_work_add(current, work, TWA_NMI_CURRENT);
> 	if (ret)
> 		return ret;
> 
> 	event->pending_unwind_callback = 1;
> 
> The task_work_add() adds the work from the event's pending_unwind_work.
> 
> Now the callback will be:
> 
> static void perf_event_deferred_task(struct callback_head *work)
> {
> 	struct perf_event *event = container_of(work, struct perf_event, pending_unwind_work);
> 
> // the above is the event that requested this. This may run on another CPU.
> 
> 	struct unwind_stacktrace trace;
> 
> 	if (!event->pending_unwind_callback)
> 		return;
> 
> 	if (unwind_deferred_trace(&trace) >= 0) {
> 
> 		/*
> 		 * All accesses to the event must belong to the same implicit RCU
> 		 * read-side critical section as the ->pending_unwind_callback reset.
> 		 * See comment in perf_pending_unwind_sync().
> 		 */
> 		guard(rcu)();
> 		perf_event_callchain_deferred(event, &trace);
> 
> // The above records the stack trace to that event.
> // Again, this may happen on another CPU.
> 
> 	}
> 
> 	event->pending_unwind_callback = 0;
> 	local_dec(&event->ctx->nr_no_switch_fast);
> 	rcuwait_wake_up(&event->pending_unwind_wait);
> }
> 
> Is the recording to an event from one CPU to another CPU an issue, if that
> event also is only tracing a task?

IIUC it should be fine as long as you use the unwind descriptor logic
like in the per-CPU case.  The data should be written to the current
CPU's ring buffer for per-task and per-CPU events.

> 
> > 
> > 	if (event->attr.defer_callchain) {
> > 		if (event->cpu >= 0) {
> > 			err = perf_add_unwind_deferred(event);
> > 			if (err)
> > 				return ERR_PTR(err);
> > 		} else {
> > 			init_task_work(&event->pending_unwind_work,
> > 					perf_event_callchain_deferred,
> > 					perf_event_deferred_task);
> > 		}
> > 	}
> > 
> > > +		err = perf_add_unwind_deferred(event);
> > > +		if (err)
> > > +			return ERR_PTR(err);
> > > +	}
> > > +
> > >  	err = security_perf_event_alloc(event);
> > >  	if (err)
> > >  		return ERR_PTR(err);
> > >  
> > >  	if (event->attr.defer_callchain)
> > >  		init_task_work(&event->pending_unwind_work,
> > > -			       perf_event_callchain_deferred);
> > > +			       perf_event_deferred_task);  
> > 
> > And you can remove here.
> 
> There's nothing wrong with always initializing it. It will just never be
> called.

Ok.

> 
> What situation do we have where cpu is negative? What's the perf command?
> Is there one?

Yep, there's --per-thread option for just per-task events.

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-05-02 20:16 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-01  1:32 [PATCH v6 0/5] perf: Deferred unwinding of user space stack traces for per CPU events Steven Rostedt
2025-05-01  1:32 ` [PATCH v6 1/5] unwind_user/deferred: Add deferred unwinding interface Steven Rostedt
2025-05-01  1:32 ` [PATCH v6 2/5] unwind_user/deferred: Make unwind deferral requests NMI-safe Steven Rostedt
2025-05-01  1:32 ` [PATCH v6 3/5] unwind deferred: Use bitmask to determine which callbacks to call Steven Rostedt
2025-05-01  1:32 ` [PATCH v6 4/5] unwind deferred: Use SRCU unwind_deferred_task_work() Steven Rostedt
2025-05-01  1:32 ` [PATCH v6 5/5] perf: Support deferred user callchains for per CPU events Steven Rostedt
2025-05-01 20:14   ` Namhyung Kim
2025-05-01 20:57     ` Steven Rostedt
2025-05-02 20:16       ` Namhyung Kim
2025-05-01 20:20   ` Namhyung Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).