[PATCH v5 0/9] tracing: Deferred unwinding of user space stack traces

linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v5 0/9] tracing: Deferred unwinding of user space stack traces
@ 2025-04-24 19:24 Steven Rostedt
  2025-04-24 19:24 ` [PATCH v5 1/9] unwind_user/deferred: Add deferred unwinding interface Steven Rostedt
                   ` (9 more replies)
  0 siblings, 10 replies; 16+ messages in thread
From: Steven Rostedt @ 2025-04-24 19:24 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Josh Poimboeuf, x86, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Indu Bhagat, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, Andrii Nakryiko, Jens Remus, Florian Weimer,
	Andy Lutomirski, Weinan Liu, Blake Jones, Beau Belgrave,
	Jose E. Marchesi, Alexander Aring

I'm currently working on getting sframe support from the kernel.
Josh Poimboeuf did a lot of the hard work already, but he told me he doesn't
have time to continue it so I'm picking it up where he left off.

His last series of v4 is here:

  https://lore.kernel.org/all/cover.1737511963.git.jpoimboe@kernel.org/

It covers a lot of topics as he found issues with other aspects of
the kernel that needed to be fixed for sframes to work properly.

This series focuses on implementing the deferred unwinding for ftrace
(and LTTng could use it).

This implements the three API functions that Josh had in his series:

  unwind_deferred_init()
  unwind_deferred_request()
  unwind_deferred_cancel()

The difference is that it does not add the task_work to the tracer's
unwind_work structure. Instead, it uses a global bitmask where each
registered tracer gets a bit. That means it can have at most 32 tracers
registered at a time on a 32 bit system, and 64 tracers on a 64 bit
system. Ideally, there should not be more than 10, and that is a lot.

This is also why perf does not use this method, as it would register
a callback for pretty much every event or task or CPU, and that goes
into the hundreds.

But for generic tracers that have a single entity tracing multiple
tasks, this works out well.

When a tracer registers with unwind_deferred_init(), a avaliable bit
in the global mask is assigned to that tracer. If there are no more
bits available, -EBUSY is returned.

When a tracer requests a stacktrace on task exit back to user space,
it is given a cookie that is associated to that stacktrace. The tracer
can save that cookie into its buffer and use it to attach the stacktrace
when it gets back. It's bit is set in the task structures unwind_mask
and when the task returns back to user space, it will iterate all
the tracers that are registered, and if their corresponding bit is
set it will call its callback and clear the bit.

The last patches implement the tracing subsystem to use this for
its global user space stack tracing per event (individual events is
not supported yet). It creates a two new events, where one is to
record the cookie when the stack trace is requested, and the other is
for the user space stacktrace itself.

Since the callback is called in faultable context, it uses this opportunity
to look at the addresses in the stacktrace and convert them to where
they would be in the executable file (if found). It also records
the inode and device major/minor numbers into the trace, so that post
processing can find the exact location where the stacks are.

Josh Poimboeuf (3):
      unwind_user/deferred: Add deferred unwinding interface
      unwind_user/deferred: Make unwind deferral requests NMI-safe
      mm: Add guard for mmap_read_lock

Steven Rostedt (6):
      unwind deferred: Use bitmask to determine which callbacks to call
      tracing: Do not bother getting user space stacktraces for kernel threads
      tracing: Rename __dynamic_array() to __dynamic_field() for ftrace events
      tracing: Implement deferred user space stacktracing
      tracing: Have deferred user space stacktrace show file offsets
      tracing: Show inode and device major:minor in deferred user space stacktrace

----
 include/linux/entry-common.h          |   2 +-
 include/linux/mmap_lock.h             |   2 +
 include/linux/sched.h                 |   1 +
 include/linux/unwind_deferred.h       |  23 ++-
 include/linux/unwind_deferred_types.h |   4 +
 kernel/trace/trace.c                  | 138 +++++++++++++++++
 kernel/trace/trace.h                  |  14 +-
 kernel/trace/trace_entries.h          |  38 ++++-
 kernel/trace/trace_export.c           |  25 +++-
 kernel/trace/trace_output.c           |  99 ++++++++++++
 kernel/unwind/deferred.c              | 275 +++++++++++++++++++++++++++++++++-
 11 files changed, 610 insertions(+), 11 deletions(-)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v5 1/9] unwind_user/deferred: Add deferred unwinding interface
  2025-04-24 19:24 [PATCH v5 0/9] tracing: Deferred unwinding of user space stack traces Steven Rostedt
@ 2025-04-24 19:24 ` Steven Rostedt
  2025-04-24 19:24 ` [PATCH v5 2/9] unwind_user/deferred: Make unwind deferral requests NMI-safe Steven Rostedt
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2025-04-24 19:24 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Josh Poimboeuf, x86, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Indu Bhagat, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, Andrii Nakryiko, Jens Remus, Florian Weimer,
	Andy Lutomirski, Weinan Liu, Blake Jones, Beau Belgrave,
	Jose E. Marchesi, Alexander Aring

From: Josh Poimboeuf <jpoimboe@kernel.org>

Add an interface for scheduling task work to unwind the user space stack
before returning to user space. This solves several problems for its
callers:

  - Ensure the unwind happens in task context even if the caller may be
    running in NMI or interrupt context.

  - Avoid duplicate unwinds, whether called multiple times by the same
    caller or by different callers.

  - Create a "context cookie" which allows trace post-processing to
    correlate kernel unwinds/traces with the user unwind.

A concept of a "cookie" is created to detect when the stacktrace is the
same. A cookie is generated the first time a user space stacktrace is
requested after the task enters the kernel. As the stacktrace is saved on
the task_struct while the task is in the kernel, if another request comes
in, if the cookie is still the same, it will use the saved stacktrace,
and not have to regenerate one.

The cookie is passed to the caller on request, and when the stacktrace is
generated upon returning to user space, it call the requester's callback
with the cookie as well as the stacktrace.

Co-developed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v4: https://lore.kernel.org/all/6052e8487746603bdb29b65f4033e739092d9925.1737511963.git.jpoimboe@kernel.org/

- Fixed comment where it said 12 bits but it should have been 16
 (Peter Zijlstra)

- Made the cookie LSB always set to 1 to make sure it never returns zero
 (Peter Zijlstra)

- Updated comment about unwind_ctx_ctr only being generated as needed.
  (Josh Poimboeuf)

 include/linux/entry-common.h          |   2 +-
 include/linux/unwind_deferred.h       |  22 +++-
 include/linux/unwind_deferred_types.h |   3 +
 kernel/unwind/deferred.c              | 163 +++++++++++++++++++++++++-
 4 files changed, 186 insertions(+), 4 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index fb2b27154fee..725ec0e87cdd 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -112,7 +112,6 @@ static __always_inline void enter_from_user_mode(struct pt_regs *regs)
 
 	CT_WARN_ON(__ct_state() != CT_STATE_USER);
 	user_exit_irqoff();
-	unwind_enter_from_user_mode();
 
 	instrumentation_begin();
 	kmsan_unpoison_entry_regs(regs);
@@ -363,6 +362,7 @@ static __always_inline void exit_to_user_mode(void)
 	lockdep_hardirqs_on_prepare();
 	instrumentation_end();
 
+	unwind_exit_to_user_mode();
 	user_enter_irqoff();
 	arch_exit_to_user_mode();
 	lockdep_hardirqs_on(CALLER_ADDR0);
diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index 54f1aa6caf29..d36784cae658 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -2,9 +2,19 @@
 #ifndef _LINUX_UNWIND_USER_DEFERRED_H
 #define _LINUX_UNWIND_USER_DEFERRED_H
 
+#include <linux/task_work.h>
 #include <linux/unwind_user.h>
 #include <linux/unwind_deferred_types.h>
 
+struct unwind_work;
+
+typedef void (*unwind_callback_t)(struct unwind_work *work, struct unwind_stacktrace *trace, u64 cookie);
+
+struct unwind_work {
+	struct list_head		list;
+	unwind_callback_t		func;
+};
+
 #ifdef CONFIG_UNWIND_USER
 
 void unwind_task_init(struct task_struct *task);
@@ -12,9 +22,14 @@ void unwind_task_free(struct task_struct *task);
 
 int unwind_deferred_trace(struct unwind_stacktrace *trace);
 
-static __always_inline void unwind_enter_from_user_mode(void)
+int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func);
+int unwind_deferred_request(struct unwind_work *work, u64 *cookie);
+void unwind_deferred_cancel(struct unwind_work *work);
+
+static __always_inline void unwind_exit_to_user_mode(void)
 {
 	current->unwind_info.cache.nr_entries = 0;
+	current->unwind_info.cookie = 0;
 }
 
 #else /* !CONFIG_UNWIND_USER */
@@ -23,8 +38,11 @@ static inline void unwind_task_init(struct task_struct *task) {}
 static inline void unwind_task_free(struct task_struct *task) {}
 
 static inline int unwind_deferred_trace(struct unwind_stacktrace *trace) { return -ENOSYS; }
+static inline int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func) { return -ENOSYS; }
+static inline int unwind_deferred_request(struct unwind_work *work, u64 *cookie) { return -ENOSYS; }
+static inline void unwind_deferred_cancel(struct unwind_work *work) {}
 
-static inline void unwind_enter_from_user_mode(void) {}
+static inline void unwind_exit_to_user_mode(void) {}
 
 #endif /* !CONFIG_UNWIND_USER */
 
diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
index b3b7389ee6eb..33373c32c221 100644
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -9,6 +9,9 @@ struct unwind_cache {
 
 struct unwind_task_info {
 	struct unwind_cache	cache;
+	u64			cookie;
+	struct callback_head	work;
+	int			pending;
 };
 
 #endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index 99d4d9e049cd..dc438c5f6618 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -2,13 +2,72 @@
 /*
  * Deferred user space unwinding
  */
+#include <linux/sched/task_stack.h>
+#include <linux/unwind_deferred.h>
+#include <linux/task_work.h>
 #include <linux/kernel.h>
 #include <linux/sched.h>
 #include <linux/slab.h>
-#include <linux/unwind_deferred.h>
+#include <linux/mm.h>
 
 #define UNWIND_MAX_ENTRIES 512
 
+/*
+ * This is a unique percpu identifier for a given task entry context.
+ * Conceptually, it's incremented every time the CPU enters the kernel from
+ * user space, so that each "entry context" on the CPU gets a unique ID.  In
+ * reality, as an optimization, it's only incremented on demand for the first
+ * deferred unwind request after a given entry-from-user.
+ *
+ * It's combined with the CPU id to make a systemwide-unique "context cookie".
+ */
+static DEFINE_PER_CPU(u64, unwind_ctx_ctr);
+
+/* Guards adding to and reading the list of callbacks */
+static DEFINE_MUTEX(callback_mutex);
+static LIST_HEAD(callbacks);
+
+/*
+ * The context cookie is a unique identifier that is assigned to a user
+ * space stacktrace. As the user space stacktrace remains the same while
+ * the task is in the kernel, the cookie is an identifier for the stacktrace.
+ * Although it is possible for the stacktrace to get another cookie if another
+ * request is made after the cookie was cleared and before reentering user
+ * space.
+ *
+ * The high 16 bits are the CPU id; the lower 48 bits are a per-CPU entry
+ * counter shifted left by one and or'd with 1 (to prevent it from ever being
+ * zero).
+ */
+static u64 ctx_to_cookie(u64 cpu, u64 ctx)
+{
+	BUILD_BUG_ON(NR_CPUS > 65535);
+	return ((ctx << 1) & ((1UL << 48) - 1)) | (cpu << 48) | 1;
+}
+
+/*
+ * Read the task context cookie, first initializing it if this is the first
+ * call to get_cookie() since the most recent entry from user.
+ */
+static u64 get_cookie(struct unwind_task_info *info)
+{
+	u64 ctx_ctr;
+	u64 cookie;
+	u64 cpu;
+
+	guard(irqsave)();
+
+	cookie = info->cookie;
+	if (cookie)
+		return cookie;
+
+	cpu = raw_smp_processor_id();
+	ctx_ctr = __this_cpu_inc_return(unwind_ctx_ctr);
+	info->cookie = ctx_to_cookie(cpu, ctx_ctr);
+
+	return info->cookie;
+}
+
 int unwind_deferred_trace(struct unwind_stacktrace *trace)
 {
 	struct unwind_task_info *info = &current->unwind_info;
@@ -47,11 +106,112 @@ int unwind_deferred_trace(struct unwind_stacktrace *trace)
 	return 0;
 }
 
+static void unwind_deferred_task_work(struct callback_head *head)
+{
+	struct unwind_task_info *info = container_of(head, struct unwind_task_info, work);
+	struct unwind_stacktrace trace;
+	struct unwind_work *work;
+	u64 cookie;
+
+	if (WARN_ON_ONCE(!info->pending))
+		return;
+
+	/* Allow work to come in again */
+	WRITE_ONCE(info->pending, 0);
+
+	/*
+	 * From here on out, the callback must always be called, even if it's
+	 * just an empty trace.
+	 */
+	trace.nr = 0;
+	trace.entries = NULL;
+
+	unwind_deferred_trace(&trace);
+
+	cookie = get_cookie(info);
+
+	guard(mutex)(&callback_mutex);
+	list_for_each_entry(work, &callbacks, list) {
+		work->func(work, &trace, cookie);
+	}
+	barrier();
+	/* If another task work is pending, reuse the cookie and stack trace */
+	if (!READ_ONCE(info->pending))
+		WRITE_ONCE(info->cookie, 0);
+}
+
+/*
+ * Schedule a user space unwind to be done in task work before exiting the
+ * kernel.
+ *
+ * The returned cookie output is a unique identifer for the current task entry
+ * context.  Its value will also be passed to the callback function.  It can be
+ * used to stitch kernel and user stack traces together in post-processing.
+ *
+ * It's valid to call this function multiple times for the same @work within
+ * the same task entry context.  Each call will return the same cookie.
+ * If the callback is not pending because it has already been previously called
+ * for the same entry context, it will be called again with the same stack trace
+ * and cookie.
+ *
+ * Returns 0 if the callback will be called on task to user space
+ *   Negative if there's an error.
+ */
+int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
+{
+	struct unwind_task_info *info = &current->unwind_info;
+	int ret;
+
+	*cookie = 0;
+
+	if (WARN_ON_ONCE(in_nmi()))
+		return -EINVAL;
+
+	if ((current->flags & PF_KTHREAD) || !user_mode(task_pt_regs(current)))
+		return -EINVAL;
+
+	guard(irqsave)();
+
+	*cookie = get_cookie(info);
+
+	/* callback already pending? */
+	if (info->pending)
+		return 0;
+
+	/* The work has been claimed, now schedule it. */
+	ret = task_work_add(current, &info->work, TWA_RESUME);
+	if (WARN_ON_ONCE(ret))
+		return ret;
+
+	info->pending = 1;
+	return 0;
+}
+
+void unwind_deferred_cancel(struct unwind_work *work)
+{
+	if (!work)
+		return;
+
+	guard(mutex)(&callback_mutex);
+	list_del(&work->list);
+}
+
+int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
+{
+	memset(work, 0, sizeof(*work));
+
+	guard(mutex)(&callback_mutex);
+	list_add(&work->list, &callbacks);
+	work->func = func;
+	return 0;
+}
+
 void unwind_task_init(struct task_struct *task)
 {
 	struct unwind_task_info *info = &task->unwind_info;
 
 	memset(info, 0, sizeof(*info));
+	init_task_work(&info->work, unwind_deferred_task_work);
 }
 
 void unwind_task_free(struct task_struct *task)
@@ -59,4 +219,5 @@ void unwind_task_free(struct task_struct *task)
 	struct unwind_task_info *info = &task->unwind_info;
 
 	kfree(info->cache.entries);
+	task_work_cancel(task, &info->work);
 }
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v5 2/9] unwind_user/deferred: Make unwind deferral requests NMI-safe
  2025-04-24 19:24 [PATCH v5 0/9] tracing: Deferred unwinding of user space stack traces Steven Rostedt
  2025-04-24 19:24 ` [PATCH v5 1/9] unwind_user/deferred: Add deferred unwinding interface Steven Rostedt
@ 2025-04-24 19:24 ` Steven Rostedt
  2025-04-24 19:24 ` [PATCH v5 3/9] unwind deferred: Use bitmask to determine which callbacks to call Steven Rostedt
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2025-04-24 19:24 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Josh Poimboeuf, x86, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Indu Bhagat, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, Andrii Nakryiko, Jens Remus, Florian Weimer,
	Andy Lutomirski, Weinan Liu, Blake Jones, Beau Belgrave,
	Jose E. Marchesi, Alexander Aring

From: Josh Poimboeuf <jpoimboe@kernel.org>

Make unwind_deferred_request() NMI-safe so tracers in NMI context can
call it to get the cookie immediately rather than have to do the fragile
"schedule irq work and then call unwind_deferred_request()" dance.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 include/linux/unwind_deferred_types.h |   1 +
 kernel/unwind/deferred.c              | 100 ++++++++++++++++++++++----
 2 files changed, 89 insertions(+), 12 deletions(-)

diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
index 33373c32c221..8f47d77ddda0 100644
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -10,6 +10,7 @@ struct unwind_cache {
 struct unwind_task_info {
 	struct unwind_cache	cache;
 	u64			cookie;
+	u64			nmi_cookie;
 	struct callback_head	work;
 	int			pending;
 };
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index dc438c5f6618..2afd197da2ef 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -47,23 +47,47 @@ static u64 ctx_to_cookie(u64 cpu, u64 ctx)
 
 /*
  * Read the task context cookie, first initializing it if this is the first
- * call to get_cookie() since the most recent entry from user.
+ * call to get_cookie() since the most recent entry from user.  This has to be
+ * done carefully to coordinate with unwind_deferred_request_nmi().
  */
 static u64 get_cookie(struct unwind_task_info *info)
 {
 	u64 ctx_ctr;
 	u64 cookie;
-	u64 cpu;
 
 	guard(irqsave)();
 
-	cookie = info->cookie;
+	cookie = READ_ONCE(info->cookie);
 	if (cookie)
 		return cookie;
 
-	cpu = raw_smp_processor_id();
-	ctx_ctr = __this_cpu_inc_return(unwind_ctx_ctr);
-	info->cookie = ctx_to_cookie(cpu, ctx_ctr);
+	ctx_ctr = __this_cpu_read(unwind_ctx_ctr);
+
+	/* Read ctx_ctr before info->nmi_cookie */
+	barrier();
+
+	cookie = READ_ONCE(info->nmi_cookie);
+	if (cookie) {
+		/*
+		 * This is the first call to get_cookie() since an NMI handler
+		 * first wrote it to info->nmi_cookie.  Sync it.
+		 */
+		WRITE_ONCE(info->cookie, cookie);
+		WRITE_ONCE(info->nmi_cookie, 0);
+		return cookie;
+	}
+
+	/*
+	 * Write info->cookie.  It's ok to race with an NMI here.  The value of
+	 * the cookie is based on ctx_ctr from before the NMI could have
+	 * incremented it.  The result will be the same even if cookie or
+	 * ctx_ctr end up getting written twice.
+	 */
+	cookie = ctx_to_cookie(raw_smp_processor_id(), ctx_ctr + 1);
+	WRITE_ONCE(info->cookie, cookie);
+	WRITE_ONCE(info->nmi_cookie, 0);
+	barrier();
+	__this_cpu_write(unwind_ctx_ctr, ctx_ctr + 1);
 
 	return info->cookie;
 }
@@ -140,6 +164,51 @@ static void unwind_deferred_task_work(struct callback_head *head)
 		WRITE_ONCE(info->cookie, 0);
 }
 
+static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *cookie)
+{
+	struct unwind_task_info *info = &current->unwind_info;
+	bool inited_cookie = false;
+	int ret;
+
+	*cookie = info->cookie;
+	if (!*cookie) {
+		/*
+		 * This is the first unwind request since the most recent entry
+		 * from user.  Initialize the task cookie.
+		 *
+		 * Don't write to info->cookie directly, otherwise it may get
+		 * cleared if the NMI occurred in the kernel during early entry
+		 * or late exit before the task work gets to run.  Instead, use
+		 * info->nmi_cookie which gets synced later by get_cookie().
+		 */
+		if (!info->nmi_cookie) {
+			u64 cpu = raw_smp_processor_id();
+			u64 ctx_ctr;
+
+			ctx_ctr = __this_cpu_inc_return(unwind_ctx_ctr);
+			info->nmi_cookie = ctx_to_cookie(cpu, ctx_ctr);
+
+			inited_cookie = true;
+		}
+
+		*cookie = info->nmi_cookie;
+	}
+
+	if (info->pending)
+		return 0;
+
+	ret = task_work_add(current, &info->work, TWA_NMI_CURRENT);
+	if (ret) {
+		if (inited_cookie)
+			info->nmi_cookie = 0;
+		return ret;
+	}
+
+	info->pending = 1;
+
+	return 0;
+}
+
 /*
  * Schedule a user space unwind to be done in task work before exiting the
  * kernel.
@@ -160,30 +229,37 @@ static void unwind_deferred_task_work(struct callback_head *head)
 int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
 {
 	struct unwind_task_info *info = &current->unwind_info;
+	int pending;
 	int ret;
 
 	*cookie = 0;
 
-	if (WARN_ON_ONCE(in_nmi()))
-		return -EINVAL;
-
 	if ((current->flags & PF_KTHREAD) || !user_mode(task_pt_regs(current)))
 		return -EINVAL;
 
+	if (in_nmi())
+		return unwind_deferred_request_nmi(work, cookie);
+
 	guard(irqsave)();
 
 	*cookie = get_cookie(info);
 
 	/* callback already pending? */
-	if (info->pending)
+	pending = READ_ONCE(info->pending);
+	if (pending)
+		return 0;
+
+	/* Claim the work unless an NMI just now swooped in to do so. */
+	if (!try_cmpxchg(&info->pending, &pending, 1))
 		return 0;
 
 	/* The work has been claimed, now schedule it. */
 	ret = task_work_add(current, &info->work, TWA_RESUME);
-	if (WARN_ON_ONCE(ret))
+	if (WARN_ON_ONCE(ret)) {
+		WRITE_ONCE(info->pending, 0);
 		return ret;
+	}
 
-	info->pending = 1;
 	return 0;
 }
 
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v5 3/9] unwind deferred: Use bitmask to determine which callbacks to call
  2025-04-24 19:24 [PATCH v5 0/9] tracing: Deferred unwinding of user space stack traces Steven Rostedt
  2025-04-24 19:24 ` [PATCH v5 1/9] unwind_user/deferred: Add deferred unwinding interface Steven Rostedt
  2025-04-24 19:24 ` [PATCH v5 2/9] unwind_user/deferred: Make unwind deferral requests NMI-safe Steven Rostedt
@ 2025-04-24 19:24 ` Steven Rostedt
  2025-04-28 16:33   ` Mathieu Desnoyers
  2025-04-24 19:25 ` [PATCH v5 4/9] tracing: Do not bother getting user space stacktraces for kernel threads Steven Rostedt
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 16+ messages in thread
From: Steven Rostedt @ 2025-04-24 19:24 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Josh Poimboeuf, x86, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Indu Bhagat, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, Andrii Nakryiko, Jens Remus, Florian Weimer,
	Andy Lutomirski, Weinan Liu, Blake Jones, Beau Belgrave,
	Jose E. Marchesi, Alexander Aring

From: Steven Rostedt <rostedt@goodmis.org>

In order to know which registered callback requested a stacktrace for when
the task goes back to user space, add a bitmask for all registered
tracers. The bitmask is the size of log, which means that on a 32 bit
machine, it can have at most 32 registered tracers, and on 64 bit, it can
have at most 64 registered tracers. This should not be an issue as there
should not be more than 10 (unless BPF can abuse this?).

When a tracer registers with unwind_deferred_init() it will get a bit
number assigned to it. When a tracer requests a stacktrace, it will have
its bit set within the task_struct. When the task returns back to user
space, it will call the callbacks for all the registered tracers where
their bits are set in the task's mask.

When a tracer is removed by the unwind_deferred_cancel() all current tasks
will clear the associated bit, just in case another tracer gets registered
immediately afterward and then gets their callback called unexpectedly.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 include/linux/sched.h           |  1 +
 include/linux/unwind_deferred.h |  1 +
 kernel/unwind/deferred.c        | 44 ++++++++++++++++++++++++++++++---
 3 files changed, 42 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a1e1c07cadfb..d3ee0c5405d6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1649,6 +1649,7 @@ struct task_struct {
 
 #ifdef CONFIG_UNWIND_USER
 	struct unwind_task_info		unwind_info;
+	unsigned long			unwind_mask;
 #endif
 
 	/* CPU-specific state of this task: */
diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index d36784cae658..719a7cfb3164 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -13,6 +13,7 @@ typedef void (*unwind_callback_t)(struct unwind_work *work, struct unwind_stackt
 struct unwind_work {
 	struct list_head		list;
 	unwind_callback_t		func;
+	int				bit;
 };
 
 #ifdef CONFIG_UNWIND_USER
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index 2afd197da2ef..f505cb1766de 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -26,6 +26,7 @@ static DEFINE_PER_CPU(u64, unwind_ctx_ctr);
 /* Guards adding to and reading the list of callbacks */
 static DEFINE_MUTEX(callback_mutex);
 static LIST_HEAD(callbacks);
+static unsigned long unwind_mask;
 
 /*
  * The context cookie is a unique identifier that is assigned to a user
@@ -135,6 +136,7 @@ static void unwind_deferred_task_work(struct callback_head *head)
 	struct unwind_task_info *info = container_of(head, struct unwind_task_info, work);
 	struct unwind_stacktrace trace;
 	struct unwind_work *work;
+	struct task_struct *task = current;
 	u64 cookie;
 
 	if (WARN_ON_ONCE(!info->pending))
@@ -156,7 +158,10 @@ static void unwind_deferred_task_work(struct callback_head *head)
 
 	guard(mutex)(&callback_mutex);
 	list_for_each_entry(work, &callbacks, list) {
-		work->func(work, &trace, cookie);
+		if (task->unwind_mask & (1UL << work->bit)) {
+			work->func(work, &trace, cookie);
+			clear_bit(work->bit, &current->unwind_mask);
+		}
 	}
 	barrier();
 	/* If another task work is pending, reuse the cookie and stack trace */
@@ -194,9 +199,12 @@ static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *cookie)
 		*cookie = info->nmi_cookie;
 	}
 
-	if (info->pending)
+	if (current->unwind_mask & (1UL << work->bit))
 		return 0;
 
+	if (info->pending)
+		goto out;
+
 	ret = task_work_add(current, &info->work, TWA_NMI_CURRENT);
 	if (ret) {
 		if (inited_cookie)
@@ -205,6 +213,8 @@ static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *cookie)
 	}
 
 	info->pending = 1;
+ out:
+	set_bit(work->bit, &current->unwind_mask);
 
 	return 0;
 }
@@ -244,14 +254,18 @@ int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
 
 	*cookie = get_cookie(info);
 
+	/* This is already queued */
+	if (current->unwind_mask & (1UL << work->bit))
+		return 0;
+
 	/* callback already pending? */
 	pending = READ_ONCE(info->pending);
 	if (pending)
-		return 0;
+		goto out;
 
 	/* Claim the work unless an NMI just now swooped in to do so. */
 	if (!try_cmpxchg(&info->pending, &pending, 1))
-		return 0;
+		goto out;
 
 	/* The work has been claimed, now schedule it. */
 	ret = task_work_add(current, &info->work, TWA_RESUME);
@@ -260,16 +274,29 @@ int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
 		return ret;
 	}
 
+ out:
+	set_bit(work->bit, &current->unwind_mask);
+
 	return 0;
 }
 
 void unwind_deferred_cancel(struct unwind_work *work)
 {
+	struct task_struct *g, *t;
+
 	if (!work)
 		return;
 
 	guard(mutex)(&callback_mutex);
 	list_del(&work->list);
+
+	clear_bit(work->bit, &unwind_mask);
+
+	guard(rcu)();
+	/* Clear this bit from all threads */
+	for_each_process_thread(g, t) {
+		clear_bit(work->bit, &t->unwind_mask);
+	}
 }
 
 int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
@@ -277,6 +304,14 @@ int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
 	memset(work, 0, sizeof(*work));
 
 	guard(mutex)(&callback_mutex);
+
+	/* See if there's a bit in the mask available */
+	if (unwind_mask == ~0UL)
+		return -EBUSY;
+
+	work->bit = ffz(unwind_mask);
+	unwind_mask |= 1UL << work->bit;
+
 	list_add(&work->list, &callbacks);
 	work->func = func;
 	return 0;
@@ -288,6 +323,7 @@ void unwind_task_init(struct task_struct *task)
 
 	memset(info, 0, sizeof(*info));
 	init_task_work(&info->work, unwind_deferred_task_work);
+	task->unwind_mask = 0;
 }
 
 void unwind_task_free(struct task_struct *task)
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v5 4/9] tracing: Do not bother getting user space stacktraces for kernel threads
  2025-04-24 19:24 [PATCH v5 0/9] tracing: Deferred unwinding of user space stack traces Steven Rostedt
                   ` (2 preceding siblings ...)
  2025-04-24 19:24 ` [PATCH v5 3/9] unwind deferred: Use bitmask to determine which callbacks to call Steven Rostedt
@ 2025-04-24 19:25 ` Steven Rostedt
  2025-04-24 19:25 ` [PATCH v5 5/9] tracing: Rename __dynamic_array() to __dynamic_field() for ftrace events Steven Rostedt
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2025-04-24 19:25 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Josh Poimboeuf, x86, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Indu Bhagat, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, Andrii Nakryiko, Jens Remus, Florian Weimer,
	Andy Lutomirski, Weinan Liu, Blake Jones, Beau Belgrave,
	Jose E. Marchesi, Alexander Aring

From: Steven Rostedt <rostedt@goodmis.org>

If a user space stacktrace is requested when running a kernel thread, just
return, as there's no point trying to get the user space stacktrace as
there is no user space.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/trace.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 8ddf6b17215c..523e98cd121d 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -3087,6 +3087,10 @@ ftrace_trace_userstack(struct trace_array *tr,
 	if (!(tr->trace_flags & TRACE_ITER_USERSTACKTRACE))
 		return;
 
+	/* No point doing user space stacktraces on kernel threads */
+	if (current->flags & PF_KTHREAD)
+		return;
+
 	/*
 	 * NMIs can not handle page faults, even with fix ups.
 	 * The save user stack can (and often does) fault.
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v5 5/9] tracing: Rename __dynamic_array() to __dynamic_field() for ftrace events
  2025-04-24 19:24 [PATCH v5 0/9] tracing: Deferred unwinding of user space stack traces Steven Rostedt
                   ` (3 preceding siblings ...)
  2025-04-24 19:25 ` [PATCH v5 4/9] tracing: Do not bother getting user space stacktraces for kernel threads Steven Rostedt
@ 2025-04-24 19:25 ` Steven Rostedt
  2025-04-24 19:25 ` [PATCH v5 6/9] tracing: Implement deferred user space stacktracing Steven Rostedt
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2025-04-24 19:25 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Josh Poimboeuf, x86, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Indu Bhagat, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, Andrii Nakryiko, Jens Remus, Florian Weimer,
	Andy Lutomirski, Weinan Liu, Blake Jones, Beau Belgrave,
	Jose E. Marchesi, Alexander Aring

From: Steven Rostedt <rostedt@goodmis.org>

The ftrace events (like function, trace_print, etc) are created somewhat
manually and not via the TRACE_EVENT() or tracepoint magic macros. It has
its own macros.

The dynamic fields used __dynamic_array() to be created, but the output is
different than the __dynamic_array() used by TRACE_EVENT().

The TRACE_EVENT() __dynamic_array() creates the field like:

	field:__data_loc u8[] v_data;   offset:120;     size:4; signed:0;

Whereas the ftrace event is created as:

	field:char buf[];       offset:12;      size:0; signed:0;

The difference is that the ftrace field is defined as the rest of the size
of the event saved in the ring buffer. TRACE_EVENT() doesn't have such a
dynamic field, and its version saves a word that holds the offset into the
event that the field is stored, as well as the size.

For consistency rename the ftrace event macro to __dynamic_field(). This
way the ftrace event can also include a __dynamic_array() later that works
the same as the TRACE_EVENT() dynamic array.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/trace.h         |  4 ++--
 kernel/trace/trace_entries.h | 10 +++++-----
 kernel/trace/trace_export.c  | 12 ++++++------
 3 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 79be1995db44..3c733b9e7b32 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -92,8 +92,8 @@ enum trace_type {
 #undef __array_desc
 #define __array_desc(type, container, item, size)
 
-#undef __dynamic_array
-#define __dynamic_array(type, item)	type	item[];
+#undef __dynamic_field
+#define __dynamic_field(type, item)	type	item[];
 
 #undef __rel_dynamic_array
 #define __rel_dynamic_array(type, item)	type	item[];
diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h
index 4ef4df6623a8..7100d8f86011 100644
--- a/kernel/trace/trace_entries.h
+++ b/kernel/trace/trace_entries.h
@@ -63,7 +63,7 @@ FTRACE_ENTRY_REG(function, ftrace_entry,
 	F_STRUCT(
 		__field_fn(	unsigned long,		ip		)
 		__field_fn(	unsigned long,		parent_ip	)
-		__dynamic_array( unsigned long,		args		)
+		__dynamic_field( unsigned long,		args		)
 	),
 
 	F_printk(" %ps <-- %ps",
@@ -81,7 +81,7 @@ FTRACE_ENTRY(funcgraph_entry, ftrace_graph_ent_entry,
 		__field_struct(	struct ftrace_graph_ent,	graph_ent	)
 		__field_packed(	unsigned long,	graph_ent,	func		)
 		__field_packed(	unsigned int,	graph_ent,	depth		)
-		__dynamic_array(unsigned long,	args				)
+		__dynamic_field(unsigned long,	args				)
 	),
 
 	F_printk("--> %ps (%u)", (void *)__entry->func, __entry->depth)
@@ -259,7 +259,7 @@ FTRACE_ENTRY(bprint, bprint_entry,
 	F_STRUCT(
 		__field(	unsigned long,	ip	)
 		__field(	const char *,	fmt	)
-		__dynamic_array(	u32,	buf	)
+		__dynamic_field(	u32,	buf	)
 	),
 
 	F_printk("%ps: %s",
@@ -272,7 +272,7 @@ FTRACE_ENTRY_REG(print, print_entry,
 
 	F_STRUCT(
 		__field(	unsigned long,	ip	)
-		__dynamic_array(	char,	buf	)
+		__dynamic_field(	char,	buf	)
 	),
 
 	F_printk("%ps: %s",
@@ -287,7 +287,7 @@ FTRACE_ENTRY(raw_data, raw_data_entry,
 
 	F_STRUCT(
 		__field(	unsigned int,	id	)
-		__dynamic_array(	char,	buf	)
+		__dynamic_field(	char,	buf	)
 	),
 
 	F_printk("id:%04x %08x",
diff --git a/kernel/trace/trace_export.c b/kernel/trace/trace_export.c
index 1698fc22afa0..d9d41e3ba379 100644
--- a/kernel/trace/trace_export.c
+++ b/kernel/trace/trace_export.c
@@ -57,8 +57,8 @@ static int ftrace_event_register(struct trace_event_call *call,
 #undef __array_desc
 #define __array_desc(type, container, item, size)	type item[size];
 
-#undef __dynamic_array
-#define __dynamic_array(type, item)			type item[];
+#undef __dynamic_field
+#define __dynamic_field(type, item)			type item[];
 
 #undef F_STRUCT
 #define F_STRUCT(args...)				args
@@ -123,8 +123,8 @@ static void __always_unused ____ftrace_check_##name(void)		\
 #undef __array_desc
 #define __array_desc(_type, _container, _item, _len) __array(_type, _item, _len)
 
-#undef __dynamic_array
-#define __dynamic_array(_type, _item) {					\
+#undef __dynamic_field
+#define __dynamic_field(_type, _item) {					\
 	.type = #_type "[]", .name = #_item,				\
 	.size = 0, .align = __alignof__(_type),				\
 	is_signed_type(_type), .filter_type = FILTER_OTHER },
@@ -161,8 +161,8 @@ static struct trace_event_fields ftrace_event_fields_##name[] = {	\
 #undef __array_desc
 #define __array_desc(type, container, item, len)
 
-#undef __dynamic_array
-#define __dynamic_array(type, item)
+#undef __dynamic_field
+#define __dynamic_field(type, item)
 
 #undef F_printk
 #define F_printk(fmt, args...) __stringify(fmt) ", "  __stringify(args)
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v5 6/9] tracing: Implement deferred user space stacktracing
  2025-04-24 19:24 [PATCH v5 0/9] tracing: Deferred unwinding of user space stack traces Steven Rostedt
                   ` (4 preceding siblings ...)
  2025-04-24 19:25 ` [PATCH v5 5/9] tracing: Rename __dynamic_array() to __dynamic_field() for ftrace events Steven Rostedt
@ 2025-04-24 19:25 ` Steven Rostedt
  2025-04-24 19:25 ` [PATCH v5 7/9] mm: Add guard for mmap_read_lock Steven Rostedt
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2025-04-24 19:25 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Josh Poimboeuf, x86, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Indu Bhagat, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, Andrii Nakryiko, Jens Remus, Florian Weimer,
	Andy Lutomirski, Weinan Liu, Blake Jones, Beau Belgrave,
	Jose E. Marchesi, Alexander Aring

From: Steven Rostedt <rostedt@goodmis.org>

Use the unwind_deferred_*() interface to be able to trace deferred user
space stacks. This creates two new ftrace events:

  user_unwind_cookie
  user_unwind_stack

The user_unwind_cookie will record into the ring buffer the cookie given
from unwind_deferred_request(), and the user_unwind_stack will record into
the ring buffer the user space stack as well as the cookie associated with
it.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/trace.c         | 93 ++++++++++++++++++++++++++++++++++++
 kernel/trace/trace.h         | 12 +++++
 kernel/trace/trace_entries.h | 24 ++++++++++
 kernel/trace/trace_export.c  | 23 +++++++++
 kernel/trace/trace_output.c  | 72 ++++++++++++++++++++++++++++
 5 files changed, 224 insertions(+)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 523e98cd121d..71340207321e 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -3077,6 +3077,66 @@ EXPORT_SYMBOL_GPL(trace_dump_stack);
 #ifdef CONFIG_USER_STACKTRACE_SUPPORT
 static DEFINE_PER_CPU(int, user_stack_count);
 
+static void trace_user_unwind_callback(struct unwind_work *unwind,
+				       struct unwind_stacktrace *trace,
+				       u64 ctx_cookie)
+{
+	struct trace_array *tr = container_of(unwind, struct trace_array, unwinder);
+	struct trace_buffer *buffer = tr->array_buffer.buffer;
+	struct userunwind_stack_entry *entry;
+	struct ring_buffer_event *event;
+	unsigned int trace_ctx;
+	unsigned long *caller;
+	unsigned int offset;
+	int len;
+	int i;
+
+	if (!(tr->trace_flags & TRACE_ITER_USERSTACKTRACE_DELAY))
+		return;
+
+	len = trace->nr * sizeof(unsigned long) + sizeof(*entry);
+
+	trace_ctx = tracing_gen_ctx();
+	event = __trace_buffer_lock_reserve(buffer, TRACE_USER_UNWIND_STACK,
+					    len, trace_ctx);
+	if (!event)
+		return;
+
+	entry	= ring_buffer_event_data(event);
+
+	entry->cookie = ctx_cookie;
+
+	offset = sizeof(*entry);
+	len = sizeof(unsigned long) * trace->nr;
+
+	entry->__data_loc_stack = offset | (len << 16);
+	caller = (void *)entry + offset;
+
+	for (i = 0; i < trace->nr; i++) {
+		caller[i] = trace->entries[i];
+	}
+
+	__buffer_unlock_commit(buffer, event);
+}
+
+static void
+ftrace_trace_userstack_delay(struct trace_array *tr,
+			     struct trace_buffer *buffer, unsigned int trace_ctx)
+{
+	struct userunwind_cookie_entry *entry;
+	struct ring_buffer_event *event;
+
+	event = __trace_buffer_lock_reserve(buffer, TRACE_USER_UNWIND_COOKIE,
+					    sizeof(*entry), trace_ctx);
+	if (!event)
+		return;
+	entry	= ring_buffer_event_data(event);
+
+	unwind_deferred_request(&tr->unwinder, &entry->cookie);
+
+	__buffer_unlock_commit(buffer, event);
+}
+
 static void
 ftrace_trace_userstack(struct trace_array *tr,
 		       struct trace_buffer *buffer, unsigned int trace_ctx)
@@ -3091,6 +3151,11 @@ ftrace_trace_userstack(struct trace_array *tr,
 	if (current->flags & PF_KTHREAD)
 		return;
 
+	if (tr->trace_flags & TRACE_ITER_USERSTACKTRACE_DELAY) {
+		ftrace_trace_userstack_delay(tr, buffer, trace_ctx);
+		return;
+	}
+
 	/*
 	 * NMIs can not handle page faults, even with fix ups.
 	 * The save user stack can (and often does) fault.
@@ -5189,6 +5254,17 @@ int trace_keep_overwrite(struct tracer *tracer, u32 mask, int set)
 	return 0;
 }
 
+static int update_unwind_deferred(struct trace_array *tr, int enabled)
+{
+	if (enabled) {
+		return unwind_deferred_init(&tr->unwinder,
+					    trace_user_unwind_callback);
+	} else {
+		unwind_deferred_cancel(&tr->unwinder);
+		return 0;
+	}
+}
+
 int set_tracer_flag(struct trace_array *tr, unsigned int mask, int enabled)
 {
 	if ((mask == TRACE_ITER_RECORD_TGID) ||
@@ -5224,6 +5300,19 @@ int set_tracer_flag(struct trace_array *tr, unsigned int mask, int enabled)
 		}
 	}
 
+	if (mask == TRACE_ITER_USERSTACKTRACE) {
+		if (tr->trace_flags & TRACE_ITER_USERSTACKTRACE_DELAY) {
+			int ret = update_unwind_deferred(tr, enabled);
+			if (ret < 0)
+				return ret;
+		}
+	}
+
+	if (mask == TRACE_ITER_USERSTACKTRACE_DELAY) {
+		if (tr->trace_flags & TRACE_ITER_USERSTACKTRACE)
+			update_unwind_deferred(tr, enabled);
+	}
+
 	if (enabled)
 		tr->trace_flags |= mask;
 	else
@@ -9890,6 +9979,10 @@ static int __remove_instance(struct trace_array *tr)
 	if (tr->ref > 1 || (tr->current_trace && tr->trace_ref))
 		return -EBUSY;
 
+	if ((tr->flags & (TRACE_ITER_USERSTACKTRACE & TRACE_ITER_USERSTACKTRACE_DELAY)) ==
+	    (TRACE_ITER_USERSTACKTRACE & TRACE_ITER_USERSTACKTRACE_DELAY))
+		unwind_deferred_cancel(&tr->unwinder);
+
 	list_del(&tr->list);
 
 	/* Disable all the flags that were enabled coming in */
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 3c733b9e7b32..3f0941c9215c 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -8,6 +8,7 @@
 #include <linux/sched.h>
 #include <linux/clocksource.h>
 #include <linux/ring_buffer.h>
+#include <linux/unwind_deferred.h>
 #include <linux/mmiotrace.h>
 #include <linux/tracepoint.h>
 #include <linux/ftrace.h>
@@ -49,7 +50,10 @@ enum trace_type {
 	TRACE_GRAPH_ENT,
 	TRACE_GRAPH_RETADDR_ENT,
 	TRACE_USER_STACK,
+	/* trace-cmd manually adds blktrace after USER_STACK */
 	TRACE_BLK,
+	TRACE_USER_UNWIND_STACK,
+	TRACE_USER_UNWIND_COOKIE,
 	TRACE_BPUTS,
 	TRACE_HWLAT,
 	TRACE_OSNOISE,
@@ -92,6 +96,9 @@ enum trace_type {
 #undef __array_desc
 #define __array_desc(type, container, item, size)
 
+#undef __dynamic_array
+#define __dynamic_array(type, item)	u32	__data_loc_##item;
+
 #undef __dynamic_field
 #define __dynamic_field(type, item)	type	item[];
 
@@ -435,6 +442,7 @@ struct trace_array {
 	struct cond_snapshot	*cond_snapshot;
 #endif
 	struct trace_func_repeats	__percpu *last_func_repeats;
+	struct unwind_work	unwinder;
 	/*
 	 * On boot up, the ring buffer is set to the minimum size, so that
 	 * we do not waste memory on systems that are not using tracing.
@@ -526,6 +534,9 @@ extern void __ftrace_bad_type(void);
 		IF_ASSIGN(var, ent, struct ctx_switch_entry, 0);	\
 		IF_ASSIGN(var, ent, struct stack_entry, TRACE_STACK);	\
 		IF_ASSIGN(var, ent, struct userstack_entry, TRACE_USER_STACK);\
+		IF_ASSIGN(var, ent, struct userunwind_stack_entry, TRACE_USER_UNWIND_STACK);\
+		IF_ASSIGN(var, ent, struct userunwind_cookie_entry, TRACE_USER_UNWIND_COOKIE);\
+		IF_ASSIGN(var, ent, struct userstack_entry, TRACE_USER_STACK);\
 		IF_ASSIGN(var, ent, struct print_entry, TRACE_PRINT);	\
 		IF_ASSIGN(var, ent, struct bprint_entry, TRACE_BPRINT);	\
 		IF_ASSIGN(var, ent, struct bputs_entry, TRACE_BPUTS);	\
@@ -1356,6 +1367,7 @@ extern int trace_get_user(struct trace_parser *parser, const char __user *ubuf,
 		C(PRINTK,		"trace_printk"),	\
 		C(ANNOTATE,		"annotate"),		\
 		C(USERSTACKTRACE,	"userstacktrace"),	\
+		C(USERSTACKTRACE_DELAY,	"userstacktrace_delay"),\
 		C(SYM_USEROBJ,		"sym-userobj"),		\
 		C(PRINTK_MSGONLY,	"printk-msg-only"),	\
 		C(CONTEXT_INFO,		"context-info"),   /* Print pid/cpu/time */ \
diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h
index 7100d8f86011..752a99296c95 100644
--- a/kernel/trace/trace_entries.h
+++ b/kernel/trace/trace_entries.h
@@ -249,6 +249,30 @@ FTRACE_ENTRY(user_stack, userstack_entry,
 		 (void *)__entry->caller[6], (void *)__entry->caller[7])
 );
 
+FTRACE_ENTRY(user_unwind_stack, userunwind_stack_entry,
+
+	TRACE_USER_UNWIND_STACK,
+
+	F_STRUCT(
+		__field(		u64,		cookie	)
+		__dynamic_array(	unsigned long,	stack	)
+	),
+
+	F_printk("cookie=%lld\n%s", __entry->cookie,
+		 __print_dynamic_array(stack, sizeof(unsigned long)))
+);
+
+FTRACE_ENTRY(user_unwind_cookie, userunwind_cookie_entry,
+
+	TRACE_USER_UNWIND_COOKIE,
+
+	F_STRUCT(
+		__field(		u64,		cookie	)
+	),
+
+	F_printk("cookie=%lld", __entry->cookie)
+);
+
 /*
  * trace_printk entry:
  */
diff --git a/kernel/trace/trace_export.c b/kernel/trace/trace_export.c
index d9d41e3ba379..831999f84e2c 100644
--- a/kernel/trace/trace_export.c
+++ b/kernel/trace/trace_export.c
@@ -57,6 +57,9 @@ static int ftrace_event_register(struct trace_event_call *call,
 #undef __array_desc
 #define __array_desc(type, container, item, size)	type item[size];
 
+#undef __dynamic_array
+#define __dynamic_array(type, item)			u32 __data_loc_##item;
+
 #undef __dynamic_field
 #define __dynamic_field(type, item)			type item[];
 
@@ -66,6 +69,16 @@ static int ftrace_event_register(struct trace_event_call *call,
 #undef F_printk
 #define F_printk(fmt, args...) fmt, args
 
+/* Only used for ftrace event format output */
+static inline char * __print_dynamic_array(int array, size_t size)
+{
+	return NULL;
+}
+
+#undef __print_dynamic_array
+#define __print_dynamic_array(array, el_size)				\
+	__print_dynamic_array(__entry->__data_loc_##array, el_size)
+
 #undef FTRACE_ENTRY
 #define FTRACE_ENTRY(name, struct_name, id, tstruct, print)		\
 struct ____ftrace_##name {						\
@@ -74,6 +87,7 @@ struct ____ftrace_##name {						\
 static void __always_unused ____ftrace_check_##name(void)		\
 {									\
 	struct ____ftrace_##name *__entry = NULL;			\
+	struct trace_seq __maybe_unused *p = NULL;			\
 									\
 	/* force compile-time check on F_printk() */			\
 	printk(print);							\
@@ -123,6 +137,12 @@ static void __always_unused ____ftrace_check_##name(void)		\
 #undef __array_desc
 #define __array_desc(_type, _container, _item, _len) __array(_type, _item, _len)
 
+#undef __dynamic_array
+#define __dynamic_array(_type, _item) {					\
+	.type = "__data_loc " #_type "[]", .name = #_item,		\
+	.size = 4, .align = __alignof__(4),				\
+	is_signed_type(_type), .filter_type = FILTER_OTHER },
+
 #undef __dynamic_field
 #define __dynamic_field(_type, _item) {					\
 	.type = #_type "[]", .name = #_item,				\
@@ -161,6 +181,9 @@ static struct trace_event_fields ftrace_event_fields_##name[] = {	\
 #undef __array_desc
 #define __array_desc(type, container, item, len)
 
+#undef __dynamic_array
+#define __dynamic_array(type, item)
+
 #undef __dynamic_field
 #define __dynamic_field(type, item)
 
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index fee40ffbd490..e11911e5f7d0 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1374,6 +1374,58 @@ static struct trace_event trace_stack_event = {
 };
 
 /* TRACE_USER_STACK */
+static enum print_line_t trace_user_unwind_stack_print(struct trace_iterator *iter,
+						int flags, struct trace_event *event)
+{
+	struct userunwind_stack_entry *field;
+	struct trace_seq *s = &iter->seq;
+	unsigned long *caller;
+	unsigned int offset;
+	unsigned int len;
+	unsigned int caller_cnt;
+	unsigned int i;
+
+	trace_assign_type(field, iter->ent);
+
+	trace_seq_puts(s, "<user stack unwind>\n");
+
+	trace_seq_printf(s, "cookie=%llx\n", field->cookie);
+
+	/* The stack field is a dynamic pointer */
+	offset = field->__data_loc_stack;
+	len = offset >> 16;
+	offset = offset & 0xffff;
+	caller_cnt = len / sizeof(*caller);
+
+	caller = (void *)iter->ent + offset;
+
+	for (i = 0; i < caller_cnt; i++) {
+		unsigned long ip = caller[i];
+
+		if (!ip || trace_seq_has_overflowed(s))
+			break;
+
+		trace_seq_puts(s, " => ");
+		seq_print_user_ip(s, NULL, ip, flags);
+		trace_seq_putc(s, '\n');
+	}
+
+	return trace_handle_return(s);
+}
+
+static enum print_line_t trace_user_unwind_cookie_print(struct trace_iterator *iter,
+						 int flags, struct trace_event *event)
+{
+	struct userunwind_cookie_entry *field;
+	struct trace_seq *s = &iter->seq;
+
+	trace_assign_type(field, iter->ent);
+
+	trace_seq_printf(s, "cookie=%llx\n", field->cookie);
+
+	return trace_handle_return(s);
+}
+
 static enum print_line_t trace_user_stack_print(struct trace_iterator *iter,
 						int flags, struct trace_event *event)
 {
@@ -1417,6 +1469,24 @@ static enum print_line_t trace_user_stack_print(struct trace_iterator *iter,
 	return trace_handle_return(s);
 }
 
+static struct trace_event_functions trace_userunwind_stack_funcs = {
+	.trace		= trace_user_unwind_stack_print,
+};
+
+static struct trace_event trace_userunwind_stack_event = {
+	.type		= TRACE_USER_UNWIND_STACK,
+	.funcs		= &trace_userunwind_stack_funcs,
+};
+
+static struct trace_event_functions trace_userunwind_cookie_funcs = {
+	.trace		= trace_user_unwind_cookie_print,
+};
+
+static struct trace_event trace_userunwind_cookie_event = {
+	.type		= TRACE_USER_UNWIND_COOKIE,
+	.funcs		= &trace_userunwind_cookie_funcs,
+};
+
 static struct trace_event_functions trace_user_stack_funcs = {
 	.trace		= trace_user_stack_print,
 };
@@ -1816,6 +1886,8 @@ static struct trace_event *events[] __initdata = {
 	&trace_ctx_event,
 	&trace_wake_event,
 	&trace_stack_event,
+	&trace_userunwind_cookie_event,
+	&trace_userunwind_stack_event,
 	&trace_user_stack_event,
 	&trace_bputs_event,
 	&trace_bprint_event,
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v5 7/9] mm: Add guard for mmap_read_lock
  2025-04-24 19:24 [PATCH v5 0/9] tracing: Deferred unwinding of user space stack traces Steven Rostedt
                   ` (5 preceding siblings ...)
  2025-04-24 19:25 ` [PATCH v5 6/9] tracing: Implement deferred user space stacktracing Steven Rostedt
@ 2025-04-24 19:25 ` Steven Rostedt
  2025-04-24 19:25 ` [PATCH v5 8/9] tracing: Have deferred user space stacktrace show file offsets Steven Rostedt
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2025-04-24 19:25 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Josh Poimboeuf, x86, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Indu Bhagat, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, Andrii Nakryiko, Jens Remus, Florian Weimer,
	Andy Lutomirski, Weinan Liu, Blake Jones, Beau Belgrave,
	Jose E. Marchesi, Alexander Aring

From: Josh Poimboeuf <jpoimboe@kernel.org>

This is the new way of doing things.  Converting all existing
mmap_read_lock users is an exercise left for the reader ;-)

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 include/linux/mmap_lock.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index 4706c6769902..cfa5ab84054a 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -222,4 +222,6 @@ static inline int mmap_lock_is_contended(struct mm_struct *mm)
 	return rwsem_is_contended(&mm->mmap_lock);
 }
 
+DEFINE_GUARD(mmap_read_lock, struct mm_struct *, mmap_read_lock(_T), mmap_read_unlock(_T))
+
 #endif /* _LINUX_MMAP_LOCK_H */
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v5 8/9] tracing: Have deferred user space stacktrace show file offsets
  2025-04-24 19:24 [PATCH v5 0/9] tracing: Deferred unwinding of user space stack traces Steven Rostedt
                   ` (6 preceding siblings ...)
  2025-04-24 19:25 ` [PATCH v5 7/9] mm: Add guard for mmap_read_lock Steven Rostedt
@ 2025-04-24 19:25 ` Steven Rostedt
  2025-04-24 19:25 ` [PATCH v5 9/9] tracing: Show inode and device major:minor in deferred user space stacktrace Steven Rostedt
  2025-04-24 19:29 ` [PATCH v5 0/9] tracing: Deferred unwinding of user space stack traces Steven Rostedt
  9 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2025-04-24 19:25 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Josh Poimboeuf, x86, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Indu Bhagat, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, Andrii Nakryiko, Jens Remus, Florian Weimer,
	Andy Lutomirski, Weinan Liu, Blake Jones, Beau Belgrave,
	Jose E. Marchesi, Alexander Aring

From: Steven Rostedt <rostedt@goodmis.org>

Instead of showing the IP address of the user space stack trace, which is
where ever it was mapped by the kernel, show the offsets of where it would
be in the file.

Instead of:

       trace-cmd-1066    [007] .....    67.770256: <user stack unwind>
cookie=7000000000009
 =>  <00007fdbd0d421ca>
 =>  <00007fdbd0f3be27>
 =>  <00005635ece557e7>
 =>  <00005635ece559d3>
 =>  <00005635ece56523>
 =>  <00005635ece6479d>
 =>  <00005635ece64b01>
 =>  <00005635ece64bc0>
 =>  <00005635ece53b7e>
 =>  <00007fdbd0c6bca8>

Which is the addresses of the functions in the virtual address space of
the process. Have it record:

       trace-cmd-1090    [003] .....   180.779876: <user stack unwind>
cookie=3000000000009
 =>  <00000000001001ca>
 =>  <000000000000ae27>
 =>  <00000000000107e7>
 =>  <00000000000109d3>
 =>  <0000000000011523>
 =>  <000000000001f79d>
 =>  <000000000001fb01>
 =>  <000000000001fbc0>
 =>  <000000000000eb7e>
 =>  <0000000000029ca8>

Which is the offset from code where it was mapped at. To find this
address, the mmap_read_lock is taken and the vma is searched for the
addresses. Then what is recorded is simply:

  (addr - vma->vm_start) + (vma->vm_pgoff << PAGE_SHIFT);

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/trace.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 71340207321e..f9eb0f7d649c 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -3085,18 +3085,27 @@ static void trace_user_unwind_callback(struct unwind_work *unwind,
 	struct trace_buffer *buffer = tr->array_buffer.buffer;
 	struct userunwind_stack_entry *entry;
 	struct ring_buffer_event *event;
+	struct mm_struct *mm = current->mm;
 	unsigned int trace_ctx;
+	struct vm_area_struct *vma = NULL;
 	unsigned long *caller;
 	unsigned int offset;
 	int len;
 	int i;
 
+	/* This should never happen */
+	if (!mm)
+		return;
+
 	if (!(tr->trace_flags & TRACE_ITER_USERSTACKTRACE_DELAY))
 		return;
 
 	len = trace->nr * sizeof(unsigned long) + sizeof(*entry);
 
 	trace_ctx = tracing_gen_ctx();
+
+	guard(mmap_read_lock)(mm);
+
 	event = __trace_buffer_lock_reserve(buffer, TRACE_USER_UNWIND_STACK,
 					    len, trace_ctx);
 	if (!event)
@@ -3113,7 +3122,16 @@ static void trace_user_unwind_callback(struct unwind_work *unwind,
 	caller = (void *)entry + offset;
 
 	for (i = 0; i < trace->nr; i++) {
-		caller[i] = trace->entries[i];
+		unsigned long addr = trace->entries[i];
+
+		if (!vma || addr < vma->vm_start || addr >= vma->vm_end)
+			vma = vma_lookup(mm, addr);
+
+		if (!vma) {
+			caller[i] = addr;
+			continue;
+		}
+		caller[i] = (addr - vma->vm_start) + (vma->vm_pgoff << PAGE_SHIFT);
 	}
 
 	__buffer_unlock_commit(buffer, event);
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v5 9/9] tracing: Show inode and device major:minor in deferred user space stacktrace
  2025-04-24 19:24 [PATCH v5 0/9] tracing: Deferred unwinding of user space stack traces Steven Rostedt
                   ` (7 preceding siblings ...)
  2025-04-24 19:25 ` [PATCH v5 8/9] tracing: Have deferred user space stacktrace show file offsets Steven Rostedt
@ 2025-04-24 19:25 ` Steven Rostedt
  2025-04-24 19:29 ` [PATCH v5 0/9] tracing: Deferred unwinding of user space stack traces Steven Rostedt
  9 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2025-04-24 19:25 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Josh Poimboeuf, x86, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Indu Bhagat, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, Andrii Nakryiko, Jens Remus, Florian Weimer,
	Andy Lutomirski, Weinan Liu, Blake Jones, Beau Belgrave,
	Jose E. Marchesi, Alexander Aring

From: Steven Rostedt <rostedt@goodmis.org>

The deferred user space stacktrace event already does a lookup of the vma
for each address in the trace to get the file offset for those addresses,
it can also report the file itself.

Add two more arrays to the user space stacktrace event. One for the inode
number, and the other to store the device major:minor number. Now the
output looks like this:

       trace-cmd-1108    [007] .....   240.253487: <user stack unwind>
cookie=7000000000009
 =>  <00000000001001ca> : 1340007 : 254:3
 =>  <000000000000ae27> : 1308548 : 254:3
 =>  <00000000000107e7> : 1440347 : 254:3
 =>  <00000000000109d3> : 1440347 : 254:3
 =>  <0000000000011523> : 1440347 : 254:3
 =>  <000000000001f79d> : 1440347 : 254:3
 =>  <000000000001fb01> : 1440347 : 254:3
 =>  <000000000001fbc0> : 1440347 : 254:3
 =>  <000000000000eb7e> : 1440347 : 254:3
 =>  <0000000000029ca8> : 1340007 : 254:3

Use space tooling can use this information to get the actual functions
from the files.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/trace.c         | 25 ++++++++++++++++++++++++-
 kernel/trace/trace_entries.h |  8 ++++++--
 kernel/trace/trace_output.c  | 27 +++++++++++++++++++++++++++
 3 files changed, 57 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index f9eb0f7d649c..cb0255106b7f 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -3089,6 +3089,8 @@ static void trace_user_unwind_callback(struct unwind_work *unwind,
 	unsigned int trace_ctx;
 	struct vm_area_struct *vma = NULL;
 	unsigned long *caller;
+	unsigned long *inodes;
+	unsigned int *devs;
 	unsigned int offset;
 	int len;
 	int i;
@@ -3100,7 +3102,8 @@ static void trace_user_unwind_callback(struct unwind_work *unwind,
 	if (!(tr->trace_flags & TRACE_ITER_USERSTACKTRACE_DELAY))
 		return;
 
-	len = trace->nr * sizeof(unsigned long) + sizeof(*entry);
+	len = trace->nr * (sizeof(unsigned long) * 2 + sizeof(unsigned int))
+			   + sizeof(*entry);
 
 	trace_ctx = tracing_gen_ctx();
 
@@ -3121,6 +3124,15 @@ static void trace_user_unwind_callback(struct unwind_work *unwind,
 	entry->__data_loc_stack = offset | (len << 16);
 	caller = (void *)entry + offset;
 
+	offset += len;
+	entry->__data_loc_inodes = offset | (len << 16);
+	inodes = (void *)entry + offset;
+
+	offset += len;
+	len = sizeof(unsigned int) * trace->nr;
+	entry->__data_loc_dev = offset | (len << 16);
+	devs = (void *)entry + offset;
+
 	for (i = 0; i < trace->nr; i++) {
 		unsigned long addr = trace->entries[i];
 
@@ -3129,9 +3141,20 @@ static void trace_user_unwind_callback(struct unwind_work *unwind,
 
 		if (!vma) {
 			caller[i] = addr;
+			inodes[i] = 0;
+			devs[i] = 0;
 			continue;
 		}
+
 		caller[i] = (addr - vma->vm_start) + (vma->vm_pgoff << PAGE_SHIFT);
+
+		if (vma->vm_file && vma->vm_file->f_inode) {
+			inodes[i] = vma->vm_file->f_inode->i_ino;
+			devs[i] = vma->vm_file->f_inode->i_sb->s_dev;
+		} else {
+			inodes[i] = 0;
+			devs[i] = 0;
+		}
 	}
 
 	__buffer_unlock_commit(buffer, event);
diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h
index 752a99296c95..f2dc09405128 100644
--- a/kernel/trace/trace_entries.h
+++ b/kernel/trace/trace_entries.h
@@ -256,10 +256,14 @@ FTRACE_ENTRY(user_unwind_stack, userunwind_stack_entry,
 	F_STRUCT(
 		__field(		u64,		cookie	)
 		__dynamic_array(	unsigned long,	stack	)
+		__dynamic_array(	unsigned long,	inodes	)
+		__dynamic_array(	unsigned int,	dev	)
 	),
 
-	F_printk("cookie=%lld\n%s", __entry->cookie,
-		 __print_dynamic_array(stack, sizeof(unsigned long)))
+	F_printk("cookie=%lld\n%s%s%s", __entry->cookie,
+		 __print_dynamic_array(stack, sizeof(unsigned long)),
+		 __print_dynamic_array(inodes, sizeof(unsigned long)),
+		 __print_dynamic_array(dev, sizeof(unsigned long)))
 );
 
 FTRACE_ENTRY(user_unwind_cookie, userunwind_cookie_entry,
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index e11911e5f7d0..4bdbc6c48cdb 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1380,9 +1380,13 @@ static enum print_line_t trace_user_unwind_stack_print(struct trace_iterator *it
 	struct userunwind_stack_entry *field;
 	struct trace_seq *s = &iter->seq;
 	unsigned long *caller;
+	unsigned long *inodes;
+	unsigned int *devs;
 	unsigned int offset;
 	unsigned int len;
 	unsigned int caller_cnt;
+	unsigned int inode_cnt;
+	unsigned int dev_cnt;
 	unsigned int i;
 
 	trace_assign_type(field, iter->ent);
@@ -1399,6 +1403,21 @@ static enum print_line_t trace_user_unwind_stack_print(struct trace_iterator *it
 
 	caller = (void *)iter->ent + offset;
 
+	/* The inodes and devices are also dynamic pointers */
+	offset = field->__data_loc_inodes;
+	len = offset >> 16;
+	offset = offset & 0xffff;
+	inode_cnt = len / sizeof(*inodes);
+
+	inodes = (void *)iter->ent + offset;
+
+	offset = field->__data_loc_dev;
+	len = offset >> 16;
+	offset = offset & 0xffff;
+	dev_cnt = len / sizeof(*devs);
+
+	devs = (void *)iter->ent + offset;
+
 	for (i = 0; i < caller_cnt; i++) {
 		unsigned long ip = caller[i];
 
@@ -1407,6 +1426,14 @@ static enum print_line_t trace_user_unwind_stack_print(struct trace_iterator *it
 
 		trace_seq_puts(s, " => ");
 		seq_print_user_ip(s, NULL, ip, flags);
+
+		if (i < inode_cnt) {
+			trace_seq_printf(s, " : %ld", inodes[i]);
+			if (i < dev_cnt) {
+				trace_seq_printf(s, " : %d:%d",
+						 MAJOR(devs[i]), MINOR(devs[i]));
+			}
+		}
 		trace_seq_putc(s, '\n');
 	}
 
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 0/9] tracing: Deferred unwinding of user space stack traces
  2025-04-24 19:24 [PATCH v5 0/9] tracing: Deferred unwinding of user space stack traces Steven Rostedt
                   ` (8 preceding siblings ...)
  2025-04-24 19:25 ` [PATCH v5 9/9] tracing: Show inode and device major:minor in deferred user space stacktrace Steven Rostedt
@ 2025-04-24 19:29 ` Steven Rostedt
  9 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2025-04-24 19:29 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Josh Poimboeuf, x86, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Indu Bhagat, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, Andrii Nakryiko, Jens Remus, Florian Weimer,
	Andy Lutomirski, Weinan Liu, Blake Jones, Beau Belgrave,
	Jose E. Marchesi, Alexander Aring

On Thu, 24 Apr 2025 15:24:56 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> This series focuses on implementing the deferred unwinding for ftrace
> (and LTTng could use it).

I forgot to say that this is based on top of this series, and does not yet
include sframe support.

  https://lore.kernel.org/all/20250424162529.686762589@goodmis.org/

-- Steve

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 3/9] unwind deferred: Use bitmask to determine which callbacks to call
  2025-04-24 19:24 ` [PATCH v5 3/9] unwind deferred: Use bitmask to determine which callbacks to call Steven Rostedt
@ 2025-04-28 16:33   ` Mathieu Desnoyers
  2025-04-28 16:56     ` Steven Rostedt
  0 siblings, 1 reply; 16+ messages in thread
From: Mathieu Desnoyers @ 2025-04-28 16:33 UTC (permalink / raw)
  To: Steven Rostedt, linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Andrew Morton, Josh Poimboeuf,
	x86, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Indu Bhagat, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Ian Rogers, Adrian Hunter, linux-perf-users, Mark Brown,
	linux-toolchains, Jordan Rome, Sam James, Andrii Nakryiko,
	Jens Remus, Florian Weimer, Andy Lutomirski, Weinan Liu,
	Blake Jones, Beau Belgrave, Jose E. Marchesi, Alexander Aring

On 2025-04-24 15:24, Steven Rostedt wrote:
> From: Steven Rostedt <rostedt@goodmis.org>
> 
> In order to know which registered callback requested a stacktrace for when
> the task goes back to user space, add a bitmask for all registered
> tracers. The bitmask is the size of log, which means that on a 32 bit

size of long

> machine, it can have at most 32 registered tracers, and on 64 bit, it can
> have at most 64 registered tracers. This should not be an issue as there
> should not be more than 10 (unless BPF can abuse this?).
> 
> When a tracer registers with unwind_deferred_init() it will get a bit
> number assigned to it. When a tracer requests a stacktrace, it will have
> its bit set within the task_struct. When the task returns back to user
> space, it will call the callbacks for all the registered tracers where
> their bits are set in the task's mask.
> 
> When a tracer is removed by the unwind_deferred_cancel() all current tasks
> will clear the associated bit, just in case another tracer gets registered
> immediately afterward and then gets their callback called unexpectedly.
> 
> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> ---
>   include/linux/sched.h           |  1 +
>   include/linux/unwind_deferred.h |  1 +
>   kernel/unwind/deferred.c        | 44 ++++++++++++++++++++++++++++++---
>   3 files changed, 42 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index a1e1c07cadfb..d3ee0c5405d6 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1649,6 +1649,7 @@ struct task_struct {
>   
>   #ifdef CONFIG_UNWIND_USER
>   	struct unwind_task_info		unwind_info;
> +	unsigned long			unwind_mask;
>   #endif
>   
>   	/* CPU-specific state of this task: */
> diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
> index d36784cae658..719a7cfb3164 100644
> --- a/include/linux/unwind_deferred.h
> +++ b/include/linux/unwind_deferred.h
> @@ -13,6 +13,7 @@ typedef void (*unwind_callback_t)(struct unwind_work *work, struct unwind_stackt
>   struct unwind_work {
>   	struct list_head		list;
>   	unwind_callback_t		func;
> +	int				bit;

int or unsigned int ?

Rename "bit" to "requester_id" ?

>   };
>   
>   #ifdef CONFIG_UNWIND_USER
> diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
> index 2afd197da2ef..f505cb1766de 100644
> --- a/kernel/unwind/deferred.c
> +++ b/kernel/unwind/deferred.c
> @@ -26,6 +26,7 @@ static DEFINE_PER_CPU(u64, unwind_ctx_ctr);
>   /* Guards adding to and reading the list of callbacks */
>   static DEFINE_MUTEX(callback_mutex);
>   static LIST_HEAD(callbacks);
> +static unsigned long unwind_mask;

Perhaps "reserved_unwind_mask" ?

>   
>   /*
>    * The context cookie is a unique identifier that is assigned to a user
> @@ -135,6 +136,7 @@ static void unwind_deferred_task_work(struct callback_head *head)
>   	struct unwind_task_info *info = container_of(head, struct unwind_task_info, work);
>   	struct unwind_stacktrace trace;
>   	struct unwind_work *work;
> +	struct task_struct *task = current;
>   	u64 cookie;
>   
>   	if (WARN_ON_ONCE(!info->pending))
> @@ -156,7 +158,10 @@ static void unwind_deferred_task_work(struct callback_head *head)
>   
>   	guard(mutex)(&callback_mutex);
>   	list_for_each_entry(work, &callbacks, list) {
> -		work->func(work, &trace, cookie);
> +		if (task->unwind_mask & (1UL << work->bit)) {
> +			work->func(work, &trace, cookie);
> +			clear_bit(work->bit, &current->unwind_mask);
> +		}

You could change this list of callbacks for an array of pointers,
indexed by "requester_id".

Then you can do a for each bit on task->unwind_mask, and all bits
that match end up calling the callback for the matching array index.

>   	}
>   	barrier();
>   	/* If another task work is pending, reuse the cookie and stack trace */
> @@ -194,9 +199,12 @@ static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *cookie)
>   		*cookie = info->nmi_cookie;
>   	}
>   
> -	if (info->pending)
> +	if (current->unwind_mask & (1UL << work->bit))
>   		return 0;
>   
> +	if (info->pending)
> +		goto out;
> +
>   	ret = task_work_add(current, &info->work, TWA_NMI_CURRENT);
>   	if (ret) {
>   		if (inited_cookie)
> @@ -205,6 +213,8 @@ static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *cookie)
>   	}
>   
>   	info->pending = 1;
> + out:
> +	set_bit(work->bit, &current->unwind_mask);
>   
>   	return 0;
>   }
> @@ -244,14 +254,18 @@ int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
>   
>   	*cookie = get_cookie(info);
>   
> +	/* This is already queued */
> +	if (current->unwind_mask & (1UL << work->bit))
> +		return 0;
> +
>   	/* callback already pending? */
>   	pending = READ_ONCE(info->pending);
>   	if (pending)
> -		return 0;
> +		goto out;
>   
>   	/* Claim the work unless an NMI just now swooped in to do so. */
>   	if (!try_cmpxchg(&info->pending, &pending, 1))

Not that it necessarily matters performance wise here, but can this be a
try_cmpxchg_local if we're working on the task struct and only expecting
interruption from NMIs ?

> -		return 0;
> +		goto out;
>   
>   	/* The work has been claimed, now schedule it. */
>   	ret = task_work_add(current, &info->work, TWA_RESUME);
> @@ -260,16 +274,29 @@ int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
>   		return ret;
>   	}
>   
> + out:
> +	set_bit(work->bit, &current->unwind_mask);
> +
>   	return 0;
>   }
>   
>   void unwind_deferred_cancel(struct unwind_work *work)
>   {
> +	struct task_struct *g, *t;
> +
>   	if (!work)
>   		return;
>   
>   	guard(mutex)(&callback_mutex);
>   	list_del(&work->list);
> +
> +	clear_bit(work->bit, &unwind_mask);
> +
> +	guard(rcu)();
> +	/* Clear this bit from all threads */
> +	for_each_process_thread(g, t) {
> +		clear_bit(work->bit, &t->unwind_mask);
> +	}

It is enough to guard with RCU ? See syscall_regfunc() from
tracepoint.c where we do:

                 read_lock(&tasklist_lock);
                 for_each_process_thread(p, t) {
                         set_task_syscall_work(t, SYSCALL_TRACEPOINT);
                 }
                 read_unlock(&tasklist_lock);

To prevent concurrent fork from adding threads while we
iterate, thus opening the possibility of missing a clear
due to a concurrent fork + set bit.

Thanks,

Mathieu

>   }
>   
>   int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
> @@ -277,6 +304,14 @@ int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
>   	memset(work, 0, sizeof(*work));
>   
>   	guard(mutex)(&callback_mutex);
> +
> +	/* See if there's a bit in the mask available */
> +	if (unwind_mask == ~0UL)
> +		return -EBUSY;
> +
> +	work->bit = ffz(unwind_mask);
> +	unwind_mask |= 1UL << work->bit;
> +
>   	list_add(&work->list, &callbacks);
>   	work->func = func;
>   	return 0;
> @@ -288,6 +323,7 @@ void unwind_task_init(struct task_struct *task)
>   
>   	memset(info, 0, sizeof(*info));
>   	init_task_work(&info->work, unwind_deferred_task_work);
> +	task->unwind_mask = 0;
>   }
>   
>   void unwind_task_free(struct task_struct *task)


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 3/9] unwind deferred: Use bitmask to determine which callbacks to call
  2025-04-28 16:33   ` Mathieu Desnoyers
@ 2025-04-28 16:56     ` Steven Rostedt
  2025-04-28 18:00       ` Mathieu Desnoyers
  0 siblings, 1 reply; 16+ messages in thread
From: Steven Rostedt @ 2025-04-28 16:56 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
	Andrew Morton, Josh Poimboeuf, x86, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Indu Bhagat, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, Andrii Nakryiko, Jens Remus, Florian Weimer,
	Andy Lutomirski, Weinan Liu, Blake Jones, Beau Belgrave,
	Jose E. Marchesi, Alexander Aring

On Mon, 28 Apr 2025 12:33:50 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> On 2025-04-24 15:24, Steven Rostedt wrote:
> > From: Steven Rostedt <rostedt@goodmis.org>
> > 
> > In order to know which registered callback requested a stacktrace for when
> > the task goes back to user space, add a bitmask for all registered
> > tracers. The bitmask is the size of log, which means that on a 32 bit  
> 
> size of long
> 

Sure


> > --- a/include/linux/unwind_deferred.h
> > +++ b/include/linux/unwind_deferred.h
> > @@ -13,6 +13,7 @@ typedef void (*unwind_callback_t)(struct unwind_work *work, struct unwind_stackt
> >   struct unwind_work {
> >   	struct list_head		list;
> >   	unwind_callback_t		func;
> > +	int				bit;  
> 
> int or unsigned int ?
> 
> Rename "bit" to "requester_id" ?

Perhaps just "id", as this is only internal and shouldn't be touched.

> 
> >   };
> >   
> >   #ifdef CONFIG_UNWIND_USER
> > diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
> > index 2afd197da2ef..f505cb1766de 100644
> > --- a/kernel/unwind/deferred.c
> > +++ b/kernel/unwind/deferred.c
> > @@ -26,6 +26,7 @@ static DEFINE_PER_CPU(u64, unwind_ctx_ctr);
> >   /* Guards adding to and reading the list of callbacks */
> >   static DEFINE_MUTEX(callback_mutex);
> >   static LIST_HEAD(callbacks);
> > +static unsigned long unwind_mask;  
> 
> Perhaps "reserved_unwind_mask" ?

Sure.

> 
> >   
> >   /*
> >    * The context cookie is a unique identifier that is assigned to a user
> > @@ -135,6 +136,7 @@ static void unwind_deferred_task_work(struct callback_head *head)
> >   	struct unwind_task_info *info = container_of(head, struct unwind_task_info, work);
> >   	struct unwind_stacktrace trace;
> >   	struct unwind_work *work;
> > +	struct task_struct *task = current;
> >   	u64 cookie;
> >   
> >   	if (WARN_ON_ONCE(!info->pending))
> > @@ -156,7 +158,10 @@ static void unwind_deferred_task_work(struct callback_head *head)
> >   
> >   	guard(mutex)(&callback_mutex);
> >   	list_for_each_entry(work, &callbacks, list) {
> > -		work->func(work, &trace, cookie);
> > +		if (task->unwind_mask & (1UL << work->bit)) {
> > +			work->func(work, &trace, cookie);
> > +			clear_bit(work->bit, &current->unwind_mask);
> > +		}  
> 
> You could change this list of callbacks for an array of pointers,
> indexed by "requester_id".
> 
> Then you can do a for each bit on task->unwind_mask, and all bits
> that match end up calling the callback for the matching array index.

Yeah, I thought of this, but that's just an optimization, and something I
probably will add as a separate patch.

> > @@ -244,14 +254,18 @@ int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
> >   
> >   	*cookie = get_cookie(info);
> >   
> > +	/* This is already queued */
> > +	if (current->unwind_mask & (1UL << work->bit))
> > +		return 0;
> > +
> >   	/* callback already pending? */
> >   	pending = READ_ONCE(info->pending);
> >   	if (pending)
> > -		return 0;
> > +		goto out;
> >   
> >   	/* Claim the work unless an NMI just now swooped in to do so. */
> >   	if (!try_cmpxchg(&info->pending, &pending, 1))  
> 
> Not that it necessarily matters performance wise here, but can this be a
> try_cmpxchg_local if we're working on the task struct and only expecting
> interruption from NMIs ?

Hmm, sure.

> 
> > -		return 0;
> > +		goto out;
> >   
> >   	/* The work has been claimed, now schedule it. */
> >   	ret = task_work_add(current, &info->work, TWA_RESUME);
> > @@ -260,16 +274,29 @@ int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
> >   		return ret;
> >   	}
> >   
> > + out:
> > +	set_bit(work->bit, &current->unwind_mask);
> > +
> >   	return 0;
> >   }
> >   
> >   void unwind_deferred_cancel(struct unwind_work *work)
> >   {
> > +	struct task_struct *g, *t;
> > +
> >   	if (!work)
> >   		return;
> >   
> >   	guard(mutex)(&callback_mutex);
> >   	list_del(&work->list);
> > +
> > +	clear_bit(work->bit, &unwind_mask);
> > +
> > +	guard(rcu)();
> > +	/* Clear this bit from all threads */
> > +	for_each_process_thread(g, t) {
> > +		clear_bit(work->bit, &t->unwind_mask);
> > +	}  
> 
> It is enough to guard with RCU ? See syscall_regfunc() from
> tracepoint.c where we do:
> 
>                  read_lock(&tasklist_lock);
>                  for_each_process_thread(p, t) {
>                          set_task_syscall_work(t, SYSCALL_TRACEPOINT);
>                  }
>                  read_unlock(&tasklist_lock);
> 
> To prevent concurrent fork from adding threads while we
> iterate, thus opening the possibility of missing a clear
> due to a concurrent fork + set bit.

A set_bit only would happen if the callback was live and accepting new
callback requests. It's a bug for a tracer to call unwind_deferred_cancel()
and then call unwind_deferred_request() (which would set the bit). We could
possibly set the tracer's unwind descriptor id to -1, and do an
WARN_ON_ONCE() in unwind_deferred_request() if the tracer's id is negative.

The loop is called under the callback_mutex, where no new tracer could
register and be assigned that bit.

The RCU lock is just to make sure the current tasks that existed when the
loop started doesn't disappear before the loop ends.

-- Steve



> 
> Thanks,
> 
> Mathieu
> 
> >   }
> >   
> >   int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
> > @@ -277,6 +304,14 @@ int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
> >   	memset(work, 0, sizeof(*work));
> >   
> >   	guard(mutex)(&callback_mutex);
> > +
> > +	/* See if there's a bit in the mask available */
> > +	if (unwind_mask == ~0UL)
> > +		return -EBUSY;
> > +
> > +	work->bit = ffz(unwind_mask);
> > +	unwind_mask |= 1UL << work->bit;
> > +
> >   	list_add(&work->list, &callbacks);
> >   	work->func = func;
> >   	return 0;
> > @@ -288,6 +323,7 @@ void unwind_task_init(struct task_struct *task)
> >   
> >   	memset(info, 0, sizeof(*info));
> >   	init_task_work(&info->work, unwind_deferred_task_work);
> > +	task->unwind_mask = 0;
> >   }
> >   
> >   void unwind_task_free(struct task_struct *task)  
> 
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 3/9] unwind deferred: Use bitmask to determine which callbacks to call
  2025-04-28 16:56     ` Steven Rostedt
@ 2025-04-28 18:00       ` Mathieu Desnoyers
  2025-04-28 18:12         ` Steven Rostedt
  0 siblings, 1 reply; 16+ messages in thread
From: Mathieu Desnoyers @ 2025-04-28 18:00 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
	Andrew Morton, Josh Poimboeuf, x86, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Indu Bhagat, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, Andrii Nakryiko, Jens Remus, Florian Weimer,
	Andy Lutomirski, Weinan Liu, Blake Jones, Beau Belgrave,
	Jose E. Marchesi, Alexander Aring

On 2025-04-28 12:56, Steven Rostedt wrote:
> On Mon, 28 Apr 2025 12:33:50 -0400
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
>> On 2025-04-24 15:24, Steven Rostedt wrote:
>>> From: Steven Rostedt <rostedt@goodmis.org>
>>>
>>> In order to know which registered callback requested a stacktrace for when
>>> the task goes back to user space, add a bitmask for all registered
>>> tracers. The bitmask is the size of log, which means that on a 32 bit
>>
>> size of long
>>
> 
> Sure
> 
> 
>>> --- a/include/linux/unwind_deferred.h
>>> +++ b/include/linux/unwind_deferred.h
>>> @@ -13,6 +13,7 @@ typedef void (*unwind_callback_t)(struct unwind_work *work, struct unwind_stackt
>>>    struct unwind_work {
>>>    	struct list_head		list;
>>>    	unwind_callback_t		func;
>>> +	int				bit;
>>
>> int or unsigned int ?
>>
>> Rename "bit" to "requester_id" ?
> 
> Perhaps just "id", as this is only internal and shouldn't be touched.
> 
>>
>>>    };
>>>    
>>>    #ifdef CONFIG_UNWIND_USER
>>> diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
>>> index 2afd197da2ef..f505cb1766de 100644
>>> --- a/kernel/unwind/deferred.c
>>> +++ b/kernel/unwind/deferred.c
>>> @@ -26,6 +26,7 @@ static DEFINE_PER_CPU(u64, unwind_ctx_ctr);
>>>    /* Guards adding to and reading the list of callbacks */
>>>    static DEFINE_MUTEX(callback_mutex);
>>>    static LIST_HEAD(callbacks);
>>> +static unsigned long unwind_mask;
>>
>> Perhaps "reserved_unwind_mask" ?
> 
> Sure.
> 
>>
>>>    
>>>    /*
>>>     * The context cookie is a unique identifier that is assigned to a user
>>> @@ -135,6 +136,7 @@ static void unwind_deferred_task_work(struct callback_head *head)
>>>    	struct unwind_task_info *info = container_of(head, struct unwind_task_info, work);
>>>    	struct unwind_stacktrace trace;
>>>    	struct unwind_work *work;
>>> +	struct task_struct *task = current;
>>>    	u64 cookie;
>>>    
>>>    	if (WARN_ON_ONCE(!info->pending))
>>> @@ -156,7 +158,10 @@ static void unwind_deferred_task_work(struct callback_head *head)
>>>    
>>>    	guard(mutex)(&callback_mutex);
>>>    	list_for_each_entry(work, &callbacks, list) {
>>> -		work->func(work, &trace, cookie);
>>> +		if (task->unwind_mask & (1UL << work->bit)) {
>>> +			work->func(work, &trace, cookie);
>>> +			clear_bit(work->bit, &current->unwind_mask);
>>> +		}
>>
>> You could change this list of callbacks for an array of pointers,
>> indexed by "requester_id".
>>
>> Then you can do a for each bit on task->unwind_mask, and all bits
>> that match end up calling the callback for the matching array index.
> 
> Yeah, I thought of this, but that's just an optimization, and something I
> probably will add as a separate patch.
> 
>>> @@ -244,14 +254,18 @@ int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
>>>    
>>>    	*cookie = get_cookie(info);
>>>    
>>> +	/* This is already queued */
>>> +	if (current->unwind_mask & (1UL << work->bit))
>>> +		return 0;
>>> +
>>>    	/* callback already pending? */
>>>    	pending = READ_ONCE(info->pending);
>>>    	if (pending)
>>> -		return 0;
>>> +		goto out;
>>>    
>>>    	/* Claim the work unless an NMI just now swooped in to do so. */
>>>    	if (!try_cmpxchg(&info->pending, &pending, 1))
>>
>> Not that it necessarily matters performance wise here, but can this be a
>> try_cmpxchg_local if we're working on the task struct and only expecting
>> interruption from NMIs ?
> 
> Hmm, sure.
> 
>>
>>> -		return 0;
>>> +		goto out;
>>>    
>>>    	/* The work has been claimed, now schedule it. */
>>>    	ret = task_work_add(current, &info->work, TWA_RESUME);
>>> @@ -260,16 +274,29 @@ int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
>>>    		return ret;
>>>    	}
>>>    
>>> + out:
>>> +	set_bit(work->bit, &current->unwind_mask);
>>> +
>>>    	return 0;
>>>    }
>>>    
>>>    void unwind_deferred_cancel(struct unwind_work *work)
>>>    {
>>> +	struct task_struct *g, *t;
>>> +
>>>    	if (!work)
>>>    		return;
>>>    
>>>    	guard(mutex)(&callback_mutex);
>>>    	list_del(&work->list);
>>> +
>>> +	clear_bit(work->bit, &unwind_mask);
>>> +
>>> +	guard(rcu)();
>>> +	/* Clear this bit from all threads */
>>> +	for_each_process_thread(g, t) {
>>> +		clear_bit(work->bit, &t->unwind_mask);
>>> +	}
>>
>> It is enough to guard with RCU ? See syscall_regfunc() from
>> tracepoint.c where we do:
>>
>>                   read_lock(&tasklist_lock);
>>                   for_each_process_thread(p, t) {
>>                           set_task_syscall_work(t, SYSCALL_TRACEPOINT);
>>                   }
>>                   read_unlock(&tasklist_lock);
>>
>> To prevent concurrent fork from adding threads while we
>> iterate, thus opening the possibility of missing a clear
>> due to a concurrent fork + set bit.
> 
> A set_bit only would happen if the callback was live and accepting new
> callback requests. It's a bug for a tracer to call unwind_deferred_cancel()
> and then call unwind_deferred_request() (which would set the bit). We could
> possibly set the tracer's unwind descriptor id to -1, and do an
> WARN_ON_ONCE() in unwind_deferred_request() if the tracer's id is negative.
> 
> The loop is called under the callback_mutex, where no new tracer could
> register and be assigned that bit.

Ah, that's the piece I missed. The callback_mutex prevents reallocation
of the ID by unwind_deferred_init while iterating on the tasks.

One more comment: if we change the linked list for an array (or make the 
linked list an RCU list), can we remove the callback_mutex from
unwind_deferred_task_work by turning it into an RCU read-side ?

Then we just need to wait for a grace period before returning from
unwind_deferred_cancel, which then allows the caller to reclaim "work".

Taking the callback_mutex in unwind_deferred_task_work will end up being
the single thing that does a lot of cache line bouncing across CPUs when
hit heavily by tracers.

Thanks,

Mathieu

> 
> The RCU lock is just to make sure the current tasks that existed when the
> loop started doesn't disappear before the loop ends.
> 
> -- Steve
> 
> 
> 
>>
>> Thanks,
>>
>> Mathieu
>>
>>>    }
>>>    
>>>    int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
>>> @@ -277,6 +304,14 @@ int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
>>>    	memset(work, 0, sizeof(*work));
>>>    
>>>    	guard(mutex)(&callback_mutex);
>>> +
>>> +	/* See if there's a bit in the mask available */
>>> +	if (unwind_mask == ~0UL)
>>> +		return -EBUSY;
>>> +
>>> +	work->bit = ffz(unwind_mask);
>>> +	unwind_mask |= 1UL << work->bit;
>>> +
>>>    	list_add(&work->list, &callbacks);
>>>    	work->func = func;
>>>    	return 0;
>>> @@ -288,6 +323,7 @@ void unwind_task_init(struct task_struct *task)
>>>    
>>>    	memset(info, 0, sizeof(*info));
>>>    	init_task_work(&info->work, unwind_deferred_task_work);
>>> +	task->unwind_mask = 0;
>>>    }
>>>    
>>>    void unwind_task_free(struct task_struct *task)
>>
>>
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 3/9] unwind deferred: Use bitmask to determine which callbacks to call
  2025-04-28 18:00       ` Mathieu Desnoyers
@ 2025-04-28 18:12         ` Steven Rostedt
  2025-04-28 18:13           ` Mathieu Desnoyers
  0 siblings, 1 reply; 16+ messages in thread
From: Steven Rostedt @ 2025-04-28 18:12 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
	Andrew Morton, Josh Poimboeuf, x86, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Indu Bhagat, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, Andrii Nakryiko, Jens Remus, Florian Weimer,
	Andy Lutomirski, Weinan Liu, Blake Jones, Beau Belgrave,
	Jose E. Marchesi, Alexander Aring

On Mon, 28 Apr 2025 14:00:07 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> >>
> >> It is enough to guard with RCU ? See syscall_regfunc() from
> >> tracepoint.c where we do:
> >>
> >>                   read_lock(&tasklist_lock);
> >>                   for_each_process_thread(p, t) {
> >>                           set_task_syscall_work(t, SYSCALL_TRACEPOINT);
> >>                   }
> >>                   read_unlock(&tasklist_lock);
> >>
> >> To prevent concurrent fork from adding threads while we
> >> iterate, thus opening the possibility of missing a clear
> >> due to a concurrent fork + set bit.  
> > 
> > A set_bit only would happen if the callback was live and accepting new
> > callback requests. It's a bug for a tracer to call unwind_deferred_cancel()
> > and then call unwind_deferred_request() (which would set the bit). We could
> > possibly set the tracer's unwind descriptor id to -1, and do an
> > WARN_ON_ONCE() in unwind_deferred_request() if the tracer's id is negative.
> > 
> > The loop is called under the callback_mutex, where no new tracer could
> > register and be assigned that bit.  
> 
> Ah, that's the piece I missed. The callback_mutex prevents reallocation
> of the ID by unwind_deferred_init while iterating on the tasks.
> 
> One more comment: if we change the linked list for an array (or make the 
> linked list an RCU list), can we remove the callback_mutex from
> unwind_deferred_task_work by turning it into an RCU read-side ?
> 
> Then we just need to wait for a grace period before returning from
> unwind_deferred_cancel, which then allows the caller to reclaim "work".
> 
> Taking the callback_mutex in unwind_deferred_task_work will end up being
> the single thing that does a lot of cache line bouncing across CPUs when
> hit heavily by tracers.

I'm not against this, but again, that's an optimization. I want to keep the
initial code simple. And then add the more complex optimizations when this
is stable.

-- Steve

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 3/9] unwind deferred: Use bitmask to determine which callbacks to call
  2025-04-28 18:12         ` Steven Rostedt
@ 2025-04-28 18:13           ` Mathieu Desnoyers
  0 siblings, 0 replies; 16+ messages in thread
From: Mathieu Desnoyers @ 2025-04-28 18:13 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu, Mark Rutland,
	Andrew Morton, Josh Poimboeuf, x86, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Indu Bhagat, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains, Jordan Rome,
	Sam James, Andrii Nakryiko, Jens Remus, Florian Weimer,
	Andy Lutomirski, Weinan Liu, Blake Jones, Beau Belgrave,
	Jose E. Marchesi, Alexander Aring

On 2025-04-28 14:12, Steven Rostedt wrote:
> On Mon, 28 Apr 2025 14:00:07 -0400
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
>>>>
>>>> It is enough to guard with RCU ? See syscall_regfunc() from
>>>> tracepoint.c where we do:
>>>>
>>>>                    read_lock(&tasklist_lock);
>>>>                    for_each_process_thread(p, t) {
>>>>                            set_task_syscall_work(t, SYSCALL_TRACEPOINT);
>>>>                    }
>>>>                    read_unlock(&tasklist_lock);
>>>>
>>>> To prevent concurrent fork from adding threads while we
>>>> iterate, thus opening the possibility of missing a clear
>>>> due to a concurrent fork + set bit.
>>>
>>> A set_bit only would happen if the callback was live and accepting new
>>> callback requests. It's a bug for a tracer to call unwind_deferred_cancel()
>>> and then call unwind_deferred_request() (which would set the bit). We could
>>> possibly set the tracer's unwind descriptor id to -1, and do an
>>> WARN_ON_ONCE() in unwind_deferred_request() if the tracer's id is negative.
>>>
>>> The loop is called under the callback_mutex, where no new tracer could
>>> register and be assigned that bit.
>>
>> Ah, that's the piece I missed. The callback_mutex prevents reallocation
>> of the ID by unwind_deferred_init while iterating on the tasks.
>>
>> One more comment: if we change the linked list for an array (or make the
>> linked list an RCU list), can we remove the callback_mutex from
>> unwind_deferred_task_work by turning it into an RCU read-side ?
>>
>> Then we just need to wait for a grace period before returning from
>> unwind_deferred_cancel, which then allows the caller to reclaim "work".
>>
>> Taking the callback_mutex in unwind_deferred_task_work will end up being
>> the single thing that does a lot of cache line bouncing across CPUs when
>> hit heavily by tracers.
> 
> I'm not against this, but again, that's an optimization. I want to keep the
> initial code simple. And then add the more complex optimizations when this
> is stable.

That works for me,

Thanks!

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-04-28 18:13 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-24 19:24 [PATCH v5 0/9] tracing: Deferred unwinding of user space stack traces Steven Rostedt
2025-04-24 19:24 ` [PATCH v5 1/9] unwind_user/deferred: Add deferred unwinding interface Steven Rostedt
2025-04-24 19:24 ` [PATCH v5 2/9] unwind_user/deferred: Make unwind deferral requests NMI-safe Steven Rostedt
2025-04-24 19:24 ` [PATCH v5 3/9] unwind deferred: Use bitmask to determine which callbacks to call Steven Rostedt
2025-04-28 16:33   ` Mathieu Desnoyers
2025-04-28 16:56     ` Steven Rostedt
2025-04-28 18:00       ` Mathieu Desnoyers
2025-04-28 18:12         ` Steven Rostedt
2025-04-28 18:13           ` Mathieu Desnoyers
2025-04-24 19:25 ` [PATCH v5 4/9] tracing: Do not bother getting user space stacktraces for kernel threads Steven Rostedt
2025-04-24 19:25 ` [PATCH v5 5/9] tracing: Rename __dynamic_array() to __dynamic_field() for ftrace events Steven Rostedt
2025-04-24 19:25 ` [PATCH v5 6/9] tracing: Implement deferred user space stacktracing Steven Rostedt
2025-04-24 19:25 ` [PATCH v5 7/9] mm: Add guard for mmap_read_lock Steven Rostedt
2025-04-24 19:25 ` [PATCH v5 8/9] tracing: Have deferred user space stacktrace show file offsets Steven Rostedt
2025-04-24 19:25 ` [PATCH v5 9/9] tracing: Show inode and device major:minor in deferred user space stacktrace Steven Rostedt
2025-04-24 19:29 ` [PATCH v5 0/9] tracing: Deferred unwinding of user space stack traces Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).