linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v11 00/14] unwind_user: x86: Deferred unwinding infrastructure
@ 2025-06-25 22:56 Steven Rostedt
  2025-06-25 22:56 ` [PATCH v11 01/14] unwind_user: Add user space unwinding API Steven Rostedt
                   ` (14 more replies)
  0 siblings, 15 replies; 33+ messages in thread
From: Steven Rostedt @ 2025-06-25 22:56 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
	Jens Axboe

This is the first patch series of a set that will make it possible to be able
to use SFrames[1] in the Linux kernel. A quick recap of the motivation for
doing this.

Currently the only way to get a user space stack trace from a stack
walk (and not just copying large amount of user stack into the kernel
ring buffer) is to use frame pointers. This has a few issues. The biggest
one is that compiling frame pointers into every application and library
has been shown to cause performance overhead.

Another issue is that the format of the frames may not always be consistent
between different compilers and some architectures (s390) has no defined
format to do a reliable stack walk. The only way to perform user space
profiling on these architectures is to copy the user stack into the kernel
buffer.

SFrames is now supported in gcc binutils and soon will also be supported
by LLVM. SFrames acts more like ORC, and lives in the ELF executable
file as its own section. Like ORC it has two tables where the first table
is sorted by instruction pointers (IP) and using the current IP and finding
it's entry in the first table, it will take you to the second table which
will tell you where the return address of the current function is located
and then you can use that address to look it up in the first table to find
the return address of that function, and so on. This performs a user
space stack walk.

Now because the SFrame section lives in the ELF file it needs to be faulted
into memory when it is used. This means that walking the user space stack
requires being in a faultable context. As profilers like perf request a stack
trace in interrupt or NMI context, it cannot do the walking when it is
requested. Instead it must be deferred until it is safe to fault in user
space. One place this is known to be safe is when the task is about to return
back to user space.

Josh originally wrote the PoC of this code and his last version he posted
was back in January:

   https://lore.kernel.org/all/cover.1737511963.git.jpoimboe@kernel.org/

That series contained everything from adding a new faultable user space
stack walking code, deferring the stack walk, implementing sframes,
fixing up x86 (VDSO), and even added both the kernel and user space side
of perf to make it work. But Josh also ran out of time to work on it and
I picked it up. As there's several parts to this series, I also broke
it out. Especially since there's parts of his series that do not depend
on each other.

This series contains only the core infrastructure that all the rest needs.
Of the 14 patches, only 3 are x86 specific. The rest is simply the unwinding
code that s390 can build against. I moved the 3 x86 specific to the end
of the series too.

Since multiple tracers (like perf, ftrace, bpf, etc) can attach to the
deferred unwinder and each of these tracers can attach to some or all
of the tasks to trace, there is a many to many relationship. This relationship
needs to be made in interrupt or NMI context so it can not rely on any
allocation. To handle this, a bitmask is used. There's a global bitmask of
size long which will allocate a single bit when a tracer registers for
deferred stack traces. The task struct will also have a bitmask where a
request comes in from one of the tracers to have a deferred stack trace, it
will set the corresponding bit for that tracer it its mask. As one of the bits
represents that a request has been made, this means at most 31 on 32 bit
systems or 63 on 64 bit systems of tracers may be registered at a given time.
This should not be an issue as only one perf application, or ftrace instance
should request a bit. BPF should also use only one bit and handle any
multiplexing for its users.

When the first request is made for a deferred stack trace from a task, it will
take a timestamp. This timestamp will be used as the identifier for the user
space stack trace. As the user space stack trace does not change while the
task is in the kernel, requests that come in after the first request and
before the task goes back to user space will get the same timestamp. This
timestamp also serves the purpose of knowing how far back a given user space
stack trace goes. If there's dropped events, and the events dropped miss a
task entering user space and coming back to the kernel, the new stack trace
taken when it goes back to user space should not be used with the events
before the drop happened.

When a tracer makes a request, it gets this timestamp, and the tasks bitmask
sets the bit for the requesting tracer. A task work is used to have the task
do the callbacks before it goes back to user space. When it does, it will scan
its bitmask and call all the callbacks for the tracers that have their
representing bit set. The callback will receive the user space stack trace as
well as the timestamp that was used.

That's the basic idea. Obviously there's more to it than the above
explanation, but each patch explains what it is doing, and it is broken up
step by step.

I run two SFrame meetings once a month (one in Asia friendly timezone and
the other in Europe friendly). We have developers from Google, Oracle, Red Hat,
IBM, EfficiOS, Meta, Microsoft, and more that attend. (If anyone is interested
in attending let me know). I have been running this since December of 2024.
Last year in GNU Cauldron, a few of us got together to discuss the design
and such. We are pretty confident that the current design is sound. We have
working code on top of this and have been testing it.

Since the s390 folks want to start working on this (they have patches to
sframes already from working on the prototypes), I would like this series
to be a separate branch based on top of v6.16-rc2. Then all the subsystems
that want to work on top of this can as there's no real dependency between
them.

I have more patches on top of this series that add perf support, ftrace
support, sframe support and the x86 fix ups (for VDSO). But each of those
patch series can be worked on independently, but they all depend on this
series (although the x86 specific patches at the end isn't necessarily
needed, at least for other architectures).

Please review, and if you are happy with them, lets get them in a branch
that we all can use. I'm happy to take it in my tree if I can get acks on the
x86 code. Or it can be in the tip tree as a separate branch on top of 6.16-rc2
(or rc3 if someone finds some issues), and I'll just base my work on top of
that. Doesn't matter either way.

Thanks!

-- Steve


[1] https://sourceware.org/binutils/wiki/sframe

Changes since v10: https://lore.kernel.org/linux-trace-kernel/20250611005421.144238328@goodmis.org/

[ Full diff between v11 and 10 at end of this message. ]

- Rename "the_end" label to "done" in unwind_user_next() to be more consistent
  to what the kernel uses. (Peter Zijlstra)

- Add a comment about what cfa and ra stand for. (Peter Zijlstra)

- Rename compat_state() to compat_fp_state() for consistency. (Peter Zijlstra)

- Lowercase macro UNWIND_GET_USER_LONG() to unwind_get_user_long()
  (Peter Zijlstra and Linus Torvalds)

- Move the functions arch_unwind_user_init() and arch_unwind_user_next() to
  the later patches when they are used. (Peter Zijlstra).

- Updated the change log of ("unwind_user/deferred: Add
  unwind_user_faultable()")

- Removed extra tab in Makefile (Peter Zijlstra)

- Renamed unwind_deferred_trace() to unwind_user_faultable()

- Rename unwind_exit_to_user_mode() to unwind_reset_info() as it is called
  in a noinstr section and the current name looks like it does
  instrumentation (Peter Zijlstra)

- Make the allocated cache fit inside a 4K page. (Peter Zijlstra)

- Added comment stating that the clock used for the timestamp must have a
  resolution that will guarantee that two system called back to back will
  have a different timestamp.

- Update change log to mention that each perf program and ftrace intance
  will register with the deferred unwinder. (Peter Zijlstra).

- Reworked to simply use a 64bit cmpxchg to update the timestamp.

- Switch timestamp to local64_t type and pending to local_t type.

- Use __clear_bit() and __set_bit() consistently with the global variable
  unwind_mask.  (Peter Zijlstra)

- Use clear_bit() and set_bit() consistently with the task unwind_mask,
  as it can race with NMIs.

- Use "bit" that was acquired by READ_ONCE() in test_and_set_bit() in
  unwind_deferred_request() instead of reading work->bit again.


Josh Poimboeuf (8):
      unwind_user: Add user space unwinding API
      unwind_user: Add frame pointer support
      unwind_user: Add compat mode frame pointer support
      unwind_user/deferred: Add unwind cache
      unwind_user/deferred: Add deferred unwinding interface
      unwind_user/x86: Enable frame pointer unwinding on x86
      perf/x86: Rename and move get_segment_base() and make it global
      unwind_user/x86: Enable compat mode frame pointer unwinding on x86

Steven Rostedt (6):
      unwind_user/deferred: Add unwind_user_faultable()
      unwind_user/deferred: Make unwind deferral requests NMI-safe
      unwind deferred: Use bitmask to determine which callbacks to call
      unwind deferred: Use SRCU unwind_deferred_task_work()
      unwind: Clear unwind_mask on exit back to user space
      unwind: Finish up unwind when a task exits

----
 MAINTAINERS                              |   8 +
 arch/Kconfig                             |  11 +
 arch/x86/Kconfig                         |   2 +
 arch/x86/events/core.c                   |  44 +---
 arch/x86/include/asm/ptrace.h            |   2 +
 arch/x86/include/asm/unwind_user.h       |  60 ++++++
 arch/x86/include/asm/unwind_user_types.h |  17 ++
 arch/x86/kernel/ptrace.c                 |  38 ++++
 include/asm-generic/Kbuild               |   2 +
 include/asm-generic/unwind_user.h        |   5 +
 include/asm-generic/unwind_user_types.h  |   5 +
 include/linux/entry-common.h             |   2 +
 include/linux/sched.h                    |   5 +
 include/linux/unwind_deferred.h          |  76 +++++++
 include/linux/unwind_deferred_types.h    |  20 ++
 include/linux/unwind_user.h              |  45 ++++
 include/linux/unwind_user_types.h        |  39 ++++
 kernel/Makefile                          |   1 +
 kernel/exit.c                            |   2 +
 kernel/fork.c                            |   4 +
 kernel/unwind/Makefile                   |   1 +
 kernel/unwind/deferred.c                 | 354 +++++++++++++++++++++++++++++++
 kernel/unwind/user.c                     | 130 ++++++++++++
 23 files changed, 834 insertions(+), 39 deletions(-)
 create mode 100644 arch/x86/include/asm/unwind_user.h
 create mode 100644 arch/x86/include/asm/unwind_user_types.h
 create mode 100644 include/asm-generic/unwind_user.h
 create mode 100644 include/asm-generic/unwind_user_types.h
 create mode 100644 include/linux/unwind_deferred.h
 create mode 100644 include/linux/unwind_deferred_types.h
 create mode 100644 include/linux/unwind_user.h
 create mode 100644 include/linux/unwind_user_types.h
 create mode 100644 kernel/unwind/Makefile
 create mode 100644 kernel/unwind/deferred.c
 create mode 100644 kernel/unwind/user.c

Diff between v10 and v11:


diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 6e850c9d3f0c..8908b8eeb99b 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -363,7 +363,7 @@ static __always_inline void exit_to_user_mode(void)
 	lockdep_hardirqs_on_prepare();
 	instrumentation_end();
 
-	unwind_exit_to_user_mode();
+	unwind_reset_info();
 	user_enter_irqoff();
 	arch_exit_to_user_mode();
 	lockdep_hardirqs_on(CALLER_ADDR0);
diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index bf0cc0477b2e..5cfd09a8708f 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -29,7 +29,7 @@ enum {
 void unwind_task_init(struct task_struct *task);
 void unwind_task_free(struct task_struct *task);
 
-int unwind_deferred_trace(struct unwind_stacktrace *trace);
+int unwind_user_faultable(struct unwind_stacktrace *trace);
 
 int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func);
 int unwind_deferred_request(struct unwind_work *work, u64 *timestamp);
@@ -37,7 +37,7 @@ void unwind_deferred_cancel(struct unwind_work *work);
 
 void unwind_deferred_task_exit(struct task_struct *task);
 
-static __always_inline void unwind_exit_to_user_mode(void)
+static __always_inline void unwind_reset_info(void)
 {
 	struct unwind_task_info *info = &current->unwind_info;
 	unsigned long bits;
@@ -55,7 +55,7 @@ static __always_inline void unwind_exit_to_user_mode(void)
 
 	if (likely(info->cache))
 		info->cache->nr_entries = 0;
-	info->timestamp = 0;
+	local64_set(&current->unwind_info.timestamp, 0);
 }
 
 #else /* !CONFIG_UNWIND_USER */
@@ -63,12 +63,13 @@ static __always_inline void unwind_exit_to_user_mode(void)
 static inline void unwind_task_init(struct task_struct *task) {}
 static inline void unwind_task_free(struct task_struct *task) {}
 
-static inline int unwind_deferred_trace(struct unwind_stacktrace *trace) { return -ENOSYS; }
+static inline int unwind_user_faultable(struct unwind_stacktrace *trace) { return -ENOSYS; }
 static inline int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func) { return -ENOSYS; }
 static inline int unwind_deferred_request(struct unwind_work *work, u64 *timestamp) { return -ENOSYS; }
 static inline void unwind_deferred_cancel(struct unwind_work *work) {}
+
 static inline void unwind_deferred_task_exit(struct task_struct *task) {}
-static inline void unwind_exit_to_user_mode(void) {}
+static inline void unwind_reset_info(void) {}
 
 #endif /* !CONFIG_UNWIND_USER */
 
diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
index f384e7f45783..4308367f1887 100644
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -2,6 +2,9 @@
 #ifndef _LINUX_UNWIND_USER_DEFERRED_TYPES_H
 #define _LINUX_UNWIND_USER_DEFERRED_TYPES_H
 
+#include <asm/local64.h>
+#include <asm/local.h>
+
 struct unwind_cache {
 	unsigned int		nr_entries;
 	unsigned long		entries[];
@@ -11,8 +14,7 @@ struct unwind_task_info {
 	struct unwind_cache	*cache;
 	struct callback_head	work;
 	unsigned long		unwind_mask;
-	u64			timestamp;
-	u64			nmi_timestamp;
+	local64_t		timestamp;
 };
 
 #endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
diff --git a/include/linux/unwind_user.h b/include/linux/unwind_user.h
index c70da8f7e54c..46f995cda606 100644
--- a/include/linux/unwind_user.h
+++ b/include/linux/unwind_user.h
@@ -14,10 +14,22 @@
  #define in_compat_mode(regs) false
 #endif
 
+/*
+ * If an architecture needs to initialize the state for a specific
+ * reason, for example, it may need to do something different
+ * in compat mode, it can define arch_unwind_user_init to a
+ * function that will perform this initialization.
+ */
 #ifndef arch_unwind_user_init
 static inline void arch_unwind_user_init(struct unwind_user_state *state, struct pt_regs *reg) {}
 #endif
 
+/*
+ * If an architecture requires some more updates to the state between
+ * stack frames, it can define arch_unwind_user_next to a function
+ * that will update the state between reading stack frames during
+ * the user space stack walk.
+ */
 #ifndef arch_unwind_user_next
 static inline void arch_unwind_user_next(struct unwind_user_state *state) {}
 #endif
diff --git a/kernel/unwind/Makefile b/kernel/unwind/Makefile
index 6752ac96d7e2..eae37bea54fd 100644
--- a/kernel/unwind/Makefile
+++ b/kernel/unwind/Makefile
@@ -1 +1 @@
- obj-$(CONFIG_UNWIND_USER)		+= user.o deferred.o
+ obj-$(CONFIG_UNWIND_USER)	+= user.o deferred.o
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index 6c95f484568e..c783d273a2dc 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -8,10 +8,42 @@
 #include <linux/task_work.h>
 #include <linux/kernel.h>
 #include <linux/sched.h>
+#include <linux/sizes.h>
 #include <linux/slab.h>
 #include <linux/mm.h>
 
-#define UNWIND_MAX_ENTRIES 512
+/*
+ * For requesting a deferred user space stack trace from NMI context
+ * the architecture must support a 64bit safe cmpxchg in NMI context.
+ * For those architectures that do not have that, then it cannot ask
+ * for a deferred user space stack trace from an NMI context. If it
+ * does, then it will get -EINVAL.
+ */
+#if defined(CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG) && \
+	!defined(CONFIG_GENERIC_ATOMIC64)
+# define CAN_USE_IN_NMI		1
+static inline u64 assign_timestamp(struct unwind_task_info *info,
+				   u64 timestamp)
+{
+	u64 old = 0;
+	if (!local64_try_cmpxchg(&info->timestamp, &old, timestamp))
+		timestamp = old;
+	return timestamp;
+}
+#else
+# define CAN_USE_IN_NMI		0
+static inline u64 assign_timestamp(struct unwind_task_info *info,
+				   u64 timestamp)
+{
+	/* For archs that do not allow NMI here */
+	local64_set(&info->timestamp, timestamp);
+	return timestamp;
+}
+#endif
+
+/* Make the cache fit in a 4K page */
+#define UNWIND_MAX_ENTRIES					\
+	((SZ_4K - sizeof(struct unwind_cache)) / sizeof(long))
 
 /* Guards adding to or removing from the list of callbacks */
 static DEFINE_MUTEX(callback_mutex);
@@ -27,9 +59,16 @@ static inline bool unwind_pending(struct unwind_task_info *info)
 /*
  * Read the task context timestamp, if this is the first caller then
  * it will set the timestamp.
+ *
+ * For this to work properly, the timestamp (local_clock()) must
+ * have a resolution that will guarantee a different timestamp
+ * everytime a task makes a system call. That is, two short
+ * system calls back to back must have a different timestamp.
  */
 static u64 get_timestamp(struct unwind_task_info *info)
 {
+	u64 timestamp;
+
 	lockdep_assert_irqs_disabled();
 
 	/*
@@ -38,27 +77,15 @@ static u64 get_timestamp(struct unwind_task_info *info)
 	 * this request and it means that this request will be
 	 * valid for the stracktrace.
 	 */
-	if (!info->timestamp) {
-		WRITE_ONCE(info->timestamp, local_clock());
-		barrier();
-		/*
-		 * If an NMI came in and set a timestamp, it means that
-		 * it happened before this timestamp was set (otherwise
-		 * the NMI would have used this one). Use the NMI timestamp
-		 * instead.
-		 */
-		if (unlikely(info->nmi_timestamp)) {
-			WRITE_ONCE(info->timestamp, info->nmi_timestamp);
-			barrier();
-			WRITE_ONCE(info->nmi_timestamp, 0);
-		}
-	}
+	timestamp = local64_read(&info->timestamp);
+	if (timestamp)
+		return timestamp;
 
-	return info->timestamp;
+	return assign_timestamp(info, local_clock());
 }
 
 /**
- * unwind_deferred_trace - Produce a user stacktrace in faultable context
+ * unwind_user_faultable - Produce a user stacktrace in faultable context
  * @trace: The descriptor that will store the user stacktrace
  *
  * This must be called in a known faultable context (usually when entering
@@ -69,7 +96,7 @@ static u64 get_timestamp(struct unwind_task_info *info)
  * Return: 0 on success and negative on error
  *         On success @trace will contain the user space stacktrace
  */
-int unwind_deferred_trace(struct unwind_stacktrace *trace)
+int unwind_user_faultable(struct unwind_stacktrace *trace)
 {
 	struct unwind_task_info *info = &current->unwind_info;
 	struct unwind_cache *cache;
@@ -131,21 +158,14 @@ static void process_unwind_deferred(struct task_struct *task)
 	trace.nr = 0;
 	trace.entries = NULL;
 
-	unwind_deferred_trace(&trace);
-
-	/* Check if the timestamp was only set by NMI */
-	if (info->nmi_timestamp) {
-		WRITE_ONCE(info->timestamp, info->nmi_timestamp);
-		barrier();
-		WRITE_ONCE(info->nmi_timestamp, 0);
-	}
+	unwind_user_faultable(&trace);
 
-	timestamp = info->timestamp;
+	timestamp = local64_read(&info->timestamp);
 
 	idx = srcu_read_lock(&unwind_srcu);
 	list_for_each_entry_srcu(work, &callbacks, list,
 				 srcu_read_lock_held(&unwind_srcu)) {
-		if (bits & BIT(work->bit))
+		if (test_bit(work->bit, &bits))
 			work->func(work, &trace, timestamp);
 	}
 	srcu_read_unlock(&unwind_srcu, idx);
@@ -168,62 +188,6 @@ void unwind_deferred_task_exit(struct task_struct *task)
 	task_work_cancel(task, &info->work);
 }
 
-static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *timestamp)
-{
-	struct unwind_task_info *info = &current->unwind_info;
-	bool inited_timestamp = false;
-	int ret;
-
-	/* Always use the nmi_timestamp first */
-	*timestamp = info->nmi_timestamp ? : info->timestamp;
-
-	if (!*timestamp) {
-		/*
-		 * This is the first unwind request since the most recent entry
-		 * from user space. Initialize the task timestamp.
-		 *
-		 * Don't write to info->timestamp directly, otherwise it may race
-		 * with an interruption of get_timestamp().
-		 */
-		info->nmi_timestamp = local_clock();
-		*timestamp = info->nmi_timestamp;
-		inited_timestamp = true;
-	}
-
-	/* Is this already queued */
-	if (info->unwind_mask & BIT(work->bit)) {
-		return unwind_pending(info) ? UNWIND_ALREADY_PENDING :
-			UNWIND_ALREADY_EXECUTED;
-	}
-
-	if (unwind_pending(info))
-		goto out;
-
-	ret = task_work_add(current, &info->work, TWA_NMI_CURRENT);
-	if (ret < 0) {
-		/*
-		 * If this set nmi_timestamp and is not using it,
-		 * there's no guarantee that it will be used.
-		 * Set it back to zero.
-		 */
-		if (inited_timestamp)
-			info->nmi_timestamp = 0;
-		return ret;
-	}
-
-	/*
-	 * This is the first to set the PENDING_BIT, clear all others
-	 * as any other bit has already had their callback called, and
-	 * those callbacks should not be called again because of this
-	 * new callback. If they request another callback, then they
-	 * will get a new one.
-	 */
-	info->unwind_mask = UNWIND_PENDING;
-out:
-	return test_and_set_bit(work->bit, &info->unwind_mask) ?
-		UNWIND_ALREADY_PENDING : 0;
-}
-
 /**
  * unwind_deferred_request - Request a user stacktrace on task exit
  * @work: Unwind descriptor requesting the trace
@@ -238,6 +202,9 @@ static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *timestamp)
  * Its value will also be passed to the callback function.  It can be
  * used to stitch kernel and user stack traces together in post-processing.
  *
+ * Note, the architecture must have a local_clock() implementation that
+ * guarantees a different timestamp per task systemcall.
+ *
  * It's valid to call this function multiple times for the same @work within
  * the same task entry context.  Each call will return the same timestamp
  * while the task hasn't left the kernel. If the callback is not pending because
@@ -264,8 +231,9 @@ int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
 	    !user_mode(task_pt_regs(current)))
 		return -EINVAL;
 
-	if (in_nmi())
-		return unwind_deferred_request_nmi(work, timestamp);
+	/* NMI requires having safe 64 bit cmpxchg operations */
+	if (!CAN_USE_IN_NMI && in_nmi())
+		return -EINVAL;
 
 	/* Do not allow cancelled works to request again */
 	bit = READ_ONCE(work->bit);
@@ -279,7 +247,7 @@ int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
 	old = READ_ONCE(info->unwind_mask);
 
 	/* Is this already queued */
-	if (old & BIT(bit)) {
+	if (test_bit(bit, &old)) {
 		/*
 		 * If pending is not set, it means this work's callback
 		 * was already called.
@@ -305,8 +273,12 @@ int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
 	 * the pending bit as well as cleared the other bits. Just
 	 * jump to setting the bit for this work.
 	 */
-	if (!try_cmpxchg(&info->unwind_mask, &old, bits))
-		goto out;
+	if (CAN_USE_IN_NMI) {
+		if (!try_cmpxchg(&info->unwind_mask, &old, bits))
+			goto out;
+	} else {
+		info->unwind_mask = bits;
+	}
 
 	/* The work has been claimed, now schedule it. */
 	ret = task_work_add(current, &info->work, TWA_RESUME);
@@ -315,9 +287,8 @@ int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
 		WRITE_ONCE(info->unwind_mask, 0);
 
 	return ret;
-
  out:
-	return test_and_set_bit(work->bit, &info->unwind_mask) ?
+	return test_and_set_bit(bit, &info->unwind_mask) ?
 		UNWIND_ALREADY_PENDING : 0;
 }
 
@@ -336,7 +307,7 @@ void unwind_deferred_cancel(struct unwind_work *work)
 	/* Do not allow any more requests and prevent callbacks */
 	work->bit = -1;
 
-	clear_bit(bit, &unwind_mask);
+	__clear_bit(bit, &unwind_mask);
 
 	synchronize_srcu(&unwind_srcu);
 
@@ -358,7 +329,7 @@ int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
 		return -EBUSY;
 
 	work->bit = ffz(unwind_mask);
-	unwind_mask |= BIT(work->bit);
+	__set_bit(work->bit, &unwind_mask);
 
 	list_add_rcu(&work->list, &callbacks);
 	work->func = func;
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
index 29e1f497a26e..2bb7995c3f23 100644
--- a/kernel/unwind/user.c
+++ b/kernel/unwind/user.c
@@ -22,16 +22,16 @@ static inline bool fp_state(struct unwind_user_state *state)
 	       state->type == UNWIND_USER_TYPE_FP;
 }
 
-static inline bool compat_state(struct unwind_user_state *state)
+static inline bool compat_fp_state(struct unwind_user_state *state)
 {
 	return IS_ENABLED(CONFIG_HAVE_UNWIND_USER_COMPAT_FP) &&
 	       state->type == UNWIND_USER_TYPE_COMPAT_FP;
 }
 
-#define UNWIND_GET_USER_LONG(to, from, state)				\
+#define unwind_get_user_long(to, from, state)				\
 ({									\
 	int __ret;							\
-	if (compat_state(state))					\
+	if (compat_fp_state(state))					\
 		__ret = get_user(to, (u32 __user *)(from));		\
 	else								\
 		__ret = get_user(to, (unsigned long __user *)(from));	\
@@ -46,24 +46,26 @@ int unwind_user_next(struct unwind_user_state *state)
 	if (state->done)
 		return -EINVAL;
 
-	if (compat_state(state))
+	if (compat_fp_state(state))
 		frame = &compat_fp_frame;
 	else if (fp_state(state))
 		frame = &fp_frame;
 	else
-		goto the_end;
+		goto done;
 
+	/* Get the Canonical Frame Address (CFA) */
 	cfa = (frame->use_fp ? state->fp : state->sp) + frame->cfa_off;
 
 	/* stack going in wrong direction? */
 	if (cfa <= state->sp)
-		goto the_end;
+		goto done;
 
-	if (UNWIND_GET_USER_LONG(ra, cfa + frame->ra_off, state))
-		goto the_end;
+	/* Find the Return Address (RA) */
+	if (unwind_get_user_long(ra, cfa + frame->ra_off, state))
+		goto done;
 
-	if (frame->fp_off && UNWIND_GET_USER_LONG(fp, cfa + frame->fp_off, state))
-		goto the_end;
+	if (frame->fp_off && unwind_get_user_long(fp, cfa + frame->fp_off, state))
+		goto done;
 
 	state->ip = ra;
 	state->sp = cfa;
@@ -74,7 +76,7 @@ int unwind_user_next(struct unwind_user_state *state)
 
 	return 0;
 
-the_end:
+done:
 	state->done = true;
 	return -EINVAL;
 }

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v11 01/14] unwind_user: Add user space unwinding API
  2025-06-25 22:56 [PATCH v11 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
@ 2025-06-25 22:56 ` Steven Rostedt
  2025-06-25 22:56 ` [PATCH v11 02/14] unwind_user: Add frame pointer support Steven Rostedt
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 33+ messages in thread
From: Steven Rostedt @ 2025-06-25 22:56 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
	Jens Axboe

From: Josh Poimboeuf <jpoimboe@kernel.org>

Introduce a generic API for unwinding user stacks.

In order to expand user space unwinding to be able to handle more complex
scenarios, such as deferred unwinding and reading user space information,
create a generic interface that all architectures can use that support the
various unwinding methods.

This is an alternative method for handling user space stack traces from
the simple stack_trace_save_user() API. This does not replace that
interface, but this interface will be used to expand the functionality of
user space stack walking.

None of the structures introduced will be exposed to user space tooling.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 MAINTAINERS                       |  8 +++++
 arch/Kconfig                      |  3 ++
 include/linux/unwind_user.h       | 15 +++++++++
 include/linux/unwind_user_types.h | 31 +++++++++++++++++
 kernel/Makefile                   |  1 +
 kernel/unwind/Makefile            |  1 +
 kernel/unwind/user.c              | 55 +++++++++++++++++++++++++++++++
 7 files changed, 114 insertions(+)
 create mode 100644 include/linux/unwind_user.h
 create mode 100644 include/linux/unwind_user_types.h
 create mode 100644 kernel/unwind/Makefile
 create mode 100644 kernel/unwind/user.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 0c1d245bf7b8..a0676218545b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -25887,6 +25887,14 @@ F:	Documentation/driver-api/uio-howto.rst
 F:	drivers/uio/
 F:	include/linux/uio_driver.h
 
+USERSPACE STACK UNWINDING
+M:	Josh Poimboeuf <jpoimboe@kernel.org>
+M:	Steven Rostedt <rostedt@goodmis.org>
+S:	Maintained
+F:	include/linux/unwind*.h
+F:	kernel/unwind/
+
+
 UTIL-LINUX PACKAGE
 M:	Karel Zak <kzak@redhat.com>
 L:	util-linux@vger.kernel.org
diff --git a/arch/Kconfig b/arch/Kconfig
index a3308a220f86..ea59e5d7cc69 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -435,6 +435,9 @@ config HAVE_HARDLOCKUP_DETECTOR_ARCH
 	  It uses the same command line parameters, and sysctl interface,
 	  as the generic hardlockup detectors.
 
+config UNWIND_USER
+	bool
+
 config HAVE_PERF_REGS
 	bool
 	help
diff --git a/include/linux/unwind_user.h b/include/linux/unwind_user.h
new file mode 100644
index 000000000000..aa7923c1384f
--- /dev/null
+++ b/include/linux/unwind_user.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_UNWIND_USER_H
+#define _LINUX_UNWIND_USER_H
+
+#include <linux/unwind_user_types.h>
+
+int unwind_user_start(struct unwind_user_state *state);
+int unwind_user_next(struct unwind_user_state *state);
+
+int unwind_user(struct unwind_stacktrace *trace, unsigned int max_entries);
+
+#define for_each_user_frame(state) \
+	for (unwind_user_start((state)); !(state)->done; unwind_user_next((state)))
+
+#endif /* _LINUX_UNWIND_USER_H */
diff --git a/include/linux/unwind_user_types.h b/include/linux/unwind_user_types.h
new file mode 100644
index 000000000000..6ed1b4ae74e1
--- /dev/null
+++ b/include/linux/unwind_user_types.h
@@ -0,0 +1,31 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_UNWIND_USER_TYPES_H
+#define _LINUX_UNWIND_USER_TYPES_H
+
+#include <linux/types.h>
+
+enum unwind_user_type {
+	UNWIND_USER_TYPE_NONE,
+};
+
+struct unwind_stacktrace {
+	unsigned int	nr;
+	unsigned long	*entries;
+};
+
+struct unwind_user_frame {
+	s32 cfa_off;
+	s32 ra_off;
+	s32 fp_off;
+	bool use_fp;
+};
+
+struct unwind_user_state {
+	unsigned long ip;
+	unsigned long sp;
+	unsigned long fp;
+	enum unwind_user_type type;
+	bool done;
+};
+
+#endif /* _LINUX_UNWIND_USER_TYPES_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index 32e80dd626af..541186050251 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -55,6 +55,7 @@ obj-y += rcu/
 obj-y += livepatch/
 obj-y += dma/
 obj-y += entry/
+obj-y += unwind/
 obj-$(CONFIG_MODULES) += module/
 
 obj-$(CONFIG_KCMP) += kcmp.o
diff --git a/kernel/unwind/Makefile b/kernel/unwind/Makefile
new file mode 100644
index 000000000000..349ce3677526
--- /dev/null
+++ b/kernel/unwind/Makefile
@@ -0,0 +1 @@
+ obj-$(CONFIG_UNWIND_USER) += user.o
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
new file mode 100644
index 000000000000..d30449328981
--- /dev/null
+++ b/kernel/unwind/user.c
@@ -0,0 +1,55 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+* Generic interfaces for unwinding user space
+*/
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/sched/task_stack.h>
+#include <linux/unwind_user.h>
+
+int unwind_user_next(struct unwind_user_state *state)
+{
+	/* no implementation yet */
+	return -EINVAL;
+}
+
+int unwind_user_start(struct unwind_user_state *state)
+{
+	struct pt_regs *regs = task_pt_regs(current);
+
+	memset(state, 0, sizeof(*state));
+
+	if ((current->flags & PF_KTHREAD) || !user_mode(regs)) {
+		state->done = true;
+		return -EINVAL;
+	}
+
+	state->type = UNWIND_USER_TYPE_NONE;
+
+	state->ip = instruction_pointer(regs);
+	state->sp = user_stack_pointer(regs);
+	state->fp = frame_pointer(regs);
+
+	return 0;
+}
+
+int unwind_user(struct unwind_stacktrace *trace, unsigned int max_entries)
+{
+	struct unwind_user_state state;
+
+	trace->nr = 0;
+
+	if (!max_entries)
+		return -EINVAL;
+
+	if (current->flags & PF_KTHREAD)
+		return 0;
+
+	for_each_user_frame(&state) {
+		trace->entries[trace->nr++] = state.ip;
+		if (trace->nr >= max_entries)
+			break;
+	}
+
+	return 0;
+}
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v11 02/14] unwind_user: Add frame pointer support
  2025-06-25 22:56 [PATCH v11 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
  2025-06-25 22:56 ` [PATCH v11 01/14] unwind_user: Add user space unwinding API Steven Rostedt
@ 2025-06-25 22:56 ` Steven Rostedt
  2025-06-25 22:56 ` [PATCH v11 03/14] unwind_user: Add compat mode " Steven Rostedt
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 33+ messages in thread
From: Steven Rostedt @ 2025-06-25 22:56 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
	Jens Axboe

From: Josh Poimboeuf <jpoimboe@kernel.org>

Add optional support for user space frame pointer unwinding.  If
supported, the arch needs to enable CONFIG_HAVE_UNWIND_USER_FP and
define ARCH_INIT_USER_FP_FRAME.

By encoding the frame offsets in struct unwind_user_frame, much of this
code can also be reused for future unwinder implementations like sframe.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Co-developed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v10: https://lore.kernel.org/20250611010428.092934995@goodmis.org

- Rename "the_end" label to "done" to be more consistent to what the
  kernel uses. (Peter Zijlstra)

- Add a comment about what cfa and ra stand for. (Peter Zijlstra)

 arch/Kconfig                      |  4 +++
 include/asm-generic/Kbuild        |  1 +
 include/asm-generic/unwind_user.h |  5 +++
 include/linux/unwind_user.h       |  5 +++
 include/linux/unwind_user_types.h |  1 +
 kernel/unwind/user.c              | 51 +++++++++++++++++++++++++++++--
 6 files changed, 65 insertions(+), 2 deletions(-)
 create mode 100644 include/asm-generic/unwind_user.h

diff --git a/arch/Kconfig b/arch/Kconfig
index ea59e5d7cc69..8e3fd723bd74 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -438,6 +438,10 @@ config HAVE_HARDLOCKUP_DETECTOR_ARCH
 config UNWIND_USER
 	bool
 
+config HAVE_UNWIND_USER_FP
+	bool
+	select UNWIND_USER
+
 config HAVE_PERF_REGS
 	bool
 	help
diff --git a/include/asm-generic/Kbuild b/include/asm-generic/Kbuild
index 8675b7b4ad23..295c94a3ccc1 100644
--- a/include/asm-generic/Kbuild
+++ b/include/asm-generic/Kbuild
@@ -59,6 +59,7 @@ mandatory-y += tlbflush.h
 mandatory-y += topology.h
 mandatory-y += trace_clock.h
 mandatory-y += uaccess.h
+mandatory-y += unwind_user.h
 mandatory-y += vermagic.h
 mandatory-y += vga.h
 mandatory-y += video.h
diff --git a/include/asm-generic/unwind_user.h b/include/asm-generic/unwind_user.h
new file mode 100644
index 000000000000..b8882b909944
--- /dev/null
+++ b/include/asm-generic/unwind_user.h
@@ -0,0 +1,5 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_GENERIC_UNWIND_USER_H
+#define _ASM_GENERIC_UNWIND_USER_H
+
+#endif /* _ASM_GENERIC_UNWIND_USER_H */
diff --git a/include/linux/unwind_user.h b/include/linux/unwind_user.h
index aa7923c1384f..a405111c41b0 100644
--- a/include/linux/unwind_user.h
+++ b/include/linux/unwind_user.h
@@ -3,6 +3,11 @@
 #define _LINUX_UNWIND_USER_H
 
 #include <linux/unwind_user_types.h>
+#include <asm/unwind_user.h>
+
+#ifndef ARCH_INIT_USER_FP_FRAME
+ #define ARCH_INIT_USER_FP_FRAME
+#endif
 
 int unwind_user_start(struct unwind_user_state *state);
 int unwind_user_next(struct unwind_user_state *state);
diff --git a/include/linux/unwind_user_types.h b/include/linux/unwind_user_types.h
index 6ed1b4ae74e1..65bd070eb6b0 100644
--- a/include/linux/unwind_user_types.h
+++ b/include/linux/unwind_user_types.h
@@ -6,6 +6,7 @@
 
 enum unwind_user_type {
 	UNWIND_USER_TYPE_NONE,
+	UNWIND_USER_TYPE_FP,
 };
 
 struct unwind_stacktrace {
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
index d30449328981..1201d655654a 100644
--- a/kernel/unwind/user.c
+++ b/kernel/unwind/user.c
@@ -6,10 +6,54 @@
 #include <linux/sched.h>
 #include <linux/sched/task_stack.h>
 #include <linux/unwind_user.h>
+#include <linux/uaccess.h>
+
+static struct unwind_user_frame fp_frame = {
+	ARCH_INIT_USER_FP_FRAME
+};
+
+static inline bool fp_state(struct unwind_user_state *state)
+{
+	return IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP) &&
+	       state->type == UNWIND_USER_TYPE_FP;
+}
 
 int unwind_user_next(struct unwind_user_state *state)
 {
-	/* no implementation yet */
+	struct unwind_user_frame *frame;
+	unsigned long cfa = 0, fp, ra = 0;
+
+	if (state->done)
+		return -EINVAL;
+
+	if (fp_state(state))
+		frame = &fp_frame;
+	else
+		goto done;
+
+	/* Get the Canonical Frame Address (CFA) */
+	cfa = (frame->use_fp ? state->fp : state->sp) + frame->cfa_off;
+
+	/* stack going in wrong direction? */
+	if (cfa <= state->sp)
+		goto done;
+
+	/* Find the Return Address (RA) */
+	if (get_user(ra, (unsigned long *)(cfa + frame->ra_off)))
+		goto done;
+
+	if (frame->fp_off && get_user(fp, (unsigned long __user *)(cfa + frame->fp_off)))
+		goto done;
+
+	state->ip = ra;
+	state->sp = cfa;
+	if (frame->fp_off)
+		state->fp = fp;
+
+	return 0;
+
+done:
+	state->done = true;
 	return -EINVAL;
 }
 
@@ -24,7 +68,10 @@ int unwind_user_start(struct unwind_user_state *state)
 		return -EINVAL;
 	}
 
-	state->type = UNWIND_USER_TYPE_NONE;
+	if (IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP))
+		state->type = UNWIND_USER_TYPE_FP;
+	else
+		state->type = UNWIND_USER_TYPE_NONE;
 
 	state->ip = instruction_pointer(regs);
 	state->sp = user_stack_pointer(regs);
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v11 03/14] unwind_user: Add compat mode frame pointer support
  2025-06-25 22:56 [PATCH v11 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
  2025-06-25 22:56 ` [PATCH v11 01/14] unwind_user: Add user space unwinding API Steven Rostedt
  2025-06-25 22:56 ` [PATCH v11 02/14] unwind_user: Add frame pointer support Steven Rostedt
@ 2025-06-25 22:56 ` Steven Rostedt
  2025-06-25 22:56 ` [PATCH v11 04/14] unwind_user/deferred: Add unwind_user_faultable() Steven Rostedt
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 33+ messages in thread
From: Steven Rostedt @ 2025-06-25 22:56 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
	Jens Axboe

From: Josh Poimboeuf <jpoimboe@kernel.org>

Add optional support for user space compat mode frame pointer unwinding.
If supported, the arch needs to enable CONFIG_HAVE_UNWIND_USER_COMPAT_FP
and define ARCH_INIT_USER_COMPAT_FP_FRAME.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Co-developed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v10: https://lore.kernel.org/20250611010428.261095906@goodmis.org

- Rename compat_state() to compat_fp_state() for consistency. (Peter Zijlstra)

- Lowercase macro UNWIND_GET_USER_LONG() to unwind_get_user_long()
  (Peter Zijlstra and Linus Torvalds)

- Remove currently unused arch_unwind_user_init() and
  arch_unwind_user_next() and add them in the later patches when they are
  used. (Peter Zijlstra)

 arch/Kconfig                            |  4 ++++
 include/asm-generic/Kbuild              |  1 +
 include/asm-generic/unwind_user_types.h |  5 ++++
 include/linux/unwind_user.h             |  5 ++++
 include/linux/unwind_user_types.h       |  7 ++++++
 kernel/unwind/user.c                    | 32 +++++++++++++++++++++----
 6 files changed, 50 insertions(+), 4 deletions(-)
 create mode 100644 include/asm-generic/unwind_user_types.h

diff --git a/arch/Kconfig b/arch/Kconfig
index 8e3fd723bd74..2c41d3072910 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -442,6 +442,10 @@ config HAVE_UNWIND_USER_FP
 	bool
 	select UNWIND_USER
 
+config HAVE_UNWIND_USER_COMPAT_FP
+	bool
+	depends on HAVE_UNWIND_USER_FP
+
 config HAVE_PERF_REGS
 	bool
 	help
diff --git a/include/asm-generic/Kbuild b/include/asm-generic/Kbuild
index 295c94a3ccc1..b797a2434396 100644
--- a/include/asm-generic/Kbuild
+++ b/include/asm-generic/Kbuild
@@ -60,6 +60,7 @@ mandatory-y += topology.h
 mandatory-y += trace_clock.h
 mandatory-y += uaccess.h
 mandatory-y += unwind_user.h
+mandatory-y += unwind_user_types.h
 mandatory-y += vermagic.h
 mandatory-y += vga.h
 mandatory-y += video.h
diff --git a/include/asm-generic/unwind_user_types.h b/include/asm-generic/unwind_user_types.h
new file mode 100644
index 000000000000..f568b82e52cd
--- /dev/null
+++ b/include/asm-generic/unwind_user_types.h
@@ -0,0 +1,5 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_GENERIC_UNWIND_USER_TYPES_H
+#define _ASM_GENERIC_UNWIND_USER_TYPES_H
+
+#endif /* _ASM_GENERIC_UNWIND_USER_TYPES_H */
diff --git a/include/linux/unwind_user.h b/include/linux/unwind_user.h
index a405111c41b0..ac007363820a 100644
--- a/include/linux/unwind_user.h
+++ b/include/linux/unwind_user.h
@@ -9,6 +9,11 @@
  #define ARCH_INIT_USER_FP_FRAME
 #endif
 
+#ifndef ARCH_INIT_USER_COMPAT_FP_FRAME
+ #define ARCH_INIT_USER_COMPAT_FP_FRAME
+ #define in_compat_mode(regs) false
+#endif
+
 int unwind_user_start(struct unwind_user_state *state);
 int unwind_user_next(struct unwind_user_state *state);
 
diff --git a/include/linux/unwind_user_types.h b/include/linux/unwind_user_types.h
index 65bd070eb6b0..0b6563951ca4 100644
--- a/include/linux/unwind_user_types.h
+++ b/include/linux/unwind_user_types.h
@@ -3,10 +3,16 @@
 #define _LINUX_UNWIND_USER_TYPES_H
 
 #include <linux/types.h>
+#include <asm/unwind_user_types.h>
+
+#ifndef arch_unwind_user_state
+struct arch_unwind_user_state {};
+#endif
 
 enum unwind_user_type {
 	UNWIND_USER_TYPE_NONE,
 	UNWIND_USER_TYPE_FP,
+	UNWIND_USER_TYPE_COMPAT_FP,
 };
 
 struct unwind_stacktrace {
@@ -25,6 +31,7 @@ struct unwind_user_state {
 	unsigned long ip;
 	unsigned long sp;
 	unsigned long fp;
+	struct arch_unwind_user_state arch;
 	enum unwind_user_type type;
 	bool done;
 };
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
index 1201d655654a..3a0ac4346f5b 100644
--- a/kernel/unwind/user.c
+++ b/kernel/unwind/user.c
@@ -12,12 +12,32 @@ static struct unwind_user_frame fp_frame = {
 	ARCH_INIT_USER_FP_FRAME
 };
 
+static struct unwind_user_frame compat_fp_frame = {
+	ARCH_INIT_USER_COMPAT_FP_FRAME
+};
+
 static inline bool fp_state(struct unwind_user_state *state)
 {
 	return IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP) &&
 	       state->type == UNWIND_USER_TYPE_FP;
 }
 
+static inline bool compat_fp_state(struct unwind_user_state *state)
+{
+	return IS_ENABLED(CONFIG_HAVE_UNWIND_USER_COMPAT_FP) &&
+	       state->type == UNWIND_USER_TYPE_COMPAT_FP;
+}
+
+#define unwind_get_user_long(to, from, state)				\
+({									\
+	int __ret;							\
+	if (compat_fp_state(state))					\
+		__ret = get_user(to, (u32 __user *)(from));		\
+	else								\
+		__ret = get_user(to, (unsigned long __user *)(from));	\
+	__ret;								\
+})
+
 int unwind_user_next(struct unwind_user_state *state)
 {
 	struct unwind_user_frame *frame;
@@ -26,7 +46,9 @@ int unwind_user_next(struct unwind_user_state *state)
 	if (state->done)
 		return -EINVAL;
 
-	if (fp_state(state))
+	if (compat_fp_state(state))
+		frame = &compat_fp_frame;
+	else if (fp_state(state))
 		frame = &fp_frame;
 	else
 		goto done;
@@ -39,10 +61,10 @@ int unwind_user_next(struct unwind_user_state *state)
 		goto done;
 
 	/* Find the Return Address (RA) */
-	if (get_user(ra, (unsigned long *)(cfa + frame->ra_off)))
+	if (unwind_get_user_long(ra, cfa + frame->ra_off, state))
 		goto done;
 
-	if (frame->fp_off && get_user(fp, (unsigned long __user *)(cfa + frame->fp_off)))
+	if (frame->fp_off && unwind_get_user_long(fp, cfa + frame->fp_off, state))
 		goto done;
 
 	state->ip = ra;
@@ -68,7 +90,9 @@ int unwind_user_start(struct unwind_user_state *state)
 		return -EINVAL;
 	}
 
-	if (IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP))
+	if (IS_ENABLED(CONFIG_HAVE_UNWIND_USER_COMPAT_FP) && in_compat_mode(regs))
+		state->type = UNWIND_USER_TYPE_COMPAT_FP;
+	else if (IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP))
 		state->type = UNWIND_USER_TYPE_FP;
 	else
 		state->type = UNWIND_USER_TYPE_NONE;
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v11 04/14] unwind_user/deferred: Add unwind_user_faultable()
  2025-06-25 22:56 [PATCH v11 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (2 preceding siblings ...)
  2025-06-25 22:56 ` [PATCH v11 03/14] unwind_user: Add compat mode " Steven Rostedt
@ 2025-06-25 22:56 ` Steven Rostedt
  2025-06-25 22:56 ` [PATCH v11 05/14] unwind_user/deferred: Add unwind cache Steven Rostedt
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 33+ messages in thread
From: Steven Rostedt @ 2025-06-25 22:56 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
	Jens Axboe

From: Steven Rostedt <rostedt@goodmis.org>

Add a new API to retrieve a user space callstack called
unwind_user_faultable(). The difference between this user space stack
tracer from the current user space stack tracer is that this must be
called from faultable context as it may use routines to access user space
data that needs to be faulted in.

It can be safely called from entering or exiting a system call as the code
can still be faulted in there.

This code is based on work by Josh Poimboeuf's deferred unwinding code:

Link: https://lore.kernel.org/all/6052e8487746603bdb29b65f4033e739092d9925.1737511963.git.jpoimboe@kernel.org/

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v10: https://lore.kernel.org/20250611010428.433111891@goodmis.org

- Updated the change log

- Removed extra tab in Makefile (Peter Zijlstra)

- Renamed unwind_deferred_trace() to unwind_user_faultable()

 include/linux/sched.h                 |  5 +++
 include/linux/unwind_deferred.h       | 24 +++++++++++
 include/linux/unwind_deferred_types.h |  9 ++++
 kernel/fork.c                         |  4 ++
 kernel/unwind/Makefile                |  2 +-
 kernel/unwind/deferred.c              | 60 +++++++++++++++++++++++++++
 6 files changed, 103 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/unwind_deferred.h
 create mode 100644 include/linux/unwind_deferred_types.h
 create mode 100644 kernel/unwind/deferred.c

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4f78a64beb52..59fdf7d9bb1e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -46,6 +46,7 @@
 #include <linux/rv.h>
 #include <linux/uidgid_types.h>
 #include <linux/tracepoint-defs.h>
+#include <linux/unwind_deferred_types.h>
 #include <asm/kmap_size.h>
 
 /* task_struct member predeclarations (sorted alphabetically): */
@@ -1654,6 +1655,10 @@ struct task_struct {
 	struct user_event_mm		*user_event_mm;
 #endif
 
+#ifdef CONFIG_UNWIND_USER
+	struct unwind_task_info		unwind_info;
+#endif
+
 	/* CPU-specific state of this task: */
 	struct thread_struct		thread;
 
diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
new file mode 100644
index 000000000000..a5f6e8f8a1a2
--- /dev/null
+++ b/include/linux/unwind_deferred.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_UNWIND_USER_DEFERRED_H
+#define _LINUX_UNWIND_USER_DEFERRED_H
+
+#include <linux/unwind_user.h>
+#include <linux/unwind_deferred_types.h>
+
+#ifdef CONFIG_UNWIND_USER
+
+void unwind_task_init(struct task_struct *task);
+void unwind_task_free(struct task_struct *task);
+
+int unwind_user_faultable(struct unwind_stacktrace *trace);
+
+#else /* !CONFIG_UNWIND_USER */
+
+static inline void unwind_task_init(struct task_struct *task) {}
+static inline void unwind_task_free(struct task_struct *task) {}
+
+static inline int unwind_user_faultable(struct unwind_stacktrace *trace) { return -ENOSYS; }
+
+#endif /* !CONFIG_UNWIND_USER */
+
+#endif /* _LINUX_UNWIND_USER_DEFERRED_H */
diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
new file mode 100644
index 000000000000..aa32db574e43
--- /dev/null
+++ b/include/linux/unwind_deferred_types.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_UNWIND_USER_DEFERRED_TYPES_H
+#define _LINUX_UNWIND_USER_DEFERRED_TYPES_H
+
+struct unwind_task_info {
+	unsigned long		*entries;
+};
+
+#endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 1ee8eb11f38b..3341d50c61f2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -105,6 +105,7 @@
 #include <uapi/linux/pidfd.h>
 #include <linux/pidfs.h>
 #include <linux/tick.h>
+#include <linux/unwind_deferred.h>
 
 #include <asm/pgalloc.h>
 #include <linux/uaccess.h>
@@ -732,6 +733,7 @@ void __put_task_struct(struct task_struct *tsk)
 	WARN_ON(refcount_read(&tsk->usage));
 	WARN_ON(tsk == current);
 
+	unwind_task_free(tsk);
 	sched_ext_free(tsk);
 	io_uring_free(tsk);
 	cgroup_free(tsk);
@@ -2135,6 +2137,8 @@ __latent_entropy struct task_struct *copy_process(
 	p->bpf_ctx = NULL;
 #endif
 
+	unwind_task_init(p);
+
 	/* Perform scheduler related setup. Assign this task to a CPU. */
 	retval = sched_fork(clone_flags, p);
 	if (retval)
diff --git a/kernel/unwind/Makefile b/kernel/unwind/Makefile
index 349ce3677526..eae37bea54fd 100644
--- a/kernel/unwind/Makefile
+++ b/kernel/unwind/Makefile
@@ -1 +1 @@
- obj-$(CONFIG_UNWIND_USER) += user.o
+ obj-$(CONFIG_UNWIND_USER)	+= user.o deferred.o
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
new file mode 100644
index 000000000000..a0badbeb3cc1
--- /dev/null
+++ b/kernel/unwind/deferred.c
@@ -0,0 +1,60 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Deferred user space unwinding
+ */
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/unwind_deferred.h>
+
+#define UNWIND_MAX_ENTRIES 512
+
+/**
+ * unwind_user_faultable - Produce a user stacktrace in faultable context
+ * @trace: The descriptor that will store the user stacktrace
+ *
+ * This must be called in a known faultable context (usually when entering
+ * or exiting user space). Depending on the available implementations
+ * the @trace will be loaded with the addresses of the user space stacktrace
+ * if it can be found.
+ *
+ * Return: 0 on success and negative on error
+ *         On success @trace will contain the user space stacktrace
+ */
+int unwind_user_faultable(struct unwind_stacktrace *trace)
+{
+	struct unwind_task_info *info = &current->unwind_info;
+
+	/* Should always be called from faultable context */
+	might_fault();
+
+	if (current->flags & PF_EXITING)
+		return -EINVAL;
+
+	if (!info->entries) {
+		info->entries = kmalloc_array(UNWIND_MAX_ENTRIES, sizeof(long),
+					      GFP_KERNEL);
+		if (!info->entries)
+			return -ENOMEM;
+	}
+
+	trace->nr = 0;
+	trace->entries = info->entries;
+	unwind_user(trace, UNWIND_MAX_ENTRIES);
+
+	return 0;
+}
+
+void unwind_task_init(struct task_struct *task)
+{
+	struct unwind_task_info *info = &task->unwind_info;
+
+	memset(info, 0, sizeof(*info));
+}
+
+void unwind_task_free(struct task_struct *task)
+{
+	struct unwind_task_info *info = &task->unwind_info;
+
+	kfree(info->entries);
+}
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v11 05/14] unwind_user/deferred: Add unwind cache
  2025-06-25 22:56 [PATCH v11 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (3 preceding siblings ...)
  2025-06-25 22:56 ` [PATCH v11 04/14] unwind_user/deferred: Add unwind_user_faultable() Steven Rostedt
@ 2025-06-25 22:56 ` Steven Rostedt
  2025-06-25 22:56 ` [PATCH v11 06/14] unwind_user/deferred: Add deferred unwinding interface Steven Rostedt
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 33+ messages in thread
From: Steven Rostedt @ 2025-06-25 22:56 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
	Jens Axboe

From: Josh Poimboeuf <jpoimboe@kernel.org>

Cache the results of the unwind to ensure the unwind is only performed
once, even when called by multiple tracers.

The cache nr_entries gets cleared every time the task exits the kernel.
When a stacktrace is requested, nr_entries gets set to the number of
entries in the stacktrace. If another stacktrace is requested, if
nr_entries is not zero, then it contains the same stacktrace that would be
retrieved so it is not processed again and the entries is given to the
caller.

Co-developed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v10: https://lore.kernel.org/20250611010428.603778772@goodmis.org

- Rename unwind_exit_to_user_mode() to unwind_reset_info() as it is called
  in a noinstr section and the current name looks like it does
  instrumentation (Peter Zijlstra)

- Make the allocated cache fit inside a 4K page. (Peter Zijlstra)

 include/linux/entry-common.h          |  2 ++
 include/linux/unwind_deferred.h       |  8 +++++++
 include/linux/unwind_deferred_types.h |  7 +++++-
 kernel/unwind/deferred.c              | 31 +++++++++++++++++++++------
 4 files changed, 40 insertions(+), 8 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index f94f3fdf15fc..8908b8eeb99b 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -12,6 +12,7 @@
 #include <linux/resume_user_mode.h>
 #include <linux/tick.h>
 #include <linux/kmsan.h>
+#include <linux/unwind_deferred.h>
 
 #include <asm/entry-common.h>
 #include <asm/syscall.h>
@@ -362,6 +363,7 @@ static __always_inline void exit_to_user_mode(void)
 	lockdep_hardirqs_on_prepare();
 	instrumentation_end();
 
+	unwind_reset_info();
 	user_enter_irqoff();
 	arch_exit_to_user_mode();
 	lockdep_hardirqs_on(CALLER_ADDR0);
diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index a5f6e8f8a1a2..baacf4a1eb4c 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -12,6 +12,12 @@ void unwind_task_free(struct task_struct *task);
 
 int unwind_user_faultable(struct unwind_stacktrace *trace);
 
+static __always_inline void unwind_reset_info(void)
+{
+	if (unlikely(current->unwind_info.cache))
+		current->unwind_info.cache->nr_entries = 0;
+}
+
 #else /* !CONFIG_UNWIND_USER */
 
 static inline void unwind_task_init(struct task_struct *task) {}
@@ -19,6 +25,8 @@ static inline void unwind_task_free(struct task_struct *task) {}
 
 static inline int unwind_user_faultable(struct unwind_stacktrace *trace) { return -ENOSYS; }
 
+static inline void unwind_reset_info(void) {}
+
 #endif /* !CONFIG_UNWIND_USER */
 
 #endif /* _LINUX_UNWIND_USER_DEFERRED_H */
diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
index aa32db574e43..db5b54b18828 100644
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -2,8 +2,13 @@
 #ifndef _LINUX_UNWIND_USER_DEFERRED_TYPES_H
 #define _LINUX_UNWIND_USER_DEFERRED_TYPES_H
 
+struct unwind_cache {
+	unsigned int		nr_entries;
+	unsigned long		entries[];
+};
+
 struct unwind_task_info {
-	unsigned long		*entries;
+	struct unwind_cache	*cache;
 };
 
 #endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index a0badbeb3cc1..96368a5aa522 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -4,10 +4,13 @@
  */
 #include <linux/kernel.h>
 #include <linux/sched.h>
+#include <linux/sizes.h>
 #include <linux/slab.h>
 #include <linux/unwind_deferred.h>
 
-#define UNWIND_MAX_ENTRIES 512
+/* Make the cache fit in a 4K page */
+#define UNWIND_MAX_ENTRIES					\
+	((SZ_4K - sizeof(struct unwind_cache)) / sizeof(long))
 
 /**
  * unwind_user_faultable - Produce a user stacktrace in faultable context
@@ -24,6 +27,7 @@
 int unwind_user_faultable(struct unwind_stacktrace *trace)
 {
 	struct unwind_task_info *info = &current->unwind_info;
+	struct unwind_cache *cache;
 
 	/* Should always be called from faultable context */
 	might_fault();
@@ -31,17 +35,30 @@ int unwind_user_faultable(struct unwind_stacktrace *trace)
 	if (current->flags & PF_EXITING)
 		return -EINVAL;
 
-	if (!info->entries) {
-		info->entries = kmalloc_array(UNWIND_MAX_ENTRIES, sizeof(long),
-					      GFP_KERNEL);
-		if (!info->entries)
+	if (!info->cache) {
+		info->cache = kzalloc(struct_size(cache, entries, UNWIND_MAX_ENTRIES),
+				      GFP_KERNEL);
+		if (!info->cache)
 			return -ENOMEM;
 	}
 
+	cache = info->cache;
+	trace->entries = cache->entries;
+
+	if (cache->nr_entries) {
+		/*
+		 * The user stack has already been previously unwound in this
+		 * entry context.  Skip the unwind and use the cache.
+		 */
+		trace->nr = cache->nr_entries;
+		return 0;
+	}
+
 	trace->nr = 0;
-	trace->entries = info->entries;
 	unwind_user(trace, UNWIND_MAX_ENTRIES);
 
+	cache->nr_entries = trace->nr;
+
 	return 0;
 }
 
@@ -56,5 +73,5 @@ void unwind_task_free(struct task_struct *task)
 {
 	struct unwind_task_info *info = &task->unwind_info;
 
-	kfree(info->entries);
+	kfree(info->cache);
 }
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v11 06/14] unwind_user/deferred: Add deferred unwinding interface
  2025-06-25 22:56 [PATCH v11 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (4 preceding siblings ...)
  2025-06-25 22:56 ` [PATCH v11 05/14] unwind_user/deferred: Add unwind cache Steven Rostedt
@ 2025-06-25 22:56 ` Steven Rostedt
  2025-06-26 16:48   ` Steven Rostedt
  2025-06-25 22:56 ` [PATCH v11 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe Steven Rostedt
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 33+ messages in thread
From: Steven Rostedt @ 2025-06-25 22:56 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
	Jens Axboe

From: Josh Poimboeuf <jpoimboe@kernel.org>

Add an interface for scheduling task work to unwind the user space stack
before returning to user space. This solves several problems for its
callers:

  - Ensure the unwind happens in task context even if the caller may be
    running in interrupt context.

  - Avoid duplicate unwinds, whether called multiple times by the same
    caller or by different callers.

  - Take a timestamp when the first request comes in since the task
    entered the kernel. This will be returned to the calling function
    along with the stack trace when the task leaves the kernel. This
    timestamp can be used to correlate kernel unwinds/traces with the user
    unwind. For this to work properly, the architecture must have a
    local_clock() resolution that guarantees a different timestamp per
    a task systemcall.

The timestamp is created to detect when the stacktrace is the same. It is
generated the first time a user space stacktrace is requested after the
task enters the kernel.

The timestamp is passed to the caller on request, and when the stacktrace is
generated upon returning to user space, it will call the requester's callback
with the timestamp as well as the stacktrace.

A global list is created and protected by a global mutex that holds
tracers that register with the unwind infrastructure. The number of
registered tracers will be limited in future changes. Each perf program or
ftrace instance will register its own descriptor to use for deferred
unwind stack traces.

Note, in the function unwind_deferred_task_work() that gets called when
returning to user space, it uses a global mutex for synchronization which
will cause a big bottleneck. This will be replaced by SRCU, but that
change adds some complex synchronization that deservers its own commit.

Co-developed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v10: https://lore.kernel.org/20250611010428.770214773@goodmis.org

- Added comment stating that the clock used for the timestamp must have a
  resolution that will guarantee that two system called back to back will
  have a different timestamp.

- Use timestamp being zero to exit out early in unwind_reset_info() so
  that there's no writes when not used. (Peter Zijlstra)

- Update change log to mention that each perf program and ftrace intance
  will register with the deferred unwinder. (Peter Zijlstra).

- Update change log to express that unwind_deferred_task_work() uses a
  global lock but will later be replaced with SRCU. (Peter Zijlstra)

 include/linux/unwind_deferred.h       |  24 ++++-
 include/linux/unwind_deferred_types.h |   3 +
 kernel/unwind/deferred.c              | 139 +++++++++++++++++++++++++-
 3 files changed, 164 insertions(+), 2 deletions(-)

diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index baacf4a1eb4c..26011b413142 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -2,9 +2,19 @@
 #ifndef _LINUX_UNWIND_USER_DEFERRED_H
 #define _LINUX_UNWIND_USER_DEFERRED_H
 
+#include <linux/task_work.h>
 #include <linux/unwind_user.h>
 #include <linux/unwind_deferred_types.h>
 
+struct unwind_work;
+
+typedef void (*unwind_callback_t)(struct unwind_work *work, struct unwind_stacktrace *trace, u64 timestamp);
+
+struct unwind_work {
+	struct list_head		list;
+	unwind_callback_t		func;
+};
+
 #ifdef CONFIG_UNWIND_USER
 
 void unwind_task_init(struct task_struct *task);
@@ -12,10 +22,19 @@ void unwind_task_free(struct task_struct *task);
 
 int unwind_user_faultable(struct unwind_stacktrace *trace);
 
+int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func);
+int unwind_deferred_request(struct unwind_work *work, u64 *timestamp);
+void unwind_deferred_cancel(struct unwind_work *work);
+
 static __always_inline void unwind_reset_info(void)
 {
-	if (unlikely(current->unwind_info.cache))
+	/* Exit out early if this was never used */
+	if (likely(!current->unwind_info.timestamp))
+		return;
+
+	if (current->unwind_info.cache)
 		current->unwind_info.cache->nr_entries = 0;
+	current->unwind_info.timestamp = 0;
 }
 
 #else /* !CONFIG_UNWIND_USER */
@@ -24,6 +43,9 @@ static inline void unwind_task_init(struct task_struct *task) {}
 static inline void unwind_task_free(struct task_struct *task) {}
 
 static inline int unwind_user_faultable(struct unwind_stacktrace *trace) { return -ENOSYS; }
+static inline int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func) { return -ENOSYS; }
+static inline int unwind_deferred_request(struct unwind_work *work, u64 *timestamp) { return -ENOSYS; }
+static inline void unwind_deferred_cancel(struct unwind_work *work) {}
 
 static inline void unwind_reset_info(void) {}
 
diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
index db5b54b18828..5df264cf81ad 100644
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -9,6 +9,9 @@ struct unwind_cache {
 
 struct unwind_task_info {
 	struct unwind_cache	*cache;
+	struct callback_head	work;
+	u64			timestamp;
+	int			pending;
 };
 
 #endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index 96368a5aa522..d5f2c004a5b0 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -2,16 +2,43 @@
 /*
  * Deferred user space unwinding
  */
+#include <linux/sched/task_stack.h>
+#include <linux/unwind_deferred.h>
+#include <linux/sched/clock.h>
+#include <linux/task_work.h>
 #include <linux/kernel.h>
 #include <linux/sched.h>
 #include <linux/sizes.h>
 #include <linux/slab.h>
-#include <linux/unwind_deferred.h>
+#include <linux/mm.h>
 
 /* Make the cache fit in a 4K page */
 #define UNWIND_MAX_ENTRIES					\
 	((SZ_4K - sizeof(struct unwind_cache)) / sizeof(long))
 
+/* Guards adding to and reading the list of callbacks */
+static DEFINE_MUTEX(callback_mutex);
+static LIST_HEAD(callbacks);
+
+/*
+ * Read the task context timestamp, if this is the first caller then
+ * it will set the timestamp.
+ *
+ * For this to work properly, the timestamp (local_clock()) must
+ * have a resolution that will guarantee a different timestamp
+ * everytime a task makes a system call. That is, two short
+ * system calls back to back must have a different timestamp.
+ */
+static u64 get_timestamp(struct unwind_task_info *info)
+{
+	lockdep_assert_irqs_disabled();
+
+	if (!info->timestamp)
+		info->timestamp = local_clock();
+
+	return info->timestamp;
+}
+
 /**
  * unwind_user_faultable - Produce a user stacktrace in faultable context
  * @trace: The descriptor that will store the user stacktrace
@@ -62,11 +89,120 @@ int unwind_user_faultable(struct unwind_stacktrace *trace)
 	return 0;
 }
 
+static void unwind_deferred_task_work(struct callback_head *head)
+{
+	struct unwind_task_info *info = container_of(head, struct unwind_task_info, work);
+	struct unwind_stacktrace trace;
+	struct unwind_work *work;
+	u64 timestamp;
+
+	if (WARN_ON_ONCE(!info->pending))
+		return;
+
+	/* Allow work to come in again */
+	WRITE_ONCE(info->pending, 0);
+
+	/*
+	 * From here on out, the callback must always be called, even if it's
+	 * just an empty trace.
+	 */
+	trace.nr = 0;
+	trace.entries = NULL;
+
+	unwind_user_faultable(&trace);
+
+	timestamp = info->timestamp;
+
+	guard(mutex)(&callback_mutex);
+	list_for_each_entry(work, &callbacks, list) {
+		work->func(work, &trace, timestamp);
+	}
+}
+
+/**
+ * unwind_deferred_request - Request a user stacktrace on task exit
+ * @work: Unwind descriptor requesting the trace
+ * @timestamp: The time stamp of the first request made for this task
+ *
+ * Schedule a user space unwind to be done in task work before exiting the
+ * kernel.
+ *
+ * The returned @timestamp output is the timestamp of the very first request
+ * for a user space stacktrace for this task since it entered the kernel.
+ * It can be from a request by any caller of this infrastructure.
+ * Its value will also be passed to the callback function.  It can be
+ * used to stitch kernel and user stack traces together in post-processing.
+ *
+ * Note, the architecture must have a local_clock() implementation that
+ * guarantees a different timestamp per task systemcall.
+ *
+ * It's valid to call this function multiple times for the same @work within
+ * the same task entry context.  Each call will return the same timestamp
+ * while the task hasn't left the kernel. If the callback is not pending because
+ * it has already been previously called for the same entry context, it will be
+ * called again with the same stack trace and timestamp.
+ *
+ * Return: 1 if the the callback was already queued.
+ *         0 if the callback successfully was queued.
+ *         Negative if there's an error.
+ *         @timestamp holds the timestamp of the first request by any user
+ */
+int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
+{
+	struct unwind_task_info *info = &current->unwind_info;
+	int ret;
+
+	*timestamp = 0;
+
+	if (WARN_ON_ONCE(in_nmi()))
+		return -EINVAL;
+
+	if ((current->flags & (PF_KTHREAD | PF_EXITING)) ||
+	    !user_mode(task_pt_regs(current)))
+		return -EINVAL;
+
+	guard(irqsave)();
+
+	*timestamp = get_timestamp(info);
+
+	/* callback already pending? */
+	if (info->pending)
+		return 1;
+
+	/* The work has been claimed, now schedule it. */
+	ret = task_work_add(current, &info->work, TWA_RESUME);
+	if (WARN_ON_ONCE(ret))
+		return ret;
+
+	info->pending = 1;
+	return 0;
+}
+
+void unwind_deferred_cancel(struct unwind_work *work)
+{
+	if (!work)
+		return;
+
+	guard(mutex)(&callback_mutex);
+	list_del(&work->list);
+}
+
+int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
+{
+	memset(work, 0, sizeof(*work));
+
+	guard(mutex)(&callback_mutex);
+	list_add(&work->list, &callbacks);
+	work->func = func;
+	return 0;
+}
+
 void unwind_task_init(struct task_struct *task)
 {
 	struct unwind_task_info *info = &task->unwind_info;
 
 	memset(info, 0, sizeof(*info));
+	init_task_work(&info->work, unwind_deferred_task_work);
 }
 
 void unwind_task_free(struct task_struct *task)
@@ -74,4 +210,5 @@ void unwind_task_free(struct task_struct *task)
 	struct unwind_task_info *info = &task->unwind_info;
 
 	kfree(info->cache);
+	task_work_cancel(task, &info->work);
 }
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v11 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe
  2025-06-25 22:56 [PATCH v11 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (5 preceding siblings ...)
  2025-06-25 22:56 ` [PATCH v11 06/14] unwind_user/deferred: Add deferred unwinding interface Steven Rostedt
@ 2025-06-25 22:56 ` Steven Rostedt
  2025-06-25 22:56 ` [PATCH v11 08/14] unwind deferred: Use bitmask to determine which callbacks to call Steven Rostedt
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 33+ messages in thread
From: Steven Rostedt @ 2025-06-25 22:56 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
	Jens Axboe

From: Steven Rostedt <rostedt@goodmis.org>

Make unwind_deferred_request() NMI-safe so tracers in NMI context can
call it and safely request a user space stacktrace when the task exits.

Note, this is only allowed for architectures that implement a safe 64 bit
cmpxchg. Which rules out some 32bit architectures and even some 64 bit
ones.  If an architecture requests a deferred stack trace from NMI context
that does not support a safe NMI 64 bit cmpxchg, it will get an -EINVAL.
For those architectures, they would need another method (perhaps an
irqwork), to request a deferred user space stack trace. That can be dealt
with later if one of theses architectures require this feature.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changse since v10: https://lore.kernel.org/20250611010428.938845449@goodmis.org

- Reworked to simply use a 64bit cmpxchg to update the timestamp.

- Removed Josh Poimboeuf's authorship as the commit is completely
  rewritten.

- Switch timestamp to local64_t type and pending to local_t type.

 include/linux/unwind_deferred.h       |  4 +-
 include/linux/unwind_deferred_types.h |  7 ++-
 kernel/unwind/deferred.c              | 74 ++++++++++++++++++++++-----
 3 files changed, 69 insertions(+), 16 deletions(-)

diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index 26011b413142..718637777649 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -29,12 +29,12 @@ void unwind_deferred_cancel(struct unwind_work *work);
 static __always_inline void unwind_reset_info(void)
 {
 	/* Exit out early if this was never used */
-	if (likely(!current->unwind_info.timestamp))
+	if (likely(!local64_read(&current->unwind_info.timestamp)))
 		return;
 
 	if (current->unwind_info.cache)
 		current->unwind_info.cache->nr_entries = 0;
-	current->unwind_info.timestamp = 0;
+	local64_set(&current->unwind_info.timestamp, 0);
 }
 
 #else /* !CONFIG_UNWIND_USER */
diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
index 5df264cf81ad..0d722e877473 100644
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -2,6 +2,9 @@
 #ifndef _LINUX_UNWIND_USER_DEFERRED_TYPES_H
 #define _LINUX_UNWIND_USER_DEFERRED_TYPES_H
 
+#include <asm/local64.h>
+#include <asm/local.h>
+
 struct unwind_cache {
 	unsigned int		nr_entries;
 	unsigned long		entries[];
@@ -10,8 +13,8 @@ struct unwind_cache {
 struct unwind_task_info {
 	struct unwind_cache	*cache;
 	struct callback_head	work;
-	u64			timestamp;
-	int			pending;
+	local64_t		timestamp;
+	local_t			pending;
 };
 
 #endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index d5f2c004a5b0..dd36e58c8cad 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -12,6 +12,35 @@
 #include <linux/slab.h>
 #include <linux/mm.h>
 
+/*
+ * For requesting a deferred user space stack trace from NMI context
+ * the architecture must support a 64bit safe cmpxchg in NMI context.
+ * For those architectures that do not have that, then it cannot ask
+ * for a deferred user space stack trace from an NMI context. If it
+ * does, then it will get -EINVAL.
+ */
+#if defined(CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG) && \
+	!defined(CONFIG_GENERIC_ATOMIC64)
+# define CAN_USE_IN_NMI		1
+static inline u64 assign_timestamp(struct unwind_task_info *info,
+				   u64 timestamp)
+{
+	u64 old = 0;
+	if (!local64_try_cmpxchg(&info->timestamp, &old, timestamp))
+		timestamp = old;
+	return timestamp;
+}
+#else
+# define CAN_USE_IN_NMI		0
+static inline u64 assign_timestamp(struct unwind_task_info *info,
+				   u64 timestamp)
+{
+	/* For archs that do not allow NMI here */
+	local64_set(&info->timestamp, timestamp);
+	return timestamp;
+}
+#endif
+
 /* Make the cache fit in a 4K page */
 #define UNWIND_MAX_ENTRIES					\
 	((SZ_4K - sizeof(struct unwind_cache)) / sizeof(long))
@@ -31,12 +60,21 @@ static LIST_HEAD(callbacks);
  */
 static u64 get_timestamp(struct unwind_task_info *info)
 {
+	u64 timestamp;
+
 	lockdep_assert_irqs_disabled();
 
-	if (!info->timestamp)
-		info->timestamp = local_clock();
+	/*
+	 * Note, the timestamp is generated on the first request.
+	 * If it exists here, then the timestamp is earlier than
+	 * this request and it means that this request will be
+	 * valid for the stracktrace.
+	 */
+	timestamp = local64_read(&info->timestamp);
+	if (timestamp)
+		return timestamp;
 
-	return info->timestamp;
+	return assign_timestamp(info, local_clock());
 }
 
 /**
@@ -96,11 +134,11 @@ static void unwind_deferred_task_work(struct callback_head *head)
 	struct unwind_work *work;
 	u64 timestamp;
 
-	if (WARN_ON_ONCE(!info->pending))
+	if (WARN_ON_ONCE(!local_read(&info->pending)))
 		return;
 
 	/* Allow work to come in again */
-	WRITE_ONCE(info->pending, 0);
+	local_set(&info->pending, 0);
 
 	/*
 	 * From here on out, the callback must always be called, even if it's
@@ -111,7 +149,7 @@ static void unwind_deferred_task_work(struct callback_head *head)
 
 	unwind_user_faultable(&trace);
 
-	timestamp = info->timestamp;
+	timestamp = local64_read(&info->timestamp);
 
 	guard(mutex)(&callback_mutex);
 	list_for_each_entry(work, &callbacks, list) {
@@ -150,31 +188,43 @@ static void unwind_deferred_task_work(struct callback_head *head)
 int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
 {
 	struct unwind_task_info *info = &current->unwind_info;
+	long pending;
 	int ret;
 
 	*timestamp = 0;
 
-	if (WARN_ON_ONCE(in_nmi()))
-		return -EINVAL;
-
 	if ((current->flags & (PF_KTHREAD | PF_EXITING)) ||
 	    !user_mode(task_pt_regs(current)))
 		return -EINVAL;
 
+	/* NMI requires having safe 64 bit cmpxchg operations */
+	if (!CAN_USE_IN_NMI && in_nmi())
+		return -EINVAL;
+
 	guard(irqsave)();
 
 	*timestamp = get_timestamp(info);
 
 	/* callback already pending? */
-	if (info->pending)
+	pending = local_read(&info->pending);
+	if (pending)
 		return 1;
 
+	if (CAN_USE_IN_NMI) {
+		/* Claim the work unless an NMI just now swooped in to do so. */
+		if (!local_try_cmpxchg(&info->pending, &pending, 1))
+			return 1;
+	} else {
+		local_set(&info->pending, 1);
+	}
+
 	/* The work has been claimed, now schedule it. */
 	ret = task_work_add(current, &info->work, TWA_RESUME);
-	if (WARN_ON_ONCE(ret))
+	if (WARN_ON_ONCE(ret)) {
+		local_set(&info->pending, 0);
 		return ret;
+	}
 
-	info->pending = 1;
 	return 0;
 }
 
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v11 08/14] unwind deferred: Use bitmask to determine which callbacks to call
  2025-06-25 22:56 [PATCH v11 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (6 preceding siblings ...)
  2025-06-25 22:56 ` [PATCH v11 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe Steven Rostedt
@ 2025-06-25 22:56 ` Steven Rostedt
  2025-06-25 22:56 ` [PATCH v11 09/14] unwind deferred: Use SRCU unwind_deferred_task_work() Steven Rostedt
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 33+ messages in thread
From: Steven Rostedt @ 2025-06-25 22:56 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
	Jens Axboe

From: Steven Rostedt <rostedt@goodmis.org>

In order to know which registered callback requested a stacktrace for when
the task goes back to user space, add a bitmask to keep track of all
registered tracers. The bitmask is the size of long, which means that on a
32 bit machine, it can have at most 32 registered tracers, and on 64 bit,
it can have at most 64 registered tracers. This should not be an issue as
there should not be more than 10 (unless BPF can abuse this?).

When a tracer registers with unwind_deferred_init() it will get a bit
number assigned to it. When a tracer requests a stacktrace, it will have
its bit set within the task_struct. When the task returns back to user
space, it will call the callbacks for all the registered tracers where
their bits are set in the task's mask.

When a tracer is removed by the unwind_deferred_cancel() all current tasks
will clear the associated bit, just in case another tracer gets registered
immediately afterward and then gets their callback called unexpectedly.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v10: https://lore.kernel.org/20250611010429.105907436@goodmis.org

- Use __clear_bit() and __set_bit() consistently with the global variable
  unwind_mask.  (Peter Zijlstra)

- Use clear_bit() and set_bit() consistently with the task unwind_mask,
  as it can race with NMIs.

 include/linux/unwind_deferred.h       |  1 +
 include/linux/unwind_deferred_types.h |  1 +
 kernel/unwind/deferred.c              | 36 ++++++++++++++++++++++++---
 3 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index 718637777649..00656e903375 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -13,6 +13,7 @@ typedef void (*unwind_callback_t)(struct unwind_work *work, struct unwind_stackt
 struct unwind_work {
 	struct list_head		list;
 	unwind_callback_t		func;
+	int				bit;
 };
 
 #ifdef CONFIG_UNWIND_USER
diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
index 0d722e877473..5863bf4eb436 100644
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -13,6 +13,7 @@ struct unwind_cache {
 struct unwind_task_info {
 	struct unwind_cache	*cache;
 	struct callback_head	work;
+	unsigned long		unwind_mask;
 	local64_t		timestamp;
 	local_t			pending;
 };
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index dd36e58c8cad..6c558d00ff41 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -48,6 +48,7 @@ static inline u64 assign_timestamp(struct unwind_task_info *info,
 /* Guards adding to and reading the list of callbacks */
 static DEFINE_MUTEX(callback_mutex);
 static LIST_HEAD(callbacks);
+static unsigned long unwind_mask;
 
 /*
  * Read the task context timestamp, if this is the first caller then
@@ -153,7 +154,10 @@ static void unwind_deferred_task_work(struct callback_head *head)
 
 	guard(mutex)(&callback_mutex);
 	list_for_each_entry(work, &callbacks, list) {
-		work->func(work, &trace, timestamp);
+		if (test_bit(work->bit, &info->unwind_mask)) {
+			work->func(work, &trace, timestamp);
+			clear_bit(work->bit, &info->unwind_mask);
+		}
 	}
 }
 
@@ -205,15 +209,19 @@ int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
 
 	*timestamp = get_timestamp(info);
 
+	/* This is already queued */
+	if (test_bit(work->bit, &info->unwind_mask))
+		return 1;
+
 	/* callback already pending? */
 	pending = local_read(&info->pending);
 	if (pending)
-		return 1;
+		goto out;
 
 	if (CAN_USE_IN_NMI) {
 		/* Claim the work unless an NMI just now swooped in to do so. */
 		if (!local_try_cmpxchg(&info->pending, &pending, 1))
-			return 1;
+			goto out;
 	} else {
 		local_set(&info->pending, 1);
 	}
@@ -225,16 +233,27 @@ int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
 		return ret;
 	}
 
-	return 0;
+ out:
+	return test_and_set_bit(work->bit, &info->unwind_mask);
 }
 
 void unwind_deferred_cancel(struct unwind_work *work)
 {
+	struct task_struct *g, *t;
+
 	if (!work)
 		return;
 
 	guard(mutex)(&callback_mutex);
 	list_del(&work->list);
+
+	__clear_bit(work->bit, &unwind_mask);
+
+	guard(rcu)();
+	/* Clear this bit from all threads */
+	for_each_process_thread(g, t) {
+		clear_bit(work->bit, &t->unwind_info.unwind_mask);
+	}
 }
 
 int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
@@ -242,6 +261,14 @@ int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
 	memset(work, 0, sizeof(*work));
 
 	guard(mutex)(&callback_mutex);
+
+	/* See if there's a bit in the mask available */
+	if (unwind_mask == ~0UL)
+		return -EBUSY;
+
+	work->bit = ffz(unwind_mask);
+	__set_bit(work->bit, &unwind_mask);
+
 	list_add(&work->list, &callbacks);
 	work->func = func;
 	return 0;
@@ -253,6 +280,7 @@ void unwind_task_init(struct task_struct *task)
 
 	memset(info, 0, sizeof(*info));
 	init_task_work(&info->work, unwind_deferred_task_work);
+	info->unwind_mask = 0;
 }
 
 void unwind_task_free(struct task_struct *task)
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v11 09/14] unwind deferred: Use SRCU unwind_deferred_task_work()
  2025-06-25 22:56 [PATCH v11 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (7 preceding siblings ...)
  2025-06-25 22:56 ` [PATCH v11 08/14] unwind deferred: Use bitmask to determine which callbacks to call Steven Rostedt
@ 2025-06-25 22:56 ` Steven Rostedt
  2025-06-25 22:56 ` [PATCH v11 10/14] unwind: Clear unwind_mask on exit back to user space Steven Rostedt
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 33+ messages in thread
From: Steven Rostedt @ 2025-06-25 22:56 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
	Jens Axboe

From: Steven Rostedt <rostedt@goodmis.org>

Instead of using the callback_mutex to protect the link list of callbacks
in unwind_deferred_task_work(), use SRCU instead. This gets called every
time a task exits that has to record a stack trace that was requested.
This can happen for many tasks on several CPUs at the same time. A mutex
is a bottleneck and can cause a bit of contention and slow down performance.

As the callbacks themselves are allowed to sleep, regular RCU cannot be
used to protect the list. Instead use SRCU, as that still allows the
callbacks to sleep and the list can be read without needing to hold the
callback_mutex.

Link: https://lore.kernel.org/all/ca9bd83a-6c80-4ee0-a83c-224b9d60b755@efficios.com/

Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v10: https://lore.kernel.org/20250611010429.274682576@goodmis.org

- Use "bit" that was acquired by READ_ONCE() in test_and_set_bit() in
  unwind_deferred_request() instead of reading work->bit again.

 kernel/unwind/deferred.c | 35 ++++++++++++++++++++++++++---------
 1 file changed, 26 insertions(+), 9 deletions(-)

diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index 6c558d00ff41..7309c9e0e57a 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -45,10 +45,11 @@ static inline u64 assign_timestamp(struct unwind_task_info *info,
 #define UNWIND_MAX_ENTRIES					\
 	((SZ_4K - sizeof(struct unwind_cache)) / sizeof(long))
 
-/* Guards adding to and reading the list of callbacks */
+/* Guards adding to or removing from the list of callbacks */
 static DEFINE_MUTEX(callback_mutex);
 static LIST_HEAD(callbacks);
 static unsigned long unwind_mask;
+DEFINE_STATIC_SRCU(unwind_srcu);
 
 /*
  * Read the task context timestamp, if this is the first caller then
@@ -134,6 +135,7 @@ static void unwind_deferred_task_work(struct callback_head *head)
 	struct unwind_stacktrace trace;
 	struct unwind_work *work;
 	u64 timestamp;
+	int idx;
 
 	if (WARN_ON_ONCE(!local_read(&info->pending)))
 		return;
@@ -152,13 +154,15 @@ static void unwind_deferred_task_work(struct callback_head *head)
 
 	timestamp = local64_read(&info->timestamp);
 
-	guard(mutex)(&callback_mutex);
-	list_for_each_entry(work, &callbacks, list) {
+	idx = srcu_read_lock(&unwind_srcu);
+	list_for_each_entry_srcu(work, &callbacks, list,
+				 srcu_read_lock_held(&unwind_srcu)) {
 		if (test_bit(work->bit, &info->unwind_mask)) {
 			work->func(work, &trace, timestamp);
 			clear_bit(work->bit, &info->unwind_mask);
 		}
 	}
+	srcu_read_unlock(&unwind_srcu, idx);
 }
 
 /**
@@ -193,6 +197,7 @@ int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
 {
 	struct unwind_task_info *info = &current->unwind_info;
 	long pending;
+	int bit;
 	int ret;
 
 	*timestamp = 0;
@@ -205,12 +210,17 @@ int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
 	if (!CAN_USE_IN_NMI && in_nmi())
 		return -EINVAL;
 
+	/* Do not allow cancelled works to request again */
+	bit = READ_ONCE(work->bit);
+	if (WARN_ON_ONCE(bit < 0))
+		return -EINVAL;
+
 	guard(irqsave)();
 
 	*timestamp = get_timestamp(info);
 
 	/* This is already queued */
-	if (test_bit(work->bit, &info->unwind_mask))
+	if (test_bit(bit, &info->unwind_mask))
 		return 1;
 
 	/* callback already pending? */
@@ -234,25 +244,32 @@ int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
 	}
 
  out:
-	return test_and_set_bit(work->bit, &info->unwind_mask);
+	return test_and_set_bit(bit, &info->unwind_mask);
 }
 
 void unwind_deferred_cancel(struct unwind_work *work)
 {
 	struct task_struct *g, *t;
+	int bit;
 
 	if (!work)
 		return;
 
 	guard(mutex)(&callback_mutex);
-	list_del(&work->list);
+	list_del_rcu(&work->list);
+	bit = work->bit;
+
+	/* Do not allow any more requests and prevent callbacks */
+	work->bit = -1;
+
+	__clear_bit(bit, &unwind_mask);
 
-	__clear_bit(work->bit, &unwind_mask);
+	synchronize_srcu(&unwind_srcu);
 
 	guard(rcu)();
 	/* Clear this bit from all threads */
 	for_each_process_thread(g, t) {
-		clear_bit(work->bit, &t->unwind_info.unwind_mask);
+		clear_bit(bit, &t->unwind_info.unwind_mask);
 	}
 }
 
@@ -269,7 +286,7 @@ int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
 	work->bit = ffz(unwind_mask);
 	__set_bit(work->bit, &unwind_mask);
 
-	list_add(&work->list, &callbacks);
+	list_add_rcu(&work->list, &callbacks);
 	work->func = func;
 	return 0;
 }
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v11 10/14] unwind: Clear unwind_mask on exit back to user space
  2025-06-25 22:56 [PATCH v11 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (8 preceding siblings ...)
  2025-06-25 22:56 ` [PATCH v11 09/14] unwind deferred: Use SRCU unwind_deferred_task_work() Steven Rostedt
@ 2025-06-25 22:56 ` Steven Rostedt
  2025-06-25 22:56 ` [PATCH v11 11/14] unwind: Finish up unwind when a task exits Steven Rostedt
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 33+ messages in thread
From: Steven Rostedt @ 2025-06-25 22:56 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
	Jens Axboe

From: Steven Rostedt <rostedt@goodmis.org>

When testing the deferred unwinder by attaching deferred user space
stacktraces to events, a live lock happened. This was when the deferred
unwinding was added to the irqs_disabled event, which happens after the
task_work callbacks are called and before the task goes back to user
space.

The event callback would be registered when irqs were disabled, the
task_work would trigger, call the callback for this work and clear the
work's bit. Then before getting back to user space, irqs would be disabled
again, the event triggered again, and a new task_work registered. This
caused an infinite loop and the system hung.

To prevent this, clear the bits at the very last moment before going back
to user space and when instrumentation is disabled. That is in
unwind_exit_to_user_mode().

Move the pending bit from a value on the task_struct to the most
significant bit of the unwind_mask (saves space on the task_struct). This
will allow modifying the pending bit along with the work bits atomically.

Instead of clearing a work's bit after its callback is called, it is
delayed until exit. If the work is requested again, the task_work is not
queued again and the work will be notified that the task has already been
called (via UNWIND_ALREADY_EXECUTED return value).

The pending bit is cleared before calling the callback functions but the
current work bits remain. If one of the called works registers again, it
will not trigger a task_work if its bit is still present in the task's
unwind_mask.

If a new work registers, then it will set both the pending bit and its own
bit but clear the other work bits so that their callbacks do not get
called again.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 include/linux/unwind_deferred.h       | 26 +++++++--
 include/linux/unwind_deferred_types.h |  1 -
 kernel/unwind/deferred.c              | 76 ++++++++++++++++++---------
 3 files changed, 74 insertions(+), 29 deletions(-)

diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index 00656e903375..97983a13ebde 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -18,6 +18,14 @@ struct unwind_work {
 
 #ifdef CONFIG_UNWIND_USER
 
+#define UNWIND_PENDING_BIT	(BITS_PER_LONG - 1)
+#define UNWIND_PENDING		BIT(UNWIND_PENDING_BIT)
+
+enum {
+	UNWIND_ALREADY_PENDING	= 1,
+	UNWIND_ALREADY_EXECUTED	= 2,
+};
+
 void unwind_task_init(struct task_struct *task);
 void unwind_task_free(struct task_struct *task);
 
@@ -29,12 +37,22 @@ void unwind_deferred_cancel(struct unwind_work *work);
 
 static __always_inline void unwind_reset_info(void)
 {
-	/* Exit out early if this was never used */
-	if (likely(!local64_read(&current->unwind_info.timestamp)))
+	struct unwind_task_info *info = &current->unwind_info;
+	unsigned long bits;
+
+	/* Was there any unwinding? */
+	if (likely(!info->unwind_mask))
 		return;
 
-	if (current->unwind_info.cache)
-		current->unwind_info.cache->nr_entries = 0;
+	bits = info->unwind_mask;
+	do {
+		/* Is a task_work going to run again before going back */
+		if (bits & UNWIND_PENDING)
+			return;
+	} while (!try_cmpxchg(&info->unwind_mask, &bits, 0UL));
+
+	if (likely(info->cache))
+		info->cache->nr_entries = 0;
 	local64_set(&current->unwind_info.timestamp, 0);
 }
 
diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_deferred_types.h
index 5863bf4eb436..4308367f1887 100644
--- a/include/linux/unwind_deferred_types.h
+++ b/include/linux/unwind_deferred_types.h
@@ -15,7 +15,6 @@ struct unwind_task_info {
 	struct callback_head	work;
 	unsigned long		unwind_mask;
 	local64_t		timestamp;
-	local_t			pending;
 };
 
 #endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index 7309c9e0e57a..e7e4442926d3 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -51,6 +51,11 @@ static LIST_HEAD(callbacks);
 static unsigned long unwind_mask;
 DEFINE_STATIC_SRCU(unwind_srcu);
 
+static inline bool unwind_pending(struct unwind_task_info *info)
+{
+	return test_bit(UNWIND_PENDING_BIT, &info->unwind_mask);
+}
+
 /*
  * Read the task context timestamp, if this is the first caller then
  * it will set the timestamp.
@@ -134,14 +139,17 @@ static void unwind_deferred_task_work(struct callback_head *head)
 	struct unwind_task_info *info = container_of(head, struct unwind_task_info, work);
 	struct unwind_stacktrace trace;
 	struct unwind_work *work;
+	unsigned long bits;
 	u64 timestamp;
 	int idx;
 
-	if (WARN_ON_ONCE(!local_read(&info->pending)))
+	if (WARN_ON_ONCE(!unwind_pending(info)))
 		return;
 
-	/* Allow work to come in again */
-	local_set(&info->pending, 0);
+	/* Clear pending bit but make sure to have the current bits */
+	bits = READ_ONCE(info->unwind_mask);
+	while (!try_cmpxchg(&info->unwind_mask, &bits, bits & ~UNWIND_PENDING))
+		;
 
 	/*
 	 * From here on out, the callback must always be called, even if it's
@@ -157,10 +165,8 @@ static void unwind_deferred_task_work(struct callback_head *head)
 	idx = srcu_read_lock(&unwind_srcu);
 	list_for_each_entry_srcu(work, &callbacks, list,
 				 srcu_read_lock_held(&unwind_srcu)) {
-		if (test_bit(work->bit, &info->unwind_mask)) {
+		if (test_bit(work->bit, &bits))
 			work->func(work, &trace, timestamp);
-			clear_bit(work->bit, &info->unwind_mask);
-		}
 	}
 	srcu_read_unlock(&unwind_srcu, idx);
 }
@@ -188,15 +194,17 @@ static void unwind_deferred_task_work(struct callback_head *head)
  * it has already been previously called for the same entry context, it will be
  * called again with the same stack trace and timestamp.
  *
- * Return: 1 if the the callback was already queued.
- *         0 if the callback successfully was queued.
+ * Return: 0 if the callback successfully was queued.
+ *         UNWIND_ALREADY_PENDING if the the callback was already queued.
+ *         UNWIND_ALREADY_EXECUTED if the callback was already called
+ *                (and will not be called again)
  *         Negative if there's an error.
  *         @timestamp holds the timestamp of the first request by any user
  */
 int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
 {
 	struct unwind_task_info *info = &current->unwind_info;
-	long pending;
+	unsigned long old, bits;
 	int bit;
 	int ret;
 
@@ -219,32 +227,52 @@ int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
 
 	*timestamp = get_timestamp(info);
 
-	/* This is already queued */
-	if (test_bit(bit, &info->unwind_mask))
-		return 1;
+	old = READ_ONCE(info->unwind_mask);
+
+	/* Is this already queued */
+	if (test_bit(bit, &old)) {
+		/*
+		 * If pending is not set, it means this work's callback
+		 * was already called.
+		 */
+		return old & UNWIND_PENDING ? UNWIND_ALREADY_PENDING :
+			UNWIND_ALREADY_EXECUTED;
+	}
 
-	/* callback already pending? */
-	pending = local_read(&info->pending);
-	if (pending)
+	if (unwind_pending(info))
 		goto out;
 
+	/*
+	 * This is the first to enable another task_work for this task since
+	 * the task entered the kernel, or had already called the callbacks.
+	 * Set only the bit for this work and clear all others as they have
+	 * already had their callbacks called, and do not need to call them
+	 * again because of this work.
+	 */
+	bits = UNWIND_PENDING | BIT(bit);
+
+	/*
+	 * If the cmpxchg() fails, it means that an NMI came in and set
+	 * the pending bit as well as cleared the other bits. Just
+	 * jump to setting the bit for this work.
+	 */
 	if (CAN_USE_IN_NMI) {
-		/* Claim the work unless an NMI just now swooped in to do so. */
-		if (!local_try_cmpxchg(&info->pending, &pending, 1))
+		if (!try_cmpxchg(&info->unwind_mask, &old, bits))
 			goto out;
 	} else {
-		local_set(&info->pending, 1);
+		info->unwind_mask = bits;
 	}
 
 	/* The work has been claimed, now schedule it. */
 	ret = task_work_add(current, &info->work, TWA_RESUME);
-	if (WARN_ON_ONCE(ret)) {
-		local_set(&info->pending, 0);
-		return ret;
-	}
 
+	if (WARN_ON_ONCE(ret))
+		WRITE_ONCE(info->unwind_mask, 0);
+
+	return ret;
  out:
-	return test_and_set_bit(bit, &info->unwind_mask);
+	return test_and_set_bit(bit, &info->unwind_mask) ?
+		UNWIND_ALREADY_PENDING : 0;
 }
 
 void unwind_deferred_cancel(struct unwind_work *work)
@@ -280,7 +308,7 @@ int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
 	guard(mutex)(&callback_mutex);
 
 	/* See if there's a bit in the mask available */
-	if (unwind_mask == ~0UL)
+	if (unwind_mask == ~(UNWIND_PENDING))
 		return -EBUSY;
 
 	work->bit = ffz(unwind_mask);
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v11 11/14] unwind: Finish up unwind when a task exits
  2025-06-25 22:56 [PATCH v11 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (9 preceding siblings ...)
  2025-06-25 22:56 ` [PATCH v11 10/14] unwind: Clear unwind_mask on exit back to user space Steven Rostedt
@ 2025-06-25 22:56 ` Steven Rostedt
  2025-06-25 22:56 ` [PATCH v11 12/14] unwind_user/x86: Enable frame pointer unwinding on x86 Steven Rostedt
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 33+ messages in thread
From: Steven Rostedt @ 2025-06-25 22:56 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
	Jens Axboe

From: Steven Rostedt <rostedt@goodmis.org>

On do_exit() when a task is exiting, if a unwind is requested and the
deferred user stacktrace is deferred via the task_work, the task_work
callback is called after exit_mm() is called in do_exit(). This means that
the user stack trace will not be retrieved and an empty stack is created.

Instead, add a function unwind_deferred_task_exit() and call it just
before exit_mm() so that the unwinder can call the requested callbacks
with the user space stack.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 include/linux/unwind_deferred.h |  3 +++
 kernel/exit.c                   |  2 ++
 kernel/unwind/deferred.c        | 23 ++++++++++++++++++++---
 3 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index 97983a13ebde..5cfd09a8708f 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -35,6 +35,8 @@ int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func);
 int unwind_deferred_request(struct unwind_work *work, u64 *timestamp);
 void unwind_deferred_cancel(struct unwind_work *work);
 
+void unwind_deferred_task_exit(struct task_struct *task);
+
 static __always_inline void unwind_reset_info(void)
 {
 	struct unwind_task_info *info = &current->unwind_info;
@@ -66,6 +68,7 @@ static inline int unwind_deferred_init(struct unwind_work *work, unwind_callback
 static inline int unwind_deferred_request(struct unwind_work *work, u64 *timestamp) { return -ENOSYS; }
 static inline void unwind_deferred_cancel(struct unwind_work *work) {}
 
+static inline void unwind_deferred_task_exit(struct task_struct *task) {}
 static inline void unwind_reset_info(void) {}
 
 #endif /* !CONFIG_UNWIND_USER */
diff --git a/kernel/exit.c b/kernel/exit.c
index bd743900354c..6599f9518436 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -68,6 +68,7 @@
 #include <linux/rethook.h>
 #include <linux/sysfs.h>
 #include <linux/user_events.h>
+#include <linux/unwind_deferred.h>
 #include <linux/uaccess.h>
 #include <linux/pidfs.h>
 
@@ -938,6 +939,7 @@ void __noreturn do_exit(long code)
 
 	tsk->exit_code = code;
 	taskstats_exit(tsk, group_dead);
+	unwind_deferred_task_exit(tsk);
 	trace_sched_process_exit(tsk, group_dead);
 
 	exit_mm();
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index e7e4442926d3..c783d273a2dc 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -104,7 +104,7 @@ int unwind_user_faultable(struct unwind_stacktrace *trace)
 	/* Should always be called from faultable context */
 	might_fault();
 
-	if (current->flags & PF_EXITING)
+	if (!current->mm)
 		return -EINVAL;
 
 	if (!info->cache) {
@@ -134,9 +134,9 @@ int unwind_user_faultable(struct unwind_stacktrace *trace)
 	return 0;
 }
 
-static void unwind_deferred_task_work(struct callback_head *head)
+static void process_unwind_deferred(struct task_struct *task)
 {
-	struct unwind_task_info *info = container_of(head, struct unwind_task_info, work);
+	struct unwind_task_info *info = &task->unwind_info;
 	struct unwind_stacktrace trace;
 	struct unwind_work *work;
 	unsigned long bits;
@@ -171,6 +171,23 @@ static void unwind_deferred_task_work(struct callback_head *head)
 	srcu_read_unlock(&unwind_srcu, idx);
 }
 
+static void unwind_deferred_task_work(struct callback_head *head)
+{
+	process_unwind_deferred(current);
+}
+
+void unwind_deferred_task_exit(struct task_struct *task)
+{
+	struct unwind_task_info *info = &current->unwind_info;
+
+	if (!unwind_pending(info))
+		return;
+
+	process_unwind_deferred(task);
+
+	task_work_cancel(task, &info->work);
+}
+
 /**
  * unwind_deferred_request - Request a user stacktrace on task exit
  * @work: Unwind descriptor requesting the trace
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v11 12/14] unwind_user/x86: Enable frame pointer unwinding on x86
  2025-06-25 22:56 [PATCH v11 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (10 preceding siblings ...)
  2025-06-25 22:56 ` [PATCH v11 11/14] unwind: Finish up unwind when a task exits Steven Rostedt
@ 2025-06-25 22:56 ` Steven Rostedt
  2025-06-25 22:56 ` [PATCH v11 13/14] perf/x86: Rename and move get_segment_base() and make it global Steven Rostedt
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 33+ messages in thread
From: Steven Rostedt @ 2025-06-25 22:56 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
	Jens Axboe

From: Josh Poimboeuf <jpoimboe@kernel.org>

Use ARCH_INIT_USER_FP_FRAME to describe how frame pointers are unwound
on x86, and enable CONFIG_HAVE_UNWIND_USER_FP accordingly so the
unwind_user interfaces can be used.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 arch/x86/Kconfig                   |  1 +
 arch/x86/include/asm/unwind_user.h | 11 +++++++++++
 2 files changed, 12 insertions(+)
 create mode 100644 arch/x86/include/asm/unwind_user.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 340e5468980e..2cdb5cf91541 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -302,6 +302,7 @@ config X86
 	select HAVE_SYSCALL_TRACEPOINTS
 	select HAVE_UACCESS_VALIDATION		if HAVE_OBJTOOL
 	select HAVE_UNSTABLE_SCHED_CLOCK
+	select HAVE_UNWIND_USER_FP		if X86_64
 	select HAVE_USER_RETURN_NOTIFIER
 	select HAVE_GENERIC_VDSO
 	select VDSO_GETRANDOM			if X86_64
diff --git a/arch/x86/include/asm/unwind_user.h b/arch/x86/include/asm/unwind_user.h
new file mode 100644
index 000000000000..8597857bf896
--- /dev/null
+++ b/arch/x86/include/asm/unwind_user.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_UNWIND_USER_H
+#define _ASM_X86_UNWIND_USER_H
+
+#define ARCH_INIT_USER_FP_FRAME							\
+	.cfa_off	= (s32)sizeof(long) *  2,				\
+	.ra_off		= (s32)sizeof(long) * -1,				\
+	.fp_off		= (s32)sizeof(long) * -2,				\
+	.use_fp		= true,
+
+#endif /* _ASM_X86_UNWIND_USER_H */
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v11 13/14] perf/x86: Rename and move get_segment_base() and make it global
  2025-06-25 22:56 [PATCH v11 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (11 preceding siblings ...)
  2025-06-25 22:56 ` [PATCH v11 12/14] unwind_user/x86: Enable frame pointer unwinding on x86 Steven Rostedt
@ 2025-06-25 22:56 ` Steven Rostedt
  2025-06-26  8:44   ` Ingo Molnar
  2025-06-26 13:07   ` Peter Zijlstra
  2025-06-25 22:56 ` [PATCH v11 14/14] unwind_user/x86: Enable compat mode frame pointer unwinding on x86 Steven Rostedt
  2025-06-30 12:50 ` [PATCH v11 00/14] unwind_user: x86: Deferred unwinding infrastructure Florian Weimer
  14 siblings, 2 replies; 33+ messages in thread
From: Steven Rostedt @ 2025-06-25 22:56 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
	Jens Axboe

From: Josh Poimboeuf <jpoimboe@kernel.org>

get_segment_base() will be used by the unwind_user code, so make it
global and rename it to segment_base_address() so it doesn't conflict with
a KVM function of the same name.

As the function is no longer specific to perf, move it to ptrace.c as that
seems to be a better location for a generic function like this.

Also add a lockdep_assert_irqs_disabled() to make sure it's always called
with interrupts disabled.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 arch/x86/events/core.c        | 44 ++++-------------------------------
 arch/x86/include/asm/ptrace.h |  2 ++
 arch/x86/kernel/ptrace.c      | 38 ++++++++++++++++++++++++++++++
 3 files changed, 45 insertions(+), 39 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 7610f26dfbd9..2f2ec84f2a14 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -43,6 +43,7 @@
 #include <asm/ldt.h>
 #include <asm/unwind.h>
 #include <asm/uprobes.h>
+#include <asm/ptrace.h>
 #include <asm/ibt.h>
 
 #include "perf_event.h"
@@ -2808,41 +2809,6 @@ valid_user_frame(const void __user *fp, unsigned long size)
 	return __access_ok(fp, size);
 }
 
-static unsigned long get_segment_base(unsigned int segment)
-{
-	struct desc_struct *desc;
-	unsigned int idx = segment >> 3;
-
-	if ((segment & SEGMENT_TI_MASK) == SEGMENT_LDT) {
-#ifdef CONFIG_MODIFY_LDT_SYSCALL
-		struct ldt_struct *ldt;
-
-		/*
-		 * If we're not in a valid context with a real (not just lazy)
-		 * user mm, then don't even try.
-		 */
-		if (!nmi_uaccess_okay())
-			return 0;
-
-		/* IRQs are off, so this synchronizes with smp_store_release */
-		ldt = smp_load_acquire(&current->mm->context.ldt);
-		if (!ldt || idx >= ldt->nr_entries)
-			return 0;
-
-		desc = &ldt->entries[idx];
-#else
-		return 0;
-#endif
-	} else {
-		if (idx >= GDT_ENTRIES)
-			return 0;
-
-		desc = raw_cpu_ptr(gdt_page.gdt) + idx;
-	}
-
-	return get_desc_base(desc);
-}
-
 #ifdef CONFIG_UPROBES
 /*
  * Heuristic-based check if uprobe is installed at the function entry.
@@ -2899,8 +2865,8 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry_ctx *ent
 	if (user_64bit_mode(regs))
 		return 0;
 
-	cs_base = get_segment_base(regs->cs);
-	ss_base = get_segment_base(regs->ss);
+	cs_base = segment_base_address(regs->cs);
+	ss_base = segment_base_address(regs->ss);
 
 	fp = compat_ptr(ss_base + regs->bp);
 	pagefault_disable();
@@ -3019,11 +2985,11 @@ static unsigned long code_segment_base(struct pt_regs *regs)
 		return 0x10 * regs->cs;
 
 	if (user_mode(regs) && regs->cs != __USER_CS)
-		return get_segment_base(regs->cs);
+		return segment_base_address(regs->cs);
 #else
 	if (user_mode(regs) && !user_64bit_mode(regs) &&
 	    regs->cs != __USER32_CS)
-		return get_segment_base(regs->cs);
+		return segment_base_address(regs->cs);
 #endif
 	return 0;
 }
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 50f75467f73d..59357ec98e52 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -314,6 +314,8 @@ static __always_inline bool regs_irqs_disabled(struct pt_regs *regs)
 	return !(regs->flags & X86_EFLAGS_IF);
 }
 
+unsigned long segment_base_address(unsigned int segment);
+
 /* Query offset/name of register from its name/offset */
 extern int regs_query_register_offset(const char *name);
 extern const char *regs_query_register_name(unsigned int offset);
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 095f04bdabdc..81353a09701b 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -41,6 +41,7 @@
 #include <asm/syscall.h>
 #include <asm/fsgsbase.h>
 #include <asm/io_bitmap.h>
+#include <asm/mmu_context.h>
 
 #include "tls.h"
 
@@ -339,6 +340,43 @@ static int set_segment_reg(struct task_struct *task,
 
 #endif	/* CONFIG_X86_32 */
 
+unsigned long segment_base_address(unsigned int segment)
+{
+	struct desc_struct *desc;
+	unsigned int idx = segment >> 3;
+
+	lockdep_assert_irqs_disabled();
+
+	if ((segment & SEGMENT_TI_MASK) == SEGMENT_LDT) {
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
+		struct ldt_struct *ldt;
+
+		/*
+		 * If we're not in a valid context with a real (not just lazy)
+		 * user mm, then don't even try.
+		 */
+		if (!nmi_uaccess_okay())
+			return 0;
+
+		/* IRQs are off, so this synchronizes with smp_store_release */
+		ldt = smp_load_acquire(&current->mm->context.ldt);
+		if (!ldt || idx >= ldt->nr_entries)
+			return 0;
+
+		desc = &ldt->entries[idx];
+#else
+		return 0;
+#endif
+	} else {
+		if (idx >= GDT_ENTRIES)
+			return 0;
+
+		desc = raw_cpu_ptr(gdt_page.gdt) + idx;
+	}
+
+	return get_desc_base(desc);
+}
+
 static unsigned long get_flags(struct task_struct *task)
 {
 	unsigned long retval = task_pt_regs(task)->flags;
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v11 14/14] unwind_user/x86: Enable compat mode frame pointer unwinding on x86
  2025-06-25 22:56 [PATCH v11 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (12 preceding siblings ...)
  2025-06-25 22:56 ` [PATCH v11 13/14] perf/x86: Rename and move get_segment_base() and make it global Steven Rostedt
@ 2025-06-25 22:56 ` Steven Rostedt
  2025-06-26  8:33   ` Ingo Molnar
  2025-06-30 12:50 ` [PATCH v11 00/14] unwind_user: x86: Deferred unwinding infrastructure Florian Weimer
  14 siblings, 1 reply; 33+ messages in thread
From: Steven Rostedt @ 2025-06-25 22:56 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
	Jens Axboe

From: Josh Poimboeuf <jpoimboe@kernel.org>

Use ARCH_INIT_USER_COMPAT_FP_FRAME to describe how frame pointers are
unwound on x86, and implement the hooks needed to add the segment base
addresses.  Enable HAVE_UNWIND_USER_COMPAT_FP if the system has compat
mode compiled in.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v10: https://lore.kernel.org/20250611010430.123232579@goodmis.org

- Moved calling the functions arch_unwind_user_init() and
  arch_unwind_user_next() to this commit as it's the first commit to use
  them. (Peter Zijlstra)

 arch/x86/Kconfig                         |  1 +
 arch/x86/include/asm/unwind_user.h       | 49 ++++++++++++++++++++++++
 arch/x86/include/asm/unwind_user_types.h | 17 ++++++++
 include/linux/unwind_user.h              | 20 ++++++++++
 kernel/unwind/user.c                     |  4 ++
 5 files changed, 91 insertions(+)
 create mode 100644 arch/x86/include/asm/unwind_user_types.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2cdb5cf91541..3f7bdc9e3cec 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -302,6 +302,7 @@ config X86
 	select HAVE_SYSCALL_TRACEPOINTS
 	select HAVE_UACCESS_VALIDATION		if HAVE_OBJTOOL
 	select HAVE_UNSTABLE_SCHED_CLOCK
+	select HAVE_UNWIND_USER_COMPAT_FP	if IA32_EMULATION
 	select HAVE_UNWIND_USER_FP		if X86_64
 	select HAVE_USER_RETURN_NOTIFIER
 	select HAVE_GENERIC_VDSO
diff --git a/arch/x86/include/asm/unwind_user.h b/arch/x86/include/asm/unwind_user.h
index 8597857bf896..43f8554c1d70 100644
--- a/arch/x86/include/asm/unwind_user.h
+++ b/arch/x86/include/asm/unwind_user.h
@@ -2,10 +2,59 @@
 #ifndef _ASM_X86_UNWIND_USER_H
 #define _ASM_X86_UNWIND_USER_H
 
+#include <linux/unwind_user_types.h>
+#include <asm/ptrace.h>
+
 #define ARCH_INIT_USER_FP_FRAME							\
 	.cfa_off	= (s32)sizeof(long) *  2,				\
 	.ra_off		= (s32)sizeof(long) * -1,				\
 	.fp_off		= (s32)sizeof(long) * -2,				\
 	.use_fp		= true,
 
+#ifdef CONFIG_IA32_EMULATION
+
+#define ARCH_INIT_USER_COMPAT_FP_FRAME						\
+	.cfa_off	= (s32)sizeof(u32)  *  2,				\
+	.ra_off		= (s32)sizeof(u32)  * -1,				\
+	.fp_off		= (s32)sizeof(u32)  * -2,				\
+	.use_fp		= true,
+
+#define in_compat_mode(regs) !user_64bit_mode(regs)
+
+static inline void arch_unwind_user_init(struct unwind_user_state *state,
+					 struct pt_regs *regs)
+{
+	unsigned long cs_base, ss_base;
+
+	if (state->type != UNWIND_USER_TYPE_COMPAT_FP)
+		return;
+
+	scoped_guard(irqsave) {
+		cs_base = segment_base_address(regs->cs);
+		ss_base = segment_base_address(regs->ss);
+	}
+
+	state->arch.cs_base = cs_base;
+	state->arch.ss_base = ss_base;
+
+	state->ip += cs_base;
+	state->sp += ss_base;
+	state->fp += ss_base;
+}
+#define arch_unwind_user_init arch_unwind_user_init
+
+static inline void arch_unwind_user_next(struct unwind_user_state *state)
+{
+	if (state->type != UNWIND_USER_TYPE_COMPAT_FP)
+		return;
+
+	state->ip += state->arch.cs_base;
+	state->fp += state->arch.ss_base;
+}
+#define arch_unwind_user_next arch_unwind_user_next
+
+#endif /* CONFIG_IA32_EMULATION */
+
+#include <asm-generic/unwind_user.h>
+
 #endif /* _ASM_X86_UNWIND_USER_H */
diff --git a/arch/x86/include/asm/unwind_user_types.h b/arch/x86/include/asm/unwind_user_types.h
new file mode 100644
index 000000000000..d7074dc5f0ce
--- /dev/null
+++ b/arch/x86/include/asm/unwind_user_types.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_UNWIND_USER_TYPES_H
+#define _ASM_UNWIND_USER_TYPES_H
+
+#ifdef CONFIG_IA32_EMULATION
+
+struct arch_unwind_user_state {
+	unsigned long ss_base;
+	unsigned long cs_base;
+};
+#define arch_unwind_user_state arch_unwind_user_state
+
+#endif /* CONFIG_IA32_EMULATION */
+
+#include <asm-generic/unwind_user_types.h>
+
+#endif /* _ASM_UNWIND_USER_TYPES_H */
diff --git a/include/linux/unwind_user.h b/include/linux/unwind_user.h
index ac007363820a..46f995cda606 100644
--- a/include/linux/unwind_user.h
+++ b/include/linux/unwind_user.h
@@ -14,6 +14,26 @@
  #define in_compat_mode(regs) false
 #endif
 
+/*
+ * If an architecture needs to initialize the state for a specific
+ * reason, for example, it may need to do something different
+ * in compat mode, it can define arch_unwind_user_init to a
+ * function that will perform this initialization.
+ */
+#ifndef arch_unwind_user_init
+static inline void arch_unwind_user_init(struct unwind_user_state *state, struct pt_regs *reg) {}
+#endif
+
+/*
+ * If an architecture requires some more updates to the state between
+ * stack frames, it can define arch_unwind_user_next to a function
+ * that will update the state between reading stack frames during
+ * the user space stack walk.
+ */
+#ifndef arch_unwind_user_next
+static inline void arch_unwind_user_next(struct unwind_user_state *state) {}
+#endif
+
 int unwind_user_start(struct unwind_user_state *state);
 int unwind_user_next(struct unwind_user_state *state);
 
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
index 3a0ac4346f5b..2bb7995c3f23 100644
--- a/kernel/unwind/user.c
+++ b/kernel/unwind/user.c
@@ -72,6 +72,8 @@ int unwind_user_next(struct unwind_user_state *state)
 	if (frame->fp_off)
 		state->fp = fp;
 
+	arch_unwind_user_next(state);
+
 	return 0;
 
 done:
@@ -101,6 +103,8 @@ int unwind_user_start(struct unwind_user_state *state)
 	state->sp = user_stack_pointer(regs);
 	state->fp = frame_pointer(regs);
 
+	arch_unwind_user_init(state, regs);
+
 	return 0;
 }
 
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 14/14] unwind_user/x86: Enable compat mode frame pointer unwinding on x86
  2025-06-25 22:56 ` [PATCH v11 14/14] unwind_user/x86: Enable compat mode frame pointer unwinding on x86 Steven Rostedt
@ 2025-06-26  8:33   ` Ingo Molnar
  2025-06-26 12:12     ` Steven Rostedt
  0 siblings, 1 reply; 33+ messages in thread
From: Ingo Molnar @ 2025-06-26  8:33 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
	Mathieu Desnoyers, Josh Poimboeuf, Peter Zijlstra, Jiri Olsa,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
	Andrew Morton, Jens Axboe


* Steven Rostedt <rostedt@goodmis.org> wrote:

> diff --git a/arch/x86/include/asm/unwind_user_types.h b/arch/x86/include/asm/unwind_user_types.h
> new file mode 100644
> index 000000000000..d7074dc5f0ce
> --- /dev/null
> +++ b/arch/x86/include/asm/unwind_user_types.h
> @@ -0,0 +1,17 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_UNWIND_USER_TYPES_H
> +#define _ASM_UNWIND_USER_TYPES_H

This is not the standard x86 header guard pattern ...

> +
> +#ifdef CONFIG_IA32_EMULATION
> +
> +struct arch_unwind_user_state {
> +	unsigned long ss_base;
> +	unsigned long cs_base;
> +};
> +#define arch_unwind_user_state arch_unwind_user_state

Ran out of newlines? ;-)

> +/*
> + * If an architecture needs to initialize the state for a specific
> + * reason, for example, it may need to do something different
> + * in compat mode, it can define arch_unwind_user_init to a
> + * function that will perform this initialization.

Please use 'func()' when referring to functions in comments.

> +/*
> + * If an architecture requires some more updates to the state between
> + * stack frames, it can define arch_unwind_user_next to a function
> + * that will update the state between reading stack frames during
> + * the user space stack walk.

Ditto.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 13/14] perf/x86: Rename and move get_segment_base() and make it global
  2025-06-25 22:56 ` [PATCH v11 13/14] perf/x86: Rename and move get_segment_base() and make it global Steven Rostedt
@ 2025-06-26  8:44   ` Ingo Molnar
  2025-06-26 12:13     ` Steven Rostedt
  2025-06-26 13:07   ` Peter Zijlstra
  1 sibling, 1 reply; 33+ messages in thread
From: Ingo Molnar @ 2025-06-26  8:44 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
	Mathieu Desnoyers, Josh Poimboeuf, Peter Zijlstra, Jiri Olsa,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
	Andrew Morton, Jens Axboe


* Steven Rostedt <rostedt@goodmis.org> wrote:

> From: Josh Poimboeuf <jpoimboe@kernel.org>
> 
> get_segment_base() will be used by the unwind_user code, so make it
> global and rename it to segment_base_address() so it doesn't conflict with
> a KVM function of the same name.

So if you make an x86-internal helper function global, please prefix it 
with x86_ or so:

	unsigned long x86_get_segment_base(unsigned int segment)

Keeping the _get name also keeps it within the nomenclature of the 
general segment descriptor API family:

	get_desc_base()
	set_desc_base()
	get_desc_limit()
	set_desc_limit()
  [x86_]get_segment_base()

> Also add a lockdep_assert_irqs_disabled() to make sure it's always 
> called with interrupts disabled.

Please make this a separate patch, this change gets hidden in the noise 
of the function movement and renaming otherwise, plus it also makes the 
title false and misleading:

   perf/x86: Rename and move get_segment_base() and make it global

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 14/14] unwind_user/x86: Enable compat mode frame pointer unwinding on x86
  2025-06-26  8:33   ` Ingo Molnar
@ 2025-06-26 12:12     ` Steven Rostedt
  2025-06-27 14:01       ` Steven Rostedt
  0 siblings, 1 reply; 33+ messages in thread
From: Steven Rostedt @ 2025-06-26 12:12 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
	Mathieu Desnoyers, Josh Poimboeuf, Peter Zijlstra, Jiri Olsa,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
	Andrew Morton, Jens Axboe

On Thu, 26 Jun 2025 10:33:05 +0200
Ingo Molnar <mingo@kernel.org> wrote:

> * Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > diff --git a/arch/x86/include/asm/unwind_user_types.h b/arch/x86/include/asm/unwind_user_types.h
> > new file mode 100644
> > index 000000000000..d7074dc5f0ce
> > --- /dev/null
> > +++ b/arch/x86/include/asm/unwind_user_types.h
> > @@ -0,0 +1,17 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _ASM_UNWIND_USER_TYPES_H
> > +#define _ASM_UNWIND_USER_TYPES_H  
> 
> This is not the standard x86 header guard pattern ...

Should it be:

#ifndef _ASM_X86_UNWIND_USER_TYPES_H

?

> 
> > +
> > +#ifdef CONFIG_IA32_EMULATION
> > +
> > +struct arch_unwind_user_state {
> > +	unsigned long ss_base;
> > +	unsigned long cs_base;
> > +};
> > +#define arch_unwind_user_state arch_unwind_user_state  
> 
> Ran out of newlines? ;-)

I believe Josh purposely kept the #define and the structure together
without a newline as one defines itself to be used in the generic code.

Do you prefer them to be separated by a newline?


> 
> > +/*
> > + * If an architecture needs to initialize the state for a specific
> > + * reason, for example, it may need to do something different
> > + * in compat mode, it can define arch_unwind_user_init to a
> > + * function that will perform this initialization.  
> 
> Please use 'func()' when referring to functions in comments.

You mean to use "arch_unwind_user_init()"?

> 
> > +/*
> > + * If an architecture requires some more updates to the state between
> > + * stack frames, it can define arch_unwind_user_next to a function
> > + * that will update the state between reading stack frames during
> > + * the user space stack walk.  
> 
> Ditto.

And this to have arch_unwind_user_next()?

I'll update.

Thanks for the review.

-- Steve

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 13/14] perf/x86: Rename and move get_segment_base() and make it global
  2025-06-26  8:44   ` Ingo Molnar
@ 2025-06-26 12:13     ` Steven Rostedt
  0 siblings, 0 replies; 33+ messages in thread
From: Steven Rostedt @ 2025-06-26 12:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
	Mathieu Desnoyers, Josh Poimboeuf, Peter Zijlstra, Jiri Olsa,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
	Andrew Morton, Jens Axboe

On Thu, 26 Jun 2025 10:44:40 +0200
Ingo Molnar <mingo@kernel.org> wrote:

> * Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > From: Josh Poimboeuf <jpoimboe@kernel.org>
> > 
> > get_segment_base() will be used by the unwind_user code, so make it
> > global and rename it to segment_base_address() so it doesn't conflict with
> > a KVM function of the same name.  
> 
> So if you make an x86-internal helper function global, please prefix it 
> with x86_ or so:
> 
> 	unsigned long x86_get_segment_base(unsigned int segment)
> 
> Keeping the _get name also keeps it within the nomenclature of the 
> general segment descriptor API family:
> 
> 	get_desc_base()
> 	set_desc_base()
> 	get_desc_limit()
> 	set_desc_limit()
>   [x86_]get_segment_base()

Sounds good.

> 
> > Also add a lockdep_assert_irqs_disabled() to make sure it's always 
> > called with interrupts disabled.  
> 
> Please make this a separate patch, this change gets hidden in the noise 
> of the function movement and renaming otherwise, plus it also makes the 
> title false and misleading:
> 
>    perf/x86: Rename and move get_segment_base() and make it global
> 

I'll break it up.

Thanks for the review!

-- Steve

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 13/14] perf/x86: Rename and move get_segment_base() and make it global
  2025-06-25 22:56 ` [PATCH v11 13/14] perf/x86: Rename and move get_segment_base() and make it global Steven Rostedt
  2025-06-26  8:44   ` Ingo Molnar
@ 2025-06-26 13:07   ` Peter Zijlstra
  2025-06-26 15:13     ` Steven Rostedt
  1 sibling, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2025-06-26 13:07 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
	Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
	Andrew Morton, Jens Axboe

On Wed, Jun 25, 2025 at 06:56:13PM -0400, Steven Rostedt wrote:
> From: Josh Poimboeuf <jpoimboe@kernel.org>
> 
> get_segment_base() will be used by the unwind_user code, so make it
> global and rename it to segment_base_address() so it doesn't conflict with
> a KVM function of the same name.
> 
> As the function is no longer specific to perf, move it to ptrace.c as that
> seems to be a better location for a generic function like this.
> 
> Also add a lockdep_assert_irqs_disabled() to make sure it's always called
> with interrupts disabled.

FWIW, I recently found we have a second 'copy' of all this in
insn_get_seg_base() / get_desc().

Its all subtly different, but largely the same.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 13/14] perf/x86: Rename and move get_segment_base() and make it global
  2025-06-26 13:07   ` Peter Zijlstra
@ 2025-06-26 15:13     ` Steven Rostedt
  0 siblings, 0 replies; 33+ messages in thread
From: Steven Rostedt @ 2025-06-26 15:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
	Mathieu Desnoyers, Josh Poimboeuf, Ingo Molnar, Jiri Olsa,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
	Andrew Morton, Jens Axboe

On Thu, 26 Jun 2025 15:07:05 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> FWIW, I recently found we have a second 'copy' of all this in
> insn_get_seg_base() / get_desc().
> 
> Its all subtly different, but largely the same.

Should I just use that then?

Instead of:

		cs_base = segment_base_address(regs->cs);
		ss_base = segment_base_address(regs->ss);

Use:

		cs_base = insn_get_seg_base(regs, INAT_SEG_REG_CS);
		ss_base = insn_get_seg_base(regs, INAT_SEG_REG_SS);

As it is used in a few places in the x86 code already. Then I could get rid
of this patch.

-- Steve

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 06/14] unwind_user/deferred: Add deferred unwinding interface
  2025-06-25 22:56 ` [PATCH v11 06/14] unwind_user/deferred: Add deferred unwinding interface Steven Rostedt
@ 2025-06-26 16:48   ` Steven Rostedt
  2025-06-26 20:34     ` Steven Rostedt
  0 siblings, 1 reply; 33+ messages in thread
From: Steven Rostedt @ 2025-06-26 16:48 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
	Jens Axboe

On Wed, 25 Jun 2025 18:56:06 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

>  static __always_inline void unwind_reset_info(void)
>  {
> -	if (unlikely(current->unwind_info.cache))
> +	/* Exit out early if this was never used */
> +	if (likely(!current->unwind_info.timestamp))
> +		return;

I found that this breaks the use of perf using the unwind_user_faultable()
directly and not relying on the deferred infrastructure (which it does when
it traces a single task and also needs to remove the separate in_nmi()
code). Because this still requires the nr_entries to be set to zero.

The clearing of the nr_entries has to be separate from the timestamp. To
prevent unneeded writes after the cache is allocated, should we check the
nr_entries is set before writing zero?

	if (current->unwind_info.cache && current->unwind_info.cache->nr_entries)
  		current->unwind_info.cache->nr_entries = 0;

?

-- Steve

> +
> +	if (current->unwind_info.cache)
>  		current->unwind_info.cache->nr_entries = 0;
> +	current->unwind_info.timestamp = 0;
>  }

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 06/14] unwind_user/deferred: Add deferred unwinding interface
  2025-06-26 16:48   ` Steven Rostedt
@ 2025-06-26 20:34     ` Steven Rostedt
  0 siblings, 0 replies; 33+ messages in thread
From: Steven Rostedt @ 2025-06-26 20:34 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel, bpf, x86
  Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
	Jens Axboe

On Thu, 26 Jun 2025 12:48:55 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> >  static __always_inline void unwind_reset_info(void)
> >  {
> > -	if (unlikely(current->unwind_info.cache))
> > +	/* Exit out early if this was never used */
> > +	if (likely(!current->unwind_info.timestamp))
> > +		return;  
> 
> I found that this breaks the use of perf using the unwind_user_faultable()
> directly and not relying on the deferred infrastructure (which it does when
> it traces a single task and also needs to remove the separate in_nmi()
> code). Because this still requires the nr_entries to be set to zero.
> 
> The clearing of the nr_entries has to be separate from the timestamp. To
> prevent unneeded writes after the cache is allocated, should we check the
> nr_entries is set before writing zero?
> 
> 	if (current->unwind_info.cache && current->unwind_info.cache->nr_entries)
>   		current->unwind_info.cache->nr_entries = 0;
> 
> ?

I just made this into:

 	if (current->unwind_info.cache)
   		current->unwind_info.cache->nr_entries = 0;

As later patches will add more here and I added a new patch that added a
USED bit to the info->unwind_mask that gets set whenever the stack trace is
used and this code needs to be executed. That makes it so that the
unwind_mask is the only condition that needs to be checked when it was
never used.

-- Steve

From: Steven Rostedt <rostedt@goodmis.org>
Subject: [PATCH] unwind: Add USED bit to only have one conditional on way back
 to user space

On the way back to user space, the function unwind_reset_info() is called
unconditionally (but always inlined). It currently has two conditionals.
One that checks the unwind_mask which is set whenever a deferred trace is
called and is used to know that the mask needs to be cleared. The other
checks if the cache has been allocated, and if so, it resets the
nr_entries so that the unwinder knows it needs to do the work to get a new
user space stack trace again (it only does it once per entering the
kernel).

Use one of the bits in the unwind mask as a "USED" bit that gets set
whenever a trace is created. This will make it possible to only check the
unwind_mask in the unwind_reset_info() to know if it needs to do work or
not and eliminates a conditional that happens every time the task goes
back to user space.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 include/linux/unwind_deferred.h | 14 +++++++-------
 kernel/unwind/deferred.c        |  5 ++++-
 2 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index e7bf133c5a20..4786acc0f087 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -21,6 +21,10 @@ struct unwind_work {
 #define UNWIND_PENDING_BIT	(BITS_PER_LONG - 1)
 #define UNWIND_PENDING		BIT(UNWIND_PENDING_BIT)
 
+/* Set if the unwinding was used (directly or deferred) */
+#define UNWIND_USED_BIT		(UNWIND_PENDING_BIT - 1)
+#define UNWIND_USED		BIT(UNWIND_USED_BIT)
+
 enum {
 	UNWIND_ALREADY_PENDING	= 1,
 	UNWIND_ALREADY_EXECUTED	= 2,
@@ -51,14 +55,10 @@ static __always_inline void unwind_reset_info(void)
 				return;
 		} while (!try_cmpxchg(&info->unwind_mask, &bits, 0UL));
 		local64_set(&current->unwind_info.timestamp, 0);
+
+		if (unlikely(info->cache))
+			info->cache->nr_entries = 0;
 	}
-	/*
-	 * As unwind_user_faultable() can be called directly and
-	 * depends on nr_entries being cleared on exit to user,
-	 * this needs to be a separate conditional.
-	 */
-	if (unlikely(info->cache))
-		info->cache->nr_entries = 0;
 }
 
 #else /* !CONFIG_UNWIND_USER */
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index c783d273a2dc..9ec1e74c6469 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -131,6 +131,9 @@ int unwind_user_faultable(struct unwind_stacktrace *trace)
 
 	cache->nr_entries = trace->nr;
 
+	/* Clear nr_entries on way back to user space */
+	set_bit(UNWIND_USED_BIT, &info->unwind_mask);
+
 	return 0;
 }
 
@@ -325,7 +328,7 @@ int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
 	guard(mutex)(&callback_mutex);
 
 	/* See if there's a bit in the mask available */
-	if (unwind_mask == ~(UNWIND_PENDING))
+	if (unwind_mask == ~(UNWIND_PENDING|UNWIND_USED))
 		return -EBUSY;
 
 	work->bit = ffz(unwind_mask);
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 14/14] unwind_user/x86: Enable compat mode frame pointer unwinding on x86
  2025-06-26 12:12     ` Steven Rostedt
@ 2025-06-27 14:01       ` Steven Rostedt
  2025-06-27 16:23         ` Linus Torvalds
  0 siblings, 1 reply; 33+ messages in thread
From: Steven Rostedt @ 2025-06-27 14:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
	Mathieu Desnoyers, Josh Poimboeuf, Peter Zijlstra, Jiri Olsa,
	Namhyung Kim, Thomas Gleixner, Andrii Nakryiko, Indu Bhagat,
	Jose E. Marchesi, Beau Belgrave, Jens Remus, Linus Torvalds,
	Andrew Morton, Jens Axboe

On Thu, 26 Jun 2025 08:12:20 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> 
> 
> >   
> > > +/*
> > > + * If an architecture needs to initialize the state for a specific
> > > + * reason, for example, it may need to do something different
> > > + * in compat mode, it can define arch_unwind_user_init to a
> > > + * function that will perform this initialization.    
> > 
> > Please use 'func()' when referring to functions in comments.  
> 
> You mean to use "arch_unwind_user_init()"?
> 
> >   
> > > +/*
> > > + * If an architecture requires some more updates to the state between
> > > + * stack frames, it can define arch_unwind_user_next to a function
> > > + * that will update the state between reading stack frames during
> > > + * the user space stack walk.    
> > 
> > Ditto.  
> 
> And this to have arch_unwind_user_next()?

I went to go update these than realized that the are not functions. As the
comment says, "it can define arch_unwind_user_next", that means it has to be:

  #define arch_unwind_user_next arch_unwind_user_next

That's not a function. It's just setting a macro named arch_unwind_user_next to
be arch_unwind_user_next. I think adding "()" to the end of that will be
confusing. I could update it to say:

  ... it can define a macro named arch_unwind_user_next with the name of the
  function that will update ...

Would that work?

I may even change the x86 code to be:

#define arch_unwind_user_next x86_unwind_user_next

As the function name doesn't have to be the same as the macro.

-- Steve

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 14/14] unwind_user/x86: Enable compat mode frame pointer unwinding on x86
  2025-06-27 14:01       ` Steven Rostedt
@ 2025-06-27 16:23         ` Linus Torvalds
  0 siblings, 0 replies; 33+ messages in thread
From: Linus Torvalds @ 2025-06-27 16:23 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Jiri Olsa, Namhyung Kim, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Jens Remus, Andrew Morton, Jens Axboe

On Fri, 27 Jun 2025 at 07:01, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> That's not a function. It's just setting a macro named arch_unwind_user_next to
> be arch_unwind_user_next. I think adding "()" to the end of that will be
> confusing.

Yeah, we use the pattern

   #define abc abc

just to show "I have my own architecture-specific implementation for
this" without having to make up a *new* name for it.

[ We used to have things like "#define __arch_has_abc" instead, which
is just annoying particularly when people didn't even always agree on
the exact prefix. We still do, but we used to too. These days that
"this arch has" pattern is _mostly_ confined to config variables, I
think. ]

Adding parenthesis not only makes that much more complicated - now you
need to match argument numbers etc - but can actually end up causing
real issues where you now can't use that 'abc' as a function pointer
any more.

That said, parenthesis can also actually help catch mis-uses (ie maybe
you *cannot* use the function as a function pointer, exactly because
some architectures _only_ implement it as a macro), so it's not like
parentheses are necessarily always wrong, but in general, I think that

  #define abc abc

pattern is the simplest and best way for an architecture header file
to say "I have an implementation for this".

And obviously the reason we have to use macros for this is because C
doesn't have a way to test for symbols existing. Other languages do
have things like that (various levels of "reflection"), but in C
you're basically limited to the pre-processor.

             Linus

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 00/14] unwind_user: x86: Deferred unwinding infrastructure
  2025-06-25 22:56 [PATCH v11 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
                   ` (13 preceding siblings ...)
  2025-06-25 22:56 ` [PATCH v11 14/14] unwind_user/x86: Enable compat mode frame pointer unwinding on x86 Steven Rostedt
@ 2025-06-30 12:50 ` Florian Weimer
  2025-06-30 16:35   ` Namhyung Kim
                     ` (2 more replies)
  14 siblings, 3 replies; 33+ messages in thread
From: Florian Weimer @ 2025-06-30 12:50 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
	Mathieu Desnoyers, Josh Poimboeuf, Peter Zijlstra, Ingo Molnar,
	Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
	Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
	Linus Torvalds, Andrew Morton, Jens Axboe

* Steven Rostedt:

> SFrames is now supported in gcc binutils and soon will also be supported
> by LLVM.

Is the LLVM support discussed here?

  [RFC] Adding SFrame support to llvm
  <https://discourse.llvm.org/t/rfc-adding-sframe-support-to-llvm/86900>

Or is there a secone effort?

> I have more patches on top of this series that add perf support, ftrace
> support, sframe support and the x86 fix ups (for VDSO). But each of those
> patch series can be worked on independently, but they all depend on this
> series (although the x86 specific patches at the end isn't necessarily
> needed, at least for other architectures).

Related to perf support: I'm writing up the SFrame change proposal for
Fedora, and I want to include testing instructions.  Any idea yet what a
typical “perf top” or “perf report” command line would look like?

Thanks,
Florian


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 00/14] unwind_user: x86: Deferred unwinding infrastructure
  2025-06-30 12:50 ` [PATCH v11 00/14] unwind_user: x86: Deferred unwinding infrastructure Florian Weimer
@ 2025-06-30 16:35   ` Namhyung Kim
  2025-06-30 17:30   ` Steven Rostedt
  2025-07-02 11:44   ` Sam James
  2 siblings, 0 replies; 33+ messages in thread
From: Namhyung Kim @ 2025-06-30 16:35 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
	Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Thomas Gleixner,
	Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
	Jens Remus, Linus Torvalds, Andrew Morton, Jens Axboe

Hello,

On Mon, Jun 30, 2025 at 02:50:52PM +0200, Florian Weimer wrote:
> * Steven Rostedt:
> 
> > SFrames is now supported in gcc binutils and soon will also be supported
> > by LLVM.
> 
> Is the LLVM support discussed here?
> 
>   [RFC] Adding SFrame support to llvm
>   <https://discourse.llvm.org/t/rfc-adding-sframe-support-to-llvm/86900>
> 
> Or is there a secone effort?
> 
> > I have more patches on top of this series that add perf support, ftrace
> > support, sframe support and the x86 fix ups (for VDSO). But each of those
> > patch series can be worked on independently, but they all depend on this
> > series (although the x86 specific patches at the end isn't necessarily
> > needed, at least for other architectures).
> 
> Related to perf support: I'm writing up the SFrame change proposal for
> Fedora, and I want to include testing instructions.  Any idea yet what a
> typical “perf top” or “perf report” command line would look like?

I think you can run "perf report -s dso,sym -g none" then it will show
"Children" and "Self" overheads.  If callchain in userspace works ok,
you will get non-kernel entries (symbols start with "[.]") having more
children overhead than the self.

  $ perf record -g -- perf bench sched messaging
  
  $ perf report -s dso,sym -g none | grep -F -e Children -e '[.]' | head
  # Children      Self  Shared Object           Symbol
      63.09%     0.01%  perf                    [.] run_bench
      63.09%     0.00%  libc.so.6               [.] __libc_start_call_main
      63.09%     0.00%  perf                    [.] cmd_bench
      63.09%     0.00%  perf                    [.] handle_internal_command
      63.09%     0.00%  perf                    [.] main
      63.09%     0.00%  perf                    [.] run_argv
      63.09%     0.00%  perf                    [.] run_builtin
      63.02%     0.00%  perf                    [.] bench_sched_messaging
      62.79%     0.00%  perf                    [.] group

Thanks,
Namhyung


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 00/14] unwind_user: x86: Deferred unwinding infrastructure
  2025-06-30 12:50 ` [PATCH v11 00/14] unwind_user: x86: Deferred unwinding infrastructure Florian Weimer
  2025-06-30 16:35   ` Namhyung Kim
@ 2025-06-30 17:30   ` Steven Rostedt
  2025-07-02 11:44   ` Sam James
  2 siblings, 0 replies; 33+ messages in thread
From: Steven Rostedt @ 2025-06-30 17:30 UTC (permalink / raw)
  To: Florian Weimer
  Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
	Mathieu Desnoyers, Josh Poimboeuf, Peter Zijlstra, Ingo Molnar,
	Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
	Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
	Linus Torvalds, Andrew Morton, Jens Axboe

On Mon, 30 Jun 2025 14:50:52 +0200
Florian Weimer <fweimer@redhat.com> wrote:

> * Steven Rostedt:
> 
> > SFrames is now supported in gcc binutils and soon will also be supported
> > by LLVM.  
> 
> Is the LLVM support discussed here?
> 
>   [RFC] Adding SFrame support to llvm
>   <https://discourse.llvm.org/t/rfc-adding-sframe-support-to-llvm/86900>
> 
> Or is there a secone effort?

Not a second effort, but also discussed here:

 https://github.com/llvm/llvm-project/issues/64449

I know internally at Google, it's being worked on. One of the
motivations for getting LLVM to support sframes is to allow live kernel
patching on arm64. Live kernel patching is currently only supported on
x86 because it requires the ORC unwinder. Which is Josh's creation (and
works basically the same way as sframes do) to have reliable stack
traces in the kernel at run time. It's been stated that porting ORC to
other archs would be too much, but since sframes do basically the same
thing, and would support multiple architectures, then it makes sense to
use sframes instead of ORC.

> 
> > I have more patches on top of this series that add perf support, ftrace
> > support, sframe support and the x86 fix ups (for VDSO). But each of those
> > patch series can be worked on independently, but they all depend on this
> > series (although the x86 specific patches at the end isn't necessarily
> > needed, at least for other architectures).  
> 
> Related to perf support: I'm writing up the SFrame change proposal for
> Fedora, and I want to include testing instructions.  Any idea yet what a
> typical “perf top” or “perf report” command line would look like?

I'll be posting updated patches soon and will Cc you. I'll also include
git branches that contain the patches. You'll need the core patches
(what this patch set is), the perf updates and the sframe patches.

The perf patches contain both the kernel side and user space side to
update perf.

Note, to make sure sframes are working properly, I also add
"trace_printk()" into the code and make sure that the sframes code is
being executed and used (it falls back to frame pointers if they fail).

I'll post a patch that includes the trace_printk() that I use as well.
But obviously that wouldn't be something you would add to your
documentation. It's mostly FYI for you.

Hopefully I'll have them done either tonight or tomorrow.

-- Steve

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 00/14] unwind_user: x86: Deferred unwinding infrastructure
  2025-06-30 12:50 ` [PATCH v11 00/14] unwind_user: x86: Deferred unwinding infrastructure Florian Weimer
  2025-06-30 16:35   ` Namhyung Kim
  2025-06-30 17:30   ` Steven Rostedt
@ 2025-07-02 11:44   ` Sam James
  2025-07-02 16:15     ` Steven Rostedt
  2025-07-02 17:10     ` Namhyung Kim
  2 siblings, 2 replies; 33+ messages in thread
From: Sam James @ 2025-07-02 11:44 UTC (permalink / raw)
  To: fweimer
  Cc: akpm, andrii, axboe, beaub, bpf, indu.bhagat, jemarch, jolsa,
	jpoimboe, jremus, linux-kernel, linux-trace-kernel,
	mathieu.desnoyers, mhiramat, mingo, namhyung, peterz, rostedt,
	tglx, torvalds, x86

I started to play around with this properly last night and it was
straightforward, fortunately.

Did initially attempt to backport to 6.15 but it was a victim of some
mm refactoring and didn't seem worth to carry on w/ that route.

Started a rough page with notes for myself (but corrections & such
welcome) at https://wiki.gentoo.org/wiki/Project:Toolchain/SFrame but
honestly, it's immediately obvious (and beautiful) when it's working
correctly. I've used Namhyung Kim's example from this thread but you can
see it easily with `perf top -g` too.

In one of the commit messages in the perf series, Steven also gave `perf
record -g -vv true` which was convenient for making sure it's correctly
discovered deferred unwinding support.

I plan on doing measurements next and doing some more playing once I've
built more userland with it.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 00/14] unwind_user: x86: Deferred unwinding infrastructure
  2025-07-02 11:44   ` Sam James
@ 2025-07-02 16:15     ` Steven Rostedt
  2025-07-02 17:14       ` Namhyung Kim
  2025-07-02 17:10     ` Namhyung Kim
  1 sibling, 1 reply; 33+ messages in thread
From: Steven Rostedt @ 2025-07-02 16:15 UTC (permalink / raw)
  To: Sam James
  Cc: fweimer, akpm, andrii, axboe, beaub, bpf, indu.bhagat, jemarch,
	jolsa, jpoimboe, jremus, linux-kernel, linux-trace-kernel,
	mathieu.desnoyers, mhiramat, mingo, namhyung, peterz, tglx,
	torvalds, x86

On Wed, 02 Jul 2025 12:44:51 +0100
Sam James <sam@gentoo.org> wrote:

> In one of the commit messages in the perf series, Steven also gave
> `perf record -g -vv true` which was convenient for making sure it's
> correctly discovered deferred unwinding support.

Although I posted the patch, the command "perf record -g -vv true" was
Namhyung's idea. Just wanted to give credit where credit was due.

-- Steve

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 00/14] unwind_user: x86: Deferred unwinding infrastructure
  2025-07-02 11:44   ` Sam James
  2025-07-02 16:15     ` Steven Rostedt
@ 2025-07-02 17:10     ` Namhyung Kim
  1 sibling, 0 replies; 33+ messages in thread
From: Namhyung Kim @ 2025-07-02 17:10 UTC (permalink / raw)
  To: Sam James
  Cc: fweimer, akpm, andrii, axboe, beaub, bpf, indu.bhagat, jemarch,
	jolsa, jpoimboe, jremus, linux-kernel, linux-trace-kernel,
	mathieu.desnoyers, mhiramat, mingo, peterz, rostedt, tglx,
	torvalds, x86

Hello,

On Wed, Jul 02, 2025 at 12:44:51PM +0100, Sam James wrote:
> I started to play around with this properly last night and it was
> straightforward, fortunately.
> 
> Did initially attempt to backport to 6.15 but it was a victim of some
> mm refactoring and didn't seem worth to carry on w/ that route.
> 
> Started a rough page with notes for myself (but corrections & such
> welcome) at https://wiki.gentoo.org/wiki/Project:Toolchain/SFrame but
> honestly, it's immediately obvious (and beautiful) when it's working
> correctly. I've used Namhyung Kim's example from this thread but you can
> see it easily with `perf top -g` too.

I've looked at the page but it doesn't seem to work well unfortunately.
The working case has the symbols correct but the overheads are same.
It should have more 'Children' overhead than 'Self' like in the broken
case because 'children = self + callchain'.

Thanks,
Namhyung

> 
> In one of the commit messages in the perf series, Steven also gave `perf
> record -g -vv true` which was convenient for making sure it's correctly
> discovered deferred unwinding support.
> 
> I plan on doing measurements next and doing some more playing once I've
> built more userland with it.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 00/14] unwind_user: x86: Deferred unwinding infrastructure
  2025-07-02 16:15     ` Steven Rostedt
@ 2025-07-02 17:14       ` Namhyung Kim
  2025-07-03  4:01         ` Sam James
  0 siblings, 1 reply; 33+ messages in thread
From: Namhyung Kim @ 2025-07-02 17:14 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Sam James, fweimer, akpm, andrii, axboe, beaub, bpf, indu.bhagat,
	jemarch, jolsa, jpoimboe, jremus, linux-kernel,
	linux-trace-kernel, mathieu.desnoyers, mhiramat, mingo, peterz,
	tglx, torvalds, x86

On Wed, Jul 02, 2025 at 12:15:02PM -0400, Steven Rostedt wrote:
> On Wed, 02 Jul 2025 12:44:51 +0100
> Sam James <sam@gentoo.org> wrote:
> 
> > In one of the commit messages in the perf series, Steven also gave
> > `perf record -g -vv true` which was convenient for making sure it's
> > correctly discovered deferred unwinding support.
> 
> Although I posted the patch, the command "perf record -g -vv true" was
> Namhyung's idea. Just wanted to give credit where credit was due.

Yep, it's to check if perf tool ask the deferred callchain to the
kernel.  To check if the kernel returns the callchain properly is:

  $ perf report -D | grep -A5 CALLCHAIN_DEFERRED

Thanks,
Namhyung


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v11 00/14] unwind_user: x86: Deferred unwinding infrastructure
  2025-07-02 17:14       ` Namhyung Kim
@ 2025-07-03  4:01         ` Sam James
  0 siblings, 0 replies; 33+ messages in thread
From: Sam James @ 2025-07-03  4:01 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Steven Rostedt, fweimer, akpm, andrii, axboe, beaub, bpf,
	indu.bhagat, jemarch, jolsa, jpoimboe, jremus, linux-kernel,
	linux-trace-kernel, mathieu.desnoyers, mhiramat, mingo, peterz,
	tglx, torvalds, x86

Namhyung Kim <namhyung@kernel.org> writes:

> On Wed, Jul 02, 2025 at 12:15:02PM -0400, Steven Rostedt wrote:
>> On Wed, 02 Jul 2025 12:44:51 +0100
>> Sam James <sam@gentoo.org> wrote:
>> 
>> > In one of the commit messages in the perf series, Steven also gave
>> > `perf record -g -vv true` which was convenient for making sure it's
>> > correctly discovered deferred unwinding support.
>> 
>> Although I posted the patch, the command "perf record -g -vv true" was
>> Namhyung's idea. Just wanted to give credit where credit was due.
>
> Yep, it's to check if perf tool ask the deferred callchain to the
> kernel.  To check if the kernel returns the callchain properly is:
>
>   $ perf report -D | grep -A5 CALLCHAIN_DEFERRED

Thanks both. I'll update my notes and tinker more.

>
> Thanks,
> Namhyung

sam

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2025-07-03  4:01 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-25 22:56 [PATCH v11 00/14] unwind_user: x86: Deferred unwinding infrastructure Steven Rostedt
2025-06-25 22:56 ` [PATCH v11 01/14] unwind_user: Add user space unwinding API Steven Rostedt
2025-06-25 22:56 ` [PATCH v11 02/14] unwind_user: Add frame pointer support Steven Rostedt
2025-06-25 22:56 ` [PATCH v11 03/14] unwind_user: Add compat mode " Steven Rostedt
2025-06-25 22:56 ` [PATCH v11 04/14] unwind_user/deferred: Add unwind_user_faultable() Steven Rostedt
2025-06-25 22:56 ` [PATCH v11 05/14] unwind_user/deferred: Add unwind cache Steven Rostedt
2025-06-25 22:56 ` [PATCH v11 06/14] unwind_user/deferred: Add deferred unwinding interface Steven Rostedt
2025-06-26 16:48   ` Steven Rostedt
2025-06-26 20:34     ` Steven Rostedt
2025-06-25 22:56 ` [PATCH v11 07/14] unwind_user/deferred: Make unwind deferral requests NMI-safe Steven Rostedt
2025-06-25 22:56 ` [PATCH v11 08/14] unwind deferred: Use bitmask to determine which callbacks to call Steven Rostedt
2025-06-25 22:56 ` [PATCH v11 09/14] unwind deferred: Use SRCU unwind_deferred_task_work() Steven Rostedt
2025-06-25 22:56 ` [PATCH v11 10/14] unwind: Clear unwind_mask on exit back to user space Steven Rostedt
2025-06-25 22:56 ` [PATCH v11 11/14] unwind: Finish up unwind when a task exits Steven Rostedt
2025-06-25 22:56 ` [PATCH v11 12/14] unwind_user/x86: Enable frame pointer unwinding on x86 Steven Rostedt
2025-06-25 22:56 ` [PATCH v11 13/14] perf/x86: Rename and move get_segment_base() and make it global Steven Rostedt
2025-06-26  8:44   ` Ingo Molnar
2025-06-26 12:13     ` Steven Rostedt
2025-06-26 13:07   ` Peter Zijlstra
2025-06-26 15:13     ` Steven Rostedt
2025-06-25 22:56 ` [PATCH v11 14/14] unwind_user/x86: Enable compat mode frame pointer unwinding on x86 Steven Rostedt
2025-06-26  8:33   ` Ingo Molnar
2025-06-26 12:12     ` Steven Rostedt
2025-06-27 14:01       ` Steven Rostedt
2025-06-27 16:23         ` Linus Torvalds
2025-06-30 12:50 ` [PATCH v11 00/14] unwind_user: x86: Deferred unwinding infrastructure Florian Weimer
2025-06-30 16:35   ` Namhyung Kim
2025-06-30 17:30   ` Steven Rostedt
2025-07-02 11:44   ` Sam James
2025-07-02 16:15     ` Steven Rostedt
2025-07-02 17:14       ` Namhyung Kim
2025-07-03  4:01         ` Sam James
2025-07-02 17:10     ` Namhyung Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).