* [PATCH v8 00/12] unwind_deferred: Implement sframe handling
@ 2025-07-08 2:11 Steven Rostedt
2025-07-08 2:11 ` [PATCH v8 01/12] unwind_user/sframe: Add support for reading .sframe headers Steven Rostedt
` (11 more replies)
0 siblings, 12 replies; 34+ messages in thread
From: Steven Rostedt @ 2025-07-08 2:11 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
Jens Axboe, Florian Weimer, Sam James
This code is based on top of:
https://lore.kernel.org/linux-trace-kernel/20250708012239.268642741@kernel.org/
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git
unwind/core
This is the implementation of parsing the SFrame section in an ELF file.
It's a continuation of Josh's last work that can be found here:
https://lore.kernel.org/all/cover.1737511963.git.jpoimboe@kernel.org/
Currently the only way to get a user space stack trace from a stack
walk (and not just copying large amount of user stack into the kernel
ring buffer) is to use frame pointers. This has a few issues. The biggest
one is that compiling frame pointers into every application and library
has been shown to cause performance overhead.
Another issue is that the format of the frames may not always be consistent
between different compilers and some architectures (s390) has no defined
format to do a reliable stack walk. The only way to perform user space
profiling on these architectures is to copy the user stack into the kernel
buffer.
SFrames[1] is now supported in gcc binutils and soon will also be supported
by LLVM. SFrames acts more like ORC, and lives in the ELF executable
file as its own section. Like ORC it has two tables where the first table
is sorted by instruction pointers (IP) and using the current IP and finding
it's entry in the first table, it will take you to the second table which
will tell you where the return address of the current function is located
and then you can use that address to look it up in the first table to find
the return address of that function, and so on. This performs a user
space stack walk.
Now because the SFrame section lives in the ELF file it needs to be faulted
into memory when it is used. This means that walking the user space stack
requires being in a faultable context. As profilers like perf request a stack
trace in interrupt or NMI context, it cannot do the walking when it is
requested. Instead it must be deferred until it is safe to fault in user
space. One place this is known to be safe is when the task is about to return
back to user space.
This series makes the deferred unwind code implement SFrames.
[1] https://sourceware.org/binutils/wiki/sframe
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git
unwind/sframe
The unwind/main branch is just unwind/sframe merged into unwind/perf
and can be used for testing with perf.
Changes since v7: https://lore.kernel.org/linux-trace-kernel/20250701184939.026626626@goodmis.org/
- Simply rebased on top of the latest unwind/core
Josh Poimboeuf (12):
unwind_user/sframe: Add support for reading .sframe headers
unwind_user/sframe: Store sframe section data in per-mm maple tree
x86/uaccess: Add unsafe_copy_from_user() implementation
unwind_user/sframe: Add support for reading .sframe contents
unwind_user/sframe: Detect .sframe sections in executables
unwind_user/sframe: Wire up unwind_user to sframe
unwind_user/sframe/x86: Enable sframe unwinding on x86
unwind_user/sframe: Remove .sframe section on detected corruption
unwind_user/sframe: Show file name in debug output
unwind_user/sframe: Enable debugging in uaccess regions
unwind_user/sframe: Add .sframe validation option
unwind_user/sframe: Add prctl() interface for registering .sframe sections
----
MAINTAINERS | 1 +
arch/Kconfig | 23 ++
arch/x86/Kconfig | 1 +
arch/x86/include/asm/mmu.h | 2 +-
arch/x86/include/asm/uaccess.h | 39 ++-
fs/binfmt_elf.c | 49 ++-
include/linux/mm_types.h | 3 +
include/linux/sframe.h | 60 ++++
include/linux/unwind_user_types.h | 1 +
include/uapi/linux/elf.h | 1 +
include/uapi/linux/prctl.h | 6 +-
kernel/fork.c | 10 +
kernel/sys.c | 9 +
kernel/unwind/Makefile | 3 +-
kernel/unwind/sframe.c | 612 ++++++++++++++++++++++++++++++++++++++
kernel/unwind/sframe.h | 71 +++++
kernel/unwind/sframe_debug.h | 99 ++++++
kernel/unwind/user.c | 25 +-
mm/init-mm.c | 2 +
19 files changed, 998 insertions(+), 19 deletions(-)
create mode 100644 include/linux/sframe.h
create mode 100644 kernel/unwind/sframe.c
create mode 100644 kernel/unwind/sframe.h
create mode 100644 kernel/unwind/sframe_debug.h
^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH v8 01/12] unwind_user/sframe: Add support for reading .sframe headers
2025-07-08 2:11 [PATCH v8 00/12] unwind_deferred: Implement sframe handling Steven Rostedt
@ 2025-07-08 2:11 ` Steven Rostedt
2025-07-08 2:11 ` [PATCH v8 02/12] unwind_user/sframe: Store sframe section data in per-mm maple tree Steven Rostedt
` (10 subsequent siblings)
11 siblings, 0 replies; 34+ messages in thread
From: Steven Rostedt @ 2025-07-08 2:11 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
Jens Axboe, Florian Weimer, Sam James
From: Josh Poimboeuf <jpoimboe@kernel.org>
In preparation for unwinding user space stacks with sframe, add basic
sframe compile infrastructure and support for reading the .sframe
section header.
sframe_add_section() reads the header and unconditionally returns an
error, so it's not very useful yet. A subsequent patch will improve
that.
Link: https://lore.kernel.org/all/f27e8463783febfa0dabb0432a3dd6be8ad98412.1737511963.git.jpoimboe@kernel.org/
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
MAINTAINERS | 1 +
arch/Kconfig | 3 +
include/linux/sframe.h | 40 ++++++++++++
kernel/unwind/Makefile | 3 +-
kernel/unwind/sframe.c | 136 +++++++++++++++++++++++++++++++++++++++++
kernel/unwind/sframe.h | 71 +++++++++++++++++++++
6 files changed, 253 insertions(+), 1 deletion(-)
create mode 100644 include/linux/sframe.h
create mode 100644 kernel/unwind/sframe.c
create mode 100644 kernel/unwind/sframe.h
diff --git a/MAINTAINERS b/MAINTAINERS
index 370d780fd5f8..a6a94d2d03b0 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -25932,6 +25932,7 @@ USERSPACE STACK UNWINDING
M: Josh Poimboeuf <jpoimboe@kernel.org>
M: Steven Rostedt <rostedt@goodmis.org>
S: Maintained
+F: include/linux/sframe.h
F: include/linux/unwind*.h
F: kernel/unwind/
diff --git a/arch/Kconfig b/arch/Kconfig
index 2c41d3072910..c54d35e2f860 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -446,6 +446,9 @@ config HAVE_UNWIND_USER_COMPAT_FP
bool
depends on HAVE_UNWIND_USER_FP
+config HAVE_UNWIND_USER_SFRAME
+ bool
+
config HAVE_PERF_REGS
bool
help
diff --git a/include/linux/sframe.h b/include/linux/sframe.h
new file mode 100644
index 000000000000..0584f661f698
--- /dev/null
+++ b/include/linux/sframe.h
@@ -0,0 +1,40 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_SFRAME_H
+#define _LINUX_SFRAME_H
+
+#include <linux/mm_types.h>
+#include <linux/unwind_user_types.h>
+
+#ifdef CONFIG_HAVE_UNWIND_USER_SFRAME
+
+struct sframe_section {
+ unsigned long sframe_start;
+ unsigned long sframe_end;
+ unsigned long text_start;
+ unsigned long text_end;
+
+ unsigned long fdes_start;
+ unsigned long fres_start;
+ unsigned long fres_end;
+ unsigned int num_fdes;
+
+ signed char ra_off;
+ signed char fp_off;
+};
+
+extern int sframe_add_section(unsigned long sframe_start, unsigned long sframe_end,
+ unsigned long text_start, unsigned long text_end);
+extern int sframe_remove_section(unsigned long sframe_addr);
+
+#else /* !CONFIG_HAVE_UNWIND_USER_SFRAME */
+
+static inline int sframe_add_section(unsigned long sframe_start, unsigned long sframe_end,
+ unsigned long text_start, unsigned long text_end)
+{
+ return -ENOSYS;
+}
+static inline int sframe_remove_section(unsigned long sframe_addr) { return -ENOSYS; }
+
+#endif /* CONFIG_HAVE_UNWIND_USER_SFRAME */
+
+#endif /* _LINUX_SFRAME_H */
diff --git a/kernel/unwind/Makefile b/kernel/unwind/Makefile
index eae37bea54fd..146038165865 100644
--- a/kernel/unwind/Makefile
+++ b/kernel/unwind/Makefile
@@ -1 +1,2 @@
- obj-$(CONFIG_UNWIND_USER) += user.o deferred.o
+ obj-$(CONFIG_UNWIND_USER) += user.o deferred.o
+ obj-$(CONFIG_HAVE_UNWIND_USER_SFRAME) += sframe.o
diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
new file mode 100644
index 000000000000..20287f795b36
--- /dev/null
+++ b/kernel/unwind/sframe.c
@@ -0,0 +1,136 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Userspace sframe access functions
+ */
+
+#define pr_fmt(fmt) "sframe: " fmt
+
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/srcu.h>
+#include <linux/uaccess.h>
+#include <linux/mm.h>
+#include <linux/string_helpers.h>
+#include <linux/sframe.h>
+#include <linux/unwind_user_types.h>
+
+#include "sframe.h"
+
+#define dbg(fmt, ...) \
+ pr_debug("%s (%d): " fmt, current->comm, current->pid, ##__VA_ARGS__)
+
+static void free_section(struct sframe_section *sec)
+{
+ kfree(sec);
+}
+
+static int sframe_read_header(struct sframe_section *sec)
+{
+ unsigned long header_end, fdes_start, fdes_end, fres_start, fres_end;
+ struct sframe_header shdr;
+ unsigned int num_fdes;
+
+ if (copy_from_user(&shdr, (void __user *)sec->sframe_start, sizeof(shdr))) {
+ dbg("header usercopy failed\n");
+ return -EFAULT;
+ }
+
+ if (shdr.preamble.magic != SFRAME_MAGIC ||
+ shdr.preamble.version != SFRAME_VERSION_2 ||
+ !(shdr.preamble.flags & SFRAME_F_FDE_SORTED) ||
+ shdr.auxhdr_len) {
+ dbg("bad/unsupported sframe header\n");
+ return -EINVAL;
+ }
+
+ if (!shdr.num_fdes || !shdr.num_fres) {
+ dbg("no fde/fre entries\n");
+ return -EINVAL;
+ }
+
+ header_end = sec->sframe_start + SFRAME_HEADER_SIZE(shdr);
+ if (header_end >= sec->sframe_end) {
+ dbg("header doesn't fit in section\n");
+ return -EINVAL;
+ }
+
+ num_fdes = shdr.num_fdes;
+ fdes_start = header_end + shdr.fdes_off;
+ fdes_end = fdes_start + (num_fdes * sizeof(struct sframe_fde));
+
+ fres_start = header_end + shdr.fres_off;
+ fres_end = fres_start + shdr.fre_len;
+
+ if (fres_start < fdes_end || fres_end > sec->sframe_end) {
+ dbg("inconsistent fde/fre offsets\n");
+ return -EINVAL;
+ }
+
+ sec->num_fdes = num_fdes;
+ sec->fdes_start = fdes_start;
+ sec->fres_start = fres_start;
+ sec->fres_end = fres_end;
+
+ sec->ra_off = shdr.cfa_fixed_ra_offset;
+ sec->fp_off = shdr.cfa_fixed_fp_offset;
+
+ return 0;
+}
+
+int sframe_add_section(unsigned long sframe_start, unsigned long sframe_end,
+ unsigned long text_start, unsigned long text_end)
+{
+ struct maple_tree *sframe_mt = ¤t->mm->sframe_mt;
+ struct vm_area_struct *sframe_vma, *text_vma;
+ struct mm_struct *mm = current->mm;
+ struct sframe_section *sec;
+ int ret;
+
+ if (!sframe_start || !sframe_end || !text_start || !text_end) {
+ dbg("zero-length sframe/text address\n");
+ return -EINVAL;
+ }
+
+ scoped_guard(mmap_read_lock, mm) {
+ sframe_vma = vma_lookup(mm, sframe_start);
+ if (!sframe_vma || sframe_end > sframe_vma->vm_end) {
+ dbg("bad sframe address (0x%lx - 0x%lx)\n",
+ sframe_start, sframe_end);
+ return -EINVAL;
+ }
+
+ text_vma = vma_lookup(mm, text_start);
+ if (!text_vma ||
+ !(text_vma->vm_flags & VM_EXEC) ||
+ text_end > text_vma->vm_end) {
+ dbg("bad text address (0x%lx - 0x%lx)\n",
+ text_start, text_end);
+ return -EINVAL;
+ }
+ }
+
+ sec = kzalloc(sizeof(*sec), GFP_KERNEL);
+ if (!sec)
+ return -ENOMEM;
+
+ sec->sframe_start = sframe_start;
+ sec->sframe_end = sframe_end;
+ sec->text_start = text_start;
+ sec->text_end = text_end;
+
+ ret = sframe_read_header(sec);
+ if (ret)
+ goto err_free;
+
+ /* TODO nowhere to store it yet - just free it and return an error */
+ ret = -ENOSYS;
+
+err_free:
+ free_section(sec);
+ return ret;
+}
+
+int sframe_remove_section(unsigned long sframe_start)
+{
+ return -ENOSYS;
+}
diff --git a/kernel/unwind/sframe.h b/kernel/unwind/sframe.h
new file mode 100644
index 000000000000..e9bfccfaf5b4
--- /dev/null
+++ b/kernel/unwind/sframe.h
@@ -0,0 +1,71 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * From https://www.sourceware.org/binutils/docs/sframe-spec.html
+ */
+#ifndef _SFRAME_H
+#define _SFRAME_H
+
+#include <linux/types.h>
+
+#define SFRAME_VERSION_1 1
+#define SFRAME_VERSION_2 2
+#define SFRAME_MAGIC 0xdee2
+
+#define SFRAME_F_FDE_SORTED 0x1
+#define SFRAME_F_FRAME_POINTER 0x2
+
+#define SFRAME_ABI_AARCH64_ENDIAN_BIG 1
+#define SFRAME_ABI_AARCH64_ENDIAN_LITTLE 2
+#define SFRAME_ABI_AMD64_ENDIAN_LITTLE 3
+
+#define SFRAME_FDE_TYPE_PCINC 0
+#define SFRAME_FDE_TYPE_PCMASK 1
+
+struct sframe_preamble {
+ u16 magic;
+ u8 version;
+ u8 flags;
+} __packed;
+
+struct sframe_header {
+ struct sframe_preamble preamble;
+ u8 abi_arch;
+ s8 cfa_fixed_fp_offset;
+ s8 cfa_fixed_ra_offset;
+ u8 auxhdr_len;
+ u32 num_fdes;
+ u32 num_fres;
+ u32 fre_len;
+ u32 fdes_off;
+ u32 fres_off;
+} __packed;
+
+#define SFRAME_HEADER_SIZE(header) \
+ ((sizeof(struct sframe_header) + header.auxhdr_len))
+
+#define SFRAME_AARCH64_PAUTH_KEY_A 0
+#define SFRAME_AARCH64_PAUTH_KEY_B 1
+
+struct sframe_fde {
+ s32 start_addr;
+ u32 func_size;
+ u32 fres_off;
+ u32 fres_num;
+ u8 info;
+ u8 rep_size;
+ u16 padding;
+} __packed;
+
+#define SFRAME_FUNC_FRE_TYPE(data) (data & 0xf)
+#define SFRAME_FUNC_FDE_TYPE(data) ((data >> 4) & 0x1)
+#define SFRAME_FUNC_PAUTH_KEY(data) ((data >> 5) & 0x1)
+
+#define SFRAME_BASE_REG_FP 0
+#define SFRAME_BASE_REG_SP 1
+
+#define SFRAME_FRE_CFA_BASE_REG_ID(data) (data & 0x1)
+#define SFRAME_FRE_OFFSET_COUNT(data) ((data >> 1) & 0xf)
+#define SFRAME_FRE_OFFSET_SIZE(data) ((data >> 5) & 0x3)
+#define SFRAME_FRE_MANGLED_RA_P(data) ((data >> 7) & 0x1)
+
+#endif /* _SFRAME_H */
--
2.47.2
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH v8 02/12] unwind_user/sframe: Store sframe section data in per-mm maple tree
2025-07-08 2:11 [PATCH v8 00/12] unwind_deferred: Implement sframe handling Steven Rostedt
2025-07-08 2:11 ` [PATCH v8 01/12] unwind_user/sframe: Add support for reading .sframe headers Steven Rostedt
@ 2025-07-08 2:11 ` Steven Rostedt
2025-07-08 2:11 ` [PATCH v8 03/12] x86/uaccess: Add unsafe_copy_from_user() implementation Steven Rostedt
` (9 subsequent siblings)
11 siblings, 0 replies; 34+ messages in thread
From: Steven Rostedt @ 2025-07-08 2:11 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
Jens Axboe, Florian Weimer, Sam James, linux-mm
From: Josh Poimboeuf <jpoimboe@kernel.org>
Associate an sframe section with its mm by adding it to a per-mm maple
tree which is indexed by the corresponding text address range. A single
sframe section can be associated with multiple text ranges.
Cc: linux-mm@kvack.org
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
arch/x86/include/asm/mmu.h | 2 +-
include/linux/mm_types.h | 3 +++
include/linux/sframe.h | 13 +++++++++
kernel/fork.c | 10 +++++++
kernel/unwind/sframe.c | 55 +++++++++++++++++++++++++++++++++++---
mm/init-mm.c | 2 ++
6 files changed, 81 insertions(+), 4 deletions(-)
diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 0fe9c569d171..227a32899a59 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -87,7 +87,7 @@ typedef struct {
.context = { \
.ctx_id = 1, \
.lock = __MUTEX_INITIALIZER(mm.context.lock), \
- }
+ },
void leave_mm(void);
#define leave_mm leave_mm
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d6b91e8a66d6..4296cabf4afa 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1206,6 +1206,9 @@ struct mm_struct {
#ifdef CONFIG_MM_ID
mm_id_t mm_id;
#endif /* CONFIG_MM_ID */
+#ifdef CONFIG_HAVE_UNWIND_USER_SFRAME
+ struct maple_tree sframe_mt;
+#endif
} __randomize_layout;
/*
diff --git a/include/linux/sframe.h b/include/linux/sframe.h
index 0584f661f698..73bf6f0b30c2 100644
--- a/include/linux/sframe.h
+++ b/include/linux/sframe.h
@@ -22,18 +22,31 @@ struct sframe_section {
signed char fp_off;
};
+#define INIT_MM_SFRAME .sframe_mt = MTREE_INIT(sframe_mt, 0),
+extern void sframe_free_mm(struct mm_struct *mm);
+
extern int sframe_add_section(unsigned long sframe_start, unsigned long sframe_end,
unsigned long text_start, unsigned long text_end);
extern int sframe_remove_section(unsigned long sframe_addr);
+static inline bool current_has_sframe(void)
+{
+ struct mm_struct *mm = current->mm;
+
+ return mm && !mtree_empty(&mm->sframe_mt);
+}
+
#else /* !CONFIG_HAVE_UNWIND_USER_SFRAME */
+#define INIT_MM_SFRAME
+static inline void sframe_free_mm(struct mm_struct *mm) {}
static inline int sframe_add_section(unsigned long sframe_start, unsigned long sframe_end,
unsigned long text_start, unsigned long text_end)
{
return -ENOSYS;
}
static inline int sframe_remove_section(unsigned long sframe_addr) { return -ENOSYS; }
+static inline bool current_has_sframe(void) { return false; }
#endif /* CONFIG_HAVE_UNWIND_USER_SFRAME */
diff --git a/kernel/fork.c b/kernel/fork.c
index 3341d50c61f2..e56daf4e546f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -106,6 +106,7 @@
#include <linux/pidfs.h>
#include <linux/tick.h>
#include <linux/unwind_deferred.h>
+#include <linux/sframe.h>
#include <asm/pgalloc.h>
#include <linux/uaccess.h>
@@ -687,6 +688,7 @@ void __mmdrop(struct mm_struct *mm)
mm_pasid_drop(mm);
mm_destroy_cid(mm);
percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS);
+ sframe_free_mm(mm);
free_mm(mm);
}
@@ -1024,6 +1026,13 @@ static void mmap_init_lock(struct mm_struct *mm)
#endif
}
+static void mm_init_sframe(struct mm_struct *mm)
+{
+#ifdef CONFIG_HAVE_UNWIND_USER_SFRAME
+ mt_init(&mm->sframe_mt);
+#endif
+}
+
static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
struct user_namespace *user_ns)
{
@@ -1053,6 +1062,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
mm->pmd_huge_pte = NULL;
#endif
mm_init_uprobes_state(mm);
+ mm_init_sframe(mm);
hugetlb_count_init(mm);
if (current->mm) {
diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
index 20287f795b36..fa7d87ffd00a 100644
--- a/kernel/unwind/sframe.c
+++ b/kernel/unwind/sframe.c
@@ -122,15 +122,64 @@ int sframe_add_section(unsigned long sframe_start, unsigned long sframe_end,
if (ret)
goto err_free;
- /* TODO nowhere to store it yet - just free it and return an error */
- ret = -ENOSYS;
+ ret = mtree_insert_range(sframe_mt, sec->text_start, sec->text_end, sec, GFP_KERNEL);
+ if (ret) {
+ dbg("mtree_insert_range failed: text=%lx-%lx\n",
+ sec->text_start, sec->text_end);
+ goto err_free;
+ }
+
+ return 0;
err_free:
free_section(sec);
return ret;
}
+static int __sframe_remove_section(struct mm_struct *mm,
+ struct sframe_section *sec)
+{
+ if (!mtree_erase(&mm->sframe_mt, sec->text_start)) {
+ dbg("mtree_erase failed: text=%lx\n", sec->text_start);
+ return -EINVAL;
+ }
+
+ free_section(sec);
+
+ return 0;
+}
+
int sframe_remove_section(unsigned long sframe_start)
{
- return -ENOSYS;
+ struct mm_struct *mm = current->mm;
+ struct sframe_section *sec;
+ unsigned long index = 0;
+ bool found = false;
+ int ret = 0;
+
+ mt_for_each(&mm->sframe_mt, sec, index, ULONG_MAX) {
+ if (sec->sframe_start == sframe_start) {
+ found = true;
+ ret |= __sframe_remove_section(mm, sec);
+ }
+ }
+
+ if (!found || ret)
+ return -EINVAL;
+
+ return 0;
+}
+
+void sframe_free_mm(struct mm_struct *mm)
+{
+ struct sframe_section *sec;
+ unsigned long index = 0;
+
+ if (!mm)
+ return;
+
+ mt_for_each(&mm->sframe_mt, sec, index, ULONG_MAX)
+ free_section(sec);
+
+ mtree_destroy(&mm->sframe_mt);
}
diff --git a/mm/init-mm.c b/mm/init-mm.c
index 4600e7605cab..b32fcf167cc2 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -11,6 +11,7 @@
#include <linux/atomic.h>
#include <linux/user_namespace.h>
#include <linux/iommu.h>
+#include <linux/sframe.h>
#include <asm/mmu.h>
#ifndef INIT_MM_CONTEXT
@@ -46,6 +47,7 @@ struct mm_struct init_mm = {
.user_ns = &init_user_ns,
.cpu_bitmap = CPU_BITS_NONE,
INIT_MM_CONTEXT(init_mm)
+ INIT_MM_SFRAME
};
void setup_initial_init_mm(void *start_code, void *end_code,
--
2.47.2
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH v8 03/12] x86/uaccess: Add unsafe_copy_from_user() implementation
2025-07-08 2:11 [PATCH v8 00/12] unwind_deferred: Implement sframe handling Steven Rostedt
2025-07-08 2:11 ` [PATCH v8 01/12] unwind_user/sframe: Add support for reading .sframe headers Steven Rostedt
2025-07-08 2:11 ` [PATCH v8 02/12] unwind_user/sframe: Store sframe section data in per-mm maple tree Steven Rostedt
@ 2025-07-08 2:11 ` Steven Rostedt
2025-07-08 2:11 ` [PATCH v8 04/12] unwind_user/sframe: Add support for reading .sframe contents Steven Rostedt
` (8 subsequent siblings)
11 siblings, 0 replies; 34+ messages in thread
From: Steven Rostedt @ 2025-07-08 2:11 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
Jens Axboe, Florian Weimer, Sam James
From: Josh Poimboeuf <jpoimboe@kernel.org>
Add an x86 implementation of unsafe_copy_from_user() similar to the
existing unsafe_copy_to_user().
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
arch/x86/include/asm/uaccess.h | 39 +++++++++++++++++++++++++---------
1 file changed, 29 insertions(+), 10 deletions(-)
diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index 3a7755c1a441..3caf02d0503e 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -599,7 +599,7 @@ _label: \
* We want the unsafe accessors to always be inlined and use
* the error labels - thus the macro games.
*/
-#define unsafe_copy_loop(dst, src, len, type, label) \
+#define unsafe_copy_to_user_loop(dst, src, len, type, label) \
while (len >= sizeof(type)) { \
unsafe_put_user(*(type *)(src),(type __user *)(dst),label); \
dst += sizeof(type); \
@@ -607,15 +607,34 @@ _label: \
len -= sizeof(type); \
}
-#define unsafe_copy_to_user(_dst,_src,_len,label) \
-do { \
- char __user *__ucu_dst = (_dst); \
- const char *__ucu_src = (_src); \
- size_t __ucu_len = (_len); \
- unsafe_copy_loop(__ucu_dst, __ucu_src, __ucu_len, u64, label); \
- unsafe_copy_loop(__ucu_dst, __ucu_src, __ucu_len, u32, label); \
- unsafe_copy_loop(__ucu_dst, __ucu_src, __ucu_len, u16, label); \
- unsafe_copy_loop(__ucu_dst, __ucu_src, __ucu_len, u8, label); \
+#define unsafe_copy_to_user(_dst, _src, _len, label) \
+do { \
+ void __user *__dst = (_dst); \
+ const void *__src = (_src); \
+ size_t __len = (_len); \
+ unsafe_copy_to_user_loop(__dst, __src, __len, u64, label); \
+ unsafe_copy_to_user_loop(__dst, __src, __len, u32, label); \
+ unsafe_copy_to_user_loop(__dst, __src, __len, u16, label); \
+ unsafe_copy_to_user_loop(__dst, __src, __len, u8, label); \
+} while (0)
+
+#define unsafe_copy_from_user_loop(dst, src, len, type, label) \
+ while (len >= sizeof(type)) { \
+ unsafe_get_user(*(type *)(dst), (type __user *)(src), label); \
+ dst += sizeof(type); \
+ src += sizeof(type); \
+ len -= sizeof(type); \
+ }
+
+#define unsafe_copy_from_user(_dst, _src, _len, label) \
+do { \
+ void *__dst = (_dst); \
+ void __user *__src = (_src); \
+ size_t __len = (_len); \
+ unsafe_copy_from_user_loop(__dst, __src, __len, u64, label); \
+ unsafe_copy_from_user_loop(__dst, __src, __len, u32, label); \
+ unsafe_copy_from_user_loop(__dst, __src, __len, u16, label); \
+ unsafe_copy_from_user_loop(__dst, __src, __len, u8, label); \
} while (0)
#ifdef CONFIG_CC_HAS_ASM_GOTO_OUTPUT
--
2.47.2
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH v8 04/12] unwind_user/sframe: Add support for reading .sframe contents
2025-07-08 2:11 [PATCH v8 00/12] unwind_deferred: Implement sframe handling Steven Rostedt
` (2 preceding siblings ...)
2025-07-08 2:11 ` [PATCH v8 03/12] x86/uaccess: Add unsafe_copy_from_user() implementation Steven Rostedt
@ 2025-07-08 2:11 ` Steven Rostedt
2025-07-08 2:11 ` [PATCH v8 05/12] unwind_user/sframe: Detect .sframe sections in executables Steven Rostedt
` (7 subsequent siblings)
11 siblings, 0 replies; 34+ messages in thread
From: Steven Rostedt @ 2025-07-08 2:11 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
Jens Axboe, Florian Weimer, Sam James
From: Josh Poimboeuf <jpoimboe@kernel.org>
In preparation for using sframe to unwind user space stacks, add an
sframe_find() interface for finding the sframe information associated
with a given text address.
For performance, use user_read_access_begin() and the corresponding
unsafe_*() accessors. Note that use of pr_debug() in uaccess-enabled
regions would break noinstr validation, so there aren't any debug
messages yet. That will be added in a subsequent commit.
Link: https://lore.kernel.org/all/77c0d1ec143bf2a53d66c4ecb190e7e0a576fbfd.1737511963.git.jpoimboe@kernel.org/
Link: https://lore.kernel.org/all/b35ca3a3-8de5-4d32-8d30-d4e562f6b0de@linux.ibm.com/
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
include/linux/sframe.h | 5 +
kernel/unwind/sframe.c | 311 ++++++++++++++++++++++++++++++++++-
kernel/unwind/sframe_debug.h | 35 ++++
3 files changed, 347 insertions(+), 4 deletions(-)
create mode 100644 kernel/unwind/sframe_debug.h
diff --git a/include/linux/sframe.h b/include/linux/sframe.h
index 73bf6f0b30c2..9a72209696f9 100644
--- a/include/linux/sframe.h
+++ b/include/linux/sframe.h
@@ -3,11 +3,14 @@
#define _LINUX_SFRAME_H
#include <linux/mm_types.h>
+#include <linux/srcu.h>
#include <linux/unwind_user_types.h>
#ifdef CONFIG_HAVE_UNWIND_USER_SFRAME
struct sframe_section {
+ struct rcu_head rcu;
+
unsigned long sframe_start;
unsigned long sframe_end;
unsigned long text_start;
@@ -28,6 +31,7 @@ extern void sframe_free_mm(struct mm_struct *mm);
extern int sframe_add_section(unsigned long sframe_start, unsigned long sframe_end,
unsigned long text_start, unsigned long text_end);
extern int sframe_remove_section(unsigned long sframe_addr);
+extern int sframe_find(unsigned long ip, struct unwind_user_frame *frame);
static inline bool current_has_sframe(void)
{
@@ -46,6 +50,7 @@ static inline int sframe_add_section(unsigned long sframe_start, unsigned long s
return -ENOSYS;
}
static inline int sframe_remove_section(unsigned long sframe_addr) { return -ENOSYS; }
+static inline int sframe_find(unsigned long ip, struct unwind_user_frame *frame) { return -ENOSYS; }
static inline bool current_has_sframe(void) { return false; }
#endif /* CONFIG_HAVE_UNWIND_USER_SFRAME */
diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
index fa7d87ffd00a..b10420d19840 100644
--- a/kernel/unwind/sframe.c
+++ b/kernel/unwind/sframe.c
@@ -15,9 +15,303 @@
#include <linux/unwind_user_types.h>
#include "sframe.h"
+#include "sframe_debug.h"
-#define dbg(fmt, ...) \
- pr_debug("%s (%d): " fmt, current->comm, current->pid, ##__VA_ARGS__)
+struct sframe_fre {
+ unsigned int size;
+ u32 ip_off;
+ s32 cfa_off;
+ s32 ra_off;
+ s32 fp_off;
+ u8 info;
+};
+
+DEFINE_STATIC_SRCU(sframe_srcu);
+
+static __always_inline unsigned char fre_type_to_size(unsigned char fre_type)
+{
+ if (fre_type > 2)
+ return 0;
+ return 1 << fre_type;
+}
+
+static __always_inline unsigned char offset_size_enum_to_size(unsigned char off_size)
+{
+ if (off_size > 2)
+ return 0;
+ return 1 << off_size;
+}
+
+static __always_inline int __read_fde(struct sframe_section *sec,
+ unsigned int fde_num,
+ struct sframe_fde *fde)
+{
+ unsigned long fde_addr, ip;
+
+ fde_addr = sec->fdes_start + (fde_num * sizeof(struct sframe_fde));
+ unsafe_copy_from_user(fde, (void __user *)fde_addr,
+ sizeof(struct sframe_fde), Efault);
+
+ ip = sec->sframe_start + fde->start_addr;
+ if (ip < sec->text_start || ip > sec->text_end)
+ return -EINVAL;
+
+ return 0;
+
+Efault:
+ return -EFAULT;
+}
+
+static __always_inline int __find_fde(struct sframe_section *sec,
+ unsigned long ip,
+ struct sframe_fde *fde)
+{
+ s32 ip_off, func_off_low = S32_MIN, func_off_high = S32_MAX;
+ struct sframe_fde __user *first, *low, *high, *found = NULL;
+ int ret;
+
+ ip_off = ip - sec->sframe_start;
+
+ first = (void __user *)sec->fdes_start;
+ low = first;
+ high = first + sec->num_fdes - 1;
+
+ while (low <= high) {
+ struct sframe_fde __user *mid;
+ s32 func_off;
+
+ mid = low + ((high - low) / 2);
+
+ unsafe_get_user(func_off, (s32 __user *)mid, Efault);
+
+ if (ip_off >= func_off) {
+ if (func_off < func_off_low)
+ return -EFAULT;
+
+ func_off_low = func_off;
+
+ found = mid;
+ low = mid + 1;
+ } else {
+ if (func_off > func_off_high)
+ return -EFAULT;
+
+ func_off_high = func_off;
+
+ high = mid - 1;
+ }
+ }
+
+ if (!found)
+ return -EINVAL;
+
+ ret = __read_fde(sec, found - first, fde);
+ if (ret)
+ return ret;
+
+ /* make sure it's not in a gap */
+ if (ip_off < fde->start_addr || ip_off >= fde->start_addr + fde->func_size)
+ return -EINVAL;
+
+ return 0;
+
+Efault:
+ return -EFAULT;
+}
+
+#define ____UNSAFE_GET_USER_INC(to, from, type, label) \
+({ \
+ type __to; \
+ unsafe_get_user(__to, (type __user *)from, label); \
+ from += sizeof(__to); \
+ to = __to; \
+})
+
+#define __UNSAFE_GET_USER_INC(to, from, size, label, u_or_s) \
+({ \
+ switch (size) { \
+ case 1: \
+ ____UNSAFE_GET_USER_INC(to, from, u_or_s##8, label); \
+ break; \
+ case 2: \
+ ____UNSAFE_GET_USER_INC(to, from, u_or_s##16, label); \
+ break; \
+ case 4: \
+ ____UNSAFE_GET_USER_INC(to, from, u_or_s##32, label); \
+ break; \
+ default: \
+ return -EFAULT; \
+ } \
+})
+
+#define UNSAFE_GET_USER_UNSIGNED_INC(to, from, size, label) \
+ __UNSAFE_GET_USER_INC(to, from, size, label, u)
+
+#define UNSAFE_GET_USER_SIGNED_INC(to, from, size, label) \
+ __UNSAFE_GET_USER_INC(to, from, size, label, s)
+
+#define UNSAFE_GET_USER_INC(to, from, size, label) \
+ _Generic(to, \
+ u8: UNSAFE_GET_USER_UNSIGNED_INC(to, from, size, label), \
+ u16: UNSAFE_GET_USER_UNSIGNED_INC(to, from, size, label), \
+ u32: UNSAFE_GET_USER_UNSIGNED_INC(to, from, size, label), \
+ s8: UNSAFE_GET_USER_SIGNED_INC(to, from, size, label), \
+ s16: UNSAFE_GET_USER_SIGNED_INC(to, from, size, label), \
+ s32: UNSAFE_GET_USER_SIGNED_INC(to, from, size, label))
+
+static __always_inline int __read_fre(struct sframe_section *sec,
+ struct sframe_fde *fde,
+ unsigned long fre_addr,
+ struct sframe_fre *fre)
+{
+ unsigned char fde_type = SFRAME_FUNC_FDE_TYPE(fde->info);
+ unsigned char fre_type = SFRAME_FUNC_FRE_TYPE(fde->info);
+ unsigned char offset_count, offset_size;
+ s32 cfa_off, ra_off, fp_off;
+ unsigned long cur = fre_addr;
+ unsigned char addr_size;
+ u32 ip_off;
+ u8 info;
+
+ addr_size = fre_type_to_size(fre_type);
+ if (!addr_size)
+ return -EFAULT;
+
+ if (fre_addr + addr_size + 1 > sec->fres_end)
+ return -EFAULT;
+
+ UNSAFE_GET_USER_INC(ip_off, cur, addr_size, Efault);
+ if (fde_type == SFRAME_FDE_TYPE_PCINC && ip_off > fde->func_size)
+ return -EFAULT;
+
+ UNSAFE_GET_USER_INC(info, cur, 1, Efault);
+ offset_count = SFRAME_FRE_OFFSET_COUNT(info);
+ offset_size = offset_size_enum_to_size(SFRAME_FRE_OFFSET_SIZE(info));
+ if (!offset_count || !offset_size)
+ return -EFAULT;
+
+ if (cur + (offset_count * offset_size) > sec->fres_end)
+ return -EFAULT;
+
+ fre->size = addr_size + 1 + (offset_count * offset_size);
+
+ UNSAFE_GET_USER_INC(cfa_off, cur, offset_size, Efault);
+ offset_count--;
+
+ ra_off = sec->ra_off;
+ if (!ra_off) {
+ if (!offset_count--)
+ return -EFAULT;
+
+ UNSAFE_GET_USER_INC(ra_off, cur, offset_size, Efault);
+ }
+
+ fp_off = sec->fp_off;
+ if (!fp_off && offset_count) {
+ offset_count--;
+ UNSAFE_GET_USER_INC(fp_off, cur, offset_size, Efault);
+ }
+
+ if (offset_count)
+ return -EFAULT;
+
+ fre->ip_off = ip_off;
+ fre->cfa_off = cfa_off;
+ fre->ra_off = ra_off;
+ fre->fp_off = fp_off;
+ fre->info = info;
+
+ return 0;
+
+Efault:
+ return -EFAULT;
+}
+
+static __always_inline int __find_fre(struct sframe_section *sec,
+ struct sframe_fde *fde, unsigned long ip,
+ struct unwind_user_frame *frame)
+{
+ unsigned char fde_type = SFRAME_FUNC_FDE_TYPE(fde->info);
+ struct sframe_fre *fre, *prev_fre = NULL;
+ struct sframe_fre fres[2];
+ unsigned long fre_addr;
+ bool which = false;
+ unsigned int i;
+ u32 ip_off;
+
+ ip_off = ip - (sec->sframe_start + fde->start_addr);
+
+ if (fde_type == SFRAME_FDE_TYPE_PCMASK)
+ ip_off %= fde->rep_size;
+
+ fre_addr = sec->fres_start + fde->fres_off;
+
+ for (i = 0; i < fde->fres_num; i++) {
+ int ret;
+
+ /*
+ * Alternate between the two fre_addr[] entries for 'fre' and
+ * 'prev_fre'.
+ */
+ fre = which ? fres : fres + 1;
+ which = !which;
+
+ ret = __read_fre(sec, fde, fre_addr, fre);
+ if (ret)
+ return ret;
+
+ fre_addr += fre->size;
+
+ if (prev_fre && fre->ip_off <= prev_fre->ip_off)
+ return -EFAULT;
+
+ if (fre->ip_off > ip_off)
+ break;
+
+ prev_fre = fre;
+ }
+
+ if (!prev_fre)
+ return -EINVAL;
+ fre = prev_fre;
+
+ frame->cfa_off = fre->cfa_off;
+ frame->ra_off = fre->ra_off;
+ frame->fp_off = fre->fp_off;
+ frame->use_fp = SFRAME_FRE_CFA_BASE_REG_ID(fre->info) == SFRAME_BASE_REG_FP;
+
+ return 0;
+}
+
+int sframe_find(unsigned long ip, struct unwind_user_frame *frame)
+{
+ struct mm_struct *mm = current->mm;
+ struct sframe_section *sec;
+ struct sframe_fde fde;
+ int ret;
+
+ if (!mm)
+ return -EINVAL;
+
+ guard(srcu)(&sframe_srcu);
+
+ sec = mtree_load(&mm->sframe_mt, ip);
+ if (!sec)
+ return -EINVAL;
+
+ if (!user_read_access_begin((void __user *)sec->sframe_start,
+ sec->sframe_end - sec->sframe_start))
+ return -EFAULT;
+
+ ret = __find_fde(sec, ip, &fde);
+ if (ret)
+ goto end;
+
+ ret = __find_fre(sec, &fde, ip, frame);
+end:
+ user_read_access_end();
+ return ret;
+}
static void free_section(struct sframe_section *sec)
{
@@ -119,8 +413,10 @@ int sframe_add_section(unsigned long sframe_start, unsigned long sframe_end,
sec->text_end = text_end;
ret = sframe_read_header(sec);
- if (ret)
+ if (ret) {
+ dbg_print_header(sec);
goto err_free;
+ }
ret = mtree_insert_range(sframe_mt, sec->text_start, sec->text_end, sec, GFP_KERNEL);
if (ret) {
@@ -136,6 +432,13 @@ int sframe_add_section(unsigned long sframe_start, unsigned long sframe_end,
return ret;
}
+static void sframe_free_srcu(struct rcu_head *rcu)
+{
+ struct sframe_section *sec = container_of(rcu, struct sframe_section, rcu);
+
+ free_section(sec);
+}
+
static int __sframe_remove_section(struct mm_struct *mm,
struct sframe_section *sec)
{
@@ -144,7 +447,7 @@ static int __sframe_remove_section(struct mm_struct *mm,
return -EINVAL;
}
- free_section(sec);
+ call_srcu(&sframe_srcu, &sec->rcu, sframe_free_srcu);
return 0;
}
diff --git a/kernel/unwind/sframe_debug.h b/kernel/unwind/sframe_debug.h
new file mode 100644
index 000000000000..055c8c8fae24
--- /dev/null
+++ b/kernel/unwind/sframe_debug.h
@@ -0,0 +1,35 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _SFRAME_DEBUG_H
+#define _SFRAME_DEBUG_H
+
+#include <linux/sframe.h>
+#include "sframe.h"
+
+#ifdef CONFIG_DYNAMIC_DEBUG
+
+#define dbg(fmt, ...) \
+ pr_debug("%s (%d): " fmt, current->comm, current->pid, ##__VA_ARGS__)
+
+static __always_inline void dbg_print_header(struct sframe_section *sec)
+{
+ unsigned long fdes_end;
+
+ fdes_end = sec->fdes_start + (sec->num_fdes * sizeof(struct sframe_fde));
+
+ dbg("SEC: sframe:0x%lx-0x%lx text:0x%lx-0x%lx "
+ "fdes:0x%lx-0x%lx fres:0x%lx-0x%lx "
+ "ra_off:%d fp_off:%d\n",
+ sec->sframe_start, sec->sframe_end, sec->text_start, sec->text_end,
+ sec->fdes_start, fdes_end, sec->fres_start, sec->fres_end,
+ sec->ra_off, sec->fp_off);
+}
+
+#else /* !CONFIG_DYNAMIC_DEBUG */
+
+#define dbg(args...) no_printk(args)
+
+static inline void dbg_print_header(struct sframe_section *sec) {}
+
+#endif /* !CONFIG_DYNAMIC_DEBUG */
+
+#endif /* _SFRAME_DEBUG_H */
--
2.47.2
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH v8 05/12] unwind_user/sframe: Detect .sframe sections in executables
2025-07-08 2:11 [PATCH v8 00/12] unwind_deferred: Implement sframe handling Steven Rostedt
` (3 preceding siblings ...)
2025-07-08 2:11 ` [PATCH v8 04/12] unwind_user/sframe: Add support for reading .sframe contents Steven Rostedt
@ 2025-07-08 2:11 ` Steven Rostedt
2025-07-08 2:11 ` [PATCH v8 06/12] unwind_user/sframe: Wire up unwind_user to sframe Steven Rostedt
` (6 subsequent siblings)
11 siblings, 0 replies; 34+ messages in thread
From: Steven Rostedt @ 2025-07-08 2:11 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
Jens Axboe, Florian Weimer, Sam James
From: Josh Poimboeuf <jpoimboe@kernel.org>
When loading an ELF executable, automatically detect an .sframe section
and associate it with the mm_struct.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
fs/binfmt_elf.c | 49 +++++++++++++++++++++++++++++++++++++---
include/uapi/linux/elf.h | 1 +
2 files changed, 47 insertions(+), 3 deletions(-)
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index a43363d593e5..e7128d026ec0 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -47,6 +47,7 @@
#include <linux/dax.h>
#include <linux/uaccess.h>
#include <linux/rseq.h>
+#include <linux/sframe.h>
#include <asm/param.h>
#include <asm/page.h>
@@ -622,6 +623,21 @@ static inline int make_prot(u32 p_flags, struct arch_elf_state *arch_state,
return arch_elf_adjust_prot(prot, arch_state, has_interp, is_interp);
}
+static void elf_add_sframe(struct elf_phdr *text, struct elf_phdr *sframe,
+ unsigned long base_addr)
+{
+ unsigned long sframe_start, sframe_end, text_start, text_end;
+
+ sframe_start = base_addr + sframe->p_vaddr;
+ sframe_end = sframe_start + sframe->p_memsz;
+
+ text_start = base_addr + text->p_vaddr;
+ text_end = text_start + text->p_memsz;
+
+ /* Ignore return value, sframe section isn't critical */
+ sframe_add_section(sframe_start, sframe_end, text_start, text_end);
+}
+
/* This is much more generalized than the library routine read function,
so we keep this separate. Technically the library read function
is only provided so that we can read a.out libraries that have
@@ -632,7 +648,7 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
unsigned long no_base, struct elf_phdr *interp_elf_phdata,
struct arch_elf_state *arch_state)
{
- struct elf_phdr *eppnt;
+ struct elf_phdr *eppnt, *sframe_phdr = NULL;
unsigned long load_addr = 0;
int load_addr_set = 0;
unsigned long error = ~0UL;
@@ -658,7 +674,8 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
eppnt = interp_elf_phdata;
for (i = 0; i < interp_elf_ex->e_phnum; i++, eppnt++) {
- if (eppnt->p_type == PT_LOAD) {
+ switch (eppnt->p_type) {
+ case PT_LOAD: {
int elf_type = MAP_PRIVATE;
int elf_prot = make_prot(eppnt->p_flags, arch_state,
true, true);
@@ -697,6 +714,20 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
error = -ENOMEM;
goto out;
}
+ break;
+ }
+ case PT_GNU_SFRAME:
+ sframe_phdr = eppnt;
+ break;
+ }
+ }
+
+ if (sframe_phdr) {
+ eppnt = interp_elf_phdata;
+ for (i = 0; i < interp_elf_ex->e_phnum; i++, eppnt++) {
+ if (eppnt->p_flags & PF_X) {
+ elf_add_sframe(eppnt, sframe_phdr, load_addr);
+ }
}
}
@@ -821,7 +852,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
int first_pt_load = 1;
unsigned long error;
struct elf_phdr *elf_ppnt, *elf_phdata, *interp_elf_phdata = NULL;
- struct elf_phdr *elf_property_phdata = NULL;
+ struct elf_phdr *elf_property_phdata = NULL, *sframe_phdr = NULL;
unsigned long elf_brk;
bool brk_moved = false;
int retval, i;
@@ -930,6 +961,10 @@ static int load_elf_binary(struct linux_binprm *bprm)
executable_stack = EXSTACK_DISABLE_X;
break;
+ case PT_GNU_SFRAME:
+ sframe_phdr = elf_ppnt;
+ break;
+
case PT_LOPROC ... PT_HIPROC:
retval = arch_elf_pt_proc(elf_ex, elf_ppnt,
bprm->file, false,
@@ -1227,6 +1262,14 @@ static int load_elf_binary(struct linux_binprm *bprm)
elf_brk = k;
}
+ if (sframe_phdr) {
+ for (i = 0, elf_ppnt = elf_phdata;
+ i < elf_ex->e_phnum; i++, elf_ppnt++) {
+ if ((elf_ppnt->p_flags & PF_X))
+ elf_add_sframe(elf_ppnt, sframe_phdr, load_bias);
+ }
+ }
+
e_entry = elf_ex->e_entry + load_bias;
phdr_addr += load_bias;
elf_brk += load_bias;
diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
index 819ded2d39de..92c16c94fca8 100644
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@ -41,6 +41,7 @@ typedef __u16 Elf64_Versym;
#define PT_GNU_STACK (PT_LOOS + 0x474e551)
#define PT_GNU_RELRO (PT_LOOS + 0x474e552)
#define PT_GNU_PROPERTY (PT_LOOS + 0x474e553)
+#define PT_GNU_SFRAME (PT_LOOS + 0x474e554)
/* ARM MTE memory tag segment type */
--
2.47.2
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH v8 06/12] unwind_user/sframe: Wire up unwind_user to sframe
2025-07-08 2:11 [PATCH v8 00/12] unwind_deferred: Implement sframe handling Steven Rostedt
` (4 preceding siblings ...)
2025-07-08 2:11 ` [PATCH v8 05/12] unwind_user/sframe: Detect .sframe sections in executables Steven Rostedt
@ 2025-07-08 2:11 ` Steven Rostedt
2025-07-08 19:58 ` Mathieu Desnoyers
2025-07-08 2:11 ` [PATCH v8 07/12] unwind_user/sframe/x86: Enable sframe unwinding on x86 Steven Rostedt
` (5 subsequent siblings)
11 siblings, 1 reply; 34+ messages in thread
From: Steven Rostedt @ 2025-07-08 2:11 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
Jens Axboe, Florian Weimer, Sam James
From: Josh Poimboeuf <jpoimboe@kernel.org>
Now that the sframe infrastructure is fully in place, make it work by
hooking it up to the unwind_user interface.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
arch/Kconfig | 1 +
include/linux/unwind_user_types.h | 1 +
kernel/unwind/user.c | 25 ++++++++++++++++++++++---
3 files changed, 24 insertions(+), 3 deletions(-)
diff --git a/arch/Kconfig b/arch/Kconfig
index c54d35e2f860..0c6056ef13de 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -448,6 +448,7 @@ config HAVE_UNWIND_USER_COMPAT_FP
config HAVE_UNWIND_USER_SFRAME
bool
+ select UNWIND_USER
config HAVE_PERF_REGS
bool
diff --git a/include/linux/unwind_user_types.h b/include/linux/unwind_user_types.h
index 0b6563951ca4..4d50476e950e 100644
--- a/include/linux/unwind_user_types.h
+++ b/include/linux/unwind_user_types.h
@@ -13,6 +13,7 @@ enum unwind_user_type {
UNWIND_USER_TYPE_NONE,
UNWIND_USER_TYPE_FP,
UNWIND_USER_TYPE_COMPAT_FP,
+ UNWIND_USER_TYPE_SFRAME,
};
struct unwind_stacktrace {
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
index 249d9e32fad7..6e7ca9f1293a 100644
--- a/kernel/unwind/user.c
+++ b/kernel/unwind/user.c
@@ -7,6 +7,7 @@
#include <linux/sched/task_stack.h>
#include <linux/unwind_user.h>
#include <linux/uaccess.h>
+#include <linux/sframe.h>
static struct unwind_user_frame fp_frame = {
ARCH_INIT_USER_FP_FRAME
@@ -31,6 +32,12 @@ static inline bool compat_fp_state(struct unwind_user_state *state)
state->type == UNWIND_USER_TYPE_COMPAT_FP;
}
+static inline bool sframe_state(struct unwind_user_state *state)
+{
+ return IS_ENABLED(CONFIG_HAVE_UNWIND_USER_SFRAME) &&
+ state->type == UNWIND_USER_TYPE_SFRAME;
+}
+
#define unwind_get_user_long(to, from, state) \
({ \
int __ret; \
@@ -44,18 +51,28 @@ static inline bool compat_fp_state(struct unwind_user_state *state)
static int unwind_user_next(struct unwind_user_state *state)
{
struct unwind_user_frame *frame;
+ struct unwind_user_frame _frame;
unsigned long cfa = 0, fp, ra = 0;
unsigned int shift;
if (state->done)
return -EINVAL;
- if (compat_fp_state(state))
+ if (compat_fp_state(state)) {
frame = &compat_fp_frame;
- else if (fp_state(state))
+ } else if (sframe_state(state)) {
+ /* sframe expects the frame to be local storage */
+ frame = &_frame;
+ if (sframe_find(state->ip, frame)) {
+ if (!IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP))
+ goto done;
+ frame = &fp_frame;
+ }
+ } else if (fp_state(state)) {
frame = &fp_frame;
- else
+ } else {
goto done;
+ }
if (frame->use_fp) {
if (state->fp < state->sp)
@@ -111,6 +128,8 @@ static int unwind_user_start(struct unwind_user_state *state)
if (IS_ENABLED(CONFIG_HAVE_UNWIND_USER_COMPAT_FP) && in_compat_mode(regs))
state->type = UNWIND_USER_TYPE_COMPAT_FP;
+ else if (current_has_sframe())
+ state->type = UNWIND_USER_TYPE_SFRAME;
else if (IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP))
state->type = UNWIND_USER_TYPE_FP;
else
--
2.47.2
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH v8 07/12] unwind_user/sframe/x86: Enable sframe unwinding on x86
2025-07-08 2:11 [PATCH v8 00/12] unwind_deferred: Implement sframe handling Steven Rostedt
` (5 preceding siblings ...)
2025-07-08 2:11 ` [PATCH v8 06/12] unwind_user/sframe: Wire up unwind_user to sframe Steven Rostedt
@ 2025-07-08 2:11 ` Steven Rostedt
2025-07-08 2:11 ` [PATCH v8 08/12] unwind_user/sframe: Remove .sframe section on detected corruption Steven Rostedt
` (4 subsequent siblings)
11 siblings, 0 replies; 34+ messages in thread
From: Steven Rostedt @ 2025-07-08 2:11 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
Jens Axboe, Florian Weimer, Sam James
From: Josh Poimboeuf <jpoimboe@kernel.org>
The x86 sframe 2.0 implementation works fairly well, starting with
binutils 2.41 (though some bugs are getting fixed in later versions).
Enable it.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
arch/x86/Kconfig | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 17d4094c821b..8a382a6b9be3 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -304,6 +304,7 @@ config X86
select HAVE_UNSTABLE_SCHED_CLOCK
select HAVE_UNWIND_USER_COMPAT_FP if IA32_EMULATION
select HAVE_UNWIND_USER_FP if X86_64
+ select HAVE_UNWIND_USER_SFRAME if X86_64
select HAVE_USER_RETURN_NOTIFIER
select HAVE_GENERIC_VDSO
select VDSO_GETRANDOM if X86_64
--
2.47.2
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH v8 08/12] unwind_user/sframe: Remove .sframe section on detected corruption
2025-07-08 2:11 [PATCH v8 00/12] unwind_deferred: Implement sframe handling Steven Rostedt
` (6 preceding siblings ...)
2025-07-08 2:11 ` [PATCH v8 07/12] unwind_user/sframe/x86: Enable sframe unwinding on x86 Steven Rostedt
@ 2025-07-08 2:11 ` Steven Rostedt
2025-07-08 2:11 ` [PATCH v8 09/12] unwind_user/sframe: Show file name in debug output Steven Rostedt
` (3 subsequent siblings)
11 siblings, 0 replies; 34+ messages in thread
From: Steven Rostedt @ 2025-07-08 2:11 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
Jens Axboe, Florian Weimer, Sam James
From: Josh Poimboeuf <jpoimboe@kernel.org>
To avoid continued attempted use of a bad .sframe section, remove it
on demand when the first sign of corruption is detected.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
kernel/unwind/sframe.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
index b10420d19840..f246ead6c2a0 100644
--- a/kernel/unwind/sframe.c
+++ b/kernel/unwind/sframe.c
@@ -310,6 +310,10 @@ int sframe_find(unsigned long ip, struct unwind_user_frame *frame)
ret = __find_fre(sec, &fde, ip, frame);
end:
user_read_access_end();
+
+ if (ret == -EFAULT)
+ WARN_ON_ONCE(sframe_remove_section(sec->sframe_start));
+
return ret;
}
--
2.47.2
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH v8 09/12] unwind_user/sframe: Show file name in debug output
2025-07-08 2:11 [PATCH v8 00/12] unwind_deferred: Implement sframe handling Steven Rostedt
` (7 preceding siblings ...)
2025-07-08 2:11 ` [PATCH v8 08/12] unwind_user/sframe: Remove .sframe section on detected corruption Steven Rostedt
@ 2025-07-08 2:11 ` Steven Rostedt
2025-07-08 2:11 ` [PATCH v8 10/12] unwind_user/sframe: Enable debugging in uaccess regions Steven Rostedt
` (2 subsequent siblings)
11 siblings, 0 replies; 34+ messages in thread
From: Steven Rostedt @ 2025-07-08 2:11 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
Jens Axboe, Florian Weimer, Sam James
From: Josh Poimboeuf <jpoimboe@kernel.org>
When debugging sframe issues, the error messages aren't all that helpful
without knowing what file a corresponding .sframe section belongs to.
Prefix debug output strings with the file name.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
include/linux/sframe.h | 4 +++-
kernel/unwind/sframe.c | 23 ++++++++++--------
kernel/unwind/sframe_debug.h | 45 +++++++++++++++++++++++++++++++-----
3 files changed, 56 insertions(+), 16 deletions(-)
diff --git a/include/linux/sframe.h b/include/linux/sframe.h
index 9a72209696f9..b79c5ec09229 100644
--- a/include/linux/sframe.h
+++ b/include/linux/sframe.h
@@ -10,7 +10,9 @@
struct sframe_section {
struct rcu_head rcu;
-
+#ifdef CONFIG_DYNAMIC_DEBUG
+ const char *filename;
+#endif
unsigned long sframe_start;
unsigned long sframe_end;
unsigned long text_start;
diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
index f246ead6c2a0..66d3ba3c8389 100644
--- a/kernel/unwind/sframe.c
+++ b/kernel/unwind/sframe.c
@@ -311,14 +311,17 @@ int sframe_find(unsigned long ip, struct unwind_user_frame *frame)
end:
user_read_access_end();
- if (ret == -EFAULT)
+ if (ret == -EFAULT) {
+ dbg_sec("removing bad .sframe section\n");
WARN_ON_ONCE(sframe_remove_section(sec->sframe_start));
+ }
return ret;
}
static void free_section(struct sframe_section *sec)
{
+ dbg_free(sec);
kfree(sec);
}
@@ -329,7 +332,7 @@ static int sframe_read_header(struct sframe_section *sec)
unsigned int num_fdes;
if (copy_from_user(&shdr, (void __user *)sec->sframe_start, sizeof(shdr))) {
- dbg("header usercopy failed\n");
+ dbg_sec("header usercopy failed\n");
return -EFAULT;
}
@@ -337,18 +340,18 @@ static int sframe_read_header(struct sframe_section *sec)
shdr.preamble.version != SFRAME_VERSION_2 ||
!(shdr.preamble.flags & SFRAME_F_FDE_SORTED) ||
shdr.auxhdr_len) {
- dbg("bad/unsupported sframe header\n");
+ dbg_sec("bad/unsupported sframe header\n");
return -EINVAL;
}
if (!shdr.num_fdes || !shdr.num_fres) {
- dbg("no fde/fre entries\n");
+ dbg_sec("no fde/fre entries\n");
return -EINVAL;
}
header_end = sec->sframe_start + SFRAME_HEADER_SIZE(shdr);
if (header_end >= sec->sframe_end) {
- dbg("header doesn't fit in section\n");
+ dbg_sec("header doesn't fit in section\n");
return -EINVAL;
}
@@ -360,7 +363,7 @@ static int sframe_read_header(struct sframe_section *sec)
fres_end = fres_start + shdr.fre_len;
if (fres_start < fdes_end || fres_end > sec->sframe_end) {
- dbg("inconsistent fde/fre offsets\n");
+ dbg_sec("inconsistent fde/fre offsets\n");
return -EINVAL;
}
@@ -416,6 +419,8 @@ int sframe_add_section(unsigned long sframe_start, unsigned long sframe_end,
sec->text_start = text_start;
sec->text_end = text_end;
+ dbg_init(sec);
+
ret = sframe_read_header(sec);
if (ret) {
dbg_print_header(sec);
@@ -424,8 +429,8 @@ int sframe_add_section(unsigned long sframe_start, unsigned long sframe_end,
ret = mtree_insert_range(sframe_mt, sec->text_start, sec->text_end, sec, GFP_KERNEL);
if (ret) {
- dbg("mtree_insert_range failed: text=%lx-%lx\n",
- sec->text_start, sec->text_end);
+ dbg_sec("mtree_insert_range failed: text=%lx-%lx\n",
+ sec->text_start, sec->text_end);
goto err_free;
}
@@ -447,7 +452,7 @@ static int __sframe_remove_section(struct mm_struct *mm,
struct sframe_section *sec)
{
if (!mtree_erase(&mm->sframe_mt, sec->text_start)) {
- dbg("mtree_erase failed: text=%lx\n", sec->text_start);
+ dbg_sec("mtree_erase failed: text=%lx\n", sec->text_start);
return -EINVAL;
}
diff --git a/kernel/unwind/sframe_debug.h b/kernel/unwind/sframe_debug.h
index 055c8c8fae24..7794bf0bd78c 100644
--- a/kernel/unwind/sframe_debug.h
+++ b/kernel/unwind/sframe_debug.h
@@ -10,26 +10,59 @@
#define dbg(fmt, ...) \
pr_debug("%s (%d): " fmt, current->comm, current->pid, ##__VA_ARGS__)
+#define dbg_sec(fmt, ...) \
+ dbg("%s: " fmt, sec->filename, ##__VA_ARGS__)
+
static __always_inline void dbg_print_header(struct sframe_section *sec)
{
unsigned long fdes_end;
fdes_end = sec->fdes_start + (sec->num_fdes * sizeof(struct sframe_fde));
- dbg("SEC: sframe:0x%lx-0x%lx text:0x%lx-0x%lx "
- "fdes:0x%lx-0x%lx fres:0x%lx-0x%lx "
- "ra_off:%d fp_off:%d\n",
- sec->sframe_start, sec->sframe_end, sec->text_start, sec->text_end,
- sec->fdes_start, fdes_end, sec->fres_start, sec->fres_end,
- sec->ra_off, sec->fp_off);
+ dbg_sec("SEC: sframe:0x%lx-0x%lx text:0x%lx-0x%lx "
+ "fdes:0x%lx-0x%lx fres:0x%lx-0x%lx "
+ "ra_off:%d fp_off:%d\n",
+ sec->sframe_start, sec->sframe_end, sec->text_start, sec->text_end,
+ sec->fdes_start, fdes_end, sec->fres_start, sec->fres_end,
+ sec->ra_off, sec->fp_off);
+}
+
+static inline void dbg_init(struct sframe_section *sec)
+{
+ struct mm_struct *mm = current->mm;
+ struct vm_area_struct *vma;
+
+ guard(mmap_read_lock)(mm);
+ vma = vma_lookup(mm, sec->sframe_start);
+ if (!vma)
+ sec->filename = kstrdup("(vma gone???)", GFP_KERNEL);
+ else if (vma->vm_file)
+ sec->filename = kstrdup_quotable_file(vma->vm_file, GFP_KERNEL);
+ else if (vma->vm_ops && vma->vm_ops->name)
+ sec->filename = kstrdup(vma->vm_ops->name(vma), GFP_KERNEL);
+ else if (arch_vma_name(vma))
+ sec->filename = kstrdup(arch_vma_name(vma), GFP_KERNEL);
+ else if (!vma->vm_mm)
+ sec->filename = kstrdup("(vdso)", GFP_KERNEL);
+ else
+ sec->filename = kstrdup("(anonymous)", GFP_KERNEL);
+}
+
+static inline void dbg_free(struct sframe_section *sec)
+{
+ kfree(sec->filename);
}
#else /* !CONFIG_DYNAMIC_DEBUG */
#define dbg(args...) no_printk(args)
+#define dbg_sec(args... ) no_printk(args)
static inline void dbg_print_header(struct sframe_section *sec) {}
+static inline void dbg_init(struct sframe_section *sec) {}
+static inline void dbg_free(struct sframe_section *sec) {}
+
#endif /* !CONFIG_DYNAMIC_DEBUG */
#endif /* _SFRAME_DEBUG_H */
--
2.47.2
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH v8 10/12] unwind_user/sframe: Enable debugging in uaccess regions
2025-07-08 2:11 [PATCH v8 00/12] unwind_deferred: Implement sframe handling Steven Rostedt
` (8 preceding siblings ...)
2025-07-08 2:11 ` [PATCH v8 09/12] unwind_user/sframe: Show file name in debug output Steven Rostedt
@ 2025-07-08 2:11 ` Steven Rostedt
2025-07-08 3:38 ` Linus Torvalds
2025-07-08 2:11 ` [PATCH v8 11/12] unwind_user/sframe: Add .sframe validation option Steven Rostedt
2025-07-08 2:11 ` [PATCH v8 12/12] unwind_user/sframe: Add prctl() interface for registering .sframe sections Steven Rostedt
11 siblings, 1 reply; 34+ messages in thread
From: Steven Rostedt @ 2025-07-08 2:11 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
Jens Axboe, Florian Weimer, Sam James
From: Josh Poimboeuf <jpoimboe@kernel.org>
Objtool warns about calling pr_debug() from uaccess-enabled regions, and
rightfully so. Add a dbg_sec_uaccess() macro which temporarily disables
uaccess before doing the dynamic printk, and use that to add debug
messages throughout the uaccess-enabled regions.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
kernel/unwind/sframe.c | 60 ++++++++++++++++++++++++++++--------
kernel/unwind/sframe_debug.h | 31 +++++++++++++++++++
2 files changed, 78 insertions(+), 13 deletions(-)
diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
index 66d3ba3c8389..3972bce40fc7 100644
--- a/kernel/unwind/sframe.c
+++ b/kernel/unwind/sframe.c
@@ -53,12 +53,15 @@ static __always_inline int __read_fde(struct sframe_section *sec,
sizeof(struct sframe_fde), Efault);
ip = sec->sframe_start + fde->start_addr;
- if (ip < sec->text_start || ip > sec->text_end)
+ if (ip < sec->text_start || ip > sec->text_end) {
+ dbg_sec_uaccess("bad fde num %d\n", fde_num);
return -EINVAL;
+ }
return 0;
Efault:
+ dbg_sec_uaccess("fde %d usercopy failed\n", fde_num);
return -EFAULT;
}
@@ -85,16 +88,22 @@ static __always_inline int __find_fde(struct sframe_section *sec,
unsafe_get_user(func_off, (s32 __user *)mid, Efault);
if (ip_off >= func_off) {
- if (func_off < func_off_low)
+ if (func_off < func_off_low) {
+ dbg_sec_uaccess("fde %u not sorted\n",
+ (unsigned int)(mid - first));
return -EFAULT;
+ }
func_off_low = func_off;
found = mid;
low = mid + 1;
} else {
- if (func_off > func_off_high)
+ if (func_off > func_off_high) {
+ dbg_sec_uaccess("fde %u not sorted\n",
+ (unsigned int)(mid - first));
return -EFAULT;
+ }
func_off_high = func_off;
@@ -116,6 +125,7 @@ static __always_inline int __find_fde(struct sframe_section *sec,
return 0;
Efault:
+ dbg_sec_uaccess("fde usercopy failed\n");
return -EFAULT;
}
@@ -140,6 +150,8 @@ static __always_inline int __find_fde(struct sframe_section *sec,
____UNSAFE_GET_USER_INC(to, from, u_or_s##32, label); \
break; \
default: \
+ dbg_sec_uaccess("%d: bad UNSAFE_GET_USER_INC size %u\n",\
+ __LINE__, size); \
return -EFAULT; \
} \
})
@@ -174,24 +186,34 @@ static __always_inline int __read_fre(struct sframe_section *sec,
u8 info;
addr_size = fre_type_to_size(fre_type);
- if (!addr_size)
+ if (!addr_size) {
+ dbg_sec_uaccess("bad addr_size in fde info %u\n", fde->info);
return -EFAULT;
+ }
- if (fre_addr + addr_size + 1 > sec->fres_end)
+ if (fre_addr + addr_size + 1 > sec->fres_end) {
+ dbg_sec_uaccess("fre addr+info goes past end of subsection\n");
return -EFAULT;
+ }
UNSAFE_GET_USER_INC(ip_off, cur, addr_size, Efault);
- if (fde_type == SFRAME_FDE_TYPE_PCINC && ip_off > fde->func_size)
+ if (fde_type == SFRAME_FDE_TYPE_PCINC && ip_off > fde->func_size) {
+ dbg_sec_uaccess("fre starts past end of function: ip_off=0x%x, func_size=0x%x\n",
+ ip_off, fde->func_size);
return -EFAULT;
+ }
UNSAFE_GET_USER_INC(info, cur, 1, Efault);
offset_count = SFRAME_FRE_OFFSET_COUNT(info);
offset_size = offset_size_enum_to_size(SFRAME_FRE_OFFSET_SIZE(info));
- if (!offset_count || !offset_size)
+ if (!offset_count || !offset_size) {
+ dbg_sec_uaccess("zero offset_count or size in fre info %u\n",info);
return -EFAULT;
-
- if (cur + (offset_count * offset_size) > sec->fres_end)
+ }
+ if (cur + (offset_count * offset_size) > sec->fres_end) {
+ dbg_sec_uaccess("fre goes past end of subsection\n");
return -EFAULT;
+ }
fre->size = addr_size + 1 + (offset_count * offset_size);
@@ -200,8 +222,10 @@ static __always_inline int __read_fre(struct sframe_section *sec,
ra_off = sec->ra_off;
if (!ra_off) {
- if (!offset_count--)
+ if (!offset_count--) {
+ dbg_sec_uaccess("zero offset_count, can't find ra_off\n");
return -EFAULT;
+ }
UNSAFE_GET_USER_INC(ra_off, cur, offset_size, Efault);
}
@@ -212,8 +236,10 @@ static __always_inline int __read_fre(struct sframe_section *sec,
UNSAFE_GET_USER_INC(fp_off, cur, offset_size, Efault);
}
- if (offset_count)
+ if (offset_count) {
+ dbg_sec_uaccess("non-zero offset_count after reading fre\n");
return -EFAULT;
+ }
fre->ip_off = ip_off;
fre->cfa_off = cfa_off;
@@ -224,6 +250,7 @@ static __always_inline int __read_fre(struct sframe_section *sec,
return 0;
Efault:
+ dbg_sec_uaccess("fre usercopy failed\n");
return -EFAULT;
}
@@ -257,13 +284,20 @@ static __always_inline int __find_fre(struct sframe_section *sec,
which = !which;
ret = __read_fre(sec, fde, fre_addr, fre);
- if (ret)
+ if (ret) {
+ dbg_sec_uaccess("fde addr 0x%x: __read_fre(%u) failed\n",
+ fde->start_addr, i);
+ dbg_print_fde_uaccess(sec, fde);
return ret;
+ }
fre_addr += fre->size;
- if (prev_fre && fre->ip_off <= prev_fre->ip_off)
+ if (prev_fre && fre->ip_off <= prev_fre->ip_off) {
+ dbg_sec_uaccess("fde addr 0x%x: fre %u not sorted\n",
+ fde->start_addr, i);
return -EFAULT;
+ }
if (fre->ip_off > ip_off)
break;
diff --git a/kernel/unwind/sframe_debug.h b/kernel/unwind/sframe_debug.h
index 7794bf0bd78c..045e9c0b16c9 100644
--- a/kernel/unwind/sframe_debug.h
+++ b/kernel/unwind/sframe_debug.h
@@ -13,6 +13,26 @@
#define dbg_sec(fmt, ...) \
dbg("%s: " fmt, sec->filename, ##__VA_ARGS__)
+#define __dbg_sec_descriptor(fmt, ...) \
+ __dynamic_pr_debug(&descriptor, "sframe: %s: " fmt, \
+ sec->filename, ##__VA_ARGS__)
+
+/*
+ * To avoid breaking uaccess rules, temporarily disable uaccess
+ * before calling printk.
+ */
+#define dbg_sec_uaccess(fmt, ...) \
+({ \
+ DEFINE_DYNAMIC_DEBUG_METADATA(descriptor, fmt); \
+ if (DYNAMIC_DEBUG_BRANCH(descriptor)) { \
+ user_read_access_end(); \
+ __dbg_sec_descriptor(fmt, ##__VA_ARGS__); \
+ BUG_ON(!user_read_access_begin( \
+ (void __user *)sec->sframe_start, \
+ sec->sframe_end - sec->sframe_start)); \
+ } \
+})
+
static __always_inline void dbg_print_header(struct sframe_section *sec)
{
unsigned long fdes_end;
@@ -27,6 +47,15 @@ static __always_inline void dbg_print_header(struct sframe_section *sec)
sec->ra_off, sec->fp_off);
}
+static __always_inline void dbg_print_fde_uaccess(struct sframe_section *sec,
+ struct sframe_fde *fde)
+{
+ dbg_sec_uaccess("FDE: start_addr:0x%x func_size:0x%x "
+ "fres_off:0x%x fres_num:%d info:%u rep_size:%u\n",
+ fde->start_addr, fde->func_size,
+ fde->fres_off, fde->fres_num, fde->info, fde->rep_size);
+}
+
static inline void dbg_init(struct sframe_section *sec)
{
struct mm_struct *mm = current->mm;
@@ -57,8 +86,10 @@ static inline void dbg_free(struct sframe_section *sec)
#define dbg(args...) no_printk(args)
#define dbg_sec(args... ) no_printk(args)
+#define dbg_sec_uaccess(args...) no_printk(args)
static inline void dbg_print_header(struct sframe_section *sec) {}
+static inline void dbg_print_fde_uaccess(struct sframe_section *sec, struct sframe_fde *fde) {}
static inline void dbg_init(struct sframe_section *sec) {}
static inline void dbg_free(struct sframe_section *sec) {}
--
2.47.2
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH v8 11/12] unwind_user/sframe: Add .sframe validation option
2025-07-08 2:11 [PATCH v8 00/12] unwind_deferred: Implement sframe handling Steven Rostedt
` (9 preceding siblings ...)
2025-07-08 2:11 ` [PATCH v8 10/12] unwind_user/sframe: Enable debugging in uaccess regions Steven Rostedt
@ 2025-07-08 2:11 ` Steven Rostedt
2025-07-08 2:11 ` [PATCH v8 12/12] unwind_user/sframe: Add prctl() interface for registering .sframe sections Steven Rostedt
11 siblings, 0 replies; 34+ messages in thread
From: Steven Rostedt @ 2025-07-08 2:11 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
Jens Axboe, Florian Weimer, Sam James
From: Josh Poimboeuf <jpoimboe@kernel.org>
Add a debug feature to validate all .sframe sections when first loading
the file rather than on demand.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
arch/Kconfig | 19 ++++++++++
kernel/unwind/sframe.c | 81 ++++++++++++++++++++++++++++++++++++++++++
2 files changed, 100 insertions(+)
diff --git a/arch/Kconfig b/arch/Kconfig
index 0c6056ef13de..86eec85cb898 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -450,6 +450,25 @@ config HAVE_UNWIND_USER_SFRAME
bool
select UNWIND_USER
+config SFRAME_VALIDATION
+ bool "Enable .sframe section debugging"
+ depends on HAVE_UNWIND_USER_SFRAME
+ depends on DYNAMIC_DEBUG
+ help
+ When adding an .sframe section for a task, validate the entire
+ section immediately rather than on demand.
+
+ This is a debug feature which is helpful for rooting out .sframe
+ section issues. If the .sframe section is corrupt, it will fail to
+ load immediately, with more information provided in dynamic printks.
+
+ This has a significant page cache footprint due to its reading of the
+ entire .sframe section for every loaded executable and shared
+ library. Also, it's done for all processes, even those which don't
+ get stack traced by the kernel. Not recommended for general use.
+
+ If unsure, say N.
+
config HAVE_PERF_REGS
bool
help
diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
index 3972bce40fc7..6159f072bdb6 100644
--- a/kernel/unwind/sframe.c
+++ b/kernel/unwind/sframe.c
@@ -353,6 +353,83 @@ int sframe_find(unsigned long ip, struct unwind_user_frame *frame)
return ret;
}
+#ifdef CONFIG_SFRAME_VALIDATION
+
+static __always_inline int __sframe_validate_section(struct sframe_section *sec)
+{
+ unsigned long prev_ip = 0;
+ unsigned int i;
+
+ for (i = 0; i < sec->num_fdes; i++) {
+ struct sframe_fre *fre, *prev_fre = NULL;
+ unsigned long ip, fre_addr;
+ struct sframe_fde fde;
+ struct sframe_fre fres[2];
+ bool which = false;
+ unsigned int j;
+ int ret;
+
+ ret = __read_fde(sec, i, &fde);
+ if (ret)
+ return ret;
+
+ ip = sec->sframe_start + fde.start_addr;
+ if (ip <= prev_ip) {
+ dbg_sec_uaccess("fde %u not sorted\n", i);
+ return -EFAULT;
+ }
+ prev_ip = ip;
+
+ fre_addr = sec->fres_start + fde.fres_off;
+ for (j = 0; j < fde.fres_num; j++) {
+ int ret;
+
+ fre = which ? fres : fres + 1;
+ which = !which;
+
+ ret = __read_fre(sec, &fde, fre_addr, fre);
+ if (ret) {
+ dbg_sec_uaccess("fde %u: __read_fre(%u) failed\n", i, j);
+ dbg_print_fde_uaccess(sec, &fde);
+ return ret;
+ }
+
+ fre_addr += fre->size;
+
+ if (prev_fre && fre->ip_off <= prev_fre->ip_off) {
+ dbg_sec_uaccess("fde %u: fre %u not sorted\n", i, j);
+ return -EFAULT;
+ }
+
+ prev_fre = fre;
+ }
+ }
+
+ return 0;
+}
+
+static int sframe_validate_section(struct sframe_section *sec)
+{
+ int ret;
+
+ if (!user_read_access_begin((void __user *)sec->sframe_start,
+ sec->sframe_end - sec->sframe_start)) {
+ dbg_sec("section usercopy failed\n");
+ return -EFAULT;
+ }
+
+ ret = __sframe_validate_section(sec);
+ user_read_access_end();
+ return ret;
+}
+
+#else /* !CONFIG_SFRAME_VALIDATION */
+
+static int sframe_validate_section(struct sframe_section *sec) { return 0; }
+
+#endif /* !CONFIG_SFRAME_VALIDATION */
+
+
static void free_section(struct sframe_section *sec)
{
dbg_free(sec);
@@ -461,6 +538,10 @@ int sframe_add_section(unsigned long sframe_start, unsigned long sframe_end,
goto err_free;
}
+ ret = sframe_validate_section(sec);
+ if (ret)
+ goto err_free;
+
ret = mtree_insert_range(sframe_mt, sec->text_start, sec->text_end, sec, GFP_KERNEL);
if (ret) {
dbg_sec("mtree_insert_range failed: text=%lx-%lx\n",
--
2.47.2
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH v8 12/12] unwind_user/sframe: Add prctl() interface for registering .sframe sections
2025-07-08 2:11 [PATCH v8 00/12] unwind_deferred: Implement sframe handling Steven Rostedt
` (10 preceding siblings ...)
2025-07-08 2:11 ` [PATCH v8 11/12] unwind_user/sframe: Add .sframe validation option Steven Rostedt
@ 2025-07-08 2:11 ` Steven Rostedt
11 siblings, 0 replies; 34+ messages in thread
From: Steven Rostedt @ 2025-07-08 2:11 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Linus Torvalds, Andrew Morton,
Jens Axboe, Florian Weimer, Sam James
From: Josh Poimboeuf <jpoimboe@kernel.org>
The kernel doesn't have direct visibility to the ELF contents of shared
libraries. Add some prctl() interfaces which allow glibc to tell the
kernel where to find .sframe sections.
[
This adds an interface for prctl() for testing loading of sframes for
libraries. But this interface should really be a system call. This patch
is for testing purposes only and should not be applied to mainline.
]
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
include/uapi/linux/prctl.h | 6 +++++-
kernel/sys.c | 9 +++++++++
2 files changed, 14 insertions(+), 1 deletion(-)
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 43dec6eed559..c575cf7151b1 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -351,7 +351,7 @@ struct prctl_mm_map {
* configuration. All bits may be locked via this call, including
* undefined bits.
*/
-#define PR_LOCK_SHADOW_STACK_STATUS 76
+#define PR_LOCK_SHADOW_STACK_STATUS 76
/*
* Controls the mode of timer_create() for CRIU restore operations.
@@ -371,4 +371,8 @@ struct prctl_mm_map {
# define PR_FUTEX_HASH_GET_SLOTS 2
# define PR_FUTEX_HASH_GET_IMMUTABLE 3
+/* SFRAME management */
+#define PR_ADD_SFRAME 79
+#define PR_REMOVE_SFRAME 80
+
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index adc0de0aa364..cf788e66dc86 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -65,6 +65,7 @@
#include <linux/rcupdate.h>
#include <linux/uidgid.h>
#include <linux/cred.h>
+#include <linux/sframe.h>
#include <linux/nospec.h>
@@ -2824,6 +2825,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
case PR_FUTEX_HASH:
error = futex_hash_prctl(arg2, arg3, arg4);
break;
+ case PR_ADD_SFRAME:
+ error = sframe_add_section(arg2, arg3, arg4, arg5);
+ break;
+ case PR_REMOVE_SFRAME:
+ if (arg3 || arg4 || arg5)
+ return -EINVAL;
+ error = sframe_remove_section(arg2);
+ break;
default:
trace_task_prctl_unknown(option, arg2, arg3, arg4, arg5);
error = -EINVAL;
--
2.47.2
^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH v8 10/12] unwind_user/sframe: Enable debugging in uaccess regions
2025-07-08 2:11 ` [PATCH v8 10/12] unwind_user/sframe: Enable debugging in uaccess regions Steven Rostedt
@ 2025-07-08 3:38 ` Linus Torvalds
2025-07-08 13:23 ` Steven Rostedt
0 siblings, 1 reply; 34+ messages in thread
From: Linus Torvalds @ 2025-07-08 3:38 UTC (permalink / raw)
To: Steven Rostedt
Cc: linux-kernel, linux-trace-kernel, bpf, x86, Masami Hiramatsu,
Mathieu Desnoyers, Josh Poimboeuf, Peter Zijlstra, Ingo Molnar,
Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
Andrew Morton, Jens Axboe, Florian Weimer, Sam James
On Mon, 7 Jul 2025 at 19:12, Steven Rostedt <rostedt@kernel.org> wrote:
>
> From: Josh Poimboeuf <jpoimboe@kernel.org>
>
> Objtool warns about calling pr_debug() from uaccess-enabled regions, and
> rightfully so. Add a dbg_sec_uaccess() macro which temporarily disables
> uaccess before doing the dynamic printk, and use that to add debug
> messages throughout the uaccess-enabled regions.
Please kill this patch, and stop doing stupid things.
The whole AND ONLY point of using the unsafe user access macros is
performance. You are now actively breaking the whole point of them.
If you need to add debug printouts to user accesses, just don't use
the 'unsafe' ones.
Or add the debug to after the unsafe region has finished. Not this way.
This patch is disgusting, in other words. It's wrong. STOP IT.
Linus
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 10/12] unwind_user/sframe: Enable debugging in uaccess regions
2025-07-08 3:38 ` Linus Torvalds
@ 2025-07-08 13:23 ` Steven Rostedt
2025-07-08 14:34 ` Josh Poimboeuf
0 siblings, 1 reply; 34+ messages in thread
From: Steven Rostedt @ 2025-07-08 13:23 UTC (permalink / raw)
To: Linus Torvalds
Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
Masami Hiramatsu, Mathieu Desnoyers, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Jens Remus, Andrew Morton, Jens Axboe,
Florian Weimer, Sam James
On Mon, 7 Jul 2025 20:38:35 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Mon, 7 Jul 2025 at 19:12, Steven Rostedt <rostedt@kernel.org> wrote:
> This patch is disgusting, in other words. It's wrong. STOP IT.
>
No problem, I can easily drop it.
I just took Josh's PoC patches and posted them. This particular series
still needs a bit of work. That's one of the reasons I split it out of the
other series. The core series is at the stage for acceptance and looking
for feedback from other maintainers. This series is still a work in
progress. Others have asked me to post them so they can start playing with
it.
We still need to come up with a system call with a robust API to allow the
dynamic linker to tell the kernel where the SFrames are for the libraries
it loads. Hence the "DO NOT APPLY" patch at the end (which I just noticed the
subject text got dropped when I pulled it into git from patchwork and sent
out this version, at least the change long still suggest it shouldn't be
applied).
But I will remove this patch from the queue. I never even used this
debugging. What I did was inject trace_printk() all over the place to make
sure it was working as expected.
Thanks,
-- Steve
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 10/12] unwind_user/sframe: Enable debugging in uaccess regions
2025-07-08 13:23 ` Steven Rostedt
@ 2025-07-08 14:34 ` Josh Poimboeuf
2025-07-08 14:41 ` Steven Rostedt
2025-07-08 15:52 ` Linus Torvalds
0 siblings, 2 replies; 34+ messages in thread
From: Josh Poimboeuf @ 2025-07-08 14:34 UTC (permalink / raw)
To: Steven Rostedt
Cc: Linus Torvalds, Steven Rostedt, linux-kernel, linux-trace-kernel,
bpf, x86, Masami Hiramatsu, Mathieu Desnoyers, Peter Zijlstra,
Ingo Molnar, Jiri Olsa, Namhyung Kim, Thomas Gleixner,
Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
Jens Remus, Andrew Morton, Jens Axboe, Florian Weimer, Sam James
On Tue, Jul 08, 2025 at 09:23:51AM -0400, Steven Rostedt wrote:
> On Mon, 7 Jul 2025 20:38:35 -0700
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> > On Mon, 7 Jul 2025 at 19:12, Steven Rostedt <rostedt@kernel.org> wrote:
>
> > This patch is disgusting, in other words. It's wrong. STOP IT.
> >
>
> No problem, I can easily drop it.
>
> I just took Josh's PoC patches and posted them. This particular series
> still needs a bit of work. That's one of the reasons I split it out of the
> other series. The core series is at the stage for acceptance and looking
> for feedback from other maintainers. This series is still a work in
> progress. Others have asked me to post them so they can start playing with
> it.
>
> We still need to come up with a system call with a robust API to allow the
> dynamic linker to tell the kernel where the SFrames are for the libraries
> it loads. Hence the "DO NOT APPLY" patch at the end (which I just noticed the
> subject text got dropped when I pulled it into git from patchwork and sent
> out this version, at least the change long still suggest it shouldn't be
> applied).
>
> But I will remove this patch from the queue. I never even used this
> debugging. What I did was inject trace_printk() all over the place to make
> sure it was working as expected.
I had found those debug printks really useful for debugging
corrupt/missing .sframe data, but yeah, this patch is ridiculously bad.
Sorry for putting that out into the world ;-)
And those are all error paths, so it's rather pointless to do that whole
uaccess disable/enable/disable dance.
So yeah, drop it for now and I can replace it with something better
later on.
--
Josh
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 10/12] unwind_user/sframe: Enable debugging in uaccess regions
2025-07-08 14:34 ` Josh Poimboeuf
@ 2025-07-08 14:41 ` Steven Rostedt
2025-07-08 15:53 ` Linus Torvalds
2025-07-08 15:52 ` Linus Torvalds
1 sibling, 1 reply; 34+ messages in thread
From: Steven Rostedt @ 2025-07-08 14:41 UTC (permalink / raw)
To: Josh Poimboeuf
Cc: Linus Torvalds, Steven Rostedt, linux-kernel, linux-trace-kernel,
bpf, x86, Masami Hiramatsu, Mathieu Desnoyers, Peter Zijlstra,
Ingo Molnar, Jiri Olsa, Namhyung Kim, Thomas Gleixner,
Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
Jens Remus, Andrew Morton, Jens Axboe, Florian Weimer, Sam James
On Tue, 8 Jul 2025 07:34:36 -0700
Josh Poimboeuf <jpoimboe@kernel.org> wrote:
> I had found those debug printks really useful for debugging
> corrupt/missing .sframe data, but yeah, this patch is ridiculously bad.
> Sorry for putting that out into the world ;-)
>
> And those are all error paths, so it's rather pointless to do that whole
> uaccess disable/enable/disable dance.
>
> So yeah, drop it for now and I can replace it with something better
> later on.
Would something like this work? If someone enables the config to enable the
validation, I don't think we need dynamic printk to do it (as that requires
passing in the format directly and not via a pointer).
-- Steve
diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
index 0060cc576776..524738e2b823 100644
--- a/kernel/unwind/sframe.c
+++ b/kernel/unwind/sframe.c
@@ -321,11 +321,24 @@ int sframe_find(unsigned long ip, struct unwind_user_frame *frame)
#ifdef CONFIG_SFRAME_VALIDATION
-static __always_inline int __sframe_validate_section(struct sframe_section *sec)
+/* Used to save error messages in uaccess sections */
+struct err_msg {
+ const char *fmt;
+ int param1;
+ int param2;
+ long param3;
+ long param4;
+};
+
+static __always_inline
+int __sframe_validate_section(struct sframe_section *sec, struct err_msg *err)
{
unsigned long prev_ip = 0;
unsigned int i;
+/* current->comm, current->pid, sec->filename */
+#define ERR_HDR KERN_WARNING "%s (%d) %s: "
+
for (i = 0; i < sec->num_fdes; i++) {
struct sframe_fre *fre, *prev_fre = NULL;
unsigned long ip, fre_addr;
@@ -341,7 +354,8 @@ static __always_inline int __sframe_validate_section(struct sframe_section *sec)
ip = sec->sframe_start + fde.start_addr;
if (ip <= prev_ip) {
- dbg_sec_uaccess("fde %u not sorted\n", i);
+ err->fmt = ERR_HDR "fde %u not sorted\n";
+ err->param1 = i;
return -EFAULT;
}
prev_ip = ip;
@@ -355,15 +369,23 @@ static __always_inline int __sframe_validate_section(struct sframe_section *sec)
ret = __read_fre(sec, &fde, fre_addr, fre);
if (ret) {
- dbg_sec_uaccess("fde %u: __read_fre(%u) failed\n", i, j);
- dbg_print_fde_uaccess(sec, &fde);
+ err->fmt = ERR_HDR
+ "fde %u: __read_fre(%u) failed\n"
+ " frame_start=%lx frame_end=%lx\n";
+ err->param1 = i;
+ err->param2 = j;
+ err->param3 = (long)sec->sframe_start;
+ err->param4 = (long)sec->sframe_end;
return ret;
}
fre_addr += fre->size;
if (prev_fre && fre->ip_off <= prev_fre->ip_off) {
- dbg_sec_uaccess("fde %u: fre %u not sorted\n", i, j);
+ err->fmt = ERR_HDR
+ "fde %u: fre %u not sorted\n";
+ err->param1 = i;
+ err->param2 = j;
return -EFAULT;
}
@@ -376,16 +398,26 @@ static __always_inline int __sframe_validate_section(struct sframe_section *sec)
static int sframe_validate_section(struct sframe_section *sec)
{
+ struct err_msg err;
int ret;
+ memset(&err, 0, sizeof(err));
+
if (!user_read_access_begin((void __user *)sec->sframe_start,
sec->sframe_end - sec->sframe_start)) {
- dbg_sec("section usercopy failed\n");
+ pr_warn("%s (%d): section usercopy failed\n",
+ current->comm, current->pid);
return -EFAULT;
}
- ret = __sframe_validate_section(sec);
+ ret = __sframe_validate_section(sec, &err);
user_read_access_end();
+
+ if (ret) {
+ printk(err.fmt, current->comm, current->pid,
+ sec->filename, err.param1, err.param2,
+ err.param3, err.param4);
+ }
return ret;
}
^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH v8 10/12] unwind_user/sframe: Enable debugging in uaccess regions
2025-07-08 14:34 ` Josh Poimboeuf
2025-07-08 14:41 ` Steven Rostedt
@ 2025-07-08 15:52 ` Linus Torvalds
1 sibling, 0 replies; 34+ messages in thread
From: Linus Torvalds @ 2025-07-08 15:52 UTC (permalink / raw)
To: Josh Poimboeuf
Cc: Steven Rostedt, Steven Rostedt, linux-kernel, linux-trace-kernel,
bpf, x86, Masami Hiramatsu, Mathieu Desnoyers, Peter Zijlstra,
Ingo Molnar, Jiri Olsa, Namhyung Kim, Thomas Gleixner,
Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
Jens Remus, Andrew Morton, Jens Axboe, Florian Weimer, Sam James
On Tue, 8 Jul 2025 at 07:34, Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>
> I had found those debug printks really useful for debugging
> corrupt/missing .sframe data, but yeah, this patch is ridiculously bad.
> Sorry for putting that out into the world ;-)
I suspect that code that is still needs that level of debugging should
just not use the 'unsafe' user access helpers.
They really are meant for "this sequence turns into three CPU
instructions" kind of uses, and the "unsafe" part of the naming was
very much intended to be a "please don't use this unless you are being
very careful and limited" marker.
Now, I do think that the "goto label for exceptions" part of the
unsafe accessors can be very convenient, so maybe we should make
_that_ part of the interface more widely available. IOW, without the
whole user_read_access_begin/user_read_access_end dance?
That model is already used by "__{get,put}_kernel_nofault()", but I
think it's limited to just some unusual code in mm/maccess.c.
Linus
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 10/12] unwind_user/sframe: Enable debugging in uaccess regions
2025-07-08 14:41 ` Steven Rostedt
@ 2025-07-08 15:53 ` Linus Torvalds
2025-07-08 16:31 ` Steven Rostedt
0 siblings, 1 reply; 34+ messages in thread
From: Linus Torvalds @ 2025-07-08 15:53 UTC (permalink / raw)
To: Steven Rostedt
Cc: Josh Poimboeuf, Steven Rostedt, linux-kernel, linux-trace-kernel,
bpf, x86, Masami Hiramatsu, Mathieu Desnoyers, Peter Zijlstra,
Ingo Molnar, Jiri Olsa, Namhyung Kim, Thomas Gleixner,
Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
Jens Remus, Andrew Morton, Jens Axboe, Florian Weimer, Sam James
On Tue, 8 Jul 2025 at 07:41, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> Would something like this work? If someone enables the config to enable the
> validation, I don't think we need dynamic printk to do it (as that requires
> passing in the format directly and not via a pointer).
I really think you should just not use 'user_access_begin()" AT ALL if
you need to play these kinds of games.
Linus
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 10/12] unwind_user/sframe: Enable debugging in uaccess regions
2025-07-08 15:53 ` Linus Torvalds
@ 2025-07-08 16:31 ` Steven Rostedt
2025-07-08 18:57 ` Josh Poimboeuf
0 siblings, 1 reply; 34+ messages in thread
From: Steven Rostedt @ 2025-07-08 16:31 UTC (permalink / raw)
To: Linus Torvalds
Cc: Josh Poimboeuf, Steven Rostedt, linux-kernel, linux-trace-kernel,
bpf, x86, Masami Hiramatsu, Mathieu Desnoyers, Peter Zijlstra,
Ingo Molnar, Jiri Olsa, Namhyung Kim, Thomas Gleixner,
Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
Jens Remus, Andrew Morton, Jens Axboe, Florian Weimer, Sam James
On Tue, 8 Jul 2025 08:53:56 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Tue, 8 Jul 2025 at 07:41, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > Would something like this work? If someone enables the config to enable the
> > validation, I don't think we need dynamic printk to do it (as that requires
> > passing in the format directly and not via a pointer).
>
> I really think you should just not use 'user_access_begin()" AT ALL if
> you need to play these kinds of games.
>
Looking at the code a bit deeper, I don't think we need to play these games
and still keep the user_read_access_begin().
The places that are more performance critical (where it reads the sframe
during normal stack walking during profiling) has no debug output, and
there's nothing there that needs to take it out of the user_read_access
area.
It's the validator that adds these hacks. I don't think it needs to. It can
just wrap the calls to the code that requires user_read_access and then
check the return value. The validator is just a debugging feature and
performance should not be an issue.
But I do think performance is something to care about during normal
operations where the one big user_read_access_begin() can help.
What about something like this? It adds "safe" versions of the user space
access functions and uses them only in the slow (we don't care about
performance) validator:
-- Steve
diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
index 0060cc576776..79ff3c0fc11f 100644
--- a/kernel/unwind/sframe.c
+++ b/kernel/unwind/sframe.c
@@ -321,7 +321,34 @@ int sframe_find(unsigned long ip, struct unwind_user_frame *frame)
#ifdef CONFIG_SFRAME_VALIDATION
-static __always_inline int __sframe_validate_section(struct sframe_section *sec)
+static int safe_read_fde(struct sframe_section *sec,
+ unsigned int fde_num, struct sframe_fde *fde)
+{
+ int ret;
+
+ if (!user_read_access_begin((void __user *)sec->sframe_start,
+ sec->sframe_end - sec->sframe_start))
+ return -EFAULT;
+ ret = __read_fde(sec, fde_num, fde);
+ user_read_access_end();
+ return ret;
+}
+
+static int safe_read_fre(struct sframe_section *sec,
+ struct sframe_fde *fde, unsigned long fre_addr,
+ struct sframe_fre *fre)
+{
+ int ret;
+
+ if (!user_read_access_begin((void __user *)sec->sframe_start,
+ sec->sframe_end - sec->sframe_start))
+ return -EFAULT;
+ ret = __read_fre(sec, fde, fre_addr, fre);
+ user_read_access_end();
+ return ret;
+}
+
+static int sframe_validate_section(struct sframe_section *sec)
{
unsigned long prev_ip = 0;
unsigned int i;
@@ -335,13 +362,13 @@ static __always_inline int __sframe_validate_section(struct sframe_section *sec)
unsigned int j;
int ret;
- ret = __read_fde(sec, i, &fde);
+ ret = safe_read_fde(sec, i, &fde);
if (ret)
return ret;
ip = sec->sframe_start + fde.start_addr;
if (ip <= prev_ip) {
- dbg_sec_uaccess("fde %u not sorted\n", i);
+ dbg_sec("fde %u not sorted\n", i);
return -EFAULT;
}
prev_ip = ip;
@@ -353,17 +380,20 @@ static __always_inline int __sframe_validate_section(struct sframe_section *sec)
fre = which ? fres : fres + 1;
which = !which;
- ret = __read_fre(sec, &fde, fre_addr, fre);
+ ret = safe_read_fre(sec, &fde, fre_addr, fre);
if (ret) {
- dbg_sec_uaccess("fde %u: __read_fre(%u) failed\n", i, j);
- dbg_print_fde_uaccess(sec, &fde);
+ dbg_sec("fde %u: __read_fre(%u) failed\n", i, j);
+ dbg_sec("FDE: start_addr:0x%x func_size:0x%x fres_off:0x%x fres_num:%d info:%u rep_size:%u\n",
+ fde.start_addr, fde.func_size,
+ fde.fres_off, fde.fres_num,
+ fde.info, fde.rep_size);
return ret;
}
fre_addr += fre->size;
if (prev_fre && fre->ip_off <= prev_fre->ip_off) {
- dbg_sec_uaccess("fde %u: fre %u not sorted\n", i, j);
+ dbg_sec("fde %u: fre %u not sorted\n", i, j);
return -EFAULT;
}
@@ -374,21 +404,6 @@ static __always_inline int __sframe_validate_section(struct sframe_section *sec)
return 0;
}
-static int sframe_validate_section(struct sframe_section *sec)
-{
- int ret;
-
- if (!user_read_access_begin((void __user *)sec->sframe_start,
- sec->sframe_end - sec->sframe_start)) {
- dbg_sec("section usercopy failed\n");
- return -EFAULT;
- }
-
- ret = __sframe_validate_section(sec);
- user_read_access_end();
- return ret;
-}
-
#else /* !CONFIG_SFRAME_VALIDATION */
static int sframe_validate_section(struct sframe_section *sec) { return 0; }
^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH v8 10/12] unwind_user/sframe: Enable debugging in uaccess regions
2025-07-08 16:31 ` Steven Rostedt
@ 2025-07-08 18:57 ` Josh Poimboeuf
0 siblings, 0 replies; 34+ messages in thread
From: Josh Poimboeuf @ 2025-07-08 18:57 UTC (permalink / raw)
To: Steven Rostedt
Cc: Linus Torvalds, Steven Rostedt, linux-kernel, linux-trace-kernel,
bpf, x86, Masami Hiramatsu, Mathieu Desnoyers, Peter Zijlstra,
Ingo Molnar, Jiri Olsa, Namhyung Kim, Thomas Gleixner,
Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
Jens Remus, Andrew Morton, Jens Axboe, Florian Weimer, Sam James
On Tue, Jul 08, 2025 at 12:31:20PM -0400, Steven Rostedt wrote:
> On Tue, 8 Jul 2025 08:53:56 -0700
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> > On Tue, 8 Jul 2025 at 07:41, Steven Rostedt <rostedt@goodmis.org> wrote:
> > >
> > > Would something like this work? If someone enables the config to enable the
> > > validation, I don't think we need dynamic printk to do it (as that requires
> > > passing in the format directly and not via a pointer).
> >
> > I really think you should just not use 'user_access_begin()" AT ALL if
> > you need to play these kinds of games.
> >
>
> Looking at the code a bit deeper, I don't think we need to play these games
> and still keep the user_read_access_begin().
>
> The places that are more performance critical (where it reads the sframe
> during normal stack walking during profiling) has no debug output, and
> there's nothing there that needs to take it out of the user_read_access
> area.
>
> It's the validator that adds these hacks. I don't think it needs to. It can
> just wrap the calls to the code that requires user_read_access and then
> check the return value. The validator is just a debugging feature and
> performance should not be an issue.
>
> But I do think performance is something to care about during normal
> operations where the one big user_read_access_begin() can help.
>
> What about something like this? It adds "safe" versions of the user space
> access functions and uses them only in the slow (we don't care about
> performance) validator:
Looks good to me, thanks!
--
Josh
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 06/12] unwind_user/sframe: Wire up unwind_user to sframe
2025-07-08 2:11 ` [PATCH v8 06/12] unwind_user/sframe: Wire up unwind_user to sframe Steven Rostedt
@ 2025-07-08 19:58 ` Mathieu Desnoyers
2025-07-08 20:11 ` Steven Rostedt
0 siblings, 1 reply; 34+ messages in thread
From: Mathieu Desnoyers @ 2025-07-08 19:58 UTC (permalink / raw)
To: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86
Cc: Masami Hiramatsu, Josh Poimboeuf, Peter Zijlstra, Ingo Molnar,
Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
Linus Torvalds, Andrew Morton, Jens Axboe, Florian Weimer,
Sam James
On 2025-07-07 22:11, Steven Rostedt wrote:
> From: Josh Poimboeuf <jpoimboe@kernel.org>
>
> Now that the sframe infrastructure is fully in place, make it work by
> hooking it up to the unwind_user interface.
>
> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> ---
> arch/Kconfig | 1 +
> include/linux/unwind_user_types.h | 1 +
> kernel/unwind/user.c | 25 ++++++++++++++++++++++---
> 3 files changed, 24 insertions(+), 3 deletions(-)
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index c54d35e2f860..0c6056ef13de 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -448,6 +448,7 @@ config HAVE_UNWIND_USER_COMPAT_FP
>
> config HAVE_UNWIND_USER_SFRAME
> bool
> + select UNWIND_USER
>
> config HAVE_PERF_REGS
> bool
> diff --git a/include/linux/unwind_user_types.h b/include/linux/unwind_user_types.h
> index 0b6563951ca4..4d50476e950e 100644
> --- a/include/linux/unwind_user_types.h
> +++ b/include/linux/unwind_user_types.h
> @@ -13,6 +13,7 @@ enum unwind_user_type {
> UNWIND_USER_TYPE_NONE,
> UNWIND_USER_TYPE_FP,
> UNWIND_USER_TYPE_COMPAT_FP,
> + UNWIND_USER_TYPE_SFRAME,
> };
>
> struct unwind_stacktrace {
> diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
> index 249d9e32fad7..6e7ca9f1293a 100644
> --- a/kernel/unwind/user.c
> +++ b/kernel/unwind/user.c
> @@ -7,6 +7,7 @@
> #include <linux/sched/task_stack.h>
> #include <linux/unwind_user.h>
> #include <linux/uaccess.h>
> +#include <linux/sframe.h>
>
> static struct unwind_user_frame fp_frame = {
> ARCH_INIT_USER_FP_FRAME
> @@ -31,6 +32,12 @@ static inline bool compat_fp_state(struct unwind_user_state *state)
> state->type == UNWIND_USER_TYPE_COMPAT_FP;
> }
>
> +static inline bool sframe_state(struct unwind_user_state *state)
> +{
> + return IS_ENABLED(CONFIG_HAVE_UNWIND_USER_SFRAME) &&
> + state->type == UNWIND_USER_TYPE_SFRAME;
> +}
> +
> #define unwind_get_user_long(to, from, state) \
> ({ \
> int __ret; \
> @@ -44,18 +51,28 @@ static inline bool compat_fp_state(struct unwind_user_state *state)
> static int unwind_user_next(struct unwind_user_state *state)
> {
> struct unwind_user_frame *frame;
> + struct unwind_user_frame _frame;
> unsigned long cfa = 0, fp, ra = 0;
> unsigned int shift;
>
> if (state->done)
> return -EINVAL;
>
> - if (compat_fp_state(state))
> + if (compat_fp_state(state)) {
> frame = &compat_fp_frame;
> - else if (fp_state(state))
> + } else if (sframe_state(state)) {
> + /* sframe expects the frame to be local storage */
> + frame = &_frame;
> + if (sframe_find(state->ip, frame)) {
> + if (!IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP))
> + goto done;
> + frame = &fp_frame;
> + }
> + } else if (fp_state(state)) {
> frame = &fp_frame;
> - else
> + } else {
> goto done;
> + }
>
> if (frame->use_fp) {
> if (state->fp < state->sp)
> @@ -111,6 +128,8 @@ static int unwind_user_start(struct unwind_user_state *state)
>
> if (IS_ENABLED(CONFIG_HAVE_UNWIND_USER_COMPAT_FP) && in_compat_mode(regs))
> state->type = UNWIND_USER_TYPE_COMPAT_FP;
> + else if (current_has_sframe())
> + state->type = UNWIND_USER_TYPE_SFRAME;
I think you'll want to update the state->type during the
traversal (in next()), because depending on whether
sframe is available for a given memory area of code
or not, the next() function can use either frame pointers
or sframe during the same traversal. It would be good
to know which is used after each specific call to next().
Thanks,
Mathieu
> else if (IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP))
> state->type = UNWIND_USER_TYPE_FP;
> else
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 06/12] unwind_user/sframe: Wire up unwind_user to sframe
2025-07-08 19:58 ` Mathieu Desnoyers
@ 2025-07-08 20:11 ` Steven Rostedt
2025-07-09 7:58 ` Jens Remus
0 siblings, 1 reply; 34+ messages in thread
From: Steven Rostedt @ 2025-07-08 20:11 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
Masami Hiramatsu, Josh Poimboeuf, Peter Zijlstra, Ingo Molnar,
Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Jens Remus,
Linus Torvalds, Andrew Morton, Jens Axboe, Florian Weimer,
Sam James
On Tue, 8 Jul 2025 15:58:56 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> > @@ -111,6 +128,8 @@ static int unwind_user_start(struct unwind_user_state *state)
> >
> > if (IS_ENABLED(CONFIG_HAVE_UNWIND_USER_COMPAT_FP) && in_compat_mode(regs))
> > state->type = UNWIND_USER_TYPE_COMPAT_FP;
> > + else if (current_has_sframe())
> > + state->type = UNWIND_USER_TYPE_SFRAME;
>
> I think you'll want to update the state->type during the
> traversal (in next()), because depending on whether
> sframe is available for a given memory area of code
> or not, the next() function can use either frame pointers
> or sframe during the same traversal. It would be good
> to know which is used after each specific call to next().
From my understanding this sets up what is available for the task at the
beginning.
So once we say "this task has sframes" it will try to use it every time. In
next we have:
if (compat_fp_state(state)) {
frame = &compat_fp_frame;
} else if (sframe_state(state)) {
/* sframe expects the frame to be local storage */
frame = &_frame;
if (sframe_find(state->ip, frame)) {
if (!IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP))
goto done;
frame = &fp_frame;
}
} else if (fp_state(state)) {
frame = &fp_frame;
} else {
goto done;
}
Where if sframe_find() fails and we switch over to frame pointers, if frame
pointers works, we can continue. But the next iteration, where the frame
pointer finds the previous ip, that ip may be in the sframe section again.
I've seen this work with my trace_printk()s. A function from code that is
running sframes calls into a library function that has frame pointers. The
walk walks through the frame pointers in the library, and when it hits the
code that has sframes, it starts using that again.
If we switched the state to just FP, it will never try to use sframes.
So this state is more about "what does this task have" than what was used
per iteration.
-- Steve
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 06/12] unwind_user/sframe: Wire up unwind_user to sframe
2025-07-08 20:11 ` Steven Rostedt
@ 2025-07-09 7:58 ` Jens Remus
2025-07-09 13:46 ` Mathieu Desnoyers
2025-07-10 15:30 ` Steven Rostedt
0 siblings, 2 replies; 34+ messages in thread
From: Jens Remus @ 2025-07-09 7:58 UTC (permalink / raw)
To: Steven Rostedt, Mathieu Desnoyers
Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
Masami Hiramatsu, Josh Poimboeuf, Peter Zijlstra, Ingo Molnar,
Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Linus Torvalds,
Andrew Morton, Jens Axboe, Florian Weimer, Sam James,
Heiko Carstens, Vasily Gorbik
On 08.07.2025 22:11, Steven Rostedt wrote:
> On Tue, 8 Jul 2025 15:58:56 -0400
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>
>>> @@ -111,6 +128,8 @@ static int unwind_user_start(struct unwind_user_state *state)
>>>
>>> if (IS_ENABLED(CONFIG_HAVE_UNWIND_USER_COMPAT_FP) && in_compat_mode(regs))
>>> state->type = UNWIND_USER_TYPE_COMPAT_FP;
>>> + else if (current_has_sframe())
>>> + state->type = UNWIND_USER_TYPE_SFRAME;
>>
>> I think you'll want to update the state->type during the
>> traversal (in next()), because depending on whether
>> sframe is available for a given memory area of code
>> or not, the next() function can use either frame pointers
>> or sframe during the same traversal. It would be good
>> to know which is used after each specific call to next().
>
> From my understanding this sets up what is available for the task at the
> beginning.
>
> So once we say "this task has sframes" it will try to use it every time. In
> next we have:
>
> if (compat_fp_state(state)) {
> frame = &compat_fp_frame;
> } else if (sframe_state(state)) {
> /* sframe expects the frame to be local storage */
> frame = &_frame;
> if (sframe_find(state->ip, frame)) {
> if (!IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP))
> goto done;
> frame = &fp_frame;
> }
> } else if (fp_state(state)) {
> frame = &fp_frame;
> } else {
> goto done;
> }
>
> Where if sframe_find() fails and we switch over to frame pointers, if frame
> pointers works, we can continue. But the next iteration, where the frame
> pointer finds the previous ip, that ip may be in the sframe section again.
>
> I've seen this work with my trace_printk()s. A function from code that is
> running sframes calls into a library function that has frame pointers. The
> walk walks through the frame pointers in the library, and when it hits the
> code that has sframes, it starts using that again.
I think Mathieu has a point, as unwind_user_next() calls the optional
architecture-specific arch_unwind_user_next() at the end. The x86
implementation does state->type specific processing (for
UNWIND_USER_TYPE_COMPAT_FP).
> If we switched the state to just FP, it will never try to use sframes.
>
> So this state is more about "what does this task have" than what was used
> per iteration.
While there is currently no fallback to UNWIND_USER_TYPE_COMPAT_FP that
would strictly require this, it could be useful to have both information.
Or the logic in unwind_user_start(), unwind_user_next(), and *_state()
may need to be adjusted so that state->type reflects the currently used
method, which unwind_user_next() determines and sets anew for every step.
Regards,
Jens
--
Jens Remus
Linux on Z Development (D3303)
+49-7031-16-1128 Office
jremus@de.ibm.com
IBM
IBM Deutschland Research & Development GmbH; Vorsitzender des Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der Gesellschaft: Böblingen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 06/12] unwind_user/sframe: Wire up unwind_user to sframe
2025-07-09 7:58 ` Jens Remus
@ 2025-07-09 13:46 ` Mathieu Desnoyers
2025-07-09 13:51 ` Mathieu Desnoyers
2025-07-10 15:30 ` Steven Rostedt
1 sibling, 1 reply; 34+ messages in thread
From: Mathieu Desnoyers @ 2025-07-09 13:46 UTC (permalink / raw)
To: Jens Remus, Steven Rostedt
Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
Masami Hiramatsu, Josh Poimboeuf, Peter Zijlstra, Ingo Molnar,
Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Linus Torvalds,
Andrew Morton, Jens Axboe, Florian Weimer, Sam James,
Heiko Carstens, Vasily Gorbik
On 2025-07-09 03:58, Jens Remus wrote:
> On 08.07.2025 22:11, Steven Rostedt wrote:
>> On Tue, 8 Jul 2025 15:58:56 -0400
>> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>>
>>>> @@ -111,6 +128,8 @@ static int unwind_user_start(struct unwind_user_state *state)
>>>>
>>>> if (IS_ENABLED(CONFIG_HAVE_UNWIND_USER_COMPAT_FP) && in_compat_mode(regs))
>>>> state->type = UNWIND_USER_TYPE_COMPAT_FP;
>>>> + else if (current_has_sframe())
>>>> + state->type = UNWIND_USER_TYPE_SFRAME;
>>>
>>> I think you'll want to update the state->type during the
>>> traversal (in next()), because depending on whether
>>> sframe is available for a given memory area of code
>>> or not, the next() function can use either frame pointers
>>> or sframe during the same traversal. It would be good
>>> to know which is used after each specific call to next().
>>
>> From my understanding this sets up what is available for the task at the
>> beginning.
>>
>> So once we say "this task has sframes" it will try to use it every time. In
>> next we have:
>>
>> if (compat_fp_state(state)) {
>> frame = &compat_fp_frame;
>> } else if (sframe_state(state)) {
>> /* sframe expects the frame to be local storage */
>> frame = &_frame;
>> if (sframe_find(state->ip, frame)) {
>> if (!IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP))
>> goto done;
>> frame = &fp_frame;
>> }
>> } else if (fp_state(state)) {
>> frame = &fp_frame;
>> } else {
>> goto done;
>> }
>>
>> Where if sframe_find() fails and we switch over to frame pointers, if frame
>> pointers works, we can continue. But the next iteration, where the frame
>> pointer finds the previous ip, that ip may be in the sframe section again.
>>
>> I've seen this work with my trace_printk()s. A function from code that is
>> running sframes calls into a library function that has frame pointers. The
>> walk walks through the frame pointers in the library, and when it hits the
>> code that has sframes, it starts using that again.
>
> I think Mathieu has a point, as unwind_user_next() calls the optional
> architecture-specific arch_unwind_user_next() at the end. The x86
> implementation does state->type specific processing (for
> UNWIND_USER_TYPE_COMPAT_FP).
>
>> If we switched the state to just FP, it will never try to use sframes.
>>
>> So this state is more about "what does this task have" than what was used
>> per iteration.
>
> While there is currently no fallback to UNWIND_USER_TYPE_COMPAT_FP that
> would strictly require this, it could be useful to have both information.
>
> Or the logic in unwind_user_start(), unwind_user_next(), and *_state()
> may need to be adjusted so that state->type reflects the currently used
> method, which unwind_user_next() determines and sets anew for every step.
I concur with Jens. I think we should keep track of both:
1) available unwind methods,
2) unwind method used for the current frame.
E.g.:
/*
* unwind types, listed in priority order: lower numbers are
* attempted first if available.
*/
enum unwind_user_type_bits {
UNWIND_USER_TYPE_SFRAME_BIT = 0,
UNWIND_USER_TYPE_FP_BIT = 1,
UNWIND_USER_TYPE_COMPAT_FP_BIT = 2,
_NR_UNWIND_USER_TYPE_BITS,
};
enum unwind_user_type {
UNWIND_USER_TYPE_NONE = 0,
UNWIND_USER_TYPE_SFRAME = (1U << UNWIND_USER_TYPE_SFRAME_BIT),
UNWIND_USER_TYPE_FP = (1U << UNWIND_USER_TYPE_FP_BIT),
UNWIND_USER_TYPE_COMPAT_FP = (1U << UNWIND_USER_TYPE_COMPAT_FP_BIT),
};
And have the following fields in struct unwind_user_state:
/* Unwind time used for the most recent unwind traversal iteration. */
enum unwind_user_type current_type;
/* Unwind types available in the current context. Bitmask of enum unwind_user_type. */
unsigned int available_types;
So as we end up adding stuff like registered JIT unwind info, we will
want to expand the "available types". And it makes sense to both keep
track of all available types (as a way to quickly know which mechanisms
we need to query for the current task) *and* to let the caller know
which unwind type was used for the current frame.
And AFAIU we'd be inserting a "jit unwind info" type between SFRAME and FP in
the future, because the jit unwind info would be more reliable than FP. This
would require that we bump the number for FP and COMPAT_FP, but that would
be OK because this is not ABI.
Thoughts ?
Thanks,
Mathieu
>
> Regards,
> Jens
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 06/12] unwind_user/sframe: Wire up unwind_user to sframe
2025-07-09 13:46 ` Mathieu Desnoyers
@ 2025-07-09 13:51 ` Mathieu Desnoyers
2025-07-09 14:06 ` Steven Rostedt
2025-07-10 9:26 ` Jens Remus
0 siblings, 2 replies; 34+ messages in thread
From: Mathieu Desnoyers @ 2025-07-09 13:51 UTC (permalink / raw)
To: Jens Remus, Steven Rostedt
Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
Masami Hiramatsu, Josh Poimboeuf, Peter Zijlstra, Ingo Molnar,
Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Linus Torvalds,
Andrew Morton, Jens Axboe, Florian Weimer, Sam James,
Heiko Carstens, Vasily Gorbik
On 2025-07-09 09:46, Mathieu Desnoyers wrote:
> On 2025-07-09 03:58, Jens Remus wrote:
>> On 08.07.2025 22:11, Steven Rostedt wrote:
>>> On Tue, 8 Jul 2025 15:58:56 -0400
>>> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>>>
>>>>> @@ -111,6 +128,8 @@ static int unwind_user_start(struct
>>>>> unwind_user_state *state)
>>>>> if (IS_ENABLED(CONFIG_HAVE_UNWIND_USER_COMPAT_FP) &&
>>>>> in_compat_mode(regs))
>>>>> state->type = UNWIND_USER_TYPE_COMPAT_FP;
>>>>> + else if (current_has_sframe())
>>>>> + state->type = UNWIND_USER_TYPE_SFRAME;
>>>>
>>>> I think you'll want to update the state->type during the
>>>> traversal (in next()), because depending on whether
>>>> sframe is available for a given memory area of code
>>>> or not, the next() function can use either frame pointers
>>>> or sframe during the same traversal. It would be good
>>>> to know which is used after each specific call to next().
>>>
>>> From my understanding this sets up what is available for the task at
>>> the
>>> beginning.
>>>
>>> So once we say "this task has sframes" it will try to use it every
>>> time. In
>>> next we have:
>>>
>>> if (compat_fp_state(state)) {
>>> frame = &compat_fp_frame;
>>> } else if (sframe_state(state)) {
>>> /* sframe expects the frame to be local storage */
>>> frame = &_frame;
>>> if (sframe_find(state->ip, frame)) {
>>> if (!IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP))
>>> goto done;
>>> frame = &fp_frame;
>>> }
>>> } else if (fp_state(state)) {
>>> frame = &fp_frame;
>>> } else {
>>> goto done;
>>> }
>>>
>>> Where if sframe_find() fails and we switch over to frame pointers, if
>>> frame
>>> pointers works, we can continue. But the next iteration, where the frame
>>> pointer finds the previous ip, that ip may be in the sframe section
>>> again.
>>>
>>> I've seen this work with my trace_printk()s. A function from code
>>> that is
>>> running sframes calls into a library function that has frame
>>> pointers. The
>>> walk walks through the frame pointers in the library, and when it
>>> hits the
>>> code that has sframes, it starts using that again.
>>
>> I think Mathieu has a point, as unwind_user_next() calls the optional
>> architecture-specific arch_unwind_user_next() at the end. The x86
>> implementation does state->type specific processing (for
>> UNWIND_USER_TYPE_COMPAT_FP).
>>
>>> If we switched the state to just FP, it will never try to use sframes.
>>>
>>> So this state is more about "what does this task have" than what was
>>> used
>>> per iteration.
>>
>> While there is currently no fallback to UNWIND_USER_TYPE_COMPAT_FP that
>> would strictly require this, it could be useful to have both information.
>>
>> Or the logic in unwind_user_start(), unwind_user_next(), and *_state()
>> may need to be adjusted so that state->type reflects the currently used
>> method, which unwind_user_next() determines and sets anew for every step.
>
> I concur with Jens. I think we should keep track of both:
>
> 1) available unwind methods,
>
> 2) unwind method used for the current frame.
>
> E.g.:
>
> /*
> * unwind types, listed in priority order: lower numbers are
> * attempted first if available.
> */
> enum unwind_user_type_bits {
> UNWIND_USER_TYPE_SFRAME_BIT = 0,
> UNWIND_USER_TYPE_FP_BIT = 1,
> UNWIND_USER_TYPE_COMPAT_FP_BIT = 2,
>
> _NR_UNWIND_USER_TYPE_BITS,
> };
>
> enum unwind_user_type {
> UNWIND_USER_TYPE_NONE = 0,
> UNWIND_USER_TYPE_SFRAME = (1U << UNWIND_USER_TYPE_SFRAME_BIT),
> UNWIND_USER_TYPE_FP = (1U << UNWIND_USER_TYPE_FP_BIT),
> UNWIND_USER_TYPE_COMPAT_FP = (1U <<
> UNWIND_USER_TYPE_COMPAT_FP_BIT),
> };
>
> And have the following fields in struct unwind_user_state:
>
> /* Unwind time used for the most recent unwind traversal iteration. */
> enum unwind_user_type current_type;
>
> /* Unwind types available in the current context. Bitmask of enum
> unwind_user_type. */
> unsigned int available_types;
>
> So as we end up adding stuff like registered JIT unwind info, we will
> want to expand the "available types". And it makes sense to both keep
> track of all available types (as a way to quickly know which mechanisms
> we need to query for the current task) *and* to let the caller know
> which unwind type was used for the current frame.
>
> And AFAIU we'd be inserting a "jit unwind info" type between SFRAME and
> FP in
> the future, because the jit unwind info would be more reliable than FP.
> This
> would require that we bump the number for FP and COMPAT_FP, but that would
> be OK because this is not ABI.
>
> Thoughts ?
One use-case for giving the "current_type" to iteration callers is to
let end users know whether they should trust the frame info. If it
comes from sframe, then it should be pretty solid. However, if it comes
from frame pointers used as a fallback on a system that omits frame
pointers, the user should consider the resulting data with a high level
of skepticism.
Thanks,
Mathieu
>
> Thanks,
>
> Mathieu
>
>>
>> Regards,
>> Jens
>
>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 06/12] unwind_user/sframe: Wire up unwind_user to sframe
2025-07-09 13:51 ` Mathieu Desnoyers
@ 2025-07-09 14:06 ` Steven Rostedt
2025-07-09 14:10 ` Mathieu Desnoyers
2025-07-10 8:03 ` Jens Remus
2025-07-10 9:26 ` Jens Remus
1 sibling, 2 replies; 34+ messages in thread
From: Steven Rostedt @ 2025-07-09 14:06 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Jens Remus, Steven Rostedt, linux-kernel, linux-trace-kernel, bpf,
x86, Masami Hiramatsu, Josh Poimboeuf, Peter Zijlstra,
Ingo Molnar, Jiri Olsa, Namhyung Kim, Thomas Gleixner,
Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
Linus Torvalds, Andrew Morton, Jens Axboe, Florian Weimer,
Sam James, Heiko Carstens, Vasily Gorbik
On Wed, 9 Jul 2025 09:51:09 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> One use-case for giving the "current_type" to iteration callers is to
> let end users know whether they should trust the frame info. If it
> comes from sframe, then it should be pretty solid. However, if it comes
> from frame pointers used as a fallback on a system that omits frame
> pointers, the user should consider the resulting data with a high level
> of skepticism.
That would be in the trace sent to the callback. We could add something
like the '?' if it's not trusted.
But for now, until we have a use case that we are implementing, I want
to keep this simple, otherwise it will never get done. I don't want to
add features for hypothetical scenarios.
Currently, the traceback is just an array of addresses. But this could
change in the future. What we are discussing right now is the internal
functionality of the user unwind code where I have made most of theses
functions static.
The only external functions that get called during the iteration is the
architecture specific code. If that code needs to know the difference
between sframes and frame pointers then we can modify it, but until
then, I rather keep this as is.
Jens, is there something that the architecture code needs now? If so,
then lets fix it, otherwise lets do it when there is something. This
isn't user API, it can change in the future.
-- Steve
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 06/12] unwind_user/sframe: Wire up unwind_user to sframe
2025-07-09 14:06 ` Steven Rostedt
@ 2025-07-09 14:10 ` Mathieu Desnoyers
2025-07-09 14:29 ` Steven Rostedt
2025-07-10 8:03 ` Jens Remus
1 sibling, 1 reply; 34+ messages in thread
From: Mathieu Desnoyers @ 2025-07-09 14:10 UTC (permalink / raw)
To: Steven Rostedt
Cc: Jens Remus, Steven Rostedt, linux-kernel, linux-trace-kernel, bpf,
x86, Masami Hiramatsu, Josh Poimboeuf, Peter Zijlstra,
Ingo Molnar, Jiri Olsa, Namhyung Kim, Thomas Gleixner,
Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
Linus Torvalds, Andrew Morton, Jens Axboe, Florian Weimer,
Sam James, Heiko Carstens, Vasily Gorbik
On 2025-07-09 10:06, Steven Rostedt wrote:
> On Wed, 9 Jul 2025 09:51:09 -0400
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>
>> One use-case for giving the "current_type" to iteration callers is to
>> let end users know whether they should trust the frame info. If it
>> comes from sframe, then it should be pretty solid. However, if it comes
>> from frame pointers used as a fallback on a system that omits frame
>> pointers, the user should consider the resulting data with a high level
>> of skepticism.
>
> That would be in the trace sent to the callback. We could add something
> like the '?' if it's not trusted.
>
> But for now, until we have a use case that we are implementing, I want
> to keep this simple, otherwise it will never get done. I don't want to
> add features for hypothetical scenarios.
>
> Currently, the traceback is just an array of addresses. But this could
> change in the future. What we are discussing right now is the internal
> functionality of the user unwind code where I have made most of theses
> functions static.
>
> The only external functions that get called during the iteration is the
> architecture specific code. If that code needs to know the difference
> between sframes and frame pointers then we can modify it, but until
> then, I rather keep this as is.
Indeed it's only kernel internal API, but this is API that will be
expected by each architecture supporting unwind_user. Changing
this later on will cause a lot of friction and cross-architecture churn
compared to doing it right in the first place.
Thanks,
Mathieu
>
> Jens, is there something that the architecture code needs now? If so,
> then lets fix it, otherwise lets do it when there is something. This
> isn't user API, it can change in the future.
>
> -- Steve
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 06/12] unwind_user/sframe: Wire up unwind_user to sframe
2025-07-09 14:10 ` Mathieu Desnoyers
@ 2025-07-09 14:29 ` Steven Rostedt
2025-07-09 15:14 ` Mathieu Desnoyers
0 siblings, 1 reply; 34+ messages in thread
From: Steven Rostedt @ 2025-07-09 14:29 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Jens Remus, Steven Rostedt, linux-kernel, linux-trace-kernel, bpf,
x86, Masami Hiramatsu, Josh Poimboeuf, Peter Zijlstra,
Ingo Molnar, Jiri Olsa, Namhyung Kim, Thomas Gleixner,
Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
Linus Torvalds, Andrew Morton, Jens Axboe, Florian Weimer,
Sam James, Heiko Carstens, Vasily Gorbik
On Wed, 9 Jul 2025 10:10:30 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> Indeed it's only kernel internal API, but this is API that will be
> expected by each architecture supporting unwind_user. Changing
> this later on will cause a lot of friction and cross-architecture churn
> compared to doing it right in the first place.
The changes you are suggesting is added info if an architecture needs
it. That is easy to do. All you need is to add an extra field in the
state structure and the architectures that need it can use it, and the
rest can ignore it.
Again, I'm not worried about it. If you want to send me a patch, feel
free, but I'm not doing this extra work, until I see a real problem.
-- Steve.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 06/12] unwind_user/sframe: Wire up unwind_user to sframe
2025-07-09 14:29 ` Steven Rostedt
@ 2025-07-09 15:14 ` Mathieu Desnoyers
0 siblings, 0 replies; 34+ messages in thread
From: Mathieu Desnoyers @ 2025-07-09 15:14 UTC (permalink / raw)
To: Steven Rostedt
Cc: Jens Remus, Steven Rostedt, linux-kernel, linux-trace-kernel, bpf,
x86, Masami Hiramatsu, Josh Poimboeuf, Peter Zijlstra,
Ingo Molnar, Jiri Olsa, Namhyung Kim, Thomas Gleixner,
Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi, Beau Belgrave,
Linus Torvalds, Andrew Morton, Jens Axboe, Florian Weimer,
Sam James, Heiko Carstens, Vasily Gorbik
On 2025-07-09 10:29, Steven Rostedt wrote:
> On Wed, 9 Jul 2025 10:10:30 -0400
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>
>> Indeed it's only kernel internal API, but this is API that will be
>> expected by each architecture supporting unwind_user. Changing
>> this later on will cause a lot of friction and cross-architecture churn
>> compared to doing it right in the first place.
>
> The changes you are suggesting is added info if an architecture needs
> it. That is easy to do. All you need is to add an extra field in the
> state structure and the architectures that need it can use it, and the
> rest can ignore it.
>
> Again, I'm not worried about it. If you want to send me a patch, feel
> free, but I'm not doing this extra work, until I see a real problem.
OK, I'll do it. As I look into the existing state, the priority order
appears to be incorrect on compat mode: if we have both compat mode and
sframe available, COMPAT_FP is preferred. I think we want to favor using
sframe first. But then if you select "sframe" in start(), then you don't
have the "compat state" information for the compat-fp fallback.
So the "type" logic is all intermingled between "fp vs sframe" and
"compat vs non-compat" there. My changes will clean this up.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 06/12] unwind_user/sframe: Wire up unwind_user to sframe
2025-07-09 14:06 ` Steven Rostedt
2025-07-09 14:10 ` Mathieu Desnoyers
@ 2025-07-10 8:03 ` Jens Remus
1 sibling, 0 replies; 34+ messages in thread
From: Jens Remus @ 2025-07-10 8:03 UTC (permalink / raw)
To: Steven Rostedt, Mathieu Desnoyers
Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
Masami Hiramatsu, Josh Poimboeuf, Peter Zijlstra, Ingo Molnar,
Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Linus Torvalds,
Andrew Morton, Jens Axboe, Florian Weimer, Sam James,
Heiko Carstens, Vasily Gorbik
On 09.07.2025 16:06, Steven Rostedt wrote:
> But for now, until we have a use case that we are implementing, I want
> to keep this simple, otherwise it will never get done. I don't want to
> add features for hypothetical scenarios.
> Jens, is there something that the architecture code needs now? If so,
> then lets fix it, otherwise lets do it when there is something. This
> isn't user API, it can change in the future.
I am fine with your suggestion. My s390 sframe support does not require
any changes in this regard.
Regards,
Jens
--
Jens Remus
Linux on Z Development (D3303)
+49-7031-16-1128 Office
jremus@de.ibm.com
IBM
IBM Deutschland Research & Development GmbH; Vorsitzender des Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der Gesellschaft: Böblingen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 06/12] unwind_user/sframe: Wire up unwind_user to sframe
2025-07-09 13:51 ` Mathieu Desnoyers
2025-07-09 14:06 ` Steven Rostedt
@ 2025-07-10 9:26 ` Jens Remus
1 sibling, 0 replies; 34+ messages in thread
From: Jens Remus @ 2025-07-10 9:26 UTC (permalink / raw)
To: Mathieu Desnoyers, Steven Rostedt
Cc: Steven Rostedt, linux-kernel, linux-trace-kernel, bpf, x86,
Masami Hiramatsu, Josh Poimboeuf, Peter Zijlstra, Ingo Molnar,
Jiri Olsa, Namhyung Kim, Thomas Gleixner, Andrii Nakryiko,
Indu Bhagat, Jose E. Marchesi, Beau Belgrave, Linus Torvalds,
Andrew Morton, Jens Axboe, Florian Weimer, Sam James,
Heiko Carstens, Vasily Gorbik
On 09.07.2025 15:51, Mathieu Desnoyers wrote:
> On 2025-07-09 09:46, Mathieu Desnoyers wrote:
>> I concur with Jens. I think we should keep track of both:
>>
>> 1) available unwind methods,
>>
>> 2) unwind method used for the current frame.
>>
>> E.g.:
>>
>> /*
>> * unwind types, listed in priority order: lower numbers are
>> * attempted first if available.
>> */
>> enum unwind_user_type_bits {
>> UNWIND_USER_TYPE_SFRAME_BIT = 0,
>> UNWIND_USER_TYPE_FP_BIT = 1,
>> UNWIND_USER_TYPE_COMPAT_FP_BIT = 2,
>>
>> _NR_UNWIND_USER_TYPE_BITS,
>> };
>>
>> enum unwind_user_type {
>> UNWIND_USER_TYPE_NONE = 0,
>> UNWIND_USER_TYPE_SFRAME = (1U << UNWIND_USER_TYPE_SFRAME_BIT),
>> UNWIND_USER_TYPE_FP = (1U << UNWIND_USER_TYPE_FP_BIT),
>> UNWIND_USER_TYPE_COMPAT_FP = (1U << UNWIND_USER_TYPE_COMPAT_FP_BIT),
>> };
>>
>> And have the following fields in struct unwind_user_state:
>>
>> /* Unwind time used for the most recent unwind traversal iteration. */
>> enum unwind_user_type current_type;
>>
>> /* Unwind types available in the current context. Bitmask of enum unwind_user_type. */
>> unsigned int available_types;
>>
>> So as we end up adding stuff like registered JIT unwind info, we will
>> want to expand the "available types". And it makes sense to both keep
>> track of all available types (as a way to quickly know which mechanisms
>> we need to query for the current task) *and* to let the caller know
>> which unwind type was used for the current frame.
>>
>> And AFAIU we'd be inserting a "jit unwind info" type between SFRAME and FP in
>> the future, because the jit unwind info would be more reliable than FP. This
>> would require that we bump the number for FP and COMPAT_FP, but that would
>> be OK because this is not ABI.
>>
>> Thoughts ?
>
> One use-case for giving the "current_type" to iteration callers is to
> let end users know whether they should trust the frame info. If it
> comes from sframe, then it should be pretty solid. However, if it comes
> from frame pointers used as a fallback on a system that omits frame
> pointers, the user should consider the resulting data with a high level
> of skepticism.
The current_type may be different for every unwind step (frame). So
struct unwind_stacktrace would probably need the following added:
/* Unwind types used for taking the stack trace. */
unsigned int used_types;
So that the user of unwind_user() could decide whether to trust the
stack trace. But as Steve suggested in his reply, all of this could
be added later, once there is a need.
Regards,
Jens
--
Jens Remus
Linux on Z Development (D3303)
+49-7031-16-1128 Office
jremus@de.ibm.com
IBM
IBM Deutschland Research & Development GmbH; Vorsitzender des Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der Gesellschaft: Böblingen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 06/12] unwind_user/sframe: Wire up unwind_user to sframe
2025-07-09 7:58 ` Jens Remus
2025-07-09 13:46 ` Mathieu Desnoyers
@ 2025-07-10 15:30 ` Steven Rostedt
2025-07-10 20:51 ` Steven Rostedt
1 sibling, 1 reply; 34+ messages in thread
From: Steven Rostedt @ 2025-07-10 15:30 UTC (permalink / raw)
To: Jens Remus
Cc: Mathieu Desnoyers, Steven Rostedt, linux-kernel,
linux-trace-kernel, bpf, x86, Masami Hiramatsu, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Linus Torvalds, Andrew Morton, Jens Axboe,
Florian Weimer, Sam James, Heiko Carstens, Vasily Gorbik
On Wed, 9 Jul 2025 09:58:26 +0200
Jens Remus <jremus@linux.ibm.com> wrote:
> I think Mathieu has a point, as unwind_user_next() calls the optional
> architecture-specific arch_unwind_user_next() at the end. The x86
> implementation does state->type specific processing (for
> UNWIND_USER_TYPE_COMPAT_FP).
I'm not too comfortable with the compat patches at this stage. I'm
thinking of separating out the compat patches, and just reject the
deferred unwind if the task is in compat mode (forcing perf or other
tracers to use whatever it uses today).
I'll take Mathieu's patches and merge them with Josh's, but make them a
separate series.
I'm aiming to get the core series into this merge window, and the less
complexity we have, the better. Then everything can be worked on
simultaneously in the next merge window as all the other patches will
not have any dependency on each other. They all have dependency on the
core set.
-- Steve
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 06/12] unwind_user/sframe: Wire up unwind_user to sframe
2025-07-10 15:30 ` Steven Rostedt
@ 2025-07-10 20:51 ` Steven Rostedt
0 siblings, 0 replies; 34+ messages in thread
From: Steven Rostedt @ 2025-07-10 20:51 UTC (permalink / raw)
To: Jens Remus
Cc: Mathieu Desnoyers, Steven Rostedt, linux-kernel,
linux-trace-kernel, bpf, x86, Masami Hiramatsu, Josh Poimboeuf,
Peter Zijlstra, Ingo Molnar, Jiri Olsa, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Linus Torvalds, Andrew Morton, Jens Axboe,
Florian Weimer, Sam James, Heiko Carstens, Vasily Gorbik
On Thu, 10 Jul 2025 11:30:39 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:
> I'm not too comfortable with the compat patches at this stage. I'm
> thinking of separating out the compat patches, and just reject the
> deferred unwind if the task is in compat mode (forcing perf or other
> tracers to use whatever it uses today).
>
> I'll take Mathieu's patches and merge them with Josh's, but make them a
> separate series.
So I'm removing the two compat patches from the series for now and plan
to replace it with this:
-- Steve
From: Steven Rostedt <rostedt@goodmis.org>
Subject: [PATCH] unwind deferred/x86: Do not defer stack tracing for compat
tasks
Currently compat tasks are not supported. If a deferred user space stack
trace is requested on a compat task, it should fail and return an error so
that the profiler can use an alternative approach (whatever it uses
today).
Add a arch_unwind_can_defer() macro that is called in
unwind_deferred_request(). Have x86 define it to a function that makes
sure that the current task is running in 64bit mode, and if it is not, it
returns false. This will cause unwind_deferred_request() to error out and
the caller can use the current method of user space stack tracing.
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
arch/x86/include/asm/unwind_user.h | 11 +++++++++++
include/linux/unwind_deferred.h | 5 +++++
kernel/unwind/deferred.c | 3 +++
3 files changed, 19 insertions(+)
diff --git a/arch/x86/include/asm/unwind_user.h b/arch/x86/include/asm/unwind_user.h
index 8597857bf896..220fd0a6e175 100644
--- a/arch/x86/include/asm/unwind_user.h
+++ b/arch/x86/include/asm/unwind_user.h
@@ -2,6 +2,17 @@
#ifndef _ASM_X86_UNWIND_USER_H
#define _ASM_X86_UNWIND_USER_H
+#ifdef CONFIG_IA32_EMULATION
+/* Currently compat mode is not supported for deferred stack trace */
+static inline bool arch_unwind_can_defer(void)
+{
+ struct pt_regs *regs = task_pt_regs(current);
+
+ return user_64bit_mode(regs);
+}
+# define arch_unwind_can_defer arch_unwind_can_defer
+#endif /* CONFIG_IA32_EMULATION */
+
#define ARCH_INIT_USER_FP_FRAME \
.cfa_off = (s32)sizeof(long) * 2, \
.ra_off = (s32)sizeof(long) * -1, \
diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
index a9d5b100d6b2..6ba4fff066dd 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -16,6 +16,11 @@ struct unwind_work {
int bit;
};
+/* Architectures can add a test to not defer unwinding */
+#ifndef arch_unwind_can_defer
+# define arch_unwind_can_defer() (true)
+#endif
+
#ifdef CONFIG_UNWIND_USER
#define UNWIND_PENDING_BIT (BITS_PER_LONG - 1)
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index 039e12700d49..745144e4717c 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -236,6 +236,9 @@ int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
*cookie = 0;
+ if (!arch_unwind_can_defer())
+ return -EINVAL;
+
if ((current->flags & (PF_KTHREAD | PF_EXITING)) ||
!user_mode(task_pt_regs(current)))
return -EINVAL;
--
2.47.2
^ permalink raw reply related [flat|nested] 34+ messages in thread
end of thread, other threads:[~2025-07-10 20:52 UTC | newest]
Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-08 2:11 [PATCH v8 00/12] unwind_deferred: Implement sframe handling Steven Rostedt
2025-07-08 2:11 ` [PATCH v8 01/12] unwind_user/sframe: Add support for reading .sframe headers Steven Rostedt
2025-07-08 2:11 ` [PATCH v8 02/12] unwind_user/sframe: Store sframe section data in per-mm maple tree Steven Rostedt
2025-07-08 2:11 ` [PATCH v8 03/12] x86/uaccess: Add unsafe_copy_from_user() implementation Steven Rostedt
2025-07-08 2:11 ` [PATCH v8 04/12] unwind_user/sframe: Add support for reading .sframe contents Steven Rostedt
2025-07-08 2:11 ` [PATCH v8 05/12] unwind_user/sframe: Detect .sframe sections in executables Steven Rostedt
2025-07-08 2:11 ` [PATCH v8 06/12] unwind_user/sframe: Wire up unwind_user to sframe Steven Rostedt
2025-07-08 19:58 ` Mathieu Desnoyers
2025-07-08 20:11 ` Steven Rostedt
2025-07-09 7:58 ` Jens Remus
2025-07-09 13:46 ` Mathieu Desnoyers
2025-07-09 13:51 ` Mathieu Desnoyers
2025-07-09 14:06 ` Steven Rostedt
2025-07-09 14:10 ` Mathieu Desnoyers
2025-07-09 14:29 ` Steven Rostedt
2025-07-09 15:14 ` Mathieu Desnoyers
2025-07-10 8:03 ` Jens Remus
2025-07-10 9:26 ` Jens Remus
2025-07-10 15:30 ` Steven Rostedt
2025-07-10 20:51 ` Steven Rostedt
2025-07-08 2:11 ` [PATCH v8 07/12] unwind_user/sframe/x86: Enable sframe unwinding on x86 Steven Rostedt
2025-07-08 2:11 ` [PATCH v8 08/12] unwind_user/sframe: Remove .sframe section on detected corruption Steven Rostedt
2025-07-08 2:11 ` [PATCH v8 09/12] unwind_user/sframe: Show file name in debug output Steven Rostedt
2025-07-08 2:11 ` [PATCH v8 10/12] unwind_user/sframe: Enable debugging in uaccess regions Steven Rostedt
2025-07-08 3:38 ` Linus Torvalds
2025-07-08 13:23 ` Steven Rostedt
2025-07-08 14:34 ` Josh Poimboeuf
2025-07-08 14:41 ` Steven Rostedt
2025-07-08 15:53 ` Linus Torvalds
2025-07-08 16:31 ` Steven Rostedt
2025-07-08 18:57 ` Josh Poimboeuf
2025-07-08 15:52 ` Linus Torvalds
2025-07-08 2:11 ` [PATCH v8 11/12] unwind_user/sframe: Add .sframe validation option Steven Rostedt
2025-07-08 2:11 ` [PATCH v8 12/12] unwind_user/sframe: Add prctl() interface for registering .sframe sections Steven Rostedt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).