* [PATCH v15 08/20] unwind_user: Stop when reaching an outermost frame
From: Jens Remus @ 2026-05-20 15:39 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, x86, Steven Rostedt,
Josh Poimboeuf, Indu Bhagat, Peter Zijlstra, Dylan Hatch,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Mathieu Desnoyers, Kees Cook, Sam James
Cc: Jens Remus, bpf, linux-mm, Namhyung Kim, Andrii Nakryiko,
Jose E. Marchesi, Beau Belgrave, Florian Weimer,
Carlos O'Donell, Masami Hiramatsu, Jiri Olsa,
Arnaldo Carvalho de Melo, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Heiko Carstens, Vasily Gorbik,
Ilya Leoshkevich
In-Reply-To: <20260520154004.3845823-1-jremus@linux.ibm.com>
Add an indication for an outermost frame to the unwind user frame
structure and stop unwinding when reaching an outermost frame.
This will be used by unwind user sframe, as SFrame may represent an
undefined return address as indication for an outermost frame.
Reviewed-by: Indu Bhagat <ibhagatgnu@gmail.com>
Signed-off-by: Jens Remus <jremus@linux.ibm.com>
---
arch/x86/include/asm/unwind_user.h | 6 ++++--
include/linux/unwind_user_types.h | 1 +
kernel/unwind/user.c | 6 ++++++
3 files changed, 11 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/unwind_user.h b/arch/x86/include/asm/unwind_user.h
index 6e469044e4de..2dfb5ef11e36 100644
--- a/arch/x86/include/asm/unwind_user.h
+++ b/arch/x86/include/asm/unwind_user.h
@@ -23,13 +23,15 @@ static inline int unwind_user_word_size(struct pt_regs *regs)
.cfa_off = 2*(ws), \
.ra_off = -1*(ws), \
.fp_off = -2*(ws), \
- .use_fp = true,
+ .use_fp = true, \
+ .outermost = false,
#define ARCH_INIT_USER_FP_ENTRY_FRAME(ws) \
.cfa_off = 1*(ws), \
.ra_off = -1*(ws), \
.fp_off = 0, \
- .use_fp = false,
+ .use_fp = false, \
+ .outermost = false,
static inline bool unwind_user_at_function_start(struct pt_regs *regs)
{
diff --git a/include/linux/unwind_user_types.h b/include/linux/unwind_user_types.h
index 43e4b160883f..616cc5ee4586 100644
--- a/include/linux/unwind_user_types.h
+++ b/include/linux/unwind_user_types.h
@@ -32,6 +32,7 @@ struct unwind_user_frame {
s32 ra_off;
s32 fp_off;
bool use_fp;
+ bool outermost;
};
struct unwind_user_state {
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
index 1fb272419733..fdb1001e3750 100644
--- a/kernel/unwind/user.c
+++ b/kernel/unwind/user.c
@@ -32,6 +32,12 @@ static int unwind_user_next_common(struct unwind_user_state *state,
{
unsigned long cfa, fp, ra;
+ /* Stop unwinding when reaching an outermost frame. */
+ if (frame->outermost) {
+ state->done = true;
+ return 0;
+ }
+
/* Get the Canonical Frame Address (CFA) */
if (frame->use_fp) {
if (state->fp < state->sp)
--
2.51.0
^ permalink raw reply related
* [PATCH v15 00/20] unwind_deferred: Implement sframe handling
From: Jens Remus @ 2026-05-20 15:39 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, x86, Steven Rostedt,
Josh Poimboeuf, Indu Bhagat, Peter Zijlstra, Dylan Hatch,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Mathieu Desnoyers, Kees Cook, Sam James
Cc: Jens Remus, bpf, linux-mm, Namhyung Kim, Andrii Nakryiko,
Jose E. Marchesi, Beau Belgrave, Florian Weimer,
Carlos O'Donell, Masami Hiramatsu, Jiri Olsa,
Arnaldo Carvalho de Melo, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Heiko Carstens, Vasily Gorbik,
Ilya Leoshkevich
This is the implementation of parsing the SFrame V3 stack trace information
from an .sframe section in an ELF file. It's a continuation of Josh's and
Steve's work that can be found here:
https://lore.kernel.org/all/cover.1737511963.git.jpoimboe@kernel.org/
https://lore.kernel.org/all/20250827201548.448472904@kernel.org/
Currently the only way to get a user space stack trace from a stack
walk (and not just copying large amount of user stack into the kernel
ring buffer) is to use frame pointers. This has a few issues. The biggest
one is that compiling frame pointers into every application and library
has been shown to cause performance overhead.
Another issue is that the format of the frames may not always be consistent
between different compilers and some architectures (s390) has no defined
format to do a reliable stack walk. The only way to perform user space
profiling on these architectures is to copy the user stack into the kernel
buffer.
SFrame [1] is now supported in binutils (x86-64, ARM64, and s390). There is
discussions going on about supporting SFrame in LLVM. SFrame acts more like
ORC, and lives in the ELF executable file as its own section. Like ORC it
has two tables where the first table is sorted by instruction pointers (IP)
and using the current IP and finding it's entry in the first table, it will
take you to the second table which will tell you where the return address
of the current function is located and then you can use that address to
look it up in the first table to find the return address of that function,
and so on. This performs a user space stack walk.
Now because the .sframe section lives in the ELF file it needs to be faulted
into memory when it is used. This means that walking the user space stack
requires being in a faultable context. As profilers like perf request a stack
trace in interrupt or NMI context, it cannot do the walking when it is
requested. Instead it must be deferred until it is safe to fault in user
space. One place this is known to be safe is when the task is about to return
back to user space.
This series makes the deferred unwind user code implement SFrame format V3
and enables it on x86-64.
[1]: https://sourceware.org/binutils/wiki/sframe
This series applies on top of v7.1-rc4 tag:
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git v7.1-rc4
The to be stack-traced user space programs (and libraries) need to be
built with the recent SFrame stack trace information format V3, as
generated by binutils 2.46+ with assembler option --gsframe-3.
Namhyung Kim's related perf tools deferred callchain support can be used
for testing ("perf record --call-graph fp,defer" and "perf report/script").
Changes in v15 (see patch notes for details):
- Rebase on v7.1-rc4.
- New patch to duplicate registered .sframe section data on clone/fork.
- Address Sashiko AI review feedback:
- Fix sframe end passed to mtree_insert_range().
- Fix outermost frame (FRE without datawords) handling.
- Use GFP_KERNEL_ACCOUNT instead of GFP_KERNEL.
- Improve text/sframe section start/end validation.
- Always use guard(srcu) when accessing struct sframe_section fields.
- Validate FDE repetition size for PCTYPE_MASK FDEs to be non-zero to
prevent division by zero.
- Only add sframe for text that is PT_LOAD in addition to PF_X.
- Use pr_debug_once() instead of WARN_ON_ONCE() to prevent user-
triggered warning/panic.
- Add support for SP/FP-based CFA recovery rules with
dereferencing.
- Reject FRE control word with reserved_p=1.
- x86-64: Fail unwind_user_get_reg() if !user_64bit_mode().
- Validate FDE PC type for supported values (i.e. PCTYPE_INC or
PCTYPE_MASK).
- Validate FDE function end against text end.
- Validate FDE's number of FREs to be less or equal to FDE's function
size, as each FRE must cover at least one byte. (Indu)
- Validate FRE function offset against FDE repetition size
for PCTYPE_MASK.
- Change type of struct sframe_fde_internal field fres_num to the one of
struct sframe_fda_v3 field fres_num.
- Return RC of sframe_init_[cfa_]rule_data() if bad RC.
- Normalize error code usage (.sframe is removed for all but ENOENT):
ENOENT: No sframe or no FDE for IP found
(FDE found but no FRE found is EINVAL)
EFAULT: Bad address
EINVAL: Invalid input or sframe
- Build-time checks for config options:
- 64BIT: SFrame V3 only supports 64-bit architectures.
- HAVE_EFFICIENT_UNALIGNED_ACCESS: Unaligned access to 16/32-bit
SFrame FRE fields and datawords using unsafe_get_user(). (Steven)
- Add pr_debug_once() when restoring CFA/FP/RA from an unsupported
register number.
Changes in v14 (see patch notes for details):
- Rebase on v7.1-rc2.
- Correct SFRAME_V3_FDE_TYPE_MASK value.
- Fix FDE function start address check in __read_fde().
- Rename SFrame V3 definitions accoring to final specification. (Indu)
- Improve comments on why UNWIND_USER_RULE_CFA_OFFSET is not
implemented. (Mark Rutland)
- Add/update/improve sframe debug messages.
- Add generic and arch-specific unwind_user.h to MAINTAINERS.
- Add arch-specific unwind_user_sframe.h to MAINTAINERS.
Changes in v13 (see patch notes for details):
- Add support for SFrame V3, including its new flexible FDEs. SFrame V2
is not supported.
Changes in v12 (see patch notes for details):
- Adjust to Peter's latest undwind user enhancements.
- Simplify logic by using an internal SFrame FDE representation, whose
FDE function start address field is an address instead of a PC-relative
offset (from FDE).
- Rename struct sframe_fre to sframe_fre_internal to align with
struct sframe_fde_internal.
- Remove unused pt_regs from unwind_user_next_common() and its
callers. (Peter)
- Simplify unwind_user_next_sframe(). (Peter)
- Fix a few checkpatch errors and warnings.
- Minor cleanups (e.g. move includes, fix indentation).
Changes in v11:
- Support for SFrame V2 PC-relative FDE function start address.
- Support for SFrame V2 representing RA undefined as indication for
outermost frames.
Patch 1 (new in v14), as a preparatory cleanup, adds the generic and
arch-specific unwind_user.h to MAINTAINERS.
Patches 2, 5, 12, and 19 have been updated to exclusively support the
latest SFrame V3 stack trace information format, that is generated by
binutils 2.46+. Old SFrame V2 sections get rejected with dynamic debug
message "bad/unsupported sframe header".
Patches 8 and 9 add support to unwind user (sframe) for outermost frames.
Patches 13-16 add support to unwind user (sframe) for the new SFrame V3
flexible FDEs.
Patch 17 improves the performance of searching the SFrame FRE for an IP.
Patch 18 (new in v15) duplicates registered .sframe section data on
clone/fork from the parent to the child process.
Patch 20 is for test purposes only and will get replaced by a new
syscall, that Steven is working on:
[RFC][PATCH] unwind: Add stacktrace_setup system call
https://lore.kernel.org/all/20260429114355.6c712e6a@gandalf.local.home/
Regards,
Jens
Jens Remus (9):
unwind_user: Add generic and arch-specific headers to MAINTAINERS
unwind_user: Stop when reaching an outermost frame
unwind_user/sframe: Add support for outermost frame indication
unwind_user: Enable archs that pass RA in a register
unwind_user: Flexible FP/RA recovery rules
unwind_user: Flexible CFA recovery rules
unwind_user/sframe: Add support for SFrame V3 flexible FDEs
unwind_user/sframe: Separate reading of FRE from reading of FRE data
words
unwind_user/sframe: Duplicate registered .sframe section data on
clone/fork
Josh Poimboeuf (11):
unwind_user/sframe: Add support for reading .sframe headers
unwind_user/sframe: Store .sframe section data in per-mm maple tree
x86/uaccess: Add unsafe_copy_from_user() implementation
unwind_user/sframe: Add support for reading .sframe contents
unwind_user/sframe: Detect .sframe sections in executables
unwind_user/sframe: Wire up unwind_user to sframe
unwind_user/sframe: Remove .sframe section on detected corruption
unwind_user/sframe: Show file name in debug output
unwind_user/sframe: Add .sframe validation option
unwind_user/sframe/x86: Enable sframe unwinding on x86
unwind_user/sframe: Add prctl() interface for registering .sframe
sections
MAINTAINERS | 4 +
arch/Kconfig | 23 +
arch/x86/Kconfig | 1 +
arch/x86/include/asm/mmu.h | 2 +-
arch/x86/include/asm/uaccess.h | 39 +-
arch/x86/include/asm/unwind_user.h | 74 +-
arch/x86/include/asm/unwind_user_sframe.h | 12 +
fs/binfmt_elf.c | 48 +-
include/linux/mm_types.h | 3 +
include/linux/sframe.h | 65 ++
include/linux/unwind_user.h | 20 +
include/linux/unwind_user_types.h | 50 +-
include/uapi/linux/elf.h | 1 +
include/uapi/linux/prctl.h | 4 +
kernel/fork.c | 10 +
kernel/sys.c | 9 +
kernel/unwind/Makefile | 3 +-
kernel/unwind/sframe.c | 949 ++++++++++++++++++++++
kernel/unwind/sframe.h | 88 ++
kernel/unwind/sframe_debug.h | 75 ++
kernel/unwind/user.c | 133 ++-
mm/init-mm.c | 2 +
mm/mmap.c | 9 +
23 files changed, 1586 insertions(+), 38 deletions(-)
create mode 100644 arch/x86/include/asm/unwind_user_sframe.h
create mode 100644 include/linux/sframe.h
create mode 100644 kernel/unwind/sframe.c
create mode 100644 kernel/unwind/sframe.h
create mode 100644 kernel/unwind/sframe_debug.h
--
2.51.0
^ permalink raw reply
* [PATCH v15 02/20] unwind_user/sframe: Add support for reading .sframe headers
From: Jens Remus @ 2026-05-20 15:39 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, x86, Steven Rostedt,
Josh Poimboeuf, Indu Bhagat, Peter Zijlstra, Dylan Hatch,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Mathieu Desnoyers, Kees Cook, Sam James
Cc: Jens Remus, bpf, linux-mm, Namhyung Kim, Andrii Nakryiko,
Jose E. Marchesi, Beau Belgrave, Florian Weimer,
Carlos O'Donell, Masami Hiramatsu, Jiri Olsa,
Arnaldo Carvalho de Melo, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Heiko Carstens, Vasily Gorbik,
Ilya Leoshkevich, Steven Rostedt (Google)
In-Reply-To: <20260520154004.3845823-1-jremus@linux.ibm.com>
From: Josh Poimboeuf <jpoimboe@kernel.org>
In preparation for unwinding user space stacks with sframe, add basic
sframe compile infrastructure and support for reading the .sframe
section header.
sframe_add_section() reads the header and unconditionally returns an
error, so it's not very useful yet. A subsequent patch will improve
that.
Link: https://lore.kernel.org/all/f27e8463783febfa0dabb0432a3dd6be8ad98412.1737511963.git.jpoimboe@kernel.org/
[ Jens Remus: Add support for SFrame V3. Add support for PC-relative
FDE function start offset. Cleanup includes and indentation. ]
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Indu Bhagat <ibhagatgnu@gmail.com>
Signed-off-by: Jens Remus <jremus@linux.ibm.com>
---
Notes (jremus):
Changes in v15:
- Improve text/sframe section start/end validation. (Sashiko AI)
- Use GFP_KERNEL_ACCOUNT instead of GFP_KERNEL (see
memory-allocation.rst, section "Get Free Page flags"). (Sashiko AI)
Changes in v14:
- Rename SFRAME_FDE_TYPE_REGULAR to SFRAME_FDE_TYPE_DEFAULT to match
SFrame V3 specification. (Indu)
- Correct SFRAME_V3_FDE_TYPE_MASK value.
Changes in v13:
- Update to SFrame V3:
- Add and use SFRAME_VERSION_3 definition.
- Add helper macros to access SFrame V3 FDE type.
- Rename SFRAME_FUNC_*() macros to SFRAME_FDE_*().
- Rename SFRAME_FDE_TYPE_PC* defines to SFRAME_FDE_PCTYPE_* and
SFRAME_FUNC_FDE_TYPE() macro to SFRAME_V3_FDE_PCTYPE().
- Reword OFFSET to DATAWORD in SFRAME_FRE_OFFSET_{COUNT|SIZE}()
macros.
- Rename version-specific SFRAME_*() macros to SFRAME_V3_*().
- Update struct sframe_fde and rename to sframe_fde_v3:
- Change field start_addr from s32 to s64 and rename to
func_start_off.
- Change field fres_num from u32 to u16.
- New field u8 info2.
- Remove u16 padding field.
- Split FDE into function descriptor entry (struct sframe_fde_v3) and
attributes (struct sframe_fde_v3).
- Rename macro parameter "data" to "info" to hint at fde/fre info
word and wrap it in parenthesis.
- Group SFRAME_* definitions so that related ones are together.
- Reword commit message (my changes).
MAINTAINERS | 1 +
arch/Kconfig | 3 +
include/linux/sframe.h | 37 +++++++++++
kernel/unwind/Makefile | 3 +-
kernel/unwind/sframe.c | 136 +++++++++++++++++++++++++++++++++++++++++
kernel/unwind/sframe.h | 81 ++++++++++++++++++++++++
6 files changed, 260 insertions(+), 1 deletion(-)
create mode 100644 include/linux/sframe.h
create mode 100644 kernel/unwind/sframe.c
create mode 100644 kernel/unwind/sframe.h
diff --git a/MAINTAINERS b/MAINTAINERS
index 7434e9d7b33f..a9b42b67a88d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -27876,6 +27876,7 @@ M: Steven Rostedt <rostedt@goodmis.org>
S: Maintained
F: arch/*/include/asm/unwind_user.h
F: include/asm-generic/unwind_user.h
+F: include/linux/sframe.h
F: include/linux/unwind*.h
F: kernel/unwind/
diff --git a/arch/Kconfig b/arch/Kconfig
index e86880045158..94b2d5e8e529 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -486,6 +486,9 @@ config HAVE_UNWIND_USER_FP
bool
select UNWIND_USER
+config HAVE_UNWIND_USER_SFRAME
+ bool
+
config HAVE_PERF_REGS
bool
help
diff --git a/include/linux/sframe.h b/include/linux/sframe.h
new file mode 100644
index 000000000000..0642595534f9
--- /dev/null
+++ b/include/linux/sframe.h
@@ -0,0 +1,37 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_SFRAME_H
+#define _LINUX_SFRAME_H
+
+#ifdef CONFIG_HAVE_UNWIND_USER_SFRAME
+
+struct sframe_section {
+ unsigned long sframe_start;
+ unsigned long sframe_end;
+ unsigned long text_start;
+ unsigned long text_end;
+
+ unsigned long fdes_start;
+ unsigned long fres_start;
+ unsigned long fres_end;
+ unsigned int num_fdes;
+
+ signed char ra_off;
+ signed char fp_off;
+};
+
+extern int sframe_add_section(unsigned long sframe_start, unsigned long sframe_end,
+ unsigned long text_start, unsigned long text_end);
+extern int sframe_remove_section(unsigned long sframe_addr);
+
+#else /* !CONFIG_HAVE_UNWIND_USER_SFRAME */
+
+static inline int sframe_add_section(unsigned long sframe_start, unsigned long sframe_end,
+ unsigned long text_start, unsigned long text_end)
+{
+ return -ENOSYS;
+}
+static inline int sframe_remove_section(unsigned long sframe_addr) { return -ENOSYS; }
+
+#endif /* CONFIG_HAVE_UNWIND_USER_SFRAME */
+
+#endif /* _LINUX_SFRAME_H */
diff --git a/kernel/unwind/Makefile b/kernel/unwind/Makefile
index eae37bea54fd..146038165865 100644
--- a/kernel/unwind/Makefile
+++ b/kernel/unwind/Makefile
@@ -1 +1,2 @@
- obj-$(CONFIG_UNWIND_USER) += user.o deferred.o
+ obj-$(CONFIG_UNWIND_USER) += user.o deferred.o
+ obj-$(CONFIG_HAVE_UNWIND_USER_SFRAME) += sframe.o
diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
new file mode 100644
index 000000000000..d24e9d4f8bef
--- /dev/null
+++ b/kernel/unwind/sframe.c
@@ -0,0 +1,136 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Userspace sframe access functions
+ */
+
+#define pr_fmt(fmt) "sframe: " fmt
+
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/srcu.h>
+#include <linux/uaccess.h>
+#include <linux/mm.h>
+#include <linux/string_helpers.h>
+#include <linux/sframe.h>
+#include <linux/unwind_user_types.h>
+
+#include "sframe.h"
+
+#define dbg(fmt, ...) \
+ pr_debug("%s (%d): " fmt, current->comm, current->pid, ##__VA_ARGS__)
+
+static void free_section(struct sframe_section *sec)
+{
+ kfree(sec);
+}
+
+static int sframe_read_header(struct sframe_section *sec)
+{
+ unsigned long header_end, fdes_start, fdes_end, fres_start, fres_end;
+ struct sframe_header shdr;
+ unsigned int num_fdes;
+
+ if (copy_from_user(&shdr, (void __user *)sec->sframe_start, sizeof(shdr))) {
+ dbg("header usercopy failed\n");
+ return -EFAULT;
+ }
+
+ if (shdr.preamble.magic != SFRAME_MAGIC ||
+ shdr.preamble.version != SFRAME_VERSION_3 ||
+ !(shdr.preamble.flags & SFRAME_F_FDE_SORTED) ||
+ !(shdr.preamble.flags & SFRAME_F_FDE_FUNC_START_PCREL) ||
+ shdr.auxhdr_len) {
+ dbg("bad/unsupported sframe header\n");
+ return -EINVAL;
+ }
+
+ if (!shdr.num_fdes || !shdr.num_fres) {
+ dbg("no fde/fre entries\n");
+ return -EINVAL;
+ }
+
+ header_end = sec->sframe_start + SFRAME_HEADER_SIZE(shdr);
+ if (header_end >= sec->sframe_end) {
+ dbg("header doesn't fit in section\n");
+ return -EINVAL;
+ }
+
+ num_fdes = shdr.num_fdes;
+ fdes_start = header_end + shdr.fdes_off;
+ fdes_end = fdes_start + (num_fdes * sizeof(struct sframe_fde_v3));
+
+ fres_start = header_end + shdr.fres_off;
+ fres_end = fres_start + shdr.fre_len;
+
+ if (fres_start < fdes_end || fres_end > sec->sframe_end) {
+ dbg("inconsistent fde/fre offsets\n");
+ return -EINVAL;
+ }
+
+ sec->num_fdes = num_fdes;
+ sec->fdes_start = fdes_start;
+ sec->fres_start = fres_start;
+ sec->fres_end = fres_end;
+
+ sec->ra_off = shdr.cfa_fixed_ra_offset;
+ sec->fp_off = shdr.cfa_fixed_fp_offset;
+
+ return 0;
+}
+
+int sframe_add_section(unsigned long sframe_start, unsigned long sframe_end,
+ unsigned long text_start, unsigned long text_end)
+{
+ struct vm_area_struct *sframe_vma, *text_vma;
+ struct mm_struct *mm = current->mm;
+ struct sframe_section *sec;
+ int ret;
+
+ if (sframe_start >= sframe_end || text_start >= text_end) {
+ dbg("invalid sframe/text address\n");
+ return -EINVAL;
+ }
+
+ scoped_guard(mmap_read_lock, mm) {
+ sframe_vma = vma_lookup(mm, sframe_start);
+ if (!sframe_vma || sframe_end > sframe_vma->vm_end) {
+ dbg("bad sframe address (0x%lx - 0x%lx)\n",
+ sframe_start, sframe_end);
+ return -EINVAL;
+ }
+
+ text_vma = vma_lookup(mm, text_start);
+ if (!text_vma ||
+ !(text_vma->vm_flags & VM_EXEC) ||
+ text_end > text_vma->vm_end) {
+ dbg("bad text address (0x%lx - 0x%lx)\n",
+ text_start, text_end);
+ return -EINVAL;
+ }
+ }
+
+ sec = kzalloc(sizeof(*sec), GFP_KERNEL_ACCOUNT);
+ if (!sec)
+ return -ENOMEM;
+
+ sec->sframe_start = sframe_start;
+ sec->sframe_end = sframe_end;
+ sec->text_start = text_start;
+ sec->text_end = text_end;
+
+ ret = sframe_read_header(sec);
+ if (ret)
+ goto err_free;
+
+ /* TODO nowhere to store it yet - just free it and return an error */
+ ret = -ENOSYS;
+
+err_free:
+ free_section(sec);
+ return ret;
+}
+
+int sframe_remove_section(unsigned long sframe_start)
+{
+ return -ENOSYS;
+}
diff --git a/kernel/unwind/sframe.h b/kernel/unwind/sframe.h
new file mode 100644
index 000000000000..fc2908e92c7b
--- /dev/null
+++ b/kernel/unwind/sframe.h
@@ -0,0 +1,81 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * From https://www.sourceware.org/binutils/docs/sframe-spec.html
+ */
+#ifndef _SFRAME_H
+#define _SFRAME_H
+
+#include <linux/types.h>
+
+#define SFRAME_VERSION_1 1
+#define SFRAME_VERSION_2 2
+#define SFRAME_VERSION_3 3
+#define SFRAME_MAGIC 0xdee2
+
+#define SFRAME_F_FDE_SORTED 0x1
+#define SFRAME_F_FRAME_POINTER 0x2
+#define SFRAME_F_FDE_FUNC_START_PCREL 0x4
+
+#define SFRAME_ABI_AARCH64_ENDIAN_BIG 1
+#define SFRAME_ABI_AARCH64_ENDIAN_LITTLE 2
+#define SFRAME_ABI_AMD64_ENDIAN_LITTLE 3
+
+struct sframe_preamble {
+ u16 magic;
+ u8 version;
+ u8 flags;
+} __packed;
+
+struct sframe_header {
+ struct sframe_preamble preamble;
+ u8 abi_arch;
+ s8 cfa_fixed_fp_offset;
+ s8 cfa_fixed_ra_offset;
+ u8 auxhdr_len;
+ u32 num_fdes;
+ u32 num_fres;
+ u32 fre_len;
+ u32 fdes_off;
+ u32 fres_off;
+} __packed;
+
+#define SFRAME_HEADER_SIZE(header) \
+ ((sizeof(struct sframe_header) + (header).auxhdr_len))
+
+struct sframe_fde_v3 {
+ s64 func_start_off;
+ u32 func_size;
+ u32 fres_off;
+} __packed;
+
+struct sframe_fda_v3 {
+ u16 fres_num;
+ u8 info;
+ u8 info2;
+ u8 rep_size;
+} __packed;
+
+#define SFRAME_FDE_PCTYPE_INC 0
+#define SFRAME_FDE_PCTYPE_MASK 1
+
+#define SFRAME_AARCH64_PAUTH_KEY_A 0
+#define SFRAME_AARCH64_PAUTH_KEY_B 1
+
+#define SFRAME_V3_FDE_FRE_TYPE(info) ((info) & 0xf)
+#define SFRAME_V3_FDE_PCTYPE(info) (((info) >> 4) & 0x1)
+#define SFRAME_V3_AARCH64_FDE_PAUTH_KEY(info) (((info) >> 5) & 0x1)
+
+#define SFRAME_FDE_TYPE_DEFAULT 0
+
+#define SFRAME_V3_FDE_TYPE_MASK 0x1f
+#define SFRAME_V3_FDE_TYPE(info2) ((info2) & SFRAME_V3_FDE_TYPE_MASK)
+
+#define SFRAME_BASE_REG_FP 0
+#define SFRAME_BASE_REG_SP 1
+
+#define SFRAME_V3_FRE_CFA_BASE_REG_ID(info) ((info) & 0x1)
+#define SFRAME_V3_FRE_DATAWORD_COUNT(info) (((info) >> 1) & 0xf)
+#define SFRAME_V3_FRE_DATAWORD_SIZE(info) (((info) >> 5) & 0x3)
+#define SFRAME_V3_AARCH64_FRE_MANGLED_RA_P(info) (((info) >> 7) & 0x1)
+
+#endif /* _SFRAME_H */
--
2.51.0
^ permalink raw reply related
* [PATCH v15 10/20] unwind_user/sframe: Remove .sframe section on detected corruption
From: Jens Remus @ 2026-05-20 15:39 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, x86, Steven Rostedt,
Josh Poimboeuf, Indu Bhagat, Peter Zijlstra, Dylan Hatch,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Mathieu Desnoyers, Kees Cook, Sam James
Cc: Jens Remus, bpf, linux-mm, Namhyung Kim, Andrii Nakryiko,
Jose E. Marchesi, Beau Belgrave, Florian Weimer,
Carlos O'Donell, Masami Hiramatsu, Jiri Olsa,
Arnaldo Carvalho de Melo, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Heiko Carstens, Vasily Gorbik,
Ilya Leoshkevich, Steven Rostedt (Google)
In-Reply-To: <20260520154004.3845823-1-jremus@linux.ibm.com>
From: Josh Poimboeuf <jpoimboe@kernel.org>
To avoid continued attempted use of a bad .sframe section, remove it
on demand when the first sign of corruption is detected.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Indu Bhagat <ibhagatgnu@gmail.com>
Signed-off-by: Jens Remus <jremus@linux.ibm.com>
---
Notes (jremus):
Changes in v15:
- sframe_find(): Align to normalized error code usage and remove .sframe
for all but ENOENT. Also remove if user_read_access_begin() fails.
kernel/unwind/sframe.c | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)
diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
index f723c1a32f90..02331956009a 100644
--- a/kernel/unwind/sframe.c
+++ b/kernel/unwind/sframe.c
@@ -360,16 +360,23 @@ int sframe_find(unsigned long ip, struct unwind_user_frame *frame)
return -ENOENT;
if (!user_read_access_begin((void __user *)sec->sframe_start,
- sec->sframe_end - sec->sframe_start))
- return -EFAULT;
+ sec->sframe_end - sec->sframe_start)) {
+ ret = -EFAULT;
+ goto end;
+ }
ret = __find_fde(sec, ip, &fde);
if (ret)
- goto end;
+ goto end_uaccess;
ret = __find_fre(sec, &fde, ip, frame);
-end:
+end_uaccess:
user_read_access_end();
+
+end:
+ if (ret && ret != -ENOENT)
+ WARN_ON_ONCE(sframe_remove_section(sec->sframe_start));
+
return ret;
}
--
2.51.0
^ permalink raw reply related
* [PATCH v15 16/20] unwind_user/sframe: Add support for SFrame V3 flexible FDEs
From: Jens Remus @ 2026-05-20 15:40 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, x86, Steven Rostedt,
Josh Poimboeuf, Indu Bhagat, Peter Zijlstra, Dylan Hatch,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Mathieu Desnoyers, Kees Cook, Sam James
Cc: Jens Remus, bpf, linux-mm, Namhyung Kim, Andrii Nakryiko,
Jose E. Marchesi, Beau Belgrave, Florian Weimer,
Carlos O'Donell, Masami Hiramatsu, Jiri Olsa,
Arnaldo Carvalho de Melo, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Heiko Carstens, Vasily Gorbik,
Ilya Leoshkevich
In-Reply-To: <20260520154004.3845823-1-jremus@linux.ibm.com>
SFrame V3 introduces flexible FDEs in addition to the regular FDEs.
The key difference is that flexible FDEs encode the CFA, RA, and FP
tracking information using two FRE data words, a control word and an
offset, or a single padding data word of zero (e.g. to represent FP
without RA tracking information).
The control word contains the following information:
- reg_p: Whether to use the register contents (reg_p=1) specified
by regnum or the CFA (reg_p=0) as base.
- deref_p: Whether to dereference.
- regnum: A DWARF register number.
The offset is added to the base (i.e. CFA or register contents). Then
the resulting address may optionally be dereferenced.
This enables the following flexible CFA and FP/RA recovery rules:
- CFA = register + offset // reg_p=1, deref_p=0
- CFA = *(register + offset) // reg_p=1, deref_p=1
- FP/RA = *(CFA + offset) // reg_p=0, deref_p=0
- FP/RA = register + offset // reg_p=1, deref_p=0
- FP/RA = *(register + offset) // reg_p=1, deref_p=1
Note that for the CFA a rule with reg_p=0 is invalid, as the value of
the CFA cannot be described using itself as base. For FP/RA a rule with
reg_p=0 and deref_p=0 and regnum=0 is invalid, as it that is equal to
the padding data word of zero.
Reviewed-by: Indu Bhagat <ibhagatgnu@gmail.com>
Signed-off-by: Jens Remus <jremus@linux.ibm.com>
---
Notes (jremus):
Changes in v15:
- __read_flex_fde_fre_datawords(): Add comment on FRE dataword RA/FP
location info decoding logic. (Sashiko AI)
- Fix outermost frame (FRE without datawords) handling to not cause
sframe_init_cfa_rule_data() and ultimately sframe_find() to fail
with -EINVAL. (Sashiko AI)
- sframe_init_[cfa_]rule_data(): Reject FRE control word with
reserved_p=1. (Sashiko AI)
- __find_fre(): Return RC of sframe_init_[cfa_]rule_data() if bad RC.
- Normalize error code usage (.sframe is removed for all but ENOENT):
ENOENT: No sframe or no FDE for IP found
(FDE found but no FRE is EINVAL)
EFAULT: Bad address
EINVAL: Invalid input or sframe
Changes in v14:
- Rename __read_regular_fre_datawords() to
__read_default_fre_datawords() to align to SFrame V3 specification
(default FRE).
- Rename SFRAME_FDE_TYPE_FLEXIBLE to SFRAME_FDE_TYPE_FLEX to match
SFrame V3 specification and adjust to rename of SFRAME_FDE_TYPE_*.
- Rename SFRAME_V3_FLEX_FDE_CTLWORD_*() to
SFRAME_V3_FLEX_FDE_CTRLWORD_*() to match SFrame V3 reference
implementation.
- Add arch/*/include/asm/unwind_user_sframe.h to MAINTAINERS.
MAINTAINERS | 1 +
kernel/unwind/sframe.c | 282 +++++++++++++++++++++++++++++++++--------
kernel/unwind/sframe.h | 6 +
3 files changed, 237 insertions(+), 52 deletions(-)
diff --git a/MAINTAINERS b/MAINTAINERS
index a9b42b67a88d..25f0b933511c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -27875,6 +27875,7 @@ M: Josh Poimboeuf <jpoimboe@kernel.org>
M: Steven Rostedt <rostedt@goodmis.org>
S: Maintained
F: arch/*/include/asm/unwind_user.h
+F: arch/*/include/asm/unwind_user_sframe.h
F: include/asm-generic/unwind_user.h
F: include/linux/sframe.h
F: include/linux/unwind*.h
diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
index 6187379750db..48709f0bafc7 100644
--- a/kernel/unwind/sframe.c
+++ b/kernel/unwind/sframe.c
@@ -12,6 +12,7 @@
#include <linux/mm.h>
#include <linux/string_helpers.h>
#include <linux/sframe.h>
+#include <asm/unwind_user_sframe.h>
#include <linux/unwind_user_types.h>
#include "sframe.h"
@@ -31,8 +32,11 @@ struct sframe_fde_internal {
struct sframe_fre_internal {
unsigned int size;
u32 ip_off;
+ u32 cfa_ctl;
s32 cfa_off;
+ u32 ra_ctl;
s32 ra_off;
+ u32 fp_ctl;
s32 fp_off;
u8 info;
};
@@ -200,16 +204,158 @@ static __always_inline int __find_fde(struct sframe_section *sec,
s32 : UNSAFE_GET_USER_SIGNED_INC(to, from, size, label), \
s64 : UNSAFE_GET_USER_SIGNED_INC(to, from, size, label))
+static __always_inline int
+__read_default_fre_datawords(struct sframe_section *sec,
+ struct sframe_fde_internal *fde,
+ unsigned long cur,
+ unsigned char dataword_count,
+ unsigned char dataword_size,
+ struct sframe_fre_internal *fre)
+{
+ s32 cfa_off, ra_off, fp_off;
+ unsigned int cfa_regnum;
+
+ UNSAFE_GET_USER_INC(cfa_off, cur, dataword_size, Efault);
+ dataword_count--;
+
+ ra_off = sec->ra_off;
+ if (!ra_off && dataword_count) {
+ dataword_count--;
+ UNSAFE_GET_USER_INC(ra_off, cur, dataword_size, Efault);
+ }
+
+ fp_off = sec->fp_off;
+ if (!fp_off && dataword_count) {
+ dataword_count--;
+ UNSAFE_GET_USER_INC(fp_off, cur, dataword_size, Efault);
+ }
+
+ if (dataword_count)
+ return -EINVAL;
+
+ cfa_regnum =
+ (SFRAME_V3_FRE_CFA_BASE_REG_ID(fre->info) == SFRAME_BASE_REG_FP) ?
+ SFRAME_REG_FP : SFRAME_REG_SP;
+
+ fre->cfa_ctl = (cfa_regnum << 3) | 1; /* regnum, deref_p=0, reg_p=1 */
+ fre->cfa_off = cfa_off;
+ fre->ra_ctl = ra_off ? 2 : 0; /* regnum=0, deref_p=(ra_off != 0), reg_p=0 */
+ fre->ra_off = ra_off;
+ fre->fp_ctl = fp_off ? 2 : 0; /* regnum=0, deref_p=(fp_off != 0), reg_p=0 */
+ fre->fp_off = fp_off;
+
+ return 0;
+
+Efault:
+ return -EFAULT;
+}
+
+static __always_inline int
+__read_flex_fde_fre_datawords(struct sframe_section *sec,
+ struct sframe_fde_internal *fde,
+ unsigned long cur,
+ unsigned char dataword_count,
+ unsigned char dataword_size,
+ struct sframe_fre_internal *fre)
+{
+ u32 cfa_ctl, ra_ctl, fp_ctl;
+ s32 cfa_off, ra_off, fp_off;
+
+ if (dataword_count < 2)
+ return -EINVAL;
+ UNSAFE_GET_USER_INC(cfa_ctl, cur, dataword_size, Efault);
+ UNSAFE_GET_USER_INC(cfa_off, cur, dataword_size, Efault);
+ dataword_count -= 2;
+
+ /*
+ * Each RA/FP location info consumes either two datawords
+ * (control word + offset) or one padding word substituting
+ * for that pair. Padding is only valid as substitution if
+ * followed by further non-padding location info. Therefore
+ * decoding only proceeds with at least two datawords. Any
+ * leftover trailing datawords are invalid and rejected by
+ * the final check.
+ */
+
+ ra_off = sec->ra_off;
+ ra_ctl = ra_off ? 2 : 0; /* regnum=0, deref_p=(ra_off != 0), reg_p=0 */
+ if (dataword_count >= 2) {
+ UNSAFE_GET_USER_INC(ra_ctl, cur, dataword_size, Efault);
+ dataword_count--;
+ if (ra_ctl) {
+ UNSAFE_GET_USER_INC(ra_off, cur, dataword_size, Efault);
+ dataword_count--;
+ } else {
+ /* Padding RA location info */
+ ra_ctl = ra_off ? 2 : 0; /* re-deduce (see above) */
+ }
+ }
+
+ fp_off = sec->fp_off;
+ fp_ctl = fp_off ? 2 : 0; /* regnum=0, deref_p=(fp_off != 0), reg_p=0 */
+ if (dataword_count >= 2) {
+ UNSAFE_GET_USER_INC(fp_ctl, cur, dataword_size, Efault);
+ dataword_count--;
+ if (fp_ctl) {
+ UNSAFE_GET_USER_INC(fp_off, cur, dataword_size, Efault);
+ dataword_count--;
+ } else {
+ /* Padding FP location info */
+ fp_ctl = fp_off ? 2 : 0; /* re-deduce (see above) */
+ }
+ }
+
+ /* Reject trailing padding or unknown extra datawords */
+ if (dataword_count)
+ return -EINVAL;
+
+ fre->cfa_ctl = cfa_ctl;
+ fre->cfa_off = cfa_off;
+ fre->ra_ctl = ra_ctl;
+ fre->ra_off = ra_off;
+ fre->fp_ctl = fp_ctl;
+ fre->fp_off = fp_off;
+
+ return 0;
+
+Efault:
+ return -EFAULT;
+}
+
+static __always_inline int
+__read_fre_datawords(struct sframe_section *sec,
+ struct sframe_fde_internal *fde,
+ unsigned long cur,
+ unsigned char dataword_count,
+ unsigned char dataword_size,
+ struct sframe_fre_internal *fre)
+{
+ unsigned char fde_type = SFRAME_V3_FDE_TYPE(fde->info2);
+
+ switch (fde_type) {
+ case SFRAME_FDE_TYPE_DEFAULT:
+ return __read_default_fre_datawords(sec, fde, cur,
+ dataword_count,
+ dataword_size,
+ fre);
+ case SFRAME_FDE_TYPE_FLEX:
+ return __read_flex_fde_fre_datawords(sec, fde, cur,
+ dataword_count,
+ dataword_size,
+ fre);
+ default:
+ return -EINVAL;
+ }
+}
+
static __always_inline int __read_fre(struct sframe_section *sec,
struct sframe_fde_internal *fde,
unsigned long fre_addr,
struct sframe_fre_internal *fre)
{
- unsigned char fde_type = SFRAME_V3_FDE_TYPE(fde->info2);
unsigned char fde_pctype = SFRAME_V3_FDE_PCTYPE(fde->info);
unsigned char fre_type = SFRAME_V3_FDE_FRE_TYPE(fde->info);
unsigned char dataword_count, dataword_size;
- s32 cfa_off, ra_off, fp_off;
unsigned long cur = fre_addr;
unsigned char addr_size;
u32 ip_off;
@@ -236,75 +382,101 @@ static __always_inline int __read_fre(struct sframe_section *sec,
if (cur + (dataword_count * dataword_size) > sec->fres_end)
return -EFAULT;
- /* TODO: Support for flexible FDEs not implemented yet. */
- if (fde_type != SFRAME_FDE_TYPE_DEFAULT)
- return -EINVAL;
+ fre->size = addr_size + 1 + (dataword_count * dataword_size);
+ fre->ip_off = ip_off;
+ fre->info = info;
if (!dataword_count) {
/*
- * A FRE without data words indicates RA undefined /
- * outermost frame.
+ * A FRE without datawords indicates an outermost
+ * frame. Zero-initialize CFA, RA, and FP location
+ * info, except for the CFA control word, so that
+ * neither sframe_init_cfa_rule_data() nor
+ * sframe_init_rule_data() fail.
*/
- cfa_off = 0;
- ra_off = 0;
- fp_off = 0;
- goto done;
- }
-
- UNSAFE_GET_USER_INC(cfa_off, cur, dataword_size, Efault);
- dataword_count--;
-
- ra_off = sec->ra_off;
- if (!ra_off && dataword_count) {
- dataword_count--;
- UNSAFE_GET_USER_INC(ra_off, cur, dataword_size, Efault);
- }
+ fre->cfa_ctl = (SFRAME_REG_SP << 3) | 1; /* regnum=SP, deref_p=0, reg_p=1 */
+ fre->cfa_off = 0;
+ fre->ra_ctl = 0;
+ fre->ra_off = 0;
+ fre->fp_ctl = 0;
+ fre->fp_off = 0;
- fp_off = sec->fp_off;
- if (!fp_off && dataword_count) {
- dataword_count--;
- UNSAFE_GET_USER_INC(fp_off, cur, dataword_size, Efault);
+ return 0;
}
- if (dataword_count)
- return -EINVAL;
-
-done:
- fre->size = addr_size + 1 + (dataword_count * dataword_size);
- fre->ip_off = ip_off;
- fre->cfa_off = cfa_off;
- fre->ra_off = ra_off;
- fre->fp_off = fp_off;
- fre->info = info;
-
- return 0;
+ return __read_fre_datawords(sec, fde, cur, dataword_count, dataword_size, fre);
Efault:
return -EFAULT;
}
-static __always_inline void
+static __always_inline int
sframe_init_cfa_rule_data(struct unwind_user_cfa_rule_data *cfa_rule_data,
- unsigned char fre_info,
- s32 offset)
+ u32 ctlword, s32 offset)
{
- if (SFRAME_V3_FRE_CFA_BASE_REG_ID(fre_info) == SFRAME_BASE_REG_FP)
- cfa_rule_data->rule = UNWIND_USER_CFA_RULE_FP_OFFSET;
- else
+ bool deref_p = SFRAME_V3_FLEX_FDE_CTRLWORD_DEREF_P(ctlword);
+ bool reg_p = SFRAME_V3_FLEX_FDE_CTRLWORD_REG_P(ctlword);
+ bool reserved_p = SFRAME_V3_FLEX_FDE_CTRLWORD_RESERVED_P(ctlword);
+ unsigned int regnum = SFRAME_V3_FLEX_FDE_CTRLWORD_REGNUM(ctlword);
+
+ if (reserved_p)
+ return -EINVAL;
+
+ /* CFA recovery rule must be register-based */
+ if (!reg_p)
+ return -EINVAL;
+
+ switch (regnum) {
+ case SFRAME_REG_SP:
cfa_rule_data->rule = UNWIND_USER_CFA_RULE_SP_OFFSET;
+ break;
+ case SFRAME_REG_FP:
+ cfa_rule_data->rule = UNWIND_USER_CFA_RULE_FP_OFFSET;
+ break;
+ default:
+ cfa_rule_data->rule = UNWIND_USER_CFA_RULE_REG_OFFSET;
+ cfa_rule_data->regnum = regnum;
+ }
+
+ if (deref_p)
+ cfa_rule_data->rule |= UNWIND_USER_RULE_DEREF;
+
cfa_rule_data->offset = offset;
+
+ return 0;
}
-static __always_inline void
+static __always_inline int
sframe_init_rule_data(struct unwind_user_rule_data *rule_data,
- s32 offset)
+ u32 ctlword, s32 offset)
{
- if (offset) {
- rule_data->rule = UNWIND_USER_RULE_CFA_OFFSET_DEREF;
- rule_data->offset = offset;
- } else {
+ bool deref_p = SFRAME_V3_FLEX_FDE_CTRLWORD_DEREF_P(ctlword);
+ bool reg_p = SFRAME_V3_FLEX_FDE_CTRLWORD_REG_P(ctlword);
+ bool reserved_p = SFRAME_V3_FLEX_FDE_CTRLWORD_RESERVED_P(ctlword);
+
+ if (!ctlword && !offset) {
rule_data->rule = UNWIND_USER_RULE_RETAIN;
+ return 0;
}
+
+ if (reserved_p)
+ return -EINVAL;
+
+ if (reg_p) {
+ unsigned int regnum = SFRAME_V3_FLEX_FDE_CTRLWORD_REGNUM(ctlword);
+
+ rule_data->rule = UNWIND_USER_RULE_REG_OFFSET;
+ rule_data->regnum = regnum;
+ } else {
+ rule_data->rule = UNWIND_USER_RULE_CFA_OFFSET;
+ }
+
+ if (deref_p)
+ rule_data->rule |= UNWIND_USER_RULE_DEREF;
+
+ rule_data->offset = offset;
+
+ return 0;
}
static __always_inline int __find_fre(struct sframe_section *sec,
@@ -356,9 +528,15 @@ static __always_inline int __find_fre(struct sframe_section *sec,
return -EINVAL;
fre = prev_fre;
- sframe_init_cfa_rule_data(&frame->cfa, fre->info, fre->cfa_off);
- sframe_init_rule_data(&frame->ra, fre->ra_off);
- sframe_init_rule_data(&frame->fp, fre->fp_off);
+ ret = sframe_init_cfa_rule_data(&frame->cfa, fre->cfa_ctl, fre->cfa_off);
+ if (ret)
+ return ret;
+ ret = sframe_init_rule_data(&frame->ra, fre->ra_ctl, fre->ra_off);
+ if (ret)
+ return ret;
+ ret = sframe_init_rule_data(&frame->fp, fre->fp_ctl, fre->fp_off);
+ if (ret)
+ return ret;
frame->outermost = SFRAME_V3_FRE_RA_UNDEFINED_P(fre->info);
return 0;
diff --git a/kernel/unwind/sframe.h b/kernel/unwind/sframe.h
index ed111fd0d702..1a2528e4b149 100644
--- a/kernel/unwind/sframe.h
+++ b/kernel/unwind/sframe.h
@@ -66,6 +66,7 @@ struct sframe_fda_v3 {
#define SFRAME_V3_AARCH64_FDE_PAUTH_KEY(info) (((info) >> 5) & 0x1)
#define SFRAME_FDE_TYPE_DEFAULT 0
+#define SFRAME_FDE_TYPE_FLEX 1
#define SFRAME_V3_FDE_TYPE_MASK 0x1f
#define SFRAME_V3_FDE_TYPE(info2) ((info2) & SFRAME_V3_FDE_TYPE_MASK)
@@ -79,4 +80,9 @@ struct sframe_fda_v3 {
#define SFRAME_V3_AARCH64_FRE_MANGLED_RA_P(info) (((info) >> 7) & 0x1)
#define SFRAME_V3_FRE_RA_UNDEFINED_P(info) (SFRAME_V3_FRE_DATAWORD_COUNT(info) == 0)
+#define SFRAME_V3_FLEX_FDE_CTRLWORD_REGNUM(data) (((data) >> 3) & 0x1f)
+#define SFRAME_V3_FLEX_FDE_CTRLWORD_RESERVED_P(data) (((data) >> 2) & 0x1)
+#define SFRAME_V3_FLEX_FDE_CTRLWORD_DEREF_P(data) (((data) >> 1) & 0x1)
+#define SFRAME_V3_FLEX_FDE_CTRLWORD_REG_P(data) ((data) & 0x1)
+
#endif /* _SFRAME_H */
--
2.51.0
^ permalink raw reply related
* [PATCH v15 13/20] unwind_user: Enable archs that pass RA in a register
From: Jens Remus @ 2026-05-20 15:39 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, x86, Steven Rostedt,
Josh Poimboeuf, Indu Bhagat, Peter Zijlstra, Dylan Hatch,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Mathieu Desnoyers, Kees Cook, Sam James
Cc: Jens Remus, bpf, linux-mm, Namhyung Kim, Andrii Nakryiko,
Jose E. Marchesi, Beau Belgrave, Florian Weimer,
Carlos O'Donell, Masami Hiramatsu, Jiri Olsa,
Arnaldo Carvalho de Melo, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Heiko Carstens, Vasily Gorbik,
Ilya Leoshkevich
In-Reply-To: <20260520154004.3845823-1-jremus@linux.ibm.com>
Not all architectures/ABIs pass the return address (RA) on the stack on
function entry, like x86-64 does due to its CALL instruction pushing
the RA onto the stack. Architectures/ABIs, such as s390, also do not
require the RA to be saved on the stack in the function prologue. In
particular, the RA may never be saved to the stack at all, such as in
leaf functions. Unwinding must therefore not assume the presence of a
RA saved on stack for the topmost frame.
Treat a RA offset from CFA of zero as indication that the RA is not
saved (on the stack). For the topmost frame treat it as indication that
the RA is in the link/RA register, such as on arm64 and s390, and obtain
it from there. For non-topmost frames treat it as error, as the RA must
be saved.
Additionally allow the SP to be unchanged in the topmost frame, for
architectures where SP at function entry == SP at call site, such as
arm64 and s390.
Note that treating a RA offset from CFA of zero as indication that
the RA is not saved on the stack additionally allows for architectures,
such as s390, where the frame pointer (FP) may be saved without the RA
being saved as well. Provided that such architectures represent this
in SFrame by encoding the "missing" RA offset using a padding RA offset
with a value of zero.
Reviewed-by: Indu Bhagat <ibhagatgnu@gmail.com>
Signed-off-by: Jens Remus <jremus@linux.ibm.com>
---
Notes (jremus):
Changes in v15:
- Define pr_fmt().
- unwind_user_get_ra_reg(): Use pr_debug_once() instead of
WARN_ON_ONCE() to prevent user-triggered warning/panic. (Sashiko AI)
- Reworded commit message. (Indu)
include/linux/unwind_user.h | 10 ++++++++++
kernel/unwind/sframe.c | 6 ++----
kernel/unwind/user.c | 20 ++++++++++++++++----
3 files changed, 28 insertions(+), 8 deletions(-)
diff --git a/include/linux/unwind_user.h b/include/linux/unwind_user.h
index 64618618febd..7bf58f23aa64 100644
--- a/include/linux/unwind_user.h
+++ b/include/linux/unwind_user.h
@@ -23,6 +23,16 @@ static inline bool unwind_user_at_function_start(struct pt_regs *regs)
#define unwind_user_at_function_start unwind_user_at_function_start
#endif
+#ifndef unwind_user_get_ra_reg
+static inline int unwind_user_get_ra_reg(unsigned long *val)
+{
+ pr_debug_once("%s (%d): unwind_user_get_ra_reg() not implemented\n",
+ current->comm, current->pid);
+ return -EINVAL;
+}
+#define unwind_user_get_ra_reg unwind_user_get_ra_reg
+#endif
+
int unwind_user(struct unwind_stacktrace *trace, unsigned int max_entries);
#endif /* _LINUX_UNWIND_USER_H */
diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
index b5f984fb2df2..4af533e9f980 100644
--- a/kernel/unwind/sframe.c
+++ b/kernel/unwind/sframe.c
@@ -255,10 +255,8 @@ static __always_inline int __read_fre(struct sframe_section *sec,
dataword_count--;
ra_off = sec->ra_off;
- if (!ra_off) {
- if (!dataword_count--)
- return -EINVAL;
-
+ if (!ra_off && dataword_count) {
+ dataword_count--;
UNSAFE_GET_USER_INC(ra_off, cur, dataword_size, Efault);
}
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
index fdb1001e3750..afa7c6f6d9b4 100644
--- a/kernel/unwind/user.c
+++ b/kernel/unwind/user.c
@@ -2,6 +2,9 @@
/*
* Generic interfaces for unwinding user space
*/
+
+#define pr_fmt(fmt) "unwind_user: " fmt
+
#include <linux/kernel.h>
#include <linux/sched.h>
#include <linux/sched/task_stack.h>
@@ -48,8 +51,12 @@ static int unwind_user_next_common(struct unwind_user_state *state,
}
cfa += frame->cfa_off;
- /* Make sure that stack is not going in wrong direction */
- if (cfa <= state->sp)
+ /*
+ * Make sure that stack is not going in wrong direction. Allow SP
+ * to be unchanged for the topmost frame, by subtracting topmost,
+ * which is either 0 or 1.
+ */
+ if (cfa <= state->sp - state->topmost)
return -EINVAL;
/* Make sure that the address is word aligned */
@@ -57,8 +64,13 @@ static int unwind_user_next_common(struct unwind_user_state *state,
return -EINVAL;
/* Get the Return Address (RA) */
- if (get_user_word(&ra, cfa, frame->ra_off, state->ws))
- return -EINVAL;
+ if (frame->ra_off) {
+ if (get_user_word(&ra, cfa, frame->ra_off, state->ws))
+ return -EINVAL;
+ } else {
+ if (!state->topmost || unwind_user_get_ra_reg(&ra))
+ return -EINVAL;
+ }
/* Get the Frame Pointer (FP) */
if (frame->fp_off && get_user_word(&fp, cfa, frame->fp_off, state->ws))
--
2.51.0
^ permalink raw reply related
* [PATCH v15 20/20] unwind_user/sframe: Add prctl() interface for registering .sframe sections
From: Jens Remus @ 2026-05-20 15:40 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, x86, Steven Rostedt,
Josh Poimboeuf, Indu Bhagat, Peter Zijlstra, Dylan Hatch,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Mathieu Desnoyers, Kees Cook, Sam James
Cc: Jens Remus, bpf, linux-mm, Namhyung Kim, Andrii Nakryiko,
Jose E. Marchesi, Beau Belgrave, Florian Weimer,
Carlos O'Donell, Masami Hiramatsu, Jiri Olsa,
Arnaldo Carvalho de Melo, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Heiko Carstens, Vasily Gorbik,
Ilya Leoshkevich, Steven Rostedt (Google)
In-Reply-To: <20260520154004.3845823-1-jremus@linux.ibm.com>
From: Josh Poimboeuf <jpoimboe@kernel.org>
The kernel doesn't have direct visibility to the ELF contents of shared
libraries. Add some prctl() interfaces which allow glibc to tell the
kernel where to find .sframe sections.
[
This adds an interface for prctl() for testing loading of sframes for
libraries. But this interface should really be a system call. This patch
is for testing purposes only and should not be applied to mainline.
]
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Indu Bhagat <ibhagatgnu@gmail.com>
Signed-off-by: Jens Remus <jremus@linux.ibm.com>
---
Notes (jremus):
Changes in v15:
- Fix rebase error (missing break). (Sashiko AI)
Changes in v14:
- Bump PR_ADD_SFRAME and PR_REMOVE_SFRAME.
include/uapi/linux/prctl.h | 4 ++++
kernel/sys.c | 9 +++++++++
2 files changed, 13 insertions(+)
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index b6ec6f693719..bd0bf828b033 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -416,4 +416,8 @@ struct prctl_mm_map {
# define PR_CFI_DISABLE _BITUL(1)
# define PR_CFI_LOCK _BITUL(2)
+/* SFRAME management */
+#define PR_ADD_SFRAME 82
+#define PR_REMOVE_SFRAME 83
+
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index 62e842055cc9..b0a9b1e3ccd7 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -65,6 +65,7 @@
#include <linux/rcupdate.h>
#include <linux/uidgid.h>
#include <linux/cred.h>
+#include <linux/sframe.h>
#include <linux/nospec.h>
@@ -2907,6 +2908,14 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
if (arg3 & PR_CFI_LOCK && !(arg3 & PR_CFI_DISABLE))
error = arch_prctl_lock_branch_landing_pad_state(me);
break;
+ case PR_ADD_SFRAME:
+ error = sframe_add_section(arg2, arg3, arg4, arg5);
+ break;
+ case PR_REMOVE_SFRAME:
+ if (arg3 || arg4 || arg5)
+ return -EINVAL;
+ error = sframe_remove_section(arg2);
+ break;
default:
trace_task_prctl_unknown(option, arg2, arg3, arg4, arg5);
error = -EINVAL;
--
2.51.0
^ permalink raw reply related
* [PATCH v15 14/20] unwind_user: Flexible FP/RA recovery rules
From: Jens Remus @ 2026-05-20 15:39 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, x86, Steven Rostedt,
Josh Poimboeuf, Indu Bhagat, Peter Zijlstra, Dylan Hatch,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Mathieu Desnoyers, Kees Cook, Sam James
Cc: Jens Remus, bpf, linux-mm, Namhyung Kim, Andrii Nakryiko,
Jose E. Marchesi, Beau Belgrave, Florian Weimer,
Carlos O'Donell, Masami Hiramatsu, Jiri Olsa,
Arnaldo Carvalho de Melo, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Heiko Carstens, Vasily Gorbik,
Ilya Leoshkevich
In-Reply-To: <20260520154004.3845823-1-jremus@linux.ibm.com>
To enable support for SFrame V3 flexible FDEs with a subsequent patch,
add support for the following flexible frame pointer (FP) and return
address (RA) recovery rules:
FP/RA = *(CFA + offset)
FP/RA = register + offset
FP/RA = *(register + offset)
Note that FP/RA recovery rules that use arbitrary register contents are
only valid when in the topmost frame, as their contents are otherwise
unknown.
This also enables unwinding of user space for architectures, such as
s390, that may save the frame pointer (FP) and/or return address (RA) in
other registers, for instance when in a leaf function.
Reviewed-by: Indu Bhagat <ibhagatgnu@gmail.com>
Signed-off-by: Jens Remus <jremus@linux.ibm.com>
---
Notes (jremus):
Changes in v15:
- Define dbg_once().
- unwind_user_get_reg(): Use pr_debug_once() instead of WARN_ON_ONCE()
to prevent user-triggered warning/panic. (Sashiko AI)
- unwind_user_next_common(): Handle UNWIND_USER_RULE_CFA_OFFSET for RA
and FP to use dbg_once() instead of WARN_ON_ONCE() to prevent user-
triggered warning/panic. (Sashiko AI)
Changes in v14:
- Improve comment on why UNWIND_USER_RULE_CFA_OFFSET is not implemented.
(Mark Rutland)
arch/x86/include/asm/unwind_user.h | 21 +++++++--
include/linux/unwind_user.h | 10 +++++
include/linux/unwind_user_types.h | 23 +++++++++-
kernel/unwind/sframe.c | 16 ++++++-
kernel/unwind/user.c | 70 +++++++++++++++++++++++++++---
5 files changed, 125 insertions(+), 15 deletions(-)
diff --git a/arch/x86/include/asm/unwind_user.h b/arch/x86/include/asm/unwind_user.h
index 2dfb5ef11e36..9c3417be4283 100644
--- a/arch/x86/include/asm/unwind_user.h
+++ b/arch/x86/include/asm/unwind_user.h
@@ -21,15 +21,26 @@ static inline int unwind_user_word_size(struct pt_regs *regs)
#define ARCH_INIT_USER_FP_FRAME(ws) \
.cfa_off = 2*(ws), \
- .ra_off = -1*(ws), \
- .fp_off = -2*(ws), \
+ .ra = { \
+ .rule = UNWIND_USER_RULE_CFA_OFFSET_DEREF,\
+ .offset = -1*(ws), \
+ }, \
+ .fp = { \
+ .rule = UNWIND_USER_RULE_CFA_OFFSET_DEREF,\
+ .offset = -2*(ws), \
+ }, \
.use_fp = true, \
.outermost = false,
#define ARCH_INIT_USER_FP_ENTRY_FRAME(ws) \
.cfa_off = 1*(ws), \
- .ra_off = -1*(ws), \
- .fp_off = 0, \
+ .ra = { \
+ .rule = UNWIND_USER_RULE_CFA_OFFSET_DEREF,\
+ .offset = -1*(ws), \
+ }, \
+ .fp = { \
+ .rule = UNWIND_USER_RULE_RETAIN,\
+ }, \
.use_fp = false, \
.outermost = false,
@@ -41,4 +52,6 @@ static inline bool unwind_user_at_function_start(struct pt_regs *regs)
#endif /* CONFIG_HAVE_UNWIND_USER_FP */
+#include <asm-generic/unwind_user.h>
+
#endif /* _ASM_X86_UNWIND_USER_H */
diff --git a/include/linux/unwind_user.h b/include/linux/unwind_user.h
index 7bf58f23aa64..6aca38f89ddd 100644
--- a/include/linux/unwind_user.h
+++ b/include/linux/unwind_user.h
@@ -33,6 +33,16 @@ static inline int unwind_user_get_ra_reg(unsigned long *val)
#define unwind_user_get_ra_reg unwind_user_get_ra_reg
#endif
+#ifndef unwind_user_get_reg
+static inline int unwind_user_get_reg(unsigned long *val, unsigned int regnum)
+{
+ pr_debug_once("%s (%d): unwind_user_get_reg(%u) not implemented\n",
+ current->comm, current->pid, regnum);
+ return -EINVAL;
+}
+#define unwind_user_get_reg unwind_user_get_reg
+#endif
+
int unwind_user(struct unwind_stacktrace *trace, unsigned int max_entries);
#endif /* _LINUX_UNWIND_USER_H */
diff --git a/include/linux/unwind_user_types.h b/include/linux/unwind_user_types.h
index 616cc5ee4586..0d02714a1b5d 100644
--- a/include/linux/unwind_user_types.h
+++ b/include/linux/unwind_user_types.h
@@ -27,10 +27,29 @@ struct unwind_stacktrace {
unsigned long *entries;
};
+#define UNWIND_USER_RULE_DEREF BIT(31)
+
+enum unwind_user_rule {
+ UNWIND_USER_RULE_RETAIN, /* entity = entity */
+ UNWIND_USER_RULE_CFA_OFFSET, /* entity = CFA + offset */
+ UNWIND_USER_RULE_REG_OFFSET, /* entity = register + offset */
+ /* DEREF variants */
+ UNWIND_USER_RULE_CFA_OFFSET_DEREF = /* entity = *(CFA + offset) */
+ UNWIND_USER_RULE_CFA_OFFSET | UNWIND_USER_RULE_DEREF,
+ UNWIND_USER_RULE_REG_OFFSET_DEREF = /* entity = *(register + offset) */
+ UNWIND_USER_RULE_REG_OFFSET | UNWIND_USER_RULE_DEREF,
+};
+
+struct unwind_user_rule_data {
+ enum unwind_user_rule rule;
+ s32 offset;
+ unsigned int regnum;
+};
+
struct unwind_user_frame {
s32 cfa_off;
- s32 ra_off;
- s32 fp_off;
+ struct unwind_user_rule_data ra;
+ struct unwind_user_rule_data fp;
bool use_fp;
bool outermost;
};
diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
index 4af533e9f980..e82d1dcdd471 100644
--- a/kernel/unwind/sframe.c
+++ b/kernel/unwind/sframe.c
@@ -283,6 +283,18 @@ static __always_inline int __read_fre(struct sframe_section *sec,
return -EFAULT;
}
+static __always_inline void
+sframe_init_rule_data(struct unwind_user_rule_data *rule_data,
+ s32 offset)
+{
+ if (offset) {
+ rule_data->rule = UNWIND_USER_RULE_CFA_OFFSET_DEREF;
+ rule_data->offset = offset;
+ } else {
+ rule_data->rule = UNWIND_USER_RULE_RETAIN;
+ }
+}
+
static __always_inline int __find_fre(struct sframe_section *sec,
struct sframe_fde_internal *fde,
unsigned long ip,
@@ -333,8 +345,8 @@ static __always_inline int __find_fre(struct sframe_section *sec,
fre = prev_fre;
frame->cfa_off = fre->cfa_off;
- frame->ra_off = fre->ra_off;
- frame->fp_off = fre->fp_off;
+ sframe_init_rule_data(&frame->ra, fre->ra_off);
+ sframe_init_rule_data(&frame->fp, fre->fp_off);
frame->use_fp = SFRAME_V3_FRE_CFA_BASE_REG_ID(fre->info) == SFRAME_BASE_REG_FP;
frame->outermost = SFRAME_V3_FRE_RA_UNDEFINED_P(fre->info);
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
index afa7c6f6d9b4..c6a2abac78e0 100644
--- a/kernel/unwind/user.c
+++ b/kernel/unwind/user.c
@@ -12,6 +12,17 @@
#include <linux/uaccess.h>
#include <linux/sframe.h>
+#ifdef CONFIG_DYNAMIC_DEBUG
+
+#define dbg_once(fmt, ...) \
+ pr_debug_once("%s (%d): " fmt, current->comm, current->pid, ##__VA_ARGS__)
+
+#else /* !CONFIG_DYNAMIC_DEBUG */
+
+#define dbg_once(args...) no_printk(args)
+
+#endif /* !CONFIG_DYNAMIC_DEBUG */
+
#define for_each_user_frame(state) \
for (unwind_user_start(state); !(state)->done; unwind_user_next(state))
@@ -64,22 +75,67 @@ static int unwind_user_next_common(struct unwind_user_state *state,
return -EINVAL;
/* Get the Return Address (RA) */
- if (frame->ra_off) {
- if (get_user_word(&ra, cfa, frame->ra_off, state->ws))
- return -EINVAL;
- } else {
+ switch (frame->ra.rule) {
+ case UNWIND_USER_RULE_RETAIN:
if (!state->topmost || unwind_user_get_ra_reg(&ra))
return -EINVAL;
+ break;
+ case UNWIND_USER_RULE_CFA_OFFSET:
+ /*
+ * RA = CFA + offset does not make sense.
+ * A return address cannot legitimately be a stack address.
+ */
+ dbg_once("UNWIND_USER_RULE_CFA_OFFSET invalid for RA\n");
+ return -EINVAL;
+ case UNWIND_USER_RULE_CFA_OFFSET_DEREF:
+ ra = cfa + frame->ra.offset;
+ break;
+ case UNWIND_USER_RULE_REG_OFFSET:
+ case UNWIND_USER_RULE_REG_OFFSET_DEREF:
+ if (!state->topmost || unwind_user_get_reg(&ra, frame->ra.regnum))
+ return -EINVAL;
+ ra += frame->ra.offset;
+ break;
+ default:
+ WARN_ON_ONCE(1);
+ return -EINVAL;
}
+ if (frame->ra.rule & UNWIND_USER_RULE_DEREF &&
+ get_user_word(&ra, ra, 0, state->ws))
+ return -EINVAL;
/* Get the Frame Pointer (FP) */
- if (frame->fp_off && get_user_word(&fp, cfa, frame->fp_off, state->ws))
+ switch (frame->fp.rule) {
+ case UNWIND_USER_RULE_RETAIN:
+ fp = state->fp;
+ break;
+ case UNWIND_USER_RULE_CFA_OFFSET:
+ /*
+ * FP = CFA + offset is currently not used for FP
+ * (e.g. SFrame cannot represent this rule).
+ */
+ dbg_once("UNWIND_USER_RULE_CFA_OFFSET unsupported for FP\n");
+ return -EINVAL;
+ case UNWIND_USER_RULE_CFA_OFFSET_DEREF:
+ fp = cfa + frame->fp.offset;
+ break;
+ case UNWIND_USER_RULE_REG_OFFSET:
+ case UNWIND_USER_RULE_REG_OFFSET_DEREF:
+ if (!state->topmost || unwind_user_get_reg(&fp, frame->fp.regnum))
+ return -EINVAL;
+ fp += frame->fp.offset;
+ break;
+ default:
+ WARN_ON_ONCE(1);
+ return -EINVAL;
+ }
+ if (frame->fp.rule & UNWIND_USER_RULE_DEREF &&
+ get_user_word(&fp, fp, 0, state->ws))
return -EINVAL;
state->ip = ra;
state->sp = cfa;
- if (frame->fp_off)
- state->fp = fp;
+ state->fp = fp;
state->topmost = false;
return 0;
}
--
2.51.0
^ permalink raw reply related
* [PATCH v15 18/20] unwind_user/sframe: Duplicate registered .sframe section data on clone/fork
From: Jens Remus @ 2026-05-20 15:40 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, x86, Steven Rostedt,
Josh Poimboeuf, Indu Bhagat, Peter Zijlstra, Dylan Hatch,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Mathieu Desnoyers, Kees Cook, Sam James
Cc: Jens Remus, bpf, linux-mm, Namhyung Kim, Andrii Nakryiko,
Jose E. Marchesi, Beau Belgrave, Florian Weimer,
Carlos O'Donell, Masami Hiramatsu, Jiri Olsa,
Arnaldo Carvalho de Melo, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Heiko Carstens, Vasily Gorbik,
Ilya Leoshkevich
In-Reply-To: <20260520154004.3845823-1-jremus@linux.ibm.com>
When duplicating a process' virtual memory mappings also duplicate all
of its registered .sframe sections stored in the per-mm maple tree to
enable stacktracing using sframe of the child process.
Signed-off-by: Jens Remus <jremus@linux.ibm.com>
---
include/linux/sframe.h | 5 ++++
kernel/unwind/sframe.c | 48 ++++++++++++++++++++++++++++++++++++
kernel/unwind/sframe_debug.h | 7 ++++++
mm/mmap.c | 9 +++++++
4 files changed, 69 insertions(+)
diff --git a/include/linux/sframe.h b/include/linux/sframe.h
index b79c5ec09229..91889b4fe3dd 100644
--- a/include/linux/sframe.h
+++ b/include/linux/sframe.h
@@ -28,6 +28,7 @@ struct sframe_section {
};
#define INIT_MM_SFRAME .sframe_mt = MTREE_INIT(sframe_mt, 0),
+extern int sframe_dup_mm(struct mm_struct *mm, struct mm_struct *oldmm);
extern void sframe_free_mm(struct mm_struct *mm);
extern int sframe_add_section(unsigned long sframe_start, unsigned long sframe_end,
@@ -45,6 +46,10 @@ static inline bool current_has_sframe(void)
#else /* !CONFIG_HAVE_UNWIND_USER_SFRAME */
#define INIT_MM_SFRAME
+static inline int sframe_dup_mm(struct mm_struct *mm, struct mm_struct *oldmm)
+{
+ return 0;
+}
static inline void sframe_free_mm(struct mm_struct *mm) {}
static inline int sframe_add_section(unsigned long sframe_start, unsigned long sframe_end,
unsigned long text_start, unsigned long text_end)
diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
index ec8318977a2e..0661090dd5bc 100644
--- a/kernel/unwind/sframe.c
+++ b/kernel/unwind/sframe.c
@@ -886,6 +886,54 @@ int sframe_remove_section(unsigned long sframe_start)
return 0;
}
+static void __sframe_dup_section(struct sframe_section *sec, struct sframe_section *oldsec)
+{
+ sec->sframe_start = oldsec->sframe_start;
+ sec->sframe_end = oldsec->sframe_end;
+ sec->text_start = oldsec->text_start;
+ sec->text_end = oldsec->text_end;
+
+ sec->fdes_start = oldsec->fdes_start;
+ sec->fres_start = oldsec->fres_start;
+ sec->fres_end = oldsec->fres_end;
+ sec->num_fdes = oldsec->num_fdes;
+
+ sec->ra_off = oldsec->ra_off;
+ sec->fp_off = oldsec->fp_off;
+
+ dbg_dup(sec, oldsec);
+}
+
+int sframe_dup_mm(struct mm_struct *mm, struct mm_struct *oldmm)
+{
+ struct sframe_section *sec, *oldsec;
+ unsigned long index = 0;
+ int ret;
+
+ guard(srcu)(&sframe_srcu);
+
+ mt_for_each(&oldmm->sframe_mt, oldsec, index, ULONG_MAX) {
+ sec = kzalloc(sizeof(*sec), GFP_KERNEL_ACCOUNT);
+ if (!sec)
+ return -ENOMEM;
+
+ __sframe_dup_section(sec, oldsec);
+
+ ret = mtree_insert_range(&mm->sframe_mt,
+ sec->text_start,
+ sec->text_end - 1,
+ sec, GFP_KERNEL_ACCOUNT);
+ if (ret)
+ goto err_free;
+ }
+
+ return 0;
+
+err_free:
+ free_section(sec);
+ return ret;
+}
+
void sframe_free_mm(struct mm_struct *mm)
{
struct sframe_section *sec;
diff --git a/kernel/unwind/sframe_debug.h b/kernel/unwind/sframe_debug.h
index a63e75cccc70..2503972155e8 100644
--- a/kernel/unwind/sframe_debug.h
+++ b/kernel/unwind/sframe_debug.h
@@ -48,6 +48,12 @@ static inline void dbg_init(struct sframe_section *sec)
sec->filename = kstrdup("(anonymous)", GFP_KERNEL_ACCOUNT);
}
+static inline void dbg_dup(struct sframe_section *sec, struct sframe_section *oldsec)
+{
+ if (oldsec->filename)
+ sec->filename = kstrdup(oldsec->filename, GFP_KERNEL_ACCOUNT);
+}
+
static inline void dbg_free(struct sframe_section *sec)
{
kfree(sec->filename);
@@ -61,6 +67,7 @@ static inline void dbg_free(struct sframe_section *sec)
static inline void dbg_print_header(struct sframe_section *sec) {}
static inline void dbg_init(struct sframe_section *sec) {}
+static inline void dbg_dup(struct sframe_section *sec, struct sframe_section *oldsec) {}
static inline void dbg_free(struct sframe_section *sec) {}
#endif /* !CONFIG_DYNAMIC_DEBUG */
diff --git a/mm/mmap.c b/mm/mmap.c
index 5754d1c36462..250e8bea73ef 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -48,6 +48,7 @@
#include <linux/sched/mm.h>
#include <linux/ksm.h>
#include <linux/memfd.h>
+#include <linux/sframe.h>
#include <linux/uaccess.h>
#include <asm/cacheflush.h>
@@ -1846,6 +1847,11 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
}
/* a new mm has just been created */
retval = arch_dup_mmap(oldmm, mm);
+#ifdef CONFIG_HAVE_UNWIND_USER_SFRAME
+ if (!retval) {
+ retval = sframe_dup_mm(mm, oldmm);
+ }
+#endif
loop_out:
vma_iter_free(&vmi);
if (!retval) {
@@ -1893,6 +1899,9 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
vm_unacct_memory(charge);
}
__mt_destroy(&mm->mm_mt);
+#ifdef CONFIG_HAVE_UNWIND_USER_SFRAME
+ sframe_free_mm(mm);
+#endif
/*
* The mm_struct is going to exit, but the locks will be dropped
* first. Set the mm_struct as unstable is advisable as it is
--
2.51.0
^ permalink raw reply related
* [PATCH v15 09/20] unwind_user/sframe: Add support for outermost frame indication
From: Jens Remus @ 2026-05-20 15:39 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, x86, Steven Rostedt,
Josh Poimboeuf, Indu Bhagat, Peter Zijlstra, Dylan Hatch,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Mathieu Desnoyers, Kees Cook, Sam James
Cc: Jens Remus, bpf, linux-mm, Namhyung Kim, Andrii Nakryiko,
Jose E. Marchesi, Beau Belgrave, Florian Weimer,
Carlos O'Donell, Masami Hiramatsu, Jiri Olsa,
Arnaldo Carvalho de Melo, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Heiko Carstens, Vasily Gorbik,
Ilya Leoshkevich
In-Reply-To: <20260520154004.3845823-1-jremus@linux.ibm.com>
SFrame may represent an undefined return address (RA) as SFrame FRE
without any offsets as indication for an outermost frame.
Reviewed-by: Indu Bhagat <ibhagatgnu@gmail.com>
Signed-off-by: Jens Remus <jremus@linux.ibm.com>
---
kernel/unwind/sframe.c | 15 ++++++++++++++-
kernel/unwind/sframe.h | 1 +
2 files changed, 15 insertions(+), 1 deletion(-)
diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
index a38f50a36363..f723c1a32f90 100644
--- a/kernel/unwind/sframe.c
+++ b/kernel/unwind/sframe.c
@@ -230,7 +230,7 @@ static __always_inline int __read_fre(struct sframe_section *sec,
UNSAFE_GET_USER_INC(info, cur, 1, Efault);
dataword_count = SFRAME_V3_FRE_DATAWORD_COUNT(info);
dataword_size = dataword_size_enum_to_size(SFRAME_V3_FRE_DATAWORD_SIZE(info));
- if (!dataword_count || !dataword_size)
+ if (!dataword_size)
return -EINVAL;
if (cur + (dataword_count * dataword_size) > sec->fres_end)
@@ -240,6 +240,17 @@ static __always_inline int __read_fre(struct sframe_section *sec,
if (fde_type != SFRAME_FDE_TYPE_DEFAULT)
return -EINVAL;
+ if (!dataword_count) {
+ /*
+ * A FRE without data words indicates RA undefined /
+ * outermost frame.
+ */
+ cfa_off = 0;
+ ra_off = 0;
+ fp_off = 0;
+ goto done;
+ }
+
UNSAFE_GET_USER_INC(cfa_off, cur, dataword_size, Efault);
dataword_count--;
@@ -260,6 +271,7 @@ static __always_inline int __read_fre(struct sframe_section *sec,
if (dataword_count)
return -EINVAL;
+done:
fre->size = addr_size + 1 + (dataword_count * dataword_size);
fre->ip_off = ip_off;
fre->cfa_off = cfa_off;
@@ -326,6 +338,7 @@ static __always_inline int __find_fre(struct sframe_section *sec,
frame->ra_off = fre->ra_off;
frame->fp_off = fre->fp_off;
frame->use_fp = SFRAME_V3_FRE_CFA_BASE_REG_ID(fre->info) == SFRAME_BASE_REG_FP;
+ frame->outermost = SFRAME_V3_FRE_RA_UNDEFINED_P(fre->info);
return 0;
}
diff --git a/kernel/unwind/sframe.h b/kernel/unwind/sframe.h
index fc2908e92c7b..ed111fd0d702 100644
--- a/kernel/unwind/sframe.h
+++ b/kernel/unwind/sframe.h
@@ -77,5 +77,6 @@ struct sframe_fda_v3 {
#define SFRAME_V3_FRE_DATAWORD_COUNT(info) (((info) >> 1) & 0xf)
#define SFRAME_V3_FRE_DATAWORD_SIZE(info) (((info) >> 5) & 0x3)
#define SFRAME_V3_AARCH64_FRE_MANGLED_RA_P(info) (((info) >> 7) & 0x1)
+#define SFRAME_V3_FRE_RA_UNDEFINED_P(info) (SFRAME_V3_FRE_DATAWORD_COUNT(info) == 0)
#endif /* _SFRAME_H */
--
2.51.0
^ permalink raw reply related
* [PATCH v15 11/20] unwind_user/sframe: Show file name in debug output
From: Jens Remus @ 2026-05-20 15:39 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, x86, Steven Rostedt,
Josh Poimboeuf, Indu Bhagat, Peter Zijlstra, Dylan Hatch,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Mathieu Desnoyers, Kees Cook, Sam James
Cc: Jens Remus, bpf, linux-mm, Namhyung Kim, Andrii Nakryiko,
Jose E. Marchesi, Beau Belgrave, Florian Weimer,
Carlos O'Donell, Masami Hiramatsu, Jiri Olsa,
Arnaldo Carvalho de Melo, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Heiko Carstens, Vasily Gorbik,
Ilya Leoshkevich, Steven Rostedt (Google)
In-Reply-To: <20260520154004.3845823-1-jremus@linux.ibm.com>
From: Josh Poimboeuf <jpoimboe@kernel.org>
When debugging sframe issues, the error messages aren't all that helpful
without knowing what file a corresponding .sframe section belongs to.
Prefix debug output strings with the file name.
[ Jens Remus: Fix checkpatch error "space prohibited before that close
parenthesis ')'". ]
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Indu Bhagat <ibhagatgnu@gmail.com>
Signed-off-by: Jens Remus <jremus@linux.ibm.com>
---
Notes (jremus):
Changes in v15:
- Use GFP_KERNEL_ACCOUNT instead of GFP_KERNEL (see
memory-allocation.rst, section "Get Free Page flags"). (Sashiko AI)
Changes in v14:
- Uppercase terms FDE and FRE in debug messages.
include/linux/sframe.h | 4 +++-
kernel/unwind/sframe.c | 23 ++++++++++--------
kernel/unwind/sframe_debug.h | 45 +++++++++++++++++++++++++++++++-----
3 files changed, 56 insertions(+), 16 deletions(-)
diff --git a/include/linux/sframe.h b/include/linux/sframe.h
index 9a72209696f9..b79c5ec09229 100644
--- a/include/linux/sframe.h
+++ b/include/linux/sframe.h
@@ -10,7 +10,9 @@
struct sframe_section {
struct rcu_head rcu;
-
+#ifdef CONFIG_DYNAMIC_DEBUG
+ const char *filename;
+#endif
unsigned long sframe_start;
unsigned long sframe_end;
unsigned long text_start;
diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
index 02331956009a..f931932bd34a 100644
--- a/kernel/unwind/sframe.c
+++ b/kernel/unwind/sframe.c
@@ -374,14 +374,17 @@ int sframe_find(unsigned long ip, struct unwind_user_frame *frame)
user_read_access_end();
end:
- if (ret && ret != -ENOENT)
+ if (ret && ret != -ENOENT) {
+ dbg_sec("removing bad .sframe section\n");
WARN_ON_ONCE(sframe_remove_section(sec->sframe_start));
+ }
return ret;
}
static void free_section(struct sframe_section *sec)
{
+ dbg_free(sec);
kfree(sec);
}
@@ -401,7 +404,7 @@ static int sframe_read_header(struct sframe_section *sec)
BUILD_BUG_ON(!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS));
if (copy_from_user(&shdr, (void __user *)sec->sframe_start, sizeof(shdr))) {
- dbg("header usercopy failed\n");
+ dbg_sec("header usercopy failed\n");
return -EFAULT;
}
@@ -410,18 +413,18 @@ static int sframe_read_header(struct sframe_section *sec)
!(shdr.preamble.flags & SFRAME_F_FDE_SORTED) ||
!(shdr.preamble.flags & SFRAME_F_FDE_FUNC_START_PCREL) ||
shdr.auxhdr_len) {
- dbg("bad/unsupported sframe header\n");
+ dbg_sec("bad/unsupported sframe header\n");
return -EINVAL;
}
if (!shdr.num_fdes || !shdr.num_fres) {
- dbg("no fde/fre entries\n");
+ dbg_sec("no FDE/FRE entries\n");
return -EINVAL;
}
header_end = sec->sframe_start + SFRAME_HEADER_SIZE(shdr);
if (header_end >= sec->sframe_end) {
- dbg("header doesn't fit in section\n");
+ dbg_sec("header doesn't fit in section\n");
return -EINVAL;
}
@@ -433,7 +436,7 @@ static int sframe_read_header(struct sframe_section *sec)
fres_end = fres_start + shdr.fre_len;
if (fres_start < fdes_end || fres_end > sec->sframe_end) {
- dbg("inconsistent fde/fre offsets\n");
+ dbg_sec("inconsistent FDE/FRE offsets\n");
return -EINVAL;
}
@@ -489,6 +492,8 @@ int sframe_add_section(unsigned long sframe_start, unsigned long sframe_end,
sec->text_start = text_start;
sec->text_end = text_end;
+ dbg_init(sec);
+
ret = sframe_read_header(sec);
if (ret) {
dbg_print_header(sec);
@@ -498,8 +503,8 @@ int sframe_add_section(unsigned long sframe_start, unsigned long sframe_end,
ret = mtree_insert_range(sframe_mt, sec->text_start, sec->text_end - 1,
sec, GFP_KERNEL_ACCOUNT);
if (ret) {
- dbg("mtree_insert_range failed: text=%lx-%lx\n",
- sec->text_start, sec->text_end);
+ dbg_sec("mtree_insert_range failed: text=%lx-%lx\n",
+ sec->text_start, sec->text_end);
goto err_free;
}
@@ -521,7 +526,7 @@ static int __sframe_remove_section(struct mm_struct *mm,
struct sframe_section *sec)
{
if (!mtree_erase(&mm->sframe_mt, sec->text_start)) {
- dbg("mtree_erase failed: text=%lx\n", sec->text_start);
+ dbg_sec("mtree_erase failed: text=%lx\n", sec->text_start);
return -EINVAL;
}
diff --git a/kernel/unwind/sframe_debug.h b/kernel/unwind/sframe_debug.h
index 36352124cde8..a63e75cccc70 100644
--- a/kernel/unwind/sframe_debug.h
+++ b/kernel/unwind/sframe_debug.h
@@ -10,26 +10,59 @@
#define dbg(fmt, ...) \
pr_debug("%s (%d): " fmt, current->comm, current->pid, ##__VA_ARGS__)
+#define dbg_sec(fmt, ...) \
+ dbg("%s: " fmt, sec->filename, ##__VA_ARGS__)
+
static __always_inline void dbg_print_header(struct sframe_section *sec)
{
unsigned long fdes_end;
fdes_end = sec->fdes_start + (sec->num_fdes * sizeof(struct sframe_fde_v3));
- dbg("SEC: sframe:0x%lx-0x%lx text:0x%lx-0x%lx "
- "fdes:0x%lx-0x%lx fres:0x%lx-0x%lx "
- "ra_off:%d fp_off:%d\n",
- sec->sframe_start, sec->sframe_end, sec->text_start, sec->text_end,
- sec->fdes_start, fdes_end, sec->fres_start, sec->fres_end,
- sec->ra_off, sec->fp_off);
+ dbg_sec("SEC: sframe:0x%lx-0x%lx text:0x%lx-0x%lx "
+ "fdes:0x%lx-0x%lx fres:0x%lx-0x%lx "
+ "ra_off:%d fp_off:%d\n",
+ sec->sframe_start, sec->sframe_end, sec->text_start, sec->text_end,
+ sec->fdes_start, fdes_end, sec->fres_start, sec->fres_end,
+ sec->ra_off, sec->fp_off);
+}
+
+static inline void dbg_init(struct sframe_section *sec)
+{
+ struct mm_struct *mm = current->mm;
+ struct vm_area_struct *vma;
+
+ guard(mmap_read_lock)(mm);
+ vma = vma_lookup(mm, sec->sframe_start);
+ if (!vma)
+ sec->filename = kstrdup("(vma gone???)", GFP_KERNEL_ACCOUNT);
+ else if (vma->vm_file)
+ sec->filename = kstrdup_quotable_file(vma->vm_file, GFP_KERNEL_ACCOUNT);
+ else if (vma->vm_ops && vma->vm_ops->name)
+ sec->filename = kstrdup(vma->vm_ops->name(vma), GFP_KERNEL_ACCOUNT);
+ else if (arch_vma_name(vma))
+ sec->filename = kstrdup(arch_vma_name(vma), GFP_KERNEL_ACCOUNT);
+ else if (!vma->vm_mm)
+ sec->filename = kstrdup("(vdso)", GFP_KERNEL_ACCOUNT);
+ else
+ sec->filename = kstrdup("(anonymous)", GFP_KERNEL_ACCOUNT);
+}
+
+static inline void dbg_free(struct sframe_section *sec)
+{
+ kfree(sec->filename);
}
#else /* !CONFIG_DYNAMIC_DEBUG */
#define dbg(args...) no_printk(args)
+#define dbg_sec(args...) no_printk(args)
static inline void dbg_print_header(struct sframe_section *sec) {}
+static inline void dbg_init(struct sframe_section *sec) {}
+static inline void dbg_free(struct sframe_section *sec) {}
+
#endif /* !CONFIG_DYNAMIC_DEBUG */
#endif /* _SFRAME_DEBUG_H */
--
2.51.0
^ permalink raw reply related
* [PATCH v15 06/20] unwind_user/sframe: Detect .sframe sections in executables
From: Jens Remus @ 2026-05-20 15:39 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel, x86, Steven Rostedt,
Josh Poimboeuf, Indu Bhagat, Peter Zijlstra, Dylan Hatch,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Mathieu Desnoyers, Kees Cook, Sam James
Cc: Jens Remus, bpf, linux-mm, Namhyung Kim, Andrii Nakryiko,
Jose E. Marchesi, Beau Belgrave, Florian Weimer,
Carlos O'Donell, Masami Hiramatsu, Jiri Olsa,
Arnaldo Carvalho de Melo, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Heiko Carstens, Vasily Gorbik,
Ilya Leoshkevich, Steven Rostedt (Google)
In-Reply-To: <20260520154004.3845823-1-jremus@linux.ibm.com>
From: Josh Poimboeuf <jpoimboe@kernel.org>
When loading an ELF executable, automatically detect an .sframe section
and associate it with the mm_struct.
[ Jens Remus: Fix checkpatch warning "braces {} are not necessary for
single statement blocks". ]
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Indu Bhagat <ibhagatgnu@gmail.com>
Signed-off-by: Jens Remus <jremus@linux.ibm.com>
---
Notes (jremus):
Changes in v15:
- Only add sframe for text that is PT_LOAD in addition to PF_X.
(Sashiko AI)
fs/binfmt_elf.c | 48 +++++++++++++++++++++++++++++++++++++---
include/uapi/linux/elf.h | 1 +
2 files changed, 46 insertions(+), 3 deletions(-)
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 16a56b6b3f6c..980a9f229cd1 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -48,6 +48,7 @@
#include <linux/uaccess.h>
#include <uapi/linux/rseq.h>
#include <linux/rseq.h>
+#include <linux/sframe.h>
#include <asm/param.h>
#include <asm/page.h>
@@ -637,6 +638,21 @@ static inline int make_prot(u32 p_flags, struct arch_elf_state *arch_state,
return arch_elf_adjust_prot(prot, arch_state, has_interp, is_interp);
}
+static void elf_add_sframe(struct elf_phdr *text, struct elf_phdr *sframe,
+ unsigned long base_addr)
+{
+ unsigned long sframe_start, sframe_end, text_start, text_end;
+
+ sframe_start = base_addr + sframe->p_vaddr;
+ sframe_end = sframe_start + sframe->p_memsz;
+
+ text_start = base_addr + text->p_vaddr;
+ text_end = text_start + text->p_memsz;
+
+ /* Ignore return value, sframe section isn't critical */
+ sframe_add_section(sframe_start, sframe_end, text_start, text_end);
+}
+
/* This is much more generalized than the library routine read function,
so we keep this separate. Technically the library read function
is only provided so that we can read a.out libraries that have
@@ -647,7 +663,7 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
unsigned long no_base, struct elf_phdr *interp_elf_phdata,
struct arch_elf_state *arch_state)
{
- struct elf_phdr *eppnt;
+ struct elf_phdr *eppnt, *sframe_phdr = NULL;
unsigned long load_addr = 0;
int load_addr_set = 0;
unsigned long error = ~0UL;
@@ -673,7 +689,8 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
eppnt = interp_elf_phdata;
for (i = 0; i < interp_elf_ex->e_phnum; i++, eppnt++) {
- if (eppnt->p_type == PT_LOAD) {
+ switch (eppnt->p_type) {
+ case PT_LOAD: {
int elf_type = MAP_PRIVATE;
int elf_prot = make_prot(eppnt->p_flags, arch_state,
true, true);
@@ -712,6 +729,19 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
error = -ENOMEM;
goto out;
}
+ break;
+ }
+ case PT_GNU_SFRAME:
+ sframe_phdr = eppnt;
+ break;
+ }
+ }
+
+ if (sframe_phdr) {
+ eppnt = interp_elf_phdata;
+ for (i = 0; i < interp_elf_ex->e_phnum; i++, eppnt++) {
+ if (eppnt->p_flags & PF_X && eppnt->p_type == PT_LOAD)
+ elf_add_sframe(eppnt, sframe_phdr, load_addr);
}
}
@@ -836,7 +866,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
int first_pt_load = 1;
unsigned long error;
struct elf_phdr *elf_ppnt, *elf_phdata, *interp_elf_phdata = NULL;
- struct elf_phdr *elf_property_phdata = NULL;
+ struct elf_phdr *elf_property_phdata = NULL, *sframe_phdr = NULL;
unsigned long elf_brk;
bool brk_moved = false;
int retval, i;
@@ -945,6 +975,10 @@ static int load_elf_binary(struct linux_binprm *bprm)
executable_stack = EXSTACK_DISABLE_X;
break;
+ case PT_GNU_SFRAME:
+ sframe_phdr = elf_ppnt;
+ break;
+
case PT_LOPROC ... PT_HIPROC:
retval = arch_elf_pt_proc(elf_ex, elf_ppnt,
bprm->file, false,
@@ -1242,6 +1276,14 @@ static int load_elf_binary(struct linux_binprm *bprm)
elf_brk = k;
}
+ if (sframe_phdr) {
+ for (i = 0, elf_ppnt = elf_phdata;
+ i < elf_ex->e_phnum; i++, elf_ppnt++) {
+ if (elf_ppnt->p_flags & PF_X && elf_ppnt->p_type == PT_LOAD)
+ elf_add_sframe(elf_ppnt, sframe_phdr, load_bias);
+ }
+ }
+
e_entry = elf_ex->e_entry + load_bias;
phdr_addr += load_bias;
elf_brk += load_bias;
diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
index ee30dcd80901..e2a7dbed2e80 100644
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@ -41,6 +41,7 @@ typedef __u16 Elf64_Versym;
#define PT_GNU_STACK (PT_LOOS + 0x474e551)
#define PT_GNU_RELRO (PT_LOOS + 0x474e552)
#define PT_GNU_PROPERTY (PT_LOOS + 0x474e553)
+#define PT_GNU_SFRAME (PT_LOOS + 0x474e554)
/* ARM MTE memory tag segment type */
--
2.51.0
^ permalink raw reply related
* Re: [PATCH v6 14/43] KVM: guest_memfd: Advertise KVM_SET_MEMORY_ATTRIBUTES2 ioctl
From: Fuad Tabba @ 2026-05-20 15:22 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <20260507-gmem-inplace-conversion-v6-14-91ab5a8b19a4@google.com>
On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Ackerley Tng <ackerleytng@google.com>
>
> Introduce KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES to advertise the
> availability of the KVM_SET_MEMORY_ATTRIBUTES2 ioctl.
>
> KVM_SET_MEMORY_ATTRIBUTES2 is a guest_memfd-scoped version of the existing
> KVM_SET_MEMORY_ATTRIBUTES VM ioctl. It allows userspace to manage memory
> attributes, such as KVM_MEMORY_ATTRIBUTE_PRIVATE, directly on a guest_memfd
> file descriptor.
>
> This new version uses struct kvm_memory_attributes2, which adds an
> error_offset field to the output. This allows KVM to return the specific
> offset that triggered an error, which is especially useful for handling
> EAGAIN results caused by transient page reference counts during attribute
> conversions.
>
> Update the KVM API documentation to define the new ioctl and its behavior,
> and add the necessary UAPI definitions and capability checks.
>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Suggested-by: Michael Roth <michael.roth@amd.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Cheers,
/fuad
> ---
> Documentation/virt/kvm/api.rst | 78 +++++++++++++++++++++++++++++++++++++++++-
> include/uapi/linux/kvm.h | 2 ++
> virt/kvm/kvm_main.c | 5 +++
> 3 files changed, 84 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 52bbbb553ce10..55c2701d9ed49 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -117,7 +117,7 @@ description:
> x86 includes both i386 and x86_64.
>
> Type:
> - system, vm, or vcpu.
> + system, vm, vcpu or guest_memfd.
>
> Parameters:
> what parameters are accepted by the ioctl.
> @@ -6361,6 +6361,8 @@ S390:
> Returns -EINVAL if the VM has the KVM_VM_S390_UCONTROL flag set.
> Returns -EINVAL if called on a protected VM.
>
> +.. _KVM_SET_MEMORY_ATTRIBUTES:
> +
> 4.141 KVM_SET_MEMORY_ATTRIBUTES
> -------------------------------
>
> @@ -6553,6 +6555,80 @@ KVM_S390_KEYOP_SSKE
> Sets the storage key for the guest address ``guest_addr`` to the key
> specified in ``key``, returning the previous value in ``key``.
>
> +4.145 KVM_SET_MEMORY_ATTRIBUTES2
> +---------------------------------
> +
> +:Capability: KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES
> +:Architectures: all
> +:Type: guest_memfd ioctl
> +:Parameters: struct kvm_memory_attributes2 (in/out)
> +:Returns: 0 on success, <0 on error
> +
> +Errors:
> +
> + ========== ===============================================================
> + EINVAL The specified `offset` or `size` were invalid (e.g. not
> + page aligned, causes an overflow, or size is zero).
> + EFAULT The parameter address was invalid.
> + EAGAIN Some page within requested range had unexpected refcounts. The
> + offset of the page will be returned in `error_offset`.
> + ENOMEM Ran out of memory trying to track private/shared state
> + ========== ===============================================================
> +
> +KVM_SET_MEMORY_ATTRIBUTES2 is an extension to
> +KVM_SET_MEMORY_ATTRIBUTES that supports returning (writing) values to
> +userspace. The original (pre-extension) fields are shared with
> +KVM_SET_MEMORY_ATTRIBUTES identically.
> +
> +Attribute values are shared with KVM_SET_MEMORY_ATTRIBUTES.
> +
> +::
> +
> + struct kvm_memory_attributes2 {
> + /* in */
> + union {
> + __u64 address;
> + __u64 offset;
> + };
> + __u64 size;
> + __u64 attributes;
> + __u64 flags;
> + /* out */
> + __u64 error_offset;
> + __u64 reserved[11];
> + };
> +
> + #define KVM_MEMORY_ATTRIBUTE_PRIVATE (1ULL << 3)
> +
> +Set attributes for a range of offsets within a guest_memfd to
> +KVM_MEMORY_ATTRIBUTE_PRIVATE to limit the specified guest_memfd backed
> +memory range for guest_use. Even if KVM_CAP_GUEST_MEMFD_MMAP is
> +supported, after a successful call to set
> +KVM_MEMORY_ATTRIBUTE_PRIVATE, the requested range will not be mappable
> +into host userspace and will only be mappable by the guest.
> +
> +To allow the range to be mappable into host userspace again, call
> +KVM_SET_MEMORY_ATTRIBUTES2 on the guest_memfd again with
> +KVM_MEMORY_ATTRIBUTE_PRIVATE unset.
> +
> +KVM does not directly manipulate the memory contents of pages during
> +attribute updates. However, the process of setting these attributes,
> +which includes operations such as unmapping pages from the host or
> +stage-2 page tables, may result in side effects on memory contents
> +that vary across different trusted firmware implementations.
> +
> +If this ioctl returns -EAGAIN, the offset of the page with unexpected
> +refcounts will be returned in `error_offset`. This can occur if there
> +are transient refcounts on the pages, taken by other parts of the
> +kernel.
> +
> +Userspace is expected to figure out how to remove all known refcounts
> +on the shared pages, such as refcounts taken by get_user_pages(), and
> +try the ioctl again. A possible source of these long term refcounts is
> +if the guest_memfd memory was pinned in IOMMU page tables.
> +
> +See also: :ref: `KVM_SET_MEMORY_ATTRIBUTES`.
> +
> .. _kvm_run:
>
> 5. The kvm_run structure
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 0b55258573d3d..f437fd0f1350c 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -996,6 +996,7 @@ struct kvm_enable_cap {
> #define KVM_CAP_S390_USER_OPEREXEC 246
> #define KVM_CAP_S390_KEYOP 247
> #define KVM_CAP_S390_VSIE_ESAMODE 248
> +#define KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES 249
>
> struct kvm_irq_routing_irqchip {
> __u32 irqchip;
> @@ -1648,6 +1649,7 @@ struct kvm_memory_attributes {
> __u64 flags;
> };
>
> +/* Available with KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES */
> #define KVM_SET_MEMORY_ATTRIBUTES2 _IOWR(KVMIO, 0xd2, struct kvm_memory_attributes2)
>
> struct kvm_memory_attributes2 {
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 4d7bf52b7b717..cec02d68d7039 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -4972,6 +4972,11 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> return 1;
> case KVM_CAP_GUEST_MEMFD_FLAGS:
> return kvm_gmem_get_supported_flags(kvm);
> + case KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES:
> + if (vm_memory_attributes)
> + return 0;
> +
> + return kvm_supported_mem_attributes(kvm);
> #endif
> default:
> break;
>
> --
> 2.54.0.563.g4f69b47b94-goog
>
>
^ permalink raw reply
* Re: [PATCH v6 13/43] KVM: guest_memfd: Return early if range already has requested attributes
From: Fuad Tabba @ 2026-05-20 14:44 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <20260507-gmem-inplace-conversion-v6-13-91ab5a8b19a4@google.com>
On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Ackerley Tng <ackerleytng@google.com>
>
> Extract a helper out of kvm_gmem_range_is_private() that checks that a
> range has given attributes.
>
> Optimize setting memory attributes by returning early if all pages in the
> requested range already has the requested attributes.
>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Cheers,
/fuad
> ---
> virt/kvm/guest_memfd.c | 33 +++++++++++++++++++++++----------
> 1 file changed, 23 insertions(+), 10 deletions(-)
>
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index baf4b88dead1f..034b72b4947fb 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -86,6 +86,23 @@ static bool kvm_gmem_is_shared_mem(struct inode *inode, pgoff_t index)
> return !kvm_gmem_is_private_mem(inode, index);
> }
>
> +static bool kvm_gmem_range_has_attributes(struct maple_tree *mt,
> + pgoff_t index, size_t nr_pages,
> + u64 attributes)
> +{
> + pgoff_t end = index + nr_pages - 1;
> + void *entry;
> +
> + lockdep_assert(mt_lock_is_held(mt));
> +
> + mt_for_each(mt, entry, index, end) {
> + if (xa_to_value(entry) != attributes)
> + return false;
> + }
> +
> + return true;
> +}
> +
> static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
> pgoff_t index, struct folio *folio)
> {
> @@ -649,12 +666,15 @@ static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
> pgoff_t end = start + nr_pages;
> struct maple_tree *mt;
> struct ma_state mas;
> - int r;
> + int r = 0;
>
> mt = &gi->attributes;
>
> filemap_invalidate_lock(mapping);
>
> + if (kvm_gmem_range_has_attributes(mt, start, nr_pages, attrs))
> + goto out;
> +
> mas_init(&mas, mt, start);
> r = kvm_gmem_mas_preallocate(&mas, attrs, start, nr_pages);
> if (r) {
> @@ -1140,20 +1160,13 @@ EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn);
> static bool kvm_gmem_range_is_private(struct gmem_inode *gi, pgoff_t index,
> size_t nr_pages, struct kvm *kvm, gfn_t gfn)
> {
> - pgoff_t end = index + nr_pages - 1;
> - void *entry;
> -
> if (vm_memory_attributes)
> return kvm_range_has_vm_memory_attributes(kvm, gfn, gfn + nr_pages,
> KVM_MEMORY_ATTRIBUTE_PRIVATE,
> KVM_MEMORY_ATTRIBUTE_PRIVATE);
>
> - mt_for_each(&gi->attributes, entry, index, end) {
> - if (xa_to_value(entry) != KVM_MEMORY_ATTRIBUTE_PRIVATE)
> - return false;
> - }
> -
> - return true;
> + return kvm_gmem_range_has_attributes(&gi->attributes, index, nr_pages,
> + KVM_MEMORY_ATTRIBUTE_PRIVATE);
> }
>
> static long __kvm_gmem_populate(struct kvm *kvm, struct kvm_memory_slot *slot,
>
> --
> 2.54.0.563.g4f69b47b94-goog
>
>
^ permalink raw reply
* Re: [PATCH mm-unstable v17 03/14] mm/khugepaged: rework max_ptes_* handling with helper functions
From: David Hildenbrand (Arm) @ 2026-05-20 14:43 UTC (permalink / raw)
To: Nico Pache
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe, Usama Arif
In-Reply-To: <CAA1CXcCD5ooRJonAVp2LvnoCrQwcs1-NsAYomXbHTVNSe5X0cw@mail.gmail.com>
>> Calculate maximum allowed empty PTEs or PTEs mapping the shared zeropage ... ?
>>
>>> + * PTEs for the given collapse operation.
>>
>> We usually indent here (second line of subject), I think. Same applies to the
>> other doc below.
>
> Hmm tbh I couldn't find a example of what you meant here. There are
> some that put a space between the first sentence and the @ list.
Yeah, we usually try to make it fit in a single line.
But nevermind, leave it as is.
--
Cheers,
David
^ permalink raw reply
* Re: [PATCH v6 12/43] KVM: guest_memfd: Call arch invalidate hooks on conversion
From: Fuad Tabba @ 2026-05-20 14:30 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <20260507-gmem-inplace-conversion-v6-12-91ab5a8b19a4@google.com>
On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Ackerley Tng <ackerleytng@google.com>
>
> When memory in guest_memfd is converted from private to shared, the
> platform-specific state associated with the guest-private pages must be
> invalidated or cleaned up.
>
> Iterate over the folios in the affected range and call the
> kvm_arch_gmem_invalidate() hook for each PFN range. This allows
> architectures to perform necessary teardown, such as updating hardware
> metadata or encryption states, before the pages are transitioned to the
> shared state.
>
> Invoke this helper after indicating to KVM's mmu code that an invalidation
> is in progress to stop in-flight page faults from succeeding.
>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Minor nit below, but lgtm.
Reviewed-by: Fuad Tabba <tabba@google.com>
Cheers,
/fuad
> ---
> virt/kvm/guest_memfd.c | 41 +++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 41 insertions(+)
>
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 9d82642a025e9..baf4b88dead1f 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -603,6 +603,42 @@ static bool kvm_gmem_is_safe_for_conversion(struct inode *inode, pgoff_t start,
> return safe;
> }
>
> +#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
> +static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end)
> +{
> + struct folio_batch fbatch;
> + pgoff_t next = start;
> + int i;
> +
> + folio_batch_init(&fbatch);
> + while (filemap_get_folios(inode->i_mapping, &next, end - 1, &fbatch)) {
> + for (i = 0; i < folio_batch_count(&fbatch); ++i) {
> + struct folio *folio = fbatch.folios[i];
> + pgoff_t start_index, end_index;
> + kvm_pfn_t start_pfn, end_pfn;
> +
> + start_index = max(start, folio->index);
> + end_index = min(end, folio_next_index(folio));
> + /*
> + * end_index is either in folio or points to
> + * the first page of the next folio. Hence,
> + * all pages in range [start_index, end_index)
> + * are contiguous.
> + */
> + start_pfn = folio_file_pfn(folio, start_index);
> + end_pfn = start_pfn + end_index - start_index;
> +
> + kvm_arch_gmem_invalidate(start_pfn, end_pfn);
> + }
> +
> + folio_batch_release(&fbatch);
> + cond_resched();
> + }
> +}
> +#else
> +static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end) {}
> +#endif
> +
> static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
> size_t nr_pages, uint64_t attrs,
> pgoff_t *err_index)
> @@ -643,7 +679,12 @@ static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
> */
>
> kvm_gmem_invalidate_begin(inode, start, end);
> +
> + if (!to_private)
> + kvm_gmem_invalidate(inode, start, end);
> +
> mas_store_prealloc(&mas, xa_mk_value(attrs));
> +
Why the unrelated extra space?
> kvm_gmem_invalidate_end(inode, start, end);
> out:
> filemap_invalidate_unlock(mapping);
>
> --
> 2.54.0.563.g4f69b47b94-goog
>
>
^ permalink raw reply
* Re: [PATCH v6 11/43] KVM: guest_memfd: Ensure pages are not in use before conversion
From: Fuad Tabba @ 2026-05-20 14:28 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <20260507-gmem-inplace-conversion-v6-11-91ab5a8b19a4@google.com>
On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Ackerley Tng <ackerleytng@google.com>
>
> When converting memory to private in guest_memfd, it is necessary to ensure
> that the pages are not currently being accessed by any other part of the
> kernel or userspace to avoid any current user writing to guest private
> memory.
>
> guest_memfd checks for unexpected refcounts to determine whether a page is
> still in use. The only expected refcounts after unmapping the range
> requested for conversion are those that are held by guest_memfd itself.
>
> Update the kvm_memory_attributes2 structure to include an error_offset
> field. This allows KVM to report the exact offset where a conversion
> failed to userspace. If the safety check fails, return -EAGAIN and copy
> the error_offset back to userspace so that it can potentially retry the
> operation or handle the failure gracefully.
>
> Suggested-by: David Hildenbrand <david@kernel.org>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Co-developed-by: Vishal Annapurve <vannapurve@google.com>
> Signed-off-by: Vishal Annapurve <vannapurve@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Cheers,
/fuad
> ---
> include/uapi/linux/kvm.h | 3 ++-
> virt/kvm/guest_memfd.c | 65 ++++++++++++++++++++++++++++++++++++++++++++----
> 2 files changed, 62 insertions(+), 6 deletions(-)
>
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index e6bbf68a83813..0b55258573d3d 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1658,7 +1658,8 @@ struct kvm_memory_attributes2 {
> __u64 size;
> __u64 attributes;
> __u64 flags;
> - __u64 reserved[12];
> + __u64 error_offset;
> + __u64 reserved[11];
> };
>
> #define KVM_MEMORY_ATTRIBUTE_PRIVATE (1ULL << 3)
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 91e89b188f583..9d82642a025e9 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -572,9 +572,42 @@ static int kvm_gmem_mas_preallocate(struct ma_state *mas, u64 attributes,
> return mas_preallocate(mas, xa_mk_value(attributes), GFP_KERNEL);
> }
>
> +static bool kvm_gmem_is_safe_for_conversion(struct inode *inode, pgoff_t start,
> + size_t nr_pages, pgoff_t *err_index)
> +{
> + struct address_space *mapping = inode->i_mapping;
> + const int filemap_get_folios_refcount = 1;
> + pgoff_t last = start + nr_pages - 1;
> + struct folio_batch fbatch;
> + bool safe = true;
> + int i;
> +
> + folio_batch_init(&fbatch);
> + while (safe && filemap_get_folios(mapping, &start, last, &fbatch)) {
> +
> + for (i = 0; i < folio_batch_count(&fbatch); ++i) {
> + struct folio *folio = fbatch.folios[i];
> +
> + if (folio_ref_count(folio) !=
> + folio_nr_pages(folio) + filemap_get_folios_refcount) {
> + safe = false;
> + *err_index = folio->index;
> + break;
> + }
> + }
> +
> + folio_batch_release(&fbatch);
> + cond_resched();
> + }
> +
> + return safe;
> +}
> +
> static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
> - size_t nr_pages, uint64_t attrs)
> + size_t nr_pages, uint64_t attrs,
> + pgoff_t *err_index)
> {
> + bool to_private = attrs & KVM_MEMORY_ATTRIBUTE_PRIVATE;
> struct address_space *mapping = inode->i_mapping;
> struct gmem_inode *gi = GMEM_I(inode);
> pgoff_t end = start + nr_pages;
> @@ -588,8 +621,21 @@ static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
>
> mas_init(&mas, mt, start);
> r = kvm_gmem_mas_preallocate(&mas, attrs, start, nr_pages);
> - if (r)
> + if (r) {
> + *err_index = start;
> goto out;
> + }
> +
> + if (to_private) {
> + unmap_mapping_pages(mapping, start, nr_pages, false);
> +
> + if (!kvm_gmem_is_safe_for_conversion(inode, start, nr_pages,
> + err_index)) {
> + mas_destroy(&mas);
> + r = -EAGAIN;
> + goto out;
> + }
> + }
>
> /*
> * From this point on guest_memfd has performed necessary
> @@ -609,9 +655,10 @@ static long kvm_gmem_set_attributes(struct file *file, void __user *argp)
> struct gmem_file *f = file->private_data;
> struct inode *inode = file_inode(file);
> struct kvm_memory_attributes2 attrs;
> + pgoff_t err_index;
> size_t nr_pages;
> pgoff_t index;
> - int i;
> + int i, r;
>
> if (copy_from_user(&attrs, argp, sizeof(attrs)))
> return -EFAULT;
> @@ -635,8 +682,16 @@ static long kvm_gmem_set_attributes(struct file *file, void __user *argp)
>
> nr_pages = attrs.size >> PAGE_SHIFT;
> index = attrs.offset >> PAGE_SHIFT;
> - return __kvm_gmem_set_attributes(inode, index, nr_pages,
> - attrs.attributes);
> + r = __kvm_gmem_set_attributes(inode, index, nr_pages, attrs.attributes,
> + &err_index);
> + if (r) {
> + attrs.error_offset = ((uint64_t)err_index) << PAGE_SHIFT;
> +
> + if (copy_to_user(argp, &attrs, sizeof(attrs)))
> + return -EFAULT;
> + }
> +
> + return r;
> }
>
> static long kvm_gmem_ioctl(struct file *file, unsigned int ioctl,
>
> --
> 2.54.0.563.g4f69b47b94-goog
>
>
^ permalink raw reply
* Re: [PATCH v6 06/43] KVM: x86/mmu: Bug the VM if gmem attributes are queried to determine max mapping level
From: Sean Christopherson @ 2026-05-20 14:21 UTC (permalink / raw)
To: Fuad Tabba
Cc: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
david, ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka, kvm,
linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <CA+EHjTxvLU4XDPXDXYXXWJES1OFQgN8VTRLMgCCNMwBE6Hk8tQ@mail.gmail.com>
On Wed, May 20, 2026, Fuad Tabba wrote:
> On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
> <devnull+ackerleytng.google.com@kernel.org> wrote:
> >
> > From: Ackerley Tng <ackerleytng@google.com>
> >
> > When the maximum mapping level is queried, KVM's MMU lock is held, and
> > while the MMU lock is held, guest_memfd cannot take the
> > filemap_invalidate_lock() to look up the current shared/private state of
> > the gfn, for these reasons:
> >
> > + The MMU lock is a spinlock or rwlock and cannot be held while taking a
> > lock that can sleep.
> > + In guest_memfd's code paths (such as truncate), the
> > filemap_invalidate_lock() is held while taking the MMU lock, and taking
> > the locks in reverse order would introduce a AB-BA deadlock.
> >
> > Currently, the maximum mapping level is only queried from guest_memfd in
> > the process of recovering huge pages, if dirty logging is disabled on a
> > memslot. Dirty logging is not currently supported for guest_memfd, and
> > guest_memfd memslots also cannot be updated.
> >
> > For now, bug the VM if guest_memfd needs to be queried to determine the
> > maximum mapping level. This guard can be removed if/when support is added.
> >
> > Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> > ---
> > arch/x86/kvm/mmu/mmu.c | 9 +++++++++
> > 1 file changed, 9 insertions(+)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index a80a876ab4ad6..153bcc5369985 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3357,6 +3357,15 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm, struct kvm_page_fault *fault,
> > max_level = fault->max_level;
> > is_private = fault->is_private;
> > } else {
> > + /*
> > + * Memory attributes cannot be obtained from guest_memfd while
> > + * the MMU lock is held.
> > + */
> > + if (KVM_BUG_ON(static_call_query(__kvm_get_memory_attributes) ==
> > + kvm_gmem_get_memory_attributes, kvm)) {
> > + return 0;
> > + }
> > +
>
> This directly takes the address of kvm_gmem_get_memory_attributes,
> which is only compiled if CONFIG_KVM_GUEST_MEMFD=y. This breaks
> ARCH=i386.
And this bleeds guest_memfd implementation details into places they don't belong.
The right way to deal with this is to use lockdep_assert_not_held() in whatever
code mustn't run with mmu_lock held. E.g.
diff --git virt/kvm/guest_memfd.c virt/kvm/guest_memfd.c
index c9f155c2dc5c..3bea9c1137ef 100644
--- virt/kvm/guest_memfd.c
+++ virt/kvm/guest_memfd.c
@@ -547,6 +547,9 @@ unsigned long kvm_gmem_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
struct inode *inode;
+ /* Comment goes here. */
+ lockdep_assert_not_held(&kvm->mmu_lock);
+
/*
* If this gfn has no associated memslot, there's no chance of the gfn
* being backed by private memory, since guest_memfd must be used for
But I'm confused, because kvm_gmem_get_memory_attributes() doesn't actually take
filemap_invalidate_lock(), so what exactly is the problem?
> > max_level = PG_LEVEL_NUM;
> > is_private = kvm_mem_is_private(kvm, gfn);
> > }
> >
> > --
> > 2.54.0.563.g4f69b47b94-goog
> >
> >
^ permalink raw reply related
* Re: [PATCH v6 10/43] KVM: guest_memfd: Add base support for KVM_SET_MEMORY_ATTRIBUTES2
From: Fuad Tabba @ 2026-05-20 14:00 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <20260507-gmem-inplace-conversion-v6-10-91ab5a8b19a4@google.com>
On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Ackerley Tng <ackerleytng@google.com>
>
> Introduce base support for KVM_SET_MEMORY_ATTRIBUTES2 in guest_memfd, which
> just updates attributes tracked by guest_memfd.
>
> Validate input fields in general. Guard usage of KVM_SET_MEMORY_ATTRIBUTES2
> by making sure requested attributes are supported for this instance of kvm.
>
> A new KVM_SET_MEMORY_ATTRIBUTES2 is defined to support writes (unlike
> KVM_SET_MEMORY_ATTRIBUTES) in addition to reads so it can provide error
> details to userspace. This will be used in a later patch.
>
> The two ioctls use their corresponding structs with no overlap, but
> backward compatibility is baked in for future support of
> KVM_SET_MEMORY_ATTRIBUTES2 and struct kvm_memory_attributes2 in the VM
> ioctl.
>
> The process of setting memory attributes is set up such that the later half
> will not fail due to allocation. Any necessary checks are performed before
> the point of no return.
>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Co-developed-by: Vishal Annapurve <vannapurve@google.com>
> Signed-off-by: Vishal Annapurve <vannapurve@google.com>
> Co-developed-by: Sean Christoperson <seanjc@google.com>
> Signed-off-by: Sean Christoperson <seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
/fuad
> ---
> include/uapi/linux/kvm.h | 13 ++++++
> virt/kvm/Kconfig | 1 +
> virt/kvm/guest_memfd.c | 114 +++++++++++++++++++++++++++++++++++++++++++++++
> virt/kvm/kvm_main.c | 12 +++++
> 4 files changed, 140 insertions(+)
>
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 6c8afa2047bf3..e6bbf68a83813 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1648,6 +1648,19 @@ struct kvm_memory_attributes {
> __u64 flags;
> };
>
> +#define KVM_SET_MEMORY_ATTRIBUTES2 _IOWR(KVMIO, 0xd2, struct kvm_memory_attributes2)
> +
> +struct kvm_memory_attributes2 {
> + union {
> + __u64 address;
> + __u64 offset;
> + };
> + __u64 size;
> + __u64 attributes;
> + __u64 flags;
> + __u64 reserved[12];
> +};
> +
> #define KVM_MEMORY_ATTRIBUTE_PRIVATE (1ULL << 3)
>
> #define KVM_CREATE_GUEST_MEMFD _IOWR(KVMIO, 0xd4, struct kvm_create_guest_memfd)
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 3fea89c45cfb4..e371e079e2c50 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -109,6 +109,7 @@ config KVM_VM_MEMORY_ATTRIBUTES
>
> config KVM_GUEST_MEMFD
> select XARRAY_MULTI
> + select KVM_MEMORY_ATTRIBUTES
> bool
>
> config HAVE_KVM_ARCH_GMEM_PREPARE
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 4f7c4824c3a45..91e89b188f583 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -540,11 +540,125 @@ unsigned long kvm_gmem_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
> }
> EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_memory_attributes);
>
> +/*
> + * Preallocate memory for attributes to be stored on a maple tree, pointed to
> + * by mas. Adjacent ranges with attributes identical to the new attributes
> + * will be merged. Also sets mas's bounds up for storing attributes.
> + *
> + * This maintains the invariant that ranges with the same attributes will
> + * always be merged.
> + */
> +static int kvm_gmem_mas_preallocate(struct ma_state *mas, u64 attributes,
> + pgoff_t start, size_t nr_pages)
> +{
> + pgoff_t end = start + nr_pages;
> + pgoff_t last = end - 1;
> + void *entry;
> +
> + /* Try extending range. entry is NULL on overflow/wrap-around. */
> + mas_set(mas, end);
> + entry = mas_find(mas, end);
> + if (entry && xa_to_value(entry) == attributes)
> + last = mas->last;
> +
> + if (start > 0) {
> + mas_set(mas, start - 1);
> + entry = mas_find(mas, start - 1);
> + if (entry && xa_to_value(entry) == attributes)
> + start = mas->index;
> + }
> +
> + mas_set_range(mas, start, last);
> + return mas_preallocate(mas, xa_mk_value(attributes), GFP_KERNEL);
> +}
> +
> +static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
> + size_t nr_pages, uint64_t attrs)
> +{
> + struct address_space *mapping = inode->i_mapping;
> + struct gmem_inode *gi = GMEM_I(inode);
> + pgoff_t end = start + nr_pages;
> + struct maple_tree *mt;
> + struct ma_state mas;
> + int r;
> +
> + mt = &gi->attributes;
> +
> + filemap_invalidate_lock(mapping);
> +
> + mas_init(&mas, mt, start);
> + r = kvm_gmem_mas_preallocate(&mas, attrs, start, nr_pages);
> + if (r)
> + goto out;
> +
> + /*
> + * From this point on guest_memfd has performed necessary
> + * checks and can proceed to do guest-breaking changes.
> + */
> +
> + kvm_gmem_invalidate_begin(inode, start, end);
> + mas_store_prealloc(&mas, xa_mk_value(attrs));
> + kvm_gmem_invalidate_end(inode, start, end);
> +out:
> + filemap_invalidate_unlock(mapping);
> + return r;
> +}
> +
> +static long kvm_gmem_set_attributes(struct file *file, void __user *argp)
> +{
> + struct gmem_file *f = file->private_data;
> + struct inode *inode = file_inode(file);
> + struct kvm_memory_attributes2 attrs;
> + size_t nr_pages;
> + pgoff_t index;
> + int i;
> +
> + if (copy_from_user(&attrs, argp, sizeof(attrs)))
> + return -EFAULT;
> +
> + if (attrs.flags)
> + return -EINVAL;
> + for (i = 0; i < ARRAY_SIZE(attrs.reserved); i++) {
> + if (attrs.reserved[i])
> + return -EINVAL;
> + }
> + if (attrs.attributes & ~kvm_supported_mem_attributes(f->kvm))
> + return -EINVAL;
> + if (attrs.size == 0 || attrs.offset + attrs.size < attrs.offset)
> + return -EINVAL;
> + if (!PAGE_ALIGNED(attrs.offset) || !PAGE_ALIGNED(attrs.size))
> + return -EINVAL;
> +
> + if (attrs.offset >= i_size_read(inode) ||
> + attrs.offset + attrs.size > i_size_read(inode))
> + return -EINVAL;
> +
> + nr_pages = attrs.size >> PAGE_SHIFT;
> + index = attrs.offset >> PAGE_SHIFT;
> + return __kvm_gmem_set_attributes(inode, index, nr_pages,
> + attrs.attributes);
> +}
> +
> +static long kvm_gmem_ioctl(struct file *file, unsigned int ioctl,
> + unsigned long arg)
> +{
> + switch (ioctl) {
> + case KVM_SET_MEMORY_ATTRIBUTES2:
> + if (vm_memory_attributes)
> + return -ENOTTY;
> +
> + return kvm_gmem_set_attributes(file, (void __user *)arg);
> + default:
> + return -ENOTTY;
> + }
> +}
> +
> static struct file_operations kvm_gmem_fops = {
> .mmap = kvm_gmem_mmap,
> .open = generic_file_open,
> .release = kvm_gmem_release,
> .fallocate = kvm_gmem_fallocate,
> + .unlocked_ioctl = kvm_gmem_ioctl,
> };
>
> static int kvm_gmem_migrate_folio(struct address_space *mapping,
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ff20e63143642..4d7bf52b7b717 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -110,6 +110,18 @@ EXPORT_SYMBOL_FOR_KVM_INTERNAL(STATIC_CALL_KEY(__kvm_get_memory_attributes));
> EXPORT_SYMBOL_FOR_KVM_INTERNAL(STATIC_CALL_TRAMP(__kvm_get_memory_attributes));
> #endif
>
> +#define MEMORY_ATTRIBUTES_MATCH(one, two) \
> + static_assert(offsetof(struct kvm_memory_attributes, one) == \
> + offsetof(struct kvm_memory_attributes2, two)); \
> + static_assert(sizeof_field(struct kvm_memory_attributes, one) ==\
> + sizeof_field(struct kvm_memory_attributes2, two))
> +
> +/* Ensure the common parts of the two structs are identical. */
> +MEMORY_ATTRIBUTES_MATCH(address, address);
> +MEMORY_ATTRIBUTES_MATCH(size, size);
> +MEMORY_ATTRIBUTES_MATCH(attributes, attributes);
> +MEMORY_ATTRIBUTES_MATCH(flags, flags);
> +
> /*
> * Ordering of locks:
> *
>
> --
> 2.54.0.563.g4f69b47b94-goog
>
>
^ permalink raw reply
* Re: [PATCH v6 09/43] KVM: Move kvm_supported_mem_attributes() to kvm_host.h
From: Fuad Tabba @ 2026-05-20 13:53 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <20260507-gmem-inplace-conversion-v6-9-91ab5a8b19a4@google.com>
On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Ackerley Tng <ackerleytng@google.com>
>
> Move kvm_supported_mem_attributes() from kvm_main.c to kvm_host.h and
> make it a static inline function. This allows the helper to be used in
> other parts of the KVM subsystem outside of kvm_main.c. This helper will be
> used later by guest_memfd.
>
> No functional change intended.
>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Cheers,
/fuad
> ---
> include/linux/kvm_host.h | 10 ++++++++++
> virt/kvm/kvm_main.c | 10 ----------
> 2 files changed, 10 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 1deab76dc0a2c..f9ea95e33d050 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2529,6 +2529,16 @@ static inline bool kvm_memslot_is_gmem_only(const struct kvm_memory_slot *slot)
> }
>
> #ifdef CONFIG_KVM_MEMORY_ATTRIBUTES
> +static inline u64 kvm_supported_mem_attributes(struct kvm *kvm)
> +{
> +#ifdef kvm_arch_has_private_mem
> + if (!kvm || kvm_arch_has_private_mem(kvm))
> + return KVM_MEMORY_ATTRIBUTE_PRIVATE;
> +#endif
> +
> + return 0;
> +}
> +
> typedef unsigned long (kvm_get_memory_attributes_t)(struct kvm *kvm, gfn_t gfn);
> DECLARE_STATIC_CALL(__kvm_get_memory_attributes, kvm_get_memory_attributes_t);
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 0a4024948711a..ff20e63143642 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2428,16 +2428,6 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
> #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
>
> #ifdef CONFIG_KVM_MEMORY_ATTRIBUTES
> -static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> -{
> -#ifdef kvm_arch_has_private_mem
> - if (!kvm || kvm_arch_has_private_mem(kvm))
> - return KVM_MEMORY_ATTRIBUTE_PRIVATE;
> -#endif
> -
> - return 0;
> -}
> -
> #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> static unsigned long kvm_get_vm_memory_attributes(struct kvm *kvm, gfn_t gfn)
> {
>
> --
> 2.54.0.563.g4f69b47b94-goog
>
>
^ permalink raw reply
* Re: [PATCH v6 08/43] KVM: guest_memfd: Only prepare folios for private pages
From: Fuad Tabba @ 2026-05-20 13:51 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <20260507-gmem-inplace-conversion-v6-8-91ab5a8b19a4@google.com>
On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Ackerley Tng <ackerleytng@google.com>
>
> All-shared guest_memfd used to be only supported for non-CoCo VMs where
> preparation doesn't apply. INIT_SHARED is about to be supported for
> non-CoCo VMs in a later patch in this series.
>
> In addition, KVM_SET_MEMORY_ATTRIBUTES2 is about to be supported in
> guest_memfd in a later patch in this series.
>
> This means that the kvm fault handler may now call kvm_gmem_get_pfn() on a
> shared folio for a CoCo VM where preparation applies.
>
> Add a check to make sure that preparation is only performed for private
> folios.
>
> Preparation will be undone on freeing (see kvm_gmem_free_folio()) and on
> conversion to shared.
>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Cheers,
/fuad
> ---
> virt/kvm/guest_memfd.c | 9 ++++++---
> 1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 9d025f518c025..4f7c4824c3a45 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -888,6 +888,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> int *max_order)
> {
> pgoff_t index = kvm_gmem_get_index(slot, gfn);
> + struct inode *inode;
> struct folio *folio;
> int r = 0;
>
> @@ -895,7 +896,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> if (!file)
> return -EFAULT;
>
> - filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
> + inode = file_inode(file);
> + filemap_invalidate_lock_shared(inode->i_mapping);
>
> folio = __kvm_gmem_get_pfn(file, slot, index, pfn, max_order);
> if (IS_ERR(folio)) {
> @@ -908,7 +910,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> folio_mark_uptodate(folio);
> }
>
> - r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
> + if (kvm_gmem_is_private_mem(inode, index))
> + r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
>
> folio_unlock(folio);
>
> @@ -918,7 +921,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> folio_put(folio);
>
> out:
> - filemap_invalidate_unlock_shared(file_inode(file)->i_mapping);
> + filemap_invalidate_unlock_shared(inode->i_mapping);
> return r;
> }
> EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn);
>
> --
> 2.54.0.563.g4f69b47b94-goog
>
>
^ permalink raw reply
* Re: [PATCH v6 07/43] KVM: guest_memfd: Update kvm_gmem_populate() to use gmem attributes
From: Fuad Tabba @ 2026-05-20 13:47 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <20260507-gmem-inplace-conversion-v6-7-91ab5a8b19a4@google.com>
On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Ackerley Tng <ackerleytng@google.com>
>
> Update the guest_memfd populate() flow to pull memory attributes from the
> gmem instance instead of the VM when KVM is not configured to track
> shared/private status in the VM.
>
> Rename the per-VM API to make it clear that it retrieves per-VM
> attributes, i.e. is not suitable for use outside of flows that are
> specific to generic per-VM attributes.
>
> Co-developed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
/fuad
> ---
> arch/x86/kvm/mmu/mmu.c | 2 +-
> include/linux/kvm_host.h | 14 +++++++++++++-
> virt/kvm/guest_memfd.c | 24 +++++++++++++++++++++---
> virt/kvm/kvm_main.c | 8 +++-----
> 4 files changed, 38 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 153bcc5369985..bfcf9be25598e 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -7997,7 +7997,7 @@ static bool hugepage_has_attrs(struct kvm *kvm, struct kvm_memory_slot *slot,
> const unsigned long end = start + KVM_PAGES_PER_HPAGE(level);
>
> if (level == PG_LEVEL_2M)
> - return kvm_range_has_memory_attributes(kvm, start, end, ~0, attrs);
> + return kvm_range_has_vm_memory_attributes(kvm, start, end, ~0, attrs);
>
> for (gfn = start; gfn < end; gfn += KVM_PAGES_PER_HPAGE(level - 1)) {
> if (hugepage_test_mixed(slot, gfn, level - 1) ||
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 28a54298d27db..1deab76dc0a2c 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2549,12 +2549,24 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> #endif
>
> #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> -bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
> +extern bool vm_memory_attributes;
> +bool kvm_range_has_vm_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
> unsigned long mask, unsigned long attrs);
> bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
> struct kvm_gfn_range *range);
> bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
> struct kvm_gfn_range *range);
> +#else
> +#define vm_memory_attributes false
> +static inline bool kvm_range_has_vm_memory_attributes(struct kvm *kvm,
> + gfn_t start, gfn_t end,
> + unsigned long mask,
> + unsigned long attrs)
> +{
> + WARN_ONCE(1, "Unexpected call to kvm_range_has_vm_memory_attributes()");
> +
> + return false;
> +}
> #endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
>
> unsigned long kvm_gmem_get_memory_attributes(struct kvm *kvm, gfn_t gfn);
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index f055e058a3f28..9d025f518c025 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -924,12 +924,31 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn);
>
> #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_POPULATE
> +static bool kvm_gmem_range_is_private(struct gmem_inode *gi, pgoff_t index,
> + size_t nr_pages, struct kvm *kvm, gfn_t gfn)
> +{
> + pgoff_t end = index + nr_pages - 1;
> + void *entry;
> +
> + if (vm_memory_attributes)
> + return kvm_range_has_vm_memory_attributes(kvm, gfn, gfn + nr_pages,
> + KVM_MEMORY_ATTRIBUTE_PRIVATE,
> + KVM_MEMORY_ATTRIBUTE_PRIVATE);
> +
> + mt_for_each(&gi->attributes, entry, index, end) {
> + if (xa_to_value(entry) != KVM_MEMORY_ATTRIBUTE_PRIVATE)
> + return false;
> + }
> +
> + return true;
> +}
>
> static long __kvm_gmem_populate(struct kvm *kvm, struct kvm_memory_slot *slot,
> struct file *file, gfn_t gfn, struct page *src_page,
> kvm_gmem_populate_cb post_populate, void *opaque)
> {
> pgoff_t index = kvm_gmem_get_index(slot, gfn);
> + struct gmem_inode *gi;
> struct folio *folio;
> kvm_pfn_t pfn;
> int ret;
> @@ -944,9 +963,8 @@ static long __kvm_gmem_populate(struct kvm *kvm, struct kvm_memory_slot *slot,
>
> folio_unlock(folio);
>
> - if (!kvm_range_has_memory_attributes(kvm, gfn, gfn + 1,
> - KVM_MEMORY_ATTRIBUTE_PRIVATE,
> - KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
> + gi = GMEM_I(file_inode(file));
> + if (!kvm_gmem_range_is_private(gi, index, 1, kvm, gfn)) {
> ret = -EINVAL;
> goto out_put_folio;
> }
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 4139e903f756a..0a4024948711a 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -103,9 +103,7 @@ module_param(allow_unsafe_mappings, bool, 0444);
>
> #ifdef CONFIG_KVM_MEMORY_ATTRIBUTES
> #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> -static bool vm_memory_attributes = true;
> -#else
> -#define vm_memory_attributes false
> +bool vm_memory_attributes = true;
> #endif
> DEFINE_STATIC_CALL_RET0(__kvm_get_memory_attributes, kvm_get_memory_attributes_t);
> EXPORT_SYMBOL_FOR_KVM_INTERNAL(STATIC_CALL_KEY(__kvm_get_memory_attributes));
> @@ -2450,7 +2448,7 @@ static unsigned long kvm_get_vm_memory_attributes(struct kvm *kvm, gfn_t gfn)
> * Returns true if _all_ gfns in the range [@start, @end) have attributes
> * such that the bits in @mask match @attrs.
> */
> -bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
> +bool kvm_range_has_vm_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
> unsigned long mask, unsigned long attrs)
> {
> XA_STATE(xas, &kvm->mem_attr_array, start);
> @@ -2584,7 +2582,7 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
> mutex_lock(&kvm->slots_lock);
>
> /* Nothing to do if the entire range has the desired attributes. */
> - if (kvm_range_has_memory_attributes(kvm, start, end, ~0, attributes))
> + if (kvm_range_has_vm_memory_attributes(kvm, start, end, ~0, attributes))
> goto out_unlock;
>
> /*
>
> --
> 2.54.0.563.g4f69b47b94-goog
>
>
^ permalink raw reply
* Re: [PATCH v6 06/43] KVM: x86/mmu: Bug the VM if gmem attributes are queried to determine max mapping level
From: Fuad Tabba @ 2026-05-20 13:33 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <20260507-gmem-inplace-conversion-v6-6-91ab5a8b19a4@google.com>
On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Ackerley Tng <ackerleytng@google.com>
>
> When the maximum mapping level is queried, KVM's MMU lock is held, and
> while the MMU lock is held, guest_memfd cannot take the
> filemap_invalidate_lock() to look up the current shared/private state of
> the gfn, for these reasons:
>
> + The MMU lock is a spinlock or rwlock and cannot be held while taking a
> lock that can sleep.
> + In guest_memfd's code paths (such as truncate), the
> filemap_invalidate_lock() is held while taking the MMU lock, and taking
> the locks in reverse order would introduce a AB-BA deadlock.
>
> Currently, the maximum mapping level is only queried from guest_memfd in
> the process of recovering huge pages, if dirty logging is disabled on a
> memslot. Dirty logging is not currently supported for guest_memfd, and
> guest_memfd memslots also cannot be updated.
>
> For now, bug the VM if guest_memfd needs to be queried to determine the
> maximum mapping level. This guard can be removed if/when support is added.
>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
> arch/x86/kvm/mmu/mmu.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index a80a876ab4ad6..153bcc5369985 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3357,6 +3357,15 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm, struct kvm_page_fault *fault,
> max_level = fault->max_level;
> is_private = fault->is_private;
> } else {
> + /*
> + * Memory attributes cannot be obtained from guest_memfd while
> + * the MMU lock is held.
> + */
> + if (KVM_BUG_ON(static_call_query(__kvm_get_memory_attributes) ==
> + kvm_gmem_get_memory_attributes, kvm)) {
> + return 0;
> + }
> +
This directly takes the address of kvm_gmem_get_memory_attributes,
which is only compiled if CONFIG_KVM_GUEST_MEMFD=y. This breaks
ARCH=i386.
Cheers,
/fuad
> max_level = PG_LEVEL_NUM;
> is_private = kvm_mem_is_private(kvm, gfn);
> }
>
> --
> 2.54.0.563.g4f69b47b94-goog
>
>
^ permalink raw reply
* Re: [PATCH v6 05/43] KVM: guest_memfd: Wire up kvm_get_memory_attributes() to per-gmem attributes
From: Fuad Tabba @ 2026-05-20 12:08 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <20260507-gmem-inplace-conversion-v6-5-91ab5a8b19a4@google.com>
On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Sean Christopherson <seanjc@google.com>
>
> Implement kvm_gmem_get_memory_attributes() for guest_memfd to allow the KVM
> core and architecture code to query per-GFN memory attributes.
>
> kvm_gmem_get_memory_attributes() finds the memory slot for a given GFN and
> queries the guest_memfd file's to determine if the page is marked as
> private.
>
> If vm_memory_attributes is not enabled, there is no shared/private tracking
> at the VM level. Install the guest_memfd implementation as long as
> guest_memfd is enabled to give guest_memfd a chance to respond on
> attributes.
>
> guest_memfd should look up attributes regardless of whether this memslot is
> gmem-only since attributes are now tracked by gmem regardless of whether
> mmap() is enabled.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Co-developed-by: Ackerley Tng <ackerleytng@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
> include/linux/kvm_host.h | 2 ++
> virt/kvm/guest_memfd.c | 31 +++++++++++++++++++++++++++++++
> virt/kvm/kvm_main.c | 3 +++
> 3 files changed, 36 insertions(+)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index c5ba2cb34e45c..28a54298d27db 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2557,6 +2557,8 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
> struct kvm_gfn_range *range);
> #endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
>
> +unsigned long kvm_gmem_get_memory_attributes(struct kvm *kvm, gfn_t gfn);
> +
> #ifdef CONFIG_KVM_GUEST_MEMFD
> int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> gfn_t gfn, kvm_pfn_t *pfn, struct page **page,
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 5011d38820d0d..f055e058a3f28 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -509,6 +509,37 @@ static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma)
> return 0;
> }
>
> +unsigned long kvm_gmem_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
> +{
> + struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
> + struct inode *inode;
> +
> + /*
> + * If this gfn has no associated memslot, there's no chance of the gfn
> + * being backed by private memory, since guest_memfd must be used for
> + * private memory, and guest_memfd must be associated with some memslot.
> + */
> + if (!slot)
> + return 0;
> +
> + CLASS(gmem_get_file, file)(slot);
> + if (!file)
> + return 0;
> +
> + inode = file_inode(file);
> +
> + /*
> + * Rely on the maple tree's internal RCU lock to ensure a
> + * stable result. This result can become stale as soon as the
> + * lock is dropped, so the caller _must_ still protect
> + * consumption of private vs. shared by checking
> + * mmu_invalidate_retry_gfn() under mmu_lock to serialize
> + * against ongoing attribute updates.
> + */
> + return kvm_gmem_get_attributes(inode, kvm_gmem_get_index(slot, gfn));
> +}
Doesn't this imply that all consumers of kvm_mem_is_private() should
validate the result using mmu_lock and the invalidation sequence?
sev_handle_rmp_fault() calls kvm_mem_is_private() without holding
mmu_lock and without any retry mechanism. Is that a problem?
Cheers,
/fuad
> +EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_memory_attributes);
> +
> static struct file_operations kvm_gmem_fops = {
> .mmap = kvm_gmem_mmap,
> .open = generic_file_open,
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ee26f1d9b5fda..4139e903f756a 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2653,6 +2653,9 @@ static void kvm_init_memory_attributes(void)
> if (vm_memory_attributes)
> static_call_update(__kvm_get_memory_attributes,
> kvm_get_vm_memory_attributes);
> + else if (IS_ENABLED(CONFIG_KVM_GUEST_MEMFD))
> + static_call_update(__kvm_get_memory_attributes,
> + kvm_gmem_get_memory_attributes);
> else
> static_call_update(__kvm_get_memory_attributes,
> (void *)__static_call_return0);
>
> --
> 2.54.0.563.g4f69b47b94-goog
>
>
^ permalink raw reply
* Re: [PATCH v6 04/43] KVM: Stub in ability to disable per-VM memory attribute tracking
From: Fuad Tabba @ 2026-05-20 12:08 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <20260507-gmem-inplace-conversion-v6-4-91ab5a8b19a4@google.com>
On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Sean Christopherson <seanjc@google.com>
>
> Introduce the basic infrastructure to allow per-VM memory attribute
> tracking to be disabled. This will be built-upon in a later patch, where a
> module param can disable per-VM memory attribute tracking.
>
> Split the Kconfig option into a base KVM_MEMORY_ATTRIBUTES and the
> existing KVM_VM_MEMORY_ATTRIBUTES. The base option provides the core
> plumbing, while the latter enables the full per-VM tracking via an xarray
> and the associated ioctls.
>
> kvm_get_memory_attributes() now performs a static call that either looks up
> kvm->mem_attr_array with CONFIG_KVM_VM_MEMORY_ATTRIBUTES is enabled, or
> just returns 0 otherwise. The static call can be patched depending on
> whether per-VM tracking is enabled by the CONFIG.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Cheers,
/fuad
> ---
> arch/x86/include/asm/kvm_host.h | 2 +-
> include/linux/kvm_host.h | 23 ++++++++++++---------
> virt/kvm/Kconfig | 4 ++++
> virt/kvm/kvm_main.c | 44 ++++++++++++++++++++++++++++++++++++++++-
> 4 files changed, 62 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 60b997764beef..c9aa50bcdac2d 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -2369,7 +2369,7 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
> int tdp_max_root_level, int tdp_huge_page_level);
>
>
> -#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> +#ifdef CONFIG_KVM_MEMORY_ATTRIBUTES
> #define kvm_arch_has_private_mem(kvm) ((kvm)->arch.has_private_mem)
> #endif
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 7d079f9701346..c5ba2cb34e45c 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2528,19 +2528,15 @@ static inline bool kvm_memslot_is_gmem_only(const struct kvm_memory_slot *slot)
> return slot->flags & KVM_MEMSLOT_GMEM_ONLY;
> }
>
> -#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> +#ifdef CONFIG_KVM_MEMORY_ATTRIBUTES
> +typedef unsigned long (kvm_get_memory_attributes_t)(struct kvm *kvm, gfn_t gfn);
> +DECLARE_STATIC_CALL(__kvm_get_memory_attributes, kvm_get_memory_attributes_t);
> +
> static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
> {
> - return xa_to_value(xa_load(&kvm->mem_attr_array, gfn));
> + return static_call(__kvm_get_memory_attributes)(kvm, gfn);
> }
>
> -bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
> - unsigned long mask, unsigned long attrs);
> -bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
> - struct kvm_gfn_range *range);
> -bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
> - struct kvm_gfn_range *range);
> -
> static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> {
> return kvm_get_memory_attributes(kvm, gfn) & KVM_MEMORY_ATTRIBUTE_PRIVATE;
> @@ -2550,6 +2546,15 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> {
> return false;
> }
> +#endif
> +
> +#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> +bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
> + unsigned long mask, unsigned long attrs);
> +bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
> + struct kvm_gfn_range *range);
> +bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
> + struct kvm_gfn_range *range);
> #endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
>
> #ifdef CONFIG_KVM_GUEST_MEMFD
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 5119cb37145fc..3fea89c45cfb4 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -100,7 +100,11 @@ config KVM_ELIDE_TLB_FLUSH_IF_YOUNG
> config KVM_MMU_LOCKLESS_AGING
> bool
>
> +config KVM_MEMORY_ATTRIBUTES
> + bool
> +
> config KVM_VM_MEMORY_ATTRIBUTES
> + select KVM_MEMORY_ATTRIBUTES
> bool
>
> config KVM_GUEST_MEMFD
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index abb9cfa3eb04d..ee26f1d9b5fda 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -101,6 +101,17 @@ EXPORT_SYMBOL_FOR_KVM_INTERNAL(halt_poll_ns_shrink);
> static bool __ro_after_init allow_unsafe_mappings;
> module_param(allow_unsafe_mappings, bool, 0444);
>
> +#ifdef CONFIG_KVM_MEMORY_ATTRIBUTES
> +#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> +static bool vm_memory_attributes = true;
> +#else
> +#define vm_memory_attributes false
> +#endif
> +DEFINE_STATIC_CALL_RET0(__kvm_get_memory_attributes, kvm_get_memory_attributes_t);
> +EXPORT_SYMBOL_FOR_KVM_INTERNAL(STATIC_CALL_KEY(__kvm_get_memory_attributes));
> +EXPORT_SYMBOL_FOR_KVM_INTERNAL(STATIC_CALL_TRAMP(__kvm_get_memory_attributes));
> +#endif
> +
> /*
> * Ordering of locks:
> *
> @@ -2418,7 +2429,7 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
> }
> #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
>
> -#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> +#ifdef CONFIG_KVM_MEMORY_ATTRIBUTES
> static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> {
> #ifdef kvm_arch_has_private_mem
> @@ -2429,6 +2440,12 @@ static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> return 0;
> }
>
> +#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> +static unsigned long kvm_get_vm_memory_attributes(struct kvm *kvm, gfn_t gfn)
> +{
> + return xa_to_value(xa_load(&kvm->mem_attr_array, gfn));
> +}
> +
> /*
> * Returns true if _all_ gfns in the range [@start, @end) have attributes
> * such that the bits in @mask match @attrs.
> @@ -2625,7 +2642,24 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
>
> return kvm_vm_set_mem_attributes(kvm, start, end, attrs->attributes);
> }
> +#else /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
> +static unsigned long kvm_get_vm_memory_attributes(struct kvm *kvm, gfn_t gfn)
> +{
> + BUILD_BUG_ON(1);
> +}
> #endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
> +static void kvm_init_memory_attributes(void)
> +{
> + if (vm_memory_attributes)
> + static_call_update(__kvm_get_memory_attributes,
> + kvm_get_vm_memory_attributes);
> + else
> + static_call_update(__kvm_get_memory_attributes,
> + (void *)__static_call_return0);
> +}
> +#else /* CONFIG_KVM_MEMORY_ATTRIBUTES */
> +static void kvm_init_memory_attributes(void) { }
> +#endif /* CONFIG_KVM_MEMORY_ATTRIBUTES */
>
> struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
> {
> @@ -4925,6 +4959,9 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> return 1;
> #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> case KVM_CAP_MEMORY_ATTRIBUTES:
> + if (!vm_memory_attributes)
> + return 0;
> +
> return kvm_supported_mem_attributes(kvm);
> #endif
> #ifdef CONFIG_KVM_GUEST_MEMFD
> @@ -5331,6 +5368,10 @@ static long kvm_vm_ioctl(struct file *filp,
> case KVM_SET_MEMORY_ATTRIBUTES: {
> struct kvm_memory_attributes attrs;
>
> + r = -ENOTTY;
> + if (!vm_memory_attributes)
> + goto out;
> +
> r = -EFAULT;
> if (copy_from_user(&attrs, argp, sizeof(attrs)))
> goto out;
> @@ -6527,6 +6568,7 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module)
> kvm_preempt_ops.sched_in = kvm_sched_in;
> kvm_preempt_ops.sched_out = kvm_sched_out;
>
> + kvm_init_memory_attributes();
> kvm_init_debug();
>
> r = kvm_vfio_ops_init();
>
> --
> 2.54.0.563.g4f69b47b94-goog
>
>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox